GPT-4 AI Text to Video Demo

Large Language Models (LLMs) have rocked the world since the start of 2023 with OpenAI’s ChatGPT claiming the title for the worlds fastest growing user base, clocking up 100 million active users in just 2 months.

Not wanting to miss out on this new technology and to explore the possibilities we have plugged in to OpenAI’s GPT-4 API to create a text to video maker.

You can check out the demo here:

The demo was put together in a day or two by our co-founder @dazzatron, with the help of ChatGPT no less.

He can give you some insights in to how it was built, but we’d love to know what you think - do you want to build a service like this? Or should we integrate this in to our Studio video editor to help you quickly create templates?

Let us know in the thread where you see this going or what you’d like to see.

Looks neat, but having more dynamic templates would be even better!

This is really impressive. Would be nice if this functionality of generating a video based on a statement could be applied to an S3 bucket of user-provided footage rather than stock video.

Yeah, this is pretty cool I built this on shortcuts using stock or user provided videos or images and overlay the text providedby GPT-3.5 or 4. Looking for ways to clean it up though. I see GPT is familiar with shotstack API and if we can get it to dynamically output the JSON for the API call it would work even better.

On my current setup I just ask it for the text I will overlay over the video, and then I just loop through creating each section as needed. However, once I figure out how to just have GPT-4 output the json based on inputs it’ll be a lot cleaner of a process.

There should be a button on the playground that shows the code but it’s missing. I’ll get that back in asap.

We’ll look to share the code for this asap but as it’s an MVP I need to clean it up first.

For everyone’s benefit this is how we’re engineering the prompts:

const coreConcepts = await readAsset('src/text-to-video/assets/docs/core-concepts.txt');
const transitionConcepts = await readAsset('src/text-to-video/assets/docs/effects-transitions.txt');
const fontsHtmlConcepts = await readAsset('src/text-to-video/assets/docs/fonts-html.txt');
const mergeConcepts = await readAsset('src/text-to-video/assets/docs/merge.txt');
const hello = { name: 'hello world', data: JSON.parse(await readAsset('src/text-to-video/assets/json/hello.json')) };
const propertyTour = { name: 'property tour', data: JSON.parse(await readAsset('src/text-to-video/assets/json/property-tour.json')) };

const buildMasterPrompt = (concepts, examples, prompt) => {
  const basePrompt = [
    { role: 'system', content: `You are an AI specifically designed to provide Shotstack JSON output. Absolutely nothing else. No explanations or messages, just plain JSON. You make sure to follow the following instructions: ${concepts.join(' ')}` },
    { role: 'system', content: 'Make sure all of the videos you create a properly styled, with assets positioned using best practice design principles. Use examples from video layouts that are industry best practice. Never use a title asset.' },
    { role: 'system', content: 'Add the Shotstack JSON to the "json" property of an object. Also add a "style" object to that same object which includes a fonts and video array. The video array contains objectes which each include a "keywords" property which has a string of keywords I can use to query the Pexels API for videos (every keyword should be unique but relevant to the broader context of the prompt), in addition to a "mergeField" property which references the "src" in the Shotstack JSON. The fonts array includes objects with the name of the font used which I can use to query the Google Fonts API.' },
    { role: 'system', content: 'If you require a logo in your video use "" This image is 1000px by 1000px.' },

  examples.forEach((example) => {
      { role: 'user', content: `Create a video for a ${} video` },
      { role: 'assistant', content: JSON.stringify( },

  basePrompt.push({ role: 'user', content: prompt });
  return basePrompt;

This can likely be done a lot more efficiently, but for this demo it appears to be working remarkably well.

This should definitely be possible. We just used a stock API but there’s no reason you couldn’t replace that with your own store of data.

We hear you. Let us know what you’re looking for as we’re keen to get more built.

Good! With regard to how clips are selected, does it require that metadata be manually added to my media, or is there a method you are using where clips are slected using some AI technology like object detection? I would love to give this a try so any info you can provide (or even better - a new Shotstack service!) is appreciated. Thanks in advance.

We’re just relying on Pexels to do the tagging for us. So when sending the keywords we get back approximately what we’re after. If you want to use an S3 bucket you may have to have some tagging mechanism or use some object detection AI to get those tags. You can also use a DAM like Clarifai.