They say a picture’s worth 1000 words. Does that mean an AI video is worth a whole novel?
Recently, the Future of Marketing Institute has been exploring the use of text-to-video generation for marketing. It’s no secret the field’s exploding, and anyone can generate videos quite easily from a voice prompt.
This was harder to conceive in the previous days. That is before we saw the capabilities of Sora in February of this year.
With the recent launches of Runway’s Gen-3, Luma’s Dream Machine, and Kuaishou’s Kling, we’re witnessing celluloid magic come into the hands of the average user.
FMI Puts Video Text Generation to the Test
FMI recently invited Tianyu Xu, a well-known AI consultant, speaker and author, to work on an experiment with us. We wanted to explore how well the most advanced models display text and how difficult it is to create practical content for social media.
Tianyu used Runway Gen-3 to run the test, and the results are encouraging.
He went into the experiment expecting shorter text would be easier to render. The results proved this assumption was partially true. In general, he found shorter text is easier to display, especially if it consists of common words frequently appearing in the AI model’s training data.
The more common the word, the more likely it will render with minimal spelling errors. For example, the words “AI” and “ChatGPT” were easily displayed as seen in the two examples below.
On the other hand, Tianyu found that uncommon words like “FMI” have a lower success rate. Runway Gen-3 managed to get “FMI” correct less than 50% of the time.
If you look at video 3 closely, you will also see that while the letters ‘FMI’ are clear, there is distorted text underneath the main FMI words.
The length of the text also matters. While it’s relatively easy to render “future,” “marketing,” and “institute” individually, rendering “future of marketing institute” in one video is extremely challenging for the model
Let’s Look at the Math
Based on the results below, the probability of each word being rendered correctly is roughly P. (0.0135). That means it could take 74 attempts to get all words correct. That’s a lot of effort to show a single phrase!
While it’s still possible to create longer phrases, as seen in the last video, is it worth the effort? Tianyu reports that post-video editing or stitching together a series of images following a storyboard, and then using Image-to-Video or frame interpolation with Kling or Luma, may be an easier route.
Types of Content Rendering Done Well by Current Software
In addition to experiments with text inputs, Tianyu also investigated what types of scenes are easiest to create with text-to-video.
Here is his list of the easiest types of shots to create using current software.
- Close-up shots
- Common subjects in daily life
- Cinematic scenes and realistic styles
- Handheld camera movements for social media
- Regular camera movements in filmmaking
- Regular stock videos for presentations
- Human faces and pets (one at a time)
- Wild animals (one at a time)
- Aerial/drone shots
Final thoughts:
We’ve moved from impossible to possible in text-to-video generation. Now, we need to balance effort and quality, adopting a flexible approach. By focusing on areas with higher chances of success, marketers can more easily use text-to-video in practical use applications.
Our thanks to Tianyu for working with FMI on this text-to-video research.
This post was written by FMI Executive Director David Rice and Tianyu Xu.
Connect with FMI
Want to stay ahead of the marketing wave and prepare yourself for the future? Connect with us on our Website, LinkedIn, Instagram or Twitter/X.
If you enjoyed this newsletter, please share it and subscribe to receive it directly.