When Will AI Video Produce a 'Midjourney'?

baoshi.rao

AI video has become the 'rising star,' with major tech companies and startups fiercely competing.

In December last year, the emergence of Pika seemed to ignite the AI video race, with nearly ten companies appearing within a month. Giants like Google, Alibaba, ByteDance, and Tencent have entered the fray, continuously pushing the competition to new heights.

'The Midjourney V5 moment for AI video is coming,' marking a critical juncture where it is about to become a productivity tool. From 2022 to 2023, text-to-image technology has evolved at a visibly rapid pace. Midjourney, for instance, has released a new version approximately every three months, progressing from V1 to V6 and achieving a monumental transformation from "barely recognizable" to "exquisitely lifelike." The monthly evolution of text-to-image technology acts like a relentless hammer, constantly reminding AI video companies that their window for growth is narrowing.

(Image: A comparison of V1-V6 outputs created by netizens, sourced from X)

Now, the trajectory of AI video development is gradually aligning with that of text-to-image generation. The "Midjourney V5" moment has become a critical tipping point: *Once this threshold is crossed, users will flock in droves, the data flywheel will start spinning, and improvements will accelerate daily, transforming text-to-video from a mere "toy" into a true "productivity tool." The progression from text to images and videos follows a natural continuum, with the evolution of text-to-image generation offering glimpses into the future of AI video.

When AI video becomes a production tool, it marks the beginning of the industrial chain's operation. Only when it is practically usable can a target user base emerge; only by retaining users and generating sustained payments can a clear business model be established; and only when the business model is validated can companies in the ecosystem survive, driving supply through consumption to revitalize the entire AI video industry.

"The productivity of the AI video industry"—this is precisely the core value that current players are competing for. DreamWorks co-founder Jeffrey Katzenberg recently predicted that "generative AI will reduce the cost of animated films by 90% within the next three years, fundamentally disrupting the media and entertainment industry."

"We may soon achieve real-time generation of high-resolution content at 30 frames per second, and by 2030, it might be possible to generate entire video games," said Midjourney CEO David Holz.

The V5 inflection point has arrived, marking the start of a new competitive phase. When will the next Midjourney emerge? In reality, AI video technology entered public awareness almost simultaneously with text-to-image generation.

In early 2023, while Midjourney popularized text-to-image generation, Runway sparked boundless imagination about 'everyone creating blockbuster movies'.

At that time, seeing the remarkable achievements in the text-to-image field, Runway's founder stated: 'We hope Gen-1 can do for video what Stable Diffusion did for images. We've witnessed the explosion of image generation models, and I believe 2023 will be the year of video.' But clearly this conclusion was drawn a bit too early. In February, RunwayAI released its video editor Gen-1, which functions similarly to an AI version of Photoshop, allowing for video style transformation and modification through text input. In March, they launched the text-to-video model Gen-2, supporting video generation from text and text+image inputs.

The promotional videos were impressive, but the actual performance left much to be desired, with issues such as short duration, unstable generated visuals, misinterpreted instructions, lack of audio, incoherent and illogical movements, among others.

After Runway fired the first shot in AI video, it continued to move forward but increasingly focused on video editing tools. Features like motion brushes, text-to-speech, and video compositing were merely 'icing on the cake.' The lack of fundamental breakthroughs in Gen-2 also left the AI video field quiet for some time. Just as people were beginning to lose patience with AI videos, December last year saw the arrival of Pika, Genmo, Moonvalley, NeverEnds, Google's VideoPoet, Alibaba's Animate Anyone, and ByteDance's Magic Animate, all bringing a ray of hope.

In Pika's official promotional video, just a single sentence was enough to generate an animated version of Elon Musk. Not only did it capture his likeness perfectly, but the background and movements were also remarkably coherent, with astonishingly consistent facial details.

(Image: GIF from Pika 1.0 promotional video, sourced from X) In its first official demo video, the generation quality almost reaches the texture level of animation studios like Disney.

According to feedback from users who have tried Pika 1.0, the product supports three video generation methods: text-to-video, image-to-video, and video-to-video. Both 3D and 2D effects have reached a new level, with realism, stability, and lighting effects significantly outperforming Gen-2.

"Pika 1.0 and Gen-2 seem like products from different eras," many users commented after trying it. The popularity of Pika and similar technologies can be attributed to the maturity of underlying infrastructure technologies. The most important among these is AnimateDiff. This is an animation framework built on the Stable Diffusion text-to-image model, which allows generated images to move directly. Companies like ByteDance, Tencent, and Alibaba have launched their own AI video models based on this framework.

Of course, in addition to the widespread application of AnimateDiff, it is also closely related to the development of multimodal large models.

The emergence of Pika and similar technologies has opened a new chapter in AI video, and AI video is about to usher in its "Midjourney V5" moment. There are two important layers of changes here, first reflected in the generation level.

In the V5 phase, better generation results can be achieved, with coherence in actions, expressions, and narrative logic within seconds of generation time; more effective control methods, with improved understanding and compliance of input instructions, as well as enhanced control over camera angles, transitions, and style transformations; lower resource consumption, enabling the generation of higher-resolution and higher-quality videos with shorter time and fewer computational resources, where a few seconds of video can achieve the effect of dozens of seconds. The more significant impact lies in the breakthrough in productivity.

Taking Midjourney as an example, at the V5 stage, it has become a design tool for UI designers, an assistant for game concept artists, and a material library for cross-border e-commerce product displays and advertising marketing. Similarly, at this stage, AI video could potentially generate advertisements, short videos, movies, and games, becoming a productivity tool that can replace directors, actors, and designers.

AI video is like a blockbuster movie—whether it sells well or receives acclaim depends on two critical elements: the script and the special effects. The script corresponds to the 'logic' in the AI video generation process, while the special effects correspond to the 'visuals.' To achieve 'logic' and 'effects,' the AI video industry has diverged into two technical paths: diffusion models and large models.

(Image: Guangzhui Intelligence)

After AIGC gained popularity, diffusion models have long dominated the field of image generation. This is largely due to Stability AI's continuous open-source efforts, which have not only attracted more developers to refine the models but also elevated diffusion models to the 'throne' of text-to-image generation. Today, AI video generation is deeply marked by the influence of diffusion models. Major tech companies and startups alike have referenced diffusion model concepts in interviews and research papers. Emerging companies like Pika have capitalized on the strengths of diffusion models to create their own innovative solutions, while industry leaders such as NVIDIA, Alibaba, ByteDance, and Tencent have further enhanced model capabilities based on this foundation.

The technical approach for large models has undergone a transformation. In the early days of large models, the primary method for AI video generation involved using the same training techniques as large language models - relying on massive parameters and datasets to build text-to-video models from scratch. CogVideo, launched in 2022, represents this early approach.

However, as large models evolved from single-modality text processing to multimodal capabilities, video generation has become another function growing from the roots of large models, similar to previous developments in text and image generation. From early on, companies like Google and Microsoft experimented with using Transformer-based methods from large models to train and enhance existing diffusion models. But it wasn't until Google released its multimodal model Gemini and the VideoPoet video model that the path of video generation through large models finally showed promising results. (Google VideoPoet video generation demo)

There's no absolute superiority between these two technical approaches, but they have different focuses. The core of diffusion models lies in 'restoration and presentation', emphasizing effects; while large models focus on 'receiving and understanding', prioritizing logic.

It's precisely these characteristics that lead to AI video companies adopting the diffusion model approach having stronger advantages in detail depiction and generation effects, while those following the multimodal large model path perform better in coherence and generation rationality. Pika's co-founder and CTO Chenlin Meng believes that the advantages of two approaches can be leveraged simultaneously to build video models. For instance, large language models like GPT can capture contextual information, which is essential for maintaining consistency across video frames through contextual control. Meanwhile, each frame remains an individual image where diffusion models can enhance generation quality.

This perspective from Pika isn't isolated, as the industry increasingly reflects this trend. The reason lies in the current reality: although companies like Pika and Runway generate buzz with each upgrade showcasing impressive results, there remains a significant gap before these technologies can be practically applied in real-world scenarios such as advertising, film production, and marketing.

Jim Fan, Senior Research Scientist at NVIDIA and Head of AI Agents, argues that current video generation merely produces "unconscious, localized pixel movements", lacking coherent temporal, spatial, and behavioral logic to govern the generation process. There's a great example that illustrates the current state of AI video development. On X, a user named Ben Nash conducted a test using the same English prompt "Will Smith eating spaghetti" to evaluate the video generation capabilities of Runway and Pika. The results showed that while both videos could roughly achieve the intended effect, they exhibited comical scenes such as "spaghetti flowing backwards" and "noodles being directly sucked into the mouth."

Runway generation effect

Pika generation effect Jim Fan stated: "By 2024, we will see video generation with high resolution and long-term coherence. But this will require more 'thinking,' specifically System 2 reasoning and long-term planning (as opposed to System 1 responsible for unconscious sensorimotor control)."

Recently, Runway also announced a new long-term research project called "General World Models" on its official website, explaining: "We believe the next major advancement in AI will come from systems that understand the visual world and its dynamics."

Logic, thinking, and reasoning may become the keywords for AI video in 2024, and the integration of these two technical approaches will become commonplace. Once it becomes a productivity tool, the commercialization dilemma currently faced by AI video can be easily resolved.

Productivity tools have two directions: the upward path of professionalization and the downward path of mass inclusivity. However, at this stage, most of the AI video industry still presents itself to users in the form of video editing tools.

"Tools as products" is quite common in the text-to-image and AI video sectors. The approach most companies take is to initially open for small-scale testing on Discord, then proceed to official public release, and finally launch a website. "Tools" imply high professionalism, high barriers to entry, complex operations, and difficulty in getting started, which creates a gap with "products" that are easy to use, convenient to operate, and highly experiential.

A typical example is the time and financial cost required to learn the functions and usage of each tool in PR software to achieve good video production results. In contrast, posting a video on Douyin (TikTok) only takes three steps: click the plus sign, shoot the video, and publish. This simplicity covers everyone from kindergarten children to elderly people in their 60s, highlighting the most obvious difference between tools and products.

On the eve of a breakthrough in productivity, the concept of tools as products may persist for some time. However, the next question facing AI video companies is clear: Should they continue on the path of professional tools, or lower the barriers to create the next AI-powered "Douyin" for video? On this issue, Pika has already made its choice. Its founder Guo Wenjing stated in an interview: "We are not developing film production tools, but products for everyday consumers—we are creative, but not professionals."

In terms of commercialization, Guo Wenjing mentioned that Pika may eventually introduce a tiered subscription model, allowing ordinary paying users to access more features. This approach is planned to differentiate Pika from its competitors.

AI video tools lacking productivity capabilities cannot retain users long-term, generate continuous payments, and thus form a healthy business model. The current reality is that users subscribe out of curiosity, to try it for free, or with a trial mentality for a month, and after the subscription period ends, the video tool is quickly forgotten. The blow to startups is significant. Without sustainable revenue and the ability to self-generate funds, they must rely on financing. The day the financing stops is the day the company can no longer sustain itself. Looking at the entire AI video industry, if individual players within it cannot survive, how can we talk about the future prospects of the industry?

If an industry has only a single tool, lacks diverse application scenarios, and cannot form a complete ecosystem, it will struggle to thrive. For example, users now briefly engage with AI video tools before diverting most of their traffic to social platforms.

(Image: Source X) For instance, bizarre videos like Musk dancing and Mona Lisa running flooded TikTok overnight. Videos generated by tools like Runway and Pika gained massive popularity through shares on X, TikTok, and YouTube, with some creators even monetizing the traffic. Yet the tool providers themselves are reduced to being mere enablers for social platforms.

Breaking the barrier between tools and usage scenarios, China's Douyin has begun pioneering such integration as a reference case.

Upon release, CapCut's AI features immediately synchronized with Douyin, sparking an "AI Outpainting Showcase" trend. The challenge topic "AI Outpainting Beyond Your Imagination" garnered over 200 million views. From concubines playing basketball to Disney characters transforming into donkeys, and fur-clad beauties turning into werewolves – whether AI brings surprises or shocks has ignited widespread debate. Once AI video becomes a productivity tool, purchasing power will emerge at the consumer end of the entire industry chain, driving the evolution of the supply side through demand. Only then can AI video truly 'come alive.'