From Text-to-Image to Text-to-Video: The AI Industry is Going Wild

baoshi.rao

Not long ago, a video of Musk in a spacesuit instantly transforming into an anime character went viral online. What's astonishing is that the animated version of Musk not only captures his likeness perfectly but also features coherent and logical backgrounds and movements. All of this was created using a video generation tool called 'Pika,' designed by a Chinese Ph.D. student at Stanford. Simply by entering keywords like 'Musk in a spacesuit, 3D animation' into the software's input box, a cartoon version of Musk in a spacesuit can appear on the screen.

In recent years, as AIGC (AI-generated content) accelerates its application across various industries, the industry's focus has gradually shifted from text-to-text and text-to-image to text-to-video. In fact, the transition from text to video is a major trend in AIGC, and many industry professionals have already sensed the market's direction. Major tech companies in China, such as ByteDance, Alibaba, and Baidu, have already entered the race.

According to reports, ByteDance launched its text-to-video model PixelDance on November 18. Alibaba quickly followed suit by releasing its Animate Anyone model. Meanwhile, Baidu's ERNIE model is currently testing similar functionality, which will soon be available as a plugin. Clearly, the integration of AI technology with text-to-video generation has sparked a new wave of enthusiasm in the industry. The reasons behind Chinese tech companies' proactive moves in this field are self-evident.

First, text-to-video applications are widely applicable and possess immense market potential. Although the short video market is still burgeoning, the production capacity for short videos cannot keep up with the explosive demand across various platforms. The increasing maturity and widespread application of text-to-video technology may introduce new dynamics to the currently popular short video market. For instance, industries such as film, television, and gaming are key scenarios for the implementation of text-to-video. This technology allows users to edit and generate desired storylines simply through text, enabling creative assistance, cost reduction, and efficiency improvements. With its unique advantage of empowering content generation, the prospects for text-to-video are undoubtedly promising.

Secondly, text-to-video operation is very convenient and can effectively reduce various costs. It is well known that personalized video production is troublesome and costly, so a simple video generation tool has become a desire for many industries and enterprises. The breakthrough in AI text-to-video technology provides a brand-new solution to this problem. As the name suggests, text-to-video requires no video production skills—just simple text to generate the desired video material. Moreover, it can continuously update based on input scenarios and keywords, greatly lowering the threshold and cost of video production. It can be said to be a 'gospel' for creators in the digital age.

Finally, the text-to-video product features are stunning, further enhancing corporate competitiveness. In the current AI landscape, text-to-image applications are already widespread, whereas players who can fully 'conquer' the text-to-video space are few and far between. Ultimately, text-to-video products are more powerful, and their difficulty is naturally higher. However, high difficulty often comes with high value. If a company can lead this field with advantages such as powerful computing capabilities, cross-domain collaboration, and technological autonomy, its differentiation in the industry will soon become evident.

Text-to-video, as an emerging media form, is influencing our daily lives in unprecedented ways. Currently, it is being applied in fields such as corporate promotion, digital humans, science communication, and online social interactions. To enhance the fluency and realism of video generation, domestic players like ByteDance, Alibaba, and Baidu have invested significant efforts in various aspects.

On one hand, various players have amassed extensive datasets to enhance the diversity of video generation effects. Text-to-video models typically require large volumes of data to learn caption correlations, the realism of frame images, and dynamic temporal information. Without high-quality paired datasets, it becomes difficult to reasonably combine characters or structure scenes, significantly compromising the rationality and coherence of generated videos. To improve the diversity of generation effects, Alibaba tasked its researchers with collecting approximately 35 million text-video pairs and 6 billion text-image pairs to optimize the model, achieving the desired video generation outcomes.

On the other hand, players have designed hierarchical editors to enhance semantic consistency in text-to-video generation. Generating high-quality videos from simple text requires text-to-video products to accurately predict textual intent while maintaining the input text's content and structure to produce precise motion. To achieve this goal, Ali's researchers designed two hierarchical encoders: a fixed CLIP encoder and a learnable content encoder. These respectively extract high-level semantics and low-level details, which are then integrated into the video diffusion model to better ensure semantic coherence in low-resolution video generation.

In addition, various players have increased video resolution to ensure high-quality video generation results. The ideal outcome for text-to-video generation is for users to provide prompts, and the system automatically generates videos in any corresponding style. However, this poses significant challenges for video resolution. Alibaba's text-to-video solution has increased the video resolution to 1280×720 and optimized the initial 600 denoising steps to improve details, artifacts, and noise issues in the generated videos. ByteDance's text-to-video approach also introduced a video generation method based on text guidance combined with first and last frame image guidance, enhancing the dynamic quality of the generated videos.

With the rapid development of artificial intelligence and video technology, the AIGC industry is shifting toward AI video. The explosive growth period for AI text-to-video generation may be imminent, and the number of players involved in AI video creation is expected to increase. Even in such an environment, whether it's ByteDance and Alibaba, which have already launched models, or Baidu, which is preparing to release plugins, entering the text-to-video arena requires significant capabilities that cannot be overlooked.

Firstly, participating players have ample computing power reserves, enabling them to effectively overcome the technical shortcomings in text-to-video generation. As an upgrade from text-to-text and text-to-image generation, text-to-video places higher demands on computing power and model engineering capabilities. It is understood that AI models for text-to-video generation typically have parameters ranging from 1 billion to 10 billion. Among the leading players in China's text-to-video field, whether it's ByteDance, Alibaba, or Baidu, all have accumulated substantial parameter expertise. This demonstrates that these cloud service providers with computing power reserves possess inherent advantages in developing video generation applications.

Secondly, experienced industry players can significantly accelerate the launch and iteration of text-to-video technology. Text-to-image and text-to-video artificial intelligence models share a high degree of similarity in their underlying technical frameworks. To some extent, text-to-video can be seen as an advanced version of text-to-image technology, which means that the techniques and experiences from text-to-image can be applied and referenced for text-to-video. As is well known, companies like ByteDance, Alibaba, and Baidu have already made significant strides in the text-to-image field, with some products even being commercially deployed. Leveraging their accumulated expertise in text-to-image technology, these players are poised to achieve substantial progress in the text-to-video domain.

Thirdly, the strong resource integration capabilities of participating players can provide support for the development of their text-to-video technology. Compared to text and images, videos can carry a larger amount of information. This means that producing more vivid, high-definition, and realistic videos will require higher investment costs from players in the text-to-video field. However, it is worth noting that as major internet companies, Alibaba, Baidu, and ByteDance have developed over the years, and their advantages and capabilities in resources such as talent, funding, and computing power should not be underestimated. Thanks to this, their text-to-video products will have stronger competitiveness and influence.

Text-to-video technology is not only disrupting the traditional media industry but also bringing numerous new business opportunities and possibilities for content enhancement and industrial evolution. However, currently in China, text-to-video technology is still in its early developmental stages. While it may appear that text-to-video follows a logic extremely similar to text-to-image generation, in reality, text-to-video is considerably more challenging with many bottlenecks yet to be overcome.

First, text-to-video generation demands high-quality data and presents significant computational challenges, leaving current participants far from producing satisfactory videos. Compared to text and images, videos offer greater advantages in multidimensional information expression, visual richness, and dynamism. However, this also means that text-to-video generation will require even more computational power. The fields involved—natural language processing, visual processing, and scene synthesis—are seeing an increase in technical challenges that need to be overcome. Currently, domestic players still lack high-quality paired datasets, which will pose severe challenges in terms of semantic accuracy, clarity, and continuity.

Secondly, text-to-video generation incurs high costs and has a relatively singular business model, making it challenging for participants to achieve successful commercialization. Compared to text-to-image generation, text-to-video involves higher computational complexity, leading to increased costs. Additionally, the business models for image generation are relatively uniform, with similar pricing models and bases, while video generation models follow a comparable pricing structure. Although image generation has achieved a higher level of commercialization within multimodal large models, providing some reference for the commercial prospects of video generation, as an emerging industry, text-to-video commercialization will still require time to mature.

Thirdly, domestic and foreign companies are significantly increasing their investments and research in text-to-video technology, which will further intensify competition in this sector. The AI video generation field is already bustling with activity, featuring innovations like Pika Labs' "Pika 1.0" and Google's "W.A.L.T" model. Beyond the strong focus from international players, Chinese tech giants such as Baidu, Alibaba, ByteDance, Tencent, 360, Wanxing Technology, Kunlun Wanwei, Guomai Culture, and Meitu have also entered the arena, launching their own AI models. Clearly, competition in video generation is heating up rapidly.

From text-to-image to text-to-video, the competition in the AIGC field has become extremely fierce. Although domestic progress in text-to-video is relatively slow, with no star products yet emerging, more companies with talent and technology in text-to-video are continuously appearing. However, in addition to the aforementioned challenges, text-to-video currently still faces data privacy and security issues that need to be resolved, and its true commercial operation and profitability remain to be verified. As for who will become the ultimate winner in this 'land grab,' we can only wait and see.