AI Video Titans Clash: Is the Next Super App About to Emerge?

baoshi.rao

Just one week after releasing its latest AI model 'Genmini', Google has unveiled its newest AI research breakthrough.

On December 12, Google announced its collaboration with globally renowned computer vision expert, Chinese-American AI pioneer Fei-Fei Li and her student team, to launch the AI video generation model 'W.A.L.T' (Window Attention Latent Transformer).

Similar to PIKA 1.0 - which recently went viral online and was developed by the daughter of A-share listed company Xinyada's chairman - W.A.L.T is also an AI video generation model.

Previously on the evening of December 6, Google had released its latest multimodal AI model Gemini along with a demonstration video.

However, shortly after Gemini's release, it was revealed that the demonstration video had been edited to artificially enhance the model's performance. This led to Google facing accusations of 'fabrication'.

Just six days later, Google has set its sights on AI video generation with the release of W.A.L.T, one of the hottest areas for AI application deployment today.

Similar to the previously popular Pika 1.0, W.A.L.T also supports text-to-video, image-to-video, and 3D video generation.

In terms of video quality, according to demo videos and research papers, W.A.L.T can generate 3-second-long videos at 8 frames per second with a resolution of 512x896 through natural language prompts.

Image source/W.A.L.T

Industry expert "Guizang" publicly commented that W.A.L.T's effects are "far better than Pika 1.0, with excellent clarity and motion."

Interestingly, Guo Wenjing, founder of Pika and daughter of Sunyard's chairman, actually shares a connection with Fei-Fei Li.

Before dropping out to start her business, Guo Wenjing pursued a Ph.D. at Stanford University's AI Lab (focusing on NLP and graphics), while Fei-Fei Li was Stanford's first Sequoia Capital Professor and also worked at the Stanford AI Lab.

Compared to the rising star Guo Wenjing, Fei-Fei Li is considered a foundational figure and technical authority in the global computer vision field, as well as a highly sought-after talent by major tech companies including Google.

According to public information, Fei-Fei Li was born in Beijing in 1976 and grew up in Chengdu. In 1992, at the age of 16, she moved to the United States with her parents and enrolled at Princeton University three years later to study physics.

During her academic journey, Li gradually developed an interest in AI research and shifted her focus to the then-niche field of computer vision. In 2007, despite funding shortages, she embarked on her first major project, ImageNet—a dataset designed to teach machines to recognize images.

At that time, AI image recognition models could only identify four types of objects: cars, airplanes, leopards, and human faces, as researchers typically trained models exclusively on these categories. To teach AI to recognize an object, humans had to manually label the target in images and then feed large volumes of such labeled images to the AI for training.

Li envisioned that with a sufficiently large and well-labeled dataset, it would be possible to train a theoretically 'omniscient' computer vision model.

In 2009, ImageNet was officially released and quickly became the training and testing dataset for almost all visual models. This achievement catapulted Fei-Fei Li to fame, earning her titles like 'Godmother of Chinese AI.' To this day, ImageNet remains one of the most renowned large-scale visual databases in the global AI industry and academia.

Whether it's launching two major models in a week or collaborating with Fei-Fei Li's team, Google is clearly going all out in the development of multimodal AI models.

Recently, the AI video generation sector has been bustling with activity. Beyond Pika 1.0 and W.A.L.T, numerous AI video generation tools have emerged or undergone functional updates.

For example, in early November, the U.S.-based generative AI unicorn Runway updated its in-house video generation model, Gen-2, focusing on improving the fidelity and consistency of generated results.

In mid-November, Meta, the tech giant originally known for social products, released its Emu Video model.

At the end of November, U.S. text-to-image startup Stability AI launched its video generation model called Stable Video Diffusion, offering two models: SVD and SVD-XT.

Image source/W.A.L.T

Domestically, tech giants like ByteDance, Alibaba, and Baidu have already entered the race.

On November 18th, ByteDance launched its text-to-video model PixelDance, introducing a novel approach that combines text instructions with first-and-last frame image guidance to enhance video dynamism.

Following closely, Alibaba released its Animate Anyone model. Users simply need to provide the model with a static character image and some preset actions (or pose sequences) to generate animated videos of the character.

According to previous public information, Baidu's ERNIE model is currently testing similar functionality internally, which will soon be made available as a plugin.

The active participation of domestic and international players indicates that the AI video generation sector is poised to become the next beneficiary in this wave of AI technological upgrades. Many industry professionals have already sensed the market trend. Jim Fan, Senior Research Scientist at NVIDIA and former OpenAI employee, wrote on social media: "2022 was the year of images, 2023 the year of sound waves, and 2024 (will be) the year of video!"

CITIC Securities' research report points out: "Drawing parallels with text-to-image applications in advertising, text-to-video technology is poised to drive a productivity revolution—reducing production costs, lowering creative barriers, and accelerating AIGC industrialization. From a capability perspective, we believe text-to-video will likely first gain traction in short videos and animation."

However, technological innovation comes with disruptive consequences for existing business models.

Leo, an employee at a domestic video creation tool company, told CityScope: "Earlier this year, we primarily viewed AIGC as applicable to graphic content creation, estimating commercial-grade video generation would take another year or two to mature." He added that commercial video requirements include maintaining object consistency and continuity during storyboard script creation.

Now, video generation tools are evolving at multiples of the expected pace. This technological acceleration is compelling market participants to proactively adopt and integrate automated generation features—or risk being left behind by the industry's transformation.