Under the Shadow of Sora: The Anxiety of China's AI Industry

baoshi.rao

"Those who can't keep up may be left behind." After watching Sora's demo video, Huang Bin, an animation producer with over a decade of experience, came to this conclusion.

As concerns about unemployment in the film industry grow, the emergence of Sora has also brought significant anxiety to China's AI sector.

Zhou Hongyi, founder of 360 Group, believes that the Sora model demonstrates performance and technological capabilities that surpass current Chinese counterparts. This is evident not only in Sora's potential timeline for achieving Artificial General Intelligence (AGI) but also in its practical application effects and innovation capabilities. More strikingly, there's a circulating notion online that 'Sora's birth is a Newton moment,' suggesting that Sora signifies the dawn of a new industrial revolution.

In reality, domestically, after the 'Hundred Models Battle,' significant achievements have only recently been made in text models, reaching or surpassing the level of GPT-3.5, with efforts now directed towards catching up to GPT-4.

However, the emergence of Sora showcases OpenAI's groundbreaking progress in multimodal models rather than just in the text domain, making it nearly an impossible task for domestic AI manufacturers to catch up with, let alone surpass, OpenAI. Many netizens have raised pointed questions about China's AI capabilities:

Why wasn't Sora developed in China? The gap between Chinese and American AI seems to be widening. Is China a decade behind in this wave of Sora-like technology? Why are we always playing catch-up instead of pioneering original innovations?

Under this barrage of soul-searching questions, China's AI industry has collectively fallen silent.

So, how big is the actual gap between China and the US in multimodal large models like Sora? Where exactly do the challenges lie in catching up? And given various constraints, does China have any unique advantages of its own? Although OpenAI acknowledges that Sora is still in the early stages of development and requires further refinement, the industry has reached a consensus — the launch of Sora marks an important milestone in the field of generative artificial intelligence.

This is because Sora is not just a text-to-video tool but also a critical node in AGI (Artificial General Intelligence), validating a feasible technical route towards AGI.

Similar to GPT-3 before it, Sora once again confirms that Scaling Law can continue to drive emergence in this technical direction. Behind this lies not only the result of astonishing capital and computing power support but also the outcome of countless engineering experiments and trial-and-error, backed by formidable technical prowess.

Many speculate that OpenAI likely already possesses a largely complete multimodal GPT-5, capable of selectively releasing certain components to outmaneuver competitors or steer public opinion as needed.

Zhou Hongyi, founder of 360 Group, has even boldly asserted that the emergence of Sora indicates the realization of AGI (Artificial General Intelligence) will be accelerated from a decade to just one year. In fact, in front of Sora, whether it's existing top AI models like Pika, Runway, or domestic manufacturers investing in multimodal AI, they are basically 'outclassed'.

This also reflects the gap between China and the US in the depth of AI technology research and development and resource investment.

First, the threshold comes from computing power. Although some scholars believe that Sora is only a model with approximately 3 billion parameters, meaning its training costs are not as high as imagined, the processing and annotation of video data itself, combined with the inevitable massive token count and computational power consumption during long video inference, clearly pose an insurmountable challenge for any company.

Even if Sora truly has only 3 billion parameters, the computational power required for video analysis should far exceed that of a trillion-parameter model. With domestic GPUs being restricted, computational power has become a significant challenge.

Secondly, there is the issue of high-quality data. According to OpenAI's technical report, Sora's powerful capabilities are attributed to two key factors: first, the use of a Transformer-based Diffusion Model; second, the conversion of different types of visual data into a unified format—patches—enabling the utilization of vast, high-quality, and cost-effective data.

Industry experts believe that the significant advantage in data quality and quantity is likely one of the most critical factors behind Sora's success.

In terms of computing power, while the number of GPU cards used by OpenAI to train the Sora model is not unattainable, other companies still struggle to replicate OpenAI's success even with sufficient hardware resources. The main bottleneck lies in how to acquire and process large-scale, high-quality video data. In 2022, OpenAI announced an innovative method to train AI models, eliminating the need for extensive data annotation processes.

According to reports, OpenAI's Video Pre-training Model (VPT) enabled AI to learn how to craft a stone pickaxe from scratch in Minecraft.

Researchers first collected data from outsourced game players, which included video recordings and keyboard/mouse operation logs. Then, using this data to create an inverse dynamics model (IDM), it infers how the keyboard and mouse were moved during each step in the video. This allows achieving the goal with much less data than originally required.

This research was published in June 2022, with the paper noting that the work had been ongoing for a year, meaning OpenAI had been conducting this research since at least 2021.

Logenic AI co-founder Li Bojie believes that OpenAI's first-mover advantage creates an early data barrier, making it more difficult for later entrants to catch up. "Even for a company like Google with the largest global data volume, the training data for large models may not necessarily be better than OpenAI's," said Li Bojie.

In comparison, domestic companies also face certain gaps in data accumulation and utilization: on one hand, due to policy changes and other restrictions, latecomers may not be able to access some previously available key data; on the other hand, as AI-generated content increasingly floods the internet, original real-world data becomes 'contaminated,' making it more difficult to obtain high-quality, unbiased training data.

Finally, there are innovative training methods. Sora achieves innovation by combining Transformer and diffusion models. First, it converts different types of visual data into a unified visual representation (visual patches). Then, it compresses the original video into a low-dimensional latent space and decomposes the visual representation into spatiotemporal patches (equivalent to Transformer tokens), allowing Sora to train and generate videos within this latent space.

Next, it performs noise addition and removal. By inputting noisy patches, Sora generates videos by predicting the original "clean" patches.

OpenAI discovered that the greater the computational resources used for training, the higher the sample quality. Particularly after large-scale training, Sora demonstrates "emergent" capabilities in simulating certain properties of the real world. Overall, Sora is the result of good architecture plus quality data, then scaling up the model to achieve a qualitative leap from quantitative changes.

Although most of Sora's design consists of existing technologies, it is the only one capable of producing stunning results. This indicates that there are many training technique challenges to overcome during the training process.

Sora's technological breakthrough has spread AI anxiety domestically, but China's AI sector is not entirely defenseless. Before Sora emerged and dominated the public spotlight, several domestic listed companies had already made arrangements in the field of multimodal AI.

On December 18, 2023, Orient Securities mentioned in a research report that leading domestic video analytics companies such as Hikvision, Dahua Technology, and EZVIZ have actively engaged in the research and industry application of multimodal large models.

At the same time, major tech giants like Baidu, Alibaba, Tencent, Huawei, and ByteDance have also laid out their plans for multimodal foundational large models. According to incomplete statistics, from December 2023 to the present, more than ten A-share companies, including Wanxing Technology, Bohui Technology, Yidian Tianxia, Digital Video, Hanwang Technology, Danghong Technology, and East China Information, have disclosed their business in the video generation model field on investor interaction platforms.

Although the current 'text-to-video' effects of domestic manufacturers are far inferior to Sora, China already possesses core infrastructures such as the foundational model LLM, text-to-image model DALL·E 3, large-scale video datasets, AI computing power systems, and large model development tool stacks used by Sora.

For example, original foundational large language models like Wenxin Yiyan, iFlytek Spark, and BAICHUAN, as well as text-to-image models like Wenxin Yige and Tencent Hunyuan, coupled with the rapid progress in large model infrastructure over the past year, have the capability and conditions to support China's AI in achieving success in the video generation track, similar to the success of ChatGPT. Wang Peng, a senior expert at Tencent Research Institute, believes that Sora's release further confirms DiT (=VAE encoder + ViT + DDPM + VAE decoder) as a viable direction for multimodal AI. Chinese AI giants still have the potential to reach Sora's current level within about a year using existing resources.

In fact, not only is the generational technology gap not as large as imagined, but the marathon of video generation models entering industries has just begun. The value of large models needs to be proven through commercialization, and Sora is no exception.

First, compared to large language models that are "accessible to everyone," video generation models have higher application barriers and a smaller audience. Currently, OpenAI only makes it available to creators, not to the general public like ChatGPT. It's evident that the journey from research and development to practical application for video generation models is considerably slower, with their potential applications and commercial viability still awaiting exploration.

Secondly, while Sora is powerful, cost remains a significant practical concern.

Estimates suggest that generating a single video with Sora can cost anywhere from a few dollars to several dozen dollars. For widespread public use, costs would need to be reduced to about 1% of current levels to be acceptable.

Reducing costs while simultaneously improving generation quality and logical coherence presents a critical challenge that urgently needs to be addressed. At the same time, considering the unresolved issue of 'hallucinations,' generating truly controllable and usable videos will remain costly in the short term.

These limitations provide a relatively long window of opportunity for China's AI industry, academia, and other sectors to catch up.

Currently, the extent of Sora's commercial value remains unclear, but leveraging large models to find application scenarios is an area where the Chinese market holds an advantage. China boasts abundant industries and scenarios. If Chinese AI vendors can solve specific problems for vertical industry users, refine their tools, and optimize prompt engineering for video generation models to make them accessible to non-professional users, there's significant potential to surpass GPT-4 or even GPT-5 in specialized fields.

Moreover, Chinese AI companies can build upon foundational models like Sora to drive application innovations. For instance, they could develop more sophisticated video editing capabilities or revolutionize medical education and simulation training on top of Sora, thereby pioneering commercialization pathways.

Sora represents a major breakthrough in AI-generated video technology, highlighting the notable technological gap between China and the US. For China's tech community, this serves as both a wake-up call and motivation. While acknowledging the reality gap, China's AI sector should not underestimate itself. Self-examination, strategic adjustments, and seizing the opportunity window are essential for overtaking on curves.