Zhou Hongyi on Sora Model: The AI Gap Between China and the US May Still Be Widening

baoshi.rao

On February 17th, OpenAI released the Sora model, which can generate 60-second video content based on user-input text prompts. The demo videos released by OpenAI show highly surreal effects.

Yesterday, Zhou Hongyi, Chairman of 360, posted on Weibo discussing OpenAI's text-to-video model Sora. He believes that Sora means the realization of AGI could be shortened from 10 years to just 1 year.

Zhou Hongyi stated that AI may not disrupt all industries so quickly, but it can inspire more people's creativity. Today, Sora may bring significant disruption to the advertising industry, movie trailers, and short video sectors, but it may not defeat TikTok so quickly. Instead, it is more likely to become a creative tool for TikTok. At the same time, Zhou Hongyi also talked about the AI gap between China and the US. He believes that OpenAI still holds some secret weapons, whether it's GPT-5 or machine self-learning for automatic content generation, including AIGC. They haven't revealed all their capabilities yet. This suggests that the AI gap between China and the US might still be widening.

Below is the full text of Zhou Hongyi's statement:

Sora means the realization of AGI could be shortened from 10 years to 1 year. At the beginning of the year, I shared ten predictions about large models in a speech at Fengmaniu. Unexpectedly, before the year was over, several of them had already been validated—from Gemini and NVIDIA's Chat With RTX to OpenAI's release of Sora, which has been truly groundbreaking. A friend asked for my thoughts on Sora, and here are a few points I’d like to share. Overall, I believe AGI will be realized very soon, likely within the next few years.

First, the ultimate competition in technology comes down to talent density and deep accumulation. Many people say Sora's performance surpasses Pika and Runway. This is hardly surprising—compared to entrepreneurial teams, OpenAI, with its core technological strengths, remains exceptionally powerful. Some believe that with AI, startups can simply operate as individual businesses. Today’s developments once again prove how laughable that notion is. Second, AI may not disrupt all industries as quickly as expected, but it can unleash more people's creativity. Many people today talk about Sora's impact on the film industry, but I don't see it that way. While machines can produce a good video, the theme, script, storyboard planning, and dialogue coordination still require human creativity—at least human input in the form of prompts. A video or movie is composed of countless 60-second clips. Today, Sora might bring significant disruption to advertising, movie trailers, and the short video industry, but it won't necessarily defeat TikTok quickly. Instead, it’s more likely to become a creative tool for TikTok.

Third, I’ve always said that, on the surface, the development level of domestic large models in China has nearly caught up with GPT-3.5, but in reality, there’s still a year-and-a-half gap compared to GPT-4.0. Moreover, I believe OpenAI still has some secret weapons up its sleeve, whether it’s GPT-5, machine self-learning to generate content automatically, or AIGC. Sam Altman is a marketing genius who knows how to control the pace—they haven’t revealed all their cards yet. From this perspective, the AI gap between China and the US might still be widening. Fourth, the most remarkable aspect of large language models is that they are not just fill-in-the-blank machines but can comprehensively understand the knowledge of this world. Many people have analyzed Sora from technical and product experience perspectives, highlighting its ability to output 60-second videos, maintain multi-shot consistency, and simulate the natural world and physical laws. However, these are relatively superficial observations. The most important aspect is that Sora's technical approach is entirely different. Previously, we used Diffusion models for video and image generation, where videos were essentially combinations of real images without truly grasping the world's knowledge. All current text-to-image and text-to-video models operate on 2D planes, manipulating graphical elements without applying physical laws. In contrast, Sora's videos demonstrate an understanding akin to human cognition—for example, recognizing that a tank has immense impact force and can crush a car, not the other way around. Thus, I believe OpenAI leveraged its large language model (LLM) strengths, combining LLM with Diffusion training to enable Sora to achieve two capabilities: understanding the real world and simulating it. This results in videos that are realistic and transcend 2D limitations to simulate the physical world accurately. This is the merit of large models and represents the future direction. With robust large models as a foundation—built on understanding human language, human knowledge, and world models—and by integrating various other technologies, we can create super tools across multiple fields, such as biomedical protein and gene research, as well as disciplines like physics, chemistry, and mathematics. Sora's simulation of the physical world will significantly impact embodied AI in robotics and autonomous driving. Traditional autonomous driving technologies overemphasized perception while neglecting cognition. In reality, human driving decisions are based on an understanding of the world—e.g., assessing the speed of another vehicle, potential collisions, and their severity. Without this understanding, achieving true autonomous driving is challenging. So this time, Sora is just a small test. What it demonstrates is not merely the ability to create videos, but the new achievements and breakthroughs that large models can bring once they understand and simulate the real world.

Fifth, OpenAI likely trained this model by reading vast amounts of video data. Large models combined with Diffusion technology require a deeper understanding of the world, and the learning samples will primarily consist of videos and images captured by cameras. Once AI is connected to cameras and watches all the movies, YouTube videos, and TikTok clips, its understanding of the world will far surpass what it learns from text. A picture is worth a thousand words, and a video conveys far more information than a single image. At that point, AGI (Artificial General Intelligence) won't be a matter of 10 or 20 years—it might be achieved in just a year or two.