Can Domestic Companies Achieve Sora? This Tsinghua-affiliated AI Team Offers Hope

baoshi.rao

On the video generation path represented by Sora, domestic companies already possess certain technical reserves.

At the end of 2023, many predicted that the coming year would see rapid development in video generation. However, unexpectedly, shortly after the Lunar New Year, OpenAI dropped a bombshell—Sora, capable of generating smooth and realistic one-minute videos. Its emergence has raised concerns among many researchers: has the gap between domestic and international AI technologies widened again?

According to OpenAI's disclosed technical report, one of Sora's core technical points is transforming visual data into a unified representation of patches, combining Transformer and diffusion models to demonstrate exceptional scaling properties. Coincidentally, the recently released Stable Diffusion 3 also adopts the same architecture. In fact, both works are based on the paper "Scalable Diffusion Models with Transformers" co-authored by Sora's core researcher William Peebles and NYU Assistant Professor Xie Saining. This paper introduced DiT, a novel diffusion model based on Transformer architecture, which replaces the commonly used U-Net backbone with Transformers operating on latent patches, successfully replicating the scalability and emergent properties of large language models for visual tasks.

We noted that as early as September 2022 (two months before DiT), Tsinghua's team had submitted a paper titled "All are Worth Words: A ViT Backbone for Diffusion Models," proposing to replace CNN-based U-Net with Transformer-based U-ViT architecture. The two works share identical architectural approaches: both proposed integrating Transformers with diffusion models. They also followed similar experimental paths, adopting identical patch embedding methods and patch sizes (concluding that 2*2 patches work best), and conducted experiments with model parameters ranging from 50M-500M, ultimately demonstrating the same scaling properties. However, DiT was only tested on ImageNet, while U-ViT conducted experiments across small datasets (CIFAR10, CelebA), ImageNet, and the text-image dataset MSCOCO. Additionally, compared to traditional Transformers, U-ViT introduced a "long connection" technique that significantly improved training convergence speed. This paper was later included in CVPR2023.

Based on the U-ViT architecture, in March 2023, the team released another work called UniDiffuser (see "Tsinghua Zhu Jun's Team Open-Sources the First Transformer-Based Multimodal Diffusion Model, Achieving Text-Image Mutual Generation and Editing"), training a 1-billion-parameter multimodal model on the open-source large-scale text-image dataset LAION-5B. Around the same time, Shengshu Technology, focusing on general multimodal models, was officially established (see "Interview with Shengshu Technology's Tang Jiayu: Tsinghua-affiliated Team Secures Over 100 Million Funding, Using Transformer for Multimodal Models"). The divergence occurred here - considering computing resources and technical maturity, Shengshu Technology prioritized applying U-ViT to text-image tasks, while OpenAI leveraged its computing advantage to directly apply DiT to video tasks. Although focused on different primary tasks, U-ViT also demonstrates excellent capabilities in visual tasks. Compared to SD1.5 at the same stage, UniDiffuser's performance is essentially on par. More importantly, UniDiffuser has stronger scalability, capable of performing any generation between text and images based on a single underlying model. Simply put, in addition to unidirectional text-to-image generation, it can also achieve image-to-text generation, joint text-image generation, unconditional text-image generation, text-image rewriting, and various other functions.

UniDiffuser open-source version performance

UniDiffuser current sample images With these early architectural explorations, Shengshu Technology actually shows considerable potential in video generation and could become the Chinese team closest to Sora. Moreover, they have already conducted some research in the field of video generation.

So, what lies ahead? What are the challenging issues that need to be addressed in video generation? What business opportunities will Sora bring? In a recent interview, Shengshu Technology CEO Tang Jiayu and Chief Scientist Zhu Jun shared their insights with Machine Heart.

Sora's emergence came half a year earlier than expected Machine Heart: First, I'd like to ask both of you to recall your first impressions when you saw Sora. Were there any particularly impressive demos?

Tang Jiayu: What struck me most was its smoothness and duration. Previously, AI-generated short videos were jokingly called GIFs—with minimal changes, very short duration, just a few seconds. Sora's generated videos are much longer, and the smoothness and naturalness are clearly at another level. I think this is the most immediate visual impact. The demo that left a deep impression on me was the paper airplane scene: a group of paper airplanes flying through a forest, with the planes colliding with each other and interacting with leaves. This is an imagined scenario, but the generation achieves high realism and already demonstrates certain capabilities in representing physical laws. Jun Zhu: Looking back at previous predictions about video generation length, Sora's emergence is actually ahead of schedule. While it was anticipated that video generation would develop rapidly this year, there were no significant technical breakthroughs foreseen at the time, so short videos (a few seconds long) were expected to be the mainstream format. However, Sora has achieved much longer durations, which is quite surprising. Originally, it was predicted that this level would be reached by mid or late this year, but Sora has advanced this timeline by about half a year.

Replacing U-Net with Transformer is a natural idea, the difference lies in who achieves the results first Machine Heart: There has been much discussion recently about the core innovations of Sora, with particular focus on its architecture. Professor Zhu, could you explain in simple terms what Sora's Diffusion Transformer architecture is about, and why it's necessary to 'replace the commonly used U-Net backbone with Transformer'?

Zhu Jun: Taking video data as an example, the principle of diffusion models involves adding and removing noise from the data. The key issue here is whether we can accurately predict the noise by designing a noise prediction network. Traditionally, people used U-Net for this purpose, but Transformer has demonstrated significant advantages in scalability and other aspects. Therefore, replacing U-Net with Transformer is a natural progression—the difference lies in who can achieve effective results first. The DiT used in Sora was released at the end of 2022. In fact, as early as September 2022, we released a model called U-ViT. The main idea of this model was to use Vision Transformer to replace U-Net, sharing the same core concept as DiT—enhancing diffusion models with Transformer. This approach has since proven highly effective, especially in generating visual data. It retains the advantages of diffusion models while leveraging the scalability of Transformers and their compatibility with different modalities. Compared to traditional Transformers, our own design (U-ViT) includes long connections, which significantly improve computational efficiency, leading to notable performance enhancements.

U-ViT Architecture

Machine Heart: What metrics can we use to observe these effects? Zhu Jun: Actually, back in 2022, it was already evident that using architectures like Vision Transformer could enhance generation quality, achieve higher resolutions, and more efficiently train larger-scale models. Now, we can see more examples, including Sora and Stable Diffusion3. These examples repeatedly demonstrate the immense potential of this architecture.

Machine Heart: What effects has this work demonstrated in the products of Shengshu? Zhu Jun: From the very beginning, we insisted on using a fusion architecture of diffusion and Transformer, which is a native multimodal architecture. Previously, many teams working on multimodal approaches would try to align all modalities to language. However, we believe this architecture is not optimal because, from both a theoretical and computational efficiency standpoint, this method has inherent limitations. That's why we've been pursuing the diffusion plus Transformer approach from the start.

In 2022, when we proposed the U-ViT architecture, we benchmarked it against Stable Diffusion, which had just been open-sourced at the time. Building on the U-ViT architecture, we open-sourced a large model called UniDiffuser in March 2023. This model is also based on the fusion architecture of diffusion and Transformer, enabling arbitrary conversions between text and image modalities. From the training of the underlying architecture to optimization and supporting the generation of images, 3D, and videos, Shengshu has consistently adhered to this architecture and maintained this fusion approach.

Machine Heart: So you mean this fusion approach yields better results than using either Diffusion or Transformer alone, is that correct?

Zhu Jun: Yes. Compared to using Diffusion alone, the main advantage of the fusion architecture is the scalability of Transformer. Compared to using Transformer alone, the fusion architecture offers significant advantages in the efficiency of generating visual data, including model representation efficiency and computational efficiency. For the Transformer architecture, the advantage of putting everything into it is simplicity and directness. However, in terms of processing and generating visual data, diffusion models still hold the upper hand. In our view, hybrid models are more aligned with the positioning of native multimodal systems. Different types of data have distinct characteristics, so the most suitable processing method should be chosen for each modality. From the perspective of actual visual generation effects, the mainstream approach currently employs diffusion models for generation, as the results from using the Transformer architecture directly for generation still lag behind.

Machine Heart: Your U-ViT and DiT were proposed around the same time, but you chose to prioritize its use for text-to-image tasks rather than video generation. What was the reasoning behind this decision? Zhu Jun: Actually, we are also working on video generation, but we prioritized based on computational resources at the time. This also reflects our judgment on technical maturity. Last year, we started with 2D images, then moved to 3D generation (from 2D to 3D) by May, and later worked on video and 4D (refer to One-Click Real Scene to Animation: Tsinghua Startup Launches World's First 4D Skeletal Animation Framework). Essentially, after establishing a foundational framework, we expanded into different dimensions—3D and 4D represent spatial and temporal extensions, respectively.

Video is essentially a stream of images, extending along the timeline. Thus, our architecture naturally supports short video generation. However, we initially focused on generating clips of a few seconds, unlike OpenAI's Sora, which achieves durations of tens of seconds or even a minute. There are many reasons for this, but a key one is our relatively limited resources. That said, much of the knowledge (e.g., large-scale training experience) from 2D image to video generation is transferable. Replicating Sora: Many Challenges Remain to Be Solved

Machine Heart: The technical differences between generating a few seconds of video and one minute of video are enormous. According to your experience, apart from computing power, what is the key to achieving this?

Zhu Jun: A crucial aspect here is how to effectively represent the spatiotemporal information of longer videos, how to efficiently compress video data, learn an embedded representation, and then perform diffusion and generation on top of that. Additionally, for this architecture to be effectively trained, data is also crucial. Leveraging the advantages accumulated from previous projects like DALL・E3, OpenAI can achieve relatively effective semantic understanding of video data. This is extremely critical in training data. During the creative process, the input language is usually limited and simple. Therefore, to generate rich video content, a robust semantic understanding process is essential.

Of course, there may be many other factors we are unaware of. Sora's success is not just about generation; it includes semantic understanding, data annotation, data cleaning, large-scale training, and engineering optimization, among others. These issues are unknown unless one has worked on them. Given OpenAI's track record of successful projects, their chances of succeeding in a new project are higher.

Machine Heart: Will the same architecture differ when applied to image tasks versus video tasks? For the Shengshu team, what are the next steps to extend this architecture from image tasks to video tasks? Zhu Jun: The main difference is that videos contain a lot of spatiotemporal information. The key challenge is capturing the essential motion and maintaining long-term consistency—something that single images don't involve. The underlying principles are similar, and we've been working on video-related projects since the second half of last year.

Our underlying architecture is self-trained, allowing us to naturally perform various types of generation. Image generation serves as the foundation, and its quality directly impacts video generation. Additionally, we continue to work on 3D generation. However, Sora appeared earlier than we expected, so we will focus more on video generation moving forward.

Building a Universal Multimodal System Requires a Universal Architecture Machine Heart: The release of Sora has revealed OpenAI's ambition to 'all in AGI.' Their technical approach focuses on two key points: multimodality and a universal architecture. Shengshu Technology also adheres to the 'universal multimodal approach.' In your view, what is the necessity of a universal architecture?

Zhu Jun: To achieve stronger generalization in models, a more universal model architecture is essential. Take Sora as an example—its architecture must integrate text and visual data. In other words, if you focus solely on vision or text, you won't achieve optimal performance in multimodal tasks, or some modalities may not be processable at all. This is a direct and mutually supportive relationship. Machine Heart: What are the key challenges in developing such a general architecture?

Jun Zhu: The main challenge lies in the fact that data from different modalities have distinct characteristics. Should we simply use a one-size-fits-all approach to represent all data? Currently, this method may not be optimal, so we need to analyze and consider the unique features of each data type. Additionally, the volume of data varies across modalities, or is imbalanced. During training, this can have a practical impact on the optimization process. Another issue is the alignment and understanding between different modalities. Machine Heart: After the emergence of Sora, some voices suggest that the gap between domestic and international has further widened. What is your view on this issue?

Zhu Jun: Whether the gap has widened is a debatable question. However, I believe that after Sora's emergence, the gap between domestic and international AI hasn't formed as obvious a generational difference as when ChatGPT first appeared. It's just that we might be slightly behind in engineering and technical aspects. The issue of video generation is also highly valued domestically, and the foundation for image and video-related tasks in China is relatively solid. Based on current results, the actual situation might be more optimistic than imagined.

Inspiration from OpenAI: Technical confidence and resources are both important

Machine Heart: From commercial and product perspectives, how do you view the success of Sora? Tang Jiayu: OpenAI's overall approach is geared towards the goal of AGI, continuously advancing through improvements in foundational model capabilities. The models themselves can be considered their core product. It's said that the Sora team didn't focus much on commercial or product considerations initially, primarily concentrating on achieving truly excellent video generation capabilities. They believed that with such strong capabilities, more commercial products could be built on top. Providing underlying API capabilities externally and fostering a thriving AI ecosystem at the higher level is a business model OpenAI has already proven successful.

From this perspective, I think a crucial factor in their success is embedded in their company's values—specifically, "Scale." The entire company believes in scaling up, as stated on their website: "If in doubt, scale it up further." This reflects their full confidence and persistence in their technical approach, which has led to their current success.

Machine Heart: What insights does this offer for Shengshu Technology? Tang Jiayu: First, it's about mindset. I believe that having designed a good architecture like Diffusion fused with Transformer, and seeing its enormous potential, we should have more technical confidence. This is similar to OpenAI's belief in scaling up.

Second, while maintaining confidence, if you want to scale up, especially with video data, you need to integrate more resources. Compared to OpenAI, we or other startups in China still have far fewer resources. Therefore, we must be bold in thinking and acting to bring in more resources and collaborate with more resource providers. Only then can we turn technical confidence into technical implementation and eventually product realization.

Video Generation: The Past and Future of Shengshu Machine Heart: Shengshu previously launched some text-to-video capabilities. Can you introduce the previous exploration work?

Tang Jiayu: Our technical exploration ultimately serves the product. From a product perspective, the capabilities we previously released were similar to those in the industry, mainly involving the generation and editing of short videos lasting a few seconds. At that time, we were primarily constrained by factors such as computational power and did not scale up using existing architectures on video data. From a product usage standpoint, we observed that these short videos could already assist users in creative work. Even for producing longer videos, it was possible to achieve this by designing scripts to concatenate short videos. The emergence of Sora has demonstrated that native long-video generation capabilities not only enhance artistic expression through techniques like long shots from a content creation perspective but also reveal a certain understanding of the physical world, making the generated videos more natural. This has significantly strengthened our confidence and determination to increase investment in video generation R&D.

Moreover, our previous explorations were aimed at driving internal engineering infrastructure development, such as accumulating experience in video data collection, cleaning, labeling, and efficient model training. These foundations and the architecture closest to Sora have heightened our expectations for the final outcomes in long-video generation.

Machine Heart: How high are the development and application costs of Sora, as far as you know? If a similar product were to be developed, how would Shengshu address the ensuing cost challenges? Tang Jiayu: In terms of development costs, the industry estimates that a resource-sufficient state requires reaching the level of 10,000 cards (NVIDIA A series). Since we have done a lot of acceleration work in large-scale training, our actual assessed needs are somewhat lower.

Estimating the application cost of Sora, currently generating a 60-second high-definition video costs roughly a few to several dozen RMB. This is likely why OpenAI has not fully released it yet, possibly due to concerns about computing power and costs. Additionally, the success rate of the model in generating videos is still unknown, which may also be a concern.

To reduce application costs, it is essential to work on model compression during this process, including some distributed operations—such as performing inference on mobile phones and laptops—which will also be a derivative direction for development. Furthermore, optimizations at the architectural level will certainly continue. Therefore, we remain relatively optimistic about the issue of application costs. What is a 'Native Multimodal Model'?

Machine Heart: According to your company's description, you are in the 'native multimodal large model' track. Could you introduce the differences between this track and others, as well as the specific situation of domestic and international players in this track? Tang Jiayu: Actually, positioning ourselves in the native multimodal track means that from day one, we have been committed to building a complete general multimodal large model, rather than training multiple models and using their capabilities in a combinatorial fashion. Our approach starts from the underlying architecture, naturally considering the support for different data inputs and outputs through a single model. The characteristic of this approach is that the knowledge learned by the model will be more comprehensive, and during usage, there is no need to call different models for combined applications, thus improving inference efficiency.

To give a concrete example, GPT-4 supports text-to-text, DALL·E3 supports text-to-image, and GPT-4V can simultaneously input text and images but outputs only text. When dealing with open visual tasks, it is achieved by calling the interfaces of DALL·E3 or GPT-4V. In contrast, the native technical route is based on a unified underlying architecture that combines "GPT-4V + DALL·E3," capable of handling a wide range of open-domain text and visual interaction scenarios. The main international players in this field are Google (Gemini) and OpenAI (Sora). Domestically, we are the earliest—and possibly the only—company persistently dedicated to developing general-purpose multimodal large models.

Machine Heart: From a product perspective, how would you define 'native'? Tang Jiayu: From a product perspective, we primarily consider whether the user experience brought by the product has seen exponential improvement with the support of native multimodal models. Concepts like 'what you think is what you get' and 'what you say is what you get' represent such exponential enhancements. Whether it's generating images, 3D models, or videos, our efforts are all directed toward this goal—enabling even those without any professional skills to create the visuals they desire or materialize their imagination in the digital or physical world. One of my personal benchmarks is whether my relatives and friends would ultimately enjoy using such a product.

The Business Opportunities Brought by Sora Machine Heart: In the debate over whether Sora understands the physical world, François Chollet, the creator of Keras, mentioned that this question is crucial because it determines the scope of applications for generating images and videos—whether it is limited to media production or can be used as a reliable simulation of the real world. If we discuss this in two scenarios, what new commercial opportunities will Sora's release bring?

Tang Jiayu: I think the former mainly corresponds to content production in the digital world. In the digital world, the content we encounter daily spans industries such as television, movies, advertising, education, and social entertainment. Because video formats are so widely used in our daily lives, even if we only consider video-related scenarios, its application potential is already incredibly vast. If it can understand the physical world, its scope of application will not be limited to the digital world but can interact with the physical world. For example, it can be combined with robots to achieve embodied intelligence, used for autonomous driving, or applied to digital twins. The previous approach of building small models one by one might miss many corner cases. If the model can truly grasp the rules of the physical world, we could use a general model to handle all cognitive and simulation tasks related to the physical world, which might significantly advance the evolution of societal operations.