Sora Propels AI Video into a New Era with Vast Prospects in AI Video Generation

baoshi.rao

Artificial intelligence has elevated video technology beyond advanced imaging to a new level. The integration of video imaging technology with AI has unlocked vast amounts of new data, applicable not only to traditional physical security applications but also enabling deeper analysis of past, present, and even future events across entire enterprises.

In the early hours of February 16, OpenAI unveiled its first text-to-video model, Sora. Sora can directly generate videos up to 60 seconds long, featuring highly detailed backgrounds, complex multi-angle shots, and emotionally expressive characters.

Sora is a diffusion Transformer model that employs technologies like Diffusion Transformer to process videos and images of varying durations, resolutions, and scales, achieving a "world simulator" capable of understanding real-world motion and physics. Currently, 48 video demos have been updated on the official website. In these demos, Sora not only accurately renders details but also comprehends the existence of objects in the physical world and generates characters with rich emotions. The model can also produce videos based on prompts, still images, or even filling in missing frames in existing videos.

CNN noted that while "multimodal large models" are common, Sora stands out due to its production length and accuracy, potentially having a significant impact on the digital entertainment industry.

Regarding technology and working principles, OpenAI explained that Sora, leveraging the Transformer architecture, boasts exceptional scalability. Building on past research on DALL·E and GPT, it also utilizes DALL·E 3's rephrased prompt technique to generate highly descriptive annotations for visual model training data. Before OpenAI's Sora, Google released a new video generation model called VideoPoet on December 21 last year, capable of performing tasks such as text-to-video, image-to-video, and video stylization. The overnight popularity of text-to-video software Pika further fueled the AI video application boom. Regarding the emergence of Sora, Zhou Hongyi, founder and chairman of 360 Group, stated on Weibo on February 16 that this means the realization of AGI could be shortened from 10 years to just 1 year.

AI video refers to video content generated or edited using artificial intelligence technology. This typically involves deep learning and computer vision technologies, enabling machines to understand and generate video content or automatically edit existing videos.

In the field of AI video generation, some software tools allow users to generate videos through text descriptions. These tools usually employ natural language processing and image generation technologies to transform text descriptions into visual content. Additionally, some tools can generate new video content from images or existing videos, often leveraging deep learning and computer vision technologies.

AI Video Generation Holds Vast Potential

In terms of AI video editing, artificial intelligence can automate the video clipping and post-processing processes. For example, AI can analyze video content to automatically select the best shots or adjust parameters such as color, brightness, and contrast to enhance video quality.

AI video has broad applications across multiple fields, including film production, advertising, news reporting, and social media. Data shows that by the end of 2023, the domestic short video user base in China alone had exceeded 1 billion people. Even without considering the potential benefits of incremental markets, merely providing AI video creative services for these over 1 billion users presents significant opportunities. Facing the vast prospects in the AI video generation field, domestic manufacturers are increasing their investments to propel AI video generation into a new era. In November last year, ByteDance released the PixelDance model, which not only achieved breakthroughs in video duration but also enables the generation of videos with complex scenes and actions through descriptions (pure text) + first-frame guidance (image) + last-frame guidance (image). The model uses the last frame of a video segment to guide the first frame of the next segment.

Wondershare Technology launched China's first multimedia large model, Wondershare "SkyCanvas," centered on audio and video at the beginning of this year. Positioned as a vertical large model for audio-visual multimedia creation, Wondershare "SkyCanvas" consists of video, audio, image, and language models. In terms of capabilities, it encompasses the current mainstream language, audio, and image models, with text-to-video being one of its sub-capabilities. Targeting more niche vertical markets, including general knowledge, marketing, and entertainment, Wondershare "SkyCanvas" has already seen large-scale commercial applications overseas.

Following the introduction of text-to-video models by overseas companies like OpenAI and Google, leading domestic players have entered the arena. Industry leaders in video analysis, such as Hikvision, Dahua Technology, and Ezviz, are actively engaged in multimodal large model research and industrial applications.

According to the "AIGC/AI-Generated Content Industry Outlook Report" released by Quantum Bit, video generation is expected to become a medium-to-high potential scenario in cross-modal generation in the near future. The underlying logic is the shift in mainstream content forms driven by different technologies. NVIDIA senior scientist Jim Fan commented that 2022 was the 'Year of Images,' 2023 the 'Year of Sound Waves,' and 2024 will be the 'Year of Video.'

Computing power limitations may be a key factor in Sora's current unavailability. As AIGC technology gradually permeates fields like film and TV series, promotional videos, self-media, and gaming, video creation efficiency is expected to improve significantly. Meanwhile, the interactive data volume for videos will see a massive increase compared to text and images, potentially leading to rapid expansion in computing power demand. In the view of industry insiders, Sora is an important milestone in achieving AGI (Artificial General Intelligence). On one hand, the emergence of Sora has focused global attention on the field of video generation, validating the trend that 'video is king' and further confirming the era where 'no video means no dissemination.' It also expands the application market space for 'video + large models.'