OpenAI Releases First Video Generation Model, Delivering Smooth and High-Definition 1-Minute Videos with Stunning Effects!

baoshi.rao

Just now, Altman unveiled OpenAI's first video generation model Sora.

It perfectly inherits the image quality and instruction-following capabilities of DALL·E 3, capable of generating high-definition videos up to 1 minute long.

Altman unveils OpenAI's first video generation model AI's imagination of the Year of the Dragon Spring Festival features red flags waving and bustling crowds.

Children curiously follow the dragon dance procession, while many others pull out their phones to capture the moment, with each character exhibiting unique behaviors.

The neon reflections on Tokyo's rain-soaked streets rival the visual effects of RTX ON.

Passing trains occasionally block the view outside the window, creating stunning momentary reflections of passengers inside. You could also create a Hollywood-style movie trailer with blockbuster quality:

In an ultra-close-up vertical shot, this lizard shows every intricate detail:

Netizens exclaimed 'game over', fearing they might lose their jobs:

Some have even begun 'mourning' the entire industry: OpenAI stated that it is teaching AI to understand and simulate the physical world in motion, with the goal of training models to help people solve problems requiring real-world interactions.

Generating videos from text prompts is merely one step in the broader plan.

Currently, Sora can already generate complex scenes with multiple characters and specific movements, not only understanding the requirements in user prompts but also grasping how these objects exist in the physical world. For example, when a large group of paper airplanes fly through a forest, Sora understands what happens upon collision and accurately depicts the resulting light and shadow changes.

A flock of paper airplanes dances gracefully through dense jungle foliage, weaving between trees like migratory birds.

Sora can also create multiple shots within a single video, leveraging its deep understanding of language to precisely interpret prompts while preserving character consistency and visual style.

Beautiful, snow-covered Tokyo bustles with activity. The camera glides through crowded city streets, following several people enjoying the snowy weather and shopping at nearby stalls. Gorgeous cherry blossom petals flutter in the wind alongside snowflakes. OpenAI does not shy away from discussing Sora's current weaknesses, pointing out that it may struggle to accurately simulate the physics of complex scenes and may fail to comprehend causality.

For example, in a scenario like "five gray wolf pups playing and chasing each other on a remote gravel road," the number of wolves may fluctuate, with some appearing or disappearing inexplicably.

The model may also confuse spatial details in prompts, such as mixing up left and right, and could have difficulty precisely describing events that unfold over time, such as following a specific camera trajectory. In the prompt 'a basketball passing through the hoop and then exploding', the basketball wasn't properly blocked by the hoop.

Technically, OpenAI hasn't revealed much about Sora. Here's a brief introduction:

Sora is a diffusion model that starts from noise and can generate entire videos at once or extend the length of videos. The key point lies in predicting multiple frames at once, ensuring the main subject remains consistent even when temporarily out of view.

Similar to GPT models, Sora utilizes the Transformer architecture, which provides strong scalability.

In terms of data, OpenAI represents videos and images as patches, similar to tokens in GPT. Through this unified data representation, models can be trained on a broader range of visual data than before, covering different durations, resolutions, and aspect ratios.

Sora builds upon past research on DALL·E and GPT models. It utilizes DALL·E 3's recaptioning technique to generate highly descriptive annotations for visual training data, enabling it to more faithfully follow user text instructions. In addition to generating videos solely based on text instructions, the model can also take existing static images and generate videos from them, accurately animating the image content while paying attention to small details.

The model can also take existing videos and extend them or fill in missing frames. Please refer to the technical paper for more information (to be released later).

Sora is a foundation for models that can understand and simulate the real world, and OpenAI believes this capability will be an important milestone in achieving AGI. Currently, some visual artists, designers, and filmmakers (including OpenAI employees) have gained access to Sora.

They have started continuously posting new works, and Altman has begun accepting online requests.

Include your prompt @sama, and you might receive a generated video response.

Reference: [1]https://openai.com/sora