MotionDirector: A Novel AI Approach for Customized Video Generation

baoshi.rao

Recent studies indicate that text-to-video diffusion models have made remarkable progress, allowing users to create realistic or imaginative videos simply by providing text descriptions. These foundational models are also fine-tuned to generate images that match specific appearances, styles, and themes.

However, the customization of motion in text-to-video generation remains an area requiring deeper exploration. Users may wish to create videos with specific motions, such as a car moving forward and then turning left. Therefore, adapting diffusion models to produce more tailored content that meets user needs becomes crucial.

Project Address: https://showlab.github.io/MotionDirector/

To address this challenge, researchers have proposed MotionDirector, a dual-path AI architecture designed to train models to learn both appearance and motion from single or multiple reference videos. This enables simultaneous customization of motion and diversity in appearance.

The spatial path incorporates a foundational model with trainable spatial LoRAs (Low-Rank Adaptations) integrated into each video transformation layer. These LoRAs are trained using randomly selected single frames during each training step to capture the visual attributes of input videos. In contrast, the temporal path replicates the foundational model, sharing spatial LoRAs with the spatial path to adapt to the appearance of given input videos. Additionally, the temporal transformers in this path are trained using multiple frames selected from input videos to capture inherent motion patterns.

By deploying trained temporal LoRAs, the foundational model can synthesize videos with diverse appearances while maintaining learned motions. This dual-path architecture allows the model to separately learn the appearance and motion of objects in videos, enabling MotionDirector to isolate and then combine these elements from different source videos.

Researchers evaluated MotionDirector's performance across multiple benchmark datasets, including over 80 different motions and 600 text prompts. In the UCF Sports Action benchmark, MotionDirector was selected by human evaluators approximately 75% of the time for better motion fidelity, surpassing 25% of the baseline models' preferred options.

In the second benchmark, the LOVEU-TGVE-2023 benchmark, MotionDirector outperformed other controllable generation and adaptation methods. These results indicate that MotionDirector can customize multiple base models to generate videos with diverse and desired motion concepts.

In summary, MotionDirector is a promising new method for adapting text-to-video diffusion models to generate videos with specific motions. It excels in learning and adapting to specific motions of objects and cameras and can be used to produce videos with various visual styles.

Although there is still room for improvement in learning the motions of multiple subjects in reference videos, even with these limitations, MotionDirector has the potential to enhance the flexibility of video generation, enabling users to create customized videos that meet their needs.