HKU and Kuaishou Kling Break Through Long Video Consistency Bottleneck with Groundbreaking 'Memory Retrieval' Technology

baoshi.rao

AIbase reports The University of Hong Kong and Kuaishou Kling team recently published a landmark paper titled 'Context as Memory: Scene-Consistent Interactive Long Video Generation with Memory Retrieval', proposing the revolutionary 'Context-as-Memory' method that successfully solves the core challenge of scene consistency control in long video generation.

Innovative Concept: Historical Context as 'Memory' Carrier

The core innovation of this research lies in treating historically generated context as 'memory', achieving high consistency control between scenes in long videos through context learning technology. The research team discovered that video generation models can implicitly learn 3D priors from video data without explicit 3D modeling assistance, a concept that aligns with Google's Genie3.

Technological Breakthrough: FOV Memory Retrieval Mechanism Significantly Improves Efficiency

To address the theoretically infinite computational burden caused by historical frame sequences, the research team proposed a memory retrieval mechanism based on camera trajectory field of view (FOV). This mechanism intelligently filters out frames highly relevant to the current generated video from all historical frames as memory conditions, significantly improving computational efficiency and reducing training costs.

Through dynamic retrieval strategies, the system determines the relevance between predicted frames and historical frames based on FOV overlap relationships of camera trajectories, greatly reducing the amount of context that needs to be learned, achieving a qualitative leap in model training and inference efficiency.

Data Construction and Application Scenarios

The research team collected a diverse long video dataset with precise camera trajectory annotations based on Unreal Engine 5, providing a solid foundation for technical verification. Users only need to provide an initial image to freely explore the generated virtual world along set camera trajectories.

Performance Surpasses Existing Methods

Experimental results show that Context-as-Memory maintains excellent static scene memory over several tens of seconds and demonstrates good generalization across different scenes. Compared with existing SOTA methods, this technology achieves significant performance improvements in long video generation scene memory and can effectively maintain memory continuity in unseen open-domain scenes.

This breakthrough marks an important step forward for AI video generation technology towards longer time sequences and higher consistency, opening up new possibilities for applications such as virtual world construction and film production.