AI Real-time Painting System StreamMultiDiffusion Supports Local Drawing + Prompt-Based Image Generation

baoshi.rao

Recently, a paper titled "StreamMultiDiffusion" proposed a novel real-time, interactive text-to-image generation system. This system can generate images based on user-provided hand-drawn areas and corresponding semantic text prompts, providing professional image creators with a powerful tool for rapid prototyping and creative exploration.

Project address: https://github.com/ironjr/StreamMultiDiffusion

Diffusion models have achieved great success in the field of text-to-image synthesis, becoming promising candidates for image generation and editing. However, applying these models to practical applications still faces two major challenges: the need for faster inference speed and smarter model control. Both objectives must be met simultaneously to be effective in real-world applications. To address these challenges, the authors proposed the StreamMultiDiffusion framework. This framework is the first real-time region-based text-to-image generation framework. By employing stable and fast inference techniques and restructuring the model into a newly proposed multi-prompt streaming batch processing architecture, it achieves faster panorama generation speed than existing solutions, reaching 1.57FPS for region-based text-to-image synthesis on a single RTX2080Ti GPU.

The framework introduces several key technologies. First is Latent Pre-Averaging, where intermediate latent representations are averaged at each inference step to accommodate fast inference algorithms. Second is Mask-Centering Bootstrapping, which aligns the center point of each mask to the image center during the initial steps of the generation process to ensure objects are not cut off by mask edges. Third is Quantized Masks, which control the tightness of prompt masks through quantization, enabling smooth blending of generated regions across different noise levels. Additionally, StreamMultiDiffusion introduces a new concept called Semantic Palette, an interactive image generation paradigm that enables users to generate high-quality images in real-time through hand-drawn regions and text prompts. This method is akin to painting on a canvas with a brush, but instead uses text prompts and masks. For example, users can generate a person in a red area and mark ears and tails as a dog, and the system will generate a person with dog ears and tails based on the painted regions.

The experimental results in the paper demonstrate that StreamMultiDiffusion achieves significant speed improvements in panoramic image generation and region-based text-to-image synthesis compared to existing MultiDiffusion methods, while maintaining image quality. This proves the system's great potential and value in practical applications.