New AI Model KOSMOS-G: Achieving Zero-Shot High-Fidelity Image Generation

baoshi.rao

Recent advancements in image generation technology have been remarkable, particularly in generating images from text descriptions and combining text and images to create new visuals. However, one underexplored area is generating images from generalized visual-language inputs, such as descriptions involving multiple objects and characters. Researchers from Microsoft Research, New York University, and the University of Waterloo have introduced KOSMOS-G, a model that leverages multimodal LLMs to address this challenge.

KOSMOS-G can create detailed images from complex combinations of text descriptions and multiple pictures, even if it has never encountered such examples before. It is the first model capable of generating images from descriptions that include various objects or elements. KOSMOS-G can serve as an alternative to CLIP, opening up new application possibilities for technologies like ControlNet and LoRA.

KOSMOS-G employs an innovative approach to generate images from both text and images. It first trains a multimodal LLM (capable of understanding both text and images) and then aligns it with a CLIP text encoder (specialized in text comprehension). When provided with captions containing text and segmented images, KOSMOS-G is trained to produce images that match the descriptions and follow instructions. It achieves this by using a pre-trained image decoder and leveraging knowledge learned from images to generate accurate visuals in different contexts.

KOSMOS-G can generate images based on instructions and input data. It undergoes three training phases. In the first phase, the model is pre-trained on a multimodal corpus. In the second phase, an AlignerNet is trained under CLIP supervision to align KOSMOS-G's output space with U-Net's input space. In the third phase, KOSMOS-G is fine-tuned by performing compositional generation tasks on carefully curated data. In Phase 1, only the MLLM is trained. In Phase 2, the AlignerNet is trained with the MLLM frozen. In Phase 3, both the AlignerNet and MLLM are jointly trained. The image decoder remains frozen throughout all phases.

KOSMOS-G excels in zero-shot image generation across diverse settings. It can produce meaningful, visually appealing images that are customizable as needed. The model can modify contexts, apply specific styles, make adjustments, and add additional details to images. KOSMOS-G is the first model capable of achieving multi-entity VL2I in zero-shot settings.

KOSMOS-G can seamlessly replace CLIP in image generation systems, opening exciting new possibilities for previously unattainable applications. By building upon CLIP, KOSMOS-G is poised to shift image generation from text-based to a combination of text and visual information, creating opportunities for numerous innovative applications.

KOSMOS-G is a model that generates detailed images from text and multiple images. It employs a unique training strategy called "alignment before guidance." The model excels at creating images of individual objects and is the first to achieve this capability with multiple objects. It can also replace CLIP and be used alongside other technologies like ControlNet and LoRA for novel applications. In essence, KOSMOS-G represents an initial step toward shaping image generation into a language.