Intensified Competition in AI Image Generation: An In-depth Analysis of the AI Art Field

baoshi.rao

Since the advent of generative AI, scenes reminiscent of the industrial revolution unfold daily.

In the field of image generation alone, a flurry of groundbreaking models released by companies and universities has delivered dazzling experiences. If AI-generated art initially sparked fears of replacement among human artists, the increasing variety and quantity of these tools have now led to intense competition—new 'deities' continuously dethrone former leaders, embodying the phrase 'kings everywhere, fleeting yet glorious.'

Fivefold Efficiency Boost in Text-to-Image Generation

Recently, Meta announced the development of an AI model named CM3Leon (pronounced like 'chameleon'), capable of generating ultra-high-resolution images from text, creating textual descriptions for images, and even editing images based on textual instructions.

CM3Leon's training employs a method adapted from text-only language models, which is straightforward yet yields powerful results. It demonstrates that transformer models based on tokenizers can be trained as effectively as existing diffusion-based generative models.

Even when trained on a dataset with only 3 billion text tokens, CM3Leon's zero-shot performance rivals that of larger models trained on more extensive datasets.

Meta states that CM3Leon requires five times less computational power than diffusion-based models like Stable Diffusion and Midjourney, yet achieves state-of-the-art performance in text-to-image generation. It excels in various vision-language tasks such as visual question answering and long-form captioning. For example, CM3Leon can handle more complex prompts, edit the color of the sky in an image under textual guidance, or add objects like sinks and mirrors to specific locations in a room.

When comparing performance on the most widely used image generation benchmark (zero-shot MS-COCO), CM3Leon achieves an FID (Fréchet Inception Distance, a metric for measuring the distance between feature vectors of real and generated images; lower FID indicates higher similarity) score of 4.88, setting a new SOTA (state of the art) in text-to-image generation. It outperforms renowned models like Google's Parti (FID 7.23), Stable Diffusion (FID 8.32), and OpenAI's DALL-E2 (FID 10.39). This achievement highlights the potential of retrieval-augmented methods and underscores the impact of scaling strategies on autoregressive model performance.

CM3Leon combines the versatility and effectiveness of autoregressive models with low training costs and inference efficiency. It is a causal masked multimodal (CM3) model, capable of generating sequences of text and images based on arbitrary sequences of other image and text content. This significantly expands the capabilities of previous models limited to text-to-image or image-to-text tasks.

The industry believes CM3Leon's capabilities have reached the pinnacle of the multimodal market. Meta acknowledges CM3Leon as a major advancement in image generation and understanding but also admits potential data bias issues, calling for greater transparency and regulation in the field.

Computer Vision Ushers in Its GPT-4 Moment

Image segmentation is a cornerstone of image understanding and a critical research direction in computer vision (CV), playing a pivotal role in fields such as autonomous driving, drones, industrial quality inspection, and pathological image segmentation.

With the rise of deep learning, early image segmentation methods relying on low-level features like brightness, color, and texture have gradually been phased out. Neural network-based approaches have achieved significant breakthroughs—training deep neural networks enables the learning of higher-level, more abstract feature representations, leading to more accurate image segmentation.

In April of this year, Meta released the first foundational image segmentation model, SAM (Segment Anything Model), along with its corresponding dataset SA-1B, instantly igniting the AI community. SAM is a general-purpose image segmentation model applicable to any scenario requiring image recognition and segmentation. Leveraging prompt engineering, it can serve as a component in content creation, AR/VR, scientific research, or general AI systems, enabling multimodal processing.

SAM significantly enhances the image segmentation capabilities of conventional CV models, delivering robust performance even in ambiguous or unfamiliar scenarios, potentially lowering the barrier to computer recognition. NVIDIA AI scientist Jim Fan remarked, 'SAM is the GPT-3 moment for computer vision.'

However, just three months later, SAM's dominance was challenged.

Recently, a team from the Hong Kong University of Science and Technology developed a more versatile image segmentation model called Semantic-SAM. Semantic-SAM can fully replicate SAM's segmentation results while achieving finer granularity and semantic functionality. It supports a wide range of segmentation tasks and applications, including general segmentation (panoptic, semantic, instance segmentation), fine-grained segmentation, interactive segmentation with multi-granular semantics, and multi-granular image editing.

Moreover, Semantic-SAM excels in granularity richness, semantic awareness, and versatility, outperforming Meta's SAM. With a single click, it can output up to six granularity levels, offering more controllable alignment with user intent without the need for repeated mouse movements to locate the desired segmentation area.

Yet, mere image segmentation no longer satisfies the ambitions of AI researchers. Video segmentation is a foundational technology for applications like autonomous driving, robotics, and video editing, but SAM cannot handle such tasks.

Recently, researchers from ETH Zurich, the Hong Kong University of Science and Technology, and EPFL broke through these limitations with their SAM-PT model, extending SAM's zero-shot capabilities to dynamic video tracking and segmentation tasks. SAM-PT demonstrates stable and robust zero-shot performance across multiple video object segmentation benchmarks.

The Other Side of AI Drawing

Following continuous updates like Zoom Out and Pan, Midjourney's latest V6 version is set for release this month. Meanwhile, AI drawing unicorn Stable Diffusion has introduced Stable Doodle, a sketch-to-image service.

For content creators and other end-users, the evolution of AI drawing tools delivers high-precision, high-quality content, offering superior experiences and more diverse options. With AI-generated content and image segmentation/recognition models, the efficiency and user experience of AR/VR wearable devices will significantly improve, as will the precision and productivity in industries like autonomous driving and healthcare.

Yet, beneath this promising night, lurk concealed hazards.

With continuous model updates and iterations, users' artwork or image data may be uploaded to cloud servers or used to train more advanced models, raising privacy and copyright concerns. Meanwhile, unrestricted AI-generated imagery is being exploited to create pornographic content featuring real individuals, testing legal boundaries.

While AIGC revolutionizes content production, it also reshapes consumption patterns. The fleeting thrill of novel, dazzling generated content fades quickly. Amid countless text-prompt choices and the reinforcement of echo chambers, whether AI's brush will create an unprecedentedly diverse content market or a monotonous aesthetic trend remains a troubling question.