Multimodal Large Models Propel AI into the Era of 'Synesthesia'

baoshi.rao

Just as human 'five senses' are interconnected and inseparable, the boundaries between AI modalities such as vision, language, and audio are increasingly merging. With the rapid development of AI perception, interaction, and generation capabilities, multimodal large models are driving AI into the 'synesthesia' era.

Reporters learned yesterday from the Shanghai AI Laboratory that its released 'Scholar' multimodal large model has achieved leading performance in over 80 global multimodal and visual evaluation tasks, surpassing similar models developed by Google, Microsoft, OpenAI, and others.

The 'Scholar' multimodal large model contains 20 billion parameters, trained on 800 million multimodal samples, and supports the recognition and understanding of 3.5 million semantic labels, covering common categories and concepts in the open world. It has three core capabilities: open-world understanding, cross-modal generation, and multimodal interaction.

When ChatGPT emerged, experts predicted it would change the 'interface' of human-computer interaction. Now, multimodal understanding, generation, and interaction capabilities are becoming key directions for the next evolution of large models, heralding an era where everyone can 'command' AI with voice, perhaps just around the corner.

From Predefined Tasks to Open Tasks: Unlocking Real-World Understanding

Under the rapidly growing demands of various application scenarios, traditional computer vision can no longer handle the countless specific tasks and scenarios in the real world, urgently requiring an advanced visual system with general scene perception and complex problem-solving capabilities. The 'Scholar' multimodal large model integrates three major model capabilities: vision, language, and multi-task modeling—namely, a general visual large model, a large language pre-training model (LLM) for text understanding, and a multi-task compatible decoding modeling large model—bringing it closer to human perception and cognitive abilities.

In AI research, the 'open world' refers to the real world, which is non-preset and not defined by academic or closed sets. Traditionally, AI could only perform predefined tasks, i.e., those defined by academic or closed sets, which are far from the real open world. For example, the ImageNet-1K academic set contains 1,000 objects, including about two types of flowers, 48 types of birds, and 21 types of fish, whereas in the real world, the numbers are approximately 450,000, 10,000, and 20,000, respectively.

In the open world, the 'Scholar' multimodal large model is continuously learning to acquire perception and cognitive abilities closer to humans. In terms of semantic openness, it can recognize and understand over 3.5 million semantics in the open world, covering common object categories, actions, and optical characters in daily life. This marks a transformation from solving predefined tasks to executing open tasks, providing strong support for future research on multimodal general artificial intelligence (AGI) models.

Writing Poetry from Images: Cross-Modal Generation with 'Creative Thinking'

Currently, AI technology faces numerous cross-modal challenges. For instance, in autonomous driving scenarios, it must accurately assist vehicles in judging traffic light states, road signs, and other information to provide effective input for decision-making.

Image-to-text generation is a classic cross-modal capability. After 'admiring' Zhang Daqian's 'Summer Landscape,' the 'Scholar' multimodal large model composed a seven-character quatrain upon request: 'Peaks tower into the clouds, / Mist curls into smoke. / Forgetting the self, the mind finds ease, / Listening to pine waves, painting sleep.' The Shanghai AI Laboratory noted that the model has demonstrated cross-modal generation from image to text, with the last line borrowing from the Tang poet Wei Zhuang's famous verse 'Spring water bluer than the sky, / Painting boats listen to rain sleep,' reflecting a certain cultural accumulation.

While generating text, the Scholar large model also provided its "creative process": first identifying elements depicted in the image; second, finding elements from the scene that can express the poet's thoughts and emotions, such as towering peaks, swirling clouds, and pine forests; third, composing poetic lines based on these elements; finally, refining the expression according to the rhythm and meter of the poetry.

Treating images as a new language, multimodal interaction lowers usage barriers

As artificial intelligence enters the era of "synesthesia," what is its most immediate impact on people? Experts from the Shanghai AI Laboratory stated that the Scholar multimodal large model can treat images as a new language, allowing users to flexibly define and manage any visual task using natural language instructions.

For example, when you input a photo and verbally "command" the AI to convert it into text and send it to your parents, it can immediately understand and execute the instruction. The multimodal interaction feature lowers the threshold for AI tasks, making AI a production tool accessible to the general public.

In other words, the "interface" for human-computer interaction is about to change. In the past, we connected to the virtual world through different software in various scenarios, meaning we were still in the era of graphical user interfaces. In the future, multimodal large models will usher us into the era of natural language dialogue interfaces, akin to Iron Man having his AI assistant J.A.R.V.I.S.