Alibaba International Releases Multimodal Large Model Ovis2.5, Advancing Visual Perception and Deep Reasoning

baoshi.rao

Recently, Alibaba International officially released its next-generation multimodal large model Ovis2.5 and made it open-source. This model focuses on native-resolution visual perception, deep reasoning, and cost-effective scenario design, aiming to further enhance AI application capabilities. Ovis2.5 has achieved a significant improvement in its overall score on the mainstream multimodal evaluation suite OpenCompass compared to its predecessor Ovis2, maintaining its SOTA (state-of-the-art) status among similar open-source models.

The newly released Ovis2.5 includes two versions with different parameter scales. The first is Ovis2.5-9B, which scored 78.3 on the OpenCompass evaluation, surpassing many larger models and ranking first among open-source models with fewer than 40B parameters. The second version, Ovis2.5-2B, achieved a score of 73.9, continuing the Ovis series' philosophy of 'small size, big power' and is particularly suitable for edge devices and resource-constrained scenarios.

In terms of Ovis2.5's overall architecture, the official announcement highlights systematic innovations in three key areas: model architecture, training strategies, and data engineering. The model architecture retains the series' innovative structured embedding alignment design, consisting of three core components: dynamic-resolution visual feature extraction, a visual vocabulary module for structural alignment between vision and text, and powerful language processing capabilities based on Qwen3.

For training strategies, Ovis2.5 adopts a refined five-stage training approach, including foundational visual pre-training, multimodal pre-training, and large-scale instruction fine-tuning. Additionally, algorithms like DPO and GRPO enhance preference alignment and reasoning capabilities, significantly improving model performance. The training speed has also achieved a 3-4x end-to-end acceleration.

In data engineering, Ovis2.5's dataset has increased by 50% compared to Ovis2, with a focus on key areas such as visual reasoning, charts, OCR (optical character recognition), and grounding. Notably, a large amount of 'thinking' data deeply aligned with Qwen3 has been synthesized, greatly enhancing the model's reflection and reasoning potential.

The code and models for Ovis2.5 are now available on platforms like GitHub and Hugging Face, allowing users to explore their application potential.

Code: https://github.com/AIDC-AI/Ovis

Model: https://huggingface.co/AIDC-AI/

Key Highlights:

Ovis2.5 achieved a score of 78.3 on OpenCompass, maintaining SOTA status.

Includes two versions: Ovis2.5-9B for large-scale applications and Ovis2.5-2B for resource-constrained scenarios.

Features innovative architecture and training strategies, with a 50% increase in data volume, focusing on key areas like visual reasoning.