Alibaba Cloud Releases Multimodal Large Model Qwen-VL-Max Version

baoshi.rao

Alibaba Cloud has announced its latest research achievements in multimodal large models, releasing the Max version subsequent to the Plus edition.

The Qwen-VL-Max model exhibits exceptional capabilities in visual reasoning, enabling it to comprehend and analyze complex image information, including tasks such as person recognition, question answering, creative generation, and code writing. Additionally, the model features visual positioning functionality, allowing it to conduct Q&A based on specified areas within an image. In terms of fundamental capabilities, Qwen-VL-Max can accurately describe and recognize image information, as well as perform information reasoning and extended creation based on images. This feature has enabled the model to perform exceptionally well in multiple authoritative evaluations, with overall performance comparable to GPT-4V and Gemini Ultra.

In tasks such as document analysis (DocVQA) and Chinese image-related tasks (MM-Bench-CN), Qwen-VL-Max has surpassed GPT-4V, achieving world-leading levels.

Additionally, Qwen-VL-Max has made significant progress in image-text processing, with notably improved Chinese and English text recognition capabilities. The model supports high-definition resolution images exceeding one million pixels and images with extreme aspect ratios. It can not only fully reproduce dense text but also extract information from tables and documents. Currently, Qwen-VL-Plus and Qwen-VL-Max are available for free for a limited time. Users can experience the capabilities of the Max version model directly on the Tongyi Qianwen official website or the Tongyi Qianwen app, or they can call the model API through the Alibaba Cloud Lingji Platform (DashScope).