Huazhong University of Science and Technology Open-Sources Multimodal Monkey Model

baoshi.rao

Monkey is a high-performance multimodal large model jointly developed by Huazhong University of Science and Technology and Kingsoft. By enhancing input resolution and introducing a multi-level description generation method, it addresses the challenges existing models face in complex scenarios and visual detail processing. Monkey can be built upon existing visual editors without requiring pre-training from scratch, significantly improving R&D efficiency.

Monkey's multi-level description generation method provides rich contextual information for the model, guiding it to learn the relationships between scenes and objects. Tested across 16 different datasets, Monkey has achieved outstanding results in multimodal tasks such as image captioning, visual question answering, and document classification. It demonstrates exceptional capabilities in perceiving fine visual details and understanding complex scenes, with broad application potential.

Open-source address: https://github.com/Yuliang-Liu/Monkey

Paper address: https://arxiv.org/abs/2311.06607v1

The quality of Monkey's training dataset is key to its capability enhancement. Researchers generated hundreds of thousands of high-quality image description data, utilized multiple models to automatically generate text descriptions, and fused the outputs of different models to improve the large model's understanding of image details.

In terms of model selection, Monkey adopted the open-source model Qwen-VL as the language decoder and the 2-billion-parameter ViT-BigHuge as the visual encoder, avoiding the resource waste of repeated pre-training. To enhance Monkey's recognition ability and input resolution, as well as to generate richer image descriptions and improve its understanding of complex scenes, three training stages were employed: multi-level description generation, high-resolution encoding, and multi-task training.

Monkey underwent comprehensive validation on 16 different datasets, including tasks such as image captioning, general visual question answering, and document-oriented question answering. In general visual question answering tasks, Monkey demonstrated significant advantages across multiple datasets. In image captioning tasks, Monkey also performed excellently on the TextCaps dataset, proving its multimodal understanding ability of text elements in images.

In document-oriented question answering tasks, Monkey achieved good results on multiple document image understanding datasets. Researchers stated that Monkey has broad application potential in fields such as medical imaging and satellite images and will continue to optimize the model's perception, association, reasoning, and generalization capabilities.

In summary, Monkey is a high-performance multimodal large model that tackles the challenges of complex scenes and visual detail processing by improving input resolution and introducing multi-level description generation methods. Monkey does not require pre-training from scratch and can be built upon existing visual editors, offering high efficiency and broad application potential. Through testing on multiple datasets, Monkey has achieved outstanding results in multimodal tasks, demonstrating exceptional visual information perception and scene understanding capabilities. In the future, Monkey will continue to optimize its perception, association, reasoning, and generalization abilities to further enhance its application value across various fields.