Huazhong University of Science and Technology Releases New Benchmark for Multimodal Large Models Covering Five Major Tasks

baoshi.rao

Recently, Huazhong University of Science and Technology and other institutions released a comprehensive new benchmark for evaluating multimodal large models (LMMs), aiming to address the challenges in assessing their performance. The study involves 14 mainstream multimodal large models, including Google Gemini and OpenAI GPT-4V, covering five major tasks and 27 datasets. However, due to the open-ended nature of responses from multimodal large models, evaluating their performance across various aspects has become a pressing issue.

This research particularly emphasizes the capabilities of multimodal large models in optical character recognition (OCR). The team conducted an in-depth study on the OCR performance of these models and developed a specialized evaluation benchmark named OCRBench. Through extensive experiments on 27 public datasets and two generated datasets (one without semantics and the other with contrasting semantics), the study revealed the limitations of multimodal large models in the OCR field. The paper provides a detailed overview of the evaluation models, metrics, and the datasets used.

Project address: https://github.com/Yuliang-Liu/MultimodalOCR Evaluation results indicate that multimodal large models excel in certain tasks, such as text recognition and document Q&A. However, they face challenges in areas like semantic dependency, handwritten text, and multilingual text. Performance is particularly poor when processing character combinations lacking semantic meaning. Handwritten and multilingual text recognition also present significant challenges, likely due to insufficient training data. Additionally, high-resolution input images yield better performance for tasks such as scene text Q&A, document Q&A, and key information extraction.

To address these limitations, the research team developed OCRBench to more accurately evaluate the OCR capabilities of multimodal large models. This initiative is expected to guide the future development of multimodal large models and encourage further improvements and research to enhance their performance and expand their application areas.

In this new era of evaluating multimodal large models, the introduction of OCRBench provides researchers and developers with a more accurate and comprehensive tool to assess and improve the OCR capabilities of these models, driving progress in the field. This research not only offers new insights into performance evaluation for multimodal large models but also lays a more solid foundation for related research and applications.