Google AI Launches ScreenAI: A Vision-Language Model for UI and Infographic Interpretation

baoshi.rao

Google AI recently introduced ScreenAI, a vision-language model designed to comprehensively understand user interfaces (UI) and infographics. UIs and infographics share design concepts and visual languages in the modern digital world, but creating a unified model becomes more challenging due to the complexity of each domain. To address this, the Google AI team proposed ScreenAI as a solution.

ScreenAI is capable of handling tasks such as graphical question answering (QA), which may involve elements like charts, images, and maps. The model combines the flexible patching method from Pix2struct and the PaLI architecture, enabling it to transform vision-related tasks into text or image-to-text problems.

The team conducted multiple tests to demonstrate how these design decisions impact the model's functionality. Evaluations show that ScreenAI achieves state-of-the-art results on tasks such as Multipage DocVQA, WebSRC, MoTIF, and Widget Captioning, with fewer than 5 billion parameters. It excels in tasks like DocVQA, infographic QA, and chart QA, outperforming models of similar scale. The team has released three new datasets: Screen Annotation, ScreenQA Short, and Complex ScreenQA. One dataset focuses on screen annotation tasks for future research, while the other two datasets are dedicated to question answering, further expanding the available resources to drive progress in this field.

ScreenAI represents a step towards comprehensively addressing the challenges of understanding infographics and user interfaces. By leveraging the common visual language and complex design of these components, ScreenAI provides a holistic approach to comprehending digital content.

Paper link: https://arxiv.org/abs/2402.04615