Apple Releases Another Large Model Paper, Siri May Soon 'Understand' Screens in the Future

baoshi.rao

Apple recently published a language model-related paper, hinting that future versions of Siri may gain the ability to understand on-screen content, which would enable smarter interactions with Apple's devices.

On April 2nd, Apple's research team released a paper titled ReALM: Reference Resolution As Language Modeling, focusing on solving reference resolution issues in non-conversational entities.

The paper's abstract notes that while large language models have demonstrated strong performance in many tasks, their ability to understand references in non-conversational entities such as on-screen elements and background entities remains underutilized. Among these, "dialogue entities" refer to specific objects or concepts that appear during a conversation, which can be any mentioned and discussed items, such as names of people, places, events, products, opinions, etc.; "screen entities" refer to various elements visible on the user's electronic device screen, such as text, icons, buttons, images, videos, etc.; "background entities" typically refer to processes and services running in the operating system or applications of electronic devices that are invisible to the user.

This article primarily demonstrates how to utilize large language models to establish efficient systems capable of parsing various types of references (especially non-dialogue entities). The team's approach transforms this into a purely language modeling problem. Specifically, ReALM (the name of the model) reconstructs the screen using already parsed entities and their locations, generating a visually recognizable text. By annotating entities on the screen with their contextual positions, the system gains the ability to understand what the user sees on the screen.

Apple's research team presented accuracy results of different models across various datasets in their findings, including ReALM-80M/250M/1B/3B models of varying parameter sizes, and compared them with GPT-3.5 and GPT-4. The data shows that these language models, fine-tuned for reference resolution, outperform GPT-4 in most scenarios. The latest published paper indicates that one of Apple's key efforts is to strengthen Siri and other products' ability to perceive and parse entities and their contexts. This could give Apple a competitive advantage in the intelligence level of hardware device interactions. However, researchers also clearly point out that relying solely on screen-based automatic parsing has limitations, as more complex visual reference parsing—such as distinguishing between multiple images—may require the integration of computer vision and multimodal technologies.

Although Apple entered the field of AI technologies like large models and generative AI relatively late, its actions have been efficient and results quite noticeable, with its investment direction in AI becoming increasingly clear. Earlier this month, Apple published a paper introducing its self-developed MM1 multimodal large language model (Multimodal LLM), which has up to 30 billion parameters (not an exceptionally high number), though it has not yet entered public testing or announced a release timeline.

The company also appears to be preparing to integrate large models into Siri. According to GeekPark, in January this year, developers discovered technical code related to large models in the iOS 17.4 developer preview Beta. This code suggests that Apple is developing a new version of Siri powered by large models. Before Apple, Samsung, its largest global competitor, has already taken the lead in AI-powered smartphones. The latest flagship series features the Galaxy AI strategy, incorporating AI capabilities across translation, photography, photo editing, and search functions. In the Chinese market, Samsung has quickly partnered with companies like Baidu, WPS, and Meitu to localize these features.

Starting from the second half of last year, Chinese smartphone manufacturers have been increasingly vocal about their AI initiatives. In August 2023, Huawei integrated the PanGu large model into HarmonyOS 4. In October, Xiaomi incorporated its self-developed AI model "MiLM-6B" into HyperOS. In November, vivo launched its "Blue Heart Large Model," while OPPO introduced the "Andes Large Model" in ColorOS 14. In January this year, Honor also released its self-developed 7-billion-parameter on-device AI model, the "Magic Large Model."

In reality, the current AI features in smartphones primarily focus on the application layer, aiming to improve the efficiency of specific functions. While scenarios like real-time call translation represent essential needs, the effectiveness in other areas has yet to be truly impressive. Therefore, if the goal is to create AI-powered smartphones compelling enough to drive users to upgrade, no such standout contender has emerged in the market yet. The imaginative space left by Apple's paper lies in the possibility that if Siri gains a sufficiently strong understanding of on-screen entities, the scope of intelligent interactions users can initiate would significantly expand. For example, in the future, users might be able to use voice commands to instruct Siri to enter a specific store on a food delivery platform and place an order—a substantial simplification of the current interaction process.

But is this the new iPhone experience users desire? Perhaps even Apple doesn't have the answer. What the market can look forward to is how Apple will kick off this challenge at this year's WWDC (Worldwide Developers Conference), ensuring the audience doesn't lament how long it has taken to arrive.