Apple's New On-Device Model Surpasses GPT-4, Significantly Enhancing Siri's Intelligence

baoshi.rao

In a recent paper, Apple researchers claimed to have developed an on-device model that outperforms GPT-4 in certain aspects.

Specifically, they focused on the problem of Reference Resolution in NLP, which involves enabling AI to identify the relationships between various entities (such as names, places, organizations, etc.) mentioned in text. In short, it determines the specific object referred to by a word or phrase. This process is crucial for understanding the meaning of sentences, as people often use pronouns or other referential terms (such as "he" or "there") to refer to previously mentioned nouns or noun phrases, avoiding repetition.

However, the "entities" mentioned in the paper are more related to devices like smartphones and tablets, including: On-screen Entities: Entities or information displayed on the screen when users interact with devices.

Conversational Entities: Entities related to conversations. These entities may come from the user's previous statements (for example, when the user says "Call Mom," the contact information for "Mom" is the relevant entity) or from virtual assistants (for example, when the assistant provides the user with a list of locations or alarms to choose from).

Background Entities: These are entities related to the context of the user's current interaction with the device but are not necessarily part of the conversational history directly generated by the user's interaction with the virtual assistant; examples include an alarm that starts ringing or music playing in the background. Apple's research states in the paper that although large language models (LLMs) have demonstrated strong capabilities in various tasks, their potential has not been fully utilized in solving reference problems for non-conversational entities (such as screen entities and background entities).

In the paper, Apple researchers propose a new method—using parsed entities and their locations to reconstruct the screen and generate a plain-text representation of the screen that visually represents the screen content. They then label the parts of the screen that serve as entities, providing the model with contextual information about where entities appear and the surrounding text (e.g., "Call business number"). To the authors' knowledge, this is the first work to use large language models to encode screen context.

Specifically, their proposed model, named ReALM, comes with parameter sizes of 80M, 250M, 1B, and 3B, all of which are very small and suitable for running on devices such as smartphones and tablets. Research results show that compared to existing systems with similar functionalities, this system has achieved significant improvements across different types of references. The smallest model obtained an absolute gain of over 5% when processing on-screen references.

Additionally, the paper compared its performance with GPT-3.5 and GPT-4. The results indicate that the smallest model performs comparably to GPT-4, while larger models significantly surpass GPT-4. This suggests that by transforming the reference resolution problem into a language modeling task, large language models can effectively address issues involving various types of references, including non-dialogue entity references that are traditionally difficult to handle with text alone.

This research holds promise for improving Apple's Siri assistant, enabling it to better understand and process contextual references in user queries. Particularly for complex references involving on-screen content or background applications, Siri could become more intelligent in online searches, app operations, notification reading, and smart home device interactions. Apple will host its global developer conference 'WWDC 2024' online from June 10 to 14, 2024, Pacific Time, and introduce a comprehensive artificial intelligence strategy. Some anticipate that these changes may appear in the upcoming iOS 18 and macOS 15, representing a major advancement in how users interact with Apple devices.

Paper Introduction

Paper address: https://arxiv.org/pdf/2403.20329.pdf Paper Title: ReALM: Reference Resolution As Language Modeling

This paper formulates the task as follows: given relevant entities and the task the user wants to perform, the researchers aim to extract the entity (or entities) relevant to the current user query. There are three different types of relevant entities: on-screen entities, conversational entities, and background entities (as detailed above).

Regarding the dataset, this paper uses datasets containing synthetically created data or data created with the help of annotators. The dataset information is shown in Table 2. Among these, dialogue data refers to entity data related to interactions between users and intelligent agents; synthetic data, as the name suggests, is data synthesized based on templates; screen data (as shown in the figure below) is collected from various web pages, including phone numbers, email addresses, etc.

The research team compared the ReALM model with two baseline methods: MARRS (not based on LLM) and ChatGPT.

The study used the following pipeline to fine-tune the LLM (FLAN-T5 model): first, the parsed input was provided to the model for fine-tuning. Note that unlike the baseline methods, ReALM does not perform extensive hyperparameter searches on the FLAN-T5 model but uses default fine-tuning parameters. For each data point consisting of user queries and corresponding entities, the research team converted it into sentence format and then provided it to the LLM for training. Conversational Reference

In this study, the research team hypothesized two types of conversational reference:

Type-based; Descriptive.

Type-based reference heavily relies on combining user queries with entity types to identify which entity in a set is most relevant to the user's query: for example, when a user says "play this", we know "this" refers to entities like songs or movies, not phone numbers or addresses; "call him" refers to phone numbers or contacts, not alarms.

Descriptive reference tends to use an entity's attributes to uniquely identify it: for example, "the one at Times Square", this type of reference may help uniquely identify one in a group. Please note that in general cases, reference resolution may rely on both types and descriptions to uniquely identify individual objects. Apple's research team has simply encoded the types and various attributes of entities.

Screen Reference

For screen reference, the research team assumes the existence of an upstream data detector capable of parsing screen text to extract entities. These entities, along with their types, bounding boxes, and a list of non-entity text elements surrounding the relevant entities, are then available. To encode these entities (and relevant portions of the screen) into the LM in a text-only manner, the study employs Algorithm 2. Intuitively, the research assumes that the positions of all entities and their surrounding objects are represented by the centers of their respective bounding boxes. These centers (and the related objects) are then sorted from top to bottom (i.e., vertically, along the y-axis) and from left to right (i.e., horizontally, along the x-axis) using a stable sorting algorithm. All objects within the margin are considered to be on the same line and are separated by tabs; objects further below outside the margin are placed on the next line. This process repeats, effectively encoding the screen into plain text from left to right and top to bottom.

Table 3 shows the experimental results: the proposed method outperforms the MARRS model across all types of datasets. Additionally, the researchers found that this method performs better than GPT-3.5, even though the latter has several orders of magnitude more parameters than the ReALM model.

When compared to GPT-4, although ReALM is more concise, its performance is roughly on par with the latest GPT-4. Furthermore, the paper highlights the model's gains on screen datasets, finding that models using text encoding can perform tasks almost as well as GPT-4, even though the latter is provided with screenshots. Finally, the researchers also experimented with models of different sizes. GPT-4 ≈ ReaLM ≫ MARRS for new use cases. As a case study, this article explores the zero-shot performance of models in unseen domains: Alarms (a sample data point is shown in Appendix Table 11).

The results in Table 3 indicate that all LLM-based methods outperform FT models. The article also finds that ReaLM and GPT-4 perform very similarly in unseen domains.

ReaLM > GPT-4 for domain-specific queries. Due to fine-tuning on user requests, ReaLM can understand more domain-specific questions. For example, in Table 4, GPT-4 incorrectly assumes that references are only related to settings, while the actual situation also includes background home automation devices, and GPT-4 lacks the ability to recognize domain knowledge. In contrast, ReaLM, trained on domain-specific data, does not exhibit this issue.