The Next Battlefield for Large Models: AI Smart Glasses with Cameras?

baoshi.rao

The battle of large models swept through 2023, and now, major tech giants seem to be targeting AI wearable devices, especially smart glasses!

According to Zhidx on December 18th, The Information reported that Meta, Google, Microsoft, Amazon, and Apple—five tech giants—are all preparing to integrate AI large models into wearable devices with cameras, such as smart glasses. They believe that hardware like smart glasses will serve as suitable carriers for AI large models because multimodal AI models can process various types of information, including sound, images, and videos.

Insiders revealed that OpenAI, the star AI startup, is currently embedding its "GPT-4 with Vision" object recognition software into products from social company Snap. This may bring new features to Snap's smart glasses, Spectacles.

Meta showcased its integration of AI features into Ray-Ban smart glasses last Tuesday. These smart glasses can describe what users see through an AI voice assistant, suggest which shirt matches which pants, and even translate Spanish newspapers into English, among other new functionalities.

Amazon's Alexa AI assistant team also has a group working on developing a new AI device with visual capabilities. Additionally, like most smartphone manufacturers, Google has begun experimenting with integrating AI features into mobile phones.

Not only that, in June of this year, Apple's Vision Pro headset officially debuted and is scheduled to go on sale next year. However, according to The Information's speculation, the device may initially lack multimodal AI functionality.

As a new wave of mobile device transformation begins, how will tech giants like Apple, Microsoft, Google, and Meta position themselves in this new battlefield? How will they highlight their AI advantages across various hardware? Which new AI hardware might become the best carrier for large AI models? Through the latest revelations, we can see an AI hardware innovation war beginning.

In the recently launched AI model Gemini, Google demonstrated a video showing how AI can guess movie titles by observing imitators' actions. It also showcased capabilities like map recognition and handling manual tasks.

Although the video content might have been edited, it reveals Google's core concept: developing an always-on AI that can provide direct feedback or assistance based on what users are seeing or hearing. A source familiar with Google's consumer hardware strategy stated that delivering this experience might take several more years, as implementing context-aware computing would consume significant amounts of power.

Google Glasses

Nowadays, Google is redesigning the operating system for its Pixel phones, aiming to embed smaller-scale Gemini models to enhance the experience of its mobile AI assistant Pixie. For example, it could tell users where to buy products they've just photographed.

Based on Google's long-term strategy in search technology, The Information believes that AI devices capable of learning and predicting people's needs based on environmental context seem perfectly suited for Google. Although Google Glass failed a decade ago, the company later encouraged Android phone manufacturers to use phone cameras to scan environments and send images to Google for cloud-based analysis, leading to the development of the 'Google Lens' image search application.

Sources familiar with the strategy revealed that the company recently halted the development of glasses-style devices but continues to develop software for such products. These sources indicated that Google plans to license image search software to hardware manufacturers, similar to how it developed the Android mobile OS for phone makers like Samsung, leveraging its large AI models.

Amid the surge in multimodal AI models, Microsoft researchers and product teams have also begun upgrading their voice assistants and experimenting with running AI features on smaller devices.

According to patent applications and informed sources, Microsoft's model could support affordable smart glasses or other hardware. The company plans to run AI software on its AR headset HoloLens. Users can point the headset's front camera at an object, take a photo, and send it to an OpenAI-powered chatbot for direct object identification. Additionally, users can obtain more information through conversational interactions with the chatbot.

HoloLens

▲HoloLens

While Apple's Vision Pro boasts numerous multimodal features, its progress in large AI models slightly trails behind competitors. Currently, there's no indication that Vision Pro will launch with sophisticated object recognition or other multimodal AI capabilities.

Apple has spent years perfecting the computer vision capabilities of Vision Pro, enabling the device to quickly recognize its surroundings. This includes rapidly identifying furniture and understanding whether the wearer is in the living room, kitchen, or bedroom. Perhaps Apple is developing a multimodal large model capable of recognizing images and videos.

Vision Pro

▲Vision Pro

However, compared to the glasses form factors being developed by other companies, Vision Pro is large, heavy, and not suitable for everyday outdoor use.

On the other hand, it was reported that Apple paused the development of its own AR glasses earlier this year to focus on the sales of its headset. It remains unclear when the AR glasses development will resume.

Meta Chief Technology Officer Andrew Bosworth announced on Instagram this Tuesday that certain Ray-Ban smart glasses users will gain direct access to AI models through their eyewear.

Ray-Ban

▲Ray-Ban

Meta executives view Ray-Ban glasses as a "pioneer" for AR glasses, which can blend digital images with the real world. Originally, Meta planned to launch AR glasses in the coming years, but the project has encountered several challenges. Reports indicate that smart glasses have struggled to attract users, and the development of next-generation displays has faced significant difficulties.

However, the advent of multimodal AI models seems to have reinvigorated Bosworth and his team, helping them understand the range of new AI features these glasses can offer customers in the short term.

This summer, during Amazon's biannual product planning sessions, engineers from the Alexa team proposed launching a new device capable of running multimodal AI.

According to individuals directly familiar with the project, the team is particularly focused on reducing the computational and memory demands for AI processing of images, videos, and voice on the device. It remains unclear whether the project has received funding or what specific problems the device aims to solve for customers, but it is separate from the company's Echo voice assistant product line.

Previously, the Alexa team also developed a smart audio glasses product called Echo Frames. This device does not support screen displays or cameras. It is currently unknown whether Amazon will develop smart glasses with visual recognition capabilities.

This isn't the first time Silicon Valley giants have designed such camera-equipped wearable devices. Previously, Google, Microsoft, and other tech giants have developed AR headsets. They initially hoped to display digital screens on the semi-transparent lenses of headsets, gradually providing guidance to help users complete tasks. However, due to the complexity of optical design, most products ultimately received poor feedback.

OpenAI's multimodal large language model can use visual recognition capabilities to let AI understand what people are looking at and doing, and provide further information about these actions and objects. As large language models become more lightweight, some small devices can also carry these models, enabling instant responses to user requests. Considering people's emphasis on privacy and security, it may take some time before they accept smart glasses and other AI devices with built-in cameras.

The Information believes that smart glasses equipped with AI assistants could potentially become as transformative as smartphones. They could not only serve as tutors guiding students through math problems or thesis questions but also provide real-time environmental information to people nearby, such as translating billboards or telling users how to fix car issues.

Pablo Mendes, former engineering manager at Apple and CEO of AI search company Objective, said: "Large AI models are crucial for everything. They will play a role in the underlying architecture of computers, phones, and other devices."

In the third wave of artificial intelligence boom ignited by ChatGPT, multimodal large models serve as foundational infrastructure while ChatGPT represents direct applications—these are already clear answers. But on which devices can ChatGPT maximize its application potential, and which devices are the best carriers for large language models? These have become directions that tech giants like OpenAI, Microsoft, and Google are now beginning to explore.

According to the latest revelations from The Information, smart glasses with cameras have emerged as a key exploration path for many giants, while some companies have started developing new wearable AI devices. Alternatively, efforts are being made to adapt various large AI models for smartphones.

This perspective isn’t limited to tech giants. Domestically, many AR glasses manufacturers also see this as an opportunity. "Robots and AR glasses may become the biggest beneficiaries of this wave of large AI models," said an industry expert with over a decade of experience in AI.

However, under the same design philosophy, who will ultimately develop the best lightweight AI model? And who will create the most practical smart glasses? We will continue monitoring the progress of major tech giants to find the answers.