How to Evaluate the Intelligence of Voice Assistants (Part 2): Service Provision

baoshi.rao

Regarding the question of how to evaluate, the author has analyzed it from four dimensions. This article focuses on the service provision dimension, breaking down evaluation points by considering scenario understanding and the ability to integrate CP and SP capabilities.

Many people think AI is an industry, but in reality, AI is not an industry. The truth is 'industry + AI,' meaning how existing industries can leverage AI to drive industrial upgrades, improve operational efficiency, and create more social value.

In the previous article, a user raised a demand: 'I just want a smart and easy-to-use assistant that can meet all my daily needs.'

Meeting these 'various needs' actually relies on existing solutions, with AI merely attempting to revolutionize the experience.

We all know the famous formula: User Value = (New Experience - Old Experience) - Switching Cost.

So the question arises: If users are already satisfied, why should they switch to you?

Many AI startups fail to understand or address this issue properly, leading to their premature demise.

The construction of any service behind a smart assistant relies on the capabilities of CPs (Content Providers) and SPs (Service Providers). How to cleverly integrate these capabilities with AI is a highly valuable area of research.

In the previous article, the author focused on the 'Intent Understanding' dimension. This article delves into the evaluation points of the 'Service Provision' dimension.

When discussing this module, the evaluation considers scenario understanding and the ability to integrate CP and SP capabilities.

Doraemon has a magic pocket that can solve countless problems.

Baymax from Big Hero 6 was initially positioned in the healthcare field, which is relatively narrow. Later, through iterations and upgrades, Baymax developed additional capabilities—this is capability expansion.

'Look at my Baymax—big and round, capable of healing and fighting. It would be great if it could do even more.'

So, is it better for a smart voice assistant to have as many skills and services as possible?

Before understanding this dimension, we must clarify the comparison.

In the previous article, the author mentioned: 'On the market, solutions like Tencent DingDang, XiaoAi, Tmall Genie, and Xiaodu Smart Speaker belong to the largest open domains.'

Behind this product form is the integration of group resources into a smart hardware device, adding more value to the speaker.

From the author's perspective, this is also an exploratory attempt by CPs and SPs to find ways for AI to provide user value through hardware carriers like speakers in the inevitable smart era of the future.

From this angle, they are undoubtedly comprehensive.

The giants' strategy is to build ecosystems. Evaluating smart assistants from the perspective of app stores is fundamentally incorrect.

Thus, Siri's future positioning must be based on Apple's ecosystem, serving as a connector between users and SPs/CPs. It acts as an intermediary, helping users find CPs and SPs.

In reality, the ones solving our daily problems are the CPs and SPs in various industry segments.

From the CP perspective: video, music, audio content, text content, gaming, etc.

From the SP perspective: transportation, education, healthcare, finance, e-commerce, travel, dining, customer service, offline lifestyle services, etc.

Therefore, the true competition in service comprehensiveness lies in the ability to solve specific problems.

For example, Didi's positioning is to solve users' transportation needs. How to address the last-mile problem? Acquire a 'bike-sharing' company.

In practical evaluations, testing many AI assistants on the market reveals that some services are available but coverage is insufficient.

For instance, many AI assistants offer flight booking, but few cover the entire service chain. For example:

...

Positioning can be broad or narrow. Only by clarifying the product's positioning and then meeting user needs within that scope can we accurately evaluate 'resource/service comprehensiveness.'

As AI practitioners, we should focus on how to use current AI capabilities to upgrade industries and provide better value to users, striving to excel in specific niche demands.

This is the question we should repeatedly ponder.

Alongside comprehensiveness, there is also the pursuit of quality.

The best quality in the industry is offered by giants like BAT, backed by interfaces—competitions between SPs and CPs. Essentially, it's about delivering content and services from smartphones through another hardware carrier.

From the user's perspective, whether they find content/services via touch or voice doesn't matter. What truly matters is whether the need is met and if the experience is upgraded.

For example, as long as I can get a train ticket home, I don’t care which app or method completes the transaction.

On this point, giants have well-equipped SPs and CPs, while smaller companies struggle. For instance, if I want to listen to Jay Chou's songs and the assistant understands the request but lacks QQ Music support, it must resort to alternatives, risking copyright issues.

Mid-sized companies like Himalaya build speakers based on content, bundling content to drive sales.

So the question arises: If you’re not a giant, lack content, and have limited funds to buy licenses, what can you do?

In some niche areas, self-built content is an option. Here’s a speculative idea:

First, define the scenario: How can a smart kitchen revolutionize the user experience?

For example, adding a screen, microphone, speaker, and Wi-Fi module to a fridge—hardware costs are manageable—could create a kitchen AI robot.

It could promote daily deals, integrate with grocery apps or local stores, and allow voice orders for kitchen needs. In the kitchen scenario, the screen could display cooking videos sourced from platforms like Zhihu, TikTok, or Xiachufang. Content maintenance in this niche is cost-effective.

Another example: some popular games already have smart assistants, with varying quality.

Possible services include providing game guides, customer support, collecting user feedback, and integrating operational and monetization services.

For a single service point, when a player is stuck on a level, the assistant could offer guides, temporary power-ups, or even complete the battle for them—effectively boosting user retention.

Thus, whether a smart assistant can deliver high-quality content based on scenarios and needs is a critical evaluation point.

Simply put, it’s about the richness of the assistant’s response types.

For example, in real life, if you ask someone, 'Can you tell me about this house?', the responses vary.

If the same question is posed to a voice assistant, the reply could include:

...

Whether asking questions or giving feedback, the assistant’s response always follows a format.

When interacting with other smart customer service systems, asking basic questions like 'Where is the check-in page?' often yields step-by-step text instructions.

The issue is it’s plain text—why not include a direct jump button? Clearly, it lacks a 'jump' function.

Therefore, the richness of response formats should also be an evaluation metric.

Here are the current response styles: text, images & text, video players, audio tracks, tabs, forms, functional buttons, multimodal interactions, etc.

It’s important to note that styles shouldn’t be overly flashy or complex—simplicity and appropriateness are key. Often, overly elaborate designs lead to slower loading times and poorer feedback, which can negatively impact evaluations (this will be discussed later in the evaluation criteria).

The choice of style depends on the specific scenario. Therefore, the evaluation point is whether the design understands user needs and employs diverse feedback styles to enhance the user experience.

This aspect truly tests one’s design skills—a great UI is always simple and elegant. It presents content only when the user needs it.

Presentation is divided into two parts: GUI (Graphical User Interface) and VUI (Voice User Interface).

For example, an airplane cockpit is often overwhelming and frustrating due to its excessive dashboard and buttons, which can paralyze decision-making. This type of design can be called "deterrent design." In contrast, a car cockpit is much better designed because it simplifies operations.

Even among car cockpits, different manufacturers handle this differently—Tesla’s approach, for instance, is exceptionally elegant.

Those who remember the era of feature phones know that physical keyboards once dominated the screen, whereas today’s mobile keyboards appear only when needed. There are countless similar examples.

Thus, the rationality of content presentation should also be an evaluation criterion. Even complex content must be well-organized, presented in layers and stages based on user context.

To illustrate, here are a few examples from voice interaction:

Pure Voice Dialogue Scenario: Imagine friends coming over for a weekend gathering. Some may call for directions if they’re unfamiliar with the route. Typically, directions are given using terms like "north, south, east, west" or street addresses—effective for those with good spatial awareness (often men). However, this approach may not work as well for women, who might prefer landmarks like store brands, billboards, building shapes, or colors. Adjusting the phrasing accordingly improves comprehension.
- This demonstrates tailoring communication based on user profiles.
Smart Kitchen Scenario: When asking how to cook a dish, a video recommendation might seem ideal—allowing users to watch, listen, and cook simultaneously. However, without proper content management, users might either watch the video first or multitask, leading to the "easy to watch, hard to execute" problem due to information overload.
- Properly layered and staged content presentation can significantly improve the experience.
Voice-Controlled Game Design: Early versions of a game tutorial might list commands like "continue," "repeat," "next step," or "quit." However:
- Problem 1: Terms like "command list" are too technical. A friendlier alternative is "You can say:"
- Problem 2: Users often freeze or forget commands, resorting to phrases like "What did you say?"
- Problem 3: Unnatural phrasing (e.g., "Say 'continue' to proceed") can confuse users. A better approach is: "To proceed, say 'continue.'"
- This highlights the challenge of voice interaction without visual cues, requiring timely and clear prompts.

In natural language processing, logical sequencing is critical. Hence, content presentation rationality is a key evaluation point.

Even with the best intentions, AI assistants sometimes face limitations:

Understood but Out of Scope: How to respond when the request exceeds capabilities?
Unclear but Guessable: How to reply when the AI is uncertain?

Practical solutions vary by business unit. Common approaches include:

For ambiguous queries (e.g., "investment recommendations" vs. "I want to buy investments"), the AI may offer different responses based on interpretation.
Case 1: Casual replies are inadequate.
Cases 2 & 3: While not perfect, they strive to provide solutions (e.g., directing to an app or navigation), creating business value.

Thus, fallback response quality measures how much value it brings to users and the company, making it another evaluation point.

This discussion goes beyond listing criteria—it incorporates deeper business insights. Originally, more evaluation points were planned under "Service Provision," but some were omitted for practicality. Here are a few deleted metrics:

Travel Planning Example: "Mr. Ma is traveling to Beijing next week—please arrange his itinerary." How should the AI handle flight bookings, hotels, transportation, and reminders?
Multi-Device Scenarios: Voice assistants can exist on watches, phones, cars, etc. How should feedback adapt across devices?
Open-Domain Chat: Addressing loneliness (e.g., "Her’s" Samantha or Microsoft’s Xiaoice).

Points (6), (7), and (8) are indeed high-difficulty challenges. Although solutions have been considered, they are relatively low-frequency scenarios for most intelligent assistants, hence they were discarded. Of course, these evaluation points could also be included as bonus criteria. If implemented well, they could become highlights or even key selling points to gain market competitiveness!

When a user makes a request, the AI first understands it and then provides feedback. This feedback performance is the focus of the various considerations under the [Service Delivery] dimension discussed in this article.

We all know the famous formula: User Value = (New Experience - Old Experience) - Switching Cost. Reading Yu Jun's Product Methodology recently deepened my understanding of this.

How can we unleash AI's capabilities to innovate experiences and maximize the value of (New Experience)?

At the same time, what constitutes the user's (Switching Cost), and how can we reduce it? How can we push both ends to maximize user value?

This is the question we must ponder repeatedly.

With this, the discussion on the second major dimension, [Service Delivery], concludes.

Subsequent articles will supplement the remaining parts and provide additional explanations and refinements in the same format.

Thank you for reading this far. If you have any questions, feel free to leave a comment in the discussion section for an in-depth exchange with the author.

How to Evaluate the Intelligence of Voice Assistants (1): Intent Understanding