How to Evaluate the Intelligence of Voice Assistants (1): Intent Understanding

baoshi.rao

This article focuses on defining and discussing the first major module, [Intent Understanding], which evaluates whether the AI can comprehend/recognize the user's expressed intent. The author believes this module is the core dimension for measuring AI intelligence and will reveal the evaluation dimensions and metrics.

Having worked in the AI-NLP field for a year and a half, I have been deeply immersed in learning. On a daily basis, I study various voice assistants and produce different types of research reports to sharpen my business acumen, while also exploring framework-based knowledge to enrich my expertise.

After carefully and repeatedly reviewing the three major interaction guidelines—Google's Conversational Interaction Design Guidelines, Alibaba's Voice Interaction Design Guidelines, and Amazon's Voice Interaction Design Specifications—and reflecting on past work experiences, I have attempted to distill a knowledge framework. My goal is to internalize these guidelines as passive skills, laying the groundwork for creating better products in the future.

In my view, the ideal artificial intelligence would be like:

These superheroes can always solve various problems in our lives.

Although our world is still far from achieving such super AI—and perhaps never will—it doesn't hurt to hold AI to an extremely high standard, pushing ourselves to create better products.

Before diving into the article, briefly forget your identity as an AI practitioner and imagine yourself as a novice user. Simply put, it boils down to one sentence—"I just want a smart and useful assistant that can meet all my daily needs."

How do we define "useful"? How are "various needs" met? The challenge lies in the lack of boundaries. The only thing that truly meets the above requirements is a magic lamp that grants unlimited wishes.

So, let's break it down into modules. For intelligent voice assistants, I propose the following four major evaluation dimensions: [Intent Understanding], [Service Provision], [Interaction Fluency], and [Personality Traits].

In other words, if these metrics can all be satisfied, we wouldn't be far from the super AI we desire. Whoever can deliver this will win users' favor.

Each evaluation dimension has corresponding sub-metrics, which we will dissect step by step.

This article focuses on defining and discussing the first major module, [Intent Understanding], which evaluates whether the AI can comprehend/recognize the user's expressed intent. I believe this module is the core dimension for measuring AI intelligence.

Current AI assistants on the market often come with a variety of capabilities. From a practitioner's perspective, these are essentially collections of skills, with each skill serving and fulfilling needs in specific domains—such as playing music, navigation, reminders, movie tickets, flights, train tickets, etc.

Many skills perform exceptionally well within their fixed domains, but when combined, their performance may not be as good.

The core consideration: the ability to accurately identify user needs and allocate them to the appropriate skill service.

For every user request, the computer provides feedback (text, voice, images, functional cards, multimedia events, etc.). Before feedback, the system must first recognize and understand the request, then successfully allocate it to the designated skill, which ultimately completes the feedback—i.e., the service behavior.

Human language is incredibly diverse. We expect computers to naturally understand human intent through natural expressions and use corresponding responses to connect with services.

Example:

This is the ability of the central controller to allocate intent. It is also the core capability of all AI assistants to integrate various skills. Without effective intent recognition by the central controller, intelligence is out of the question.

On the market, solutions like Tencent DingDang, Xiaomi's Xiao Ai, and Baidu's Xiaodu Assistant belong to the largest open domains. Many skills can only be activated through command phrases, resulting in waiting times and lengthy dialogue flows. Given the uncertainty of input, why wouldn't users just use a GUI (graphical user interface) to achieve their goals?

For niche domains like travel, dining, customer service, or gaming, using keyword-based methods to access specific skills and then seek services feels clunky.

If full open-domain central control isn't achievable, at least the ability to recognize and allocate intent within fixed domains should be optimized. This leverages the convenience of voice input to reach goals directly, avoiding the impression of being a toy.

In plain terms: Can the AI correctly understand the same meaning when expressed in different ways?

The industry term is "generalization of recognizable phrases/slot values." The solution is "increased semantic coverage."

Generalization comes in two forms: sentence patterns and slot values.

First, examples of sentence patterns:

I often observe user dialogue logs and notice that users express music playback requests in various ways:

"I want to listen to music >> Just play any song >> Let the music play >> Music, let's go >>"

Some are correctly understood, with [Music] triggering a random playlist. Others, if not understood, are handled by the [fallback] mechanism, which counts as an assistant error.

Examples of slot values:

An assistant I developed has a [Movie Tickets] skill. Observations from user dialogue logs:

The AI first extracts the corresponding movie title, then passes it to an interface for querying. Only the "full name of the specified movie" can ensure a successful query, so special mapping is required here.

In the movie ticket example, context and timeliness are critical. When users say they want to watch a certain series at different times, they rarely specify the installment number verbally.

These requirements stem from real-life scenarios. Since humans can understand them, AI should naturally do better.

As practitioners, we must regularly review our company's dialogue logs to observe how users interact with our product. This is crucial for iterative improvements, allowing us to refine the product based on actual usage.

Based on past generalization experience, structural sentence variations are relatively minor, while word variations are abundant. Analyzing user dialogue logs like data reveals many insights.

For example, Alibaba's Tmall Genie supports voice shopping. During the 2020 Spring Festival, many users didn't just say "I want to buy masks" but also "I want to buy 3M masks" or even "Do you have N95 masks?"—since, in that context, N95 was almost synonymous with masks.

If such cases aren't covered, they can only be addressed through version updates. AI practitioners should reflect on the efficiency of their product iterations.

Thus, "getting it right from the start" versus "identifying flaws through feedback and fixing them iteratively" represents two entirely different levels of product design fundamentals.

The solution here is a dynamic hot-word database, though product design and operations are beyond this article's scope.

In practice, new words and phrases constantly emerge. Prioritization and generalization of slot values and sentence patterns are complex topics not suitable for detailed discussion here.

(3) Feedback Accuracy/Fault Tolerance

Evaluates whether the AI's responses accurately match user needs and include explicit confirmation to improve fault tolerance. This is highlighted in various voice interaction design guidelines.

Example:

"I want to listen to Lin Zhixuan's 'Fireworks Easy to Cool'" >>> If the AI suggests Jay Chou's version, it's wrong.

If the resource isn't available, the response should be: "Couldn't find XXX, let's listen to YYY instead," which is more reasonable.

而当接口方真（因为版权）没有资源时，明确没有，是一种我听懂了，但是实在没有，给你提供替代方案的处理，而如果你不明示没有，我可能会再追问一句，然后你还是不明示，到底是我没说明白，还是你没听懂呢？例子：第二种情况，如果按照计算机的逻辑去理解，那1月2日的明天早上则是1月3日的早上了，这种定闹钟的方式意味着悲剧。而基于日常逻辑，两种情况，都应该提供1月2日，早上7点钟的闹钟方为合理。逻辑处理完毕后，然后就是话术的处理，回复方式有几种选择：如果没有显性确认，就没有容错性，用户就会心中不安，一旦被【闹钟】服务坑过用户一回，那么就会恶评如潮。本来用户就用的低频，一旦不信任，被打入冷宫再也没什么机会了。只要你仔细体验观察，相当多的AI语音助手在给于反馈的时候，此类细节处理得不好，容错率实在是太低了。好的容错性设计，其实应该是每个AI从业者体内的基因，成为被动技能，天赋一样的能力。GUI的交互意味着输入可控，CUI/VUI的交互意味着输入不可控。这中间相当一部分是人类的表达问题，但是一旦造成的回复不满意，意味着用户将花费巨大的成本去再来一次。最后被用户批评或者被定性为“人工智障”、“就是个能对话的玩具”往往很让人沮丧。核心考量点：当用户使用模糊歧义表述的时候，AI的处理方式。例子：“我明天下午4点要去上海出差。”注意此时至少存在两处模糊歧义表述：例子：（假设现在是周一）“帮我定下周三去上海的机票”。注意：ASR的转化是无法翻译停顿的，到底是“帮我订，下周三”的，还是，“帮我定下，周三”的呢？在真实的对话中，人们是能够根据停顿节奏，以及具体的场景猜测到底是如何断句的。以上两个例子是我们业务中反馈的真实案例。说说我自己处理这类问题的思路，即提前交付结果，等待用户反馈。第一个例子，根据用户的GPS坐标出行便捷程度以及商业诉求进行推荐。火车，飞机，或者是打车均是正确的选择。例如可以做出如下回复，“基于天气情况，建议火车出行，为你找到从XX到上海的火车票，1月3日出发，高铁二等座，价格……第二个例子，根据用户提出需求的时间，就近选择结果反馈，并给予显性确认。当面对模糊/歧义意图的时候，一定要有一个处理逻辑，去管理用户的期望值和服务。面对模糊/歧义表述的处理方式在行业内通通都是大难题。好的处理方案，能够判断用户的歧义表述，并引导纠错。至于处理逻辑是直接给于结果，还是通过追问的形式二次判断，就是具体业务具体场景的选择了。不过多举例，但是有无处理方案，应该纳入进评测点。核心考量点：帮助用户达成目标中间所花费的成本。当前市面上几乎所有的服务类技能，都是AI通过提取用户表述中的具体信息，填充到指定槽位完成服务的推荐，而当用户没有给予主要槽位的时候，是需要引导用户完成的。市面上有两种做法，一种是固定路径，不可改变的填槽。比如说【火车票】技能，正常的对话是这样的。先问出发地和目的地，然后问出发日期，然后确定车次，中间不能改不能乱，然后方可完成查询行为。这种我称之为，固定序不可逆填槽，简直笨到了极致。如果你颠倒顺序填充槽位，AI很可能就智障掉了。生活中，我们这边一个70岁以上的老人，可以在窗口完成火车票购买，（抛开口音的问题）但是无法通过AI助手完成火车票的购买。为什么呢？很多比较笨的AI，跟图形界面一样，要求用户适应它的逻辑去完成填充。这种处理方案，简直违背自然语言处理的这一初衷。而好的智能助手是可以做到乱序填槽，并且随意改槽位条件的。例子：用户第一句话：“我想买一张明天从北京到上海的火车票，我要下午四点出发的，我想要一等座。”我们可以根据结果，着AI提取槽位，以及反馈的能力。用户第二句话：“再帮我看看，后天上午十点出发的，二等座也行。”如果AI能够搞定，那证明可以达到一定的智能化程度了。以上是应对用户的表述，而在对话服务过程中，还有一个反向管理，完善槽位的引导。我们可以做一个简单的练习，例如在买电影票的场景，从需求到下单至少需要4个核心槽位。A电影名，B电影院，C场次，D几张票。（选座可以提供默认规则）想要完成订单的确认，则成功引导用户填充ABCD四个槽位即可。好的完善和引导，则是：如果用户填充了AB，AI应该追问CD的例子：“我想看《魔童哪咤》，帮我在附近找个最近的电影院。”此时AI需要展示哪几个场次可以选择，然后追问要买几张票如果填充了ABC，应该追问D的例子：“我想看《魔童哪咤》，附近找个最近的电影院，8点钟左右开场的。”此时AI只需要追问要买几张票即可。ABCD四个主槽位，无论用户的先后顺序，先填充哪个槽位，后续能够完善填充即可。人类的表述千奇百怪，无论多少个槽位，人类都可以组织语言联合起来表述。乱序填充槽位才是智能化，自然表述的的基本要求。笔者刚进入AI行业NLP领域工作的时候，梦想着有一天能够做出伟大的产品。什么算伟大的产品，每个人定义不同。从业以来，就我们目前技术发展的前提下，能做的真的有限。科幻影视作品里面的超级人工智能，目前来看似乎遥不可及。遂化为小白用户，提出一个最为直白的需求——“我就想要一个聪明且好用的智能助理，能够满足我生活中的各种需求。”所以在当前的技术实现下，输出了过往在工作中一些评测产品以及处理问题的具体表现。实际上，原本在意图理解这个单元模块，有更多评测点去列举，但是受限于篇幅以及能力所限，删掉的一些内容。用提问的方式，列举一下我删除掉的指标：上述我提到的种种问题，其实都可以设计考核指标。笔者可以讲清楚是什么，解决方案以及思考后续会以独立文章的形式分享。既然是评测指标，自然是有权重之分。有些是可以努力做好的部分，比如前文中就【意图理解】这个维度提及的5个模块，各个例子的列举，都是基于用户的对话日志后台，是实际业务中非常高频的。而另外的有些是重点加分项，有些是附加加分项来评定。【意图理解】越深，越到位，都是让我们极尽所能，在【意图理解】这个维度，无限逼近超级人工智能的种种思考。而笔者的思路是，用户尽管提要求，余下的尽量去想办法去实现，如此才能够尽量去逼近伟大的产品。以上，关于本文第一大模块【意图理解】的部分，就此完结。

Follow-up articles will supplement the remaining parts and provide additional explanations and improvements in the same format.