How to Evaluate the Intelligence of Voice Assistants (5): Designing Indicator Weights

baoshi.rao

This is a summary of the previous four articles introducing evaluation dimensions, as well as a user guide for the checklist.

Knowing yourself and your enemy ensures victory in every battle. Evaluating and researching competitors' products is a daily task for professionals. But when a product lands in our hands, what exactly should we look at? Which aspects should we focus on? Those lacking professional expertise might find it hard to know where to start.

On the other hand, professionals with a well-honed perspective navigate this effortlessly. They know which points to consider, structuring their analysis clearly and prioritizing effectively. Deconstructing a product in this way is a hallmark of professional thinking among AI practitioners.

Someone might ask, among these four major dimensions, which is the most important? The answer is: Customize based on needs.

Defining weight priorities involves two considerations: industry demands and hardware platforms.

AI voice assistants are typically designed to address specific business needs within a particular industry. They often exist on one or more hardware platforms, interacting with humans. Just like buying a house or hiring an employee, there are many standards to consider. What matters to you determines which dimensions and indicators should carry higher weights.

For example: If a product is positioned for music playback, and its intent understanding modules perform exceptionally well, but it fails to play music due to licensing issues, this would be a major drawback for users because it fails to meet their core need for listening to music.

Another example: If an assistant is designed for offline lifestyle services like ordering food or movie tickets—areas without monopolistic licensing constraints—but involves complex workflows with multiple query conditions, intent understanding would naturally require a high weight.

Yet another example: If a toy or figurine features voice interaction, users might care deeply about whether the voice matches the character's personality. For such users, personality traits would be a high-priority dimension.

The same logic applies to customizing weights for major dimensions, as well as the indicators within each dimension.

Creating a checklist is a highly demanding and mentally taxing task. This checklist took the author a significant amount of time, with many points worthy of discussion.

Particularly in deciding which indicators to retain and which to merge, the author made numerous trade-offs.

Some might point out that an indicator was omitted: speech recognition performance. Rather than an oversight, this was a deliberate choice by the author.

To the author's knowledge, the current best ASR (Automatic Speech Recognition) accuracy is 97%, which is already a mature technology. In the future, ASR and TTS (Text-to-Speech) will be as fundamental as utilities in the AI field—much like choosing between Tencent Cloud or Alibaba Cloud. Paying for technology and services can bridge most gaps, making it unworthy of inclusion in the evaluation framework.

Thus, basic speech recognition performance is categorized under the interaction fluency dimension as part of "service stability."

ASR technology will eventually reach parity across the board. However, if a system can convert dialects (or audio tracks) to Mandarin before transcribing text, that’s a different topic altogether. Dialect-to-Mandarin conversion and any language-to-Mandarin conversion follow the same logic. In such cases, the author might classify it under the intent understanding dimension.

This explains the author’s logic for selecting and categorizing indicators. The considerations above represent the author’s best effort to achieve MECE (Mutually Exclusive, Collectively Exhaustive) completeness.

The author strives for comprehensiveness, but not every indicator is necessary. Readers are encouraged to add, remove, or modify categories based on their specific needs.

For example: If evaluating a smart headset or a semantic translation device, "feedback style richness" might not need to be included.

But selection itself is a challenge, testing one’s judgment. Consider this example: Early generations of iPhones had glass screens that were prone to shattering—a flaw that would have failed Nokia’s evaluation criteria. The rest, as they say, is history. Food for thought.

Quantifying indicators shouldn’t be a hurdle. Use test datasets to validate performance, compare metrics across competitors, and draw conclusions.

In business, it’s about relative positioning, not absolute metrics. You don’t need a perfect score—just a clear lead over competitors in a specific area to claim your product is the "best" in that regard.

Beyond evaluating other AI assistants, this checklist can also guide product positioning during the planning phase or serve as a performance evaluation tool.

Defining a product’s scope—what to include and exclude—is a critical exercise. Initially, smart speakers had no screens, with everyone copying Amazon’s Echo. Later, screen-equipped speakers emerged. This reflects strategic product positioning.

From a business perspective, products with glaring flaws cannot survive in the market, and those without standout features are doomed to mediocrity. Companies can’t allocate resources evenly—such products would be average at best. Only excellence ensures survival.

Your product’s upper limit is its selling point, the key to standing out in competition.

Take budget smartphones, for instance. Most resources go into the CPU and display, with other components barely meeting standards. These two selling points alone can propel the product to the top of reviews, influencing consumer choices.

Even the worst smartphones must include a camera, and its performance can’t fall below a certain threshold. If it does, the product won’t survive. The earlier iPhone example also highlights durability—it wasn’t fragile, representing a baseline requirement.

The lower limit defines the minimum standard for market survival.

Once baseline requirements are met, resources should be concentrated on excelling in select dimensions. The challenge lies in delivering a product that aligns with its positioning under resource constraints.

Excelling in every dimension is neither realistic nor advisable.

The first step in strategy is knowing what to abandon—a test of insight.

In summary:

This guide serves dual purposes: positioning products during planning and evaluating existing ones to quantify strengths and weaknesses for iterative improvements. Mastering this checklist unlocks its full value.

When faced with problems, experts and novices think differently.

Novices tackle issues as they come—reacting to challenges. Experts, however, rely on decision-making systems, consulting checklists for consistent, objective choices.

Human rationality is limited. Ad-hoc decisions are prone to environmental and emotional biases, leading to unpredictable outcomes. Principles-based checklists enhance control, reduce hesitation, and boost efficiency. Experts often resemble stable, objective programs—calm and rational, making better decisions.

In designing evaluation metrics, the author benchmarks against an ideal AI assistant, helping us inch closer to superintelligence and create awe-inspiring products.

Thank you for reading. May this checklist serve you well.

How to Evaluate the Intelligence of Voice Assistants (1): Intent Understanding

How to Evaluate the Intelligence of Voice Assistants (2): Service Provision

How to Evaluate the Intelligence of Voice Assistants (3): Interaction Fluency

How to Evaluate the Intelligence of Voice Assistants (4): Personality Traits