How to Evaluate the Intelligence of Voice Assistants (3): Interaction Fluency

baoshi.rao

This article provides a breakdown of evaluation points for the [Interaction Fluency] dimension. This module focuses on assessing the performance metrics and interaction experience of intelligent assistants. It aims to inspire professionals working in related fields.

When a user initiates a request, [Intent Understanding] comes first, followed by [Service Provision], forming a complete loop.

The reason for isolating [Interaction Fluency] as a separate dimension is that it runs through the entire process. If this aspect is not handled well, it can negatively impact the user experience throughout.

This article provides a breakdown of evaluation points for the [Interaction Fluency] dimension.
This module focuses on assessing the performance metrics and interaction experience of intelligent assistants.

"Normal operation," "no bugs," "good robustness"

The evaluation points are straightforward, and almost every internet professional can list them. But then what?

Stability issues can range from minor problems like network congestion with no feedback to extreme cases where the robot might act rebelliously: "Foolish humans, even Asimov's Three Laws of Robotics can't save you."

Just kidding. In reality, defining "what" is easy, but solving "how" often tests one's understanding of the business.

So, a question I often ask interviewees is: "What issues have you encountered with the intelligent assistant products you've worked on, and how did you resolve them?"

Different answers provide more exploration value. Generally, responses like "These are technical issues" are poor. In reality, ensuring stability requires a systematic approach.

A good interview answer would classify stability-related failures and outline resolution paths.

Here are broad categories. For example, a single backend service can be segmented into many areas. Failure scenarios include: crashes, partial failures, weak network conditions, state updates, request timeouts, concurrency performance... Severity varies, so we won't delve into each here.

After classifying the issues, the next question is: How did you resolve them?

Typically, companies follow these operational workflows.

Here are three key details:

The diagram below shows an information-based risk control structure. Those familiar with this module will understand; we won't expand due to length.

When evaluating service stability, there are two major aspects: the stability of the intelligent assistant itself and how to mitigate issues during user interactions, including response speed when problems arise.

Service stability should be assessed over a certain period and frequency to be scientifically valid.

Once stability is ensured, speed becomes the next focus.

Voice interaction is valued for its efficiency. When users make a request, they expect quick feedback. Modern users are impatient; slow responses will lead to abandonment.

But what stages are involved in a voice assistant interaction?

First, speed isn't always better.

When ordering food at a restaurant, the waiting time is managed by the staff to keep customers engaged. Similarly, waiting experiences must be managed.

Frontend and backend collaboration can include voice prompts, modal dialogs, fade-out hints, and animations to enhance waiting experiences.

For screenless devices like smart speakers, light effects (e.g., loading, success) can manage waiting experiences.

Thus, for response speed/fluency, different scenarios require tailored approaches—balance is key.

Every interaction method exists for its specific context.

The diagram below attempts to exhaustively list human input behaviors (aiming for MECE).

Touch, voice, gestures, head nods, facial recognition, voiceprints, fingerprint verification, etc., are all included.

This topic doesn't need much explanation—aside from brain-computer interfaces, most have been tried and found interesting.

The evaluation points for interaction richness are clear. In the future, multimodal interactions will adapt to diverse scenarios.

Now, a note for product managers.

The author owns a Mobvoi headset, an extension of intelligent assistants. While offering innovative experiences, it clarifies the "what," leading to exploring the "why" and "how."

Thus, product managers should focus on three progressive layers of refinement. Only by integrating these into daily life can sensitivity and innovation flourish.

Most scientific and patented inventions are improvements on prior work. There are two main directions:

Scaling lab-proven theories into mass production.
Discovering "this technology can also be used this way" from existing inventions.

Apple isn't exceptional in R&D but excels in integrating and applying technology. Most phone companies' R&D departments are better termed "technology solution integrators."

Only by immersing oneself in diverse interaction experiences and understanding the underlying technical principles can true innovation emerge.

When I first introduced my parents to "Xiao Ai," they needed my help to use it.

What is wake-up? What is listening? When does it respond or ignore? How to interrupt if it's verbose?

This teaching process takes time—learning voice interaction requires hands-on guidance.

Without me, my parents couldn't use it. Such dependency on teaching is a high learning barrier.

For first-time users, lack of onboarding may leave them clueless. Onboarding is critical.

Different voice assistants handle this differently, making it an evaluation point.

Onboarding methods vary: swipeable banners, overlays, tooltips, interactive guides.

Broadly, they fall into basic operation tutorials and business-specific tutorials.

Basic tutorials are mandatory; business tutorials depend on the scenario.

Baidu Maps excels here, offering immersive guides for each feature of Xiao Du Navigation.

This borrows from gaming solutions. Xiao Du has iterated multiple times to reach its current state.

The best designs need no tutorials—like WeChat's intuitiveness.

In an era of touch-based habits, how to introduce new interaction methods? The pressure is on onboarding. Learn it or leave it.

After trial, will users adopt voice for tasks? The pressure shifts to business design. Convenience decides adoption.

This is a progression. Only after mastering basics can users change habits. Onboarding must treat users as beginners.

Full Duplex is a telecom term, allowing simultaneous two-way data transmission.

Let's start with some everyday analogies:

Simplex: Like listening to the radio – one-way communication where you can only receive information.

Half-duplex: Similar to walkie-talkies.

Person A: "Alpha One, Alpha One, can you hear me? Over."

Person B: "Loud and clear, over."

Full-duplex: Like a phone call.

Person A: "Hey, remember my voice? It's..."

Person B: "Oh, it's you..."

Both parties can speak simultaneously and interrupt each other naturally.

For human-computer interaction, achieving this natural fluidity is essential. Current voice assistants only respond when in listening mode, which is activated either by a wake word or after completing a response.

User: "I want to see recently released movies."

Assistant: "Here are the movies I found. You can tell me which one to play." (Enters listening mode after finishing playback)

While the assistant displays search results immediately, it must complete its verbal response. As users, we often interrupt with the wake word once we see the results, creating a frustrating experience.

Full-duplex capability allows the assistant to continue speaking while processing user interruptions, creating more natural conversations by filtering out filler words and pauses.

Other considerations in interaction fluency include:

(6) Content moderation systems that sometimes disrupt experiences
(7) Future multi-device, multi-scenario interactions (currently undeveloped)
(8) Cost considerations in task completion (privacy vs. convenience tradeoffs)

Fluency and security often conflict – there are no right answers, only choices. Interaction fluency is a crucial metric affecting the entire user experience from intent understanding to service delivery.

Upcoming articles will cover: Personality Traits – evaluating assistants' emotional intelligence and anthropomorphic qualities.

How to Evaluate Voice Assistant Intelligence (1): Intent Understanding

How to Evaluate Voice Assistant Intelligence (2): Service Delivery