Addressing the 'Xiaodu' Experience Issues: Discussing Voice Interaction Solutions

baoshi.rao

In recent years, smart speakers have become highly popular, with many people opting to purchase them for simple voice interactions. However, many users have encountered issues where the smart speaker 'fails to understand human speech'...

As a user of the Xiaodu Smart Speaker (basic version without a screen), this article analyzes a specific experience detail and attempts to propose solutions for voice interaction.

User: Xiaodu Xiaodu, play Jin Wenqi's 'Thirteen' / play SHE's 'Seventeen'.
Xiaodu: Okay, playing Jin Wenqi's 'Time Thief' / Okay, playing SHE's 'Super Star'.
User: Xiaodu Xiaodu, play Jin Wenqi's 'Thirteen' / play SHE's 'Seventeen'.
Xiaodu: Okay, playing Jin Wenqi's 'Time Thief' / Okay, playing Andy Lau's 'Seventeen'.
...
Infinite loop.

User: Xiaodu Xiaodu, play G.E.M. and Ai Re's 'Light Years Away' / play G.E.M.'s 'Light Years Away' (Passion Version).
Xiaodu: Okay, playing G.E.M.'s 'Light Years Away'.
User: Xiaodu Xiaodu, play Ai Re's 'Light Years Away' / play G.E.M.'s 'Light Years Away' (Passion Version).
Xiaodu: Okay, playing G.E.M.'s 'Light Years Away'.
...
Infinite loop.

In other words, from the system's perspective, the received commands might resemble 'play SHE's 'period'' or 'play SHE's ten/seven' or simply 'play SHE' (ignoring the title 'Seventeen'), leading to incorrect playback by Xiaodu.

After verification, neither Baidu Music nor QQ Music, Xiaodu's music copyright providers, have the rights to Jin Wenqi's 'Thirteen,' though they do have rights to other songs. Therefore, copyright issues are not the main reason for this poor experience.

Compared to Problem 1, Problem 2 is the key cause of this poor experience. This is because, when Xiaodu fails to correctly understand the user's intent, it does not enter a dialogue mode to offer more solutions but mechanically repeats the operation with the highest confidence in the system, which undoubtedly frustrates users.

'Dialogue mode' typically involves multiple rounds of voice interaction, where the AI can understand the context of the conversation and respond more 'intelligently.' A classic example:

In dialogue mode, the AI understands that 'he' refers to 'Lincoln' from the previous context. In non-dialogue mode, the AI would be confused by the second 'he' from the user.

Currently, the default interaction between Xiaodu and users is single-command-based: the user wakes Xiaodu with 'Xiaodu Xiaodu' and gives a command, and Xiaodu responds once.

However, in the following two scenarios (observed from memory, though some cases may be missed due to lack of access to the product during the pandemic), Xiaodu switches to dialogue mode:

Under the existing interaction strategy, when Xiaodu is 'confident'—for example, confidently ignoring certain qualifiers (like 'Ai Re' or 'Passion Version') and assuming the user wants to listen to G.E.M.'s original version of 'Light Years Away'—it generally does not enter dialogue mode.

Even if the user repeatedly and angrily gives the same command ('play 'Light Years Away' Ai Re version'), Xiaodu will stubbornly play G.E.M.'s original version.

The decision-making basis for this strategy is unclear. It might be due to extreme cases being overlooked, technical limitations, or cost considerations. This article does not judge but offers suggestions from an experience optimization perspective.

Conclusion: Due to potential cost considerations, automatically enter dialogue mode for 'copyright issues.' For other errors caused by AI limitations, address them with the solution for Problem 2.

Example dialogue:
User: Xiaodu Xiaodu, play Jin Wenqi's 'Thirteen.'
Xiaodu: Sorry, we currently don't have the rights to play this song. Would you like to listen to Jin Wenqi's 'Time Thief' instead?
User: Okay.

GUI prototype: None.
VUI interaction flow:

Key concepts to explain:

Confirmation Strategy: When responding to user commands, the AI has a list of alternative answers ranked by confidence (N-Best list). Different confirmation strategies are used for answers with varying confidence levels—implicit confirmation for high-confidence answers and explicit confirmation for low-confidence answers (see example dialogue).
Disambiguation Strategy: When the user provides ambiguous or redundant commands, the AI repeatedly confirms, breaks down, or supplements the command to form a clear instruction.

Example dialogue:
// Enter dialogue mode //
// Apply confirmation strategy //
... (confidence decreases) // Apply disambiguation strategy //

GUI prototype: None.
VUI interaction flow: Omitted.

New issue:
In the short term, this solution requires significant code changes, as the first step—determining 'whether the user has repeated the same command m times'—would involve adding a judgment to the entire code architecture, which is a substantial task.

In the long run, however, the improvement in user experience is worth it. From the user's perspective, better optimization is undoubtedly desired. Whether the development cost is justified (given temporary alternatives like manually searching for songs in the Xiaodu app or using a screen-equipped speaker) requires further data analysis on benefits and priorities. As an outsider, I lack the decision-making basis and refrain from judgment.

Revisiting this experience issue, due to the complexity of scenarios and AI limitations, users inadvertently play a role in 'helping the system correct errors.'

Given my limited expertise, these thoughts are largely derived from work-related voice projects and academic papers, without hands-on experience in complex voice assistant projects. Corrections and criticisms are welcome.