Starting from Open-Domain Robot Construction: How to Chat with Robots

baoshi.rao

Robots can be categorized by their dialogue methods into 'Q&A robots,' 'task-oriented dialogue robots,' and 'open-domain chatbots.' However, in practical applications, robots often need to combine different functionalities.

Take a customer service robot for appliances as an example:

User: 'Is installation included?'

Robot: 'Yes, installation is included, dear.'

This is the most common Q&A scenario, where the robot retrieves the corresponding answer for the user's query.

Another example:

User: 'I want to check the shipping status.'

Robot: 'Which order would you like to check?' (providing options: Order A, Order B, Order C)

User: Selects A.

Robot: 'The item has been shipped via SF Express.'

In this scenario, the robot completes the task of checking shipping status through multi-turn dialogue.

Just like humans, robots need not only problem-solving abilities but also everyday communication skills—both are indispensable.

Having worked with various robots, I’ve observed that open-domain chatting, while seemingly less critical in robot scenarios, is often a fundamental capability. It’s a classic case of 'you always want what you don’t have, and take for granted what you do.'

Currently, most robot documentation focuses on task-oriented and retrieval-based functionalities, with limited content on casual chatting. Moreover, discussions about chatting tend to lean toward technical implementations. However, I believe to excel in open-domain chatting, we must consider not just technology but also the product itself.

Today, I’ll share insights from my experience building open-domain robots, using the children's robot scenario as an example, to explore how to construct a chatbot from different angles. Let’s dive in!

Casual chatting, colloquially known as 'shooting the breeze,' is about having fun. It’s a process where both parties seek emotional engagement—whether for amusement or comfort.

Thus, when users interact with a robot casually, they expect it to evoke emotional resonance and change. For example, here’s a dialogue with 'Xiao Ai' (Xiaomi’s smart speaker):

User: 'Xiao Ai, fart for me.'

Robot: 'Oh my, I’m a girl—how could I do something so embarrassing? But since you insist, I’ll reluctantly oblige. Don’t blame me if it’s not perfect.'

Pfft~~

In this exchange, the user is simply seeking amusement, and the robot’s playful response makes it feel like a fun companion.

A friend with high emotional intelligence—who comforts you when you’re sad, entertains you when you’re bored, and shares joy when you’re happy—is someone everyone cherishes. Similarly, a robot should be more than a 'straightforward problem-solver'; it should be an emotionally intelligent companion, sensing and reflecting the user’s emotions to create a warm and engaging experience.

Only then can users form a lasting bond with the robot, avoiding the sentiment of 'laughing at AI’s incompetence.'

If users engage in casual chat with a robot, they inherently expect it to 'feel human.' While this standard sounds simple, achieving it is incredibly challenging, given that NLP technology hasn’t yet reached full human-level understanding.

Alternatively, as Westworld suggests, robots with memory begin evolving into conscious beings. Similarly, for a robot to 'feel human,' we can abstract certain traits that, when embodied, make users willing to engage.

Below are traits I’ve abstracted from building chatbots, which I’ll explain one by one.

1. The Importance of Consistent Persona

Every individual has a unified persona—identity, gender, appearance, personality, hobbies—forming the basis of human interaction. Even strangers asking for directions infer persona from appearance and gender (e.g., 'Hey, handsome'). Inconsistency in persona comes across as erratic or disingenuous.

For example:

A: 'Who are you?'

B: 'I’m a product manager from Guangdong.'

A: 'What do you do at work?'

B: 'I draft architectural designs, write code, and fix air conditioners.'

A: (This can’t be a real product manager.)

Robots, too, need a consistent persona to feel real and reliable. Without it, conversations become disjointed and confusing. As Cathy Pearl notes in Designing Voice User Interfaces, 'Consistency in persona allows users to predict interactions.'

2. Design Approach

How do we define a robot’s persona? To understand a person, we start with background info (name, hometown, profession, hobbies) and observe their demeanor. Similarly, robots need a backstory. In Westworld, each robot has a scripted identity. But users won’t ask every possible background question—even humans forget details.

Instead, focus on high-frequency queries. In our children’s robot project, over 10% of interactions involved questions like:

'What’s your name?'
'Where are you from?'

Next, design the robot’s personality traits (e.g., humorous, confident, loyal, mischievous, warm) to shape its dialogue style.

Here’s an example persona:

'Our children’s robot, Xiao Qi, is a handsome Leo from Titan’s Eternal Clan. Sent to Earth as his planet faced extinction, he now lives happily among humans. Xiao Qi is humorous and helpful but occasionally cheeky when teaching kids.'

3. Product Examples

Most bot-framework platforms prioritize skill customization and model training, neglecting persona design. Two exceptions:

Turing Robot: Attributes include gender, age, zodiac, and 'parents.'
Haizhi Ruyi: Attributes include name, gender, birthday, hobbies, and 'parents.'

While minimal, these basics are essential.

1. Significance and Design Approach

Language style must align with persona. A humorous robot should reply playfully; a serious one, formally. Imagine a court robot saying, 'I can flirt and pick up girls'—utter chaos!

For example, if a child named Xiaoqi tends to show off book knowledge, we can remind them to take a break after playing games for a long time and suggest reading or learning ancient poetry.

After determining the language style, it's essential to reflect these personality traits in the chatbot's dialogue. We can identify high-frequency scenarios from user queries and tailor settings accordingly, ensuring the chatbot's persona and speaking style remain consistent, making it feel more like chatting with a real person—our ultimate goal.

To make the chatbot appear more human-like, we've also experimented with adding features like catchphrases, such as starting with "uh" or using "then" as a connector.

(2) Product Example

In past cases, one high-frequency scenario we observed is users repeatedly asking the same question (perhaps some intelligence testers are chatting, awkward). It's highly discouraged for the chatbot to provide monotonous responses, turning it into a mere repeater. Microsoft's Xiaoice handles such interactions by combining her persona and language style (e.g., acting coquettish or proud).

Let's explore Xiaoice's strategy:

Xiaoice provides varied responses to the same question. If the user repeats too often, Xiaoice shows gradual impatience, eventually blaming the user and stopping replies until the user changes the topic, responding with, "You finally stopped repeating yourself~" This demonstrates Xiaoice's lively, playful, and somewhat tsundere language style.

Only with such rich language expression can users believe they're chatting with a friend rather than a dull, unintelligent robot.

(1) Significance of Personalization

As a political textbook might say: no two leaves in the world are identical. Even twins have distinct personalities. Our past experiences shape our unique worldviews, values, and outlooks on life. These "three views" determine individual thoughts, and since language reflects thought, everyone's speech is distinctive.

Similarly, chatbots should have their own memories to form unique "three views." This way, each chatbot becomes personalized, just like friends with different personalities.

(2) Design Approach

Remembering everything is challenging and meaningless for chatbots—humans forget things too. So, what should chatbots remember? Beyond background information, the most crucial aspect is remembering their owner, the user.

In other words, the chatbot needs to remember user-related information, including name, gender, hobbies, and even past interactions. This ensures that when User A owns a chatbot with memory, it becomes unique to them, even if other robots exist on the market.

To remember users, the chatbot must first collect user profiles. This scenario resembles a guy trying to understand a girl—either by actively learning (flirting) or passively observing (the "backup plan" approach).

Active Learning

If the chatbot actively seeks to understand the user, timing is critical to avoid seeming abrupt or triggering defensiveness.

The best timing is when two friends naturally get to know each other. For example:

User: What's your name?
Chatbot: I'm Xiaoqi. What about you, master?
User: I'm Xiaoba. Nice to meet you.
Chatbot: Hello, Xiaoba. I'm so happy to know my master's name!
User: What can you do?
Chatbot: I can sing and dance—pretty much everything! Xiaoba, what do you like to do?
User: I like singing too.

Through such interactions, the chatbot learns the user's interests and remembers them, making the user feel the robot is a genuine companion.

Observational Learning

If the chatbot pays attention to the user's past statements, it can glean information for impressive interactions. For example:

(Morning)
User: Good morning.
Chatbot: Good morning, Xiaoba. It's a lovely weekend. Any plans?
User: I'm going to the zoo with my parents.
Chatbot: Sounds great! Have fun!

(Evening)
User: Good evening.
Chatbot: Xiaoba, how was the zoo? Tell me all about it!
User: I saw zebras, lions, peacocks...
Chatbot: That sounds amazing! Next time, take me with you, okay?

The chatbot's good memory makes the child feel genuinely cared for, reassuring parents it's a good companion against loneliness.

Such memory ensures each chatbot becomes unique, offering children a personalized playmate.

P.S. For households with two children, voiceprint recognition can help the chatbot distinguish users.

(1) Significance

A relatively intelligent chatbot is ready, but users won't chat endlessly. Many robots claim to teach math or English, but whether these features are useful or triggerable matters most.

Imagine two people chatting where one always leads the conversation—even a chatterbox runs out of topics. Users often don't know what to say, and awkwardness kills interest. Thus, proactive topic guidance is crucial.

(2) Design Approach

Designing proactive guidance requires strategy, focusing on content, timing, and phrasing.

Content

The content depends on the chatbot's capabilities. For educational robots, guiding children to number or poetry games works. New features should also be introduced promptly to maintain engagement.

Timing

Timing can be at the start, middle, or end of interactions.

A common approach is guiding users upon activation. For example, Xiaodu Speaker might say:

User: Xiaodu, good evening.
Xiaodu: Good evening! First, check tomorrow's weather, then enjoy a fun program~
Xiaodu: Tomorrow's weather is xxxxx
Xiaodu: Here are some popular shows. Say "change channel" if you dislike them.

However, always guiding at activation feels rigid. Like friends chatting, topics should flow naturally between user and chatbot.

We don't need to force topic guidance in all chat content. First, we should identify users' high-frequency chat scenarios. For example, children often ask robots to tell jokes, after which we can guide them to other educational games. Of course, real-world situations aren't that simple—trigger conditions require weighted calculations, including the frequency of various skill triggers, the occurrence rate of other guidance scenarios, and historical user feedback (e.g., when users say "I don't like this").

Finally, guidance should be timed appropriately, such as when both parties fall into an awkward silence. Take Xiaomi's smart speaker as an example: since it uses full-duplex wake-up, if the user doesn't speak for 15 seconds, it will proactively guide once. If there's no response after three attempts, it exits the wake-up state. For instance: "Master, where did you go? Let me tell you, I've learned a new skill recently—want to try it with me?" This can spark the user's interest in continuing the conversation, starting a new topic, and ensuring an increase in CPS (conversations per session) metrics.

Guidance Scripts

As for the final guidance scripts, since they vary by scenario, they should align with the previously mentioned language style. At the very least, we shouldn't have a serious, scholarly robot suddenly cooing, "Let's talk about something else, okay~?"

As the saying goes, "Pretty faces are a dime a dozen, but an interesting soul is one in a million." Ultimately, if a chatbot isn't fun, no amount of effort will make it useful. On the other hand, chatbots are inherently consumer-facing products, so maintaining engagement and retention through entertaining content is crucial. This often surprises users and keeps them interested in continuing the conversation.

To make small talk fun, we must mention Xiaoice's tactics. On one hand, Xiaoice regularly updates its skills to keep users engaged; on the other, it occasionally plays around in conversations, making users believe it's a fun "friend." For example, as mentioned earlier, it handles users deliberately repeating the same sentence mischievously.

Another example is Xiaoice's "mind-reading" skill, where it guesses the character the user is thinking of within 15 questions. Using algorithms like ID3 decision trees, it first trains on character traits as samples, then asks the user questions to classify each trait, finally identifying the user's "heart's desire."

These mini-games create a sense of joy and anticipation for the next interaction. The same goes for friendships—shared topics and activities are what turn two people into close friends.

When designing the small-talk robot Xiaoqi, certain tactics were also implemented. For instance, around Valentine's Day, it would send cheesy pick-up lines as proactive messages when users started interactions. These daily pick-up lines significantly boosted retention and engagement during the holiday. Implementation was straightforward, relying on rule-based settings with a high ROI.

User: Open Chat Maid
Bot: Hello, Master! Oh, do you smell something?
User: No / What smell? / ...
Bot: The air turns sweet the moment you appear!

Similarly, children's robots need this kind of fun and novelty, as kids naturally crave variety. If a playmate repeats the same games and lines daily, it will eventually have "no friends." Thus, incorporating educational games, regularly updated jokes, and stories can capture children's attention and make them fond of their robotic companion.

According to Maslow's hierarchy of needs, the need for love and belonging is incredibly strong. Those lacking it often feel unvalued in the world due to a lack of care from others. For open-domain chatbots, the market often positions them as companions to fulfill some emotional needs. Hence, it's essential for robots to perceive users' emotions and provide emotional support.

This can be divided into two parts: (1) how to recognize user emotions, and (2) how the robot should respond emotionally.

(1) Emotion Recognition

We won't delve into the technical side of emotion recognition but focus on which emotions the robot should identify from a product perspective. Data-wise, this means categorizing which data points count as emotional classifications.

Emotions come in many forms—likes and dislikes in attitude, sadness and happiness in mood, distance and indifference in relationships, etc. Choosing which emotional scenarios to respond to depends on two main factors:

(2) Emotional Feedback Strategies

Once the robot knows whether the user is happy, disappointed, or angry, it must respond as a "friend." Different emotional categories warrant different strategies. Here are some response strategies for children's scenarios:

User is angry (using foul language): "Kids shouldn't swear! Otherwise, I won't want to play with you. I only befriend polite children!" (Educational strategy)
User is angry (no foul language): "What's wrong? Did someone upset you? Don't worry, Xiaoqi is here to cheer you up! How about we listen to a nice nursery rhyme together?" (Guiding to a child-friendly scenario)
User is sad: "Master, life has its ups and downs, but at least Xiaoqi is always here for you. Oh, I just heard a hilarious joke—let me share it to cheer you up!" (Guiding to a joke scenario)
User is scared: "Hold me tight, Master, and you'll feel brave! We'll face it together!"
User is happy: "Your happiness makes Xiaoqi even happier! But don't forget—you promised to read poems with me later!" (Guiding to a poetry scenario)

In short, the ultimate goal of a robot's emotional companionship should be: "Don't lie to me or scold me—care for me. When others bully me, stand up for me. When I'm happy, share my joy. When I'm sad, comfort me." Yes, the ultimate best friend!

According to China's 2017 Cybersecurity Law (Articles 47 and 68), companies must ensure their platforms' content safety. Violations involving sensitive words can lead to penalties or even forced shutdowns. As chatbot designers, we must ensure our robots don't say inappropriate things, as the consequences could be severe.

Thus, we typically design and maintain a sensitive-word database. With this in place, the robot's reply sources fall into three categories: (1) manually added, (2) web-scraped, and (3) auto-generated. For (1) and (2), we filter sensitive words during input. For auto-generated replies, we filter them during generation.

In short, make a smart robot that knows what to say and what not to.

This chapter focuses on chatbot implementation. Unless you're an algorithm-focused product manager, most AI PMs should prioritize user scenarios, so we'll keep this brief (we're getting tired of this topic anyway).

Retrieval-based small talk relies on sentence similarity matching. For example, convert user messages and dialogue library categories into sentence vectors, then calculate cosine similarity to determine semantic closeness. The highest-scoring reply is returned to the user.

To get sentence vectors, word vectors are processed via supervised or unsupervised methods. Popular word-vector models include Word2Vec and BERT. From there, models like CNN, Skip-Thought Vectors, or Quick-Thought Vectors generate sentence vectors.

The overall process:

Preprocess the query (e.g., tokenization), convert tokens to word vectors (Word2Vec, BERT), then generate sentence vectors (Skip-Thought, Quick-Thought).
Match the sentence vector with pre-processed categories, calculating cosine similarity scores.
Sort by score, select the best-matching question, and return the corresponding answer.

Certainly, as previously shared by algorithm experts: Due to the enormous size of the corpus, calculating matches for every piece of data against a query would be highly inefficient. Therefore, an efficient search engine can be used for coarse-grained filtering to select candidate answers before further processing them using vector-based methods.

Generative chatbots employ end-to-end deep learning models, such as seq2seq, which learn from vast amounts of dialogue data to automatically generate responses for each query. In other words, the responses are not pre-set but entirely generated by the bot itself.

Generally, models like LSTM can map input sequences to fixed-length vectors, and then deep LSTM is used to decode the target output sequence from these vectors.

From an industry perspective, current seq2seq generative models often face issues such as safe responses, inconsistent bot personalities, and continuity problems in multi-turn conversations. Our algorithm experts have mentioned that these issues are not unsolvable. External knowledge (e.g., Xiaoice's topic model and sentiment classification model) can be integrated into the generative model to make responses more meaningful.

Of course, from Xiaoqi's perspective, generative models don't just face these three issues. The 'chatting strategy' we discussed earlier is the core value of a casual chatbot. While generative models ensure a response to every message, the essence of casual conversation among friends lies not in responding to every message but in heartfelt communication and connection.

At one point, I naively proposed a solution: use retrieval-based methods for strategic responses while employing generative replies for long-tail queries, supplemented by models like sentiment analysis to ensure meaningful responses. This 'big-picture but low-ROI' idea was promptly dismissed by our algorithm expert with a single word: 'Naive!' Hahaha~

Once a chatbot is built and launched, a product manager's job isn't over. We need to use data to assess whether the bot's chatting ability meets expectations and truly satisfies users.

In daily life, we might describe someone as a great communicator who can chat effortlessly with anyone, but this is often a subjective judgment. For a chatbot as a product, product managers must identify measurable metrics to prove it meets user needs.

As shown in the figure above, product managers need to focus on different metrics depending on the goals. For example, if we design a children's companion robot, there are various metrics from top to bottom.

For businesses, the primary concern is sales performance. Product managers must design scenarios and highlights to boost sales.

From the user perspective, product managers need to monitor usage. Most casual chat scenarios are To C, so metrics like retention and activity are unavoidable. Only when these metrics improve can the bot's companionship value be demonstrated. Additionally, we should track the number of turns per conversation (CPS, or Conversations Per Session) to gauge user engagement.

Functionally, product managers must evaluate each skill's performance, including trigger rate, completion rate, and retention rate per skill/scenario. These metrics provide deeper insights into overall retention, activity, and CPS. For instance, high trigger rates in certain scenarios may boost CPS, while low completion rates for specific skills could reduce overall activity.

Finally, as an intelligent system, a chatbot should have objective standards to measure its intelligence. Since we focus on retrieval-based casual chat systems, common evaluation metrics include recall, precision, and F-measure.