Mix-methods Study of Chatbot Guidance Design

Identified the optimal type and timing for delivering guidance in Human-AI Interaction

April - September 2021

July 2021 - August 2021

Project Type

Academic Research

My Role

Designed and executed the study with the team, including study design, system design, and conducting interviews (n=126)
Led analysis of qualitative and quantitative data with tools like thematic analysis, coding, python, R
Synthesized findings and co-authored the final paper

Overview

Chatbot users often struggle to communicate their intentions to the agent effectively, leading to conversational breakdowns. Providing guidance on chatbot usage has been suggested as a potential solution to address this problem. This study explores the impact of different guidance types and timings on user performance, learning, and subjective experience in Human-AI interactions.

I led part of the data analysis. For the qualitative analysis, I conducted a thematic analysis to identify patterns and translate them into findings. On the quantitative side, I coded the data and applied statistical models to uncover trends. By synthesizing both qualitative and quantitative findings, I contributed to the study's core insights and co-authored the final paper.

This study was published on CHI '22 and received an honorable mention for the Best Paper Award.

Read our paper: Link

01 Context

Task-oriented chatbots are increasingly used for various domains, yet users often face challenges in completing tasks due to communication breakdowns.

This study focuses on understanding how different types and timings of chatbot guidance influence user performance, learning, and experience. Drawing inspiration from a mixed-methods study, the goal is to inform the design of more intuitive chatbot systems.

We had three research questions:

RQ1

Which combination of guidance type and timing enables users to:

complete their tasks more efciently
make better conversational progress
improve their performance during subsequent chatbot use?

RQ2

What are users’ subjective experiences of each of these combinations?

RQ3

What are users’ desired characteristics for the combination of a chatbot-conversation guidance type and its timing?

While this work was published before the popularization of LLM-powered conversational agents, the findings from this research are generalizable for Human-AI design.

02 Methods

To address the research questions, we designed a two-phase mixed-methods approach combining controlled experiments with post-experiment interviews collecting user reflections.

Phase 1

We set up a between-subjects experiment where participants interacted with task-oriented chatbots designed to provide guidance in various formats.

Specifically, we tested two guidance types:

example-based
rule-based

and four delivery timings:

service-onboarding
task-introduction
after-failure
upon-request

Participants were tasked with completing six tasks spanning different user scenarios such as movie booking and travel planning while engaging with one specific combination of guidance type and timing.

These tasks mimicked real-world scenarios, offering practical and meaningful challenges.

Demonstration of the experiment set up in phase 1

Phase 2

After the experiment, participants reflected on their experiences in structured interviews. During this step, participants reviewed chatbot interaction logs and ranked all eight combinations of guidance type and timing based on personal preference.

This qualitative exploration was critical for uncovering subjective attitudes, challenges, and user-defined ideal characteristics for chatbot guidance.

By merging quantitative performance metrics with these qualitative insights, we aimed to provide a holistic view of user interaction with chatbot systems.

Demonstration of the experiment set up in phase 2

03 Recruitment

Participants were recruited through online advertisements posted on social media platforms, university mailing lists, and community forums.

The recruitment process targeted individuals with diverse backgrounds, ensuring a balanced representation of age, gender, and prior chatbot experience. Potential participants completed a pre-screening survey to assess their familiarity with task-oriented chatbots, and only those within a predetermined range of familiarity levels were invited to participate.

This approach allowed us to include both novice and experienced users, ensuring the study captured a wide spectrum of user behaviors and preferences. We recruited 126 particpants for this study. Each participant was compensated for their time, receiving monetary rewards.

04 Data Analysis

After each study session, we coded the task performance metrics, including task success, completion times, and non-progress events. Data were then analyzed using regression models.

These metrics provided a measurable understanding of how effectively users interacted with different guidance combinations. Mixed regression models were applied to account for individual differences among participants and to examine the effects of guidance type and timing on performance trends. I used Python and R to process the data.

For qualitative data, we used thematic analysis to identify recurring themes in user feedback. Interviews were transcribed and iteratively coded to extract insights related to user preferences, emotional responses, and perceived utility of guidance. This process revealed high-level themes regarding users' behaviors and attitudes, such as why participants favored specific guidance types or timings and how they adapted their behaviors over time.

To combine the quantitative and qualitative findings, we employed a triangulation approach. Quantitative metrics were cross-referenced with qualitative themes to uncover relationships between numbers and users' subjective experiences. For instance, participants who performed better with task-intro guidance also expressed higher satisfaction in their reflections, aligning performance data with user sentiment.

Synthesis of both types of data provided a comprehensive understanding of how guidance type and timing influenced both objective performance and subjective experience.

04 Findings

We observed several patterns for each chatbot guidance type and timing. Here are some high-level findings:

A Mismatch between Task Performance and User Experience

Designs that lead to high performance don't always result in positive user experiences.

For instance, participants favored receiving examples at the introduction of a task because it allowed them to quickly adapt the example and efficiently complete the task, but it was not as efficient as presenting rules after users had failed a task, which created a sense of frustration.

Examples Warranted a Good Start, whereas Rules Promoted Understanding

Example-based guidance allowed participants to perform well initially, but it did not significantly improve their performance over time. This was because participants often simply copied and modified the examples without fully grasping the chatbot's functionality.

On the other hand, rule-based guidance, though initially slower, led to significant improvement in task performance due to the deeper processing required to understand and apply the rules.

The Timing of Providing Examples Matters

Providing examples at different times resulted in varied task efficiency and improvement.

For instance, providing examples during service onboarding was often ignored, while providing it upon request allowed users to engage in more exploration and absorbe the guidance when they needed it most.

05 Design Implication

Different combinations of chatbot guidance types and timing afford different task performance and user experience. The main design implication of our study is

Chatbot guidance should be designed based on the context.

There was no clear “winner” in terms of either guidance type or timing. Instead, it is clear that choices of both type and timing should depend on the purpose of the guidance. Namely, facilitating task execution v.s. promoting learning.

While designers should leverage the strengths of both example and rule-based guidance at specific timing tailoring to their goals and user needs, there are specific combinations of guidance types and timings we recommend against:

Showing examples at service onboarding led users to ignore the examples as a part of the "irrelevant messages"
Showing examples of interaction failure negated the main advantage of examples as templates for user messages. It also prompted negative emotional response from the users.

Alex

/

Chen

Alex

/

Chen

Alex

/

Chen