Designers Characterize Naturalness in Voice User Interfaces: Their Goals, Practices, and Challenges

by

Yelim Kim

BSc., The University of Toronto, 2016

A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF

MASTER OF SCIENCE

in

The Faculty of Graduate and Postdoctoral Studies

(Computer Science)

THE UNIVERSITY OF BRITISH COLUMBIA

(Vancouver)

March 2020

c Yelim Kim, 2020 The following individuals certify that they have read, and recommend to the Faculty of Graduate and Postdoctoral Studies for acceptance, the thesis entitled:

Designers Characterize Naturalness in Voice User Interfaces: Their Goals, Practices, and Challenges submitted by Yelim Kim in partial fulfillment of the requirements for the degree of Master of Science in Computer Science

Examining Committee:

Dongwook Yoon, Computer Science Supervisor

Joanna McGrenere, Computer Science Supervisor

Karon MacLean, Computer Science Examining committee member

ii Abstract

With substantial industrial interests, conversational voice user interfaces (VUIs) are becoming ubiquitous through devices that feature voice assistants such as Apple’s and . Naturalness is often considered to be central to conversational VUI designs as it is associated with numerous benefits such as reducing cognitive load and increasing accessibility. The lit- erature offers several definitions for naturalness, and existing conversational VUI design guidelines provide different suggestions for delivering a natural experience to users. However, these suggestions are hardly comprehensive and often fragmented. A precise characterization of naturalness is necessary for identifying VUI designers’ needs and supporting their design practices. To this end, we interviewed 20 VUI designers, asking what naturalness means to them, how they incorporate the concept in their design practice, and what challenges they face in doing so. Through inductive and deductive the- matic analysis, we identify 12 characteristics describing naturalness in VUIs and classify these characteristics into three groups, which are ‘Fundamental’, ‘Transactional’ and ‘Social’ depending on the purpose each characteristic serves. Then we describe how designers pursue these characteristics under different categories in their practices depending on the contexts of their VUIs (e.g., target users, application purpose). We identify 10 challenges that de- signers are currently encountering in designing natural VUIs. Our designers reported experiencing the most challenges when creating naturally sounding dialogues, and they required better tools and guidelines. We conclude with

iii implications for developing better tools and guidelines for designing natural VUIs.

iv Lay Summary

Providing natural conversation experience is often considered central to de- signing conversational Voice User Interfaces (VUIs), as it is expected to bring out numerous benefits such as lower cognitive load, lower learning curve, and higher accessibility. Despite its noted importance, naturalness is ill-defined. There are also no comprehensive standard resources for helping designers to pursue naturalness. In order to provide support for VUI designers in the future, it is critical to understand how they currently perceive and pursue naturalness in their design practices. Hence, we interviewed 20 VUI designers to understand their notion of a natural conversational VUI and their practices and challenges of pursuing it. In this thesis, we present 12 characteristics of naturalness and classify these characteristics into 3 groups. We also identify 10 challenges that our designers are currently encountering and conclude with implications for developing better tools and guidelines for designing natural conversational VUIs.

v Preface

This thesis was written based on the study approved by the UBC Behavioural Research Ethics Board (certificate number H18-01732). This thesis extends a conference paper that is currently under review for publication. As the first author of the submitted paper, I designed and conducted the semi-structured interviews and analyzed the data under the supervision of Dr. Dongwook Yoon and Dr. Joanna McGrenere. More specifically, my two supervisors helped me formulate research questions, design the study and analyze the collected data. The submitted paper was written with great help from the two supervisors as well as the help from Mohi Reza, another co-author of the paper. Mohi Reza, an MSc student, provided a great amount of help in writing the introduction and related work section of the submitted paper as well as providing great insight for shaping findings and contributions. Mohi Reza also provided English writing assistance for the submitted paper.

vi Table of Contents

Abstract ...... iii

Lay Summary ...... v

Preface ...... vi

Table of Contents ...... vii

List of Figures ...... x

Acknowledgements ...... xi

1 Introduction ...... 1 1.1 Problem Definition ...... 1 1.2 Contributions ...... 4 1.3 Overview ...... 4

2 Related Work ...... 5 2.1 VUI Literature in HCI ...... 5 2.2 Existing Definitions of Naturalness ...... 6 2.3 Characterizing Conversations ...... 7 2.4 Difference Between Spoken Language and Written Language . 8 2.5 Human-likeness in Embodied Agents ...... 8 2.6 Tools and Guidelines for VUI Design ...... 9

vii 3 Methods ...... 10 3.1 Participants ...... 10 3.2 Interviews ...... 12 3.3 User Tasks ...... 13 3.4 Procedure ...... 14 3.5 Data Analysis ...... 14 3.6 Apparatus ...... 15

4 Results ...... 17 4.1 Designers Characterize Naturalness ...... 17 4.1.1 Fundamental Conversation Characteristics ...... 18 4.1.2 Social Conversation Characteristics ...... 23 4.1.3 Transactional Conversation Characteristics ...... 25 4.2 Designers Experience Challenges ...... 30

5 Discussion and Implications for Design ...... 47 5.1 Discussion ...... 47 5.1.1 Capturing Naturalness in Context ...... 47 5.1.2 Positioning Naturalness of VUI in the Literature . . . . 48 5.1.3 Contrasting Naturalness of Transactional vs. Social Agents ...... 49 5.1.4 Limitations ...... 50 5.1.5 Implications for VUI Design Tools ...... 50

6 Conclusion and Future Work ...... 53

Bibliography ...... 55

Appendix ...... 69 A.1 The Recruitment Poster ...... 69 A.2 The Recruitment Message Posted Through SNS ...... 71 A.3 The Consent Form ...... 73

viii A.4 The Pre-interview Survey ...... 78 A.5 The Semi-structured Interview Script ...... 86 A.6 More Descriptions on the User Task ...... 93 A.7 The Post User-task Survey ...... 97 A.8 The Data Analysis Process ...... 104

ix List of Figures

3.1 The familiarity with SSML ...... 11

4.1 The twelve characteristics of naturalness that designers deem important ...... 19 4.2 The 10 challenges that designers are currently encountering in designing natural VUIs ...... 32

x Acknowledgements

Firstly, I would like to express my sincere gratitude to my two supervisors, Dongwook Yoon and Joanna McGrenere. They were always generous with their time whenever I needed their help. Before I entered graduate school, I did not have much exposure to the Human-computer Interaction (HCI) field and research environment. My two great supervisors were always so patient and generous with my progress and encouraged me to explore and make my own decisions for my project. I really thank them for their extensive help throughout the whole project. My two supervisors always amazed me with their great professionalism and their endless passion for HCI research, and they set an example for me to follow. Secondly, I would like to thank Mohi Reza, an MSc student who helped me in writing a paper we submitted to a top tier computing conference. He dedicated 2 weeks for me to help my project. I really enjoyed working with him and learned a lot from him. Thirdly, I would like to express my appreciation to Karon MacLean for accepting to be the second reader of my thesis and being so generous with her time. I also want to thank her for her insightful guidance and help for the other project that I worked on with her student, Soheil Kianzad. Lastly, I would like to thank the students in Multimodal User eXperience lab for providing insightful feedback on my study and for offering much valuable advice on graduate life.

xi Chapter 1

Introduction

In this chapter, we first introduce an overview of the problem space. Then, we motivate and illustrate the contributions of our study, and outline the overall structure of the thesis.

1.1 Problem Definition

With substantial industrial interest, conversational Voice User Interfaces (VUIs) are becoming ubiquitous, with the plethora of everyday gadgets, from smartphones to home control systems, that feature voice assistants (e.g., Ap- ple’s Siri, Amazon Alexa and Google Home ). Conversational VUI systems are one of the two general types of VUI systems [1]. In a conversa- tional VUI system, users perceive the voice agents as conversation partners and accomplish their goals by having conversations with the agents [1]. While in a command-based VUI system, which is the other general type of VUI sys- tem, users are expected to learn and use the appropriate voice commands to accomplish their goals [2]. Hereafter, we use the term ‘VUI’ to refer to ‘con- versational VUI’, and we use the term ‘voice agent’ to refer to ‘conversational agent’ [3]. At the heart of desired properties of VUIs is naturalness. According to

1 prior work and industrial design guidelines, enabling users to accomplish their goals by having natural conversations with voice agents brings out numerous benefits such as lowering cognitive load [4, 5], lowering learning curve [5], and increasing accessibility [5, 6]. Hence, multiple VUI design textbooks and guidelines recommend that designers make VUIs that provide natural con- versational experiences to the users [7, 5, 8]. VUIs have only become popular recently. As such, designing a conversational can be dif- ficult due to the lack of standard and comprehensive design guidelines as Robert and Raphael noted in their book, “conversational interfaces are at the stage that web interfaces were in 1996: the technologies are in the hands of masses, but mature design standards have not yet emerged around them.”. In fact, currently, available design resources suggest design approaches to- ward naturalness, but their characterization of the term is fragmented and hardly comprehensive. Multiple resources recommend different practices to designers to make VUIs sound more natural [9, 10], feature natural dialogues [11, 12, 13], or offer natural interactions [14, 15]. However, the term, natural- ness, is an ill-defined construct lacking precision and clarity [16]. Therefore, the field is lacking a comprehensive and substantive characterization of nat- uralness in VUIs despite its advertised importance. Bridging this conceptual gap is a critical step towards providing compre- hensive guidance to designers who strive to create natural VUI experiences. The literature in communications and social science inform the characteri- zation of naturalness in human dialogues. They suggest that people have different expectations and concepts of violations in interpersonal communi- cation, depending on a class of situational factors, such as who is talking, and the relationship between interlocutors [17, 18]. Given that modern voice assistants are often situated in complex and dynamic social settings [19], it is possible that conversational characteristics of a VUI considered natural in one setting is not perceived the same in another setting (e.g. an extremely human-like voice agent can be considered deceptively anthropomorphic and

2 uncanny [20]). The extent to which specific characteristics of naturalness apply in different conversational settings remains an open question. Within the broader discourse on Natural User Interface (NUI) design, the preliminary conceptions of naturalness offered have remained abstract and generic. Some have described naturalness as a property that refers to how the users “interact with and feel about a product” [21]. Others have used it to describe devices that “adapt to our needs and preferences”, and enable people to use technology is “whatever way is most comfortable and natural” [22]. Such broad characterizations make sense at a conceptual level. However, the extent to which they can be applied in the domain of VUI design remains uncertain. While there are some existing VUI guidelines, they too are “too high-level and not easy to operationalise” [23]. To offer designers proper guidelines and tools, we need to seek answers to questions on how designers characterize nat- uralness, how they align such characteristics to their varying design goals, and what challenges they face in this pursuit of naturalness. Doing so will enable researchers to create conceptual and technical tools that support nat- uralness for VUI designers. As a first step towards characterizing naturalness, we conducted semi- structured interviews with 20 VUI designers to understand how they define naturalness, what design process they use to enhance naturalness, and the challenges they face. Through reflexive thematic analysis [24], our study revealed a comprehensive set of characteristics which we mapped into dif- ferent categories according to the aspects of naturalness each characteristic contributes to, whether it is for achieving a basic required skill for having a fluent verbal communication, for providing social interactions, or for helping users’ tasks. Some of the characteristics mirror those found in the human- to-human conversation literature, but interestingly designers also identified characteristics that are “beyond-human”, which reflect the machine-specific characteristics that outperform people such as superior memory capacity and

3 processing power. Our VUI designers also described significant challenges in achieving natural interaction related to a lack of adequate design tools and guidelines and in balancing the different characteristics based on the role of voice agent (e.g., a social companion or a personal assistant).

1.2 Contributions

In this study, we recruited VUI designers, a group that was not explored previously in the HCI literature to our knowledge, and ran an empirical study to uncover their perceptions of naturalness and their current practices of pursing it. Our work contributes the following:

1. We identified a set of 12 characteristics for naturalness as perceived by designers and categorized them based on different aspects of nat- uralness each characteristic contributes to: Fundamental, Social and Transactional.

2. We identified and characterized the 10 challenges that hinder designers from creating natural VUIs.

3. We proposed design implications for the tools and design guidelines to support designers in creating natural VUIs based on our findings from the interviews.

1.3 Overview

In Chapter 2, we present relevant previous works. For Chapter 3, we describe our study design and analysis methodologies. Then, Chapter 4 introduces our study findings, and Chapter 5 discusses our reflections on the findings and presents design implications that we created based on our reflections and insights. Finally, Chapter 6 provides the conclusion of the thesis and suggestions for future work.

4 Chapter 2

Related Work

We set the stage by first reviewing the existing body of literature on VUIs. Then, we look at the broad manner in which naturalness is currently con- ceived and employed, and the ways in which people have characterized con- versations. We then focus on the rich body of work on anthropomorphism in conversational agents and beyond, a topic that is of particular importance to our discussion on naturalness. Finally, we look at orthogonal, yet important concerns, and review existing tools and guidelines for VUIs.

2.1 VUI Literature in HCI

Researchers have been investigating ways to support speech interactions since as early as the 50s. With rapid advancements in Natural Language Under- standing (NLU), we transitioned from rudimentary speech-recognition based systems such as Audrey [25] and Harpy [26] in the 50s and 70s, to task- oriented systems like SpeechActs [27] in the 90s, and sophisticated conversa- tional agents that we now have. More recently, several studies from the HCI community, have been inves- tigating how voice assistants impact users [19, 28, 29, 30]. In these studies, various issues have been explored, including how VUIs fit into everyday set-

5 tings [19], how users perceive social and functional roles in conversation [28], and the disparity between high user expectations and low system capability [31]. A common thread between many of these studies is that they take into account the perspective of the users. As Wigdor [21] put it, naturalness is a powerful word because it elicits a range of ideas in those who hear it - in this study, we take the path less trodden, and see what designers think.

2.2 Existing Definitions of Naturalness

Given that naturalness is a construct, we see several angles from which ex- isting studies define and use the term. As a descriptor for human-likeness: Naturalness is often seen as a “mimicry of the real world” [21]. In the context of speech, the human is the natural entity of concern, and hence, behavioral realism, i.e. creating VUIs that act like real humans, has become a focus. We can trace the attribution of an- thropomorphic traits onto computers in a seminal paper by Turing [32] on whether machines can think. In that paper, he assumes the “best strategy” to answer this question is to seek answers from machines that would be “nat- urally given by man”. The pervasive influence of such thought can be seen in existing definitions of naturalness in VUI literature - [33, 34, 35] all treat naturalness in this light, as a pursuit of human-likeness. As a distinguishing term between the novel and traditional modes of in- put: The term is also used to contrast interfaces that leverage newer input modalities such as speech and gestures, with more classical modes on in- put, namely, graphical and command-line interfaces [36]. In this definition, the term is in essence an umbrella descriptor of countless systems involv- ing multi-touch [37, 38], hand-gestures [39, 40], speech [41, 42], and beyond [43, 44]. As interfaces that are unnoticeable to the user: Another usage draws from

6 Mark Weiser’s notion of transparency introduced in his seminal article on ubiquitous computing [45]. In this formulation, naturalness is a descriptor for technologies that “vanish into the background” by leveraging natural human capabilities [46, 47]. As an external property: In this conception, the term does not refer to the device itself, but rather the experience of using it, i.e. the focus is on what users do and how they feel when using the device [48]. The characteristics that we present in this paper can also be viewed from such an angle, i.e. designers form and utilize characteristics not because they make the VUI more natural, but rather the experience of using it more natural. The existing usage of the term has drawn heavy criticism from some - Hansen and Dalsgaard [16] describe naturalness as “unnuanced and marred by imprecision”, and find the non-neutral nature of the term to be prob- lematic. In their view, the term has been misused to conflate “novel and unfamiliar” products with “positive associations”, akin to marketing propa- ganda. Norman [49] contends the distinction between natural and non-natural systems and notes that there is nothing inherently more natural about newer modalities over traditional input methods. With speech, for example, he notes that utterances still have to be learned.

2.3 Characterizing Conversations

Existing literature [50, 51, 52] has demarcated different forms of human con- versations based on purpose. Clark et al. [28] takes these forms, and classi- fies them into two broad categories - social and transactional. In the former category, he notes that the aim is to establish and maintain long-term re- lationships, whereas, in the latter, the focus is on completing tasks. In our study, we ground some of the characteristics that designers mention on the basis of these two categories.

7 2.4 Difference Between Spoken Language and Written Language

Previous studies from linguistics have identified the differences between spo- ken and written languages [53, 54, 55]. Researchers found that people use more complex words for writing compared to when they are speaking [53, 54, 56, 57]. Bennett said that passive sentences are more frequently used in the written texts [58]. Also, more complex syntactic structures are used in written language than spoken language [56]. In our study, we analyzed our interview data based on these previous works to find out what specific aspects of spoken dialogues that VUI designers find it challenging to mimic when they are writing VUI dialogues.

2.5 Human-likeness in Embodied Agents

A rich body of studies explore issues revolving around human-likeness in em- bodied agents and our relationships with them. They investigate a plethora of concerns such as ways to transfer of human qualities onto machines [59, 60], ways to maintain trust between users and computers [61, 62, 63], modeling human-computer relationships [64, 65], designing for different user groups such as older adults [66, 67], children [68, 69], and stereotypes [70, 71]. A series of studies by Naas et al. on how people respond to voice assistants have been done. Results from these studies suggest that people apply existing social norms to their interactions with voice assistants [72]. The “Similarity attraction hypothesis” posits that people prefer interacting with computers that exhibit a personality that is similar to their own [73], and that cheerful voice agents can be undesirable to sad users. In our study, designers reflect on issues that echo the literature by con- sidering factors such as personality, trust, bias, and demographics in their VUI design practice.

8 2.6 Tools and Guidelines for VUI Design

Many large vendors of commercial voice assistants provide their own separate guidelines for designers [74, 12, 75]. These guidelines offer design advice tailored to developing applications for a specific platform. With regards to platform-independent options, some preliminary effort has been undertaken in the form of principles [76], models [77] and design tools [78] for VUIs. More specifically, Ross et al. provided a set of design principles for the VUI applications taking a role as a faithful servant [76] while Myers et al. analyzed and modelled users’ behaviour patterns in interaction with unfamiliar VUIs [77]. Lastly, Klemmer et al. introduced their tool for intuitive and fast VUI prototyping process [78]. Our study adds the design implications for tools and design guidelines to help VUI designers for creating natural VUI experience.

9 Chapter 3

Methods

To understand how VUI designers perceive naturalness in their design prac- tices, we conducted semi-structured interviews with 20 VUI designers. We designed the interview questions with a constructivist epistemological stance, viewing the interview as a collaborative meaning-making process between the interviewer and the interviewee [79]. This chapter will describe the design of our study and the methodologies we used for collecting and analyzing our data.

3.1 Participants

We recruited 20 VUI designers (7 female, 13 male) using purposeful sampling. To draw findings from a varied set of perspectives, we interviewed both am- ateur (N = 7) and professional (N = 13) VUI designers. We recruited the participants through flyers (Appendix A.1) and study invitation messages on social network services such as Facebook and LinkedIn (Appendix A.2). Participants’ ages ranged from 17 to 73 ( = 34.3, Median = 30.5, SD = 14.7). The nationalities of our participants were as follows: 4 American, 1 Belgian, 1 Brazilian, 5 Canadian, 1 Dutch, 1 German, 5 Indian, 1 Italian, and 1 Mexican.

10 In this study, we define professional VUI designers as people working full-time on designing VUI applications regardless of their actual job titles (e.g., VUI designer, Voice UX Manager, UX manager, CEO). Participant’s length of professional VUI design experience ranged from 9 months to 20 years (M = 4 years and 2 months, Median = 2 years, SD = 6 years). Our participants worked in companies that ranged considerably in size, from 3 employee startups to large corporations with over 5,000 employees. Most of the professional VUI designers we recruited (8 out of 13) were working for relatively small size companies (from 2 employees to 49 employees), and the remaining 2 participants were working for medium size companies (from 50 to 999 employees). In addition to the information stated above, we collected the participants’ highest level of education (2 with a high school diploma, 2 with a technical training certificate, 9 with a bachelor’s degree, 5 with a master’s degree, and 2 with a doctoral degree), and their familiarities with Speech Synthesis Markup Language (SSML) (Figure 3.1). We collected information about SSML familiarity because SSML is the only available standard method to modulate synthesized voices across different platforms.

Figure 3.1: The distribution of the participants’ familiarities with SSML (N=20)

11 About half of the participants considered them to be unfamiliar with SSML, while the same amount of the participants considered them to be familiar with SSML. All of our participants had previously designed at least one conversational voice user interface, including voice applications for Amazon’s and Google’s smart speakers, humanized Interactive Voice Response (IVR) systems, and VUIs for a virtual nurse, a smart home appliance and a companion robot.

3.2 Interviews

For each participant, we conducted one session of a semi-structured interview based on the interview script in Appendix A.5. The duration of interviews varied from 30 minutes to about an hour, depending on the participants’ time availabilities. All of the interviews were conducted by the thesis author. We arranged online interviews for the participants (17 out of 20 participants) who could not come to the University of British Columbia for the interview. Prior to each interview session, the participants were asked to fill out a survey asking about their demographic data, familiarities with SSML and previous VUI design experiences (Appendix A.4). In this survey, we also asked their most memorable VUI projects to contextualize our research ques- tions based on their vivid memories. The collected data through pre-interview surveys were analyzed using descriptive statistics to ensure the diversity of the participant demographics (i.e., gender and VUI design professional level). The interview questions can be broadly divided into four sections. The goal of the first section was to understand the participants’ general VUI de- sign practices and their previous VUI experiences in depth. In this part, we requested them to describe their design practices for the two most memo- rable VUI projects that they reported in the pre-interview survey. For the second section, we sought to understand the participants’ conceptions of a natural VUI. To achieve this goal, we requested them to provide their own

12 definitions of a natural VUI. Then, we asked them how important it is for them to create a natural VUI, and if it is important, what benefits they are expected to gain by doing so. The goal of the third section was to under- stand the participants’ design practices for creating natural VUIs and the challenges our participants are currently facing in creating natural VUIs. Therefore, we asked the participants what particular design steps they take for creating more natural VUIs, and asked them what the most challenging aspects of carrying out those steps are. The last section of the questions was to understand how useful the current design guidelines and tools are for designing natural VUIs. In this part, we asked the participants about what tools or design guidelines they are currently using and how helpful they are for designing natural VUIs.

3.3 User Tasks

Depending on the participants’ time availabilities, 15 out of 20 participants had the time to do the user task after the interview. In this user task, participants were asked to write relatively short VUI dialogues in two ways: by typing using the keyboard, as well as by using a voice typing tool that we created. However, one participant whose age was 73 dropped in the middle of the user task because he felt tired of typing. The duration of each user task was about 15 minutes. This user task was designed to accomplish two goals. The first goal was to understand the participants’ VUI dialogue writing procedures and their perceptions of the current synthesized voices. The second goal was to explore the possibilities of using voice typing instead of the keyboard input for creating natural VUIs. Please note that the data collected from the user tasks did not directly inform our study results due to the lack of the amount and richness of the collected data. However, to provide a more transparent understanding of our study, we lay out the whole procedure of the user task in Appendix A.6.

13 3.4 Procedure

Prior to the interview, each participant was asked to fill out the pre-interview survey (Appendix A.4) mentioned above. Before each interview session, an email containing the consent form (Appendix A.3) was sent to each partici- pant, so that our participants could provide their consent by replying to the emails. Before recording the interview sessions, the participants were informed that the recording would start. After asking the interview questions, the participants who were able to spend more time carried out the user task. After they finished the user task, they were asked to fill out the post-task survey (Appendix A.7) to provide their feedback on the task and the current design guidelines. We provided a link for the survey to each participant. Lastly, we asked participants if there were any concerns or questions re- garding this study. We addressed their questions if there were any, and if there were no further questions, we informed them that the interview was finished, and thanked them for their help in this study. At the end of each interview session, each participant received $15/hour for their participation. The payment was made electronically through Paypal or Interact e-Transfer. Some of the participants refused to get paid and expressed their desire to help the study for free.

3.5 Data Analysis

All 20 interviews were transcribed before being analyzed. We used Braun and Clarke’s approach for reflexive thematic analysis [24] for analyzing the interview data. Their approach was particularly suited for our study be- cause of its theoretical flexibility and rigour. The three members of the research team, including the thesis author and her two supervisors, had a one-hour weekly meeting where they developed the themes over the course of several months. Instead of seeking the objective truth, we took an ap-

14 proach to crystallization [80], and developed a deeper understanding of the data by sharing each other’s interpretations of the data during each meet- ing. For facilitating a productive discussion, we organized the themes us- ing several different ways in a concurrent manner, and this includes using post-its, ‘Miro’ (https://miro.com/), an online collaborative whiteboard platform, and ‘airtable’ (https://airtable.com/), an online collaborative spreadsheet application (Appendix A.8). All of the members in the team coded the interview data using the open coding methodology [81] “where the text is read reflectively to identify relevant categories.” The two members of the team went through at least three interview transcripts, and the thesis author went through them all. We took both inductive and deductive ap- proaches for coding the data and developed a set of coherent themes that form the basis of our findings. Our deductive approach for coding was de- rived from the previous works on “the classification of human conversation” [52, 50, 51].

3.6 Apparatus

For the participants who were not able to visit the University of British Columbia to have an in-person interview, we used ‘Zoom’, an online com- munication software (https://zoom.us/), to conduct the interviews and record the audios of the interviews. For the in-person interviews, ‘Easy Voice Recorder’, an Android application for audio recording (https://shorturl. at/mwzHK), was used to record the audio of the interview. The interviewer also wrote interview notes on papers. For the user task, the participants were asked to write VUI dialogues using Google Docs. Then, the interviewer used a software tool named ‘TTS reader’, which we built for playing the sound of VUI dialogues (https: //github.com/usetheimpe/ttsReader). TTS reader used Amazon Polly voices (https://aws.amazon.com/polly/) to read the VUI dialogues that

15 the participants wrote. An add-on for Google Docs, ‘Kaizena’ (http:// tiny.cc/sqjxjz), was used to let the participants add voice comments on the dialogues that they wrote.

16 Chapter 4

Results

In this section, we first describe how our VUI designers characterize natu- ralness, followed by the challenges that they are facing in creating a natural VUI. We also elicit how each challenge is related to the characteristics of naturalness defined by our designers.

4.1 Designers Characterize Naturalness

To contextualize our findings, we briefly summarize the most memorable projects described by our 20 participants. In total, we collected data about 38 VUI projects. There was only one chat-oriented dialogue system (e.g., [82, 83]) where a participant built a conversational agent, for reducing elders’ loneliness and the rest of them were grouped as task-oriented systems accord- ing to the definition provided by [84]. Among the types of the applications mentioned during the interviews, there were 27 Intelligent Personal Assistant (IPA) systems for smart home speakers (23 Amazon Alexa, 4 Google Home), 8 Interactive Voice Response (IVR) phone systems, 1 voice agent system for a smart air-conditioner, 1 voice agent system for a mobile application, and 1 voice agent system for a humanoid robot. When asked to provide their definitions of a natural VUI, our participants

17 responded in terms of the characteristics of human conversation that they consider important for creating a natural VUI. Later, our thematic analysis revealed the three categories for classifying these characteristics of human conversation, namely ones that are: (1) fundamental to any good conver- sation, (2) ones that promote good social interactions, and (3) those that help users to accomplish their tasks. Perhaps not surprisingly, the latter two categories echo classifications for human-to-human conversations in existing literature [52, 50, 51], labeled as “social conversation” and “transactional conversation”. To be consistent with that literature, we adopted those la- bels. We found that instead of pursuing all the characteristics under the three categories, our designers selectively choose a particular set of characteris- tics depending on the context of their VUI applications and their design purposes. We also found that pursuing multiple characteristics at once can create conflicts. The three categories we use in this study can help readers conceptualize the 12 characteristics that we found in this study. They can also serve as a lens to understand why designers pursue these characteristics and how these characteristics can often conflict with each other. The following section provides detailed descriptions of each characteristic.

4.1.1 Fundamental Conversation Characteristics

Among the conversation characteristics mentioned by our participants, there was a set of fundamental verbal communication characteristics that a natu- ral VUI should have, regardless of whether the aim is to support social con- versations or transactional conversations. The participants consider a VUI that does not achieve these elements as unnatural. For example, synthesized voices that do not show appropriate prosodies (e.g., having a monotonous intonation) were frequently referred to as “robotic” (P1, P4, P6) or being a “machine”. (P18)

18 Conversation Characteristics for Imbuing Naturalness in VUIs Fundamental Characteristics § Sound like a human speech. § Understand and use variations in human language. § Use appropriate prosody and intonation. § Collaboratively repair conversation breakdowns.

Social Characteristics Transactional Characteristics § Express sympathy and empathy. § Proactively help users. § Express interest in the user. § Present a task-appropriate persona. § Be lovable, charming and § Be capable to handle a wide range of topics in lovable. the task domain. § Deliver information with machine-like speed and accuracy.* § Maintain user profiles to deliver personalized services.* *beyond-human aspects

Figure 4.1: The twelve characteristics of naturalness that designers deem important

19 Sound Like a Human Speech

Six participants (P2, P5, P6, P7, P13, P15) mentioned that utterances of a natural VUI should have characteristics of spoken language as opposed to written text. For example, people tend to use more abstract words and complex sentence structures when writing [85, 86]. To specify, a natural VUI should use simple words:“When you’re writing it down, and you say it out loud, sometimes you realize that whatever you’ve written down is way too long or has way too many big words.” (P6) However, our participants said that the simple words do not mean less formal words such as slang or vulgar expressions, but rather more typical words that the persona of the VUI would use. P4 mentioned that he avoided using words that are “too casual” for his virtual doctor application because “people would take it more seriously if they felt that it was a natural doctor.” A natural VUI should also incorporate filler words [87], breathing, and pauses:“You need to introduce those pauses...conversation bits, like, ‘Um’, ‘Like’, ‘You know’...it makes the conversation more natural.” (P13) The par- ticipants mentioned that the patterns for breathing and pausing should give the impression of a ‘mind’ in the VUI, and make the user feel more like they are talking to a human:“Pauses before something, like a joke...we need to create an anticipation for the final jokes.” (P15)

Understand and Use Variations in Human Language

Human language is immensely flexible, and we can express the same request in countless ways. Thirteen participants (P1, P2, P5, P6, P7, P9, P10, P11, P16, P17, P18, P19, P20) mentioned that a natural VUI should understand various synonymous expressions spoken by users, such as:“‘Increase the vol- ume.’, ‘Turn up the volume.’...at the end of the day, you just want [it] to increase the volume.” (P16) The participants considered VUIs that heavily restrict what users can say to be unnatural and of diminished value:“...if you are instructing people to speak in a certain way, that’s not how, I feel, voice

20 technology can be used.” (P16) In addition to understanding varied expressions, 4 participants (P6, P7, P17, P20) mentioned that a natural VUI should be able to respond using a varied set of expressions to avoid sounding repetitive. “...we don’t always say, ‘Good choice!’, we say, ‘Great choice!’, or we say, ‘Awesome!’” (P6) When users repeat the same input, a natural VUI should respond to it using different expressions:“So if you [the user] go through the application more than once, the structure [of the dialogue] is similar, but you will always hear different sentences.” (P7) The participants mentioned that variation of expressions in human lan- guage should be considered within the context of the target user. For ex- ample, factors such as different age groups and even individual differences need to be considered:“...we found out like hanging up the house phone, and dial the phone clockwise. These are all words that older people use, while we don’t anymore.” (P8) Age aside, the same expression can mean very different things when coming from different individuals:“If I say ‘it’s hot’, then it’s different from you say ‘it’s hot’, right?” (P5) However, there are certain use cases where language variation is unde- sirable. This characteristic is less important when the primary purpose of a VUI application is having a transactional conversation for helping the users’ tasks, and the target users of the application are not the general public, but rather people from certain professions such as police officers or fire workers. This is because such target users are often trained to use special keywords and have a fixed workflow for faster and efficient communications. Hence, in order to help them effectively, the application should stick to the fixed set of the vocabularies:“So one of the main users of this type of application will be police or fire ambulance [drivers]...They’re used to a very rigid command set. So they’re always saying things in the same way.” (P2)

21 Use Appropriate Prosody and Intonation

Prosody “refers to the intonation contour, stress pattern, and tempo of an ut- terance, the acoustic correlates of which are pitch, amplitude, and duration.” [88]. Eleven participants (P2, P3, P4, P5, P6, P8, P9, P11, P12, P13, P18) said that a natural VUI should present messages clearly with the appropriate prosody:“For me, it’s important to put the right intonation in some parts of the text to make it clear.” (P3) However, they highlighted that the appro- priate prosody can differ for age groups. Many participants reported that Amazon Alexa, a popular commercial voice assistant, is “way too fast” (P11) by default and that prosodies should be modified for seniors:“...for seniors, you may want to slow down the speed and potentially increase the volume or put an emphasis on the words.” (P2) Since there is no voice customized for seniors, P11 had to put a break for each sentence manually. “...it’s like, ‘Okay, here’s your calendar!’, another break. ‘Today you have...’, a slight pause, like point two, point three-second pause.” (P11)

Collaboratively Repair Conversation Breakdowns

During verbal communication, we often encounter small conversation break- downs when people do not respond in a timely way or do not understand what each other said. Four participants (P3, P6, P7, P8) mentioned that a natural VUI should solve these kinds of conversation breakdowns in a sim- ilar way how humans collaboratively solve them by asking each other:“Like in this conversation, how often did we already say, ‘I don’t understand you’ or ‘What do you mean?’ or ‘Can you explain more?’. That’s already for humans that way...and the robots and the user interfaces have to learn from humans.” (P7) A VUI is considered as very unnatural and machine-like if it repeats the same information when the conversation breaks down:“...if you don’t understand something sometimes, [and if] the system just keeps repeating the same information [to get the response from you], just like a robot.” (P3)

22 4.1.2 Social Conversation Characteristics

While the main focus of our participants was on task-oriented applications as mentioned in section 4.1, they also emphasized the importance of providing positive social interactions to users. Humans have social conversations to build a positive relationship with each other [89]. In the VUI context, de- signers incorporate the social conversation characteristics for providing har- monious and positive interactions, a more realistic feeling of conversation, and a feeling of being heard.

Express Sympathy and Empathy

Ten participants (P2, P3, P4, P5, P6, P8, P9, P11, P12, P13) mentioned the importance of providing empathetic responses to users’ sentiments to maintain harmonious interactions:“...if I know your favorite team won, I’d have a happy voice. If I know your favorite team lost, I’d have a sad voice.” (P2) Most of the participants’ elaborations on this part were focused on show- ing sympathy when users experience negative sentiments. The participants try to make the voice assistants console the users and present empathetic voices when users feel negative:“If they respond negatively, the Alexa re- sponds, ‘Oh, I’m sorry to hear that.’” (P4) Beyond being sympathetic, the participants even actively try to soothe users’ feelings in situations when they feel heightened emotions such as anger:“You have a calm reassuring voice when they’re upset because there’s a traffic.” (P9) To find out if users feel negative, the participants use user responses, their personal information (e.g., their favorite sports teams) and the location of the conversations (e.g., hospital). There was no participant who mentioned their experiences of using a sentiment analysis approach, and one designer specifically mentioned that using such an approach required too much of a time commitment:“I don’t have time to know the APIs that can do sentiment

23 detection.” (P11) If there is no way to detect users’ real-time sentiments, then our designers choose to use a “flat voice” (P3) to prevent the happy voice of a voice agent upsets the user who are currently feeling down, as suggested in [72]:“You have to control the tone of voice, because you can’t sound very enthusiastic, things like that, because you never know the situation of the person on the other side.” (P3)

Express Interest to Users

Four participants (P4, P6, P11, P12) said that they incorporate greetings, compliments, and words that express interest to the user. These words make the conversations appear “real-ish” (P11), and make users feel important:“I think the benefit of providing this type of responses, instead of just blank ones, is that it actually helps the person feel like their responses actually got heard.” (P4) P11 and P6 mentioned that VUIs can even make users feel as if they have personal connections the applications by providing daily greetings or feedback on users’ actions:“I’ll say, ‘See you tomorrow!’. Little snippets of humanity.” (P11) “We could just say the recipe steps and all of that, and not have to ask questions like, ‘How’s the spice?’ and all of that, but if we do, then there’s some kind of personal connection.” (P6)

Be Interesting, Charming and Lovable

Social conversations include humour and gossip which fulfill hedonic values [90, 91] that transactional conversations do not contain. In order to bring more user engagement for task-oriented applications, 4 participants (P1, P7, P13, P17) reported trying to write more interesting dialogues and to create a charming persona:“Interactive means using some good words. Something which sounds interesting to the user.” (P17)

24 The importance of being entertaining was emphasized, especially when the target users are children. “So, when it’s a kid’s application, you respond back in a very funny way. You use, terms like ‘Okie Dokie’.” (P13) P7, who created an Alexa application for resolving the conflicts between children, mentioned that the persona does not need to be loving and nice to be charming. It can be sarcastic and funny, instead:“She’s not loving and caring, but she’s maybe a little sarcastic. She makes fun of what they say and I would say she’s lovable, not loving.” (P7) When the system fails to accomplish what the user asked for, design- ers mentioned that a persona of a VUI that presents socially preferable be- haviours can abate negative emotions from users:“...if I had the sense that it understood its own limitation as opposed to telling me it can not do some- thing...I’ll be more flexible with it.” (P1)

4.1.3 Transactional Conversation Characteristics

As mentioned in section 4.1, all participants, except one, grounded their an- swers in their design experiences for task-oriented applications. For transac- tional conversations, our designers want to achieve naturalness by leveraging the way the user used to exploit verbal interactions with others to get things done. P9’s example is especially illustrative:“To me, the greatest benefit [of voice interaction] is that it’s an interface that someone will naturally know how to use and won’t have to learn how to use a new [command]. Man, I know people, especially kids, are great at using phones and all the things they learn, but I really like how we can do [many things] with voice. One example I like to give is that our internet was down, and we had to reboot the router or whatever, and I was wondering if it worked, so I just said ‘Alexa, are you working?’ and it says ‘Yes, I’m working’ before I even thought about it.” (P9) Interestingly, not all of the conversation characteristics they identified in this category referred to human-like conversation characteristics. Our

25 designers considered machine-like speed and memory as characteristics that a natural VUI should have for transactional conversations, which we label as “Beyond-human” aspects. This was surprising given that existing notions of naturalness are primarily based on being human-like. In other words, the designers’ expectations of naturalness in VUIs extend beyond providing realistic conversation experiences, and include machine-specific benefits:“So I guess a natural agent would be close to having a real conversation with someone, but with all the added benefits of an actual application.” (P12)

Proactively Help Users

Eleven participants (P2, P3, P8, P9, P10, P12, P13, P15, P16, P18, P20) mentioned that a natural VUI should be efficient, and proactively “detect or even ask for the things that [the user] needs.” (P12) P18 said a VUI should not wait for a command. If someone says they have a problem, a natural response for humans is to ask if the person needs help, even when not explicitly asked:“From a linguistic perspective, ‘Could you help me with my software?’ is a yes-no question. ‘I have a problem with my software.’ is not even a question yet. So for ‘I have a problem’, bots need to be more proactive and ask a question, ‘Could I help you with the software?’” (P18) Hence, a natural VUI should understand the meaning behind the statement and take action proactively to help the users. The efficiency of a VUI was not strictly measured by the total time taken for the task. Instead, they considered the quality of the results that users would obtain compared to the number of the conversation turns that they took to finish:“...not necessarily as short of a time as possible. But, something that makes sense and it is value-driven for me as a user. So that means if I’m engaging in multiple [conversation] turns just to get more, I guess, more valuable information on my end, I’m okay with that.” (P12) Related, a natural VUI should avoid overloading users with excess infor- mation, and instead, it should minimize the number of conversation turns:“...you

26 do not become a robot who can keep going on and on and on and on about all this information...You should not overload the user with a lot of informa- tion. You should try to cut down as many decisions for the user as possible.” (P13) The number of conversation turns can be minimized by proactively “asking them [users] less and less and assuming more...” (P13) To ask fewer questions, a natural VUI should actively make decisions based on contextual information:“But if the user tells me the zip code correctly, I don’t ask him for city and state, I use some libraries to find the name of the city and state...We need to have a record of the entire conversation from top to bottom.” (P13) Even though minimizing the number of questions is important, if the consequence of failing the task is considerable, a natural VUI should ask the user to confirm:“...so if I say things like ‘You wanted your checking account. Is that correct?’ and I say ‘No, I want my savings account’ then that to me, that confirm and correct [strategy] is a very important part in making it more conversational.” (P10)

Present a Task-appropriate Persona

Four participants (P3, P4, P5, P13) said that a natural VUI application should present an appropriate persona for certain tasks. The tone of voice should match the application’s purpose to increase user trust and elicit more useful responses. For example, P4 mentioned that for financial applications, the voice agent should sound serious for making the application feel more re- liable:“So when you’re creating these prompts, every company has a different tone of voice...Like do you want the machine to be quirky? Do you want a machine to be very serious? If you’re talking about your wealth management, you don’t want to have a fun guy. It has to be serious.” (P5) As another example, P4 designed an application for collecting elders’ health status. He tried to make the voice agent sound like a real doctor as much as possible. This was done to ensure that users take the task more seriously and report their status correctly. “...as if someone was visiting

27 their doctor and asking the questions...it was better than making it seems like you were having a conversation with a friend, because it was kind of a serious topic dealing with...people would take it more seriously if they felt that it was a natural doctor, something like that.” (P4)

Be Capable to Handle a Wide Range of Topics in the Task Domain

Nine participants (P2, P5, P6, P7, P9, P10, P11, P16, P17) mentioned that a natural VUI should not only be able to respond to the questions directly related to its task but also be able to handle a wide range of topics within the domain of its task:“I would think it [a natural VUI] would need to handle anything that is specific to that institution, right? If I call Bank of America and ask about my Bank of America go-card, you know you need to understand me.” (P10) A natural VUI should also be able to handle changes in conversation topics as long as the topic belongs within the task domain. “Let’s say I want to book a table for three people, and I said [to a waiter] ‘I want to book a table.’ and [a waiter asked] ‘For how many people?’ and what if I say, ‘What do you have outdoor seating?’ I didn’t answer the question. I didn’t say like six people or two people, because I need to know this other piece of information, but I also didn’t say like ‘How tall is Barack Obama?’” (P9) When a user brings a topic that is beyond the task domain, a natural VUI should still continue the conversation and remind the user about the task domain in which it can help with:“...if a person says ‘I want to order a pizza’, and your skill [Amazon VUI application] has no idea what that is...give them a helpful prompt saying ‘This is the senior housing voice assistant. I can help you with finding when the next bus is, or finding when the next garbage day is, or this or this.” (P2) To help users aware of the boundaries of the serviceable topic domain, the designers recommended preemptively providing context to users to help them understand what they can do with the application:“A lot of people make a

28 mistake in the design by saying ‘Welcome to Toyota. How can I help you?’ And it’s like you’re going to fail right there because that’s so open-ended. No one will have an idea of what they can or can’t say. They will probably fail. So you have to be really clear...like ‘Welcome to Toyota’s repair center! Would you like to schedule an appointment?’” (P9) However, when the target users are children, the designers should expect them to ask a lot of questions outside of the task domain:“ What’s your favorite color, Alexa?’ and they [children] would like to shift the conversation or just like go to a totally different topic.” (P7)

Beyond-human Aspect #1: Deliver Information With Machine-Like Speed and Accuracy

Our designers mentioned that, to accomplish its tasks in an efficient manner, a natural VUI should incorporate machine-specific attributes such as high processing powers, and selectively mimic certain parts of human conversa- tion instead of pursuing every aspect of a natural human conversation:“This machine will be able to talk to us as if it was a human of course...more effi- cient, of course, more...you know you have a couple of these rules, but that’s how I personally define natural.” (P5) P12 mentioned that a natural VUI should attain the human-level ability to maintain conversational context while being able to deliver accurate in- formation in a blazing fast manner:“So it’s just super-fast processing times, being able to deliver information while maintaining conversational context.” (P12) Our participants described natural human speeches as often being indirect and inefficient, so these aspects of human conversation should be left out when designing for a natural VUI. “There are so much more words like in the human version of asking, but it sounds human. You know, it’s not as direct or not as efficient, but there is a kind of like a personality behind it, I guess?” (P1)

29 “Oh, no less conversational, because you don’t want...something that you’re using every day. You don’t want to have that be chatty and friendly right? You want to get your work done. so you know concentrating on being effi- cient and giving them the information and exactly the way that they want it.” (P10)

Beyond-human Aspect #2: Maintain User Profiles to Deliver Personalized Services

Humans’ memories are volatile in contrast to machine memories. Designers mentioned that a natural VUI should store a vast amount of personal infor- mation of users such as personal histories or family relationships to combine this information all together and provide a customized experience for users. “We customize all the knowledge of the user.” (P8) “We personalize things and make things fit each user. Suppose you have an allergy or specific dietary requirements, then we could filter out all of those recipes and only suggest you the recipes that fit your needs.” (P6) However, designers are, of course, aware that storing a huge amount of information comes with concerns about privacy. So the importance of trans- parency on what data is stored was highlighted:“You need to be transparent about the collected and stored data.” (P8)

4.2 Designers Experience Challenges

In order to inform where and how we should invest our efforts for future technological and theoretical advancement, we asked our participants what makes designing for a natural VUI the most challenging. We grouped the challenges they described and ordered them in a list by the number of par- ticipants who mentioned the issue. To contextualize their challenges, we also asked them to describe their design practices. We found that the designers largely follow a user-centered design process

30 that includes three phases, namely a user research phase, a high-level design phase, and a testing phase each described next. This matches with the VUI design process laid out by Google [92]. During the user research phase, they conduct user research to determine requirements for their applications, collect the user utterances, and create the personas for their applications. This phase involves multiple user observations and interviews. In the high-level design phase, designers create high-level designs such as sample dialogues and flowcharts of dialogues for their applications. Designers interactively develop high-level designs through collaborations with other designers. During this phase, while writing the dialogues, designers create the audio outputs of the dialogues and modify the audio and dialogues more or less in parallel. In the testing phase, designers create prototypes and conduct user tests to validate their high-level designs and iteratively develop real VUI applications through multiple user testings. Here we present the most prevalent challenges, as reported by our par- ticipants (Figure 4.2), in designing a natural VUI. For each challenge, we illustrate how it is related to the twelve characteristics of naturalness de- scribed in Figure 4.1.

1. Synthesized Voice Fails to Convey Nuances and Emotion

The same sentence can convey different meanings depending on the way one narrates the text. Paralanguage, such as intonation, pauses, volume, and prosody, are essential components in expressing subtle nuances and emotions in speech. During the high-level phase, to make a VUI sound natural, the designers want to have control over the way the speech synthesizer will nar- rate their dialogues to the users. However, 9 participants (P1, P4, P5, P7, P8, P10, P13, P18, P20) reported that current speech synthesis technology is lacking the expressivity to interpret the intended meaning of the dialogue text and convey it to the user. They felt that even the best speech synthe- sizer still sounds like “just a robo-voice” (P4) or like “just putting the sounds

31 Primary 10 Challenges in Designing a Natural VUI

1. Synthesized voice fails to convey nuances and emotion. 2. SSML is time-consuming to use while not producing the desired results. 3. Existing VUI guidelines lack concrete and useful recommendations on how to design for naturalness. 4. Writing for spoken language is difficult. 5. Reconciling between “social” and “transactional” is hard. 6. Conveying messages clearly is difficult due to the limitation of synthesized voice. 7. Handling various spoken inputs from the users is difficult. 8. Impossible to capture all the possible situations. 9. Difficult to capture the users’ emotions. 10. Difficult to understand users’ perceived naturalness.

Figure 4.2: The 10 challenges that designers are currently encountering in designing natural VUIs

32 together” (P18) rather than “really meaning it [the script].” (P18) They think that the voice synthesis technology has a large gap to bridge, saying “there’s a long way to go for it to become very expressive.” (P7) Hiring voice actors who can narrate the script in a natural tone and flow of a “real voice” was reported by many participants (P3, P6, P7, P8, P9, P13) as a common solution to make a VUI sound natural. However, recording audio is considered to be significantly limited in flexibility and scalability. When there is a need to change the narration, editing audio of a recorded speech is more laborious than re-synthesizing audio from an edited text:“...if we discover during research there are more words, then we have to hire that actor again to speak those words again. So it was not practical at all.” (P8) Moreover, using pre-baked recordings was not a scalable solution to gen- erate spoken utterances of modern VUIs that are required to handle a wide variety of data and conversation context:“Of course, when you are using a real voice actor, it’s impossible because I should record every single street name that we have in Brazil.” (P3) This challenge is more severe for non-English languages. “Also the way she [Amazon Alexa] is speaking for us in German it sounds very ironic and sarcastic, her tone of the voice.” (P18) “[The] Dutch language is not so de- veloped yet for Google...English words, sometimes Dutch people use English words, but if Google is [speaking] in Dutch, then it’s sometimes complicated [not correct].” (P8) Due to the limited expressivity of the synthesized voices, for the applica- tions that need to convey human-like emotions, our designers reported that they often had no other choice than using their own voices to record the au- dios. For example, P1 found that the currently available synthesized voices were not good enough to express nuanced emotions that he desired to express for his storytelling application:“...there are some subtleties that I couldn’t get Alexa to feel nostalgic about, you know, there is no command like nostalgia about the house party that you first met this guy that you are still in love with

33 at, you know?” (P1) Interestingly, for the applications that do not need to express emotions, the importance of the expressive voice was not highlighted, and was even de-promoted. Being humorous is often considered to be relatively human- specific behaviour. P1 said that using Alexa’s “robotic” voice for pulling a joke creates irony and makes the situation funnier. Hence, he used Alexa’ for his VUI application where Alexa is asking about the users’ feces every day:“Alexa asks a question that only a human would ask and it’s like a very human written skill [Amazon VUI application] that sounds like very robotic and because you’re talking about poop because it’s like a joke...There’s something I think funnier about it.” (P1) Relation with the characteristics of naturalness: The difficulty in expressing nuanced emotions inhibits the designers from achieving the social characteristics of the VUIs listed in Figure 4.1. P8 found that the synthesized voice used by IBM was not able to produce natural laughing sounds and posit a risk of misrepresenting itself with unintended negative social expressions:“The robot can not laugh, because if the robot laughs and you just say, ‘ha, ha, ha’ it sounds sarcastic...older adults actually feels like Alice [the robot] is laughing at them. So that’s bad.”

2. SSML is Time-Consuming to Use While Not Producing the Desired Results

Designers can emphasize the parts of dialogues or modify the prosodies of the audio such as pitch, volume, and speed by using Speech Synthesis Markup Language (SSML) during the high-level design phase. Tech giants such as Google, Amazon and IBM continuously develop and support their own sets of SSML tags. However, many of our participants (P2, P3, P7, P9, P10, P11, P13, P15, P16) pointed out that writing and editing SSML tags is “time digging” (P13), while it frequently fails to yield the desired result:“I haven’t been very successful at doing this with SSML. It makes very minor changes,

34 but it doesn’t come close to what it would be if you use a voice actor, for example.” (P7) It was difficult to make the whole sentence flow naturally, and it felt “still too mechanical.” (P5) Even after fine-tuning speech timings by meticulously entering numerical values (e.g., 0.5 seconds of whispering):“I think it’s not very natural, like another 0.5-second break here, another somewhat slower here, all those things.” (P15) The poor design of SSML authoring interfaces, which resulted in SSML requiring too much time to use, was another point of consternation. Most of our designers were using a simple text editor or generic XML mark-up tools for writing SSML tags. Hence, they had to (re)write and (re)listen to the whole sentence or paragraph even when only making a small change to their dialogues: “I mean even just in the best-case scenario, let’s say you listen to a prompt, you decided that you wanted to change one thing by using SSML. You change that thing. You listen to it again. That‘s your best-case scenario, and right there you just spent, I don’t know, couple minutes maybe, and if you have a hundred prompts to do, it’s just not worth it for the small benefit you’ll get.” (P9) Also, it was hard to evaluate when the SSML tag reached the optimal level of expressiveness. Hence, our designers often spent a lot of time iteratively modifying SSML tags without knowing when to stop:“Hard to stop, like, I’m not satisfied with what I got there, so I just keep on changing something here and there.” (P15) Our designers had a hard time using SSML when trying to imbue a nar- ration with emotional expressivity. Designing for expressing a subjective experience of emotion requires holistic control of all prosody features at the same time. However, SSML only offers control of each prosodic element at a time separately:“The technology should be mature a little bit for us to have an SSML tag that is empathetic. It has a very subjective nature. The thing is that it’s not very objective. The objective things are being loud, slient...but this is completely different.” (P13)

35 Due to these limits of the current SSML, most of the participants had abandoned using SSML except for making simple changes of the audio such as putting the breaks, slowing the speeds, and correcting the pronunciations of mispronounced words. “In the SSML, I spent some time a while ago, like there is a prosody tag like that. I just tried and tried and, usually it just didn’t do what I wanted it to do.” (P9) “The prosody tag can be really difficult to deal with right?...I haven’t played with SSML for a couple [of] years.” (P10) “It’s been a while since I played with SSML. Well...[the SSML tag for putting a] break I use it all the time still.” (P7) If the application is targeting multiple platforms (e.g., Google Home and Amazon Alexa), using SSML takes even more time, because designers should test it for each platform. In other words, even for the same tag, the resulted voices can be different depending on the platforms and the same set of SSML tags might not be available on some platforms. This requires designers to test their SSML tags for each platform:“Different speech synthesizers are going to have different packages, so I want to be able to play with the SSML before I decide on how this is going to work.” (P10) Relation with the characteristics of naturalness: Being able to produce an empathetic voice is essential for having positive social interactions with users, as stated in Figure 4.1. Even though SSML is the only available way for modifying synthesized voices, our designers reported the limitations of SSML in truly creating an empathetic voice and the significant amount of time required as the obstacles to the achievement of the social characteristics.

3. Existing VUI Guidelines Lack Concrete and Useful Recommen- dations on How to Design for Naturalness

In terms of writing VUI dialogues during the high-level design phase, 6 par- ticipants (P2, P5, P6, P10, P11, P19, and P20) mentioned 3 problems with the existing VUI guidelines. First, they found the existing VUI guidelines easy to dismiss as cliche.

36 More specifically, our designers mentioned that some of the existing guide- lines are “somewhat common sense in terms of avoiding using technical lan- guage, try and make it casual and simple”, and easy to let go:“I feel like it’s kind of obvious and you know that when you’re creating something like a voice skill...I probably read it once, and I just left it.” (P6) Secondly, our designers found the existing design guidelines do not ap- ply to a certain VUI depending on the context of the project:“At the same time, I think every company will have its own set of these [Design guide- lines]...I mean, some apps are made to comfort people and make them feel less alone, and those guidelines are completely irrelevant, so it does depend on the context.” (P5) Lastly, they pointed out that the guidelines are lacking useful linguistic insights:“I feel like linguistic principles are a little more difficult to come by for voice interaction designers.” (P2) Linguists have been discovering hidden patterns of natural human conversations that people use without realizing it. For creating VUIs that enable natural conversations with users, VUI designers should know how to use these linguistic insights on natural conversations as well as how to incorporate these insights with their design strategies for better user experiences. However, our designers reported that there is a “disconnection” between the linguistic insights and their design strategies. “So there’s a lot you can do with language, with pragmatics [a sub-field of linguistics] and this is not the rocket science. I mean, we have a lot of research about pragmatics already, since the 70s, since the 80s. Well, they’re explored in research well enough, but they’re not connected well enough with the IT department.” (P18) The designers who found the existing design guidelines as useful reported that whenever they are designing VUI applications, they need to put an intentional effort to return to the guidelines to brush up their memories on them:“So for example, I need to work on confirmations. Let me go to refresh

37 my memory on how to do confirmation style...I don’t have to like constantly go back to them [the design guidelines], but I certainly do go back in and look [at them]” (P9), or not applying them due to the time constraint:“So there are principles that I work with, but that’s just going to have to come to me at the moment. I’m not going to go back.” (P11) Relation with the characteristics of naturalness: This challenge hinders designers from creating a VUI that provides a great user experience while achieving the fundamental characteristic of a natural VUI, ‘Sound like a human is speaking’. More specific design guidelines that provide actionable advice on how to connect these linguistic components and design intentions are requested from our designers. “I have no idea on how to use this construct behind these phrases, and how to break it down to use in my favorite design process...The contrac- tion, phonetics, and ‘Umm’...I think understanding how meaning can change depending on how things are structured and organized in the dialogue” [is important]. (P20)

4. Writing for Spoken Language Is Difficult

When the participants are writing dialogues during the high-level design phase, they often find it is hard to write for spoken language. When people are talking, they adopt spoken language characteristics such as filler words, personal pronouns and colloquial words [53, 54, 55]. However, they show these characteristics subconsciously. Hence, when our designers are writing, they often forget to incorporate these characteristics. “What I want to say is when you just started [as a VUI designer], it’s difficult because you have to really understand that the text you’re writing is not something to be read. It’s something to be spoken. It’s for a spoken interface, so you have to really try to imagine yourself in that situation and, like I said, just write like you are writing a screenplay or something like that.” (P3)

38 Designers write the dialogues by typing on the keyboard instead of speak- ing it, and our designers reported that it is often hard to detect unnaturalness of the dialogue just by reading it:“Challenges? So the platform limitation is definitely a challenge. A lot of times, the conversation sounds good on paper, but you really have to just say it.” (P12) Since we subconsciously use the characteristics of spoken language when conversing with others, P10 mentioned that the unnaturalness caused by writing can be hard to detect, and many people often treat this problem as something insignificant and hence do not put the effort in enhancing it:“I think the hard thing about these kinds of interfaces is that people feel like because they can talk, because they speak English, so they can write one of these interfaces.” (P10) From our survey on the existing VUI design guidelines, we found that mul- tiple guidelines already exist for supporting designers in writing for spoken dialogues [12, 13]. These guidelines tell VUI designers about what character- istics of the spoken language they should incorporate in their VUI dialogues. Since people use these characteristics unconsciously, if designers do not re- turn to guidelines and verify their dialogues manually, they often forget to apply the rules from design guidelines when they are writing their dialogues. For example, even though there is a design guideline asking designers to avoid putting too much information in one line [93], designers often make mistakes:“The frequent mistake is people are giving a big amount of the texts and yeah just read it.” (P18) This makes designing a natural VUI challenging. However, P3, a profes- sional VUI designer, mentioned that this could be overcome as designers gain more experience:“At first, it is difficult, because, like I said, normally when you’re writing, you’re writing for someone to read it, not to speak it...[It] took something like maybe three months or so to really get used to this way of writing, the practice of writing dialogue.” (P3) Relation with the characteristics of naturalness: This difficulty

39 hinders designers from creating VUI dialogues that sound like spoken dia- logues, and it thus counters to the fundamental characteristic of a natural VUI, ‘Sound like a human is speaking’.

5. Reconciling Between “Social” and “Transactional” Is Hard

Our designers tried to embrace the characteristics of social conversations for providing more realistic and interactive conversation experiences to their users, however since most of them were designing for task-oriented applica- tions, they found the goals for task-oriented applications often conflict with the desire to embrace social conversation characteristics. Five designers (P5, P11, P15, P17, P20) mentioned their desire to add more components of social conversations to their VUI designs to make them more realistic. However, in an attempt to do so, they found that the dialogue gets longer and it conflicts with the task-oriented goal to complete the tasks efficiently:“So obviously I wanted to write the dialogues that felt [like a] human [is speaking and] didn’t feel robotic, but I soon realized that things are more complicated. The more you want to add personality to things, then the longer becomes your dialogue.” (P20) So, our designers felt that the two categories of the naturalness character- istics often don’t get along with one another:“...efficient, but it has to come up as like friendly [and] conversational.” (P5) “Challenges, I told you earlier, challenges are keeping it simple, yet interactive. It should sound familiar. It should sound friendly. It should not go out of the voice, so like that.” (P17) Relation with the characteristics of naturalness: This often con- fused our designers when they were writing the dialogues and often made them give up incorporating social characteristics. “But again we’re still thinking, we’re still in the process of, ‘Should we actually put in those lit- tle sentences [for having social interactions] or not?’” (P6) “I would prefer, right now, to focus more on helping people achieve their goals and move on with their lives more than a kind of having these artificial entities talking to

40 me in slang.” (P20)

6. Conveying Messages Clearly Is Difficult Due to the Limitation of Synthesized Voice

Our designers (P4, P8, P9, P12, P14) are facing a challenge in conveying precise meanings through synthesized voices. This is because synthesized voices often mispronounce certain types of words and sentences, and are not able to produce proper intonations and tones. Our designers elaborated on three specific problematic cases of mispro- nunciation. Synthesize voices often fail to: (1) put proper breaks when pro- nouncing long sentences:“...some words just kind of got mashed together, like into one whole long word, and it [the long setence] sounded weird.” (P4), (2) pronounce contractions clearly. “That’s what it’s supposed to say, ‘What’ll it be?’, and then when it [Amazon Alexa] actually says it, it will be like, ‘Whatill it be? ’” (P12), and (3) pronounce proper nouns such as names of cryptocurrency and people. “...she can’t pronounce ethereum [a cryptocur- rency] and mmm that’s a popular one.” (P14), “Names of people, Google says it differently.” (P8) Our participants also elaborated on two problematic cases inadequate tones and intonations: (1) sentences with the question mark, and (2) non- lexical words. P9 mentioned the difficulty of producing natural sounds for interrogative sentences ending with a question mark. “It’s so hard sometimes to get the text to speech to ask a question in a way that makes sound natu- ral...so I changed the question mark to a period, and then it said ‘Which one would you like?’, which is more how a person would actually say it, because a lot of times when we ask a question, we’re not doing a rising intonation.” P8 reported that synthesized voices do not produce proper tones for non-lexical words (i.e., words do not have a defined meaning) such as laughter. Her design intention was to make her robot laugh with a happy tone, but it had a sarcastic tone instead:“The robot can not laugh, because if the robot laughs,

41 and it just says, ‘Ha-ha-ha’, it sounds sarcastic.” (P8) Relation with the characteristics of naturalness: The limitations of speech synthesis systems (i.e., mispronouncing words and sentences and not being able to produce proper intonations) hinder designers not only from attaining the fundamental characteristics of a natural VUI (‘Sound like a human is speaking’ and ‘Use appropriate prosody and intonation’) but also from achieving a social characteristic of it (‘Express sympathy and empa- thy’). This is because non-lexical words are essential in human conversations for social interactions. People feel mutual understanding and compassion towards each other when they laugh together. Hence, by not being able to produce natural laughing sounds, it’s more challenging to provide harmo- nious interactions with users.

7. Handling Various Spoken Inputs From the Users Is Difficult

Four participants (P2, P6, P9, P11) mentioned that the current NLU engine is still not good enough to understand various expressions of our language, and this requires VUI designers to provide possible synonyms and expressions when they are writing dialogues for training the NLU engine:“Yeah, I still need to interview them [the VUI users]. I will, just because I don’t think the natural language engine is good enough to be able to figure out all the different ways you can ask for a bus.”(P2) However, they often found that the collected synonyms and expressions do not cover all the possible ones:“I don’t necessarily know what the users are going to say back. So I’ll make up something that you might say but that’s certainly not the same as a user who’s never used it [the VUI application] before.”(P9) Designers pointed out that understanding human vocabularies can be especially challenging due to its personal aspect. The same word can be used to express different meanings based on the conversational context (e.g., age and individual differences). Hence, designers mentioned that user utterances should be understood in the personalized context:“You need to personalize

42 the VUI to a user...You can not talk with Alexa if you‘re older because Alexa won’t understand you.” (P8) Relation with the characteristics of naturalness: Understanding the specialized vocabularies that elders are using seems to be critically needed. The lack of capacity to correctly understand various words from users risks additional interaction break-downs, and resulting in more designers’ efforts to collect additional user utterances. This prevents designers from achiev- ing the fundamental characteristic, ‘Understand and use variations in human language’. However, as mentioned in that section, when the target audience of the application is the population using special terminologies, this challenge may have less impact.

8. Impossible to Capture All the Possible Situations

Our participants reported that VUIs provide more flexibility in terms of user inputs than other user interfaces such as graphical user interfaces or text- based user interfaces:“...there’s so much more flexibility that you have in voice than with the searches [text-based search interfaces]” (P12) Our designers (P2, P8, P13, P16) reported that this great amount of flexibility makes it hard for VUI designers to expect and prepare for all possible scenarios that can occur during user interaction:“I also have to worry about how to handle the language, if they say something completely different from what I expected.” (P2) We found that our designers use two strategies to prevent the conversation breakdown from unexpected inputs. First, they strive to narrow down a set of available conversation pathways, in advance, when the task context is ‘predictable’(P6) and ‘typical’ (P3):“Normally, when you’re designing an interface like that, you have three typical errors that we have to prevent.”(P3) Second, they can preemptively guide the users to ask questions in the service domain only. “We can’t handle all those things. So you really need to know how to guide the conversation to get the person to know what they can say,

43 and help them say in a way that your technology can actually handle.” (P9) P8, P13 and P16 mentioned that this is especially challenging designing for children:“People can say anything. Children can say more than any- thing...they can go random into anything” (P16) “I think that’s very impor- tant and that’s most challenging for me to anticipate what each of kids will say at each of the points because they’re very unpredictable.” (P13) Even though most situations are predictable and can be handled with a common strategy, our designers still feel challenged in designing for children, because they do not follow the voice assistants’ guidance:“...they don‘t follow instructions, usually. So they can go randomly into anything.” (P16) Relation with the characteristics of naturalness: This challenge may hinder designers from attaining the fundamental characteristic of a nat- ural VUI, ‘Collaboratively repair conversation breakdowns’, especially when the target VUI users are children.

9. Difficult to Capture the Users’ Emotions

Our designers found it challenging to write VUI dialogues that are empathetic to VUI users’ emotions due to lack of a way to capture emotions of VUI users and incorporate the detected emotions into their VUI dialogue designs. Hence, our participants mentioned that in their practices, they tried to avoid in making their VUI assistants sound too happy for incase their users are experiencing negative feelings:“You have to control the tone of voice because you can’t sound very enthusiastic, things like that because you never know the situation of the person on the other side. You don’t know if the person is really emotionally ill or something more serious is happening at the time that the person is calling and interacting with the system.” (P3) Our participants wished for a VUI design tool where they can write VUI conversation flows depending on the detected emotions of VUI users:“I think it would be good to identify emotions...More useful [thing] would be to detect emotional content on an utterance [from a VUI user] to give you the context”

44 (P2) Relation with the characteristics of naturalness: For having har- monious social interactions with users, understanding users’ emotions, and providing truly empathetic messages are essential. Misunderstanding users’ emotions and saying something that breaks the emotional context may hin- der designers from achieving the social characteristic, ‘Express sympathy and empathy’.

10. Difficult to Understand Users’ Perceived Naturalness

Our participants (P6, P12) reported that they lack a practical way to find out how VUI users really perceive the naturalness of VUI applications:“It’s also hard to detect if people find that, at scale, if things are unnatural. Be- cause there’s, at least at this point, no real way of asking users, ‘Does this conversation seem natural?’ at the end of their experiences.” (P12) Our participants mentioned that even with the best effort for creating nat- ural VUIs (e.g., following guidelines for creating natural VUIs), VUI users’ real experiences with the applications could be different from designers’ ex- pectations:“I mean the designers and project strategists might have an idea of what the ideal conversation is, and like, yeah, everything makes sense. But when you actually deploy [the VUI applications], the users are going to come back to you and say, ‘No, no, no, no, we don’t care about any of this.’” (P12) There is built-in support from Google for letting VUI users rate VUI applications with a five-star rating system. However, our designers found the rating results are not comprehensive enough for understanding their users’ experiences:“Google handles that by giving you a review...Option to review the application. If you’ve interacted with the Google action [a VUI application developed with Google’s platform]. Even that isn’t comprehensive.” (P12) Instead of using the built-in systems from Google or Amazon, designing voice assistants to ask VUI users about their experiences was mentioned as an alternative solution for this problem by our participants:“I think down

45 the line we would actually collect the responses [from the users]...For our feedback, we’d say, ‘Could you rate it one to five?’, and then we’d say, ‘If you want, you can say a little more about it.’, and then we’d start recording it.” (P6) However, P12 doubted that many people would answer the feedback question:“Unless we explicitly asked them, like some QA [quality assurance] tools do for web applications. But, are people going to stick around to answer those questions? Maybe not.” (P12) Instead, our participants reported the lack of tool support for understanding the perceived naturalness by VUI users:“It’s a lot harder to, I guess, find out if the actual conversation is natural, per se, at scale, because there just aren’t a lot of tools to help with that.” (P12) Relation with the characteristics of naturalness: This challenge does not particularly hinder designers from attaining a specific characteristic of a natural VUI. However, this makes it difficult for designers to validate their design practices for creating a natural VUI and makes it harder to choose the design directions for enhancing the naturalness of their VUI ap- plications.

46 Chapter 5

Discussion and Implications for Design

5.1 Discussion

5.1.1 Capturing Naturalness in Context

Despite the lack of a consistent definition of naturalness, our study demon- strates that characterizing the term is a feasible inquiry and that structuring the characteristics with respect to context is important—in our case, the primary context refers to the role of the conversation, social (i.e., for hav- ing a harmonious social interaction) or transactional (i.e., for helping VUI users’ tasks), in which a VUI is situated. Beyond the role of conversation, additional contextual factors that shape naturalness include the target user; for example, user age, such as children and older adults, and occupation such as physician vs. general public. Using this range of factors as a lens to interpret the designers’ practices enabled us to characterize where differ- ent challenges in design practices stem from. Building on these findings, we ideate approaches to address the found challenges at the end of Discussion. These context-dependent characteristics of naturalness hinge primarily

47 on the role that a VUI is expected to play. For the present, we discovered the two roles, transactional and social, as the prevalent types that designers could identify and discuss. However, given the pervasiveness of computing to people’s everyday lives [94] and the near-ubiquity of VUI-enabled devices, it’s entirely possible that the role of VUIs can be extended and diversified beyond the two. Then, designers will need to adjust their conception of naturalness in VUI to the new roless accordingly. For instance, if a VUI in an interactive learning system serves the role of an instructor [95], and not that of a task-oriented assistant nor a social companion, We categorized the characteristics of naturalness according to the role of VUIs, drawing heavily on Clark et al.’s categories of conversation types [28]. Another researcher may leverage dimensions other than role of VUIs to categorize the characteristics. For instance, a different but complementary approach could involve labeling each found characteristic with different qual- ities of natural VUI experiences, such as anthropomorphism, transparency, efficiency, pleasantness.

5.1.2 Positioning Naturalness of VUI in the Literature

As noted above, roles were not the only determinant of naturalness for a characteristic. The target user demographic was an apparent, though some- what buried, factor to consider when designing for naturalness. For example, children tend to anthropomorphize VUIs [96] and the designers in our study expect their VUIs to offer children a more entertaining experience. Our de- signers focused on tailoring characteristics to the specific needs of the user. This is in alignment with the conception of naturalness from Wigdor et al. [21], where the term focuses on the user experience and not the device. When asked what is natural in VUIs, designers tend to think of natu- ralness as a quality of the system that can provide a positive experience in the given application context. This concept differs from the existing notion of naturalness that equates it to behavioural realism (being like a human)

48 or considers naturalness to be the Holy Grail of all VUI applications. For example, the VUI can be designed to produce unnatural sounds for the pur- pose of entertainment or to use technical jargon for users with specialized occupations (e.g., police and firefighters) as mentioned in P1’s example from section 4.2.1 and P2’s example from section 4.1.1. It is worth noting that our results provide a conceptual departure from the notion of naturalness as an imitation of the qualities in human-to-human interactions. In their seminal work on mediated communication [97], Hollan and Stornetta claim that creation of electronic media should go “beyond being there”, so that the new tools can offer beneficial interaction mechanisms that humans in the flesh cannot offer, rather than blindly mimicking the reality. Given that most of our findings on ‘beyond-human’ aspects were in transactional characteristics, it can be inspiring to contemplate what are the machine-specific characteristics that can enhance natural social encounters.

5.1.3 Contrasting Naturalness of Transactional vs. So- cial Agents

Characteristics of naturalness may not transfer across different roles that VUI plays. We identified conceptual groupings of naturalness characteristics bounded by types of conversation (Social or Transactional). For instance, an- swering a given question with high accuracy and speed may not be conducive to the naturalness of a social conversation, but may be highly desirable and feel perfectly natural in a transactional conversation. As such, designers should not simply follow base naturalness characteristics; they need to start with the role of the conversation and select the appropriate related character- istics, and sometimes even demote a certain naturalness characteristic when it mismatches the target application. Our finding that naturalness characteristics differ according to the role of a VUI application (e.g., social or transactional) uncovered a significant design challenge that VUI designers are facing: namely, how to reconcile the

49 dual role of helpful assistant and pleasant social interlocutor. Such agents offer a fundamentally transactional interaction that is structured in a conver- sational format only on the surface [19]. Most of the designers in our study were experienced primarily in designing for task-oriented VUIs. When they try to imbue their interfaces with social characteristics, such as empathetic greetings or small chit-chat, they face difficulty in finding an appropriate tradeoff between designing for an effective assistant vs. an affable compan- ion; P20 adds that “the more you want to add personality to things that the longer become your dialogue”.

5.1.4 Limitations

We tried to recruit designers who worked on various types of VUIs such as Interactive Voice Response (IVR) systems, humanoid robots and smart home speakers such as Google Home and Amazon Alexa. However, more than half of the recruited designers (N=14) had experienced only with smart home speakers. Hence, a future study should recruit similar numbers of designers for different types of VUIs. Our study also did not verify if designers’ characterizations of a natural VUI are matched with the VUI users’ perceptions of a natural VUI. Given the scope of this study, the impact of pursuing the characteristics of a nat- ural VUI defined by our designers on user experience is unexplored as well. Finally, even though we recruited both professional and amateur VUI design- ers, we did not explore the patterns specific to each participant group due to the time constraint, which could have enriched the findings of our study.

5.1.5 Implications for VUI Design Tools

According to our findings, all of the top 10 primary challenges were related to the high-level design phase where VUI designers write VUI dialogues and create flowcharts for the dialogues. Delving into the designers’ practices for

50 naturalness in the high-level design phase revealed two significant critical gaps in the way existing VUI related technologies facilitate their jobs.

Towards More Natural Sounding Narration

Although fine-tuning the way VUIs sound was given great importance in pro- viding natural user experience with VUI, the VUI scene is lacking in design tools for producing narrations that sound rich and nuanced. Our designers regard SSML and voice talents to be the only possible two approaches avail- able to them, but the pros and cons of the two were complementary: working with voice talents gives designers a great deal of control over para-language (i.e. non-lexical attributes of speech, such as intonation, timing, pitch, etc.), but hiring them is expensive and recorded audio clips are not scalable. SSML lacks sufficient control but is scalable. VUI designers need a solution that is both scalable and offers a lot of control. The existing interfaces for SSML editing have deep ‘gulfs of execution and evaluation’ introduced by Norman and Draper [98]. In other words, cur- rent SSML editing interfaces lack the convenient ways for producing natural sounds (‘a gulf of execution’) and for evaluating the naturalness of the pro- duced sounds (‘a gulf of evaluation’). The designers use a simple text editor to implement SSML tags, which is laborious and error-prone. The gulf of evaluation is also very deep because the existing tools do not support spot listening of the edited part of a script. Hence, repetitively playing through the entire sentence is tedious when checking if a minor change of tags yields a desirable output. There’s a lot to learn from the way VUI designers guide voice talents to narrate a given script in intended prosody and timing. Their direc- tions are primarily demonstration-based, such as ghost narration. Hence, a demonstration-based prosody editing, leveraging our own voices [99], is a promising approach worth investigating. Also, multi-modal cues, such as hand gestures, are effective ways to convey and highlight intended changes

51 to the voice talents. Similarly, graphical or direct manipulation of pitch, accent, and timing will enable faster and easier creation of natural-sounding narration.

Writing for More Natural Spoken Dialogues

One of the primary challenges for VUI designers is to write dialogues with spoken language characteristics, such as frequent use of filler words and fewer big words. In the field, there exist several dialogue design tools that of- fer many beneficial features like dialogue mapping and instant speech-based testing, but they lack in evaluation, and they do not recommend linguistic properties for a given script. Given that designers tend to dismiss this kind of naturalness characteristic easily, a proper scripting interface, similar to proof-reading tools, should warn the user when the dialogue has too many attributes of written language or even suggest alternative phrasing in spoken language. Also, such tools can offer editing suggestions that are tailored to the target users or the purpose of the application as the required naturalness varies by such design contexts. Implementing such tools will require linguistic modelling of spoken vs. written language, computational prediction for evaluating the given script, and alternative searching for recommending different expressions. Natural language processing techniques are becoming increasingly sophisticated. For example, F-Score [100] is a linguistic measure of how formal a given text is. The Stanford Politeness API can identify linguistic markers that are indicative of politeness.

52 Chapter 6

Conclusion and Future Work

Naturalness is a construct that has been loosely defined in HCI. In this study, we articulated its meaning from the designers’ perspective. To this end, we conducted 20 interviews with VUI designers to see how they characterize the term, how they incorporate it in their design practice and the challenges they face in doing so. Through an inductive and deductive thematic analysis, we uncovered 12 characteristics of a natural VUI and introduced 3 categories for subdivid- ing these characteristics: ‘Fundamental’, ‘Social’ and ‘Transactional’. While many of the traits we found were human-like in essence, designers mentioned that they saw naturalness in VUIs as a quality that is beyond-human - in their conception, exhibiting machine-like speed, accuracy, and memory were natural. In addition to these findings, we also elicited 10 challenges that VUI de- signers are facing in creating a natural VUI. Then, we used the 3 categories (‘Fundamental’, ‘Social’ and ‘Transactional’) as a lens to obtain a deeper understanding of these challenges. As a result, we found that most of these challenges hinder VUI designers from attaining the fundamental characteris- tic (‘Sound like a human is speaking’) and the social characteristic (‘Express sympathy and empathy’) of a natural VUI. We also uncovered an interest-

53 ing conflict between attaining social characteristics versus achieving trans- actional characteristics, which are expected to become more problematic as the role of VUI is expanding. To resolve the primary challenges mentioned above, we developed design implications for future tool support for producing natural synthesized voices and for writing VUI dialogues that incorporate the characteristics of spoken language.

54 Bibliography

[1] G. Skantze, Error handling in spoken dialogue systems-managing un- certainty, grounding and miscommunication. Gabriel Skantze, 2007.

[2] T. K. Harris and R. Rosenfeld, “A universal speech interface for appli- ances,” in Eighth International Conference on Spoken Language Pro- cessing, 2004.

[3] D. Griol, J. Carb´o,and J. M. Molina, “An automatic dialog simulation technique to develop and evaluate interactive conversational agents,” Applied Artificial Intelligence, vol. 27, no. 9, pp. 759–780, 2013.

[4] H. Hofmann, U. Ehrlich, S. Reichel, and A. Berton, “Development of a conversational speech interface using linguistic grammars,” in Adjunct Proceedings of the 5th International Conference on Automo- tive User Interfaces and Interactive Vehicular Applications, Eindhoven, The Netherlands, 2013.

[5] Amazon. (2020) Conversational ai. [Online]. Available: https:// developer.amazon.com/en-US/alexa/alexa-skills-kit/conversational-ai

[6] Shopify. (2020) Ui of the future: Conversational inter- faces. [Online]. Available: https://www.shopify.ca/partners/blog/ conversational-interfaces

[7] Google. (2020) Conversation design. [Online]. Available: https: //developers.google.com/assistant/actions/design

55 [8] R. J. Moore and R. Arar, Conversational UX Design: A Practitioner’s Guide to the Natural Conversation Framework. Morgan & Claypool, 2019.

[9] M. H. Cohen, M. H. Cohen, J. P. Giangola, and J. Balogh, Voice user interface design. Addison-Wesley Professional, 2004.

[10] Amazon. (2018) Things every alexa skill should do: Pass the one- breath test. [Online]. Available: https://developer.amazon. com/blogs/alexa/post/531ffdd7-acf3-43ca-9831-9c375b08afe0/ things-every-alexa-skill-should-do-pass-the-one-breath-test

[11] INTUITYTM CONVERSANT R System Version 6.0. Lucent Tech- nologies, Bell Labs Innovations Bell Labs Innovations.

[12] Google. (2020) Conversation design guideline. [Online]. Available: https://designguidelines.withgoogle.com/conversation/ conversation-design/welcome.html

[13] Amazon. (2020) Write out a script with conversational turns. [On- line]. Available: https://developer.amazon.com/en-US/docs/alexa/ alexa-design/script.html

[14] S. Inc. (2020) Natural dialogue and conversation with voice ui. [Online]. Available: https://www.soundhound.com/vui-guide/ enabling-natural-conversations-in-vui

[15] Amazon. (2020) Voice design for alexa experiences: Be adapt- able. [Online]. Available: https://developer.amazon.com/en-US/docs/ alexa/alexa-design/adaptable.html

[16] L. K. Hansen and P. Dalsgaard, “Note to self: Stop calling interfaces “natural”,” in Proceedings of The Fifth Decennial Aarhus Conference on Critical Alternatives, ser. CA ’15. Aarhus N:

56 Aarhus University Press, 2015, p. 65–68. [Online]. Available: https://doi.org/10.7146/aahcc.v1i1.21316

[17] J. K. Burgoon, “Interpersonal expectations, expectancy violations, and emotional communication,” Journal of Language and Social Psychol- ogy, vol. 12, no. 1-2, pp. 30–48, 1993.

[18] E. Goffman et al., The presentation of self in everyday life. Har- mondsworth London, 1978.

[19] M. Porcheron, J. E. Fischer, S. Reeves, and S. Sharples, “Voice interfaces in everyday life,” in Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, ser. CHI ’18. New York, NY, USA: Association for Computing Machinery, 2018. [Online]. Available: https://doi.org/10.1145/3173574.3174214

[20] V. M. LLC. (2020) Google duplex puts ai into a social uncanny valley. [Online]. Available: https://www.vice.com/en us/article/ d3kgkk/google-duplex-assistant-voice-call-dystopia

[21] D. Wigdor and D. Wixon, “Chapter 2 - the natural user interface,” in Brave NUI World, D. Wigdor and D. Wixon, Eds. Boston: Morgan Kaufmann, 2011, pp. 9 – 14. [Online]. Available: http: //www.sciencedirect.com/science/article/pii/B9780123822314000022

[22] B. Gates, “The power of the natural user interface,” Oct 2011. [Online]. Available: https://www.gatesnotes.com/About-Bill-Gates/ The-Power-of-the-Natural-User-Interface

[23] M. F. McTear, “The rise of the conversational interface: A new kid on the block?” in Future and Emerging Trends in Language Technology. Machine Learning and Big Data, J. F. Quesada, F.-J. Mart´ınMateos, and T. L´opez Soto, Eds. Cham: Springer International Publishing, 2017, pp. 38–49.

57 [24] V. Braun and V. Clarke, “Using thematic analysis in psychology,” Qualitative Research in Psychology, vol. 3, no. 2, pp. 77–101, 2006. [Online]. Available: https://www.tandfonline.com/doi/abs/10.1191/ 1478088706qp063oa

[25] K. H. Davis, R. Biddulph, and S. Balashek, “Automatic recognition of spoken digits,” The Journal of the Acoustical Society of America, vol. 24, no. 6, pp. 637–642, 1952.

[26] B. T. Lowerre, “The harpy speech recognition system,” CARNEGIE- MELLON UNIV PITTSBURGH PA DEPT OF COMPUTER SCI- ENCE, Tech. Rep., 1976.

[27] N. Yankelovich, G.-A. Levow, and M. Marx, “Designing speechacts: Issues in speech user interfaces,” in Proceedings of the SIGCHI confer- ence on Human factors in computing systems, 1995, pp. 369–376.

[28] L. Clark, N. Pantidi, O. Cooney, P. Doyle, D. Garaialde, J. Edwards, B. Spillane, E. Gilmartin, C. Murad, C. Munteanu et al., “What makes a good conversation? challenges in designing truly conversa- tional agents,” in Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, 2019, pp. 1–12.

[29] T. Ammari, J. Kaye, J. Y. Tsai, and F. Bentley, “Music, search, and iot: How people (really) use voice assistants,” ACM Trans. Comput.-Hum. Interact., vol. 26, no. 3, Apr. 2019. [Online]. Available: https://doi.org/10.1145/3311956

[30] A. Pyae and T. N. Joelsson, “Investigating the usability and user experiences of voice user interface: A case of google home smart speaker,” in Proceedings of the 20th International Conference on Human-Computer Interaction with Mobile Devices and Services Adjunct, ser. MobileHCI ’18. New York, NY, USA: Association

58 for Computing Machinery, 2018, p. 127–131. [Online]. Available: https://doi.org/10.1145/3236112.3236130

[31] E. Luger and A. Sellen, ““like having a really bad pa”: The gulf between user expectation and experience of conversational agents,” in Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, ser. CHI ’16. New York, NY, USA: Association for Computing Machinery, 2016, p. 5286–5297. [Online]. Available: https://doi.org/10.1145/2858036.2858288

[32] A. M. Turing, “Computing machinery and intelligence,” in Parsing the Turing Test. Springer, 2009, pp. 23–65.

[33] P. Garrido, F. Martinez, and C. Guetl, “Adding semantic web knowl- edge to intelligent personal assistant agents,” in Proceedings of the ISWC 2010 Workshops. Philippe Cudre-Mauroux, 2010, pp. 1–12.

[34] P. Braunger and W. Maier, “Natural language input for in-car spoken dialog systems: How natural is natural?” in Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, 2017, pp. 137–146.

[35] C. Nass, Y. Moon, B. J. Fogg, B. Reeves, and C. Dryer, “Can computer personalities be human personalities?” in Conference Companion on Human Factors in Computing Systems, ser. CHI ’95. New York, NY, USA: Association for Computing Machinery, 1995, p. 228–229. [Online]. Available: https://doi.org/10.1145/223355.223538

[36] A. Malizia and A. Bellucci, “The artificiality of natural user interfaces,” Commun. ACM, vol. 55, no. 3, p. 36–38, Mar. 2012. [Online]. Available: https://doi.org/10.1145/2093548.2093563

[37] A. Soro, S. A. Iacolina, R. Scateni, and S. Uras, “Evaluation of user gestures in multi-touch interaction: A case study in pair- programming,” in Proceedings of the 13th International Conference

59 on Multimodal Interfaces, ser. ICMI ’11. New York, NY, USA: Association for Computing Machinery, 2011, p. 161–168. [Online]. Available: https://doi.org/10.1145/2070481.2070508

[38] B. Loureiro and R. Rodrigues, “Multi-touch as a natural user interface for elders: A survey,” in 6th Iberian Conference on Information Systems and Technologies (CISTI 2011), June 2011, pp. 1–6.

[39] U. Lee and J. Tanaka, “Finger identification and hand gesture recognition techniques for natural user interface,” in Proceedings of the 11th Asia Pacific Conference on Computer Human Interaction, ser. APCHI ’13. New York, NY, USA: Association for Computing Machinery, 2013, p. 274–279. [Online]. Available: https://doi.org/10. 1145/2525194.2525296

[40] P. G lomb, M. Romaszewski, S. Opozda, and A. Sochan, “Choosing and modeling the hand gesture database for a natural user interface,” in Gesture and Sign Language in Human-Computer Interaction and Embodied Communication, E. Efthimiou, G. Kouroupetroglou, and S.- E. Fotinea, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2012, pp. 24–35.

[41] L. Bian and H. Holtzman, “Qooqle: Search with speech, gesture, and social media,” in Proceedings of the 13th International Conference on Ubiquitous Computing, ser. UbiComp ’11. New York, NY, USA: Association for Computing Machinery, 2011, p. 541–542. [Online]. Available: https://doi.org/10.1145/2030112.2030203

[42] A. Adler and R. Davis, “Speech and sketching for multimodal design,” in ACM SIGGRAPH 2007 Courses, ser. SIGGRAPH ’07. New York, NY, USA: Association for Computing Machinery, 2007, p. 14–es. [Online]. Available: https://doi.org/10.1145/1281500.1281525

60 [43] J. K. Zao, T. T. Gan, C. K. You, S. J. R. M´endez,C. E. Chung, Y. T. Wang, T. Mullen, and T. P. Jung, “Augmented brain computer inter- action based on fog computing and linked data,” in 2014 International Conference on Intelligent Environments, June 2014, pp. 374–377.

[44] S. Siltanen and J. Hyv¨akk¨a,“Implementing a natural user interface for camera phones using visual tags,” in Proceedings of the 7th Aus- tralasian User Interface Conference - Volume 50, ser. AUIC ’06. AUS: Australian Computer Society, Inc., 2006, p. 113–116.

[45] M. Weiser, “The computer for the 21st century,” SIGMOBILE Mob. Comput. Commun. Rev., vol. 3, no. 3, p. 3–11, Jul. 1999. [Online]. Available: https://doi.org/10.1145/329124.329126

[46] T. K. Hui and R. S. Sherratt, “Towards disappearing user interfaces for ubiquitous computing: human enhancement from sixth sense to super senses,” Journal of Ambient Intelligence and Humanized Computing, vol. 8, no. 3, pp. 449–465, 2017.

[47] A. Van Dam, “User interfaces: disappearing, dissolving, and evolving,” Communications of the ACM, vol. 44, no. 3, pp. 50–52, 2001.

[48] D. Wigdor and D. Wixon, Brave NUI World: Designing Natural User Interfaces for Touch and Gesture, 1st ed. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2011.

[49] D. A. Norman, “Natural user interfaces are not natural,” Interactions, vol. 17, no. 3, p. 6–10, May 2010. [Online]. Available: https: //doi.org/10.1145/1744161.1744163

[50] S. Eggins and D. Slade, Analysing casual conversation. Equinox Pub- lishing Ltd., 2005.

[51] K. P. Schneider, Small talk: Analyzing phatic discourse. Hitzeroth, 1988, vol. 1.

61 [52] G. Brown, G. D. Brown, G. R. Brown, B. Gillian, and G. Yule, Dis- course analysis. Cambridge university press, 1983.

[53] C. H. Woolbert, “Speaking and writing—a study of differences,” Quar- terly Journal of Speech, vol. 8, no. 3, pp. 271–285, 1922.

[54] G. Borchers, “An approach to the problem of oral style,” Quarterly Journal of Speech, vol. 22, no. 1, pp. 114–117, 1936.

[55] G. Redeker, “On differences between spoken and written language,” Discourse processes, vol. 7, no. 1, pp. 43–55, 1984.

[56] G. H. Drieman, “Differences between written and spoken language: An exploratory study,” Acta Psychologica, vol. 20, pp. 78–100, 1962.

[57] R. C. O’Donnell, “Syntactic differences between speech and writing,” American Speech, vol. 49, no. 1/2, pp. 102–110, 1974.

[58] T. Bennett, “Verb voice in unplanned and planned narratives,” Keenan, EO yT. Bennett (ed.): Discourse across time and space. SCOPIL, vol. 5, 1977.

[59] J. Cassell, T. Bickmore, M. Billinghurst, L. Campbell, K. Chang, H. Vilhj´almsson, and H. Yan, “Embodiment in conversational interfaces: Rea,” in Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, ser. CHI ’99. New York, NY, USA: Association for Computing Machinery, 1999, p. 520–527. [Online]. Available: https://doi.org/10.1145/302979.303150

[60] R. S. Nickerson, “On conversational interaction with computers,” in Proceedings of the ACM/SIGGRAPH Workshop on User-Oriented Design of Interactive Graphics Systems, ser. UODIGS ’76. New York, NY, USA: Association for Computing Machinery, 1976, p. 101–113. [Online]. Available: https://doi.org/10.1145/1024273.1024286

62 [61] T. Bickmore and J. Cassell, “Relational agents: A model and implementation of building user trust,” in Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, ser. CHI ’01. New York, NY, USA: Association for Computing Machinery, 2001, p. 396–403. [Online]. Available: https://doi.org/10.1145/365024.365304

[62] N. Wang, D. V. Pynadath, S. G. Hill, and A. P. Ground, “Building trust in a human-robot team with automatically generated explanations,” in Proceedings of the Interservice/Industry Training, Simulation and Education Conference (I/ITSEC), vol. 15315, 2015, pp. 1–12.

[63] T. Sanders, K. E. Oleson, D. R. Billings, J. Y. Chen, and P. A. Hancock, “A model of human-robot trust: Theoretical model development,” in Proceedings of the human factors and ergonomics society annual meet- ing, vol. 55, no. 1. SAGE Publications Sage CA: Los Angeles, CA, 2011, pp. 1432–1436.

[64] T. W. Bickmore and R. W. Picard, “Establishing and maintaining long-term human-computer relationships,” ACM Trans. Comput.- Hum. Interact., vol. 12, no. 2, p. 293–327, Jun. 2005. [Online]. Available: https://doi.org/10.1145/1067860.1067867

[65] M. D. Abrams, G. E. Lindamood, and T. N. Pyke, “Measuring and modelling man-computer interaction,” in Proceedings of the 1973 ACM SIGME Symposium, ser. SIGME ’73. New York, NY, USA: Association for Computing Machinery, 1973, p. 136–142. [Online]. Available: https://doi.org/10.1145/800268.809345

[66] L. P. Vardoulakis, L. Ring, B. Barry, C. L. Sidner, and T. Bickmore, “Designing relational agents as long term social companions for older adults,” in Intelligent Virtual Agents, Y. Nakano, M. Neff, A. Paiva, and M. Walker, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2012, pp. 289–302.

63 [67] S. Sayago, B. B. Neves, and B. R. Cowan, “Voice assistants and older people: Some open issues,” in Proceedings of the 1st International Conference on Conversational User Interfaces, ser. CUI ’19. New York, NY, USA: Association for Computing Machinery, 2019. [Online]. Available: https://doi.org/10.1145/3342775.3342803

[68] R. Cole, D. W. Massaro, J. d. Villiers, B. Rundle, K. Shobaki, J. Wouters, M. Cohen, J. Baskow, P. Stone, P. Connors et al., “New tools for interactive speech and language training: using animated con- versational agents in the classroom of profoundly deaf children,” in MATISSE-ESCA/SOCRATES Workshop on Method and Tool Innova- tions for Speech Science Education, 1999.

[69] D. P´erez-Mar´ınand I. Pascual-Nieto, “An exploratory study on how children interact with pedagogic conversational agents,” Behaviour & Information Technology, vol. 32, no. 9, pp. 955–964, 2013.

[70] C. Nass, Y. Moon, and N. Green, “Are machines gender neutral? gender-stereotypic responses to computers with voices,” Journal of Applied Social Psychology, vol. 27, no. 10, pp. 864–876, 1997. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1111/ j.1559-1816.1997.tb00275.x

[71] R. C.-S. Chang, H.-P. Lu, and P. Yang, “Stereotypes or golden rules? exploring likable voice traits of social robots as active aging companions for tech-savvy baby boomers in taiwan,” Computers in Human Behavior, vol. 84, pp. 194 – 210, 2018. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0747563218300839

[72] C. I. Nass and S. Brave, Wired for speech: How voice activates and advances the human-computer relationship. MIT press Cambridge, MA, 2005.

64 [73] C. Nass, Y. Moon, B. J. Fogg, B. Reeves, and C. Dryer, “Can computer personalities be human personalities?” in Conference Companion on Human Factors in Computing Systems, ser. CHI ’95. New York, NY, USA: Association for Computing Machinery, 1995, p. 228–229. [Online]. Available: https://doi.org/10.1145/223355.223538

[74] A. Inc, “Siri style guide - sirikit.” [Online]. Available: https: //developer.apple.com/siri/style-guide/

[75] “Design process: The process of thinking through the design of a voice experience,” 2020, accessed: 2020-01-31. [Online]. Available: https://developer.amazon.com/fr/designing-for-voice/design-process/

[76] S. Ross, E. Brownholtz, and R. Armes, “Voice user interface principles for a conversational agent,” in Proceedings of the 9th International Conference on Intelligent User Interfaces, ser. IUI ’04. New York, NY, USA: Association for Computing Machinery, 2004, p. 364–365. [Online]. Available: https://doi.org/10.1145/964442.964536

[77] C. M. Myers, D. Grethlein, A. Furqan, S. Onta˜n´on, and J. Zhu, “Modeling behavior patterns with an unfamiliar voice user interface,” in Proceedings of the 27th ACM Conference on User Modeling, Adaptation and Personalization, ser. UMAP ’19. New York, NY, USA: Association for Computing Machinery, 2019, p. 196–200. [Online]. Available: https://doi.org/10.1145/3320435.3320475

[78] S. R. Klemmer, A. K. Sinha, J. Chen, J. A. Landay, N. Aboobaker, and A. Wang, “Suede: A wizard of oz prototyping tool for speech user interfaces,” in Proceedings of the 13th Annual ACM Symposium on User Interface Software and Technology, ser. UIST ’00. New York, NY, USA: Association for Computing Machinery, 2000, p. 1–10. [Online]. Available: https://doi.org/10.1145/354401.354406

65 [79] H. H. HILLER and L. DILUZIO, “The interviewee and the research interview: Analysing a neglected dimension in research*,” Canadian Review of Sociology/Revue canadienne de sociologie, vol. 41, no. 1, pp. 1–26, 2009. [Online]. Available: https://onlinelibrary.wiley.com/doi/ abs/10.1111/j.1755-618X.2004.tb02167.x

[80] L. Richardson and E. St Pierre, “A method of inquiry,” Collecting and interpreting qualitative materials, vol. 3, no. 4, p. 473, 2008.

[81] A. Strauss and J. Corbin, Basics of qualitative research. Sage publi- cations, 1990.

[82] L. Zhou, J. Gao, D. Li, and H.-Y. Shum, “The design and imple- mentation of xiaoice, an empathetic social chatbot,” Computational Linguistics, no. Just Accepted, pp. 1–62, 2018.

[83] R. E. Banchs and H. Li, “Iris: a chat-oriented dialogue system based on the vector space model,” in Proceedings of the ACL 2012 System Demonstrations. Association for Computational Linguistics, 2012, pp. 37–42.

[84] Z. Yu, L. Nicolich-Henkin, A. W. Black, and A. Rudnicky, “A wizard- of-oz study on a non-task-oriented dialog systems that reacts to user engagement,” in Proceedings of the 17th annual meeting of the Special Interest Group on Discourse and Dialogue, 2016, pp. 55–63.

[85] F. N. Akinnaso, “On the differences between spoken and written lan- guage,” Language and speech, vol. 25, no. 2, pp. 97–125, 1982.

[86] D. Tannen, “Oral and literate strategies in spoken and written narra- tives,” Language, pp. 1–21, 1982.

[87] C. M. Laserna, Y.-T. Seih, and J. W. Pennebaker, “Um... who like says you know: Filler word use as a function of age, gender, and personality,”

66 Journal of Language and Social Psychology, vol. 33, no. 3, pp. 328–338, 2014.

[88] H. N. Nagel, L. P. Shapiro, and R. Nawy, “Prosody and the processing of filler-gap sentences,” Journal of psycholinguistic research, vol. 23, no. 6, pp. 473–485, 1994.

[89] J. C. Richards, J. C. Richards et al., The language teaching matrix. Cambridge University Press, 1990.

[90] S. Oishi, S. Kesebir, C. Eggleston, and F. F. Miao, “A hedonic story has a transmission advantage over a eudaimonic story.” Journal of Ex- perimental Psychology: General, vol. 143, no. 6, p. 2153, 2014.

[91] G. S. Berns, “Something funny happened to reward,” Trends in cogni- tive sciences, vol. 8, no. 5, pp. 193–194, 2004.

[92] Google. (2020) Google conversation design process. [Online]. Available: https://designguidelines.withgoogle.com/conversation/ conversation-design-process/how-do-i-get-started.html

[93] ——. (2020) Conversational components–informational statements. [Online]. Available: https://designguidelines.withgoogle.com/ conversation/conversational-components/informational-statements. html

[94] S. Bødker, “When second wave hci meets third wave challenges,” in Proceedings of the 4th Nordic conference on Human-computer interac- tion: changing roles, 2006, pp. 1–8.

[95] H. Jung, H. J. Kim, S. So, J. Kim, and C. Oh, “Turtletalk: an edu- cational programming game for children with voice user interface,” in Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems, 2019, pp. 1–6.

67 [96] C. Nass and Y. Moon, “Machines and mindlessness: Social responses to computers,” Journal of Social Issues, vol. 56, no. 1, pp. 81–103, 2000. [Online]. Available: https://spssi.onlinelibrary.wiley.com/doi/ abs/10.1111/0022-4537.00153

[97] J. Hollan and S. Stornetta, “Beyond being there,” in Proceedings of the SIGCHI conference on Human factors in computing systems, 1992, pp. 119–125.

[98] D. A. Norman and S. W. Draper, User centered system design: New perspectives on human-computer interaction. CRC Press, 1986.

[99] Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio et al., “Tacotron: Towards end- to-end speech synthesis,” arXiv preprint arXiv:1703.10135, 2017.

[100] F. Heylighen and J.-M. Dewaele, “Variation in the contextuality of language: An empirical measure,” Foundations of science, vol. 7, no. 3, pp. 293–340, 2002.

[101] Google. (2020) Conversation design process - write sample di- alogs. [Online]. Available: https://designguidelines.withgoogle.com/ conversation/conversation-design-process/write-sample-dialogs.htm

[102] M. F. Scheier and C. S. Carver, “The self-consciousness scale: A revised version for use with general populations 1,” Journal of Applied Social Psychology, vol. 15, no. 8, pp. 687–699, 1985.

68 Appendix

This appendix includes every resource mentioned in Chapter 3 and 4 for conducting this study.

A.1 The Recruitment Poster

69 THE UNIVERSITY OF BRITISH COLUMBIA

Recruitment Form for Department of Computer ‘Voice User Interface Prototyping Study’ Science 2366 Main Mall Vancouver, B.C., V6T 1Z4

Interested in using Voice User Interfaces? Come help us! NO technical background is required.

Who we are: We are researchers from the University of British Columbia: Dr. Dongwook Yoon, Dr. Joanna McGrenere, and Yelim Kim. We are conducting a study to create design / technology solutions to create natural VUI dialogues.

We are looking for participants who are: ● 19 years of age or older ● Fluent english speakers

Receive a $23 honorarium for this 1.5 hours study ($15/hour)

Interested in participating? If you are interested in participating or would like more information, please contact Yelim Kim at [email] or [phone number] ​ ​ ​ ​ ​

Call for participation (Version 1.2 / May 21, 2019) A.2 The Recruitment Message Posted Through SNS

71 Subject: Call for Study Participants - $15/hour for Voice User Interface Prototyping Study

Voice User Interface (VUI) is currently gaining high industrial and academic interest!

Are you interested in using the latest voice technologies such as Google home or Amazon Alexa, or even making an application for it? If so, you can help us! We are conducting a study to create design / technology solutions to create natural dialogues for VUI applications.

Who are we? We are researchers from the University of British Columbia: Dr. Dongwook Yoon, Dr. Joanna McGrenere, and Yelim Kim.

Whom are we looking for? ● Native level English speaker ● 19 years of age or older

You do not need to have any previous programming experience nor previous experience with ​ ​ ​ ​ any voice user interface (e.g., Google home)

How long will it take? You will be asked to complete 1 session that takes at most 1 to 1 and a half hours.

What will you get? We will pay $15 for 1 hour and $23 for 1 and a half hour in cash.

What will you do? ● Two short online surveys (Pre-task and Post-task) ● Creating or interacting with Voice User Interface ● One post-task interview

The audio recording may be obtained under your agreement which will be transcribed by Yelim Kim for the study.

For further information, please contact Yelim Kim (E-mail: [email], Phone: [phone number]). ​ ​ ​ ​ Yelim Kim will provide all the details of the study, and provide a consent form for the study. You can opt-out of further contact by noticing Yelim Kim at any point of the time.

Yelim Kim MSc Student, UBC Computer Science

Version 1.1 / July 29,2018 1 A.3 The Consent Form

73

THE UNIVERSITY OF BRITISH COLUMBIA

Consent Form for Department of Computer ‘Voice User Interface Prototyping Study’ Science 2366 Main Mall Vancouver, B.C., V6T 1Z4

Principal Investigator: Dongwook Yoon, Computer Science Department, University of British Columbia. Ph: [phone number]; Email: [email]. ​ ​ ​ ​ Co-Investigator(s): Joanna McGrenere, Computer Science Department, University of British Columbia. Ph: [phone number]; Email: [email]. ​ ​ ​ ​ Yelim Kim, Computer Science Department, University of British Columbia. Ph: [phone number]; Email: [email]. ​ ​ ​ ​

Sponsor: This project is funded by the Natural Sciences and Engineering Research Council (NSERC).

Purpose: Voice User Interface (VUI) has been gaining high interest from the human-computer interaction research field in recent years. When designing VUIs, one of the essential things to do is to write natural dialogues between the voice assistant and users. The dialogue is a written script of an expected conversation between the voice assistant and users, which also includes the prosodic information such as tones and pitches that the voice assistant should use when it speaks. Since we have natural conversations with other people on a daily basis without knowing exactly how we do, it is often challenging for VUI designers when they are purposely trying to make their dialogue more natural. Therefore, in this study, we will aim to find the methods to help designers to create more natural VUI dialogues.

Study Procedure: After you have read this document, if you have any question or concern, please don’t hesitate to contact Yelim Kim through email. Yelim Kim will respond to you with detailed within one or two business days. Once you have signed this consent form, you will be asked to:

Version 1.1 / February 25, 2019 Page 1

THE UNIVERSITY OF BRITISH COLUMBIA

- Fill up the online survey asking the basic demographic data (i.e., gender, age, occupation, academic background) and your previous experience of designing VUIs - Answer interview questions about your VUI design experiences, practices and your notion of natural VUI dialogues - Complete a user task of creating VUI dialogue scripts with the keyboard typing and the voice typing - Answer interview questions asking about your experience of creating dialogues with keyboard typing and the voice typing

The online survey should take approximately 5~15 minutes to fill out, and the interview should take approximately one hour to one and a half hours and be completed in 1 session.

During the interview, the audio recording is required, and we may ask your permission to take pictures of the tools that you are using to create VUI dialogues. For the user task, you will have a choice to use your own computer or use the computer provided by the researcher. During the user task,to understand your experience of creating VUI dialogues better, we may ask your permission to take a screen recording to observe the process of creating VUI dialogues. We will share the pictures and the screen recordings we captured during the interview with you to make sure there is no element you do not want to share with us. Any identifiable element in the images, videos, or audios will be blurred or obscured to prevent identification.

Check off the box if you agree with our condition: [ ] For the interview, I agree that an audio tape recording can be made of what I have to say. [ ] For the interview, I agree that the research can take pictures of the tools I am using to create VUI [ ] For the user task, I agree that the research can take screen recording to observe the process of creating VUI dialogues

Participants are free to withdraw without giving a reason. If you withdraw from the study, all of the data obtained from you will be permanently discarded on the date of your withdrawal and will not be used for this study. All of the electronic files will be permanently deleted and all of the physical copy of documents (e.g., handwritten notes, paper transcriptions and the consent form) obtained from you will be shredded. Your name will be also removed from code assignment file.

Version 1.1 / February 25, 2019 Page 2

THE UNIVERSITY OF BRITISH COLUMBIA

Project Outcomes: Although the project outcomes will be determined by the research findings, possible research products will include: journal articles, a report, and plain language summaries.

Potential Benefits: There are no explicit benefits to you by taking part in this study. However, you will be asked to describe your current practices of creating VUI dialogues which may help you reflect on your current practices, and think about new approaches to improve them.

Potential Risks: Since you are asked either to write the script with the voice typing or to interact with VUI using your voice, you might feel thirsty during the interaction. We can provide the water or find the water source for you. This study may take one and a half hours, and you may feel tired during the session. If you need any break at any point of the time, you can always ask the researcher to have a 5-15 minutes of break. You can also withdraw your participation in the project at any time.

Confidentiality: The survey will be conducted through a UBC survey tool provided by Qualtrics. All the hard copies of documents will be identified only by code numbers that Yelim Kim will assign to each participant. Any of electronic file names will not contain any identifiable data such as participant name. You will not be identified by name in either the screen recording, the VUI scripts you wrote, survey data or interview transcript. The link between the code associated and the actual names will be stored in the “code assignment file”. The only documents containing your real name will be the “code assignment file” and this consent form.

This file will be stored on an encrypted hard drive of a password protected laptop owned by Yelim Kim, and there will not be any other copy stored than the one in the Yelim’s hard drive. The only documents containing your real name will be the “code assignment file” and this consent form. Audio recordings will be transcribed by Yelim Kim and the any identifiable data such as participant’s name will be removed from the transcription. As mentioned above, any identifiable element in the images, videos, or audios will be blurred or obscured to prevent identification.

During the study any handwritten notes and paper transcripts will be kept in a locked cabinet in the researchers' laboratory (Room X508 at ICICS) with controlled access at UBC, and all electronic files including audio files and “code assignment file” will be

Version 1.1 / February 25, 2019 Page 3

THE UNIVERSITY OF BRITISH COLUMBIA stored on an encrypted hard drive of a password protected laptop owned by Yelim Kim, and there will not be any other copy stored than the one in the Yelim’s hard drive.

In any of publishing contents (i.e., any reports, research papers, thesis documents, and presentations), participants will be named by the code Yelim Kim will assign, instead of their real names. There will be no identifiable data published. The audio files and the screen recordings will not be used in any of publishing content (i.e., any reports, research papers, thesis documents, and presentations), and the transcription of the audio files will be modified to remove any identifying details of the participants. The pictures of the tools may be used for the publication, however, any identifiable element will be blurred to prevent identification.

Remuneration/Compensation: In order to acknowledge the time you have taken to be involved in this project, each participant will receive $15 / hour.

Contact for information about the study: If you have any questions or desire further information with respect to this study, you may contact Yelim Kim (phone number; email: email) ​ ​ ​ ​

Contact for concerns or complaints about the study: If you have any concerns or complaints about your rights as a research participant and/or your experiences while participating in this study, contact the Research Participant Complaint Line in the UBC Office of Research Ethics at 604-822-8598 or if long distance e-mail [email protected] or call toll free 1-877-822-8598.

Consent: Your participation in this study is entirely voluntary and you may refuse to participate or withdraw from the study at any time. Your signature below indicates that you have received a copy of this consent form for your own records. Your signature indicates that you consent to participate in this study.

______Subject Signature Date

______Printed Name

Version 1.1 / February 25, 2019 Page 4 A.4 The Pre-interview Survey

78

A.5 The Semi-structured Interview Script

86 Participant Code: ______Interview Date: 2019 /___ /___ Interview Time: ____:____ ~ ____:____ Interview Place: ______

Recruitment checklist

1. Attach the consent form 2. Let them prepare the microphone 3. Send the pre-interview survey with the correct participant ID: send the survey 2~3 day earlier. So that I can understand their applications better

Before starting the interview - Note to myself

1. Make sure the devices have enough battery 2. Prepare an extra laptop 3. Prepare the microphone and the voice recorder 4. Turn off the cell phones (Before interview) 5. Start the recording devices are working a. Screen capturing b. Zoom audio c. Voice Recorder d. Smart-phone voice recorder

Opening Script

Introduce Study Phase: 1. Answer the interview questions regarding VUI designing experience (At most 20 minutes) 2. Complete a user task for writing dialogues with voice typing and keyboard typing (At most 20 minutes) 3. Complete a post user task survey (At most 30 minutes) 4. Answer the interview questions regarding the user task (At most 20 minutes)

Pre-interview questionnaire

Make sure the participants finished the pre-interview questionnaire prior to the interview.

Interview questions prior to the user task

Q1. I would like to begin our interview by asking some of your previous voice user ​ interface projects. Please think of one or two of the most memorable projects (It ​ ​ doesn’t need to be the latest projects. It means the projects you remember the procedure most vividly). Could you tell me what was the goal of the project, the role ​ you took in this project and what you have designed and built in the course of the ​ ​ ​ ​ ​ projects (This question will be skipped if the participants already answered in the survey in detail.)?

VUI project 1: (name: )

VUI project 2: (name: )

Were these applications conversational or task-oriented? - If it’s a task-oriented application, how much effort the designers made for the conversational aspect? - How many goals to achieve?

“Non-task-oriented conversational systems do not have a stated goal to work towards.”

Q2. For this project, we are especially interested in the process of creating dialogue. ​ ​ ​ Could you walk me through the steps you took to create the dialogues for one of the ​ ​ ​ projects you mentioned earlier? ​

VUI project name: ______

Software or hardware tools used to created the dialogue (e.g., prototyping tool, audio recorder):

Did the same procedure and tools were taken for the other VUI project? (Yes / No) - If not, what are the differences?

Is creating a natural dialogue was one of your goals for this process? If so, how important?

Q3. How do you define “natural dialogue”? What is the expected value in creating a natural dialogue? (Dig more deep down until I get the concrete answers)

Subjective definition of naturalness:

Expected benefits of natural dialogues:

Q4. How do you create a dialogue to be more natural?

Strategies of creating natural dialogues (e.g., In which step, does it apply? and how many people are included?):

Criteria to determine the naturalness:

In any of the tools you mentioned earlier, were any of them helpful in creating more natural dialogue?

The most chellenges in creating more natural dialogue:

Q5. In any of the steps you took for creating the dialogue, was there any step where you use your voice (e.g., say things aloud during the process in order to create a natural dialogue)?

User Task for Creating Dialogues

1. Practice task for creating dialogue with the voice input a. For the online meeting, turn off the cameras b. For the offline meeting, be away from the user c. Prepare the microphone d. Put the microphone UI button to the left side of the document 2. Go through the instruction 3. Complete user task with the text input 4. Complete user task with the voice input 5. Make them listen to two audios while asking them compare the two audios regarding the naturalness a. Let them take a note if they want

SSML Task

1. Do the screen sharing to show my google doc page 2. Increase the speaker volume to create a note from the speaker sound 3. Turn on the Add-ons (Note this voice note takes time) 4. Ask them if they want to insert SSML to make the dialogue more expressive 5. For the same place where they used the SSML, ask them if they want to use their own voice to enhance the expressivity of the dialogue

Interview questions post to user task

Q6: Could you tell me about your experience of using the two modalities to create the dialogues?

Post-user task survey

This survey will ask ● How helpful the voice input was for each VUI design guidelines ● About designers’ current practice of following the design guidelines ● About Perceived helpfulness of using the voice input and text input to create a natural dialogue

Q7 (Follow-up of the survey): Have you ever heard of these guidelines before and do you think they are important?

Q8: How would you compare the two input modalities for creating dialogue? (Follow-up questions regarding the survey result)

Other difference regardless of the guidelines suggested from the survey:

Q9: Is there any situation you would use your voice to create VUI dialogues?

A helpful situation to use the voice:

Situation:

Why does it make it more helpful:

How should it be implemented to maximize the benefit:

In which step of the dialogue writing process, does this situation occur:

Closing Script

Thank you so much for helping us! The data we obtained will be very helpful in guiding us to create a better VUI prototyping tool in the future. If you’d like to know the result of this study, I can send you a copy of the paper once we publish this work, so please let me know. Also, if you have any questions after this interview about our study, please don’t hesitate to contact me through the e-mail address written in the consent form. Again, we really appreciate your participation in this study! A.6 More Descriptions on the User Task

Each participant who carried out the user task was given the requirements for writing two VUI dialogues as below. The formats and the contents of the requirements were created to be as close as to the requirements used in real practices [101]. The requirements outline the use context and key use cases of the application, the desired persona of the voice assistant, the perspective persona of target users, and high-level schema for the VUI dialogue. While the participants were writing the VUI dialogues, the researcher observed them and wrote notes on their VUI writing procedures. One of the VUI dialogue scenarios was to help users lose weight by letting them reflect on their previous diet and providing suggestions for creating healthier eating plans. Another scenario was to help users to have healthier exercise routines by letting them reflect on their current exercise practices and providing suggestions for better exercise plans. The participant had to write one of the VUI dialogues with a keyboard input while writing the other using their own voices with Google Doc’s voice typing functionality. The scenario for each input method was chosen randomly by the uniform distribution. By asking the participants to use the voice typing instead of the keyboard typing, we sought to understand if the participants will see the benefits of using voice input in writing natural VUI dialogues. Since using the voice input in front of the interviewer may cause anxiety for some people, to take into account the possibility that individual differences in self- consciousness play a role as a latent variable, we asked the participants to fill out the questions from the “Self-Consciousness Scale” [102] in the pre-survey. After each participant wrote the two VUI dialogues, the researcher gen- erated the audio for the dialogues with ‘TTS reader’, the tool we created to use Amazon Alexa voice to read written scripts with SSML tags. We made the participants listen to the audio and asked them if there’s any part in the synthesized audios that they wished to change, and if so, we made them demonstrate the desired changes with their own voices. At last, we asked

93 the participants to fill out the post-user task survey (Appendix A.8), where we asked about their experiences of using the voice typing and keyboard in- put. In this survey, we also asked about their perceived effectiveness of the existing VUI design guidelines for creating natural VUIs.

94 Application Name: Dr. Nutrio (Nutrition counsellor)

User persona: Emily, 35, recently gained weight. Her goal is to create a healthier ​ ​ ​ eating plan for the next week to help her lose 5 lbs. She regularly eats healthy meals, but often feels hungry late at night. She’s easily tempted by hunger, and her late ​ ​ night snacks of choice are pizza and chips. She doesn’t understand people who think raw vegetables are snacks.

User Context: On every Friday night, Emily uses Dr. Nutrio to create her diet plan for ​ the following week.

Key use cases: 1. Helps users reflect on their previous diet 2. Gives suggestions for creating a healthier eating plan and provides tips for following the plan

High-level schema 1: 1. Dr. Nutrio and Emily engage with a small talk (Asking about last weekend) 2. Dr. Nutrio and Emily discuss about Emily’s current eating habit. 3. Emily reports her goal to Dr. Nutrio (e.g., losing 5 lb) 4. Dr. Nutrio suggest possible solutions to her 5. Dr. Nutrio and Emily negotiate the plan for the next week. 6. Farewell

Application Name: Dr. Fitna (Exercise counsellor)

User persona: Nick, 31, recently gained weight. His goal is to create an effective ​ ​ ​ exercise plan for the next week to help him lose 5 lbs. Last week, he intended to wake up early to run in the morning, but as a “night owl”, Nick failed to wake up in time. As ​ ​ ​ a runner, Nick is currently a beginner, running 2km per day. He doesn’t believe that he can increase his daily running distance in a short amount of time.

User Context: On every Friday night, Nick uses Dr. Fitna to create his exercise plan for ​ the following week.

Key use cases: 1. Helps users reflect on their current exercise practices 2. Gives suggestions for creating a more effective exercise plan and provides tips for following the plan

System persona: Dr. Fitna, a cheerful and supportive exercise counsellor, believes ​ ​ ​ that the best way to help users is by establishing a comfortable environment for sharing concerns and problem-solving. Therefore, she prefers to negotiate a plan with the ​ ​ users instead of telling them what to do.

High-level schema 1: 1. Dr. Fitna and Nick engage with a small talk (Asking about last weekend) 2. Dr. Fitna and Nick discuss about Nick’s current exercise practice 3. Nick reports his objective to Dr. Fitna (e.g., losing 5 lb) 4. Dr. Fitna suggest possible solutions to Nick 5. Dr. Fitna and Nick negotiate the plan for the next week. 6. Farewell A.7 The Post User-task Survey

97 Default Question Block

Reflection on User Task

Please indicate your level of agreement on the following statements.

During the user task, voice typing was helpful in making the dialogue sound like a natural spoken conversation.

Neither agree nor Strongly agree Agree disagree Disagree Strongly disagree

During the user task, keyboard typing was helpful in making the dialogue sound like a natural spoken conversation.

Neither agree nor Strongly agree Agree disagree Disagree Strongly disagree

During the user task, using my voice to create the dialogue made me feel self-conscious.

Neither agree nor Strongly agree Agree disagree Disagree Strongly disagree

Design Guidelines and Voice Typing/Note

Please indicate your level of agreement on the following statements.

During the user task, voice typing helped me use less formal words.

Example: Instead of using 'extract' or 'eliminate', use 'remove'.

Neither agree nor Strongly agree Agree disagree Disagree Strongly disagree

During the user task, voice typing helped me use common words instead of using overly technical words or jargons.

Example: Instead of saying “Your database transaction has failed due to the storage limit”, say “We can’t store the information, because you ran out of storage.”

Neither agree nor Strongly agree Agree disagree Disagree Strongly disagree

During the user task, voice typing helped me use contractions (e.g., I’m, she’d, shouldn’t) to shorten words

Example: Change ‘You would like to’ into ‘You’d like to’

Neither agree nor Strongly agree Agree disagree Disagree Strongly disagree

During the user task, voice typing helped me create simple sentence structures.

Example: Instead of saying “I like the house that you bought, which has a green roof”, say “I like your house with the green roof”

Neither agree nor Strongly agree Agree disagree Disagree Strongly disagree

During the user task, voice note helped me create an expressive voice that conveys nuances by adding vocal variety, breaks, and non-verbal sounds.

Example: Nuances can be conveyed in a voice by manipulating intonation, pitch, or volume. breaks or non- verbal sounds (e.g. breathing) can also be added. This manipulation can be done through either appropriate SSML tags or better speech synthesis engines.

Using a SSML tag to emphasize: “I already told you I really love it.”

Neither agree nor Strongly agree Agree disagree Disagree Strongly disagree

During the user task, voice typing helped me provide only relevant information, from the users' point of view, at each conversation turn

Example: Let’s say a user wants to buy a ticket for a sport event. When booking a ticket for a sport event, the seating area (e.g., VIP seats, section A) is very important. Hence, don’t burden the user’s short-term memory by listing all of the information about seats and prices. First, introduce available areas. After the user chose the section, introduce the prices of seats.

Neither agree nor Strongly agree Agree disagree Disagree Strongly disagree

During the user task, voice typing helped me use various ways of presenting the same questions and various ways of giving the same responses.

Example: A variation of “What movie do you wanna see?” might be “What movie would you like to watch?”

Neither agree nor Strongly agree Agree disagree Disagree Strongly disagree

Your past design practice

Please indicate your level of agreement on the following statements.

When I was designing my VUI applications, I successfully followed the guideline of using less formal words.

Example: Instead of using ‘extract’ or ‘eliminate’, use ‘remove’.

Neither agree nor Strongly agree Agree disagree Disagree Strongly disagree

You just selected either "Disagree" or "Strongly disagree." Was this because it was hard to follow the guideline or was this because the guideline was not important to you?

Because it was hard to follow it Because it was not important to me

Neither of them, it was because (e.g., often forgot to follow it):

When I was designing my VUI applications, I successfully followed the guideline of using common words instead of using overly technical words or jargons.

Example: Instead of saying “Your database transaction has failed due to the storage limit”, say “We can’t store the information, because you ran out of storage.”

Neither agree nor Strongly agree Agree disagree Disagree Strongly disagree

You just selected "Disagree" or "Strongly disagree." Was this because it was hard to follow the guideline or was this because the guideline was not important to you?

Because it was hard to follow it Because it was not important to me

Neither of them, it was because (e.g., often forgot to follow it):

When I was designing my VUI applications, I successfully followed the guideline of using contractions (e.g., I’m, she’d, shouldn’t) to shorten words.

Example: Change ‘You would like to’ into ‘You’d like to’

Neither agree nor Strongly agree Agree disagree Disagree Strongly disagree

You just selected "Disagree" or "Strongly disagree." Was this because it was hard to follow the guideline or was this because the guideline was not important to you?

Because it was hard to follow it Because it was not important to me

Neither of them, it was because (e.g., often forgot to follow it):

When I was designing my VUI applications, I successfully followed the guideline of using simple sentence structures.

Example: Instead of saying “I like the house which you bought that has a green roof”, say “I like your house with a green roof”

Neither agree nor Strongly agree Agree disagree Disagree Strongly disagree You just selected "Disagree" or "Strongly disagree." Was this because it was hard to follow the guideline or was this because the guideline was not important to you?

Because it was hard to follow it Because it was not important to me

Neither of them, it was because (e.g., often forgot to follow it):

When I was designing my VUI applications, I successfully followed the guideline of creating an expressive voice that conveys nuances by manipulating the audio sound.

Example: Voice can be expressive by manipulating its intonation, pitch, or volume, and you can also add breaks or non-verbal sounds (e.g., breathing). This manipulation can be done through either appropriate SSML tags or better speech synthesis engine.

Using a SSML tag to emphasize: “I already told you I really love it.”

Neither agree nor Strongly agree Agree disagree Disagree Strongly disagree

You just selected "Disagree" or "Strongly disagree." Was this because it was hard to follow the guideline or was this because the guideline was not important to you?

Because it was hard to follow it Because it was not important to me

Neither of them, it was because (e.g., often forgot to follow it):

When I was designing my VUI applications, I successfully followed the guideline of providing only relevant information, from the user’s point of view, at each conversation turn.

Example: Let’s say a user wants to buy a ticket for a sporting event. When searching for tickets, the date, time, and location of the event are the most optimally relevant information at the initial turn of the conversation. Information about seats and prices are not optimally relevant at this turn of the conversation. Introducing all this information at the initial turn of the conversation would likely serve as a burden to the user, rather than be helpful.

Neither agree nor Strongly agree Agree disagree Disagree Strongly disagree You just selected "Disagree" or "Strongly disagree." Was this because it was hard to follow the guideline or was this because the guideline was not important to you?

Because it was hard to follow it Because it was not important to me

Neither of them, it was because (e.g., often forgot to follow it):

When I was designing my VUI applications, I successfully followed the guideline of using various ways of presenting the same questions and various ways of giving the same responses.

Example: A variation of “What movie do you wanna see?” might be “What movie would you like to watch?”

Neither agree nor Strongly agree Agree disagree Disagree Strongly disagree

You just selected "Disagree" or "Strongly disagree." Was this because it was hard to follow the guideline or was this because the guideline was not important to you?

Because it was hard to follow it Because it was not important to me

Neither of them, it was because (e.g., often forgot to follow it):

Powered by Qualtrics A.8 The Data Analysis Process

I used ‘airtable’ (‘https://airtable.com/’), an online collaborative spread- sheet application to orgnize the codes.

104 Grid view

# Factor Categories Factors Factor elements

1 Technologies/Frameworks Limitations of the current technology Speech Synthesis

2 Design Process Design Constraints Design Goals

3 Technologies/Frameworks Limitations of the current technology Speech Synthesis

4 Technologies/Frameworks Limitations of the current technology Speech Synthesis

5 Technologies/Frameworks Limitations of the current technology Artificial Intelligence

6 Design Process Design mistakes to be avoided Dialogue Presentation

7 Technologies/Frameworks Lack of technical / framework supp… Artificial Intelligence

8 Design Process Design Constraints Design procedure

9 Technologies/Frameworks Limitations of the current technology Speech Synthesis

10 Design Process Design mistakes to be avoided Dialogue Style Design

11 Technologies/Frameworks Lack of technical / framework supp… Speech Synthesis

12 Technologies/Frameworks Limitations of the current technology Speech Synthesis

13 Technologies/Frameworks Lack of technical / framework supp… Design guideline

14 Design Process Design Constraints Design Environment

15 Technologies/Frameworks Limitations of the current technology Speech Synthesis

16 Technologies/Frameworks Limitations of the current technology Artificial Intelligence

17 Technologies/Frameworks Lack of technical / framework supp… Artificial Intelligence

18 Technologies/Frameworks Lack of technical / framework supp… Design guideline

19 Technologies/Frameworks Lack of technical / framework supp… Design guideline

20 Technologies/Frameworks Lack of technical / framework supp… Design Tools

21 Design Process Design mistakes to be avoided Interaction Design

22 Design Process Design mistakes to be avoided Interaction Design

23 Design Process Design Constraints Design Goals

24 Technologies/Frameworks Limitations of the current technology Speech Synthesis

25 Technologies/Frameworks Limitations of the current technology Speech Synthesis

26 Design Process Design Constraints Design Domain

27 Technologies/Frameworks Limitations of the current technology Artificial Intelligence

28 Technologies/Frameworks Limitations of the current technology Artificial Intelligence

29 Technologies/Frameworks Lack of technical / framework supp… Design Tools

30 Technologies/Frameworks Lack of technical / framework supp… Design Tools

31 Design Process Design mistakes to be avoided Dialogue Presentation

32 Technologies/Frameworks Limitations of the current technology Speech Synthesis

33 Technologies/Frameworks Limitations of the current technology Speech Synthesis

34 Technologies/Frameworks Limitations of the current technology Speech Synthesis

35 Technologies/Frameworks Limitations of the current technology Speech Synthesis

36 Technologies/Frameworks c… Limitations of the current technology Speech Synthesis

37 Design Porcess Grid view

# Sub-element Partic…

1 Synthesized Voice

2 Being efficient

3 Synthesized Voice

4 SSML

5 Natural Language Understanding Engine

6 Spoken Language

7 Natural Language Understanding Engine

8 Pre-scripted nature

9 Synthesized Voice

10 Spoken Language

11 Evaluation Method

12 SSML

13 Specific guideline

14 Access to end-users

15 Synthesized Voice

16 Natural Language Understanding Engine

17 Sentiment Analysis

18 Personalization

19 Linguistics

20 Analyzing User Experience

21 Rigidness

22 Rigidness

23 Lowering development cost

24 Synthesized Voice

25 Synthesized Voice

26 No invalid input

27 Processing time

28 Natural Language Understanding Engine

29 Dialogue Writing

30 Prototoyping

31 Rigidness

32 Synthesized Voice

33 SSML

34 SSML

35 SSML

36 SSML

37

SUM 110 Grid view

# Participant

1 p8 p5 p20 p4 p18 p7 p10 p1 p13

2 p15 p11 p10 p5 p20 p17 p1

3 p9 p5 p4 p3 p2 p12 p1

4 p13 p7 p9 p3 p16 p15 p1

5 p2 p9 p6 p11 p1

6 p3 p2 p17 p16 p12 p10

7 p5 p8 p7 p16 p13

8 p18 p8 p6 p2 p3

9 p12 p8 p7 p4 p14

10 p9 p7 p18 p13

11 p14 p11 p10 p6

12 p3 p5 p16 p14

13 p6 p2 p5

14 p5 p3 p2

15 p9 p5 p4

16 p16 p2

17 p11 p18

18 p16 p13

19 p20 p18

20 p6 p12

21 p12 p10

22 p6 p9 p1 p3

23 p20 p6

24 p18 p8

25 p7 p11

26 p16 p13

27 p4

28 p5

29 p6

30 p5

31 p9

32 p4

33 p10

34 p5

35 p10

36 p13

37 Grid view

# Factor Description

1 It is not expressive enough to display emotions

2 Designing a VUI involves other goals than creating a natural VUI that are often conflicting

3 It doesn't have a proper intonation

4 The time spent for implementing SSML is not justified (Too many iterations required)

5 NLU engine still can't understand various ways of expressing the same meaning

6 Designers are often creating dialogues to be sound like the written language

7 Methods to understand personalized meanings or interaction are missing (Different generation, Individual differences)

8 Designers script the dialogue before when itʼs spoken

9 It doesn't pronounce words correctly

10 Designers are often overloading the information

11 There is no evaluation method other than the instinct-based one

12 There is no clear way to utilize it (because it provides too low-level modifications)

13 More specific design guidelines are missing

14 Designers often canʼt have direct contacts with their end-users

15 It doesn't handle punctuation properly

16 NLU engine doesn't understand the meaning of the words based on the context

17 Methods to collect real-time user sentiments are missing

18 There is no guidelines considering individual differences (Culture, Generation differences)

19 Design guidelines tied to linguistic knowledge are missing

20 Tools for analyzing user experiences are missing

21 Designers are often making users to go through unnecessary conversation turns

22 VUI often repeat the same information or stop the process when it doesn't know how to handle the current situation

23 Designing a VUI involves other goals than creating a natural VUI that are often conflicting

24 Multi-langauge support is still very bad (e.g., German pronunciation sounds awkward)

25 It doesn't have the appropriate speed for certain groups (kids, elders)

26 There is no invalid input for VUIs and this is even sever problem for kids and elders

27 The processing time takes too long

28 It doesn't understand slangs

29 Designers should manually vary words not to be repetitive

30 Tools for rapid prototyping are missing

31 Designers are often forcing people to use special keywords

32 It doesn't put a break properly

33 Prosody tags are not working well

34 The transition between the words are not smooth

35 Different packages for SSML have different effects

36 Feeling is not a subjective matter

37 Grid view

# Need …

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31 yes

32

33

34

35

36

37 Grid view

# Project Main Purpose

1 Entertainment Education Programming learning Game To make more efficient work process Information Querying To help daily life For family bonding

2 Entertainment To help daily life To make more efficient work process Information Querying Meditation

3 Entertainment Healthcare To make more efficient work process Customer Service Health Check-ups To help daily life

4 Entertainment Meditation Information Querying Customer Service To help daily life For family bonding Education Programming learning Game

5 Entertainment To help daily life Information Querying Customer Service

6 To make more efficient work process Information Querying Healthcare Entertainment Customer Service

7 Entertainment Education Programming learning Game Information Querying To help daily life For family bonding Companion support

8 Customer Service To help daily life Companion support For family bonding Education

9 Entertainment Game Healthcare Health Check-ups To help daily life For family bonding Companion support To make more efficient work process

10 Entertainment Education Programming learning Game To help daily life For family bonding

11 To help daily life To make more efficient work process Information Querying Entertainment Game

12 Entertainment Game Information Querying To help daily life Customer Service

13 To help daily life Customer Service

14 Customer Service To help daily life

15 Healthcare Health Check-ups To help daily life

16 Customer Service Information Querying

17 Education To help daily life Information Querying

18 Entertainment Education Programming learning Game Information Querying

19 Education To help daily life

20 Healthcare To make more efficient work process To help daily life

21 To make more efficient work process Information Querying Healthcare

22 Customer Service Entertainment

23 To help daily life

24 Companion support For family bonding To help daily life Education

25 To help daily life Information Querying For family bonding

26 Entertainment Education Programming learning Game Information Querying

27 Healthcare Health Check-ups

28 To help daily life

29 To help daily life

30 To help daily life

31 To help daily life

32 Healthcare Health Check-ups

33 To make more efficient work process Information Querying

34 To help daily life

35 To make more efficient work process Information Querying

36 Entertainment Education Programming learning Game

37 Grid view

# Affected Goals - Creating natural VUIs Affected Goals - other than for creating natural VUIs

1 Assistant General People

2 General People General People

3 Conversationalist

4 Conversationalist

5 General People

6 General People

7 Conversationalist

8 Conversationalist

9 Conversationalist

10 Assistant

11 Effective developement Time consuming No clear deve

12 Conversationalist

13 Effective developement No clear development direction

14 Conversationalist

15 Conversationalist

16 Conversationalist

17 Assistant

18 Conversationalist

19 Conversationalist

20 Effective developement No clear development direction

21 Assistant

22 Conversationalist

23 Effective developement Time consuming

24 Conversationalist

25 Conversationalist

26 Conversationalist

27 Conversationalist

28 General People

29 General People

30 Effective developement Time consuming

31 General People

32 General People

33 Conversationalist

34 Conversationalist

35 Effective developement Time consuming

36 General People

37 Grid view

# comments for now Participant-SSML familiarity

1 p8 p5 p20 p4 p18 p7 p10 p1 p13

2 p15 p11 p10 p5 p20 p17 p1

3 p9 p5 p4 p3 p2 p12 p1

4 Other related goals p13 p7 p9 p3 p16 p15 p1

5 p20 p2 p9 p6 p11 p1

6 p3 p2 p17 p16 p12 p10

7 p5 p8 p7 p16 p13

8 Need to work again p18 p8 p6 p2 p3

9 p12 p8 p7 p4 p14

10 Need to work again p9 p7 p18 p13

11 p14 p11 p10 p6

12 p3 p5 p16 p14

13 p6 p2 p5

14 Need to work again p5 p3 p2

15 p9 p5 p4

16 p16 p2

17 p11 p18

18 p16 p13

19 p20 p18

20 p6 p12

21 Defined p12 p10

22 p1 p3

23 p20 p6

24 N/A p18 p8

25 p7 p11

26 p16 p13

27 p4

28 p5

29 p6

30 p5

31 Defined p9

32 p4

33 p10

34 p5

35 Other related goals p10

36 p13

37 Grid view

# Participant-Project

1 p8 p5 p20 p4 p18 p7 p10 p1 p13

2 p15 p11 p10 p5 p20 p17 p1

3 p9 p5 p4 p3 p2 p12 p1

4 p13 p7 p9 p3 p16 p15 p1

5 p20 p2 p9 p6 p11 p1

6 p3 p2 p17 p16 p12 p10

7 p5 p8 p7 p16 p13

8 p18 p8 p6 p2 p3

9 p12 p8 p7 p4 p14

10 p9 p7 p18 p13

11 p14 p11 p10 p6

12 p3 p5 p16 p14

13 p6 p2 p5

14 p5 p3 p2

15 p9 p5 p4

16 p16 p2

17 p11 p18

18 p16 p13

19 p20 p18

20 p6 p12

21 p12 p10

22 p1 p3

23 p20 p6

24 p18 p8

25 p7 p11

26 p16 p13

27 p4

28 p5

29 p6

30 p5

31 p9

32 p4

33 p10

34 p5

35 p10

36 p13

37 For visualizing the themes, I used ‘Miro’ (https://miro.com/) as below. A more detailed version of the diagram can be found at shorturl.at/afLMQ Please note that the below diagram reflects the themes from the earlier phase of our study.

114