Adding Speech to Dialogues with a Council of Coaches

MASTER THESIS Adding Speech to Dialogues with a Council of Coaches Laura Bosdriesz S1446673 MSC INTERACTION TECHNOLOGY Faculty of Electrical Engineering Mathematics and Computer Science EXAMINATION COMMITTEE Dr. Ir. D. Reidsma (University of Twente, the Netherlands) D.P. Davison MSc. (University of Twente, the Netherlands) Prof. dr. D.K.J. Heylen (University of Twente, the Netherlands) Dr. Ir. W. Eggink (University of Twente, the Netherlands) Dr. Ir. H. op den Akker (Roessingh Research and Development, the Netherlands) 13 December 2020 Abstract With the ageing of the population, more diseases arise, putting pressure on the healthcare system. This requires a shift from treatment towards prevention of age-related diseases by stim- ulating the aging generation to take care of their own health. The Council of Coaches is a team of virtual coaches attempting help older adults to achieve health goals by offering insights and advice based on their expertise. Currently the user interacts with the coaches by selecting one of several predefined multiple-choice options. Although this a robust method to capture user input, it is not ideal for older adults. Spoken dialogues might offer a better user experience, but also comes with many complexities. The goal of this study is to adapt the COUCH system to support spoken interactions in order to answer the main question: To what extent can spoken interaction offer a valuable addition to the multi-party virtual Council of Coaches application? User experiments are performed with the original text-based and the developed speech-based applications to research two different fields of interest: (1) the difference in experience between the two systems, and (2) the robustness of the speech implementation. In a controlled setting, 28 participants used both versions (i.e. a within-subjects design) for a limited amount of time in which the number of system errors was counted. Participants rated their experiences with both systems via questionnaires and open questions. This data was then analyzed to find differences between the two versions. During a one-week field study, the speech-based application is tested with 4 participants, who completed an interview in the end. These results are used to gain insights in the robustness of the application in a home setting. Analysis of the collected data showed that the addition of speech led to a significant increase in some UEQ ratings (the novelty and stimulation scale). Additionally, the speech-version received significantly higher scores on several other items when explicitly comparing both systems. The field study revealed large fluctuations in user experiences, depending on the robustness ofthe speech recognition. In situations where the application worked properly, it was perceived rela- tively well. However, in situations where it worked insufficient, the application was perceived as cumbersome to use. Most but not all participants mentioned substantial problems in the speech recognition and the responsiveness of the system. Results from both experiments indicated the slow response speed of the application to be the main bottleneck of the experience, causing the feeling of miscommunication between human and machine. 2 Acknowledgement This thesis marks the end of being a student at the University of Twente. I would like to take this opportunity to thank everyone who supported me during this master thesis and my master studies in general. First of all I wish to thank Dennis Reidsma and Daniël Davison, my academic supervisors from the University of Twente, who supervised me since the start of my graduation process. Your expertise on diverse fields allowed me to greatly improve the quality of mywork. In particular I would like to thank Daniël for his patience with the application development, with which I had a difficult start. I also want to express my gratitude towards Dirk Heylenand Wouter Eggink, who agreed to join my examination committee. Last but not least I want to thank Harm op den Akker, my company supervisor, who offered me the possibility to carry out this assignment at Roessing Research and Development, even though it was a busy and difficult time due to Corona. Unfortunately I never had the change to work on location and meet the entire RRD team, but I was very glad to have our weekly online meetings. Moreover, you pro- vided me with extremely helpful feedback on my final report, especially on the structure and readability. During my project, I (online) met Dennis Hofs and Marian Hurmuz from Roessingh Research and Development, who I would like to thank for sharing their knowledge about the Council of Coaches and answering my questions. Furthermore, I wish to express my gratitude towards all participants of the experiment, who took the effort completely voluntary participate in my experiment during a time that therewas no need to be at the university. Lastly, I would like to thank my family and friends for their support during this process. In particular, I want to thank Martijn and Birte, who were always willing to brainstorm and help with my project. Even though they were no experts in the subject, I could always approach them to hear their (outsider) view on the topic and discuss my ideas and developments during these isolated times. Their valuable opinions and feedback helped me to get to this final results. 3 Contents 1 Introduction 10 1.1 Council of Coaches ................................... 11 1.2 Why Speech? ...................................... 12 1.3 Problem Statement ................................... 13 1.4 Approach ........................................ 14 1.5 Document Structure .................................. 14 2 Theory 16 2.1 An Introduction in Conversational Interfaces .................... 16 2.2 Conversation Mechanisms ............................... 17 2.3 The Technologies in Conversational Interfaces .................... 17 2.3.1 Automatic Speech Recognition ........................ 18 2.3.2 Spoken Language Understanding ....................... 19 2.3.3 Dialogue Management ............................. 19 2.3.4 Response Generation .............................. 20 2.3.5 Text-to-Speech Synthesis ........................... 20 2.4 Limitations of Conversational Interfaces ....................... 21 2.4.1 Conversation Mechanisms ........................... 21 2.4.2 Naturalness of Speech ............................. 22 2.4.3 Speech Input Variations ............................ 22 2.4.4 Speech Synthesis for Older Adults ...................... 23 2.4.5 Expectations .................................. 23 2.4.6 Long-term Engagement ............................ 24 2.4.7 Privacy Issues .................................. 24 3 Related Work 25 3.1 In-home Social Support Agent ............................. 25 3.2 Kristina ......................................... 26 3.3 Meditation Coach .................................... 26 3.4 Exercise Advisor .................................... 27 3.5 Implications for the Current Research ........................ 28 4 System design 29 4.1 The Council of Coaches Platform ........................... 29 4.1.1 The WOOL Dialogue Platform ........................ 29 4.1.2 The Coaches .................................. 30 4.1.3 The Council of Coaches Interface ....................... 31 4.1.4 The Dialogue Structure and Coaching Content ............... 34 4.2 System Architecture .................................. 35 4.2.1 Automatic Speech Recognition ........................ 37 4.2.2 Dialogue Management ............................. 39 4 4.2.3 Spoken Language Understanding ....................... 41 4.2.4 Response Generation .............................. 41 4.2.5 Text-to-Speech Synthesis ........................... 41 4.3 Speech Synthesis Markup Language ......................... 42 4.4 Dialogue Design Strategies ............................... 43 4.5 Additional Features ................................... 44 4.6 Removed and Ignored Features ............................ 46 5 Methodology of Evaluation 47 5.1 Ethical permission ................................... 47 5.2 Controlled Experiment ................................. 47 5.2.1 Experimental Design .............................. 48 5.2.2 Hypothesis ................................... 48 5.2.3 Experimental setup ............................... 48 5.2.4 Measures .................................... 50 5.2.5 Participants ................................... 55 5.2.6 Procedure .................................... 55 5.2.7 Pilot Test .................................... 56 5.3 Field Experiment .................................... 57 5.3.1 Experimental Design .............................. 57 5.3.2 Experimental Setup .............................. 57 5.3.3 Interviews .................................... 57 5.3.4 Participants ................................... 58 5.3.5 Procedure .................................... 58 6 Results 60 6.1 Controlled Experiment ................................. 60 6.1.1 Observational Measures ............................ 61 6.1.2 User Experience ................................ 62 6.1.3 Explicit Comparison .............................. 67 6.1.4 Open Questions ................................. 69 6.2 Field Experiment .................................... 75 6.2.1 General Impressions .............................. 75 6.2.2 Practical Problems ............................... 76 6.2.3 Way of Interaction ............................... 77 6.2.4 Use and Recommendation of the Application ................ 77 6.2.5 Suggestions for Improvements

Adding Speech to Dialogues with a Council of Coaches

Voice Interfaces

NLP-5X Product Brief

A Guide to Chatbot Terminology

Voice User Interface on the Web Human Computer Interaction Fulvio Corno, Luigi De Russis Academic Year 2019/2020 How to Create a VUI on the Web?

Eindversie-Paper-Rianne-Nieland-2057069

Voice Assistants and Smart Speakers in Everyday Life and in Education

Integrating a Voice User Interface Into a Virtual Therapy Platform Yun Liu Lu Wang William R

Master Thesis

Context-Aware Intelligent User Interfaces for Supporting Sytem

Arxiv:2011.11315V1 [Eess.AS] 23 Nov 2020 in Public Areas Due to Its Risk of Privacy Leakage

A ROOM with a VUI – VOICE USER INTERFACES in the TESOL CLASSROOM by David Kent Woosong University Daejon, Republic of Korea Dbkent @ Wsu.Ac.Kr

Fuzzing Semantic Misinterpretation for Voice Assistant Applications