Masaryk University Faculty of Informatics

Automatic Derivation of Grammar Rules Used for in Dialogue Systems

Master’s Thesis

Bc. Klára Kufová

Brno, Spring 2018

This is where a copy of the official signed thesis assignment and a copy of the Statement of an Author is located in the printed version of the document.

Declaration

Hereby I declare that this paper is my original authorial work, which I have worked out on my own. All sources, references, and literature used or ex- cerpted during elaboration of this work are properly cited and listed in com- plete reference to the due source.

Bc. Klára Kufová

Advisor: Mgr. Luděk Bártek, Ph.D.

i

Acknowledgements

First and foremost, I would like to acknowledge the thesis advisor, Mgr. Luděk Bártek, Ph.D., of the Faculty of Informatics at Masaryk University, for his valuable ideas, constructive advice, and the time dedicated to our consultations. My sincere thanks also go to both my current and former colleagues at , especially to Mgr. Milan Navrátil and Ing. Radovan Synek, who helped me tremendously during both the implementation and writing of this thesis. Last but not least, I would like to express my sincere gratitude to my parents, grandparents, my brother, and to my fiancé, who have supported and encouraged me throughout the course of my studies.

iii Abstract

The thesis deals with the process of building an adaptive dialogue system, which is capable of learning new grammar rules used for automatic speech recognition based on past conversations with real users. The subsequent abil- ity to automatically reduce the unused rules within a grammar is proposed and implemented as well. Such a dialogue system is domain-independent, effi- ciently extensible, and overcomes the most significant drawbacks of grammar- based language models utilized within speech recognizers. The described theoretical principles are demonstrated on the created conversational agent, which is prepared to be deployed in production as a virtual shopping assis- tant in an online fashion boutique. Apart from the system’s implementation details, the thesis provides a comprehensive overview of the area of dialogue systems in the field of artificial intelligence and natural language processing.

iv Keywords dialogue system, grammar rules, speech recognition, Sphinx4, speech syn- thesis, MaryTTS, language modelling, chat bot, virtual personal assistant, Coco, natural language processing, speech understanding, speech generation, corrective dialogue, grammar expansion, grammar reduction

v

Contents

Introduction 1

1 State of the Art 3 1.1 Contemporary Dialogue Systems ...... 3 1.1.1 Virtual Personal Assistants ...... 3 1.1.2 Business and Commerce ...... 5 1.1.3 Education and Healthcare ...... 5 1.2 Current Fields of Research ...... 7 1.2.1 Natural Language Understanding ...... 8 1.2.2 Dialogue Management ...... 9 1.2.3 Natural Language Generation ...... 10

2 Building a Dialogue System 11 2.1 Introducing Coco ...... 11 2.1.1 Problem Domain ...... 12 2.1.2 Deployment ...... 13 2.2 Voice Dialogue Standards ...... 16 2.2.1 VoiceXML ...... 17 2.2.2 Aspect Prophecy ...... 19 2.2.3 VoxML ...... 21 2.3 Input and Output Speech ...... 22 2.3.1 Speech Recognition ...... 22 2.3.2 Speech Synthesis ...... 31 2.4 Speech Understanding and Generation ...... 35 2.4.1 Speech Understanding ...... 35 2.4.2 Speech Generation ...... 36

3 Adaptive Dialogue Systems 37 3.1 Automatic Derivation of Grammar Rules ...... 38 3.1.1 Detecting Out-Of-Grammar Utterances ...... 38 3.1.2 Corrective Dialogue ...... 40 3.1.3 Grammar Expansion ...... 44 3.1.4 Dialogue Continuation ...... 50 3.2 Automatic Reduction of Grammar Rules ...... 50 3.2.1 Removing Old Rules ...... 51 3.2.2 Removing Unused Rules ...... 52

4 Future Work 55

vii Conclusion 57

Bibliography 59

A Running Coco 67 A.1 Software Distribution ...... 67 A.2 Execution Instructions ...... 67 A.2.1 Linux-Based Operating Systems ...... 68 A.2.2 Microsoft Windows ...... 68

B Contributing to Coco 69 B.1 Developing Coco ...... 69 B.2 Building Coco ...... 70

C Example Dialogue with Coco 71

viii List of Figures

1.1 An architecture of a dialogue system. 7 2.1 The architecture of a VoiceXML application. 17 2.2 The architecture of the Sphinx4 speech recognition system. 23 2.3 An example search graph generated by the Linguist module. 27 2.4 The architecture of the MaryTTS speech synthesizer. 32 3.1 The schema of a corrective dialogue. 41

ix

Introduction

„Simplicity is the keynote of all true elegance.“

– Coco Chanel

The idea of a real conversation with a machine has been tempting the re- search from the field of artificial intelligence and natural language processing from the very beginning. Starting in 1950, when the British journal Mind published the influential article Computing Machinery and Intelligence [1] written by Alan Mathison Turing, the area of natural language processing instantaneously emerged and soon after, machines were not only able to talk, but also recognize and understand human speech. Simple and naively operating dialogue systems from 1960s shifted into complex, sophisticated conversational agents of the new millennium. With the assistance of a dialogue system, making a restaurant reservation, booking a plane ticket, or recognizing a new song is a matter of seconds, while learning a new language or improving one’s mental health may be a matter of days. Classified based on their area of usage, both the former and contemporary dialogue systems are introduced in chapter 1. The chapter, however, does not only mention the existing conversational agents, but also describes the current prominent research areas as well. The second half of the first chapter is therefore divided into three sections, which correspond to the fundamental components of a dialogue system: a natural language understanding unit, a dialogue manager, and a natural language generation unit. The most important part of this thesis is Coco, a personal shopping as- sistant introduced at the beginning of chapter 2. The Coco dialogue system is named after Gabrielle Bonheur Chanel, the founder of the world-famous Parisian haute couture fashion house, nicknamed Coco. Principles and tech- niques associated with the implementation of a conversational agent are demonstrated on the created dialogue system. Apart from the characteriza- tion of the system’s problem domain, deployment, and personas, the chapter also introduces the most influential voice dialogue standards and thoroughly describes the used speech recognition and speech synthesis libraries and re- lated methods, which were utilized within the system. The natural language understanding and generation logic implemented in Coco is mentioned as well.

1 The Coco dialogue system was created to demonstrate the ability of an artificial computer system to learn new grammar rules for speech recognition based on past conversations with real users. The automatic derivation of grammar rules together with the employed approaches to their automatic reduction are explained in detail in chapter 3. The final chapter 4 then briefly lists the system’s weak points and desirable enhancements, which may be included in one of the future releases of Coco to provide a more progressive state-of-the-art dialogue interface. The aim of the thesis is not only to define and build an adaptive dialogue system, but also to provide a brief but comprehensive overview of the con- versational agents-specific field of artificial intelligence and natural language processing. And as stated at the beginning, it is crucial to achieve the goal as simply as possible.

2 1 State of the Art

At the time of writing the very first sentence of this thesis, researching and implementing computer dialogue systems that can vary in many assorted as- pects has been a significant and fairly appealing part of the natural-language processing area of computer science for over fifty years. From ELIZA1 and PARRY2, simple and intuitive early chat bots that did not as a matter of fact apply many artificial intelligence approaches—although they are considered to be early artificial intelligence programs—the research moved noticeably forward to complex and sophisticated systems, which can solve and address many current issues.

1.1 Contemporary Dialogue Systems

In general, dialogue systems can be divided by many diverse aspects. One of the most common methods of classifying dialogue systems is by the used initiative—a dialogue interface can have either a system initiative, user ini- tiative, or a mixed initiative—but with the current notable progress in the area of multimodal human-computer interaction interfaces, it is becoming more common to categorize dialogue systems based on modality (a dialogue interface can be multimodal or controlled by a written text, spoken word, or through a graphical user interface). For the purposes of this thesis, a classifi- cation based on areas of usage is discussed, although each of the mentioned dialogue systems may belong to more—occasionally overlapping—categories.

1.1.1 Virtual Personal Assistants Virtual personal assistants (hereinafter also referred to as VPAs) are pos- sibly the most widely known type of dialogue systems mainly due to their automatic availability in personal electronic devices. Current virtual per- sonal assistants are built to perform a large amount of different tasks, which makes the interaction with the personal device easier and faster, while also allowing the users not to focus on accomplishing their goals manually.

1. ELIZA was created between 1964 and 1966 by Joseph Weizenbaum at the Massachusetts Institute of Technology and is believed to be one of the very first dialogue systems ever implemented. The program simulated a conversation with a psychotherapist and its fundamental logic was based on the detection of critical words in the user’s text input [2]. 2. PARRY—as a reaction to ELIZA—was supposed to act as a paranoid patient suffering from schizophrenia. PARRY, created by Kenneth Mark Colby, comprised more advanced approaches than ELIZA and was even examined by the famous Turing test [3].

3 1. State of the Art

The very first virtual personal assistant, IBM Shoebox—in the size of an actual shoe box—that was created by William . Dersch and presented in 1962, was able to recognize up to sixteen different spoken words, including numbers, and perform simple calculations (adding and subtracting) [4]. At that time revolutionary IBM Shoebox is however entirely overshadowed by the contemporary VPAs, that are able to learn and adapt to new situations. Siri, which is by its authors referred to as an “ontology-driven applica- tion”, is a speech-based dialogue system that forms an integral part of the operating systems produced by Apple Inc. Siri is a powerful and progressive personal assistant that is able to perform an advanced information search and directly present relevant search results, create reminders and events that are synchronized amongst all connected devices, make restaurant reser- vations, change device settings, schedule routes, and even recognize a song that is being played from a different device. Over time, Siri is automatically adapting to its user, and starts to personalize provided results. As expected, such a complex task-oriented system must utilize many diverse technologies and must be integrated with other services to operate correctly. The ontology implemented in Siri is divided into six parts: user, language, dialogue, task, services, and domains [5]. Consequently, every user request comprises the following tasks: natural language processing (the user’s input utterance must be correctly recognized), word-sense disambiguation, identification of the actual request, assembling the data for a corresponding service, invoking the service, and presenting results to the user. As some of these tasks are solvable by machine learning techniques, it was announced that Siri is amongst other approaches applying also deep neural networks and convolutional3 neural networks [7]. The most widely known competitive products include Google Assistant, Amazon Alexa, and Microsoft’s Cortana. Between the four mentioned VPAs, Google Assistant is the youngest; it was initially released in May 2016. It offers similar functions as Siri (planning routes, providing restaurant recom- mendations, sending text messages, or playing music), but due to its inte- gration with other services from the Google family, Google Assistant is also able to search for images, turn a TV on and off, translate languages, convert currencies, or identify surroundings using a camera.

3. The architecture of convolutional neural networks differs from the architecture of traditional neural networks in more aspects, one of them being a 3-dimensional arrangement of neurons in each layer of the network. Although convolutional neural networks are most often connected with visual imagery, they have been also successfully utilized for natural language processing tasks [6].

4 1. State of the Art

With regards to this thesis, an especially appealing functionality provided by the mentioned systems is their assistance in real-time meal ordering and online shopping. At the beginning of 2017, Google announced a collaboration with more than 50 American retailers [8], which indicates an irrepressible growth of this area and its current necessity as a provided service.

1.1.2 Business and Commerce Although all of the above listed intelligent assistants are considered commer- cial, this section is dedicated to alternative examples of dialogue systems used in business and commerce that do not fall into the category of VPAs. These are for example in-car dialogue systems, guided selling systems, or technical support and help desk systems. The idea of in-car dialogue systems came in hand with drivers’ need for a simple and minimal-distraction control of passenger cars. Although the dialogue interactions are still unilateral in most cases—such systems are commonly controlled by simple spoken commands—the current research is focused on creating progressive and adaptable in-cars assistants with wide- coverage spoken language understanding and a great response-ability [9]. In the field of marketing, guided selling is a common technique of assisting customers to purchase the product that suits their needs the most, and aims to maximize the conversion rate4. In comparison with human employees, automated guided selling-solutions are cost-effective even with an increasing number of demands, and since the problem domain is generally quite limited, such dialogue systems tend to perform unusually well. Similarly to the other types of dialogue systems, commercial systems are facing a number of technical challenges that go beyond the expected natural language processing difficulties. Apart from the necessity to be able to engage in a conversation that is beneficial to both sides—the customer and the seller—the system also needs to satisfy the enterprise demands; be effortless to integrate with other services, be reliable, scaleable, secure, and meet performance requirements [10].

1.1.3 Education and Healthcare Dialogue systems used for educational purposes are no longer utilized only to provide prompt advice and support. The research moved forward to virtual reality; an environment that creates an immediate sense of fictitious presence.

4. In online marketing, conversion rate expresses the amount of page visitors that pur- chased a product and hence became a customer, or performed any other action desirable for the business (for example subscribed to a newsletter). 5 1. State of the Art

Initially mechanical reality simulators5 shifted into progressive computer- based systems that gave rise to the area of immersive training. Immersive training is learning with the use of virtual reality; the learners do not need to sit in classrooms and study printed materials, considering the fact that immersive training can offer experiences comparable to the ones from the real world and as such provides an ideal learning environment. Although immersive training raised public awareness in the nineties [12], it is recently becoming more widespread [13]. Apart from the virtual reality and immersive training, dialogue systems are especially prominent in the area of computer-assisted language learning. Presumably the most widely known platform—available as both a mobile application and a website—is Duolingo, created by Luis von Ahn and Severin Hacker at Carnegie Mellon University. Duolingo, which has about 200 million active users worldwide [14], was originally designed as a language learning and translation system, as the students were learning by translating simple phrases in foreign languages [15]. The latest releases of Duolingo contain so- called bots; dialogue systems, that simulate various conversations with native speakers of the learned language. Generally, language-learning dialogue systems are facing two challenges; firstly, the dialogue system needs to recognize utterances spoken by non- native speakers, and secondly, it needs to generate a valid response to the possibly grammatically incorrect utterances. Such a response should warn about the user’s mistakes and provide assistance with correcting them [16]. Another spoken dialogue system, ITSPOKE (Intelligent Tutoring Spoken Dialogue System) [17] serves as an assistant for learning qualitative physics. The system asks questions about simple physics phenomena and confirms or corrects user’s answers (the required answers are generally short, uncom- plicated utterances). ITSPOKE provides a spoken interface to the original text-based tutoring application called Why2-Atlas, which based answer eval- uation on a syntactic analysis and abductive reasoning. In the field of healthcare, counselling dialogue systems are surprisingly similar to the early ELIZA and PARRY chat bots mentioned at the beginning of this chapter. Contemporary mental health systems can act as counsellors or mentors that help to promote healthy lifestyle or spread awareness and provide basic support for people suffering from different diseases.

5. Sensorama, created by Morton Heilig and patented in 1962, is one of the very first virtual reality simulators. Sensorama was a purely mechanical device that offered five different scenarios and was designed to engage four senses of the spectator: sight, touch, smell, and sound. [11]

6 1. State of the Art

A theoretical model of a used for promoting healthy liv- ing through series of follow-up dialogues about relevant topics was described in [18]. The United States Army was funding SimCoach, a dialogue system available as a web application and created for soldiers and their families, that acts as a virtual human being and addresses the user’s health concerns [19]. Domain experts were involved during the development of the system to ensure correctness and accuracy. Dialogue systems are however also deployed directly in hospitals and clinics where they are often used in real-time environments. A significant area of utilizing speech-based dialogue systems in healthcare is for example recording clinical observations during examinations [20]. In this section, only a limited collection of contemporary dialogue systems was mentioned. As the ability of engaging in a fluent conversation in a natural language is a highly important skill of machines that are striving for an artificial general intelligence, the ongoing focus on dialogue systems is even greater than fifty years ago, when the research field emerged. In the following section 1.2, some of the most interesting modern research areas are described.

1.2 Current Fields of Research

A typical dialogue system consists of the following parts: an input handling (when applicable, the input needs to be converted into plain text, using au- tomatic speech recognition or handwriting recognition), a natural language understanding, a dialogue manager, a natural language generation, and an output handling (presenting the generated answer either as a speech—using speech synthesis—or as a text). The input and output handling are described in detail together with practical examples in section 2.3.

Automatic Speech Natural Language Recognition Understanding

Dialogue Manager

Natural Language Text to Speech Generation

Figure 1.1: An architecture of a dialogue system.

7 1. State of the Art

1.2.1 Natural Language Understanding Unlike other tasks that can be expressed as a sequence of simple operations performable easily by machines, natural language understanding (hereinafter also referred to as NLU) is a complex discipline of artificial intelligence, even treated as AI-complete6. NLU is an integral part of all advanced dialogue systems that is necessary for understanding the passed information and ob- taining the user’s requests, intents and goals, or sentiment. This knowledge is crucial for the computer to make further decisions about the continuation of the dialogue and the possible actions that need to be undertaken. Understanding a language goes far beyond understanding single words or separate sentences. Many other aspects give a „meaning“ to an utterance— semantics, pragmatics, concepts—which makes describing a language and its meaning immensely difficult. Some of the most commonly adopted approaches in the area of natu- ral language understanding include semantic parsing, dependency parsing, relation extraction, and vector space modelling. Semantic parsing transforms utterances from a natural language into a machine-readable format, such as database queries, concrete instructions for a device („turn off the microwave“), or system commands („show me what time the train leaves tomorrow“). Although semantic parsing tends to per- form well on limited problem domains and with utterances that satisfy the prescribed form, the goal is to create a general meaning-representation lan- guage applicable to any domain and any utterance. One of the most recent efforts is the Alexa Meaning Representation Language, which parses the in- put using deep neural networks into actions, roles, properties, types, and operators [21]. Current efforts are also focusing on combinatory categorial grammars [22], dependency-based compositional semantics [23], or unsuper- vised semantic parsing [24]. Dependency parsing analyses the syntactic structure of an utterance and provides a dependency tree with relations between words (which words mod- ify or belong to other words) as an output. The gained syntactic structure can also be a valuable source of information for the semantic parsers [25] described above. Similarly to dependency parsing, relation extraction7 is a task of finding semantic relationships within an utterance.

6. It is believed that by solving an AI-complete problem, the overall general artificial intelligence will be achieved as well. 7. An interesting application of relation extraction is the Never-Ending Language Learner, a system created at Carnegie Mellon University, that has been running continuously since 2010. The system is attempting to learn various facts from the world wide web and improve its learning abilities.

8 1. State of the Art

The last mentioned approach is vector space modelling, which is primar- ily applied in information retrieval as a method for analysing a similarity between documents and relevance ranking. Vector space models represent objects—for our purposes texts and utterances—as vectors. The most basic representation is a document-term matrix, which shows if and how many times a term8 occurs in a set of documents. Vector space models form the basis for latent semantic analysis (LSA), a technique that works on the as- sumption of distributional hypothesis. An essential part of LSA is singular- value decomposition (SVD) that is used within latent semantic analysis for a size reduction. However, because simple vector space models often fail to take into account the majority of syntactic differences (for example, the state- ments “type ‘her’” and “her type” clearly have different meaning, but might be treated as one), improved models that try to capture also the context of an utterance are being created [26].

1.2.2 Dialogue Management

A fundamental part of each dialogue system is a dialogue manager, that is completely hidden from the user and as the only component in the system, maintains its state throughout the conversation, or even throughout one life cycle. Based on the implementation, the state may contain details about the user (which improves decision-making abilities regarding the dialogue strat- egy and makes the communication more user-focused), conversation history, and a list of information that needs to be obtained from the user to achieve a predefined goal. A dialogue manager can even be divided into more units, where each of the units is responsible for a different task. The dialogue manager implemented in the 2017 Alexa Prize-winning9 social bot called Sounding Board [27] is divided into two units. The main unit is responsible for error handling, sentiment detection, dialogue strat- egy selection, and connection with the second unit, accountable for so-called miniskills, which are divided into three categories: general (such as greet- ing and exit), personality-oriented (used for user assessment), and content- oriented, which include opinions, jokes, question answering, trending topics, and more.

8. Note the difference between a word and a term; a term is a normalized word. 9. The Alexa Prize is a contest organized by Amazon, Inc. for university teams across the world. The aim is to create a social bot capable of fluent conversation in a natural language about current popular topics, that would become available as a part of the Amazon Alexa virtual personal assistant.

9 1. State of the Art

In Alquist, a social bot placed second, the dialogue manager is divided into top-level and topic-level units [28]. The top-level unit decides on the dialogue strategy (such as question answering or chit-chat) and the rule- based topic-level unit manages a discussion about a particular topic (such as books, music, or sports).

1.2.3 Natural Language Generation Finally, natural language generation (hereinafter also referred to as NLG) can be perceived as reversed natural language understanding, where the task is to produce a correct and human-understandable spoken (or written) out- put in natural language. Although these two fields are closely related, their basis differs; while natural language understanding is a process of choos- ing the correct interpretation of the input, natural language generation is a process of choosing the correct expression of the output. Besides dialogue systems, NLG is widely adopted in the fields where au- tomatic text generation saves resources; learning systems, weather forecast systems, business and financial services, data analysis, and even computa- tional humour. Within a dialogue system, NLG component receives the information pro- vided by the dialogue manager and creates a response. The response may be an uncomplicated utterance either already available in a database (an apology, agreement, gratitude, or a general question) or a template with slots to be filled in with the information from the user’s input (echoing). The currently prevailing method of producing more complicated responses, mainly in domain-specific systems, is deriving them from grammars whose rules were obtained from human-evaluated corpora. However, similar rule- based systems [29] lack the characteristic fluctuation of a natural language and are usually not expandable and usable in different domains. To prevent similar issues, a semantically controlled recurrent neural net- work with long short-term memory10 units was introduced [32]. The network is capable of learning from unaligned data to guarantee the domain indepen- dence, while sampling the output responses ensures a language variation. In this chapter, some of the most widely known dialogue systems to- gether with current research fields were introduced. The following chapter 2 describes in detail the process of building a new dialogue system.

10. Long short-term memory (LSTM) units were initially introduced by Sepp Hochreiter in his diploma thesis in 1991 [30], but became widely recognized six years later after publishing an improved proposal with Jürgen Schmidhuber [31]. An LSTM unit is specific to its ability to remember information over extended periods of time.

10 2 Building a Dialogue System

Besides the general overview of contemporary dialogue systems, areas of usage, and research methods, the previous chapter 1 also attempts to put the process of building a dialogue system in a wider context. Although this chapter focuses mostly on the actual implementation of the dialogue system that was created as an essential part of this thesis, relevant sections of the chapter are devoted to presenting different approaches, technologies, and methods that were not used, but are viable alternatives. The first section 2.1 is dedicated to introducing Coco, the created shop- ping personal assistant. In section 2.2, some of the most widely known voice dialogue standards are described, followed by the used automatic speech recognition and speech synthesis techniques in section 2.3, which were in- tentionally omitted in the first chapter. The last section 2.4 is devoted to the natural language understanding and generation implemented in Coco. This concludes the outlining of a complete dialogue system structure, and therefore, this thesis per se provides a brief but comprehensive overview of the conversational computer systems field of artificial intelligence.

2.1 Introducing Coco

Coco: Hello! Welcome to the clothes and shoes ordering system. My name is Coco and I will guide you through the whole experi- ence. Coco: Excuse me if I do not understand everything you say, I am still learning!

Coco is a spoken dialogue system that was created as an example appli- cation for demonstrating the described principles and designed techniques within this thesis. It is available as a ready-to-use script in the attached pro- gram distribution and is executable from the command-line interface with- out installation. Coco is flexible, lightweight, and system-independent. It was written in the Java programming language, as well as all the external libraries that are used within Coco for speech recognition and speech synthesis. The details about running Coco, viewing its source code, and contributing to the program are described thoroughly in the appendices. Although the original version of Coco is domain-dependent and the sys- tem is designed to be single task-oriented, the dialogue system is built to be easily extensible and modifiable.

11 2. Building a Dialogue System

Coco’s introductory speech that serves as a dialogue opening is depicted on the previous page. Coco is fully speech-operated and cannot be controlled by a written text, however, all the utterances produced by Coco and the utterances produced by the user and recognized by the system are written as a program output during the dialogue. This makes the interaction with the system easier, minimizes the probability of misunderstandings, and provides further functionality, such as result validation or form filling.

Coco: Is there anything else I can do for you?

You: Yes, there is.

Coco: What is it?

2.1.1 Problem Domain As mentioned previously, the system’s original purpose is to serve as a per- sonal shopping assistant available in an online fashion store. Coco is named after Gabrielle Bonheur Chanel1, nicknamed Coco. The system is currently designed to guide the customer through the pro- cess of ordering already chosen goods from an online boutique; from speci- fying the concrete selected items and putting them into a virtual shopping bag, to filling out the user’s shipping information and concluding the whole shopping experience. Coco is distributed in its original form as a plain di- alogue system that is not connected to any concrete service or a database. However, due to its extensibility, modularity, and adaptability, integration with other systems and services is straightforward. Despite the fact that the problem domain is quite limited, Coco is now capable of producing more than 100 different utterances (phrases, questions, requests, and more) while trying to repeat the already said expressions as little as possible, which ensures a natural dialogue flow, language variability, and prevents the system from acting machinelike. Considering the system’s ability to learn new grammar rules for speech recognition and the fact that the spoken word recognition can be easily switched from grammar-based to language model-based, the dialogue system is able to recognize and react to nearly any utterance in English. The implemented speech recognition and speech synthesis are described in detail in section 2.3.

1. Gabrielle Bonheur Chanel, born in 1883, was one of the most influential French fashion designers and founder of the world-famous haute couture brand Chanel.

12 2. Building a Dialogue System

2.1.2 Deployment Since possible deployment environments, usage areas, and personas of any system should be identified even before the actual implementation has started, this aspect of the Coco dialogue system is discussed further in the following few paragraphs.

Deployment environments As a system-independent virtual shopping assistant, Coco can be set up in numerous distinct environments. The most straightforward deployment would be placing the dialogue system directly into an online store as the final step of a shopping experience, instead of the non-automatic shopping carts, which are currently the most commonly seen option and that need to be managed manually (other disadvantages of similar solutions, mainly for the users with disabilities, are discussed later in this section). Coco could then be connected to internal databases and as such provide information about item availability in stock, and serve as a link between the front and back ends. Various chat bots, which can be added in the same way as real personal connections in instant messaging applications for smart devices, are gaining popularity due to their novelty and originality, but mainly thanks to the significant improvement of customer experience. Miscellaneous services, from online food ordering companies to e-commerce corporations2, are starting to invest into the research and implementation of similar bots in order to gain a competitive advantage. As one of the common features provided by the contemporary IM applications is an audio or video chatting, Coco— with its capability to print the ongoing dialogue—is in its nature built to be deployed as a chat bot. Advantages of this type of deployment include portability, continuous synchronization between multiple connected devices (which allows to view and revise any previous order regardless of the device), or even an offline interaction. The last mentioned environment that would be suitable for the Coco di- alogue system is within an IVR (Interactive Voice Response) or VRU (Voice Response Unit) platform. Although the system is speech-operated and such deployment would be indeed possible, Coco was not originally designed as an on-call dialogue system, mainly because of the functionality loss.

2. Recently released beta version of the eBay ShopBot chat bot proves that multinational sales corporations that have millions of sellers and buyers worldwide are interested in this area of service as well as smaller, emerging companies. eBay ShopBot, that is available for Facebook Messenger, allows the user to browse through various collections of items, shop within a custom price range, specify a personal measurements, or even shop similar products based on a taken uploaded photo [33].

13 2. Building a Dialogue System

Areas of usage As mentioned before, Coco was created as an example application of a virtual shopping assistant in an online fashion boutique. It allows the user to buy clothes, shoes, and accessories, and describe the items to a greater extent using their categories, types, colours, or fictional item numbers, as shown in the following dialogue snippet:

Coco: Now, I need you to tell me more about the second item. Coco: Is the second item from the category clothes, shoes, or acces- sories? You: The second item is from the category clothes. Coco: Could you please specify the type of the item? You: It’s a purple leather jacket. Coco: What size would you like your jacket to be? You: Size XS. Coco: Could you now kindly dictate the four-digit item number that is available at the top of the page?

However, as the dialogue system is not originally pulling data from any database or linked to any concrete service-providing business, Coco is easily modifiable to start supporting an arbitrary type of merchandise (such as electronics, sporting goods, collectibles, or books). Moreover, Coco is able not only to learn new grammar rules for speech recognition by itself, but also extend the already existing ones by new words, which allows the system to be self-sustainable even after it is deployed in production. Due to the system’s high extensibility, an unlimited amount of different categories characteristic of the selected type of goods can be added, as well as another dialogue management decision techniques, additional dialogue parts, or system’s responsibilities.

Personas3 The capability to control and interact with the device by using only voice is crucial to disabled users, who appear to be one of the most interesting persona groups. These users most frequently suffer from visual impairments or physical disabilities, and they often have to rely on built-in text-to-speech systems for achieving their everyday goals.

3. In the fields of user experience and human-computer interaction, a persona is a representative of one particular group of users, that is interacting with a specific product. Although personas are mostly made-up characters, their skills, goals, and common behaviour are thoroughly described. Based on these characteristics, services and products are designed and implemented, in order to maximize the usability and accessibility of the product and improve the overall user experience [34].

14 2. Building a Dialogue System

To ensure that reading the text output is not critical for the dialogue flow, Coco automatically reads each request aloud every time the system awaits any action from the user, as depicted in the dialogue snippet below:

Coco: All right, in that case, please fill in your first name. » First name: Klara Coco: Thank you! Coco: Please fill in your last name. » Last name: Kufova Coco: Now, please fill in the street and the house number.

However, situations that require the user to review the printed infor- mation (e.g. the list of all the ordered items or the shipping information) occur occasionally during the dialogue in order to prevent any future com- plications caused by incorrectly entered or recognized data. Integration with other speech synthesis software would be in these cases impractical and com- plicated, yet necessary for the visually impaired users. To avoid similar issues, Coco provides the option to ease the process by asking the user whether she should read the text to be reviewed aloud.

Coco: Now, below is the list of all the ordered items. Please review the list of items before we proceed.

ITEM NO. CATEGORY TYPE SIZE NUMBER 1. clothes shirt S 9001 2. shoes heels 4 5562

Coco: Do you want me to read the list aloud for you? You: Yes, please. Coco: All right, no problem! Coco: The first item is from group clothes, of type shirt, in size S, and its number is 9001. Coco: The second item is from group shoes, of type heels, in size 4, and its number is 5562. Coco: Is everything correct?

Although one of the features of the system is the support for handicapped users, this persona group does not appear to be predominant. Virtual shop- ping assistants, which represent faster and more comfortable solutions to everyday problems, are no longer used exclusively for their convenience.

15 2. Building a Dialogue System 2.2 Voice Dialogue Standards

Possibly the easiest and the most straightforward way of building a web- based dialogue system is using one of the available open-source voice dia- logue standards or commercial platforms. In this section, three of them are introduced: VoiceXML, Aspect Prophecy, and Motorola VoxML. Although Coco is not built on any platform or using any XML dialogue design lan- guage, introducing similar options is a necessary part of a thesis with a focus on the design and implementation of a dialogue system.

Building Coco from the ground up Coco was built from the ground up without using any of the contemporary voice dialogue standards. There have been numerous reasons for this decision, four of them being determinative; firstly, Coco was created only as an example application for demonstration purposes and it is not deployed to any concrete environment and used further in production. Leaving the system completely independent makes it available for several alternatives of future development. The mentioned voice dialogue standards are primarily used for web or IVR-based applications, and Coco is in its first original release neither of them. Secondly, another important aspect of Coco is its employment of open- source—and therefore modifiable—speech recognition and speech synthesis Java libraries. Coco listens with the use of the Sphinx44 speech recognition system and talks with the help of the MaryTTS5 platform. Utilizing open- source libraries allows Coco to adjust the settings so that they fit the needs of the system as much as possible. This includes for example setting a custom voice and its unique properties for the speech synthesis or adapting the pro- nunciation dictionary of the speech recognizer. The details about employing Sphinx4 and MaryTTS are described thoroughly in section 2.3. Implementing a purpose-built virtual shopping assistant also allows to carefully elaborate the dialogue management and all the custom functionality that requires a finer level of control. And finally, building a dialogue systems from the ground up and describing the procedure is an essential part of this thesis. To broaden the assignment, numerous implementation aspects specific to speech-controlled systems are defined and explained. Furthermore, emerging dialogue systems (such as Sounding Board and Alquist described in section 1.2.2, which are closely related to Coco in many aspects) rarely resort to utilizing voice dialogue standards. This indicates the opposite direction of the modern development of dialogue systems.

4. https://cmusphinx.github.io 5. http://mary.dfki.de

16 2. Building a Dialogue System

2.2.1 VoiceXML Voice Extensible Markup Language (hereinafter referred to as VoiceXML) is an XML-based standard for creating spoken dialogue systems available as web services. It is the only markup language of its kind that has received the W3C6 Recommendation7. The current version 2.1 was released in 2007 [37]. VoiceXML is designed for creating voice dialogues with automatic speech recognition of the user’s spoken input (although one of the main features is also the recognition of the DTMF—dual-tone multi-frequency signalling— input) and synthesis of the system’s text output (alternatively, custom audio files can be used as an output as well). In addition to several telephony features, VoiceXML also supports recording the user’s speech [38]. VoiceXML applications follow the client-server architecture and their de- velopment, deployment, and interpretation are very similar to HTML (Hy- pertext Markup Language) applications that are being transmitted from web servers through the Hypertext Transfer Protocol (HTTP) and processed in web browsers. One of the related advantages of VoiceXML applications is their ability to separate the platform-specific aspects, resource management, and service logic from the actual code that describes the dialogue.

Document Server Request Document

VoiceXML Interpreter Context

VoiceXML Interpreter

Implementation Platform

Figure 2.1: The architecture of a VoiceXML application.

6. W3C, or the World Wide Web Consortium, is an international association for creating and releasing standards for the World Wide Web, which was founded by Tim Berners-Lee at the Massachusetts Institute of Technology in 1994. The main goal of the organization is to ensure a continual growth that would eventually cause the web to reach its full potential [35]. The most widely known W3C standards include CSS, HTML, SVG, and XML. 7. The W3C Recommendation is the final level that can be reached by a developed standard (from the working draft, candidate recommendation, to the proposed recommen- dation) [36]. At this stage, the standard has been thoroughly tested, and is supported and recommended for a broad use.

17 2. Building a Dialogue System

As the name suggests, VoiceXML documents are essentially XML docu- ments, however, apart from the top-level element, VoiceXML con- tains many other format-specific elements (such as

1 2

3 4 I am not Coco! 5 ButI could be. 6 7
8

In the architecture depicted on the previous page, the document server is most often a web server that is connected to a database and generates dialogues, the VoiceXML interpreter (for example JVoiceXML8) is a client application, the implementation platform handles the user’s demands, and the VoiceXML interpreter context observes the dialogue and changes the context if necessary (for example redirects the dialogue to another dialogue segment).

Associated W3C Specifications All of the specifications listed below are a part of the Voice Browser Working Group, which officially ceased its activity in 2015 [40], and are commonly applied together with VoiceXML. SRGS » Speech Recognition Grammar Specification [41]. A standard for defin- ing the structure of grammars used for speech recognition in a spoken dialogue application. SRGS is based on the JSGF (Java Speech Gram- mar Format), which is used for speech recognition within Coco and is described in detail later in this chapter. SISR » Semantic Interpretation for Speech Recognition [42]. A specification which allows adding semantic tags to speech recognition grammars and their later interpretation. This semantic information can then be used by the application for deciding on the dialogue continuation.

8. JvoiceXML is an open-source Java VoiceXML Interpreter with a full support of the Java Speech API (JSAPI) and the Java Telephony API (JTAPI) [39].

18 2. Building a Dialogue System

SSML » Speech Synthesis Markup Language [43] is a markup language suit- able for applications that are synthesising speech; one of the most significant usage areas are e.g. audio books. SSML offers a number of elements and their attributes for changing the properties of the synthesized speech, such as the language or prosody. PLS » Pronunciation Lexicon Specification [44]. PLS specifies the exact pro- nunciation of particular words in order to improve both the speech recognition and speech synthesis. The standard allows to define mul- tiple pronunciations for one word, pronunciation for a word with multiple different spellings (such as “doughnut” and “donut”), and supports homophones9 and homographs10. Another important related specification—although not belonging to the Voice Browser Working Group—is ECMAScript [45], a scripting language used within a VoiceXML application for client-side scripting code. The next major version of VoiceXML, 3.0, has been in the status working draft—which is the very first publicly available document—since December 2010 [46]. The most outstanding changes presented in this release include introducing the SIV module (speaker identification, verification, and enrol- ment) and multimodal output [47].

2.2.2 Aspect Prophecy Aspect Prophecy [48] is a commercial VoiceXML-based application platform belonging to the Aspect Software product portfolio. As the company focuses mostly on developing software for call centres and improving the user expe- rience of its customers’ solutions, Aspect Prophecy is mainly intended for IVR, self-service, and SIP (Session Initiation Protocol) applications. Example Aspect Prophecy solutions include mostly automated telephone systems (such as virtual receptionists and survey research) or call routing and conference systems. Aspect Prophecy supports multiple W3C standards (such as the already mentioned specifications SRGS, SISR, or SSML, plus e.g. SOAP—Simple Object Access Protocol—and WSDL—Web Services Description Language) and a large number of other IETF11 standards.

9. Homophones are words that differ in meaning, but are pronounced the same way, such as “deer” and “dear”. 10. Homographs are words that have the same spelling, but differ in meaning, such as “bear” (for example “bear with me” and “the bear was very hairy”). 11. Internet Engineering Task Force (IETF) is an international organization closely con- nected with W3C, that creates and releases Internet specifications.

19 2. Building a Dialogue System

Aspect Prophecy is also suitable for creating various mobile applications, which are similar to the chat bots that were mentioned previously in this the- sis. These mobile applications can be utilized in multiple different areas, e.g. in education, healthcare, banking, or retail. For example, the applications in the education area can be used for classes registration or obtaining informa- tion (about classes schedules, office hours, student parking, etc.), whereas the healthcare applications can be used for managing appointments or patient engagement. The following figure shows an example of a typical interactive SMS ap- plication that can be created using Aspect Prophecy.

Hello, could you please tell me where is Professor Lovelace’s Math class?

Professor Ada Lovelace is teaching two Math classes. One class is today at 8:00 AM and another on Friday at 2:00 PM. Which class are you interested in?

The today’s class.

Math with Professor Ada Lovelace is being held today from 8 AM to 10 AM in Building KK, Room 135. Would you like a campus map?

With regards to this thesis, an especially interesting part of the As- pect Prophecy application platform is the adopted technology for automatic speech recognition and speech synthesis. These are in Aspect Prophecy achieved by using the commercial speech servers LumenVox and Nuance [48], both of which are built on the Media Resource Control Protocol (MRCP). Automatic speech recognition in LumenVox is developed from the open- source CMU Sphinx2 speech recognizer (the newest version, Sphinx4, was used for ASR in the Coco dialogue system). Moreover, LumenVox provides a support for natural language understanding based on complex statistical language models [49]. Nuance Communications is the biggest worldwide en- terprise that focuses on speech and telephone services, and is among others also involved in developing Apple’s Siri [50], a virtual personal assistant that was mentioned in section 1.1.1.

20 2. Building a Dialogue System

2.2.3 VoxML VoxML [51] (an abbreviation for “voice XML”) is an XML-based markup lan- guage created by Motorola in late 1990s. VoxML is a forerunner of the W3C standard VoiceXML; however, unlike VoiceXML, VoxML was not widely adopted and its development was stopped after releasing version 1.1 in 1999. VoxML is a markup language for creating voice applications. It was pri- marily intended for developing software that would enable the users to access web content using their phones. At the time of releasing VoxML, browsing the Internet using a telephone was a quite popular development area, and many online content providers, including e.g. weather forecast and stock mar- ket servers, travelling and route-planning systems, and even astrology web pages, were embracing similar technologies to extend their reach. Motorola was trying to create a markup language similar to HTML, how- ever, there are a few notable differences; firstly, while HTML is used for designing web pages inside a web application, VoxML constructs dialogues, whose smallest parts are so-called steps (as depicted in the source code be- low). Secondly, a single step of a VoxML dialogue usually offers a very limited amount of options for choosing the next dialogue step, unlike an HTML page, which can potentially contain many different links to other pages and still be coherent and understandable. Finally, an HTML page is a two-dimensional visible object, in contrast with a dialogue, whose only dimension is time. See the following example of a minimal VoxML document with a simple synthesized utterance that contains the element:

1 2

3 4 5 I am not Coco! 6 7 ButI could be. 8 9 10

Apart from the mentioned voice dialogue standards and platforms, mul- tiple other languages and toolkits are available, two of them being e.g. Mi- crosoft’s XML-based markup language SALT (Speech Application Language Tags) [52] and open-source, Java-based toolkit OpenDial [53], that was cre- ated by Pierre Lison at the University of Oslo.

21 2. Building a Dialogue System 2.3 Input and Output Speech

This section is dedicated to the automatic speech recognition and speech synthesis used in Coco, with the emphasis on the utilized open-source frame- works Sphinx4 and MaryTTS. Architectures of both frameworks are de- scribed thoroughly, together with the actual implementation details of speech recognition and speech synthesis in the created dialogue system. An impor- tant part of this section is dedicated to the definition and examples of the created JSGF grammars, which form the basis of the system and are the primary subject matter of the following chapter 3. As both the automatic speech recognition and speech synthesis are ex- traordinarily complex fields of natural language processing and many other publications deal with these topics alone, this thesis does not explain any ele- mentary principles or techniques and works on the assumption of the reader’s respective knowledge in this area. Although Coco was originally designed to understand and speak in En- glish, its architecture is not language-specific and the transition to another language is straightforward; for speech recognition, it is sufficient to provide a path to another acoustic model, dictionary, and grammars or a statistical language model (or both to fully utilize the system’s potential). For speech synthesis, it is required to specify a new set of utterances and voice that was trained for the selected language.

2.3.1 Speech Recognition The original automatic speech recognizers were speaker-dependent, compli- cated systems based on the concept of acoustic phonetics, which were able to understand only a few spoken words. In 1952, Bell Laboratories in the lead with S. Balashek, R. Biddulph, and K. H. Davis introduced a first automatic digit recognizer [54] whose accuracy reached up to 99 %, and thus started an entirely new era of automatic speech recognition. The early speech recognizers needed to overcome several obstacles; their dependency on a particular speaker, the ability to correctly recognize only isolated words instead of fluent speech, insufficient dictionaries, and limited grammars that did not cover an adequate part of a language. Mainly due to the government support, the research gradually moved forward to more so- phisticated recognizers. In 1971, DARPA started a five-year program whose purpose was to create several speech recognizers, which would be able to cor- rectly recognize fluent speech, their dictionaries would have at least 1, 000 words, and it would be possible to run them repeatedly in real time [55].

22 2. Building a Dialogue System

At the end of the program, the most successful turned out to be the Harpy speech understanding system, which was created at the Carnegie Mellon University in Pennsylvania by a research group lead by Raj Reddy. Harpy was able to recognize fluent speech containing up to 1, 011 words spoken by five different speakers (three men and two women) [56]. The same research group is also behind the creation of the recognizers Hearsay [57] (the first recognition system that was able to recognize connected speech), Dragon [58] (the first recognition system that used the probabilistic function of a Markov process), and a group of speech recognizers ; Sphinx-I, Sphinx-II, Sphinx-III, and Sphinx4.

Sphinx4 Sphinx4 [59] [60] is an open-source speech recognition framework written in the Java programming language, which was developed as a shared effort of research teams from Sun Microsystems Laboratories, Massachusetts In- stitute of Technology, Mitsubishi Electric Research Laboratories, and the Carnegie Mellon University. Currently, Sphinx4 is being maintained by a research group hereinafter referred to as CMU Sphinx.

application

input control result tools

RECOGNIZER

Front End Decoder Linguist

search manager language model

active dictionary

scorer pruner acoustic model

features sgraph

Configuration Manager

Figure 2.2: The architecture of the Sphinx4 speech recognition system.

23 2. Building a Dialogue System

Architecture of the Sphinx4 speech recognizer Unlike the previous speech recognizers from the Sphinx group, each module of Sphinx-4 is re- sponsible for a different task (modularity), and the design of the recognizer allows to replace or modify any component of the system individually without the subsequent need to change any other part of the system (loose coupling). In the architecture depicted on the previous page, each labelled component can be altered. Coco takes full advantage of the loose coupling; for example, its implementation allows to switch between statistical language models and JSGF grammars even within one run of the program only by changing a single attribute in the source code that is responsible for speech recognition. The Sphinx4 speech recognition system consists of three elementary com- ponents: the Front End, Decoder, and Linguist. The input utterance spoken by the user is processed by the Front End module into a sequence of features (such as using mel-frequency cepstral coefficients12 or perceptual linear prediction cepstral coefficients13). The Linguist module generates a search graph (called sgraph in figure 2.2) based on the information from the language-specific language model, dictionary, and the acoustic model. The search graph together with the features are an input for the Decoder module, whose main purpose is to generate a result hypothesis. Decoder contains a search manager, which can be for example composed of an active list (called active in figure 2.2) that actively communicates with the pruner and the scorer. The application itself can then at any time take control of the speech recognition process, and using the Configuration Manager, modify any module or parameter in the recognition system. The Sphinx4 framework con- tains a number of tools, which allow to observe miscellaneous performance or accuracy statistics. Each component of the system’s architecture is described in detail in [60]. With regards to Coco, the most interesting module of the framework is Linguist, which is responsible for handling the language-specific language model, dictionary, and acoustic model. Within Coco, all three of these com- ponents were provided directly by CMU Sphinx and modified to maximize performance and accuracy. Although Sphinx4 itself is language-independent, the framework offers prebuild models e.g. for French, Spanish, Italian, etc.

12. Mel-frequency cepstral coefficients (MFCCs) are obtained from the user’s input speech signal by applying fast Fourier transform, mel scale filtering, and discrete cosine transform. The concept was firstly described by Paul Mermelstein in 1976 [61] and it is one of the most commonly used methods for feature extraction in speech recognition. 13. Perceptual linear prediction (PLP), described in 1990 by Hynek Hermansky, is a computationally efficient alternative to MFC, which works with the notion of hearing psychophysics and produces low-dimensional feature vectors [62].

24 2. Building a Dialogue System

Language model A language model is a structure that defines the proba- bility of occurrence of each—spoken or written—word sequence. The primary purpose of a language model is to reduce the set of candidate words after recognizing the first token in a sequence. The Sphinx4 speech recognition sys- tem supports two types of language modelling: grammars (described in detail later in this section) and stochastic n-gram14 models (also known as statisti- cal language models). Unlike grammars, whose limited expressive power does not allow their use outside restricted problem domains, statistical language models are able to easily define any language and are frequently utilized for complex tasks, such as machine translation [63]. Coco uses the generic pruned US statistical language model created by CMU Sphinx, which contains more than 3.5 million entries divided into unigrams (single words), bigrams (pairs of words), and trigrams (three con- secutive words). Below is a small segment of the used language model:

... -1.359658 coco chanel -2.656865 coco coco -2.618173 coco cola -2.339274 coco crisp ...

By default, Coco is set to use grammars in the JSGF format as opposed to a statistical language model. For small problem domains such as online clothes shopping, where the collection of possible meaningful and relevant utterances is quite limited and certainly finite, grammars tend to perform better (with nearly 100% accuracy independently of the speaker). However, Coco is distributed with a statistical language model that can be used—or even combined with grammars—instantly without the need of further setup.

Dictionary A dictionary (or a pronunciation dictionary) is most often a text structure that describes the pronunciation of each word—in the form of a phonetic transcription—which can be recognized by the system. In the context of the Linguist module, all three components are closely linked; a dictionary consists of words, which compose the language model, and pro- nunciations, which are processed by the acoustic model. Based on the size of the dictionary, different strategies are employed during the recognition pro- cess; small dictionaries can be loaded during application initialization, while entries in extensive dictionaries can be accessed upon request.

14. In the field of computational linguistics, the term n-gram represents a sequence of n consecutive items; words, letters, phones, etc.

25 2. Building a Dialogue System

Coco uses the open-source CMU Pronouncing Dictionary [64] (hereinafter referred to as CMUdict), which was created by the Speech Group in the School of Computer Science at Carnegie Mellon university for the English language. Its latest version, 0.7b, was released in November 2014. The dictio- nary currently contains more than 130 thousand entries and is being contin- uously expanded. The phone set used for the phonetic transcription within the dictionary has 39 distinct phones, which are based on the ARPAbet15 symbol set; for example AA (vowel), B (stop), CH (affricate), DH (fricative), HH (aspirate), L (liquid), M (nasal), and W (semivowel).

... cocoKOWKOW cocoaKOWKOW cocoanutsKOWKOWNAHTS cocomKOWKAAM ...

Moreover, the implementation of the dictionary also allows to divide the words into distinct classes, specify multiple pronunciations for a single word, or even recognize words which are not yet defined in the dictionary. However, in such cases, the Sphinx4 speech recognition system issues a warning during its initialization16. For the purposes of the Coco dialogue system, CMUdict was expanded by multiple new words; e.g. fuchsia (F Y UW SH AH), jew- ellery (JH UW AH L ER IY), or pyjamas (P IH JH AH M AH S).

Acoustic model In Sphinx4, the acoustic model of a language or its sub- set consists of a collection of files that describe the acoustic characteristics and parameters of a speech. The collection contains for example feature parameters (in the file feat.params), mappings of context-dependent tri- phones to Gaussian mixture models (mdef), definitions of sounds and noises (noisedict), etc. A small segment of the mdef (model definition) file is shown on the next page, where the values are as follows: the base phone, its left context, its right context, the position of the triphone (single, beginning, internal, ending), its attribute, the phone ID, and the respective state IDs.

15. The ARBAbet phonetic transcription alphabet was originally created for the Speech Understanding Research project, which was already mentioned at the beginning of section 2.3.1. ARPAbet is named after the Advanced Research Projects Agency (ARPA), which was in March 1972 renamed to Defense Advanced Research Projects Agency (DARPA). 16. The implementation of Coco allows to display WARNING and ERROR log messages to standard output, in contrast with the general INFO messages—e.g. about initializing the speech recognizer—to hide the underlying logic from the end user that is interacting with the dialogue system. This settings can be easily modified in the logging.properties file, which is available in the application’s software distribution.

26 2. Building a Dialogue System

... OWKKbn/a 26 3549 3603 3634N OWKKen/a 26 3549 3604 3634N OWKKin/a 26 3549 3604 3634N OWKKsn/a 26 3596 3648 3671N ...

By default, the elementary part of speech in Sphinx4 is a senone17. The acoustic model specifies also the mapping between each senone (or any other selected speech unit) and the respective hidden Markov model (hereinafter also referred to as an HMM). Hidden Markov models in Sphinx4 do not follow a fixed structure; even within one acoustic model, HMMs for separate senones can have different topologies, number of states and transitions, or transition directions. Together with the information from the language model and the dictionary, the Linguist module creates a search graph, which is a basic data structure necessary for the Decoder module.

coco K OW K OW chanel SH AH N EH L

Figure 2.3: An example search graph generated by the Linguist module.

In the above figure, three small consecutive states with loops represent HMMs for corresponding phones. States in a search graph are divided into emitting and non-emitting; while emitting states are used for comparing to the feature sequences that are provided by the Front End module, the non-emitting states represent larger language elements—such as words or phones—and are not compared to the feature vectors. Based on the imple- mentation of the Linguist module, the characteristics of generated search graphs differ. Their construction and later processing significantly affect the overall performance and accuracy of the recognition system; simple in- memory search graphs usually have a smaller memory footprint than com- plicated, extensive graphs, which need to be accessed upon request.

17. A senone is a subset of a phone. A phone is a basic unit of speech with a distinguishing sound; for example, the name Coco consists of two distinct phones: [k] and [o¯].

27 2. Building a Dialogue System

Java Speech Grammar Format (JSGF) Coco was initially designed to use grammars in the Java Speech Grammar Format (hereinafter also referred to as JSGF) as the default form of language modelling. This section provides a basic overview of the JSG format and the actual grammars that are distributed directly with Coco. Their automatic expansion and reduction are discussed thoroughly in chapter 3. A formal grammar G is in the theory of formal languages a tuple

G = (N, Σ,P,S), where N is a non-empty finite set of nonterminal symbols, Σ is a finite set of terminal symbols (also called the alphabet), P is a finite set of production rules (production rules are usually written in the format α → β), and S is the start symbol (also called the sentence symbol). In speech recognition, grammars define the exact words and their order in all recognizable utterances. They are usually employed to describe fairly sim- ple language models in narrow problem domains, where the set of expected utterances is quite limited; such as online shopping, travelling guidance, or simple dictating systems. The recognition accuracy is in similar conditions generally very high (reaching up to 100%), grammars considerably simplify the dialogue management, and are easy to develop, modify, and extend. However, a significant drawback of utilizing grammars is the need to create them—at least partially—by hand, and they are also extremely inef- fective in recognizing fluent speech with a broad dictionary of thousands of entries. Although the current research is gradually retreating from their use, the ongoing progress e.g. in machine learning provides new ways of improving the process of employing grammars within speech recognition systems. Java Speech Grammar Format [65] was created by Sun Microsystems, Inc. in 2000 as both a machine and human-readable format of platform- independent grammars for speech recognition. JSGF is based on the Java programming language and as such adopts many Java conventions and style characteristics; such as naming practices (of packages, grammar rules, etc.), comment styles, or access modifiers.

S → GREETINGS INTRODUCTION NAME+ GREETINGS → hello | hi | good morning INTRODUCTION → my name is | I am | I’m NAME → Elizabeth | Katie | Nina | Rosie

28 2. Building a Dialogue System

Note the production rules P of grammar G = ({S, GREETINGS, IN- TRODUCTION, NAME}, {hello, hi, good morning, my name is, I am, I’m, Elizabeth, Katie, Nina, Rosie}, P , S) listed on the previous page. Grammar G is a simplified formal notation of the following JSGF grammar that is actually used within Coco in the opening dialogue:

1 #JSGF V1.0; ‚ 2 3 /** 4 * A grammar that describes the opening dialog. ƒ 5 * 6 * @author Klara Kufova „ 7 * @version 1.0 8 */ 9 10 grammar opening-dialog; 11 12 import; † 13 14 // BODY: ‡ 15 16 = good | hello| hey| hi; ˆ 17 = morning| afternoon| evening; ‰ 18 =i’m|i am| it’s| it is| my name is; 19 20 public = [] [] +; Š

Each JSGF grammar is divided into a header and a body. A header con- sists of a so-called self-identifying header ‚ (which declares that the file is a JSGF grammar, specifies the version of the format, and can also indicate the character encoding within the file), a grammar name (JSGF names follow the same structure as package and class names in Java and must be declared right after the self-identifying header. If the grammar is located inside a package, the grammar name must be fully qualified), and imports † (to import all the public rules inside a grammar—using the .* notation–or a single public rule, the fully qualified name of the grammar or the rule must be specified. Referencing imported rules is explained later). Both the header and body can contain comments. As well as in the Java programming language, two types of comments are supported: multi-line comments (inside the /* and */ characters) and single-line comments ‡. Moreover, multi-line documentation comments ƒ can contain documenta- tion tags, such as @author „ or @version, and can later be used for an automatic documentation generation.

29 2. Building a Dialogue System

A grammar body consists of rule definitions. A rule definition has a rule name, an equals sign (=), a rule expansion, and a mandatory semi- colon (;). A rule can be declared public; public rules Š define the user’s spoken utterances and can be referenced by rules from different grammars. Rules that are not public are private ˆ, which means that they are visible only inside their local grammar and cannot be matched against any user’s utterance. A rule can be expanded to simple tokens (“terminal symbols”) ‰, to other rules (“nonterminal symbols”; this means that the expansion is nested), or to the combination of both ˆ. To reference a public rule from another grammar, it is sufficient to import the grammar (or the rule itself) and specify the plain rule name Š, or use a fully qualified rule name. To define an alternative, the vertical bar character (|) is used (for example, in ‰, the rule can be expanded to either “morning”, “afternoon”, or “evening”, but not their combination). To specify an optional grouping, the square brackets ([, ]) are used (for example, in Š, the user can say greetings, but it is not required by the rule definition). The plus operator (+) Š indicates that the symbol can be spoken one or more times, and similarly, the asterisk (Kleene star, *) indicates that the symbol can be spoken zero or more times. Based on the user names available in the imported rule, valid utterances that would be successfully matched against the grammar are e.g. “Good morning, my name is Rosie Elizabeth.”, “Hi, I’m Nina.”, or “Katie”. The JSG format provides many other convenient features, such as tags (tags can be used to add a system-specific labels to rule definitions), weights (weights indicate the probability of each symbol being spoken), or recursion (although recursive rules can be rewritten, recursion enhances readability).

1 #JSGF V1.0; 2 3 /** 4 * --- This isa CATEGORY grammar. --- 5 * 6 * A grammar that contains the names of the known users of the system. 7 * This grammar expands with the use of the system. 8 * 9 * @author Klara Kufova 10 * @version 1.0 11 */ 12 13 grammar category-grammars.category-users; 14 15 public = elizabeth 16 | katie 17 | nina 18 | rosie 19 ;

30 2. Building a Dialogue System

Grammars developed for the Coco dialogue system are currently of two types; a grammars specific for a particular dialogue part—such as the open- ing and closing dialogues (see the previous grammar snippet)—and so-called category grammars18. Each category grammar contains only one public rule with a list of items from the given category. Categories that are distributed by default with Coco include for example users (see the example category grammar on the previous page), merchandise groups (the included items are e.g. accessories, clothes, and shoes), delivery types (premium, standard), possible answers (yes, no, sure, etc.), colours, numbers, sizes of clothes and shoes, and others. The main purpose of category grammars is to make the system easily modifiable, expandable, transparent, and modular. Category grammars can be effortlessly changed to support the application’s requirements without any subsequent need to modify the system’s logic, and they significantly ease the dialogue management; detecting category words in spoken—possibly out-of- grammar—utterances considerably improves the dialogue system’s natural language understanding abilities.

2.3.2 Speech Synthesis Applications that are able to talk to their users are becoming a standard equipment provided by most of the contemporary personal devices. Although their original purpose was to assist the users with visual impairments or learning disabilities to control the device (common assistive technology ap- plications are for example screen readers), speaking systems are nowadays being utilized for entertainment, education, and research purposes. One of the very first early mechanical speech synthesisers is the Voice Operating Demonstrator called Voder [66], which was invented by Homer Dudley in the Bell Telephone Laboratories and officially introduced in 1939. The Voder was based on the Vocoder (“voice encoder”) [67], a device used for analysing and synthesizing human speech, whose purpose was to recon- struct the input spoken utterance into a synthesized output. The Voder was introduced at the New York World’s Fair, where a male announcer was ask- ing pre-prepared questions and a female operator was constructing simple answers by connecting various sounds into specific words using the device. Although speech synthesis systems are currently widely adapted, the qual- ity of the synthesized speech varies; commercial systems are generally more advanced than open-source, small-scale synthesizers.

18. The term category grammar was defined strictly for the purposes of the Coco dialogue system and this thesis.

31 2. Building a Dialogue System

MaryTTS MaryTTS19 [68] is an open-source platform for speech synthesis created by the German Research Center for Artificial Intelligence and Saarland Univer- sity in Germany. MaryTTS is written in the Java programming language and its current version 5.2 provides support for ten languages, including e.g. both American and British English, French, and Luxembourgish.

plain text SABLE text SSML text

SABLE parser SSML parser

MaryXML markup skeleton

tokeniser

tokens and sentences

preprocessing

expanded forms

tagger and chunker

part-of-speech and syntactic phrases

inflection endings

lexicon prosody unknown known letter-to-sound

phonemes, stress, pitch accents, syllable boundaries prosodic phrases phonological rules

pronunciation in context

acoustic parameters

duration and F0 values

synthesis

sound

Figure 2.4: The architecture of the MaryTTS speech synthesizer.

19. Mary is an abbreviation for a modular architecture for research on speech synthesis.

32 2. Building a Dialogue System

Architecture of the MaryTTS speech synthesizer MaryTTS sup- ports three types of input data; plain text, text in the SABLE20 format, and text in the SSML format. The SABLE and SSML parsers are transforming the input data into the MaryXML format, which is a low-level markup lan- guage created for internal processing. The tokeniser module then partitions the created MaryXML skeleton into separate tokens (words and punctua- tion marks21); individual words are enclosed by the tag, while whole sentences by the tag. The preprocessing step converts unpronounceable tokens, such as numbers and abbreviations, into their normalized and pro- nounceable form. As the normalized form differs based on the meaning of the utterance (for example, the digit “1” can be either converted to the term “one” or the term “first”), such conversion introduces a complex problem. The tagger module assigns each token corresponding part-of-speech tags, while the chunker module performs shallow parsing (also known as partial parsing), whose purpose is to detect higher-level syntactic structures, such as noun phrases, pronoun phrases, and other [70]. The component responsible for “inflection endings” receives the normalized tokens from the preprocess- ing module together with the information from the part-of-speech tagging and shallow parsing and assigns a correct inflection to each normalized token. Known words are matched against the pronunciation lexicon, which contains their phonetic transcription. Words with pronunciation that cannot be de- rived from the lexicon are processed using the letter-to-sound module. Con- currently, various prosody labels are assigned to each token. This includes e.g. pitch accents, prosodic boundaries, or intonation, based on the sentence type (declarative, interrogative, or exclamative). The output from these modules can then be transformed into a set of phonological rules. As a result, a complete MaryXML structure is created, which is converted into a number of acoustic parameters—such as frequency values—which affects the quality and properties of the resulting voice. If applicable, the information available at this stage can be used for controlling talking heads22. The synthesis itself is then performed using the MBROLA [71] (Multi-Band Resynthesis OverLap Add) software and the resulting sound can be generated in multiple different audio formats.

20. SABLE [69] is an XML-based speech synthesis markup language that was developed as a shared effort of the Sun Microsystems, AT&T, Bell Laboratories, and Edinburgh University. SABLE is a successor of the SSML format described in section 2.2.1. 21. Punctuation is especially critical for the resulting synthesized speech; each punctuation mark affects the utterance’s tone—and often also its meaning—differently. 22. A talking head is a visual representation of the synthesized speech in the form of a human or animal head, body, etc.

33 2. Building a Dialogue System

Speech Synthesis in Coco The default MaryTTS settings were significantly altered for the purposes of the system. In this section, two example modifications are described; the code snippet below shows a corresponding part of the Coco’s actual setup:

1 MaryInterface maryTTS= new LocalMaryInterface(); 2 maryTTS.setVoice("dfki-prudence"); 3 maryTTS.setInputType("RAWMARYXML"); 4 5 Document document= MaryXML.newDocument(); 6 7 Element maryXML= document.getDocumentElement(); 8 maryXML.setAttribute("xml:lang","en-GB"); 9 10 Element paragraph= MaryXML.appendChildElement(maryXML, MaryXML.PARAGRAPH); 11 Element sentence= MaryXML.appendChildElement(paragraph, MaryXML.SENTENCE); 12 Element boundary= MaryXML.appendChildElement(sentence, MaryXML.BOUNDARY); 13 boundary.setAttribute("duration","1000"); 14 Element token= MaryXML.appendChildElement(sentence, MaryXML.TOKEN); 15 token.setTextContent(text); 16 17 MaryData maryData= new MaryData(MaryDataType.PHONEMES, 18 Locale.ENGLISH, false); 19 maryData.setDocument(document); (1) Although the dialogue system is distributed with the default MaryTTS English voice as well, Coco speaks the custom DFKI23 Prudence’s voice. Pru- dence is a female British English unit selection24 voice, which was chosen for its naturalness and intelligibility as the most suitable. (2) As the original synthesized speech was very fast and due to the min- imal pauses between individual spoken sentences also unintelligible, Coco modifies the generated MaryXML document during runtime by inserting custom values to the duration attribute of the boundary element. The settings shown above correspond to the following MaryXML document:

1 2 5

6 7 Hello,I am Coco! 8

9

23. DFKI is an abbreviation for the Deutsche Forschungszentrum für Künstliche Intelligenz; the German Research Center for Artificial Intelligence. 24. Each voice built for the MaryTTS platform can be either a unit selection or hidden Markov model-based. Depending on the selected type, the building process differs, as well as the quality and specifics of the resulting voice.

34 2. Building a Dialogue System 2.4 Speech Understanding and Generation

This section briefly describes some of the basic approaches used within the Coco dialogue system for natural language understanding and generation.

2.4.1 Speech Understanding Correct understanding of the user’s input utterance is an essential prereq- uisite for a successful dialogue. However, much like in a human-to-human conversation, errors in understanding necessarily happen also during dia- logues with computer systems. Such dialogue system then needs to be able to quickly adapt and respond to each error that occurs, either on its own initiative—if it is able to identify the mistake immediately—or after a noti- fication from the user. One approach to addressing errors in understanding are so-called corrective dialogues (note the difference between a corrective dialogue and a corrective feedback25). A corrective dialogue can be initiated by a dialogue system in several different scenarios, for example: (1) if the system incorrectly recognized the user’s spoken input and it is either notified by the user, or the dialogue manager, (2) if the system correctly recognized the user’s spoken input, but it does not understand the meaning of the utterance, and (3) if a specific input is expected from the user in order to achieve a predefined goal, yet the required information was not provided. In Coco, corrective dialogues can be initiated within each dialogue part26. One example is dictating the number of the item to be ordered, which is expected by the system in a specific format to be correctly processed:

Coco: Could you now kindly dictate the four-digit item number that is available at the top of the page? You: Three thousand and five. Coco: Could you please repeat the item number? For example: one two three four. You: Three zero zero five. Coco: Did you say that the item number is three zero zero five? You: Yes, I did.

25. A corrective feedback is most commonly used in educational dialogue systems. The system identifies the user’s mistake (a grammatical error, stylistic inconsistency, etc.), notifies the user, and provides a relevant correction [72]. 26. The dialogue in Coco is divided into five parts: the opening, counting (i.e. obtaining the number of items to be ordered), ordering, shipping, and closing dialogue.

35 2. Building a Dialogue System

A precursor or successor of a corrective dialogue can be so-called dialogue feedback. Dialogue feedback is usually initiated to verify the obtained infor- mation before proceeding further, as faulty data may cause critical issues in the future (e.g. incorrectly entered or recognized shipping information will cause the delivery not to arrive). Dialogue feedback can be summariz- ing, echoing, implicit, or explicit. The dialogue snippet on the previous page shows an example of explicit dialogue feedback initiated right after the cor- rective dialogue to confirm the acquired information. In Coco, other feedback types are implemented as well; the snippet below shows an example of the explicit dialogue feedback followed by the implicit one:

Coco: OK! How many items would you like to order in total? You: I would like to order two items. Coco: Did you say that the number of items you would like to order is two? You: Yeah. Coco: Great! Together we will now order your two items.

In Coco, natural language understanding is primarily based on category grammars, which were introduced in section 2.3.1. Category grammars can be perceived as “keyword dictionaries” when searching for expected words or their combinations; for example, if Coco asks a question about the desired jacket size, it is mostly not necessary to process and understand the whole input utterance, but it is sufficient to detect the particular corresponding word. Coco employs elaborate strategies for rejecting or reacting to utter- ances, which do not meet the required criteria (such as uncertain answers similar to “I’m thinking XS or S.” or “I am not sure about the size yet.”).

2.4.2 Speech Generation As the problem domain, goals, and dialogue strategies are established in advance regardless of the user, the flow of the dialogue is fairly uncomplicated and the system does not need to be able to react to an immense number of different situations, which may occur during the conversation. As mentioned before, the dialogue is divided into five parts, and each part is responsible for accomplishing its objectives; if the required data is not obtained from the user—or there are unfulfilled conditions—the system does not allow the conversation to proceed further. Within each dialogue segment, Coco either uses one of the predefined segment-specific utterances, or creates its own by filling in the slots in the set of prepared incomplete utterances.

36 3 Adaptive Dialogue Systems

Although the Coco dialogue system employs a number of interesting ap- proaches and even in its simplest form is sufficient as a simple chat bot in an appropriately selected problem domain, the most significant and appeal- ing feature of the system, which has not yet been adequately mentioned, is its ability to learn new grammar rules for the automatic speech recognition based on previous conversations with real users.

Motivation As already pointed out in section 2.3.1, utilizing grammars as the default form of language modelling for recognizing the input speech has many drawbacks, which cause contemporary research to completely di- vert from their use. Among the most critical hindrances belong both the necessity to create the grammars manually (1) and their inability to support wide dictionaries and extensive problem domains (2). However, a notable advantage of using grammars, particularly in open-source speech recogniz- ers, is their recognition accuracy compared to the precision and correctness of speech recognition based on statistical language models. The recognition accuracy is especially important in goal-oriented problem domains (such as online shopping, route planning, or e.g. library systems), where the users want to achieve their predefined aims promptly and without complications and misunderstandings. Systems that are not able to precisely recognize and understand basic commands and queries cannot be successfully deployed into production, as their insufficient competence may cause frustration and even potential loss of customers. The adaptive learning of grammar rules implemented in the Coco dia- logue system solves both mentioned drawbacks while allowing the system to benefit from a high speech recognition accuracy at the same time. More- over, Coco provides a number of different strategies that prevent the system from repeated faulty recognition. Various properties, such as the number of attempts to recognize the user’s spoken utterance, can be easily modified to suit any application’s needs. These properties are described in the relevant parts of the following text and are marked in the dialogue system’s source code as well. In the following section 3.1, the adaptive learning implemented in Coco is described in detail, together with concrete examples from the system’s source code, suggested alternative approaches to the mentioned problems, and known software bugs and their proposed solutions. In section 3.2, two approaches to the automatic reduction of unused rules are presented.

37 3. Adaptive Dialogue Systems 3.1 Automatic Derivation of Grammar Rules

One of the dialogue system’s most notable features is its ability to automat- ically expand the used grammars for automatic speech recognition. Once a new rule is added to the grammar, it can then be matched against the user’s spoken utterance in any future dialogue without the obligation to further alter or recompile the system. This capability makes the dialogue system self-sustainable even after it has been deployed into production and more- over, it allows the system to learn the most commonly used utterances as opposed the need to rely on the default set of rules, which has been created and used during the development and testing phases and does not necessarily reflect the actual production situations. The process of creating a new grammar rule is depicted below:

1. Detecting an out-of-grammar utterance; 2. initiating a corrective dialogue to obtain the spoken utterance; 3. expanding the grammar by adding an entirely new rule or modifying an already existing rule; 4. continuing with the dialogue.

All of the steps above are described thoroughly in the following text.

3.1.1 Detecting Out-Of-Grammar Utterances The most important prerequisite for a successful grammar extension is the detection of out-of-grammar utterances, which have been produced by the user during the dialogue. Commonly, speech recognition systems tend to force matching the rules, as they intuitively assume that the grammars al- ready cover all the possible combinations of words in the input speech; and because grammars are typically used in limited problem domains, where simple spoken commands are usually sufficient, this assumption is in most cases actually eligible. However, complex dialogue systems such as Coco re- quire a fine-grained detection of the utterances, which cannot be successfully matched against any previously specified grammar rule. A basic example of a forced rule match is an introduction of a new user, whose name differs from the names that are already known by the system. Considering the following rule public = hello my name is elizabeth; the utterance “Hello, my name is Isabella.” may be incorrectly matched, as the only dissimilarity is one word with a closely resembling pronunciation.

38 3. Adaptive Dialogue Systems

By default, the Sphinx4 speech recognizer does not offer a precise sup- port for detecting out-of-grammar utterances and the recognition accuracy is in similar cases extremely low. However, the system provides a group of properties, which can be set within the Linguist module1 and which are re- sponsible for accurate out-of-grammar detection. The default values of these properties are specified in the default.config.xml configuration file in the Sphinx4 API and the properties are as follows:

» addOutOfGrammarBranch: to allow the system to consider the pos- sibility of an out-of-grammar input, this property must be set to true. If set to true, the search graphs generated by the Linguist module contain a branch specifically intended for representing a po- tential out-of-grammar utterance.

» outOfGrammarProbability: this property specifies the probabil- ity of an out-of-grammar input speech occurrence. The default value of the property in Sphinx4 is 1e − 30.

» phoneInsertionProbability: the greater the value, the more preferred are the words with higher amount of phones. For example, in case of a high phone-insertion probability, the recognizer would favour the homophone “whale” as opposed to the identically sounding word “wail” with a lower number of phones. The default value of the property in Sphinx4 is 1e − 10.

» phoneLoopAcousticModel: defines the acoustic model that is used for detecting an out-of-grammar utterance and proposing the rejec- tion hypothesis. Generally, the acoustic model employed within the Linguist module is used.

For the purposes of the Coco dialogue system, the properties above had to be significantly altered and thoroughly tested to achieve as accurate recog- nition results as possible. To avoid the need to recompile the entire Sphinx4 library every time the properties above have changed, Coco utilizes a cus- tom class for speech recognition, which sets the required values of the rel- evant properties during runtime. Experimentally, the most suitable turned out to be the values 0.5 for the out-of-grammar probability and 0.01 for the phone-insertion probability. Values of the addOutOfGrammarBranch and phoneLoopAcousticModel properties are unchanging.

1. The architecture of the Sphinx4 speech recognition system is described in detail in section 2.3.1.

39 3. Adaptive Dialogue Systems

1 public class SpeechRecognizer extends LiveSpeechRecognizer{ 2 3 public SpeechRecognizer(Configuration configuration){ 4 super(configuration); 5 6 context.setLocalProperty( 7 "flatLinguist->outOfGrammarProbability","0.5"); 8 context.setLocalProperty( 9 "flatLinguist->phoneInsertionProbability","0.01"); 10 } 11 12 }

If the Sphinx4 speech recognizer detects an out-of-grammar utterance, an empty result () is returned as an output. This is an important signal for the dialogue manager that an unknown utterance has been produced and the corresponding grammar needs to be updated with a new rule. As a first step, a specific corrective dialogue is initiated.

3.1.2 Corrective Dialogue The term corrective dialogue was already briefly introduced in section 2.4.1. Here, the term is used to describe the dialogue segment that is initiated after detecting an out-of-grammar utterance with content and meaning that have not been recognized and understood by the dialogue system. A corrective dialogue has two main goals: 1. to acquire the user’s spoken utterance (a process evident to the user) in order to extend the corresponding grammar (a process hidden from the user) and 2. to obtain the required information that is specific for each dialogue part and necessary for a proper dialogue continuation (e.g. specifying the shipping option preferred by the user). The most interesting aspect of a corrective dialogue implemented in the Coco dialogue system is its combination of two types of automatic speech recognizers; a grammar-based and a statistical language model-based rec- ognizer. The latter is necessary for acquiring the spoken utterance, which has not been matched against any rule in the corresponding grammar. The language model utilized in Coco (which was already shortly introduced in section 2.3.1) is intentionally generic to be capable of correctly recognizing any spoken input in English, irrespective of its content or meaning. This allows the dialogue system to learn without any limitations.

40 3. Adaptive Dialogue Systems

1 DialogCorrective() { 2 try{ 3 lmRecognizer= new SpeechRecognizer 4(configureRecognizer(false, null)); 5 grammarRecognizer= new SpeechRecognizer 6(configureRecognizer(true, CORRECTIVE_DIALOG_GRAMMAR)); 7} catch(IOException exception){ 8 System.out.println 9("Initializing speech recognizers has failed."); 10} 11}

The code snippet above shows the initialization of the speech recognizers within a corrective dialogue. The parameters of the configureRecognizer method define whether to use a grammar-based speech recognition (the first parameter) and if so, which grammar to employ (the second parameter).

A corrective dialogue itself can be illustrated by the universal schema below. Although the exact procedure entirely depends on the particular di- alogue situation and the properties, which were set prior the conversation (such as the number of attempts to recognize the input speech using the sta- tistical language model-based recognizer), each corrective dialogue is com- posed of the displayed segments.

Corrective Dialogue

GET WORD GET STRING

recognize speech

attempts attempts recognized correctly?

ask for answer ask for repeat

get written input

add rule

Figure 3.1: The schema of a corrective dialogue.

41 3. Adaptive Dialogue Systems

The basis of a corrective dialogue differs depending on whether the infor- mation that needs to be obtained from the user is in the form of a multi-word string (such as the name of the user, e.g. Elizabeth Rose) or a single word (such as the item type, e.g. dress). In the following text, the latter case is considered, although the implementation of both is analogous. As the initial step, the user is asked to repeat the originally unrecog- nisable utterance, which could not be matched against any already existing grammar rule. At this stage, a statistical language model-based recognizer that attempts to recognize the utterance is employed for the first time and the natural language understanding unit of the dialogue system thereupon tries to detect the required information within the utterance. If the infor- mation is indeed identified, the corrective dialogue is promptly concluded to enhance the user experience and the corresponding grammar—based on the recognized utterance—is expanded.

Coco: Could you please specify the type of the item? » An unknown utterance has been recognized! « Coco: I am sorry, but I do not understand what you have said just yet. Could you please repeat your answer to help me learn? You: I’ve chosen a dress with a floral print. Coco: Thank you for helping me learn! Now, let’s continue.

However, the required information may either not be included in the utterance produced by the user, or the utterance may be incorrectly recog- nized2 by the recognizer. To determine which of these situations has occurred and to choose a suitable dialogue strategy, the dialogue system asks the user whether the input utterance has been correctly understood. The user’s answer is recognized by a grammar-based recognizer that uses a grammar, which was created specifically for the purposes of a corrective di- alogue. If the user’s answer is negative, it is a signal for the dialogue manager that the recognition was inaccurate and thus the manager asks the user to repeat the utterance again. If the answer is positive, it is apparent that the user did not provide the necessary information and is requested to answer the original question asked by the system before the corrective dialogue has been initiated. In both cases, the following speech recognition is performed by a language model-based recognizer again.

2. When utilizing open-source automatic speech recognition systems, it is necessary to consider the fact that the recognition accuracy of the language model-based recognizers is significantly lower compared to the accuracy of the grammar-based recognizers. The dialogue system’s dialogue management and NLU unit needs to be adjusted accordingly.

42 3. Adaptive Dialogue Systems

Coco: I am sorry, but I do not understand what you have said just yet. Could you please repeat your answer to help me learn? You: I it does get out and dive in. Coco: Did I correctly hear what you said? You: No, you didn’t. Coco: I am sorry. Could you please say it again? You: I’ve chosen a dress with a floral print.

Coco: I am sorry, but I do not understand what you have said just yet. Could you please repeat your answer to help me learn? You: Yes, I can repeat my answer. But will I? Coco: Did I correctly hear what you said? You: Oh yes, you did. Coco: I am afraid that this was not an answer to my question; I have asked you about the item type. Could you please respond? You: I’ve chosen a dress with a floral print.

However, in case that recognizing the input speech or detecting the re- quired information has been unsuccessful again, the dialogue system offers the possibility to set the number of attempts to obtain the expected out- put. By default, the number of attempts is set to 1, which corresponds to a single iteration of the loop. Setting the value to 0 is suitable for dialogue systems with an extremely low recognition accuracy or in domains where a long interruption of a dialogue flow is considered undesirable. On the con- trary, allowing a higher value may cause disappointing user experience in case the recognition repeatedly fails.

1 // Set the number of attempts to suit the application’s needs: 2 int numberOfAttempts = 1; 3 for(inti = 0;i< numberOfAttempts;i++) { 4 checkIfCorrectlyRecognized(); 5 if(answer.isPositive()) { 6 askUserForRequiredWord(category); 7} else{ 8 askForRepeatingWord(category); 9} 10 ...

After all the attempts have been exhausted, yet the utterance has not been correctly recognized or its content does not contain the expected infor- mation, the dialogue system asks the user to manually type in the utterance.

43 3. Adaptive Dialogue Systems

This option can be completely restricted by setting a high number of at- tempts, as the manual control of the dialogue might be considered unsuitable for example in deployment environments where the predominant customer group consists of users with physical disabilities or it is not possible to use a keyboard (e.g. in IVR-based applications). However, the Coco dialogue system was originally designed to be controlled by a written input as well, and such restriction would cause a significant loss of functionality.

Coco: It looks like I cannot understand you at all. Could you please write it down for me? » I’ve chosen a dress with a floral print. Coco: Thank you for helping me learn! Now, let’s continue.

In case the written statement does not contain the required information neither, the user is repeatedly asked for a spoken utterance alternated with a written input until the dialogue can proceed further. The only situation when the dialogue system is not instructed to ask for repeating the unrecognised utterance is the introduction of the user, whose name is not in the list of known user names. To avoid the inevitable issues with incorrectly recognized complicated names, the user is asked to type in the name manually immediately after an unmatchable utterance occurs. The entered input is later used during the specification of the shipping informa- tion, yet with the possibility to change the name if it is not correct.

Coco: Firstly, please say your name. » An unknown utterance has been recognized! « Coco: I have never heard this name before. Could you please write it down for me? I promise I will remember next time! » Your name: Klara Coco: Hello Klara!

As the final step of a corrective dialogue, the dialogue system passes the obtained utterance to the control unit that handles the logic behind the grammar expansion.

3.1.3 Grammar Expansion Unlike the corrective dialogue, which is partially user-controlled, the actual grammar expansion is completely hidden from the user and does not interfere with the dialogue flow.

44 3. Adaptive Dialogue Systems

Each rule learned during a conversation between the system and the user can be successfully matched in any following dialogue, without the consecu- tive need to adjust the modified grammar, recompile the system, or perform any further alterations. The rule-learning mechanism in the Coco dialogue system can be divided into two fundamental classes:

1. learning single category words (component: Word Learner) and 2. learning a complete utterances composed of multiple words (compo- nent: Rule Learner).

Word Learner Category grammars, which were already mentioned in sections 2.3.1 and 2.4.1, are custom-created keyword dictionaries, such as that each of them consists of a single rule with a list of individual category-specific words (e.g. user names, clothing sizes, item types, etc.). Generally, category grammars and their rules are imported into the corresponding dialogue part-related grammars, which use references to the category rules as opposed to ineffi- ciently listing the particular words one by one. These characteristics of the Coco dialogue system allows the category grammars to be easily expanded or modified regardless of the actual dialogue logic and implementation. The most elementary mechanism of a grammar expansion in Coco is the automatic learning of new category words. A typical example of this type of learning is extending the list of user names based on the conversations with the real users of the dialogue system. The control unit responsible for word learning is called Word Learner. The Word Learner requires two parameters as its input; the word to be added to a category and the category itself. In case of multi-word user names (or generally multi-word phrases, whose components are designated to be added into a category grammar), the cre- ated system divides the entered name into individual words and includes them separately into the corresponding grammar. This ensures that any part of the processed name can be reused in the future. For example, introducing the user “Anna Grace” would expand the initial category grammar with user names in the following way:

1 public = elizabeth 2 | katie 3 | nina 4 | rosie 5 | anna 6 | grace 7 ;

45 3. Adaptive Dialogue Systems

Rule Learner Unlike single-word learning, learning new grammar rules is a complicated concept that requires cooperation of a multiple system components. The control unit responsible for rule learning in the Coco dialogue system is called Rule Learner. To successfully expand a grammar with a new rule, the Rule Learner expects three parameters from the dialogue system; the user’s correctly recognized utterance, the word (or the list of words) contained in the utterance that belongs to a particular category grammar, and the cate- gory grammar itself. The last two arguments can be omitted, however, the newly generated rules may unnecessarily chain the category-related words instead of a simple reference to a category rule. Within the Coco dialogue system, each grammar body that can be ex- panded by the automatic rule learning is divided into two logical segments; the default and derived rules. While the default rules are distributed with the dialogue system directly and cannot be altered in any way, the derived rules segment is originally empty and changes based on the varying conver- sations with real users. The code snippet below shows an actual grammar used within Coco for the closing dialogue: 1 grammar closing-dialog; 2 3 // ------4 // DEFAULT RULES 5 // ------6 7 =i(want| would like) to; 8 = cancel| change| delete| modify; 9 = talk to| speak with; 10 11 public = the order; 12 public = place[another|a different] order; 13 public = (the| your) manager; 14 15 // ------16 // DERIVED RULES 17 // ------

Although Coco is distributed with a collection of default rules that may be fully sufficient for some conversations, it is possible to allow the dialogue system to learn each defined rule independently by altering the grammars and subsequently deploying the system without any predefined rules. Such environment would however cause the initial conversations to be extremely slow, until the grammars would contain an adequate language coverage to be actually usable in production.

46 3. Adaptive Dialogue Systems

Comparing utterances and rules As the very first step, the received utterance is compared with each rule from the group of derived rules. The comparison is based on the numerical parameter, which represents the num- ber of allowed differences between the utterance and a rule. If the actual number of differences exceeds the specified parameter, a completely new rule is added to the grammar. Otherwise, the rule that meets the condition is modified accordingly so that it could be successfully matched against the same utterance in the future. The number of differences defines the number of words that differs be- tween the utterance and a rule. The comparison is double-sided; the utter- ance cannot contain more distinct words than the allowed amount, and the same condition applies to the compared rule. By default, the parameter is in the Coco dialogue system set to 3, however, the value can be adjusted to suit any application’s needs. Setting the value to 0 would result in each utterance being added as a completely new rule, while a higher value would enforce frequent rule modification. To perform a precise comparison, both the input utterance and the rule must be appropriately preprocessed. The preprocessing is performed by both the Rule Learner and a module called Rule Scanner.

Utterances As already mentioned on the previous page, all the category words in the input utterance are replaced by the corresponding reference to a category rule. If the input contains punctuation marks, all of their occurrences are removed to obtain plain text. For example, the input utterance “I think I have chosen two items.” is altered to:

i think i have chosen items

Rules Preprocessing rules is a complicated procedure mainly due to the number of potential special symbols and assorted logical groups in- side a rule. The Rule Scanner unit solves this problem by transform- ing a rule into a sequence of so-called representatives. A representa- tive is a word that represents its logical group. A word, which is not a part of any logical group, represents itself. All words, which are a part of one logical group, are represented by a single word from the group, most typically the first one. For example, the rule body

i (think | guess) i want [pretty | cool] items;

is transformed into the following sequence of representatives:

i think i want pretty items

47 3. Adaptive Dialogue Systems

After the preprocessed utterance and rule are compared, a decision based on the number of differences is made. If the number of allowed differences is exceeded, a completely new rule is added to the grammar.

Adding a new rule A newly added rule declaration can be divided into four parts: the access modifier (1), the rule name (2), the rule body (3) and a mandatory semicolon (4). In the Coco dialogue system, each newly added rule is declared public, as the rule was created based on an actual spo- ken utterance. Every rule name starts with the prefix generated-rule (to distinguish the derived rules from the default rules) followed by a time stamp from the moment the rule was created. The time stamp ensures the uniqueness of a rule and is necessary during the automatic grammar reduc- tion, which is described more thoroughly in section 3.2. The format of time stamps in Coco is year-month-day-hour-minute-second. The rule body—or the rule expansion—is the preprocessed utterance (as described on the previous page) followed by a semicolon. No further modifications are made to the input utterance to guarantee the correctness of the new rule. For example, based on the input utterance “I want to buy four items.”, the rule below is created: public = i want to buy items;

Naturally, if the grammar does not contain any derived rules yet, the input utterance cannot go through the comparison process and a new rule must be added. Otherwise, the input utterance is being compared with all the available rules gradually (the rules are always sorted from the oldest to the latest) until a rule that meets the condition of allowed differences is found. At this moment, the comparison process ends and the current rule is altered accordingly.

Altering a rule The most important step of altering a rule is the computa- tion of a so-called diff between the utterance and the rule. The diff is a linked list of differences and equalities between two pieces of plain text, where each difference or equality is described as either EQUAL, INSERT, or DELETE. To compute the diff, the Coco dialogue system uses the open-source tool diff- match-patch [73], which was created by Google in 2006 for the Google Docs [74] project. The algorithm used within the tool for the diff creation was proposed by Eugene W. Myers in 1986 [75]. Besides the diff computation, the tool also provides the ability to find a fuzzy match in plain text and apply custom patches to the given piece of text.

48 3. Adaptive Dialogue Systems

The diff-match-patch tool is distributed directly with Coco as a part of its source code. For the purposes of the Coco dialogue system, the tool was significantly modified; its source code was changed from Java 7 to Java 8, access modifiers defined within the tool were adjusted, it was formatted to support the code formatting of the dialogue system, a large number of unused code and comments was completely removed, tool variables were renamed, and most importantly, new methods diff_linesToWords and diff_LinesToWordsMunge were added to support the computation of a word-based diff. The input for the diff-match-patch tool are the two sequences of repre- sentatives used during the comparison. Given the two input texts items would be great (the rule body to be altered) and yes items would be perfect (the input utterance), their diff3 is computed as follows:

[Diff(INSERT,"yes "), Diff(EQUAL," items would be "), Diff(DELETE,"great"), Diff(INSERT,"perfect")]

The Rule Learner takes each element from the resulted linked list and gradually alters the selected rule. The alteration takes multiple aspects into account; e.g. the type of the element (EQUAL, INSERT, or DELETE), the element’s context, or the element’s position in the input text. Based on these characteristics, logical groups of words (enclosed in either round or square brackets) are created. The result of the alteration from the above example is shown below: [yes] items would be (great) (perfect)

The next step is to correctly format the newly composed rule. The for- matting is done by a component called Rule Formatter. The Rule Formatter performs bracket merging, adjusts unnecessarily added brackets (such as sin- gle words enclosed in round brackets), fixes potential distorted references to category rules, removes duplicate words (their presence may be caused by representatives), etc. The final formatted rule body is shown below: [yes] items would be (great | perfect)

3. One of the weak points of the rule adaptation is the occasional inaccuracy of the diff computation. If the diff is created incorrectly, the resulted rule may be invalid and unusable for speech recognition. An example of a faulty diff computation can be the misinterpretation of characters as single words and their subsequent use as a context for a different word. However, this issue can be solved by the automatic rule pruning described in the following section 3.2.

49 3. Adaptive Dialogue Systems

As the last step, the obsolete rule is completely removed from the gram- mar and its newly created alteration is added to the end of the list of derived rules with a new time stamp. Updating the time stamp is crucial especially for proper operation of the automatic grammar rules pruning mechanism. Although adding a completely new rule each time an unknown utterance is recognized would be a valid and quite uncomplicated approach, the rule alteration in the Coco dialogue system is one of its most valuable features, which provides an otherwise simple dialogue system with an intelligent be- haviour that significantly eases the usage of the system and improves the user experience. Once the initially unknown spoken utterance is utilized to expand the corresponding grammar, the dialogue between the system and the user continues.

3.1.4 Dialogue Continuation

You: Yes, four items would be perfect. Coco: Thank you for helping me learn! Now, let’s continue. Coco: Did you say that the number of items you would like to order is four?

After the Coco dialogue system correctly recognizes the user’s spoken utterance and is able to detect the information that is required for a proper dialogue continuation, Coco immediately concludes the corrective dialogue and the conversation proceeds further from the moment the original dialogue needed to be interrupted. The corrective dialogue is unchanging regardless of the dialogue segment in which it was initiated and it does not influence the main dialogue flow.

3.2 Automatic Reduction of Grammar Rules

The automatic pruning of grammar rules is in systems, whose grammars can uncontrollably grow, an essential procedure. The reduction should prevent the grammar from accumulating obsolete and unused rules, and most impor- tantly, it should prune the rules which were either falsely created or altered, and such deformation makes them no longer usable for speech recognition. In the Coco dialogue system, the automatic pruning is performed sepa- rately for each dialogue part (as each dialogue part utilizes a custom dialogue part-specific grammar) during every real time conversation with the dialogue system, right after the dialogue part is concluded.

50 3. Adaptive Dialogue Systems

Such implementation ensures that the reduction is executed on a reg- ular basis, as opposed to an irregular rule-pruning triggering, which may be insufficient and may even cause production issues in case of deformed rule definitions. If a deformed rule contains an invalid segment (according to the JSGF specification), the grammar cannot be parsed and the speech recognition fails. Two techniques of the automatic rule reduction were designed for the Coco dialogue system and the purposes of the thesis:

1. removing old rules, which were created before the predefined time threshold (this method is implemented in Coco) and 2. removing rules, which are not used frequently enough by the users of the dialogue system (this method is not yet implemented).

Both techniques are described in detail in the following two sections. The system’s component responsible for the automatic rule reduction is called Rule Pruner. The methods used for the actual rule pruning are within the Rule Pruner adjustable, as shown in the following code snippet:

1 public void pruneGrammar(String grammarFile){ 2 // Based on the application requirements, use one or both 3 // of the following options, or adda new one: 4 pruneOldRules(grammarFile); 5 pruneUnusedRules(grammarFile); 6 }

3.2.1 Removing Old Rules The first technique designed and implemented for the automatic rule reduc- tion is removing the rules, which were added to the grammar before the predefined time period. By default, the time period is in the Coco dialogue system set to one month, however, the threshold is adjustable and can be modified to suit any application’s needs.

1 ... 2 DateTimeFormatter formatter= DateTimeFormatter 3.ofPattern("yyyy-MM-dd-HH-mm-ss"); 4 LocalDateTime ruleDateAndTime= LocalDateTime 5.parse(ruleTimestamp, formatter); 6 7 // Set the number of months to suit your application’s needs: 8 if(ruleDateAndTime.isBefore(LocalDateTime.now().minusMonths(1))) { 9 ...

51 3. Adaptive Dialogue Systems

The time threshold can be adjusted to reflect the actual usage of the dialogue system in production; high-traffic applications may benefit from a shorter time period (such as one week), while for rarely used applications a longer threshold (such as two or more months) may be favourable. As the rule’s time stamp (contained in the rule name, as described in section 3.1.3) is updated every time a rule is altered, the method does not affect the recently added or altered rules.

Advantages This technique of rule reduction ensures that the list of de- rived rules within a grammar is regularly reduced and does not contain an excessively large amount of rules, which may cause performance issues if a slow speech recognizer is employed. Recently added rules also precisely reflect the actual needs of the dialogue system’s users and as such can be utilized for inspecting the current prevailing customer demands to adapt the business’ strategies and maximize profits.

Disadvantages The biggest disadvantage of this approach is the potential removal of rules, which are used frequently “as they are” without any further alterations. However, this issue may be solved by combining this method with the following technique of unused rules removal.

3.2.2 Removing Unused Rules Another presented approach is the technique of removing the grammar rules that are not matched against the incoming utterances often enough and are considered unused.4 A necessary prerequisite for a successful unused rules removal is a mecha- nism that correctly identifies the grammar rule that was matched against the incoming spoken utterance. A library that is suitable for Coco and which pro- vides this functionality is e.g. the open-source JSFG Grammar Rules Parser Library [76] that was created in 2017 for the Sphinx4 speech recognizer and is written in the Java programming language. Once the matched rule is identified, its label that expresses the number of times the rule was matched in the past must be updated accordingly; its value must be increased by 1. Such labels can be in grammars that are written in the JSG format established for example by using the concept of tags.A tag, enclosed in curly brackets ({, }), is an unary operator that contains application-specific information and is included in the rule definition. A tag does not affect the recognition process and can be attached to any rule.

4. Note that this technique is not implemented in the initial release of Coco.

52 3. Adaptive Dialogue Systems

An example grammar with rule definitions containing tags that indicate the number of times a rule has been matched against an incoming spoken utterance in the past can look as follows:

1 // ------2 // DERIVED RULES 3 // ------4 5 public 6=i thinki(have chosen| picked) items {13}; 7 public 8 = items would be(great| amazing)[thank you] {5};

The actual reduction step then removes the rules that were matched less often than is the desired numerical threshold completely from the grammar. However, this rule reduction logic must only be applied to rules older than a predefined time period to avoid removing each newly created rule, which has been necessarily matched only once during its creation. The value of the threshold must also be regularly increased to be relevant for the rule reduction process.

Advantages After choosing an appropriate numerical threshold and em- ploying this technique, the dialogue system’s grammars contain only the rules that are frequently being matched against the users’ spoken utterances during speech recognition. The grammars may even reach the stage when automatic rule learning is no longer needed, as the grammars offer complete domain-specific language coverage.

Disadvantages The biggest disadvantage is the potential improper selec- tion of the numerical threshold. If the threshold is too low, the rule reduc- tion may cause the grammar to accumulate a big number of grammar rules, which may cause performance issues. If the threshold is too high, even the frequently used rules may be removed.

Both of the described approaches can be combined and employed together to maximize the efficiency and correctness of the grammar rules reduction. Moreover, other supplementary techniques might be included in the process of the automatic grammar pruning as well.

53

4 Future Work

The first version of the Coco dialogue system was created as an example application for the purposes of this thesis. All of the provided features were implemented cautiously, with a meticulous attention to detail, to provide an initial release that is of high quality and as complete as possible. However, multiple aspects of the current version of the dialogue system can be improved, and there is yet a high number of features that can be included in the future releases of the system to produce a more advanced, production-ready dialogue interface. In this chapter, a list of features that may be improved in the future is provided, together with their short de- scriptions. The list is divided into relevant sections based on the area of improvement.

Dialogue Flow » One of the proposed future enhancements is an improved searching algorithm used for comparing the content of an input utterance with a corresponding category grammar. Each utterance is by the system’s NLU unit searched for the required information based on a particular dialogue segment. The searching implemented in the initial version of the Coco dialogue system is naive and slow and does not employ any advanced approaches. The category words contained in each category rule are however already arranged in alphabetical order to ease the searching process that may be implemented in the future.

Automatic Derivation of Grammar Rules As the automatic derivation of grammar rules is one of the most valuable abilities of the Coco dialogue system, the list of proposed improvements in this area is the most extensive. » The accuracy of the statistical language model-based speech recogni- tion employed within the corrective dialogue may be improved. One of the most important enhancements would however be an auto- matic correction of falsely recognized utterances; for example, the inaccurately recognized input “I think I longed seven items.” will be automatically altered to “I think I want seven items.” » The rules produced by the Rule Learner may support more JSGF- specific elements, such as weights, the Kleene stars and plus oper- ators, references to other rules, and recursion.

55 4. Future Work

» Machine learning approaches may be employed within the Rule Learner for deciding which rule would be altered based on the incom- ing spoken utterance. » As already mentioned in section 3.1.3, the currently implemented diff computation occasionally treats single characters as whole words, which causes a faulty rule alteration. This drawback may be overcome in the future releases of Coco to improve the overall process of rule modification. » Automatic tests for verifying the syntactic correctness of each newly added or altered grammar rule may be created.

Automatic Reduction of Grammar Rules » The technique of automatic removal of unused rules—as de- scribed in section 3.2.2—may be implemented. » A method for the automatic removal of invalid rules, which cause the grammar parsing and the subsequent speech recognition to fail, may be implemented and added as a part of the Rule Pruner component.

56 Conclusion

The thesis attempts to aptly summarize the area of conversational agents within the complex field of artificial intelligence. Both the overall structure of a conventional dialogue system and the process of building a new conver- sational agent are thoroughly described. From the automatic speech recog- nition, natural language understanding, and the dialogue management, to the natural language generation and speech synthesis, each area is gradually introduced together with its commonly employed or currently researched techniques. When applicable, examples of contemporary dialogue systems and their predecessors, voice standards, available related software libraries, or world-known speech services are provided, which puts the process of build- ing a new dialogue system in a wider context. The main contribution of the thesis is the creation of an adaptive dia- logue system called Coco, which is capable of learning new grammar rules for automatic speech recognition based on past conversations with real users. Making use of such a conversational agent overcomes the most significant drawbacks that grammar-based language models suffer from: the need to write the grammars manually, their insufficient language coverage, and the inability to effectively support large problem domains with wide dictionaries. The built dialogue system thus benefits from both the high speech recogni- tion accuracy specific to grammar-based speech recognizers and the complete independence of the problem domain specific to statistical language model- based speech recognition systems. Although the original implementation of the dialogue system was striving to provide a high quality initial release, a number of enhancements that may be included in one of the future versions was identified; for example, in addition to employing assorted machine learning approaches to enhance the rule adaptation, automatic tests may be created to continuously guarantee the correctness of the newly learned grammar rules.

57

Bibliography

1. TURING, Alan M. Computing Machinery and Intelligence. Mind. 1950, vol. 59, no. 236, pp. 433–460. 2. WEIZENBAUM, Joseph. ELIZA—a computer program for the study of nat- ural language communication between man and machine. Communications of the ACM. 1966, vol. 9, no. 1, pp. 36–45. 3. COLBY, Kenneth M.; WEBER, Sylvia; HILF, Franklin D. Artificial Para- noia. Artificial Intelligence. 1971, vol. 2, no. 1, pp. 1–25. 4. DERSCH, William C. Shoebox—A voice responsive machine. Datamation. 1962, vol. 8, pp. 47–50. 5. CHEYER, Adam; GRUBER, Tom. Siri: A Virtual Personal Assistant: An Ontology-driven Application for the Masses [presentation available at Open, International, Virtual Community of Practice: on Ontology, Ontological Engineering and Semantic Technology]. 2010. 6. LECUN, Yann; BENGIO, Yoshua. Convolutional Networks for Images, Speech, and Time-Series. In: The Handbook of Brain Theory and Neural Networks. MIT Press, 2003, pp. 276–279. 7. LEVY, Steven. The Brain Is Here—And It’s Already Inside Your Phone [online]. WIRED, 2014 [visited on 2018-03-20]. Available from: https: //www.wired.com/2016/08/an-exclusive-look-at-how-ai- and-machine-learning-work-at-apple. 8. WANG, David; ADAMS, Larry. Start shopping with the Google Assistant on Google Home [online]. Google Blog, 2017 [visited on 2018-03-21]. Available from: https://blog.google/products/home/start-shopping- google-assistant-google-home. 9. CAVEDON, Lawrence; WENG, Fuliang; MISHRA, Rohit, et al. Develop- ing a Conversational In-Car Dialog System. In: 12th World Congress on Intelligent Transport Systems. San Francisco, California, 2005. 10. LESTER, James; BRANTING, Karl; MOTT, Bradford. Conversational Agents. In: The Practical Handbook of Internet Computing. Chapman and Hal- l/CRC, 2004, pp. 220–240. 11. HEILIG, Morton Leonard. Sensorama Simulator. 1962. US Patent 3,050,870. 12. PSOTKA, Joseph. Immersive training systems: Virtual reality and education and training. Instructional Science. 1995, vol. 23, no. 5–6, pp. 405–431.

59 BIBLIOGRAPHY

13. MANNONE, Sarah. A New Frontier For Agencies: Immersive Training [on- line]. Forbes, 2017 [visited on 2018-03-23]. Available from: https://www. forbes . com / sites / forbesagencycouncil / 2017 / 08 / 29 / a - new-frontier-for-agencies-immersive-training. 14. SMITH, Craig. 17 Amazing Duolingo Facts and Statistics [online]. DMR, 2017 [visited on 2018-03-26]. Available from: https://expandedramblings. com/index.php/duolingo-facts-statistics. 15. HACKER, Severin Benedict Hans. Duolingo: Learning a Language while Translating the Web. 2014. PhD thesis. School of Computer Science, Carnegie Mellon University. 16. RAUX, Antoine; ESKENAZI, Maxine. Using Task-Oriented Spoken Dia- logue Systems for Language Learning: Potential, Practical Applications and Challenges. In: InSTIL/ICALL 2004 Symposium on Computer As- sisted Learning. Venice, Italy, 2004, pp. 147–150. 17. LITMAN, Diane J.; SILLIMAN, Scott. ITSPOKE: An Intelligent Tutoring Spoken Dialogue System. In: Proceedings of the Human Language Technol- ogy Conference: Third Meeting of the North American Chapter of the As- sociation of Computational Linguistics. Edmonton, Canada, 2004, pp. 52– 54. 18. BICKMORE, Timothy W.; SCHULMAN, Daniel; SIDNER, Candace L. A reusable framework for health counseling dialogue systems based on a behavioral medicine ontology. Journal of Biomedical Informatics. 2011, vol. 44, no. 2, pp. 183–197. 19. MORBINI, Fabrizio; FORBELL, Eric; DEVAULT, David, et al. A Mixed- Initiative Conversational Dialogue System for Healthcare. In: Proceedings of the 13th Annual Meeting of the Special Interest Group on Discourse and Dialogue. 2012, pp. 137–139. 20. BARKER, DJ; SIMPSON, DS; CORBETT, WA, et al. Evaluating a spoken dialogue system for recording clinical observations during an endoscopic ex- amination. Medical Informatics and the Internet in Medicine. 2003, vol. 28, no. 2, pp. 85–97. 21. PERERA, Vittorio; CHUNG, Tagyoung; KOLLAR, Thomas; STRUBELL, Emma. Multi-Task Learning for parsing the Alexa Meaning Representa- tion Language. 2018. 22. ARTZI, Yoav; LEE, Kenton; ZETTLEMOYER, Luke. Broad-coverage CCG Semantic Parsing with AMR. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 2015, pp. 1699–1710.

60 BIBLIOGRAPHY

23. LIANG, Percy; JORDAN, Michael I.; KLEIN, Dan. Learning Dependency- Based Compositional Semantics. Computational Linguistics. 2013, vol. 39, no. 2, pp. 389–446. 24. POON, Hoifung; DOMINGOS, Pedro. Unsupervised Semantic Parsing. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Lan- guage Processing. 2009, pp. 1–10. 25. MONROE, Will; WANG, Yushi. Dependency Parsing Features for Semantic Parsing. 2014. 26. ERK, Katrin; PADÓ, Sebastian. A Structured Vector Space Model for Word Meaning in Context. In: Proceedings of the Conference on Empirical Meth- ods in Natural Language Processing. 2008, pp. 897–906. 27. FANG, Hao; CHENG, Hao; CLARK, Elizabeth, et al. Sounding Board – University of Washington’s Alexa Prize Submission. 2017. 28. PICHL, Jan; MAREK, Petr; KONRÁD, Jakub, et al. Alquist: The Alexa Prize Socialbot. 2017. 29. CHEYER, Adam; GUZZONI, Didier. Method and apparatus for building an intelligent automated assistant. 2014. US Patent 8,677,377. 30. HOCHREITER, Sepp. Untersuchungen zu dynamischen neuronalen Net- zen. 1991. Master’s thesis. Institut fur Informatik, Technische Universitat, Munchen. 31. HOCHREITER, Sepp; SCHMIDHUBER, Jürgen. Long Short-Term Memory. Neural Computation. 1997, vol. 9, no. 8, pp. 1735–1780. 32. WEN, Tsung-Hsien; GAŠIĆ, Milica; MRKŠIĆ, Nikola, et al. Semantically Conditioned LSTM-based Natural Language Generation for Spoken Dia- logue Systems. arXiv preprint arXiv:1508.01745. 2015. 33. eBay ShopBot [online] [visited on 2018-04-02]. Available from: https:// shopbot.ebay.com. 34. IDOUGHI, Djilali; SEFFAH, Ahmed; KOLSKI, Christophe. Adding user ex- perience into the interactive service design loop: a persona-based approach. Behaviour & Information Technology. 2012, vol. 31, no. 3, pp. 287–303. 35. World Wide Web Consortium (W3C) [online] [visited on 2018-04-06]. Avail- able from: https://www.w3.org. 36. World Wide Web Consortium Process Document [online]. W3C, 2018 [vis- ited on 2018-04-06]. Available from: https://www.w3.org/2018/ Process-20180201/.

61 BIBLIOGRAPHY

37. Voice Extensible Markup Language (VoiceXML) 2.1 [online]. W3C, 2007 [visited on 2018-04-06]. Available from: https://www.w3.org/TR/ voicexml21. 38. Voice Extensible Markup Language (VoiceXML) Version 2.0 [online]. W3C, 2004 [visited on 2018-04-06]. Available from: https://www.w3.org/ TR/voicexml20. 39. JVoiceXML - The Open Source VoiceXML Interpreter [online] [visited on 2018-04-06]. Available from: http://jvoicexml.sourceforge.net. 40. The Voice Browser Working Group [online]. W3C, 2015 [visited on 2018-04-06]. Available from: https://www.w3.org/Voice. 41. Speech Recognition Grammar Specification Version 1.0 [online]. W3C, 2004 [visited on 2018-04-06]. Available from: https://www.w3.org/TR/ speech-grammar. 42. Semantic Interpretation for Speech Recognition (SISR) Version 1.0 [online]. W3C, 2007 [visited on 2018-04-06]. Available from: https://www.w3. org/TR/semantic-interpretation. 43. Speech Synthesis Markup Language (SSML) Version 1.1 [online]. W3C, 2010 [visited on 2018-04-06]. Available from: https://www.w3.org/TR/ speech-synthesis11. 44. Pronunciation Lexicon Specification (PLS) Version 1.0 [online]. W3C, 2008 [visited on 2018-04-06]. Available from: https://www.w3.org/TR/ pronunciation-lexicon. 45. Standard ECMA-262: ECMAScript 2017 Language Specification [online]. Ecma International, 2017 [visited on 2018-04-07]. Available from: https: //www.ecma-international.org/publications/standards/ Ecma-262.htm. 46. Voice Extensible Markup Language (VoiceXML) 3.0 [online]. W3C, 2010 [visited on 2018-04-07]. Available from: https://www.w3.org/TR/ voicexml30. 47. Voice Extensible Markup Language (VoiceXML) 3.0 Requirements [online]. W3C, 2008 [visited on 2018-04-07]. Available from: https://www.w3. org/TR/vxml30reqs. 48. Aspect Prophecy 16 [online]. 2016 [visited on 2018-04-07]. Available from: https://www.aspect.com/globalassets/aspect-prophecy- wp.pdf. Technical report. Aspect Sofware.

62 BIBLIOGRAPHY

49. LumenVox Automated Speech Recognizer (ASR) [online]. LumenVox, 2017 [visited on 2018-04-07]. Available from: https://www.lumenvox.com/ products/speech_engine. 50. KAY, Roger. Behind Apple’s Siri Lies Nuance’s Speech Recognition [on- line]. Forbes, 2014 [visited on 2018-04-07]. Available from: https://www. forbes.com/sites/rogerkay/2014/03/24/behind- apples- siri-lies-nuances-speech-recognition. 51. VoxML 1.1 Language Reference [online]. Motorola, Inc., 1999 [visited on 2018-04-08]. Available from: https://www.w3.org/Voice/1999/ VoxML.pdf. 52. WANG, Kuansan. SALT: a Spoken Language Interface for Web-Based Mul- timodal Dialog Systems. In: Seventh International Conference on Spoken Language Processing. 2002, pp. 2241–2244. 53. LISON, Pierre; KENNINGTON, Casey. OpenDial: A Toolkit for Developing Spoken Dialogue Systems with Probabilistic Rules. Proceedings of ACL- 2016 System Demonstrations. 2016, pp. 67–72. 54. DAVIS, K. H.; BIDDULPH, R.; BALASHEK, S. Automatic Recognition of Spoken Digits. The Journal of the Acoustical Society of America. 1952, vol. 24, no. 6, pp. 637–642. 55. KLATT, Dennis. Review of the ARPA Speech Understanding Project. In: Readings in Speech Recognition. Elsevier, 1990, pp. 554–575. 56. LOWERRE, Bruce; REDDY, Raj. The HARPY Speech Understanding Sys- tem. In: Readings in Speech Recognition. Elsevier, 1990, pp. 576–586. 57. REDDY, Raj; ERMAN, Lee D.; FENNELL, Richard D.; NEELY, Richard B. The Hearsay-I Speech Understanding System: An Example of the Recog- nition Process. IEEE Transactions on Computers. 1976, vol. 25, no. 4, pp. 422–431. 58. BAKER, James K. The DRAGON system–An Overview. IEEE Transactions on Acoustics, Speech, and Signal Processing. 1975, vol. 23, no. 1, pp. 24– 29. 59. LAMERE, Paul; KWOK, Philip; GOUVÊA, Evandro, et al. The CMU Sphinx-4 Speech Recognition System. In: IEEE International Conference on Acoustics, Speech and Signal Processing. 2003, pp. 2–5. 60. WALKER, Willie; LAMERE, Paul; KWOK, Philip, et al. Sphinx-4: A Flex- ible Open Source Framework for Speech Recognition. 2004.

63 BIBLIOGRAPHY

61. MERMELSTEIN, Paul. Distance measures for speech recognition, psycho- logical and instrumental. Pattern Recognition and Artificial Intelligence. 1976, vol. 116, pp. 374–388. 62. HERMANSKY, Hynek. Perceptual linear predictive (PLP) analysis of speech. The Journal of the Acoustical Society of America. 1990, vol. 87, no. 4, pp. 1738–1752. 63. KOEHN, Philipp; HOANG, Hieu; BIRCH, Alexandra, et al. Moses: Open Source Toolkit for Statistical Machine Translation. In: Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions. 2007, pp. 177–180. 64. The CMU Pronouncing Dictionary [online]. Carnegie Mellon University [vis- ited on 2018-04-14]. Available from: http://www.speech.cs.cmu. edu/cgi-bin/cmudict?in=C+M+U+Dictionary. 65. JSpeech Grammar Format [online]. W3C, 2005 [visited on 2018-04-15]. Avail- able from: https://www.w3.org/TR/2000/NOTE-jsgf-20000605. 66. DUDLEY, Homer; RIESZ, R. R.; WATKINS, S. S. A. A Synthetic Speaker. Journal of the Franklin Institute. 1939, vol. 227, no. 6, pp. 739–764. 67. DUDLEY, Homer. The Vocoder. The Bell System Technical Journal. 1939, vol. 18, pp. 347–351. 68. SCHRÖDER, Marc; TROUVAIN, Jürgen. The German Text-to-Speech Syn- thesis System Mary: A Tool for Research, Development and Teaching. In- ternational Journal of Speech Technology. 2003, vol. 6, no. 4, pp. 365–377. 69. SPROAT, Richard; HUNT, Andrew; OSTENDORF, Mari, et al. SABLE: A Standard for TTS Markup. In: The Third ESCA/COCOSDA Workshop (ETRW) on Speech Synthesis. 1998, pp. 27–30. 70. SKUT, Wojciech; BRANTS, Thorsten. Chunk Tagger. In: ESSLLI-98 Work- shop on Automated Acquisition of Syntax and Parsing. 1998. 71. DUTOIT, Thierry; PAGEL, Vincent; PIERRET, Nicolas, et al. The MBROLA project: towards a set of high quality speech synthesizers free of use for non commercial purposes. In: Fourth International Conference on Spoken Language. 1996, vol. 3, pp. 1393–1396. 72. BASIRON, Halizah Binti. Corrective Feedback in Dialogue-based Computer Assisted Language Learning. Proc. NZCSRSC. 2008, pp. 192–195. 73. The Diff, Match, and Patch libraries [online]. GitHub, 2018 [visited on 2018-05-11]. Available from: https://github.com/google/diff- match-patch.

64 BIBLIOGRAPHY

74. Google Docs [online]. Google [visited on 2018-05-11]. Available from: https: //www.google.com/intl/en-GB/docs/about. 75. MYERS, Eugene W. An O(ND) Difference Algorithm and Its Variations. Algorithmica. 1986, vol. 1, no. 1–4, pp. 251–266. 76. JSFG Grammar Rules Parser Library for Sphinx4 [online]. GitHub, 2017 [vis- ited on 2018-05-12]. Available from: https://github.com/goxr3plus/ JSFG-Grammar-Rules-Parser-Library-for-Sphinx4-5.

65

A Running Coco

In this chapter, information about the Coco software distribution and its execution instructions is provided. For details about developing and building the dialogue system, see the following appendix B.

A.1 Software Distribution

The software distribution of the Coco dialogue system is an integral part of this thesis. Coco is distributed as the coco.zip archive, which includes the following files:

coco.bat A plain text batch file containing a command with a series of pa- rameters, which are necessary for a proper execution of Coco in the command-line interface of the Microsoft Windows .

coco.sh A plain text Bash script, which contains a command with a series of parameters that are necessary for a proper execution of Coco in the command-line interface of a Linux-based operating system.

coco-1.0.jar An executable Java archive with a collection of classes and libraries of the dialogue system. This archive is not intended for an independent execution; one of the above scripts must be used instead. coco-resources A directory that includes the necessary resources for a proper func- tioning of Coco. The provided resources are divided into three cate- gories: logging, speech recognition, and speech synthesis. The logging directory contains a property file that filters the unnecessary logging messages during a conversation with the dialogue system, the speech recognition directory contains the acoustic model, dictionary, statis- tical language model, and all the Coco-specific grammars, and the speech synthesis directory contains the DFKI Prudence voice.

LICENSE A file containing the licence agreement.

README A text file containing basic information about the Coco dialogue system and how to properly execute the program.

A.2 Execution Instructions

Mandatory prerequisite: Before executing Coco, make sure that Java 8 is installed on the computer.

67 A. Running Coco

Coco is executable from the command-line interface and does not require an Internet connection. The software distribution of Coco includes both a batch file and a Bash script, which contain a series of commands for a simple and straightforward execution of the dialogue system. To be able to work with Coco properly, it must be executed from the same directory where the distribution archive was unzipped and where all the available resources are located. IMPORTANT: For a pleasant experience, make sure your microphone is turned on only when you are actually talking. This will prevent background noise from disrupting the speech recognition.

A.2.1 Linux-Based Operating Systems To execute the dialogue system, follow the instructions below.

1. Download the provided coco.zip archive and unzip it into a direc- tory of choice. Do not alter or move any files. 2. In the command-line interface, navigate to the directory that contains the unzipped distribution. 3. Run the following command: ./coco.sh 4. If the previous command fails, run: chmod +x ./coco.sh and try step 3 again.

To conclude the conversation at any time, press Ctrl + C.

A.2.2 Microsoft Windows The Coco dialogue system was created and tested on Linux-based operating systems. The Coco’s implementation does not guarantee a proper functioning on Microsoft Windows operating systems.

1. Download the provided coco.zip archive and unzip it into a direc- tory of choice. Do not alter or move any files. 2. In the command-line interface, navigate to the directory that contains the unzipped distribution. 3. Run the following command: coco

68 B Contributing to Coco

This chapter provides the information and procedures necessary for setting up the appropriate environment for developing the Coco dialogue system, together with the system’s building details.

B.1 Developing Coco

All the Coco’s source code is publicly available on GitHub. To be able to effectively work with the system, a basic Git understanding is necessary; however, as many other suitable resources are accessible, the procedures below do not provide any Git-related information. To start developing Coco, the following steps must be taken: 1. Clone the public Coco repository available at the following address: https://github.com/kkufova/Coco.git

2. Import the cloned repository into a preferred IDE. In this procedure, IntelliJ IDEA is considered. 3. Download the DFKI Prudence voice.1 The archive is also available in the coco-resources/speech-synthesis directory of the coco.zip distribution. As the voice is an integral part of the dia- logue system, it is not recommended to skip this step. 4. Unzip the downloaded voice into the following directory of Coco: src/main/resources/speech-synthesis

5. In the preferred IDE, add the voice-dfki-prudence-5.2.jar file (available in the unzipped archive with the downloaded voice) as a new external library. In IntelliJ IDEA, click File » Project Structure » Modules » Dependencies » Add. 6. Set the mary.base property as a VM option in the IDE project configuration settings: -Dmary.base=PATH_TO_COCO/src/main/resources/speech-synthesis

7. If you want to run Coco directly in the IDE, uncomment the com- mented paths (and comment the current paths) in the Paths.java class.

1. https://github.com/marytts/voice-dfki-prudence/releases/ download/v5.2/voice-dfki-prudence-5.2.zip

69 B. Contributing to Coco B.2 Building Coco

Follow the procedure below to build the Coco dialogue system locally.

1. Make sure the Paths.java class is not modified and contains the default originally uncommented paths.

2. Install the DFKI Prudence voice using the following command: mvn install:install-file -Dfile=PATH_TO_COCO/src/main/resources/speech-synthesis/ lib/voice-dfki-prudence-5.2.jar -DgroupId=de.dfki.mary -DartifactId=voice-dfki-prudence -Dversion=5.2 -Dpackaging=jar

3. In the directory containing the Coco dialogue system, execute the following command: mvn clean package

4. In the recently generated target folder, create a new directory called coco-resources.

5. Copy the content of the src/main/resources/ directory into the newly created folder coco-resources.

6. In the target folder, execute the following command: java -Dmary.base=./coco-resources/speech-synthesis -jar coco-1.0.-SNAPSHOT.jar

70 C Example Dialogue with Coco

Below is a complete example conversation between the Coco dialogue system and a made-up customer.

Coco: Hello! Welcome to the clothes and shoes ordering system. My name is Coco and I will guide you through the whole experience. Excuse me if I do not understand everything you say, I am still learning! Firstly, please say your name. You: Good morning, my name is Rosie Elizabeth. Coco: Hello Rosie Elizabeth! Coco: Have you already picked the goods you would like to order? You: Yes, I have. Coco: That is great! Let me prepare a new order form for you. Coco: OK! How many items would you like to order in total? » An unknown utterance has been recognized! « Coco: I am sorry, but I do not understand what you have said just yet. Could you please repeat your answer to help me learn? You: I think I have chosen to the items. Coco: Did I correctly hear what you said? You: No, you didn’t. Coco: I am sorry. Could you please say it again? You: I think I have chosen to the items. Coco: It looks like I cannot understand you at all. Could you please write it down for me? » i think i have chosen two items Coco: Thank you for helping me learn! Now, let’s continue. Coco: Did you say that the number of items you would like to order is two? You: Yes, I did. Coco: Great! Together we will now order your two items. Coco: Now, I need you to tell me more about the first item. Is the first item from the category clothes, shoes, or accessories? You: The first item is from the category clothes. Coco: Could you please specify the type of the item? You: It’s a purple leather jacket. Coco: What size would you like your jacket to be? You: In size XS.

71 C. Example Dialogue with Coco

Coco: Could you now kindly dictate the four-digit item number that is available at the top of the page? You: Fifty eight sixty. Coco: Could you please repeat the item number? For example: one two three four. You: Five eight six oh. Coco: Did you say that the item number is five eight six oh? You: Yes. Coco: Thank you! Now, I need you to tell me more about the second item. Is the second item from the category clothes, shoes, or accessories? You: The second item is from the category shoes. Coco: Could you please specify the type of the item? » An unknown utterance has been recognized! « Coco: I am sorry, but I do not understand what you have said just yet. Could you please repeat your answer to help me learn? You: Blue sandals. Coco: Thank you for helping me learn! Now, let’s continue. Coco: What size would you like your sandals to be? You: In size four. Coco: Could you now kindly dictate the four-digit item number that is available at the top of the page? You: Six seven nine five. Coco: Did you say that the item number is six seven nine five? You: Yes, I did. Coco: Thank you! Now, below is the list of all the ordered items. Please review the list of items before we proceed.

ITEM NO. CATEGORY TYPE SIZE NUMBER 1. clothes jacket XS 5860 2. shoes sandals 4 6795

Coco: Do you want me to read the list aloud for you? You: No. Coco: Is everything correct? You: Yes. Coco: Great! We will now proceed to the checkout. Coco: The total price is 165 pounds. Is it all right?

72 C. Example Dialogue with Coco

You: Yes, it is. Coco: Great! Now I need to find out a few more details before we can finish the order. Would you like a free standard delivery or a premium next day delivery? You: How much does the premium next day delivery cost? Coco: The premium delivery costs 5 pounds. Is it all right? You: No. Coco: OK, I will send the package for free as a standard delivery. Coco: Now I need to get all the shipping information. Would you mind filling out the following form? Coco: I already know your first name. Is it correct? » First name: Rosie Elizabeth You: Yes, it is. Coco: OK! Please fill in your last name. » Last name: Bloom Coco: Now, please fill in the street and the house number. » Address: Upper Phillimore Gardens 17 Coco: Please fill in the city. » City: London Coco: What is the postcode? » Postcode: W8 7HF Coco: I am assuming that you live in the United Kingdom. Is it cor- rect? » Country: United Kingdom You: Yes. Coco: OK! Now, please fill in your email address. Please make sure it is correct. » Email address: [email protected] Coco: What is your phone number? Make sure you do not type any non-numeric values. You can leave this field empty. To do so, just press Enter. » Phone number: + (020) 5377 9021 Coco: Thank you! Now, please review all the shipping information below. Do you want me to read the entered information aloud for you? You: Nope.

73 C. Example Dialogue with Coco

First name: Rosie Elizabeth Last name: Bloom Address: Upper Phillimore Gardens 17 City: London Postcode: W8 7HF Country: United Kingdom Email: [email protected] Phone: +(020) 5377 9021 Coco: Is everything correct? You: Yes. Coco: Great! And this is all the information I need to place your order. The order details are being sent to your email address as we speak. We will keep you informed until the ordered items arrive. Is there anything else I can do for you? You: Yes, there is. Coco: What is it? You: I would like to place another order. Coco: OK, I will make it happen. Coco: Goodbye Rosie Elizabeth! I hope to speak with you again soon!

74