Composing a Travel Voice Assistant from Re-Purposed Models

With One Voice: Composing a Travel Voice Assistant from Re-purposed Models Shachaf Poran Gil Amsalem [email protected] [email protected] Booking.com, Tel Aviv Booking.com, Tel Aviv Amit Beka Dmitri Goldenberg [email protected] [email protected] Booking.com, Tel Aviv Booking.com, Tel Aviv ABSTRACT A voice assistant relies on a function E : * ! 퐴 that maps an Voice assistants provide users a new way of interacting with digital utterance D 2 * provided by the user to an action 0 2 퐴 which can products, allowing them to retrieve information and complete tasks be performed by the app, aiming to fulfill the user’s intent which 1 with an increased sense of control and flexibility. Such products are was presented in the utterance. An example for such a mapping comprised of several machine learning models, like Speech-to-Text in the groceries domain might be: transcription, Named Entity Recognition and Resolution, and Text E ¹we are out of milkº = place order for milk Classification. Building a voice assistant from scratch takes the and in the travel domain (which is more relevant in our case) it prolonged efforts of several teams constructing numerous models might be: and orchestrating between components. Alternatives such as using third-party vendors or re-purposing existing models may be consid- E ¹I need to book a hotel in Parisº = present a list of hotels in Paris ered to shorten time-to-market and development costs. However, The actions taken by the Booking.com app may include searching each option has its benefits and drawbacks. We present key insights for accommodation, seeking travel inspiration, asking for help, from building a voice search assistant for Booking.com search and amending existing bookings, recommending a property, etc. recommendation system. Our paper compares the achieved perfor- The function E may be seen as a chain of auxiliary functions mance and development efforts in dedicated tailor-made solutions starting with transforming the raw voice input to text. Only then is against existing re-purposed models. We share and discuss our the natural language processed to extract the intent of the user and data-driven decisions about implementation trade-offs and their the entities mentioned in the text [7, 32]. Eventually, a decision is estimated outcomes in hindsight, showing that a fully functional made about which action to perform. In practice, the former two machine learning product can be built from existing models. steps are realized using machine learning models. In creating these machine learning elements, there’s a point of KEYWORDS decision about how the research and development teams implement Voice, Search, Recommendation, Machine Learning Architecture them [29]. Options include but are not limited to those shown in Figure 1. 1 INTRODUCTION Voice assistants have become a prevailing mode of communication between customers and companies [16, 23]. Today you can pick up your smart device and utter a request or a command and the device complies, a thing we wouldn’t have dreamt of in the past. The most appealing aspect of this feature is the transfer of touch and typing arXiv:2108.11463v1 [eess.AS] 4 Aug 2021 interfaces into spoken commands, conveying your request in free Figure 1: Different options to implementing machine language and making the action easy to perform and almost instan- learning models and the trade-off between effort of taneous. For example, you can simply ask a question rather than implementation and specificity of the resulting solution. navigating a verbose FAQ page, or you can use the voice interface when you have limited hand dexterity [6]. Using voice assistants in search and recommendation tasks serves various customer expec- Each of these options entails implicit costs, whether monetary, tations and needs [9]. Introducing a free-form speech input allows development time, or how well the results fit the business needs. customers to generate unstructured queries, resulting in a complex The items on the left are the ones that are more lengthy and costly in input to the search and recommendation systems [17]. The unstruc- development time, but on the other hand they should also result in tured form of natural language also allows users to explore different more specialized models [29]. These costs are difficult to estimate in options in their apps that otherwise would be hidden for the sake advance and might vary widely depending on the kind of problem of simplicity of the graphical user interface. The user would have to be solved, existing expertise in the workforce, and demand for to reach these options using buttons and menus that involve more 1Conversational assistants may have additional context which is out scope for this attention and more steps to progress through [10]. paper. Shachaf Poran, Gil Amsalem, Amit Beka, and Dmitri Goldenberg Figure 2: Voice assistant flow, screens from left to right: Index entry point, Search prompt, Input Query, and Search Results. high accuracy metrics for the models. Moreover, recent work in the Our key contributions are evidence-based comparisons of ded- online travel domain has shown that an improvement in an offline icated tailor-made solutions against re-purposing existing mod- metric does not necessarily reflect business impact, and requires els for different machine learning tasks. The paper demonstrates online validation via a randomized controlled experiment [4, 11]. how to overcome the lack of in-domain data and to compose a At the same time, orchestrating a cascade of machine learning machine learning product without training new models, all the models requires a supporting software system designed to allow a while not compromising potential impact. We share and discuss our combination of business logic with ML-driven decisions [30]. data-driven decisions about implementation trade-offs and their Another concern when choosing one of these options over an- estimated outcomes in hindsight by examining the main four com- other revolves around domain-specific data and knowledge [10]. ponents of the system. The three leftmost options in Figure 1 require having data available Voice assistant systems are composed of a voice-to-text ele- for training and evaluation, while the other two do not. Having ment followed by a language understanding element [31]. The the same distribution of data when training a model and when ML pipeline we developed is summarized in the flowchart shown in using it for inference is considered good practice, and a significant Figure 3. The voice-to-text (VTT) system is discussed in Section 2. mismatch between the two might lead to accuracy metrics being For our use case we chose to construct the language understanding irrelevant. Knowledge of these distributions in advance might in element with three steps: Machine Translation (MT), Named Entity some cases lead to using different modeling techniques and better Resolution (NER), and Text Classification (TC). These steps are dis- performance. cussed in Section 3, Section 4, and Section 5 respectively. The output Constructing a voice assistant usually require a complex archi- of the last element is fed into a downstream recommender systems tecture, and a generous investment in research and development [24]. Section 6 concludes our findings and discusses opportunities [12, 18]. At the same time, re-purposing existing ML models to- for future research. wards new applications [26] becomes a popular solution for various product needs. We suggest to adopt a well-known software reuse paradigm [8], that allows to achieve high quality and reduce development time [22] by re-purposing existing machine learning Language Understanding components or considering using external third-party off-the-shelf services [5, 27]. In this paper we share insights regarding these challenges and how decisions were made in the process of devel- oping a mobile voice assistant (see Figure 2 for an overview of the Voice Machine Named Entity Text product flow). 2 to Text Translation Resolution Classification 2app is available at https://www.booking.com/apps.en-gb.html Figure 3: Overview of the voice assistant architecture. With One Voice: Composing a Travel Voice Assistant from Re-purposed Models Data source Error word TPV Kaldi Error Rate (WER) [25, 33], defined as the edit distance between ground truth and predicted sentences normalized by the length Conversations All words 25.25% 28.99% of ground truth. This is a common metric used in voice-to-text App commands All words 45.24% 38.68% systems. 198/415 31/415 Both TPV and Kaldi allow for the tweaking of their models: App commands "booking" (47.7%) (7.5%) the former receives a set of hint phrases that may appear in the 46/108 23/108 utterance and the latter allows fine-tuning modular steps in its App commands "cancellation" (42.6%) (21.3%) VTT pipeline including an acoustic model, a lexicon, and a lan- Table 1: Comparison of WER for the TPV vs. the Kaldi guage model. We tweaked both of the alternatives by using our model on the adjacent domain and in-domain data-sets. adjacent-domain data to achieve the lowest WER we could with either. Kaldi’s out-of-the-box model achieved 45.01% , 퐸', com- pared to 25.25%, 퐸' by the TPV. The effort to tweak Kaldi model resulted in 28.99% , 퐸', resembling similar comparisons with 2 VOICE-TO-TEXT open-access datasets [20]. Tweaking TPV resulted in a negligible boost in performance. At this point, a decision based on currently- The first ML-driven element of the pipeline has an utterance as available data was made to use the TPV and defer any additional a waveform as input and outputs transcribed text for the uttered Kaldi development indefinitely. speech. It is worthwhile discerning how the distribution of inputs After releasing the voice assistant feature real-world data was may vary between domains, as it may determine the performance of gathered and annotated, and the two models were reevaluated pre-trained models versus models that were trained on in-domain based on it.

Composing a Travel Voice Assistant from Re-Purposed Models

Voice Interfaces

NLP-5X Product Brief

A Guide to Chatbot Terminology

Voice User Interface on the Web Human Computer Interaction Fulvio Corno, Luigi De Russis Academic Year 2019/2020 How to Create a VUI on the Web?

Eindversie-Paper-Rianne-Nieland-2057069

Voice Assistants and Smart Speakers in Everyday Life and in Education

Integrating a Voice User Interface Into a Virtual Therapy Platform Yun Liu Lu Wang William R

Master Thesis

Context-Aware Intelligent User Interfaces for Supporting Sytem

Arxiv:2011.11315V1 [Eess.AS] 23 Nov 2020 in Public Areas Due to Its Risk of Privacy Leakage

A ROOM with a VUI – VOICE USER INTERFACES in the TESOL CLASSROOM by David Kent Woosong University Daejon, Republic of Korea Dbkent @ Wsu.Ac.Kr

Fuzzing Semantic Misinterpretation for Voice Assistant Applications