Voice enabling mobile applications with UIVoice Ahmad Bisher Tarakji Jian Xu∗ Samsung Research America Stony Brook University [email protected] [email protected] Juan A. Colmenares† Iqbal Mohomed Samsung Research America Samsung Research America [email protected] [email protected] ABSTRACT The key mechanism we introduce to tie voice agents with mo- Improvements in cloud-based speech recognition have led to an bile applications is User Interface Automation. That is, we enable 1 explosion in voice assistants, as bespoke devices in the home, cars, developers to automate UI interactions that a human user would wearables or on smart phones. In this paper, we present UIVoice, perform on a mobile app, and connect these to voice interactions through which we enable voice assistants (that heavily utilize the that an end user would engage with on an agent. This combination cloud) to dynamically interact with mobile applications running in lets us “voice enable” applications that were never designed for the edge. We present a framework that can be used by third party interface with voice agents. A novel facet of our approach is the developers to easily create Voice User Interfaces (VUIs) on top of ex- ability to have a dialogue with the user in the face of ambiguities. isting applications. We demonstrate the feasibility of our approach There are four contributions in this paper: (i) we present the through a prototype based on Android and Amazon Alexa, describe UIVoice system for creating Voice User Interfaces (VUIs) on top how we added voice to several popular applications and provide of existing voice agents and mobile applications, (ii) we describe a an initial performance evaluation. We also highlight research chal- framework that simplies the task of creating VUIs, (iii) we describe lenges that are relevant to the edge computing community. our prototype system, discuss sample VUIs for several popular mobile apps and present a detailed evaluation, and nally, (iv) we CCS CONCEPTS highlight interesting research problems in edge computing that emanate from our work. • Human-centered computing → Ubiquitous and mobile com- In this paper, we start by giving a background on voice assistants puting systems and tools; Interactive systems and tools; and how they currently interface with third-party applications (§2). In §3, we introduce our UIVoice system prototype that works with 1 INTRODUCTION Amazon’s Alexa family of voice agents and the Android OS. We Recently there has been great excitement on the possibilities oered use our prototype to create VUIs for several popular mobile apps, by edge computing to support applications in saving bandwidth, which are described and evaluated in §4. In §5, we discuss various providing low latency experiences to users, preserving user privacy, important issues that emerge from our research. We also provide etc. In this paper, we describe a novel edge application — extend- related work on automation techniques developed in industry and ing the capabilities of a voice assistant by performing actions on academia in §6, and the paper concludes in §7. a user’s personal mobile device. Modern voice assistants perform most automatic speech recognition (ASR) and natural language un- 2 BACKGROUND ON VOICE ASSISTANTS derstanding (NLU) functions in the cloud. On the other hand, mobile In recent years, voice-based interactions with computers have been devices contain signicant personal state of users, that which spans embodied in the persona of voice assistants or agents, which can across applications. By giving voice assistants access to the applica- be bespoke devices in the home, integrated into cars and wearables tions and state within mobile devices, we believe that an immense or on smartphones. At present, there is no standard software ar- degree of personalization can be achieved. However, this leads to chitecture for voice assistants. However, there are two conceptual interesting architecture and latency considerations that will be of operations that occur in processing user utterances. First, the user’s interest to the edge computing community. utterance goes through an automated speech recognition (ASR) module. The key output of this module is an attempted transcrip- ∗Work done during internship at Samsung Research America. †Work done at Samsung Research America. tion of what the user said - text albeit still with speech recognition errors. Next, the recognized text goes through a Natural Language Permission to make digital or hard copies of all or part of this work for personal or Understanding (NLU) module that attempts to extract the desired classroom use is granted without fee provided that copies are not made or distributed user intent. Advanced NLU/NLP may add more complex process- for prot or commercial advantage and that copies bear this notice and the full citation ing to the user’s voice utterance, such as emotion recognition and on the rst page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or speaker recognition. ASR and NLU systems that allow programma- republish, to post on servers or to redistribute to lists, requires prior specic permission bility, typically require the third-party developer to specify a set and/or a fee. Request permissions from [email protected]. of user utterances, and how to extract commands (subsequently EdgeSys’18, June 10–15, 2018, Munich, Germany © 2018 Copyright held by the owner/author(s). Publication rights licensed to ACM. ACM ISBN 978-1-4503-5837-8/18/06...$15.00 1This can be any developer or even a power user - not necessarily the one who created https://doi.org/10.1145/3213344.3213353 the specic application(s) being automated. 49 EdgeSys’18, June 10–15, 2018, Munich, Germany A.B. Tarakji et al. referred to as Intents) and parameters (subsequently referred to as Slots) from the utterance. For instance “Call my Mom” would initiate a “call phone number” Intent with the Slot “Mom” of type “Contact” from which the assistant should extract the number to call. Of course, there may be multiple such applications active at any given point in time. Ultimately, the job of the speech processing system is to take the user utterance, extract the intent of the user request, and route the set of action and parameters to the appro- priate VUI application. One important thing to note about existing Figure 1: System design. voice assistants from a systems perspective is that the majority of ASR and Intent classication occurs in the cloud (a notable excep- tion is wake word recognition). Some systems do provide a limited and mobile phone are connected to the home network. Since most recognition vocabulary on-device (e.g. functions to set alarms or of the ASR and NLU for Alexa occur in the cloud, this results in make a call). This is a fundamental problem because the size of communication from the in-home device to the Alexa backend in the recognition vocabulary has an impact on the performance of the cloud. Logic in that tier connects with our UIVoice backend the ASR system (and its resource requirements). The prevalence of that is hosted in the cloud. Our UIVoice backend also maintains a cloud-based ASR and NLU leads to latency and other challenges communication channel to the mobile device running the UIVoice that are relevant to the edge computing community. agent. This setup introduces various wide-area latencies that we A key challenge with existing SDKs from major commercial ven- discuss further during our evaluation. dors is that they require mobile application developers to modify The results of the rst step are two-fold: 1) code and congu- their code to support VUIs [2]. Given the large number of applica- ration that needs to be uploaded to the voice agent system, and tions on major app stores, this is a signicant barrier to the use of 2) instructions that are made ready to be sent to an agent on the voice assistants on mobile phones. The approach proposed in this mobile device. In the execution step, the UIVoice agent must be paper allows unmodied applications to be voice enabled. running on the user’s phone, to enable remote interaction with Creators of VUI-based applications have always had to design the applications. In our implementation, we also have a backend good voice interactions regardless of the mechanism used to im- component that runs on an external server and is responsible for plement them. Designers utilize good-practices such as VUI pat- mediating between the user’s phone and the voice agent. In this terns [12, 13] and Grice’s maxims of conversation [10] as a guideline section we describe all of these components. to closer approximate human expectations. However, the task of good dialog design is more an art than exact science [3, 9]. This is 3.2 VUI Creation Step an important matter but not tackled by our work. In our system, In this step, which occurs at development time, a VUI developer the creator of the voice-enabled application uses the primitives of (this could be an end user who knows how to program or some the voice agent to dene an eective dialog with the end-user. third-party developer) species the voice interactions they wish to support and the mapping to an existing mobile application. These 3 UIVOICE are both done by dening a Interaction Script. Conceptually, an In this section we present the UIVoice system overview and outline interaction script has both user-facing actions (that execute in the its two main steps. context of the voice agent) and on-device actions. On the Voice agent side, actions are “listen” to the user (to capture the command 3.1 System Overview the user wishes to execute for example) and “speak” to the user to “ask” them for input or “tell” them results.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages6 Page
-
File Size-