Parakeet: a Continuous Speech Recognition System for Mobile

Parakeet: A Continuous Speech Recognition System for Mobile Touch-Screen Devices Keith Vertanen and Per Ola Kristensson Cavendish Laboratory, University of Cambridge JJ Thomson Avenue, Cambridge, UK {kv277, pok21}@cam.ac.uk ABSTRACT We present Parakeet, a system for continuous speech recognition on mobile touch-screen devices. The design of Parakeet was guided by computational experiments and validated by a user study. Participants had an average text entry rate of 18 words-per-minute (WPM) while seated indoors and 13 WPM while walking outdoors. In an expert pilot study, we found that speech recognition has the potential to be a highly competitive mobile text entry method, particularly in an actual mobile setting where users are walking around while entering text. Author Keywords Continuous speech recognition, mobile text entry, text input, touch-screen interface, error correction, speech input, word confusion network, predictive keyboard Figure 1. The Nokia N800 device and a Bluetooth headset. The Parakeet continuous speech recognition system is shown ACM Classification Keywords running on the device. H5.2. User Interfaces: Voice I/O From an engineering perspective, speech recognition demands a device with more memory and processing power INTRODUCTION than required by traditional mobile text entry methods, such The advantages of speech recognition as a mobile text entry as predictive text or character recognition. method seem obvious and highly attractive. Speaking is a naturally acquired skill which does not demand much Despite these challenges, there has been recent interest in practice from users. Speech can also be very fast. Users can creating continuous speech recognition engines for speak up to 200 WPM [22]. embedded and mobile devices [4]. Two projects in this direction are PocketSphinx [11] and PocketSUMMIT [10]. However, speech recognition remains to prove itself as a competitive mobile text entry method. At least in the past, These developments pave the way for the study of speech speech recognition performance suffered from poor recognition as a mobile text entry method. Two important accuracy [12]. The privacy implications of using speech as research questions arise. First, what constitutes a good the sole modality for text entry are also obvious. Further, it design for a speech recognition system on a modern mobile has been argued that using speech as a text entry method device? Second, how efficient and practical is speech also carries cognitive costs that may limit speech recognition as a mobile text entry method? recognition entry rates in practice [23]. In this paper, we describe Parakeet: a system designed for efficient mobile text entry using speech recognition. We have built a complete mobile speech recognition system that recognizes American and British English. Users enter text into their mobile Linux device (such as the Nokia Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are N800) by speaking into a Bluetooth headset microphone not made or distributed for profit or commercial advantage and that copies (Figure 1). bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior The design and development of Parakeet followed several specific permission and/or a fee. design cycles. Our design was to a large extent guided by IUI’09, February 8–11, 2009, Sanibel Island, Florida, USA. computational experiments on recorded speech audio. Copyright 2009 ACM 978-1-60558-331-0/09/02...$5.00. These experiments helped us from two perspectives. First, 237 they helped us visualize the tradeoffs between different explicit action, whether the desired correction was design choices, and thereby guided our creative design available. If the desired correction was not easily available, process. Second, they aided us in finding the optimal the user could then undertake a more costly corrective parameter settings for our graphical user interface. measure (but a measure guaranteed to be successful). This work provides insight into the design issues related to Efficient and Practical Interaction by Touch speech-based mobile text entry. We also present the first Mobile devices increasingly come with stylus or touch study of speech-based text entry on an actual mobile device sensitive screens. These modalities enable us to implement in an actual mobile setting. a full direct manipulation user interface. Direct The rest of this paper is structured as follows. First, we manipulation enables us to design a user interface where discuss the principles which guided our design. Second, we users can view and act on the rich set of recognition describe our interface and detail the experiments which hypotheses we wanted to display. helped shape the design. Third, we describe the details of We preferred touch-interaction to stylus for three reasons. our mobile speech recognition system. Fourth, we present a First, touch doesn’t require the user to utilize a stylus while user study validating our design. Fifth, we present results on the go. Second, one-handed usage is impossible with a from an expert pilot study demonstrating the promising stylus, and users tend to prefer mobile interfaces that can be potential our system has as a mobile text entry solution. operated with one hand [13]. Third, a well-designed touch- Finally, we discuss limitations and implications of our screen interface can also be used with a stylus, but the findings, point to future work, and conclude. converse does not necessarily hold. By creating a good touch-screen design, we are also creating an interface DESIGN PRINCIPLES suitable for stylus use. Avoid Cascading Errors Speech recognition is imperfect and recognition errors are Support Fragmented Interaction ultimately unavoidable. Thus, error correction is a crucial When users are entering text while walking around, they part of any speech recognition interface. As previous need to divide their attention between interacting with the research has shown (e.g. [12]), users have difficultly mobile device and dealing with their surroundings. correcting recognition errors using only speech. This is Oulasvirta et al. [19] found that users interacting with a partly because errors may cascade – recognition in the mobile web browser while walking in a demanding setting correction phase may fail, requiring further corrections, and attended to the device in 4-8 second bursts. This finding has so on [12]. A possible solution is to use a different modality two implications. First, our system needs to enable users to than speech for correction. For example, Suhm et al. [25] quickly process and respond to the interface. Second, our investigated a speech system which enabled corrections via system needs to be designed so that it is easy for users to pen gestures. For an in-depth review of multimodal speech pick up from where they left off after an interruption. interfaces, we refer the reader to Oviatt et al. [20]. We therefore designed Parakeet to minimize attention To avoid cascading errors, we decided to create a demands. For example, after recognition is completed, we multimodal speech interface. In our system, recognition is flash the entire display screen and beep. This simple first performed on a user’s utterance. The recognition result feedback frees the user to attend to their surrounding almost is then corrected in a manner which is simple, direct and entirely while waiting for recognition to complete. transparent to the user. The correction process does not We also wanted to minimize the physical actions required. rely on any further error-prone recognition technologies. In a mobile setting, a large sequence of precise touch actions is likely to go wrong. We designed towards an Exploit the Speech Recognition Hypothesis Space In a typical speech recognition interface, such as Dragon interface presenting more visual information (perhaps NaturallySpeaking, only the best recognition hypothesis is requiring several bursts of attention) but requiring fewer initially shown to the user. In order to explore alternatives motor actions. For example, the user might enter a few to this best hypothesis, the user must take explicit action letters of a word and then scan a list of predictions which such as highlighting some text and issuing a command. As allow completion of the entire word with a single touch. observed by Kristensson and Zhai [16], this style of error INTERFACE DESCRIPTION correction interface introduces a degree of uncertainty. The The main part of our correction interface displays the user has to hope his or her action will expose the desired recognizer’s best hypothesis along a single line at the top correction. If it does not, the user has wasted effort. Such (Figure 2). If the best hypothesis cannot fit on the screen, interfaces may lead to user frustration, which is a common the user can scroll using the buttons on the left and right problem in intelligent user interfaces in general. sides. In addition to the best hypothesis, likely alternatives Therefore, we designed our correction interface so it avoids for each word in the best hypothesis are also displayed. For forcing users to blindly search the hypothesis space. We example, in Figure 2 the word “imports” has two other wanted the user to be able to see immediately, and without competing words (“imported” and “import”). 238 This display is based on a word confusion network [9]. A several words at once, the user can slide across adjacent word confusion network is a time-ordered set of clusters delete buttons. In speech recognition, often the recognizer where each cluster contains competing word hypotheses gets off track and outputs several erroneous words in a row. along with their posterior probabilities. The word In such instances, it is particularly useful to delete several confusion network is built from the lattice generated during words at once by crossing contiguous delete buttons.

Load more