Tongueboard: an Oral Interface for Subtle Input
Total Page:16
File Type:pdf, Size:1020Kb
TongueBoard: An Oral Interface for Subtle Input Richard Li, Jason Wu, Thad Starner Google Research & Machine Intelligence Mountain View, CA, USA {lichard,jsonwu,thadstarner}@google.com ABSTRACT 1 INTRODUCTION We present TongueBoard, a retainer form-factor device for Conversational technologies, such as Google Assistant, Siri, recognizing non-vocalized speech. TongueBoard enables ab- and Alexa, facilitate natural and expressive interaction with solute position tracking of the tongue by placing capacitive online services, connected Internet-of-Things (IoT) devices, touch sensors on the roof of the mouth. We collect a dataset and mobile computer interaction. While these conversational of 21 common words from four user study participants (two interfaces are popular due to their ease of use, they often native American English speakers and two non-native speak- rely on speech transcription technology that are ineffective ers with severe hearing loss). We train a classifier that is able in noisy environments, are often socially unacceptable, and to recognize the words with 91.01% accuracy for the native present privacy concerns of the microphone inadvertently speakers and 77.76% accuracy for the non-native speakers in capturing nearby conversations or having nearby listeners a user dependent, offline setting. The native English speak- hearing the user’s commands. ers then participate in a user study involving operating a We present TongueBoard, an oral interface for subtle ges- calculator application with 15 non-vocalized words and two ture interaction and control using silent speech. Tongue- tongue gestures at a desktop and with a mobile phone while Board is a retainer form-factor device that tracks the move- walking. TongueBoard consistently maintains an informa- ment of the tongue, enabling silent speech recognition and tion transfer rate of 3.78 bits per decision (number of choices accurate detection of tongue gestures. Our system’s accurate = 17, accuracy = 97.1%) and 2.18 bits per second across sta- characterization of the tongue’s position and movement al- tionary and mobile contexts, which is comparable to our lows for subtle interaction through a combination of silent control conditions of mouse (desktop) and touchpad (mobile) speech vocabulary sets and tongue gestures. We discuss how input. our system can be used to control common user interfaces by supporting the capabilities of existing input hardware. CCS CONCEPTS To evaluate the accuracy of our system, we test a lexicon • Human-centered computing → Interaction techniques; consisting of 21 words (numerical digits, mathematical op- Ubiquitous and mobile devices; erations, and time descriptors) and 9 tongue gestures. We first evaluate the offline accuracy of the silent speech recog- KEYWORDS nition with data from two American English speakers and wearable devices, input interaction, oral sensing, subtle ges- two non-native deaf speakers. In addition, both interfaces tures, silent speech interface are also compared in a live study that compares its accu- racy and interaction speed to traditional inputs (mouse and ACM Reference Format: touchscreen) in stationary and mobile contexts. Richard Li, Jason Wu, Thad Starner. 2019. TongueBoard: An Oral Interface for Subtle Input. In Augmented Human International Con- ference 2019 (AH2019), March 11–12, 2019, Reims, France. ACM, New 2 RELATED WORK York, NY, USA, Article 4, 9 pages. https://doi.org/10.1145/3311823. Subtle Interfaces 3311831 While some gesture interfaces are expressive and afford fast input, they may not be practical for everyday use. Gestures Permission to make digital or hard copies of part or all of this work for that are subtle or that utilize everyday movements are more personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies likely to be socially acceptable and willingly performed by bear this notice and the full citation on the first page. Copyrights for third- users [31]. Many gesture interfaces seek to minimize their party components of this work must be honored. For all other uses, contact noticeability by requiring little or no obvious movement, and the owner/author(s). others allow for the interaction to be performed out of view. AH2019, March 11–12, 2019, Reims, France Despite the challenges of capturing signal from low-motion © 2019 Copyright held by the owner/author(s). gestures using traditional sensors, electromyography (EMG) ACM ISBN 978-1-4503-6547-5/19/03...$15.00 https://doi.org/10.1145/3311823.3311831 has been successfully used to capture neuromuscular signals 1 AH2019, March 11–12, 2019, Reims, France Table 1: Wearable Silent Speech Interface Comparison Silent Speech Interface Modality Proxy Dictionary Accuracy Invisible Walking Bedri et al. 2015 [6] Optical & Magnetic Jaw and tongue 11 phrases 90.50% No No Kapur et al. 2018 [21] sEMG Jaw and cheek 10 words 92.01% Yes No Meltzner et al. 2018 [27] sEMG Face and neck 2200 words 91.10% No No Sun et al. 2018 [36] Camera Lip movement 44 phrases 95.46% No No Fukumoto 2018 [13] Audio Ingressive speech 85 phrases 98.20% No No TongueBoard Capacitive Tongue 15 words/2 gestures 97.10% Partially Yes associated with different movements [8, 26]. Other input (which measures the activations of actuator muscle signals interfaces support subtle interaction by requiring movement or command signals from the brain) [10]. of unseen body parts to decrease the visibility of the gesture. More recently, improvements have been made in several Eyes-free [2, 29] interfaces and in-attentive gestures allow areas of sensing described in Denby et al. by incorporating interactions to take place out of view, such as below the table novel hardware [12] and more advanced machine learning or inside the user’s pocket [17, 32, 34]. Recently, the ear has techniques [18]. Several systems have incorporated deep been identified as a promising place for supporting subtle learning techniques to lipread speech for the purposes of interactions and monitoring physiological signals such as silent speech and augmenting automatic speech recognition heart rate, blood pressure, electroencephalography (EEG), (ASR) models [35, 37]. Most relevant to the work presented and eating frequency [5, 15, 16, 30]. Motion of the temporalis in this paper is research done on subtle physical sensing of muscle, responsible for moving the jaw, causes movement silent speech. Table 1 shows several recent silent speech inter- and deformation of the ear, which can be sensed through faces and situates our work with regard to sensing modality, barometric and in-ear electric field sensing [1, 24, 25]. proxy, dictionary size, accuracy, whether the sensed phe- Hidden areas such as the inside of the mouth can also be nomenoni is visible to a bystander, and usability while walk- sensed using a variety of methods, including neuromuscular ing. signals (EEG, EMG), skin surface deformation (SKD) [28], We position our main contribution as an interaction that is bone conduction microphones [3], optical approaches (in- robust to motion artifacts. To the best of our knowledge, the frared proximity sensors) [4, 33], and wireless signals [14]. works noted in table 1 still require significant work to be able Retainer form-factor oral and tongue sensors have also been to withstand noise from external movements. Furthermore, explored for enabling tetraplegics to control electronic de- with practice, TongueBoard can be used with no movements vices and powered wheelchairs using a tongue-drive system noticeable by an observer, since all motions are contained [22]. However, many of the sensing approaches provide an within the mouth. incomplete or low resolution characterization of the inner mouth and are not suitable for more complex gestures and 3 TONGUEBOARD silent speech applications. Interaction Design We construct a silent speech interface with 21 words for Silent Speech application input and nine tongue gestures (Table 2). When Silent speech interfaces allow the user to communicate with used together, TongueBoard is able to support numerical a computer system using speech or natural language com- input, navigation, and item selection, enabling a wide array mands without the need to produce a clearly audible sound. of potential interactions. These interfaces allow users to communicate efficiently with • Digits & Text - While we designed our vocabulary pri- computer systems without attracting attention or disrupt- marily to support a calculator application, recognition ing the environment using traditional voice-based interfaces. of digits can be used for dialing phones, selecting be- Denby et al. [10] and Freitas et al. [11] provide comprehen- tween Smart Reply suggestions [20], or controlling a sive surveys of silent speech interfaces used for speech sig- multitap-like telephone keypad (e.g., saying “2 2 2” to nal amplification, recognition of non-audible speech, and select a C), allowing slower but more expressive input. “unspoken speech” (imagined speech without any explicit • Navigation - TongueBoard emulates a d-pad (up/down/ muscle movement). Broadly, they fall into the categories of left/right) by allowing users to place their tongue on physical techniques (which measure the vocal tract directly), certain regions on the roof of the mouth. Pressing the inference from non-audible acoustics, and electrical sensing tongue fully against the roof of the mouth is selection. 2 TongueBoard: An Oral Interface for Subtle Input AH2019, March 11–12, 2019, Reims, France • Editing - Four swiping gestures (left-right, front-back, Data Processing. The SmartPalate sensor array provides con- etc.) enable editing commands such as backspace. sistent and accurate tracking of the user’s tongue as it moves • Control - Progressively disclosing interfaces on head against the roof of the mouth, allowing clear differentiation worn displays such as Google Glass follow a hierarchi- between words and gestures in our vocabulary. cal structure where each utterance displays a limited We use a classification algorithm that leverages both tongue list of potential options (‘OK Glass ... Send a message to placement information and temporal movement for word- ..