Justspeak: Enabling Universal Voice Control on Android
Total Page:16
File Type:pdf, Size:1020Kb
JustSpeak: Enabling Universal Voice Control on Android Yu Zhong1, T.V. Raman2, Casey Burkhardt2, Fadi Biadsy2 and Jeffrey P. Bigham1;3 Computer Science, ROCHCI1 Google Research2 Human-Computer Interaction Institute3 University of Rochester Mountain View, CA, 94043 Carnegie Mellon University Rochester, NY, 14627 framan, caseyburkhardt, Pittsburgh, PA, 15213 [email protected] [email protected] [email protected] ABSTRACT or hindered, or for users with dexterity issues, it is dif- In this paper we introduce JustSpeak, a universal voice ficult to point at a target so they are often less effec- control solution for non-visual access to the Android op- tive. For blind and motion-impaired people this issue erating system. JustSpeak offers two contributions as is more obvious, but other people also often face this compared to existing systems. First, it enables system problem, e.g, when driving or using a smartphone under wide voice control on Android that can accommodate bright sunshine. Voice control is an effective and efficient any application. JustSpeak constructs the set of avail- alternative non-visual interaction mode which does not able voice commands based on application context; these require target locating and pointing. Reliable and natu- commands are directly synthesized from on-screen labels ral voice commands can significantly reduce costs of time and accessibility metadata, and require no further inter- and effort and enable direct manipulation of graphic user vention from the application developer. Second, it pro- interface. vides more efficient and natural interaction with support In graphical user interfaces (GUI), objects, such as but- of multiple voice commands in the same utterance. We tons, menus and documents, are presented for users to present the system design of JustSpeak and describe its manipulate in ways that are similar to the way they are utility in various use cases. We then discuss the sys- manipulated in the real work space, only that they are tem level supports required by a service like JustSpeak displayed on the screen as icons. The first and most es- on other platforms. By eliminating the target locating sential step of interaction with GUI is target locating, and pointing tasks, JustSpeak can significantly improve sighted people can intuitively find the target object with experience of graphic interface interaction for blind and a quick visual scan. But without visual access to the motion-impaired users. screen, this simple task is often very hard to finish, which is the main cause of difficulties in non-visual interaction. ACM Classification Keywords Before the invention of screen readers [1, 7, 6], it was es- H.5.2 Information Interfaces and Presentation: User In- sentially impossible to interact with a GUI non-visually. terfaces|Voice I/O; K.4.2 Computers and Society: So- Now that audio feedback has been used to support non- cial Issues|Assistive technologies for persons with dis- visual accessibility of computer and smartphones, many abilities works [2, 4] have been focusing on utilizing voice con- trol to improve the efficiency of eyes-free user input. As General Terms for motion-impaired people, they face challenges in the Human Factors, Design second step, point targeting, as most GUI interaction techniques like mouse and multi-touch require accurate Author Keywords physical movements of users' hands or fingers. Universal voice control, accessibility, Android, mobile The advantage of voice commands over mouse or multi- touch when interacting with a screen non-visually is that INTRODUCTION it does not require targets to be located and thus avoids Mouse and multi-touch surface have been widely used the problems with pointing as described before. This as the main input methods on computers and mobile task often costs the majority of time needed to complete devices for their reliable accuracy. But under circum- interaction with an input device, for example, moving a stances when visual access to the display is impossible cursor onto a button with a mouse. This has been deeply studied since at least the introduction of Fitts's law [11] in the 1960s. On the contrary, when using speech as Permission to make digital or hard copies of all or part of this work for input, users can expect a computing device to automat- personal or classroom use is granted without fee provided that copies are ically execute commands without the hassles of finding not made or distributed for profit or commercial advantage and that copies and pointing on-screen objects. For example, to com- bear this notice and the full citation on the first page. To copy otherwise, to plete a simple task like switching off Bluetooth on a republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. multi-touch smartphone, with speech-enabled interface, W4A ’14, April 7-9, 2014, Seoul, Korea Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00. a user can simply tell the phone to \switch off Bluetooth" non-visual users before the introduction of screen read- instead of going through the time consuming process of ers [12]. Over the last two decades, many screen readers finding and clicking the settings icon, and then the Blue- have been successful in providing accessibility for blind tooth switch. users on multiple platforms. For example, JAWS [7] on Windows personal computers, WebAnywhere [3] for the Unfortunately, existing voice control systems are con- web, VoiceOver [1] on Macintosh computers and iOS, strained in many ways. We have not yet seen a system TalkBack [6] on Android devices, etc. Those softwares which supports universal voice commands on a device identify and interpret what is being displayed on the able to control any application. The majority of them screen and then re-present the interface to users with are either built on application level (e.g. speech input text-to-speech, sound icons or a braille output device. methods) or limited by pre-defined commands set. For With the non-visual feedback of screen readers, blind instance, even the most popular voice assistants like Siri people can explore the user interface, locate and manip- and Google Now [2, 4] can only support built-in func- ulate on-screen objects. tionalities on the device like messaging, calling and web searching, etc. Screen readers largely improved the accessibility of GUI for blind users. However, blind users still need to adapt In this paper, we present JustSpeak, which is designed to traditional input techniques that require target dis- and built with a fundamentally different architecture on covery. Those techniques are designed for visual inter- the system level of Android. JustSpeak aims at enabling action, hence require additional efforts for blind people. universal voice control on Android operating system to For example, on non-touch devices, blind people often help users quickly and naturally control the system non- use keyboard shortcuts to iterate through the interface visually and hands-freely. Users can easily access JustS- linearly (from top to bottom, left to right) in order to peak on their existing Android smartphones and tablets. find an object, and use gestures similarly on touch inter- The application runs in the background as a system faces. service which can be activated with simple gestures, a button or a Near Field Communication (NFC) tag no For sighted people who are not familiar with screen read- matter which application is running in the foreground. ers, there are few works that allow them to interact with When activated, JustSpeak records the speech spoken by their devices non-visually. Mostly they have to wait un- the user. The recorded audio is then transcribed into til visual access to the devices is possible. Although the plain texts and parsed into computer understandable non-visual interaction problem is often merely nuisance commands which are automatically executed by Just- for them, sometimes it can cause great inconvenience, for Speak. Unlike most current systems, JustSpeak can ac- instance, when someone wants to check the map while comodate commands for any application installed on the driving a vehicle. device, for example, clicking buttons, scrolling lists and toggling switches. Moreover, JustSpeak is smarter and Speech Recognition Services more natural than other works in that it supports mul- Speech recognition, also known as speech-to-text or au- tiple commands in one utterance. For instance, to open tomatic speech recognition (ASR), has been studied for the Gmail application then refresh the inbox, two com- more than two decades [14] and recently been used in mands can be combined into once sentence \Open Gmail various commercial products ranging from speech input then refresh". The two main features differ JustSpeak like customer service hotlines, voice IMEs, to automatic from other voice control systems and offer more intuitive virtual agents such as Siri and Google Now which will and efficient eyes-free and hands-free interaction. be discussed in the following section. The contributions of this paper include: Recognition of natural spoken speech is a very complex • a mobile application, JustSpeak, that enables universal problem because vocalizations vary interms of accent, voice control on Android devices; pronunciation, pitch, volume, speed, etc. The perfor- mance of a speech recognition engine is usually evalu- • a description of the system design that empowered ated in terms of accuracy and speed. There are a large accommodation of adaptive commands parsing and number of ASR services, both in academia as research chaining; and projects [9, 13], and in industry as commercial products [8, 10]. • a discussion on leveraging speech input to enhance in- teraction for blind people and people with dexterity In the framework of JustSpeak, we employ Google ASR issues. service [10] to recognize spoken speech of users. Google has been working on speech processing for a long time RELATED WORK and their voice recognizer is widely used in multi- ple products across different platforms, including voice Non-visual Interaction search on mobile phones and webpages, voice IMEs and As described before, currently nearly all computing de- Youtube [5] video closed captioning.