Designing Speech and Multimodal Interactions for Mobile, Wearable, and Pervasive Applications
Total Page:16
File Type:pdf, Size:1020Kb
Designing Speech and Multimodal Interactions for Mobile, Wearable, and Pervasive Applications Cosmin Munteanu Nikhil Sharma Abstract University of Toronto Google, Inc. Traditional interfaces are continuously being replaced Mississauga [email protected] by mobile, wearable, or pervasive interfaces. Yet when [email protected] Frank Rudzicz it comes to the input and output modalities enabling Pourang Irani Toronto Rehabilitation Institute our interactions, we have yet to fully embrace some of University of Manitoba University of Toronto the most natural forms of communication and [email protected] [email protected] information processing that humans possess: speech, Sharon Oviatt Randy Gomez language, gestures, thoughts. Very little HCI attention Incaa Designs Honda Research Institute has been dedicated to designing and developing spoken [email protected] [email protected] language and multimodal interaction techniques, Matthew Aylett Keisuke Nakamura especially for mobile and wearable devices. In addition CereProc Honda Research Institute to the enormous, recent engineering progress in [email protected] [email protected] processing such modalities, there is now sufficient Gerald Penn Kazuhiro Nakadai evidence that many real-life applications do not require University of Toronto Honda Research Institute 100% accuracy of processing multimodal input to be [email protected] [email protected] useful, particularly if such modalities complement each Shimei Pan other. This multidisciplinary, two-day workshop will University of Maryland, bring together interaction designers, usability Baltimore County [email protected] researchers, and general HCI practitioners to analyze the opportunities and directions to take in designing more natural interactions with mobile and wearable Permission to make digital or hard copies of part or all of this work for personal or devices, and to look at how we can leverage recent classroom use is granted without fee provided that copies are not made or distributed for advances in speech and multimodal processing. profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author. ACM Classification Keywords Copyright is held by the owner/author(s). CHI'16 Extended Abstracts, May 07-12, 2016, San Jose, CA, USA H.5.2 [User interfaces]: Voice I/O, Natural language, ACM 978-1-4503-4082-3/16/05. User-centered design, and Evaluation/methodology. http://dx.doi.org/10.1145/2851581.2856506 Introduction and Motivation speech interaction1 – people now face many more During the past decade we have witnessed dramatic situations requiring hands-and eyes-free interaction. In changes in the way people access information and store fact, achieving 100% accurate speech processing may knowledge, mainly due to the ubiquity of mobile and not be necessary: proper interaction design can pervasive computing and affordable broadband complement speech processing in ways that Internet. Such recent developments have presented us compensate for its lack of accuracy [4,8]; and, in many with opportunities to reclaim naturalness as a central tasks where users interact with spoken information, theme for interaction. We have seen this happen with verbatim transcription is often not relevant [9]. touch for mobile computing; it is now time to see this for other interaction modalities as well. Recent commercial applications (e.g. personal digital assistants), made possible particularly by advances in At the same time, as wearable devices gained speech processing and machine learning, have brought prominence among the ecosystem of interconnected renewed attention to speech- and multimodal-based devices, novel approaches for interacting with digital interaction. Yet, these are still perceived as “more of a content on such devices will be necessary. Such device gimmick” [2] or simply “that good” [3], which can be form factors present new challenges (and partly attributed to a mismatch between the affordance opportunities) for multimodal input capabilities, through of speech as input modality and the reality of speech as speech and audio processing, brain computer content creation modality. Unfortunately, there is very interfaces, gestural input and electromyography (EMG) little HCI research, with rare exceptions such as our interaction, among other modalities. CHI'14 workshop on which we are building upon [8], on how to leverage the recent engineering progress into Unfortunately, communication through speech and developing more natural, effective, or accessible UIs. language, while undoubtedly natural, are also among the most difficult modalities for machines, as these are Concurrently, we are seeing how significant the highest-bandwidth two-way communication developments in miniaturization are transforming channels we have. While significant effort, in wearable devices, such as smartwatches, smart engineering, linguistic, and cognitive sciences, have glasses, smart textiles and body worn sensors, into a been spent on improving machines' ability to viable ubiquitous computing platform. The varied usage understand speech and natural language, these have contexts and applications in which such emerging often been neglected as interaction modalities [1]. devices are being deployed in create new opportunities and challenges for digital content interaction. The challenges in enabling such natural interactions have often led to these modalities being treated as While the wearable computing platform is at its early error-prone alternatives to “traditional” input or output developmental stages, what is currently unclear or may mechanisms. This should not be a reason to abandon 1 We use the term speech and speech interaction to denote both verbal and text-based interaction. be ambiguous is the range of interaction possibilities • Identifying key challenges and opportunities for that are necessary and possible to engage device enabling and designing multi-input modalities for a wearers with digital content on such devices. The wide range of wearable device classes. diverse usage contexts, form factors, and applications, call for a multimodal perspective on content interaction Topics through wearables. Because of the explosion of new We are proposing to build upon the discussions started modalities and sensors on smart phones and wearables, during our lively-debated and highly-engaging panel on the multimodal-multisensor fusion of information speech interaction that was held at CHI 2013 [6], which sources is now possible. It is thus important to consider was followed by a successful workshop (20+ the broad range of input modalities, including speech participants) on speech and language interaction, held and audio input, touch and mid-air gestures, EMG at CHI 2014 [7]. The interest attracted by the 2013 input, and BCI, among other modalities. Furthermore, panel and the 2014 workshop has resulted in organizers as such a platform offers broad usage contexts, we receiving many requests for further development of the believe that the erstwhile ‘one size fits all’ paradigm workshop. Additionally, a course on speech interaction Figure 1: Popular media view of may be less applicable; instead our communities need that has been offered at CHI for the past five years [5] speech-enabled mobile personal to pave the way for crossing multiple input modalities by two of the co-authors of the present proposal has assistants [3] to identify which modalities may co-exist for varied always been well-attended. As such, we are proposing usage contexts. here to expand the proceedings of such a workshop to two days and enlarge its scope to speech and Goals multimodal interaction for mobile, wearable, and In light of such barriers and opportunities, this pervasive computing. We therefore propose several workshop aims to foster an interdisciplinary dialogue topics for discussions and activity: and create momentum for increased research and • What are the important challenges in using speech collaboration in: as a “mainstream” modality? Speech is • Formally framing the challenges to the widespread increasingly present in commercial applications – can adoption of speech and natural language interaction, we characterize which other applications speech is • Taking concrete steps toward developing a framework suitable for or has the highest potential to help with? of user-centric design guidelines for speech- and • What interaction opportunities are presented by language-based interactive systems, grounded in the rapidly evolving mobile, wearable, and good usability practices, pervasive computing areas? How and how much • Establishing directions to take and identifying further does multimodal processing increase robustness research opportunities in designing more natural over speech alone, and in what contexts? interactions that make use of speech and natural • Can speech and multimodal increase usability and language, and robustness of interfaces and improve user experience beyond input/output? • What can the CHI community learn from the new contexts for wearable use, including hands- Automatic Speech Recognition (ASR), Text-to-Speech busy, cognitively demanding situations and perhaps Synthesis (TTS), and Natural Language Processing even unconscious and unintentional use (in the case (NLP) research, and in turn, how can it help the these of body-worn sensors)? Wearables may have form communities