<<

Designing Speech and Multimodal Interactions for Mobile, Wearable, and Pervasive Applications

Cosmin Munteanu Nikhil Sharma Abstract University of Toronto Google, Inc. Traditional interfaces are continuously being replaced Mississauga [email protected] by mobile, wearable, or pervasive interfaces. Yet when [email protected] Frank Rudzicz it comes to the input and output modalities enabling Pourang Irani Toronto Rehabilitation Institute our interactions, we have yet to fully embrace some of University of Manitoba University of Toronto the most natural forms of communication and [email protected] [email protected] information processing that humans possess: speech, Sharon Oviatt Randy Gomez language, gestures, thoughts. Very little HCI attention Incaa Designs Honda Research Institute has been dedicated to designing and developing spoken [email protected] [email protected] language and multimodal interaction techniques, Matthew Aylett Keisuke Nakamura especially for mobile and wearable devices. In addition CereProc Honda Research Institute to the enormous, recent engineering progress in [email protected] [email protected] processing such modalities, there is now sufficient Gerald Penn Kazuhiro Nakadai evidence that many real-life applications do not require University of Toronto Honda Research Institute 100% accuracy of processing multimodal input to be [email protected] [email protected] useful, particularly if such modalities complement each Shimei Pan other. This multidisciplinary, two-day workshop will University of Maryland, bring together interaction designers, usability Baltimore County [email protected] researchers, and general HCI practitioners to analyze the opportunities and directions to take in designing more natural interactions with mobile and wearable Permission to make digital or hard copies of part or all of this work for personal or devices, and to look at how we can leverage recent classroom use is granted without fee provided that copies are not made or distributed for advances in speech and multimodal processing. profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author. ACM Classification Keywords Copyright is held by the owner/author(s). CHI'16 Extended Abstracts, May 07-12, 2016, San Jose, CA, USA H.5.2 [User interfaces]: Voice I/O, Natural language, ACM 978-1-4503-4082-3/16/05. User-centered design, and Evaluation/methodology. http://dx.doi.org/10.1145/2851581.2856506 Introduction and Motivation speech interaction1 – people now face many more During the past decade we have witnessed dramatic situations requiring hands-and eyes-free interaction. In changes in the way people access information and store fact, achieving 100% accurate speech processing may knowledge, mainly due to the ubiquity of mobile and not be necessary: proper interaction design can pervasive computing and affordable broadband complement speech processing in ways that Internet. Such recent developments have presented us compensate for its lack of accuracy [4,8]; and, in many with opportunities to reclaim naturalness as a central tasks where users interact with spoken information, theme for interaction. We have seen this happen with verbatim transcription is often not relevant [9]. touch for mobile computing; it is now time to see this for other interaction modalities as well. Recent commercial applications (e.g. personal digital assistants), made possible particularly by advances in At the same time, as wearable devices gained speech processing and machine learning, have brought prominence among the ecosystem of interconnected renewed attention to speech- and multimodal-based devices, novel approaches for interacting with digital interaction. Yet, these are still perceived as “more of a content on such devices will be necessary. Such device gimmick” [2] or simply “that good” [3], which can be form factors present new challenges (and partly attributed to a mismatch between the affordance opportunities) for multimodal input capabilities, through of speech as input modality and the reality of speech as speech and audio processing, brain computer content creation modality. Unfortunately, there is very interfaces, gestural input and electromyography (EMG) little HCI research, with rare exceptions such as our interaction, among other modalities. CHI'14 workshop on which we are building upon [8], on how to leverage the recent engineering progress into Unfortunately, communication through speech and developing more natural, effective, or accessible UIs. language, while undoubtedly natural, are also among the most difficult modalities for machines, as these are Concurrently, we are seeing how significant the highest-bandwidth two-way communication developments in miniaturization are transforming channels we have. While significant effort, in wearable devices, such as smartwatches, smart engineering, linguistic, and cognitive sciences, have glasses, smart textiles and body worn sensors, into a been spent on improving machines' ability to viable ubiquitous computing platform. The varied usage understand speech and natural language, these have contexts and applications in which such emerging often been neglected as interaction modalities [1]. devices are being deployed in create new opportunities and challenges for digital content interaction. The challenges in enabling such natural interactions have often led to these modalities being treated as While the wearable computing platform is at its early error-prone alternatives to “traditional” input or output developmental stages, what is currently unclear or may mechanisms. This should not be a reason to abandon 1 We use the term speech and speech interaction to denote both verbal and text-based interaction. be ambiguous is the range of interaction possibilities • Identifying key challenges and opportunities for that are necessary and possible to engage device enabling and designing multi-input modalities for a wearers with digital content on such devices. The wide range of wearable device classes. diverse usage contexts, form factors, and applications, call for a multimodal perspective on content interaction Topics through wearables. Because of the explosion of new We are proposing to build upon the discussions started modalities and sensors on smart phones and wearables, during our lively-debated and highly-engaging panel on the multimodal-multisensor fusion of information speech interaction that was held at CHI 2013 [6], which sources is now possible. It is thus important to consider was followed by a successful workshop (20+ the broad range of input modalities, including speech participants) on speech and language interaction, held and audio input, touch and mid-air gestures, EMG at CHI 2014 [7]. The interest attracted by the 2013 input, and BCI, among other modalities. Furthermore, panel and the 2014 workshop has resulted in organizers as such a platform offers broad usage contexts, we receiving many requests for further development of the believe that the erstwhile ‘one size fits all’ paradigm workshop. Additionally, a course on speech interaction Figure 1: Popular media view of may be less applicable; instead our communities need that has been offered at CHI for the past five years [5] speech-enabled mobile personal to pave the way for crossing multiple input modalities by two of the co-authors of the present proposal has assistants [3] to identify which modalities may co-exist for varied always been well-attended. As such, we are proposing usage contexts. here to expand the proceedings of such a workshop to two days and enlarge its scope to speech and Goals multimodal interaction for mobile, wearable, and In light of such barriers and opportunities, this pervasive computing. We therefore propose several workshop aims to foster an interdisciplinary dialogue topics for discussions and activity: and create momentum for increased research and • What are the important challenges in using speech collaboration in: as a “mainstream” modality? Speech is • Formally framing the challenges to the widespread increasingly present in commercial applications – can adoption of speech and natural language interaction, we characterize which other applications speech is • Taking concrete steps toward developing a framework suitable for or has the highest potential to help with? of user-centric design guidelines for speech- and • What interaction opportunities are presented by language-based interactive systems, grounded in the rapidly evolving mobile, wearable, and good usability practices, pervasive computing areas? How and how much • Establishing directions to take and identifying further does multimodal processing increase robustness research opportunities in designing more natural over speech alone, and in what contexts? interactions that make use of speech and natural • Can speech and multimodal increase usability and language, and robustness of interfaces and improve user experience beyond input/output? • What can the CHI community learn from the new contexts for wearable use, including hands- Automatic (ASR), Text-to-Speech busy, cognitively demanding situations and perhaps Synthesis (TTS), and Natural Language Processing even unconscious and unintentional use (in the case (NLP) research, and in turn, how can it help the these of body-worn sensors)? Wearables may have form communities improve the user-acceptance of factors that verge on being ‘invisible’ or inaccessible such technologies? For example, what should we be to direct touch. Such reliance on sensors requires asking them to extract from speech beside clearer conceptual analyses of how to combine words/segments? How can work in context and active input modes with passive sensors to discourse understanding or dialogue management can deliver optimal functionality and ease of use. And shape research in speech and multimodal UI? And what role can understanding users' context can we bridge the divide between the evaluation (hands, eyes busy) play in selecting best modality methods used in HCI and the AI-like batch for such interactions or in predicting user needs? evaluations used in speech processing? CHI Contributions and Benefits • How can UI designers make better use of the The proposed topics address speech and multimodal acoustic-prosodic information in speech beyond interaction, as well as its application to areas such as simply word recognition, such as emotion wearable and mobile devices. This is currently receiving recognition or identifying users' cognitive states? significant attention in the commercial space and in • What are the usability challenges of synthetic areas outside HCI, yet it is a marginal topic at CHI. speech? How can expressiveness and naturalness be Only a limited number of research papers, panels, incorporated into interface design guidelines, workshop and tutorial are presented by other particularly in mobile or wearable contexts where researchers than the authors of this proposal (to our text-to-speech could potentially play a significant role knowledge, the last CHI workshop on this topic not in users' experiences? And how can this be organized by the present authors has been organized in generalized to designing usable UIs for mobile 1997). As such, we hope that the proposed workshop and pervasive (in-car, in-home) applications that will increase the collaboration between researchers and rely on multimedia response generation? practitioners belonging to the CHI community and those • What are the opportunities and challenges for speech mostly dedicated to speech, language, and multimodal and multimodal interaction with regards to technologies. The workshop organizers are also spontaneous access to information afforded by finalizing industry sponsorships that, to a minimum, wearable and mobile devices? And can such allow us to invite guest speakers which will attract modalities facilitate access in a secure and personal further interest and opportunities for collaboration manner, especially since mobile and wearable between academia and industry. Given the location of interfaces raise significant privacy concerns? CHI 2016, with such close proximity to many of the “tech giants” who are vastly invested in natural user • What are the implications for the design of speech and multimodal interaction presented by interfaces, we believe the timing and conditions are of advancing multimodal interaction design for favourable for the organization of this workshop. wearable, mobile, and pervasive computing. The organizing committee for this workshop (the authors Our workshop aims to develop speech and multimodal list) is living proof that CHI is the most appropriate interaction as a well-established area of study within venue to initiate such inter-disciplinary collaborations. HCI, aiming to leverage current engineering advances We believe that the rich diversity of this committee's in ASR, NLP, TTS, multimodal/gesture recognition, or professional networks, spanning several research brain-computer interfaces. In return, advances in HCI communities, will ensure that this workshop will attract can contribute to creating NLP and ASR algorithms that many new participants to CHI 2016. are informed by and better address the usability challenges of speech and multimodal interfaces. Workshop plan Before the workshop We also aim to increase the cohesion between research [November] Configure the Easy Chair conference currently dispersed across many areas including HCI, system for paper submission and reviews, and launch wearable design, ASR, NLP, BCI complementing speech, the workshop website www.dgp.toronto.edu/dsli2016/. EMG interaction and eye-gaze input. Our hope is that [By Nov 20th] Publicize the workshop through e-mail, by focusing and challenging the research community on distribution lists, social media, and through posters and multi-input modalities for wearables, we will energize in-person at conferences. The workshop organizing the CHI and engineering communities to push the committee may invite specific submissions from a small boundaries of what is possible with wearable, mobile, group of researchers deemed valuable for the and pervasive computing, but also make advances in interactions and discussions during the workshop. each of the respective communities. As an example, the [Early December]: Review of early round of position recent significant breakthroughs in deep neural papers and notification of early acceptance to authors networks is largely confined to audio-only features, of invited submissions. while there is a significant opportunity to incorporate [December] Review all position papers into this framework other features and context (such as multimodal input for wearables). We anticipate this can [By Jan 25th] Send acceptance notifications. only be accomplished by closer collaboration between [February 8th]: Camera-ready deadline the speech and the HCI communities. [March] Upload papers on the workshop website and create preliminary break-out groups. Invite participants Our ultimate goal is to cross pollinate ideas from the to read the position papers of those in the same group. activities and priorities of different disciplines. With its [April] Send the workshop agenda to participants. unique format and reach, a CHI workshop offers the During the workshop opportunity to strengthen future approaches and unify Participants: 25 practices moving forward. The CHI community can be a Activities for day one, with a focus on the theoretical host to researchers from other disciplines with the goal and methodological challenges of designing speech and challenge-type workshop. language interactions: [5-8 months] Carry out the proposed action plan. • Introductions from each participant, including a 2 Participants minute overview of their research [1 hours] The workshop aims to attract position papers and • Guest speaker and brief presentations of papers participation from a diverse audience within the CHI related speech and language interaction [2 hours] community of varied backgrounds, with an interest in, • Group discussions [0.5 hour] or record of, research, design, or practice in areas • “Birds of a feather” sessions, each addressing one of related to speech, natural language interaction, the workshop's goals (specific goals may be adjusted multimodal interaction, brain-computer interfaces as based on participants' input) [1.5 hours] used to complement speech, natural user interfaces, • Reporting from break-out sessions [1 hour] especially as applied to mobile or wearable devices. Activities for day two, with focus on specific Participants with a speech recognition, synthesis, or application areas: general language processing background, but with an • Guest speaker and brief presentations of papers interest in the HCI aspects of speech-based interaction, related to the application of multimodal interactions will be particularly welcome (as emphasized in the call (speech, language, EEG, gestures, etc.) to wearable for papers). Based on our past experiences (panel, and mobile interfaces [2 hours] previous workshop, and related CHI courses), we • “Matchmaking” – Facilitate future collaborations expect that the workshop will attract contributions from between HCI and speech processing researchers by a truly multidisciplinary audience. The members of the looking at challenges encountered in one area that organizing committee have themselves backgrounds can be solved with help from the other area. This will that complement each other, ranging from significant be conducted following the template of Robert Dilts' work in core speech processing issues, to natural and “Disney Brainstorming Method” in which groups multimodal interfaces, and to designing intelligent progress through stages of idea analysis, generation, wearable and mobile interfaces. We are confident that evaluation, and planning [2 hours] this diversity will ensure that the proposed workshop • Drafting of workshop conclusions, proposed will be engaging and fruitful. framework, findings, and determine an after- workshop action plan (e.g. a special-issue journal Sample Call for Papers proposal) [2 hours] This workshop aims to bring together interaction designers, usability researchers, and general HCI and After the workshop speech processing practitioners. Our goal is to create, [1 month] Finalize the action plan and send a workshop through an interdisciplinary dialogue, momentum for follow-up report to all participants. increased research and collaboration in: [2-4 months] Follow-up: development of a proposal for Formally framing the challenges to the widespread a special edition journal or a proposal for a grand • adoption of speech and natural language interaction, • Taking concrete steps toward developing a framework Workshop URL: http://www.dgp.toronto.edu/dsli2016/ of user-centric design guidelines for speech- and Authors' Biographies language-based interactive systems, grounded in Prof. Cosmin Munteanu is an Assistant Professor at the good usability practices, Institute for Communication, Culture, Information, and • Establishing directions to take and identifying further Technology at University of Toronto Mississauga. His research opportunities in designing more natural research includes speech and natural language interactions that make use of speech and natural interaction for mobile devices, mixed reality systems, language, and learning technologies for marginalized users, usable privacy and cyber-safety, assistive technologies for • Identifying key challenges and opportunities for older adults, and ethics in human-computer interaction enabling and designing multi-input modalities for a research. http://cosmin.taglab.ca wide range of wearable device classes We invite the submission of position papers Prof. Pourang Irani is a Professor in Human-Computer demonstrating research, design, practice, or interest in Interaction at the University of Manitoba and Canada areas related to speech, language, and multimodal Research Chair in Ubiquitous Analytics. His research sits interaction that address one or more of the workshop at the crossroads of mobile/wearable computing and goals, with an emphasis, but not limited to, applications data visualization. http://www.cs.umanitoba.ca/~irani such as mobile, wearable, or pervasive computing. Dr. Sharon Oviatt is internationally known for her Position papers should be 4-6 pages long, in the ACM multidisciplinary research on human-centered SIGCHI extended abstract format and include a brief interfaces, multimodal and mobile interfaces, spoken statement justifying the fit with the workshop's topic. language and digital pen interfaces, educational Summaries of previous research are welcome if they interfaces, technology design and evaluation, and the contribute to the workshop's multidisciplinary goals impact of interface design on human cognition. Her (e.g. a speech processing research in clear need of HCI latest textbook, The Paradigm Shift to Multimodality in expertise). Submissions will be reviewed according to: Contemporary Computer Interfaces (co-authored with • Fit with the workshop topic Phil Cohen) was published by Morgan Claypool in 2015. • Potential to contribute to the workshop goals CereProc Chief Science Officer Dr. Matthew Aylett has • A demonstrated track of research in the workshop over 15 years’ experience in commercial speech area (HCI or speech/multimodal processing, with an synthesis and research. He is a interest in both areas). founder of CereProc, which offers unique emotional and Important Dates: characterful synthesis solutions and has been awarded • January 13th, 2016: Submission of position papers a Royal Society Industrial Fellowship to explore the role • January 20th, 2016: Notification of acceptance of speech synthesis in the perception of character in February 10th, 2016: Submission of camera-ready • artificial agents. http://www.cereproc.com accepted papers Prof. Gerald Penn is a Professor of Computer Science at Dr. Keisuke Nakamura is a senior scientist at the Honda the University of Toronto. He is one of the leading Research Institute Japan. His research interests are in scholars in Computational , with significant the field of robotics, control engineering, contributions to both the mathematical and the processing, computational auditory scene analysis, computational study of natural languages. Gerald's multi-modal integration, and robot audition. publications cover many areas, from Theoretical Linguistics, to Mathematics, and to ASR, as well as HCI. Dr. Kazuhiro Nakadai is a principal researcher at the Honda Research Institute Japan. He is also holding Dr. Shimei Pan is an assistant professor at University of concurrent professorship positions in Japanese Maryland, Baltimore County (UMBC). Before joining Universities. His research interests include AI, robotics, UMBC, Dr. Pan was a research scientist at IBM T. J signal processing, computational auditory scene Watson research center in New York. Her primary analysis, multi-modal integration and robot audition research interests include Natural Language Processing, Multimedia Multimodal Conversation Systems, User References Modeling and Human-Computer Interaction. 1. Aylett, M. et al. (2014). None of a CHInd: relationship counselling for HCI and speech Dr. Nikhil Sharma is a Senior User Experience technology. In Alt.CHI '14 Researcher at Google. He leads the user experience 2. Business Insider (2012). Frankly, It's Concerning research for Voice Search. He's worked on both that Apple is Still Advertising A Product as Flawed multimodal and voice only experiences on a range of as Siri. http://www.businessinsider.com, 2012. devices including smartphones, smartwatches and cars. 3. Gizmodo (2011). Siri is Apple's Broken Promise. http://www.gizmodo.com, 2011. Dr. Frank Rudzicz is a Scientist at the Toronto 4. Munteanu, C. et al. (2006). Automatic speech Rehabilitation Institute and an Assistant Professor in recognition for webcasts: how good is good enough Computer Science at the University of Toronto. His and what to do when it isn't. Proc. of ICMI. research focus is on machine learning and signal 5. Munteanu, C. and Penn, G. (2015). Speech-based processing in atypical speech. He is the President of the interaction. Course, ACM SIGCHI 2011, 2012, 2013, 2014, 2015 ACL/ISCA SIG on Speech and Language Processing for Assistive Technologies and Young Investigator of the 6. Munteanu, C. et al. (2013). We need to talk: HCI and the delicate topic of speech-based interaction. Alzheimer’s Society. http://www.cs.toronto.edu/~frank Panel, ACM SIGCHI 2013. Dr. Randy Gomez is a senior scientist at the Honda 7. Munteanu, C. et al (2014). Designing Spoken Research Institute Japan (HRI-JP). His interests include Language Interaction. ACM SIGCHI 2014 Workshop speech recognition, speech enhancement, multi-modal 8. Oviatt, S. (2003). Advances in Robust Multimodal interaction and intelligent systems. Currently he Interface Design. IEEE Comput. Graph. Appl. 23-5. oversees the research work in robust multimodal 9. Penn, G. and Zhu, X. (2008). A critical reassessment of evaluation baselines for speech interaction with context awareness at HRI-JP. summarization. In Proc. of ACL-HLT.