Enabling Non-Speech Experts to Develop Usable Speech-User Interfaces
Total Page:16
File Type:pdf, Size:1020Kb
Enabling Non-Speech Experts to Develop Usable Speech-User Interfaces Anuj Kumar CMU-HCII-14-105 August 2014 Human-Computer Interaction Institute School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 USA Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy Copyright © Anuj Kumar 2014. All rights reserved. Committee: Florian Metze Co-Chair, Carnegie Mellon University Matthew Kam Co-Chair, Carnegie Mellon University & American Institutes for Research Dan Siewiorek Carnegie Mellon University Tim Paek Microsoft Research The research described in this dissertation was supported by the National Science Foundation under grants IIS- 1247368 and CNS-1205589, Siebel Scholar Fellowship 2014, Carnegie Mellon’s Human-Computer Interaction Institute, and Nokia. Any findings, conclusions, or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the above organizations or corporations. KEYWORDS Human-Computer Interaction, Machine Learning, Non-Experts, Rapid Prototyping, Speech-User Interfaces, Speech Recognition, Toolkit Development. II Dedicated to Mom and Dad for their eternal support, motivation, and love III IV ABSTRACT Speech user interfaces (SUIs) such as Apple’s Siri, Microsoft’s Cortana, and Google Now are becoming increasingly popular. However, despite years of research, such interfaces really only work for specific users, such as adult native speakers of English, when in fact, many other users such as non-native speakers or children stand to benefit at least as much, if not more. The problem in developing SUIs for such users or for other acoustic or language situations is the expertise, time, and cost in building an initial system that works reasonably well, and can be deployed to collect more data or also to establish a group of loyal users. In particular, application developers or researchers who are not speech experts find it excruciatingly difficult to build a testable speech interface on their own, and instead routinely resort to Wizard-of-Oz experiments. To address the above problem, we take the view that while it can take prohibitive amount of time and cost to train non-experts into the nuances of speech recognition and user-interface development, well-trained speech experts and user-interface specialists who routinely build working recognizers have accumulated years of experiential knowledge that we can study and formalize for the benefit of non-experts. As such, the core speech recognition technology has reached a point where given enough expertise and in-domain data, a working system can be developed for almost every user group, acoustic or language situation. To this end, we design, develop, and evaluate a speech toolkit called SToNE, which embeds expert knowledge and lowers the entry bar for non-experts into the design and development space of speech systems. Our goal is not to render the speech expert superfluous, but to make it easier for non-speech experts to figure out why a speech system is failing, and guide their efforts in the right direction. We investigate three research goals: (i) how can we elicit and formalize the tacit knowledge that speech experts employ in building an accurate recognizer, (ii) what are the different analysis supports – automatic or semi-automatic – that we can develop to enable speech recognizer development by non-experts, and (iii) to what extent do non-experts benefit from SToNE. Through experiments both in the lab with new datasets, and summative evaluations with non-experts, we show that with the support of SToNE, non-experts are able to build recognizers with accuracy similar to that of experts, as well as achieve significant gains from when SToNE support is unavailable to them. This work aims to support the “black art” in SUI development. It contributes to human- computer interaction by developing tools that support non-speech experts in building usable SUIs. It also contributes to speech technologies by formalizing experts knowledge and offering a set of tools to analyze speech data systematically. V ACKNOWLEDGEMENTS A PhD is not just a destination, it is a journey. It is nearly impossible to enumerate and write about the people who have made this journey both intellectually stimulating and fun for me, as any attempt will be incomplete and any words will be far too less. Nonetheless, here’s my attempt. It is the very least I can do. Research cannot happen in a vacuum – unless of course, if you are an experimental astrophysicist. A number of amazing people across academia and industry have played a huge role in shaping up my experience as a PhD researcher, and I have been immensely privileged to be around them. First and foremost, I have had the great luck of being advised by two remarkable researchers: Florian Metze and Matthew Kam. Under their wings, I have learnt how to be passionate about a research topic, how to think (or not to) when you are stuck, how to break down an insurmountable problem into smaller pieces, and how to be relevant and aim big. I would like to thank both of them for supporting me throughout my time at CMU, no matter what. Florian Metze and I go back to my early days as a PhD student when I took his course on “Speech Recognition and Understanding.” It was this course that got me interested in Speech Recognition, so much so, that I decided to pursue it more seriously in my PhD thesis with him. Florian has a rare gift among advisors: the ability to delicately balance a hands-on/ hands-off approach and lead by example. This allows his students to thrive, to never fall back, to always have someone to look up to, and to be able to slowly nurture their own paths that they enjoy. This is what makes the PhD journey both exciting and fruitful. It makes the students who work with him feel like they are working with a true collaborator, instead of a boss. He also has a great technical repertoire across multiple fields, which always gave me a sense of confidence to pop in his office and ask any question under the sun. I always knew that whatever be the question, Florian might have an answer – or if not something definite, at least a direction to think in. Florian has been a great advisor and a friend to brainstorm many research ideas with. He has the ability of taking a topic, fleshing it out, and making others in the room enthusiastic about it. As a result, I have not only been able to work on my own thesis, but also discuss and submit a number of federal and industry grant proposals with him, many of which were awarded. It is these qualities that I admire most. He has enthusiastically supported me throughout my time at CMU, and for that I will always be indebted. Matthew Kam and I go way back to my sophomore days as an undergraduate student in India. He was first my undergraduate mentor and later, my PhD thesis advisor at CMU. It is because of Matt that I got excited about research, understood what high-quality research is all about, and got trained on how to properly conduct a research agenda. Matt inspired me to pursue VI graduate studies then, and had it not been his guidance during my undergraduate days, I would not be adequately prepared for graduate school. I still remember my early days as an undergraduate researcher when I had no understanding of what research meant, and considered writing somewhat functional code as research. From that I went to being confident of thinking about my own research agenda, leading small research groups including multi-disciplinary participants, writing my own top-tier publications, and presenting my work to a large audience. Matt has taught me to look at the big picture and to take risks. I still remember his words about having a “helicopter vision,” which is often necessary to stay focused. Matt invests in people, rather than ideas. He invested in me, allowed me to explore ideas in my early days as a PhD student, and encouraged me all the way. He won’t let his students sell themselves short, and heavily protects their interests, even if it means fighting tooth and nail for them. This is a profoundly motivating and supportive method of advising, for which I am incredibly thankful. I would also like to thank my other committee members: Dan Siewiorek and Tim Paek. Both of them have been a source of great support and imparted sage advice over years. Dan and I met somewhat in the middle of my PhD career when I took his class on “Mobile and Pervasive Computing,” and I was instantly awestruck by his sheer technical brilliance coupled with decades of experience. He had a number of war stories to narrate ranging from “why things will nearly always go wrong when you don’t want them to, and how to prepare for it,” to “how you should thrive in a multidisciplinary environment.” It these short – but extremely interesting lessons – that made him a perfect committee member, and I have always felt very comfortable bouncing ideas off of him, and getting his opinion on research. For this support, I am very grateful. Tim Paek and I met at a workshop on Speech Recognition and Human-Computer Interaction, which he was organizing, and immediately after that I went on to intern with him at Microsoft Research for the summer. It was one of the most fascinating three months of my research life. Tim proves that you can be both a highly successful researcher and the nicest person in the world. His focus on thinking about research both from a scientific standpoint, as well as, an industry standpoint still stands out.