Voice Typing”, Which Reduces the Recognition Error Rate in Modern Day Dictation Interfaces Simply by Modifying the User’S Interaction Style
Total Page:16
File Type:pdf, Size:1020Kb
Enabling Non-Speech Experts to Develop Usable Speech-User Interfaces Thesis Proposal Anuj Kumar [email protected] Human-Computer Interaction Institute School of Computer Science Carnegie Mellon University Thesis Committee: Florian Metze, CMU (chair) Matthew Kam, CMU & American Institutes for Research (co-chair) Dan Siewiorek, CMU Tim Paek, Microsoft Research Abstract Speech user interfaces (SUIs) such as Apple’s Siri, Samsung’s S Voice, and Google’s Voice Search are becoming increasingly popular. However, despite years of research, speech interfaces are only actively being developed for specific users, such as adult native speakers of English, when in fact, less mainstream users such as non-native English speakers or children stand to benefit at least as much, if not more. The problem in developing SUIs for such users or for challenging acoustic conditions is the expertise, time, and cost in building a reasonably accurate speech recognition system. In particular, researchers who are not speech experts find it difficult to build a working speech interface on their own, and instead routinely resort to Wizard-of-Oz experiments for testing their application ideas. Unfortunately, such experiments are unable to test real usage of the system. At the same time, speech recognition technology has reached a point where, given enough expertise and in-domain data, a working recognizer can, in principle, be developed for almost every user or acoustic condition. To address the above problem, in this thesis, I propose to design, develop, and evaluate a toolkit that lowers the threshold for developing accurate speech systems. In particular, my research goals are three-fold: (i) to discover and formalize the tacit knowledge that speech experts employ while optimizing a recognizer for greater accuracy, (ii) to investigate approaches for supporting the semi-automatic analysis of errors occurring in speech recognition, and (iii) to investigate strategies for improving recognition accuracy. Researchers who are non-specialists in speech recognition can benefit from the proposed toolkit because it will assist them in developing speech recognizers with less cost, time, and expertise, which they would otherwise not be likely to develop and experiment with. The goal is not to render the speech expert superfluous, but to make it easier for non-speech experts, i.e. the developers utilizing speech technology, to figure out why a speech system is failing, and guide their efforts in the right direction. I have accomplished several steps towards the above goals. Using contextual interviews and cognitive task analysis, I have formalized the intuition and technical know-how that speech experts draw on when improving recognition accuracy, in the form of a rule-based knowledge base. I have built a toolkit that supports semi-automatic error analysis, and embeds this rule-based knowledge base. Furthermore, by applying these rules to two datasets, I have demonstrated that the recommendations from the knowledge base can lead to the development of recognizers with accuracy at least as much as those developed using experts’ recommendations. Once recognizers have been reasonably optimized for the target users, we make the case that subtle changes to the user interaction can lead to further gains in usability. To demonstrate the above, I discuss the design and evaluation of a novel technique called “Voice Typing”, which reduces the recognition error rate in modern day dictation interfaces simply by modifying the user’s interaction style. We see such techniques as a part of the strategies that the toolkit will eventually recommend. For the remainder of my PhD, I propose two experiments: first, to evaluate the extent to which non-speech experts can benefit from the guidance provided by the toolkit, and second, to measure the usability gains that the accuracy improvements translates into for the end-users. We will test both the above experiments in the context of a children’s dataset that has over 100 hours of spontaneous and scripted speech. This dataset will enable us to test the efficacy of the toolkit on a challenging speech problem, i.e. large-vocabulary continuous speech recognition for children. 2 The proposed work attempts to take the “black art” out of speech-user interface development. It contributes to human-computer interaction by developing rapid-development tools for non-speech experts to develop usable SUIs. It contributes to speech technologies by formalizing speech experts’ knowledge and offering a set of tools to analyze speech data systematically. 3 Table of Contents 1. Introduction 1.1. Current Approaches 1.2. Proposed Approach 1.3. Research Goals 1.4. Contributions of this Thesis 1.5. Thesis Proposal Organization 2. Case Study #1: Speech Games for Second Language Learners 2.1. Need for Speech-enabled Second Language Learning Games 2.2. English Language Competency: Word Reading 2.3. Game Designs 2.4. Initial Usability Testing 2.5. Speech Recognizer Development and Improvements 2.6. Subsequent User Study 2.7. Discussion 3. Case Study #2: Voice Typing – A Novel Interaction Technique 3.1. Voice Typing 3.2. Related Work: Real-time Speech Decoding and Error Correction Techniques 3.3. User Study 3.4. Results 3.5. Discussion 4. Speech Interfaces: Myths vs. Empirical Reality 4.1. Myth #1: No data like more data 4.2. Myth #2: SUIs are unusable by non-native speakers, children, or in noisy environments 4.3. Myth #3: SUIs need to achieve very high accuracy before they are usable 4.4. Myth #4: SUIs are complex and expensive to build 5. Capturing and Representing Experts’ Understanding in a Knowledge Base 5.1. Knowledge Elicitation Methodology: Contextual Interviews with Speech Experts 5.2. Related Work: Expert Systems and Rule-based Knowledge Representation 5.3. Primer to Speech Recognition 5.4. Knowledge Formalization 5.4.1. ChAOTIC Process Framework and Rules 5.4.2. Characteristics of Knowledge Base 5.4.3. Expert System and Forward Chaining Inference Engine 5.4.4. Conflict Resolution 5.5. Datasets 5.6. Data Annotation 5.7. Evaluation Experiments 5.8. Discussion and Next Steps 4 6. Design and Development of SToNE: Speech Toolkit for Non-Experts 6.1. Related Work: Error Analysis in Speech Recognition 6.1.1. Types of Errors 6.1.2. Lexical and Prosodic Reasons of Error 6.2. SToNE Design and Development 6.2.1. Feature Extractor 6.2.2. Error Visualizer: Data Selector & Performance Comparer 6.2.3. Optimization Advisor 7. Proposed Work 7.1. Evaluation of Toolkit with Non-Speech Experts 7.2. Evaluation of a Speech Application with End-Users 7.3. Risks and Fallback Strategies 8. Thesis Overview and Timeline 9. References 5 1. Introduction Speech-user interfaces (SUIs) are becoming increasingly popular. For users in emerging regions, where illiteracy often impedes usage of graphical and textual interfaces, spoken dialog systems are being explored in many domains, such as agriculture [Plauche, ’06], health [Sherwani, ’09], entertainment service [Raza, ’13], etc. Similarly, for users in the developed world, several speech interfaces have gained traction. United Airlines’ customer care system, which receives an average of two million calls per month, indicated that their caller abandonment dropped by 50% after implementing a speech system to replace a touchtone system [Kotelly, ’03]. Such a solution introduced shallower menu hierarchies, eyes-free, hands-free operation, and ease of data entry for non-numeric text [Hura, ’08]. Recently introduced mobile assistants like Apple’s Siri, Samsung’s S Voice, and Google’s Skyvi have further boosted the popularity of speech applications. They assist in executing simple commands such as “Call John”, “Text my wife that I’ll be late today”, or “Set a wakeup alarm for tomorrow 6am!” to more complicated tasks such as dictation or transcribing a phone conversation. These examples underscore the reasons why many researchers and practitioners are becoming increasingly interested in incorporating speech as an input modality in their applications. Yet, it is telling that beyond a few examples, we haven’t seen a large number of functional and sufficiently accurate speech services. The reason is the difficulty in building a usable speech user interface (SUI) for a certain user group or acoustic situation, without having a good, off-the-shelf speech recognizer available [Laput, ’13]: recognizers work reliably when it is known what people say, how they say it, what noises do occur, etc. This is hard to achieve, especially for SUI designers who are non-experts in speech recognition, such as designers in human-computer interaction. As a result, many studies that HCI researchers have conducted are “Wizard-of-Oz” studies, i.e. using simulated ASR [e.g., Stifelman et al., ’13; Tewari, ’13]. Unfortunately, while Wizard-of-Oz studies are good for iterating on the design of the application, they cannot uncover real recognition errors and test real system usage [Thomason, ’13]. These therefore become the leading source of usability issues in speech applications once the product is released [Lai, Karat & Yankelovich, ’08]. At the same time, research on automatic speech recognition (ASR) has seen great strides in recent times. It has reached a point where given enough speech expertise and in-domain data, a sufficiently accurate speech recognizer can be developed for any scenario, including those of non- native accents, background noise, children’s voice, or other similar challenges for speech recognition. However, if a speech recognition system doesn’t work as intended, it is generally impossible for a non-expert or a HCI researcher to tell, for example, if the ASR fails because users speak too slow (“speaking rate” is a frequent cause of mismatch), too “clearly” (i.e. they “hyper-articulate”), because the background noises are unexpected, or because of some other reason; while an expert can usually analyze the error pattern quickly. “Adaptation” or several optimizations of an existing recognizer can generally mitigate the problem, and will often result in a functional system [Abras & Krimchar, 2004; Preece & Rogers, 2002], but recognizing what to do and how do it needs tremendous expertise and experience in developing speech recognition systems.