Methods for Pronunciation Assessment in Computer Aided Language Learning by Mitchell A
Total Page:16
File Type:pdf, Size:1020Kb
Methods for Pronunciation Assessment in Computer Aided Language Learning by Mitchell A. Peabody M.S., Drexel University, Philadelphia, PA (2002) B.S., Drexel University, Philadelphia, PA (2002) Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY September 2011 © Massachusetts Institute of Technology 2011. All rights reserved. Author............................................................. Department of Electrical Engineering and Computer Science September 2011 Certified by . Stephanie Seneff Senior Research Scientist Thesis Supervisor Accepted by. Professor Leslie A. Kolodziejski Chair, Department Committee on Graduate Students 2 Methods for Pronunciation Assessment in Computer Aided Language Learning by Mitchell A. Peabody Submitted to the Department of Electrical Engineering and Computer Science on September 2011, in partial fulfillment of the requirements for the degree of Doctor of Philosophy Abstract Learning a foreign language is a challenging endeavor that entails acquiring a wide range of new knowledge including words, grammar, gestures, sounds, etc. Mastering these skills all require extensive practice by the learner and opportunities may not always be available. Computer Aided Language Learning (CALL) systems provide non-threatening environments where foreign language skills can be practiced where ever and whenever a student desires. These systems often have several technologies to identify the different types of errors made by a student. This thesis focuses on the problem of identifying mispronunciations made by a foreign language student using a CALL system. We make several assumptions about the nature of the learning activity: it takes place using a dialogue system, it is a task- or game-oriented activity, the student should not be interrupted by the pronunciation feedback system, and that the goal of the feedback system is to identify severe mispronunciations with high reliability. Detecting mispronunciations requires a corpus of speech with human judgements of pronunciation quality. Typical approaches to collecting such a corpus use an expert pho- netician to both phonetically transcribe and assign judgements of quality to each phone in a corpus. This is time consuming and expensive. It also places an extra burden on the tran- scriber. We describe a novel method for obtaining phone level judgements of pronunciation quality by utilizing non-expert, crowd-sourced, word level judgements of pronunciation. Foreign language learners typically exhibit high variation and pronunciation shapes dis- tinct from native speakers that make analysis for mispronunciation difficult. We detail a sim- ple, but effective method for transforming the vowel space of non-native speakers to make mispronunciation detection more robust and accurate. We show that this transformation not only enhances performance on a simple classification task, but also results in distributions that can be better exploited for mispronunciation detection. This transformation of the vowel is exploited to train a mispronunciation detector using a variety of features derived from acoustic model scores and vowel class distributions. We confirm that the transformation technique results in a more robust and accurate identifica- tion of mispronunciations than traditional acoustic models. 3 Thesis Supervisor: Stephanie Seneff Title: Senior Research Scientist 4 Acknowledgments This work would have not been possible without the support of many people: Stephanie Seneff for her guidance and patience during my long and circuitous program. Victor Zue and John Guttag, for sitting on my committee. Natalija Jovanovic and Cordelia Zorana for unwavering support and daddy hugs. Family Carol, Mat, Lisa, John, Zoran, Zorica, Natalie, Mike, Jason, Mandy, Karla, and Stephen, for keeping me humbled. All of SLS but especially: Scott Cyphers, Jim Glass, Alex Gruenstein, Lee Hetherington, Ian McGraw, and Chao Wang for technical assistance, guidance, and advice. Marcia Davidson, for years of witty banter about nothing in particular and keeping me in check, literally and figuratively. Friends Lisa Anthony, Joe Beatty, Mark Bellew, Nadya Belov, Michael Bernstein, Syl- vain Bruni, Chris Cera, Chih-yu Chao, Ghinwa Choueiter, Christopher Dahn, Ajit Dash, Leeland Ekstrom, Suzanne Flynn, Michael Anthony Fowler, Abigail Fran- cis, Tyrone Hill, Melva James, Amber Johnson, Fadi Kanaan, Alexandra Kern, Joe Kopena, Shawn Kuo, Rob Lass, Karen Lee, Vivian Lei, Hong Ma, Lisa Marshall, Gregory (grem) Marton, Amy McCreath, Ali Mohammad, Ali Motamedi, Song-hee Paik, Katrina Panovich, Anna Poukchanski, Bill Regli, Tom Robinson, Sarah Ro- driguez, Micah Romer, Joseph Rumpler, Katie Ryan, Chris Rycroft, Josh Schanker, Yuiko Shibamoto, Ali Shokoufandeh, Ross Snyder, Susan Song, Evan Sultanik, Ryan Tam, William Tsu, Bob Yang, Stan Zanorotti, and Vera Zaychik, for the stress relief and sanity checks. Mentors COL Joe Follansbee, MAJ Eric Schaertl, and CPT Vikas Nagardeolekar, US Army, for showing me that Warrior-Scholars are not imaginary. Ops Brothers MAJ Jeffrey Rector, US Army, LT Carlis Brown and LT Jimmy Wang, US Navy, MND-B G9 Ops: CA4Life. To all of the above and to anyone I missed who has touched my life and brought me to this point: Thank you. 5 This research was supported in part by the National Science Foundation (NSF) Graduate Research Fellowship Program in the United States, and by the Information Technology Research Institute (ITRI) in Taiwan. 6 Contents 1 Introduction 19 1.1 Motivations .................................. 20 1.2 Contributions ................................. 21 1.3 Assumptions ................................. 22 1.4 Terminology and Conventions ........................ 22 1.5 Thesis Structure ................................ 23 2 Background 25 2.1 Pronunciation ................................. 26 2.2 Computer Aided Language Learning ..................... 27 2.2.1 Dialogue-based Systems ....................... 28 2.3 Computer Aided Pronunciation Training ................... 30 2.3.1 Holistic Pronunciation Evaluation .................. 30 2.3.2 Pinpoint Error Detection ....................... 34 2.4 Summary ................................... 37 3 Crowd-sourced phonetic labeling 39 3.1 Motivation ................................... 39 3.2 Related Work ................................. 40 3.3 Approach ................................... 42 3.3.1 Data .................................. 43 3.3.2 Annotation Task ........................... 43 3.4 Annotation Results .............................. 45 7 3.4.1 Efficiency ............................... 46 3.4.2 Cost .................................. 46 3.4.3 Agreement among raters ....................... 46 3.4.4 Aggregated κ ............................. 49 3.4.5 Pronunciation Deviation and Mispronunciation ........... 54 3.5 Labeling Algorithm .............................. 55 3.6 Labeling Results ............................... 58 3.7 Summary ................................... 61 4 Anchoring Vowels for phonetic assessment 63 4.1 Motivation ................................... 63 4.2 Related Work ................................. 64 4.3 Approach ................................... 65 4.3.1 Data .................................. 66 4.3.2 Anchoring .............................. 67 4.4 Results ..................................... 68 4.5 Summary ................................... 75 5 Mispronunciation Detection 77 5.1 Motivation ................................... 77 5.2 Related Work ................................. 78 5.3 Approach ................................... 79 5.3.1 Corpora ................................ 80 5.3.2 Segmentation ............................. 80 5.3.3 Features ................................ 81 5.3.4 Decision Tree Classifier ....................... 89 5.4 Results ..................................... 91 5.4.1 Performance ............................. 91 5.4.2 Decision Tree Rules ......................... 96 5.5 Summary ................................... 99 8 6 Summary & Future Work 101 6.1 Contributions ................................. 101 6.1.1 Crowd-sourced phonetic labeling .................. 102 6.1.2 Anchoring for vowel normalization ................. 102 6.1.3 Mispronunciation detection ..................... 103 6.2 Directions for Future Research ........................ 104 6.2.1 Crowd-sourced phonetic labeling .................. 105 6.2.2 Anchoring for vowel normalization ................. 107 6.2.3 Mispronunciation Detection ..................... 109 6.2.4 Application to other domains ..................... 110 A A Comprehensive Overview of Computer Aided Language Learning 111 A.1 Foreign Language Learning .......................... 112 A.1.1 Teaching Methodology ........................ 112 A.1.2 Measuring Language Performance .................. 113 A.1.3 Pronunciation ............................. 115 A.2 Technology in Foreign Language Learning .................. 116 A.3 Computer Aided Language Learning ..................... 118 A.3.1 Early Systems ............................. 119 A.3.2 Modern Systems ........................... 120 A.3.3 Dialogue-based Systems ....................... 122 A.4 Computer Aided Pronunciation Training ................... 123 A.4.1 Holistic Pronunciation Evaluation .................. 124 A.4.2 Pinpoint Error Detection ....................... 128 A.4.3 Pronunciation Feedback ....................... 131 B Comprehensive