Pronunciation Modeling in Spelling Correction for Writers of English As a Foreign Language
Total Page:16
File Type:pdf, Size:1020Kb
PRONUNCIATION MODELING IN SPELLING CORRECTION FOR WRITERS OF ENGLISH AS A FOREIGN LANGUAGE A Thesis Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in the Graduate School of The Ohio State University By Adriane Boyd, B.A., M.A. ***** The Ohio State University 2008 Master’s Examination Committee: Approved by Professor Eric Fosler-Lussier, Advisor Professor Christopher Brew Advisor Computer Science and Engineering Graduate Program c Copyright by Adriane Boyd 2008 ABSTRACT In this thesis I propose a method for modeling pronunciation variation in the context of spell checking for non-native writers of English. Spell checkers, which are nearly ubiquitous in text-processing software, have been developed with native speakers as the target audience and fail to address many of the types of spelling errors peculiar to non-native speakers, especially those errors influenced by their native language’s writing system and by differences in the phonology of the native and non-native languages. The model of pronunciation variation is used to extend a pronouncing dictionary for use in the spelling correction algorithm developed by Toutanova and Moore (2002), which includes statistical models of spelling errors re- lated to both orthography and pronunciation. The pronunciation variation modeling is shown to improve performance for misspellings produced by Japanese writers of English as a foreign language. ii to my parents iii ACKNOWLEDGMENTS I would like to thank my advisor, Eric Fosler-Lussier, and the computational lin- guistics faculty in the Linguistics and Computer Science and Engineering departments at Ohio State for their support. I would also like to thank the computational linguis- tics discussion group Clippers for their feedback in the early stages of this work. iv VITA 2003 . .B.A., Linguistics and German, Univer- sity of North Carolina at Chapel Hill 2007 . .M.A., Linguistics, The Ohio State Uni- versity 2005-2008 . Graduate Research and Teaching Asso- ciate, The Ohio State University PUBLICATIONS Research Publications Adriane Boyd, Markus Dickinson, and Detmar Meurers (2007). Increasing the re- call of corpus annotation error detection. In Proceedings of the Sixth International Workshop on Treebanks and Linguistic Theories (TLT 2007). Adriane Boyd, Markus Dickinson, and Detmar Meurers (2007). On representing de- pendency relations – Insights from converting the German TiGerDB. In Proceedings of the Sixth International Workshop on Treebanks and Linguistic Theories (TLT 2007). Adriane Boyd (2007). Discontinuity Revisited: An Improved Conversion to Context- Free Representations. In Proceedings of the Linguistic Annotation Workshop (LAW 2007). Adriane Boyd, Whitney Gegg-Harrison, and Donna Byron (2006). Identifying non- referential it. A machine learning approach incorporating linguistically motivated patterns. Traitement Automatique des Langues. Volume 46, No. 1. Adriane Boyd, Whitney Gegg-Harrison, and Donna Byron (2005). Identifying non- referential it: a machine learning approach incorporating linguistically motivated features. In Proceedings of the ACL Workshop on Feature Engineering for Machine Learning in Natural Language Processing. v FIELDS OF STUDY Major Field: Computer Science and Engineering vi TABLE OF CONTENTS Page Abstract . ii Dedication . iii Acknowledgments . iv Vita......................................... v List of Tables . x List of Figures . xii Chapters: 1. Introduction and Motivation . 1 1.1 Characteristics of Spelling Errors . 2 1.1.1 Native Writers of English . 3 1.1.2 Japanese Writers of English as a Foreign Language . 4 1.2 Developing a Spell Checker for Non-Native Writers of English . 7 2. Background . 9 2.1 Spell Checking Tasks . 10 2.1.1 Non-Word Error Detection . 10 2.1.2 Isolated Word Error Correction . 13 2.2 Edit Operations . 16 2.2.1 Types of Edit Operations . 16 2.2.2 Costs of Edit Operations . 18 2.2.3 Extending Edits to Pronunciation . 19 2.3 Noisy Channel Spelling Correction . 20 vii 2.3.1 Training the Error Model . 23 2.3.2 Extending the Model to Pronunciation Errors . 25 2.3.3 Letter-To-Phone Model . 27 2.4 Spell Checkers Adapted for JWEFL . 29 2.5 Summary . 30 3. Resources and Data Preparation . 32 3.1 TIMIT . 32 3.2 English Read by Japanese Corpus . 33 3.3 CMU Pronouncing Dictionary . 34 3.4 Atsuo-Henry Corpus . 34 3.5 Spell-Checker Oriented Word Lists . 35 4. Method .................................... 39 4.1 Pronouncing Dictionary with Variation . 39 4.1.1 Initial Recognizer . 41 4.1.2 Adapting the Recognizer . 42 4.1.3 Generating Pronunciations . 43 4.2 Implementation of the Noisy Channel Spelling Correction Approach 46 4.2.1 Letter-to-Phone Model . 46 4.2.2 Noisy Channel Spelling Correction . 48 5. Results . 51 5.1 Experimental Setup . 51 5.2 Baseline . 51 5.3 Evaluation . 52 5.3.1 Tuning Model Parameters . 53 5.3.2 Evaluation of Pronunciation Variation . 57 5.3.3 Evaluation of the Spelling Correction Model . 58 5.4 Summary . 58 6. Summary and Outlook . 60 6.1 Outlook . 60 Bibliography . 62 Appendices: viii A. Annotation Schemes . 64 A.1 Phonetic Transcriptions . 64 A.1.1 TIMIT . 64 A.1.2 English Read by Japanese Corpus . 64 A.2 Mapping to CMUDICT Phoneme Set . 65 B. Letter-to-Phone Alignments . 66 ix LIST OF TABLES Table Page 1.1 Difficult Phoneme Pairs for Japanese Speakers of English . 6 2.1 Percentage of Correct Suggestions in the 1 to 3-Best Candidates as a Function of the Maximum Substitution Length (N) on Native Speaker Misspellings from Brill and Moore (2000) . 18 2.2 Percentage of Correct Suggestions in the 1 to 4-Best Candidates by the Letter (L), Pronunciation (PHL), and Combined (CMB) Models on Native Speaker Misspellings from Toutanova and Moore (2002) . 19 2.3 Summary of Types and Costs of Edit Operations in Previous Spelling Correction Approaches . 20 2.4 Percentage of Correct Suggestions in the 1- to 6-Best Candidates for Native and JWEFL Misspellings from the Atsuo-Henry Corpus (Mitton and Okada, 2007) . 30 3.1 Word List Sizes . 38 4.1 Number of Pronunciations with Five Generated Variations . 45 4.2 Phone and Word Accuracy for Letter-to-Phone Model Trained and Tested on CMUDICT as a Function of the Number of Most-Specific Contexts(N) ............................... 47 4.3 Phone and Word Accuracy for Letter-to-Phone Models Trained on Word List 70 and CMUDICT, Tested on Word List 70 Test Set as a Function of the Number of Most-Specific Contexts N ........ 48 5.1 Aspell Results: Percentage of Correct Suggestions on the Atsuo-Henry Corpus Test Set . 52 x 5.2 Percentage of Correct Suggestions on the Atsuo-Henry Corpus Devel- opment Set as a Function of the Maximum Substitution Length (N) for PL ................................... 54 5.3 Percentage of Correct Suggestions on the Atsuo-Henry Corpus Devel- opment Set as a Function of the Maximum Substitution Length (N) for PPHL .................................. 54 5.4 Percentage of Correct Suggestions on the Atsuo-Henry Corpus Devel- opment Set as a Function of the Maximum Substitution Length (N) for Combined Model . 54 5.5 Percentage of Correct Suggestions on the Atsuo-Henry Corpus Devel- opment Set as a Function of Dictionary Size for All Models . 55 5.6 Percentage of Correct Suggestions on the Atsuo-Henry Corpus Devel- opment Set as a Function of Minimum Probability m for All Models . 56 5.7 Candidate Corrections for the Misspelling *eney, Intended Word any 57 5.8 Percentage of Correct Suggestions on the Atsuo-Henry Corpus Test Set as a Function of Pronunciation Variation for PPHL ........... 58 5.9 Percentage of Correct Suggestions on the Atsuo-Henry Corpus Test Set for All Models . 59 5.10 Performance of Spell Checker on Test Data . 59 A.1 TIMIT Phonemes . 64 A.2 ERJ Phonemes . 64 A.3 Mapping to CMUDICT Phonemes . 65 B.1 Letter-Phone Edit Distances . 67 B.2 Letter-Phone Edit Distances, cont. 68 xi LIST OF FIGURES Figure Page 2.1 Sample Trie . 12 2.2 Directed Graph for Calculating the Distance between *plog and peg (from Mitton, 1996) . 15 2.3 Letter Alignment of Word and Misspelling . 23 4.1 Example Phone Alignment . 41 4.2 Original phone model for p ........................ 43 4.3 Adapted phone model for p accounting for variation between p, th, t, and dh ................................... 44 4.4 Finite state transducer for canonical phone r where the respective tran- sition probabilities reflect the negative logarithm of the probability that the phone r, uh, d, or l was observed for r ............... 44 4.5 Word List Trie . 49 xii CHAPTER 1 INTRODUCTION AND MOTIVATION Spell checkers are very frequently included in software where text is entered such as word processors, email programs, and web browsers. The goal of a spell checker is to identify misspellings, select appropriate words as suggested corrections, and rank the suggested corrections so that the intended word is high in the suggestion list. Since spell checkers have been developed with competent native speakers as the target users, they do not appropriately address many types of errors made by non- native writers and they often fail to suggest the appropriate corrections (cf. Okada, 2004; L’Haire, 2007). Non-native writers of English struggle with many of the same idiosyncrasies of English spelling that cause difficulty for native speakers, but differ- ences between English phonology and the phonology of their native language lead to types of spelling errors not anticipated by traditional spell checkers (Okada, 2004; L’Haire, 2007; Mitton and Okada, 2007). In order to address the spelling errors that result from these phonological differ- ences, I propose a method for modeling pronunciation variation from a phonetically untranscribed corpus of read speech. The model of pronunciation variation is evalu- ated in the context of the spelling correction algorithm developed by Toutanova and Moore (2002), which takes into account pronunciation similarity between misspellings 1 and suggested corrections.