<<

PRONUNCIATION MODELING IN SPELLING CORRECTION FOR WRITERS OF ENGLISH AS A FOREIGN LANGUAGE

A Thesis

Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in the

Graduate School of The Ohio State University

By

Adriane Boyd, B.A., M.A.

*****

The Ohio State University

2008

Master’s Examination Committee: Approved by

Professor Eric Fosler-Lussier, Advisor Professor Christopher Brew Advisor Computer Science and Engineering Graduate Program c Copyright by

Adriane Boyd

2008 ABSTRACT

In this thesis I propose a method for modeling pronunciation variation in the context of checking for non-native writers of English. Spell checkers, which are nearly ubiquitous in text-processing , have been developed with native speakers as the target audience and fail to address many of the types of spelling errors peculiar to non-native speakers, especially those errors influenced by their native language’s writing system and by differences in the phonology of the native and non-native languages. The model of pronunciation variation is used to extend a pronouncing for use in the spelling correction algorithm developed by

Toutanova and Moore (2002), which includes statistical models of spelling errors re- lated to both orthography and pronunciation. The pronunciation variation modeling is shown to improve performance for misspellings produced by Japanese writers of

English as a foreign language.

ii to my parents

iii ACKNOWLEDGMENTS

I would like to thank my advisor, Eric Fosler-Lussier, and the computational lin- guistics faculty in the Linguistics and Computer Science and Engineering departments at Ohio State for their support. I would also like to thank the computational linguis- tics discussion group Clippers for their feedback in the early stages of this work.

iv VITA

2003 ...... B.A., Linguistics and German, Univer- sity of North Carolina at Chapel Hill 2007 ...... M.A., Linguistics, The Ohio State Uni- versity 2005-2008 ...... Graduate Research and Teaching Asso- ciate, The Ohio State University

PUBLICATIONS

Research Publications

Adriane Boyd, Markus Dickinson, and Detmar Meurers (2007). Increasing the re- call of corpus annotation error detection. In Proceedings of the Sixth International Workshop on and Linguistic Theories (TLT 2007).

Adriane Boyd, Markus Dickinson, and Detmar Meurers (2007). On representing de- pendency relations – Insights from converting the German TiGerDB. In Proceedings of the Sixth International Workshop on Treebanks and Linguistic Theories (TLT 2007).

Adriane Boyd (2007). Discontinuity Revisited: An Improved Conversion to Context- Free Representations. In Proceedings of the Linguistic Annotation Workshop (LAW 2007).

Adriane Boyd, Whitney Gegg-Harrison, and Donna Byron (2006). Identifying non- referential it. A machine learning approach incorporating linguistically motivated patterns. Traitement Automatique des Langues. Volume 46, No. 1.

Adriane Boyd, Whitney Gegg-Harrison, and Donna Byron (2005). Identifying non- referential it: a machine learning approach incorporating linguistically motivated features. In Proceedings of the ACL Workshop on Feature Engineering for Machine Learning in Natural Language Processing.

v FIELDS OF STUDY

Major Field: Computer Science and Engineering

vi TABLE OF CONTENTS

Page

Abstract ...... ii

Dedication ...... iii

Acknowledgments ...... iv

Vita...... v

List of Tables ...... x

List of Figures ...... xii

Chapters:

1. Introduction and Motivation ...... 1

1.1 Characteristics of Spelling Errors ...... 2 1.1.1 Native Writers of English ...... 3 1.1.2 Japanese Writers of English as a Foreign Language . . . . . 4 1.2 Developing a for Non-Native Writers of English . . . 7

2. Background ...... 9

2.1 Spell Checking Tasks ...... 10 2.1.1 Non- Error Detection ...... 10 2.1.2 Isolated Word Error Correction ...... 13 2.2 Edit Operations ...... 16 2.2.1 Types of Edit Operations ...... 16 2.2.2 Costs of Edit Operations ...... 18 2.2.3 Extending Edits to Pronunciation ...... 19 2.3 Noisy Channel Spelling Correction ...... 20

vii 2.3.1 Training the Error Model ...... 23 2.3.2 Extending the Model to Pronunciation Errors ...... 25 2.3.3 Letter-To-Phone Model ...... 27 2.4 Spell Checkers Adapted for JWEFL ...... 29 2.5 Summary ...... 30

3. Resources and Data Preparation ...... 32

3.1 TIMIT ...... 32 3.2 English Read by Japanese Corpus ...... 33 3.3 CMU Pronouncing Dictionary ...... 34 3.4 Atsuo-Henry Corpus ...... 34 3.5 Spell-Checker Oriented Word Lists ...... 35

4. Method ...... 39

4.1 Pronouncing Dictionary with Variation ...... 39 4.1.1 Initial Recognizer ...... 41 4.1.2 Adapting the Recognizer ...... 42 4.1.3 Generating Pronunciations ...... 43 4.2 Implementation of the Noisy Channel Spelling Correction Approach 46 4.2.1 Letter-to-Phone Model ...... 46 4.2.2 Noisy Channel Spelling Correction ...... 48

5. Results ...... 51

5.1 Experimental Setup ...... 51 5.2 Baseline ...... 51 5.3 Evaluation ...... 52 5.3.1 Tuning Model Parameters ...... 53 5.3.2 Evaluation of Pronunciation Variation ...... 57 5.3.3 Evaluation of the Spelling Correction Model ...... 58 5.4 Summary ...... 58

6. Summary and Outlook ...... 60

6.1 Outlook ...... 60

Bibliography ...... 62

Appendices:

viii A. Annotation Schemes ...... 64

A.1 Phonetic Transcriptions ...... 64 A.1.1 TIMIT ...... 64 A.1.2 English Read by Japanese Corpus ...... 64 A.2 Mapping to CMUDICT Phoneme Set ...... 65

B. Letter-to-Phone Alignments ...... 66

ix LIST OF TABLES

Table Page

1.1 Difficult Phoneme Pairs for Japanese Speakers of English ...... 6

2.1 Percentage of Correct Suggestions in the 1 to 3-Best Candidates as a Function of the Maximum Substitution Length (N) on Native Speaker Misspellings from Brill and Moore (2000) ...... 18

2.2 Percentage of Correct Suggestions in the 1 to 4-Best Candidates by the Letter (L), Pronunciation (PHL), and Combined (CMB) Models on Native Speaker Misspellings from Toutanova and Moore (2002) . . 19

2.3 Summary of Types and Costs of Edit Operations in Previous Spelling Correction Approaches ...... 20

2.4 Percentage of Correct Suggestions in the 1- to 6-Best Candidates for Native and JWEFL Misspellings from the Atsuo-Henry Corpus (Mitton and Okada, 2007) ...... 30

3.1 Word List Sizes ...... 38

4.1 Number of Pronunciations with Five Generated Variations ...... 45

4.2 Phone and Word Accuracy for Letter-to-Phone Model Trained and Tested on CMUDICT as a Function of the Number of Most-Specific Contexts(N) ...... 47

4.3 Phone and Word Accuracy for Letter-to-Phone Models Trained on Word List 70 and CMUDICT, Tested on Word List 70 Test Set as a Function of the Number of Most-Specific Contexts N ...... 48

5.1 Aspell Results: Percentage of Correct Suggestions on the Atsuo-Henry Corpus Test Set ...... 52

x 5.2 Percentage of Correct Suggestions on the Atsuo-Henry Corpus Devel- opment Set as a Function of the Maximum Substitution Length (N)

for PL ...... 54

5.3 Percentage of Correct Suggestions on the Atsuo-Henry Corpus Devel- opment Set as a Function of the Maximum Substitution Length (N)

for PPHL ...... 54

5.4 Percentage of Correct Suggestions on the Atsuo-Henry Corpus Devel- opment Set as a Function of the Maximum Substitution Length (N) for Combined Model ...... 54

5.5 Percentage of Correct Suggestions on the Atsuo-Henry Corpus Devel- opment Set as a Function of Dictionary Size for All Models ...... 55

5.6 Percentage of Correct Suggestions on the Atsuo-Henry Corpus Devel- opment Set as a Function of Minimum Probability m for All Models . 56

5.7 Candidate Corrections for the Misspelling *eney, Intended Word any 57

5.8 Percentage of Correct Suggestions on the Atsuo-Henry Corpus Test Set

as a Function of Pronunciation Variation for PPHL ...... 58

5.9 Percentage of Correct Suggestions on the Atsuo-Henry Corpus Test Set for All Models ...... 59

5.10 Performance of Spell Checker on Test Data ...... 59

A.1 TIMIT Phonemes ...... 64

A.2 ERJ Phonemes ...... 64

A.3 Mapping to CMUDICT Phonemes ...... 65

B.1 Letter-Phone Edit Distances ...... 67

B.2 Letter-Phone Edit Distances, cont...... 68

xi LIST OF FIGURES

Figure Page

2.1 Sample Trie ...... 12

2.2 Directed Graph for Calculating the Distance between *plog and peg (from Mitton, 1996) ...... 15

2.3 Letter Alignment of Word and Misspelling ...... 23

4.1 Example Phone Alignment ...... 41

4.2 Original phone model for p ...... 43

4.3 Adapted phone model for p accounting for variation between p, th, t, and dh ...... 44

4.4 Finite state transducer for canonical phone where the respective tran- sition probabilities reflect the negative logarithm of the probability that the phone r, uh, d, or l was observed for r ...... 44

4.5 Word List Trie ...... 49

xii CHAPTER 1

INTRODUCTION AND MOTIVATION

Spell checkers are very frequently included in software where text is entered such as word processors, email programs, and web browsers. The goal of a spell checker is to identify misspellings, select appropriate as suggested corrections, and rank the suggested corrections so that the intended word is high in the suggestion list. Since spell checkers have been developed with competent native speakers as the target users, they do not appropriately address many types of errors made by non- native writers and they often fail to suggest the appropriate corrections (cf. Okada,

2004; L’Haire, 2007). Non-native writers of English struggle with many of the same idiosyncrasies of English spelling that cause difficulty for native speakers, but differ- ences between English phonology and the phonology of their native language lead to types of spelling errors not anticipated by traditional spell checkers (Okada, 2004;

L’Haire, 2007; Mitton and Okada, 2007).

In order to address the spelling errors that result from these phonological differ- ences, I propose a method for modeling pronunciation variation from a phonetically untranscribed corpus of read speech. The model of pronunciation variation is evalu- ated in the context of the spelling correction algorithm developed by Toutanova and

Moore (2002), which takes into account pronunciation similarity between misspellings

1 and suggested corrections. I consider in particular Japanese writers of English as a foreign language (JWEFL), who provide an interesting test scenario as misspellings are influenced both by significant difficulties with the writing system and differences in the phonology of the two languages. Characteristics of spelling errors made by native speakers and JWEFL are discussed in section 1.1 and section 1.2 outlines the approach for modeling pronunciation variation and incorporating pronunciation variation in a spell checker adapted for non-native speakers.

1.1 Characteristics of Spelling Errors

Spell checkers developed for native speakers use observations about the types of misspellings frequently made by native speakers in order to select and rank suggested corrections. For English, many of the same types of misspellings are found in native and non-native productions (cf. Mitton and Okada, 2007). Common competence errors may be related to idiosyncrasies of the writing system,1 such as consonant doubling in *begining2 for beginning and native and non-native speaker alike have problems in these areas. Okada (2004) finds that English and Japanese speakers show very similar patterns for many substitution errors, such as for the letters c and t. On the other hand, non-native speakers’ incomplete knowledge of the language’s phonology and may result in errors unique to non-native speakers such as *lestrant for restaurant, *writed for wrote, or *womanes for women.

1See Mitton (1996) for a brief history of English spelling. 2* is used to indicate a misspelling. All example spelling errors in this chapter come from the Atsuo-Henry Corpus (Okada, 2004).

2 Many native speaker errors for English are also homophones or near-homophones

of the target word. Mitton (1987) finds that 64% of non-word errors in a corpus of mis-

spellings by British secondary school students are homophones or near-homophones of

the target word. Both native and non-native speakers are expected to produce many

homophones or near-homophones. Although no empirical studies have been carried

out to analyze homophones in misspellings by Japanese writers of English as a foreign

language, many of the errors in the Atsuo-Henry Corpus (Okada, 2004) appear to be

homophones, such as *prity for pretty and *polititian for politician.3

The following two sections compare typical characteristics of spelling errors made

by native speakers of English and those made by Japanese writers of English as a

foreign language (JWEFL).

1.1.1 Native Writers of English

In the spelling correction literature, misspellings are typically characterized by

the number of single character edit operations (insertion, substitution, and deletion)

required to convert the misspelling to the target word. For example, the misspelling

*separet for separate requires two single character edits, one substitution and one

insertion, to be corrected. The percentage of native speaker misspellings that are

one edit away from the correct word (single-error misspellings) varies depending on

the type of text, but various studies have found that anywhere from 69% to 94%

of misspellings by native writers of English are single-error misspellings (Damerau,

1964; Pollock and Zamora, 1984; Mitton, 1987; Kukich, 1992). The scientific text

3Given that there are a number phonological distinctions in English that are missing in Japanese, such a distinction between /l/ and /r/, the concept of homophone would need to be extended for Japanese writers of English as a foreign language in order to evaluate this in the corpus of misspellings.

3 analyzed in Pollock and Zamora (1984) shows the highest percentage of single-error

misspellings while a corpus of spelling errors by British secondary school students

(Mitton, 1987) finds the lowest percentage. Various studies have also found that only

1-7% of misspellings differ in the initial letter from the target word.

1.1.2 Japanese Writers of English as a Foreign Language

Although many error types are produced by native speakers and non-native speak-

ers alike, Japanese writers of English as a foreign language (JWEFL) produce spelling

errors that differ in significant ways from the types of errors produced by native speak-

ers. One of the main differences is that a much higher percentage of spelling errors

differ by two or more edit operations from the target word. In the Atsuo-Henry

Corpus of spelling errors made by JWEFL, only 37% of misspellings by JWEFL are

single-error misspellings. Additionally, 11% of have a different first letter than the

target word.

Okada (2004) and Mitton and Okada (2007) investigate spelling errors made by

native speakers writing English (NSWE) and Japanese writers of English as a foreign

language (JWEFL). Many types of spelling substitution errors are common to both

NSWE and JWEFL: n for m, s/x/t for c, and t for d. The most frequent spelling errors unique to JWEFL are confusion of r/l, confusion of b/v, and insertion of extra syllables with the vowels o and u (Mitton and Okada, 2007). Additional substitution errors unique to JWEFL include: b for d (due to confusion about similar-looking letters) and s for t.

4 Aside from idiosyncrasies of the English spelling system that cause difficulty for

both native and non-native speakers, Okada (2004) identifies two main sources of er-

rors for JWEFL: differences between English and Japanese phonology and differences

between the English alphabet and the Japanese romazi writing system, which uses a

subset of English letters.

Phonological Differences

Due to differences between English and Japanese phonology, there are a number

of phonological distinctions in English that are not present in Japanese including the

pair /l/ and /r/ and the pair /b/ and /v/ (Okada, 2004). Using the English Read

by Japanese (ERJ) Corpus (see section 3.2), Minematsu et al. (2002) highlights a

number of differences between American English and Japanese English. As expected

in a corpus by non-native speakers, inter-speaker variation leads to large degree of

variance in the pronunciation of particular phones. Certain pairs of phonemes are

much more similar in the Japanese productions than in American ones, see Table 1.1.

In particular, the central or low vowels /ah/, /ae/, and /aa/ show a lot of variation as Japanese has only one central and low vowel. Phonotactic constraints in Japanese that only allow one consonant at the beginning of a syllable also lead to frequent insertions of vowels into consonant clusters (Minematsu et al., 2002).

Writing System Influences

Japanese uses a combination of different types of writing systems: morphographic kanzi borrowed from Chinese, syllabic hiragana and katakana, and romazi, an alpha-

betic system that uses letters of the Latin alphabet to transcribe Japanese. The most

5 /r/ - /l/ /dh/ - /jh/ /aa/ - /ah/ /s/ - /th/ /ih/ - /iy/ /er/ - /ah/ /th/ - /sh/ /ih/ - /y/ /er/ - /aa/ /z/ - /dh/ /uh/ - /uw/ /er/ - /ae/ /z/ - /jh/ /ae/ - /aa/ /zh/ - /dh/ /ae/ - /ah/

Table 1.1: Difficult Phoneme Pairs for Japanese Speakers of English

common romazi systems use a subset of 19 or 21 letters. The letters c, l, q, v, and x are never used and f and j are only used in some romazi systems (Okada, 2004).

The romazi system causes difficulties for JWEFL because the Latin letters are used in romazi to represent Japanese sounds that are very different from the sounds they correspond to in English. Confusion between the romazi and English use of the letters causes many spelling errors. The lack of familiarity with the Latin letters not used in the romazi system also causes difficulties for writers. Many of these correspond to lacking phonological distinctions lacking in Japanese, so JWEFL cannot necessary rely on their pronunciations to guide their choice (Okada, 2004). Okada (2004) also explains that Japanese elementary school students learn romazi well before they start to learn English and students, so when they begin to learn English they are already very familiar with the romazi correspondences for each letter and they will have a tendency to carry over the use of romazi pronunciations into English writing. Okada

(2004) and Mitton and Okada (2007) also notice that JWEFL are more likely to substitute letters that they are familiar with from their use in romazi, in particular r and b, for letters that are not used in romazi, l and v respectively, than vice versa.

6 1.2 Developing a Spell Checker for Non-Native Writers of English

In this thesis, I propose a method for creating a model of pronunciation varia-

tion from a phonetically untranscribed corpus of read speech recorded by non-native

speakers. The pronunciation variation model is used to generate multiple pronuncia-

tions for each canonical pronunciation in a pronouncing dictionary and the pronun-

ciation variations are incorporated into the spelling correction approach developed

by Toutanova and Moore (2002), which uses statistical models of spelling errors that

consider both orthography and pronunciation.

Chapter 2 provides an overview of research in spell checking and the spelling

correction approach developed by Toutanova and Moore (2002). Chapter 3 describes

the resources used develop a spell checker that takes orthography, pronunciation, and

pronunciation variation into account. Chapter 4 describes the approach used to model

pronunciation variation and the implementation of the spelling correction approach

from Toutanova and Moore (2002) for Japanese writers of English as foreign language.

Chapter 5 presents an evaluation of the JWEFL-adapted spell checker, and Chapter 6

concludes the thesis.

Conventions

The following conventions are used throughout this thesis:

A word is a sequences of characters from the given alphabet found in the current

word list. A misspelling is a sequence of characters from the given alphabet not found in the current word list. Both words and misspellings are shown in a monospace

font. Misspellings are marked with *.A candidate correction is a word from the

7 current word list proposed as a potential correction for a misspelling. A word list is a list of words using the given alphabet. Dictionary and word list may be used interchangeably, but word list is preferred as the spell checker uses a list of inflected forms with no or affix generation.

8 CHAPTER 2

BACKGROUND

Research in spell checking, which has been ongoing since the 1960s (see Kukich,

1992, for a survey of spell checking research), has focused on three main problems:

non-word error detection, isolated-word error correction, and context-dependent word

correction. A non-word is a sequence of letters that is not a possible word in the

language in any context. Examples of non-words in English are *ckwre, *eated, and

*seperately. Once a sequence of letters has been determined to be a non-word, isolated-word error correction is the process of determining the appropriate word to substitute for the non-word. These first two problems are at the focus for traditional interactive spellcheckers such as GNU Aspell,4 which consider each word in a text in isolation, decide whether it is a valid word in the language, and if not, flag it and propose a ranked list of potential corrections.

The third problem, context-dependent word correction, is a much more difficult task since it needs to consider all words in a text and determine whether typographic, grammatical, semantic, or other errors have resulted in non-words or the substitution of one word for another. Context-dependent word correction may be used to ad- dress non-word errors such *seperately, typographic errors such as form for from,

4http://aspell.net

9 errors due to homophones such as it’s for its, and grammatical errors such as subject-verb agreement errors or pronoun-antecedent agreement errors. Successful context-dependent word correction requires full-scale natural language processing and understanding including analysis of syntax, semantics, pragmatics, and discourse. In this thesis, the focus of the spelling correction task will be on the selection and ranking of candidate corrections for non-word spelling errors; context-dependent corrections are not considered.

2.1 Spell Checking Tasks

In the task of selecting and ranking candidate corrections for non-word errors there are two main subtasks (cf. Kukich, 1992). Given a sequence of letters, determine: 1) whether this sequence of letters is a non-word, 2) if so, select and rank candidate words as potential corrections to present to the writer. Each subtask is described separately in the following two sections.

2.1.1 Non-Word Error Detection

Two main techniques have been used to detect non-word spelling errors:

The first method, intended mainly for use with optical character recognition sys- tems, uses statistics over letter n-grams to determine whether a particular word is a likely word in the target language. Words with unusual sequences are flagged as potential non-words. Although this method can flag many misspellings, the let- ter n-grams are not always sufficient to differentiate between words and non-words

(Kukich, 1992).

10 The second method, which was costly in the early development of spell checkers

due to limited memory and storage, is to maintain a list of words and identify non-

words as any strings not present in the word list (Kukich, 1992). Now that the mem-

ory and storage restrictions are not a problem for the sizes of typical spell checking

word lists (typically no larger than 200,000 words for English and for general-purpose

spell checking around 100,000 words), most spell checkers use dictionary look-up,

potentially with additional affix tables that extend the word list.5

The most common dictionary look-up method is to store the words in a hash table. With a well-designed hash function, the time to look up a word in the hash is nearly constant (Fox et al., 1992), although in the worst case the look-up time may be O(n) where n is the number of words in the dictionary.. An alternate dictionary look-up method is to store the dictionary in a trie, a tree data structure where values are stored along the path from the root to a node. A sample trie for a word list with the seven words (a, an, are, at, ate, be, bed) is shown in Figure 2.1. Each word corresponds to a node sequence from the root node to a shaded node. Looking up a letter sequence of length n in a tree takes O(n) time. A hash allows constant time

look-up for a word in the dictionary, but it is only possible to determine whether an

entire word is in the dictionary. With a trie, any substring starting at the beginning of

the word can be accessed incrementally. In the spell checker developed in this thesis,

a trie will be used to store the dictionary to allow for efficient calculation of edit

distances, since the calculation considers edit operations letter-by-letter beginning at

the first letter in the word (see sections 2.1.2 and 4.2.2).

5In an agglutinating language or a language with rich morphology, listing all inflected forms in a dictionary may still be prohibitively expensive or impossible.

11 root

a* b

n* r t* e*

e* e* d*

Figure 2.1: Sample Trie

Notes on Tokenization

In processing a text, spell checkers need to tokenize the sequence of characters in the text into smaller sequences of characters that correspond to words. Typically, whitespace characters and some punctuation characters are considered word bound- aries. Some spelling errors may result from errors in the separation of words by the writer such as leaving out the space between two words (*onthe), introducing a space into a word (*extr *aordinary), or putting the space in the wrong place in a two- word sequence (*ont *hese). Since tokenization errors may result in words as well as non-words (for got), the general task of correcting tokenization errors falls under context-dependent word correction. For non-word errors that result from a sequence of two (or a limited number) of run-on words or mistokenized words, a spell checker could consider all sequences of two words from the original text along with pairs of dictionary words when proposing corrections. These problems are not addressed here.

12 2.1.2 Isolated Word Error Correction

After a non-word has been detected, one or more words need to be selected as

candidate corrections and ranked. This general spelling correction problem can be

stated as follows (Brill and Moore, 2000):

Given an alphabet Σ, a word list D of strings ∈ Σ∗, and a string r∈ / D and ∈ Σ∗, find w ∈ D such that w is the most likely correction.

In order to have a finite dictionary (see the previous note on tokenization), the al- phabet Σ cannot contain word-separating characters such as space and hyphen.6 This definition of a dictionary is sufficient for English spell checking, but may not be ap- propriate for languages with very productive morphological processes (e.g., languages with rich inflectional morphology or agglutinating languages).

Selecting Candidate Corrections

The most common method for selecting candidate corrections uses the idea of minimum . The misspelling is compared to words in the word list by determining the number of single character edit operations (insertion, deletion, substi- tution, and sometimes transposition) required convert the misspelling into the word.

Words requiring very small numbers of edit operations are selected as candidate cor- rections.

This method was initially proposed in Damerau (1964). His method considered only those non-words which were exactly one edit operation away from a word in the word list. His approach was limited by the current memory and storage limitations,

6If the alphabet Σ does include word-separating characters such as space or hyphen, a dictionary D of strings becomes infinite as it may include any word sequences in the language. If tokenization spelling errors are restricted to those involving a fixed number of words then they may still be handled by a finite dictionary.

13 but in the following decades his basic approach of considering each single character

edit between two string was extended to calculate the edit distance between any two

arbitrary strings. Dynamic programming algorithms developed in the 1970s allow

the edit distance between two strings of length m and n to be computed in O(m ∗ n)

time. This task can be seen as the problem of finding the shortest-cost path through

a directed graph where the arc weights relate to the costs of edit operations.

An example graph is shown in Figure 2.2. The shortest-cost path from node A

to node T is the edit distance between *plog and peg. Each arc represents a single character insertion, deletion, or substitution. One shortest-cost path in this example

(A-I-J-O-T) has an edit cost of 2. In this particular example, all edit operations are equally weighted. Alternate weighting schemes may consider the phonetic similarity between letters, the proximity of letters on the keyboard, or similarity in letter ap- pearance when assigning the weights to the arcs; types of edit operations and methods for determining their weights are discussed in detail in section 2.2.

Ranking Candidate Corrections

Once a small number of candidate corrections have been chosen, it is necessary to rank them for presentation to the writer. The minimum edit distance selection method provides an obvious choice for ranking candidate corrections: those with smaller edit distances appear higher in the list of candidate corrections.

Alternate Approaches

One common alternate approach to minimum edit distance is the similarity key technique, where the dictionary is sorted into small sets of similarly-spelled words.

Instead of comparing the misspelling directly to words from the word list to find

14 p l o g 1 1 A 1 EFG 1 H p 1 0 1 1 1 1 1 1 1

1 1 1 1 B IJ K L e 1 1 1 1 1 1 1 1 1

1 1 1 1 C M N OP 1 1 0 g 1 1 1 1 1 1 1 1 1 1 DQ R S T

Figure 2.2: Directed Graph for Calculating the Distance between *plog and peg (from Mitton, 1996)

candidate corrections as in the minimum edit distance approach, the similarity key is used to look up a set of candidate corrections. One similarity key approach called

SPEEDCOP (Pollock and Zamora, 1984) uses similarity keys that consist of the first letter of the word followed by the set of consonants in order of first appearance followed by the set of vowels, also in order of first appearance. For example, the similarity key for spell would be SPLE. The word list is sorted by similarity key and candidate corrections are the words near the location of the misspelling’s similarity key in the sorted word list.

GNU Aspell uses a similarity key algorithm called the metaphone algorithm

(Aspell, 2008).7 A hand-crafted table of conversions applied to the word from left- to-right is used to reduce a misspelling to a similarity key. The similarity key of the

7http://aspell.net/metaphone/

15 misspelling is then compared to the similarity keys of words in the word list. The En-

glish conversions mainly reduce the set of consonants by ignoring certain voicing and

place contrasts. Some conversions include: c in certain contexts, g, j, and k become

K, d and t become T, and f and v become F. Grapheme sequences that correspond to single phonemes are also collapsed: ck becomes K and gn becomes n. Addition- ally, all word-initial vowels are collapsed to a single vowel and any other vowels are ignored. The edit distance between the similarity key produced for the misspelling and the similarity keys of words in the word list are used to select and rank candidate corrections.

2.2 Edit Operations

In recent spelling correction systems, edit operations have been extended beyond single character edits and the methods for calculating edit operation weights have become increasingly sophisticated. Section 2.2.1 discusses the types of edit operations and section 2.2.2 discusses methods for determining the costs of edit operations.

2.2.1 Types of Edit Operations

Many implementations of minimum edit distance method focus on single character edit operations where a single character (a) is replaced (a → b), inserted ( → a), or deleted (a → ) (e.g., Damerau, 1964; Church and Gale, 1991). Mitton (1996) extends the set of basic edit operations to allow for substitutions of multi-character strings

(a1a2..an → b1b2..bm) for hand-selected substitutions. This is equivalent to adding extra arcs to the edit distance graph. For instance, if lo were a common substitution for e, an additional arc would be added from node I to node O in Figure 2.2. For

English, some common substitutions used by Mitton (1996) are ph for f, eau for o (as

16 in beau), and cs for x (as in ecstasy). In the general case when edits of any length

are permitted, the time complexity of the minimum edit distance algorithm becomes

O(m2 ∗ n2) when comparing strings of length m and n.

The spelling error model proposed by Brill and Moore (2000) goes further to allow

generic string edit operations up to a certain length N. Any string of letters up to

length N may be replaced by any string of up to the same length (a1a2..an → b1b2..bm;

n, m ≤ N) This effectively adds arcs to the edit distance graph from any node to all

nodes within N nodes to the right or N nodes below. If N = 2, then from node A

there would be additional arcs in Figure 2.2 to nodes F, J, C, M, and N. Increasing

N adds a constant factor of N 2 to the running time of the minimum edit distance

calculation. For example, the running time when N = 2 is 4 ∗ m ∗ n where m and n are the length of the two strings. Their approach is described in detail in section 2.3.

An example of the improvement seen in the accuracy of the spell checker as N increases in shown in Table 2.1. These results comes from an evaluation on English native speaker misspellings in Brill and Moore (2000). Spell checker performance is evaluated by the position of the target word in the ranked list of candidate corrections.

The 1-Best score is the percentage of misspellings where the spell checker suggested the target word as the first candidate correction. Here, the first candidate was correct for 87.0% of misspellings when N = 1. Likewise, when N = 1 the target word could

be found the top 2 candidates for 93.9% of misspellings and in the top 3 for 95.9% of

misspellings. Their results show that increasing N improves performance up a point,

after which the results level off.8

8Brill and Moore (2000) and Toutanova and Moore (2002) use a unspecified corpus of 10,000 misspellings to train and evaluate their spell checkers. Approximately 2,000 misspellings are used for the evaluation.

17 N 1-Best 2-Best 3-Best 1 87.0 93.9 95.9 2 90.9 95.6 96.8 3 92.9 97.1 98.1 4 93.6 97.4 98.5 5 93.6 97.4 98.5

Table 2.1: Percentage of Correct Suggestions in the 1 to 3-Best Candidates as a Function of the Maximum Substitution Length (N) on Native Speaker Misspellings from Brill and Moore (2000)

2.2.2 Costs of Edit Operations

As mentioned above, alternate weighting schemes can improve the ranking of candidate corrections by modeling how likely particular edits are. For instance, the substitution of r for l may be more likely than t for l by JWEFL. Instead of assigning each insertion, substitution, and deletion a cost of 1 as in the simple example given in

Figure 2.2, Mitton (1996) assigns hand-tuned costs of 0 to 5 on all arcs. The target word under consideration may also affects the weights of particular types of edits.

Instead of hand-tuning weights, Church and Gale (1991) use a noisy channel model of spelling errors, which is explained in detail in section 2.3, and estimate the probability of each single character edit operation based on misspellings found in a large corpus of AP newswire text. Substitutions and transpositions are considered independent of context and insertions and deletions are conditioned on the preceding letter. Each edit is initially assumed to be equally probable and probabilities are adjusted iteratively as the misspellings in the corpus are corrected. Brill and Moore

(2000) extend the model from Church and Gale (1991) to generic string edit operations

18 Model 1-Best 2-Best 3-Best 4-Best L 94.2 98.2 98.9 99.0 PHL 86.4 93.7 95.7 96.6 CMB 95.6 98.9 99.3 99.5

Table 2.2: Percentage of Correct Suggestions in the 1 to 4-Best Candidates by the Letter (L), Pronunciation (PHL), and Combined (CMB) Models on Native Speaker Misspellings from Toutanova and Moore (2002)

up to a certain length N and estimate the probability of each edit from a corpus of spelling errors.

2.2.3 Extending Edits to Pronunciation

Toutanova and Moore (2002) extend Brill and Moore (2000) to consider generic edits over both letter sequences from the word and misspelling and sequences of phones in the pronunciations of the word and misspelling. They show that including pronunciation information in the spelling correction model can improve spell checker performance as compared to the model from Brill and Moore (2000) which only considers the orthography. Their approach is described in Chapter 2.3.2. Their results for misspellings by native speakers of English, given in in Table 2.2, show that the performance of the model that combines orthographic and pronunciation models (CMB) is better than either the orthographic (L) or pronunciation (PHL) model on its own. The orthographic model is used in Toutanova and Moore (2002) is an extension of the Brill and Moore model described in section 2.2.1 where N = 3.

The types and costs are edit operations in previous spelling correction approaches are summarized in Table 2.2.3. Next, we turn to the noisy channel spelling correction

19 approach used by Church and Gale (1991), Brill and Moore (2000), and Toutanova

and Moore (2002) to determine the types and costs of edit operations.

Types of Operations Costs of Operations (Damerau, 1964) Single character Equally weighted (Church and Gale, 1991) Single character Learned from training corpus (Mitton, 1996) Single and limited Hand-tuned multi-character (Brill and Moore, 2000) Generic string substi- Learned from training corpus tutions up to length N (Toutanova and Moore, Generic string and Learned from training corpus 2002) phone substitutions up to length N

Table 2.3: Summary of Types and Costs of Edit Operations in Previous Spelling Correction Approaches

2.3 Noisy Channel Spelling Correction

The spelling correction models from Church and Gale (1991), Brill and Moore

(2000), and Toutanova and Moore (2002) use the noisy channel model (Shannon,

1948) approach to determine the types and weights of edit operations. In the noisy

channel model approach, a writer is considered to know the correct spelling of the

intended word w, but as it is being written down the word passes through a noisy communication channel resulting in the observed non-word r. In order to determine

how likely a candidate correction is, the spelling correction model determines the

probability that the word w was the intended word given the misspelling r: P (w|r).

20 To find the best correction, the word w is found for which P (w|r) is maximized:

argmaxw P (w|r) (2.1)

Applying Bayes’ Rule, equation 2.1 can be rewritten:

argmax P (w)P (r|w) argmax P (w|r) = w (2.2) w P (r)

The normalizing constant P (r) can be discarded since w does not depend on P (r),

giving the resulting correction model:

argmaxw P (w|r) = argmaxw P (w)P (r|w) (2.3)

In the noisy channel framework, P (w) is the source model, or how probable the word w is overall. P (r|w) is the channel (or error) model, or how likely it is for a writer intending to write w to output r. P (w) and P (r|w) can be estimated from corpora

containing misspellings. In the following experiments, P (w) is assumed be equal for

all words, so the focus is on the error model, estimating P (r|w) from a corpus of

misspellings. Of course, P (w) is not equal for all words, but it is not possible to

estimate from the available training corpus, the Atsuo-Henry Corpus (Okada, 2004),

because it contains only pairs of words and misspellings for approximately 1,000

target words. P (w) could be estimated from a large corpus of text. For use in a

spell checker for JWEFL, P (w) would ideally be estimated from a corpus of text by

JWEFL rather than text by native speakers since non-native writers will typically

use a smaller vocabulary than native speakers.

Church and Gale (1991) first used the noisy channel model for spelling correction

by estimating the probabilities of single character edit operations (insertion, deletion,

21 substitution, and transposition) from a corpus of newswire text. Their spelling cor-

rection method handles only misspellings that are one edit operation away from a

word in the word list. When proposing corrections for a misspelling, their system

selects any words one edit away from the misspelling as candidate corrections and

uses the error model to rank them.

Brill and Moore (2000) extend the error model by allowing all edit operations

α → β where Σ is the alphabet and α, β ∈ Σ∗. These edit operations are a superset of the edit operations in Church and Gale (1991) and Mitton (1996). In order to consider all ways that a word w may generate r with the possibility that any, possibly empty, substring α of w becomes any, possibly empty, substring β of r, it is necessary to consider all ways that w and r may be partitioned into substrings.

To illustrate this with an example, consider the word albatross and a misspelling

*allubatros. One possible pair of partitions of the strings is a-l-b-a-t-r-o-ss and a-llu-b-a-t-r-o-s for each respectively. With this particular partition, the probability that *allubatros was generated from the source word albatross would be P (a → a)P (l → llu)P (b → b)P (a → a)P (t → t)P (r → r)P (o → o)P (ss → s).

When all possible pairs of partitions for both w and r are taken into consideration, the letter error model PL is expressed as follows. P art(w) is all possible partitions of

th w, |R| is number of segments in a particular partition, and Ri is the i segment of the partition. |R| X X Y PL(r|w) = P (R|w) P (Ri → Ti) (2.4) R∈P art(r) T ∈P art(w),|R|=|T | i=1

PL is approximated by considering only the pair of partitions of w and r with the

maximum product of probabilities of individual substitutions. Brill and Moore (2000)

found P (R|w) difficult to model well, so they drop P (R|w). The resulting error model

22 Figure 2.3: Letter Alignment of Word and Misspelling

is shown below:

|R| Y PL(r|w) ≈ maxR∈P art(r),T ∈P art(w) P (Ri → Ti) (2.5) i=1

2.3.1 Training the Error Model

The parameters for PL(r|w) are estimated from a corpus of pairs of misspellings

and target words. First, the letters in the misspelling and target word are aligned,

minimizing the single character edit distance between the two strings. An example

alignment between albatross and *allubatros is shown in Figure 2.3.

This alignment corresponds to the following edits:

a → a  → u t → t s → s  → l b → b r → r s →  l → l a → a o → o

To generalize this to longer strings, each edit is expanded with up to M neigh-

boring alignments.9 If M = 2 and we consider the alignment  → u, the following additional edits are generated:

l → lu b → ub

9While I expand all edits, Brill and Moore (2000) expand only non-match edits. In effect, Brill and Moore (2000) only consider substitutions α → α where |α| = 1 while I also consider longer α → α substitutions up to length |α| ≤ M + 1.

23 l → llu lb → lub ba → uba

The probability of each edit α → β is estimated by counting the number of times

α → β is seen and dividing by the number of α is seen:

count(α → β) P (α → β) = (2.6) count(α)

Using a training corpus that consists solely of pairs of misspellings and words (such as the Atsuo-Henry Corpus, see section 3.4) leads to lower counts for α → α and than would be found in a corpus where misspellings are observed in context with many correctly spelled words. The undercounting of α → α can lead to an error model where a substitution such as tion → sion may have a relatively high probability compared to correct alignment tion → tion because the misspelling of tion outnumbered the correct spelling in the training corpus. Since the majority of letters in a misspelling are correct,10 it is necessary to approximate count(α → α). In my implementation

I choose to assign a minimum probability m for P (α → α).11 Given count(α → β) and count(α) from the spelling error corpus, P (α → β) is calculated as follows:

( count(α→β) m + (1 − m) count(α) if α = β P (α → β) = count(α→β) (2.7) (1 − m) count(α) if α 6= β m is a parameter for the spelling correction model.

10The average word in the Atsuo-Henry Corpus is 7.1 letters long with 2.5 incorrect letters. 11Brill and Moore (2000) mention this difficulty, but do not specify how they approximate P (α → β). The substitution patterns noted for JWEFL in section 1.1.2 indicate that it would be more appropriate to vary m based on the particular α, so familiar letters from romazi have a higher minimum probability than letters not used in romazi.

24 2.3.2 Extending the Model to Pronunciation Errors

Toutanova and Moore (2002) describe an extension to Brill and Moore (2000) where the same noisy channel error model is used to model phone sequences instead of letter sequences. The alphabet Σ consists of phones rather than letters. Instead of the word w and the non-word r, the error model estimates the probability that the pronunciation of the non-word r, pronr, is produced instead of the pronunciation of the word w, pronw. The error model over phone sequences, called PPH , is just like

PL except that r and w are replaced with their respective pronunciations. The model is trained just as in section 2.3.1 except that the alignments are between the phones in the pronunciation of w and the pronunciation of r.

|R| Y PPH (pronw|pronr) = maxR∈P art(pronr),T ∈P art(pronw) P (Ri → Ti) (2.8) i=1

Since a spelling correction model needs to rank candidate words rather than can- didate pronunciations, Toutanova and Moore (2002) derive an error model that de- termines the probability that a word w was spelled as the non-word r based on the pronunciations of w and r. This model, called PPHL, is the sum over all possible pronunciations of w of the probability that r is a misspelling of w. Toutanova and

Moore (2002) approximate PPHL as follows:

X X PPHL(r|w) = P (pronw, r|w) = P (pronw|w)P (r|pronw, w) (2.9)

pronw pronw

First, the non-word r is assumed to be independent of the pronunciation of w given pronw: X PPHL(r|w) ≈ P (pronw|w)P (r|pronw) (2.10)

pronw

25 Next, the probability of each pronunciation of w in the word list is assumed to be equally likely. |pronw| is the number of pronunciations of w in the word list:

X 1 PPHL(r|w) ≈ P (r|pronw) (2.11) |pronw| pronw

Then, Bayes’ rule is applied to P (r|pronw):

X 1 P (r) PPHL(r|w) ≈ P (pronw|r) (2.12) |pronw| P (pronw) pronw

The marginal probabilities in the final term P (r) not modeled,12 so they are P (pronw) dropped: X 1 PPHL(r|w) ≈ P (pronw|r) (2.13) |pronw| pronw pronr is introduced in the final term:

X 1 X PPHL(r|w) ≈ P (pronr, pronw|r) (2.14) |pronw| pronw pronr

The sum is approximated as its maximum term:

X 1 PPHL(r|w) ≈ max P (pronr, pronw|r) (2.15) |pronw| pronr pronw

P (pronr, pronw|r) is factored:

X 1 PPHL(r|w) ≈ max P (pronw|r, pronr)P (pronr|r) (2.16) |pronw| pronr pronw

Finally, P (pronw|r, pronr) is approximated as P (pronw|pronr) with the assumption that the probability of pronw is independent of the non-word r given pronr:

X 1 PPHL(r|w) ≈ max P (pronw|pronr)P (pronr|r) (2.17) |pronw| pronr pronw

12 Dropping P (pronw) causes the model to ignore the fact that multiple words (homophones) can share the same pronunciation.

26 P (pronw|pronr) is the pronunciation error model PPH described above and P (pronr|r) is provided by the letter-to-phone model described in the following section, sec- tion 2.3.3.

For a misspelling r and a candidate correction w, the letter model PL gives the probability that w was written as r due to the noisy channel taking into account only the orthography. PPH does the same for the pronunciations of r and w, giving the probability that pronw was output was pronr. The pronunciation model PPHL relates the pronunciations modeled by PPH to the orthography in order to give the probability that r was written as w based on pronunciation. PL and PPHL are combined as follows to calculate a score for each candidate correction:

SCMB(r|w) = logPL(r|w) + λlogPPHL(r|w) (2.18)

The word w with the highest score according to SCMB is chosen as the highest- ranked word in the list of candidate corrections. Combining the models in this manner multiplies the probabilities of the two models, allowing either PL or PPHL to veto any candidate word if its probability is low. In effect, the highest-ranked candidate corrections will be ranked highly by both PL and PPHL individually.

2.3.3 Letter-To-Phone Model

A letter-to-phone model is needed to predict the pronunciation of misspellings for PPHL, since they are not found in a pronouncing dictionary. Toutanova and

Moore (2002) use the n-gram letter-to-phone model from Fisher (1999) to predict these pronunciations. The n-gram letter-to-phone model predicts the pronunciation of each letter in a word individually considering m letters of context to the left and n to the right.

27 Training the Model

First, the best alignment between the letter sequence in the word and phone

sequence in the pronunciation is found by finding the alignment with the minimum

edit distance given a hand-crafted table of letter-phone edit costs (see Appendix B).

Each letter is allowed to align to 0, 1, or 2 phones. Deletions correspond to an

alignment with 0 phones, substitutions to 1 phone, and insertions to 2 phones.

Any pairs in the training corpus with alignments containing more than one deletion

in a row are discarded. Most of these correspond to the pronunciation of acronyms

(e.g., AAA = t r ih p ah l ey). Alignments between a letter and 2 phones come

from insertions, which are associated with the alignment immediately to the left,

so the alignments for the word ax and the pronunciation ae k s are a → ae and

x → k s.

Next, all rules of the form Lm.T.Rn → phoneo are counted from the training corpus

where T is the letter whose pronunciation is being predicted, Lm is a sequence of m letters to the left of T , Rn is a sequence of n letters to the right of T , and phoneo is the sequence of 0, 1, or 2 phones aligned with T in the alignment step. The distribution over predicted pronunciation is learned for each left-hand side. Toutanova and Moore

(2002) consider 0 to 4 letters of context on each side of the target letter. A word boundary character is inserted on each side of the word and treated as a letter by the algorithm.

Predict Pronunciations

To predict the pronunciation of a word, the letters in the word are considered one-by-one. For each letter, the contexts with of the form Lm.T.Rn are for the target

28 letter T are considered and the most specific context found in the training data is

used to predict the pronunciation. The longest context is considered the most specific

context and the order in which contexts are ranked is intended to favor the longest

context and right-hand context.13

Since the most specific context does not always provide the best prediction due to

sparse training data (c.f. Toutanova and Moore, 2002), the prediction step is extended

to consider the most probable phone for the top N most specific contexts. If there is

a majority candidate in the top N, the majority candidate is chosen. If there is not a majority, the predictor uses the most specific context.

2.4 Spell Checkers Adapted for JWEFL

Thus far, the noisy channel spelling correction approach has only been used for general purpose spell checkers for native writers. One spelling correction ap- proach, the spell checker described in Mitton (1996), has been specifically adapted for JWEFL.14

Mitton and Okada (2007) finds three common errors in the Atsuo-Henry Corpus:

1) the substitution of l for r and vice versa, 2) the substitution of b for v and

vice versa, 3) the insertion of extra syllables, particularly with the vowels o and u.

Mitton and Okada adapt Mitton’s original spell checker (Mitton, 1996) using these

observations and test the spell checker on native-speaker and JWEFL errors from

the Atsuo-Henry Corpus (see section 3.4). Mitton and Okada find that adapting for

13 The context order used is L4R4,L3R4,L4R3,L3R3,L2R4,L4R2,L2R3,L3R2,L1R4,L4R1, L2R2,L1R3,L3R1,L0R4,L4R0,L2R1,L1R2,L0R3,L3R0, L1R1,L0R2,L2R0,L0R1,L1R0,L0R0 where LxRy corresponds to a context of x letters to the left of the letter under consideration and y letters to the right. 14Spell checkers have also been created for adapted for other non-native writers, such as the FipsOrtho spell checker developed for learners of French (L’Haire, 2007).

29 Type of Writer Spell Checker 1-Best 3-Best 6-Best Native Native-speaker spell checker 54.2 67.9 73.4 JWEFL Native-speaker spell checker 61.2 73.3 77.9 JWEFL-adapted spell checker 65.8 78.7 83.5

Table 2.4: Percentage of Correct Suggestions in the 1- to 6-Best Candidates for Native and JWEFL Misspellings from the Atsuo-Henry Corpus (Mitton and Okada, 2007)

these three errors noticeably improves the performance of the spell checker. Their results are given in Table 2.4. They report that the improvement in the performance is almost entirely the result of adaptions to handle confusions between r/l and b/v.15

2.5 Summary

The previous research has shown how spell checkers developed for native speakers that extend edit operations beyond single character edits (Brill and Moore, 2000) and that factor in pronunciation (Toutanova and Moore, 2002) provide better rankings of the candidate corrections and thus better spell checker performance. Mitton and

Okada (2007) show that adaptations for common error types for non-native writers with a common language background, in this case Japanese, can improve spell checker performance in comparison to the native-speaker spell checker.

The spelling correction approach developed in Toutanova and Moore (2002) ap- pears especially appropriate for use with JWEFL because it models both common sources of spelling errors — difficulties with letters in the writing system and sys- tematic differences in pronunciation due to differences between English and Japanese

15Note that their results for native-speaker misspellings cannot be compared to the results from Brill and Moore (2000) and Toutanova and Moore (2002) because of differences in the corpora used for evaluation.

30 phonology. The approach from Toutanova and Moore (2002) will be used in this thesis to develop a spell checker for JWEFL that takes into account pronunciation variation by Japanese speakers of English. The method for modeling pronunciation variation will be presented in Chapter 4, but first the resources required to model pronunciation variation and train the spelling correction models are described in Chapter 3.

31 CHAPTER 3

RESOURCES AND DATA PREPARATION

The spelling correction approach that includes error models for both orthography and pronunciation (Toutanova and Moore, 2002, see Chapter 2.3) and that considers pronunciation variation for non-native writers requires a number of resources: 1) spo- ken corpora of American English (TIMIT) and Japanese English (English Read by

Japanese) are used to model pronunciation variation in Japanese English, 2) a pro- nunciation dictionary (CMUDICT) provides American English pronunciations for the target words, 3) a corpus of spelling errors made by JWEFL (Atsuo-Henry Corpus) is used to train spelling error models and test the spell checker’s performance, and

4) and Spell Checker Oriented Word Lists (SCOWL) are adapted for use as the spell checker’s word lists. These resources and their preparation for use with the spelling correction approach are described below.

3.1 TIMIT

The TIMIT Corpus (TIMIT, 1991) is a corpus of read speech that contains 6,300 sentences recorded by 630 American speakers from all major dialect regions. The sentences were constructed to elicit dialect differences and to cover all possible phone sequences in English. The sentences were phonetically transcribed using the set of 52

32 phonemes given in Table A.1. The TIMIT phonemes are mapped onto the CMUDICT phoneme set as described in Table A.3.

3.2 English Read by Japanese Corpus

The English Read by Japanese (ERJ) Corpus (Minematsu et al., 2002) consists of 70,000 prompts recorded by 200 native Japanese speakers with varying English competence. In order to focus on systematic differences in pronunciation of English by Japanese speakers rather than differences due to temporary pronunciation errors or lack of knowledge of the pronunciation of particular lexical items, the speakers read text accompanied by written phonemic and prosodic cues. The target language is considered General American English and the phonemic cues use a reduced set of phonemes from the TIMIT Corpus with 41 phonemes (see Table A.2). The ERJ phonemes are also mapped onto the CMUDICT phoneme set as outlined in Table A.3.

The words include phonemically-balanced words, minimal pairs, and words with various accent patterns. The sentences include phonemically-balanced sentences, sen- tence intended to be challenging for Japanese speakers, and sentences with various intonation and stress patterns. The speakers are undergraduate and graduate stu- dents at twenty different Japanese universities. Each speaker read approximately 220 word prompts and 120 sentence prompts. Speakers were allowed to practice each prompt in advance and to rerecord the prompt until they thought that they had produced the correct pronunciation. See Minematsu et al. (2002) for details on the construction of the corpus.

Also included in the ERJ Corpus is a smaller, comparable corpus of recordings of the same prompts by 18 native American English speakers. Each speaker recorded

33 400 word prompts and 500 sentence prompts, half of the complete set of prompts created for the Japanese speakers.

3.3 CMU Pronouncing Dictionary

The CMU Pronouncing Dictionary (CMUDICT, 1998) contains the pronuncia- tions for over 125,000 English words. The CMU phoneme set uses 39 phonemes, two phonemes fewer than the ERJ Corpus (ax and axr are not used), and lexical stress is indicated on the vowels. For the spell checker developed in this thesis, the lexical stress indicators are ignored.

3.4 Atsuo-Henry Corpus

The Atsuo-Henry Corpus (Okada, 2004) consists two subcorpora of English spelling errors, one of spelling errors made by native speakers and one of spelling errors made by JWEFL. The subcorpus of errors made by native speakers of English is taken from the Birkbeck Corpus collected by Roger Mitton (described in Mitton, 1996) and contains approximately 31,000 type misspellings of 5,000 target words. The errors by

JWEFL is described in Okada (2004) and consists of a collection of errors from seven corpora of including errors collected from short essays, from translation tasks, and from a spelling test task. There are 4,875 type misspellings of 1,144 target words.

Both corpora are from hand-written data which does contain typographical errors.

Some typographical errors have entered the source corpora when they were originally digitized, but however these were mostly eliminated when the Atsuo-Henry Corpus was compiled (Okada, 2004). Both subcorpora contain errors by speakers with varying educational backgrounds and for the non-native corpus, varying English proficiency

34 levels. Since the corpora contain only type and not token errors and come from a range of sources, neither is intended to be a balanced or representative corpus of spelling errors, but they still provide valuable training and testing data.

For training and testing the spell checker presented in this thesis, the JWEFL corpus has been cleaned up and modified slightly to fit the task: a few target British spellings have been converted to American spellings (e.g., aeroplane to airplane) for compatibility with CMUDICT16, items containing spaces or hyphens have been removed, capitalization has been normalized for non-proper nouns, items without pronunciations in CMUDICT (e.g., decorates) and pairs where the spelling test elicitation failed (e.g., *ce for certainly).17 This results in 4,769 misspellings of

1,046 words. The data is divided into training (80%), development (10%), and test

(10%) sets.

3.5 Spell-Checker Oriented Word Lists

When a spell checker uses the minimum edit distance approach to select and rank candidate corrections (see section 2.1.2), it needs a list of correct words from which to select candidates. The misspelling is compared to words in the word list. The word lists for the spell checker presented in this thesis are generated from the Spell

Checker Oriented Word Lists (SCOWL).18 The words from CMUDICT could also be used as a word list, however CMUDICT is intended to contain a fairly exhaustive

16The misspelling in this case (*airplan) appears to target the American spelling. The other conversions involve doubling l to ll; the JWEFL errors in these cases are elsewhere in the word, never in this doubling itself. 17A subset of the Atsuo-Henry Corpus comes from the Samantha Error Corpus, where JWEFL completed a spelling test task. They were presented with a definition of the target word in Japanese and an approximation of the English pronunciation in katakana (Mitton and Okada, 2007). 18http://wordlist.sourceforge.net

35 list of English words rather than a list of words used in everyday spell checking and

contains many infrequent words and proper names. Using the SCOWL word lists

makes it possible to construct word lists based on word frequency.

The SCOWL word lists are divided into British, American, Canadian, and general

English lists. Within each there are sublists of words, abbreviations, contractions, upper-case words, and proper-names. Upper-case words and common proper names that are frequent enough to appear in a typical dictionary are separated from less fre- quent upper-case names (proper-names). Additionally, words with common variant spellings are divided by how widely accepted the variants are: variant 0 words are

nearly equally acceptable, variant 1 are generally considered acceptable, variant 2 are rare. Each SCOWL sublist is also split into smaller lists based on the frequency of the words.

In order to create general purpose word lists that cover all the target words from the Atsuo-Henry Corpus, the following SCOWL sublists were combined to create the word lists with American English as the target: English words, American words,

English upper, English contractions, variant 0-words, and variant 0-upper.

Since the target pronunciation of each word list item is needed for the pronunci- ation model, the SCOWL-based word lists are filtered to remove words whose pro- nunciation is not in CMUDICT. CMUDICT does not include the pronunciation of many possessive forms in the SCOWL word lists, so the pronunciation of plural forms is substituted where possible.19 The sizes of the initial SCOWL word lists and the

CMUDICT-filtered lists are shown in Table 3.1. The size of a spell checking word

list is characterized by a 2-digit number where 10 is tiny, 30 is small, 50 is medium,

19In the size 70 word list, approximately 7,200 possessive form pronunciations are generated in this manner.

36 60 is medium-large, and 70 is large. For general purpose spell checking, GNU Aspell

defaults to size 60.20 Size 50 is the smallest size that contains all the target words from the Atsuo-Henry Corpus, so only lists of size 50 and above will be used with the spell checker developed in this thesis.

20Aspell’s general purpose English word list, which is not directly related to SCOWL, contains 132,000 words (http://aspell.net/man-html/Dictionary-Naming.html).

37 SCOWL Dict. Size # Words in SCOWL List # Words in of CMUDICT-Filtered List 10 5,041 4,617 20 13,945 12,319 35 49,602 37,399 40 56,288 41,266

38 50 89,716 54,001 60 115,311 58,426 70 171,369 62,474

Table 3.1: Word List Sizes CHAPTER 4

METHOD

The method for modeling pronunciation variation from a phonetically untran- scribed corpus of read speech is presented in this chapter along with the implemen- tation of the noisy channel spelling approach including the letter-to-phone model and the noisy channel error models from Toutanova and Moore (2002). First, sec- tion 4.1 describes how pronunciation variation in the ERJ Corpus is used to extend the pronouncing dictionary CMUDICT with multiple variations of each canonical pronunciation. The next section, section 4.2, describes the implementation of the letter-to-phone model, which is used to predict the pronunciations of misspellings for use with the pronunciation error model, and the implementation and training of Toutanova and Moore (2002)’s spelling correction model to develop the JWEFL- adapted spell checker.

4.1 Pronouncing Dictionary with Variation

A spell checker that relies solely on orthography and uses minimum edit distance rank candidate corrections finds possible corrections by comparing a misspelling to words in a word list. In the same way, the pronunciation-based spelling correction

39 approach developed in Toutanova and Moore (2002) requires a list of possible pro- nunciations so that the minimum edit distance between the pronunciation of the misspelling and the pronunciation of correct words can be calculated. Because of the many differences between English and Japanese phonology, the target pronunciation that influences a phonetic spelling error may be different for Japanese speakers than native speakers. In order to account for these differences, a model of pronunciation variation is required.

The pronunciation variation observed in the ERJ Corpus will be used to generate additional pronunciations for each word in the word list. If the ERJ Corpus were transcribed at the phone level, it would be trivial to observe the most frequent pro- nunciation variations, however since it is not transcribed, it is necessary to adapt a recognizer trained on native English speech. First, the ERJ Corpus is recognized using a monophone recognizer trained on American English data from the TIMIT

Corpus. Next, the most frequent variations observed between the canonical pronun- ciations of the ERJ utterances and the recognized pronunciations are used to adapt the American English monophone recognizer. The adapted recognizer is then used to recognize the ERJ Corpus in forced alignment with the canonical pronunciations of the utterances. Finally, the observed variations from the forced alignment step are used to create models of pronunciation variation for each phone. These are used to generate multiple pronunciations of each word in the word list.

40 Figure 4.1: Example Phone Alignment

4.1.1 Initial Recognizer

A monophone speech recognizer was trained on all TIMIT Corpus data using the

Hidden Markov Model Toolkit, Version 3.4 (Young et al., 2006).21 This recognizer is used to generate a phone string for each utterance in the ERJ Corpus. Each recognized phone string is then compared to the canonical pronunciation provided to the speakers during the recording sessions. The recognized phone strings are aligned with the canonical pronunciations using the phone alignment algorithm from Fosler-

Lussier (1999), which uses phone edit distances based on phonetic features to find the best alignment. An example alignment is shown in Figure 4.1 for the utterance peg with the canonical pronunciation p eh g and recognized pronunciation d ey g ih.

Correct alignments (such as g → g) and substitutions (such as p → d) are considered with no context and any insertions (such as the insertion of ih) are conditioned on the previous phone. Thus, from Figure 4.1, the observed alignments are: p → d, eh

→ ey, g → g, and g → g ih. Deletions are currently ignored in order to simplify the

HMM adaptation in the next step.

21http://htk.eng.cam.ac.uk; the training was done with an adapted version of Keith Vertanen’s HTK Recipe (http://www.inference.phy.cam.ac.uk/kv227/htk/), which follows the tutorial for cre- ating monophone HMMs presented in section 3.2 of the HTK Book (Young et al., 2006).

41 The phone alignments for all utterances in the ERJ Corpus are collected. Since

the phone accuracy of the monophone recognizer on native speech is only around 50%

and the phone accuracy on non-native speech is undoubtedly much lower, alignments

are observed between nearly all pairs of phones. In order to focus on the most

frequent alignments common to multiple speakers and multiple utterances, a cutoff

point is calculated based on the frequency of the most frequently observed phone for

each canonical phone: any alignment observed less than 20% as often as the most

frequent alignment for the given canonical phone is discarded. For example, if the

most frequently observed phone for the canonical phone p is p and p → p is observed

43% of the time, then any alignment observed at least 8.6% of the time will be considered as a possible variation of the canonical phone p. The cutoff of 20% was chosen to allow a few variations for most phones. A small number of phones have no variants (iy, w, ey, and h) while a few have ten or more variants (e.g., ah, l).22

A probability distribution over these frequently observed phones is created for each canonical phone.

4.1.2 Adapting the Recognizer

Now that probability distributions over observed phones have been created, the

HMMs trained on TIMIT are modified as follows to allow the observed variation. An example of the original three-state left-to-right HMM for the phone p is shown in

Figure 4.2. To allow, for instance, the observed variation between p, th, t, and dh, the states for th, t, and d are inserted into the model for p as separate paths. The resulting phone model is shown in Figure 4.3. The transition probabilities into the

22It is not surprising that phones that are well-known to be difficult for Japanese speakers (cf. Minematsu et al., 2002) are the phones with the most observed variation.

42 Figure 4.2: Original phone model for p

first states of p and the variant phones come directly from the probability distribution

observed in the initial recognition step. In this case, p has been seen 43.7% of the

time for the canonical phone p, th has been seen 31.1% of the time instead, etc.

The transition probabilities between the three states for each variant phone remain unchanged. All HMMs are adapted in this manner using the probability distributions from the initial recognition step.

The adapted HMMs are used to recognize the ERJ Corpus for a second time, this time in forced alignment with the canonical pronunciations. The state transitions in- dicate which variant of each phone was recognized and the correspondences between the canonical phones and recognized phones are used to generate a new probability distribution over observed phones for each canonical phone. These probability distri- butions are used to find the most probable pronunciation variations of pronunciations in the native-speaker pronouncing dictionary as described in the following section.

4.1.3 Generating Pronunciations

The observed phone variation is used to generate multiple pronunciations for each pronunciation in the word lists. The OpenFst Library23 is used to find the most

23http://www.openfst.org/

43 Figure 4.3: Adapted phone model for p accounting for variation between p, th, t, and dh

r:l/1.90353405

r:uh/1.15139902 0 1/1 r:r/1.12080002

r:d/1.56663704

Figure 4.4: Finite state transducer for canonical phone r where the respective tran- sition probabilities reflect the negative logarithm of the probability that the phone r, uh, d, or l was observed for r

44 Word List # of Words # of Pronunciations 50 54,001 255,827 60 55,631 274,322 70 58,426 288,295

Table 4.1: Number of Pronunciations with Five Generated Variations

probable pronunciations in each case. First, finite state transducers are created for each phone using the probability distributions from the previous section. An example for the phone r is shown in Figure 4.4. By concatenating the FSTs for each phone in the canonical pronunciation from CMUDICT, an FST is created for the entire word.

The best n paths through the FST are found using the OpenFst tools. The pronun- ciations corresponding to the best n paths and the original canonical pronunciation become possible pronunciations of the word in the extended pronouncing dictionary.

Table 4.1 shows the total number of pronunciations generated when the top five vari- ations of each pronunciation from the word list are generated.24 Since the number of pronunciations becomes very large and the task of comparing the pronunciation of a misspelling to all pronunciations in the list becomes very time-consuming for the spell checker, the number of pronunciation variations is currently limited to five, however the method presented here can generate any number of pronunciation variations for each canonical pronunciation.

24When CMUDICT contains multiple pronunciations of one word, variations are generated inde- pendently for both pronunciations. The totals here are slightly smaller than the number of pronun- ciations multiplied by five because there are occasionally fewer than five possible variations and the canonical pronunciation is also sometimes one of the generated pronunciations.

45 4.2 Implementation of the Noisy Channel Spelling Correc- tion Approach

The letter-to-phone model and the noisy channel spelling correction approach described in section 2.3 are implemented in order to evaluate the effect of incorporat- ing the pronunciation variation described in the previous section in Toutanova and

Moore (2002)’s spelling correction approach. The implementation and evaluation of the letter-to-phone model is described in section 4.2.1 and the implementation of the noisy channel spelling correction models follows in section 4.2.2.

4.2.1 Letter-to-Phone Model

I implemented the letter-to-phone algorithm as described in section 2.3.3 and trained and evaluated it using pronunciations from CMUDICT. Initially, the letter- to-phone model was trained on 80% of CMUDICT and tested on the remaining 20%.

The letter-to-phone model can return multiple pronunciations for each word, but in order to evaluate the performance, the most probable pronunciation for each word was used. The results are evaluated using the tool sclite from NIST

Scoring Toolkit (SCTK).25 The results are shown in Table 4.2. The phone accuracy is comparable to the performance reported in Toutanova and Moore (2002) on the

NETtalk data set, although the word accuracy is approximately 5% lower.26 Consid- ering up to the top 7 most specific contexts improves performance for CMUDICT.

25http://www.nist.gov/speech/tools/ 26Toutanova and Moore (2002) report the best performance of their letter-to-phone model, also based on the model from Fisher (1999), on the NETtalk data set as 91.5% phone accuracy and 63.6% word accuracy. However, they implement several further extensions to the model beyond what is described and implemented in this thesis.

46 N Phone Acc. Word Acc. 1 91.3 55.7 3 91.6 57.2 5 91.8 58.0 7 91.8 58.1 9 91.7 57.4

Table 4.2: Phone and Word Accuracy for Letter-to-Phone Model Trained and Tested on CMUDICT as a Function of the Number of Most-Specific Contexts (N)

CMUDICT is intended to be a comprehensive pronunciation dictionary and con- tains pronunciations for a large number of low frequency words and for many names, whose pronunciations tend to be difficult to predict from orthography. When con- sidering the task at hand — predicting pronunciations of misspellings of relatively frequent words used by JWEFL — the items in CMUDICT do not seem to be partic- ularly well-suited for the training a letter-to-phone model for use in a JWEFL-adapted spell checker.

Instead of using the training and test sets consisting of CMUDICT entries, a new training corpus was created by pairing the words from the size 70 CMUDICT-filtered word list (see section 3.5) with their pronunciations from CMUDICT. This list of approximately 62,000 words was likewise split into a training set containing 80% of entries and a test set of the remaining 20%. The performance of the resulting model

(Word List 70) on the size 70 word list test set is shown along with the performance of the original CMUDICT model on the size 70 word list test in Table 4.3.27 The

27There may be some overlap between the CMUDICT model training items and the Word List 70 model test items.

47 Word List 70 Model CMUDICT Model N Phone Acc. Word Acc. Phone Acc. Word Acc. 1 95.4 74.2 90.0 55.9 3 95.5 74.9 90.4 57.4 5 95.4 73.9 90.5 56.9

Table 4.3: Phone and Word Accuracy for Letter-to-Phone Models Trained on Word List 70 and CMUDICT, Tested on Word List 70 Test Set as a Function of the Number of Most-Specific Contexts N

CMUDICT model’s performance drops slightly on the Word List 70 test set as com- pared to the CMUDICT test set, but the Word List 70 model outperforms it by a large margin. The best performance is seen when N = 3.

The Word List 70 letter-to-phone model with N = 3 will be used to predict the pronunciations of misspellings while training and testing the spelling correction approach, whose implementation is described in the following section.

4.2.2 Noisy Channel Spelling Correction

I implemented the combined letter and pronunciation spelling correction models as described in section 2.3. The letter error model PL(r|w) is trained using pairs of words and misspellings from the Atsuo-Henry training corpus (section 3.4) and the phone error model PPH (pronr|pronw) is trained on the pronunciations of the same pairs of words and misspellings. The pronunciations of the words are provided by

CMUDICT and the pronunciations of the misspellings are predicted using the Word

List 70 letter-to-phone model described in section 4.2.1.

In order to rank the words as candidate correction for a misspelling r, PL(r|w) and PPHL(r|w) are calculated for each word in the word list using the algorithm for

48 Figure 4.5: Word List Trie

described in Brill and Moore (2000), which calculates the minimum edit distance between the misspelling all words in the word list. In order to do this efficiently, the word list is stored in a trie (see section 2.1.1). At each node in trie, a vector corresponding to a row of an edit distance matrix is stored. A sample word list trie is shown in Figure 4.5. If the current misspelling is *raod and we are calculating

PL, then the values in the vector shown at the bottom right-hand node in the figure correspond to PL(bed → ), PL(bed → r), PL(bed → ra), PL(bed → rao), and

PL(bed → raod). The edit distance vectors are filled in from the root downwards in the tree so that each calculation can reference the edit distances stored for shorter strings in the nodes above it. The possible edit operations are the substitutions α → β observed during training and the edit costs are the negative log probabilities for each substitution.

PPH is calculated in the exact same manner except orthorgraphic strings are replaced by strings of phones. When the edit distances are calculated for pronunci- ations, the “word list” used by the spell checker is the list of all possible pronuncia- tions from the pronouncing dictionary developed in section 4.1. The pronunciations

49 of misspellings are predicted using the Word List 70 letter-to-phone model from sec-

tion 2.3.3. Currently, the top 50 ranked pronunciations according to PPH are used

in the calculation of PPHL. Finally, PL and PPHL are combined as in equation 2.18 to determine the final score for each word in the word list. The scores are used to

find and rank the top n candidate corrections. The following chapter presents the evaluation of the spell checker. following chapter.

50 CHAPTER 5

RESULTS

The performance of the spelling correction algorithm described in the previous chapter is compared to the performance of the native-speaker spell checker GNU

Aspell with and without the extensions from modeling pronunciation variation.

5.1 Experimental Setup

For all experiments, the alphabet Σ is restricted to the upper and lower case letters plus apostrophe (’) used in contractions. The word lists D are the CMUDICT-filtered

SCOWL word lists described in section 3.5.

5.2 Baseline

The open source spell checker GNU Aspell (Aspell, 2008) is used to determine the baseline performance of a native-speaker spell checker using the same word lists.

Aspell were created with the same word lists described in section 3.5.

Aspell was configured to use the similarity key table distributed with the English dic- tionary28 but without an affix dictionary. Aspell’s similarity key approach is described in section 2.1.2.

28ftp://ftp..org/gnu/aspell/dict/en/aspell6-en-6.0-0.tar.bz2

51 Dict. Size 1-Best 2-Best 3-Best 4-Best 5-Best 6-Best 50 44.1 54.0 64.1 68.3 70.0 72.5 60 42.9 52.9 63.2 67.6 69.5 71.4 70 41.0 50.8 62.2 66.6 68.9 70.4

Table 5.1: Aspell Results: Percentage of Correct Suggestions on the Atsuo-Henry Corpus Test Set

The performance of Aspell on the test set from the Atsuo-Henry Corpus is shown

in Table 5.1. Aspell makes suggestions related to word tokenization that include spaces and hyphens, so in order to only consider suggestions directly from the word list, all suggested corrections containing spaces or hyphens within the list of proposed corrections were removed and the remaining items were used in the evaluation. The

1-Best performance is the percentage of test items for which the first candidate cor- rection was the target word, the 2-Best performance is the percentage of test items for which the target word was in the top two items, and so on. It is difficult to com- pare these results with other reported results, such as those from Mitton and Okada

(2007) given in section 2.4, because the word lists may differ significantly, but the

1-Best accuracy of Aspell is approximately 20% lower than Mitton and Okada’s spell checker.

5.3 Evaluation

The spelling correction model described in the previous chapter is trained on the training data from the Atsuo-Henry Corpus (see section 3.4). Before the spelling cor- rection algorithm can be evaluated on the test items, several model parameters need to be set. The development set from the Atsuo-Henry Corpus is used to determine

52 these parameters. The parameters include: 1) the maximum substitution length N used in the error models (section 2.3.1), 2) the size of the word list (section 3.5), and

3) the minimum probability for α → α (section 2.3.1).

5.3.1 Tuning Model Parameters

First, the effect of the maximum substitution length, N, is determined for the spell checker using the size 50 CMUDICT-filtered word list and the minimum probability for α → α set at 80%. For the combined models, the λ parameter used to weight the phone model in the combined score is optimized individually for each N. The results for the letter model PL, the phone model PPHL, and the combined model are shown in Tables 5.2–5.4.

Unlike Brill and Moore (2000) and Toutanova and Moore (2002), I do not find that the performance levels off after a certain N. Both the letter and phone models show that the best performance is seen when N = 4 or N = 5. In the letter model, performance falls across the board when N is increased from 4 to 5. In the phone model, the best 1-Best performance is found for N = 3, but the performance for the

2-6 Best cases improves slightly as N gets higher. The final line in Table 5.4 shows the optimal combination of models: N = 4 for the letter model and N = 5 for the phone model with λ = 0.15.

Next, the effect of dictionary size on the combined model where N = 4 for the let-

ter model, N = 5 for the phone model, and m = 80% are shown in Table 5.5. Because

of the CMUDICT filtering, the size 60 and size 70 word lists are only respectively

4,000 and 8,000 words larger than the size 50 word list.29 Although the difference

29The original SCOWL size 60 and 70 word lists are respectively 26,000 and 82,000 words larger.

53 N 1-Best 2-Best 3-Best 4-Best 5-Best 6-Best 1 51.8 64.4 68.8 70.4 73.0 74.8 2 57.7 69.6 75.9 78.8 80.3 81.6 3 62.3 73.4 79.5 82.0 82.8 84.1 4 64.2 74.2 79.9 83.2 83.6 84.9 5 63.7 74.2 79.9 83.0 83.6 84.5

Table 5.2: Percentage of Correct Suggestions on the Atsuo-Henry Corpus Develop- ment Set as a Function of the Maximum Substitution Length (N) for PL

N 1-Best 2-Best 3-Best 4-Best 5-Best 6-Best 1 48.4 61.2 67.7 72.1 76.3 78.8 2 51.4 63.3 69.2 74.0 75.7 77.4 3 54.7 66.5 73.0 76.7 80.3 81.3 4 54.5 66.7 73.6 77.1 80.9 82.6 5 54.1 66.9 73.6 77.1 80.9 82.6

Table 5.3: Percentage of Correct Suggestions on the Atsuo-Henry Corpus Develop- ment Set as a Function of the Maximum Substitution Length (N) for PPHL

N λ 1-Best 2-Best 3-Best 4-Best 5-Best 6-Best 1 0.45 53.7 62.9 68.3 70.6 72.3 74.0 2 0.20 61.2 70.0 75.5 78.4 79.9 81.3 3 0.30 63.5 74.4 79.0 82.0 83.9 84.9 4 0.15 65.2 75.1 79.7 82.4 84.9 85.7 5 0.15 65.0 74.4 79.7 82.4 84.7 85.3 4, 5 0.15 65.4 75.1 79.7 82.6 84.9 85.7

Table 5.4: Percentage of Correct Suggestions on the Atsuo-Henry Corpus Develop- ment Set as a Function of the Maximum Substitution Length (N) for Combined Model

54 Model Dict. Size 1-Best 2-Best 3-Best 4-Best 5-Best 6-Best L 50 64.2 74.2 79.9 83.2 83.6 84.9 PHL 50 54.1 66.9 73.6 77.1 80.9 82.6 CMB 50 65.4 75.1 79.7 82.6 84.9 85.7 L 60 63.5 73.2 79.7 83.0 83.4 84.3 PHL 60 54.1 66.7 73.4 76.9 80.7 82.4 CMB 60 63.7 72.7 78.2 80.1 81.6 82.8 L 70 62.9 72.7 79.5 82.8 83.2 84.1 PHL 70 48.8 61.2 67.1 71.5 74.6 76.1 CMB 70 63.5 73.2 77.8 79.5 81.3 82.0

Table 5.5: Percentage of Correct Suggestions on the Atsuo-Henry Corpus Develop- ment Set as a Function of Dictionary Size for All Models

in dictionary size is relatively small, the additional word list items cause a slight de- crease in performance for the letter and combined models, but have little effect on the phone model. As expected, a smaller dictionary leads to better performance, so the best choice for the current evaluation is the size 50 dictionary.

Finally, the effect of the minimum probability for P (α → α) is determined. As

Brill and Moore (2000) found, the approximation P (α → α) (section 2.3.1) does not have a large effect on results. The identical 6-Best accuracy scores for the PL models show that m affects only the rankings within the top few candidate corrections and has little effect on the selection of top candidates. For PPHL, the drop in accuracy across the board for m = 0.5 and m = 0.9 show that m affects the selection and ranking of the candidates. When the probabilities of non-match phone substitutions

(i.e., P (α → β) where α 6= β) are too high, the model may overfit the training data, but when they are too low, the model may not account for the wide range of phone substitutions encountered in the JWEFL misspellings.

55 Model m 1-Best 2-Best 3-Best 4-Best 5-Best 6-Best L 0.5 63.5 75.1 80.3 82.6 83.4 84.9 PHL 0.5 52.4 63.7 67.5 72.3 75.5 77.8 CMB 0.5 65.2 73.8 78.6 80.7 82.0 82.4 L 0.8 64.2 74.2 79.9 83.2 83.6 84.9 PHL 0.8 54.1 66.9 73.6 77.1 80.9 82.6 CMB 0.8 65.4 75.1 79.7 82.6 84.9 85.7 L 0.9 62.9 74.4 79.0 82.4 83.6 84.9 PHL 0.9 50.1 61.8 67.1 71.7 74.6 77.4 CMB 0.9 64.2 73.4 78.0 80.3 81.8 82.8

Table 5.6: Percentage of Correct Suggestions on the Atsuo-Henry Corpus Develop- ment Set as a Function of Minimum Probability m for All Models

Comparing the Ranking of Candidate Corrections

The ranked lists in Table 5.7 show the candidate corrections for misspelling of any

as *eney as provided by each of spell checkers discussed above. The native-speaker

spell checker, Aspell, can be seen to preserve the initial letter of the misspelling and the misspelling’s vowels in nearly all of its candidate corrections. PL’s top candidates

also overlap a great deal in orthography, but more initial letter and vowel variation

is seen in the candidates. As we would expect, PPHL ranks any as the top correction,

but some of the lower-ranked candidates for PPHL are seen to differ a great deal from

the target word in length. The combined model appears to be a reranking of the

candidate corrections selected by PL in terms of their similarity to the misspelling in

pronunciation. This serves to move any from the 6th to the 3rd position in the list,

behind enemy and envy, which are both also quite similar to *eney in orthography

and pronunciation. PPHL’s main influence in the combined model appears to be in

the ranking of candidates already selected by PL. Candidates proposed by PPHL that

56 Aspell L PHL CMB enemy enemy any enemy envy envy Emmy envy energy money Ne any eye emery gunny deny teeny deny ebony money Ne any anything emery deny nay senna nay any ivy journey ivy

Table 5.7: Candidate Corrections for the Misspelling *eney, Intended Word any

differ too much in length would not be ranked highly by PL, so they do not appear the list for the combined model.

5.3.2 Evaluation of Pronunciation Variation

Using the parameters from the previous section, the effect of the pronunciation variation introduced into the pronouncing dictionary using the method described in section 4.1 can be evaluated by examining the performance on the test set from the Atsuo-Henry corpus for the phone model PPHL with and without the additional variations. The results in Table 5.8 show that the addition of pronunciation variations does indeed improve the performance of PPHL. The 1-Best accuracy improves by

2.7% and the improvements for the 2-6 Best accuracies vary from 0.9–2.5%. The next section will show the effect of these improvements on the combined spelling correction model.

57 1-Best 2-Best 3-Best 4-Best 5-Best 6-Best Standard Pron. Dict. 47.9 60.7 67.9 70.8 75.0 77.3 Pron. Dict. with Variations 50.6 62.2 70.4 73.1 76.7 78.2

Table 5.8: Percentage of Correct Suggestions on the Atsuo-Henry Corpus Test Set as a Function of Pronunciation Variation for PPHL

5.3.3 Evaluation of the Spelling Correction Model

The effect of the improvements in the phone model PPHL due to the pronunci- ation variations added to the pronouncing dictionary (see section 4.1) is evaluated on the test set from the Atsuo-Henry Corpus. Table 5.10 shows the performance of the letter model and the performance of the phone and combined models with and without pronunciation variation in the pronouncing dictionary. Without additional pronunciation variations in the pronouncing dictionary, the phone model PPHL shows very small improvements for the 1-Best and 2-Best phone accuracies but the 3-Best through 6-Best accuracies fall with the 6-Best falling by 2.1% compared to the letter model PL. When the pronunciation variations predicted in section 4.1 are included in the pronouncing dictionary, the 1-Best accuracy improves by 0.8% over the letter model and 6-Best accuracy falls only 1.3%.

5.4 Summary

The combined spelling correction model with pronunciation variation outperforms the native-speaker spell checker Aspell by a wide margin. The 1-Best accuracy is 21% higher and the 6-Best accuracy improves by 14%. The performance is comparable

58 Model 1-Best 2-Best 3-Best 4-Best 5-Best 6-Best Aspell 44.1 54.0 64.1 68.3 70.0 72.5 L 64.7 74.6 79.6 83.2 84.0 85.3 PHL without Pron. Var. 47.9 60.7 67.9 70.8 75.0 77.3 CMB without Pron. Var. 64.9 75.2 78.6 81.1 82.6 83.2 PHL with Pron. Var. 50.6 62.2 70.4 73.1 76.7 78.2 CMB with Pron. Var. 65.5 75.0 78.4 80.7 82.6 84.0

Table 5.9: Percentage of Correct Suggestions on the Atsuo-Henry Corpus Test Set for All Models

Table 5.10: Performance of Spell Checker on Test Data

to the performance of the spell checker from Mitton and Okada (2007), but differ- ences in experimental set-up make it difficult to know whether a direct comparison is appropriate. The noisy channel spelling correction approach developed by Brill and Moore (2000) and Toutanova and Moore (2002) appears well-suited to model the sources of spelling errors for non-native writers of English as a foreign language and the pronunciation variation model developed in this thesis leads to further small improvements in the performance of the spell checker.

59 CHAPTER 6

SUMMARY AND OUTLOOK

In this thesis, I have presented a method for modeling pronunciation variation from a phonetically untranscribed corpus of read non-native speech by adapting a monophone recognizer trained initially on native speech. The model of pronunciation variation allows a native pronouncing dictionary to be extended to include non-native pronunciation variations. This model of pronunciation variation is evaluated in the context of a spell checker adapted for non-native speakers of English. A model of pronunciation variation is created from the English Read by Japanese Corpus in order to create an extended pronouncing dictionary for Japanese speakers of English.

The use of the extended pronouncing dictionary leads to small improvements in spell checker performance for the spelling correction approach proposed by Toutanova and

Moore (2002), which uses error models for both orthography and pronunciation in order to select and rank candidate corrections.

6.1 Outlook

Extensions to several of the models used in the spelling correction approach may further improve spell checker performance. First, the letter-to-phone model (sec- tion 2.3.3) currently provides one best pronunciation of each misspelling with a word

60 accuracy of only 75%, so extending the model to provide multiple pronunciations of each word may improve performance. Next, the error models from the spelling cor- rection approach could be extended to take into account the position of the edit in the word. The edit operation probabilities P (α → β) would be conditioned on whether edit occurs at the beginning, middle, or end of the word. Brill and Moore (2000) show that this adaptation leads to performance improvements for native speaker spelling for English. Brill and Moore (2000) also show that including a im- proves performance. Instead of assuming that each word is equally likely, a language model, especially if were developed for the particular non-native writers, could im- prove overall performance and lessen the drop in performance observed when the word list size is increased. Finally, the running time, especially with the large word lists used in the pronunciation model, is still an obstacle and perhaps the similarity key approach could be adapted to select a subset of the pronunciations for comparison rather than comparing the misspelling to all pronunciations in the dictionary.

The method for modeling pronunciation variation presented in this thesis is in no way dependent on the particular combination of native language background and target written language. Given a corpus of read non-native speech by speakers with a common language background, pronunciation variation can be modeled in order to extend a pronouncing dictionary for the target language. The noisy channel spelling correction approach is likewise independent of the target language and it would inter- esting to investigate the effect of pronunciation variation in spell checking for other language pairs such as English writers of French as a foreign language or Chinese writers of German as a foreign language.

61 BIBLIOGRAPHY

Aspell (2008). GNU Aspell 0.60.6 User’s Manual. http://aspell.net/man-html/.

Brill, Eric and Robert C. Moore (2000). An Improved Error Model for Noisy Channel Spelling Correction. In Proceedings of ACL. Hong Kong, pp. 286–293.

Church, Kenneth W. and William A. Gale (1991). Probability Scoring for Spelling Correction. Statistics and Computing 1, 93–103.

CMUDICT (1998). CMU Pronouncing Dictionary version 0.6. http://www.speech. cs.cmu.edu/cgi-bin/cmudict.

Damerau, Fred J. (1964). A technique for computer detection and correction of spelling errors. Communications of the ACM 7(3), 171–176.

Fisher, Willam (1999). A statistical text-to-phone function using ngrams and rules.

Fosler-Lussier, J. Eric (1999). Dynamic Pronunciation Models for Automatic Speech Recognition. Tech. rep., International Computer Science Institute. Technical Re- port TR-99-015.

Fox, Edward A., Qi Fan Chen and Lenwood S. Heath (1992). A faster algorithm for constructing minimal perfect hash functions. In SIGIR ’92: Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval. Copenhagen, Denmark: ACM.

Kukich, Karen (1992). Technique for automatically correcting words in text. ACM Computing Surveys 24(4), 377–439.

L’Haire, S´ebasten(2007). FipsOrtho: A spell checker for learners of French. ReCALL 19(2).

62 Minematsu, N., Y. Tomiyama, K. Yoshimoto, K. Shimizu, S. Nakagawa, M. Dantsuji, and S. Makino (2002). English Speech Database Read by Japanese Learners for CALL System Development. In Proceedings of LREC .

Mitton, Roger (1987). Spelling checkers, spelling correctors and the misspellings of poor spellers. Information Processing and Management 23(5).

Mitton, Roger (1996). English spelling and the computer. London: Birkbeck ePrints. http://eprints.bbk.ac.uk/archive/00000469.

Mitton, Roger and Takeshi Okada (2007). The adaptation of an English spellchecker for Japanese writers. In Symposium on Second Language Writing. Nagoya, Japan.

Okada, Takeshi (2004). A Corpus Analysis of Spelling Errors Made by Japanese EFL Writers. Yamagata English Studies 9, 17–36.

Pollock, Joseph J. and Antonio Zamora (1984). Automatic spelling correction in scientific and scholarly text. Commun. ACM 27(4), 358–368.

Shannon, Claude (1948). A mathematical theory of communication. Bell System Technical Journal 27(3).

TIMIT (1991). TIMIT Acoustic-Phonetic Continuous . NIST Speech Disc CD1-1.1.

Toutanova, Kristina and Robert Moore (2002). Pronunciation Modeling for Improved Spelling Correction. In Proceedings of ACL.

Young, Steve, Gunnar Evermann et al. (2006). The HTK Book.

63 APPENDIX A

Annotation Schemes

A.1 Phonetic Transcriptions

A.1.1 TIMIT

aa, ae, ah, ao, aw, ax, ax-h, axr, ay, b, bcl, ch, d, dcl, dh, dx, eh, el, em, en, eng, er, ey, f, g, gcl, hh, hv, ih, ix, iy, jh, k, kcl, l, m, n, ng, nx, ow, oy, p, pcl, q, r, s, sh, t, tcl, th, uh, uw, ux, v, w, y, z, zh

Table A.1: TIMIT Phonemes

A.1.2 English Read by Japanese Corpus

aa, ae, ah, ao, aw, ax, axr, ay, b, ch, d, dh, eh, er, ey, f, g, hh, ih, iy, jh, k, l, m, n, ng, ow, oy, p, r, s, sh, t, th, uh, uw, v, w, y, z, zh

Table A.2: ERJ Phonemes

64 A.2 Mapping to CMUDICT Phoneme Set

The phonemes of the TIMIT and ERJ Corpora are mapped to the 39-phoneme CMUDICT phoneme set in the following way:

Original TIMIT Phoneme CMUDICT Mapping bcl ∅ dcl ∅ gcl ∅ kcl ∅ pcl ∅ tcl ∅ q ∅ nx n dx d hv hh ux uw ax-h ax eng ng ax ah ix ih el l em m en n axr er

Table A.3: Mapping to CMUDICT Phonemes

65 APPENDIX B

Letter-to-Phone Alignments

The letter-phone distances used to determine the best alignment for the letter- to-phone model are presented in Tables B.1–B.2. # is the word boundary character added at the beginning and edit of both the word and phone sequence during the alignment step. INS is the cost of inserting a phone, DEL is the cost of deleting letter.

66 DEL aa ae ah ao aw ax axr ay b ch d dh eh er ey f g hh # 0 100 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 ’ 10 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 INS 24 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 a 10 1 1 5 5 5 5 5 5 5 5 5 5 3 5 3 5 5 5 b 10 5 5 5 5 5 5 5 5 0 5 5 5 5 5 5 5 5 5 c 10 5 5 5 5 5 5 5 5 5 1 5 5 5 5 5 5 5 5 d 10 5 5 5 5 5 5 5 5 5 5 0 5 5 5 5 5 5 5 e 10 2 2 2 5 5 2 1 1 5 5 5 5 0 1 1 5 5 5 f 10 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 0 5 5 g 10 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 0 5 h 10 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 0 i 10 5 5 5 5 5 5 5 1 5 5 5 5 1 5 5 5 5 5 j 10 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 k 10 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5

67 l 10 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 m 10 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 n 10 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 o 10 5 5 5 0 0 5 5 5 5 5 5 5 5 5 5 5 5 5 p 10 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 q 10 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 r 10 5 5 5 5 5 5 1 5 5 5 5 5 5 1 5 5 5 5 s 10 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 t 10 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 u 10 5 5 1 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 v 10 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 x 10 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 1 5 y 10 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 z 10 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5

Table B.1: Letter-Phone Edit Distances ih iy jh k l m n ng ow oy p r s sh t th uh uw v w y z zh # 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 ’ 55555555555555555555555 INS 88888888888888888888888 a 55555555555555555555555 b 55555555555555555555555 c 55515555555512555555555 d 55555555555555255555555 e 20555555555555555555555 f 55555555555555555555555 g 55055555555555555555555 h 55555555555554545555555 i 01555555555555555555555 j 55055555555555555555555 k 55505555555555555555555

68 l 55550555555555555555555 m 55555055555555555555555 n 55555502555555555555555 o 55555555005555555555555 p 55555555550555555555555 q 55515555555555555555555 r 55555555555055555555555 s 55355555555502555555512 t 55555555555552055555555 u 55555555555555550055555 v 55555555555555555505555 x 55515555555535555555535 y 55555555555555555555055 z 55555555555555555555505

Table B.2: Letter-Phone Edit Distances, cont.