<<

Edit , Spelling Correction, and the Noisy Channel

CS 585, Fall 2015 Introduction to Natural Language Processing http://people.cs.umass.edu/~brenocon/inlp2015/

Brendan O’Connor

Thursday, October 22, 15 • Projects • OH after class • Some major topics in second half of course • Translation: spelling, machine translation • Syntactic : dependencies, hierarchical phrase structures • Coreference • Lexical semantics • Unsupervised language learning • Topic models, exploratory analysis? • Neural networks?

2

Thursday, October 22, 15 Bayes Rule for doc classif.

Noisy Hypothesized transmission process Observed Class channel text model Inference problem

Previously: Naive Bayes

3

Thursday, October 22, 15 4

Thursday, October 22, 15 Noisy channel model

Hypothesized transmission process Original Observed text text Inference problem Codebreaking P(plaintext | encrypted text) P(encrypted text | plaintext) P(plaintext) /

TRANSMISSION: Enigma machine

INFERENCE: Bletchley Park (WWII)

Thursday, October 22, 15 Noisy channel model

Hypothesized transmission process Original Observed text text Inference problem Codebreaking P(plaintext | encrypted text) P(encrypted text | plaintext) P(plaintext) / P(text | acoustic signal) P(acoustic signal | text) P(text) /

Thursday, October 22, 15 Noisy channel model

Hypothesized transmission process Original Observed text text Inference problem Codebreaking P(plaintext | encrypted text) P(encrypted text | plaintext) P(plaintext) / Speech recognition P(text | acoustic signal) P(acoustic signal | text) P(text) / Optical character recognition P(text | image) P(image | text) P(text) /

Thursday, October 22, 15 Noisy channel model

Hypothesized transmission process Original Observed text text Inference problem Codebreaking P(plaintext | encrypted text) P(encrypted text | plaintext) P(plaintext) / Speech recognition P(text | acoustic signal) P(acoustic signal | text) P(text) / Optical character recognition P(text | image) P(image | text) P(text) / Machine translation P(target text | source text) P(source text | target text) P(target text) / Spelling correction P(target text | source text) P(source text | target text) P(target text) / Thursday, October 22, 15 2 CHAPTER 6 SPELLING CORRECTION AND THE NOISY CHANNEL • candidates candidates: real words that have a similar letter sequence to the error, Candidate corrections from the spelling error graffe might include giraffe, graf, gaffe, grail, or craft. We then rank the candidates using a distance between the source and the surface error. We’d like a metric that shares our intuition that giraffe is a more likely source than grail for graffe because giraffe is closer in spelling to graffe than grail is to graffe. The minimum edit distance algorithm from Chapter 2 will play a role here. But we’d also like to prefer corrections that are more frequent words, or more likely to occur in the context of the error. The noisy channel model introduced in the next section offers a way to formalize this intuition. Real word spelling error detection is a much more difficult task, since any word in the input text could be an error. Still, it is possible to use the noisy channel to find candidates for each word w typed by the user, and rank the correction that is most likely to have been the users original intention.

6.1 The Noisy Channel Model

In this section we introduce the noisy channel model and show how to apply it to theNoisy task of detecting channel and correcting model spelling errors. The noisy channel model was applied to the spelling correction task at about the same time by researchers at AT&T Bell Laboratories (Kernighan et al. 1990, Church and Gale 1991) and IBM Watson Research (Mays et al., 1991).

noisy channel

original word noisy word

decoder word hyp1 guessed word word hyp2 noisy 1 ... noisy 2 word hyp3 noisy N

Figure 6.1 In the noisy channel model, we imagine that the surface form we see is actually a “distorted” form of an original word passed through a noisy channel. The decoder passes each hypothesis through a model of this channel and picks the word that best matches the surface noisy word. 9

Thursday, October 22, 15 noisy Channel The intuition of the noisy channel model (see Fig. 6.1) is to treat the misspelled word as if a correctly spelled word had been “distorted” by being passed through a noisy communication channel. This channel introduces “noise” in the form of substitutions or other changes to the letters, making it hard to recognize the “true” word. Our goal, then, is to build a model of the channel. Given this model, we then find the true word by passing every word of the language through our model of the noisy channel and seeing which one comes the closest to the misspelled word. Bayesian This noisy channel model, is a kind of Bayesian inference. We see an obser- 6.1 THE NOISY CHANNEL MODEL 3 • vation x (a misspelled word) and our job is to find the word w that generated this V misspelled word. Out of all possible words in the vocabulary V we want to find the ˆ word w such that P(w x) is highest. We use the hat notation ˆ to mean “our estimate | of the correct word”.

wˆ = argmaxP(w x) (6.1) w V | 2 argmax The function argmaxx f (x) means “the x such that f (x) is maximized”. Equa- tion 6.1 thus means, that out of all words in the vocabulary, we want the particular word that maximizes the right-hand side P(w x). | The intuition of Bayesian classification is to use Bayes’ rule to transform Eq. 6.1 into a set of other probabilities. Bayes’ rule is presented in Eq. 6.2; it gives us a way to break down any conditional probability P(a b) into three other probabilities: | P(b a)P(a) P(a b)= | (6.2) | P(b)

We can then substitute Eq. 6.2 into Eq. 6.1 to get Eq. 6.3:

P(x w)P(w) wˆ = argmax | (6.3) w V P(x) 2 We can conveniently simplify Eq. 6.3 by dropping the denominator P(x). Why is that? Since we are choosing a potential correction word out of all words, we will P(x w)P(w) be computing |P(x) for each word. But P(x) doesn’t change for each word ; we are always asking about the most likely word for the same observed error x, which must have the same probability P(x). Thus, we can choose the word that maximizes this simpler formula:

wˆ = argmaxP(x w)P(w) (6.4) w V | 2 To summarize, the noisy channel model says that we have some true underlying word w, and we have a noisy channel that modifies the word into some possible likelihood misspelled observed surface form. The likelihood or channel model of the noisy channel model channel producing any particular observation sequence x is modeled by P(x w). The prior | probability prior probability of a hidden word is modeled by P(w). We can compute the most probable wordw ˆ given that we’ve seen some observed misspelling x by multiply- ing the prior P(w) and the likelihood P(x w) and choosing the word for which this Spelling correction| as noisy channel product is greatest. We apply the noisy channel approachHypothetical to correcting Model non-word spelling errors by I was too tired to go taking any word notI was in to ourtired to spell go dictionary, generating aINPUT list of candidate words, ranking them accordingI was zzz tired to Eq.to go 6.4, and picking the highest-rankedI was ti tired to go one. We can Inference problem modify Eq. 6.4 to refer... to this list of candidate words instead of the full vocabulary V as follows: channel model prior wˆ = argmax P(x w) P(w) (6.5) w C | 2 z }| { z}|{ The noisy channel algorithm is shown in Fig. 6.2. To see the details of the computationEdit of distance the likelihoodLanguage and model language model, let’s walk through an example, applying the algorithm to the example misspelling acress. The first stage of the algorithm1. How proposes to score candidate possible corrections translations? by finding words that 2. How to efficiently search over them? 10

Thursday, October 22, 15 Edit distance • Tasks • Calculate numerical similarity between pairs • President Barack Obama • President Barak Obama • Enumerate edits with distance=1 • Model: Assume possible changes. • Deletions: actress => acress • Insertions: cress => acress • Substitutions: access => acress • [Transpositions: caress => acress] • Probabilistic model: assume each has prob of occurrence

11

Thursday, October 22, 15 18 CHAPTER 2 REGULAR EXPRESSIONS,TEXT NORMALIZATION,EDIT DISTANCE •

an alternative version of his metric in which each insertion or deletion has a cost of 1 and substitutions are not allowed. (This is equivalent to allowing substitution, but giving each substitution a cost of 2 since any substitution can be represented by one insertion and one deletion). Using this version, the between intention and execution is 8.

2.4.1 The Minimum Edit Distance Algorithm How do we find the minimum edit distance? We can think of this as a search task, in which we are searching for the shortest path—a sequence of edits—from one string to another.

i n t e n t i o n

del ins subst

n t e n t i o n i n t e c n t i o n i n x e n t i o n Figure 2.13 Finding the edit distance viewed as a search problem

The space of all possible edits is enormous, so we can’t search naively. How- ever, lots of distinct edit paths will end up in the same state (string), so rather than recomputing all those paths, we could just remember the shortest path to a state each time we saw it. We can do this by using dynamic programming. Dynamic programming is the name for a class of algorithms, first introduced by Bellman (1957), that apply a table-driven method to solve problems by combining solutions to sub-problems. Some of the most commonly used algorithms in natural language processingLevenshtein make use of dynamic programming, Algorithm such as the the Viterbi and forward algorithms (Chapter 7) and the CKY algorithm for parsing (Chapter 12). The intuition of a dynamic programming problem is that a large problem can be solved by properlyGoal: combiningInfer minimum the solutions to edit various distance, sub-problems. and Consider argmin edit path, the shortest• path of transformed words that represents the minimum edit distance 18 CHAPTER 2 REGULARfor aE XPRESSIONSpair of strings.,TEXT N ORMALIZATION e.g.: (intention,EDIT, DexecutionISTANCE ) between• the strings intention and execution shown in Fig. 2.14.

an alternativei version n t ofe hisn metrict i o in n which each insertion or deletion has a cost of 1 and substitutions are not allowed. (This isdelete equivalent i to allowing substitution, but n t e n t i o n giving each substitution a cost of 2 since anysubstitute substitution n canby e be represented by one insertion ande one t deletion).e n t i Using o n this version, the Levenshtein distance between intention and execution is 8. substitute t by x e x e n t i o n insert u e x e n u t i o n 2.4.1 The Minimum Edit Distancesubstitute Algorithm n by c e x e c u t i o n How do we find the minimum edit distance? We can think of this as a search task, in Figurewhich 2.14 we arePath searching from intention for theto execution shortest. path—a sequence of edits—from one string to another. Imagine some string (perhaps it is exention) that is in this optimal path (whatever it is). The intuition of dynamic programming is that if exention is in the optimal i n t e n t i o n operation list, then the optimal sequence must also include the optimal path from intention to exention. Why? Ifdel there were ains shorter pathsubst from intention to exention, then we could use it instead, resulting in a shorter overall path, and the optimal sequencen wouldn’tt e n t bei optimal,o n thusi n leadingt e c ton at contradiction. i o n i n x e n t i o n minimum edit The algorithm was named by Wagner and Fischer (1974) distance Figureminimum 2.13 Finding edit thedistance edit distance viewed as a search problem

The space of all possible edits is enormous, so we can’t search naively. How- ever, lots of distinct edit paths will end up in the same12 state (string), so rather than

Thursday, Octoberrecomputing 22, 15 all those paths, we could just remember the shortest path to a state dynamic programming each time we saw it. We can do this by using dynamic programming. Dynamic programming is the name for a class of algorithms, first introduced by Bellman (1957), that apply a table-driven method to solve problems by combining solutions to sub-problems. Some of the most commonly used algorithms in natural language processing make use of dynamic programming, such as the the Viterbi and forward algorithms (Chapter 7) and the CKY algorithm for parsing (Chapter 12). The intuition of a dynamic programming problem is that a large problem can be solved by properly combining the solutions to various sub-problems. Consider the shortest path of transformed words that represents the minimum edit distance between the strings intention and execution shown in Fig. 2.14.

i n t e n t i o n delete i n t e n t i o n substitute n by e e t e n t i o n substitute t by x e x e n t i o n insert u e x e n u t i o n substitute n by c e x e c u t i o n Figure 2.14 Path from intention to execution.

Imagine some string (perhaps it is exention) that is in this optimal path (whatever it is). The intuition of dynamic programming is that if exention is in the optimal operation list, then the optimal sequence must also include the optimal path from intention to exention. Why? If there were a shorter path from intention to exention, then we could use it instead, resulting in a shorter overall path, and the optimal sequence wouldn’t be optimal, thus leading to a contradiction. minimum edit distance The minimum edit distance algorithm was named by Wagner and Fischer (1974) [added after lecture]

• Want to calculate: Minimum edit distance between two strings X, Y, lengths n,m

Declaratively, edit distance has a recursive substructure: d(barak obama, barack obama) = d(barak obama, barack obam) + InsertionCost

This allows for a dynamic programming algorithm to quickly compute the lowest cost path -- specifically, the Levenshtein algorithm. (We’ll just do the version with ins/del/subst, no transpositions)

13

Thursday, October 22, 15 2.4 MINIMUM EDIT DISTANCE 19 •

but independently discovered by many people (summarized later, in the Historical Notes section of Chapter 7). Let’s first define the minimum edit distance between two strings. Given two 2.4 MINIMUM EDIT DISTANCE 19 strings, the source string X of length n, and target• string Y of length m, we’ll define D(i, j) as the edit distance between X[1..i] and Y[1.. j], i.e., the first i characters of X butand independently the first j characters discovered of Y. by The many edit people distance (summarized between X and later,Y is in thus the HistoricalD(n,m). NotesWe’ll section use of dynamic Chapter programming 7). to compute D(n,m) bottom up, combining so- lutionsLet’s to first subproblems. define the minimum In the base edit case, distance with a source between substring two strings. of length Giveni but two an strings,empty the target source string, string goingX of from lengthi charactersn, and target to 0 string requiresY ofi lengthdeletes.m, With we’ll a define target Dsubstring(i, j) as the of edit length distancej but between an emptyX source[1..i] and goingY[1.. fromj], i.e., 0 characters the first i characters to j characters of X andrequires the firstj inserts.j characters Having of Y computed. The editD distance(i, j) for between small i,Xj andweY thenis thus computeD(n,m larger). D(We’lli, j) based use dynamic on previously programming computed to smaller compute values.D(n,m The) bottom value up, of combiningD(i, j) is com- so- lutionsputedLevenshtein byto subproblems. taking the minimum In the base of Algorithm the case, three with possible a source paths substring through of the length matrixi but which an emptyarrive target there: string, going from i characters to 0 requires i deletes. With a target substring of length j but an empty source going from 0 characters to j characters requires j inserts. Having computed D(i, j) for small i, j we then compute larger ( , ) Want to calculate:D[i 1, j]+del-cost(source[i]) ( , ) D i j •based on previously computed smaller values. The value of D i j is com- MinimumD[i, j]= minedit distanceD[i, j between1]+ins-cost two(target strings[ j])) X, Y, lengths n,m puted by taking the minimum8 of the three possible paths through the matrix which D[i 1, j 1]+sub-cost(source[i],target[ j]) arrive there:• D(i,j): edit dist< between X[1..i] and Y[1..j]. D(n,m): edit dist between X and Y If we assume the version: of Levenshtein distance in which the insertions and deletions each have a cost ofD[ 1i (ins-cost(1, j]+del-cost) = del-cost((source)[i =]) 1), and substitutions have · · a cost of 2D (except[i, j]= substitutionmin D[i, j of identical1]+ins-cost letters(target have[ j zero])) cost), the computation 8 for D(i, j) becomes: D[i 1, j 1]+sub-cost(source[i],target[ j]) < If we assume the version: of Levenshtein distance in which the insertions and D[i 1, j]+1 deletions each have a cost of 1 (ins-cost( ) = del-cost( ) = 1), and substitutions have ins,del=1 D[i, j 1]+1· · a cost of 2 (exceptD[i, j]= substitutionmin8 of identical letters have zero cost), the computation sub=2 2; if source[i] = target[ j] for D(i, j) becomes: > D[i 1, j 1]+ 6 < 0; if source[i]=target[ j] ⇢ Levenshtein algorithm:> dynamic programming algorithm •The algorithm itself:> isD summarized[i 1, j]+1 in Fig. 2.15 and Fig. 2.16 shows the results to quickly calculate all D[i,j]. of applying the algorithmD to[i, thej distance1]+1 between intention and execution with the D[i, j]=min8 version of Levenshtein in Eq. 2.5. [ ] = [ ] > 14 2; if source i target j Knowing the minimum<> D[ editi 1 distance, j 1]+ is useful for algorithms6 like finding poten- Thursday, October 22, 15 0; if source[i]=target[ j] tial spelling error corrections. But the edit distance⇢ algorithm is important in another > way;The with algorithm a small itself change,:> is summarized it can also provide in Fig. the2.15 minimumand Fig. cost2.16alignmentshows thebetween results oftwo applying strings. the Aligning algorithm two to strings the distance is useful between throughoutintention speechand andexecution languagewith process- the versioning. In of speech Levenshtein recognition, in Eq. minimum2.5. edit distance alignment is used to compute wordKnowing error rate the in minimum speech recognition edit distance (Chapter is useful 25). for algorithmsAlignment likeplays finding a role poten- in ma- tialchine spelling translation, error corrections. in which sentences But the edit in a distanceparallel corpus algorithm (a corpus is important with a intext another in two way;languages) with a small need to change, be matched it can to also each provide other. the minimum cost alignment between two strings.To extend Aligning the edit two distance strings algorithm is useful throughout to produce speech an alignment, and language we can process- start by ing.visualizing In speech an recognition, alignment as minimum a path through edit distance the edit alignment distance matrix. is used to Figure compute2.17 wordshows error this rate path in with speech the boldfaced recognition cell. (Chapter Each boldfaced 25). Alignment cell represents plays a an role alignment in ma- chineof a pair translation, of letters in in which the two sentences strings. in If a paralleltwo boldfaced corpus cells (a corpus occur with in the a text same in tworow, languages)there will be need an to insertion be matched in going to each from other. the source to the target; two boldfaced cells inTo the extend same column the edit indicates distance a algorithm deletion. to produce an alignment, we can start by visualizingFigure an2.17 alignmentalso shows as the a path intuition through of how the to edit compute distance this matrix. alignment Figure path.2.17 The showscomputation this path proceeds with the in boldfaced two steps. cell. In Each the first boldfaced step, we cell augment represents the minimum an alignment edit ofdistance a pair of algorithm letters in to the store two backpointers strings. If two in boldfaced each cell. cells The backpointer occur in the from same a row, cell therepoints will to be the an previous insertion cell in (or going cells) from that the we source came to from the in target; entering two theboldfaced current cells cell. in the same column indicates a deletion. Figure 2.17 also shows the intuition of how to compute this alignment path. The computation proceeds in two steps. In the first step, we augment the minimum edit distance algorithm to store backpointers in each cell. The backpointer from a cell points to the previous cell (or cells) that we came from in entering the current cell. 20 CHAPTER 2 REGULAR EXPRESSIONS,TEXT NORMALIZATION,EDIT DISTANCE •

function MIN-EDIT-DISTANCE(source, source) returns min-distance

n LENGTH(source)

m LENGTH(target)

Create a distance matrix distance[n+1,m+1]

# Initialization: the zeroth row and column is the distance from the empty string D[0,0] = 0 for each row i from 1 to n do D[i,0] D[i-1,0] + del-cost(source[i])

for each column j from 1 to m do D[0,j] D[0, j-1] + ins-cost(target[j])

# Recurrence relation: for each row i from 1 to n do for each column j from 1 to m do D[i, j] MIN( D[i 1, j]+del-cost(source[i]), D[i 1, j 1] + sub-cost(source[i], target[j]), D[i, j 1] + ins-cost(target[j])) # Termination return D[n,m]

Figure 2.15 The minimum edit distance algorithm, an example of the class of dynamic programming algorithms. The various costs can either be fixed (e.g., x,ins-cost(x)=1) 8 or can be specific to the letter (to model the fact that some letters are more likely to be in- serted than others). We assume that there is no cost for substituting a letter for itself (i.e., sub-cost(x,x)=0).

Src Tar # e x e c u t i o n \ # 0 1 2 3 4 5 6 7 8 9 i 1 2 3 4 5 6 7 6 7 8 n 2 3 4 5 6 7 8 7 8 7 t 3 4 5 6 7 8 7 8 9 8 e 4 3 4 5 6 7 8 9 10 9 n 5 4 5 6 7 8 9 10 11 10 t 6 5 6 7 8 9 8 9 10 11 i 7 6 7 8 9 10 9 8 9 10 o 8 7 8 9 10 11 10 9 8 9 n 9 8 9 10 11 12 11 10 9 8 Figure 2.16 Computation of minimum edit distance between intention and execution with the algorithm of Fig. 2.15, using Levenshtein distance with cost of 1 for insertions or dele- tions, 2 for substitutions. In italics are the initial values representing the distance from the empty string.

We’ve shown a schematic of these backpointers in Fig. 2.17, after a similar diagram 15 in Gusfield (1997). Some cells have multiple backpointers because the minimum Thursday, Octoberextension 22, 15 could have come from multiple previous cells. In the second step, we backtrace perform a backtrace. In a backtrace, we start from the last cell (at the final row and column), and follow the pointers back through the dynamic programming matrix. Each complete path between the final cell and the initial cell is a minimum distance alignment. Exercise 2.7 asks you to modify the minimum edit distance algorithm to store the pointers and compute the backtrace to output an alignment. Backpointers

2.5 SUMMARY 21 • # e x e c u t i o n # 0 1 2 3 4 5 6 7 8 9 2 3 4 5 6 7 6 7 8 i 1 - " - " - " - " - " - " - 2 4 5 6 7 8 7 8 7 n - " 3 - " - " - " - " - " " - " - 3 4 6 7 8 7 8 9 8 t - " - " 5 - " - " - " - " - " " 4 3 4 7 8 9 10 9 e - - 5 6 " - " - " " 5 4 5 6 7 9 10 11 10 n " - " - " - " - " 8 - " - " - " -" 6 5 6 7 8 9 9 10 11 t " - " - " - " - " - 8 " 7 6 7 8 9 10 9 9 10 i " - " - " - " - " " - 8 8 7 8 9 10 11 10 9 9 o " - " - " - " - " " " - 8 9 8 9 10 11 12 11 10 9 n " - " - " - " - " " " " - 8 Figure 2.17 When entering a value in each cell, we mark which of the three neighboring cells we came from with up to three arrows. After the table is full we compute an alignment (minimum edit path) by using a backtrace, starting at the 8 in the lower-right corner and following the arrows back. The sequence of bold cells represents one possible minimum cost alignment between the two strings.

While we worked our example with simple Levenshtein distance, the algorithm in Fig. 2.15 allows arbitrary weights on the16 operations. For spelling correction, for example, substitutions are more likely to happen between letters that are next to Thursday, October 22, 15 each other on the keyboard. We’ll discuss how these weights can be estimated in Ch. 5. The Viterbi algorithm, for example, is an extension of minimum edit distance that uses probabilistic definitions of the operations. Instead of computing the “mini- mum edit distance” between two strings, Viterbi computes the “maximum probabil- ity alignment” of one string with another. We’ll discuss this more in Chapter 7.

2.5 Summary

This chapter introduced a fundamental tool in language processing, the regular ex- pression, and showed how to perform basic text normalization tasks including word segmentation and normalization, sentence segmentation, and stemming. We also introduce the important minimum edit distance algorithm for comparing strings. Here’s a summary of the main points we covered about these ideas:

The language is a powerful tool for pattern-matching. • Basic operations in regular expressions include concatenation of symbols, • disjunction of symbols ([], |, and .), counters (*, +, and {n,m}), anchors (ˆ, $) and precedence operators ((,)). Word tokenization and normalization are generally done by cascades of • simple regular expressions substitutions or finite automata. The Porter algorithm is a simple and efficient way to do stemming, stripping • off affixes. It does not have high accuracy but may be useful for some tasks. The minimum edit distance between two strings is the minimum number of • operations it takes to edit one into the other. Minimum edit distance can be computed by dynamic programming, which also results in an alignment of the two strings. 22 CHAPTER 2 REGULAR EXPRESSIONS,TEXT NORMALIZATION,EDIT DISTANCE • Bibliographical and Historical Notes

Kleene (1951) and (1956) first defined regular expressions and the finite automaton, based on the McCulloch-Pitts neuron. Ken Thompson was one of the first to build regular expressions compilers into editors for text searching (Thompson, 1968). His editor ed included a command “g/regular expression/p”, or Global Regular Expres- sion Print, which later became the Unix grep utility. Text normalization algorithms has been applied since the beginning of the field. One of the earliest widely-used stemmers was Lovins (1968). Stemming was also applied early to the digital humanities, by Packard (1973), who built an affix-stripping morphological parser for Ancient Greek. Currently a wide variety of code for tok- enization and normalization is available, such as the Stanford Tokenizer (http:// nlp.stanford.edu/software/tokenizer.shtml) or specialized tokenizers for Twitter (O’Connor et al., 2010), or for sentiment (http://sentiment.christopherpotts. net/tokenizing.html). See Palmer (2012) for a survey of text preprocessing. While the max-match algorithm we describe is commonly used as a segmentation baseline in languages like Chinese, higher accuracy algorithms like the Stanford CRF segmenter, are based on sequence models; see Tseng et al. (2005) and Chang et al. (2008). NLTK is an essential tool that offers both useful Python libraries (http://www.nltk.org) and textbook descriptions (Bird et al., 2009). of many algorithms including text normalization and corpus interfaces. For more on Herdan’s law and Heaps’ Law, see Herdan (1960, p. 28), Heaps (1978), Egghe (2007) and Baayen (2001); Yasseri et al. (2012) discuss the relation- Dynamicship with other measures programming of linguistic complexity. For more on edit distance, see the excellent Gusfield (1997). Our example measuring the edit distance from ‘intention’ to ‘execution’ was adapted from Kruskal (1983). There are various publicly avail- able packages to compute edit distance, including Unix and the NIST sclite program (NIST, 2005). In his autobiography Bellman (1984) explains how he originally came up with the term dynamic programming:

“...The 1950s were not good years for mathematical research. [the] Secretary of Defense ...had a pathological fear and hatred of the word, research... I decided therefore to use the word, “programming”. I wanted to get across the idea that this was dynamic, this was multi- stage... I thought, let’s ... take a word that has an absolutely precise meaning, namely dynamic... it’s impossible to use the word, dynamic, in a pejorative sense. Try thinking of some combination that will pos- sibly give it a pejorative meaning. It’s impossible. Thus, I thought dynamic programming was a good name. It was something not even a Congressman could object to.”

Exercises

2.1 Write regular expressions for the following languages. 1. the set of all alphabetic strings; 2. the set of all lower case alphabetic17 strings ending in a b;

Thursday, October 22, 15 22 CHAPTER 2 REGULAR EXPRESSIONS,TEXT NORMALIZATION,EDIT DISTANCE • Bibliographical and Historical Notes

Kleene (1951) and (1956) first defined regular expressions and the finite automaton, based on the McCulloch-Pitts neuron. Ken Thompson was one of the first to build regular expressions compilers into editors for text searching (Thompson, 1968). His editor ed included a command “g/regular expression/p”, or Global Regular Expres- sion Print, which later became the Unix grep utility. Text normalization algorithms has been applied since the beginning of the field. One of the earliest widely-used stemmers was Lovins (1968). Stemming was also applied early to the digital humanities, by Packard (1973), who built an affix-stripping morphological parser for Ancient Greek. Currently a wide variety of code for tok- enization and normalization is available, such as the Stanford Tokenizer (http:// nlp.stanford.edu/software/tokenizer.shtml) or specialized tokenizers for Twitter (O’Connor et al., 2010), or for sentiment (http://sentiment.christopherpotts. net/tokenizing.html). See Palmer (2012) for a survey of text preprocessing. While the max-match algorithm we describe is commonly used as a segmentation baseline in languages like Chinese, higher accuracy algorithms like the Stanford CRF segmenter, are based on sequence models; see Tseng et al. (2005) and Chang et al. (2008). NLTK is an essential tool that offers both useful Python libraries (http://www.nltk.org) and textbook descriptions (Bird et al., 2009). of many algorithms including text normalization and corpus interfaces. For more on Herdan’s law and Heaps’ Law, see Herdan (1960, p. 28), Heaps (1978), Egghe (2007) and Baayen (2001); Yasseri et al. (2012) discuss the relation- Dynamicship with other measures programming of linguistic complexity. For more on edit distance, see the excellent Gusfield (1997). Our example measuring the edit distance from ‘intention’ to ‘execution’ was adapted from Kruskal (1983). There are various publicly avail- able packages to compute edit distance, including Unix diff and the NIST sclite program (NIST, 2005). In his autobiography Bellman (1984) explains how he originally came up with the term dynamic programming:

“...The 1950s were not good years for mathematical research. [the] Secretary of Defense ...had a pathological fear and hatred of the word, research... I decided therefore to use the word, “programming”. I wanted to get across the idea that this was dynamic, this was multi- stage... I thought, let’s ... take a word that has an absolutely precise meaning, namely dynamic... it’s impossible to use the word, dynamic, in a pejorative sense. Try thinking of some combination that will pos- sibly give it a pejorative meaning. It’s impossible. Thus, I thought dynamic programming was a good name. It was something not even a Congressman could object to.”

Exercises• Levenshtein, Viterbi are both examples 2.1 Write regular expressions for the following languages. 1. the set of all alphabetic strings; 2. the set of all lower case alphabetic17 strings ending in a b;

Thursday, October 22, 15 6.1 THE NOISY CHANNEL MODEL 5 •

error model error model? A perfect model of the probability that a word will be mistyped would condition on all sorts of factors: who the typist was, whether the typist was left- handed or right-handed, and so on. Luckily, we can get a pretty reasonable estimate of P(x w) just by looking at local context: the identity of the correct letter itself, the | misspelling, and the surrounding letters. For example, the letters m and n are often substituted for each other; this is partly a fact about their identity (these two letters are pronounced similarly and they are next to each other on the keyboard) and partly a fact about context (because they are pronounced similarly and they occur in similar contexts). A simple model might estimate, for example, p(acress across) just using the | number of times that the letter e was substituted for the letter o in some large corpus of errors. To compute the probability for each edit in this way we’ll need a confusion confusion matrix matrix that contains counts of errors. In general a confusion matrix lists the number of time one thing was confused with another. Thus for example a substitution matrix will be a square matrix of size 26 26 (or more generally A A , for an alphabet ⇥ | | ⇥ | | A) that represents the number of times one letter was incorrectly used instead of another. Following Kernighan et al. (1990) we’ll use four confusion matrices.

del[x,y]: count(xy typed as x) ins[x,y]: count(x typed as xy) sub[x,y]: count(x typed as y) Channeltrans[x,y]: count(modelxy typed as yx)

• NorvigNote reading: that we’ve Just conditioned make theup insertion edit parameters: and deletion probabilities on the previ- deletions,ous character; edits, we could etc. insteadall have have logprob chosen to = condition -1 on the following character. To estimateWhere do wefrom get data: these confusion Need access matrices? to One a corpus way is toof extract mistakes them from • lists of misspellings like the following:

additional: addional, additonal 6 environmentsCHAPTER 6 : enviornments,SPELLING CORRECTION enviorments, AND enviroments THE NOISY CHANNEL preceded: preceeded• ...

del[xi 1,wi] , if deletion count[xi 1wi] There are lists available on Wikipedia8 and from Roger Mitton (http://www. > ins[xi 1,wi] dcs.bbk.ac.uk/˜ROGER/corpora.html>) Peter Norvig, if ( insertionhttp://norvig.com/ > count[wi 1] ngrams/). Norvig also gives the counts for> each single-character edits that can be P(x w)=> (6.6) | > sub[xi,wi] used to directly create the error model probabilities.<> , if substitution An alternative approach used by Kernighancount et al.[wi] (1990) is to compute the ma- > trans[w ,w ] trices by iteratively using this very spelling> error correctioni i+1 , if transposition algorithm itself. The > [ ] iterative algorithm first initializes the matrices> count with equalwiwi+1 values; thus, any character > > is equally likely to beUsing deleted, the counts equally from likelyKernighan:> to be et substituted al. (1990) results for any in the other error char- model proba- acter, etc. Next,bilities the spelling for acress error18shown correction in Fig. algorithm6.4. is run on a set of spelling

Thursday, October 22, 15 errors. Given the set of typos paired with their predicted corrections, the confusion matrices can now beCandidate recomputed, the Correct spelling algorithm Error run again, and so on. This Correction Letter Letter x w P(x w) iterative algorithm is an instance of the important EM algorithm| (Dempster| et al., 1977), which we discussactress in Chapter 7. t - c|ct .000117 cress - a a|# Once we have the confusion matrices, we can estimate P(x w) as follows.00000144 (where caress ca ac| ac|ca .00000164 wi is the ith characteraccess of the correct word cw) and xi ris the ith character r|c of the.000000209 typo x: across o e e|o .0000093 acres - s es|e .0000321 acres - s ss|s .0000342 Figure 6.4 Channel model for acress; the probabilities are taken from the del[], ins[], sub[], and trans[] confusion matrices as shown in Kernighan et al. (1990).

Figure 6.5 shows the final probabilities for each of the potential corrections; the unigram prior is multiplied by the likelihood (computed with Eq. 6.6 and the confusion matrices). The final column shows the product, multiplied by 109 just for readability.

Candidate Correct Error Correction Letter Letter x w P(x w) P(w) 109*P(x w)P(w) | | | actress t - c|ct .000117 .0000231 2.7 cress - a a|# .00000144 .000000544 0.00078 caress ca ac ac|ca .00000164 .00000170 0.0028 access c r r|c .000000209 .0000916 0.019 across o e e|o .0000093 .000299 2.8 acres - s es|e .0000321 .0000318 1.0 acres - s ss|s .0000342 .0000318 1.0 Figure 6.5 Computation of the ranking for each candidate correction, using the language model shown earlier and the error model from Fig. 6.4. The final score is multiplied by 109 for readability.

The computations in Fig. 6.5 show that our implementation of the noisy channel model chooses across as the best correction, and actress as the second most likely word. Unfortunately, the algorithm was wrong here; the writer’s intention becomes clear from the context: . . . was called a “stellar and versatile acress whose com- bination of sass and glamour has defined her. . . ”. The surrounding words make it clear that actress and not across was the intended word. 6.1 THE NOISY CHANNEL MODEL 3 • vation x (a misspelled word) and our job is to find the word w that generated this V misspelled word. Out of all possible words in the vocabulary V we want to find the ˆ word w such that P(w x) is highest. We use the hat notation ˆ to mean “our estimate | of the correct word”.

wˆ = argmaxP(w x) (6.1) w V | 2 argmax The function argmaxx f (x) means “the x such that f (x) is maximized”. Equa- tion 6.1 thus means, that out of all words in the vocabulary, we want the particular word that maximizes the right-hand side P(w x). | The intuition of Bayesian classification is to use Bayes’ rule to transform Eq. 6.1 into a set of other probabilities. Bayes’ rule is presented in Eq. 6.2; it gives us a way to break down any conditional probability P(a b) into three other probabilities: | P(b a)P(a) P(a b)= | (6.2) | P(b)

We can then substitute Eq. 6.2 into Eq. 6.1 to get Eq. 6.3:

P(x w)P(w) wˆ = argmax | (6.3) w V P(x) 2 We can conveniently simplify Eq. 6.3 by dropping the denominator P(x). Why is that? Since we are choosing a potential correction word out of all words, we will P(x w)P(w) be computing |P(x) for each word. But P(x) doesn’t change for each word ; we are always asking about the most likely word for the same observed error x, which must have the same probability P(x). Thus, we can choose the word that maximizes this simpler formula:

wˆ = argmaxP(x w)P(w) (6.4) w V | 2 To summarize, the noisy channel model says that we have some true underlying word w, and we have a noisy channel that modifies the word into some possible likelihood misspelled observed surface form. The likelihood or channel model of the noisy channel model channel producing any particular observation sequence x is modeled by P(x w). The prior | probability prior probability of a hidden word is modeled by P(w). We can compute the most probable wordw ˆ given that we’ve seen some observed misspelling x by multiply- ing the prior P(w) and the likelihood P(x w) and choosing the word for which this | product is greatest. We apply the noisy channel approach to correcting non-word spelling errors by 6 CHAPTER 6 SPELLING CORRECTION AND THE NOISY CHANNEL taking any word• not in our spell dictionary, generating a list of candidate words,

del[xi 1,wi] ranking them according to Eq. 6.4 , and, if deletion picking the highest-ranked one. We can count[xi 1wi] 8 modify Eq. 6.4 to refer to this> ins list[xi 1, ofwi] candidate words instead of the full vocabulary > , if insertion > count[wi 1] > V as follows: P(x w)=> (6.6) | > sub[xi,wi] <> , if substitution count[wi] channel model prior > trans[w ,w ] > i i+1 , if transposition > [ ] > count wiwi+1 wˆ =>argmax P(x w) P(w) (6.5) > Using the counts from Kernighan> w etC al. (1990) results in the| error model proba- : Edit distance Language model bilities for acress shown in Fig. 6.4.2 z }| { z}|{ The noisy channelCandidate algorithm Correct is Error shown in Fig. 6.2. Correction Letter Letter x w P(x w) | | To see the detailsactress of the t computation - c|ct of the.000117 likelihood and language model, let’s cress - a a|# .00000144 walk through ancaress example, ca applying ac the algorithm ac|ca .00000164 to the example misspelling acress. access c r r|c .000000209 The first stage ofacross the algorithm o proposes e e|o candidate.0000093 corrections by finding words that acres - s es|e .0000321 acres - s ss|s .0000342 Figure 6.4 Channel model for acress; the probabilities are taken from the del[], ins[], sub[], and trans[] confusion matrices as shown in Kernighan et al. (1990).

Figure 6.5 shows the final probabilities for each of the potential corrections; the unigram prior is multiplied by the likelihood (computed with Eq. 6.6 and the confusion matrices). The final column shows the product, multiplied by 109 just for readability.

Candidate Correct Error Correction Letter Letter x w P(x w) P(w) 109*P(x w)P(w) | | | actress t - c|ct .000117 .0000231 2.7 cress - a a|# .00000144 .000000544 0.00078 caress ca ac ac|ca .00000164 .00000170 0.0028 access c r r|c .000000209 .0000916 0.019 across o e e|o .0000093 .000299 2.819 acres - s es|e .0000321 .0000318 1.0 Thursday,acres October 22, - 15 s ss|s .0000342 .0000318 1.0 Figure 6.5 Computation of the ranking for each candidate correction, using the language model shown earlier and the error model from Fig. 6.4. The final score is multiplied by 109 for readability.

The computations in Fig. 6.5 show that our implementation of the noisy channel model chooses across as the best correction, and actress as the second most likely word. Unfortunately, the algorithm was wrong here; the writer’s intention becomes clear from the context: . . . was called a “stellar and versatile acress whose com- bination of sass and glamour has defined her. . . ”. The surrounding words make it clear that actress and not across was the intended word. 6.1 THE NOISY CHANNEL MODEL 3 • vation x (a misspelled word) and our job is to find the word w that generated this V misspelled word. Out of all possible words in the vocabulary V we want to find the ˆ word w such that P(w x) is highest. We use the hat notation ˆ to mean “our estimate | of the correct word”.

wˆ = argmaxP(w x) (6.1) w V | 2 argmax The function argmaxx f (x) means “the x such that f (x) is maximized”. Equa- tion 6.1 thus means, that out of all words in the vocabulary, we want the particular word that maximizes the right-hand side P(w x). | The intuition of Bayesian classification is to use Bayes’ rule to transform Eq. 6.1 into a set of other probabilities. Bayes’ rule is presented in Eq. 6.2; it gives us a way to break down any conditional probability P(a b) into three other probabilities: | P(b a)P(a) P(a b)= | (6.2) | P(b)

We can then substitute Eq. 6.2 into Eq. 6.1 to get Eq. 6.3:

P(x w)P(w) wˆ = argmax | (6.3) w V P(x) 2

4 CHAPTER 6 SPELLING CORRECTION AND THE NOISY CHANNEL We can conveniently simplify Eq. 6.3 by dropping• the denominator P(x). Why is that? Since we are choosing a potential correction word out of all words, we will P(x w)P(w) function NOISY CHANNEL SPELLING(word x, dict D, lm, editprob) returns correction be computing | for each word. But P(x) doesn’t change for each word ; we P(x) if x / D 2 candidates, edits All strings at edit distance 1 from x that are D, and their edit are always asking about the most likely word for the same observed error x, which2 for each c,e in candidates, edits must have the same probability P(x). Thus, we canchannel chooseeditprob(e) the word that maximizes

prior lm(x) this simpler formula: score[c] = log channel + log prior return argmaxc score[c]

wˆ = argmaxP(xFigurew) 6.2P(wNoisy) channel model for spelling correction for unknown(6.4) words. w V | 2 have a similar spelling to the input word. Analysis of spelling error data has shown that the majority of spelling errors consist of a single-letter change and so we often To summarize, the noisy channel model saysmake the that simplifying we have assumption some that these true candidates underlying have an edit distance of 1 word w, and we have a noisy channel thatfrom modifies the error word. the To word find this list into of candidates some we’ll possible use the minimum edit dis- tance algorithm introduce in Chapter 2, but extended so that in addition to insertions, likelihood misspelled observed surface form. The likelihooddeletions, andor substitutions,channel we’ll model add a fourthof type the of edit, noisy transpositions, in which two letters are swapped. The version of edit distance with transposition is called channel model channel producing any particular observationDamerau- sequence x is modeled by P(x w). The Levenshtein Damerau-Levenshtein edit distance. Applying all such single transformations to prior acress yields the list of candidate words in Fig. 6.3. | probability prior probability of a hidden word is modeled by P(w). We can compute the most probable wordw ˆ given that we’ve seen some observed misspellingTransformationx by multiply- Correct Error Position ing the prior P(w) and the likelihood P(x w) andError choosing Correction the Letter word for Letter which (Letter this #) Type | acress actress t -- 2 deletion product is greatest. acress cress -- a 0 insertion We apply the noisy channel approach to correctingacress caress non-word ca spelling ac errors0 by transposition 6 CHAPTER 6 SPELLING CORRECTION AND THE NOISY CHANNEL • acress access c r 2 substitution taking any word not in our spell dictionary, generatingacress across a list of ocandidate e 3 words, substitution acress acres -- s 5 insertion ranking them according to Eq.del[6.4xi 1,w,i] and picking the highest-ranked one. We can , if deletion acress acres -- s 4 insertion count[xi 1wi] 8 Candidate corrections for the misspelling acress and the transformations that modify Eq. 6.4 to refer to thisins list[x , ofw ] candidateFigure words 6.3 instead of the full vocabulary > i 1 i > , if insertion would have produced the error (after Kernighan et al. (1990)). “–” represents a null letter. > count[wi 1] V as follows: P(x w)=> (6.6) | > sub[xi,wi] > , if substitution Once we have a set of a candidates, to score each one using Eq. 6.5 requires that < count[w ] prior i channelwe model compute the prior and the channel model. > trans[w ,w ] > i i+1 , if transposition The prior probability of each correction P(w) is the language model probability > [ ] > count wiwi+1 wˆ =>argmax P(xofw the) word w inP context.(w) We can use any language model from(6.5) the previous chapter, > Using the counts from Kernighan> w etC al. (1990) results infrom the| error unigram model to proba- trigram or 4-gram. For this example let’s start in the following table : Edit distance Language model bilities for acress shown in Fig. 6.4.2 z }|by assuming{ a unigramz}|{ language model. We computed the language model from the 404,253,213 words in the Corpus of Contemporary English (COCA). The noisy channelCandidate algorithm Correct is Error shown in Fig. 6.2. Unigram LM: Correction Letter Letter x w P(x w) w count(w) p(w) To see the details of the computation| of the likelihood| and language model, let’s actress t - c|ct .000117 actress 9,321 .0000231 cress - a a|# .00000144 cress 220 .000000544 walk through ancaress example, ca applying ac the algorithm ac|ca to the example misspelling acress. .00000164 caress 686 .00000170 access c r r|c .000000209 The first stage ofacross the algorithm o proposes e e|o candidate.0000093 correctionsaccess by37,038 finding.0000916 words that acres - s es|e .0000321 across 120,844 .000299 acres - s ss|s .0000342 acres 12,874 .0000318 Figure 6.4 Channel model for acress; the probabilities are taken from the del[], ins[], channel model How can we estimate the likelihood P(x w), also called the channel model or sub[], and trans[] confusion matrices as shown in Kernighan et al. (1990). | Figure 6.5 shows the final probabilities for each of the potential corrections; the unigram prior is multiplied by the likelihood (computed with Eq. 6.6 and the confusion matrices). The final column shows the product, multiplied by 109 just for readability.

Candidate Correct Error Correction Letter Letter x w P(x w) P(w) 109*P(x w)P(w) | | | actress t - c|ct .000117 .0000231 2.7 cress - a a|# .00000144 .000000544 0.00078 caress ca ac ac|ca .00000164 .00000170 0.0028 access c r r|c .000000209 .0000916 0.019 across o e e|o .0000093 .000299 2.819 acres - s es|e .0000321 .0000318 1.0 Thursday,acres October 22, - 15 s ss|s .0000342 .0000318 1.0 Figure 6.5 Computation of the ranking for each candidate correction, using the language model shown earlier and the error model from Fig. 6.4. The final score is multiplied by 109 for readability.

The computations in Fig. 6.5 show that our implementation of the noisy channel model chooses across as the best correction, and actress as the second most likely word. Unfortunately, the algorithm was wrong here; the writer’s intention becomes clear from the context: . . . was called a “stellar and versatile acress whose com- bination of sass and glamour has defined her. . . ”. The surrounding words make it clear that actress and not across was the intended word. 6.1 THE NOISY CHANNEL MODEL 3 • vation x (a misspelled word) and our job is to find the word w that generated this V misspelled word. Out of all possible words in the vocabulary V we want to find the ˆ word w such that P(w x) is highest. We use the hat notation ˆ to mean “our estimate | of the correct word”.

wˆ = argmaxP(w x) (6.1) w V | 2 argmax The function argmaxx f (x) means “the x such that f (x) is maximized”. Equa- tion 6.1 thus means, that out of all words in the vocabulary, we want the particular word that maximizes the right-hand side P(w x). | The intuition of Bayesian classification is to use Bayes’ rule to transform Eq. 6.1 into a set of other probabilities. Bayes’ rule is presented in Eq. 6.2; it gives us a way to break down any conditional probability P(a b) into three other probabilities: | P(b a)P(a) P(a b)= | (6.2) | P(b)

We can then substitute Eq. 6.2 into Eq. 6.1 to get Eq. 6.3:

P(x w)P(w) wˆ = argmax | (6.3) w V P(x) 2

4 CHAPTER 6 SPELLING CORRECTION AND THE NOISY CHANNEL We can conveniently simplify Eq. 6.3 by dropping• the denominator P(x). Why is that? Since we are choosing a potential correction word out of all words, we will P(x w)P(w) function NOISY CHANNEL SPELLING(word x, dict D, lm, editprob) returns correction be computing | for each word. But P(x) doesn’t change for each word ; we P(x) if x / D 2 candidates, edits All strings at edit distance 1 from x that are D, and their edit are always asking about the most likely word for the same observed error x, which2 for each c,e in candidates, edits must have the same probability P(x). Thus, we canchannel chooseeditprob(e) the word that maximizes

prior lm(x) this simpler formula: score[c] = log channel + log prior return argmaxc score[c]

wˆ = argmaxP(xFigurew) 6.2P(wNoisy) channel model for spelling correction for unknown(6.4) words. w V | 2 have a similar spelling to the input word. Analysis of spelling error data has shown that the majority of spelling errors consist of a single-letter change and so we often To summarize, the noisy channel model saysmake the that simplifying we have assumption some that these true candidates underlying have an edit distance of 1 word w, and we have a noisy channel thatfrom modifies the error word. the To word find this list into of candidates some we’ll possible use the minimum edit dis- tance algorithm introduce in Chapter 2, but extended so that in addition to insertions, likelihood misspelled observed surface form. The likelihooddeletions, andor substitutions,channel we’ll model add a fourthof type the of edit, noisy transpositions, in which two letters are swapped. The version of edit distance with transposition is called channel model channel producing any particular observationDamerau- sequence x is modeled by P(x w). The Levenshtein Damerau-Levenshtein edit distance. Applying all such single transformations to prior acress yields the list of candidate words in Fig. 6.3. | probability prior probability of a hidden word is modeled by P(w). We can compute the most 6 CHAPTER 6 SPELLING CORRECTION AND THE NOISY CHANNEL probable wordw ˆ given that we’ve• seen some observed misspellingTransformationx by multiply- Correct Error Position ing the prior P(w) and the likelihood P(x w) andErrordel[xi choosing1,wi Correction] the Letter word for Letter which (Letter this #) Type , if deletion | acresscount[xi 1w actressi] t -- 2 deletion product is greatest. 8acress cress -- a 0 insertion > ins[xi 1,wi] >acress caress, if insertion ca ac 0 transposition We apply the noisy channel approach to correcting> count[wi 1] non-word spelling errors by CHAPTER 6 SPELLING CORRECTION AND THE NOISY CHANNEL> 6 P(x w)=>acress access c r (6.6) 2 substitution • | > sub[xi,wi] taking any word not in our spell dictionary,<> generatingacress, acrossif substitution a list of ocandidate e 3 words, substitution count[wi] acress acres -- s 5 insertion ranking them according to Eq.del[6.4xi 1,w,i] and picking> trans[w thei,wi+1 highest-ranked] one. We can , if deletion >acress acres, if transposition -- s 4 insertion count[xi 1wi] > count[wiwi+1] 8 Figure> 6.3 Candidate corrections for the misspelling acress and the transformations that modify Eq. 6.4 to refer to this> ins list[xi 1, ofwi] candidate> words instead of the full vocabulary > , if insertion would> have produced the error (after Kernighan et al. (1990)). “–” represents a null letter. > countUsing[wi the1] counts from Kernighan: et al. (1990) results in the error model proba- > V as follows: P(x w)=>bilities for acress shown in Fig. 6.4. (6.6) | > sub[xi,wi] <> , if substitution Once we have a set of a candidates, to score each one using Eq. 6.5 requires that count[wi] channel model prior Candidate Correctwe compute Error the prior and the channel model. > trans[wi,wi+ ] > Correction1 , if transposition LetterThe prior Letter probability x w of each correction P(x w) P(w) is the language model probability > [ ] > count wiwi+1 | | wˆ =>argmaxactressP t(xofw the) word -w inP context.(w c|ct) We can use.000117 any language model from(6.5) the previous chapter, > Using the counts from Kernighan> cressw etC al. (1990) results - infrom the| error unigram model a to proba- trigram a|# or 4-gram..00000144 For this example let’s start in the following table : Edit distance Language model bilities for acress shown in Fig.caress6.4.2 z ca}|by assuming{ ac a unigramz}|{ ac|ca language model..00000164 We computed the language model from the access c404,253,213 r words in r|cthe Corpus of.000000209 Contemporary English (COCA). The noisy channelCandidate algorithm Correctacross is Error shown o in Fig. 6.2 e. e|o Unigram.0000093 LM: Correction Letteracres Letter x w - P(x w) s es|ew count(w).0000321 p(w) To see the details of the computation| of the likelihood| and language model, let’s actress tacres - c|ct -.000117 s ss|sactress 9,321.0000342 .0000231 cress - a a|# .00000144 Figure 6.4 Channel model for acress; the probabilities arecress taken from220 the del[],.000000544ins[], walk through ancaress example, ca applying ac the algorithm ac|ca to the example misspelling acress. sub[], and trans[] confusion matrices.00000164 as shown in Kernighan etcaress al. (1990)686. .00000170 access c r r|c .000000209 The first stage ofacross the algorithm o proposes e e|o candidate.0000093 correctionsaccess by37,038 finding.0000916 words that acres -Figure s6.5 shows es|e the final probabilities.0000321 for each ofacross the potential120,844 corrections;.000299 acres -the unigram s prior is ss|s multiplied.0000342 by the likelihood (computedacres with12,874 Eq. 6.6 and.0000318 the 9 Figure 6.4 Channel model forconfusionacress; the matrices). probabilities The are final taken column from the showsdel[], ins the[], product, multiplied by 10 just for readability.channel model How can we estimate the likelihood P(x w), also called the channel model or sub[], and trans[] confusion matrices as shown in Kernighan et al. (1990). | Figure 6.5 shows the finalCandidate probabilities Correct for each of Error the potential corrections; the unigram prior is multipliedCorrection by the likelihood Letter (computed Letter with x w Eq. 6.6 P(xandw) the P(w) 109*P(x w)P(w) | 9 | | confusion matrices). The final columnactress shows t the product, - multiplied c|ct by 10.000117just for .0000231 2.7 readability. => cress - a a|# .00000144 .000000544 0.00078 caress ca ac ac|ca .00000164 .00000170 0.0028 Candidateunnorm. Correct Error access c r r|c .000000209 .0000916 0.019 Correction Letter Letter x w P(x w) P(w) 109*P(x w)P(w) across| | o e e|o .0000093| .000299 2.8 actressposterior: t - c|ct .000117 .0000231 2.7 acres - s es|e .0000321 .0000318 1.0 cress - a a|# .00000144 .000000544 0.00078 acres - s ss|s caress ca ac ac|ca .00000164 .00000170 0.0028.0000342 .0000318 1.0 access c rFigure r|c 6.5.000000209Computation .0000916 of the ranking 0.019 for each candidate correction, using the language across o emodel e|o shown earlier and the error model from Fig. 6.4. The final score is multiplied by 109 .0000093 .000299 2.819 acres - sfor es|e readability..0000321 .0000318 1.0 Thursday,acres October 22, - 15 s ss|s .0000342 .0000318 1.0 Figure 6.5 Computation of the rankingThe for computations each candidate in correction, Fig. 6.5 show using the that language our implementation of the noisy channel model shown earlier and the errormodel model from chooses Fig. 6.4across. The finalas score the best is multiplied correction, by 109 and actress as the second most for readability. likely word. Unfortunately, the algorithm was wrong here; the writer’s intention becomes The computations in Fig. 6.5clearshow from that the our context: implementation. . . was of called the noisy a “stellar channel and versatile acress whose com- model chooses across as the best correction, and actress as the second most bination of sass and glamour has defined her. . . ”. The surrounding words make it likely word. actress across Unfortunately, the algorithmclear was that wrong here;and the not writer’s intentionwas the becomes intended word. clear from the context: . . . was called a “stellar and versatile acress whose com- bination of sass and glamour has defined her. . . ”. The surrounding words make it clear that actress and not across was the intended word. 6.2 REAL-WORD SPELLING ERRORS 7 N-Gram Language• Models For this reason, it is important to use larger language models than unigrams. For example,was if we called use thea “stellar Corpus 6.2and of versatile ContemporaryREAL acress-WORD whose American SPELLING combination English ERRORS of to compute7 • • bigram probabilitiessass and for glamour the words has actressdefined andher across in their context using add-one Forsmoothing, this reason,• weMore get it the iscontext important following helps! probabilities: to Assume use larger a bigram language Markov models model than unigrams. For example, if we use the Corpus of Contemporary American English to compute P(actress versatile) = .000021 bigram probabilities for the words actress| and across in their context using add-one smoothing, weP(acress get the following | versatileP(across probabilities:versatile) _ whose)= .000021 = ? | P(whose actress) = .0010 P(actress versatile)| = .000021 P(whose| across) = .000006 P(across versatile)| = .000021 | Multiplying these outP(whose gives usactress) the language= .0010 model estimate for the two candi- | dates in context: P(whose across) = .000006 | 10 MultiplyingP(“versatile these out actress gives us whose”) the language= .000021 model. estimate0010 = 210 for the10 two candi- ⇤ ⇥ 10 dates in context:P(“versatile across whose”) = .000021 .000006 = 1 10 ⇤ ⇥

20 10 CombiningP(“versatile the actress language whose”) model with= . the000021 error model.0010 in= Fig.2106.510, the bigram noisy Thursday,channel October 22, model 15 now chooses the correct word actress⇤ . ⇥ 10 P(“versatile across whose”) = .000021 .000006 = 1 10 Evaluating spell correction algorithms is generally⇤ done by⇥ holding out a train- ing, development and test set from lists of errors like those on the Norvig and Mitton Combining the language model with the error model in Fig. 6.5, the bigram noisy sites mentioned above. channel model now chooses the correct word actress. Evaluating spell correction algorithms is generally done by holding out a train- ing, development and test set from lists of errors like those on the Norvig and Mitton 6.2 Real-wordsites mentioned above. spelling errors

The noisy channel approach can also be applied to detect and correct real-word real-word error 6.2detection Real-wordspelling errors spelling, errors that errors result in an actual word of English. This can happen from typographical errors (insertion, deletion, transposition) that accidentally produce a real word (e.g., there for three) or because the writer substituted the wrong spelling Theof noisy a homophone channel approach or near-homophone can also be (e.g., applieddessert tofor detectdesert and, or correctpiece forreal-wordpeace). A real-word error detection spellingnumber errors of studies, errors suggest that result that in between an actual 25% word and of English. 40% of spellingThis can errors happen are from valid typographicalEnglish words errors as (insertion,in the following deletion, examples transposition)(Kukich, that 1992) accidentally: produce a real word (e.g., there for three) or because the writer substituted the wrong spelling of a homophoneThey are leaving or near-homophone in about fifteen (e.g.,minuetsdessertto gofor todesert her house., or piece for peace). A numberThe of design studiesan suggestconstruction that between of the system 25% and will 40% take more of spelling than a errors year. are valid EnglishCan words they aslave in thehim following my messages? examples (Kukich, 1992): The study was conducted mainly be John Black. They are leaving in about fifteen minuets to go to her house. The designThe noisyan construction channel can of deal the with system real-word will take errors more as than well. a year. Let’s begin with a Canversion they oflave thehim noisy my messages?channel model first proposed by Mays et al. (1991) to deal Thewith study these was real-word conducted spelling mainly errors.be John Their Black. algorithm takes the input sentence X = x ,x ,...,x ,...,x , generates a large set of candidate correction sentences C(X), { 1 2 k n} Thethen noisy picks the channel sentence can with deal the with highest real-word language errors model as well. probability. Let’s begin with a versionTo of generate the noisy the channel candidate model correction first proposed sentences, by weMays start et by al. generating (1991) to a deal set of withcandidate these real-word words for spelling each input errors. word Theirxi. algorithm The candidates, takes theC(x inputi). includes sentence everyX = En- x ,glishx ,..., wordx ,..., withx a small, generates edit distance a large from set ofx — candidate(Mays et correction al., 1991) sentenceschose editC distance(X), { 1 2 k n} i thenone. picks So the C(graffe) sentence = withgiraffe,graff,gaffe the highest language. We then model make probability. the simplifying assumption { } Tothat generate every sentence the candidate has only correction one error. sentences, Thus the set we of start candidate by generating sentences aC set(X) offor candidatea sentence words X = forOnly each two input of word thewxi. apples The candidates,would be:C(xi). includes every En- glish word with a small edit distance from xi—(Mays et al., 1991) chose edit distance one. So C(graffe) = giraffe,graff,gaffe . We then make the simplifying assumption { } that every sentence has only one error. Thus the set of candidate sentences C(X) for a sentence X = Only two of thew apples would be: 6.2 REAL-WORD SPELLING ERRORS 7 •

For this reason, it is important to use larger language models than unigrams. For example, if we use the Corpus of Contemporary American English to compute bigram probabilities for the words actress and across in their context using add-one smoothing, we get the following probabilities:

P(actress versatile) = .000021 | P(across versatile) = .000021 | P(whose actress) = .0010 | P(whose across) = .000006 | Multiplying these out gives us the language model estimate for the two candi- dates in context:

10 P(“versatile actress whose”) = .000021 .0010 = 210 10 ⇤ ⇥ 10 P(“versatile across whose”) = .000021 .000006 = 1 10 ⇤ ⇥ Combining the language model with the error model in Fig. 6.5, the bigram noisy channel model now chooses the correct word actress. Evaluating spell correction algorithms is generally done by holding out a train- ing, development and test set from lists of errors like those on the Norvig and Mitton sites mentioned above.

6.2 Real-word spelling errors

The noisy channel approach can also be applied to detect and correct real-word real-word error detection spelling errors, errors that result in an actual word of English. This can happen from typographical errors (insertion, deletion, transposition) that accidentally produce a real word (e.g., there for three) or because the writer substituted the wrong spelling of a homophone or near-homophone (e.g., dessert for desert, or piece for peace). A number of studies suggest that between 25% and 40% of spelling errors are valid English words as in the following examples (Kukich, 1992): They are leaving in about fifteen minuets to go to her house. The design an construction of the system will take more than a year. Can they lave him my messages? The study was conducted mainly be John Black.

The noisy channel can deal with real-word errors as well. Let’s begin with a version of the noisy channel model first proposed by Mays et al. (1991) to deal with these real-word spelling errors. Their algorithm takes the input sentence X = x ,x ,...,x ,...,x , generates a large set of candidate correction sentences C(X), { 1 2 k n} then picks the sentence with the highest language model probability. • Many misspellingsTo generate are the legitimate candidate correction English words. sentences, Language we start bymodel generating is key. a set of candidate words for each input word xi. The candidates, C(xi). includes every En- Even assumingglish word only with a1 small error edit per distance sentence: from xi—(Mays et al., 1991) chose edit distance one. So C(graffe) = giraffe,graff,gaffe . We then make the simplifying assumption { } that every sentence has only one error. Thus the set of candidate sentences C(X) for 8 CHAPTER 6 SPELLING CORRECTION AND THE NOISY CHANNEL • a sentence X = Only two of thew apples would be: only two of thew apples oily two of thew apples only too of thew apples only to of thew apples only tao of the apples only two on thew apples only two off thew apples only two of the apples only two of threw apples only two of thew applies only two of thew dapples ... Each sentence is scored by the noisy channel:

Wˆ = argmaxP(X W)P(W) (6.7) W C(W) 21 | 2 Thursday, October 22, 15 For P(W), we can use the trigram probability of the sentence. What about the channel model? Since these are real words, we need to consider the possibility that the input word is not an error. Let’s say that the channel proba- bility of writing a word correctly, P(w w), is a; we can make different assumptions | about exactly waht the value of a is in different tasks; perhaps a is .95, assum- ing people write 1 word wrong out of 20, for some tasks, or maybe .99 for others. Mays et al. (1991) proposed a simple model: given a typed word x, let the channel model P(x w) be a when x = w, and then just distribute 1 a evenly over all other | candidate corrections C(x):

a if x = w 1 a p(x w)=8 if x C(x) (6.8) | > C(x) 2 <> | | 0 otherwise > Now we can replace the equal> distribution of 1 a over all corrections in Eq. 6.8; : we’ll make the distribution proportional to the edit probability from the more sophis- ticated channel model from Eq. 6.6 that used the confusion matrices. Let’s see an example of this integrated noisy channel model applied to a real word. Suppose we see the string two of thew. The author might have intended to type the real word thew, a rare word meaning ‘muscular strength’. But thew here could also be a typo for the or some other word. For the purposes of this example lets consider edit distance 1, and only the following five candidates the, thaw, threw, and thwe (a rare name) and the string as typed, thew. We took the edit probabilities from Norvig’s (2009) analysis of this example. For the language model probabilities, we used a Stupid Backoff model trained on the Google N-grams: P(the two of) = 0.476012 | P(thew two of) = 9.95051 10 8 | ⇥ P(thaw two of) = 2.09267 10 7 | ⇥ P(threw two of) = 8.9064 10 7 | ⇥ P(them two of) = 0.00144488 | P(thwe two of) = 5.18681 10 9 | ⇥ Here we’ve just computed probabilities for the single phrase two of thew, but the model applies to entire sentences; so if the example in context was two of thew