<<

Real- Spelling Correction using Google Web 1T n-gram with Backoff

Aminul ISLAM Diana INKPEN Department of Computer Science, SITE Department of Computer Science, SITE University of Ottawa University of Ottawa Ottawa, ON, Canada Ottawa, ON, Canada [email protected] [email protected]

Abstract: set [5]1, and a normalized and modified version of the Longest We present a method for correcting real-word spelling Common Subsequence (LCS) string matching algorithm (de- errors using the Google Web 1T n-gram data set and tails are in section 3.1). Our intention is to focus on how to a normalized and modified version of the Longest improve the correction recall while maintaining the correc- Common Subsequence (LCS) string matching algo- tion precision as high as possible. The reason behind this rithm. Our method is focused mainly on how to intention is that if the recall for any method is around 0.5, improve the correction recall (the fraction of er- this means that the method fails to correct around 50 per- rors corrected) while keeping the correction preci- cent of the errors. As a result, we can not completely rely on sion (the fraction of suggestions that are correct) as these type of methods, for that we need some type of human high as possible. Evaluation results on a standard interventions or suggestions to correct the rest of the uncor- data set show that our method performs very well. rected errors. Thus, if we have a method that can correct almost 90 percent of the errors, even generating some extra candidates that are incorrect is more helpful to the human. Keywords: This paper is organized as follow: Section 2 presents a brief overview of the related work. Our proposed method is Real-word; spelling correction; Google web 1T; n- described in Section 3. Evaluation and experimental results gram are discussed in Section 4. We conclude in Section 5.

1. Introduction 2. Related Work Real-word spelling errors are in a text that occur Work on real-word spelling correction can roughly be clas- when a user mistakenly types a correctly spelled word when sified into two basic categories: methods based on semantic another was intended. Errors of this type may be caused by information or human-made lexical resources, and methods the writer’s ignorance of the correct spelling of the intended based on machine learning or probability information. Our word or by typing mistakes. Such errors generally go un- proposed method falls into the latter category. noticed by most spellcheckers as they deal with words in isolation, accepting them as correct if they are found in the 2.1 Methods Based on Semantic Information , and flagging them as errors if they are not. This The ‘semantic information’ approach first proposed by [6] approach would be sufficient to correct the non-word error and later developed by [1] detected semantic anomalies, but myss in “It doesn’t know what the myss is all about.” but was not restricted to checking words from predefined confu- not the real-word error muss in “It doesn’t know what the sion sets. This approach was based on the observation that muss is all about.” To correct the latter, the spell-checker the words that a writer intends are generally semantically needs to make use of the surrounding context such as, in related to their surrounding words, whereas some types of this case, to recognise that fuss is more likely to occur than real-word spelling errors are not. muss in the context of all about. Ironically, errors of this type may even be caused by spelling checkers in the cor- 2.2 Methods Based on Machine Learning rection of non-word spelling errors when the auto-correct Machine learning methods are regarded as lexical disam- feature in some word-processing sometimes silently biguation tasks and confusion sets are used to model the change a non-word to the wrong real word [1], and some- ambiguity between words. Normally, the machine learn- times when correcting a flagged error, the user accidentally ing and statistical approaches rely on pre-defined confusion make a wrong selection from the choices offered [2]. An ex- sets, which are sets (usually pairs) of commonly confounded tensive review of real-word spelling correction is given in [3, words, such as {their, there, they’re} and {principle, princi- 1] and the problem of spelling correction more generally is pal}. [7], an example of a machine-learning method, com- reviewed in [4]. bined the Winnow algorithm with weighted-majority voting, In this paper, we present a method for correcting real- using nearby and adjacent words as features. Another ex- word spelling error using the Google Web 1T n-gram data ample of a machine-learning method is that of [8].

1Details of the Google Web 1T data set can be found at 978-1-4244-4538-7/09/$25.00 c 2009 IEEE. www.ldc.upenn.edu/Catalog/docs/LDC2006T13/readme.txt. 2.3 Methods Based on Probability Information of these2. [14] showed that and the length of [9] proposed a statistical method using word- prob- the longest common subsequence are special cases of n-gram abilities for detecting and correcting real-word errors with- distance and similarity, respectively. [15] normalized LCS by out requiring predefined confusion sets. In this method, if dividing the length of the longest common subsequence by the trigram-derived probability of an observed sentence is the length of the longer string and called it longest common lower than that of any sentence obtained by replacing one subsequence ratio (LCSR). But LCSR does not take into ac- of the words with a spelling variation, then we hypothesize count the length of the shorter string which sometimes has that the original is an error and the variation is what the a significant impact on the similarity score. user intended. [13] normalized the longest common subsequence so that [2] analyze the advantages and limitations of [9]’s method, it takes into account the length of both the shorter and the and present a new evaluation of the algorithm, designed so longer string and called it normalized longest common sub- that the results can be compared with those of other meth- sequence (NLCS). We normalize NLCS in the following way ods, and then construct and evaluate some variations of the as it gives better similarity value, as well as it is more com- algorithm that use fixed-length windows. They consider a putationally efficient: variation of the method that optimizes over relatively short, 2 × len(LCS(si, sj )) v1 = NLCS(si, sj ) = (1) fixed-length windows instead of over a whole sentence (ex- len(si) + len(sj ) cept in the special case when the sentence is smaller than While in classical LCS, the common subsequence needs the window), while respecting sentence boundaries as natu- not be consecutive, in spelling correction, a consecutive com- ral breakpoints. To check the spelling of a span of d words mon subsequence is important for a high degree of match- requires a window of length d+4 to accommodate all the tri- ing. [13] used maximal consecutive longest common subse- grams that overlap with the words in the span. The smallest quence starting at character 1, MCLCS and maximal con- possible window is therefore 5 words long, which uses 3 tri- 1 secutive longest common subsequence starting at any char- grams to optimize only its middle word. They assume that acter n, MCLCS . MCLCS takes two strings as input and the sentence is bracketed by two BoS and two EoS markers n 1 returns the shorter string or maximal consecutive portions of (to accommodate involving the first two and last the shorter string that consecutively match with the longer two words of the sentence). The window starts with its left- string, where matching must be from first character (charac- hand edge at the first BoS marker, and the [9]’s method is ter 1) for both strings. MCLCS takes two strings as input run on the words covered by the trigrams that it contains; n and returns the shorter string or maximal consecutive por- the window then moves d words to the right and the process tions of the shorter string that consecutively match with the repeats until all the words in the sentence have been checked. longer string, where matching may start from any charac- As [9]’s algorithm is run separately in each window, poten- ter (character n) for both of the strings. They normalized tially changing a word in each, [2]’s method as a side-effect MCLCS and MCLCS and called it normalized MCLCS also permits multiple corrections in a single sentence. 1 n 1 (NMCLCS ) and normalized MCLCS (NMCLCS ) respec- [10] proposed a trigram-based method for real-word errors 1 n n tively. Similarly, we normalize NMCLCS and NMCLCS without explicitly using probabilities or even localizing the 1 n in the following way: possible error to a specific word. This method simply as- sumes that any word trigram in the text that is attested in 2 × len(MCLCS1(si, sj )) v2 =NMCLCS1(si, sj ) = (2) the British National Corpus [11] is correct, and any unat- len(si) + len(sj ) tested trigram is a likely error. When an unattested trigram 2 × len(MCLCS (s , s )) v =NMCLCS (s , s ) = n i j (3) is observed, the method then tries the spelling variations of 3 n i j len(s ) + len(s ) all words in the trigram to find attested trigrams to present i j to the user as possible corrections. The evaluation of this [13] did not consider consecutive common subsequence end- method was carried out on only 7100 words of the Wall ing at the last character, though MCLCSn sometimes covers Street Journal corpus, with 31 errors introduced (i.e., one this, but not always. We argue that the consecutive common error in every approximately 200 words) obtaining a recall subsequence ending at the last character is as significant as of 0.33 for correction, a precision of 0.05 and a F-measure of the consecutive common subsequence starting at the first 0.086. character. So, we introduce the maximal consecutive longest common subsequence ending at the last character, MCLCSz 3. Proposed Method (Algorithm 1). Algorithm 1, takes two strings as input and The proposed method first tries to determine some prob- returns the shorter string or the maximal consecutive por- able candidates and then finds the best one among the can- tions of the shorter string that consecutively matches with didates. We consider a string similarity function and a fre- the longer string, where matching must end at the last char- quency value function in our method. The following sections acter for both strings. We normalize MCLCSz and call it present a detailed description of each of these functions, fol- normalized MCLCSz (NMCLCSz). lowed by the procedure to determine some probable candi- 2 × len(MCLCSz(si, sj )) v4 =NMCLCSz(si, sj ) = (4) dates along with the procedure to find the best candidate. len(si) + len(sj )

3.1 String Similarity between Two Strings We take the weighted sum of these individual values v1, v2, We use the longest common subsequence (LCS) [12] mea- v3, and v4 from equation (1), (2), (3) and (4), respectively, sure with some normalization and small modifications for 2 [13] use modified versions because in their experiments our string similarity measure. We use the same three differ- they obtained better results (precision and recall) for schema ent modified versions of LCS that [13] used, along with an- matching on a sample of data than when using the original other modified version of LCS, and then take a weighted sum LCS, or other string similarity measures. 5 to determine the string similarity score, where α1, α2, α3, (i > 1) is the word having the spelling error. First, we α4 are weights and α1 + α2 + α3 + α4 = 1. Therefore, the discuss how we find the candidates for the word marked as similarity of the two strings, S ∈ [0, 1] is: a spelling error and, then we discuss the procedure to find the most relevant single candidate from several candidates. S(si, sj ) = α1v1 + α2v2 + α3v3 + α4v4 (5) We heuristically set equal weights for most of our experi- 3.3.1 Determining candidate words using the 5-gram 3 ments . Theoretically, v3 ≥ v2 and v3 ≥ v4. data set We use the following steps: Algorithm 1: MCLCSz ( Maximal Consecutive LCS ending at the last character) 1. We define the term cut off frequency for word wi as the frequency of the 5-gram w w w w w input : s , s /* s and s are input strings i−4 i−3 i−2 i−1 i i j i j (where m i 5) in the Google Web 1T 5-grams, if where |s | ≤ |s | */ > > i j the said 5-gram exists. Otherwise, we set the cut off output: str /* str is the Maximal Consecutive frequency of w as 0. The intuition behind using the LCS ending at the last character */ i cut off frequency is the fact that, if the word is mis- 1 str ←NULL spelled, then the correct one should have a higher fre- 2 c ← 1 quency than the misspelled one in the context. Thus, 3 while |si| ≥ c do using the cut off frequency, we isolate a large number 4 x ← SubStr(si, −c, 1) /* returns cth character of candidates that we do not need to process. of si from the end */ 5 y ← SubStr(s , −c, 1) c j /* returns th character 2. We find all the 5-grams (where only wi is changed while s of j from the end */ wi−4, wi−3, wi−2 and wi−1 are unchanged), if any, hav- 6 if x = y then ing frequency greater than the cut off frequency of wi 7 str ← SubStr(si, −c, c) (determined in step 1). Let us consider that we find n 8 else replacements of wi which are R1 = {wi1, wi2, ··· , win} 9 return str and their frequencies F1 = {fi1, fi2, ··· , fin} where fij 10 end is the frequency of the 5-gram wi−4 wi−3 wi−2 wi−1 11 increment c wij . If there is no such 5-gram (having frequency above 12 end the cut off frequency), our task is two-fold: we set matched←1, if there is at least one 5-gram, and we jump to step 5 or 6 or 7 that is yet to visit. 3.2 Normalized Frequency Value We determine the normalized frequency value of each can- 3. For each wij ∈ R1, we calculate the string similarity didate word for a single position with respect to all other between wij and wi using equation (5) and then assign candidates for the same position. If we find n replacements a weight using the following equation (7) only to the words that return the string similarity value greater of a word wi which are {wi1, wi2, ··· , wij , ··· , win}, and than 0.5. their frequencies {fi1, fi2, ··· , fij , ··· , fin}, where fij is the frequency of a n-gram (where n ∈ {5, 4, 3, 2} and any can- weight(wi, wij ) = βS(wi, wij ) + (1 − β)F (wij ) (7) didate word wij is a member of the n-gram), then we deter- mine the normalized frequency value of any candidate word Equation (7) is used to ensure a balanced weight be- wij as the frequency of the n-gram containing wij , over the tween the string similarity function and the probability maximum frequency among all the candidate words for that function, where β refers to how much importance we position. fij Wall Street Journal portion of the Penn . Notable F (wij ) = (6) exceptions include the following: max(fi1, fi2, ··· , fij , ··· , fin) - Hyphenated word are usually separated, and hyphen- 3.3 Determining Candidate Words (Phase 1) ated numbers usually form one token. Our task is to correct real-word spelling error from an in- - Sequences of numbers separated by slashes (e.g., in put text using Google Web 1T data set. First, we use Google dates) form one token. 5-gram data set to find candidate words of the word having spelling error. If the 5-gram data set fails to generate at - Sequences that look like urls or email addresses form one token. least one candidate word then we move forward to 4-gram data set or 3-gram data set or 2-gram data set if the preced- 5It means that we do not consider the first word as the word ing data set fails to generate at least one candidate word. with spelling error. Though, our method could have worked Let us consider an input text W which after tokenization4 for the first word too. We did not do it here due to efficiency has m words, i.e., W = {w1, w2, . . . , wi, . . . , wm}, where wi reasons. Google n-grams are sorted based on the first word, then the second word, and so on. Based on this sorting, all 3We use equal weights in several places in this paper in order Google 5-grams, 4-grams, 3-grams and 2-grams are stored in to keep the system unsupervised. If development data would 117, 131, 97 and 31 different files, respectively. For example, be available, we could adjust the weights. all the 117 Google 5-gram files could have been needed to 4We need to tokenize the input sentence to make the n- access a single word instead of accessing just one 5-gram grams formed using the tokens returned after the tokeniza- file, that we do for any other words. This is because when tion consistent with the Google n-grams. The input sentence the first word needs to be corrected, it might be in any file is tokenized in a manner similar to the tokenization of the among those 117 5-gram files. give to the string similarity function with respect to 4. We sort the words found in step 3 that were given the frequency function6. weights, if any, in descending order by the assigned weights and keep only a fixed number of words as the 4. We sort the words found in step 3 that were given candidate words. weights, if any, in descending order by the assigned 7 weights and keep only a fixed number of words as the 5. If this step is not visited yet then we follow step 1 to candidate words8. step 4 with the 4-gram wi−2 wi−1 wi wi+1 if m − 1 > i 3. Otherwise, we go to next step. 5. If this step is not visited yet then we follow step 1 > to step 4 with the 5-gram wi−3 wi−2 wi−1 wi wi+1 if 6. If this step is not visited yet then we follow step 1 to m − 1 > i > 4. Otherwise, we go to next step. step 4 with the 4-gram wi−1 wi wi+1 wi+2 if m − 2 > 6. If this step is not visited yet then we follow step 1 i > 2. Otherwise, we go to next step. to step 4 with the 5-gram wi−2 wi−1 wi wi+1 wi+2 if m − 2 i 3. Otherwise, we go to next step. 7. If we find exactly one word in step 4, then return that > > word as the suggestion word and exit. 7. If this step is not visited yet then we follow step 1 to step 4 with the 5-gram wi−1 wi wi+1 wi+2 wi+3 if 8. If we find more than one word in step 4, we go to m − 3 > i > 2. Otherwise, we go to next step. section 3.5. Otherwise, if matched=1 then return no suggestion and exit. 8. If we find exactly one word in step 4, then return that word as the suggestion word and exit. 3.3.3 Determining candidate words using the 3-gram data set 9. If we find more than one word in step 4, we go to section 3.5. Otherwise, if matched=1 then return no We use the following steps: suggestion and exit. 1. We define the term cut off frequency for word wi as the 3.3.2 Determining candidate words using the 4-gram frequency of the 3-gram wi−2 wi−1 wi (where m > i > data set 3) in the Google Web 1T 3-grams, if the said 3-gram exists. Otherwise, we set the cut off frequency of wi as We use the following steps: 0.

1. We define the term cut off frequency for word wi as 2. We find all the 3-grams (where only wi is changed the frequency of the 4-gram wi−3 wi−2 wi−1 wi (where while w and w are unchanged), if any, having m > i > 4) in the Google Web 1T 4-grams, if the said i−2 i−1 4-gram exists. Otherwise, we set the cut off frequency frequency greater than the cut off frequency of wi (de- termined in step 1). Let us consider that we find n re- of wi as 0. placements of wi which are R1 = {wi1, wi2, ··· , win} 2. We find all the 4-grams (where only wi is changed while and their frequencies F1 = {fi1, fi2, ··· , fin} where wi−3, wi−2 and wi−1 are unchanged), if any, having fij is the frequency of the 3-gram wi−2 wi−1 wij . If frequency greater than the cut off frequency of wi (de- there is no such 3-gram, our task is two folds: we set termined in step 1). Let us consider that we find n matched←1, if there is at least one 3-gram having any replacements of wi which are R1 = {wi1, wi2, ··· , win} frequency and we jump to step 5, if it is yet to visit. and their frequencies F1 = {fi1, fi2, ··· , fin} where fij is the frequency of the 4-gram wi−3 wi−2 wi−1 wij . If 3. For each wij ∈ R1, we calculate the string similarity there is no such 4-gram, our task is two folds: we set between wij and wi using equation (5) and then assign matched←1, if there is at least one 4-gram having any a weight using equation (7) only to the words that frequency and we jump to step 5 or 6 that is yet to return the string similarity value greater than 0.5. visit. 4. We sort the words found in step 3 that were given 3. For each wij ∈ R1, we calculate the string similarity weights, if any, in descending order by the assigned between wij and wi using equation (5) and then assign weights and keep only a fixed number of words as the a weight using equation (7) only to the words that candidate words. return the string similarity value greater than 0.5. 5. If this step is not visited yet then we follow step 1 to 6We give more importance to string similarity function with respect to frequency value function throughout the section step 4 with the 3-gram wi−1 wi wi+1 if m − 1 > i > 2. of ‘determining candidate words’ to have more candidate Otherwise, we go to next step. words so that the chance of including the target word into the set of candidate words gets higher. For this reason, we 6. If we find exactly one word in step 4, then return that heuristically set β=0.85 in equation (7) instead of setting word as the suggestion word and exit. β=0.5. 7For our experiment, we heuristically set this number as 7. If we find more than one word in step 4, we go to 10. If we lower this number (say 2 or 3) then there is a section 3.5. Otherwise, if matched=1 then return no chance that sometimes the most appropriate candidate (so- suggestion and exit. lution word) fails to be included in the candidate list. 8 Sometimes the top candidate word might be either a plural 3.3.4 Determining candidate words using the 2-gram form or a past participle form of the original word. Or even it might be a high frequency function word (e.g., the). We data set omit these type of words from the candidacy. We use the following steps: 1. We define the term cut off frequency for word wi as the by setting β = 0.5. We find the word having the maximum frequency of the 2-gram wi−1 wi (where m > i > 2) in weight value as the target suggestion word which is: the Google Web 1T 2-grams, if the said 2-gram exists. Suggestion Word = argmax weight(wi, wij ) (8) Otherwise, we set the cut off frequency of wi as 0. wij

2. We find all the 2-grams (where only wi is changed while 4. Evaluation and Experimental Results wi−1 is unchanged) having frequency greater than the We used as test data the same data that [2] used in their cut off frequency of wi (determined in step 1). Let us evaluation of [9] method, which in turn was a replication of consider that we find n replacements of wi which are the data used by [6] and [1] to evaluate their methods. R1 = {wi1, wi2, ··· , win} and their frequencies F1 = The data consisted of 500 articles (approximately 300,000 {fi1, fi2, ··· , fin} where fij is the frequency of the 2- words) from the 1987−89 Wall Street Journal corpus, with gram wi−1 wij . If there is no such 2-gram, we set matched←1, if there is at least one 2-gram having any all headings, identifiers, and so on removed; that is, just frequency. a long stream of text. It is assumed that this data con- tains no errors; that is, the Wall Street Journal contains no

3. For each wij ∈ R1, we calculate the string similarity malapropisms or other typos. In fact, a few typos (both non- between wij and wi using equation (5) and then assign word and real-word) were noticed during the evaluation, but a weight using equation (7) only to the words that they were small in number compared to the size of the text. return the string similarity value greater than 0.5. Malapropisms were randomly induced into this text at a frequency of approximately one word in 200. Specifically, 4. We sort the words found in step 3 that were given any word whose base form was listed as a noun in WordNet weights, if any, in descending order by the assigned (but regardless of whether it was used as a noun in the text; weights and keep only a fixed number of words as the there was no syntactic analysis) was potentially replaced candidate words. by any spelling variation found in the lexicon of the ispell spelling checker9.A spelling variation was defined as any 5. If we find exactly one word in step 4, then return that word with an edit distance of 1 from the original word; that word as the suggestion word and exit. is, any single-character insertion, deletion, or substitution, or the transposition of two characters, that results in another 6. If we find more than one word in step 4, we go to real word. Thus, none of the induced malapropisms were section 3.5. Otherwise, if matched=1 then return no derived from closed-class words, and none were formed by suggestion and exit. Otherwise, we proceed to phase the insertion or deletion of an apostrophe or by splitting a 2 (section 3.4). word. Though [2] mentioned that the data contained 1402 inserted malapropisms, there were only 1391 malapropisms. 3.4 Determining Candidate Words (Phase 2) Because it had earlier been used for evaluating [9]’s tri- The question of why we use phase 2 is best understood by gram method, which operates at the sentence level, the data the example “ ··· by releasing the WPPSS retort.” (Table 1) set had been divided into three parts, without regard for ar- where retort is the observed word that needs to be corrected. ticle boundaries or text coherence: sentences into which no But, there is no such 5-gram where retort can be changed malapropism had been induced; the original versions of the by following the condition of having the string similarity be- sentences that received malapropisms; and the malapropized tween retort and wi be at least 0.5 and keeping “by releasing sentences. In addition, all instances of numbers of vari- the WPPSS” unchanged. Similarly, there is no such 4-gram, ous kinds had been replaced by tags such as , 3-gram and 2-gram where retort can be changed by follow- , and . Ac- ing the same previous condition and keeping “releasing the tual (random) numbers or values were restored for these WPPSS”, “the WPPSS” and “WPPSS” unchanged respec- tags. Some spacing anomalies around punctuation marks tively. The reason of the unavailability of such n-grams is were corrected. A detailed description of this data can be that “WPPSS” is not a very common word in the Google found in [16, 2]. Web 1T data set. Some examples of successful and unsuccessful corrections, To solve this issue is straightforward. We follow phase 1 using Google 5-grams, are shown in Table 1. Some of the with some small changes: instead of trying to find all the malapropisms created were “unfair” in the sense that no au- n-grams (n ∈ {5, 4, 3, 2}) where only wi is changed while tomatic procedure could reasonably be expected to see the keeping all of {· · · , wi−2, wi−1} unchanged, we try to find error. The canonical case is the substitution of million for 10 all the n-grams (n ∈ {5, 4, 3, 2}) where wi, as well as any billion, or vice versa ; another is employee for employer, or but the first member of {· · · , wi−2, wi−1} are changed while vice versa, in many (but not all) contexts. In some cases, keeping the rest of {· · · , wi−2, wi−1} unchanged. the substitution was merely a legitimate spelling variation of the same word (e.g., labour for labor). 3.5 Determining the Suggestion Word Table 2 shows some examples of successful and unsuccess- We use this section only if we have more than one candi- ful corrections using Google 4-grams where Google 5-grams date word found in section 3.3 or section 3.4. Let us consider fail to generate any suggestion. In Table 3 and Table 4, some that we find n candidate words of wi in section 3.3 or sec- 9 tion 3.4 which are {w , w , ··· , w , ··· , w }. For each Ispell is a fast screen-oriented spelling checker that shows i1 i2 ij in you your errors in the context of the original file, and sug- wij , we use the string similarity value between wij and wi gests possible corrections when it can figure them out. (already calculated using equation (5)) and the normalized 10One such example is shown in FALSE NEGATIVE section frequency value of wij (already calculated using equation in Table 1. There are 40 malapropisms in the data set related (6)) and then calculate the weight value using equation (7) with million/billion. SUCCESSFUL CORRECTION: SUCCESSFUL CORRECTION: ··· chance to mend his fencers → fences [fences] with ··· Disappointment turned to dire traits → straits Mr. Jefferies ··· [straits] when Vestron’s ··· ··· employees is the largest employee → employer [em- ··· pay million in destitution → ployer] in Europe and ··· restitution [restitution] related to contract ··· SUCCESSFUL CORRECTION (in Second Phase): SUCCESSFUL CORRECTION (in Second Phase): ··· by releasing the WPPSS retort → report [report]. ··· its Ally & Gargano unity → unit [unit] settled their ··· tests comparing its potpourri covert → cover [cover] suit ··· with the traditional ··· ··· 23 5/8 after rising 23 3/8 prints → points [points] FALSE POSITIVE CORRECTION: Tuesday. ··· can almost see the firm → fire [farm] issue receding. FALSE POSITIVE CORRECTION: ··· the Senate to support aim → him [aid] for the ··· working on improving his lent → talent [left]. Contras ··· ··· company feels the 23 mate → rate [date] is for the FALSE NEGATIVE: ··· I trust that the contract [contrast] between the Amer- FALSE NEGATIVE: ican ··· ··· town saloon after the battle [cattle] roundup. ··· as much as billion [million]. ··· be this rock-firm [film] orgy, an ···

Table 1: Examples of successful and unsuccessful Table 3: Examples of successful and unsuccessful corrections using Google 5-grams. Italics indicate corrections using Google 3-grams. For these exam- the observed word, arrow indicates the correction, ples, Google 5-grams and 4-grams fail to generate square brackets indicate the intended word. any suggestion.

SUCCESSFUL CORRECTION: SUCCESSFUL CORRECTION: ··· one of a corporate raiser → raider [raider]’s five ··· raider or Zurn management making → taking [tak- most ··· ing] shares out of ··· ··· of rumors about insider tracing → trading [trading] ··· Silesia, said Solidarity advisor → adviser [adviser] in Pillsbury options ··· Jacek Kuron. SUCCESSFUL CORRECTION (in Second Phase): SUCCESSFUL CORRECTION (in Second Phase): ··· , for the Belzberg brothels → brothers [brothers]’ Mr. Mutert votes → notes [notes] that investors have First City ··· ··· ··· typical of Iowa’s food → mood [mood] a month FALSE POSITIVE CORRECTION: before ··· ··· relieved if Super Tuesday toughs → tours [coughs] FALSE POSITIVE CORRECTION: up a front ··· ··· I’m uncomfortable tacking → talking [taking] a lot ··· be reserved about indiscriminate clays → ways of ··· [plays] in some groups ··· ··· to approve a formal teat → test [text] of their pro- FALSE NEGATIVE: posed ··· ··· aimed at clearing out overstuffing [overstaffing] left FALSE NEGATIVE: by previous ··· A lot of the optimisms [optimists] were washed out ··· ··· NMS rose (8.4 billion [million]) than fell ··· ··· published its support of patent [patient]-funded re- search ··· Table 4: Examples of successful and unsuccessful corrections using Google 2-grams. For these exam- Table 2: Examples of successful and unsuccessful ples, Google 5-grams, 4-grams and 3-grams fail to corrections using Google 4-grams. For these exam- generate any suggestion. ples, Google 5-grams fail to generate any suggestion.

Figure 1 shows the number of errors where either a sugges- examples of successful and unsuccessful corrections using tion or no suggestion is generated for different combinations Google 3-grams (where Google 5-grams and 4-grams fail to of n-grams used. To give an example, using only 5-grams, generate any suggestion) and using Google 2-grams (where each of 505 errors either generate a suggestion or no sugges- Google 5-grams, 4-grams and 3-grams fail to generate any tion. It also means that in the next 5-4-gram combination, suggestion) are shown respectively. we only process 886 errors14 (i.e., 1391-505). Figure 1 val- For each error, our method returns either a suggestion idates the intuition behind using a combination of n-grams (which is either correct11 or wrong12) or no suggestion13. rather using only n-grams (e.g., 5-grams) by showing that while single 5-gram generates either a suggestion or no sug- 11 A returned suggestion which is correct is also known as true gestion for only 505 errors, a combination of 5-4-3-2-50-40- positive. 12A returned suggestion which is wrong is also known as false than the observed word. In other words, the method thinks positive. that the observed word is indeed a correct one. 13Also known as false negative. Generating no suggestion 14We use the result of the previous 5-grams in 5-4-gram com- means that the method does not find any better suggestion bination, thus only use 4-grams in this combination. Precision Recall F-measure

1.00 0.96 0.93 0.92 0.91 0.91 0.91 0.91 0.91 0.89 0.89 0.87 0.89 0.88 0.89 0.88 0.89 0.90 0.86 0.87 0.87

e 0.81 r u

s 0.80 0.74 a e

m 0.70 - 0.61 F

d 0.60 n 0.50 a

l

l 0.50 a c e

r 0.40

0.34 , n o

i 0.30 s i c

e 0.20 r P 0.10 0.00 5-gram 5-4-gram 5-4-3-gram 5-4-3-2-gram 5-4-3-2-5'-gram 5-4-3-2-5'-4'-gram 5-4-3-2-5'-4'-3'-gram 5-4-3-2-5'-4'-3'-2'-gram Different combinations of n-gram used

Figure 3: Precision, recall and F-measure for different combinations of n-grams used.

1383 1390 1391 0 0 0 1368 1375 Figure 2 also shows that a combination of 5-4-3-2-5 -4 -3 -

o 1400

n 1263 0

) r

1 2 -gram generates 1343 suggestions, 1222 out of them are o 9

3 1200 n 1 o

i correct and 121 are incorrect, along with 48 no suggestion. f t o s

t e 943

u 1000 g The performance is measured using Precision (P ), Re- o g (

u s d

call (R) and F-measure (F ): e a t

800 a e r r e e n h e number of correct suggestions returned w g 600

505 s s P = r i

o n r number of suggestions returned r o i

e 400 t

f s o e

number of correct suggestions returned g r e g 200 R = b u s

m total number of errors in the collection u N 0 2PR 5-gram 5-4-gram 5-4-3-gram 5-4-3-2- 5-4-3-2-5'- 5-4-3-2-5'- 5-4-3-2-5'- 5-4-3-2-5'- F = gram gram 4'-gram 4'-3'-gram 4'-3'-2'-gram P + R Different combinations of n-grams used Figure 3 shows precision, recall and F-measure for differ- ent combinations of n-gram used. We get highest precision Figure 1: Number of errors where a suggestion or no (0.96) when using only 5-gram which is obvious because 5- suggestion is generated for different combinations of gram uses maximum possible context words (which is four) n-grams used. Apostrophe (0) is used to denote the and as a result the chance of getting highest ratio between n-grams used in phase 2. x-y-··· z-gram means that the number of correct suggestions returned and the num- we use x-grams, y-grams, ··· and z-grams. ber of suggestions returned increases. But the recall at this level is very poor (only 0.34). Figure 3 demonstrates how re- call gets better using different combinations of n-gram while True Positive False Positive False Negative keeping precision as high as possible. Using 5-gram to a

s 1400 e combination of 5-4-3-2-gram, we get a significant improve- v i

t 1206 1213 1217 1221 1222 i s 1200 1125 ment of recall but after that (i.e., a combination of 5-4-3- o p

0 0 0 0

e

s 2-gram to a combination of 5-4-3-2-5 -4 -3 -2 -gram), we get s l e 1000 a v f i

851 t , only 0.01 recall increase. a s g e

e 800 v i n t

We cannot directly compare our results with the correc- i e s s o l

p 600 a

tion results from previous work, because in that work the f 472

e d u r n t correction was run on the results of the detection module, a 400 f o

r cumulating the errors, while our correction module ran on e

b 200 116 116 118 121 121 64 98 m 21 28 40 46 47 48 48 48 the correctly-flagged spelling errors. Still, we indirectly try

u 12 N 0 to compare our results with the previous work. Table 5 5-gram 5-4-gram 5-4-3-gram 5-4-3-2- 5-4-3-2-5'- 5-4-3-2-5'- 5-4-3-2-5'- 5-4-3-2-5'- gram gram 4'-gram 4'-3'-gram 4'-3'-2'-gram shows our method’s results on the described data set com- Different combinations of n-grams used pared with the results for the trigram method of [2] and the lexical cohesion method of [1]. The data shown here for trigram method are not from [2], but rather are later Figure 2: Number of true positives, false positives results following some corrections reported in [16]15. That and false negatives for different combinations of n- grams used. 15The result (detection recall=0.544, detection preci- sion=0.528, correction recall=0.491, correction preci- sion=0.503) mentioned in [16] seems to have some inconsis- 30-20-gram generates the same for all 1391 errors. tency in correction recall and correction precision. Detection true positives (762) and detection false positives (681) can Figure 2 breaks down the numbers shown in Figure 1 into be calculated from detection precision and recall. Correc- true positive, false positive and false negative. For example, tion true positives and correction false positives are 688 and using only 5-grams, we get 493 suggestions, 472 out of them 755, respectively, given that the correction recall is 0.491. are correct and 21 are incorrect, along with 12 no suggestion. Thus, correction precision is 0.477. Detection correction [1] G. Hirst and A. Budanitsky, “Correcting real-word RPFRPF spelling errors by restoring lexical cohesion,” Natural Lexical cohesion[1] Language Engineering, vol. 11, pp. 87–111, March 0.306 0.225 0.260 0.281 0.207 0.238 2005. Trigrams[2] [2] L. A. Wilcox-O’Hearn, G. Hirst, and A. Budanitsky, 0.544 0.528 0.536 0.491 0.477 0.484 “Real-word spelling correction with trigrams: A Google n-grams reconsideration of the mays, damerau, and mercer - - - 0.88 0.91 0.89 model,” in Proceedings, 9th International Conference on Intelligent and Computational Linguistics (CICLing-2008) (Lecture Notes in Table 5: A comparison of recall, precision, and F- Computer Science 4919, Springer-Verlag) measure for three methods of malapropism detec- (A. Gelbukh, ed.), (Haifa), pp. 605–616, February tion and correction on the same data set. 2008. [3] J. Pedler, Computer Correction of Real-word Spelling Errors in Dyslexic Text. PhD thesis, Birkbeck, London the corrected result of [2] can detect 762 errors and thus University, 2007. correct 688 errors out of these 762 detected errors means [4] K. Kukich, “Technique for automatically correcting each of the correction precision, recall and F-measure is 0.9. words in text,” ACM Comput. Surv., vol. 24, no. 4, It is obvious that the performance of correcting the rest of pp. 377–439, 1992. the undetected errors will not be the same as correcting the detected errors because these errors are difficult to correct [5] T. Brants and A. Franz, “Web 1T 5-gram corpus since they are difficult to detect in the first place. Still, the version 1.1.,” tech. rep., Google Research, 2006. correction performance of our proposed method is compara- [6] G. Hirst and D. St-Onge, WordNet: An electronic ble to the correction performance of the method that runs on lexical database, ch. Lexical chains as representations the results of the detection module, cumulating the errors. of context for the detection and correction of Moreover, considering the fact16 that 85 malapropisms out malapropisms, pp. 305–332. Cambridge, MA: The of 1391 created were “unfair” in the sense that no automatic MIT Press, 1998. procedure could reasonably be expected to see the error[16], [7] A. R. Golding and D. Roth, “A winnow-based to have a method generating 1222 correct suggestions along approach to context-sensitive spelling correction,” with 121 wrong suggestions and 48 no suggestion could be Machine Learning, vol. 34, no. 1-3, pp. 107–130, 1999. useful. [8] A. J. Carlson, J. Rosen, and D. Roth, “Scaling up context-sensitive text correction,” in Proceedings of the Thirteenth Conference on Innovative Applications of 5. Conclusions Artificial Intelligence Conference, pp. 45–50, AAAI Our purpose in this paper was the development of a high- Press, 2001. quality correction module. The Google n-grams proved to [9] E. Mays, F. J. Damerau, and R. L. Mercer, “Context be very useful in correcting real-word errors. When we tried based spelling correction,” Information Processing and with only 5-grams the precision (0.96) was good, though the Management, vol. 27, no. 5, pp. 517–522, 1991. recall (0.34) was too low. Having sacrificed a bit of the pre- [10] S. Verberne, “Context-sensitive spell checking based cision score, our proposed combination of n-grams method on word trigram probabilities,” Master’s thesis, achieves a very good recall (0.88) while maintaining the pre- University of Nijmegen, February-August 2002. cision at 0.91. Our attempts to improve the correction recall [11] L. Burnard, Reference Guide for the British National while maintaining the precision as high as possible are help- Corpus (World Edition), October 2000. ful to the human correctors who post-edit the output of the www.natcorp.ox.ac.uk/docs/userManual/urg.pdf. real-word . If there is no postediting, at least [12] L. Allison and T. Dix, “A bit-string more errors get corrected automatically. Our method could longest-common-subsequence algorithm,” Information also correct misspelled words, not only malapropism, with- Processing Letters, vol. 23, pp. 305–310, 1986. out any modification. In future work, we plan to add a de- tection module and extend our method to allow for deleted [13] A. Islam and D. Inkpen, “Semantic text similarity or inserted words, and to find the corrected strings in the using corpus-based word similarity and string Google Web 1T n-grams. In this way we will be able to similarity,” ACM Transactions on Knowledge correct grammar errors too. Discovery from Data, vol. 2, no. 2, pp. 1–25, 2008. [14] G. Kondrak, “N-gram similarity and distance,” in Proceedings of the 12h International Conference on Acknowledgments String Processing and Information Retrieval, (Buenos This work is funded by the Natural Sciences and Engineering Aires, Argentina), pp. 115–126, 2005. Research Council of Canada. We want to thank Professor [15] I. D. Melamed, “Bitext maps and alignment via Graeme Hirst from the Department of Computer Science, pattern recognition,” Computational Linguistics, University of Toronto, for providing the evaluation data set. vol. 25, no. 1, pp. 107–130, 1999. [16] G. Hirst, “An evaluation of the contextual spelling References checker of microsoft office word 2007,” January 2008. http://ftp.cs.toronto.edu/pub/gh/Hirst-2008- 16Though we did not consider this fact in the evaluation pro- Word.pdf. cedure.