Real-Word Spelling Correction Using Google Web 1T N-Gram with Backoff

Real-Word Spelling Correction using Google Web 1T n-gram with Backoff Aminul ISLAM Diana INKPEN Department of Computer Science, SITE Department of Computer Science, SITE University of Ottawa University of Ottawa Ottawa, ON, Canada Ottawa, ON, Canada [email protected] [email protected] Abstract: set [5]1, and a normalized and modified version of the Longest We present a method for correcting real-word spelling Common Subsequence (LCS) string matching algorithm (de- errors using the Google Web 1T n-gram data set and tails are in section 3.1). Our intention is to focus on how to a normalized and modified version of the Longest improve the correction recall while maintaining the correc- Common Subsequence (LCS) string matching algo- tion precision as high as possible. The reason behind this rithm. Our method is focused mainly on how to intention is that if the recall for any method is around 0.5, improve the correction recall (the fraction of er- this means that the method fails to correct around 50 per- rors corrected) while keeping the correction preci- cent of the errors. As a result, we can not completely rely on sion (the fraction of suggestions that are correct) as these type of methods, for that we need some type of human high as possible. Evaluation results on a standard interventions or suggestions to correct the rest of the uncor- data set show that our method performs very well. rected errors. Thus, if we have a method that can correct almost 90 percent of the errors, even generating some extra candidates that are incorrect is more helpful to the human. Keywords: This paper is organized as follow: Section 2 presents a brief overview of the related work. Our proposed method is Real-word; spelling correction; Google web 1T; n- described in Section 3. Evaluation and experimental results gram are discussed in Section 4. We conclude in Section 5. 1. Introduction 2. Related Work Real-word spelling errors are words in a text that occur Work on real-word spelling correction can roughly be clas- when a user mistakenly types a correctly spelled word when sified into two basic categories: methods based on semantic another was intended. Errors of this type may be caused by information or human-made lexical resources, and methods the writer's ignorance of the correct spelling of the intended based on machine learning or probability information. Our word or by typing mistakes. Such errors generally go un- proposed method falls into the latter category. noticed by most spellcheckers as they deal with words in isolation, accepting them as correct if they are found in the 2.1 Methods Based on Semantic Information dictionary, and flagging them as errors if they are not. This The `semantic information' approach first proposed by [6] approach would be sufficient to correct the non-word error and later developed by [1] detected semantic anomalies, but myss in \It doesn't know what the myss is all about." but was not restricted to checking words from predefined confu- not the real-word error muss in \It doesn't know what the sion sets. This approach was based on the observation that muss is all about." To correct the latter, the spell-checker the words that a writer intends are generally semantically needs to make use of the surrounding context such as, in related to their surrounding words, whereas some types of this case, to recognise that fuss is more likely to occur than real-word spelling errors are not. muss in the context of all about. Ironically, errors of this type may even be caused by spelling checkers in the cor- 2.2 Methods Based on Machine Learning rection of non-word spelling errors when the auto-correct Machine learning methods are regarded as lexical disam- feature in some word-processing software sometimes silently biguation tasks and confusion sets are used to model the change a non-word to the wrong real word [1], and some- ambiguity between words. Normally, the machine learn- times when correcting a flagged error, the user accidentally ing and statistical approaches rely on pre-defined confusion make a wrong selection from the choices offered [2]. An ex- sets, which are sets (usually pairs) of commonly confounded tensive review of real-word spelling correction is given in [3, words, such as ftheir, there, they'reg and fprinciple, princi- 1] and the problem of spelling correction more generally is palg. [7], an example of a machine-learning method, com- reviewed in [4]. bined the Winnow algorithm with weighted-majority voting, In this paper, we present a method for correcting real- using nearby and adjacent words as features. Another ex- word spelling error using the Google Web 1T n-gram data ample of a machine-learning method is that of [8]. 1Details of the Google Web 1T data set can be found at 978-1-4244-4538-7/09/$25.00 c 2009 IEEE. www.ldc.upenn.edu/Catalog/docs/LDC2006T13/readme.txt. 2.3 Methods Based on Probability Information of these2. [14] showed that edit distance and the length of [9] proposed a statistical method using word-trigram prob- the longest common subsequence are special cases of n-gram abilities for detecting and correcting real-word errors with- distance and similarity, respectively. [15] normalized LCS by out requiring predefined confusion sets. In this method, if dividing the length of the longest common subsequence by the trigram-derived probability of an observed sentence is the length of the longer string and called it longest common lower than that of any sentence obtained by replacing one subsequence ratio (LCSR). But LCSR does not take into ac- of the words with a spelling variation, then we hypothesize count the length of the shorter string which sometimes has that the original is an error and the variation is what the a significant impact on the similarity score. user intended. [13] normalized the longest common subsequence so that [2] analyze the advantages and limitations of [9]'s method, it takes into account the length of both the shorter and the and present a new evaluation of the algorithm, designed so longer string and called it normalized longest common sub- that the results can be compared with those of other meth- sequence (NLCS). We normalize NLCS in the following way ods, and then construct and evaluate some variations of the as it gives better similarity value, as well as it is more com- algorithm that use fixed-length windows. They consider a putationally efficient: variation of the method that optimizes over relatively short, 2 × len(LCS(si; sj )) v1 = NLCS(si; sj ) = (1) fixed-length windows instead of over a whole sentence (ex- len(si) + len(sj ) cept in the special case when the sentence is smaller than While in classical LCS, the common subsequence needs the window), while respecting sentence boundaries as natu- not be consecutive, in spelling correction, a consecutive com- ral breakpoints. To check the spelling of a span of d words mon subsequence is important for a high degree of match- requires a window of length d+4 to accommodate all the tri- ing. [13] used maximal consecutive longest common subse- grams that overlap with the words in the span. The smallest quence starting at character 1, MCLCS and maximal con- possible window is therefore 5 words long, which uses 3 tri- 1 secutive longest common subsequence starting at any char- grams to optimize only its middle word. They assume that acter n, MCLCS . MCLCS takes two strings as input and the sentence is bracketed by two BoS and two EoS markers n 1 returns the shorter string or maximal consecutive portions of (to accommodate trigrams involving the first two and last the shorter string that consecutively match with the longer two words of the sentence). The window starts with its left- string, where matching must be from first character (charac- hand edge at the first BoS marker, and the [9]'s method is ter 1) for both strings. MCLCS takes two strings as input run on the words covered by the trigrams that it contains; n and returns the shorter string or maximal consecutive por- the window then moves d words to the right and the process tions of the shorter string that consecutively match with the repeats until all the words in the sentence have been checked. longer string, where matching may start from any charac- As [9]'s algorithm is run separately in each window, poten- ter (character n) for both of the strings. They normalized tially changing a word in each, [2]'s method as a side-effect MCLCS and MCLCS and called it normalized MCLCS also permits multiple corrections in a single sentence. 1 n 1 (NMCLCS ) and normalized MCLCS (NMCLCS ) respec- [10] proposed a trigram-based method for real-word errors 1 n n tively. Similarly, we normalize NMCLCS and NMCLCS without explicitly using probabilities or even localizing the 1 n in the following way: possible error to a specific word. This method simply as- sumes that any word trigram in the text that is attested in 2 × len(MCLCS1(si; sj )) v2 =NMCLCS1(si; sj ) = (2) the British National Corpus [11] is correct, and any unat- len(si) + len(sj ) tested trigram is a likely error. When an unattested trigram 2 × len(MCLCS (s ; s )) v =NMCLCS (s ; s ) = n i j (3) is observed, the method then tries the spelling variations of 3 n i j len(s ) + len(s ) all words in the trigram to find attested trigrams to present i j to the user as possible corrections.

Real-Word Spelling Correction Using Google Web 1T N-Gram with Backoff

Automatic Correction of Real-Word Errors in Spanish Clinical Texts

Unified Language Model Pre-Training for Natural

Spell Checker

Finite State Recognizer and String Similarity Based Spelling

Spell Checking in Computer-Assisted Language Learning: a Study of Misspellings by Nonnative Writers of German

Exploiting Wikipedia Semantics for Computing Word Associations

Words in a Text

Analyzing Political Polarization in News Media with Natural Language Processing

Automatic Spelling Correction Based on N-Gram Model

N-Gram-Based Machine Translation

Development of a Persian Syntactic Dependency Treebank

Study on Spell-Checking System Using Levenshtein Distance Algorithm