<<

WNSpell: a WordNet-Based Corrector

Bill Huang Princeton University [email protected]

Abstract as they also considered the statistical likeliness of certain errors and the frequency of the candidate This paper presents a standalone spell cor- in literature. rector, WNSpell, based on and written for Two other major techniques were also being de- WordNet. It is aimed at generating the veloped. One was similarity keys, which used best possible suggestion for a mistyped properties such as the word’s phonetic sound or query but can also serve as an all-purpose first few letters to vastly decrease the size of the spell corrector. The spell corrector con- to be considered. The other was the sists of a standard initial correction sys- rule-based approach, which implements a set of tem, which evaluates word entries using a human-generated common misspelling rules to ef- multifaceted approach to achieve the best ficiently generate a set of plausible corrections and results, and a semantic recognition sys- then matching these candidates with a dictionary. tem, wherein given a related word input, With the advent of the Internet and the subse- the system will adjust the sug- quent increase in data availability, spell correc- gestions accordingly. Both feature signifi- tion has been further improved. N-grams can be cant performance improvements over cur- used to integrate grammatical and contextual va- rent context-free spell correctors. lidity into the spell correction process, which stan- 1 Introduction dalone spell correction is not able to achieve. Ma- WordNet is a lexical of English chine learning techniques, such as neural nets, us- and serves as the premier tool for word sense ing massive online crowdsourcing or gigantic cor- disambiguation. It stores around 160,000 word pora, are being harnessed to refine spell correction forms, or lemmas, and 120,000 word senses, or more than could be done manually. synsets, in a large graph of semantic relations. The Nevertheless, spell correction still faces signif- goal of this paper is to introduce a spell correc- icant challenges, though most lie in understand- tor for the WordNet interface, directed at correct- ing context. Spell correction in other languages is ing queries and aiming to take advantage of Word- also incomplete, as despite significant work in En- Net’s structure. glish , relatively little has been done in other languages. 1.1 Previous Work 1.2 This Project Work on spell checkers, suggesters, and correctors began in the late 1950s and has developed into a Spell correctors are used everywhere from simple multifaceted field. First aimed at simply detecting spell checking in a word document to query com- spelling errors, the task of spelling correction has pletion/correction in to context-based in- grown exponentially in complexity. passage corrections. This spell corrector, as it is The first attempts at spelling correction utilized for the WordNet interface, will focus on spell cor- , such as the , rection on a single word query with the additional where the word with minimal distance would be possibility of a user-inputted semantically-related chosen as the correct candidate. word from which to base corrections off of. Soon, probabilistic techniques using noisy channel models and Bayesian properties were in- vented. These models were more sophisticated, 2 Correction System 3. Reasonable runtime The first part of the spell corrector is a standard There have been several approaches to this search context-free spell corrector. It takes in a query space problem, but all have significant drawbacks such as speling and will return an ordered list of in one of the criteria of search space generation: three possible candidates; in this case, it returns • The simplest approach is the lexicographic the set {spelling, spoiling, sapling}. approach, which simply generates a search The spell corrector operates similarly to the As- space of words within a certain edit dis- pell and spell correctors (the latter which tance away from the query. Though simple, serves as the for many applications this minimum edit distance technique, intro- varying from Chrome and to OpenOffice duced by Damerau in 1964 and Levenshtein and LibreOffice). The spell corrector we introduce in 1966, only accounts for type 3 (and pos- here, though not as versatile in terms of support sibly type 2) misspellings. The approach is for different platforms, achieves far better perfor- reasonable for misspellings of up to edit dis- mance. tance 2, as Norvig’s implementation of this To tune the spell corrector to WordNet queries, runs in ∼0.1 seconds, but time complexity stress is placed on bad misspellings over small er- increases exponentially and for misspellings rors. We will mainly use the Aspell data set (547 such as funetik → phonetic that are a sig- errors), kindly made public by the GNU Aspell nificant edit distance away, this approach will project, to test the performance of the spell cor- not be able to contain the correction without rector. Though the mechanisms of the spell cor- sacrificing both the size of the search space rector are inspired by logic and research, they are and the runtime. included and adjusted mainly based on empirical tests on the above data set. • Another approach is using phonetics, as mis- spelled words will most likely still have sim- 2.1 Generating the Search Space ilar phonetic sounds. This accounts for type To improve performance, the spell corrector needs 2 misspellings, though not necessarily type to implement a fine-tuned scoring system for each 1 or type 3 misspellings. Implementations candidate word. Clearly, scoring each word in of this approach, such as using the SOUND- WordNet’s dictionary of 150,000 words is not EX code (Odell and Russell, 1918), are able practical in terms of runtime, so the first step to to efficiently capture misspellings such as an accurate spell corrector is always to reduce the funetik → phonetic, but not misspellings search space of correction candidates. like rypo → typo. Again, this approach is The search space should contain all possible not sufficient in containing all plausible cor- reasonable sources of the the spelling error. These rections. errors in spelling arise from three separate stages (Deorowicz and Ciura, 2005): • A similarity key can also be used. The sim- ilarity key approach stores each word under 1. Idea → thought word a key, along with other similar words. One i.e. distrucally → destructfully implementation of this is the SPEED-COP spell corrector (Pollock and Zamora, 1984), 2. Thought word → spelled word which takes advantage of the usual alpha- i.e. egsistance → existence betic proximity of misspellings to the correct word. This approach accounts for many er- 3. Spelled word → typed word rors, but there are always a large number of i.e. autocorrecy → autocorrect exceptions, as the misspellings do not always have similar keys (such as the misspelling The main challenges regarding search space gen- zlphabet → alphabet). eration are: 1. Containment of all, or nearly all, possible • Finally, the rule-based approach uses a set of reasonable corrections common misspelling patterns, such as im → in or y → t, to generate possible sources 2. Reasonable size of the typing error. The most complicated approach, these spell correctors are able to This group includes words with a pho- contain the plausible corrections for most netic key that differs by an edit distance at spelling errors quite well, but will miss many most 1 from the phonetic key of the entry of the bad misspellings. The implementation (funetik → phonetic), and also does a very by Deoroicz and Ciura using this approach is good job of including typos/misspellings of quite effective, though it can be improved. edit distance greater than 1 (it actually in- cludes the first group completely, but for Our approach with this spell corrector is to pruning purposes, the first group is consid- use a combination of these approaches to achieve ered separately) in very little time O(Cn) the best results. Each approach has its strengths where ∼ 52 × 9. and weaknesses, but cannot achieve a good cover- age of the plausible corrections without sacrificing • Exceptions: size and runtime. Instead, we take the best of each This group includes words that are not approach to much better contain the plausible cor- covered by either of the first two groups rections of the query. but are still plausible corrections, such as To do this, we partition the set of plausible cor- lignuitic → linguistic. We observe that rections into groups (not necessarily disjoint, but most of these exceptions either still have with a very complete union) and consider each similar beginning and endings to the orig- separately: inal word and are close edit distance-wise • Close mistypings/misspellings: or are simply too far-removed from the en- try to be plausible. Searching through words This group includes typos of edit distance with similar beginnings that also have simi- 1 (typo → rypo) and misspellings of edit lar endings (through an alphabetically-sorted distance 1 (consonent → consonant), as list) proves to be very effective in including well as repetition of letters (mispel → the exception, while taking very little time. misspell). These are easy to generate, run- ning in O(n log nα) time, where n is the As many generated words, especially from the length of the entry and α is the size of the later groups, are clearly not plausible corrections, alphabet, to generate and check each word candidate words of each type are then pruned with (though increasing the maximum distance to different constraints depending on which group 2 would result an significantly slower time of they are from. Words in later groups are subject to O(n2 log nα2). tougher pruning, and the finding of a close match results in overall tougher pruning. • Words with similar phonetic key: For instance, many words in the second group We implement a precalculated phonetic key are quite far removed from the entry and com- for each word in WordNet, which uses a nu- pletely implausible as corrections (e.g. zjpn → merical representation of the first five conso- [00325] → [03235] → suggestion), while those nant sounds of the word: that are simply caused by repetition of letters (e.g. 0: (ignored) a, e, i, o, u, h, w, [gh](t) lllooolllll → loll) are almost always plausible, so 1: b, p the former group should be more strictly pruned. 2: k, c, g, j, q, x Finally, since the generated search space after group pruning can be quite large (up to 200), de- 3: s, z, c(i/e/y), [ps], t(i o), (x) pending on the size of the search space, the search 4: d, t space may be pruned, repetitively, until the size of 5: m, n, [pn], [kn] the search space is of an acceptable size. 6: l Some factors considered during pruning in- 7: r clude: 8: f, v, (r/n/t o u)[gh], [ph] • Each word in WordNet is then stored in an Length of word array with indices ranging from [00000] (no • Letters contained in word consonants) to [88888] and can be looked up quickly. • Phonetic key of word • First and last letter agreement First, we smooth all the counts using add-k 1 smoothing (where we set k = 2 ), as there are nu- • Number of syllables merous counts of 0. Since the /monogram counts were retrieved in log format, for sake of • Frequency of word in text (COCA corpus) simplicity of data manipulation, we only smooth the counts of 0, changing their values to −0.69 • Edit distance (originally undefined). We then calculate c(ti | ci) This process successfully generates a search space as: that rarely misses the desired correction, while  1  c(t | c ) = k log + k keeping both a small size in number of words and i i 1 p(α → β) 2 a fast runtime. Where p(α → β) is the probability of the edit op- eration and k , k factors that adjust the cost de- 2.2 Evaluating Possibilities 1 2 pending on the uncertainty of small counts and the The next step is to assign a similarity score to all of increased likelihood of errors if errors are already the candidates in the search space. It must be ac- present. curate enough to discern that disurn −→ discern For the different edit operations, p(x → y) is: but disurn 6−→ disown and versatile enough to  del0(xy) figure out that funetik −→ phonetic.  deletion : 0  N (xy) Our approach is a modified version of Church  add0(xy)·N  addition : 0 0 and Gale’s probabilistic scoring of spelling errors. p(x → y) = N (x)·N (y) sub0(xy)·N In this approach, each candidate correction c is  substitution : N 0(x)·N 0(y)  scored following the Bayesian combination rule:  rev0(xy)  reversal : N 0(xy)  Y  And for deletion and addition of letters at the be- P (c) = p(c) max p(ti | ci) i ginning of a word:  X   del0(.y) C(c) = c(c) + min c(ti | ci)  deletion : N 0(.y) i p(x → y) = (add0(.y))·N·w  addition : 0 Where P (c) is the frequency of the candidate N (y) P correction, P (t | c ) the cost of each edit distance To evaluate the minimum cost min c(ti | i i  i operation in a sequence of edit operations that ci) of a correction, we use a modified Wagner- generate the correction. The cost is then scored Fischer algorithm, finds the minimum in O(mn) logarithmically based on the probability, where time, where m, n are the lengths of the entry and  c(ti | ci) ∝ − log p(ti | ci) . The correction correction candidate, respectively. This is done candidates are then sorted, with lower cost mean- over for candidate corrections in the search space ing higher likelihood. generated in (3.1). We use bigram error counts generated from a Now, the probabilistic scoring by itself is corpora (Jones and Mewhort, 2004) to determine not always accurate, especially in cases such as the values of c(t | p). Two sets of counts were funetik −→ phonetic. Thus, we modify the used: scoring of each candidate correction to signifi- cantly improve the accuracy of the suggestions: • Error counts: • Instead of setting c(c) = − log(p(c), we – Deletion of letter β after letter α find that using c(c) as multiplicative constant as a function f(c)γ, where f(c) is the fre- – Addition of letter β after letter α quency of the word in the corpus and γ an – Substitution of letter β for letter α empirically-determined constant, yields sig- – Adjacent transposition of the bigram αβ nificantly more accurate predictions. • Bigram/monogram counts (log scale): • We add empirically-determined multiplica- tive factors λi pertaining to the following fac- – Monograms α tors regarding the entry and the candidate cor- – αβ rection: – Same phonetic key (not restricted to first corrections are lexicographically difficult to gen- 5 consonant sounds) erate, using a semantic approach would be more – Same aside from repetition of letters effective in increasing coverage. – Same consonants (ordered) The additional group of word forms is generated – Same vowels (ordered) as follows: – Same set of letters 1. For each synset of the related word, we con- – Similar set of letters sider all synsets related to it by some seman- – Same number of syllables tic pointer in WordNet. – Same after removal of es 2. All lemmas (word forms) of these synsets are (Note that other factors were considered but evaluated. the factors pertaining to them were insignifi- cant) 3. Lemmas that share the same first letter or the same last letter and are not too far away in The candidate corrections are then ordered by their length are added to the group. 0 Q modified costs C (c) = C(c) i λi and the top three results, in order, are returned to the user. The inclusion of the additional group is indeed very effective in capturing the missed corrections 3 Semantic Input: and remains relatively small in size. The second part of the spell corrector adds a se- Some examples of missed words captured in mantic aspect into the correction of the search this group from the training set are (entry, correct, query. When users have trouble entering the query related): and cannot immediately choose a suggested cor- • autoamlly, automatically, mechanically rection, they are given the option to enter a se- mantically related word. WNSpell then takes this • conibation, contribution, donation word into account when generating suggestions, 3.2 Adjusting the Evaluation: harnessing WordNet’s vast to further optimize results. We also modify the scoring process of each candi- This added dimension in spell correction is very date correction to take into account semantic dis- helpful for the more severe errors, which usually tance. First, each candidate correction is assigned arise from the “idea → thought word” process in a semantic distance d (higher means more similar) spelling. These are much harder to deal with than based on Lesk distance: conventional mistypings or misspellings, and are d = max max s(ri, cj) exactly the type of error WNSpell needs to be able i j to handle (as mistyped or even misspelled queries Which takes the maximum similarity over all pairs can be fixed without too much trouble by the user). of definitions of the related word r and candidate The semantic anchor the related word provides c where similarity s is measured by: helps WNSpell establish the idea” behind the de- sired word and thus refine the suggestions for the X s(ri, cj) = k − ln(nw + 1) desired word. w∈Ri∩Cj ,w∈ /S To incorporate the related word into the sugges- tion generation, we add some modifications to the Which considers words w in the intersection of original context-free spell corrector. the definitions that are not stopwords and weights them by the smoothed frequency nw of w in the 3.1 Adjusting the Search Space: COCA corpus (as rarity is related to information One of the issues in search space generation in the content) and some appropriate constant k. original is that a small fraction of plausible correc- Additionally, if r or c is found in the other defi- tions are still missed, especially in more severe er- nition, we also add to the similarity s of two defi-  rors. To improve the coverage of the search space, nitions a k − ln(nr/c + 1) for some appropriate we modify the search space to also include a nu- constant a > 1. This resolves many issues that cleus of plausible corrections generated semanti- come up with hypernyms/hyponyms (among oth- cally, not just lexicographically. Since the missed ers) where two similar words are assigned a low score since the only words in common in their def- Compared to these three spell correctors, WN- initions may be the words themselves. Spell clearly does a significantly better job con- We also consider the number n of shared sub- taining the desired correction than Aspell, Hun- sequences of length 3 between r and c, which is spell, , or Word within a set of words of ac- very helpful in ruling out semantically similar but ceptable size. lexicographically unlikely words. We now compare the results of the top three We then adjust the cost function C0 by: words returned on the list with those returned by

0 Aspell, Hunspell, Ispell, Word. We also include 00 C data from Deorowicz and Ciura, which also uses C = α β (d + 1) (n + 1) the Aspell test set. Since the used For some empirically-determined constants α and were different, we also include Aspell results us- β. The new costs are then sorted and the top three ing their subset of the Aspell test set. The results results returned to the user. are shown in Table 2, and a graphical comparison is shown in Figure 1. 4 Results Once again, WNSpell significantly outperforms the other five spell correctors. We used the Aspell data set to train the system. The test set consists of 547 hard-to-correct words. Aspell Test Set Results (% Identified) This is ideal for our purposes, as we are focusing Method Top 1 Top 2 Top 3 Top 10 on correcting bad misspellings as well as the easy WNSpell 77.5 88.5 91.2 96.1 ones. Most of the empirically-derived constants Aspell (0.60.6n) 54.3 63.0 72.9 87.1 from (3.2) were determined based off of results Hunspell (1.1.12) 58.2 71.5 76.6 82.3 Ispell (3.1.20) 40.1 47.9 50.4 54.1 from this data set. Word 97 62.6 69.4 72.7 75.4 4.1 Without Semantic Input Aspell (n) 56.9 66.9 74.7 87.9 DC 66.3 75.5 79.6 85.5 We compare the results of WNSpell to a few pop- Table 2 ular spellcheckers: Aspell, Hunspell, Ispell, and Word; as well as with the proposition of Deorow- Aspell Test Set Results (% Identified) icz and Ciura, which seems to have the best results on the Aspell test set so far (other approaches are 100 based off of unavailable/uncompatible data sets). Ideally, for comparison, it would be ideal to run 90 each spell checker on the same and on the same computer for consistent results. However, 80 due to technical constraints, it is rather infeasible to do so. Instead, we will use the results posted WNSpell by the authors of the spell checkers, which, de- 70 Aspell(0.60.6n) spite some uncertainty, will still yield consistent Hunspell(1.1.12) Word97 and comparable results. 60 First, we compare our generated search space Aspell(n) DC with the lists returned by Aspell, Hunspell, Ispell, and Word (Atkinson). We use a subset of the As- 2 4 6 8 10 pell test set containing all entries whose correc- Top n tions are in all five dictionaries. The results are Figure 1 shown in Table 1. Search Space Results Method % Size We also test WNSpell on the Aspell common found (0/50/100%) misspellings test set, a list of 4206 common mis- WNSpell 97.4 1 10 66 and their corrections. Since the word Aspell (0.60.6n) 90.1 2 12 100 corrector was not trained on this set, it is a blind Hunspell (1.1.12) 83.2 1 4 15 Ispell (3.1.20) 54.8 0 1 29 comparison. Once again, we use a subset of the Word 97 75.4 0 2 20 Aspell test set containing all entries whose correc- Table 1 tions are in all five dictionaries. The results are shown in tables 3 and 4, and a graphical compari- Semantic Results (% Identified) son is shown in Figure 2. Method Top 1 Top 2 Top 3 Top 10 with 87.4 93.0 96.5 99.1 Blind Search Space Results without 77.5 88.5 91.2 96.1 Method % Size found (0/50/100%) Table 5 WNSpell 98.4 1 4 50 Aspell (0.60.6n) 97.7 1 9 100 Semantic Results (% Identified) Hunspell (1.1.12) 97.3 1 5 15 Ispell (3.1.20) 85.2 0 1 26 100 Table 3 95 Blind Test Set Results Method Top 1 Top 2 Top 3 Top 10 WNSpell 91.4 96.3 97.6 98.3 90 Aspell (0.60.6n) 73.6 81.2 92.0 97.0 Hunspell (1.1.12) 80.8 92.0 95.0 97.3 Ispell (3.1.20) 77.4 82.7 84.3 85.2 85 Table 4 80 with Blind Test Set Results (% Identified) without 100 2 4 6 8 10 Top n 95 Figure 3 90 The runtime for WNSpell with semantic input, ∼ 85 however, is rather slow at an average of 200ms. 5 Conclusions: 80 WNSpell Aspell(0.60.6n) Hunspell(1.1.12) The WNSpell algorithm introduced in this paper 75 Ispell(3.1.20) presents a significant improvement in accuracy in correcting standalong spelling corrections over 2 4 6 8 10 other systems, including the most recent version of Top n Aspell and other commercially used spell correc- Figure 2 tors such as Huspell and Word, by approximately 20%. WNSpell is able to take into a variety of fac- Additionally, WNSpell runs in decently fast tors regarding different types of spelling errors and time. WNSpell takes ∼13ms per word, while using a carefully tuned algorithm to account for Aspell takes ∼3ms, Hunspell ∼50ms, and Ispell much of the diversity in spelling errors presented ∼0.3ms. Thus, WNSpell is a very efficient stan- in the test data sets. There is a efficient sample dalone spell corrector, achieving superior perfor- space pruning system that restricts the number of mance within acceptable runtime. words to be considered, strongly improved by a phonetic key, and an accurate scoring system that 4.2 With Semantic Input then compares the words. The accuracy of WN- We test WNSpell with the semantic component on Spell in correcting hard-to-correct words is quite the original training set, this time with added syn- close that of most peoples’ abilities and signifi- onyms. For each word in the training set, a human- cantly stronger than other methods. generated related word is inputted. WNSpell also provides an alternative using a With the addition of the semantic adjustments, related word to help the system find the desired WNSpell performs considerably better than with- correction even if the user is far off the mark in out them. The results are shown in Table 5 and a terms of spelling or phonetics. This added feature graphical comparison in Figure 3: once again significantly increases the accuracy of WNSpell by approximately 10% by directly con- References necting the idea word the user has in mind to the D. Jurafsky and J.H. Martin. 1999. Speech and Lan- word itself. This link allows for the possibility of guage Processing, Prentice Hall. users who only know what rough meaning their desired word has or context it is in to actually find R. Mishra and N. Kaur. 2013. “A survey of Spelling Error Detection and Correction Techniques,” Inter- the word. national Journal on Computer Trends and Technol- ogy, Vol. 4, No. 3, 372-374. 5.1 Limitations: K. Atkinson. “Spell Checker Test Kernel Results,” The standalone algorithm currently does not take http://aspell.net/test/. into consideration vowel phonetics, which are rather complex in the . For in- S. Deorowicz and M.G. Ciura. 2005. “Correcting stance, the query spoak would be corrected into Spelling Errors by Modeling their Causes,” Int. J. Appl. Math. Comp. Sci., Vol. 15, No. 2, 275-285. speak rather than spoke. While a person easily corrects spoak, WNSpell would not be able to use P. Norvig. “How to Write a Spell Corrector,” the fact that spoke sounds the same while speak http://norvig.com/spell-correct.html. does not. Rather, all three have consonant sounds K.W. Church and W.A. Gale. 1991. “Probability Scor- s, p, k and have one different letter from spoak. ing for Spelling Correction,” AT&T Bell Laborato- But an evaluation of edit distance finds that speak ries is clearly closer, so the algorithm chooses speak M.N. Jones and J.K. Mewhort. 2004. “Case-Sensitive instead. Letter and Bigram Frequency Counts from Large- WNSpell, a spell corrector targeting at single- Scale English Corpora,” Behavior Research Meth- word queries, also does not have the benefit of ods, Instruments, & Computers, 36(3), 388-396. contextual clues most modern spell correctors use. Corpus of Contemporary American English. (n.d.). http://corpus.byu.edu/coca/. 5.2 Future Improvements: As mentioned earlier, introducing a vowel pho- netic system into WNSpell would increase its accuracy. The semantic feature of WNSpell can be improved by either pruning down the algorithm to improve performance or possibly using/incorporating other closeness measures of words into the algorithm. One possible addition is the use of some , such as using pre-trained word vectors to search for simi- lar words (such as ). Additionally, WNSpell-like spell correctors can be implemented in many languages rather easily, as WNSpell does not rely very heavily on the mor- phology of the language (though it requires some statistics of letter frequencies as well as simpli- fied phonetics). The portability is quite useful as WordNet is implemented in over a hundred lan- guages, so WNSpell can be ported to other non- English .