Wnspell: a Wordnet-Based Spell Corrector
Total Page:16
File Type:pdf, Size:1020Kb
WNSpell: a WordNet-Based Spell Corrector Bill Huang Princeton University [email protected] Abstract as they also considered the statistical likeliness of certain errors and the frequency of the candidate This paper presents a standalone spell cor- word in literature. rector, WNSpell, based on and written for Two other major techniques were also being de- WordNet. It is aimed at generating the veloped. One was similarity keys, which used best possible suggestion for a mistyped properties such as the word’s phonetic sound or query but can also serve as an all-purpose first few letters to vastly decrease the size of the spell corrector. The spell corrector con- dictionary to be considered. The other was the sists of a standard initial correction sys- rule-based approach, which implements a set of tem, which evaluates word entries using a human-generated common misspelling rules to ef- multifaceted approach to achieve the best ficiently generate a set of plausible corrections and results, and a semantic recognition sys- then matching these candidates with a dictionary. tem, wherein given a related word input, With the advent of the Internet and the subse- the system will adjust the spelling sug- quent increase in data availability, spell correc- gestions accordingly. Both feature signifi- tion has been further improved. N-grams can be cant performance improvements over cur- used to integrate grammatical and contextual va- rent context-free spell correctors. lidity into the spell correction process, which stan- 1 Introduction dalone spell correction is not able to achieve. Ma- WordNet is a lexical database of English words chine learning techniques, such as neural nets, us- and serves as the premier tool for word sense ing massive online crowdsourcing or gigantic cor- disambiguation. It stores around 160,000 word pora, are being harnessed to refine spell correction forms, or lemmas, and 120,000 word senses, or more than could be done manually. synsets, in a large graph of semantic relations. The Nevertheless, spell correction still faces signif- goal of this paper is to introduce a spell correc- icant challenges, though most lie in understand- tor for the WordNet interface, directed at correct- ing context. Spell correction in other languages is ing queries and aiming to take advantage of Word- also incomplete, as despite significant work in En- Net’s structure. glish lexicography, relatively little has been done in other languages. 1.1 Previous Work 1.2 This Project Work on spell checkers, suggesters, and correctors began in the late 1950s and has developed into a Spell correctors are used everywhere from simple multifaceted field. First aimed at simply detecting spell checking in a word document to query com- spelling errors, the task of spelling correction has pletion/correction in Google to context-based in- grown exponentially in complexity. passage corrections. This spell corrector, as it is The first attempts at spelling correction utilized for the WordNet interface, will focus on spell cor- edit distance, such as the Levenshtein distance, rection on a single word query with the additional where the word with minimal distance would be possibility of a user-inputted semantically-related chosen as the correct candidate. word from which to base corrections off of. Soon, probabilistic techniques using noisy channel models and Bayesian properties were in- vented. These models were more sophisticated, 2 Correction System 3. Reasonable runtime The first part of the spell corrector is a standard There have been several approaches to this search context-free spell corrector. It takes in a query space problem, but all have significant drawbacks such as speling and will return an ordered list of in one of the criteria of search space generation: three possible candidates; in this case, it returns • The simplest approach is the lexicographic the set fspelling; spoiling; saplingg. approach, which simply generates a search The spell corrector operates similarly to the As- space of words within a certain edit dis- pell and Hunspell spell correctors (the latter which tance away from the query. Though simple, serves as the spell checker for many applications this minimum edit distance technique, intro- varying from Chrome and Firefox to OpenOffice duced by Damerau in 1964 and Levenshtein and LibreOffice). The spell corrector we introduce in 1966, only accounts for type 3 (and pos- here, though not as versatile in terms of support sibly type 2) misspellings. The approach is for different platforms, achieves far better perfor- reasonable for misspellings of up to edit dis- mance. tance 2, as Norvig’s implementation of this To tune the spell corrector to WordNet queries, runs in ∼0.1 seconds, but time complexity stress is placed on bad misspellings over small er- increases exponentially and for misspellings rors. We will mainly use the Aspell data set (547 such as funetik ! phonetic that are a sig- errors), kindly made public by the GNU Aspell nificant edit distance away, this approach will project, to test the performance of the spell cor- not be able to contain the correction without rector. Though the mechanisms of the spell cor- sacrificing both the size of the search space rector are inspired by logic and research, they are and the runtime. included and adjusted mainly based on empirical tests on the above data set. • Another approach is using phonetics, as mis- spelled words will most likely still have sim- 2.1 Generating the Search Space ilar phonetic sounds. This accounts for type To improve performance, the spell corrector needs 2 misspellings, though not necessarily type to implement a fine-tuned scoring system for each 1 or type 3 misspellings. Implementations candidate word. Clearly, scoring each word in of this approach, such as using the SOUND- WordNet’s dictionary of 150,000 words is not EX code (Odell and Russell, 1918), are able practical in terms of runtime, so the first step to to efficiently capture misspellings such as an accurate spell corrector is always to reduce the funetik ! phonetic, but not misspellings search space of correction candidates. like rypo ! typo. Again, this approach is The search space should contain all possible not sufficient in containing all plausible cor- reasonable sources of the the spelling error. These rections. errors in spelling arise from three separate stages (Deorowicz and Ciura, 2005): • A similarity key can also be used. The sim- ilarity key approach stores each word under 1. Idea ! thought word a key, along with other similar words. One i.e. distrucally ! destructfully implementation of this is the SPEED-COP spell corrector (Pollock and Zamora, 1984), 2. Thought word ! spelled word which takes advantage of the usual alpha- i.e. egsistance ! existence betic proximity of misspellings to the correct word. This approach accounts for many er- 3. Spelled word ! typed word rors, but there are always a large number of i.e. autocorrecy ! autocorrect exceptions, as the misspellings do not always have similar keys (such as the misspelling The main challenges regarding search space gen- zlphabet ! alphabet). eration are: 1. Containment of all, or nearly all, possible • Finally, the rule-based approach uses a set of reasonable corrections common misspelling patterns, such as im ! in or y ! t, to generate possible sources 2. Reasonable size of the typing error. The most complicated approach, these spell correctors are able to This group includes words with a pho- contain the plausible corrections for most netic key that differs by an edit distance at spelling errors quite well, but will miss many most 1 from the phonetic key of the entry of the bad misspellings. The implementation (funetik ! phonetic), and also does a very by Deoroicz and Ciura using this approach is good job of including typos/misspellings of quite effective, though it can be improved. edit distance greater than 1 (it actually in- cludes the first group completely, but for Our approach with this spell corrector is to pruning purposes, the first group is consid- use a combination of these approaches to achieve ered separately) in very little time O(Cn) the best results. Each approach has its strengths where C ∼ 52 × 9. and weaknesses, but cannot achieve a good cover- age of the plausible corrections without sacrificing • Exceptions: size and runtime. Instead, we take the best of each This group includes words that are not approach to much better contain the plausible cor- covered by either of the first two groups rections of the query. but are still plausible corrections, such as To do this, we partition the set of plausible cor- lignuitic ! linguistic. We observe that rections into groups (not necessarily disjoint, but most of these exceptions either still have with a very complete union) and consider each similar beginning and endings to the orig- separately: inal word and are close edit distance-wise • Close mistypings/misspellings: or are simply too far-removed from the en- try to be plausible. Searching through words This group includes typos of edit distance with similar beginnings that also have simi- 1 (typo ! rypo) and misspellings of edit lar endings (through an alphabetically-sorted distance 1 (consonent ! consonant), as list) proves to be very effective in including well as repetition of letters (mispel ! the exception, while taking very little time. misspell). These are easy to generate, run- ning in O(n log nα) time, where n is the As many generated words, especially from the length of the entry and α is the size of the later groups, are clearly not plausible corrections, alphabet, to generate and check each word candidate words of each type are then pruned with (though increasing the maximum distance to different constraints depending on which group 2 would result an significantly slower time of they are from. Words in later groups are subject to O(n2 log nα2).