
This is a repository copy of A comparison of standard spell checking algorithms and a novel binary neural approach. White Rose Research Online URL for this paper: https://eprints.whiterose.ac.uk/689/ Article: Hodge, V.J. orcid.org/0000-0002-2469-0224 and Austin, J. orcid.org/0000-0001-5762-8614 (2003) A comparison of standard spell checking algorithms and a novel binary neural approach. IEEE Transactions on Knowledge and Data Engineering. pp. 1073-1081. https://doi.org/10.1109/TKDE.2003.1232265 Reuse Items deposited in White Rose Research Online are protected by copyright, with all rights reserved unless indicated otherwise. They may be downloaded and/or printed for private study, or other acts as permitted by national copyright laws. The publisher or other rights holders may allow further reproduction and re-use of the full text version. This is indicated by the licence information on the White Rose Research Online record for the item. Takedown If you consider content in White Rose Research Online to be in breach of UK law, please notify us by emailing [email protected] including the URL of the record and the reason for the withdrawal request. [email protected] https://eprints.whiterose.ac.uk/ IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 15, NO. 5, SEPTEMBER/OCTOBER 2003 1073 A Comparison of Standard Spell Checking Algorithms and a Novel Binary Neural Approach Victoria J. Hodge and Jim Austin Abstract—In this paper, we propose a simple, flexible, and efficient hybrid spell checking methodology based upon phonetic matching, supervised learning, and associative matching in the AURA neural system. We integrate Hamming Distance and n-gram algorithms that have high recall for typing errors and a phonetic spell-checking algorithm in a single novel architecture. Our approach is suitable for any spell checking application though aimed toward isolated word error correction, particularly spell checking user queries in a search engine. We use a novel scoring scheme to integrate the retrieved words from each spelling approach and calculate an overall score for each matched word. From the overall scores, we can rank the possible matches. In this paper, we evaluate our approach against several benchmark spellchecking algorithms for recall accuracy. Our proposed hybrid methodology has the highest recall rate of the techniques evaluated. The method has a high recall rate and low-computational cost. Index Terms—Binary neural spell checker, integrated modular spell checker, associative matching. æ 1INTRODUCTION RRORS, particularly spelling and typing errors, are subsets of the query term. They incorporate statistical Eabundant in human generated electronic text. For correlations between adjacent letters and are able to example, Internet search engines are criticized for their accommodate letter omissions or insertions. Hamming inability to spell check the user’s query which would Distance matches words by left aligning them and matching prevent many futile searches where the user has incorrectly letter for letter. Hamming Distance does not work well for spelled one or more query terms. An approximate word- insertion and deletion where the error prevents the letters matching algorithm is required to identify errors in queries aligning with the correct spelling, but works well for where little or no contextual information is available and transposition and substitutions where most characters are using some measure of similarity, recommend words that still aligned. We have developed a novel scoring system for are most similar to each misspelled word. This error our spell checker. We separate the Hamming Distance and checking would prevent wasted computational processing, n-gram scores so the hybrid system can utilize the best prevent wasted user time, and make any system more match from either and overcome all four typing-error types. robust as spelling and typing errors can prevent the system We add the Soundex score to the two separate scores to identifying the required information. produce two word scores. The overall word score is the We describe an interactive spell checker that performs a maximum of these two values. presence check on words against a stored lexicon, identifies Our approximate matching approach is simple and spelling errors, and recommends alternative spellings [9]. flexible. We assume the query words and lexicon words The basis of the system is the AURA modular neural comprise sequences of characters from a finite set of network [3] described later. The spell checker uses a hybrid 30 characters (26 alphabetical and four punctuation char- approach to overcome phonetic spelling errors and the four acters). The approach maps characters onto binary vectors main forms of typing errors: insertion, deletion, substitu- and two storage-efficient binary matrices (Correlation tion, and transposition (double substitution). We use a Matrix Memories described in Section 2.3) that represent Soundex-type coding approach (see Kukich [10]) coupled the lexicon. The system is not language-specific so may be with transformation rules to overcome phonetic spelling used on other languages; the phonetic codes and transfor- errors. “Phonetic spelling errors are the most difficult to mation rules would just need to be adapted to the new language. Our spell checker aims to high recall1 accuracy detect and correct” [10] as they distort the spelling more possibly at the expense of precision.2 However, the scoring than the other error types so a phonetic component is allows us to rank the retrieved matches so we can limit the essential. We use an n-gram approach [13] to overcome the number of possibilities suggested to the user to the top first two forms of typing errors and integrate a Hamming 10 matches, giving both high recall and precision. Distance approach to overcome substitution and transposi- Some alternative spelling approaches include the Le- tion errors. N-gram approaches match small character venshtein Edit distance [10], Agrep [15], [14], aspell [2], and the two benchmark approaches MS Word 97 and MS Word . The authors are with the Department of Computer Science, University of 2000. We evaluate our approach against these alternatives. York, UK YO10 5DD. E-Mail: {austin, vicky}@cs.york.ac.uk. The reader is referred to Kukich [10] for a thorough treatise Manuscript received 30 Oct. 2000; revised 25 May 2001; accepted 27 July of spell checking techniques. We compare our hybrid 2001. For information on obtaining reprints of this article, please send e-mail to: 1. The percentage of correct spellings retrieved. [email protected], and reference IEEECS Log Number 113081. 2. The percentage of words other than the correct spelling retrieved. 1041-4347/03/$17.00 ß 2003 IEEE Published by the IEEE Computer Society 1074 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 15, NO. 5, SEPTEMBER/OCTOBER 2003 system with all of the aforementioned approaches for quality of retrieval—the percentage of correct words retrieved from 600 misspelled words giving a figure for the recall accuracy with noisy inputs (misspellings). The paper is organized into the following sections: In Section 2, we provide a brief description of the alternative techniques used for the empirical comparison against our technique. Section 2.3 is a detailed description of the AURA neural architecture and our hybrid spelling technique using integrated AURA modules. In Section 3, we provide a qualitative comparison of the recall accuracy of our technique and those alternative approaches described in Section 2. Section 4 provides our analysis of the recall results and Section 5 contains our conclusions and sugges- tions for possible further developments for our technique. 2SOME ALTERNATIVE SPELL CHECKING APPROACHES 2.1 Levenshtein Edit Distance Levenshtein edit distance produces a similarity score for the Fig. 1. Diagram of the hybrid spell checker as implemented in the AURA query term against each lexicon word in turn. The score is modular system. the number of single character insertions, deletions, or substitutions required to alter the query term to produce the in the matching process of each word and R the next lexicon word, for example, to go from “ him”to“ham” is one jþ1 step. RShift is a logical right shift, AND and OR denote substitution or “ham”to“harm” is one insertion. The word logical AND and OR, respectively, and S is a bit vector with the lowest score is deemed the best match. c representing the character being compared c. The following fð0; 0Þ¼0 ð1Þ two equations describe the matching process for up to k errors 0 <d k. fði; jÞ¼min½ðfði À 1;jÞþ1;fði; j À 1Þþ1; d R0 ¼ 11 . 100 . 000 with d bits set ð3Þ fði À 1;jÀ 1Þþdðqi;ljÞ where dðqi;ljÞ¼0ifqi ¼ lj else dðqi;ljÞ¼1: d d Rjþ1 ¼Rshift½Rj AND Sc OR ð4Þ ð2Þ dÀ1 dÀ1 dÀ1 Rshift½Rj OR Rshift½Rjþ1 OR Rj : For all words, a function fð0; 0Þ is set to 0 and then a For a search with up to k errors permitted, there are k þ 1 function fði; jÞ is calculated for all query letters and all arrays as stated previously. There are two shifts, one AND, lexicon-word letters, iteratively counting the string differ- and three OR operations for each character comparison (see ence between the query q1q2 ...qi and the lexicon word [15]), so Wu and Manber quote the running time as Oððk þ l1l2 ...lj. Each insertion, deletion, or substitution is awarded a score of 1, see (2). Edit distance is OðmnÞ for retrieval as it 1ÞnÞ for an n word lexicon. performs a brute force comparison with every character (all 2.3 Our Integrated Hybrid Modular Approach m characters) of every word (all n words) in the lexicon and, We implement all spelling modules in AURA. AURA [3] is therefore, can be slow when using large dictionaries.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages10 Page
-
File Size-