Spell Checkers and Correctors: a Unified Treatment
Total Page:16
File Type:pdf, Size:1020Kb
SPELL CHECKERS AND CORRECTORS: A UNIFIED TREATMENT By Hsuan Lorraine Liang SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE MASTER OF SCIENCE (COMPUTER SCIENCE) IN THE FACULTY OF ENGINEERING, BUILT ENVIRONMENT AND INFORMATION TECHNOLOGY, UNIVERSITY OF PRETORIA, PRETORIA, SOUTH AFRICA AT NOVEMBER 2008 © University of Pretoria i To my late grandfather, Shu-Hsin. To my parents, Chin-Nan and Hsieh-Chen. Abstract The aim of this dissertation is to provide a unified treatment of various spell checkers and correctors. Firstly, the spell checking and correcting problems are formally described in mathematics in order to provide a better understanding of these tasks. An approach that is similar to the way in which denotational semantics used to describe programming languages is adopted. Secondly, the various attributes of existing spell checking and cor- recting techniques are discussed. Extensive studies on selected spell checking/correcting algorithms and packages are then performed. Lastly, an empirical investigation of various spell checking/correcting packages is presented. It provides a comparison and suggests a classification of these packages in terms of their functionalities, implementation strategies, and performance. The investigation was conducted on packages for spell checking and correcting in English as well as in Northern Sotho and Chinese. The classification pro- vides a unified presentation of the strengths and weaknesses of the techniques studied in the research. The findings provide a better understanding of these techniques in order to assist in improving some existing spell checking/correcting applications and future spell checking/correcting package designs and implementations. Keywords: classification, dictionary lookup, edit distance, formal concept analysis, FSA, n-gram, performance, spell checking, spell correcting. Degree: Magister Scientia Supervisors: Prof B. W. Watson and Prof D. G. Kourie Department of Computer Science iii Acknowledgements First and foremost, I would like to thank my supervisors, Bruce Watson and Derrick Kourie, for directing my research, reviewing my ideas, and providing feedback on my work. I would also like to take this opportunity to express my gratitude for their continuous patience, support, and encouragement. I would like to thank my parents, Chin-Nan and Hsieh-Chen Liang, my brother, Albert Liang, and my grandmother, Yu-Fang Fan, for their unconditional love and support. A special thanks to Ernest Ketcha Ngassam for many technical discussions and in- spiration during this research. He devoted a great deal of his time to helping me kick start this work. Sergei Obiedkov was helpful in pointing me to related research on formal concept analysis. TshwaneDJe HLT and its CEO, David Joffe, were helpful in providing the Northern Sotho corpora as well as providing assistance for the language. v Table of Contents Abstract iii Acknowledgements v Table of Contents vii 1 Introduction 1 1.1 Problem Statement . 1 1.2 Structure of this Dissertation . 2 1.3 Intended Audience . 3 2 Mathematical Preliminaries 5 2.1 Introduction . 5 2.2 Notation . 5 2.3 Denotational Semantics . 7 2.3.1 Spell checking and correcting problems . 7 2.3.2 n-gram analysis . 9 2.4 Conclusion . 10 3 Spell Checking and Correcting Techniques 11 3.1 Introduction . 11 3.2 Morphological Analysis . 12 3.3 Spell Checking Techniques . 13 3.3.1 Dictionary lookup techniques . 14 3.3.2 n-gram analysis . 17 3.4 Spell Correcting Techniques . 21 3.4.1 Minimum edit distance . 22 3.4.2 Similarity key techniques . 26 3.4.3 Rule-based techniques . 27 3.4.4 n-gram-based techniques . 28 3.4.5 Probabilistic techniques . 31 3.4.6 Neural networks . 34 3.5 Other Related Issues . 35 3.6 Conclusion . 36 vii viii 4 Spell Checking and Correcting Algorithms 39 4.1 Introduction . 39 4.2 SPELL . 40 4.3 CORRECT . 44 4.4 ASPELL . 48 4.5 SPEEDCOP . 50 4.6 FSA . 54 4.7 AGREP . 56 4.8 AURA . 61 4.8.1 Spell checking in AURA . 63 4.8.2 Spell correcting in AURA . 64 4.8.3 Phonetic checking and correcting . 67 4.9 Spell Checkers/Correctors for the South African Languages . 69 4.10 CInsunSpell | Chinese Spell Checking Algorithms . 70 4.11 Conclusion . 76 5 Empirical Investigation and Classification 79 5.1 Introduction . 79 5.2 Spell Checking and Correcting for English . 80 5.2.1 Data Sets . 80 5.2.2 Performance . 81 5.3 Spell Checking and Correcting for Northern Sotho . 89 5.3.1 Data Set . 89 5.3.2 Performance . 90 5.4 Spell Checking and Correcting for Chinese . 90 5.4.1 Data Set . 91 5.4.2 Performance . 91 5.5 Classification . 93 5.5.1 Concept lattice . 94 5.6 Conclusions . 97 6 Conclusions and Future Work 99 6.1 Conclusions . 99 6.2 Future Work . 101 Bibliography 104 List of Tables 3.1 An example of a simple non-positional binary bigram array for English. 19 3.2 An example of a simple non-positional frequency bigram array for English. 20 3.3 An extract of the probabilities of bigrams generated from Table 3.2. 21 4.1 An example of the deletion table for the word after. 47 4.2 Example of a bit mask table. 57 4.3 Recalling from the CMM using binary hamming distance. 65 4.4 Recalling from the CMM using shifting n-gram. 67 5.1 Spell checking results for English using the first data set (DS1). 82 5.2 Single-letter errors detected using (DS1). 83 5.3 Spell checking results for English using the second data set (DS2). 83 5.4 Spell correction results for English using DS1. 85 5.5 Spell correction results for English using DS2. 86 5.6 Spell checking results for CInsunSpell. 92 5.7 Classification of spell checking/correcting packages Part I. 93 5.8 Classification of spell checking/correcting packages Part II. 94 ix List of Figures 5.1 Spell checking recall rates for spell checking/correcting packages for English. 84 5.2 Suggestion accuracy for various spell checking/correcting packages for En- glish. 87 5.3 Mean of time performance of the spell checking/correcting packages for English. 88 5.4 Concept lattice for various spell checking and correcting packages. 96 xi Chapter 1 Introduction In this chapter, we identify the problem statement of the domain of spell checking and correcting. We also identify the intended audience of this research. An introduction to the contents and the structure of this dissertation is presented here. This study is restricted to context-independent spelling error detection and correction. 1.1 Problem Statement Spell checkers and correctors are either stand-alone applications capable of processing a string of words or a text, or as an embedded tool which is part of a larger application such as a word processor. Various search and replace algorithms are adopted to fit in the domain of spell checking and correcting. Spelling error detection and correction are closely related to exact and approximate pattern matching respectively. Work on spelling error detection and correction in text started in the 1960s. A num- ber of commercial and non-commercial spell checkers and correctors, such as the ones embedded in Microsoft Word, Unixr spell, GNU's ispell, aspell and other variants, and agrep have been extensively studied. However, the techniques of spell correcting in particular are still limited in their scope, speed, and accuracy. Spell checking identifies words that are valid in some language, as well as the misspelled words in the language. Spell correcting suggests one or more alternative words as the correct spelling when a misspelled word is identified. Spell checking involves non-word error detection and spelling correction involves isolated-word error correction. Isolated- word error correction refers to spell correcting without taking into account any textual or linguistic information in which the misspelling occurs whereas context-dependent word correction would correct errors involving textual or linguistic context. Between the early 1970s and early 1980s, research focused on the techniques for non- word error detection. A non-word can be defined as a continuous string of letters of an alphabet that does not appear in a given word list or dictionary or that is an invalid word form. Non-word spelling error detection can generally be divided into two main techniques, namely dictionary lookup techniques and n-gram analysis. In the past, it was often the case that text recognition systems, such as OCR technology, inclined to rely 1 2 on n-gram analysis for error detection, whereas spell checkers mostly rely on dictionary lookup techniques. Each technique can be used individually as a basis, in combination or together with other methods, such as probabilistic techniques. Isolated-word error correc- tion techniques are often designed based on a study of spelling error patterns. Distinctions are generally made between typographic, cognitive, and phonetic errors. Techniques such as minimum edit distance, n-gram-based techniques, neural networks, probabilistic, rule- based techniques, and similarity key techniques are discussed. In [Kuk92], Kukich gave a survey that provided an overview of the spell checking and correcting techniques and a discussion on some existing algorithms. Chapters 5 and 6 in the book of Jurafsky and Martin [JM00] were based on Kukich's work. Hodge and Austin [HA03] compared standard spell checking algorithms to a spell checking system based on the aura modular neural network. Their research is heavily focused on this hybrid spell checking methodology. Although descriptions on some investigated spell checking and correcting algorithms can be found in the mentioned references, little has been done to compare and represent the existing algorithms in a unified manner. Neither has any formal classification been done previously. This dissertation aims to fill these gaps and provide a unified treatment of various spell checking and correcting algorithms.