An Unsupervised and Data-Driven Approach for Spell Checking in Vietnamese OCR-Scanned Texts
Total Page:16
File Type:pdf, Size:1020Kb
An Unsupervised and Data-Driven Approach for Spell Checking in Vietnamese OCR-scanned Texts Cong Duy Vu HOANG & Ai Ti AW Department of Human Language Technology (HLT) Institute for Infocomm Research (I2R) A*STAR, Singapore {cdvhoang,aaiti}@i2r.a-star.edu.sg Abstract Researchers in the literature used to approach this task for various languages such as: English OCR (Optical Character Recognition) scan- (Tong and Evans, 1996; Taghva and Stofsky, ners do not always produce 100% accuracy 2001; Kolak and Resnik, 2002), Chinese (Zhuang in recognizing text documents, leading to et al., 2004), Japanese (Nagata, 1996; Nagata, spelling errors that make the texts hard to process further. This paper presents an in- 1998), Arabic (Magdy and Darwish, 2006), and vestigation for the task of spell checking Thai (Meknavin et al., 1998). for OCR-scanned text documents. First, we The most common approach is to involve users conduct a detailed analysis on characteris- for their intervention with computer support. tics of spelling errors given by an OCR interac- scanner. Then, we propose a fully auto- Taghva and Stofsky (2001) designed an matic approach combining both error detec- tive system (called OCRSpell) that assists users as tion and correction phases within a unique many interactive features as possible during their scheme. The scheme is designed in an un- correction, such as: choose word boundary, mem- supervised & data-driven manner, suitable orize user-corrected words for future correction, for resource-poor languages. Based on the provide specific prior knowledge about typical er- evaluation on real dataset in Vietnamese rors. For certain applications requiring automa- language, our approach gives an acceptable tion, the interactive scheme may not work. performance (detection accuracy 86%, cor- rection accuracy 71%). In addition, we also Unlike (Taghva and Stofsky, 2001), non- give a result analysis to show how accurate interactive (or fully automatic) approaches have our approach can achieve. been investigated. Such approaches need pre- specified lexicons & confusion resources (Tong 1 Introduction and Related Work and Evans, 1996), language-specific knowledge (Meknavin et al., 1998) or manually-created pho- Documents that are only available in print re- netic transformation rules (Hodge and Austin, quire scanning from OCR devices for retrieval 2003) to assist correction process. or e-archiving purposes (Tseng, 2002; Magdy supervised and Darwish, 2008). However, OCR scanners Other approaches used mecha- do not always produce 100% accuracy in rec- nisms for OCR error correction, such as: statis- ognizing text documents, leading to spelling er- tical language models (Nagata, 1996; Zhuang et rors that make the texts texts hard to process fur- al., 2004; Magdy and Darwish, 2006), noisy chan- ther. Some factors may cause those errors. For nel model (Kolak and Resnik, 2002). These ap- instance, shape or visual similarity forces OCR proaches performed well but are limited due to scanners to misunderstand some characters; or in- requiring large annotated training data specific to put text documents do not have good quality, caus- OCR spell checking in languages that are very ing noises in resulting scanned texts. The task of hard to obtain. spell checking for OCR-scanned text documents Further, research in spell checking for proposed aims to solve the above situation. Vietnamese language has been understudied. 36 Proceedings of the Workshop on Innovative Hybrid Approaches to the Processing of Textual Data (Hybrid2012), EACL 2012, pages 36–44, Avignon, France, April 23 2012. c 2012 Association for Computational Linguistics Hunspell−spellcheck−vn1 & Aspell2 are inter- Real-syllable or Context-based Errors (Class 2): active spell checking tools that work based on refer to syllables that are correct in terms of pre-defined dictionaries. their existence in a standard dictionary but According to our best knowledge, there is no incorrect in terms of their meaning in the work in the literature reported the task of spell context of given sentence. checking for Vietnamese OCR-scanned text doc- uments. In this paper, we approach this task in Unexpected Errors (Class 3): are accidentally terms of 1) fully automatic scheme; 2) without us- formed by unknown operators, such as: ing any annotated corpora; 3) capable of solving insert non-alphabet characters, do incorrect both non-word & real-word spelling errors simul- upper-/lower- case, split/merge/remove taneously. Such an approach will be beneficial for syllable(s), change syllable orders, . a poor-resource language like Vietnamese. Note that errors in Class 1 & 2 can be formed by applying one of 4 operators4 (Insertion, Dele- 2 Error Characteristics tion, Substitution, Transposition). Class 3 is ex- First of all, we would like to observe and analyse clusive, formed by unexpected operators. Table 1 the characteristics of OCR-induced errors in com- gives some examples of 3 error classes. pared with typographical errors in a real dataset. An important note is that an erroneous syllable can contain errors across different classes. Class 2.1 Data Overview 3 can appear with Class 1 or Class 2 but Class 1 We used a total of 24 samples of Vietnamese never appears with Class 2. For example: OCR-scanned text documents for our analysis. − hoàn (correct) || Hòan (incorrect) (Class 3 & 1) Each sample contains real & OCR texts, referring − bắt (correct) || bặt’ (incorrect) (Class 3 & 2) to texts without & with spelling errors, respec- tively. Our manual sentence segmentation gives a result of totally 283 sentences for the above 24 samples, with 103 (good, no errors) and 180 (bad, errors existed) sentences. Also, the number of syl- lables3 in real & OCR sentences (over all samples) are 2392 & 2551, respectively. 2.2 Error Classification We carried out an in-depth analysis on spelling Figure 1: Distribution of operators used in Class errors, identified existing errors, and then man- 1 (left) & Class 2 (right). ually classified them into three pre-defined error classes. For each class, we also figured out how an error is formed. 2.3 Error Distribution As a result, we classified OCR-induced spelling Our analysis reveals that there are totally 551 rec- errors into three classes: ognized errors over all 283 sentences. Each error is classified into three wide classes (Class 1, Class Typographic or Non-syllable Errors (Class 1): 2, Class 3). Specifically, we also tried to identify refer to incorrect syllables (not included operators used in Class 1 & Class 2. As a result, in a standard dictionary). Normally, at we have totally 9 more fine-grained error classes least one character of a syllable is expected (1A..1D, 2A..2D, 3)5. misspelled. We explored the distribution of 3 error classes 1http://code.google.com/p/ in our analysis. Class 1 distributed the most, fol- hunspell-spellcheck-vi/ lowing by Class 3 (slightly less) and Class 2. 2http://en.wikipedia.org/wiki/GNU_ Aspell/ 4Their definitions can be found in (Damerau, 1964). 3In Vietnamese language, we will use the word “sylla- 5A, B, C, and D represent for Insertion, Deletion, Sub- ble” instead of “token” to mention a unit that is separated by stitution, and Transposition, respectively. For instance, 1A spaces. means Insertion in Class 1. 37 Class Insertion Deletion Substitution Transpositiona Class 1 áp (correct) || áip (in- không (correct) || kh yếu (correct) || ỵếu N.A. correct) (“i” inserted) (incorrect). (“ô”, “n”, (incorrect). (“y” sub- and “g” deleted) stituted by “ỵ”) Class 2 lên (correct) || liên trình (correct) || ngay (correct) || ngây N.A. (contextually incor- tình (contextually (contextually incor- rect). (“i” inserted) incorrect). (“r” rect). (“a” substituted deleted) by “â”) Class 3 xác nhận là (correct) || x||nha0a (incorrect). 3 syllables were misspelled & accidentally merged. aOur analysis reveals no examples for this operator. Table 1: Examples of error classes. Generally, each class contributed a certain quan- ing scheme to produce the final output. Currently, tity of errors (38%, 37%, & 25%), making the VOSE implements two different correctors, a correction process of errors more challenging. In Contextual Corrector and a Weighting-based addition, there are totally 613 counts for 9 fine- Corrector. Contextual Corrector employs lan- grained classes (over 551 errors of 283 sentences), guage modelling to rank a list of potential can- yielding an average & standard deviation 3.41 & didates in the scope of whole sentence whereas 2.78, respectively. Also, one erroneous syllable is Weighting-based Corrector chooses the best able to contain the number of (fine-grained) error candidate for each syllable that has the highest classes as follows: 1(492), 2(56), 3(3), 4(0) ((N) weights. The following will give detailed descrip- is count of cases). tions for all components developed in VOSE. We can also observe more about the distribu- tion of operators that were used within each error 3.1 Pre-processor class in Figure 1. The Substitution operator was Pre-processor will take in the input text, do used the most in both Class 1 & Class 2, holding tokenization & normalization steps. Tokeniza- 81% & 97%, respectively. Only a few other oper- tion in Vietnamese is similar to one in En- ators (Insertion, Deletion) were used. Specially, glish. Normalization step includes: normal- the Transposition operator were not used in both ize Vietnamese tone & vowel (e.g. hòa –> Class 1 & Class 2. This justifies the fact that OCR hoà), standardize upper-/lower- cases, find num- scanners normally have ambiguity in recognizing bers/punctuations/abbreviations, remove noise similar characters. characters, . This step also extracts unigrams. Each of them 3 Proposed Approach will then be checked whether they exist in a pre- built list of unigrams (from large raw text data). The architecture of our proposed approach Unigrams that do not exist in the list will be re- (namely (VOSE)) is outlined in Figure 2. Our pur- garded as Potential Class 1 & 3 errors and then pose is to develop VOSE as an unsupervised data- turned into Non-syllable Detector.