<<

Discourse-aware Statistical as a Context-Sensitive Spell Checker

Behzad Mirzababaei, Heshaam Faili and Nava Ehsan School of Electrical and Computer Engineering College of Engineering University of Tehran, Tehran, Iran {b.mirzababaei,hfaili,n.ehsan}@ut.ac.ir

Phrase-based SMT is weak in handling long- Abstract distance dependencies between the sentence . In order to capture this kind of Real- errors or context sensitive spelling dependencies, which affects detecting the correct errors, are misspelled words that have been candidate word, mentioned SMT is augmented wrongly converted into another word of with a discourse-aware reranking method for vocabulary. One way to detect and correct reranking the N-best results of SMT. real-word errors is using Statistical Machine Our work can be regarded as an extension of Translation (SMT), which translates a text containing some real-word errors into a the method introduced by Ehsan and Faili correct text of the same language. In this (2013), in which they use SMT to detect and paper, we improve the results of mentioned correct the spelling errors of a document. But SMT system by employing some discourse- here, we use the N-best results of SMT as a aware features into a log-linear reranking candidate list for each erroneous word and rerank method. Our experiments on a real-world test the list by using a discourse-aware reranking data in Persian show an improvement of system which is just a log-linear ranker. about 9.5% and 8.5% in the recall of Shortly, the contributions of this paper can be detection and correction respectively. Other summarized as follow: The N-best results of experiments on standard English test sets also SMT are regarded as a candidate list of show considerable improvement of real-word checking results. suspicious word, which is reranked by using a discourse-aware reranking system. Two 1 Introduction discourse-aware features are employed in a log- linear ranker. The keywords in whole document Kukich (1992) has categorized errors of a text surrounding the erroneous sentence are into five categories: 1. isolated error 2. syntactic considered as the context window. We have error 3. real-word error 4. discourse structure and achieved about 5% improvement over the SMT- 5. pragmatic error. In this paper, we focus on the based approach in detection and correction recall third category, which is also referred as context- and 1% in precision on English experiment. The sensitive spelling error. This type of error state-of-the-art results are achieved for Persian includes misspelled words that are converted to context-sensitive spell checker respect to F- another word of the (e.g., typing measure and Mean Reciprocal Rank metrics. “arm” instead of “are” in the sentence “we arm This paper is organized as follows: Section 2 good”). In order to detect and correct this kind of presents an overview of related works. In Section error, context analysis of the text is crucial. 3, we explain attributes of Persian language. In Here, we propose a language-independent section 4, we will describe how to use SMT for method, which is based on a phrase-based generating candidate words. In Section 5, we Statistical Machine Translation (SMT). In this discuss the approach for reranking the N-best case, the input and output sentences are both in result of SMT. Finally, we illustrate the the same language and the input sentence experimental results and compare the results with contains some real-word errors. the SMT-based approach.

475 Proceedings of Recent Advances in Natural Language Processing, pages 475–482, Hissar, Bulgaria, 7-13 September 2013. 2 Related Works There are few spell checkers for Persian, such as the works presented by Ehsan and Faili Most of the previous works in real-word error (2013); Kashefi, Minaei-Bidgoli, and Sharifi detection and correction are classified into two (2010). In Kashefi et al. (2010), a new metric categories : 1. based-on statistical approaches based-on string distance for Persian is presented (Bassil & Alwani, 2012 and 2. based-on separate to rank spelling suggestions. This ranking is resource such as WordNet (Fellbaum, 2010) in based-on the effect of keyboard layout or on the (Pedler, 2007). Statistical methods use several typographical spelling errors. features, such as N-gram models (Bassil & A language-independent approach based on a Alwani, 2012; Islam & Inkpen, 2009), POS SMT framework is presented by (Ehsan & Faili, tagging (Golding & Schabes, 1996), Bayesian 2013). This method achieved the state-of-the-art classifiers (Gale, Church, & Yarowsky, 1992), results for grammar checking and context- decision lists (Yarowsky, 1994), Bayesian hybrid sensitive spell checking for Persian language. method (Golding, 1995), Here, we also use SMT as a candidate generator (Jones & Martin, 1997). The N-gram and POS- for spell checking of real word errors, but our based method are combined by Golding and approach is different from that work in the Schabes (1996) and a better result achieved. following causes: we consider the keywords of Pedler (2007) used WordNet as a separate whole document as the context-aware features. resource to extract the semantic relations of the SMT is used as a candidate generator. We train a words. These methods consider fixed-length log-linear reranking system as a post-processing windows instead of the whole sentence as the system to rerank the candidate list. context window. Our experiments on a real-world test data in Most of these methods use confusion set for Persian show an improvement of about 9.5% and detecting real-word errors. The confusion set is a 8.5% in the recall of detection and correction set of words that are confusable with the respectively over the method of Ehsan and Faili headword of the set. The words of the set are not (2013). necessarily confusable with each other (Faili, 2010). When the error checker comes across one 3 Persian Language of the words in a confusion set, it should select an appropriate word in the sentence. A machine- Persian or Farsi is an Indo-European language. It learning method and the Winnow algorithm is is mostly spoken in Iran, Afghanistan and proposed in (Golding & Roth, 1999), to solve Tajikistan with dialects Farsi, Dari and Tajik word disambiguities based-on surrounding words respectively. The Persian language has a rich of the spelling errors. This method uses several (Megerdoomian, 2000) in which features of surrounding words, such as POS tag. words can be combined with a very large number +/-10 words from the corresponding confusable of affixes. Combination, derivation, and word in confusion set are considered as the inflection rules in Persian are uncertain (Lazard context window. & Lyon, 1992; Mahootian, 2003). Wilcox-O‟Hearn et al. (2008) report a The alphabet of Farsi is the same as Arabic reconsideration of the work of (Mays et al., with four additional letters. The alphabet 1991). They use three different lengths for the contains 26 consonants and 6 vowels. Also there context window. Also, they use 6, 10 and 14 are some homophone and homograph letters. For are homophones ”ض“ and ”ظ“ ,”ذ“ ,ˮز“ ,words as the context window and accommodate example t/”ت“ ,p/”پ“ ,b/”ب“ all the that overlap with the words in the which all sound as “/z” and s are homograph letters which just differ/”ث“ window. and Some statistical methods use Google Web 1T in number and place of dots. These phonetic and N-gram data set to detect and select the best graphical similarities cause many spelling errors. correct word for a real-word error (Bassil & In the next section, we will describe how to use Alwani, 2012; Islam & Inkpen, 2009). Google the SMT to detect context-sensitive spelling Web 1T N-gram consists of N-gram word errors in a sentence and generate candidates. sequences, extracted from the World Wide Web. 5-gram and 3-gram are used in these papers, thus 4 SMT as a Candidate generator the context window in these methods is 9 and 5 SMT framework can be used to model context- words respectively. sensitive spell checker, which translates a word that does not fit in a sentence with some

476 suggestions for the suspicious word. SMT uses detection and correction on real-word errors. Let parallel corpora as the training data. It learns assume that wi is in j-th phrase of E, then, we can phrases of the language and some features such estimate 푃 퐸|퐶 as follows: as phrase probability, reordering probability. In 푐표푢푛푡 (푒 ,푐 ) order to use SMT framework, a confusion set for 푃 퐸|퐶 = 푃 푒 | 푐 = 푗 푗 (4) 푗 푗 each word is defined. Confusion set of a 푒 푗 푐표푢푛푡 (푒 푗 ,푐 푗 ) headword,wi is a set of words {wi1,wi2,…,win}, in which each word wij is a word that could be Equation (4) is the same as phrasal translation converted to wi with one editing operation of model in phrasal SMT systems. Therefore, we insertion, deletion, substitution or transposition. can use a phrasal SMT to correct context- The Damerau- metric sensitive spelling errors. In this paper, Moses (Damerau, 1964) has been used for calculating (Koehn et al., 2007) is used as the phrasal SMT. the distance between two words. If their distance When using SMT as a context-sensitive spell is lower than a pre-defined threshold, one editing checker, source and target sentences are in same operation, two words have been considered language. The source sentences contain real- similar and then wj is added to the confusion set word error while the target sentences contain of wi. For example, confusable words in their correct form. After generating candidate ruz „day‟ are as sentences by retrieving the N-best results of the روز confusion set of the word ravesh „method‟, mentioned SMT, we rerank the candidate list by روش ,‟ruze „fast روزه :follows ruh „spirit‟. discourse-aware features, which are described in روح ,‟rud „river رود next section. If E={w1,w2,…,wi,…,wn} is a sentence and wi is a real-word error in the sentence, it could appear 5 Discourse-aware Features in several confusion sets, thus, there are several headwords as candidates for the suspicious word. For any given sentence, SMT-based approach In other words, each headword that has wi in its retrieves a list of candidate sentences. The confusion set can be suggested as the correct phrasal SMT does not take the whole context of word. To formulate this, consider C={ the sentence into account. Thus, in order to find ' ' w1,w2,…,wi ,…,wn} is the correct sentence then wi the correct sentence from the candidate list and is defined as follows (Ehsan & Faili, 2013): obtain a better ranking, we define other features that indicate the affinity of each word in ′ 푤푖 = 푤푖 표푟 (푤푗 ,0 푠푢푐푕 푡푕푎푡 ∃푗,푘 푤푗 ,푘 = 푤푖 ) (1) candidate sentences with the whole context. Both the sentence and the whole document are ' Equation (1) implies that the correct word, wi , considered as the context of the candidate is either wi or one of the headwords that contain sentences. wi. For each erroneous sentence E, which For example in the sentence: “This cat is contains real-word error wi, we can define the N- black.”, both “cat” and “car” could be best candidate sentences 퐶 as follows: meaningful. In this sentence, by considering just the sentence as context window, we cannot 푃 퐸 퐶 푃 퐶 퐶 = 푁 − 푎푟푔푚푎푥 (2) identify whether “cat” is correct or “car”. 퐶 푃(퐸) Discourse analysis may help us to detect the best candidate. If we know the document is about P(E) in Equation (2) is probability of automobile or animal, then we can have better occurring the erroneous sentence, which is reranking on candidates. In other word, constant for each candidate sentence and can be considering whole document as the context removed from Equation (2). P(E|C) can be window is more helpful than considering just defined as follows: whole sentence for reranking the candidate. Here, we get the benefit from discourse by ′ 푃 퐸|퐶 = 푃 푤1, … , 푤푖 , … , 푤푛 푤1 , … , 푤푖 , … , 푤푛 (3) capturing the relations among the words in a candidate sentence and with the keywords of In Equation (3), each w is a word. In order to whole document. In Subsection 5.1, we show estimate 푃 퐸|퐶 in Equation (3) we can convert that by selecting Point-wise Mutual Information E and C from word base to phrase base, E = (PMI) measure, we can find the long distance 푒 1, 푒 2, … , 푒 퐼 and C = 푐 1, 푐 2, … , 푐 퐼 . Using phrase- dependency between the words in a document. based SMT, we can capture some local dependencies among the words resulting better

477 Candidate 1 st 2 nd 3 rd 4 th 5 th 6 th 7 th Detected دندان=<بندان متر=<مصر دندان=<دزدان دندان=<مندان دندان=<زندان دندان=<دندان دندان=<چندان word

PMIsentence -10.8908 -10.8103 -10.8506 -10.9654 -9.94 -10.7639 -10.8488

PMIdiscourse -7.1539 -7.1549 -7.1548 -7.1552 -7.05 -7.1606 -7.1523

Table 1: One erroneous sentence with 7 candidate sentences and their PMIs.

5.1 Contextual Features are candidates of erroneous sentence of W. If discourse of W is about automobile then We select some features that describe the PMI (S ) > PMI (S ), because the co- information about the context of the sentences. discourse k discourse j occurrence of “car” with the keywords of PMI is used to measure the relation between automobile related document is greater than the candidate sentences and the document; and also co-occurrence of “cat” with that keywords. to measure the co-occurrence among words of Second criterion is PMI , which refers to the sentence. Another feature that gives us useful sentence co-occurrence of sentence words with each other. information about fluency of candidate sentences To calculate PMI , the PMI of all words of is (LM) of sentence. A sentence the candidate sentence is calculated. To monolingual corpus is required to calculating formulate this, consider S ={w ,w ,…,w } is j-th PMI and LM. PMI of two words of A and B is j j1 j2 jn candidate sentence for test sentence W. calculated as follows: PMIsentence of Sj is calculated as follow: 퐷표푐 _퐶표푢푛푡 (퐴,퐵) 푃푀퐼 퐴, 퐵 = (5) 푛 n 퐷표푐 _퐶표푢푛푡 (퐴) × 퐷표푐 _퐶표푢푛푡 (퐵) PMI 푤푗푘 ;푤푗푚 푘=1 푚 =k 푃푀퐼푠푒푛푡푒푛푐푒 푆 = 푛−1 (7) 푗 푛∗ In Equation (5), Doc_Count(A) is number of 2 documents that contain word A. Doc_Count(A,B) In Equation (7), n is number of words of the is number of documents that contain both A, B. sentence and wjk is k-th word of j-th candidate of We formulate two criteria based on PMI for each W. Table 1 shows an example of our Persian candidate sentence PMIdiscourse and PMIsentence. artificial test data in which PMIdiscourse PMIdiscourse is the PMI of the candidate sentence and PMIsentence of correct candidate are more than with its discourse while PMIsentence is the PMI of that of SMT-based approach suggests. The input words candidate sentence. PMI for all words of sentence is: دندان قوي هيكل دو متر از ريل راه آهن اوكراين را دزديدند the candidate sentence with the keywords of document is calculated as PMIdiscourse. For dandaan-ghavi-hikal-dv-mtr-az-ril-raah-aahan- extracting the keywords, term frequency (TF) avkraain-raa-dozdidand and inverse document frequency (IDF) measure „Robust teeth stole two meters of railway of is like (Li & Zhang, 2007). For each sentence of Ukrainian‟. the test data, 50 keywords are extracted from its There are two confusable words in the metr متر dandaan „teeth‟ and دندان ,discourse. To formulate this, consider W as a sentence sentence in the test data and Sj={wj1,wj2,…,wjn} „meter‟. SMT generate 7 candidate sentences in as j-th candidate sentence resulted from SMT- which the 5th candidate is the correct one. As based approach. Let Cw={c1,c2,…,c50} is 50 shown in Table 1, the first candidate, generated keywords of the document containing W. by SMT, has PMIdiscourse and PMIsentence score less PMIdiscourse for Sj is calculated as follow: than the correct sentence. By reranking SMT 푛 50 results using PMIdiscourse and PMIsentence, we can 푘=1 푚 =1 PMI 푤푗푘 ;푐푚 푃푀퐼푑푖푠푐표푢푟푠푒 푆 = (6) put the correct sentence at better rank or the top 푗 푛∗50 of the list. The third contextual feature is LM, In Equation (6), n is the number of sentence which is used to score the fluency of the words. cm is the m-th keyword of discourse and candidate. wjk is k-th word of j-th candidate for W. Since We consider surrounding words of suspicious PMI measures the co-occurrence of two different word, whole sentence and whole document as the words, two identical words has maximum PMI in context, then, we use LM, PMIsentence and the sentence. In this case, if a word in the PMIsentence to extract information. After candidate is a keyword of the context, calculating PMIsentence, PMIdiscourse and LM for all corresponding PMIdiscourse is increased. Consider candidate sentences, a log-linear model is used to Sj={This,cat,is,black} and Sk={This,car,is,black} rerank the N-best results.

478 For reranking with log-linear model we need In Equation (11), |Q| is the number of the weight of each feature. Support Vector sentences of test data and 푟푎푛푘푖 is the rank of Machine 1 (SVM) (Tsochantaridis, Joachims, correct sentence in 20-best result. We tested the Hofmann, Altun, & Singer, 2006) is used to SMT-based approach on two different languages, weight each feature. SVM is a machine-learning English and Persian. In the next subsections, we algorithm based on statistical learning theory. It illustrate results on Persian and English has been widely used, especially in function languages. regression (Jeng, 2005) and pattern recognition (Tsai, 2005), in recent years for its better 6.1 Results on Persian Language generalization performance (Burges, 1998). Our train data is generated from Peykareh (Bijankhan, 2004), Hamshahri2 and IRNA3 data 5.2 Feature Weighting sets. Hamshahri and IRNA are collections of Log linear model is used to rerank the N-best news documents of Persian language. These results of SMT. Like (Hayashi, Watanabe, corpora contain 814, 166,774 and 179,574 Tsukada, & Isozaki, 2009), we use SVM-rank to documents of general texts respectively. They obtain the weight of each feature. A corpus have 56,241, 576,137 and 332,343 types and contains erroneous and correct sentence is 2,530,772, 78,841,045 and 64,085,181 tokens developed. For each sentence of the corpus, respectively. All three corpora contain 923,744 PMIsentence, PMIdiscourse and LM is calculated. We types. use the corpus a training data for SVM-rank to Our confusion set is generated from all obtain the weight. In next section, the details of mentioned data sets. It includes 5,000 headwords all data sets are described more precisely. and each headword has about 4 confusable words in average. For our experiments on Persian, we 6 Experiment Result have deployed two different test sets: an artificial and a real-world test sets. We evaluate the accuracy of the approach by Our Persian real-world test data for context- using the false positive and false negative rates sensitive spelling errors contains 1,100 as follows: False positive (FP) errors refer to sentences. The test set selected manually from real-word errors that were not identified by the Internet mostly from Persian weblogs4. Each SMT-based system. False negative (FN) errors sentence contains 16.7 words in average and only refer to appropriately written word that SMT- one real-word error. The test set contains 27 based approach detected as real-word error. True insertion errors, 266 deletion errors, 527 positive (TP) results are correct words that are substitution errors and 91 transpositions errors. considered as correct. True negative (TN) results Only 89 errors, 8% of whole errors, need more refer to real-word errors that SMT-based than one editing action. approach detected and changed regardless of the We also made an artificial test data for correction. Finally True negative with correction context-sensitive spelling errors. 1,500 sentences (TNC) are real-word errors that SMT-based were selected randomly from Peykareh corpus. approach was able to replace them with the Length of each sentence is between 4 and 20 correct word. Evaluation metrics are computed words. For each sentence in the artificial test set, as follows: one real-word error was inserted artificially, by # 표푓 푇푁퐶 푃푟푒푐푖푠푖표푛 = (8) replacing a random word with a word in its # 표푓 푇푁 # 표푓 푇푁퐶 confusion set. 퐶표푟푟푒푐푡푖표푛 푅푒푐푎푙푙 = (9) # 표푓 퐹푃 푎푛푑 푇푁 Our training corpus contains 381,007 sentence # 표푓 푇푁 퐷푒푡푒푐푡푖표푛 푅푒푐푎푙푙 = (10) pairs which are selected form mentioned corpora. # 표푓 퐹푃 푎푛푑 푇푁 After generating training data, Moses is used as our SMT system, GIZA++ (Och & Ney, 2003) is Another metric for evaluating our N-best result used for word alignment and SRILM (Stolcke, retrieved by SMT, is Mean Reciprocal rank. It is 2002) is used as LM toolkit. Our LM is created calculated as follows: from Hamshahri and IRNA and contains 329,607

1 푄 1 푀푅푅 = 푖=1 (11) 2 푄 푟푎푛푘 푖 The Hamshahri2 test collection is available on: http://ece.ut.ac.ir/DBRG/Hamshahri/. 3 Islamic Republic News Agency-http://www.irna.ir 1 http://svmlight.joachims.org/ 4 The test set is available on: ece.ut.ac.ir/nlp/resources/

479 unigrams, 4,764,131 and 6,228,300 from WSJ randomly similar to the test sets trigrams. developed in (Islam & Inkpen, 2009; Wilcox- In order to develop training data for SVM, a O‟Hearn et al., 2008). For each sentence, a real- confusion set is generated. The confusion set word error is inserted randomly. Rest of WSJ is contains 26,891 headwords, which are selected considered as training data for SMT. from Hamshahri and Peykareh. Each headword Similar work of Golding and Roth (1999); has 4.6 confusable words. Jones and Martin (1997), we use 20% Brown 5,000 sentences from Hamshahri and corpus as test data and apply on 19 confusion Peykareh are selected randomly. All sentences sets. The test data contains 3015 erroneous have at least one headword in the confusion set. sentences 1 . Train data for SMT, is generated For each sentence, one word of the sentence is from WSJ and rest of Brown corpus, 80%. selected and replaced with one of its headword. We have tested SMT based approach on both For each erroneous sentence maximum 20 artificial English test data, generated candidates candidates are generated by SMT. 56,320 and reranked them with discourse-aware sentences are generated and 3,728 of them are features. Table 3 shows results of discourse- correct sentences. For each sentence of training aware. data, PMIsentence, PMIdiscourse and LM are calculated and their values normalized. We used Experiments on WSJ Brown test 56,320 sentences as training data for SVM-rank English test data data to obtain the weights. Precision 0.97(+0.001) 0.96(+0.008%) We generate a candidate list for each sentence Detection recall 0.90(+5.4%) 0.81(+2.6%) of test sets by using the SMT and rerank the list Correction recall 0.87(+5.6%) 0.78(+3.2%) in a post-processing step. In Table 2, results of F-measure 0.92(+3%) 0.86(+2.1%) discourse-aware reranking on real-world and MRR 0.88(+3%) 0.83(+1%) artificial test data are shown. We selected the work of Ehsan and Faili (2013) as a baseline. Table 3: Summarized results on English test sets (the improvements are mentioned in Experiments on Artificial Real -world parentheses). Persian test data test data Precision 0.97(-0.01%) 0.83(-0.01%) As shown in Table 3, in WSJ and Brown test Detection recall 0.70(+16%) 0.73(+9.5%) sets, our proposed system outperforms the Correction recall 0.69(+15%) 0.61(+8.4%) baseline with respect to all metrics. We have a F-measure 0.80(+8.4%) 0.70(+4.4%) significant improvement over the baseline with MRR 0.71(+8%) 0.67(+4%) respect to detection and correction recall.

Table 2: Summarized results on Persian test sets 7 Conclusion & Future work (the improvements are mentioned in We improved SMT-based approach by parentheses). extracting some contextual features and using a As it is shown in Table 2, in both test sets, the learning algorithm, SVM-rank, for getting proposed ranker retrieved a significant superior weights of each feature and reranking the N-best result over the baseline with respect to recall results by a log-linear model. The proposed metric with a comparable precision. Since the ranker retrieved a significant superior result over principle of discourse-aware SMT is language the baseline with respect to recall metric with a independent, we tested it on English language comparable precision. too. Real-word errors with two editing actions can be injected to training data. An ontology, named 6.2 Results on English Language FarsNet (Shamsfard, 2008), can be used as an external resource to identify Persian semantic The test sets for English language were drawn relationships between words. We can use from two corpora: Wall Street Journal (WSJ) and discourse-aware reranking as a Learning To Brown corpus. For WSJ test set, a confusion set Rank, and apply it on every method that generate is generated with 73,437 headwords and each N-best result. headword has 5.9 confusable words in average. We extract confusable words from WSJ based on 1 The test set is available on: one editing action. 1,500 sentences are selected http://cogcomp.cs.illinois.edu/Data/Spell/

480 References Conference on Empirical Methods in Natural Language Processing: Volume 3-Volume 3. Bassil, Youssef, & Alwani, Mohammad. (2012). Context-sensitive Spelling Correction Using Jeng, Jin-Tsong. (2005). Hybrid approach of selecting Google Web 1T 5-Gram Information. arXiv hyperparameters of support vector machine for preprint arXiv:1204.5852. regression. Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on, 36(3), 699- Bijankhan, Mahmood. (2004). The role of the corpus 709. in writing a grammar: An introduction to a . Iranian Journal of Linguistics, 19(2). Jones, Michael P, & Martin, James H. (1997). Contextual spelling correction using latent Burges, Christopher JC. (1998). A tutorial on support semantic analysis. Paper presented at the vector machines for pattern recognition. Data Proceedings of the fifth conference on Applied mining and knowledge discovery, 2(2), 121-167. natural language processing. Damerau, Fred J. (1964). A technique for computer Kashefi, O, Minaei-Bidgoli, B, & Sharifi, M. (2010). detection and correction of spelling errors. A novel string distance metric for ranking Persian Communications of the ACM, 7(3), 171-176. spelling error corrections. and Ehsan, Nava, & Faili, Heshaam. (2013). Grammatical Evaluation. and context ‐ sensitive error correction using a Koehn, Philipp, Hoang, Hieu, Birch, Alexandra, statistical machine translation framework. Callison-Burch, Chris, Federico, Marcello, Software: Practice and Experience, 43(2), 187- Bertoldi, Nicola, . . . Zens, Richard. (2007). Moses: 206. Open source toolkit for statistical machine Faili, Heshaam. (2010). Detection and correction of translation. Paper presented at the Proceedings of real-word spelling errors in Persian language. the 45th Annual Meeting of the ACL on Interactive Paper presented at the Natural Language Poster and Demonstration Sessions. Processing and Knowledge Engineering (NLP- Kukich, Karen. (1992). Techniques for automatically KE), 2010 International Conference on. correcting words in text. ACM Computing Surveys Fellbaum, Christiane. (2010). WordNet: an electronic (CSUR), 24(4), 377-439. lexical database. WordNet is available from Lazard, Gilbert, & Lyon, Shirley A. (1992). A http://www. cogsci. princeton. edu/wn. grammar of contemporary Persian: Mazda Publishers. Gale, William A, Church, Kenneth W, & Yarowsky, David. (1992). A method for disambiguating word Li, Juanzi, & Zhang, Kuo. (2007). Keyword senses in a large corpus. Computers and the extraction based on tf/idf for Chinese Humanities, 26(5-6), 415-439. news document. Wuhan University Journal of Natural Sciences, 12(5), 917-921. Golding, Andrew R. (1995). A Bayesian hybrid method for context-sensitive spelling correction. Mahootian, Shahrzad. (2003). Persian: Routledge. Paper presented at the Proceedings of the third Mays, Eric, Damerau, Fred J, & Mercer, Robert L. workshop on Very Large Corpora. (1991). Context based spelling correction. Golding, Andrew R, & Roth, Dan. (1999). A winnow- Information Processing & Management, 27(5), based approach to context-sensitive spelling 517-522. correction. Machine learning, 34(1-3), 107-130. Megerdoomian, Karine. (2000). Unification-based Golding, Andrew R, & Schabes, Yves. (1996). Persian morphology. Paper presented at the Combining -based and feature-based Proceedings of CICLing. methods for context-sensitive spelling correction. Och, Franz Josef, & Ney, Hermann. (2003). A Paper presented at the In Proceedings of the 34th systematic comparison of various statistical Annual Meeting of the Association for alignment models. Computational linguistics, Computational Linguistics, Santa Cruz, CA. 29(1), 19-51. Hayashi, Katsuhiko, Watanabe, Taro, Tsukada, Pedler, Jennifer. (2007). Computer correction of real- Hajime, & Isozaki, Hideki. (2009). Structural word spelling errors in dyslexic text. Unpublished support vector machines for log-linear approach in PhD thesis. Birkbeck, University of London. statistical machine translation. Proceedings of IWSLT, Tokyo, Japan. Shamsfard, Mehrnoush. (2008). Developing FarsNet: A lexical ontology for Persian. Paper presented at Islam, Aminul, & Inkpen, Diana. (2009). Real-word the 4th Global WordNet Conference, Szeged, spelling correction using Google Web IT 3-grams. Hungary. Paper presented at the Proceedings of the 2009

481 Stolcke, Andreas. (2002). SRILM-an extensible Journal of Machine Learning Research, 6(2), language modeling toolkit. Paper presented at the 1453. Proceedings of the international conference on Wilcox-O‟Hearn, Amber, Hirst, Graeme, & spoken language processing. Budanitsky, Alexander. (2008). Real-word spelling Tsai, Chih-Fong. (2005). Training support vector correction with trigrams: A reconsideration of the machines based on stacked generalization for Mays, Damerau, and Mercer model Computational image classification. Neurocomputing, 64, 497- Linguistics and Intelligent (pp. 503. 605-616): Springer. Tsochantaridis, Ioannis, Joachims, Thorsten, Yarowsky, David. (1994). Decision lists for lexical Hofmann, Thomas, Altun, Yasemin, & Singer, ambiguity resolution: Application to accent Yoram. (2006). Large margin methods for restoration in Spanish and French. Paper presented structured and interdependent output variables. at the Proceedings of the 32nd annual meeting on Association for Computational Linguistics.

482