Statistical Machine Translation As a Grammar Checker for Persian Language

ICCGI 2011 : The Sixth International Multi-Conference on Computing in the Global Information Technology Statistical Machine Translation as a Grammar Checker for Persian Language Nava Ehsan, Heshaam Faili Department of Electrical and Computer Engineering, University of Tehran Tehran, Iran [email protected], [email protected] Abstract—Existence of automatic writing assistance tools such checking. Grammar checkers cannot check the whole syntactic as spell and grammar checker/corrector can help in increasing structure of the text [6]. In the proposed model, frequently electronic texts with higher quality by removing noises and occurred error types have been identified for evaluating both cleaning the sentences. Different kinds of errors in a text can be categorized into spelling, grammatical and real-word errors. error detection and correction of the system in terms of In this article, the concepts of an automatic grammar checker precision and recall metrics. for Persian (Farsi) language, is explained. A statistical grammar The remainder of the paper is organized as follows: Section checker based on phrasal statistical machine translation (SMT) 2 outlines related works of grammar checking. In Section framework is proposed and a hybrid model is suggested by 3, the limitation of previous Persian grammar checker is merging it with an existing rule-based grammar checker. The results indicate that these two approaches are complimentary discussed. Section 4 describes the use of SMT framework in detecting and correcting syntactic errors, although statistical for grammar checking followed by preparing training and approach is able to correct more probable errors. The state-of- test data set. Finally, the evaluation results for each approach the-art results on Persian grammar checking are achieved by individually and the hybrid model are reported in Sections 5 using the hybrid model. The obtained recall is about 0.5 for and 6, respectively. correction and about 0.57 for detection with precision about 0.63. Index Terms—Natural Language Processing, Syntactic Error, II. RELATED WORK Statistical Machine Translation, Grammar Checker, Persian Lan- guage Grammar checkers deal with syntactic errors in the text such as subject-verb disagreement and word order errors. I. INTRODUCTION Grammar checking entails several techniques from the NLP Proofreading tools for automatic detection and correction research area such as tokenization, part-of-speech tagging, of erroneous sentences are one of the most widely used determining the dependency between words or phrases and tools within natural language applications such as text editing, defining and matching grammatical rules. Grammar checking optical character recognition (OCR), machine translation (MT) techniques are categorized into three groups: syntax-based, and question answering systems [1]. The editorial assistance statistical or corpus-based and rule-based [2]. In syntax- tools are useful in helping second language learners not only in based approach the text is parsed and if parsing does not writing but also in learning a language by providing valuable succeed the text is considered as incorrect. It requires a feedbacks [2]. Kukich [3] has categorized the errors of a complete grammar or mal-rules or relaxing constraints which text into five groups, 1. Isolated, 2. Non isolated or syntactic are obviously difficult to obtain due to complex nature of errors, 3. Real-word errors, 4. Discourse structure, and 5. natural languages. Mal-rules allow the parsing of specific Pragmatic errors. The first category refers to spelling errors. errors in the input and relaxing constraints redefine unification Detecting errors of second and third categories need syntactic so that the parse does not fail when two elements do not as well as semantic analysis. The last two hierarchies cannot unify [2]. The existing grammar checkers [7][8] fall into rule- be considered as spelling or grammatical error. In this article based category in which a collection of rules describe the we just focus on correcting syntactic errors and presuppose errors of the text, while [9][10][11] use statistical analysis for that the text is spell checked correctly. This paper is going grammar checking. Although, rule-based grammar checkers to describe a statistical grammar checker approach within the have been shown to be effective in detecting some class of framework of phrasal statistical machine translation. SMT has grammatical errors, manual design and refinement of rules the potential to solve some kind of errors occurring in the are difficult and time-consuming tasks. Deep understanding sentences [4]. We will show that training statistical model of the linguistics is required to write non-conflicting rules would be helpful in detecting and correcting grammatical which cover a suitable variety of grammatical errors. Although errors which were not addressed in the rule-based grammar there have been some prior works on Persian spell checking, checker [5] especially those errors which need contextual [12][13] from the best of our knowledge, the only work on cues for recognition. We will also introduce a hybrid of Persian grammar checking is a rule-based system which is statistical and rule-based approaches for grammar checking introduced in [5]. The limitations of this grammar checker and achieved the state-of-the-art results on Persian grammar are described in detail in next section.The ALEK system Copyright (c) IARIA, 2011. ISBN: 978-1-61208-139-7 20 ICCGI 2011 : The Sixth International Multi-Conference on Computing in the Global Information Technology developed by [9] uses an unsupervised method for detecting ast (the discussion © is differences) should be corrected English grammatical errors by using negative evidence from as I@ AîEðA®K Qå QK. Im'./ bahs bar sar tafaavotha ast edited textual corpora. It uses TOEFL essays as its resource. (the discussion is about differences). Integrating pattern discovery with supervised learning model (2) Omission of @P/ ra (definite object sign): Object is a is proposed by [11]. A generation-based approach for grammar mandatory argument of transitive verbs. The meaning correction is introduced by [10] which checks the fluency of of transitive verb is incomplete and unclear without the sentences produced by second language learners. The N-best object. The direct object should be addressed in the candidates are generated using n-gram language model which sentence by preposition @P/ ra. Finding the object of are reranked by parsing using stochastic context-free grammar. the sentence requires semantic analysis and it cannot be A pilot study of [4] presents the use of phrasal SMT for detected by regular expressions. Since this is an impor- identifying and correcting writing errors made by learners of tant preposition, the rule has been considered separately. English as a second language and the focus was on countability For example, YK XQ» ¨ðQå PA¿/ kar shooroo kardid (you errors associated with mass nouns. The statistical phase of started work) should be corrected as YK XQ» ¨ðQå @P PA¿/ grammar checking procedure introduced in this paper also kar ra shooroo kardid (you started the work). relies on phrasal SMT framework for detecting and correcting (3) Omission of conjunctions: The omission of conjunc- syntactic errors. To overcome the negative impact of some tions is not always incorrect, but there are some types of errors on recall metric, the system is augmented with cases that the omission makes the sentence grammat- the rule-based procedure. ically wrong. This usually happens when a clause ap- III. LIMITATIONS OF PERSIAN RULE-BASED GRAMMAR pears in the middle of the sentence. Lexical information is also important in this case. For example, CHECKER I k YKP@X AÖÞ Xñk© ù®K© QªK The proposed rule-based grammar checker [5] faces some / tarifi khod shoma darid chist (what is the description you have) should be limitations. It is based on regular expression patterns and I k YKP@X AÖÞ Xñk© é» ù®K© QªK detects errors which can be matched by regular expressions, converted to / tarifi ke thus it cannot detect those patterns which are difficult or khod shoma darid chist (what is your description). impossible to be modeled by regular expressions. The other (4) Using indefinite noun when a demonstrative pronoun is problem is having pre-defined pattern and suggestion for each used: Demonstrative pronouns are independent words type of error. For example whenever it detects two repeated that precede© the noun. After© demonstrative pronouns words, it shows an error although not all two repeated words such as áK @/ in (this) and à@/ an (that) a definite noun should be used, unless a description is given are incorrect and one of them is deleted due to pre-defined ÐYK@ñ© k© é» úGAJ» à© @ suggestion. Our method is an SMT-based approach which within a phrase, like . / an ketaabi does not follow any specific pre-defined rule or suggestion. ke khaandam (the book that I have read). Since reg- For example by detecting repeated words, in some cases one ular expressions cannot identify to which word of the of the words may be eliminated or sometimes a preposition sentence the descriptive phrase belongs, defining this is added between duplicate words and in some cases it rule with regular expression© © may result© in many false does not recognize any error. In addition regular expressions alarms. For example,ÐYK@ñk @P úG.AJ» à@/ an ketaabi ra cannot detect any recursive pattern. The errors which need khaandam© © (I read© that a book) should be changed to context free grammar or statistical or semantic analysis or ÐYK@ñk @P H. AJ» à@/ an ketaab ra khaandam (I read that disambiguation are also undetectable by regular expressions. book). Existing techniques for Persian, based on hand-crafted rules (5) Connecting indefinite postfix to the first noun in pos- or statistical POS tag sequences [5] are not strong enough sessive nouns (ezafe construction [14]): Persian is a to tackle the common incorrect preposition or conjunction dependent-marking language [15] and tends to mark the omission errors due to lack of information about language relation on the non-head.

Load more