A Punjabi Grammar Checker

A Punjabi Grammar Checker Mandeep Singh Gill Gurpreet Singh Lehal Shiv Sharma Joshi Department of Computer Department of Computer Department of Science Science Anthropological Linguistics Punjabi University Punjabi University & Punjabi Lexicography Patiala -147002, India Patiala -147002, India Punjabi University [email protected] [email protected] Patiala -147002, India Tel.: +91-175-3046292 Tel.: +91-9888165971 Tel.: +91-175-3046171 Abstract if possible generates suggestions to correct those errors. This article provides description about the To the best of our knowledge the grammar grammar checking software developed for checking provided here will be the first such detecting the grammatical errors in Punjabi system for Indian languages. There is n-gram texts and providing suggestions wherever based grammar checking system for Bangla (Alam appropriate to rectify those errors. This et al, 2006). The authors admit its accuracy is very system utilizes a full-form lexicon for low and there is no description about whether the morphology analysis and rule-based system provides any suggestions to correct errors systems for part of speech tagging and or not. It is mentioned that it was tested to identify phrase chunking. The system supported by correct sentences from the set of sentences a set of carefully devised error detection provided as input but nothing is mentioned as far rules can detect and suggest rectifications as correcting those errors is concerned. However, for a number of grammatical errors, the system that we discuss here for Punjabi detects resulting from lack of agreement, order of errors and suggests corrections as well. In doing words in various phrases etc., in literary so, provides enough information for the user to style Punjabi texts. understand the error reason and supports the suggestions provided, if any. 1 Introduction 2 System Overview Grammar checking is one of the most widely used tools within natural language engineering The input Punjabi text is given to the preprocessing applications. Most of the word processing systems system that performs tokenization and detects any available in the market incorporate spelling, phrases etc. After that morphological analysis is grammar, and style-checking systems for English performed, this returns possible tags for all the and other foreign languages, one such rule-based words in the given text, based on the full-form grammar checking system for English is discussed lexicon that it is using. Then a rule-based part of in (Naber, 2003). However, when it comes to the speech tagger is engaged to disambiguate the tags smaller languages, specifically the Indian based on the context information. After that, the languages, most of such advanced tools have been text is grouped into various phrases accordingly to lacking. Spell checking has been addressed for the pre-defined phrase chunking rules. In the final most of the Indian languages but still grammar and phase, rules to check for various grammatical style checking systems are lacking. In this article a errors internal to phrases and agreement on the grammar checking system for Punjabi, a member sentence level, are applied. If any error is found in of the Modern Indo-Aryan family of languages, is a sentence then based on the context information provided. The grammar checker uses a rule-based corrections are suggested (generated) for that. system to detect grammatical errors in the text and 940 For the purpose of morphological analysis we words of a phrase into account, like if words of a have divided the Punjabi words into 22 word phrase are separated (having some other phrase in classes like noun, adjective (inflected and between) then that cannot be taken as a single uninflected), pronoun (personal, demonstrative, phrase, even though this may be a potential error. reflexive, interrogative, relative, and indefinite), In the last phase i.e. grammar checking, there verb (main verb, operator verb, and auxiliary verb), are again manually designed error detection rules cardinals, ordinals, adverb, postposition, to detect potential errors in the text and provide conjunction, interjection etc., depending on the corrections to resolve those errors. For example, grammatical information required for the words of rule to check modifier and noun agreement, will go these word classes. The information that is in the through all the noun phrases in a sentence to check database depends upon the word class, like for if the modifiers of those sentences agree with their noun and inflected adjective, it is gender, number, respective head words (noun/pronoun) in terms of and case. For personal pronouns, person is also gender, number, and case or not. For this matching, required. For main verbs gender, number, person, the grammatical information from the tags of those tense, phase, transitivity etc. is required. As words is used. In simple terms, it will compare the mentioned earlier the lexicon of this morphological grammatical information (gender, number, and analyzer is full form based i.e. all the word forms case) of modifier with the headword of all the commonly used Punjabi words are kept (noun/pronoun) and displays an error message if in the lexicon along with their root and other some grammatical information fails to match. To grammatical information. resolve this error, the grammar checking module For part of speech tagging, we have devised a will use morphological generator, to generate the tag set keeping into mind all the grammatical correct form (based on headword’s gender, categories that can be helpful for agreement number, and case) for that modifier from its root checking. At present, there are more than 600 tags word. in the tag set. In addition to this, some word- For example, consider the grammatically specific tags are also there. The tag set is very user incorrect sentence ਸੋਹਣੇ ਲੜਕਾ ਜਦਾ ਹੈ sohne larka friendly and while choosing tag names existing tag janda hai ‘handsome boy goes’. In this sentence in sets for English and other such languages were the noun phrase, sohne larka ‘handsome taken into consideration, like NNMSD – ਸੋਹਣੇ ਲੜਕਾ masculine, singular, and direct case noun, boy’, the modifier ਸੋਹਣੇ sohne ‘handsome’ (root PNPMPOF – masculine, plural, oblique case, and word – ਸੋਹਣਾ sohna ‘handsome’), with masculine first person personal pronoun. The approach followed for part of speech tagging is rule-based, gender, plural number, and direct case, is not in as there is no tagged corpus for Punjabi available accordance with the gender, number, case of its at present. As the text we are processing may have head word. It should be in singular number instead grammatical agreement errors, so the part of of plural. The grammar checking module will speech tagging rules are devised considering this. detect this as an error as ‘number’ for modifier and The rules are applied in sequential order with each headword is not same, then it will use rule having an attached priority to control its order morphological generator to generate the ‘singular in this sequence. number form’ from its root word, which is same as For phrase chunking, again a rule-based root form i.e. ਸੋਹਣਾ sohna ‘handsome’ (masculine approach was selected mainly due to the similar gender, singular number, and direct case). So, the reasons as for part of speech tagging. The tag set input sentence will be corrected as ਸੋਹਣਾ ਲੜਕਾ ਜਦਾ that is being used for phrase chunking includes tags like NPD – noun phrase in direct case, NPNE ਹੈ sohna larka janda hai ‘handsome boy goes’. – noun phrase followed by ਨੇ ne etc. The rules for The error detection rules in grammar checking module are again applied in sequential order with phrase chunking also take into account the priority field to control the sequence. This is done potential errors in the text, like lack of agreement to resolve phrase level errors before going on to in words of a potential phrase. However, as would the clause level errors, and then to sentence level be expected there is no way to take the misplaced agreement errors. 941 3 Grammar Errors following noun phrase that it is connecting with the preceding noun phrase. At present, this grammar checking system for Punjabi detects and provides corrections for Some other options covered include noun phrase following grammatical errors, based on the study must be in oblique form before a postposition, all of Punjabi grammar related texts (Chander, 1964; the noun phrases joined by connectives must have Gill and Gleason, 1986; Puar, 1990): same case, main verb should be in root form if Modifier and noun agreement preceding ਕੇ ke etc. The modifier of a noun must agree with the noun in terms of gender, number, and case. Modifiers of a noun include adjectives, pronouns, cardinals, 4 Sample Input and Output ordinals, some forms of verbs etc. This section provides some sample Punjabi sentences that were given as input to the Punjabi Subject and verb agreement grammar checking system along with the output In Punjabi text, the verb must agree with the generated by this system. subject of the sentence in terms of gender, number, and person. There are some special forms of verbs Sentence 1 like transitive past tense verbs, which need some Shows the grammatical errors related to ‘Modifier specific postpositions with their subject, like the and noun agreement’ and ‘Order of the modifiers use of ਨੇ ne with transitive verbs in perfect form of a noun in noun phrase’. In this sentence noun is etc. ਲੜਕਾ larka ‘boy’ and its modifiers are ਸੋਹਣੀ ਇੱਕ ਭੱਜੀ ਜਦਾ sohni ek bhajji janda ‘handsome one Noun and adjective (in attributive form) running’. agreement This is different from ‘modifier and noun Input: ਸੋਹਣੀ ਇੱਕ ਭੱਜੀ ਜਦਾ ਲੜਕਾ ਆਇਆ agreement’ as described above in the sense that Input1: sohni ek bhajji janda larka aaeya adjective is not preceding noun but can be virtually Input2: Handsome one running boy came anywhere in the sentence, usually preceding verb Output: ਇੱਕ ਭੱਿਜਆ ਜਦਾ ਸੋਹਣਾ ਲੜਕਾ ਆਇਆ phrase acting as a complement for it.

A Punjabi Grammar Checker

Online Guides to Indian Languages with Particular Reference to Hindi, Punjabi, and Sanskrit

Book Reviews 279

Dependency-Based Tigrinya Grammar Checker

William Carey, D.D

The Development of Grammatical Morphemes in the Speech of Young Children Using English As a Second Language in a Classroom Context

Grammar Checkers for Natural Languages: a Review

Vygotsky and Linguistic Relativity: the Case of Chinese and English Reading

Language Distinctiveness*

General Editor, Phonological Inventories of Tibeto-Burman

Simply Chomsky

SARVHITKARI EDUCATIONAL SOCIETY, (PUNJAB) Guru Gobind Singh Avenue, Near Telephone Exchange, Bye-Pass, Jalandhar

CUPB NEWSLETTER Vol