A Punjabi Checker

Mandeep Singh Gill Gurpreet Singh Lehal Shiv Sharma Joshi Department of Computer Department of Computer Department of Science Science Anthropological Linguistics Punjabi University Punjabi University & Punjabi Lexicography Patiala -147002, India Patiala -147002, India Punjabi University [email protected] [email protected] Patiala -147002, India Tel.: +91-175-3046292 Tel.: +91-9888165971 Tel.: +91-175-3046171

Abstract if possible generates suggestions to correct those errors. This article provides description about the To the best of our knowledge the grammar grammar checking software developed for checking provided here will be the first such detecting the grammatical errors in Punjabi system for Indian languages. There is n-gram texts and providing suggestions wherever based grammar checking system for Bangla (Alam appropriate to rectify those errors. This et al, 2006). The authors admit its accuracy is very system utilizes a full-form lexicon for low and there is no description about whether the analysis and rule-based system provides any suggestions to correct errors systems for part of speech tagging and or not. It is mentioned that it was tested to identify phrase chunking. The system supported by correct sentences from the set of sentences a set of carefully devised error detection provided as input but nothing is mentioned as far rules can detect and suggest rectifications as correcting those errors is concerned. However, for a number of grammatical errors, the system that we discuss here for Punjabi detects resulting from lack of , order of errors and suggests corrections as well. In doing words in various phrases etc., in literary so, provides enough information for the user to style Punjabi texts. understand the error reason and supports the suggestions provided, if any. 1 Introduction 2 System Overview Grammar checking is one of the most widely used tools within natural language engineering The input Punjabi text is given to the preprocessing applications. Most of the word processing systems system that performs tokenization and detects any available in the market incorporate spelling, phrases etc. After that morphological analysis is grammar, and style-checking systems for English performed, this returns possible tags for all the and other foreign languages, one such rule-based words in the given text, based on the full-form grammar checking system for English is discussed lexicon that it is using. Then a rule-based part of in (Naber, 2003). However, when it comes to the speech tagger is engaged to disambiguate the tags smaller languages, specifically the Indian based on the context information. After that, the languages, most of such advanced tools have been text is grouped into various phrases accordingly to lacking. Spell checking has been addressed for the pre-defined phrase chunking rules. In the final most of the Indian languages but still grammar and phase, rules to check for various grammatical style checking systems are lacking. In this article a errors internal to phrases and agreement on the grammar checking system for Punjabi, a member sentence level, are applied. If any error is found in of the Modern Indo-Aryan family of languages, is a sentence then based on the context information provided. The grammar checker uses a rule-based corrections are suggested (generated) for that. system to detect grammatical errors in the text and

940 For the purpose of morphological analysis we words of a phrase into account, like if words of a have divided the Punjabi words into 22 word phrase are separated (having some other phrase in classes like noun, (inflected and between) then that cannot be taken as a single uninflected), (personal, , phrase, even though this may be a potential error. reflexive, interrogative, relative, and indefinite), In the last phase i.e. grammar checking, there (main verb, operator verb, and ), are again manually designed error detection rules cardinals, ordinals, , postposition, to detect potential errors in the text and provide conjunction, interjection etc., depending on the corrections to resolve those errors. For example, grammatical information required for the words of rule to check modifier and noun agreement, will go these word classes. The information that is in the through all the noun phrases in a sentence to check database depends upon the word class, like for if the modifiers of those sentences agree with their noun and inflected adjective, it is gender, number, respective head words (noun/pronoun) in terms of and case. For personal , person is also gender, number, and case or not. For this matching, required. For main gender, number, person, the grammatical information from the tags of those tense, phase, etc. is required. As words is used. In simple terms, it will compare the mentioned earlier the lexicon of this morphological grammatical information (gender, number, and analyzer is full form based i.e. all the word forms case) of modifier with the headword of all the commonly used Punjabi words are kept (noun/pronoun) and displays an error message if in the lexicon along with their and other some grammatical information fails to match. To grammatical information. resolve this error, the grammar checking module For part of speech tagging, we have devised a will use morphological generator, to generate the tag set keeping into mind all the grammatical correct form (based on headword’s gender, categories that can be helpful for agreement number, and case) for that modifier from its root checking. At present, there are more than 600 tags word. in the tag set. In addition to this, some word- For example, consider the grammatically specific tags are also there. The tag set is very user incorrect sentence sohne larka friendly and while choosing tag names existing tag janda hai ‘handsome boy goes’. In this sentence in sets for English and other such languages were the noun phrase, sohne larka ‘handsome taken into consideration, like NNMSD – masculine, singular, and noun, boy’, the modifier sohne ‘handsome’ (root PNPMPOF – masculine, plural, , and word – sohna ‘handsome’), with masculine first person . The approach followed for part of speech tagging is rule-based, gender, plural number, and direct case, is not in as there is no tagged corpus for Punjabi available accordance with the gender, number, case of its at present. As the text we are processing may have head word. It should be in singular number instead grammatical agreement errors, so the part of of plural. The grammar checking module will speech tagging rules are devised considering this. detect this as an error as ‘number’ for modifier and The rules are applied in sequential order with each headword is not same, then it will use rule having an attached priority to control its order morphological generator to generate the ‘singular in this sequence. number form’ from its root word, which is same as For phrase chunking, again a rule-based root form i.e. sohna ‘handsome’ (masculine approach was selected mainly due to the similar gender, singular number, and direct case). So, the reasons as for part of speech tagging. The tag set input sentence will be corrected as that is being used for phrase chunking includes tags like NPD – noun phrase in direct case, NPNE sohna larka janda hai ‘handsome boy goes’. – noun phrase followed by ੇ ne etc. The rules for The error detection rules in grammar checking module are again applied in sequential order with phrase chunking also take into account the priority field to control the sequence. This is done potential errors in the text, like lack of agreement to resolve phrase level errors before going on to in words of a potential phrase. However, as would the clause level errors, and then to sentence level be expected there is no way to take the misplaced agreement errors.

941 3 Grammar Errors following noun phrase that it is connecting with the preceding noun phrase. At present, this grammar checking system for Punjabi detects and provides corrections for Some other options covered include noun phrase following grammatical errors, based on the study must be in oblique form before a postposition, all of Punjabi grammar related texts (Chander, 1964; the noun phrases joined by connectives must have Gill and Gleason, 1986; Puar, 1990): same case, main verb should be in root form if Modifier and noun agreement preceding ke etc. The modifier of a noun must agree with the noun in terms of gender, number, and case. Modifiers of a noun include , pronouns, cardinals, 4 Sample Input and Output ordinals, some forms of verbs etc. This section provides some sample Punjabi sentences that were given as input to the Punjabi and verb agreement grammar checking system along with the output In Punjabi text, the verb must agree with the generated by this system. subject of the sentence in terms of gender, number, and person. There are some special forms of verbs Sentence 1 like transitive verbs, which need some Shows the grammatical errors related to ‘Modifier specific postpositions with their subject, like the and noun agreement’ and ‘Order of the modifiers use of ੇ ne with transitive verbs in perfect form of a noun in noun phrase’. In this sentence noun is etc. larka ‘boy’ and its modifiers are sohni ek bhajji janda ‘handsome one Noun and adjective (in attributive form) running’. agreement This is different from ‘modifier and noun Input: agreement’ as described above in the sense that Input1: sohni ek bhajji janda larka aaeya adjective is not preceding noun but can be virtually Input2: Handsome one running boy came anywhere in the sentence, usually preceding verb Output: phrase acting as a complement for it. It must still Output1: ek bhajjia janda sohna larka aaeya agree with the noun for which it is used in that Output2: One running handsome boy came sentence. Sentence 2 Order of the modifiers of a noun in noun phrase Covers the grammatical error related to ‘Subject If a noun has more than one modifier, then those and verb agreement’. Subject is barish ‘rain’ modifiers should be in a certain order such that phrase modifiers precede single word modifiers but and verb phrase is ho riha han ‘is pronouns and numerals precede all other. raining’. Input: Order of the words in a verb phrase There are certain future tense forms of Punjabi Input1: bahr barish ho riha han verbs that should occur towards the end of verb Input2: It is raining outside phrase without any auxiliary. In addition, if Output: negative and emphatic particles are used in a verb Output1: bahr barish ho rahi hai phrase then the latter must precede the former. Output2: It is raining outside

da postposition and following noun phrase Sentence 3 agreement For grammatical errors related to ‘ da All the forms of da postposition must agree in postposition and following noun phrase agreement’ terms of gender, number, and case with the and ‘Noun phrase in oblique form before a post position’. Noun phrase preceding dee

942 ( marker) is chota baccha • Input2/Output2 specifies the English gloss for the input/output. ‘small boy’ and following one is naam ‘name’. Input: 5 System Features Input1: chota baccha dee naam raam hai The system is designed in Microsoft Visual C# Input2: Small boy’s name is Ram 2005 using Microsoft .NET Framework 2.0. The Output: entire database of this tool is in XML files with the Ouput1: chote bacche da naam raam hai Punjabi text in Unicode format. Some of the Ouput2: Small boy’s name is Ram significant features of this grammar checking system are: Sentence 4 Rules can be turned on and off individually Highlights the grammatical errors related to Being a rule-based system all the rules provided in ‘Subject and verb agreement’ and ‘Order of the section 3 can be turned on and off individually words in a verb phrase’. Subject in this sentence is without requiring any changes in the system. The larki ‘girl’ and verb phrase is rules are kept in a separate XML file, not hard coded into the system. To turn on/off a rule, nahi ja hee riha see ‘was not going’. changes can be made to that XML file directly or it Input: can be done through the options provided within Input1: larkee school nahi ja hee riha see the system. Input2: The girl was not going to school Error and Suggestions information Output: The system is able to provide enough reasons in Ouput1: larkee school ja he nahi rahi see support of every error that it detects. With a Output2: The girl was not going to school meaningful description of the rule, it provides the grammatical categories that failed to match if there Sentence 5 is an error and provides the desired correct value For grammatical error related to ‘Subject and verb for those grammatical categories, with suggestions. agreement’. Subject here is raam ‘Ram’ and However, the information about grammatical verb phrase is khadha ‘ate’, which is transitive categories may not be much meaningful to an ordinary user but if someone is learning Punjabi as and in perfect phase. a foreign/second language then information about Input: correct grammatical categories according to the Input1: raam phal khadha context can be helpful. Wherever possible system Input2: Ram ate fruit also specifies both the words, for which matching Output: ੇ was performed, making it more clear that what is Output1: raam ne phal khadha wrong and with respect to what, as shown in Output2: Ram ate fruit Figure 1, it shows that which was the head word and which word failed to match with it. Legend: The suggestions produced by the Punjabi • Input and Output specifies the input Grammar Checker for the following grammatically Punjabi sentence in script and incorrect sentence to correct the first incorrect the output produced by this grammar word rahian ‘-ing plural’ are riha ‘-ing checking system in Gurmukhi script, singular’ and rahi ‘-ing singular’: respectively.

• Input1/Output1 specifies the Romanized version of the input/output. main khed rahian han ‘I are playing’

943

Figure 1. Punjabi Grammar Checker – Error Reason

Figure 1 shows the grammatical categories that 7 Hardware and Software Requirements failed to match for the subject main ‘I’ and part The system needs hardware and software as would of the verb phrase rahian ‘-ing plural’. It be expected from a typical word processing provides the values for the grammatical categories application. A Unicode compatible Windows XP that failed to match for the incorrect word along based PC with 512 MB of RAM, 1 GB of hard disk with the desired values for correction. space and Microsoft .NET Framework 2.0 installed, would be sufficient. 6 System Scope References The system is designed to work on the literary style Punjabi text with SOV (Subject Verb) Duni Chander. 1964. Punjabi Bhasha da Viakaran sentence structure. At present, it works properly on (Punjabi). University Publication Bureau, simple or kernel sentences. It can detect any Chandigarh, India. agreement errors in compound or complex Daniel Naber. 2003. A Rule-Based Style and Grammar sentences also. However, there may be some false Checker. Diplomarbeit Technische Fakultät, alarms in such sentences. The sentences in which Universität Bielefeld, Germany. (Available at: word order is shuffled for emphasis has not been http://www.danielnaber.de/languagetool/download/sty considered, along with the sentences in which le_and_grammar_checker.pdf (1/10/2007)) intonation alone is used for emphasis. Due to Harjeet S. Gill and Henry A. Gleason, Jr. 1986. A emphatic intonation, the meaning or word class of Reference Grammar of Punjabi. Publication Bureau, a word may be changed in a sentence e.g., te Punjabi University, Patiala, India. ‘and’ is usually a connective but if emphasized it Joginder S. Puar. 1990. The Punjabi verb form and can be used as an emphatic particle. However, this function. Publication Bureau, Punjabi University, is hard to detect from the written form of the text Patiala, India. and thus has not been considered. However, if Md. Jahangir Alam, Naushad UzZaman, and Mumit some emphatic particles like he ee ve etc., Khan. 2006. N-gram based Statistical Grammar Checker for Bangla and English. In Proc. of ninth are used directly in a sentence to show emphasis International Conference on Computer and then that is given due consideration. Information Technology (ICCIT 2006), Dhaka, Bangladesh.

944