A Checker for Tagalog using LanguageTool

Nathaniel Oco Allan Borra Center for Language Technologies Center for Language Technologies College of Computer Studies College of Computer Studies De La Salle University De La Salle University 2401 Taft Avenue 2401 Taft Avenue Malate, Manila City Malate, Manila City 1004 Metro Manila 1004 Metro Manila Philippines Philippines [email protected] [email protected]

a part-of-speech tag based on the declarations in Abstract the Tagger Dictionary. The words and their part- of-speech are used to check for patterns that match those declared in the rule file. If there is a pattern match, an error message is shown to the This document outlines the use of Language user. Currently, LanguageTool supports Belaru- Tool for a Tagalog Grammar Checker. Lan- sian, Catalan, Danish, Dutch, English, Esperanto, guage Tool is an open-source rule-based en- French, Galician, Icelandic, Italian, Lithuanian, gine that offers grammar and style checking functionalities. The details of the various lin- Malayalam, Polish, Romanian, Russian, Slovak, guistic resource requirements of Language Slovenian, Spanish, Swedish, and Ukrainian to a Tool for the are outlined and certain degree. discussed. These are the tagger dictionary and Tagalog is the basis for the , the rule file that use the notation of Language the official language of the Philippines. Accord- Tool. The expressive power of Language ing to a data collected by Cheng et al. (2009), Tool’s notation is analyzed and checked if there are 22,000,000 native speakers of Tagalog. Tagalog linguistic phenomena are captured or This makes it the highest in the country, fol- not. The system was tested using a collection lowed by Cebuano with 20,000,000 native of sentences and these are the results: 91% speakers. Tagalog is very rich in morphology, precision rate, 51% recall rate, 83% accuracy rate. Ramos (1971) stated that Tagalog words are normally composed of root words and . 1 Credits Dimalen and Dimalen (2007) described Tagalog as a language with “high degree of ”. LanguageTool was developed by Naber (2003). Jasa et al. (2007) stated that the number of It can run as a stand-alone program and as an available Tagalog grammar checkers is limited. extension for OpenOffice.Org 1 and LibreOffice 2. Tagalog is a very rich language and Language- LanguageTool is distributed through Language- Tool is a flexible language. The development of Tool’s website: http://www.languagetool.org/. Tagalog support for LanguageTool provides a readily-available Tagalog grammar checker that 2 Introduction can be easily updated. LanguageTool is an open-source style and 3 Related Works grammar checker that follows a manual-based rule-creation approach. Ang et al. (2002) developed a semantic analyzer LanguageTool utilizes rules stored in an xml that has the capability to check semantic rela- file to analyze and check text input. The text in- tionships in a Tagalog sentence. Jasa et al. (2007) put is separated into sentences, each sentence is and Dimalen and Dimalen (2007) both developed separated into words, and each word is assigned syntax-based Filipino grammar checker exten- sions for OpenOffice.Org Writer. In syntax- 1 OpenOffice.Org is available at http://www.openoffice.org/ based grammar checkers, error-checking is based 2 LibreOffice is available at http://www.libreoffice.org/ on the parser. An input is considered correct if

2 Proceedings of the 9th Workshop on Asian Language Resources, pages 2–9, Chiang Mai, Thailand, November 12 and 13, 2011. parsing succeeds, erroneous if parsing fails. Na- word declarations for the Tagalog Tagger Dic- ber (2003) explained that syntax-based grammar tionary. checkers need a complete grammar to function.

Erroneous sentences that are not covered by the 4.2 Tagset for the Tagger Dictionary grammar can be flagged as error-free input. A tagset for the Tagalog tagger dictionary is pro- posed. The tagset is based on the tagset devel- 4 LanguageTool Resources oped by Rabo and Cheng (2006) and the modifi- Discussed here are the different language re- cations by Miguel and Roxas (2007). The discus- sources required by the tool. The notations, for- sions on Tagalog affixation (1971) and case sys- mats, and acquisition of resources are outlined tem of Tagalog (1973) by Ramos, and discussed. aspect and verb by Cena and Ramos (1990), different Tagalog part-of-speech by Cu- 4.1 Tagger Dictionary bar and Cubar (1994), and inventory of verbal affixes by Otanes and Schachter (1972) were Language Tool utilizes a dictionary file, called taken into account. the Tagger Dictionary. The tagger dictionary, Table 1 shows the proposed tags. which contains word declarations, is utilized in were classified into proper nouns, common pattern matching to identify and tag words with nouns, and abbreviations. Kroeger (1993) ex- their part-of-speech. plained that the determiners used for proper The tagger dictionary can be a txt file, a dict nouns and common nouns are different to a cer- file, or an FSA-encoded 3 dict file. The tagger tain degree. dictionary contains three columns, separated by a tag. The first column is the inflected form. The NOUN: [tag] [semantic class] second column is the base form. The third col- Tag umn is the part-of-speech tag. The format for the Tagalog tagger dictionary follows the three- NPRO Proper Noun column format. The first column is the inflected NCOM Common Noun form, which could contain ligatures. The second NABB Abbreviation column is similar to the first column, except that Table 1. Noun Tags ligatures were omitted. This serves as the base form. The third column is the proposed tag, Table 2 shows the proposed tags. which is composed of the part-of-speech or POS Grammatical person and plurality attribute were of the word and the corresponding attribute-value added to aid in distinguishing different types of pair, separated by a white space character. This . serves as the POS tag. Figure 1 shows a sample declaration from the Tagalog tagger dictionary. PRONOUN: [tag] [grammatical person] [plurality] Tag doktor doktor NCOM PANP “ang” Pronouns ako ako PANP ST S PNGP “ng” Pronouns kumakain kumakain VACF IN PSAP “sa” Pronouns nasa nasa PRLO mga mga DECP PAND “ang” Demonstratives hoy hoy INTR PNGD “ng” Demonstratives PSAD “sa” Demonstratives Figure 1. Tagalog Tagger Dictionary Example PFOP Found Pronouns Declarations PINP Interrogative Pronouns

Evaluation and test data from different re- PCOP Comparison Pronouns searches on Tagalog POS Tagging (Bonus, 2004; PIDP Indefinite Pronouns Cheng and Rabo, 2006; Miguel and Roxas, POTH Other 2007) were used to come up with almost 8,000 Grammatical Person ST 1st person 3 FSA stands for Finite State Automata. Morfologik was ND 2nd person used to build the binary automata. Morfologik is available at RD 3rd person http://sourceforge.net/projects/morfologik/files/morfologik- stemming/ NU Null

3 Plurality S Singular : [tag] [modifies] P Plural Tag B Both AVMA Manner Table 2. Pronoun Tags AVNU Numeral AVDE Definite Table 3 shows the proposed verb tags. Verb AVEO Comparison, group I focus and verb aspect were added. The verb fo- AVET Comparison, group II cus can indicate the thematic role the subject is AVCO Comparative, group I taking. This is useful for future works. AVCT Comparative, group II AVSO Superlative, group I VERB: [focus] [aspect] AVST Superlative, group II Focus AVSC Slight comparison VACF Actor Focus AVAY Agree (Panang-ayon) VOBF / Goal Focus AVGI Disagree (Pananggi) VBEF Benefactive Focus AVAG Possibility (Pang-agam) VLOF Locative Focus AVPA Frequency (Pamanahon) VINF Instrument Focus AVOT Other VOTF Other Modifies Aspect VE Verb NE Neutral AD CM Completed AV Adverb IN Incompleted AL Applicable to All CN Contemplated Table 5. Adverb Tags RC Recently Completed OT Other Conjunctions, prepositions, determiners, inter- Table 3. Verb Tags jections, ligatures, particles, enclitic, punctua- tion, and auxiliary words are also part of the pro- Table 4 shows the proposed adjective tags. posed tagset. These tags however, do not contain Plurality was added to handle number agreement. additional properties or corresponding attribute- Kroeger (1993) stated that if the plurality of the value pairs. Overall, the tagset has a total of 87 nominative argument does not match the plural- tags from 14 POS and lexical categories. ity of the adjective or the predicate, the sentence considered ungrammatical. 4.3 Rule File The rule file is an xml file used to check errors in ADJECTIVE: [tag] [plurality] a sentence. If a pattern declared in the rule Tag matches the input sentence, an error is shown to ADMO Modifier the user. ADCO Comparative The rule file, case insensitive by default, is ADSU Superlative composed of several rule categories which may ADNU Numeral cover but is not limited to spelling, grammar, ADUN Unaffixated style, and punctuation errors. Each rule category ADOT Other is composed of one or more rules or rule groups. Plurality Each rule is composed of different elements and S Singular attributes. The three basic elements a rule has are P Plural pattern, message, and example. The pattern ele- N Null ment is where the error to be matched is de- Table 4. Adjective Tags clared. The message element is where the feed- back and suggestion, if applicable, is declared. Table 5 shows the proposed adverb tags. An The example element is where incorrect and cor- additional attribute was added to distinguish the rect examples are declared. Figure 2 shows a POS of the word being modified. Ramos (1971) pseudocode that describes what happens in the stated that in Tagalog can modify verbs, event a pattern is matched and Figure 3 shows an , and other adverbs. example rule in the Tagalog rule file.

4 ding? = din or ding if(pattern in rule file = pattern in input) { ring? = rin or ring mark error; .*[aeiou] = any word that ends in a vowel show feedback; .*[bcdfghjklmnpqrstvwxyz] = any word that provide suggestions if applicable; ends in a consonant } Figure 4. Regular expression usage Figure 2. Pseudocode think think|say mark_from="0"> matches the regular expression think|say, i.e. mga the word “think” or “say” mga Do you mean house ang matches a base form verb followed by the \2? "mga" can word house. not be followed by another "mga". cause negate="yes">and|to Word Repetition matches the word “cause” followed by any Maganda mga mga ken>foobar Maganda matches the word “foobar” only at the begin- ang mga ning of a sentence tanawin. Figure 5. Different methods of pattern-matching described in LangaugeTool’s website Figure 3. Rule File Declaration for “ang ang” word repetition The following resources were used as basis in developing rules: Makabagong Balarila ng Pili- Pattern matching can utilize tokens, POS tags, pino (Ramos, 1971), Writing Filipino Gramamar: and a combination of both to properly capture 4 Traditions and Trends (Cubar and Cubar, 1994), errors. Regular expressions are also used to Modern Tagalog: Grammatical Explanations and simplify or merge several rules. Figure 4 shows Exercises for Non-native Speakers (Cena and different examples of using regular expression. Ramos, 1990), Tagalog Reference Grammar Different methods of pattern-matching explained (Otanes and Schachter, 1972) and Phrase Struc- in LanguageTool’s website are shown in Figure ture and Grammatical Relations in Tagalog 5. It should be noted that if a particular error is (Kroeger, 1993). not covered by the tagger dictionary and the rule file, the error will not be detected. 5 Tagalog Grammar Checking

Errors are classified into three types: wrong word, missing word, and transposition of words. This section discusses the different types of er- rors and the corresponding method for capturing these errors. Figure 6 shows a pseudocode ex- plaining how an error is classified.

4 Standard Regular Expression Engine of Java. Described at: http://download.oracle.com/javase/1,5.0/docs/api/java/util/r egex/Pattern.html

5 if(POS sequence != unoccurring) ken matching is performed. Regular expressions Wrong Word; were employed to make rule files shorter. else if(POS sequence = unoccurring) if(POS sequence before != Correct: unoccurring || POS after != unoccur- Magnanakaw din siya. ring) He is also thief. Missing Word; else Incorrect: Transposition; Magnanakaw rin siya. (For: He is also thief.) Figure 6. Pseudocode Figure 8. Sound and Letter Change

5.1 Wrong Words Wrong words are often caused by using the Other errors like proper adverb and ligature wrong determiner and affixation rule. Also, mor- usage also fall into this type of error. phophonemic change and verb focus are often 5.2 Missing Words not taken into consideration. There are cases where relying on part-of-speech alone will not Missing words are often due to missing deter- capture certain errors. To address this issue, miners, particles, markers, and other words com- grammatical person and plurality of pronouns, posed of several letters. Usually, missing words focus and aspect of verbs, plurality of adjectives, cause irregular and unoccurring POS sequence. and word modified by adverbs were considered Figure 9 illustrates an example. Unoccurring in developing the tagset. Consider the example in POS sequence are checked and matched against Figure 7. Both have the same POS but only one specific rules. The missing word is added to the is correct. Kroeger (1993) pointed out that plural- sentence as feedback. In the sentences in Figure ity in adjectives is demonstrated by the redupli- 9, it is unnatural for a pronoun to be immediately cation of the first syllable. An error caused by the followed by an adjective. Missing words are cap- disagreement of the plurality of the adjective and tured by looking for unoccurring POS sequence the plurality of the nominative argument can not often caused by a missing word. be handled by considering the part-of-speech only. Correct: Ikaw ay maganda. Correct: Pronoun Marker Adjective Magaganda kami. You beautiful Adjective 1st person Pronoun You are beautiful. Plural Plural Beautiful we. Incorrect: We are beautiful. Ikaw maganda. Pronoun Adjective Incorrect: You beautiful Magaganda ako. (For: You are beautiful) Adjective 1st person Pronoun ay Plural Singular Figure 9. Missing Lexical Marker “ ” Beaautiful me. 5.3 Transposition (For: I am beautiful) The process of detecting errors caused by trans- Figure 7. Number Agreement position is similar to missing words. The main difference is tokens and POS tags before and af- Consider the sentences in Figure 8. The en- ter the unoccurring POS sequence are considered “ din ” is used if the last letter of the preced- and checked for any irregularities. ing word is a consonant. Otherwise, “ rin ” is used. Cena and Ramos (1990) explained that sound and letter changes occur in affixation and even in word boundaries. “ din ” and “ rin ” is one of many examples. To address this, a simple to-

6 6 Performance of Language Tool: Re- The system flagged 4 error-free sentences as sults and Analysis erroneous. This is mainly because of wrong dec- larations in the tagger dictionary file. Figure 13 The system was initially tested using a collection shows one of the sentences. In the tagger dic- of sentences. The collection is composed of tionary, “ mag-aral ” was declared as a noun and evaluation data used in FiSSAn (Ang et al., “maingay ” was declared both as an adverb and as 2002), LEFT (Chan et al., 2006), and PanPam an adjective. In the Tagalog language, if a com- (Jasa et al., 2007). Test data used by Dimalen mon noun is preceded by an adjective, there (2003) examples from books (Kroeger, 1993; should be a ligature between them. Figure 14 Ramos, 1971), and additional test data are also demonstrates proper Tagalog ligature usage. part of the collection. A total of 272 sentences from the collection were used. Table 6 shows a Umalis ang mabait summary of figures. 186 out of 190 error-free Verb Det Adjective sentences were marked as error-free, 4 out of 190 Leave the good error-free sentences were marked as erroneous,

42 out of 82 erroneous sentences were marked as ngunit maingay mag-aral. erroneous, and 40 out of 82 erroneous sentences Conjunct Adverb Verb were marked as error-free. but noisy study

Sentences Correctly Incorrectly Total Figure 13. Flagged as erroneous Flagged Flagged

Error-free 186 4 190 Erroneous 42 40 82 Root word ends with a vowel, add “-ng” Total 228 44 272 Matalino + bata Table 6. Summary of Figures Adjective Common Noun Intelligent Child The test showed that the system has a 91% precision rate, 51% recall rate, and 83% accuracy =Matalinong bata rate. Figure 10, Figure 11, and Figure 12 show Intelligent Child the formulas used for precision, recall, and accu- racy, respectively. True Positives refer to errone- Root word ends with the letter “n”, add “-g” ous evaluation data properly flagged by the sys- Matulin + bata tem as erroneous. False Positives refer to error- Adjective Common Noun free evaluation data flagged by the system as er- Fast Child roneous. True Negatives refer to error-free evaluation data properly flagged by the system as =Matuling bata error-free. False Negatives refer to erroneous Fast Child evaluation data flagged by the system as error- free. Root word ends with a consonant, add “na” Matapang + bata TruePositi ves Adjective Common Noun Brave Child TruePositi ves + FalsePosit ives Figure 10. Precision Formula =Matapang na bata Brave Child TruePositi ves Figure 14. Ligature usage TruePositi ves + FalseNegat ives Figure 11. Recall Formula The presence of ellipsis in one of the sen- tences is another reason why error-free sentences TruePositi ves + TrueNegati ves were flagged as erroneous. Ellipsis was not de-

TotalNumbe rOfEvaluat ionData clared in the rule file. This resulted in two sen- tences being recognized as one. Figure 12. Accuracy Formula The system flagged 40 out of 42 erroneous sentences as error-free. A close analysis on there errors reveal that majority of the sentences con-

7 tains free-word order errors, transposition of For comparative evaluation, the same collec- more than 2 words, extra words. Some sentences tion was tested on PanPam (Jasa et al., 2007) and contain errors that focus on semantic checking. these are the results: 23% precision rate, 46% Figure 15 shows 9 of these sentences. These are recall rate, and 38% accuracy rate. Table 7 shows the type of errors that are not handled by the sys- a summary of figures. tem and are not declared in the rule file. Future research works can focus on these areas. Sentences Correctly Incorrectly Total Flagged Flagged Humihinga ang bangkay. Error-free 68 122 190 The corpse is breathing. Erroneous 38 44 82 Total 106 166 272 Nagluto ang sanggol. Table 7. PanPam Results The baby cooked. The comparative evaluation shows that the Naglakad ang ahas. system scored 68% higher than PanPam in terms The snake walked. of precision, 5% higher in terms of recall, and 37% higher in terms of accuracy. Kumain ang plato. Overall, these findings reaffirm earlier analy- The plate ate. sis by Konchady (2009) that rule-based grammar checkers that follow a manual-based rule- Nabasag ang basong mabilis. creation approach tend to produce low recall rate The fast glass shattered. but precision rate is above average. This is be- cause the total number of rules isn’t sufficient to Kumain ang plato sa baso. cover a variety of errors. Also, because of pat- The plate ate at the glass. tern-matching, majority of the errors detected are indeed errors. It is also important to note, espe- Kumain ang aso ng plato. cially in the case of LanguageTool, that the pat- The dog ate the plate. terns being captured are erroneous sentences and not error-free sentences. This makes rule-based Tumakbo ang sapatos. grammar checkers dependent on the rules de- The shoe ran. clared for error checking coverage. LanguageTool can support the Tagalog lan- Nagluto ang pusa ng pagkain. guage to a certain degree. Although developing a The cat cooked food. tagger dictionary and a rule file is a tedious task, it is necessary to create a tagger dictionary, a tag- Figure 15. Flagged as error-free set, and rules that can handle the different Taga- log linguistic Penomena. Among the 42 erroneous sentences it correctly flagged as erroneous, the system provided the Acknowledgements correct feedback for 41 sentences. The sentence with incorrect feedback is shown in Figure 16. The authors acknowledge developers and main- The sentence, used to test free-word order, con- tainers of LanguageTool especially Daniel Na- tains transposition of several words. The system ber, Dominique Pellé, and Marcin Milkowski for detected it as a missing last word error because being instrumental to the completion of this the determiner “ ang ” can not be the last word of study and to the completion of the Tagalog sup- a sentence. port for LanguageTool. The August 07, 2011 snapshot of LanguageTool would not include Tagalog support if not for their assistance. The Pinalo tatay ng makulit batang ang. authors also acknowledge the opinions and thoughts shared through email by Vlamir Rabo, Correct Form: Manu Konchady, and Mark Johnson. Pinalo ng tatay ang batang makulit. The father spanked the naughty child. References Figure 16. Sentence with incorrect feedback Charibeth K. Cheng, Nathalie Rose T. Lim, and Ra- chel Edita O. Roxas. 2009. Philippine Language

8 Resources: Trends and Directions. Proceedings of pino Sentence Syntax and Semantic Analyzer . Un- the 7 th Workshop on Asian Langauge Resource dergraduate Thesis. De La Salle University, Ma- (ALR7), Singapore. nila. Charibeth K. Cheng and Vlamir S. Rabo. 2006. Paul Kroeger. 1993. Phrase Structure and Gram- TPOST: A Template-based Part-of-Speech Tagger martical Relations in Tagalog . CSLI Publications, for Tagalog. Journal of Research in Science, Com- Stanford, CA. puting, and Engineering, Volume 3, Number 1. Resty M. Cena and Teresita V. Ramos. 1990. Modern Dalos D. Miguel and Rachel Edita O. Roxas. 2007. Tagalog: Grammatical Explanations and Exercises Comparative Analysis of Tagalog Part of Speech for Non-native Speakers . University of Hawaii (POS) Taggers. Proceedings of the 4 th National Press, Honolulu, HI. Natural Language Processing Research Sympo- Teresita V. Ramos. 1971. Makabagong Balarila ng sium (NNLPRS), CSB Hotel, Manila. ISSN 1908- Pilipino . Rex Book Store, Manila. 3092. Teresita V. Ramos. 1973. The Case System of Taga- Daniel Naber. 2003. A Rule-Based Style and Gram- log Verbs . Doctoral Dissertation. University of mar Checker. Diploma Thesis. Bielefeld Univer- Hawaii. Honolulu, HI. sity, Bielefeld. Davis Muhajereen D. Dimalen and Editha D. Di- malen. 2007. An OpenOffice Spelling and Gram- mar Checker Add-in Using an Open Source Exter- nal Engine as Resource Manager and Parser. Pro- ceedings of the 4 th National Natural Language Processing Research Symposium (NNLPRS), CSB Hotel, Manila. Don Erick J. Bonus. 2004. A Stemming Algorithm for Tagalog Words. Proceedings of the 4 th Philippine Computing Science Congress (PSCS 2004), Uni- versity of the Philippines – Los Baños, Laguna. Editha D. Dimalen. 2003. A Parsing Algorithm for Constituent Structures of Tagalog . Master’s The- sis. De La Salle University, Manila. Ernesto H. Cubar and Nelly I. Cubar. 1994. Writing Filipino Grammar: Traditions and Trends . New Day Publishers, Quezon City. Erwin Andrew O. Chan, Chris Ian R. Lim, Richard Bryan S. Tan, and Marlon Cromwell N. Tong. 2006. LEFT: Lexical Functional Grammar Based English-Filipino Translator . Undergraduate The- sis. De La Salle University, Manila. Fe T. Otanes and Paul Schachter. 1972. Tagalog Ref- erence Grammar . University of California Press, Berkeley, CA. LanguageTool. http://www.languagetool.org/ Manu Konchady. 2009. Detecting Grammatical Er- rors in Text using a Ngram-based Ruleset . Re- trieved from: http://emustru.sourceforge.net/detecting_grammati cal_errors.pdf Michael A. Jasa, Justin O. Palisoc, and Martee M. Villa. 2007. Panuring Pampanitikan (PanPam): A Sentence Syntax and Semantic Based Grammar Checker for Filipino . Undergraduate Thesis. De La Salle University, Manila. Morgan O. Ang, Sonny G. Cagalingan, Paulo Justin U. Tan, and Reagan C. Tan. 2002. FiSSAn: Fili-

9