Tagalog Support for LanguageTool Nathaniel Oco Allan Borra De La Salle University - Manila De La Salle University - Manila 2401 Taft Avenue Malate, Manila City 2401 Taft Avenue Malate, Manila City 1004 Metro Manila, Philippines 1004 Metro manila, Philippines +639178477549 +639174591073 [email protected] [email protected]

ABSTRACT 25, 2011, already includes Tagalog support. This paper outlines the different processes and issues involved in Aside from Tagalog, LanguageTool also supports Asturian, adding a Tagalog support for LanguageTool. LanguageTool is an Belarusian, Breton, Catalan, Chinese, Czech, Danish, Dutch, open-source rule-based style and checker that English, Esperanto, French, Galician, Icelandic, Italian, Khmer, implements a manual-based rule-creation approach. Details of the Lithuanian, Malayalam, Polish, Russian, Slovak, Slovenian, different LanguageTool resources are discussed in this paper. The Spanish, Swedish, Ukrainian, and Romanian. different linguistic considerations, technical considerations, and language properties of Tagalog – that were captured and handled This paper aims to explain the processes and issues involved in – are also discussed and outlined. The system was tested using 50 adding a new language support – Tagalog support – for correct and 50 incorrect sentences collected from different LanguageTool. sources. LanguageTool processed the correct sentences in 53 milliseconds and the incorrect sentences in 80 milliseconds. The 2. GRAMMAR CHECKING Tagalog support scored 95.83% for precision, 46% for recall, and Grammar Checking is the process of detecting if there is error in 72% for accuracy. an input. Mark Johnson [Johnson, personal communication] added that grammar checking entails locating where the error is Categories and Subject Descriptors and notifying the user about the error. [5] agrees, adding that F.4.2 [Mathematical Logic and Formal Languages]: grammar checking also entails providing a feedback, which can and Other Rewriting System – grammar types. include possible corrections with linguistic explanations. Grammar checkers can then be defined as programs that can I.5.0 [Pattern Recognition]: General. detect if there is error in an input, locate the error, notify the user I.7.1 [Document and Text Processing]: Document and Text about the error, and provide relevant feedback. Editing – languages, spelling. [15] identified three approaches in grammar checking – syntax- based approach, statistics-based approach, and rule-based General Terms approach. Algorithms, Languages. Syntax-based approach relies on parsing and grammar formalisms (e.g. CFG, LFG). An error is detected if parsing fails and an error Keywords is located using tree structures, graphs, and other methods. Tagalog, Grammar Checking, LanguageTool, Rules, NLP Tools. Examples of Filipino Grammar checkers that utilize this approach are PanPam [10] and [6]. 1. INTRODUCTION Statistics-based approach relies on properly annotated corpus (e.g. LanguageTool, developed by [15], is an open-source rule-based Penn Treebank, Brown Corpus) to train the language model. An style and grammar checker that implements a manual-based rule- error is detected and located using probability. [15] explained that creation approach. It is publicly available through sequences describing correct sentences will occur often in the LanguageTool’s website [13]. LanguageTool has a growing list of corpus while sequences describing incorrect sentences will occur supported languages. The authors of this paper developed and less in the corpus or probably not at all. submitted a new language support – Tagalog support – for Rule-based approach relies on rules, which are matched against LanguageTool to provide a readily-available Tagalog Grammar the input to check and locate errors. LanguageTool is an example Checker [16]. LanguageTool version 1.5, released last September of this approach. [11] classifies grammar checkers under this approach into two – manual-based and automatic-based. Manual-

based grammar checkers use manual means to develop rules while automatic-based grammar checkers use automatic means to develop rules. In LanguageTool, rules are manually created and are added or modified incrementally.

64 Proceedings of the 8th National Natural Language Processing Research Symposium, pages 64-71 De La Salle University, Manila, 24-25 November 2011 3. POS TAGGING Part-of-speech or POS is a lexical category that defines the function of words. POS Tagging or POST is the process of labeling words with POS. The list of POS used to label words is called a tagset. POS Tagging is heavily utilized in LanguageTool.

4. LANGUAGETOOL LanguageTool can run as a stand-alone program or as an extension to word editors like OpenOffice1 or LibreOffice2. It uses two major linguistic resources to perform grammar checking. These resources are the tagger dictionary and the rule file. LanguageTool, as a stand alone program, splits an input into sentences. Each sentence is split into words and each word is assigned a tag based on the declarations in the tagger dictionary. The words and their tags are checked against the rules in the rule file. If there is a match, a feedback is shown to the user. Figure 1 shows a screenshot of LanguageTool as a stand-alone program Figure 2. LanguageTool as an OpenOffice Extension and Figure 2 shows a screenshot of LangaugeTool as an OpenOffice extension. Both figures demonstrate the feedback 5. LANGUAGETOOL RESOURCES mechanism of LanguageTool. This section discusses and outlines LanguageTool resources – the tagger dictionary and the rule file – and the tagset used for the tagger dictionary.

5.1 Tagger Dictionary The tagger dictionary is a text file used to tag words in an input. Each declaration in the text file follows a tab-separated three- column format. The first column is the token, the second column is base form of the token, and the third column is the tag of the token. The interpretation for the base form of the token is left to the discretion of the language maintainer. This could be the root word or any other interpretation. In total, the Tagalog tagger dictionary has approximately 8,000 word declarations. This number, with Tagalog support as a new language support, pales in comparison with 350,000-word English tagger dictionary and the 530,000-word French tagger dictionary. Table 1 demonstrates word declarations and shows sample entries from the Tagalog tagger dictionary.

Table 1. Tagalog Tagger Dictionary Declatations Token Base Form Tag mapanirang mapanira ADMO S anim anim ADNU sa sa DECN DAT ng ng DECN GEN ho ho MAHM nila nila PNGP RD P

nilang nila PNGP RD P Figure 1. LanguageTool as a stand-alone program pumupusta pumupusta VACF IN B

yakapin yakapin VOBF NE B kakukumpisal kakukumpisal VOTF RC B kalilibing kalilibing VOTF RC B

1 http://www.openoffice.org/ 2 http://www.libreoffice.org/

65 5.2 Tagset for the Tagger Dictionary 6. LINGUISTIC CONSIDERATIONS [17] developed a tagset for Tagalog using the Penn Treebank3 This section discusses and outlines the different linguistic Tagset as guide. [14] modified this tagset with the aid of linguists considerations in developing a Tagalog support for and experts in the field. The original tagset proposed by [17] and LanguageTool. the modifications by [14] were used as basis in developing the tagset for the tagger dictionary. 6.1 POS and Lexical Categories The tagset for the tagger dictionary defines the different tags used [19] and [18] identified several Tagalog POS. These are article or in word declarations. The tagset is composed of POS and lexical pantukoy, or pangngalan, or panghalip, or categories, which could be followed by one or more attributes pandiwa, or pang-uri, or pang-abay, preposition with a white space separating them. The third column in Table 1 or pang-ukol, or pangatnig, interjection or shows sample tag declarations. The table in the appendix shows pandamdam, and ligatures or pang-angkop. These POS, except the tagset developed for the tagger dictionary. pantukoy, and other lexical categories (e.g. markers and particles) compose the tagset. The table in the appendix shows the POS and Lexical Categories used. 5.3 Rule File The rules are stored in an xml file. The input is matched against patterns in the rule file, which describe incorrect sentences. A rule 6.2 Ligatures file is composed of elements and attributes. The three main According to [18], ligatures or pang-angkop in Tagalog, are elements of a rule file are: pattern, message, and example. Pattern words or that link a modifier to the word being refers to the token or sequence of tokens to be matched. Message modified. Figure 4 demonstrates an example. refers to the feedback, which can include possible suggestions. Example refers to incorrect and correct sentences demonstrating Maganda Babae = Magandang babae the rule’s usage. Figure 3 shows a rule from the Tagalog rule file. Beautiful Woman = Beautiful woman Adjective Noun Mabilis Bata = Mabilis na bata Fast Child = Fast Child si|sina Adjective Noun

Figure 4. Ligature Usage To handle words with ligatures, a separate tagset needs to be allotted. However, this would result in a large tagset. To solve this Do you mean \1? Irregular similar to the first column, except that ligatures were omitted. The POS sequence due to transposition of words. words, with ligatures omitted, serve as the base form of the token Exchange Word Positions (second column). Maganda Maria si. 6.3 Tagset Attributes Maganda si Additional attributes were considered. [9] and [8] both proposed Maria. tagsets with additional attributes. These are to appropriately model their language and to address language-specific ambiguities and issues.

Figure 3. Rule File The tagset for the Tagalog tagger dictionary contains attributes. General categorization and semantic classes were considered as The Tagalog rule file has approximately 500 lines covering noun attributes. Grammatical person and plurality were incorrect patterns caused by wrong words, missing words, and considered as pronoun attributes. Verb , verb aspect, and transposition of words. The English rule file has approximately plurality were considered as verb attributes. Plurality was 10,000 lines and the French rule file has approximately 25,000 considered as adjective attribute. The POS of the word being lines. With these numbers, the Tagalog support can still be modified was considered as adverb attribute. The morphological considered in its initial stage. case was considered as determiner attribute. These attributes aid In most rule files, the message element is normally in the language in better classifying properties, linguistic of the language support. Tagalog uses English in the message phenomena, and Tagalog words. element so that other users of the system can also understand.

3 http://www.cis.upenn.edu/~treebank/

66 7. TECHNICAL CONSIDERATIONS This section discusses and outlines the different technical considerations in developing a Tagalog support for LanguageTool. ding?|dito|ditong|daw|diyang?|dyang?|doong? 7.1 Tagger Dictionary File Size In most NLP Tools, large dictionary and grammar file sizes affect performance and file storage. One issue with the tagger dictionary is the file size. For instance, the English tagger dictionary in .txt Figure 6. Using Regular Expressions file format has a file size of 8 million bytes while the French tagger dictionary has a file size of 17 million bytes. To address 7.3 Populating the Tagger Dictionary this issue, Morfologik4 was utilized to encode the tagger The words from the literature domain in [14] were used for the dictionary into Finite State Automata encoded (FSA-encoded) tagger dictionary. .dict file. Using Morfologik, the FSA-encoded .dict file of the English tagger dictionary was reduced to 1 million bytes while the [2] proposed a stemming algorithm for Tagalog words. The French tagger dictionary was reduced to 500 thousand bytes. algorithm they proposed with the use of a spreadsheet application Since Tagalog support is still in its initial stage, there is little was utilized to tag words. difference in terms of file size between the .txt file format and the FSA-encoded format. 7.4 File Compilation LanguageTool community uses Apache Ant6 to compile the 7.2 Regular Expressions extension file, which is needed in OpenOffice and in LibreOffice. Another way of reducing the file size is by using regular However, once the files have been compiled, changing the tagger expressions to express patterns and other elements in the rule file. dictionary or the rule file would require the files to be recompiled LanguageTool uses the standard regular expression engine of again. 5 Java . Figure 5 and Figure 6 illustrate the advantages of using LanguageTool uses subversion control7 and uploads daily regular expressions in a rule. The patterns in Figure 5 can be snapshots8. This allows language maintainers to provide regular reduced to the pattern in Figure 6, reducing the number of lines updates to tagger dictionaries and rule files. This also allows users and the space occupied. to download the latest program.

din 7.5 Rule-Creation Standards ding Apache Ant can also be used to check the rules for errors. Things dito like the number of tokens to be highlighted must match the number of tokens in the suggestion and in the examples. These ditong standards ensure that the rule files are error-free in terms of daw syntax. diyan diyang 8. TESTING AND RESULTS A total of 100 sentences – 50 correct and 50 incorrect – from [3], dyan [7], [12], FiSSAn [1], LEFT [4], PanPam [10] and previous test dyang data were used to test the Tagalog support. LanguageTool doon processed the correct sentences in 53 milliseconds and the incorrect sentences in 80 milliseconds. doong LanguageTool properly marked 49 out of 50 sentences as correct

and 23 out of 50 incorrect sentences as incorrect. The Tagalog Figure 5. Pattern Matching support scored 95.83% for precision (23 over 24), 46% for recall (23 over 50), and 72% for accuracy (72 over 100). Figure 7 shows 27 incorrect sentences marked as correct. Sentences 1 to 9 contain free word order errors. Sentences 10 to 11 contain predicates in the plural form. Sentences 12 to 27 either contain missing words or transposition of words. The low recall rate can be attributed to three things: (1) lack of rules or grammar checking coverage; (2) incorrect or erroneous tagger dictionary declarations; (3) and insufficient word entries.

4 Morfologik is available at: http://sourceforge.net/projects/morfologik/files/morfologikstem 6 http://ant.apache.org/ ming/ 7 LanguageTool’s subversion repository: 5 Standard regular expression engine of Java: https://languagetool.svn.sourceforge.net/svnroot/languagetool/tr http://download.oracle.com/javase/1,5.0/docs/api/java/util/regex/ unk/JLanguageTool Pattern.html 8 http://www.languagetool.org/download/snapshots/

67 [3] Cena, R. and T, Ramos. 1990. Modern Tagalog: 1. Ang bumili lalaki ng isda sa tindahan. Grammatical Explanations and Exercises for Non-native 2. Nagbigay sa ng libro babae ang lalaki. Speakers. University of Hawaii Press, Honolulu, HI. 3. Binigyan ng babae ng libro dalaga ang. 4. Binigyan mabait ng regalo ang batang. [4] Chan, E., Lim, I., Tan, R., and Tong, M. 2006. LEFT: 5. Maganda sumasayaw si Rosa na. Lexical Functional Grammar Based English-Filipino 6. Pinalo tatay ng makulit batang ang. Translator. Undergraduate Thesis. De La Salle University, 7. Bumili ng Manuel si medias para sa dalaga. Manila, Philippines. 8. Bumili si Maria ng libro tungkol sa pag-ibig. [5] Clément, L., Gerdes, K., and Marlet, R. 2009. A Grammar 9. Ang pusang mataba pinakain isda ng. Correction Algorithm: Deep Parsing and Minimal 10. Nagsisikain na si Maria ng hapunan. Corrections for a Grammar Checker. In Proceedings of the 11. Nagsipangisda na si Ben. 14th Conference on Formal Grammar (Bordeaux, France, 12. Ay kumain. July 25 – 26, 2009). FG '09. Springer-Verlag, Berlin, 13. Si Janyll ay. Germany. 47-63, DOI= http://doi.acm.org/10.1007/978-3- 14. Ay si Janyll maganda. 642-20169-1_4. 15. Si Janyll kumain ay. 16. Kumain ay si Janyll. [6] Dimalen, D. and Dimalen, E. 2007. An OpenOffice Spelling 17. Maganda ay si Janyll. and Grammar Checker Add-in Using an Open Source External Engine as Resource Manager and Parser. In 18. Mabait si Martee sa. th 19. Kumain si Martee si Justin. Proceedings of the 4 National Natural Language 20. Aalis ako ikaw. Processing Research Symposium (Manila, Philippines, June 21. Umalis nagnakaw. 14 – 16, 2007). NNLPRS '07. 69-73. 22. Sumama maganda. [7] Dimalen, E. 2003. A Parsing Algorithm for Constituent 23. Umalis ang nagnakaw ang ninakawan. Structures of Tagalog. Graduate Thesis. De La Salle 24. Sumama ang malakas ang maganda. University, Manila, Philippines. 25. Kumain uminom si Martee. [8] Divjak, D., T. Erjavec, A. Feldman, M. Kopotev and S. 26. Nawawala mabilis ang tumakbo. Sharoff. 2008. Designing and Evaluating a Russian Tagset. 27. Sumama totoo ang maganda. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (Marrakech, Morocco, Figure 7. Incorrect Sentences Detected as Correct May 28 – 30, 2008). LREC '08. European Language Resources Association, Paris, France, 279-285. ISBN=2- 9. FINAL NOTES 9517408-4-0. Although the Tagalog support scored 95.83% for precision, the [9] Feldman, A. and Hana, J. 2010. A Positional Tagset for recall rate is below average. This clearly highlights the fact that Russian. In Proceedings of the Seventh International Tagalog support is still in its early stages. However, it is important Conference on Langauge Resources and Evaluation to note that the linguistic resources of LanguageTool are sufficient (Valletta, Malta, May 19-21, 2010). LREC '10. European to handle the different Tagalog language properties and linguistic Languages Resources Association, Paris, France, 1277 – phenomena. 1284. ISBN=2-9517408-6-7. Further development can focus on improving the grammar [10] Jasa, M., Palisoc, J., and Villa, M. 2007. Panuring checking coverage by adding more rules to the rule file and Pampanitikan (PanPam): A Sentence Syntax and Semantic improving the tagger dictionary and the tagset. Future works can Based Grammar Checker for Filipino. Undergraduate also focus on other LanguageTool functionalities like automatic Thesis. De La Salle University, Manila, Philippines. detection of code switching. [11] Konchady, M. 2009. Detecting Grammatical Errors in Text Overall, LanguageTool is a novel NLP tool that provides a using a Ngram-based Ruleset. Retrieved October 6, 2011 readily-available grammar checker. from Emustru: http://emustru.sourceforge.net/detecting_grammatical_errors. 10. ACKNOWLEDGMENTS pdf. The authors acknowledge LanguageTool’s developers and [12] Kroeger, P. 1993. Phrase Structure and Grammatical language maintainers for their assistance. Relations in Tagalog. CSLI Publications, Stanford, CA. [13] LanguageTool. http://www.languagetool.org/. 11. REFERENCES [14] Miguel, D. 2007. Comparative Analysis of Tagalog Part-of- [1] Ang, M., Cagalingan, S., Tan, P., and Tan, R. 2002. FiSSAn: Speech (POS) Taggers. Graduate Thesis. De La Salle Filipino Sentence Syntax and Semantic Analyzer. University, Manila, Philippines. Undergraduate Thesis. De La Salle University, Manila, Philippines. [15] Naber, D. 2003. A Rule-Based Style and Grammar Checker. Diploma Thesis. Bielefeld University, Bielefeld. [2] Bonus, D and Roxas, R. 2004. A Stemming Algorithm for Tagalog Words. In Proceedings of the 4th Philippine [16] Oco, N. & Borra, A. 2011. A Grammar Checker for Tagalog th Computing Science Congress (Laguna, Philippines, February using LanguageTool. In Proceedings of the 9 Workshop on 14 – 15, 2004). PCSC '04. Computing Science of the Asian Language Resources collocated with IJCNLP 2011 Philippines. ISSN=1908-1146. (Chiang Mai, Thailand, November 12 – 13, 2011). ALR '11.

68 Asian Federation of Natural Language Processing. 2-9. ISBN [18] Ramos, T. 1971. Makabagong Balarila ng Pilipino. Rex 978-974-466-565-2. Bookstore, Manila, Philippines. [17] Rabo, V. 2004. TPOST: A Template-based N-gram Part-of- [19] Santos, L. 1939. Balarila ng Wikang Pambansa. Institute of Speech Tagger for Tagalog. Graduate Thesis. De La Salle Philippine Language, Manila, Philippines. University, Manila, Philippines.

69 APPENDIX AVSC Slight comparison Table 2. Tagset for the Tagger Dictionary AVAY Agree (Panang-ayon) Noun: [tag] [general categorization] [semantic class] AVGI Disagree (Pananggi) NPRO Proper Noun AVAG Possibility (Pang-agam) NCOM Common Noun AVPA Frequency (Pamanahon) NABB Abbreviation AVOT Other Pronoun: [tag] [grammatical person] [plurality] Conjunction PANP “ang” CONM Panimbang PNGP “ng” Pronouns COMU Pamukod PSAP “sa” Pronouns CONU Panubali PAND “ang” Demonstratives CONI Paninsay PNGD “ng” Demonstratives CONA Pananhi PSAD “sa” Demonstratives CONP Panapos PFOP Found Pronouns CONG Panghugnay PINP Interrogative Pronouns COOT Other PCOP Comparison Pronouns Preposition PIDP Indefinite Pronouns PRPL Place POTH Other PRLO Location Verb: [focus] [aspect] [plurality] PRSO Source VACF Actor Focus PRTA Target VOBF / Goal Focus PRRE Referential VBEF Benefactive Focus PRAG Agree VLOF Locative Focus PRDI Disagree VINF Instrument Focus PRME Means VOTF Other PROT Other Adjective: [tag] [plurality] Determiner: [tag] [morphological case] ADMO Modifier DECN Common Noun ADCO Comparative DEPS Personal Name Singular ADSU Superlative DEPP Singular Person Marker ADNU Numeral DEPL Plural Marker ADUN Unaffixiated Interjection ADOT Other INTR Interjection Adverb: [tag] [modifies] IRIA Positive Informal Response AVMA Manner IRID Negative Informal Response AVNU Numeral IRFA Positive Formal Response AVDE Definite IRUN Uncertain Response AVEO Comparison, group I Ligature AVET Comparison, group II LINA Ligature “na” AVCO Comparative, group I LIPA Ligature “pa” AVCT Comparative, group II Independent Particles AVSO Superlative, group I MALM Lexical Marker AVST Superlative, group II MANE Negation Marker

70 MAVN Verb Negation Marker ENCL Enclitic MAEM Existential Marker Punctuation, Symbol, Number MANM Non-existential Marker PSNP Period MAHM Honorific Marker PSNE Exclamation Point Auxiliary Word PSNQ Question Mark AUXP Auxiliary Positive PSNC Comma AUXN Auxiliary Negative PSNS Symbols AUPO Auxiliary Possibility PSNN Numerals Enclitic

71