A Grammar Checker for Tagalog Using Languagetool
Total Page:16
File Type:pdf, Size:1020Kb
A Grammar Checker for Tagalog using LanguageTool Nathaniel Oco Allan Borra Center for Language Technologies Center for Language Technologies College of Computer Studies College of Computer Studies De La Salle University De La Salle University 2401 Taft Avenue 2401 Taft Avenue Malate, Manila City Malate, Manila City 1004 Metro Manila 1004 Metro Manila Philippines Philippines [email protected] [email protected] a part-of-speech tag based on the declarations in Abstract the Tagger Dictionary. The words and their part- of-speech are used to check for patterns that match those declared in the rule file. If there is a pattern match, an error message is shown to the This document outlines the use of Language user. Currently, LanguageTool supports Belaru- Tool for a Tagalog Grammar Checker. Lan- sian, Catalan, Danish, Dutch, English, Esperanto, guage Tool is an open-source rule-based en- French, Galician, Icelandic, Italian, Lithuanian, gine that offers grammar and style checking functionalities. The details of the various lin- Malayalam, Polish, Romanian, Russian, Slovak, guistic resource requirements of Language Slovenian, Spanish, Swedish, and Ukrainian to a Tool for the Tagalog language are outlined and certain degree. discussed. These are the tagger dictionary and Tagalog is the basis for the Filipino language, the rule file that use the notation of Language the official language of the Philippines. Accord- Tool. The expressive power of Language ing to a data collected by Cheng et al. (2009), Tool’s notation is analyzed and checked if there are 22,000,000 native speakers of Tagalog. Tagalog linguistic phenomena are captured or This makes it the highest in the country, fol- not. The system was tested using a collection lowed by Cebuano with 20,000,000 native of sentences and these are the results: 91% speakers. Tagalog is very rich in morphology, precision rate, 51% recall rate, 83% accuracy rate. Ramos (1971) stated that Tagalog words are normally composed of root words and affixes. 1 Credits Dimalen and Dimalen (2007) described Tagalog as a language with “high degree of inflection”. LanguageTool was developed by Naber (2003). Jasa et al. (2007) stated that the number of It can run as a stand-alone program and as an available Tagalog grammar checkers is limited. extension for OpenOffice.Org 1 and LibreOffice 2. Tagalog is a very rich language and Language- LanguageTool is distributed through Language- Tool is a flexible language. The development of Tool’s website: http://www.languagetool.org/. Tagalog support for LanguageTool provides a readily-available Tagalog grammar checker that 2 Introduction can be easily updated. LanguageTool is an open-source style and 3 Related Works grammar checker that follows a manual-based rule-creation approach. Ang et al. (2002) developed a semantic analyzer LanguageTool utilizes rules stored in an xml that has the capability to check semantic rela- file to analyze and check text input. The text in- tionships in a Tagalog sentence. Jasa et al. (2007) put is separated into sentences, each sentence is and Dimalen and Dimalen (2007) both developed separated into words, and each word is assigned syntax-based Filipino grammar checker exten- sions for OpenOffice.Org Writer. In syntax- 1 OpenOffice.Org is available at http://www.openoffice.org/ based grammar checkers, error-checking is based 2 LibreOffice is available at http://www.libreoffice.org/ on the parser. An input is considered correct if 2 Proceedings of the 9th Workshop on Asian Language Resources, pages 2–9, Chiang Mai, Thailand, November 12 and 13, 2011. parsing succeeds, erroneous if parsing fails. Na- word declarations for the Tagalog Tagger Dic- ber (2003) explained that syntax-based grammar tionary. checkers need a complete grammar to function. Erroneous sentences that are not covered by the 4.2 Tagset for the Tagger Dictionary grammar can be flagged as error-free input. A tagset for the Tagalog tagger dictionary is pro- posed. The tagset is based on the tagset devel- 4 LanguageTool Resources oped by Rabo and Cheng (2006) and the modifi- Discussed here are the different language re- cations by Miguel and Roxas (2007). The discus- sources required by the tool. The notations, for- sions on Tagalog affixation (1971) and case sys- mats, and acquisition of resources are outlined tem of Tagalog verbs (1973) by Ramos, verb and discussed. aspect and verb focus by Cena and Ramos (1990), different Tagalog part-of-speech by Cu- 4.1 Tagger Dictionary bar and Cubar (1994), and inventory of verbal affixes by Otanes and Schachter (1972) were Language Tool utilizes a dictionary file, called taken into account. the Tagger Dictionary. The tagger dictionary, Table 1 shows the proposed noun tags. Nouns which contains word declarations, is utilized in were classified into proper nouns, common pattern matching to identify and tag words with nouns, and abbreviations. Kroeger (1993) ex- their part-of-speech. plained that the determiners used for proper The tagger dictionary can be a txt file, a dict nouns and common nouns are different to a cer- file, or an FSA-encoded 3 dict file. The tagger tain degree. dictionary contains three columns, separated by a tag. The first column is the inflected form. The NOUN: [tag] [semantic class] second column is the base form. The third col- Tag umn is the part-of-speech tag. The format for the Tagalog tagger dictionary follows the three- NPRO Proper Noun column format. The first column is the inflected NCOM Common Noun form, which could contain ligatures. The second NABB Abbreviation column is similar to the first column, except that Table 1. Noun Tags ligatures were omitted. This serves as the base form. The third column is the proposed tag, Table 2 shows the proposed pronoun tags. which is composed of the part-of-speech or POS Grammatical person and plurality attribute were of the word and the corresponding attribute-value added to aid in distinguishing different types of pair, separated by a white space character. This pronouns. serves as the POS tag. Figure 1 shows a sample declaration from the Tagalog tagger dictionary. PRONOUN: [tag] [grammatical person] [plurality] Tag doktor doktor NCOM PANP “ang” Pronouns ako ako PANP ST S PNGP “ng” Pronouns kumakain kumakain VACF IN PSAP “sa” Pronouns nasa nasa PRLO mga mga DECP PAND “ang” Demonstratives hoy hoy INTR PNGD “ng” Demonstratives PSAD “sa” Demonstratives Figure 1. Tagalog Tagger Dictionary Example PFOP Found Pronouns Declarations PINP Interrogative Pronouns Evaluation and test data from different re- PCOP Comparison Pronouns searches on Tagalog POS Tagging (Bonus, 2004; PIDP Indefinite Pronouns Cheng and Rabo, 2006; Miguel and Roxas, POTH Other 2007) were used to come up with almost 8,000 Grammatical Person st ST 1 person 3 FSA stands for Finite State Automata. Morfologik was ND 2nd person used to build the binary automata. Morfologik is available at RD 3rd person http://sourceforge.net/projects/morfologik/files/morfologik- stemming/ NU Null 3 Plurality S Singular ADVERB: [tag] [modifies] P Plural Tag B Both AVMA Manner Table 2. Pronoun Tags AVNU Numeral AVDE Definite Table 3 shows the proposed verb tags. Verb AVEO Comparison, group I focus and verb aspect were added. The verb fo- AVET Comparison, group II cus can indicate the thematic role the subject is AVCO Comparative, group I taking. This is useful for future works. AVCT Comparative, group II AVSO Superlative, group I VERB: [focus] [aspect] AVST Superlative, group II Focus AVSC Slight comparison VACF Actor Focus AVAY Agree (Panang-ayon) VOBF Object / Goal Focus AVGI Disagree (Pananggi) VBEF Benefactive Focus AVAG Possibility (Pang-agam) VLOF Locative Focus AVPA Frequency (Pamanahon) VINF Instrument Focus AVOT Other VOTF Other Modifies Aspect VE Verb NE Neutral AD Adjective CM Completed AV Adverb IN Incompleted AL Applicable to All CN Contemplated Table 5. Adverb Tags RC Recently Completed OT Other Conjunctions, prepositions, determiners, inter- Table 3. Verb Tags jections, ligatures, particles, enclitic, punctua- tion, and auxiliary words are also part of the pro- Table 4 shows the proposed adjective tags. posed tagset. These tags however, do not contain Plurality was added to handle number agreement. additional properties or corresponding attribute- Kroeger (1993) stated that if the plurality of the value pairs. Overall, the tagset has a total of 87 nominative argument does not match the plural- tags from 14 POS and lexical categories. ity of the adjective or the predicate, the sentence considered ungrammatical. 4.3 Rule File The rule file is an xml file used to check errors in ADJECTIVE: [tag] [plurality] a sentence. If a pattern declared in the rule Tag matches the input sentence, an error is shown to ADMO Modifier the user. ADCO Comparative The rule file, case insensitive by default, is ADSU Superlative composed of several rule categories which may ADNU Numeral cover but is not limited to spelling, grammar, ADUN Unaffixated style, and punctuation errors. Each rule category ADOT Other is composed of one or more rules or rule groups. Plurality Each rule is composed of different elements and S Singular attributes. The three basic elements a rule has are P Plural pattern, message, and example. The pattern ele- N Null ment is where the error to be matched is de- Table 4. Adjective Tags clared. The message element is where the feed- back and suggestion, if applicable, is declared. Table 5 shows the proposed adverb tags. An The example element is where incorrect and cor- additional attribute was added to distinguish the rect examples are declared. Figure 2 shows a POS of the word being modified. Ramos (1971) pseudocode that describes what happens in the stated that adverbs in Tagalog can modify verbs, event a pattern is matched and Figure 3 shows an adjectives, and other adverbs.