Tagalog Support for Languagetool
Total Page:16
File Type:pdf, Size:1020Kb
Tagalog Support for LanguageTool Nathaniel Oco Allan Borra De La Salle University - Manila De La Salle University - Manila 2401 Taft Avenue Malate, Manila City 2401 Taft Avenue Malate, Manila City 1004 Metro Manila, Philippines 1004 Metro manila, Philippines +639178477549 +639174591073 [email protected] [email protected] ABSTRACT 25, 2011, already includes Tagalog support. This paper outlines the different processes and issues involved in Aside from Tagalog, LanguageTool also supports Asturian, adding a Tagalog support for LanguageTool. LanguageTool is an Belarusian, Breton, Catalan, Chinese, Czech, Danish, Dutch, open-source rule-based style and grammar checker that English, Esperanto, French, Galician, Icelandic, Italian, Khmer, implements a manual-based rule-creation approach. Details of the Lithuanian, Malayalam, Polish, Russian, Slovak, Slovenian, different LanguageTool resources are discussed in this paper. The Spanish, Swedish, Ukrainian, and Romanian. different linguistic considerations, technical considerations, and language properties of Tagalog – that were captured and handled This paper aims to explain the processes and issues involved in – are also discussed and outlined. The system was tested using 50 adding a new language support – Tagalog support – for correct and 50 incorrect sentences collected from different LanguageTool. sources. LanguageTool processed the correct sentences in 53 milliseconds and the incorrect sentences in 80 milliseconds. The 2. GRAMMAR CHECKING Tagalog support scored 95.83% for precision, 46% for recall, and Grammar Checking is the process of detecting if there is error in 72% for accuracy. an input. Mark Johnson [Johnson, personal communication] added that grammar checking entails locating where the error is Categories and Subject Descriptors and notifying the user about the error. [5] agrees, adding that F.4.2 [Mathematical Logic and Formal Languages]: Grammars grammar checking also entails providing a feedback, which can and Other Rewriting System – grammar types. include possible corrections with linguistic explanations. Grammar checkers can then be defined as programs that can I.5.0 [Pattern Recognition]: General. detect if there is error in an input, locate the error, notify the user I.7.1 [Document and Text Processing]: Document and Text about the error, and provide relevant feedback. Editing – languages, spelling. [15] identified three approaches in grammar checking – syntax- based approach, statistics-based approach, and rule-based General Terms approach. Algorithms, Languages. Syntax-based approach relies on parsing and grammar formalisms (e.g. CFG, LFG). An error is detected if parsing fails and an error Keywords is located using tree structures, graphs, and other methods. Tagalog, Grammar Checking, LanguageTool, Rules, NLP Tools. Examples of Filipino Grammar checkers that utilize this approach are PanPam [10] and [6]. 1. INTRODUCTION Statistics-based approach relies on properly annotated corpus (e.g. LanguageTool, developed by [15], is an open-source rule-based Penn Treebank, Brown Corpus) to train the language model. An style and grammar checker that implements a manual-based rule- error is detected and located using probability. [15] explained that creation approach. It is publicly available through sequences describing correct sentences will occur often in the LanguageTool’s website [13]. LanguageTool has a growing list of corpus while sequences describing incorrect sentences will occur supported languages. The authors of this paper developed and less in the corpus or probably not at all. submitted a new language support – Tagalog support – for Rule-based approach relies on rules, which are matched against LanguageTool to provide a readily-available Tagalog Grammar the input to check and locate errors. LanguageTool is an example Checker [16]. LanguageTool version 1.5, released last September of this approach. [11] classifies grammar checkers under this approach into two – manual-based and automatic-based. Manual- based grammar checkers use manual means to develop rules while automatic-based grammar checkers use automatic means to develop rules. In LanguageTool, rules are manually created and are added or modified incrementally. 64 Proceedings of the 8th National Natural Language Processing Research Symposium, pages 64-71 De La Salle University, Manila, 24-25 November 2011 3. POS TAGGING Part-of-speech or POS is a lexical category that defines the function of words. POS Tagging or POST is the process of labeling words with POS. The list of POS used to label words is called a tagset. POS Tagging is heavily utilized in LanguageTool. 4. LANGUAGETOOL LanguageTool can run as a stand-alone program or as an extension to word editors like OpenOffice1 or LibreOffice2. It uses two major linguistic resources to perform grammar checking. These resources are the tagger dictionary and the rule file. LanguageTool, as a stand alone program, splits an input into sentences. Each sentence is split into words and each word is assigned a tag based on the declarations in the tagger dictionary. The words and their tags are checked against the rules in the rule file. If there is a match, a feedback is shown to the user. Figure 1 shows a screenshot of LanguageTool as a stand-alone program Figure 2. LanguageTool as an OpenOffice Extension and Figure 2 shows a screenshot of LangaugeTool as an OpenOffice extension. Both figures demonstrate the feedback 5. LANGUAGETOOL RESOURCES mechanism of LanguageTool. This section discusses and outlines LanguageTool resources – the tagger dictionary and the rule file – and the tagset used for the tagger dictionary. 5.1 Tagger Dictionary The tagger dictionary is a text file used to tag words in an input. Each declaration in the text file follows a tab-separated three- column format. The first column is the token, the second column is base form of the token, and the third column is the tag of the token. The interpretation for the base form of the token is left to the discretion of the language maintainer. This could be the root word or any other interpretation. In total, the Tagalog tagger dictionary has approximately 8,000 word declarations. This number, with Tagalog support as a new language support, pales in comparison with 350,000-word English tagger dictionary and the 530,000-word French tagger dictionary. Table 1 demonstrates word declarations and shows sample entries from the Tagalog tagger dictionary. Table 1. Tagalog Tagger Dictionary Declatations Token Base Form Tag mapanirang mapanira ADMO S anim anim ADNU sa sa DECN DAT ng ng DECN GEN ho ho MAHM nila nila PNGP RD P nilang nila PNGP RD P Figure 1. LanguageTool as a stand-alone program pumupusta pumupusta VACF IN B yakapin yakapin VOBF NE B kakukumpisal kakukumpisal VOTF RC B kalilibing kalilibing VOTF RC B 1 http://www.openoffice.org/ 2 http://www.libreoffice.org/ 65 5.2 Tagset for the Tagger Dictionary 6. LINGUISTIC CONSIDERATIONS [17] developed a tagset for Tagalog using the Penn Treebank3 This section discusses and outlines the different linguistic Tagset as guide. [14] modified this tagset with the aid of linguists considerations in developing a Tagalog support for and experts in the field. The original tagset proposed by [17] and LanguageTool. the modifications by [14] were used as basis in developing the tagset for the tagger dictionary. 6.1 POS and Lexical Categories The tagset for the tagger dictionary defines the different tags used [19] and [18] identified several Tagalog POS. These are article or in word declarations. The tagset is composed of POS and lexical pantukoy, noun or pangngalan, pronoun or panghalip, verb or categories, which could be followed by one or more attributes pandiwa, adjective or pang-uri, adverb or pang-abay, preposition with a white space separating them. The third column in Table 1 or pang-ukol, conjunction or pangatnig, interjection or shows sample tag declarations. The table in the appendix shows pandamdam, and ligatures or pang-angkop. These POS, except the tagset developed for the tagger dictionary. pantukoy, and other lexical categories (e.g. markers and particles) compose the tagset. The table in the appendix shows the POS and Lexical Categories used. 5.3 Rule File The rules are stored in an xml file. The input is matched against patterns in the rule file, which describe incorrect sentences. A rule 6.2 Ligatures file is composed of elements and attributes. The three main According to [18], ligatures or pang-angkop in Tagalog, are elements of a rule file are: pattern, message, and example. Pattern words or morphemes that link a modifier to the word being refers to the token or sequence of tokens to be matched. Message modified. Figure 4 demonstrates an example. refers to the feedback, which can include possible suggestions. Example refers to incorrect and correct sentences demonstrating Maganda Babae = Magandang babae the rule’s usage. Figure 3 shows a rule from the Tagalog rule file. Beautiful Woman = Beautiful woman Adjective Noun <rule id="NOUN_SI_PUNCT" name="noun si punct (si noun punct)"> <pattern case_sensitive="no" mark_from="0" mark_to="-1"> Mabilis Bata = Mabilis na bata <token postag="(NPRO|NCOM).*" postag_regexp="yes"/> Fast Child = Fast Child <token regexp="yes">si|sina</token> Adjective Noun <token postag="(PSNP|PSNQ|PSNE|PSNC)" postag_regexp="yes"/> Figure 4. Ligature Usage </pattern> To handle words with ligatures, a separate tagset needs to be allotted. However, this would result in a large tagset. To solve this <message>Do you mean