Language Resource Development at DLSU-NLP Lab

Shirley B. Chu Software Technology Department College of Computer Studies De La Salle University - Manila 2401 Taft Avenue, Manila 1004 Philippines Tel/Fax: (632) 536 0278 [email protected]

Abstract—In 2003, the Department of Science and computer software will automatically translate English Technology awarded a 5- million-peso grant to De La texts and documents to Filipino, and vice versa [27]. Salle University for the development of an English- Completed in 2006, the project includes the development Filipino System. Faced with of different language resources and creation of translation limited resources for the Filipino language, the team engines. has to build language resources and develop language tools in order to complete the system. This paper presents the different resources and tools created and built by the team and the successive improvements on these projects after 2006.

Keywords: Natural Language Processing, Language Resource, Lexicon, Corpora, Morphological Analyzer, Language Grammars

1 Introduction

Natural language 1 refers to the language used by humans in order to communicate with each other. It may be spo- ken, written or even signed. Some examples of natural languages are English, Chinese, and Filipino. Around the Figure 1: Architecture of the Hybrid English-Filipino MT world, there are more than 6,912 living languages, and in System. the Philippines alone, there are 175 languages, four of which are extinct [18]. Given that there are many differ- The Hybrid English-Filipino Machine Translation System ent languages, communicating with another whose famil- takes as input for a statement in the source language (for iarity is a different language is difficult. However, with instance English) and produces the corresponding tar- modern technology, bridging the gap between cultures is get language (Filipino) translation of the statement (See now possible [27]. This technology is called natural lan- figure 1). This MT System uses a combination of rule- guage processing (NLP). based and corpus-based approaches [25]. Rule-based ma- Natural language processing, or human language tech- chine translation is responsible for building a set of rules nology, is a multidisciplinary field requiring expertise in for language representation and translation from linguists the areas of linguistics, psychology, engineering and com- and other experts, while corpus-based MT learns these puter science [10]. NLP is a term that refers to the use information automatically from sample text translations. of computers to process language for some practical and In this project, two corpus-based approaches were con- useful purpose [3]. sidered: example-based and template-based. In order to implement these paradigms, language resources such as In the 2003, the Philippine Council for Advanced Science lexicon, corpora were developed and language tools such and Technology Research and Development (PCASTRD) as morphological analyzers and generators and part of of the Department of Science and Technology (DOST) speech taggers were built. granted PhP 5 million to the De La Salle University - College of Computer Studies for the development of a The succeeding sections present the different language re- Hybrid English-Filipino Machine Translation System, a sourced developed in the process of completing the MT system. These sections also present further research con- 1Natural language is also known as human language. ducted on these areas. Section 2 presents the different The School of Asian Applied Natural Language Processing for Linguistics Diversity and Language Resource Development ADD-4: Language Resource Technology Bangkok, Thailand · Feb 23-27, 2009 language resources developed. Section 3 enumerates the representation significant to natural language sentences language tools built. are constituent and functional structures [4]. Constituent structure is also referred to as the phrase structure. It 2 Language Resources serves as the basis of phonological information and repre- sents the precedence of the elements. Whereas functional Language Resources2 (LR) are machine-readable data structure focuses on information pertaining to grammat- sets and language descriptions [2, 10]. Availabilty of ical relations and attributes pertaining to the sentence these resources are important in the developing, improv- semantics. ing and evaluating different natural language processing algorithm and systems. Some examples of language re- Language Resource Builder. Language resource sources are lexicons, text and speech corpora, grammar builders are computer software that are used to build or rules, and tagsets. grow language resources automatically.

Lexicon. Lexicon refers to the list of words that form a 2.1 Lexicon language. Examples of such are dictionaries and thesauri. Lexicon is also a collection of words of a source language Based on the dictionary of the Komisyon ng Wikang Fil- that may be mapped to another language. Lexicons may ipino, the English-Filipino lexicon contains 23,520 En- be monolingual, bilingual or multilingual. Depending on glish and 20,540 Filipino word senses with information the application, the need for lexicon may include not only such as part-of-speech and co-occuring words [7, 27]. the words, but also some linguistic features like frequency, Additional information such as synsetID from Princeton co-occuring words, part of speech, etc. WordNet were also included.

Corpus. The word corpus, from the Latin word meaning Table 1: The English-Filipino Lexicon. ”body”, may be used to refer to any written or spoken text [1]. A linguistic corpus is a large collection of selected Database Size texts brought together so that language can be studied English-Filipino Entries 23,520 on the computer [32]. English-Filipino Attributes 7,762 Filipino-English Entries 20,540 There are many different types of corpora. These corpora Filipino-English Attributes 1,208 may contain texts, either from books, journals, newspa- pers or transcribed, in various lengths [1]. 2.2 Corpus A general corpora consist of general texts that do not belong to any text type or topic. A sublanguage corpora Initial work on the manual collection of documents on contain texts from a particular language, dialect or topic. Philippine languages has been done through the fund- ing from the National Commission for Culture and the Corpora can also consist of texts in one language only Arts[26, 27]. This include documents in four major or texts in more than one language. A parallel corpora Philippine Languages - Tagalog, Cebuano, Ilocano and contain texts that are translated in different languages. Hiligaynon, with 250,000 words each and the Filipino sign A comparable corpora contain texts in different languages language with 7,000 signs. Computational features such that are similar in content, but are not translations of as word frequency counts and a allows view- each other. A non-comparable corpora contain texts that ing co-occurring words in the corpus. are different in content, author and publish dates [30].

Tagset. An important linguistic feature found in corpora Table 2: The Philippine Corpus. are the part of speech (POS) tags associated with each word [27]. In NLP applications, POS tags are used in POS Verified syntactic and semantic processes. Tagset is a collection Size Tagged by Linguist of possible tags for a specific language. 10,000 yes yes 263,681 yes no Grammar Rules. Grammar shows how combination 57,687 no of the smaller linguistic units produces the meaning of larger textual units [10]. This grammar consists of a lexicon, and rules that combine words and phrases into 2.3 Filipino Tagset larger phrases and sentences. Thus, grammar determines Linguists examined existing tagsets and concluded that the acceptability of a given sentence, the syntax and the these are not capable of handling some of the Philip- morphology of a language. The two levels of syntactic pine language phenomena. Thus, a tagset for the Filipino 2Language resources also refered to as linguistic resources. language, consisting of 65 tags, has been revised by Dr.

2 The School of Asian Applied Natural Language Processing for Linguistics Diversity and Language Resource Development ADD-4: Language Resource Technology Bangkok, Thailand · Feb 23-27, 2009

Raquel Sison-Buban and Dr. Dolores Taylan (March 21, to 10. The precision and recall of the different models 2006) from the 59 tags used by TPOST [24]. are 70.7% precision and 72.6% respectively. These mod- els gave an overall performance measure of 69.5%. The 2.4 Grammar Rules current models can not handle dependency words and phrases properly, and thus recommended to address de- Current Filipino grammar resources are limited. The ex- pendency to further improve performance. isting Filipino grammar can only handle declarative sen- tences. Little work has been done on computational as- pect of the Filipino grammar [4]. 2.5 Language Resource Builder

There have been researches on learning the grammar of Developing language resources for NLP is a tedious task. a language. [20], [23], [9] learn grammar formalisms of It may require many man hours of encoding of data and the English language using statistical analysis methods knowledge of expert linguists. Thus, automatic meth- applied on tagged corpus. Existing systems are also de- ods for builing language resources using machine learning signed for the English language, wherein they are con- techniques were developed. Acquisition of information figured to handle the right-biased nature and Subject- from various sources including the World Wide Web was Object-Verb structure of the English language. also considered. The Filipino language, however, has free word order pat- terns and Predicate-Topic 3 phenomena. These existing systems are not designed to handle the Filipino language 2.5.1 Lexicon Extraction phenomena. Since languages are evolving, there should be some means Sample Sentence of determining and capturing new words, and probably Bracketed Form new meanings of words [25]. New terms should be added Para kay Pedro ko binili ang laruan. ((Para kay Pedro) ((binili) ((ko) (ang laruan)))) to the base lexicon through automatic extraction from Huwag mo nang itanong sa akin. documents on English and Filipino, or new terms should ((Huwag nang) ((itanong) (mo) (sa akin))) be learned from sample documents. Two experiments were conducted - using parallel corpora and non-parallel Figure 2: Sample Input to the Grammar Induction Sys- corpora. tem. A research on lexicon extraction automatically extracts lexical translations from two sets of comparable, non- parallel corpora, English and Tagalog corpora was con- ducted [29]. In this approach, a 50% accuracy was achieved on the extraction of translation terms of the source word to its equivalent target word using a corpora within the same domain.

Another system that automatically generates candidate translations (not found in the base lexicon) based on the input parallel or comparable corpora was developed [21]. This system, the Automatic English and Filipino Lexicon Builder (AEFLex) system is a lexicon extraction system designed for the English and Filipino language. The ac- curacy of the system can reach at most 57%.

Figure 3: Output of the Grammar Induction System.

[4] designed an unsupervised grammar induction system 2.5.2 Entity Identification focusing on learning the consituent structure of the Fil- ipino language. The system accepts as input a bracketed A system that automatically identifies named-entities corpus and will produce the grammar rules induced (See from documents called NER-Fil (Named-Entity figures 2 and 3.) Three models were presented to handle Recognizer for Filipino Texts) was developed. The the distribution and substitutability of constituents. The system can also annotate corpora with named-entity models were evaluated using 1,264 sentences of length 1 information. Named-entities such as person, place and 3The focus of the sentence is referred to as the Topic, rather organization are automatically classified by the system than the Subject. using machine learning techniques.

3 The School of Asian Applied Natural Language Processing for Linguistics Diversity and Language Resource Development ADD-4: Language Resource Technology Bangkok, Thailand · Feb 23-27, 2009

2.5.3 Corpora Builder 3.1 Morphology

AutoCor is an automatic retrieval system for documents The Philippine language, Tagalog, exhibits morpholog- written in closely-related languages [12]. Experiments for ical phenomena such as affixation (prefixation, suffixa- closely-related Philippine languages namely, Tagalog, Ce- tion, or both), stress shifting, consonant alternation, and buano and Bicolano, have been conducted. n-gram lan- reduplication (partial or complete)[15, 28, 8]. It is much guage models are used to match input documents against more morphosyntactically complex than a language like relevant (Tagalog, Cebuano and Bicolano) and irrelevant English which makes less use of markers and morphemes (English, Hungarian and Polish) documents. Experiment for determining parts of speech and focus, as it does syn- results show that common word pruning used to differ- tactic arrangements. Due the morphological complexities entiate between the closely-related Philippine languages, and the regularities of morpheme combinations, Tagalog improves the precision of the system. seems well suited for computational analysis.

PALITO is an online repository of the Philippine corpus Take for instance, the word pinanglilibang-libang. The [26]. This system allows users to upload documents writ- root word is libang and the prefix pang, infix in, partial ten in any Philippine languages that would eventually reduplication of syllable li and full word reduplication of function as a corpora for the Philippine language docu- libang. mentation and research. The system also provides auto- matic data categorization and corpus annotation. Aside Many system have been developed to handle concate- from text documents, videos of the Filipino sign lan- native morphological phenomenon, however, the Taga- guages can also be uploaded into the system. An ongoing log language exhibits both concatenative and non- research on PALITO is on extending it so that it can also concatenative phenomena. To address these phenomena accept speech recordings. exhibited by the Tagalog language, a research on us- ing constraint-based approach was conducted using the idea of the Optimality Theory [15, 16]. A system called 3 Language Tools TagMA (Tagalog Morphological Analyzer) is a two-level morphological analyzer for Tagalog verbs built by revers- Language tools are applications that support linguistic ing the generation process of the Optimality Theory. The research and processing of various language computa- current implementation of TagMA employs contraints tional layers [27]. These include lexical units, syntax and that are deemed important for infixation and reduplica- semantics. Projects have been conducted specifically on tion phenomena. In an experiment, out of 1,600 Taga- morphological processes, part of speech tagging and pars- log verbs, TagMA was able to correctly analyze 1,535 ing. verbs (96%) with their proper morphological and syntac- tic properties.

Morphology. Morphology is an area in linguistics that In his dissertation, Richard Wicentowski developed the deals with the internal structure and transformational WordFrame model [31], a robust multi-lingual supervised processes of words [31]. It is also the study of morphemes learning algorithm that learns a languages morphological - the smallest meaningful unit in a language. There are process. In this algorithm, a word is modeled using a a number of ways to form words - purely concatenative, seven-way split, namely, the canonical prefix, the point- prefixation, infixation, suffixation, circumfixation, tem- ofprefixation change, common substrings, vowel change, plative, reduplication, zero morphology and subtractive point-of-suffixation, and canonical suffix. However, with morphology. this division, the model cannot correctly model infixes and reduplicating stem in a word, which exist in the Fil- Morphological analyzers (MA) are systems responsible ipino language [8]. These morphological phenomena are for identifying the word stems of a given word. modeled as a point-of-prefixation, instead of their appro- priate morphological phenomena. Thus, the WordFrame POS Taggers. POS taggers or part of speech taggers is modified for the Tagalog language. In an experiment are tools that correctly identify the POS for each word using 40,276 words (containing some erroneous data), in a text. For some NLP applications, POS tagger is an 90% of the data was categorized as part of the train- essential component. ing set and 10% of the data for test set. A ten-fold cross validation was conducted to test the model. The revised WordFrame model significantly improves the generaliza- Grammar Checkers. Grammar checkers are software tion of the morphological analyzer, with a 89% accuracy tools that attempt to verify whether a given statement (lower than 97% result of the Wicentowski research). The is grammatically correct. These applications require syn- time complexity of the said model is also higher. This tactic specification of languages [27]. revised WordFrame model for Tagalog performs better

4 The School of Asian Applied Natural Language Processing for Linguistics Diversity and Language Resource Development ADD-4: Language Resource Technology Bangkok, Thailand · Feb 23-27, 2009 with words with complex morphology. The performance try, the group is devoted in conducting different NLP of this model is not significantly better when analyzing researches and developing NLP applications. words that are not as complex. In 2006, the Hybrid English-Filipino Machine Transla- 3.2 Part of Speech Taggers tion System was completed. The project initiated re- searches on the computational aspect of the Filipino lan- Automatic tagging for Philippine languages, namely MB- guage. Language resources for the Philippine languages POST, PTPOST4.1, TPOST and Tag-Alog have been has been developed and improvements on these resources conducted. MBPOST [19] is a memory-based POS tag- is ongoing. With growing amount of available resources, ger. PTPOST4.1, a probabilitic part of speech tagger, is a number of NLP applications for the Filipino language an extension of earlier versions of PTPOST [11, 13, 17]. were also developed. TPOST is a template-based n-gram POS Tagger devel- oped in [24]. Tag-Alog [14] is a rule-based tagger with a Currently, the NLP Research Team has already com- process. pleted various project not only for the Filipino language but also for the English language. Some of these projects Experiments on these taggers were conducted. Results are Picture Books, a children story generator; MesCH, a showed accuracies of 85, 73, 65 and 61%, respectively tool that measures children’s reading comprehension; VI- [22]. GAN, a system that automatically generates NL descrip- tions of museum artifacts, and Legal Truths, a system 3.3 Grammar Checkers that automatically extracts crime information from legal documents. For the Filipino language, spelling and grammar checkers were developed as OpenOffice plug-ins. References The for Filipino called SpellChef uses a hy- [1] . http://www.essex.ac.uk/ brid approach for detecting and correcting mispelled word linguistics/clmt/w3c/corpus_ling/content/ in a document [6]. It uses dictionary-lookup, n-gram introduction.html. (accessed April 2009). analysis, Soundex and character distance measurements. Spellchecker uses spelling rules and guidelines specified by [2] Language resources. http://www.elra.info/ the Komisyon sa Wikang Filipino 2001 Revision of the Al- Definition.html, 2009. (accessed April 3, 2009). phabet and Guidelines in Spelling the Filipino Language, and the Gabay sa Editing sa Wikang Filipino rulebooks. [3] Natural Language Processing Research Group at the SpellChecker is composed of a lexicon builder, a detector University of Sheffield Department of Computer Sci- and a corrector. All three components uses both manu- ence. http://nlp.shef.ac.uk/, 2009. (accessed ally formulated and automatically learned rules to carry April 3, 2009). out their respective tasks. [4] D. L. Alcantara. Probabilistic approach to con- stituent structure induction for Filipino. Master’s FiSSAn is a semantics-based and Pan- thesis, De La Salle University, July 2008. Pam is an extension of FiSSAn that also incorporates a dictionary-based spell checker [5]. Both systems use rule- [5] A. Borra, M. Ang, P. J. Chan, S. Cagalingan, and based approach. R. Tan. FiSSan: Filipino sentence syntax and se- mantic analyzer. Proceedings of the 7th Philippine 4 Summary Computing Science Congress, pages 74–78, February 2007. In the 2003, DOST-PCASTARD granted PhP 5 million [6] C. Cheng, C. P. Alberto, I. A. Chan, and V. J. to the College of Computer Studies of the De La Salle Querol. SpellChef: spelling checker and corrector University Manila for a software development project, for Filipino. Journal of Research in Science, Com- creating an English-Filipino Machine Translation System. puting and Engineering, 4(3):75–82, December 2007. At that time, linguistic information on Philippine lan- guages are mostly theoretical [25]. Language resources [7] C. Cheng and N. R. Lim. Natural language pro- are very limited. Thus, in order to complete the project, cessing research in De La Salle University - Manila: different language resources have to be developed and Then, Now, and the Future. 11th PSITE National various language tools have to be created. Convention, January 2009. In 2004, the Natural Language Processing Research Team [8] C. Cheng and S. See. The revised wordframe model of the De La Salle University College of Computer Stud- for Filipino language. Journal of Research in Sci- ies was formed. Proponents of the MT System were the ence, Computing and Engineering, 3(2), August first members. Spearheading NLP Research in the coun- 2006.

5 The School of Asian Applied Natural Language Processing for Linguistics Diversity and Language Resource Development ADD-4: Language Resource Technology Bangkok, Thailand · Feb 23-27, 2009

[9] A. Clark. Unsupervised induction of stochastic [21] N. R. T. Lim, J. O. Lat, K. Sze, S. T. Ng, and G. D. context-free grammars using distributional cluster- Yu. Lexicon for an English-Filipino machine transla- ing. Proceedings of the 2001 workshop on Computa- tion system. Proceedings of the 4th National Natural tional Natural Language Learning, pages 1–8, 2001. Language Processing Research Symposium, 2007. Available online: http://wing.comp.nus.edu.sg/ acl/W/W01/W01-0713.pdf. [22] D. Miguel and R. E. Roxas. Comparative evaluation of Tagalog part of speech taggers. Proceedings of the [10] R. Cole, editor. Survey of the State of the Art in 4th National Natural Language Processing Research Human Language Technology. Cambridge University Symposium, 2007. Press and Giardini, 1997. [23] M. Osborne and T. Briscoe. Learning stochastic cat- [11] A. Cortez, D. J. Navarro, R. Tan, and A. Victor. egorial grammar. Proceedings of CoNLL97: Com- PTPOST: probabilistic Tagalog part-of-speech tag- putational Natural Language Learning, pages 80–87, ger. Class Project Paper, 2005. De La Salle Univer- 1997. Available online: http://acl.ldc.upenn. sity. edu/W/W97/W97-1010.pdf.

[12] D. Dimalen and R. E. Roxas. AutoCor: a query- [24] V. Rabo. TPOST: a template-based, n-gram part- based automatic acquisition of corpora of closely- of-speech tagger for Tagalog. Master’s thesis, De La related languages. Proceedings of the 21st Pa- Salle University, 2004. cific Asia Conference on Language, Information and [25] R. E. Roxas, A. Borra, C. Cheng, N. R. Lim, E. Ong, Computation, pages 146–154, November 2007. and M. W. Tan. Building language resources for a multi-engine machine translation system. In Lan- [13] J. Flordeliza, K. Go, and D. Miguel. PT- guage Resources and Evaluation, volume 42, pages POST4.0: probabilistic Tagalog part of speech tag- 183–195. Springer, Netherlands, 2008. ging. Class Project Paper, 2005. De La Salle Uni- versity. [26] R. E. Roxas, P. Inventado, G. Asenjo, M. Corpus, S. Dita, R. Sison-Buban, and D. Taylan. Online [14] G. K. Fontanilla and H. W. Wu. Tag-Alog: a corpora of philippine languages. 2nd DLSU Arts rule-based part-of-speech tagger for Tagalog. Class Congress: Arts and Environment, February 2009. Project Paper, 2006. De La Salle University. [27] R. E. Roxas, N. R. Lim, and C. Cheng. Natural lan- [15] F. C. L. Fortes. A constraint based morphologi- guage processing laboratory: the CCS-DLSU expe- cal analyzer for concatenative and non concatenative rience. 9th Philippine Computing Science Congress, morphology. Master’s thesis, De La Salle University, March 2009. Keynote. November 2002. [28] S. See. A Tagalog morphological analyzer using [16] F. C. L. Fortes and R. E. O. Roxas. Optimality example-based approach. Master’s thesis, De La theory in morphological analysis. In 1st National Salle University, 2006. Natural Language Processing Research Symposium, Januray 2004. [29] E. P. Tiu and R. E. Roxas. Lexicon extraction from comparable corpora. 1st National Natural Language [17] K. Go. PTPOST4.1: probabilistic Tagalog part Processing Research Symposium, January 2004. of speech tagger. Class Project, 2006. De La Salle University. [30] E. P. Tiu and R. E. Roxas. Automatic bilingual lexicon extraction for a minority target language. [18] R. G. G. Jr., editor. Ethnologue: Languages PACLIC, 2008. of the World. SIL International, Dallas, Texas, [31] R. Wicentowski. Modeling and Learning Multilingual 15th edition, 2005. Online version: http://www. Inflectional Morphology in a Minimally Supervised ethnologue.com/. Framework. PhD thesis, The Johns Hopkins Univer- [19] R. R. Jr. and R. Trogo. Memory-based part-of- sity, October 2002. http://www.cs.swarthmore. speech tagger. Class Project Paper, 2006. De La edu/~richardw/pubs/thesis.pdf. Salle University. [32] M. Wynne, editor. Developing Linguistic Corpora: a Guide to Good Practice. Oxbow Books, Ox- [20] D. Klein and C. D. Manning. Distributional phrase ford, 2005. Available online: http://ahds.ac.uk/ structure induction. Proceedings of the Fifth Confer- linguistic-corpora/ (Accessed April 2009). ence on Natural Language Learning, pages 113–120, 2001. Available online: http://ucrel.lancs.ac. uk/acl/W/W01/W01-0714.pdf.

6