Language Resource Development at DLSU-NLP Lab Shirley B. Chu 1

Language Resource Development at DLSU-NLP Lab Shirley B. Chu Software Technology Department College of Computer Studies De La Salle University - Manila 2401 Taft Avenue, Manila 1004 Philippines Tel/Fax: (632) 536 0278 [email protected] Abstract|In 2003, the Department of Science and computer software will automatically translate English Technology awarded a 5- million-peso grant to De La texts and documents to Filipino, and vice versa [27]. Salle University for the development of an English- Completed in 2006, the project includes the development Filipino Machine Translation System. Faced with of different language resources and creation of translation limited resources for the Filipino language, the team engines. has to build language resources and develop language tools in order to complete the system. This paper presents the different resources and tools created and built by the team and the successive improvements on these projects after 2006. Keywords: Natural Language Processing, Language Resource, Lexicon, Corpora, Morphological Analyzer, Language Grammars 1 Introduction Natural language 1 refers to the language used by humans in order to communicate with each other. It may be spoken, written or even signed. Some examples of natural languages are English, Chinese, and Filipino. Around the Figure 1: Architecture of the Hybrid English-Filipino MT world, there are more than 6,912 living languages, and in System. the Philippines alone, there are 175 languages, four of which are extinct [18]. Given that there are many differ- The Hybrid English-Filipino Machine Translation System ent languages, communicating with another whose famil- takes as input for a statement in the source language (for iarity is a different language is difficult. However, with instance English) and produces the corresponding tar- modern technology, bridging the gap between cultures is get language (Filipino) translation of the statement (See now possible [27]. This technology is called natural lan- figure 1). This MT System uses a combination of rule- guage processing (NLP). based and corpus-based approaches [25]. Rule-based ma- Natural language processing, or human language tech- chine translation is responsible for building a set of rules nology, is a multidisciplinary field requiring expertise in for language representation and translation from linguists the areas of linguistics, psychology, engineering and com- and other experts, while corpus-based MT learns these puter science [10]. NLP is a term that refers to the use information automatically from sample text translations. of computers to process language for some practical and In this project, two corpus-based approaches were con- useful purpose [3]. sidered: example-based and template-based. In order to implement these paradigms, language resources such as In the 2003, the Philippine Council for Advanced Science lexicon, corpora were developed and language tools such and Technology Research and Development (PCASTRD) as morphological analyzers and generators and part of of the Department of Science and Technology (DOST) speech taggers were built. granted PhP 5 million to the De La Salle University - College of Computer Studies for the development of a The succeeding sections present the different language re- Hybrid English-Filipino Machine Translation System, a sourced developed in the process of completing the MT system. These sections also present further research con- 1Natural language is also known as human language. ducted on these areas. Section 2 presents the different The School of Asian Applied Natural Language Processing for Linguistics Diversity and Language Resource Development ADD-4: Language Resource Technology Bangkok, Thailand · Feb 23-27, 2009 language resources developed. Section 3 enumerates the representation significant to natural language sentences language tools built. are constituent and functional structures [4]. Constituent structure is also referred to as the phrase structure. It 2 Language Resources serves as the basis of phonological information and repre- sents the precedence of the elements. Whereas functional Language Resources2 (LR) are machine-readable data structure focuses on information pertaining to grammat- sets and language descriptions [2, 10]. Availabilty of ical relations and attributes pertaining to the sentence these resources are important in the developing, improv- semantics. ing and evaluating different natural language processing algorithm and systems. Some examples of language re- Language Resource Builder. Language resource sources are lexicons, text and speech corpora, grammar builders are computer software that are used to build or rules, and tagsets. grow language resources automatically. Lexicon. Lexicon refers to the list of words that form a 2.1 Lexicon language. Examples of such are dictionaries and thesauri. Lexicon is also a collection of words of a source language Based on the dictionary of the Komisyon ng Wikang Fil- that may be mapped to another language. Lexicons may ipino, the English-Filipino lexicon contains 23,520 En- be monolingual, bilingual or multilingual. Depending on glish and 20,540 Filipino word senses with information the application, the need for lexicon may include not only such as part-of-speech and co-occuring words [7, 27]. the words, but also some linguistic features like frequency, Additional information such as synsetID from Princeton co-occuring words, part of speech, etc. WordNet were also included. Corpus. The word corpus, from the Latin word meaning Table 1: The English-Filipino Lexicon. "body", may be used to refer to any written or spoken text [1]. A linguistic corpus is a large collection of selected Database Size texts brought together so that language can be studied English-Filipino Entries 23,520 on the computer [32]. English-Filipino Attributes 7,762 Filipino-English Entries 20,540 There are many different types of corpora. These corpora Filipino-English Attributes 1,208 may contain texts, either from books, journals, newspa- pers or transcribed, in various lengths [1]. 2.2 Corpus A general corpora consist of general texts that do not belong to any text type or topic. A sublanguage corpora Initial work on the manual collection of documents on contain texts from a particular language, dialect or topic. Philippine languages has been done through the fund- ing from the National Commission for Culture and the Corpora can also consist of texts in one language only Arts[26, 27]. This include documents in four major or texts in more than one language. A parallel corpora Philippine Languages - Tagalog, Cebuano, Ilocano and contain texts that are translated in different languages. Hiligaynon, with 250,000 words each and the Filipino sign A comparable corpora contain texts in different languages language with 7,000 signs. Computational features such that are similar in content, but are not translations of as word frequency counts and a concordancer allows view- each other. A non-comparable corpora contain texts that ing co-occurring words in the corpus. are different in content, author and publish dates [30]. Tagset. An important linguistic feature found in corpora Table 2: The Philippine Corpus. are the part of speech (POS) tags associated with each word [27]. In NLP applications, POS tags are used in POS Verified syntactic and semantic processes. Tagset is a collection Size Tagged by Linguist of possible tags for a specific language. 10,000 yes yes 263,681 yes no Grammar Rules. Grammar shows how combination 57,687 no of the smaller linguistic units produces the meaning of larger textual units [10]. This grammar consists of a lexicon, and rules that combine words and phrases into 2.3 Filipino Tagset larger phrases and sentences. Thus, grammar determines Linguists examined existing tagsets and concluded that the acceptability of a given sentence, the syntax and the these are not capable of handling some of the Philip- morphology of a language. The two levels of syntactic pine language phenomena. Thus, a tagset for the Filipino 2Language resources also refered to as linguistic resources. language, consisting of 65 tags, has been revised by Dr. 2 The School of Asian Applied Natural Language Processing for Linguistics Diversity and Language Resource Development ADD-4: Language Resource Technology Bangkok, Thailand · Feb 23-27, 2009 Raquel Sison-Buban and Dr. Dolores Taylan (March 21, to 10. The precision and recall of the different models 2006) from the 59 tags used by TPOST [24]. are 70.7% precision and 72.6% respectively. These models gave an overall performance measure of 69.5%. The 2.4 Grammar Rules current models can not handle dependency words and phrases properly, and thus recommended to address de- Current Filipino grammar resources are limited. The ex- pendency to further improve performance. isting Filipino grammar can only handle declarative sentences. Little work has been done on computational as- pect of the Filipino grammar [4]. 2.5 Language Resource Builder There have been researches on learning the grammar of Developing language resources for NLP is a tedious task. a language. [20], [23], [9] learn grammar formalisms of It may require many man hours of encoding of data and the English language using statistical analysis methods knowledge of expert linguists. Thus, automatic meth- applied on tagged corpus. Existing systems are also de- ods for builing language resources using machine learning signed for the English language, wherein they are con- techniques were developed. Acquisition of information

Load more