Feedforward Approach to Sequential Morphological Analysis in the Tagalog Language
Total Page:16
File Type:pdf, Size:1020Kb
IALP 2020, Kuala Lumpur, Dec 4-6, 2020 Feedforward Approach to Sequential Morphological Analysis in the Tagalog Language Arian N. Yambao1,2 and Charibeth K. Cheng1,2 1College of Computer Studies, De La Salle University, Manila, Philippines 2Senti Techlabs Inc., Manila, Philippines arian_yambao, charibeth.cheng @dlsu.edu.ph, arian.yambao, chari.cheng @senti.com.ph { } { } Abstract—Morphological analysis is a preprocessing task used the affix class of a given verb. This can be shown with the to normalize and get the grammatical information of a word root words starting with a consonant like ’baba’ (’down’ in which may also be referred to as lemmatization. Several NLP English), a prefix ba- added to the word makes it ’bababa’ tasks use this preprocessing technique to improve their im- plementations in different languages, however languages like (”going down” in English). And for words that start with a Tagalog, are considered to be morphosyntactically rich for having vowel, like ’asa’ (’hope’ or ’wish’ in English), an infix -um- a diverse number of morphological phenomena. This paper is added and the initial letter of the word is reduplicated in presents a Tagalog morphological analyzer implemented using a between to form the word ’umaasa’ (’hoping in English) [3] Feed Forward Neural Networks model (FFNN). We transformed [4]. our data to its character-level representation and used the model to identify its capability of learning the language’s inflections. We There have been efforts in MA that are mainly focused compared our MA to previous rule-based Tagalog MA works and on Tagalog includes the works of De Guzman [5], Nelson saw an improved accuracy of 0.93. [4], Cheng [6], and an improved version of the work of Index Terms—morphological analysis; tagalog morphology; [4] by Cheng [3]. Their works have shown the earliest to feedforward neural networks; natural language processing most distinct morphological phenomena in the language, also their approaches are mainly built using rule-based techniques I. INTRODUCTION and systems. The mentioned works are discussed in the next Morphological analysis (MA) is a preprocessing technique section. in natural language processing (NLP) applied to given text In this paper, we represented the word data at the character input in order to normalize and extract the grammatical level and created a feedforward neural network model to information of each given word, usually deriving the words investigate whether the machine is capable of learning the to its root word also called lemmatization (e.g. Processes = existing morphological phenomena and further improve the ⇒ Process). This technique is used and seen to have improved performance of MA for the Tagalog language. The rest of the the performance of the models in some NLP works, in which paper proceeds as follows: Section 2 reviews existing works some works have mentioned having a good MA model, which related to our approach, Section 3 highlights the contribution means having a language wealth of syntactic and semantic of the work, Section 4 introduces the main processes of our information, especially on low-resourced languages [1] [2]. approach, and Section 5 and 6 describe our experiments and It was also mentioned in the work of [10], that this prepro- findings. In Section 7, we conclude our efforts and discuss cessing technique decreases the complexity of the analyzed some future works. texts, which can be considered as an important benefit when analyzing texts for different NLP tasks. II. RELATED STUDIES While the task of MA in the English language is easier Works for solving morphology in Tagalog was started a and considered mostly solved, there are a lot of languages long time. All works for it have considered the language that are being faced with a challenge for the task, many of to be morphosyntactically rich and multiple approaches have these languages are considered to be morphosyntactically rich been done to solve the problem. [5] mentioned Tagalog being such as the Tagalog language. Tagalog is the most commonly highly inflectional and highlighted the early morphological spoken Filipino language and is the primary basis of national phenomena being experienced in the language. It showed the Filipino, considered to be a morphosyntatically rich language different inflection effects in forming verbs into nouns such for having more than one morphological phenomena for many as the word ’sinigang’ (a stewed pork dish) which is made of its words which makes less use of markers and morphemes up of the words ’sini’(prefix) + ’sigang’ (verb), and nouns for determining parts of speech and focus as it does syntactic into; verbs, compound words, and gerundive nouns such as arrangements (e.g. ’ies’, ’ing’ in english) [3]. the word ’pagtakbo’ which means ’running’, it is comprised One of the most prevalent morphological phenomena ob- of the words ’pag’ prefix) + ’takbo’ (verb) or ’run’. served in Tagalog is agglutination which often triggers redu- [4] stated that the complexity of the language also evolves plication of words. An example can be seen where verbal as time passes by and gave a phenomenon of how the inflection is used to indicate aspects that differ according to Tagalog language is easily influenced by other languages such 978-1-7281-7689-5/20/$31.00 c 2020 IEEE 81 as ’Taglish’. The paper also explained why approaches in can be interpreted in the case of the words ’laglag’ (fall morphology for languages such as English, calling it ”cut-and- in English) and ’sabi’ (say in English), making the words paste” is not going to work in a morphologically rich language ’nalaglag’ (fell in English) and ’nasabi’ (said in English) as due to the morphophonemic variation across each morpheme proof of the morphological phenomena in their rule. Even boundaries like for example the suffixes ’ed’, ’ing’, ’ion’, and though, the lexicon approach by rule-based systems are proven ’ions’ are almost identical to all words across English, in effective, [6] has mentioned that improving the lexicons and Tagalog, although how words are formed is familiar, these are finding rules that would fit all of the words to improve the still words which are formed by inflections that sometimes are rule-based approach in general, will be time-consuming and unique to others. Hence the creation of the parallel working resource expensive. Two-Level Morphology Engine or PC-KIMMO Version 2. It is compiled of 22 Tagalog rules file and a Tagalog lexicon (for III. CONTRIBUTION its reference) which tries to recognize and generate all possible As there are currently no published Tagalog-focused mor- parses of a given word. The majority of the phenomena phological analysis using neural network models, our main composed in the rule file are insertion, deletion, reduplication, contribution is the creation of a new approach in processing and infixation. His work was able to correctly identify 73.4% morphological analysis in the Tagalog language and possibly of the 323 types of an 895 token list (words) input. making it as the baseline for comparison in similar future The work of [6] further introduced the morphological phe- works. Then, we referenced this approach with previous works nomena in the language. These are prefixation, infixation, for the task. Finally, we identified and discussed the perfor- suffixation, circumfixation, internal vowel changes, and partial mance word derivation of Tagalog words from the proposed to whole word reduplication (can be both viewed as point- approach. of-prefixation or point-of-suffixation). A revised version of the Wordframe Model, which is also composed of a lexicon IV. METHODOLOGY and rules, was introduced for the language in this work. The revised model improves the rules learned from the point-of- This section discusses the methods used for this research prefixation. In the analysis phase of the model, alignment of undertaking. Our methodology involves 4 steps: (1) Data Col- all the rules to the given word is done by getting all the highest lection, (2) Preprocessing, (3) Feature Extraction, (4) Output probability of the possible combinations of prefixes, suffixes, Preprocessing, and (5) Model Implementation. point-of-prefixation, point-of-suffixation, vowel-change, with or without the infix, and with or without the identified redu- A. Data Collection plication after the canonical prefix and suffix are removed. We retrieved and used the dataset that was also used in the Although with better performance on more complex Tagalog work of [3]. It is composed of 16,540 Tagalog inflected word words, the paper mentioned that there are still works to be Tagalog root word. We also added each unique root words → done in certain morphological phenomena such as the removal to both the Words and Root word columns, making the dataset of reduplication from the point-of-suffixation and handling a total of 16,814 words. Table I shows a few instances of the the occurrence of several partial or whole-word reduplications dataset. within a word. The performance of their work in analyzing 4,034 words from a training data of 36,242 achieved 89.985% TABLE I The work of [3] is one of the most recent works in Tagalog SAMPLE INSTANCE OF MAG-TAGALOG DATASET morphology and is considered to be an improved version of the work of [4], with the introduced morphological generator Words Root word in the system and making it a 3-level rule compiler system. madalaw dalaw makakain kain The paper highlighted the future works to be done in handling bayad bayad the most recent observed morphological phenomena which iyak iyak are assimilation, phoneme insertion, and vowel loss. Their approach was able to achieve 83.84% accuracy in the analysis of 16,540 Tagalog words. B. Preprocessing The approaches in setting the rules in the works of [3], [6], [11], [12] (all morphosyntactically rich languages), are The only preprocessing done for the dataset is the removal similar.