IALP 2020, Kuala Lumpur, Dec 4-6, 2020

Feedforward Approach to Sequential Morphological Analysis in the

Arian N. Yambao1,2 and Charibeth K. Cheng1,2 1College of Computer Studies, De La Salle University, Manila, 2Senti Techlabs Inc., Manila, Philippines arian_yambao, charibeth.cheng @dlsu.edu.ph, arian.yambao, chari.cheng @senti.com.ph { } { }

Abstract—Morphological analysis is a preprocessing task used the affix class of a given verb. This can be shown with the to normalize and get the grammatical information of a word root words starting with a consonant like ’baba’ (’down’ in which may also be referred to as lemmatization. Several NLP English), a prefix ba- added to the word makes it ’bababa’ tasks use this preprocessing technique to improve their im- plementations in different languages, however languages like (”going down” in English). And for words that start with a Tagalog, are considered to be morphosyntactically rich for having vowel, like ’asa’ (’hope’ or ’wish’ in English), an infix -um- a diverse number of morphological phenomena. This paper is added and the initial letter of the word is reduplicated in presents a Tagalog morphological analyzer implemented using a between to form the word ’umaasa’ (’hoping in English) [3] Feed Forward Neural Networks model (FFNN). We transformed [4]. our data to its character-level representation and used the model to identify its capability of learning the language’s inflections. We There have been efforts in MA that are mainly focused compared our MA to previous rule-based Tagalog MA works and on Tagalog includes the works of De Guzman [5], Nelson saw an improved accuracy of 0.93. [4], Cheng [6], and an improved version of the work of Index Terms—morphological analysis; tagalog morphology; [4] by Cheng [3]. Their works have shown the earliest to feedforward neural networks; natural language processing most distinct morphological phenomena in the language, also their approaches are mainly built using rule-based techniques I.INTRODUCTION and systems. The mentioned works are discussed in the next Morphological analysis (MA) is a preprocessing technique section. in natural language processing (NLP) applied to given text In this paper, we represented the word data at the character input in order to normalize and extract the grammatical level and created a feedforward neural network model to information of each given word, usually deriving the words investigate whether the machine is capable of learning the to its root word also called lemmatization (e.g. Processes = existing morphological phenomena and further improve the ⇒ Process). This technique is used and seen to have improved performance of MA for the Tagalog language. The rest of the the performance of the models in some NLP works, in which paper proceeds as follows: Section 2 reviews existing works some works have mentioned having a good MA model, which related to our approach, Section 3 highlights the contribution means having a language wealth of syntactic and semantic of the work, Section 4 introduces the main processes of our information, especially on low-resourced languages [1] [2]. approach, and Section 5 and 6 describe our experiments and It was also mentioned in the work of [10], that this prepro- findings. In Section 7, we conclude our efforts and discuss cessing technique decreases the complexity of the analyzed some future works. texts, which can be considered as an important benefit when analyzing texts for different NLP tasks. II.RELATED STUDIES While the task of MA in the English language is easier Works for solving morphology in Tagalog was started a and considered mostly solved, there are a lot of languages long time. All works for it have considered the language that are being faced with a challenge for the task, many of to be morphosyntactically rich and multiple approaches have these languages are considered to be morphosyntactically rich been done to solve the problem. [5] mentioned Tagalog being such as the Tagalog language. Tagalog is the most commonly highly inflectional and highlighted the early morphological spoken and is the primary basis of national phenomena being experienced in the language. It showed the Filipino, considered to be a morphosyntatically rich language different inflection effects in forming verbs into nouns such for having more than one morphological phenomena for many as the word ’sinigang’ (a stewed pork dish) which is made of its words which makes less use of markers and morphemes up of the words ’sini’(prefix) + ’sigang’ (verb), and nouns for determining parts of speech and focus as it does syntactic into; verbs, compound words, and gerundive nouns such as arrangements (e.g. ’ies’, ’ing’ in english) [3]. the word ’pagtakbo’ which means ’running’, it is comprised One of the most prevalent morphological phenomena ob- of the words ’pag’ prefix) + ’takbo’ (verb) or ’run’. served in Tagalog is agglutination which often triggers redu- [4] stated that the complexity of the language also evolves plication of words. An example can be seen where verbal as time passes by and gave a phenomenon of how the inflection is used to indicate aspects that differ according to Tagalog language is easily influenced by other languages such

978-1-7281-7689-5/20/$31.00 c 2020 IEEE 81 as ’Taglish’. The paper also explained why approaches in can be interpreted in the case of the words ’laglag’ (fall morphology for languages such as English, calling it ”cut-and- in English) and ’sabi’ (say in English), making the words paste” is not going to work in a morphologically rich language ’nalaglag’ (fell in English) and ’nasabi’ (said in English) as due to the morphophonemic variation across each morpheme proof of the morphological phenomena in their rule. Even boundaries like for example the suffixes ’ed’, ’ing’, ’ion’, and though, the lexicon approach by rule-based systems are proven ’ions’ are almost identical to all words across English, in effective, [6] has mentioned that improving the lexicons and Tagalog, although how words are formed is familiar, these are finding rules that would fit all of the words to improve the still words which are formed by inflections that sometimes are rule-based approach in general, will be time-consuming and unique to others. Hence the creation of the parallel working resource expensive. Two-Level Morphology Engine or PC-KIMMO Version 2. It is compiled of 22 Tagalog rules file and a Tagalog lexicon (for III.CONTRIBUTION its reference) which tries to recognize and generate all possible As there are currently no published Tagalog-focused mor- parses of a given word. The majority of the phenomena phological analysis using neural network models, our main composed in the rule file are insertion, deletion, reduplication, contribution is the creation of a new approach in processing and infixation. His work was able to correctly identify 73.4% morphological analysis in the Tagalog language and possibly of the 323 types of an 895 token list (words) input. making it as the baseline for comparison in similar future The work of [6] further introduced the morphological phe- works. Then, we referenced this approach with previous works nomena in the language. These are prefixation, infixation, for the task. Finally, we identified and discussed the perfor- suffixation, circumfixation, internal vowel changes, and partial mance word derivation of Tagalog words from the proposed to whole word reduplication (can be both viewed as point- approach. of-prefixation or point-of-suffixation). A revised version of the Wordframe Model, which is also composed of a lexicon IV. METHODOLOGY and rules, was introduced for the language in this work. The revised model improves the rules learned from the point-of- This section discusses the methods used for this research prefixation. In the analysis phase of the model, alignment of undertaking. Our methodology involves 4 steps: (1) Data Col- all the rules to the given word is done by getting all the highest lection, (2) Preprocessing, (3) Feature Extraction, (4) Output probability of the possible combinations of prefixes, suffixes, Preprocessing, and (5) Model Implementation. point-of-prefixation, point-of-suffixation, vowel-change, with or without the infix, and with or without the identified redu- A. Data Collection plication after the canonical prefix and suffix are removed. We retrieved and used the dataset that was also used in the Although with better performance on more complex Tagalog work of [3]. It is composed of 16,540 Tagalog inflected word words, the paper mentioned that there are still works to be Tagalog root word. We also added each unique root words → done in certain morphological phenomena such as the removal to both the Words and Root word columns, making the dataset of reduplication from the point-of-suffixation and handling a total of 16,814 words. Table I shows a few instances of the the occurrence of several partial or whole-word reduplications dataset. within a word. The performance of their work in analyzing 4,034 words from a training data of 36,242 achieved 89.985% TABLE I The work of [3] is one of the most recent works in Tagalog SAMPLE INSTANCE OF MAG-TAGALOG DATASET morphology and is considered to be an improved version of the work of [4], with the introduced morphological generator Words Root word in the system and making it a 3-level rule compiler system. madalaw dalaw makakain kain The paper highlighted the future works to be done in handling bayad bayad the most recent observed morphological phenomena which iyak iyak are assimilation, phoneme insertion, and vowel loss. Their approach was able to achieve 83.84% accuracy in the analysis of 16,540 Tagalog words. B. Preprocessing The approaches in setting the rules in the works of [3], [6], [11], [12] (all morphosyntactically rich languages), are The only preprocessing done for the dataset is the removal similar. All works are using lexicons, which could be treated as of the hyphen ’-’ for words like: ’mag-aakyat’ = ’maga- ⇒ linguistic knowledge, being referenced each time an analysis akyat’ and ’nag-aksaya’ = ’nagaaksaya’, since almost all ⇒ occurs, making the lexicon one of the backbones of their of the input in the dataset is cleaned. The importance of this approaches. The output of the combined rules applied in the preprocessing technique can be seen in some data sources like lexicons are the primary output of the analysis, this can be Tagalog articles. An example of this is in the case seen as an example in the case of one of the rules in [3]: of the term ’mag-aaral’ (student in English) and ’magaaral’ the prefix ’na’, followed up by any word is automatically (to study in English), which both were derived from the root seen and treated as verbs in past tense by the system. It word ‘aral‘ (lesson or wisdom).

2020 International Conference on Asian Language Processing (IALP) 82 C. Feature Extraction ReLU In the hopes of the model understanding the sequences 5 y of the words, we have further transformed the data into its character-level representations that are similar to the work of 4 [7]. This will make the input like ’iyak’ = ’i’,’y’,’a’,’k’. ⇒ Afterward, we have tokenized (a = 1, b = 2, ..., z = 26) and padded the text with its maximum character length of 23 in the 3 Words and 10 for the Root word. Table II shows the character represented the equivalent of the dataset and Table III shows the tokenized and padded dataset. 2

TABLE II 1 CHARACTERIZED DATASET

Words Root word x ’m’, ’a’, ’d’, ’a’, ’l’, ’a’, ’w’ ’d’, ’a’, ’l’, ’a’, ’w’ 6 4 2 2 4 6 ’m’, ’a’, ’k’, ’a’, ’k’, ’a’, ’i’, ’n’ ’k’, ’a’, ’i’, ’n’ − − − ’b’, ’a’, ’y’, ’a’, ’d’ ’b’, ’a’, ’y’, ’a’, ’d’ ’i’, ’y’, ’a’, ’k’ ’i’, ’y’, ’a’, ’k’ Fig. 1. ReLU Graphical Representation

Where: 1ifx > 0 TABLE III ReLU´(x) = (2) PADDED DATASET 0ifx 0 ( ≤ Lastly, weights from the hidden layers would be passed Words Root word on to the output layer in which we simply applied its linear [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,13,1,4,1,12,1,23] [0,0,0,0,0,4,1,12,1,23] function (e.g. a(x) = x). For optimization, we used Adaptive [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,13,1,11,1,11,1,9,14] [0,0,0,0,0,0,11,1,9,14] Moment Estimation (Adam). Figure 2 shows the structure of [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,1,25,1,4] [0,0,0,0,0,2,1,25,1,4] [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,9,25,1,11] [0,0,0,0,0,0,9,25,1,11] the models we used, it is where the outputs from the layer are the following: 1st: 1st character output • 2nd: 2nd character output • D. Output Preprocessing 3rd: 3rd character output • After performing the padding of the words in the dataset, 4th: 4th character output • an additional preprocessing was added for the output 5th: 5th character output • (Root word column). The preprocessing involved one- 6th: 6th character output • hot encoding the Root word section which would make 7th: 7th character output • it a matrix of 1x10x26 since there are 26 alphabets. A 8th: 8th character output • detailed example could be shown for the entire process 9th: 9th character output • like: ’iyak’ = ’i’,’y’,’a’,’k’ = [0,0,0,0,0,0,9,25,1,11] 10th: 10th character output ⇒ ⇒ • = [[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0], ..., ⇒ XPERIMENTS [0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]]. V. E For the experiments, we used 3 separate feedforward neural E. Model Implementation networks models: a 2-Layered FFNN model (2FFNN), 3- Layered FFNN model (3FFNN), and a 4-Layered FFNN model The work of [8] theorized that feedforward neural net- (4FFNN), to see whether the performance improved as the work models (FFNN) can estimate any type of function or number of layers was increased. All training acquired 90% of equivalently able to find any type of mapping, is the reason the dataset and the remaining 10% is for the unseen test data. why even though the Tagalog MA task is similar to text All of the experiments were run with 100 epochs. Table IV generation, we implemented this model to see if it was able shows the performance of each model for the MA task and to understand the sequences of each word and identify the we also showed the performances with the previous works as inflections. In our work, we used this model to take in the a reference and not as a comparison, namely MAG-Tagalog inputs x = (x0, . . . , xn 1) and pass it on to the hidden layer − [3] and Wordframe [6]. with the Rectified Linear Units (ReLU) as activation function, also shown in Figure VI.DISCUSSION We collected and used the dataset from [3] (shown in Table 1 : ReLU(x) = max(0, x) (1) I) for the task of Morphological Analysis in Tagalog. For

2020 International Conference on Asian Language Processing (IALP) 83 phoneme changes. Consider the case of the word ’liparin’ = ’lipap’, which should be ’lapad’. Second, the character ⇒ sequence is sometimes ignored. For instance, ’dinaldalan’ = ⇒ ’halan’, which should be ’daldal’ which shows the architecture of the FFNN not taking previous character outputs (character n-1) into consideration. This current problem may be solved by other learning models. For example, the Recurrent Neural Networks (RNN) model and its variants (e.g. Long-Short Term Memory or LSTM), are all capable of remembering the importance of each input’s sequence into account, this may improve the means in solving the current constraints we have mentioned in the FFNN models we used in this work [9].

VII.CONCLUSIONANDFUTUREWORKS We created a Tagalog morphological analyzer by using the modified version dataset used from the previous work of [3]. Using that dataset, we used its character level representations to train the feedforward neural networks model. Even though FFNN is considered to be one of the most basic neural network models and is not expected to perform well on sequential- based tasks, we found out that the model was capable of learning the MA, which is a text generation task, but while there are acceptable character sequences and syllabication rules that can be learned from the training set, it still does not take into account the previous character outputs, where a different learning model may be able to consider such as RNNs. It also showed that the results in our work remove some of the tedious tasks when performing rule-based MA such as collecting more data into the lexicon and adding more rules since neural network models learn based on examples. Therefore, we can conclude that our work may be set as a baseline for Tagalog-focused morphological analysis that fo- cuses on and generalizes the morphological phenomena in the language. Lastly, further research on our work and production Fig. 2. 2-Layer FFNN Model Topology of a richer dataset would greatly benefit the performance of any models used for this task, especially when dealing with TABLE IV the diverse Tagalog morphological phenomena. MATASK MODEL PERFORMANCE REFERENCES Model Number of Words Analyzed Accuracy 2FFNN 16,814 79.84% [1] Gupta, M., Prasad, B. N. REVIEW ON MORPHOLOGICAL ANAL- 3FFNN 16,814 93.10% YSIS OF HINDI TEXT. [2] Pauw, G. D., Wagacha, P. W. (2007). Bootstrapping morphological 4FFNN 16,814 93.57% analysis of G˜ıkuy˜ u˜ using unsupervised maximum entropy learning. In MAG-Tagalog [3] 16,540 83.84% Eighth Annual Conference of the International Speech Communication Tagalog Wordframe [6] 4,034 89.98% Association. [3] Cheng, C., Adlaon, K. M., Aquino, M., Fernandez, E., Villanueva, K. (2018). MAG-Tagalog: A Rule-Based Tagalog Morphological Analyzer and Generator. our data features, we represented both input and output to [4] Nelson, H. J. (2004). A two-level engine for tagalog morphology and a its character-level representations which confirms the claims structured xml output for pc-kimmo. [5] De Guzman, V. P. (1991). Inflectional morphology in the lexicon: of [7], that the model can learn the morphological phenomena Evidence from Tagalog. Oceanic Linguistics, 30(1), 33-48. of a given language without teaching it any linguistic rules by [6] Cheng, C., See, S. (2006). The revised wordframe model for the Filipino achieving an accuracy of 93% using only an FFNN model. language. Journal of Research in Science, Computing and Engineering (JRSCE), 3(2). We have also tested the trained model for words that are not [7] Premjith, B., Soman, K. P., Kumar, M. A. (2018). A deep learning existing in the dataset to check whether it can handle out- approach for Malayalam morphological analysis at character level. of-vocabulary words like ’pinagbibilhan’, and it was able to Procedia computer science, 132, 47-54. [8] Passban, P., Liu, Q., Way, A. (2016). Boosting neural POS tagger for derive its root word, ’bili’. We then studied the common errors Farsi using morphological information. ACM Transactions on Asian and generated by the MA. First, the model failed to generalize Low-Resource Language Information Processing (TALLIP), 16(1), 1-15.

2020 International Conference on Asian Language Processing (IALP) 84 [9] Lipton, Z. C., Berkowitz, J., Elkan, C. (2015). A critical review of recurrent neural networks for sequence learning. arXiv preprint arXiv:1506.00019. [10] Liu, H., Christiansen, T., Baumgartner, W. A., Verspoor, K. (2012). BioLemmatizer: a lemmatization tool for morphological processing of biomedical text. Journal of biomedical semantics, 3(1), 3. [11] Goyal, V., Lehal, G. S. (2008, July). Hindi morphological analyzer and generator. In 2008 First International Conference on Emerging Trends in Engineering and Technology (pp. 1156-1159). IEEE. [12] Dabre, R., Amberkar, A., Bhattacharyya, P. (2012, December). Morpho- logical analyzer for affix stacking languages: A case study of Marathi. In Proceedings of COLING 2012: Posters (pp. 225-234).

2020 International Conference on Asian Language Processing (IALP) 85