<<

18 Human Language Technologies – The Baltic Perspective K. Muischnek and K. Müürisep (Eds.) © 2018 The authors and IOS Press. This article is published online with Open Access by IOS Press and distributed under the terms of the Creative Commons Attribution Non-Commercial License 4.0 (CC BY-NC 4.0). doi:10.3233/978-1-61499-912-6-18 Deeper Error Analysis of Lithuanian Morphological Analyzers

Loïc BOIZOU1, Jurgita KAPOČIŪTĖ-DZIKIENĖ and Erika RIMKUTĖ Vytautas Magnus University

Abstract. In this research we continue the intrinsic evaluation of two the most popular and publicly available Lithuanian morphological analyzers-lemmatizers Lemuoklis and Semantika.lt. In our previous paper [1] we reported the comparative results of the shallow morphological analysis mostly covering coarse-grained part- of-speech tags. The results were better for Semantika.lt on 3 domains (administrative, fiction and periodicals), but not on the scientific texts. The deeper analysis of the fine-grained morphological categories (case, gender, number, degree, tense, mood, person, and voice) gave a more precise account of the strengths and weaknesses of both analyzers. Further investigations showed that the higher performance of Lemuoklis analyzer on the scientific domain is probably related to a more successful analyse of long distance agreements, in a spite of an overall slight superiority of Semantika.lt analyzer.

Keywords. Morphological analysis, Lemuoklis, Semantika.lt, intrinsic evaluation, Lithuanian language

1. Introduction

A research in [1] describes the strengths and weaknesses of the two most popular and publicly available Lithuanian morphological analyzers Lemuoklis 2 [2] and Semantika.lt 3 [3]. Lemuoklis is a hard-coded lexical database, extended with the statistical Hidden Markov Model (HMM) approach for the disambiguation of morphological homoforms [4]. Semantika.lt is based on the open source platform supplemented with the statistical HMM method for the disambiguation task. Lemuoklis lexical database contains less number of headwords compared to Semantika.lt, but Lemuoklis uses the synthesis method in order to handle some frequent derivation patterns (e.g. some regular agentive and diminutive forms). Both analyzers were tested on 4 different domains (administrative, fiction, scientific texts and periodicals). On the whole corpus Semantika.lt outperformed Lemuoklis by ~1.7%, ~2.5%, and ~8.1% on the lemmatisation, part-of-speech tagging

1 Corresponding Author, Loïc Boizou, Vytautas Magnus University, Centre of Computational , V. Putvinskio str. 23-216, Kaunas, Lithuania; E-mail: [email protected]. 2 Available at http://tekstynas.vdu.lt/page.xhtml;jsessionid=C27B0743101187E540CD32D0498C9887?id=morphological- annotator. 3 Available at http://www.semantika.lt/TextAnnotation/Annotation/Annotate. L. Boizou et al. / Deeper Error Analysis of Lithuanian Morphological Analyzers 19 and fine-grained annotation tasks, achieving accuracy scores of ~98.0%, ~95.3% and ~86.8% respectively, although Lemuoklis performed better on the scientific domain. However, initial evaluation and too shallow error analysis gave an incomplete picture. Moreover, during the evaluation morphological fine-grained information next to each (about its case, gender, number, etc.) was treated as the whole indivisible unit, giving no explanations which of these morphological categories make the biggest impact on the evaluation results. Consequently, during this research work we dive more into details by evaluating each morphological tag separately in order to provide comprehensive explanations and further recommendations on improvements of Lemuoklis and Semantika.lt.

2. Related Work

In general, all existing morphological analysers according to their creation method can be divided into knowledge-based (sometimes called rule-based and/or lexicon-based), supervised, and unsupervised. However, the knowledge-based approaches relying on rules/lexicons prepared by linguist-experts do not require additional resources (as huge quantities of annotated text corpora). Probably due to this reasons, this approach is still the most widely spread. The annotated corpus-based morphological analysers are the closest alternative to the knowledge-based approaches. Although such systems are already built automatically in the supervised manner, the induced rules are based on the morphological annotations found in the training data. Moreover, these analysers can be easily redeveloped and improved after adding more annotated texts. However, the annotation process itself is very laborious and requires a deep language expertise. In the research by Bosch and Daelemans [5], the supervised Memory-Based learning was used for Dutch: their system makes direct mapping from letters in context to rich categories that encode morphological boundaries, syntactic class labels, and spelling changes. Pauw and Schryver [6] also used the supervised Memory-Based method on either characters or syllables and the best results on the Swahili language were achieved with the syllables as the chosen features. The comparative experiments with the unsupervised Morfessor [7] only claimed the significant superiority of their offered supervised method. Uchimoto and Isahara [8] proposed a framework for the human- aided morphological annotation for Spontaneous Japanese. The authors assume that there are two types of word segments depending on their length, but each longer word consists of one or more short . The shorter word segments and their part-of- speech information were detected using morpheme model implemented within the Maximum Entropy (ME) framework. The longer word segments were determined using the chunking model (estimating the likelihood of four labels carrying information about if the position of the word and its part-of-speech information agrees with the short one) implemented within the Maximum Entropy and Support Vector Machine frameworks. Malladi and Mannem [9] applied statistical linear SVM method with lexical (lexical category, next word, previous word, lemma), character features (last 3, 4 characters, word-level character n-grams) and the word length to predict gender, number, person and case for the Hindi language. Whereas the prediction of lemmas was performed in the manner (unmodified input word form was treated as the source and transformed into lemma as the target). The tense, aspect and modality were predicted using heuristics based on the sequences of the part-of-speech 20 L. Boizou et al. / Deeper Error Analysis of Lithuanian Morphological Analyzers tags. Statistical methods based on the statistical rules and Hidden Markov Models (considering if the sequence of tagged words maximizes the chain probability) were also applied on Arabic [10] and Thai [11], respectively. Jędrzejowicz and Strychowski [12] proposed a hybrid solution for the Polish language, integrating dictionary (containing all words’ forms) and algorithmic (contains a set of inflection rules) approaches. The learned Decision Tree (storing possible inflection patterns) for a given word generated a list of candidates and stimulated the Artificial Neural Network to solve the disambiguation problem and to produce the valid inflection pattern. Baxi et al. [13] presented a hybrid morphological analyser for the Gujarati language that combined paradigm-based, knowledge-based, and statistical approaches. After the paradigm-based approach generated the multiple category choices, the knowledge- based approach referred to WordNet varying these choices and selecting only the valid ones, lastly the statistical approach solved the disambiguation problem. The Latvian morphological analyser also combined two approaches, in particular, the rule-based for dictionary creation and statistical based on the Conditional Markov Model – for disambiguation problem solving [14]. Indeed, the supervised approaches are accurate and easy to retrain, but not suitable for the resource-scarce languages. The unsupervised approaches have become very attractive in the recent decades due to the couple of reasons. Firstly, a lack of the supervised morphological labels is not the problem anymore. Secondly, theoretically in any language could be an unlimited number of text resources. Further we provide a few examples of such developed systems. Creutz at al. [7] present the Morfessor package mathematically based on the Minimum Description Length (MDL) principle and using Maximum a Posteriori estimation. This generative model framework takes a raw text as an input and segments every word into the smaller units (so-called morphs, e.g., un+fail+ing+ly) and the obtained segmentation often resembles the linguistic morpheme segmentation. The packed was successfully applied on English and agglutinative Finnish. The robustness of Morfessor (which was predominantly designed for the languages with the concatenative morphology) was proved on the agglutinative Turkic languages, in particular, Turkish, Azeri, Kazakh, Turkmen, Kyrgyz, and Uzbek [15]. The log-linear model-based approach offered by Poon et al. [16] use overlapping features (such as morphemes and their content) and incorporates exponential priors inspired by the Minimum Description Length (MDL) principle. The authors present efficient algorithms for learning and inference by combining contrastive estimation with sampling. Their method was successfully applied on morphologically complex Arabic and Hebrew. Snyder and Barzilay [17] present a method which effectively integrates both supervised and unsupervised learning into a single probabilistic framework. Their approach (effectively combining cross-lingual alignment with the target predictions) can induce links between languages and jointly learn the linguistic structure for each one from the multilingual corpora. Used the non-parametric hierarchical Bayesian model with Dirichlet Process priors was successfully applied for the morphological segmentation task (separating words into morphemes) on the parallel corpus of Hebrew and Arabic. To summarizing this huge variety of methods it is important to mention that: no matter if the method is developed using rule-based or corpus-based approach, the most important is that it could achieve as high accuracy as possible and could be reliable in the different NLP applications. In this paper we are focusing on the morphologically complex and highly inflected Lithuanian language, therefore the high accuracy of the analyser is especially important. L. Boizou et al. / Deeper Error Analysis of Lithuanian Morphological Analyzers 21

3. Evaluation of annotation of grammatical categories

Features like negation and reflexivity, which are tagged as single values (against the lack of feature) are not included into evaluation. Without them, there are 32 morphological features as a whole for eight categories (case, gender, number, degree, tense, mood, person, voice), but the dual number does not occur in the data. Furthermore, five features are very rare: the common gender (only 3 occurrences in the gold-standard corpus), the vocative case (5), the illative case (3), the imperative mood (7) and the necessity form (6, included in voice). The results for these rare features (in italics in Table 1) might not be significant; therefore they are not discussed further.

Table 1. Accuracy and f-score values obtained with Semantika.lt and Lemuoklis on all domains Semantika.lt Lemuoklis Morph. tag Value accuracy f-score accuracy f-score Nominative 0.962 0.883 0.971 0.917 Accusative 0.993 0.974 0.978 0.904 Genitive 0.978 0.962 0.958 0.922 Instrumental 0.995 0.953 0.987 0.857 CASE Dative 0.998 0.967 0.998 0.975 Locative 0.999 0.989 0.999 0.988 Vocative 0.991 0.014 0.999 0.000 Illative 1.000 1.000 1.000 0.000 Masculine 0.974 0.964 0.983 0.976 Feminine 0.973 0.947 0.984 0.968 GENDER Neuter 0.992 0.728 0.994 0.742 Common 1.000 0.364 1.000 0.500 Singular 0.956 0.948 0.947 0.937 NUMBER Plural 0.964 0.938 0.958 0.927 Dual 1.000 - 1.000 - Positive 0.979 0.918 0.973 0.894 DEGREE Comparative 0.999 0.976 0.998 0.939 Superlative 1.000 0.958 1.000 0.971 Past 0.975 0.482 0.995 0.884 Present 0.996 0.977 0.996 0.976 TENSE Past definite 0.98 0.727 0.999 0.988 Past iterative 1.000 1.000 1.000 1.000 Future 0.999 0.972 1.000 0.984 Indicative 0.998 0.987 0.998 0.987 MOOD Conditional 0.998 0.883 0.999 0.927 Imperative 1.000 1.000 1.000 0.727 1st 1.000 0.997 0.999 0.973 PERSON 2nd 0.999 0.945 0.999 0.872 3rd 0.995 0.972 0.997 0.983 Active 0.999 0.970 0.999 0.985 VOICE Passive 0.992 0.902 0.994 0.928 Necessity 1.000 0.667 0.999 0.462

The overall performance level of both taggers for the remaining morphological features appears quite similar: Lemuoklis gets a better f-score on 12 morphological features, Semantika.lt tagger on 11. The better performance of Semantika.lt on the tag combinations (~86.8% in the fine-grained accuracy mentioned above) comes from better annotated frequent features (such as singular, plural, genitive case, and accusative to a lesser degree) (see Table 1). 22 L. Boizou et al. / Deeper Error Analysis of Lithuanian Morphological Analyzers

Nonetheless, Semantika.lt underperforms Lemuoklis in two aspects. Firstly, it wrongly tags a significant proportion of nominative forms (about 5%) as vocative, and these false vocative tags are in greater number than the true vocative. This mistake has a negative impact on the accuracy of nominative tags for Semantika.lt (0.830, vs. 0.914 for Lemuoklis). Secondly, the statistics show a low f-score for Semantika.lt tagger regarding past tense (0.482) and past definite tense (0.727). According to the gold- standard annotations, past participles should be annotated as past and finite forms, i.e., as past definite. This mistake has been corrected in the recent version of Semantika.lt, therefore the results for these two tags should be considered as partly irrelevant. The neuter gender, which is relatively rare (55 occurrences in the golden standard) and formally hard to recognize, stands out for its lower f-score by both taggers, 0.728 and 0.742 for Lemuoklis and Semantika.lt, respectively. The f-score for the other categories is mostly around 0.900 or more, peeking at 0.997 for the first person (Semantika.lt), although Lemuoklis demonstrates a slightly lower f-score (0.857) for the instrumental case. A detailed linguistic error analysis on scientific texts revealed the following groups of errors:  homonymic endings inside a given paradigm: this is obviously the main cause of errors, because some ending syncretisms are systematic (for some paradigms at least): nominative and vocative (e.g. įtvirtinęs “established”); feminine singular genitive and feminine plural nominative (e.g. publikacijos “publication(s)”); nominative feminine singular and predicative neuter form of adjectives (e.g aptariama “discussed”); third person singular and third person plural (e.g. atkartoja “repeat(s)”); infinitive and past participle nominative plural (e.g. gauti “to get/gotten”); third person conditional and past participle genitive plural (e.g. susietų “would relate/related”).  Alternative POS: a significant problem comes from words that are traditionally assigned to different parts of speech with a single morphological paradigm, for example noun/adjective (e.g. artimas “a relative/close”) or adjective/participle (e.g. aprašomas “described”). Ambiguous grammatical words can be assigned to two or even three lexical categories (with diverse combinations as particles, adverbs and/or conjunctions): this case is complicated even for linguists, not surprising it is also problematic for taggers.

This qualitative analysis was completed by comparing occurrences of the most frequent errors, described by Pinnis and Goba [18]4. The results are presented in Table 2. Both analysers share 3 common and the most often errors as in [18]:  for Semantika.lt analyzer: pn → sg (pn – plural nominative, sg – singular genitive), m → f (m – masculine, f – feminine), f → m;  for Lemuoklis: pn → sg, sg → pn and q → c (q – particle, c – conjunction).

Other errors made by both analyzers are related to number p ↔ s (p – plural, s – singular), in other contexts than the already mentioned pn ↔ sg. Furthermore, for both

4 The authors would like to acknowledge one of the anonymous reviewers for mentioning this article. L. Boizou et al. / Deeper Error Analysis of Lithuanian Morphological Analyzers 23 analyzers q → r (r – adverb, c – conjunction) is also a frequent error. This is an example of already mentioned ambiguous grammatical forms, as well as q → c.

Table 2. The most frequent error types made by Semantika.lt and Lemuoklis Semantika.lt Lemuoklis Right Wrong Occurrences Right Wrong Occurrences pn → sg 206 p → s 262 p → s 174 q → r 155 s → p 174 pn → sg 150 q → r 156 q → c 132 f → m 143 sg → pn 99 m → f 142 s → p 99

4. Discussion

The preliminary error analysis does not reveal a clear reason why Lemuoklis achieves better recognition rate on the scientific texts in the evaluation corpus. Some identified mistakes appear too rare to have the significant influence on the results, for example false vocatives (instead of nominatives, 124 occurrences in the corpus, 8th most frequent error) or the failure to analyse compounds such as vokalinių-instrumentinių “vocal and instrumental” (compounds are divided, and each part is analysed separately by Lemuoklis, whereas compounds remain as untagged blocks for Semantika.lt analyzer). During our analysis we have formulated two hypotheses. The first hypothesis was related to the larger number of participles in the scientific texts. After calculating the number of participles in different corpora domains it was revealed that participles (even 2.7% of all words) are the most frequent in scientific texts (whereas 2.2%, 1.5%, and 2.1% of participles were used in administrative, fiction and periodicals). It confirmed the statement (as presented in [19]) that participles more often appear in the formal language (administrative or scientific texts) and less often in the less formal language (as in periodicals or fiction). Despite it confirmed the statement in [19]; further investigations revealed that Lemuoklis does not outperform Semantika.lt tagger for participles as shown in Table 3 and 4. Hence, the quantity of participles cannot the overall results and lead to the better Lemuoklis performance on the scientific domain.

Table 3. Accuracy and f-score values obtained with Semantika.lt and Lemuoklis for verbal forms on all domains Semantika.lt Lemuoklis Morph. tag Value accuracy f-score accuracy f-score Infinitive 0.997 0.961 0.997 0.959 Participle 0.992 0.941 0.991 0.925 VERB Adv. participle 0.999 0.965 0.999 0.950 FORMS Half participle 1.000 1.000 1.000 0.990 Adv. participle 2 1.000 - 1.000 -

24 L. Boizou et al. / Deeper Error Analysis of Lithuanian Morphological Analyzers

Table 4. Accuracy and f-score values obtained with Semantika.lt and Lemuoklis for participles on all domains Semantika.lt Lemuoklis Morph. tag Value accuracy f-score accuracy f-score Nominative 0.880 0.886 0.737 0.709 Accusative 0.995 0.980 0.957 0.878 Genitive 0.959 0.877 0.925 0.799 Instrumental 0.991 0.892 0.982 0.811 CASE Dative 0.998 0.955 0.996 0.930 Locative 0.999 0.979 0.996 0.939 Vocative 0.999 0.000 0.858 0.016 Illative 1.000 - 1.000 - Masculine 0.947 0.946 0.843 0.833 Feminine 0.922 0.906 0.815 0.781 GENDER Neuter 0.943 0.759 0.913 0.733 Common 1.000 - 1.000 - Singular 0.909 0.904 0.852 0.835 NUMBER Plural 0.929 0.918 0.876 0.860 Dual 1.000 - 1.000 -

The second hypothesis was related to a larger number of distant syntactic dependencies, where cases of agreement might be better disambiguated by Lemuoklis. An attempt was made to evaluate the average distance between nouns and adjectives or participles they agree with. The available parsers for Lithuanian are still insufficiently reliable for complex sentences which frequently appear in the scientific domain, that it why we decided to use a simple match in case-number-gender between nouns and adjective/participles. The average distance is ~2.7 words inserted between agreed in the scientific domain, whereas ~0.5-~1.5 words in the other domains. Although this simple algorithm may produce some mistakes (e.g. in relatively rare cases the adjective/participle does not depend on the closer noun with the same case-number- gender), the difference is obviously significant. Lemuoklis copes more accurately with the longer distant agreements and it is one of the main reasons why it achieves better results on the scientific domain compared to Semantika.lt.

5. Conclusions

This research details the deeper intrinsic evaluation of Lemuoklis and Semantika.lt. Theoretically, according to these findings Semantika.lt should also be more suitable in the downstream applications (as, e.g., in ). However, in practice it is not necessary a case: some weak spots of the lemmatizer (e.g., nominative case confusion with vocative, particle confusion with conjunction, etc.) may not be very influential. For this reason it is essential to evaluate morphological analyzers in the real applications. Hence, the extrinsic evaluation of both tools is in our future plans. It would also be useful to compare these results with the new tagger developed by P. Paikens and based on the neural networks (https://github.com/PeterisP/BalTag). The results show that the older hard-coded analyzer Lemuoklis, which was under active development for about twenty years, is still reliable. However, the more recent Semantika.lt analyzer, which was specially designed to be easy to upgrade, seems to take the lead and these evaluation results provide some ideas for its further enhancements. L. Boizou et al. / Deeper Error Analysis of Lithuanian Morphological Analyzers 25

References

[1] J. Kapočiūtė-Dzikienė, E. Rimkutė, L. Boizou, Comparison of Lithuanian Morphological Analyzers. Ekštein K., Matoušek V. (eds) Text, Speech, and Dialogue. TSD 2017. Lecture Notes in Computer Science 10415 (2017). Springer, Cham. [2] V. Zinkevičius V. Lemuoklis – morfologinei analizei [Morphological analysis with Lemuoklis]. Darbai ir Dienos 24 (2000), 246–273. [3] V. Dadurkevičius, Lietuvių kalbos morfologija atvirojo kodo Hunspell platformoje [Lithuanian Morphology in the Hunspell Framework]. Bendrinė kalba 90 (2017), 1–15. http://www.bendrinekalba.lt/Straipsniai/90/Dadurkevicius_BK_90_straipsnis.pdf [4] E. Rimkutė, V. Daudaravičius, Morfologinis dabartinės lietuvių kalbos tekstyno anotavimas [Morphological Annotation of the Lithuanian Corpus]. Kalbų studijos 11 (2007), 30–35. [5] A. van den Bosch, W. Daelemans, Memory-Based Morphological Analysis. Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics (ACL '99) (1999), 285–292. [6] G. De Pauw, G.-M. de Schryver, Improving the Computational Morphological Analysis of a Swahili Corpus for Lexicographic Purposes. Lexikos 18 (2008), 303–318. [7] M. Creutz, K. Lagus, K. Lindén, S. Virpioja. Morfessor and Hutmegs: Unsupervised Morpheme Segmentation for Highly-Inflecting and Compounding Languages. Proceedings of the Second Baltic Conference on Human Language Technologies (2005), 107–112. [8] K. Uchimoto, H. Isahara. Morphological Annotation of a Large Spontaneous in Japanese. Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (ACL 2003), (2003), 479–488. [9] D. K. Malladi, P. Mannem, Statistical Morphological Analyzer for Hindi. International Joint Conference on Natural Language Processing (IJCNLP 2013) (2013), 1007–1011. [10] N. Khoufi, and M. Boudokhane. Statistical-based System for Morphological Annotation of Arabic Texts. Recent Advances in Natural Language Processing (RANLP 2013) (2013), 100–106. [11] A. Kawtrakul, C. Thumkanon, A Statistical Approach to Thai Morphological Analyzer. Proceedings of the 5th Workshop on Very Large Corpora (1997), 289–286. [12] P. Jędrzejowicz, J. Strychowski, A Neural Network Based Morphological Analyser of the Natural Language. Advances in Soft Computing, Intelligent Information Processing and Web Mining 31 (2005), 199–208. [13] J. Baxi, P. Patel, B. Bhatt, Morphological Analyzer for Gujarati using Paradigm based approach with Knowledge based and Statistical Methods. The 12th International Conference on Natural Language Processing (ICON-2015) (2015). [14] P. Paikens, L. Rituma, L. Pretkalniņa, Morphological analysis with limited resources: Latvian example. Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013) (2013), 267–474. [15] V. Baisa, V. Suchomel, Large Corpora for Turkic Languages and Unsupervised Morphological Analysis. Proceedings of the Eighth conference on International Language Resources and Evaluation (LREC’12) (2012). [16] H. Poon, C. Cherry, K. Toutanova, Unsupervised Morphological Segmentation with Log-linear Models. Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL '09) (2009), 209–217. [17] B. Snyder, R. Barzilay. Cross-lingual Propagation for Morphological Analysis. Proceedings of the 23rd National Conference on Artificial Intelligence 2 (2008), 848–854. [18] M. Pinnis, K. Goba. Maximum Entropy Model for Disambiguation of Rich Morphological Tags. Systems and Frameworks for Computational Morphology. SFCM 2011. Communications in Computer and Information Science 100 (2011), 14–22. Springer, Berlin, Heidelberg. [19] R. Brinkutė, Gramatinių kategorijų pasiskirstymas morfologiškai anotuotame lietuvių kalbos tekstyne [Distribution of Grammatical Categories in the Morphologically Annotated Lithuanian Language Corpus], Master thesis, Vytautas Magnus University, Kaunas, 2018.