Deeper Error Analysis of Lithuanian Morphological Analyzers

18 Human Language Technologies – The Baltic Perspective K. Muischnek and K. Müürisep (Eds.) © 2018 The authors and IOS Press. This article is published online with Open Access by IOS Press and distributed under the terms of the Creative Commons Attribution Non-Commercial License 4.0 (CC BY-NC 4.0). doi:10.3233/978-1-61499-912-6-18 Deeper Error Analysis of Lithuanian Morphological Analyzers Loïc BOIZOU1, Jurgita KAPOČIŪTĖ-DZIKIENĖ and Erika RIMKUTĖ Vytautas Magnus University Abstract. In this research we continue the intrinsic evaluation of two the most popular and publicly available Lithuanian morphological analyzers-lemmatizers Lemuoklis and Semantika.lt. In our previous paper [1] we reported the comparative results of the shallow morphological analysis mostly covering coarse-grained part- of-speech tags. The results were better for Semantika.lt on 3 domains (administrative, fiction and periodicals), but not on the scientific texts. The deeper analysis of the fine-grained morphological categories (case, gender, number, degree, tense, mood, person, and voice) gave a more precise account of the strengths and weaknesses of both analyzers. Further investigations showed that the higher performance of Lemuoklis analyzer on the scientific domain is probably related to a more successful analyse of long distance agreements, in a spite of an overall slight superiority of Semantika.lt analyzer. Keywords. Morphological analysis, Lemuoklis, Semantika.lt, intrinsic evaluation, Lithuanian language 1. Introduction A research in [1] describes the strengths and weaknesses of the two most popular and publicly available Lithuanian morphological analyzers Lemuoklis 2 [2] and Semantika.lt 3 [3]. Lemuoklis is a hard-coded lexical database, extended with the statistical Hidden Markov Model (HMM) approach for the disambiguation of morphological homoforms [4]. Semantika.lt is based on the Hunspell open source platform supplemented with the statistical HMM method for the disambiguation task. Lemuoklis lexical database contains less number of headwords compared to Semantika.lt, but Lemuoklis uses the synthesis method in order to handle some frequent derivation patterns (e.g. some regular agentive and diminutive forms). Both analyzers were tested on 4 different domains (administrative, fiction, scientific texts and periodicals). On the whole corpus Semantika.lt outperformed Lemuoklis by ~1.7%, ~2.5%, and ~8.1% on the lemmatisation, part-of-speech tagging 1 Corresponding Author, Loïc Boizou, Vytautas Magnus University, Centre of Computational Linguistics, V. Putvinskio str. 23-216, Kaunas, Lithuania; E-mail: [email protected]. 2 Available at http://tekstynas.vdu.lt/page.xhtml;jsessionid=C27B0743101187E540CD32D0498C9887?id=morphological- annotator. 3 Available at http://www.semantika.lt/TextAnnotation/Annotation/Annotate. L. Boizou et al. / Deeper Error Analysis of Lithuanian Morphological Analyzers 19 and fine-grained annotation tasks, achieving accuracy scores of ~98.0%, ~95.3% and ~86.8% respectively, although Lemuoklis performed better on the scientific domain. However, initial evaluation and too shallow error analysis gave an incomplete picture. Moreover, during the evaluation morphological fine-grained information next to each word (about its case, gender, number, etc.) was treated as the whole indivisible unit, giving no explanations which of these morphological categories make the biggest impact on the evaluation results. Consequently, during this research work we dive more into details by evaluating each morphological tag separately in order to provide comprehensive explanations and further recommendations on improvements of Lemuoklis and Semantika.lt. 2. Related Work In general, all existing morphological analysers according to their creation method can be divided into knowledge-based (sometimes called rule-based and/or lexicon-based), supervised, and unsupervised. However, the knowledge-based approaches relying on rules/lexicons prepared by linguist-experts do not require additional resources (as huge quantities of annotated text corpora). Probably due to this reasons, this approach is still the most widely spread. The annotated corpus-based morphological analysers are the closest alternative to the knowledge-based approaches. Although such systems are already built automatically in the supervised manner, the induced rules are based on the morphological annotations found in the training data. Moreover, these analysers can be easily redeveloped and improved after adding more annotated texts. However, the annotation process itself is very laborious and requires a deep language expertise. In the research by Bosch and Daelemans [5], the supervised Memory-Based learning was used for Dutch: their system makes direct mapping from letters in context to rich categories that encode morphological boundaries, syntactic class labels, and spelling changes. Pauw and Schryver [6] also used the supervised Memory-Based method on either characters or syllables and the best results on the Swahili language were achieved with the syllables as the chosen features. The comparative experiments with the unsupervised Morfessor [7] only claimed the significant superiority of their offered supervised method. Uchimoto and Isahara [8] proposed a framework for the human- aided morphological annotation for Spontaneous Japanese. The authors assume that there are two types of word segments depending on their length, but each longer word consists of one or more short words. The shorter word segments and their part-of- speech information were detected using morpheme model implemented within the Maximum Entropy (ME) framework. The longer word segments were determined using the chunking model (estimating the likelihood of four labels carrying information about if the position of the word and its part-of-speech information agrees with the short one) implemented within the Maximum Entropy and Support Vector Machine frameworks. Malladi and Mannem [9] applied statistical linear SVM method with lexical (lexical category, next word, previous word, lemma), character features (last 3, 4 characters, word-level character n-grams) and the word length to predict gender, number, person and case for the Hindi language. Whereas the prediction of lemmas was performed in the machine translation manner (unmodified input word form was treated as the source and transformed into lemma as the target). The tense, aspect and modality were predicted using heuristics based on the sequences of the part-of-speech 20 L. Boizou et al. / Deeper Error Analysis of Lithuanian Morphological Analyzers tags. Statistical methods based on the statistical rules and Hidden Markov Models (considering if the sequence of tagged words maximizes the chain probability) were also applied on Arabic [10] and Thai [11], respectively. Jędrzejowicz and Strychowski [12] proposed a hybrid solution for the Polish language, integrating dictionary (containing all words’ forms) and algorithmic (contains a set of inflection rules) approaches. The learned Decision Tree (storing possible inflection patterns) for a given word generated a list of candidates and stimulated the Artificial Neural Network to solve the disambiguation problem and to produce the valid inflection pattern. Baxi et al. [13] presented a hybrid morphological analyser for the Gujarati language that combined paradigm-based, knowledge-based, and statistical approaches. After the paradigm-based approach generated the multiple category choices, the knowledge- based approach referred to WordNet varying these choices and selecting only the valid ones, lastly the statistical approach solved the disambiguation problem. The Latvian morphological analyser also combined two approaches, in particular, the rule-based for dictionary creation and statistical based on the Conditional Markov Model – for disambiguation problem solving [14]. Indeed, the supervised approaches are accurate and easy to retrain, but not suitable for the resource-scarce languages. The unsupervised approaches have become very attractive in the recent decades due to the couple of reasons. Firstly, a lack of the supervised morphological labels is not the problem anymore. Secondly, theoretically in any language could be an unlimited number of text resources. Further we provide a few examples of such developed systems. Creutz at al. [7] present the Morfessor package mathematically based on the Minimum Description Length (MDL) principle and using Maximum a Posteriori estimation. This generative model framework takes a raw text as an input and segments every word into the smaller units (so-called morphs, e.g., un+fail+ing+ly) and the obtained segmentation often resembles the linguistic morpheme segmentation. The packed was successfully applied on English and agglutinative Finnish. The robustness of Morfessor (which was predominantly designed for the languages with the concatenative morphology) was proved on the agglutinative Turkic languages, in particular, Turkish, Azeri, Kazakh, Turkmen, Kyrgyz, and Uzbek [15]. The log-linear model-based approach offered by Poon et al. [16] use overlapping features (such as morphemes and their content) and incorporates exponential priors inspired by the Minimum Description Length (MDL) principle. The authors present efficient algorithms for learning and inference by combining contrastive estimation with sampling. Their method was successfully applied on morphologically complex Arabic and Hebrew. Snyder and Barzilay [17] present a method which effectively integrates both supervised

Load more