Evaluation of Finite State Morphological Analyzers Based on Paradigm Extraction from Wiktionary

Evaluation of Finite State Morphological Analyzers Based on Paradigm Extraction from Wiktionary Ling Liu and Mans Hulden Department of Linguistics University of Colorado [email protected] Abstract network (RNN) encoder-decoder model to gener- ate an inflected form of a lemma for a target mor- Wiktionary provides lexical information phological tag combination. The SIGMORPHON for an increasing number of languages, in- 2016 shared task (Cotterell et al., 2016) of mor- cluding morphological inflection tables. It phological reinflection received 11 systems which is a good resource for automatically learn- used various approaches such as conditional ran- ing rule-based analysis of the inflectional dom fields (CRF), RNNs, and other linguistics- morphology of a language. This paper per- inspired heuristics. Among all the methods, one forms an extensive evaluation of a method standard technology is to use finite-state transduc- to extract generalized paradigms from ers, which are more interpretable and manually morphological inflection tables, which can modifiable, and thus more easily incorporated into be converted to weighted and unweighted and made to assist linguists’ work. Hulden (2014) finite transducers for morphological pars- presents a method to generalize inflection tables ing and generation. The inflection tables into paradigms with finite state implementations of 55 languages from the English edition and Forsberg and Hulden (2016) subsequently in- of Wiktionary are converted to such gen- troduce how to transform morphological inflection eral paradigms, and the performance of tables into both unweighted and weighted finite the probabilistic parsers based on these transducers and apply the transducers to parsing paradigms are tested. and generation, the result of which is very promis- ing, especially for facilitating and assisting lin- 1 Introduction guists’ work in addition to applications to morphological parsing and generation for downstream Morphological inflection is used in many lan- NLP tasks. However, the system was evaluated guages to convey syntactic and semantic infor- with only three languages (German, Spanish, and mation. It is a systematic source of sparsity for Finnish), all with Latin script. This paper intends NLP tasks, especially for languages with rich mor- to carry out a more extensive evaluation of this phological systems where one lexeme can be in- method. flected into as many as over a million distinct Wiktionary1 provides a source of morpholog- word forms (Kibrik, 1998). In this case, mor- ical paradigms for a wide and still increasing phological parsers which can convert the inflected range of languages, which is a useful resource of word forms back to the lemma forms, or the crosslinguistic research. The data in this work also other way around can largely benefit downstream originate with Wiktionary. tasks, like part-of-speech tagging, language mod- In this paper we evaluate the cross-linguistic eling, and machine translation (Tseng et al., 2005; performance of the paradigm generalization Hulden and Francom, 2012; Duh and Kirchhoff, method on 55 languages, of which inflection tables 2004; Avramidis and Koehn, 2008). Various ap- have been extracted from the Wiktionary data. All proaches have been adopted to tackle the mor- the languages are consistently annotated with uni- phological inflection and lemmatization problem. versal morphological tags (Sylak-Glassman et al., For example, Durrett and DeNero (2013) auto- 2015) and are in the native orthography. In par- matically extracts transformation rules from la- ticular, we evaluate the accuracy on the ability to beled data and learns how to apply these rules lemmatize previously unseen word forms and the with a discriminative sequence model. Kann and Schutze¨ (2016) proposes to use a recurrent neural 1http://www.wiktionary.org 69 Proceedings of the 13th International Conference on Finite State Methods and Natural Language Processing, pages 69–74, Umea,˚ Sweden, 4–6 September 2017. c 2017 Association for Computational Linguistics https://doi.org/10.18653/v1/W17-4009 unlabeled data (Goldsmith, 2001; Schone and Ju- 3.1 Paradigm representation rafsky, 2001; Chan, 2006; Creutz and Lagus, We represent the functions in what we call ab- 2007; Monson et al., 2008). Hammarstrom¨ and stract paradigm. In our representation, an ab- Borin (2011) provides a current overview of unsu- stract paradigm is an ordered collection of strings, pervised learning. where each string may additionally contain in- Previous work with similar semi-supervised terspersed variables denoted x1,x2,...,xn. The goals as the ones in this paper include Yarowsky strings represent fixed, obligatory parts of a and Wicentowski (2000), Neuvel and Fulop paradigm, while the variables represent mutable (2002), Clement´ et al. (2004). Recent machine parts. These variables, when instantiated, must learning oriented work includes Dreyer and Eis- contain at least one segment, but may otherwise ner (2011) and Durrett and DeNero (2013), which vary from word to word. A complete abstract documents a method to learn orthographic trans- paradigm captures some generalization where the formation rules to capture patterns across inflec- mutable parts represented by variables are instan- tion tables. Part of our evaluation uses the same tiated the same way for all forms in one particu- dataset as Durrett and DeNero (2013). Eskander lar inflection table. For example, the fairly simple et al. (2013) shares many of the goals in this paper, paradigm but is more supervised in that it focuses on learning inflectional classes from richer annotation. x1 x1+s x1+ed x1+ing A major departure from much previous work could represent a set of English verb forms, where is that we do not attempt toA encodemodel variation of Analogyx1 in this case would coincide with the infinitive as string-changing operations, say by string edits form of the verb—walk, climb, look, etc. (Dreyer and Eisner,Formal 2011) claim: or the transformation common parts rules (stem) areFor calculated more complex by extracting patterns, several variable (Linden,´ 2008;the DurrettLongest and Common DeNero, Subsequence 2013) that parts from may related be invoked,forms (Hulden some of them discontinu- perform mappings2014; Ahlberg, between Forsberg, forms. Rather, Hulden our 2014/2015)ous. For example, part of an inflection paradigm goal is to encode all variation within paradigms for German verbs of the type schreiben (to write) by presenting theminflection in a sufficiently table genericLCS = fash- schribverbs may“paradigm” be described as: ion so as to allow affixation processes, phonolog- leihen since the l can be assumed to match schrei ben x1 = schr x1+e+x2+x3+en INFINITIVE7! ical alternations as well as orthographic changes the variable x1, the i x2, and the h x3. Forsberg schrei bend x = i x1+e+x2+x3+end PRESENT PARTICIPLE to naturally fall out of the paradigm specification2 geschr ieben x = b ge+x1+x2+e+x3+en PASTand PARTICIPLE Hulden (2016) developl a model that creates itself. Also, we perform no explicit alignment3 of schrei be x1+e+x2+x3+e PRESENT 1P SG the various forms in an inflection table, as in e.g. such lemmatizing transducers from inflection ta- schrei b st x1+e+x2+x3+st PRESENT 2P SG Tchoukalov et al. (2010). Rather, we base our al- bles, which also return the inflectional information schrei b t x1+e+x2+x3+t PRESENT 3P SG gorithm on extracting the longest common subse- x1 x x of the source word form. quence (LCS) shared by2 all forms3 in an inflection If the variables are instantiated as x1=schr, x2=i, and x3=b, the paradigm corresponds to table, fromFigure whichGU alignment 1: Illustration of segments of the falls paradigm outDeep Learning extraction & Linguistics mechanism. 32 the forms (schreiben, schreibend,2.1 geschrieben, Analyzing word forms naturally. AlthoughAn inflection our paradigm table is representation given as input, and the longest com- schreibe, schreibst, schreibt). If, on the other is similar tomon and inspired subsequence by that (LCS) of Forsberg is extracted et al. and assigned to “vari- hand, x1=l, x2=i, and x3=h, theThis same model paradigm has re- the disadvantage of often return- (2006) andable Detrez´ parts” and Rantaof a more (2012), abstract our method paradigm, based on discontinu- flects the conjugation of leihen (to lend/borrow)— of generalizingities from in the inflection LCS. Several tables to inflection paradigms tables may yield the same ing a large number of plausible analyses due to (leihen, leihend, geliehen, leihe, leihst, leiht). is novel. “paradigm” in which case paradigms are collapsed, and infor- the fact that an unseen word form may fit many It is worth noting that in this representation, no mation about the shape of the variable strings xi is retained particular form is privileged indifferent the sense learned that all paradigms, and also fit them in 3 Paradigmfor statistical learning modeling. other forms can only be generatedmany from different some spe- slots. One can, however, induce In what follows, we adopt the view that words cial form, say the infinitive.a Rather, language in the model cur- of each variable part x in the and their inflection patterns can be organized rent representation, all forms can be derived from i into paradigmsability (Hockett, to assign 1954; correct Robins, morphosyntactic 1959; knowing the tags variable to instantiations. a paradigms Also, givenand create a probabilistic model which Matthews,word 1972; Stump, form. 2001). We essentially only a particular word formfavors and a hypothetical production of such analyses where variable treat a paradigm as an ordered set of functions paradigm to fit it in, the variableparts instantiations resemble can those that have been seen in the (f ,...,f ), where f : x ,...,x ⌃ , that is, often be logically deduced unambiguously. For 1 n i 1 n 7! ⇤ where each2 entry Paradigm in a paradigm is Extraction a function from example, let us say we havetraining a hypothetical data. form An n-gram model over the variables variables to strings, and each function in a partic- steigend and need to fit it in theseen above in paradigm, each paradigm can be formulated as fol- ular paradigmThe shares paradigm the same variables.

Evaluation of Finite State Morphological Analyzers Based on Paradigm Extraction from Wiktionary

Traditional Bengali Keyboar

Managing Order Relations in Mlbibtex∗

Peace Corps Romania Survival Romanian Language Lessons Pre-Departure On-Line Training

Technical Reference Manual for the Standardization of Geographical Names United Nations Group of Experts on Geographical Names

Recognition Features for Old Slavic Letters: Macedonian Versus Bosnian Alphabet

Germanic Standardizations: Past to Present (Impact: Studies in Language and Society)

On the Standardization of the Aromanian System of Writing

Comac Medical NLSP2 Thefo

Eight Fragments Serbian, Croatian, Bosnian

An Analysis of Contrasts Between English and Bengali Vowel Phonemes

LNCS 8104, Pp

Spelling Progress Bulletin Summer 1977 P2 in the Printed Version]