A Framework for Understanding the Role of Morphology in Universal Dependency Parsing

A Framework for Understanding the Role of Morphology in Universal Dependency Parsing Mathieu Dehouck Pascal Denis Univ. Lille, CNRS, UMR 9189 - CRIStAL Magnet Team, Inria Lille Magnet Team, Inria Lille 59650 Villeneuve d’Ascq, France 59650 Villeneuve d’Ascq, France [email protected] [email protected] Abstract as dependency parsing. Furthermore, morphologically rich languages for which we hope to see a This paper presents a simple framework for real impact from those morphologically aware rep- characterizing morphological complexity and resentations, might not all rely to the same extent how it encodes syntactic information. In par- on morphology for syntax encoding. Some might ticular, we propose a new measure of morpho- syntactic complexity in terms of governor- benefit mostly from reducing data sparsity while dependent preferential attachment that ex- others, for which paradigm richness correlate with plains parsing performance. Through ex- freer word order (Comrie, 1981), will also benefit periments on dependency parsing with data from morphological information encoding. from Universal Dependencies (UD), we show that representations derived from morphological attributes deliver important parsing per- This paper aims at characterizing the role of formance improvements over standard word form embeddings when trained on the same morphology as a syntax encoding device for vari- datasets. We also show that the new morpho- ous languages. Using simple word representations, syntactic complexity measure is predictive of we measure the impact of morphological informa- the gains provided by using morphological at- tion on dependency parsing and relate it to two tributes over plain forms on parsing scores, measures of language morphological complexity: making it a tool to distinguish languages using the basic form per lemma ratio and a new measure morphology as a syntactic marker from others. (HPE) defined in terms of head attachment prefer- ence encoded by its morphological attributes. We 1 Introduction show that this new measure is predictive of parsing While word embedding has proven a good solution result differences observed when using different to reduce data sparsity in parsing (Koo et al., 2008), word representations and that it allows one to dis- treating word forms as atomic units is at odds with tinguish amongst morphologically rich languages, the fact that words have a potentially complex in- those that use morphology for syntactic purpose ternal structure. Furthermore, it makes parameters from those using morphology as a more semantic estimation difficult for morphologically rich lan- marker. To the best of our knowledge, this work guages (MRL) in which the number of possible is the first attempt at systematically measuring the forms a word can take can be very large1. syntactic content of morphology in a multi-lingual Recently, researchers have started to work on environment. morphologically informed word embeddings (Cao and Rei, 2016; Botha and Blunsom, 2014), aiming at better capturing both lexical, syntactic and mor- Section 2 presents the representation learning phological information. But encoding lexicon and method and the dependency parsing model. It also morphology in the same space makes it difficult to defines two measures of morphological complex- distinguish the role of each in syntactic tasks such ity. Section 3 describes the experimental setting and analyses parsing results in terms of the pre- 1 A typical English noun has 2 forms while a Finnish one viously defined morphological complexity mea- may have more than 30. This shows in data as English lemmas have 1.39 forms on average while Finnish ones have 2.19, as sures. Section 4 gives some conclusions and future measured on UD data (Nivre et al., 2016). work perspectives. 2864 Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2864–2870 Brussels, Belgium, October 31 - November 4, 2018. c 2018 Association for Computational Linguistics 2 Framework simply summing over all the attributes of a word (noted Morph(w)). If we note rw the vectorial This section details: (i) our method for learn- representation of word w we have: ing lexical and morphological representations, (ii) X how these can be used for graph-based dependency rw = Ra: parsing, and (iii) how to measure morphological a2Morph(w) complexity. Our representation learning and parsing techniques are purposely very simple in order Simple additive models have been shown to be to let us separate lexical and morphological infor- very efficient for compositionally derived embed- mation and weight the role of morphology in de- dings (Arora et al., 2017). pendency parsing of MRL. 2.2 Dependency Parsing 2.1 Word Representation We work with graph-based dependency parsing, which offers very competitive parsing models as We construct separate vectorial representations for recently re-emphasized by Dozat et al. (2017) in lemmas, forms and morphological attributes, ei- the CONLL 2017 shared-task on dependency pars- ther learned via dimension reduction of their own ing (Zeman et al., 2017). cooccurrence count matrices or represented as raw Let x = (w ; w ; :::; w ) be a sentence, T be one-hot vectors. 1 2 n x the set of all possible trees over it, y^ the tree that we Let V be a vocabulary (it can be lemmas or forms predict for x, and Score(•; •) a scoring function or morphological attributes (incl. values for POS, over sentence-tree pairs : number, case, tense, mood...)) for a given language. Correspondingly, let C be the set of con- y^ = argmax Score(x; t): texts defined over elements of V. That is, lem- t2Tx mas appear in the context of other lemmas, forms We use edge factorization to make the inference in the context of forms, and attributes in the con- problem tractable. A tree score is thus the sum of text of attributes. Then, given a corpus annotated its edges scores. We use a simple linear model: with lemmas and morphological information, we X can gather the cooccurrence counts in the matrix Score(x; t) = θ> · φ(x; e); jV|×|Cj M 2 N , such that M ij is the frequency of e2t lemma (form or morphological attributes) Vi ap- where φ(x; e) is a feature vector representing edge pearing in context Cj in the corpus. Here, we con- e in sentence x, and θ 2 m is a parameter vector sider plain sequential contexts (i.e. surrounding R to be learned. bag of “words”) of length 1, although we could The vector representation of an edge e whose extend them to more structured contexts (Bansal ij governor is the i-th word w and dependent is the et al., 2014). Those cooccurrence matrices are then i j-th word w , is defined by the outer product of reweighted by unshifted Positive Point-wise Mu- j their respective representations in context. Let ⊕ tual Information (PPMI) and reduced via Singular note vector concatenation, ⊗ the outer product and Value Decomposition (SVD). For more informa- w be the word just before/after w , then: v = tion on word embedding via matrix factorization, k±1 k i w ⊕ w ⊕ w , v = w ⊕ w ⊕ w and please refer to (Levy et al., 2015). i−1 i i+1 j j−1 j j+1 Despite its apparent simplicity, this model is as 9d2 φ(x; eij) = vec(vi ⊗ vj) 2 R : expressive as more popular state of the art embedding techniques. Indeed, Goldberg and Levy Recall that wi of length d V is a vector from R. (2014) have shown that the SkipGram objective We use the averaged Passive-Aggressive on- with negative sampling of Mikolov’s Word2vec line algorithm for structured prediction (Crammer (2013) can be framed as the factorization of a et al., 2006) for learning the model θ. Given a shifted PMI weighted cooccurrence matrix. score for each edge, we use Eisner algorithm (Eis- This matrix reduction procedure gives us vectors ner, 1996) to retrieve the best projective spanning for lemmas, forms and morphological attributes, tree. Even though some languages display a fair noted R. Note that while a word has only one amount of non-projective edges, on average Eisner lemma and one form, it will often realize several algorithm scores higher than Chu-Liu-Edmonds morphological attributes. We tackle this issue by algorithm (Chu and Liu, 1965) in our setting. 2865 2.3 Measuring Morpho-Syntactic Complexity This is a measure of a token preferencial attach- Some languages use morphological cues to encode ment to its head. A token with a low HPE tends to syntactic information while other encode more se- attach often to the same part-of-speech, while a to- mantic information with them. For example, the ken with a high HPE will attach to many different Case feature (especially core cases) is of prime parts-of-speech. Thus a language with a low HPE syntactic importance, for it encodes the type of re- will tend to encode a lot of syntactic information lation words have with each other. On the contrary, in the morphology, rather than in word order say. the Possessor feature (in Hungarian for example) is For example, a noun can attach to another noun more semantic in nature and need not impact sen- like a genitive, or to a verb as a subject or object, tence structure. This remark would support differ- or even to an adjective in the case of transitive ad- ent treatment for each language. However, those jective. French nouns do not inflect for case, thus languages tend to be treated equally in works deal- attachment to another noun or verb can only be in- ing with MRL. fered from words relative positions. On the contrary, Gothic nouns do inflect for case, thus mak- Form to Lemma Ratio A basic measure of mor- ing verb or noun attachment clear directly from the phological complexity is the form per lemma ratio, morphological analysis.

A Framework for Understanding the Role of Morphology in Universal Dependency Parsing

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support