<<

Operations for Unified Dependency Analysis

Kiril Simov and Petya Osenova Linguistic Modeling Department Institute of Information and Communication Technologies Bulgarian Academy of Sciences Bulgaria {kivs,petya}@bultreebank.org

Abstract to its syntactic analysis. On the other hand, a uni- fied strategy of dependency analysis is proposed The notion of catena was introduced origi- via extending the catena to handle also phenomena nally to represent the syntactic structure of as valency and other combinatorial dependencies. multiword expressions with idiosyncratic Thus, a two-fold analysis is achieved: handling the semantics and non-constituent structure. lexicon-grammar relation and arriving at a single Later on, several other phenomena (such means for analyzing related phenomena. as , verbal complexes, etc.) were The paper is structured as follows: the next formalized as catenae. This naturally led section outlines some previous work on catenae; to the suggestion that a catena can be con- section 3 focuses on the formal definition of the sidered a basic unit of . In this paper catena and of catena-based lexical entries; sec- we present a formalization of catenae and tion 4 presents different lexical entries that demon- the main operations over them for mod- strate the expressive power of the catena formal- elling the combinatorial potential of units ism; section 5 concludes the paper. in . 2 Previous Work on Catenae 1 Introduction The notion of catena (chain) was introduced in Catenae were introduced initially to handle lin- (O’Grady, 1998) as a mechanism for representing guistic expressions with non-constituent structure the syntactic structure of idioms. He shows that and idiosyncratic semantics. It was shown in a for this task there is need for a definition of syntac- number of publications that this unit is appropri- tic patterns that do not coincide with constituents. ate for both - the analysis of syntactic (for exam- He defines the catena in the following way: The ple, ellipsis, idioms) and morphological phenom- words A, B, and C (order irrelevant) form a chain ena (for example, compounds). One of the impor- if and only if A immediately dominates B and C, tant questions in NLP is how to establish a connec- or if and only if A immediately dominates B and B tion between the lexicon and the text dimension in immediately dominates C. an operable way. At the moment most investiga- In recent years the notion of catena revived tions focus on the representation and analysis of again and was applied also to dependency rep- the text dimension. resentations. Catenae have been used success- We first employed catenae when modeling mul- fully for the modelling of problematic language tiword expressions in Bulgarian within the relation phenomena. (Gross 2010) presents the problems lexicon - text (Simov and Osenova, 2014). En- in syntax and morphology that have led to the couraged by the promising results, we continued introduction of the subconstituent catena level. our research on how to exploit catenae as a uni- Constituency-based analysis faces non-constituent fied strategy for dependency analysis. In the paper structures in ellipsis, idioms, verb complexes. we use examples mostly from Bulgarian and to a Apart from the linguistic modelling of language lesser extend from English, but our approach is ap- phenomena, catenae have been used in a number plicable to other languages, as well. of NLP applications. (Maxwell et al., 2013), for In this piece of research we pursue both issues example, presents an approach to Information Re- mentioned above. On the one hand, we show in a trieval based on catenae. The authors consider formal way how the lexicon representation maps the catena as a mechanism for semantic encoding

320 Proceedings of the Third International Conference on Dependency Linguistics (Depling 2015), pages 320–329, Uppsala, Sweden, August 24–26 2015. root iobj con. conj In order to model the variety of phenomena subj det and characteristics encoded in a dependency gram- cc mar we extend the catena with partial arc and John bought and ate an apple node labels. We follow the approach taken in rootC rootC rootC CoNLL shared tasks on dependency parsing repre- conj senting for each node its word form, lemma, part det cc of speech, extended part of speech, grammatical John bought and ate an apple features (and later – semantics). This provides a flexible mechanism for expressing the combinato- Figure 1: A complete dependency tree and some rial potential of lexical items. In the following def- of its catenae. inition all grammatical features are represented as POS tags. Let us have the sets: LA — a set of POS tags1, which overcomes the problems of long-distance LE — a set of lemmas, WF — a set of word paths and elliptical sentences. The employment of forms, and a set of dependency tags D (ROOT catenae in NLP applications is additional motiva- ∈ D). Let us have a sentence x = w , ..., w .A tion for us to use it in the modelling of the interface 1 n is a directed tree T = between the treebank and the lexicon. tagged dependency tree (V, A, π, λ, ω, δ) where: Terminology note: an alternative term for catena is treelet. It has been used in the area of ma- 1. V = 0, 1, ..., n is an ordered set of nodes chine translation as a unit for translation transfer { } that corresponds to an enumeration of the (see (Quirk et al., 2005)). Their definition is equiv- words in the sentence (the root of the tree has alent to the definition of catena. Also (Kuhlmann, index 0); 2010) uses treelet for a node and its children (if any). In the paper we resort to the term catena 2. A V V is a set of arcs. For each node ⊆ × because it is closer to the spirit of the issues dis- i, 1 i n, there is exactly one arc in A: ≤ ≤ cussed here. i, j A, 0 j n, i = j. There is h i ∈ ≤ ≤ 6 exactly one arc i, 0 A; 3 Formal Definition of Catena h i ∈ In this section we define the formal presentation 3. π : V 0 LA is a total labelling func- − { } → 2 of the catena as it is used in syntax and in the tion from nodes to POS tags . π is not de- lexicon. Here we follow the definition of catena fined for the root; provided by (O’Grady, 1998) and (Gross, 2010): 4. λ : V 0 LE is a total labelling func- a catena is a word or a combination of words − { } → tion from nodes to lemmas. λ is not defined directly connected in the dominance dimension. for the root; In reality this definition of catena for dependency trees is equivalent to a subtree definition. Fig. 1 5. ω : V 0 WF is a total labelling func- depicts a complete dependency tree and some of − { } → tion from nodes to word forms. ω is not de- its catenae. Notice that the complete tree is also fined for the root; a catena itself. With “rootC ” we mark the root of the catena. It might be the same as the root of 6. δ : A D is a total labelling function for → the complete tree, but also might be different as arcs. Only the arc i, 0 is mapped to the label h i in the cases of “John” and “an apple”. Following ROOT ; (Osborne et al., 2012) we prefer to use the notion of catena to that of dependency subtree or treelet 7. 0 is the root of the tree. as mentioned above. We aim to utilize the notion 1In the formal definitions here we use tags as entities, but of catena for several purposes: representation of in practice they are sets of grammatical features. words and multiword expressions in the lexicon, 2In case when we are interested in part of the grammatical their realization in the actual trees expressing the features encoded in a POS tag we could consider p as a set of different mappings for the different grammatical features. It analysis of sentences as well as for representation is easy to extend the definition in this respect, but we do not of derivational structure of compounds in the lexi- do this here.

321 We will hereafter refer to this structure as a form like in: “they spilled the beans” and “he spills parse tree for the sentence x. The node 0 does the beans”. Thus, the catena in the lexicon will not correspond to a word form in the sentence, but be underspecified with respect to the grammatical plays the role of a root of the tree. features and word form for the verb.

Let T = (V, A, π, λ, ω, δ) be a tagged root dependency tree. A directed tree G = dobj (VG,AG, πG, λG, ωG, δG) is called dependency catena of T if and only if there exists a mapping Vpi Pp Nc ψ : V V 3 such that: G → – си очите 1. A A, the set of arcs of G; затварям си око G ⊆ shut one’s eyes 2. π π is a partial labelling function from G ⊆ nodes of G to POS tags; Realization 1:

3. λ λ is a partial labelling function from root G ⊆ C nodes to lemmas; dobj iobj pobj 4. ω ω is a partial labelling function from clitic G ⊆ nodes to word forms; Nc Pp Vpi R Nc Очите си затваряха пред фактите 5. δG δ is a partial labelling function for arcs. око си затварям пред факт ⊆ eyes one’s shut at facts A directed tree G = (VG,AG, πG, λG, ωG, δG) is a dependency catena if and only if there exists Realization 2: a dependency tree T such that G is a dependency root subj catena of T . dobj Having partial functions for assigning POS tags, clitic dependency labels, word forms and lemmas al- Np Pp Vpi Nc lows us to construct arbitrary abstractions over the Иван си затваряше очите structure of a catena. Thus, the catena could be Иван си затварям око underspecified for some of the node labels, like Ivan one’s shut eyes grammatical features, lemmas and also some de- pendency labels. The mapping ψ parameterizes Figure 2: Catena realization. the catena with respect to different dependency trees. Using the mapping, there is a possibility to This underspecified catena will be called a lex- realize different word orders of the catena nodes, icon catena (LC), because its kind will be stored for instance. The omission of node 0 from the in the lexical entries. The Fig. 2 depicts two real- range of the mapping ψ excludes the external root izations (with different word orders) of the catena of the tagged dependency tree from each catena. for the idiom затварям си очите (’zatvaryam si CatR is the root of the catena. The catena could be ochite’, lit. shut one’s eyes). The upper part of the a word or an arbitrary subtree. image represents the lexicon catena for the idiom. We call the mapping of a catena into a given It determines the fixed elements of the catena: dependency tree the realization of the catena in the arcs, their labels, nodes and their labels: ex- the tree. We consider the realization of the catena tended part of speech (first row), word forms (sec- as a fully specified subtree including all node and ond row), lemmas (third row), and gloss in En- arc labels. For example, the catena for “to spill the glish (fourth row)4. The dash (–) in the word form beans” will allow for any realization of the verb row means that the word form is not defined for 3This mapping allows for embedding of G in different the verbal node. In the two realizations the fixed tagged dependency trees and thus different word order real- elements of the catena are represented as in the izations of the catena nodes (corresponding to word forms in T .). The mapping y is specific for G and T . It allows also 4In the next examples we will present only the important the image of G in T not to be a subtree of T , but several sub- information, thus, some of these rows will be missing. In trees of T . A special case is discussed below — partition and other cases new rows will be used to represent additional in- extension operations. formation.

322 image of the catena. The word order in the two such as “the devil is in the details”; the realizations realizations is different. Thus, using catenae with of catenae from the lexicon into syntax often are different underspecified elements defines different accompanied by intervening material — see the levels of freedom of realization of the multiword discussion in (Osborne et al., 2012). For example, expressions. the idiom allows realizations such as: “the devil Let G1 and G2 be two catenae. A composition will be in the details”, “the devil seems to be in of G1 and G2 is a catena Gc, such that the cate- the details”, etc. nae G1 and G2 are realized in Ge in such a way Our insight, supported by the examples, is that that the root node of G2 is mapped to a node in the intervening material forms a catena of a certain Gc to which a node of G1 is mapped. Each node type. Such a type of catena will be called an auxil- 6 in Gc is an image of a node from G1 or G2. The iary catena in this paper, although it could be of realizations of both catenae G1 or G2 share ex- different kinds (auxiliary, modal, control, etc.), de- actly one node in Gc. This node has to represent pending on the verb forms. In order to implement all the information from the nodes that are mapped this idea we need some additional notions. to it. In this way we could realize the selectional Let G = (VG,AG, πG, λG, ωG, δG) be a catena restriction of a given lexical unit with respect to a and n V , then G ,G , ..., G is a partition of ∈ G 1 2 n catena in a sentence. For example, let us assume G on node n if and only if for each 1 i n: ≤ ≤ that the verb ‘to read’ requires a subject to be a human and an object to be an information object. 1. each Gi is a catena which is a subtree of G In Fig. 3 we present how the catena for ‘I read’ 2. at most one subcatena G has n as a leaf node is combined with the catena ‘a book’ in order to i form the catena ‘I read a book’. The figure repre- 3. one or more subcatenae Gi have n as a root sents only the level of word forms and a level of node semantics (specified only for the node, on which the composition is performed). The catena for ‘I 4. the only common node for all subcatenae Gi read ...’ specifies that the unknown direct object is n has the semantics of an Information Object (In- fObj). The catena for ‘a book’ represents the fact 5. the mappings πGi, λGi, ωGi, βGi are the that the book is an Information Object. Thus the same as for the whole catena G, except for two catenae could be composed on the two nodes the node n where the mappings πGi, λGi, ωGi marked as InfObj. The result is represented at the could be partial with respect to the original bottom of the Fig. 3.5 mappings.

rootC rootC An example of the operation partition of the devil is in the details is given in Fig. 4. dobj subj det After the partition of a catena for an idiom we I read - a book need a mechanism to connect the different catenae InfObj InfObj of the partition with the auxiliary catena. Let G be a catena and for n V ,G , G, ..., G root ∈ G 1 n be a partition of G and Ga be an auxiliary catena. dobj subj det An extension of G on partition G1,G2, ..., Gn with catena G is a catena G such that each I read a book a e catena G1,G2, ..., Gn and the auxiliary catena Ga are realized in Ge in such a way that the node Figure 3: Composition of catenae. ni in Gi (corresponding to the original node n) is mapped to a node in Ge to which a node of Ga is Some MWEs require more complex operations 6 over catenae in order to deal with them. Such a Under auxiliary catena we understand a catena that is part of the verbal complex and contains nodes for the aux- class of MWEs are idioms with an explicit subject, iliary verbs. In the grammars for the different languages dif- ferent kinds of catena could be defined on the basis of there 5In this representation many details like lemmas and role in the grammar. In this respect the definition of exten- grammatical features are not presented because they are not sion here is restricted to verbal complex, but easy could be important for the example. adapted for other cases when necessary.

323 root pobj det subj comp det

D N V R D N The devil is in the details the devil be in the detail root root pobj det subj comp det

D N V V R D N The devil - - in the details the devil - be in the detail

Figure 4: Partition.

root root root pobj det subj comp comp det

D N V Aux V V R D N The devil - will - - in the details the devil - will - be in the detail root pobj det subj comp comp det

D N Aux V R D N The devil will be in the details the devil will be in the detail

Figure 5: Extension.

mapped. Each node in Ge is an image of a node the operations of composition, partition and ex- from G1,G2, ..., Gn or Ga. tension we could define a procedure for analysis An example of the operation extention is pre- of actual sentences. sented in Fig. 57 For each node in a catena or dependency tree we Two catenae G1 and G2 could have the same set present the following information: POS, Gram- of realizations. In this case, we will say that G1 matical Features, Word Form, Lemma, Node iden- and G2 are equivalent. Representing the nodes tifier (position of word form in a catena or a sen- via paths in the dependency tree from root to the tence). Each of the information is depicted in the corresponding node and imposing a linear order node representation on a different row. over this representation of nodes facilitates the se- In order to model the behavior in a better way lection of a unique representative of each equiva- we need to add semantics to the dependency rep- lent class of catenae. Thus, in the rest of the pa- resentation. We will not be able to do this in full per we assume that each catena is representative in this paper. In order to represent the interaction of its class of equivalence. This representation of between lexical items and their valency frames a catena will be called canonical form. in the lexicon, we assume a semantic analysis Using the notion of catena introduced in this based on Minimal Recursion Semantics (MRS) section we define the structure of lexical items in (see (Copestake et al., 2005)). For dependency the lexicon of a dependency grammar. Through analyses, the MRS structures are constructed in a way similar to the one presented in (Simov and 7Notice that there are alternative analyses in which the auxiliary verb is not a head of the sentence, but a dependent Osenova, 2011). In this work, the root of a subtree of the . of a given dependency tree is associated with the

324 MRS structure corresponding to the whole sub- basis of syntactic analyses, includes information tree. This means that for the semantic interpre- about the main form (lemma) of the word, the tation of MWEs we will use the root of the cor- valency frame with all the elements, their forms, responding catena. In the dependency tree for the grammatical features and semantics (Osenova et corresponding sentence the catena root will - al., 2012). The lexical entry for each lexical item vide the interpretation of the MWE and its depen- also includes the semantics of the main form and dent elements, if any. In the lexicon we will pro- information on how this semantics incorporates vide the corresponding structure to model the id- the semantics of each frame element. iosyncratic semantic content of MWE. Here we first present the structure of the lex- Our goal is to use catenae to represent the syn- ical entry for the verb ‘бягам’ (’byagam’, run) tactic and morphological form of lexical units in in the sense "run away from facts". The verb the lexicon. The lexical units could be multiword takes an indirect object in the form of a prepo- expressions or single words. The lexical entry for sitional phrase starting with the preposition ‘от’ a lexical unit has the following fields: lexicon- (’ot’, from). Here are the examples: catena (LC) which contains a catena for the lex- ical item; semantics (SM) represents the seman- rootC tic content of the lexical item; valency frame (Frame) contains a catena of the frame element and its semantics. The field Frame can be re- Vpi peated as many times as necessary. Each valency – frame corresponds to a syntactic relation of the – dependent element. Alternative valencies for a бягам given syntactic relation are represented in differ- run CNo1 ent Frame fields. LC Here lexicon-catena determines the lexicon SM CNo1: run-away-from_rel(e,x0,x1), { form of the lexical unit. The underspecification fact(x1), [1](x1) } of the catena allows for the different realizations rootC of the catena in the actual sentences. The seman- iobj pobj tics field defines the basic semantics of the lexical unit. The valency frame field provides selectional Vpi R N restriction for the lexical unit. Because the lexical – – – unit could be a multiword expression, the seman- – от – tics and selectional restrictions could be assigned бягам от – run from – to different nodes of the corresponding catena. In CNo1 No1 No2 this way, different parts of the semantics could be Frame provided by different nodes in the catena or from semantics: the catena related to the selectional restrictions. No2: { fact(x), [1] (x) } The selectional restrictions of a lexical unit also Figure 6: Lexical entry for the verb could be connected to different nodes of the lexi- бягам "byagam", ‘run’). cal catena. In this way the lexical entry determines the possible variations of multiword expressions (MWEs). Below we present concrete lexical en- In this model we use catenae for the represen- tries for different types of lexical units, demon- tation of a single word and a MWE, because by strating selectional restrictions of verbs, nouns, definition single words are also catenae. Using multiword expressions. the formal definition of catena from above, we might specify all grammatical features of the lexi- 4 Lexical Entry Examples cal item. The semantics in the lexical entry could be attached to each node in the lexicon-catena. In In this section we present some types of lexical this example, there is just one node of the lexicon- entries using the structure of the lexical entry pre- catena. In the paper we present only the set of ele- sented above. The examples are taken from the mentary predicates instead of the full MRS struc- valency lexicon of Bulgarian, constructed on the tures with the aim to demonstrate the principles of

325 the representation. In the example, the verb in- descriptions are alternatives follows from the fact troduces three elementary predicates: run-away- that the verb has no more than one indirect object. from_rel(e, x0, x1), fact(x1), [1](x1). The pred- If there is also a direct object then the valency set icate run-away-from_rel(e, x0, x1) represents the will contain elements for it as well. event and its main participants: x0, x1. The pred- rootC icate fact(x1) is part of the meaning of the verb in dobj the sense that the agent represented by x0 will run away from some fact. There is also one underspec- clitic ified predicate [1](x1) which has to be compatible Vpi Pp Nc with the predicate fact(x1). This predicate is used – poss plur|def for incorporating the meaning of the indirect ob- – си очите затварям си око ject. The valency frame is given as a set of valency shut one’s eyes elements. They are defined as a catena and se- CNo1 CNo1 CNo2 mantic description. The catena describes the basic LC SM CNo1: run-away-from_rel(e,x ,x ), structure of the valency element including the nec- { 0 1 fact(x ), [1](x ) essary lexical information, grammatical features, 1 1 } the syntactic relation to the main lexical item. The rootC semantic description determines the main seman- iobj pobj tic contribution of the frame element and via struc- tural sharing it is incorporated in the semantics Vpi R N of the whole lexical item. In the example there – – – is only one frame element. It is introduced via – пред – затварям пред – the preposition ‘ot’ (from). The semantics comes shut at – from the dependent noun which has to be compat- CNo1 No1 No2 ible with fact(x) predicate and via the underspeci- Frame fied predicate [1](x1) which could specify a more semantics: concrete predicate. Via the structure sharing index No2: { fact(x), [1] (x) } [1] this specific predicate is copied to the seman- rootC tics of the main lexical item. iobj pobj The lexical entry of a MWE uses the same for- mat: a lexicon-catena, semantics and valency. Vpi R N The lexicon-catena for the MWEs is stored in its – – – – за – canonical form as described above. The seman- затварям за – tics part of a lexical entry specifies the list of el- shut for – ementary predicates for the MRS analysis. When CNo1 No1 No2 Frame the MWE allows for some modification (also ad- junction) of its elements, i.e. modifiers of a noun, semantics: No2: { fact(x), [1] (x) } the lexical entry in the lexicon needs to specify the role of these modifiers. For example, the MWE Figure 7: Lexical entry for затварям си очите from the above example ‘затварям си очите’ “zatvaryam si ochite”, ‘I close my eyes’. which is synonymic to the verb ‘byagam’ pre- sented above, is depicted in Fig. 7.8 The lexi- The semantics and the valency information are cal entry is similar to the one shown earlier. The attached to the corresponding nodes in the catena main differences are: the lexicon-catena is for the representation. In the example in Fig. 7 only the MWE instead of a single word. The semantics is information for the root node of the catena is given the same, because the verb and the MWE are syn- (identifier CNo1). onyms. The valency frame contains two alterna- tive elements for indirect object introduced by two In cases when other parts of the catena allow different prepositions. The situation that the two modification, the information for the correspond- ing nodes will be given. Here we provide ex- 8The grammatical features are: ‘poss’ for possessive pro- amples of such cases. For example, the Multi- noun, ‘plur’ for plural number and ‘def’ for definite noun. word Expression ‘среща на върха’ (’sreshta na

326 rootC rootC

mod pobj mod

Nc R Nc A Nc – – sing|semdef – indef – на – – – среща на връх снежен човек meeting at peak-the snowy man CNo1 CNo2 CNo3 CNo1 CNo2 LC LC SM CNo1: meeting_rel(e, x), SM CNo1: snowman_rel(x) { { } member(y,x), head-of-a-country(y,z), Frame ∅ country(z), [1](z)) } rootC Figure 9: Lexical Entry for снежен човек mod “snezhen chovek”, ‘snowman’. A Nc def – but also definite or indefinite forms depending on – – the usage of the phrase. The possible modifiers – връх of the MWE are determined by the represented – peak No1 CNo3 semantics. The relation snowman_rel(x) is taken Frame from an appropriate ontology where its conceptual semantics: definition is given. No1: { [1] (x) } rootC Figure 8: Lexical Entry for среща на върха “sresta na varha", ‘summit’. Nc – varha’, summit) allows for modification not only – of the whole catena, but also of the noun within баща the prepositional phrase. The lexical entry is given father in Fig. 89. This lexical entry allows modifications CNo1 LC like ‘европейски’ (European) — среща на ев- SM CNo1: father-of(x,y), human(y), ропейския връх (’sreshta na evropeyskiya vrah’, { [1](y) meeting of the European top). This catena allows } rootC also modification of the head word. The next example presented here is for the mod pobj multiword ‘снежен човек’ meaning “a man-like Nc R Nc sculpture from snow”. It does not allow any modi- – – – fication of the dependent node ‘снежен’ (snowy), – на – but it allows for modifications of the root like баща на – “large snow man” etc. The lexical entry is given father of – 10 CNo1 No1 No2 in Fig. 9 . The grammatical features for the head Frame noun (indef for indefinite) restricts its possible semantics: form. In this way, singular and plural forms are al- No2: { human(y), [1](y) } lowed. The empty valency ensures that the depen- dent adjective cannot be modified except for mor- Figure 10: Lexical Entry for баща на “bashta na”, phological variants like singular and plural forms, ’father of’.

9The grammatical features are: ‘sing’ for singular number Fig. 10 shows an example of non-verbal va- and ‘semdef’ for definite subtree. Features like ‘semdef’ are lency: the lexical entry of the relational noun ‘ба- specified for root node, but can be realized on a form inside the subtree. ща [на ...]’ (’bashta na...’, father of ...). 10The grammatical feature is: ‘indef’ for indefinite noun. In the examples so far, the selectional restric-

327 tions are potential and it is possible for them not entry is the representation of constraints over the to be realized in the actual text. But in some word order of the catena nodes. We envisage ad- cases they are obligatory. Here we present one dition of such constraints as future work. The in- such example for the verb ‘състоя се’(’sastoya formation from the lexical entries is combined by se ot’, consist of). It requires an obligatory indi- different operations on the elements of the lexical rect object introduced by the preposition ‘от’ (’ot’, entries structure. The main operation on catenae from) as in the sentence: Системата се състои is the realization in dependency trees. The two от два модула (’Sistemata se sastoi ot dva mod- other operations are extension and composition of ula’, The system consists of two modules.). In or- catenae. The extension is used when an MWE or der to ensure that the indirect object will be al- other catena needs to be realized together with an ways realized, we encode the preposition as an el- auxiliary catena as in the case of sentence MWEs ement of the lexicon catena. See the lexical entry where the subject catena is detached from the ver- in Fig. 1111. bal catena and realized as a subject of the auxiliary catena (see the example in Fig. 5). The composi- rootC iobj tion is used when the valency catena is realized with the main lexical catena (see the example in clitic Fig. 3). Vpi Pp R – refl – 5 Conclusion and Future Work – се от състоя се от The paper demonstrates using Bulgarian data that consist REFL of the modeling at the level of catena is appropriate CNo1 CNo2 CNo3 LC for encoding language units (including multiword SM consist-of(e, x, y), [1](y) expressions and valencies) at the lexicon-syntax { } rootC interface. The catena allows for additional mate- rial to be inserted, based on the information from pobj valence lexicons and contexts. Additionally, a se- R N mantics component is added for ensuring the cor- – – rect interpretation of the language units. от – The paper confirms the conclusions from previ- от – ous works that catena is an appropriate means for of – encoding idioms and idiosyncratic language mate- CNo3 No1 Frame rial. With respect to idioms it is very useful for semantics: cases where in addition to the figurative meaning No2: { [1](y) } the literal meaning also remains a possible inter- pretation. The paper also extends the catena mech- Figure 11: Lexical Entry for състоя се от “sas- anism to incorporate valency and semantic infor- toya se ot”, ‘I consist of’. mation. The formalization of the catena provides defini- These examples demonstrate the power of the tions of operations over catenae which allow com- combination of catenae (as subtree units), MRS bination of catenae in complete analyses of sen- structures (as semantic units) and valency rep- tences. In our work here we assume that cate- resentation (as subcategorization units) to model nae could have only one node in common — the MWEs and valencies in the lexicon. The catena is node on which they extend or combine. This as- appropriate for representation of syntactic struc- sumption is motivated by the examples of MWEs ture; the semantic part represents the idiosyncratic that are idioms. Idioms usually interact with other semantics of the MWE and the semantics of valen- catenae in a sentence via one of their nodes. But cies and determines the possible semantic modifi- this requirement might be relaxed for the other cation, and the valency part determines the syn- catenae in the lexicon. In this way, in valency tactic behavior of MWEs and other dependency one could specify more than one common node expressions. One missing element of the lexical between the lexical catena and the valency catena. 11The grammatical feature is: ‘ref’ for reflexive pronoun. We do not employ any specific dependency the-

328 ory in our approach, but we believe that the pro- (eds. Proceedings of the Eight International Con- posed modeling might be incorporated in most of ference on Language Resources and Evaluation them, if not all. (LREC’12), pages 2636–2640. Chris Quirk, Arul Menezes, and Colin Cherry. 2005. Acknowledgements Dependency treelet translation: Syntactically in- formed phrasal smt. In Proceedings of ACL. Asso- This research has received support by the EC’s ciation for Computational Linguistics, June. FP7 (FP7/2007-2013) under grant agreement number 610516: “QTLeap: Quality Translation Kiril Simov and Petya Osenova. 2011. Towards mini- mal recursion semantics over bulgarian dependency by Deep Language Engineering Approaches” and parsing. In Proceedings of the RANLP 2011. by European COST Action IC1207: “PARSEME: PARSing and Multi-word Expressions. Towards Kiril Simov and Petya Osenova. 2014. Formalizing linguistic precision and computational efficiency multiwords as catenae in a treebank and in a lexi- con. In Verena Henrich, Erhard Hinrichs, Daniël in natural language processing.” de Kok, Petya Osenova, Adam Przepiórkowski (eds.) We are grateful to the three anonymous review- Proceedings of the Thirteenth International Work- ers, whose remarks, comments, suggestions and shop on Treebanks and Linguistic Theories (TLT13), encouragement helped us to improve the initial pages 198–207. variant of the paper. All errors remain our own responsibility.

References Ann Copestake, Dan Flickinger, Carl Pollard, and Ivan Sag. 2005. Minimal recursion semantics: An in- troduction. Research on Language & Computation, 3(4):281–332. Thomas Gross. 2010. Chains in syntax and morphol- ogy. In Ryo Otoguro, Kiyoshi Ishikawa, Hiroshi Umemoto, Kei Yoshimoto, and Yasunari Harada, ed- itors, PACLIC, pages 143–152. Institute for Digital Enhancement of Cognitive Development, Waseda University. Marco Kuhlmann. 2010. Dependency Structures and Lexicalized Grammars: An Algebraic Approach, volume 6270 of Lecture Notes in Computer Science. Springer. K. Tamsin Maxwell, Jon Oberlander, and W. Bruce Croft. 2013. Feature-based selection of dependency paths in ad hoc information retrieval. In Proceed- ings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- pers), pages 507–516, Sofia, Bulgaria, August. As- sociation for Computational Linguistics. William O’Grady. 1998. The syntax of idioms. Natu- ral Language and Linguistic Theory, 16:279–312. Timothy Osborne, Michael Putnam, and Thomas Groß. 2012. Catenae: Introducing a novel unit of syntactic analysis. Syntax, 15(4):354–396. Petya Osenova, Kiril Simov, Laska Laskova, and Stanislava Kancheva. 2012. A Treebank-driven Creation of an OntoValence Verb Lexicon for Bul- garian. In Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Mehmet U˘gur Do˘gan and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis

329