Catena Operations for Unified Dependency Analysis

Catena Operations for Unified Dependency Analysis Kiril Simov and Petya Osenova Linguistic Modeling Department Institute of Information and Communication Technologies Bulgarian Academy of Sciences Bulgaria {kivs,petya}@bultreebank.org Abstract to its syntactic analysis. On the other hand, a unified strategy of dependency analysis is proposed The notion of catena was introduced origi- via extending the catena to handle also phenomena nally to represent the syntactic structure of as valency and other combinatorial dependencies. multiword expressions with idiosyncratic Thus, a two-fold analysis is achieved: handling the semantics and non-constituent structure. lexicon-grammar relation and arriving at a single Later on, several other phenomena (such means for analyzing related phenomena. as ellipsis, verbal complexes, etc.) were The paper is structured as follows: the next formalized as catenae. This naturally led section outlines some previous work on catenae; to the suggestion that a catena can be con- section 3 focuses on the formal definition of the sidered a basic unit of syntax. In this paper catena and of catena-based lexical entries; sec- we present a formalization of catenae and tion 4 presents different lexical entries that demon- the main operations over them for mod- strate the expressive power of the catena formal- elling the combinatorial potential of units ism; section 5 concludes the paper. in dependency grammar. 2 Previous Work on Catenae 1 Introduction The notion of catena (chain) was introduced in Catenae were introduced initially to handle lin- (O’Grady, 1998) as a mechanism for representing guistic expressions with non-constituent structure the syntactic structure of idioms. He shows that and idiosyncratic semantics. It was shown in a for this task there is need for a definition of syntac- number of publications that this unit is appropri- tic patterns that do not coincide with constituents. ate for both - the analysis of syntactic (for exam- He defines the catena in the following way: The ple, ellipsis, idioms) and morphological phenom- words A, B, and C (order irrelevant) form a chain ena (for example, compounds). One of the impor- if and only if A immediately dominates B and C, tant questions in NLP is how to establish a connec- or if and only if A immediately dominates B and B tion between the lexicon and the text dimension in immediately dominates C. an operable way. At the moment most investiga- In recent years the notion of catena revived tions focus on the representation and analysis of again and was applied also to dependency rep- the text dimension. resentations. Catenae have been used success- We first employed catenae when modeling mul- fully for the modelling of problematic language tiword expressions in Bulgarian within the relation phenomena. (Gross 2010) presents the problems lexicon - text. (Simov and Osenova, 2014). En- in syntax and morphology that have led to the couraged by the promising results, we continued introduction of the subconstituent catena level. our research on how to exploit catenae as a uni- Constituency-based analysis faces non-constituent fied strategy for dependency analysis. In the paper structures in ellipsis, idioms, verb complexes. we use examples mostly from Bulgarian and to a Apart from the linguistic modelling of language lesser extend from English, but our approach is ap- phenomena, catenae have been used in a number plicable to other languages, as well. of NLP applications. (Maxwell et al., 2013), for In this piece of research we pursue both issues example, presents an approach to Information Re- mentioned above. On the one hand, we show in a trieval based on catenae. The authors consider formal way how the lexicon representation maps the catena as a mechanism for semantic encoding 320 Proceedings of the Third International Conference on Dependency Linguistics (Depling 2015), pages 320–329, Uppsala, Sweden, August 24–26 2015. root iobj con. conj In order to model the variety of phenomena subj det and characteristics encoded in a dependency gram- cc mar we extend the catena with partial arc and John bought and ate an apple node labels. We follow the approach taken in rootC rootC rootC CoNLL shared tasks on dependency parsing repre- conj senting for each node its word form, lemma, part det cc of speech, extended part of speech, grammatical John bought and ate an apple features (and later – semantics). This provides a flexible mechanism for expressing the combinato- Figure 1: A complete dependency tree and some rial potential of lexical items. In the following def- of its catenae. inition all grammatical features are represented as POS tags. Let us have the sets: LA — a set of POS tags1, which overcomes the problems of long-distance LE — a set of lemmas, WF — a set of word paths and elliptical sentences. The employment of forms, and a set of dependency tags D (ROOT catenae in NLP applications is additional motiva- ∈ D). Let us have a sentence x = w , ..., w .A tion for us to use it in the modelling of the interface 1 n is a directed tree T = between the treebank and the lexicon. tagged dependency tree (V, A, π, λ, ω, δ) where: Terminology note: an alternative term for catena is treelet. It has been used in the area of ma- 1. V = 0, 1, ..., n is an ordered set of nodes chine translation as a unit for translation transfer { } that corresponds to an enumeration of the (see (Quirk et al., 2005)). Their definition is equiv- words in the sentence (the root of the tree has alent to the definition of catena. Also (Kuhlmann, index 0); 2010) uses treelet for a node and its children (if any). In the paper we resort to the term catena 2. A V V is a set of arcs. For each node ⊆ × because it is closer to the spirit of the issues dis- i, 1 i n, there is exactly one arc in A: ≤ ≤ cussed here. i, j A, 0 j n, i = j. There is h i ∈ ≤ ≤ 6 exactly one arc i, 0 A; 3 Formal Definition of Catena h i ∈ In this section we define the formal presentation 3. π : V 0 LA is a total labelling func- − { } → 2 of the catena as it is used in syntax and in the tion from nodes to POS tags . π is not de- lexicon. Here we follow the definition of catena fined for the root; provided by (O’Grady, 1998) and (Gross, 2010): 4. λ : V 0 LE is a total labelling func- a catena is a word or a combination of words − { } → tion from nodes to lemmas. λ is not defined directly connected in the dominance dimension. for the root; In reality this definition of catena for dependency trees is equivalent to a subtree definition. Fig. 1 5. ω : V 0 WF is a total labelling func- depicts a complete dependency tree and some of − { } → tion from nodes to word forms. ω is not de- its catenae. Notice that the complete tree is also fined for the root; a catena itself. With “rootC ” we mark the root of the catena. It might be the same as the root of 6. δ : A D is a total labelling function for → the complete tree, but also might be different as arcs. Only the arc i, 0 is mapped to the label h i in the cases of “John” and “an apple”. Following ROOT ; (Osborne et al., 2012) we prefer to use the notion of catena to that of dependency subtree or treelet 7. 0 is the root of the tree. as mentioned above. We aim to utilize the notion 1In the formal definitions here we use tags as entities, but of catena for several purposes: representation of in practice they are sets of grammatical features words and multiword expressions in the lexicon, 2In case when we are interested in part of the grammatical their realization in the actual trees expressing the features encoded in a POS tag we could consider p as a set of different mappings for the different grammatical features. It analysis of sentences as well as for representation is easy to extend the definition in this respect, but we do not of derivational structure of compounds in the lexi- do this here. 321 We will hereafter refer to this structure as a arc labels. For example, the catena for “to spill the parse tree for the sentence x. The node 0 does beans” will allow for any realization of the verb not correspond to a word form in the sentence, but form like in: “they spilled the beans” and “he spills plays the role of a root of the tree. the beans”. Thus, the catena in the lexicon will Let T = (V, A, π, λ, ω, δ) be a tagged depen- be underspecified with respect to the grammatical dency tree. features and word form for the verb. Let T = (V, A, π, λ, ω, δ) be a tagged root dependency tree. A directed tree G = dobj (VG,AG, πG, λG, ωG, δG) is called dependency clitic catena of T if and only if there exists a mapping ψ : V V 3 such that: Vpi Pp Nc G → – си очите затварям си око 1. AG A, the set of arcs of G; ⊆ shut one’s eyes 2. π π is a partial labelling function from G ⊆ nodes of G to POS tags; Realization 1: 3. λ λ is a partial labelling function from G ⊆ rootC nodes to lemmas; dobj iobj pobj 4. ω ω is a partial labelling function from clitic G ⊆ nodes to word forms; Nc Pp Vpi R Nc Очите си затваряха пред фактите 5. δG δ is a partial labelling function for arcs.

Catena Operations for Unified Dependency Analysis

Interpreting and Defining Connections in Dependency Structures Sylvain Kahane

Some Observations on the Hebrew Desiderative Construction – a Dependency-Based Account in Terms of Catenae1

Clitics in Dependency Morphology

Interpreting and Defining Connections in Dependency Structures

Defining Dependencies (And Constituents)

Catena Operations for Unified Dependency Analysis

Transformational Grammarians and Other Paradoxes Abstract Keywords

A Lexical Theory of Phrasal Idioms

Catenae in Morphology

Why So Many Nodes? Dan Maxwell US Chess Center Washington DC, USA [email protected]

Tests for Constituents: What They Really Reveal About the Nature of Syntactic Structure

CATENAE Cambridge