Querying Dependency Treebanks in XML ¡ Gosse Bouma , Geert Kloosterman

Querying Dependency Treebanks in XML ¡ Gosse Bouma , Geert Kloosterman ¢ Computational Linguistics Groningen University [email protected] £ Artificial Intelligence Groningen University [email protected] Abstract The need for manual editing during construction of a treebank may impose constraints on the representation of dependency trees which are not optimal for linguistic exploration. Using XML-technology it is possible to maintain the treebank both in a form suitable for editing and in a form suitable for linguistic exploration. By choosing a compact representation, we can use XPath directly as query language. We argue that, given an explicit encoding of string positions, this direct encoding of dependency trees as XML-trees can represent discontinuous constituents in a way that supports queries involving both dependency and linear order. 1. Introduction possibilities exist for encoding discontinuous constituency in XML. In our proposal, the structure of the dependency During development of a treebank, the emphasis is often tree is mapped directly onto an XML-tree with the same on representing the treebank in a way which supports inter- structure. The relation with word order in the input string is action with a parser environment (for automatically gen- encoded by adding attributes for string positions. We show erating analyses) as well as with an editor (for correcting that this allows subtle queries concerning linear order and analyses manually). For users of a treebank, however, a dependency to be stated. representation which supports linguistic exploration is es- sential. It is not always clear that these two requirements In the next section, we briefly describe the grammar can be satisfied by a single representation. used to create dependency trees. Next, we introduce the Alpino treebank and the annotation process. In section 4, Below, we describe the Alpino treebank, a dependency we present a transformation of the original treebank for- treebank for Dutch which is being developed using a wide- mat into a format which supports linguistic exploration. In coverage parser for Dutch and a graphical tool for display- section 5, we discuss how we encode word order in depen- ing and editing linguistic data-structures. The initial moti- dency trees. We conclude with various examples queries vation for developing the treebank was the need for evalu- illustrating the kind of information that can be extracted. ation material for the syntactic parser for Dutch, which is being developed in parallel with the treebank. For (auto- matic) evaluation, the internal format of the treebank is not 2. The Alpino Grammar very important. However, as the treebank grows in size, it The Alpino grammar (Bouma et al., 2001) is a lexical- becomes increasingly interesting to explore it interactively ized grammar for Dutch in the tradition of constructionalist as well. Queries to the treebank may be motivated by lin- Head-driven Phrase Structure Grammar (Pollard and Sag, guistic interest (i.e. which verbs take inherently reflexive 1994).1 The grammar currently contains over 270 rules, objects?) but can also be a tool for quality control (i.e. find defined in terms of a general rule structures and princi- all PP’s where the head is not a preposition). ples. The grammar covers a substantial part of the syntac- The XPath standard implements a powerful query lan- tic constructions of Dutch (including main and subordinate guage for XML documents, which can be used to formu- clauses, (indirect) questions, imperatives, (free) relative late queries over the treebank. However, we found that clauses, a wide range of verbal and nominal complementa- both the complexity and size of the XML documents pro- tion and modification patterns, verbal crossing-dependency duced during the annotation process, makes these less suit- constructions, extraposition, and coordination) as well as able for querying. Implementation of a query expansion a wide variety of more idiosyncratic constructions (appo- tool only partially solved this problem. Therefore, we sitions, verb-particle constructions, PP’s including a parti- transformed the annotated data into a more compact XML- cle, NP’s modified by an adverb, punctuation, etc.). The representation, ideally suited for linguistic exploration. As lexicon contains approximately 47,000 lemma’s. Lemma’s the new format encodes dependency trees directly, it can be are associated with complicated attribute-value matrices, explored using XPath only. Using XPath has the advantage containing, for instance, subcategorization frames enriched that existing technology can be used to query the treebank. with dependency relations. The lexicon was created to a One of the attractions of dependency trees is the fact that dependents may span discontinuous parts of the input 1Alpino is being developed as part of the NWO PIONIER string. For a language like Dutch, with its crossing depen- project Algorithms for Linguistic Processing, www.let.rug. dency word orders, this is especially important. Several nl/˜vannoord/alp large extent by extracting information from two existing re- s sources (Bouma, 2001), a version of Celex (Baayen et al., 2 1993) extended with valency information and Parole. Cur- su hd rently, the lexicon contains definitions for 70 different ver- 1 vc verb bal subcategorization types, various nominal types (nouns noun vp zou with various complementation patterns, proper names, pro- mercedes nouns, temporal nouns, deverbalized nouns), various adjec- hd su vc tival subcategorization types, and distinct complementizer, verb 1 vp determiner, and adverb types. hebben Linguistic analysis of a given input string on the basis of the grammar described above, proceeds in three steps. First, mod hd su obj1 the lexical analysis phase assigns the most likely part of adv verb 1 np speech tags to the input. Lexical analysis recognizes multi- gisteren aangekondigd word lexical items, and uses several heuristics to decide on the most likely tags for unknown words. The POS-tagger det mod hd consists of a HMM trained on material analyzed automati- det adj noun cally by the grammar without POS-tagging (Prins and van haar nieuwe model Noord, 2001). Next, a parse forest is constructed, in which the various analyses (typically, several hundreds) of the in- Figure 1: Dependency structure for Mercedes zou haar put are represented in a compact and non-redundant man- nieuwe model gisteren hebben aangekondigd, Mercedes ner. Parsing is robust, in that it will find the constituents should her new model yesterday have announced. spanning a maximal portion of the input in case no full parse is available (van Noord, 2001). The third step consists of the selection of the best parse from the parse forest. level of annotation which is especially suited for annotat- Here, we use a log-linear statistical model which computes ing languages with a strong word order variation and dis- the likelyhood of an analysis as the weighted sum of prede- continuous constituency, such as Czech, German, or Dutch fined properties of a parse. Relevant properties are defined (Hajicova et al., 1998; Skut et al., 1997) . Dependency re- by hand, and weights are estimated using a combination of lations also have been used successfully in statistical pars- supervised and unsupervised Maximum Entropy learning ing (Collins, 1999). An important further reason for adopt- (Bouma et al., 2001). Supervised learning uses the depen- ing dependency trees in our case is that ensures consistency dency treebank described below. with the guidelines of the Spoken Dutch Corpus (Oostdijk, We use the dependency treebank described below for 2000; Moortgat et al., 2000). This project will eventu- evaluation of the parser in a manner similar to that de- ally provide a substantive amount of syntactically annotated scribed in Carroll et al. (1998). The system currently identi- spoken language, which we hope to use as further training fies dependency relations with an accuracy of around 80%. and testing of the Alpino grammar. 3. The Alpino Dependency Treebank 3.1. Construction of the Treebank To evaluate the coverage and disambiguation compo- nent of the system, a testbench containing syntactically an- The annotation process typically starts by parsing a sen- notated material is absolutely crucial. Furthermore, (super- tence with the Alpino grammar. This produces a (often vised) training of a statistical disambiguation module re- large) number of possible analyses. The annotator picks quires syntactically annotated material. Given the current the analysis which best matches the correct analysis. To lack of such material for Dutch, we have started to anno- facilitate selection of the best parse among a large num- tate a corpus of newspaper text with dependency trees in ber of possibilities, the grammar development environment parallel with the grammar development effort. HDRUG (van Noord and Bouma, 1997) has been extended Dependency structures make explicit the dependency with a number of auxiliary functions. First, a graphical tool relations between constituents in a sentence. Each non- allows the annotator to select or remove POS-tags suggested terminal node in a dependency structure consists of a head- by lexical analysis. Second, the input string may be ex- daughter and a list of non-head daughters, whose depen- tended with brackets, which must be respected by gram- dency relation to the head is marked. An example is given matical analysis. Finally, a graphical tool based on the SRI in figure 1. Control relations are encoded by means of co- TreeBanker (Carter, 1997) has been added which displays indexing (i.e. the subject of hebben is the dependent with all remaining fragments of the input which are a source of index 1). Note that a dependency structure does not nec- ambiguity. By disambiguating these items (usually a much essarily reflect (surface) syntactic constituency. The de- smaller number than the number of analyses), the annotator pendent haar nieuwe model gisteren aangekondigd, for in- can quickly pick the most accurate parse. stance, does not correspond to a (surface) syntactic con- If the parse selected by the annotator is fully correct, the stituent.

Querying Dependency Treebanks in XML ¡ Gosse Bouma , Geert Kloosterman

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support