<<

Dependency : methods, annotation schemes and tools Tuomo Kakkonen Department of University of Joensuu P.O. Box 111 FI-80101 Joensuu [email protected]

Abstract for selecting the dependency format for build- ing a is that the treebank is being In this paper, current dependency- created for a language with a relatively free based treebanks are introduced and order. Such treebanks exist e.g. for analyzed. The methods used for Basque, Czech, German and Turkish. On the building the resources, the annotation other hand, dependency treebanks have been schemes applied, and the tools used developed for languages such as English, which (such as POS taggers, parsers and an- have been usually seen as languages that can notation software) are discussed. be better represented with constituent formal- ism. The motivations for using dependency an- 1 Introduction notation vary from the fact that the type of structure is the one needed by many, if not Annotated data is a crucial resource for devel- most, applications to the fact that it offers a opments in computational and nat- proper interface between syntactic and seman- ural language processing. Syntactically anno- tic representation. Furthermore, dependency tated corpora, treebanks, are needed for de- structures can be automatically converted into veloping and evaluating natural language pro- phrase structures if needed (Lin, 1995; Xia cessing applications, as well as for research and Palmer, 2000), although not always with in empirical linguistics. The choice of annota- 100% accuracy. The TIGER Treebank of Ger- tion type in a treebank usually boils down to man, a free word order language, with 50,000 two options: the linguistic resource is anno- sentences is an example of a treebank with tated either according to some constituent or both phrase structure and dependency annota- functional structure scheme. As the name tree- tions (Brants et al., 2002). bank suggests, these linguistic resources were The aim of this paper is to answer the follow- first developed in the phrase-structure frame- ing questions about the current state-of-art in work, usually represented as tree-shaped con- dependency treebanking: structions. The first efforts to create such re- sources started around 30 years ago. The most • What kinds of texts do the treebanks con- well-known of such a treebank is the Penn Tree- sist of? bank for English (Marcus et al., 1993). In recent years, there has been a wide in- • What types of annotation schemes and for- terest towards functional annotation of tree- mats are applied? banks. In particular, many dependency-based • What kinds of annotation methods and treebanks have been constructed. In addi- tools are used for creating the treebanks? tion, grammatical function annotation has been added to some constituent-type treebanks. De- • What kinds of functions do the annotation pendency Grammar formalisms stem from the tools for creating the treebanks have? work of Tesnieére (1959). In dependency gram- mars, only the lexical nodes are recognized, We start by introducing the existing and the phrasal ones are omitted. The lexi- dependency-based treebanks (Section 2). cal nodes are linked with directed binary re- In Section 3, the status and state-of-art in lations. The most commonly used argument dependency treebanking is summarized and

S. Werner (ed.), Proceedings of the 15th NODALIDA conference, Joensuu 2005, Ling@JoY 1, 2006, pp. 94–104 ISBN 952-458-771-8, ISSN 1796-1114. http://ling.joensuu.fi/lingjoy/ © by the author Proceedings of the 15th NODALIDA conference, Joensuu 2005 Ling@JoY 1, 2006 analyzed. Finally in Section 4, we conclude the some 19,000 sentences were annotated, Collins findings. lexicalized stochastic parser (Nelleke et al., 1999) was trained with the data, and was capa- 2 Existing dependency treebanks ble of assigning 80% of the dependencies cor- rect. At that stage, the work of the annotator 2.1 Introduction changed from building the trees from scratch to Several kinds of resources and tools are needed checking and correcting the parses assigned by for constructing a treebank: annotation guide- the parser, except for the analytical functions, lines state the conventions that guide the an- which still had to be assigned manually. The de- notators throughout their work, a software tool tails related to the tectogrammatical level are is needed to aid the annotation work, and in omitted here. Figure 1 illustrates an example the case of semi-automated treebank construc- of morphological and analytical levels of anno- tion, a part-of-speech (POS) tagger, morpholog- tation. ical analyzer and/or a syntactic parser are also There are other treebank projects using the needed. Building trees manually is a very slow framework developed for the Prague Depen- and error-prone process. The most commonly dency Treebank. Prague Dependency used method for developing a treebank is a Treebank (Hajicˇ et al., 2004), consisting of combination of automatic and manual process- around 49,000 tokens of newswire texts from ing, but the practical method of implementation Arabic Gigaword and Penn Arabic Treebank, varies considerably. There are some treebanks is a treebank of Modern Standard Arabic. that have been annotated completely manually, The Slovene Dependency Treebank consists of but with taggers and parsers available to auto- around 500 annotated sentences obtained from mate some of the work such a method is rarely the MULTEXT-East Corpus (Erjavec, 2005b; Er- employed in state-of-the-art treebanking. javec, 2005a).

2.2 The treebanks 2.2.2 TIGER Treebank 2.2.1 Prague Dependency Treebank The TIGER Treebank of German (Brants et The largest of the existing dependency tree- al., 2002) was developed based on the NEGRA banks (around 90,000 sentences), the Prague Corpus (Skut et al., 1998) and consists of com- Dependency Treebank for Czech, is annotated plete articles covering diverse topics collected in layered structure annotation, consisting of from a German newspaper. The treebank has three levels: morphological, analytical (), around 50,000 sentences. The syntactic anno- and tectogrammatical () (Böhmová et tation combining both phrase-structure and de- al., 2003). The data consist of newspaper ar- pendency representations is organized as fol- ticles on diverse topics (e.g. politics, sports, lows: phrase categories are marked in non- culture) and texts from popular science mag- terminals, POS information in terminals and azines, selected from the Czech National Cor- syntactic functions in the edges. The syntactic pus. There are 3,030 morphological tags in the annotation is rather simple and flat in order to morphological tagset (Hajic,ˇ 1998). The syn- reduce the amount of attachment ambiguities. tactic annotation comprises of 23 dependency An interesting feature in the treebank is that a types. MySQL database is used for storing the anno- The annotation for the levels was done sep- tations, from where they can be exported into arately, by different groups of annotators. The NEGRA Export and TIGER-XML file formats, morphological tagging was performed by two which makes it usable and exchangeable with human annotators selecting the appropriate a range of tools. tag from a list proposed by a tagging sys- The annotation tool Annotate with two meth- tem. Third annotator then resolved any differ- ods, interactive and Lexical-Functional Gram- ences between the two annotations. The syn- mar (LFG) , was employed in creating tactic annotation was at first done completely the treebank. LFG parsing is a typical semi- manually, only by the aid of ambiguous mor- automated annotation method, comprising of phological tags and a graphical user interface. processing the input texts by a parser and a hu- Later, some functions for automatically assign- man annotator disambiguating and correcting ing part of the tags were implemented. After the output. In the case of TIGER Treebank, a

Kakkonen: Dependency treebanks: methods, annotation schemes and tools 95 Proceedings of the 15th NODALIDA conference, Joensuu 2005 Ling@JoY 1, 2006

DodoRR–2———-AuxP17 1515C=————-Atr 24 ..Z:————-AuxG32 kvetnaˇ kvetenˇ NNIS2—–A—-Adv41 budoubýtVB-P—3F-AA—AuxV57 cestujícícestujícíNNMP1—–A—-Sb 67 platitplatitVf——–A—-Pred70 dosuddosudDb————-Adv89 platnýmplatnýAAIS7—-1A—-Atr910 zpøusobemzp˙usobNNIS7—–A—-Adv107 ..Z:————-AuxK110

Figure 1: A morphologically and analytically annotated from the Prague Dependency Treebank (Böhmová et al., 2003).

broad coverage LFG parser is used, producing to constituent format, and manual checking of the constituent and functional structures for the structures. the sentences. As almost every sentence is left Arboretum has around 21,600 sentences an- with unresolved ambiguities, a human annota- notated with dependency tags, and of those, tor is needed to select the correct ones from the 12,000 sentences have also been marked with set of possible parses. As each sentence of the constituent structures (Bick, 2003; Bick, 2005). corpus has several thousands of possible LFG The annotation is in both TIGER-XML and representations, a mechanism for automatically PENN export formats. Floresta Sintá()tica reducing the number of parses is applied, drop- consists of around 9,500 manually checked ping the number of parses represented to the (version 6.8, October 15th, 2005) and around human annotator to 17 on average. Interactive 41,000 fully automatically annotated sentences annotation is also a type of semi-automated an- obtained from a corpus of newspaper Por- notation, but in contrast to human post-editing, tuguese (Afonso et al., 2002). Arborest of Esto- the method makes the parser and the annotator nian consists of 149 sentences from newspaper to interact. First, the parser annotates a small articles (Bick et al., 2005). The morphosyntac- part of the sentence and the annotator either tic and CG-based surface syntactic annotation accepts or rejects it based on visual inspection. are obtained from an existing corpus, which The process is repeated until the sentence is is converted semi-automatically to Arboretum- annotated completely. style format.

2.2.4 The Dependency Treebank for 2.2.3 Arboretum, L’Arboratoire, Arborest Russian and Floresta Sintá(c)tica The Dependency Treebank for Russian is Arboretum of Danish (Bick, 2003), based on the Uppsala University Corpus (Lön- L’Arboratoire of French and Floresta ngren, 1993). The texts are collected from Sintá(c)tica of Portuguese (Afonso et al., contemporary Russian prose, newspapers, and 2002), and Arborest of Estonian (Bick et al., magazines (Boguslavsky et al., 2000; Bo- 2005) are "sibling" treebanks, Arboretum guslavsky et al., 2002). The treebank has being the "oldest sister". The treebanks are about 12,000 annotated sentences. The annota- hybrids with both constituent and dependency tion scheme is XML-based and compatible with annotation organized into two separate levels. Text Encoding for Interchange (TEI), except for The levels share the same morphological some added elements. It consists of 78 syntac- tagset. The dependency annotation is based tic relations, divided into six subgroups, such as on the (CG) (Karlsson, attributive, quantitative, and coordinative. The 1990) and consists of 28 dependency types. annotation is layered, in the sense that the lev- For creating each of the four treebanks, a CG- els of annotation are independent and can be based morphological analyzer and parser was extracted or processed independently. applied. The annotation process consisted of The creation of the treebank started by pro- CG parsing of the texts followed by conversion cessing the texts with a morphological analyzer

Kakkonen: Dependency treebanks: methods, annotation schemes and tools 96 Proceedings of the 15th NODALIDA conference, Joensuu 2005 Ling@JoY 1, 2006 and a syntactic parser, ETAP (Apresjan et al., tween parses, lexical discriminants represent 1992), and was followed by post-editing by hu- ambiguities resulting from , and man annotators. Two tools are available for constituent discriminants group to con- the annotator: a sentence boundary markup stituents without specifying the type of the con- tool and post-editor. The post-editor offers stituent. The annotator marks each of the max- the annotator functions for building, editing, imal discriminants as good or bad, and the tool and managing the annotations. The editor has narrows down the number of possible parses a special split-and-run mode, used when the based on the information. If the parse result- parsers fails to produce a parse or creates a ing from the selection is not correct, it can be parse with a high number of errors. In the edited by a parse editor tool. mode the user can pre-chunk the sentence into smaller pieces to be input to the parser. The 2.2.6 The Danish Dependency Treebank parsed chucks can be linked by the annotator, The annotation of the Danish Dependency thus producing a full parse for the sentence. Treebank is based on Discountinuous Gram- The tool also provides the annotator with the mar, which is a formalism closely related to possibility to mark the annotation of any word Word Grammar (Kromann, 2003). The tree- or sentence as doubtful, in order to remind at bank consists of 5,540 sentences covering a the need for a later revision. wide range of topics. The morphosyntactic annotation is obtained from the PAROLE Cor- 2.2.5 Alpino pus (Keson and Norling-Christensen, 2005), The Alpino Treebank of Dutch, consisting of thus no morphological analyzer or POS tagger 6,000 sentences, is targeted mainly at parser is applied. The dependency links are marked evaluation and comprises of newspaper arti- manually by using a command-line interface cles (van der Beek et al., 2002). The annotation with a graphical parse view. A parser for au- scheme is taken from the CGN Corpus of spo- tomatically assigning the dependency links is ken Dutch (Oostdijk, 2000) and the annotation under development. guidelines are based on the TIGER Treebank’s guidelines. 2.2.7 METU-Sabanci Turkish Treebank The annotation process in the Alpino Tree- Morphologically and syntactically annotated bank starts with applying a parser based Turkish Treebank consists of 5,000 sentences on Head-Driven obtained from the METU Turkish Corpus (Ata- (HPSG) (Pollard and Sag, 1994) and is followed lay et al., 2003) covering 16 main genres by a manual selection of the correct parse of present-day written Turkish (Oflazer et al., trees. An interactive lexical analyzer and a con- 2003). The annotation is presented in a format stituent marker tools are employed to restrict that is in conformance with the XML-based Cor- the number of possible parses. The interac- pus Encoding Standard (XCES) format (Anne tive lexical analyzer tool lets the user to mark and Romary, 2003). Due to morphological com- each word in a sentence belonging to correct, plexity of Turkish, morphological information is good, or bad categories. ’Correct’ denotes that not encoded with a fixed set of tags, but as se- the parse includes the lexical entry in ques- quences of inflectional groups (IGs). An IG is a tion, ’good’ that the parse may include the en- sequence of inflectional , divided by try, and ’bad’ that the entry is incorrect. The derivation boundaries. The dependencies be- parser uses this manually reduced set of en- tween IGs are annotated with the following 10 tries, thus generating a smaller set of possi- link types: subject, object, modifier, possessor, ble parses. With the constituent marker tool, classifier, determiner, dative adjunct, locative the annotator can mark constituents and their adjunct, ablative adjunct, and instrumental ad- types to sentences, thus aiding the parser. junct. Figure 2 illustrates a sample annotated The selection of the correct parse is done sentence from the treebank. by the help of a parse selection tool, which The annotation, directed by the guidelines, calculates maximal discriminants to help the is done in a semi-automated fashion, although annotator. There are three types of discrimi- relatively lot of manual work remains. First, nants. Maximal discriminants are sets of short- a morphological analyzer based on the two- est dependency paths encoding differences be- level morphology model (Oflazer, 1994) is ap-

Kakkonen: Dependency treebanks: methods, annotation schemes and tools 97 Proceedings of the 15th NODALIDA conference, Joensuu 2005 Ling@JoY 1, 2006

Figure 2: A sample sentence from the METU-Sabanci Treebank (Oflazer et al., 2003). plied to the texts. The morphologically ana- tags for each word in the sentence after which lyzed and preprocessed text is input to an an- the parser proceeds to the next word (Bosco, notation tool. The tagging process requires two 2000). steps: morphological disambiguation and de- 2.2.10 The Dependency Treebank of pendency tagging. The annotator selects the English correct tag from the list of tags proposed by the morphological analyzer. After the whole The Dependency Treebank of English con- sentence has been disambiguated, dependency sists of dialogues between a travel agent and links are specified manually. The annotators customers (Rambow et al., 2002), and is the can also add notes and modify the list of de- only dependency treebank with spoken lan- pendency link types. guage annotation. The treebank has about 13,000 words. The annotation is a direct repre- 2.2.8 The Basque Dependency Treebank sentation of lexical predicate-argument struc- The Basque Dependency Treebank (Aduriz ture, thus arguments and adjuncts are depen- and al., 2003) consists of 3,000 manually anno- dents of their predicates and all function words tated sentences from newspaper articles. The are attached to their lexical heads. The anno- syntactic tags are organized as a hierarchy. The tation is done at a single, syntactic level, with- annotation is done by aid of an annotation tool, out surface representation for surface syntax, with tree visualization and automatic tag syntax the aim being to keep the annotation process checking capabilities. as simple as possible. Figure 3 shows an exam- ple of an annotated sentence. 2.2.9 The Turin University Treebank The trained annotators have access to an The Turin University Treebank for Italian on-line manual and work off the transcribed consisting of 1,500 sentences is divided into speech without access to the speech files. The four sub-corpora (Lesmo et al., 2002; Bosco, dialogs are parsed with a dependency parser, 2000; Bosco and Lombardo, 2003). The ma- the Supertagger and Lightweight Dependency jority of texts is from civil law code and news- Analyzer (Bangalore and Joshi, 1999). The an- paper articles. The annotation format is based notators correct the output of the parser using on the Augmented Relational Structure (ARS). a graphical tool, the one developed by Prague The POS tagset consists of 16 categories and Dependency Treebank project. In addition to 51 subcategories. There are around 200 depen- the standard tag editing options, annotators dency types, organized as a taxonomy of five can add comments. After the editing is done, levels. The scheme provides the annotator with the sentence is automatically checked for in- the possibility of marking a relation as under- consistencies, such as the difference in surface specified if a correct relation type cannot be and deep roles or prepositions missing objects determined. etc. The annotation process consists of automatic tokenization, morphological analysis and POS 2.2.11 DEPBANK disambiguation, followed by syntactic pars- As the name suggests, the PARC 700 De- ing (Lesmo et al., 2002). The annotator can pendency Bank (DEPBANK) (King et al., 2003) interact with the parser through a graphical consists of 700 annotated sentences from the interface, in a similar way to the interactive Penn Wall Street Journal Treebank (Marcus et method in the TIGER Treebank. The annota- al., 1994). There are 19 grammatical relation tor can either accept or reject the suggested types (e.g. subject, object, modifier) and 37

Kakkonen: Dependency treebanks: methods, annotation schemes and tools 98 Proceedings of the 15th NODALIDA conference, Joensuu 2005 Ling@JoY 1, 2006

Figure 3: The sentence "The flight will have been booked" from the English treebank (Rambow et al., 2002). The words are marked with the word form (first line), the POS (second line), and the surface role (third line). In addition, node ’flight’ is marked with a deep role (DRole) and the root node as passive in the FRR feature, not set in any other nodes. feature types (e.g. number (pl/sg), passive (+/- paper articles, and in the cases of e.g. Czech, ), tense (future/past/present)) in the annotation German, Russian, Turkish, Danish, and Dutch scheme. treebanks from an existing corpus. Annotation The annotation process is semi-automatic, usually consists of POS and morphological lev- consisting of parsing by broad-coverage LFG, els accompanied by dependency-based syntac- converting the parses to the DEPBANK for- tic annotation. In the case of the Prague De- mat and manually checking and correcting pendency Treebank a higher, semantic layer of the resulting structures. The annotations are annotation is also included. checked by a tool that looks e.g. for the correct- The definition of the annotation schema is al- ness of header information and the syntax of ways a trade-off between the accuracy of the the annotation, and inconsistencies in feature representation, data coverage and cost of tree- names. The checking tool helps in two different bank development (Bosco, 2000; Bosco and ways: first, when the annotator makes correc- Lombardo, 2003). The selection of the tagsets tions to the parsed structure, it makes sure that for annotation is critical. Using a large vari- no errors were added, and second, the tool can ety of tags provides a high accuracy and spe- detect erroneous parses and note that to the cialization in the description, but makes the an- annotator. notators’ work even more time-consuming. In 3 Analysis addition, for some applications, such as train- ing of statistical parsers, highly specific anno- Table 1 summarizes some key properties of tation easily leads into sparsity problem. On the existing dependency treebanks. The size of the other hand, if annotation is done at a highly the treebanks is usually quite limited, ranging general level the annotation process is faster, from few hundreds to 90,000 sentences. This is but naturally lot of information is lost. The TUT partly due to the fact that even the most long- and Basque treebanks try to tackle the problem lived of the dependency treebank projects, the by organizing the set of grammatical relations Prague Dependency Treebank, was started less into hierarchical taxonomy. Also the choice of than 10 years ago. The treebank producers type of application for the treebank may affect have in most cases aimed at creating a multi- the annotation choices. A treebank for evalua- purpose resource for evaluating and develop- tion allows for some remaining ambiguities but ing NLP systems and for studies in theoreti- no errors, while the opposite may be true for cal linguistics. Some are built for specific pur- a treebank for training (Abeillé, 2003). In an- poses, e.g. the Alpino Treebank of Dutch is notation consisting of multiple levels clear sep- mainly for parser evaluation. Most of the de- aration between the levels is a concern. The pendency treebanks consist of written text; to format of the annotation is also directed by the our knowledge there is only one that is based on specific language that the treebank is being de- a collection of spoken utterances. The written veloped for. The format must be suited for rep- texts are most commonly obtained from news- resenting the structures of the language. For

Kakkonen: Dependency treebanks: methods, annotation schemes and tools 99 Proceedings of the 15th NODALIDA conference, Joensuu 2005 Ling@JoY 1, 2006

Table 1: Comparison of dependency treebanks. (*Due to limited number of pages not all the treebanks in the Arboretum "family" are included in the table. **Information of number of utterances was not available. M=manual, SA=semi-automatic, TB=treebank) Name Lan- Genre Size Annotation Autom. tools Supported for- guage (sent.) methods mats Prague Czech Newsp., 90,000 M/SA Lexicalized FS, CSTS SGML, Dep. science stochastic Annotation TB mag. parser (Collins) Graphs XML TIGER Ger- Newsp. 50,000 Post-editing & Probabilistic/ TIGER-XML & TB man interactive LFG parser NEGRA export Arbore- 4 lang. Mostly 21,600 Dep. to const. CG-based parser TIGER-XML & tum & newsp. (Ar) conversion, M for each lan- PENN export co.* 9,500 checking guage (Ar.) (Flo) Dep. Rus- Fiction, 12,000 SA Morph. analyzer XML-based TEI- TB for sian newsp., & a parser compatibe Rus- scien- sian tific Alpino Dutch Newsp. 6,000 M disambig. HPSG-based Own XML-based aided by parse Alpino parser selection tool Danish Dan- Range 5,540 Morphosyn. - PAROLE-DK Dep. ish of top- annotation ob- with additions, TB ics & tained from a TIGER-XML genres corpus, M dep. marking METU- Turk- 16 gen- 5,000 M disambigua- Morph. analyzer XML-based Saba- ish res tion & M depen- based on XEROX XCES compati- nci dency marking FST ble TB Basque Basque Newsp. 3,000 M, automatic - XML-based TEI- TB checking compatible TUT Italian Mainly 1,500 M checking of Morph. ana- Own ASCII- newsp. parser & morph. lyzer, rule-based based & civil analyzer output tagger and a law parser Dep. Eng- Spoken, 13,000 M correction of Supertagger FS TB of lish travel words parser output & & Lightweight En- agent ** autom. check- Dep. Analyzer glish dial. ing of inconsis- tencies DEP- Eng- Financial 700 M checking LFG parser, Own ASCII- BANK lish newsp. & correction, checking tool based autom. consis- tency checking

Kakkonen: Dependency treebanks: methods, annotation schemes and tools 100 Proceedings of the 15th NODALIDA conference, Joensuu 2005 Ling@JoY 1, 2006 example, in the METU-Sabanci Treebank a spe- ones have been developed. Furthermore, the cial type of morphological annotation scheme schemes have often been designed from the- was introduced due to the complexity of Turk- ory and even application-specific viewpoints, ish morphology. and consequently, undermine the possibility for Semi-automated creation combining parsing reuse. Considering the high costs of tree- and human checker is the state-of-art annota- bank development (for example in the case of tion method. None of the dependency tree- the Prague Dependency Treebank estimated banks are created completely manually; at least USD600,000 (Böhmová et al., 2003)), reusabil- an annotation tool capable of visualizing the ity of tools and formats should have a high pri- structures is used by each of the projects. Ob- ority. In addition to the difficulties for reuse, viously, the reason that there aren’t any fully creating a treebank-specific representation for- automatically created dependency treebanks is mat requires developing a new set of tools for the fact there are no parsers of free text capa- creating, maintaining and searching the tree- ble of producing error-free parses. bank. Yet, the existence of exchange formats The most common way of combining the hu- such as XCES (Anne and Romary, 2003) and man and machine labor is to let the human work TIGER-XML (Mengel and Lezius, 2000) would as a post-checker of the parser’s output. Al- allow multipurpose tools to be created and beit most straight-forward to implement, the used. method has some pitfalls. First, starting an- notation with parsing can lead to high number 4 Conclusion of unresolved ambiguities, making the selec- We have introduced the state-of-art in depen- tion of the correct parse a time-consuming task. dency treebanking and discussed the main Thus, a parser applied for treebank building characteristics of current treebanks. The find- should perform at least some disambiguation ings reported in the paper will be used in de- to ease the burden of annotators. Second, the signing and constructing an annotation tool work of post-checker is mechanic and there is for dependency treebanks and constructing a a risk that the checker just accept the parser’s treebank for Finnish for syntactic parser eval- suggestions, without a rigorous inspection. A uation purposes. The choice of dependency solution followed e.g. by the both treebanks format for a treebank for evaluating syntactic for English and the Basque treebank is to ap- parser of Finnish is self-evident, Finnish be- ply a post-checking tool to the created struc- ing a language with relatively free word order tures before accepting them. Some variants of and all parsers for the language working in the semi-automated annotation exist: the TIGER, dependency-framework. The annotation format TUT, Alpino, and the Russian Treebanks apply a will be some of the existing XML-based formats, method where the parser and the annotator can allowing existing tools to be applied for search- interact. The advantage of the method is that ing and editing the treebank. when the errors by the parser are corrected by The findings reported in this paper indicate the human at the lower levels, they do not mul- that the following key properties must be im- tiply into the higher levels, thus making it more plemented into the annotation tool for creating probable that the parser produces a correct the treebank for Finnish: parse. In some annotation tools, such as the tools of the Russian, the English Dependency • An interface to a morphological analyzer treebanks, the annotator is provided with the and parser for constructing the initial possibility of adding comments to annotation, trees. Several parsers can be applied in easing the further inspection of doubtful struc- parallel to offer the annotator a possibility tures. In the annotation tool of the TUT Tree- to compare the outputs. bank, a special type relation can be assigned to • Support for an existing XML annotation mark doubtful annotations. format. Using an existing format will make Although more collaboration has emerged the system more reusable. XML-based for- between treebank projects in recent years, the mats offer good syntax-checking capabili- main problem with current treebanks in re- ties. gards to their use and distribution is the fact that instead of reusing existing formats, new • An inconsistency checker. The annotated

Kakkonen: Dependency treebanks: methods, annotation schemes and tools 101 Proceedings of the 15th NODALIDA conference, Joensuu 2005 Ling@JoY 1, 2006

sentences to be saved will be checked Jurij Apresjan, Igor Boguslavsky, Leonid against errors in tags and annotation for- Iomdin, Alexandre Lazursku, Vladimir San- mat. In addition to XML-based validation nikov, and Leonid Tsinman. 1992. ETAP-2: the linguistics of a sys- of the syntax of the annotation, the incon- tem. META, 37(1):97–112. sistency checker will inform the annota- tor about several other types of mistakes. Nart B. Atalay, Kermal Oflazer, and Bilge Say. The POS and morphological tags will be 2003. The annotation process in the Turkish checked to find any mismatching combina- treebank. In Proceedings of the EACL Work- shop on Linguistically Interpreted Corpora, tions. A missing main verb, a fragmented, Budapest, Hungary. incomplete parse etc. will be indicated to the user. Srinivas Bangalore and Aravind K. Joshi. 1999. Supertagging: An approach to almost pars- • A comment tool. The annotator will be able ing. Computational Linguistics, 25(2):237– to add comments to the annotations to aid 265. later revision. , Heli Uibo, and Kaili Müürisep. 2005. Arborest - a growing treebank of Es- • Menu-based tagging. In order to minimize tonian. In Henrik Holmboe, editor, Nordisk errors, instead of typing the tags, the an- Sprogteknologi 2005. Nordic Language Tech- notator will only be able to set tags by se- nology 2005. Museum Tusculanums Forlag, lecting them from predefined lists. Copenhagen, Denmark. Eckhard Bick. 2003. Arboretum, a hybrid tree- 5 Acknowledgments bank for Danish. In Proceedings of the 2nd Workshop on Treebanks and Linguistic Theo- The research reported in this work has ries, Växjö, Sweden. been supported by the European Union un- der a Marie Curie Host Fellowship for Early Eckhard Bick. 2005. Current status of Arbore- tum treebank. Electronic mail communica- Stage Researchers Training at MULTILINGUA, tion, October 12th, 2005. Bergen, Norway and MirrorWolf project funded by the National Technology Agency of Finland Igor Boguslavsky, Svetlana Grigorieva, Niko- (TEKES). lai Grigoriev, Leonid Kreidlin, and Nadezhda Frid. 2000. Dependency treebank for Rus- sian: Concept, tools, types of information. In Proceedings of the 18th International Con- References ference on Computational Linguistics, Saar- Anne Abeillé. 2003. Introduction. In brücken, Germany. Anne Abeillé, editor, Treebanks: Building Igor Boguslavsky, Ivan Chardin, Svetlana Grig- and Using Syntactically Annotated Corpora. orieva, Nikolai Grigoriev, Leonid Iomdin, Kluwer Academic Publishers, Dordrecht, The Leonid Kreidlin, and Nadezhda Frid. 2002. Netherlands. Development of a dependency treebank for Russian and its possible applications in NLP. Itziar Aduriz and al. 2003. Construction of a In Proceedings of the 3rd International Con- Basque dependency treebank. In Proceed- ference on Language Resources and Evalua- ings of the 2nd Workshop on Treebanks and tion, Las Palmas, Gran Canaria, Spain. Linguistic Theories, Växjö, Sweden. Alena Böhmová, Jan Hajic,ˇ Eva Hajicová,ˇ and Susana Afonso, Eckhard Bick, Renato Haber, Barbora Hladká. 2003. The Prague depen- and Diana Santos. 2002. "Floresta dency treebank: Three-level annotation sce- Sintá(c)tica": a treebank for Portuguese. In nario. In Anne Abeillé, editor, Treebanks: Proceedings of the 3rd International Confer- Building and Using Syntactically Annotated ence on Language Resources and Evaluation, Corpora. Kluwer Academic Publishers, Dor- Las Palmas, Gran Canaria, Spain. drecht, The Netherlands.

Nancy Anne and Laurent Romary. 2003. En- Cristina Bosco and Vincenzo Lombardo. 2003. coding syntactic annotation. In Anne Abeillé, A relation-based schema for treebank annota- editor, Treebanks: Building and Using Syn- tion. In Proceedings of the Advances in Arti- tactically Annotated Corpora. Kluwer Aca- ficial Intelligence, 8th Congress of the Italian demic Publishers, Dordrecht, The Nether- Association for Artificial Intelligence, Pisa, lands. Italy.

Kakkonen: Dependency treebanks: methods, annotation schemes and tools 102 Proceedings of the 15th NODALIDA conference, Joensuu 2005 Ling@JoY 1, 2006

Cristina Bosco. 2000. An annotation schema Dekang Lin. 1995. A dependency-based for an Italian treebank. In Proceedings of method for evaluating broad-coverage the Student Session, 12th European Summer parsers. In The 1995 International Joint School in Logic, Language and Information, Conference on AI, Montreal, Canada. Birmingham, UK. Lennart Lönngren. 1993. Chastotnyj slo- Sabine Brants, Stefanie Dipper, Silvia Hansen, var’ sovremennogo russkogo jazyka. (A fre- Wolfgang Lezius, and George Smith. 2002. quency dictionary of modern Russian). In The Tiger Treebank. In Proceedings of the Acta Universitatis Upsaliensis, Studia Slavica Workshop on Treebanks and Linguistic Theo- Upsaliensia 32, Uppsala, Sweden. ries, Sozopol, Bulgaria. Mitchell Marcus, Beatrice Santorini, and Tomaž Erjavec. 2005a. Current status of the Mary Ann Marcinkiewicz. 1993. Building Slovene dependency treebank. Electronic a large annotated corpus of English: The mail communication, September 29th, 2005. Penn Treebank. Computational Linguistics, Tomaž Erjavec. 2005b. Slovene depen- 19(2):313–330. dency treebank. http://nl.ijs.si/sdt/. (Ac- cessed November 9th, 2005). Mitchell Marcus, Grace Kim, Mary Ann Marcinkiewicz, Robert MacIntyre, Ann Bies, Jan Hajic,ˇ Otakar Smrž, Petr Zemáánek, Jan Mark Ferguson, Karen Katz, and Britta Šnaidauf, and Emanuel Beška. 2004. Prague Schasberger. 1994. The Penn treebank: An- Arabic dependency treebank: Development notating predicate argument structure. In in data and tools. In Proceedings of the NEM- Proceedings of the ARPA Human Language LAR International Conference on Arabic Lan- Technology Workshop, Princeton, New Jer- guage Resources and Tools, Cairo, Egypt. sey, USA.

Jan Hajic.ˇ 1998. Building a syntactically anno- Andreas Mengel and Wolfgang Lezius. 2000. tated corpus: The Prague dependency tree- An XML-based representation format for syn- bank. In Eva Hajicová,ˇ editor, Issues of tactically annotated corpora. In Proceedings Valency and Meaning. Studies in Honor of of the International Conference on Language Jarmila Panevová, pages 12–19. Charles Uni- Resources and Evaluation, Athens, Greece. versity Press, Prague, Czech Republic. Michael Nelleke, Jan Hajic,ˇ Eric Brill, Lance Fred Karlsson. 1990. Constraint grammar as a Ramshaw, and Christoph Tillmann. 1999. A framework for parsing running text. In Pro- statistical parser of Czech. In Proceedings ceedings of the 13th Conference on Compu- of 37th Annual Meeting of the Association tational Linguistics - Volume 3, Helsinki, Fin- for Computational Linguistics, College Park, land. Maryland, USA. Britt-Katrin Keson and Ole Norling- Christensen. 2005. PAROLE-DK. Kermal Oflazer, Bilge Say, Dilek Zeynep Det Danske Sprog- litteraturselskab. Hakkani-Tür, and Gökhan Tür. 2003. Build- http://korpus.dsl.dk/e-resurser/parole- ing a Turkish treebank. In Anne Abeillé, ed- korpus.php. (Accessed October 7th, 2005). itor, Treebanks: Building and Using Syntac- tically Annotated Corpora. Kluwer Academic Tracy H. King, Richard Crouch, Stefan Rie- Publishers. zler, Mary Dalrymple, and Ronald M. Kaplan. 2003. The PARC 700 dependency bank. In Kemal Oflazer. 1994. Two-level description of Proceedings of the 4th International Work- Turkish morphology. Literary and Linguistic shop on Linguistically Interpreted Corpora, Computing, 9(2):472–472. the 10th Conference of the EACL, Budapest, Hungary. Nelleke Oostdijk. 2000. The Spoken Dutch Corpus: Overview and first evaluation. In Matthias T. Kromann. 2003. The Danish depen- Proceedings of the 2nd International Confer- dency treebank and the DTAG treebank tool. ence on Language Resources and Evaluation, In Proceedings of the 2nd Workshop on Tree- Athens, Greece. banks and Linguistic Theories, Växjö, Swe- den. Carl Pollard and Ivan A. Sag. 1994. Head- Driven Phrase Structure Grammar. Univer- Leonardo Lesmo, Vincenzo Lombardo, and sity of Chicago Press and CSLI Publications, Cristina Bosco. 2002. Treebank develop- Chicago, Illinois, USA. ment: the TUT approach. In Proceedings of the International Conference on Natural Lan- Owen Rambow, Cassandre Creswell, Rachel guage Processing, Mumbai, India. Szekely, Harriet Taber, and Marilyn Walker.

Kakkonen: Dependency treebanks: methods, annotation schemes and tools 103 Proceedings of the 15th NODALIDA conference, Joensuu 2005 Ling@JoY 1, 2006

2002. A dependency treebank for English. In Proceedings of the 3rd International Confer- ence on Language Resources and Evaluation, Las Palmas, Gran Canaria, Spain.

Wojciech Skut, Thorsten Brants, Birgitte Krenn, and Hans Uszkoreit. 1998. A linguistically in- terpreted corpus of German newspaper text. In Proceedings of the 10th European Summer School in Logic, Language and Information, Saarbrücken, Germany.

Lucien Tesnieére. 1959. Éléments de syntaxe structurale. Klincksiek, Paris, France.

Leonoor van der Beek, Gosse Bouma, Robert Malouf, and Gertjan van Noord. 2002. The Alpino dependency treebank. In Ma- riet Theune, Anton Nijholt, and Hendri Hon- dorp, editors, Computational Linguistics in the Netherlands 2001. Selected Papers from the Twelfth CLIN Meeting. Rodopi, Amster- dam, The Netherlands.

Fei Xia and Martha Palmer. 2000. Converting dependency structures to phrase structures. In Proceedings of the First International Con- ference on Human Language Technology Re- search, San Diego, California, USA.

Kakkonen: Dependency treebanks: methods, annotation schemes and tools 104