Teaching Treebanks

Teaching Treebanking by Martin Volk, Sofia Gustafson-Capková and David Hagstrand (Stockholm University) Heli Uibo (Tartu University) 1. Introduction Treebanks have become valuable resources in natural language processing (NLP) in recent years (Abeillé, 2003). A treebank is a collection of syntactically annotated sentences in which the annotation has been manually checked. The name derives from the fact that syntactic descriptions of sentences often come in the form of tree structures, in particular constituent trees. But treebank annotation has also been done in the framework of dependency grammar and recent annotation has also exceeded syntax towards semantic features such as predicate-argument structures or word senses. A treebank can serve as training corpus for natural language parsers, as repository for linguistic research, or as evaluation corpus for NLP systems. In particular it has been shown (Manning and Schütze, 1999) that natural language parsers which are trained on treebanks (rather than being based on hand-crafted rules and preferences) are more robust and thus more successful for practical applications. Treebanks have also become a necessary resource for many research activities in NLP. But while treebanking has proven to be of high importance for NLP, instruction in this field has been largely missing. Students have learned about treebanks within syntax courses or within general courses on natural language processing or on corpus linguistics. But treebanking is a time-consuming and demanding activity and therefore specific training is essential. We therefore organized an intensive course on “Treebanks: Formats, Tools and Usage” for PhD students in Language Technology from the Nordic countries. The course took place at Stockholm University in March 2004 and was attended by more than 20 participants from Denmark, Estonia, Finland, Iceland, Norway, and Sweden. Due to its success it was repeated in a reduced manner for undergraduate students as part of the summer school on “Empirical Methods in NLP” at Tartu University in August 2004. Again the course was attended by more than 20 participants, this time coming from the Baltic countries, Russia and Ukraine. Participants at the Stockholm University treebanking course The courses introduced the processes involved in creating and exploiting treebanks. They gave an overview of the annotation formats in different treebanks (e.g. the English Penn Treebank, the Swedish Treebank SynTag, the German TIGER Treebank, the Danish Dependency Treebank etc.). And they demonstrated the most important tools used for the creation of treebanks (tree editors), for consistency checking in treebanks and for treebank searches. The general goal was to offer the students hands-on experience in treebanking so that they will be able to participate in treebank projects for their own languages or even help initiating such projects. This report documents these two treebank courses. We summarize the topics covered in the lectures and comment on what we find most central. We describe how we set up the practical exercises and give recommendations for future practice sessions. We also share our observations about how to adapt the material from a treebank course on the PhD level to an undergraduate course. We conclude with some remarks on supporting activities that accompanied both courses and an outlook. 2. Course format The PhD course in Stockholm (2 credit points) consisted of reading seven research papers before the course, 15 hours of lectures and 12 (+ 3) hours of computer labs. Students who were willing to work on an individual research project could obtain 5 credit points. The Treebank course in Tartu (1 credit point) included 8 hours of lectures and 8 hours of hands-on sessions. The core part of the lectures was similar in Stockholm and Tartu. The course in Stockholm additionally contained some guest lectures on different treebank-related topics (training of a parser, spoken language treebanks and the Danish Dependency Treebank), while the course in Tartu provided new research results about parallel treebanks. For the Tartu course, three out of the four practical sessions designed for the Stockholm course were used in a reduced format to address the undergraduate level. 3. The Treebank Lectures Our ambition with the lectures in the treebank course was to cover the most important aspects of treebanking, from the sampling and collection, over annotation and search into future enrichments of treebanks. In addition to this general overview of treebank construction, we wanted to provide insights into previous and ongoing work with specific treebanks. Composition of the lecture series To fulfill the aims defined above, we let one set of lectures cover more general aspects, and to satisfy the requirements of more specific aspects, external speakers with experience from specific projects in the treebanking field were invited. The lectures dealing with general aspects were taught by staff from Stockholm University and covered the following topics: Corpus collection and preparation Treebank tools and treebank search Treebank maintenance Treebanks and discourse information The invited speakers' talks covered the topics of: Training a parser for Swedish on a treebank (Beata Megyesi, KTH Stockholm) Spoken language treebanks (Erhard Hinrichs, University of Tübingen). The Danish Dependency Treebank (Matthias Trautner Kromann, Copenhagen Business School) An introduction plus a closing talk with a glance into the future, completed the lecture series. The lectures were scheduled in the mornings while the afternoons were reserved for practical exercises (see section 3, Practical Treebank Exercises). The lectures covering general aspects of treebanking were distributed over four topics and are briefly described below. 1. Corpus collection and preparation In this lecture we covered questions concerning the sampling and balancing of a corpus as well as how requirements on the sampling can differ with regard to the intended purpose. Examples of the requirements of the SUC corpus (Ejerhed et al.) and its sampling dimensions were shown and discussed. In addition the choice of format for a corpus was touched upon. 2. Treebank tools and searching Two lectures were addressing this topic, covering different categories of tools useful in the construction (graph editors and disambiguators, such as e.g. Annotate [Anno]) as well as search tools, such as e.g. TIGERSearch [Tiger], developed for the German TIGER Treebank and TGrep2 developed for search in the Penn Treebank (PTB) [Penn]. 3. Treebank maintenance This lecture addressed the topic of how to maintain the corpus after finishing a first version. It was discussed how to handle bug reports, how to distribute, update and enrich a treebank, and how to keep the consistency of a treebank. The legal issues were also brought up. The close connection between the construction (including the selection of format) and future addition of new information was stressed. 4. Treebanks and discourse information In this lecture we showed how existing treebanks have been enriched with discourse information. Examples were drawn from the RST Discourse Treebank (Marcu et al., 1999), which is annotated with rhetorical relations and the Penn Discourse Treebank (PDTB) (Miltsakaki et al., 2004) annotated with coherence relations from a discourse parser based on DL-TAG. Examples of tools for discourse annotation, such as RST- tool and Wordfreak were given, and the lecture closed with a discussion of different approaches to discourse structure. In addition to the lectures covering treebanking topics in general, the three invited speakers each gave one talk in the area of their own specific experiences with treebanking. Beáta Megyesi from KTH, Stockholm, gave a talk on her experiments on training a shallow parser for Swedish (Megyesi, 2002), focusing on parsing methods and problems related to parsing. Erhard Hinrichs from University of Tübingen presented experiences from his work with the development of the Tübingen Treebank of German Spontaneous Speech (TüBa-D/S) (Stegmann, 2000), and discussed problems specific for spoken language and dialogue. And Matthias T. Kromann, CBS Copenhagen, gave a talk on his work with the Danish Dependency Treebank (Kromann, 2003), which consists of a part of the Danish PAROLE corpus analyzed with the dependency-based formalism Discontinuous Grammar. Most weight was put on the discussion of the analysis and formalism. The invited speakers, showing practical examples from their own work with treebank development constituted a natural link between the more theoretical underpinnings in the rest of the lectures and the students’ hands-on work during the practical sessions. The reactions of the course participants indicated that the selection of lecture topics for the course was rather well picked. Of course there are topics that could be considered as more “core treebanking”, such as e.g. the actual annotation of trees including the grammatical formalism and the parsing and annotation tools. However, in constructing a treebank, the quality and usability is dependent also on insights gained in the field of corpus construction in general. That is why the topics, such as e.g. sampling, formats and legal issues, have their given place also in a course directed specifically towards treebanking. 4. Practical Treebank Exercises This section describes the different exercises designed for the course, along with descriptions of what was required to set them up in terms

Load more