The Tibidabo Treebank
Total Page:16
File Type:pdf, Size:1020Kb
The Tibidabo Treebank El treebank Tibidabo Montserrat Marimon Universitat de Barcelona Gran Via de les Corts Catalanes 585, 08007-Barcelona [email protected] Resumen: En este art´ıculopresentamos el desarrollo de un nuevo recurso de c´odigo abierto para el espa~nol:el treebank Tibidabo. La anotaci´onse est´allevando a cabo de forma semi–autom´aticaen la que, en primer lugar, el corpus es analizado au- tom´aticamente con una gram´aticasimb´olicadel espa~nolbasada en HPSG e im- plementada en el sistema Linguistic Knowledge Builder, y, en segundo lugar, los resultados del proceso de an´alisisse desambiguan manualmente. La existencia del treebank Tibidabo nos permitir´afuturos trabajos de investigaci´onpara el desar- rollo y evaluaci´onde una arquitectura h´ıbridaque combine metodos simb´olicosy estad´ısticospara el PLN, as´ıcomo investigaciones orientadas a la hibridizaci´onde t´ecnicasde bajo y alto nivel para el PLN. Palabras clave: treebank, espa~nol,HPSG. Abstract: This paper describes work in progress for the creation of a new open{ source resource for Spanish: an HPSG{based treebank so{called Tibidabo. The annotation is performed semi{automatically. First, the corpus is automatically an- notated by a symbolic HPSG{based grammar for Spanish implemented on the Lin- guistic Knowledge Builder system; then, the output is manually disambiguated. The existence of the Tibidabo treebank will facilitate research into the development and evaluation of a hybrid architecture combining symbolic and stochastic approaches to NLP, as well as investigations oriented to hybridization of shallow{deep techniques for NLP. Keywords: Treebank, Spanish, HPSG. 1 Introduction mar, a multi{purpose large{coverage HPSG{ based grammar for Spanish implemented on Linguistically interpreted natural language the Linguistic Knowledge Builder system. texts constitute a crucial resource both Then, ambiguous outputs are manually dis- for theoretical linguistic investigations about ambiguated. language use, for instance, and for practical The goal of the work we present is NLP purposes, such as the acquisition and twofold: (i) to create the training data which evaluation of parsing systems. Thus, in re- will allow us to build up a parse selection cent years, there has been an increasing in- model over the hand{built grammar follow- terest in the construction of treebanks and, ing (Toutanova et al., 2005), and (ii) since nowadays, both theory{neutral and theory{ language resource development and evalua- grounded treebanks have been developed for tion go hand in hand, to create the gold stan- a great variety of languages. Some of these dard to evaluate the hybrid system. The treebanks are presented in (Hinrichs and Tibidabo treebank, in addition, will enable Simov, 2004). to evaluate foreseen investigations oriented to This paper describes ongoing work for the the hybridization of shallow{deep techniques creation of a new resource for Spanish, an for NLP aiming at efficient, robust, and ac- HPSG{based treebank so{called Tibidabo. curate rule{based parsing. The annotation is carried out in two The primary objective of our research steps. First, the corpus is automatically an- work is to produce open{source reusable re- notated with the Spanish Resource Gram- sources both for theoretical linguistic inves- tigations and for application{oriented NLP. The Tibidabo treebank is open{source and, together with the Spanish Resource Gram- mar, is part of the DELPH{IN open{source repository of linguistic resources, which also includes HPSG{based grammars for a wide variety of languages, including En- glish (Flickinger, 2002), German (Crysmann, 2005), Japanese (Siegel and Bender, 2002), French, Korean (Kim and Yangs, 2003), mod- ern Greek (Kordoni and Neu, 2005), Norwe- gian (Hellan and Haugereid, 2005), and Por- tuguese (Branco and Costa, 2008), as well as HPSG{based treebanks, such as the LinGO Figure 1: System architecture. Redwoods for English (Oepen et al., 2004) and Hinoki for Japanese (Hashimoto, Bond, 2.1 The FreeLing Tool and Siegel, 2007).1 The paper is organized as follows. First, Before parsing input sentences with the LKB in the following three sections, we present the system, raw text is pre{processed by FreeL- main features of the system we use to anno- ing, an open{source language analysis tool tate the corpus automatically (the architec- suite performing shallow processing function- alities ranging from text tokenization to de- ture, the development environment and theo- 2 retical background, and the linguistic compo- pendency parsing (Atserias et al., 2006). nents and coverage) and the disambiguation Our system integrates the morpho pro- process, and we show the representation of cessing module of FreeLing which receives a derivations produced by the grammar. Then, sentence and morphologically annotates each in section 5, we present the corpus which is word. This module applies a cascade of spe- the basis of the treebank and shows some fig- cialized processors that includes: ures about the treebank. Finally, in section 6, we conclude this paper with a summary • Punctuation symbol annotator. and some directions for future work. • Multi{word recognizer. 2 Treebank annotation • Numerical expression recognizer. We have developed a hybrid architecture for • Date/time expression recognizer. the automatic processing of Spanish, shown in Figure 1, which integrates a shallow pro- • Ratio and percentage expression and cessing tool { FreeLing { into an HPSG{ monetary amount recognizer. based grammar implemented on the LKB • Proper noun recognizer. A fast and sim- system { the Spanish Resource Grammar. ple pattern{matching module based on The advantage of our hybrid architecture capitalization which yields an accuracy is that it allows us to release the parser from near 90%.3 certain tasks {namely, morphological analy- sis and recognition and classification of spe- • Dictionary look{up and affixes handler. cial text expressions that have been consid- • Lexical probabilities annotator and un- ered peripheral to the lexicon, e.g. numbers, known word handler. dates, percentages, currencies, proper names, etc.{ that may be robustly, efficiently, and Our system plugs the FreeLing tool into reliably dealt with by shallow external com- the system by means of the LKB Simple Pre- ponents, thus making the whole system more Processor Protocol (SPPP), which assumes adequate to deal with real world text. 2The FreeLing toolkit may be downloaded from: http://www.lsi.upc.edu/~nlp/freeling. 3FreeLing also provides a NE recognizer based on the CoNLL{2002 shared task winning system (Car- reras, M´arquez,and Padr´o,2002) with higher accu- 1See http://www.delph-in.net/. racy {over 92%{ but which is rather slower. that a preprocessor runs as an external pro- rules, a lexicon, and syntactic rules. cess to the LKB system and communicates The inflectional rules. The inflectional with its caller through its standard input and rules in the LKB system perform the mor- output channels.4 phological analysis of the words in the input The SPPP, therefore, integrates into the sentences. parsing process the PoS tags and lemmata of Since we use an external morphological each word in the sentence. analyzer, the SRG does not need a mor- FreeLing and the Spanish Resource Gram- phology component, instead we use the in- mar are two independently developed com- flectional rule component to propagate the ponents and show some discrepancies in the morpho{syntactic information associated to tagset and the lexical categories. We also use full{forms, in the form of PoS tags, to this model to handle them. the morpho{syntactic features of the lexical 2.2 The Spanish Resource items. We have defined as many rules as tags Grammar we have in the FreeLing tagset. The lexicon. The lexicon component in The Spanish Resource Grammar (SRG) is the LKB system contains the lexical entries designed as multi{purpose (abstracted away of the grammar. from any particular application), and broad{ coverage (aiming to cover not only all varia- The SRG has a full coverage lexicon of tions of the phenomena that have been im- closed word classes (pronouns, determiners, plemented, but also the combinations of dif- prepositions and conjunctions) and it con- ferent phenomena). tains about 50,000 lexical entries for open word classes (7,865 verbs, 28,025 nouns, 2.2.1 Development environment and 10,410 adjectives and 4,110 adverbs).5 These theoretical background lexical entries are organized into a multiple The grammar is implemented on the Lin- inheritance type hierarchy of about 500 leaf guistic Knowledge Builder (LKB) system { types which represent the type of words we an interactive grammar development envi- have in our lexicon. The grammar also has ronment for typed feature structure gram- 64 lexical rules to perform valence changing mars { (Copestake, 2002), based on the ba- operations on lexical items (e.g. movement sic components of the LinGO Grammar Ma- and removal of complements) which reduces trix, an open{source starter{kit for rapid de- the number of lexical entries to be manually velopment of precision broad{coverage gram- encoded in the lexicon. mars compatible with the LKB system (Ben- The syntactic rules. The syntactic der and Flickinger, 2005). rules in the LKB system are phrase structure The SRG is grounded in the theoretical rules that combine words and phrases into framework of Head{driven Phrase Structure larger constituents and compositionally build Grammar (HPSG) (Pollard and Sag, 1994), a up their semantic representation. The SRG constraint{based lexicalist approach to gram- has 191 phrase structure rules. matical theory where all linguistic objects (i.e. words and phrases) are represented as With these linguistic resources, the SRG typed feature structures, and uses Minimal handles the following range of linguistic Recursion Semantics (MRS) (Copestake et phenomena: all types of subcategorization al., 2006) for the semantic representation. structures, surface word order variation and MRS is not a semantic theory in itself, but a valence alternations, subordinate clauses, kind of meta{level which has been defined for raising and control, determination, null{ describing semantic structures.