<<

The Tibidabo

El treebank Tibidabo

Montserrat Marimon Universitat de Barcelona Gran Via de les Corts Catalanes 585, 08007-Barcelona [email protected]

Resumen: En este art´ıculopresentamos el desarrollo de un nuevo recurso de c´odigo abierto para el espa˜nol:el treebank Tibidabo. La anotaci´onse est´allevando a cabo de forma semi–autom´aticaen la que, en primer lugar, el corpus es analizado au- tom´aticamente con una gram´aticasimb´olicadel espa˜nolbasada en HPSG e im- plementada en el sistema Linguistic Knowledge Builder, y, en segundo lugar, los resultados del proceso de an´alisisse desambiguan manualmente. La existencia del treebank Tibidabo nos permitir´afuturos trabajos de investigaci´onpara el desar- rollo y evaluaci´onde una arquitectura h´ıbridaque combine metodos simb´olicosy estad´ısticospara el PLN, as´ıcomo investigaciones orientadas a la hibridizaci´onde t´ecnicasde bajo y alto nivel para el PLN. Palabras clave: treebank, espa˜nol,HPSG. Abstract: This paper describes work in progress for the creation of a new open– source resource for Spanish: an HPSG–based treebank so–called Tibidabo. The annotation is performed semi–automatically. First, the corpus is automatically an- notated by a symbolic HPSG–based grammar for Spanish implemented on the Lin- guistic Knowledge Builder system; then, the output is manually disambiguated. The existence of the Tibidabo treebank will facilitate research into the development and evaluation of a hybrid architecture combining symbolic and stochastic approaches to NLP, as well as investigations oriented to hybridization of shallow–deep techniques for NLP. Keywords: Treebank, Spanish, HPSG.

1 Introduction mar, a multi–purpose large–coverage HPSG– based grammar for Spanish implemented on Linguistically interpreted natural the Linguistic Knowledge Builder system. texts constitute a crucial resource both Then, ambiguous outputs are manually dis- for theoretical linguistic investigations about ambiguated. language use, for instance, and for practical The goal of the work we present is NLP purposes, such as the acquisition and twofold: (i) to create the training data which evaluation of parsing systems. Thus, in re- will allow us to build up a parse selection cent years, there has been an increasing in- model over the hand–built grammar follow- terest in the construction of and, ing (Toutanova et al., 2005), and (ii) since nowadays, both theory–neutral and theory– language resource development and evalua- grounded treebanks have been developed for tion go hand in hand, to create the gold stan- a great variety of . Some of these dard to evaluate the hybrid system. The treebanks are presented in (Hinrichs and Tibidabo treebank, in addition, will enable Simov, 2004). to evaluate foreseen investigations oriented to This paper describes ongoing work for the the hybridization of shallow–deep techniques creation of a new resource for Spanish, an for NLP aiming at efficient, robust, and ac- HPSG–based treebank so–called Tibidabo. curate rule–based parsing. The annotation is carried out in two The primary objective of our research steps. First, the corpus is automatically an- work is to produce open–source reusable re- notated with the Spanish Resource Gram- sources both for theoretical linguistic inves- tigations and for application–oriented NLP. The Tibidabo treebank is open–source and, together with the Spanish Resource Gram- mar, is part of the DELPH–IN open–source repository of linguistic resources, which also includes HPSG–based grammars for a wide variety of languages, including En- glish (Flickinger, 2002), German (Crysmann, 2005), Japanese (Siegel and Bender, 2002), French, Korean (Kim and Yangs, 2003), mod- ern Greek (Kordoni and Neu, 2005), Norwe- gian (Hellan and Haugereid, 2005), and Por- tuguese (Branco and Costa, 2008), as well as HPSG–based treebanks, such as the LinGO Figure 1: System architecture. Redwoods for English (Oepen et al., 2004) and Hinoki for Japanese (Hashimoto, Bond, 2.1 The FreeLing Tool and Siegel, 2007).1 The paper is organized as follows. First, Before parsing input sentences with the LKB in the following three sections, we present the system, raw text is pre–processed by FreeL- main features of the system we use to anno- ing, an open–source language analysis tool tate the corpus automatically (the architec- suite performing shallow processing function- alities ranging from text tokenization to de- ture, the development environment and theo- 2 retical background, and the linguistic compo- pendency parsing (Atserias et al., 2006). nents and coverage) and the disambiguation Our system integrates the morpho pro- process, and we show the representation of cessing module of FreeLing which receives a derivations produced by the grammar. Then, and morphologically annotates each in section 5, we present the corpus which is word. This module applies a cascade of spe- the basis of the treebank and shows some fig- cialized processors that includes: ures about the treebank. Finally, in section 6, we conclude this paper with a summary • Punctuation symbol annotator. and some directions for future work. • Multi–word recognizer. 2 Treebank annotation • Numerical expression recognizer. We have developed a hybrid architecture for • Date/time expression recognizer. the automatic processing of Spanish, shown in Figure 1, which integrates a shallow pro- • Ratio and percentage expression and cessing tool – FreeLing – into an HPSG– monetary amount recognizer. based grammar implemented on the LKB • Proper noun recognizer. A fast and sim- system – the Spanish Resource Grammar. ple pattern–matching module based on The advantage of our hybrid architecture capitalization which yields an accuracy is that it allows us to release the parser from near 90%.3 certain tasks –namely, morphological analy- sis and recognition and classification of spe- • Dictionary look–up and affixes handler. cial text expressions that have been consid- • Lexical probabilities annotator and un- ered peripheral to the lexicon, e.g. numbers, known word handler. dates, percentages, currencies, proper names, etc.– that may be robustly, efficiently, and Our system plugs the FreeLing tool into reliably dealt with by shallow external com- the system by means of the LKB Simple Pre- ponents, thus making the whole system more Processor Protocol (SPPP), which assumes adequate to deal with real world text. 2The FreeLing toolkit may be downloaded from: http://www.lsi.upc.edu/˜nlp/freeling. 3FreeLing also provides a NE recognizer based on the CoNLL–2002 shared task winning system (Car- reras, M´arquez,and Padr´o,2002) with higher accu- 1See http://www.delph-in.net/. racy –over 92%– but which is rather slower. that a preprocessor runs as an external pro- rules, a lexicon, and syntactic rules. cess to the LKB system and communicates The inflectional rules. The inflectional with its caller through its standard input and rules in the LKB system perform the mor- output channels.4 phological analysis of the words in the input The SPPP, therefore, integrates into the sentences. parsing process the PoS tags and lemmata of Since we use an external morphological each word in the sentence. analyzer, the SRG does not need a mor- FreeLing and the Spanish Resource Gram- phology component, instead we use the in- mar are two independently developed com- flectional rule component to propagate the ponents and show some discrepancies in the morpho–syntactic information associated to tagset and the lexical categories. We also use full–forms, in the form of PoS tags, to this model to handle them. the morpho–syntactic features of the lexical 2.2 The Spanish Resource items. We have defined as many rules as tags Grammar we have in the FreeLing tagset. The lexicon. The lexicon component in The Spanish Resource Grammar (SRG) is the LKB system contains the lexical entries designed as multi–purpose (abstracted away of the grammar. from any particular application), and broad– coverage (aiming to cover not only all varia- The SRG has a full coverage lexicon of tions of the phenomena that have been im- closed word classes (pronouns, determiners, plemented, but also the combinations of dif- prepositions and conjunctions) and it con- ferent phenomena). tains about 50,000 lexical entries for open word classes (7,865 , 28,025 nouns, 2.2.1 Development environment and 10,410 adjectives and 4,110 adverbs).5 These theoretical background lexical entries are organized into a multiple The grammar is implemented on the Lin- inheritance type hierarchy of about 500 leaf guistic Knowledge Builder (LKB) system – types which represent the type of words we an interactive grammar development envi- have in our lexicon. The grammar also has ronment for typed feature structure gram- 64 lexical rules to perform valence changing mars – (Copestake, 2002), based on the ba- operations on lexical items (e.g. movement sic components of the LinGO Grammar Ma- and removal of complements) which reduces trix, an open–source starter–kit for rapid de- the number of lexical entries to be manually velopment of precision broad–coverage gram- encoded in the lexicon. mars compatible with the LKB system (Ben- The syntactic rules. The syntactic der and Flickinger, 2005). rules in the LKB system are structure The SRG is grounded in the theoretical rules that combine words and into framework of Head–driven Phrase Structure larger constituents and compositionally build Grammar (HPSG) (Pollard and Sag, 1994), a up their semantic representation. The SRG constraint–based lexicalist approach to gram- has 191 phrase structure rules. matical theory where all linguistic objects (i.e. words and phrases) are represented as With these linguistic resources, the SRG typed feature structures, and uses Minimal handles the following range of linguistic Recursion (MRS) (Copestake et phenomena: all types of subcategorization al., 2006) for the semantic representation. structures, surface word order variation and MRS is not a semantic theory in itself, but a valence alternations, subordinate clauses, kind of meta–level which has been defined for raising and control, determination, null– describing semantic structures. Using unifi- subjects and impersonal constructions, cation of typed features structures, MRS as- compound tenses, modification, passive signs a syntactically flat semantic represen- constructions, comparatives and superla- tation to linguistic expressions. tives, cliticization, relative and interrogative 2.2.2 Linguistic components and clauses, sentential adjuncts, negation, noun coverage ellipsis, and coordination, among others. To parse a sentence, an LKB grammar re- quires three basic components: inflectional 5The grammar also includes a set of generic lexical entry templates for open classes to deal with unknown 4See http://wiki.delph-in.net/moin/LkbSppp. words for virtually unlimited lexical coverage. Figure 2: Screenshot of the annotation environment.

3 Disambiguation where t´ecnico is analysed as a noun and it Almost every sentence we parse is ambigu- is the head of the NP el t´ecnico; and (1), ous. Thus, in order to construct the Tibidabo where t´ecnico is analysed as an adjective and treebank, the outputs of the SRG have to be it is attached to the article building an NP disambiguated by manually selecting the pre- with an elliptical head. The window on the ferred analyses. This task is carried out with right shows the set of lexical and phrasal am- biguities that originates the multiple analy- the [incr tstb ()] profiling environment. ses for this sentence. That is, the two lexi- Details of the [incr tstb ()] profiling cal entries –aj i le and n c le– and the two environment can be found in (Oepen and phrase structure rules that build up the NP Carroll, 2000). Basically, it includes: (1) node –SP-HD C– and the elliptical NP node a database that records the parsing results –NP ELL C–. AQ0MS00 and NCMS000 are obtained from an LKB grammar, and (2) a the inflectional rules that propagate the PoS tree comparison tool for the annotators to se- tags to the morpho–syntactic features of the lect the preferred analysis for each sentence, lexical items. To disambiguate this sentence either by directly selecting it, as it is dis- by selecting the first phrase structure tree, we played as a labeled phrase structure tree, or, can either select it directly, with the ’Save’ when dozens (or even hundreds) of analyses option on the top of the environment, or we are displayed, to incrementally reduce the set can select the rule or the lexical entry with of analyses either by selecting the (lexical or which it is built up.7 phrasal) local ambiguity that originates the multiple analyses, or by rejecting it.6 All the decisions made by the annotators Figure 2 shows a screenshot of the annota- are recorded in the database of the [incr tion environment. The full set of analyses for tsdb()] profiling environment so that they the example “el t´ecnico justific´osus aspira- can be re–used to update the treebank semi– ciones” (the coach justified his goals)which automalically with a revised version of the appears on the top, are displayed on the left. grammar, since only new ambiguities pro- Since t´ecnico is both a noun and an adjec- duced by the revised grammar need to be tive, the SRG produces two analyses: (0), manually disambiguated.

6See (Oepen et al., 2004) for a detailed description 7Alternatively, we can reject either the second of the tree comparison tool and some examples show- phrase structure tree, with the ’Reject’ option on the ing how the authors use it to construct the LinGO top of the environment, or we can reject the lexical Redwoods treebank for English. entry or the rule with which it is built up. Figure 3: MRS semantic representation.

So far, analysis selection has been car- tation. Thus, for an input sentence such as ried out by a single annotator, who has also “el gato come pescado.” (the cat eats fish.), been responsible for the development of the the output of the SRG is as shown in Figure HPSG grammar. While disambiguating the 3 and Figure 4. outputs of the SRG, the annotator is writ- ing down an annotation manual, or guide- lines, which will be used by annotators that will join the project in the future. This man- ual include guidelines, for instance, (1) to se- lect one lexical entry among the several en- tries we have defined to cope with all possible subcategorization frames of a given word, (2) to select the PP attachment, (3) to disam- biguate the impersonal and passive construc- tions with “se”, etc. Also, it is crucial to set a clear criteria to select one analysis when the grammar produces spurious ambiguity, i.e. when we get different structures showing the same semantic representation. That is the case, for example, of subordinating clauses in post–verbal position that the grammar at- Figure 4: Phrase structure tree. taches them both to the VP node and to the S node (i.e. before and after cancelling the In the phrase structure tree each node is subject). Another example is the rule re- labeled with a set of atomic labels of the type moving optional complements which applies ’S’, ’VP’, ’V’, ’NP’, etc. both before and after attaching post-verbal The MRS semantic representation consists PP and/or adverbial modifiers. of: 1) a list of semantic relations (RELS), each with a ”handle” (LBL) (used to ex- 4 Representation of derivations press scope relations) and one or more roles The analyses that have been manually se- (ARG0, ARG1,...). Relations are classi- lected by the annotators are recorded in fied according to the number and type of the [incr tstb ()] profiling environment arguments; 2) a set of handle constraints database. (HCONS), reflecting syntactic limitations on The annotation of each parsed sentence possible scope relations among the semantic simultaneously represents two different de- relations, and 3) a group of distinguished se- scriptive levels: (i) a traditional phrase struc- mantic attributes of a linguistic sign. These ture tree, and (ii) an MRS semantic represen- attributes are: LTOP – the local top handle, and INDEX – the salient nominal instance or Table 2 shows the ratio of sentences up event variable introduced by the lexical se- to 40 words we have already processed, dis- mantic head. tributed along the sentence length. It is ob- In addition, the database also records vious that the longer the sentence are the a derivational tree composed of the identi- more analyses they get, so sentences with fiers of the lexical entries and grammar rules have more that 40 words receive thundreds which have been used to build up the tree (even thousands) of analyses, this makes the and which may be used to reconstruct the disambiguation task difficult. Our objective full analysis. is to disambiguate first the sentences which are up to 40 words and to use this treebank to 5 The corpus create a parse selection model. The idea is to The basis of the Tibidabo treebank are use this parse selection model to reduce the newspaper text we borrow from the corpus number of analyses produced by the gram- AnCora, a corpus of about 500,000 words mar when parsing long sentences, which will (17,364 sentences) (Taul´e, Mart´ı, and Re- then be disambiguated manually. casens, 2008). Table 1 shows the number of sentences and ratio distributed along the sen- sentence tence length. length processed 0-4 76% 5-9 85% sentence number of 10-14 70% length sentences ratio 15-19 60% 0-4 644 3.7% 20-24 50% 5-9 1290 11.1% 25-29 40% 10-14 1858 21.8% 30-34 30% 15-19 2001 33.4% 35-39 20% 20-24 2096 45.4% 25-29 1952 56.7% 30-34 1949 67.9% 35-39 1707 77.7% 40-44 1401 85.8% Table 2: The treebank Tibidabo. 45-49 1059 91.9% 50-54 615 95.4% 6 Conclusions and future work 55-59 357 97.5% 60-64 206 98.7% In this paper we have presented work in progress for the creation of a new open– 65-69 112 99.3% source resource for Spanish: the Tibidabo 70-74 52 99.6% HPSG–based treebank. 75-79 22 99.8% Besides improving the linguistic modules 80-84 11 99.8% by extending the coverage of the grammar 85-89 9 99.9% to deal with the phenomena which have not 90-94 4 99.9% been implemented so far (e.g. VP ellip- 95-99 5 99.9% sis) and debugging the grammar to avoid er- 100-104 2 99.9% rors and deficiencies to increase the treebank 105-109 4 100% cases, future work includes the development 110-114 4 100% of a parse selection model over the hand-built 120-124 2 100% grammar. It is also foreseen investigations 125-129 1 100% oriented to the hybridization of shallow–deep 130-134 1 100% techniques for NLP aiming at efficient, ro- 125-129 1 100% bust, and accurate rule–based parsing.

Acknowledgments Table 1: The corpus AnCora. This work was funded by the Ram´ony Cajal program of the Spanish Ministerio de Ciencia e Innovaci´on. References and Melanie Siegel, editors, A Workshop Atserias, J., B. Casas, E. Comelles, on Ideas and Strategies for Multilingual M. Gonz´alez, L. Padr´o, and M. Padr´o. Grammar Engineering, ESSLLI, Vienna, 2006. Freeling 1.3: Syntactic and seman- Austria. tic services in an open-source nlp library. Hinrichs, E.W. and K. Simov. 2004. Re- In Proceedings of LREC’06, Genoa, Italy. search on Language and Computation, 2 Bender, E.M. and D. Flickinger. 2005. (4). Rapid prototyping of scalable grammars: Kim, J-B. and J. Yangs. 2003. Korean Towards modularity in extensions to a and its imple- language-independent core. In Pro- mentations into the lkb system. In Pro- ceedings of IJCNLP-05 (Posters/Demos), ceedings of the 17th Pacific Asia Confer- Jeju Island, Korea. ence on Language, Information, and Com- Branco, A. and F. Costa. 2008. A Com- putation. putational Grammar for Deep Linguis- Kordoni, V. and J. Neu. 2005. Deep anal- tic Processing of Portuguese: LXGram, ysis of modern greek. In Jong-Hyeok Lee version A.4.1. Research Report TR-2008- et al. Keh-Yih Su, Jun’ichi Tsujii, editor, 17. Universidade de Lisboa, Faculdade de Lecture Notes in Computer Science, Vol a CiA˜ ncias, Departamento de Informatica. 3248. Springer-Verlag Berlin Heidelberg, Carreras, X, L. M´arquez,and L. Padr´o.2002. pages 674 – 683. Named entity extraction using adaboost. Oepen, S. and J. Carroll. 2000. Perfor- In Proceedings of CoNLL Shared Task, mance profiling for parser engineering. In Taipei, Taiwan. D. Flickinger, S. Oepen, J. Tsujii, and Copestake, A. 2002. Implementing Typed H. Uszkoreit, editors, Natural Language Feature Structure Grammars. CSLI Pub- Engineering (6)1 —Special Issue: Effi- lications, CSLI lecture notes, number 110, ciency Processing with HPSG: Methods, Chicago. Systems, Evaluation. Cambridge Univer- sity Press, pages 81–97. Copestake, A., D. Flickinger, C.J. Pollard, and I. A. Sag. 2006. Minimal recursion Oepen, S., D. Flickinger, K. Toutanova, and semantics: an introduction. Research on C.D. Manning. 2004. Lingo redwoods. Language and Computation, 3(4):281–332. In E.W. Hinrichs and K. Simov, editors, Research on Language and Computation, Crysmann, B. 2005. Syncretism in German: 2(4). a unified approach to underspecification, indeterminacy, and likeness of case. In Pollard, C.J. and I.A. Sag. 1994. Head- Proceedings of the 12th International Con- driven Phrase Structure Grammar. The ference on Head-driven Phrase Structure University of Chicago Press and CSLI Grammar, Lisbon, Portugal. Publications, Chicago. Flickinger, D. 2002. On building a more ef- Siegel, M. and E.M. Bender. 2002. Efficient ficient grammar by exploiting types. In deep processing of japanese. In 3rd Work- D. Flickinger, S. Oepen, J. Tsujii, and shop on Asian Language Resources and H. Uszkoreit, editors, Collaborative Lan- International Standardization, COLING- guage Engineering. Stanford: CSLI Publi- 02, Tapei, Taiwan. cations, pages 1–17. Taul´e, M., M.A. Mart´ı, and M. Recasens. Hashimoto, C., F. Bond, and M. Siegel. 2007. 2008. Ancora: Multilevel annotated cor- Semi-automatic documentation of an im- pora for catalan and spanish. In Proceed- plemented linguistic grammar augmented ings of the 6th International Conference with a treebank. In Language Resources on Language Resources and Evaluation, and Evaluation. (Special issue on Asian LREC-2008, Marrakech, Morocco. language technology). Toutanova, K., C.D. Manning, D. Flickinger, Hellan, L. and P. Haugereid. 2005. Nor- and S Oepen. 2005. Stochastic hpsg parse source - an excercise in the matrix gram- disambiguation using the redwoods cor- mar building design. In Emily M. Ben- pus. In Journal of Logic and Computa- der, Dan Flickinger, Frederik Fouvry, tion.