Investigating NP-Chunking with Universal Dependencies for English
Total Page:16
File Type:pdf, Size:1020Kb
Investigating NP-Chunking with Universal Dependencies for English Ophélie Lacroix Siteimprove Sankt Annæ Plads 28 DK-1250 Copenhagen, Denmark [email protected] Abstract We want to automatically deduce chunks from universal dependencies (UD) (Nivre et al., 2017) Chunking is a pre-processing task generally and investigate its benefit for other tasks such dedicated to improving constituency parsing. In this paper, we want to show that univer- as Part-of-Speech (PoS) tagging and dependency sal dependency (UD) parsing can also leverage parsing. We focus on English, which has prop- the information provided by the task of chunk- erties that make it a good candidate for chunking ing even though annotated chunks are not pro- (low percentage of non-projective dependencies). vided with universal dependency trees. In par- As a first target, we also decide to restrict the task ticular, we introduce the possibility of deduc- to the most common chunks: noun-phrases (NP). ing noun-phrase (NP) chunks from universal We choose to see NP-chunking as a sequence dependencies, focusing on English as a first example. We then demonstrate how the task labeling task where tags signal the beginning (B- of NP-chunking can benefit PoS-tagging in a NP), the inside (I-NP) or the outside (O) of chunks. multi-task learning setting – comparing two We thus propose to use multi-task learning for different strategies – and how it can be used training chunking along with PoS-tagging and as a feature for dependency parsing in order to feature-tagging to show that the tasks can benefit learn enriched models. from each other. We experiment with two differ- ent multi-task learning strategies (training param- 1 Introduction eters in parallel or sequentially). We also intend to Syntactic chunking consists of identifying groups make parsing benefit from NP-chunking as a pre- of (consecutive) words in a sentence that constitute processing task. Accordingly, we propose to add phrases (e.g. noun-phrases, verb-phrases). It can NP-chunk tags as features for dependency parsing. be seen as a shallow parsing task between PoS- Contributions. We show how to (i) deduce NP- tagging and syntactic parsing. Chunking is known chunks from universal dependencies for English in as being a relevant preprocessing step for syntactic order to (ii) demonstrate the benefit of performing parsing. chunking along with PoS-tagging through multi- Chunking got a lot of attention when syntactic task learning and (iii) evaluate the impact of using parsing was predominantly driven by constituency NP-chunks as features for dependency parsing. parsing and was highlighted, in particular, through the CoNLL-2000 Shared Task (Tjong Kim Sang 2 NP-Chunks and Buchholz, 2000). Nowadays, studies (Sø- gaard and Goldberg, 2016; Hashimoto et al., 2017) While chunks are inherently deduced from con- still compare chunking –as well as constituency stituent trees, we want to deduce chunks from de- parsing– performance on these same data from pendency trees in order to not rely on specific con- the Penn Treebank. While dependency parsing stituent annotations which would not be available is spreading to different languages and domains for other domains or languages. In this case, it (Kong et al., 2014; Nivre et al., 2017), chunking means that only partial information is provided is restricted to old journalistic data. Nevertheless, by the dependencies to automatically extract the chunking can benefit dependency parsing as well chunks. We thus choose to only deduce noun- as constituency parsing, but gold annotated chunks phrase (NP) chunks (Ramshaw and Marcus, 1995) are not available for universal dependencies. from the dependency trees. nmod root case case det nummod amod obl obj nummod nsubj compound On the next two pictures he took screenshots of two beheading video’s ADP DET ADJ NUM NOUN PRON VERB NOUN ADP NUM NOUN NOUN O B-NP I-NP I-NP I-NP B-NP O B-NP O B-NP I-NP I-NP Figure 1: NP-chunks deduced from a UD tree of the English Web Treebank (EWT). Automatic Deduction. We deduce minimal NP- • following and preceding obl:npmod and chunks, which means that embedded prepositional obl:tmod; (PP) chunks are not included in our NP-chunks, e.g. in Figure1“ screenshots of two beheading • obl if its head has a amod incoming depen- video’s" is split in two distinct NPs instead of one dency. long NP with an embedded PP (“of two beheading In addition, when grouping a core token with one video’s"). of its det, compound, nummod or nmod:poss We first identify the core tokens of NPs: the children, we automatically attach tokens which are nouns (NOUN), proper nouns (PROPN) and some in between. If split chunks remain, we attach the pronouns1 (PRON). After identifying these core non-attached tokens which are in between two part tokens, we form full NPs by joining these core to- of a chunk. It allows us to attach the adverbs which kens with their direct and indirect children which modify adjectives such as “very" in “my very best are not part of PPs. In practice, they are those for friend" or some specific punctuation such as the which the incoming dependency is labeled with slash in “The owner/baker". one of the following relations (modulo some in- dividual conditions): Manual Annotation. To assert the correctness of the automatically deduced chunks, we manu- • compound, compound:prt, flat, ally annotate noun-phrases on a small portion of flat:name, goeswith, fixed, the test set of the EWT UD treebank data. For 50 nummod; sentences (from which 233 NP chunks were manu- • det if the child is located before its head; ally annotated), the accuracy of the automatic de- duction reaches 98.7%. Errors in deduction are • conj if the child and its head are adjectives. mostly due to punctual inconsistencies in the UD We want “excellent and strong performers" to annotations. be one NP and “these challenges and possible solutions" to be split in two NPs; 3 Models • amod if the child is not an adverb. We 3.1 Sequence Labeling don’t want to attach preceding adverbs such We implement a deep recurrent neural network as “not" to a NP; with an architecture based on bidirectional Long • appos if the child is directly before or after Short-Term Memory (bi-LSTM) (Graves and its head; Schmidhuber, 2005) that can exploit contextual in- formation for processing sequences. • advmod if the child is not a PART or a VERB The base network is composed of an embed- and its head an adjective; ding layer that feeds two hidden bi-LSTM lay- ers (forward and backward). The outputs of the nmod:poss • if the child is not a NOUN or a bi-LSTMs are then concatenated to feed the next PROPN. We want to group “your world" but layer. Multiple bi-LSTM layers can be stacked. In not “John’s last day (where “John" and “last the end, these outputs are fed to a Softmax output day" would be two distinct NPs; layer. The embedding layer is a concatenation of a 1All pronouns but the interrogative and relative pronouns. word embedding layer and a character embedding layer. It takes as input a sequence of n tokens. The 4 Experiments output of the network is a sequence of n tags. As a baseline for PoS-tagging, feature-tagging and We use this architecture for PoS-tagging, NP-chunking, we first train our sequence tagger feature-tagging (i.e. morpho-syntactic tagging) for each task separately. We then train the tag- and NP-chunking. In order to make the tasks bene- ger in a multi-task setting –with PoS-tagging as a fit from each other, we adapt the network to multi- main task– alternating the auxiliary tasks and the task learning. We propose to compare two strate- strategies (shared or stacked multi-task learning). gies for multi-task learning : shared or stacked. As a baseline for dependency parsing, we train the parser using only word and character embed- Shared multi-task learning. In this architec- dings as input to the bi-LSTM. We then add the ture, different tasks are trained at the same level in PoS and NP-chunk embeddings, separately and si- a similar way as in Søgaard and Goldberg(2016). multaneously, for training enriched models. As an They share parameters through all the network and upper bound, we also propose to run the exper- feed different outputs. iments with “gold" NP-chunks, i.e. we feed the parser (for training and testing) with NP-chunks that were automatically deduced from the depen- Stacked multi-task learning. In this architec- dencies. ture, different tasks are trained at different levels as proposed by Hashimoto et al.(2017). A bi- Data. We evaluate all tasks on the three English LSTM layer is dedicated to a task. The output of treebanks included in the version 2.1 of the Uni- a layer for a given task feeds the next layer dedi- versal Dependencies project (Nivre et al., 2017): cated to the next task. EWT (254k tokens), LinES (82k tokens) and Par- TUT (49k tokens). In average, 3.8, 3.3 and 6.2 3.2 Dependency Parsing NP-chunks per sentence are deduced respectively for each treebank. 2 Note that the LinES treebank Our dependency parser is a reimplementation does not contain features (morpho-syntactic tags), of the arc-hybrid non-projective transition-based so we exclude feature-tagging from the evaluation parser of de Lhoneux et al.(2017b). for this treebank. In this version of the arc-hybrid system, the Hyper-parameters. We use the development SWAP transition is added to the original transition data to tune our hyper-parameters and to deter- set (Kuhlmann et al., 2011) made up of the stan- mine the number of epochs (via early-stopping) dard transitions RIGHT,LEFT and SHIFT.