Detecting Verb-Particle Constructions with Syntax-Based Methods

VPCTagger: Detecting Verb-Particle Constructions With Syntax-Based Methods Istvan´ Nagy T.1 and Veronika Vincze1,2 1Department of Informatics, University of Szeged Arp´ ad´ ter´ 2., 6720 Szeged, Hungary [email protected] 2Hungarian Academy of Sciences, Research Group on Artificial Intelligence Tisza Lajos krt. 103., 6720 Szeged, Hungary [email protected] Abstract (do in “kill”). Moreover, as their syntactic sur- face structure is very similar to verb – preposi- Verb-particle combinations (VPCs) con- tional phrase combinations, it is not straightfor- sist of a verbal and a preposition/particle ward to determine whether a given verb + prepo- component, which often have some addi- sition/particle combination functions as a VPC or tional meaning compared to the meaning not and contextual information plays a very impor- of their parts. If a data-driven morpholog- tant role here. For instance, compare the follow- ical parser or a syntactic parser is trained ing examples: The hitman did in the president and on a dataset annotated with extra informa- What he did in the garden was unbelievable. Both tion for VPCs, they will be able to iden- sentences contain the sequence did in, but it is tify VPCs in raw texts. In this paper, only in the first sentence where it functions as a we examine how syntactic parsers perform VPC and in the second case, it is a simple verb- on this task and we introduce VPCTag- prepositional phrase combination. For these rea- ger, a machine learning-based tool that is sons, VPCs are of great interest for natural lan- able to identify English VPCs in context. guage processing applications like machine trans- Our method consists of two steps: it first lation or information extraction, where it is neces- selects VPC candidates on the basis of sary to grab the meaning of the text. syntactic information and then selects gen- The special relation of the verb and particle uine VPCs among them by exploiting new within a VPC is often distinctively marked at sev- features like semantic and contextual ones. eral annotation layers in treebanks. For instance, Based on our results, we see that VPC- in the Penn Treebank, the particle is assigned a Tagger outperforms state-of-the-art meth- specific part of speech tag (RP) and it also has ods in the VPC detection task. a specific syntactic label (PRT) (Marcus et al., 1993), see also Figure 1. This entails that if a data- 1 Introduction driven morphological parser or a syntactic parser Verb-particle constructions (VPCs) are a subclass is trained on a dataset annotated with extra infor- of multiword expressions (MWEs) that contain mation for VPCs, it will be able to assign these more than one meaningful tokens but the whole kind of tags as well. In other words, the morpho- unit exhibits syntactic, semantic or pragmatic logical/syntactic parser itself will be able to iden- idiosyncracies (Sag et al., 2002). VPCs consist tify VPCs in texts. of a verb and a preposition/particle (like hand in In this paper, we seek to identify VPCs on the or go out) and they are very characteristic of the basis of syntactic information. We first examine English language. The particle modifies the mean- how syntactic parsers perform on Wiki50 (Vincze ing of the verb: it may add aspectual informa- et al., 2011), a dataset manually annotated for tion, may refer to motion or location or may totally different types of MWEs, including VPCs. We change the meaning of the expression. Thus, the then present our syntax-based tool called VPC- meaning of VPCs can be compositional, i.e. it Tagger to identify VPCs, which consists of two can be computed on the basis of the meaning of steps: first, we select VPC candidates (i.e. verb- the verb and the particle (go out) or it can be preposition/particle pairs) from the text and then idiomatic; i.e. a combination of the given verb and we apply a machine learning-based technique to particle results in a(n unexpected) new meaning classify them as genuine VPCs or not. This 17 Proceedings of the 10th Workshop on Multiword Expressions (MWE 2014), pages 17–25, Gothenburg, Sweden, 26-27 April 2014. c 2014 Association for Computational Linguistics punct root VPC identification more precisely, we also iden- dobj tify VPCs with syntactic parsers, it seems nec- essary to mention studies that experimented with det nsubj prt det parsers for identifying different types of MWEs. For instance, constituency parsing models were The hitman did in the president . employed in identifying contiguous MWEs in French and Arabic (Green et al., 2013). Their Figure 1: A dependency parse of the sentence method relied on a syntactic treebank, an MWE “The hitman did in the president”. list and a morphological analyzer. Vincze et al. (2013) employed a dependency parser for identi- method is based on a rich feature set with new fying light verb constructions in Hungarian texts features like semantic or contextual features. We as a “side effect” of parsing sentences and report compare the performance of the parsers with that state-of-the-art results for this task. of our approach and we discuss the reasons for any Here, we make use of parsers trained on the possible differences. Penn Treebank (which contains annotation for VPCs) and we evaluate their performance on the 2 Related Work Wiki50 corpus, which was manually annotated for VPCs. Thus, we first examine how well these Recently, some studies have attempted to iden- parsers identify VPCs (i.e. assigning VPC-specific tify VPCs. For instance, Baldwin and Villavicen- syntactic labels) and then we present how VPC- cio (2002) detected verb-particle constructions in Tagger can carry out this task. First, we select raw texts with the help of information based on VPC candidates from raw text and then, we clas- POS-tagging and chunking, and they also made sify them as genuine VPCs or not. use of frequency and lexical information in their classifier. Kim and Baldwin (2006) built their 3 Verb-particle Constructions in English system on semantic information when deciding whether verb-preposition pairs were verb-particle As mentioned earlier, verb-particle constructions constructions or not. Nagy T. and Vincze (2011) consist of a verb and a particle. Similar construc- implemented a rule-based system based on mor- tions are present in several languages, although phological features to detect VPCs in raw texts. there might be different grammatical or ortho- The (non-)compositionality of verb-particle graphic norms for such verbs in those languages. combinations has also raised interest among For instance, in German and in Hungarian, the par- researchers. McCarthy et al. (2003) implemented ticle usually precedes the verb and they are spelt as a method to determine the compositionality of one word, e.g. aufmachen (up.make) “to open” in VPCs and Baldwin (2005) presented a dataset in German or kinyitni (out.open) “to open” in Hun- which non-compositional VPCs could be found. garian. On the other hand, languages like Swedish, Villavicencio (2003) proposed some methods to Norwegian, Icelandic and Italian follow the same extend the coverage of available VPC resources. pattern as English; namely, the verb precedes the Tu and Roth (2012) distinguished genuine particle and they are spelt as two words (Masini, VPCs and verb-preposition combinations in con- 2005). These two typological classes require dif- text. They built a crowdsourced corpus of VPC ferent approaches if we would like identify VPCs. candidates in context, where each candidate was For the first group, morphology-based solutions manually classified as a VPC or not. How- can be implemented that can identify the inter- ever, during corpus building, they applied lexi- nal structure of compound words. For the second cal restrictions and concentrated only on VPCs group, syntax-based methods can also be success- formed with six verbs. Their SVM-based algo- ful, which take into account the syntactic relation rithm used syntactic and lexical features to clas- between the verb and the particle. sify VPCs candidates and they concluded that their Many of the VPCs are formed with a motion system achieved good results on idiomatic VPCs, verb and a particle denoting directions (like go but the classification of more compositional VPCs out, come in etc.) and their meaning reflects this: is more challenging. they denote a motion or location. The meaning Since in this paper we focus on syntax-based of VPCs belonging to this group is usually trans- 18 parent and thus they can be easily learnt by second language learners. In other cases, the particle adds some aspectual information to the meaning of the verb: eat up means “to consume totally” or burn out means ”to reach a state where some- one becomes exhausted”. These VPCs still have a compositional meaning, but the particle has a non- directional function here, but rather an aspectual one (cf. Jackendoff (2002)). Yet other VPCs have completely idiomatic meanings like do up “repair” or do in “kill”. In the latter cases, the meaning of the construction cannot be computed from the Figure 2: System Architecture meaning of the parts, hence they are problematic for both language learners and NLP applications. 4 VPC Detection Tu and Roth (2012) distinguish between two Our goal is to identify each individual VPC in run- sets of VPCs in their database: the more com- ning texts; i.e. to take individual inputs like How positional and the more idiomatic ones. Dif- did they get on yesterday? and mark each VPC in ferentiating between compositional and idiomatic the sentence. Our tool called VPCTagger is based VPCs has an apt linguistic background as well (see on a two-step approach.

Load more