Classification of Verb-Particle Constructions with the Google

Classification of Verb-Particle Constructions with the Google Web1T Corpus Jonathan K. Kummerfeld and James R. Curran School of Information Technologies University of Sydney NSW 2006, Australia fjkum0593,[email protected] Abstract vious work on VPCs has mainly focused either on their compositionality (McCarthy et al., 2003), Manually maintaining comprehensive or on using sophisticated parsers to perform ex- databases of multi-word expressions, for example Verb-Particle Constructions traction from corpora (Baldwin and Villavicencio, (VPCs), is infeasible. We describe a new 2002). While parser based methods have been type level classifier for potential VPCs, very successful (Kim and Baldwin, 2006) they which uses information in the Google rely on contextual knowledge and complex pro- Web1T corpus to perform a simple lin- cessing. The Web has been used as a corpus previ- guistic constituency test. Specifically, we ously by Villavicencio (2003b), who used search consider the fronting test, comparing the engines as a source of statistics for classification, frequencies of the two possible orderings but her aim was to create new resources, rather of the given verb and particle. Using only a small set of queries for each verb-particle than a tool that can be used for identification. We pair, the system was able to achieve an have constructed a high-throughput query system F-score of 75:7% in our evaluation while for the Google Web1T data, which we use to col- processing thousands of queries a second. lect information to perform the fronting linguistic constituency test. The results of this process are 1 Introduction used to train a classifier, which can then quickly and accurately classify a given verb-particle pair. Creating comprehensive linguistic resources manually is an expensive and slow process, The Google Web1T corpus is over twenty-four making it infeasible for maintaining up-to-date gigabytes of plain text in compressed form, mak- resources of constantly evolving linguistic fea- ing it infeasible to store the entire data-set in tures, such as Multi-Word Expressions (MWEs). memory and perform queries directly. To han- These resources are crucial for a range of Natural dle the data we have aggregated the frequency Language Processing (NLP) tasks, such as ac- counts for n-grams by using templates that con- curate parsing. Identification of MWEs does not tain a mixture of wild-cards and words, where the prevent the production of syntactically accurate words are all possibilities taken from three sets. parses, but the extra information can improve The new data this process produces is stored in results. Since manually creating these resources a hash-table distributed across several computers, is a challenge, systems are needed that can making fast queries possible. These queries pro- automatically classify expressions accurately. vide the frequency of templates such as verb Also, because these systems will be running ? particle and particle ? verb, which during parsing, they need to be fast to prevent the can then be compared as a form of the fronting creation of an additional bottleneck. test. By comparing the frequency of these tem- Here we focus on Verb-Particle Constructions plates we are able to construct classifiers for verb- (VPCs), a type of MWE composed of a verb and a particle pairs with an accuracy of 74:7%, recall of particle. Villavicencio (2003a) showed that VPCs 74:7%, and a precision of 76:8%. These results are poorly covered by corpus data and constantly indicate that this method is a promising source of growing in number, making them a suitable can- information for fast on-line classification of verb- didate for an automatic classification system. Pre- particle pairs as VPCs. 2 Background compositionality studies. The challenge of dis- tinguishing LVCs from idioms was considered by VPCs are MWEs composed of a verb and a par- Fazly et al. (2005), who proposed a set of statis- ticle, where particles are derived from two cat- tical measures to quantify properties that relate to egories, prepositions and spatial adverbs (Quirk the compositionality of an MWE, ideas that were et al., 1985). However, most research on VPCs then extended in Fazly and Stevenson (2007). has focused on prepositions only, because they Recently, Cook and Stevenson (2006) ad- are more productive. Semantically, the composi- dressed the question of which sense of the compo- tionality of multi-word expressions varies greatly. nent words is being used in a particular VPC. They Often they have the property of paraphrasability, focused on the contribution by particles and con- being replacable by a single verb of equivalent structed a feature set based on the properties of meaning, and exhibit prosody, having a distinc- VPCs and compared their effectiveness with stan- tive stress pattern. dard co-occurrence measurements. Previous work concerning VPCs can be broadly divided into two groups, compositionality analy- 2.2 Classification sis, and classification. Our work is entirely con- A range of methods for automatic classification of cerned with classification, but we will briefly de- MWEs have been studied previously, in particular scribe previous work on compositionality as the the use of parsers, applying heuristics to n-grams, two areas are closely linked. and search engine queries. Possibly the earliest attempt at classification, 2.1 Compositionality was by Smadja (1993), who considered verb- Determining how much the simplex meaning of particle pairs, separated by up to four other words, individual words in a MWE contribute to the over- but did not perform a rigorous evaluation, which all meaning is challenging, and important for pro- prevents us from performing a comparison. ducing semantically correct analysis of text. A Recent work has focused on using parsers over range of methods have been considered to differ- raw text to gain extra information about the con- entiate examples based on their degree of seman- text of the verb-particle pair being considered. In tic idiosyncrasy. particular, the information needed to identify the Initial work by Lin (1999) considered the com- head noun phrase (NP) for each potential VPC. positionality of MWEs by comparing the distri- Early work by Blaheta and Johnson (2001) used butional characteristics for a given MWE and po- a parsed corpus and log-linear models to identify tentially similar expressions formed by synonym VPCs. This was followed by Baldwin and Villavi- substitution. This was followed by Bannard et al. cencio (2002) who used a range of parser outputs (2003), who used human non-experts to construct and other features to produce a better informed a gold standard dataset of VPCs and their compo- classifier. Other forms of linguistic and statistical sitionality, which was used to construct a classi- unsupervised methods were considered by Bald- fier that could judge whether the two words used win (2005), such as Pointwise Mutual Informa- contributed their simplex meaning. McCarthy et tion. This work was further extended by Kim and al. (2003) used an automatically acquired the- Baldwin (2006) to utilise the sentential context of saurus in the calculation of statistics regarding the verb-particle pairs and their associated NP to im- compositionality of VPCs and compared their re- prove results. sults with statistics commonly used for extract- The closest work to our own in terms of the cor- ing multiwords, such as latent semantic analysis. pus used is that of Villavicencio (2003b), in which Bannard (2005) compared the lexical contexts of the web is used to test the validity of candidate VPCs and their component words across a corpus VPCs. Also, like our work, Villavicencio does to determine which words are contributing an in- not consider the context of the verb and particle, dependent meaning. unlike the parser based methods described above. Light Verb Constructions (LVCs) are another A collection of verb-particle pairs was generated example of an MWE that has been considered in by combining verbs from Levin’s classes (Levin, 1993) with the particle up. Each potential VPC N-gram Frequency was passed to the search engine Google to ob- ferret out 79728 tain an approximate measure of their frequency, out ferret 74 first on their own, then in the form Verb up ferret her out 52 for to prevent the inclusion of prepositional verb ferret him out 342 forms. The measurement is only approximate be- ferret information out 43 cause the value is a document count, not an actual ferret is out 54 frequency as in the Web1T data. This led to the ferret it out 1562 identification of many new VPCs, but the results ferret me out 58 were not evaluated beyond measurements of the ferret that out 180 number of identified VPCs attested in current re- ferret them out 1582 sources, making comparison with our own work ferret these out 232 difficult. ferret things out 58 When a verb-particle pair is being classified the ferret this out 148 possibilities other than a VPC are a prepositional ferret you out 100 verb, or a free verb-preposition combination. A out a ferret 63 variety of tests exist for determining which of out of ferret 71 these a candidate verb-particle pair is. For the out the ferret 120 transitive case, Baldwin and Villavicencio (2002) ferret it all out 47 applied three tests. First, that VPCs are able to un- ferret these people out 60 dergo particle alternation, providing the example ferret these projects out 52 that hand in the paper and hand the paper in are both out of the ferret 54 valid, while refer to the book is, but *refer the book ferret lovers can ferret out 45 to is not. Second, that pronominal objects must out a needing shelter ferret 63 be expressed in the split configuration, providing the example that hand it in is valid, where as *hand Table 1: N-grams in the Web1T corpus for the VPC ferret out.

Load more