Identifying Phrasal Verbs Using Many Bilingual Corpora

Identifying Phrasal Verbs Using Many Bilingual Corpora Karl Pichotta∗ John DeNero Department of Computer Science Google, Inc. University of Texas at Austin [email protected] [email protected] Abstract We focus on a particular subset of MWEs, English phrasal verbs. A phrasal verb consists of a head We address the problem of identifying mul- verb followed by one or more particles, such that tiword expressions in a language, focus- the meaning of the phrase cannot be determined by ing on English phrasal verbs. Our poly- combining the simplex meanings of its constituent glot ranking approach integrates frequency words (Baldwin and Villavicencio, 2002; Dixon, statistics from translated corpora in 50 dif- 1 ferent languages. Our experimental eval- 1982; Bannard et al., 2003). Examples of phrasal uation demonstrates that combining statisti- verbs include count on [rely], look after [tend], or cal evidence from many parallel corpora us- take off [remove], the meanings of which do not in- ing a novel ranking-oriented boosting algo- volve counting, looking, or taking. In contrast, there rithm produces a comprehensive set of English are verbs followed by particles that are not phrasal phrasal verbs, achieving performance compa- verbs, because their meaning is compositional, such rable to a human-curated set. as walk towards, sit behind, or paint on. We identify phrasal verbs by using frequency 1 Introduction statistics calculated from parallel corpora, consist- ing of bilingual pairs of documents such that one A multiword expression (MWE), or noncomposi- is a translation of the other, with one document in tional compound, is a sequence of words whose English. We leverage the observation that a verb meaning cannot be composed directly from the will translate in an atypical way when occurring as meanings of its constituent words. These idiosyn- the head of a phrasal verb. For example, the word cratic phrases are prevalent in the lexicon of a lan- look in the context of look after will tend to trans- guage; Jackendoff (1993) estimates that their num- late differently from how look translates generally. ber is on the same order of magnitude as that of sin- In order to characterize this difference, we calculate gle words, and Sag et al. (2002) suggest that they a frequency distribution over translations of look, are much more common, though quantifying them then compare it to the distribution of translations of is challenging (Church, 2011). The task of identify- look when followed by the word after. We expect ing MWEs is relevant not only to lexical semantics that idiomatic phrasal verbs will tend to have unex- applications, but also machine translation (Koehn et pected translation of their head verbs, measured by al., 2003; Ren et al., 2009; Pal et al., 2010), informa- the Kullback-Leibler divergence between those dis- tion retrieval (Xu et al., 2010; Acosta et al., 2011), tributions. and syntactic parsing (Sag et al., 2002). Awareness Our polyglot ranking approach is motivated by the of MWEs has empirically proven useful in a num- hypothesis that using many parallel corpora of dif- ber of domains: Finlayson and Kulkarni (2011), for ferent languages will help determine the degree of example, use MWEs to attain a significant perfor- semantic idiomaticity of a phrase. In order to com- mance improvement in word sense disambiguation; Venkatapathy and Joshi (2006) use features associ- 1Nomenclature varies: the term verb-particle construction ated with MWEs to improve word alignment. is also used to denote what we call phrasal verbs; further, the term phrasal verb is sometimes used to denote a broader class ∗Research conducted during an internship at Google. of constructions. 636 Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 636–646, Seattle, Washington, USA, 18-21 October 2013. c 2013 Association for Computational Linguistics bine evidence from multiple languages, we develop Feature Description a novel boosting algorithm tailored to the task of ϕL ( 50) KL Divergence for each language L × ranking multiword expressions by their degree of id- µ1 frequency of phrase given verb iomaticity. We train and evaluate on disjoint subsets µ2 PMI of verb and particles 2 of the phrasal verbs in English Wiktionary . In our µ3 µ1 with interposed pronouns experiments, the set of phrasal verbs identified au- tomatically by our method achieves held-out recall Table 1: Features used by the polyglot ranking system. that nears the performance of the phrasal verbs in WordNet 3.0, a human-curated set. Our approach strongly outperforms a monolingual system, and and the task of ranking those candidates by their se- continues to improve when incrementally adding mantic idiosyncracy. With English phrasal verbs, it translation statistics for 50 different languages. is straightforward to enumerate all desired verbs followed by one or more particles, and rank the entire 2 Identifying Phrasal Verbs set. The task of identifying phrasal verbs using corpus Using Parallel Corpora. There have been a num- information raises several issues of experimental de- ber of approaches proposed for the use of multilin- sign. We consider four central issues below in moti- gual resources for MWE identification (Melamed, vating our approach. 1997; Villada Moiron´ and Tiedemann, 2006; Caseli et al., 2010; Tsvetkov and Wintner, 2012; Salehi Types vs. Tokens. When a phrase is used in con- and Cook, 2013). Our approach differs from pre- text, it takes a particular meaning among its pos- vious work in that we identify MWEs using transla- sible senses. Many phrasal verbs admit composi- tion distributions of verbs, as opposed to 1–1, 1–m, tional senses in addition to idiomatic ones—contrast or m–n word alignments, most-likely translations, idiomatic “look down on him for his politics” with bilingual dictionaries, or distributional entropy. To compositional “look down on him from the balcony.” the best of our knowledge, ours is the first approach In this paper, we focus on the task of determining to use translational distributions to leverage the ob- whether a phrase type is a phrasal verb, meaning that servation that a verb typically translates differently it frequently expresses an idiomatic meaning across when it heads a phrasal verb. its many token usages in a corpus. We do not at- tempt to distinguish which individual phrase tokens 3 The Polyglot Ranking Approach in the corpus have idiomatic senses. Our approach uses bilingual and monolingual statis- Ranking vs. Classification. Identifying phrasal tics as features, computed over unlabeled corpora. verbs involves relative, rather than categorical, judg- Each statistic characterizes the degree of idiosyn- ments: some phrasal verbs are more compositional crasy of a candidate phrasal verb, using a single than others, but retain a degree of noncomposition- monolingual or bilingual corpus. We combine fea- ality (McCarthy et al., 2003). Moreover, a poly- tures for many language pairs using a boosting algo- semous phrasal verb may express an idiosyncratic rithm that optimizes a ranking objective using a su- sense more or less often than a compositional sense pervised training set of English phrasal verbs. Each in a particular corpus. Therefore, we should expect of these aspects of our approach is described in de- a corpus-driven system not to classify phrases as tail below; for reference, Table 1 provides a list of strictly idiomatic or compositional, but instead as- the features used. sign a ranking or relative scoring to a set of candidates. 3.1 Bilingual Statistics One of the intuitive properties of an MWE is that Candidate Phrases. We distinguish between the its individual words likely do not translate literally task of identifying candidate multiword expressions when the whole expression is translated into another 2http://en.wiktionary.org language (Melamed, 1997). We capture this effect 637 by measuring the divergence between how a verb to its subphrase most commonly aligned to the verb translates generally and how it translates when head- in e. It expresses how this verb is translated in the ing a candidate phrasal verb. context of a phrasal verb construction.3 Equation (1) A parallel corpus is a collection of document defines a distribution over all phrases x of a foreign pairs DE,DF , where DE is in English, DF is in language. anotherh language,i one document is a translation of We also assign statistics to verbs as they are trans- the other, and all documents DF are in the same lated outside of the context of a phrase. Let v(e) language. A phrase-aligned parallel corpus aligns be the verb of a phrasal verb candidate e, which those documents at a sentence, phrase, and word is always its first word. For a single-word verb level. A phrase e aligns to another phrase f if some phrase v(e), we can compute the constituent transla- word in e aligns to some word in f and no word in tion probability Pv(e)(x), again using Equation (1). e or f aligns outside of f or e, respectively. As a The difference between Pe(x) and Pv(e)(x) is that result of this definition, the words within an aligned the latter sums over all translations of the verb v(e), phrase pair are themselves connected by word-level regardless of whether it appears in the context of e: alignments. X Given an English phrase e, define F (e) to be the Pv(e)(x) = P (f v(e)) δ (π1(v(e), f), x) | · set of all foreign phrases observed aligned to e in a f∈F (v(e)) parallel corpus. For any f F (e), let P (f e) be the ∈ | For a one-word phrase such as v(e), π1(v(e), f) conditional probability of the phrase e translating to is the subphrase of f that most commonly directly the phrase f.

Load more