Identifying Phrasal Using Many Bilingual Corpora

Karl Pichotta∗ John DeNero Department of Computer Science Google, Inc. University of Texas at Austin [email protected] [email protected]

Abstract We focus on a particular subset of MWEs, . A phrasal consists of a We address the problem of identifying mul- verb followed by or more particles, such that tiword expressions in a language, focus- the meaning of the cannot be determined by ing on English phrasal verbs. Our poly- combining the simplex meanings of its constituent glot ranking approach integrates frequency words (Baldwin and Villavicencio, 2002; Dixon, statistics from translated corpora in 50 dif- 1 ferent languages. Our experimental eval- 1982; Bannard et al., 2003). Examples of phrasal uation demonstrates that combining statisti- verbs include count on [rely], look after [tend], or cal evidence from many parallel corpora us- take off [remove], the meanings of which do not in- ing a novel ranking-oriented boosting algo- volve counting, looking, or taking. In contrast, there rithm produces a comprehensive set of English are verbs followed by particles that are not phrasal phrasal verbs, achieving performance compa- verbs, because their meaning is compositional, such rable to a human-curated set. as walk towards, sit behind, or paint on. We identify phrasal verbs by using frequency 1 Introduction statistics calculated from parallel corpora, consist- ing of bilingual pairs of documents such that one A multiword expression (MWE), or noncomposi- is a translation of the other, with one document in tional , is a sequence of words whose English. We leverage the observation that a verb meaning cannot be composed directly from the will translate in an atypical way when occurring as meanings of its constituent words. These idiosyn- the head of a phrasal verb. For example, the word cratic are prevalent in the lexicon of a lan- look in the context of look after will tend to trans- guage; Jackendoff (1993) estimates that their num- late differently from how look translates generally. ber is on the same order of magnitude as that of sin- In order to characterize this difference, we calculate gle words, and Sag et al. (2002) suggest that they a frequency distribution over translations of look, are much more common, though quantifying them then compare it to the distribution of translations of is challenging (Church, 2011). The task of identify- look when followed by the word after. We expect ing MWEs is relevant not only to lexical semantics that idiomatic phrasal verbs will tend to have unex- applications, but also machine translation (Koehn et pected translation of their head verbs, measured by al., 2003; Ren et al., 2009; Pal et al., 2010), informa- the Kullback-Leibler divergence between those dis- tion retrieval (Xu et al., 2010; Acosta et al., 2011), tributions. and syntactic parsing (Sag et al., 2002). Awareness Our polyglot ranking approach is motivated by the of MWEs has empirically proven useful in a num- hypothesis that using many parallel corpora of dif- ber of domains: Finlayson and Kulkarni (2011), for ferent languages will help determine the degree of example, use MWEs to attain a significant perfor- semantic idiomaticity of a phrase. In order to com- mance improvement in word sense disambiguation; Venkatapathy and Joshi (2006) use features associ- 1Nomenclature varies: the term verb-particle construction ated with MWEs to improve word alignment. is also used to denote what we call phrasal verbs; further, the term phrasal verb is sometimes used to denote a broader class ∗Research conducted during an internship at Google. of constructions.

636

Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 636–646, Seattle, Washington, USA, 18-21 October 2013. c 2013 Association for Computational Linguistics bine evidence from multiple languages, we develop Feature Description a novel boosting algorithm tailored to the task of ϕL ( 50) KL Divergence for each language L × ranking multiword expressions by their degree of id- µ1 frequency of phrase given verb iomaticity. We train and evaluate on disjoint subsets µ2 PMI of verb and particles 2 of the phrasal verbs in English Wiktionary . In our µ3 µ1 with interposed experiments, the set of phrasal verbs identified au- tomatically by our method achieves held-out recall Table 1: Features used by the polyglot ranking system. that nears the performance of the phrasal verbs in WordNet 3.0, a human-curated set. Our approach strongly outperforms a monolingual system, and and the task of ranking those candidates by their se- continues to improve when incrementally adding mantic idiosyncracy. With English phrasal verbs, it translation statistics for 50 different languages. is straightforward to enumerate all desired verbs fol- lowed by one or more particles, and rank the entire 2 Identifying Phrasal Verbs set.

The task of identifying phrasal verbs using corpus Using Parallel Corpora. There have been a num- information raises several issues of experimental de- ber of approaches proposed for the use of multilin- sign. We consider four central issues below in moti- gual resources for MWE identification (Melamed, vating our approach. 1997; Villada Moiron´ and Tiedemann, 2006; Caseli et al., 2010; Tsvetkov and Wintner, 2012; Salehi Types vs. Tokens. When a phrase is used in con- and Cook, 2013). Our approach differs from pre- text, it takes a particular meaning among its pos- vious work in that we identify MWEs using transla- sible senses. Many phrasal verbs admit composi- tion distributions of verbs, as opposed to 1–1, 1–m, tional senses in addition to idiomatic ones—contrast or m–n word alignments, most-likely translations, idiomatic “look down on him for his politics” with bilingual dictionaries, or distributional entropy. To compositional “look down on him from the balcony.” the best of our knowledge, ours is the first approach In this paper, we focus on the task of determining to use translational distributions to leverage the ob- whether a phrase type is a phrasal verb, meaning that servation that a verb typically translates differently it frequently expresses an idiomatic meaning across when it heads a phrasal verb. its many token usages in a corpus. We do not at- tempt to distinguish which individual phrase tokens 3 The Polyglot Ranking Approach in the corpus have idiomatic senses. Our approach uses bilingual and monolingual statis- Ranking vs. Classification. Identifying phrasal tics as features, computed over unlabeled corpora. verbs involves relative, rather than categorical, judg- Each statistic characterizes the degree of idiosyn- ments: some phrasal verbs are more compositional crasy of a candidate phrasal verb, using a single than others, but retain a degree of noncomposition- monolingual or bilingual corpus. We combine fea- ality (McCarthy et al., 2003). Moreover, a poly- tures for many language pairs using a boosting algo- semous phrasal verb may express an idiosyncratic rithm that optimizes a ranking objective using a su- sense more or less often than a compositional sense pervised training set of English phrasal verbs. Each in a particular corpus. Therefore, we should expect of these aspects of our approach is described in de- a corpus-driven system not to classify phrases as tail below; for reference, Table 1 provides a list of strictly idiomatic or compositional, but instead as- the features used. sign a ranking or relative scoring to a set of candi- dates. 3.1 Bilingual Statistics One of the intuitive properties of an MWE is that Candidate Phrases. We distinguish between the its individual words likely do not translate literally task of identifying candidate multiword expressions when the whole expression is translated into another 2http://en.wiktionary.org language (Melamed, 1997). We capture this effect

637 by measuring the divergence between how a verb to its subphrase most commonly aligned to the verb translates generally and how it translates when head- in e. It expresses how this verb is translated in the ing a candidate phrasal verb. context of a phrasal verb construction.3 Equation (1) A parallel corpus is a collection of document defines a distribution over all phrases x of a foreign pairs DE,DF , where DE is in English, DF is in language. anotherh language,i one document is a translation of We also assign statistics to verbs as they are trans- the other, and all documents DF are in the same lated outside of the context of a phrase. Let v(e) language. A phrase-aligned parallel corpus aligns be the verb of a phrasal verb candidate e, which those documents at a sentence, phrase, and word is always its first word. For a single-word verb level. A phrase e aligns to another phrase f if some phrase v(e), we can compute the constituent transla- word in e aligns to some word in f and no word in tion probability Pv(e)(x), again using Equation (1). e or f aligns outside of f or e, respectively. As a The difference between Pe(x) and Pv(e)(x) is that result of this definition, the words within an aligned the latter sums over all translations of the verb v(e), phrase pair are themselves connected by word-level regardless of whether it appears in the context of e: alignments. X Given an English phrase e, define F (e) to be the Pv(e)(x) = P (f v(e)) δ (π1(v(e), f), x) | · set of all foreign phrases observed aligned to e in a f∈F (v(e)) parallel corpus. For any f F (e), let P (f e) be the ∈ | For a one-word phrase such as v(e), π1(v(e), f) conditional probability of the phrase e translating to is the subphrase of f that most commonly directly the phrase f. This probability is estimated as the word-aligns to the one word of v(e). relative frequency of observing f and e as an aligned Finally, for a phrase e and its verb v(e), we calcu- phrase pair, conditioned on observing e aligned to late the Kullback-Leibler (KL) divergence between any phrase in the corpus: the translation distribution of v(e) and e: N(e, f) P (f e) = P 0  X Pv(e)(x) | f 0 N(e, f ) DKL Pv(e) Pe = Pv(e)(x) ln (2) k x Pe(x) with N(e, f) the number of times e and f are ob- served occurring as an aligned phrase pair. where the sum ranges over all x such that Pv(e)(x) > Next, we assign statistics to individual verbs 0. This quantifies the difference between the trans- within phrases. The first word of a candidate phrasal lations of e’s verb when it occurs in e, and when it verb e is a verb. For a candidate phrasal verb e and occurs in general. Figure 1 illustrates this computa- tion on a toy corpus. a foreign phrase f, let π1(e, f) be the subphrase of f that is most commonly word-aligned to the first Smoothing. Equation (2) is defined only if, for ev- word of e. As an example, consider the phrase pair ery x such that Pv(e)(x) > 0, it is also the case talk down to hablar con menosprecio e = and f = . that Pe(x) > 0. In order to ensure that this con- Suppose that when e is aligned to f, the word talk is dition holds, we smooth the translation distributions most frequently aligned to hablar. Then π1(e, f) = toward uniform. Let D be the set of phrases with hablar. non-zero probability under either distribution: For a phrase e and its set F (e) of aligned trans- lations, we define the constituent translation proba- D = x : P (x) > 0 or Pe(x) > 0 { v(e) } bility of a foreign subphrase x as: Then, let UD be the uniform distribution over D: X Pe(x) = P (f e) δ (π1(e, f), x) (1)  | · 1/ D if x D f∈F (e) UD(x) = | | ∈ 0 if x / D ∈ where is the Kronecker delta function, taking value δ 3To extend this statistic to other types of multiword expres- 1 if its arguments are equal and 0 otherwise. Intu- sions, one could compute a similar distribution for other content itively, Pe assigns the probability mass for every f words in the phrase.

638 50%

40% Combined with AdaBoost Individual Bilingual Statistics 30%

20%

Test set recall-at-1220 10%

0% 0 5 10 15 20 25 30 35 40 45 50 Number of languages (k)

Aligned Phrase Pair N(e, f) ⇡1(e, f) tion distributions for any inflected form ei E: ∈ looking forward to 1 X  0 0  1 deseando ϕL(e) = DKL Pv(e ) Pe deseando E i k i | | ei∈E looking forward to 3.2 Monolingual Statistics 3 mirando mirando adelante a We also collect a number of monolingual statistics for each phrasal verb candidate, motivated by the looking 5 mirando considerable body of previous work on the topic mirando (Church and Hanks, 1990; Lin, 1999; McCarthy et al., 2003). The monolingual statistics are designed looking 3 buscando to identify frequent in a language. This buscando a set of monolingual features is not comprehensive, as we focus our attention primarily on bilingual fea- mirando deseando buscando tures in this paper.\begin{tabular}{rrrr} &\textit{mirando} &\textit{deseando} &\textit{buscando} \\ [2ex] 5 3 As above, define$P_{v(e)}(x)$E to be &$\frac{5}{8}=0.625 the set of morpholog- $ &$0$ &$\frac{3}{8}=0.375 $ \\ [1ex] Pv(e)(x) 8 =0.625 0 8 =0.375 ically inflected\hline variants \\ [-1ex] of a candidate e, and let $P'_{v(e)}(x)$&$0.610 $ &$0.02$ &$0.373$ \\ [1ex] P 0 (x)0.610 0.02 0.373 v(e) V be the set of\hline inflected \\ [-1ex] variants of the head verb v(e) of e. We define$P_e(x)$ three &$\frac{3}{4}=0.75 statistics calculated $ &$\frac{1}{4}=0.25 from $ &$0$ \\ [1.5ex] P (x) 3 =0.75 1 =0.25 0 e 4 4 the phrase counts\hline of a\\ monolingual [-1ex] English corpus. $P'_e(x)$&$0.729 $ &$0.254 $ &$0.02$ \\ [1ex] First, we define µ1(e) to be the relative frequency of Pe0(x)0.729 0.254 0.02 \hline \\ [-1ex] the candidate e\end{tabular}, given e’s head verb, summed over

DKL(P 0 P 0 )= 0.109 + 0.045 + 1.159 = 1.005 morphological variants: v(ei)k ei D_{KL} (P'_{v(e_i)} \| P'_{e_i}) = -0.109 + -0.045 + 1.159 = 1.005 µ1(e) = ln P (E V ) Figure 1: The computation of DKL(P 0 P 0 ) using a | v(ei)k ei P toy corpus, for looking forward to. Note that the sec- e ∈E N(ei) e = = ln i ond aligned phrase pair contains the third, so the second’s P vi∈V N(vi) count of 3 must be included in the third’s count of 5. where N(x) is the number of times phrase x was observed in the monolingual corpus. When computing divergence in Equation (2), we use 0 0 Second, define µ2(e) to be the pointwise mutual the smoothed distributions Pe and P : v(e) information (PMI) between V (the event that one of the inflections of the verb in e is observed) and R, P 0(x) = αP (x) + (1 α)U (x) e e D the event of observing the rest of the phrase: 0 − P (x) = αP (x) + (1 α)UD(x). v(e) v(e) − µ2(e) We use α = 0.95, which distributes 5% of the total = PMI(V,R) probability mass evenly among all events in D. = lg P (V,R) lg (P (V )P (R)) − . We calculate statistics for morpho- = lg P (E) lg (P (V )P (R)) logical variants of an English phrase. For a candi- X − X = lg N(ei) lg N(vi) lg N(r)+lg N date English phrasal verb e (for example, look up), − − e ∈E v ∈V let E denote the set of inflections of that phrasal verb i i (for look up, this will be [look looks looked looking] where N is the total number of tokens in the corpus, | | | up). We extract the variants in E from the verb en- and logarithms are base-2. This statistic character- tries in English Wiktionary. The final score com- izes the degree of association between a verb and puted from a phrase-aligned parallel corpus translat- its phrasal extension. We only calculate µ2 for two- ing English sentences into a language L is the aver- word phrases, as it did not prove helpful for longer age KL divergence of smoothed constituent transla- phrases.

639 Finally, define µ3(e) to be the relative frequency Algorithm 1 Recall-Oriented Ranking AdaBoost of the phrasal verb e augmented by an accusative 1: for i = 1 : X do | | , conditioned on the verb. Let A be the 2: w[i] 1/ X ← | | set of phrases in E with an accusative pronoun (it, 3: end for them, him, her, me, you) optionally inserted either at 4: for t = 1 : T do the end of the phrase or directly after the verb. For 5: for all h do ∈ H e = look up, A = look up, look X up, look up X, 6: h 0 { ← looks up, looks X up, looks up X, . . . , with X an 7: for i = 1 : X do } | | accusative pronoun. The µ3 statistic is similar to µ1, 8: if xi h then 6∈ but allows for an intervening or following pronoun: 9: h h + w[i] ← 10: end if µ3(e) = ln P (A V ) | 11: end for P 12: end for ei∈A N(ei) = ln P . N(vi) 13: ht argmaxh∈H B h vi∈V ← | − | 14: αt ln(B/h ) ← t This statistic is designed to exploit the intuition that 15: for i = 1 : X do | | phrasal verbs frequently have accusative pronouns 16: if xi ht then ∈ 1 either inserted into the middle (e.g. look it up) or at 17: w[i] w[i] exp ( αt) ← Z − the end (e.g. look down on him). 18: else 1 19: w[i] w[i] exp (αt) 3.3 Ranking Phrasal Verb Candidates ← Z 20: end if Our goal is to assign a single real-valued score to 21: end for each candidate e, by which we can rank candidates 22: end for according to semantic idiosyncrasy. For each lan- guage L for which we have a parallel corpus, we defined, in section 3.1, a function ϕL(e) assigning to this gold-standard training set. We choose this real values to candidate phrasal verbs e, which we recall-at-N metric so as to not directly penalize pre- hypothesize is higher on average for more idiomatic cision errors, as our training set is incomplete. compounds. Further, in section 3.2, we defined real- Define to be the set of N-element sets contain- H valued monolingual functions µ1, µ2, and µ3 for ing the top proposals for each weak ranker (we use which we hypothesize the same trend holds. Be- N = 2000). That is, each element of is a set con- H cause each score individually ranks all candidates, taining the 2000 highest values for some ϕL or µi. it is natural to view each ϕL and µi as a weak rank- We define the baseline error B to be 1 E[R], with − ing function that we can combine with a supervised R the recall-at-N of a ranker ordering the candidate boosting objective. We use a modified version of phrases in the set at random. The value E[R] is ∪H AdaBoost (Freund and Schapire, 1995) that opti- estimated by averaging the recall-at-N of 1000 ran- mizes for recall. dom orderings of . ∪H For each ϕL and µi, we compute a ranked list Algorithm 1 gives the formulation of the Ada- of candidate phrasal verbs, ordered from highest to Boost training algorithm that we use to combine lowest value. To simplify learning, we consider only weak rankers. The algorithm maintains a weight the top 5000 candidate phrasal verbs according to vector w (summing to 1) which contains a positive µ1, µ2, and µ3. This pruning procedure excludes real number for each gold standard phrasal verb in candidates that do not appear in our monolingual the training set X. Initially, w is uniformly set to corpus. 1/ X . At each iteration of the algorithm, w is mod- We optimize the ranker using an unranked, in- ified| | to take higher values for recently misclassi- complete training set of phrasal verbs. We can eval- fied examples. We repeatedly choose weak rankers uate the quality of the ranker by outputting the top ht (and corresponding real-valued coefficients ∈ H N ranked candidates and measuring recall relative αt) that correctly rank examples with high w values.

640 Lines 5–12 of Algorithm 1 calculate the weighted about across after against along among around at before behind error values h for every weak ranker set h . ∈ H between beyond by down for The error  will be 1 if h contains none of X and 0 h from in into like off if h contains all of X, as w always sums to 1. Line on onto outside over past 13 picks the ranker ht whose weighted error is ∈ H round through to towards under as far as possible from the random baseline error B. up upon with within without Line 14 calculates a coefficient αt for ht, which will Table 2: Particles and prepositions allowed in phrasal be positive if h < B and negative if h > B. t t verbs gathered from Wiktionary. Intuitively, αt encodes the importance of ht—it will be high if ht performs well, and low if it performs poorly. The Z in lines 17 and 19 is the normalizing Many of the and derived terms are not constant ensuring the vector w sums to 1. phrasal verbs (e.g. kick the bucket, make-or-break). After termination of Algorithm 1, we have We filter out any phrases not of the form VP +, with + weights α1, . . . , αT and lists h1, . . . , hT . Define ft V a verb, and P denoting one or more occurrences as the function that generated the list ht (each ft will of particles and prepositions from the list in Table 2. be some ϕL or µi). Now, we define a final combined We omit prepositions that do not productively form function ϕ, taking a phrase e and returning a real English phrasal verbs, such as amid and as. This number: process also omits some compounds that are some- T X times called phrasal verbs, such as con- ϕ(e) = α f (e). t t structions, e.g. have a (Butt, 2003), and noncom- t=1 positional verb- collocations, e.g. look for- We standardize the scores of individual weak ward. rankers to have mean 0 and variance 1, so that their There are a number of extant phrasal verb cor- scores are comparable. pora. For example, McCarthy et al. (2003) present The final learned ranker outputs a real value, in- graded human compositionality judgments for 116 stead of the class labels frequently found in Ada- phrasal verbs, and Baldwin (2008) presents a large Boost. This follows previous work using boosting set of candidates produced by an automated system, for learning to rank (Freund et al., 2003; Xu and Li, with false positives manually removed. We use Wik- 2007). Our algorithm differs from previous methods tionary instead, in an attempt to construct a maxi- because we are seeking to optimize for Recall-at-N, mally comprehensive data set that is free from any rather than a ranking loss. possible biases introduced by automatic extraction 4 Experimental Evaluation processes. 4.1 Training and Test Set 4.2 Filtering and Data Partition In order to train and evaluate our system, we con- The merged list of phrasal verbs extracted from Wik- struct a gold-standard list of phrasal verbs from tionary included some common collocations that the freely available English Wiktionary. We gather have compositional semantics (e.g. know about), as phrasal verbs from three sources within Wiktionary: well as some very rare constructions (e.g. cheese down). We removed these spurious results system- 1. Entries labeled as English phrasal verbs4, atically by filtering out very frequent and very infre- 2. Entries labeled as English idioms5, and quent entries. First, we calculated the log probability of each phrase, according to a language model built derived terms6 3. The of English verb entries. from a large monolingual corpus of news documents 4http://en.wiktionary.org/wiki/Category: and web documents, smoothed with stupid back- English_phrasal_verbs off (Brants et al., 2007). We sorted all Wiktionary 5http://en.wiktionary.org/wiki/Category: English_idioms phrasal verbs according to this value. Then, we se- 6For example, see http://en.wiktionary.org/ lected the contiguous 75% of the sorted phrases that wiki/take#Derived_terms minimize the variance of this statistic. This method

641 Recall-at-1220 4.5 Results Dev Test Frequent Candidates 17.0 19.3 We optimize a ranker using the boosting algorithm WordNet 3.0 Frequent 41.6 43.7 described in section 3.3, using the features from Ta- ble 1, optimizing performance on the Wiktionary de-

Baseline WordNet 3.0 Filtered 49.4 48.8 velopment set described in section 4.2. Monolingual Monolingual Only 30.1 30.2 and bilingual statistics are calculated using the cor- Bilingual Only 47.1 43.9 pora described in section 4.3, with candidate phrasal

Boosted Monolingual+Bilingual 50.8 47.9 verbs being drawn from the set described in section Table 3: Our boosted ranker combining monolingual 4.4. and bilingual features (bottom) compared to three base- We evaluate our method of identifying phrasal lines (top) gives comparable performance to the human- verbs by computing recall-at-N. This statistic is the curated upper bound. fraction of the Wiktionary test set that appears in the top N proposed phrasal verbs by the method, where N is an arbitrary number of top-ranked candidates removed a few very frequent phrases and a large held constant when comparing different approaches number of rare phrases. The remaining phrases were (we use N = 1220). We do not compute precision, split randomly into a development set of 694 items because the test set to which we compare is not an and a held-out test set of 695 items. exhaustive list of phrasal verbs, due to the develop- ment/test split, frequency filtering, and omissions in 4.3 Corpora the original lexical resource. Proposing a phrasal verb not in the test set is not necessarily an error, but Our monolingual English corpus consists of news ar- identifying many phrasal verbs from the test set is an ticles and documents collected from the web. Our indication of an effective method. Recall-at-N is a parallel corpora from English to each of 50 lan- natural way to evaluate a ranking system where the guages also consist of documents collected from gold-standard data is an incomplete, unranked set. the web via distributed data mining of parallel doc- Table 3 compares our approach to three baselines uments based on the text content of web pages using the Recall-at-1220 metric evaluated on both (Uszkoreit et al., 2010). the development and test sets. As a lower bound, we The parallel corpora were segmented into aligned evaluated the 1220 most frequent candidates in our sentence pairs and word-aligned using two iterations Monolingual corpus (Frequent Candidates). of IBM Model 1 (Brown et al., 1993) and two iter- As a competitive baseline, we evaluated the set of ations of the HMM-based alignment model (Vogel phrasal verbs in WordNet 3.0 (Fellbaum, 1998). We et al., 1996) with posterior symmetrization (Liang et selected the most frequent 1220 out of 1781 verb- al., 2006). This training recipe is common in large- particle constructions in WordNet (WordNet 3.0 Fre- scale machine translation systems. quent). A stronger baseline resulted from apply- ing the same filtering procedure to WordNet that 4.4 Generating Candidates we did to Wiktionary: sorting all verb-particle en- tries by their language model score and retaining the To generate the set of candidate phrasal verbs con- 1220 consecutive entries that minimized language sidered during evaluation, we exhaustively enumer- model variance (WordNet 3.0 Filtered). WordNet ated the Cartesian product of all verbs present in the is a human-curated resource, and yet its recall-at-N previously described Wiktionary set ( ), all parti- compared to our Wiktionary test set is only 48.8%, cles in Table 2 ( ) and a small set ofV second parti- indicating substantial divergence between the two P cles = with, to, on,  , with  the empty string. resources. Such divergence is typical: lexical re- The setT of{ candidate phrasal} verbs we consider dur- sources often disagree about what multiword expres- ing evaluation is the product , which con- sions to include (Lin, 1999). tains 96,880 items. V × P × T The three final lines in Table 3 evaluate our

642 50% Recall-at-1220 Dev Test Bilingual only 47.1 43.9 40% Combined with AdaBoost Bilingual+µ1 48.1 46.9 Individual Bilingual Statistics Bilingual+µ2 50.1 48.3 30% Bilingual+µ3 48.4 46.3 Bilingual+µ1 + µ2 50.2 47.9 20% Bilingual+µ1 + µ3 49.0 47.4 Bilingual+µ2 + µ3 50.4 49.4 Test set recall-at-1220 10% Bilingual+µ1 + µ2 + µ3 50.8 47.9

Table 4: An ablation of monolingual statistics shows that 0% they are useful in addition to the 50 bilingual statistics 0 5 10 15 20 25 30 35 40 45 50 combined, and no single statistic provides maximal per- Number of languages (k) formance.

Figure 2: The solid line shows recall-at-1220 when com- bining the k best-performing bilingual statistics and three The dotted line in Figure 2 shows that individual monolingual statistics. The dotted line shows the indi- bilingual statistics have recall-at-1220 ranging from vidual performance of the kth best-performing bilingual 34.4% to 5.0%. This difference reflects the differ- statistic, when applied in isolation to rank candidates. ent sizes of parallel corpora and usefulness of dif- ferent languages in identifying English semantic id- iosyncrasy. Combining together the signal of mul- boosted ranker. Automatically detecting phrasal tiple languages is clearly beneficial, and including verbs using monolingual features alone strongly out- many low-performing languages still offers overall performed the frequency-based lower bound, but un- improvements. derperformed the WordNet baseline. Bilingual fea- Table 4 shows the effect of adding different sub- tures, using features from 50 languages, proved sub- sets of the monolingual statistics to the set of all stantially more effective. The combination of both 50 bilingual statistics. Monolingual statistics give types of features yielded the best performance, out- a performance improvement of up to 5.5% recall performing the human-curated WordNet baseline on on the test set, but the comparative behavior of the the development set (on which our ranker was opti- various combinations of the µ is somewhat unpre- mized) and approaching its performance on the held- i dictable when training on the development set and out test set. evaluating on the test set. The pointwise mutual in- formation of a verb and its particles (µ ) appears to 4.6 Feature Analysis 2 be the most useful feature. In fact, the test set per- The solid line in Figure 2 shows the recall-at-1220 formance of using µ2 alone outperforms the combi- for a boosted ranker using all monolingual statistics nation of all three. The best combination even out- and k bilingual statistics, for increasing k. Bilin- performs the WordNet 3.0 baseline on the test set, gual statistics are added according to their individual though optimizing on the development set would not recall, from best-performing to worst. That is, the select this model. point at k = 0 uses only µ1, µ2, and µ3, the point at k = 1 adds the best individually-performing bilin- 4.7 Error Analysis gual statistic (Spanish) as a weak ranker, the next Table 5 shows the 100 highest ranked phrasal verb point adds the second-best bilingual statistic (Ger- candidates by our system that do not appear in either man), etc. Boosting maximizes performance on the the development or test sets. Most of these candi- development set, and evaluation is performed on the dates are in fact English phrasal verbs that happened test set. We use T = 53 (equal to the total number to be missing from Wiktionary; some are present of weak rankers). in Wiktionary but were removed from the reference

643 pick up pat on tap into fit for charge with suit against catch up burst into muck up haul up give up get off get through get up get in tack on buzz about do like plump for haul in keep up with strap on catch up with suck into get round chop off slap on pitch into get into inquire into drop behind get on catch up on pass on cue from carry around get around get over shoot at pick over shoot by shoot in make up to get past cast down set up with rule off hand round piss on hit by break down move for lead off pluck off flip through edge over strike off plug into keep up go past set off pull round see about stay on put up sidle up to buzz around take off set up slap in head towards shoot past inquire for tuck up lie with well before go on with reel from drive along snap off barge into whip on put down instance through bar from cut down on let in tune in to move off suit in lean against well beyond get down to go across sail into lie over hit with chow down on look after catch at

Table 5: The highest ranked phrasal verb candidates from our full system that do not appear in either Wiktionary set. Candidates are presented in decreasing rank; “pat on” is the second highest ranked candidate. sets during filtering, and the remainder are in fact tiword expressions by picking out sufficiently com- not phrasal verbs (true precision errors). mon phrases that align to single target-side tokens. These errors fall largely into two categories. Tsvetkov and Wintner (2012) generate candidate Some candidates are compositional, but contain pol- MWEs by finding one-to-one alignments in paral- ysemous verbs, such as hit by, drive along, and head lel corpora which are not in a bilingual dictionary, towards. In these cases, prepositions disambiguate and ranking them based on monolingual statistics. the verb, which naturally affects translation distri- The system of Salehi and Cook (2013) is perhaps butions. Other candidates are not phrasal verbs, but the closest to the current work, judging noncompo- instead phrases that tend to have a different syntac- sitionality using string edit distance between a can- tic role, such as suit against, instance through, fit didate phrase’s automatic translation and its com- for, and lie over (conjugated as lay over). A care- ponents’ individual translations. Unlike the current ful treatment of part-of-speech tags when computing work, their method does not use distributions over corpus statistics might address this issue. translations or combine individual bilingual values with boosting; however, they find, as we do, that in- 5 Related Work corporating many languages is beneficial to MWE The idea of using word-aligned parallel corpora identification. to identify idiomatic expressions has been pur- A large body of work has investigated the identifi- sued in a number of different ways. Melamed cation of noncompositional compounds from mono- (1997) tests candidate MWEs by collapsing them lingual sources (Lin, 1999; Schone and Jurafsky, into single tokens, training a new translation model 2001; Fazly and Stevenson, 2006; McCarthy et with these tokens, and using the performance of al., 2003; Baldwin et al., 2003; Villavicencio, the new model to judge candidates’ noncomposi- 2003). Many of these monolingual statistics could tionality. Villada Moiron´ and Tiedemann (2006) be viewed as weak rankers and fruitfully incorpo- use word-aligned parallel corpora to identify Dutch rated into our framework. MWEs, testing the assumption that the distributions There has also been a substantial amount of work of alignments of MWEs will generally have higher addressing the problem of differentiating between entropies than those of fully compositional com- literal and idiomatic instances of phrases in con- pounds. Caseli et al. (2010) generate candidate mul- text (Katz and Giesbrecht, 2006; Li et al., 2010;

644 Sporleder and Li, 2009; Birke and Sarkar, 2006; Timothy Baldwin. 2008. A resource for evaluating the Diab and Bhutada, 2009). We do not attempt this deep lexical acquisition of english verb-particle con- task; however, techniques for token identification structions. In Proceedings of the LREC Workshop To- could be used to improve type identification (Bald- wards a Shared Task for Multiword Expressions. Colin Bannard, Timothy Baldwin, and Alex Lascarides. win, 2005). 2003. A statistical approach to the semantics of verb- particles. In Proceedings of the ACL Workshop on 6 Conclusion Multiword Expressions. Julia Birke and Anoop Sarkar. 2006. A clustering ap- We have presented the polyglot ranking approach proach for the nearly unsupervised recognition of non- to phrasal verb identification, using parallel corpora literal language. In Proceedings of European Chapter from many languages to identify phrasal verbs. We of the Association for Computational Linguistics. proposed an evaluation metric that acknowledges the Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J. inherent incompleteness of reference sets, but dis- Och, and Jeffrey Dean. 2007. Large language mod- tinguishes among competing systems in a manner els in machine translation. In Proceedings of the Joint aligned to the goals of the task. We developed a Conference on Empirical Methods in Natural Lan- recall-oriented learning method that integrates mul- guage Processing and Computational Natural Lan- guage Learning. tiple weak ranking signals, and demonstrated exper- Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della imentally that combining statistical evidence from a Pietra, and Robert L. Mercer. 1993. The mathematics large number of bilingual corpora, as well as from of statistical machine translation: Parameter estima- monolingual corpora, produces the most effective tion. Computational Linguistics. system overall. We look forward to generalizing Miriam Butt. 2003. The light verb jungle. In Proceed- our approach to other types of noncompositional ings of the Workshop on Multi-Verb Constructions. phrases. Helena de Medeiros Caseli, Carlos Ramisch, Maria das Grac¸as Volpe Nunes, and Aline Villavicencio. 2010. Acknowledgments Alignment-based extraction of multiword expressions. Language Resources and Evaluation. Special thanks to Ivan Sag, argued for the Kenneth Ward Church and Patrick Hanks. 1990. Word importance of handling multi-word expressions in association norms, mutual information, and lexicogra- phy. Computational Linguistics, 16(1). natural language processing applications, and who Kenneth Church. 2011. How many multiword expres- taught the authors about natural language sions do people know? In Proceedings of the ACL once upon a time. We would also like to thank the Workshop on Multiword Expressions. anonymous reviewers for their helpful suggestions. Mona T. Diab and Pravin Bhutada. 2009. Verb construction MWE token supervised classification. In Proceedings of the ACL Workshop on Multiword Ex- References pressions. Robert Dixon. 1982. The of english phrasal Otavio Acosta, Aline Villavicencio, and Viviane Moreira. verbs. Australian Journal of Linguistics. 2011. Identification and treatment of multiword ex- Afsaneh Fazly and Suzanne Stevenson. 2006. Automat- Proceed- pressions applied to information retrieval. In ically constructing a lexicon of idiomatic ings of the ACL Workshop on Multiword Expressions . combinations. In Proceedings of the European Chap- Timothy Baldwin and Aline Villavicencio. 2002. Ex- ter of the Association for Computational Linguistics. tracting the unextractable: A case study on verb- Christiane Fellbaum. 1998. WordNet: An Electronic particles. In Proceedings of the Sixth Conference on Lexical Database. The MIT Press. Natural Language Learning. Mark Finlayson and Nidhi Kulkarni. 2011. Detecting Timothy Baldwin, Colin Bannard, Takaaki Tanaka, and multiword expressions improves word sense disam- Dominic Widdows. 2003. An empirical model of biguation. In Proceedings of the ACL Workshop on multiword expression decomposability. In Proceed- Multiword Expressions. ings of the ACL Workshop on Multiword Expressions. Yoav Freund and Robert E. Schapire. 1995. A decision- Timothy Baldwin. 2005. Deep lexical acquisition of theoretic generalization of on-line learning and an ap- verb-particle constructions. Computer Speech & Lan- plication to boosting. In Proceedings of the Confer- guage, Special Issue on Multiword Expressions. ence on Computational Learning Theory.

645 Yoav Freund, Raj Iyer, Robert E Schapire, and Yoram Patrick Schone and Daniel Jurafsky. 2001. Is Singer. 2003. An efficient boosting algorithm for knowledge-free induction of multiword unit dictionary combining preferences. The Journal of Machine headwords a solved problem? In Proceedings of Learning Research. the Conference on Empirical Methods in Natural Lan- Ray Jackendoff. 1993. The Architecture of the Language guage Processing. Faculty. MIT Press. Caroline Sporleder and Linlin Li. 2009. Unsupervised Graham Katz and Eugenie Giesbrecht. 2006. Auto- recognition of literal and non-literal use of idiomatic matic identification of non-compositional multi-word expressions. In Proceedings of the European Chapter expressions using latent semantic analysis. In Pro- of the Association for Computational Linguistics. ceedings of the ACL Workshop on Multiword Expres- Yulia Tsvetkov and Shuly Wintner. 2012. Extraction sions. of multi-word expressions from small parallel corpora. Philipp Koehn, Franz Josef Och, and Daniel Marcu. In Natural Language Engineering. 2003. Statistical phrase-based translation. In Proceed- Jakob Uszkoreit, Jay Ponte, Ashok Popat, and Moshe Du- ings of the North American Chapter of the Association biner. 2010. Large scale parallel document mining for for Computational Linguistics. machine translation. In Proceedings of the Conference Linlin Li, Benjamin Roth, and Caroline Sporleder. 2010. on Computational Linguistics. Topic models for word sense disambiguation and Sriram Venkatapathy and Aravind K Joshi. 2006. Us- token-based detection. In Proceedings of the ing information about multi-word expressions for the Association for Computational Linguistics. word-alignment task. In Proceedings of the ACL Percy Liang, Ben Taskar, and Dan Klein. 2006. Align- Workshop on Multiword Expressions. ment by agreement. In Proceedings of the North Begona˜ Villada Moiron´ and Jorg¨ Tiedemann. 2006. American Chapter of the Association for Computa- Identifying idiomatic expressions using automatic tional Linguistics. word-alignment. In Proceedings of the EACL Work- Dekang Lin. 1999. Automatic identification of non- shop on Multiword Expressions in a Multilingual Con- compositional phrases. In Proceedings of the Asso- text. ciation for Computational Linguistics. Aline Villavicencio. 2003. Verb-particle constructions Diana McCarthy, Bill Keller, and John Carroll. 2003. and lexical resources. In Proceedings of the ACL Detecting a continuum of compositionality in phrasal workshop on Multiword expressions. verbs. In Proceedings of the ACL Workshop on Multi- Stephan Vogel, Hermann Ney, and Christoph Tillmann. word Expressions. 1996. HMM-based word alignment in statistical trans- lation. In Proceedings of the Conference on Computa- I. Dan Melamed. 1997. Automatic discovery of non- tional linguistics. compositional compounds in parallel data. In Pro- ceedings of the Conference on Empirical Methods in Jun Xu and Hang Li. 2007. AdaRank: a boosting al- Natural Language Processing. gorithm for information retrieval. In Proceedings of the SIGIR Conference on Research and Development Santanu Pal, Sudip Kumar Naskar, Pavel Pecina, Sivaji in Information Retrieval. Bandyopadhyay, and Andy Way. 2010. Handling named entities and compound verbs in phrase-based Ying Xu, Randy Goebel, Christoph Ringlstetter, and statistical machine translation. In Proceedings of the Grzegorz Kondrak. 2010. Application of the tightness COLING 2010 Workshop on Multiword Expressions. continuum measure to chinese information retrieval. In Proceedings of the COLING Workshop on Multi- Zhixiang Ren, Yajuan Lu, Jie Cao, Qun Liu, and Yun word Expressions. Huang. 2009. Improving statistical machine transla- tion using domain bilingual multiword expressions. In Proceedings of the ACL Workshop on Multiword Ex- pressions. Ivan A. Sag, Timothy Baldwin, Francis Bond, Ann A. Copestake, and Dan Flickinger. 2002. Multiword ex- pressions: A pain in the neck for NLP. In Proceedings of the CICLING Conference on Intelligent Text Pro- cessing and Computational Linguistics. Bahar Salehi and Paul Cook. 2013. Predicting the com- positionality of multiword expressions using transla- tions in multiple languages. In Second Joint Confer- ence on Lexical and Computational Semantics.

646