<<

Automatic Idiom Identification in Wiktionary

Grace Muzny and Luke Zettlemoyer Computer Science & Engineering University of Washington Seattle, WA 98195 {muznyg,lsz}@cs.washington.edu

Abstract not. Using these incomplete annotations as super- vision, we train a binary Perceptron classifier for Online resources, such as Wiktionary, provide identifying idiomatic entries. We intro- an accurate but incomplete source of idiomatic duce new lexical and graph-based features that use . In this paper, we study the problem WordNet and Wiktionary to compute semantic re- of automatically identifying idiomatic dictio- latedness. This allows us to learn, for example, that nary entries with such resources. We train the in the diamond in the rough are an idiom classifier on a newly gathered cor- pus of over 60,000 Wiktionary multi- more closely related to the words in its literal defi- definitions, incorporating features that model nition than the idiomatic one. Experiments demon- whether phrase meanings are constructed strate that the classifier achieves precision of over compositionally. Experiments demonstrate 65% at recall over 52% and that, when used to fill in that the learned classifier can provide high missing Wiktionary idiom labels, it more than dou- quality idiom labels, more than doubling the bles the number of idioms from 7,764 to 18,155. number of idiomatic entries from 7,764 to These gains also translate to idiom detection in 18,155 at precision levels of over 65%. These sentences, by simply using the Lesk word sense gains also translate to idiom detection in sen- tences, by simply using known word sense disambiguation (WSD) algorithm (1986) to match disambiguation algorithms to match phrases phrases to their definitions. This approach allows to their definitions. In a set of Wiktionary def- for scalable detection with no restrictions on the syn- inition example sentences, the more complete tactic structure or context of the target phrase. In a set of idioms boosts detection recall by over set of Wiktionary definition example sentences, the 28 percentage points. more complete set of idioms boosts detection recall by over 28 percentage points.

1 Introduction 2 Related Work Idiomatic is common and provides unique To the best of our knowledge, this work represents challenges for language understanding systems. For the first attempt to identify dictionary entries as id- example, a diamond in the rough can be the literal iomatic and the first to reduce idiom detection to unpolished object or a crude but lovable person. Un- identification via a dictionary. derstanding such distinctions is important for many Previous idiom detection systems fall in one applications, including (Sag et al., 2002) and of two paradigms: phrase classification, where a machine (Shutova et al., 2012). phrase p is always idiomatic or literal, e.g. (Gedigian We use Wiktionary as a large, but incomplete, ref- et al., 2006; Shutova et al., 2010), or token classifi- erence for idiomatic entries; individual entries can cation, where each occurrence of a phrase p can be be marked as idiomatic but, in practice, most are idiomatic or literal, e.g. (Katz and Giesbrecht, 2006; Birke and Sarkar, 2006; Li and Sporleder, 2009). Data Set Literal Idiomatic Total Most previous idiom detection systems have focused All 56,037 7,764 63,801 on specific syntactic constructions. For instance, Train 47,633 6,600 54,233 Unannotated Dev 2,801 388 3,189 Shutova et al. (2010) consider subject/verb (cam- Annotated Dev 2,212 958 3,170 paign surged stir ex- ) and verb/direct-object idioms ( Unannotated Test 5,603 776 6,379 citement) while Fazly and Stevenson (2006), Cook Annotated Test 4,510 1,834 6,344 et al. (2007), and Diab and Bhutada (2009) de- Figure 1: Number of dictionary entries with each class blow smoke tect verb/noun idioms ( ). Fothergill and for the Wiktionary identification data. Baldwin (2012) are syntactically unconstrained, but only study Japanese idioms. Although we focus on Data Set Literal Idiomatic Total identifying idiomatic dictionary entries, one advan- Dev 171 330 501 tage of our approach is that it enables syntactically Test 360 695 1055 unconstrained token-level detection for any phrase Figure 2: Number of sentences of each class for the Wik- in the dictionary. tionary detection data. 3 Formal Problem Definitions Identification For identification, we assume data its base form—senses that are not defined as a dif- ferent tense of a phrase—e.g. the pair h “weapons of of the form {(hpi, dii, yi): i = 1 . . . n} where mass destruction”, “Plural form of weapon of mass pi is the phrase associated with definition di and destruction” i was removed while the pair h “weapon yi ∈ {literal, idiomatic}. For example, this would include both the literal pair h “leave for dead”, “To of mass destruction”, “A chemical, biological, radio- i abandon a person or other living creature that is in- logical, nuclear or other weapon that ... ” was kept. jured or otherwise incapacitated, assuming that the Each pair hp, di was assigned label y according death of the one abandoned will soon follow.”i and to the idiom labels in Wiktionary, producing the the idiomatic pair h “leave for dead”, “To disregard Train, Unannotated Dev, and Unannotated Test data sets. In practice, this produces a noisy assignment or bypass as unimportant.” i. Given hpi, dii, we aim because a majority of the idiomatic senses are not to predict yi. marked. The development and test sets were anno- Detection To evaluate identification in the con- tated to correct these potential omissions. Annota- text of detection, we assume data {(hpi, eii, yi): tors used the definition of an idiom as a “phrase with i = 1 . . . n}. Here, pi is the phrase in exam- a non-compositional meaning” to produce the An- ple sentence ei whose idiomatic status is labeled notated Dev and Annotated Test data sets. Figure 1 yi ∈ {idiomatic, literal}. One such idiomatic pair presents the data statistics. is h“heart to heart”, “They sat down and had a We measured inter-annotator agreement on 1,000 long overdue heart to heart about the future of their examples. Two annotators marked each dictionary relationship.”i. Given hpi, eii, we again aim to pre- entry as literal, idiomatic, or indeterminable. Less dict yi. than one half of one percent could not be deter- mined2—the computed kappa was 81.85. Given 4 Data this high level of agreement, the rest of the data We gathered phrases, definitions, and example sen- were only labeled by a single annotator, follow- tences from the English-language Wiktionary dump ing the methodology used with the VNC-Tokens from November 13th, 2012.1 Dataset (Cook et al., 2008). Identification Phrase, definition pairs hp, di were Detection For detection, we gathered the example gathered with the following restrictions: the title of sentences provided, when available, for each defi- the Wiktionary entry must be English, p must com- nition used in our annotated identification data sets. posed of two or more words w, and hp, di must be in These sentences provide a clean source of develop-

1We used the Java Wiktionary Library (Zesch et al., 2008). 2The indeterminable pairs were omitted from the data. ment and test data containing idiomatic and literal Graph-based features use the graph structure of phrase usages. In all, there were over 1,300 unique WordNet 3.0 to calculate path distances. Let phrases, half of which had more than one possible distance(w, v, rel, n) be the minimum distance via dictionary definition in Wiktionary. Figure 2 pro- links of type rel in WordNet from a word w to a vides the complete statistics. word v, up to a threshold max integer value n, and 0 otherwise. The features compute: 5 Identification Model • closest : For identification, we use a linear model that pre- ∗ min distance(w, v, synonym, 5) dicts class y ∈ {literal, idiomatic} for an input pair w∈p,v∈d hp, di with phrase p and definition d. We assign the class: • closest antonym:4 y∗ = arg max θ · φ(p, d, y) y min distance(w, v, antonym, 5) w∈p,v∈d n given features φ(p, d, y) ∈ R with associated pa- n rameters θ ∈ R . • average synonym distance: 1 X Learning In this work, we use the averaged Per- distance(w, v, synonym, 5) |p| ceptron algorithm (Freund and Schapire, 1999) to w∈p,v∈d perform learning, which was optimized in terms of iterations T , bounded by range [1, 100], by maxi- • average hyponym: mizing F-measure on the development set. 1 X distance(w, v, hyponym, 5) |p| The models described correspond to the features w∈p,v∈d they use. All models are trained on the same, unan- notated training data. • synsets connected by an antonym: This feature in- dicates whether the following is true. The set of Features The features that were developed fall synsets Synp, all synsets from all words in p, and into two categories: lexical and graph-based fea- the set of synsets Synd, all synsets from all words tures. The lexical features were motivated by the in d, are connected by a shared antonym. This fea- intuition that literal phrases are more likely to have ture follows an approach described by Budanitsky closely related words in d to those in p because lit- et al. (2006). eral phrases do not break the principle of compo- sitionality. All words compared are stemmed ver- 6 Experiments sions. Let count(w, t) = number of times word w We report identification and detection results, vary- appears in text t. ing the data labeling and choice of feature sets.

• synonym overlap: Let S be the set of syn- 6.1 Identification onyms as defined in Wiktionary for all words Random Baseline in p. Then, we define the synonym overlap = We use a proportionally ran- 1 P count(s, d). dom baseline for the identification task that classi- |S| s∈S fies according to the proportion of literal definitions • antonym overlap: Let A be the set of antonyms seen in the training data. as defined in Wiktionary for all words in Results Figure 3 provides the results for the base- p. Then, we define the antonym overlap = line, the full approach, and variations with subsets 1 P count(a, d). |A| a∈A of the features. Results are reported for the origi- nal, unannotated test set, and the same test examples • average number of capitals:3 The value of number of capital letters in p with corrected idiom labels. All models increased number of words in p . 4The first relation expanded was the antonym relation. All 3In practice, this feature identifies most proper nouns. subsequent expansions were via synonym relations. Data Set Model Rec. Prec. F1 Phrase Definition Unannotated Lexical 85.8 21.9 34.9 feel free You have my permission. Graph 62.4 26.6 37.3 live down To get used to something shameful. Lexical+Graph 70.5 28.1 40.1 nail down To make something Baseline 12.2 11.9 12.0 (e.g. a decision or plan) firm or certain. Annotated Lexical 81.2 49.3 61.4 make after To chase. Graph 64.3 51.3 57.1 get out To say something with difficulty. Lexical+Graph 75.0 52.9 62.0 good riddance A welcome departure. Baseline 29.5 12.5 17.6 to bad rubbish as all hell To a great extent or degree; very. Figure 3: Results for idiomatic definition identification. roll around To happen, occur, take place.

Figure 5: Newly discovered idioms.

Phrase Definition put asunder To sunder; disjoin; separate; disunite; divorce; annul; dissolve. add up To take a sum. peel off To remove (an outer layer or covering, such as clothing). straighten up To become straight, or straighter. wild potato The edible root of this plant. shallow embedding The act of representing one logic or language with another by providing a syntactic translation.

Figure 6: High scoring false identifications. Figure 4: Precision and recall with varied features on the annotated test set. as sports or mathematics, and with phrases whose words also appear in their definitions. over their corresponding baselines by more than 22 6.2 Detection points and both feature families contributed.5 Figure 4 shows the complete precision, recall Approach We use the Lesk (1986) algorithm to curve. We selected our operating point to optimize perform WSD, matching an input phrase p from sen- F-measure, but we see that the graph features per- tence e to the definition d in Wiktionary that defines form well across all recall levels and that adding the the sense p is being used in. The final classification y lexical features provides consistent improvement in is then assigned to hp, di by the identification model. precision. However, other points are possible, es- Results Figure 7 shows detection results. The pecially when aiming for high precision to extend baseline for this experiment is a model that assigns the labels in Wiktionary. For example, the original the default labels within Wiktionary to the disam- 7,764 entries can be extended to 18,155 at 65% pre- biguated definition. The Annotated model is the cision, 9,594 at 80%, or 27,779 at 52.9%. Lexical+Graph model shown in Figure 3 evaluated Finally, Figures 5 and 6 present qualitative results, on the annotated data. The +Default setting aug- including newly discovered idioms and high scoring ments the identification model by labeling the hp, ei false identifications. Analysis reveals where our sys- as idiomatic if either the model or the original label tem has room to improve—errors most often occur within Wiktionary identifies it as such. with phrases that are specific to a certain field, such 7 Conclusions 5We also ran ablations demonstrating that removing each feature from the Lexical+Graph model hurt performance, but We presented a supervised approach to classifying omit the detailed results for space. definitions as idiomatic or literal that more than dou- Model Rec. Prec. F1 A. Fazly and S. Stevenson. 2006. Automatically con- Default 60.5 1 75.4 structing a of verb phrase idiomatic combina- Annotated 78.3 76.7 77.5 tions. In Proceedings of the Conference of the Eu- Annotated+Default 89.2 79.0 83.8 ropean Chapter of the Association for Computational . Figure 7: Detection results. R. Fothergill and T. Baldwin. 2012. Combining re- sources for mwe-token classification. In Proceedings of the First Joint Conference on Lexical and Compu- bles the number of marked idioms in Wiktionary, tational Semantics-Volume 1: Proceedings of the main even when training on incomplete data. When com- conference and the shared task, and Volume 2: Pro- bined with the Lesk word sense algorithm, this ap- ceedings of the Sixth International Workshop on Se- proach provides a complete idiom detector for any mantic Evaluation. phrase in the dictionary. Y. Freund and R.E. Schapire. 1999. Large margin clas- We expect that semi-supervised learning tech- sification using the perceptron algorithm. Machine niques could better recover the missing labels and learning, 37(3):277–296. boost overall performance. We also think it should M. Gedigian, J. Bryant, S. Narayanan, and B. Ciric. 2006. Catching metaphors. In Proceedings of the be possible to scale the detection approach, perhaps Third Workshop on Scalable Natural Language Un- with automatic dictionary definition discovery, and derstanding. evaluate it on more varied sentence types. G. Katz and E. Giesbrecht. 2006. Automatic identi- fication of non-compositional multi-word expressions Acknowledgments using latent semantic analysis. In Proceedings of the Workshop on Multiword Expressions: Identifying and The research was supported in part by the Na- Exploiting Underlying Properties. tional Science Foundation (IIS-1115966) and a M. Lesk. 1986. Automatic sense disambiguation using Mary Gates Research Scholarship. The authors machine readable : How to tell a pine cone thank Nicholas FitzGerald, Sarah Vieweg, and Mark from an ice cream cone. In Proceedings of Special Yatskar for helpful discussions and feedback. Interest Group on the Design of Communication. L. Li and C. Sporleder. 2009. Classifier combination for contextual idiom detection without labelled data. In References Proceedings of the Conference on Empirical Methods in Natural Language Processing. J. Birke and A. Sarkar. 2006. A clustering approach I. Sag, T. Baldwin, F. Bond, A. Copestake, and for nearly unsupervised recognition of nonliteral lan- D. Flickinger. 2002. Multiword expressions: A pain guage. In Proceedings of the Conference of the Eu- in the neck for nlp. In Computational Linguistics and ropean Chapter of the Association for Computational Intelligent Text Processing. Springer. Linguistics. E. Shutova, L. Sun, and A. Korhonen. 2010. Metaphor A. Budanitsky and G. Hirst. 2006. Evaluating - identification using verb and noun clustering. In Pro- based measures of lexical semantic relatedness. Com- ceedings of the International Conference on Computa- putational Linguistics, 32(1):13–47. tional Linguistics. P. Cook, A. Fazly, and S. Stevenson. 2007. Pulling their E. Shutova, S. Teufel, and A. Korhonen. 2012. Statisti- weight: Exploiting syntactic forms for the automatic cal metaphor processing. Computational Linguistics, identification of idiomatic expressions in context. In 39(2):301–353. Proceedings of the workshop on a broader perspective T. Zesch, C. Muller,¨ and I. Gurevych. 2008. Extracting on multiword expressions. lexical semantic knowledge from and wik- tionary. In Proceedings of the International Confer- P. Cook, A. Fazly, and S. Stevenson. 2008. The ence on Language Resources and Evaluation. vnc-tokens dataset. In Proceedings of the Language Resources and Evaluation Conference Workshop To- wards a Shared Task for Multiword Expressions. M. Diab and P. Bhutada. 2009. Verb noun construction mwe token supervised classification. In Proceedings of the Workshop on Multiword Expressions: Identifica- tion, Interpretation, Disambiguation and Applications.