Automatic Idiom Identification in Wiktionary
Total Page:16
File Type:pdf, Size:1020Kb
Automatic Idiom Identification in Wiktionary Grace Muzny and Luke Zettlemoyer Computer Science & Engineering University of Washington Seattle, WA 98195 fmuznyg,[email protected] Abstract not. Using these incomplete annotations as super- vision, we train a binary Perceptron classifier for Online resources, such as Wiktionary, provide identifying idiomatic dictionary entries. We intro- an accurate but incomplete source of idiomatic duce new lexical and graph-based features that use phrases. In this paper, we study the problem WordNet and Wiktionary to compute semantic re- of automatically identifying idiomatic dictio- latedness. This allows us to learn, for example, that nary entries with such resources. We train the words in the phrase diamond in the rough are an idiom classifier on a newly gathered cor- pus of over 60,000 Wiktionary multi-word more closely related to the words in its literal defi- definitions, incorporating features that model nition than the idiomatic one. Experiments demon- whether phrase meanings are constructed strate that the classifier achieves precision of over compositionally. Experiments demonstrate 65% at recall over 52% and that, when used to fill in that the learned classifier can provide high missing Wiktionary idiom labels, it more than dou- quality idiom labels, more than doubling the bles the number of idioms from 7,764 to 18,155. number of idiomatic entries from 7,764 to These gains also translate to idiom detection in 18,155 at precision levels of over 65%. These sentences, by simply using the Lesk word sense gains also translate to idiom detection in sen- tences, by simply using known word sense disambiguation (WSD) algorithm (1986) to match disambiguation algorithms to match phrases phrases to their definitions. This approach allows to their definitions. In a set of Wiktionary def- for scalable detection with no restrictions on the syn- inition example sentences, the more complete tactic structure or context of the target phrase. In a set of idioms boosts detection recall by over set of Wiktionary definition example sentences, the 28 percentage points. more complete set of idioms boosts detection recall by over 28 percentage points. 1 Introduction 2 Related Work Idiomatic language is common and provides unique To the best of our knowledge, this work represents challenges for language understanding systems. For the first attempt to identify dictionary entries as id- example, a diamond in the rough can be the literal iomatic and the first to reduce idiom detection to unpolished object or a crude but lovable person. Un- identification via a dictionary. derstanding such distinctions is important for many Previous idiom detection systems fall in one applications, including parsing (Sag et al., 2002) and of two paradigms: phrase classification, where a machine translation (Shutova et al., 2012). phrase p is always idiomatic or literal, e.g. (Gedigian We use Wiktionary as a large, but incomplete, ref- et al., 2006; Shutova et al., 2010), or token classifi- erence for idiomatic entries; individual entries can cation, where each occurrence of a phrase p can be be marked as idiomatic but, in practice, most are idiomatic or literal, e.g. (Katz and Giesbrecht, 2006; Birke and Sarkar, 2006; Li and Sporleder, 2009). Data Set Literal Idiomatic Total Most previous idiom detection systems have focused All 56,037 7,764 63,801 on specific syntactic constructions. For instance, Train 47,633 6,600 54,233 Unannotated Dev 2,801 388 3,189 Shutova et al. (2010) consider subject/verb (cam- Annotated Dev 2,212 958 3,170 paign surged stir ex- ) and verb/direct-object idioms ( Unannotated Test 5,603 776 6,379 citement) while Fazly and Stevenson (2006), Cook Annotated Test 4,510 1,834 6,344 et al. (2007), and Diab and Bhutada (2009) de- Figure 1: Number of dictionary entries with each class blow smoke tect verb/noun idioms ( ). Fothergill and for the Wiktionary identification data. Baldwin (2012) are syntactically unconstrained, but only study Japanese idioms. Although we focus on Data Set Literal Idiomatic Total identifying idiomatic dictionary entries, one advan- Dev 171 330 501 tage of our approach is that it enables syntactically Test 360 695 1055 unconstrained token-level detection for any phrase Figure 2: Number of sentences of each class for the Wik- in the dictionary. tionary detection data. 3 Formal Problem Definitions Identification For identification, we assume data its base form—senses that are not defined as a dif- ferent tense of a phrase—e.g. the pair h “weapons of of the form f(hpi; dii; yi): i = 1 : : : ng where mass destruction”, “Plural form of weapon of mass pi is the phrase associated with definition di and destruction” i was removed while the pair h “weapon yi 2 fliteral, idiomaticg. For example, this would include both the literal pair h “leave for dead”, “To of mass destruction”, “A chemical, biological, radio- i abandon a person or other living creature that is in- logical, nuclear or other weapon that ... ” was kept. jured or otherwise incapacitated, assuming that the Each pair hp; di was assigned label y according death of the one abandoned will soon follow.”i and to the idiom labels in Wiktionary, producing the the idiomatic pair h “leave for dead”, “To disregard Train, Unannotated Dev, and Unannotated Test data sets. In practice, this produces a noisy assignment or bypass as unimportant.” i. Given hpi; dii, we aim because a majority of the idiomatic senses are not to predict yi. marked. The development and test sets were anno- Detection To evaluate identification in the con- tated to correct these potential omissions. Annota- text of detection, we assume data f(hpi; eii; yi): tors used the definition of an idiom as a “phrase with i = 1 : : : ng. Here, pi is the phrase in exam- a non-compositional meaning” to produce the An- ple sentence ei whose idiomatic status is labeled notated Dev and Annotated Test data sets. Figure 1 yi 2 fidiomatic, literalg. One such idiomatic pair presents the data statistics. is h“heart to heart”, “They sat down and had a We measured inter-annotator agreement on 1,000 long overdue heart to heart about the future of their examples. Two annotators marked each dictionary relationship.”i. Given hpi; eii, we again aim to pre- entry as literal, idiomatic, or indeterminable. Less dict yi. than one half of one percent could not be deter- mined2—the computed kappa was 81.85. Given 4 Data this high level of agreement, the rest of the data We gathered phrases, definitions, and example sen- were only labeled by a single annotator, follow- tences from the English-language Wiktionary dump ing the methodology used with the VNC-Tokens from November 13th, 2012.1 Dataset (Cook et al., 2008). Identification Phrase, definition pairs hp; di were Detection For detection, we gathered the example gathered with the following restrictions: the title of sentences provided, when available, for each defi- the Wiktionary entry must be English, p must com- nition used in our annotated identification data sets. posed of two or more words w, and hp; di must be in These sentences provide a clean source of develop- 1We used the Java Wiktionary Library (Zesch et al., 2008). 2The indeterminable pairs were omitted from the data. ment and test data containing idiomatic and literal Graph-based features use the graph structure of phrase usages. In all, there were over 1,300 unique WordNet 3.0 to calculate path distances. Let phrases, half of which had more than one possible distance(w; v; rel; n) be the minimum distance via dictionary definition in Wiktionary. Figure 2 pro- links of type rel in WordNet from a word w to a vides the complete statistics. word v, up to a threshold max integer value n, and 0 otherwise. The features compute: 5 Identification Model • closest synonym: For identification, we use a linear model that pre- ∗ min distance(w; v; synonym; 5) dicts class y 2 fliteral, idiomaticg for an input pair w2p;v2d hp; di with phrase p and definition d. We assign the class: • closest antonym:4 y∗ = arg max θ · φ(p; d; y) y min distance(w; v; antonym; 5) w2p;v2d n given features φ(p; d; y) 2 R with associated pa- n rameters θ 2 R . • average synonym distance: 1 X Learning In this work, we use the averaged Per- distance(w; v; synonym; 5) jpj ceptron algorithm (Freund and Schapire, 1999) to w2p;v2d perform learning, which was optimized in terms of iterations T , bounded by range [1, 100], by maxi- • average hyponym: mizing F-measure on the development set. 1 X distance(w; v; hyponym; 5) jpj The models described correspond to the features w2p;v2d they use. All models are trained on the same, unan- notated training data. • synsets connected by an antonym: This feature in- dicates whether the following is true. The set of Features The features that were developed fall synsets Synp, all synsets from all words in p, and into two categories: lexical and graph-based fea- the set of synsets Synd, all synsets from all words tures. The lexical features were motivated by the in d, are connected by a shared antonym. This fea- intuition that literal phrases are more likely to have ture follows an approach described by Budanitsky closely related words in d to those in p because lit- et al. (2006). eral phrases do not break the principle of compo- sitionality. All words compared are stemmed ver- 6 Experiments sions. Let count(w; t) = number of times word w We report identification and detection results, vary- appears in text t. ing the data labeling and choice of feature sets.