Quick viewing(Text Mode)

Semantic Classifications for Detection of Verb Metaphors

Semantic Classifications for Detection of Verb Metaphors

Semantic classifications for detection of metaphors

Beata Beigman Klebanov1 and Chee Wee Leong1 and Elkin Dario Gutierrez2 and Ekaterina Shutova3 and Michael Flor1 1 Educational Testing Service 2 University of California, San Diego 3 University of Cambridge bbeigmanklebanov,cleong,mflor @ets.org [email protected]{ , [email protected]}

Abstract (Tsvetkov et al., 2014; Heintz et al., 2013; Tur- ney et al., 2011; Birke and Sarkar, 2007; Gedigan We investigate the effectiveness of se- et al., 2006), rather than using naturally occurring mantic generalizations/classifications for continuous text, as done here. Beigman Klebanov capturing the regularities of the behavior et al. (2014) and Beigman Klebanov et al. (2015) of in terms of their metaphoric- are the exceptions, used as a baseline in the current ity. Starting from orthographic word paper. unigrams, we experiment with various Features that have been used so far in super- ways of defining semantic classes for vised metaphor classification address concreteness verbs (grammatical, resource-based, dis- and abstractness, topic models, orthographic uni- tributional) and measure the effectiveness grams, sensorial features, semantic classifications of these classes for classifying all verbs using WordNet, among others (Beigman Klebanov in a running text as metaphor or non et al., 2015; Tekiroglu et al., 2015; Tsvetkov et al., metaphor. 2014; Dunn, 2014; Heintz et al., 2013; Turney et al., 2011). Of the feature sets presented in this pa- 1 Introduction per, all but WordNet features are . According to the Conceptual Metaphor theory (Lakoff and Johnson, 1980), metaphoricity is a 3 Semantic Classifications property of concepts in a particular context of use, In the following subsections, we describe the dif- not of specific words. The notion of a concept is a ferent types of semantic classifications; Table 1 fluid one, however. While write and wrote would summarizes the feature sets. likely constitute instances of the same concept ac- cording to any definition, it is less clear whether Name Description #Features eat and gobble would. Furthermore, the Con- U orthographic unigram varies ceptual Metaphor theory typically operates with UL unigram varies whole semantic domains that certainly generalize VN-Raw VN frames 270 VN-Pred VN predicate 145 beyond narrowly-conceived concepts; thus, save VN-Role VN thematic role 30 and waste share a very general semantic feature of VN-RoRe VN them. role filler 128 applying to finite resources – it is this meaning el- WordNet WN lexicographer files 15 Corpus distributional clustering 150 ement that accounts for the observation that they tend to be used metaphorically in similar contexts. Table 1: Summary of feature sets. All features are In this paper, we investigate which kinds of gen- binary features indicating class membership. eralizations are the most effective for capturing regularities of metaphor usage. 3.1 Grammar-based 2 Related Work The most minimal level of semantic generalization Most previous supervised approaches to verb is that of putting together verbs that share the same metaphor classification evaluated their systems on lemma (lemma unigrams, UL). We use NLTK selected examples or in small-scale experiments (Bird et al., 2009) for identifying verb lemmas.

101 Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 101–106, Berlin, Germany, August 7-12, 2016. c 2016 Association for Computational Linguistics 3.2 Resource-based strictions that apply to fillers of various thematic roles. For example, verbs that have a thematic VerbNet: The VerbNet database (Kipper et al., role of instrument can have the filler restricted 2006) provides a classification of verbs accord- to being inanimate, body part, concrete, pointy, ing to their participation in frames – syntactic pat- solid, and others. Across the various VerbNet terns with semantic components, based on Levin’s classes, there are 128 restricted roles (such as in- classes (Levin, 1993). Each verb class is anno- strument pointy). We used those to generate VN- tated with its member verb lemmas, syntactic con- RoRe classes. structions in which these participate (such as tran- WordNet: We use lexicographer files to clas- sitive, intransitive, diathesis alternations), seman- sify verbs into 15 classes based on their general tic predicates expressed by the verbs in the class meaning, such as verbs of communication, con- (such as motion or contact), thematic roles (such sumption, weather, and so on. as agent, patient, instrument), and restrictions on the fillers of these semantic roles (such as pointed 3.3 Corpus-based instrument). We also experimented with automatically- VerbNet can thus be thought of as providing a generated verb clusters as semantic classes. We number of different classifications over the same clustered VerbNet verbs using a spectral cluster- set of nearly 4,000 English verb lemmas. The ing algorithm and lexico-syntactic features. We main classification is based on syntactic frames, as selected the verbs that occur more than 150 times enacted in VerbNet classes. We will refer to them in the British National Corpus, 1,610 in total, and as VN-Raw classes. An alternative classification clustered them into 150 clusters (Corpus). is based on the predicative meaning of the verbs; We used verb subcategorization frames (SCF) for example, the verbs assemble and introduce are and the verb’s nominal arguments as features for in different classes based on their syntactic beha- clustering, as they have proved successful in pre- vior, but both have the meaning component of to- vious verb classification experiments (Shutova et gether, marked in VerbNet as a possible value of al., 2010). We extracted our features from the Gi- the Predicate variable. Similarly, shiver and faint gaword corpus (Graff et al., 2003) using the SCF belong to different VerbNet classes in terms of classification system of Preiss et al. (2007) to iden- syntactic behavior, but both have the meaning el- tify verb SCFs and the RASP parser (Briscoe et al., ement of describing an involuntary action. Using 2006) to extract the verb’s nominal arguments. the different values of the Predicate variable, we Spectral clustering partitions the data relying created a set of classes. We note that the VN-Pred on a similarity matrix that records similarities be- same verb lemma can occur in multiple classes, tween all pairs of data points. We use Jensen- since different senses of the same lemma can have Shannon divergence (d ) to measure similarity different meanings, and even a single sense can JS between feature vectors for two verbs, v and v , express more than one predicate. For example, the i j and construct a similarity matrix S : verb stew participates in the following classes of ij various degrees of granularity: cause (shared with Sij = exp( dJS(vi, vj)) (1) 2,912 other verbs), use (with 700 other verbs), ap- − ply heat (with 49 other verbs), cooked (with 49 The matrix S encodes a similarity graph G over other verbs). our verbs. The clustering problem can then be de- Each VerbNet class is marked with the thematic fined as identifying the optimal partition, or cut, of roles its members take, such as agent or benefi- the graph into clusters. We use the multiway nor- ciary. Here again, verbs that differ in syntactic malized cut (MNCut) algorithm of Meila and Shi behavior and in the predicate they express could (2001) for this purpose. The algorithm transforms share thematic roles. For example, stew and prick S into a stochastic matrix P containing transition belong to different VerbNet classes and share only probabilities between the vertices in the graph as 1 the most general predicative meanings of cause P = D− S, where the degree matrix D is a dia- N and use, yet both share a thematic role of instru- gonal matrix with Dii = j=1 Sij. It then com- ment. We create a class for each thematic role putes the K leading eigenvectors of P , where K is P (VN-Role). the desired number of clusters. The graph is par- Finally, VerbNet provides annotations of the re- titioned by finding approximately equal elements

102 in the eigenvectors using a simpler clustering al- al., 2015), Random Forest (Tsvetkov et al., 2014), gorithm, such as k-means. Meila and Shi (2001) Linear Support Vector Classifier. We found that have shown that the partition I derived in this way Logistic Regression was better for unigram fea- minimizes the MNCut criterion: tures, Random Forest was better for features using K WordNet and VerbNet classifications, whereas the MNCut(I) = [1 P (I I I )], (2) − k → k| k corpus-based features yielded similar performance kX=1 across classifiers. We therefore ran all evaluations which is the sum of transition probabilities across with both Logistic Regression and Random For- different clusters. Since k-means starts from a ran- est classifiers. We use the skll and scikit-learn dom cluster assignment, we ran the algorithm mul- toolkits (Blanchard et al., 2013; Pedregosa et al., tiple times and used the partition that minimizes 2011). During training, each class is weighted in the cluster distortion, that is, distances to cluster inverse proportion to its frequency. The optimiza- centroid. tion function is F1 (metaphor). We tried expanding the coverage of VerbNet verbs and the number of clusters using grid search 5 Results on the training data, with coverage grid = 2,500; { We first consider the performance of each type of 3,000; 4,000 and #clusters grid = 200; 250; 300; } { semantic classification separately as well as var- 350; 400 , but obtained no improvement in perfor- } ious combinations using cross-validation on the mance over our original setting. training set. Table 3 shows the results with the 4 Experiment setup classifier that yields the best performance for the given feature set. 4.1 Data

We use the VU Amsterdam Metaphor Corpus Name N F A C Av. (Steen et al., 2010).1 The corpus contains anno- U .64 .51 .55 .39 .52 UL .65 .51 .61 .39 .54 tations of all tokens in running text as metaphor or VN-Raw .64 .49 .60 .38 .53 non metaphor, according to a protocol similar to VN-Pred .62 .47 .58 .39 .52 MIP (Pragglejaz, 2007). The data come from the VN-Role .61 .46 .55 .40 .50 VN-RoRe .59 .47 .54 .36 .49 BNC, across 4 genres: news (N), academic WN .64 .50 .60 .38 .53 (A), fiction (F), and conversation (C). We address Corpus .59 .49 .53 .36 .49 each genre separately. We consider all verbs apart VN-RawToCorpus .63 .49 .59 .38 .53 UL+WN .67 .52 .63 .40 .56 from have, be, and do. UL+Corpus .66 .53 .62 .39 .55 We use the same training and testing partitions as Beigman Klebanov et al. (2015). Table 2 sum- Table 3: Performance (F1) of each of the feature marizes the data. 2 sets, xval on training data. U = unigram baseline.

Data Training Testing Of all types of semantic classification, only the #T #I %M #T #I News 49 3,513 42% 14 1,230 grammatical one (lemma unigrams, UL) shows Fict. 11 4,651 25% 3 1,386 an overall improvement over the unigram base- Acad. 12 4,905 31% 6 1,260 line with no detriment for any of the genres.VN- Conv. 18 4,181 15% 4 2,002 Raw and WordNet show improved performance Table 2: Summary of the data. #T = # of texts; #I for Academic but lower performance on Fiction = # of instances; %M = percentage of metaphors. than the unigram baseline. Other versions of VerbNet-based semantic classifications are gener- 4.2 Machine Learning Methods ally worse than VN-Raw, with some exceptions Our setting is that of supervised machine learn- for the Conversation genre. Distributional clus- ing for binary classification. We experimented ters (Corpus) generally perform worse than the with a number of classifiers using VU-News train- resource-based classifications, even when the re- ing data, including those used in relevant prior source is restricted to the exact same set of verbs as work: Logistic Regression (Beigman Klebanov et that covered in the Corpus clusters (compare Cor- pus to VN-RawToCorpus). 1available at http://metaphorlab.org/metcor/search/ 2Data and features will be made available at The distributional features are, however, about https://github.com/EducationalTestingService/metaphor. as effective as WordNet features when combined

103 with the lemma unigrams (UL); the combinations lean model, UL+Corpus. The top three rows of improve the performance over UL alone for every Table 4 show the results. The UL+WN model out- genre. We also note that the better performance performs the state of art for every genre; the im- for these combinations is generally attained by the provement is statistically significant ( p<0.05).4 Logistic Regression classifier. We experimented The improvement of UL+Corpus over SOA’15 is with additional combinations of feature sets, but not significant. observed no further improvements. Following the observation of the similarity be- To assess the consistency of metaphoricity tween weights of semantic class features across behavior of semantic classes across genres, we genres, we also trained the three systems on all the calculated correlations between the weights as- available training data across all genres (all data in signed by the UL+WN model to the 15 WordNet the Train column in Table 2), and tested on test features. All pairwise correlations between News, data for the specific genre. This resulted in perfor- Academic, and Fiction were strong (r > 0.7), mance improvements for all systems in all genres, while Conversation had low to negative correlation including Conversation (see the bottom 3 rows in with other genres. The low correlations with Con- Table 4). The significance of the improvement of versation was largely due to a highly discrepant UL+WN over SOA’15 was preserved; UL+Corpus behavior of verbs of weather3 – these are con- now significantly outperformed SOA’15. sistently used metaphorically in all genres apart from Conversation. This discrepancy, however, is Feature Set N F A C Av. Train SOA’15 .64 .47 .71 .43 .56 not so much due to genre-specific differences in in UL+WN .68 .49 .72 .44 .58 behavior of the same verbs as to the difference genre UL+Corpus .65 .49 .71 .43 .57 in the identity of the weather verbs that occur in Train SOA’15 .66 .48 .74 .44 .58 on all UL+WN .69 .50 .77 .45 .60 the data from the different genres. While burn, genres UL+Corpus .67 .51 .76 .45 .60 pour, reflect, fall are common in the other genres, the most common weather verb in Conversation is Table 4: Benchmark performance, F1 score. rain, and none of its occurrences is metaphoric; its single occurrence in the other genres is likewise 6 Conclusion not metaphoric. More than a difference across genres, this case underscores the complementarity The goal of this paper was to investigate of lemma-based and semantic class-based infor- the effectiveness of semantic generaliza- mation – it is possible for weather verbs to tend tions/classifications for metaphoricity classi- towards metaphoricity as a class, yet some verbs fication of verbs. We found that generalization might not share the tendency – verb-specific infor- from orthographic unigrams to lemmas is effec- mation can help correct the class-based pattern. tive. Further, lemma unigrams and semantic class features based on WordNet combine effectively, 5.1 Blind Test Benchmark producing a significant improvement over the To compare the results against state-of-art, we state of the art. We observed that semantic class show the performance of Beigman Klebanov et features were weighted largely consistently across al. (2015) system (SOA’15) on the test data (see genres; adding training data from other genres is Table 2 for the sizes of the test sets per genre). helpful. Finally, we found that a resource-lean Their system uses Logistic Regression classifier model where lemma unigram features were and a set of features that includes orthographic combined with clusters generated automatically unigrams, part of tags, concreteness, and using a large corpus yielded a competitive perfor- difference in concreteness between the verb and its mance. This latter result is encouraging, as the direct object. Against this benchmark, we evaluate knowledge-lean system is relatively easy to adapt the performance of the best combination identified to a new domain or . during the cross-validation runs, namely, UL+WN feature set using Logistic Regression classifier. 4We used McNemar’s test of significance of difference We also show the performance of the resource- between correlated proportions (McNemar, 1947), 2-tailed. We combined data from all genres into on a 2X2 matrix: 3Removing verbs of weather propelled the correlations both SOA’15 and UL+WN correct in (1,1), both wrong with Conversation to a moderate range, r = 0.25-0.45 across (0,0), SOA’15 correct UL+WN wrong (0,1), UL+WN correct genres. SOA’15 wrong (1,0)).

104 Acknowledgment NLP, pages 58–66, Atlanta, Georgia, June. Associa- tion for Computational Linguistics. We are grateful to the ACL reviewers for their helpful feedback. Ekaterina Shutova’s research is Karin Kipper, Anna Korhonen, Neville Ryant, and Martha Palmer. 2006. Extensive classifications of supported by the Leverhulme Trust Early Career english verbs. In Proceedings of the 12th EURALEX Fellowship. International Congress, Turin, Italy, September. George Lakoff and Mark Johnson. 1980. Metaphors References we live by. University of Chicago Press, Chicago. Beata Beigman Klebanov, Chee Wee Leong, Michael Beth Levin. 1993. English Verb Classes and Alter- Heilman, and Michael Flor. 2014. Different texts, nations: A Preliminary Investigation. Chicago, IL: same metaphors: Unigrams and beyond. In Pro- University of Chicago Press. ceedings of the Second Workshop on Metaphor in Quinn McNemar. 1947. Note on the sampling error NLP, pages 11–17, Baltimore, MD, June. Associa- of the difference between correlated proportions or tion for Computational Linguistics. percentages. Psychometrika, 12(2). Beata Beigman Klebanov, Chee Wee Leong, and M. Meila and J. Shi. 2001. A random walks view of Michael Flor. 2015. Supervised word-level spectral segmentation. In Proceedings of AISTATS. metaphor detection: Experiments with concreteness and reweighting of examples. In Proceedings of the F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, Third Workshop on Metaphor in NLP, pages 11–20, B. Thirion, O. Grisel, M. Blondel, P. Pretten- Denver, Colorado, June. Association for Computa- hofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Pas- tional Linguistics. sos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learn- Steven Bird, Edward Loper, and Ewan Klein. 2009. ing in Python. Journal of Machine Learning Re- Natural Language Processing with Python. OReilly search, 12:2825–2830. Media Inc. Group Pragglejaz. 2007. MIP: A Method for Iden- Julia Birke and Anoop Sarkar. 2007. Active learn- tifying Metaphorically Used Words in Discourse. ing for the identification of nonliteral language. Metaphor and Symbol, 22(1):1–39. In Proceedings of the Workshop on Computational Approaches to Figurative Language, pages 21–28, Judita Preiss, Ted Briscoe, and Anna Korhonen. 2007. Rochester, New York. A system for large-scale acquisition of verbal, nom- inal and adjectival subcategorization frames from Daniel Blanchard, Michael Heilman, corpora. In Proceedings of the 45th Annual Meet- and Nitin Madnani. 2013. SciKit- ing of the Association of Computational Linguistics, Learn Laboratory. GitHub repository, pages 912–919, Prague, Czech Republic, June. As- https://github.com/EducationalTestingService/skll. sociation for Computational Linguistics. E. Briscoe, J. Carroll, and R. Watson. 2006. The sec- Ekaterina Shutova, Lin Sun, and Anna Korhonen. ond release of the rasp system. In Proceedings of the 2010. Metaphor identification using verb and noun COLING/ACL on Interactive presentation sessions, clustering. In Proceedings of the 23rd International pages 77–80. Conference on Computational Linguistics (COL- ING), pages 1002–1010. Jonathan Dunn. 2014. Multi-dimensional abstractness in cross-domain mappings. In Proceedings of the Gerard Steen, Aletta Dorst, Berenike Herrmann, Anna Second Workshop on Metaphor in NLP, pages 27– Kaal, Tina Krennmayr, and Trijntje Pasma. 2010. A 32, Baltimore, MD, June. Association for Computa- Method for Linguistic Metaphor Identification. Am- tional Linguistics. sterdam: John Benjamins. M. Gedigan, J. Bryant, S. Narayanan, and B. Ciric. Serra Sinem Tekiroglu, Gozde¨ Ozbal,¨ and Carlo Strap- 2006. Catching metaphors. In PProceedings of the parava. 2015. Exploring sensorial features for 3rd Workshop on Scalable Natural Language Un- metaphor identification. In Proceedings of the Third derstanding, pages 41–48, New York. Workshop on Metaphor in NLP, pages 31–39, Den- ver, Colorado, June. Association for Computational D. Graff, J. Kong, K. Chen, and K. Maeda. 2003. Linguistics. English Gigaword. Linguistic Data Consortium, Philadelphia. Yulia Tsvetkov, Leonid Boytsov, Anatole Gershman, Eric Nyberg, and Chris Dyer. 2014. Metaphor de- Ilana Heintz, Ryan Gabbard, Mahesh Srivastava, Dave tection with cross-lingual model transfer. In Pro- Barner, Donald Black, Majorie Friedman, and Ralph ceedings of the 52nd Annual Meeting of the Asso- Weischedel. 2013. Automatic Extraction of Lin- ciation for Computational Linguistics (Volume 1: guistic Metaphors with LDA Topic Modeling. In Long Papers), pages 248–258, Baltimore, Maryland, Proceedings of the First Workshop on Metaphor in June. Association for Computational Linguistics.

105 Peter Turney, Yair Neuman, Dan Assaf, and Yohai Co- hen. 2011. Literal and metaphorical sense identi- fication through concrete and abstract context. In Proceedings of the Conference on Empirical Meth- ods in Natural Language Processing, pages 680– 690, Stroudsburg, PA, USA. Association for Com- putational Linguistics.

106