Embedding and WordNet Based Metaphor Identification and Interpretation

Rui Mao, Chenghua Lin and Frank Guerin Department of Computing Science University of Aberdeen Aberdeen, United Kingdom {r03rm16, chenghua.lin, f.guerin}@abdn.ac.uk

Abstract Translate failed in translating devour within a sen- tence, “She devoured his novels.” (Mohammad Metaphoric expressions are widespread in et al., 2016), into Chinese. The term was translated natural language, posing a significant chal- into 吞噬, which takes the literal sense of swallow lenge for various natural language pro- and is not understandable in Chinese. Interpreting cessing tasks such as . metaphors allows us to paraphrase them into literal Current word embedding based metaphor expressions which maintain the intended meaning identification models cannot identify the and are easier to translate. exact metaphorical within a sen- Metaphor identification approaches based on tence. In this paper, we propose an un- word embeddings have become popular (Tsvetkov method that identi- et al., 2014; Shutova et al., 2016; Rei et al., fies and interprets metaphors at word-level 2017) as they do not rely on hand-crafted knowl- without any preprocessing, outperforming edge for training. These models follow a sim- strong baselines in the metaphor identifi- ilar paradigm in which input sentences are first cation task. Our model extends to inter- parsed into phrases and then the metaphoricity pret the identified metaphors, paraphras- of the phrases is identified; they do not tackle ing them into their literal counterparts, so word-level metaphor. E.g., given the former sen- that they can be better translated by ma- tence “She devoured his novels.”, the aforemen- chines. We evaluated this with two popu- tioned methods will first parse the sentence into a lar translation systems for English to Chi- verb-direct object phrase devour novel, and then nese, showing that our model improved detect the clash between devour and novel, flag- the systems significantly. ging this phrase as a likely metaphor. However, which component word is metaphorical cannot be 1 Introduction identified, as important contextual words in the Metaphor enriches language, playing a significant sentence were excluded while processing these role in communication, cognition, and decision phrases. Discarding contextual information also making. Relevant statistics illustrate that about leads to a failure to identify a metaphor when both one third of sentences in typical corpora contain words in the phrase are metaphorical, but taken out metaphor expressions (Cameron, 2003; Martin, of context they appear literal. E.g., “This young 2006; Steen et al., 2010; Shutova, 2016). Linguis- man knows how to climb the social ladder.” (Mo- tically, metaphor is defined as a language expres- hammad et al., 2016) is a metaphorical expression. sion that uses one or several words to represent an- However, when the sentence is parsed into a verb- other concept, rather than taking their literal mean- direct object phrase, climb ladder, it appears lit- ings of the given words in the context (Lagerwerf eral. and Meijers, 2008). Computational metaphor pro- In this paper, we propose an unsupervised cessing refers to modelling non-literal expressions metaphor processing model which can identify (e.g., metaphor, metonymy, and personification) and interpret linguistic metaphors at the word- and is useful for improving many NLP tasks such level. Specifically, our model is built upon word as Machine Translation (MT) and Sentiment Anal- embedding methods (Mikolov et al., 2013) and ysis (Rentoumi et al., 2012). For instance, uses WordNet (Fellbaum, 1998) for lexical re-

1222 Proceedings of the 56th Annual Meeting of the Association for Computational (Long Papers), pages 1222–1231 Melbourne, Australia, July 15 - 20, 2018. c 2018 Association for Computational Linguistics lation acquisition. Our model is distinguished ing metaphor processing for supporting MT. We from existing methods in two aspects. First, our describe related work in §2, followed by our la- model is generic which does not constrain the belling method in §4, experimental design in §5, source domain of metaphor. Second, the devel- results in §6 and conclusions in §7. oped model does not rely on any labelled data for 2 Related Work model training, but rather captures metaphor in an unsupervised, data-driven manner. Linguistic A wide range of methods have been applied for metaphors are identified by modelling the distance computational metaphor processing. Turney et al. (in ) between the target word’s literal (2011); Neuman et al.(2013); Assaf et al.(2013) and metaphorical senses. The metaphorical sense and Tsvetkov et al.(2014) identified metaphors within a sentence is identified by its surrounding by modelling the abstractness and concreteness context within the sentence, using word embed- of metaphors and non-metaphors, using a ma- ding representations and WordNet. This novel ap- chine usable dictionary called MRC Psycholin- proach allows our model to operate at the sentence guistic Database (Coltheart, 1981). They be- level without any preprocessing, e.g., dependency lieved that metaphorical words would be more ab- . Taking contexts into account also ad- stract than literal ones. Some researchers used dresses the issue that a two-word phrase appears topic models to identify metaphors. For instance, literal, but it is metaphoric within a sentence (e.g., Heintz et al.(2013) used Latent Dirichlet Alloca- the climb ladder example). tion (LDA) (Blei et al., 2003) to model source and We evaluate our model against three strong target domains, and assumed that sentences con- baselines (Melamud et al., 2016; Shutova et al., taining words from both domains are metaphor- 2016; Rei et al., 2017) on the task of metaphor ical. Strzalkowski et al.(2013) assumed that identification. Extensive experimentation con- metaphorical terms occur out of the topic chain, ducted on a publicly available dataset (Moham- where a topic chain is constructed by topical mad et al., 2016) shows that our model sig- words that reveal the core discussion of the text. nificantly outperforms the Shutova et al.(2017) performed metaphorical con- baselines (Melamud et al., 2016; Shutova et al., cept mappings between the source and target do- 2016) on both phrase and sentence evaluation, and mains in multi-languages using both unsupervised achieves equivalent performance to the state-of- and semi-supervised learning approaches. The the-art baseline (Rei et al., 2017) source and target domains are represented by se- on phrase-level evaluation. In addition, while mantic clusters, which are derived through the dis- most of the existing works on metaphor processing tribution of the co-occurrences of words. They solely evaluate the model performance in terms of also assumed that when contextual vocabularies metaphor classification accuracy, we further con- are from different domains then there is likely to ducted another set of experiments to evaluate how be a metaphor. metaphor processing can be used for supporting There is another line of approaches based on the task of MT. Human evaluation shows that our word embeddings. Generally, these works are not model improves the metaphoric translation sig- limited by conceptual domains and hand-crafted nificantly, by testing on two prominent transla- knowledge. Shutova et al.(2016) proposed a tion systems, namely, Google Translate1 and Bing model that identified metaphors by employing Translator2. To our best knowledge, this is the word and image embeddings. The model first first metaphor processing model that is evaluated parses sentences into phrases which contain target on MT. words. In their word embedding based approach, To summarise, the contributions of this paper the metaphoricity of a phrase was identified by are two-fold: (1) we proposed a novel frame- measuring the cosine similarity of two component work for metaphor identification which does not words in the phrase, based on their input vectors require any preprocessing or annotated corpora from Skip-gram word embeddings. If the cosine for training; (2) we conducted, to our knowledge, similarity is higher than a threshold, the phrase is the first metaphor interpretation study of apply- identified as literal; otherwise metaphorical. Rei et al.(2017) identified metaphors by introducing a 1https://translate.google.co.uk deep learning architecture. Instead of using word 2https://www.bing.com/translator input vectors directly, they filtered out noisy in-

1223 CBOW Skip-gram where Cn is the one-hot encoding of the nth con- i Input Hidden Output Input Hidden Output text word, vc,n is the nth context word row vec- i C1 C1 tor (input vector) in W which is a weight matrix … … between input and hidden layers. Thus, the hid- Cn T T Cn

.. .. den layer is the transpose of the average of input … .. .. … vectors of context words. The probability of pre- Cm W i W o W i W o Cm dicting a centre word in its context is given by a softmax function below: Figure 1: CBOW and Skip-gram framework. o> o> ut = Wt × HCBOW = vt × HCBOW (3) formation in the vector of one word in a phrase, exp(u ) p(t|c , ..., c , ..., c ) = t (4) projecting the word vector into another space via 1 n m PV exp(uj) a sigmoid activation function. The metaphoricity j=1 o o of the phrases was learnt via training a supervised where Wt is equivalent to the output vector vt deep neural network. which is essentially a column vector in a weight The above word embedding based models, matrix W o that is between hidden and output lay- while demonstrating some success in metaphor ers, aligning with the centre word t. V is the size identification, only explored using input vectors, of vocabulary in the corpus. which might hinder their performance. In addi- The output is a one-hot encoding of the centre tion, metaphor identification is highly dependent word. W i and W o are updated via back propa- on its context. Therefore, phrase-level models gation of errors. Therefore, only the value of the (e.g., Tsvetkov et al.(2014); Shutova et al.(2016); position that represents the centre word’s probabil- Rei et al.(2017)) are likely to fail in the metaphor ity, i.e., p(t|c1, ..., cn, ..., cm), will get close to the identification task if important contexts are ex- value of 1. In contrast, the probability of the rest cluded. In contrast, our model can operate at the of the words in the vocabulary will be close to 0 sentence level which takes into account rich con- in every centre word training. W i embeds context text and hence can improve the performance of words. Vectors within W i can be viewed as con- metaphor identification. text word embeddings. W o embeds centre words, o 3 Preliminary: CBOW and Skip-gram vectors in W can be viewed as centre word em- beddings. Our metaphor identification framework is built Skip-gram is the reverse of CBOW (see Fig- upon word embedding, which is based on Con- ure1). The input and output layers are centre word tinuous Bag of Words (CBOW) and Skip-gram and context word one-hot encodings, respectively. (Mikolov et al., 2013). The target is to maximize the probability of pre- dicting each context word, given a centre word: In CBOW (see Figure1), the input and output lay- ers are context (C) and centre word (T) one-hot arg max p(c1, ..., cn, ..., cm|t) (5) encodings, respectively. The model is trained by maximizing the probability of predicting a centre Skip-gram’s hidden layer is defined as: word, given its context (Rong, 2014): i> i> HSG = W × T = vt (6) arg max p(t|c1, ..., cn, ..., cm) (1) where T is the one-hot encoding of the centre word t. Skip-gram’s hidden layer is equal to the where t is a centre word, cn is the nth context word of t within a sentence, totally m context words. transpose of a centre word’s input vector vt, as CBOW’s hidden layer is defined as: only the tth row are kept by the operation. The probability of a context word is: m 1 X H = × W i> × C u = W o> × H = vo> × H (7) CBOW m n c,n c,n SG c,n SG n=1 m 1 X i> exp(u ) = × vc,n (2) c,n m p(cn|t) = (8) n=1 PV j=1 exp(uj) 1224 w1 w1 She devoured his novels. train w2 w2 Wiki w3 w3

(1) .. Word Embedding … .. .. … Sense 1 Sense 2 Sense 3 Sense 4 wn wn • devour • devour • devour • devour • devoured • devoured • devoured • devoured • … • … • … • … A sentence: {wc1, wc2, wt wc3 …} HYPERNYMS HYPERNYMS HYPERNYMS HYPERNYMS • destroy • enjoy • eat up • raven (2) Context words: Target word: Look up Synonyms: {s1, s2 …} • destroyed • enjoyed • … • ravened • … • … • … • … {wc1, wc2, wc3 …} {wt} WordNet Hypernyms: {h1, h2 …} • ruin • bask SYNONYMS • pig Candidate word set W • ruined • basked • down • pigged • … • … • … • … • … • … • … • … wc1 wt cos( wt , context) Context words Context

cos( s , context) Candidate wc2 s1 1 devour (3) wc3 s2 agrmax cos( s2 , context) w* ∈W

.. She … .. .. … … devoured Best fit word W wcm hj cos( hj , context) enjoy his …

enjoyed set literal, if S > threshold novels … (4) S = cos(w*, wt) metaphoric, otherwise Figure 3: Given CBOW trained input and output vec- w∗ = Figure 2: Metaphor identification framework. NB: tors, a target word of devoured, and a context of best fit word, wt = target word. o i She [] his novels, cos(vdevoured, vcontext) = −0.01, o i cos(venjoyed, vcontext) = 0.02. where c, n is the nth context word, given a centre In Step (2), given an input sentence, the target word. In Skip-gram, W i aligns to centre words, word (i.e., the word in the original text whose while W o aligns to context words. Because the metaphoricity is to be determined) and its con- names of centre word and context word embed- text words (i.e., all other words in the sentence dings are reversed in CBOW and Skip-gram, we excluding the target word) are separated. We con- will uniformly call vectors in W i input vectors vi, struct a candidate word set W which represents and vectors in W o output vectors vo in the remain- all the possible senses of the target word. This is ing sections. Word embeddings represent both in- achieved by first extracting the synonyms and di- put and output vectors. rect hypernyms of the target word from WordNet, 4 Methodology and then augmenting the set with the inflections In this section, we present the technical details of of the extracted synonyms and hypernyms, as well our metaphor processing framework, built upon as the target word and its inflections. Auxiliary two hypotheses. Our first hypothesis (H1) is verbs are excluded from this set, as these words that a metaphorical word can be identified, if the frequently appear in most sentences with little lex- sense the word takes within its context and its lit- ical meaning. In Step (3), we identify the best fit eral sense come from different domains. Such a word, which is defined as the word that represents hypothesis is based on the theory of Selectional the literal sense that the target word is most likely Preference Violation (Wilks, 1975, 1978) that a taking given its context. Finally, in Step (4), we metaphorical item can be found in a violation of compute the cosine similarity between the target selectional restrictions, where a word does not sat- word and the best fit word. If the similarity is isfy its semantic constrains within a context. Our above a threshold, the target word will be identi- second hypothesis (H2) is that the literal senses of fied as literal, otherwise metaphoric (i.e., based on words occur more commonly in corpora than their H1). We will discuss in detail Step (3) and Step (4) metaphoric senses (Cameron, 2003; Martin, 2006; in §4.1. Steen et al., 2010; Shutova, 2016). 4.1 Metaphor identification Step (3): One of the key steps of our metaphor Figure2 depicts an overview of our metaphor identification framework is to identify the best fit identification framework. The workflow of our word for a target word given its surrounding con- framework is as follows. Step (1) involves training text. The intuition is that the best fit word will rep- word embeddings based on a Wikipedia dump3 resent the literal sense that the target word is most for obtaining input and output vectors of words. likely taking. E.g., for the sentence “She devoured 3https://dumps.wikimedia.org/enwiki/ his novels.” and the corresponding target word de- 20170920/ voured, the best fit word is enjoyed, as shown in

1225 Figure3. Also note that the best fit word could We give a detailed discussion in §4.2 of our ratio- be the target word itself if the target word is used nale for using input vectors for Eq. 14. literally. If the similarity is higher than a threshold (τ) Given a sentence s, let wt be the target word the target word is considered as literal, otherwise, of the sentence, w∗ ∈ W the best fit word for metaphorical (based on H1). One benefit of our wt, and wcontext the surrounding context for wt, approach is that it allows one to paraphrase the i.e., all the words in s excluding wt. We compute identified metaphorical target word into the best fit i the context embedding vcontext by averaging out word, representing its literal sense in the context. the input vectors of each context word of wcontext, Such a feature is useful for supporting other NLP based on Eq.2. Next, we rank each candidate tasks such as Machine Translation, which we will word k ∈ W by measuring its similarity to the explore in §6. In terms of the value of threshold i context input vector vcontext in the vector space. (τ), it is empirically determined based on a devel- The candidate word with the highest similarity to opment set. Please refer to §5 for details. the context is then selected as the best fit word. To better explain the workflow of our frame-

∗ work, we now go through an example as illus- w = arg max SIM(vk, vcontext) (9) k trated in Figure3. The target word of the input sentence, “She devoured his novels.” is devoured, where v is the vector of a candidate word k ∈ k and its the lemmatised form devour has four verbal W. In contrast to existing word embedding based senses in WordNet, i.e., destroy completely, enjoy methods for metaphor identification which only avidly, eat up completely with great appetite, and make use of input vectors (Shutova et al., 2016; eat greedily. Each of these senses has a set of cor- Rei et al., 2017), we explore using both input responding synonyms and hypernyms. E.g., Sense and output vectors of CBOW and Skip-gram em- 3 (eat up completely with great appetite) has syn- beddings when measuring the similarity between onyms demolish, down, consume, and hypernyms a candidate word and the context. We expect go through, eat up, finish, and polish off. We then that using a combination of input and output vec- construct a candidate word set W by including the tors might work better. Specifically, we have ex- synonyms and direct hypernyms of the target word perimented with four different model variants as from WordNet, and then augmenting the set with shown below. the inflections of the extracted synonyms and hy- i i pernyms, as well as the target word devour and SIM-CBOWI = cos(vk,cbow, vcontext,cbow) (10) its inflections. We then identify the best fit word she his novels o i given the context [] based on Eq.9. SIM-CBOWI+O = cos(vk,cbow, vcontext,cbow) Based on H2, literal expressions are more com- (11) mon than metaphoric ones in corpora. Therefore, i i SIM-SGI = cos(vk,sg, vcontext,sg) (12) the best fit word is expected to frequently appear o i within the given context, and thus represents the SIM-SGI+O = cos(v , v ) (13) k,sg context,sg most likely sense of the target word. For exam- Here, cos(·) is cosine similarity, cbow is CBOW ple, the similarity between enjoy (i.e., the best fit word embeddings, sg is Skip-gram word embed- word) and the the context is higher than that of de- dings. We have also tried other model variants us- vour (i.e., the target word), as shown in Figure3. ing output vectors for vcontext. However, we found 4.2 Word embedding: output vectors vs. in- that the models using output vectors for vcontext put vectors (both CBOW and Skip-gram embeddings) do not improve our framework performance. Due to the Typically, input vectors are used after training page limit we omitted the results of those models CBOW and Skip-gram, with output vectors be- in this paper. ing abandoned by practical models, e.g., original Step (4): Given a predicted best fit word w∗ model (Mikolov et al., 2013) and Gen- ˇ identified in Step (3), we then compute the cosine sim toolkit (Rehu˚rekˇ and Sojka, 2010), as these similarity between the lemmatizations of w∗ and models are designed for modelling similarities in semantics. However, we found that using input the target word wt using their input vectors. vectors to measure cosine similarity between two ∗ i i ∗ SIM(w , wt) = cos(vw , vwt ) (14) words with different POS types in a phrase is sub- 1226 CBOW Skip-gram tend to have the same types of POS. When measur- Input vec Output vec Input vec Output vec ing the similarity between candidate words and the apple context, using output vectors for the former and in- drink put vectors for the latter seems to better predict the juice best fit word. 5 Experimental settings Figure 4: Input and output vector visualization. The bluer, Baselines. We compare the performance of the more negative. The redder, the more positive. our framework for metaphor identification against three strong baselines, namely, an unsupervised optimal, as words with different POS normally word embedding based model by Shutova et al. have different semantics. They tend to be distant (2016), a supervised deep learning model by Rei from each other in the input vector space. Tak- et al.(2017), and the Context2Vec model 5 (Mela- ing Skip-gram for example, empirically, input vec- mud et al., 2016) which achieves the best perfor- tors of words with the same POS, occurring within mance on Microsoft Sentence Completion Chal- the same contexts tend to be close in the vector lenge (Zweig and Burges, 2011). Context2Vec space (Mikolov et al., 2013), as they are frequently was not designed for processing metaphors, in or- updated by back propagating the errors from the der to use it for this we plug it into a very similar same context words. In contrast, input vectors framework to that described in Figure2. We use of words with different POS, playing different se- Context2Vec to predict the best fit word from the mantic and syntactic roles tend to be distant from candidate set, as it similarly uses context to predict each other, as they seldom occur within the same the most likely centre word but with bidirectional contexts, resulting in their input vectors rarely be- LSTM based context embedding method. After ing updated equally. Our observation is also in line locating the best fit word with Context2Vec, we with Nalisnick et al.(2016), who examine IN-IN, identify the metaphoricity of a target word with OUT-OUT and IN-OUT vectors to measure simi- the same method (see Step (4) in §4), so that larity between two words. Nalisnick et al. discov- we can also apply it for metaphor interpretation. ered that two words which are similar by function Note that while Shutova et al. and Rei et al. de- or type have higher cosine similarity with IN-IN or tect metaphors at the phrase level by identifying OUT-OUT vectors, while using input and output metaphorical phrases, Melamud et al.’s model can vectors for two words (IN-OUT) that frequently perform metaphor identification and interpretation co-occur in the same context (e.g., a sentence) can on sentences. obtain a higher similarity score. Dataset. Evaluation was conducted based on a dataset developed by Mohammad et al.(2016). This dataset6, containing 1,230 literal and 409 For illustrative purpose, we visualize the metaphor sentences, has been widely used for CBOW and Skip-gram updates between 4- metaphor identification related research (Shutova dimensional input and output vectors by Wevi4 et al., 2016; Rei et al., 2017). There is a verbal tar- (Rong, 2014), using a two-sentence corpus, get word annotated by 10 annotators in each sen- “Drink apple juice.” and “Drink orange juice.”. tence. We use two subsets of the Mohammad et al. We feed these two sentences to CBOW and Skip- set, one for phrase evaluation and one for sentence gram with 500 iterations. As seen Figure4, the in- evaluation. The phrase evaluation dataset was put vectors of apple and orange are similar in both kindly provided by Shutova, which consists of 316 CBOW and Skip-gram, which are different from metaphorical and 331 literal phrases (subject-verb the input vectors of their context words (drink and and verb-direct object word pairs), parsed from juice). However, the output vectors of apple and Mohammad et al.’s dataset. Similar to Shutova orange are similar to the input vectors of drink and et al.(2016), we use 40 metaphoric and 40 literal juice. phrases as a development set and the rest as a test To summarise, using input vectors to compare similarity between the best fit word and the tar- 5http://u.cs.biu.ac.il/˜nlp/resources/ get word is more appropriate (cf. Eq.14), as they downloads/context2vec/ 6http://saifmohammad.com/WebPages/ 4https://ronxin.github.io/wevi/ metaphor.html

1227 Method P R F1 els based on both input and output vectors (i.e., Shutova et al.(2016) 0.67 0.76 0.71 Rei et al.(2017) 0.74 0.76 0.74 SIM-CBOWI+O and SIM-SGI+O) yield better per- Phrase SIM-CBOWI+O 0.66 0.78 0.72 formance than the models based on input vectors SIM-SGI+O 0.68 0.82 0.74* only (i.e., SIM-CBOWI and SIM-SGI ). Such an ob- Melamud et al.(2016) 0.60 0.80 0.69 SIM-SGI 0.56 0.95 0.70 servation supports our assumption that using in- Sent. SIM-SGI+O 0.62 0.89 0.73 put and output vectors can better model similarity SIM-CBOWI 0.59 0.91 0.72 between words that have different types of POS, SIM-CBOWI+O 0.66 0.88 0.75* than simply using input vectors. When compar- Table 1: Metaphor identification results. NB: * denotes that ing CBOW and Skip-gram based models, we see our model outperforms the baseline significantly, based on that CBOW based models generally achieve bet- two-tailed paired t-test with p < 0.001. ter performance in precision whereas Skip-gram based models perform better in recall. set. In terms of phrase level metaphor identifica- For sentence evaluation, we select 212 tion, we compare our best performing models (i.e., metaphorical sentences whose target words are SIM-CBOWI+O and SIM-SGI+O) against the ap- annotated with at least 70% agreement. We proaches of Shutova et al.(2016) and Rei et al. also add 212 literal sentences with the highest (2017). In contrast to the sentence level eval- agreement. Among the 424 sentences, we form uation in which SIM-CBOWI+O gives the best our development set with 12 randomly selected performance, SIM-SGI+O performs best for the metaphoric and 12 literal instances to identify the phrase level evaluation. This is likely due to the threshold for detecting metaphors. The remaining fact that Skip-gram is trained by using a centre 400 sentences are our testing set. word to maximise the probability of each context word, whereas CBOW uses the average of context Word embedding training. We train CBOW and word input vectors to maximise the probability of Skip-gram models on a Wikipedia dump with the the centre word. Thus, Skip-gram performs bet- same settings as Shutova et al.(2016) and Rei et al. ter in modelling one-word context, while CBOW (2017). That is, CBOW and Skip-gram models has better performance in modelling multi-context are trained iteratively 3 times on Wikipedia with words. When comparing to the baselines, our a context window of 5 to learn 100-dimensional model SIM-SG significantly outperforms the word input and output vectors. We exclude words I+O word embedding based approach by Shutova et al. with total frequency less than 100. 10 negative (2016), and gives the same performance as the samples are randomly selected for each centre deep supervised method (Rei et al., 2017) which word training. The word down-sampling rate is requires a large amount of labelled data for train- 10-5. We use Stanford CoreNLP (Manning et al., ing and cost in training time. 2014) lemmatized Wikipedia to train word embed- SIM-CBOW and SIM-SG are also evalu- dings for phrase level evaluation, which is in line I+O I+O ated with different thresholds for both phrase and with Shutova et al.(2016). In sentence evaluation, sentence level metaphor identification. As can be we use the original Wikipedia for training word seen from Table2, the results are fairly stable embeddings. when the threshold is set between 0.5 and 0.9 in 6 Experimental Results terms of F1. 6.1 Metaphor identification Table1 shows the performance of our model and 6.2 Metaphor processing for MT the baselines on the task of metaphor identifica- We believe that one of the key purposes of tion. All the results for our models are based metaphor processing is to support other NLP on a threshold of 0.6, which is empirically de- tasks. Therefore, we conducted another set of ex- termined based on the developing set. For sen- periments to evaluate how metaphor processing tence level metaphor identification, it can be ob- can be used to support English-Chinese machine served that all our models outperform the baseline translation. (Melamud et al., 2016), with SIM-CBOWI+O giv- The evaluation task was designed as follows. ing the highest F1 score of 75% which is a 6% From the test set for sentence-level metaphor gain over the baseline. We also see that mod- identification which contains 200 metaphoric and

1228 Sentence Phrase τ Sample Questionnaire P R F1 F1SIM-CBOWI+O F1SIM-SGI+O 0.3 0.75 0.60 0.67 0.56 0.46 The ex-boxer's job is to bounce people who want to enter this 0.4 0.69 0.75 0.72 0.65 0.63 private club. 0.5 0.67 0.82 0.74 0.71 0.72 bounce: eject from the premises Good / Bad 0.6 0.66 0.88 0.75 0.72 0.74 1. 前拳击手的工作是反弹人谁想要进入这个私人俱乐部。 2. 前拳击手的工作是让想要进入这个私人俱乐部的人弹跳。 0.7 0.64 0.88 0.74 0.72 0.73 3. 前拳击手的工作是拒绝谁想要进入这个私人俱乐部的人。 0.8 0.63 0.89 0.74 0.72 0.73 4. 前拳击手的工作是拒绝那些想进入这个私人俱乐部的人。 5. 前拳击手的工作是打人谁想要进入这个私人俱乐部。 0.9 0.63 0.89 0.74 0.71 0.73 6. 前拳击手的工作是打击那些想进入这个私人俱乐部的人。 1.0 0.50 1.00 0.67 0.65 0.65 Figure 6: MT-based metaphor interpretation questionnaire. Table 2: Model performance vs. different threshold (τ) settings. NB: the sentence level results are based on Acc-met. Acc-lit. Acc-overall SIM-CBOW . I+O Orig. Sent. 0.34 0.68 0.51 Context2Vec 0.50 0.66 0.58

Google SIM-CBOWI+O 0.60 0.64 0.62 Original sentence Orig. Sent. 0.42 0.70 0.56 Paraphrased by our model Context2Vec 0.60 0.66 0.63

Paraphrased by the baseline (Melamud et al. 2016) Bing SIM CBOW 0.8 - I+O 0.66 0.64 0.65 0.7 Table 3: Accuracy of metaphor interpretation, evaluated on 0.6 +0.09 +0.11 +0.24 Google and Bing Translation. 0.5 +0.26 0.4 0.3 Translation accuracy Translation Literal Metaphoric Overall Literal Metaphoric Overall containing English-Chinese translations of each of Google Bing the 100 randomly selected sentences. For each sentence predicted as literal (39 out of 100 sen- Figure 5: Accuracy of metaphor interpretation, evaluated on tences), there are two corresponding translations Google and Bing Translation. by Google and Bing respectively. For each sen- tence predicted as metaphoric (61 out of 100 sen- tences), there are 6 corresponding translations. 200 literal sentences, we randomly selected 50 metaphoric and 50 literal sentences to construct a set SM for the Machine Translation (MT) evalu- An example of the evaluation task is shown ation task. For each sentence in SM, if it is pre- in Figure6, in which “The ex-boxer’s job is to dicted as literal by our model, the sentence is kept bounce people who want to enter this private unchanged; otherwise, the target word of the sen- club.” is the original sentence, followed by an tence is paraphrased with the best fit word (refer to WordNet explanation of the target word of the §4.1 for details). The metaphor identification step sentence (i.e., bounce: eject from the premises). resulted in 42 True Positive (TP) instances where There are 6 translations. No. 1-2 are the orig- the ground truth label is metaphoric and 19 False inal sentence translations, translated by Google Positive (FP) instances where the ground truth la- Translate (GT) and Bing Translator (BT). The tar- bel is literal, resulting in a total of 61 instances get word, bounce, is translated, taking the sense predicted as metaphorical by our model. We also of (1) physically rebounding like a ball (反弹), run one of our baseline models, Context2Vec, on (2) jumping (弹跳). No. 3-4 are SIM-CBOWI+O the 61 sentences to predict the best fit words for paraphrased sentences, translated by GT and BT, comparison. Our hypothesis is that by paraphras- respectively, taking the sense of refusing (拒绝). ing the metaphorically used target word with the No. 5-6 are Context2Vec paraphrased sentences, best fit word which expresses the target word’s real translated by GT and BT, respectively, taking the meaning, the performance of translation engines sense of hitting (5.打; 6.打击). can be improved. Subjects were instructed to determine if the We test our hypothesis on two popular English- translation of a target word can correctly represent Chinese MT systems, i.e., the Google and Bing its sense within the translated sentence, matching Translators. We recruited from a UK university 5 its context (cohesion) in Chinese. Note that we Computing Science postgraduate students who are evaluate the translation of the target word, there- Chinese native speakers to participate the English- fore, errors in context word translations are ig- Chinese MT evaluation task. During the evalua- nored by the subjects. Finally, a label is taken tion, subjects were presented with a questionnaire agreed by more than half annotators. Noticeably,

1229 based on our observation, there is always a Chi- context yields better results in the best fit word (the nese word corresponding to an English target word literal counterpart of the metaphor) identification. in MT, as the annotated target word normally rep- CBOW and Skip-gram do not consider the dis- resents important information in the sentence in tance between a context word and a centre word the applied dataset. in a sentence, i.e., context word contributes to pre- We use translation accuracy as a measure to dict the centre word equally. Future work will in- evaluate the improvement on MT systems after troduce weighted CBOW and Skip-gram to learn metaphor processing. The accuracy is calcu- positional information within sentences. lated by dividing the number of correctly trans- Acknowledgments lated instances by the total number of instances. As can be seen in Figure5 and Table3, after This work is supported by the award made by the paraphrasing the metaphorical sentences with the UK Engineering and Physical Sciences Research SIM-CBOWI+O model, the translation improve- Council (Grant number: EP/P005810/1). ment for the metaphorical class is dramatic for both MT systems, i.e., 26% improvement for References Google Translate and 24% for Bing Translate. Dan Assaf, Yair Neuman, Yohai Cohen, Shlomo Arg- In terms of the literal class, there is some small amon, Newton Howard, Mark Last, Ophir Frieder, drop (i.e., 4-6%) in accuracy. This is due to the and Moshe Koppel. 2013. Why “dark thoughts” fact that some literals were wrongly identified as aren’t really dark: A novel algorithm for metaphor metaphors and hence error was introduced dur- identification. In Computational Intelligence, Cog- nitive Algorithms, Mind, and Brain (CCMB), 2013 ing paraphrasing. Nevertheless, with our model, IEEE Symposium on. IEEE, pages 60–65. the overall translation performance of both Google and Bing Translate are significantly improved by David M Blei, Andrew Y Ng, and Michael I Jordan. 11% and 9%, respectively. Our baseline model 2003. Latent Dirichlet allocation. Journal of ma- Context2Vec also improves the translation accu- chine Learning research 3(Jan):993–1022. racy, but is 2-4 % lower than our model in terms of Lynne Cameron. 2003. Metaphor in educational dis- overall accuracy. In summary, the experimental re- course. A&C Black. sults show the effectiveness of applying metaphor Max Coltheart. 1981. The MRC psycholinguistic processing for supporting Machine Translation. database. The Quarterly Journal of Experimental 7 Conclusion Psychology 33(4):497–505. We proposed a framework that identifies and in- Christiane Fellbaum. 1998. WordNet: An Electronic terprets metaphors at word-level with an unsuper- Lexical Database. Bradford Books. vised learning approach. Our model outperforms Ilana Heintz, Ryan Gabbard, Mahesh Srinivasan, David the unsupervised baselines in both sentence and Barner, Donald S Black, Marjorie Freedman, and phrase evaluations. The interpretation of the iden- Ralph Weischedel. 2013. Automatic extraction of tified metaphorical words given by our model also linguistic metaphor with LDA topic modeling. In Proceedings of the First Workshop on Metaphor in contributes to Google and Bing translation sys- NLP (ACL 2013). pages 58–66. tems with 11% and 9% accuracy improvements. The experiments show that using words’ hy- Luuk Lagerwerf and Anoe Meijers. 2008. Open- pernyms and synonyms in WordNet can para- ness in metaphorical and straightforward advertise- ments: Appreciation effects. Journal of Advertising phrase metaphors into their literal counterparts, so 37(2):19–30. that the metaphors can be correctly identified and translated. To our knowledge, this is the first study Christopher Manning, Mihai Surdeanu, John Bauer, that evaluates a metaphor processing method on Jenny Finkel, Steven Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language pro- Machine Translation. We believe that compared cessing toolkit. In Proceedings of 52nd Annual with simply identifying metaphors, metaphor pro- Meeting of the Association for Computational Lin- cessing applied in practical tasks, can be more guistics: System Demonstrations. pages 55–60. valuable in the real world. Additionally, our ex- James H Martin. 2006. A corpus-based analysis of periments demonstrate that using a candidate word context effects on metaphor comprehension. Tech- output vector instead of its input vector to model nical Report CU-CS-738-94, Boulder: University of the similarity between the candidate word and its Colorado: Computer Science Department.

1230 Oren Melamud, Jacob Goldberger, and Ido Dagan. Ekaterina Shutova, Lin Sun, Elkin Dar´ıo Gutierrez,´ Pa- 2016. context2vec: Learning generic context em- tricia Lichtenstein, and Srini Narayanan. 2017. Mul- bedding with bidirectional LSTM. In Proceedings tilingual metaphor processing: Experiments with of the 20th SIGNLL Conference on Computational semi-supervised and unsupervised learning. Com- Natural Language Learning (CoNLL 2016). pages putational Linguistics 43(1):71–123. 51–61. Gerard J Steen, Aletta G Dorst, J Berenike Herrmann, Tomas Mikolov, Kai Chen, Greg Corrado, and Jef- Anna Kaal, Tina Krennmayr, and Trijntje Pasma. frey Dean. 2013. Efficient estimation of word rep- 2010. A method for linguistic metaphor identifica- resentations in vector space. Proceedings of Inter- tion: From MIP to MIPVU, volume 14. John Ben- national Conference on Learning Representations jamins Publishing. (ICLR 2013) . Tomek Strzalkowski, George Aaron Broadwell, Sarah Saif M Mohammad, Ekaterina Shutova, and Peter D Taylor, Laurie Feldman, Samira Shaikh, Ting Liu, Turney. 2016. Metaphor as a medium for emo- Boris Yamrom, Kit Cho, Umit Boz, Ignacio Cases, tion: An empirical study. Proceedings of the Joint et al. 2013. Robust extraction of metaphor from Conference on Lexical and Computational Seman- novel data. In Proceedings of the First Workshop tics (*SEM 2016) page 23. on Metaphor in NLP (ACL 2013). pages 67–76.

Eric Nalisnick, Bhaskar Mitra, Nick Craswell, and Yulia Tsvetkov, Leonid Boytsov, Anatole Gershman, Rich Caruana. 2016. Improving document ranking Eric Nyberg, and Chris Dyer. 2014. Metaphor detec- with dual word embeddings. In Proceedings of the tion with cross-lingual model transfer. Proceedings 25th International Conference Companion on World of the 52nd Annual Meeting of the Association for Wide Web. International World Wide Web Confer- Computational Linguistics (ACL 2014) pages 248– ences Steering Committee, pages 83–84. 258.

Yair Neuman, Dan Assaf, Yohai Cohen, Mark Last, Peter D Turney, Yair Neuman, Dan Assaf, and Yohai Shlomo Argamon, Newton Howard, and Ophir Cohen. 2011. Literal and metaphorical sense identi- Frieder. 2013. Metaphor identification in large texts fication through concrete and abstract context. Pro- corpora. PloS one 8(4):e62343. ceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2011) pages Radim Rehˇ u˚rekˇ and Petr Sojka. 2010. Software frame- 680–690. work for topic modelling with large corpora. In Proceedings of the LREC 2010 Workshop on New Yorick Wilks. 1975. A preferential, pattern-seeking, Challenges for NLP Frameworks. ELRA, Valletta, semantics for natural language inference. Artificial Malta, pages 45–50. http://is.muni.cz/ Intelligence 6(1):53–74. publication/884893/en . Yorick Wilks. 1978. Making preferences more active. Artificial Intelligence 11(3):197–223. Marek Rei, Luana Bulat, Douwe Kiela, and Ekaterina Shutova. 2017. Grasping the finer point: A su- Geoffrey Zweig and Christopher JC Burges. 2011. The pervised similarity network for metaphor detection. Microsoft research sentence completion challenge. Proceedings of the 2017 Conference on Empirical Technical report, Technical Report MSR-TR-2011- Methods in Natural Language Processing (EMNLP 129, Microsoft. 2017) pages 1537–1546.

Vassiliki Rentoumi, George A Vouros, Vangelis Karkaletsis, and Amalia Moser. 2012. Investigat- ing metaphorical language in : A sense-to-sentiment perspective. ACM Transactions on Speech and Language Processing (TSLP) 9(3):6.

Xin Rong. 2014. word2vec parameter learning ex- plained. arXiv preprint arXiv:1411.2738 .

Ekaterina Shutova. 2016. Design and evaluation of metaphor processing systems. Computational Lin- guistics .

Ekaterina Shutova, Douwe Kiela, and Jean Maillard. 2016. Black holes and white rabbits: Metaphor identification with visual features. Proceedings of the 15th Annual Conference of the North American Chapter of the Association for Computational Lin- guistics: Human Language Technologies (NAACL- HLT 2016) pages 160–170.

1231