<<

Syntactic and Semantic Improvements to Computational Metaphor Processing

by Kevin Stowe B.A., Michigan State University, 2009 M.A., Indiana University, 2011

A thesis submitted to the Faculty of the Graduate School of the University of Colorado in partial fulfillment of the requirement for the degree of Doctor of Philosophy Department of 2019 This thesis entitled: Syntactic and Semantic Improvements to Computational Metaphor Processing written by Kevin Stowe has been approved for the Department of Linguistics

Martha Palmer

James H. Martin

Date

The final copy of this thesis has been examined by the signatories, and we find that both the content and the form meet acceptable presentation standards of scholarly work in the above mentioned discipline. iii Stowe, Kevin (Phd., Linguistics) Syntactic and Semantic Improvements to Computational Metaphor Processing Thesis directed by Professors Martha Palmer & Jim Martin

Identifying and interpreting figurative is necessary for comprehensive nat- ural language understanding. The main body of work on computational metaphor pro- cessing is based in lexical . We’ve seen recent evidence that syntactic construc- tions play a part in our production and comprehension of metaphors; the goal of this work is to identify areas where these theories can improve metaphor processing. This is done be exploiting dependency parses, syntax-based lexical resources, and distant su- pervision using linguistic analysis. Through these methods we show improvements over state-of-the-art deep learning models in a variety of metaphor processing tasks. iv Dedicated to my parents, Ron and Kristine, who made all of this possible. v Acknowledgements

I would first and foremost like to thank my advisors Martha and Jim for all of their help making this possible, and providing me the opportunity to pursue this interesting path of research. I would also like to thank Susan Brown, for her support on all things VerbNet, and her constant encouragement. I am grateful to Laura Michaelis for her in- valuable insight into every facet of language, and for helping me pursue many of the se- mantic details we both find interesting. I am also grateful for the support of Oana David, and for her inspiring work on the interaction between metaphors and constructions. I would also like to thank my many other collaborators I’ve had the opportunity to work with at the University of Colorado. First, the people at Project EPIC: Leysia Palen, Ken Anderson, Jennings Anderson, Marina Kogan, and Melissa Bica, have all been tremendous supporters and colleagues, giving me the opportunity to employ our what I’ve learned in practical settings and opening my mind to more concrete, practical ap- plications of our research. Additionally, the support of the Institute of Cognitive Science has been tremendous, especially through the practicums offered by Sidney D’Mello and Tamara Sumner. Finally, I would like to thank all the fellow students who have helped and inspired with my work. Jenette Preciado was a constant source of help and encouragement. I would also like to thank Claire Bonial, Meredith Green, Rebecca Lee, James Gung, and Tim O’Gorman for their support for all of the various linguistic, computational, and per- sonal components that make this kind of research possible. vi

Contents

1 Introduction 1

1.1 The Problem ...... 2 1.2 Research Questions ...... 6 1.3 Approach ...... 7

2 Linguistic Background 10

2.1 Some Basics ...... 11 2.2 Conceptual Metaphor Theory ...... 13 2.2.1 Invariance Principle ...... 15 2.2.2 Hierarchical Structure ...... 16 2.2.3 Word Senses ...... 17 2.2.4 Metaphor as Purely Cognitive ...... 19 2.2.5 Analysis ...... 20 2.3 Selectional Preferences and Lexical Features ...... 22 2.3.1 Selectional Preferences ...... 22 2.3.2 Lexical Features ...... 25 2.3.3 Analysis ...... 28 2.4 Alternatives ...... 30 2.4.1 Blending Theory ...... 30 2.4.2 Class Inclusion ...... 31 2.5 Frames, Metaphors, and Constructions ...... 33 2.5.1 Adjective-noun constructions ...... 34 2.5.2 Argument Structure Constructions ...... 35 vii 2.5.3 Metaphor Identification by Construction ...... 36 2.5.4 Analysis ...... 37 2.6 Differentiating Metaphors from Other Language ...... 38 2.6.1 Literal vs Figurative ...... 39 2.6.2 Similes ...... 41 2.6.3 Metonymy ...... 42 2.6.4 Idioms ...... 45 2.6.5 Analysis ...... 47 2.7 Summary ...... 48

3 Computational Background 49

3.1 What’s the task? ...... 49 3.2 Knowledge-based Systems ...... 51 3.2.1 MIDAS ...... 51 3.2.2 Induction-based Reasoning ...... 52 3.2.3 MetaNet ...... 54 3.3 Machine Learning ...... 55 3.3.1 Features ...... 58 3.3.2 Syntax and Lexical Resources ...... 61 3.4 Word Embeddings ...... 62 3.4.1 Types of Embedding Models ...... 63 3.4.2 Embeddings for Metaphor ...... 64 3.5 Neural Networks ...... 66 3.6 Summary ...... 68

4 Lexical Resources 70

4.1 Metaphors in Lexical Resources ...... 71 4.2 VerbNet ...... 72 viii 4.2.1 Metaphoric/Literal VerbNet Classes ...... 72 4.2.2 Thematic Roles ...... 77 4.2.3 Syntactic Frames ...... 79 4.2.4 Semantic Frames ...... 80 4.2.5 Previous Applications of VerbNet for Metaphor Processing ...... 82 4.2.6 Summary ...... 83 4.3 FrameNet ...... 83 4.3.1 Frames ...... 84 4.3.2 Metaphoric/Literal FrameNet Frames ...... 84 4.3.3 Frame Elements ...... 86 4.3.4 Previous Applications of FrameNet for Metaphor Processing . . . . 87 4.3.5 Summary ...... 88 4.4 PropBank ...... 88 4.4.1 Previous Applications of PropBank for Metaphor Processing . . . . . 89 4.4.2 Summary ...... 89 4.5 WordNet ...... 90 4.5.1 OntoNotes Sense Groupings ...... 90 4.5.2 Previous Applications of WordNet for Metaphor Processing . . . . . 91 4.5.3 Summary ...... 92 4.6 Lexical Resources Summary ...... 92

5 Corpora 93

5.1 Introduction ...... 93 5.2 Difficulties in Annotation ...... 93 5.2.1 Conventionalized metaphors ...... 93 5.2.2 Unit of analysis ...... 95 5.2.3 Different kinds of Figuration ...... 96 5.3 VUAMC ...... 97 ix 5.4 LCC ...... 102 5.5 TroFi ...... 106 5.6 The Mohammad et al Dataset (MOH) ...... 108 5.7 Summary ...... 109

6 Methods 111

6.1 Tasks ...... 112 6.1.1 VUAMC ...... 113 6.1.2 MOH-X ...... 114 6.1.3 Trofi ...... 114 6.1.4 LCC ...... 115 6.2 Computational Methods ...... 117 6.2.1 Feature-based Machine Learning ...... 117 Support Vector Machines ...... 118 6.2.2 Deep Learning ...... 119 Long-Short Term Memory Networks ...... 119 6.3 Syntactic Features and Representations ...... 120 6.4 Baselines ...... 121 6.4.1 A Note on Significance ...... 123 6.5 Summary ...... 124

7 Dependency Structures 125

7.1 Introduction ...... 125 7.2 Implementation ...... 127 7.3 Results ...... 128

8 VerbNet Classes and Embeddings 131

8.1 VerbNet Structures ...... 131 8.1.1 Structural Components ...... 132 x Frame Prediction ...... 134 Thematic Roles ...... 136 8.1.2 Results ...... 136 8.2 VerbNet-based Embeddings ...... 138 8.2.1 Implementation ...... 140 8.2.2 Results ...... 142

9 Distant Supervision 143

9.1 Introduction ...... 143 9.2 Data Extraction ...... 145 9.2.1 VerbNet Analysis ...... 146 9.2.2 Syntactic Pattern Analysis ...... 147 9.3 Implementation ...... 152 9.4 Results ...... 153

10 Putting it all together 156

10.1 Results ...... 157

11 Analysis 160

11.1 Feature Weights ...... 160 11.2 Error Analysis ...... 163 Verbs...... 164 Nouns ...... 165 Adjectives ...... 168 Syntactic Analysis ...... 169

12 Future Work 171

12.1 Representing Constructions ...... 171 12.2 Quality Data ...... 172 xi 12.3 Quality Tasks ...... 174 12.4 Understanding Linguistic Metaphor ...... 176

13 Bibliography 177

Appendix A 188

Appendix B 190

Appendix C 191 xii

List of Tables

4.1 "Grow" lexical units in FrameNet ...... 85

5.1 VUAMC counts ...... 97 5.2 LCC counts of pairs ...... 103 5.3 TroFi counts ...... 107 5.4 MOH counts ...... 109

6.1 Tasks, methods, and algorithms ...... 111 6.2 Dataset overview ...... 117 6.3 Baseline results ...... 122

7.1 Dependency feature F1 results (* denotes significant improvements over the baseline, p < .01) ...... 128

8.1 Frame prediction results (macro F1) ...... 135 8.2 VerbNet structure F1 results (* denotes significant improvements over the baseline, p < .01) ...... 137 8.3 Sample training data Using replacement ...... 141 8.4 VerbNet embedding F1 results (* denotes significant improvements over the baseline, p < .01) ...... 141

9.1 Challenging verbs: on the left, the verbs with the most even split between literal and metaphoric. On the right, verbs in the validation set that were most often misclassified...... 145 9.2 Example analysis of syntactic patterns and VerbNet classes...... 151 9.3 Total samples extracted from VerbNet classes and syntactic patterns. . . . . 152 9.4 Additional data F1 results (* denotes significant improvements over the baseline, p < .01) ...... 153 9.5 Difference in F1 scores for analyzed verbs after data was added...... 155

10.1 Combined results ...... 158

11.1 Lemmas with strongest weights for the negative (literal) and positive (metaphoric) classes...... 161 11.2 Most misclassified verbs (with at least 5 positive samples) ...... 164 11.3 Most misclassified nouns (with at least 5 positive samples) ...... 166 11.4 Most misclassified adjectives (with at least 5 positive samples) ...... 168

A.1 Syntactic pattern analysis of the most ambiguous (top) and most misclassi- fied (bottom) verbs from the VUAMC...... 188 A.2 VerbNet analysis of the most ambiguous (top) and most misclassified (bot- tom) verbs from the VUAMC...... 189

B.1 Possible syntactic frames after compression...... 190

C.1 F1, precision, and recall scores for LCC classification ...... 191 xiv

List of Figures

1.1 Dimensions of analysis ...... 8

3.1 Metaphor mappings from ’killing’ to ’terminate-process’ ...... 52 3.2 Metaphor processing pipeline ...... 53 3.3 Outline of basic machine learning procedure ...... 56 3.4 Outline of machine learning for metaphor procedure ...... 57 3.5 Outline of word embedding procedure ...... 63

4.1 Overview of grow-26.2.1 class ...... 74 4.2 Overview of calibratable_cos-45.6.1-1 class ...... 75 4.3 Thematic roles from the eat-39.1 class...... 78 4.4 Syntactic frames for the eat-39.1-1 class...... 80 4.5 Semantics for frame in cut-21.1 ...... 81 4.6 Semantics for frame in hit-18.1 ...... 81 4.7 Growing_Food frame ...... 85 4.8 PropBank frame file for the word "glance" ...... 89

5.1 Metaphor Identification Procedure (MIP) ...... 98

6.1 Dimensions of analysis ...... 112

7.1 Exploration: Dependency parse-based features ...... 126 7.2 Basic dependency parse ...... 126

8.1 Exploration: VerbNet structure-based features ...... 132 8.2 Exploration: VerbNet embedding-based features ...... 139 9.1 Exploration: Additional data ...... 144

10.1 Exploration: Combining methods ...... 157

11.1 Aggregated weights for baseline features...... 162 11.2 Aggregated weights for all features...... 163 xvi

Notes on Terminology and Formatting

NLP Natural Language Processing CMT Conceptual Metaphor Theory AMR Abstract Meaning Representation WSD Word Sense Disambiguation NP Noun Phrase PP Prepositional Phrase LDA Latent Dirichlet Allocation LSA Latent Semantic Analaysis GloVe Global Vectors for Word Representation ELMo Embeddings from Language Models BERT Bidirectional Encoding Representations from Transformers MLP Multi-Layer Perceptron CNN Convolutional Neural Network (bi)-LSTM (bidirectional) Long Short-Term Memory network SVM Support Vector Machine VUAMC Vrije Universitat Amsterdam Metaphor Corpus LCC Language Computer Corportation corpus TroFi The Trope Finder dataset MOH The Mohammad et al. [2016] Dataset

SMALLCAPS metaphoric domains and mappings

+/-SMALLCAPS small caps with a + or - mark selectional restrictions, following VerbNet notation bold used in examples to highlight target words (...) irrelevant of longer examples (L) example is annotated as "literal" (M) example is annotated as "metaphoric" span span between tags is annotated as evoking a "source" domain span span between tags is annotated as evoking a "target" domain 1

Chapter 1

Introduction

Humans understand a wide variety of creative language use including many kinds of metaphors, but natural language processing (NLP) systems have traditionally strug- gled with this kind of non-standard language. While many approaches rely on and knowledge bases, there is evidence that syntactic constructions also affect metaphoricity. The goal of this work is to explore syntactically motivated computational methods for metaphor processing. We aim to use linguistic analysis to inform computa- tional systems, and eventually use improved computational models to shed light on the natural language aspects of the interaction between syntax and metaphor. Figurative language has long posed problems for linguists, computational and oth- erwise. The variety of expressions and their nuances of meaning are relatively easy for humans to understand, but hard to formalize in ways that are computationally and cog- nitively coherent. Historically, figurative language and metaphor in particular are often seen as anomalous, marked, and/or otherwise strange; perhaps comprising a part of a natural language that lies outside of the realm of rigorous analysis. Because they are often perceived as poetic and non-standard, NLP has largely relegated the study of au- tomatic metaphor processing to the future. There are so many problems with handling aspects of literal language that the analysis of the non-literal, if it needs to be done at all, must wait until the simpler, literal language is solved. However, recent work has brought to light that metaphors are in fact common and productive. They have a cognitive basis that allows them to be used in creative linguis- tic utterances, but they also are employed for much of our everyday language use. This 2 ubiquity of metaphor shows that natural language understanding requires not only rep- resentations for literal utterances but also for the metaphoric. So much of our language is driven by conceptual metaphoric mappings that attempting to provide accurate seman- tics for any given sentence is likely to require a functional representation of metaphoric meaning. To this end, there has been a plethora of research in the past 10 years attempt- ing to identify and interpret metaphors automatically, but the task remains especially difficult, even relative to the other significant roadblocks in natural language processing. This work will attempt to alleviate some of the difficulty by exploring new techniques for automatic metaphor processing. We aim to show that the deficits in current research lie in their reliance on lexical- and sentential-level semantics, and that the addition of "mid-level" syntactic features will provide a link that will improve metaphor understand- ing. This can be accomplished by exploring syntactic constructions and the possible ways these can be represented computationally. We will generally be limited to basic - argument constructions, as these can be automatically extracted via dependency parses, but will also attempt to show that our analysis and methods are broadly applicable to other constructions. Our goal is to explore how the syntactic features of these construc- tions can be represented computationally, and develop models for incorporating lexical semantics into these syntactic representations. We intend to show that metaphor process- ing can benefit from improved computational representations of syntax and semantics.

1.1 The Problem

The primary motivation for improving automatic metaphor understanding is that it is extremely common both in speech and text, and requires special attention that standard semantic processing doesn’t supply. Consider some typical examples using the structure of war to describe the nature of argumentation: 3 1. In several books and articles, he defended his position against its detractors1

2. His speech attacked Maori as a whole

In the examples above, literal semantic interpretations of ’defend’ and ’attack’ will yield nonsensical utterances: a physical position cannot reasonably be defended by a pub- lication, nor can a speech physically attack any kind of entity. We require some knowledge of the metaphor in question: that arguments can be conceptualized as war, and that at- tributes of the WAR domain can be projected onto the ARGUMENT domain. This theory of metaphor is generally known as Conceptual Metaphor Theory (CMT) [Lakoff and John- son, 1980b; Lakoff and Turner, 1989; Lakoff, 1993]. Metaphors from this perspective are conceptual. We have mappings between domains in our cognitive systems, which can the be communicated to improve understanding via language or other modalities. Typical natural language understanding systems tend to be unaware, or only ab- stractly so, of metaphoric meaning of utterances. They are designed to provide correct semantic interpretations for given utterances, and these utterances are generally consid- ered to be literal. However, given the ubiquity of metaphor in everyday thought and language, we would prefer to be able to generate correct semantics for a wider variety, including figurative language. One option is to treat different metaphors as different word senses. In the example above, we could consider "defend" to have multiple word senses, one meaning to defend a physical location and another meaning to make a case for a particular argument. In many situations, this is preferable, as the metaphoric senses are extremely conventional and often fall into a finite set of senses for a particular word. However, this is not always a practical solution: it ignores the commonalities among metaphoric mappings, and many words can be used in such a wide variety of metaphors that it would be impractical to include them all as separate word senses.

1All examples in this dissertation are drawn from Sketch Engine (http://www.sketchengine.eu) [Kilgar- riff et al., 2014], unless otherwise cited . 4 Instead, metaphor processing is typically viewed as comprising two tasks, both of which being required to develop the correct interpretation of an utterance.

• Identification : Identifying which words, phrases, and/or sentences are being used metaphorically.

• Interpretation : Generate a semantic interpretation of a given metaphorical word, phrase, and/or sentence.

These tasks are typically treated separately, with some notable exceptions [Shutova, 2013]. While not the only way to undertake the task of understanding metaphor com- putationally, these two tasks give easily understandable goals, and can be directly im- plemented to improve downstream performance on other NLP tasks. Effective metaphor identification is a practical initial step: it allows a processing system to understand that the "typical" meaning of a word or phrase is not being employed, and that some other kind of meaning is intended. Interpretation is a natural follow on: once we know that a word or phrase isn’t intended literally, we can then produce a correct semantic inter- pretation that can be used in downstream tasks. It may be theoretically possible to do interpretation without identification, but it seems impractical: if a system is to generate the correct semantic representation for a novel metaphor, it seems likely that it needs to understand the metaphoric mapping being evoked and have some method for identify- ing the words which evoke the mapping, which is essentially the identification process. In recent research in metaphor language processing, identification has largely been approached as a supervised machine learning problem, typically using lexical semantic features and their interaction with context to learn the kinds of situations where lexi- cal metaphors appear. Interpretation is a more complex problem. Generating semantic 5 interpretations for literal utterances is difficult enough, as there are many possible rep- resentations of meaning including logic-based systems [Ali and Shapiro, 1993], combina- tory (CCG) based representations [Artzi et al., 2015], abstract mean- ing representations (AMR) [Banarescu et al., 2013], and more. Approaches to metaphoric interpretation tend to vary with regard to and theoretical goals. They include im- plementation of full metaphoric interpretation systems [Martin, 1990; Ovchinnikova et al., 2014], identification of source and target domains [Dodge et al., 2015; Rosen, 2018], developing knowledge bases [Gordon et al., 2015], and providing literal paraphrases to metaphoric phrases [Shutova, 2010; Shutova, 2013]. Results from these systems are difficult to directly compare due to the wide variety of corpora, evaluation metrics, syntactic structures analyzed, definitions of metaphoricity, and experimental setups. A shared task has been developed to allow for direct compar- isons [Leong et al., 2018], but it has yet to be used for extensive, rigorous system evalua- tion, and there are numerous concerns with the corpus used for evaluation (see Chapter 5). Interpretation systems are even more difficult to assess, as they tend to have differ- ent endpoints (including literal paraphrases, developing knowledge bases, identifying source and target domains, etc). Our problem is that of the interaction of syntax and metaphor. There are many ways of representing syntactic structures in computational feasible ways, and we aim to explore this space with regard to metaphor under the assumption that there are syntactic cues to metaphor understanding. To this end we will implement and analyze a variety of syntax-based methods for improving NLP metaphor tasks, and study their effectiveness and contribution to computational metaphor understanding. 6 1.2 Research Questions

A key feature missing from most metaphor processing systems regardless of task is syntactic structures. Systems tend to either include shallow syntactic information (con- textual windows based on word order or dependency parse structures) or none at all. Most rely heavily on word- and sentence-level semantics. Many systems rely only on lexical semantics of target words, or use only minimal context or dependency relations to help disambiguate in context [Gargett and Barnden, 2015; Rai et al., 2016]. Others rely on topic modeling and other document and sentence level features to provide general semantics, and compare the lexical semantics to that, ignoring the more "middle"-level syntactic interactions [Heintz et al., 2013]. While these approaches have been effective in many areas, there is recent evidence that figurative language is significantly influenced by syntactic constructions [Sullivan, 2013], and if they can be represented more effectively, metaphor processing capabilities can be improved. To implement these models of syntactically motivated metaphor meaning, we pro- pose to incorporate in-depth syntax-based representations into classification. In order to undertake this task, we will explore a variety of possible syntactic feature represen- tations, including syntactic tree kernels, higher order tensors for vector representations, and syntactic frame structures from lexical resources. We aim to develop utterance rep- resentations that interweave syntax and semantics in a computationally feasible manner, allowing for better metaphor understanding and processing. This area of research will be explored through a handful of research questions:

• What are the strengths and weaknesses of different approaches to combining syn- tactic and semantic information for metaphor processing?

• How can we best construct features that implement these syntactic structures com- putationally? 7 • How do our results clarify the relationship between syntactic structures and metaphor?

These questions will be explored through linguistic and computational lenses, begin- ning with an overview of historic and modern linguistic and computational approaches to figurative language in general and metaphor specifically.

1.3 Approach

To achieve these goals, we will experiment with a variety of computational methods over a variety of tasks. For each task, we will implement syntax-based methods based on linguistic analysis and previous work. We will experiment with traditional feature-based machine learning to allow for better understanding of which features are helpful, and we will also implement these methods using modern deep learning methods as they are the current state of the art, and we aim to improve upon it. In the end, we will explore three different dimensions of the problem:

• Tasks (identification, interpretation, which dataset)

• Method (dependency structures, VerbNet features, embeddings, additional training data)

• Algorithm (feature-based, deep learning)

This can be represented as a three dimensional matrix, in which we intend to eval- uate the performance of metaphor classification for each point in the matrix. Figure 6.1 demonstrates a simplification of this method. Each empty cube is a combination of a task, a method, and an algorithm. We intend to run experiments to fill the cube with our best knowledge of how well that combination works. In the end, we will have a fully complete representation, incorporating under- standing of which methods work best for which datasets. A full description of each of these tasks, methods, and algorithms will be provided in Chapter 6. 8

FIGURE 1.1: Dimensions of analysis

The contribution of this work will primarily be in three areas. First, we explore the nature of the interaction between syntactic structure and metaphoric meaning. This will improve our understanding of how syntax can influence meaning, how metaphors func- tion, and distributional properties of metaphors with regard to the constructions they appear in. Second, we develop a state-of-the-art metaphor detection system that is func- tional across a variety of tasks, building on previous work in neural networks and ex- panding on these by incorporating syntactically derived methods. Finally, we do a full analysis of our results, indicating what kinds of syntactic representations are appropriate for metaphor processing. This analysis also reveals the nature of metaphor in the corpora we examine, and we explore possible difficulties in both annotation and classification. The rest of the document is organized as follows: Chapter 2 starts with an overview of the linguistic background pertaining to metaphor. Our will pertain to our research 9 questions: what role has syntax and semantics played in linguistic theories of metaphor, and how can we best leverage these insights to improve computational performance? Chapter 3 will further explore computational methods: how have previous computational applications utilized metaphor processing, and what has been effective? We will attempt to identify how we can improve on previous work by using insights gained from our linguistic understanding. Chapter 4 will deal with the concrete computational resources required. This includes lexical resources that capture syntactic and semantic properties. Chapter 5 will examine corpora of metaphor annotations, which are necessary for super- vised machine learning training and evaluation. Chapters 6-10 will explain in detail our methods and results, with separate chapters for each different experimental method we are exploring. This will begin with a brief overview of the algorithms, tasks, and methods we are exploring for our knowledge cube, and then go into detail by method. Chapter 7 will begin with dependency-based syntactic structures, Chapter 8.1 will explore Verb- Net structures and vector representations based on VerbNet information, and Chapter 9 explores employing additional syntax-based data for distant supervision. Chapter 10 will finish with combining each of these methods. Each of these chapters will fully ex- plore the computational intuitions and choices made to leverage these methods fully, and will quantitatively explore their results over lexical and neural baselines. Chapter 11 will then move to a qualitative analysis, bringing together what we’ve seen from implement- ing these computational approaches, what weaknesses these models still have, and what this might tell us about the linguistic nature of metaphors. Future work is addressed in Chapter 12. 10

Chapter 2

Linguistic Background

In order to understand the implications of our research questions, we first need to understand how metaphor is understood from a linguistic perspective. This will enable us to clarify the relationship between syntactic elements and metaphor meaning. We can then use the linguistic analysis of metaphor as a stepping stone to identifying the correct kinds of computational structures that can encompass aspects of the theory that seem particularly promising. Then, in turn, we can use our computational models to shed light on the relationship between syntax and metaphor in natural language. The first step of linguistic analysis is in defining what exactly a metaphor is: this has proven a difficult problem for decades, and there have been many attempts in many fields, including linguistics, philosophy, cognitive science, psycholinguistics, and com- puter science. To what extent a definition of metaphor is a good one depends largely on the task one is hoping to accomplish. For instance, cognitive science researchers need a definition that is cognitively feasible and testable through the methods of the field. Philosophers may prefer definitions that only include novel metaphors over convention- alized language, and linguists likely need a definition that functions within the principles of formal semantics. These fields’ definitions of metaphors are thus influenced not only by differences in background and perspective, but also by the differences in goals. This work will not attempt to solve the question of what exactly constitutes a metaphor, but will approach the problem from a practical perspective. In order to have a functioning natural language understanding system, we require knowledge of what words mean when used in any kind of context. This will include their literal meanings and their metaphorical meanings. From this perspective, we need to be able to generate 11 correct semantic representations for lexical items and phrase structures for literal and figurative language. The end goal will be developing appropriate semantics based on context, and we will approach the theoretical backgrounds to metaphor with this goal in mind. Interestingly, although much of the formal linguistic analysis of metaphor only came to prominence in the 20th century, the phenomenon has been studied as far back as Aris- totle [Aristotle and Roberts, 1946]. Many philosophers and linguists have since relegated metaphor to the domain of the uncommon, the marked, and the difficult to process. John Middleton Murray[1931] writes,

"Discussions of metaphor - there are not many of them - often strike us at first as superficial."

Modern research seems to indicate oppositions to both aspects of this claim: metaphors are both very frequent, and discussions of them particularly vital. In fact, metaphor research has branched into many domains, each contributing to our under- standing. This work will focus on linguistic metaphor, that is, metaphors occurring in speech or text, as well as its interaction with computational systems, with less emphasis on psychological plausibility and cognitive models.

2.1 Some Basics

There are many key components to linguistic metaphor, and much debate on which components are actually primary, which are secondary or tertiary, and which play no significant role at all. A commonality of most metaphor theories is that one thing is seen in terms of another. Thus, we receive metaphorical readings for classic analogy statements such as:

3. My lawyer is a shark. [Glucksberg, 2001] (pg 10)

4. Sally is a block of ice. [Searle, 1993] (pg 83) 12 In (3), a person is seen as having attributes of a shark. In (4), Sally is seen as having at- tributes of a block of ice, with the physical coldness of the ice representing some emotional or social coldness in Sally. The jargon used to talk about this kind of metaphor is diverse. We will adopt Lakoff and Johnson’s[1980b] terminology due to its ubiquity in modern linguistics research, re- ferring to the basic, more concrete domain as the "source" and the domain to which features are extended as the "target" (also possible are the pairs "tenor" and "vehicle" [Richards, 1936] or "frame" and "focus" [Black, 1962]). Thus, for the following fairly con- ventional examples [1980b] (pg 23):

5. The crime rate is going up.

6. Inflation is rising.

The "source" of these examples is VERTICALITY, while the "target" is QUANTITY. The more concrete experiential domain of VERTICALITY supplies meaning to the more abstract

QUANTITY, and thus concepts like increasing height are applied to increasing crime rates and stock prices. Concepts in the source domain become templates for seeing concepts in the target domain in new ways. While most researchers agree that metaphors are ways of framing one concept in the domain of another, there are many theories for how exactly we come to produce and understand metaphoric language. In the following sections, we will explore a variety of metaphor theories, and assess their practicality from a computational perspective. First, we will examine Conceptual Metaphor Theory which has been the driving force in metaphor research since the 1980s. We will then examine computationally appealing theories involving selectional prefer- ences and lexical features, as well as theories of the interaction between metaphor and construction grammar. Finally, we will explore the differences between metaphor and literal language, as well as difficulties in distinguishing metaphor from other types of figurative language. 13 2.2 Conceptual Metaphor Theory

The most influential theory of metaphor in recent history is the conceptual metaphor framework outlined first by Lakoff and Johnson in 1980, with influence from Reddy’s analysis of the conduit metaphor [Lakoff and Johnson, 1980b; Reddy, 1979]. This theory claims that metaphors are not merely a linguistic anomaly, but a central component of our cognition. Source and target domains are mapped in our cognitive systems, allowing us to conceptualize the more abstract target domains by employing the concrete structure of the source domains. These mappings contain correspondences between aspects of the source and target domains. Relevant aspects of the source domain are mapped to the target, creating or expressing an understanding of the target domain that is structurally consistent with the source domain. To Lakoff and Johnson, this process is primarily cog- nitive and while these cognitive mappings can be expressed linguistically yielding lin- guistic metaphors, the language mappings are derived from the cognitive system and not independently motivated.

"The metaphor is not just a matter of language, but of thought and reason. The language is secondary. The mapping is primary, in that it sanctions the use of source domain language and inference patterns for target domain concepts." [Lakoff, 1993] (pg 208)

Classic examples of mappings include ARGUMENT IS WAR1 from [Lakoff and Johnson, 1980b] (pg 4):

7. Your claims are indefensible.

8. He attacked every weak point in my argument.

9.I demolished his argument.

1We will follow Lakoff and Johnson in using small caps to denote metaphoric mappings. All examples of metaphoric mappings come from the Master Metaphor List [Lakoff, 1994] unless otherwise noted. 14 These metaphors use language from the source domain WAR to describe the target do- main ARGUMENT. From these examples we can notice some difficulties with this theory.

First, in 9, supposedly an instantiation of the conceptual metaphor ARGUMENT IS WAR, could also be interpreted through a different metaphor, BELIEFSARESTRUCTURES, which involves the argument being a belief that is destroyed. In many cases this gives rise to metaphor "duals", in which a certain target domain is understood through separate but related source domains. This is easily observable in time metaphors, in which time can be viewed either as an object moving towards an observer or as an observer moving past a fixed location [Lakoff, 1993]:

10. We’re getting close to Christmas (TIME PASSING IS A FIXED LOCATION)

11. Thanksgiving is coming up on us (TIMEPASSINGISMOTIONOFANOBJECT)

While the theory of duality in metaphors generally refers to cases where objects and locations alternately function as source domains for specific targets, we can see more generally that there are many possible interpretations for many metaphoric utterances, as they can evoke different conceptual metaphors. This fact highlights a difficulty in building frameworks for metaphoric usage and un- derstanding, as individual comprehension of metaphors can vary somewhat arbitrarily. Reading Example 9 above, one individual could view the argument through the structure of war, while another could view the argument as a building being destroyed. This may often lead to the same inferences: the argument is a structure, it no longer is functioning, and so on. However, there may also be subtle differences, and the metaphoric structure being conceptual means the precise understanding of the mapping is left to each partici- pant to determine. This difficulty is interesting from linguistic and cognitive perspectives, but is some- what less troubling computationally, as it is inherent in the computational task. We will 15 need to determine possible conceptual structures for ambiguous sentences, but this pro- cess is necessary to determine any metaphoric meaning. In identifying and interpreting metaphors we need to build understanding of both literal and metaphoric meanings, so while our task is made somewhat more difficult by the possibility of multiple metaphoric meanings, we will only require that the system be allowed to make multiple predictions for conceptual metaphor mappings, and then determine which one is most likely, which is a standard capability of most machine learning methods.

2.2.1 Invariance Principle

It is also important to note that not all source and target domains are possible for any given utterance; otherwise the explosion of possible mappings available would make understanding of metaphoric language likely intractable. A critical factor that narrows down possible readings is what is called the invariance principle. According to concep- tual metaphor theory, source domains map to parts of the target domain, "insides to in- sides, outsides to outsides, etc" [Lakoff and Turner, 1989]. The "cognitive topology" of the source domain must cohere with that of the target. This means that the various compo- nents of each domain align: boundaries, insides and outsides, thematic roles, and so on must be coherent between source and target domains. This also includes causal structure, aspectual structure, and persistence of entities. These are all part of what is called "Gen- eral Structure". If critical structural components of the domains don’t match, then the metaphor will fail to capture any relevant features of the mapping. Black also remarks on this restriction that metaphors must have the same structure in source and target do- mains [Black, 1993], and in this way conceptual metaphor theory’s mapping relations also appear similar to the structure-mapping theory of Gentner [1983]. Lakoff [1993] notes that it is necessary to understand conceptual metaphor mappings not as functions from source domain structures to target domain structure. Rather, there are structures that hold between domains only in cases where the invariance principles 16 holds, so we will not develop conceptual mappings for theoretical structures that vio- late the invariance principle. These structural constraints on conceptual metaphor are necessary for computational models. We require knowledge of what possible elements from source and target domains can be mapped, and requiring parallels in conceptual structure should allow for practical bounds on the conceptual structures that need to be interpreted. Only those structures that have compatible structures (in terms of topology, event structure, and so forth) are allowed.

2.2.2 Hierarchical Structure

Another crucial aspect of conceptual metaphor theory is the hierarchical structuring of both domains and metaphoric mappings. For example, the concept MONEY inherits from the more abstract LIMITEDRESOURCE. We can also observe that the metaphoric mapping

TIMEISMONEY inherits from the more abstract TIMEISALIMITEDRESOURCE [Lakoff and Johnson, 1980a].2 The hierarchical structuring of domains and metaphor mappings correctly predicts the flexibility of novel linguistic metaphors. All levels of specificity of metaphors are observed, provided the structural consistencies noted above are adhered to. Consider the following examples:

12. She spent all her time waiting around

13. In extremely hot conditions this meant that they lost time.

The phrase in 12 evokes the domains of TIME via "time" and MONEY via "spent", and thus participates in the metaphor TIMEISMONEY. From this we understand that some properties of MONEY can be applied to the concept of TIME, in this case, significant value.

For the similar utterance in 13, there is no lexical trigger for the MONEY domain, but we may fall back on the more abstract conceptual mapping TIMEISALIMITEDRESOURCE. This yields a slighter less evocative metaphoric reading: we don’t know that the speaker

2This domain is abstracted further to TIMEISARESOURCE by the updated Master Metaphor List [Lakoff, 1994], although I find the examples using TIMEASALIMITEDRESOURCE to be clearer. 17 intends to convey a meaning of value as we do with the source MONEY, but we do know that time is something that people can use (and lose), as with any limited resource. The hierarchical structure of conceptual metaphor mappings allows us to make many helpful generalizations. Lexical items can participate in higher level mappings, and these structures are inherited by more specific instantiations. Consider the word "crossroad", which can be employed in multiple metaphoric mappings (from [Lakoff, 1993] (pg 206)):

14. I’m at a crossroads on this project. (activity)

15. I’m at a crossroads in life. (life)

16. We’re at a crossroads in our relationship. (love)

17. I’m at a crossroads in my career. (career)

Four different domains are evoked through these examples, but the lexical trigger "crossroads" functions equivalently in each instance. From these we can say that "cross- roads" participates in a relatively generic metaphor (LONGTERMPERSONALACTIVITIES

AREJOURNEYS) which is instantiated by more specific metaphors (ACTIVITIES, LIFE,

LOVE, CAREERSAREJOURNEYS). Thus "crossroad" is not seen as having multiple different meanings or senses that account for each metaphor type, but rather the specific metaphors inherit from the generic metaphor and understanding of the word "crossroad" is gained from its participation in the generic metaphor. This alleviates us of the need to define an unwieldy number of word senses for the possible metaphoric interpretations of lexical items. Instead, they participate at different levels of the conceptual mapping hierarchy and can be inherited to more specific mappings.

2.2.3 Word Senses

The mappings "crossroads" evokes can highlight a key component of understanding metaphoric language: we can view metaphors as different word senses, each instantiating 18 a different conceptual metaphor in different contexts. This notion is appealing, in that it limits metaphor uses to a set of word senses that can be disambiguated through standard word-sense disambiguation, but ultimately it is problematic. To say that metaphor use is multiple word senses ignores the commonalities of various metaphors. Consider the following expressions which evoke the metaphor EMOTIONS ARE HEAT3:

18. The artist’s intention is to deliver a sense of vitality behind these grieved and cold women.

19. I simply do not want somebody who has a hot temper to be in charge of our military.

20. Many were reported to predict a lukewarm response.

21. The milder manic states and their fiery energies would seem at first thought to be more obviously linked.

These examples display the wide range of ways we can employ temperature words

("cold", "hot", "lukewarm", and "fiery") to describe emotional states, employing the EMO-

TIONS ARE HEAT metaphoric mapping. It appears that we would be missing something crucial to claim that these are just separate word senses for each temperature word, es- chewing the evidence that there is a common metaphor being employed across all of them. A practical solution to this problem is to posit that similar words are affected by regu- lar, systematic polysemy, with literal words senses mapping to metaphoric senses. Thus synonyms like "fall", "plummet", and "dive" will all have literal senses of motion and regu- lar metaphoric senses that interpret the motion as movement on a non-literal scale, gener-

ated through a MOREISUP/LESSISDOWN mapping. This yields appropriate metaphoric meanings for verbs of spatial location employed to describe abstract quantities like price:

3A generalization of INTENSEEMOTIONSAREHEAT from [Lakoff, 1994] 19 22. However, world commodity prices hit a slump in 1990 and cocoa prices plum- meted.

23. When oil and gas prices dove in the late 80s and into the 90s ...

24. When that bubble broke and housing prices fell from their stratospheric levels, more and more homeowners found themselves "underwater."

With regard to temperature metaphors, words like "cool", "hot", and "tepid" will all have systematic polysemy, with temperature-based senses mapping to emotion-based senses such as 18-21. This process would involve deterministic mappings from literal to metaphoric senses, so that literal synonyms will have the same metaphoric sense. These are straightforward cases of single mappings allowing regular polysemy, but we can also employ the hierarchy of conceptual metaphors to account for more complex examples. We can use specific conceptual metaphors to represent specific instances while employing more generic metaphors to capture the commonalities. This allows for the ’crossroads’ ex- amples (14-17). Each specific metaphor involved evokes a different sense of "crossroads",

but they are all generated based on the generic LONGTERMPERSONALACTIVITIESARE

JOURNEYS mapping.

2.2.4 Metaphor as Purely Cognitive

One key component that needs to be accounted for from conceptual metaphor theory is that they consider metaphors primarily to be a matter of cognition. We have concep- tual mappings between source and target domains as part of our cognitive processes, and these can be expressed through language as well as gesture and other non-linguistic forms. If we consider metaphors to be primarily cognitive, this makes the task of finding them automatically much more difficult, as we need access to cognitive mechanisms to accu- rately process how metaphors are functioning. Conceptual metaphor theory is somewhat 20 weakened for our purposes due to its reliance on cognitive structure rather than observed linguistic phenomena. If metaphor mappings reside beyond the scope of language, we would expect them not to be affected or influenced by syntactic structures: the speaker should be able to produce whatever syntactic structure they like, as long as the concep- tual structure of the metaphor remains consistent. We can still use linguistic utterances as evidence for particular conceptual mappings, but based on conceptual metaphor the- ory it is unlikely that features "below" the level of semantics (such as syntax, phonology, and phonetics) will be predictive or informative of what kinds of metaphors are being employed. However, we will see evidence in Section 2.5 that metaphoric mappings are often sub- ject to rules based on syntactic structures. While these rules are not always purely de- terministic, they show that while metaphoric mappings may be primarily a matter of cognition, they are helpfully constrained in their linguistic use, and we should be able to leverage linguistic features to identify metaphoric language and mappings. We will see that syntax can play a part in metaphor understanding, although we will not explore any further down: we believe that phonology and phonetics are safely divorced from mean- ing, and that the relationship between the signifier and sign with regard to metaphor can be assumed to be arbitrary [Saussure, 1916].

2.2.5 Analysis

With regard to our research questions, we need to understand the impact of concep- tual metaphor theory as it pertains to automatic metaphor identification and interpreta- tion. For identification, there are certainly aspects of conceptual metaphor theory that may be useful. If we can identify lexical items that belong to certain domains that we know participate in certain source-target mappings, we have some indication that the ut- terance may be metaphoric. However, this is only a general assumption, and requires more knowledge of context. Lexical items of all kinds, regardless of their domain, can 21 participate in literal usage, so we will need a further step to determine in which capacity they are actually being used. Its focus on cognitive representations lacks insights into the structure of particular linguistic metaphors that may help distinguish figurative phrases from literal. Its use for identification is limited to lexical knowledge, and thus better im- plemented via selectional preference and lexical feature-based theories (Section 2.3). Conceptual metaphor theory’s identification applications are focused on lexical se- mantics, avoiding the necessity of syntactic representations. While the theory doesn’t afford us anything particularly useful in terms of novel methodology, it highlights a gap we are aiming to fill: CMT focuses on lexical semantics, but we believe the introduction of novel syntactic representations can alleviate some problems in identifying when lexical items are used metaphorically and when they are not. CMT is much more promising for metaphor interpretation, and has been used in many modern metaphor interpretation systems. Given knowledge of source and target do- mains, we can map structures from one to the other that are salient and coherent with the structure of each. Elements of source and target domains must be structurally coherent, helpfully limiting the scope of the possible mappings between them. The hierarchical orientation of domains and mappings allows for falling back on more abstract interpreta- tions if specific readings are incoherent. To use CMT, it is necessary to define what exactly constitutes a domain, and any ap- plication implementing conceptual metaphor will be forced to define domains in a prac- tical way. One approach to formalizing domains is the hand-crafted resource FrameNet, which models domains via frame semantics. It also provides the basis for a metaphoric , MetaNet [Dodge et al., 2015], and the applications of this kind of knowledge- based system will be examined in Section 3.2.3. Identifying which elements evoke source and target domains is also critical, and we will see that construction grammar has been proposed as a possible tool for this task in Section 2.5. All of these interpretation-based benefits require methods for identification: we may 22 be able to do inferencing based on source and target domains, but we first need to un- derstand which items from which domains are actually being used metaphorically, and which reflect the respective source and target domains. From CMT we can derive a new task that leads from identification to interpretation: rather than picking out which words in the sentence are used metaphorically in a binary fashion, we can instead find which words in the sentence evoke which domains, and whether these are source or target do- mains. This task will be fully fleshed out in Section 6.1.4. A more direct theory for metaphor identification that has been extremely influential both linguistically and computationally is the notion of selectional preference.

2.3 Selectional Preferences and Lexical Features

CMT and its derivatives have driven many other areas of metaphor research, focusing on linguistic, cognitive, and psychological aspects. While the notion of source and target domains being applied to develop models of metaphor interpretation is appealing, there are other theories from linguistics and computer science that are particularly relevant to the identification of metaphoric items. Two key components, selectional preferences and lexical features, support the majority of supervised machine learning algorithms em- ployed in the 21st century for metaphor identification.

2.3.1 Selectional Preferences

Perhaps the first computationally driven approach to metaphor comes from [Wilks, 1975; Wilks, 1978]. Wilks shows that metaphors, as well as other kinds of anaphoric am- biguity, can be distinguished by violations of selectional preferences. He stresses that the cases in which preferences are violated are not necessarily incorrect, but rather are indicators for non-literal language:

"The point is to prefer the normal, but to accept the unusual." [Wilks, 1975] (pg 56) 23 Taking some of his examples:

25. John left the window and drank the wine on the table. It was good.[Wilks, 1975] (pg 53)

26. The car drank gasoline. [Fass and Wilks, 1983] (pg 183)

We can interpret the bolded "It" in the first example as referring to the wine, as the selectional preferences of the verb "drink" require something that can be "good", thus resolving the between the window and the wine via selectional preference. In the second example, we can identify that a metaphor is being used, as the verb "drink" prefers an animate, agentive subject, and here receives an inanimate one. He clarifies how we disambiguate the verb "lie" in the example "Pieces of paper lie about the floor.":

"Thus ’Pieces of paper lie about the floor’, is understood as being about posi- tion rather than deception because from the preference information in the sys- tem about the concept ’lying’ we will know that deceptive lying is a concept that prefers an animate agent if it can get it (here it cannot), while a statement about passive position prefers a physical object as the apparent agent, which is available here." [Wilks, 1975] (pg 56)

Wilks’ theory treats metaphors as something akin to separate word senses, which can be disambiguated using selectional preferences. Thus, the word "grasp" has a physical sense (as in Example 27 below), which requires animate and physical subjects and objects, as well as a "think" sense (as in Example 28), which has an animate subject and a concept as an object.

27. His hand grasped the handle firmly and brought it to his side.

28. Stevenson just grasped the idea better than the director

Each verb is represented by a template, which include a predicate and the kind of semantic arguments it typically takes. These representations look very similar to mod- ern lexical semantic resources such as PropBank [Palmer et al., 2005], VerbNet [Kipper- Schuler, 2005], and FrameNet [Baker et al., 1998], with roles for each word being filled by 24 possible arguments. While there are both theoretical advantages and disadvantages to treating metaphoric words as separate word senses, it should be noted that for the pur- poses of computational metaphor identification, this treatment is relatively powerful and easy to implement. The difficulty lies in extending the theory to interpretation: knowing which words are used in a metaphoric sense provides minimal information to help de- termine the correct meaning. Additionally, while in many cases violation of selectional preferences of verbs or other lexical items can directly indicate metaphoricity, there are also many cases where metaphors are present despite the lack of any preference viola- tion:

29. And when they played each other, Curry killed him.

30. Muslims have attacked him saying "he isn’t a member of the royal family"

31. Also thanks for standing up to the alphabet gang when you fought them in court on our behalf.

In each of these examples, the verbs of attack have arguments that align with their regular selectional preferences. In each case, a metaphor of aggression is evoked, but the verb’s arguments are animate and human, which shouldn’t raise any violation of pref- erences. We should incorrectly prefer readings where physical attacks occurred, and we require some other information to determine the metaphor. In 29, it is only through world knowledge (namely that "Curry" is here a basketball player, and the killing event refers to a game being played) that we interpret the verb as evoking the COMPETITIONIS 1 ON

1 PHYSICALAGGRESSION metaphoric mapping. In 30, we have direct linguistic evidence: the immediately following quotative clause indicates that the sentence is metaphoric, likely employing the THEORIESAREDEFENSIBLEPOSITIONS mapping and referring to a person’s theories metonymically through "him". In 31, the "fight" verb takes normal an- imate arguments, and there even is a possible literal interpretation where a physical fight happens in a courtroom, but world knowledge leads us to prefer a metaphoric mapping 25 such as THEORETICAL DEBATE IS A BATTLE. These examples show that not all metaphors can be determined by selectional preferences, and while many are determined by local contextual clues, others require more complete world knowledge. This is one cause of the apparent difficulty in computational metaphor understanding. Despite these difficulties, many computational systems successfully opt for this ap- proach, with varying models being used to make the selectional preference information explicit. Note that to implement these notions of selectional preference, we need to ef- fectively do two things. First, we need to know how the verb relates to its arguments: we need to understand the subject and object, but perhaps more importantly we need to understand the semantic roles available. For the classic "My car drinks gasoline" exam- ple, the relevant selectional preferences are the agent and patient of the "drinking" event. Identifying these is critical, and we believe employing better syntactic representations can in turn improve our representations of these semantic roles. Second, we need to be able to define the selectional preferences of these arguments in an explicit, tractable way. This is typically done by combining this theory of selectional preference violation with the notion of lexical features: words as bundles of computationally manipulable features that can be employed for supervised machine learning.

2.3.2 Lexical Features

In order to best make use of Wilks’ theory of selectional preferences, we need compu- tationally feasible representations of lexical items and the kinds of things they select for. We can achieve this by treating words as bundles of lexical features, a theory originally proposed for metaphoric understanding by Tversky et al. [1977] and improved by Ortony et al. [1979] (among others). In this model, analogies of the form a is like b are analyzed by determining the salient features of both a and b, assessing which features are relevant in the context, and comparing them. Ortony posits that the metaphoricity of compari- son statements can be determined by the difference between high-salient and low-salient 26 attributes. Consider the following from Ortony [1979] (pg 13):

32. Billboards are like placards.

33. Billboards are like warts.

34. Billboards are like spoons.

In 32, the comparison is between highly salient features, so the sentence is literal. In 33, the comparison is between a highly salient feature of "warts" (ugliness), which isn’t highly salient in "Billboards", yielding a metaphoric interpretation. In 34, there are no highly salient features shared, so the statement is hard to interpret.4 It is important here to account for the asymmetry of metaphors and analogies. If a bidirectional comparison is made between the features of source and target, one would expect symmetry. However, many metaphors do not function at all when source and target are reversed: "My lawyer is a shark." seems perfectly interpretable while "My shark is a lawyer", even in a context where the speaker owns a shark, seems difficult if not impossible to interpret metaphorically. Other metaphors are interpretable when reversed but with entirely different meanings. "My surgeon is a butcher." and "My butcher is a surgeon." are both viable, but with very different meanings [Glucksberg, 2001] (pg 33). The solution proposed by Ortony is that the direction of comparison is critical, and thus comparing a to b yields different salient features than comparing b to a. This also functions computationally: we expect to learn the preferences for "butcher" separately from that of "surgeon", so we would likely not expect symmetry. A critical problem with the feature-comparison theory is that it offers no account for metaphors in which there are no similar features between the source and target, or at least none that make the metaphor apt. Consider our earlier example:

4Ortony notes that contexts can almost always be generated to make these kinds of examples reasonable. This is a common issue in metaphor understanding: almost any well-formed sentence can be coerced into yielding a viable semantic interpretation. Ortony clarifies: "(The) point does not depend on its being impossible to conjure up a suitable context–it almost never is impossible. It depends merely on the fact that it is much more difficult to produce such a context for anomalous cases than it is for meaningful ones" (pg 14) 27 35. Sally is a block of ice.

Here we have a target, Sally, being compared to the source domain of ice. Sally (pre- sumably a person) certainly shares many features with a block of ice: they are physical objects, have a temperature, shape, and others. However, the metaphor involves none of the features they share.5 We need the temperature (or coldness) of the block of ice to map to the unresponsive emotional state of Sally, and without some other inference to know that English speakers consider emotions to be correlated to temperatures, there are no valid features to make the comparison work. This is where conceptual metaphor theory shines: we have a cognitive mapping (EMOTIONSARETEMPERATURES) which explicitly reveals this inference. Conceptual metaphor theory also correctly predicts that further ex- tensions to this metaphoric mapping are possible. Expressions where Sally is warmed or melted are coherent with the EMOTIONSARETEMPERATURES mapping, and thus similar novel expressions involving a change of temperature can be metaphorically understood. Despite this downside, feature-based metaphor theories are very appealing computa- tionally, and various iterations are used in many identification and interpretation systems. Lexical semantic resources including Verbnet and FrameNet [Baker et al., 1998] include a variety of lexical features which can be used to implement Wilks’ theory of selectional preference violation. There are many examples of treating words as bundles of seman- tics features used for computational metaphor tasks, from using dictionaries of semantic properties to learning their latent semantic representations as vectors [Gargett and Barn- den, 2015; Beigman Klebanov et al., 2015; Dunn, 2014; Shutova et al., 2012]. While these have proven effective to a degree, we believe they often ignore the fact that the connec- tions between these lexical bundles evoke the metaphoric mappings. We need to have

5The cold-emotion reading of this sentence is a common example for showing how features are inade- quate. Note there is a reading in which the feature of being ’cold’ is the prominent one mapped to Sally from the block of ice: were Sally standing outside in the snow for an hour, we could reasonably utter (1) and expect the ’cold’ feature to be salient. 28 a better understanding of the interactions between these lexical feature bundles to un- derstand which metaphoric mappings are used, and this is best accomplished by under- standing the syntactic and semantic relationships between them. In addition to hand-crafted lexical resources like those used above, latent semantic spaces seem to be a promising area for implementing lexical feature models. Word em- beddings have become increasingly popular, particularly since [Mikolov et al., 2013], as they can quickly and accurately represent a word’s semantic features as vectors, which can be easily manipulated. Presumably various parts of a word’s vector represent vari- ous semantic features, many of which may be engendered by the lexeme’s use in various metaphors. We should be able to highlight sections of embeddings for similar words that show their use in various conceptual metaphors. Recent work in metaphor identifica- tion includes many approaches that use word embeddings; a full exploration follows in Section 3.4.

2.3.3 Analysis

Selectional preference violations seem perfectly suited to metaphor identification. Their implementation is straightforward and requires only knowledge of a lexeme’s selectional preferences and the semantic categories of its arguments. These can often be derived from corpora with little overhead (via word embeddings or simpler corpus patterns). There are also a variety of hand-crafted lexical resources that can indicate se- mantic features of lexemes which can be used as a proxy for their selectional preferences. We can also understand selectional preferences as applying to particular constructions. We will see in Section 2.5 how certain constructions influence metaphoric properties, and one way this can occur is when certain verbs select certain types of arguments when used in particular constructions. Knowledge of these verbs, the constructions they occur in, and the kinds of selectional preferences that are applicable for each will allow us to best identify metaphoric utterances. 29 Although selectional preference models show great promise for metaphor identifica- tion, their application for interpretation is less promising. Metaphors can be identified based on violations of preferences, but this gives us no further knowledge into the ac- tual meaning of the utterance. They are merely a marker for determining non-literal use. We can identify violations of selectional preferences, but these violations tell us nothing about what is meant by the intended metaphor. While lexical features may not be strong enough to stand alone for either identifica- tion or interpretation, they show clear value when paired with selectional preferences (for identification) or conceptual metaphor theory (for interpretation). For identification, lex- ical features can function as a proxy for selectional preferences. Computational research shows that a mismatch in certain lexical features (particularly imageability and concrete- ness) is highly predictive of metaphoricity, and many modern systems rely on lexical features coupled with machine learning or heuristic based models. As we get better rep- resentations of lexical features, possibly through employing word embeddings, metaphor identification will subsequently improve. Lexical features can also be leveraged to iden- tify source and target domains, which can augment interpretation systems based on con- ceptual metaphor theory. Correct domain identification is critical, and lexical features may also help indicate which aspects may be mapped between domains. Topic modeling and word embedding approaches can approximate the semantics and domain of usage for particular lexical items, and these can be implemented as features for automatic do- main detection. This is accomplished by converting topics and word vectors to features for machine learning algorithms, employing either traditional statistical machine learning or deep learning. Lexical features are a valuable tool in this regard, especially considering the consistent improvements to word embeddings that may capture more subtle lexical features. Understanding of how thematic roles interact with the verb is critical to generating the correct semantic representations; this requires both syntactic knowledge of argument 30 structures as well as semantic knowledge of thematic roles. Selectional preference theory shows us that how verbs interact with their arguments often determines their metaphoric- ity, and combining knowledge of selectional preference with lexical features that can rep- resent these lexical items’ semantics should prove valuable in many metaphor processing tasks. We will fully explore how combining lexical features and selectional preferences can be implemented computationally for identification and interpretation tasks in Section 3.3, and we will use the the fundamentals of both theories as our baseline for machine learning tasks: the semantics of a verb and its arguments is easy to encode through these formulations. We will then improve upon these baselines by improving our understand- ing of the relationship between each word and its possible arguments.

2.4 Alternatives

Conceptual metaphor theory has provided the framework for a massive amount of research into linguistic and cognitive properties of metaphor understanding, and the selectional preference and lexical feature-based theories provide a promising theoreti- cal bases for computational interactions with metaphoric language. Naturally, there are many other theories and frameworks available for metaphor analysis. Their differences lie primarily in the type of research intended: some frameworks are ideal for linguistic analysis, while others focus on psychological plausibility. We will here briefly cover some alternative approaches to metaphor and attempt to highlight their use cases in a compu- tational environment.

2.4.1 Blending Theory

Conceptual metaphor theory revolves around the combination and interaction of two domains, a source and target. Since its inception in the 1980s, there have been many vari- ants on this theme that involve comparing multiple domains. One prevalent framework 31 is the notion of conceptual blends, which makes use of more domains than just the source and target to create meaning [Fauconnier and Turner, 1996; Fauconnier and Turner, 1998]. Blending theory functions like conceptual metaphor theory, but with up to four spaces instead of two: the source and target domains, a blended space that contains the similari- ties between the source and target and the emergent space of new information that comes from the mapping. This theory is particularly interesting from a cognitive perspective, with the creation of domains accounting for the emergence of new understanding due to metaphoric usage. Creating and making use of these conceptual blends is an online pro- cess, allowing for the creation of novel metaphors as well as conventionalized meanings, while conceptual metaphor relies on known conceptual relationships. Blending theory offers a host of new options for language interpretation with its inclu- sion of up to four spaces, which appear more abstract and perhaps more flexible than the domains described by conceptual metaphor theory. It also accounts for both novel and conventional metaphors, and gives a practical account for creativity in metaphor pro- duction and comprehension. However, these theoretical advantages may induce extra difficulty computationally. Implementing blending theory requires knowledge of more spaces and more interactions between them. While a perfect implementation may pro- vide advantages in metaphor understanding, from a practical standpoint it is perhaps more tractable to build initially off of conceptual metaphor theory. We can begin with a two-domain implementation of CMT; if it is effective, we should be able to modify and/or expand the implementation to account for the extra theoretical structures (blended spaces, etc.) afforded by blending theory. In this way, blending theory may be relevant as an ex- tension on top of computational implementations of conceptual metaphor theory.

2.4.2 Class Inclusion

A challenge to conceptual metaphor theory found to be psychologically and linguisti- cally plausible is class inclusion theory [Glucksberg, 2001; Glucksberg and Keysar, 1990; 32 Glucksberg and Keysar, 1993]. Consider the following equative metaphor statements:

36. The surgeon is a butcher. [Glucksberg, 2001] (pg 10)

37. My job is a jail. [Glucksberg, 2001] (pg 10)

38. Man is a wolf. [Black, 1981] (pg 73)

In class inclusion theory, the statements above are not seen as comparisons involving two domains, but rather as class inclusion statements. When a person says "my job is a jail", they are expanding the class of things that fall into the category "jail" to include "my job". This creates a more abstract meaning for the phrase "jail" as it now includes some components of a job. "Man is a wolf" includes men in the category of wolves, and so on. The primary question regarding class inclusion theory for our purposes is whether it provides any computational leverage that isn’t covered by other theories. One ini- tial weakness is that many of the examples provided are limited to these very simple metaphoric expression "X is a Y", which are anything but ubiquitous in natural language. There are innumerable other ways of expressing metaphors which can’t be neatly mapped to class inclusion statements. So while this theory may provide some leverage for this syntactic class of metaphor, it may not be extendable to general metaphoric expressions. In a way, this supports our belief that metaphors are driven by syntax, as this particular syntactic construction can be explained through a certain metaphoric theory. However it seems unlikely that it is practical to implement disparate metaphoric theories computa- tionally to handle metaphors that appear relatively similar. While class inclusion theory provides a different viewpoint on these metaphoric ex- pressions supported by a wealth of psycholinguistic evidence (see Glucksberg [2001]), it seems the computational requirements are similar to that of conceptual metaphor theory. Knowledge of classes and their properties will be required to determine what is meant by including entities in other classes. A class hierarchy is also required to make infer- ences, and these requirements parallel the knowledge of source and target domains and 33 their hierarchy. It seems that in this regard the two frameworks are fairly similar, if not interoperable. We could choose either as a basis for knowledge representation, and the ubiquity of conceptual metaphor theory data makes it more appealing for machine learn- ing and knowledge representation approaches.

2.5 Frames, Metaphors, and Constructions

While there has been a significant amount of research into the cognitive and psy- cholinguistic properties of metaphor spurred on by the work in conceptual metaphor theory, there has been relatively little research into the relations between the syntactic and semantic properties of metaphor. Construction grammar has been posited as a means of connecting metaphoric meaning and the relevant syntactic and semantic components available. While not proposed as a theory of metaphor, the construction grammar framework has been promisingly extended to support metaphoric language. Constructions, which consist of form-meaning pairs, can include lexical constructions which pair lexical se- mantics with syntactic category and syntactic constructions which can attribute meaning to syntactic patterns [Fillmore, 1988; Goldberg, 1995]. This integrates well with some of our intuitions about metaphor and metonymy: the meaning isn’t generated purely from the combinatoric properties of lexical items but requires some top-down interpretation as well. There is considerable evidence that some metonymy can be accounted for via gram- matical constructions [Hilpert, 2006; Croft, 1993], and there is also further evidence that some metaphors can be accounted for using construction grammar approaches [Glucks- berg, 2001; Martin, 2006]. We are specifically interested in the intersection of syntactic constructions, argument structure, and metaphoricity. Most critically, there is substantial evidence that source and target domains are constrained based on syntactic constructions [Sullivan, 2013], and that 34 overt argument structure can be determined by metaphoric properties of certain verbs [David, 2016].

2.5.1 Adjective-noun constructions

Sullivan’s analysis provides a model for using frame semantics along with construc- tion grammar as a framework for metaphor. She argues that certain constructions de- termine what syntactic components are allowable as source and target domains. As an example, she points to adjective-noun pairings:6

39. spiritual wealth (pg 63)

40. blood-soaked wealth (pg 63)

In 39, the abstract target domain of spirituality is understood in terms of the more concrete domain of wealth. In 40, "blood-soaked" is understood to be the source domain, as relating to uncleanness. "Wealth" in this phrase is then understood as being metaphor- ically unclean. The difference between these adjective-noun pairs is the construction: "spiritual wealth" is a domain construction, while "blood-soaked wealth" is a predicative construction. Domain constructions force the head noun (wealth) to be the source and the adjective to be the target (spirituality), while predicative constructions force the head noun to be the target (wealth) and the adjective to be the source (blood-soaked), allowing for a variety of possible interpretations based on the adjective. The adjective-noun constructions given by Sullivan that are clear cases of source and target mappings being determined by the construction appear to be what Sign Based Construction Grammar calls "lexical-class constructions", where the semantics of a word is determined by the type of construction that licenses it [Boas and Sag, 2012]. For ex- ample, the predicative/domain adjective distinction can be formalized by having two different lexical classes of adjectives, with the one class licensing predicative adjectives

6All examples in this section taken from [Sullivan, 2013] 35 like "beautiful" and "blood soaked" and another licensing domain-adjectives like "spiri- tual" and "electrical". This would yield information at the NP level indicating the source and target, but we are required to have this information about the type of adjective at the word level.

2.5.2 Argument Structure Constructions

More interesting cases involve how argument structure constructions can be predic- tive of source and target domains. Sullivan uses Langacker’s notion of dependent and au- tonomous elements [Langacker, 1987], and shows that dependent elements tend to evoke source domains and autonomous elements evoke target domains. This process extends generally to any syntactic construction. Sullivan claims that these syntactic constructions determine which elements are autonomous and which are dependent, and thus can deter- mine source and target domains. In the adjective-noun constructions above, predicative adjectives are dependent and thus evoke source domains, while domain adjectives are autonomous and evoke target domains. In argument structure constructions, verbs are shown to be dependent elements, and thus always evoke source domains. The target do- main is then evoked by one or more of the verb’s arguments (pg 88, emphasis in original):

41. the cinema beckoned (intransitive)

42. the criticism stung him (transitive)

43. Meredith flung him an eager glance (ditransitive)

In these instances, the verb is from the source domain and at least one of the objects is from the target. However, arguments can also be neutral and don’t necessarily evoke the target domain. Pronouns like "him" in 42 and 43 don’t evoke any domain. This is especially prevalent in ditransitive constructions. According to the theory Sullivan devel- ops in which autonomous and dependent elements determine source and target domain 36 items, any possible configuration of source and targets should be possible for the argu- ments in a ditransitive construction. However, data shows that very few of these itera- tions appear. In fact, it appears that the direct object of ditransitive constructions always evokes the target domain, while the subject and indirect object are either target domain or neutral (pg 100, emphasis in original):

44. The helmet gave Athelstan an idea.

45. She tossed him a bold look.

46. Smoking gave him a headache.

In the examples above, the indirect object is a person, and thus domain neutral. The direct objects, "idea", "look", and "headache" evoke various target domains. Interestingly, the observation that direct objects are always target domain items is explained by the structure of English ditransitives. Direct objects in ditransitive constructions are forbid- den from having pronominal direct objects. As most domain neutral items are pronom- inal, they rarely if ever appear as direct objects of ditransitive constructions. This forces the direct object to be a target domain item, and leaves flexibility for the other arguments. The optionality of domain evocation for nominals makes it harder to predict which el- ements of the construction participate in the metaphor. Despite this limitation, this analy- sis shows that syntactic structures beyond the lexical level can be indicative of source and target domains.

2.5.3 Metaphor Identification by Construction

Sullivan’s research focuses on source and target domain items, and which are evoked by different elements of constructions. This can be extended more broadly to metaphoric- ity in general: there are cases where certain lexical items are used metaphorically only in certain constructions. One example is the verb "hemorrhage" [Martin, 2017]. The distri- bution of literal and metaphoric uses of "hemorrhage" shows strict syntactic patterning: 37 when used intransitively, the verb is meant literally, and with a direct object it is invari- ably metaphoric:7

47. She hemorrhaged.

48. The company hemorrhaged money.

49. (?) She hemorrhaged blood.

50. (?) The company hemorrhaged.

47 is literal, as it doesn’t contain a direct object. 48 is metaphoric, and it does have a direct object: "money". Note that examples 49 and 50 are marked as (?): they are not gram- matically incorrect, in that they don’t produce the typical ungrammaticality of pure syn- tactic errors. However, they are unattested: corpus studies show that these constructions are non-existent in actual data. In many cases, certain lexemes can only be used either lit- erally or metaphorically in particular syntactic structures. This affords us a simple, direct implementation for metaphor detection by determining the distribution of metaphoricity for each lexeme and the syntactic structures it participates in. We aim to show in Chapter 5 that these distributions are significant for a large number of lexical items, both for bi- nary categorization of metaphor or non-metaphor as well as for identifying source-target domain mappings.

2.5.4 Analysis

While the nature of source and target domains isn’t always predictable from syntax, there is still evidence from construction grammar that syntax plays some part in metaphor understanding, with roles in various constructions providing knowledge of whether they belong to source or target domains. This is not predicted from conceptual metaphor the- ory in which metaphors are cognitive and the linguistic utterances that employ them are

7Examples are the author’s. 38 merely artifacts of the underlying mappings. The important tenet of construction gram- mar that is relevant is that the meaning can be determined in a more top-down fashion. The semantics of utterances requires both lexical knowledge and knowledge of the syn- tactic constructions those lexical items belong in. Sullivan shows that certain constructions require specific source and target elements. We believe that combining this theory with lexical knowledge should yield better pre- dictions of source and target elements across more constructions. While there is much work to be done relating construction grammar to metaphor it is a promising place to start when formalizing the interaction between lexical semantics and higher level syntac- tic meaning. This idea can be applied either through using syntactic structures as features or by em- ploying broader construction grammar-based methods for metaphor processing. There have been some attempts to model aspects of construction grammar computationally [Narayanan and Jurafsky, 1998; Martin, 2006], and these are especially interesting with regard to metaphor identification. We will directly model constructions by using syntac- tic frames identified from VerbNet, and our experimental methods are otherwise moti- vated by the evidence presented here: the relations between verbs and their arguments, as well as their thematic roles, are critical in identifying and understanding metaphoric expressions.

2.6 Differentiating Metaphors from Other Language

Figurative language is not limited to metaphors, in which one thing is seen in terms of another. Speakers employ all sorts of figurative language in everyday conversation, and have no problems producing or understanding it. Gibbs claims this is because our expe- rience is shaped by figuration, which strongly parallels Lakoff and Johnsons’ conceptual framework, in which metaphors shape our understanding: 39 ’Speakers can’t help but employ tropes in everyday conversation because they conceptualize much of their experience through the figurative schemes of metaphor, metonymy, irony, and so on. Listeners find tropes easy to under- stand precisely because much of their thinking is constrained by figurative processes.’ [Gibbs Jr., 1993] (pg 253)

There are many subtleties to differentiating between metaphor and other types of fig- urative language, and as many figurative language processing systems are specifically designed to handle only certain types, it is worth analyzing these subtleties in some depth for our purposes. It is helpful to highlight differences and similarities between metaphors and other types of figurative language as it will help us understand the tools necessary to automatically understand both. To this end we will examine possible distinguishing factors of the literal and the metaphoric, as well as other kinds of figuration. It may be the case that the syntactic properties we believe are informative for understanding metaphors are also important for other types of figuration, and thus the improvements we make in metaphor systems are broadly applicable to other NLP tasks. However, it may also be the case that these other types of figurative language are distinct enough from metaphor that they are influenced by complementary syntactic patterns, or aren’t influenced by syntax at all. To start this analysis, we must first begin with the broad difference between literal and figurative language.

2.6.1 Literal vs Figurative

We tend to have natural intuitions for what is literal and what is not. When presented with metaphors, particularly interesting or novel ones, we can easily distinguish them from literal counterparts. Upon closer examination, particularly regarding metaphor re- search in the linguistic community, there is rarely complete agreement when it comes to formalizing these intuitions. There are significant differences of opinion on when metaphors become conventionalized, how ’novel’ metaphors need to be to step outside literality, the literality of idioms and similes, and many other issues. In order to resolve 40 these in any practical fashion, we must consider the goal of this work (identifying compu- tationally feasible frameworks) when considering this problem. We can sidestep many of the psychological and cognitive concerns and consider a more computational viewpoint, which will necessarily raise more computational concerns. In defining what it means for a word or phrase to be literal, Lakoff provides a critical insight: what is called "literal" by varying researchers actually falls into four more specific categories [Lakoff, 1986]:

• Literal 1: conventional literality, ’ordinary conventional language’, by which he means what is common, and not poetic or exaggerated.

• Literal 2: Subject-Matter Literality, or language used to talk about some particular subdomain.

• Literal 3: Nonmetaphorical Literality: this type of literality is reserved for ’Directly meaningful language – not language that is understood, even partly, in terms of something else.’ (pg 2)

• Literal 4: Truth-conditional Literality: Language the fits the world, and has objectively observable true or false values.

For Lakoff, Literal types 1, 2, and 4 all allow for metaphorical interpretations. However, this detailed account of how we should treat the concept of literality does little to alleviate our computational concerns. Sentences that Lakoff considers to be literal 1, 2, or 4 can also make use of metaphor. If our machine is to fully understand both literal and metaphoric language, it will need to be able to distinguish which of these "literal" phrases are in- tended with their non-metaphoric meaning and which can possibly have a metaphoric meaning. For example, consider the following sentence from Martin [1990]:

51. You can enter Emacs by typing emacs into the shell. (pg 1) 41 According to Lakoff’s definitions, this sentence would be literal 1 (conventional) and literal 2 (related to a particular domain). In order for a machine to get the intended reading (that is, "enter" meaning to start the program Emacs), we will require our lexi- cal knowledge of "enter" to contain significantly more sophistication than just the spatial movement sense of "enter" that is typically considered literal. Either we will need multi- ple senses of "enter", one of which relates to the computer domain and contains knowl- edge of how "enter" is used in that space, or we will need an inference process to apply our spatial orientation knowledge of "enter" to a new domain. This exposes the core of the problem. We do not need to discern what a person would deem literal (in either the traditional sense or a more nuanced one), or metaphor, or an- other type of figurative language. Rather, we need to devise an appropriate semantic representation for a sentence, regardless of its "literal" nature. This will be dependent on the level of representation provided by the knowledge bases for lexical semantics as well as inference methods.

2.6.2 Similes

The comparison and interaction views of metaphor lean heavily on relating metaphors to similes, with many arguing that metaphors and similes are paraphrases or possible re- framings of each other [Black, 1962]. While re-framing metaphors as similes won’t be powerful enough to develop a full account for metaphor, it does show that similes and metaphors may need similar treatment. Black writes of the similarity between simile and metaphor:

"In a given context of utterance, ’Poverty is like a crime’ may still be figura- tive, and hardly more than a stylistic variant upon the original metaphorical statement. Burns might have said, ’My love is a red, red rose,’ instead of ’My love is like a red, red rose,’ if the meter had permitted, with little semantic difference." [Black, 1993] (pg 30) 42 So while there may be a difference in descriptive power between metaphor and sim- ile, it may behoove a computational system to treat metaphor and simile as equivalent. Ortony echoes the sentiment, if not in the same phrasing, as he considers metaphors and similes to be "nonliteral" [Ortony et al., 1979]. They both involve much of the same processing, in that one thing is viewed as another, so if our machine is to interpret the sentence "My lawyer is a shark", it should obtain the same interpretation for the simile "My lawyer is like a shark." Both utterances require knowledge of the domain of lawyers and sharks, and require knowledge of the attribute of sharks that can be applied to the lawyer. These theories point us to a computational approach that treats metaphors and simi- les similarly. Similes may be easier as the knowledge of mapped attributes is often overt, but they both will likely require a similar process to derive correct interpretations. Ad- ditionally, similes contain marker words, allowing us to automatically distinguish them from metaphors and other figurative language. The process for interpretation remains the same: we need to understand the mapping between concepts, and this process will be essentially the same for similes and metaphors.

2.6.3 Metonymy

Very closely related to metaphor is the notion of metonymy. While historically over- looked in comparison to metaphor, there has been a large amount of recent research into metonymy, both from linguistic and computational perspectives.8 Metonymy is using a word to refer to another concept in the same domain. For example:

52. The ham sandwich is waiting for his check. (food for person eating food) [Lakoff and Johnson, 1980b] (pg 35)

8For comprehensive overviews, see [Stefanowitsch and Gries, 2006] and [Dirven and Pörings, 2002], only some of which is covered here. 43 53. It won’t happen while I breathe. (bodily action for living) [Warren, 2002] (pg 114)9

54. She married money. (asset for asset possessor) [Warren, 2002] (pg 115)

Metonyms can be shown to be similar to metaphors in many ways. One is that they frequently highlight or emphasize certain important or salient characteristics of their ref- erent.10 In the ham sandwich example the food item a person is consuming is highlighted, while in the case of marrying money the wealth of the person being married is high- lighted. Both of these are semantic features of the target (albeit "wealth" is probably a much more salient feature than "current food being consumed"), and they are highlighted by use of the metonymy. This is comparable to metaphors which can highlight (and cre- ate) similarity, but this metaphoric highlighting occurs through comparison to another domain. Some accounts of metonymy highlight their opposition to metaphor. Warren [2002] claims that while metaphor and metonymy are alike in their use of source and target expressions, they are "two distinct constructions arising from two distinct cognitive op- erations" (pg 114). Thus it is important to note distinctions between these two types of figurative language, which should shed light on how to interpret both. We should also examine any subtleties of their differences that might be useful in distinguishing one or the other automatically. Perhaps the most important feature of metonymy is its application to only a single domain. Metonymy, it is frequently argued, applies to only one domain while metaphor maps between two. In 52, both "ham sandwich" and the person eating fall into a restau- rant domain. In 53, "breathe" clearly can fall within the same domain as living, which is the state referred to. In 54, we may be able to say that "money" and the money possessor

9This initially appeared very idiomatic to me, but it seems fairly productive. It’s possible to replace ’breathe’ with many bodily actions: ’It won’t happen while my heart beats.’, ’It won’t happen while my lungs work.’, etc 10There is some discussion of highlighting as compared to cancellation or shifting of focus [Cohen, 1993]. For our purposes, these are equivalent. Any level of "highlighting" can be normalized. A cancellation, de- focusing, or shift just involves a negative highlight on certain features, which is equivalent to a highlighting of others. 44 are in the same domain, presuming we believe that a person who possesses money may fall into a domain related to wealth. In these cases the word triggering the metonymy falls into the same domain as the concept being referred to. However, this depends largely on how to define a domain. Barcelona points out that if pushed far enough, a large percentage of domain pairs could be considered related, which would over-generate the possible uses of metonymy [Barcelona, 2002]. He also claims that domains are determined by a kind of folk taxonomy, which could potentially be interesting for a linguist, but is patently disappointing for a machine. If we are to determine a statement as being metonymic or metaphoric, we will need domains that are determinable by surface forms. Metonymy shares a key computational component with metaphor, which is that the word or words used do not directly convey their literal meaning and some sort of ad- ditional inferences or knowledge bases are required to infer correct semantics. Many computational researchers have built metaphor interpreting systems that also extend to metonymy [Fass, 1991; Fass, 1997; Shutova, 2011], so it is reasonable to believe they can be interpreted via equivalent or at least similar processes. While we would like to believe that metonymy and metaphor are fundamentally re- lated and thus techniques that allow for the identification and interpretation of one are easily applicable to the other, it appears that this is effectively only applicable to their identification. Metaphor identification can be treated as similar to metonymy identifica- tion, as both often contain mismatches in selectional preferences and have many similar- ities common to figurative language. Lexical features used for metaphor can be similarly used to identify aspects that are highlighted in metonymy structures, albeit within a sin- gle domain. From this perspective, we expect similar syntactic patterns and representa- tions to be effective for both metaphor and metonymy, and we should be able to apply similar processes to both. Interpretation of metonymy requires different processes. Interpretation of metonymy 45 involves finding which part of the semantic domain of a word is standing for the linguis- tic expression being used. This could be implemented as searching a tree structure for identifying the correct entity or event the metonym is referring to in a particular context. Metaphor interpretation is much more complicated. It requires knowledge of two do- mains, and which elements from one are mapped to another. These processes seem fun- damentally different. Conceptual metaphor theory will not help us with metonymy, as metonymies function within the same domain and contain no mappings between source and target domains. We would likely need an entirely different mechanism to find the correct interpretation for a metonymic phrase.

2.6.4 Idioms

Idioms have been classically defined as adhering to two base principles: they are non- compositional (the meaning of the idiom doesn’t come from its parts) and they are syn- tactically inflexible. There is significant debate about the validity of these two claims. Notably it has been shown that many idioms have constituent parts that can be traced to parts of their meaning, even if they’re not compositionally derived in the way non- idiomatic expressions are [Nunberg et al., 1994]. Idioms that have compositional meaning components and have mappings between the idiomatic component to the literal meaning also have varying degrees of syntactic flexibility.11 Importantly for us, idioms are shown to contain a degree of "figuration", whether it be metaphor or metonymy. It can be claimed that idioms are literal [Kiparsky, 1976], and this would be computa- tionally trivial. We could assume that all idioms function literally as multi-word expres- sions. We would then require only a dictionary-style look-up for each idiomatic expres- sion, which would be used to replace the expression with its correct semantic representa- tion. However, this would require either complete syntactic inflexibility, with all words

11Sometimes referred to as idiomatically combining expressions 46 always appearing in the same syntactic construction, or a prohibitively large number of stored idiomatic phrases that includes all syntactic variants. Nunberg [1994], among others, disagree with the notion of idioms as literal expres- sions:

"... it seems to us that all of these idioms will be almost universally perceived as being figurative, even if speakers have no idea WHY these metaphors are used to express these meanings." (pg 493)

Kay et al. [2015] distinguish three types of idioms: fixed expressions, which exhibit no flexibility and can be listed as lexical entries (such as "by and large" and "first of all"); semi-fixed expressions, which are non-compositional but display some flexibility (such as "kick the bucket", which can undergo inflectional morphology); and syntactically flex- ible expressions, which are analyzable and syntactically flexible ("pull strings", "spill the beans"). Only fixed expressions could be interpreted via look-up, and that process would be trivial. With regard to metaphoric interpretation, there is evidence that idioms participate in metaphoric constructions, thus making metaphoric interpretations for idioms possible. Consider the example "Many Californians jumped on the bandwagon that Perot had set in motion." [Nunberg et al., 1994] (pg 500). The idiom "jumped on the bandwagon" provides a metaphorical domain of MOTION which "set in motion" makes use of. As much as we would like to employ the same processes for metaphors and idioms, we would be foolish to commit fully to this. Many idioms are completely opaque, with no recognizable metaphors (i.e. "kick the bucket"), and resist further interpretation. We may, however, be able to fall back to a weaker approach: first, attempt to look up the idiom in a dictionary structure, which would necessarily involve some knowledge of syntax to account for inflection and syntactic flexibility. This would account for fixed expressions and semi-fixed expressions. If no definition is available or the syntactic pattern is unrec- ognizable, we then need additional interpretation components that rely either on lexical 47 semantics or syntactic constructions. As a last resort, we can attempt general metaphoric interpretation. Although idioms can be flexible with regard to syntax, they are typically more fixed than other literal or metaphoric expressions. From this we believe that our syntactic rep- resentations would also be effective for idiom identification.

2.6.5 Analysis

There are many kinds of figurative language, and although there are similarities be- tween them, they all require some degree of separate treatment. We believe the process of metonym interpretation is fundamentally different from metaphor interpretation. Idioms also follow quite different patterns and their interpretation is often opaque, requiring separate lexical entries. We then are forced to believe that separate systems should be used for identification and interpretation of each. They are not completely disparate, as all three categories appear to participate in regular constructions that may help predict their meaning and all require extensive lexical knowledge. Thus while their treatment may need to be separate, each component will likely make use of the same tools. Better syntactic representations will likely improve our automatic identification procedures for any type of figurative language, although interpretation processes will necessarily vary across these different types. We may even hypothesize that by representing lexical semantics and the processes through which lexical items combine at a high enough level of abstraction with a high enough level of accuracy, the same system may accurately identify all kinds of figurative language. To this end, there are numerous systems developed for figurative language processing that are applicable to metaphor as well as idioms, metonymy, and other types of figuration. However, a single system that can accurately and indiscriminately capture all figurative language is equivalent to one that accurately captures all semantics and is, at best, a distant goal. 48 2.7 Summary

We’ve seen from linguistic literature that there are many factors that affect the pro- duction and comprehension of metaphors. From a practical standpoint, selectional pref- erence theory is extremely appealing to those doing computational modelling. It provides an easy to use framework, as long as words’ preferences can be identified, as well as the words’ ability to fill certain preference slots. Also notable is the recent research in construction grammar: there appears be certain syntactic patterns that determine source and target domain items, and it may also be the case that particular constructions select for metaphoricity in general. From the linguistic theories we have seen, we believe improved syntactic represen- tations will be ideally suited to improving metaphor processing. Selectional preferences require knowledge of verb arguments, and predicate-argument structure appears to be predictive of source and target domains. Combining our knowledge of selectional prefer- ences, construction grammar, and conceptual metaphor theory, we aim to develop better methods for representing literal and metaphoric constructions, and show that improve- ments based on these theories are broadly effective at improving metaphor processing. From here, we will explore previous computational approaches to metaphor process- ing. There are many different goals for computational methods in this field, and we will evaluate what kinds of methods are utilized to good effect for each kind of task. We will see how selectional preferences have been exploited in a variety of ways, particu- larly within machine learning systems. Most importantly, we will attempt to fill the gaps in previous approaches by incorporating ideas from construction grammar: we will see how syntax has been used in the past, and develop new ways of representing these kinds of structures for better metaphor processing. 49

Chapter 3

Computational Background

3.1 What’s the task?

Computational approaches to metaphor have long suffered from an essential problem: what is the end result of such a system? The wide variety of applications for a system that processes metaphor has led to the lack of a clearly defined goal and many different systems with disparate results on different tasks. Below is a short list of attested tasks for a metaphor processing system:

• Generate semantics representations for literal/metaphoric words/phrases [Martin, 1992; Weiner, 1984; Veale and Hao, 2008]

• Understand linguistic data through corpus research [Gedigian et al., 2006; Reining and Lneker-Rodman, 2007; Stickles et al., 2014]

• Develop methods for automatically generating understandable metaphors [Veale, 2016]

While these may all have their benefits, the ultimate goal of metaphor understanding should be the generation of correct semantic representations. This task has not yet been fully studied, and it is difficult to define and undertake. There are currently many kinds of semantic representations available, including AMRs and others, but they don’t tend to take strong stances towards non-literal language: typically resources have separate senses for literal and non-literal uses of some words, while ignoring the same mappings when they occur between other lexemes. Thus, while these resources are useful for a variety of tasks, they currently suffer from a lack of understanding of figurative language. 50 An ideal semantic parser will have knowledge of metaphoric mappings, and will be able to identify when a mapping is being used, which syntactic elements of the sentence are influenced, either as source or target items, and determine which elements of the se- mantics need to be adjusted to incorporate the metaphoric meaning.1 While this idealized parser in entirety is outside the scope of this dissertation, we will proceed towards gen- erating semantic representations as a goal. This task can be split into two parts: first, identify metaphoric words or phrases at the lexical or phrase level. From this, we then need to generate the correct semantic interpretation of the metaphor. There are a myriad of ways that these interpretations can be generated, and this work will focus on a pre- liminary step: identifying source and target components in order to facilitate reasoning about the domains. The tasks of identification and interpretation are typically treated separately. In order to frame these tasks, we will explore the history of computational metaphor processing. The field has evolved greatly since its inception, and due to the breadth of tasks there are a wide variety of systems developed under varying theoretical frame- works. We will explore three main trends in the algorithms used to process metaphoric and other figurative language: knowledge-based systems, supervised machine learning, and recent advances in neural networks. Each of these methods has been implemented in different ways to accomplish different goals in the space of identification and interpre- tation, so there are benefits to taking a broad view of the field. We will also review word embeddings and their application as a particularly useful tool for handling metaphors. For recent and historical overviews of computational work on metaphor, there are a vari- ety of comprehensive resources [Shutova, 2015; Shutova, 2011; Martin, 1996].

1Note that this largely eschews the kind of creative meaning often attributed to metaphors. This kind of system will not be able to identify the subtleties intended by the producer of complex novel metaphors, but rather only represent the coarse grained semantics. 51 3.2 Knowledge-based Systems

The earliest approaches to computational metaphor were those of Wilks [1978], em- ploying knowledge of selectional preferences. These often included unsupervised mod- els working with knowledge bases. Knowledge-based approaches tend to require large amounts of structured data to derive understanding. These data structures are typically produced manually, which is time consuming and likely not practical to implement at scale. To alleviate this issue, most knowledge-based methods incorporate some method for automatic expansion of the knowledge base. The type of knowledge representation used, the method for drawing metaphorical inferences from the knowledge available, and the ways to automatically expand the knowledge base are all key questions that need to be addressed. In order to understand the strengths and weaknesses of this kind of approach, we will examine some implementations of metaphor systems that employ knowledge-based style algorithms.

3.2.1 MIDAS

An early knowledge based approach is the MIDAS system [Martin, 1990]. The MIDAS system is a metaphor interpretation system structured on the foundation of conceptual metaphor theory. It includes a rigorous process for interpreting metaphor based on the machine’s current knowledge base as well as allowing the knowledge base to be extended when new metaphors are encountered. The system relies on metaphor maps, which map certain concepts from source and target domains. An example of the mappings between

the KILLING domain and the TERMINATION domain are shown in Figure 3.1 ([Martin, 1990] pg 69), which show how the phrase "kill a process" can be correctly mapped to the termination domain. It is able through inference processes based on current maps to 52 create new maps that can be used in the future, and thus is a very flexible implementation that builds a more complete knowledge base while it runs.

FIGURE 3.1: Metaphor mappings from ’killing’ to ’terminate-process’

Designed to be part of a UNIX operating system interface, it is thus restricted to that domain, but there are no theoretical reasons why it couldn’t be applied to others. It is also syntactically flexible, handling any kind of input, and functions on natural text.

3.2.2 Induction-based Reasoning

More recently, there have been numerous computational implementations that rely heavily on knowledge bases. Another approach to metaphor interpretation is that of Ovchinnikova [2014], who develops a full pipeline for metaphor interpretation based on conceptual metaphor theory and Hobbs et al [1992]. They begin with a parser, create log- ical forms, and then use abductive reasoning to extract conceptual metaphors which are 53 converted back to natural language (Figure 3.2 [Ovchinnikova et al., 2014]). Their knowl- edge base consists of lexical axioms, which indicate which source or target domains go with lexemes, and mapping axioms, which link source and target domains much like Martin’s metaphor maps. The system inputs natural language and produces valid in- terpretations based on the conceptual metaphors present, showing promising results for both English and Russian.

FIGURE 3.2: Metaphor processing pipeline

This system is also knowledge heavy: it requires a knowledge base consisting of both kinds of axioms for any possible source-target mapping encountered, and these axioms need to be both generated and curated, which also makes the system difficult to extend to novel metaphors. However, like many knowledge-based approaches it is completely unsupervised, requiring no annotation. This benefit will become clearer as we explore the difficulties inherent in annotation of metaphor data (Chapter 5). It is important to note that most knowledge-based metaphor processing systems will look similar to the pipeline detailed by Ovchinnikova et al in Figure 3.2. Metaphors, along with other latent semantic and pragmatic meanings, live near the top of the natural language processing pyramid, and thus require many preprocessing steps: part-of-speech tagging, lemmatization, and syntactic parsing at least, with many using and/or other semantic representations as well. This means complex pipelines are typically required, with many steps being required to go from the original text input to a viable semantic representation, which is then typically used for metaphor inference and then converted back to natural language. Everything between "Logical form" and 54 "NL generator" will likely differ from system to system, but the other parts are all but ubiquitous, albeit in different formalizations.

3.2.3 MetaNet

The MetaNet project is a formalization of conceptual metaphor theory that relies on the frame semantics of FrameNet as well as construction grammar [Stickles et al., 2016]. MetaNet is a network of frames, frame elements, and metaphors. These include structure- defining relations that are typical of ontologies: "is-a", "has-a", and others, as well as non- hierarchical relations that include causal and temporal relationships. These relations are applied to a structure of frames and frame elements (as in FrameNet). Metaphors are identified in MetaNet when "role type mismatch between individual lexical units in a given construct trigger a metaphoric interpretation".

55. poverty crushes people

In the above example, the lexical unit "crush" invokes the frame "Harm to Living En- tity". This frame has roles for "cause_of_harm" and "victim". The role type of "victim" is Physical Entity, but in the sentence the lexical item is "poverty", which evokes the Poverty frame. The corresponding role in the Poverty frame is Abstract State. This mismatch triggers a metaphoric interpretation. MetaNet also helpfully contains conceptual metaphors with possible role mappings.

In this case, the ECONOMICHARDSHIPISPHYSICALHARM metaphor contains mappings from Abstract State to Physical Entity, which matches the requirements of the frame elements in "poverty crushes people". Through these mechanisms MetaNet allows for metaphoric inferencing using knowledge of frames, frame elements, and possible metaphoric mappings between them. The structure of these elements is also hierarchical, allowing for abstraction to find the level of mapping necessary. 55 MetaNet has been implemented in metaphor detection and interpretation [Jisup, 2016] with very good results for multiple . The system relies on a series of construc- tional patterns and identifying relations between elements in them. These are then com- pared against relations in MetaNet. They report recall and precision for metaphor identi- fication on English to be .86 and .85. They are also identifying appropriate target frames, which is a step further towards interpretation. MetaNet has been productively applied to several different metaphor related tasks. It is frequently used as a tool to help linguists and other analysts find and understand metaphors in corpora. David et al. [2018] show that this kind of knowledge-based method can be extended to find novel metaphors in corpora, including poverty metaphors for English and Spanish and cancer metaphors in English. They note that the resource isn’t designed with machine learning in mind:

(MetaNet) is not intended to be a general NLP system (...), rather, a tool that helps metaphor analysts carry out large-scale studies, for instance, by nar- rowing the search according to target domain and providing cross-linguistic metaphor distribution. [David and Matlock, 2018] (pg 470)

This approach is valuable in that it shows the benefits of having clear data and research goals which drive the design and implementation of the system. There are numerous approaches to computationally working with metaphor, and MetaNet is a tool designed to handle one particular aspect of the space.

3.3 Machine Learning

In that last few decades, statistical machine learning has been realized as a valuable tool for many language tasks including metaphor processing. It is often broadly based on the principle of learning a boundary between classes of data. The boundary is learned based on training data, through any number of algorithms. For example, consider the toy dataset given in Figure 3.3. We know some points belong to the red class and some to 56 the blue class. We can then employ a machine learning algorithm to generate a boundary between these classes.

FIGURE 3.3: Outline of basic machine learning procedure

In natural language processing, these points are bundles of features for whatever needs to be classified. In our case, this is typically whether a particular lexical item in con- text is metaphoric or not. Figure 3.4 shows how the standard machine learning paradigm is employed for this task. The red points are sentences in which the verb is literal; the blue points are those that contain metaphoric verbs. By converting these instances into features, we can treat them as points, and employ machine learning to create a boundary between the literal and metaphoric classes. Then, when we see new instances (the two points in white), we assign them classes based on which side they fall on. This is an extreme oversimplification: for most cases, the dimensionality of the fea- tures is much higher, the classes are not linearly seperable, and there is much more noise in the dataset. Fortunately, modern machine learning algorithms are ideally suited for handling this kind of complexity, and have proven effective for a wide variety of metaphor processing tasks. 57

FIGURE 3.4: Outline of machine learning for metaphor procedure

For most statistical machine learning, metaphor processing is done based on the the- ories of selectional preference violation and lexical features. This generally requires min- imally two components: a way of representing a word as bundle of features, and a way of representing a word’s context in order to assess its selectional preferences. These two components have been employed for many other tasks in NLP, and these types of systems have been extended to various kinds of figurative language processing. Su- pervised machine learning algorithms have been successful for many tasks employing lexical and contextual features, and thus the extension to figurative language detection and interpretation is natural. Despite this, results have varied greatly and evaluation of systems has proven difficult, largely because of the variety of tasks undertaken and the lack of adequate data to train and evaluate systems on. Despite these setbacks, super- vised machine learning has remained the dominant approach in NLP for handling many metaphor-related tasks, including identification of metaphoric words and phrases, identi- fying source and target domains, and creating literal paraphrases of metaphoric phrases. In order to effectively use machine learning for NLP tasks, we need to represent the words and contexts we encounter as feature vectors, so that the boundaries can be ap- propriately learned. The points in Figure 3.4 are placed arbitrarily: perhaps the most 58 difficult component of machine learning problems is identifying the correct features that put these points into the correct spaces. We will here examine the typical ways words and contexts are featurized, the kinds of features that have proven successful, and what algo- rithms and datasets have been employed. We will be particularly concerned with what aspects of machine learning algorithms have been unsuccessful, where there is room for improvement, and what lessons from linguistic metaphor research can be applied to im- prove computational methods.

3.3.1 Features

A common approach to representing lexical items for machine learning is known as 1-hot encoding, where a word is represented as a vector the length of the vocabulary, with a 1 at the index for that particular word. This allows for single instances of words to be represented in a computationally feasible way, and often works reasonably well for machine learning. However, it ignores commonalities between similar words. Consider the following examples:

56. If you feel that your car drinks up too much fuel

57. (...) an SUV guzzles gas

58. we realize the RS6 drinks gasoline and isn’t a hybrid

These examples all are references to vehicles using fuel at an excessive rate. Encoding the lexical items a 1-hot vectors treats each lexeme separately, ignoring the obvious sim- ilarities between "car" and "SUV", as well as between the direct objects "fuel", "gas", and "gasoline". To recognize these similarities, we need to encode the lexical items as features, either from hand-crafted lexical resources or learned from context. This aligns nicely with linguistic research in which we’ve seen metaphor often arises when there are particular matches or mismatches between lexical features. This general 59 idea applies neatly to machine learning algorithms in NLP, where we are required to fea- turize our data in order to recognize similarity between lexical items. At the lexical level, a number of features have been employed to delineate metaphoric words from others:

• basic linguistic features such as part of speech

• WordNet and other ontology-based features [Fellbaum, 2010]

• concreteness and imageability from the MRC psycholinguistic database [Wilson, 1988]

• valence, dominance, and arousal from the ANEW database [Bradley and Lang, 1999]

• features from the Linguistic Inquiry Word Count (LIWC) [Tausczik and Pennebaker, 2010]

• topic-modeling derived features [Jang et al., 2016]

These and similar features have been coupled in many different algorithms for many different figurative language tasks. Jang [2016] uses LIWC emotion features to model tar- get words, and compare topic transitions between sentences using LDA topic modeling to predict their metaphoricity. Their evaluation is restricted to the domain of breast cancer forums, although the approach should be flexible enough to handle other domains. The model of Dunn et al. [2014] uses lexical features as well as corpus data to determine se- mantic overlap between source and target words. While this approach is fairly basic, they include a module for combining classifiers, which allows various different approaches to vote on the correct tag. This is a valuable insight: figurative language is expressed in many different ways, so different approaches are likely to capture different kinds of mean- ing effectively. Many others employ similar approaches for similar tasks [Shutova, 2013; Rai et al., 2016; Hovy et al., 2013; Beigman Klebanov et al., 2016; Beigman Klebanov et al., 2015; Stowe and Palmer, 2018]. 60 Metaphor processing doesn’t seem to have the high ceiling we’ve seen in other tasks such as part of speech tagging. While there is much to debate about evaluation and exper- imental design, it appears the best performing machine learning algorithms on the stan- dard Vrije Universitat Amsterdam Metaphor Corpus (VUAMC) dataset yield F1 scores of between .6 and .7. These results are low relative to many other NLP tasks, and this may be due to the difficulty of handling non-literal language. Classification on this problem is difficult, as is annotation, because even humans have a hard time agreeing on what precisely constitutes metaphors. This has led to limited datasets and poor performance both in inter-annotator agreement and automatic classification. In addition to dataset difficulty, we believe there are possible improvements to the featurizations presented in previous work. We’ve seen from Sullivan’s work that syntac- tic constructions can have strong influences on source and target domains, and possibly even metaphoric vs non-metaphoric words. However, the majority of machine learning systems largely ignore syntax. Some use basic dependency or context-related features, but many eschew these in favor of contextual windows which ignore syntax. Many fo- cus only on metaphoricity in one specific constructions (such as adjective-noun pairs), bypassing any possible comparison between syntactic structures. We believe that better representations of syntactic structures can yield gains in both metaphor identification and interpretation. Additionally, there are countless lexical resources that have various syntactic and se- mantic information available for use for metaphor processing. Many machine learning systems employ features based on lexical resources, but they typically focus on features from semantic hierarchies such as as WordNet. Our work will incorporate syntactic fea- tures as well as lexical resources that are based on syntactic information, allowing us to maximize the impact of our linguistic intuitions. We will next explore a handful of previ- ous systems that lean heavily on either syntactic features or lexical resources. 61 3.3.2 Syntax and Lexical Resources

The CorMet system [Mason, 2004] relies on the selectional preference paradigm, and is similar to ours in their collection of key verbs and analysis of syntactic arguments and se- mantic roles. They automatically collect documents for particular domains based on key words, and identify selectional preferences based on the WordNet hierarchy for verbs in these particular domains. For example, they find that the assault typically takes direct ob- jects of the type fortification in the MILITARY domain. This allows them to make inferences about when selectional preferences are adhered to, and they can then identify mappings between different domains. While their task is fundamentally different, their usage of syntactic frames to identify relevant arguments is very similar to our work. However, rather than identify preferences, we are using syntactic frames to identify whether the verbs are possibly used metaphorically. Our methods require less adherence to semantic properties (which they retrieve from WordNet), and are inherently somewhat more noisy: while there is evidence that syntactic frames can be indicative of metaphoric properties, these properties are rarely observed deterministically. Gedigian et al [2006] use FrameNet and PropBank annotation to collect data, focusing on the FrameNet frames MOTION and CURE. They use PropBank argument annotations as features, resulting in an accuracy of over 95%, but this is only slightly above the most- frequent baseline of 92%. Their use of lexical resources is related to ours. They collect data from lexical resources and then annotate it for metaphoricity, which is similar to our approach of analyzing the resources and considering certain senses to be metaphoric. Shutova et al [2013] also employs syntactic information to generate selectional prefer- ences, identifying verb-subject and verb-direct object pairs in corpora. They begin with a seed set of metaphoric pairs, similar to our methods of collecting instances based on syntactic information. They use these seed pairs to identify new metaphors, similar to our usage of syntactic patterns to identify training data. Their methods are based on the 62 selectional preferences of verbs, and thus are less concerned with the variety of syntac- tic patterns metaphors can participate in. We will identify much more complex syntactic patterns, and we then use the data for training metaphor systems rather than identifying selectional preferences. We will follow their methods in their use of statistical machine learning methods to identify certain types of metaphoric words. All of these approaches employ syntactic frames and/or argument structures to identify in some way metaphoric words or sen- tences. Our work expands on these approaches in three ways: we utilize similar features but at broader scale, we generate new features based on syntactically motivated lexical re- sources, and we use syntactic patterns to extract additional training data. In order to best make use of these additional methods, we also need representations of lexical meaning. For this, the standard is employing some type of vector representation or word embed- dings.

3.4 Word Embeddings

The majority of features and algorithms above are at least in part rely on hand crafted lexical resources (WordNet, the MRC database, the ANEW database, LIWC, and others). While these resources are effective, they require curation, sometimes are limited in their coverage of word types, and are often expensive to build and maintain. Ideally, we would like a way of representing lexical semantics that can be learned automatically from unla- belled data, allowing coverage to be expanded automatically. Word embeddings, which are continuous vector representations of words based on their observed contexts over large quantities of data, have been extremely effective in a large number of NLP tasks, particularly those relating to semantics. Word embeddings all function on the same basic principle: words are represented as vectors that are typically low in dimensionality (between 100-1000 dimensions), but dense, with each element of the vector containing a real value. This is done by using the 63 context the word appears in: words that appear in similar contexts should have similar word vectors. There are numerous ways these can be generated; a basic outline of the word embedding procedure is shown in Figure 3.5.

FIGURE 3.5: Outline of word embedding procedure

3.4.1 Types of Embedding Models

There is a long history of encoding words as fixed-length vectors based on their cooc- curences with other words. Early approaches use a method called Latent Semantic Analy- sis (LSA), developing matrices from corpora with rows and columns representing words and the value of items corresponding to the cooccurrence between those words within certain contexts. These matrices are large and sparse, and thus are then condensed using dimensionality reduction such as singular value decomposition (SVD)[Landauer et al., 1998]. New developments in last five years have greatly increased the amount of data that they can be trained on, and the way embeddings can be generated. Perhaps most no- table is the word2vec approach of Mikolov et al [2013], who showed embeddigns can be trained on massive amounts of data and yield extremely useful vectors for semantic representations. Rather than being developed through cooccurence matrices, these em- beddings are trained using neural networks. A basic network is trained, with the inputs being individual words in a sentence and the objective being to predict the words in the 64 immediate context. Words are mapped to fixed length vectors, initialized randomly, and these are fed to the network to predict context. These vectors are updated during the training process, resulting in words that share similar contexts also sharing similar vec- tors. Two variations of this theme, the skip-gram and contextual bag-of-words (CBOW) models are shown to have tremendous performance on semantic similarity tasks. Since these algorithms’ development in 2013, there have been numerous improve- ments based on similar concepts. The Global Vectors for Word Representation (GloVe) model is similar to more historical LSA-based approaches, generating cooccurrence ma- trices and reducing dimensionality, and they provide results that improve over word2vec in many cases [Pennington et al., 2014]. Most recently, embeddings are being developed that work for any kind of task, using bidirectional deep networks. The Embeddings from Language Models (ELMo) approach is one of these and has shown significant improvements. ELMo embeddings are trained similar to word2vec, although based on characters rather words, allowing for the use of morphological information to create better representations [Peters et al., 2018]. They also employ both forward and backword passes through the neural network, allowing both contexts to be incorporated. Since ELMo, a system dubbed BERT has been developed that also uses character level embeddings in transformers, and has achieved extremely good performance on many tasks [Devlin et al., 2018]. BERT trains on both left and right context simultaneously, and improves over ELMo in many cases. Modern systems like ELMo and BERT tend to be effective end-to-end: they can take an input and predict a classification without intermediate steps and over many kinds of datasets.

3.4.2 Embeddings for Metaphor

Word embeddings suit lexical feature-based theories of metaphor nicely. They are abstract, in that it is difficult if not impossible to determine exactly what "features" are present in the vectors produced, but they capture the semantic properties of lexical items 65 in a way that is very easy to deal with computationally, and are thus ideally suited to use for machine learning implementations of metaphor processing. Shutova utilizes these embeddings models, in which words are mapped to dense, low- dimensional representations via their use in context, for multiple tasks [Shutova et al., 2012; Shutova, 2013]. She gives the example phrase "reflect concern". The latent semantic space shows that the noun "concern", in similar spaces as "reflect", prefers other verbs like "address", "highlight", and "express". She then uses selectional preferences generated via a clustering algorithm to filter out other metaphorical paraphrases. Her system as of 2013 achieves a precision of .68, which is impressive given that it requires no annotated training data. It requires large amounts of data to generate the vec- tor space models and selectional preferences, but as this data is unlabeled it is relatively inexpensive, which is a key advantage of employing word embeddings. Numerous other approaches similarly employ embeddings to augment performance on various metaphor tasks [Heintz et al., 2013; Mohler et al., 2014; Klebanov et al., 2014]. These vector space models also have applications for other kinds of figurative lan- guage. Zhang et al. [2015] use word embeddings along with traditional features to resolve metonymy, supporting the hypothesis that the identification of metonymy is a computationally similar problem to that of metaphor. Saltan [2016] uses sentence-level embeddings for the detection of idioms. Mohler et al.[2014] show that clustering of vec- tor representations can even help with identifying source and target domains, which is applicable to many interpretation tasks within the realm of figurative language. Their results show that word2vec embeddings are more effective than classic discrete vectors or LSA approaches. Word embeddings have been used to effectively represent lexical meaning; if we could represent the constructional composition in the same embedding space, the resulting combination should be a more accurate, computationally viable representation of phrase meaning, including whether the phrase is literal or metaphoric. The work of Gutierrez 66 et al. [2016] employs higher-order tensors towards this end, capturing combinatoric se- mantics for adjective-noun constructions, and their algorithm points towards possible embedding solutions for more complex constructions. Embeddings have been effective as features for standard machine learning and for unsupervised clustering approaches. Perhaps the most effective way they’re employed, however, is as inputs for neural network architectures. These have proven effectiveness for many NLP tasks, and they are beginning to show their effectiveness for metaphor processing as well [Gao et al., 2018].

3.5 Neural Networks

Despite their effectiveness in nearly every area of computational linguistics, there has been only limited work in using deep learning-based approaches to classify metaphoric expressions. This may be due to the difficulty in finding sufficient training examples, or merely due to the difficulty of the task. Recently, some research has appeared using deep learning to identify metaphors. Advancements in word embedding models and neural architectures have led to state of the art performance on some metaphor tasks. Neural networks work similarly to other statistical machine learning systems, learning boundaries between literal and metaphoric classes based on training data. They typically use word embeddings as input. Each dimension of the input embedding is then passed through a layer of nodes, which then apply a function to their combined input and pass along the result combined with a learned weight to the next layer. These functions are non-linear: sigmoid, hyperbolic tangent, and rectified linear unit (ReLU) functions are typical. The final layer then contains two nodes: one for metaphoric and one for literal. Applying a softmax function to the result yields a probability distribution over these two classes, and we can then predict the correct class. The weights for these networks are learned via backpropagation: a loss function is defined and changes are made backwards through the network to minimize the loss. 67 Basic neural networks employ an input layer (typically word embeddings), a number of "hidden" layers, and output layer. This architecture is referred to as a multi-layer per- ceptron (MLP). There are a plethora of more complicated models. Common in linguistic tasks are convolutional neural networks (CNNs), which employ convolutions over the inputs to better understand local context, and recurrent neural networks (RNNs), which maintain weights over inputs, allowing for distant dependencies to be captured. Early work in metaphor processing with neural networks have used MLP architec- tures, experimenting with word embeddings, part of speech tags, and concreteness rat- ings as input [Do Dinh and Gurevych, 2016]. They predict metaphoricity on the VUAMC data, with each word being either metaphoric or not. They report an overall F1 score of .561, which, while not necessarily impressive, is competitive with other non-neural approaches at the time. Since this initial approach, a shared task was developed on the VUAMC data in order to help with inconsistencies in experimental setups and datasets [Leong et al., 2018]. This task released a section of the corpus as training data, and allowed systems to incorporate systematic evaluation on the held out test set. This task was released in 2018, and received 100 submissions resulting in 8 submitted system papers. Nearly all of these, including the top three systems, employed neural networks, particularly bidirectional long-short term memory networks (LSTMs), which are all but ubiquitous in modern NLP tasks. The best performing system employed both convolutional neural network (CNN) and LSTM architectures, achieving an overall F1 score of .651. Since the shared task, the state of the art has been significantly improved upon by a similar LSTM structure employing multiple types of word embeddings as input, achiev- ing an F1 score of .66. [Gao et al., 2018]. We will explore this work as the current state of the art, using it as a baseline for neural performance and a starting point for improvement. Both neural and traditional machine learning approaches have proven effective in the past decade, we will need to implement both, and provide a comparison of performance, 68 speed, and interpretability of the models. While neural models have been the gold stan- dard in performance, they are opaque: it is difficult to understand what parts of the input or model are effective. Traditional machine learning models allow for better transparency, as we can determine which weights on which features are most important. Employing both thus gives us an added benefit: we can analyze the importance of the features we are adding using traditional machine learning, and analyze improvements in practical state of the art performance using deep learning models.

3.6 Summary

We’ve seen a considerable amount of breadth in the computational approaches to metaphor. Selectional preference theory is widespread. Utilizing preferences for pred- icates and their arguments, and the mismatches thereof, has proven particularly effective for many tasks. More recently, neural networks have bypassed these kind of feature- based methods, achieving remarkable performance on a variety of tasks using word em- bedding inputs. There are only a limited number of computational approaches that have any signif- icantly inclusion of syntactic features. Dependency-based relations are somewhat com- mon, but metaphor is largely handled via lexical semantics in the computational domain. This gives us two different avenues towards improving our knowledge of metaphor both linguistically and computationally. First, we can exploit syntactic properties in combination with modern machine learn- ing and deep learning architectures to improve metaphor processing. This will require novel syntactic representations and methods to maximize the signal present, as well as knowledge of the patterns in lexical resources and corpora that make use of this signal. Second, we can use feature-based machine learning to better understand how the syn- tactic and semantic components interact. While historically most metaphor processing 69 systems lean heavily on lexical semantic features, we can now incorporate syntactic struc- tures and fully assess how they interact to improve metaphor processing. This assessment should also improve our knowledge of how syntax and metaphor interact in language, and better our understanding of human metaphor production and comprehension. As a first step to incorporating syntax, we need to understand the data and resources available for computational analysis. This requires knowing how metaphor is handled in lexical resources, including whether resources make distinctions between metaphoric and literal senses and whether there is available annotated data that can be leveraged for better performance. We will also need to explore the corpora available, the annotation processes used to generate them, and their potential strengths and weaknesses. 70

Chapter 4

Lexical Resources

Metaphor research requires reliable data, whether the focus is on improving perfor- mance in NLP tasks or exploring metaphor usage via computational linguistic methods. However, metaphor resources are notoriously weak compared to those available for other language tasks. This is likely due to the complicated nature of metaphors: theorists often disagree about how metaphors are used, what counts as a novel versus a conventional metaphor, and how they occur in data, which has led to difficulty in generating consis- tently annotated metaphor data. Despite these difficulties, there are a number of resources that can be employed for research computational approaches to metaphor. We will split these resources into two main types. First, we have lexical resources, which are structured ontologies of lexical items with various levels of information including syntactic and semantic properties. These resources often include information that is applicable to literal and figurative sense distinctions, and their structure is often exploitable for metaphor research. These re- sources are likely best employed in doing linguistic analysis: the syntactic and semantic information combined with examples provides a solid reference for understanding the metaphors involved. However, we will see that they never consistently distinguish between literal and metaphoric usages, making them hard to leverage for automated metaphor processing via machine learning. Second, we have corpora: datasets of language that have been annotated in some way. There are a variety of corpora that have been annotated for metaphor, although the meth- ods used for annotation are disparate. As they contain a large number of instances with 71 theoretically consistent annotation tags, these can all be employed for machine learning tasks, albeit at different scale and accuracy as they also vary in size, domain, and practi- cality.

4.1 Metaphors in Lexical Resources

There are bountiful lexical resources available and they include a wide variety of in- formation. While different resources can be used for any number of NLP and computa- tional linguistic tasks, we will focus on resources that have applications based on both syntax and semantics. Using resources that incorporate semantics allows us to look for metaphoric regularities. Using resources that also incorporate syntax will additionally allow us to compare the kinds of syntactic structures give rise to different metaphoric expressions. While few resources explicitly evoke both, there are some that seem ideally suited for this task. In addition to exploring linguistic properties of metaphor through lexical resources, there are also possible computational applications. First, if lexical resources directly iden- tify metaphoric senses (and perhaps even metaphoric semantics) we can leverage an- notation that has been done using these resources as a new corpus for machine learn- ing applications. This is appealing in that it offers a free dataset, but also this would be equivalent to treating metaphor processing as a word sense disambiguation (WSD) task, with the different senses incorporating different metaphors. This would allow us to adapt off-the-shelf and other modern WSD models for metaphor processing. Finally, if metaphoric senses are clear we may be able to leverage the semantics provided by a resource to develop methods for automatic metaphor interpretation. If we are provided with consistent semantic representations, we may be able to learn transformations from literal to metaphoric representations and vice-versa. With these linguistic and computational opportunities in mind, we will explore four 72 different resources: VerbNet, FrameNet, WordNet, and PropBank. Each has been ex- tensively used for semantic-based tasks, and contains some information about possible metaphoric senses.

4.2 VerbNet

VerbNet is a lexical resource that currently categorizes 6,791 verb senses into 329 verb classes based on their syntactic and semantic behavior. These verb classes are based on the work of Levin [1993], who shows that for many verbs their semantics seem to correlate with the syntactic alternations they participate in:

"The behavior of a verb, particularly with respect to the expression and inter- pretation of its arguments, is to a large degree determined by its meaning." (Levin 1993, 1)

VerbNet distinguishes between different verb senses: each different sense of a particu- lar lemma is assigned a class which contains other verb senses that participate in the same syntactic alternations. These classes are defined by the kinds of syntactic and semantic structures they allow, the thematic roles involved, and the verbs that can participate in them. This gives us a rich source of information about the syntactic and semantic prop- erties of specific verb senses, as well as details of the kinds of arguments these verbs can take.

4.2.1 Metaphoric/Literal VerbNet Classes

VerbNet organizes sense distinctions into a hierarchy of classes, and we need to as- sess whether VerbNet makes regular metaphoric distinctions in its ontological structure, reflected by this hierarchy. VerbNet classes function as groupings of verbs based on their syntactic and semantic patterns. If classes neatly divide literal and metaphoric senses, we 73 can use VerbNet annotation as a source of metaphoric training data, as well as using es- tablished word-sense disambiguation methods to automatically classify these metaphoric and literal classes. When analyzing VerbNet, we find that the delineation of metaphoric senses is present but inconsistent. There are classes that contain only literal senses and classes that only contain metaphoric senses. Even more helpfully, there are classes for which the members all instantiate certain metaphoric mappings. There are also classes where only some of the verbs are used metaphorically, and classes where no distinction between literal and metaphoric usages is made. We will briefly cover examples of each of these types.

Literal Classes

Many VerbNet classes are intended to reflect only literal usages. These include classes that contain selectional restrictions that match literal uses of verbs: Agents and Themes that are +CONCRETE or +HUMAN indicate literal uses of many verbs, as they don’t violate the verbs’ selectional preferences. A prime example of this kind of class is grow-26.2.1. The grow-26.2.1 verbs involve the transformation of one entity into another accompa- nied by a change in size.1 The class has an animate Agent, a Patient restricted to +CON-

CRETE and a Product restricted to +CONCRETE These roles and restrictions strongly indi- cate that this class prefers literal readings, as the arguments should be concrete entities which matches the selectional restrictions for the literal use of these verbs. The seman- tics also mark the literal usage: the Product did not exist at the start of the event, then through some change the Product then does exist at the end state. This is also evident from sentences manually annotated with the grow-26.2.1 class:

1We would like to distinguish these plant-based usages of "grow" with a possible metonymic variation, where "grow" literally means the change in size of an entity, and this change in size being metonymic for the advancement of some living entity in its life cycle. This kind of metonymy would account for the phrase "grow up", in which growth is seen as maturity rather than in size. However, in these plant-based examples, both size and maturity are involved when using the verb "grow", so we believe these are likely facets of the same literal sense. 74

FIGURE 4.1: Overview of grow-26.2.1 class

59. A private farmer in Poland is free to buy and sell land, hire help, decide what to grow and how to grow it.

60. The five-member Atlantis crew will conduct several experiments, including grow- ing plants and processing polymeric materials in space (...)

61. From conelets three millimeters in diameter come seeds no larger than sesame seeds, yet they grow into huge trees up to 60 meters tall.

There are many classes like grow-26.2.1 that indicate literal usage through their the- matic roles or semantic frames. These can be leveraged to generate additional data: any data annotated with these VerbNet classes can be employed as training data for the literal category. This annotation can be combined with that of any other metaphoric dataset, assuming the distinctions made with regard to metaphoricity by VerbNet match those made by the annotation scheme of the dataset in question. We will see this technique in action in Chapter 9. 75 Metaphoric Classes

There are also classes in VerbNet that are exclusively metaphoric. Even better, as VerbNet classes contain semantic information, the classes that are metaphoric typically instantiate a particular metaphoric mapping. An example of this phenomenon is a class that captures change of state on a scale, calibratable_cos-45.6.1.

FIGURE 4.2: Overview of calibratable_cos-45.6.1-1 class

The calibratable_cos-45.6.1 class contains verbs that indicate change of position on a scale. It has Extent, Initial_State, and Result thematic roles, none of which contain any selectional restrictions, making them available for abstract realizations. The semantics involves a change of the position of the Extent from one value to another. Most verbs in the class have concrete senses ("explode", "jump", "drop", etc). However, 76 in this class they are a different sense, which is change of position on a scale. This class then instantiates a particular metaphor (UPISMORE), in which something moving in a upward direction reflects the abstract scalar change.2 While these kinds of metaphoric classes are more rare than literal cases, they can be ex- tremely useful in identifying additional metaphor data. Distributionally, verbs (and other parts of speech) are much more likely to be used literally, so the metaphoric classes tend to be lacking, which makes the addition of extra metaphor data invaluable. In addition, if our goal is to identify source and target domains, and even metaphoric mappings, these kinds of examples are extremely valuable, as they implicitly employ known metaphoric mappings. The verb "grow" is present in both grow-26.2.1 and calibratable_cos-45.6.1. This allows us to treat metaphor detection for this verb as a word sense disambiguation problem. If "grow" is used in the grow-26.2.1 sense, we know it is literal, and if it used in the calibratable_cos-45.6.1 sense, we know it is metaphoric, and employing the UPISMORE metaphoric mapping. This is evident from sentences from VerbNet annotation:

62. All you have to do is become a member and watch your earnings grow.

63. According to related data, last year saw property prices in 70 large to medium cities grow by 1.5% year over year, (...)

64. We are absolutely confident that Stuart’s ethical and focused strategies will see these trends continue to grow (...)

Indeterminate Classes

If every verb class made these distinctions, we could use VerbNet directly and our metaphor annotation would be plentiful. Unfortunately, most VerbNet classes don’t

2Note that the semantics are often more complex than this simple metaphor: "explode" evokes an explo- sion, which we then have to infer the upward movement which then is reflected as a change in position on a scale. 77 make any clear distinction between literal and metaphoric uses of the verbs in them. A prime example of this is the other_cos-45.4 class, which contains over 300 verbs and has relatively vague semantics. "Grow" is also in this class, but fits multiple frames in both literal and metaphoric capacities. For example, the resultative construction, which verbs in this class allow:

65. He grew tired. (metaphoric, CHANGE IN SIZE IS CHANGE OF STATE)

66. He grew bigger. (literal)

This class makes no distinction between literal and metaphoric uses, licensing them both. This is much more frequent than the other two cases, where a distinction is made. This generally makes it difficult to use VerbNet for automatic metaphor detection. First, linguistic analysis is necessary for each class to determine whether it can be used as de- terministic source of metaphoric or literal data. Second, for many classes there won’t be a clear distinction. This will be the case for all the lexical resources we explore: they of- fer some insight into metaphor in the various senses and structures they describe, but metaphoric meaning is never a core component of the resource, and they rarely make de- terministic decisions about metaphoric meanings. Despite the difficulty, the cases where VerbNet does make distinctions can be effectively used, and we elaborate on this process in Chapter 9. In addition to class-based sense distinctions made by VerbNet’s class hierarchy, there are three structural components of VerbNet classes that offer possible value for metaphor research. These are thematic roles, syntactic frames, and semantic frames.

4.2.2 Thematic Roles

Each class contains a list of possible thematic roles that the verbs in the class take as arguments. These thematic roles are supplied with selectional restrictions indicating the types of things that are allowed to fill that role. This information parallels directly the 78

FIGURE 4.3: Thematic roles from the eat-39.1 class. selectional preference theory of Wilks [1978], and should provide valuable insight into the metaphoricity of given senses. See Figure 4.4 for examples of thematic roles with selectional restrictions. In the eat-39.1 class, the top class has two thematic roles: an Agent that has the

+ANIMATE restriction, and a Location, allowing for phrases like "He ate off the table". The subclasses inherit these roles; 39.1-1 and 39.1-2 both inherit these roles. 39.1-1 also has an additional Patient thematic role, which has +BIOTIC and +SOLID restrictions, and encom- passes verbs of eating. The other subclass, 39.1-2, also adds a Patient with the +BIOTIC restriction, but replaces the +SOLID restriction with -SOLID. This allows for "drink", and this inheritance structure allows the similarities between "eat" and "drink" to be captured while still maintaining the salient difference: the argument of "eat" must be solid and the argument of "drink" must not be. If we look at the VerbNet representation of the classic example "My car drinks gaso- line", we can see how these preferences would be used. "Drink" here is a member of the eat-39.1 class, which involves verbs of consumption. This class has an Agent thematic role that has the +ANIMATE restriction, as well as a Patient thematic role that has a +BIOTIC restriction. Specifically, the subclass that "drink" belongs to has an additional restriction: the Patient must also be -SOLID. In the example, the Agent of "drink" is "my car", which 79 violates the +ANIMATE restriction. The Patient, "gasoline", is correctly -SOLID, but likely doesn’t match the restriction +BIOTIC. Unfortunately, the selectional restrictions VerbNet includes are often incomplete and poorly defined. Note that the +BIOTIC restriction excludes sentences in which an animate Agent drinks something they shouldn’t, which should still be taken literally. "Water" likely doesn’t count as +BIOTIC, nor would other manufactured liquids.:

67. Chimps also use leaves as sponges or spoons to drink water.

68. He also loves the taste of paint and has a tendency to eat paintings and drink pure paint.

These sentences are still valid and literal: the Agent is involved in a concrete drinking that consumes the Patient. This follows from Wilks’ observation about selectional prefer- ences: we should prefer the normal but accept the unusual. While VerbNet can theoreti- cally provide effective selectional preferences, there are cases where they are incomplete and/or inaccurate.

4.2.3 Syntactic Frames

Another useful component of VerbNet structure is the syntactic frames. These are general syntactic structures that the verbs within a given class are most likely to partic- ipate in. They are loosely based on the kinds of syntactic alternations that Levin noted distinguish different verb semantics. Each syntactic frame contains a primary descrip- tion which gives the general type of syntax involved, as well as the thematic roles, verb, and other fixed elements necessary for the frame to be evoked. See Figure 4.4 for some examples of syntactic frames. These frames can be leveraged in a variety of ways. First, they allow for distinctions between verbs and verb classes to be made primarily based on syntax. This intuition from Levin is also related to that of Sullivan [2013], who shows that certain syntactic 80

FIGURE 4.4: Syntactic frames for the eat-39.1-1 class. constructions influence metaphoric properties. From this we can infer that the kinds of syntactic frames a verb can participate in should influence the kinds of metaphors it can evoke. We will see how we can explore this idea using automatic VerbNet class tagging and annotated corpus data in Chapters 6 and 8.1.

4.2.4 Semantic Frames

Coupled with each syntactic frame is a semantic frame. These are first-order predicate logic representations of the semantics of that particular frame.3 These semantic represen- tations have numerous uses, including doing event-based reasoning, as well as provid- ing an overall view of the kinds of semantic activities that are happening within a class of verbs. In many cases, they attempt to distinguish between relatively similar classes. Consider two verbs analyzed by Levin: cut and hit. Each has unique semantic properties,

3In most cases, the semantics for each syntactic frame within a given class are identical, but there are some exceptions. In most cases, this is due to certain arguments being optional or not expressed within certain syntactic frames. VerbNet is working towards unifying these representations, so that each class will have the same semantic representation. 81

FIGURE 4.5: Semantics for frame in cut-21.1

FIGURE 4.6: Semantics for frame in hit-18.1 and each is a member of different but related verbs classes: cut in cut-21.1 and hit in hit- 18.1. Levin provides an in depth analysis of these verbs and their distinctions, which are semantic but influence their syntactic properties:

• cut : causing change of state by moving something into contact with the entity that changes state

• hit : contact by motion

These differences, along with the syntactic properties of these and similar verbs, causes the differentiation in VerbNet classes. They are also made concrete by the semantic frames for these classes. In 4.5, the Agent causes some event E. During this event, the Agent has a manner, defined as Motion. The Instrument (not overtly stated in many frames, including this one) is in contact with the Patient during the event. And as a result of the event, there is some degradation of the material integrity of the patient. This captures the change of state caused by motion combined with contact. In , in 4.6, the Agent causes some event but has a directed motion manner rather than a regular motion manner. The Patient isn’t in contact with the Agent during the event, but at the end of the event the Patient and Agent are in contact. This indicates the directed contact through motion. 82 These verbs have a lot in common, but the semantic frames here attempt to pull apart the subtleties of meaning that make them distinct. We will in further sections explore how the semantic predicates for these frames can be used for a variety of VerbNet-based computational applications. These three components make up the structure of VerbNet classes. While they offer descriptive power for verb classes, indicating the syntactic and semantic behavior that verbs participate in, they also form the foundation of a structured base of knowledge that can be employed computationally. We will see more in Chapter 8.1 about usages of VerbNet structure for metaphor processing and analysis.

4.2.5 Previous Applications of VerbNet for Metaphor Processing

A handful of systems have incorporated VerbNet as a feature for metaphor process- ing. Klebanov et al. [2016] use a variety of lexical resources with feature-based machine learning, including VerbNet classes. However, they use all the senses for each verb as features, ignoring the distinctions between the word senses that VerbNet makes. They report minimal impact of using VerbNet, but we believe that without word sense disam- biguation to determine which VerbNet class is being used for each particular instance, the effectiveness of this resource will be minimal. More recently, Stowe et al. [2018] use VerbNet senses as features. They use word sense disambiguation to determine the VerbNet class each instance belongs to, and add this as a lexical feature. They report significant improvements over lexical baselines using VerbNet classes as a feature for binary classification of verbs, as well as identifying target-domain verbs. However, their baselines are much lower than many other approaches, so while they show VerbNet classes improve over basic baselines, they do not show conclusively that they are more effective than other lexical-semantic features. 83 4.2.6 Summary

VerbNet will be our primary resource for metaphor processing, for a handful of rea- sons. First, VerbNet class structure is based on the syntactic alternations. We have seen that syntactic properties and constructions can be influential in metaphor understanding, so using a resource that makes fundamentally syntactic distinctions is prudent. If we can show that using VerbNet-based features and methods, this lends credence to the notion that syntax is an effective predictor of metaphor. Second, each class is an extremely rich source of syntactic and semantic information. Each class contains many components that relate to metaphor, and the class ontology itself also frequently distinguishes between literal and metaphoric usages. Another benefit of VerbNet is the availability of annotation due to numerous VerbNet annotation projects. It also contains links to a variety of other resources that could be used for additional data. We can use this annotated data to understand the interaction between VerbNet and metaphor, and given a better understanding of this interaction, we can put this data to use for metaphor processing. Additionally, VerbNet parsing is quick and effective. The word sense disambiguation system we use achieves 92% accuracy [Palmer et al., 2017] , and can tag over 1,000 sentences per minute. This allows us to use it for any reasonable dataset, and also to generate a corpus of Wikipedia data tagged with VerbNet class and role information (Section 8.2). Next we will examine a variety of other resources. Each of these has possible value, but none offer the structural components that make VerbNet ideal for our purposes.

4.3 FrameNet

FrameNet is another lexical resource that focuses on semantic components based on Frame Semantics [Baker et al., 1998]. It contains an inventory of semantic frames and the lexical units which evoke these frames (called frame evoking elements (FEEs)). The 84 frames also contain information about Frame Elements (FEs), which are constituents that are related to the frame semantically and often syntactically. FrameNet is similar in that it includes semantic components for verbs via their frames and for their arguments via their frame elements. However, FrameNet has no syntactic representations. It relies on semantic frames, the elements that evoke them, and their frame elements, without reference to the structures that might compose them. Regardless, we will explore the key structural components of FrameNet, frames and frame elements.

4.3.1 Frames

FrameNet has 1,224 total frames, evoked by 13,640 possible lexical units, which are particular senses for words. These frames contain descriptions of the frames, frame ele- ments, lexical units that evoke the frames, and frame to frame relations. These relations include typical ontological relations including "inherits from" and "uses". It also includes semantic relationships such as "perspective on", "is preceded by", and "is inchoative of". This provides a comprehensive and informative ontology of relations between frames, which may be exploitable for metaphor processing purposes. To continue with the verb "grow", FrameNet has specific classes for many literal and metaphoric meanings. As an example, Figure 4.7 shows the Growing_Food frame, which instantiates the same usage as the grow-26.6.2 class from VerbNet. This shows the key components of FrameNet: Frames and Frame Elements.

4.3.2 Metaphoric/Literal FrameNet Frames

FrameNet’s focus on frame semantics makes it much more appealing in terms of metaphoric semantics. Frames in FrameNet are akin to domains, and have been used extensively with conceptual metaphor theory. Frames are, according to Fillmore, "any system of concepts related in such a way that to understand any one of them you have to understand the whole structure in which it fits" [Fillmore, 1982] (pg 111). Croft and Cruse 85

FIGURE 4.7: Growing_Food frame

Frame Example Metaphor Mapping Change position Attacks on civilians grew UPISMORE on a scale over the last 4 months. Over the years she CHANGEINSIZEIS Transition to a state grew to hate him. CHANGE IN STATE Expansion The cow grew a little. literal I grow vegetables until Cause expansion literal they weigh 25 pounds. Growing food I grow my own vegetables literal Ontogeny Someday this bud will grow into a rose literal

TABLE 4.1: "Grow" lexical units in FrameNet

[2004] point out that the term "domain" used by Lakoff [1987] is nearly identical to the use of frame by Fillmore (pg 15). Because of this direct similarity, we see that FrameNet frames typically evoke either the literal use of a verb or a particular metaphor. This is especially apparent when we consider our example verb "grow". "Grow" is a lexical unit in six different FrameNet frames, each of which is either literal or evokes a particular metaphor. Table 4.1 shows an analysis of each frame. Each frame "grow" is in reflects a different facet of its meaning. While many of the 86 frames are literal, they each represent a different part of the literal meaning. The two metaphoric frames, Change_position_on__scale and Transition_to_a_state, reflect unique metaphoric mappings.

4.3.3 Frame Elements

Each frame also contains a list of frame elements. These come in three different vari- eties:

• core frame elements

• peripheral frame elements

• extra-thematic frame elements

Core frame elements are those that essential for the frame. They are similar to Verb- Net’s thematic roles, in that they are key components of the frame’s semantics.

"A core frame element is one that instantiates a conceptually necessary com- ponent of a frame, while making the frame unique and different from other frames." [Ruppenhoger et al., 2016] (pg 23)

For the Growing_Food frame, the core frame elements are the Food and the Grower: these are the concepts that are necessary to understand the semantics of this frame, and that differentiate it from other similar uses of "grow". These also include elements that are still understood when they are not overtly marked. In contrast, peripheral frame elements are common across frames, and are not necessary to understand the semantics of the frame. Peripheral frame elements include things such as "Time", "Place", and "Manner". These are universally applicable, and don’t introduce anything distinct to a particular frame. For the Growing_Food frame, "Instrument", "Duration", and "Manner" are all peripheral frame elements: they are all secondary, and not necessary to understand the growing event. 87 Extra-thematic frame elements introduce outside events that function as a backdrop for the current event. They are not considered to be conceptually part of the frames they appear in, but rather function as arguments of more abstract frames. Growing_Food in- cludes "Particular_Iteration" and "Circumstances" as extra-thematic frame elements. These frames also come with markers for semantic type. This is similar to the verb- specific features for VerbNet: they indicate the kind of thing that typically fills the par- ticular role. This may be usable via selectional preference theory. If these semantic types are violated, it may be an indication of metaphoricity. Consider the Disaster_Scenario frame. It contains a non-core frame element Responder, which has the semantic type Sen- tient. This indicates that the individual or group responding to the disaster is likely to be a sentient entity. This can be violated to produce metaphoric interpretation, such as the following:

69. His cool demeanor put out the blaze in her heart.

4.3.4 Previous Applications of FrameNet for Metaphor Processing

FrameNet contains a wealth of information that is extremely valuable, but it lacks di- rect notation of metaphoricity. The frames don’t indicate whether they are metaphoric or which metaphors they evoke, and hand curation would be necessary to use FrameNet to this end. Helpfully, a significant amount of curation has been done in the form of the MetaNet project (as seen in Section 3.2.3). While this project has been overviewed in the previous section, we should note that it is has been effective as a tool for computa- tional metaphor processing [David, 2017], and David et al. [2018] show that this tool can also be used very effectively for more exploratory research of poverty and cancer based metaphors in English and Spanish. 88 4.3.5 Summary

While it is a practical tool for this kind of linguistic analysis, our intentions are focused on direct NLP applications: automatic metaphor detection. FrameNet can certainly be used to this end: the specificity, breadth, and semantic detail of the frames provided are inherently valuable for metaphor processing. However, our preference remains with VerbNet, due to the richness of the classes provided. FrameNet contains frame elements, and ontological relations, but provides relatively sparse information at the frame level about the kinds of syntactic structures the lexical units can be involved in. This is due to its basis in frame semantics: they aren’t concerned with syntactic realizations of these frames, but rather in the semantic frame structure. For this reason, to explore syntactic patterns we prefer VerbNet.

4.4 PropBank

PropBank is a resource designed to annotate sentences with semantic roles. It is com- posed of "frame" files, each containing one or more rolesets, which correspond to different coarse-grained word senses. Originally designed for verbs, these frames now incorporate nominalizations and other possible predicative usages. Each roleset is defined and in- cludes the number and type of arguments that sense can take. In this way, it resembles VerbNet in that it identifies multiple word senses and their arguments. As an example, the PropBank frame file for "glance" is shown in Figure 4.8.4 These frames contain a multitude of information regarding the particular word sense. They contain a roleset ID number as well as a brief definition. They also contain roles: the thematic arguments that are possible for the predicate. "Glance" takes an Agent, here Arg1, and a Patient, Arg0. These arguments are deliberately somewhat vague, as this makes for easier annotation and agreement. They also then contain a set of examples.

4Some irrelevant information regarding mappings to other resources and the creation of the frame has been omitted. 89

FIGURE 4.8: PropBank frame file for the word "glance"

4.4.1 Previous Applications of PropBank for Metaphor Processing

While PropBank is a valuable resource for semantic role labelling and other tasks, it hasn’t been used extensively for metaphor processing. A notable exception is Gedigian et al. [2006], who use PropBank frames to identify metaphors for MOTION and CURE do- mains. They employ PropBank as well as FrameNet, using a maximum entropy classifier, and report 95% accuracy in distinguishing metaphoric from literal usage. While this re- sult is very good, a most-frequent-class baseline on their dataset achieves 92% accuracy; the majority of their data falls in the metaphoric class.

4.4.2 Summary

PropBank is a valuable resource for semantic role labelling, as it contains concise def- initions of semantic roles for a large number of predicates. However, the frame files and information about the predicates is relatively sparse with regard to syntax and semantics. We have access to the kinds of roles involved, similar to VerbNet’s thematic roles, but we have no structured information about the types of things that can fill the roles, the syn- tactic structures the predicate can employ, or the semantics of the predicate. PropBank 90 also rarely makes distinctions between literal and metaphoric senses, and doesn’t pro- vide any further semantic information regarding metaphor: the rolesets that PropBank uses are slightly too coarse grained, and thus this nuanced semantic distinction is lost.

4.5 WordNet

WordNet is a comprehensive lexical resource that includes nouns, verbs, and other parts of speech. It includes examples and definitions for a large number of lexical items including multi-word expressions, as well as providing ontological relationships between them. WordNet has the best coverage of any of these resources, with over 10,000 verb entires and over 100,000 noun entries. WordNet has been used for a variety of semantic tasks, including metaphor detection. This is typically done by exploiting the hierarchical relationships between lexemes, which can indicate semantic similarity. Despite this potential utility and WordNet’s broad cover- age, WordNet is deficient in complexity of representation. It provides only minimal infor- mation for each lexical item. It doesn’t provide any structural information about syntax, and the individual senses are so fine grained they are difficult to annotate and difficult to detect [Palmer et al., 2007]. In order to alleviate this problem, the OntoNotes Sense Group- ings project was developed to make WordNet senses more tractable [Duffield et al., 2007; Pradhan et al., 2007].

4.5.1 OntoNotes Sense Groupings

The OntoNotes Sense Groupings were developed to alleviate the difficulties of Word- Net annotation. The senses in WordNet are very fine-grained, and often difficult to distin- guish. Consider the verb "play", which has the following two distinct senses in WordNet:

70. play%2:36:12 (play on an instrument) "The band played all night" 91 71. play%2:36:01 (perform music on (a musical instrument)) "He plays the flute"; "Can you play on this old recorder"

While it may be possible to find a distinction between these two senses with care, they have proven exceptionally difficult for annotators, as well as word-sense disambiguation systems. To alleviate this, researchers analyzed WordNet for senses that could be con- flated, and re-annotated data until reasonable agreement rates were reached. These new more coarse-grained senses make up the OntoNotes Sense Groupings, which can be more easily annotated and are more practical for WSD problems. Despite this improvement, they still remain relatively useless for metaphor detection. No information is provided about the syntax and semantics of the groupings, other than treating them as separate word senses. WordNet does contain certain senses that are literal and certain that are metaphoric, and in fact there has been research in metaphor detection using WordNet example sentences as data (see 5.6), but they don’t mark this information in any way; the annotations were provided independently. The example sentences need to be manually annotated, and the fine-grained divisions made in the sense groupings often makes it im- practical to disambiguate between these senses. OntoNotes offers some improvement, as literal and metaphoric senses tend to be grouped together, but there is still no way of automatically identifying literal or metaphoric senses in the resource. WordNet and the OntoNotes sense groupings based on it have some application as semantic types, but overall they offer less syntactic and semantic information than VerbNet and FrameNet.

4.5.2 Previous Applications of WordNet for Metaphor Processing

While WordNet senses don’t offer much semantic information on their own, their hi- erarchy and relations to other concepts have been used in many feature-based machine learning systems for metaphor processing [Peters and Peters, 2000; Krishnakumaran and Zhu, 2007; Veale and Hao, 2008]. These typically involve determining semantic similarity or systematic mappings by exploiting the hierarchy: words that share a common parent 92 at some level are likely related, and this hierarchy can be used to automatically identify relations between word senses.

4.5.3 Summary

WordNet provides complex semantic relations via its hierarchy, but offers no syntactic information, and the semantic information can only be accessed abstractly. Additionally, word sense disambiguation is difficult due to the extremely fine-grained nature of the senses, and these senses are often not informative. The OntoNotes sense groupings have improved this issue, but they still offer no insight into the syntactic properties of the word senses. We will avoid using WordNet in favor of lexical resources with more explicit syntactic and semantic information

4.6 Lexical Resources Summary

We believe VerbNet is best suited for syntactic approaches for metaphor processing, due to its inherent alignment with syntactic patterns, its rich syntactic and semantic infor- mation, and the speed and ease with which we can tag large amounts of data. FrameNet is also a possible option: it contains extremely valuable semantic information, and frame semantics interweave nicely with domain mappings and metaphoric construals. Prop- Bank provides helpful argument structure information, but we believe VerbNet provides similar benefits and richer information. WordNet provides little extra information in its structure, and while it is valuable for many other tasks we will not employ it for our tasks.5

5We also will be running models and evaluations based on a dataset built from WordNet senses (Section 5.6). Using WordNet as an analytical or computational tool would bias our results in favor of this dataset: in a way, we would be analyzing our test data before developing our models, so avoiding it is also a practical matter 93

Chapter 5

Corpora

5.1 Introduction

In addition to lexical resources that can provide detailed information about the syntax and semantics of lexical items, there are also numerous corpora annotated for metaphor information of various types. These corpora are invaluable for supervised machine learn- ing: we require quality training and evaluation data to determine whether we can accu- rately detect metaphoric expressions. Unfortunately, these resources tend to lack the size and consistency found in other areas of natural language processing. This is likely due to the difficulty in finding a consensus for what actually constitutes a metaphor.

5.2 Difficulties in Annotation

Metaphor is notoriously difficult - linguists, cognitive scientists, computer scientists, philosophers all disagree on what is a metaphor. This has led to extremely disparate and often small scale efforts towards metaphor annotation. We will explore some commonly used corpora shortly, but first we will consider some factors that cause difficulty in anno- tation. This will help us assess the value of various corpora, and highlight the problems in the field.

5.2.1 Conventionalized metaphors

First and foremost, annotation efforts need to be concerned with what exactly they in- tend to annotate as metaphoric. Expressions can vary greatly in their "conventionality", or the degree to which the metaphor has become common, accepted use. Certian metaphoric 94 expressions have become extremely conventional, and don’t seem metaphoric at all to many:1

72. It seems that Roland Franklin, the latest unbundler to appear in the UK, has made a fatal error in the preparation of his 697m break-up bad for stationary and packaging group DRG. (M)

73. I do not regard property profits as earnings. (M)

74. (he) was killed in an aircrash in 1958. (M)

In example 72, "make" is annotated as metaphoric (perhaps evoking the CREATING IS

MAKING metaphor from the Master Metaphor List), but this is a very conventional use. We conceive of errors as something that are typically made, and it is difficult to conceptu- alize this metaphorically. In example 73, "regard"’s literal meaning is seeing something, and the metaphor involved (UNDERSTANDING IS SEEING is extremely common. Without knowledge of the original meaning of "regard", this also may be difficult to conceptualize as metaphoric. Finally, metaphors of containment can be evoked by prepositions such as example 74, in which a year is conceptualized as a container, and their frequency makes them extremely conventional. While we will avoid a full analysis of the nature of metaphoric meaning over time, it is important to note that many expressions that were once considered metaphoric now have become conventionalized, and their status is perhaps unclear. Each metaphor annotation project has to make decisions with regard to the level of conventionality they consider to be metaphoric. Many projects rely on binary annotation, with a word or phrase being ei- ther fully metaphoric or literal, which is perhaps conceptually challenging, as the change in use of words and phrases over time more naturally suggests metaphoric meaning exists on a scale. 1Examples and annotations from the VUAMC data. 95 5.2.2 Unit of analysis

Another difficulty is determining the unit of analysis in metaphor annotation. By far the most prominent option is word-level: either every word in a sentence is annotated for metaphor, or one particular word in a sentence is annotated. The latter approach makes the task faster and perhaps easier, but eschews the importance of context in understand- ing metaphors. Consider the following passage:

75. Here is the point. (M)

In this passage, we cannot know whether the word itself is used metaphorically with only the word itself or the sentential context. We might expect it to be a metaphoric point, given the conventionality of using this expression to refer to an abstract destination, but if this sentence is preceded by a reference to a knife, or to scissors, this could easily be interpreted literally. The here provides background knowledge that we can frame the metaphoric word around. Annotation at the word level ignores this kind of context. Annotation at the phrase or sentence level has another issue. While we prefer to know the context that a given phrase is in to understand its metaphoricity, annotating at a higher level requires more details: to understand the metaphoricity of a sentence, we would also like to be able to understand which lexical items trigger the source and target domains, or if they are understood implicitly. We need to know which parts of the phrase or sentence evoke which metaphorical components in order to extract the correct meaning for the sentence. This process is exceedingly difficult, as metaphor is often difficult to parse and even humans often disagree about potential metaphors. 96 5.2.3 Different kinds of Figuration

Metaphor is not the only type of figurative language possible, and annotation projects often ignore the difference between metaphor, metonymy, hyperbole, and other more ab- stract kinds off figuration. In metaphor annotation, we would prefer to find cases where the source and target domain are known, and we can point to a conceptual mapping be- tween them. In other kinds of figuration like metonymy, there may be only one domain evoked, or the language may not evoke a direct mapping. While all of these meaning schemas are interesting and perhaps necessary to understand, they are often combined in metaphor annotation, and this can detract from system performance. Assuming each type requires a different kind of knowledge representation or algorithm, lumping them together would be expected to be ineffective for optimal performance. Consider the fol- lowing UNDERSTANDING IS SEEING metaphor, analyzed by Sweetser [1990]:

76. I see what you mean.

We can conceptualize this as metaphoric, with the domain of understanding being understood through visual seeing. However, in many cases, the literal seeing event is a part of the understanding: if we view a presentation or read a paper, we might consider this metonymy, with the literal seeing event standing for the conceptual understanding event. Metonymy and metaphor are closely linked, sometimes context dependent, and difficult to annotate. These factors all make metaphor annotation exceedingly difficult, but we still rely on corpus data for machine learning. We experiment with four different corpora that have been used before for metaphor processing tasks. These corpora are:

• The Vrije Universiteit Amsterdam Metaphor Corpus (VUAMC) [Group, 2007; Steen et al., 2010]

• The Language Computer Corporation Metaphor Corpus (LCC) [Mohler et al., 2016] 97 Dataset Count % Met VUAMC (All POS) 116,622 12% VUAMC (Verbs) 23,113 28%

TABLE 5.1: VUAMC counts

• The TroFi Dataset (TroFi) [Birke and Sarkar, 2006]

• The dataset from Mohammad et al. [2016] (MOH)

We will analyze each of these corpora for strengths and weaknesses. Some factors we are particularly interested in are the level of detail in the annotation (just binary metaphoric, or more?), the size of the dataset, the domains included, and the consistency of the annotation. We believe each has different benefits to offer, so our experiments will encompass each of these four datasets.

5.3 VUAMC

The largest and most prominent metaphor corpus available is the VUAMC [Steen et al., 2010]. This corpus contains approximately 200,000 words from four different do- mains: academic, news, fiction, and conversation. Each word in the corpus is annotated for metaphoricity, with a variety of other related information. The annotation scheme used is the Metaphor Identification Procedure (VU), or MIPVU, with substantial addi- tions. This procedure involves checking a word in context, comparing it to dictionary definitions, and assessing whether the target word has a more basic or concrete mean- ing in the dictionary. A brief summary of the core of the process is shown in Figure 5.1. They also include additional meta-information to refine this procedure. They indicate differences between direct and implicit metaphors, mark where cases are borderline, and indicate words that potentially signal metaphors. The VUAMC data has been an invaluable tool for metaphor research, for two primary reasons. First, it is the first corpus of metaphor annotation of substantial size. There have been a variety of other projects, but at 200,000 words this corpus remains the primary 98

FIGURE 5.1: Metaphor Identification Procedure (MIP) resource for machine learning, as it contains sufficient samples to effectively employ deep learning approaches, which may struggle on smaller datasets. Second, the VUAMC data has been the basis of the 2018 Metaphor Detection Shared Task [Leong et al., 2018], the first attempt at standardizing comparisons for computa- tional metaphor algorithms. Historically, the disparate nature of goals, datasets, annota- tion choices, and evaluation procedures has engendered substantial difficulty in drawing comparisons between algorithm and model choices for metaphor processing. The 2018 Shared Task alleviates some of these problems by providing a standardized training and test split for the VUAMC data. This removes two major problems in using this corpus. The first is that variation in the data makes comparison between algorithms difficult if the training and test data is split randomly. The consistent split allows for more consistent evaluation. Second, and much more importantly, the training and test datasets provided by the shared task make explicit which tokens are evaluated on. As the VUAMC proce- dure annotates all tokens, including prepositions, pronouns, and other items of dubious metaphoricity, systems can achieve wildly different performance if they include or ex- clude stopwords, certain parts of speech, or have other minor variations in the training and evaluation data they use. The shared task provides explicit notation for which tokens to evaluate on, finally normalizing the procedure for evaluating metaphor processing. Despite these obvious advantages, there remain significant problems with this dataset. 99 First and foremost, their annotation procedure is based on using dictionaries and histor- ical uses, which is unintuitive, sometimes yielding very strange and/or indecipherable metaphoric and non-metaphoric annotations. To explore these difficulties, we will look at three examples: the polysemous verb "get", the strangely annotated verb "consult", and the pronoun "this". The verb "get" is extremely polysemous in English (WordNet lists 36 different senses for the verb form). It has a definite concrete and literal usage, involving acquiring some- thing from a source. It also has many possible metaphoric usages that employ the source domain of TRANSFER:

77. I’ve got another bit of good news. (L)

It is present in many idiomatically combining constructions.

78. I’m too old to get used to women buying my drinks. (L)

79. Plenty of housing if those stupid farts at the council got round to repairing it. (L)

Additionally, in many cases it appears to have a somewhat "light" usage, where it contributes very little to the semantics but allows the use of a nominal predicate. It also has a modal usage, where it also contributes only minimally to the predication of the utterance:

80. People mustn’t think that because there’s a computer , they’ve got to think of a use for it. (M)

81. With an auction , you’ve got to be certain that at least two buyers are there who can commit themselves (...) (M)

82. I’m too busy to get involved in this sort of thing. (L)

83. What’s got you so tickled? (M) (GETTINGAPROPERTYISGETTINGAPOSSESSION) 100 84. If he got hurt, my guy used to suffer, too. (GETTINGAPROPERTYISGETTINGA

POSSESSION) (L)

The verb "get" contains many more possible usages, and is nuanced and often difficult to interpret. While all the examples above (77-81) appear in the VUAMC data, it is difficult to tell how decisions were made with regard to their annotation. Specifically, the cases marked "(M)" are annotated as metaphor, while the cases marked "(L)" are annotated as literal. We would expect that these instances should be handled consistently. Verbs that employ the transfer metaphor should be metaphoric. Similarly, the idiomatic cases in 78- 79 contain the same syntactic and semantic properties, and should be annotated the same way. The same applies to the "light" usages found in 80-81. However, we see from corpus data inconsistencies across the annotation of this verb. These inconsistencies are likely due to the difficult nature of identifying metaphors for especially abstract and/or polysemous words like "get". However, we also notice interesting irregularities when looking at the verb "conduct". Consider the following two instances from the VUAMC data:

85. GIS research into site selection for non-nuclear hazardous waste has been almost exclusively conducted in North America and has yet to be matched in the UK. (L)

86. The Imperial Cancer Research Fund and the Cancer Research Campaign are con- ducting a joint study of a the Bristol Cancer Help Centre (...) (M)

These examples contain nearly identical syntax and semantics: the preferences for the verb "conduct" and its meaning in both are very similar. However, 85 is annotated as literal while 86 is annotated as metaphoric. This may be because 85 has no subject, so selectional preference violations are hard to ascertain, and 86 has a metonymic sub- ject where the "Fund" and "Campaign" represent the people employed, but these factors shouldn’t influence the metaphoricity of the verb "conduct". This may be an error in the annotation process, but these contradictory decisions appear relatively frequently. They 101 also reveal another problem. In inspecting the sentences containing the literal example 85, we notice that a significant continuous chunk of the data lacks any metaphor annotation.2 In this chunk we see patterns that exactly mirror other parts of the corpus that should be annotated as metaphoric, so it appears that there are gaps in the annotation. We are currently unclear if the omission is accidental or deliberate, but the data doesn’t appear to differ from any other chunk with regard to actual metaphoricity. This chunk of missing annotations may be indicative of other gaps in the data, and these gaps are likely the cause of some of the confusing annotation decisions such as 85 and 86. This leads us to data with significant gaps, particularly when linguistic analysis is required. It becomes increasingly difficult to identify syntactic patterns that differentiate literal and metaphoric usages when the annotation between these categories is inconsistent. It also increases the noise factor when training machine learning algorithms: we cannot expect to achieve reliable performance, regardless of algorithm or model, when the data contains systematic gaps or inconsistencies. Another consequence of the VUAMC annotation is that it doesn’t include any infor- mation about source and target domains. This is a practical decision: identifying and agreeing upon source and target domains, and even metaphoric interpretations, is a dif- ficult task, and for binary machine learning approaches having the scale of the VUAMC data can outweigh the difficulty of including finer-grained semantic information about the included metaphors. However, metaphor identification in itself is a task of limited value. Identifying which words are metaphoric is only a first step to proper natural lan- guage understanding, and to do further metaphor interpretation automatically, we will require richer datasets that offer some insight or annotation involving the source and target domains, proper literal translations for metaphoric utterances, or proper semantic representations.

2Notably, sentences 766 to 1012 in the blg-fragment02 section, almost 8,000 words of data, lack any metaphor annotation. This appears to be the case in multiple other sections of the corpus, but determining exactly where metaphor annotation is missing is difficult. 102 Despite these weaknesses, the VUAMC corpus is an effective and practical tool for us- ing machine learning to identify metaphoric items. It may be possible to use this resource as a jumping off point, first identifying all metaphoric words as well as the constructions they belong to, then leveraging this information and other resources to provide better se- mantic representations. We have seen from Sullivan (pg 87) that verbs tend to be source domains, so the we may be able to identify these using the VUAMC data. Helpfully, there are other metaphoric resources that may be practical for exploring computational meth- ods for identifying source and target domains. The primary resource for this kind of task is the Language Computer Corporation (LCC) metaphor dataset.

5.4 LCC

Another resource that includes more fine-grained annotation for metaphor interpre- tation is the LCC metaphor dataset [Mohler et al., 2016]. This dataset includes sentences annotated with source/target metaphor pairs. These pairs are linguistic phrases that ei- ther have no syntactic relation, or are marked as metaphoric, nonmetaphoric, or unclear. The metaphoric phrases are tagged with source and target domains. There are 33 possible target domains and 172 possible source domains. Each pair is composed of two linguistic phrases, with one evoking the source domain and one evoking the target domain. These pairs also include various tags for affect, polarity, and intensity of the metaphor. Some sample LCC annotation pairs are shown below:3

87. Thank Lyndon Johnson, his Great Society, and the War on Poverty.

• Source Phrase: War

• Target Phrase: Poverty

• Source Domain: WAR 3In further examples, we will mark source domain phrases with the span marker phrase, and the target domain phrases with the span marker phrase, noting their respective source and target domains where relevant. 103 Annotation Status Count Non-syntactic (no 7,541 relation between words) Metaphors 3,036 Non-metaphors 4,307 Unclear 1,381 Total 16,265

TABLE 5.2: LCC counts of pairs

• Target Domain: POVERTY

88. Guns to Surpass Car Accidents as Leading Cause of Deaths in America

• Source Phrase: Surpass

• Target Phrase: Guns

• Source Domain: FORWARD_MOVEMENT

• Target Domain: GUNS

89. (their) language accurately reflects the position of all three branches of our govern- ment.

• Source Phrase: branches

• Target Phrase: government

• Source Domain: PLANTS

• Target Domain: GOVERNMENT

While LCC has annotations for particular domains, it doesn’t annotate everything. The creators selected certain domains to find, and annotated metaphors related to those domains. They ignore other metaphors within sentences, only annotating the particu- lar metaphors they are focused on. This makes for much easier annotation, as complex metaphor chains and difficult interpretations can be effectively ignored if they don’t in- volve the particular domains of study, but severely restricts the types of tasks this data can be employed for. 104 This weakness is especially apparent when looking at a lexeme like "link". In many cases, "link" is annotated as metaphoric.

90. You’re looking at a population of gun owners and linking gun ownership to suicide fatality outcomes.

91. This means that Fair Isaac can link the Social Security num- bers found on the health care claims to the Social Security numbers found on the credit card apps (...)

92. Interestingly, there are some reviews that concluded that there was insufficient evi- dence to support a cancer link

They consider these to generally involve the source domain of MACHINES, indicat- ing that the metaphoric "link" is represented by a physical, mechanical link. This is fairly generic: many kinds of things can be linked to others, and we would expect this metaphoric mapping to extend to many different usages of the verb "link". However, when we look at the data, we find many more instances in which "link" is used similarly and not annotated as metaphoric:

93. Californians have approved plans for a bullet train linking northern and southern California.

94. Michael Totten describes how climate change is linked to other problems such as poverty and species loss

95. By the constitution of 1961, the states of West Cameroon and East Cameroon were linked together into a federation.

In 93, "link" is used to connect two cities. This is more concrete, but still not a direct instantiation of a physical, mechanical link. Examples 94 and 95 are clear cases of abstract concepts being connected using the verb "link", but they lack any annotation in the LCC 105 data. This inconsistency is due to the nature of the project: they are only concerned with metaphors that incorporate certain target domains, and "climate change" and the various states of Cameroon don’t fall into these domains. This example shows the difficulty in using this corpus for syntactic analysis. We can understand the kinds of metaphors "link" is used in, particularly how it evokes the do- main of machinery to highlight a connection between abstract concepts. However, we cannot fully understand its syntactic and distributional properties, as there are many cases where it evokes a metaphor but isn’t annotated. We would need annotation over all lexemes to be able to effectively perform this analysis. It is also apparent when looking at verbs of destruction. We can directly observe these difficulties from the annotation of the verb "destroy":

96. We merely take their children away, destroy their marriages, and destroy their quality of life.

97. My posts destroyed your phony narrative blaming only one party for gun control.

In Example 96. the first "destroy" token is part a metaphor with PHYSICAL_HARM as the source domain and MARRIAGE as the target. However, the second "destroy" is unannotated, despite it being used to evoke the same source domain metaphorically. This is because MARRIAGE is a domain they included in their searches and data analysis, while "quality of life" doesn’t evoke a target domain of interest. In Example 97, the same source domain is evoked and the verb is used metaphorically, but it is not annotated as being part of a metaphor, as the target domain they are focused on here is gun control. Their limited focus on certain domains allows for large amounts of data to be collected, but severely hinders the usefulness of the annotation at a lexical level, particularly with regard to domains that aren’t included. 106 These examples highlight the weakness of the LCC corpus. They do not annotate at the lexical level consistently, and only annotate the particular metaphors they were searching for. This may be due to their methods, which involve starting with the metaphors they are looking for and then using corpus data to find examples. This method for metaphor identification can yield positive examples for identifying certain phenomena, but makes it difficult to find novel instances and doesn’t provide complete coverage of the data. Because of this inconsistent annotation, we cannot use this data to compare proper- ties of metaphoric and non-metaphoric uses of the same lexical items. We cannot derive patterns from the metaphoric uses of a word such as "link", because they do not anno- tate it as metaphoric in cases where it isn’t a part of the metaphors they’re looking for. We can, however, use it to analyze verbs specifically when they are used metaphorically. For instance, we can find all annotated instances of "link" that appear metaphorically and comment on their similarities, but we are unable to make meaningful comparisons to non-metaphoric uses of the same verb. Despite these limitations, the LCC corpus affords a possibility the VUAMC annotation does not: the training of supervised systems that can classify source and target domain elements. This allows for a different kind of metaphor processing task. Instead of identi- fying whether individuals words are used metaphorically or not, we can instead attempt to identify the source and target domains that are evoked by particular metaphoric utter- ances. This details of this task are fully explained in Section 6.1.

5.5 TroFi

The TroFi dataset, developed in 2006, is a clustering based approach, classifying sen- tences using 50 different English verbs into literal and non-literal clusters. These sen- tences are drawn from the Wall Street Journal. The clusters were developed using the 107 Annotation Status Count Literal 1605 Non-Literal 2130 No Hand Annotation 2694 Total 5829

TABLE 5.3: TroFi counts

TroFi system [Birke and Sarkar, 2006]. This system uses unsupervised word-sense dis- ambiguation as well as clustering, relying on sentential context to predict either literal or non-literal sentences. They cluster 5,829 sentences for 50 different verbs into literal or non-literal clusters. In addition, they do experiments with adding hand-annotated data to improve the seeding of the clusters. This means many of the instances have gold-standard hand annotation. A summary of the number of instances and their annotation status is shown in Table 5.4. This dataset has numerous pros and cons. It provides over 3,737 manually labelled sentences, with the main verbs annotated as either literal or non-literal, which is sizable and usable for statistical machine learning. The automatic clusters may be useful as well, but they contain numerous errors which introduce extra noise to an already difficult prob- lem. Consider the following examples:

98. Mr. Dassler absorbed 40.5 million marks by waiving repayment on a personal loan (...) (L)

99. It got the government to absorb half the cost (...) (L)

These examples of the verb "absorb" we would like to consider metaphoric: they both evoke the mapping MONEYISALIQUID. However, they appear to be incorrectly clus- tered into the literal cluster. This is understandable, as their methods are unsupervised and there will necessarily be errors in the clustering. For this reason, previous work eval- uating against the TroFi dataset is sometimes limited to the 3,700 sentences with manual annotation. 108 The distinction made between literal and non-literal may also be somewhat concern- ing, as we are focused on metaphor processing, and there are many types of non-literal language that are not metaphoric (metonymy, hyperbole, etc). Two factors alleviate this concern. First, when inspecting the data, it appears that the majority of non-literal in- stances are in fact metaphoric, rather than some other non-literal language. Because of this, we will use the manually annotated part of the TroFi dataset in our experiments as an additional metaphor processing task, with the goal being to classify each sentence as either literal or metaphoric (Section 6.1). Second, if we maintain that metaphor processing techniques are likely applicable to other types of figurative language, we can posit that our improvements to metaphor pro- cessing can be applied to this dataset as well, and still hypothesize that they will improve performance. This difference will be most apparent when analyzing outside training data, as our approach to collecting this data relies on analysis of metaphor, rather than other types of figuration. See Chapter 9 for full analysis of this process and the results across datasets.

5.6 The Mohammad et al Dataset (MOH)

Another dataset has been developed annotating sentences from WordNet as being literal or metaphoric.4 Mohammad et al. annotate 1,639 senses from WordNet of 440 different verbs as either metaphoric or literal via crowdsourcing. They use these to re- search emotion and metaphor, but their annotation can also function as a standard NLP task, identifying which of these senses is metaphoric. Other work has used a subset of this dataset (dubbed MOH-X) [Shutova et al., 2016]. This subset contains 647 instances with the subject and object extracted, and the metaphoric distribution of these instances is even. Let us consider some examples of the verb "grab" from the MOH dataset:

4This dataset lacks an official name, but is referred to as "MOH" by Gao et al, so we will here follow that convention. 109 Annotation Status Count % Met Mohammad et al. 1,639 25% MOH-X from Gao et al. 647 49%

TABLE 5.4: MOH counts

100. The passenger grabbed for the oxygen mask. (L)

101. She grabbed the child’s hand and ran out of the room. (L)

102. This story will grab you. (M)

The first two senses are concrete literal uses, while the third employs a personification metaphor. These are annotated through crowdsourcing: the annotators were presented with example sentences from WordNet and asked to determine whether the verb in ques- tion was used literally or metaphorically. They were given brief descriptions of each category as well as examples to learn from. These descriptions are as follows:

• Literal usages tend to be: more basic, straightforward meaning; more physical, closely tied to our senses: vision, hearing, touching, tasting

• Metaphorical usages tend to be: more complex; more distant from our senses; more abstract; more vague; often surprising; tend to bring in imagery from a different domain.

These definitions are much more lax than those of the VUAMC annotation. They are written to inform non-experts of the kind of annotation needed, and allow for people un- familiar with conceptual metaphors, source-target mappings, and other metaphor theory to be able to make consistent decisions about what kinds of things are metaphoric. This is encouraging, and the process can be easily extended to other resources.

5.7 Summary

There are many metaphor datasets available, but these four are promising for a vari- ety of reasons. They contain enough instances to train viable machine learning systems, 110 they include full sentences and thus can use syntactic structures as features, and they have state-of-the-art baselines that can be improved upon. While they all contain prob- lems, specifically different definitions of metaphoricity and often incomplete annotation, their similarities lead us to believe that the methods we’ve devised to improve metaphor processing can be applied across these datasets effectively. We may even consider them to be different tasks: the VUAMC data has tags for all words, and thus can be viewed a sequential tagging task, while the Trofi and MOH-X datasets contain only one annotation per sentence, and thus can be viewed as sentence classification tasks. The LCC dataset hasn’t been extensively used for metaphor processing, but we can devise multiple similar tasks based on its source and target annotations. We will fully elucidate the details of these tasks in the following sections. 111

Chapter 6

Methods

In this chapter we will define the metaphor processing experiments we run. Our over- all problem is identifying appropriate tasks, and how to implement what we’ve learned from our linguistic and computational analyses of metaphor to improve performance on these tasks. To do this we employ multiple methods based to varying degrees on syntactic influences. This in turn will allow for better metaphor processing, which can be employed in NLP pipelines to improve downstream performance on semantic tasks. Our methods will be applied across three dimensions: the task undertaken, the method employed, and the algorithm we use to leverage that representation computationally. Table 6.1 summa- rizes the possible components for each dimensions. We will first overview each of these dimensions; our analysis then follows the method, exploring each method in turn for all of our tasks and algorithms. Our eventual goal is to fully explore each of these three dimensions. Figure 6.1 shows a conceptualization of the problem: each cube with the diagram represents a combination

Tasks Methods Algorithm Metaphor Identification Dependency Structures Baselines VUAMC Tagging (all) VerbNet Structures Standard ML VUAMC Tagging (verbs) VerbNet Embeddings Support Vector Machines (SVM) MOH-X Distant Supervision Deep Learning TroFi Gao et al. LCC Tagging (all) BERT LCC Tagging (verbs) Domain Identification LCC Domain Tagging (all) LCC Domain Tagging (verbs)

TABLE 6.1: Tasks, methods, and algorithms 112

FIGURE 6.1: Dimensions of analysis of task, syntactic feature, and algorithm. We aim to fill this three dimensional space, assessing which areas yield the best performance.

6.1 Tasks

The field of metaphor processing remains hindered by a lack of a practical task. Even with the shared task which has standardized evaluation for the VUAMC dataset, this problem persists. Identifying which words in an utterance are metaphoric is the simplest form of the problem, and can be easily expressed as a binary machine learning problem, but it is difficult to see how it can contribute to greater understanding of metaphoric lan- guage, or be used practically in a natural language understanding system. Metaphor identification as a task fails to answer key questions that are necessary for metaphor un- derstanding: what is the metaphoric mapping involved? What are the source and target 113 domains? How is the metaphor interpreted? To answer these questions, we need to do more than metaphor identification, and we need more complex representations that ad- dress these issues, but the datasets and evaluation procedures remain inconsistent and non-standardized. Another alternative task is that of generating literal paraphrases for metaphoric phrases [Shutova et al., 2012; Bollegala and Shutova, 2013], which provides a practical goal for metaphor systems. Generating a literal paraphrase perhaps sheds some light on how metaphors are understood, and definitely allows for a downstream model to use the literal paraphrase rather than deal with the tricky metaphor. However, even this improvement doesn’t fully capture what humans understand given a metaphoric expression. In an ideal world, we would prefer our metaphoric system to be able to transform a metaphoric expression into a literal one, while also including the special properties that make the metaphor unique or evocative. When we say "a ship ploughs the mighty waves", a literal paraphrase may capture that it moved through the water, but surely the metaphoric expression intends to evoke more than that. In this work, as we are focused on syntactic and computational representations, we restrict ourselves to relatively basic metaphor processing tasks. Our goal is to develop methods for two classes of tasks: binary metaphor identification, and source/target do- main identification. These tasks can be done over a variety of possible datasets: we’ve seen these explained in depth in Chapter 5.

6.1.1 VUAMC

The largest and most used metaphor dataset is the VUAMC. It allows for sequential tagging: as every word is annotated, we can use sequence based models like the LSTM deep learning architecture. While we’ve seen some weaknesses in their annotation, the dataset is the largest (at approximately 200,000 words) and most completely annotated available. We will mirror the setup of the metaphor shared task [Leong et al., 2018], using 114 the 76% of the data for training and 24% for testing. We report our results on the split provided by the shared task, but used randomized splits to ensure statistical significance (see Section 6.4.1).

6.1.2 MOH-X

We also evaluate our methods on the MOH-X dataset, a version of the data first devel- oped by Mohammad et al [2016] and further refined by Shutova [2016], which is based on WordNet definitions, and the TroFi dataset [Birke and Sarkar, 2006; Birke and Sarkar, 2007]. This dataset has 636 sentences taken from WordNet, each annotated as metaphoric or not. We will follow the intuitions of Gao et al., treating the target verb as the metaphoric word in each sentence while using the context as features. This corpus lacks full annota- tion, and thus cannot be treated a sequence tagging problem: we must tag the sentence as a whole, and have no information about the metaphoricity of any word other than the target verb. While we would ideally mirror previous approaches that use 10-fold cross validation on this dataset, this process is expensive in terms of time and yields extremely variable results. Instead, we opt to mirror the data split of the shared task (76% train, 24% test), allowing for standard comparison across datasets within reasonable time.

6.1.3 Trofi

Similar to the MOH-X dataset, the Trofi dataset is only annotated at the sentence level. We know the target verb, and whether it is used literally or metaphorically, but lack tags for any other words in the dataset. While the project initially involves using unsuper- vised clustering to produce semi-automatic metaphor annotations, we will only use the sentences for which manual distinctions between literal and metaphoric instances were made. This still provides a substantially larger dataset than the MOH-X, with 3,737 sen- tences. We will again mirror the shared task train/test split for this dataset. 115 These three binary classification tasks are all attested, with known baselines and ex- perimental setups that can be replicated. However, they are also simple, and don’t tell us much about the semantics of the metaphors involved. To go one step further in this direction, we will explore using the LCC data, which has source and target annotation.

6.1.4 LCC

The LCC dataset is not frequently used for automatic metaphor detection, so we have had to formulate novel tasks based on their annotation. Recall that the LCC contains 4,311 annotations pairs, containing a source phrase and a target phrase that evoke certain metaphors. Each pair is annotated with rater confidence, polarity, intensity, and other features. Not all words are annotated in the sentence, but these phrases often contain multiple words. This provides a much richer dataset, as it includes domain mappings and allows for at least a first step towards better interpretation. This resource is here used to develop two primary tasks: binary identificaton and source/target identification. For binary identification, we treat each sentence as a sequence, with all the words that participate in a metaphoric mapping (either as source or target) as a positive class, and those that don’t as a negative class. This mimics the VUAMC sequence tagging prob- lem, only instead of finding words that are used metaphorically, we are finding words that participate in a metaphor, evoking either the source or the target domain. The dif- ference from the VUAMC annotation is noticeable when comparing the data: we typi- cally wouldn’t consider things from "target" domains to be metaphoric, and the VUAMC doesn’t typically annotate them as such, as the source domain items are the ones evok- ing the metaphor. The LCC annotates source-target pairs, so we have richer data, and although the target items are often not what we’d think of as being "metaphoric" on a lexical level, they participate in evoking the source-target mapping and are important to capture. 116 Consider the following examples from the LCC data involving the verb "grow":

103. His faith and knowledge of the Father grows always greater (...)

104. The language of faith is learned in homes where this faith grows (...)

105. The musical idea may grow from a simple melody line (...)

106. Today there is a growing desire for sophisticated experi- ences.

In these examples, "grow" evokes the source domain PLANTS, while the noted el- ements evoke different target domains (FAITH for 103-104, MENTAL_CONCEPTS for 105- 106, by the LCC’s annotation). The VUAMC annotation, and most metaphor analysts, would consider the target items in bold to be literal: the source domain verb "grow" is evoking the metaphor, and it is used to understand the abstract target notions through a physical, concrete domain. However, in our work we believe it is important to identify all components of a metaphoric expression, and identifying both source and target ele- ments for a given metaphor is a practical way to handle this problem. We can frame this problem as two different identification tasks: first, identifying whether the word is a part of any metaphor (binary classification), and second, which source and target domains are evoked by the target words (domain classification). The first is a standard, straightfor- ward binary classification problem, while the second is a multi-class problem and much more difficult. The LCC then gives us two primary tasks: binary identification of words participating in metaphoric mappings, and a multi-class tasks of identifying what domains each word evokes. We run both of these tasks over all parts of speech, as well as just for verbs, yield- ing a total of four LCC tasks. As with the other datasets, we use a 76%/24% train/test split. 117 Task # Samples % Met Source/Target Sequential VUAMC Sequence Classification 205,425 words 12% - D VUAMC Verb Classification 23,113 verbs 28% - D MOH-X Classification 647 verbs 49% - - Trofi Classification 3,737 verbs 43% - - LCC Sequence Tagging 67,967 words 15% - D LCC Domain Sequence Tagging 67,967 words 11% D D LCC Verb Classification 17,336 verbs 15% - D LCC Verb Domain Tagging 17,336 verbs 11% D D

TABLE 6.2: Dataset overview

In summary, we have a total of eight different tasks. Six are binary classification, where the goal is to predict whether a word is metaphoric or not. Two from the LCC go a step further: the task is to predict the source and target domain of each word. An overview of these tasks is shown in Table 6.2. For each of these tasks, we will experiment with different syntactic features and learning models to determine which are most useful.

6.2 Computational Methods

Next, we need to experiment with a variety of machine learning methods. While deep learning is state of the art for a variety of tasks, feature-based machine learning methods are still applicable for some tasks (for example, the state of the art classifier on the Trofi dataset [Köper and Schulte im Walde, 2017]). Feature-based machine learning also allows for better inspection of feature performance leading to better error analysis. To compare against state of the art performance, we experiment with bidirectional LSTMs, which achieve the best performance currently on many metaphor-related tasks.

6.2.1 Feature-based Machine Learning

Our architecture for traditional feature-based machine learning algorithms is consis- tent: we first build a baseline based on previously used linguistic features, and then add our syntactic representations. This requires building corpus loaders and featurizers for 118 each dataset, and then tuning each algorithm. We experiment with three different ma- chine learning algorithms that have proven successful for similar tasks: Logistic Regre- sion, Niave Bayes, and Support Vector Machines (SVMs). We ran baseline experiments for each task comparing these three algorithms, and found only minor performance dif- ferences between them. SVMs tended to perform best, and also showed minimal variance in results between experiments and feature sets, so we chose to proceed focusing on this algorithm.

Support Vector Machines

As our feature-based algorithm, we use SVMs. SVMs are extremely common for lan- guage tasks. They function like logistic regression, in which a boundary is drawn be- tween positive and negative classes. However, while linear and logistic regression have countless possible boundaries that could split a dataset, support vector machines are specifically designed to maximize the distance between the two classes by identifying the "support", or vectors that the define the boundary. This yields only a single boundary, maximally distant from the vectors that define the classes. This algorithm allows for in- creasing dimensionality to better separate the data using different "kernels", and is both flexible and customizable for varied input. They work well for many language tasks, and while they haven’t shown particular prowess in metaphor processing, they often outper- form logistic regression for many other tasks. While feature-based algorithms tend to lag behind deep learning in performance, they are still competitive at many tasks. More importantly, they allow for far a far greater level of model inspection and error analysis. This involves learning weights based on features during training: these weights directly correspond to the effectiveness of certain features. We will therefore use the SVM results not as a competitive machine learning approach, but as a means of assessing feature importance and performing error analysis. 119 6.2.2 Deep Learning

Recent advances in deep learning have led to algorithms that give significant perfor- mance boosts for a wide variety of tasks. The 2018 Metaphor Shared task was dominated by deep learning algorithms, particularly LSTMs and CNNs. Since then, further advances have been made, employing more complex architectures and better embedding represen- tations as input. We will focus on LSTM architectures, as they yield the best performance in the literature.

Long-Short Term Memory Networks

LSTMs are nearly ubiquitous in modern natural language processing. They are neural networks, with a set of nodes activating based on weighted inputs, with additional mech- anisms to retain information from the weights of the previous words classified. Adding a "bidirectional" layer, in which the weights can also be retained from right to left, further increases performance in many cases. Additional mechanisms for "attention", focusing the learning process on the most relevant nodes, also tend to improve performance. Bidirectional LSTMs with attention mechanisms yield state of the art results on a wide variety of tasks, particularly in sequence tagging, in which the tags of previous words can be predictive of the current word’s classification. The metaphor shared task was domi- nated by these type of models, with four of the top five systems employing some kind of LSTM. The current state of the art on the task, as well as other metaphor classification tasks, uses a bidirectional LSTM with attention, using ELMO and GloVe word embed- dings as input [Gao et al., 2018]. This system is available publicly, and we employ it as a baseline for our neural models. There is some difficulty in combining these sequence based models with syntactic fea- tures, as the state of the art LSTM models don’t always interact nicely with syntax: they 120 learn relationships between words, even distant ones, implicitly, and the effect of addi- tional syntactic information will theoretically be less effective for an algorithm that han- dles these relationships implicitly. However, we believe it is still practical: we can include embeddings for syntactic representations (including VerbNet structures), and adding ad- ditional syntactically driven data should be effective. The only major loss is that depen- dency features are impractical to include in this kind of model, as those relationships are already captured implicitly.

6.3 Syntactic Features and Representations

So far, we’ve outlined two dimensions: tasks and computational methods. These are straightforward and easy to understand. The core of this work will be focused on the next dimension: introduction of novel methods for improving classification based on syntac- tic and lexical information. We will provide a brief overview of each of these methods, and then fully explore the implementation details and analyze the results when applying each method to both feature-based machine learning and deep learning models, across all datasets. We’ve shown that syntactic information can be valuable for determining the metaphoricity of utterances, including which elements are allowable as source and target domains. In order to implement this theory computationally, we need to develop a practical way to incorporate these syntactic structures into our machine learning paradigms. To do this, we experiment with multiple frameworks using dependency parses and structural information from VerbNet. Each of these syntactic representations will need to be incorporated along with base- line features or representations. For traditional machine learning, this baseline will in- volve simple lexical features along with some context, mirroring other machine learning baselines [Beigman Klebanov et al., 2016]. For deep learning, we will employ a bidirec- tional LSTM with attention as the baseline, using only basic embeddings for each word 121 as input. This is the method that currently achieves state-of-the-art results on many metaphor tasks [Gao et al., 2018], and it is relatively easy to augment to include better representations.

6.4 Baselines

We are experimenting with both feature-based and deep learning methodologies. Im- plementing syntactic representations will vary slightly between these two categories, and they both will require separate baselines. For the traditional machine learning algorithms, we use scikit learn [Pedregosa et al., 2011], which has consistent interfaces that allow for easy testing and deployment of the support vector machine algorithm. As a first baseline, we simply use the unigram of the word as the only feature. This simple baseline is effective in some cases, as words are either used exclusively literally or metaphorically. As an improved baseline, we include the following set of features, chosen through feature selection:

• lemma

• part of speech

• concreteness rating

• imageability rating

• valence, dominance, and arousal from the Affective Norms for English Words database

• context window of two words

• word embeddings from Word2Vec

For the deep learning method, we will use two different methods. First, we employ a bidirectional LSTM, based on the approach of Gao et al to copy their baselines. This 122

Algorithm Trofi MOH-X LCC (All) LCC (Verbs) VUAMC (All) VUAMC (Verbs) LCC Domain (All) LCC Domain (Verbs)

SVM (Unigrams) .476 .506 .586 .249 .568 .348 .400 .085 SVM (Full baseline) .518 .543 .633 .510 .656 .517 .596 .221 Bi-LSTM from Gao et al .731 .683 .697 .676 .773 .716 .557 .202

TABLE 6.3: Baseline results allows for easy extension of the state of the art models. Note that for these baselines, we re-run their experiments as opposed to just reporting their results, yielding slightly variant results. For the LCC binary classification tasks, we alter their architecture slightly to allow for the new dataset. For the source and target domain classification, we change their methods to allow for multi-class classification. We also experimented with BERT embeddings [Devlin et al., 2018]. These are similar to ELMo, but they uses transformers, which involve encoders which take text as input and decoders which then produce tags. BERT considers all input tokens simultaneously, making it effectively bidirectional. This deep learning-based architecture has achieved state of the art results on many natural language processing tasks. BERT embeddings are similar to ELMo in many respects, so we believe incorporating them into the Bi-LSTM system of Gao et al. may yield improvements over their ELMo based input. We replaced the ELMo embeddings with BERT embeddings, using the same model architecture, and found the ELMo embeddings to perform slightly better. A summary of how these baselines perform for each task is shown in Table 7.1. Re- sults are typically between .5 and .8 F1 score, although the domain classification of verbs is exceedingly difficult: our simple baseline yields only .144 F1. The TroFi dataset is the only task in which the SVM outperforms the neural baseline. This matches the report 123 of Gao et al., who report that the state of the art lexical feature-based model outperforms their LSTM on this dataset. The Gao et al. model significantly outperforms the SVM base- line on all other tasks; for this reason we will use the SVM primarily to analyze feature importance, as the deep learning models are relatively opaque, and we will use the Gao et al. model to analyze performance.

6.4.1 A Note on Significance

Many of the performance differences we see in experimentation are fairly small: in many cases the results differ by less than 1 point of F1 score. There is considerable debate in determining significance for natural language processing tasks, especially when the experiments can take a large amount of time to run. We experimented with a variety of methods including bootstrap learning [Efron, 1979] and 10-fold and 5 by 2 fold cross validation [Dietterich, 1998]. We found bootstrap learning to be ineffective: results were consistently calculated to be significantly different even when the same experiment was run. Cross-validation was more effective, but difficult due to the time necessary to run experiments: a single pass of 10-fold or 5 by 2 cross validation can take more than 24 hours for same tasks and experiments. Additionally, state-of-the-art models on many datasets are based on train-test splits, and we would ideally compare our models to those. We decided to run repeat trials using randomized 76%/24% splits of each dataset. While time consuming, this mirrors the experimental setups of other methods, and al- lows us to conduct multiple trials to determine means and standard deviations for our methods. We report results as significant when we find p < .01 after running 10 random- ized trials. 124 6.5 Summary

We are exploring four different methods for improving computational metaphor pro- cessing over eight different tasks. We will implement these methods in both standard- feature based machine learning models as well as deep learning models. Through this approach we will broaden our understanding of how these syntax-based methods can impact metaphor processing, how they integrate with linguistic features, and how to op- timize these methods and algorithms for metaphor processing performance. We will now fully outline the syntax-based methods we use to improve processing. We will compare each against both baselines for all 8 different tasks. We will describe in detail the methods used: the architectures, additional resources, and innovations we’ve added to best incorporate these features. The next four chapters will deal with each method: dependency structures, VerbNet classes, VerbNet embedding information, and distant supervision. 125

Chapter 7

Dependency Structures

7.1 Introduction

The first syntactic features we incorporate are basic dependency structures. These include head words for the target, as well as the target’s dependents. We can also use fea- tures based on the type of relation between the target and its head and dependent words. The combination of these features provides a relatively straightforward representation for the kind of argument construction a word is participating in. Figure 7.2 (from Stanford CoreNLP [Manning et al., 2014]) shows how a to extract relevant features from a simple dependency tree for a given metaphoric verb. This figure is a full dependency parse for the sentence "The chapter as a whole draws attention to a number of key methodological issues." From a linguistic standpoint, we can understand this sentence to be metaphoric due to selectional preference violation: the verb "draw" requires some kind of animate subject and concrete object. Computationally, if we were to classify this verb "draw" in the sentence, we can see from the dependency parse that it has a subject (marked "nsubj") "chapter" and a direct object ("dobj") "attention". This already provides us with additional information about the syntactic construction: we know there is a subject and object, indicating the valency of the verb. It is also likely sufficient for a machine to detect this metaphor: if it has good enough knowledge of the lexical semantics of the phrase "draw", it knows the kinds of arguments it prefers, and if it has good enough knowledge of "chapter" and "attention", it knows that these preferences are being violated. In addition, we have an element marked "nmod" that also is headed by the verb, which indicates where the "attention" is drawn to. This gives us further knowledge of the verb: 126

FIGURE 7.1: Exploration: Dependency parse-based features

FIGURE 7.2: Basic dependency parse we know that "draw" in the creation sense ("draw a picture") doesn’t have a directional component. Other uses, both literal and metaphoric, do contain this physical movement of the direct object ("she drew him in with her charm", "they drew the boat in to shore"). These additional features could help distinguish more difficult cases. From a practical perspective, we don’t necessarily need to know what dependency re- lations, either for heads or dependents, are most important: we can bundle them up into features and the algorithm will determine which ones make the most impact. Fortunately, 127 using feature-based machine learning, we can use the algorithm to analyze our results, in- specting which features are strongly weighted, and determining from each dataset which kinds of dependency relations impact metaphor classification. These features are straightforward, and basic formulations are sometimes used to combine lexical features of the target word with those of its head and/or dependents. We will follow this tradition, exploring a set of dependency-based features, and analyz- ing how they affect the results for a variety of tasks. Note that deep learning models likely won’t be influenced by including dependency information: the bi-LSTM we employ for deep learning handles long distance relationships relatively well, and the information captured by syntactic dependencies is likely captured implicitly by the architecture. For traditional machine learning, many of these features have been previously imple- mented by Stowe et al [2018]. We maintain their implementation, but adjust the datasets to mirror those outlined in the previous section.

7.2 Implementation

For feature-based machine learning, we use two methods to implement additional dependency features. First, we add the set of features from the baseline (not including the context window) for the head word of the target, and all of the dependent words of the target. This expands our lexical knowledge to include that of syntactically relevant words. Second, we incorporate bag-of-words style features based on dependency relations ("nmod", "dobj", and so on). These include a one-hot feature vector based on the relation between the word and its head, as well as a bag-of-words style vector for the relations between the word and its dependents. This encodes directly the type of dependency rela- tions, and should help indicate the number and type of arguments a verb has, accounting for verbs like "hemorrhage" where the metaphoricity is directly linked to the argument structure. 128

Algorithm Trofi MOH-X LCC (All) LCC (Verbs) VUAMC (All) VUAMC (Verbs) LCC Domain (All) LCC Domain (Verbs)

SVM (Full baseline) .518 .543 .633 .510 .656 .517 .596 .221 +Head word features .508 .545 .637 .500 .650 .496 .614 .202 +Head word relation .507 .543 .628 .497 .650 .496 .614 .202 +Dependent word features .509 .546 .667 .541* .655 .500 .621* .179 +Dependent word relations .518 .544 .636 .518 .655 .519 .596 .220 +All .507 .542 .669 .516 .653 .508 .539 .173

TABLE 7.1: Dependency feature F1 results (* denotes significant improve- ments over the baseline, p < .01)

Deep learning is likely not going to be affected by the addition of dependency struc- tures as features. The architecture we employ, a long short-term memory network, is notorious for being able to capture long-distance syntactic relations implicitly by passing weights between inputs. We did preliminary studies, adding an extra input for the head and dependents at each time step, but found this did not improve performance in any task or framework. We will therefore omit them from further study, assuming that the deep learning architecture sufficiently captures these dependency relations.

7.3 Results

In most cases the head word features aren’t particularly informative: adding them doesn’t improve over the baseline. This is perhaps expected of verbs, as they are typically the head of their clause, and their head word is likely non-existent or non-informative. However, adding features for a word’s dependents yields significant improvements for many datasets (Trofi, LCC Binary for all parts of speech).1 Primarily adding the features of

1Note that while the F1 score for the MOH-X dataset improved by over 3 points, the high variability in results made this improvement not statistically significant. Further testing is necessary to determine the effectiveness of these methods for this dataset. 129 dependent words improved performance; the dependency relations themselves were not informative. In the MOH-X dataset, combining all the features together yields improve- ments over any individually. In the cases where we see improvement, these performance gains are between .02 and .04 to the F1 score. The LCC domain classification of verbs remains particularly difficult, and adding these features yielded no improvement. We can consider these dependency structures proxies for syntactic constructions: they carry information regarding parts of speech, the number of arguments and the relations between them. We can see that using them is effective, particularly with regard to depen- dent words, but these results are not overwhelming, and the connection to constructions is relatively abstract: what we implementing doesn’t directly reflect the aspects of con- struction grammar we’ve seen can be informative with regard to metaphor. In Chapter 12 we will explore some better possible uses of dependency parses, as well as proposing possible alternatives for better incorporation of constructional information for metaphor processing. Dependency features are relatively easy to implement and give us some improve- ments, and they will be explicitly linked to other methods as we continue. In particu- lar, we will be leveraging dependency parses in combination with both VerbNet struc- tures (Chapter 8.1), and to identify supporting training data (Chapter 9). For the VerbNet section, we will explore directly implementing VerbNet elements as features, as well as adding them for head and dependent words. In addition to the intuitive idea that extra information about the syntactic structure of a sentence will improve performance, this framework also allows for non-verbal elements to benefit from verb-specific resources like VerbNet; knowing the verb class of a noun’s head may improve classification for that noun. For the distantly supervised data, we will be using the same features for the new data as the original datasets, including dependency and VerbNet information. An in-depth look at the combination of methods is in Chapter 10. First, we will continue with our experimental methods by examining how VerbNet can be used to improve metaphor 130 processing. 131

Chapter 8

VerbNet Classes and Embeddings

8.1 VerbNet Structures

As we have seen, VerbNet has many structural elements that can be used for metaphor processing, as well as sometimes making direct distinctions between literal and metaphoric classes. We will aim to employ the resource in three ways:

• Featurizing structural components of VerbNet classes

• Using neural nets to learn word and structure-based embedding representations which can be used as features or inputs to computational models

• Using hand-annotated VerbNet data to support existing training data.

This allows us to leverage the data from multiple viewpoints. Using features from VerbNet structure prioritizes lexical semantics, where the hand crafted lexical resource is in itself useful for computational semantic tasks. Transforming these structures, includ- ing classes and verb senses, into embedded representations relies on the standard deep learning paradigm, that with enough contextual data, good representations for machine learning can be learned with relatively little input from an analyst. Leveraging additional training data relies on good annotation and linguistic analysis to determine which classes are literal and metaphoric, and aims to use this linguistic analysis to better improve deep learning models. Using these three components together merges linguistic and compu- tational approaches for VerbNet, and can best show how the resource can be gainfully employed for metaphor processing. 132

FIGURE 8.1: Exploration: VerbNet structure-based features

These all follow a syntactic assumption: the classes in VerbNet, as well as the the- matic roles and syntactic frames within those classes, are based on different syntactic alternations the verbs can participate in. Each class, then, represents a bundle of syntactic information, and each verb sense within a class is coupled with that information. While this isn’t purely syntax, as the the class is a reflection of the underlying semantics, there is significant syntactic motivation for VerbNet classes. The class itself provides much of this syntactic (and semantic) information, but we can improve further by identifying which thematic roles and syntactic frames are present in a given instance.

8.1.1 Structural Components

Each VerbNet class contains a wealth of information. While some metaphor research has employed VerbNet classes as a feature [Beigman Klebanov et al., 2016; Stowe and 133 Palmer, 2018], they either don’t disambiguate between verb senses, treating all the classes a verb can appear in as a single feature, or use simple VerbNet class tagging as the only feature. Neither fully develops features based on the structure of the resource. Here, we will take advantage of any pertinent information we can find from VerbNet classes. Below is a list of VerbNet features we aim to use:

• VerbNet senses (classes)

• Syntactic Frames

• Thematic Roles

VerbNet senses are relatively easy to extract given a VerbNet class disambiguator. We will use the system of Palmer et al [2017], as it shows state of the art performance for VerbNet sense tagging. This lets us annotate unseen data with VerbNet class information. We can incorporate these classes directly as features or learn embedding representations, and we evaluate both methods. In order to add VerbNet structures, we first parse each dataset with a word-sense dis- ambiguation system to identify the VerbNet senses for each verb. For traditional machine learning, we first use these VerbNet classes as one-hot features. This has a direct impact of verb classification, but doesn’t effect non-verbal components which lack a VerbNet class. In order to leverage this class information fully, we also include VerbNet class informa- tion coupled with dependency parses: for each non-verb, we also add the VerbNet class of its head word and dependent words as features. While it would be possible to include new inputs for VerbNet structures for deep learning, we instead choose to develop more appropriate input representations. We will show how to learn pretrained VerbNet class and sense embeddings for deep learning in- put in Section 8.2. These embeddings are then concatenated with the GloVe and ELMo embeddings, and used as input to the LSTM model. 134 Frame Prediction

Each VerbNet class contains a set of syntactic frames, which contain information about a verb’s syntactic status: its argument structure, thematic roles, prepositions, and other fixed elements. We can use these syntactic frames as proxies for the verb’s argument structure constructions, as they contain significant syntactic information. This allows us to bundle a large amount of syntactic information into a single, recognizable compo- nent. Additionally, they are directly linked to the verb’s semantics: each syntactic frame is paired with a semantic frame that includes the predicates the verb uses. While these semantic predicates aren’t directly related to syntax, they are based on VerbNet classes, and as they are linked to syntactic frames, our abliity to identify these frames allows us to also include semantic information. While it is possible to naively represent these syntactic frames purely as a bundle of dependency features, if we would like to identify a specific frame in a specific class we need a more precise method for frame identification. For this task, we developed a machine learning method based on VerbNet examples. First, we compress syntactic frames into a simpler set: there are many syntactic frames in VerbNet that are extremely similar. The distinction is unlikely to be useful, and we would like to restrict our possible classification options to improve performance. This process is done semi-automatically. We remove extraneous tags on NP and PP elements: PP.location becomes simply PP, NP.theme becomes NP, and so on. Next, we compress wh- phrases into a single category (<+wh-inf>, <+wh-extract>, etc. are combined into a single element). This process trims the set of available frames to 81, down from the original 288. The full list can be found in Appendix B.1. Next, using this set of 81 frames, we use the VerbNet example sentences as training data, and generate dependency parses for these examples. We match each parse to the syntactic frame the example came from. This functions as training data: we use machine 135 Algorithm Unrestricted Restricted Random Forest .725 .820 Naive Bayes .469 .711 SVM .647 .794 Logistic Regression .705 .805

TABLE 8.1: Frame prediction results (macro F1) learning trained on the dependency features from the example sentences to predict syn- tactic frames for unseen sentences. In order to evaluate this method, we experimented with a variety of statistical machine learning algorithms (Naive Bayes, Logistic Regression, and Support Vector Machines), running 10-fold cross validation on the VerbNet example sentences. Due to the nature of the task, we also experimented with restricting the classifier’s predictions only to the syntactic frames present in the VerbNet class. For example, if we’ve identified a verb to be in the grow-26.2.1 class, we don’t need the algorithm to pick between all 81 frames, but only the four that are present in grow-26.2.1. The results from these experiments are shown in Table 8.1. We selected the Random Forest algorithm as it showed the best performance, achieving an average macro-F1 score of .82 over the 10-fold cross validation. This classifier works well enough to be practical, and while the method is biased to- wards the example sentences in VerbNet, we believe it is effective on unseen data, at least to the degree that we can use the output of the frame classifier as a reasonable prediction of the syntactic structure of an unseen sentence. Once we have identified a particular verb instance’s syntactic frame, we can use that as a discrete feature. To incorporate these into our model, we include the predicted syntactic frame for each verb as a one-hot feature vector. 136 Thematic Roles

VerbNet classes provide the list of thematic roles that apply to the class as whole along with their selectional preferences. These roles are instantiated in the syntactic frames, and vary depending on the verb’s syntactic and semantic properties. They represent ar- gument types, and this is a vital component of argument structure and thus likely to be informative for metaphor processing. To identify thematic roles, we use the latest version of the VerbNet parser, which is adapted for VerbNet version 3.3, and produces full semantic parses of each sentence. These parses include the predicate argument structure, as well as token-based annotation of thematic roles for each VerbNet sense. This allows us to extract role information for each argument, including the role type (Agent, Theme, etc) as well as the VerbNet class attached to that role (Agent of 26.6.2, etc). This rich information provides us with better understanding of the argument structure for each verb. We added VerbNet thematic roles as features for both verbs and their arguments. For the feature-based SVM, we included the lexical items that were present combined with their thematic role. For each word in the sentence, we include the thematic role it instan- tiates. For example, in "John ate an apple", the word "John" receives an "Agent" feature, and "apple" receives a "Patient" feature.

8.1.2 Results

VerbNet classes aren’t terribly helpful, but they are when mixed with dependency information. Knowing what a word’s head and dependent words’ VerbNet classes are is much more informative than only knowing the target word’s class. We see best system performances for the LCC data for all parts of speech, both in binary and multi-class classification, using the VerbNet class of the target word’s dependents. Syntactic frames yield no improvement over the baseline for all datasets. This could be for a variety of reasons. First, the classification isn’t perfect, and is only evaluated on 137

Algorithm Trofi MOH-X LCC (All) LCC (Verbs) VUAMC (All) VUAMC (Verbs) LCC Domain (All) LCC Domain (Verbs) SVM (Full baseline) .518 .543 .633 .510 .656 .517 .596 .221 +VerbNet Sense .520 .563* .654 .532 .656 .520 .597 .216 +VerbNet Sense (Dependents) .515 .548 .667 .543* .676* .572* .628* .117 +Syntactic Frame .518 .543 .636 .519 .653 .511 .597 .148 +Thematic Roles .516 .557* .635 .520 .652 .556 .566 .177 +Thematic Roles (Dependents) .507 .557* .683 .557 .671 .584* .574 .184 +All .532* .580* .682* .561* .653 .569* .597 .148

TABLE 8.2: VerbNet structure F1 results (* denotes significant improvements over the baseline, p < .01) simple VerbNet example clauses. When applied to real-world data, this classification of syntactic frames is likely less effective, introducing additional noise. Second, the syntac- tic frames may be too specific to VerbNet: VerbNet makes choices about what to include in these syntactic representations, and this information is designed specifically to distin- guish VerbNet classes. It is possible that these kinds of representations are less applicable for general syntactic patterning, especially those patterns that may affect the metaphoric- ity of an utterance. Adding the VerbNet thematic role to the target word sometimes improves perfor- mance, but it is inconsistent. Adding the role to the dependents of the target word gives a much more consistent advantage in classification across datasets. This may because of the verbal focus: verbs won’t typically have a thematic role, but the thematic roles of their arguments is extremely important in determining metaphoricity. This follows from our understanding of selectional preference: certain verbs prefer certain arguments, and adding VerbNet thematic roles provides additional knowledge about these arguments. Combining all features together consistently improves over the baseline: five of the 138 eight tasks show significant improvements with all VerbNet-based features added. Three of these are have the best overall performance: the VUAMC data, for all parts of speech and just for verbs, and the Trofi dataset. The best performing system in other cases is either using the VerbNet class of dependents, or the the thematic roles of dependents. It is possible that these are providing somewhat conflicting information: one is indicating the kind of verb that is present, and the other is indicating the kinds of arguments, and these may lead to different predictions for these datasets. The LCC verb domain classification remains intractable: no features improve over the baseline.

8.2 VerbNet-based Embeddings

Continuous word representations have been ubiquitous in natural language process- ing, proving to be effective at wide variety of tasks, particularly those that require se- mantic knowledge. Numerous different models have been used to generate and improve these word embeddings [Mikolov et al., 2013; Pennington et al., 2014]. These representa- tions are continuous vectors based on context, in which a word’s meaning is determined by the words it co-occurs with. These have been extended to various other levels of rep- resentation including sentences [Kiros et al., 2015] and documents [Dai et al., 2015]. Despite their effectiveness, these embeddings have limitations. Word-level embed- dings tend to be sense-agnostic: different word senses are aggregated into a single em- bedding, which may dilute the distinctions between the senses. They often ignore some of the more subtle features of word similarity that have been developed in hand-crafted resources. One possible approach to alleviating these concerns is by retrofitting embed- dings, constraining them to properties of lexical resources , which has yielded improved results [Faruqui et al., 2015]. There are also numerous methods for training sense-specific embeddings [Athiwaratkun et al., 2018; Rothe and Schütze, 2015; Chen et al., 2014]. 139

FIGURE 8.2: Exploration: VerbNet embedding-based features

To incorporate better continuous representations from VerbNet, we use word em- bedding methods, adapted to learn representations for classes and other structural ele- ments instead. For verbs, this is done in two ways: we learn representations for VerbNet classes, which allow better generalization, and we learn representations for specific Verb- Net senses, which are more specific and perhaps more informative. VerbNet class embeddings are beneficial for generalization. Consider the following example:1

• The chapter as a whole inveigled attention to a number of key methodological is- sues.

Here, we use the extremely rare verb "inveigle", which means something like "to in- fluence". If we use a generic embedding model, it might contain this verb, or because of

1Author’s example. 140 its rarity, the training data will be insufficient to develop a good representation. If we can learn a VerbNet class embedding for this sense (as in the class draw-59.3), we can replace or supplement the verb embedding with the generic class embedding. This will allow knowledge of the class to provide better information than the verb embedding alone. From the other side, verb sense embeddings provide more specific information. Con- sider the following example:

• The chapter as a whole draws attention to a number of key methodological issues.

We can use a generic word embedding representation for the verb "draw". However, these embeddings are sense-agnostic, and are more of a combined representation for the different meanings of draw : to paint, to lure someone in, to remove something, and more. So while the generic embeddings are effective, they are underspecified. If we learn embeddings for specific verb senses, this problem is alleviated. Instead of using the generic word embedding, we can use one for this particular sense (draw-59.3). This embedding, trained for this specific class, will be more relevant to this particular instance.

8.2.1 Implementation

To generate these embeddings, we first tag a corpus (in this case, 450 million sentences from a dump of Wikipedia) with VerbNet sense tags. We then run two different replace- ment options. For one dataset, we replace each verb with just its sense tag, which removes the specific verb information. For the second, we replace each verb with the verb and its sense tag, allowing for training of more specific sense embeddings. This process yields texts that are similar to normal, unmodified texts, but with cer- tain elements replaced by VerbNet structural elements. These texts allow for learning of structural embeddings: the same methods that effectively generate embeddings for words can be directly applied to these texts with VerbNet replacements, generating em- beddings instead for structural elements based on the contexts they occur in. Examples 141 Apollo turned both the mother and son into swans when Raw Sentence they jumped in the lake. Apollo turn-26.6.2 both the mother and son into swans VN Sense when they jump-51.3.2 into a lake Apollo 26.6.2 both the mother and son into swans when VN Class they 51.3.2 into a lake

TABLE 8.3: Sample training data Using replacement

Algorithm Trofi MOH-X LCC (All) LCC (Verbs) VUAMC (All) VUAMC (Verbs) LCC Domain (All) LCC Domain (Verbs) SVM (Full baseline) .518 .543 .633 .510 .656 .517 .596 .221 +GloVe Sense Embedding .518 .544 .631 .519 .651 .513 .605 .186 +GloVe Class Embedding .518 .547 .650 .518 .654 .522 .599 .216 +Both .518 .547 .653 .517 .655 .511 .605 .216 Bi-LSTM from Gao et al .731 .683 .697 .676 .773 .716 .557 .202 +GloVe Sense Embedding .737* .693* .677 .676 .775 .725 .538 .204 +GloVe Class Embedding .735 .691 .644 .677 .770 .706 .546 .212 +Both .733 .691 .691 .682 .771 .702 .542 .211

TABLE 8.4: VerbNet embedding F1 results (* denotes significant improve- ments over the baseline, p < .01) of each sentence with replacements for each component are shown in Table 8.3. The verbs identified from VerbNet are underlined, while the component that has been replaced are in bold. This replacement method is effective, and allows for flexibility: any word embedding method can be used to generate these representations. We evaluate this set of embeddings (generic VerbNet class, verb sense, generic VerbNet role, class-specific role) trained using GloVe and skip-gram models of 100 and 300 dimensions, using both feature-based and deep learning methods. We found that the 100 dimension GloVe model performed best. 142 8.2.2 Results

For feature-based machine learning, the effects of sense and class based embeddings are minimal. In most cases, we see no improvement over the baseline; in the cases where there are improvements, they are not statistically significant. When input to the deep learning model, both class and sense embeddings yield small but significant improvements over most tasks. However, the MOH-X task is hampered significantly by using either type of embeddings. This may be due to the small dataset which includes only short, simple examples. We are increasing the size of the input fea- ture space, and without enough training data this may cause problems in classification. Sense embeddings tend to be more informative, which makes sense: they capture specific details about specific word senses, while class-based embeddings are just generic and likely less useful. Combining them doesn’t typically improve performance over just using sense embeddings. 143

Chapter 9

Distant Supervision

9.1 Introduction

As a final method for using syntactic information to improve metaphoric classifica- tion, we look to improving the training data. From our analysis of corpora, there are frequent errors and gaps that lead to inconsistency in training and performance. We have identified VerbNet classes that have deterministic metaphoric or literal properties, and we can use annotated data for these classes as freely available training data. In addition, we can use syntactic information in the same way. While syntactic patterns do not in and of themselves deterministically produce metaphors, analysis of corpora has shown that matching anomalous syntactic structures coupled with certain verbs frequently yields certain interpretations. A notable case is the verb "hemorrhage" as seen in Section 2.5.3. For this particular verb, its argument structure is directly predictive of metaphor: intran- sitives are literal, while transitives are almost always metaphoric. We analyze 20 verbs to find cases where syntactic structures are predictive of metaphor, and use syntactic pat- terns from dependency parses to find additional data. To find verbs to analyze, we used the VUAMC data. For a first sample, we extracted the 10 verbs in the training data (per the shared task) that were most ambiguous: that is to say, the verbs that shared the most even distribution between literal and metaphoric classes. Our goal was to find verbs the classifier would struggle with, and those that had a balanced distribution seemed to be a reasonable starting point. We also ran the state of the art sequential tagging system of Gao et al., which we are using as our neural baseline. We took the 10 verbs in the validation set (using Gao et al.’s split) that were most often misclassified. This gives us a total of 19 verbs in two sets, as the verb "hold" occurred 144

FIGURE 9.1: Exploration: Additional data in both: verbs that are naturally ambiguous between literal and metaphoric classes, and those that state of the art classifiers perform poorly on. For more statistics about these verbs, see Table 9.1. For each of these 19 verbs, we performed a thorough analysis of the instances in the VUAMC data that contained them. We examined the syntactic structures they typ- ically appeared in, and attempted to determine whether the annotation was valid, and what kinds of argument structures and syntactic patterns differentiate the literal and metaphoric usages. We also analyzed these verbs in VerbNet. We went through each class a verb could occur in, and determined whether that usage of the verb in that class would be necessarily literal or metaphoric, or whether it was unclear. We marked classes that were necessarily literal or metaphoric, as these can be used to extract additional training data. We ignored 145 Most Ambiguous Verbs Most Misclassified Verbs Verb Met Lit % Met Verb FP FN Correct % Correct encourage 6 6 .5 spend 7 0 4 .363 blow 5 5 .5 include 8 1 10 .526 conduct 5 5 .5 play 3 3 8 .571 show 34 33 .49 hold 3 3 8 .571 find 60 62 .49 stop 7 1 12 .600 fall 18 19 .51 reduce 2 2 6 .600 hold 28 30 .52 get 15 21 74 .673 bring 36 33 .48 suggest 4 0 9 .692 put 57 52 .48 meet 2 1 7 .700 allow 19 21 .48 discuss 1 2 7 .700

TABLE 9.1: Challenging verbs: on the left, the verbs with the most even split between literal and metaphoric. On the right, verbs in the validation set that were most often misclassified. cases where the class didn’t deterministically pattern one way or the other. Appendix A shows a full overview of this analysis; we will now explore in depth some examples of patterns we found interesting.

9.2 Data Extraction

We used this analysis to extract additional training data from two sources. For syntac- tic patterns, we parsed one million sentences in Wikipedia, and wrote dependency-based patterns to match our analysis. We matched these patterns against the corpora, collecting sentences for which the verb in a particular clause matched a syntactic pattern that we noted as either exclusively literal or exclusively metaphoric in the training data. This nec- essarily introduces noise, as error in the NLP pipeline can compound: the dependency parser isn’t perfect, and neither are the syntactic patterns. In addition, our analysis is based on limited VUAMC data, so the syntactic patterns we identified are not always accurate for the test data, and perhaps unlikely to extend to other corpora that make dif- ferent annotation decisions. However, this method does allow us to collect large amounts of linguistically motivated training data. 146 9.2.1 VerbNet Analysis

We analyze VerbNet classes to determine when they might be indicative of metaphoric or literal usage for particular verbs. We can then use this analysis to pull instances from databases of approximately 150,000 verb instances annotated with their appropri- ate VerbNet class [Palmer et al., 2017]. We can extract instances automatically that match metaphoric or literal classes. Our analysis is specifically for certain verbs in these classes, but we believe this fairly consistently extends to other verbs within the same classes, so we extract all the verbs in the annotation annotated with these classes. For each of these challenging verbs, we examined the VerbNet classes in which it ap- pears. We looked at VerbNet annotation, the example sentences, the selectional prefer- ences on the class’s thematic roles, and the semantic predicates. From this we assessed whether the sense of the verb in each class was typically metaphoric or literal. Con- sider the verb "grow". It is present in two particular VerbNet classes: grow-26.2 and calibratable_cos-45.6. The grow-26.2 class has an +ANIMATE Agent role, and produces a

+CONCRETE Product out of a +CONCRETE Material:

107. A private farmer in Poland is free to buy and sell land, hire help, and decide what to grow.

108. It’s the kind of fruit that grew freely and that you could help yourself to.

We note from the semantics and annotated examples, we expect this sense of grow to typically be literal. However, in the calibratable_cos-45.6 class, it contains a Value role that moves along a scale by a certain Extent. These examples all appear to be metaphoric, evoking the MOREISUP mapping:1

109. Exports in the first eight months grew only 9%.

110. Non-interest expenses grew 16% to $496 million.

1Examples from the VerbNet annotation data from Palmer et al. [2017] 147 This allows us to extract new training data using these classes, using our database of VerbNet annotations. We found all annotated instances of "grow" in the grow-26.2 class, and considered them to be literal, and all instances of "grow" from calibratable_cos-45.6 were considered metaphoric. This process was completed for all of the verbs in Table 9.1. Note that we only consider the verbs in these instances: we have no knowledge of the arguments. For each verb, we extracted up to 100 annotations for each sense that we determined to be largely metaphoric or literal.

9.2.2 Syntactic Pattern Analysis

Our belief that these resources can be used to generate training data is based on prop- erties of linguistic metaphors. We will further explore three aspects of syntax that have shown to be predictive of metaphors in the data: argument structure, active/passive voice, and prepositional complements.

Argument Structure

The number and type of arguments that verbs take is a core component of their semantics. There is a wealth of evidence that the number of arguments with which a verb can occur is directly influenced by its semantics, showing a close linking between a verb’s syntax and meaning. This idea is explored in depth by the work of Levin [1993], who notes that "the behavior of a verb, particularly with respect to the expression and interpretation of its arguments, is to a large extent determined by its meaning" (pg 1). We see this in English with verbs like hemorrhage, which is almost always used metaphorically when it is used transitively:

111. GM was supporting this event even as they were hemorrhaging cash.

112. For 30 straight years, American organized labor has been hemorrhaging members.

When used intransitively, hemorrhage is almost always literal: 148 113. Cerebral AVMs often have no symptoms until they rupture and hemorrhage.

114. Michael hemorrhaged and sustained a massive stroke to the left side of his brain.

This is likely due to the fact that literal use of "hemorrhage" contains an understood argument, blood, which is the most natural object of the verb. If the use is intended in a less literal way, the object is required as the null "blood" object needs to be overridden. While not all verbs have this direct relation between argument number and metaphoricity, we believe that the type and number of syntactic arguments of a verb can be indicative of unmarked usage, and may be utilized as a method for automatically extracting training data for metaphor classification. We find evidence for this in the VUAMC data: the verb "encourage" has distinctive argument structure patterns in the metaphor data. The verb "encourage" is encountered 12 times in the data, with half being tagged as metaphoric and half tagged as literal. The semantic distinction between the literal and metaphoric samples is often difficult to assess, but the literal samples seem to typically involve animate participants that are encouraged to do something, while the metaphoric samples involve more abstract entities either encouraging or being encouraged. Due to the nature of the VUAMC annotation procedure, which is based on dictionary definitions and favors overgeneration with regard to metaphor, many of the annotated examples are extremely conventional and may not seem particularly metaphoric. How- ever, like "hemorrhage", the number of arguments often determines the metaphoricity. The literal samples tend to be transitive, having subject and object NPs, with an addi- tional prepositional phrase indicating what the direct object is encouraged to do (66%). In contrast, while the metaphoric samples also have subject and object arguments, they tend to lack the additional prepositional phrase (83%).

115. It was a time when children were encouraged to fantasize about machines and outer space. (L) 149 116. He then encouraged her to experiment with the stick, manoeuvring the place in every direction. (L)

117. Post-war reconstruction and housing programmes seemed even to encourage a rise in crime and mental illness. (M)

118. Such re-use is much encouraged by a building being listed. (M)

These patterns parallel the distinctions made in VerbNet. The VUAMC samples of "encourage" occur in two classes, amuse-31.1 and advise-37.9. The literal samples belong to advise-37.9, which contains syntactic frames that require the additional prepositional phrase. The metaphoric samples belong to amuse-31.1, which does not contain these prepositional phrases.

Active/Passive Voice

English contains active and passive voice constructions which are transitivity alterna- tions: the active typically comes in the form NP V NP, whereas the passive loses the direct object and sometimes takes an additional prepositional phrase. These construc- tions have semantic functions as well, de-emphasizing the agentive role and highlighting the patient-like role. We find that there are cases where the distinction between active and passive voice can be an indicator of metaphoric and literal usages. The most notable case in the VUAMC training data is the verb "conduct", which has 5 literal and 5 metaphoric samples in the corpus.

119. Research into site selection has been almost exclusively conducted in North Amer- ica. (L) research has been conducted on using GIS to monitor this. (L)

120. The Architects’ Journal conducted a survey to find what were considered the best modern buildings. (M) 150 121. The Imperial Cancer Research Fund and is conducting a joint study of the Bristol Cancer Help Centre. (M)

Similar to "encourage", the metaphoric annotations are somewhat difficult to assess, but it appears that literal uses are those where the entity "conducting" is either animate, or unknown as in the above examples. The metaphoric samples tend to have abstract or inanimate entities as subject of the verb. However, like "encourage", there are promising syntactic patterns. The literal samples are usually passive constructions (75%). This highlights the inter- action between syntax and semantics: these passive constructions do not contain overt agent-like entities, and thus are marked literal. Almost all the metaphoric samples have overt subjects and objects (88%), with only one using a passive construction. While it is possible this is an artifact of the dataset and annotation procedure, it remains helpful for showing syntactic relations can be predictive.

Prepositions

Most verbs that we analyzed frequently take prepositional phrases as oblique arguments. The nature of the preposition changes with the verb and the intended meaning. From the different verbs we get different literal and metaphoric distinctions. While many of these patterns are idiomatic and/or verb-particle constructions, knowing what kinds of prepositions are available is invaluable for determining the semantics of the verb instance. Like the argument constructions, prepositions show distributional properties with regard to metaphoricity: many prepositions are used frequently with certain verbs in literal or metaphoric contexts, even though there is no rule constraining them to a particular usage. For example, in the VUAMC data, we find frequent use of the phrase "blow over". While this phrase has an idiomatic meaning something like "to pass", the corpus data shows that "blow" followed by the preposition "over" is invariably used literally. In contrast, "blow up" also occurs frequently, and is always used metaphorically. This is likely due to the 151 Verb (L) Syn. Patterns (M) Syn. Patterns (L) VN Classes (M) VN Classes encourage NPVNP {TO} VP NPVNP advise-37.9 amuse-31.1 NPVPROVP find find out VNP {TOBE} ADJ get-13.5.1 declare-29.4 find dead convert-26.6.2 WHNPV long-32.2 fall NPVADV, NPV escape-51.1 fall in, fall to acquiesce-95.1 die-42.4 NPV spend time consume-66 spend pay-68 NPV{ON}NP spend life spend_time-104 NPVPP meet-36.3 trifle-105.3 play - play with play-114.2 use-105.1

TABLE 9.2: Example analysis of syntactic patterns and VerbNet classes. annotation procedure: "blow up" is a very conventionalized metaphor for destruction, but the VUAMC data considers it to be metaphoric. Analysis of metaphoric preposition usage gives an easy path to more data: verbs with prepositional complements are easy to extract via syntactic pattern matching. We can use our linguistic analysis of prepositional phrases and their potential metaphoric nature to identify these samples automatically and provide additional training data. We performed VerbNet and syntactic pattern analysis for each of our 19 difficult-to- classify verbs. A sample of this analysis is shown in Table 9.2; the full analysis can be found in Appendix A. Counts of the number of sentences extracted for each verb from each source are shown in Table 9.3. Note that for the syntactic patterns, we restricted the extracted instances to 100 per syntactic pattern, as some patterns are much more frequent, leading to extreme imbalance in the resulting data. This is another abstract use of syntax for metaphor processing. While we don’t imple- ment directly any syntactic features, the dataset is collected based on syntactic patterns (as well as VerbNet, a syntactically motivated resource), and thus the data should improve classification if these syntactic properties do indeed influence metaphor production. This 152 Verb VN samples Syn samples encourage 86 200 blow - 99 conduct - 200 show - - find 407 255 fall 314 600 hold 913 487 bring - 500 put - - allow 2 300 spend 439 341 play 52 343 stop 482 - reduce - - suggest 307 12 meet 455 399 Total 3985 3736

TABLE 9.3: Total samples extracted from VerbNet classes and syntactic pat- terns. data should couple well with direct implementation of syntactic features, such as depen- dency relations and VerbNet structures. If a system can successfully identify and make use of syntactic relations, it should be better able to make use of this additional data.

9.3 Implementation

This method of extracting distantly supervised data is extremely easy to implement from a practical perspective. This data can be added to any metaphor task, although we believe it will be most effective for tasks that focus on verb classification. We only need to format it appropriately as input. The data is ideally formatted for single-word classification, so we anticipate sequence tagging will benefit less, as we don’t have tags for all of the words in the sentence, only the verbs. Also, our syntactic analysis was based on verbs in the VUAMC data, so classification performance on other datasets (TroFi, MOH- X, and the LCC dataset) likely won’t be as good. We added this data for each task, using it as additional training data while keeping 153

Algorithm Trofi MOH-X LCC (All) LCC (Verbs) VUAMC (All) VUAMC (Verbs) LCC Domain (All) LCC Domain (Verbs)

SVM Baseline .516 .555 .581 .617 .629 .534 .516 .114 +VN Data .510 .543 .364 .390 .647 .530 - - +Syn. Pattern Data .510 .537 .426 .374 .654 .519 - - +All .508 .538 .373 .339 .633 .506 - - Bi-LSTM from Gao et al .731 .683 .697 .676 .773 .716 .557 .202 +VN Data .738* .693* .688 .676 .761 .697 - - +Syn. Pattern Data .739* .692 .657 .664 .773 .710 - - +All .738* .693* .696 .683 .762 .700 - -

TABLE 9.4: Additional data F1 results (* denotes significant improvements over the baseline, p < .01) the test data the same. For the feature-based models, we include all the same features from the baseline and use the same experimental parameters. For deep learning, the baseline algorithm was kept the same, although we increased training time in proportion to the additional data added. Note that this implementation of distantly supervised data is not applicable to the domain classification task for the LCC data. This data requires knowledge of which domain the target word is evoking, and we cannot infer these tags in any reasonable fashion: our analysis is limited to binary metaphoric or not classification. This is an area where future work could be directed: distantly supervised extraction of domain tags would allow this method to be applied to the LCC dataset, and provide better metaphoric data in general.

9.4 Results

We see small but significant gains from the additional data in the sequence-based tasks using deep learning for the VUAMC-based tasks. Adding both datasets doesn’t improve 154 performance over either individually: this indicates that they are possibly providing con- tradictory information. Only for the Trofi dataset was the improvement from adding both datasets higher than the improvement from adding the best individual dataset. For the sequence-based tagging of the VUAMC data, adding both yielded negligible improve- ments over either individually. We believe the difficulty in combining both datasets in the sequence models is due to excessive noise from the non-target words of the samples. We default to marking every word other than the target verb in the sentence as literal, so the additional data is understandably less informative for sequence tagging problems. It is likely that the combination of VerbNet data and syntactic pattern-based data caused additional noise: the two datasets may in places be contradictory, particularly with regard to these non-target elements. Note that the data we analyzed was based on the VUAMC data, so our results with regard to that corpus are the most relevant. For the other datasets, the additional data provided no significant benefits, and hurt performance in many cases. This is likely due to the different definitions of metaphor used by different datasets. Additionally, for the multi-label LCC domain classification task, the additional data was unusable as it con- tained no information about source and target domains. Our analysis was focused on the metaphoric definitions from the VUAMC data, and thus likely yielded ineffective data for other datasets. Our analysis is thus focused on the VUAMC data. For this dataset, we analyzed each verb we added with its performance on the test set before additional data was added and after, using the Bi-LSTM model which achieved the best performance. The F1 scores from this analysis are shown in Table 9.5. There are a couple of key takeaways from this analysis. First, we don’t see consistent improvement across the verbs we analyzed. While there were significant improvements for some verbs ("allow", "spend", and "play"), most showed either no change in perfor- mance or a slight decrease compared to the baseline. This indicates that our analysis’ 155 Verb Test Samples F1 Before F1 After encourage 3 1.0 1.0 blow 1 0.0 0.0 conduct 4 .857 .667 find 40 .949 .873 fall 6 .800 .800 hold 18 .900 .800 bring 20 .545 .545 allow 15 .824 .875 spend 6 .800 .909 play 17 0.0 .600 stop 10 .400 .250 reduce 6 .909 .800 suggest 22 .615 .615 meet 12 .667 .571 Total 3985 3736

TABLE 9.5: Difference in F1 scores for analyzed verbs after data was added. impact isn’t restricted to the verbs we analyzed: our overall improvements lie outside of the specific verbs we looked at. This is likely due to the generalizability of embedding inputs. Our analysis yielded some good data, and the embeddings generalize this good data to other similar examples. It is disconcerting that we don’t see better improvement for the specific verbs we analyzed, but as overall performance improved we believe this approach is fundamentally sound. One possible reason that we didn’t see consistent improvement is the nature of the test set. We picked verbs to analyze based on the validation and training data. When we observe the test set data, we see that these verbs occur in different distributions: some are exceedingly rare ("encourage", "blow", and "conduct" all with less than five samples). This variation between datasets encourages generalizability in models, and we believe we were successful in this regard due to the overall performance gains. 156

Chapter 10

Putting it all together

We’ve seen that most of the methods we’ve employed have been successful to some degree on some tasks. Some work best for sentence-level tasks, and some work best for sequence tagging. Some provide more benefit when applied to feature-based machine learning, and some realize peak performance on deep learning. To finalize our analysis, we need to combine these features. Each method we’ve looked at has complementary properties. Dependency structures operate separately from the lexical semantics of VerbNet. VerbNet structures and embeddings are two ways of rep- resenting class information that is both syntactically and semantically motivated. These motivations extend to our distant supervision, but this method functions completely dif- ferent by covering gaps and improving the size of our training data. Because they all func- tion somewhat differently, we believe combining these features is likely to yield greater improvements in performance. To combine our methods, we simply incorporate them all into a single model. This works nicely for feature-based methods: the features generated based on syntactic parses and VerbNet structures can all be combined into a single bundle, and the same classifi- cation methods can then be applied. For deep learning, we stick with the methods that have proven effective: VerbNet-based embeddings and additional data from distant su- pervision. VerbNet structures are better represented as embeddings, and we believe de- pendency parse-based information is captured implicitly by the LSTM architecture. 157

FIGURE 10.1: Exploration: Combining methods

10.1 Results

To understand these results, we show the performance of each feature on each dataset using each algorithm in Table 10.1. For feature-based learning, all of the additional meth- ods we attempted showed some promise, excepting distant supervision. VerbNet struc- tures were especially effective: they yielded significant improvements over the baseline for five of the eight tasks. As we saw in Chapter 8.1, the strength of these features involves combining them with dependency structures. However, adding both methods in combi- nation doesn’t yield improvements over the VerbNet-based features alone. This may be due to size of the feature space: none of our datasets are very large, meaning increasing the size of the feature space will lead to decreased model performance. Distant supervision provided no benefit to any of the feature-based tasks. The model 158

Algorithm Trofi MOH-X LCC (All) LCC (Verbs) VUAMC (All) VUAMC (Verbs) LCC Domain (All) LCC Domain (Verbs)

SVM (Full baseline) .518 .543 .633 .510 .656 .517 .596 .221 Dependency Features .507 .542 .669 .516 .653 .508 .539 .173 VerbNet Structures .532* .580* .682 .561* .654 .569* .574 .202 VerbNet Embeddings .518 .547 .653 .517 .655 .511 .605 .216 Distant Supervision .508 .538 .373 .339 .648 .526 - - All Methods .533* .576* .673 .556* .648 .526 .584 .193 Bi-LSTM from Gao et al .731 .683 .697 .676 .773 .716 .557 .202 Dependency Features ------VerbNet Structures ------VerbNet Embeddings .733 .691* .691 .682 .771 .702 .542 .211 Distant Supervision .738* .692* .696 .683 .762 .700 - - All Methods .739* .700* .682 .678 .767 .699 - -

TABLE 10.1: Combined results may be unable to generalize: the additional data we found is noisy and contains incon- sistencies, and the feature-based model may be too strongly weighting lexical items and other features that are inconsistent in the distantly supervised data. We will examine fea- ture weights in detail in Chapter 11: it appears the VerbNet structures and lemmas are the most heavily weighted in the models, and it may be the case that these are incorrect or introduce additional noise when adding the distantly supervised data. VerbNet structures improved performance for classification of LCC verbs, but no other methods proved particularly effective, which is understandable given their annotation scheme and our task. The fact that they only annotate a single metaphor for a given sen- tence means that each sentence contains additional noise. Our task includes both source and target lexical triggers as the positive class, leading to additional confusion for the clas- sifier. To evaluate these methods use against this dataset will require further refinement in annotation or task setup. 159 Our deep learning models are more promising with regard to combinations: adding VerbNet embeddings and distantly supervised data together yielded the best perfor- mance for both VUAMC tasks. The MOH-X dataset was best classified by the baseline, while the Trofi dataset performed best using distantly supervised data alone: adding both methods together hindered performance.1 Again, the LCC data proved difficult, with no method significantly outperforming the baselines. In addition, we report precision and recall scores for the LCC experiments in Ap- pendix C.1. This shows an interesting trade-off: the deep learning methods have a wide gap between precision and recall, and adding VerbNet-based information and distantly supervised data lowers precision to improve recall. To improve this model, we need to balance these results: we are making more metaphoric predictions, which is necessary, but we need to ensure that these predictions are correct. To further explore the effectiveness of these methods, we will now analyze our models more closely. In particular, we will identify which additional features are informative for the feature-based models, and perform error analysis on our best performance deep learning system.

1Due to the small size of these datasets, standard deviations between runs was extremely high (between .02 and .05). We believe better statistical analysis is fully required to determine if these methods are effective, but as the datasets are so small we have concerns of overfitting models and experiments to the particular dataset. 160

Chapter 11

Analysis

Our analysis of our results will take two forms. First, for feature-based machine learn- ing, we will inspect the weights of the model to determine which features are useful. Hopefully the features we added based on syntactic information contribute positively to the model. For deep learning, the models tend to be opaque, and inspection of weights on individual nodes or layers is uninformative. As these models tend to be state of the art, it will be more effective to instead focus on error analysis: what do they get wrong, and what is improved by adding our new methods? We will focus on these questions with a particular eye for syntactic patterns, and whether or not certain patterns are better detected using our improved models. For each of these two analyses, we will be focused on the VUAMC dataset. This is for multiple reasons. First, it is the largest dataset, and this makes our analysis of feature weights more likely to be significant. It also allows for broader error analysis, as more examples are misclassified, making patterns in errors more easy to recognize. Second, this dataset is the most widely use corpus, and has a standardized training and test set. This means our analysis will be more broadly applicable to other research.

11.1 Feature Weights

To analyze feature weights, we will compare the most important features from our baseline model to those of our best model, which includes all of our improved methods except for the distant supervision, which doesn’t introduce any additional features. We will begin by analyzing individual lexical items, and see which are weighted most heavily 161 Lit Met the formula a habits an odd some fair provided admitted problems application improved lord hurtling regeneration lift market sided extensive

TABLE 11.1: Lemmas with strongest weights for the negative (literal) and positive (metaphoric) classes. for the negative (literal) and positive (metaphoric). Table 11.1 shows the top 10 most predictive words in each of these contexts for literal and metaphoric categories. The heaviest weighted literal words are those which occur very frequently and never metaphorically. Thus we get the articles, along with some other content words which never occur metaphorically in the data. The heaviest weighted metaphoric words are those that are frequent and always used metaphorically. "Formula" in particular is fre- quently used to reference a mathematical formula; while this is an extremely conventional metaphor, the dictionary definition-based annotation scheme likely has a more concrete entry for a mixture, and thus these instances are always metaphoric. Similar patterns occur for words like "odd" and "lord". "Market" is typically used to reference the stock market, which is likely tagged as metaphor in light of the more concrete physical market. We also explored the weights for each feature. We aggregated the effects of each fea- ture for the baseline and complete models. This was done by extracting the weight for each feature of a particular type. For example, every feature that was a "lemma", includ- ing all the words in the training set, was compiled into a single list. We then calculated the maximum and minimum weights for each feature, as well as the average positive and av- erage negative weights. This indicates how strong a feature is: features that have a large 162

FIGURE 11.1: Aggregated weights for baseline features. maximum and minimum weight are those that contain elements that are strongly pre- dictive. Those that have high average positive weights contain many elements that con- tribute strongly to the metaphoric class, while those with low average negative weights have elements that contribute strongly to the literal class. These results are displayed for the baseline set of features (Figure 11.1) and for the final model with all features added (Figure 11.2). The weights for basic linguistic units: the lemma, bigram, and part of speech, are the primary factors in the baseline classifier. Interestingly, the additional lexical features (va- lence, arousal, dominance, concreteness, and imageability), which have been shown to be effective for this task, have only very minimal weights. Embeddings provide additional information for both positive and negative classes. The features from the baseline remain unchanged when all features are added. Fea- tures from dependencies are helpful, as are those from VerbNet class and role information. These structural components have proven effective in our models, and are also weighted heavily. The VerbNet-based embeddings are similar to the original word embeddings: they contribute to both classes, but are relatively weak compared to the structural compo- nents. This parallels our intuitions about feature-based and deep learning models: while 163

FIGURE 11.2: Aggregated weights for all features. embeddings can be effective in feature-based models, they can be replaced by one-hot or bag of words features. Their value is more obvious when used as input to deep learning models.

11.2 Error Analysis

For our deep learning models, we’ve achieved state of the art performance using addi- tional VerbNet-based embeddings and distantly supervised training data. To understand what we’re improving on, and how we can improve our system further, we need to iden- tify which examples we are getting correctly that the previous baseline was not, and what we’re still incorrectly classifying. We will do this both quantitatively and qualitatively. We will assess which lexical items are most frequently misclassified, and attempt to de- termine patterns that could yield to improvement. We will also look closely at specific examples, and determine if there are syntactic or semantic reasons these are difficult. We hope to find patterns in the errors that can be used to then further improve our system, and our understanding of computational metaphor processing. 164 Verb FP FN Correct (M) Correct (L) F1 serve 0 5 0 1 0 pursue 1 4 0 0 0 decide 0 4 0 7 0 want 1 4 0 101 0 progress 0 4 0 0 0 know 0 4 0 177 0 call 0 10 1 15 16.67 say 0 15 2 204 21.03 ask 3 3 1 43 25.00 appear 3 3 1 8 25 do 1 8 2 488 30.77 act 1 3 3 1 33.33 mean 3 12 4 75 34.78 pass 5 2 2 1 36.36 inherit 0 6 2 0 40.00 talk 0 3 1 22 40.00 carry 3 2 2 17 44.44 get 21 48 28 179 44.80 offer 11 0 5 10 47.62 stand 4 2 3 9 50.00

TABLE 11.2: Most misclassified verbs (with at least 5 positive samples)

Verbs

We begin with verbs, as they are central to our analysis of metaphor. Table 11.2 shows the 20 most misclassified verbs for which there were at least 3 positive samples in the test data. The first thing that is apparent is that false negatives dominate the errors for our most difficult verbs. These are cases where the system should indicate metaphoricity, but falsely predicts the negative class. This is likely due at least in part to imbalance in the training data: the VUAMC data has significantly more non-metaphoric words than metaphoric, so it encourages the underprediction of metaphoricity. Another interesting takeaway is the type of verbs present in this list of misclassifi- cations. Many of them are inherently abstract or vague in their meaning: "do", "mean", and "get" all have many possible senses and uses, and this likely yields confusion for the classifier. In these cases, there are multiple literal and metaphoric senses for each of 165 these verbs, and treating this as a simple WSD problem will not be sufficient to deter- mine the correct meaning. This is to be expected: these verbs are inherently difficult and classification mirrors this difficulty. It is also important to note the inherent annotation problems with "get": we attempted to analyze syntactic patterns for this verb in Chapter 9, but found the annotation to be inconsistent and the nature of metaphoric usages of this verb to be confusing. Verbs like "get" that are inherently abstract or have a large number of senses are the most difficult for automatic classification. Consider the following exam- ples from the VUAMC data that were incorrectly classified by our system (correct tags on the right):

122. Now that they can’t get hold of cocaine they’ll just crash out. (M)

123. Can’t get more common than me! (L)

124. The people who get caught and imprisoned may not be a representative picture of all criminals. (L)

125. (...) they get puzzled and ask question after question. (L)

"Get" is used in many different ways in the corpus. Typically it literally involves phys- ical acquisition of some object. Some metaphoric instances also have similar meanings: Example 122 is an instance of acquiring something, but its tagged as metaphoric. We also see in Examples 123-125 that literal annotations also involve change of state. Whether or not these should actually be metaphoric is extremely difficult to assess, and the machine understandably struggles in differentiating them.

Nouns

We see similar patterns when analyzing noun errors (Table 11.3). The vast majority are false negatives. We are predicting these nouns to be literal when they are annotated as metaphoric. The problem here appears to be the nature of these nouns, which are almost 166 Verb FP FN Correct (M) Correct (L) F1 liberation 0 4 0 0 0 autonomy 0 6 0 0 0 ensemble 0 5 0 0 0 0 9 0 0 0 observation 0 5 0 1 0 foreclosure 0 8 0 0 0 self-examination 0 5 0 0 0 association 0 4 0 0 0 prospect 0 6 0 0 0 training 0 5 0 2 0 group 2 5 0 6 0 interest 0 15 0 7 0 detail 0 4 0 5 0 pupil 0 8 0 1 0 design 0 4 0 2 0 plant 0 17 1 1 10.53 crisis 0 8 1 0 20 rule 0 4 1 1 33.33 change 1 6 2 4 36.36 subject 0 10 3 0 37.50

TABLE 11.3: Most misclassified nouns (with at least 5 positive samples) all extremely abstract. The annotation considers these abstract usages to be metaphoric, while we might more naturally consider them to be literal uses of abstract concepts. It appears the system has learned that these abstractions should be tagged as literal, and this yields a high number of false negatives. The question of why the system is learn- ing this distinction persists. If the annotation consistently considers abstract nouns to be metaphoric, we would expect embeddings to be able to capture this generality and use it to correctly predict these nouns as metaphoric. However, this doesn’t appear to be the case. One possible explanation is that many of these abstract nouns are derived from verbs ("liberation" from "liberate", "negation" from "negate", "observation" from "observe", and so on). These derivations don’t always have matching metaphoric annotation. All in- stances of "observation" in the test data are annotated as metaphoric:

126. (...) Aristotle’s observation that children are immature (...) (M) 167 127. The attacks are based on empirical observation (...) (M)

128. Now Holt and Harris both have many wise, enlightened and humane observations to make about some of the injustices which we inflict upon children (...) (M)

The verb "observe" occurs only once, and its tag is literal:

129. Today, United States citizens observe this holiday to celebrate the legacy of the Con- stitution and the birth of government. (L)

This tag may be an error: it seems that "observe" in this sense doesn’t refer to occular observation, but rather making use of a day to remember a certain event. The embedding based learned is likely to see similarities between this "observe" and the abstract nominal "observations", and as the verb is annotated as literal, the nominalizations are then clas- sified as literal. This pattern also holds for the verb "train", for which the nominalization "training" is always metaphoric:

130. In training for youth, the child must be given reasons; (...) (M)

131. But the job of lifting the standards of people employed, giving them proper training, and paying a wage which breaks dependence on overtime (...) (M)

132. (...) the railways ran a service dependent on people who belonged to a narrow and inbred working culture, with outdated procedures of training and promotion (...) (M)

The only verbal instance of train is annotated as literal:

133. (...) she was enjoying the reflected glory that came from having personally trained a house model for one of the great London couture houses. (L)

This mismatch in annotations between verbs and their nominalizations likely confuses deep learning models, and it seems to be an artifact of annotation choices. In these cases, it appears that the nominals carry the same semantics as the verbs, and in many cases appear that they should likely carry the same annotation. 168 Verb FP FN Correct (M) Correct (L) F1 autonomous 0 6 0 1 0 subject 0 8 0 0 0 odd 0 4 0 0 0 attractive 0 4 0 1 0 present 0 6 0 5 0 slight 0 4 0 1 0 great 5 6 3 2 35.29 reasonable 0 3 1 8 40.00 fine 0 3 1 1 40.00 small 3 2 2 12 44.44 short 4 3 4 1 53.33 next 0 8 9 6 69.23 full 2 0 4 0 80.00 long 0 4 9 10 81.81 bloody 3 0 7 7 82.35 big 1 1 5 8 83.33 open 0 1 3 3 85.71 fair 0 2 6 0 85.71 hard 0 2 7 2 87.50 whole 1 1 9 1 90.00

TABLE 11.4: Most misclassified adjectives (with at least 5 positive samples)

Adjectives

Like nouns, the vast majority of adjective errors are false negatives, where the system over predicted the literal class (Table 11.4. One common semantic similarity is that no- tion of size: many of the adjectives the system performed worst on relate to size: "slight", "great", "fine", and "small" all had F1 scores below .5. This indicates that metaphors evok- ing size as the source domain are difficult for classification: better understanding of how these metaphors are represented will lead to better metaphor detection. We are here again deterred by the lack of quality source and target domain annotation. This problem is exacerbated by cases which don’t violate selectional preferences, and are difficult to assess without further context. Consider the following example from the VUAMC data:

134. (...) if he thinks you are a small man and will not listen to your suggestions, why not? (M) 169 135. Beside her, Monsieur Mattli, a small Greek-looking man, some ten years her senior, was also quivering with indignation. (L)

In Example 134, the usage of "small" is metaphoric, somewhat archaically evoking the source concept of size to refer to a lack in a person’s intellect or character. This isn’t made explicit anywhere in the text: "man" is a perfectly reasonable noun for "small" to literally modify, and nothing in the text refutes it. We only understand the phrase as metaphoric by inferring things from the tone and context of the passage. In Example 135, we see the same argument structure, with "small" modifying the noun "man". However, the concrete "Greek-looking" modifier makes this more likely to be interpreted literally. These are fine-grained distinctions that are understandable to humans but extremely difficult to represent computationally, and highlight the difficulty of metaphor detection, even in cases where the annotation is straightforward and matches our intuitions.

Syntactic Analysis

We’ve seen at the word level what kinds of things the model tends to have difficulty with. To improve upon this analysis, we also examined sentences where our best system differs from the baseline, looking more closely at the kinds of syntactic structures that may be difficult. Overall, it is difficult to determine the kinds of sentences and structures for which our improved model does better at than the baseline. In most cases, the results appear to be more driven by lexical semantics: words appear in a variety of different constructions, but it’s the meaning of the word independently that makes it difficult to classify. However, we were able to find some patterns in the errors that relate to syntactic structures. Our system failed to classify several examples of conventional verb-preposition con- structions as metaphoric. Consider the following examples of "turn up" and "shut down":

136. It has turned up in Canberra with Japan to develop Asia Pacific Economic Coopera- tion (...) (M) 170 137. (...) if people are able to demonstrate that we are causing an environment problem we will be shut down. (M)

138. (...) someone is standing up for them somewhere. (M)

These are cases where the verb and preposition have metaphoric components. They can be seen as resultative constructions, with the resulting state being a location metaphorically indicated by the preposition. Our system may be weak in these cases due to our implementation of VerbNet information for these multi-word phrases. VerbNet contains an entry for "turn_up", but the tagging only applies directly to the verb "turn". VerbNet doesn’t have an entry for "shut_down": it contains "shut" in the other_cos-45.4 class. Similarly VerbNet contains an entry for "stand up" in assuming_position-50, but this is a literal usage: it has no representation for "stand up for" in the sense it is used in above. This may lead the classifier to further confuse cases where the predication is strongly dictated by both the verb and the preposition. 171

Chapter 12

Future Work

We’ve seen that syntactic evidence from linguistics can provide valuable information for improving computational metaphor processing. We’ve experimented with a variety of ways of incorporating this information, some of which was best utilized with feature- based machine learning and some of which was designed to be incorporated into deep learning systems. While each method is effective to a degree, there remain many prob- lems with computational metaphor processing, and many areas that can be improved. We will further explore a handful of key issues that restrict advancement in this field, and possibilities for opening up new avenues of research that could improve metaphor processing and improve its application to other domains and tasks. This may shed light on how to make the research done here more broadly applicable.

12.1 Representing Constructions

Our linguistic analysis is inspired by evidence from construction grammar, which shows various syntactic elements can be influential on our understanding of metaphors. We attempted to incorporate this evidence by employing various syntax-based methods. However, our incorporation of ideas from construction grammar is relatively limited and abstract. We used dependency structures, both for head and dependent words, which provides some notion of argument structure constructions, and we attempted to identify syntactic frames from VerbNet, which also can function as a proxy for argument structure constructions, but we didn’t incorporate any direct construction grammar formalisms which may be able to more elegantly capture the linguistic intuitions the work is based on. 172 A future possibility for better incorporation of this linguistic theory is employing a formal construction grammar (such as sign-based construction grammar [Michaelis, 2015; Boas and Sag, 2012] or fluid construction grammar [van Trijp, 2017; Steels, 2017]). If we can apply computational or heuristic methods to automatically identify the constructs present in sentences, we may be able to better incorporate our linguistic analysis into computational metaphor modelling. This would minimally involve developing a formal construction grammar that can handle metaphors. This could be done by adding source and target potential to constructions, so that they can directly indicate whether they can be involved as the source or target element for metaphors. Lexical constructions could then contain domain information about what source and target domains they belong to, and possibly even the metaphoric mappings that they participate in. The details of adding metaphor representation to a formal construction grammar is beyond the scope of this work, but it should be feasible. Once we have a functional construction grammar that handles metaphor, we require computational methods to parse sentences into the appropriate representations, includ- ing identifying the appropriate constructions and enumerating feature values. This is still a major hurdle: development of construction-based parsing, and even evaluation of these systems, remains difficult [Marques and Beuls, 2016]. Construction grammars, and specifically computational construction grammars, are complex with a variety of moving parts, and parsing these rich representations is a difficult problem. However, if we can develop a feasible construction grammar, and a parser to go along with it, we would be in a much better place to evaluate the use of construction-based analysis with regard to metaphor detection.

12.2 Quality Data

The first persistent issue in computational metaphor research remains the lack of con- sistent, high-quality data. Metaphors are exceedingly difficult to annotate as we have 173 seen, and this leads to a bottleneck in data. While many other NLP tasks have millions of words or sentences, our largest set is only 200,000 words, and this dataset is often anno- tated inconsistently. Other datasets are significantly smaller, and they all have their own strengths and weaknesses. In order to do high quality machine learning work, more data is necessary, and it needs to be annotated consistently. We attempt to remedy this problem by identifying additional training data using Verb- Net and syntactic patterns. While this data is effective for the VUAMC tasks, it doesn’t provide any benefits for other tasks. We believe this may be due to inaccuracies in parsing, particularly with regard to the syntactic pattern data. Identifying these syntactic patterns in corpora requires dependency parses and hand-crafted rules, and it may be the case that these parses are inaccurate. As an improvement, we could instead use gold-standard parses which are available for many datasets, including some which also contain gold- standard VerbNet data. If we use these gold standard dependency parses, we can extract additional data based on syntactic patterns with significantly less noise, which may lead to better model performance. Another option is to use these dependency parses to better understand the distribu- tions of syntactic patterns. Using gold-standard parses, we can examine what kinds of patterns are typical and what kind of patterns are marked for each particular verb, and analyze whether these are influential with regard to metaphor. We can then more directly assess what kinds of syntactic patterns are used in what kinds of ways, and as our analysis is based on gold-standard dependency parses, this analysis will allow for direct extraction of metaphoric and literal data based on these parses. We will be able to directly inspect the structure of the parses and their relation to metaphoricity, which allows for better ex- traction of data, and also better understanding of how these syntactic components can influence how metaphors are produced. Even with additional data, there remains problems with these corpora. Metaphor an- notation projects are typically driven by some other need: certain domains of data, high 174 inter-annotator agreement, or a particular task that needs to be accomplished. This natu- rally leads to inconsistencies in annotation schema, and incompatibility between datasets. We examined four different resources: VUAMC, the Mohammad et al dataset, the Trofi dataset, and the LCC data, and each of these has a different fundamental goal. The VUAMC data was built upon the metaphor identification procedure to be able to build a large corpus that is consistently annotated. The Mohammad et al. dataset was built as part of a research project measuring emotion-based language; the dataset is really an artifact of this task and not directly inspired by any particular metaphor motivation. The Trofi dataset was built to develop a specific system that uses unsupervised clustering to identify senses, and thus it is focused on the goals of this system. The LCC data was built with very specific domains in mind, and thus it doesn’t capture all metaphors. Each of these datasets has a valuable motivation for its creation, but we’ve seen that they lack consistency. While our methods are broadly applicable, they have different potency on different datasets, and the disparities in annotation schema make these datasets generally incompatible. This problem doesn’t have a simple answer: an idealized metaphor annotation schema would capture both metaphoricity (as do the VUAMC, MOH-X, and TroFi datasets) as well as some more advanced annotations of the meaning of the metaphor, be it source and target domains (such as the LCC has) or some kind of more complete semantic representation. This is an extremely complex task, and the amount of time and effort that would be required to capture everything that goes into making a metaphor is excessive.

12.3 Quality Tasks

A related issue is that metaphor identification at the binary level isn’t necessarily ben- eficial for downstream tasks. Knowing whether a lexical item or sentence is metaphoric doesn’t tell us anything specific about the metaphoric meaning that is intended. There are 175 some words that correspond to only one or two possible metaphors, and a binary clas- sification may be sufficient as it selects the only metaphor the word participates in, but most lexical items have many possible meanings. The productive nature of metaphoric mappings means that binary classification will be uninformative in most cases. This problem parallels that of good data: the data is built for particular tasks or rea- sons, and these tasks haven’t generally involved building correct semantic interpretations for metaphoric utterances (due to the inherent difficulty, among other reasons). More helpfully, there have been numerous efforts to process metaphors by converting them into literal paraphrases. While this is more practical from an NLP standpoint, as it allows metaphor processing systems to convert metaphoric utterances into literal ones, making downstream NLP easier, it is lacking from linguistic and cognitive perspectives. When people use complex, interesting, and/or novel metaphors, the entirety of the intended meaning cannot be captured by a literal paraphrase. While generating paraphrases is a theoretically incomplete method of metaphor pro- cessing, it is perhaps the most promising bridge allowing metaphor processing systems to have practical implications for NLP. It may be the case that dense paraphrases can capture most of the meaning intended by novel and complex metaphors. The ideal pro- cessing system would go a step further, and develop a formal semantic representation for the intended meaning of a metaphor. This would allow a paraphrase to be generated, and if the semantic representation is abstract and effective enough, it could even paraphrase the expression in a language-agnostic manner. This task is understandably extremely hard. The subtlety and nuance intended in many metaphors isn’t easily incorporated into formal semantic systems. Additionally, many metaphors contain multiple possible interpretations, and different speakers may intend or understand them in different ways. Future research into computational repre- sentations of metaphor is required to make this goal a reality. 176 12.4 Understanding Linguistic Metaphor

We’ve approached metaphor from a syntactic perspective, attempting to show that while lexical semantics is important, there are other structural components in sentences that influence and perhaps license certain metaphors. We’ve taken this inspiration from construction grammar theories, which show that source and target domains are depen- dent on the constructions involved, their autonomy or dependence based on aspects of cognitive grammar, and other aspects of the interaction between lexical semantics and syntax. This area is ripe for further research. As we develop better datasets and better tasks to fully evaluate and understand how computational methods can be used to handle metaphors and other figurations, we can better explore from a linguistic perspective how these metaphors are generated and understood, and what kinds of patterns license or prohibit what metaphors. 177

Chapter 13

Bibliography

[Ali and Shapiro, 1993] Syed S. Ali and Stuart C. Shapiro. Natural language processing using a propositional semantic network with structured variables. Minds and Machines, 1993.

[Aristotle and Roberts, 1946] Aristotle and Translator Roberts, R. Rhetoric. Oxford Uni- versity Press, 1946.

[Artzi et al., 2015] Yoav Artzi, Nicholas FitzGerald, and Luke Zettlemoyer. with Combinatory Categorial Grammars. In Tutorials, Austin, Texas, 2015. As- sociation for the Advancement of Artificial Intelligence (AAAI).

[Athiwaratkun et al., 2018] Ben Athiwaratkun, Andrew Wilson, and Anima Anandku- mar. Probabilistic fasttext for multi-sense word embeddings. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1–11, 2018.

[Baker et al., 1998] C. F. Baker, C.J. Fillmore, and J.B. Lowe. The Berkeley FrameNet project. In COLING-ACL ’98, pages 86–90, Montreal, QC, 1998.

[Banarescu et al., 2013] Laura Banarescu, Claire Bonial, Shu Cai, Madalina Georgescu, Kira Griffitt, Ulf Hermjakob, Kevin Knight, Philipp Koehn, Martha Palmer, and Nathan Schneider. Abstract meaning representation for sembanking. In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, pages 178–186, Sofia, Bulgaria, August 2013.

[Barcelona, 2002] Antonio Barcelona. Clarifying and applying the notions of metaphor and metonymy within cognitive linguistics : An update. In René Dirven and Ralf Pörings, editors, Metaphor and Metonymy in Comparison and Contrast, pages 207–278. Mouton de Gruyter, 2002.

[Beigman Klebanov et al., 2015] Beata Beigman Klebanov, Chee Wee Leong, and Michael Flor. Supervised Word-Level Metaphor Detection: Experiments with Concreteness and Reweighting of Examples. In Proceedings of the Third Workshop on Metaphor in NLP, pages 11–20, Denver, Colorado, June 2015.

[Beigman Klebanov et al., 2016] Beata Beigman Klebanov, Chee Wee Leong, E. Dario Gutierrez, Ekaterina Shutova, and Michael Flor. Semantic classifications for detection of verb metaphors. In Proceedings of the 54th Annual Meeting of the Association for Com- putational Linguistics (Volume 2: Short Papers), pages 101–106, Berlin, Germany, August 2016. 178 [Birke and Sarkar, 2006] Julia Birke and Anoop Sarkar. A clustering approach for nearly unsupervised recognition of nonliteral language. In In Proceedings of EACL-06, pages 329–336, 2006.

[Birke and Sarkar, 2007] Julia Birke and Anoop Sarkar. Active learning for the identifica- tion of nonliteral language. In Proceedings of the Workshop on Computational Approaches to Figurative Language, FigLanguages ’07, pages 21–28, Stroudsburg, PA, USA, 2007.

[Black, 1962] Max Black. Models and Metaphors, 1962.

[Black, 1981] Max Black. Metaphor. In Mark Johnson, editor, Philosophical Perspectives on Metaphor, pages 63–82. University of Minnesota Press, Minnesota, 1981.

[Black, 1993] Max Black. More about metaphor. In Andrew Ortony, editor, Metaphor and Thought, pages 19–42. Cambridge University Press, Cambridge, 1993.

[Boas and Sag, 2012] H.C Boas and I.A. Sag, editors. Sign-based construction grammar. CSLI Publications/Center for the Study of Langauge and Information, 2012.

[Bollegala and Shutova, 2013] Danushka Bollegala and Ekaterina Shutova. Metaphor in- terpretation using paraphrases extracted from the web. PLOS ONE, 8(9):1–10, 09 2013.

[Bradley and Lang, 1999] Margaret M. Bradley and Peter J. Lang. Affective Norms for English Words (ANEW): Instruction Manual and Affective Ratings. Technical report, The Center for Research in Psychophysiology, University of Florida, 1999.

[Chen et al., 2014] Xinxiong Chen, Zhiyuan Liu, and Maosong Sun. A unified model for word sense representation and disambiguation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1025–1035, 2014.

[Cohen, 1993] Jonathan L. Cohen. The semantics of metaphor. In Andrew Ortony, editor, Metaphor and Thought, pages 58–70. Cambridge University Press, 1993.

[Croft, 1993] William Croft. The role of domains in the interpretation of metaphors and metonymies. In René Dirven and Ralf Pörings, editors, Metaphor and Metonymy in Comparison and Contrast, pages 161–206. Mouton de Gruyter, 1993.

[Cruse and Croft, 2004] D. Alan Cruse and William Croft. Cognitive Linguistics. Cam- bridge Textbooks in Linguistics. Cambridge University Press, 2004.

[Dai et al., 2015] Andrew M. Dai, Christopher Olah, and Quoc V. Le. Document embed- ding with paragraph vectors. In NIPS Deep Learning Workshop, 2015.

[David and Matlock, 2018] Oana David and Teenie Matlock. Cross-linguistic automated detection of metaphors for poverty and cancer. Language and Cognition, 10(3):467–493, 2018.

[David, 2016] Oana David. Metaphor in the Grammar of Argument Realization. PhD thesis, The University of California, Berkeley, 2016. 179 [David, 2017] Oana David. Computing Metaphor: The Case of MetaNet. Cambridge Handbook of Cognitive Linguistics, pages 574–589, 2017. [Devlin et al., 2018] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, 2018. [Dietterich, 1998] Thomas G. Dietterich. Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms. Neural Computation, 10(7):1895–1923, 1998. [Dirven and Pörings, 2002] René Dirven and Ralf Pörings, editors. Metaphor and Metonymy in Comparison and Contrast. Mouton de Gruyter, 2002. [Do Dinh and Gurevych, 2016] Erik-Lân Do Dinh and Iryna Gurevych. Token-Level Metaphor Detection using Neural Networks. In Proceedings of the Fourth Workshop on Metaphor in NLP, pages 28–33, June 2016. [Dodge et al., 2015] Ellen Dodge, Jisup Hong, and Elise Stickles. Metanet: Deep semantic automatic metaphor analysis. In Proceedings of the Third Workshop on Metaphor in NLP, pages 40–49, Denver, Colorado, June 2015. [Duffield et al., 2007] Cecily Jill Duffield, Jena D. Hwang, Susan Windisch Brown, Dmitriy Dligach, Sarah E. Vieweg, Jenny Davis, and Martha Palmer. Criteria for the Manual Grouping of Verb Senses. In Proceedings of the Linguistic Annotation Workshop, pages 49–52, Prague, Czech Republic, June 2007. [Dunn, 2014] Jonathan Dunn. Measuring metaphoricity. In Association of Computational Linguitics, pages 745–751, Baltimore, MD, 2014. [Efron, 1979] Bradley Efron. Bootstrap methods: Another look at the jackknife. The An- nals of Statistics, 1979. [Faruqui et al., 2015] Manaal Faruqui, Jesse Dodge, Sujay Kumar Jauhar, Chris Dyer, Ed- uard Hovy, and Noah A. Smith. Retrofitting word vectors to semantic lexicons. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies, pages 1606–1615, 2015. [Fass and Wilks, 1983] Dan Fass and Yorick Wilks. Preference Semantics, III-Formedness, and Metaphor. Association of Computational Linguistics, 9(3-4):178–187, 1983. [Fass, 1991] Dan Fass. met* : A Method for Discriminating Metonymy and Metaphor by Computer. In Association for Computational Linguistics, pages 49–90, 1991. [Fass, 1997] Dan Fass. Processing Metonymy and Metaphor. Ablex Publishing Corporation, Greenwich, CT, 1997. [Fauconnier and Turner, 1996] Gilles Fauconnier and Mark Turner. Blending as a central process of grammar. In Adele Goldberg, editor, Conceptual Structure, Discourse, and Language. Stanford, CA, 1996. 180 [Fauconnier and Turner, 1998] Gilles Fauconnier and Mark Turner. Conceptual integra- tion networks. Cognitive Science, (22):133–187, 1998. [Fellbaum, 2010] George A. Fellbaum, Christiane; Miller. WordNet. http://wordnet. princeton.edu/, 2010. Accessed: 2017-09-19. [Fillmore, 1982] Charles Fillmore. Frame Semantics. Linguistics in the Morning Calm, 1:111–138, 1982. [Fillmore, 1988] Charles J. Fillmore. The Mechanisms of "Construction Grammar". In Fourteenth Annual Meeting of the Berkeley Linguistic Society, pages 35–55, 1988. [Gao et al., 2018] Ge Gao, Eunsol Choi, Yejin Choi, and Luke Zettlemoyer. Neural Metaphor Detection in Context. In Proceedings of the 2018 Conference on Empirical Meth- ods in Natural Language Processing, pages 607–613, 2018. [Gargett and Barnden, 2015] Andrew Gargett and John Barnden. Modeling the interac- tion between sensory and affective meanings for detecting metaphor. In Third Workshop on Metaphor in NLP, pages 21–30, Denver, CO, 2015. [Gedigian et al., 2006] Matt Gedigian, John Bryant, Srini Narayanan, and Branimir Ciric. Catching Metaphors. In Proceedings of the Third Workshop on Scalable Natural Language Understanding, ScaNaLU ’06, pages 41–48, Stroudsburg, PA, USA, 2006. [Gentner, 1983] Dedre Gentner. Structure-Mapping: A Theoretical Framework for Anal- ogy. COGNITIVE SCIENCE, 7:1–5, 1983. [Gibbs Jr., 1993] Raymond Gibbs Jr. Process and products in making sense of tropes. In Andrew Ortony, editor, Metaphor and Thought, pages 252–276. Cambridge University Press, Cambridge, 1993. [Glucksberg and Keysar, 1990] Sam Glucksberg and Boaz Keysar. Understanding metaphorical comparisons: Beyond similarity. Psychological Review, (97):3–18, 1990. [Glucksberg and Keysar, 1993] Sam Glucksberg and Boaz Keysar. How metaphors work. In Andrew Ortony, editor, Metaphor and Thought, pages 401–424. Cambridge University Press, Cambridge, 1993. [Glucksberg, 2001] Sam Glucksberg. Understanding Figurative Language. Oxford Univer- sity Press, London, 2001. [Goldberg, 1995] Adele E. Goldberg. Constructions: A Construction Grammar Approach to Argument Structure, 1995. [Gordon et al., 2015] Jonathan Gordon, Jerry Hobbs, Jonathan May, and Fabrizio Morbini. High-Precision Abductive Mapping of Multilingual Metaphors. In Proceedings of the Third Workshop on Metaphor in NLP, pages 50–55, Denver, Colorado, June 2015. [Group, 2007] Pragglejaz Group. MIP: A method for identifying metaphorically used words in discourse. Metaphor and Symbol, 22(1):1–39, 2007. 181 [Gutierrez et al., 2016] E.Dario Gutierrez, Ekaterina Shutova, Tyler Marghetis, and Ben- jamin Bergen. Literal and Metaphorical Senses in Compositional Distributional Seman- tic Models. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 183–193, 2016. [Heintz et al., 2013] Ilana Heintz, Ryan Gabbard, Mahesh Srivastava, Dave Barner, Don- ald Black, Majorie Friedman, and Ralph Weischedel. Automatic Extraction of Lin- guistic Metaphors with LDA Topic Modeling. In Proceedings of the First Workshop on Metaphor in NLP, pages 58–66, Atlanta, Georgia, June 2013. [Hilpert, 2006] Martin Hilpert. Keeping an eye on the data: metonymies and their pat- terns. In Anatol Stefanowitsch and Stefan Th. Grief, editors, Corpus-Based Approaches to Metaphor and Metonymy, pages 123–151. Mouton de Gruyter, Berlin, New-York, 2006. [Hobbs, 1992] Jerry R. Hobbs. Metaphor and abduction. In Andrew Ortony, J Slack, and O Stock, editors, Communication from an Artifical Intelligence Perspective: Theoretical and Applied Issues, pages 35–58. Springer, 1992. [Hovy et al., 2013] Dirk Hovy, Shashank Shrivastava, Sujay Kumar Jauhar, Mrinmaya Sachan, Kartik Goyal, Huying Li, Whitney Sanders, and Eduard Hovy. Identifying Metaphorical Word Use with Tree Kernels. In Proceedings of the First Workshop on Metaphor in NLP, pages 52–57, Atlanta, Georgia, June 2013. [Jang et al., 2016] Hyeju Jang, Yohan Jo, Qinlan Shen, Michael Miller, Seungwhan Moon, and Carolyn Rose. Metaphor Detection with Topic Transition, Emotion and Cognition in Context. In Proceedings of the 54th Annual Meeting of the Assocation for Computational Linguistics, pages 216–225, January 2016. [Jisup, 2016] Hong Jisup. Automatic metaphor detection using constructions and frames. Constructions & Frames, 8(2):295 – 322, 2016. [Kay et al., 2015] Paul Kay, Ivan A. Sag, and Dan Flickinger. A Lexical Theory of Phrasal Idioms. Unpublished Manuscript, 2015. [Kilgarriff et al., 2014] Adam Kilgarriff, Vít Baisa, Jan Bušta, Miloš Jakubíˇcek, Vojtˇech Kováˇr,Jan Michelfeit, Pavel Rychlý, and Vít Suchomel. The Sketch Engine: ten years on. Lexicography, pages 7–36, 2014. [Kiparsky, 1976] Paul Kiparsky. Oral poetry: Some linguistic and typological considera- tions. Oral literature and the formula, 111:73–106, 1976. [Kipper-Schuler, 2005] Karen Kipper-Schuler. VerbNet: A broad-coverage, comprehensive verb lexicon. PhD thesis, University of Pennsylvania, 2005. [Kiros et al., 2015] Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Skip-Thought Vectors. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 3294–3302. Curran Associates, Inc., 2015. 182 [Klebanov et al., 2014] Beata Beigman Klebanov, Chee Wee Leong, Michael Heilman, and Michael Flor. Different Texts, Same Metaphors: Unigrams and Beyond. In Second Workshop on Metaphor in NLP, pages 11–17, Baltimore, MD, 2014.

[Köper and Schulte im Walde, 2017] Maximilian Köper and Sabine Schulte im Walde. Improving Verb Metaphor Detection by Propagating Abstractness to Words, Phrases and Individual Senses. In Proceedings of the 1st Workshop on Sense, Concept and Entity Representations and their Applications, pages 24–30, 2017.

[Krishnakumaran and Zhu, 2007] Saisuresh Krishnakumaran and Xiaojin Zhu. Hunting Elusive Metaphors Using Lexical Resources. In Proceedings of the Workshop on Com- putational Approaches to Figurative Language, pages 13–20, Rochester, New York, April 2007.

[Lakoff and Johnson, 1980a] George Lakoff and Mark Johnson. The metaphorical struc- ture of the human conceptual system. Cognitive Science, 4(2):195–208, 1980.

[Lakoff and Johnson, 1980b] George Lakoff and Mark Johnson. Metaphors We Live By. University of Chicago Press, Chicago and London, 1980.

[Lakoff and Turner, 1989] George Lakoff and Mark Turner. More than Cool Reason : A Field Guide to Poetic Metaphor. University of Chicago Press, 1989.

[Lakoff, 1986] George Lakoff. The Meanings of Literal, 1986.

[Lakoff, 1987] George Lakoff. Women, Fire, and Dangerous Things. University of Chicago Press, Chicago, 1987.

[Lakoff, 1993] George Lakoff. The Contemporary Theory of Metaphor. In Andrew Ortony, editor, Metaphor and Thought, pages 202–251. Cambridge University Press, 1993.

[Lakoff, 1994] George Lakoff. Master Metaphor List. University of California, 1994.

[Landauer et al., 1998] T. K. Landauer, P. W. Foltz, and D Laham. An Introduction to . (25):259–284, 1998.

[Langacker, 1987] Ronald W. Langacker. Foundations of Cognitive Grammar Vol. I : Theoret- ical Perspectives. Stanford University Press, 1987.

[Leong et al., 2018] Chee Wee (Ben) Leong, Beata Beigman Klebanov, and Ekaterina Shutova. A Report on the 2018 VUA Metaphor Detection Shared Task. In Proceedings of the Workshop on Figurative Language Processing, pages 56–66, 2018.

[Levin, 1993] Beth Levin. English Verb Classes and Alternations: A Preliminary Investigation. The University of Chicago Press, 1993. 183 [Manning et al., 2014] Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. The Stanford CoreNLP natural lan- guage processing toolkit. In Association for Computational Linguistics (ACL) System Demonstrations, pages 55–60, 2014.

[Marques and Beuls, 2016] Tania Marques and Katrien Beuls. Evaluation strategies for computational construction grammars. In Proceedings of COLING 2016, the 26th Interna- tional Conference on Computational Linguistics: Technical Papers, pages 1137–1146, 2016.

[Martin, 1990] James H. Martin. A Computational Model of Metaphor Interpretation. Aca- demic Press, Inc, 1990.

[Martin, 1992] James Martin. Knowledge Representation and Metaphor. Computational Linguistics, 18(1), 1992.

[Martin, 1996] James H. Martin. Computational Approaches to Figurative Language. Metaphor and Symbolic Activity, 11(1):85–100, 1996.

[Martin, 2006] James H. Martin. A corpus-based analysis of context-effects on metaphor comprehension. In Anatol Stefanowitsch and Stefan Th. Gries, editors, Corpus-Based Approaches to Metaphor and Metonymy, pages 214–236. Mouten de Gruyter, 2006.

[Martin, 2017] James Martin. personal communication, 2017.

[Mason, 2004] Zachary J. Mason. CorMet: A Computational, Corpus-Based Conven- tional Metaphor Extraction System. Computational Linguistics, 30(1):23–44, 2004.

[Michaelis, 2015] Laura A. Michaelis. Sign-Based Construction Grammar. The Oxford Handbook of Linguistic Analysis, 2015.

[Mikolov et al., 2013] Tomas Mikolov, Kai Chen, Greg Corrado, and Dean Jeffrey. Efficient estimation of word representations in vector space, 2013.

[Mohammad et al., 2016] Saif Mohammad, Ekaterina Shutova, and Peter Turney. "metaphor as a medium for emotion: An empirical study". In Proceedings of the Fifth Joint Conference on Lexical and Computational Semantics, pages 23–33, Berlin, Germany, August 2016.

[Mohler et al., 2014] Michael Mohler, Bryan Rink, David Bracewell, and Marc Tomlinson. A Novel Distributional Approach to Multilingual Conceptual Metaphor Recognition. In COLING Technical Papers, 2014.

[Mohler et al., 2016] Michael Mohler, Mary Brunson, Bryan Rink, and Marc Tomlinson. Introducing the LCC Metaphor Datasets. In Proceedings of the Tenth International Con- ference on Language Resources and Evaluation (LREC 2016), Paris, France, May 2016. Eu- ropean Language Resources Association (ELRA).

[Murray, 1931] J. M. Murray. Metaphor. In Countries of the Mind. University of Oxford Press, London, 1931. 184 [Narayanan and Jurafsky, 1998] Srini Narayanan and Daniel Jurafsky. Bayesian Models of Human Sentence Processing. In Cognitive Science Society, pages 752–757. Marslen- Wilson Simpson & Burgess Garnsey, 1998. [Nunberg et al., 1994] Geoffrey Nunberg, Ivan A. Sag, and Thomas Wasow. Idioms. Lan- guage, 70(3):491–538, 1994. [Ortony et al., 1979] Andrew Ortony, Sam Glucksberg, Larry Jones, Doug Medin, Robert Sternberg, Tom Trabasso, and Amos Tversky. Beyond Literal Similarity. Psychological Review, 86(3), 1979. [Ovchinnikova et al., 2014] Ekaterina Ovchinnikova, Ross Israel, Suzanne Wertheim, Vladimir Zaytsev, Niloofar Montazeri, and Jerry Hobbs. Abductive Inference for Inter- pretation of Metaphors. In Proceedings of the Second Workshop on Metaphor in NLP, pages 33–41, Baltimore, MD, June 2014. [Palmer et al., 2005] Martha Palmer, Daniel Gildea, and Paul Kingsbury. The Bank: An Annotated Corpus of Semantic Roles. Association of Computational Linguitics, 31(1):71–106, 2005. [Palmer et al., 2007] Martha Palmer, Hoa Dang, and Christiane Fellbaum. Making fine- grained and coarse-grained sense distinctions, both manually and automatically. In Journal of Natural Language Engineering, pages 137–163, 2007. [Palmer et al., 2017] Martha Palmer, James Gung, Claire Bonial, Jinho Choi, Orin Har- graves, Derek Palmer, and Kevin; Stowe. The pitfalls of shortcuts: Tales from the word sense tagging trenches. Essays in Lexical Semantics and Computational Lexicography - In honor of Adam Kilgarriff, 2017. [Pedregosa et al., 2011] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learn- ing in Python. Journal of Machine Learning Research, 12:2825–2830, 2011. [Pennington et al., 2014] Jeffrey Pennington, Richard Socher, and Christopher D. Man- ning. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, 2014. [Peters and Peters, 2000] Wim Peters and Ivonne Peters. Lexicalized Systematic Poly- semy in WordNet. In Proceedings of the 2nd international conference on Language Resources and Evaluation, 2000. [Peters et al., 2018] Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word rep- resentations. In Proc. of NAACL, 2018. [Pradhan et al., 2007] Sameer Pradhan, Eduard Hovy, Mitch Marcus, Martha Palmer, Lance Ramshaw, and Ralph Weischedel. OntoNotes: A Unified Relational Semantic Representation. In International Journal of , pages 405–419, 2007. 185 [Rai et al., 2016] Sunny Rai, Shampa Chakraverty, and Devendra K. Tayal. Supervised Metaphor Detection using Conditional Random Fields. In Proceedings of the Fourth Workshop on Metaphor in NLP, pages 18–27, San Diego, California, June 2016.

[Reddy, 1979] Michael Reddy. The Conduit Metaphor : A case of frame conflict in our language about language. In Andrew Ortony, editor, Metaphor and Thought, pages 284– 324. Cambridge University Press, Cambridge, 1979.

[Reining and Lneker-Rodman, 2007] Astrid Reining and Birte Lneker-Rodman. Corpus- driven Metaphor Harvesting. In Proceedings of the Workshop on Computational Approaches to Figurative Language, pages 5–12, 2007.

[Richards, 1936] I.A Richards. Metaphor. In The Philosophy of Rhetoric. Oxford University Press, London, 1936.

[Rosen, 2018] Zachary Rosen. Computationally constructed concepts: A machine learn- ing approach to metaphor interpretation using usage-based construction grammatical cues. In Proceedings of the Workshop on Figurative Language Processing, pages 102–109, 2018.

[Rothe and Schütze, 2015] Sascha Rothe and Hinrich Schütze. AutoExtend: Extending word embeddings to embeddings for synsets and lexemes. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1793– 1803, Beijing, China, July 2015.

[Ruppenhoger et al., 2016] Jofes Ruppenhoger, Michael Ellsworth, Miriam R. L. Petruck, Christopher R. Johnson, Collin F. Baker, and Jan Scheffczyk. FrameNet II: Extended Theory and Practice. 2016.

[Salton et al., 2016] Giancarlo D. Salton, Robert J. Ross, and John D. Kelleher. Idiom Token Classification using Sentential Distributed Semantics. In Association for Computational Linguistics, pages 194–204, July 2016.

[Saussure, 1916] Ferdinand Saussure. Course in General Linguistics. Duckworth, London, 1916. (trans. Roy Harris).

[Searle, 1993] John R. Searle. Metaphor. In Andrew Ortony, editor, Metaphor and Thought, pages 83–111. Cambridge University Press, Cambridge, 1993.

[Shutova et al., 2012] Ekaterina Shutova, Tim Van De Cruys, and Anna Korhonen. Un- supervised Metaphor Paraphrasing using a Vector Space Model. In COLING, pages 1121–1130, 2012.

[Shutova et al., 2013] Ekaterina Shutova, Simone Teufel, and Anna Korhonen. Statistical Metaphor Processing. Computational Linguistics, 39(2):301–353, 2013. 186 [Shutova et al., 2016] Ekaterina Shutova, Douwe Kiela, and Jean Maillard. Black holes and white rabbits: Metaphor identification with visual features. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguis- tics: Human Language Technologies, pages 160–170, San Diego, California, June 2016.

[Shutova, 2010] Ekaterina Shutova. Automatic Metaphor Interpretation as a Paraphras- ing Task. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 1029–1037, Los Angeles, California, June 2010.

[Shutova, 2011] Ekaterina V Shutova. Computational approaches to figurative language. Technical report, Cambridge University, 2011.

[Shutova, 2013] Ekaterina Shutova. Metaphor Identification as Interpretation. In Second Joint Conference on Lexical and Computational Semantics, pages 276–285, Atlanta, Georgia, 2013.

[Shutova, 2015] Ekaterina Shutova. Design and Evaluation of Metaphor Processing Sys- tems. In Assocation for Computational Linguistics, 2015.

[Steels, 2017] Luc Steels. Basics of fluid construction grammar. Constructions and frames, 9(2):178–255, 2017.

[Steen et al., 2010] Gerard J. Steen, Aletta G. Dorst, J. Berenike Herrmann, Anna A. Kaal, Tina Krennmayr, and Trijntje Pasma. A Method for Linguistic Metaphor Identification. John Benjamins Publishing Company, 2010.

[Stefanowitsch and Gries, 2006] Anatol Stefanowitsch and Stefan Th. Gries, editors. Corpus-Based Approaches to Metaphor and Metonymy. Mouton de Gruyter, 2006.

[Stickles et al., 2014] E. Stickles, E. Dodge, and J. Hong. A construction-driven, MetaNet- based approach to metaphor extraction and corpus analysis. In Proceedings of Concep- tual Structure, Discourse, and Language Conference (CSDL 2014), 2014.

[Stickles et al., 2016] Elise Stickles, Oana David, Ellen Dodge, and Hong Jisup. Formaliz- ing contemporary conceptual metaphor theory. Constructions & Frames, 8(2):166 – 213, 2016.

[Stowe and Palmer, 2018] Kevin Stowe and Martha Palmer. Leveraging syntactic con- structions for metaphor processing. In Workshop on Figurative Language Processing, New Orleans, Louisiana, June 2018.

[Sullivan, 2013] Karen Sullivan. Frames and Constructions in Metaphoric Language. John Benjamins, 2013.

[Sweetser, 1990] Eve Sweetser. From etymology to : Metaphorical and cultural as- pects of semantic structure. Cambridge University Press, 1990. 187 [Tausczik and Pennebaker, 2010] Yla R. Tausczik and James W. Pennebaker. Journal of Language and Social Psychology, 29(1):24–54, 3 2010.

[Tversky, 1977] Amos Tversky. Features of similarity. Psychological Review, 84(4):327–352, 1977.

[van Trijp, 2017] Remi van Trijp. A Computational Construction Grammar for English. In The AAAI 2017 Spring Symposium on Computational Construction Grammar and Natural Language Understanding Technical Report, pages 266–273, 2017.

[Veale and Hao, 2008] Tony Veale and Yanfen Hao. A Fluid Knowledge Representation for Understanding and Generating Creative Metaphors. In Proceedings of the 22nd In- ternational Conference on Computational Linguistics (Coling 2008), pages 945–952, 2008.

[Veale, 2016] Tony Veale. Round Up The Usual Suspects: Knowledge-Based Metaphor Generation. In Proceedings of the Fourth Workshop on Metaphor in NLP, pages 34–41, 2016.

[Warren, 2002] Beatrice Warren. An alternative account of the interpretation of referential metonymy and metaphor. In René Dirven and Ralf Pörings, editors, Metaphor and Metonymy in Comparison and Contrast, pages 113–132. Mouton de Gruyter, 2002.

[Weiner, 1984] E. Judith Weiner. A Knowledge Representation Approach to Understand- ing Metaphors. Computational Linguistics, 10(1), 1984.

[Wilks, 1975] Yorick Wilks. A preferential, pattern-seeking semantics for natural lan- guage inference. Artificial Intelligence, 6(1):53–74, 1975.

[Wilks, 1978] Yorick Wilks. Making Preferences More Active. In Words and Intelligence I, pages 141–166. Springer Netherlands, Dordrecht, 1978.

[Wilson, 1988] M.D. Wilson. The MRC Psycholinguistic Database: Machine Readable Dictionary, Version 2. Behavioural Research Methods, Instruments and Computers, 20(1):6– 11, 1988.

[Zhang and Gelernter, 2015] Wei Zhang and Judith Gelernter. Exploring Metaphorical Senses and Word Representations for Identifying Metonyms. CoRR, 2015. 188

Appendix A

Verb Lit. Syn. Patterns Met. Syn. Patterns encourage NPVNP {TO} VP NPVNP blow blow over blow up, blow away conduct NPV-PASS NPVNP show - - NPVPROVP find NPVNP {TOBE} ADJ find out, find dead WHNPV fall NPVADV, NPV fall in, fall to NPVNPBYNP NPVNP {THAT} VP hold NPVNP {OUT} NPVNPADJ hold onto/on/out hold at, hold down NPVNP {TO} NP, bring bring up, bring about bring together, bring in put - - NPVNP {TOBE} NP allow which allowed negation Verb Lit. Syn. Patterns Met. Syn. Patterns NPV spend time spend NPV{ON}NP spend life NPVPP play - play with stop * * reduce - - suggest negation - NPV meet - meet for/at/to discuss * *

TABLE A.1: Syntactic pattern analysis of the most ambiguous (top) and most misclassified (bottom) verbs from the VUAMC. 189

Verb Lit. VerbNet Met. VerbNet encourage advise-37.9 amuse-31.1 build-26.1 blow crane-40.3.2 - weather-57 conduct - - show crane-40.3.2 - find get-13.5.1 declare-29.4 calibratable_cos-45.6 convert-26.6.2 fall escape-51.1 long-32.2 acquiesce-95.1 die-42.4 hold-15.1 body_motion-49.2 conjecture-29.5 hold contain-15.4 conduct-111.1 fit-54.3 support-15.3 earn-54.6 bring-11.3 bring establish-55.5 urge-58.1 engender-27.1 estimate-34.2 put - invest-13.5.4 allow - - Verb Lit. VerbNet Met. VerbNet consume-66 spend pay-68 spend_time-104 meet-36.3 trifle-105.3 play performance-36.7 use-105.1 play-114.2 lodge-46 stop stop-55.4 forbid-67 terminus-47.9 subjugate-42.3 reduce - caused_calibratable_cos-45.6.2 suggest say-37.7 reflexive_appearance-48.1.2 meet contiguous_location-47.8 satisfy-55.7 discuss chit_chat-37.6 -

TABLE A.2: VerbNet analysis of the most ambiguous (top) and most misclas- sified (bottom) verbs from the VUAMC. 190

Appendix B

It V It V NP It V PP It V PP that S It V that S NP NP V ADVP together NP NP V together NP V NP V ADJ NP V ADJ PP NP V ADJP NP NP V ADV NP V ADV NP NP V ADV PP NP V ADV together NP V ADVP NP V ADVP PP NP V NP NP V NP (PP) NP V NP ADJ NP V NP ADJ PP NP V NP ADJP NP V NP ADJP PP NP V NP ADV NP V NP ADVP NP V NP NP NP V NP NP PP NP V NP NP PP PP NP V NP P NP V NP PP NP V NP PP NP NP V NP PP PP NP V NP PP PP PP NP V NP PP S NP V NP PP WH S NP V NP NP V NP WH S NP V NP apart NP V NP down NP V NP how S NP V NP that S NP V NP to be ADJ NP V NP to be NP NP V NP together NP V NP up NP V PP NP V PP ADV NP V PP NP NP V PP NP PP PP NP V PP NP S NP V PP PP NP V PP PP PP NP V PP PP PP PP NP V PP PP S NP V PP PP WH S NP V PP S NP V PP WH NP V PP WH S NP V PP how S NP V PP how/whether S NP V PP that S NP V S NP V WH S NP V apart NP V down NP NP V for NP S NP V how S NP V out NP V that S NP V that S PP NP V together NP V together ADV NP V up NP PP V NP PP V PP PP there V NP Passive That S V There V NP There V NP PP There V PP NP

TABLE B.1: Possible syntactic frames after compression. 191

Appendix C

Algorithm LCC (All) LCC (Verbs) LCC Domain (All) LCC Domain (Verbs)

P R F1 P R F1 P R F1 P R F1 SVM (Full baseline) .62 .68 .65 .52 .59 .56 .58 .64 .59 .18 .21 .19 Dependency Features .64 .68 .66 .56 .57 .57 .54 .59 .55 .16 .19 .17 VerbNet Structures .65 .68 .67 .58 .58 .58 .52 .58 .53 .12 .16 .13 VerbNet Embeddings .62 .68 .65 .54 .58 .56 .54 .60 .55 .14 .17 .14 Distant Supervision .63 .70 .66 .52 .60 .56 ------All Methods .66 .69 .67 .60 .59 .59 .52 .58 .53 .16 .18 .16 Bi-LSTM from Gao et al .84 .71 .77 .80 .63 .70 .68 .60 .64 .63 .64 .63 Dependency Features ------VerbNet Structures ------VerbNet Embeddings .82 .72 .76 .79 .62 .69 .76 .57 .65 .69 .61 .64 Distant Supervision .81 .72 .76 .75 .65 .69 ------All Methods .77 .76 .76 .72 .66 .69 ------

TABLE C.1: F1, precision, and recall scores for LCC classification