Learning High-Precision Lexical Inferences

Vered Shwartz

Department of Computer Science Ph.D. Thesis

Submitted to the Senate of Bar-Ilan University

Ramat Gan, Israel April 2019 This work was carried out under the supervision of Prof. Ido Dagan, De- partment of Computer Science, Bar-Ilan University. To my dearest parents, Haim and Tikva, and to my beloved husband, Or, who knows so much about my research topic that should be awarded an hon- orary degree. Acknowledgements

The years at Bar-Ilan University, and especially the years during my PhD, were among the best of my life. It was due to passion to my research topic but also thanks to a supporting environment that I was happy to wake up every morning and come to the lab. First and foremost I would like to express my deepest gratitude to my ad- visor, Ido Dagan. Ido not only introduced me to NLP but also taught me how to conduct sound scientific research, convey the results clearly and distill the main argument of a research paper. He has endless patience and a very kind and generous attitude towards his students. I would like to believe I developed a somewhat more relaxed working mode thanks to him. I also had the privilege of working with Yoav Goldberg to whom I owe my knowledge of deep learning. It was a pleasure to collaborate with him and I look up to his productivity in work and his ability to conduct fast and high quality research. Many colleagues played an important role in the success of my PhD. I would like to thank my lab mates at the NLP lab in Bar-Ilan University (sorted alpha- betically): Alon Jacovi, Amit Moryossef, Arie Cattan, Aryeh Tiktinsky, Asher Stern, Assaf Amrami, Assaf Toledo, Aviv Weinstein, Ayal Klein, Chaya Liebe- skind, Daniel Juravski, Daniela Stepanov, Eli Kiperwasser, Eyal Orbach, Gabriel Stanovsky, Hila Gonen, Jessica Ficler, Karin Brisker, Lilly Kotlerman, Matan Ben Noach, Meni Adler, Micah Shlain, Natalie Shapira, Noa Lubin, Ofer Bronstein, Ofer Sabo, Ohad Rosen, Omer Levy, Oren Melamud, Ori Shapira, Paul Roit, Rachel Wities, Roee Aharoni, Shachar Rosenman, Shani Barhom, Shauli Ravfo- gel, Vasily Konovalov, Yanai Elazar, and Yehudit Meged. I am very grateful to my collaborators outside the NLP lab, who contributed to the success of my PhD: Alon Eirew, Chris Callison-Burch, Claudio Delli Bovi, Dan Roth, Dominik Schlechtweg, Enrico Santus, Eugenio Martinez Camara, Gavriel Habib (who doubles as my co-author and my brother), Guy Tevet, Ho- racio Saggion, Iryna Gurevych, Jacob Goldberger, Jonathan Berant, Jose Camacho- Collados, Kathrin Eichler, Luis Espinosa-Anke, Marianna Apidianaki, Max Glock- ner, Michael Bugert, Nils Reimers, Roberto Navigli, Sebastian Pado,´ Sergio Ora- mas, Shyam Upadhyay, Sneha Rajana, Tae-Gil Noh, Tommaso Pasini, Tu Vu, and Vivi Nastase. Special thanks to Eyal Shnarch, who mentored me during my internship at IBM Research, and to Chris Waterson, who mentored me during my internship at Google Research. Both internships taught me about working in large research groups and doing research that makes an impact, and I learned and improved my working methods thanks to both. Last but not least, I am grateful to my family: my parents Haim and Tikva, who always encouraged me to study and supported me along the way, and my siblings, Shiri, Gali, and Gavriel. In no way have I been able to complete a suc- cessful PhD without the endless help and support from my beloved husband, Or. He was there for me all along, making my success his priority, taking care of me and taking away from me other tasks so that I can focus solely on my research. Preface

Portions of this thesis are joint work and have appeared elsewhere. Chapter 3, “Hypernyms under Siege: Linguistically-motivated Artillery for Hypernymy Detection”, appeared in the proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2017) (Shwartz et al., 2017a). Chapter 4, “Improving Hypernymy Detection with an Integrated Path-based and Distributional Method”, appeared in the proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016) (Shwartz et al., 2016). Chapter 5, “Path-based vs. Distributional Information in Recognizing Lex- ical Semantic Relations”, appeared in the proceedings of the 5th Workshop on Cognitive Aspects of the Lexicon (CogALex 2016) (Shwartz and Dagan, 2016b). Chapter 6, “Acquiring Predicate Paraphrases from News Tweets”, appeared in the proceedings of the 6th Joint Conference on Lexical and Computational Semantics (*SEM 2017) (Shwartz et al., 2017b). Chapter 7, “Olive Oil Is Made of Olives, Baby Oil Is Made for Babies: In- terpreting Noun Compounds Using Paraphrases in a Neural Model”, appeared in the proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL 2018) (Shwartz and Waterson, 2018). Chapter 8, “Paraphrase to Explicate: Revealing Implicit Noun-Compound Relations”, appeared in the proceedings of the 56th Annual Meeting of the As- sociation for Computational Linguistics (ACL 2018) (Shwartz and Dagan, 2018). Chapter 9, “Adding Context to Semantic Data-Driven Paraphrasing”, ap- peared in the proceedings of the Fifth Joint Conference on Lexical and Compu- tational Semantics (*SEM 2016) (Shwartz and Dagan, 2016a). This work was supported by an Intel ICRI-CI grant, the Israel Science Foun- dation grant 1951/17, the German Research Foundation through the German- Israeli Project Cooperation (DIP, grant DA 1600/1-1), the Clore Scholars Pro- gramme (2017), and the AI2 Key Scientific Challenges Program (2017). Contents

Abstract i

1 Introduction 1 1.1 Semantic Relation Classification ...... 2 1.2 Predicate Paraphrases ...... 4 1.3 Interpreting Noun Compounds ...... 5 1.4 Context-sensitive Lexical Inference ...... 7 1.5 Contributions and Outline ...... 8

2 Background 11 2.1 Distributional Semantics ...... 11 2.1.1 Distributional Semantic Models ...... 12 2.1.2 Word Embeddings ...... 12 2.1.3 Contextualized Word Representations ...... 15 2.2 Taxonomies and Knowledge Resources ...... 15 2.2.1 WordNet ...... 15 2.2.2 Knowledge Bases ...... 16 2.3 Lexical Inference Benchmarks ...... 17 2.3.1 Word Similarity and Relatedness ...... 17 2.3.2 Lexical Substitution ...... 18 2.3.3 Relation Classification ...... 18 2.3.4 Paraphrasing ...... 19

3 Hypernyms under Siege: Linguistically-motivated Artillery for Hypernymy Detection 20

4 Improving Hypernymy Detection with an Integrated Path-based and Distributional Method 32

5 Path-based vs. Distributional Information in Recognizing Lexical Semantic Relations 43

6 Acquiring Predicate Paraphrases from News Tweets 50

7 Olive Oil Is Made of Olives, Baby Oil Is Made for Babies: Interpreting Noun Compounds Using Paraphrases in a Neural Model 57

8 Paraphrase to Explicate: Revealing Implicit Noun-Compound Relations 65

9 Adding Context to Semantic Data-Driven Paraphrasing 78

10 Conclusion 85 10.1 Going Beyond the Distributional Hypothesis ...... 85 10.2 Modeling Context-sensitivity ...... 86 10.3 Representations Beyond the Word-Level ...... 88 10.4 Going Forward: Improving Downstream Applications ...... 90

Bibliography 92

Hebrew Abstract ℵ Abstract

Applications of natural text understanding have to deal with two major issues in human language: lexical variability, i.e. the same meaning can be expressed in various ways, and ambiguity, i.e. the same word can have different meanings, depending on the context in which it is said. Targeting such issues, one of the most basic components in natural lan- guage understanding is recognizing lexical inferences. Lexical inference cor- responds to a semantic relation that holds between two lexical items (words or multi-word expressions), when the meaning of one can be inferred from the other. These semantic relations can be of various types such as synonymy (el- evator/lift), hypernymy or Is-A (cat/animal, Google/company), and part-of (Lon- don/England). Recognizing lexical inferences can benefit various natural language process- ing (NLP) applications. In text summarization, lexical inference can help iden- tifying redundancy, when two candidate sentences for the summary differ only in terms that hold a lexical inference relation. For example, when summariz- ing reviews of phone models, it would help knowing that “the battery is long- lasting” is redundant with “the battery is enduring”. In reading comprehension, answering the question “which phones have long-lasting batteries?” given the text “Galaxy has a long-lasting battery”, requires knowing that Galaxy is a model of a phone. Most NLP applications today rely on word embeddings as means of infor- mation about the semantic similarity between words. A word embedding is a dense vector obtained by aggregating the contexts with which the word co-

i occured in a large text corpora, based on the distributional hypothesis (Har- ris, 1954). Word embeddings capture topical similarity, or relatedness (eleva- tor/floor), and also functional similarity (elevator/escalator). However, many tasks require knowing the exact relationship between two words, and this in- formation is not straightforwardly available in word embeddings. Instead, the most similar words of a word like cat can have various different relations with it: tail (part of), dog (sibling), kitten (hyponym), and feline (hypernym). In this thesis I present various algorithms developed to recognize lexical in- ferences. On one thread, I investigate ways to recognize the semantic relation that holds between a pair of words. As part of this exploration, I propose a novel method that integrates two complementary information sources: path-based and distributional. The first relies on the joint occurrences of the two words in the cor- pus (e.g. cats and other animals), while the second utilizes the word embeddings of each word, obtained from their disjoint occurrences in the corpus. The proposed model overcomes issues observed in previous work. First, previous work showed that purely distributional methods can only determine properties of each single word, e.g. “how likely is animal to be a hypernym?” rather than “how likely is animal to be a hypernym of cat?”. When this infor- mation is combined with path-based information, the model is also aware of the specific relationship between the two words. Second, the model represents paths using a neural model that allows for better generalization, capturing se- mantic relatedness between paths, e.g. “X is Y” and “X is defined as Y”. Finally, the model became the new state-of-the-art for this task, with significant gap from previous methods. In another line of work, I constructed a resource of predicate paraphrases. An example of such a pair of predicates is ‘X dies at Y’ and ‘X lives until Y’, which would have a similar meaning under the assignment of a person to X and an age to Y. This resource was constructed automatically from news headlines in , relying on the redundancy of various headlines discussing the same event to extract lexically-divergent paraphrases. Finally, I worked on interpreting the implicit semantic relation that holds between the constituents of a noun-compound. For example, olive oil is oil

ii made of olives, while baby oil is oil made for babies. The interpretation of noun- compounds has been addressed in the literature either by classifying them to a fixed inventory of ontological relationships (e.g. olive oil: SOURCE vs. baby oil: PURPOSE) or by generating various free text paraphrases that describe the rela- tion in a more expressive manner (e.g. “olive oil is oil made of olives”). I explore both variants of this task. For the classification task, I applied the same classification method that I de- veloped to recognize semantic relations between arbitrary words, gaining per- formance improvement upon previous work. Even so, the performance on this task remains mediocre for all methods, partially due to the strict task definition. One of the conclusions drawn from this work was that noun compounds often exhibit multiple semantic relations, and therefore the paraphrasing paradigm is more informative. The method I developed for the paraphrasing task achieved the state-of-the- art performance on this task, thanks to its ability to generalize for both seman- tically similar constituent nouns (e.g. predicting the paraphrases of the unob- served steel knife based on the similar observed plastic spoon) and semantically similar paraphrases (e.g. predicting X made of Y when X extracted from Y was observed). My proposed models outperform highly competitive baselines and improve the state-of-the-art in several benchmarks for lexical inference. However, it is worth noting that applying lexical inferences within applications imposes ad- ditional difficulty in detecting whether the potential lexical inference is valid within a given context, as is shown in the last chapter of this thesis. The recent development of context-aware word embeddings have advanced the ability of NLP applications to recognize the correct sense of a polysemous word in a given context, but other semantic phenomena related to lexical inferences remain chal- lenging even to them. In the conclusion of this thesis I share my thoughts on the remaining challenges and possible approaches to address them in the future.

iii

Chapter 1

Introduction

Making informed decisions, such as which model of phone to buy, or which hotel to stay in while traveling, is often aided by reading online reviews and in- teracting on social networks. This requires reading and processing many texts, some of which are redundant, while others may be contradicting. Our capac- ity as humans to process the infeasible amount of data in the web is limited. To that end, the natural language processing community has been developing applications such as multi-document summarization (Barzilay et al., 1999) and machine reading comprehension (Hirschman et al., 1999), that aid people by processing large amounts of texts into concise information. Such applications have to deal with lexical variability, i.e. the same meaning can be expressed in various ways. Targeting this issue, one of the basic compo- nents in natural language understanding is recognizing lexical inferences. Lex- ical inference corresponds to a semantic relation that holds between two lexical items (words or multiword expressions), when the meaning of one can be in- ferred from the other. In text summarization, for example, lexical inference can help identifying redundancy, when two candidate sentences for the summary differ only in terms that hold a lexical inference relation (e.g. “the battery is long-lasting” and “the battery is enduring”). In reading comprehension, answer- ing the question “which phones have long-lasting batteries?” given the text “Galaxy has a long-lasting battery”, requires knowing that Galaxy is a model of a

1 CHAPTER 1. INTRODUCTION 2 phone. In this chapter I provide background for the various aspects of lexical in- ference discussed in the thesis. First, I focus on 3 tasks pertaining to lexical inference: identifying the semantic relation that holds between a pair of words (Section 1.1), extracting pairs of texts that have roughly the same meaning (Sec- tion 1.2), and identifying the semantic relation that holds between the con- stituents of a noun compound (Section 1.3). Then I discuss the importance of considering the context in which the lexical inferences are applied (Section 1.4). Finally, in Section 1.5 I list the contributions of this work and provide an outline for the thesis.

1.1 Semantic Relation Classification

Many NLP applications today rely on word embeddings to inform them about the semantic relation between words. However, word embeddings capture a fuzzy distributional similarity between words, which conflates multiple seman- tic relations such as synonymy (elevator / lift), hypernymy/hyponymy (cat / animal), holonymy/meronymy (cat / tail), co-hyponyny (cat / dog), antonymy (hot / cold), and general relatedness or topical similarity (elevator / floor). Appli- cations that require knowing the specific semantic relation that holds between a pair of words must get this information from other sources. While semantic taxonomies, like WordNet (Fellbaum, 1998), define the se- mantic relations between word types (e.g. (cat, animal, hypernymy), (cat, tail, part-of)), they are limited in scope and domain. Therefore, automated methods have been developed to determine, for a given term-pair (x, y), the semantic relation that holds between them, based on their occurrences in a large corpus. Two main information sources are used to recognize these relations in corpus- based methods: path-based and distributional. Path-based methods consider the joint occurrences of x and y in the corpus, where the lexico-syntactic paths that connect the terms are typically used as features (e.g. “cat and other animals” indicates that cat is a “type of” animal). Hearst (1992) identified a small set of CHAPTER 1. INTRODUCTION 3 frequent paths that indicate hypernymy, e.g. Y such as X. Snow et al. (2004) rep- resented each (x, y) term-pair as the multiset of dependency paths connecting their co-occurrences in a corpus, and trained a classifier to predict hypernymy, based on these features. Using individual paths as features results in a huge, sparse feature space. While some paths are rare, they often consist of certain unimportant compo- nents. For instance, “Spelt is a species of wheat” and “Fantasy is a genre of fiction” yield two different paths: “X be species of Y” and “X be genre of Y”, while both indicating that X is-a Y. A possible solution is to generalize paths by re- placing words along the path with their part-of-speech tags or with wild cards, as done in the PATTY system (Nakashole et al., 2012). In contrast to the path based approach, distributional methods are based on the individual occurrences of each term, relying on the distributional hypothesis (Harris, 1954), according to which words that occur in similar contexts tend to have similar meanings. In recent years, this was the most common approach pursued, and it has shown to perform well on several common benchmarks (Baroni et al., 2012; Roller et al., 2014; Weeds et al., 2014). However, it has been later pointed out that these methods suffer from lexical memorization, and can only determine properties of a single word, e.g. “how likely is animal to be a hypernym?” rather than “how likely is animal to be a hypernym of cat?” (Levy et al., 2015a). Overall, the state-of-the-art path-based methods perform worse than the dis- tributional ones. This stems from a major limitation of path-based methods: they require that the terms of the pair occur together in the corpus, limiting the recall of these methods. While distributional methods have no such require- ment, they are usually less precise in detecting a specific semantic relation, and perform best on detecting broad semantic similarity between terms. Though these approaches seem complementary, there has been rather little previous work on integrating them. Chapters 4 and 5 describe an integrated path-based and distributional model for semantic relation classification. First, I focused on improving path represen- tation using a recurrent neural network, resulting in a path-based model that CHAPTER 1. INTRODUCTION 4 performs significantly better than prior path-based methods, and matches the performance of the previously superior distributional methods. In particular, I demonstrated that the increase in recall is a result of generalizing semantically- similar paths, in contrast to prior methods, which either make no generaliza- tions or over-generalize paths. Then, I extend the approach to integrate both path-based and distributional signals, significantly improving upon the state- of-the-art on this task, and confirming that the approaches are indeed comple- mentary. I provide further analysis of specific cases in which the path-based signal especially contributes to the performance of the model: 1) in strict evaluation setups, in which a model cannot benefit from memorizing the typicality of a single word to a relation (e.g. animal is often a hypernym); 2) for rare words or senses; and 3) when the relationship is less prototypical, e.g. cherry and pick are classified to the event relation.

1.2 Predicate Paraphrases

While the above method works well for terms (words and noun phrases), the following is concerned with verbal phrases. Recognizing that various textual descriptions across multiple texts refer to the same event or action can benefit NLP applications such as recognizing textual entailment (Dagan et al., 2013) and question answering. For example, to answer “when did the US Supreme Court approve same-sex marriage?” given the text “In June 2015, the Supreme Court ruled for same-sex marriage”, approve and ruled for should be identified as describing the same action. Much effort has been devoted to identifying predicate paraphrases, e.g. “X buy Y” and “X acquire Y”. Two main approaches were proposed for that goal; the first leverages the similarity in argument distribution across a large cor- pus between two predicates (Lin and Pantel, 2001; Berant et al., 2010), e.g. “X buy Y” and “X acquire Y” are similar because they are both instantiated with names of purchased companies as Y and names of companies that purchased others as X. The second approach exploits bilingual parallel corpora, extract- CHAPTER 1. INTRODUCTION 5 ing as paraphrases pairs of texts that were translated identically to foreign lan- guages (Barzilay and McKeown, 2001; Ganitkevitch et al., 2013). A third approach was proposed to harvest paraphrases from multiple men- tions of the same event in news articles. This approach assumes that various redundant reports make different lexical choices to describe the same event (Shinyama et al., 2002; Shinyama and Sekine, 2006; Roth and Frank, 2012; Zhang and Weld, 2013). In chapter 6, I present an ever-growing resource, currently con- sisting of millions of predicate paraphrase template pairs such as “X dies at Y” and “X lives until Y”, following this event coreference assumption. The resource was from news tweets. Pairs of predicate templates (e.g. “X dies at Y” and “X lives until Y”) that share mutual arguments (e.g. Chuck Berry, 90) and which were published on the same day, were considered as paraphrases. Comparison to existing resources showed that the resource complements other, much larger paraphrase resources with non-consecutive predicates (e.g. “reveal X to Y” / “share X with Y”) and paraphrases which are highly context specific (e.g. “X get Y” / “X sentence to Y” in the context of a person and the time they are about to serve in prison).

1.3 Interpreting Noun Compounds

A special case of identifying the semantic relation that holds between a pair of nouns can be found in noun compounds. Noun compounds implicitly convey a certain semantic relation that holds between their constituents. For example, a ‘birthday cake’ is a cake eaten on a birthday, while ‘apple cake’ is a cake made of apples. Interpreting noun compounds by explicating their implicit relationship is beneficial for many natural language understanding tasks, especially given the prevalence of noun compounds in English (Nakov, 2013). Noun compound interpretation has been addressed in the literature in two variants. The first variant is a “closed” classification task to a set of predefined ontological relations, for example, classifying apple cake to SOURCE vs. birthday cake to TIME. As an alternative to the strict classification, Nakov and Hearst CHAPTER 1. INTRODUCTION 6

(2006) suggested that the semantics of a noun compound could be expressed with multiple prepositional and verbal paraphrases. For example, apple cake is a cake from, made of, or which contains apples. Recent approaches for classification start by learning vector representations for noun compounds, and then use them as features for the classifier. Most commonly, the noun compound vector is obtained by applying a function to its constituents’ distributional representations, e.g. vec(apple cake) = f(vec(apple), vec(cake)). Various functions have been proposed, usually based on vector arith- metics, and they are often learned with the objective of minimizing the distance between the learned vector and the observed vector (computed from corpus oc- currences) of each noun compound. Using compositional noun compound vec- tors for classification leads to promising results (Van de Cruys et al., 2013; Dima and Hinrichs, 2015), however, Dima (2016) argues that it stems from memo- rizing prototypical words for each relation. For example, classifying any noun compound with the head cake to the SOURCE relation, regardless of the modifier. In chapter 7, I applied lessons learned from the parallel task of semantic re- lation classification and adapted LexNET to the noun compound classification task, combining the distributional embeddings of the constituent nouns with the lexico-syntactic paths connecting them in the corpus. The proposed method outperformed the baselines in strict evaluation setups in which the model could not benefit from lexical memorization. Having said that, the analysis also re- vealed issues related to the dataset and task definition, suggesting that the less strict paraphrasing paradigm is more suitable for the interpretation of noun compounds. For the paraphrasing task, it is common to start with a pre-processing step of extracting the joint occurrences of the constituents from a corpus to generate a list of candidate paraphrases. Most methods lack the ability to generalize, and have a hard time interpreting infrequent or new noun compounds. For exam- ple, if the corpus does not contain paraphrases for plastic spoon, it is desirable for a model to predict paraphrases of a similar compound, such as steel knife. Similarly, if a certain noun compound only appears with few paraphrases in the corpus, we would like the model to output additional semantically-similar CHAPTER 1. INTRODUCTION 7 paraphrases, as in predicting “X made from Y” when “X extracted from Y” has been observed. In chapter 8, I present a method for noun compound paraphrasing. I fol- low previous work and extract the joint occurrences of the constituents from a corpus to generate a list of candidate paraphrases. As opposed to previous approaches, that focus on predicting a paraphrase template for a given noun- compound (e.g. predicting “X made of Y” for apple cake), I reformulate the task as a multi-task learning problem, and train the model to also predict a miss- ing constituent of the compound given the paraphrase template and the other constituent (e.g. predicting apple for “cake made of X”). This enables better gen- eralization for both unseen noun compounds and rare paraphrases. Indeed, the model’s ability to generalize leads to improved performance in challenging evaluation settings, becoming the new-state-of-the-art on the task.

1.4 Context-sensitive Lexical Inference

Applying lexical inferences within applications imposes additional difficulty in detecting whether the potential lexical inference is valid within a given context. First, the application needs to address the ambiguity issue, by considering the meaning of each term within its context. For instance, play is a hypernym of compete in certain contexts, but not in the context of playing the national anthem at a sports competition. Second, the soundness of inferences within context is conditioned on the specific semantic relation that holds between the terms. For instance, in some sentences, a part infers its whole (“I live in London” “I live → in England”), while in others it may not (“I’m leaving London” “I’m leaving 6→ England”). Most work on lexical inference today judges the relationship between a pair of terms out-of-context. The existing datasets for in-context lexical inference are annotated for either coarse-grained semantic relations, such as similarity or re- latedness, which may not be sufficiently informative, or for lexical substitution which is a much stricter definition requiring words to be near synonyms. CHAPTER 1. INTRODUCTION 8

In chapter 9, I studied pairs of words which were annotated to the seman- tic relations that hold between them in a context-agnostic manner. By adding context and re-annotating the data, I find that almost half of the semantically- related term-pairs become unrelated when the context is specified. This hap- pens, for example, due to polysemy, as in the relation between piece and strip when the latter appears in the sentence “...the labor leader was found to have gone to a strip club during a taxpayer funded trip”. Furthermore, a generic out-of-context relation may change within a given context. For instance, race is more specific than competition in general, but in specific contexts it can be judged as equiva- lent, as in “...will pull out of the race for presidency”.

1.5 Contributions and Outline

In chapters 3 and 4 I focus on the hypernymy relationship. Detecting hyper- nymy relations is a key task in NLP. In chapter 3, I investigate an extensive num- ber of unsupervised distributional measures for hypernymy detection. These measures are based on high-dimensional distributional vectors, also called dis- tributional semantic models (DSMs). I investigated the combination of pro- posed measures from the literature with various DSMs that differ by context type and feature weighting, and analyzed the performance of the different meth- ods based on their linguistic motivation. The contributions of this work are 1) suggesting a “recipe” for the best setting to differentiate hypernyms from each other semantic relation; and 2) comparison to the state-of-the-art shows that al- though supervised methods generally outperform the unsupervised ones, the former are sensitive to the distribution of training instances, hurting their re- liability. Being based on general linguistic hypotheses and independent from training data, unsupervised measures are more robust. In chapter 4, I present HypeNET: an integrated path-based and distributional model for hypernymy detection. The model achieved state-of-the-art perfor- mance on the task by combining these two complementary information sources and thanks to an improved path representation. The paper describing HypeNET was recognized as an “outstanding paper” at ACL 2016. In chapter 5, I extend CHAPTER 1. INTRODUCTION 9

HypeNET to LexNET, which supports classification to multiple semantic rela- tions. The empirical results show that this method is effective in the multi-class setting as well. LexNET achieved the new state-of-the-art for semantic relation classification, and won the CogALex 2016 shared task. In chapter 6, I present a resource consisting of predicate paraphrase template pairs (e.g. “X dies at Y” and “X lives until Y”). Knowing that the two can have the same meaning—under the same assignment of arguments and in certain contexts—can contribute to many NLP applications dealing with lexical vari- ability. The resource I present was collected from news tweets. Pairs of pred- icate templates that share mutual arguments (e.g. Chuck Berry, 90) and which were published on the same day, were considered as paraphrases. The resource is ever-growing, currently consisting of 4 million pairs (as of April 2019). In chapters 7 and 8 I address the interpretation of the semantic relation that holds between the constituents of a noun compound. In chapter 7, I present a method for the classification task. I adapted LexNET to this task, combining the distributional embeddings of the constituent nouns with the lexico-syntactic paths connecting them in the corpus. The method outperformed the baselines in strict evaluation setups. In chapter 8, I present a state-of-the-art method for the paraphrasing task. The model enables better generalization for both unseen noun compounds and rare paraphrases by jointly predicting a paraphrase tem- plate for a given noun-compound (e.g. predicting “X made of Y” for olive oil) and a a missing constituent of the compound given the paraphrase template and the other constituent (e.g. predicting olive for “oil made of X”). In chapter 9, I touch upon the importance of modelling the context in which the lexical inference is applied. I studied pairs of words which were annotated to the semantic relations that hold between them in both context-agnostic man- ner and with respect to a given context. I find that adding context often changes this semantic relation, concluding that modelling the context in which the infer- ence is applied is crucial. The paper described in this chapter was recognized as a “best paper” at *SEM 2016. Finally, in chapter 10, I summarize the contributions of this thesis, which in- clude (1) going beyond the commonly used distributional representations and CHAPTER 1. INTRODUCTION 10

Ch. Task Input Output Approach Unsupervised 3 Hypernymy pair of terms whether y distributional Detection (x, y) is a hypernym of x Supervised, integrated path- 4 based and distributional Semantic pair of terms the semantic relation Supervised, integrated 5 Relation (x, y) between x and y path-based and Classification (from a pre-defined set). distributional pairs of predicate templates Unsupervised heuristic Predicate (p , p ) conveying roughly based on redundant news 6 - 1 2 Paraphrases the same meaning headlines in some contexts. Noun Compound noun compound the semantic relation between Supervised, integrated path- 7 Classification w1w2 w1 and w2 (pre-defined set). based and distributional Noun Compound noun compound free text paraphrase p describing Supervised, multi-task. 8 Paraphrasing w w the relation between w and w Predict each of w , w , p 1 2 1 2 { 1 2 } Semantic Relation pair of terms the semantic relation between 9 Classification (x, y) in given x and y in the given contexts Dataset in Context sentences (from a pre-defined set).

Table 1: Summary of the tasks presented in this thesis, specifying inputs and outputs, the approach taken, and the chapter in which each work is presented.

investigating additional information sources for recognizing lexical inferences; (2) an analysis of the effect of context on the applicability of lexical inferences; and (3) two approaches for making inferences related to the implicit relationship between the constituents of a noun compound. I share my view of the remain- ing challenges in each one of these aspects of lexical inference and discuss the potential approaches for future research. For the reader’s convenience, Table 1 presents a summary of the tasks pre- sented in this thesis. Chapter 2

Background

In this chapter I provide complementary background to that in the following chapters. First, I give an overview of distributional semantics, which is the ba- sis on which words are represented across NLP applications. Then I briefly review other sources of lexical information, which are given in a structured manner. Although the algorithms presented in this thesis do not make use of such resources, but are instead corpus-based, knowledge resources are used in chapter 4 to create training data. Finally, I list a few other lexical tasks and benchmarks which were not used in this work.

2.1 Distributional Semantics

A word is often considered as the most basic unit of meaning in language. In NLP, it is a common practice to represent words as vectors. The various ap- proaches to create such vectors rely on the distributional hypothesis, according to which words that occur in similar contexts tend to have similar meanings (Harris, 1954), or as described by Firth (1957): “You shall know a word by the company it keeps”.

11 CHAPTER 2. BACKGROUND 12

2.1.1 Distributional Semantic Models

Based on this hypothesis, distributional semantic models (DSMs) compute a matrix of word-word co-occurrences in a large text corpus, and then apply a weighting function to the matrix. Each row corresponds to the vector represent- ing a specific word, and the vectors representing semantically-similar words (e.g. cat and kitten) are expected to be similar in terms of vector similarity (typi- cally cosine similarity). DSMs differ from each other in the choice of context type and feature weight- ing. The context type defines the neighbors of each word. Most commonly, window-based context is used, where the contexts of a target word wi are the words surrounding it in a k-sized window: wi k, ..., wi 1, wi+1, ..., wi+k. Alterna- − − tively, dependency-based context considers the neighbors in a dependency parse tree (Pado´ and Lapata, 2007; Baroni and Lenci, 2010). The context type can affect the type of similarity that the DSM captures: larger windows tend to cap- ture topical similarity (e.g. dancer / performance), whereas smaller windows and dependency-based contexts capture more functional similarity (e.g. dancer / singer) (Levy and Goldberg, 2014).

2.1.2 Word Embeddings

The dimension of distributional vectors is the size of the vocabulary, V , which | | can easily reach a few hundred thousand words. The vectors are sparse, because most words do not occur next to each other. To overcome the computational obstacles, it was common to apply dimensionality reduction algorithms such as Singular Value Decomposition (SVD) to the co-occurrence matrix, obtaining d-dimensional vectors for some predefined dimension d << V (Deerwester | | et al., 1990). An alternative to dimensionality reduction was suggested by Bengio et al. (2003): instead of counting co-occurrences, the word vectors are initialized ran- domly and during training they are updated with the training objective of pre- dicting whether certain words occurred together or not. The result is a low- CHAPTER 2. BACKGROUND 13 dimensional embedding space (typically 50-600 dimensions) in which semantically- similar words are embedded in proximity. Despite being first presented in 2003, word embeddings only gained popularity with the release of word2vec a decade later (Mikolov et al., 2013). Following word2vec, several other embed- ding algorithms were developed, the most popular of which are GloVe (Pen- nington et al., 2014), and fastText (Bojanowski et al., 2017), which extends word2vec by adding information about subwords (bag of character n-grams).

Usage of Word Embeddings in Downstream Tasks

In recent years, word embeddings have been used successfully across tasks, starting from syntactic tasks such as part-of-speech tagging, noun phrase chunk- ing, and syntactic parsing, and continuing with semantic tasks such as named entity recognition, semantic role labeling, sentiment analysis and other text clas- sification variants, paraphrase detection, and recognizing textual entailment (Bakarov, 2018). Initializing the representation layer with pre-trained embed- dings can especially benefit tasks with a small amount of training data which does not allow learning a meaningful representation from scratch.

Usage of Word Vectors in Intrinsic Lexical Inference Tasks

Earlier methods developed unsupervised measures based on high-dimensional distributional vectors, to detect lexical inferences. This started with symmetric similarity measures (Lin, 1998), and followed by directional measures of hyper- nymy based on the distributional inclusion hypothesis (Weeds and Weir, 2003; Kotlerman et al., 2010). This hypothesis states that the contexts of a hyponym are expected to be largely included in those of its hypernym. More recent work (Santus et al., 2014; Rimell, 2014) introduced new measures, based on the as- sumption that the most typical linguistic contexts of a hypernym are less infor- mative than those of its hyponyms. More recently, the focus of the distributional approach for recognizing lexi- cal inferences shifted to supervised methods. In these methods, a pair of words x and y is represented by a feature vector, and a classifier is trained on these vec- CHAPTER 2. BACKGROUND 14 tors to predict the semantic relation that holds between them. Several methods are used to represent term-pairs as a combination of each term’s embeddings vector: concatenation ~x ~y (Baroni et al., 2012), difference ~y ~x (Roller et al., ⊕ − 2014; Weeds et al., 2014), and dot-product ~x ~y. Using word embeddings, these · methods are easy to apply, and show good results (Baroni et al., 2012; Roller et al., 2014). In several chapters of this thesis, I return to the disadvantage of relying solely on word embeddings to determine semantic relations between words. As shown by Levy et al. (2015a), these methods can only rely on the properties of each single word in a pair, rather than modelling the relationship between the words. To that end, complementary information sources can mitigate this issue, as I show in chapter 4.

Limitations

Two major shortcomings of any kind of word vectors based on the distribu- tional hypothesis are 1) they conflate all the different senses of a word into a single vector; and 2) they conflate the various semantic relations into one fuzzy similarity relation. To address the polysemy issue, there have been attempts to learn separate vectors for each sense of a word. These methods typically follow one of two main approaches: they either rely on a predefined sense inventory, or induce the senses in an unsupervised manner (Camacho-Collados and Pilehvar, 2018). However, the utility of such embeddings highly depends on defining the required granularity level of the senses, and the produced vectors do not deal with minor meaning shifts within the same sense, which can happen in certain contexts. With respect to the fuzzy similarity signal the vectors yield, consider a word like cat. Among at the most similar words to cat, one may find hypernyms (an- imal, pet, feline), hyponymys (kitten), co-hyponyms / siblings (dog), parts (tail), and related words (purr). Perhaps the most concerning issue is the inability to discern, based on distributional similarity, synonyms from antonyms (hot / cold) and mutually exclusive terms (Sunday / Monday), since words related in any of these relations tend to appear in similar contexts. CHAPTER 2. BACKGROUND 15

Finally, while in high-dimensional distributional vectors each feature cor- responds to a context, the features of word embeddings are not interpretable. In chapter 3 I describe various measures for hypernymy detection that depend on the feature interpretation as means of linguistic motivation, and therefore cannot work with word embeddings.

2.1.3 Contextualized Word Representations

Recently, a new paradigm has emerged: instead of using static word vectors, the representation of a word is computed dynamically given its sentential context (Peters et al., 2018; Radford et al., 2018; Devlin et al., 2018). Doing so, such con- textualized word representations largely address polysemy, as they no longer conflate all the different senses of a word into a single vector. These models are pre-trained as general purpose language models using a large-scale unan- notated corpus. They are meant to be used as a representation layer in down- stream tasks, either fine-tuned to the task or fixed. Contextualized word rep- resentations have shown to significantly improve the performance upon using regular word embeddings, across various NLP applications.

2.2 Taxonomies and Knowledge Resources

While most work on recognizing lexical inferences is corpus-based, it is also possible to mine high-precision lexical inferences from structured resources, such as those detailed below.

2.2.1 WordNet

WordNet (Fellbaum, 1998) is an ontology of the English language in which nouns, verbs, adjectives and adverbs are grouped into sets of synonyms (synsets) and are interlinked by means of lexical semantic relations (hypernymy, meronymy, etc). WordNet is widely used for identifying lexical inferences, usually in an un- CHAPTER 2. BACKGROUND 16 supervised setting where the relations relevant for each specific inference task are manually selected a priori. One approach looks for chains of these pre- defined relations, e.g. inference via a chain of hypernyms: dog canine → → carnivore placental mammal mammal (Harabagiu and Moldovan, 1998). An- → → other approach is via WordNet Similarity, which takes two synsets and returns a numeric value that represents their similarity based on WordNet’s hierarchi- cal hypernymy structure (Pedersen et al., 2004). Popular similarity measures consider only hypernyms and synonyms as a manual design choice (Wu and Palmer, 1994; Resnik, 1995).

2.2.2 Knowledge Bases

While WordNet is quite extensive (containing 150,000 nouns annotated to 13 different semantic relations), it is hand-crafted by expert lexicographers, and thus cannot compete in terms of scale with community-built knowledge bases (KBs) which connect millions of entities through a rich variety of structured re- lations. Specifically, by definition, WordNet does not cover many proper-names (Beyonce´ artist) and recent terminology (Facebook social network). These can → → be found in the rich and up-to-date structured KBs. Wikidata (Vrandeciˇ c,´ 2012) contains facts about 55 million entities with 2,000 different relations, which were extracted automatically and edited by humans. DBPedia (Auer et al., 2007) contains structured information from Wikipedia: info boxes, redirections, disambiguation links, etc. The English version of the DBpedia currently describes 4.58 million entities and 1,300 relations. Yago ∼ (Suchanek et al., 2007) is a semantic knowledge base derived from Wikipedia, WordNet, and GeoNames, consisting of more than 10 million entities and 114 relations. Finally, ConceptNet (Speer and Havasi, 2012) contains 37 common- sense relationships between 3.9 million entities, and is collected from a variety of resources, including crowd-sourced resources, games with a purpose, and expert-created resources such as WordNet. In recent years, numerous NLP applications leveraged KBs as means of ob- taining a richer knowledge representation. Examples where such KBs proved CHAPTER 2. BACKGROUND 17 beneficial include word sense disambiguation, semantic role labeling, seman- tic parsing, and text classification (Gurevych et al., 2016). Shwartz et al. (2015) leveraged KBs to extract lexical inference relations, concluding that they are complementary to corpus-based methods, especially in scenarios in which pre- cision is more important than recall.

2.3 Lexical Inference Benchmarks

In this section I provide an overview of lexical inference benchmarks which are complementary to those mentioned in this thesis. In these benchmarks, each entry consists of a pair of terms x and y, annotated to certain semantic relation(s) that may hold between them. The tasks and benchmarks differ on two main axes: (1) the semantic relations; and (2) whether the terms are given within context or out-of-context.

2.3.1 Word Similarity and Relatedness

The goal in this task is to predict a number indicating the degree of similarity or association between x and y. The task is commonly used as intrinsic evalua- tion to estimate the quality of word embeddings, by computing the correlation between the human judgements and the cosine similarity between the vectors. Common datasets for the task include WordSim-353 (Finkelstein et al., 2002), SimLex-999 (Hill et al., 2014), and MEN (Bruni et al., 2014). Although all these benchmarks group several semantic relations under a sin- gle coarse-grained relation, there is a fine distinction of similarity vs. related- ness. Similar words share numerous common properties (e.g. cup and mug) while related terms appear in the same broad contexts and are pertinent to the same domain or topic (e.g. cup and coffee). The SimLex-999 dataset captures strict similarity, and WordSim-353 was divided by Zesch et al. (2008) into dis- tinct sets consisting of either similarity or relatedness. In most datasets, word-pairs are annotated to relatedness or similarity be- tween them in some (unspecified) contexts. The exceptions are the Stanford’s CHAPTER 2. BACKGROUND 18

Contextual Word Similarity dataset (SCWS) (Huang et al., 2012) and TR9856 (Levy et al., 2015b), in which the annotation is given with respect to a given context.

2.3.2 Lexical Substitution

The lexical substitution task requires identifying meaning-preserving substi- tutes for a target word in a given sentential context. Common datasets for this task are the SemEval 2007 task (McCarthy and Navigli, 2007) and the “All- Words” corpus (Kremer et al., 2014). Lexical substitutability is stricter than sim- ilarity and is usually defined by synonymy (cold / chilly) or functional similarity (dollars / euros). The task is typically addressed by modelling the context sen- tence and predicting words that can fit in that context (Melamud et al., 2013, 2015, 2016).

2.3.3 Relation Classification

In relation classification the goal is to classify the relation that is expressed between two target terms in a given sentence to one of predefined relation classes. To illustrate, consider the following sentence, from the SemEval-2010 relation classification task dataset (Hendrickx et al., 2009): “The [apples]e1 are in the [basket]e2 ”. Here, the relation expressed between the target entities is Content Container(e , e ). − 1 2 While relation classification and semantic relation classification (section 1.1) are both concerned with identifying semantic relations that hold for pairs of terms, they differ in a major respect. In semantic relation classification, the goal is to recognize a generic lexical-semantic relation between terms that holds in many contexts, while in relation classification the relation should be expressed in the given text. A common best practice is to rely on the shortest dependency path between the target entities (Fundel et al., 2007; Xu et al., 2015; Liu et al., 2015). CHAPTER 2. BACKGROUND 19

2.3.4 Paraphrasing

Paraphrasing refers to extraction of pairs of differing textual realizations of the same meaning. These pairs of texts can be words, other lexical items, full sen- tences, or predicate templates as described in chapter 6. Two typical approaches to extract paraphrases are translation-based and event-based. Barzilay and McKeown (2001) were the first to extract paraphrases from multiple translations of the same text, a technique referred to as “bilingual piv- oting” or “back-translation”. Following, Ganitkevitch et al. (2013) applied this technique to multiple foreign languages, resulting in the paraphrase database (PPDB), consisting of 220 million pairs of English paraphrases. Mallinson et al. (2017) showed that this technique is still valid, and yields improved perfor- mance, when neural machine translation models are used. The second approach harvests paraphrases from multiple mentions of the same event in news articles. This approach assumes that various redundant re- ports make different lexical choices to describe the same event (Shinyama et al., 2002; Shinyama and Sekine, 2006; Roth and Frank, 2012; Zhang and Weld, 2013). Lan et al. (2017) developed a supervised model to collect sentential paraphrases from Twitter. They used pairs of tweets linking to the same URL as positive training examples. Chapter 3

Hypernyms under Siege: Linguistically-motivated Artillery for Hypernymy Detection

20 Hypernyms under Siege: Linguistically-motivated Artillery for Hypernymy Detection Vered Shwartz1, Enrico Santus2,3 and Dominik Schlechtweg4

1Bar-Ilan University, Ramat-Gan, Israel 2Singapore University of Technology and Design, Singapore 3The Hong Kong Polytechnic University, Hong Kong 4University of Stuttgart, Stuttgart, Germany vered1986,esantus @gmail.com, [email protected] { } Abstract adaptations of the distributional hypothesis (Har- ris, 1954). The fundamental role of hypernymy in The first distributional approaches were unsu- NLP has motivated the development of pervised, assigning a score for each (x, y) word- many methods for the automatic identi- pair, which is expected to be higher for hyper- fication of this relation, most of which nym pairs than for negative instances. Evalua- rely on word distribution. We investigate tion is performed using ranking metrics inherited an extensive number of such unsupervised from information retrieval, such as Average Pre- measures, using several distributional se- cision (AP) and Mean Average Precision (MAP). mantic models that differ by context type Each measure exploits a certain linguistic hypoth- and feature weighting. We analyze the per- esis such as the distributional inclusion hypothesis formance of the different methods based (Weeds and Weir, 2003; Kotlerman et al., 2010) on their linguistic motivation. Comparison and the distributional informativeness hypothesis to the state-of-the-art supervised methods (Santus et al., 2014; Rimell, 2014). shows that while supervised methods gen- In the last couple of years, the focus of the re- erally outperform the unsupervised ones, search community shifted to supervised distribu- the former are sensitive to the distribution tional methods, in which each (x, y) word-pair is of training instances, hurting their relia- represented by a combination of x and y’s word bility. Being based on general linguistic vectors (e.g. concatenation or difference), and a hypotheses and independent from training classifier is trained on these resulting vectors to data, unsupervised measures are more ro- predict hypernymy (Baroni et al., 2012; Roller et bust, and therefore are still useful artillery al., 2014; Weeds et al., 2014). While the origi- for hypernymy detection. nal methods were based on count-based vectors, in recent years they have been used with word em- 1 Introduction beddings (Mikolov et al., 2013; Pennington et al., In the last two decades, the NLP community has 2014), and have gained popularity thanks to their invested a consistent effort in developing auto- ease of use and their high performance on sev- mated methods to recognize hypernymy. Such ef- eral common datasets. However, there have been fort is motivated by the role this semantic relation doubts on whether they can actually learn to rec- plays in a large number of tasks, such as taxonomy ognize hypernymy (Levy et al., 2015b). creation (Snow et al., 2006; Navigli et al., 2011) Additional recent hypernymy detection meth- and recognizing textual entailment (Dagan et al., ods include a multimodal perspective (Kiela et al., 2013). The task has appeared to be, however, 2015), a supervised method using unsupervised a challenging one, and the numerous approaches measure scores as features (Santus et al., 2016a), proposed to tackle it have often shown limitations. and a neural method integrating path-based and Early corpus-based methods have exploited pat- distributional information (Shwartz et al., 2016). terns that may indicate hypernymy (e.g. “animals In this paper we perform an extensive evalua- such as dogs”) (Hearst, 1992; Snow et al., 2005), tion of various unsupervised distributional mea- but the recall limitation of this approach, requir- sures for hypernymy detection, using several dis- ing both words to co-occur in a sentence, mo- tributional semantic models that differ by context tivated the development of methods that rely on type and feature weighting. Some measure vari-

65 Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 65–75, Valencia, Spain, April 3-7, 2017. c 2017 Association for Computational Linguistics ants and context-types are tested for the first time.1 AMOD NSUBJ DOBJ We demonstrate that since each of these mea- cute cats drink milk sures captures a different aspect of the hypernymy ADJ NOUN VERB NOUN relation, there is no single measure that consis- Figure 1: An example dependency tree of the sentence cute cats drink milk, with the target word cats. The dependency- 1 tently performs well in discriminating hypernymy based contexts are drink-v:nsubj and cute-a:amod− . The from different semantic relations. We analyze the joint-dependency context is drink-v#milk-n. Differently from performance of the measures in different settings Chersoni et al. (2016), we exclude the dependency tags to mitigate the sparsity of contexts. and suggest a principled way to select the suitable measure, context type and feature weighting ac- Out-of-vocabulary words are filtered out before cording to the task setting, yielding consistent per- applying the window. We experimented with formance across datasets. window sizes 2 and 5, directional and indirec- We also compare the unsupervised measures to tional (win2, win2d, win5, win5d). the state-of-the-art supervised methods. We show Dependency-based contexts: rather than adja- that supervised methods outperform the unsuper- • vised ones, while also being more efficient, com- cent words in a window, we consider neighbors puted on top of low-dimensional vectors. At the in a dependency parse tree (Pado´ and Lapata, same time, however, our analysis reassesses pre- 2007; Baroni and Lenci, 2010). The contexts vious findings suggesting that supervised meth- of a target word wi are its parent and daughter ods do not actually learn the relation between the nodes in the dependency tree (dep). We also words, but only characteristics of a single word in experimented with a joint dependency context the pair (Levy et al., 2015b). Moreover, since the inspired by Chersoni et al. (2016), in which the features in embedding-based classifiers are latent, contexts of a target word are the parent-sister it is difficult to tell what the classifier has learned. pairs in the dependency tree (joint). See Fig- We demonstrate that unsupervised methods, on the ure 1 for an illustration. other hand, do account for the relation between words in a pair, and are easily interpretable, being Feature Weighting Each distributional seman- based on general linguistic hypotheses. tic space is spanned by a matrix M in which each row corresponds to a target word while each col- 2 Distributional Semantic Spaces umn corresponds to a context. The value of each cell Mi,j represents the association between the We created multiple distributional semantic spaces target word wi and the context cj. We experi- that differ in their context type and feature weight- mented with two feature weightings: ing. As an underlying corpus we used a concate- nation of the following two corpora: ukWaC (Fer- Frequency - raw frequency (no weighting): • raresi, 2007), a 2-billion word corpus constructed Mi,j is the number of co-occurrences of wi and by crawling the .uk domain, and WaCkypedia EN cj in the corpus. (Baroni et al., 2009), a 2009 dump of the English Positive PMI (PPMI) - pointwise mutual in- Wikipedia. Both corpora include POS, lemma • formation (PMI) (Church and Hanks, 1990) and dependency parse annotations. Our vocabu- is defined as the log ratio between the joint lary (of target and context words) includes only probability of w and c and the product of nouns, verbs and adjectives that occurred at least their marginal probabilities: PMI(w, c) = 100 times in the corpus. ˆ log P (w,c) , where Pˆ(w), Pˆ(c), and Pˆ(w, c) Pˆ(w)Pˆ(c) Context Type We use several context types: are estimated by the relative frequencies of a word w, a context c and a word-context Window-based contexts: the contexts of a tar- • pair (w, c), respectively. To handle unseen get word w are the words surrounding it in a k- i pairs (w, c), yielding PMI(w, c) = log(0) = sized window: wi k, ..., wi 1, wi+1, ..., wi+k. − − , PPMI (Bullinaria and Levy, 2007) assigns If the context-type is directional, words occur- −∞ zero to negative PMI scores: PPMI(w, c) = ring before and after w are marked differently, i max(PMI(w, c), 0). i.e.: wi k/l, ..., wi 1/l, wi+1/r, ..., wi+k/r. − − In addition, one of the measures we used (San- 1Our code and data are available at: https://github.com/vered1986/UnsupervisedHypernymy tus et al., 2014) required a third feature weighting:

66 Positive LMI (PLMI) - positive local mu- Weeds Precision (Weeds and Weir, 2003) A • • tual information (PLMI) (Evert, 2005; Ev- directional precision-based similarity measure. ert, 2008). PPMI was found to have a bias This measure quantifies the weighted inclusion towards rare events. PLMI simply balances of x’s contexts by y’s contexts:

PPMI by multiplying it by the co-occurrence Σc ~vx ~vy ~vx[c] W eedsP rec(x y) = ∈ ∩ frequency of w and c: P LMI(w, c) = → Σc ~vx~vx[c] freq(w, c) PPMI(w, c). ∈ · cosWeeds (Lenci and Benotto, 2012) Geomet- • ric mean of cosine similarity and Weeds preci- sion: p 3 Unsupervised Hypernymy Detection cosW eeds(x y) = cos(x, y) W eedsP rec(x y) Measures → · → ClarkeDE (Clarke, 2009) Computes degree of We experiment with a large number of unsuper- • inclusion, by quantifying weighted coverage of the hyponym’s contexts by those of the hyper- vised measures proposed in the literature for dis- nym: tributional hypernymy detection, with some new Σc ~vx ~vy min(~vx[c], ~vy[c]) variants. In the following section, ~vx and ~vy de- CDE(x y) = ∈ ∩ → Σc ~v ~vx[c] note x and y’s word vectors (rows in the matrix ∈ x M). We consider the scores as measuring to what balAPinc (Kotlerman et al., 2010) Balanced av- extent y is a hypernym of x (x y). • erage precision inclusion. → PNy [P (r) rel(cr)] 3.1 Similarity Measures AP inc(x y) = r=1 · → Ny Following the distributional hypothesis (Harris, is an adaptation of the average precision mea- 1954), similar words share many contexts, thus sure from information retrieval for the inclusion have a high similarity score. Although the hyper- hypothesis. Ny is the number of non-zero con- nymy relation is asymmetric, similarity is one of texts of y and P (r) is the precision at rank r, y its properties (Santus et al., 2014). defined as the ratio of shared contexts with among the top r contexts of x. rel(c) is the relevance of a context c, set to 0 if c is not Cosine Similarity (Salton and McGill, 1986) A • a context of y, and to 1 ranky(c) otherwise, symmetric similarity measure: − Ny+1 where ranky(c) is the rank of the context c in y’s sorted vector. Finally, ~vx ~vy cos(x, y) = · p ~v ~v balAP inc(x y) = Lin(x, y) AP inc(x y) k xk · k yk → · → Lin Similarity (Lin, 1998) A symmetric simi- is the geometric mean of APinc and Lin similar- • larity measure that quantifies the ratio of shared ity. contexts to the contexts of each word: invCL (Lenci and Benotto, 2012) Measures • both distributional inclusion of x in y and dis- Σc ~vx ~vy [~vx[c] + ~vy[c]] Lin(x, y) = ∈ ∩ tributional non-inclusion of y in x: Σc ~vx~vx[c] + Σc ~vy ~vy[c] ∈ ∈ p APSyn (Santus et al., 2016b) A symmetric invCL(x y) = CDE(x y) (1 CDE(y x)) • → → · − → measure that computes the extent of intersec- tion among the N most related contexts of two 3.3 Informativeness Measures words, weighted according to the rank of the According to the distributional informativeness shared contexts (with N as a hyper-parameter): hypothesis, hypernyms tend to be less informative than hyponyms, as they are likely to occur in more 1 general contexts than their hyponyms. AP Syn(x, y) = Σc N(~vx) N(~vy) ∈ ∩ rankx(c)+ranky(c) 2 SLQS (Santus et al., 2014) • Ex 3.2 Inclusion Measures SLQS(x y) = 1 → − Ey According to the distributional inclusion hypothe- The informativeness of a word x is evaluated as sis, the prominent contexts of a hyponym (x) are the median entropy of its top N contexts: Ex = N expected to be included in those of its hypernym mediani=1(H(ci)), where H(c) is the entropy (y). of context c.

67 SLQS Sub A new variant of SLQS based on dataset relations #instances size • hypernym 1,337 the assumption that if y is judged to be a hyper- meronym 2,943 nym of x to a certain extent, then x should be coordination 3,565 judged to be a hyponym of y to the same extent BLESS event 3,824 26,554 attribute 2,731 (which is not the case for regular SLQS). This random-n 6,702 is achieved by subtraction: random-v 3,265 SLQSsub(x y) = Ey Ex random-j 2,187 → − hypernym 3,637 It is weakly symmetric in the sense that meronym 1,819 EVALution attribute 2,965 13,4653 SLQSsub(x y) = SLQSsub(y x). → − → synonym 1,888 antonym 3,156 SLQS and SLQS Sub have 3 hyper-parameters: hypernym 1,933 i) the number of contexts N; ii) whether to Lenci/Benotto synonym 1,311 5,010 use median or average entropy among the top antonym 1,766 hypernym 1,469 N contexts; and iii) the feature weighting used Weeds 2,928 coordination 1,459 to sort the contexts by relevance (i.e., PPMI or Table 1: The semantic relations, number of instances in each PLMI). relation, and size of each dataset.

SLQS Row Differently from SLQS, SLQS Row • hand, hyponyms are likely to occur in broad con- computes the entropy of the target rather than texts (e.g. eat, live), where hypernyms are also ap- the average/median entropy of the contexts, as propriate. In this sense, we can define the reversed an alternative way to compute the generality of inclusion hypothesis: “hypernym’s contexts are 2 a word. In addition, parallel to SLQS we tested likely to be included in the hyponym’s contexts”. SLQS Row with subtraction, SLQS Row Sub. The following variants are tested for the first time. RCTC (Rimell, 2014) Ratio of change in topic • coherence: Reversed Weeds TC(tx)/T C(tx y) • RCTC(x y) = \ RevW eeds(x y) = W eeds(y x) → TC(ty)/T C(ty x) → → where t are the top N contexts of x, considered\ x Reversed ClarkeDE • as x’s topic, and tx y are the top N contexts of RevCDE(x y) = CDE(y x) x which are not contexts\ of y. TC(A) is the → → topic coherence of a set of words A, defined as 4 Datasets the median pairwise PMI scores between words in A. N is a hyper-parameter. The measure We use four common semantic relation datasets: is based on the assumptions that excluding y’s BLESS (Baroni and Lenci, 2011), EVALution contexts from x’s increases the coherence of the (Santus et al., 2015), Lenci/Benotto (Benotto, topic, while excluding x’s contexts from y’s de- 2015), and Weeds (Weeds et al., 2014). The creases the coherence of the topic. We include datasets were constructed either using knowledge this measure under the informativeness inclu- resources (e.g. WordNet, Wikipedia), crowd- sion, as it is based on a similar hypothesis. sourcing or both. The semantic relations and the size of each dataset are detailed in Table 1. 3.4 Reversed Inclusion Measures In our distributional semantic spaces, a target These measures are motivated by the fact that, word is represented by the word and its POS tag. even though—being more general—hypernyms While BLESS and Lenci/Benotto contain this in- are expected to occur in a larger set of contexts, formation, we needed to add POS tags to the other sentences like “the vertebrate barks” or “the mam- datasets. For each pair (x, y), we considered 3 mal pairs (x-p, y-p) for p noun, adjective, verb , arrested the thieves” are not common, since ∈ { } hyponyms are more specialized and are hence and added the respective pair to the dataset only if more appropriate in such contexts. On the other the words were present in the corpus.4

2In our preliminary experiments, we noticed that the 3We removed the entailment relation, which had too few entropies of the targets and those of the contexts are not instances, and conflated relations to coarse-grained relations highly correlated, yielding a Spearman’s correlation of up (e.g. HasProperty and HasA into attribute). to 0.448 for window based spaces, and up to 0.097 for the 4Lenci/Benotto includes pairs to which more than one re- dependency-based ones (p < 0.01). lation is assigned, e.g. when x or y are polysemous, and re-

68 context feature dataset hyper vs. relation measure hyper-parameters AP@100 AP@All type weighting all other relations invCL joint freq - 0.661 0.353 meronym APSyn joint freq N=500 0.883 0.675 attribute APSyn joint freq N=500 0.88 0.651 joint freq 0.74 0.54 EVALution antonym SLQS row joint ppmi - 0.74 0.55 joint plmi 0.74 0.537 joint freq 0.83 0.647 synonym SLQS row joint ppmi - 0.83 0.657 joint plmi 0.83 0.645 all other relations invCL win5 freq - 0.54 0.051 SLQS win5d freq N=100, median, plmi 1.0 0.76 meronym sub SLQS win5d freq N=100, median, plmi 1.0 0.758 BLESS coord SLQSsub joint freq N=50, average, plmi 0.995 0.537 SLQS dep plmi N=70, average, plmi 1.0 0.74 attribute sub cosine joint freq - 1.0 0.622 event APSyn dep freq N=1000 1.0 0.779 all other relations APSyn joint freq N=1000 0.617 0.382 Lenci/ antonym APSyn dep freq N=1000 0.861 0.624 Benotto synonym SLQS rowsub joint ppmi - 0.948 0.725 all other relations clarkeDE win5d freq - 0.911 0.441 Weeds coord clarkeDE win5d freq - 0.911 0.441 Table 2: Best performing unsupervised measures on each dataset in terms of Average Precision (AP) at k = 100, for hypernym vs. all other relations and vs. each single relation. AP for k = all is also reported for completeness. We excluded the experiments of hypernym vs. random-(n, v, j) for brevity; most of the similarity and some of the inclusion measures achieve AP @100 = 1.0 in these experiments.

We split each dataset randomly to 90% test and Table 2 reports the best performing measure(s), 10% validation. The validation sets are used to with respect to AP @100, for each relation in each tune the hyper-parameters of several measures: dataset. The first observation is that there is no SLQS (Sub), APSyn and RCTC. single combination of measure, context type and feature weighting that performs best in discrimi- 5 Experiments nating hypernymy from all other relations. In or- der to better understand the results, we focus on 5.1 Comparing Unsupervised Measures the second type of evaluation, in which we dis- In order to evaluate the unsupervised measures criminate hypernyms from each other relation. described in Section 3, we compute the measure The results show preference to the syntactic scores for each (x, y) pair in each dataset. We first context-types (dep and joint), which might be ex- measure the method’s ability to discriminate hy- plained by the fact that these contexts are richer pernymy from all other relations in the dataset, i.e. (as they contain both proximity and syntactic in- by considering hypernyms as positive instances, formation) and therefore more discriminative. In and other word pairs as negative instances. In ad- feature weighting there is no consistency, but in- dition, we measure the method’s ability to discrim- terestingly, raw frequency appears to be success- inate hypernymy from every other relation in the ful in hypernymy detection, contrary to previously dataset by considering one relation at a time. For reported results for word similarity tasks, where a relation we consider only (x, y) pairs that are R PPMI was shown to outperform it (Bullinaria and annotated as either hypernyms (positive instances) Levy, 2007; Levy et al., 2015a). or (negative instances). We rank the pairs ac- R cording to the measure score and compute average The new SLQS variants are on top of the list precision (AP) at k = 100 and k = all.5 in many settings. In particular they perform well in discriminating hypernyms from symmetric re- lated differently in each sense. We consider y as a hypernym lations (antonymy, synonymy, coordination). of x if hypernymy holds in some of the words’ senses. There- fore, when a pair is assigned both hypernymy and another The measures based on the reversed inclu- relation, we only keep it as hypernymy. sion hypothesis performed inconsistently, achiev- 5We tried several cut-offs and chose the one that seemed to be more informative in distinguishing between the unsu- ing perfect score in the discrimination of hyper- pervised measures. nyms from unrelated words, and performing well

69 relation measure context type feature weighting cosWeeds dep ppmi meronym Weeds dep / joint ppmi ClarkeDE dep / joint ppmi / freq APSyn joint freq cosine joint freq attribute Lin dep ppmi cosine dep ppmi antonym SLQS - - SLQS row joint (freq/ppmi/plmi) synonym SLQS row/SLQS row sub dep ppmi invCL win2/5/5d freq coordination - Table 3: Intersection of datasets’ top-performing measures when discriminating between hypernymy and each other relation. in few other cases, always in combination with surprising result could be explained by the nature syntactic contexts. of meronymy in this dataset: most holonyms in Finally, the results show that there is no single BLESS are rather specific words. combination of measure and parameters that per- BLESS was built starting from 200 basic level forms consistently well for all datasets and classi- concepts (e.g. goldfish) used as the x words, to fication tasks. In the following section we analyze which y words in different relations were asso- the best combination of measure, context type and ciated (e.g. eye, for meronymy; animal, for hy- feature weighting to distinguish hypernymy from pernymy). x words represent hyponyms in the any other relation. hyponym-hypernym pairs, and should therefore not be too general. Indeed, SLQS assigns high 5.2 Best Measure Per Classification Task scores to hyponym-hypernym pairs. At the same We considered all relations that occurred in two time, in the meronymy relation in BLESS, x is datasets. For such relation, for each dataset, we the holonym and y is the meronym. For consis- ranked the measures by their AP@100 score, se- tency with EVALution, we switched those pairs in lecting those with score 0.8.6 Table 3 displays ≥ BLESS, placing the meronym in the x slot and the the intersection of the datasets’ best measures. holonym in the y slot. As a consequence, after the Hypernym vs. Meronym The inclusion hy- switching, holonyms in BLESS are usually rather pothesis seems to be most effective in discriminat- specific words (e.g., there are no holonyms like ing between hypernyms and meronyms under syn- animal and vehicle, as these words were originally tactic contexts. We conjecture that the window- in the y slot). In most cases, they are not more gen- based contexts are less effective since they capture eral than their meronyms ((eye, goldfish)), yielding topical context words, that might be shared also low SLQS scores which are easy to separate from among holonyms and their meronyms (e.g. car hypernyms. We note that this is a weakness of the will occur with many of the neighbors of wheel). BLESS dataset, rather than a strength of the mea- However, since meronyms and holonyms often sure. For instance, on EVALution, SLQS performs have different functions, their functional contexts, worse (ranked only as high as 13th), as this dataset which are expressed in the syntactic context-types, has no such restriction on the basic level concepts, are less shared. This is where they mostly differ and may contain pairs like (eye, animal). from hyponym-hypernym pairs, which are of the same function (e.g. cat is a type of animal). Hypernym vs. Attribute Symmetric similar- Table 2 shows that SLQS performs well in this ity measures computed on syntactic contexts suc- task on BLESS. This is contrary to previous find- ceed to discriminate between hypernyms and at- ings that suggested that SLQS is weak in dis- tributes. Since attributes are syntactically different criminating between hypernyms and meronyms, from hypernyms (in attributes, y is an adjective), as in many cases the holonym is more general it is unsurprising that they occur in different syn- than the meronym (Shwartz et al., 2016).7 The tactic contexts, yielding low similarity scores.

6We considered at least 10 measures, allowing scores nearly 50% of the SLQS false positive pairs were meronym- slightly lower than 0.8 when others were unavailable. holonym pairs, in many of which the holonym is more general 7In the hypernymy dataset of Shwartz et al. (2016), than the meronym by definition, e.g. (mauritius, africa).

70 best supervised best unsupervised hyper vs. dataset AP context feature AP relation method vectors penalty measure @100 type weighting @100

meronym concat dep-based L2 0.998 APSyn joint freq 0.886 attribute concat Glove-100 L2 1.000 invCL dep ppmi 0.877 EVALution antonym concat dep-based L2 1.000 invCL joint ppmi 0.773 synonym concat dep-based L1 0.996 SLQSsub win2 plmi 0.813

meronym concat Glove-50 L1 1.000 SLQSsub win5 freq 0.939 coord concat Glove-300 L1 1.000 SLQS rowsub joint plmi 0.938 attribute concat Glove-100 L1 1.000 SLQSsub dep freq 0.938 BLESS event concat Glove-100 L1 1.000 SLQSsub dep freq 0.847 random-n concat word2vec L1 0.995 cosWeeds win2d ppmi 0.818 random-j concat Glove-200 L1 1.000 SLQSsub dep freq 0.917 random-v concat word2vec L1 1.000 SLQSsub dep freq 0.895

Lenci/ antonym concat dep-based L2 0.917 invCL joint ppmi 0.807 Benotto synonym concat Glove-300 L1 0.946 invCL win5d freq 0.914 invCL win2d freq Weeds coord concat dep-based L2 0.873 0.824 SLQS rowsub joint ppmi Table 4: Best performance on the validation set (10%) of each dataset for the supervised and unsupervised measures, in terms of Average Precision (AP) at k = 100, for hypernym vs. each single relation.

Hypernym vs. Antonym In all our experi- inclusion-based measures (ClarkeDE, invCL and ments, antonyms were the hardest to distinguish Weeds) showed the best results. The best per- from hypernyms, yielding the lowest performance. forming measures on BLESS, however, were vari- We found that SLQS performed reasonably well in ants of SLQS, that showed to perform well in this setting. However, the measure variations, con- cases where the negative relation is symmetric text types and feature weightings were not consis- (antonym, synonym and coordination). The dif- tent across datasets. SLQS relies on the assump- ference could be explained by the nature of the tion that y is a more general word than x, which is datasets: the BLESS test set contains 1,185 hy- not true for antonyms, making it the most suitable pernymy pairs, with only 129 distinct ys, many of measure for this setting. which are general words like animal and object. The Weeds test set, on the other hand, was inten- Hypernym vs. Synonym SLQS performs well tionally constructed to contain an overall unique also in discriminating between hypernyms and y in each pair, and therefore contains much more synonyms, in which y is also not more general specific ys (e.g. (quirk, strangeness)). For this rea- than x. We observed that in the joint con- son, generality-based measures perform well on text type, the difference in SLQS scores between BLESS, and struggle with Weeds, which is han- synonyms and hypernyms was the largest. This dled better using inclusion-based measures. may stem from the restrictiveness of this context type. For instance, among the most salient con- texts we would expect to find informative contexts 5.3 Comparison to State-of-the-art like drinks milk for cat and less informative ones Supervised Methods like drinks water for animal, whereas the non- For comparison with the state-of-the-art, we eval- restrictive single dependency context drinks would uated several supervised hypernymy detection probably be present for both. methods, based on the word embeddings of x and Another measure that works well is invCL: in- y: concatenation ~v ~v (Baroni et al., 2012), dif- terestingly, other inclusion-based measures assign x ⊕ y ference ~v ~v (Weeds et al., 2014), and ASYM high scores to (x, y) when y includes many of x’s y − x (Roller et al., 2014). We downloaded several pre- contexts, which might be true also for synonyms trained embeddings (Mikolov et al., 2013; Pen- (e.g. elevator and lift share many contexts). in- nington et al., 2014; Levy and Goldberg, 2014), vCL, on the other hand, reduces with the ratio of and trained a logistic regression classifier to pre- y’s contexts included in x, yielding lower scores dict hypernymy. We used the 90% portion (orig- for synonyms. inally the test set) as the train set, and the other Hypernym vs. Coordination We found no con- 10% (originally the validation set) as a test set, sistency among BLESS and Weeds. On Weeds, reporting the best results among different vectors,

71 method AP@100 original AP@100 switched ∆ supervised concat, word2vec, L1 0.995 0.575 -0.42 unsupervised cosWeeds, win2d, ppmi 0.818 0.882 +0.064 Table 5: Average Precision (AP) at k = 100 of the best supervised and unsupervised methods for hypernym vs. random-n, on the original BLESS validation set and the validation set with the artificially added switched hypernym pairs. method and regularization factor.8 findings of Levy et al. (2015b). The large perfor- Table 4 displays the performance of the best mance gaps reported by Levy et al. (2015b) might classifier on each dataset, in a hypernym vs. a sin- be attributed to the size of their training sets. Their gle relation setting. We also re-evaluated the unsu- lexical split discarded around half of the pairs in pervised measures, this time reporting the results the dataset and split the rest of the pairs equally on the validation set (10%) for comparison. to train and test, resulting in a relatively small The overall performance of the embedding- train set. We performed the split such that only based classifiers is almost perfect, and in partic- around 30% of the pairs in each dataset were dis- ular the best performance is achieved using the carded, and split the train and test sets with a ratio concatenation method (Baroni et al., 2012) with of roughly 90/10%, obtaining large enough train either GloVe (Pennington et al., 2014) or the sets. dependency-based embeddings (Levy and Gold- Our experiment suggests that rather than mem- berg, 2014). As expected, the unsupervised mea- orizing the verbatim prototypical hypernyms, the sures perform worse than the embedding-based supervised models might learn that certain regions classifiers, though generally not bad on their own. in the vector space pertain to prototypical hyper- These results may suggest that unsupervised nyms. For example, device (from the BLESS train methods should be preferred only when no train- set) and appliance (from the BLESS test set) are ing data is available, leaving all the other cases to two similar words, which are both prototypical supervised methods. This is, however, not com- hypernyms. Another interesting observation was pletely true. As others previously noticed, super- recently made by Roller and Erk (2016): they vised methods do not actually learn the relation showed that when dependency-based embeddings between x and y, but rather separate properties are used, supervised distributional methods trace of either x or y. Levy et al. (2015b) named this x and y’s separate occurrences in different slots of the “lexical memorization” effect, i.e. memoriz- Hearst patterns (Hearst, 1992). ing that certain ys tend to appear in many positive Whether supervised methods only memorize or pairs (prototypical hypernyms). also learn, it is more consensual that they lack the On that account, the Weeds dataset has been ability to capture the relation between x and y, and designed to avoid such memorization, with every that they rather indicate how likely y (x) is to be word occurring once in each slot of the relation. a hypernym (hyponym) (Levy et al., 2015b; San- While the performance of the supervised methods tus et al., 2016a; Shwartz et al., 2016; Roller and on this dataset is substantially lower than their per- Erk, 2016). While this information is valuable, it formance on other datasets, it is yet well above the cannot be solely relied upon for classification. random baseline which we might expect from a method that can only memorize words it has seen To better understand the extent of this limita- during training.9 This is an indication that super- tion, we conducted an experiment in a similar vised methods can abstract away from the words. manner to the switched hypernym pairs in Santus et al. (2016a). We used BLESS, which is the only Indeed, when we repeated the experiment with a dataset with random pairs. For each hypernym lexical split of each dataset, i.e., such that the train pair (x , y ), we sampled a word y that partici- and test set consist of distinct vocabularies, we 1 1 2 pates in another hypernym pair (x , y ), such that found that the supervised methods’ performance 2 2 (x , y ) is not in the dataset, and added (x , y ) did not decrease dramatically, in contrast to the 1 2 1 2 as a random pair. We added 139 new pairs to the 8In our preliminary experiments we also trained other validation set, such as (rifle, animal) and (salmon, classifiers used in the distributional hypernymy detection lit- weapon). We then used the best supervised and erature (SVM and SVM+RBF kernel), that performed simi- unsupervised methods for hypernym vs. random- larly. We report the results for logistic regression, since we use the prediction probabilities to measure average precision. n on BLESS to re-classify the revised validation 9The dataset is balanced between its two classes. set. Table 5 displays the experiment results.

72 The switched hypernym experiment paints a datasets will be available for the task, which would much less optimistic picture of the embeddings’ be drawn from corpora and will reflect more real- actual performance, with a drop of 42 points in istic distributions of words and semantic relations. average precision. 121 out of the 139 switched hypernym pairs were falsely classified as hyper- 7 Conclusion nyms. Examining the y words of these pairs re- We performed an extensive evaluation of unsuper- veals general words that appear in many hypernym vised methods for discriminating hypernyms from pairs (e.g. animal, object, vehicle). The unsuper- other semantic relations. We found that there is vised measure was not similarly affected by the no single combination of measure and parameters switched pairs, and the performance even slightly which is always preferred; however, we suggested increased. This result is not surprising, since most a principled linguistic-based analysis of the most unsupervised measures aim to capture aspects of suitable measure for each task that yields consis- the relation between x and y, while not relying on tent performance across different datasets. 10 information about one of the words in the pair. We investigated several new variants of existing 6 Discussion methods, and found that some variants of SLQS turned out to be superior on certain tasks. In addi- The results in Section 5 suggest that a supervised tion, we have tested for the first time the joint method using the unsupervised measures as fea- context type (Chersoni et al., 2016), which was tures could possibly be the best of both worlds. We found to be very discriminative, and might hope- would expect it to be more robust than embedding- fully benefit other semantic tasks. based methods on the one hand, while being more For comparison, we evaluated the state-of- informative than any single unsupervised measure the-art supervised methods on the datasets, and on the other hand. they have shown to outperform the unsupervised Such a method was developed by Santus et al. ones, while also being efficient and easier to use. (2016a), however using mostly features that de- However, a deeper analysis of their performance scribe a single word, e.g. frequency and entropy. It demonstrated that, as previously suggested, these was shown to be competitive with the state-of-the- methods do not capture the relation between x and art supervised methods. With that said, it was also y, but rather indicate the “prior probability” of ei- shown to be sensitive to the distribution of training ther word to be a hyponym or a hypernym. As a examples in a specific dataset, like the embedding- consequence, supervised methods are sensitive to based methods. the distribution of examples in a particular dataset, We conducted a similar experiment, with a making them less reliable for real-world applica- much larger number of unsupervised features, tions. Being motivated by linguistic hypotheses, namely the various measure scores, and encoun- and independent from training data, unsupervised tered the same issue. While the performance was measures were shown to be more robust. In this good, it dropped dramatically when the model was sense, unsupervised methods can still play a rele- tested on a different test set. vant role, especially if combined with supervised We conjecture that the problem stems from the methods, in the decision whether the relation holds currently available datasets, which are all some- or not. what artificial and biased. Supervised methods which are strongly based on the relation between Acknowledgments the words, e.g. those that rely on path-based in- The authors would like to thank Ido Dagan, formation (Shwartz et al., 2016), manage to over- Alessandro Lenci, and Yuji Matsumoto for their come the bias. Distributional methods, on the help and advice. Vered Shwartz is partially sup- other hand, are based on a weaker notion of the ported by an Intel ICRI-CI grant, the Israel Sci- relation between words, hence are more prone to ence Foundation grant 880/12, and the German overfit the distribution of training instances in a Research Foundation through the German-Israeli specific dataset. In the future, we hope that new Project Cooperation (DIP, grant DA 1600/1-1). 10Turney and Mohammad (2015) have also shown that un- Enrico Santus is partially supported by HK PhD supervised methods are more robust than supervised ones in Fellowship Scheme under PF12-13656. a transfer-learning experiment, when the “training data” was used to tune their parameters.

73 References Zellig S. Harris. 1954. Distributional structure. Word, 10(2-3):146–162. Marco Baroni and Alessandro Lenci. 2010. Dis- tributional memory: A general framework for Marti A. Hearst. 1992. Automatic acquisition of hy- corpus-based semantics. Computational Linguis- ponyms from large text corpora. In COLING 1992 tics, 36(4):673–721. Volume 2: The 15th International Conference on Computational Linguistics. Marco Baroni and Alessandro Lenci. 2011. Proceed- ings of the gems 2011 workshop on geometrical Douwe Kiela, Laura Rimell, Ivan Vulic,´ and Stephen models of natural language semantics. In How we Clark. 2015. Exploiting image generality for lex- BLESSed distributional semantic evaluation, pages ical entailment detection. In Proceedings of the 1–10. Association for Computational Linguistics. 53rd Annual Meeting of the Association for Compu- tational Linguistics and the 7th International Joint Marco Baroni, Silvia Bernardini, Adriano Ferraresi, Conference on Natural Language Processing (Vol- and Eros Zanchetta. 2009. The wacky wide ume 2: Short Papers), pages 119–124. Association web: a collection of very large linguistically pro- for Computational Linguistics. cessed web-crawled corpora. Language resources and evaluation, 43(3):209–226. Lili Kotlerman, Ido Dagan, Idan Szpektor, and Maayan Zhitomirsky-Geffet. 2010. Directional distribu- Marco Baroni, Raffaella Bernardi, Ngoc-Quynh Do, tional similarity for lexical inference. Natural Lan- and Chung-chieh Shan. 2012. Entailment above the guage Engineering, 16(04):359–389. word level in distributional semantics. In Proceed- ings of the 13th Conference of the European Chap- Alessandro Lenci and Giulia Benotto. 2012. Identi- ter of the Association for Computational Linguistics, fying hypernyms in distributional semantic spaces. pages 23–32. Association for Computational Lin- In *SEM 2012: The First Joint Conference on Lexi- guistics. cal and Computational Semantics – Volume 1: Pro- ceedings of the main conference and the shared task, Giulia Benotto. 2015. Distributional models for and Volume 2: Proceedings of the Sixth Interna- semantic relations: A sudy on hyponymy and tional Workshop on Semantic Evaluation (SemEval antonymy. PhD Thesis, University of Pisa. 2012), pages 75–79. Association for Computational John A Bullinaria and Joseph P Levy. 2007. Extracting Linguistics. semantic representations from word co-occurrence Omer Levy and Yoav Goldberg. 2014. Dependency- statistics: A computational study. Behavior re- based word embeddings. In Proceedings of the 52nd search methods, 39(3):510–526. Annual Meeting of the Association for Computa- Emmanuele Chersoni, Enrico Santus, Alessandro tional Linguistics (Volume 2: Short Papers), pages Lenci, Philippe Blache, and Chu-Ren Huang. 2016. 302–308. Association for Computational Linguis- Representing verbs with rich contexts: an evaluation tics. on verb similarity. In Proceedings of the 2016 Con- Omer Levy, Yoav Goldberg, and Ido Dagan. 2015a. ference on Empirical Methods in Natural Language Improving distributional similarity with lessons Processing, pages 1967–1972. Association for Com- learned from word embeddings. Transactions of the putational Linguistics. Association for Computational Linguistics, 3:211– Kenneth Ward Church and Patrick Hanks. 1990. Word 225. association norms, mutual information, and lexicog- Omer Levy, Steffen Remus, Chris Biemann, and Ido raphy. Computational linguistics, 16(1):22–29. Dagan. 2015b. Do supervised distributional meth- Daoud Clarke. 2009. Context-theoretic semantics for ods really learn lexical inference relations? In Pro- natural language: an overview. In Proceedings of ceedings of the 2015 Conference of the North Amer- the Workshop on Geometrical Models of Natural ican Chapter of the Association for Computational Language Semantics, pages 112–119. Association Linguistics: Human Language Technologies, pages for Computational Linguistics. 970–976. Association for Computational Linguis- tics. Ido Dagan, Dan Roth, and Mark Sammons. 2013. Rec- ognizing textual entailment. Morgan & Claypool Dekang Lin. 1998. An information-theoretic defini- Publishers. tion of similarity. ICML, 98:296–304. Stefan Evert. 2005. The statistics of word cooccur- Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- rences: word pairs and collocations. Dissertation. rado, and Jeff Dean. 2013. Distributed representa- tions of words and phrases and their compositional- Stefan Evert. 2008. Corpora and collocations. Corpus ity. In Advances in neural information processing linguistics. An international handbook, 2:223–233. systems, pages 3111–3119. Adriano Ferraresi. 2007. Building a very large corpus Roberto Navigli, Paola Velardi, and Stefano Faralli. of english obtained by web crawling: ukwac. Mas- 2011. A graph-based algorithm for inducing lex- ters thesis, University of Bologna, Italy. ical taxonomies from scratch. In Proceedings of

74 the Twenty-Second International Joint Conference Enrico Santus, Alessandro Lenci, Tin-Shing Chiu, Qin on Artificial Intelligence, volume 11, pages 1872– Lu, and Chu-Ren Huang. 2016b. Unsupervised 1877. measure of word similarity: How to outperform co- occurrence and vector cosine in vsms. In Thirtieth Sebastian Pado´ and Mirella Lapata. 2007. AAAI Conference on Artificial Intelligence, pages Dependency-based construction of semantic space 4260–4261. models. Computational Linguistics, 33(2):161–199. Vered Shwartz, Yoav Goldberg, and Ido Dagan. 2016. Jeffrey Pennington, Richard Socher, and Christopher Improving hypernymy detection with an integrated Manning. 2014. Glove: Global vectors for word path-based and distributional method. In Proceed- representation. In Proceedings of the 2014 Con- ings of the 54th Annual Meeting of the Association ference on Empirical Methods in Natural Language for Computational Linguistics (Volume 1: Long Pa- Processing (EMNLP), pages 1532–1543. Associa- pers), pages 2389–2398. Association for Computa- tion for Computational Linguistics. tional Linguistics.

Laura Rimell. 2014. Distributional lexical entailment Rion Snow, Daniel Jurafsky, and Andrew Y. Ng. 2005. by topic coherence. In Proceedings of the 14th Con- Learning syntactic patterns for automatic hypernym ference of the European Chapter of the Association discovery. In Advances in Neural Information Pro- for Computational Linguistics, pages 511–519. As- cessing Systems 17, pages 1297–1304. MIT Press. sociation for Computational Linguistics. Rion Snow, Daniel Jurafsky, and Andrew Y. Ng. 2006. Semantic taxonomy induction from heterogenous Stephen Roller and Katrin Erk. 2016. Relations such evidence. In Proceedings of the 21st International as hypernymy: Identifying and exploiting hearst pat- Conference on Computational Linguistics and 44th terns in distributional vectors for lexical entailment. Annual Meeting of the Association for Computa- In Proceedings of the 2016 Conference on Empiri- tional Linguistics, pages 801–808. Association for cal Methods in Natural Language Processing, pages Computational Linguistics. 2163–2172. Association for Computational Linguis- tics. Peter D. Turney and Saif M. Mohammad. 2015. Ex- periments with three approaches to recognizing lex- Stephen Roller, Katrin Erk, and Gemma Boleda. 2014. ical entailment. Natural Language Engineering, Inclusive yet selective: Supervised distributional hy- 21(03):437–476. pernymy detection. In Proceedings of COLING 2014, the 25th International Conference on Compu- Julie Weeds and David Weir. 2003. A general frame- tational Linguistics: Technical Papers, pages 1025– work for distributional similarity. In Proceedings of 1036. Dublin City University and Association for the 2003 Conference on Empirical Methods in Nat- Computational Linguistics. ural Language Processing, pages 81–88.

Gerard Salton and Michael J. McGill. 1986. Introduc- Julie Weeds, Daoud Clarke, Jeremy Reffin, David Weir, tion to modern information retrieval. McGraw-Hill, and Bill Keller. 2014. Learning to distinguish hy- Inc. pernyms and co-hyponyms. In Proceedings of COL- ING 2014, the 25th International Conference on Enrico Santus, Alessandro Lenci, Qin Lu, and Sabine Computational Linguistics: Technical Papers, pages Schulte im Walde. 2014. Chasing hypernyms in 2249–2259. Dublin City University and Association vector spaces with entropy. In Proceedings of the for Computational Linguistics. 14th Conference of the European Chapter of the As- sociation for Computational Linguistics, volume 2: Short Papers, pages 38–42. Association for Compu- tational Linguistics.

Enrico Santus, Frances Yung, Alessandro Lenci, and Chu-Ren Huang. 2015. Proceedings of the 4th workshop on linked data in linguistics: Resources and applications. In EVALution 1.0: an Evolving Se- mantic Dataset for Training and Evaluation of Dis- tributional Semantic Models, pages 64–69. Associa- tion for Computational Linguistics.

Enrico Santus, Alessandro Lenci, Tin-Shing Chiu, Qin Lu, and Chu-Ren Huang. 2016a. Nine features in a random forest to learn taxonomical semantic re- lations. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), pages 4557–4564. European Lan- guage Resources Association (ELRA).

75 Chapter 4

Improving Hypernymy Detection with an Integrated Path-based and Distributional Method

32 Improving Hypernymy Detection with an Integrated Path-based and Distributional Method

Vered Shwartz Yoav Goldberg Ido Dagan Computer Science Department Bar-Ilan University Ramat-Gan, Israel [email protected] [email protected] [email protected]

Abstract each (x, y) term-pair is represented using some combination of the terms’ embedding vectors. Detecting hypernymy relations is a key In contrast to distributional methods, in which task in NLP, which is addressed in the decision is based on the separate contexts of the literature using two complemen- x and y, path-based methods base the decision on tary approaches. Distributional methods, the lexico-syntactic paths connecting the joint oc- whose supervised variants are the cur- currences of x and y in a corpus. Hearst (1992) rent best performers, and path-based meth- identified a small set of frequent paths that indicate ods, which received less research atten- hypernymy, e.g. Y such as X. Snow et al. (2004) tion. We suggest an improved path-based represented each (x, y) term-pair as the multiset of algorithm, in which the dependency paths dependency paths connecting their co-occurrences are encoded using a recurrent neural net- in a corpus, and trained a classifier to predict hy- work, that achieves results comparable pernymy, based on these features. to distributional methods. We then ex- Using individual paths as features results in a tend the approach to integrate both path- huge, sparse feature space. While some paths based and distributional signals, signifi- are rare, they often consist of certain unimportant cantly improving upon the state-of-the-art components. For instance, “Spelt is a species of on this task. wheat” and “Fantasy is a genre of fiction” yield two different paths: X be species of Y and X be 1 Introduction genre of Y, while both indicating that X is-a Y. A Hypernymy is an important lexical-semantic rela- possible solution is to generalize paths by replac- tion for NLP tasks. For instance, knowing that ing words along the path with their part-of-speech Tom Cruise is an actor can help a question an- tags or with wild cards, as done in the PATTY sys- swering system answer the question “which ac- tem (Nakashole et al., 2012). tors are involved in Scientology?”. While seman- Overall, the state-of-the-art path-based methods tic taxonomies, like WordNet (Fellbaum, 1998), perform worse than the distributional ones. This define hypernymy relations between word types, stems from a major limitation of path-based meth- they are limited in scope and domain. Therefore, ods: they require that the terms of the pair oc- automated methods have been developed to deter- cur together in the corpus, limiting the recall of mine, for a given term-pair (x, y), whether y is an these methods. While distributional methods have hypernym of x, based on their occurrences in a no such requirement, they are usually less precise large corpus. in detecting a specific semantic relation like hy- For a couple of decades, this task has been ad- pernymy, and perform best on detecting broad se- dressed by two types of approaches: distributional, mantic similarity between terms. Though these and path-based. In distributional methods, the de- approaches seem complementary, there has been cision whether y is a hypernym of x is based on rather little work on integrating them (Mirkin et the distributional representations of these terms. al., 2006; Kaji and Kitsuregawa, 2008). Lately, with the popularity of word embeddings In this paper, we present HypeNET, an inte- (Mikolov et al., 2013), most focus has shifted to- grated path-based and distributional method for wards supervised distributional methods, in which hypernymy detection. Inspired by recent progress

2389 Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 2389–2398, Berlin, Germany, August 7-12, 2016. c 2016 Association for Computational Linguistics in relation classification, we use a long short- most typical linguistic contexts of a hypernym are term memory (LSTM) network (Hochreiter and less informative than those of its hyponyms. Schmidhuber, 1997) to encode dependency paths. More recently, the focus of the distributional ap- In order to create enough training data for our net- proach shifted to supervised methods. In these work, we followed previous methodology of con- methods, the (x, y) term-pair is represented by a structing a dataset based on knowledge resources. feature vector, and a classifier is trained on these We first show that our path-based approach, on vectors to predict hypernymy. Several methods its own, substantially improves performance over are used to represent term-pairs as a combination prior path-based methods, yielding performance of each term’s embeddings vector: concatenation comparable to state-of-the-art distributional meth- ~x ~y (Baroni et al., 2012), difference ~y ~x (Roller ⊕ − ods. Our analysis suggests that the neural path rep- et al., 2014; Weeds et al., 2014), and dot-product resentation enables better generalizations. While ~x ~y. Using neural word embeddings (Mikolov et · coarse-grained generalizations, such as replacing a al., 2013; Pennington et al., 2014), these methods word by its POS tag, capture mostly syntactic sim- are easy to apply, and show good results (Baroni ilarities between paths, HypeNET captures also et al., 2012; Roller et al., 2014). semantic similarities. We then show that we can easily integrate dis- 2.2 Path-based Methods tributional signals in the network. The integration A different approach to detecting hypernymy be- results confirm that the distributional and path- tween a pair of terms (x, y) considers the lexico- based signals indeed provide complementary in- syntactic paths that connect the joint occurrences formation, with the combined model yielding an of x and y in a large corpus. Automatic acquisi- improvement of up to 14 F points over each indi- 1 tion of hypernyms from free text, based on such vidual model.1 paths, was first proposed by Hearst (1992), who identified a small set of lexico-syntactic paths that 2 Background indicate hypernymy relations (e.g. Y such as X, X We introduce the two main approaches for hyper- and other Y). nymy detection: distributional (Section 2.1), and In a later work, Snow et al. (2004) learned to de- path-based (Section 2.2). We then discuss the re- tect hypernymy. Rather than searching for specific cent use of recurrent neural networks in the related paths that indicate hypernymy, they represent each task of relation classification (Section 2.3). (x, y) term-pair as the multiset of all dependency paths that connect x and y in the corpus, and train 2.1 Distributional Methods a logistic regression classifier to predict whether y is a hypernym of x, based on these paths. Hypernymy detection is commonly addressed us- Paths that indicate hypernymy are those that ing distributional methods. In these methods, the were assigned high weights by the classifier. The decision whether y is a hypernym of x is based on paths identified by this method were shown to the distributional representations of the two terms, subsume those found by Hearst (1992), yield- i.e., the contexts with which each term occurs sep- ing improved performance. Variations of Snow arately in the corpus. et al.’s (2004) method were later used in tasks Earlier methods developed unsupervised mea- such as taxonomy construction (Snow et al., 2006; sures for hypernymy, starting with symmetric sim- Kozareva and Hovy, 2010; Carlson et al., 2010; ilarity measures (Lin, 1998), and followed by di- Riedel et al., 2013), analogy identification (Tur- rectional measures based on the distributional in- ney, 2006), and definition extraction (Borg et al., clusion hypothesis (Weeds and Weir, 2003; Kotler- 2009; Navigli and Velardi, 2010). man et al., 2010). This hypothesis states that the A major limitation in relying on lexico- contexts of a hyponym are expected to be largely syntactic paths is the sparsity of the feature space. included in those of its hypernym. More recent Since similar paths may somewhat vary at the lex- work (Santus et al., 2014; Rimell, 2014) introduce ical level, generalizing such variations into more new measures, based on the assumption that the abstract paths can increase recall. The PATTY al- 1Our code and data are available in: gorithm (Nakashole et al., 2012) applied such gen- https://github.com/vered1986/HypeNET eralizations for the purpose of acquiring a taxon-

2390 ATTR by a single dependency path, while in hypernymy NSUBJ DET detection it is represented by the multiset of all de- parrot is a bird pendency paths in which they co-occur in the cor- NOUN VERB DET NOUN pus.

Figure 1: An example dependency tree of the sen- 3 LSTM-based Hypernymy Detection tence “parrot is a bird”, with x=parrot and y=bird, repre- sented in our notation as X/NOUN/nsubj/< be/VERB/ROOT/- Y/NOUN/attr/>. We present HypeNET, an LSTM-based method for hypernymy detection. We first focus on im- omy of term relations from free text. For each proving path representation (Section 3.1), and then path, they added generalized versions in which a integrate distributional signals into our network, subset of words along the path were replaced by resulting in a combined method (Section 3.2). either their POS tags, their ontological types or wild-cards. This generalization increased recall 3.1 Path-based Network while maintaining the same level of precision. Similarly to prior work, we represent each depen- dency path as a sequence of edges that leads from 2.3 RNNs for Relation Classification x to y in the dependency tree.2 Each edge contains Relation classification is a related task whose goal the lemma and part-of-speech tag of the source is to classify the relation that is expressed between node, the dependency label, and the edge direction two target terms in a given sentence to one of pre- between two subsequent nodes. We denote each lemma/P OS/dep/dir defined relation classes. To illustrate, consider edge as . See figure 1 for the following sentence, from the SemEval-2010 an illustration. relation classification task dataset (Hendrickx et Rather than treating an entire dependency path as a single feature, we encode the sequence of al., 2009): “The [apples]e1 are in the [basket]e2 ”. Here, the relation expressed between the target en- edges using a long short-term memory (LSTM) network. The vectors obtained for the different tities is Content Container(e1, e2). − paths of a given (x, y) pair are pooled, and the re- The shortest dependency paths between the tar- sulting vector is used for classification. Figure 2 get entities were shown to be informative for this depicts the overall network structure, which is de- task (Fundel et al., 2007). Recently, deep learning scribed below. techniques showed good performance in capturing the indicative information in such paths. Edge Representation We represent each edge In particular, several papers show improved per- by the concatenation of its components’ vectors: formance using recurrent neural networks (RNN) that process a dependency path edge-by-edge. Xu ~ve = [~vl,pos ~v ,dep ~v , ~vdir] et al. (2015; 2016) apply a separate long short- where ~vl,pos ~v ,dep ~v , ~vdir represent the embedding term memory (LSTM) network to each sequence vectors of the lemma, part-of-speech, dependency of words, POS tags, dependency labels and Word- label and dependency direction (along the path Net hypernyms along the path. A max-pooling from x to y), respectively. layer on the LSTM outputs is used as the in- put of a network that predicts the classification. Path Representation For a path p composed of e , ..., e ~v , ..., ~v Other papers suggest incorporating additional net- edges 1 k, the edge vectors e1 ek are work architectures to further improve performance fed in order to an LSTM encoder, resulting in (Nguyen and Grishman, 2015; Liu et al., 2015). a vector ~op representing the entire path p. The While relation classification and hypernymy de- LSTM architecture is effective at capturing tem- tection are both concerned with identifying se- poral patterns in sequences. We expect the train- mantic relations that hold for pairs of terms, they ing procedure to drive the LSTM encoder to focus differ in a major respect. In relation classification on parts of the path that are more informative for the relation should be expressed in the given text, the classification task while ignoring others. while in hypernymy detection, the goal is to rec- 2Like Snow et al. (2004), we added for each path, addi- ognize a generic lexical-semantic relation between tional paths containing single daughters of x or y not already contained in the path, referred by Snow et al. (2004) as “satel- terms that holds in many contexts. Accordingly, lite edges”. This enables including paths like Such Y as X, in in relation classification a term-pair is represented which the word “such” is not in the path between x and y.

2391 Embeddings: lemma POS ~vwx dependency label direction average ~op pooling (x, y) classification (softmax)

X/NOUN/nsubj/> be/VERB/ROOT/- Y/NOUN/attr/<

~vxy

~vwy X/NOUN/dobj/> define/VERB/ROOT/- as/ADP/prep/< Y/NOUN/pobj/<

Path LSTM Term-pair Classifier

Figure 2: An illustration of term-pair classification. Each term-pair is represented by several paths. Each path is a sequence of edges, and each edge consists of four components: lemma, POS, dependency label and dependency direction. Each edge vector is fed in sequence into the LSTM, resulting in a path embedding vector ~op. The averaged path vector becomes the term-pair’s feature vector, used for classification. The dashed ~vwx , ~vwy vectors refer to the integrated network described in Section 3.2.

Term-Pair Classification Each (x, y) term-pair the 50-dimensional and 100-dimensional embed- is represented by the multiset of lexico-syntactic ding vectors and selected the ones that yield bet- paths that connected x and y in the corpus, de- ter performance on the validation set.4 The other noted as paths(x, y), while the supervision is embeddings, as well as out-of-vocabulary lemmas, given for the term pairs. We represent each (x, y) are initialized randomly. We update all embedding term-pair as the weighted-average of its path vec- vectors during training. tors, by applying average pooling on its path vec- tors, as follows: 3.2 Integrated Network

P The network presented in Section 3.1 classifies p paths(x,y) fp,(x,y) ~op ~vxy = ~vpaths(x,y) = P∈ · (1) each (x, y) term-pair based on the paths that con- p paths(x,y) fp,(x,y) ∈ nect x and y in the corpus. Our goal was to im- prove upon previous path-based methods for hy- where fp,(x,y) is the frequency of p in paths(x, y). We then feed this path vector to a single-layer net- pernymy detection, and we show in Section 6 work that performs binary classification to decide that our network indeed outperforms them. Yet, whether y is a hypernym of x. as path-based and distributional methods are con- sidered complementary, we present a simple way c = softmax(W ~vxy) (2) to integrate distributional features in the network, · yielding improved performance. c is a 2-dimensional vector whose components We extended the network to take into account sum to 1, and we classify a pair as positive if distributional information on each term. In- c[1] > 0.5. spired by the supervised distributional concatena- tion method (Baroni et al., 2012), we simply con- Implementation Details To train the network, catenate x and y word embeddings to the (x, y) we used PyCNN.3 We minimize the cross en- feature vector, redefining ~v : tropy loss using gradient-based optimization, with xy mini-batches of size 10 and the Adam update rule ~v = [ ~v ,~v , ~v ] (3) (Kingma and Ba, 2014). Regularization is applied xy wx paths(x,y) wy by a dropout on each of the components’ embed- where ~vw and ~vw are x and y’s word embed- dings. We tuned the hyper-parameters (learning x y dings, respectively, and ~vpaths(x,y) is the averaged rate and dropout rate) on the validation set (see the path vector defined in equation 1. This way, each appendix for the hyper-parameters values). (x, y) pair is represented using both the distribu- We initialized the lemma embeddings with the tional features of x and y, and their path-based pre-trained GloVe word embeddings (Pennington features. et al., 2014), trained on Wikipedia. We tried both 4Higher-dimensional embeddings seem not to improve 3https://github.com/clab/cnn performance, while hurting the training runtime.

2392 resource relations train test val all WordNet instance hypernym, hypernym random split 49,475 17,670 3,534 70,679 DBPedia type lexical split 20,335 6,610 1,350 28,295 Wikidata subclass of, instance of Yago subclass of Table 2: The number of instances in each dataset.

Table 1: Hypernymy relations in each resource. to perform “lexical memorization”, i.e., instead of learning a relation between the two terms, they 4 Dataset mostly learn an independent property of a single 4.1 Creating Instances term in the pair: whether it is a “prototypical hy- Neural networks typically require a large amount pernym” or not. For instance, if the training set of training data, whereas the existing hypernymy contains term-pairs such as (dog, animal), (cat, datasets, like BLESS (Baroni and Lenci, 2011), animal), and (cow, animal), all annotated as posi- are relatively small. Therefore, we followed the tive examples, the algorithm may learn that animal common methodology of creating a dataset us- is a prototypical hypernym, classifying any new (x, ing distant supervision from knowledge resources animal) pair as positive, regardless of the relation (Snow et al., 2004; Riedel et al., 2013). Fol- between x and animal. Levy et al. (2015) sug- lowing Snow et al. (2004), who constructed their gested to split the train and test sets such that each dataset based on WordNet hypernymy, and aiming will contain a distinct vocabulary (“lexical split”), to create a larger dataset, we extract hypernymy in order to prevent the model from overfitting by relations from several resources: WordNet (Fell- lexical memorization. baum, 1998), DBPedia (Auer et al., 2007), Wiki- To investigate such behaviors, we present re- data (Vrandeciˇ c,´ 2012) and Yago (Suchanek et al., sults also for a lexical split of our dataset. In this 2007). case, we split the train, test and validation sets All instances in our dataset, both positive and such that each contains a distinct vocabulary. We negative, are pairs of terms that are directly re- note that this differs from Levy et al. (2015), who lated in at least one of the resources. These re- split only the train and the test sets, and dedicated a sources contain thousands of relations, some of subset of the train for validation. We chose to devi- which indicate hypernymy with varying degrees of ate from Levy et al. (2015) because we noticed that certainty. To avoid including questionable relation when the validation set contains terms from the types, we consider as denoting positive examples train set, the model is rewarded for lexical mem- only indisputable hypernymy relations (Table 1), orization when tuning the hyper-parameters, con- which we manually selected from the set of hyper- sequently yielding suboptimal performance on the nymy indicating relations in Shwartz et al. (2015). lexically-distinct test set. When each set has a dis- Term-pairs related by other relations (including tinct vocabulary, the hyper-parameters are tuned hyponymy), are considered as negative instances. to avoid lexical memorization and are likely to Using related rather than random term-pairs as perform better on the test set. We tried to keep 5 negative instances tests our method’s ability to dis- roughly the same 70/25/5 ratio in our lexical split. tinguish between hypernymy and other kinds of The sizes of the two datasets are shown in Table 2. semantic relatedness. We maintain a ratio of 1:4 Indeed, training a model on a lexically split positive to negative pairs in the dataset. dataset may result in a more general model, that Like Snow et al. (2004), we include only term- can better handle pairs consisting of two unseen pairs that have joint occurrences in the corpus, re- terms during inference. However, we argue that quiring at least two different dependency paths for in the common applied scenario, the inference in- each pair. volves an unseen pair (x, y), in which x and/or y have already been observed separately. Models 4.2 Random and Lexical Dataset Splits trained on a random split may introduce the model As our primary dataset, we perform standard ran- with a term’s “prior probability” of being a hyper- dom splitting, with 70% train, 25% test and 5% nym or a hyponym, and this information can be validation sets. exploited beneficially at inference time. As pointed out by Levy et al. (2015), super- 5The lexical split discards many pairs consisting of cross- vised distributional lexical inference methods tend set terms.

2393 path X/NOUN/dobj/> establish/VERB/ROOT/- as/ADP/prep/< Y/NOUN/pobj/< X/NOUN/dobj/> VERB as/ADP/prep/< Y/NOUN/pobj/< X/NOUN/dobj/> * as/ADP/prep/< Y/NOUN/pobj/< X/NOUN/dobj/> establish/VERB/ROOT/- ADP Y/NOUN/pobj/< X/NOUN/dobj/> establish/VERB/ROOT/- * Y/NOUN/pobj/<

Table 3: Example generalizations of X was established as Y.

5 Baselines 5.2 Distributional Methods We compare HypeNET with several state-of-the- Unsupervised SLQS (Santus et al., 2014) is art methods for hypernymy detection, as described an entropy-based measure for hypernymy detec- in Section 2: path-based methods (Section 5.1), tion, reported to outperform previous state-of- and distributional methods (Section 5.2). Due to the-art unsupervised methods (Weeds and Weir, different works using different datasets and cor- 2003; Kotlerman et al., 2010). The original paper pora, we replicated the baselines rather than com- was evaluated on the BLESS dataset (Baroni and paring to the reported results. Lenci, 2011), which consists of mostly frequent We use the Wikipedia dump from May 2015 as words. Applying the vanilla settings of SLQS on the underlying corpus of all the methods, and parse our dataset, that contains also rare terms, resulted it using spaCy.6 We perform model selection on in low performance. Therefore, we received as- the validation set to tune the hyper-parameters of sistance from Enrico Santus, who kindly provided each method.7 The best hyper-parameters are re- the results of SLQS on our dataset after tuning the ported in the appendix. system as follows. The validation set was used to tune the thresh- 5.1 Path-based Methods old for classifying a pair as positive, as well as the maximum number of each term’s most associated Snow We follow the original paper, and extract all shortest paths of four edges or less between contexts (N). In contrast to the original paper, in terms in a dependency tree. Like Snow et al. which the number of each term’s contexts is fixed (2004), we add paths with “satellite edges”, i.e., to N, in this adaptation it was set to the minimum single words not already contained in the depen- between the number of contexts with LMI score dency path, which are connected to either X or Y, above zero and N. In addition, the SLQS scores allowing paths like such Y as X. The number of were not multiplied by the cosine similarity scores distinct paths was 324,578. We apply χ2 feature between terms, and terms were lemmatized prior selection to keep only the 100,000 most informa- to computing the SLQS scores, significantly im- tive paths and train a logistic regression classifier. proving recall. As our results suggest, while this method is Generalization We also compare our method state-of-the-art for unsupervised hypernymy de- to a baseline that uses generalized dependency tection, it is basically designed for classifying paths. Following PATTY’s approach to general- specificity level of related terms, rather than hy- izing paths (Nakashole et al., 2012), we replace pernymy in particular. edges with their part-of-speech tags as well as with To represent term-pairs with distri- wild cards. We generate the powerset of all possi- Supervised butional features, we tried several state-of-the-art ble generalizations, including the original paths. methods: concatenation ~x ~y (Baroni et al., 2012), See Table 3 for examples. The number of features ⊕ difference ~y ~x (Roller et al., 2014; Weeds et al., after generalization went up to 2,093,220. Simi- − 2014), and dot-product ~x ~y. We downloaded sev- larly to the first baseline, we apply feature selec- · eral pre-trained embeddings (Mikolov et al., 2013; tion, this time keeping the 1,000,000 most infor- Pennington et al., 2014) of different sizes, and mative paths, and train a logistic regression classi- trained a number of classifiers: logistic regression, fier over the generalized paths.8 SVM, and SVM with RBF kernel, which was re- 6https://spacy.io/ ported by Levy et al. (2015) to perform best in this 7 We applied grid search for a range of values, and picked setting. We perform model selection on the val- the ones that yield the highest F1 score on the validation set. 8We also tried keeping the 100,000 most informative idation set to select the best vectors, method and paths, but the performance was worse. regularization factor (see the appendix).

2394 random split lexical split method precision recall F1 precision recall F1 Snow 0.843 0.452 0.589 0.760 0.438 0.556 Path-based Snow + Gen 0.852 0.561 0.676 0.759 0.530 0.624 HypeNET Path-based 0.811 0.716 0.761 0.691 0.632 0.660 SLQS (Santus et al., 2014) 0.491 0.737 0.589 0.375 0.610 0.464 Distributional Best supervised (concatenation) 0.901 0.637 0.746 0.754 0.551 0.637 Combined HypeNET Integrated 0.913 0.890 0.901 0.809 0.617 0.700

Table 4: Performance scores of our method compared to the path-based baselines and the state-of-the-art distributional methods for hypernymy detection, on both variations of the dataset – with lexical and random split to train / test / validation.

6 Results sult of lexical memorization, but rather stems from over-generalization (Section 7.1). Table 4 displays performance scores of HypeNET and the baselines. HypeNET Path-based is our 7 Analysis path-based recurrent neural network model (Sec- tion 3.1) and HypeNET Integrated is our combined 7.1 Qualitative Analysis of Learned Paths method (Section 3.2). Comparing the path-based We analyze HypeNET’s ability to generalize over methods shows that generalizing paths improves path structures, by comparing prominent indica- recall while maintaining similar levels of preci- tive paths which were learned by each of the path- sion, reassessing the behavior found in Nakas- based methods. We do so by finding high-scoring hole et al. (2012). HypeNET Path-based outper- paths that contributed to the classification of true- forms both path-based baselines by a significant positive pairs in the dataset. In the path-based improvement in recall and with slightly lower pre- baselines, these are the highest-weighted features cision. The recall boost is due to better path gen- as learned by the logistic regression classifier. In eralization, as demonstrated in Section 7.1. the LSTM-based method, it is less straightforward Regarding distributional methods, the unsuper- to identify the most indicative paths. We assess the vised SLQS baseline performed slightly worse on contribution of a certain path p to classification by our dataset. The low precision stems from its regarding it as the only path that appeared for the inability to distinguish between hypernyms and term-pair, and compute its TRUE label score from meronyms, which are common in our dataset, the class distribution: softmax(W ~v )[1], set- · xy causing many false positive pairs such as (zabrze, ting ~vxy = [~0, ~op,~0]. poland) and (kibbutz, israel). We sampled 50 A notable pattern is that Snow’s method learns false positive pairs of each dataset split, and found specific paths, like X is Y from (e.g. Megadeth that 38% of the false positive pairs in the random is an American thrash metal band from Los An- split and 48% of those in the lexical split were geles). While Snow’s method can only rely on holonym-meronym pairs. verbatim paths, limiting its recall, the generalized In accordance with previously reported results, version of Snow often makes coarse generaliza- the supervised embedding-based method is the tions, such as X VERB Y from. Clearly, such a best performing baseline on our dataset as well. path is too general, and almost any verb assigned HypeNET Path-based performs slightly better, to it results in a non-indicative path (e.g. X take achieving state-of-the-art results. Adding distri- Y from). Efforts by the learning method to avoid butional features to our method shows that these such generalization, again, lower the recall. Hy- two approaches are indeed complementary. On peNET provides a better midpoint, making fine- both dataset splits, the performance differences grained generalizations by learning additional se- between HypeNET Integrated and HypeNET Path- mantically similar paths such as X become Y from based, as well as the supervised distributional and X remain Y from. See table 5 for additional method, are substantial, and statistically signifi- example paths which illustrate these behaviors. cant with p-value of 1% (paired t-test). We also noticed that while on the random split We also reassess that indeed supervised distri- our model learns a range of specific paths such as butional methods perform worse on a lexical split X is Y published (learned for e.g. Y=magazine) (Levy et al., 2015). We further observe a similar and X is Y produced (Y=film), in the lexical split reduction when using HypeNET, which is not a re- it only learns the general X is Y path for these re-

2395 method path example text X/NOUN/nsubj/> be/VERB/ROOT/- Y/NOUN/attr/< Eyeball is a 1975 Italian-Spanish direct/VERB/acl/> film directed by Umberto Lenzi Snow X/NOUN/nsubj/> be/VERB/ROOT/- Y/NOUN/attr/< Allure is a U.S. women’s beauty publish/VERB/acl/> magazine published monthly Calico Light Weapons Inc. (CLWS) is an X/NOUN/compound/> NOUN∗ be/VERB/ROOT/- American privately held manufacturing Snow + Y/NOUN/attr/< base/VERB/acl/> Gen company based in Cornelius, Oregon X/NOUN/compound/> NOUN Y/NOUN/compound/< Weston Town Council X/NOUN/nsubj/> be/VERB/ROOT/- Y/NOUN/attr/< Blinky is a 1923 American comedy (release|direct|produce|write)/VERB/acl/> film directed by Edward Sedgwick X/NOUN/compound/> (association|co.|company|corporation| HypeNET foundation|group|inc.|international Integrated |limited|ltd.)/NOUN/nsubj/> Retalix Ltd. is a software company be/VERB/ROOT/- Y/NOUN/attr/< ((create|found|headquarter |own|specialize)/VERB/acl/>)?

Table 5: Examples of indicative paths learned by each method, with corresponding true positive term-pairs from the random split test set. Hypernyms are marked red and hyponyms are marked blue.

Relation % versed hypernym-hyponym pairs (y is a hyponym synonymy 21.37% hyponymy 29.45% of x). Examining a sample of these pairs suggests holonymy / meronymy 9.36% that they are usually near-synonyms, i.e., it is not hypernymy-like relations 21.03% that clear whether one term is truely more general other relations 18.77% than the other or not. For instance, fiction is an- Table 6: Distribution of relations holding between each pair notated in WordNet as a hypernym of story, while of terms in the resources among false positive pairs. our method classified fiction as its hyponym. A possible future research direction might be lations. We note that X is Y is a rather “noisy” to quite simply extend our network to classify path, which may occur in ad-hoc contexts with- term-pairs simultaneously to multiple semantic re- out indicating generic hypernymy relations (e.g. lations, as in Pavlick et al. (2015). Such a multi- chocolate is a big problem in the context of chil- class model can hopefully better distinguish be- dren’s health). While such a model may identify tween these similar semantic relations. hypernymy relations between unseen terms, based Another notable category is hypernymy-like re- on general paths, it is prone to over-generalization, lations: these are other relations in the resources hurting its performance, as seen in Table 4. As that could also be considered as hypernymy, but discussed in 4.2, we suspect that this scenario, in were annotated as negative due to our restrictive § which both terms are unseen, is usually not com- selection of only indisputable hypernymy relations mon enough to justify this limiting training setup. from the resources (see Section 4.1). These in- clude instances like (Goethe, occupation, novelist) 7.2 Error Analysis and (Homo, subdivisionRanks, species). False Positives We categorized the false positive Lastly, other errors made by the model often pairs on the random split according to the rela- correspond to term-pairs that co-occur very few tion holding between each pair of terms in the re- times in the corpus, e.g. xebec, a studio produc- sources used to construct the dataset. We grouped ing Anime, was falsely classified as a hyponym of several semantic relations from different resources anime. to broad categories, e.g. synonym includes also alias and Wikipedia redirection. Table 6 displays False Negatives We sampled 50 term-pairs that the distribution of semantic relations among false were falsely annotated as negative, and analyzed positive pairs. the major (overlapping) types of errors (Table 7). More than 20% of the errors stem from confus- Most of these pairs had only few co-occurrences ing synonymy with hypernymy, which are known in the corpus. This is often either due to infre- to be difficult to distinguish. quent terms (e.g. cbc.ca), or a rare sense of x in An additional 30% of the term-pairs are re- which the hypernymy relation holds (e.g. (night,

2396 Error Type % References 1 low statistics 80% 2 infrequent term 36% Soren¨ Auer, Christian Bizer, Georgi Kobilarov, Jens 3 rare hyponym sense 16% Lehmann, Richard Cyganiak, and Zachary Ives. 4 annotation error 8% 2007. Dbpedia: A nucleus for a web of open data. Springer. Table 7: (Overlapping) categories of false negative pairs: (1) x and y co-occurred less than 25 times (average co- Marco Baroni and Alessandro Lenci. 2011. How we occurrences for true positive pairs is 99.7). (2) Either x or blessed distributional semantic evaluation. In Pro- y is infrequent. (3) The hypernymy relation holds for a rare ceedings of the GEMS 2011 Workshop on GEomet- sense of x. (4) (x, y) was incorrectly annotated as positive. rical Models of Natural Language Semantics, pages 1–10. play) holding for “Night”, a dramatic sketch by Marco Baroni, Raffaella Bernardi, Ngoc-Quynh Do, and Chung-chieh Shan. 2012. Entailment above Harold Pinter). Such a term-pair may have too the word level in distributional semantics. In EACL, few hypernymy-indicating paths, leading to clas- pages 23–32. sifying it as negative. Claudia Borg, Mike Rosner, and Gordon Pace. 2009. Evolutionary algorithms for definition extraction. In 8 Conclusion Proceedings of the 1st Workshop on Definition Ex- traction, pages 26–32. We presented HypeNET: a neural-networks-based Andrew Carlson, Justin Betteridge, Bryan Kisiel, method for hypernymy detection. First, we fo- Burr Settles, Estevam R Hruschka Jr, and Tom M cused on improving path representation using Mitchell. 2010. Toward an architecture for never- LSTM, resulting in a path-based model that per- ending language learning. In AAAI. forms significantly better than prior path-based Christiane Fellbaum. 1998. WordNet. Wiley Online methods, and matches the performance of the pre- Library. viously superior distributional methods. In partic- Katrin Fundel, Robert Kuffner,¨ and Ralf Zimmer. ular, we demonstrated that the increase in recall is 2007. Relexrelation extraction using dependency a result of generalizing semantically-similar paths, parse trees. Bioinformatics, pages 365–371. in contrast to prior methods, which either make no Marti A Hearst. 1992. Automatic acquisition of hy- generalizations or over-generalize paths. ponyms from large text corpora. In ACL, pages 539– We then extended our network by integrating 545. distributional signals, yielding an improvement of Iris Hendrickx, Su Nam Kim, Zornitsa Kozareva, ´ additional 14 F1 points, and demonstrating that the Preslav Nakov, Diarmuid OSeaghdha,´ Sebastian path-based and the distributional approaches are Pado,´ Marco Pennacchiotti, Lorenza Romano, and Stan Szpakowicz. 2009. Semeval-2010 task 8: indeed complementary. Multi-way classification of semantic relations be- Finally, our architecture seems straightfor- tween pairs of nominals. In SemEval, pages 94–99. wardly applicable for multi-class classification, Sepp Hochreiter and Jurgen¨ Schmidhuber. 1997. Long which, in future work, could be used to classify short-term memory. Neural computation, pages term-pairs to multiple semantic relations. 1735–1780. Nobuhiro Kaji and Masaru Kitsuregawa. 2008. Using Acknowledgments hidden markov random fields to combine distribu- tional and pattern-based word clustering. In COL- We would like to thank Omer Levy for his in- ING, pages 401–408. volvement and assistance in the early stage of this Diederik Kingma and Jimmy Ba. 2014. Adam: A project and Enrico Santus for helping us by com- method for stochastic optimization. arXiv preprint puting the results of SLQS (Santus et al., 2014) on arXiv:1412.6980. our dataset. Lili Kotlerman, Ido Dagan, Idan Szpektor, and Maayan This work was partially supported by an In- Zhitomirsky-Geffet. 2010. Directional distribu- tional similarity for lexical inference. NLE, pages tel ICRI-CI grant, the Israel Science Foundation 359–389. grant 880/12, and the German Research Founda- tion through the German-Israeli Project Coopera- Zornitsa Kozareva and Eduard Hovy. 2010. A semi-supervised method to learn and construct tax- tion (DIP, grant DA 1600/1-1). onomies using the web. In EMNLP, pages 1110– 1118.

2397 Omer Levy, Steffen Remus, Chris Biemann, and Ido Rion Snow, Daniel Jurafsky, and Andrew Y Ng. 2004. Dagan. 2015. Do supervised distributional methods Learning syntactic patterns for automatic hypernym really learn lexical inference relations. NAACL. discovery. In NIPS.

Dekang Lin. 1998. An information-theoretic defini- Rion Snow, Daniel Jurafsky, and Andrew Y Ng. 2006. tion of similarity. In ICML, pages 296–304. Semantic taxonomy induction from heterogenous evidence. In ACL, pages 801–808. Yang Liu, Furu Wei, Sujian Li, Heng Ji, Ming Zhou, and Houfeng Wang. 2015. A dependency-based Fabian M Suchanek, Gjergji Kasneci, and Gerhard neural network for relation classification. arXiv Weikum. 2007. Yago: a core of semantic knowl- preprint arXiv:1507.04646. edge. In WWW, pages 697–706.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S Peter D Turney. 2006. Similarity of semantic relations. Corrado, and Jeffrey Dean. 2013. Distributed rep- CL, pages 379–416. resentations of words and phrases and their compo- Denny Vrandeciˇ c.´ 2012. Wikidata: A new platform sitionality. In NIPS, pages 3111–3119. for collaborative data collection. In WWW, pages Shachar Mirkin, Ido Dagan, and Maayan Geffet. 2006. 1063–1064. Integrating pattern-based and distributional similar- Julie Weeds and David Weir. 2003. A general frame- ity methods for lexical entailment acquisition. In work for distributional similarity. In EMLP, pages COLING and ACL, pages 579–586. 81–88. Ndapandula Nakashole, Gerhard Weikum, and Fabian Julie Weeds, Daoud Clarke, Jeremy Reffin, David Weir, Suchanek. 2012. Patty: a taxonomy of rela- and Bill Keller. 2014. Learning to distinguish tional patterns with semantic types. In EMNLP and hypernyms and co-hyponyms. In COLING, pages CoNLL, pages 1135–1145. 2249–2259. Roberto Navigli and Paola Velardi. 2010. Learning Yan Xu, Lili Mou, Ge Li, Yunchuan Chen, Hao Peng, word-class lattices for definition and hypernym ex- and Zhi Jin. 2015. Classifying relations via long traction. In ACL, pages 1318–1327. short term memory networks along shortest depen- dency paths. In EMNLP. Thien Huu Nguyen and Ralph Grishman. 2015. Combining neural networks and log-linear mod- Yan Xu, Ran Jia, Lili Mou, Ge Li, Yunchuan Chen, els to improve relation extraction. arXiv preprint Yangyang Lu, and Zhi Jin. 2016. Improved re- arXiv:1511.05926. lation classification by deep recurrent neural net- works with data augmentation. arXiv preprint Ellie Pavlick, Johan Bos, Malvina Nissim, Charley arXiv:1601.03651. Beller, Benjamin Van Durme, and Chris Callison- Burch. 2015. Adding semantics to data-driven para- Appendix A Best Hyper-parameters phrasing. In ACL. Table 8 displays the chosen hyper-parameters of Jeffrey Pennington, Richard Socher, and Christo- pher D. Manning. 2014. Glove: Global vectors for each method, yielding the highest F1 score on the word representation. In EMNLP, pages 1532–1543. validation set.

Sebastian Riedel, Limin Yao, Andrew McCallum, and method values Snow regularization: L2 Benjamin M Marlin. 2013. Relation extraction Snow + Gen regularization: L1 with matrix factorization and universal schemas. In embeddings: GloVe-100-Wiki LSTM learning rate: α = 0.001 NAACL. dropout: d = 0.5 SLQS N=100, threshold = 0.000464 Laura Rimell. 2014. Distributional lexical entailment Best method: concatenation, classifier: SVM random split Supervised embeddings: GloVe-300-Wiki by topic coherence. In EACL, pages 511–519. embeddings: GloVe-50-Wiki LSTM- learning rate: α = 0.001 Integrated Stephen Roller, Katrin Erk, and Gemma Boleda. 2014. word dropout: d = 0.3 Snow regularization: L2 Inclusive yet selective: Supervised distributional hy- Snow + Gen regularization: L2 pernymy detection. In COLING, pages 1025–1036. embeddings: GloVe-50-Wiki LSTM learning rate: α = 0.001 dropout: d = 0.5 Enrico Santus, Alessandro Lenci, Qin Lu, and SLQS N=100, threshold = 0.007629 Sabine Schulte Im Walde. 2014. Chasing hyper- Best method: concatenation, classifier: SVM lexical split nyms in vector spaces with entropy. In EACL, pages Supervised embeddings: GloVe-100-Wikipedia embeddings: GloVe-50-Wiki LSTM- 38–42. learning rate: α = 0.001 Integrated word dropout: d = 0.3 Vered Shwartz, Omer Levy, Ido Dagan, and Jacob Goldberger. 2015. Learning to exploit structured Table 8: The best hyper-parameters in every model. resources for lexical inference. In CoNLL, page 175.

2398 Chapter 5

Path-based vs. Distributional Information in Recognizing Lexical Semantic Relations

43 arXiv:1608.05014v3 [cs.CL] 27 Oct 2016 hswr slcne ne raieCmosAtiuin4 Attribution Commons Creative a under licenced is work This hnteridvda rpris n a os vnfrrare for even so do can c at and available better properties, to individual seems their information than path-based particular, In datasets most it. across dominant is source distributional information the path-based of contribution pe the often assess setting, further multiclass the in also broade effective the indeed for is datasets common over findings their reexamine otedsrbtoa inli yenm eeto,notab detection, hypernymy prov in can signal representation distributional path the good to a distri that showed the (2016) over al. contribution et marginal no have methods based 02 olre l,21) ossetyotefrigoth se outperforming 2015). consistently on al., 2014), well al., perform et Roller to distr 2012; reported a provide were which methods 2014), al., embedding-based et Pennington 2013; al., et nwe l,20;Nksoee l,21;Ree ta. 2013 al., et Riedel 2012; al., et disjoint Nakashole 2004; al., ar terms et the Snow connect that paths dependency the where corpus, rto ftebs-efrigdsrbtoa ehdwith method distributional best-performing hyper the a of HypeNET, gration introduced (2016) al. et Shwartz Recently, HypeNET Background: 2 http://creativecommons.org/licenses/by/4.0/ L plctos w anifrainsucsaeue or to the used consider are methods Path-based sources distributional. information main Two relati semantic applications. lexical NLP the recognize to methods Automated Introduction 1 nti ae,w pl hat ta.s(06 ehdt reco to method (2016) al.’s et Shwartz apply we paper, this In complementary considered been have sources two these While nwihi otycmlmnstedsrbtoa informat distributional the complements mostly th it to which m contributes in multiclass always recognize the source in to information effective path-based it is the method extend this and that integrat 2016) show an results al., of et success recent (Shwartz the detection follow We t obsolete. suggested become recently showed latter con the are results superior sources the information distributional be is and terms path-based between relations semantic various Recognizing curne fec emadhv eetybcm oua us popular become recently have and term each of occurrences https://github.com/vered1986/LexNET optrSineDprmn,BrIa nvriy Ramat- University, Bar-Ilan Department, Science Computer [email protected] [email protected] [email protected] nRcgiigLxclSmni Relations Semantic Lexical Recognizing in ahbsdv.Dsrbtoa Information Distributional vs. Path-based ee hat d Dagan Ido Shwartz Vered Abstract joint 0ItrainlLcne ies details: License License. International .0 curne ftetotrsi ie ari the in pair given a in terms two the of occurrences frigbte hnec niiulmto.We method. individual each than better rforming osmni eaincasfiain vnthough Even classification. relation semantic to yipoigrslso e dataset. new a on results improving ly ahbsdifrainawy otiue to contributed always information path-based , rmtos(atse l,21;Ncuec et Necsulescu 2016; al., et (Santus methods er oe erlpt ersnain improving representation, path neural novel a pueterltosi ewe em,rather terms, between relationship the apture btoa ersnainfrec em These term. each for representation ibutional ntehlsbtentrsaevlal for valuable are terms between holds the on yial sda etrs(.Has,1992; Hearst, (A. features as used typically e a h omrscnrbto ih have might contribution former’s the hat uinloe.Rcnl,hwvr Shwartz however, Recently, ones. butional d usata opeetr information complementary substantial ide ak eso htti nertdmethod integrated this that show We task. r .Dsrbtoa ehd r ae nthe on based are methods Distributional ). yydtcinmto ae nteinte- the on based method detection nymy lsicto,adaayetecases the analyze and classification, e od rsne.Orcd n aaare data and code Our senses. or words iee opeetr o hstask, this for complementary sidered ion. etn swl.W ute hwthat show further We well. as setting cgiesc eain:pt-ae and path-based relations: such ecognize eca o ayNPtss While tasks. NLP many for neficial . ea omndtst Brn tal., et (Baroni datasets common veral gnize dnua ehdfrhypernymy for method neural ed eetrslssgetdta path- that suggested results recent , lil eain.Teempirical The relations. ultiple multiple n odebdig (Mikolov embeddings word ing a,Israel Gan, eatcrltos and relations, semantic Embeddings: lemma ~vpaths(x,y) POS dependency label direction ~op X/NOUN/nsubj > be/VERB/ROOT < Y/NOUN/attr . . .

average pooling

X/NOUN/dobj > define/VERB/ROOT < as/ADP/prep < Y/NOUN/pobj

~vxy ~vxy ~vxy

~vwx ~vwx ~vwx

(x, y) (x, y) (x, y) classification classification classification (softmax) (softmax) ......

~vpaths(x,y) ~vpaths(x,y)

~vwy ~vwy ~vwy (1) PB (2) DS (3) LexNET

~vxy ~vxy

~vwx

~vwx (x, y) (x, y) classification classification (softmax) (softmax) ......

~vpaths(x,y) ~h ~ ~vwy h

~vwy ~vwy

(4) DSh (5) LexNETh

Figure 1: An illustration of the classification models. The top row illustrates the path-based component. Each path is a sequence of edges, and each edge consists of four components: lemma, POS, dependency label and direction. Edge vectors are fed in sequence into the LSTM, resulting in an embedding vector ~op for each path. (x,y)’s path embeddings are averaged to create ~vpaths(x,y). upon state-of-the-art methods. In HypeNET, a term-pair (x,y) is represented as a feature vector, con- sisting of both distributional and path-based features: ~vxy = [~vwx ,~vpaths(x,y),~vwy ], where ~vwx and ~vwy are x and y’s word embeddings, providing their distributional representation, and ~vpaths(x,y) is a vector representing the dependency paths connecting x and y in the corpus. A binary classifier is trained on these vectors, yielding c = softmax(W ~v ), predicting hypernymy if c[1] > 0.5. · xy Each dependency path is embedded using an LSTM (Hochreiter and Schmidhuber, 1997), as illustrated in the top row of Figure 1. This results in a path vector space in which semantically-similar paths (e.g. X is defined as Y and X is described as Y) have similar vectors. The vectors of all the paths that connect x and y are averaged to create ~vpaths(x,y). Shwartz et al. (2016) showed that this new path representation outperforms prior path-based methods for hypernymy detection, and that the integrated model yields a substantial improvement over each indi- dataset dataset relations #instances K&H+N hypernym, meroynym, co-hyponym, random 57,509 BLESS hypernym, meroynym, co-hyponym, , event, attribute, random 26,546 ROOT09 hypernym, co-hyponym, random 12,762 EVALution hypernym, meronym, attribute, synonym, antonym, holonym, substance meronym 7,378 Table 1: The relation types and number of instances in each dataset, named by their WordNet equivalent where relevant. vidual model. While HypeNET is designed for detecting hypernymy relations, it seems straightforward to extend it to classify term-pairs simultaneously to multiple semantic relations, as we describe next.

3 Classification Methods We experiment with several classification models, as illustrated in Figure 1:

Path-based HypeNET’s path-based model (PB) is a binary classifier trained on the path vectors alone: ~vpaths(x,y). We adapt the model to classify multiple relations by changing the network softmax output c to a distribution over k target relations, classifying a pair to the highest scoring relation: r = argmaxi c[i]. Distributional We train an SVM classifier on the concatenation of x and y’s word embeddings 1 [~vwx ,~vwy ] (Baroni et al., 2012) (DS). Levy et al. (2015) claimed that such a linear classifier is inca- pable of capturing interactions between x and y’s features, and that instead it learns separate properties of x or y, e.g. that y is a prototypical hypernym. To examine the effect of non-linear expressive power on 2 the model, we experiment with a neural network with a single hidden layer trained on [~vwx ,~vwy ] (DSh). Integrated We similarly adapt the HypeNET integrated model to classify multiple semantic relations

(LexNET). Based on the same motivation of DSh, we also experiment with a version of the network with a hidden layer (LexNETh), re-defining c = softmax(W ~h + b ), where ~h = tanh(W ~v + b ) is the 2 · 2 1 · xy 1 hidden layer. The technical details of our network are identical to Shwartz et al. (2016).

4 Datasets

We use four common semantic relation datasets that were created using semantic resources: K&H+N (Nec- sulescu et al., 2015) (an extension to Kozareva and Hovy (2010)), BLESS (Baroni and Lenci, 2011), EVALution (Santus et al., 2015), and ROOT09 (Santus et al., 2016). Table 1 displays the relation types and number of instances in each dataset. Most dataset relations are parallel to WordNet relations, such as hypernymy (cat, animal) and meronymy (hand, body), with an ad- ditional random relation for negative instances. BLESS contains the event and attribute relations, connect- ing a concept with a typical activity/property (e.g. (alligator, swim) and (alligator, aquatic)). EVALution contains a richer schema of semantic relations, with some redundancy: it contains both meronymy and holonymy (e.g. for bicycle and wheel), and the fine-grained substance-holonymy relation. We removed two relations with too few instances: Entails and MemberOf. To prevent the lexical memorization effect (Levy et al., 2015), Santus et al. (2016) added negative switched hyponym-hypernym pairs (e.g. (apple, animal), (cat, fruit)) to ROOT09, which were reported to reduce this effect.

5 Results Like Shwartz et al. (2016), we tuned the methods’ hyper-parameters on the validation set of each dataset, and used Wikipedia as the corpus. Table 2 displays the performance of the different methods on all datasets, in terms of recall, precision and F1. Our first empirical finding is that Shwartz et al.’s (2016) algorithm is effective in the multiclass setting as well (LexNET). The only dataset on which performance is mediocre is EVALution, which seems to be inherently harder for all methods, due to its large number of relations and small size. The performance differences between LexNET and DS are statistically significant on all datasets with p-value of 0.01 (paired

1 We experimented also with difference ~vwx ~vwy and other classifiers, but concatenation yielded the best performance. 2This was previously done by Bowman et al.− (2014), with promising results, but on a small artificial vocabulary. K&H+N BLESS ROOT09 EVALution method P R F1 P R F1 P R F1 P R F1 PB 0.713 0.604 0.55 0.759 0.756 0.755 0.788 0.789 0.788 0.53 0.537 0.503 DS 0.909 0.906 0.904 0.811 0.812 0.811 0.636 0.675 0.646 0.531 0.544 0.525 DSh 0.983 0.984 0.983 0.891 0.889 0.889 0.712 0.721 0.716 0.57 0.573 0.571 LexNET 0.985 0.986 0.985 0.894 0.893 0.893 0.813 0.814 0.813 0.601 0.607 0.6 LexNETh 0.984 0.985 0.984 0.895 0.892 0.893 0.812 0.816 0.814 0.589 0.587 0.583

Table 2: Performance scores (precision, recall and F1) of each individual approach and the integrated models. To compute the metrics we used scikit-learn (Pedregosa et al., 2011) with the “averaged” setup, which computes the metrics for each relation, and reports their average, weighted by support (the number of true instances for each relation). Note that it can result in an F1 score that is not the harmonic mean of precision and recall.

dataset #pairs x y gold label DSh prediction possible explanation firefly car false hypo (x, car) frequent label is hypo K&H+N 102 racehorse horse hypo false out of the embeddings vocabulary larvacean salp sibl false rare terms larvacean and salp tanker ship hyper event (x, ship) frequent label is event BLESS 275 squirrel lie random event (x, lie) frequent label is event herring salt event random non-prototypical relation toaster vehicle RANDOM HYPER (x, vehicle) frequent label is HYPER ROOT09 562 rice grain HYPER RANDOM (x, grain) frequent label is RANDOM lung organ HYPER COORD polysemous term organ pick metal MadeOf IsA rare sense of pick EVALution 235 abstract concrete Antonym MadeOf polysemous term concrete line thread Synonym MadeOf (x, thread) frequent label is MadeOf Table 3: The number of term-pairs that were correctly classified by the integrated model while being incorrectly classified by DSh, in each test set, with corresponding examples of such term-pairs.

t-test). The performance differences between LexNET and DSh are statistically significant on BLESS and ROOT09 with p-value of 0.01, and on EVALution with p-value of 0.05.

DSh consistently improves upon DS. The hidden layer seems to enable interactions between x and y’s features, which is especially noticed in ROOT09, where the hypernymy F1 score in particular rose from

0.25 to 0.45. Nevertheless, we did not observe a similar behavior in LexNETh, which worked similarly or slightly worse than LexNET. It is possible that the contributions of the hidden layer and the path-based source over the distributional signal are redundant.3 It may also be that the larger number of parameters in

LexNETh prevents convergence to the optimal values given the modest amount of training data, stressing the need for large-scale datasets that will benefit training neural methods. 6 Analysis

Table 2 demonstrates that the distributional source is dominant across most datasets, with DS performing better than PB. Although by design DS does not consider the relation between x and y, but rather learns properties of x or y, it performs well on BLESS and K&H+N. DSh further manages to capture relations at the distributional level, leaving the path-based source little room for improvement on these two datasets. On ROOT09, on the other hand, DS achieved the lowest performance. Our analysis reveals that this is due to the switched hypernym pairs, which drastically hurt the ability to memorize individual single words, hence reducing performance. The F1 scores of DS on this dataset were 0.91 for co-hyponyms but only 0.25 for hypernyms, while PB scored 0.87 and 0.66 respectively. Moreover, LexNET gains 10 points over DSh, suggesting the better capacity of path-based methods to capture relations between terms.

6.1 Analysis of Information Sources To analyze the contribution of the path-based information source, we examined the term-pairs that were correctly classified by the best performing integrated model (LexNET/LexNETh) while being incorrectly classified by DSh. Table 3 displays the number of such pairs in each dataset, with corresponding term- pair examples. The common errors are detailed below:

Lexical Memorization DSh often classifies (x,y) term-pairs according to the most frequent relation of one of the terms (usually y) in the train set. The error is mostly prominent in ROOT09 (405/562 pairs, 3We also tried adding a hidden layer only over the distributional features of LexNET, but it did not improve performance. 72%), as a result of the switched hypernym pairs. However, it is not infrequent in other datasets (47% in BLESS, 43% in EVALution and 34% in K&H+N). As opposed to distributional information, path-based information pertains to both terms in the pair. With such information available, the integrated model succeeds to overcome the most frequent label bias for single words, classifying these pairs correctly.

Non-prototypical Relations DSh might fail to recognize non-prototypical relations between terms, i.e. when y is a less-prototypical relatum of x, as in mero:(villa, guest), event:(cherry, pick), and attri:(piano, electric). While being overlooked by the distributional methods, these relations are often expressed in joint occurrences in the corpus, allowing the path-based component to capture them. Rare Terms The integrated method often managed to classify term-pairs in which at least one of the terms is rare (e.g. hyper:(mastodon, proboscidean)), where the distributional method failed. It is a well known shortcoming of path-based methods that they require informative co-occurrences of x and y, which are not always available. With that said, thanks to the averaged path representation, PB can capture the relation between terms even if they only co-occur once within an informative path, while the distributional representation of rare terms is of lower quality. We note that the path-based information of (x,y) is encoded in the vector ~vpaths(x,y), which is the averaged vector representation of all paths that connected x and y in the corpus. Unlike other path-based methods in the literature, this representation is indifferent to the total number of paths, and as a result, even a single informative path can lead to successful classification. Rare Senses Similarly, the path-based component succeeded to capture relations for rare senses of words where DSh failed, e.g. mero:(piano, key), event:(table, draw). Distributional representations suffer from insufficient representation of rare senses, while PB may capture the relation with a single meaningful occurrence of the rare sense with its related term. At the same time, it is less likely for a polysemous term to co-occur, in its non-related senses, with the candidate relatum. For instance, paths connecting piano to key are likely to correspond to the keyboard sense of key, indicating the relation that does hold for this pair with respect to this rare sense. Finally, we note that LexNET, as well as the individual methods, perform poorly on synonyms and antonyms. The synonymy F1 score in EVALution was 0.35 in LexNET and in DSh and only 0.09 in PB, reassessing prior findings (Mirkin et al., 2006) that the path-based approach is weak in recognizing synonyms, which do not tend to co-occur. DSh performed poorly also on antonyms (F1 = 0.54), which were often mistaken for synonyms, since both tend to occur in the same contexts. It seems worthwhile to try improving the model with insights from prior work on these specific relations (Santus et al., 2014; Mohammad et al., 2013) or by using additional information sources, like multilingual data (Pavlick et al., 2015). 7 Conclusion We presented an adaptation to HypeNET (Shwartz et al., 2016) that classifies term-pairs to one of multi- ple semantic relations. Evaluation on common datasets shows that HypeNET is extensible to the multi- class setting and performs better than each individual method. Although the distributional information source is dominant across most datasets, it consistently bene- fits from path-based information, particularly when finer modeling of inter-term relationship is needed. Finally, we note that all common datasets were created synthetically using semantic resources, leading to inconsistent behavior of the different methods, depending on the particular distribution of examples in each dataset. This stresses the need to develop “naturally” distributed datasets that would be drawn from corpora, while reflecting realistic distributions encountered by semantic applications. Acknowledgments This work was partially supported by an Intel ICRI-CI grant, the Israel Science Foundation grant 880/12, and the German Research Foundation through the German-Israeli Project Cooperation (DIP, grant DA 1600/1-1). References Marti A. Hearst. 1992. Automatic acquisition of hyponyms from large text corpora. In COLING 1992 Volume 2: The 15th International Conference on Computational Linguistics. Marco Baroni and Alessandro Lenci. 2011. Proceedings of the gems 2011 workshop on geometrical models of natural language semantics. pages 1–10. Marco Baroni, Raffaella Bernardi, Ngoc-Quynh Do, and Chung-chieh Shan. 2012. Entailment above the word level in distributional semantics. In Proceedings of EACL 2012, pages 23–32. Samuel R Bowman, Christopher Potts, and Christopher D Manning. 2014. Learning distributed word representa- tions for natural logic reasoning. AAAI. Sepp Hochreiter and J¨urgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780. Zornitsa Kozareva and Eduard Hovy. 2010. A semi-supervised method to learn and construct taxonomies using the web. In Proceedings of the EMNLP 2010, pages 1110–1118. Omer Levy, Steffen Remus, Chris Biemann, and Ido Dagan. 2015. Do supervised distributional methods really learn lexical inference relations? In Proceedings of NAACL-HLT 2015, pages 970–976. Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their compositionality. In NIPS, pages 3111–3119. Shachar Mirkin, Ido Dagan, and Maayan Geffet. 2006. Integrating pattern-based and distributional similarity methods for lexical entailment acquisition. In Proceedings of the COLING/ACL 2006, pages 579–586. Saif M Mohammad, Bonnie J Dorr, Graeme Hirst, and Peter D Turney. 2013. Computing lexical contrast. Com- putational Linguistics, 39(3):555–590. Ndapandula Nakashole, Gerhard Weikum, and Fabian Suchanek. 2012. Patty: A taxonomy of relational patterns with semantic types. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 1135–1145. Silvia Necsulescu, Sara Mendes, David Jurgens, N´uria Bel, and Roberto Navigli. 2015. Reading between the lines: Overcoming data sparsity for accurate classification of lexical relationships. In Proceedings of the *SEM 2015, pages 182–192. Ellie Pavlick, Johan Bos, Malvina Nissim, Charley Beller, Benjamin Van Durme, and Chris Callison-Burch. 2015. Adding semantics to data-driven paraphrasing. In Proceedings of ACL 2015 (Volume 1: Long Papers), pages 1512–1522. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit- learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830. Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representa- tion. In Proceedings of EMNLP 2014, pages 1532–1543. Sebastian Riedel, Limin Yao, Andrew McCallum, and M. Benjamin Marlin. 2013. Relation extraction with matrix factorization and universal schemas. In Proceedings of NAACL-HLT 2013, pages 74–84. Stephen Roller, Katrin Erk, and Gemma Boleda. 2014. Inclusive yet selective: Supervised distributional hyper- nymy detection. In Proceedings of COLING 2014, pages 1025–1036. Enrico Santus, Qin Lu, Alessandro Lenci, and Churen Huang. 2014. Unsupervised antonym-synonym discrimi- nation in vector space. Enrico Santus, Frances Yung, Alessandro Lenci, and Chu-Ren Huang. 2015. Proceedings of the 4th workshop on linked data in linguistics: Resources and applications. pages 64–69. Enrico Santus, Alessandro Lenci, Tin-Shing Chiu, Qin Lu, and Chu-Ren Huang. 2016. Nine features in a random forest to learn taxonomical semantic relations. In LREC. Vered Shwartz, Yoav Goldberg, and Ido Dagan. 2016. Improving hypernymy detection with an integrated path- based and distributional method. In Proceedings of the ACL 2016 (Volume 1: Long Papers), pages 2389–2398. Rion Snow, Daniel Jurafsky, and Andrew Y Ng. 2004. Learning syntactic patterns for automatic hypernym discovery. In NIPS. Chapter 6

Acquiring Predicate Paraphrases from News Tweets

50 Acquiring Predicate Paraphrases from News Tweets

Vered Shwartz Gabriel Stanovsky Ido Dagan Computer Science Department, Bar-Ilan University, Ramat-Gan, Israel vered1986, gabriel.satanovsky @gmail.com, [email protected] { }

Abstract [a]0 introduce [a]1 [a]0 welcome [a]1 [a]0 appoint [a]1 [a]0 to become [a]1 [a]0 die at [a]1 [a]0 pass away at [a]1 We present a simple method for ever- [a]0 hit [a]1 [a]0 sink to [a]1 growing extraction of predicate para- [a]0 be investigate [a]1 [a]0 be probe [a]1 phrases from news headlines in Twitter. [a]0 eliminate [a]1 [a]0 slash [a]1 [a]0 announce [a]1 [a]0 unveil [a]1 Analysis of the output of ten weeks of col- [a]0 quit after [a]1 [a]0 resign after [a]1 lection shows that the accuracy of para- [a]0 announce as [a]1 [a]0 to become [a]1 phrases with different support levels is es- [a]0 threaten [a]1 [a]0 warn [a]1 [a]0 die at [a]1 [a]0 live until [a]1 timated between 60-86%. We also demon- [a]0 double down on [a]1 [a]0 stand by [a]1 strate that our resource is to a large extent [a]0 kill [a]1 [a]0 shoot [a]1 [a]0 approve [a]1 [a]0 pass [a]1 complementary to existing resources, pro- [a]0 would be cut under [a]1 [a]1 slash [a]0 viding many novel paraphrases. Our re- seize [a]0 at [a]1 to grab [a]0 at [a]1 source is publicly available, continuously Table 1: A sample from the top-ranked predicate expanding based on daily news. paraphrases.

1 Introduction their accuracy is limited. Specifically, the first ap- Recognizing that various textual descriptions proach may extract antonyms, that also have sim- across multiple texts refer to the same event or ilar argument distribution (e.g. [a]0 raise to [a]1 action can benefit NLP applications such as rec- / [a]0 fall to [a]1) while the second may conflate ognizing textual entailment (Dagan et al., 2013) multiple senses of the foreign phrase. and question answering. For example, to answer A third approach was proposed to harvest para- “when did the US Supreme Court approve same- phrases from multiple mentions of the same event sex marriage?” given the text “In June 2015, the in news articles.1 This approach assumes that ap- Supreme Court ruled for same-sex marriage”, various redundant reports make different lexical prove ruled for and should be identified as describ- choices to describe the same event. Although ing the same action. there has been some work following this approach To that end, much effort has been devoted to (e.g. Shinyama et al., 2002; Shinyama and Sekine, identifying predicate paraphrases, some of which 2006; Roth and Frank, 2012; Zhang and Weld, resulted in releasing resources of predicate entail- 2013), it was less exhaustively investigated and ment or paraphrases. Two main approaches were did not result in creating paraphrase resources. proposed for that matter; the first leverages the In this paper we present a novel unsupervised similarity in argument distribution across a large method for ever-growing extraction of lexically- corpus between two predicates (e.g. [a] buy [a] 0 1 divergent predicate paraphrase pairs from news / [a] acquire [a] )(Lin and Pantel, 2001; Berant 0 1 tweets. We apply our methodology to create a et al., 2010). The second approach exploits bilin- resource of predicate paraphrases, exemplified in gual parallel corpora, extracting as paraphrases Table1. pairs of texts that were translated identically to for- Analysis of the resource obtained after ten eign languages (Ganitkevitch et al., 2013). While these methods have produced exhaustive 1This corresponds to instances of event coreference resources which are broadly used by applications, (Bagga and Baldwin, 1999).

155 Proceedings of the 6th Joint Conference on Lexical and Computational Semantics (*SEM 2017), pages 155–160, Vancouver, Canada, August 3-4, 2017. c 2017 Association for Computational Linguistics weeks of acquisition shows that the set of para- et al., 2002; Barzilay and Lee, 2003). The as- phrases reaches accuracy of 60-86% at differ- sumption is that multiple news articles describing ent levels of support. Comparison to existing the same event use various lexical choices, pro- resources shows that, even as our resource is viding a good source for paraphrases. Heuristics still smaller in orders of magnitude from exist- are applied to recognize that two news articles dis- ing resources, it complements them with non- cuss the same event, such as lexical overlap and consecutive predicates (e.g. take [a]0 from [a]1) same publish date (Shinyama and Sekine, 2006). and paraphrases which are highly context specific. Given such a pair of articles, it is likely that predi- The resource and the source code are avail- cates connecting the same arguments will be para- able at http://github.com/vered1986/ phrases, as in the following example: Chirps.2 As of the end of May 2017, it con- 1. GOP lawmakers introduce new health care plan tains 456,221 predicate pairs in 1,239,463 differ- 2. GOP lawmakers unveil new health care plan ent contexts. Our resource is ever-growing and is expected to contain around 2 million predicate Zhang and Weld(2013) and Zhang et al.(2015) paraphrases within a year. Until it reaches a large introduced methods that leverage parallel news enough size, we will release a daily update, and at streams to cluster predicates by meaning, using a later stage, we plan to release a periodic update. temporal constraints. Since this approach acquires paraphrases from descriptions of the same event, it 2 Background is potentially more accurate than methods that ac- 2.1 Existing Paraphrase Resources quire paraphrases from the entire corpus or trans- lation phrase table. However, there is currently no A prominent approach to acquire predicate para- paraphrase resource acquired in this approach.3 phrases is to compare the distribution of their argu- Finally, Xu et al.(2014) developed a supervised ments across a corpus, as an extension to the dis- model to collect sentential paraphrases from Twit- tributional hypothesis (Harris, 1954). DIRT (Lin ter. They used Twitter’s trending topic service, and and Pantel, 2001) is a resource of 10 million para- considered two tweets from the same topic as para- phrases, in which the similarity between predicate phrases if they shared a single anchor word. pairs is estimated by the geometric mean of the similarities of their argument slots. Berant(2012) constructed an entailment graph of distributionally 3 Resource Construction similar predicates by enforcing transitivity con- We present a methodology to automatically collect straints and applying global optimization, releas- binary verbal predicate paraphrases from Twitter. ing 52 million directional entailment rules (e.g. We first obtain news related tweets ( 3.1) from § [a]0 shoot [a]1 [a]0 kill [a]1). which we extract propositions ( 3.2). For a can- → § A second notable source for extracting para- didate pair of propositions, we assume that if both phrases is multiple translations of the same text arguments can be matched then the predicates are (Barzilay and McKeown, 2001). The Para- likely paraphrases ( 3.3). Finally, we rank the § phrase Database (PPDB) (Ganitkevitch et al., predicate pairs according to the number of in- 2013; Pavlick et al., 2015) is a huge collection of stances in which they were aligned ( 3.4). paraphrases extracted from bilingual parallel cor- § pora. Paraphrases are scored heuristically, and the 3.1 Obtaining News Headlines database is available for download in six increas- ingly large sizes according to scores (the smallest We use Twitter as a source of readily avail- size being the most accurate). In addition to lex- able news headlines. The 140 characters limit ical paraphrases, PPDB also consists of 140 mil- makes tweets concise, informative and indepen- lion syntactic paraphrases, some of which include dent of each other, obviating the need to resolve predicates with non-terminals as arguments. document-level entity coreference. We query the Twitter Search API4 via Twitter Search.5 We use 2.2 Using Multiple Event Descriptions 3Zhang and Weld(2013) released a small collection of Another line of work extracts paraphrases from re- 10k predicate paraphrase clusters (with average cluster size dundant comparable news articles (e.g. Shinyama of 2.4) produced by the system. 4https://apps.twitter.com/ 2Chirp is a paraphrase of tweet. 5https://github.com/ckoepp/TwitterSearch

156 subj

prep about subj prep from obj comp subj prop of

Turkey intercepts the plane which took off from Moscow Russia , furious about the plane , threatens to retaliate

(1) [Turkey]0 intercepts [plane]1 (2) [plane]0 took off from [Moscow]1 [Russia]0 threatens to [retaliate]1 Figure 1: PropS structures and the corresponding propositions extracted by our process. Left: multi-word predicates and multiple extractions per tweet. Right: argument reduction.

Manafort hid payments from party with Moscow ties [a]0 hide [a]1 Paul Manafort payments Manafort laundered the payments through Belize [a]0 launder [a]1 Manafort payments Send immigration judges to cities to speed up deportations to send [a]0 to [a]1 immigration judges cities Immigration judges headed to 12 cities to speed up deportations [a]0 headed to [a]1 immigration judges 12 cities Table 2: Examples of predicate paraphrase instances in our resource: each instance contains two tweets, predicate types extracted from them, and the instantiations of arguments.

Twitter’s news filter that retrieves tweets contain- tial token matching or WordNet synonyms).7 Ta- ing links to news websites, and limit the search to ble2 shows examples of predicate paraphrase in- English tweets. stances in the resource.

3.4 Resource Release 3.2 Proposition Extraction The resource release consists of two files: We extract propositions from news tweets using PropS (Stanovsky et al., 2016), which simplifies 1. Instances: the specific contexts in which the dependency trees by conveniently marking a wide predicates are paraphrases (as in Table2). In range of predicates (e.g, verbal, adjectival, non- practice, to comply with Twitter policy, we lexical) and positioning them as direct heads of release predicate paraphrase pair types along their corresponding arguments. Specifically, we with their arguments and tweet IDs, and pro- run PropS over dependency trees predicted by vide a script for downloading the full texts. 6 spaCy and extract predicate types (as in Table1) 2. Types: predicate paraphrase pair types (as in composed of verbal predicates, datives, preposi- Table1). The types are ranked in a descend- tions, and auxiliaries. ing order according to a heuristic accuracy Finally, we employ a pre-trained argument re- score: duction model to remove non-restrictive argument d modifications (Stanovsky and Dagan, 2016). This s = count 1 + · N is essential for our subsequent alignment step, as   it is likely that short and concise phrases will tend where count is the number of instances in to match more frequently in comparison to longer, which the predicate types were aligned (Sec- more specific arguments. Figure1 exemplifies tion 3.3), d is the number of different days in some of the phenomena handled by this process, which they were aligned, and N is the num- along with the automatically predicted output. ber of days since the resource collection be- gun. 3.3 Generating Paraphrase Instances Taking into account the number of different days in which predicates were aligned re- Following the assumption that different descrip- duces the noise caused by two entities that tions of the same event are bound to be redun- undergo two different actions on the same dant (as discussed in Section 2.2), we consider day. For example, the following tweets from two predicates as paraphrases if: (1) They appear the day of Chuck Berry’s death: on the same day, and (2) Each of their arguments aligns with a unique argument in the other predi- 1. Last year when Chuck Berry turned 90 cate, either by strict matching (short edit distance, 2. Chuck Berry dies at 90 abbreviations, etc.) or by a looser matching (par- 7In practice, our publicly available code requires that at 6https://spacy.io least one pair of arguments will strictly match.

157 Accuracy Types 86 Accuracy Instances Types 78 74 100 90 60 80 70 60 40.37 50 40 30 15.38 20 6.15 10 1.78 0 1 2 3 4 5 6 7 8 9 10 10 20 50 ∞ (a) Estimated accuracy (%) and number of types ( 1K) of (b) Estimated accuracy (%), number of instances ( 10K) predicate pairs with at least 5 instances in different score× bins. and types ( 10K) in the first 10 weeks. × × Figure 2: Resource statistics after ten weeks of collection.

yield the incorrect type [a]0 turn [a]1 / [a]0 4.1 Quality of Extractions and Ranking die at [a]1. While there may be several oc- To evaluate the resource accuracy, and follow- currences of that type on the same day, it is ing the instance-based evaluation scheme, we only not expected to re-occur in other news events considered paraphrases that occurred in at least 5 (in different days), yielding a low accuracy instances (which currently constitute 10% of the score. paraphrase types). We partition the types into four 4 Analysis of Resource Quality increasingly large bins according to their scores (the smallest bin being the most accurate), simi- We estimate the quality of the resource obtained larly to PPDB (Ganitkevitch et al., 2013), and an- after ten weeks of collection by annotating a sam- notate a sample of 50 types from each bin. Fig- ple of the extracted paraphrases. ure 2(a) shows that the frequent types achieve up The annotation task was carried out in Ama- to 86% accuracy. 8 zon Mechanical Turk. To ensure the quality of The accuracy expectedly increases with the workers, we applied a qualification test and re- score, except for the lowest-score bin ((0, 10]) quired a 99% approval rate for at least 1,000 prior which is more accurate than the next one tasks. We assigned each annotation to 3 workers ((10, 20]). At the current stage of the resource and used the majority vote to determine the cor- there is a long tail of paraphrases that appeared rectness of paraphrases. few times. While many of them are incorrect, We followed a similar approach to instance- there are also true paraphrases that are infrequent based evaluation (Szpektor et al., 2007), and let and therefore have a low accuracy score. We ex- workers judge the correctness of a predicate pair pect that some of these paraphrases will occur (e.g. [a]0 purchase [a]1/[a]0 acquire [a]1) through again in the future and their accuracy score will 5 different instances (e.g. Intel purchased Mobil- be strengthened. eye/Intel acquired Mobileye). We considered the type as correct if at least one of its instance-pairs 4.2 Size and Accuracy Over Time were judged as correct. The idea that lies behind To estimate future usefulness, Figure 2(b) plots the this type of evaluation is that predicate pairs are resource size (in terms of types and instances) and difficult to judge out-of-context. estimated accuracy through each week in the first Differently from Szpektor et al.(2007), we used 10 weeks of collection. the instances in which the paraphrases appeared The accuracy at a specific time was estimated originally, as those are available in the resource. by annotating a sample of 50 predicate pair types 8https://www.mturk.com/mturk with accuracy score 20 in the resource obtained ≥

158 drag [a]0 from [a]1 [a]0 remove from [a]1 leave this evaluation to future work. leak [a]0 to [a]1 to share [a]0 with [a]1 oust [a]0 from [a]1 [a]0 be force out at [a]1 To demonstrate the added value of our resource, reveal [a]0 to [a]1 share [a]0 with [a]1 we show that even in its current size, it already [a]0 add [a]1 [a]0 beef up [a]1 [a]0 admit to [a]1 [a]0 will attend [a]1 contains accurate predicate pairs which are absent [a]0 announce as [a]1 [a]0 to become [a]1 from the existing resources. Rather than compar- [a]0 arrest in [a]1 [a]0 charge in [a]1 ing against labeled data, we use types with score [a]0 attack [a]1 [a]0 clash with [a]1 50 from our resource (1,778 pairs), which were [a]0 be force out at [a]1 [a]0 have be fire from [a]1 ≥ [a]0 eliminate [a]1 [a]0 slash [a]1 assessed as accurate (Section 4.2). [a]0 face [a]1 [a]0 hit with [a]1 [a]0 mock [a]1 [a]0 troll [a]1 We checked whether these predicate pairs are [a]0 open up about [a]1 [a]0 reveal [a]1 covered by Berant and PPDB. To eliminate di- [a]0 get [a]1 [a]0 sentence to [a]1 rectionality, we looked for types in both directions, Table 3: A sample of types from our resource that i.e. for a predicate pair (p1, p2) we searched for are not found in Berant or in PPDB. both (p1, p2) and (p2, p1). Overall, we found that 67% of these types do not exist in Berant, 62% in PPDB, and 49% in neither. at that time, which roughly correspond to the top Table3 exemplifies some of the predicate pairs ranked 1.5% types. that do not exist in both resources. Specifically, Figure 2(b) demonstrates that these types main- our resource contains many non-consecutive pred- tain a level of around 80% in accuracy. The re- icates (e.g. reveal [a]0 to [a]1 / share [a]0 with [a]1) source growth rate (i.e. the number of new types) that by definition do not exist in Berant. is expected to change with time. We predict that Some pairs, such as [a]0 get [a]1 / [a]0 sentence the resource will contain around 2 million types in to [a]1, are context-specific, occurring when [a]0 9 one year. is a person and [a]1 is the time they are about to serve in prison. Given that get has a broad distri- 5 Comparison to Existing Resources bution of argument instantiations, this paraphrase and similar paraphrases are less likely to exist in The resources which are most similar to ours are resources that rely on the distribution of arguments Berant (Berant, 2012), a resource of predicate in the entire corpus. entailments, and PPDB (Pavlick et al., 2015), a re- source of paraphrases, both described in Section2. 6 Conclusion We expect our resource to be more accurate than resources which are based on the distributional ap- We presented a new unsupervised method to ac- proach (Berant, 2012; Lin and Pantel, 2001). In quire fairly accurate predicate paraphrases from addition, in comparison to PPDB, we specialize on news tweets discussing the same event. We re- binary verbal predicates, and apply an additional lease a growing resource of predicate paraphrases. phase of proposition extraction, handling various Qualitative analysis shows that our resource adds phenomena such as non-consecutive particles and value over existing resources. In the future, when minimality of arguments. the resource is comparable in size to the existing Berant(2012) evaluated their resource against resources, we plan to evaluate its intrinsic accu- a dataset of predicate entailments (Zeichner et al., racy on annotated test sets, as well as its extrinsic 2012), using a recall-precision curve to show the benefits in downstream NLP applications. performance obtained with a range of thresholds on the resource score. This kind of evaluation is less suitable for our resource; first, predicate en- Acknowledgments tailment is directional, causing paraphrases with the wrong entailment direction to be labeled neg- This work was partially supported by an Intel ative in the dataset. Second, since our resource is ICRI-CI grant, the Israel Science Foundation grant still relatively small, it is unlikely to have sufficient 880/12, and the German Research Foundation coverage of the dataset at that point. We therefore through the German-Israeli Project Cooperation (DIP, grant DA 1600/1-1). 9For up-to-date resource statistics, see: https://github. com/vered1986/Chirps/tree/master/resource.

159 References texts: A new corpus for a new task. In *SEM 2012: The First Joint Conference on Lexical and Compu- Amit Bagga and Breck Baldwin. 1999. Cross- tational Semantics – Volume 1: Proceedings of the document event coreference: Annotations, experi- main conference and the shared task, and Volume ments, and observations. In Workshop on Corefer- 2: Proceedings of the Sixth International Workshop ence and its Applications. on Semantic Evaluation (SemEval 2012). Associa- Regina Barzilay and Lillian Lee. 2003. Learn- tion for Computational Linguistics, pages 218–227. ing to paraphrase: An unsupervised approach us- http://aclweb.org/anthology/S12-1030. ing multiple-sequence alignment. In Proceed- Yusuke Shinyama and Satoshi Sekine. 2006. Preemp- ings of the 2003 Human Language Technol- tive information extraction using unrestricted rela- ogy Conference of the North American Chapter tion discovery. In Proceedings of the Human Lan- of the Association for Computational Linguistics. guage Technology Conference of the NAACL, Main http://aclweb.org/anthology/N03-1003. Conference. http://aclweb.org/anthology/N06-1039. Regina Barzilay and R. Kathleen McKeown. 2001. Yusuke Shinyama, Satoshi Sekine, and Kiyoshi Sudo. Extracting paraphrases from a parallel corpus. 2002. Automatic paraphrase acquisition from news In Proceedings of the 39th Annual Meeting of articles. In Proceedings of the second interna- the Association for Computational Linguistics. tional conference on Human Language Technology http://aclweb.org/anthology/P01-1008. Research. Morgan Kaufmann Publishers Inc., pages 313–318. Jonathan Berant. 2012. Global Learning of Textual En- tailment Graphs. Ph.D. thesis, Tel Aviv University. Gabriel Stanovsky and Ido Dagan. 2016. Annotating and predicting non-restrictive noun phrase modifica- Jonathan Berant, Ido Dagan, and Jacob Goldberger. tions. In Proceedings of the 54rd Annual Meeting of 2010. Global learning of focused entailment graphs. the Association for Computational Linguistics (ACL In Proceedings of the 48th Annual Meeting of the 2016). Association for Computational Linguistics. Associ- ation for Computational Linguistics, pages 1220– Gabriel Stanovsky, Jessica Ficler, Ido Dagan, and 1229. http://aclweb.org/anthology/P10-1124. Yoav Goldberg. 2016. Getting more out of syntax with props. CoRR abs/1603.01648. Ido Dagan, Dan Roth, and Mark Sammons. 2013. Rec- http://arxiv.org/abs/1603.01648. ognizing textual entailment . Idan Szpektor, Eyal Shnarch, and Ido Dagan. 2007. Juri Ganitkevitch, Benjamin Van Durme, and Chris Instance-based evaluation of entailment rule acqui- Callison-Burch. 2013. PPDB: The paraphrase sition. In Proceedings of the 45th Annual Meeting database. In Proceedings of the 2013 Con- of the Association of Computational Linguistics. As- ference of the North American Chapter of the sociation for Computational Linguistics, pages 456– Association for Computational Linguistics: Hu- 463. http://aclweb.org/anthology/P07-1058. man Language Technologies. Association for Computational Linguistics, pages 758–764. Wei Xu, Alan Ritter, Chris Callison-Burch, William B http://aclweb.org/anthology/N13-1092. Dolan, and Yangfeng Ji. 2014. Extracting lexi- cally divergent paraphrases from twitter. Transac- Zellig S Harris. 1954. Distributional structure. Word tions of the Association for Computational Linguis- 10(2-3):146–162. tics 2:435–448. Dekang Lin and Patrick Pantel. 2001. Dirt – Discovery Naomi Zeichner, Jonathan Berant, and Ido Da- of inference rules from text. In Proceedings of the gan. 2012. Crowdsourcing inference-rule evalua- seventh ACM SIGKDD international conference on tion. In Proceedings of the 50th Annual Meet- Knowledge discovery and data mining. ACM, pages ing of the Association for Computational Lin- 323–328. guistics (Volume 2: Short Papers). Association for Computational Linguistics, pages 156–160. Ellie Pavlick, Pushpendre Rastogi, Juri Ganitke- http://aclweb.org/anthology/P12-2031. vitch, Benjamin Van Durme, and Chris Callison- Burch. 2015. PPDB 2.0: Better paraphrase rank- Congle Zhang, Stephen Soderland, and Daniel S Weld. ing, fine-grained entailment relations, word em- 2015. Exploiting parallel news streams for unsuper- beddings, and style classification. In Proceed- vised event extraction. Transactions of the Associa- ings of the 53rd Annual Meeting of the Associa- tion for Computational Linguistics 3:117–129. tion for Computational Linguistics and the 7th In- ternational Joint Conference on Natural Language Congle Zhang and Daniel S Weld. 2013. Harvest- Processing (Volume 2: Short Papers). Associa- ing parallel news streams to generate paraphrases of tion for Computational Linguistics, pages 425–430. event relations. In Proceedings of the 2013 Con- https://doi.org/10.3115/v1/P15-2070. ference on Empirical Methods in Natural Language Processing. Seattle, Washington, USA, pages 1776– Michael Roth and Anette Frank. 2012. Aligning predi- 1786. cate argument structures in monolingual comparable

160 Chapter 7

Olive Oil Is Made of Olives, Baby Oil Is Made for Babies: Interpreting Noun Compounds Using Paraphrases in a Neural Model

57 Olive Oil is Made of Olives, Baby Oil is Made for Babies: Interpreting Noun Compounds using Paraphrases in a Neural Model

Vered Shwartz∗ Chris Waterson Computer Science Department Google Inc. Bar-Ilan University 1600 Amphitheatre Parkway Ramat-Gan, Israel Mountain View, California 94043 [email protected] [email protected]

Abstract yielded promising results, recently, Dima(2016) showed that similar performance is achieved by Automatic interpretation of the relation representing the NC as a concatenation of its con- between the constituents of a noun com- stituent embeddings, and attributed it to the lexical pound, e.g. olive oil (source) and baby memorization phenomenon (Levy et al., 2015). oil (purpose) is an important task for many In this paper we apply lessons learned from NLP applications. Recent approaches are the parallel task of semantic relation classifica- typically based on either noun-compound tion. We adapt HypeNET (Shwartz et al., 2016) representations or paraphrases. While the to the NC classification task, using their path em- former has initially shown promising re- beddings to represent paraphrases and combining sults, recent work suggests that the suc- with distributional information. We experiment cess stems from memorizing single pro- with various evaluation settings, including settings totypical words for each relation. We ex- that make lexical memorization impossible. In plore a neural paraphrasing approach that these settings, the integrated method performs bet- demonstrates superior performance when ter than the baselines. Even so, the performance is such memorization is not possible. mediocre for all methods, suggesting that the task 1 1 Introduction is difficult and warrants further investigation.

Automatic classification of a noun-compound 2 Background (NC) to the implicit semantic relation that holds between its constituent words is beneficial for ap- Various tasks have been suggested to address plications that require text understanding. For in- noun-compound interpretation. NC paraphrasing stance, a personal assistant asked “do I have a extracts texts explicitly describing the implicit re- morning meeting tomorrow?” should search the lation between the constituents, for example stu- calendar for meetings occurring in the morning, dent protest is a protest LEDBY, BESPONSORED while for group meeting it should look for meet- BY, or BEORGANIZEDBY students (e.g. Nakov ings with specific participants. The NC classifica- and Hearst, 2006; Kim and Nakov, 2011; Hen- tion task is a challenging one, as the meaning of an drickx et al., 2013; Nulty and Costello, 2013). NC is often not easily derivable from the meaning Compositionality prediction determines to what of its constituent words (Sparck¨ Jones, 1983). extent the meaning of the NC can be expressed Previous work on the task falls into two main in terms of the meaning of its constituents, e.g. approaches. The first maps NCs to paraphrases spelling bee is non-compositional, as it is not re- that express the relation between the constituent lated to bee (e.g. Reddy et al., 2011). In this paper words (e.g. Nakov and Hearst, 2006; Nulty and we focus on the NC classification task, which is Costello, 2013), such as mapping coffee cup and defined as follows: given a pre-defined set of re- garbage dump to the pattern [w1] CONTAINS [w2]. lations, classify nc = w1w2 to the relation that The second approach computes a representation holds between w1 and w2. We review the various for NCs from the distributional representation of their individual constituents. While this approach 1The code is available at https://github.com/ tensorflow/models/tree/master/research/ ∗Work done during an internship at Google. lexnet_nc.

218 Proceedings of NAACL-HLT 2018, pages 218–224 New Orleans, Louisiana, June 1 - 6, 2018. c 2018 Association for Computational Linguistics features used in the literature for classification.2 may share the feature MADEOF.S eaghdha´ and Copestake(2013) leveraged this “relational sim- 2.1 Compositional Representations ilarity” in a kernel-based classification approach. In this approach, classification is based on a vector They combined the relational information with representing the NC (w1 w2), which is obtained the complementary lexical features of each con- by applying a function to its constituents’ distri- stituent separately. Two NCs labeled to the same butional representations: ~v ,~v n. Various relation may consist of similar constituents (pa- w1 w2 ∈ R functions have been proposed in the literature. per-steel, cup-knife) and may also appear with Mitchell and Lapata(2010) proposed 3 simple similar paraphrases. Combining the two informa- tion sources has shown to be beneficial, but it was combinations of ~vw1 and ~vw2 (additive, multiplica- tive, dilation). Others suggested to represent com- also noted that the relational information suffered positions by applying linear functions, encoded as from data sparsity: many NCs had very few para- matrices, over word vectors. Baroni and Zam- phrases, and paraphrase similarity was based on parelli(2010) focused on adjective-noun composi- ngram overlap. tions (AN) and represented adjectives as matrices, Recently, Surtani and Paul(2015) suggested to nouns as vectors, and ANs as their multiplication. represent NCs in a vector space model (VSM) us- Matrices were learned with the objective of min- ing paraphrases as features. These vectors were imizing the distance between the learned vector used to classify new NCs based on the nearest and the observed vector (computed from corpus neighbor in the VSM. However, the model was occurrences) of each AN. The full-additive model only tested on a small dataset and performed sim- (Zanzotto et al., 2010; Dinu et al., 2013) is a sim- ilarly to previous methods. ilar approach that works on any two-word compo- sition, multiplying each word by a square matrix: 3 Model nc = A ~v + B ~v . · w1 · w2 Socher et al.(2012) suggested a non-linear We similarly investigate the use of paraphrasing composition model. A recursive neural network for NC relation classification. To generate a signal w w operates bottom-up on the output of a constituency for the joint occurrences of 1 and 2, we follow parser to represent variable-length phrases. Each the approach used by HypeNET (Shwartz et al., w w constituent is represented by a vector that captures 2016). For an 1 2 in the dataset, we collect all w w its meaning and a matrix that captures how it mod- the dependency paths that connect 1 and 2 in ifies the meaning of constituents that it combines the corpus, and learn path embeddings as detailed with. For a binary NC, nc = g(W [~v ;~v ]), in Section 3.2. Section 3.1 describes the classifi- · w1 w2 where W 2n n and g is a non-linear function. cation models with which we experimented. ∈ R × These representations were used as features in 3.1 Classification Models NC classification, often achieving promising re- sults (e.g. Van de Cruys et al., 2013; Dima and Figure1 provides an overview of the models: Hinrichs, 2015). However, Dima(2016) recently path-based, integrated, and integrated-NC, each showed that similar performance is achieved by which incrementally adds new features not present representing the NC as a concatenation of its con- in the previous model. In the following sections, ~x stituent embeddings, and argued that it stems from denotes the input vector representing the NC. The memorizing prototypical words for each relation. network classifies NC to the highest scoring rela- For example, classifying any NC with the head oil tion: r = argmaxi softmax(~o)i, where ~o is the to the SOURCE relation, regardless of the modifier. output layer. All networks contain a single hidden x layer whose dimension is |2| . k is the number of 2.2 Paraphrasing relations in the dataset. See AppendixA for addi- In this approach, the paraphrases of an NC, i.e. tional technical details. the patterns connecting the joint occurrences of Path-based. Classifies the NC based only on the the constituents in a corpus, are treated as fea- paths connecting the joint occurrences of w1 and tures. For example, both paper cup and steel knife w2 in the corpus, denoted P (w1, w2). We define 2Leaving out features derived from lexical resources (e.g. the feature vector as the average of its path embed- Nastase and Szpakowicz, 2003; Tratz and Hovy, 2010). dings, where the path embedding ~p of a path p is

219 ··· ··· vnc vw1 vw2 paths(w1, w2) mean pooling prep pobj

... coffee cup of coffee [w1] of [w2] acl dobj

[w1] containing [w2] cup containing coffee Path LSTM coffee cup cup prep pobj ...

...... coffee in cup [w2] in [w1] Path Integrated Integrated-NC Figure 1: An illustration of the classification models for the NC coffee cup. The model consists of two parts: (1) the distributional representations of the NC (left, orange) and each word (middle, green). (2) the corpus occurrences of coffee and cup, in the form of dependency path embeddings (right, purple).

weighted by its frequency fp,(w1,w2): We use the NC labels as distant supervision. While HypeNET predicts a word pair’s label from f ~p p P (w1,w2) p,(w1,w2) ~x = ~v = ∈ · the frequency-weighted average of the path vec- P (w1,w2) f p P (w1,w2) p,(w1,w2) tors, we differ from it slightly and compute the P ∈ label from the frequency-weighted average of the P Integrated. We concatenate w1 and w2’s word predictions obtained from each path separately: embeddings to the path vector, to add distribu- f softmax(~p) tional information: x = [~v ,~v ,~v ]. p P (w1,w2) p,(w1,w2) w1 w2 P (w1,w2) ~o = ∈ · Potentially, this allows the network to utilize f P p P (w1,w2) p,(w1,w2) the contextual properties of each individual con- ∈ r = argmaxPi ~oi stituent, e.g. assigning high probability to SUBSTANCE-MATERIAL-INGREDIENT for edible We conjecture that label distribution averaging al- w1s (e.g. vanilla pudding, apple cake). lows for more efficient training of path embed- Integrated-NC. We add the NC’s observed vec- dings when a single NC contains multiple paths. tor ~vnc as additional distributional input, providing 4 Evaluation the contexts in which w1w2 occur as an NC:

~vnc = [~vw1 ,~vw2 ,~vnc,~vP (w1,w2)]. Like Dima 4.1 Dataset (2016), we learn NC vectors using the GloVe algo- We follow Dima(2016) and evaluate on the Tratz rithm (Pennington et al., 2014), by replacing each (2011) dataset, with 19,158 instances and two lev- NC occurrence in the corpus with a single token. els of labels: fine-grained (Tratz-fine, 37 rela- This information can potentially help clustering tions) and coarse-grained (Tratz-coarse, 12 re- NCs that appear in similar contexts despite having lations). We report results on both versions. See low pairwise similarity scores between their con- Tratz(2011) for the list of relations. stituents. For example, gun violence and abortion rights belong to the TOPIC relation and may ap- Dataset Splits Dima(2016) showed that a clas- pear in similar news-related contexts, while (gun, sifier based only on vw1 and vw2 performs on par abortion) and (violence, rights) are dissimilar. with compound representations, and that the suc- cess comes from lexical memorization (Levy et al., 3.2 Path Embeddings 2015): memorizing the majority label of single Following HypeNET, for a path p composed of words in particular slots of the compound (e.g. edges e1, ..., ek, we represent each edge by the TOPIC for travel guide, fishing guide, etc.). This concatenation of its lemma, part-of-speech tag, memorization paints a skewed picture of the state- dependency label and direction vectors: ~ve = of-the-art performance on this difficult task.

[~vl,pos ~v ,dep ~v , ~vdir]. The edge vectors ~ve1 , ..., ~vek To better test this hypothesis, we evaluate on 4 are encoded using an LSTM (Hochreiter and different splits of the datasets to train, test, and val- Schmidhuber, 1997), and the last output vector ~p idation sets: (1) random, in a 75:20:5 ratio, (2) is used as the path embedding. lexical-full, in which the train, test, and validation

220 Dataset Split Best Freq Dist Dist-NC Best Comp Path Int Int-NC Rand 0.319 0.692 0.673 0.725 0.538 0.714 0.692 Lex 0.222 0.458 0.449 0.450 0.448 0.478 Tratz-fine head 0.510 Lexmod 0.292 0.574 0.559 0.607 0.472 0.613 0.600 Lexfull 0.066 0.363 0.360 0.334 0.423 0.421 0.429 Rand 0.256 0.734 0.718 0.775 0.586 0.736 0.712 Lex 0.225 0.501 0.497 0.538 0.518 0.548 Tratz-coarse head 0.558 Lexmod 0.282 0.630 0.600 0.645 0.548 0.646 0.632 Lexfull 0.136 0.406 0.409 0.372 0.472 0.475 0.478

Table 1: All methods’ performance (F1) on the various splits: best freq: best performing frequency baseline (head / modifier),3 best comp: best model from Dima(2016).

Dataset Split Train Validation Test 2012) and single word embeddings in a fully con- 5 Lexfull 4,730 1,614 869 nected network. Lexhead 9,185 5,819 4,154 TRATZ-FINE Lexmod 9,783 5,400 3,975 Rand 14,369 958 3,831 4.3 Results Lexfull 4,746 1,619 779 Lexhead 9,214 5,613 3,964 Table1 shows the performance of various meth- TRATZ-COARSE Lexmod 9,732 5,402 3,657 Rand 14,093 940 3,758 ods on the datasets. Dima’s (2016) compositional models perform best among the baselines, and on Table 2: Number of instances in each dataset split. the random split, better than all the methods. On sets each consists of a distinct vocabulary. The the lexical splits, however, the baselines exhibit split was suggested by Levy et al.(2015), and it a dramatic drop in performance, and are outper- randomly assigns words to distinct sets, such that formed by our methods. The gap is larger in for example, including travel guide in the train set the lexical-full split. Finally, there is usually no promises that fishing guide would not be included gain from the added NC vector in Dist-NC and in the test set, and the models do not benefit from Integrated-NC. memorizing that the head guide is always anno- tated as TOPIC. Given that the split discards many 5 Analysis NCs, we experimented with two additional splits: Path Embeddings. To focus on the changes (3) lexical-mod split, in which the w1 words are from previous work, we analyze the performance unique in each set, and (4) lexical-head split, in of the path-based model on the Tratz-fine ran- which the w2 words are unique in each set. Ta- dom split. This dataset contains 37 relations ble2 displays the sizes of each split. and the model performance varies across them. Some relations, such as MEASURE and PER- 4.2 Baselines SONAL TITLE yield reasonable performance (F1 Frequency Baselines. mod freq classifies w1w2 score of 0.87 and 0.68). Table3 focuses on these to the most common relation in the train set for relations and illustrates the indicative paths that NCs with the same modifier (w1w20 ), while head the model has learned for each relation. We com- 4 freq considers NCs with the same head (w10 w2). pute these by performing the analysis in Shwartz Distributional Baselines. Ablation of the path- et al.(2016), where each path is fed into the path- based component from our models: Dist uses only based model, and is assigned to its best-scoring relation. For each relation, we consider paths with w1 and w2’s word embeddings: ~x = [~vw1 ,~vw2 ], a score 0.8. while Dist-NC includes also the NC embedding: ≥ Other relations achieve very low F1 scores, in- ~x = [~vw1 ,~vw2 ,~vnc]. The network architecture is defined similarly to our models (Section 3.1). dicating that the model is unable to learn them at all. Interestingly, the four relations with the Compositional Baselines. We re-train Dima’s lowest performance in our model6 are also those (2016) models, various combinations of NC rep- with the highest error rate in Dima(2016), very resentations (Zanzotto et al., 2010; Socher et al., 5We only include the compositional models, and omit the 3In practice, in lexical-full this is a random baseline, in “basic” setting which is similar to our Dist model. For the lexical-head it is the modifier frequency baseline, and in full details of the compositional models, see Dima(2016). lexical-mod it is the head frequency baseline. 6LEXICALIZED, TOPIC OF COGNITION&EMOTION, 4Unseen heads/modifiers are assigned a random relation. WHOLE+ATTRIBUTE&FEAT, PARTIAL ATTR TRANSFER

221 Relation Path Examples

[w2] varies by [w1] state limit, age limit MEASURE 2,560 [w1] portion of [w2] acre estate [w2] Anderson [w1] Mrs. Brown PERSONALTITLE [w2] Sheridan [w1] Gen. Johnson

[w2] produce [w1] food producer, drug group CREATE-PROVIDE-GENERATE-SELL [w2] selling [w1] phone company, merchandise store [w2] manufacture [w1] engine plant, sugar company [w2] begin [w1] morning program TIME-OF1 [w2] held Saturday [w1] afternoon meeting, morning session [w2] made of wood and [w1] marble table, vinyl siding SUBSTANCE-MATERIAL-INGREDIENT [w2] material includes type of [w1] steel pipe Table 3: Indicative paths for selected relations, along with NC examples.

Test NC Most Similar NC NC Label NC Label majority party EQUATIVE minority party WHOLE+PART OR MEMBER OF enforcement director OBJECTIVE enforcement chief PERFORM&ENGAGE IN fire investigator OBJECTIVE fire marshal ORGANIZE&SUPERVISE&AUTHORITY stabilization plan OBJECTIVE stabilization program PERFORM&ENGAGE IN investor sentiment EXPERIENCER-OF-EXPERIENCE market sentiment TOPIC OF COGNITION&EMOTION alliance member WHOLE+PART OR MEMBER OF alliance leader OBJECTIVE Table 4: Example of NCs from the Tratz-fine random split test set, along with the most similar NC in the embeddings, where the two NCs have different labels. likely since they express complex relations. For Tratz-fine test NCs; 3091/3831 (81%) of them example, the LEXICALIZED relation contains non- had embeddings. For each NC, we looked for compositional NCs (soap opera) or lexical items the 10 most similar NC vectors (in terms of co- whose meanings departed from the combination of sine similarity), and compared their labels. We the constituent meanings. It is expected that there have found that only 27.61% of the NCs were are no paths that indicate lexicalization. In PAR- mostly similar to NCs with the same label. The TIAL ATTRIBUTE TRANSFER (bullet train), w1 problem seems to be inconsistency of annotations transfers an attribute to w2 (e.g. bullet transfers rather than low embeddings quality. Table4 dis- speed to train). These relations are not expected plays some examples of NCs from the test set, to be expressed in text, unless the text aims to ex- along with their most similar NC in the embed- plain them (e.g. train as fast as a bullet). dings, where the two NCs have different labels. Looking closer at the model confusions shows 6 Conclusion that it often defaulted to general relations like OB- JECTIVE (recovery plan) or RELATIONAL-NOUN- We used an existing neural dependency path COMPLEMENT (eye shape). The latter is described representation to represent noun-compound para- as “indicating the complement of a relational noun phrases, and along with distributional information (e.g., son of, price of)”, and the indicative paths applied it to the NC classification task. Following for this relation indeed contain many variants of previous work, that suggested that distributional “[w2] of [w1]”, which potentially can occur with methods succeed due to lexical memorization, we NCs in other relations. The model also confused show that when lexical memorization is not pos- between relations with subtle differences, such as sible, the performance of all methods is much the different topic relations. Given that these re- worse. Adding the path-based component helps lations were conflated to a single relation in the mitigate this issue and increase performance. inter-annotator agreement computation in Tratz and Hovy(2010), we can conjecture that even hu- Acknowledgments mans find it difficult to distinguish between them. We would like to thank Marius Pasca, Susanne Riehemann, Colin Evans, Octavian Ganea, and NC Embeddings. To understand why the NC Xiang Li for the fruitful conversations, and Corina embeddings did not contribute to the classifi- Dima for her help in running the compositional cation, we looked into the embeddings of the baselines.

222 References Omer Levy, Steffen Remus, Chris Biemann, and Ido Dagan. 2015. Do supervised distribu- Mart´ın Abadi, Ashish Agarwal, Paul Barham, Eugene tional methods really learn lexical inference re- Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, lations? In Proceedings of the 2015 Confer- Andy Davis, Jeffrey Dean, Matthieu Devin, et al. ence of the North American Chapter of the As- 2016. Tensorflow: Large-scale machine learning on sociation for Computational Linguistics: Human heterogeneous distributed systems. arXiv preprint Language Technologies. Association for Computa- arXiv:1603.04467 . tional Linguistics, Denver, Colorado, pages 970– 976. http://www.aclweb.org/anthology/N15-1098. Marco Baroni and Roberto Zamparelli. 2010. Nouns are vectors, adjectives are matrices: Representing Jeff Mitchell and Mirella Lapata. 2010. Composition adjective-noun constructions in semantic space. In in distributional models of semantics. Cognitive sci- Proceedings of the 2010 Conference on Empirical ence 34(8):1388–1429. Methods in Natural Language Processing. Associ- ation for Computational Linguistics, pages 1183– Preslav Nakov and Marti Hearst. 2006. Using verbs to 1193. characterize noun-noun relations. In International Conference on Artificial Intelligence: Methodology, Corina Dima. 2016. Proceedings of the 1st Work- Systems, and Applications. Springer, pages 233– shop on Representation Learning for NLP, As- 244. sociation for Computational Linguistics, chapter On the Compositionality and Semantic Interpreta- Vivi Nastase and Stan Szpakowicz. 2003. Explor- tion of English Noun Compounds, pages 27–39. ing noun-modifier semantic relations. In Fifth in- https://doi.org/10.18653/v1/W16-1604. ternational workshop on computational semantics (IWCS-5). pages 285–301. Corina Dima and Erhard Hinrichs. 2015. Automatic noun compound interpretation using deep neural Paul Nulty and Fintan Costello. 2013. General and networks and word embeddings. IWCS 2015 page specific paraphrases of semantic relations between 173. nouns. Natural Language Engineering 19(03):357– 384. Georgiana Dinu, Nghia The Pham, and Marco Baroni. 2013. General estimation and evaluation of compo- Jeffrey Pennington, Richard Socher, and Christopher sitional distributional semantic models. In Proceed- Manning. 2014. Glove: Global vectors for word ings of the Workshop on Continuous Vector Space representation. In Proceedings of the 2014 Con- Models and their Compositionality. Association for ference on Empirical Methods in Natural Language Computational Linguistics, Sofia, Bulgaria, pages Processing (EMNLP). Association for Computa- 50–58. http://www.aclweb.org/anthology/W13- tional Linguistics, Doha, Qatar, pages 1532–1543. 3206. http://www.aclweb.org/anthology/D14-1162. Siva Reddy, Diana McCarthy, and Suresh Manand- Iris Hendrickx, Zornitsa Kozareva, Preslav Nakov, Di- ´ har. 2011. An empirical study on compositional- armuid OSeaghdha,´ Stan Szpakowicz, and Tony ity in compound nouns. In Proceedings of 5th In- Veale. 2013. Semeval-2013 task 4: Free paraphrases ternational Joint Conference on Natural Language Second Joint Conference of noun compounds. In Processing. Asian Federation of Natural Language on Lexical and Computational Semantics (*SEM), Processing, Chiang Mai, Thailand, pages 210–218. Volume 2: Proceedings of the Seventh International http://www.aclweb.org/anthology/I11-1024. Workshop on Semantic Evaluation (SemEval 2013). Association for Computational Linguistics, pages Diarmuid O Seaghdha´ and Ann Copestake. 2013. In- 138–143. http://aclweb.org/anthology/S13-2025. terpreting compound nouns with kernel methods. Natural Language Engineering 19(3):331–356. Sepp Hochreiter and Jurgen¨ Schmidhuber. 1997. Long short-term memory. Neural computation Vered Shwartz, Yoav Goldberg, and Ido Dagan. 2016. 9(8):1735–1780. Improving hypernymy detection with an integrated path-based and distributional method. In Pro- Nam Su Kim and Preslav Nakov. 2011. Large- ceedings of the 54th Annual Meeting of the As- scale noun compound interpretation using boot- sociation for Computational Linguistics (Volume strapping and the web as a corpus. In Pro- 1: Long Papers). Association for Computational ceedings of the 2011 Conference on Empirical Linguistics, Berlin, Germany, pages 2389–2398. Methods in Natural Language Processing. Associa- http://www.aclweb.org/anthology/P16-1226. tion for Computational Linguistics, pages 648–658. http://aclweb.org/anthology/D11-1060. Richard Socher, Brody Huval, D. Christopher Man- ning, and Y. Andrew Ng. 2012. Semantic composi- Diederik Kingma and Jimmy Ba. 2014. Adam: A tionality through recursive matrix-vector spaces. In method for stochastic optimization. arXiv preprint Proceedings of the 2012 Joint Conference on Empir- arXiv:1412.6980 . ical Methods in Natural Language Processing and

223 Computational Natural Language Learning. Asso- A Technical Details ciation for Computational Linguistics, pages 1201– 1211. http://aclweb.org/anthology/D12-1110. To extract paths, we use a concatenation of En- glish Wikipedia and the Gigaword corpus.7 We Karen Sparck¨ Jones. 1983. Compound noun interpreta- tion problems. Technical report, University of Cam- consider sentences with up to 32 words and depen- bridge, Computer Laboratory. dency paths with up to 8 edges, including satel- lites, and keep only 1,000 paths for each noun- Nitesh Surtani and Soma Paul. 2015. A vsm-based sta- compound. We compute the path embeddings in tistical model for the semantic relation interpretation of noun-modifier pairs. In RANLP. pages 636–645. advance for all the paths connecting NCs in the dataset ( 3.2), and then treat them as fixed embed- § Stephen Tratz. 2011. Semantically-enriched parsing dings during classification ( 3.1). for natural language understanding. University of § Southern California. We use TensorFlow (Abadi et al., 2016) to train the models, fixing the values of the hyper- Stephen Tratz and Eduard Hovy. 2010. A taxon- parameters after performing preliminary experi- omy, dataset, and classifier for automatic noun ments on the validation set. We set the mini-batch compound interpretation. In Proceedings of the 48th Annual Meeting of the Association for Com- size to 10, use Adam optimizer (Kingma and Ba, putational Linguistics. Association for Computa- 2014) with the default learning rate, and apply tional Linguistics, Uppsala, Sweden, pages 678– word dropout with probability 0.1. We train up to 687. http://www.aclweb.org/anthology/P10-1070. 30 epochs with early stopping, stopping the train- Tim Van de Cruys, Stergos Afantenos, and Philippe ing when the F1 score on the validation set drops Muller. 2013. Melodi: A supervised distribu- 8 points below the best performing score. tional approach for free paraphrasing of noun com- We initialize the distributional embeddings with pounds. In Second Joint Conference on Lex- ical and Computational Semantics (*SEM), Vol- the 300-dimensional pre-trained GloVe embed- ume 2: Proceedings of the Seventh Interna- dings (Pennington et al., 2014) and the lemma em- tional Workshop on Semantic Evaluation (Se- beddings (for the path-based component) with the mEval 2013). Association for Computational Lin- 50-dimensional ones. Unlike HypeNET, we do guistics, Atlanta, Georgia, USA, pages 144–147. not update the embeddings during training. The http://www.aclweb.org/anthology/S13-2026. lemma, POS, and direction embeddings are initial- Fabio Massimo Zanzotto, Ioannis Korkontzelos, ized randomly and updated during training. NC Francesca Fallucchi, and Suresh Manandhar. 2010. embeddings are learned using a concatenation of Estimating linear models for compositional distribu- tional semantics. In Proceedings of the 23rd Inter- Wikipedia and Gigaword. Similarly to the origi- national Conference on Computational Linguistics. nal GloVe implementation, we only keep the most Association for Computational Linguistics, pages frequent 400,000 vocabulary terms, which means 1263–1271. that roughly 20% of the noun-compounds do not have vectors and are initialized randomly in the model.

7https://catalog.ldc.upenn.edu/ ldc2003t05

224 Chapter 8

Paraphrase to Explicate: Revealing Implicit Noun-Compound Relations

65 Paraphrase to Explicate: Revealing Implicit Noun-Compound Relations

Vered Shwartz Ido Dagan Computer Science Department, Bar-Ilan University, Ramat-Gan, Israel [email protected] [email protected]

Abstract Methods dedicated to paraphrasing noun- compounds usually rely on corpus co-occurrences of the compound’s constituents as a source of ex- Revealing the implicit semantic rela- plicit relation paraphrases (e.g. Wubben, 2010; tion between the constituents of a noun- Versley, 2013). Such methods are unable to gen- compound is important for many NLP ap- eralize for unseen noun-compounds. Yet, most plications. It has been addressed in the noun-compounds are very infrequent in text (Kim literature either as a classification task to and Baldwin, 2007), and humans easily interpret a set of pre-defined relations or by pro- the meaning of a new noun-compound by general- ducing free text paraphrases explicating izing existing knowledge. For example, consider the relations. Most existing paraphras- interpreting parsley cake as a cake made of pars- ing methods lack the ability to generalize, ley vs. resignation cake as a cake eaten to cele- and have a hard time interpreting infre- brate quitting an unpleasant job. quent or new noun-compounds. We pro- We follow the paraphrasing approach and pro- pose a neural model that generalizes better pose a semi-supervised model for paraphras- by representing paraphrases in a contin- ing noun-compounds. Differently from previ- uous space, generalizing for both unseen ous methods, we train the model to predict ei- noun-compounds and rare paraphrases. ther a paraphrase expressing the semantic rela- Our model helps improving performance tion of a noun-compound (predicting ‘[w2] made on both the noun-compound paraphrasing of [w1]’ given ‘apple cake’), or a missing con- and classification tasks. stituent given a combination of paraphrase and noun-compound (predicting ‘apple’ given ‘cake made of [w1]’). Constituents and paraphrase tem- 1 Introduction plates are represented as continuous vectors, and semantically-similar paraphrase templates are em- Noun-compounds hold an implicit semantic rela- bedded in proximity, enabling better generaliza- tion between their constituents. For example, a tion. Interpreting ‘parsley cake’ effectively re- ‘birthday cake’ is a cake eaten on a birthday, while duces to identifying paraphrase templates whose ‘apple cake’ is a cake made of apples. Interpreting “selectional preferences” (Pantel et al., 2007) on noun-compounds by explicating the relationship is each constituent fit ‘parsley’ and ‘cake’. beneficial for many natural language understand- A qualitative analysis of the model shows that ing tasks, especially given the prevalence of noun- the top ranked paraphrases retrieved for each compounds in English (Nakov, 2013). noun-compound are plausible even when the con- The interpretation of noun-compounds has been stituents never co-occur (Section4). We evalu- addressed in the literature either by classifying ate our model on both the paraphrasing and the them to a fixed inventory of ontological relation- classification tasks (Section5). On both tasks, ships (e.g. Nastase and Szpakowicz, 2003) or by the model’s ability to generalize leads to improved generating various free text paraphrases that de- performance in challenging evaluation settings.1 scribe the relation in a more expressive manner (e.g. Hendrickx et al., 2013). 1The code is available at github.com/vered1986/panic

1200 Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Long Papers), pages 1200–1211 Melbourne, Australia, July 15 - 20, 2018. c 2018 Association for Computational Linguistics 2 Background from a corpus to generate a list of candidate para- phrases. Unsupervised methods apply information 2.1 Noun-compound Classification extraction techniques to find and rank the most Noun-compound classification is the task con- meaningful paraphrases (Kim and Nakov, 2011; cerned with automatically determining the seman- Xavier and Lima, 2014; Pasca, 2015; Pavlick tic relation that holds between the constituents of and Pasca, 2017), while supervised approaches a noun-compound, taken from a set of pre-defined learn to rank paraphrases using various features relations. such as co-occurrence counts (Wubben, 2010; Li Early work on the task leveraged information et al., 2010; Surtani et al., 2013; Versley, 2013) derived from lexical resources and corpora (e.g. or the distributional representations of the noun- Girju, 2007; OS´ eaghdha´ and Copestake, 2009; compounds (Van de Cruys et al., 2013). Tratz and Hovy, 2010). More recent work broke One of the challenges of this approach is the the task into two steps: in the first step, a noun- ability to generalize. If one assumes that suffi- compound representation is learned from the dis- cient paraphrases for all noun-compounds appear tributional representation of the constituent words in the corpus, the problem reduces to ranking the (e.g. Mitchell and Lapata, 2010; Zanzotto et al., existing paraphrases. It is more likely, however, 2010; Socher et al., 2012). In the second step, the that some noun-compounds do not have any para- noun-compound representations are used as fea- phrases in the corpus or have just a few. The ap- ture vectors for classification (e.g. Dima and Hin- proach of Van de Cruys et al.(2013) somewhat richs, 2015; Dima, 2016). generalizes for unseen noun-compounds. They The datasets for this task differ in size, num- represented each noun-compound using a compo- ber of relations and granularity level (e.g. Nastase sitional distributional vector (Mitchell and Lap- and Szpakowicz, 2003; Kim and Baldwin, 2007; ata, 2010) and used it to predict paraphrases from Tratz and Hovy, 2010). The decision on the re- the corpus. Similar noun-compounds are expected lation inventory is somewhat arbitrary, and sub- to have similar distributional representations and sequently, the inter-annotator agreement is rela- therefore yield the same paraphrases. For exam- tively low (Kim and Baldwin, 2007). Specifi- ple, if the corpus does not contain paraphrases for cally, a noun-compound may fit into more than plastic spoon, the model may predict the para- one relation: for instance, in Tratz(2011), busi- phrases of a similar compound such as steel knife. ness zone is labeled as CONTAINED (zone con- In terms of sharing information between tains business), although it could also be labeled semantically-similar paraphrases, Nulty and as PURPOSE (zone whose purpose is business). Costello(2010) and Surtani et al.(2013) learned 2.2 Noun-compound Paraphrasing “is-a” relations between paraphrases from the co-occurrences of various paraphrases with each As an alternative to the strict classification to pre- other. For example, the specific ‘[w2] extracted defined relation classes, Nakov and Hearst(2006) from [w1]’ template (e.g. in the context of olive suggested that the semantics of a noun-compound oil) generalizes to ‘[w2] made from [w1]’. One of could be expressed with multiple prepositional the drawbacks of these systems is that they favor apple cake and verbal paraphrases. For example, more frequent paraphrases, which may co-occur from made of contains is a cake , , or which apples. with a wide variety of more specific paraphrases. The suggestion was embraced and resulted in two SemEval tasks. SemEval 2010 task 9 2.3 Noun-compounds in other Tasks (Butnariu et al., 2009) provided a list of plau- sible human-written paraphrases for each noun- Noun-compound paraphrasing may be considered compound, and systems had to rank them with the as a subtask of the general paraphrasing task, goal of high correlation with human judgments. whose goal is to generate, given a text fragment, In SemEval 2013 task 4 (Hendrickx et al., 2013), additional texts with the same meaning. How- systems were expected to provide a ranked list of ever, general paraphrasing methods do not guar- paraphrases extracted from free text. antee to explicate implicit information conveyed Various approaches were proposed for this task. in the original text. Moreover, the most notable Most approaches start with a pre-processing step source for extracting paraphrases is multiple trans- of extracting joint occurrences of the constituents lations of the same text (Barzilay and McKeown,

1201 wˆ1i = 28 pˆi = 78

MLPw MLPp

cake made of [w1] cake [p] apple

(23) made (1) [w1] (23) made

(28) apple (2) [w2] (28) apple (78) [w2] containing [w1]

(4145) cake (3) [p] (4145) cake (1) [w1] ...

...... (2) [w2] (131) [w2] made of [w1] (7891) of (7891) of (3) [p] ...

Figure 1: An illustration of the model predictions for w1 and p given the triplet (cake, made of, apple). The model predicts each component given the encoding of the other two components, successfully pre- dicting ‘apple’ given ‘cake made of [w1]’, while predicting ‘[w2] containing [w1]’ for ‘cake [p] apple’. 2001; Ganitkevitch et al., 2013; Mallinson et al., Section 3.2 details the creation of training data, 2017). If a certain concept can be described by and Section 3.3 describes the model. an English noun-compound, it is unlikely that a translator chose to translate its foreign language 3.1 Multi-task Reformulation equivalent to an explicit paraphrase instead. Each training example consists of two constituents Another related task is Open Information Ex- and a paraphrase (w2, p, w1), and we train the traction (Etzioni et al., 2008), whose goal is to ex- model on 3 subtasks: (1) predict p given w1 and tract relational tuples from text. Most system fo- w2, (2) predict w1 given p and w2, and (3) predict cus on extracting verb-mediated relations, and the w2 given p and w1. Figure1 demonstrates the pre- few exceptions that addressed noun-compounds dictions for subtasks (1) (right) and (2) (left) for provided partial solutions. Pal and Mausam the training example (cake, made of, apple). Ef- (2016) focused on segmenting multi-word noun- fectively, the model is trained to answer questions compounds and assumed an is-a relation between such as “what can cake be made of?”, “what can the parts, as extracting (Francis Collins, is, NIH be made of apple?”, and “what are the possible re- director) from “NIH director Francis Collins”. lationships between cake and apple?”. Xavier and Lima(2014) enriched the corpus with The multi-task reformulation helps learning bet- compound definitions from online dictionaries, for ter representations for paraphrase templates, by example, interpreting oil industry as (industry, embedding semantically-similar paraphrases in produces and delivers, oil) based on the Word- proximity. Similarity between paraphrases stems Net definition “industry that produces and delivers either from lexical similarity and overlap between oil”. This method is very limited as it can only the paraphrases (e.g. ‘is made of’ and ‘made of’), interpret noun-compounds with dictionary entries, or from shared constituents, e.g. ‘[w2] involved in while the majority of English noun-compounds [w1]’ and ‘[w2] in [w1] industry’ can share [w1] don’t have them (Nakov, 2013). = insurance and [w2] = company. This allows the model to predict a correct paraphrase for a given 3 Paraphrasing Model noun-compound, even when the constituents do As opposed to previous approaches, that focus on not occur with that paraphrase in the corpus. predicting a paraphrase template for a given noun- 3.2 Training Data compound, we reformulate the task as a multi- task learning problem (Section 3.1), and train the We collect a training set of (w2, p, w1, s) exam- model to also predict a missing constituent given ples, where w1 and w2 are constituents of a noun- the paraphrase template and the other constituent. compound w1w2, p is a templated paraphrase, and Our model is semi-supervised, and it expects as s is the score assigned to the training instance.2 input a set of noun-compounds and a set of con- 2We refer to “paraphrases” and “paraphrase templates” in- strained part-of-speech tag-based templates that terchangeably. In the extracted templates, [w2] always pre- make valid prepositional and verbal paraphrases. cedes [w1], probably because w2 is normally the head noun.

1202 We use the 19,491 noun-compounds found in 3.3 Model the SemEval tasks datasets (Butnariu et al., 2009; For a training instance (w2, p, w1, s), we predict Hendrickx et al., 2013) and in Tratz(2011). To ex- each item given the encoding of the other two. tract patterns of part-of-speech tags that can form noun-compound paraphrases, such as ‘[w2] VERB Encoding. We use the 100-dimensional pre- PREP [w1]’, we use the SemEval task training data, trained GloVe embeddings (Pennington et al., but we do not use the lexical information in the 2014), which are fixed during training. In addi- gold paraphrases. tion, we learn embeddings for the special words [w1], [w2], and [p], which are used to represent Corpus. Similarly to previous noun-compound a missing component, as in “cake made of [w1]”, paraphrasing approaches, we use the Google N- “[w2] made of apple”, and “cake [p] apple”. gram corpus (Brants and Franz, 2006) as a source For a missing component x [p], [w1], [w2] ∈ { } of paraphrases (Wubben, 2010; Li et al., 2010; surrounded by the sequences of words v1:i 1 and − Surtani et al., 2013; Versley, 2013). The cor- vi+1:n, we encode the sequence using a bidirec- pus consists of sequences of n terms (for n tional long-short term memory (bi-LSTM) net- ∈ 3, 4, 5 ) that occur more than 40 times on the work (Graves and Schmidhuber, 2005), and take { } web. We search for n-grams following the ex- the ith output vector as representing the missing tracted patterns and containing w1 and w2’s lem- component: bLS(v1:i, x, vi+1:n)i. mas for some noun-compound in the set. We re- In bi-LSTMs, each output vector is a concate- move punctuation, adjectives, adverbs and some nation of the outputs of the forward and backward determiners to unite similar paraphrases. For ex- LSTMs, so the output vector is expected to con- ample, from the 5-gram ‘cake made of sweet ap- tain information on valid substitutions both with ples’ we extract the training example (cake, made respect to the previous words v1:i 1 and the sub- − of, apple). We keep only paraphrases that occurred sequent words vi+1:n. at least 5 times, resulting in 136,609 instances. Prediction. We predict a distribution of the vo- cabulary of the missing component, i.e. to predict Weighting. Each n-gram in the corpus is accom- w1 correctly we need to predict its index in the panied with its frequency, which we use to assign word vocabulary Vw, while the prediction of p is scores to the different paraphrases. For instance, from the vocabulary of paraphrases in the training ‘cake of apples’ may also appear in the corpus, al- set, Vp. We predict the following distributions: though with lower frequency than ‘cake from ap- pˆ = softmax(W bLS( ~w , [p], ~w ) ) ples’. As also noted by Surtani et al.(2013), the p · 2 1 2 shortcoming of such a weighting mechanism is wˆ = softmax(W bLS( ~w , ~p , [w ]) ) (1) 1 w · 2 1:n 1 n+1 that it prefers shorter paraphrases, which are much wˆ2 = softmax(Ww bLS([w2], ~p1:n, ~w1)1) more common in the corpus (e.g. count(‘cake · where W Vw 2d, W Vp 2d, and d is made of apples’) count(‘cake of apples’)). We w ∈ R| |× p ∈ R| |×  the embeddings dimension. overcome this by normalizing the frequencies for During training, we compute cross-entropy loss each paraphrase length, creating a distribution of for each subtask using the gold item and the pre- paraphrases in a given length. diction, sum up the losses, and weight them by the instance score. During inference, we predict the Negative Samples. We add 1% of negative sam- missing components by picking the best scoring ples by selecting random corpus words w1 and index in each distribution:3 w2 that do not co-occur, and adding an exam- pˆi = argmax(ˆp) ple (w2, [w2] is unrelated to [w1], w1, sn), for wˆ = argmax(w ˆ ) (2) some predefined negative samples score sn. Sim- 1i 1 ilarly, for a word wi that did not occur in a para- wˆ2i = argmax(w ˆ2) phrase p we add (wi, p, UNK, sn) or (UNK, p, The subtasks share the pre-trained word embed- wi, sn), where UNK is the unknown word. This dings, the special embeddings, and the biLSTM may help the model deal with non-compositional parameters. Subtasks (2) and (3) also share Ww, noun-compounds, where w1 and w2 are unrelated, the MLP that predicts the index of a word. rather than forcibly predicting some relation be- 3In practice, we pick the k best scoring indices in each tween them. distribution for some predefined k, as we discuss in Section5.

1203 [w1] [w2] Predicted Paraphrases [w2] Paraphrase Predicted [w1] Paraphrase [w1] Predicted [w2] [w2] of [w1] heart surgery [w2] on [w1] brain drug cataract surgery surgery [w2] to treat [w1] [w2] to treat [w1] cataract [w2] to remove [w1] back patient [w2] in patients with [w1] knee transplant [w2] of [w1] management company [w2] to develop [w1] production firm software company company [w2] engaged in [w1] [w2] engaged in [w1] software [w2] in [w1] industry computer engineer [w2] involved in [w1] business industry [w2] is of [w1] spring party [w2] of [w1] afternoon meeting stone wall meeting [w2] held in [w1] [w2] held in [w1] morning [w2] is made of [w1] hour rally [w2] made of [w1] day session Table 1: Examples of top ranked predicted components using the model: predicting the paraphrase given w1 and w2 (left), w1 given w2 and the paraphrase (middle), and w2 given w1 and the paraphrase (right).

[w2] belongs to [w1] [w2] belonging to [w1]

[w2] given toby [w1][w1] [w2] of providing [w1] [w2] offered by [w1]

[w2] issued by [w1] [w2] source of [w1] [w2] conducted[w2][w2] involvedengaged by [w1] in in [w1] [w1]

[w2] held by [w1] [w2][w2][w2] pertaining relatingrelated to to[w1] [w1] [w2] caused[w2] by [w1]occur in [w1] [w2] in terms of [w1] [w2] with [w1] [w2] aimed at [w1] [w2] from [w1] [w2] with respect to [w1] [w2] out of [w1] [w2] part of [w1][w2] consisting of [w1] [w2][w2] composedconsists of of[w1] [w1] [w2][w2] for for use [w1] in [w1] [w2] devoteddedicated to to [w1] [w1] [w2] included in [w1] [w2] created by [w1] [w2] employed in [w1] [w2] owned by [w1] [w2] is made of [w1] [w2] done by [w1] [w2] found in [w1] [w2] is for [w1] [w2] generated[w2] by [w1]waymeans of [w1]of [w1] [w2] produced by [w1] [w2] because of [w1] [w2] supplied by [w1]

[w2][w2] mademade byof[w2] [w1] to make [w1]

[w2] provided by [w1] [w2] to produce [w1]

Figure 2: A t-SNE map of a sample of paraphrases, using the paraphrase vectors encoded by the biLSTM, for example bLS([w2] made of [w1]).

Implementation Details. The model is imple- data, but are rather generalized from similar exam- mented in DyNet (Neubig et al., 2017). We dedi- ples. For example, there is no training instance for cate a small number of noun-compounds from the “company in the software industry” but there is a corpus for validation. We train for up to 10 epochs, “firm in the software industry” and a company in stopping early if the validation loss has not im- many other industries. proved in 3 epochs. We use Momentum SGD While the frequent prepositional paraphrases (Nesterov, 1983), and set the batch size to 10 and are often ranked at the top of the list, the model the other hyper-parameters to their default values. also retrieves more specified verbal paraphrases. The list often contains multiple semantically- 4 Qualitative Analysis similar paraphrases, such as ‘[w2] involved in To estimate the quality of the proposed model, we [w1]’ and ‘[w2] in [w1] industry’. This is a result first provide a qualitative analysis of the model of the model training objective (Section3) which outputs. Table1 displays examples of the model positions the vectors of semantically-similar para- outputs for each possible usage: predicting the phrases close to each other in the embedding paraphrase given the constituent words, and pre- space, based on similar constituents. dicting each constituent word given the paraphrase To illustrate paraphrase similarity we compute and the other word. a t-SNE projection (Van Der Maaten, 2014) of The examples in the table are from among the the embeddings of all the paraphrases, and draw a top 10 ranked predictions for each component- sample of 50 paraphrases in Figure2. The projec- pair. We note that most of the (w2, paraphrase, tion positions semantically-similar but lexically- w1) triplets in the table do not occur in the training divergent paraphrases in proximity, likely due to 1204 many shared constituents. For instance, ‘with’, is its confidence score. The last feature incorpo- ‘from’, and ‘out of’ can all describe the relation rates the original model score into the decision, as between food words and their ingredients. to not let other considerations such as preposition frequency in the training set take over. 5 Evaluation: Noun-Compound During inference, the model sorts the list of Interpretation Tasks paraphrases retrieved for each noun-compound ac- For quantitative evaluation we employ our model cording to the pairwise ranking. It then scores for two noun-compound interpretation tasks. The each paraphrase by multiplying its rank with its main evaluation is on retrieving and ranking para- original model score, and prunes paraphrases with phrases ( 5.1). For the sake of completeness, we final score < 0.025. The values for k and the § also evaluate the model on classification to a fixed threshold were tuned on the training set. inventory of relations ( 5.2), although it wasn’t de- § signed for this task. Evaluation Settings. The SemEval 2013 task provided a scorer that compares words and n- 5.1 Paraphrasing grams from the gold paraphrases against those in Task Definition. The general goal of this task the predicted paraphrases, where agreement on is to interpret each noun-compound to multiple a prefix of a word (e.g. in derivations) yields prepositional and verbal paraphrases. In SemEval a partial scoring. The overall score assigned to 4 2013 Task 4, the participating systems were each system is calculated in two different ways. asked to retrieve a ranked list of paraphrases for The ‘isomorphic’ setting rewards both precision each noun-compound, which was automatically and recall, and performing well on it requires ac- evaluated against a similarly ranked list of para- curately reproducing as many of the gold para- phrases proposed by human annotators. phrases as possible, and in much the same order.

Model. For a given noun-compound w1w2, we The ‘non-isomorphic’ setting rewards only preci- first predict the k = 250 most likely paraphrases: sion, and performing well on it requires accurately reproducing the top-ranked gold paraphrases, with pˆ1, ..., pˆk = argmaxk pˆ, where pˆis the distribution of paraphrases defined in Equation1. no importance to order. While the model also provides a score for each paraphrase (Equation1), the scores have not been Baselines. We compare our method with the optimized to correlate with human judgments. We published results from the SemEval task. The therefore developed a re-ranking model that re- SemEval 2013 baseline generates for each noun- ceives a list of paraphrases and re-ranks the list to compound a list of prepositional paraphrases in better fit the human judgments. an arbitrary fixed order. It achieves a moder- We follow Herbrich(2000) and learn a pair- ately good score in the non-isomorphic setting by wise ranking model. The model determines which generating a fixed set of paraphrases which are of two paraphrases of the same noun-compound both common and generic. The MELODI sys- should be ranked higher, and it is implemented tem performs similarly: it represents each noun- as an SVM classifier using scikit-learn (Pedregosa compound using a compositional distributional et al., 2011). For training, we use the available vector (Mitchell and Lapata, 2010) which is then training data with gold paraphrases and ranks pro- used to predict paraphrases from the corpus. The vided by the SemEval task organizers. We extract performance of MELODI indicates that the sys- the following features for a paraphrase p: tem was rather conservative, yielding a few com- 1. The part-of-speech tags contained in p mon paraphrases rather than many specific ones. 2. The prepositions contained in p SFS and IIITH, on the other hand, show a more 3. The number of words in p balanced trade-off between recall and precision. 4. Whether p ends with the special [w1] symbol As a sanity check, we also report the results of a pˆi pˆ 5. cosine(bLS([w2], p, [w1])2, V~p ) pˆ i baseline that retrieves ranked paraphrases from the · training data collected in Section 3.2. This base- ~ pˆi where Vp is the biLSTM encoding of the pre- line has no generalization abilities, therefore it is pˆi dicted paraphrase computed in Equation1 and pˆ expected to score poorly on the recall-aware iso- 4https://www.cs.york.ac.uk/semeval-2013/task4 morphic setting.

1205 Method isomorphic non-isomorphic SFS (Versley, 2013) 23.1 17.9 IIITH (Surtani et al., 2013) 23.1 25.8 Baselines MELODI (Van de Cruys et al., 2013) 13.0 54.8 SemEval 2013 Baseline (Hendrickx et al., 2013) 13.8 40.6 Baseline 3.8 16.1 This paper Our method 28.2 28.4 Table 2: Results of the proposed method and the baselines on the SemEval 2013 task.

Category % ally annotated categories for each error type. False Positive Many false positive errors are actually valid (1) Valid paraphrase missing from gold 44% (2) Valid paraphrase, slightly too specific 15% paraphrases that were not suggested by the hu- (3) Incorrect, common prepositional paraphrase 14% man annotators (error 1, “discussion by group”). (4) Incorrect, other errors 14% Some are borderline valid with minor grammati- (5) Syntactic error in paraphrase 8% (6) Valid paraphrase, but borderline grammatical 5% cal changes (error 6, “force of coalition forces”) or too specific (error 2, “life of women in commu- False Negative (1) Long paraphrase (more than 5 words) 30% nity” instead of “life in community”). Common (2) Prepositional paraphrase with determiners 25% prepositional paraphrases were often retrieved al- (3) Inflected constituents in gold 10% (4) Other errors 35% though they are incorrect (error 3). We conjec- ture that this error often stem from an n-gram that Table 3: Categories of false positive and false neg- does not respect the syntactic structure of the sen- ative predictions along with their percentage. tence, e.g. a sentence such as “rinse away the oil from baby ’s head” produces the n-gram “oil from Results. Table2 displays the performance of the baby”. proposed method and the baselines in the two eval- With respect to false negative examples, they uation settings. Our method outperforms all the consisted of many long paraphrases, while our methods in the isomorphic setting. In the non- model was restricted to 5 words due to the source isomorphic setting, it outperforms the other two of the training data (error 1, “holding done in the systems that score reasonably on the isomorphic case of a share”). Many prepositional paraphrases setting (SFS and IIITH) but cannot compete with consisted of determiners, which we conflated with the systems that focus on achieving high precision. the same paraphrases without determiners (error The main advantage of our proposed model 2, “mutation of a gene”). Finally, in some para- is in its ability to generalize, and that is also phrases, the constituents in the gold paraphrase demonstrated in comparison to our baseline per- appear in inflectional forms (error 3, “holding of formance. The baseline retrieved paraphrases only shares” instead of “holding of share”). for a third of the noun-compounds (61/181), ex- pectedly yielding poor performance on the isomor- phic setting. Our model, which was trained on the 5.2 Classification very same data, retrieved paraphrases for all noun- compounds. For example, welfare system was not Noun-compound classification is defined as a mul- present in the training data, yet the model pre- ticlass classification problem: given a pre-defined dicted the correct paraphrases “system of welfare set of relations, classify w1w2 to the relation that benefits”, “system to provide welfare” and others. holds between w1 and w2. Potentially, the cor- pus co-occurrences of w1 and w2 may contribute Error Analysis. We analyze the causes of the to the classification, e.g. ‘[w2] held at [w1]’ in- false positive and false negative errors made by the dicates a TIME relation. Tratz and Hovy(2010) in- model. For each error type we sample 10 noun- cluded such features in their classifier, but ablation compounds. For each noun-compound, false pos- tests showed that these features had a relatively itive errors are the top 10 predicted paraphrases small contribution, probably due to the sparseness which are not included in the gold paraphrases, of the paraphrases. Recently, Shwartz and Wa- while false negative errors are the top 10 gold terson(2018) showed that paraphrases may con- paraphrases not found in the top k predictions tribute to the classification when represented in a made by the model. Table3 displays the manu- continuous space.

1206 Model. We generate a paraphrase vector repre- Dataset & Split Method F1 sentation ~par(w1w2) for a given noun-compound Tratz and Hovy(2010) 0.739 Dima(2016) 0.725 Tratz w1w2 as follows. We predict the indices of the k Shwartz and Waterson(2018) 0.714 fine most likely paraphrases: pˆ , ..., pˆ = argmax pˆ, distributional 1 k k Random 0.677 where pˆ is the distribution on the paraphrase vo- paraphrase 0.505 integrated 0.673 cabulary Vp, as defined in Equation1. We then Tratz and Hovy(2010) 0.340 encode each paraphrase using the biLSTM, and Dima(2016) 0.334 Tratz Shwartz and Waterson(2018) average the paraphrase vectors, weighted by their fine 0.429 distributional 0.356 Lexical confidence scores in pˆ: paraphrase 0.333 pˆi integrated k pˆi ~ 0.370 i=1 pˆ Vp Tratz and Hovy(2010) 0.760 ~par(w1w2) = · (3) k pˆ Dima(2016) pˆ i Tratz 0.775 P i=1 Shwartz and Waterson(2018) 0.736 coarse distributional 0.689 We train a linear classifier,P and represent w w Random 1 2 paraphrase 0.557 in a feature vector f(w1w2) in two variants: para- integrated 0.700 phrase: f(w1w2) = ~par(w1w2), or integrated: Tratz and Hovy(2010) 0.391 Dima(2016) 0.372 concatenated to the constituent word embeddings Tratz Shwartz and Waterson(2018) coarse 0.478 f(w1w2) = [ ~par(w1w2), ~w1, ~w2]. The classifier distributional 0.370 Lexical type (logistic regression/SVM), k, and the penalty paraphrase 0.345 are tuned on the validation set. We also pro- integrated 0.393 vide a baseline in which we ablate the paraphrase Table 4: Classification results. For each dataset component from our model, representing a noun- split, the top part consists of baseline methods and compound by the concatenation of its constituent the bottom part of methods from this paper. The embeddings f(w1w2) = [ ~w1, ~w2] (distributional). best performance in each part appears in bold.

Datasets. We evaluate on the Tratz(2011) the Full-Additive (Zanzotto et al., 2010) and Ma- dataset, which consists of 19,158 instances, la- trix (Socher et al., 2012) models. We report the beled in 37 fine-grained relations (Tratz-fine) or results from Shwartz and Waterson(2018). 12 coarse-grained relations (Tratz-coarse). 3) Paraphrase-based (Shwartz and Waterson, We report the performance on two different 2018): a neural classification model that learns dataset splits to train, test, and validation: a ran- an LSTM-based representation of the joint occur- dom split in a 75:20:5 ratio, and, following con- rences of w1 and w2 in a corpus (i.e. observed cerns raised by Dima(2016) about lexical mem- paraphrases), and integrates distributional infor- orization (Levy et al., 2015), on a lexical split in mation using the constituent embeddings. which the sets consist of distinct vocabularies. The lexical split better demonstrates the scenario in Table4 displays the methods’ perfor- which a noun-compound whose constituents have Results. mance on the two versions of the Tratz(2011) not been observed needs to be interpreted based on dataset and the two dataset splits. The paraphrase similar observed noun-compounds, e.g. inferring model on its own is inferior to the distributional the relation in pear tart based on apple cake and model, however, the integrated version improves other similar compounds. We follow the random upon the distributional model in 3 out of 4 settings, and full-lexical splits from Shwartz and Waterson demonstrating the complementary nature of the (2018). distributional and paraphrase-based methods. The Baselines. We report the results of 3 baselines contribution of the paraphrase component is espe- representative of different approaches: cially noticeable in the lexical splits. 1) Feature-based (Tratz and Hovy, 2010): we re- As expected, the integrated method in Shwartz implement a version of the classifier with features and Waterson(2018), in which the paraphrase from WordNet and Roget’s Thesaurus. representation was trained with the objective of 2) Compositional (Dima, 2016): a neural archi- classification, performs better than our integrated tecture that operates on the distributional represen- model. The superiority of both integrated models tations of the noun-compound and its constituents. in the lexical splits confirms that paraphrases are Noun-compound representations are learned with beneficial for classification.

1207 Example Noun-compounds Gold Distributional Example Paraphrases

printing plant PURPOSEOBJECTIVE [w2] engaged in [w1]

marketing expert [w2] in [w1] TOPICALOBJECTIVE development expert [w2] knowledge of [w1]

weight/job loss OBJECTIVECAUSAL [w2] of [w1]

rubber band [w2] made of [w1] CONTAINMENT PURPOSE rice cake [w2] is made of [w1]

laboratory animal LOCATION/PART-WHOLE ATTRIBUTE [w2] in [w1], [w2] used in [w1] Table 5: Examples of noun-compounds that were correctly classified by the integrated model while being incorrectly classified by distributional, along with top ranked indicative paraphrases.

Analysis. To analyze the contribution of the a plausible concrete interpretation or which origi- paraphrase component to the classification, we fo- nated from one. For example, it predicted that sil- cused on the differences between the distributional ver spoon is simply a spoon made of silver and that and integrated models on the Tratz-Coarse lexical monkey business is a business that buys or raises split. Examination of the per-relation F1 scores monkeys. In other cases, it seems that the strong revealed that the relations for which performance prior on one constituent leads to ignoring the other, improved the most in the integrated model were unrelated constituent, as in predicting “wedding TOPICAL (+11.1 F1 points), OBJECTIVE (+5.5), AT- made of diamond”. Finally, the “unrelated” para- TRIBUTE (+3.8) and LOCATION/PART WHOLE (+3.5). phrase was predicted for a few compounds, but Table5 provides examples of noun-compounds those are not necessarily non-compositional (ap- that were correctly classified by the integrated plication form, head teacher). We conclude that model while being incorrectly classified by the dis- the model does not address compositionality and tributional model. For each noun-compound, we suggest to apply it only to compositional com- provide examples of top ranked paraphrases which pounds, which may be recognized using compo- are indicative of the gold label relation. sitionality prediction methods as in Reddy et al. (2011). 6 Compositionality Analysis 7 Conclusion Our paraphrasing approach at its core assumes compositionality: only a noun-compound whose We presented a new semi-supervised model for meaning is derived from the meanings of its con- noun-compound paraphrasing. The model differs stituent words can be rephrased using them. In from previous models by being trained to predict 3.2 we added negative samples to the train- both a paraphrase given a noun-compound, and a § ing data to simulate non-compositional noun- missing constituent given the paraphrase and the compounds, which are included in the classifi- other constituent. This results in better general- cation dataset ( 5.2). We assumed that these ization abilities, leading to improved performance § compounds, more often than compositional ones in two noun-compound interpretation tasks. In the would consist of unrelated constituents (spelling future, we plan to take generalization one step fur- bee, sacred cow), and added instances of random ther, and explore the possibility to use the biL- STM for generating completely new paraphrase unrelated nouns with ‘[w2] is unrelated to [w1]’. Here, we assess whether our model succeeds to templates unseen during training. recognize non-compositional noun-compounds. Acknowledgments We used the compositionality dataset of Reddy et al.(2011) which consists of 90 noun- This work was supported in part by an Intel compounds along with human judgments about ICRI-CI grant, the Israel Science Foundation their compositionality in a scale of 0-5, 0 be- grant 1951/17, the German Research Foundation ing non-compositional and 5 being compositional. through the German-Israeli Project Cooperation For each noun-compound in the dataset, we pre- (DIP, grant DA 1600/1-1), and Theo Hoffenberg. dicted the 15 best paraphrases and analyzed the er- Vered is also supported by the Clore Scholars Pro- rors. The most common error was predicting para- gramme (2017), and the AI2 Key Scientific Chal- phrases for idiomatic compounds which may have lenges Program (2017).

1208 References Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013). Regina Barzilay and R. Kathleen McKeown. 2001. Association for Computational Linguistics, pages Extracting paraphrases from a parallel corpus. 138–143. http://aclweb.org/anthology/S13-2025. In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics. Ralf Herbrich. 2000. Large margin rank boundaries for http://aclweb.org/anthology/P01-1008. ordinal regression. Advances in large margin classi- fiers pages 115–132. Thorsten Brants and Alex Franz. 2006. Web 1t 5-gram version 1 . Nam Su Kim and Preslav Nakov. 2011. Large- scale noun compound interpretation using boot- Cristina Butnariu, Su Nam Kim, Preslav Nakov, ´ strapping and the web as a corpus. In Pro- Diarmuid OSeaghdha,´ Stan Szpakowicz, and ceedings of the 2011 Conference on Empirical Tony Veale. 2009. Semeval-2010 task 9: The Methods in Natural Language Processing. Associa- interpretation of noun compounds using para- tion for Computational Linguistics, pages 648–658. phrasing verbs and prepositions. In Proceed- http://aclweb.org/anthology/D11-1060. ings of the Workshop on Semantic Evalua- tions: Recent Achievements and Future Direc- Su Nam Kim and Timothy Baldwin. 2007. Interpret- tions (SEW-2009). Association for Computational ing noun compounds using bootstrapping and sense Linguistics, Boulder, Colorado, pages 100–105. collocation. In Proceedings of Conference of the http://www.aclweb.org/anthology/W09-2416. Pacific Association for Computational Linguistics. pages 129–136. Corina Dima. 2016. Proceedings of the 1st Work- shop on Representation Learning for NLP, As- Omer Levy, Steffen Remus, Chris Biemann, and sociation for Computational Linguistics, chapter Ido Dagan. 2015. Do supervised distribu- On the Compositionality and Semantic Interpreta- tional methods really learn lexical inference re- tion of English Noun Compounds, pages 27–39. lations? In Proceedings of the 2015 Confer- https://doi.org/10.18653/v1/W16-1604. ence of the North American Chapter of the As- sociation for Computational Linguistics: Human Corina Dima and Erhard Hinrichs. 2015. Automatic Language Technologies. Association for Computa- noun compound interpretation using deep neural tional Linguistics, Denver, Colorado, pages 970– networks and word embeddings. IWCS 2015 page 976. http://www.aclweb.org/anthology/N15-1098. 173. Guofu Li, Alejandra Lopez-Fernandez, and Tony Oren Etzioni, Michele Banko, Stephen Soderland, and Veale. 2010. Ucd-goggle: A hybrid system for Daniel S Weld. 2008. Open information extrac- noun compound paraphrasing. In Proceedings of tion from the web. Communications of the ACM the 5th International Workshop on Semantic Eval- 51(12):68–74. uation. Association for Computational Linguistics, pages 230–233. Juri Ganitkevitch, Benjamin Van Durme, and Chris Callison-Burch. 2013. PPDB: The paraphrase Jonathan Mallinson, Rico Sennrich, and Mirella Lap- database. In Proceedings of the 2013 Con- ata. 2017. Paraphrasing revisited with neural ma- ference of the North American Chapter of the chine translation. In Proceedings of the 15th Confer- Association for Computational Linguistics: Hu- ence of the European Chapter of the Association for man Language Technologies. Association for Computational Linguistics: Volume 1, Long Papers. Computational Linguistics, pages 758–764. Association for Computational Linguistics, Valen- http://aclweb.org/anthology/N13-1092. cia, Spain, pages 881–893. Roxana Girju. 2007. Improving the interpreta- Jeff Mitchell and Mirella Lapata. 2010. Composition tion of noun phrases with cross-linguistic infor- in distributional models of semantics. Cognitive sci- mation. In Proceedings of the 45th Annual ence 34(8):1388–1429. Meeting of the Association of Computational Lin- guistics. Association for Computational Linguis- Preslav Nakov. 2013. On the interpretation of noun tics, Prague, Czech Republic, pages 568–575. compounds: Syntax, semantics, and entailment. http://www.aclweb.org/anthology/P07-1072. Natural Language Engineering 19(03):291–330. Alex Graves and Jurgen¨ Schmidhuber. 2005. Frame- Preslav Nakov and Marti Hearst. 2006. Using verbs to wise phoneme classification with bidirectional lstm characterize noun-noun relations. In International and other neural network architectures. Neural Net- Conference on Artificial Intelligence: Methodology, works 18(5-6):602–610. Systems, and Applications. Springer, pages 233– 244. Iris Hendrickx, Zornitsa Kozareva, Preslav Nakov, Di- armuid OS´ eaghdha,´ Stan Szpakowicz, and Tony Vivi Nastase and Stan Szpakowicz. 2003. Explor- Veale. 2013. Semeval-2013 task 4: Free paraphrases ing noun-modifier semantic relations. In Fifth in- of noun compounds. In Second Joint Conference ternational workshop on computational semantics on Lexical and Computational Semantics (*SEM), (IWCS-5). pages 285–301.

1209 Yurii Nesterov. 1983. A method of solving a con- E. Duchesnay. 2011. Scikit-learn: Machine learning vex programming problem with convergence rate o in Python. Journal of Machine Learning Research (1/k2). In Soviet Mathematics Doklady. volume 27, 12:2825–2830. pages 372–376. Jeffrey Pennington, Richard Socher, and Christopher Graham Neubig, Chris Dyer, Yoav Goldberg, Austin Manning. 2014. Glove: Global vectors for word Matthews, Waleed Ammar, Antonios Anastasopou- representation. In Proceedings of the 2014 Con- los, Miguel Ballesteros, David Chiang, Daniel ference on Empirical Methods in Natural Language Clothiaux, Trevor Cohn, et al. 2017. Dynet: The Processing (EMNLP). Association for Computa- dynamic neural network toolkit. arXiv preprint tional Linguistics, Doha, Qatar, pages 1532–1543. arXiv:1701.03980 . http://www.aclweb.org/anthology/D14-1162.

Paul Nulty and Fintan Costello. 2010. Ucd-pn: Select- Siva Reddy, Diana McCarthy, and Suresh Manand- ing general paraphrases using conditional probabil- har. 2011. An empirical study on compositional- ity. In Proceedings of the 5th International Work- ity in compound nouns. In Proceedings of 5th In- shop on Semantic Evaluation. Association for Com- ternational Joint Conference on Natural Language putational Linguistics, pages 234–237. Processing. Asian Federation of Natural Language Processing, Chiang Mai, Thailand, pages 210–218. Diarmuid OS´ eaghdha´ and Ann Copestake. 2009. http://www.aclweb.org/anthology/I11-1024. Using lexical and relational similarity to clas- sify semantic relations. In Proceedings of the Vered Shwartz and Chris Waterson. 2018. Olive oil 12th Conference of the European Chapter of is made of olives, baby oil is made for babies: In- the ACL (EACL 2009). Association for Computa- terpreting noun compounds using paraphrases in a tional Linguistics, Athens, Greece, pages 621–629. neural model. In The 16th Annual Conference of http://www.aclweb.org/anthology/E09-1071. the North American Chapter of the Association for Computational Linguistics: Human Language Tech- Harinder Pal and Mausam. 2016. Demonyms and com- nologies (NAACL-HLT). New Orleans, Louisiana. pound relational nouns in nominal open ie. In Pro- ceedings of the 5th Workshop on Automated Knowl- Richard Socher, Brody Huval, D. Christopher Man- edge Base Construction. Association for Compu- ning, and Y. Andrew Ng. 2012. Semantic composi- tational Linguistics, San Diego, CA, pages 35–39. tionality through recursive matrix-vector spaces. In http://www.aclweb.org/anthology/W16-1307. Proceedings of the 2012 Joint Conference on Empir- ical Methods in Natural Language Processing and Patrick Pantel, Rahul Bhagat, Bonaventura Coppola, Computational Natural Language Learning. Asso- Timothy Chklovski, and Eduard Hovy. 2007. ISP: ciation for Computational Linguistics, pages 1201– Learning inferential selectional preferences. In Hu- 1211. http://aclweb.org/anthology/D12-1110. man Language Technologies 2007: The Confer- ence of the North American Chapter of the Asso- Nitesh Surtani, Arpita Batra, Urmi Ghosh, and Soma ciation for Computational Linguistics; Proceedings Paul. 2013. Iiit-h: A corpus-driven co-occurrence of the Main Conference. Association for Computa- based probabilistic model for noun compound para- tional Linguistics, Rochester, New York, pages 564– phrasing. In Second Joint Conference on Lexical 571. http://www.aclweb.org/anthology/N/N07/N07- and Computational Semantics (* SEM), Volume 2: 1071. Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013). volume 2, Marius Pasca. 2015. Interpreting compound noun pages 153–157. phrases using web search queries. In Proceed- ings of the 2015 Conference of the North American Stephen Tratz. 2011. Semantically-enriched parsing Chapter of the Association for Computational Lin- for natural language understanding. University of guistics: Human Language Technologies. Associa- Southern California. tion for Computational Linguistics, pages 335–344. https://doi.org/10.3115/v1/N15-1037. Stephen Tratz and Eduard Hovy. 2010. A taxon- omy, dataset, and classifier for automatic noun Ellie Pavlick and Marius Pasca. 2017. Identify- compound interpretation. In Proceedings of the ing 1950s american jazz musicians: Fine-grained 48th Annual Meeting of the Association for Com- isa extraction via modifier composition. In Pro- putational Linguistics. Association for Computa- ceedings of the 55th Annual Meeting of the As- tional Linguistics, Uppsala, Sweden, pages 678– sociation for Computational Linguistics (Volume 687. http://www.aclweb.org/anthology/P10-1070. 1: Long Papers). Association for Computational Linguistics, Vancouver, Canada, pages 2099–2109. Tim Van de Cruys, Stergos Afantenos, and Philippe http://aclweb.org/anthology/P17-1192. Muller. 2013. Melodi: A supervised distribu- tional approach for free paraphrasing of noun com- F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, pounds. In Second Joint Conference on Lex- B. Thirion, O. Grisel, M. Blondel, P. Pretten- ical and Computational Semantics (*SEM), Vol- hofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Pas- ume 2: Proceedings of the Seventh Interna- sos, D. Cournapeau, M. Brucher, M. Perrot, and tional Workshop on Semantic Evaluation (Se-

1210 mEval 2013). Association for Computational Lin- guistics, Atlanta, Georgia, USA, pages 144–147. http://www.aclweb.org/anthology/S13-2026. Laurens Van Der Maaten. 2014. Accelerating t-sne using tree-based algorithms. Journal of machine learning research 15(1):3221–3245. Yannick Versley. 2013. Sfs-tue: Compound para- phrasing with a language model and discriminative reranking. In Second Joint Conference on Lexical and Computational Semantics (* SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013). volume 2, pages 148–152. Sander Wubben. 2010. Uvt: Memory-based pairwise ranking of paraphrasing verbs. In Proceedings of the 5th International Workshop on Semantic Eval- uation. Association for Computational Linguistics, pages 260–263. Clarissa Xavier and Vera Lima. 2014. Boosting open information extraction with noun-based relations. In Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the Ninth International Conference on Language Re- sources and Evaluation (LREC’14). European Lan- guage Resources Association (ELRA), Reykjavik, Iceland. Fabio Massimo Zanzotto, Ioannis Korkontzelos, Francesca Fallucchi, and Suresh Manandhar. 2010. Estimating linear models for compositional distribu- tional semantics. In Proceedings of the 23rd Inter- national Conference on Computational Linguistics. Association for Computational Linguistics, pages 1263–1271.

1211 Chapter 9

Adding Context to Semantic Data-Driven Paraphrasing

78 Adding Context to Semantic Data-Driven Paraphrasing

Vered Shwartz Ido Dagan Computer Science Department Bar-Ilan University Ramat-Gan, Israel [email protected] [email protected]

Abstract et al., 2007), e.g., play entails compete in certain contexts, but not in the context of playing the na- Recognizing lexical inferences between tional anthem at a sports competition. Second, the pairs of terms is a common task in NLP soundness of inferences within context is condi- applications, which should typically be tioned on the fine-grained semantic relation that performed within a given context. Such holds between the terms, as studied within natural context-sensitive inferences have to con- logic (MacCartney and Manning, 2007). For in- sider both term meaning in context as stance, in upward-monotone sentences a term en- well as the fine-grained relation hold- tails its hypernym (“my iPhone’s battery is low” ing between the terms. Hence, to de- “my phone’s battery is low”), while in down- ⇒ velop suitable lexical inference methods, ward monotone ones it entails its hyponym (“talk- we need datasets that are annotated with ing on the phone is prohibited” “talking on the ⇒ fine-grained semantic relations in-context. iPhone is prohibited”). Since existing datasets either provide out- Accordingly, developing algorithms that prop- of-context annotations or refer to coarse- erly apply lexical inferences in context requires grained relations, we propose a method- datasets in which inferences are annotated in- ology for adding context-sensitive anno- context by fine-grained semantic relations. Yet, tations. We demonstrate our methodol- such a dataset is not available (see 2.1). Most ex- ogy by applying it to phrase pairs from isting datasets provide out-of-context annotations, PPDB 2.0, creating a novel dataset of fine- while the few available in-context annotations re- grained lexical inferences in-context and fer to coarse-grained relations, such as relatedness showing its utility in developing context- or similarity. sensitive methods. In recent years, the PPDB paraphrase database (Ganitkevitch et al., 2013) became a popular re- 1 Introduction source among semantic tasks, such as monolin- Recognizing lexical inference is an essential com- gual alignment (Sultan et al., 2014) and recog- ponent in semantic tasks. In question answering, nizing textual entailment (Noh et al., 2015). Re- for instance, identifying that broadcast and air cently, Pavlick et al. (2015) classified each para- are synonymous enables answering the question phrase pair to the fine-grained semantic relation “When was ‘Friends’ first aired?” given the text that holds between the phrases, following natu- “‘Friends’ was first broadcast in 1994”. Semantic ral logic (MacCartney and Manning, 2007). To relations such as synonymy (tall, high) and hyper- that end, a subset of PPDB paraphrase-pairs were nymy (cat, pet) are used to infer the meaning of manually annotated, forming a fine-grained lexi- one term from another, in order to overcome lexi- cal inference dataset. Yet, annotations are given cal variability. out-of-context, limiting its utility. In semantic tasks, such terms appear within cor- In this paper, we aim to fill the current gap responding contexts, thus making two aspects nec- in the inventory of lexical inference datasets, and essary in order to correctly apply inferences: First, present a methodology for adding context to out- the meaning of each term should be considered of-context datasets. We apply our methodology on within its context (Szpektor et al., 2007; Pantel a subset of phrase pairs from Pavlick et al. (2015),

108 Proceedings of the Fifth Joint Conference on Lexical and Computational Semantics (*SEM 2016), pages 108–113, Berlin, Germany, August 11-12, 2016. out-of-context in-context x y contexts relation relation Roughly 1,500 gold and silver pieces were found and the hoard contains roughly 5kgs of gold and 2.5kgs of silver. 1 piece strip Equivalence Independent A huge political storm has erupted around Australia after labor leader Kevin Rudd was found to have gone to a strip club during a taxpayer funded trip. Three countries withdrew from the competition: Germany, Spain and Switzerland. Reverse 2 competition race Morgan Tsvangirai, the leader of the Movement for Equivalence Entailment Democratic Change (MDC), Zimbabwe’s main opposition party, has said that he will pull out of the race to become the president of Zimbabwe. The birth of the boy, whose birth name is disputed among different sources, is considered very important in the entertainment world. Forward Other- 3 boy family Entailment related Bill will likely disrupt the Obama family’s vacation to Martha’s Vineyard. Amid wild scenes of joy on the pitch he jumped onto the podium and lifted the trophy, the fourth of Italy’s history. Other- 4 jump walk Alternation related In a game about rescuing hostages a hero might walk past Coca-Cola machine’s one week and Pepsi the next.

Table 1: Illustration of annotation shifts when context is given. [1] the sense of strip in the given context is different from the one which is equivalent to piece. [2] the term race is judged out-of-context as more specific than competition, but is considered equivalent to it in a particular context. [3] a meronymy relation is (often) considered out-of-context as entailment, while in a given context this judgment doesn’t hold. [4] general relations may become more concrete when the context is given. creating a novel dataset for fine-grained lexical in- single coarse-grained relation (e.g. lexical substi- ference in-context. For each term-pair, we add a tution, similarity). pair of context sentences, and re-annotate these 1 term-pairs with respect to their contexts. We In some datasets, term-pairs are annotated to show that almost half of the semantically-related whether the relation holds between them in some term-pairs become unrelated when the context is (unspecified) contexts (out-of-context), while in specified. Furthermore, a generic out-of-context others, the annotation is given with respect to a relation may change within a given context (see given context (in-context). In these datasets, each table 1). We further report baseline results that entry consists of a term-pair, x and y, and con- demonstrate the utility of our dataset in develop- text, where some of the datasets provide a single ing fine-grained context-sensitive lexical inference context in which x occurs while others provide a methods. separate context for each of x and y (correspond- ing to the 1 context and 2 contexts columns in 2 Background Figure 1). The latter simulates a frequent need 2.1 Lexical Inference Datasets in NLP applications, for example, a question an- swering system recognizes that broadcast entails Figure 1 lists prominent human-annotated datasets air given the context of the question (“When was used for developing lexical inference methods. In ‘Friends’ first aired?”) and that of the candidate these datasets, each entry consists of an (x, y) passage (“‘Friends’ was first broadcast in 1994”). term-pair, annotated to whether a certain semantic relation holds between x and y. Each dataset ei- ther specifies fine-grained semantic relations (see We observe that most lexical inference datasets 2.2), or groups several semantic relations under a provide out-of-context annotations. The exist- ing in-context datasets are annotated for coarse- 1 The dataset and annotation guidelines are available at: grained semantic relations, such as similarity or http://u.cs.biu.ac.il/ ˜nlp/resources/downloads/ relatedness, which may not be sufficiently infor- context-sensitive-fine-grained-dataset. mative.

109 eia neecsbsdo uhfiegandse- fine-grained relations. such applying mantic on to based lead inferences likely applica- lexical may in extension PPDB this of tions, usage widespread Consid- the PPDB. semantic in ering pairs the paraphrase predict all and for classifier relation a later train were to annotations used para- human These each for pair. appro- holds phrase the that relation select semantic to priate instructed were of 2014). al., Annotators et subset (Marelli entailment a textual of annotating SICK dataset the in first appeared that by pharaphase-pairs PPDB 2). done table was (see This 2007) Manning, and (MacCartney logic natural following relations, fine- semantic with grained PPDB re-annotated (2015) al. et Pavlick monolingual and 2014). al., 2015), et (Sultan al., alignment textual et recognizing (Noh 2015), semantic entailment al., as et such (Wang tasks, parsing semantic many used quite been has for de- it automatically years, recent of In resource paraphrases. rived huge a et is (Ganitkevitch 2013) al., database paraphrase PPDB The Relations Semantic with PPDB 2.2 conflate we 2: Table (2012), al. et Zeichner [14] (2015), al. et Shwartz [13] [16]. (2014), 2.2), al. (see et (2015) Levy al. [12] et (2014), Pavlick Mohammad [15] and Turney [11] (2010), al. (2015). al. et Wieting (2015). [4] (2012), al. et Huang 1: Figure ∼ ≡ A @ neednei o eae to related not is Independence # eaini h xc poieof opposite exact the is Negation ˆ | relation eety spr ftePD . release, 2.0 PPDB the of part as Recently, coarse-grained fine-grained owr naleti oeseicthan specific more is Entailment Forward ees naleti oegnrlthan general more is Entailment Reverse eia substitution Lexical eatcrltosi PB20 iePvike al., et Pavlick Like 2.0. PPDB in relations Semantic te-eae srltdi oeohrwyto way other some in related is Other-Related a fpoietlxclifrnedatasets. inference lexical prominent of map A qiaec stesm as same the is Equivalence lento smtal xlsv with exclusive mutually is Alternation negation 5 odi-5 relatedness WordSim-353 [5] 1 odi-5 similarity WordSim-353 [1] 1]PPDB-fine-human [15] 4 Annotated-PPDB [4] 1]Kotlerman2010 [10] 1]Shwartz2015 [13] 1]Turney2014 [11] 2 SimLex-999 [2] 1]Levy2014 [12] and 6 MEN [6] oe1cnet2contexts 2 context 1 none alternation 8 catyadNvgi(07,[]Kee ta.(2014), al. et Kremer [9] (2007), Navigli and McCarthy [8] : nooerelation. one into emrelatedness Term 1]Zeichner2012 [14] 8 eEa 2007 SemEval [8] 9 All-Words [9] odsimilarity Word 7 TR9856 [7] context 110 5 ec ta.(08,[]Buie l 21) 7 eye al. et Levy [7] (2014), al. et Bruni [6] (2008), al. et Zesch [5] : sflfrorproe ssmni applications semantic as purpose, our for useful Filtering e.g. ical, in phrases Grammaticality-based pairs. phrase such we omit too annotations, to was human chose of this cost the As justify to sparse relation. semantic we another annotated that with were pairs 8% such only 100 context, within of annotated sample a context; of every out almost indeed, in independent remain will as out-of-context annotated were that Types Relation pairs: phrase the the on applied editing therefore and We filtering following less purpose. are our that for phrase-pairs useful some with albeit pairs), PPDB-fine-human Phrase-Pairs Selecting 3.1 of that methodology datasets, on a apply inference we lexical present to we context adding section, this In Methodology Construction Dataset 3 context. of spec- out is ified pair paraphrase each for the relation since semantic inference, lexical in feature key still a are missing resource, created as automatically 2.0 dataset PPDB this men- to refer paraphrases PPDB-fine-human above we human-annotated the relevant; of particularly find subset therefore tioned and datasets, nti ae,w ou nhuman-annotated on focus we paper, this In 1 ec ta.(08,[]Hl ta.(04,[3] (2014), al. et Hill [2] (2008), al. et Zesch [1] : o is boy PPDB-fine-human PPDB-fine-human ecnie uhprssless phrases such consider We . eepce htprs pairs phrase that expected We eia inference Lexical hsdtst swl sthe as well as dataset, This . saqielredtst(14k dataset large quite a is 1]ti paper this [16] 3 SCWS [3] r ungrammat- are 1]Ktemnet Kotlerman [10] : . independent Many usually apply lexical inferences on syntactically (x, y, sx, sy) for each phrase pair and 3750 tuples coherent constituents. We therefore parse the in total.5 original SICK (Marelli et al., 2014) sentences We split the dataset to 70% train, 25% test, and containing these phrases, and omit pairs in which 5% validation sets. Each of the sets contains dif- one of the phrases is not a constituent. ferent term-pairs, to avoid overfitting for the most common relation of a term-pair in the training set. Filtering Trivial Pairs In order to avoid trivial paraphrase pairs, we filter out inflections (Iraq, 3.3 Annotation Task Iraqi) and alternate spellings (center, centre), by Our annotation task, carried out on Amazon Me- omitting pairs that share the same lemma, or those chanical Turk, followed that of Pavlick et al. that have Levenshtein distance 3. In addi- ≤ (2015). We used their guidelines, and altered them tion, we omit pairs that have lexical overlaps only to consider the contexts. We instructed an- (a young lady, lady) and filter out pairs in which notators to select the relation that holds between one of the two phrases is just a stop word. the terms (x and y) while interpreting each term’s meaning within its given context (s and s ). To Removing Determiners The annotation seems x y to be indifferent to the presence of a determiner, ensure the quality of workers, we applied a quali- e.g., the labelers annotated all of (kid, the boy), fication test and required a US location, and a 99% (the boy, the kid), and (a kid, the boy) as reverse approval rate for at least 1,000 prior HITS. We as- entailment. To avoid repetitive pairs, and to get a signed each annotation to 5 workers, and, follow- single “normalized” phrase, we remove preceding ing Pavlick et al. (2015), selected the gold label determiners, e.g., yielding (kid, boy). using the majority rule, breaking ties at random. We note that for 91% of the examples, at least 3 of Finally, it is interesting to note that the annotators agreed.6 PPDB-fine-human includes term-pairs in The annotations yielded moderate levels of which terms are of different grammatical cat- agreement, with Fleiss’ Kappa κ = 0.51 (Lan- egories. Our view is that such cross-category dis and Koch, 1977). For a fair comparison, we term-pairs are often relevant for semantic infer- replicated the original out-of-context annotation ence (e.g. (bicycle, riding)) and therefore we on a sample of 100 pairs from our dataset, yield- decided to stick to the PPDB setting, and kept ing agreement of κ = 0.46, while the in-context such pairs. agreement for these pairs was κ = 0.51. As ex- At the end of this filtering process we remained pected, adding context improves the agreement, with 1385 phrase pairs from which we sampled by directing workers toward the same term senses 375 phrase pairs for our dataset, preserving the rel- while revealing rare senses that some workers may 7 ative frequency across relation types in PPDB. miss without context.

3.2 Adding Context Sentences 4 Analysis We used Wikinews2 to extract context sentences. Figure 2 displays the confusion matrix of rela- We used the Wikinews dump from November tion annotations in context compared to the out- 2015, converted the Wiki Markup to clean text us- of-context annotations. Most prominently, while ing WikiExtractor3, and parsed the corpus using the original relation holds in many of the contexts, spaCy.4 it is also common for term-pairs to become inde- For each (x, y) phrase-pair, we randomly sam- pendent. In some cases, the semantic relation is pled 10 sentence-pairs of the form (sx, sy), such changed (as in table 1). that sx contains x and sy contains y. In the 5Our dataset is comparable in size to most of the datasets sampling process we require, for each of the two in Figure 1. In particular, the SCWS dataset (Huang et al., terms, that its 10 sentences are taken from differ- 2012), which is the most similar to ours, contains 2003 term- ent Wikinews articles, to obtain a broader range of pairs with context sentences. 6We also released an additional version of the dataset, in- the term’s senses. This yields 10 tuples of the form cluding only the agreeable 91%. 7The gap between the reported agreement in Pavlick et 2https://en.wikinews.org/ al. (2015) (κ = 0.56) and our agreement for out-of-context 3https://github.com/attardi/wikiextractor annotations (κ = 0.46) may be explained by our filtering 4http://spacy.io/ process, removing obvious and hence easily consensual pairs.

111 tn uhdtstoe PB20fine-grained 2.0 cre- PPDB by over it dataset demonstrated such and ating in- for datasets, lexical methodology context-insensitive ference a to presented context we adding paper, this In Conclusion 5 methodol- our by created ogy. ones similar our or using methods, dataset such better develop to need emphasizing the mediocre, ab- still the is Yet, performance solute methods. context- inference fine-grained lexical developing sensitive of for utility dataset potential our the illustrating insensitive thus context baselines, outperforms notably dataset, similar- contexts. word the across maximal con- ity term’s the measures other (3) the term and in text, term a similar between most similarities its and measure (2) and (1) features: the following added and 2014), al., et on (Pennington trained Wikipedia dimensions, 300 pretrained of used embeddings GloVe we vectors, features. as fea- words 2.0 context-sensitive represent To PPDB additional available plus the tures, using set, train our on trained classifier regression also logistic We context-sensitive a predictions. from classifier labels its 2.0 PPDB all manual in assigns term-pair first PPDB-fine-human a the to as- label context-insensitive, contexts; (ta- are same set two the first test signing The our 3). on ble performances baseline sev- eral report we utility, dataset’s our demonstrate To Results Baseline 4.1 independent out-of-context annotate didn’t pairs. we to changed. that are due relations Recall column semantic usually cells, last other independent, all In the become sense-shifts. and that in-context, term-pairs hold out-of- shows shows that diagonal relations The context out-of-context. annotations for 2: Figure

hscnetsniiemto,tando our on trained method, context-sensitive This out-of-context ∼ A @ ≡ | ecnae fec eainantto in-context, annotation relation each of percentages 05 . .803 .822.22 9.58 0.38 0.38 6.9 60.54 .20720 4.56 2.03 0.7 1.67 1.52 5.97 2.96 .518 5.42 1.88 1.25 ≡ max 11 .101.942.82 11.69 0 1.41 41.13 A @ max w max x ∈ s x n h eodassigns second the and , w w in-context 79 .81.939.17 13.19 2.08 37.92 ,w ∈ ∈ y s s y x ∈ ~x ~y s y · · 14 .247.08 2.92 41.46 ~w ~w ~w x ∼ | · ~w y 14 59.75 31.46 # (2) (3) (1) 112 ehr rmr arnEk eata Pad Sebastian Erk, Katrin Kremer, Gerhard Maayan and Szpektor, Idan Dagan, Ido Kotlerman, Lili Man- D Christopher Socher, Richard Huang, H Eric 2014. Korhonen. Anna and Reichart, Roi Hill, Felix Chris and Durme, Van Benjamin Ganitkevitch, Juri Baroni. Marco and Tran, Nam-Khanh Bruni, Elia References 1600/1-1). DA grant Coopera- (DIP, tion Project German-Israeli Founda- the Research through German tion the Foundation and Science 880/12, Israel grant the grant, ICRI-CI tel Chris insightful and and comments. Pavlick assistance Ellie their for thank Callison-Burch to like would We Acknowledgments corre- baselines. the context-insensitive outperform for sponding which used methods, be lexical inference indeed context-sensitive can fine-grained demon- dataset developing then our We that strated annotations. paraphrase-pair the conflate we al., and et classi- Pavlick regression Like logistic fier. context-sensitive our (out-of- (3) predictions classifier context). 2.0 PPDB (2) (out-of-context). (1) classes). all 3: Table fa l-od eia usiuincru.In us-analysis corpus. 2014 substitution tell lexical substitutes all-words What an of 2014. Thater. fan distribu- Engineering inference. guage Directional lexical for similarity tional 2010. Zhitomirsky-Geffet. Linguistics. Computational for word tion Improving In word 2012. prototypes. multiple and context global Ng. via Y representations Andrew and ning, with models estimation. arXiv:1408.3456 semantic similarity Evaluating (genuine) Simlex-999: paraphrase The PPDB: In 2013. database. Callison-Burch. semantics. 49:1–47. distributional Multimodal 2014. In- an by supported partially was work This ees entailment reverse ncnetcasfir0.677 classifier in-context PPDB-fine-human PB lsie .1 .6 0.556 0.565 0.611 classifier PPDB2 . aeiepromneo h etst(enover (mean set test the on performance Baseline HLT-NAACL PPDB-fine-human C 2012 ACL . eain nalbaselines. all in relations 16(04):359–389. , rcso recall precision 0.722 ae 7–8.Associa- 873–882. pages , ae 758–764. pages , .8 0.288 0.380 .8 0.670 0.685 aulannotations manual owr entailment forward ri preprint arXiv aua Lan- Natural ,adSte- and o, ´ F 1 EACL JAIR , J Richard Landis and Gary G Koch. 1977. The mea- Idan Szpektor, Eyal Shnarch, and Ido Dagan. 2007. surement of observer agreement for categorical data. Instance-based evaluation of entailment rule acqui- biometrics, pages 159–174. sition. In ACL 2007, page 456.

Omer Levy, Ido Dagan, and Jacob Goldberger. 2014. Peter D Turney and Saif M Mohammad. 2014. Ex- Focused entailment graphs for open ie propositions. periments with three approaches to recognizing lex- In CoNLL 2014, pages 87–97, Ann Arbor, Michi- ical entailment. Natural Language Engineering, gan, June. Association for Computational Linguis- 21(03):437–476. tics. Yushi Wang, Jonathan Berant, and Percy Liang. 2015. Ran Levy, Liat Ein-Dor, Shay Hummel, Ruty Rinott, Building a semantic parser overnight. In ACL 2015. and Noam Slonim. 2015. Tr9856: A multi-word term relatedness benchmark. In ACL 2015, page John Wieting, Mohit Bansal, Kevin Gimpel, and Karen 419. Livescu. 2015. From paraphrase database to com- positional paraphrase model and back. TACL 2015, Bill MacCartney and Christopher D Manning. 2007. 3:345–358. Natural logic for textual inference. In Proceedings of the ACL-PASCAL Workshop on Textual Entail- Naomi Zeichner, Jonathan Berant, and Ido Dagan. ment and Paraphrasing, pages 193–200. Associa- 2012. Crowdsourcing inference-rule evaluation. In tion for Computational Linguistics. ACL 2012, pages 156–160. Association for Compu- tational Linguistics. Marco Marelli, Luisa Bentivogli, Marco Baroni, Raf- Torsten Zesch, Christof Muller,¨ and Iryna Gurevych. faella Bernardi, Stefano Menini, and Roberto Zam- 2008. Using wiktionary for computing semantic re- parelli. 2014. Semeval-2014 task 1: Evaluation of latedness. In AAAI, volume 8, pages 861–866. compositional distributional semantic models on full sentences through semantic relatedness and textual entailment. SemEval-2014.

Diana McCarthy and Roberto Navigli. 2007. Semeval- 2007 task 10: English lexical substitution task. In Proceedings of the 4th International Workshop on Semantic Evaluations, pages 48–53. Association for Computational Linguistics.

Tae-Gil Noh, Sebastian Pado,´ Vered Shwartz, Ido Da- gan, Vivi Nastase, Kathrin Eichler, Lili Kotlerman, and Meni Adler. 2015. Multi-level alignments as an extensible representation basis for textual entailment algorithms. *SEM 2015, page 193.

Patrick Pantel, Rahul Bhagat, Bonaventura Coppola, Timothy Chklovski, and Eduard H Hovy. 2007. Isp: Learning inferential selectional preferences. In HLT-NAACL, pages 564–571.

Ellie Pavlick, Johan Bos, Malvina Nissim, Charley Beller, Benjamin Van Durme, and Chris Callison- Burch. 2015. Adding semantics to data-driven para- phrasing. In ACL 2015, Beijing, China, July. Asso- ciation for Computational Linguistics.

Jeffrey Pennington, Richard Socher, and Christo- pher D. Manning. 2014. Glove: Global vectors for word representation. In EMNLP 2014, pages 1532– 1543.

Vered Shwartz, Omer Levy, Ido Dagan, and Jacob Goldberger. 2015. Learning to exploit structured resources for lexical inference. CoNLL 2015, page 175.

Md Arafat Sultan, Steven Bethard, and Tamara Sum- ner. 2014. Back to basics for monolingual align- ment: Exploiting word similarity and contextual ev- idence. TACL 2014, 2:219–230.

113 Chapter 10

Conclusion

In this chapter I summarize the contributions described in this thesis, and present open issues and the way forward to addressing them. The first contribution is going beyond relying on distributional representations and incorporating addi- tional information sources when addressing various tasks pertaining to lexical inference (Section 10.1). The second contribution is an analysis of the effect of context on the applicability of lexical inferences (Section 10.2). The third con- tribution is two approaches for uncovering the implicit relationship between the constituent of a noun compound, which are part of a broader problem of learning meaningful representations for multi-word expressions (Section 10.3). Finally, I briefly discuss the imperative next step in lexical inferences - applying these inferences within downstream applications (Section 10.4).

10.1 Going Beyond the Distributional Hypothesis

Many NLP applications today rely on word embeddings to inform them about the semantic relations between words. However, word embeddings capture a fuzzy distributional similarity between words, which conflates multiple seman- tic relations. In many cases it would be beneficial to know the specific semantic relation that holds for a pair of words, e.g. identifying that “it is cold today” and “it is warm today” contradict each other.

85 CHAPTER 10. CONCLUSION 86

Supervised models that use word embeddings to predict the specific seman- tic relations between words suffer from the “lexical memorization” effect (Levy et al., 2015a): they provide information on each of the words in isolation, by memorizing their distribution in the training data (e.g. predicting with cer- tainty that entity is a hypernym of any other word). It is therefore clear that additional information sources are required for providing more accurate infor- mation about the relationship between a pair of words. In this work I demonstrated that additional information sources are bene- ficial for various lexical inference tasks. In chapters 4 and 5 I used the lexico- syntactic patterns connecting the joint occurrences of the words in the corpus to identify the specific semantic relation that holds between them. I showed that this information is complementary to information originating in their distribu- tional representations. The model consistently benefited from this information, especially in challenging evaluation settings, for rare words or senses, and for non-typical relations. In chapter 7 I applied a similar model to recognize the implicit semantic relation between the constituents of a noun compound. In all cases the models succeeded better than purely distributional models, especially in settings in which lexical memorization was disabled. Finally, in chapter 6 I extracted lexically-divergent paraphrases. Many of these paraphrase pairs were not generally similar to each other, and in partic- ular are not expected to have similar word embeddings, but given a certain assignment to their arguments and in specific contexts they had roughly the same meaning. What made this extraction possible was the event coreference assumption: both texts referred to the same entities within descriptions of the same news event, suggesting that they have roughly the same meaning, even despite having possibly very different distributional representations.

10.2 Modeling Context-sensitivity

Applying lexical inferences within applications imposes additional difficulty in detecting whether the potential lexical inference is valid within a given context. Such an application needs to address polysemy, at the basic level: words may CHAPTER 10. CONCLUSION 87 have different meanings in different contexts. In chapter 9 I showed that the context may cause nuanced changes in a word’s meaning, without choosing a completely different sense. This may change the semantic relation that holds between terms in different contexts. For instance, while a race is considered more specific than a competition, in some contexts “race” is used as a metaphor of “competition”, and they are considered equivalent in meaning. Moreover, the validity of the inference depends on the monotonicity of the sentence. In an upward monotone sentence, a word entails its hypernym, as in “I like cats” “I like animals”, while in downward monotone sentences it is the → other way around: “I do not own animals” “I do not own cats”. In previous- → generation textual inference approaches, this phenomenon was handled with “natural logic” (MacCartney and Manning, 2007), a system of logical inference which operates over natural language. Modern neural models represent a word in a given context by encoding the context, typically by using recurrent neural networks. Recently, it has been made possible even for applications with a scarce amount of training data. In- stead of learning context representations from scratch with the downstream training objective, they rely on pre-trained contextualized word embeddings which may be plugged in into any neural model (Peters et al., 2018; Radford et al., 2018; Devlin et al., 2018), either as a fixed representation layer or fine- tuned to the task. Such models compute a dynamic vector representation of a word given its context. Doing so, they largely address polysemy, as they no longer conflate all the different senses of a word into a single vector. Contextualized word representations have shown to significantly improve the performance upon using regular word embeddings, across various NLP ap- plications. Recent work investigated what these representations capture with respect to the syntax and semantics of a sentence, by testing them on a broad range of tasks. Tenney et al. (2019) found that all the models produced strong representations for syntactic phenomena, but gained smaller performance im- provements upon the baselines in the more semantic tasks. Liu et al. (2019) found that some tasks (e.g., identifying the tokens that comprise the conjuncts in a coordination construction) required fine-grained linguistic knowledge which CHAPTER 10. CONCLUSION 88 was not available in the representations unless they were fine-tuned for the task. The contribution of this thesis is orthogonal to that of contextualized word embeddings with respect to improving upon standard word embeddings. Most of the work presented in the thesis largely addresses the out-of-context lexi- cal inference setting, as a step towards learning and applying such inferences within context. The focus was mainly on going beyond learning fuzzy distribu- tional similarity between words or phrases, and modelling the specific semantic relations between them. Contextualized word embeddings, despite their other advantages, suffer from similar issues as standard word embeddings with re- spect to their distributional nature. While standard word embeddings often confuse synonyms with antonyms or with similar but mutually exclusive terms, contextualized word embeddings, now being used as source of knowledge, may produce wrongly-generalized facts like “’s wife is Hillary” (Lo- gan et al., 2019), and even a strong structural indication of negation does not stop them from completing “facts” like “birds cannot” with “fly” (Kassner and Schutze,¨ 2019).

10.3 Representations Beyond the Word-Level

In recent years, with the help of recurrent neural networks which facilitate pro- cessing texts in arbitrary lengths, phrase, paragraph, and sentence embeddings became popular (e.g. Socher et al., 2011; Kiros et al., 2015). Conneau et al. (2018) showed that sentence embeddings capture various syntactic and semantic prop- erties of a sentence. However, this approach works under the assumption of full compositionality: the sentence’s meaning is derived fully from the sum of its word meanings. Multiword expressions (MWE) have long been a “pain in the neck” for NLP (Sag et al., 2002), specifically because they often hold meanings which are not derived fully from their constituent words. Idioms (“look what the cat dragged in”), fixed expressions (“ad hoc”), and proper names (“New York”) easily break the compositionality assumption. In adjective-noun composition, the adjective usually adds information to the noun (e.g. red car), but often it is already implied CHAPTER 10. CONCLUSION 89

(e.g. little baby, because babies are always little), or completely contradicts the meaning of the noun (e.g. fake gun) (Pavlick and Callison-Burch, 2016). Finally, noun compounds are often non-compositional (monkey business) or only related to one of the constituents (spelling bee is related to spelling). Even when they are fully-compositional, there is no “one-size fits all” composition function. As I showed in chapters 7 and 8, various different semantic relations may hold between the constituents of a noun compound (e.g. “olive oil” is oil made of olives, while “baby oil” is oil made for babies). As part of the broader goal of addressing this representation gap, I focused on noun compounds and have developed two approaches for uncovering the semantic relation that holds between the constituents of a noun compound. In chapter 7 I developed a method for the classification task, which maps a noun compound to one of a set of pre-defined relationships, while in chapter 8 I worked on the paraphrasing task, which extracts a list of prepositional and ver- bal paraphrases describing each noun compound. The common to both models was using the joint occurrences of the noun compound’s constituents in the cor- pus as means of explaining the semantic relation, as in “cake made of apples”, and encoding the patterns connecting the joint occurrences with neural networks, which help generalize semantically similar patterns. Given the prevalence of MWEs in English, the current representation gap likely hurts the performance of downstream applications. In Shwartz and Da- gan (2019), I have tested the ability of the new contextualized word representa- tions to handle phenomena related to lexical composition, and found that they were impressively good at recognizing meaning shift. For example, that spelling bee is unrelated to bees or that carry on has a different meaning from carry. This is not a surprising result given that the context-sensitive nature of such repre- sentation makes them extremely good in word sense disambiguation. With that said, an attempt to query such representations for lexical substitutes suggests that the representation of the shifted senses is of lower quality, likely due to their infrequency in the training corpus. Moreover, none of the representations tested was particularly successful in recovering implicit meaning, e.g. recognizing the relationships implicitly con- CHAPTER 10. CONCLUSION 90 veyed in olive oil vs. in baby oil. There is no reason to expect such representa- tion to be better than standard word embeddings in recovering something that wasn’t said. I expect that future research in representation learning will attempt to address “reading between the lines” and incorporating knowledge which is conveyed implicitly.

10.4 Going Forward: Improving Downstream Ap- plications

Recognizing semantic relations between lexical items is crucial for NLP appli- cations and infrastructure tasks, in order to overcome lexical variability. A rep- resentative example is recognizing textual entailment (RTE) (Dagan et al., 2013). In this task, two sentences, a premise (p) and a hypothesis (h) are given, and the goal is to recognize whether the p entails, contradicts, or is neutral with h. For instance, consider the following entailment example: p: “ spoke in Huntsville, criticizing the NFL players who protested during the anthem.”, and h: “In his Alabama speech, the president attacked the NFL players who protested during the hymn.”. Previous generation models for this task leveraged an abundance of lexi- cal resources to identify lexical inferences (Androutsopoulos and Malakasio- tis, 2010), e.g. identifying that criticize and attack refer to the same action and that Huntsville is in Alabama. In recent years, with the availability of large-scale datasets, the task shifted from explicitly recognizing lexical inferences to train- ing end-to-end neural models that rely solely on word embeddings for mod- eling lexical inferences. While these neural models achieve impressive perfor- mance on the new datasets, various recent works showed that this is achieved thanks to artifacts in the data rather than solving the actual task (Gururangan et al., 2018; Poliak et al., 2018). Specifically, Glockner et al. (2018), Naik et al. (2018), and Sanchez et al. (2018) all showed that the models are limited in their ability to recognize lexical inferences. It is imperative, therefore, to incorporate external lexical knowledge into CHAPTER 10. CONCLUSION 91 these models, however, it is not trivial to do so. One promising approach is to incorporate this knowledge into the word embeddings themselves, and then use the embeddings in downstream tasks as one would use standard distribu- tional embeddings. There is a growing research interest in embedding seman- tic relations from WordNet (Fellbaum, 1998) into vectors (e.g. Faruqui et al., 2015; Vendrov et al., 2016; Nguyen et al., 2017; Nickel and Kiela, 2017; Vulic and Mrksic, 2018), but these vectors have only been evaluated intrinsically on their ability to complete missing edges from the WordNet taxonomy. Using them in downstream applications is not trivial since they encode the information differ- ently from distributional embeddings. A promising research direction is to learn word pair embeddings that cap- ture the semantic relation between the words, and use them instead of or in addition to the embeddings of each single word. Joshi et al. (2019) learned such vectors in an unsupervised manner from the lexico-syntactic paths connecting two words in the corpus. I expect that in the future, such knowledge will be incorporated into general word representations by adding auxiliary objectives to their training. Bibliography

Ion Androutsopoulos and Prodromos Malakasiotis. A survey of paraphrasing and textual entailment methods. Journal of Artificial Intelligence Research, 38: 135–187, 2010.

Soren¨ Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyga- niak, and Zachary Ives. Dbpedia: A nucleus for a web of open data. In The semantic web, pages 722–735. Springer, 2007.

Amir Bakarov. A survey of word embeddings evaluation methods. arXiv preprint arXiv:1801.09536, 2018.

Marco Baroni and Alessandro Lenci. Distributional memory: A general frame- work for corpus-based semantics. Computational Linguistics, 36(4):673–721, 2010.

Marco Baroni, Raffaella Bernardi, Ngoc-Quynh Do, and Chung-chieh Shan. En- tailment above the word level in distributional semantics. In Proceedings of EACL 2012, pages 23–32, 2012.

Regina Barzilay and R. Kathleen McKeown. Extracting paraphrases from a par- allel corpus. In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics, 2001. URL http://aclweb.org/anthology/ P01-1008.

Regina Barzilay, Kathleen R McKeown, and Michael Elhadad. Information fu- sion in the context of multi-document summarization. In Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computa- tional Linguistics, pages 550–557. Association for Computational Linguistics, 1999.

Yoshua Bengio, Rejean´ Ducharme, Pascal Vincent, and Christian Jauvin. A neu- ral probabilistic language model. Journal of machine learning research, 3(Feb): 1137–1155, 2003.

92 BIBLIOGRAPHY 93

Jonathan Berant, Ido Dagan, and Jacob Goldberger. Global learning of focused entailment graphs. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 1220–1229. Association for Computational Linguistics, 2010. URL http://aclweb.org/anthology/P10-1124.

Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. Enrich- ing word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146, 2017. ISSN 2307-387X.

Elia Bruni, Nam-Khanh Tran, and Marco Baroni. Multimodal distributional semantics. JAIR, 49:1–47, 2014.

Jose Camacho-Collados and Mohammad Taher Pilehvar. From word to sense embeddings: A survey on vector representations of meaning. Journal of Artifi- cial Intelligence Research, 63:743–788, 2018.

Alexis Conneau, GermA¡n˜ Kruszewski, Guillaume Lample, LoA¯c˜ Barrault, and Marco Baroni. What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2126–2136, Melbourne, Australia, July 2018. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P18-1198.

Ido Dagan, Dan Roth, Mark Sammons, and Fabio Massimo Zanzotto. Recogniz- ing textual entailment: Models and applications. Synthesis Lectures on Human Language Technologies, 6(4):1–220, 2013.

Scott Deerwester, Susan T Dumais, George W Furnas, Thomas K Landauer, and Richard Harshman. Indexing by latent semantic analysis. Journal of the Amer- ican society for information science, 41(6):391–407, 1990.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre- training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805v1, 2018.

Corina Dima. Proceedings of the 1st Workshop on Representation Learning for NLP, chapter On the Compositionality and Semantic Interpretation of En- glish Noun Compounds, pages 27–39. Association for Computational Lin- guistics, 2016. doi: 10.18653/v1/W16-1604. URL http://aclweb.org/ anthology/W16-1604.

Corina Dima and Erhard Hinrichs. Automatic noun compound interpretation using deep neural networks and word embeddings. IWCS 2015, page 173, 2015. BIBLIOGRAPHY 94

Manaal Faruqui, Jesse Dodge, Sujay Kumar Jauhar, Chris Dyer, Eduard Hovy, and Noah A. Smith. Retrofitting word vectors to semantic lexicons. In Pro- ceedings of the 2015 Conference of the North American Chapter of the Associa- tion for Computational Linguistics: Human Language Technologies, pages 1606– 1615. Association for Computational Linguistics, 2015. doi: 10.3115/v1/ N15-1184. URL http://aclanthology.coli.uni-saarland.de/pdf/ N/N15/N15-1184.pdf.

Christiane Fellbaum. WordNet. Wiley Online Library, 1998.

Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin. Placing search in context: The concept revisited. ACM Transactions on information systems, 20(1):116–131, 2002.

John R Firth. A synopsis of linguistic theory, 1930-1955. Studies in linguistic analysis, 1957.

Katrin Fundel, Robert Kuffner,¨ and Ralf Zimmer. Relex—relation extraction using dependency parse trees. Bioinformatics, 23(3):365–371, 2007.

Juri Ganitkevitch, Benjamin Van Durme, and Chris Callison-Burch. PPDB: The paraphrase database. In Proceedings of the 2013 Conference of the North Amer- ican Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 758–764. Association for Computational Linguistics, 2013. URL http://aclweb.org/anthology/N13-1092.

Max Glockner, Vered Shwartz, and Yoav Goldberg. Breaking nli systems with sentences that require simple lexical inferences. In The 56th Annual Meeting of the Association for Computational Linguistics (ACL), Melbourne, Australia, July 2018.

Iryna Gurevych, Judith Eckle-Kohler, and Michael Matuschek. Linked lexical knowledge bases: Foundations and applications. Synthesis Lectures on Human Language Technologies, 9(3):1–146, 2016.

Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel Bowman, and Noah A. Smith. Annotation artifacts in natural language infer- ence data. In The 16th Annual Conference of the North American Chapter of the As- sociation for Computational Linguistics: Human Language Technologies (NAACL- HLT), New Orleans, Louisiana, June 2018.

Sanda Harabagiu and Dan Moldovan. Knowledge processing on an extended wordNet. WordNet: An electronic lexical database, 305:381–405, 1998. BIBLIOGRAPHY 95

Zellig S. Harris. Distributional structure. Word, 1954.

Marti Hearst. Automatic acquisition of hyponyms from large text corpora. In COLING 1992 Volume 2: The 15th International Conference on Computational Lin- guistics, 1992.

Iris Hendrickx, Su Nam Kim, Zornitsa Kozareva, Preslav Nakov, Diarmuid OS´ eaghdha,´ Sebastian Pado,´ Marco Pennacchiotti, Lorenza Romano, and Stan Szpakowicz. Semeval-2010 task 8: Multi-way classification of semantic relations between pairs of nominals. In SemEval, pages 94–99, 2009.

Felix Hill, Roi Reichart, and Anna Korhonen. Simlex-999: Evaluating semantic models with (genuine) similarity estimation. arXiv preprint arXiv:1408.3456, 2014.

Lynette Hirschman, Marc Light, Eric Breck, and John D. Burger. Deep read: A reading comprehension system. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, pages 325–332, College Park, Maryland, USA, June 1999. Association for Computational Linguistics. doi: 10.3115/1034678.1034731. URL http://www.aclweb.org/anthology/ P99-1042.

Eric H Huang, Richard Socher, Christopher D Manning, and Andrew Y Ng. Improving word representations via global context and multiple word proto- types. In ACL 2012, pages 873–882. Association for Computational Linguis- tics, 2012.

Mandar Joshi, Eunsol Choi, Omer Levy, Daniel S. Weld, and Luke Zettlemoyer. pair2vec: Compositional word-pair embeddings for cross-sentence inference. June 2019.

Nora Kassner and Hinrich Schutze.¨ Negated lama: Birds cannot fly, 2019.

Jamie Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Skip-thought vectors. In Ad- vances in neural information processing systems, pages 3294–3302, 2015.

Lili Kotlerman, Ido Dagan, Idan Szpektor, and Maayan Zhitomirsky-Geffet. Di- rectional distributional similarity for lexical inference. NLE, 16(04):359–389, 2010.

Gerhard Kremer, Katrin Erk, Sebastian Pado,´ and Stefan Thater. What substi- tutes tell us-analysis of an” all-words” lexical substitution corpus. In EACL 2014, 2014. BIBLIOGRAPHY 96

Wuwei Lan, Siyu Qiu, Hua He, and Wei Xu. A continuously growing dataset of sentential paraphrases. In Proceedings of The 2017 Conference on Empirical Methods on Natural Language Processing (EMNLP), 2017.

Omer Levy and Yoav Goldberg. Dependency-based word embeddings. In Pro- ceedings of the 52nd Annual Meeting of the Association for Computational Linguis- tics (Volume 2: Short Papers), pages 302–308, Baltimore, Maryland, June 2014. Association for Computational Linguistics. URL http://www.aclweb. org/anthology/P14-2050.

Omer Levy, Steffen Remus, Chris Biemann, and Ido Dagan. Do supervised dis- tributional methods really learn lexical inference relations? In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, pages 970–976, 2015a.

Ran Levy, Liat Ein-Dor, Shay Hummel, Ruty Rinott, and Noam Slonim. Tr9856: A multi-word term relatedness benchmark. In ACL 2015, page 419, 2015b.

Dekang Lin. An information-theoretic definition of similarity. In ICML, vol- ume 98, pages 296–304, 1998.

Dekang Lin and Patrick Pantel. Dirt – Discovery of inference rules from text. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pages 323–328. ACM, 2001.

Nelson F Liu, Matt Gardner, Yonatan Belinkov, Matthew Peters, and Noah A Smith. Linguistic knowledge and transferability of contextual representa- tions. arXiv preprint arXiv:1903.08855, 2019.

Yang Liu, Furu Wei, Sujian Li, Heng Ji, Ming Zhou, and Houfeng Wang. A dependency-based neural network for relation classification. arXiv preprint arXiv:1507.04646, 2015.

Robert Logan, Nelson F. Liu, Matthew E. Peters, Matt Gardner, and Sameer Singh. Barack’s wife hillary: Using knowledge graphs for fact-aware lan- guage modeling. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5962–5971, Florence, Italy, July 2019. As- sociation for Computational Linguistics. doi: 10.18653/v1/P19-1598. URL https://www.aclweb.org/anthology/P19-1598.

Bill MacCartney and Christopher D Manning. Natural logic for textual infer- ence. In Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing, pages 193–200. Association for Computational Linguistics, 2007. BIBLIOGRAPHY 97

Jonathan Mallinson, Rico Sennrich, and Mirella Lapata. Paraphrasing revisited with neural machine translation. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 881–893, Valencia, Spain, April 2017. Association for Computa- tional Linguistics.

Diana McCarthy and Roberto Navigli. Semeval-2007 task 10: English lexical substitution task. In Proceedings of the 4th International Workshop on Semantic Evaluations, pages 48–53. Association for Computational Linguistics, 2007.

Oren Melamud, Jonathan Berant, Ido Dagan, Jacob Goldberger, and Idan Szpek- tor. A two level model for context sensitive inference rules. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1331–1340, Sofia, Bulgaria, August 2013. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/ P13-1131.

Oren Melamud, Ido Dagan, and Jacob Goldberger. Modeling word meaning in context with substitute vectors. In HLT-NAACL, pages 472–482, 2015.

Oren Melamud, Jacob Goldberger, and Ido Dagan. context2vec: Learning generic context embedding with bidirectional lstm. In Proceedings of CONLL, 2016.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Dis- tributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.

Aakanksha Naik, Abhilasha Ravichander, Norman Sadeh, Carolyn Rose, and Graham Neubig. Stress test evaluation for natural language inference. In Proceedings of the 27th International Conference on Computational Linguistics, pages 2340–2353, Santa Fe, New Mexico, USA, August 2018. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/ C18-1198.

Ndapandula Nakashole, Gerhard Weikum, and Fabian Suchanek. Patty: a tax- onomy of relational patterns with semantic types. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Com- putational Natural Language Learning, pages 1135–1145. Association for Com- putational Linguistics, 2012.

Preslav Nakov. On the interpretation of noun compounds: Syntax, semantics, and entailment. Natural Language Engineering, 19(03):291–330, 2013. BIBLIOGRAPHY 98

Preslav Nakov and Marti Hearst. Using verbs to characterize noun-noun rela- tions. In International Conference on Artificial Intelligence: Methodology, Systems, and Applications, pages 233–244. Springer, 2006. Kim Anh Nguyen, Maximilian Koper,¨ Sabine Schulte im Walde, and Ngoc Thang Vu. Hierarchical embeddings for hypernymy detection and di- rectionality. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 233–243. Association for Computational Linguis- tics, 2017. URL http://aclweb.org/anthology/D17-1022. Maximilian Nickel and Douwe Kiela. Poincare´ embeddings for learning hier- archical representations. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Infor- mation Processing Systems 30, pages 6341–6350. Curran Associates, Inc., 2017. Sebastian Pado´ and Mirella Lapata. Dependency-based construction of seman- tic space models. Computational Linguistics, 33(2):161–199, 2007. Ellie Pavlick and Chris Callison-Burch. Most ”babies” are ”little” and most ”problems” are ”huge”: Compositional entailment in adjective-nouns. In Pro- ceedings of the 54th Annual Meeting of the Association for Computational Linguis- tics (Volume 1: Long Papers), pages 2164–2173, Berlin, Germany, August 2016. Association for Computational Linguistics. URL http://www.aclweb. org/anthology/P16-1204. Ted Pedersen, Siddharth Patwardhan, and Jason Michelizzi. Word- net::similarity - measuring the relatedness of concepts. In Daniel Marcu Su- san Dumais and Salim Roukos, editors, HLT-NAACL 2004: Demonstration Pa- pers, pages 38–41, Boston, Massachusetts, USA, May 2 - May 7 2004. Associa- tion for Computational Linguistics. Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empir- ical Methods in Natural Language Processing (EMNLP), pages 1532–1543. Asso- ciation for Computational Linguistics, 2014. doi: 10.3115/v1/D14-1162. URL http://aclweb.org/anthology/D14-1162. Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word repre- sentations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol- ume 1 (Long Papers), pages 2227–2237, New Orleans, Louisiana, June 2018. As- sociation for Computational Linguistics. URL http://www.aclweb.org/ anthology/N18-1202. BIBLIOGRAPHY 99

Adam Poliak, Jason Naradowsky, Aparajita Haldar, Rachel Rudinger, and Ben- jamin Van Durme. Hypothesis Only Baselines in Natural Language Inference. In Joint Conference on Lexical and Computational Semantics (StarSem), 2018.

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai-assets/research-covers/language- unsupervised/language understanding paper. pdf, 2018.

Philip Resnik. Using information content to evaluate semantic similarity in a taxonomy. In Proceedings of the 14th international joint conference on Artificial intelligence - Volume 1, IJCAI’95, pages 448–453. Morgan Kaufmann Publish- ers Inc., 1995. URL http://dl.acm.org/citation.cfm?id=1625855. 1625914.

Laura Rimell. Distributional lexical entailment by topic coherence. In EACL, pages 511–519, 2014.

Stephen Roller, Katrin Erk, and Gemma Boleda. Inclusive yet selective: Su- pervised distributional hypernymy detection. In Proceedings of COLING 2014, pages 1025–1036, 2014.

Michael Roth and Anette Frank. Aligning predicate argument structures in monolingual comparable texts: A new corpus for a new task. In *SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Vol- ume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012), pages 218–227. Association for Computational Linguistics, 2012. URL http://aclweb.org/anthology/S12-1030.

Ivan A Sag, Timothy Baldwin, Francis Bond, Ann Copestake, and Dan Flickinger. Multiword expressions: A pain in the neck for nlp. In International Conference on Intelligent Text Processing and Computational Linguistics, pages 1– 15. Springer, 2002.

Ivan Sanchez, Jeff Mitchell, and Sebastian Riedel. Behavior analysis of nli models: Uncovering the influence of three factors on robustness. In Proceed- ings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Pa- pers), pages 1975–1985, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/ N18-1179. BIBLIOGRAPHY 100

Enrico Santus, Alessandro Lenci, Qin Lu, and Sabine Schulte Im Walde. Chasing hypernyms in vector spaces with entropy. In EACL, pages 38–42, 2014.

Yusuke Shinyama and Satoshi Sekine. Preemptive information extraction using unrestricted relation discovery. In Proceedings of the Human Language Technol- ogy Conference of the NAACL, Main Conference, 2006. URL http://aclweb. org/anthology/N06-1039.

Yusuke Shinyama, Satoshi Sekine, and Kiyoshi Sudo. Automatic paraphrase ac- quisition from news articles. In Proceedings of the second international conference on Human Language Technology Research, pages 313–318. Morgan Kaufmann Publishers Inc., 2002.

Vered Shwartz and Ido Dagan. Adding context to semantic data-driven para- phrasing. In *SEM, Berlin, Germany, August 2016a.

Vered Shwartz and Ido Dagan. Path-based vs. distributional information in rec- ognizing lexical semantic relations. In Proceedings of the 5th Workshop on Cog- nitive Aspects of the Lexicon (CogALex-V), in COLING, Osaka, Japan, December 2016b.

Vered Shwartz and Ido Dagan. Paraphrase to explicate: Revealing implicit noun-compound relations. In Proceedings of the 56th Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: Long Papers), pages 1200–1211, Melbourne, Australia, July 2018. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P18-1111.

Vered Shwartz and Ido Dagan. Still a pain in the neck: Evaluating text repre- sentations on lexical composition. In Transactions of the Association for Compu- tational Linguistics (TACL), 2019.

Vered Shwartz and Chris Waterson. Olive oil is made of olives, baby oil is made for babies: Interpreting noun compounds using paraphrases in a neural model. In The 16th Annual Conference of the North American Chapter of the As- sociation for Computational Linguistics: Human Language Technologies (NAACL- HLT), New Orleans, Louisiana, June 2018.

Vered Shwartz, Omer Levy, Ido Dagan, and Jacob Goldberger. Learning to ex- ploit structured resources for lexical inference. In CoNLL, Beijing, China, July 2015.

Vered Shwartz, Yoav Goldberg, and Ido Dagan. Improving hypernymy detec- tion with an integrated path-based and distributional method. In Proceed- ings of the 54th Annual Meeting of the Association for Computational Linguistics BIBLIOGRAPHY 101

(Volume 1: Long Papers), pages 2389–2398, Berlin, Germany, August 2016. As- sociation for Computational Linguistics. URL http://www.aclweb.org/ anthology/P16-1226.

Vered Shwartz, Enrico Santus, and Dominik Schlechtweg. Hypernyms un- der siege: Linguistically-motivated artillery for hypernymy detection. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 65–75, Valencia, Spain, April 2017a. Association for Computational Linguistics. URL http: //www.aclweb.org/anthology/E17-1007.

Vered Shwartz, Gabriel Stanovsky, and Ido Dagan. Acquiring predicate para- phrases from news tweets. In Proceedings of the 6th Joint Conference on Lex- ical and Computational Semantics (*SEM 2017), pages 155–160, Vancouver, Canada, August 2017b. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/S17-1019.

Rion Snow, Daniel Jurafsky, and Andrew Y Ng. Learning syntactic patterns for automatic hypernym discovery. In NIPS, 2004.

Richard Socher, Jeffrey Pennington, Eric H. Huang, Andrew Y. Ng, and Christo- pher D. Manning. Semi-supervised recursive autoencoders for predicting sen- timent distributions. In Proceedings of the 2011 Conference on Empirical Meth- ods in Natural Language Processing, pages 151–161, Edinburgh, Scotland, UK., July 2011. Association for Computational Linguistics. URL http://www. aclweb.org/anthology/D11-1014.

Robert Speer and Catherine Havasi. Representing general relational knowledge in conceptnet 5. In LREC, pages 3679–3686, 2012.

Fabian M Suchanek, Gjergji Kasneci, and Gerhard Weikum. Yago: a core of semantic knowledge. In Proceedings of the 16th international conference on World Wide Web, pages 697–706. ACM, 2007.

Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang, Adam Poliak, R Thomas Mc- Coy, Najoung Kim, Benjamin Van Durme, Sam Bowman, Dipanjan Das, and Ellie Pavlick. What do you Learn from Context? Probing for Sentence Struc- ture in Contextualized Word Representations. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum? id=SJzSgnRcKX.

Tim Van de Cruys, Stergos Afantenos, and Philippe Muller. Melodi: A super- vised distributional approach for free paraphrasing of noun compounds. In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (Se- mEval 2013), pages 144–147, Atlanta, Georgia, USA, June 2013. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/ S13-2026.

Ivan Vendrov, Ryan Kiros, Sanja Fidler, and Raquel Urtasun. Order-embeddings of images and language. In ICLR, 2016.

Denny Vrandeciˇ c.´ Wikidata: A new platform for collaborative data collection. In Proceedings of the 21st international conference companion on World Wide Web, pages 1063–1064. ACM, 2012.

Ivan Vulic and Nikola Mrksic. Specialising word vectors for lexical entailment. In The 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 2018.

Julie Weeds and David Weir. A general framework for distributional similarity. In EMLP, pages 81–88, 2003.

Julie Weeds, Daoud Clarke, Jeremy Reffin, David Weir, and Bill Keller. Learning to distinguish hypernyms and co-hyponyms. In Proceedings of COLING 2014, pages 2249–2259, 2014.

Zhibiao Wu and Martha Palmer. Verb semantics and lexical selection. In Pro- ceedings of the 32nd Annual Meeting of the Association for Computational Lin- guistics, pages 133–138, Las Cruces, New Mexico, USA, June 1994. Asso- ciation for Computational Linguistics. doi: 10.3115/981732.981751. URL http://www.aclweb.org/anthology/P94-1019.

Yan Xu, Lili Mou, Ge Li, Yunchuan Chen, Hao Peng, and Zhi Jin. Classifying relations via long short term memory networks along shortest dependency paths. In EMNLP, 2015.

Torsten Zesch, Christof Muller,¨ and Iryna Gurevych. Using wiktionary for com- puting semantic relatedness. In AAAI, volume 8, pages 861–866, 2008.

Congle Zhang and Daniel S Weld. Harvesting parallel news streams to generate paraphrases of event relations. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1776–1786, Seattle, Washington, USA, 2013. תקציר

יישומים רבים בתחום הבנת שפה טבעית נאלצים להתמודד עם שני קשיים עיקריים בשפות אנושיות. האחד הוא גיוון לקסיקלי: ניתן לבטא את אותה המשמעות בכמה דרכים שונות. השני הוא רב־משמעות: למילה יכולות להיות מספר משמעויות שונות כתלות בהקשר בה היא נאמרת. בכדי להתמודד עם קשיים אלה, אחד מאבני הבניין בהבנת שפה טבעית הוא זיהוי היסקים לקסיקליים. היסק לקסיקלי מתייחס ליחס סמנטי המתקיים בין שתי מילים או ביטויים, כאשר המשמעות של אחת המילים ניתנת להסקה מהשנייה. יחסים סמנטיים רבים יכולים להיחשב כיחסי היסק: מילים נרדפות, יחסי הכללה (חתול/בעל חיים), ״חלק מ״ (לונדון/אנגליה), וכו׳. זיהוי יחסים לקסיקליים עשוי לתרום לביצועים של יישומים רבים בעיבוד שפה טבעית (NLP). בתמצות אוטומטי, למשל, יחסי היסק עשויים לעזור בזיהוי של משפטים כפולים או מיותרים, כאשר הם נבדלים זה מזה רק בביטויים שקיים ביניהם יחס היסק. רוב יישומי ה־NLP היום מסתמכים על ״שיכוני מילים״ כדי לקבל מידע על הדמיון הסמנטי בין מילים. שיכוני מילים הם ווקטורים דחוסים המתקבלים ע״י צבירה של ההקשרים בהם מילה מופיעה בקורפוס טקסט גדול, עפ״י ההיפותיזה ההתפלגותית (האריס, 1954). שיכוני מילים מכילים מידע על דמיון נושאי, או אסוציאטיבי (מעלית/קומה) כמו גם דמיון פונקציונלי (מעלית/דרגנוע). עם זאת, משימות רבות דורשות לדעת את היחס המדויק בין זוג מילים, והמידע הזה לא קיים באופן זמין בשיכוני המילים. עבור מילה כמו ״חתול״, המילים בעלות הווקטור הקרוב ביותר אליה במרחב הווקטורי יהיו מיחסים שונים, לדוגמה ״זנב״ (חלק מ), ״כלב״ (תחת אותה הקטגוריה), ״חתלתול׳ (ספציפי יותר) ו־״חיית מחמד״ (כללי יותר). בעבודה זו אציג מספר אלגוריתמים שפיתחתי עבור זיהוי של היסקים לקסיקליים. בקו עבודות אחד, בחנתי שיטות לזיהוי היחס הסמנטי המתקיים בין זוג מילים, והצעתי מודל המשלב שני מקורות מידע משלימים: מבוסס מסלולים ומבוסס התפלגות. הראשון מסתמך על ההופעות המשותפות של שתי מילים בקורפוס (לדוגמה ״חתול ובעלי חיים אחרים״, ״בעלי חיים כמו חתול״), בעוד שהשני משתמש בשיכוני המילים, שנלמדים מההופעות בנפרד של כל אחת מהמילים בקורפוס.

א תקציר ב

המודל מתגבר על בעיות שצצו בעבודות קודמות. ראשית, עבודות קודמות הראו שמודלים המבוססים אך ורק על ההיפותזה ההתפלגותית מסוגלים לתת אינפורמציה על כל אחת מהמילים בנפרד, אך לא על הקשר ביניהן. לדוגמה, הם יכולים לענות על שאלה כמו ״עד כמה סביר שבעל חיים הוא קטגוריה של מילה כלשהי?״, אבל לא על ״עד כמה סביר שבעל חיים היא קטגוריה של חתול?״. כאשר המידע הזה משולב עם מידע מההופעות המשותפות של ״חתול״ ו״בעל חיים״, המודל מודע גם לקשר בין המילים, כפי שהוא מובע בטקסטים. שנית, המודל מייצג את המסלולים התחביריים של ההופעות המשותפות בשיטות מבוססות רשתות ניורונים המאפשרות הכללה טובה יותר בין מסלולים דומים. השיטה השיגה את התוצאות הטובות ביותר על משימה זו. בעבודה אחרת בניתי משאב של פאראפרזות בין זוגות של פרדיקטים פעליים, כגון ״א מת בגיל ב״ ו״א חי עד גיל ב״. המשאב נבנה באופן אוטומטי מציוצים חדשותיים בטוויטר, בהסתמכות על ההנחה שכותרות חדשותיות שונות מתארות את אותו האירוע במילים שונות. קו עבודות נוסף מתמקד בצירופים שמניים, ובפרט ביחס הסמנטי המתקיים בין שמות העצם. למשל, ״שמן זית״ הוא שמן מזיתים, לעומת ״שמן תינוקות״ שהוא שמן לתינוקות. פירוש צרופים שמניים מתחלק בספרות לשתי גישות: האחת מסווגת כל צירוף ליחס מתוך אוסף יחסים מוגדר מראש (לדוג׳ מקור / תכלית), בעוד שהשניה מייצרת פאראפרזות טקסט חופשי המתארות את הצירוף (״שמן המופק מזיתים״, ״שמן מזיתים״, וכו׳). בעבודה זו אני מציגה מודלים לשתי המשימות. עבור משימת הסיווג, השתמשתי במודל הסיווג המוזכר מעלה, והראיתי שיפור על פני עבודות קודמות. עם זאת, הביצועים של כלל המודלים על מערך הנתונים היו בינוניים, בין היתר בגלל הגדרת המשימה הקשיחה. אחת המסקנות מעבודה זו הייתה שעדיף לייצר פאראפרזות טקסט חופשי כדי להביע טווח רחב יותר של יחסים סמנטיים. השיטה שפיתחתי עבור משימת הפאראפרזות השיגה תוצאות טובות יותר משיטות קודמות, הודות ליכולת ההכללה שלה. המודל הכליל צירופים דומים (לדוגמה, חיזוי פאראפרזות עבור ״כפית פלסטיק״ על סמך צירוף דומה עליו המודל אומן, כמו ״סכין פלדה״) כמו גם פאראפרזות דומות (לדוגמה, חיזוי גם של ״א שעשוי מ־ב״ כאשר ״א שמופק ב״ נצפה). המודלים שפיתחתי משיגים ביצועים טובים יותר משל מודלים תחרותיים ואף משפרים את התוצאות הטובות ביותר הידועות במספר משימות. עם זאת, חשוב לציין כי יישום של היסקים לקסיקלים בתוך מערכות NLP דורש התייחסות להקשר שבו מילים מופיעות, כפי שאני מראה בפרק האחרון של עבודה זו. לאחרונה פותחו שיכוני מילים תלויי־הקשר המאפשרים ייצוג מתאים למילה בהקשר בה היא מופיעה ומקלים את ההתמודדות עם רב־משמעות של מילים, אך תופעות אחרות הקשורות ליישום של היסקים לקסיקליים בהקשר עדיין אינן פתורות. בסיכום עבודה זו אשתף במחשבותיי על האתגרים הנותרים וגישות אפשריות לפתרון שלהם בעתיד. תוכן עניינים

תקציר ...... i

1הקדמה ...... 1

1.1 סיווג ליחסים סמנטיים ...... 2 1.2 פאראפרזות של פרדיקטים ...... 4 1.3 פירוש של צירופים שמניים ...... 5 1.4 היסקים לקסיקליים תלויי הקשר ...... 7 1.5 תרומות ומבנה העבודה ...... 8

2רקע ...... 11

2.1 סמנטיקה התפלגותית ...... 11 2.2 טקסונומיות וידע מובנה ...... 15 2.3 מבחנים של היסקים לקסיקליים ...... 17

3 מדדים לא מונחים לזיהוי יחסי הייפרנים ...... 20

4 זיהוי יחסי הייפרנים בגישה משולבת מבוססת מסלולים והתפלגות ...... 32

5 גישה מבוססת מסלולים לעומת גישה התפלגותית בסיווג ליחסים סמנטיים ...... 43

6 רכישת פאראפרזות של פרדיקטים מציוצי חדשות בטוויטר ...... 50

7 סיווג צירופים שמניים ליחסים סמנטיים ...... 57

8 פירוש צירופים שמניים באמצעות פארארפרזות ...... 65 9 על החשיבות של הקשר בהיסקים לקסיקליים ...... 78

10סיכום ...... 85

10.1 סמנטיקה מעבר להיפותזה ההתפלגותית ...... 85

10.2 מידול של היסקים לקסיקליים תלויי־הקשר ...... 86

10.3 ייצוג ביטויים המורכבים ממספר מילים ...... 88

10.4 הדרך הלאה: שיפור ביצועים של משימות קצה ...... 90

רשימת מקורות ...... 92

תקציר בעברית ...... א עבודה זו בוצעה בהנחייתו של פרופ׳ עידו דגן מהמחלקה למדעי המחשב, אוניברסיטת בר אילן. למידת היסקים לקסיקליים בדיוק גבוה

חיבור לשם קבלת התואר ׳׳דוקטור לפילוסופיה׳׳

מאת: ורד שוורץ

המחלקה למדעי המחשב הפקולטה למדעים מדויקים

הוגש לסנט של אוניברסיטת בר אילן

רמת גן ניסן תשע׳׳ט