Still a Pain in the Neck: Evaluating Text Representations on Lexical Composition

Vered Shwartz Ido Dagan Computer Science Department, Bar-Ilan University, Ramat-Gan, Israel [email protected] [email protected]

Abstract olives while baby oil is made for babies (Shwartz and Waterson, 2018). Building meaningful phrase representations There has been a line of attempts to learn com- is challenging because phrase meanings are positional phrase representations (e.g. Mitchell not simply the sum of their constituent and Lapata, 2010; Baroni and Zamparelli, 2010; meanings. Lexical composition can shift the Wieting et al., 2017; Poliak et al., 2017), but many meanings of the constituent and in- troduce implicit information. We tested a of these are tailored to a specific type of phrase or broad range of textual representations for to a fixed number of constituent words, and they their capacity to address these issues. We all disregard the surrounding context. Recently, found that, as expected, contextualized contextualized word representations boasted dra- representations perform better than static matic performance improvements on a range of word embeddings, more so on detecting NLP tasks (Peters et al., 2018; Radford et al., meaning shift than in recovering implicit 2018; Devlin et al., 2019). Such models serve information, in which their performance is as a function for computing word representations still far from that of humans. Our evalua- tion suite, consisting of six tasks related to in a given context, making them potentially more lexical composition effects, can serve future capable to address meaning shift. These models research aiming to improve representations. were shown to capture some world knowledge (e.g Zellers et al., 2018), which may potentially help with uncovering implicit information. 1 Introduction In this paper we test how well various text Modeling the meaning of phrases involves ad- representations address these composition-related dressing semantic phenomena that pose non- phenomena. Methodologically, we follow recent trivial challenges for common text representations, work that applied “black-box” testing to assess which derive a phrase representation from those various capacities of distributed representations of its constituent words. One such phenomenon is (e.g. Adi et al., 2017; Conneau et al., 2018). We meaning shift, which happens when the meaning construct an evaluation suite with six tasks related of the phrase departs from the meanings of its con- to the above two phenomena, as shown in Fig- stituent words. This is especially common among ure1, and develop generic models that rely on pre- -particle constructions (carry on), idiomatic trained representations. We test six representa- arXiv:1902.10618v2 [cs.CL] 19 May 2019 compounds (guilt trip) and other multi-word tions, including static word embeddings (Mikolov expressions (MWE, lexical units that form a dis- et al., 2013; Pennington et al., 2014; Bojanowski tinct concept), making them “a pain in the neck” et al., 2017) and contextualized word embeddings for NLP applications (Sag et al., 2002). (Peters et al., 2018; Radford et al., 2018; Devlin et al., 2019). Our contributions are as follows: A second phenomenon is common for both MWEs and free word combinations such as noun 1. We created a unified framework that tests the compounds and -noun compositions. It capacity of representations to address lexical happens when the composition introduces an im- composition via classification tasks, focusing plicit meaning that often requires world knowl- on detecting meaning shift and recovering im- edge to uncover. For example, that hot refers to plicit meaning. We test six representations and the temperature of tea but to the manner of de- provide an in depth analysis of the results. bate (Hartung, 2015), or that olive oil is made of 2. We relied on existing annotated datasets used addtie below. detailed and phenomena, 1 Table meaning in the summarized implicit address and that shift tasks meaning six with experimented We Tasks Composition 2 and 2018). Dagan , Shwartz and 2009; Shwartz 2018; Copestake, Waterson, and Séaghdha (e.g interpretation Ó compound noun in com- done is monly as texts, different the across of words constituent co-occurrences simul- multiple that models objective training taneously richer a future is one direction particular, In objective. training model a than more require would im- reveal meaning to plicit representations such of such ability improv- the that on ing expect We performance perform- substantial. human remains best the tasks the and between model gap ing the much and is weaker, information implicit recovering for they yield signal by the provided representations, contextualized knowledge the world with information on humans. performed with par models contextualized on tasks, shift: meaning such recognizing con- to modeling contributes indeed ones. text that show static we the particular, than In better perform embeddings in- framework, classification the provide We 3. of type phe- to meaning). the shift/implicit according (meaning and test tasks they combination) our nomenon word of (MWE/free map phrase A 1: Figure phrase type ovrey ept oe ffiln missing filling of hopes despite Conversely, word contextualized the that confirm results Our hc ol lo etn uuemdl for composition. lexical models address to future capacity testing their allow would which //github.com/vered1986/lexcomp at available and code and data cluding per- human task. the each on estimate formance confirm and to validity set data test the each from sample our a an- notated fit additionally We to them framework. re-casted classification and tasks, various in Free word MWE P Classification VPC V Classification LVC CLiterality NC enn mlct|both | implicit | meaning hf enn | meaning | shift phenomenon | | | | | | | | | | | | | | | | NAttributes AN CRelations NC https: | | Phrase Phrase Type Type , hte ti nLCo not. or LVC an is it decision whether (“ easy construction verb an light potential a cludes Definition. Task sentence. the the of meaning by ( the noun replaced changing out be direct can its LVC of usage an verb thumb, of rule a ( verb main ( object (LVC, noun construction verb light e.g. a of meaning The Constructions Verb Light Recognizing 2.2 and combinations. V their P in and distinct test, contain train, sets the validation split i.e. we verb, bias, by lexically label dataset reduce the To particles. or sitions most give dif- ( the 23 verbs of on used six focused frequently from is derived is dataset verbs The it phrasal whether ferent not. to or annotated VPC P a verb preposition a a containing and V each (BNC), Corpus British the National from sentences 1,348 of consists which not. or Data. VPC a the P, is it preposition whether a determine to by is followed goal V verb a cludes Definition. Task (e.g. verb’s meaning the changes an which of form preposition, the intransitive in typically a particle, a of and consists verb head (VPC) construction verb-particle A Constructions Verb-Particle Recognizing 2.1 un- task. sets, each for test and detailed and as train validation constraints lexical the 80% der of roughly each to for dataset 10% Finally, each split 6 ). we (Section the scores by as- judging performance an cases, human context, most in the holds on that sumption depend an- not the that does assume notation We appears. phrase target which in the dump) 2018 (January Wikipedia English extract- from words) by (15-20 sentences averaged-length out-of-context, ing phrases the we datasets annotate original Second, the where we tasks. contexts First, sentential classification add as framework. tasks uniform all cast a in create augmentations to and order changes substantial make erl neitn ak n aaes but datasets, and tasks existing on rely We ,adtercmiainwt omnprepo- common with combination their and ), aeadecision a make 2012) ( Roth and Tu of dataset the use We make decision ar on carry .As 1965). (Jespersen, “light” is ) ) h oli odetermine to is goal the ”), ie etnesta in- that s sentence a Given in- that s sentence a Given smil eie rmits from derived mainly is ) ,wietemaigo its of meaning the while ), vs. ae ae ae e,do, get, have, make, take, carry ). decide with- ) make Train/val/test Task Data Source Input Output Context Size sentence s VPC Classification Tu and Roth(2012) 919/209/220 is VP a VPC? O VP = w1 w2 sentence s is the span LVC Classification Tu and Roth(2011) 1521/258/383 O span = w1 ... wk an LVC? sentence s Reddy et al.(2011) is w NC Literality 2529/323/138 NC = w w A Tratz(2011) 1 2 literal in NC? target w ∈ {w1, w2} sentence s SemEval 2013 Task 4 does p NC Relations 1274/162/130 NC = w w A (Hendrickx et al., 2013) 1 2 explicate NC? paraphrase p sentence s does p describe AN Attributes HeiPLAS (Hartung, 2015) 837/108/106 AN = w w A 1 2 the attribute in AN? paraphrase p STREUSLE Phrase Type 3017/372/376 sentence s label per token O (Schneider and Smith, 2015)

Table 1: A summary of the composition tasks included in our evaluation suite. In the context column, O means the context is part of the original dataset, while A is used for datasets in which the context was added in this work.

Data. We use the dataset of Tu and Roth(2011), amples. which contains 2,162 sentences from BNC in which a potential construction was Task Adaptation. We add sentential contexts found (with the same 6 common verbs as in Sec- from Wikipedia, keeping up to 10 sentences per tion 2.1), annotated to whether it is an LVC in a example. We downsample from the literal exam- given context or not. We split the dataset lexically ples to balance the dataset, allowing for a ratio of by the verb. up to 4 literal to non-literal examples. We split the dataset lexically by head, i.e. if w1 w2 is in one 1 2.3 Noun Compound Literality set, there are no w’1 w2 NCs in the other sets. Task Definition. Given a noun compound NC = 2.4 Noun Compound Relations w1 w2 in a sentence s, and a target word w ∈ {w1, w2}, the goal is to determine whether the meaning Task Definition. Given a noun compound NC = of w in NC is literal. For instance, market has a w1 w2 in a sentential context s, and a paraphrase literal meaning in flea market but flea does not. p, the goal is to determine whether p describes the semantic relation between w1 and w2 or not. Data. We use the dataset of Reddy et al.(2011) For example, “part that makes up body” is a valid which consists of 90 noun compounds along with paraphrase for body part, but “replacement part human judgments about the literality of each con- bought for body” is not. stituent word. Scores are given in a scale of 0-5, 0 being non-literal and 5 being literal. For each Data. We use the data from SemEval 2013 Task noun compound and each of its constituents we 4 (Hendrickx et al., 2013). The dataset consists consider examples with a score ≥ 4 as literal, and of 356 noun compounds annotated with 12,446 ≤ 2 as non-literal, ignoring the middle range. We human-proposed free text paraphrases. obtain 72 literal and 72 non-literal examples. Task Adaptation. The goal of the SemEval task To increase the dataset size we augment it with was to generate a list of free-form paraphrases for literal examples from the Tratz(2011) dataset of a given noun compound, which explicate the im- noun compound classification. Compounds in this plicit semantic relation between its constituent. To dataset are annotated to the semantic relation that match with our other tasks, we cast the task as a holds between w and w . Most relations (except 1 2 binary classification problem where the input is a for lexicalized, which we ignore), define the noun compound NC and a paraphrase p, and the meaning of NC as some trivial combination of w1 and w2, allowing us to regard both words as literal. 1We chose to split by head rather than by modifier based This method produces additional 3,061 literal ex- on the majority baseline that achieved better performance. goal is to predict whether p is a correct description reduce the chance that the negative attribute is in of NC. fact a valid attribute for AN, we compute the Wu- The positive examples for an NC w1 w2 are triv- Palmer similarity (Wu and Palmer, 1994) between ially derived from the original data by sampling the original and negative attribute, taking only at- up to 5 of its paraphrases and creating a (NC, p, tributes whose similarity to the original attribute is TRUE) example for each paraphrase p. The same below a threshold (0.4). number of negative examples is then created us- Similarly to the previous task, we attach a con- ing negative sampling from the paraphrase tem- text sentence from Wikipedia to each example. Fi- plates of other noun compounds w’1 w2 and w1 nally, we split the dataset lexically by adjective, w’2 in the dataset that share a constituent word i.e. if A N is in one set, there are no A N’ exam- with NC. For example, “replacement part bought ples in the other sets. for body” is a negative example constructed from 2.6 Identifying Phrase Type the paraphrase template “replacement [w2] bought for [w1]” which appeared for car part. We require The last task consists of multiple phrase types and one shared constituent in order to form more flu- addresses detecting both meaning shift and im- ent paraphrases (which would otherwise be easily plicit meaning. classifiable as negative). To reduce the chances of creating negative examples which are in fact Task Definition. The task is defined as sequence valid paraphrases, we only consider negative para- labeling to BIO tags. Given a sentence, each word phrases whose verbs never occurred in the positive is classified to whether it is part of a phrase, and paraphrase set for the given NC. the specific type of the phrase. We add sentential contexts from Wikipedia, ran- Data. We use the STREUSLE corpus domly selecting one sentence per example, and (Supersense-Tagged Repository of English split the dataset lexically by head. with a Unified Semantics for Lexical Expressions, 2.5 Adjective-Noun Attributes Schneider and Smith, 2015). The corpus contains texts from the web reviews portion of the English Task Definition. Given an adjective-noun com- Web Treebank, along with various semantic anno- position AN in a sentence s, and an attribute AT, tations, from which we use the BIO annotations. the goal is to determine whether AT is implic- Each token is labeled with a tag, a B-X tag marks itly conveyed in AN. For example, the attribute the beginning of a span of type X, I occurs inside temperature is conveyed in hot water, but not a span, and O outside of it. B labels mark specific in hot (emotionality). types of phrase.2 Data. We use the HeiPLAS data set (Hartung, 2015), which contains 1,598 adjective-noun com- Task Adaptation. We are interested in a sim- positions annotated with their implicit attribute pler version of the annotations. Specifically, we meaning. The data was extracted from WordNet exclude the discontinuous spans (e.g. a span and manually filtered. The label set consists of like “turn the [TV] off ” would not be consid- attribute synsets in WordNet which are linked to ered as part of a phrase). The corpus distin- adjective synsets. guishes between “strong” MWEs (fixed or id- iomatic phrases) and “weak” MWEs (ad hoc com- Task Adaptation. Since the dataset is small and positional phrases). The weak MWEs are untyped, the number of labels is large (254), we recast the hence we label them as COMP (compositional). task as a binary classification task. The input to the task is an AN and a paraphrase created from 3 Representations the template “[A] refers to the [AT] of [N]” (e.g. “loud refers to the volume of thunder”). The goal We experimented with 6 common word represen- is to predict whether the paraphrase is correct or tations from two different families detailed below. not with respect to the given AN. 2Sorted by frequency: , weak (compositional) We create up to 3 negative instances for each MWE, verb-particle construction, verbal idioms, preposi- positive instance by replacing AT in another at- tional phrase, auxiliary, adposition, discourse / pragmatic ex- pression, inherently adpositional verb, adjective, , tribute that appeared with either A or N. For ex- , light verb construction, non- , full ample, (hot argument, temperature, False). To verb or , . output training objective corpus (#words) basic unit dimension word embeddings

WORD2VEC Predicting surrounding words Google News (100B) 300 word

GLOVE Predicting co-occurrence probability Wikipedia + Gigaword 5 (6B) 300 word

FASTTEXT Predicting surrounding words Wikipedia + UMBC + statmt.org (16B) 300 subword contextualized word embeddings

ELMO Language model 1B Word Benchmark (1B) 1024 character

OPENAIGPT Language model BooksCorpus (800M) 768 subword BERT Masked language model (Cloze) BooksCorpus + Wikipedia (3.3B) 768 subword

Table 2: Architectural differences of the specific pre-trained representations used in this paper.

Table2 summarizes the differences between the ing at the character-level allows using morpho- pretrained models used in this paper. logical clues to form robust representations for out-of-vocabulary words, unseen in training. The Word Embeddings. Word embedding models OpenAI GPT (Generative Pre-Training, Rad- provide a fixed d-dimensional vector for each ford et al., 2018) has a similar training objec- word in the vocabulary. Their training is based on tive, but the underlying encoder is a transformer the distributional hypothesis, according to which (Vaswani et al., 2017). It uses subwords as the semantically-similar words tend to appear in the basic unit, employing bytepair encoding. Fi- same contexts (Harris, 1954). word2vec (Mikolov nally, BERT (Bidirectional Encoder Representa- et al., 2013) can be trained with one of two ob- tions from Transformers, Devlin et al., 2019) is jectives. We use the embeddings trained with the also based on the transformer, but it is bidirec- Skip-Gram objective which predicts the context tional as opposed to left-to-right as in the OpenAI words given the target word. GloVe (Pennington GPT, and the directions are dependent as opposed et al., 2014) learns word vectors with the objec- to ELMo’s independently trained left-to-right and tive of estimating the log-probability of a word right-to-left LSTMs. It also introduces a some- pair co-occurrence. fastText (Bojanowski et al., what different objective called “masked language 2017) extends word2vec by adding information model”: during training, some tokens are ran- about subwords (bag of character n-grams). This domly masked, and the objective is to restore them is especially helpful in morphologically-rich lan- from the context. guages, but can also help handling rare or mis- spelled words in English. 4 Classification Models Contextualized Word Embeddings are func- We implemented minimal “Embed-Encode- tions computing dynamic word embeddings for Predict” models that use the representations from words given their context sentence, largely ad- Section3 as inputs, keeping them fixed during dressing polysemy. They are pre-trained as gen- training. The rationale behind the model design eral purpose language models using a large-scale was to keep them uniform for easy unannotated corpus, and can later be used as a rep- between the representations, and make them sim- resentation layer in downstream tasks (either fine- ple so that the model’s success can be attributed tuned to the task with the other model parameters directly to the input representations. or fixed). The representations used in this paper Embed. We use the embedding model to embed have multiple output layers. We either use only each word in the sentence s = w ...w , obtaining: the last layer, which was shown to capture seman- 1 n tic information (Peters et al., 2018), or learn a task- ~v , ...,~v = Embed(s) (1) specific scalar mix of the layers (see Section6). 1 n ELMo (Embeddings from Language Models, where ~vi stands for the word embedding of word Peters et al., 2018) are obtained by learning a wi, which may be computed as a function of the character-based language model using a deep biL- entire sentence (in the case of contextualized word STM (Graves and Schmidhuber, 2005). Work- embeddings). ~0 Depending on the specific task, we may have vector u 1...l. In the general case, the input to the 0 0 another input w1, ..., wl to embed separately from classifier is: the sentence: the paraphrases in the NC Relations ~0 ~0 and AN Attributes tasks, and the target word in ~x = [~ui; ~ui+k; u 1; u l] (7) the NC Literality task (to obtain an out-of-context ~0 ~0 representation of the target word). We embed this where each of ~ui+k, u 1, and u l can be empty vec- extra input as follows: tors in the cases of single word spans or no addi- tional inputs. ~0 ~0 0 0 v 1, ..., v l = Embed(w1, ..., wl) (2) The classifier output is defined as:

Encode. We encode the embedded sequences ~0 ~0 ~o = softmax(W · ReLU(Dropout(h(~x)))) (8) ~v1, ...,~vn and v 1, ..., v l using one of the following 3 encode variants. As opposed to the pre-trained embeddings, the encoder parameters are updated where h is a 300-dimensional hidden layer, the k×300 during training to fit the specific task. dropout probability is 0.2, W ∈ R , and k is the number of class labels for the specific task. • biLM: Encoding the embedded sequence using Implementation Details. We implemented the a biLSTM with a hidden dimension d, where d models using the AllenNLP library (Gardner et al., is the input embedding dimension: 2018) which is based on the PyTorch framework (Paszke et al., 2017). We train them for up to ~u , ..., ~u = biLSTM(~v , ...,~v ) (3) 1 n 1 n 500 epochs, stopping early if the validation per- • Att: Encoding the embedded sequence using formance doesn’t improve in 20 epochs. self-attention. Each word is represented as the The phrase type model is a sequence tagging concatenation of its embedding and a weighted model that predicts a label for each embedded (po- average over other words in the sentence: tentially encoded) word wi. During decoding, we enforce a single constraint that requires that a B-X n X tag must precede I tag(s). ~ui = [~vi; ai,j · ~vj] (4) j=1 5 Baselines The weights ai are computed by applying dot- 5.1 Human Performance product between ~vi and every other word, and normalizing the scores using softmax: The human performance on each task can be used as a performance upper bound which shows both T ~ai = softmax(~vi · ~v) (5) the inherent ambiguity in the task, as well as the limitations of the particular dataset. We estimated • None: In which we don’t encode the embedded the human performance on each task by sampling text, but simply define: and re-annotating 100 examples from each test set. The annotation was carried out in Amazon Me- ~u , ..., ~u = ~v , ...,~v (6) 1 n 1 n chanical Turk. We asked 3 workers to annotate each example, taking the majority label as the fi- For all encoder variants, ~ui stands for the vector nal prediction. To control the quality of the anno- representing wi, which is used as input to the clas- tations, we required that workers must have an ac- sifier. ceptance rate of at least 98% on at least 500 prior Predict. We represent a span by concatenating HITs, and had them pass a qualification test. its end-point vectors, e.g. ~ui...i+k = [~ui; ~ui+k] In each annotation task, we showed the work- is the target span vector of wi, ..., wi+k. In ers the context sentence with the target span high- tasks which require a second span, we similarly lighted, and asked them questions regarding the ~0 compute u 1...l, representing the encoded span target span as exemplified in Table3. In addition 0 0 w1, ..., wl (e.g. the paraphrase in NC relations). to the possible answers given in the table, annota- The input to the classifier is the concatenation of tors were always given the choice of “I can’t tell” ~ui...i+k, and, when applicable, the additional span or “the sentence does not make sense”. Task Example Question I feel there are others far more suited to take on the responsibility. VPC Classification 84.17% What is the verb in the highlighted span? (take/take on) Jamie made a decision to drop out of college. Mark all that apply to the highlighted span in the given context: 1. It describes an action of “making something”, in the common meaning of “make”. LVC Classification 83.78% 2. The essence of the action is described by “decision”. 3. The span could be rephrased without “make” but with a verb like “decide”, without changing the meaning of the sentence. He is driving down memory lane and reminiscing about his first love. NC Literality 80.81% Is “lane” used literally or non-literally? (literal/non literal) Strawberry shortcakes were held as celebrations of the summer fruit harvest. NC Relations 86.21% Can “summer fruit” be described by “fruit that is ripe in the summer”? (yes/no) Send my warm regards to your parents. AN Attributes 86.42% Does “warm” refer to temperature? (yes/no)

Table 3: The worker agreement (%) and the question(s) displayed to the workers in each annotation task.

Model Family VPC LVC NC NC AN Phrase Classification Classification Literality Relations Attributes Type

Acc Acc Acc Acc Acc F1 Majority Baselines 23.6 43.7 72.5 50.0 50.0 26.6 Word Embeddings 60.5 74.6 80.4 51.2 53.8 44.0 Contextualized 90.0 82.5 91.3 54.3 65.1 64.8 Human 93.8 83.8 91.0 77.8 86.4 -

Table 4: Summary of the best performance of each family of representations on the various tasks. The evaluation metric is accuracy except for the phrase type task in which we report span-based F1 score, excluding O tags.

Of all the annotation tasks, the LVC classifica- different between the train and test sets, result- tion task was more challenging and required care- ing in accuracy < 50% even for binary classifi- ful examination of the different criteria for LVCs. cation tasks. In the example given in Table3 with the candidate • Majority1 assigns for each test item the most LVC “make a decision”, we considered a worker’s common label in the training set for items with answer as positive (LVC) either if: 1) the worker the same first constituent. For example, in the indicated that “make a decision” does not describe VPC classification task, it classifies get through an action of “making something” AND that the as positive in all its contexts because the verb essence of “make a decision” is in the word “de- get appears in more positive than negative ex- cision”; or 2) if the worker indicated that “make amples. a decision” can be replaced in the given sen- • Majority2 symmetrically assigns the label ac- tence with “decide” without changing the mean- cording to the last (typically second) con- ing of the sentence. The second criterion was stituent. given in the original guidelines of Tu and Roth (2011). The replacement verb “decide” was se- 6 Results lected as it is linked to “decision” in WordNet in Table4 displays the best performance of each the derivationally-related relation. model family on the various tasks. We didn’t compute the estimated human perfor- mance on the phrase type task, which is more com- Representations. The general trend across tasks plicated and requires expert annotation. is that the performance improves from the major- ity baselines through word embeddings and to the 5.2 Majority Baselines contextualized word representations, with a large We implemented three majority baselines: gap in some of the tasks. Among the contextual- ized word embeddings, BERT performed best on • MajorityALL is computed by assigning the most 4 out of 6 tasks, with no consistent preference be- common label in the training set to all the test tween ELMo and the OpenAI GPT. The best word items. Note that the label distribution may be embedding representations were GloVe (4/6) fol- VPC LVC NC NC AN Phrase Model Classification Classification Literality Relations Attributes Type

Layer Encoding Layer Encoding Layer Encoding Layer Encoding Layer Encoding Layer Encoding ELMo All Att All biLM All Att/None Top biLM All None All biLM OpenAI GPT All None Top Att/None Top None All biLM Top None All biLM BERT All Att All biLM All Att All None All None All biLM

Table 5: The best setting (layer and encoding) for each contextualized word embedding model on the various tasks. Bold entries are the best performers on each task.

VPC LVC NC NC AN Phrase Model Classification Classification Literality Relations Attributes Type word2vec biLM Att biLM/Att None - None/biLM GloVe biLM Att Att biLM - biLM fastText Att biLM biLM biLM Att biLM

Table 6: The best encoding for each word embedding model on the various tasks. Bold entries are the best performers on each task. Dash marks no preference. lowed by fastText (2/6). When it comes to word embedding models, Ta- ble6 shows that biLM was preferred more of- Phenomena. The gap between the best model ten. This is not surprising. A contextualized word performance (achieved by the contextualized rep- embedding of a certain word is, by definition, al- resentations) and the estimated human perfor- ready aware of the surrounding words, obviating mance varies considerably across tasks. The best the need for a second layer of order-aware encod- performance in NC Literality is on par with hu- ing. A word embedding based model, on the other man performance, and only a few points short of hand, must rely on a biLSTM to learn the same. that in VPC Classification and LVC Classification. This is an evidence for the utility of contextual- ized word embeddings in detecting meaning shift, Finally, the best settings on the Phrase Type task which has positive implications for the yet un- use biLM across representations. It may suggest solved problem of detecting MWEs. that predicting a label for each word can benefit Conversely, the gap between the best model and from a more structured modelling of word order. the human performance is as high as 23.5 and Looking into the errors made by the best model 21.3 points in NC Relations and AN Attributes, re- (BERT+All+biLM) reveals that most of the errors spectively, suggesting that tasks requiring reveal- were predicting O, i.e. missing the occurrence of ing implicit meaning are more challenging to the a phrase. With respect to specific phrase types, existing representations. near perfect performance was achieved among the Layer. Table5 elaborates on the best setting for more syntactic categories. Specifically, auxiliary each representation on the various tasks. In most (“Did they think we were [going to] feel lucky cases, there was a preference to learning a scalar to get any reservation at all?”), (“any mix of the layers rather than using only the top longer”), and (“a bit”). In accor- layer. We extracted the learned layer weights for dance to the VPC Classification task, the VPC each of the All models, and found that the model label achieved 85% accuracy. 10% were missed usually learned a balanced mix of the top and bot- (classified as O) and 5% were confused with a tom layers. “weak” MWE. Two of the more difficult types were “weak” MWEs (which are judged as more Encoding. We did not find one encoding setting compositional and less idiomatic) and idiomatic that performed best across tasks and contextual- verbs. The former achieved accuracy of 22% (68% ized word embedding models. Instead, it seems were classified as O) and the latter only 8% (62% that tasks related to meaning shift typically prefer were classified as O). Overall it seems that the Att or no encoding, while tasks related to implicit model relied mostly on syntactic cues, failing to meaning performed better with either biLM or no recognize semantic subtleties such as idiomatic encoding. meaning and level of compositionality. be decomposed to possibly rare senses of their constituents, as in considering bee to have a sense of “competition” in spelling bee and crocodile to stand for “manipulative” in crocodile tears. Using this definition, we could possibly use the NC lit- erality data for word sense induction, in which re- cent work has shown that contextualized word rep- resentations are successful (Stanovsky and Hop- kins, 2018; Amrami and Goldberg, 2018). We are interested in testing not only whether the contextu- alized models are capable of detecting rare senses induced by non-literal usage, which we have con- firmed in Section6, but whether they can also model these senses. To that end, we sample tar- get words that appear in both literal and non-literal examples, and use each contextualized word em- bedding model as a language model to predict the best substitutes of the target word in each context. Figure 2: t-SNE projection of BERT representations Table7 exemplifies some of these predictions. of verb-preposition candidates for VPC. Blue (dark) points are positive examples and orange (light) points Bold words are words judged reasonable in the are negative. given context, even if they don’t have the exact same meaning as the target word. It is apparent that there are more reasonable substitutes for the 7 Analysis literal examples, across models (left part of the ta- We on the contextualized word embeddings, ble), but BERT performs better than the others. and look into the representations they provide. The OpenAI GPT shows a clear disadvantage of being uni-directional, often choosing a substitute 7.1 Meaning Shift that doesn’t go well with the next word (“a train Does the representation capture VPCs? The to from”). best performer on the VPC Classification task was The success is only partial among non-literal the BERT+All+Att. To get a better idea of the examples. While some correct substitutes are pre- signal that BERT contains for VPCs, we chose dicted for (guilt) trip, the predictions are much several ambiguous verb-preposition pairs in the worse for the other examples. The meaning of di- dataset. We define a verb-preposition pair as am- amond in diamond wedding is “60th”, and ELMo biguous if it appeared in at least 8 examples as makes the closest prediction, 10th (which would a VPC and at least 8 examples as a non-VPC. make it “tin wedding”). 400th is a borderline pre- For a given pair we computed the BERT repre- diction, because it is also an ordinal number, but sentations of the sentences in which it appears in an unreasonable one in the context of years of mar- the dataset, and, similarly to the model, we rep- riage. resented the pair as the concatenation of its word Finally, the last example snake oil is un- vectors. In each vector we averaged the layers us- surprisingly a difficult one, possibly “non- ing the weights learned by the model. Finally, we decomposable” (Nunberg et al., 1994), as both projected the computed vectors into 2D space us- constituents are non-literal. Some predicted sub- ing t-SNE (Maaten and Hinton, 2008). Figure2 stitutes, rogue and charmer, are valid replace- demonstrates 4 example pairs. The other pairs ments for the entire noun compound (e.g. “you we plotted had similar t-SNE plots, confirming are a rogue salesman”). Others go well with the that the signal for separating different verb usages literal meaning of snake creating phrases denot- comes directly from BERT. ing concepts which can indeed be sold (snake egg, Non-literality as a rare sense. Nunberg et al. snake skin). (1994) considered some non-literal compounds as Overall, while contextualized representations “idiosyncratically decomposable”, i.e. which can excel at detecting shifts from the common mean- ELMo OpenAI GPT BERT ELMo OpenAI GPT BERT Creating a guilt [trip] in another person may be considered to The Queen and her husband were on a train [trip] from N L be psychological manipulation in the form of punishment for a Sydney to Orange. perceived transgression. ride 1.24% to 0.02% travelling 19.51% tolerance 0.44% that 0.03% reaction 8.32% carriage 1.02% headed 0.01% running 8.20% fest 0.23% so 0.02% feeling 8.17% journey 0.73% heading 0.01% journey 7.57% avoidance 0.16% trip 0.01% attachment 8.12% heading 0.72% that 0.009%going 6.56% onus 0.15% he 0.01% sensation 4.73% carrying 0.39% and 0.005%headed 5.75% association 0.14% she 0.01% note 3.31%

Richard Cromwell so impressed the king with his valour, that She became the first British monarch to celebrate a [diamond]N he was given a [diamond]L ring from the king’s own finger. wedding anniversary in November 2007. diamond 0.23% and 0.01% silver 15.99% customary 0.20% new 0.11% royal 1.58% wedding 0.19% of 0.01% gold 14.93% royal 0.17% british 0.02% 1912 1.23% pearl 0.18% to 0.01% diamond 13.18% sacrifice 0.15% victory 0.01% recent 1.10% knighthood 0.16% ring 0.01% golden 12.79% 400th 0.13% french 0.01% 1937 1.08% hollow 0.15% in 0.01% new 4.61% 10th 0.13% royal 0.01% 1902 1.08%

China is attempting to secure its future [oil]L share and establish Put bluntly, I believe you are a snake [oil]N salesman, a deals with other countries. narcissist that would say anything to draw attention to himself. beyond 0.44% in 0.01% market 98.60% auto 0.52% in 0.05% oil 32.5% engagement 0.44% and 0.01% export 0.45% egg 0.42% and 0.01% pit 2.94% market 0.34% for 0.01% trade 0.14% hunter 0.42% that 0.01% bite 2.65% nuclear 0.33% government 0.01% trading 0.09% consummate 0.39% of 0.01% skin 2.36% majority 0.29% supply 0.01% production 0.04% rogue 0.37% charmer 0.008%jar 2.23%

Table 7: Top substitutes for a target word in literal (left) and non-literal (right) contexts, along with model scores. Bold words are words judged reasonable (not necessarily meaning preserving) in the given context, and underlined words are suitable substitutes for the entire noun compound, but not for a single constituent.

NC Relations AN Attributes it encoded in the phrase representation itself, or Majority 50.0 50.0 -Phrase 50.0 55.66 does it appear explicitly in the context sentences? -Context 45.06 63.21 Finally, could it be that the performance gap from -(Context+Phrase) 45.06 59.43 Full Model 54.3 65.1 the majority baseline is due to the models learning to recognize which paraphrases are more probable Table 8: Accuracy scores of ablations of the phrase, than others, regardless of the phrase itself? context sentence, and both features from the best To try answer this question, we performed abla- models in the NC Relations and AN Attributes tion tests for each of the tasks, using the best per- tasks (ELMo+Top+biLM and BERT+All+None, re- spectively). forming setting for each (ELMo+Top+biLM for NC Relations and BERT+All+None for AN At- tributes). We trained the following models (-X sig- ings of words, their ability to obtain meaningful nifies the ablation of the X ): representations for such rare senses is much more limited. 1. -Phrase: where we mask the phrase in its context sentence, e.g. replacing “Today, the 7.2 Implicit Meaning house has become a wine bar or bistro called The performance of the various models on the Barokk” with “Today, the house has become a tasks that involve revealing implicit meaning are something or bistro called Barokk”. Success substantially worse than on the other tasks. In NC in this setting may indicate that the implicit Relations, ELMo performs best with the biLM- information is given explicitly in some of the 3 encoded model using only the top layer of the rep- context sentences. resentation, surpassing the majority baseline by 2. -Context: the out-of-context version of the only 4.3 points in accuracy. The best performer original task, in which we replace the context in AN Attributes is BERT, with no encoding and using all the layers, achieving accuracy of 65.1%, 3A somewhat similar phenomenon was recently reported well above the majority baseline (50%). by Senaldi et al.(2019). Their model managed to distinguish idioms from non-idioms, but their ablation study showed the We are interested in finding out where the model was in fact learning to distinguish between abstract knowledge of the implicit meaning originates. Is contexts (in which idioms tend to appear) and concrete ones. sentence by the phrase itself, as in setting it duced strong representations for syntactic phe- to “wine bar”. Success in this setting may in- nomena, but gained smaller performance improve- dicate that the phrase representation contains ments upon the baselines in the more semantic this implicit information. tasks. Liu et al.(2019) found that some tasks (e.g., identifying the tokens that comprise the con- 3. -(Context+Phrase): in which we omit the juncts in a coordination construction) required context sentence altogether, and provide only fine-grained linguistic knowledge which was not the paraphrase, as in “bar where people buy available in the representations unless they were and drink wine”. Success in this setting may fine-tuned for the task. To the best of our knowl- indicate that negative sampled paraphrases edge, we are the first to provide an evaluation suite form sentences which are less probable in En- consisting of tasks related to lexical composition. glish. Lexical Composition. There is a vast literature Table8 shows the results of this experiment. A on multi-word expressions in general (e.g. Sag first observation is that the full model performs et al., 2002; Vincze et al., 2011), and research best on both tasks, suggesting that the model cap- focusing on noun compounds (e.g. Nakov, 2013; tures implicit meaning from various sources. In Nastase et al., 2013), adjective-noun compositions the NC Relations, all variants perform on par or (e.g. Baroni and Zamparelli, 2010; Boleda et al., worse than the majority baseline, achieving a few 2013), verb-particle constructions (e.g. Baldwin, points less than the full model. In the AN At- 2005; Pichotta and DeNero, 2013), and light verb tributes task it is easier to see that the phrase (AN) constructions (e.g. Tu and Roth, 2011; Chen et al., is important for the classification, while the con- 2015). text is secondary. In recent years, word embeddings have been 8 Related Work used to predict the compositionality of phrases (Salehi et al., 2015; Cordeiro et al., 2016), and Probing Tasks. One way to test whether dense to identify the implicit relation in adjective-noun representations capture a certain linguistic prop- compositions (Hartung et al., 2017) and in noun erty is to design a probing task for this prop- compounds (Surtani and Paul, 2015; Dima, 2016; erty, and build a model that takes the tested rep- Shwartz and Waterson, 2018; Shwartz and Dagan, resentation as an input. This kind of “black box” 2018). testing has become popular recently. Adi et al. Pavlick and Callison-Burch(2016) created a (2017) studied whether sentence embeddings cap- simpler variant of the recognizing textual entail- ture properties such as sentence length and word ment task (RTE, Dagan et al., 2013) that tests order. Conneau et al.(2018) extended their work whether an adjective-noun composition entails the with a large number of sentence embeddings, and noun alone and vice versa in a given context. They tested various properties at the surface, syntactic, tested various standard models for RTE and found and semantic levels. Others focused on interme- that the models performed poorly with respect diate representations in neural machine translation to this phenomenon. To the best of our knowl- systems (e.g Shi et al., 2016; Belinkov et al., 2017; edge, contextualized word embeddings haven’t Dalvi et al., 2017; Sennrich, 2017), or on spe- been employed for tasks related to lexical compo- cific linguistic properties such as agreement (Giu- sition yet. lianelli et al., 2018), and tense (Bacon and Regier, 2018). Phrase Representations. With respect to ob- More recently, Tenney et al.(2019) and Liu taining meaningful phrase representations, there is et al.(2019) each designed a suite of tasks to a prominent line of work in learning a composi- test contextualized word embeddings on a broad tion function over pairs of words. Mitchell and range of sub-sentence tasks, including part-of- Lapata(2010) suggested simple composition via speech tagging, syntactic constituent labeling, de- vector arithmetics. Baroni and Zamparelli(2010) pendency parsing, named entity recognition, se- and later Maillard and Clark(2015) treated adjec- mantic role labeling, coreference resolution, se- tival modifiers as functions that operate on mantic proto-role, and relation classification. Ten- and change their meanings, and represented them ney et al.(2019) found that all the models pro- as matrices. Zanzotto et al.(2010) and Dinu et al. (2013) extended this approach and composed any ers process idioms found that the most common two words by multiplying each word vector by a and successful strategies were inferring from the composition matrix. These models start by com- context (57% success) and relying on the literal puting the phrases’ distributional representation meanings of the constituent words (22% success) (i.e. treating it as a single token) and then learning (Cooper, 1999). As opposed to distributional mod- a composition function that approximates it. els that aim to learn from a large number of (possi- The main drawbacks of this approach are that bly noisy and uninformative) contexts, the senten- it assumes compositionality and that it operates tial contexts in this experiment were manually se- on phrases with a pre-defined number of words. lected, and a follow up study found that extended Moreover, we can expect the resulting composi- contexts (stories) help the interpretation further tional vectors to capture properties inherited from (Asl, 2013). The participants didn’t simply rely on the constituent words, but it is unclear whether adjacent words or phrases, but also employed rea- they also capture new properties introduced by soning. For example, in the sentence “Robert knew the phrase. For example, the compositional rep- that he was robbing the cradle by dating a sixteen- resentation of olive oil may capture properties like year-old girl”, the participants inferred that 16 is green (from olive) and fat (from oil), but would it too young to date, combined it with the knowledge also capture properties like expensive (a result of that cradle is where a baby sleeps, and concluded the extraction process)? that rob the cradle means dating a very young per- Alternatively, other approaches were suggested son. This level of context modeling seems to be for learning general phrase embeddings, either us- beyond the scope of current text representations. ing direct supervision for paraphrase similarity We expect that improving the ability of repre- (Wieting et al., 2016), indirectly from an extrin- sentations to reveal implicit meaning will require sic task (Socher et al., 2012), or in an unsuper- training them to handle this specific phenomenon. vised manner by extending the word2vec objec- Our evaluation suite, the data and code will be tive (Poliak et al., 2017). While they don’t have made available. It is easily extensible, and may be constrains on the phrase length, these methods used in the future to evaluate new representations still suffer from the two other drawbacks above: for their ability to address lexical composition. they assume that the meaning of the phrase can always be composed from its constituent mean- Acknowledgments ings, and it is unclear whether they can incorpo- This work was supported in part by an Intel rate implicit information and new properties of ICRI-CI grant, the Israel Science Foundation the phrase. We expected that contextualized word grant 1951/17, the German Research Foundation embeddings, which assign a different vector for a through and the German-Israeli Project Coopera- word in each given context, would address at least tion (DIP, grant DA 1600/1-1). Vered is also sup- the first issue by producing completely different ported by the Clore Scholars Programme (2017). vectors to literal vs. non-literal word occurrences.

9 Discussion and Conclusion References

We showed that contextualized word representa- Yossi Adi, Einat Kermany, Yonatan Belinkov, Ofer tions perform generally better than static word em- Lavi, and Yoav Goldberg. 2017. Fine-grained beddings on tasks related to lexical composition. analysis of sentence embeddings using auxiliary However, while they are on par with human per- prediction tasks. In International Conference formance in recognizing meaning shift, they are on Learning Representations (ICLR). still far from that in revealing implicit meaning. This gap may suggest a limit on the information Asaf Amrami and Yoav Goldberg. 2018. Word that distributional models currently provide about sense induction with neural biLM and symmet- the meanings of phrases. ric patterns. In Proceedings of the 2018 Con- Going beyond the distributional models, an ap- ference on Empirical Methods in Natural Lan- proach to build meaningful phrase representations guage Processing, pages 4860–4867, Brussels, can get some inspiration from the way that hu- Belgium. Association for Computational Lin- mans process phrases. A study on how L2 learn- guistics. Fatemeh Mohamadi Asl. 2013. The impact of Alexis Conneau, Germán Kruszewski, Guillaume context on learning idioms in EFL classes. Lample, Loïc Barrault, and Marco Baroni. TESOL Journal, 37(1):2. 2018. What you can cram into a single $&!#* vector: Probing sentence embeddings for lin- Geoff Bacon and Terry Regier. 2018. Probing guistic properties. In Proceedings of the 56th sentence embeddings for structure-dependent Annual Meeting of the Association for Compu- tense. In Proceedings of the 2018 EMNLP tational Linguistics (Volume 1: Long Papers), Workshop BlackboxNLP: Analyzing and Inter- pages 2126–2136, Melbourne, Australia. Asso- preting Neural Networks for NLP, pages 334– ciation for Computational Linguistics. 336. Association for Computational Linguis- tics. Thomas C Cooper. 1999. Processing of idioms by L2 learners of English. TESOL quarterly, Timothy Baldwin. 2005. Deep lexical acquisi- 33(2):233–262. tion of verb–particle constructions. Computer Speech & Language, 19(4):398–414. Silvio Cordeiro, Carlos Ramisch, Marco Idiart, and Aline Villavicencio. 2016. Predicting the Marco Baroni and Roberto Zamparelli. 2010. compositionality of nominal compounds: Giv- Nouns are vectors, are matrices: Rep- ing word embeddings a hard time. In Proceed- resenting Adjective-noun constructions in se- ings of the 54th Annual Meeting of the Associ- mantic space. In Proceedings of the 2010 ation for Computational Linguistics (Volume 1: Conference on Empirical Methods in Natural Long Papers), pages 1986–1997, Berlin, Ger- Language Processing, pages 1183–1193, Cam- many. Association for Computational Linguis- bridge, MA. Association for Computational tics. Linguistics. Ido Dagan, Dan Roth, Mark Sammons, and Fabio Yonatan Belinkov, Nadir Durrani, Fahim Dalvi, Zanzotto. 2013. Recognizing Textual Entail- Hassan Sajjad, and James Glass. 2017. What do ment. Morgan & Claypool Publishers. neural machine translation models learn about Fahim Dalvi, Nadir Durrani, Hassan Sajjad, ? In Proceedings of the 55th An- Yonatan Belinkov, and Stephan Vogel. 2017. nual Meeting of the Association for Compu- Understanding and improving morphological tational Linguistics (Volume 1: Long Papers), learning in the neural machine translation de- pages 861–872, Vancouver, Canada. Associa- coder. In Proceedings of the Eighth Interna- tion for Computational Linguistics. tional Joint Conference on Natural Language Piotr Bojanowski, Edouard Grave, Armand Joulin, Processing (Volume 1: Long Papers), pages and Tomas Mikolov. 2017. Enriching word vec- 142–151, Taipei, Taiwan. Asian Federation of tors with subword information. Transactions of Natural Language Processing. the Association for Computational Linguistics, Jacob Devlin, Ming-Wei Chang, Kenton Lee, and 5:135–146. Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language Gemma Boleda, Marco Baroni, The Nghia Pham, understanding. In Proceedings of the 2019 and Louise McNally. 2013. Intensionality was Conference of the North American Chapter of only alleged: On Adjective-noun composition the Association for Computational Linguistics: in distributional semantics. In Proceedings of Human Language Technologies the 10th International Conference on Computa- , Minneapolis, Minnesota. Association for Computational Lin- tional Semantics (IWCS 2013) – Long Papers, guistics. pages 35–46, Potsdam, Germany. Association for Computational Linguistics. Corina Dima. 2016. On the compositionality and semantic interpretation of English noun com- Wei-Te Chen, Claire Bonial, and Martha Palmer. pounds. In Proceedings of the 1st Workshop on 2015. English Light Verb Construction Identi- Representation Learning for NLP, pages 27–39. fication using Lexical Knowledge. In Twenty- Ninth AAAI Conference on Artificial Intelli- Georgiana Dinu, Nghia The Pham, and Marco Ba- gence. roni. 2013. General estimation and evaluation of compositional distributional semantic mod- Semantics (*SEM), Volume 2: Proceedings of els. In Proceedings of the Workshop on Contin- the Seventh International Workshop on Seman- uous Vector Space Models and their Composi- tic Evaluation (SemEval 2013), pages 138–143. tionality, pages 50–58, Sofia, Bulgaria. Associ- Association for Computational Linguistics. ation for Computational Linguistics. Otto Jespersen. 1965. A Modern English Gram- Matt Gardner, Joel Grus, Mark Neumann, mar: On Historical Principles. George Allen Oyvind Tafjord, Pradeep Dasigi, Nelson F. Liu, & Unwin Limited. Matthew Peters, Michael Schmitz, and Luke Zettlemoyer. 2018. AllenNLP: A deep seman- Nelson F Liu, Matt Gardner, Yonatan Belinkov, tic natural language processing platform. In Matthew Peters, and Noah A Smith. 2019. Lin- Proceedings of Workshop for NLP Open Source guistic knowledge and transferability of contex- Software (NLP-OSS), pages 1–6, Melbourne, tual representations. In Proceedings of the 2019 Australia. Association for Computational Lin- Conference of the North American Chapter of guistics. the Association for Computational Linguistics: Human Language Technologies, Minneapolis, Mario Giulianelli, Jack Harding, Florian Mohnert, Minnesota. Association for Computational Lin- Dieuwke Hupkes, and Willem Zuidema. 2018. guistics. Under the hood: Using diagnostic classifiers to investigate and improve how language models Laurens van der Maaten and Geoffrey Hinton. track agreement information. In Proceedings 2008. Visualizing data using t-SNE. Journal of of the 2018 EMNLP Workshop BlackboxNLP: machine learning research, 9(Nov):2579–2605. Analyzing and Interpreting Neural Networks for NLP, pages 240–248. Association for Computa- Jean Maillard and Stephen Clark. 2015. Learning tional Linguistics. adjective meanings with a tensor-based Skip- Gram model. In Proceedings of the Nine- Alex Graves and Jürgen Schmidhuber. 2005. teenth Conference on Computational Natural Framewise phoneme classification with bidirec- Language Learning, pages 327–331, Beijing, tional LSTM and other neural network architec- China. Association for Computational Linguis- tures. Neural Networks, 18(5-6):602–610. tics.

Z Harris. 1954. Distributional Hypothesis. Word, Tomas Mikolov, Kai Chen, Greg Corrado, and Jef- 10(23):146–162. frey Dean. 2013. Efficient estimation of word representations in vector space. In Interna- Matthias Hartung. 2015. Distributional Semantic tional Conference on Learning Representations Models of Attribute Meaning in Adjectives and (ICLR). Nouns. Ph.D. thesis, Heidelberg University. Jeff Mitchell and Mirella Lapata. 2010. Composi- Matthias Hartung, Fabian Kaupmann, Soufian tion in distributional models of semantics. Cog- Jebbara, and Philipp Cimiano. 2017. Learn- nitive science, 34(8):1388–1429. ing compositionality functions on word em- beddings for modelling attribute meaning in Preslav Nakov. 2013. On the interpretation Adjective-noun phrases. In Proceedings of the of noun compounds: , semantics, and 15th Conference of the European Chapter of entailment. Natural Language Engineering, the Association for Computational Linguistics: 19(3):291–330. Volume 1, Long Papers, pages 54–64, Valencia, Spain. Association for Computational Linguis- Vivi Nastase, Preslav Nakov, Diarmuid O tics. Seaghdha, and Stan Szpakowicz. 2013. Seman- tic Relations between Nominals. Synthesis lec- Iris Hendrickx, Zornitsa Kozareva, Preslav Nakov, tures on human language technologies, 6(1):1– Diarmuid Ó Séaghdha, Stan Szpakowicz, and 119. Tony Veale. 2013. Semeval-2013 task 4: Free paraphrases of noun compounds. In Second Geoffrey Nunberg, Ivan A Sag, and Thomas Wa- Joint Conference on Lexical and Computational sow. 1994. Idioms. Language, 70(3):491–538. Diarmuid Ó Séaghdha and Ann Copestake. 2009. Short Papers, pages 503–508, Valencia, Spain. Using lexical and relational similarity to clas- Association for Computational Linguistics. sify semantic relations. In Proceedings of the 12th Conference of the European Chap- Alec Radford, Karthik Narasimhan, Tim Sal- ter of the ACL (EACL 2009), pages 621–629, imans, and Ilya Sutskever. 2018. Improv- Athens, Greece. Association for Computational ing language understanding by generative Linguistics. pre-training. URL https://s3-us-west-2. amazonaws. com/openai-assets/research- Adam Paszke, Sam Gross, Soumith Chintala, Gre- covers/language-unsupervised/language_ gory Chanan, Edward Yang, Zachary DeVito, understanding_paper. pdf. Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differenti- Siva Reddy, Diana McCarthy, and Suresh Man- ation in PyTorch. In Autodiff Workshop, NIPS andhar. 2011. An empirical study on compo- 2017. sitionality in compound nouns. In Proceedings of 5th International Joint Conference on Natu- Ellie Pavlick and Chris Callison-Burch. 2016. ral Language Processing, pages 210–218, Chi- Most "babies" are "little" and most "prob- ang Mai, Thailand. Asian Federation of Natural lems" are "huge": Compositional Entailment in Language Processing. Adjective-nouns. In Proceedings of the 54th Annual Meeting of the Association for Compu- Ivan A Sag, Timothy Baldwin, Francis Bond, Ann tational Linguistics (Volume 1: Long Papers), Copestake, and Dan Flickinger. 2002. Mul- pages 2164–2173, Berlin, Germany. Associa- tiword expressions: A pain in the neck for tion for Computational Linguistics. NLP. In International Conference on Intelli- gent Text Processing and Computational Lin- Jeffrey Pennington, Richard Socher, and Christo- guistics, pages 1–15. Springer. pher D. Manning. 2014. Glove: Global Vectors for word representation. In Empirical Meth- Bahar Salehi, Paul Cook, and Timothy Baldwin. ods in Natural Language Processing (EMNLP), 2015. A word embedding approach to predict- pages 1532–1543. ing the compositionality of multiword expres- sions. In Proceedings of the 2015 Conference Matthew Peters, Mark Neumann, Mohit Iyyer, of the North American Chapter of the Associ- Matt Gardner, Christopher Clark, Kenton Lee, ation for Computational Linguistics: Human and Luke Zettlemoyer. 2018. Deep contextu- Language Technologies, pages 977–983, Den- alized word representations. In Proceedings ver, Colorado. Association for Computational of the 2018 Conference of the North Ameri- Linguistics. can Chapter of the Association for Computa- tional Linguistics: Human Language Technolo- Nathan Schneider and Noah A. Smith. 2015.A gies, Volume 1 (Long Papers), pages 2227– corpus and model integrating multiword expres- 2237, New Orleans, Louisiana. Association for sions and supersenses. In Proceedings of the Computational Linguistics. 2015 Conference of the North American Chap- ter of the Association for Computational Lin- Karl Pichotta and John DeNero. 2013. Identifying guistics: Human Language Technologies, pages phrasal verbs using many bilingual corpora. In 1537–1547, Denver, Colorado. Association for Proceedings of the 2013 Conference on Empir- Computational Linguistics. ical Methods in Natural Language Processing, pages 636–646, Seattle, Washington, USA. As- Marco Silvio Giuseppe Senaldi, Yuri Bizzoni, and sociation for Computational Linguistics. Alessandro Lenci. 2019. What Do Neural Net- works Actually Learn, When They Learn to Adam Poliak, Pushpendre Rastogi, M. Patrick Identify Idioms? Proceedings of the Society for Martin, and Benjamin Van Durme. 2017. Ef- Computation in Linguistics, 2(1):310–313. ficient, compositional, order-sensitive N-gram embeddings. In Proceedings of the 15th Con- Rico Sennrich. 2017. How grammatical is ference of the European Chapter of the Associ- character-level neural machine translation? as- ation for Computational Linguistics: Volume 2, sessing MT quality with contrastive translation pairs. In Proceedings of the 15th Conference cent Advances in Natural Language Processing, of the European Chapter of the Association for pages 636–645. Computational Linguistics: Volume 2, Short Pa- Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang, pers, pages 376–382, Valencia, Spain. Associa- Adam Poliak, R Thomas McCoy, Najoung Kim, tion for Computational Linguistics. Benjamin Van Durme, Sam Bowman, Dipan- Xing Shi, Inkit Padhi, and Kevin Knight. 2016. jan Das, and Ellie Pavlick. 2019. What do Does string-based neural MT learn source syn- you Learn from Context? Probing for Sentence tax? In Proceedings of the 2016 Conference Structure in Contextualized Word Representa- on Empirical Methods in Natural Language tions. In International Conference on Learning Processing, pages 1526–1534. Association for Representations. Computational Linguistics. Stephen Tratz. 2011. Semantically-enriched Pars- ing for Natural Language Understanding. Uni- Vered Shwartz and Ido Dagan. 2018. Para- versity of Southern California. phrase to Explicate: Revealing Implicit Noun- Compound Relations. In Proceedings of the Yuancheng Tu and Dan Roth. 2011. Learning En- 56th Annual Meeting of the Association for glish Light Verb Constructions: Contextual or Computational Linguistics (Volume 1: Long Pa- Statistical. In Proceedings of the Workshop on pers), pages 1200–1211, Melbourne, Australia. Multiword Expressions: from Parsing and Gen- Association for Computational Linguistics. eration to the Real World, pages 31–39, Port- land, Oregon, USA. Association for Computa- Vered Shwartz and Chris Waterson. 2018. Olive tional Linguistics. oil is made of olives, baby oil is made for ba- bies: Interpreting noun compounds using para- Yuancheng Tu and Dan Roth. 2012. Sorting out phrases in a neural model. In Proceedings of the the most confusing . In 2018 Conference of the North American Chap- *SEM 2012: The First Joint Conference on ter of the Association for Computational Lin- Lexical and Computational Semantics – Vol- guistics: Human Language Technologies, Vol- ume 1: Proceedings of the main conference ume 2 (Short Papers), pages 218–224, New and the shared task, and Volume 2: Proceed- Orleans, Louisiana. Association for Computa- ings of the Sixth International Workshop on Se- tional Linguistics. mantic Evaluation (SemEval 2012), pages 65– 69, Montréal, Canada. Association for Compu- Richard Socher, Brody Huval, D. Christopher tational Linguistics. Manning, and Y. Andrew Ng. 2012. Seman- Ashish Vaswani, Noam Shazeer, Niki Parmar, tic compositionality through recursive matrix- Jakob Uszkoreit, Llion Jones, Aidan N Gomez, vector spaces. In Proceedings of the 2012 Joint Łukasz Kaiser, and Illia Polosukhin. 2017. At- Conference on Empirical Methods in Natural tention is all you need. In Advances in Neural Language Processing and Computational Natu- Information Processing Systems, pages 5998– ral Language Learning, pages 1201–1211. As- 6008. sociation for Computational Linguistics. Veronika Vincze, István Nagy T., and Gábor Gabriel Stanovsky and Mark Hopkins. 2018. Spot Berend. 2011. Multiword expressions and the odd man out: Exploring the associative named entities in the Wiki50 corpus. In Pro- power of lexical resources. In Proceedings ceedings of the International Conference Re- of the 2018 Conference on Empirical Methods cent Advances in Natural Language Processing in Natural Language Processing, pages 1533– 2011, pages 289–295. Association for Compu- 1542, Brussels, Belgium. Association for Com- tational Linguistics. putational Linguistics. John Wieting, Mohit Bansal, Kevin Gimpel, and Nitesh Surtani and Soma Paul. 2015. A VSM- Karen Livescu. 2016. Towards universal para- based statistical model for the semantic relation phrastic sentence embeddings. In Interna- interpretation of noun-modifier pairs. In Pro- tional Conference on Learning Representations ceedings of the International Conference Re- (ICLR). John Wieting, Jonathan Mallinson, and Kevin Gimpel. 2017. Learning paraphrastic sentence embeddings from back-translated bitext. In Proceedings of the 2017 Conference on Empir- ical Methods in Natural Language Processing, pages 274–285, Copenhagen, Denmark. Asso- ciation for Computational Linguistics.

Zhibiao Wu and Martha Palmer. 1994. Verbs se- mantics and lexical selection. In Proceedings of the 32nd annual meeting on Association for Computational Linguistics, pages 133–138. As- sociation for Computational Linguistics.

Fabio Massimo Zanzotto, Ioannis Korkontzelos, Francesca Fallucchi, and Suresh Manandhar. 2010. Estimating linear models for composi- tional distributional semantics. In Proceedings of the 23rd International Conference on Com- putational Linguistics, pages 1263–1271. Asso- ciation for Computational Linguistics.

Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi. 2018. SWAG: A large-scale adver- sarial dataset for grounded commonsense infer- ence. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Pro- cessing, pages 93–104, Brussels, Belgium. As- sociation for Computational Linguistics.