<<

HEBREW UNIVERSITYOF JERUSALEM

Semantic change at large: A computational approach for semantic change research

Thesis for the degree "Doctor of Philosophy" in Brain Sciences: Computation and Information Processing

by: Haim DUBOSSARSKY

Submitted to the Senate of the Hebrew University of Jerusalem

April 2018

iii

This work was carried out under the supervision of Doctor Eitan GROSSMAN and Professor Daphna WEINSHALL

v

HEBREW UNIVERSITY OF JERUSALEM Abstract Edmond and Lily Safra Center for Brain Sciences

Doctor of Philosophy

Semantic change at large: A computational approach for semantic change research

by Haim DUBOSSARSKY

The search for hitherto undiscovered regularities and laws of semantic change is today made possible by two recent and independent developments: first, the upsurge in the availability of digitized historical corpora of texts; and sec- ond, novel computational methods that allow the automatic semantic pro- cessing of these corpora. Compared to traditional research methods in se- mantic change, this dynamic duo of massive corpora and innovative compu- tational methods has the potential to lead to novel insights about semantic change, both in discovering regular patterns of change and in identifying their explanatory factors, based on large-scale data-driven analysis. In a series of studies, I demonstrate that this approach is not only feasible, but also fruitful for semantic change research. My research papers show that this approach extends the scope of research beyond known phenomena of semantic change, uncovering several regularities of semantic change, and complements existing theories with more objective and reliable analyses. However, these studies have also demonstrated the importance of method- ological issues. As this field of research has just emerged, fundamental method- ological concerns were raised which must be addressed in order to make an objective, reliable, and genuine contribution to the research field. Two papers were dedicated to tackle these methodological issues, and their results promise to put this computational approach on a solid method- ological footing right from the beginning. The studies not only call into ques- tion earlier results, but also provide clear, feasible, and replicable validation routines to ensure objective and reliable testing. Importantly, these method- ological accentuations may benefit the NLP community at large, as they have applications that go beyond semantic change research.

vii Letter of Contribution

Chapter 2 - Verbs change more than nouns: A bottom-up computational approach to semantic change. Haim Dubossarsky, Daphna Weinshall and Eitan Grossman. HD proposed the idea for the main analysis, processed the corpora, trained the model, and performed the critical analysis. DW supervised the analyses. HD and EG interpreted the results and proposed a linguistic theory for them. HD and EG wrote the manuscript: HD the technical parts to communicate them to the linguistic community, and EG the linguistic-theoretical parts.

Chapter 3 - A bottom up approach to category mapping and meaning change. Haim Dubossarsky, Yulia Tsvetkov, Chris Dyer and Eitan Grossman HD proposed the theoretical research question. YT and CD positioned it in the right computation-theoretic background, and suggested the model that would allow to investigate this question. HD processed the corpora, trained the model, made the diachronic comparisons, and devised the critical analysis. HD, YT and CD interpreted the results on the computational side, and EG contributed the pivotal linguistic interpretation. HD and EG wrote the manuscript, and YT wrote part of the methodology.

Chapter 4 - Outta Control: Laws of Semantic Change and Inherent Biases in Representation Models. Haim Dubossarsky, Eitan Grossman and Daphna Weinshall. HD speculated about the magnitude of the problem in the current re- search, and together with DW outlined the required experiments to test it as a research question. HD planned and carried out the experiments, ana- lyzed their results, replicated the results of previous studies, and conducted the critical comparisons. HD interpreted the results together with DW. DW contributed to the theoretical part. HD, DW and EG wrote the manuscript, and emphasize its critical importance to the research field.

Chapter 5 - Coming to Your Senses: on Controls and Evaluation Sets in Research. Haim Dubossarsky, Matan Ben-Yosef, Eitan Grossman and Daphna Weinshall. HD identified the misconception in the current research, and suggested to phrase it as a research hypothesis. HD and DW devised the set of ex- periments to investigate the hypothesis. HD collected and preprocessed the corpora, trained the models, and conducted the critical comparisons. HD and MB replicated the results of previous models, and manipulating their code for the critical simulations. HD analyzed the results from all the experi- ments and simulations. DW contributed to the theoretical part. HD and DW interpreted the results, and drew conclusions. HD, DW and EG wrote the manuscript, and positioned the conclusions in the right research context.

ix

“In the beginning was the Word, and the Word was with God, and the Word was God.”

John, 1:1.

xi

Contents

Abstractv

1 Introduction1 1.1 Semantic change...... 1 1.2 Research paradigms in semantic change...... 2 1.2.1 Overview...... 2 1.2.2 The traditional paradigm...... 3 1.2.3 The quantitative paradigm...... 5 1.2.4 The computational paradigm...... 6 1.3 Research objectives...... 10 1.4 Current work...... 12 References...... 14

2 The Diachronic Word-Class Effect 17

3 The Diachronic Prototypicality Effect 39

4 Avoiding methodological pitfalls in semantic change research 45

5 Sense-specific word representation, a misconception? 57

6 Discussion and Conclusions 69 References...... 74

1

Chapter 1

Introduction

At the heart of this PhD dissertation is a new research paradigm for the study of semantic change, i.e., changes in the meaning of lexical items. This paradigm is characterized by a focus on large-scale, data-driven analyses based on modern computational tools, in order to address research ques- tions stemming from linguistic research. This approach complements exist- ing paradigms, and in so doing, enriches the toolbox of semantic change re- search. At the same time, this dissertation addresses fundamental method- ological issues that are crucial not only for the emerging field of NLP research on semantic change, but for research in NLP more broadly. In particular, this dissertation constitutes a contribution to methodological issues inherent to research in the framework of DISTRIBUTIONALSEMANTICS. The structure of this chapter is as follows. Section 1.1 provides a defi- nition of semantic change. Section 1.2 presents essential background about semantic change research. Section 1.3 states our research objectives in light of the needs that are derived from current research state. Section 1.4 briefly describes our research papers, and how they contribute to the research field in accordance with the objectives outlined before.

1.1 Semantic change

The meanings of – or, more precisely, lexical items – are liable to change over time. For example, girl originally used to denote a child of either sex, but since the 15th century only refers to a young female (Bybee 2015, p. 202). Broadcast used to describe a method of scattering seeds less than a hundred years ago, but nowadays refers to the transmission of information by radio or television. The phenomenon of changes in the meaning of words over time is part of what is called semantic change, and is known at least since the work of Reisig (1839). Semantic change, to use a classic definition, may 2 Chapter 1. Introduction be described as “innovations which change the lexical meaning rather than the grammatical function of a form” (Bloomfield 1933, p. 425)1. A formal definition of semantic change would be as follows: given a his- torical corpus that contains texts from n different time periods in a chrono- logical order [ , , ..., ], semantic change is any difference between the Ct1 Ct2 Ctn meanings associated with a given word at t1, t2 ... tn. A model of semantic change aims first of all to discover which words changed their meaning be- tween any two time-periods. Given this information, additional tasks can be undertaken in order to answer further questions, such as the size or extent of the changes, their characteristics, and their dependence on other factors.

1.2 Research paradigms in semantic change

1.2.1 Overview

Until recently, research on semantic change was largely based on a traditional paradigm that used philological tools informed by linguistic analysis. For the most part, linguists interested in semantic change have studied changes in the denotation or connotation of individual words or constructions (or small groups thereof) as they occur in historical corpora. The introduction of modern quantitative and computational tools to the field did not change the basically TOP-DOWN nature of this research paradigm. While a top-down research approach allows a fine-grained examination of semantic change, it also imposes limitations on what one can learn about semantic change. The fundamental limitation is the size and the representa- tiveness of the dataset to be generalized over. Since previous studies were done on a SMALLSCALE and were based on ultimately anecdotal instances of semantic change, it is not at all certain to what extent the semantic changes collected in surveys are representative of semantic change in general. Fur- thermore, analyses based on small and anecdotal datasets cannot turn up large-scale regularities of semantic change. As a consequence of their top-down and small-scale nature, these anal- yses tend to draw conclusion on a subjective and qualitative evaluation of a limited number of examples, and rarely deal with counterexamples. As such, statistical methodologies are rarely brought to bear in support of the- oretical claims, and the importance of statistical significance in hypotheses

1For several overviews of semantic change, see, e.g., Bybee (2015, pp. 188-208), Hock and Joseph (2009, pp. 205-240), Newman (2015, pp. 266-280), or Traugott and Dasher (2002). 1.2. Research paradigms in semantic change 3 testing is somewhat foreign to the field. On the other hand, large-scale, bottom-up analyses demand STATISTICALLY-RIGOROUS research methods in order to reach sound and reliable findings. However, although it is clear that statistically-rigorous research methods are crucial for large-scale, bottom-up research, the methods themselves – and their application to NLP analyses – are not self-evident, and they require dedicated research. Over the years, several research paradigms in semantic change research have developed. These paradigms differ in their research methods with re- spect to the three distinctions that were made above, namely (i) whether they are top-down or bottom-up, (ii) small or large scale, and (iii) whether they employ statistically-rigorous methods. While it may seem these paradigms show a clear developmental trajectory, as some have preceded the others and the later ones may improve on the older ones, the boundaries between them may be obscure. Importantly, they are all still used in contemporary research. In the following paragraphs, I briefly survey the main research paradigms for semantic change, comparing them according to the distinctions made above (see Table 1.1 for a summary).

Paradigm Research Research Scale Statistical type design methods TRADITIONAL Qualit. Top-down Small None

QUANTITATIVE Quantit. Top-down Small, but Simple sta- full corpora tistical tests COMPUTATIONAL Quantit. Bottom-up Initially Initially small, then none, then large rigorous

TABLE 1.1: Comparative summary of the research paradigms

1.2.2 The traditional paradigm

Many if not most studies of semantic change have focused primarily on describing and classifying the major types of change in meaning into tax- onomies. This could be called the TRADITIONAL paradigm. According to most taxonomies, the major types of change involve (a) changes in the ex- tension of meaning, i.e., either widening or narrowing, and (b) changes in the connotation of meaning, i.e., which may become either more positive (ame- lioration), or more negative (pejoration). See the examples in Table 1.2, taken from English: 4 Chapter 1. Introduction

Widening bird ’young bird’ > ’any kind of bird’ Narrowing meat ’food’ > ’edible flesh’ Amelioration knight ’servant boy’ > ’nobleman’ Pejoration awful ’inspiring wonder’ > ’terribly bad’

TABLE 1.2: Major taxonomies of semantic change

This paradigm is characterized by a top-down approach, where examples of individual words that underwent a certain type of semantic change are searched for in historical corpora and are either classified according to an existing taxonomy, or used to define a new one. The traditional paradigm provides us with invaluable source for understanding potential regularities and tendencies of semantic change, upon which principled theoretical work could be based, and it is an active and valued research paradigm until today (see, for example, the large database of semantic changes documented by Zalizniak et al. (2012)). Theoretical work in this vein is mostly limited to identifying the mechanisms that may explain how semantic change came into being, i.e., the claim that a metaphorical process has led broadcast to change its meaning from "casting seeds" into "casting audio and video signals". However, knowing how the meaning of words changes diachronically is not equivalent to explaining why such changes occur. In this respect, this paradigm does not usually generate explanatory theories of semantic change. In fact, what it considers as causes for semantic change are actually motivations thereof which only explain the need to adjust the meaning of words to sup- port the ever changing objects and ideas around us in the technical, scientific, political or sociocultural domains of the environment (Blank 1999). As such, the true causes of semantic change that comprise, for example, the linguistics pre-conditions that may promote or repress words from undergoing seman- tic change, was mostly neglected in actual research. Cognitive , and in particular theories of prototype semantics, did pay some attention to this question. These theoretical approaches empha- size the role that the relations between words may have on semantic change. Significantly, it was argued that the differences in prototypicality of words in their category (e.g., a robin or a dove is more prototypical in the category of birds, than a peacock or an ostrich which are more peripheral) can explain many cases of semantic change (Koch 2016). For example, Geeraerts (1985, 1992) maps related words into semantic categories, analyzes these categories 1.2. Research paradigms in semantic change 5 over time, and concludes that prototypical semantic areas are more stable over time than peripheral ones. However, as discussed above, these generalizations use a top-down ap- proach, and are based on fine-grained studies of small and anecdotal datasets. Moreover, their claims about regularities of semantic change are basically untestable. Specifically it is not clear whether these are claims about invio- lable rules or tendencies. If they are inviolable rules, what would constitute counter-evidence? And if they are tendencies, how strong are they, and how might one investigate this question? It is important to point out that there is still no comprehensive database of documented semantic changes from a wide range of languages, and the databases that do exist, such as Zalizniak et al. (2012), are problematic from several points of view.

1.2.3 The quantitative paradigm

The paradigm described above generally bases its identification of semantic changes on the qualitative evaluation of words’ usage-context in historical corpora. Such approach is informative, as it provides examples that are eas- ily and naturally appreciated as supporting claimed taxonomies or mecha- nisms of change. However, this type of analysis is, in the end, subjective, due to its dependence on the particular set of examples selected, as well as on those that were left out. This makes it difficult to assess the magnitude or importance of the reported semantic changes, but moreover make them questionable in terms of reliability (i.e., representativeness). A second approach, which we call the QUANTITATIVE PARADIGM, puts numbers to these changes by quantifying changes in the words’ distribu- tional statistics over time. This approach has been operationalized in a num- ber of ways. A relatively straightforward operationalization involves mea- suring change in words’ token frequencies (Bybee 2006). A more sophisti- cated method was introduced by Hilpert (2006), who adapted Stefanowitsch and Gries (2003) collostructional analysis for diachronic analysis. In this ap- proach, the strength of association between two words is modeled by their statistical dependency (i.e., how often they appear together relative to what is expected from their general frequency). These association strengths are computed for each period and compared diachronically. For example, the association between apple and phone was not evident in the 1990s, and only emerged in the 2000s. The diachronic differences in association strength are 6 Chapter 1. Introduction evaluated statistically, which puts the question of whether or not a semantic change took place on a statistically objective footing. The quantitative paradigm has been used in a number of studies, al- beit generally with a top-down approach2. One use has been to identify cases where semantic change took place for subsequent philological analysis (Fonteyn and Hartmann 2016; Smirnova 2012; Tantucci et al. 2017). Another has been to identify empirically well-founded stages in diachronic corpora (Gries and Hilpert 2008). A third use has been to test hypotheses that derive from a theory (Geeraerts et al. 2011). Moreover, word frequency was even proposed as a possible explanation for semantic change, as it has been ar- gued that high frequency has a bleaching effect on a word’s meaning (Bybee 2006), an effect similar to broadening (see Table 1.2). While these works extract distributional statistics about words from cor- pora, their analyses tend to be on a small scale, diachronically examining only a few words or grammatical constructions at a time.

1.2.4 The computational paradigm

According to the distributional hypothesis (Harris 1954) similar words ap- pear in similar contexts. Consequently, it has been suggested that the mean- ing of a word can be extracted from its usage contexts, or in the words of Firth “you shall know a word by the company it keeps.” (1957, p. 11). This is the basis of most, if not all, current corpus-based computational models that rely on contextual information to represent the meanings of words. We call this approach the COMPUTATIONAL PARADIGM. It should be noted that all the paradigms discussed above comply, in one way or another, with the distributional hypothesis. Crucially though, none of the above paradigms directly address the issue of word representation. In- stead, the meaning of words is implicitly assumed either subjectively (e.g., in the traditional paradigm), or according to the association strengths (e.g., in the quantitative paradigm). In contrast, the computational paradigm nec- essarily represents word meanings explicitly, and only then defines semantic change as measurable differences between these representations. For that reason, the following focuses on the computational paradigm approach for meaning representation and semantic change.

2Bochkarev et al. (2014), used a bottom-up approach to study global changes in lan- guages, but completely ignored the level of individual words. 1.2. Research paradigms in semantic change 7

Vector space models (VSM)3 are usage-based models specifically based on NLP tools that were developed under the distributional hypothesis to ex- tract word meanings from their contextual information. These models use the local context window around a given word to capture information about the words that co-occur with it. They then represent each word as a contin- uous vector in a high-dimensional space. These vectors are used extensively – in fact, almost exclusively – in a variety of NLP domains, such as semantic word similarity, analogy solving, synonym detection, and information re- trieval (Baroni et al. 2014; Mikolov et al. 2013; Pennington et al. 2014), as well as in chunking, named entity recognition and sentiment analysis (Guo et al. 2014; Socher et al. 2013; Turian et al. 2010), among others. Importantly, these vector representations are the primary tool with which computational approaches – including the one adopted in this dissertation – investigate, in a data-driven fashion, how words’ meanings change over time. In line with the distributional hypothesis discussed above, these vector representations map semantically similar words, if these appear in similar contexts, to proximate points in the vector space (Turney and Pantel 2010). It is customary to use the cosine-distance between the vectors of two words as an estimate of the degree of their semantic similarity. For example, in the toy example shown in Figure 1.1 below, automobile and car are close to each other in the vector space, relative to horse. This will be reflected in a small cosine-distance between the former pair, and a larger one between either of them and horse.

FIGURE 1.1: Three word vectors in the semantic space created by the VSM

3We refer to both explicit count-based models (e.g., point-wise mutual information, PPMI) as well as implicit predictive models (e.g., word2vec) by this term. 8 Chapter 1. Introduction

Following the same logic, the cosine-distance is often viewed – and used – as an estimate of the degree or amount of semantic change. Specifically, the semantic change that a particular word has undergone is measured by the cosine-distance between its vectors at two time points. Therefore, the larger the cosine-distance the greater the semantic change, and vice versa. Or, more formally:

t0 t1 t t vw vw ∆w 0→ 1 = 1 · (1.1) − vt0 vt1 k w k · k w k t0 t1 Where vw and vw are the vector representations of the word w at two time points, t0 and t1, respectively. This idea might seem intuitively compelling, as it looks like a natural ex- tension of the synchronic comparison between two different words. How- ever, its validity as an accurate metric for semantic change has never been tested. This poses an acute problem for claims about semantic change that are based on this measure (which comprise the vast majority of studies).

Computational studies of semantic change

Among the first to take a computational approach to linguistically motivated questions of semantic change were Sagi et al. (2009) and Wijaya and Yeniterzi (2011). Despite their pioneering nature, these studies, and others that soon followed (Heylen et al. 2015; Hilpert and Perek 2015), have analyzed only a small number of examples or a single construction, thus not exploiting the potential of combining massive corpora with modern computational tools in a large-scale bottom-up analysis of an entire lexicon. Since these first pioneering works, the computational paradigm has be- come increasingly popular in the NLP research community, as additional studies brought this research field into the big-data era (Cook and Steven- son 2010; Frermann and Lapata 2016; Gulordava and Baroni 2011; Jatowt and Duh 2014; Kim et al. 2014; Kulkarni et al. 2015; Schlechtweg et al. 2017). These studies focused on developing techniques to detect words that under- went semantic change, or a specific type thereof, but neglected almost com- pletely the potential insights for the linguistic analysis of meaning, primarily the how’s and why’s of semantic change. Recently though, several studies have addressed this research lacuna and reported phenomena that can only be observed in a large-scale analysis. These 1.2. Research paradigms in semantic change 9 studies have proposed law-like generalizations or regularities of semantic change:

The Diachronic Prototypicality Effect (Dubossarsky et al. (2015), and • see Chapter3)

The Diachronic Word-Class Effect (Dubossarsky et al. (2016), and see • Chapter2)

The Law of Conformity (Hamilton et al. 2016) • The Law of Innovation (Hamilton et al. 2016) • However, all of these studies (and virtually all studies in the field4), use a small number of hand-picked examples as a basis for arguing for the sound- ness of their models. In the absence of a gold standard evaluation set for words that have undergone semantic change, it is difficult to objectively as- sess the quality of any proposed model of semantic change, or to compare different models. Crucially, these above-mentioned studies have relied on the cosine-distance between a word’s respective vector-representations at two time points as their sole metric to measure semantic change. As noted above, despite its wide and apparently uncontroversial use, the validity of this metric as an ac- curate estimate for semantic change has never been tested. As a result, previ- ously reported findings might not be accurate. As studies that are based on this unvalidated metric accumulate, without being evaluated against a gold standard test for semantic change, the problem becomes increasingly acute. I address these two methodological problems in my work (see Section 1.4).

A multi-sense semantic representation

A fundamental property of VSMs is that each word is represented by a sin- gle vector. This may seem surprising, given the prevalence of polysemy in natural language. Lexical items typically – and perhaps usually – have mul- tiple senses or meanings. The word cell, for example, has distinct senses or meanings related to biology, incarceration, and telecommunications. Current models collapse these different senses into a single global vector and are thus unable to explicitly represent distinct senses of a word.

4Except, perhaps, Frermann and Lapata (2016) who used a task of novel-sense detection as a proxy. 10 Chapter 1. Introduction

This global approach to word representation is especially disadvanta- geous for semantic change research, as polysemy has a fundamental role in semantic change (Traugott and Dasher 2002). Comparing a word’s global vectors, and determining that it underwent semantic change, presumably tells us little about the nature of such a change. It does not tell us what aspects of its meaning have changed, whether new senses have emerged, whether some senses have dropped from usage and so on. It seems that re- searchers have presumed that this information resides in the original usage contexts of the word, but is impaired by the global representation approach. In light of this drawback, several studies advocate the use of sense-specific vector representations for general NLP purposes. These vectors arguably capture the various senses of a word by creating a distinct representation for each sense. Intuitively, sense-specific vectors should be more accurate than global vectors, and this improvement in accuracy should be reflected in per- formance gains in downstream tasks. Several studies reported such performance gains when using sense-specific vectors, which they conclude is due to the utility of polysemic information that these vectors arguably capture (Huang et al. 2012; Li and Jurafsky 2015; Neelakantan et al. 2014). However, such a conclusion can only be drawn if two conditions are met: (1) if sense-specific vectors truly represent polysemic information; and (2) if sense-specific vectors improve performance. While the first condition was not addressed in these studies, the second condition was consistently met in all of them. As a result, it remains to be demonstrated that these performance gains can be attributed to the representation of poly- semy. Only when the first condition is validated can these vectors be reliably used in a more ecological model of semantic change that takes polysemy into account. I address this validation step in my work (see Section 1.4).

1.3 Research objectives

This PhD dissertation aims to advance semantic change research with state- of-the-art computational techniques in two ways. First, it aims to demon- strate that it is both possible and fruitful to employ a large-scale bottom- up computational approach to semantic change research. Second, it aims to make sure that this emerging research field stands on a methodologically solid footing from its beginning. As the survey in Section 1.2 points out, research on semantic change, whether it was conducted in traditional or more contemporary paradigms 1.3. Research objectives 11 could greatly benefit from a bottom-up, large-scale analyses that are carried out over an entire lexicon. A bottom-up approach basically seeks to articulate generalizations not on the basis of prior theoretical assumptions but on the basis of naturally-occurring language production, which is also not liable to the subjective judgments of individual researchers. Large-scale analyses ad- dress the fundamental limitation of the representativeness of the data-set to be generalized over. Since previous studies are based on small and anecdotal cases of semantic change, it is not at all certain to what extent the semantic changes collected in surveys are representative of semantic change in gen- eral. The investigation of entire lexicons not only provides more examples of semantic change as grist for the theoretical mill, but also allows one to identify possible regularities of change that may be evident only on a large scale. These desiderata are today made possible by two recent independent de- velopments. The first is an upsurge in the availability of digitized histori- cal corpora of texts. The second is novel computational methods that allow the automatic semantic processing of these corpora. This dynamic duo of massive corpora and innovative computational methods has the potential to lead to novel insights about semantic change, both in discovering regular patterns of change and in identifying their linguistics causes, based on large- scale data-driven analyses. It should be noted that in comparison to the first two paradigms that use small-scale analysis, the computational paradigm has also some disadvan- tages. While the former carefully choose their findings to fit their conclu- sions (which is also their main methodological weakness), the latter includes many unexpected findings due to its large-scale bottom-up approach. Nat- urally, these findings comprise many intuitive examples of words known to have undergone semantic change, in addition to counter-intuitive and sur- prising examples that may help challenge existing theories and advance re- search. However, it also expected to find “noisy” and incorrect examples of words that were mistakenly found to have undergone semantic change. For example, a naïve computational semantic change model that is based only on the distributions of words in context may find that president undergoes a recurrent semantic change once in every 4 or 8 years. Therefore, while the carefully chosen examples that are used in small-scale analyses are usually sufficient to support an argument, more rigorous research methods must be employed to handle the type of data involved in large-scale bottom-up anal- ysis. Moreover, since each paradigm has its advantages and disadvantages, 12 Chapter 1. Introduction they should be seen as complementary rather than competing accounts. The initial attempts to employ the computational paradigm on a large scale have demonstrated the importance of methodological issues. The place of quantitative hypothesis testing is more central in this emerging field than in its parent fields of research: first, the quantitative turn in modern lin- guistics has been relatively recent, and second, NLP research has no tra- dition of empirical hypothesis testing, which is instead based on modeling work. Thus, fundamental methodological issues must be addressed in order to make an objective, reliable, and genuine contribution to the research field. Primarily, the absence of a gold standard evaluation set for semantic change makes it problematic to assess the quality of any proposed model of seman- tic change objectively. Furthermore, the reliance on an unvalidated semantic change metric poses an additional hurdle, especially as this metric becomes increasingly popular. Lastly, validating that words polysemy is reliably rep- resented by sense-specific vectors would facilitate their diachronic analysis in finer, more ecological models of semantic change. By contributing to the resolution of these methodological issues, this dissertation aims to provide general guidelines and frameworks for hypothesis testing in NLP research that go beyond the specific problem of semantic change.

1.4 Current work

The first paper (see Chapter2) in this dissertation examines whether words corresponding to different parts of speech (POS), i.e., Nouns, Verbs and Ad- jectives, differ in their rates of semantic change. This is the first time such a question could be experimentally tested, as it required large-scale bottom-up analysis over an entire lexicon which was impossible until recently. In this ar- ticle, we analyze the semantic change rates of a very large sample of nouns, verbs and adjectives, and compare their rates of semantic change through- out the decades of the 20th century. This study set out to demonstrate the usefulness of our approach for semantic change research as it focuses on reg- ularities of semantic change that may only be uncovered on a large scale. The second paper (see Chapter3) tackles a theoretical question: are there linguistic factors that make certain words more prone to semantic change? We address this question by mapping semantic change patterns over an en- tire lexicon in order to examine why certain words change more than others. Similarly to our first paper, we take a whole-lexicon approach and analyze the semantic change rates of the 7000 most frequent words in the lexicon 1.4. Current work 13 throughout the decades of the 20th century. We compare the results to pre- vious theoretical accounts that proposed causal factors in semantic change (which were based on a very small-set of hand-picked examples). Thus, this study further demonstrates the potential of our large-scale bottom-up approach to contribute to the investigation of the linguistic causes that are involved in semantic change by rigorously testing an existing theoretical hy- pothesis on a much larger scale. The third paper (see Chapter4) addresses the problem of evaluating mod- els of semantic change in the absence of gold standard evaluation sets. As a result of this lacuna, it is necessary to find a way to validate the results of these models. Specifically, we examine the standard metric in semantic change research, i.e., the cosine-distance, which despite being widely used has never been directly validated. We developed a control condition that is crafted specifically for this task under which we test this metric, in addition to an analytical investigation, and revisited previous studies that relied on this metric in their analyses. Overall, this paper provides a critical validation analysis for a pivotal methodological procedure in semantic change research. Importantly, it proposes a general framework to circumvent the problem of an absent gold standard evaluation set for semantic change, by using control conditions. Such a framework may also be used to verify the reliability of results reported in prior art, in addition to make future contributions stand on a more solid methodological footing. Consequently, the findings of our analyses are not limited to semantic change research, and may be useful for the NLP research community at large. Our fourth paper (see Chapter5) aims to test the feasibility of using sense- specific vectors as a “next generation” model for semantic change, due to their presumably more ecological representation of meaning. Specifically, we set to investigate the validity of the claim that sense-specific vectors truly rep- resent polysemic information, a claim that was based on performance gains obtained using these vectors. This is a necessary step before such vectors can be used as a viable tool to study changes in sense representations over time. We rigorously tested this claim, first by dismantling it to its constituent elements, and then testing each element using a combination of an appropri- ate control condition, an analytical analysis and computational simulations. Overall, this paper provides a critical reevaluation of prominent results in polysemy analysis and representation, a reevaluation that is not limited to semantic change research and further promotes the use of carefully designed control conditions and validation routines in NLP research at large. 14 Chapter 1. Introduction

References

Baroni, Marco, Georgiana Dinu, and Germán Kruszewski (2014). “Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors.” In: Proceedings of ACL, pp. 238–247. Blank, Andreas (1999). “Why do new meanings occur? A cognitive typology of the motivations for lexical semantic change”. In: Historical Semantics and Cognition. Ed. by Andreas Blank and Peter Koch. Mouton de Gruyter, pp. 61–89. DOI: 10.1515/9783110895100.335. Bloomfield, L (1933). Language. University of Chicago Press. Bochkarev, Vladimir, Valery Solovyev, and Sören Wichmann (2014). “Univer- sals versus historical contingencies in lexical evolution”. In: Journal of The Royal Society Interface 11, pp. 1–23. Bybee, Joan (2006). Frequency of Use and the Organization of Language. Oxford: Oxford University Press, p. 375. — (2015). . Cambridge, UK: Cambridge University Press, p. 292. Cook, Paul and Suzanne Stevenson (2010). “Automatically Identifying Changes in the Semantic Orientation of Words.” In: Proceedings of LREC. Dubossarsky, Haim, Yulia Tsvetkov, Chris Dyer, and Eitan Grossman (2015). “A bottom up approach to category mapping and meaning change.” In: NetWordS 2015 Word Knowledge and Word Usage, pp. 66–70. Dubossarsky, Haim, Daphna Weinshall, and Eitan Grossman (2016). “Verbs change more than nouns: A bottom up computational approach to seman- tic change”. In: Lingue e Linguaggio 1, pp. 5–25. Firth, John Rupert (1957). Papers in Linguistics 1934–1951. London: Oxford University Press. Fonteyn, Lauren and Stefan Hartmann (2016). “Usage-based perspectives on diachronic morphology: a mixed-methods approach towards English ing- nominals”. In: Linguistics vanguard 2.1, pp. 1–12. DOI: 10.1515/lingvan- 2016-0057. Frermann, Lea and Mirella Lapata (2016). “A bayesian model of diachronic meaning change”. In: Transactions of the Association for Computational Lin- guistics 4, pp. 31–45. Geeraerts, Dirk (1985). “Cognitive restrictions on the structure of semantic change”. In: Historical Semantics, pp. 127–153. — (1992). “Prototypicality effects in diachronic semantics: A round-up”. In: Diachrony within Synchrony: language, history and cognition. Ed. by G. Keller- mann and M. D. Morissey. Frankfurt am Main: Peter Lang, pp. 183–203. Geeraerts, Dirk, Caroline Gevaerts, and Dirk Speelman (2011). “How anger rose: hypothesis testing in diachronic semantics”. In: Current Methods in Historical Semantics. Ed. by Justyna Robynson and Kathryn Allan. Top- ics in English Linguistics [TiEL]. Berlin/New York: Mouton de Gruyter, pp. 109–32. Gries, Stefan Th and Martin Hilpert (2008). “The identification of stages in diachronic data: variability-based neighbour clustering”. In: Corpora 3.1, pp. 59–81. ISSN: 1749-5032. DOI: 10.3366/E1749503208000075. REFERENCES 15

Gulordava, Kristina and Marco Baroni (2011). “A distributional similarity ap- proach to the detection of semantic change in the Google Books Ngram corpus”. In: Proceedings of the GEMS 2011 Workshop on GEometrical Models of Natural Language Semantics. Association for Computational Linguistics, pp. 67–71. Guo, Jiang, Wanxiang Che, Haifeng Wang, and Ting Liu (2014). “Revisiting embedding features for simple semi-supervised learning”. In: Proceedings of EMNLP, pp. 110–120. Hamilton, William L, Jure Leskovec, and Dan Jurafsky (2016). “Diachronic word embeddings reveal statistical laws of semantic change”. In: Proceed- ings of ACL. Harris, Zellig S (1954). “Distributional structure”. In: Word 10.2-3, pp. 146– 162. Heylen, Kris, Thomas Wielfaert, Dirk Speelman, and Dirk Geeraerts (2015). “Monitoring polysemy: Word space models as a tool for large-scale lexical semantic analysis”. In: Lingua 157, pp. 153–172. Hilpert, Martin (2006). “Distinctive collexeme analysis and diachrony”. In: Corpus Linguistics and Linguistic Theory 2.2, pp. 243–256. Hilpert, Martin and Florent Perek (2015). “Meaning change in a petri dish: constructions, semantic vector spaces, and motion charts”. In: Linguistics Vanguard 1.1, pp. 339–350. Hock, H H and B D Joseph (2009). Language History, Language Change, and Lan- guage Relationship: An Introduction to Historical and Comparative Linguistics. Trends in Linguistics. Studies and Monographs [TiLSM]. De Gruyter. Huang, Eric H, Richard Socher, Christopher D Manning, and Andrew Y Ng (2012). “Improving word representations via global context and multiple word prototypes”. In: Proceedings of ACL. Association for Computational Linguistics, pp. 873–882. Jatowt, Adam and Kevin Duh (2014). “A framework for analyzing semantic change of words across time”. In: Proceedings of the 14th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 229–238. Kim, Yoon, Yi-I Chiu, Kentaro Hanaki, Darshan Hegde, and Slav Petrov (2014). “Temporal Analysis of Language through Neural Language Models”. In: Proceedings of ACL, pp. 61–65. Koch, Peter (2016). “Meaning change and semantic shifts”. In: The Lexical Ty- pology of Semantic Shifts. Ed. by Päivi Juvonen and Maria Koptjevskaja- Tamm. Chap. 2, pp. 21–66. Kulkarni, Vivek, Rami Al-Rfou, Bryan Perozzi, and Steven Skiena (2015). “Statistically significant detection of linguistic change”. In: Proceedings of WWW, pp. 625–635. Li, Jiwei and Dan Jurafsky (2015). “Do multi-sense embeddings improve nat- ural language understanding?” In: arXiv preprint arXiv:1506.01070. Mikolov, Tomas, Wen-tau Yih, and Geoffrey Zweig (2013). “Linguistic regu- larities in continuous space word representations”. In: Proceedings of NAACL, pp. 746–751. 16 Chapter 1. Introduction

Neelakantan, Arvind, Jeevan Shankar, Re Passos, and Andrew Mccallum (2014). “Efficient nonparametric estimation of multiple embeddings per word in vector space”. In: Proceedings of EMNLP. Newman, John (2015). “Semantic shift”. In: The Routledge Handbook of Seman- tics. Ed. by Nick Rimer. New York: Routledge, pp. 266–280. Pennington, Jeffrey, Richard Socher, and Christopher Manning (2014). “Glove: Global vectors for word representation”. In: Proceedings of EMNLP, pp. 1532– 1543. Reisig, Christian Karl (1839). Vorlesungen über lateinische Sprachwissenschaft. Leipzig: Lehnhold. Sagi, Eyal, Stefan Kaufmann, and Brady Clark (2009). “Semantic density anal- ysis: Comparing word meaning across time and phonetic space”. In: Pro- ceedings of the Workshop on Geometrical Models of Natural Language Seman- tics. Association for Computational Linguistics, pp. 104–111. Schlechtweg, Dominik, Stefanie Eckmann, Enrico Santus, Sabine Schulte im Walde, and Daniel Hole (2017). “German in Flux: Detecting Metaphoric Change via Word Entropy”. In: Proceedings of CoNLL, pp. 354–367. Smirnova, Elena (2012). “On some Problematic Aspects of Subjectification”. In: Language Dynamics and Change 2.1, pp. 34–58. ISSN: 2210-5832. Socher, Richard, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Man- ning, Andrew Ng, and Christopher Potts (2013). “Recursive deep models for semantic compositionality over a sentiment treebank”. In: Proceedings of EMNLP, pp. 1631–1642. Stefanowitsch, Anatol and Stefan Th Gries (2003). “Collostructions: Investi- gating the interaction of words and constructions”. In: International journal of corpus linguistics 8.2, pp. 209–243. Tantucci, Vittorio, Jonathan Culpeper, and Matteo Di Cristofaro (2017). “Dy- namic resonance and social reciprocity in language change: the case of Good morrow”. In: Language Sciences. ISSN: 0388-0001. Traugott, Elizabeth Closs and Richard B Dasher (2002). Regularity in semantic change. Vol. 97. Cambridge University Press. Turian, Joseph, Lev Ratinov, and Yoshua Bengio (2010). “Word representa- tions: a simple and general method for semi-supervised learning”. In: Proceedings of ACL. Association for Computational Linguistics, pp. 384– 394. Turney, Peter D and Patrick Pantel (2010). “From frequency to meaning: Vec- tor space models of semantics”. In: Journal of artificial intelligence research 37, pp. 141–188. Wijaya, Derry Tanti and Reyyan Yeniterzi (2011). “Understanding semantic change of words over centuries”. In: Proceedings of the 2011 international workshop on DETecting and Exploiting Cultural diversiTy on the social web. ACM, pp. 35–40. Zalizniak, Anna S., Maria Bulakh, Dmitrij Ganenkov, Ilya Gruntov, Timur Maisak, and Maxim Russo (2012). “The catalogue of semantic shifts as a database for lexical semantic typology”. In: Linguistics 50.3, pp. 633–669. ISSN: 00243949. DOI: 10.1515/ling-2012-0020. 17

Chapter 2

The Diachronic Word-Class Effect

Verbs change more than nouns: A bottom up com- putational approach to semantic change

Published

Dubossarsky, H., Grossman, E., & Weinshall, D. (2016). Verbs change more than nouns: A bottom up computational approach to semantic change. Lingue e Linguaggio, 5-25.

* Affiliations appear in the last page of the published paper 18

VERBS CHANGE MORE THAN NOUNS: A BOTTOM-UP COMPUTATIONAL APPROACH TO SEMANTIC CHANGE

HAIM DUBOSSARSKY DAPHNA WEINSHALL EITAN GROSSMAN

ABSTRACT: Linguists have identified a number of types of recurrent semantic change, and have proposed a number of explanations, usually based on specific lexical items. This paper takes a different approach, by using a distributional semantic model to identify and quantify semantic change across an entire lexicon in a completely bottom-up fashion, and by examining which distributional properties of words are causal factors in semantic change. Several independent contributing factors are identified. First, the degree of prototypicality of a word within its semantic cluster correlated inversely with its likelihood of change (the “Diachronic Prototypicality Effect”). Second, the word class assignment of a word correlates with its rate of change: verbs change more than nouns, and nouns change more than adjectives (the “Diachronic Word Class Effect”), which we propose may be the diachronic result of an independently established synchronic psycholinguistic effect (the “Verb Mutability Effect”). Third, we found that mere token frequency does not play a significant role in the likelihood of a word’s meaning to change. A regression analysis shows that these effects complement each other, and together, cover a significant amount of the variance in the data.

KEYWORDS: semantic change, distributional semantics.

1. THE PROBLEM OF SEMANTIC CHANGE

Lexical semantic change - change in the meanings of words - is a basic fact of language change that can be observed over long periods of time. For example, the English word girl originally indicated a child of either sex, but in contem- porary English, it refers only to a female child. Bybee shows the turning point was the fifteenth century, after the conventionalization of the word boy to refer to a male child, which “cut into the range of reference for girl” (Bybee 2015: 202). But semantic change is also “an undeniable and ubiquitous facet of our experience of language” (Newman 2015: 267), with words acquiring new

15

LINGUE E LINGUAGGIO XV.1 (2016) 5–25 19 HAIM DUBOSSARSKY, DAPHNA WEINSHALL AND EITAN GROSSMAN senses, developing new polysemies, and entirely new meanings, in time- frames that can be observed even by casual observation by speakers. For ex- ample, recent changes in technology have led to novel meanings of words like navigate, surf, and desktop (Newman 2015: 266). Speakers and listeners may even be aware of “mini” semantic change in real time, when they experience an innovative use of an existing word. Linguists have identified some recurring types of semantic change. Some of the major types include the textbook examples of change in scope, e.g., widening (Latin caballus ‘nag, workhorse’ > Spanish caballo ‘horse’) or nar- rowing (hound ‘canine’ > ‘hunting dog’), or in connotation (amelioration or pejoration). However, the systematic search for an explanatory theory of se- mantic change was largely neglected until Geeraerts (1985, 1992) and Traugott & Dasher (2002), who both claimed that semantic change is over- whelmingly regular. Moreover, both Geeraerts and Traugott have claimed that semantic change – like language change in general – is rooted in and con- strained by properties of human cognition and of language usage. Contemporary research identifies different kinds of regularity in semantic change as tendencies of change, which are asymmetries with respect to the directions in which change is more likely to occur. For example, Traugott & Dasher (2002) propose that semantic change regularly follows the pathway: objective meaning > subjective meaning > intersubjective meaning. It has also been suggested that concrete meanings tend to develop into more abstract ones (Bloomfield 1933; Haspelmath 2004; Sweetser 1990). See the following ex- amples:

(1) see ‘visual perception’ > ‘understanding’

(2) touch ‘tactile perception’ > ‘feel’

(3) head ‘body part’ > ‘chief’ Another often-observed regularity is that semantic change overwhelm- ingly tends to entail polysemy, in which a word or expression acquire new senses that co-exist with the older conventionalized senses (e.g., a new sense for surf has emerged since the 1990s). These new senses can continue to co- exist stably with the older ones or to supplant earlier senses, thereby “taking over” the meaning of the word. The existence of such regularities and asymmetries, or “unidirectional pathways of change”, has been taken as evidence that language change is not random. Moreover, these asymmetries call for explanations that are plausible in terms of what we know about human cognition and communication. Nu- merous such explanations have been offered, from Traugott & Dasher's (2002) influential Neo-Gricean account to other pragmatically-based accounts (for an

26

20 VERBS CHANGE MORE THAN NOUNS: A BOTTOM-UP COMPUTATIONAL APPROACH TO SEMANTIC CHANGE

overview, see Grossman & Noveck 2015). However, while such accounts may offer potentially convincing explanations for observed changes, there is to date no empirically-grounded theory that can explain – or predict – which words are likely to undergo semantic change, and why this is so, across an entire lexicon. This last point is the focus of the present article. While historical linguists have painstakingly accumulated much data about – and proposed explanations for – cross-linguistically recurrent pathways of semantic change (e.g., body- part term > spatial term), the data and explanations are usually specific to a particular group of words. For example, the explanations proposed for the de- velopment of body-part terms into spatial terms cannot necessarily be gener- alized to words of other semantic classes. In fact, the question posed in this article – what are the specific properties of words that make them more or less prone to semantic change? – has been almost entirely neglected in historical linguistic research. Furthermore, most studies of attested pathways of change tend to focus on their descriptive semantics, and have tended to ignore their distributional properties. Nonetheless, some work in this direction can be found in earlier structur- alist and cognitivist theories of semantic change, which emphasized the role of the structure of the lexicon in explaining semantic change. For example, it has often been assumed that changes in words’ meanings are due to a tendency for languages to avoid ambiguous form-meaning pairings, such as homonymy, synonymy, and polysemy (Anttila 1989; Menner 1945). On the other hand, when related words are examined together, it has been observed that one word’s change of meaning often “drags along” other words in the same se- mantic field, leading to parallel change (Lehrer 1985). These seemingly con- tradictory patterns of change lead to the conclusion that if ambiguity avoid- ance is indeed a reason of semantic change, its role is more complex than ini- tially assumed. However, what is common to both ideas – the putative tendency to avoid ambiguous form-meaning pairings and the equally putative tendency for words in the same semantic domain to change in similar ways – is the obser- vation that changes in a word’s meaning may result from – or cause – changes in the meaning of a semantically related word. The idea that words should be examined relative to each other, and that these relations play a causal role in semantic change is elaborated by Geeraerts (1985, 1992), who maps related words into clusters, and based on Rosch’s prototype theory (1973), establishes which words are the prototypical or peripheral exemplars within each cluster. Geeraerts analyzes these clusters diachronically, finds characteristic patterns of change due to meaning overlap, and concludes that prototypical semantic areas are more stable diachronically than peripheral ones. While Geeraert’s

37

21 HAIM DUBOSSARSKY, DAPHNA WEINSHALL AND EITAN GROSSMAN ideas are promising for studies of semantic change, they are based on case- studies hand-picked by the linguist, and are not based on large-scale corpora (Geeraerts 2010). This is a lacuna in the research field of semantic change, which we have addressed in a previous article (Dubossarsky et al. 2015) by articulating a method for identifying and quantifying semantic change across an entire lexicon, represented by a massive historical corpus. Our aim in the present article is to evaluate whether other distributional properties of words are indeed implicated in semantic change. Specifically, we examine whether words of different parts-of-speech or word classes change at different rates. We assume that the null hypothesis is that there is no difference between word class assignment and rate of change. However, we predict that there will indeed be differences, based on the fact that different word classes prototypically encode cognitively different things: nouns proto- typically encode entities, verbs prototypically encode events, and adjectives prototypically encode properties. Moreover, different word classes can have significantly different collocational properties, i.e., they occur in different types and ranges of contexts. Finally, Sagi et al. (2009), one of the only studies to tackle this question, found that in 19th century English, a small selection of verbs showed a higher rate of change than nouns. It is important to stress that at no time do we, or any of the above works cited as far as we know, claim that semantic change is governed by a single factor. In fact, it is clear that previous work on semantic change is likely to be correct in supposing that social, historical, technological, cognitive, commu- nicative, and other factors are implicated in semantic change. The question is how to tease them apart and understand their respective contributions. This paper demonstrates that an observable property of words, i.e., their part-of- speech or word class assignment, is indeed implicated in semantic change. Moreover, we demonstrate that this effect is in addition to another effect which we have argued for earlier, namely, that the position of a word within its semantic cluster – interpreted as its degree of prototypicality. The structure of the paper is as follows: in Section 2, we sketch the meth- odology used, and in Section 3, we describe the experiment conducted. In Sec- tion 4 we discuss the results, and in Section 5 we analyze possible interactions with other factors. Section 6 is devoted to discussion on the results and their implications. Section 7 provides concluding remarks, focusing on directions for future research.

2. METHODOLOGY

2.1 The role of input frequency

48

22 VERBS CHANGE MORE THAN NOUNS: A BOTTOM-UP COMPUTATIONAL APPROACH TO SEMANTIC CHANGE

There are numerous ways of representing lexical meaning. Computational models developed for representing meaning excel in what computational ap- proaches do best and classical does poorly, namely, the large-scale analysis of language usage and the precise quantitative represen- tation of meaning. At the heart of these models lies the “distributional hypoth- esis” (Firth 1957; Harris 1954), according to which the meaning of words can be deduced from the contexts in which they appear. We employ a distributional semantic modeling (DSM) approach to represent word meanings. DSM collects distributional information on the co- occurrence profiles of words, essentially showing their collocates (Hilpert 2006; Stefanowitsch & Gries 2003), i.e., the other words with which they co- occur in specific contexts. Traditionally, this is done by representing each word in terms of its collocates across an entire lexicon. This type of model has the advantage of providing an explicit (or direct) quantitative measure of a word’s meaning, and is informative in that it tells us which words do or do not occur with a given word of interest. However, since most words occur with a limited range of collocates, most of the words in a lexicon will co-occur with most other words in the lexicon zero times. As such, these kinds of representations are sparse. This can be seen in the following illustrative example bellow, where only ten words collocate with the word pan, while the rest of the vocabulary (i.e., surf, sky, dress, hat, call, etc.) does not.

oil

fry

pot sky hat

stir

egg

surf

cake

cook

stove

dress

butter

bacon

ocations

oll

c

# 87 69 61 55 51 49 23 19 17 9 0 0 0 0

TABLE 1. WORDS COLLOCATIONS STATISTICS FOR THE WORD PAN (ILLUSTRATIVE EXAMPLE) This type of representation is usually further analyzed, e.g., by normaliz- ing the word counts to frequencies, or with more sophisticated statistical meth- ods, e.g., tf-idf or point mutual information. However, for our purposes, such models are inadequate, because in the end they tell us only whether a word co-occurs with another word or not. In order to understand the relationship of a word with the rest of the words in an entire lexicon, other types of models are necessary. These models are the more recent ones that exploit machine-learning and neural network tools to learn the distributional properties of words automati- cally. Unlike traditional models, they do so by representing words in terms of the interaction of multiple properties. However, the specific contribution of

59

23 HAIM DUBOSSARSKY, DAPHNA WEINSHALL AND EITAN GROSSMAN each property, when taken on its own, is opaque; as such, the quantitative rep- resentation of a word’s meaning is implicit. Of the available recent models of this type, we chose a recently developed skip-gram word2vec model (Mikolov et al. 2013c, 2013d). This word2vec model has been fruitfully applied to dis- tributional semantic corpora research, and scores high in semantic evaluation tasks (Mikolov et al. 2013a). As we will show, proof-of-concept can also be found in our results. The word2vec model captures the meaning of words through dense vec- tors in an n-dimensional space. Every time a word appears in the corpus, its corresponding vector is updated according to the collocational environment in which it is embedded, up to a fixed distance from that word. The update is carried out such that the probability in which these words predict their context is maximized (Figure 1a.). As a result, words that predict similar contexts would be represented with similar vectors. In fact, this is much like linguistic items in a classical structuralist paradigm, whose interchangeability at a given point or “slot” in the syntagmatic chain implies that they share certain aspects of function or meaning, i.e., the Saussurian notion of “value” (Figure 1b.). It is worth noticing that if taken individually, the vectors’ dimensions are opaque; only when the full range of dimensions is taken together do they cap- ture the meaning of a word in the semantic hyper-space they occupy.

FIGURE 1. (A) WORD2VEC SKIP-GRAM ARCHITECTURE. GIVEN A WORD, W(T), THE MODEL PREDICTS THE WORDS THAT PRECEDE AND PROCEED IT IN A WINDOW OF 4 WORDS, W(T-2),W(T-1),W(T+1),W(T+2) (MIKOLOV ET AL. 2013B). (B) AN EXAMPLE OF THE CLASSICAL STRUCTURALIST PARADIGM. While it may be surprising for linguists that one would choose to rely on a model whose individual dimensions are opaque, this is not a major concern, since it is well-established that words assigned similar vectors by the model are in fact semantically related in an intuitive way; for a recent demonstration,

106

24 VERBS CHANGE MORE THAN NOUNS: A BOTTOM-UP COMPUTATIONAL APPROACH TO SEMANTIC CHANGE

see Hilpert & Perek (2015), which looks at the collocates of a single construc- tion in English. The similarity between vectors is evaluated quantitatively, and defined as the cosine distance between the vectors in the semantic hyper- space. Short distances are considered to reflect similarity in meaning: related words are closer to each other in the semantic space (Turney 2006; Mikolov et al. 2013d; Levy & Goldberg 2014). In fact, this is reflected in the words’ nearest neighbors in the semantic space that often capture synonymic, anto- nymic or level-of-category relations. Although the model uses the entire lexicon for training, the accuracy of the meaning representations that is captured in the corresponding vectors is expected to diminish for less frequent words. This is simply because these words do not appear frequently enough for the model to learn their corre- sponding contexts. Therefore, only the most frequent words in the corpus, ex- cluding stop-words, are defined as words-of-interest and are further analyzed. These words represent the entire lexicon.

2.2 Corpus

A massive historical corpus is required to train distributional semantic models. This is because the words whose distributional properties we are interested in must appear frequently enough in each time period in order to collect enough statistical information about their properties. Clearly, the time resolution of any analysis on such models is limited by the nature of the historical corpus: the finer the tagging for time, the finer the analysis can be. Google Ngrams is the best available historical corpus for our purposes, as it provides an unprecedented time resolution – year by year – on a massive scale; the second largest historical corpus is about 1000 times smaller. Tens of millions of books were scanned as part of the Google Books project, and aggregated counts of Ngrams on a yearly resolution from those books are pro- vided. We used a recently published syntactic-Ngram dataset (Goldberg & Or- want 2013), where the words1 are analyzed syntactically using a dependency

1 The present study deals with word forms rather than lexemes. While this is possibly a short- coming, it is shared by most NLP studies of massive corpora. Furthermore, the issue is less likely to affect English, with its relatively poor morphology, than other languages. Neverthe- less, one might speculate about the effects of this. For example, it might be that the meaning of a specific verb forms in the corpus will be narrower than that of specific noun forms, overall, in an analysis based on word forms than in one based on lexemes. While it would be of consid- erable interest to conduct an experiment to determine the effect of using word forms versus lexemes, the issue has never been dealt with explicitly in computational linguistics, as far as we know, and it is beyond the scope of the present paper. We thank an anonymous reviewer for bringing this issue to our attention.

117

25 HAIM DUBOSSARSKY, DAPHNA WEINSHALL AND EITAN GROSSMAN parser in their original sentences. The dataset provides aggregated counts of syntactic Ngrams on a yearly resolution that includes their part-of-speech (POS)2 assignments as well. The dataset distinguishes content words, which are meaning-bearing elements, from functional markers3 that modify the con- tent words. Therefore, a syntactic Ngram of order N includes exactly N con- tent words and few optional function markers. We used syntactic Ngrams of 4 content words from the English fiction books,4 and aggregated them over their dependency labels to provide POS Ngrams. The following is an example POS Ngram from the corpus. (4) and_CC with_IN sanction_NN my_PR tears_NN gushed_VB out_RB Verbs, nouns, and adjectives below a certain frequency threshold, and all the rest of the POS assignment, lose their tags. In this Ngram, only tears re- tains it. The historical corpus is sorted diachronically, with 10 million POS Ngrams (about 50 million words) per year for the years 1850-2000. When the number of POS Ngrams in the corpus for a given year was bigger than that size, due to the increasing number of published and scanned books over time, a random subsampling process was conducted to keep a fixed corpus size per year. This resulted in a corpus size of about 7.5 billion words. Only the words- of-interest, the most frequent words in the corpus, retain their POS assign- ment, while the rest of the words reverted to their original word forms. All words were lowered case.

2.3 Diachronic Analysis

After initialization, the model is trained incrementally, one year after the other, for the entire historical corpus (POS-tagged and untagged words alike). In this way, the model’s vectors at the end of one year’s training are the start- ing point of the following year’s training, which make them comparable dia- chronically. The model is saved after each year’s training, so that the words' vectors could be later restored for synchronic and diachronic analyses. The words vectors are compared diachronically in order to detect semantic change. Based on the affinity between similarity in meaning and similarity in vectors described in §2.1, semantic change is defined here as the difference between a word’s two vectors at two time points. This allows us to quantify

2 We use the term “part-of-speech” abbreviated POS, in the context of Natural Language Pro- cessing tagging, and the term “word class” otherwise. 3 These include the following dependency labels: det, poss, beg, aux, auxpass, ps, mark, com- plm and prt. 4 From the 2nd version of Google books.

128

26 VERBS CHANGE MORE THAN NOUNS: A BOTTOM-UP COMPUTATIONAL APPROACH TO SEMANTIC CHANGE

semantic change in a straightforward fashion: the bigger the distance between the two vectors of a given word, the bigger the semantic change that this word underwent over that period of time. Specifically, the comparison is defined as the cosine distance between the word’s two vectors according to equation 1, with 0 being identical vectors and 2 being maximally different. This is carried out for the entire lexicon.

푣 푡0 ∙ 푣 푡1 푡0→푡1 푤 푤 (1) ∆푤 = 1 − 푡 푡 ‖푣푤 0‖ ∙ ‖푣푤 1‖

푡0 푡1 where 푣푤 and 푣푤 are the word’s w vectors at two time points, t0 and t1, respectively. In the following section, we present an experiment that investigates the relationship between word class assignments and likelihood of change.

3. EXPERIMENT

In this experiment, we evaluate the hypothesis that different parts of speech change at different rates. As noted above, we assume that the null hypothesis is that there is no difference between part of speech assignment and rate of change. However, we predict that there will indeed be differences, based on the fact that different parts of speech prototypically encode cognitively differ- ent things: nouns prototypically encode entities, verbs prototypically encode events, and adjectives prototypically encode properties. Moreover, different parts of speech can have significantly different collocational properties, i.e., they occur in different types and ranges of contexts. Finally, pilot studies of this question (Sagi et al. 2009) have indicated that some verbs show a higher rate of change than some nouns. The word2vec model5 was initialized with the length of vector set to 52, which means that the words' contexts are captured in a 52-dimension semantic hyper-space. The model was trained over the POS-tagged English fiction cor- pus (see §2.2), using the method described above (see §2.3). Words that ap- peared less than 10 times in the entire corpus were discarded from the lexicon and were ignored by the model. The vectors of the 2000 most frequent verbs, nouns and adjectives (6000 in total) as they appear in the corpus were defined as the words-of-interest, and restored from the model at every decade from 1900 till 2000. For each word, the cosine distances between its vectors at every two consecutive dec- ades were computed using equation (1). This resulted in 6000x10 semantic

5 We used genism python library for its word2vec implementation (Řehůřek & Sojka 2010).

139

27 HAIM DUBOSSARSKY, DAPHNA WEINSHALL AND EITAN GROSSMAN change scores that represent the degree of semantic change that each word underwent in every decade throughout the twentieth century (e.g., 1900-1910, 1910-1920, until 1990-2000). The average semantic change scores of each POS assignment were compared between groups.

4. RESULTS

Figure 2 shows the average semantic change for the different POS assignment groups at ten decades throughout the twentieth century. The results were sub- mitted to a two-way ANOVA with POS assignment and decade as the inde- pendent variables. The first main effect, also clearly visible, is that the POS assignment groups differ in their rates of semantic change over all the decades (F(2,59970) = 6464, η = .177, p-value <.001). The second main effect is that the semantic change rate appears to differ throughout different decades across all POS assignment groups (F(9,59970) = 576, η = .08, p-value <.001). The interac- tion between the variables was found to be significant as well (F(18,59970) = 14.34, η = .004, p-value <.001). This means that the rate of semantic change along the decades is not uniform across the POS assignment groups. However, the effect size of the first two variables reported above is robust, accounting for 17.7% and 8% of the overall variance in the words semantic change, re- spectively, which render these variables highly meaningful. In contrast, the effect size of the aforementioned interaction accounts for only 0.4% of the variance, which makes it unimportant, albeit statistically significant. In order to evaluate the source of the first main effect – the difference in the rate of semantic change between the POS assignment, we conducted per- mutation tests as a post-hoc analysis on the pairs verbs-nouns and nouns-ad- jectives. The permutation tests created null hypotheses for each pair by assign- ing words to one of the two POS group randomly, then computing the differ- ences between the averages of the two groups, and repeating the process 10,000 times for each decade. These distributions were later compared to the real differences in the average semantic change in each decade, so that their statistical significance could be evaluated. The permutation tests corroborate what is visibly clear from the descriptive pattern of the results (all p-values <.001), that verbs change more than nouns, and nouns change more than ad- jectives.

1014

28 VERBS CHANGE MORE THAN NOUNS: A BOTTOM-UP COMPUTATIONAL APPROACH TO SEMANTIC CHANGE

FIGURE 2. AVERAGE SEMANTIC CHANGE RATES THROUGHOUT THE DECADES IN THE TWENTIETH CENTURY FOR DIFFERENT POS ASSIGNMENT GROUPS. BARS REPRESENT STANDARD ERRORS.

5. INTERACTION WITH OTHER FACTORS: FREQUENCY AND PROTOTYPICALITY

In previous work, at least two observable properties of words have been argued to be causally implicated in semantic change, word frequency and prototypicality. We wanted to test their joint involvement in semantic change in light of the aforementioned findings.

5.1 Frequency

Frequency is often linked to language change, but its exact effects still remain to be worked out (Bybee 2006, 2010). While frequency clearly facilitates reductive formal change in grammaticalization and in , it also protects morphological structures and syntactic constructions from analogy (e.g., irregular verbs forms are more frequent). Since no explicit hypothesis has been made regarding the role of frequency in semantic change per se, we set out to test the hypothesis that frequency plays some role in semantic

1115

29 HAIM DUBOSSARSKY, DAPHNA WEINSHALL AND EITAN GROSSMAN change. The null hypothesis was that there is no correlation between words’ frequencies and their degree of semantic change. Token frequencies were extracted from the entire corpus (about 7.5 billion words) and served as the words frequencies. The degrees of semantic change were taken from the results reported in §4 above. In general, frequency was not found to correlate with the degree of words’ semantic change over the ten decades in the twentieth century. Only four decades (1900-1910; 1910-1920; 1950-1960; 1960-1970) showed significant (p-value <.01) correlations. However, such correlations are so small, with maximum correlation coefficient <.07, that in terms of their effect size they account for less than 0.5% of the variance in the semantic change scores. Similar results were obtained when the analysis was repeated for each POS assignment group separately. Most correlations were statistically insignificant, and the ones that were significant were very small. Overall these results suggest that frequency plays little or no role in semantic change. We think that this result is surprising, since frequency is often thought to correlate with the degree of entrenchment of linguistic items in the mental lexicon (Bybee 2010). As such, one might hypothesize that words with high token frequency might be “protected” from semantic change. However, this hypothesis is counter-indicated by the results of our experiment. It may be that token frequency is, in the end, mainly responsible for coding asymmetries (Haspelmath 2008) and does not contribute much to semantic change per se.

5.2 Prototypicality

One of the model’s inherent properties is that similar words have similar vectors (see §2.1). This makes the vectors ideal for clustering, where each cluster captures the words’ “semantic landscape,” as Hilpert & Perek (2015) call it. Importantly, it turned out that these clusters exhibit an internal structure, with some words closer to the center and others further away. In Dubossarsky et al. (2015) we analyzed this structure, and interpreted the distance of a word from its cluster center to reflect its degree of prototypicality, which is the degree by which a word resembles its category prototype. Crucially, this prototypicality was found to play an important role in semantic change, as the further a word is from its category’s prototype, the more likely it is to undergo change. We employ the methodology described in Dubossarsky et al. (2015) to the current dataset. Specifically, for each decade we cluster the 6000 word vectors using 1500 clusters, and compute the words’ distances from their cluster centroids. This resulted in ten “prototypicality scores” for each word. In Table 2, we present two clusters as examples. In each cluster, the words are sorted in prototypicality order (distance from their cluster’s center). As a

1216

30 VERBS CHANGE MORE THAN NOUNS: A BOTTOM-UP COMPUTATIONAL APPROACH TO SEMANTIC CHANGE

result, said and chamber/room, appear at the tops of their lists, and constitute the most prototypical exemplars in their clusters, verbs of utterance and enclosed habitats for humans (see Dubossarsky et al. 2015 for further examples).

said_VB, 0.06 chamber_NN, 0.04 exclaimed_VB, 0.08 room_NN, 0.04 answered_VB, 0.08 drawing_NN, 0.05 added_VB, 0.11 bedroom_NN, 0.06 whispered_VB, 0.13 kitchen_NN, 0.07 cried_VB, 0.14 apartment_NN, 0.1 murmured_VB, 0.15 growled_VB, 0.16 repeated_VB, 0.2 muttered_VB, 0.25 TABLE 2. TWO WORD CLUSTERS, WITH POS TAGS AND DISTANCES FROM THEIR CENTROID, SORTED IN ASCENDING ORDER OF THE LATTER. We used this approach to extend our previous finding that focused on semantic change in only one decade (1950-1960) to the entire twentieth century. Indeed, prototypicality at the beginning of each of the ten decades was related to the semantic change the words underwent by the end of that decade. Correlation coefficients ranged between r=.27 and r=.35, with average coefficient of r=.32 (all p-values <.001). This means that the farther a word is from the prototypical center of its category, the more likely it is to undergo semantic change, and attests to the meaning-conserving nature of prototypicality in semantic change. This could be called the “Diachronic Prototypicality Effect”.

5.3 Regression analysis

It is intuitively clear that semantic change is not induced solely by a single factor, and that different factors may also be involved. Therefore, we wanted to evaluate the interaction between the two factors that were proven to be in- volved in semantic change, word class assignment and prototypicality. In order to discern the contribution of these two factors, whether they com- plement each other or are to a large extent redundant, they were submitted to a multiple linear regression analysis. Prototypicality, as distance from cen- troid, and POS assignment were the independent variables, and the semantic change scores was the dependent variable. Regression analyses were con- ducted for these variables at each of the ten decades, and also pulled over all the decades. Table 3 shows the contribution of each of the two variables in accounting for the semantic change in each of the ten decades examined as well as overall

1317

31 HAIM DUBOSSARSKY, DAPHNA WEINSHALL AND EITAN GROSSMAN the decades (all the results reported were statistically significant p-value <.01). The results show that the two variables account for a fair amount of the vari- ance in semantic change, between 21%-29%. Although both variables account for a large part of semantic change when taken individually, POS plays a larger role. Prototypicality, despite playing a lesser role, accounts for a sub- stantial amount of the variance in semantic change as well, which exactly re- flects its correlation coefficients' values reported above. Crucially, prototypicality’s unique contribution to the variance in seman- tic change, over and above what is being explained by POS, is smaller than its individual contribution. This indicates that the two variables overlap to a cer- tain degree, and are not fully independent. However, the fact that prototypi- cality adds a substantial and unique explanatory power to the regression model suggests that different independent causal elements are involved in semantic change. Our variables are unable to capture these elements in a fully independ- ent form, but different choice of variables, at a different linguistic level, per- haps could. Nevertheless, the results support the hypothesis that the different factors involved in semantic change can be ultimately teased apart.

Decades 1 2 3 4 5 6 7 8 9 10 pulled Variables POS + Prototypicality 29 24 28 25 23 22 23 19 21 22 22 POS 24 17 23 19 19 16 20 12 14 17 17 Prototypicality 10 12 10 11 7 11 8 12 12 9 10 ∆ Prototypicality 5 7 5 6 4 6 3 7 7 5 5

TABLE 3. PERCENTAGES OF THE EXPLAINED VARIANCE IN SEMANTIC CHANGE WITH DIFFERENT COMBINATIONS OF VARIABLES THROUGHOUT THE 10 DECADES, AND PULLED OVER THE DECADES.

6. DISCUSSION

In the above section, we have argued that the word class assignment of a word is a distinct and significant contributing factor to the likelihood for its meaning to change over time. While, as we have noted above, the null hypothesis is that part of speech assignment does not play a role in semantic change, it is nonetheless reasonable that verbs change at a faster rate than nouns, and that both change at a faster rate than adjectives. For an explanation, we turn to psycholinguistic research that indicates that in particular contexts, verb meanings are more likely to be reinterpreted than noun meanings. In this section, we restrict ourselves to the noun-verb asym- metry, leaving adjectives for future research. Early work on this topic (Gentner 1981) identified a processing effect known as “verb mutability”

1418

32 VERBS CHANGE MORE THAN NOUNS: A BOTTOM-UP COMPUTATIONAL APPROACH TO SEMANTIC CHANGE

which basically says that “the semantic structures conveyed by verbs and other predicate terms are more likely to be altered to fit the context than are the semantic structures conveyed by object-reference terms” (Gentner & France 1988: 343). Broadly speaking, this effect states that when language users are confronted with semantically implausible utterances, e.g., the lizard wor- shipped, they are more likely to reinterpret the verb’s meaning than that of the collocate noun. While it would have been possible for lizard to be reinter- preted as meaning slimy man, in fact, experimental subjects preferentially re- interpreted the verb as meaning, e.g., look at the sun or some other action that lizards actually do.6 Similarly, given the utterance the flower kissed the rock, English speakers did not reinterpret the meaning of the nouns, e.g., a flower- like and rock-like person kissing, but rather of the verb, interpreting kissed as describing an act of gentle contact (Gentner & France 1988: 345). The verb mutability effect requires explanation. Several types of explana- tions have been proffered which mostly have to do with the inherent semantic and formal properties of nouns as opposed to verbs: 1. Nouns outnumber verbs in utterances (Gentner & France 1988). 2. Verbs are typically more polysemous than nouns (Gentner & France 1988). 3. Verbs are typically predicates, while nouns establish reference to ob- jects (Gentner & France 1988). 4. Nouns concepts are more internally cohesive than verb representations (Gentner & France 1988). 5. Nouns are learned earlier than verbs, and presumably for this reason are more stable (Gentner & Boroditsky 2001). However, all of these explanations have problems (Gentner & France 1988; Fausey et al. 2006; Ahrens 1999). Our results do not allow us to take a position on the ultimate causal factors underlying the verb mutability effect, nor do we assume that it is universal.7

6 Another line of research that may contribute to an explanation of this phenomenon is generally known as coercion, in which the meaning of a construction is “type-shifted” in appropriate contexts. For example, while the verb know in English has a stative default interpretation, when combined with an adverb like suddenly, e.g., Suddenly, she knew it, it takes on an inchoative meaning. Michaelis (2004) has provided a detailed theory of coercion in the framework of Con- struction Grammar, focusing on aspectual coercion. What we observe from the literature on coercion, although the point is not made explicitly therein, is that it is the event whose semantics is adjusted to fit the context, rather than the referring expressions. 7 For example, Ahrens (1999) shows that the verb mutability effect observed in Mandarin is different from that observed in English, and Fausey et al. (2006) found that Japanese does not show a robust noun-verb asymmetry.

1519

33 HAIM DUBOSSARSKY, DAPHNA WEINSHALL AND EITAN GROSSMAN

Rather, we opportunistically embrace the observation that in English, the lan- guage investigated here, this effect has been shown to be robust. Under the assumption that diachronic biases are ultimately rooted in synchronic “online” performance or usage, we expect that the tendency of verbs’ meanings to be more frequently adapted to contexts of semantic strain than the meanings of their noun collocates should show up as a diachronic bias. In fact, this is the leading hypothesis in most theories of semantic change: the interpretive strategies of language users, specifically listeners, are what lead to semantic reanalysis. For example, Bybee et al. (1994) propose that listeners’ inferences cause some types of semantic change observed in gram- maticalization. Traugott & Dasher (2002) make a similar argument, couching their theory in Neo-Gricean . Detges & Waltereit (2002) propose a “Principle of Reference”, according to which listeners interpret contextual meanings as coded meanings, and Heine (2002) talks about “context-induced reinterpretation”. However, closest to the type of effect discussed here is Re- gina Eckardt (2009) principle of “Avoid Pragmatic Overload”, which says that when listeners are confronted with utterances with implausible presupposi- tions, they may be coerced into a form-meaning remapping.8 Essentially, all of these theories argue that the ways in which listeners in- terpret semantically implausible utterances lead to biases in semantic change, and, ultimately, the appearance of “pathways” of semantic change. The verb mutability effect identified by Gentner (1981) may be one kind of synchronic interpretative bias implicated in the diachronic asymmetry observed in the present article: in terms of synchronic processing, verbs are more semantically mutable than nouns; correspondingly, in terms of diachronic change over time, verbs undergo more semantic change than nouns. However, the bridge be- tween synchronic processing and diachronic change is not an obvious one. What does seem to be clear is that one would need an appropriate model of memory that would allow individual tokens of utterances, with their contex- tual meanings, to be stored as part of the representation of a word; for an ex- ample, see the exemplar-based model proposed in detail by Bybee (2010). We would like to point out that we do not think that it is necessarily the word class as a structural label that is implicated in semantic change. Rather, we suspect, along with previous researchers, that this is but a proxy for another asymmetry: verbs, nouns, and adjectives prototypically encode different con- cepts, with verbs prototypically denoting events, nouns denoting entities, and adjectives denoting properties (Croft 1991, 2000, 2001). It is highly plausible

8 Grossman et al. (2014) and Grossman & Polis (2014) have applied the latter to long-term diachronic changes in Ancient Egyptian, which provides some necessary comparative data from a language other than the well-studied western European languages.

1620

34 VERBS CHANGE MORE THAN NOUNS: A BOTTOM-UP COMPUTATIONAL APPROACH TO SEMANTIC CHANGE

that the diachronic asymmetry observed in this article is the result of the se- mantics of the concepts prototypically encoded by a word class rather than the formal appurtenance to a word class per se.

7. CONCLUSIONS

In this paper, we have proposed that a computational approach to the problem of semantic change can complement the toolbox of traditional historical lin- guistics, by detecting and quantifying semantic change over an entire lexicon using a completely bottom-up method. Using a word2vec model on a massive corpus of English, we characterized word meanings distributionally, and rep- resented it as vectors. Defining the degree of semantic change as the cosine distance between two vectors of a single word at two points in time allowed us to characterize semantic change. While in earlier work (Dubossarsky et al. 2015), we argued that the degree of semantic change undergone by a word was found to correlate inversely with its degree of prototypicality, defined as its distance from its category’s center, in the present article we argued that the degree of semantic change correlates with its word class assignment: robustly, verbs change more than nouns, and nouns change more than adjectives. A re- gression analysis showed that although these effects are not entirely independ- ent from each other, they nevertheless complement each other to a large ex- tent, and together account for about 25% of the variance found in the data. Interestingly, token frequency on its own did not play a role in semantic change. These results are both reasonable and surprising. They are reasonable be- cause part-of-speech assignment is probably a proxy for the prototypical meanings denoted by the different parts of speech. While verbs, nouns, and adjectives are formal categories of English (“descriptive categories,” Haspel- math 2010), and as such, may encode non-prototypical meanings (e.g., the English word flight denotes an event rather than an entity), the majority of frequently encountered nouns are likely to denote entities, verbs to denote events, and adjectives to denote properties. Our results indicate that the inher- ent prototypical semantics of parts-of-speech does indeed influence the likeli- hood of word meanings to change, individually and aggregately across a lex- icon. We have addressed one part of the diachronic data observed, by relating the diachronic noun-verb asymmetry to the findings of experimental psychol- ogy: verbs not only change more than nouns over time, their meanings are also more likely to be changed in online synchronic usage, especially under condi- tions of “semantic strain,” i.e., when language users are confronted with se- mantically implausible collocations. Under the assumption that semantic

1721

35 HAIM DUBOSSARSKY, DAPHNA WEINSHALL AND EITAN GROSSMAN change over time is the result of “micro-changes” in synchronic usage, we think it is plausible that the “verb mutability effect” may be part of a real causal explanation for the diachronic noun-verb asymmetry. To the extent that this assumption is correct, it provides further evidence for the need for rich models of memory, possibly along the lines of Bybee's exemplar-based model. Obviously, much remains for future research. The findings presented here are for a particular language over a particular time period. The most urgent desideratum, therefore, is cross-linguistic investigation. Since the computa- tional tools used here require massive corpora, such cross-linguistic research would demand either larger corpora for more languages, or the development of computational tools that could deal adequately with smaller corpora. An- other direction for future research is to continue to identify and tease apart the causal factors implicated in semantic change: while our findings account for a considerable amount of the variance found in the data, they hardly account for all of it. It is likely that further causal factors will be found both in purely distributional factors, the semantics of individual lexical items (given a finer- grained semantic tagging), and extra-linguistic factors. For example, our re- sults show a lack of uniformity in the total amount of change across decades in the twentieth century, a finding that may be related to that of (Bochkarev et al. 2014), which showed that the total amount of change in the lexicons of European languages over the same time period correlated with actual histori- cal events. Despite the preliminary and language-specific nature of our results, we believe that this study makes a real contribution to the question of semantic change, by showing that a bottom-up analysis of an entire lexicon can identify and quantify semantic change, and that the interaction of the causal factors identified can be evaluated.

REFERENCES

Ahrens, K. (1999). The mutability of noun and verb meaning. Chinese Language and Linguistics 5. 335–371. Anttila, R. (1989). Historical and comparative linguistics. Amsterdam: John Benjamins. Bloomfield, L. (1933). Language. New York: Henry Holt. Bochkarev, V., V. Solovyev & S. Wichmann (2014). Universals versus historical contingencies in lexical evolution. Journal of The Royal Society Interface. 1–23. Bybee, J. (2006). Frequency of Use and the Organization of Language. Oxford: Oxford University Press. Bybee, J. (2010). Language, Usage and Cognition. Cambridge, UK: Cambridge University Press. Bybee, J. (2015). Language change. Cambridge, UK: Cambridge University Press.

1822

36 VERBS CHANGE MORE THAN NOUNS: A BOTTOM-UP COMPUTATIONAL APPROACH TO SEMANTIC CHANGE

Bybee, J., R. Perkins & W. Pagliuca (1994). The evolution of grammar: Tense, aspect, and modality in the languages of the world. Chicago: University of Chicago Press. Croft, W. (1991). Syntactic categories and grammatical relations: The cognitive organization of information. Chicago: University of Chicago Press. Croft, W. (2000). Parts of speech as language universals and as language-particular categories. In P. Vogel & B. Comrie (eds.), Approaches to the typology of word classes, 65–102. Berlin: Mouton de Gruyter. Croft, W. (2001). Radical construction grammar: Syntactic theory in typological perspective. Oxford University Press. Detges, U. & R. Waltereit (2002). Grammaticalization vs. reanalysis: A semantic- pragmatic account of functional change in grammar. Zeitschrift für Sprachwissenschaft 21(2). 151–195. Dubossarsky, H., Y. Tsvetkov, C. Dyer & E. Grossman (2015). A bottom up approach to category mapping and meaning change. In V. Pirrelli, C. Marzi & M. Ferro (eds.), Word Structure and Word Usage. Proceedings of the NetWordS Final Conference, 66–70. Pisa. Eckardt, R. (2009). APO—avoid pragmatic overload. In M.-B. Mosegaard Hansen & J. Visconti (eds.), Current trends in diachronic semantics and pragmatics, 21– 42. Bingley: Emerald. Fausey, C. M., H. Yoshida, J. Asmuth & D. Gentner (2006). The verb mutability effect: Noun and verb semantics in English and Japanese. In Proceedings of the 28th annual meeting of the Cognitive Science Society, 214–219. Firth, J. R. (1957). Papers in Linguistics 1934–1951. London: Oxford University Press. Geeraerts, D. (1985). Cognitive restrictions on the structure of semantic change. In J. Fisiak (ed.), Historical Semantic, 127–153. Berlin: Mouton de Gruyter. Geeraerts, D. (1992). Prototypicality effects in diachronic semantics: A round-up. In G. Kellermann & M. D. Morissey (eds.), Diachrony within Synchrony: language, history and cognition, 183–203. Frankfurt am Main: Peter Lang. Geeraerts, D. (2010). Theories of . Oxford: Oxford University Press. Gentner, D. (1981). Some interesting differences between verbs and nouns. Cognition and brain theory 4(2). 161–178. Gentner, D. & L. Boroditsky (2001). Individuation, relativity, and early word learning. In M. Bowerman & S. C. Levinson (eds.), Language acquisition and conceptual development, 215–256. Cambridge, UK: Cambridge University Press. Gentner, D. & I. M. France (1988). The verb mutability effect: Studies of the combinatorial semantics of nouns and verbs. In S. L. Small, G. W. Cottrell & M. K. Tanenhaus (eds.), Lexical ambiguity resolution: Perspectives from psycholinguistics, neuropsychology, and artificial intelligence, 343–382. San Mateo, CA: Kaufmann. Goldberg, Y. & J. Orwant (2013). A Dataset of Syntactic-Ngrams over Time from a Very Large Corpus of English Books. In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity, 241–247. Atlanta, Georgia, USA.

1923

37 HAIM DUBOSSARSKY, DAPHNA WEINSHALL AND EITAN GROSSMAN

Grossman, E., G. Lescuyer & S. Polis (2014). Contexts and Inferences. The grammaticalization of the Later Egyptian Allative Future. In E. Grossman, S. Polis, A. Stauder & J. Winand (eds.), On Forms and Functions: Studies in Ancient Egyptian Grammar. Hamburg: Kai Widmaier Verlag. Grossman, E. & I. Noveck (2015). What can historical linguistics and experimental pragmatics offer each other? Linguistics Vanguard 1(1). 145–153. Grossman, E. & S. Polis (2014). On the pragmatics of subjectification: the emergence and modalization of an Allative Future in Ancient Egyptian. Acta Linguistica Hafniensia 46(1). 25–63. Harris, Z. S. (1954). Transfer Grammar. International Journal of American Linguistics 20(4). 259–270. Haspelmath, M. (2004). On directionality in language change with particular reference to grammaticalization. In O. Fischer, M. Norde & H. Perridon (eds.), Up and down the cline: The nature of grammaticalization, 17–44. Amsterdam: Benjamins. Haspelmath, M. (2008). Creating economical morphosyntactic patterns in language change. In J. Good (ed.), Language universals and language change, 185–214. Oxford: Oxford University Press. Haspelmath, M. (2010). Comparative concepts and descriptive categories in crosslinguistic studies. Language 86(3). 663–687. Heine, B. (2002). On the role of context in grammaticalization. In I. Wischer & G. Diewald (eds.), New reflections on grammaticalization, 83–101. Amsterdam & Philadelphia, PA: John Benjamins. Hilpert, M. (2006). Distinctive collexeme analysis and diachrony. Corpus Linguistics and Linguistic Theory 2(2). 243–256. Hilpert, M. & F. Perek (2015). Meaning change in a petri dish: constructions, semantic vector spaces, and motion charts. Linguistics Vanguard 1(1). 339–350. Lehrer, A. (1985). The influence of semantic fields on semantic change. In J. Fisiak (ed.), Historical Semantic, Historical Word formation, 283–296. Berlin: Mouton. Levy, O. & Y. Goldberg (2014). Dependencybased word embeddings. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, vol. 2, 302–308. Baltimore, Maryland. Menner, R. J. (1945). Multiple meaning and change of meaning in English. Language 21. 59–76. Michaelis, L. A. (2004). Type shifting in construction grammar: An integrated approach to aspectual coercion. Cognitive linguistics 15(1). 1–68. Mikolov, T., K. Chen, G. Corrado & J. Dean (2013a). Efficient Estimation of Word Representations in Vector Space. In Proceedings of the International Conference on Learning Representations (ICLR). Scottsdale, Arizona, USA. Mikolov, T., Q. V. Le & I. Sutskever (2013b). Exploiting Similarities among Languages for Machine Translation. ArXiv preprint arXiv:1309.4168. Mikolov, T., I. Sutskever, K. Chen, G. Corrado & J. Dean (2013c). Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems, 3111–3119. Mikolov, T., W. Yih & G. Zweig (2013d). Linguistic Regularities in Continuous

2024

38 VERBS CHANGE MORE THAN NOUNS: A BOTTOM-UP COMPUTATIONAL APPROACH TO SEMANTIC CHANGE

Space Word Representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 746–751. Newman, J. (2015). Semantic shift. In N. Rimer (ed.), The Routledge Handbook of Semantics, 266–280. New York: Routledge. Řehůřek, R. & P. Sojka. (2010). Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, 45–50. Valletta, Malta: ELRA. Rosch, E. H. (1973). Natural categories. Cognitive psychology 4(3). 328–350. Sagi, E., S. Kaufmann & B. Clark (2009). Semantic Density Analysis : Comparing word meaning across time and phonetic space. In Proceedings of the EACL 2009 Workshop on GEMS: Geometrical Models of Natural Language Semantics, 104– 111. Stefanowitsch, A. & S. T. Gries (2003). Collostructions: Investigating the interaction of words and constructions. International Journal of Corpus Linguistics 8(2). 209–243. Sweetser, E. (1990). From to pragmatics: Metaphorical and cultural aspects of semantic structure. Cambridge, UK: Cambridge University Press. Traugott, E. C. & R. B. Dasher (2002). Regularity in Semantic Change. Cambridge, UK: Cambridge University Press. Turney, P. D. (2006). Similarity of semantic relations. Computational Linguistics 32(3). 379–416.

Haim Dubossarsky The Edmond and Lily Safra Center for Brain Sciences (ELSC) Hebrew University of Jerusalem Jerusalem 91905 Israel email: [email protected]

Daphna Weinshall School of Computer Science and Engineering Hebrew University of Jerusalem Jerusalem 91905 Israel email: [email protected]

Eitan Grossman Department of Linguistics Hebrew University of Jerusalem Jerusalem 91905 Israel email: [email protected]

2125

39

Chapter 3

The Diachronic Prototypicality Effect

A bottom up approach to category mapping and mean- ing change.

Published Dubossarsky, H., Tsvetkov, Y., Dyer, C. & Grossman, E. (2015). A bottom up approach to category mapping and meaning change. In Word Structure and Word Usage. Proceedings of the NetWordS Final Conference (eds. Pirrelli, V., Marzi, C. & Ferro, M.) 66–70. 40

A bottom up approach to category mapping and meaning change

Haim Dubossarsky Yulia Tsvetkov Chris Dyer Eitan Grossman The Edmond and Lily Language Tech- Language Tech- Linguistics Department and Safra Center for Brain nologies Institute nologies Institute the Language, Logic and Sciences Carnegie Mellon Carnegie Mellon Cognition Center The Hebrew Universi- University University The Hebrew University of ty of Jerusalem Pittsburgh, PA Pittsburgh, PA Jerusalem Jerusalem 91904, Is- 15213 USA 15213 USA Jerusalem 91904, Israel rael ytsvetko cdyer eit- haim.dub@gmail @cs.cmu.edu @cs.cmu.edu [email protected] .com uji.ac.il

few single words to few dozen words. Only re- Abstract cently though, have usage-based approaches (Bybee, 2010) become prominent, in part due to In this article, we use an automated bot- their compatibility with quantitative research on tom-up approach to identify semantic large-scale corpora (Geeraerts et al., 2011; categories in an entire corpus. We con- Hilpert, 2006; Sagi et al., 2011). Such approach- duct an experiment using a word vector es argue that meaning change, like other linguis- model to represent the meaning of words. tic changes, are to a large extent governed by and The word vectors are then clustered, giv- reflected in the statistical properties of lexical ing a bottom-up representation of seman- items and grammatical constructions in corpora. tic categories. Our main finding is that In this paper, we follow such usage-based ap- the likelihood of changes in a word’s proaches in adopting Firth’s famous maxim meaning correlates with its position with- “You shall know a word by the company it in its cluster. keeps,” an axiom that is built into nearly all dia- 1 Introduction chronic corpus linguistics (see Hilpert and Gries, 2014 for a state-of-the-art survey). However, it is Modern theories of semantic categories, especial- unclear how such ‘semantic fields’ are to be ly those influenced by Cognitive Linguistics identified. Usually, linguists’ intuitions are the (Geeraerts and Cuyckens, 2007), generally con- primary evidence. In contrast to an intuition- sider semantic categories to have an internal based approach, we set out from the idea that structure that is organized around prototypical categories can be extracted from a corpus, using exemplars (Geeraerts, 1997; Rosch, 1973). a ‘bottom up’ methodology. We demonstrate this Historical linguistics uses this conception of by automatically categorizing the entire lexicon semantic categories extensively, both to describe of a corpus, using clustering on the output of a changes in word meanings over the years and to word embedding model. explain them. Such approaches tend to describe We analyze the resulting categories in light of changes in the meaning of lexical items as the predictions proposed in historical linguistics changes in the internal structure of semantic cat- regarding changes in word meanings, thus egories. For example, (Geeraerts, 1999) hypothe- providing a full-scale quantitative analysis of sizes that changes in the meaning of a lexical changes in the meaning of words over an entire item are likely to be changes with respect to the corpus. This approach is distinguished from pre- prototypical ‘center’ of the category. Further- vious research by two main characteristics: first, more, he proposes that more salient (i.e., more it provides an exhaustive analysis of an entire prototypical) meanings will probably be more corpus; second, it is fully bottom-up, i.e., the cat- resistant to change over time than less salient egories obtained emerge from the data, and are (i.e., less prototypical) meanings. not in any way based on linguists’ intuitions. As Despite the wealth of data and theories about such, it provides an independent way of evaluat- changes in the meaning of words, the conclu- ing linguists’ intuitions, and has the potential to sions of most historical linguistic studies have turn up new, unintuitive or even counterintuitive been based on isolated case studies, ranging from

Copyright © by the paper’s authors. Copying permitted for private and academic purposes. In Vito Pirrelli, Claudia Marzi, Marcello Ferro (eds.): Word Structure and Word Usage. Proceedings of the NetWordS Final Conference, Pisa, March 30-April 1, 2015, published at http://ceur-ws.org

66 41

facts about language usage, and hence, by hy- Where d is the vector’s dimension length, and Wi pothesis, about knowledge of language. and Wi’ represent two specific values at the same vector point for the first and second words, re- 2 Literature review spectively. Since words with similar meaning have simi- Some recent work has examined meaning change lar vectors, related words are closer to each other in large corpora using a similar bottom-up ap- in the semantic space. This makes them ideal for proach and word embedding method (Kim et al., clustering, as word clusters represent semantic 2014). These works analyzed trajectories of ‘areas,’ and the position of a word relative to a meaning change for an entire lexicon, which en- cluster centroid represents its saliency with re- abled them to detect if and when each word spect to the semantic concept captured by the changed, and to measure the degree of such cluster. This saliency is higher for words that are changes. Although these works are highly useful closer to their cluster centroid. In other words, a for our purposes, they do not attempt to explain word’s closeness to its cluster centroid is a why words differ in their trajectories of change measure of its prototypicality. To test for the op- by relating observed changes to linguistic param- timal size of the ‘semantic areas,’ different num- eters. bers of clusters were tested. For each the cluster- Wijaya and Yeniterzi (2011) used clustering to ing procedure was done independently. characterize the nature of meaning change. They To quantify diachronic word change, we train were able to measure changes in meaning over a word vector model on a historical corpus in an time, and to identify which aspect of meaning orderly incremental manner. The corpus was had changed and how (e.g., the classical seman- sorted by year, and set to create word vectors for tic changes known as ‘broadening,’ ‘narrowing,’ each year such that the words’ representations at and ‘bleaching’). Although innovative, only 20 the end of training of one year are used to initial- clusters were used. Moreover, clustering was ize the model of the following year. This allows only used to describe patterns of change, rather a yearly resolution of the word vector representa- than as a possible explanatory factor. tions, which are in turn the basis for later anal- 3 Method yses. To detect and quantify meaning change for each word-of-interest, the distance between a A distributed word vector model was used to word’s vector in two consecutive decades was learn the context in which the words-of-interest computed, serving as the degree of meaning are embedded. Each of these words is represent- change a word underwent in that time period ed by a vector of fixed length. The model chang- (with 2 being maximal change and 0 no change). es the vectors’ values to maximize the probabil- Having two representational perspectives – ity in which, on average, these words could pre- synchronic and diachronic – we test the hypothe- dict their context. As a result, words that predict sis that words that exhibit stronger cluster salien- similar contexts would be represented with simi- cy in the synchronic model – i.e., are closer to lar vectors. This is much like linguistic items in a the cluster centroid – are less likely to change classical structuralist paradigm, whose inter- over time in the diachronic model. We thus changeability at a given point or ‘slot’ in the syn- measure the correlation between the distance of a tagmatic chain implies they share certain aspects word to its cluster centroid at a specific point in of function or meaning. time and the degree of change the word under- The vectors’ dimensions are opaque from a went over the next decade. linguistic point of view, as it is still not clear how to interpret them individually. Only when the full 4 Experiment range of the vectors’ dimensions is taken togeth- We used the 2nd version of Google Ngram of er does meaning emerges in the semantic hyper- fiction English, from which 10 millions 5-grams space they occupy. The similarity of words is were sampled for each year from 1850-2009 to computed using the cosine distance between two serve as our corpus. All words were lower cased. word vectors, with 0 being identical vectors, and Word2vec (Mikolov et al., 2013) was used as 2 being maximally different: the distributed word vector model. The model × (1) 1 was initiated to 50 dimensions for the word vec- 𝑑𝑑 tors’ representations, and the window size for (∑𝑖𝑖=)1 𝑊𝑊×𝑖𝑖 𝑊𝑊′𝑖𝑖 ( ) − context set to 4, which is the maximum size giv- 𝑑𝑑 2 𝑑𝑑 2 �∑𝑖𝑖=1 𝑊𝑊𝑖𝑖 �∑𝑖𝑖=1 𝑊𝑊′𝑖𝑖

67 42 en the constraints of the corpus. Words that ap- shutters, 0.04 hat, 0.03 peared less than 10 times in the entire corpus windows, 0.05 cap, 0.04 were discarded from the model vocabulary. doors, 0.08 napkin, 0.09 Training the model was done year by year, and curtains, 0.1 spectacles, 0.09 versions of the model were saved in 10 year in- blinds, 0.11 helmet, 0.13 gates, 0.13 cloak, 0.14 tervals from 1900 to 2000. gallop, 0.02 handkerchief, 0.14 The 7000 most frequent words in the corpus trot, 0.02 cane, 0.15 were chosen as words-of-interest, representing Table 1: Example for clusters of words using 2000 the entire lexicon. For each of these words, the clusters and their distance from their centroids. cosine distance between its two vectors, at a spe- cific year and 10 years later, was computed using Figure 1 shows the analysis of changes in (1) above to represent the degree of meaning word meanings for the years 1950-1960. We change. A standard K-means clustering proce- chose this decade at random, but the general dure was conducted on the vector representations trend observed here obtains over the entire peri- of the words for the beginning of each decade od (1900-2000). There is a correlation between from 1900 to 2000 and for different number of the words’ distances from their centroids and the clusters from 500 until 5000 in increments of degree of meaning change they underwent in the 500. The distances of words from their cluster following decade, and this correlation is observ- centroids were computed for each cluster, using able for different number of clusters (e.g., for (1) above. These distances were correlated with 500 clusters, 1000 clusters, and so on). The posi- the degree of change the words underwent in the tive correlations (r>.3) mean that the more distal following ten-year period. The correlation be- a word is from its cluster’s centroid, the greater tween the distance of words from random cen- the change its word vectors exhibit the following troids of different clusters, on the one hand, and decade, and vice versa. the degree of change, on the other hand, served Crucially, the correlations of the distances as a control condition. from the centroid outperform the correlations of the distances from the prototypical exemplar, 4.1 Results which was defined as the exemplar that is the Table 1 shows six examples of clusters of words. closest to the centroid. Both the correlations of The clusters contain words that are semantically the distance from the cluster centroid and of the similar, as well as their distances from their clus- distance from the prototypical exemplar were ter centroids. It is important to stress that a cen- significantly better than the correlations of the troid is a mathematical entity, and is not neces- control condition (all p’s < .001 under permuta- sarily identical to any particular exemplar. We tions tests). suggest interpreting a word’s distance from its cluster’s centroid as the degree of its proximity to a category’s prototype, or, more generally, as a measure of prototypicality. Defined in this way, sword is a more prototypical exemplar than spear or dagger, and windows, shutters or doors may be more prototypical exemplars of a cover of an entrance than blinds or gates. In addition, the clusters capture near-synonyms, like gallop and trot, and level-of-category relations, e.g., the modal predicates allowed, permitted, able. The very fact that the model captures clusters and Figure 1. Change in the meanings of words correlated with distance from centroid for different numbers of distances of words which are intuitively felt to be clusters, for the years 1950-1960. semantically closer to or farther away from a cat- egory prototype is already an indication that the In other words, the likelihood of a word model is on the right track. changing its meaning is better correlated with the distance from an abstract measure than with the distance from an actual word. For example, the sword, 0.06 allowed, 0.02 likelihood of change in the sword-spear-dagger spear, 0.07 permitted, 0.04 cluster is better predicted by a word’s closeness dagger, 0.09 able, 0.06

68 43 to the centroid, which perhaps could be concep- 5 Conclusion tualized as a non-lexicalized ‘elongated weapon with a sharp point,’ than its closeness to an actual We have shown an automated bottom-up ap- word, e.g., sword. This is a curious finding, proach for category formation, which was done which seems counter-intuitive for nearly all theo- on an entire corpus using the entire lexicon. ries of lexical meaning and meaning change. We have used this approach to supply histori- The magnitude of correlations is not fixed or cal linguistics with a new quantitative tool to randomly fluctuating, but rather depends on the test hypotheses about change in word meanings. number of clusters used. It peaks for about 3500 Our main findings are that the likelihood of a clusters, after which it drops sharply. Since a word’s meaning changing over time correlates larger number of clusters necessarily means with its closeness to its semantic cluster’s most smaller ‘semantic areas’ that are shared by fewer prototypical exemplar, defined as the word clos- words, this suggests that there is an optimal est to the cluster’s centroid. Crucially, even bet- range for the size of clusters, which should not ter than the correlation between distance from be too small or too large. the prototypical exemplar and the likelihood of change is the correlation between the likelihood 4.2 Theoretical implications of change and the closeness of a word to its clus- One of our findings matches what might be ex- ter’s actual centroid, which is a mathematical pected, based on Geeraert’s hypothesis, men- abstraction. This finding is surprising, but is tioned in Section 1: a word’s distance from its comparable to the idea that attractors, which are cluster’s most prototypical exemplar is quite in- also mathematical abstractions, may be relevant formative with respect to how well it fits the for language change. cluster (Fig. 1). This could be taken to corrobo- rate Roschian prototype-based views. However, another finding is more surprising, namely, that a Acknowledgements word’s distance from its real centroid, an abstract We thank Daphna Weinshall (Hebrew University average of the members of a category by defini- of Jerusalem) and Stéphane Polis (University of tion, is even better than the word’s distance from Liège) for their helpful and insightful comments. the cluster’s most prototypical exemplar. All errors are, of course, our own. In fact, our findings are consonant with recent work in usage-based linguistics on attractors, Reference ‘the state(s) or patterns toward which a system is drawn’ (Bybee and Beckner, 2015). Importantly, Joan Bybee. 2010. Language, usage and cognition. attractors are ‘mathematical abstractions (poten- Cambridge: Cambridge University Press. tially involving many variables in a multidimen- sional state space)’. We do not claim that the Joan Bybee and Clay Beckner. 2015. Emergence at centroids of the categories identified in our work the cross linguistic level. In B. MacWhinney are attractors – although this may be the case – and W. O'Grady (eds.), The handbook of but rather make the more general point that an language emergence, 181-200. Wiley abstract mathematical entity might be relevant Blackwell. for knowledge of language and for language change. Dirk Geeraerts. 1997. Diachronic prototype In the domain of meaning change, the fact that semantics. A contribution to historical . Oxford: Clarendon Press. words farther from their cluster’s centroid are more prone to change is in itself an innovative Dirk Geeraerts. 1999. Diachronic Prototype result, for at least two reasons. First, it shows on Semantics. A Digest. In: A. Blank and P. Koch unbiased quantitative grounds that the internal (eds.), Historical semantics and cognition. structure of semantic categories or clusters is a Berlin & New York: Mouton de Gruyter. factor in the relative stability over time of a word’s meaning. Second, it demonstrates this on Dirk Geeraerts, and Hubert Cuyckens (eds.). 2007. the basis of an entire corpus, rather than an indi- The Oxford handbook of cognitive linguistics. vidual word. Ideas in this vein have been pro- Oxford: Oxford University Press. posed in the linguistics literature (Geeraerts, 1997), but on the basis of isolated case studies Dirk Geeraerts, Caroline Gevaerts, and Dirk which were then generalized. Speelman. 2011. How Anger Rose: Hypothesis

69 44

Testing in Diachronic Semantics. In J. Robynson and K. Allan (eds.), Current methods in historical semantics, 109-132. Berlin & New York: Mouton de Gruyter.

Martin Hilpert. 2006. Distinctive Collexeme Analysis and Diachrony. Corpus Linguistics and Linguistic Theory, 2 (2): 243–256.

Martin Hilpert and Stefan Th. Gries. 2014. Quantitative Approaches to Diachronic Corpus Linguistics. In M. Kytö and P. Pahta (eds.), The Cambridge Handbook of English Historical Linguistics. Cambridge: Cambridge University Press, 2014.

Yoon Kim, Yi-I Chiu, Kentaro Haraki, Darshan Hegde, and Slav Petrov. 2014. Temporal Analysis of Language through Neural Language Models. Proceedings of the ACL 2014 Workshop on Language Technologies and Computational Social Science, 61-65. Baltimore, USA.

Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013. Linguistic Regularities in Continuous Space Word Representations. Proceedings of NAACL-HLT 2013: 746–751. Atlanta, Georgia.

Eleanor H. Rosch. 1973. Natural Categories. Cognitive Psychology 4 (3): 328–350.

Eyal Sagi, Stefan Kaufmann, and Brady Clark. 2011. Tracing semantic change with latent semantic analysis. In K. Allan and J.A. Robinson (eds.), Current methods in historical semantics, 161- 183. Berlin & New York: Mouton de Gruyter.

Derry T. Wijaya and Reyyan Yeniterzi. 2011. Understanding semantic change of words over centuries. In Proceedings of the 2011 international workshop on DETecting and Exploiting Cultural diversiTy on the social web (DETECT ’11) 35-40. Glasgow, United Kingdom.

70 45

Chapter 4

Avoiding methodological pitfalls in semantic change research

Outta Control: Laws of Semantic Change and In- herent Biases in Word Representation Models.

Published Dubossarsky, H., Grossman, E. & Weinshall, D. (2017). Outta Control: Laws of Semantic Change and Inherent Biases in Word Representation Models. In Empirical Methods in Natural Language Processing, 1147–1156. 46 Outta Control: Laws of Semantic Change and Inherent Biases in Word Representation Models

Haim Dubossarsky1, Eitan Grossman2 and Daphna Weinshall3 1 Edmond and Lily Safra Center for Brain Sciences 2 Department of Linguistics 3 School of Computer Science and Engineering The Hebrew University of Jerusalem, 91904 Jerusalem, Israel [email protected], eitan.grossman,daphna @mail.huji.ac.il { }

Abstract of historical distributional semantics, which has yielded findings that inform our understanding of This article evaluates three proposed laws the dynamic structure of language (Sagi et al., of semantic change. Our claim is that in 2009; Wijaya and Yeniterzi, 2011; Mitra et al., order to validate a putative law of seman- 2014; Hilpert and Perek, 2015; Frermann and La- tic change, the effect should be observed pata, 2016; Dubossarsky et al., 2016). Recent in the genuine condition but absent or re- research has even proposed laws of change that duced in a suitably matched control condi- predict the conditions under which the meaning tion, in which no change can possibly have of words is likely to change (Dubossarsky et al., taken place. Our analysis shows that the 2015; Xu and Kemp, 2015; Hamilton et al., 2016). effects reported in recent literature must be This is an important development, as traditional substantially revised: (i) the proposed neg- historical linguistics has generally been unable to ative correlation between meaning change provide predictive models of semantic change. and word frequency is shown to be largely However, these preliminary results should be an artefact of the models of word represen- addressed with caution. To date, analyses of tation used; (ii) the proposed negative cor- changes in words’ meanings have relied on the relation between meaning change and pro- comparison of word representations at different totypicality is shown to be much weaker points in time. Thus any proposed change in than what has been claimed in prior art; meaning is contingent on a particular model of and (iii) the proposed positive correlation word representation and the method used to mea- between meaning change and polysemy sure change. Distributional semantic models typi- is largely an artefact of word frequency. cally count words and their co-occurrence statis- These empirical observations are corrob- tics (explicit models) or predict the embedding orated by analytical proofs that show that contexts of words (implicit models). In this paper, count representations introduce an inher- we show that the choice of model may introduce ent dependence on word frequency, and biases into the analysis. We therefore suggest that thus word frequency cannot be evaluated empirical findings may be used to support laws of as an independent factor with these repre- semantic change only after a proper control can be sentations. shown to eliminate artefactual factors as the un- derlying cause of the empirical observations. 1 Introduction Regardless of the specific representation used, The increasing availability of digitized histori- a frequent method of measuring the semantic cal corpora, together with newly developed tools change a word has undergone (Gulordava and Ba- of computational analysis, make the quantitative roni, 2011; Jatowt and Duh, 2014; Kim et al., study of language change possible on a larger scale 2014; Dubossarsky et al., 2015; Kulkarni et al., than ever before. Thus, many important ques- 2015; Hamilton et al., 2016) is to compare the tions may now be addressed using a variety of word’s vector representations between two points NLP tools that were originally developed to study in time using the cosine distance: synchronic similarities between words. This has x y cosDist(x, y) = 1 · (1) catalyzed the evolution of an exciting new field − x y k k2k k2

1147 Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1147–1156 Copenhagen, Denmark, September 7–11, 2017. c 2017 Association for Computational Linguistics

47 This choice naturally assumes that greater dis- prototypicality may play a smaller role in semantic tances correspond to greater semantic changes. change than previously claimed. The main artefact However, this measure introduces biases that may underlying the emergence of the first two laws in affect our interpretation of meaning change. both the genuine and control conditions may be We examine various representations of word due to the SVD step used for the embedding of the meaning, in order to identify inherent confounds PPMI word representation (see Section 2.5). when meaning change is evaluated using the co- sine distance. In addition to the empirical evalua- 2 Methods tion, in Section5 we provide an analytical account of the influence of word frequency on cosine dis- The historical corpus used here is Google Books tance scores when using these representations. 5-grams of English fiction. Equally sized sam- In our empirical investigation, we highlight the ples of 10 million 5-grams per year were ran- critical role of control conditions in the validation domly sampled for the period of 1900-1999 (Kim of experimental findings. Specifically, we argue et al., 2014) to prevent the more prolific publi- that every observation about a change of mean- cation years from biasing the results, and were ing over time should be subjected to a control test. grouped into ten-year bins. Uncommon words The control condition described in Section 2.1 is were removed, keeping the 100,000 most frequent based on the construction of an artificially gener- words as the vocabulary for subsequent model ated corpus, which resembles the historical corpus learning. All words were lowercased and stripped in most respects but where no change of mean- of punctuation. ing over time exists. In order to establish the va- This corpus served as the genuine condition, lidity of an observation about meaning change - and was used to replicate and evaluate findings and even more importantly, the validity of a law- from previous studies. In this corpus, words are like generalization about meaning change - the re- expected to change their meaning between decadal sult obtained in a genuine experimental condition bins, as they do in a truly random sample of texts. should be demonstrated to be lacking (or at least According to the distributional hypothesis (Firth, significantly diminished) in the control condition. 1957), one can extract a word’s meaning from the As we show in Section4, some recently re- contexts in which it appears. Therefore, if words’ ported laws of historical meaning change do not meanings change over time, as has been argued survive this proposed test. In other words, sim- at least since Reisig (1839), it follows that the ilar results are obtained in the genuine and con- words’ contexts should change accordingly, and trol conditions. These include the correlation of this change should be detected by our model. meaning change with word frequency, polysemy (the number of different meanings a word has), 2.1 Control condition setup and prototypicality (how representative a word is Complementary to the genuine condition, a con- of its category). These factors lie at the basis of trol condition was created where no change of the following proposed laws of semantic change: meaning is expected. Therefore, any observed change in a word’s meaning in the control con- The Law of Conformity, according to which • dition can only stem from random “noise“, while frequency is negatively correlated with se- changes in meaning in the genuine condition are mantic change (Hamilton et al., 2016). attributed to “real“ semantic change in addition to The Law of Innovation, according to which “noise“. Two methods were used to construct the • polysemy is positively correlated with se- corpus in the control condition: mantic change (Hamilton et al., 2016). Chronologically shuffled corpus (shuffle): 5- The Law of Prototypicality, according to grams were randomly shuffled between decadal • which prototypicality is negatively correlated bins, so that each bin contained 5-grams from all with semantic change (Dubossarsky et al., the decades evenly. This was chosen as a control 2015). condition for two reasons. First, this condition re- sembles the genuine condition in size of the vocab- Our analysis shows that these laws have only ulary, size of the corpus, overall variance in words’ residual effects, suggesting that frequency and usage, and size of the decadal bins. Second and

1148 48 crucially, words are not expected to show any ap- used for the analysis of semantic change. Change parent change in their meaning between decades scores and frequencies were log-transformed, and in the control condition, because their various us- all variables were subsequently standardized. age contexts are shuffled across decades. A linear mixed effects model was used to evalu- ate meaning change in both the genuine and shuf- One synchronous corpus (subsample): All 5- fled control conditions. Frequency was set as a grams of the year 1999, which amount to 250 mil- fixed effect while random intercepts were set per lion 5-grams, were selected from Google Books word. The model attempts to account for semantic English fiction. 10 million 5-grams were ran- change scores using frequency, while controlling domly subsampled from this selection, and this for the variability between words by assuming that process was repeated 30 times. This is suggested each word’s behavior is strongly correlated across as an additional control condition since the under- decades and independent across words as follows: lying assumption is always that words in the same year do not change their meaning. Again, unlike (t) ∆w = β + β freq(t) + z + ε(t) (2) in the genuine condition, any changes that are ob- i 0 f wi wi wi (t) served based on these 30 subsamples can be at- Here ∆wi is the semantic change score of the tributed only to ”noise” that stems from random i’th word measured between two specific consec- sampling, rather than real change in meaning. utive decades, β0 is the model’s intercept, βf is the fixed-effect predictor coefficient for frequency, 2.2 Measures of interest zwi N(0, σ) is a random intercept for the i’th Meaning change: Meaning change was evalu- ∼ (t) word, and εwi is an error term associated with the ated as the cosine distance between vector rep- i’th word. We report the predictor coefficient as resentations of the same word in consecutive well as the proportion of variance explained1 by decades. This was done separately for each pro- each model. Only statistically significant results cessing stage (see Section 2.5). For the subsample (p < .01) are reported. All statistical tests are per- condition, this was defined as the average cosine formed in R (lme4 and MuMln packages). distance between the vectors in all 30 samples. 2.5 Word meaning representation Frequency: Words’ frequencies were computed separately for each decadal bin as the number of We used a cascade of processing stages based times a word appeared divided by the total number on the explicit meaning representation of words of words in that decade. For the subsample control (i.e., word counts, PPMI, SVD, as explained be- condition, it was computed as the number of times low) as commonly practiced (Baroni et al., 2014; a word appeared among the 250 million 5-grams, Levy et al., 2015). For each of these stages, we divided by the total number of words. sought to evaluate the relationship between word frequency and meaning change, by computing the 2.3 Construct validity corresponding correlations between these two fac- To establish the adequacy of our control condition, tors in the subsample control condition. we compared the meaning change scores (before Counts: Co-occurrence counts were collected log-transformation and standardization) between for all the words in the vocabulary per decade. the genuine and the shuffled control conditions. Change scores were obtained by taking the aver- PPMI: Sparse square matrices of vocabulary age meaning change over all words in each decade size containing positive pointwise mutual infor- using the representation of the final processing mation (PPMI) scores were constructed for each stage (SVD). An adequate control condition will decade based on the co-occurrence counts. We exhibit a lower degree of change compared to the used the context distribution smoothing parameter genuine condition, and is expected to show a fixed α = 0.75, as recommended by (Levy et al., 2015), rate of change across decades (see3a). using the following procedure: Pˆ(w, c) 2.4 Statistical analysis PPMI (w, c) = max log , 0 α ˆ ˆ Following common practice (Hamilton et al., P (w)Pα(c)! ! 2016), the 10k most frequent words, as measured 1R2 for mixed linear models (Nakagawa and Schielzeth, by their average decadal bin frequencies, were 2013)

1149 49

(a) (b) (c)

Figure 1: Correlations in the control condition between change scores in the year 1999 and word fre- quency for three word representation types, based on: (a) Counts, (b) PPMI, (c) SVD. Correlation coef- ficients are reported above each subplot. LS regression lines are shown in dashed green. where Pˆ(w, c) denotes the probability that word c appears as a context word of w, while Pˆ(w) and ˆ #(c)α Pα(c) = P α denote the marginal probabili- C #(c) ties of the word and its context, respectively.

SVD: Each PPMI matrix was approximated by a truncated singular value decomposition as de- scribed in (Levy et al., 2015). This embedding was shown to improve results on downstream tasks Figure 2: Cosine distances between PPMI and ap- (Baroni et al., 2014; Bullinaria and Levy, 2012; proximated PPMI representations (y-axis), plotted Turney and Pantel, 2010). Specifically, the top 300 against frequency (x-axis). Correlation coefficient elements of the diagonal matrix of singular values is reported above the plot. Σ, denoted Σd, were retained to represent a new, dense embedding of the word vectors, using the truncated left hand orthonormal matrix Ud: on frequency, and investigate the existence of an artefactual relation between frequency and mean- SVD ing change. This is done by evaluating this re- Wi = (Ud Σd)i (3) · lation in the subsample control condition. Any These representations were subsequently changes observed in this condition must be the aligned with the orthogonal Procrustes method consequence of inherent noise, since this con- following (Hamilton et al., 2016). trol condition contains random samples from the same year (and the baseline assumption is that no Relation to other models: (Levy and Gold- berg) have shown that the Skip-Gram with Neg- change can be observed within the same year). ative Sampling (SGNS) embedding model, e.g. We first plotted the change scores that use the word2vec (Mikolov et al., 2013) - perhaps the representation based on word count vs. word fre- (r = most popular model of word meaning representa- quency. This resulted in a robust correlation 0.915) between the two variables, as shown in tion, implicitly factorizes the values of the word- − context PMI matrix. Hence, the optimization goal Fig.1a (see the analytical account in Section5). and the sources of information available to SGNS We repeated the same procedure using the PPMI and our model are in fact very similar. We there- representation, which showed a much weaker cor- relation with frequency (r = 0.295), see Fig.1b. fore hypothesize that conclusions similar to those − reported below can be drawn for SGNS models. Finally, we repeated the same procedure us- ing the final explicit representation after SVD em- 2 3 Results bedding , see Fig.1c. Surprisingly, the negative correlation with frequency was reinstated (r = 3.1 Confound of frequency 0.793). To investigate how this came about, − There are many factors that may confound the 2Similar results were obtained for the implicit embedding measurement of meaning change. Here we focus (word2vec-SGNS) described in Section 2.5.

1150 50

(a) (b) (c)

Figure 3: (a) Average change score per decade for the genuine and control conditions. Bars represent standard deviations. (b-c) Change scores (y-axis), relative to their frequency (x-axis): (b) genuine his- torical corpus, (c) chronologically shuffled historical corpus. LS regression lines are shown in dashed green. we computed the change in the PPMI vectors be- While validity was established for each of the pro- fore and after the low-rank SVD embedding using cessing stages, the most robust effect was seen for the cosine-distance. As apparent from Fig.2, it the PPMI representation, following by SVD and turns out that the SVD procedure distorts data in word counts. an uneven manner - frequent words are distorted less than infrequent words. Thus we demonstrate 3.3 Accounting for the frequency confound that this reinstatement of correlation between fre- In Section 3.1 we used the subsample control con- quency and change scores is merely an artefactual dition to establish the confounding effect of fre- consequence of the truncated SVD factorization. quency on meaning change. We now examine the extent to which this frequency confound exists in a 3.2 Construct validity historical corpus. We do so by comparing the fre- Potential confounding factors can be addressed by quency confound between the genuine historical comparing any experimental finding to a validated corpus and the shuffled historical corpus. control condition. Here we validate the use of the To visualize the frequency confound in a man- shuffled condition as a proper control. To this end, ner comparable to the analysis presented in Sec- the average change scores of words per decade in tion 3.1, we again plot change scores vs. fre- both the genuine and shuffled conditions are com- quency, ignoring the time dimension of the data. pared within each processing stage. In the genuine Fig.3b presents this plot for the genuine condi- condition, words appear in different usage con- tion. The same analysis is repeated in the shuffled texts between decades, while in the shuffled condi- condition, see Fig.3c. tion they do not, because the random shuffling cre- Both plots reveal a highly significant correla- ates a homogeneous corpus. Therefore, the valid- tion between change scores and frequency. Fur- ity of the control condition is established if: (a) the thermore, the fact that the correlation coefficients change scores are diminished as compared to the are virtually identical in the genuine and shuffled genuine condition; (b) change scores are uniform conditions, with r = 0.748 and r = 0.747 re- − − across decades (since decades are shuffled); (c) the spectively, suggests that they are due to artefactual variance of change scores is smaller that in the factors in both conditions and not to true change genuine condition. As seen in Fig.3a, all these re- of meaning over time. In fact, this pattern of re- quirements are met by the control condition. Note sults is reminiscent of the spurious pattern we see that the change scores in the shuffled condition are in Fig.1c. all significantly positive, namely, meaning change The relation between frequency and meaning allegedly exists in this control condition. This sup- change can also be represented by a linear mixed ports the claim that any measurement is signifi- effect model, with the benefit that this model en- cantly affected by unrelated noise. ables the addition of more explanatory variables to Thus, we have established that the shuffled con- the data. The regression model found frequency dition is a suitable control for meaning change. to have a negative influence on change scores,

1151 51 PPMI + SVD PPMI Genuine Shuffled Genuine Shuffled Frequency β -0.91 -0.75 -0.29 0.06 (one-predictor) explained variance (σ2) 67% 56% 8% 0% Frequency + β frequency -1.22 -1.12 -0.69 0.53 Polysemy β polysemy 0.43 0.40 0.49 -0.52 (two-predictor) explained variance (σ2) 68% 60% 9% 4% Frequency + β frequency -0.71 -0.70 -0.02 0.07 Prototypicality β polysemy 0.22 0.21 0.12 0.02 (two-predictor) explained variance (σ2) 65% 60% 2% 0%

Table 1: Results of one-predictor and two-predictor regression analysis in all conditions.

3 with βf =-0.91 and βf =-0.75, for the genuine and puted using the authors’ provided code . We then shuffled conditions respectively. Importantly, fre- log-transformed and standardized the polysemy quency accounted for 67% of the variance in the scores. Next, frequency and polysemy were set as change scores in the genuine condition, and was two fixed effect predictors in a linear mixed effect only slightly diminished in the shuffled condition, model, like the one described in Section 2.4. accounting for 56% of the variance. Similar re- Thus we were able to replicate the results in sults were obtained for the PPMI representation the genuine condition as reported in (Hamilton (see Table1). et al., 2016). Interestingly, the same pattern of results emerged, again, in the shuffled condition 4 Revisiting previous studies (see Table1). Importantly, the difference in ef- fect size between conditions, as evaluated by the We replicated three recent results that were af- explained variance of frequency and polysemy to- fected by this frequency effect, since they all de- gether, showed a modest effect of 8% over the fine change as the word’s cosine distance relative shuffled condition, pointing to the conclusion that to itself at two time points. These studies report the putative effects may indeed be real, but to a far laws of semantic change that measure the role of lesser extent than had been claimed. We conclude frequency in semantic change either directly (Law that adding polysemy to the analysis contributed of Conformity), or indirectly through another lin- very little to the model’s predictive power. guistic variable that is dependent on frequency Since the PPMI representation (the explicit rep- (Laws of Innovation and Prototypicality). resentation without dimensionality reduction with SVD) seems much less affected by spurious ef- 4.1 Laws of conformity and innovation fects correlated with frequency (see Fig.1b), we repeated the analysis of frequency described here Continuing the work described in Section 3.1, we and in Section 3.1 while using this representation. replicated the model and analysis procedure de- The results are listed in Table1, showing a similar scribed in (Hamilton et al., 2016), where two pre- pattern of rather small frequency effect. dictors were used together to explain the change scores: frequency and polysemy. Polysemy, which 4.2 Prototypicality describes the number of different senses a word Prototypicality is the degree to which a word is has, naturally differs among words, where some representative of the category of which it is a words are more polysemous than others (com- member (a robin is a more prototypical bird than pare bank and date to wine). Following (Hamil- a parrot). According to the proposed Law of Pro- ton et al., 2016), we defined polysemy as the totypicality, words with more prototypical mean- words’ secondary connections patterns - the con- ings will show less semantic change, and vice nections between each word’s co-occurring words versa. Following (Dubossarsky et al., 2015), we (using the entries in the PPMI representation for computed words’ prototypicality scores for each that word). The more interconnected these sec- decade as the cos-distance between a word’s vec- ondary connections are, the less polysemic a word is, and vice versa. Polysemy scores were com- 3https://github.com/williamleif/histwords

1152 52 tor and its k-means cluster’s centroid, and ex- It follows that tended the analysis to encompass the entire 20th century. The previous regression model assumed E(cosDist(x, y)) = V ar(xi) (4) independence between words, and therefore as- X signed words to a random effect variable. How- ever, when modeling prototypicality, this assump- Implication: The average cosine distance be- tion is invalid as relations between words are what tween two samples of the same random variables inherently define prototypicality. We therefore is directly related to the variance of the variable, designed a model in which decades, rather than or the sampling noise. This variance should be words, are the random effect variable. measured empirically whenever cosine distance is With this analysis the prototypicality effect used, since only distances that are larger than the seems to be substantiated in two ways. First, the empirical variance can be relied upon to support addition of prototypicality explains an additional significant observations. 5% of the variance. Second, the effect of proto- 5.2 Cos distance of count vectors: frequency typicality meets the more stringent requirement of being diminished in the shuffle condition (see Ta- Next, we analyze the cosine distance between 2 ble1). Nevertheless, here too the effect originally iid samples from a normalized multinomial ran- reported was found to be drastically reduced after dom variable. This distribution models the dis- being compared with the proper control. tribution of the count vector representation. Let k , 1 i m denote the number of times word i ≤ ≤ 5 Theoretical analysis i appeared in the context of word w, and let m denote the size of the dictionary not including w. We show in Section 5.1 that the average cosine dis- Let n = ki denote the number of words in the tance between two vectors representing the same count vector of w; n determines the word’s fre- P word is equivalent to the variance of the popula- quency score. Assume that the counts are sampled tion of vectors representing the same word in inde- from the distribution Multinomial(n, ~p), namely pendent samples, and is therefore always positive. n This is true for any word vector representation. P rob(k , , k ) = pk1 pkm 1 ··· m k , k 1 ··· m In Sections 5.2-5.3 we prove that the average  1 ··· m cosines distance between two count vectors rep- Lemma 2. The expected value of the cosine dis- resenting the same word is negatively correlated tance between two count vectors x, y sampled iid with the frequency of the word, and positively cor- from this distribution is monotonically decreasing related with the polysemy score of the word. with n. Proof. By definition, 1 E[cosDist(x, y)] equals 5.1 Sampling variability and the cos distance − 2 Lemma 1. Assume two random variables x, y of x y xi 2 E · = E = Ei (5) length x 2 = y 2 = 1, distributed iid with ex- x y x k k k k  2 2  i  2  i pected value µ and covariance matrix Σ. The ex- k k k k X k k X pected value of the cosine distance between them We compute the expected value of Ei directly: is equal to the sum of the diagonal elements of Σ. ki n k1 km Ei = p1 pm 2 k1 , km ··· (k1, ,km) j kj   Proof. X··· ··· q Using Taylor expansion:P E(x y)2 =E(x µ)2 + E(y µ)2+ − − − 2E(x µ)(y µ) k ki − − i = n =2 E(x µ )2 = 2 V ar(x ) k2 kj 2 kj kl i i i j j ( j n ) l=j n2 − − 6 2 2 2 E(x y) =EX(x ) + E(y ) 2E(xXy) qP kqi 1 − − · = P P x y n kj kl 1 l=j n2 =2 2E · − 6 − x 2 y 2 k k k k  ki q ε 2 = 1 + P+ O(ε ) (6) =2E(cosDist(x, y)) n 2  

1153 53 kj kl where ε = l=j n2 . polysemous and has two meanings, then there is a 6 The expected value of the 0-order term with re- third set of indices k1, , km which correspond P ··· 2 spect to ε in (6) equals pi, which is independent of to the words appearing in the context of w in its n. We conclude the proof by focusing on the first second meaning. If w has more then two mean- order term with respect to ε in (6), to be denoted ings, they can be modeled with additional sets of f1, showing that its expected value is monotoni- disjoint indices. cally decreasing with n. Specifically: Lemma 3. Under certain conditions specified in the proof, given two count vectors x, y sampled ki kj kl n k1 km f1 = p1 pm iid from the above distribution of w, the expected n n n k1 , km ··· ~ l=j   Xk X6 ··· value of the cosine distance between them in- We switch the summation order and compute creases with the number of sets of disjoint indices each expression in the external sum, considering which represent different meanings of w. two cases separately: when l = j = i 6 6 Proof. We will prove that when w has two mean-

ki kj kl n k1 km ings, the expected value of the cosine distance is p1 pm n n n k1 , km ··· larger than in the case of a single meaning. The (k1, ,km)  ···  X··· proof for the general case immediately follows. n(n 1)(n 2) = − − p p p Starting from (6) while keeping only the 0-order n3 i j l term in ε, it follows from the derivations in the When l = j = i w.l.g, we rewrite kikj = proof of Lemma2 that the expected cosine dis- 6 ki(ki 1) + ki, and the sum above becomes tance between two count vector samples of w, to n(n 1)(−n 2) 2 n(n 1) 2 − − p p + − p p . Thus be denoted M, is 1 p . In our current model ~p n3 i l n2 i l − i is a random variable, and we shall compute the ex- P n 1 n 2 pected value of this random variable under the two f = − p − p p + (1 p ) 1 n i  n j l − i  conditions, when w has either one or two mean- l,j:l=j X6 ings.   We start by observing that, given the definition and it readily follows that f1 is monotonically in- creasing with n. of the Dirichlet distribution, it follows that Since n measures the frequency score of word α (1 + α ) w, it follows from (5) that the expected value of the E(p2) =V ar(p ) + E(p )2 = i i i i i α (1 + α ) cosine distance between two iid samples from the 0 0 distribution of the count vector of w is monotoni- αo = αi cally decreasing with the word’s frequency. 2 X 2 α0 + αi = M = E(pi ) = (7) 5.3 Cos distance of count vectors: polysemy ⇒ α0(1 + α0) X P We start our investigation of polysemy by mod- Considering the different sets of indices in iso- eling the distribution of the parameters of the im jm ϕ = 0 α ϕ = 1 α lation, let o i=i1 i, 1 i=j1 i, and multinomial distribution from which count vec- k i ϕ = m2 α . Let ψ = m0 α2, ψ = tors are sampled. A common prior distribution 2 i=k1 iP o iP=i1 i 1 j k on the vector ~pw in m-simplex, which defines the m1 α2, and ψ = m2 α2. i=j1 Pi 2 i=k1 i P multinomial distribution generating the context of We rewrite (7) for the two conditions: word w, is the Dirichlet distribution f(~pw; ~αw) = P P f(p1, , pm; α1, , αm). 1. w has one meaning: ··· ··· ~αw is a sparse vector of prior counts on all the words in the dictionary, by which the co- ϕ0 + ϕ1 + ψ0 + ψ1 M (1) = occurrence context of word w is modeled. We (ϕ0 + ϕ1)(1 + ϕ0 + ϕ1) divide the set of none-zero indices of ~αw into two subsets: i , , i correspond to the words 1 ··· m0 2. w has two meanings: which always appear in the context of w, while j1, , im correspond to the words which appear ϕ + ϕ + ϕ + ψ + ψ + ψ ··· 1 M (2) = 0 1 2 0 1 2 in the context of w in one given meaning. If w is (ϕ0 + ϕ1 + ϕ2)(1 + ϕ0 + ϕ1 + ϕ2)

1154 54 With some algebraic manipulations, it can be tics, as it is commonly assumed that the factors shown that M (1) > M (2) if the following holds: leading to semantic change are more diverse than purely distributional factors. For example, socio- 2 2 (ϕ0 + ϕ1) ϕ2 + (ψ0 + ψ1)ϕ2 (8) cultural, political, and technological changes are +2(ψ0 + ψ1)(ϕ0 + ϕ1)ϕ2 + (ψ0 + ψ1)ϕ2 known to impact semantic change (Bochkarev 2 2 et al., 2014; Newman, 2015). Furthermore, some +(ϕ0 + ϕ1)(ϕ2 ψ2) > ψ2(ϕ0 + ϕ1) − regularities of semantic change have been imputed Thus when (8) holds, the average cosine distance to ‘channel bias‘, inherent biases of utterance pro- between two samples of a certain word w gets duction and interpretation on the part of speakers larger as w acquires more meanings. and listeners, e.g., (Moreton, 2008). As such, it would be surprising if word frequency, polysemy, (8) readily holds under reasonable conditions, and prototypicality were to capture too high a de- e.g., when the prior counts for each meaning are gree of variance. In other words, since semantic similar (as a set) and much bigger than the prior change may result from the interaction of many counts of the joint context words (i.e., ϕ = ψ = 0 0 factors, small effects may be a priori more credi- ε, ϕ = ϕ , ψ = ψ ). 1 2 1 2 ble than large ones. 6 Conclusions and discussion The results of our empirical analysis showed that the spurious effects of frequency were In this article we have shown that some reported much weaker for the explicit PPMI representa- laws of semantic change are largely spurious re- tion unaugmented by SVD dimensionality reduc- sults of the word representation models on which tion. We therefore conclude that the artefactual they are based. While identifying such laws is frequency effects reported are inherent to the type probably within the reach of NLP analyses of mas- of word representations upon which these analy- sive digital corpora, we argued that a more strin- ses are based. As the analytical proof in Section 5 gent standard of proof is necessary in order to put demonstrates, it is count vectors that introduce an them on a firm footing. Specifically, it is nec- artefactual dependence on word frequency. essary to demonstrate that any proposed law of Intuitively, one might expect that the average change has to be observable in the genuine con- value for the cosine distance between a given dition, but to be diminished or absent in a control word’s vector in any two samples would be 0. condition. We replicated previous studies claim- However, Lemma 1 above shows that this is not ing to establish such laws, which propose that se- the case, and the average distance is the vari- mantic change is negatively correlated with fre- ance of the population of vectors representing the quency and prototypicality, and positively corre- same word. This result is independent of the spe- lated with polysemy. None of these laws - at least cific method used to represent words as vectors. in their strong versions - survived the more strin- Lemma 2 proves that the average cosine distance gent standard of proof, since the observed correla- between two samples of the same word, when us- tions were found in the control conditions. ing count vector representations, is negatively cor- In our analysis, the Law of Conformity, which related with the word’s frequency. Thus, the role claims a negative correlation between word fre- of frequency cannot be evaluated as an indepen- quency and meaning change, was shown to have a dent predictor in any model based on count vector much smaller effect size than previously claimed. representations. It remains for future research to This indicates that word frequency probably does establish whether other approaches to word repre- play a role - but a small one - in semantic change. sentation, e.g. (Blei et al., 2003; Mikolov et al., According to the Law of Innovation, polysemy 2013), have inherent biases. was claimed to correlate positively with meaning While our findings may seem to be mainly nega- change. However, our analysis showed that pol- tive, since they invalidate proposed laws of seman- ysemy is highly collinear with frequency, and as tic change, we would like to point to the positive such, did not demonstrate independent contribu- contribution made by articulating more stringent tion to semantic change. For similar reasons, the standards of proof and devising replicable control alleged role of prototypicality was diminished. conditions for future research on language change These results may be more consonant than pre- based on distributional semantics representations. vious ones with the findings of historical linguis-

1155 55 References Vivek Kulkarni, Rami Al-Rfou, Bryan Perozzi, and Steven Skiena. 2015. Statistically significant detec- Marco Baroni, Georgiana Dinu, and German´ tion of linguistic change. In Proceedings of the 24th Kruszewski. 2014. Don’t count, predict! A International Conference on World Wide Web, pages systematic comparison of context-counting vs. 625–635. context-predicting semantic vectors. In Proceedings of ACL, pages 238–247. Omer Levy and Yoav Goldberg. Neural word embed- ding as implicit matrix factorization. In Advances David M. Blei, Andrew Y. Ng, and Michael I. Jordan. in Neural Information Processing Systems (NIPS), 2003. Latent Dirichlet Allocation. Journal of Ma- pages 2177–2185. chine Learning Research, 3(4-5):993–1022. Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Im- Vladimir Bochkarev, Valery Solovyev, and Soren¨ proving distributional similarity with lessons learned Wichmann. 2014. Universals versus historical con- from word embeddings. TACL, 3:211–225. tingencies in lexical evolution. Journal of The Royal Society Interface, 11:1–23. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Cor- rado, and Jeffrey Dean. 2013. Distributed represen- John A Bullinaria and Joseph P Levy. 2012. Extracting tations of words and phrases and their composition- semantic representations from word co-occurrence ality. In Advances in Neural Information Processing statistics: stop-lists, stemming, and svd. Behavior Systems (NIPS), pages 3111–3119. research methods, 44(3):890–907. Sunny Mitra, Ritwik Mitra, Martin Riedl, Chris Bie- Haim Dubossarsky, Yulia Tsvetkov, Chris Dyer, and mann, Animesh Mukherjee, and Pawan Goyal. Eitan Grossman. 2015. A bottom up approach to 2014. That’s sick dude!: Automatic identification category mapping and meaning change. In Net- of word sense change across different timescales. In WordS 2015 Word Knowledge and Word Usage, Proceedings of ACL, pages 1020–1029. pages 66–70. Elliott Moreton. 2008. Analytic bias and phonological Haim Dubossarsky, Daphna Weinshall, and Eitan typology. Phonology, 25(1):83–127. Grossman. 2016. Verbs change more than nouns: A bottom up computational approach to semantic Shinichi Nakagawa and Holger Schielzeth. 2013. A change. Lingue e Linguaggio, 1:5–25. general and simple method for obtaining r2 from generalized linear mixed-effects models. Methods John Rupert Firth. 1957. Papers in Linguistics 1934– in Ecology and Evolution, 4(2):133–142. 1951. Oxford University Press, London. John Newman. 2015. Semantic shift. In Nick Rimer, Lea Frermann and Mirella Lapata. 2016. A Bayesian editor, The Routledge Handbook of Semantics, pages model of diachronic meaning change. TACL, 4:31– 266–280. Routledge, New York. 45. Eyal Sagi, Stefan Kaufmann, and Brady Clark. 2009. Kristina Gulordava and Marco Baroni. 2011. A distri- Semantic density analysis: Comparing word mean- butional similarity approach to the detection of se- ing across time and phonetic space. In Proceedings mantic change in the Google Books Ngram corpus. of the Workshop on Geometrical Models of Natu- In Proceedings of the GEMS 2011 Workshop on GE- ral Language Semantics, pages 104–111. Associa- ometrical Models of Natural Language Semantics, tion for Computational Linguistics. pages 67–71. Association for Computational Lin- guistics. Peter D Turney and Patrick Pantel. 2010. From fre- quency to meaning: Vector space models of se- William L Hamilton, Jure Leskovec, and Dan Jurafsky. mantics. Journal of artificial intelligence research, 2016. Diachronic word embeddings reveal statisti- 37:141–188. cal laws of semantic change. In Proceedings of ACL. Derry Tanti Wijaya and Reyyan Yeniterzi. 2011. Un- Martin Hilpert and Florent Perek. 2015. Meaning derstanding semantic change of words over cen- change in a petri dish: constructions, semantic vec- turies. In Proceedings of the 2011 international tor spaces, and motion charts. Linguistics Vanguard, workshop on DETecting and Exploiting Cultural di- 1(1):339–350. versiTy on the social web, pages 35–40. ACM.

Adam Jatowt and Kevin Duh. 2014. A framework for Yang Xu and Charles Kemp. 2015. A computational analyzing semantic change of words across time. In evaluation of two laws of semantic change. In Proceedings of the 14th ACM/IEEE-CS Joint Con- CogSci. ference on Digital Libraries, pages 229–238.

Yoon Kim, Yi-I Chiu, Kentaro Hanaki, Darshan Hegde, and Slav Petrov. 2014. Temporal analysis of lan- guage through neural language models. In Proceed- ings of ACL, pages 61–65.

1156

57

Chapter 5

Sense-specific word representation, a misconception?

Coming to Your Senses: on Controls and Evaluation Sets in Polysemy Research.

Under review Dubossarsky, H., Ben-Yosef, M., Grossman, E. & Weinshall, D. Coming to Your Senses: on Controls and Evaluation Sets in Polysemy Research. 58

Coming to Your Senses: on Controls and Evaluation Sets in Polysemy Research

Haim Dubossarsky1, Matan Ben-Yosef2, Eitan Grossman3 and Daphna Weinshall2 1 Edmond and Lily Safra Center for Brain Sciences 2 School of Computer Science and Engineering 3 Department of Linguistics The Hebrew University of Jerusalem, 91904 Jerusalem, Israel haim.dub, matan.ben.yosef @gmail.com, eitan.grossman,daphna @mail.huji.ac.il { } { }

Abstract It is commonly assumed or claimed that stan- dard word embeddings are unable to capture poly- The point of departure of this article is semy (Iacobacci et al., 2015), which results in sub- the claim that sense-specific vectors cap- optimal performance in gold standard evaluation ture polysemy, which is based on perfor- tests such as word similarity tasks, and potentially mance gains in gold standard evaluation hamper performance in downstream tasks. The tests such as word similarity tasks. We corollary assumption is that sense-specific rep- demonstrate that this claim, at least as it resentations will show improved performance on is instantiated in prior art, is unfounded in these evaluation tests. This assumption is concep- two ways. Furthermore, we provide em- tually attractive, since it makes sense that sense- pirical data and an analytic discussion that specific representations are more accurate than may account for the previously reported global representations. For example, in translating improved performance. First, we show ’chair,’ it is reasonable that performance should that ground-truth polysemy degrades per- improve if the two senses are represented sepa- formance in word similarity tasks. There- rately. This view is supported by several stud- fore word similarity tasks are not suitable ies (Huang et al., 2012; Neelakantan et al., 2014; as an evaluation test for polysemy repre- Chen et al., 2014; Li and Jurafsky, 2015), which sentation. Second, random assignment of consistently argue that sense-specific representa- words to senses is shown to improve per- tions lead to improved performance in word simi- formance in the same task. This and addi- larity tasks. tional results point to the conclusion that In contrast, recent studies (Arora et al., 2016; performance gains as reported in previ- Sun et al., 2017) have argued against this claim, ous work may be an artifact of random and show that global representations are able to sense annotation, which is equivalent to capture polysemic information to a great extent. sub-sampling and multiple estimation of Additionally, they demonstrate that this informa- word vector representations. Theoretical tion is in fact easily accessible for evaluation tests. analysis shows that this may on its own be Ideally, claims about polysemy should be eval- beneficial for the estimation of word sim- uated using a gold standard evaluation set that is ilarity, by reducing the bias in the estima- tailored specifically for polysemic words. As such tion of the cosine distance. a set does not exist, tasks involving word simi- larity tests have been used as a proxy (see Sec- 1 Introduction tion2). The underlying hypothesis is that enrich- Polysemy is a fundamental feature of natural lan- ing word vector representations with polysemic guages, which typically have many polysemic information should express itself in performance words. Chair, for example, can refer to either gains in these tasks. Unfortunately, this hypothe- a piece of furniture or to a person in charge of sis has never been tested directly, and the ability of a meeting. Therefore both theoretical linguis- word similarity tasks to directly benefit from pol- tics and computational linguistics seek to establish ysemic information must first be validated if they principled methods of identifying the senses that are to serve as genuine evaluation sets in research together constitute the meaning of words. on polysemy. Until this is done, the validity of any 59

reported positive effects of sense-specific repre- deed obtained for a corpus with randomly assigned sentations on evaluation tests is to be treated with senses. In addition, we modify the polysemy rep- caution. resentation model proposed by (Li and Jurafsky, In this paper our first aim is to assess the validity 2015) to randomly assign words to senses, and ob- of word similarity tasks as proper evaluation tests serve that the effect size remains unchanged be- for polysemic word representations. We use two tween the original and random conditions. independent corpora in order to obtain polysemic In support of our empirical findings, we discuss vectors: (i) a sense-annotated corpus, and (ii) an the difficulty of obtaining an unbiased estimator artificially-induced annotated corpus, constructed for the cosine distance between two normalized by using an established method that we modify for random variable vectors. This may provide a par- our purposes. Surprisingly, our analyses show that tial explanation for the empirical findings, under even the most accurate sense-specific word vectors the assumption that words are better represented do not improve performance. In fact, peak perfor- as a population of vectors. Specifically, the true mance is achieved when polysemic information is source of the reported performance gains may be ignored, i.e., when all different annotated senses an artifact of a purely statistical benefit that derives are collapsed to a single word, as they naturally from the assignment of words to particular senses, appear in a normal text. These counter-intuitive or separate sub-samples, which subsequently re- results indicate that the word similarity tasks are duces the bias of the similarity estimator. not a suitable test to evaluate polysemic represen- tations. 2 Background Although these negative results point to the in- Previous attempts to use polysemic information adequacy of the evaluation test, they may also for enriching word representation report marked point to a suboptimal representation of polysemy performance gains (Huang et al., 2012; Neelakan- by the sense-specific vectors. We therefore ask tan et al., 2014; Chen et al., 2014; Li and Jurafsky, whether the representation of polysemy in exist- 2015). These studies use normal unannotated cor- ing models indeed adequately captures the phe- pora, and therefore have to disambiguate the dif- nomenon it is meant to measure. We provide evi- ferent senses of words before exploiting any sense- dence that currently-used sense-specific represen- specific information. This approach produces (i) tations might not adequately capture polysemy. global vectors that represent a word’s meaning as We thus identify two independent pitfalls in a single vector (with no subdivision into distinct NLP research on polysemy representation: first, senses), as well as (ii) sense-specific vectors rep- the inadequacy of currently-used evaluation tests, resenting individual senses of words, determined and second, current polysemic representations. in the disambiguation step, as separate vectors. Given these conclusions, a serious question arises: For example, such approaches would represent the why might inaccurate polysemic representations meaning of chair as a single vector, as well as dis- would show superior performances in inadequate tinct vectors for each of its multiple senses, e.g., evaluation tests? ”chair (person)” and ”chair (furniture).” An alternative explanation for the reported ef- In order to evaluate performance, the vec- fects may lie in an inherent property of sense- tors created by the models are evaluated using specific representations. The procedure of as- two standard word similarity tasks, WordSim- signing a word occurrence to a particular sense 353 (Finkelstein et al., 2001) and Stanford’s Con- amounts to a sampling procedure. This sampling textual Word Similarities (SCWS) (Huang et al., procedure itself, regardless of its validity, may be 2012). These tasks comprise pairs of words and the true source of the reported performance gains. the similarity scores assigned to them by human To test this hypothesis, we created a control condi- annotators. For example, the similarity between tion in which word occurrences are randomly as- table and chair might be rated as 0.8 (i.e., human signed to different senses. Determining that an ef- annotators found these words to be very similar, fect is attributable to genuine polysemy can only but not perfectly so), while the similarity between be established if a similar effect is lacking or sig- table and tree might be rated as 0.3 (not very sim- nificantly reduced in this control condition. ilar). The models used by Huang (2012) and oth- We demonstrate that performance gains are in- ers produce similarity scores for each word pair by 60

computing the cosine-distance between the word identifying a corpus where polysemic information vectors for each pair. The model’s performance is known for a significant number of words. (ii) is then evaluated as the rank-order similarity in Compute two sets of word representations: A1 the order of pairs (Spearman correlation) between - which computes a single representation for all the human annotators’ scores and the scores pro- words in the corpus, and - which computes A2 duced by the model. In line with the assumption multiple representations for each polysemic word discussed above, one would predict that the rank- in the corpus based on the different known senses order similarity produced by the sense-specific of the word. (iii) Evaluate the task using the two vectors should outperform the one produced by representation sets and . Only if significant A1 A2 the global vectors. In particular, more accurate performance gains can be shown when using A2 sense-specific vectors should produce better re- as compared to , the task can be used to evalu- A1 sults in these tasks; conversely, better performance ate polysemy representation. on these tasks is interpreted as indicating that pol- ysemy has been captured more accurately. 3.1 Polysemy induction Computing word similarity is straightforward A major drawback of the proposed methodology when each word is represented as a single vec- is that such annotated corpora are scarce, and the tor, but it is less so when the meaning of a pol- largest among them is still small. We therefore ar- ysemic word is represented by multiple sense- ticulate a methodology to generate a task valida- specific vectors. This problem of matching the tion test from any corpus, even without prior an- senses relevant for a specific word pair, i.e., match- notation of polysemy. To this end, we use a variant ing the ”person” sense of chair with the correct of the pseudo-words approach (Gale et al., 1992), sense of the word meeting, poses a major hurdle which we call polysemy induction. This allows us for meaningful comparison, and has been tack- to subject any task, including the word similarity led in three different ways: (i) average over all tasks, to a task validation test which is based on a similarity scores between all the different possible much larger corpus. pairs; (ii) weighted average over these scores ac- cording to the probability assigned to each of them More specifically, we induce polysemy in a nat- by the disambiguation model; or (iii) selection of ural corpus by randomly selecting pairs of words, the most suitable sense according to the disam- then collapsing every pair of words into a sin- biguation model, and using only the correspond- gle word-form while keeping their ”sense tags” as ing similarity score. polysemy annotation. For example ring and ta- ble Common sense suggests that the third approach may be collapsed to a single word with two table1 table2 should outperform the others, as it is based on the senses, and respectively. The new clearest distinction between the relevant and non- corpus is polysemic with respect to the collapsed relevant senses. However, previous studies all re- words, while all other words keep a single sense. port results that do not conform to the na¨ıve pre- This corpus has most of the features of a natural diction. Rather, across studies, the best results are corpus, but unlike most natural corpora (and all obtained for average and weighted average, fol- large corpora), it contains polysemy annotation. lowed by global (ignoring polysemy), while se- With this corpus, we follow the methodology for task validation described above: Let denote lection falls far behind the others. This counter- A1 intuitive observation suggests that the observed a set of word representations whereby every word benefit may be less related to sense disambigua- (collapsed or natural) has a single representation, tion than previously supposed. constructed without the use of polysemy annota- tion. Let denote a set of word representations A2 3 Task validation whereby each collapsed word has two represen- tations, each corresponding to one natural word Generally, before any task can be used as an eval- from the original pair of words that have been uation testbed for polysemy discovery algorithms merged. For task evaluation, the task is performed or polysemous representations, we argue that the while using one of the two sets of representations. task itself should be validated as suitable (or not) Only if leads to a significant performance gain A2 for the intended purpose. We propose the fol- in the task as compared to , the task under ex- A1 lowing task validation methodology: (i) Start by amination should be considered adequate to eval- 61

uate polysemy computation, i.e., the accurate dis- ambiguation and representation of word senses.

3.2 Methods Word embedding model (word2vec) skip-gram model (Mikolov et al., 2013) is used to obtain vector representations for words. The model was separately trained over the corpus, the first time producing sense-specific vectors according to the annotated senses, and a second time producing global vectors by ignoring the annotated senses and collapsing all their occurrences to a single word. Throughout the analyses we used an embed- Figure 1: Summary of the results reported in Ta- ding size of 300d, a window size of 5 words from ble2, showing the difference between the per- each size of the target word, negative-sampling of formance of the vanilla global representation and 5 words, and an initial learning rate of 0.025. the performance of various polysemy representa- tion methods. 5 polysemy methods are shown. Sense-annotated corpus We use OntoNotes Ideal gain: when using ground-truth polysemy (Weischedel et al., 2013), the largest available cor- and known matching. Actual gain: when using pus annotated for word senses. This allows us to ground-truth polysemy but no given matching, the circumvent the problem of first disambiguating the condition which best describes a natural polyse- words’ senses, and thus to directly test the util- mous corpus. Reported gain: 3 reported results ity of using polysemic information in word vec- from the literature on the representation of poly- tor representations. The corpus contains 1.5 mil- semy. Color (dark and white) marks the tasks. lion English tokens, comprising about 50k English word types, of which 8675 word types are sense- annotated. Because the annotation is not uniform 3.3 Results throughout the corpus (words are not annotated Results clearly demonstrate that global represen- every time they appear), which can bias the anal- tations are significantly superior to sense-specific ysis described below, we extract a subset of the representations in both evaluation tests and across corpus by removing sentences where polysemous corpora, as shown in Tables1,2. words are not annotated, thus removing 40% of the GLOBAL AVERAGE corpus. Stopwords as well as words occurring less WS-353 44.7(0.7) 41.3(0.3) than 10 times are ignored by the word representa- SCWS 64.0(0.4) 62.6(0.6) tion models. All words are lowercased. Table 1: OntoNotes scores on two word similarity Sense-induced corpus Wikipedia dump tasks. GLOBAL ignores sense information, while (04/2017) is the original corpus from which a AVERAGE uses sense information by taking the av- polysemic version is induced by randomly pairing erage of the sense-specific vectors. Standard devi- words into a single word-form (see Section 3.1). ation for 10 independent runs of the vector model Stopwords and infrequent words (<300 tokens) shown in parentheses. are ignored by the word representation models. Fig.1 summarizes the results. We compare Evaluation The utility of polysemic word repre- the performance of each polysemy representation sentations is evaluated on the two word similarity method to the results of the vanilla global repre- tasks described. Crucially, the problem of match- sentation, which serves as the baseline that they ing the relevant sense in these tests (described in aim to improve. Thus we show the difference be- Section2) is tackled by taking the average of the tween the performance of each method and this sense-specific representations and comparing it to 1 baseline. This difference is expected to be posi- the global word representations . tive, if the method indeed improves performance 1Recall that average was reported to be one of the best over baseline. performing matching methods is previous work. As clearly (and surprisingly) seen in Fig.1, this 62

IDEALCOLLAPSED It suggests that the failure to obtain performance GLOBAL GLOBAL AVERAGE WS-353 gains in the word similarity tasks when using real INDUCEDP1 70 68.3(1.5) 61.0(0.7) ground-truth polysemy may be due to the need for INDUCEDP2 70 68.1(0.5) 63.9(2.8) additional information when computing similarity INDUCEDP3 70 70.8(0.7) 69.1(1.1) HUANG 22.8 71.3 between words with multiple representations. NEELAKANTAN 69.2 70.9 RAND.SENSES 70 69.8 3.4 Discussion SCWS INDUCEDP1 65.7 64.2(0.5) 50.1(1.1) The main result of the analysis described above is INDUCEDP2 65.7 64.8(0.3) 56.7(1.6) negative, demonstrating that word similarity tasks INDUCEDP3 65.7 66(0.1) 63(0.3) HUANG 58.6 62.8 are not suitable to serve as gold standard tests for NEELAKANTAN 65.5 67.2 polysemy representation. However, the method- LI 64.6 66.4 ology we developed for polysemy induction con- RAND.SENSES 65.7 67.3 stitutes a positive contribution, as it can be used to effectively test any task for its utility in the evalua- Table 2: Polysemy induced word similarity scores. tion of polysemy representation while using state- IDEALGLOBAL uses the original Wikipedia of-the-art corpora. This may lead to the discov- scores with complete disambiguation of senses ery of suitable tasks which can serve as gold stan- in the training set and the evaluation test, COL- dard evaluation tests for polysemy. Moreover, the LAPSEDGLOBAL uses the collapsed Wikipedia use of polysemy induction adds yet another type scores when sense information is ignored, while of control to the NLP toolbox; such controls are AVERAGE uses the average over sense-specific still rarely implemented in NLP studies (but see representations. Several random pairing parame- (Dubossarsky et al., 2017)). ters were used to induce polysemy: INDUCEDP1: 10k words, INDUCEDP2: 6k words, INDUCEDP3: 4 The statistical signature of polysemous 2k words. Standard deviation for 10 independent representations runs in parentheses. Previous results are reported for comparison. The reported lack of a positive effect could also stem from the inaccurate representation of poly- semy by the sense-specific vectors. To evaluate is not the case for the condition that simulates the this alternative, we look at the statistics of the pair- clearest polysemy representations, the condition wise similarity between the different sense repre- termed Actual gain in the figure. Thus in the ideal sentations. We start from the observation that pol- control condition, based on induced polysemy, the ysemy is inherently defined by word senses that word similarity tasks fail to demonstrate the value are distinguishable from each other. We therefore of polysemy representation in improving perfor- expect sense-specific representations of the same mance over the baseline. This stands in marked word to show a smaller resemblance to each other contrast to the results reported in the relevant liter- when genuine polysemy is captured, as compared ature, the 3 sets of bars which are marked in Fig.1 to arbitrary assignment to senses. by Reported gains. This seems to indicate that the reported gains do not reflect effective polysemy 4.1 Sense-specific vectors sets representation by these methods of true, but rather Using the Wikipedia corpus, we investigated and some other unknown factor which benefits perfor- compared 4 ways to obtain sense-specific word mance. vector representations: Interestingly (and reassuringly), when we mea- sure performance gain with the ideal method, 1.W IKI-INDUCED: sense-specific vectors ob- which has access to the ground-truth polysemy tained by way of polysemy induction (see and the correct sense label for each word in a Section 3.1). word-pair in the word similarity tasks, we see 2.CRP-O RIGINAL: the representations de- performance gains (albeit small) over the vanilla scribed in (Li and Jurafsky, 2015), repro- method (Fig.1, Ideal gain). This information is duced using the provided code2. only available in the induced polysemy condition, and is generally not available in a natural corpus. 2https://github.com/jiweil/mutli-sense-embedding 63

3.CRP-R ANDOM: representation based on the improved performance, and the actual gain condi- random assignment of senses, obtained by tion which shows a worsening of performance in modifying the code provided by (Li and Ju- the same task when polysemic information is in- rafsky, 2015). We changed only the words’ cluded. The results demonstrated in Fig.2 can be sense assignment process, in order to achieve described as a negative image of those presented random assignment of senses which is based in Fig.1. Specifically, the actual gain condition on the same sense distribution as in the origi- of the induced-polysemy has the largest polysemic nal model (see details in Section 6.1). signature as compared to the other conditions. Together, these results indicate that the con- 4.W IKI-RANDOM: random sense annotation dition that demonstrates polysemy most clearly (see details in section 6.1). shows the poorest performance in the evaluation tests. The converse is also true: the conditions that 4.2 Comparing the different sets demonstrate polysemy rather poorly show height- For each word in each of the four polysemic vec- ened performance in the evaluation tests. A gold tors sets, the average cosine-distance between its standard for polysemy representation should en- different sense-specific vectors is computed. The tail that given optimal vector representations, per- distribution of these average distances within a formance on the evaluation tests would be opti- specific set is defined as its ”polysemic signature”, mal, and vice versa. Since our results demonstrate which is then compared across sets. that the directions of optimal vector representation The results are shown in Fig.2, revealing and optimal test performance are opposite, we are marked differences in the polysemic signature of lead to the following conclusions: First, methods the four sense-specific vector sets. High poly- which provide improvement in the word similarity semic signature is seen for the induced polysemy tasks may not necessarily be suitable for the recov- set, which is the only set with guaranteed semanti- ery of polysemous vector representations. Second, cally different senses. The two random sets have a the word similarity task is not a suitable evaluation rather small polysemic signature, while the model test for studying the recovery of polysemy. originally designed to recover true polysemy (Li and Jurafsky, 2015) has an intermediate polysemic 5 Theoretical discussion signature, which is much closer to the signature of In this section we recall and analyze some proper- the two random sense sets. ties of the cosine distance, and describe how they may partially account for the empirical observa- tions discussed in this article. The crucial point is to model the contextual representation of words as a distribution over some vector space. Let Xi denote the random variable which cap- tures the contextual representation of word i. Let Xl,Xl Rd L denote a sample of such rep- { i j ∈ }l=1 resentations for words i, j respectively, where L denotes the sample size. d corresponds to the di- mension of the vector space when using word2vec representations, or the number of words in the dic- tionary when using explicit representations (e.g., Figure 2: Density distribution of poylsemic signa- PPMI) (see Section 3.2). To simplify the analysis, tures for the four sets, see text for details. we further assume that Xl = 1 l. k k ∀ The similarity between two words i, j can be 4.3 Discussion plausibly measured (as customarily done) by the cosine distance between their contextual represen- The broader implications of these results on our tations, namely, E [XiXj]: research hypothesis can be understood in the con- text of the findings reported in Section 3.3. The E [XiXj] = E [Xi] E [Xj] + cov(Xi,Xj) (1) results described in Fig.1 demonstrate a marked difference between previously reported effects of Thus the average distance is not equivalent to the 64

distance between the average representations, with tiple estimation of contextual vector representa- an additional bias term - cov(Xi,Xj) - which re- tions, and that this alone may be beneficial for the flects the statistical dependence between the two estimation of word similarity. A reasonable con- vector representations Xi,Xj. This term is sig- clusion may be promoted, that sub-sampling and nificant, because the contextual representation of multiple vector estimation may have produced the two words is likely to exhibit strong dependence, reported performance gains. In this section we test especially when the words are more similar. this hypothesis directly. This is where the problem lies. In the process of In order to do so, we propose a simple control generating words’ representations, we start from condition, in which senses are randomly assigned a sample of sentences and generate a single rep- to words in a corpus, and sense-specific vectors are resentation. This representation is essentially our produced in the same way as before. Determining estimate of E [Xi] for word i. When multiplying that an effect is reliably attributed to genuine pol- two such representations in order to compute the ysemy can only be established if a similar effect cosine distance between them, we obtain an esti- is lacking or significantly reduced in this control mate for E [Xi] E [Xj], which is not a good esti- condition. mate for E [XiXj] because of the bias term in (1). Ideally, in order to provide an unbiased esti- 6.1 Random sense assignment mate of E [XiXj], we should divide the sample We investigated two ways to achieve random sense of sentences into mini-batches, compute the ap- assignment: propriate contextual representation for both words Sim1: Sampling from a known distribution. i, j from each mini-batch, and then directly esti- For the entire corpus and vocabulary (100k mate E [XiXj] by taking the average multiplica- words), we assigned senses at random from a tion of the corresponding representations in each known probability distribution (note that (Nee- mini-batch. Interestingly, in the process of gen- lakantan et al., 2014; Li and Jurafsky, 2015) also erating polysemous representations, whether rely- took this entire vocabulary approach). After trying ing on true polysemy or arbitrary polysemy, we both the uniform and multinomial distributions, essentially accomplish the same goal: for each and with different numbers of possible senses for word, a mini-batch is replaced by the subset of each distribution, we empirically found that the re- sentences in which only one of the word’s mean- sults differed only slightly between the conditions. ings is present3. If sense matching (see Section2) is achieved by Sim2: Sampling from an unknown distribu- way of average or weighted average, it implies tion. To test the hypothesis more directly against that our estimate of word similarity should im- the sense distribution used in prior work, we first prove with the number of senses used in the analy- reproduced sense-specific vectors using the model sis, especially when the assignment is arbitrary. Of and code described in (Li and Jurafsky, 2015). course, any improvement is hampered by the dete- We kept their Chinese-Restaurant-Process proba- rioration in the quality of the contextual represen- bilistic mechanism, where senses are assigned to tation computed from the smaller mini-batch sam- words based on the similarity of their contexts. ple, and therefore improvement is only expected We only shuffled the elements of the final vector for a small number of real or artificial ”senses”. of sense assignments produced by the model. In addition and for further comparisons, we used the 6 Performance gain revisited original code unchanged to reproduce another set of global and sense-specific vectors. The empirical findings presented so far converge on the conclusion that the performance gains re- 6.2 Performance boost due to word sampling ported in prior art may not stem from the utility The results of Sim1 show a marked performance of polysemic information, as previously claimed, gain for the sense-specific vectors that were pro- but are the result of an alternative source. In the duced at random as compared to the global vectors theoretical discussion we argue that random sense (see Table2 under dashed-line). In fact, the effect annotation is equivalent to sub-sampling and mul- reported in (Neelakantan et al., 2014) is replicated 3For the purpose of this discussion we ignore sentences in almost exactly, perhaps due to the fact that they which a word appears more than once. also used a fix number of senses for each word as 65

we did in this simulation. Furthermore, the results artifact. We thus created a control condition, in of Sim2 in Table3 are almost identical in the orig- which we randomly assign word occurrences to inal and random condition. This means that ran- senses, and found that randomly-produced sense- domly assigning words to senses does not weaken specific vectors indeed showed a marked improve- the effect, even though it did change the ”poly- ment in performance. Since this effect cannot stem semic signature” (see Fig.2). from polysemy (which is lacking in this condi- tion), it may only be the result of a methodolog- ORIGINAL RANDOM GLOBAL AVE.GLOBAL AVE. ical artifact - the sampling procedure entailed by WS-353 61.0 67.8 60.5 68.2 the assignment of words to senses. Note that this SCWS 58.9 66.2 57.4 66.2 experiment provides evidence against the validity of the evaluation task, but does not address the va- Table 3: Word similarity test scores based on the lidity of representations, as both accurate or inac- original (Li and Jurafsky, 2015) method, and on curate representations would result in similar per- its modification with shuffled assignment. formance due to the same sampling artifact. Taken together, these two independent con- The existence of a sampling artifact is supported trol conditions clearly show that an effect of the by our theoretical discussion, showing that multi- same magnitude as previously reported in several ple vector sampling can lower the bias of the esti- studies emerges under random sense assignment. mator of the cosine distance between two vectorial Therefore, our findings strongly undermine the as- random variables. The underlying assumption is sumption that the reported effects are in any way that a better model for contextual word representa- related to actual polysemy. tion should employ a population of vectors. Thus, we demonstrate both experimentally and theoret- 7 Summary and Discussion ically that the evaluation test does not provide a marker for the utility of polysemy, but is artifac- Here we investigate the validity of polysemy rep- tually influenced by the procedure by which poly- resentation methods. First, we question the valid- semic representations are created. ity of using word similarity tasks to evaluate the performance of such methods. To test the claim To address the hypothesis that polysemic rep- that resolving the polysemy of words improves resentations may be inaccurate, we measured the performance in these tasks, we used real-world similarity between word senses, as reflected by polysemy in two independent conditions: (i) a their sense-specific vectors. We found that sense human-annotated corpus that tags word senses and specific vectors in previous studies were more (ii) a corpus in which polysemy was induced in similar to vectors obtained by assigning random a controlled artificial fashion. In both conditions, polysemy, than to vectors obtained by polysemy the performance in the word similarity tasks de- induction. Nevertheless, we note that our pro- teriorated. This implies that previously observed posed polysemic signature should be taken as a improvements in performance in these tasks stem coarse ”rule of thumb”, and more systematic man- from one or both of the following: (i) the tasks are ners for evaluating the accuracy of sense-specific not suitable for evaluating sense-specific vectors, representations should be employed in order to or (ii) the commonly produced sense-specific vec- fully address this hypothesis. tors do not accurately capture polysemy. Essentially, the findings reported here mean that With regards to the first hypothesis, if the word there is no solid empirical foundation to the claim similarity task is inadequate to evaluate polysemy, that sense-specific vectors improve performance why would it show high gains for polysemic rep- in evaluation tests. In fact, they corroborate the resentations? One possibility is that polysemic general impression that polysemic representations informations per se is not required to drive such do not improve performance on most downstream effects, and instead these effects are artificially tasks (Li and Jurafsky, 2015). It may be the case caused by the procedures by which polysemous that sense-specific vectors can or will show height- vectors are created. To prove this, we set out ened performance on evaluation tests, or improve to demonstrate that even representations that bear downstream tasks, but this will have to be demon- no polysemic information could nonetheless yield strated on the basis of tasks whose suitability has improved performance due to a methodological been properly validated, and any effects reported 66 will have to be supported by demonstrating that References these effects are absent or strongly reduced in a Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, properly articulated control condition. and Andrej Risteski. 2016. Linear algebraic struc- ture of word senses, with applications to polysemy. arXiv preprint arXiv:1601.03764.

Xinxiong Chen, Zhiyuan Liu, and Maosong Sun. 2014. A unified model for word sense representation and disambiguation. In Proceedings of EMNLP, pages 1025–1035.

Haim Dubossarsky, Daphna Weinshall, and Eitan Grossman. 2017. Outta control: Laws of semantic change and inherent biases in word representation models. In Proceedings of EMNLP, pages 1136– 1145.

Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Ey- tan Ruppin. 2001. Placing search in context: The concept revisited. In Proceedings of the 10th inter- national conference on World Wide Web, pages 406– 414. ACM.

William A Gale, Kenneth W Church, and David Yarowsky. 1992. Work on statistical methods for word sense disambiguation. In AAAI Fall Sympo- sium on Probabilistic Approaches to Natural Lan- guage, volume 54, page 60.

Eric H Huang, Richard Socher, Christopher D Man- ning, and Andrew Y Ng. 2012. Improving word representations via global context and multiple word prototypes. In Proceedings of the 50th Annual Meet- ing of the Association for Computational Linguis- tics: Long Papers-Volume 1, pages 873–882. Asso- ciation for Computational Linguistics.

Ignacio Iacobacci, Mohammad Taher Pilehvar, and Roberto Navigli. 2015. Sensembed: Learning sense embeddings for word and relational similarity. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Lan- guage Processing (Volume 1: Long Papers), vol- ume 1, pages 95–105.

Jiwei Li and Dan Jurafsky. 2015. Do multi-sense em- beddings improve natural language understanding? arXiv preprint arXiv:1506.01070.

Tomas Mikolov, Kai Chen, Greg Corrado, and Jef- frey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

Arvind Neelakantan, Jeevan Shankar, Re Passos, and Andrew Mccallum. 2014. Efficient nonparametric estimation of multiple embeddings per word in vec- tor space. In Proceedings of EMNLP.

Yifan Sun, Nikhil Rao, and Weicong Ding. 2017. A simple approach to learn polysemous word embed- dings. arXiv preprint arXiv:1707.01793. 67

Ralph Weischedel, Martha Palmer, Mitchell Marcus, Eduard Hovy, Sameer Pradhan, Lance Ramshaw, Nianwen Xue, Ann Taylor, Jeff Kaufman, Michelle Franchini, et al. 2013. Ontonotes release 5.0 ldc2013t19. Linguistic Data Consortium, Philadel- phia, PA.

69

Chapter 6

Discussion and Conclusions

This dissertation has presented a new research paradigm for the study of semantic change. It presents the first large-scale bottom-up approach that uses Vector Space Models (VSM) and focuses on research questions originating in traditional linguis- tics. While computational approaches have some drawbacks (see Section 1.3), they also have important advantages. First, a bottom-up approach allows more objective research, as it prevents bias in the selection of examples. Second, large-scale studies make research more reliable, as findings are more credible when they are not based on a small set of examples (even if objectively chosen). These two aspects of com- putational approaches may contribute to a higher standard of research in semantic change. In addition, we address fundamental methodological issues which are critical to both this emerging field and to the NLP research community at large. The fact that these type of research questions were never tested on a large scale, combined with the lack of an empirical research tradition in NLP, have made the development and adaptation of rigorous research methodologies of paramount importance.

The benefit of our approach for semantic change research

In two experiments (Chapters2,3), we were able to show the usefulness of a straight- forward diachronic analysis of vector representations over an entire lexicon. In Chapter2, we examined whether words with different parts-of-speech (POS) tags, i.e., Nouns, Verbs and Adjectives, differ in their rates of semantic change. This is the first time such a question could be experimentally tested, as it required a large- scale, bottom-up analysis over an entire lexicon, which was impossible until recently. We report robust and systematic differences in the rates of semantic change that are associated with different POS types. Specifically, throughout the 20th century we find that verbs show greater changes than nouns and adjectives. This regularity in semantic change, which we call the DIACHRONICWORD-CLASS EFFECT, is a com- pletely novel finding, which uncovers a covert regularity of semantic change that may only be observed on a large scale. The novelty of such a finding is further attested by the fact that there was no lin- guistic theory that had predicted it (or was able to account for it). In the end, it was a 70 Chapter 6. Discussion and Conclusions theory from psycholinguistics that suggested a synchronic cognitive mechanism that we propose is the source of the word-class effect. According to the verb mutability effect (Gentner and France 1988), language speakers prefer to modify the meaning of a verb, rather than that of a noun, when they encounter an utterance with semantic mismatch (e.g., the lizard worshiped). We hypothesize that this synchronic preference has an accumulated effect, which may explain why overtime verbs change more than nouns. Ultimately, the fact that an "external" theoretical explanation was required fur- ther demonstrates the potential of this bottom-up approach in contributing to se- mantic change research. This novel finding extends beyond existing or expected results and provides surprising ones, and thus is able to nourish theoretical break- throughs. The second paper (see Chapter3) tackles a theoretical question: are there lin- guistic factors that make certain words more prone to semantic change? The results obtained emphasize the importance of the semantic relations between words to their likelihood of change. According to the DIACHRONIC PROTOTYPICALITY EFFECT, the degree of proximity of words to their category’s prototype (e.g., the proximity of a robin or a peacock to a prototypical ’bird’), plays a decisive role in semantic change. Specifically, the more prototypical a word is (i.e., the more similar it is to its cate- gory prototype), the more "protected" it is from semantic change, and vice versa. This finding corroborates Geeraerts (1985, 1992) results, which were based on the descriptive analysis of only two verbs. Thus, this study further demonstrates the utility of our large-scale, bottom-up approach in contributing to the research for the linguistic causes of semantic change by rigorously testing and corroborating an ex- isting theoretical hypothesis on a much larger scale. The notion that words do not change in isolation but rather in an intricate man- ner that involves their relations with other words is not a novel one. For example, it has often been assumed that changes in words’ meanings are due to a tendency for languages to avoid ambiguous form-meaning pairings, such as homonymy, syn- onymy, and polysemy (Anttila 1989; Menner 1945). On the other hand, when related words are examined together, one word’s change of meaning often "drags along" other words in that semantic field, leading to parallel change (Lehrer 1985). The pro- posed diachronic prototypicality effect joins a recent study of Xu and Kemp (2015) in the renewal of interest in the idea that the inter-relations between words are central to predicting their trajectories of semantic change. We hope that our research papers will contribute to the study of semantic change by tightly coupling the latest developments in NLP with linguistic questions. Our first publications were credited in recent studies that extended our approach to other aspects of semantic change or to other languages (Ponti et al. 2017; Rodda et al. 2017). Our first two studies were unique in the landscape of this field of research. At the time when the research was carried out we gave little attention to methodological issues. Only after the studies were completed, and several other works appeared, Chapter 6. Discussion and Conclusions 71 did we notice important methodological drawbacks that went unnoticed at first. These are, primarily, the lack of gold standard evaluation sets for semantic change and unvalidated metrics for semantic change. Traditionally, research on semantic change was not based on rigorous statistical analysis to test its hypotheses, simply because its small scale and subjective nature does not lend itself to such analyses. Likewise, statistical hypothesis testing has not been relevant for the vast majority of NLP works, most of which did not test theories. Only when these two streams of research are combined in our empirical framework do certain methodological issues surface. This is the background for last two studies in this dissertation, which take a more methodological approach, as the necessity and importance of dealing with these problems became clearer.

Important methodological guidelines to this research field and a reminder to others

The studies in Chapters4 and5 critically examine fundamental methodological is- sues in current research. Our third paper addresses the problem of assessing mod- els of semantic change in the absence of gold standard evaluation sets, as well as validating their results. Specifically, we examine the standard metric that is used to evaluate semantic change – the cosine-distance between a word’s vectors at two periods – which despite being widely used has never been directly validated. We report that this method adds a bias to the estimate of semantic change, and that the size of this bias is inversely proportional to the words’ frequency (i.e., more frequent words produce a smaller bias, and vice versa). Ultimately, this bias comprises the lion’s share of the semantic change scores that are commonly reported across many studies. With respect to the former laws of semantic change that were proposed, includ- ing our own (see bullet points in Section 1.2.4), we report that the Diachronic Pro- totypicality Effect was significantly diminished, and its true size is half what we previously reported in Chapter3, explaining 5 % of the total semantic changes of the whole lexicon as opposed to the 10 % that was previously reported. Significantly, the laws proposed by Hamilton et al. (2016) were far more impaired by this bias, i.e., the Law of Conformity plunged to about 15 % of its original reported size, and the Law of Innovation vanished completely. Importantly, the Diachronic Word-Class Ef- fect that we reported in Chapter2 is exempt from this scrutiny. This is because the words’ frequency, which is the driving source of this bias, is equated between the three groups of word classes, and therefore cannot serve as an alternative account for the results. Of course there is nothing innately biased in the cosine-distance as a metric per se. Only its interaction with certain types of word vector representation is what turns it into a biased estimator for semantic change. This is demonstrated by the lack of bias for word representations that are based on Positive Pointwise Mutual 72 Chapter 6. Discussion and Conclusions

Information (PPMI), as opposed to the large bias observed for the more common representation types (PPMI + SVD and SGNS-word2vec). Clearly, for the purpose of evaluating semantic change using the cosine-distance, PPMI representations should be used. We conclude, however, that if other representations are chosen, their results should be critically compared to a proper control condition as our paper demon- strates. Importantly, we propose a general framework to circumvent the problem of assessing models’ quality in the absent of gold standard evaluation sets, as well as to verify the reliably of results obtained using such models. This framework is based on a critical comparison to a control condition and is not limited to semantic change research, but may be used in various research domains in NLP. Interestingly, the very basis of our approach – the assumption that the distance between word vectors represent the semantic similarity between the words – has been called into question lately. Mimno and Thompson (2017), for example, have shown that vectors representations are sensitive to the hyper-parameters used in the training phase of the predictive word-embedding models (e.g., word2vec, Glove), which adds a substitutional stochastic artefact to the distances between these vec- tors. Other studies have analyzed an alternative approach to word meaning rep- resentations that is based on the words’ closest neighbors in the embedding space, and represents semantic changes in these words according to changes to each word’s closest neighbors. These studies conclude that this approach, perhaps the second most popular after the approach we use, leads to unstable word representations (Antoniak and Mimno 2017) and has low reliability (i.e., repeated initializations of the model led to markedly different results), and is specifically problematic for di- achronic use (Hellrich and Hahn 2016). Our results supplement these findings in two ways: first, by reporting a critical analysis on the other, more popular metric of semantic change that has similar reser- vations; and second, by advocating the importance of a principled methodological work that needs to be done in order to test the validity and reliability of any pro- posed variable before putting it to use in actual research. As this field of large-scale computational approach to semantic change research is still nascent, such method- ological critics are well expected, but also paramount for the advancement of the research field at large. Our last research paper (see Chapter5) aims to improve word vectors such that they include polysemic information in a more accessible way for semantic change research. In this paper, we aim to test the feasibility of using sense-specific vectors as a more informed model for semantic change, due to their presumably more eco- logical representation of meaning. Specifically, we set to investigate the validity of the claim that sense-specific vectors truly represent polysemic information, a claim that was based on performance gains obtained in word-similarity tasks using these vectors. This is a necessary step before such vectors could be used as a viable tool to study changes in sense representations over time. Chapter 6. Discussion and Conclusions 73

The crux of our paper is the validity of word-similarity tasks in evaluating poly- semic word representation, because these are the main, almost solely tasks in which performance gains for sense-specific vectors were observed. In a series of critical evaluations, we found that performance gains similar in size to gains previously reported for sense-specific vectors (Huang et al. 2012; Li and Jurafsky 2015; Nee- lakantan et al. 2014) were obtained for vectors that were produced using random sense annotations. In contrast, we found that sense-specific vectors that were based on the most accurate sense distinctions in fact worsen performance compared to global word representations. This dissociation between sense-specific vectors and performance gains in word-similarity tasks undermines the claim that the latter are due to true polysemy distinctions that are captured by the sense-specific vectors, al- though it cannot completely rule out that sense-specific vectors do capture polysemy to some extent. Importantly, our analysis does show that multiple estimations of word vectors through sub-sampling lead to performance gains, presumably as sub-sampling av- erages out random bias that is captured by the word vectors. This positive finding may be the true source of the performance gains obtained for sense-specific vectors, as their production is equivalent in some sense to a sub-sampling procedure. Based on these two findings we conclude that the ability of current models of sense-specific vectors to truly capture polysemy is questionable at best, and proba- bly false. As a result, these vectors cannot be used to in sematic change research to study changes in sense representations over time. Interestingly, recent studies have argued against the notion that polysemy must be represented using a distinct vec- tor per sense, and showed that global representations are able to capture polysemic information to a great extent (Arora et al. 2016; Sun et al. 2017). Overall, this paper provides a critical reevaluation of prominent results and com- mon assumptions in polysemy research that are not limited to semantic change re- search. Our analyses and findings further promote the use of carefully designed con- trol conditions and validation routines in NLP research at large. This joins our pre- vious paper in emphasizing the importance of a meticulous evaluation, and some- times re-evaluation, of working hypotheses that are considered fundamental, almost axiomatic, in the research field. 74 Chapter 6. Discussion and Conclusions

References

Antoniak, Maria and David Mimno (2017). “Evaluating the Stability of Embedding- based Word Similarities”. In: Tacl 6, pp. 107–119. Anttila, Raimo (1989). Historical and comparative linguistics. Vol. 6. John Ben- jamins Publishing. Arora, Sanjeev, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski (2016). “Linear algebraic structure of word senses, with applications to polysemy”. In: arXiv preprint arXiv:1601.03764. Geeraerts, Dirk (1985). “Cognitive restrictions on the structure of semantic change”. In: Historical Semantics, pp. 127–153. — (1992). “Prototypicality effects in diachronic semantics: A round-up”. In: Diachrony within Synchrony: language, history and cognition. Ed. by G. Keller- mann and M. D. Morissey. Frankfurt am Main: Peter Lang, pp. 183–203. Gentner, Dedre and Ilene M France (1988). “The verb mutability effect: Stud- ies of the combinatorial semantics of nouns and verbs”. In: Lexical ambi- guity resolution. Elsevier, pp. 343–382. Hamilton, William L, Jure Leskovec, and Dan Jurafsky (2016). “Diachronic word embeddings reveal statistical laws of semantic change”. In: Proceed- ings of ACL. Hellrich, Johannes and Udo Hahn (2016). “Bad company—neighborhoods in neural embedding spaces considered harmful”. In: Proceedings of COL- ING: Technical Papers, pp. 2785–2796. Huang, Eric H, Richard Socher, Christopher D Manning, and Andrew Y Ng (2012). “Improving word representations via global context and multiple word prototypes”. In: Proceedings of ACL. Association for Computational Linguistics, pp. 873–882. Lehrer, Adrienne (1985). “The influence of semantic fields on semantic change”. In: Historical Semantics, Historical Word Formation 29, p. 283. Li, Jiwei and Dan Jurafsky (2015). “Do multi-sense embeddings improve nat- ural language understanding?” In: arXiv preprint arXiv:1506.01070. Menner, Robert J (1945). “Multiple meaning and change of meaning in En- glish”. In: Language, pp. 59–76. Mimno, David and Laure Thompson (2017). “The strange geometry of skip- gram with negative sampling”. In: Proceedings of EMNLP, pp. 2873–2878. Neelakantan, Arvind, Jeevan Shankar, Re Passos, and Andrew Mccallum (2014). “Efficient nonparametric estimation of multiple embeddings per word in vector space”. In: Proceedings of EMNLP. Ponti, Edoardo Maria, Elisabetta Jezek, Bernardo Magnini Fondazione, and Bruno Kessler (2017). “Distributed Representations of Lexical Sets and Prototypes in Causal Alternation Verbs”. In: Italian Journal of Computa- tional Linguistics. Rodda, Martina A., Marco S.G. Senaldi, and Alessandro Lenci (2017). “Panta rei: Tracking semantic change with distributional semantics in ancient Greek”. In: Italian Journal of Computational Linguistics 3.1, pp. 11–24. ISSN: 16130073. REFERENCES 75

Sun, Yifan, Nikhil Rao, and Weicong Ding (2017). “A simple approach to learn polysemous word embeddings”. In: arXiv preprint arXiv:1707.01793. Xu, Yang and Charles Kemp (2015). “A Computational Evaluation of Two Laws of Semantic Change.” In: CogSci. שינוי משמעות הולך בגדולות: גישה חישובית למחקר של שינוי סמנטי

Semantic change at large: A computational approach for semantic change research

מאת: חיים דובוסרסקי מנחים: ד"ר איתן גרוסמן ופרופ' דפנה ויינשל

תקציר החיפוש אחר דפוסים וחוקים המתארים מדוע וכיצד מילים משנות משמעותן לאורך זמן, טומן בחובו פוטנציאל רב לאור שתי התפתחויות חדשות בתחום: זמינותם ההולכת וגוברת של קורפוסים היסטוריים ממוחשבים, וחידושים בשיטות החישוביות המאפשרות עיבוד סמנטי אוטומטי. בהשוואה לשיטות מחקר מסורתיות בחקר השינוי הסמנטי, השילוב הזה מאפשר לאתר תבניות שינוי חדשות, ולזהות את גורמיהן – גילויים המבוססים על יתרונם של ניתוחים הנעשים בקנה מידה רחב על כלל מופעיהם של המילים בקורפוסים הטקסטואליים.

בסדרה של עבודות הדגמתי כי גישה חדשה זאת למחקר של שינוי סמנטי אינה רק אפשרית, אלא גם נושאת פרי. עבודות המחקר שלי הראו שגישה זאת יכולה להרחיב את היריעה מעבר לתופעות המוכרות, ולחשוף דפוסים לא מוכרים של שינוי סמנטי, כמו גם לבסס תיאוריות קיימות של שינוי סמנטי באמצעות ניתוחים מדוייקים ואמינים יותר.

בד בבד, מחקר זה גם חשף את חשיבותם של נושאים מתודולוגיים. מאחר ומדובר בתחום מחקרי חדש, ביקורות מתודולוגיות מסוג זה הינן צפויות, ואף חיוניות על מנת לבסס שדה מחקר זה על אדניים אמינות.

שני מאמרי האחרונים שהוקדשו לבעיות המתודולוגיות הללו, מבטיחים שמחקר השינוי הסמנטי הנעשה באמצעות גישות חישוביות יהיה מבוסס על מתודולוגיות אמינות כבר מראשיתו. במחקרים אלה לא רק ביקרתי את תקופתן של תוצאות קודמות, אלא גם הצעתי דרכים סדורות שנועדו להבטיח ניתוח אובייקטיבי ואמין של תוצאות המחקר. הדגשים המתודולוגיים הללו יכולים להועיל לקהילת ה-NLP בכללותה, משום שתובנותיהם והיישומים הקשורים בהם תקפים עבור תחומים החורגים מעבר לשינוי סמנטי.

עבודה זאת נעשה בהדרכתם של:

ד"ר איתן גרוסמן ופרופ' דפנה ויינשל

האוניברסיטה העברית בירושלים

שינוי משמעות הולך בגדולות

גישה חישובית למחקר של שינוי סמנטי

חיבור לשם קבלת תואר

"דוקטור בפילוסופיה" ב-

מדעי המח: חישוב ועיבוד מידע

מאת

חיים דובוסרסקי

הוגש לסנט האוניברסיטה העברית בירושלים

אפריל 8102 אייר תשע"ח