Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change

Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change William L. Hamilton, Jure Leskovec, Dan Jurafsky Department of Computer Science, Stanford University, Stanford CA, 94305 wleif,jure,[email protected] Abstract But many core questions about semantic change remain unanswered. One is the role of fre- Understanding how words change their quency. Frequency plays a key role in other lin- meanings over time is key to models of guistic changes, associated sometimes with faster language and cultural evolution, but his- change—sound changes like lenition occur in torical data on meaning is scarce, mak- more frequent words—and sometimes with slower ing theories hard to develop and test. change—high frequency words are more resistant Word embeddings show promise as a di- to morphological regularization (Bybee, 2007; achronic tool, but have not been carefully Pagel et al., 2007; Lieberman et al., 2007). What evaluated. We develop a robust method- is the role of word frequency in meaning change? ology for quantifying semantic change Another unanswered question is the relationship by evaluating word embeddings (PPMI, between semantic change and polysemy. Words SVD, word2vec) against known historical gain senses over time as they semantically drift changes. We then use this methodology (Breal,´ 1897; Wilkins, 1993; Hopper and Trau- to reveal statistical laws of semantic evo- gott, 2003), and polysemous words1 occur in lution. Using six historical corpora span- more diverse contexts, affecting lexical access ning four languages and two centuries, we speed (Adelman et al., 2006) and rates of L2 propose two quantitative laws of seman- learning (Crossley et al., 2010). But we don’t tic change: (i) the law of conformity—the know whether the diverse contextual use of pol- rate of semantic change scales with an in- ysemous words makes them more or less likely verse power-law of word frequency; (ii) to undergo change (Geeraerts, 1997; Winter et the law of innovation—independent of fre- al., 2014; Xu et al., 2015). Furthermore, poly- quency, words that are more polysemous semy is strongly correlated with frequency—high have higher rates of semantic change. frequency words have more senses (Zipf, 1945; Ilgen˙ and Karaoglan, 2007)—so understanding 1 Introduction how polysemy relates to semantic change requires Shifts in word meaning exhibit systematic regu- controling for word frequency. larities (Breal,´ 1897; Ullmann, 1962). The rate Answering these questions requires new meth- of semantic change, for example, is higher in ods that can go beyond the case-studies of a few some words than others (Blank, 1999) — com- words (often followed over widely different time- pare the stable semantic history of cat (from Proto- periods) that are our most common diachronic Germanic kattuz, “cat”) to the varied meanings of data (Breal,´ 1897; Ullmann, 1962; Blank, 1999; English cast: “to mould”, “a collection of actors’, Hopper and Traugott, 2003; Traugott and Dasher, “a hardened bandage”, etc. (all from Old Norse 2001). One promising avenue is the use of distri- kasta, “to throw”, Simpson et al., 1989). butional semantics, in which words are embedded Various hypotheses have been offered about in vector spaces according to their co-occurrence such regularities in semantic change, such as an in- relationships (Bullinaria and Levy, 2007; Turney creasing subjectification of meaning, or the gram- and Pantel, 2010), and the embeddings of words maticalization of inferences (e.g., Geeraerts, 1997; 1We use ‘polysemy’ here to refer to related senses as well Blank, 1999; Traugott and Dasher, 2001). as rarer cases of accidental homonymy. 1489 Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 1489–1501, Berlin, Germany, August 7-12, 2016. c 2016 Association for Computational Linguistics Figure 1: Two-dimensional visualization of semantic change in English using SGNS vectors.2 a, The word gay shifted from meaning “cheerful” or “frolicsome” to referring to homosexuality. b, In the early 20th century broadcast referred to “casting out seeds”; with the rise of television and radio its meaning shifted to “transmitting signals”. c, Awful underwent a process of pejoration, as it shifted from meaning “full of awe” to meaning “terrible or appalling” (Simpson et al., 1989). are then compared across time-periods. This new we use to quantify semantic change. All of the direction has been effectively demonstrated in a learned embeddings and the code we used to ana- number of case-studies (Sagi et al., 2011; Wijaya lyze them are made publicly available.3 and Yeniterzi, 2011; Gulordava and Baroni, 2011; Jatowt and Duh, 2014) and used to perform large- 2.1 Embedding algorithms scale linguistic change-point detection (Kulkarni We use three methods to construct word em- et al., 2014) as well as to test a few specific hy- beddings within each time-period: PPMI, SVD, potheses, such as whether English synonyms tend and SGNS (i.e., word2vec).4 These distributional to change meaning in similar ways (Xu and Kemp, methods represent each word wi by a vector wi 2015). However, these works employ widely dif- that captures information about its co-occurrence ferent embedding approaches and test their ap- statistics. These methods operationalize the ‘dis- proaches only on English. tributional hypothesis’ that word semantics are im- In this work, we develop a robust methodol- plicit in co-occurrence relationships (Harris, 1954; ogy for quantifying semantic change using embed- Firth, 1957). The semantic similarity/distance be- dings by comparing state-of-the-art approaches tween two words is approximated by the cosine (PPMI, SVD, word2vec) on novel benchmarks. similarity/distance between their vectors (Turney We then apply this methodology in a large-scale and Pantel, 2010). cross-linguistic analysis using 6 corpora spanning 200 years and 4 languages (English, German, 2.1.1 PPMI French, and Chinese). Based on this analysis, we In the PPMI representations, the vector embedding propose two statistical laws relating frequency and for word w contains the positive point-wise i ∈ V polysemy to semantic change: mutual information (PPMI) values between wi and The law of conformity: Rates of semantic a large set of pre-specified ‘context’ words. The • change scale with a negative power of word word vectors correspond to the rows of the matrix PPMI frequency. M R|V|×|VC | with entries given by ∈ The law of innovation: After controlling for • frequency, polysemous words have signifi- PPMI pˆ(wi, cj) Mi,j = max log α, 0 , cantly higher rates of semantic change. pˆ(w)ˆp(cj) − (1) 2 Diachronic embedding methods where cj C is a context word and α > 0 ∈ V is a negative prior, which provides a smooth- The following sections outline how we construct ing bias (Levy et al., 2015). The pˆ correspond diachronic (historical) word embeddings, by first to the smoothed empirical probabilities of word constructing embeddings in each time-period and then aligning them over time, and the metrics that 3http://nlp.stanford.edu/projects/histwords 4Synchronic applications of these three methods are re- 2Appendix B details the visualization method. viewed in detail in Levy et al. (2015). 1490 Name Language Description Tokens Years POS Source ENGALL English Google books (all genres) 8.5 1011 1800-1999 (Davies, 2010) × 10 ENGFIC English Fiction from Google books 7.5 10 1800-1999 (Davies, 2010) COHA English Genre-balanced sample 4.1× 108 1810-2009 (Davies, 2010) × 11 FREALL French Google books (all genres) 1.9 10 1800-1999 (Sagot et al., 2006) × 10 GERALL German Google books (all genres) 4.3 10 1800-1999 (Schneider and Volk, 1998) × 10 CHIALL Chinese Google books (all genres) 6.0 10 1950-1999 (Xue et al., 2005) × Table 1: Six large historical datasets from various languages and sources are used. (co-)occurrences within fixed-size sliding win- SGNS optimization avoids computing the normal- dows of text. Clipping the PPMI values above zero izing constant in (3) by randomly drawing ‘neg- ensures they remain finite and has been shown to ative’ context words, cn, for each target word and dramatically improve results (Bullinaria and Levy, ensuring that exp(wSGNS cSGNS) is small for these i · n 2007; Levy et al., 2015); intuitively, this clipping examples. ensures that the representations emphasize posi- SGNS has the benefit of allowing incremental tive word-word correlations over negative ones. initialization during learning, where the embed- t 2.1.2 SVD dings for time are initialized with the embeddings from time t ∆ (Kim et al., 2014). We SVD embeddings correspond to low-dimensional − employ this trick here, though we found that it had approximations of the PPMI embeddings learned a negligible impact on our results. via singular value decomposition (Levy et al., 2015). The vector embedding for word wi is given by 2.2 Datasets, pre-processing, and SVD γ hyperparameters wi = (UΣ )i , (2) PPMI where M = UΣV> is the truncated singular We trained models on the 6 datasets described value decomposition of MPPMI and γ [0, 1] is in Table 1, taken from Google N-Grams (Lin et ∈ an eigenvalue weighting parameter. Setting γ < 1 al., 2012) and the COHA corpus (Davies, 2010). has been shown to dramatically improve embed- The Google N-Gram datasets are extremely large ding qualities (Turney and Pantel, 2010; Bulli- (comprising 6% of all books ever published), but ≈ naria and Levy, 2012). This SVD approach can they also contain many corpus artifacts due, e.g., be viewed as a generalization of Latent Seman- to shifting sampling biases over time (Pechenick tic Analysis (Landauer and Dumais, 1997), where et al., 2015). In contrast, the COHA corpus was the term-document matrix is replaced with MPPMI. carefully selected to be genre-balanced and rep- Compared to PPMI, SVD representations can be resentative of American English over the last 200 more robust, as the dimensionality reduction acts years, though as a result it is two orders of mag- as a form of regularization.

Load more