Statistically Significant Detection of Linguistic Change

Statistically Significant Detection of Linguistic Change Vivek Kulkarni Rami Al-Rfou Stony Brook University, USA Stony Brook University, USA [email protected] [email protected] Bryan Perozzi Steven Skiena Stony Brook University, USA Stony Brook University, USA [email protected] [email protected] ABSTRACT gay We propose a new computational approach for tracking and 2005 detecting statistically significant linguistic shifts in the mean- lesbian gay uneducated hom osexual ing and usage of words. Such linguistic shifts are especially gay1950 1975 prevalent on the Internet, where the rapid exchange of ideas talkative gay1990 healthy religious can quickly change a word's meaning. Our meta-analysis courageous transgender philanthropist approach constructs property time series of word usage, and courteous adolescents gay statesm angays then uses statistically sound change point detection algo- 1900 transgendered cheerful dapper illiterate hispanic rithms to identify significant linguistic shifts. sublim ely sorcerers We consider and analyze three approaches of increasing profligate artisans complexity to generate such linguistic property time series, unembarrassed m etonym y the culmination of which uses distributional characteristics apparitional inferred from word co-occurrences. Using recently proposed deep neural language models, we first train vector represen- tations of words for each time period. Second, we warp the Figure 1: A 2-dimensional projection of the latent seman- vector spaces into one unified coordinate system. Finally, we tic space captured by our algorithm. Notice the semantic construct a distance-based distributional time series for each trajectory of the word gay transitioning meaning in the space. word to track its linguistic displacement over time. We demonstrate that our approach is scalable by track- In this paper, we study the problem of detecting such ing linguistic change across years of micro-blogging using linguistic shifts on a variety of media including micro-blog Twitter, a decade of product reviews using a corpus of movie posts, product reviews, and books. Specifically, we seek to reviews from Amazon, and a century of written books using detect the broadening and narrowing of semantic senses of the Google Book Ngrams. Our analysis reveals interesting words, as they continually change throughout the lifetime of patterns of language usage change commensurate with each a medium. medium. We propose the first computational approach for tracking and detecting statistically significant linguistic shifts of Categories and Subject Descriptors words. To model the temporal evolution of natural language, H.3.3 [Information Storage and Retrieval]: Information we construct a time series per word. We investigate three Search and Retrieval methods to build our word time series. First, we extract Keywords Frequency based statistics to capture sudden changes in word usage. Second, we construct Syntactic time series by ana- Web Mining;Computational Linguistics lyzing each word's part of speech (POS) tag distribution. Finally, we infer contextual cues from word co-occurrence 1. INTRODUCTION statistics to construct Distributional time series. In order to Natural languages are inherently dynamic, evolving over detect and establish statistical significance of word changes time to accommodate the needs of their speakers. This over time, we present a change point detection algorithm, effect is especially prevalent on the Internet, where the rapid which is compatible with all methods. exchange of ideas can change a word's meaning overnight. Figure 1 illustrates a 2-dimensional projection of the latent semantic space captured by our Distributional method. We clearly observe the sequence of semantic shifts that the word gay has undergone over the last century (1900-2005). Ini- tially, gay was an adjective that meant cheerful or dapper. Observe for the first 50 years, that it stayed in the same Copyright is held by the International World Wide Web Conference Com- general region of the semantic space. However by 1975, it mittee (IW3C2). IW3C2 reserves the right to provide a hyperlink to the had begun a transition over to its current meaning |a shift author’s site if the Material is used in electronic media. which accelerated over the years to come. WWW 2015, May 18–22, 2015, Florence, Italy. ACM 978-1-4503-3469-3/15/05. The choice of the time series construction method deter- http://dx.doi.org/10.1145/2736277.2741627 . mines the type of information we capture regarding word Sandy Hurricane usage. The difference between frequency-based approaches 100 100 and distributional methods is illustrated in Figure 2. Figure y 80 y 80 c c n n e e 2a shows the frequencies of two words, Sandy (red), and u u q q e e r 60 r 60 Hurricane (blue) as a percentage of search queries according F F d d e e z z 1 i i l 40 l 40 to Google Trends . Observe the sharp spikes in both words' a a m m r r o o N N usage in October 2012, which corresponds to a storm called 20 20 Hurricane Sandy striking the Atlantic Coast of the United 0 0 Oct Jan Apr Jul Oct Jan Apr Jul Oct Oct Jan Apr Jul Oct Jan Apr Jul Oct States. However, only one of those words (Sandy) actually 12 13 12 13 acquired a new meaning. Note that while the word Hurri- (a) Frequency method (Google Trends) cane definitely experienced a surge in frequency of usage, it 7 7 did not undergo any change in meaning. Indeed, using our 6 6 5 5 e e distributional method (Figure 2b), we observe that only the r r o o c 4 c 4 S S word Sandy shifted in meaning where as Hurricane did not. ¡ ¡ Z Z Our computational approach is scalable, and we demon- 3 3 strate this by running our method on three large datasets. 2 2 1 1 Specifically, we investigate linguistic change detection across Nov Jan Mar May Jul Sep Nov Jan Mar May Jul Sep Nov Jan Mar May Jul Sep Nov Jan Mar May Jul Sep 12 13 12 13 years of micro-blogging using Twitter, a decade of product (b) Distributional method reviews using a corpus of movie reviews from Amazon, and a century of written books using the Google Books Ngram Figure 2: Comparison between Google Trends and our Corpus. method. Observe how Google Trends shows spikes in fre- Despite the fast pace of change of the web content, our quency for both Hurricane (blue) and Sandy (red). Our method is able to detect the introduction of new products, method, in contrast, models change in usage and detects movies and books. This could help semantically aware web that only Sandy changed its meaning and not Hurricane. applications to better understand user intentions and re- quests. Detecting the semantic shift of a word would trigger S, we divide the corpora into n snapshots C each of period such applications to apply focused sense disambiguation anal- t length P . We build a common vocabulary V by intersecting ysis. the word dictionaries that appear in all the snapshots (i.e, In summary, our contributions are as follows: we track the same word set across time). This eliminates • Word Evolution Modeling: We study three dif- trivial examples of word usage shift from words which appear ferent methods for the statistical modeling of word or vanish throughout the corpus. evolution over time. We use measures of frequency, To model word evolution, we construct a time series T (w) part-of-speech tag distribution, and word co-occurrence for each word w 2 V. Each point Tt(w) corresponds to to construct time series for each word under investiga- statistical information extracted from corpus snapshot Ct tion.(Section 3) that reflects the usage of w at time t. In Section 3, we • Statistical Soundness: We propose (to our knowl- propose several methods to calculate Tt(w), each varying in edge) the first statistically sound method for linguistic the statistical information used to capture w's usage. shift detection. Our approach uses change point de- Once these time series are constructed, we can quantify tection in time series to assign significance of change the significance of the shift that occurred to the word in scores to each word. (Section 4) its meaning and usage. Sudden increases or decreases in • Cross-Domain Analysis: We apply our method on the time series are indicative of shifts in the word usage. three different domains; books, tweets and online re- Specifically we pose the following questions: views. Our corpora consists of billions of words and 1. How statistically significant is the shift in usage of a spans several time scales. We show several interesting word w across time (in T (w))? instances of semantic change identified by our method. 2. Given that a word has shifted, at what point in time (Section 6) did the change happen? The rest of the paper is structured as follows. In Section 2 we define the problem of language shift detection over time. Then, we outline our proposals to construct time series 3. TIME SERIES CONSTRUCTION modeling word evolution in Section 3. Next, in Section 4, we Constructing the time series is the first step in quantify- describe the method we developed for detecting significant ing the significance of word change. Different approaches changes in natural language. We describe the datasets we capture various aspects of a word's semantic, syntactic and used in Section 5, and then evaluate our system both qualita- usage patterns. In this section, we describe three approaches tively and quantitatively in Section 6. We follow this with a (Frequency, Syntactic, and Distributional) to building a time treatment of related work in Section 7, and finally conclude series, that capture different aspects of word evolution across with a discussion of the limitations and possible future work time.

Load more