Studying the History of Ideas Using Topic Models
Total Page:16
File Type:pdf, Size:1020Kb
Studying the History of Ideas Using Topic Models David Hall Daniel Jurafsky Christopher D. Manning Symbolic Systems Linguistics Computer Science Stanford University Stanford University Stanford University Stanford, CA 94305, USA Stanford, CA 94305, USA Stanford, CA 94305, USA [email protected] [email protected] [email protected] Abstract However, because we are particularly interested in the change of ideas in a field over time, we have How can the development of ideas in a sci- chosen a different method, following Kuhn (1962). entific field be studied over time? We ap- In Kuhn’s model of scientific change, science pro- ply unsupervised topic modeling to the ACL ceeds by shifting from one paradigm to another. Anthology to analyze historical trends in the field of Computational Linguistics from 1978 Because researchers’ ideas and vocabulary are con- to 2006. We induce topic clusters using Latent strained by their paradigm, successive incommensu- Dirichlet Allocation, and examine the strength rate paradigms will naturally have different vocabu- of each topic over time. Our methods find lary and framing. trends in the field including the rise of prob- Kuhn’s model is intended to apply only to very abilistic methods starting in 1988, a steady in- large shifts in scientific thought rather than at the crease in applications, and a sharp decline of research in semantics and understanding be- micro level of trends in research foci. Nonetheless, tween 1978 and 2001, possibly rising again we propose to apply Kuhn’s insight that vocabulary after 2001. We also introduce a model of the and vocabulary shift is a crucial indicator of ideas diversity of ideas, topic entropy, using it to and shifts in ideas. Our operationalization of this in- show that COLING is a more diverse confer- sight is based on the unsupervised topic model La- ence than ACL, but that both conferences as tent Dirichlet Allocation (LDA; Blei et al. (2003)). well as EMNLP are becoming broader over time. Finally, we apply Jensen-Shannon di- For many fields, doing this kind of historical study vergence of topic distributions to show that all would be very difficult. Computational linguistics three conferences are converging in the topics has an advantage, however: the ACL Anthology, a they cover. public repository of all papers in the Computational Linguistics journal and the conferences and work- 1 Introduction shops associated with the ACL, COLING, EMNLP, and so on. The ACL Anthology (Bird, 2008), and How can we identify and study the exploration of comprises over 14,000 documents from conferences ideas in a scientific field over time, noting periods of and the journal, beginning as early as 1965 through gradual development, major ruptures, and the wax- 2008, indexed by conference and year. This re- ing and waning of both topic areas and connections source has already been the basis of citation anal- with applied topics and nearby fields? One im- ysis work, for example, in the ACL Anthology Net- portant method is to make use of citation graphs work of Joseph and Radev (2007). We apply LDA (Garfield, 1955). This enables the use of graph- to the text of the papers in the ACL Anthology to based algorithms like PageRank for determining re- induce topics, and use the trends in these topics over searcher or paper centrality, and examining whether time and over conference venues to address ques- their influence grows or diminishes over time. tions about the development of the field. Venue # Papers Years Frequency acterized by a multinomial distribution over topics, Journal 1291 1974–Present Quarterly and each topic is in turn characterized by a multino- ACL 2037 1979-Present Yearly mial distribution over words. We perform parame- EACL 596 1983–Present ∼2 Years NAACL 293 2000–Present ∼Yearly ter estimation using collapsed Gibbs sampling (Grif- Applied NLP 346 1983–2000 ∼3 Years fiths and Steyvers, 2004). COLING 2092 1965-Present 2 Years Possible extensions to this model would be to in- HLT 957 1986–Present ∼2 Years tegrate topic modelling with citations (e.g., Dietz et Workshops 2756 1990-Present Yearly TINLAP 128 1975–1987 Rarely al. (2007), Mann et al. (2006), and Jo et al. (2007)). MUC 160 1991–1998 ∼2 Years Another option is the use of more fine-grained or hi- IJCNLP 143 2005 —— erarchical model (e.g., Blei et al. (2004), and Li and Other 120 —— —— McCallum (2006)). All our studies measure change in various as- Table 1: Data in the ACL Anthology pects of the ACL Anthology over time. LDA, how- ever, does not explicitly model temporal relation- Despite the relative youth of our field, computa- ships. One way to model temporal relationships is tional linguistics has witnessed a number of research to employ an extension to LDA. The Dynamic Topic trends and shifts in focus. While some trends are Model (Blei and Lafferty, 2006), for example, rep- obvious (such as the rise in machine learning meth- resents each year’s documents as generated from a ods), others may be more subtle. Has the field got- normal distribution centroid over topics, with the ten more theoretical over the years or has there been following year’s centroid generated from the pre- an increase in applications? What topics have de- ceding year’s. The Topics over Time Model (Wang clined over the years, and which ones have remained and McCallum, 2006) assumes that each document roughly constant? How have fields like Dialogue or chooses its own time stamp based on a topic-specific Machine Translation changed over the years? Are beta distribution. there differences among the conferences, for exam- Both of these models, however, impose con- ple between COLING and ACL, in their interests straints on the time periods. The Dynamic Topic and breadth of focus? As our field matures, it is im- Model penalizes large changes from year to year portant to go beyond anecdotal description to give while the beta distributions in Topics over Time are grounded answers to these questions. Such answers relatively inflexible. We chose instead to perform could also help give formal metrics to model the dif- post hoc calculations based on the observed proba- ferences between the many conferences and venues bility of each topic given the current year. We define in our field, which could influence how we think pˆ(z|y) as the empirical probability that an arbitrary about reviewing, about choosing conference topics, paper d written in year y was about topic z: and about long range planning in our field. X pˆ(z|y) = pˆ(z|d)ˆp(d|y) 2 Methodology d:td=y 1 X = pˆ(z|d) 2.1 Data C (1) d:t =y The analyses in this paper are based on a text- d 1 X X 0 only version of the Anthology that comprises some = I(zi = z) C 0 12,500 papers. The distribution of the Anthology d:td=y zi∈d data is shown in Table 1. where I is the indicator function, td is the date docu- 2.2 Topic Modeling ment d was written, pˆ(d|y) is set to a constant 1/C. Our experiments employ Latent Dirichlet Allocation 3 Summary of Topics (LDA; Blei et al. (2003)), a generative latent variable model that treats documents as bags of words gener- We first ran LDA with 100 topics, and took 36 that ated by one or more topics. Each document is char- we found to be relevant. We then hand-selected seed 0.2 N04-1039 Goodman, Joshua. Exponential Priors For Maximum Classification Probabilistic Models Entropy Models (HLT-NAACL, 2004) Stat. Parsing W97-0309 Saul, Lawrence, Pereira, Fernando C. N. Aggregate And Stat. MT Mixed-Order Markov Models For Statistical Language 0.15 Lex. Sem Processing (EMNLP, 1997) P96-1041 Chen, Stanley F., Goodman, Joshua. An Empirical Study Of Smoothing Techniques For Language Model- ing (ACL, 1996) 0.1 H89-2013 Church, Kenneth Ward, Gale, William A. Enhanced Good-Turing And CatCal: Two New Methods For Esti- mating Probabilities Of English Bigrams (Workshop On Speech And Natural Language, 1989) 0.05 P02-1023 Gao, Jianfeng, Zhang, Min Improving Language Model Size Reduction Using Better Pruning Criteria (ACL, 2002) 0 P94-1038 Dagan, Ido, Pereira, Fernando C. N. Similarity-Based 1980 1985 1990 1995 2000 2005 Estimation Of Word Cooccurrence Probabilities (ACL, 1994) Figure 1: Topics in the ACL Anthology that show a Some of the papers with the highest weights for strong recent increase in strength. the classification/tagging class include: W00-0713 Van Den Bosch, Antal Using Induced Rules As Com- words for 10 more topics to improve coverage of the plex Features In Memory-Based Language Learning (CoNLL, 2000) field. These 46 topics were then used as priors to a W01-0709 Estabrooks, Andrew, Japkowicz, Nathalie A Mixture-Of- new 100-topic run. The top ten most frequent words Experts Framework For Text Classification (Workshop for 43 of the topics along with hand-assigned labels On Computational Natural Language Learning CoNLL, 2001) are listed in Table 2. Topics deriving from manual A00-2035 Mikheev, Andrei. Tagging Sentence Boundaries (ANLP- seeds are marked with an asterisk. NAACL, 2000) H92-1022 Brill, Eric. A Simple Rule-Based Part Of Speech Tagger (Workshop On Speech And Natural Language, 1992) 4 Historical Trends in Computational Linguistics As Figure 1 shows, probabilistic models seem to have arrived significantly before classifiers. The Given the space of possible topics defined in the pre- probabilistic model topic increases around 1988, vious section, we now examine the history of these which seems to have been an important year for in the entire ACL Anthology from 1978 until 2006. probabilistic models, including high-impact papers To visualize some trends, we show the probability like A88-1019 and C88-1016 below.