Towards a Computational History of the ACL: 1980-2008

Towards a Computational History of the ACL: 1980–2008 Ashton Anderson Dan McFarland Dan Jurafsky Stanford University Stanford University Stanford University [email protected] [email protected] [email protected] Abstract history has been the use of topic models (Blei et al., 2003) to analyze the rise and fall of research top- We develop a people-centered computational ics to study the progress of science, both in general history of science that tracks authors over top- (Griffiths and Steyvers, 2004) and more specifically ics and apply it to the history of computa- in the ACL Anthology (Hall et al., 2008). tional linguistics. We present four findings in this paper. First, we identify the topical We extend this work with a more people-centered subfields authors work on by assigning auto- view of computational history. In this framework, matically generated topics to each paper in the we examine the trajectories of individual authors ACL Anthology from 1980 to 2008. Next, we across research topics in the field of computational identify four distinct research epochs where linguistics. By examining a single author’s paper the pattern of topical overlaps are stable and topics over time, we can trace the evolution of her different from other eras: an early NLP pe- academic efforts; by superimposing these individual riod from 1980 to 1988, the period of US traces over each other, we can learn how the entire government-sponsored MUC and ATIS eval- uations from 1989 to 1994, a transitory period field progressed over time. One goal is to investi- until 2001, and a modern integration period gate the use of these techniques for computational from 2002 onwards. Third, we analyze the history in general. A second goal is to use the ACL flow of authors across topics to discern how Anthology Network Corpus (Radev et al., 2009) and some subfields flow into the next, forming dif- the incorporated ACL Anthology Reference Corpus ferent stages of ACL research. We find that the (Bird et al., 2008) to answer specific questions about government-sponsored bakeoffs brought new the history of computational linguistics. What is the researchers to the field, and bridged early topics to modern probabilistic approaches. Last, path that the ACL has taken throughout its 50-year we identify steep increases in author retention history? What roles did various research topics play during the bakeoff era and the modern era, in the ACL’s development? What have been the piv- suggesting two points at which the field be- otal turning points? came more integrated. Our method consists of four steps. We first run topic models over the corpus to classify papers into 1 Introduction topics and identify the topics that people author in. We then use these topics to identify epochs by cor- The rise of vast on-line collections of scholarly pa- relating over time the number of persons that topics pers has made it possible to develop a computational share in common. From this, we identify epochs as history of science. Methods from natural language sustained patterns of topical overlap. processing and other areas of computer science can Our third step is to look at the flow of authors be- be naturally applied to study the ways a field and tween topics over time to detect patterns in how au- its ideas develop and expand (Au Yeung and Jatowt, thors move between areas in the different epochs. 2011; Gerrish and Blei, 2010; Tu et al., 2010; Aris et We group topics into clusters based on when au- al., 2009). One particular direction in computational thors move in and out of them, and visualize the flow 13 Proceedings of the ACL-2012 Special Workshop on Rediscovering 50 Years of Discoveries, pages 13–21, Jeju, Republic of Korea, 10 July 2012. c 2012 Association for Computational Linguistics of people across these clusters to identify how one rally dynamic: since every paper is published in a topic leads to another. particular year, authors’ topic memberships change Finally, in order to understand how the field grows over time. This fact is at the heart of our methodol- and declines, we examine patterns of entry and exit ogy — by assigning authors to topics in this princi- within each epoch, studying how author retention pled way, we can track the topics that authors move (the extent to which authors keep publishing in the through. Analyzing the flow of authors through top- ACL) varies across epochs. ics enables us to learn which topics beget other topics, and which topics are related to others by the peo- 2 Identifying Topics ple that author across them. Our first task is to identify research topics within 3 Identifying Epochs computational linguistics. We use the ACL Anthol- ogy Network Corpus and the incorporated ACL An- What are the major epochs of the ACL’s history? In thology Reference Corpus, with around 13,000 pa- this section, we seek to partition the years spanned pers by approximately 11,000 distinct authors from by the ACL’s history into clear, distinct periods of 1965 to 2008. Due to data sparsity in early years, we topical cohesion, which we refer to as epochs. If drop all papers published prior to 1980. the dominant research topics people are working on We ran LDA on the corpus to produce 100 genera- suddenly change from one set of topics to another, tive topics (Blei et al., 2003). Two senior researchers we view this as a transition between epochs. in the field (the third author and Chris Manning) then To identify epochs that satisfy this definition, we collaboratively assigned a label to each of the 100 generate a set of matrices (one for each year) de- topics, which included marking those topics which scribing the number of people that author in every were non-substantive (lists of function words or af- pair of topics during that year. For year y, let N y y fixes) to be eliminated. They produced a consensus be a matrix such that Nij is the number of people labeling with 73 final topics, shown in Table 1 (27 that author in both topics i and j in year y (where non-substantive topics were eliminated, e.g. a pro- authoring in topic j means being an author on a pa- noun topic, a suffix topic, etc.). per p such that Qpj = 1). We don’t normalize by Each paper is associated with a probability distri- the total number of people in each topic, thus pro- bution over the 100 original topics describing how portionally representing bigger topics since they ac- much of the paper is generated from each topic. All count for more research effort than smaller topics. of this information is represented by a matrix P , Each matrix is a signature of which topic pairs have where the entry Pij is simply the loading of topic overlapping author sets in that year. j on paper i (since each row is a probability distri- From these matrices, we compute a final matrix P bution, j Pij = 1). For ease of interpretation, we C of year-year correlations. Cij is the Pearson cor- sparsify the matrix P by assigning papers to topics relation coefficient between N i and N j. C captures and thus set all entries to either 0 or 1. We do this the degree to which years have similar patterns of by choosing a threshold T and setting entries to 1 if topic authorship overlap, or the extent to which a they exceed this threshold. If we call the new ma- consistent pattern of topical research is formed. We trix Q, Qij = 1 ⇐⇒ Pij ≥ T . Throughout all visualize C as a thermal in Figure 1. our analyses we use T = 0.1. This value is approx- To identify epochs in ACL’s history, we ran hier- imately two standard deviations above P , the mean archical complete link clustering on C. This resulted of the entries in P . Most papers are assigned to 1 in a set of four distinct epochs: 1980–1988, 1989– or 2 topics; some are assigned to none and some are 1994, 1995–2001, and 2002–2008. For three of assigned to more. these periods (all except 1995–2001), years within This assignment of papers to topics also induces each of these ranges are much more similar to each an assignment of authors to topics: an author is as- other than they are to other years. During the third signed to a topic if she authored a paper assigned period (1995–2001), none of the years are highly to that topic. Furthermore, this assignment is natu- similar to any other years. This is indicative of a 14 Number Name Topics 1 Big Data Statistical Machine Translation (Phrase-Based): bleu, statistical, source, target, phrases, smt, reordering NLP Dependency Parsing: dependency/ies, head, czech, depen, dependent, treebank MultiLingual Resources: languages, spanish, russian, multilingual, lan, hindi, swedish Relation Extraction: pattern/s, relation, extraction, instances, pairs, seed Collocations/Compounds: compound/s, collocation/s, adjectives, nouns, entailment, expressions, MWEs Graph Theory + BioNLP: graph/s, medical, edge/s, patient, clinical, vertex, text, report, disease Sentiment Analysis: question/s, answer/s, answering, opinion, sentiment, negative, positive, polarity 2 Probabilistic Discriminative Sequence Models: label/s, conditional, sequence, random, discriminative, inference Methods Metrics + Human Evaluation: human, measure/s, metric/s, score/s, quality, reference, automatic, correlation, judges Statistical Parsing: parse/s, treebank, trees, Penn, Collins, parsers, Charniak, accuracy, WSJ ngram Language Models: n-gram/s, bigram/s, prediction, trigram/s, unigram/s, trigger, show, baseline Algorithmic Efficiency: search, length, size, space, cost, algorithms, large, complexity, pruning Bilingual Word Alignment: alignment/s, align/ed, pair/s, statistical, source, target,

Towards a Computational History of the ACL: 1980-2008

Dan Jurafsky: Curriculum Vitae

GAZETTE Dan Jurafsky

Dan Jurafsky Linguistics and Computer Science Stanford University

Dan Jurafsky

Christopher David Manning

Dan Jurafsky

Which Words Are Hard to Recognize? Prosodic, Lexical, and Disfluency

Steven Bethard

Multimodal Semantic Understanding: Semantic Role Labeling with Vision and Language

Linguistic Structure in Statistical Machine Translation

Dan Jurafsky: Curriculum Vitae

Dan Jurafsky