<<

Measuring discursive influence across scholarship

Aaron Gerowa,1, Yuening Hub, Jordan Boyd-Graberc,d,e,f, David M. Bleig,h,i, and James A. Evansj,k,1

aDepartment of Computing, Goldsmiths, University of London, New Cross, London SE14 6NW, United Kingdom; bCloud AI, Google, Sunnyvale, CA 94809; cDepartment of Computer Science, University of Maryland, College Park, MD 20742; dUniversity of Maryland Institute for Advanced Computer Studies, University of Maryland, College Park, MD 20742; eiSchool, University of Maryland, College Park, MD 20742; fLanguage Science Center, University of Maryland, College Park, MD 20742; gDepartment of Statistics, Columbia University, New York, NY 10027; hDepartment of Computer Science, Columbia University, New York, NY 10027; iData Science Institute, Columbia University, New York, NY 10027; jDepartment of Sociology, University of Chicago, Chicago, IL 60637; and kComputation Institute, University of Chicago, Chicago, IL 60637

Edited by Peter S. Bearman, Columbia University, New York, NY, and approved February 12, 2018 (received for review November 18, 2017)

Assessing scholarly influence is critical for understanding the col- a paper (12–14). Moreover, authors are often unable to cite all lective system of scholarship and the history of academic inquiry. of their influences, selecting citations based on cognitive avail- Influence is multifaceted, and citations reveal only part of it. Cita- ability and space constraints (15, 16). Over time, citations of a tion counts exhibit preferential attachment and follow a rigid paper tend to decay due to a preference for recentness (17), “news cycle” that can miss sustained and indirect forms of influ- even when a paper continues to be influential. Further compli- ence. Building on dynamic topic models that track distributional cating the matter, citation cultures differ across disciplines and shifts in discourse over time, we introduce a variant that incor- time periods (18, 19). porates features, such as authorship, affiliation, and publication The primary artifact of scholarly work is not citations but text. venue, to assess how these contexts interact with content to shape Here, we offer a measure of influence based on text and context future scholarship. We perform in-depth analyses on collections of as they shape future discourse. We use scholarly discourse to ref- physics research (500,000 abstracts; 102 years) and scholarship gen- erence the amalgam of language in academic publications, that erally (JSTOR repository: 2 million full-text articles; 130 years). Our traces a sequence of communication and scholarly argument. We measure of document influence helps predict citations and shows adopt a broad notion of discourse as the overall state of schol- how outcomes, such as winning a Nobel Prize or affiliation with arship expressed in published work. This approach avoids some a highly ranked institution, boost influence. Analysis of citations challenges of discourse analysis while raising new ones. Analyz- alongside discursive influence reveals that citations tend to credit ing specific aspects of scholarly conversation requires a selection authors who persist in their fields over time and discount credit for process, the result of which may not generalize. Alternatively, works that are influential over many topics or are “ahead of their analyzing a vast number of publications presents technical chal- time.” In this way, our measures provide a way to acknowledge lenges. To model more general notions of discourse, we use a diverse contributions that take longer and travel farther to achieve class of probabilistic techniques called topic models that extract scholarly appreciation, enabling us to correct citation biases and patterns of term–document co-occurrence and yield semantically enhance sensitivity to the full spectrum of scholarly impact. related word distributions called “topics.” Here, we develop a dynamic topic model designed to explain scholarly influence | science of science | probabilistic modeling not only which documents are influential but how their influ- ence is derived from content and its interaction with features, cholarship is a complex and chaotic process by which authors like authorship, affiliation, and publication venue. We build on Screate, share, and promote new concepts, theories, meth- models of lexical change (20, 21) and in particular, the document ods, data, and findings. Subsequent research adopts scholarly innovations through direct contact with the original work or Significance through follow-on research, review articles, seminars, and per- sonal conversation. Researchers may acknowledge such influ- ence by citing the work on which they draw or passively adopt- Scientific and scholarly influence is multifaceted, shifts over ing the vocabulary, ideas, assumptions, approaches, or insights time, and varies across disciplines. We present a dynamic topic without explicit references. In this way, citations constitute an model to credit documents with influence that shapes future important but incomplete signal of influence, the full spec- discourse based on their content and contextual features. We trum of which includes direct and indirect—acknowledged and trace discursive innovation in scholarship and identify the ignored—influences from across the scholarly landscape. We influence of particular articles along with their authors, affil- hold that a key dimension of influence involves the ability to iations, and journals. In collections of science, social science, change scholarly discourse, which itself interacts with contextual and humanities research spanning over a century, our mea- features, such as the status of publication venues, authors’ name- sure helps predict citations and reveals signals that recognize sakes, and institutions that host the research. Here, we seek to authors who make diverse contributions and whose contribu- detect discursive influence and the factors that shape it. tions take longer to be appreciated, allowing us to compen- Research impact is commonly assessed by the number of sate for bias in citation behavior. explicit references to a document, author, or journal. A refer- ence list represents a complex combination of considerations, Author contributions: A.G., Y.H., J.B.-G., D.M.B., and J.A.E. designed research; A.G., Y.H., including perceived influence of important arguments, methods, J.B.-G., and J.A.E. performed research; A.G., Y.H., J.B.-G., and J.A.E. contributed new data, or findings but also, what authors, reviewers, and editors reagents/analytic tools; A.G. and J.A.E. analyzed data; and A.G., Y.H., J.B.-G., D.M.B., and believe should be cited to appease readers or enhance their own J.A.E. wrote the paper. status. provides measures of author, journal, and The authors declare no conflict of interest. institutional impact (1), which tabulate functions of the citation This article is a PNAS Direct Submission. distribution over articles and time. New “alt metrics” go beyond This article is distributed under Creative Commons Attribution- citations to assess online views, downloads, likes, and tweets or NonCommercial-NoDerivatives License 4.0 (CC BY-NC-ND). estimate readership, cocitation, and diversity (2, 3). Like cita- Data deposition: The source code of the model’s implementation has been deposited on tions, these metrics provide an informative but distorted portrait GitHub and is available at https://github.com/gerowam/influence. of influence: they exhibit rich-get-richer feedback (4–6) and man- 1 To whom correspondence may be addressed. Email: [email protected] or jevans@ ifest manipulation by authors and editors (7–9). Authors may cite uchicago.edu. their own work to drive traffic to prior articles (10, 11). Less This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. nefariously, citations may certify membership in a community, 1073/pnas.1719792115/-/DCSupplemental. reveal intellectual alliances, or reflect the status aspirations of Published online March 12, 2018.

3308–3313 | PNAS | March 27, 2018 | vol. 115 | no. 13 www.pnas.org/cgi/doi/10.1073/pnas.1719792115 Downloaded by guest on October 1, 2021 Downloaded by guest on October 1, 2021 eo tal. et Gerow citation with correlated per- and DIM bound in likelihood influence counts: the the of in with Despite Estimates convergence DIM. comparably plexity. of with performed terms variant in model DIM our our compare complexity, to additional ACL-ARC) used Corpus; was Compu- Reference (23) for Anthology (Association Linguistics research tational linguistics ( influence computational on of covariates of effects marginal mated (word topics θ (i) results: time, important over mixtures three produces model The Results schol- how into insights of range unfolds. a influence opens arly each this in see, influence will from we detract As topic. or add covariates how of denoted estimates regression, latent the document’s solve metadata. that article A coefficients in The traced origins. contexts associated with its interaction its and topic, particular a influence in Al- influence expla- discursive influence. robust of enables on this nations innovation, journal technical and modest marginal a institution, the though estimates authorship, model of regres- the latent effect covariates, a document incorporating on By sion arises. influence how establish us corpus. allows each and for 32) complexity (31, topic topics the of discover numbers to possible over (SI nonparamet- search used Bayesian ric data a iden- the approximates to topics approach topics many This many how Appendix). for with threshold fit reasonable data a same tify the fashion of data-driven models motivated, static theoretically using a of in number done the was Choosing of topics 30). number (27, the dimensions as discursive interpreted be retained can it topics, but implications, of empirical “background” number coherent the from specific, Choosing to shift topics. words) semantically collection-wide of topics (composed those topics as assigned are number topics words specified Statistically, to a throughout. present that the are assume about topics models assumptions of such make particular, they In although data. 29), 28, (21, erature composition. its docu- explain learns not DIM does docu- topic, but (22). topics a individual DIM in the how influence is measure ment topics explicitly future change to ments model only) document– edge, a with fit z is document (27). mixture, Each topic (20). topic time a from over topics typify change periods derive which into over models binned topic of distributions documents dynamic likely probability texts, to most time-ordered In refer the and words, documents observed of set a nlsso uhdsorei ag etcletos(42) Top- (24–26). collections it the text that enable large extent in ics models discourse the topic such to of Probabilistic influential analysis discourse. be to future document affects a consider We Influence of Model Robust and influence. citations scholarly of of scope understanding limited composite a the measure provide address dynamic helps This contribution it. without time of over discourse contribution future document’s simulating studied a by intro- widely assess also to a mechanism We of in a (23). value duce research citations the linguistics predict computational affirm help of We corpus they individually. as topic features, publica- for each such and for institutions, allows venues authors, modeling This tion of explicitly kinds content. by different influence. explanation alongside comparing model of an the covariates composition such but document-level the offers topics, explain variant future to Our change a mechanism they receive no documents how offers DIM, on the based In score (22). (DIM) model influence d k ,k aitoa siae fiflec ( influence of estimates variational (ii) ), u ain dsa motn xlntr ehns to mechanism explanatory important an adds variant Our lit- scientific analyzing for effective proven have models Topic {k ,n ∈N 0 , ρ · · · vrtevocabulary the over 0 = , k .22 K θ } d (K n ahwr sfi ihatpcassignment topic a with fit is word each and , r icvrdfo eia oocrec in co-occurrence lexical from discovered are 10 ), = β ˆ t ,k n oi itrswti documents, within mixtures topic and , ρ ` d 0 = ,k ` N ae nhwi hne future changes it how on based , d .21 ,k {t h rt(n oorknowl- our to (and first The . sapouto t otn and content its of product a is , 0 , (K · · · 20 ), = K , t T sipratadhas and important is , loigtpc to topics allowing }, ` ˆ d ρ ,k 0 = esti- (iii) and ), µ ˆ .21 k .Acorpus A ). (K µ k 50 ), = offer , hi rlfins.Ato essec orltswt hi most their by with (ρ papers correlates scaled cited persistence mixtures, Author document–topic prolificness. establish- of their sum author authors’ measure (Eq. of less We entropy are persistence citations. but through an high influence to ment with discursive receive contribute high authors to who post likely Authors of topics field. works diverse their more touchstone in cite presence established to pressure sional innovations. indirect and direct whereas critiques, both nondiscursive or captures data capture influence new influ- of uniquely document provision the citations high like have Third, contributions, to 1C). tend (Fig. and ence discovery as before experience delay influential which papers, a enduringly (SB) beauty” are “sleeping so-called that as well papers credits influ- Document however, of cycle. most news ence, receive scientific this that within articles references favor their patterns citation atten- such, of tail As diminishing a tion. by followed uptake rapid distribution with log-normal time skewed, over a follow to tend citations topics. Second, diverse greater span citations careers identifies whose scientists discourse itinerant First, for in whereas influence persist science, ways. of who three subfields authors narrow favoring least constitute attachment, here at preferential discourse exhibit in and signals Citations pattern. complementary this to cleanly much adding impact. discursive of neither and assessment with the citations another, to from information one cited signals for are the substitute contributions Here, influence discursive work. where future sci- credit, by the standard of the in cycle in Papers participate citations. entific rela- quadrants of low/low quadrants number and these and discur- high/high influence define relative median we their to Empirically, on tive characterized bins. based is high–low quadrant papers Each by count. of citation kinds and influence four sive out lays 2 on 1D). (Fig. effect Nobel a positive without those more than significantly influence document’s quan- a their lattice have in overall, Nobel byline others. Laureates, among Anderson’s superconductors and and chromodynamics physics tum detec- atomic in byline and Penzias’ tors does influ- as for topics effect cosmology-related positive in magnetic a ence of lends byline model Witten’s a transitions. as phase and spin-glasses arrays, cofounded detector cosmic Arno Anderson new Philip of theory, Wit- development the string Ed and the bang and understanding 1A). big to cosmology greatly (Fig. contributed in telescope topics radio advances Penzias’ related outsized driven few, authors’ has by a ten traced in substan- is influence and specialization marginal meaningful example, are For innovations tive. discursive We that the affirm Appendix). trace (SI to counts, physics of (I tion us influence in Most allow document concepts publications. that and found APS modern subfields other of typify by emergence topics cited documents resulting were of collection the 90% sample APS where the The environment, in model. citation rich influence a dynamic provides static 37-topic a then, and our estimate solution, 37-topic fit to optimal used an discover was model, 500-topic (APS) Society Physical American Physics. in Influence influence. discursive fea- of pre- notion additional our on to these effect explanation observable adds modeling tures an Furthermore, have citations. model dicting our by captured tures all transform; (K correlations: larger substantially and and ae hni hymk oesatrdcnrbtos aesin cited Papers highly contributions. a scattered have more to make likely they more if are than topics paper of less range productive narrow have remain a average who in Authors on 1B). but (Fig. citations, documents influential more receive (ρ authors paper sistent influential most author’s epooeta cetssadshlr xeineprofes- experience scholars and scientists that propose We less adhere however, quadrants, off-diagonal the in Papers Fig. impact. measure to way a provides influence Discursive 20), = ρ 0 = . 18 ρ 0 = C (K d p 0 = at , . .Orvratyeddsgicnl stronger significantly yielded variant Our 75). = 23 < .57, PNAS hscnrsta h xgnu fea- exogenous the that confirms This 0.01 ρ (K olcino eerhpbihdb the by published research of collection A 0 = p ,and 50), = | < . 32 ac 7 2018 27, March ,mc oeta ihtesame the with than more much 0.01), (p < ,cluae steinverse the as calculated 11), d .Sm ncoa results anecdotal Some 0.01). (Eq. ) 0 = ρ ρ 0 = .34, 0 = | .22 orltdwt cita- with correlated 8) .28 o.115 vol. p < (K (K .Mr per- More 0.01). Fisher 75; = 10), = | o 13 no. ρ 0 = | 3309 .27 z

APPLIED PHYSICAL SOCIAL SCIENCES SCIENCES A B

C

D

Fig. 1. (A) Box plots of author coefficients for each APS topic. Medians are shown as a red line within each box, the first quartiles are within the box, and the second and third quartiles are within bands. CMP, condensed matter physics; HEP, high energy physics. Note the wider distributions for more general topics 1 and 36. Overlaid are coefficients for three physicists. Positive values mean that an author’s byline adds influence, whereas a negative value means it detracts. A positive coefficient does not necessarily mean that a document is highly influential itself, only that it was more influential than if it had it been written by the average author. (B) Locally weighted scatter plot smoothing (LOWESS) curves comparing document influence (dashed red line; left y axis) and citations (solid blue line; right y axis) with author persistence (Eq. 11) (x axis). A consistent and statistically significant trend is established: more persistent authors tend to produce more highly cited but less influential documents, whereas less persistent authors have more influential but less cited documents. (C) LOWESS curve fit to the plot of documents’ influence vs. their SB score. Error bars are ±2 SE of the mean in both dimensions. (D) Distribution of authors’ marginal effect on influence for Nobel Laureates compared with all other authors.

the high citation/low influence bin include those from authors most topics and significantly to a few, but in 2001, its contribu- whose papers one “must cite,” a self-reinforcing process where tion to the materials topic jumps significantly. papers by persistent authors become required scholarly bound- With a sample of 500 articles from each percentile of docu- ary markers. Authors of papers in this region are more persistent ment influence, we assessed the variance and half-life of citations than those in any other quadrant (SI Appendix, Fig. S15). Con- compared with topic contribution. Half-life is the number of time versely, papers in the low-citations/high-influence bin dispropor- steps after which the score is one-half what it was initially. Doc- tionately credit itinerant authors who contribute to many fields. uments in higher percentiles of influence have longer half-lives, We find that norms of citation encourage attention to recent and in every percentile, citations have lower variance and shorter work. We evaluate this by examining how articles make dynamic half-lives than topic contribution (SI Appendix). This suggests a contributions to discourse over time. Although document influ- conventional “statute of limitations” through which scholars no ence is a static attribute representing a paper’s effect on overall longer need to cite older articles with persistent influence, pos- discourse, sometimes this influence occurs immediately, while in sibly because those ideas have entered popular consciousness or other cases, it takes time to foment. To establish the dynamics of because their original authors are no longer around to claim them. discursive impact, we measure a document’s topic contribution Other articles defy the scientific news cycle not because they by estimating how different future topics would have been with- are persistently influential, but because they garner little atten- out it (Eq. 10). Fig. 3 illustrates this dynamic contribution for two tion for a long period before a spike of citation attention, not different kinds of SB papers. Felix Bloch’s 1946 “Nuclear induc- unlike Wallace’s paper (34) discussed above. The SB index (35) tion” (33) is typical in that it had a small impact on most topics identifies such articles as a function of the convexity and time- but a sizeable one on a few. It is also typical in that it received a to-maximum citations over time. An article that is dormant for a spike of citations shortly after publication, which then began to long time, after which it suddenly receives a burst of citations, has decay. Near the turn of the century, we see a significant spike in a high SB score, whereas the typical article, which receives a burst both the article’s citations and its contribution of the Hall effect of citations on publication that decay thereafter, receives a low to the condensed matter physics topic. This is when transistors score. SB scores correlated four times more strongly with docu- in microchips became small enough to be affected by the quan- ment influence (r = 0.20, p < 0.01) than with citations (r = 0.05, tum Hall effect and the nuclear magnetic moments described by p < 0.01), suggesting that papers ahead of their time nevertheless Bloch. A second paper, the top cited in our sample from 1947, is receive fewer citations than their influence would predict. Such a “pure” SB, which garnered relatively few citations until a sud- high-influence/low-citation articles violate the standard influence den spike in 2004. Philip Wallace’s “The band theory of graphite” trajectory. (34) described the structure of graphene, which at the time, was Finally, high citation/low influence papers contain important only observable on an iron film. Theoretically, graphene was nontextual features, while high-influence/low-citations papers exceptionally strong and “growable” in a block structure, but the tend to have influence that may not be credited in citations. For technology to characterize and isolate it without a substrate was example, the most cited paper in the ACL-ARC collection intro- not discovered until 2004, which was awarded the 2010 Nobel duced the Penn Treebank (36), an important resource that accel- Prize in Physics. Wallace’s paper (34) contributed modestly to erated research in parsing. Nevertheless, it was the resource and

3310 | www.pnas.org/cgi/doi/10.1073/pnas.1719792115 Gerow et al. Downloaded by guest on October 1, 2021 Downloaded by guest on October 1, 2021 ujc n ob ie rmohrsbet.Mroe,influential Moreover, subjects. their other beyond from work cited cite be away. to to farther and likely subject more from both cited are refer- are papers their Influential papers in influential disciplines Likewise, across pref- ences. farther have a 4 reach 5% show (Fig. papers and citations influential papers 1 incoming influential distance and for have outgoing erence 10% Both while 2. sub- 0), distance within are distance citations Most (i.e., two. ject or one, zero, grouped distance: were their distinct citations by by taxonomy, captured subject JSTOR’s variation Using subject topics. the of measure to our sensitive habits, is S6 citation influence different Table have subjects Appendix , while (SI that, positive document– more by nificantly scaled topics (θ were mixtures citations individual When topic within weak. correlation influence and varied The and was cultures. counts citation citation doc- different between includes sample vastly our from that corre- means uments modestly JSTOR of were diversity full < The citations (ρ to ics. influence that model document found dynamic with and 53-topic lated char- JSTOR the best in fit topics we texts 53 corpus, using static, the that a acterized discovering estimating and After traditions. model offers academic 500-topic JSTOR of in sample research wider habits, a shared with Repository. munity JSTOR in Reach Scholarly field. a within assumed become has innova- language conceptual whose in while from tors etc.), nontextual are these refutations, papers primarily Excluding research-killing high-influence/low-citation are methods, APS. that (data, for nature innovations authors 1942 from contributed are until papers who mean influence up global citation/low which high the period, during papers, of early burn-in year SDs channel the 2 a that within to as corpus was year influence the first sample. document the of our mean to defined years prior empirically first published the We documents the artificially from in borrowed also contributions ideas is articles discursive This for whose generations. case over innovators pervasive early became by authored region. they citations influence/low Only result, high emulation. a the discursive in As stimulate were not 6% terminology. new do or birth but with- citations concepts than refutations attract new rather direct producing or inquiry rebuttals, out of comments corrections, lines Critical offering existing influence. ones, low kill have replies but where the and cited quadrant, right to highly lower are the publications in they were APS 54% When 2, reply-type Fig. (22). of and framework impact comment- its up mapped made we that innovations conceptual not eo tal. et Gerow i.2. Fig.

ayhgl nunilbtudrie aestne obe to tended papers undercited but influential highly Many Greater Influence rmwr o coal mat iain s icrieinfluence. discursive vs. citations impact: scholarly for framework A Low citations Low citations Low discursiveinfluence High discursiveinfluence Delayedordistant impact Persistently impactful - - Self-reinforcing Ignoredbypeers - - Detracted bycovariates Small,narrowimpact - - Dispersed impactacross Innovative ideas - - topics andtime Low /High Low / d ,k C Influence vs.Citations d ,hwvr h orltoswr l sig- all were correlations the however, ), More Citations hl hsc om com- a forms physics While 0.17, B High citations Low discursiveinfluence High citations High discursiveinfluence and Preferential attachment Persistent,focused authors - - Focused contributionto - Self-reinforcing Acknowledged bypeers - - Boosted bycovariates Large, broadimpact - - - p established fields(topics) Non-textual influence < .Ti hw that shows This C). High / High /Low vraltop- all over 0.01) .Ti suggests This ). yispbiain hl h eann 4g nhne.Cagle’s unchanged. go 44 remaining changed the significantly in while were publication, little topics its offers 9 by contribution: identification” topic future of for way sys- the turtles “A in marking technique, cited a for highly Cagle explaining tem remains By R. title, research. its zoology Fred to and by up ecology lives identification” which paper, future system The the for (37). “A In turtles 1939’s S9). was marking Fig. document for Appendix, top-cited (SI the percentile, JSTOR) lowest the for per- after each (1930 published in were cutoff documents that burn-in distribution cited influence most the the of of centile top- some of sampled variety be We a to ics. may contribute documents who some but topic-specific, fields, formal contributions. specific and more discourse for “narra- of empirical credited likely flow in more from total con- the citations authors authors influenced have having to than that for (SI likely them more suggests nonexistent on much ferred are cases, This areas many driven S7). tively” in Table biology, and cell Appendix, correlation weaker the (e.g., considerably topics), fields statistics was various topics). science and education natural chemistry, physical and and theory, mathematical literary In with philosophy, correlated (e.g., on strongly ences impact more ( counts marginal was citation their Authors’ is influence citation authors. documents’ how for in their influence variation compara- topicwise to produced found related topic also each We to results. fully ble fit the with models above, effects Similar as similar Appendix). form estimated same neg- model the link specified of logistic regressions Using counts. binomial citation ative actual predict helps also [β influence document for n iens ihdcmn nuneadpbiaindate, predict- publication model and form logit influence the of a document models: with influ- citation Document citedness in 4A). ing predictive (Fig. papers also outright influential cited is be ence more papers to JSTOR), which likely in more at were (29% Looking citations cited. any never receive are papers many them. because claim work, to influenced present the not space, are from scientific authors distant originating also topically the where or but that old time work sufficiently influential cite only is to illustrated obligation not the limitations from in scholars of releasing interest. operate statute local citation higher may more the a above of that works have than suggests work citations This distant to influence from of citations ratio receive that articles i.3. Fig. ae loehbt aesiei iain ace yacicdn pk in spike coincident a (labeled). by topic Each matched specific time. citations a over in to slowly spike contribution late diminishes a which exhibits topics, also few paper a in change affect they (Lower (34) (Upper (33) induction” h eainhpbteniflec n iain nJTRis JSTOR in citations and influence between relationship The rdcigcttosi ooiul ifiut(5 7,i part in 17), (15, difficult notoriously is citations Predicting oi otiuin (Eq. contributions Topic .Bt aesfaue h yia otiuinpol where profile contribution typical the featured papers Both ). C d > 0 ∼ PNAS n hlpWlaes“h adter fgraphite” of theory band “The Wallace’s Philip and ) I µ ˆ d Auth × | (I t d ac 7 2018 27, March × d n iain o ei lc’ “Nuclear Bloch’s Felix for citations and 10) 1 + 0.42, = ) C Auth siae togpstv effect positive strong a estimated nhmnte n oilsci- social and humanities in ) P > | o.115 vol. |z | 0 = β .Influence .000]. (I | d o 13 no. 0.33 = ) | 3311 (SI t d ,

APPLIED PHYSICAL SOCIAL SCIENCES SCIENCES B synoptic trace of acknowledged influence—an important trace A of impact. On the other hand, contributing change to scholarly discourse is another measurable kind of influence that can be explained over time, across individual subjects, and with respect to authors, institutions, and publication venues, all of which con- tribute to the complex evolution of scholarship. C Materials and Methods The APS collection contains 509,007 abstracts from 1913 to 2015 coded with type (article, review, commentary, etc.), journal, authors, and their institu- tional affiliations. To avoid spurious metadata, only papers with an author found twice and an affiliation that occurred three times were retained. This resulted in 74,459 covariates, τ, over 251,382 documents dating from 1918 to 2015. Stop words, infrequent words, and statistically uninteresting words (by term frequency–inverse document frequency) were discarded. The Fig. 4. (A) Violin plot of document influence (y axis; kernel density esti- final vocabulary contained 15,312 tokens. The model was fit with 37 topics, mate bandwidth = 0.1) for cited and uncited documents in JSTOR grouped a value chosen by assessing topic use in a static, Bayesian nonparametric by decade; 2010 is omitted as incomplete. (B and C) Histograms and density model with 500 topics (SI Appendix). Labels were given by three researchers estimates of (B) incoming and (C) outgoing citations among JSTOR docu- with doctoral degrees in physics. ments grouped by distance in the subject tree. The JSTOR collection consists of over 2 million documents from 1894 to 2014; 28,861 covariates were coded in τ representing authorship, jour- paper (37) typifies low-influence/high-citation papers that make nal/venue, publisher, and discipline. Documents were excluded if they did technical offerings adopted by future research, but that do not not have at least one author with three or more documents or if they were incite discursive shifts. In the 50th percentile, 1965’s “Popula- classified in disciplines with a 20-y gap, representing subject instability or tion fluctuations and clutch-size in the great tit” by C. M. Perrins death (e.g., railroad science). The final sample contained 428,034 full-text (38) was the most cited (504 times within sample). Perrins’ dis- articles. The vocabulary was processed similarly to APS and resulted in 20,155 cursive contribution is spread over a range of topics and time. tokens. A model with 53 topics was estimated, with the number of top- It immediately shifted discourse in biology- and ecology-related ics selected by fitting a 500-topic static model. Topics were labeled by the topics, fields where it continues to be cited today as a landmark authors and assisted by . Discussion of the data, model speci- paper. After publication, topics in statistics, group behavior, sex- fication, a closer inspection of the resulting topics, and the labeling process ual health, medicine, and psychology adopted some of the con- are in SI Appendix. ceptual terminology and remained changed to the end of the The Generative Model. We assume that influence is drawn from a Gaussian, corpus. These contributions owe much to Perrin’s use of terms the mean of which is given by a projection on τ , : about population dynamics, movement, procreation, and individ- t d 2 ual variation. Finally, the highest cited paper in the 90th per- `d,k ∼ N (µkτd, σ`I), [1] centile of influence was Sam Peltzman’s 1976 “Toward a more general theory of regulation,” (39) which had an influence +3.5 where µ is a K × S matrix of topic-specific coefficients. We assume that µ is SD above the mean and 701 citations. Recall that influence is drawn from a centered Gaussian of specified variance: a latent variable fit within the model, while contribution is a 2 µ ∼ N (0, σ ). [2] post hoc estimate of how different topics would have been with- k µI out a given document. Peltzman’s 1976 paper (39) has a typi- The generative process for each time slice t is as follows. cal contribution profile—significantly influencing a few topics but 1) Draw topics β |β ,(w, `, z) ∼ N (β + exp(−β ) P ` P w z , t+1 t t,1:Dt t t d d n d,n d,n only slightly altering most—although it sits near the top of the 2 influence distribution. This is because much of its influence was σ I) µ ∼ N σ2 derived extrinsically: it was published by an eminent researcher 2) Draw coefficients k (0, µI) (who in 2013, Wired magazine listed as 1 of 28 top scientists with- 3) For each document d at time t: a) Draw θ ∼ Dir(α) out a Nobel Prize) in the high-impact Journal of Law and Eco- t,d b) For each word wt,d,n nomics, and it was published by the University of Chicago Press, a i) Draw z ∼ Mult(θ ) leading publisher of economics research. “Toward a more general t,d,n t,d ii) Draw w ∼ Mult(π(βt,z )) theory of regulation” exemplifies a configuration to which many t,d,n t,d,n ` ∼ N µτ σ2 researchers can relate: good papers are often made more so with c) Draw t,d ( t,d, `I), contextual boosts, like authorship, venue, and publisher. where π(x) maps the multinomial parameters x to their mean. Discussion Approximate Inference. The model has latent variables for words, docu- Our measure of influence tracks changes in future discourse ments, time slices, topics, and covariate coefficients. Collapsed Gibbs sam- and explicitly identifies the content and context of previous doc- pling and direct expectation maximization are precluded by nonconjugacy in the topic parameters {β1, ··· , βT }. Instead, the model estimates varia- uments that affect these discursive shifts. Document influence tional parameters that minimize the Kullback–Leibler (KL) divergence to the provides a direct measure of impact that allows us to disentan- true posterior. Here, the topic parameters are a Gaussian chain governed by gle many dimensions of influence across domains of science and ˆ ˆ the variational parameters {βk,1, ··· , βk,T } that describe their means. The scholarship. The model also estimates lasting contributions to latent variable µ is also fit by variational estimation to the approximate µˆ. topics over time and helps discover influential but undercited The variational distribution for a document’s influence is given by the Gaus- work. Our measures not only help predict citations better than sian of the mean influence and specified variance. This variational distribu- previous similar models, but they provide an explanation of what tion, q(β, `, θ, z, µ|βˆ, `ˆ, γ, φ, µˆ), is drives influence. Most importantly, the model reveals how dis- K K cursive innovations are adopted and credited. Alongside cita- Y ˆ ˆ Y tions, discursive influence brings us closer toward a full-spectrum q(βk,1, ··· , βk,T |βk,1, ··· , βk,T ) × q(µk|µˆk)× k=1 k=1 estimate of scholarly impact. . [3] N T Dt t,d ! Assessing scholarly influence is a retrospective task, and it can Y Y ˆ Y be distorted by conflating citations with impact. Not only does q(θt,d|γt,d)q(`t,d|`t,d) × q(zt,d,n|φt,d,n) our method enable previously impossible analyses of discursive t=1 d=1 n=1 influence, it could help authors decide what to cite and why. The simplified objective is fit by expectation maximization. Two terms of On the one hand, citations provide a simplified, censored, and the evidence lower bound are related to `, which requires expectation- and

3312 | www.pnas.org/cgi/doi/10.1073/pnas.1719792115 Gerow et al. Downloaded by guest on October 1, 2021 Downloaded by guest on October 1, 2021 5 tigrM,SlsProM mrlLN(08 fetvns fjunlranking journal of Effectiveness (2008) LAN Amaral M, Sales-Pardo MJ, Stringer 15. strate- search scientific in changes Measuring distance: Citation scientific (2016) and al. combinations et R Atypical Whalen (2013) B 14. Jones M, Stringer S, Mukherjee house. counting B, The Uzzi analysis: Citation 13. (2002) D Adam 12. 6 aRbrsM,Mcoet R(00 rbeso iainaayi:Asuyof study A analysis: citation of Problems (2010) BR MacRoberts MH, MacRoberts 16. h-index. inflate can self- Self-citation through (2008) manipulation K Krutovsky h-index L, Detecting Zhivotovsky (2010) 11. S Kokkelmans C, Bartneck 10. 8 lhueB,Ws D egto T egto 20)Dfeecsi matfactor impact in Differences (2009) T Bergstrom CT, Bergstrom JD, West BM, Althouse 18. Barab C, Song D, Wang 17. 1 agX calmA(06 oisoe ie o-akvcontinuous-time non-Markov A time: over Topics (2006) A McCallum X, Wang 21. to models. h-index topic annual Dynamic individual (2006) An J Lafferty hIa: DM, (2014) Blei D 20. Adams S, Alakangas AW, Harzing 19. influence: Metrics. Model-Derived expectation variational by done is estimation maximization. likelihood maximum its and µ of bound lower The influence: for update E-step the provides This on effect covariate’s each of ` estimate an to verge across covariates covariates, hsyed h -tpudt o h coefficients: the for update M-step the yields This Eq. in projection the incorporate define that updates maximization-step eo tal. et Gerow ˆ ˆ .Wnea ,Wl 20)Npts n eimi peer-review. in publishing. sexism academic and Nepotism in (2001) A citation Wold Coercive C, Wenneras (2012) 9. EA Fong AW, distortions. Wilhite factor Impact 8. (2013) B Alberts collaboration. scientific 7. of in patterns mechanism and attachment networks Coauthorship preferential (2004) the MEJ Newman Measuring (2008) 6. D Yu advantage G, cumulative Yu other M, and Wang bibliometric of 5. theory general A (1976) DdS Price eigenfactor. and 4. factor metrics. Impact journals: eigenfactor influential most The The (2009) (2008) A Fersht MA 3. Wiseman JD, power? West predictive CT, have Bergstrom index h 2. the Does (2007) JE Hirsch 1. t ,d L siiilzdb admdasfo etrd utvraeGaussian, multivariate centered, a from draws random by initialized is cee sato o oaiginformation. locating for tool a as schemes 419–423. pp Switzerland), Geneva, of Canton 2016) gies. impact. 77:373–375. nie n edmctdinfluences. seldom-cited and uncited analysis. citation 46–52. pp London), (Routledge, Technology and 335:542–543. USA Sci Acad Natl Proc networks. citation processes. USA Sci Acad 28:11433–11434. 104:19193–19198. cosfilsadoe time. over and fields across 342:127–132. oe ftpcltrends. topical of model 113–120. pp CA), Jolla, La 2006) Society, (ICML Learning Machine on Conference tional differences. length career and disciplinary accommodate ,k ` ˆ t sgvnby given is ,k = opno ote2t nentoa ol ieWbCneec (WWW Conference Web Wide World International 25th the to Companion X ItrainlWrdWd e ofrne teigCmite eulcand Republic Committee, Steering Conferences Web Wide World (International Science = σ 1 τ ˆ ` L 2 mScIfSci Inf Soc Am J d Diag(exp(−β t µ ˆ µ ,k E ˆ san is k T k 106:6883–6884. q ← ← = h 342:468–472. d yrM abrhc ,ComyrD zukH an M Wayne H, Ozturk D, Cookmeyer M, Barbercheck M, Wyer eds , X Thus, D. Scientometrics

X

T S hsA Phys t E ∆β lnt etr and vector, -length X µ ˆ q t X hni ie by given is then h 101:5200–5205. s L(03 uniyn ogtr cetfi impact. scientific long-term Quantifying (2013) AL asi ´ t d X ,k ouetiflec stesmtopic-proportional sum the is influence Document X rceig fte1t C nentoa ofrneon Conference International ACM 12th the of Proceedings 387:4692–4698. t i d T 27:292–306. ,k X µ X ` mScIfSiTechnol Sci Inf Soc Am J ˆ k ))(w t i τ ,k sa is I t + d ,d − − 87:85–98. = τ t σ σ 2σ K t ◦ T 2σ ,d d 2 2 1 X 1 z × so n c Technol Sci Inf Assoc J I k ` 2 + t 2 ! ,k ( S E ` ˆ θ and ) −1 σ σ t q arx h auso hc ilcon- will which of values the matrix, d ,d Science µ τ 2 ` 2 ,k h ,k LSOne PLoS  ` I X san is ˆ E ! − t T ,d q −1 X ,k µ ˆ h ∆β i . k 340:787–787. X S ˆ ` X rceig fte2r Interna- 23rd the of Proceedings τ T t 2 60:27–34. × ItrainlMcieLearning Machine (International t t ,k ∆β ,d t 3:e1683. Nature ,k D − ) X 2 = d Scientometrics t arxcdn observed coding matrix h oe on of bound lower The `. − ,k 2σ β i 1 61:1–12. τ X t rcNt cdSiUSA Sci Acad Natl Proc ˆ + d t +1,k 2 415:726–729. k ,d ( µ ` ` ˆ ˆ t t 2σ k ,k ,d µ − ˆ τ oe,Science, Women, ,k t k 2 − µ 2  β . Scientometrics . . t µ ˆ ,k 99:811–821. k With . τ Neurosci J t rcNatl Proc ) 2 Science Science We 1. . [8] [7] [5] [4] [6] S 9 etmnS(96 oadamr eea hoyo regulation. of theory general more a Toward identification. (1976) major S future Parus Peltzman tit, for great 39. the in turtles clutch-size and marking fluctuations Population (1965) of CM Perrins System 38. A (1939) FR Cagle 37. models. nonparametric Bayesian (2011) YW Teh P, Orbanz nonparametrics. 32. Bayesian Modern (2011) YW Teh for P, methods Evaluation Orbanz (2009) 31. D Mimno R, Salakhutdinov I, Murray HM, Wallach formation. consensus 30. scientific of structure temporal The (2010) PS Bearman U, Shwed 29. models. topic dynamic time Continuous (2008) D Heckerman D, Blei C, Wang 28. How leaves: tea Reading (2009) DM Blei JL, Boyd-graber C, topics. Wang S, scientific Gerrish J, Finding Chang (2004) 27. M Steyvers TL, models. Griffiths topic Probabilistic 26. (2012) DM Blei allocation. 25. Dirichlet Latent (2003) MI Jordan AY, Ng DM, Blei 24. bibli- for dataset reference A Corpus: Reference Anthology ACL scholarly The (2008) al. measuring et S Bird to 23. approach language-based A (2010) DM Blei S, Gerrish 22. 6 acsM,MrikeizM,SnoiiB(93 uligalreanttdcorpus annotated large a Building (1993) B Santorini MA, Marcinkiewicz MP, beau- sleeping Marcus identifying and 36. Defining (2015) A Flammini F, Radicchi E, graphite. Ferrara of Q, theory Ke band The 35. (1947) PR Wallace induction. Nuclear 34. (1946) F Bloch 33. n iepn ie nato’ e fdocuments documents of set of author’s terms an Given in timespan. prolificness and their and mixtures topic documents’ their topic the between divergence the The as without drift. measured and topic with be overlooks then it because can estimate, contribution conservative a provides This iei loacsil.A siaeo ouetcnrbto oatopic a to contribution document of estimate time An accessible. also is draws. time it which from topics the changes it proportional much is influence how document’s to a interpretation: plausible a offers This and 429.JAE sas upre yArFreOfieo cetfi Research NCSE- Scientific Grant of Office (NSF) Force SBE1158803. Foundation Air Grant NSF by Science and Network). supported FA9550-15-1-0162 Grant National also Research is by Metaknowledge J.A.E. supported 1422492. the from grant is (to a by J.B.-G. Foundation supported were Templeton J.A.E. and the A.G. Foundation. Killam Walton topics. same ACKNOWLEDGMENTS. the in documents and time more given persistent more published, were up documents scaled of mixture–is number document–topic the total by author’s an over entropy inverse where edfieato essec sn h nes needn nrp over entropy independent inverse the using persistence author define We Whereas 240. L. 1939:170–173. F, Pereira PL, Bartlett RS, CA). Jolla, Zemel La Foundation, J, Systems Processing Shawe-Taylor Information (Neural eds KQ Weinberger 24, Processing Information 2009) (ICML models. topic Rev Sociol Am 2008) (UAI 579–586. Intelligence pp Artificial OR), in Corvallis, Uncertainty Press, on Conference 24th Information the of (Neural ings A 288–296. pp Culotta 22, C, Vol CA), Williams Jolla, J, La Foundation, Lafferty Systems D, Processing Schuurmans Y, Bengio eds models. topic interpret humans 101:5228–5235. 1022. 1755–1759. pp Paris), 2008) Association, (LREC Resources Evaluation and Resources Language on Conference linguistics. computational in research ographic 382. 2010) (ICML ing impact. 2006) (SIGKDD Mining 424–433. Data and Discovery Knowledge fEgih h entreebank. Penn The English: of science. in ties 81–89. pp Berlin), (Springer, GI Webb C, Sammut eds Learning, dH nmEcol Anim J t 0 H e a ecmue ysmltn h oi ihu document without topic the simulating by computed be can optsteetoyo rbblt distribution probability of entropy the computes (p) K Persistence(A) stemxmmpsil nrp o a for entropy possible maximum the is rceig fte2t nentoa ofrneo ahn Learn- Machine on Conference International 27th the of Proceedings ` ˆ d Sg ulctos o nee) p1105–1112. pp Angeles), Los Publications, (Sage ,k rceig fte2t nentoa ofrneo ahn Learning Machine on Conference International 26th the of Proceedings 34:601–647. 75:817–840. rcNt cdSiUSA Sci Acad Natl Proc sasai etr,tetpccnrbto fadcmn over document a of contribution topic the feature, static a is ∆ d ItrainlMcieLann oit,L ol,C) p375– pp CA), Jolla, La Society, Learning Machine (International Contribution(d t : .Ti esr subudd uhr a eever be can authors unbounded: is measure This (A). PNAS β ˆ = ..wsspotdb elwhpfo h Izaak the from fellowship a by supported was A.G. t (−d 0 ,k

) 1 = | − optLinguist Comput |A

dacsi erlIfrainPoesn Systems, Processing Information Neural in Advances ac 7 2018 27, March H hsRev Phys β ˆ  t , n h ubro er vrwihthey which over years of number the and 0 ,k k t 112:7426–7431. 0 P − ) = dH (w d omnACM Commun 70:460–474. ∈ KL e A d K  z θ rceig fteSxhInternational Sixth the of Proceedings d β d hsRev Phys ˆ 19:313–330. ,k k t ` 0 ˆ 1 ,k d |  ,k , edfiepritneas persistence define we A, β ! ) o.115 vol. ˆ

t K (−d 0 1 AMPes e ok,pp York), New Press, (ACM · 55:77–84. ,k 71:622–634. . (|A etr h rtterm– first The vector. ) nylpdao Machine of Encyclopedia rcNt cdSiUSA Sci Acad Natl Proc  ahLanRes Learn Mach J . ∆ + Erpa Language (European dacsi Neural in Advances | a Econ Law J p as o 13 no. t (A) P i , p Proceed- i d | 19:211– log(p Copeia : 3:993– (AUAI 3313 k [10] [11] [9] at i )

APPLIED PHYSICAL SOCIAL SCIENCES SCIENCES