The Impact of Preprint Servers in the Formation of Novel Ideas
Total Page:16
File Type:pdf, Size:1020Kb
bioRxiv preprint doi: https://doi.org/10.1101/2020.10.08.330696; this version posted October 8, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. The impact of preprint servers in the formation of novel ideas Swarup Satish∗ Zonghai Yao∗ University of Massachussetts, Amherst University of Massachussetts, Amherst [email protected] [email protected] Andrew Drozdov Boris Veytsman University of Massachussetts, Amherst Chan Zuckerverg Initiative [email protected] George Mason University [email protected] Abstract The benefits of preprints for accelerating science We study whether novel ideas in biomedical are often raised in the discussions between regu- literature appear first in preprints or traditional latory agencies, funders, scientists and publishers. journals. We develop a Bayesian method to Thus a method to objectively assess them is impor- estimate the time of appearance for a phrase tant. One way to do this assessment is to look at in the literature, and apply it to a number of a new important idea and to measure whether it phrases, both automatically extracted and sug- first appears in a traditional journal or on a preprint gested by experts. We see that presently most server. phrases appear first in the traditional journals, but there is a number of phrases with the first An implementation of this approach requires one appearance on preprint servers. A compari- to define what is an important idea, and how to find son of the general composition of texts from the time of appearance for it. This is the goal of bioRxiv and traditional journals shows a grow- our work. ing trend of bioRxiv being predictive of tradi- The definition of novelty in science and the meth- tional journals. We discuss the application of ods to determine and predict novelty have a long the method for related problems. history discussed in the next section. In this work 1 Introduction we use a very simple approach (Garfield, 1967; La- tour and Woolgar, 1986): new ideas correspond to A paper submitted to a journal goes through several new terms. Thus if we find new words and phrases stages: peer review, editorial work, copyediting, in scientific papers, we can surmise the appearance publication. This leads to a long waiting time be- of new ideas. tween submission and the final publication (Powell, The definition of the time of appearance for a 2016). The situation is especially bad in the life sci- new idea is not trivial. It is not enough to regis- ences, where the waiting time between submission ter the first mention of a term. First, some hits and publication approaches the duration of a tradi- might be erroneous, and give us false positives. On tional PhD study, creating serious difficulties for the other hand, we might miss some mentions of young scientists (Vale, 2015). While this is frustrat- a term due to the incompleteness of the corpus. ing for scientists whose recognition and promotion Therefore a more subtle method to determine the often depend on the publication record, it is also time of appearance is needed. In this work we offer bad for science itself, significantly slowing down a Bayesian approach to this problem. its progress (Qunaj et al., 2018). Based on our definitions of novelty and the time Preprint servers were offered as a means of ac- of appearance for novel ideas we compare the time celerating science (Berg et al., 2016; Desjardins- of appearance for several novel ideas in the pa- Proulx et al., 2013; Sarabipour et al., 2019; Schloss, pers published in bioRxiv/medRxiv https://www. 2017; Lauer et al., 2015; Peiperl, 2018), especially biorxiv.org/ and PubMed Central full text col- in the wake of COVID-19 epidemics (Krumholz lection https://www.ncbi.nlm.nih.gov/pmc/. et al., 2020). The discussion about the benefits (and dangers) of preprint is no longer confined to the scientific literature, coming to the pages of popular 2 Related Works newspapers (Eisen and Tibshirani, 2020). Novel ideas and breakthroughs are among the cen- *Equal contribution. tral concepts for the science of science. A number bioRxiv preprint doi: https://doi.org/10.1101/2020.10.08.330696; this version posted October 8, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. of studies propose different ideas to quantify orig- Another field that is relevant to our research is inality in science and technology (Cozzens et al., the detection of change points in a stream of events. 2010; Alexander et al., 2013; Rzhetsky et al., 2015; Change point detection (or CPD) detects abrupt Rotolo et al., 2015; Wang et al., 2016; Wang and shifts in time series trends that can be easily iden- Chai, 2018; Shibayama and Wang, 2020) or their tified via the human eye, but are harder to pin- impact on the other works (Shi et al., 2010; Shahaf point using traditional statistical approaches. The et al., 2012; Sinatra et al., 2016; Hutchins et al., research in CPD is applicable across an array of in- 2016; Wesley-Smith et al., 2016; Herrmannova dustries, including finance, manufacturing quality et al., 2018b,a; Zhao et al., 2019; Bornmann et al., control, energy, medical diagnostics, and human 2019; Small et al., 2019). The prediction of break- activity analysis. There are many representative throughs, scientific impact and citation counts is a methods of CPD. Binary segmentation (Bai, 1997) well developed area (Schubert and Schubert, 1997; is a sequential approach: first, one change point is Garfield et al., 2002; Dietz et al., 2007; Lokker detected in the complete input signal, then series et al., 2008; Shi et al., 2010; Uzzi et al., 2013; is split around this change point, then the opera- Alexander, 2013; Klimek et al., 2016; Tahamtan tion is repeated on the two resulting sub-signals. et al., 2016; McKeown et al., 2016; Clauset et al., As opposite to binary segmentation, which is a 2017; Peoples et al., 2017; Salatino et al., 2018; greedy procedure, bottom-up segmentation (Fry- Dong et al., 2018; Iacopini et al., 2018; Feldman zlewicz, 2007) is generous: it starts with many et al., 2018; van den Besselaar and Sandstrom¨ , change points and successively deletes the less sig- 2018; Klavans et al., 2020). However, the ques- nificant ones. First, the signal is divided in many tion asked in these works is different from the one sub-signals along a regular grid. Then contigu- we ask. Most of the researchers tried to determine ous segments are successively merged according what makes a work original or impactful, and how to a measure of how similar they are. Because the to predict originality or impact. Our question is the enumeration of all possible partitions is impossi- following: suppose we know a certain idea is novel ble, Pelt (Killick et al., 2012) relies on a pruning (or impactful). Can we pinpoint a moment in time rule. Many indexes are discarded, greatly reducing when this idea appeared, and where did it appear? the computational cost while retaining the ability to find the optimal segmentation. Window-based Sentence level novelty detection was a topic change point detection (Aminikhanghahi and Cook, of novelty tracks of Text Retrieval Conferences 2016) uses two windows which slide along the data (TREC) from 2002 to 2004 (Soboroff and Harman, stream. Dynamic programming was also used for 2003; Harman, 2002; Clarke et al., 2004; Sobo- this task (Truong et al., 2020). In this work we pro- roff and Harman, 2005). The goal of these tracks pose a simple Bayesian approach to the detection was to highlight the relevant sentences that con- of change points, which seems to give intuitively tain novel information, given a topic and an or- reasonable results for our purpose. dered list of relevant documents. At the document level, Karkali et al. (2013) computed novelty score 3 Bayesian Approach to Novelty based on the inverse document frequency scoring Detection function. Another work by Verheij et al. (2012) presents a comparison study of different novelty detection methods evaluated on news articles where −μt n language model based methods perform better than e μt P(X = t, Y = n) = the cosine similarity based ones. Dasgupta and Dey n! (2016) conducted experiments with information en- ] tropy measure to calculate novelty of a document. τ frequency [ Again, the work that we present here significantly n μt = α μt = α + β(t − τ) differs from the existing novelty detection meth- t [time interval] ods since we use the novelty detection as a starting point rather than a goal. We first get candidate Figure 1: Our Bayesian Approach to Novelty Detec- phrases from the documents with the most appro- tion (BAND). The goal is to find the inflection point τ priate key phrases extraction method, and then de- that indicates the earliest point attributed to the rapid termine the appearance timing of these phrases. research growth associated with a novel idea. bioRxiv preprint doi: https://doi.org/10.1101/2020.10.08.330696; this version posted October 8, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. Our proposed Bayesian Approach to Novelty 3. Remove 2.5% of the triples with the highest Detection (BAND) finds the time interval τ that value of τ, and 2.5% of triplets with the lowest maximizes the observed series of publication fre- value of τ.