Tracking the Popularity and Outcomes of All Biorxiv Preprints
Total Page:16
File Type:pdf, Size:1020Kb
1 Meta-Research: Tracking the popularity and outcomes of all bioRxiv 2 preprints 3 Richard J. Abdill1 and Ran Blekhman1,2 4 1 – Department of Genetics, Cell Biology, and Development, University of Minnesota, 5 Minneapolis, MN 6 2 – Department of Ecology, Evolution, and Behavior, University of Minnesota, St. Paul, 7 MN 8 9 ORCID iDs 10 RJA: 0000-0001-9565-5832 11 RB: 0000-0003-3218-613X 12 13 Correspondence 14 Ran Blekhman, PhD 15 University of Minnesota 16 MCB 6-126 17 420 Washington Avenue SE 18 Minneapolis, MN 55455 19 Email: [email protected] 1 20 Abstract 21 The growth of preprints in the life sciences has been reported widely and is 22 driving policy changes for journals and funders, but little quantitative information has 23 been published about preprint usage. Here, we report how we collected and analyzed 24 data on all 37,648 preprints uploaded to bioRxiv.org, the largest biology-focused preprint 25 server, in its first five years. The rate of preprint uploads to bioRxiv continues to grow 26 (exceeding 2,100 in October 2018), as does the number of downloads (1.1 million in 27 October 2018). We also find that two-thirds of preprints posted before 2017 were later 28 published in peer-reviewed journals, and find a relationship between the number of 29 downloads a preprint has received and the impact factor of the journal in which it is 30 published. We also describe Rxivist.org, a web application that provides multiple ways 31 to interact with preprint metadata. 32 Introduction 33 In the 30 days of September 2018, four leading biology journals – The Journal of 34 Biochemistry, PLOS Biology, Genetics and Cell – published 85 full-length research 35 articles. The preprint server bioRxiv (pronounced ‘Bio Archive’) had posted this number 36 of preprints by the end of September 3 (Figure 1—source data 4). 37 Preprints allow researchers to make their results available as quickly and widely 38 as possible, short-circuiting the delays and requests for extra experiments often 39 associated with peer review (Berg et al. 2016; Powell 2016; Raff et al. 2008; Snyder 40 2013; Hartgerink 2015; Vale 2015; Powell 2016; Royle 2014). 2 41 Physicists have been sharing preprints using the service now called arXiv.org 42 since 1991 (Verma 2017), but early efforts to facilitate preprints in the life sciences failed 43 to gain traction (Cobb 2017; Desjardins-Proulx et al. 2013). An early proposal to host 44 preprints on PubMed Central (Varmus 1999; Smaglik 1999) was scuttled by the National 45 Academy of Sciences, which successfully negotiated to exclude work that had not been 46 peer-reviewed (Marshall 1999; Kling et al. 2003). Further attempts to circulate biology 47 preprints, such as NetPrints (Delamothe et al. 1999), Nature Precedings (Kaiser 2017), 48 and The Lancet Electronic Research Archive (McConnell and Horton 1999), popped up 49 (and then folded) over time (The Lancet Electronic Reseach Archive 2005). The preprint 50 server that would catch on, bioRxiv, was not founded until 2013 (Callaway 2013). Now, 51 biology publishers are actively trawling preprint servers for submissions (Barsh et al. 52 2016; Vence 2017), and more than 100 journals accept submissions directly from the 53 bioRxiv website (bioRxiv 2018). The National Institutes of Health now allows researchers 54 to cite preprints in grant proposals (National Institutes of Health 2017), and grants from 55 the Chan Zuckerberg Initiative require researchers to post their manuscripts to preprint 56 servers (Chan Zuckerberg Initiative 2019; Champieux 2018). 57 Preprints are influencing publishing conventions in the life sciences, but many 58 details about the preprint ecosystem remain unclear. We know bioRxiv is the largest of 59 the biology preprint servers (Anaya 2018), and sporadic updates from bioRxiv leaders 60 show steadily increasing submission numbers (Sever 2018). Analyses have examined 61 metrics such as total downloads (Serghiou and Ioannidis 2018) and publication rate 62 (Schloss 2017), but long-term questions remain open. Which fields have posted the 63 most preprints, and which collections are growing most quickly? How many times have 3 64 preprints been downloaded, and which categories are most popular with readers? How 65 many preprints are eventually published elsewhere, and in what journals? Is there a 66 relationship between a preprint’s popularity and the journal in which it later appears? Do 67 these conclusions change over time? 68 Here, we aim to answer these questions by collecting metadata about all 37,648 69 preprints posted to bioRxiv from its launch through November 2018. As part of this effort 70 we have developed Rxivist (pronounced ‘Archivist’): a website, API and database 71 (available at https://rxivist.org and gopher://origin.rxivist.org) that provide a fully featured 72 system for interacting programmatically with the periodically indexed metadata of all 73 preprints posted to bioRxiv. 74 Results 75 We developed a Python-based web crawler to visit every content page on the 76 bioRxiv website and download basic data about each preprint across the site’s 27 77 subject-specific categories: title, authors, download statistics, submission date, 78 category, DOI, and abstract. The bioRxiv website also provides the email address and 79 institutional affiliation of each author, plus, if the preprint has been published, its new 80 DOI and the journal in which it appeared. For those preprints, we also used information 81 from Crossref to determine the date of publication. We have stored these data in a 82 PostgreSQL database; snapshots of the database are available for download, and users 83 can access data for individual preprints and authors on the Rxivist website and API. 84 Additionally, a repository is available online at https://doi.org/10.5281/zenodo.2465689 85 that includes the database snapshot used for this manuscript, plus the data files used to 4 86 create all figures. Code to regenerate all figures in this paper is included there and on 87 GitHub (https://github.com/blekhmanlab/rxivist/blob/master/paper/figures.md). See 88 Methods and Supplementary Information for a complete description. 89 Preprint submissions 90 The most apparent trend that can be pulled from the bioRxiv data is that the 91 website is becoming an increasingly popular venue for authors to share their work, at a 92 rate that increases almost monthly. There were 37,648 preprints available on bioRxiv at 93 the end of November 2018, and more preprints were posted in the first 11 months of 94 2018 (18,825) than in all four previous years combined (Figure 1a). The number of 95 bioRxiv preprints doubled in less than a year, and new submissions have been trending 96 upward for five years (Figure 1b). The largest driver of site-wide growth has been the 97 neuroscience collection, which has had more submissions than any bioRxiv category in 98 every month since September 2016 (Figure 1b). In October 2018, it became the first of 99 bioRxiv’s collections to contain 6,000 preprints (Figure 1a). The second-largest category 100 is bioinformatics (4,249 preprints), followed by evolutionary biology (2,934). October 101 2018 was also the first month in which bioRxiv posted more than 2,000 preprints, 102 increasing its total preprint count by 6.3% (2,119) in 31 days. 5 (a) 40 k 6,000 Neuroscience s t n s i t 30 k r n i 5,000 p r e r p 20 k e p Bioinformatics r 4,000 l p a t e 10 k o Evolutionary Bio. v T i 3,000 Genomics t a Microbiology l 0 k u Genetics 2,000 Month m u Cell Biology C 1,000 0 2014 2015 2016 2017 2018 (b) 2,000 ) h Bioinformatics t n o 1,500 Cell Biology m ( d e t Evolutionary Bio. s o 1,000 Genetics p Genomics s t n i Microbiology r p e 500 r Neuroscience P 0 2014 2015 2016 2017 2018 Month Collection Animal Behav. & Cognition Ecology Paleontology Biochemistry Epidemiology Pathology Bioengineering Evolutionary Bio. Pharmacology & Toxicology Bioinformatics Genetics Physiology Biophysics Genomics Plant Bio. Cancer Bio. Immunology Scientific Comm. & Edu. Cell Bio. Microbiology Synthetic Bio. Clinical Trials Molecular Bio. Systems Bio. Developmental Bio. Neuroscience Zoology 103 104 Figure 1. Total preprints posted to bioRxiv over a 61-month period from 105 November 2013 through November 2018. (a) The number of preprints (y- 106 axis) at each month (x-axis), with each category depicted as a line in a 107 different color. Inset: The overall number of preprints on bioRxiv in each 108 month. (b) The number of preprints posted (y-axis) in each month (x-axis) 6 109 by category. The category color key is provided below the figure. 110 Figure 1—source data 1: The number of submissions per month to each 111 bioRxiv category, plus running totals. submissions_per_month.csv 112 Figure 1—source data 2: An Excel workbook demonstrating the formulas 113 used to calculate the running totals in Figure 1—source data 1. 114 submissions_per_month_cumulative.xlsx 115 Figure 1—source data 3: The number of submissions per month overall, 116 plus running totals. submissions_per_month_overall.csv 117 Figure 1—source data 4: The number of full-length articles published by 118 an arbitrary selection of well-known journals in September 2018. 119 Figure 1—source data 5: A table of the top 15 authors with the most 120 preprints on bioRxiv. 121 Figure 1—source data 6: A list of every author, the number of preprints 122 for which they are listed as an author, and the number of email addresses 123 they are associated with. papers_per_author.csv 124 Figure 1—source data 7: A table of the top 25 institutions with the most 125 authors listing them as their affiliation, and how many papers have been 126 published by those authors.