Voom: Precision Weights Unlock Linear Model Analysis Tools for RNA-Seq Read Counts Charity W Law1,2,Yunshunchen1,2,Weishi1,3 and Gordon K Smyth1,4*

Total Page:16

File Type:pdf, Size:1020Kb

Voom: Precision Weights Unlock Linear Model Analysis Tools for RNA-Seq Read Counts Charity W Law1,2,Yunshunchen1,2,Weishi1,3 and Gordon K Smyth1,4* Law et al. Genome Biology 2014, 15:R29 http://genomebiology.com/2014/15/2/R29 METHOD Open Access voom: precision weights unlock linear model analysis tools for RNA-seq read counts Charity W Law1,2,YunshunChen1,2,WeiShi1,3 and Gordon K Smyth1,4* Abstract New normal linear modeling strategies are presented for analyzing read counts from RNA-seq experiments. The voom method estimates the mean-variance relationship of the log-counts, generates a precision weight for each observation and enters these into the limma empirical Bayes analysis pipeline. This opens access for RNA-seq analysts to a large body of methodology developed for microarrays. Simulation studies show that voom performs as well or better than count-based RNA-seq methods even when the data are generated according to the assumptions of the earlier methods. Two case studies illustrate the use of linear modeling and gene set testing methods. Background In the past few years, RNA-seq has emerged as a rev- Gene expression profiling is one of the most commonly olutionary new technology for expression profiling [10]. used genomic techniques in biological research. For most One common approach to summarize RNA-seq data of the past 16 years or more, DNA microarrays have been is to count the number of sequence reads mapping to the premier technology for genome-wide gene expression each gene or genomic feature of interest [11-14]. RNA- experiments, and a large body of mature statistical meth- seq profiles consist therefore of integer counts, unlike ods and tools has been developed to analyze intensity microarrays, which yield intensities that are essentially data from microarrays. This includes methods for differ- continuous numerical measurements. A number of early ential expression analysis [1-3], random effects [4,5], gene RNA-seq publications applied statistical methods devel- set enrichment [6], gene set testing [7,8] and so on. One oped for microarrays to analyze RNA-seq read counts. popular differential expression pipeline is that provided For example, the limma package has been used to ana- by the limma software package [9]. The limma pipeline lyze log-counts after normalization by sequencing depth includes linear modeling to analyze complex experiments [11,15-17]. with multiple treatment factors, quantitative weights to Later statistical publications argued that RNA-seq data account for variations in precision between different should be analyzed by statistical methods designed specif- observations, and empirical Bayes statistical methods to ically for counts. Much interest has focused on the nega- borrow strength between genes. tive binomial (NB) distribution as a model for read counts, Borrowing information between genes is a crucial fea- and especially on the problem of estimating biological ture of the genome-wide statistical methods, as it allows variability for experiments with small numbers of repli- for gene-specific variation while still providing reliable cates. One approach is to fit a global value or global trend inference with small sample sizes. The normal-based to the NB dispersions [13,18,19], although this has the lim- empirical Bayes statistical procedures can adapt to differ- itation of not allowing for gene-specific variation. A num- ent types of datasets and can provide exact type I error ber of empirical Bayes procedures have been proposed to rate control even for experiments with a small number of estimate the gene-wise dispersions [20-22]. Alternatively, replicate samples [3]. Lund et al. [23] proposed that the residual deviances from NB generalized linear models be entered into limma’s *Correspondence: [email protected] empirical Bayes procedure to enable quasi-likelihood test- 1Bioinformatics Division, The Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Victoria 3052, Australia ing. Other methods based on over-dispersed Poisson 4Department of Mathematics and Statistics, The University of Melbourne, models have also been proposed [24-26]. Parkville, Victoria 3010, Australia Unfortunately, the mathematical theory of count dis- Full list of author information is available at the end of the article tributions is less tractable than that of the normal distribution, © 2014 Law et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Law et al. Genome Biology 2014, 15:R29 Page 2 of 17 http://genomebiology.com/2014/15/2/R29 and this tends to limit both the performance and the larger standard deviations than small counts. While a log- usefulness of the RNA-seq analysis methods. One prob- arithmic transformation counteracts this, it overdoes the lem relates to error rate control with small sample sizes. adjustment somewhat so that large log-counts now have Despite the use of probabilistic distributions, all the sta- smaller standard deviations than small log-counts. We tistical methods developed for RNA-seq counts rely on explore the idea that it is more important to model the approximations of various kinds. Many rely on the statis- mean-variance relationship correctly than it is to specify tical tests that are only asymptotically valid or are theoret- the exact probabilistic distribution of the counts. There ically accurate only when the dispersion is small. All the is a body of theory in the statistical literature showing differential expression methods currently available based that correct modeling of the mean-variance relationship on the NB distribution treat the estimated dispersions as inherent in a data generating process is the key to design- if they were known parameters, without allowing for the ing statistically powerful methods of analysis [30]. Such uncertainty of estimation, and this leads to statistical tests variance modeling may in fact take precedence over iden- that are overly liberal in some situations [27,28]. This is tifying the exact probability law that the data values follow true even of the NB exact test [18], which gives exact [31-33]. We therefore take the view that it is crucial to type I error rate control when the dispersion is known but understand the way in which the variability of RNA-seq which becomes liberal when an imprecise dispersion esti- read counts depends on the size of the counts. Our work mator is inserted for the known value. Quasi-likelihood is in the spirit of pseudo-likelihoods [32] whereby statisti- methods [23] account for uncertainty in the dispersion by cal methods based on the normal distribution are applied using an F-test in place of the usual likelihood ratio test, after estimating a mean-variance function for the data at but this relies on other approximations, in particular that hand. the residual deviances are analogous to residual sums of Our approach is to estimate the mean-variance rela- squares from a normal analysis of variance. tionship robustly and non-parametrically from the data. A related issue is the ability to adapt to different types We work with log-counts normalized for sequence depth, of data with high or low dispersion heterogeneity. None of specifically with log-counts per million (log-cpm). The the empirical Bayes methods based on the NB distribution mean-variance is fitted to the gene-wise standard devia- achieve the same adaptability, robustness or small sample tions of the log-cpm as a function of average log-count. properties as the corresponding methods for microarrays, We explore two ways to incorporate the mean-variance due to the mathematical intractability of count distribu- relationship into the differential expression analysis. The tions compared to the normal distribution. first is to modify limma’s empirical Bayes procedure to The most serious limitation though is the reduced range incorporate a mean-variance trend. The second method of statistical tools associated with count distributions incorporates the mean-variance trend into a precision compared to the normal distribution. This is more fun- weight for each individual normalized observation. The damental than the other problems because it limits the normalized log-counts and associated precision weights types of analyses that can be done. Much of the statisti- can then be entered into the limma analysis pipeline, or cal methodology that has been developed for microarray indeed into any statistical pipeline for microarray data data relies on use of the normal distribution. For exam- that is precision weight aware. We call the first method ple, we often find it useful in our own microarray gene limma-trend and the second method voom, an acronym expression studies to estimate empirical quality weights for ‘variance modeling at the observational level’. limma- to downweight poor quality RNA samples [29], to use trend applies the mean-variance relationship at the gene random effects to allow for repeated measures on the level whereas voom applies it at the level of individual same experimental units [4,5] or to conduct gene set observations. tests for expression signatures while allowing for inter- This article compares the performance of the limma- gene correlations [7,8]. These techniques broaden the based pipelines to edgeR [20,34], DESeq
Recommended publications
  • Normalization and Variance Stabilization of Single-Cell RNA-Seq Data Using Regularized Negative Binomial Regression Christoph Hafemeister1* and Rahul Satija1,2*
    Hafemeister and Satija Genome Biology (2019) 20:296 https://doi.org/10.1186/s13059-019-1874-1 Method Open Access Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression Christoph Hafemeister1* and Rahul Satija1,2* Abstract Single-cell RNA-seq (scRNA-seq) data exhibits significant cell-to-cell variation due to technical factors, including the number of molecules detected in each cell, which can confound biological heterogeneity with technical effects. To address this, we present a modeling framework for the normalization and variance stabilization of molecular count data from scRNA-seq experiments. We propose that the Pearson residuals from “regularized negative binomial regression,” where cellular sequencing depth is utilized as a covariate in a generalized linear model, successfully remove the influence of technical characteristics from downstream analyses while preserving biological heterogeneity. Importantly, we show that an unconstrained negative binomial model may overfit scRNA-seq data, and overcome this by pooling information across genes with similar abundances to obtain stable parameter estimates. Our procedure omits the need for heuristic steps including pseudocount addition or log-transformation and improves common downstream analytical tasks such as variable gene selection, dimensional reduction, and differential expression. Our approach can be applied to any UMI-based scRNA-seq dataset and is freely available as part of the R package sctransform, with a direct interface to our single-cell toolkit Seurat. Keywords: Single-cell RNA-seq, Normalization Introduction sampling during sequencing also contribute significantly, In the analysis and interpretation of single-cell RNA- necessitating technical correction [4]. These same chal- seq (scRNA-seq) data, effective pre-processing and nor- lenges apply to bulk RNA-seq workflows, but are exac- malization represent key challenges.
    [Show full text]
  • Journal of Biology Celebrates Its Fifth Anniversary Biomedcentral
    BioMed Central Editorial Journal of Biology celebrates its fifth anniversary Published: 29 June 2007 Journal of Biology 2007, 6:5 The electronic version of this article is the complete one and can be found online at http://jbiol.com/content/6/3/5 © 2007 BioMed Central Ltd Five years ago this month, Journal of have been cited and accessed. Publish- an article by Mark Noble and col- Biology was launched under the guid- ing on average only every two months leagues provided evidence that brain ance of Editor-in-Chief Martin Raff has its perils, however: both authors cells are susceptible to chemotherapy and Editor Theodora Bloom as the and readers told us that they’d be [10]; the article has been downloaded premier biology journal of the open happier to see a journal that wasn’t so more than 5,000 times from the access publisher BioMed Central, the very selective and published more Journal of Biology site with a flurry of publisher of Genome Biology and the often. The journal is now planning to media interest. BMC series of journals. As we cele- build on its success in publishing Of course, for many authors, brate Journal of Biology’s birthday, we high-quality articles and is striving to readers and institutions there is one take this opportunity to reflect on the increase the rate of publication, while measure that matters above all others first five years during which the maintaining a very high standard. in evaluating a journal: the impact journal has published articles of factor, as determined by The Thomson exceptional interest across the full Corporation (ISI).
    [Show full text]
  • Purifying Selection of Long Dsrna Is the First Line of Defense Against False Activation of Innate Immunity Michal Barak1, Hagit T
    Barak et al. Genome Biology (2020) 21:26 https://doi.org/10.1186/s13059-020-1937-3 RESEARCH Open Access Purifying selection of long dsRNA is the first line of defense against false activation of innate immunity Michal Barak1, Hagit T. Porath1, Gilad Finkelstein1, Binyamin A. Knisbacher1, Ilana Buchumenski1, Shalom Hillel Roth1, Erez Y. Levanon1*† and Eli Eisenberg2*† Abstract Background: Mobile elements comprise a large fraction of metazoan genomes. Accumulation of mobile elements is bound to produce multiple putative double-stranded RNA (dsRNA) structures within the transcriptome. These endogenous dsRNA structures resemble viral RNA and may trigger false activation of the innate immune response, leading to severe damage to the host cell. Adenosine to inosine (A-to-I) RNA editing is a common post- transcriptional modification, abundant within repetitive elements of all metazoans. It was recently shown that a key function of A-to-I RNA editing by ADAR1 is to suppress the immunogenic response by endogenous dsRNAs. Results: Here, we analyze the transcriptomes of dozens of species across the Metazoa and identify a strong genomic selection against endogenous dsRNAs, resulting in their purification from the canonical transcriptome. This purifying selection is especially strong for long and nearly perfect dsRNAs. These are almost absent from mRNAs, but not pre-mRNAs, supporting the notion of selection due to cytoplasmic processes. The few long and nearly perfect structures found in human transcripts are weakly expressed and often heavily edited. Conclusion: Purifying selection of long dsRNA is an important defense mechanism against false activation of innate immunity. This newly identified principle governs the integration of mobile elements into the genome, a major driving force of genome evolution.
    [Show full text]
  • Tracking the Popularity and Outcomes of All Biorxiv Preprints
    bioRxiv preprint doi: https://doi.org/10.1101/515643; this version posted January 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license. 1 Tracking the popularity and outcomes of all bioRxiv preprints 2 Richard J. Abdill1 and Ran Blekhman1,2 3 1 – Department of Genetics, Cell Biology, and Development, University of Minnesota, 4 Minneapolis, MN 5 2 – Department of Ecology, Evolution, and Behavior, University of Minnesota, St. Paul, 6 MN 7 8 ORCID iDs 9 RJA: https://orcid.org/0000-0001-9565-5832 10 RB: https://orcid.org/0000-0003-3218-613X 11 12 Correspondence 13 Ran Blekhman, PhD 14 University of Minnesota 15 MCB 6-126 16 420 Washington Avenue SE 17 Minneapolis, MN 55455 18 Email: [email protected] 1 bioRxiv preprint doi: https://doi.org/10.1101/515643; this version posted January 13, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-ND 4.0 International license. 19 Abstract 20 Researchers in the life sciences are posting their Work to preprint servers at an 21 unprecedented and increasing rate, sharing papers online before (or instead of) 22 publication in peer-revieWed journals. Though the popularity and practical benefits of 23 preprints are driving policy changes at journals and funding organizations, there is little 24 bibliometric data available to measure trends in their usage.
    [Show full text]
  • Print Special Issue Flyer
    IMPACT FACTOR 4.096 an Open Access Journal by MDPI Chromosome-Centric View of the Genome Organization and Evolution Guest Editors: Message from the Guest Editors Prof. Dr. Maria Sharakhova Dear Colleagues, Fralin Life Science Institute, The development of next generation sequencing Virginia Tech, Blacksburg, VA, USA technologies in the last decade has led to obtaining highly- fragmented genome assemblies for numerous organisms. [email protected] The quality of genome assemblies significantly varies Dr. Vladimir Trifonov among species, depending of the abundance of the Institute of Molecular and repetitive elements and levels of genetic polymorphism. As Cellular Biology of the Siberian a result, many important problems in genome biology Branch of the Russian Academy of Sciences (IMCB SB RAS), remain unresolved, without understanding how the 630090 Novosibirsk, Russia genome is organized at the level of the chromosomes. [email protected] Recent advances in genome and chromosome technologies, including long-read sequencing, Hi-C scaffoding, chromosome flow sorting, and physical and optical mapping, allow for obtaining genome assemblies Deadline for manuscript at the level of complete chromosomes. Such assemblies submissions: closed (30 April 2021) provide new opportunities to study chromosome organization and evolution, structural genome variations, sex-biased gene expression, epigenomic modifications, and long-range chromatin interactions. In this Special Issue, we would like to invite submissions of original research and review articles, with a special focus on chromosomes in our understanding of the genome structure, function, and evolution. mdpi.com/si/32917 SpeciaIslsue IMPACT FACTOR 4.096 an Open Access Journal by MDPI Editor-in-Chief Message from the Editor-in-Chief Prof.
    [Show full text]
  • Genome Biology (2015) 16:6 DOI 10.1186/S13059-014-0577-X
    Kelly et al. Genome Biology (2015) 16:6 DOI 10.1186/s13059-014-0577-x METHOD Open Access Churchill: an ultra-fast, deterministic, highly scalable and balanced parallelization strategy for the discovery of human genetic variation in clinical and population-scale genomics Benjamin J Kelly1, James R Fitch1, Yangqiu Hu1, Donald J Corsmeier1, Huachun Zhong1, Amy N Wetzel1, Russell D Nordquist1, David L Newsom1ˆ and Peter White1,2* Abstract While advances in genome sequencing technology make population-scale genomics a possibility, current approaches for analysis of these data rely upon parallelization strategies that have limited scalability, complex implementation and lack reproducibility. Churchill, a balanced regional parallelization strategy, overcomes these challenges, fully automating the multiple steps required to go from raw sequencing reads to variant discovery. Through implementation of novel deterministic parallelization techniques, Churchill allows computationally efficient analysis of a high-depth whole genome sample in less than two hours. The method is highly scalable, enabling full analysis of the 1000 Genomes raw sequence dataset in a week using cloud resources. http://churchill.nchri.org/. Background as sequencing costs decline and the rate at which sequence Next generation sequencing (NGS) has revolutionized data are generated continues to grow exponentially. genetic research, enabling dramatic increases in the dis- Current best practice for resequencing requires that a covery of new functional variants in syndromic and sample be sequenced to a depth of at least 30× coverage, common diseases [1]. NGS has been widely adopted by approximately 1 billion short reads, giving a total of 100 the research community [2] and is rapidly being imple- gigabases of raw FASTQ output [4].
    [Show full text]
  • Biomed Central Open-Access Research That Covers a Broad Range of Disciplines, and Reaches Influencers and Decision Makers
    The Open Access Publisher 2017 Media Kit BioMed Central Open-access research that covers a broad range of disciplines, and reaches influencers and decision makers. CHEMISTRY SPRINGER NATURE.................2 - BIOCHEMISTRY - GENERAL CHEMISTRY OUR SOLUTIONS.....................3 HEALTH JOURNALS & DISCIPLINES......5 - HEALTH SERVICES RESEARCH - PUBLIC HEALTH BIOLOGY - BIOINFORMATICS - CELL & MOLECULAR BIOLOGY - GENERICS AND GENOMICS - NEUROSCIENCE MEDICINE - CANCER - CARDIOVASCULAR DISORDERS - CRITICAL, INTENSIVE CARE AND EMERGENCY MEDICINE - IMMUNOLOGY - INFECTIOUS DISEASES SPRINGER NATURE SPRINGER NATURE QUALITY CONTENT RESEARCHERS, CLINICIANS, DOCTORS Springer Nature is a leading publisher of scientific, scholarly, professional EARLY-CAREER and educational content. For over a century, our brands have been setting the 20 JOURNALS RANK #1 PROFESSORS, SCIENTISTS, IN 1 OR MORE SUBJECT LIBRARIANS, scientific agenda. We’ve published ground-breaking work on many fundamental STUDENTS CATEGORY* EDUCATORS achievements, including the splitting of the atom, the structure of DNA, and the 9 OF THE TOP 20 SCIENCE JOURNALS BY IMPACT discovery of the hole in the ozone layer, as well as the latest advances in stem- FACTOR* MORE NOBEL LAUREATES cell research and the results of the ENCODE project. BOARD-LEVEL PUBLISHED WITH US THAN ANY POLICY-MAKERS, SENIOR MANAGERS OTHER SCIENTIFIC PUBLISHER OPINION LEADERS Our dominance in the scientific publishing market comes from a company- wide philosophy to uphold the highest level of quality for our readers, authors
    [Show full text]
  • Citation Characterization and Impact Normalization in Bioinformatics Journals
    This is a preprint of an article accepted for publication in Journal of the American Society for Information Science and Technology. Huang, H., Andrews J., & Tang, J.(in press, 2011). Citation Characterization and Impact Normalization in Bioinformatics Journals. Journal of the American Society for Information Science and Technology , Citation Characterization and Impact Normalization in Bioinformatics Journals Hong Huang and James Andrews School of Information, University of South Florida, Tampa, FL, 33620-7800. Telephone: (813) 974-6361, (813) 974-6840; Fax: (813) 974-6840; E-mail: {honghuang, jimandrews}@usf.edu Jiang Tang Shanghai Information Center for Life Sciences, Chinese Academy of Sciences, Shanghai, China, 200031. Telephone: (8621) 54922803; Fax: (8621)54922800; E-mail: [email protected] 1 Abstract Bioinformatics journals publish research findings of intellectual synergies among subfields such as biology, mathematics, and computer science. The objective of this study is to characterize the citation patterns in bioinformatics journals and their correspondent knowledge subfields. Our study analyzed bibliometric data (impact factor, cited-half-life, and references-per-article) of bioinformatics journals and their related subfields collected from the Journal Citation Report (JCR). The findings showed bioinformatics journals’ citations are field-dependent, with scattered patterns in article life span and citing propensity. Bioinformatics journals originally derived from biology-related subfields have shorter article life spans, more citing on average, and higher impact factors. Those journals derived from mathematics and statistics demonstrate converse citation patterns. The journal impact factors were normalized, taking into account of the impacts of article life spans and citing propensity. A comparison of these normalized factors to JCR journal impact factors showed rearrangements in the ranking orders of a number of individual journals, but a high overall correlation with JCR impact factors.
    [Show full text]
  • The Genome of the Sparganosis Tapeworm Spirometra Erinaceieuropaei Isolated from the Biopsy of a Migrating Brain Lesion Bennett Et Al
    The genome of the sparganosis tapeworm Spirometra erinaceieuropaei isolated from the biopsy of a migrating brain lesion Bennett et al. Bennett et al. Genome Biology 2014, 15:510 http://genomebiology.com/2014/15/11/510 Bennett et al. Genome Biology 2014, 15:510 http://genomebiology.com/2014/15/11/510 RESEARCH Open Access The genome of the sparganosis tapeworm Spirometra erinaceieuropaei isolated from the biopsy of a migrating brain lesion Hayley M Bennett1*, Hoi Ping Mok3, Effrossyni Gkrania-Klotsas3, Isheng J Tsai1,8, Eleanor J Stanley1,9, Nagui M Antoun4, Avril Coghlan1, Bhavana Harsha1, Alessandra Traini1, Diogo M Ribeiro1, Sascha Steinbiss1, Sebastian B Lucas7, Kieren SJ Allinson2, Stephen J Price5, Thomas S Santarius5, Andrew J Carmichael3, Peter L Chiodini6, Nancy Holroyd1, Andrew F Dean2† and Matthew Berriman1† Abstract Background: Sparganosis is an infection with a larval Diphyllobothriidea tapeworm. From a rare cerebral case presented at a clinic in the UK, DNA was recovered from a biopsy sample and used to determine the causative species as Spirometra erinaceieuropaei through sequencing of the cox1 gene. From the same DNA, we have produced a draft genome, the first of its kind for this species, and used it to perform a comparative genomics analysis and to investigate known and potential tapeworm drug targets in this tapeworm. Results: The1.26GbdraftgenomeofS. erinaceieuropaei is currently the largest reported for any flatworm. Through investigation of β-tubulin genes, we predict that S. erinaceieuropaei larvae are insensitive to the tapeworm drug albendazole. We find that many putative tapeworm drug targets are also present in S. erinaceieuropaei, allowing possible cross application of new drugs.
    [Show full text]
  • A Framework for Transcriptome-Wide Association Studies in Breast Cancer in Diverse Study Populations Arjun Bhattacharya1, Montserrat García-Closas2,3, Andrew F
    Bhattacharya et al. Genome Biology (2020) 21:42 https://doi.org/10.1186/s13059-020-1942-6 RESEARCH Open Access A framework for transcriptome-wide association studies in breast cancer in diverse study populations Arjun Bhattacharya1, Montserrat García-Closas2,3, Andrew F. Olshan4,5, Charles M. Perou5,6,7, Melissa A. Troester4,7 and Michael I. Love1,6* Abstract Background: The relationship between germline genetic variation and breast cancer survival is largely unknown, especially in understudied minority populations who often have poorer survival. Genome-wide association studies (GWAS) have interrogated breast cancer survival but often are underpowered due to subtype heterogeneity and clinical covariates and detect loci in non-coding regions that are difficult to interpret. Transcriptome-wide association studies (TWAS) show increased power in detecting functionally relevant loci by leveraging expression quantitative trait loci (eQTLs) from external reference panels in relevant tissues. However, ancestry- or race-specific reference panels may be needed to draw correct inference in ancestrally diverse cohorts. Such panels for breast cancer are lacking. Results: We provide a framework for TWAS for breast cancer in diverse populations, using data from the Carolina Breast Cancer Study (CBCS), a population-based cohort that oversampled black women. We perform eQTL analysis for 406 breast cancer-related genes to train race-stratified predictive models of tumor expression from germline genotypes. Using these models, we impute expression in independent data from CBCS and TCGA, accounting for sampling variability in assessing performance. These models are not applicable across race, and their predictive performance varies across tumor subtype. Within CBCS (N = 3,828), at a false discovery-adjusted significance of 0.10 and stratifying for race, we identify associations in black women near AURKA, CAPN13, PIK3CA, and SERPINB5 via TWAS that are underpowered in GWAS.
    [Show full text]
  • Tracking the Popularity and Outcomes of All Biorxiv Preprints
    FEATURE ARTICLE META-RESEARCH Tracking the popularity and outcomes of all bioRxiv preprints Abstract The growth of preprints in the life sciences has been reported widely and is driving policy changes for journals and funders, but little quantitative information has been published about preprint usage. Here, we report how we collected and analyzed data on all 37,648 preprints uploaded to bioRxiv.org, the largest biology-focused preprint server, in its first five years. The rate of preprint uploads to bioRxiv continues to grow (exceeding 2,100 in October 2018), as does the number of downloads (1.1 million in October 2018). We also find that two-thirds of preprints posted before 2017 were later published in peer-reviewed journals, and find a relationship between the number of downloads a preprint has received and the impact factor of the journal in which it is published. We also describe Rxivist.org, a web application that provides multiple ways to interact with preprint metadata. DOI: https://doi.org/10.7554/eLife.45133.001 RICHARD J ABDILL AND RAN BLEKHMAN* Introduction that had not been peer-reviewed (Mar- In the 30 days of September 2018, four leading shall, 1999; Kling et al., 2003). Further biology journals – The Journal of Biochemistry, attempts to circulate biology preprints, such as PLOS Biology, Genetics and Cell – published 85 NetPrints (Delamothe et al., 1999), Nature Pre- full-length research articles. The preprint server cedings (Kaiser, 2017), and The Lancet Elec- bioRxiv (pronounced ‘Bio Archive’) had posted tronic Research Archive (McConnell and this number of preprints by the end of Septem- Horton, 1999), popped up (and then folded) *For correspondence: blekhman@ ber 3 (Figure 1—source data 4).
    [Show full text]
  • Productivity and Influence in Bioinformatics: a Bibliometric Analysis Using Pubmed Central
    Productivity and Influence in Bioinformatics: A Bibliometric Analysis using PubMed Central Min Song Department of Library and Information Science, Yonsei University, 50 Yonsei-ro, Seodaemun-gu, Seoul, Korea E-mail: [email protected] SuYeon Kim Department of Library and Information Science, Yonsei University, 50 Yonsei-ro, Seodaemun-gu, Seoul, Korea Guo Zhang School of Library and Information Science, Indiana University Bloomington, IN, USA Ying Ding School of Library and Information Science, Indiana University Bloomington, IN, USA Tamy Chambers School of Library and Information Science, Indiana University Bloomington, IN, USA Abstract Bioinformatics is a fast growing field based on the optimal the use of “big data” gathered in genomic, proteomics, and functional genomics research. In this paper, we conduct a comprehensive and in-depth bibliometric analysis of the field of Bioinformatics by extracting citation data from PubMed Central full-text. Citation data for the period, 2000 to 2011, comprising 20,869 papers with 546,245 citations, was used to evaluate the productivity and influence of this emerging field. Four measures were used to identify productivity; most productive authors, most productive countries, most productive organization, and most popular subject terms. Research impact was analyzed based on the measures of most cited papers, most cited authors, emerging stars, and leading organizations. Results show the overall trends between the periods, 2000 to 2003, and, 2004 to 2007, were dissimilar, while trends between the periods, 2004 to 2007, and, 2008 to 2011, were similar. In addition, the field of bioinformatics has undergone a significant shift to co-evolve with other biomedical disciplines. Introduction The rapid development of powerful computing technology has fueled a global boom in the biomedical industry that has led to the explosive growth of biological information generated by the scientific community.
    [Show full text]