Deconvolution of Expression for Nascent RNA Sequencing Data (DENR) Highlights Pre-RNA Isoform Diversity in Human Cells
Total Page:16
File Type:pdf, Size:1020Kb
bioRxiv preprint doi: https://doi.org/10.1101/2021.03.16.435537; this version posted March 17, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. Deconvolution of Expression for Nascent RNA Sequencing Data (DENR) Highlights Pre-RNA Isoform Diversity in Human Cells Yixin Zhao1,*, Noah Dukler1,*, Gilad Barshad2, Shushan Toneyan1, Charles G. Danko2, and Adam Siepel1 1Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, USA. 2Baker Institute for Animal Health, College of Veterinary Medicine, Cornell University, Ithaca, NY 14853, USA. *These authors contributed equally to this work. Abstract assembled contigs, and then the “read depth,” or average Quantification of mature-RNA isoform abundance from density of aligned reads, is used as a proxy for the abundance RNA-seq data has been extensively studied, but much of RNAs corresponding to each annotated transcription unit. less attention has been devoted to quantifying the abun- The approach is relatively inexpensive and straightforward, dance of distinct precursor RNAs based on nascent RNA and, with adequate sequencing depth, it generally leads to sequencing data. Here we address this problem with accurate estimates of abundance. a new computational method called Deconvolution of Expression for Nascent RNA sequencing data (DENR). A fundamental challenge with this general paradigm, how- DENR models the nascent RNA read counts at each locus ever, is that transcription units frequently overlap in genomic as a mixture of user-provided isoforms. The performance coordinates—that is, the same segment of DNA often serves of the baseline algorithm is enhanced by the use of as a template for multiple distinct RNA transcripts. As a machine-learning predictions of transcription start sites result, it is unclear which transcription unit is the source (TSSs) and an adjustment for the typical “shape profile” of each sequence read. While this problem can occur at of read counts along a transcription unit. We show using the level of whole genes that contain overlapping segments, simulated data that DENR clearly outperforms simple it is most prevalent at the level of multiple isoforms for read-count-based methods for estimating the abundances each gene, owing to alternative transcription start sites of both whole genes and isoforms. By applying DENR (TSSs), alternative transcription termination sites (TTSs) or to previously published PRO-seq data from K562 and polyadenylation and cleavage sites (PAS), and alternative CD4+ T cells, we find that transcription of multiple splicing. These isoforms often overlap heavily with one isoforms per gene is widespread, and the dominant another, and differ on a scale that is not well described isoform frequently makes use of an internal TSS. We by short-read sequencing. For example, a typical Illumina also identify >200 genes whose dominant isoforms make RNA sequencing run today generates reads of length 150 use of different TSSs in these two cell types. Finally, we bp, roughly the size of an exon in the human genome. As apply DENR and StringTie to newly generated PRO-seq a result, many reads fall within a single exon and therefore and RNA-seq data, respectively, for human CD4+ T cells carry no direct information about the relative abundances and CD14+ monocytes, and show that entropy at the of isoforms containing that exon. This problem is critical pre-RNA level makes a disproportionate contribution because the existence of multiple isoforms per gene is the to overall isoform diversity, especially across cell types. rule rather than the exception in most eukaryotes. For ex- Altogether, DENR is the first computational tool to ample, more than 90% of multi-exon human genes undergo enable abundance quantification of pre-RNA isoforms alternative splicing (1), with an average of more than 7 based on nascent RNA sequencing data, and it reveals isoforms per protein-coding gene (2); in plants, up to 70% high levels of pre-RNA isoform diversity in human cells. of multi-exon genes show evidence of alternative splicing (3). Nascent RNA Sequencing | Gene Expression | Expression Quantification | Alternative Promoters In the case of RNA-seq data, the problem of isoform Correspondence: [email protected] abundance estimation from short-read sequence data has been widely studied for more than a decade (4–6). Several software packages now address the problem efficiently and Introduction effectively, including ones that make use of fully mapped For about the last 15 years, most large-scale transcriptomic reads (7–10) and others that substantially boost speed by analyses have relied on high-throughput short-read sequenc- working only with “pseudoalignments” at remarkably little ing technologies as the readout for the relative abundances (if any) cost in accuracy (11–13). These computational of RNA transcripts. In species with available genome methods differ in detail but they generally work by modeling assemblies, these sequence reads are generally mapped to the observed sequence reads as an unknown mixture of Zhao et al. | bioRχiv | March 16, 2021 | 1–14 bioRxiv preprint doi: https://doi.org/10.1101/2021.03.16.435537; this version posted March 17, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. isoforms at each locus. They estimate the relative abun- In this article, we introduce a new computational method dances (mixture coefficients) of the isoforms from the read and implementation in R, called Deconvolution of Expres- counts, relying in particular on the subset of reads that reflect sion for Nascent RNA sequencing (DENR), that addresses distinguishing features, such as exons or splice junctions the problem of isoform abundance quantification at the pre- present in some isoforms but not others. Because RNA-seq RNA level. DENR also solves the closely related problems libraries are typically dominated by mature RNAs, intronic of estimating abundance at the gene level, summing over all reads tend to be rare and splice junctions provide one of the isoforms, and identifying the “dominant isoform,” that is, the strongest signals for differentiation of isoforms. Altogether, one exhibiting the greatest abundance. DENR makes use of these isoform quantification methods work quite well, with a straightforward non-negative least-squares strategy for de- the best methods exhibiting Pearson correlation coefficients composing the mixture of isoforms present in the data, but of 0.95 or higher with true values in simulation experiments, then improves on this baseline approach by taking advantage and similarly high concordance across technical replicates of machine-learning predictions of TSSs and an adjustment for real data (2). for the typical shape profile in the read counts along a tran- scription unit. We show that the method performs well on In recent years, another method for interrogating the tran- simulated data, and then use it to reveal a high level of diver- scriptome, known as “nascent RNA sequencing,” has become sity in the pre-RNA isoforms inferred from PRO-seq data for several human cell types, including K562, CD4+ T cells, and increasingly widely used. Instead of measuring the con- + centrations of mature RNAs, as RNA-seq effectively does, CD14 monocytes. nascent RNA sequencing protocols isolate and sequence newly transcribed RNA segments, typically by tagging them Results with selectable ribonucleotide analogs or through isolation of Overview of DENR. DENR is implemented as a package polymerase-associated RNAs (14–22). In this way, they pro- in the R programming environment. It requires two main vide a measurement of primary transcription, independent of inputs: a set of isoform annotations and a set of correspond- the RNA decay processes that influence cellular concentra- ing strand-specific nascent RNA sequencing read counts. tions of mature RNAs. In addition, nascent RNA sequencing Mature RNA isoform annotations can be easily downloaded methods have a wide variety of other applications, including by making use of biomaRt (31) or extracted from files in identification of active enhancers (through the presence of commonly available formats, such as GTF or GFF; similarly eRNAs) (20, 22–25), characterization of promoter-proximal read counts can be obtained from a file in bigWig format. pausing and divergent transcription (14, 15), estimation of Detailed examples are provided in an online vignette (see elongation rates (26, 27), and estimation of relative RNA Methods). half-lives (28). Given the necessary inputs, DENR first builds a tran- In nascent RNA sequencing, the isolated RNAs have gener- script_quantifier object, which summarizes the read counts ally not yet been spliced; therefore, they represent the entire corresponding to the available isoform annotations (Fig. 1). transcribed portion of the genome, including introns. As a This phase consists of three steps (Supplementary Fig. result, the problem of distinguishing alternative splice forms S1). First, the mature RNA isoforms are grouped into is largely irrelevant. On the other hand, the data typically nonoverlapping, strand-specific clusters, corresponding still reflect a mixture of precursor RNA (pre-RNA)