The Drosophila Melanogaster Transcriptome by Paired-End RNA Sequencing
Total Page:16
File Type:pdf, Size:1020Kb
Downloaded from genome.cshlp.org on September 29, 2021 - Published by Cold Spring Harbor Laboratory Press Resource The Drosophila melanogaster transcriptome by paired-end RNA sequencing Bryce Daines,1,2,6 Hui Wang,2,6 Liguo Wang,3,6 Yumei Li,1,2 Yi Han,2 David Emmert,4 William Gelbart,4 Xia Wang,1,2 Wei Li,3,5 Richard Gibbs,1,2,7 and Rui Chen1,2,7 1Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas 77030, USA; 2Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA; 3Dan L. Duncan Cancer Center, Baylor College of Medicine, Houston, Texas 77030, USA; 4Department of Molecular and Cellular Biology, Harvard University, Cambridge, Massachusetts 02138, USA; 5Department of Molecular and Cellular Biology, Baylor College of Medicine, Houston, Texas 77030, USA RNA-seq was used to generate an extensive map of the Drosophila melanogaster transcriptome by broad sampling of 10 developmental stages. In total, 142.2 million uniquely mapped 64–100-bp paired-end reads were generated on the Illumina GA II yielding 3563 sequencing coverage. More than 95% of FlyBase genes and 90% of splicing junctions were observed. Modifications to 30% of FlyBase gene models were made by extension of untranslated regions, inclusion of novel exons, and identification of novel splicing events. A total of 319 novel transcripts were identified, representing a 2% increase over the current annotation. Alternate splicing was observed in 31% of D. melanogaster genes, a 38% increase over previous estimations, but significantly less than that observed in higher organisms. Much of this splicing is subtle such as tandem alternate splice sites. [Supplemental material is available for this article. The sequencing data from this study have been submitted to the NCBI Sequence Read Archive (http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi) under accession no. SRA012173 and to the NCBI Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo/) under accession no. GSE24324.] High-throughput sequencing of cDNA libraries (RNA-seq) is an ef- splicing isoforms is enabled without dependence on prior knowl- fective and accurate method for transcriptome profiling. RNA-seq edge. While splicing variants can also be detected with custom has been applied in many organisms including humans (Campbell microarrays, probes must be specifically designed to interrogate et al. 2008; Cloonan and Grimmond 2008; Cloonan et al. 2008; exon–exon junctions and are thus limited in scope to known or Marioni et al. 2008; Mortazavi et al. 2008; Mudge et al. 2008; predicted junctions. Finally, direct annotation of gene models and Nagalakshmi et al. 2008; Pan et al. 2008; Shendure 2008; Sultan novel transcript discovery are possible with RNA-seq (Yassour et al. et al. 2008; Wang et al. 2008; Filichkin et al. 2009; Levin et al. 2009; 2009). This can be accomplished by alignment of reads to a refer- Perkins et al. 2009; Wurtzel et al. 2009). RNA-seq is not dependent ence genome and inference of transcript models from coverage on prior knowledge and requires no design work, thus reducing and splicing data. While genome-tiling arrays can also be used to labor and enabling de novo discovery. Several additional features identify novel transcriptional units, RNA-seq is not subject to the of RNA-seq make it likely to supplant microarrays as the key tech- inherent limitations of hybridization platforms. Due to the several nology for transcriptome profiling, including accurate and sensi- advantages of the RNA-seq approach we have applied this tech- tive measuring of expression level, identification of alternate splic- nology to characterize the transcriptome of D. melanogaster. ing isoforms, and direct discovery of novel transcripts (Graveley D. melanogaster is an important model organism and has 2008; Shendure 2008). contributed significantly to our understanding of gene expression Highly accurate digital measurement of gene expression by and development. The transcriptome of D. melanogaster is well RNA-seq can be obtained by counting the number of sequencing annotated and functionally characterized. Analysis of transcript reads which map to annotated transcripts from appropriately expression using spotted arrays has yielded insight into the de- prepared libraries (Mortazavi et al. 2008; Wang et al. 2009). In this velopmental expression patterns of annotated D. melanogaster way, RNA-seq offers increased sensitivity and larger dynamic range transcripts; however, this platform is limited to a subset of anno- over microarray, effectively detecting rare transcripts with greater tated genes (Arbeitman et al. 2002). Additionally, tilling arrays sensitivity and abundant transcripts with superior discrimination have been used to demonstrate that >29% of unnanotated introns (Marioni et al. 2008). Furthermore, RNA-seq generates reads which and intergenic regions are transcribed within the first 24 h of de- straddle exon–exon junctions (junction reads). When junction velopment at specific time points in development (Manak et al. reads are properly aligned to a reference genome the associated 2006). Tilling arrays, however, are limited in their resolution by the splice event can be inferred. Thus, direct discovery of alternate size of probes used. Expressed sequence tags (ESTs) provide the primary evidence for annotation efforts. Although more than 800,000 D. melanogaster ESTs have been sequenced, not all pre- 6 These authors contributed equally to this work. dicted or annotated genes are supported by ESTs (Stapleton et al. 7Corresponding authors. 2002a,b). Furthermore, these EST efforts have focused primarily E-mail [email protected]. on embryo and adult libraries, specifically adult head and gonad E-mail [email protected]. Article published online before print. Article, supplemental material, and pub- tissues, leaving intermediate stages of development underrep- lication date are at http://www.genome.org/cgi/doi/10.1101/gr.107854.110. resented. Continuation of Sanger-based EST sequencing efforts is 21:315–324 Ó 2011 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/11; www.genome.org Genome Research 315 www.genome.org Downloaded from genome.cshlp.org on September 29, 2021 - Published by Cold Spring Harbor Laboratory Press Daines et al. not an ideal solution for improvement of gene models because of set of curated models. Other annotations based on experimental its higher cost in comparison to next-generation sequencing and computational analysis are also available; a prominent ex- platforms. More recently, high throughput sequencing has been ample is the gene models (MB7) produced by the modENCODE recognized as an ideal solution for transcriptome profiling with consortium (http://www.modENCODE.org/). For robustness we base pair resolution. In D. melanogaster the 59-end SAGE method used both of these annotations as references throughout our has been combined with Illumina sequencing to generate a de- analysis of RNA-seq data. tailed profile of transcription start sites throughout development Read alignment with BLAT enabled the direct detection of (Ahsan et al. 2009). RNA-seq will enable interrogation of all tran- junction reads. These reads are essential to the identification of scripts, including potentially unannotated transcripts without the exon–exon junctions and alternate splicing. To maximize sensi- need for further EST characterization (Torres et al. 2008). We tivity for detecting junction reads we used two independent therefore propose that characterization of the D. melanogaster computational approaches (see Supplemental materials). Approx- transcriptome by RNA-seq technology will greatly facilitate on- imately 6% of all uniquely aligned reads were junction reads, of going annotation efforts. which, 94% are consistent with annotated FlyBase events, sug- gesting high concordance between experimental evidence and existing annotations. In total, 90.6% of annotated junctions Results (46,820) were observed with 93.6% consistency between the two approaches (see Supplemental Table 1; Supplemental Fig. 1). Ad- Extensive transcriptome profiling of D. melanogaster by RNA-seq ditionally, junction reads supported ;120,000 candidate novel In order to gain a broad sampling of the D. melanogaster tran- junctions. Analytical methods were used to reduce these to 7776 scriptome, RNA-seq experiments were performed at all stages of well-supported candidates (see Supplemental materials), of which the D. melanogaster life cycle. Poly(A)+ transcripts from 10 distinct 2012 (25.9%) are reported in the modENCODE MB7 gene models. stages during the live cycle of D. melanogaster were isolated to To determine the technical robustness of RNA-seq we con- generated cDNA libraries which were sequenced on the Illumina sidered the positive predictive value (PPV) and sensitivity of the GA II instrument. As shown in Table 1, 12 independent cDNA li- platform. The PPV of RNA-seq for expressed sequences is defined as braries were generated, including embryonic, larval, pupal, and the percent of uniquely aligned reads which overlap with anno- adult. Some libraries were staged as specific windows: 2–4-h em- tated transcripts. Across all samples, 97.6% of uniquely aligned bryo, 14–16-h embryo, third instar larva, 3-d pupa, and 17-d adult. reads overlap with annotated sequences, suggesting that RNA-seq is Additional libraries were derived from broadly staged mixed sam- highly specific to expressed sequences (Table 1). The sensitivity of ples: embryo, larvae, and pupa. Three-day-old male and female