Defining a Personal, Allele-Specific, and Single-Molecule Long-Read Transcriptome
Total Page:16
File Type:pdf, Size:1020Kb
Defining a personal, allele-specific, and single-molecule long-read transcriptome Hagen Tilgnera,1, Fabian Gruberta,1, Donald Sharona,b,1, and Michael P. Snydera,2 aDepartment of Genetics, Stanford University, Stanford, CA 94305-5120; and bDepartment of Molecular, Cellular, and Developmental Biology, Yale University, New Haven, CT 06511 Edited by Sherman M. Weissman, Yale University School of Medicine, New Haven, CT, and approved June 3, 2014 (received for review January 8, 2014) Personal transcriptomes in which all of an individual’s genetic var- for both parents of GM12878 (GM12891 and GM12892), we iants (e.g., single nucleotide variants) and transcript isoforms (tran- show that despite the higher error rate of the PacBio platform, scription start sites, splice sites, and polyA sites) are defined and single molecules can be attributed to the alleles from which they quantified for full-length transcripts are expected to be important were transcribed, thereby generating accurate personal tran- for understanding individual biology and disease, but have not scriptomes. This technique allows the assessment of biased allelic been described previously. To obtain such transcriptomes, we se- expression and isoform expression. quenced the lymphoblastoid transcriptomes of three family mem- bers (GM12878 and the parents GM12891 and GM12892) by using Results a Pacific Biosciences long-read approach complemented with Illu- Increased Full-Length Representation of RNA Molecules by Circular mina 101-bp sequencing and made the following observations. Consensus Reads. We sequenced ∼711,000 circular consensus First, we found that reads representing all splice sites of a transcript reads (CCS) molecules from unamplified, polyA-selected RNA ≤ are evident for most sufficiently expressed genes 3 kb and often from the GM12878 cell line (see Fig. S1 for mapping statistics). for genes longer than that. Second, we added and quantified pre- We have recently shown that CCS often describe all splice sites viously unidentified splicing isoforms to an existing annotation, of typical RNA molecules, although the success rate declines as thus creating the first personalized annotation to our knowledge. RNA length increases (11). The CCS we sequenced here were Third, we determined SNVs in a de novo manner and connected GENETICS significantly longer (average 1,188 bp, maximum 6 kb) than those them to RNA haplotypes, including HLA haplotypes, thereby in the HOP sample (average 999.9 bp; Fig. 1A). Both datasets assigning single full-length RNA molecules to their transcribed al- lele, and demonstrated Mendelian inheritance of RNA molecules. showed equal representation of RNA molecules between 0.8 and Fourth, we show how RNA molecules can be linked to personal 1.3 kb, but beginning at 1.3 kb and even more pronounced after variants on a one-by-one basis, which allows us to assess differential 1.7 kb, the GM12878 sample represented longer RNA molecules – allelic expression (DAE) and differential allelic isoforms (DAI) from the more faithfully; RNA molecules of 2.7 4 kb were present in the phased full-length isoform reads. The DAI method is largely indepen- GM12878 sample, but are essentially absent in the HOP sample B ′ dent of the distance between exon and SNV—in contrast to fragmen- (Fig. 1 ). The distance from the 5 -mapping end to the nearest tation-based methods. Overall, in addition to improving eukaryotic annotated transcription start site (TSS) dropped significantly −16 transcriptome annotation, these results describe, to our knowledge, (one-sided Wilcoxon rank sum test; P < 2.2e ) in GM12878 the first large-scale and full-length personal transcriptome. Significance personalized medicine | isoform sequencing | platform comparison | alternative splicing | allele-specific expression RNA molecules of higher eukaryotes can be thousands of nucleotides long and are expressed from two distinct alleles, hort-read RNA sequencing (1–7) is a widely used tool in which can differ by single nucleotide variations (SNVs) in the Smodern day biology. In mammalian transcriptomes, multi- mature RNA molecule. The de facto standard in RNA biology is intron genes are common and the detection and quantification of short (≤101 bp) read sequencing, which, although very useful, different transcript isoforms is of high importance. Recent work does not cover the entire molecule in a read. We show that has shown that reconstruction and quantification of transcript using amplification-free long-read sequencing one can often (i) isoforms from short-read sequencing is insufficiently accurate (8, cover the entire molecule, (ii) determine the allele it originated 9). Simultaneously, a number of research groups have pursued from, and (iii) record its entire exon-intron structure within long-read sequencing (8, 10–12), and such datasets generally a single read, thus producing a full-length, allele-specific view excel at connecting different exons, up to entire transcripts, with of an individual’s transcriptome. By enhancing existing gene a compromise of lower sequencing depth. However, these studies annotations using long reads and quantifying this enhanced have not investigated allelic variants. Such information is crucial annotation using >100 million 101-bp paired-end reads, we for understanding personal transcriptomes and their potential overcome the smaller number of long reads. biological consequences. “ ” Author contributions: H.T., F.G., D.S., and M.P.S. designed research; H.T., F.G., and D.S. Here, we use the Pacific Biosciences (13) platform ( PacBio ) performed research; H.T. analyzed data; and H.T. and M.P.S. wrote the paper. to produce a single-molecule RNA-seq dataset in the GM12878 Conflict of interest statement: M.P.S. is on the scientific advisory board of Personalis cell line that is more comprehensive in both length and depth than and GenapSys. a recently described dataset from a human organ panel (HOP) This article is a PNAS Direct Submission. (11). We additionally sequenced cDNA from the same cell line Freely available online through the PNAS open access option. by using 101-bp paired-end (PE) sequencing on the Illumina Data deposition: The sequences reported in this paper have been deposited in the NCBI platform, to show the properties of genes that can be sequenced Sequence Read Archive (accession no. SRP036136). Further data are available at http:// by using long-read, single-molecule transcriptome sequencing. stanford.edu/~htilgner/2014_PNAS_paper/utahTrio.index.html. We use previously unidentified isoforms revealed by long-read 1H.T., F.G., and D.S. contributed equally to this work. sequencing to produce an enhanced and personalized genome 2To whom correspondence should be addressed. E-mail: [email protected]. annotation, which we quantify by using 101-bp PE Illumina This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. reads. Finally, by producing single-molecule transcriptomes 1073/pnas.1400447111/-/DCSupplemental. www.pnas.org/cgi/doi/10.1073/pnas.1400447111 PNAS | July 8, 2014 | vol. 111 | no. 27 | 9869–9874 Downloaded by guest on October 1, 2021 A Read length D than those that had a CSMM (Fig. 2B), showing that deeper se- 3 ,1 21 0,0 00 13 221, ,000 13 1, 24,000 Gm12878 CSMM for BCKDK quencing of shorter reads detects more lowly expressed RNAs. M5 M2 M13 Because CCS generation requires read length to exceed cDNA Bp M11 1k 2k M6 M4 length by at least a factor of two, we hypothesized that CSMMs M8 M1 M10 M12 would not represent longer genes. Surprisingly, when calculating HOP ’78 M14 M9 M16 M3 the number of base pairs of the longest mature and annotated Long molecules: M15 B M7 GM12878 vs Hop Human organ panel CSMM for BCKDK transcript for each gene, we found this hypothesis to be wrong. M4 M9 Genes with and without a CSMM behaved largely similarly in M3 M5 ratio M10 − M1 terms of length. However, genes with a CSMM only very rarely M8 M2 C Log M7 represented genes smaller than 1 kb (Fig. 2 ), which is likely due 604 M13 − M6 M15 02k4k M12 to the use of magnetic beads in the loading procedure, which M18 Read length M19 M17 M14 disfavor short fragments. To derive an approximate predictive Distance to M16 C annotated TSS M11 statement from the above observations, we calculated the frac- Gencode version 15 annotation for BCKDK tion of genes that had at least one CSMM, for bins of gene Bp lengths and 101-bp PE Cufflinks-derived expression. At FPKMs >10 and gene lengths ≥ 1kb, 98% of genes receive a PacBio- 01k2k HOP ’78 CSMM when sequencing ∼711,000 CCS (Fig. 2D), whereas with an FPKM of >1, this percentage drops to 89%. When requiring Fig. 1. Increased length of CCS for the GM12878 sample. (A) Length distribu- at least 10 CSMMs (at FPKMs > 10), which may be useful for tion of CCS reads in the human organ panel (Hop; blue) (11) and CCS sequenced quantitative analyses, this fraction drops further to 68% (Fig. here for the GM12878 cell line (red). (B) Relative representation of molecules in E length bins in the two samples. y axis is calculated as log [(number of GM12878- 2 ). For genome-annotation purposes, CSMMs representing all CCS in bin + 1)/(number of Hop-CCS in bin +1)]. The red horizontal line gives the introns of an RNA molecule are useful. Sixty-three percent of expected ratio, which is above 0, because of the increased sequencing depth in CSMMs appear complete in that their first splice site is an an- GM12878. (C) Distribution of distances for CSMMs between the 5′ end of the notated first splice site and that their last splice site is an an- mapping and the closest annotated TSS of the same gene for both the Hop notated last splice site. By relaxing this criterion (SI Methods), (blue) and the GM12878 (red) sample. (D) All CSMMs mapped to the BCKDK ultimately 71% of CSMMs were classified as “full length.” For gene in the GM12878 cell line (red) and in the Hop sample (blue) as well as all Gencode15-annoated transcripts for this gene (black).