Exome Versus Transcriptome Sequencing in Identifying Coding Region Variants
Total Page:16
File Type:pdf, Size:1020Kb
THEMED ARTICLE S Genetic & Genomics Applications Review For reprint orders, please contact [email protected] Exome versus transcriptome sequencing in identifying coding region variants Expert Rev. Mol. Diagn. 12(3), 241–251 (2012) Chee-Seng Ku*1, The advent of next-generation sequencing technologies has revolutionized the study of genetic Mengchu Wu1, variation in the human genome. Whole-genome sequencing currently represents the most David N Cooper2, comprehensive strategy for variant detection genome-wide but is costly for large sample sizes, Nasheen Naidoo3, and variants detected in noncoding regions remain largely uninterpretable. By contrast, whole- 4 exome sequencing has been widely applied in the identification of germline mutations underlying Yudi Pawitan , Mendelian disorders, somatic mutations in various cancers and de novo mutations in 5 Brendan Pang , neurodevelopmental disorders. Since whole-exome sequencing focuses upon the entire set of Barry Iacopetta6 and exons in the genome (the exome), it requires additional exome-enrichment steps compared Richie Soong1 with whole-genome sequencing. Although the availability of multiple commercial exome-enrichment 1Cancer Science Institute of Singapore kits has made whole-exome sequencing technically feasible, it has also added to the overall (CSI Singapore), #12-01, MD6, Centre cost. This has led to the emergence of transcriptome (or RNA) sequencing as a potential for Translational Medicine, NUS Yong alternative approach to variant detection within protein coding regions, since the transcriptome Loo Lin School of Medicine, National of a given tissue represents a quasi-complete set of transcribed genes (mRNAs) and other University of Singapore, 14 Medical Drive, 117599, Singapore noncoding RNAs. A further advantage of this approach is that it bypasses the need for exome 2Institute of Medical Genetics, School enrichment. Here we discuss the relative merits and limitations of these approaches as they are of Medicine, Cardiff University, applied in the context of variant detection within gene coding regions. Cardiff, UK 3Saw Swee Hock School of Public Health, National University of KEYWORDS:EXOMEsEXOMEENRICHMENTsNEXT GENERATIONSEQUENCINGsSINGLE NUCLEOTIDEVARIANTSsTRANSCRIPTOME Singapore, Singapore 4Department of Medical Epidemiology [10–13] & Biostatistics, Karolinska Institutet, The advent of next-generation sequencing (NGS) detection . WES has been applied to the Stockholm, Sweden technologies has revolutionized our approach to detection of both germline and somatic variants 5 Department of Pathology, National performing structural and functional genomics [14,15] . As most of the disease-causing mutations University Health System, Singapore [1,2] 6School of Surgery, The University of studies . The detection and characterization in Mendelian disorders reside within gene cod- Western Australia, WA, Australia of genetic variation (ranging from single-nucle- ing regions, this has promoted the use of WES *Author for correspondence: otide variants [SNVs] and small insertions and in unraveling new causal variants for these Tel.: +65 81388095 [16,17] Fax: +65 68739664 deletions [indels] to larger structural rearrange- disorders . This approach has also been [email protected] ments) in the human genome have been greatly widely employed in attempts to identify the facilitated by NGS technologies such as whole- somatic driver mutations within the exomes of genome sequencing (WGS) [3–5]. This has also various cancers [18] . WES has higher sensitivity driven the 1000 Genomes Project, which, upon and specificity for detecting SNVs than small completion, aims to provide a comprehensive indels [19,20] . In addition, WES also allows the map of human genetic variants. Findings from detection of larger copy-number variations using the pilot phases of this project have already pro- depth of coverage from mapped short-sequence vided new insights into the nature and extent reads through the development of appropriate of human genetic variation [6]. However, this bioinformatics tools [21]. undertaking is well beyond the technical and Although NGS technologies have been avail- financial capabilities of individual laboratories. able since 2005, the isolation and enrichment of The high cost of WGS (in relation to sequenc- the entire set of all exons in the human genome ing, data storage and ana lysis), together with the (the exome) was not technically feasible until the challenges inherent in analyzing and interpret- development of commercial high-throughput ing variants detected in noncoding regions [7–9], exome-enrichment kits [22,23]. However, the cost have now made whole-exome sequencing (WES) of the exome-enrichment step, which constitutes a more popular approach in the context of variant a substantial proportion of the total cost of WES, www.expert-reviews.com 10.1586/ERM.12.10 © 2012 Expert Reviews Ltd ISSN 1473-7159 241 Review Ku, Wu, Cooper et al. represents a ‘bottleneck’ that impedes the scale-up of WES to detection in coding regions, highlight their respective pros and large sample sizes. More recently, the cost of sequencing has fallen cons, and make recommendations with regard to which approach rapidly owing to the increasing throughput of sequencing data to use in different circumstances. (up to hundreds of gigabases) per instrument run by the latest sequencing platforms. As a result, multiple samples (up to tens of High-throughput sequencing technologies exomes) can now be multiplexed to avoid redundant sequencing Currently available NGS technologies, such as the Illumina® while still achieving adequate sequencing depth. This is known as HiSeq™ and Life Technologies™ SOLiD4™, are able to gen- post-hybridization sample multiplexing or bar coding. By contrast, erate hundreds of millions of short sequence reads (50–125 bp) this barcoding protocol became available for the exome-enrich- totaling several hundred gigabases of sequencing data per instru- ment steps only comparatively recently [24,25]; although it should ment run. By contrast, the Roche 454 GS FLX produces approxi- further decrease the cost of exome enrichment and/or WES, its mately 1 million longer sequence reads (500 bp). These sequenc- technical performance and effect on sequencing data from the ing technologies have been widely used in various studies ranging sample barcoding in these prehybridization steps have not yet been from large-scale targeted sequencing of candidate genes to WES tested experimentally by the end user. and WGS. However, owing to the large number of sequence reads To further optimize the cost–effectiveness of variant detec- generated by HiSeq and SOLiD, these platforms are more suit- tion within coding regions, transcriptome or RNA sequencing able for RNA-seq, which requires millions of reads for applica- (RNA-seq) has been proposed as a potential substitute for WES tions such as profiling expression levels [33,34]. In terms of the [26,27]. From a theoretical standpoint, this approach represents accuracy of variant detection, all three NGS technologies have a promising alternative since, by definition, the transcriptome higher raw base error rates than Sanger sequencing. However, this comprises all transcripts for both coding RNAs (i.e., mRNAs) can be improved through deeper sequencing to achieve a higher and noncoding RNAs in a given tissue. Hence, RNA-seq would consensus accuracy rate. For WES (or WGS) of genomic DNA, also be able to detect variants within the coding regions [28–30]. an average sequencing depth of 30–50× is usually deemed to be In addition, the use of RNA-seq bypasses the need for exome- sufficient to detect most germline SNVs accurately. However, enrichment steps, thereby rendering this approach more cost effec- greater sequencing depth would be needed to detect somatic tive than WES. It would also obviate the need for target-probe point mutations in primary cancer tissue in order to allow for hybridization steps and the technical limitations during exome tissue contamination and genetic heterogeneity within the tissue enrichment. For example, owing to the uneven capture efficiency [35,36]. By contrast, the sequencing depth or coverage of RNA- across exons experienced when using available exome-enrichment seq is difficult to estimate, because calculation of the coverage kits, capture of all exons is incomplete. Moreover, some sequence of the transcriptome is less straightforward given that the true reads map outside the targeted regions (‘off-target hybridization’), number and level of different transcript isoforms is not usually leading to the production of unusable sequence reads for down- known. Moreover, transcriptional activity varies greatly between stream ana lysis [22,23,31,32]. However, the application of RNA-seq genes (low- vs high-abundance transcripts) [27]. Although NGS in this context is not without its shortcomings and limitations. It technologies have greater specificity to detect both germline and is important to bear in mind that the transcriptome is tissue spe- somatic SNVs, further validation using Sanger sequencing is still cific, so the set of genes transcribed varies between tissue types. As common practice. a result, sequencing a transcriptome from a specific tissue would The arrival of third-generation sequencing (TGS) technologies, be incapable of capturing all variants in the exome; hence, the such as the true single-molecule sequencing (Helicos Biosciences) transcriptome of a specific tissue is invariably and unavoidably and single-molecule real-time