Long-Read Sequencing in Deciphering Human Genetics to a Greater Depth
Total Page:16
File Type:pdf, Size:1020Kb
Human Genetics (2019) 138:1201–1215 https://doi.org/10.1007/s00439-019-02064-y REVIEW Long‑read sequencing in deciphering human genetics to a greater depth Mohit K. Midha1,2 · Mengchu Wu3 · Kuo‑Ping Chiu1,2,4 Received: 22 May 2019 / Accepted: 13 September 2019 / Published online: 19 September 2019 © Springer-Verlag GmbH Germany, part of Springer Nature 2019 Abstract Through four decades’ development, DNA sequencing has inched into the era of single-molecule sequencing (SMS), or the third-generation sequencing (TGS), as represented by two distinct technical approaches developed independently by Pacifc Bioscience (PacBio) and Oxford Nanopore Technologies (ONT). Historically, each generation of sequencing technologies was marked by innovative technological achievements and novel applications. Long reads (LRs) are considered as the most advantageous feature of SMS shared by both PacBio and ONT to distinguish SMS from next-generation sequencing (NGS, or the second-generation sequencing) and Sanger sequencing (the frst-generation sequencing). Long reads overcome the limitations of NGS and drastically improves the quality of genome assembly. Besides, ONT also contributes several unique features including ultra-long reads (ULRs) with read length above 300 kb and some close to 1 million bp, direct RNA sequencing and superior portability as made possible by pocket-sized MinION sequencer. Here, we review the history of DNA sequencing technologies and associated applications, with a special focus on the advantages as well as the limitations of ULR sequencing in genome assembly. Background were frst reported (Maxam and Gilbert 1977; Sanger and Coulson 1975; Sanger et al. 1977). The maximum sequence DNA sequencing technologies have undergone massive length generated by Sanger sequencing is about 1 kb. changes and improvement during last four decades. It can be Sanger sequencing requires large amount of DNA sam- traced back to 1977 when Sanger partial chain termination ples generally produced by cloning target sequences into method and Maxam–Gilbert chemical degradation method vectors and amplifed by prokaryotic cells such as E. coli. Invention of polymerase chain reaction (PCR) created a tre- mendous opportunity for later biotechnologies. PCR using Electronic supplementary material The online version of this oligos immobilized on fow cells enables clonal amplif- article (https ://doi.org/10.1007/s0043 9-019-02064 -y) contains cation of templates which is essential for next-generation supplementary material, which is available to authorized users. sequencing (NGS) or the second-generation sequencing. * Kuo-Ping Chiu NGS comprises a number of robust technologies charac- [email protected] terized by parallel sequencing of massive DNA molecules. Mohit K. Midha There were four major NGS platforms made commercially [email protected] available in consecutive years: 454 system by Roche, GA/ Mengchu Wu Solexa system by Illumina, SOLiD system by ABI, and Ion [email protected] Torrent system by Life Technologies (Goodwin et al. 2016; Liu et al. 2012; Mardis 2013). Over the past decade, Illumina 1 Genomics Research Center, Academia Sinica, 128 Academia emerged as dominant provider of sequencers due to their Road, Sec. 2, Nankang District, Taipei 115, Taiwan lower cost, high speed and high yield. 2 Institute of Biochemistry and Molecular Biology, National Illumina has released a series of instruments to fulfll the Yang-Ming University, Taipei, Taiwan need of various output demands. Some of these sequencers 3 Health GeneTech, 22F No. 99, Xin Pu 6th St., Taoyuan, produce large amount of short reads (billions of reads could Taiwan be generated from a single run) of ≤ 300 bp for complex 4 Department of Life Sciences, College of Life Sciences, eukaryotic genomes or small microbial genomes in relatively National Taiwan University, Taipei, Taiwan Vol.:(0123456789)1 3 1202 Human Genetics (2019) 138:1201–1215 a short period of time (Liu et al. 2012). Due to this virtue, and NGS are either laborious or imprecise (Pang et al. 2010; Illumina sequencers have gained popularity in their appli- Quinlan and Hall 2012). Short paired-end reads, although cability and NGS has been widely deployed to explore vari- being able to ofer single base precision, are frequently una- ous domains of genomics including oncology, microbiology, ble to precisely map the repetitive regions (e.g., trinucleotide environmental genomics, metagenomics, etc., for biologi- repeats) (Tattini et al. 2015). On the other hand, SMS long- cal, medical, environmental, and agricultural studies (Ash- read length ofers a superior alternative for characterization ley 2016; Deurenberg et al. 2017; Gardy and Loman 2018; of CGRs (Chaisson et al. 2015; Huddleston et al. 2017; Giampaoli et al. 2018; Hoper et al. 2016; Scheben et al. Merker et al. 2018; Spies et al. 2017). 2017; Yuan et al. 2017). However, the read length remains Transcriptome data analysis also encounters similar con- a bottleneck for biological studies (Ulahannan et al. 2013). strains of short reads as that of in genome assembly. Short reads are often unable to infer the full-length RNA tran- scripts or to precisely determine specifc isoforms (Bayega Limitations of NGS et al. 2018; Martin and Wang 2011). Because of this prob- lem with short reads, studies were unable to fully address NGS is advantageous in many aspects such as low cost, high gene regulation, protein-coding potential of genome and speed and high yield and has been intensively employed in phenotypic diversity. various biological analyses during the past 15 years. Several NGS is also limited by its incapability of direct sequenc- studies have been conducted to solve same biological ques- ing of RNA and epigenetic/methylation markers. RNA tion with diferent NGS and TGS methods. These works, sequencing by NGS requires conversion of RNA molecules however, have led to the discovery of a number of intrinsic to corresponding cDNA molecules and then sequenced as limitations of NGS which may have signifcant impact on DNA. This procedure is seen in all aspects of biological the accuracy of biological studies. Among the bottlenecks, studies, especially transcription of protein-coding genes short-read length is the most noticeable drawback for NGS and non-coding genes (Costa-Silva et al. 2017). Epigenetic sequencing. This limits the precision of many biological modifcation plays a vital role in regulation of eukaryotic studies, especially genome assembly and transcriptome gene expression. Previous DNA methylation largely relies analysis. on 5-methylcytosine (5mC) bisulfte conversion. Although For genome assembly, to deduce the genome sequences genome-wide 5mC profling became feasible by NGS, it is from billions of short reads, one has to face computa- still limited by uneven coverage as well as sequencing and tional challenges resulted from genomic complexity, time mapping artifacts (Nair et al. 2018; Smith et al. 2009; War- and hardware limitations. These challenges have become necke et al. 2002). a critical issue for large genome assembly, because short Moreover, short-read sequencing normally involves the reads often result in highly fragmented assemblies resulted usage of large equipment and laborious experimental pro- from unsolvable repetitive regions or regions with high GC cedures and bioinformatics analysis and thus unable to meet content (Petersen et al. 2017; Salzberg and Yorke 2005; the need for fast feld testing. The process from transporta- Schmutz et al. 2004). Using short reads in analysis of seg- tion of biological samples from one place to another, prepa- mental duplication, structural variations (SVs) or paralogous ration of sequencing libraries, sequencing and data analysis regions may result in a signifcant number of false positives may take a long time (Quick et al. 2016). (Guan and Sung 2016; Treangen and Salzberg 2011). This issue was empirically addressed by a number of reports. Despite the advances in sequencing technologies and bio- Development of single‑molecule sequencing informatics, de novo assembly of large genomes remains challenging (Chin et al. 2014). Shortcomings of NGS fostered the development of sin- Complexities resulting from heterozygosity, transposable gle-molecule sequencing (SMS), or the third-generation elements, GC-rich regions, tandem repeats and interspersed sequencing (TGS). Evolvement of SMS resulted in an repetitive regions of 10 kb–10 Mb or more in genome remain increase of 1–2 orders of magnitude in read length over unresolved by NGS short reads (Alkan et al. 2011). Sequenc- Sanger sequencing as achieved by Pacific Biosciences ing of polymorphic tandem repeats in the genome by NGS (PacBio) and Oxford Nanopore Technologies (ONT), with a is severely impaired by read length (Mousavi et al. 2018), small portion of reads by the latter, may achieve an increase and the read length of 100–250 bases used for determining of 3 orders of magnitude (Giordano et al. 2017; Jain et al. the TR expansion might be inaccurate (Bahlo et al. 2018). 2018; Judge et al. 2015; Weirather et al. 2017). For sequencing complex chromosomal rearrangements and The development of SMS has been phenomenal, as a structural variants, previous analytical approaches such as number of crucial technologies being developed using fuorescent in situ hybridization (FISH), array CGH, PCR cross-disciplinary expertise. Firstly, pores (for Nanopore 1 3 Human Genetics (2019) 138:1201–1215 1203 sequencing) or wells (for PacBio sequencing) at micron variations and