Ph.D. Og Speciale
Total Page:16
File Type:pdf, Size:1020Kb
UNIVERSITY OF COPENHAGEN DEPARTMENT O F B I O L O G Y Ph.D. Thesis Zonghui Peng Comparative Studies of NGS Assays and Sequencing Technologies Supervisor: Karsten Kristiansen University of Copenhagen, Denmark The PhD School of Science, Faculty of Science, University of Copenhagen Submitted on November 2020 Name of department: Department of Biology Author(s): Zonghui Peng Title: Comparative Studies of NGS Assays and Sequencing Technologies Comparison of NGS sequencing technologies, metagenomics assay and FFPE transcriptome assay Supervisor: Karsten Kristiansen Submitted on: 19 November 2020 ACKNOWLEDGMENTS Firstly, I would like to thank my supervisor, Prof. Karsten Kristiansen, for giving me this valued opportunity to perform my Ph.D. study at the University of Copenhagen, as well as appreciate all the guidance, help and motivation. Secondly, Many thanks to my colleagues at BGI and the University of Copenhagen for all help. Thanks to Zhijiao Wang, Charles Bao, Awei Jiang, Xianting Yan, Guangbiao Wang, Meifang Tan for help with bench work. Also, thanks to Xiaolong Zhu, Jintu Wang, Fei Teng for their bioinformatics support. Thirdly, I also would thank all my co-authors for comments and guidance with study design, sample retrieval, data analysis, and interpretation. Finally, my sincere thanks to my family and friends, for supporting me throughout this whole process and contributing countless sacrifices to help me get this point. Without their fully supports, I doubt I would have kept on until now. TABLE OF CONTENTS ABBREVIATIONS……………………………………………………………………………….1 ABSTRACT………………………………………………………………………………………3 1. INTRODUCTION………………………………………………………………..…………...4 1.1 Next-generation sequencing technologies……………………………………………….4 1.1.1 Illumina Sequencing……………………………………………………………...…...5 1.1.2 DNBseq Sequencing…………………………………………..……………………...9 1.2 Metagenomics assay comparative analysis……………………………………..……..12 1.2.1 Sample storage methods………………………………………….………………...14 1.2.2 Extraction methods………………………………………...………………………...14 1.2.3 Library preparation protocols…………………………………..…………………...15 1.2.4 Sample inputs…………………………………………...……………………………16 1.3 FFPE RNAseq assay comparative analysis……………………………………………17 1.3.1 DSN method………………………………………………………………………….19 1.3.2 Ribo-Zero method…………………………………………………………………....19 1.3.3 RNA Access method…………………………………………………………..…….20 1.4 Objectives……………………………………………………….…………………………21 2. LIST OF PAPERS……….……………………………………………….…………………23 3. SUMMARY OF RESULTS……………………………………………………….………..24 4. DISCUSSION………………………………………………….……………………………28 5. CONCLUSION……………………………………………………………………….……..31 6. FUTURE PERSPECTIVES…………………………………………..……………….…...32 7. REFERENCE………………………………………………………………………….…….34 8. APPENDIX……………………………………………………..……………………………40 ABBREVIATIONS cPAL Combinatorial Probe-Anchor Ligation cPAS Combinatorial probe anchored synthesis CG Complete Genomics CNV Copy number variant dNTPs deoxynucleoside triphosphates DNB DNA nanoball dsDNA Double-stranded genomic DNA DSN Duplex-Specific Nuclease emQ Empirical base quality FFPE Formalin-fixed paraffin-embedded FISH Fluorescence in situ hybridization GA Genome Analyzer GEP Gene expression profiling GIAB Genome in a Bottle HMP Human Microbiome Project InDels Insertions and deletions KH KAPA Hyper Prep Kit LFR Long Fragment Read MetaHIT Metagenomics of the Human Intestinal Tract ncRNA Non-coding RNA NGS Next-generation sequencing NIH National Institutes of Health OM Mag-Bind® Universal Metagenomics Kit PCR Polymerase chain reaction PTP Picotiter Plate QP DNeasy PowerSoil Kit RCR Rolling circular replication RIN RNA Integrity Number rRNA Ribosomal RNA SBS Sequencing-by-synthesis SNP Single-nucleotide polymorphism 1 SNV Single nucleotide variant SOLiD Sequencing by Oligonucleotide Ligation and Detection ssDNA Single-strand DNA SV Structure variant Gb gigabase Tb terabase TP TruePrep DNA Library Prep Kit V2 2 ABSTRACT With the development of next-generation sequencing (NGS), different NGS technologies have been developed and launched in the last few years, and NGS based applications such as metagenomics and RNAseq have received considerable attention. In this Ph.D. project, we firstly compared the performances of the Illumina and DNBseq platforms using the most commonly use reference sample, the Genome-In-A-Bottle (GIAB) sample (NA12878). Our findings indicate a comparable single-nucleotide polymorphism (SNP) calling accuracy for DNBseq data compared to Illumina data as well as for copy number variant (CNV) detection. However, for Insertions and deletions (InDels), we found lower accuracy for Illumina than for DNBseq data. In addition, our study also showed that DNBseq can be a more cost-effective platform for WGS compared to the Illumina platform. We conducted a comparative analysis of metagenomics applications as there are many possible factors that may affect the studies of human microbiome, including the specimen status after preservation, extracted DNA quality, library preparation protocol, and sample DNA input. Through our study, a combined protocol is recommended for performing metagenomics studies, by using Mag-Bind® Universal Metagenomics Kit (OM) method plus KAPA Hyper Prep Kit (KH) protocol as well as suitable DNA quantity on either fresh or freeze-thaw samples. Our findings provide clues for potential variations from various DNA extraction methods, library protocols, and sample DNA inputs, which are critical for consistent and comprehensive profiling of the human gut microbiome. Finally, to identify suitable methods and provide a benchmark for formalin-fixed paraffin- embedded (FFPE) RNAseq, we investigated three major library construction methods, including Duplex-Specific Nuclease (DSN), Truseq Ribo-Zero (Ribo-Zero), and Truseq RNA Access (RNA Access). Based on our results we recommend that RNA Access should be used for the analysis of mRNA expression, noting that also non-coding RNA can be detected by this method. By contrast, our analyses indicated that the DSN protocol would be the preferred choice for analysis of ncRNA, and furthermore, our results also provided evidence that DSN would be preferable especially for SNV calling using FFPE samples. 3 1. INTRODUCTION 1.1 Next-generation sequencing (NGS) technologies Back in 1977, Sanger and colleagues from the Medical Research Council Laboratory of Molecular Biology published the first-generation sequencing technology (Sanger et al., 1977), which scientific researchers now name as Sanger Sequencing. Because Sanger sequencing has higher throughput compared with the alternative Maxam and Gilbert’s method (Maxam and Gilbert 1977), the Sanger sequencing technology was broadly applied for many years; and thus, the Human Genome Project was finished by utilizing Sanger sequencing and thereby provided the first human genome reference to researchers. However, the capacity of the Sanger sequencing was still limited and rather expensive for whole genome analyses. With significantly increasing demands for sequencing studies, new high throughput sequencing technologies were developed and released commercially, such as the 454 sequencing technology (now Roche, figure 1), with the first next-generation sequencing instruments being launched in 2005 (Margulies, Egholm et al. 2005). Figure 1. Workflow of 454 sequencing. (A) Library preparation with using 454-specific adapters. (B) Emulsion PCR by using streptavidin-coated beads. (C) Loading with using PTP (picotiter plate) device after 4 the single-stranded template DNA library was constructed. (D) Pyrosequencing based on sequencing-by- synthesis. Reprinted from Mardis, E. (2008). The impact of next-generation sequencing technology on genetics. Trends in Genetics 24(3), 133-141. Later, Sequencing by Oligonucleotide Ligation and Detection (SOLiD) was developed and launched by Applied Biosystems (now Thermo Fisher) in 2006 (figure 2) and the first Solexa (now Illumina) sequencer, GA (Genome Analyzer) was launched in the same year. Figure 2. The sequencing principle of SOLiD sequencing. The main steps contain (1) Primer binds to template DNA. (2) Probe hybridization and ligation. (3) Florescence measurement; 4) Dye-end nucleotides cleaved. Reprinted from Voelkerding, K., Dames S., Durtschi, J (2009). Next-generation sequencing: from basic research to diagnostics. Clin Chem 55(4):641-58. 1.1.1 Illumina Sequencing During mid to late 1990s, Balasubramanian and Klenerman from the University of Cambridge developed the sequencing-by-synthesis (SBS) technology, and both pioneers founded the Solexa company to commercialize this technology in 1998. Solexa launched its first version of the sequencer, Genome Analyzer (GA) to the market in 2005, which attracted considerable interest because its capacity reached one gigabase, which was a quite large data output in this period compared with other NGS sequencers. In 2007, Illumina purchased the Solexa company to acquire their sequencing technology, and subsequently Illumina has been building up more high-throughput and large capacity 5 sequencers based on the SBS technology, such as GA II, HiSeq 2000 and others. Simply to say, the Solexa/Illumina sequencing technology is quite similar with Sanger sequencing technology because both methods apply the same concept of chain- terminating inhibitors. But Solexa/Illumina uses modified deoxynucleoside triphosphates (dNTPs) that contain a reversible terminator that can block each round of polymerization, and all four reversible dNTPs (A, T, G, and C) are separate molecules enabling minimization of base incorporation bias, which means only