UNIVERSITY OF COPENHAGEN DEPARTMENT O F B I O L O G Y

Ph.D. Thesis

Zonghui Peng

Comparative Studies of NGS Assays and Sequencing Technologies

Supervisor:

Karsten Kristiansen

University of Copenhagen, Denmark

The PhD School of Science, Faculty of Science, University of Copenhagen Submitted on November 2020

Name of department: Department of Biology

Author(s): Zonghui Peng

Title: Comparative Studies of NGS Assays and Sequencing Technologies

Comparison of NGS sequencing technologies, metagenomics assay and

FFPE transcriptome assay

Supervisor: Karsten Kristiansen

Submitted on: 19 November 2020

ACKNOWLEDGMENTS

Firstly, I would like to thank my supervisor, Prof. Karsten Kristiansen, for giving me this valued opportunity to perform my Ph.D. study at the University of Copenhagen, as well as appreciate all the guidance, help and motivation. Secondly, Many thanks to my colleagues at BGI and the University of Copenhagen for all help. Thanks to Zhijiao Wang, Charles Bao, Awei Jiang, Xianting Yan, Guangbiao Wang, Meifang Tan for help with bench work. Also, thanks to Xiaolong Zhu, Jintu Wang, Fei Teng for their bioinformatics support. Thirdly, I also would thank all my co-authors for comments and guidance with study design, sample retrieval, data analysis, and interpretation. Finally, my sincere thanks to my family and friends, for supporting me throughout this whole process and contributing countless sacrifices to help me get this point. Without their fully supports, I doubt I would have kept on until now.

TABLE OF CONTENTS

ABBREVIATIONS……………………………………………………………………………….1 ABSTRACT………………………………………………………………………………………3 1. INTRODUCTION………………………………………………………………..…………...4 1.1 Next-generation sequencing technologies……………………………………………….4 1.1.1 Illumina Sequencing……………………………………………………………...…...5 1.1.2 DNBseq Sequencing…………………………………………..……………………...9 1.2 Metagenomics assay comparative analysis……………………………………..……..12 1.2.1 Sample storage methods………………………………………….………………...14 1.2.2 Extraction methods………………………………………...………………………...14 1.2.3 Library preparation protocols…………………………………..…………………...15 1.2.4 Sample inputs…………………………………………...……………………………16 1.3 FFPE RNAseq assay comparative analysis……………………………………………17 1.3.1 DSN method………………………………………………………………………….19 1.3.2 Ribo-Zero method…………………………………………………………………....19 1.3.3 RNA Access method…………………………………………………………..…….20 1.4 Objectives……………………………………………………….…………………………21 2. LIST OF PAPERS……….……………………………………………….…………………23 3. SUMMARY OF RESULTS……………………………………………………….………..24 4. DISCUSSION………………………………………………….……………………………28 5. CONCLUSION……………………………………………………………………….……..31 6. FUTURE PERSPECTIVES…………………………………………..……………….…...32 7. REFERENCE………………………………………………………………………….…….34 8. APPENDIX……………………………………………………..……………………………40

ABBREVIATIONS cPAL Combinatorial Probe-Anchor Ligation cPAS Combinatorial probe anchored synthesis CG Complete Genomics CNV Copy number variant dNTPs deoxynucleoside triphosphates DNB DNA nanoball dsDNA Double-stranded genomic DNA DSN Duplex-Specific Nuclease emQ Empirical base quality FFPE Formalin-fixed paraffin-embedded FISH Fluorescence in situ hybridization GA Genome Analyzer GEP Gene expression profiling GIAB Genome in a HMP Human Microbiome Project InDels Insertions and deletions KH KAPA Hyper Prep Kit LFR Long Fragment Read MetaHIT Metagenomics of the Human Intestinal Tract ncRNA Non-coding RNA NGS Next-generation sequencing NIH National Institutes of Health OM Mag-Bind® Universal Metagenomics Kit PCR Polymerase chain reaction PTP Picotiter Plate QP DNeasy PowerSoil Kit RCR Rolling circular replication RIN RNA Integrity Number rRNA Ribosomal RNA SBS Sequencing-by-synthesis SNP Single-nucleotide polymorphism 1

SNV Single nucleotide variant SOLiD Sequencing by Oligonucleotide Ligation and Detection ssDNA Single-strand DNA SV Structure variant Gb gigabase Tb terabase TP TruePrep DNA Library Prep Kit V2

2

ABSTRACT With the development of next-generation sequencing (NGS), different NGS technologies have been developed and launched in the last few years, and NGS based applications such as metagenomics and RNAseq have received considerable attention.

In this Ph.D. project, we firstly compared the performances of the Illumina and DNBseq platforms using the most commonly use reference sample, the Genome-In-A-Bottle (GIAB) sample (NA12878). Our findings indicate a comparable single-nucleotide polymorphism (SNP) calling accuracy for DNBseq data compared to Illumina data as well as for copy number variant (CNV) detection. However, for Insertions and deletions (InDels), we found lower accuracy for Illumina than for DNBseq data. In addition, our study also showed that DNBseq can be a more cost-effective platform for WGS compared to the Illumina platform.

We conducted a comparative analysis of metagenomics applications as there are many possible factors that may affect the studies of human microbiome, including the specimen status after preservation, extracted DNA quality, library preparation protocol, and sample DNA input. Through our study, a combined protocol is recommended for performing metagenomics studies, by using Mag-Bind® Universal Metagenomics Kit (OM) method plus KAPA Hyper Prep Kit (KH) protocol as well as suitable DNA quantity on either fresh or freeze-thaw samples. Our findings provide clues for potential variations from various DNA extraction methods, library protocols, and sample DNA inputs, which are critical for consistent and comprehensive profiling of the human gut microbiome.

Finally, to identify suitable methods and provide a benchmark for formalin-fixed paraffin- embedded (FFPE) RNAseq, we investigated three major library construction methods, including Duplex-Specific Nuclease (DSN), Truseq Ribo-Zero (Ribo-Zero), and Truseq RNA Access (RNA Access). Based on our results we recommend that RNA Access should be used for the analysis of mRNA expression, noting that also non-coding RNA can be detected by this method. By contrast, our analyses indicated that the DSN protocol would be the preferred choice for analysis of ncRNA, and furthermore, our results also provided evidence that DSN would be preferable especially for SNV calling using FFPE samples. 3

1. INTRODUCTION 1.1 Next-generation sequencing (NGS) technologies Back in 1977, Sanger and colleagues from the Medical Research Council of Molecular Biology published the first-generation sequencing technology (Sanger et al., 1977), which scientific researchers now name as Sanger Sequencing. Because Sanger sequencing has higher throughput compared with the alternative Maxam and Gilbert’s method (Maxam and Gilbert 1977), the Sanger sequencing technology was broadly applied for many years; and thus, the Human Genome Project was finished by utilizing Sanger sequencing and thereby provided the first human genome reference to researchers. However, the capacity of the Sanger sequencing was still limited and rather expensive for whole genome analyses. With significantly increasing demands for sequencing studies, new high throughput sequencing technologies were developed and released commercially, such as the 454 sequencing technology (now Roche, figure 1), with the first next-generation sequencing instruments being launched in 2005 (Margulies, Egholm et al. 2005).

Figure 1. Workflow of 454 sequencing. (A) Library preparation with using 454-specific adapters. (B) Emulsion PCR by using streptavidin-coated beads. (C) Loading with using PTP (picotiter plate) device after

4 the single-stranded template DNA library was constructed. (D) Pyrosequencing based on sequencing-by- synthesis. Reprinted from Mardis, E. (2008). The impact of next-generation sequencing technology on genetics. Trends in Genetics 24(3), 133-141. Later, Sequencing by Oligonucleotide Ligation and Detection (SOLiD) was developed and launched by Applied Biosystems (now Thermo Fisher) in 2006 (figure 2) and the first Solexa (now Illumina) sequencer, GA (Genome Analyzer) was launched in the same year.

Figure 2. The sequencing principle of SOLiD sequencing. The main steps contain (1) Primer binds to template DNA. (2) Probe hybridization and ligation. (3) Florescence measurement; 4) Dye-end nucleotides cleaved. Reprinted from Voelkerding, K., Dames S., Durtschi, J (2009). Next-generation sequencing: from basic research to diagnostics. Clin Chem 55(4):641-58. 1.1.1 Illumina Sequencing During mid to late 1990s, Balasubramanian and Klenerman from the University of Cambridge developed the sequencing-by-synthesis (SBS) technology, and both pioneers founded the Solexa company to commercialize this technology in 1998. Solexa launched its first version of the sequencer, Genome Analyzer (GA) to the market in 2005, which attracted considerable interest because its capacity reached one gigabase, which was a quite large data output in this period compared with other NGS sequencers. In 2007, Illumina purchased the Solexa company to acquire their sequencing technology, and subsequently Illumina has been building up more high-throughput and large capacity

5 sequencers based on the SBS technology, such as GA II, HiSeq 2000 and others. Simply to say, the Solexa/Illumina sequencing technology is quite similar with Sanger sequencing technology because both methods apply the same concept of chain- terminating inhibitors. But Solexa/Illumina uses modified deoxynucleoside triphosphates (dNTPs) that contain a reversible terminator that can block each round of polymerization, and all four reversible dNTPs (A, T, G, and C) are separate molecules enabling minimization of base incorporation bias, which means only one single base can be ligated by the amplification enzyme, thus each base will be "read" by an amplification cycle (Figure 3).

Figure 3. The workflow of Illumina sequencing. The main steps comprise (A) Library preparation. (B) Cluster generation. (C) Sequencing via SBS. (D) Base calling. Because of the simultaneous sequencing of a large number of DNA templates on the solid microarray, the Solexa/Illumina instrument generates a large output of sequencing data. According to continuous improvement of Flow Cell density design and upgradation of the base calling detection system, Illumina kept launching different sequencers (table 1); especially in 2019, Illumina launched the NovaSeq 6000 sequencer, which enables 6

6

terabases (Tb) data to be generated by one sequencing run. Compared with the first SBS- based sequencer, Genome Analyzer (~1Gb per run, 2005) the output was increased 6,000-fold. Furthermore, the sequencer running time is narrowing down continuously with high-quality sequencing reads output. Apparently, within 15 years since 2004, the sequencing data generation has been significantly improved and increased using the Solexa/Illumina sequencing technology.

Table 1. Instrument metrics of different Illumina sequencers

Sequencin Max Read Length Max Output Running Model Accuracy g platform (bp) (Gb) time

MiSeq 2*300 15 56 hrs

NextSeq 550 2*150 120 29 hrs

HiSeq 3000 2*150 750 3.5 day Mostly>Q3 Illumina 0 HiSeq 4000 2*150 1,500 3.5 day

NovaSeq 6000 2*150 6,000 1.9 day (S4)

However, with increasing data output, some data quality will be compromised as well. For example, Illumina Nextseq500/550 changed the way to acquire the sequencing imaging from a four colors system to a two colors system, which can speed up the image data capture, but it also brings up the issue that it is hard for the instrument to distinguish the “no signal” from a G base, meaning that we cannot identify it is a real G base or a missing image call because the Q score indicates the quality is in “high” level. Andrews et al (https://sequencing.qcfail.com/articles/illumina-2-colour-chemistry-can-overcall-high- confidence-g-bases/) reported this issue with some preliminary informatics check with the fastQC program (figure 4) and it clearly shows that the base quality located at the 3'of reads is higher than bases at the 5' end. This makes it hard for reads trim processing, and it is supposed to be removed, but it would be considered as "qualified" reads for downstream analysis, which can create problems for variants calling.

7

Figure 4. The fastQC result of NextSeq sequencer. The reads quality is deteriorating as expected during the mid-cycles of sequencing, which leads to the bioinformatics tool cannot differentiate it as the low-quality reads. Reprinted from Andrews, S. (2016). Illumina 2 colour can overcall high confidence G bases. Available online: https://sequencing.qcfail.com/. (accessed on 2 Dec 2019). Recently, Illumina launched the NovaSeq sequencer, which Illumina claims is the highest capacity sequencer marketed so far. However, this system is also a two-color system- based sequencer, but Illumina claims that the sequencing chemistry is “improved”. Based on empirical base quality (emQ) calculated by the GATK BQSR method (www.broadinstitute.org), Li (https://lh3.github.io/2017/07/24/on-nonvaseq-base-quality) also evaluated and reported that the sequencing data quality of NovaSeq does have an over-estimation issue compared to Illumina’s old instrument HiSeq2500, which indicates that the real data quality of NovaSeq is lower than Hiseq2500 using the same Genome in a Bottle (GIAB) sample (NA12878), and also the over-estimation issue was found in the read2 of NovaSeq (Figure 5), possibly because the sequencer detects the random noise, instrument- & sequencing run-based systematic bias issue or chemistry error during the base synthesis in sequencing by Solexa/Illumina sequencer.

8

Figure 5. The emQ result of HiSeq2500, HiSeq X Ten and NovaSeq sequencers. (A)-(C) Statistics of emQ performance at each cycle generated by NovaSeq, Hiseq X Ten and Hiseq 2500 respectively. On the read 2, emQ of NovaSeq drops more significantly comparing with Hiseq2500. (D) The frequency of erroneous base changes from read 1 to read 2. Reprinted from Li, H. (2017). On NovaSeq Base Quality. Available online: https://lh3.github.io/. (accessed on 5 Dec 2019). 1.1.2 DNBseq sequencing In 2006, Drmanac et al founded the Complete Genomics (CG) company to commercialize their proprietary sequencing technology, DNA nanoball sequencing with Combinatorial Probe-Anchor Ligation (cPAL) chemistry (see figure 6) and they released their demo data in 2009 showing high accuracy data performance (1 false calling in 100,000 bases) with 45- to 87-fold coverage per genome in average (Drmanac, Sparks et al. 2010). It indicated that the CG sequencing technology already in 2009 could reach a relatively high volume of data output capacity. However, the sequencing read length of the cPAL chemistry was quite a bottleneck for CG to promote their whole genome sequencing solution, where the 28bp x 2 as standard read output had many limitations for large structure variant (SV) detection because SV can be at the hundred kb or even Mb level. Although Brock et al (Peters, Kermani et al. 2012) developed the Long Fragment Read (LFR) technology,

9 which enables that even super-long structure events can be detected by using co- barcoded short reads in a 384 well plate scale, the sequencing throughput was still limited comparing with regular SBS sequencing technology.

Figure 6. Principle of cPAL sequencing. The main mechanism of cPAL is sequencing-by-ligation; each sequencing cycle begins with hybridization and ligation of fluorescent probes to template. Before the next sequencing cycle is started, the anchor-probe complex will be removed. Reprinted from Complete Genomics. (2015). Revolocity™ Whole Genome Sequencing Technology Overview. Available online: http://www.completegenomics.com/documents/revolocity-tech-overview.pdf. (accessed on 5 Dec 2019).

Early in 2013, BGI acquired CG and started to accelerate the DNA nanoball sequencing technology conformation, development, and commercialization. The general workflow is briefly described in Figure 7. The double-stranded genomic DNA (dsDNA) is denatured at a certain temperature into single-stranded DNA (ssDNA), followed by ligation of the DNBseq compatible adapter to move to rolling circular replication (RCR), which is a polymerase chain reaction (PCR)-free process. Because RCR continuously will be using the template rather than the PCR product, it avoids any error accumulation during the PCR amplification process. By contrast, Illumina uses exponential PCR to get plenty of clusters for sequencing. After a certain number of RCR cycles, each ssDNA will automatically generate a DNA nanoball by nature of the DNA sample, with 300-500 copies of each template forming each DNA nanoball (DNB). Each DNB will be negatively charged, and the nano-pattern array is positively charged, securing specific attachment of each DNB to a sequencing spot because the size of each sequencing spot is designed

10 so it can only contain one DNB. Thus, either one DNB or no DNB will be located in each sequencing spot. Testing revealed a 95% efficiency for DNB-Sequencing spot attachment (Drmanac, Sparks et al. 2010). With the same concept of SBS, BGI transformed the cPAL chemistry into cPAS (combinatorial probe anchored synthesis) chemistry, which also uses dNTPs as terminators for recognizing the incorporated base one by one, in short, when the sequencing primers recognize the adapters of the DNB libraries, a single dNTP probe will be incorporated and other dNTPs will be washed away. The fluorescence image will be captured during imaging procedure and it will be converted into digital signal. Thus, each base will be "read" after each cycle of imaging has been performed, then moving to the next cycle. Eventually, this makes elongation of the reads to 150bp or longer compared to the cPAL chemistry. Furthermore, the sequencing error is comparable to Solexa/Illumina SBS data with the first DNBseq dataset being validated in a side-by-side comparison study reported by Huang et al (Huang, Liang et al. 2016).

Figure 7. The workflow of DNBseq sequencing. After the size selection, the fragmented double stranded DNAs were denatured into single stranded DNA to conduct the circularization. After that, with using single stranded DNA circles as template, the rolling circle amplification was performed to generate the DNA nanoballs. Finally, the DNA nanoballs were loaded into DNBseq sequencer by using combinatorial probe anchored synthesis chemistry. As described as above, as a core of the DNBseq technology, the nanoscale patterned array ensures the high utilization for the chip, which allows us to generate large-scale data sets according to different chip sizes, with BGI launching several instruments with different data production capacity since 2015 (see table 2)

11

Table 2. Instrument metrics of different DNBseq sequencers

Sequencing Max Read Length Max Output Running Model Accuracy platform (bp) (Gb) time

BGISEQ-50 1*50 8 <24 hrs

BGISEQ-500 2*100 520 9-10 day

MGISEQ- 2*100 60 2 day DNBseq 200 Mostly>Q30 MGISEQ- 2*150 1,080 3 day 2000

MGISEQ-T7 2*150 6,000 <1 day Since the international human genome consortium was initiated on 1990 and until it was fully completed in 2003, researchers always like to compare current sequencing technology with this initial consortium in terms of sequencing timeline. Nowadays it will take less than one day using MGISEQ-T7, and if conducting a 10,000 WGS study the DNBseq take less time than an Illumina based study (166 days vs. 286 days, figure 8)

Novaseq 286

MGISEQ-T7 166

0 50 100 150 200 250 300 DAY

Figure 8. Comparison of the capacity of DNBseq and Illumina. Instrument running time for completing 10,000 WGS on MGISEQ-T7 and NovaSeq 1.2 Metagenomics assay comparative analysis Microorganisms can be everywhere, particularly for human beings, there are billions of microorganisms living inside and externally of our bodies and we also call them “the second genome of human”. This means billions of genomes can be found in the human body compared with the human genome itself. This indicates that the microbiome is more 12 complicated to study compared with the human genome. Also, many microorganisms are difficult to cultivate, because it is difficult to replicate the precise human-environment conditions that microorganisms need for survival and growth. As such shotgun de novo metagenomics can be a tool (see workflow at figure 9), which enables us to get a deep dive into the "world" of these billions of microbes in/on our bodies. In 2008, the first human microbiome consortium was established by the EU, called MetaHIT (Metagenomics of the Human Intestinal Tract). In the same year, US-based researchers joined this scientific competition and launched the Human Microbiome Project (HMP), which was founded by National Institutes of Health (NIH), the goal for both large-scale projects was to understand at least the pan-genome of microbiomes in/on human beings. A large number of publications has described how different diseases correlate with signatures of the gut microbiome, and recently, microbiome information has been used to stratify patients for anti-PD-1 drug treatment (Gopalakrishnan, Spencer et al. 2018). Many studies have used 16S rRNA gene amplicon sequencing, but during the last couple of years emphasis has been directed towards the more informative method, metagenomics, to explore and identify more valuable biomarkers within different fields including the immuno-oncology area. However, more knowledge of sample preparation, library construction, and sample inputs for metagenomics sequencing is still needed.

13

Figure 9. The workflow of metagenomics. The main steps involve DNA isolation, short insert library construction, shotgun sequencing, raw data QC & pre-processing, de novo metagenome assembly, gene prediction & catalog construction, functional annotation, taxonomy & diversity analysis, and quantification & differential analysis of gene abundance 1.2.1 Sample storage methods One of the biggest advantages to perform metagenomics sequencing for either diagnostics studies or biomarker exploratory studies is its non-invasive mode, which enables researchers to easily acquire samples, most often fecal sample. Frozen storage of samples has been a standard procedure for metagenomics sampling, even for oral samples; for example, saliva, Lazarevic et al (Lazarevic, Whiteson et al. 2009) applied snap-freezing for sample storage, a study that was endorsed by the Human Microbiome Project. However, it is still challenging for non-professionals to perform snap-freezing of fecal sample using liquid nitrogen or storage at low temperatures, which normally would not be possible for collection of samples at home, needed for most large-scale population studies. Therefore, several companies have developed methods and devices for easy sampling and storage at ambient temperature. The requirement for such sampling is distinct from that of ordinary blood sampling where several standard solutions have been implemented such as the commonly used PAXgeneTM tube, a commercial standard container for blood-derived RNA Sequencing (Rainen, Oelmueller et al. 2002), which can be used for IVD as well (www.qiagen.com). For robust fecal sample collection and storage, the Genotek kit (Catalog # OMR-200, DNA Genotek, Ottawa, Canada) is now commonly used and reported to be comparable with snap-frozen samples (Abrahamson, Hooker et al. 2017, Song, Amir et al. 2016, Vandeputte Tito et al. 2016). 1.2.2 Extraction methods Different from other regular human sample types, such as tissue and blood, human fecal sample may vary according to human health conditions reflected in the consistence of the fecal samples, for instance comparing individuals with diarrhea, and constipation. In addition, there are lots of challenges for purification of DNA from this special sample type, for example, contaminants that lead to PCR inhibition and incomplete lysis during the extraction process. Thus, to completely isolate and purify the microbial genomic DNA is critical to microbiome studies. Back in 2009, Wang et al (Wang, Hoenig et al. 2009) even applied a traditional protocol of phenol: chloroform: isoamyl alcohol-based extraction for fecal DNA isolation for their microbiome study, which was even sponsored by NIH Human Microbiome Project. But more and more researchers adapted the use of commercial kits 14 for fecal sample, such as MoBio Powersoil kit (now DNeasy PowerSoil kit since MoBio was merged with Qiagen), which was applied in the Human Microbiome Project (Gonzalez, Schaffer et al. 2016). Also, de Boer et al reported that a bead beating-based protocol was more robust than a non-bead-based method for stool samples (de Boer, Peters et al. 2010), demonstrating that the traditional DNA isolation method was not suitable for fecal samples. The bead beating-based protocol see the workflow of Takara kit (figure 10) uses ceramic beads for the sample disruption process which ensures the DNA from both gram-negative and gram-positive microorganisms can be isolated and purified.

Figure 10. The workflow of fecal DNA isolation. The beads-based method provides effective homogenization of the gram-negative, gram-positive and other hard-to-lyse microorganisms. 1.2.3 Library preparation protocols Before the shotgun metagenomics was applied in microbiome studies, researchers had relied on using 16S rRNA gene amplicon sequencing for qualitative and quantitative studies. Library preparation was quite straightforward, including targeted variable region amplification using 16S rRNA universal primers, ligating the NGS sequencing adapters to get multiplexed libraries constructed. These procedures were commonly applied in molecular biology since the early1990s using commercial enzymes and buffers (Weisburg, Barns et al. 1991). But because of the limitation of data robustness, information, and low reproducible performance of 16S rRNA more and more people choose to migrate from this assay to metagenomics (Kim and Yu 2014, Schmidt, Matias Rodrigues et al. 2015). For metagenomics, new and complex chemistry needed to be developed for NGS library preparation protocols, including the use of PCR, which depending on the number of required cycles might lead to artefacts. Hence, researchers spent lots of effort to develop a more robust protocol for metagenomics study. Both the HMP and the MetaHit consortium project utilized the adjusted Illumina-compatible short 15 insert size of library preparation as previously reported by Manichanh et al. (Manichanh, Rigottier-Gois et al. 2006). However, during the past 10 years kit suppliers such as Illumina, KAPA, and Vazyme have released kits for metagenomics, including transposase-based kits, now commonly used for metagenomics library preparation because this method can significantly reduce the complexity of library construction procedures, including DNA fragmentation, end-repair, and adapter ligation into a one- stop reaction (https://www.protocols.io/view/nucleic-acid-extraction-amplification-and- library-m5vc866). However, it is worthwhile to perform a head-to-head comparison since there is no industrial standard or benchmark for metagenomics library construction in terms of data robustness and cost. 1.2.4 Sample inputs Compared with samples from environments, such as atmosphere or water, a human stool sample can be defined as a high biomass system according to the previous MetaHit and HMP projects (Qin, Li et al. 2010; Aagaard, Petrosino et al. 2012), but the amount of human stool-derived DNA is still relatively low compared with human blood sample. , For example, typically a yield of 50-200ng can be expected from 100 mg of fecal sample, and 1-10µg genomic DNA can be easily isolated from a typical human blood sample. Usually, the fecal sample DNA is fragmented either chemically or by using a physical method such as sonication, followed by several rounds of modification steps including DNA fragment end repair, ligating the adapter, and finally, PCR amplification will be needed to ensure enough copies of template to be loaded and sequenced by the sequencer. However, this PCR procedure will be varied based on template DNA quantity (Figure 11), hence it may introduce erroneous reads, GC bias, and duplication issues (Duhaime, Deng et al. 2012). Even Chafee et al reported that metagenomics library can be constructed with picogram DNA level (Chafee, Maignien et al. 2014), such low input may introduce PCR amplification biases , because Aird concluded that PCR amplification bias can always be expected no matter how many cycles are performed (Aird, Ross et al. 2011). However, because of sample limitation issues, it is difficult to generate PCR-free library preparations. To balance the sample input and amplification bias, it worth to conduct the side-by-side comparison among different sample inputs.

16

Figure 11. Experimental sources of sequence variation. Unexpected erroneous variation may occur during the steps of sample collection, PCR amplification and incorrected base incorporation. Reprinted from Robasky, K., Lewis, N., Church, G. (2013). The role of replicates for error mitigation in next-generation sequencing. Nature Reviews Genetics 15(1), 56-62. 1.3 FFPE RNAseq assay comparative analysis Compared with RT-PCR and microarray, RNAseq is increasingly applied for performing gene expression profiling, which can be used to measure mRNA levels as an alternative solution to micro-array platforms for in vitro and in vivo pre-clinical settings, in particular within the field of drug discovery and development, and fusion genes discovered by RNA sequencing may in some cases lead to novel cancer therapies [Ren, Peng et al. 2012, Singh, Chan et al. 2012]. In a clinical setting, formalin-fixed paraffin-embedded (FFPE) samples are the most common sample type. However, the challenge using FFPE sample relates to high degradation and chemical modification of RNA, urging researchers to explore suitable protocols for FFPE RNAseq library construction rather than using polyA- enrichment assay directly (Figure 12).

17

Figure 12. The workflow of poly(A) enrichment RNAseq. After the polyA enrichment and cDNA synthesis, the double stranded cDNAs are denatured into single stranded cDNA to conduct the circularization. After that, with using single stranded cDNA circles as template, the rolling circle amplification is performed to generate the DNA nanoballs. Currently, most commonly used protocols involve ribosomal RNA (rRNA)-depletion- based protocols and coding transcriptome/RNA exome probe capture-based protocols. The Duplex-Specific Nuclease (DSN) protocol (Yi, Cho et al. 2011) and the Ribo-Zero rRNA removal kit (Illumina) (Huang, Jaritz et al. 2011) are recently introduced major rRNA-depletion protocols. The Truseq RNA Access kit (Now the Truseq RNA Exome kit, Illumina) is also a popular method focusing on coding regions of the entire transcriptome (https://www.illumina.com), described in figure 13. However, even if there are a commercial protocols available on the market, we still need to figure out how robust a protocol for FFPE sample is regarding sample input, Q20 & Q30, raw to clean fastQ data percentage, genome and gene mapping rate, rRNA contamination percentage, and reproducibility. So far, we have not established a comprehensive assessment of the three FFPE-focused protocols, although Guo et al. reported a comparative study between the RNase H and the Ribo-Zero protocol with four FFPE samples [Guo, et al. 2016], but both RNase H and DSN protocol are belonged to rRNA removal method, thus it would be 18 useful for the researchers to compare the non-rRNA depletion method for FFPE sample, such as RNA Access protocol.

Figure 13. The workflow of rRNA-deletion and RNA exon capture protocols. DSN and Ribo-Zero are two major rRNA-deletion based protocols, and Truseq Access is the RNA exon capture based protocol. 1.3.1 DSN method Back in 2004, Zhulidov et al [Zhulidov, Bogdanova et al. 2004] reported that duplex- specific nuclease (DSN) may be useful for normalizing the high abundance transcripts, whereby the low abundance transcripts can be detectable, because more than 80% of transcripts can be ribosomal RNA which researchers expect to remove, and protein- coding RNA (mRNA) usually will make up 1-5% of total RNA for targeted RNA profiling (figure 13). In 2010, Illumina scientists reported a DSN-based method for FFPE RNAseq [Guo, Khrebtukova et al. 2010], and Yi et al published a report evaluating the DSN and poly(A) enrichment method for prokaryotic species [Yi, Cho et al. 2011]. Subsequently in 2014, Zhao et al released a comparative analysis of DSN and poly(A) enrichment method as well as microarray [Zhao, He et al. 2014], and they concluded that both Ribo-Zero and DSN methods enabled consistent transcript quantification using FFPE RNAs. However, since 2016, DSN-based FFPE RNAseq reports have become scarce, although considering the cost of library preparation protocol, DSN is much cheaper compared with other commercial kits, such as the Ribo-Zero method, but still more robust protocols are needed.

1.3.2 Ribo-Zero method

19

Since RNAseq was established as a major method for transcriptome profiling [Wang, Gerstein et al. 2009], either for low quality RNA sample with loss of poly(A) mRNA sequences issue or species without poly(A) tailor, researchers had strived to develop better methods for RNAseq library preparation. Sooknanan et al first released the Ribo- Zero method for RNAseq [Sooknanan, Pease et al. 2010], which was reported as a novel protocol where rRNA depletion efficiency can be higher than 97% on average. The Ribo-Zero method was initially designed for prokaryotic RNAseq, but with increasing needs for clinical biomarker studies, where FFPE samples constitute the major sample type a major concern relates to the highly degraded and chemically modified RNA Hence the Epicentre (an Illumina company) expanded the applications of Ribo-Zero to target the FFPE RNAseq users [Pease, Sooknanan et al. 2012]. In 2014, the first scientific comparative analysis was published [Zhao, He et al. 2014], which indicated no significant difference between the DSN and the Ribo-Zero methods. Also, Adiconis reported that their novel rRNA-deletion protocol (RNase H) would be superior to DSN and Ribo-Zero [Adiconis, Borges-Rivera et al. 2013]. However, because RNase H is a manually-based assay, lacking a benchmark standard for protocols, it has not been developed into a commercialized protocol. Yet, there are few publications using this protocol during the last 5 year according to a PubMed searching results with keywords RNAseq using RNase H, in particular, for clinical FFPE RNAseq. 1.3.3 RNA Access method Since Illumina released the RNA Access method (now named as Truseq RNA Exome) as a commercial product in 2014, researchers immediately took action to try this method for clinical FFPE sample gene expression profiling and gene fusion detection and published their findings in 2015 [Huang, Goldfischer et al. 2015, Walther, Hofvander et al. 2015]. These preliminary testing indicated that even chromatin rearrangement events can be detectable using RNA Access method. After that, Schuierer et al applied the RNA Access method in a comparison with the poly(A) enrichment and the Ribo-Zero methods by using simulated degraded RNA samples[Schuierer, Carbone et al. 2017], RNA Access was reported to be suitable for highly degraded samples according to this report, however, since real FFPE samples were not included in the study, we cannot conclude how chemical modification affects gene expression. Regardless of which FFPE RNA extraction protocol is used, the RIN value of FFPE sample is expected lower than 3 [Boeckx, Wouters et al. 2011] and RIN is not a sensitive measure of FFPE-derived RNA 20 quality. However, DV200, the percentage of RNA fragments above 200 nucleotides (Figure 14), showed strong correlation to FFPE RNA library preparation yield [Illumina 2016], so it's recommended to using DV200 for classification of the quality of FFPE- derived RNA. Using the real FFPE samples with different DV200 status for comparing the library preparation protocols in RNAseq would be important to determine the optimal protocol for FFPE coding and non-coding gene expression profiling studies.

Figure 14. FFPE- and non-FFPE-derived RNA sample QC results. (A) The FFPE-derived RNA sample. (B) The non-FFPE-derived RNA sample. (C) DV200 range. Other than RIN value, DV200 is the robust metrics to differentiate which FFPE sample is “qualified” or “unqualified”. Figure modified from Graf, E. (2017). Simplified DV200 Evaluation with the Agilent 2100 Bioanalyzer System. Agilent company Technical Overview. Available online: https://www.agilent.com/cs/library/technicaloverviews/public/5991- 8287EN.pdf. (accessed on 12 Dec 2019). 1.4 Objectives The main goal of this PhD study was to perform comparisons of different NGS platforms for whole genome studies, metagenomics studies, and transcriptome studies. The study specifically aimed to generate: 1. Knowledge of how DNBseq performs compared to the Illumina platform for whole- genome sequencing applications to draw clear and comprehensive conclusions.

21

2. To examine how DNA extraction methods, sample preservation methods, sample inputs, and library construction methods can affect the composition of microbial communities to recommend robust experimental protocols for shotgun metagenomics applications. 3. To compare and conclude which library preparation protocols would be most robust and suitable for clinical FFPE RNA Sequencing (mRNA + lncRNA or mRNA only).

22

2. LIST OF PAPERS • Zonghui Peng, Jintu Wang, Honglan Gou, Yonggang Zhao, Meifang Tang, Fei Teng, Karsten Kristiansen, Zhijiao Wang. Comparative analysis of the DNBseq and the Illumina Sequencing platforms for human whole-genome sequencing. Manuscript (First author) • Zonghui Peng, Xiaolong Zhu, Zhijiao Wang, Xianting Yan, Guangbiao Wang, Meifang Tang, Awei Jiang, and Karsten Kristiansen. Comparative analysis of sample extraction and library construction for shotgun metagenomics. Bioinformatics and Biology Insights. 2020;14:1-13. • Zonghui Peng, Qiwei Sun, Zhijiao Wang, Fei Teng, Liang Zong, Yipting Kwong, Bimeng Tu, Karsten Kristiansen. Comparative analysis of library preparation methods for formalin-fixed paraffin-embedded RNAseq. Submitted to BMC Genomics, 2020.

23

3. SUMMARY OF RESULTS PART 1. Comparative analysis of the DNBseq and the Illumina Sequencing platforms for human whole-genome sequencing • The comparative analysis results indicated that 94.06% and 86.76% of the unique SNPs and InDels were concordant between the DNBseq and the Illumina platforms, respectively when applying high-confidence variant calls. • The SNP calling accuracy generated by the DNBseq platform (sensitivity 96.21% and precision 99.94%) is highly consistent with the Illumina platform (sensitivity 96.34% and precision 99.89%). • Concerning InDel calling accuracy, Illumina data (sensitivity 90.32% and precision 96.55%) are slightly lower than DNBseq data (sensitivity 93.23% and precision 97.62%). • From the aspect of CNV (copy number variant) calling, both the DNBseq and the Illumina platforms have a comparable performance regarding CNV detection compared with Agilent CGH array-based validated CNV datasets.

SNP comparison between DNBseq and Illumina on GIAB high-confidence region

24

InDel comparison analysis between DNBseq and Illumina on GIAB high-confidence region PART 2. Comparative analysis of sample extraction and library construction for shotgun metagenomics • On average, Mag-Bind® Universal Metagenomics Kit (OM) provided slightly higher output than DNeasy PowerSoil Kit (QP) in terms of extracted DNA quantity. • On average, OM provided detection of more genes than QP using the same library preparation protocol. • KAPA Hyper Prep Kit (KH) on average performed slightly better than the TruePrep DNA Library Prep Kit V2 (TP) in terms of detected gene numbers and Shannon index on the same stool sample. • There was no significant difference in taxonomy composition between the two different library preparations using a fresh, freeze-thaw and DNA mock community. Also, we did not observe any significant difference between fresh and freeze-thaw samples. • Low input with 50ng performed comparable with regular sample input (250ng gDNA) in terms of gene detection and microbial community distribution.

25

The experimental workflow of metagenomics comparative analysis PART 3. Comparative analysis of library preparation methods for formalin-fixed paraffin-embedded RNAseq • From the perspective of raw to clean data ratio, RNA Access performed better than Ribo-Zero and DSN for either FF or FFPE sample. • In terms of mapping quality, RNA Access provided the highest similarity between FF and FFPE samples from the perspective of total and unique gene mapping rate comparing with Ribo-Zero and DSN. • RNA Access provided higher fidelity of transcript/gene expression evenness for the FFPE sample than Ribo-Zero and DSN. • DSN detected more known and novel genes than RNA Access and Ribo-Zero, but RNA Access provided the most cost efficiency if the coding regions of FFPE transcriptome are the primary object of the study. • According to the coding RNA detection performance, RNA Access produced a higher correlation between pairs of FFPE and FF samples than Ribo-Zero and DSN. But regarding non-coding RNA (ncRNA)-based correlation, DSN performed with higher reproducibility than Ribo-Zero and RNA Access, indicating that DSN could be an optimal option for ncRNA studies. • Each protocol exhibited a high correlation regarding to technical reproducibility performance. 26

• From the aspect of single point alteration detection, DSN exhibited the highest consistent performance from mutation types from A-G to G-T and Ti/Tv ratio compared to Ribo-Zero and RNA Access.

The experimental workflow of FFPE RNAseq comparative analysis

27

4. DISCUSSION 4.1 Comparative analysis of the DNBseq and Illumina Sequencing the platforms for human whole-genome sequencing With ultrahigh throughput and unprecedented low price per genome, the Illumina Hiseq X Ten has become one of the major sequencing systems. Our analysis filled the gap for a GIAB-sample-based sequencing performance comparison between DNBseq and Hiseq X Ten. The findings indicated a high concordance for point mutation detection with on average a 2.49-fold higher cost per SNP for Illumina Hiseq X Ten, suggesting that DNBseq is a more cost-effective option. A higher accuracy was found for InDels detected by DNBseq, which could be a direct reflection of the advantage of DNBseq due to the nature of its sequencing technology. With the limitations of the reference GIAB sample, a comparison would be more valid if the validation carries out using actual human DNA samples In terms of copy number variation (CNV) detection, a comparable performance was observed using a limited number of CNV datasets. A thorough validation is warranted utilizing more comprehensively validated GIAB CNVs, when they become publicly available by other technologies such as Affymetrics SNP 6.0 array- or qPCR-based datasets. Another limitation is the type of samples. It would be worth to test other human sample types for the FFPE sample. Such comparisons are valuable for a comprehensive evaluation of the DNBseq sequencing technology for the detection of a full spectrum of the genomic variants, especially, the performance on the same FFPE sample, Additionally, with an increasing need for the application of PCR-free WGS, performance comparisons of the sequencing platforms in detecting the challenging regions and reducing the PCR bias in GC-poor/rich regions by using PCR-free library could be worth pursuing in a future study.

4.2 Comparative analysis of sample extraction and library construction for shotgun metagenomics Our comparative study indicated that on average the OM protocol yielded a higher DNA amount and more diverse microbial communities compared with the QP protocol when using either fresh feces or commercial mock samples. This emphasizes that the bead- beating method is superior for processing gram-positive/gram-negative microorganisms compared to the non-bead-beating method. From the point of view of library preparation, 28

TP generated larger insert sizes of the library, lower duplication rate and a higher number of qualified reads compared with KH, but KH libraries performed better than TP in terms of gene detection number and Shannon index Also a principal component analysis (PCA) on microbial community abundance showed that the TP and the KH protocols did not cluster together, which implied that the long insert size (~350nt) and short insert size (~250nt) of libraries should be included. This library preparation design may enable researchers to avoid the potential bias due to insert size preference and capture efficiency of the library protocol when the corresponding gene catalog database is not yet available or fully established. Furthermore, our findings indicated that both TP and KH were highly consistent using the commercial mock sample, although optimization is needed after including microorganisms with GC-poor (<30%) or GC-rich (>60%) content. Finally, as one of the key factors for metagenomics study, the sample inputs with different content (50ng and 250ng) were tested, our results showed that low input (50ng) did not significantly impact on microbial community distribution compared with regular sample input (250ng). A lower sample input needs to be assessed for the samples with limited biomass, such as skin or swap.

4.3 Comparative analysis of library preparation methods for formalin-fixed paraffin-embedded RNAseq With the increasing application of gene expression profiling (GEP) in clinical studies, RNAseq, as a robust tool is needed to construct the benchmark for library preparation, in particular with FFPE specimens. Previously, rRNA-depletion methods such as DSN, Ribo-Zero, were reported as robust protocols for FFPE RNAseq (Zhao, He et al. 2014), however, our comparative results showed that the RNA Access method, an RNA Exome capture-based protocol, was more robust than DSN and Ribo-Zero using matched FFPE and FF samples through the measurement of different data metrics, including the raw to clean data efficiency, gene mapping, exon mapping, rRNA contamination ratio, and correlation rate between matched FFPE and FF samples. From the perspective of non-coding RNA profiling, a rRNA-depletion method is the only option for library preparation according to our comparative results, which was determined by the nature of protocol design. The poly-A enrichment method cannot capture FFPE sample-derived coding RNAs and partial of non-coding RNAs, which usually contain no

29 poly-A tail. For FFPE RNAseq, therefore, a selection of appropriate protocol is critical to meet different study requirements. Regardless of the library protocols applied in our comparative study, high reproducibility was observed for all tested protocols either on FFPE sample or on their matched FF sample, although the correlation among different library methods on the same sample is in general poor, implying that switching the protocol or using different methods for a single batch of FFPE samples will affect the robustness and consistency of the study. Additionally, for SNV calling, our study indicated that DSN generated more and reliable variants than the RNA exome capture and the Ribo-Zero protocol, through comparison of Ti/Tv ratio and the variants correlation between the matched FFPE and FF samples

30

5. CONCLUSION The WGS-based comparative analysis results showed that DNBseq is comparable to the Illumina platform on a broad spectrum of genomic variants. In addition, DNBseq is a more cost-effective platform for WGS compared to the Illumina platform. The metagenomics-based comparative analysis demonstrated that adopting robust protocols for performing shotgun metagenomics studies generate the most reliable results. Compared with other protocols for DNA extraction and library preparation, our results supported that the Omega Mag-bind extraction kit plus the KAPA library method with suitable DNA input will generate more valuable data either for fresh or for freeze- thaw human fecal samples. Our study results also provide clues for future studies in terms of the importance of evaluation of DNA isolation method, library preparation protocol and sample inputs for more comprehensive profiling of the human gut microbiome. The FFPE transcriptome comparative analysis indicated that the RNA Access method is suitable for clinical FFPE mRNA-based expression profiling. If non-coding RNA and mRNA profiling is the study target, DSN would be a better option for FFPE sample types compared to Ribo-Zero and RNA Access. Additionally, if SNV calling is required, our study revealed that DSN is more reliable than the other two protocols for FFPE RNAseq.

31

6. FUTURE PERSPECTIVES In our comparative study in PAPER, I firstly present the comparison between the DNBseq and the Hiseq X Ten system using the GIAB sample, which provides a reference for the laboratories currently equipped with the Illumina Hiseq X Ten system, but considering DNBseq as an alternative platform. Since this study utilized the standard reference genomic DNA samples rather than the diversity of sample types, such as FFPE or single- cell samples. Using different sample types for evaluation may shed new light on the platform selection. Furthermore, more attention needs to be paid on SV detection in the future, especially when more validated and robust SV detection database and reference samples are made available, In PAPER II, our results provide evidence for an optimized protocol in a combination of fecal sample-based DNA extraction method, library preparation protocol, sample input, and suitable sample preservation method to generate more reliable microbial community distribution results, which is the DNA-based approach. Given the fact that more and more researchers prefer conducting integrative analyses, a similar comparative analysis may need to be tested regarding metatranscriptomics in the future. It would be reasonable that it will be possible to establish a benchmark for a comprehensive microbiome analysis at the RNA level. Our comparative analysis results were limited by using few fecal samples and mock samples, a large-scale cohort study would be ideal, particularly for non-fecal sample types, such as oral swab or skin sample. According to our PAPER III, our comparative analysis results illustrated that the RNA Access protocol exhibit robust performance for detection of coding gene expression compared to two traditional RNAseq library protocols for FFPE samples, with a limitation of the RNA Access protocol regarding non-coding gene profiling as well as SNV calling which cannot be captured effectively. Other FFPE RNA-specific assays need to be considered in the future, including Nanostring, HTG EdgeSeq and RNA-ISH. In addition, FFPE sample storage time also affects the RNAseq data quality, hence, it is of importance to understand which library protocol can generate the most stable and consistent data for FFPE samples with different storage time for the future work, in particular in relation to long-term clinical studies, which collect FFPE samples ranging from few to ten years, or even longer. Although our study mainly focused on gene expression profiling, gene fusion and alternative splicing events are becoming popular research topics among oncology researchers. However, the classical technologies are not designed either for splicing or 32 gene fusion detection, such as qPCR. It is valuable if scientists explore both variants detection performance for FFPE RNAseq studies, in the hope to figure out the most robust protocol for FFPE RNAseq studies in a comprehensive way rather than expression-data- focused comparison.

33

7. REFERENCES Aagaard, K., Petrosino, J., Keitel, W., Watson, M., Katancik, J., Garcia, N., Patel, S., Cutting, M., Madden, T., Hamilton, H., Harris, E., Gevers, D., Simone, G., McInnes, P., Versalovic, J. (2012). The Human Microbiome Project strategy for comprehensive sampling of the human microbiome and why it matters. FASEB Journal 27(3), 1012-22. Abrahamson, M., Hooker, E., Ajami, N., Petrosino, J., Orwoll, E. (2017). Successful collection of stool samples for microbiome analyses from a large community-based population of elderly men. Contemporary Clinical Trials Communications 7, 158-162. Adiconis, X., Borges-Rivera, D., Satija, R., DeLuca, D., Busby, M., Berlin, A., Sivachenko, A., Thompson, D., Wysoker, A., Fennell, T., Gnirke, A., Pochet, N., Regev, A., Levin, J. (2013). Comparative analysis of RNA sequencing methods for degraded or low-input samples. Nature methods 10(7), 623-9. Aird, D., Ross, M., Chen, W., Danielsson, M., Fennell, T., Russ, C., Jaffe, D., Nusbaum, C., Gnirke, A. (2011). Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biology 12(2), R18. Andrews, S. (2016). Illumina 2 colour chemistry can overcall high confidence G bases. Available online: https://sequencing.qcfail.com/. (accessed on 2 Jan 2020) Boeckx, C., Wouters, A., Pauwels, B., Deschoolmeester, V., Specenier, P., Lukaszuk, K., Vermorken, J., Pauwels, P., Peeters, M., Lardon, F., Baay, M. (2011). Expression Analysis on Archival Material. Diagnostic Molecular Pathology 20(4), 203-211. Boer, R., Peters, R., Gierveld, S., Schuurman, T., Kooistra-Smid, M., Savelkoul, P. (2010). Improved detection of microbial DNA after bead-beating before DNA isolation. Journal of Microbiological Methods 80(2), 209-211. Chafee, M., Maignien, L., Simmons, S. (2015). The effects of variable sample biomass on comparative metagenomics. Environmental Microbiology 17(7), 2239-53. Complete Genomics. (2015). Revolocity™ Whole Genome Sequencing Technology Overview. Available online: http://www.completegenomics.com/documents/revolocity- tech-overview.pdf. (accessed on 5 Dec 2019). Drmanac, R., Sparks, A., Callow, M., Halpern, A., Burns, N., Kermani, B., Carnevali, P., Nazarenko, I., Nilsen, G., Yeung, G., Dahl, F., Fernandez, A., Staker, B., Pant, K., Baccash, J., Borcherding, A., Brownley, A., Cedeno, R., Chen, L., Chernikoff, D., Cheung, A., Chirita, R., Curson, B., Ebert, J., Hacker, C., Hartlage, R., Hauser, B., Huang, S., Jiang, Y., Karpinchyk, V., Koenig, M., Kong, C., Landers, T., Le, C., Liu, J., McBride, C., 34

Morenzoni, M., Morey, R., Mutch, K., Perazich, H., Perry, K., Peters, B., Peterson, J., Pethiyagoda, C., Pothuraju, K., Richter, C., Rosenbaum, A., Roy, S., Shafto, J., Sharanhovich, U., Shannon, K., Sheppy, C., Sun, M., Thakuria, J., Tran, A., Vu, D., Zaranek, A., Wu, X., Drmanac, S., Oliphant, A., Banyai, W., Martin, B., Ballinger, D., Church, G., Reid, C. (2009). Human Genome Sequencing Using Unchained Base Reads on Self-Assembling DNA Nanoarrays. Science 27(5961), 78-81. Duhaime, M., Deng, L., Poulos, B., Sullivan, M. (2012). Towards quantitative metagenomics of wild viruses and other ultra-low concentration DNA samples: a rigorous assessment and optimization of the linker amplification method. Environmental Microbiology 14(9), 2526-37. Gopalakrishnan, V., Spencer, C., Nezi, L., Reuben, A., Andrews, M., Karpinets, T., Prieto, P., Vicente, D., Hoffman, K., Wei, S., Cogdill, A., Zhao, L., Hudgens, C., Hutchinson, D., Manzo, T., Macedo, M., Cotechini, T., Kumar, T., Chen, W., Reddy, S., Sloane, R., Galloway-Pena, J., Jiang, H., Chen, P., Shpall, E., Rezvani, K., Alousi, A., Chemaly, R., Shelburne, S., Vence, L., Okhuysen, P., Jensen, V., Swennes, A., McAllister, F., Sanchez, E., Zhang, Y., Chatelier, E., Zitvogel, L., Pons, N., Austin-Breneman, J., Haydu, L., Burton, E., Gardner, J., Sirmans, E., Hu, J., Lazar, A., Tsujikawa, T., Diab, A., Tawbi, H., Glitza, I., Hwu, W., Patel, S., Woodman, S., Amaria, R., Davies, M., Gershenwald, J., Hwu, P., Lee, J., Zhang, J., Coussens, L., Cooper, Z., Futreal, P., Daniel, C., Ajami, N., Petrosino, J., Tetzlaff, M., Sharma, P., Allison, J., Jenq, R., Wargo, J. (2017). Gut microbiome modulates response to anti–PD-1 immunotherapy in melanoma patients. Science 359(6371), 97-103. Gonzalez, M., Schaffer, J., Orlow, S., Gao, Z., Li, H., Alekseyenko, A., Blaser, M. (2016). Cutaneous microbiome effects of fluticasone propionate cream and adjunctive bleach baths in childhood atopic dermatitis. Journal of the American Academy of Dermatology 75(3), 481-493. Graf, E. (2017). Simplified DV200 Evaluation with the Agilent 2100 Bioanalyzer System. Agilent company Technical Overview. Available online: https://www.agilent.com/cs/library/technicaloverviews/public/5991-8287EN.pdf. (accessed on 12 Dec 2019) Guo, Y., Wu, J., Zhao, S., Ye, F., Su, Y., Clark, T., Sheng, Q., Lehmann, B., Shu, X., Cai, Q. (2016). RNA Sequencing of Formalin-Fixed, Paraffin-Embedded Specimens for Gene

35

Expression Quantification and Data Mining International. Journal of Genomics 2016, 1- 10. Huang, J., Liang, X., Xuan, Y., Geng, C., Li, Y., Lu, H., Qu, S., Mei, X., Chen, H., Yu, T., Sun, N., Rao, J., Wang, J., Zhang, W., Chen, Y., Liao, S., Jiang, H., Liu, X., Yang, Z., Mu, F., Gao, S. (2017). A reference human genome dataset of the BGISEQ-500 sequencer. GigaScience 6(5), gix024. Huang, R., Jaritz, M., Guenzl, P., Vlatkovic, I., Sommer, A., Tamir, I., Marks, H., Klampfl, T., Kralovics, R., Stunnenberg, H., Barlow, D., Pauler, F. (2011). An RNA-Seq strategy to detect the complete coding and non-coding transcriptome including full-length imprinted macro ncRNAs. PloS One 6(11), e27288. Huang, W., Goldfischer, M., Babyeva, S., Mao, Y., Volyanskyy, K., Dimitrova, N., Fallon, J., Zhong, M. (2015). Identification of a novel PARP14-TFE3 gene fusion from 10-year- old FFPE tissue by RNA-seq. Genes, chromosomes & cancer 54(8), 500-505. Illumina. (2016). Evaluating RNA Quality from FFPE Samples. Illumina company Technical Note. Retrieved from https://www.illumina.com/content/dam/illumina- marketing/documents/products/technotes/evaluating-rna-quality-from-ffpe-samples- technical-note-470-2014-001.pdf. Kim, M., Yu, Z. (2014). Variations in 16S rRNA-based microbiome profiling between pyrosequencing runs and between pyrosequencing facilities. Journal of Microbiology 52(5), 355-65. Lazarevic, V., Whiteson, K., Huse, S., Hernandez, D., Farinelli, L., Osterås, M., Schrenzel, J., François, P. (2009). Metagenomic study of the oral microbiota by Illumina high-throughput sequencing. Journal of Microbiological Methods 79(3), 266-71. Li, H. (2017). On NovaSeq Base Quality. Available online: https://lh3.github.io/. (accessed on 5 Jan 2020). Manichanh, C., Rigottier-Gois, L., Bonnaud, E., Gloux, K., Pelletier, E., Frangeul, L., Nalin, R., Jarrin, C., Chardon, P., Marteau, P., Roca, J., Dore, J. (2006). Reduced diversity of fecal microbiota in Crohn's disease revealed by a metagenomic approach. Gut 55, 205–211. Mardis, E. (2008). The impact of next-generation sequencing technology on genetics. Trends in Genetics 24(3), 133-141. Margulies, M., Egholm, M., Altman, W., Attiya, S., Bader, J., Bemben, L., Berka, J., Braverman, M., Chen, Y., Chen, Z., Dewell, S., Du, L., Fierro, J., Gomes, X., Godwin, B., 36

He, W., Helgesen, S., Ho, C., Ho, C., Irzyk, G., Jando, S., Alenquer, M., Jarvie, T., Jirage, K., Kim, J., Knight, J., Lanza, J., Leamon, J., Lefkowitz, S., Lei, M., Li, J., Lohman, K., Lu, H., Makhijani, V., McDade, K., McKenna, M., Myers, E., Nickerson, E., Nobile, J., Plant, R., Puc, B., Ronan, M., Roth, G., Sarkis, G., Simons, J., Simpson, J., Srinivasan, M., Tartaro, K., Tomasz, A., Vogt, K., Volkmer, G., Wang, S., Wang, Y., Weiner, M., Yu, P., Begley, R., Rothberg, J. (2005). Genome sequencing in microfabricated high-density picolitre reactors. Nature 437(7057), 376-380. Maxam, A., Gilbert, W. (1977). A new method for sequencing DNA. Proceedings of the National Academy of Sciences 74(2), 560-564. Pease, J., Sooknanan, R. (2012). Rapid, directional RNA-seq library preparation kits for formalin-fixed paraffin-embedded RNA. Nature Methods 9(10), i-ii. Peters, B., Kermani, B., Sparks, A., Alferov, O., Hong, P., Alexeev, A., Jiang, Y., Dahl, F., Tang, Y., Haas, J., Robasky, K., Zaranek, A., Lee, J., Ball, M., Peterson, J., Perazich, H., Yeung, G., Liu, J., Chen, L., Kennemer, M., Pothuraju, K., Konvicka, K., Tsoupko- Sitnikov, M., Pant, K., Ebert, J., Nilsen, G., Baccash, J., Halpern, A., Church, G., Drmanac, R. (2012). Accurate whole-genome sequencing and haplotyping from 10 to 20 human cells. Nature 487(7406), 190-5. Qin, J., Li, R,, Raes, J., Arumugam, M., Burgdorf, K., Manichanh, C., Nielsen, T., Pons, N., Levenez, F., Yamada, T., Mende, D., Li, J., Xu, J., Li, S., Li, D., Cao, J., Wang, B., Liang, H., Zheng, H., Xie, Y., Tap, J., Lepage, P., Bertalan, M., Batto, J., Hansen, T., Le Paslier, D., Linneberg, A., Nielsen, H., Pelletier, E., Renault, P., Sicheritz-Ponten, T., Turner, K., Zhu, H., Yu, C., Li, S., Jian, M., Zhou, Y., Li, Y., Zhang, X., Li, S., Qin, N., Yang, H., Wang, J., Brunak, S., Doré, J., Guarner, F., Kristiansen, K., Pedersen, O., Parkhill, J., Weissenbach, J., MetaHIT Consortium, Bork, P., Ehrlich, S., Wang, J. (2010). A human gut microbial gene catalog established by metagenomic sequencing. Nature 464:59-70 Quince, C., Walker, A., Simpson, J., Loman, N., Segata, N. (2017). Shotgun metagenomics, from sampling to analysis. Nature Biotechnology 35(9), 833-844. Rainen, L., Oelmueller, U., Jurgensen, S., Wyrich, R., Ballas, C., Schram, J., Herdman, C., Bankaitis-Davis, D., Nicholls, N., Trollinger, D., Tryon, V. (2002). Stabilization of mRNA Expression in Whole Blood Samples. Clinical Chemistry 48(11), 1883-1890. Ren, S., Peng, Z., Mao, J., Yu, Y., Yin, C., Gao, X., Cui, Z., Zhang, J., Yi, K., Xu, W., Chen, C., Wang, F., Guo, X., Lu, J., Yang, J., Wei, M., Tian, Z., Guan, Y., Tang, L., Xu, 37

C., Wang, L., Gao, X., Tian, W., Wang, J., Yang, H., Wang, J., Sun, Y. (2012). RNA-seq analysis of prostate cancer in the Chinese population identifies recurrent gene fusions, cancer-associated long noncoding RNAs and aberrant alternative splicings. Cell Research 22(5), 806-21. Robasky, K., Lewis, N., Church, G. (2013). The role of replicates for error mitigation in next-generation sequencing. Nature Reviews Genetics 15(1), 56-62. Sanger, F., Nicklen, S., Coulson, A. (1977). DNA sequencing with chain-terminating inhibitors. Proceedings of the National Academy of Sciences 74(12), 5463-5467. Schuierer, S., Carbone, W., Knehr, J., Petitjean, V., Fernandez, A., Sultan, M., Roma, G. (2017). A comprehensive assessment of RNA-seq protocols for degraded and low- quantity samples. BMC Genomics 18, 442. Singh, D., Chan, J., Zoppoli, P., Niola, F., Sullivan, R., Castano, A., Liu, E., Reichel, J., Porrati, P., Pellegatta, S., Qiu, K., Gao, Z., Ceccarelli, M., Riccardi, R., Brat, D., Guha, A., Aldape, K., Golfinos, J., Zagzag, D., Mikkelsen, T., Finocchiaro, G., Lasorella, A., Rabadan, R., Iavarone, A. (2012). Transforming fusions of FGFR and TACC genes in human glioblastoma. Science 337(6099), 1231-5. Song, S., Amir, A., Metcalf, J., Amato, K., Xu, Z., Humphrey, G., Knight, R. (2016). Preservation Methods Differ in Fecal Microbiome Stability, Affecting Suitability for Field Studies. mSystems 1(3), e00021-16. Sooknanan, R., Pease, J., Doyle, K. (2010). Novel methods for rRNA removal and directional, ligation-free RNA-seq library preparation. Nature Methods 7(10), i-ii. Vandeputte, D., Tito, R., Vanleeuwen, R., Falony, G., Raes, J. (2017). Practical considerations for large-scale gut microbiome studies. FEMS Microbiology Reviews 41, S154-S167. Voelkerding, K., Dames S., Durtschi, J (2009). Next-generation sequencing: from basic research to diagnostics. Clin Chem 55(4):641-58. Walther, C., Hofvander, J., Nilsson, J., Magnusson, L., Domanski, H., Gisselsson, D., Tayebwa, J., Doyle, L., Fletcher, C., Mertens, F. (2015). Gene fusion detection in formalin-fixed paraffin-embedded benign fibrous histiocytomas using fluorescence in situ hybridization and RNA sequencing. Laboratory Investigation 95(9), 1071-1076. Wang, Y., Hoenig, J., Malin, K., Qamar, S., Petrof, E., Sun, J., Antonopoulos, D., Chang, E., Claud, E. (2009). 16S rRNA gene-based analysis of fecal microbiota from preterm infants with and without necrotizing enterocolitis. The ISME Journal 3(8), 944-54. 38

Wang, Z., Gerstein, M., Snyder, M. (2009). RNA-Seq: a revolutionary tool for transcriptomics. Nature reviews. Genetics 10(1), 57-63. Weisburg, W., Barns, S., Pelletier, D., Lane, D. (1991). 16S ribosomal DNA amplification for phylogenetic study. Journal of Bacteriology 173(2), 697-703. Yi, H., Cho, Y., Won, S., Lee, J., Yu, H., Kim, S., Schroth, G., Luo, S., Chun, J. (2011). Duplex-specific nuclease efficiently removes rRNA for prokaryotic RNA-seq. Nucleic Acids Research 39(20), e140. Zhao, W., He, X., Hoadley, K., Parker, J., Hayes, D., Perou, C. (2014). Comparison of RNA-Seq by poly (A) capture, ribosomal RNA depletion, and DNA microarray for expression profiling. BMC genomics 15(1), 419. Zhulidov, P., Bogdanova, E., Shcheglov, A., Vagner, L., Khaspekov, G., Kozhemyako, V., Matz, M., Meleshkevitch, E., Moroz, L., Lukyanov, S., Shagin, D. (2004). Simple cDNA normalization using kamchatka crab duplex-specific nuclease. Nucleic Acids Research 32(3), 37e.

39

8. APPENDIX PAPER I: Zonghui Peng, Jintu Wang, Honglan Gou, Yonggang Zhao, Meifang Tang, Fei Teng, Karsten Kristiansen, Zhijiao Wang. Comparative analysis of the DNBseq and the Illumina Sequencing platforms for human whole-genome sequencing. Manuscript (First author)

PAPER II: Zonghui Peng, Xiaolong Zhu, Zhijiao Wang, Xianting Yan, Guangbiao Wang, Meifang Tang, Awei Jiang, Karsten Kristiansen. Comparative analysis of sample extraction and library construction for shotgun metagenomics. Bioinformatics and Biology Insights. 2020;14:1-13.

PAPER III: Zonghui Peng, Qiwei Sun, Zhijiao Wang, Fei Teng, Liang Zong, Yipting Kwong, Bimeng Tu, Karsten Kristiansen. Comparative analysis of library preparation methods for formalin-fixed paraffin-embedded RNAseq. Submitted to BMC Genomics, 2020.

40

Paper I

Comparative analysis of the DNBseq and the Illumina Sequencing platforms for human whole-genome sequencing

Zonghui Peng1,2 *, Jintu Wang3, Honglan Gou3, Yonggang Zhao3, Meifang Tang3, Fei Teng3, Karsten

Kristiansen2, 4, 5, Zhijiao Wang3

1BGI Americas, Cambridge, Massachusetts, USA.

2Laboratory of Genomics and Molecular Biomedicine, Department of Biology, University of Copenhagen,

Copenhagen, Denmark.

3BGI Genomics, Shenzhen, China.

4BGI-Shenzhen, Shenzhen, China.

5China National GeneBank, BGI-Shenzhen, Shenzhen, China

Corresponding author:

Zonghui Peng, BGI Americas Corporation, 1 Broadway, 14th Fl, Cambridge, MA 02142. Telephone: +1 617-500-2754. Email: [email protected]

41

Abstract Background To introduce the most cost-effective whole genome sequencing technologies with high quality data to the biologists is very important.

Objective To compare two most commonly used “benchtop” sequencing platforms, Illumina and DNBseq, for human whole genome sequencing to better understand performance of DNBseq using the same reference sample.

Methods We used the publicly available genome of a Genome-In-A-Bottle (GIAB) sample (NA12878) to compare the performances using the DNBseq sequencing platform at ~30x coverage with downloaded sequencing data generated by the Illumina platform.The performances of both platforms in terms of single- nucleotide polymorphism (SNP) calling, insertion & deletion (InDel) detection, and copy number variation

(CNV) screening were evaluated .

Results The results indicated that 94.06% and 86.76% of the unique SNPs and InDels were concordant between the two platforms when applying high-confidence variants calls. Our findings indicated acomparable SNP calling accuracy for DNBseq data (sensitivity 96.21% and precision 99.94%) compared to Illumina data (sensitivity 96.34% and precision 99.89%). However, for InDel, we found lower accuracy for Illumina (sensitivity 90.32% and precision 96.55%) than for DNBseq data (sensitivity 93.23% and precision 97.62%). In addition, our study also showed that both platforms have highly comparable performance on CNV detection in comparison with the microarray-based validated CNV datasets.

Conclusions The comparative analysis results on different type of variant detection showed that DNBseq can be a more cost-effective platform for WGS compared to the Illumina platform.

Keywords

Next generation sequencing (NGS) · DNBseq platform· Illumina platform · Whole genome sequencing ·

NA12878

Introduction

With the cost of sequencing decreasing significantly in recent years, whole genome sequencing (WGS) has been increasingly applied in research on human oncology, and complex and monogenic diseases

(Cristescu, et al. 2015; Scarpa, et al. 2017; Cai, et al. 2015; Yuen, et al.2017; Kim, et al. 2017), with clinical whole genome sequencing being developed for diagnostics use (Clark, et al. 2019; Gross, et al. 2019;

Miller, et al. 2017). Many researchers have also switched from whole exome sequencing (WES) to WGS,

42 because WGS would be more powerful than WES for exome variants detection, genome structure mutation, and copy number variant detection (Belkadi, et al. 2015; Liang, et al. 2017). Furthermore, as a powerful tool, WGS has been progressively applied in genome-wide haplotype phasing detection (Zheng, et al. 2016;

Greer, et al. 2017). Illumina and BGI, the two major sequencing instrument providers, have contributed to improve the sequencing technology. Illumina launched the HiSeq X Ten System for human whole genome sequencing in 2014. Based on the Complete GenomicsTM DNA NanoBalls (DNBs), combinatorial probe anchor synthesis (cPAS), rolling-circle replication and patterned nano array technologies (Drmanac, et al.

2010), BGI launched the DNBseq™ technology with the BGISEQ-500 instrument in 2015. Each system has their own sequencing chemistry and technology, Huang et al (2017) reported a preliminary comparison between BGISEQ-500 and Hiseq2500, but considering the cost effectiveness, most whole genome sequencing cohort studies have been performed with the Hiseq X Ten system rather than Hiseq2500

(Agrawal, et al. 2014; Nik-Zainal, et al. 2016): Hence it is quite necessary to perform a comprehensive comparison between the DNBseq (BGISEQ-500) and Illumina X Ten system, which enable researchers to understand the accuracy and robustness for WGS variants calling by each technology to select the most cost-effective platform.

Methods and Materials

Materials

The Genome-In-A-Bottle (GIAB) sample, a B lymphoblastoid cell line, (NA12878) is used globally as reference materials for evaluating clinical assays in development and is well characterized by many clinical laboratories. Hence it was selected for this head-to-head comparison study. The NA12878 cell line (Coriell

Cat# GM12878, RRID: CVCL_7526) genomic DNA was purchased from the Coriell Institute.

Illumina X Ten whole-genome sequencing datasets

Raw FASTQ data of 150 bp paired-end reads from whole genome sequencing of human NA12878 sample

(NA12878_L1) prepared by the TruSeq Nano workflow on the Illumina HiSeq X Ten system were downloaded from the Illumina BaseSpace (https://basespace.illumina.com).

High-confidence variant calls for NA12878

The GIAB consortium generated high-confidence SNP, InDel, and reference genotyping calls for NA12878

(Zook, et al. 2014), and these benchmark calls have been widely applied for optimization and analytical validation of clinical genome sequencing (Lincoln, et al. 2015; Patwardhan, et al. 2015). Hence, we used these high-confidence variant calls for system/platform performance comparison analysis ussing the high- 43 confidence variants calls, including SNP, small InDel and homozygous reference calls for NA12878, which were downloaded from NCBI database (https://ftp- trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/NISTv3.3.1/GRCh37/HG001_GRCh37_GIAB_hi ghconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.1_highconf_phased.vcf.gz).

Illumina Infinium Omni2.5 array datasets

To compare the low-coverage WGS data with array-based genotyping data, the reference genotyping datasets for NA12878 were downloaded from the 1000 Genome database

(http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/hd_genotype_chip/ALL.chip.omni

_broad_sanger_combined.20140818.snps.genotypes.vcf.gz).

Agilent SurePrint G3 Human CGH Microarray data

Considering the high cost of whole genome sequencing when using the Illumina-based platform, the array- based platform is still the standard assay for CNV detection (Haraksingh, et al. 2017), especially in clinical cytogenetics (Uddin, et al. 2015; Stobbe, et al. 2014; D'Arrigo, et al 2016). To compare the CNV calling output from both sequencing technologies, the aCHG array data sets for NA12878 were downloaded from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE96909, then using Agilent Cytogenomics 3.0 to call CNV with default CGH QC metric set.

Library preparation and sequencing using DNBseq

The qualified genomic DNA of sample NA12878 (1ug) was used for library preparation, including fragmentation, size selection, end repair, dTTP tailed adaptor ligation,PCR amplification and splint circulation, followed by DNA Nanoballs (DNBs) preparation, DNB loading and image calling via the DNBseq platform as described previously (Huang, et al. 2017).

DNBseq whole-genome sequencing

As previously reported by Huang et al (2017), the sequencing-derived raw image files were processed by the DNBseq base-calling software with default parameters and the sequence data of NA12878 is generated as paired-end reads, which is defined as "raw data" and stored in FASTQ format. Hence, 100 bp paired- end reads were generated using DNBseq platform (BGISEQ-500 system).

Bioinformatics analysis

Data filtering

Raw sequence reads of DNBSeq and Illumina were quality controlled using SOAPnuke (version 2.0) with multiple filtering steps as follows: i. removing reads with adapters; ii. removing reads in which unknown 44 bases make up more than 10%; and iii. removing reads which contain more than 50% of bases with sequencing quality less than 10. After filtering, the remaining reads were considered as “clean reads” and used for downstream bioinformatics analysis.

Mapping and variant calling

All clean reads of each sample were mapped to the human reference genome (GRCh37/Hg19) using the mem algorithm in Burrows-Wheeler Aligner (version 0.7.12-r1039) (Li, et al. 2009; Li, et al. 2010).

To ensure accurate variant calling, we followed the best practice recommendations formulated by the GATK development team (https://www.broadinstitute.org/gatk/guide/best-practices). Local realignment around

InDels and base quality score recalibration (BQSR) were performed using GATK (version 3.3.0) (McKenna, et al. 2010), and the Picard (v1.118) (http://broadinstitute.github.io/picard/) was used for duplicated reads filtration. The sequencing depth and coverage for each sample were calculated based on the alignments.

The HaplotypeCaller of GATK was used to call both raw SNPs and InDels simultaneously via local de novo assembly of haplotypes in a region showing signs of variation. After that, to get high quality of call sets for downstream data analysis, we used the GATK Variant Quality Score Recalibration (VQSR) that used a machine learning algorithm to filter the raw variant call sets. The GATK VQSR used high-quality known variant sets as training and truth resources and built a predictive model to filter spurious variants. The SNPs and InDels marked PASS in the output VCF file were high-confident variation set.

The Copy Number Variants (CNVs) were called using the CNVnator (version 0.2.7) read-depth algorithm, this algorithm divides the genome into non-overlapping bins of equal size and uses the count of mapped reads in each bin as the Read-Depth signal. We used the standard settings and a bin size of 100bp, and the workflow is show in Fig. 1.

Comparison of variant calls using different platforms

To compare the variants calling by the DNBseq and Illumina platforms and according to the validated data sets of NA12878, we investigated the true positive calls (TP) which reflects the number of SNPs/InDels that are found in high-confidence reference dataset, false positive calls (FP) which reflects the number of

SNPs/InDels that are not found in reference dataset, and false negative calls (FN) which reflects the number of SNPs/InDels that are found in the high-confidence reference dataset, but are not called by each sequencing platform. The sensitivity and precision were calculated by TP/(TP + FN) and TP/(TP + FP), respectively. In this calculation, only high confidence of SNPs and InDels that are covered by GIAB High-

45 confidence variant calls data sets and Illumina Infinium Omni2.5 array data sets, respectively, were included.

Comparison on whole genome sequencing with low coverage

The high-density array was considered the first option for performing large-scale GWAS (genome wide association study) studies (Pistis, et al. 2015; Sung, et al. 2018). Since the cost of sequencing has dropped precipitously, low-coverage WGS are now frequently used for detection of common, low frequency or rare genetic variants (Cai, et al. 2015; Cai, et al. 2017; Gizer, et al 2018) and in relation to large-scale population genomics characterization (Rustagi, et al. 2017). Hence, we also investigated the performance comparison on DNBseq and Illumina platforms with low WGS sequencing depth at 4x, 6x, 8x and 10x accordingly.

Cost comparison analysis (CCA)

Taking SNP callings located at high confidence region as an example, the CCA was calculated using the following formula:

Ct/(Cv+Kv) where

Ct = Cost of sequencing each platform at 30x depth

Cv= Total concordant SNPs called by both platforms

Kv = Validated platform-specific SNPs

Results

Sequence data summary

After sequencing using DNBseq platform, we obtained 1,002,366,106 sequence reads (~100.24 Gb) raw bases for sample NA12878. To compare, we selected a similar amount of data (778,240,910 sequence reads, ~117.00 Gb) from the public Illumina dataset of this cell line generated from GIAB

(https://basespace.illumina.com). After removing low-quality reads, we obtained 100.16 Gb and 110.08 Gb high-quality datasets for DNBseq and Illumina respectively. The clean reads output obtained using the

DNBseq and Illumina platform had comparable high Q20 and Q30 score, the percent of bases with quality score (-10×lg(error rate)) higher than 20 and 30, which showed high sequencing quality (Fig. 2). The GC content was similar for both platforms, being 41%. Further information on cumulative sequencing depth for both platforms can be found as Fig. 2-

3.C:\var\folders\3f\_dc3772s6wvg5g_k25zht2xh0000gn\T\huang\Desktop\wangjintu\BGIseq500\report\rep ort\src\page\Results.html - figure2. The data mapping metrics results are summarized in table 1. The 46 average sequencing coverage depth for the DNBseq and the Illumina datasets were 33.02 X and 31.57 X, respectively (Table 1). We obtained a unique mapping rate of 94.33% for the DNBseq dataset, which is higher than that obtained using the Illumina platform (85.14% of unique mapping rate), and in addition, the duplicate reads rate of the DNBseq platform (1.77%) was significantly lower than that of the Illumina platform (11.76%), both findings differed from the results reported by Huang et al (2017) which indicated that the DNBseq system is highly comparable with Illumina (Hiseq2500 system) in terms of genome mapping and duplicate rate possibly reflecting thatthe sequencing chemistry of Illumina Hiseq X Ten is different from Hiseq 2500 (Reuter, et al. 2015). Because of the low duplicate rate, it enables DNBseq deliver a deeper sequencing depth with lower data size compared to the Illumina platform. Of note, similar sequencing coverage at different depths was observed for both the DNBseq and the Illumina platforms

(Table 1). To examine the completeness of sequencing, the depth and breadth of genomic coverage were analysed for each platform. Both platforms covered most of the genome, and >97% of the genome was covered by 10 or more reads (Table 1). Finally, for the Illumina platform coverage dropped to zero at a slightly lower read depth than observed for the DNBseq platform, because of the substantially fewer reads in the Illumina data set.

SNP calling and differences in two platform

In total, we identified 3,482,838 and 3,499,428 SNPs using the DNBseq and Illumina platforms, respectively

(Table 2). The SNPs detected using the DNBseq dataset were comparable to those identified in the Illumina dataset based on different features including the fraction of SNPs in the dbSNP and the 1000 genome database, proportion of SNPs in different regions related to genes, and transition/transversion (Ti/Tv) ratio, which indirectly reflected SNP accuracy (Table 2).

To compare the sensitivity and precision of each platform for SNP calling, we selected the SNPs that are included in the GIAB high-confidence variant reference data set for further assessment. In total, the data from both platforms exhibit similar precision and sensitivity (99.94% vs. 99.89% and 96.21% vs. 96.34%) although the DNBseq data has slightly lower false positive calling and higher false negative callings comparing with the Illumina data (Table 3). Furthermore, 94.06% (2,946,356 out of 3,132,297) of the unique

SNPs were concordant that is, either a homozygous or heterozygous SNP was detected at the same locus by the two platforms (Fig. 4). We identified 185,941 SNPs by only one platform or the other but not both, of which 90,172 were specific to DNBseq (2.88% of the DNBseq combined SNPs) and 95,769 were Illumina- specific (3.06% of the Illumina combined SNPs). 47

By inspecting the fraction of novel platform-specific SNPs, we found that 2,945,177 (99.96%) of the concordant SNPs were present in the dbSNP138 database (Sherry, et al. 2001). Similarly, 88,892 (98.58%) of the SNP in the DNBseq-specific set, and 93,250 (97.37%) of the SNPs in the Illumina-specific set were present in dbSNP138. Thus, the platform-specific calls were rarely enriched for novel SNPs (1.42% and

2.63% of SNPs are novel variants for the DNBseq and the Illumina platforms, respectively), which suggest that they likely contain small number of errors for both platforms.

One important criteria for assessing quality of variant calling is Ti/Tv (Durbin, et al. 2010). In our study the

Ti/Tv of SNPs concordance between the two platforms was 2.15. For all SNPs detected by the DNBseq platform, Ti/Tv was 2.15, but for the DNBseq-specific SNPs, it was 1.97. Similarly, for SNPs detected by

Illumina, Ti/Tv was 2.15, but for SNPs specific to the Illumina platform, it was 1.98 (Fig. 4). Hence, the Ti/Tv of concordant SNPs was very close to the expected value, whereas the platform-specific Ti/Tv was slightly lower, but still provides clue that the accuracy of platform-specific calls was at a relatively lower level.

However, it also indicated that both platforms are highly similar regarding performance in terms of platform specific Ti/Tv value.

We preliminarily investigated the coverage of tandem repeat regions for both platforms, and according to the mapping rate and the fraction of targets covered by different depth, both the DNBseq and Illumina platforms are highly comperable (Table 4), but as expected the DNBseq data set has a higher mapping rate and lower duplicate rate compared to the Illumina data set.

Low-coverage comparison analysis

To further assess the accuracy of the DNBseq and the Illumina platforms, we compared the performance of the platforms when using low-coverage WGS dataset. We selected 4 types of sequencing depth to cover the typical low-coverage settings: ~4x, ~6x, ~8x and ~10x for each sequencing platform. After alignment with the genome reference was conducted, we found that DNBseq still had lower duplicate rate and higher unique genome mapping rate no matter how low the sequencing depth was. Both platforms were highly similar in terms of other mapping metrics, including total genome mapping rate and coverage at different depth (Table 5). To check precision and sensitivity of SNP calling, we compared with high-confidence variant calls of NA12878 to estimate true positive calls, false positive calls, and false negative calls using the DNBseq and the Illumina platforms (Table 6). Hence, from 4x to 10x coverage, the DNBseq platform performed slightly better than the Illumina platform from a perspective of precision and sensitivity (Fig. 5),

48 suggesting that the DNBseq platform generated higher quality of genotyping data with different low pass of

WGS datasets.

InDel calling using the two platforms

The InDels callings from the Illumina and the DNBseq platform were also examined. We identified 823,627

InDels from the DNBseq data, compared to 656,186 InDels detected using the Illumina platform. In terms of novel InDel callings, the DNBseq platform enabled identification of more InDels than the the Illumina platform, regardless whether the variant type was homozygous or heterozygous (Table 7). Furthermore, because Durbin et al (2010) reported that InDels of 1-3 bp could contribute more than 70% of all unique

InDels in coding regions of the human genome. When we examined the small InDels ranging in the size of

1-21 bp in coding sequences (CDS), we found that both platforms exhibited highly similar distribution results

(Fig. 6) and indicated that the predominant lengths of InDels in coding region are less than 10bp.

By calculating InDels located in high confidence region from the GIAB sequence, we found that 313,082

(86.76%) InDels were detected by both the DNBseq and the Illumina platform. In addition, we observed that 27,346 (7.58%) and 20,425 (5.66%) were DNBseq- and Illumina-specific InDels, respectively. In addition, the DNBseq data exhibited higher sensitivity (93.23%) than the Illumina data (90.32%), with similar precision (97.62%) compared to Illumina data (96.55%). Detection precision was assessed for concordant and platform specific InDels at high confidence region by comparing them to validated InDels that were characterized by GIAB consortium. For the concordant InDels, we found the precision rate reached 98.4%, and in terms of precision of specific InDels detected by each platform only, DNBseq (80.73%) was significantly higher than that obtained using the Illumina platform (62.73%), indicating that the InDel calling quality of DNBseq was higher than that of the Illumina platform. Since InDels are more complex compared to SNPs, this suggests that DNBseq platform may be advantageous for detecting complex variants (Fig.

7).

Differences in CNV callings

We studied the CNV callings by both platforms. Totally, we detected 4,049 and 3,678 CNVs with 75,058,300 bp and 75,746,100bp deletion length for the DNBseq and the Illumina platforms, respectively (Table 8).

The CNVs identified using the DNBseq dataset were comparable to those identified from the Illumina dataset in different gene element regions, which indirectly reflected the diversity of the CNVs (Table 8).

Based on the CNVs distribution across all chromosomes, we found that the number copy number gain variants detected by DNBseq was slightly higher than the number detected by the Illumina platform, 49 whereas the number of copy number loss variants, the DNBseq platform yielded results similar to those of the Illumina platform (Fig. 8).

To further assess the accuracy of CNV callings, we sought to compare the validated CNV data sets established by standard CNV array, using the Agilent comparative genomic hybridization (aCGH) array

(SurePrint G3 Human CGH Microarray 4×180K, SurePrint G3 Human CGH Microarray 2×400K and

SurePrint G3 Human CGH Microarray 1×1M) results for NA12878 as validation data sets (Table 9).

Using Agilent SurePrint G3 Human CGH Microarray data as the reference data set, we observed that both the DNBseq- and the Illumina-based data sets exhibited a less than 100% overlap with the reference CNV data set, and false negative calling was high. Thus, using these limited CNV validation datasets, our results suggested that the CNV calling quality by both platforms were less robust, but owing to the complexity of

CNVs compared to SNPs or InDels, our findings also indicate that more validated dataset are needed to compare the performance of both platforms in relation to CNV calling. Still, overall our results showed that that the DNBseq and the Illumina platform performed highly comparable in terms of CNV calling sensitivity on whole genome wide and autosomal regions (Table 10-12).

Cost comparison analysis (CCA)

Illumina-specific SNPs (Kv-Illumina=93,250) and DNBseq-specific SNPs (Kv-DNBseq =88,892) are the true positive SNPs according to the GIAB validated datasets. Total concordant SNPs called by both platforms were 2,946,356 (Cv). The cost of sequencing was estimated at $1,500 and $600 for Illumina (Hiseq X Ten) and DNBseq (BGISEQ-500), respectively. The CCA for Illumina was 4.93 x 10-4 and for DNBseq it was

1.98 x 10-4, and fold change of CCA-Illumina/CCA-DNBseq is thus around 2.49, which indicates that the

DNBseq platform constitutes a more cost-effective platform for WGS applications than the Illumina platform.

Discussion

Huang et al. published the first comparison of DNBseq and Illumina Hiseq 2500 datasets (Huang, et al.

2017). Our comparison complements their findings by a comparison of the performance of the DNBseq sequencing technology with the Illumina Hiseq X Ten system. We concluded that both sequencing platforms generally are capable of detecting most SNPs. Based on validated reference sample genotyping data, our results indicated that identified DNBseq-specific SNPs were more robustly validated than

Illumina-specific SNPs. However, the illumine platform might be more sensitive, since more SNPs were detected using the Illumina platform. In part this may reflect the longer reads of the Illumina platform andthat difficult regions are more easily covered by longer reads. 50

We found that InDel detection is subject to a relative high bias for both sequencing platforms, with each platform detecting a certain number of short InDels missed by the other sequencing platform. However, when we were examining the precision performance for platform specific InDels, our findings indicated that the precision of the DNBseq platform was higher than that of the Illumina platform, possibly reflecting the higher accuracy of the DNBseq technology (Drmanac, et al. 2010).

Regarding CNV detection, we did not observe any significant biases across the two sequencing platforms, but the quality of the reference CNV datasets we used may be a limitation factor for precise conclusion to be drawn in comparing CNV calling for both sequencing platforms. Hence, better and more comprehensive validated CNV datasets are needed for further comparative analyses. Furthermore, for the further work, investigation of the structure variations detection performance for these two sequencing platforms would be warranted.

Conclusion

Our findings show that 94.06% and 86.76% of unique SNPs and InDels were concordant between the two platforms when applying high-confidence variants calls. Furthermore, we observed comparable SNP calling accuracy for DNBseq data (sensitivity 96.21% and precision 99.94%) compared with Illumina data

(sensitivity 96.34% and precision 99.89%). Of note, we found lower accuracy for InDel detection using the

Illumina platform (sensitivity 90.32% and precision 96.55%) than that obtained using DNBseq data

(sensitivity 93.23% and precision 97.62%). In addition, our study also showed that both platforms exhibit highly comparable performance regarding CNV detection in comparison to the microarray-based validated

CNV datasets. Overall, according to the comparative analysis on different types of variant detection performance, we conclude that DNBseq provide a more cost-effective platform for WGS compared with the

Illumina platform.

Acknowledgements

We thank Grace Cai from BGI Americas for comments on the manuscript.

Funding

This work has been supported by BGI-Genomics, who covered the cost of library preparation and sequencing.

Availability of data and materials

51

The DNBseq WGS sequence datasets are available on European Nucleotide Archive (ENA) database under the accession no. PRJEB19427 (https://www.ebi.ac.uk/ena/data/view/PRJEB19427).

Authors’ contributions

ZP, HG, YZ and KK designed the study. MT performed the experiment. JW and FT analyzed the data. ZP drafted the paper. JW prepared the figures and Tables. ZW, ZP and KK revised the paper. All the authors approved the paper.

Ethics approval and consent to participate

The sample involved in this study is established human cell lines only, ethics approval is not required.

Consent for publication

Not applicable

Competing interests

This study was funded by BGI-Genomics. All the authors are employees of BGI or its subsidiaries. In addition, BGI is the parent company of the manufacture of DNBseq.

References

Abyzov A, Urban AE, Snyder M, Gerstein M (2011) CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome Res

21(6): 974-984.

Agrawal N, Akbani R, Aksoy BA, et al (2014) Integrated genomic characterization of papillary thyroid carcinoma. Cell 159(3): 676-690.

Belkadi A, Bolze A, Itan Y, et al (2015) Whole-genome sequencing is more powerful than whole-exome sequencing for detecting exome variants. Proc. Natl. Acad. Sci 112(17): 5473-5478.

Cai N, Bigdeli TB, Kretzschmar W, et al (2015) Sparse whole-genome sequencing identifies two loci for major depressive disorder. Nature 523(7562): 588-591.

Cai N, Bigdeli TB, Kretzschmar W, et al (2017) 11,670 whole-genome sequences representative of the Han

Chinese population from the CONVERGE project. Sci Data 4:170011.

Cristescu R, Lee J, Nebozhyn M, et al (2015) Molecular analysis of gastric cancer identifies subtypes associated with distinct clinical outcomes. Nat Med 21(5): 449-456.

52

Clark MM, Hildreth A, Batalov S, et al (2019) Diagnosis of genetic diseases in seriously ill children by rapid whole-genome sequencing and automated phenotyping and interpretation. Science Translational Medicine

11:489.

D'Arrigo S, Gavazzi F, Alfei E, et al (2016) The Diagnostic Yield of Array Comparative Genomic

Hybridization Is High Regardless of Severity of Intellectual. Disability/Developmental Delay in Children. J

Child Neurol 31(6): 691-699.

Drmanac R, Sparks AB, Callow MJ, et al. Human genome sequencing using unchained base reads on self- assembling DNA nanoarrays. Science 327(5961): 78-81.

Durbin RM, Abecasis GR, Altshuler DL, et al (2010) A map of human genome variation from population- scale sequencing. Nature 467: 1061–1073.

Scarpa A, Chang DK, Nones K, et al (2017) Whole-genome landscape of pancreatic neuroendocrine tumours. Nature 543(7643): 65-71

Gizer IR, Bizon C, Gilder DA, et al (2018) Whole genome sequence study of cannabis dependence in two independent cohorts. Addict Biol 23(1): 461-473.

Greer SU, Nadauld LD, Lau BT, et al (2017) Linked read sequencing resolves complex genomic rearrangements in gastric cancer metastases. Genom Med 9(1):57.

Gross AM, Ajay SS, Rajan V, et al (2019) Copy-number variants in clinical genome sequencing: deployment and interpretation for rare and undiagnosed disease. Genetics in Medicine 21: 1121-1130.

Haraksingh RR, Abyzov A, Urban AE (2017) Comprehensive performance comparison of high-resolution array platforms for genome-wide Copy Number Variation (CNV) analysis in humans. BMC Genomics 18(1):

321.

Huang J, Liang XM, Xuan YK, et al (2017) A reference human genome dataset of the BGISEQ-500 sequencer. Gigascience 6(5): 1–9.

Kim J, Shimizu C, Kingsmore SF, et al (2017) Whole genome sequencing of an African American family highlights toll like receptor 6 variants in Kawasaki disease susceptibility. PLoS One 12(2): e0170977.

Li H and Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler transform.

Bioinformatics 25: 1754-1760.

Li H and Durbin R (2010) Fast and accurate long-read alignment with burrows-wheeler transform.

Bioinformatics 26: 589-595.

53

Liang D, Wang Y, Ji X, et al (2017) Clinical application of whole-genome low-coverage next-generation sequencing to detect and characterize balanced chromosomal translocations. Clin Genet 91(4): 605-610.

Lincoln SE, Kobayashi Y, Anderson MJ, et al (2015) A Systematic Comparison of Traditional and Multigene

Panel Testing for Hereditary Breast and Ovarian Cancer Genes in More Than 1000 Patients. J Mol Diagn

17(5): 533-544.

Miller KA, Twigg SR, McGowan SJ, et al (2017) Diagnostic value of exome and whole genome sequencing in craniosynostosis. J Med Genet 54(4): 260-268.

McKenna A, Hanna M, Banks E, et al (2010). The Genome Analysis Toolkit: a MapReduce framework for analyzing next generation DNA sequencing data. Genome Res 20(9): 1297–1303.

Pistis G, Porcu E, Vrieze SI, Sidore C, et al (2015) Rare variant genotype imputation with thousands of study-specific whole-genome sequences: implications for cost-effective study designs. Eur J Hum Genet

23(7): 975-983.

Rustagi N, Zhou A, Watkins WS, et al (2017) Extremely low-coverage whole genome sequencing in South

Asians captures population genomics information. BMC Genomics 18(1): 396.

Reuter JA, Spacek DV, Snyder MP (2015) High-throughput sequencing technologies. Mol Cell 58(4): 586-

597.

Sherry ST, Ward MH, Kholodov M, Baker J, et al (2001) dbSNP: the NCBI database of genetic variation.

Nucleic Acids Res 29(1): 308-311.

Stobbe G, Liu Y, Wu R, et al (2014) Diagnostic yield of array comparative genomic hybridization in adults with autism spectrum disorders. Genet Med 16(1): 70-77.

Sung YJ, Winkler TW, de Las Fuentes L, et al. A Large-Scale Multi-ancestry Genome-wide Study

Accounting for Smoking Behavior Identifies Multiple Significant Loci for Blood Pressure. Am J Hum Genet

102(3): 375-400.

Uddin M, Thiruvahindrapuram B, Walker S, et al (2015) A high-resolution copy-number variation resource for clinical and population genetics. Genet Med 17(9): 747-752.

Yuen RKC, Merico D, Bookman M, et al (2017) Whole genome sequencing resource identifies 18 new candidate genes for autism spectrum disorder. Nat Neurosci 20(4): 602–611.

Zheng GXY, Lau BT, Schnall-Levin M, et al (2016) Haplotyping germline and cancer genomes with high- throughput linked-read sequencing. Nat Biotechnol 34(3): 303–311.

54

Nik-Zainal S, Davies H, Staaf J, et al (2016) Landscape of somatic mutations in 560 breast cancer whole- genome sequences. Nature 534(7605): 47-54.

Zook JM, Chapman B, Wang J et al (2014) Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat Biotechnol 32(3):246-251.

Patwardhan A, Harris J, Leng N, et al (2015) Achieving high-sensitivity for clinical applications using augmented exome sequencing. Genome Med 7: 71.

55

Table 1 Statistics of sequencing data Sample DNBseq Illumina Raw reads 1,002,366,106 778,240,910 Raw bases (Mb) 100,236.61 116,995.58 Clean reads 1,001,630,550 732,165,210 Clean bases (Mb) 100,163.05 110,083.73 Clean data rate (%) 99.93 94.09 Clean reads Q20 (%) 95.10 95.91 Clean reads Q30 (%) 84.9 90.47 GC content (%) 41.71 40.94 Mapping rate (%) 99.47 96.52 Unique rate (%) 94.33 85.14 Duplicate rate (%) 1.77 11.76 Mismatch rate (%) 0.53 0.56 Average sequencing depth (X) 33.02 31.57 Coverage (%) 99.1 98.95 Coverage at least 4X (%) 98.62 98.43 Coverage at least 10X (%) 97.68 97.24 Coverage at least 20X (%) 93.09 91.45

56

Table 2 SNP variation statistics of the dataset Sample DNBseq Illumina Total SNPs 3,482,838 3,499,428 Fraction of SNPs in dbSNP (%) 99.29 99.09 Fraction of SNPs in 1000genomes (%) 97.13 96.6 Novel 17,097 23,450 Homozygous 1,409,979 1,428,407 Heterozygous 2,072,859 2,071,021 Intronic 1,369,699 1,368,632 5' UTRs 4,132 4,194 3' UTRs 22,120 22,100 Upstream 47,323 47,592 Downstream 46,340 46,373 Intergenic 1,963,944 1,980,758 Ti/Tv 2.06 2.07

57

Table 3 Precision and sensitivity of DNBseq- and Illumina-specific SNP calling at high confidence regions True positive False positive False negative Sample Precision Sensitivity calls calls calls NA12878- 3,034,804 1,778 119,509 99.94 96.21 DNBseq NA12878- 3,038,981 3,198 115,332 99.89 96.34 Illumina

58

Table 4 Comparison between DNBseq and Illumina on tandem repeats region Sample DNBseq Illumina Initial bases on target 116,431,574 116,431,574 Total effective reads 952,689,357 627,692,952 Total effective bases (Mb) 93,866.9 91,470.38 Effective sequences on target (Mb) 3,400.83 3,845.19 Mapping rate on target (%) 99.95 96.54 Duplicate rate on target (%) 2.92 11.69 Mismatch rate in target region (%) 3.51 5.48 Average sequencing depth on target 29.21 33.03 Fraction of target covered >= 1x (%) 59.3 59.4 Fraction of target covered >= 4x (%) 57.6 57.3 Fraction of target covered >= 10x (%) 53.9 53.1 Fraction of target covered >= 20x (%) 44.8 42.5

59

Table 5 Mapping metrics for DNBseq and Illumina with low-coverage WGS data

Illum DNBs DNBse ina- DNBs Illumi DNBs Illumi eq- Illumin q-4X 4X eq-6X na-6X eq-8X na-8X 10X a-10X

129,007 92,41 193,54 131,71 253,13 170,91 306,81 216,14 Clean reads ,242 3,756 8,638 9,628 0,008 4,636 4,610 7,580 13,95 Clean bases (Mb) 12,901 4 19,355 19,890 25,313 25,808 30,681 32,638

Mapping rate (%) 99.46 99.46 99.46 96.39 99.46 96.37 99.46 96.32

Unique rate (%) 94.72 90.16 94.56 89.66 94.42 89.21 94.31 87.76

Duplicate rate (%) 1.33 6.42 1.51 6.95 1.67 7.41 1.79 8.91

Mismatch rate (%) 0.59 0.68 0.57 0.69 0.55 0.69 0.53 0.68 Average sequencing depth (X) 4.32 4.26 6.47 6.03 8.44 7.77 10.21 9.66

Coverage (%) 96.32 95.89 98.16 97.75 98.57 98.30 98.72 98.53 Coverage at least 4X (%) 58.00 56.46 82.92 78.74 92.26 89.46 95.55 94.44 Coverage at least 10X (%) 1.99 1.94 12.71 9.63 32.25 24.99 51.88 45.70

60

Table 6 Comparison on SNP calling with low-coverage WGS

Sequencing DNBseq Illumina depth 4X 6X 8X 10X 4X 6X 8X 10X

True positive calls 1,861,173 2,275,139 2,564,299 2,731,124 1,657,113 2,066,135 2,394,756 2,638,124 False positive calls 464,671 270,346 253,307 248,537 461,501 283,055 264,026 264,963 False negative calls 1,981,994 1,568,031 1,278,870 1112042 2,186,061 1,777,039 1,448,414 1,205,043

61

Table 7 InDel variation statistics of the dataset Sample DNBseq Illumina Total InDels 823,627 656,186 Fraction of InDels in dbSNP (%) 77.66 82.31 Fraction of InDels in 1000genomes (%) 55.05 62.91 Novel 167,967 103,860 Homozygous 301,855 259,173 Heterozygous 521,772 397,013 Intronic 349,670 270,944 5' UTRs 707 667 3' UTRs 6,243 5,166 Upstream 13,551 10,620 Downstream 13,451 10,329 Intergenic 437,994 356,570

62

Table 8 Statistics of CNV detection Sample DNBseq Illumina Total CNVs 4,049 3,678 Exonic 446 431 Splicing 3 4 NcRNA 0 0 Intronic 1,182 966 5' UTRs 0 0 3' UTRs 2 1 Upstream 63 45 Downstream 68 56 Intergenic 2,285 2,175 Amplification Length (bp) 11,599,100 14,026,100 Deletion Length (bp) 75,058,300 75,746,100

63

Table 9 Known CNVs detected by the Agilent aCGH array Number of CNV aCGH array type Autosomal Total Agilent aCGH 2x400K 58 170 Agilent aCGH 4x180K 25 78 Agilent aCGH 1x1M 159 352

64

Table 10 Comparing DNBseq/Illumina with aCGH array in whole genome wide analyses DNBseq Illumina Overlapped Overlapped aCGH uniq Total aCGH uniq Total aCGH array CNV CNV Agilent aCGH 19 59 78 19 59 78 2x400K Agilent aCGH 34 136 170 35 135 170 4x180K Agilent 112 240 352 104 248 352 aCGH 1x1M Note: “Overlapped CNV” denotes the CNVs detected by aCGH array and DNBseq/Illumina platform; “aCGH uniq” denotes the unique CNV detected by aCGH array only.

65

Table 11 Comparing DNBseq/Illumina with aCGH array in autosomal regions DNBseq Illumina Overlapped Overlapped aCGH uniq Total aCGH uniq Total aCGH array CNV CNV Agilent aCGH 15 10 25 15 10 25 2x400K Agilent aCGH 31 27 58 32 26 58 4x180K Agilent 99 60 159 94 65 159 aCGH 1x1M Note: “Overlapped CNV” denotes the CNVs detected by aCGH array and DNBseq/Illumina platform; “aCGH uniq” denotes the unique CNV detected by aCGH array only.

66

Table 12 Sensitivity of CNV calls using DNBseq and Illumina platforms Whole genome wide Autosomal region True False True False positive negative Sensitivity positive negative Sensitivity Platform calls calls calls calls DNBseq 165 435 27.50% 145 97 59.92% Illumina 158 442 26.33% 141 101 58.26%

67

Fig. 1 Whole genome sequencing analysis pipeline

68

A C

B D

Fig. 2 Distribution of base quality scores on clean reads. (A)-(B) and (C)-(D) indicate the data generated by DNBseq and Illumina platforms respectively. X-axis is positions along reads. Y-axis is quality value. Each dot in the image represents the quality score of the corresponding position along reads.

69

A B

Fig. 3 Genome coverage at different sequencing depths (A) and (B) indicate the data generated by DNBseq and Illumina platforms respectively. X-axis denotes sequencing depth, and Y-axis indicates the fraction of the whole genome excluding gap regions that are achieved at or above a given sequencing depth.

70

Fig. 4 SNP comparison between DNBseq and Illumina on GIAB high-confidence regions

71

Fig. 5 Precision and sensitivity of SNP calling with low-coverage WGS

72

A

B

Fig. 6 InDel length distribution in CDS (A) and (B) indicate the data generated by DNBseq and Illumina platforms respectively

73

Fig. 7 InDel comparison analysis between DNBseq and Illumina on GIAB High-confidence regions

74

Fig. 8 Distribution of CNV at different chromosomes

75

Paper II

Comparative analysis of sample extraction and library construction for shotgun metagenomics

Zonghui Peng, Xiaolong Zhu, Zhijiao Wang, Xianting Yan, Guangbiao Wang, Meifang Tang, Awei Jiang, Karsten Kristiansen. Comparative analysis of sample extraction and library construction for shotgun metagenomics. Bioinformatics and Biology Insights. 2020;14:1-13.

76

77

78

79

80

81

82

83

84

85

86

87

88

89

Paper III

Comparative analysis of library preparation methods for formalin- fixed paraffin-embedded RNAseq

Zonghui Peng1, 2†*, Qiwei Sun3†, Zhijiao Wang3†, Fei Teng3, Liang Zong4, Yipting Kwong5, Bimeng Tu3, Karsten Kristiansen2, 6, 7* 1BGI Americas, Cambridge, Massachusetts, USA. 2Laboratory of Genomics and Molecular Biomedicine, Department of Biology, University of Copenhagen, Copenhagen, Denmark. 3BGI Genomics, Shenzhen, China. 4BGI-Wuhan, Wuhan, China. 5BGI-Hongkong, Hongkong, China. 6BGI-Shenzhen, Shenzhen, China. 7China National GeneBank, BGI-Shenzhen, Shenzhen, China

†These authors contributed equally * Corresponding author: Zonghui Peng, BGI Americas Corporation, 1 Broadway, 14th Fl, Cambridge, MA 02142. Telephone: +1 617-500-2754. Email: [email protected] Karsten Kristiansen, Laboratory of Genomics and Molecular Biomedicine, Department of Biology, University of Copenhagen, Universitetsparken 13, 2100 København, Denmark. Telephone: +45 3532 4443. Email: [email protected].

90

Abstract Background: Ribosomal RNA (rRNA)-depletion or protein-coding gene capture-based library preparation methods are widely applied in connection with formalin-fixed paraffin-embedded (FFPE)-derived gene expression profiling. However, to identify suitable methods and provide a benchmark for FFPE RNAseq, we investigated three major library construction methods, including Duplex-Specific Nuclease (DSN), Truseq Ribo-Zero (Ribo-Zero) and Truseq RNA Access (RNA Access). Pairs of commercial freshly frozen (FF) and FFPE samples were tested with all three library methods in duplicate. The assessment was conducted from the perspective of data effectiveness, quantification bias, method reproducibility, FF-FFPE correlation, and the ability of RNA detection of single nucleotide variants. Results: With identical sample input, all three tested methods provided high correlation between replicates using either FF or FFPE specimens. For mRNA analysis the RNA Access was superior for clinical FFPE specimens compared to the rRNA depletion methods, Ribo-Zero and DSN, using data obtained by RNAseq of paired FF specimens as the reference. Thus, RNA Access provided highly consistent mRNA profiling comparing matched FFPE and FF samples with a Pearson correlation coefficient of ~0.8. By contrast, the Ribo-Zero and the DSN methods had advantages for ncRNA and variant detection with the DSN method enabling superior detection of non-coding RNA with a high correlation for matched FF and FFPE samples with a Pearson correlation coefficient of more than 0.9. Finally, our results indicate that the DSN based on the average Ti/Tv ratio provided a more robust and reliable method for variant calling compared with data obtained by using the RNA Access or the Ribo-Zero methods, and furthermore, yielded a consistent high correlation between FF- and FFPR-derived SNV callings. Therefore, we conclude that DSN methods perform robustly for SNV calling, especially for FFPE samples. Conclusions: Based on our results we recommend that RNA Access should be used for analysis of mRNA expression, noting that also non-coding RNA can be detected by this method. By contrast, our analyses indicated that the DSN protocol would be the preferred choice for analysis of ncRNA, and furthermore, our results also provided evidence that DSN would be preferable especially for SNV calling using FFPE samples.

Keywords: FFPE, RNAseq, library preparation, DSN, Ribo-Zero, RNA Access

Background Particularly for clinical samples, formalin-fixed paraffin-embedded (FFPE) samples represent method is the most common way for sample preservation. Nucleic acid and protein can be preserved for a longer time as FFPE specimens compared to freshly frozen (FF) samples [1]. Furthermore, room temperature is standard storage condition for FFPE samples compared with FF samples, which usually need storage in liquid nitrogen, and thus, the storage cost of FFPE samples is less than that of FF samples [2]. However, total RNA isolated from FFPE samples are most often degraded and furthermore subject to chemical modification during the process of formalin fixing and paraffin embedding [3, 4]. Gene expression profiling based on FFPE samples initially used microarrays, but in2009, Wang et al. predicted that RNAseq technology would be more powerful and would replace or complement the use of microarrays [5]. Thus, 91 presently RNAseq is the favoured protocol for gene expression profiling using FFPE samples. However, still the use of FFPE samples is challenging because of RNA degradation generating small size fragments unsuitable for poly-A-enrichment library preparation, in part because the transcripts with degraded poly-A tails cannot be captured by oligo-dTs primers, and gene quantification biases can be expected related to read randomness distribution performance [6]. Since the commonly used poly-A-enrichment library preparation protocol cannot be used, removal of ribosomal RNA or circumventing ribosomal RNA becomes an issue. Thus, current protocols include ribosomal RNA (rRNA)-depletion and coding transcriptome/RNA exome probe capture-based methods. The Duplex-Specific Nuclease (DSN) method [7] and the Ribo-Zero rRNA removal kit (Illumina) [8] are currently major rRNA-depletion methods, and the RNA Access kit (Now Truseq RNA Exome kit, Illumina) is so far a most popular method focusing on exome regions. RNAseq is currently a routine Next Generation Sequencing (NGS) application for mRNA profiling, but there has been no benchmarking of library construction methods for FFPE RNAseq using established methods or commercial kits, indicating which method would be most suitable for FFPE-derived gene expression quantification analysis. In this study we comprehensively evaluated RNAseq library preparation especially in relation to data quality metrics such as rRNA depletion, genome/gene/exon mapping ratio and gene expression profiling consistency and reproducibility by comparing three major methods, the RNA Access, the Ribo-Zero and the DSN methods using pairs of commercial clinical FFPE and FF samples with identical instruments and operators.

Results RNA Extraction FF and FFPE sample extraction information is listed in Table 1. For each sample we generated 6 aliquots for three methods-based library constructions with two replicates. RNA extracted from FFPE samples had a reasonable quality with DV200>50%. Data quality and statistics summary On average, Ribo-Zero, DSN and RNA Access generated 35.1 (range: 33.9-36.3) million, 29.2 (range: 26.6- 31.2) million and 36.4 (range: 36.2-36.6) million raw reads for FF sample, respectively. In addition, when using FFPE samples, 20.1 (range: 17.6-22.7) million, 37.2 (range: 36.2-38.1) million and 35.7 (range: 35.6- 35.8) million raw reads were generated for Ribo-Zero, DSN and RNA Access, respectively (Table 3). Given that the same amount of total RNA was used, and the same quantity of cDNA library was pooled per lane, DSN produced less data than Ribo-Zero and RNA Access when using FF sample, however, using FFPE samples, Ribo-Zero yielded a lower amount of data compared to the other two methods, indicating that DNS and RNA Access exhibit a higher RNA capture efficiency than Ribo-Zero for FFPE samples, whereas DSN may be less suitable for normal fresh frozen samples. After data filtration, we observed that the clean ratio (clean reads/raw reads) of samples using the RNA Access protocol performed better than samples processed using the Ribo-Zero and DSN protocol for both FF or FFPE samples, indicating the quality of the RNA Access library is better than that of the other two 92 protocols. Still, the clean reads output by Ribo-Zero, DSN and RNA Access had similar Q20 and Q30 scores(Table 2). Mapping quality assessment Genome and gene mapping were conducted for FF and FFPE separately. For the FF samples, we found that RNA Access had 83.8% multiple gene alignments on average (range: 83.6%-84.1%), which were higher than those obtained using DSN (45.4% on average and range: 45.2-45.6%) and Ribo-Zero (37.4% on average and range: 37.3-37.5%). RNA Access (72.6.4% on average and range: 70.5-74.6%) also performed significantly better than the other two methods (Ribo-Zero, 15.1% on average with range: 14.8- 15.3% and DSN, 6.6% on average with range: 6.5-6.7%) using FFPE samples (Fig. 1). We obtained a similar result in terms of unique genome and unique gene mapping results, indicating that RNA Access exhibited the best performance comparing results FF and FFPE samples in terms of multiple and unique gene mapping rates (Fig.1). We further evaluated exon, intron and intergenic regions mapping ratios for each library, which impact on the effectiveness of sequencing reads, the higher coding mapping rate and the less non-coding alignment ratio, the more effective reads can be applied for downstream data analysis, such as gene expression quantification. For FFPE sample, RNA Access, on average, produced 86.4% (range: 85.8%-87.0%) of reads that could be mapped to exon regions, a significantly higher than that observed for Ribo-Zero (63.3% on average, range: 63.0%-63.5%) and DSN (3.1% on average, range: 2.9%-3.2%). Similarly, we found that RNA Access also had the best performance in terms of exon, intron and intergenic regions mapping rates (Fig.2), probably reflecting that RNA Access is based on exon-based capture. Finally, RNA Access-based FFPE data metrics are closer to RNA Access-based FF data compared to the other two methods. Unexpectedly, we found that DSN exhibited much lower exon mapping for FFPE samples, but still performed slightly better than the Ribo-Zero method for the same FF sample. Herbert et al. [9] reported that Ribo-Zero produced a mapping rate <20% for rRNA for degraded samples. By contrast we found that read generated by Ribo-Zero contained >50% of rRNA using FF samples, which significantly reduces the usable reads for exon/gene quantification, and furthermore, it would lead to an overestimation of expression of genes levels that overlap with intronic regions of other genes. Because rRNA contamination is a key concern in relation to RNAseq quantification studies, we also examined rRNA contamination for each library. Our results showed that on average, RNA Access produced less reads that mapped to rRNA than Ribo-Zero and DSN for both FF and FFPE samples, and comparing FF-derived datasets and FFPE-derived datasets generated by the different protocols revealed that RNA Access exhibited the smallest difference between the two sample types (Fig.3). Reads distribution and coverage analysis of transcripts For a cDNA library with fairly good quality and enough sequencing data, most transcripts will be completely covered, and reads will be evenly distributed throughout the transcript. To evaluate the performance of library formation protocols on FFPE sample, we use FF-derived libraries as a reference to detect biases introduced by the different protocols. Thus, we calculated the reads distribution of each detected transcript for each library. This analysis showed that each protocol generates highly similar patterns for randomness distribution on both FF duplicates. However, FFPE-derived reads distribution with paired FF-derived result

93 using the same library method, we observed that the RNA Access method showed the highest similarity, whereas samples processed using the Ribo-Zero and DSN protocols exhibited significant biases at the 5’ and 3’ end of transcripts (Fig.4). In addition, as shown in Fig.5, we observed that some low abundance transcripts were not detected using the Ribo-Zero protocol for FFPE samples. Thus, these findings suggest that RNA Access performed better regarding gene expression evenness for FFPE samples than Ribo-Zero and DSN. RNA detection After FPKM (Fragments Per Kilobase Million) normalization of all clean sequencing reads, we compared RNA detection between the three library methods using by-default threshold (>=1). RNA Access focuses on protein-coding regions, and as expected, Ribo-Zero and DSN detected more known and novel genes than RNA Access on FF samples. When using FFPE samples, Ribo-Zero detected less genes and transcripts compared to the other two methods, which may be explained by data size, because FFPE- derived Ribo-Zero libraries generated the lowest number of sequencing reads and lowly expressed genes need deeper depth to be covered (see Table 3). However, when examining the correlation between FFPE- derived and FF-derived data, then on average, RNA Access (푝 = 0.63 and 0.34 for gene- and transcript- based paired 푡-test results, respectively) exhibited no significant difference between data from FFPE and FF samples regarding known gene and transcript percentages compared with Ribo-Zero (푝 = 0.06 and 0.01 for gene- and transcript-based paired 푡-test results, respectively) and DSN (푝 = 0.03 and 0.02 for gene- and transcript-based paired 푡-test results, respectively). Correlation analysis between different methods Since two replicates were designed for each sample type and each library protocol, the performance in relation to reproducibility and correlation of gene expression level between library protocols can be used to evaluate whether the library protocol is reliable and robust for certain types of samples. The correlation value between each two samples that is based on FPKM results and the correlation value between each two samples are shown in Fig. 6 and Fig. 7. According to coding RNA detection, RNA Access (range: 0.79- 0.81) produced higher correlation between pairs of FFPE and FF samples than Ribo-Zero (range: 0.49- 0.50) and DSN (range: 0.12-0.13). Interestingly, in regard to non-coding RNA (ncRNA)-based correlation analysis, DSN performed with higher reproducibility (range: 0.93-0.97) than Ribo-Zero (range: 0.78-0.79) and RNA Access (range: 0.85-0.88), indicating that DSN could be an optimal option for ncRNA studies (Fig. 7).

Hierarchical clustering analysis To identify how library methods differ from each other and to further investigate the correlation between matched FFPE and FF samples, we applied hierarchical clustering analysis with an euclidean distance matrix to construct coding RNA- and non-coding RNA-based dendrogram cluster trees based on the expression data. From a perspective of coding RNA, we observed as expected that replicate samples clustered regardless of library method or sample type (Fig. 8). FF-derived DSN- and FF-derived Ribo-Zero libraries clustered together, while FFPE-derived DSN and FFPE-derived Ribo-Zero samples separated on different branches. On the contrary, FF-derived RNA Access and FFPE-derived RNA Access libraries were

94 clustered into one single branch, providing additional evidence of for the superior performance of the RNA Access protocol in relation to generate FFPE-derived expression data resembling FF-derived data. We also investigated ncRNA level-based hierarchical clustering revealing that method-specific samples were clustered together, with a tighter clustering of RNA Access-based samples compared to Ribo-Zero processed samples. Of note, samples prepared using the DSN protocol almost clustered together into a single tree branch, providing further support for the notion that the DSN protocol performs robustly for ncRNA profiling regardless of whether FFPE or FF samples are the starting material (Fig. 9). Variants detection We performed an analysis on total SNV calling. After filtration, we retained SNVs that met the following criteria: quality score of consensus genotype ≥ 20, covered depth ≥ 5 and repeats (estimated copy number of flanked sequences in the genome) ≤ 1. The SNVs from FF and FFPE RNAseq datasets were called by using GATK 4.0. On average, FF-derived DSN, Ribo-Zero and RNA Access libraries yielded 264,222 (range:250,648-277,796), 559,712 (range: 538,463-580,960) and 78,384 (range: 75,196-81,571), respectively. FFPE-derived DSN, Ribo-Zero and RNA Access libraries yielded 267,441(range: 239,647- 295,235), 101,468 (range: 80,432-122,503) and 699,01 (range: 61,419-78,382) SNVs, respectively. Samples processed using the DSN protocol clearly enabled detection of more SNVs than the other two methods. By contrast, using FFPE samples, the Ribo-Zero-based protocol enabled detection of significantly more SNVs than obtained using the DSN and RNA Access protocol when FF sample was tested. In terms of type of changes, regardless of sample type or library method, all SNV were enriched in A>G and C>T variant types, which are two major RNA editing types [10], suggesting a possible contribution from RNA editing to the SNVs. In addition, we assessed the quality of SNV calling by considering the Ti/Tv ratio, an important parameter for evaluating mutation calling quality. As shown in Fig. 10, On average, we found that the Ti/Tv ratio for FF-derived DSN, Ribo-Zero and RNA Access processed samples was 2.71 (range: 2.69-2.72), 3.31 (range: 3.30-3.31) and 2.52 (range: 2.49-2.55), respectively. By contrast, we observed significant differences for FFPE sample, with values of FFPE-derived DSN, Ribo-Zero and RNA Access generated samples of 2.58 (range: 2.57-2.58), 11.53 (range: 11.50-11.56) and 7.03 (range: 7.00-7.05), respectively, showing that DSN exhibited highly comparable performance in relation to SNV detection, mutation types from A>G to G>T and Ti/Tv ratio, suggestion that DSN would a suitable choice for single mutation detection.

Discussion FFPE specimens for gene expression profiling have found extensive usage within clinical research in relation to disease treatment and stratification of patients [11-17]. Previous reports have demonstrated the use of the Ribo-Zero or the DSN protocols for RNAseq analysis of FFPE specimens [18]. Subsequently, RNase H treatment was reported to be a more robust method compared to the Ribo-Zero methods for FFPE samples [19]. However, these methods all rely on rRNA depletion which often is incomplete leading to a low exon-rRNA mapping ratio. Our comparative analyses indicated that FFPE-derived RNA Access data may be superior to data generated from FF samples using the Ribo-Zero or DSN protocols (Fig. 1, 2 and 3). Thus, the rRNA depletion strategy seems to less useful for degraded RNA samples, often present

95 in FFPE samples. The high cost of RNA Access may explain the widespread use of the DSN protocol or other non-commercialized methods as the option for samples with high rRNA levels. However, due to the low efficacy of data utilization, much deeper sequencing depth of rRNA depletion libraries is needed to compensate for the low exon and gene mapping rate, which results in an overall cost that even may exceed the cost of the RNA Access kit. With the same sample input, all three tested methods provided high correlation between replicates using either FF or FFPE specimens (Fig. 6 and 7). Overall, we observed that the RNA Access method compared to the rRNA depletion methods, Ribo-Zero and DSN in relation mRNA analysis performed best for clinical FFPE specimens using data obtained by RNAseq of FF specimens as the reference. Thus, our findings demonstrate that RNA Access provided highly consistent mRNA profiling comparing matched FFPE and FF samples with a Pearson correlation coefficient of ~0.8. Based on our results we therefore recommended that RNA Access should be used for analysis of mRNA expression, noting that also non-coding RNA can be detected by this method. By contrast, we found that the Ribo-Zero and the DSN methods had advantages for ncRNA and variant detection. Especially the DSN method enabled superior for detection of non-coding RNA and exhibited a high correlation for matched FF and FFPE samples in terms of non-coding RNAs profiling with Pearson correlation coefficient of more than 0.9. Thus, we recommend the DSN protocol for analysis of ncRNA. Finally, our analyses suggest that the DSN method would be a more reliable method for variant calling compared with data obtained by using the RNA Access or the Ribo-Zero methods based on the average Ti/Tv ratio for the total number of detected SNVs, and furthermore, we observed a consistent high correlation between FF- and FFPR-derived DSN-based SNV calling. Therefore, we conclude that DSN methods perform robustly for SNV calling, especially for FFPE samples. Conclusions Our comparative study indicated that the RNA Access method is suitable for FFPE mRNA-based expression profiling. If non-coding RNA and mRNA profiling is the study goal, DSN would be a better option for FFPE sample types compared to Ribo-Zero and RNA Access. Additionally, if SNV calling is required, our study revealed that DSN is more suitable than the other two protocols for FFPE RNAseq.

Methods FFPE and FF samples To assess the robustness of library preparation methods for FFPE samples, we designed the study using matched FFPE and FF tumor tissue preparations from the same patient, with FF tissue-derived RNA being defined as the reference of gene expression performing RNAseq and used for comparison of the performance of each method on FFPE samples. Thus, paired human diffuse large B-cell lymphoma (DLBCL) tumor tissue, composed of one FFPE tissue block and one matched FF block were purchased from OriGene Technologies, Inc. (cat# CB808941 and CB632047, respectively) (Table 4). Characterization of FFPE and FF tissues was performed by a trained pathologist and tumor regions were macro dissected for total RNA isolation. Tissue RNA isolation and QC 96

The RNeasy Micro Kit (Qiagen) and the RNeasy FFPE Kit (Qiagen) were used to isolate and purify total RNA from FF and FFPE samples, respectively. The quantity and quality of isolated total RNA from FFPE and FF tissue samples were analyzed by an Agilent 2100 Bioanalyzer with the RNA 6000 Nano Kit (Agilent). The extracted total RNA from FFPE and FF were split into 6 identical aliquots. Library preparation and QC DSN library: 100ng FF or FFPE-derived total RNA was used for library preparation following the method reported by Yi et al., including the DSN normalization process [20]. Ribo-Zero library: Illumina Truseq Ribo- Zero total RNA kit was used with 100ng of FF or FFPE-derived total RNA for preparation of the libraries according to the manufacture’s instruction. RNA Access library: Based on the instruction for the Truseq RNA Access kit (Now Truseq RNA Exome kit, Illumina), 100ng of FF or FFPE-derived total RNA were used to construct the library. Two technical replicates were made for each method and sample. All libraries were analyzed by qPCR to quantify the molecular concentration and an Agilent 2100 bioanalyzer was used to determine the fragment size. Sequencing and data QC The Illumina Hiseq4000 was used for sequencing. All cDNA libraries were evenly pooled based on qPCR quantification results, aiming to produce 35M 100PE reads for each library. All raw sequence reads were quality controlled using SOAPnuke (version 2.0) with multiple filtering steps as follows: i. removing reads with adapters; ii. removing reads in which unknown bases constitute more than 10%; and iii. removing reads which contain more than 50% of bases with sequencing quality less than 10. After filtering, the remaining reads were considered “clean reads” and used for downstream bioinformatics analysis (Fig. 11). Data Analysis The clean reads were analyzed using HISAT (Hierarchical Indexing for Spliced Alignment of Transcripts) [21] and Bowtie2 [22] to perform genome and gene mapping, respectively usingHG19 and the RefeSeq database as references genome. BGI-FastQC, a BGI in-house script, was used to check rRNA contamination for each library. Genes and isoforms expression level are quantified by using RSEM (RNASeq by Expectation Maximization) [23]. During the RNAseq experiments, mRNAs are firstly fragmented into short segments by chemical methods and then sequenced. In case, the randomness is poor, then read preference for specific gene region will directly affect subsequent gene quantification analysis. Therefore, we used the distribution of reads located on the genes to evaluate randomness. We detected differentially expressed genes (DEGs) by using PossionDis, which is based on the poison distribution, performed as described by Audic et al. [24]. We calculated Pearson correlation between all samples using cor, in which Pearson’s correlation coefficients were used to denote the distance between any two samples. We performed hierarchical clustering of all samples using hclust and constructed diagrams with the ggplot2 function of the R software [25,26] to identify similarities among the same samples with different library methods and correlation between FF and FFPE samples with the same library method. We assembled transcripts from reads by using Cufflinks (v2.2.1) [27]. Comparison of assembled transcripts and gene annotation from reference information allowed us to find assembled transcripts that extended the

97

5' or 3' end of the transcripts, thereby refining gene structure annotation. In addition, to discover the novel regions, we compared our assembled gene/transcripts with reference sequences, and the metrics were followed as: 1. The transcript must be at least 200bp away from annotated gene. 2. The length of transcript need to be over 180bp. 3. The sequencing depth is no less than 2. We use GATK 4.0 [28] to call SNV for each library. After filtering out the unreliable sites, we compiled the final SNVs in the VCF format.

Data availability All the RNAseq sequence datasets are available on CNGB Nucleotide Sequence Archive database under the accession no. CNP0001026 (https://db.cngb.org/cnsa/project/CNP0001026/reviewlink/). All other data are available upon request from the corresponding authors (Z.P.) or (K.K). Acknowledgements We thank Deqiong Ma from Yale School of Medicine for comments on the manuscript. Funding This work has been supported by BGI-Genomics, who afford the cost of library preparation and sequencing of this study. Authors’ contributions ZW, ZP and KK designed the study. LZ and YK performed the experiment. QS and FT analyzed the data. ZP drafted the paper. BT prepared the figures and Tables. ZW and ZP revised the paper with a final extensive revision by KK. All the authors approved the paper. Ethics approval and consent to participate The study protocol was approved by BGI Institutional Review Board. (IRB No: FT19074). All the sample involved in this study is purchased from the commercial vendor only Consent for publication Not applicable Competing interests This study was funded by BGI-Genomics. All the authors are employees of BGI or its subsidiaries.

Reference 1. Hedegaard J, Thorsen K, Lund MK, Hein AM, Hamilton-Dutoit SJ, Vang S, et al. Next-generation sequencing of RNA and DNA isolated from paired fresh-frozen and formalin-fixed paraffin-embedded samples of human cancer and normal tissue. PLoS One. 2014;9:e98187. 2. Nechifor-Boilă AC, Loghin A, Vacariu V, Halaţiu VB, Borda A. The storage period of the formalin-fixed paraffin-embedded tumor blocks does not influence the concentration and purity of the isolated DNA in a series of 83 renal and thyroid carcinomas. Rom J Morphol Embryol. 2015;56(2 Suppl):759-63. 3. Watanabe M, Hashida S, Yamamoto H, Matsubara T, Ohtsuka T, Suzawa K, et al. Estimation of age- related DNA degradation from formalin-fixed and paraffin-embedded tissue according to the extraction methods. Experimental and Therapeutic Medicine. 2017;14(3):2683-8.

98

4. Howat WJ, Wilson BA. Tissue fixation and the effect of molecular fixatives on downstream staining procedures. Methods. 2014;70(1):12-9. 5. Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009; 10(1):57-63. 6. Adiconis X, Borges-Rivera D, Satija R, DeLuca DS, Busby MA, Berlin AM, et al. Comparative analysis of RNA sequencing methods for degraded or low-input samples. Nat Methods. 2013;10(7):623-9. 7. Luo S, Khrebtukova I, Perou C, Schroth GP. Complete RNA-seq analysis of cancer transcriptomes from FFPE samples. Genome Biol. 2010;11:P35. 8. Huang R, Jaritz M, Guenzl P, Vlatkovic I, Sommer A, Tamiret IM, et al. An RNA-Seq strategy to detect the complete coding and non-coding transcriptome including full-length imprinted macro ncRNAs. PLoS One. 2011;6(11):e27288. 9. Herbert ZT, Kershner JP, Butty VL, Thimmapuram J, Choudhari S, Alekseyev YO, et al. Cross-site comparison of ribosomal depletion kits for Illumina RNAseq library construction. BMC Genomics. 2018;19(1):199. 10. Peng Z, Cheng Y, Tan BC, Kang L, Tian ZJ, Zhu YK, et al. Comprehensive analysis of RNA-Seq data reveals extensive RNA editing in a human transcriptome. Nat Biotechnol. 2012;30(3):253-60. 11. Kojima K, April C, Canasto-Chibuque C, Chen XT, Deshmukh M, Venkatesh A, et al. Transcriptome profiling of archived sectioned formalin-fixed paraffin-embedded (AS-FFPE) tissue for disease classification. PLoS One. 2014;9(1):e86961. 12. Cieslik M, Chugh R, Wu YM, Wu M, Brennan C, Lonigro R, et al. The use of exome capture RNA-seq for highly degraded RNA with application to clinical cancer sequencing. Genome Res. 2015;25(9):1372- 81. 13. Graw S, Meier R, Minn K, Bloomer C, Godwin AK, Fridley B, et al. Robust gene expression and mutation analyses of RNA-sequencing of formalin-fixed diagnostic tumor samples. Sci Rep. 2015;5:12335. 14. Priedigkeit N, Watters RJ, Lucas PC, Basudan A, Bhargava R, Horne W, et al. Exome-capture RNA sequencing of decade-old breast cancers and matched decalcified bone metastases. JCI Insight. 2017;2(17):e95703. 15. Ayers M, Lunceford J, Nebozhyn M, Murphy E, Loboda A, Kaufman DR, et al. IFN-γ-related mRNA profile predicts clinical response to PD-1 blockade. J Clin Invest. 2017;127(8):2930-40. 16. Ott PA, Bang YJ, Piha-Paul SA, Abdul Razak AR, Bennouna J, Soria JC, et al. T-Cell-Inflamed Gene- Expression Profile, Programmed Death Ligand 1 Expression, and Tumor Mutational Burden Predict Efficacy in Patients Treated With Pembrolizumab Across 20 Cancers: KEYNOTE-028. J Clin Oncol. 2019;37(4):318-27. 17. Cristescu R, Mogg R, Ayers M, Albright A, Murphy E, Yearley J, et al. Pan-tumor genomic biomarkers for PD-1 checkpoint blockade-based immunotherapy. Science. 2018;362(6411):eaar3593. 18. Zhao W, He X, Hoadley KA, Parker JS, Hayes DN, Perou CM. Comparison of RNA-Seq by poly (A) capture, ribosomal RNA depletion, and DNA microarray for expression profiling. BMC Genomics. 2014;15(1):419.

99

19. Guo Y, Wu J, Zhao S, Ye F, Su Y, Clark T, et al. RNA sequencing of formalin-fixed, paraffin-embedded specimens for gene expression quantification and data mining. Int J Genomics. 2016:9837310. 20. Yi H, Cho YJ, Won S, Lee JE, Yu HJ, Kim SJ, et al. Duplex-specific nuclease efficiently removes rRNA for prokaryotic RNA-seq. Nucleic Acids Res. 2011;39(20):e140. 21. Kim D, Langmead B, Salzberg SL. HISAT: a fast spliced aligner with low memory requirements. Nat. Methods. 2015;12:357-60. 22. Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9(4):357- 59. 23. Li B, Dewey CN. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics。 2011;12:323. 24. Audic S,Claverie, JM. The significance of digital gene expression profiles. Genome Res. 1997;7:986- 995. 25. Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A. 1998;95(25):14863-68. 26. de Hoon MJ, Imoto S, Nolan J, Miyano S. Open source clustering software. Bioinformatics. 2004;20(9):1453‐1454.27. Trapnell C, Hendrickson DG, Sauvageau M, Goff L, Rinn JL, Pachter L. Differential analysis of gene regulation at transcript resolution with RNA-seq. Nat Biotechnol. 2013 Jan;31(1):46-53. 28. McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20(9):1297-303.

100

Table 1 Sample extraction and QC metrics statistics Concentra Volu Sample Total Mass 28S/ tion me RIN DV200 type (μg) 18S (ng/μL) (μL) FFPE 1,645 30 49.35 2.3 1.2 51% FF 606 10 6.06 5.1 1.3 - Note: DV200 value denote the percentage of fragments >200 nucleotides.

101

Table 2 Statistics of sequencing data Total Clean Clean Clean Raw Total Clean Total Clean Reads Sample Reads Reads Reads Reads (M) Bases (Gb) Ratio Q20(%) Q30(%) (M) (%) Ribo-Zero_FF1 33.9 31.2 6.2 99.3 97.7 92.0

Ribo-Zero_FF2 36.3 33.4 6.7 99.4 97.8 91.9

Ribo-Zero_FFPE1 17.6 15.2 3.0 99.2 97.0 86.3

Ribo-Zero_FFPE2 22.7 19.8 4.0 99.2 97.0 87.2 DSN_FF1 31.8 28.1 5.6 99.4 97.9 88.5 DSN_FF2 26.6 23.7 4.7 99.5 98.1 88.9 DSN_FFPE1 36.2 32.5 6.5 99.2 97.4 89.7 DSN_FFPE2 38.1 33.9 6.8 99.1 97.3 88.9 RNA Acceess_FF1 36.2 34.7 7.0 99.2 97.4 96.0 RNA Acceess_FF2 36.6 35.2 7.0 99.2 97.4 95.9 RNA Acceess_FFPE1 35.8 33.3 6.7 99.3 97.7 92.9 RNA Acceess_FFPE2 35.6 31.8 6.4 99.4 97.7 89.4

102

Table 3 RNA detection statistics Know Know Novel Total Known Novel Known Total n Sample n gene transcript transcript transcript Transcript gene# gene gene# # # # # % % Ribo- 23,94 19,75 4,190 82.5% 51,499 30,233 21,266 58.7% Zero_FF1 4 4 Ribo- 24,12 19,89 4,226 82.5% 52,582 31,003 21,579 59.0% Zero_FF2 4 8 Ribo- 14,16 12,48 1,678 88.2% 20,773 11,138 9,635 53.6% Zero_FFPE1 3 5 Ribo- 16,07 13,89 2,180 86.4% 24,503 13,096 11,407 53.4% Zero_FFPE2 4 4 21,93 18,84 DSN_FF1 3,097 85.9% 47,741 28,880 18,861 60.5% 9 2 21,61 18,69 DSN_FF2 2,917 86.5% 46,671 28,275 18,396 60.6% 0 3 27,83 20,68 DSN_FFPE1 7,154 74.3% 38,772 17,673 21,099 45.6% 4 0 27,31 20,12 DSN_FFPE2 7,188 73.7% 37,444 16,695 20,749 44.6% 6 8 RNA 21,24 18,63 2,615 87.7% 48,118 29,323 18,795 60.9% Access_FF1 9 4 RNA 21,38 18,67 2,717 87.3% 48,424 29,407 19,017 60.7% Access_FF2 7 0 RNA 18,76 16,38 Access_FFPE 2,385 87.3% 30,405 17,759 12,646 58.4% 7 2 1 RNA 17,90 15,98 Access_FFPE 1,915 89.3% 29,044 17,446 11,598 60.1% 3 8 2

103

Table 4 Clinical details of commercial FFPE and FF tissue blocks

104

100% 90% 80% 70% 60% 50% 40%

Mapping Ratio Mapping 30% 20% 10% 0% Ribo-Zero_FF Ribo-Zero_FFPE DSN_FF DSN_FFPE RNA Access_FF RNA Access_FFPE

Genome Mapping Ratio Genome Uniquely Mapping Ratio Gene Mapping Ratio Gene Uniquely Mapping Ratio

Fig. 1 The genome and gene mapping summaries for sequence reads

105

100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%

Exon_rate intron_rate intergenetic_rate

Fig. 2. The percentage of sequence reads that mapped to exon, intron and intergenic regions. The y-axis represents percentage.

106

70%

60%

50%

40%

30% rRNA ratio rRNA

20%

10%

0% Ribo-Zero_FF Ribo-Zero_FFPE DSN_FF DSN_FFPE RNA Access_FF RNA Access_FFPE

Fig. 3 The percentage of sequence reads that mapped to rRNA

107

Fig. 4 Reads distribution on transcripts for each library method with FF and FFPE samples. The x axis represents the position along transcripts from 5’ UTR to 3’ UTR. The y axis represents the number of reads.

108

Fig. 5 Percentage of transcript covered at each expression level. The y axis represents the percentage of the transcript length covered and the x axis represents the transcripts at each expression level.

109

Fig. 6 Heatmap of Pearson correlation between different libraries. The correlations were measured using the Pearson coefficient from the FPKM value of protein-coding genes.

110

Fig. 7 Heatmap of Pearson correlation between different libraries. The correlations were measured using the Pearson coefficient from the FPKM value of nonS-coding genes.

111

Fig. 8 Hierarchical clustering analysis between libraries based on coding RNAs

112

Fig. 9 Hierarchical clustering analysis of libraries based on non-coding RNAs

113

250,000 12

10 200,000

8 150,000 6

100,000 ratio Ti/Tv

4 Number of SNVs SNVs of Number 50,000 2

0 0 Ribo-Zero_FFPE Ribo-Zero_FF DSN_FFPE DSN_FF RNA Access_FFPE RNA Access_FF

A-G C-T A-C A-T C-G G-T Ti/Tv

Fig. 10 SNV detection and typing statistics. “A-G” denotes the amount of A>G variant type; “C-T” denotes the amount of C>T variant type; “A-C” denotes the amount of A>C variant type; “A-T” denotes the amount of A>T variant type; “C-G” denotes the amount of C>G variant type; “G-T” denotes the amount of G>T variant type.

114

Fig. 11 Workflow of comparative analysis study

115