Nanopore Sequencing the Advantages of Long Reads for Genome Assembly
Total Page:16
File Type:pdf, Size:1020Kb
WHITE PAPER Nanopore sequencing The advantages of long reads for genome assembly SEPTEMBER 2017 nanoporetech.com nanoporetech.com/publications OXFORD NANOPORE TECHNOLOGIES | THE ADVANTAGESAPPLICATION ANDOF LONG ADVANTAGES READS FOR OF GENOMELONG-READ ASSEMBLY NANOPORE SEQUENCING TO STRUCTURAL VARIATION ANALYSIS Contents 1 De novo assembly 2 Advantages of long-read sequencing for genome assembly 3 Genome assembly tools 4 Case studies 5 Summary 6 About Oxford Nanopore Technologies 7 References OXFORD NANOPORE TECHNOLOGIES | THE ADVANTAGES OF LONG READS FOR GENOME ASSEMBLY Introduction Over the last decade, improvements in Whole genome assembly – next generation DNA sequencing solving the puzzle technology have transformed the field Traditional technologies have required of genomics, making it an essential tool users to sequence short lengths of DNA, in modern genetic and clinical research which must then be reassembled back laboratories. The facility to sequence into their original order as accurately as whole genomes or specific genomic possible. Such short-read sequencing regions of interest is delivering new insights technologies, however, present a number into a variety of applications such as of challenges, particularly the difficulty of human health and disease, metagenomics, accurately analysing repetitive regions and antimicrobial resistance, evolutionary large structural variations.1 biology and crop breeding. This means that many reference genomes that were created using short-read For applications such as the analysis of sequencing are highly fragmented, which larger structural variation, or de novo in turn introduces bias into any alignments assembly, whole genome sequencing made against that reference2. This review shows how these challenges are now (WGS) is typically the technique of choice. being met, by the emergence of long-read nanopore sequencing3. One of the key steps of WGS is the accurate assembly of the vast amount of data generated into a contiguous stretch of DNA sequence. This review provides a background to the DNA assembly process and the associated advantages of long or ultra-long DNA reads, as provided by nanopore sequencing technology. 1 OXFORD NANOPORE TECHNOLOGIES | THE ADVANTAGES OF LONG READS FOR GENOME ASSEMBLY 1 De novo assembly De novo assembly aims to reconstruct Several approaches have been the original genome sequence from the used for this purpose, including Greedy set of reads, for example when there is no extension, de Bruijn graphs and Overlap reference genome available or a scientist Layout Consensus (OLC) (Figure 1). The prefers to avoid potential bias that could graph is then simplified and a consensus arise from using an imperfect reference. is extracted. A key step in this process is Generally, the first step is to find overlaps the assembly of contiguous, uninterrupted between the reads and build a graph to stretches of overlapping DNA, so-called describe their relationship. contigs. As detailed in the next section, long-read sequencing technology offers many advantages for both de novo and Long-read sequencing technology alignment-based genome assembly. offers simplified and less ambiguous genome assembly. Figure 1 Long reads A schematic showing how long- read sequencing can deliver simplified, less ambiguous genome assembly. Long reads Short reads (solid arrows) have greater overlap with other reads than is provided by short reads (dashed arrows), allowing more accurate assemblies, especially in repeat regions (R). Image adapted from Schatz (2014)4. 2 OXFORD NANOPORE TECHNOLOGIES | THE ADVANTAGES OF LONG READS FOR GENOME ASSEMBLY Advantages of long-read sequencing 2 for genome assembly Until recently, cost-effective generation Facility to span repetitive of large amounts of sequence data could genomic regions only be performed using short-read (<300 Most genomes contain significant amounts bp) sequencing technologies; however, of repetitive DNA (e.g. transposons, nanopore-based sequencing can process satellites, gene duplications). As the very long DNA fragments (currently up to short reads produced by traditional next 950 kb)5 to create long reads, which deliver generation sequencing (NGS) technology a number of significant benefits: may not span each given repetitive region, the resulting genome assemblies can be highly fragmented6. Long-read sequencing Long-read nanopore sequencing technologies have a significant advantage offers easier assembly and here as the reads generated are more likely to span the full repetitive region, the ability to span repetitive allowing the creation of accurate genome genomic regions. assemblies with minimal gaps (see Figure Ease of assembly 3 and Case Study 1, page 7). The longer a sequencing read, the more overlap it will have with other reads. As such, it is much easier to assemble the DNA fragments back into the correct order. This can be visualised much like a jigsaw; the larger the pieces, the easier the puzzle (Figure 2). Figure 2 Like a jigsaw puzzle with large pieces, long-read DNA is much easier to assemble than short- read DNA. The Escherichia coli genome comprises 4.6 million bases, which would equate to 92,000 fragments of 50 bp or just 9 fragments of 500 kb in length. ~ 50 base read ~ 500,000 base read ~ 92,000 “pieces” ~ 9 “pieces” Figure 3 A schematic highlighting the Short reads advantages of long reads in de novo assembly of repetitive regions. Long read lengths Consensus are more likely to incorporate the whole repetitive region Long reads (shown in red) allowing more accurate assembly with fewer gaps. Image adapted from Sam Consensus Demharter7. 3 OXFORD NANOPORE TECHNOLOGIES | THE ADVANTAGES OF LONG READS FOR GENOME ASSEMBLY Identification of large Most existing genome assemblies have structural variation been created using short-read sequencing technology, which is limited in its ability The term structural variation (SV) covers to capture large structural variation. Even a range of genetic alterations, including the well- studied human genome has large copy number variation (CNV), duplications, gaps in its assembly6.As CNV alone is translocations and inversions (Table 1). estimated to comprise 4.8–9.5% of the human genome10, it is clear that accurate Traditionally, structural variants were analysis of such regions is of significant defined as spanning over 1,000 base importance. pairs; however, with the advent of higher resolution genome-analysis techniques, Unlike short-read technology, longer some researchers have now revised nanopore-sequencing reads can cover this definition to spanning over 50 base the whole structure variant in one read. pairs8. Such structural variants have been This results in more accurate genome associated with a number of diseases assemblies facilitating better understanding (including autism, schizophrenia and of genome architecture in genomic cancer) and, for this reason, have become diseases (see Figure 4 and Case Studies 2 increasingly important areas for research9. and 3, pages 8, 9 and 10)11,12,13. A Short reads Consensus Long reads Consensus Structural variants Definition 1 3 Short reads 1 2 4 Copy number variation (includes A segment of DNA which has 1 3 4 insertions and deletions) greater (insertion) or fewer Consensus 1 3 2 4 (deletion) copies than expected Duplications A segment of DNA that occurs 1 2 3 4 Long reads 1 2 3 4 in two or more copies per 1 2 3 4 haploid genome Consensus 1 2 3 4 Translocation A change in position of a region of DNA which involves no Figure 4 change to the total DNA content A schematic highlighting the advantages of long reads in de novo Inversions A segment of DNA that is assembly of copy number variations (CNVs). Long read lengths are reversed end to end more likely to contain the whole CNV, providing less potential for error in the consensus sequence. Image adapted from Sam Demharter7. Table 1 Definition of common large structural variants. 4 OXFORD NANOPORE TECHNOLOGIES | THE ADVANTAGES OF LONG READS FOR GENOME ASSEMBLY Assembly quality The much higher N50 values generated using long-read nanopore sequencing, The quality of genome assemblies is underlines the superior quality and often assessed by the number of contigs completeness of genome assembly required to represent the genome, the when compared with short-read length of these contigs relative to the size sequencing technology (Table 2 and of the genome and the proportion of reads Case Study 1). that can be assigned to the contigs. One of the most commonly used assembly metrics is N50. The N50 value is calculated by sorting all contigs by length, calculating the cumulative assembly length (i.e. adding the length of all contigs together) and identifying the length of the contig which is situated in the middle of the cumulative length (Figure 5). The higher the N50 value, the more contiguous the assembly. Figure 5 The N50 metric is one measure of the quality of a genome assembly. Other metrics such as N75 and N25 are also sometimes used to further assess assembly quality. Image adapted from Jansen (2016)14. Table 2 Comparison between genome Isolate Technology Number of contigs N50 length assemblies of Rhizoctonia solani using long- and short- AG1-IA Short read 6,452 20 kb read sequencing platforms. The higher N50 value coupled with the fewer number of contigs, AG3 Short read 6,040 26 kb obtained using long-read length nanopore technology is AG8 Short read 7,6 0 6 7 kb indicative of superior genome assembly quality. Table adapted 15 AG1-IB Short read 3,793 35 kb from Datema et al (2016) . AGX Nanopore long read 606 199 kb 5 OXFORD NANOPORE TECHNOLOGIES | THE ADVANTAGES OF LONG READS FOR GENOME ASSEMBLY 3 Genome assembly tools Many software tools are now available A recent study by Cherukuri and Janga for genome assembly utilising long-read (2016) comparing a number of assembly sequencing data. Tools differ in what they algorithms found the OLC approach to be offer and the extent of their use and should most favourable for long-read nanopore be assessed according to the specific sequencing – delivering higher genome project requirements (Table 3).