Towards Population-Scale Long-Read Sequencing
Total Page:16
File Type:pdf, Size:1020Kb
REVIEWS Towards population-scale long-read sequencing Wouter De Coster 1,2,5, Matthias H. Weissensteiner3,5 and Fritz J. Sedlazeck 4 ✉ Abstract | Long-read sequencing technologies have now reached a level of accuracy and yield that allows their application to variant detection at a scale of tens to thousands of samples. Concomitant with the development of new computational tools, the first population-scale studies involving long-read sequencing have emerged over the past 2 years and, given the continuous advancement of the field, many more are likely to follow. In this Review, we survey recent developments in population-scale long-read sequencing, highlight potential challenges of a scaled-up approach and provide guidance regarding experimental design. We provide an overview of current long-read sequencing platforms, variant calling methodologies and approaches for de novo assemblies and reference-based mapping approaches. Furthermore, we summarize strategies for variant validation, genotyping and predicting functional impact and emphasize challenges remaining in achieving long-read sequencing at a population scale. Genome-wide association Sequencing the DNA or mRNA of multiple individuals These studies highlighted that a substantial proportion studies of one or more species (that is, population-scale sequenc- of hidden variation can be discovered with long-read (GWAS). Studies involving a ing) aims to identify genetic variation at a population sequencing. Indeed, recent long-read sequencing stud- statistical approach in genetics level to address questions in the fields of evolutionary, ies of Icelandic and Chinese populations have already to identify variants that correlate with a certain agricultural and medical research. Previous popula- identified previously undetected variants associated with 11,12 phenotype (for example, a tion studies, including genome-wide association studies height, cholesterol level and anaemia . Analysis of disease). (GWAS), have not been able to exhaustively charac- 26 maize genomes13 revealed that more SVs are involved terize the genetic factors underlying human traits and in causing diseases than in conferring agronomically diseases1. There has been much speculation about important traits. In addition, long-read sequencing is the source of this ‘missing heritability’, often pointing beneficial for improving the continuity, accuracy and to both structural variants (SVs) and rare variants2,3. range of variant phasing14–16, assessing complex small SVs account for a greater total number of nucleotide variants17 and has been applied to find disease-associated 1 Applied and Translational 18–20 Neurogenomics Group, changes in human genomes than the far more numer- alleles . For de novo assemblies, multiple methods 4 VIB Center for Molecular ous single-nucleotide variants (SNVs) . To date, such pop- have been published over recent years to promote the Neurology, VIB, Antwerp, ulation studies have relied mostly on high-throughput use of long reads21–25. Belgium. short-read sequencing technologies, which produce reads Ongoing advances in sequencing technology 2Applied and Translational ranging from 25 bp to 400 bp in length5. However, short and bioinformatics have paved the way to achieving Neurogenomics Group, reads have important limitations in characterizing repet- long-read sequencing on a population scale26. The two Department of Biomedical 6,7 Sciences, University of itive regions . DNA repeats act as the genomic substrate main competitors driving innovation in the field are 8 Antwerp, Antwerp, Belgium. to facilitate SV formation while also hampering SV dis- Pacific Biosciences (PacBio) and Oxford Nanopore 3Department of Biology, covery owing to read alignment inaccuracies. Even in Technologies (ONT). PacBio high fidelity (HiFi) reads Penn State University, a non-repetitive genome, variations such as insertions are generated by their Sequel II system; HiFi reads are Pennsylvania, PA, USA. (especially for alleles longer than the read length7) or both long (15–20 kbp) and highly accurate27. The 4Human Genome Sequencing other modifications (for example, methylation) would ONT PromethION platform can produce much longer Center, Baylor College of be missed by an approach relying solely on short reads. reads (up to 4 Mbp28), has a higher throughput at lower Medicine, Houston, TX, USA. Long-read sequencing has emerged as superior to cost, but produces less accurate reads than the Sequel II 5 These authors contributed short-read sequencing and other methods (for example, system. Recent comparisons show an equivalent per- equally: Wouter De Coster, 29,30 Matthias H. Weissensteiner. arrays) for the identification of structural variation, as formance for SV calling with the two platforms ✉e-mail: fritz.sedlazeck@ shown by the Genome in a Bottle (GIAB) and Human (in-depth technical review and further comparison of 31 bcm.edu Genome Structural Variation (HGSV) consortia, which long-read sequencing platforms available elsewhere ). https://doi.org/10.1038/ combined multiple technologies to comprehensively Within the past 2 years, multiple studies have applied s41576-021-00367-3 characterize structural variation in human genomes9,10. long-read sequencing to answer various questions in 572 | SEPTEMBER 2021 | VOLUME 22 www.nature.com/nrg 0123456789();: REVIEWS Structural variants Genome size Method (SVs). Genomic alterations that <500 Mbp Assembly are 50 bp or larger, including deletions, duplications, 500–2,000 Mbp Assembly and read mapping insertions, inversions and >2,000 Mbp Read mapping translocations. 1,000 Single-nucleotide variants (SNVs). Genomic alterations of 1–50 bp that are present at any frequency in the population. These variants include substitutions, insertions and deletions. -scale) 10 Short-read sequencing 100 Parallel sequencing of clonally amplified clusters of DNA molecules using optical or electrical methods, ranging Sample size (log from 25 bp to 400 bp per fragment. Long-read sequencing Continuous stretch of nucleotides derived from a 10 sequencing machine, which usually exceed 1,000 bp and currently range up to 4 Mbp. Jan 2019 Jul 2019 Jan 2020 Jul 2020 Jan 2021 Phasing In this Review, only per sample Date (physical) phasing is considered, which refers to the detection of Fig. 1 | Overview of population-scale studies using long-read sequencing. Studies published in 2019–2021 in which co-occurrences of two or more five or more samples were sequenced are included. Genome size of study organisms is viewed in three different categories variants on the same DNA (<500 Mbp, 500–2,000 Mbp and >2,000 Mbp), and the methodological approach taken to investigate genetic variation molecule by their overlap on (comparison of assemblies, read mapping against a reference or both) is illustrated by the different colours. For further the same read. In contrast to details, see TABLe 1. statistical inference phasing (using linkage information), this approach can include phasing multiple different organisms32–35 (FIG. 1; TABLe 1). The as linked reads or optical mapping (for example, Bionano of private or de novo variants. largest human-focused long-read sequencing study to Genomics). However, both these technologies may be date investigated the genomic diversity of 3,622 Icelandic useful and applicable in a population setting37,38. When PacBio high fidelity 11 (PacBio HiFi). A type of PacBio genomes , with many other studies to follow, such as sequencing of the highest number of samples is required, sequencing that yields reads the NIH All of Us research programme and the NIH targeted sequencing may be a cost-efficient alterna- that are accurate (average Center for Alzheimer’s and Related Dementias (CARD) tive to whole-genome approaches (BOx 2). Similarly to 99.9%) and long (15–25 kbp). in the USA and similar efforts in China, Abu Dhabi most population-scale sequencing projects, we focus These reads are produced as a germline variants somatic variants consensus from multiple serial and Qatar. Long-read sequencing of a global diversity on , as require higher observations of the same DNA cohort is also being carried out as part of the Human genome coverage and access to the relevant tissues. molecule in a row. Previous Pangenome project36. Aside from human studies, versions of this method long-read sequencing has been applied on a population Project strategies are referred to as ‘circular scale to discover structural variation associated with The total number of sequenced individuals (or rather consensus sequencing’ (CCS). phenotypes in crops32,33, fruitflies34 and songbirds35, and chromosomes) should in general be as high as possible. ONT PromethION increasingly has a role in metagenomic studies (BOx 1). However, the different underlying questions that moti- A sequencing platform that Here, we restrict our discussion to eukaryotic organisms, vate population-scale sequencing studies have vastly yields longer (up to 4 Mbp) but as long-read sequencing studies of bacteria and other different sample size requirements. Although estimat- less accurate (average 3–8%) reads than the PacBio HiFi prokaryotes require specific laboratory and bioinfor- ing the degree of genetic differentiation or ancestral platform. matics approaches, and the challenges are inherently population size is already possible with a sample size different. as low as ten chromosomes (five individuals of a dip- Mapping In this Review, we discuss the approach of long-read, loid organism)39, the identification of rare variants (and The alignment of