Novel Sequencing Technologies and Bioinformatic Tools for Deciphering the Non-Coding Genome

medizinische genetik 2021; 33(2): 133–145 Jana Marie Schwarz, Richard Lüpken, Dominik Seelow, and Birte Kehr* Novel sequencing technologies and bioinformatic tools for deciphering the non-coding genome https://doi.org/10.1515/medgen-2021-2072 technologies provide large-scale datasets from whole ex- Received February 2, 2021; accepted June 24, 2021 omes and whole genomes, resulting in a well-established Abstract: High-throughput sequencing techniques have and validated process for identifying small variants. The signifcantly increased the molecular diagnosis rate for pa- identifcation of larger, structural variants is improving tients with monogenic disorders. This is primarily due to a with recent developments of new sequencing technologies substantially increased identifcation rate of disease muta- and variant detection tools. The standard steps of a sequencing and data analy- tions in the coding sequence, primarily SNVs and indels. sis workfow are illustrated in Figure 1. After sample col- Further progress is hampered by difculties in the detec- lection and DNA extraction, a sequencing library that will tion of structural variants and the interpretation of vari- be loaded onto a sequencing instrument is prepared. Mod- ants outside the coding sequence. In this review, we pro- ern sequencers produce vast numbers of short sequence vide an overview about how novel sequencing techniques pieces, termed reads. Before variants can be detected in and state-of-the-art algorithms can be used to discover the data, the reads need to be assigned to positions in a ref- small and structural variants across the whole genome erence genome using a read alignment program. Variant and introduce bioinformatic tools for the prediction of detection and genotyping algorithms can then search for efects variants may have in the non-coding part of the diferences between the reads and the reference sequence. genome. Finally, the impact of the variants on a phenotype is in- Keywords: whole genome sequencing, variant detection, ferred through annotation and computational prediction structural variants, non-coding variants, bioinformatics of variant efects. Though the overall workfow is similar for whole exome and whole genome sequencing, the im- plementation of the individual steps can difer substan- Introduction tially depending on the chosen sequencing technology and variant type of interest. High-throughput sequencing techniques have radically in- Read alignment is the single most time consuming fuenced our ability to obtain genomic information. These computational analysis step. Unless special hardware, e. g., feld-programmable gate arrays (FPGA), specifcally *Corresponding author: Birte Kehr, BIH–Junior Research Group designed for the read alignment task is used [1], this step Genome Informatics, Berlin Institute of Health at Charité-Universitätsmedizin Berlin, Berlin, Germany; and takes hours for whole genome data. Each of the millions of Algorithmic Bioinformatics, Regensburg Center for Interventional reads resulting from each genome needs to be compared Immunology (RCI), Franz-Josef-Strauß-Allee 11, 93053 Regensburg, to the roughly 3 billion base pairs of the human reference Germany; and University Regensburg, Regensburg, Germany, e-mail: genome in order to fnd the sequence’s position of origin. [email protected] It is essential to allow for small diferences between read Jana Marie Schwarz, Department of Neuropediatrics, Charité–Universitätsmedizin Berlin, Corporate Member of Freie and reference in order to enable the alignment of reads Universität Berlin and Humboldt-Universität zu Berlin, Berlin, containing variants or sequencing errors. All widely used Germany; and NeuroCure Cluster of Excellence, alignment programs, e. g., BWA [2] for short read data and Charité–Universitätsmedizin Berlin, Corporate Member of Freie Minimap2 [3] for long read data, create an index of the ref- Universität Berlin and Humboldt-Universität zu Berlin, Berlin, erence genome that is comparable to the index in a text- Germany, e-mail: [email protected] Richard Lüpken, BIH–Junior Research Group Genome Informatics, book. The index allows a quick lookup of subsequences Berlin Institute of Health at Charité-Universitätsmedizin Berlin, from the reads to identify candidate positions in the refer- Berlin, Germany, e-mail: [email protected] ence genome, eliminating the need to go through the en- Dominik Seelow, BIH–Bioinformatics and Translational Genetics, tire reference genome for each read. These candidate posi- Berlin Institute of Health at Charité-Universitätsmedizin Berlin, tions are verifed by a detailed read-to-reference compar- Berlin, Germany; and Institute for Medical and Human Genetics, Charité–Universitätsmedizin Berlin, Corporate Member of Freie ison limited only to the respective part of the reference. Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Aligned reads are commonly stored in the binary align- Germany, e-mail: [email protected] ment map (BAM) fle format. Open Access. © 2021 Schwarz et al., published by De Gruyter. This work is licensed under the Creative Commons Attribution 4.0 International License. 134 | J. M. Schwarz et al., Novel sequencing technologies and bioinformatic tools Figure 1: Standard steps in a sequencing and data analysis workfow. Accuracy and completeness of the alignment directly ment entirely diferent algorithms depending on the vari- infuence the performance of the following variant call- ant type of interest and the sequencing technology used ing step. Variant calling includes variant detection and to generate the data. The calling of small variants, single- genotyping. Computational tools for variant calling imple- nucleotide variants (SNVs), and insertions and deletions J. M. Schwarz et al., Novel sequencing technologies and bioinformatic tools | 135 Table 1: Whole genome sequencing technologies. Company Platform/protocol/ Read Accuracy Throughput Cost per Investment SV Phasing fow cell type length rating per fow cell coverage costs detection Illumina NovaSeq 6000 S4 + +++ 54–68 Gb/h very low high + o Illumina HiSeq X + +++ 22–25 Gb/h low high + o Illumina HiSeq 4000 + +++ 15–18 Gb/h low medium + o MGI DNBSEQ-G400 + +++ 13.1 Gbp/h very low medium + o MGI DNBSEQ-T7 + +++ 250 Gbp/h very low high + o Oxford Nanopore PromethIon +++ + 4.2 Gb/h medium medium +++ + Technologies Oxford Nanopore GridIon +++ + 0.7 Gb/h high low +++ + Technologies Oxford Nanopore MinIon +++ + 0.7 Gb/h high very low +++ + Technologies Pacifc Sequel II HiFi ++ +++ 1.5 Gb/h very high high +++ + Biosciences Pacifc Sequel II Long read +++ + 6 Gb/h high high ++ ++ Biosciences 10X Genomics Chromium Linked +++ +++ N/A1 N/A2 medium ++ +++ Reads MGI stLFR +++ +++ N/A1 low very low ++ +++ 1Throughput of linked read protocols depends on the short read platform used. 210X Genomics discontinued the linked read protocol in 2020. of up to 50 bp (indels) is typically performed together. Vari- Detecting variation through whole ants larger than 50 bp, the SVs, require more elaborate al- genome sequencing gorithms. Most SV calling tools are specialized to recog- nize a single SV type, e. g., copy number variants (CNVs) Short read sequencing is the most established and widely or variants in a pre-defned size range. used technology for studying variation in whole genomes. Variant calling is followed by variant annotation and Additional technologies that can reveal variants in regions interpretation. In general, variants are categorized based of the genome that are inaccessible in short read data have on either their efect on the DNA sequence or their func- emerged. In this section we discuss variant detection ap- tional efect on the gene product. The latter categorization proaches for short read, long read, and linked read se- is often used for coding variants, since the impact on the quencing data and provide a brief overview of more spe- protein can often be directly deduced from the change of cialized protocols for studying whole genomes. the DNA sequence. Categorization of variants located outside of protein coding genes is not so trivial, because less is known about these variants’ potential efects. Although Short read whole genome sequencing examples of human disorders caused by the disturbance Short read sequencing generates sequence reads between of non-coding regulatory elements have been identifed, 100 and 300 bp in length at very low error rates (Figure 2B). these are still a minority in comparison to the entirety of In preparation for sequencing, the input DNA is sheared known disease mutations. This puts a challenge on cur- into fragments of about 300–600 bp in length. When us- rently available tools for variant interpretation. ing Illumina sequencing platforms, the DNA fragments This review provides an overview of state-of-the-art es- are bridge amplifed to form clonal clusters. When using tablished and new whole genome sequencing technolo- MGI sequencing platforms, the fragments are converted gies (Table 1), variant detection algorithms (Table 2), and into DNA nanoballs (DNBs). These processes allow the se- variant evaluation tools (Table 3). We will frst recapitulate quencing instruments to read a fxed number of base pairs, whole genome sequencing technologies along with a sum- e. g., 150 bp, from both ends of the DNA fragments. The re- mary of corresponding variant detection tools and then sulting sequences are output as read pairs, one read pair discuss computational approaches for variant interpreta-

Novel Sequencing Technologies and Bioinformatic Tools for Deciphering the Non-Coding Genome

Reference Genome Sequence of the Model Plant Setaria

Y Chromosome Dynamics in Drosophila

The Bacteria Genome Pipeline (BAGEP): an Automated, Scalable Workflow for Bacteria Genomes with Snakemake

Lawrence Berkeley National Laboratory Recent Work

Telomere-To-Telomere Assembly of a Complete Human X Chromosome W

Human Genome Reference Program (HGRP)

De Novo Genome Assembly Versus Mapping to a Reference Genome

Practical Guideline for Whole Genome Sequencing Disclosure

Y Chromosome

Reference Genomes and Common File Formats Overview

Informatics and Clinical Genome Sequencing: Opening the Black Box

Using DECIPHER V2.0 to Analyze Big Biological Sequence Data in R by Erik S