medizinische genetik 2021; 33(2): 133–145

Jana Marie Schwarz, Richard Lüpken, Dominik Seelow, and Birte Kehr* Novel sequencing technologies and bioinformatic tools for deciphering the non-coding https://doi.org/10.1515/medgen-2021-2072 technologies provide large-scale datasets from whole ex- Received February 2, 2021; accepted June 24, 2021 omes and whole , resulting in a well-established Abstract: High-throughput sequencing techniques have and validated process for identifying small variants. The signifcantly increased the molecular diagnosis rate for pa- identifcation of larger, structural variants is improving tients with monogenic disorders. This is primarily due to a with recent developments of new sequencing technologies substantially increased identifcation rate of disease muta- and variant detection tools. The standard steps of a sequencing and data analy- tions in the coding sequence, primarily SNVs and indels. sis workfow are illustrated in Figure 1. After sample col- Further progress is hampered by difculties in the detec- lection and DNA extraction, a sequencing library that will tion of structural variants and the interpretation of vari- be loaded onto a sequencing instrument is prepared. Mod- ants outside the coding sequence. In this review, we pro- ern sequencers produce vast numbers of short sequence vide an overview about how novel sequencing techniques pieces, termed reads. Before variants can be detected in and state-of-the-art algorithms can be used to discover the data, the reads need to be assigned to positions in a ref- small and structural variants across the whole genome erence genome using a read alignment program. Variant and introduce bioinformatic tools for the prediction of detection and genotyping algorithms can then search for efects variants may have in the non-coding part of the diferences between the reads and the reference sequence. genome. Finally, the impact of the variants on a phenotype is in- Keywords: , variant detection, ferred through annotation and computational prediction structural variants, non-coding variants, bioinformatics of variant efects. Though the overall workfow is similar for whole exome and whole genome sequencing, the im- plementation of the individual steps can difer substan- Introduction tially depending on the chosen sequencing technology and variant type of interest. High-throughput sequencing techniques have radically in- Read alignment is the single most time consuming fuenced our ability to obtain genomic information. These computational analysis step. Unless special hardware, e. g., feld-programmable gate arrays (FPGA), specifcally *Corresponding author: Birte Kehr, BIH–Junior Research Group designed for the read alignment task is used [1], this step Genome Informatics, Berlin Institute of Health at Charité-Universitätsmedizin Berlin, Berlin, Germany; and takes hours for whole genome data. Each of the millions of Algorithmic Bioinformatics, Regensburg Center for Interventional reads resulting from each genome needs to be compared Immunology (RCI), Franz-Josef-Strauß-Allee 11, 93053 Regensburg, to the roughly 3 billion base pairs of the human reference Germany; and University Regensburg, Regensburg, Germany, e-mail: genome in order to fnd the sequence’s position of origin. [email protected] It is essential to allow for small diferences between read Jana Marie Schwarz, Department of Neuropediatrics, Charité–Universitätsmedizin Berlin, Corporate Member of Freie and reference in order to enable the alignment of reads Universität Berlin and Humboldt-Universität zu Berlin, Berlin, containing variants or sequencing errors. All widely used Germany; and NeuroCure Cluster of Excellence, alignment programs, e. g., BWA [2] for short read data and Charité–Universitätsmedizin Berlin, Corporate Member of Freie Minimap2 [3] for long read data, create an index of the ref- Universität Berlin and Humboldt-Universität zu Berlin, Berlin, erence genome that is comparable to the index in a text- Germany, e-mail: [email protected] Richard Lüpken, BIH–Junior Research Group Genome Informatics, book. The index allows a quick lookup of subsequences Berlin Institute of Health at Charité-Universitätsmedizin Berlin, from the reads to identify candidate positions in the refer- Berlin, Germany, e-mail: [email protected] ence genome, eliminating the need to go through the en- Dominik Seelow, BIH–Bioinformatics and Translational Genetics, tire for each read. These candidate posi- Berlin Institute of Health at Charité-Universitätsmedizin Berlin, tions are verifed by a detailed read-to-reference compar- Berlin, Germany; and Institute for Medical and Human Genetics, Charité–Universitätsmedizin Berlin, Corporate Member of Freie ison limited only to the respective part of the reference. Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Aligned reads are commonly stored in the binary align- Germany, e-mail: [email protected] ment map (BAM) fle format.

Open Access. © 2021 Schwarz et al., published by De Gruyter. This work is licensed under the Creative Commons Attribution 4.0 International License. 134 | J. M. Schwarz et al., Novel sequencing technologies and bioinformatic tools

Figure 1: Standard steps in a sequencing and data analysis workfow.

Accuracy and completeness of the alignment directly ment entirely diferent algorithms depending on the vari- infuence the performance of the following variant call- ant type of interest and the sequencing technology used ing step. Variant calling includes variant detection and to generate the data. The calling of small variants, single- genotyping. Computational tools for variant calling imple- nucleotide variants (SNVs), and insertions and deletions J. M. Schwarz et al., Novel sequencing technologies and bioinformatic tools | 135

Table 1: Whole genome sequencing technologies.

Company Platform/protocol/ Read Accuracy Throughput Cost per Investment SV Phasing fow cell type length rating per fow cell coverage costs detection Illumina NovaSeq 6000 S4 + +++ 54–68 Gb/h very low high + o Illumina HiSeq X + +++ 22–25 Gb/h low high + o Illumina HiSeq 4000 + +++ 15–18 Gb/h low medium + o MGI DNBSEQ-G400 + +++ 13.1 Gbp/h very low medium + o MGI DNBSEQ-T7 + +++ 250 Gbp/h very low high + o Oxford Nanopore PromethIon +++ + 4.2 Gb/h medium medium +++ + Technologies Oxford Nanopore GridIon +++ + 0.7 Gb/h high low +++ + Technologies Oxford Nanopore MinIon +++ + 0.7 Gb/h high very low +++ + Technologies Pacifc Sequel II HiFi ++ +++ 1.5 Gb/h very high high +++ + Biosciences Pacifc Sequel II Long read +++ + 6 Gb/h high high ++ ++ Biosciences 10X Genomics Chromium Linked +++ +++ N/A1 N/A2 medium ++ +++ Reads MGI stLFR +++ +++ N/A1 low very low ++ +++ 1Throughput of linked read protocols depends on the short read platform used. 210X Genomics discontinued the linked read protocol in 2020. of up to 50 bp (indels) is typically performed together. Vari- Detecting variation through whole ants larger than 50 bp, the SVs, require more elaborate al- genome sequencing gorithms. Most SV calling tools are specialized to recog- nize a single SV type, e. g., copy number variants (CNVs) Short read sequencing is the most established and widely or variants in a pre-defned size range. used technology for studying variation in whole genomes. Variant calling is followed by variant annotation and Additional technologies that can reveal variants in regions interpretation. In general, variants are categorized based of the genome that are inaccessible in short read data have on either their efect on the DNA sequence or their func- emerged. In this section we discuss variant detection ap- tional efect on the gene product. The latter categorization proaches for short read, long read, and linked read se- is often used for coding variants, since the impact on the quencing data and provide a brief overview of more spe- protein can often be directly deduced from the change of cialized protocols for studying whole genomes. the DNA sequence. Categorization of variants located out- side of protein coding genes is not so trivial, because less is known about these variants’ potential efects. Although Short read whole genome sequencing examples of human disorders caused by the disturbance Short read sequencing generates sequence reads between of non-coding regulatory elements have been identifed, 100 and 300 bp in length at very low error rates (Figure 2B). these are still a minority in comparison to the entirety of In preparation for sequencing, the input DNA is sheared known disease mutations. This puts a challenge on cur- into fragments of about 300–600 bp in length. When us- rently available tools for variant interpretation. ing Illumina sequencing platforms, the DNA fragments This review provides an overview of state-of-the-art es- are bridge amplifed to form clonal clusters. When using tablished and new whole genome sequencing technolo- MGI sequencing platforms, the fragments are converted gies (Table 1), variant detection algorithms (Table 2), and into DNA nanoballs (DNBs). These processes allow the se- variant evaluation tools (Table 3). We will frst recapitulate quencing instruments to read a fxed number of base pairs, whole genome sequencing technologies along with a sum- e. g., 150 bp, from both ends of the DNA fragments. The re- mary of corresponding variant detection tools and then sulting sequences are output as read pairs, one read pair discuss computational approaches for variant interpreta- per fragment. The process is massively parallelized allow- tion. ing the generation of up to 6 trillion base pairs (6 Tb) per 136 | J. M. Schwarz et al., Novel sequencing technologies and bioinformatic tools

Figure 2: Whole exome and whole genome sequence data types.

sequencing run (Table 1). In order to reliably identify het- The high accuracy of short read data allows for very erozygous variants, a typical aim is a minimum of 30-fold reliable genotyping of SNVs and indels. In addition, short sequencing coverage of the genome. This translates to ap- reads can reveal thousands of SVs per genome. Short read proximately 90 billion base pairs (90 Gb) or 600 million data are, however, inherently limited in repetitive regions reads assuming a read length of 150 bp. of the genome. As SVs often occur in repetitive sequences, Short read data are well suited for detecting and geno- e. g., as variable numbers of tandem repeats or fanked by typing SNVs and indels across the whole genome. Ap- repeat sequences, this limitation clearly afects our abil- proaches that compare the read sequences to the linear ref- ity to use short read data to detect SVs. We can overcome erence genome, e. g., GATK HaplotypeCaller [4], are widely this problem by using long read sequencing data. This ap- used. In contrast to small variants, SVs are often too large proach detects several times more SVs [9, 10] and identifes to be spanned by a single short read. By looking for certain small variants in genomic regions inaccessible with short data signatures, SVs can be indirectly detected in short read data. reads. CNVs result in changes in the number of read pairs covering the deleted or duplicated part of the genome, e. g., detected by CNVnator [5]. Breakpoints of any type of Long read whole genome sequencing SV are visible in the alignment as changes in the alignment Modern long read technologies produce reads of approx- distance of the two reads in a pair, e. g., detected by Delly imately 10–100 kb length from preferably unfragmented [6] and PopDel [7], and as split reads, where the beginning DNA (Figure 2C). The Pacifc Biosciences (PacBio) technol- of a read aligns to a diferent region of the genome than ogy relies on the synthesis of a complementary DNA strand the end of the same read. A recent benchmark study [8] by a single polymerase enzyme emitting light when incor- assessed SV detection approaches for short read data and porating a nucleotide to the growing DNA strand [11]. The demonstrated diferences in their strengths. sequencing instruments from Oxford Nanopore Technolo- J. M. Schwarz et al., Novel sequencing technologies and bioinformatic tools | 137 PMID: 30001702 PMID: 32746918 PMID: 29109544 unpublished PMID: 31811119 PMID: 29713083 PMID: 30668829 PMID: 32192518 PMID: 28714986 PMID: 29112732 PMID: 30894395 PMID: 28881988 PMID: 25926346 PMID: 24970577 PMID: 25123898 PMID: 31776332 https://doi.org/10.1038/s41467-020-20850-5 PMID: 21324876 PMID: 19561018 PMID: 19668202 PMID: 29097403 PMID: 26647377 PMID: 22962449 unpublished PMID: 31604920 PMID: 30247488 Reference PMID: 21903627 PMID: 28945251 https://arxiv.org/abs/1207.3907v2 https://doi.org/10.1101/201178 https://github.com/brentp/smoove http://software.broadinstitute.org/software/genomestrip/ PMID: 25621458 Source code 2020 https://github.com/PacificBiosciences/pbsv Latest update 2018 2020 https://github.com/haojingshao/npInv 2017 2019 https://github.com/mroosmalen/nanosv 2019 2021 https://github.com/WGLab/LinkedSV 2020 2020 https://github.com/BilkentCompGen/valor 2017 2020 https://github.com/grocsvs/grocsvs 2018 2019 https://github.com/raphael-group/NAIBR 2019 2020 https://github.com/10XGenomics/longranger 2017 2020 https://github.com/vpc-ccg/pamir/ 2016 2020 http://github.com/bkehr/popins 2014 2020 https://github.com/GATB/MindTheGap 2019 2021 https://github.com/DecodeGenetics/graphtyper 2021 2021 https://github.com/kehrlab/PopDel 2011 2020 https://github.com/abyzovlab/CNVnator 2009 2017 https://github.com/genome/pindel 2015 2009 2015 https://github.com/genome/breakdancer 2017 2021 https://github.com/PapenfussLab/gridss 2014 20202016 https://github.com/arq5x/lumpy-sv, 2019 https://github.com/Illumina/manta 2012 2021 https://github.com/dellytools/delly 2017 2020 https://github.com/nanoporetech/medaka 2019 2021 https://github.com/pjedge/longshot/ 2018 2020 https://github.com/google/deepvariant 2011 2021 https://github.com/samtools/samtools 2017 2021 https://github.com/DecodeGenetics/graphtyper 2012 2021 https://github.com/ekg/freebayes 2017 2021 https://github.com/broadinstitute/gatk Year of publication DEL, DUP, INV, TRL, INS 2020 2021 https://github.com/tjiangHIT/cuteSV DEL, DUP, INV, TRL, INS INV DEL, DUP, INV, TRL, INSDEL, DUP, INV, TRL, INSDEL, DUP, INV, INS 2018 2019 2020 2021 https://github.com/fritzsedlazeck/Sniffles https://github.com/eldariont/svim DEL, DUP, INV, TRL DEL, DUP, INV, TRL Breakpoints Breakpoints DEL, INV, DUP INS INS INS Only genotyping DEL CNV DEL, INV, TRL CNV DEL, INV, TRL Breakpoints DEL, DUP, INV, TRL DEL, DUP, INV, TRL SNVs, indels SNVs SNVs, indels SNVs, indels SNVs, indels SNVs, indels Variant types SNVs, indels Selection of variant calling tools. cuteSV PBSV npInv SVIM NanoSV LinkedSV GROC-SVs NAIBR LongRanger Pamir PopIns MindTheGap GraphTyper2 PopDel CNVnator Pindel GenomeSTRiP BreakDancer GRIDSS Lumpy/SmooveManta DEL, DUP, INV, TRL Delly Medaka Longshot DeepVariant GraphTyper FreeBayes Table 2: GATK HaplotypeCaller Tools referenced in the main text are in italics. DEL, deletions. DUP, duplications. INV, inversions. TRL, translocations. CNV, copy number variants. Tool name SV callers for long read data Snifes VALOR2 SV callers for linked read data SV callers for short read data Small variant callers for long read data Samtools Small variant callers for short read data 138 | J. M. Schwarz et al., Novel sequencing technologies and bioinformatic tools gies pass a single-stranded DNA molecule through a pro- Linked reads are, at their core, short reads. This means tein pore and measure an ionic current corresponding to that linked read sequencing achieves the same order of the nucleotides residing in the pore at each moment [12]. throughput and high accuracy as short read sequencing Though these processes are massively parallelized, they and the data are well suited for detecting small variants. do not yet reach the throughput and low cost of short read In addition, linked reads contain long range sequence in- sequencing (Table 1). Despite enormous improvements in formation included in the barcode sequences. Barcodes accuracy, long read data still sufer from much higher er- can resolve ambiguity in read alignment or can be used ror rates than short read data. To circumvent the lower to phase variants to parental , e. g., using Lon- accuracy, Pacifc Biosciences has launched a high-fdelity gRanger [15]. The long range information is useful to re- (HiFi) sequencing protocol, which sequences the same solve SVs, e. g., using NAIBR [16], and may even be used DNA fragments several times to allow computing a consen- for local assemblies. Some repeat elements inaccessible to sus sequence, thereby averaging out many sequencing er- standard short read data, such as mobile elements, can rors. A limitation of PacBio sequencing is the high amount be resolved with linked reads. Still, many other repeats re- of DNA required as input. main difcult to handle and may only be resolvable by ac- Though long reads are less suited to detect SNVs and tual long reads or further technological development. An indels because of their low per base accuracy, the increas- additional disadvantage is that bioinformatics tools to an- ing availability and dropping error rate of long read data alyze linked read data lag behind those for short and long has led to the development of the frst dedicated long read read data. This is why linked reads have not been widely SNV detection tools, e. g., Longshot. In contrast, the long adopted despite their great potential. read length simplifes the identifcation of SVs, e. g., us- ing SVIM [13], as a single long read often spans an entire SV. Most notably in repetitive sequences, long read meth- Specialized whole genome protocols ods outperform short read methods when searching for SVs. Because long reads frequently span multiple variants, A variety of additional technologies for analyzing whole detected variants can be phased into the parental hap- genomes are available. Optical genome mapping using the lotypes, e. g., using WhatsHap [14]. Despite these advan- Bionano Saphyr system, also called whole genome imag- tages, the high cost and high error rate are still barriers to ing, can examine ultra-long DNA molecules of hundreds a widespread use of long read sequencing. of kilobases in length, though it does not read the full sequence. The long DNA molecules are fuorescently la- beled at specifc 6-bp sequence motifs. In thousands of Linked read whole genome sequencing nanochannels, the labeled molecules are linearized and imaged, and the distances between motif occurrences are Linked reads combine some of the strengths of short detected. This generates a footprint of the sequence, al- and long read sequencing by combining information from lowing reliable detection of large SVs. Though the aver- longer DNA molecules with high accuracy and lower costs age length of assayed molecules surpasses that of long and (Figure 2D). Technically, linked read sequencing is a spe- linked read sequencing technologies, the resolution is low cialized protocol for short read sequencing. Long DNA compared to that of sequencing, meaning that the technol- molecules are isolated from a sample. During sequencing ogy does not reveal small variants [17]. library preparation, the long molecules are sheared into Single-cell DNA sequencing protocols provide data shorter fragments in a process that ligates barcodes to all from whole genomes of individual cells. This is widely fragments. All short fragments resulting from the same used in research on somatic variation in cancer genomes original long DNA molecule are labeled with the same bar- and when studying cancer genome heterogeneity. Spe- code. The Chromium platform by 10X Genomics achieved cialized single-cell protocols can reveal variation that is this by separating long molecules within oil droplets from otherwise hidden. For example, the Strand-Seq protocol each other whereas the stLFR protocol marketed by MGI can detect inversions, including those that are fanked uses microbeads and transposons for fragmentation and by very long repeats. Prior to sequencing, one of the barcoding. Finally, the barcoded fragments are sequenced two DNA strands is digested in each cell, resulting in with standard short read sequencing technologies. The sequences from only one strand. The inverted sequence barcode is sequenced as part of each read pair. Reads with becomes almost trivially visible in the alignment of the the same barcode typically originate either from the same reads to the reference genome. Major drawbacks of Strand- long molecule or from a small set of long molecules. Seq are that cell lines are needed as input material J. M. Schwarz et al., Novel sequencing technologies and bioinformatic tools | 139 and that the library preparation is comparably sophisti- chanisms. Both are needed to develop automatic predic- cated. tion tools, which can assess the potential efects of non- Other methods, such as Hi-C, ChiP-Seq, and ATAC-Seq, coding variants. Regardless of these limitations, several provide functional and sequence information across the tools have been developed that take the current knowledge whole genome, yet play a less important role for variant into account, e. g., the Genomiser [28], CADD [29], or Regu- calling. These technologies are reviewed by Guo et al. (this lationSpotter [30] (Table 3). Diferent approaches can also edition). be combined as refected by meta-tools, which merge the results from other predictors, such as Ensembl VEP [31] or SNPnexus [32]. Variant interpretation The evaluation of all extragenic variants often begins with annotation of their location. For each variant, infor- Signifcant progress has been made in predicting the dele- mation is gathered whether it is located in a known reg- teriousness of non-synonymous variants in the coding ulatory region and whether this region is conserved dur- genome, e. g., with Polyphen [18] or SIFT [19], over the ing evolution. Exploiting information about phylogenetic course of the last two decades. MutationTaster [20] was the conservation is a common and well-established method frst tool that allowed the analysis of non-coding variants, used by many tools (Table 3). The rationale is that strongly though restricted to those within protein coding genes. conserved positions are of high functional importance and Current tools reach prediction accuracies of about 90 % disruption through variation in the DNA sequence is con- in artifcial benchmarking settings. Results improve sub- sidered potentially pathogenic. stantially in regular usage when known polymorphisms The next step after annotation is evaluation of func- are excluded from the analysis. The chance of identify- tional impact. Some tools use the chromatin state where ing disease mutations is further improved by including the variant is located to determine if a variant could in- phenotype information through software such as Muta- terfere with gene expression. Chromatin state data come tionDistiller [21] because this limits the analysis to variants from public datasets on histone marks, transcription fac- located in promising candidate genes. Computer-based tor binding sites (TFBSs), DNase I hypersensitive sites 1 analysis of facial dysmorphologies, e. g., with Face2Gene, (DHS), and long range genomic interactions (topologi- may also help to highlight genomic regions of interest. cally associating domain [TAD] boundaries) in diferent Other tools, e. g., VarFish [22], allow for family trio anal- cell types or tissues. Many data are generated and curated ysis, excluding additional variants and detecting de novo by large consortia, such as BluePrint [33], ENCODE [34], mutations. In spite of recent progress, the diagnostic rates FANTOM5 [35], or Roadmap Epigenomics [36]. The idea for whole exome sequencing (WES) usually remain below that a variant residing in a putative promoter region, as 50 % [23]. It is clear that the low hanging fruit of “easy-to- indicated, e. g., by DNase I hypersensitivity, characteris- solve” monogenic diseases has already been picked. The tic histone modifcations, and TFBSs, impacts gene expres- remaining cases may be caused by SVs, which are noto- sion is appealing. Though the amount of available data is riously hard to detect by WES, more subtle (deep) intronic increasing, our incomplete understanding of exactly how variants afecting splicing [24] (Krude et al. this edition), or small variants, such as SNVs and indels, afect gene regu- variants altering gene regulation. The latter are frequently lation hampers the identifcation of disease mutations at located outside of the coding sequence, often hundreds of present (see also Guo et al. this edition). bases away from the gene they act on [25, 26]. Focusing the It has been shown, for example, that enhancers often search for disease mutations on the extragenic space has act redundantly with other enhancers serving as a backup enormous potential to reveal the molecular basis of cur- and even the deletion of a complete enhancer does not au- rently undiagnosed genetic diseases. Yet less than 1 % of tomatically have an impact on the expression of the reg- the disease mutations listed in the ClinVar database [27] ulated gene [37]. This makes it extremely difcult to pre- are located outside of protein coding genes (Krude et al. dict if and how a more subtle deletion or exchange of a this edition). single nucleotide within an enhancer might afect gene ex- While whole genome sequencing can reveal such vari- pression. The same applies to TFBSs, where it is often un- ants, it is challenging to evaluate their efect. The two main clear whether or not a specifc SNV has a profound efect obstacles are the low number of known extragenic disease on transcription factor binding. In other words, there is mutations and the lack of knowledge about their pathome- no simple correlation between binding score and binding afnity. To complicate matters further, a reduced binding 1 https://www.face2gene.com/ (FDNA Inc., Boston, MA, USA). afnity does not necessarily result in disease, as transcrip- 140 | J. M. Schwarz et al., Novel sequencing technologies and bioinformatic tools PMID: 30371827 PMID: 26301843 PMID: 31106382 PMID: 30744685 PMID: 27569544 Reference URL washington.edu/ princeton.edu/ regulationspotter. org/ RausellLab/NCBoost github.io/software- remm-score.html X https://github.com/ X https://charite. – https://www. (X) https://cadd.gs. (X) http://deepsea. Requires extended knowledge in bioinformatics 50,000 < Limitations variants precomputed scores only available for selected variants web-interface service restricted to VCF fle with maximum 100,000 variants and 2 MB web-interface service recommended only for VCF fles with only local installation no local installation Prediction/annotation based on gene features, sequence context features features, conservation features, gene features, sequence context features features, conservation features features, conservation features, population/ frequency features, sequence context features features, conservation features, gene features, population/ frequency features ical output Graph- Anno- tation X – X chromatin/epigenetic X – X chromatin/epigenetic X – – chromatin/epigenetic X – – conservation features, – X X chromatin/epigenetic tion Predic- types SNV SNV, selected indels SNV, indel SNV, indel SNV, indel VCF Availability Variant download: precomputed scores download: precomputed scores for selected variants web-interface: ofers precomputed scores for selected variants or VCF upload download: complete software for local installation web-interface: download: precomputed scores or complete software for local installation web-interface: analysis of single variants or VCF fles upload purpose pathogenic- ity of variants teriousness of variants chromatin efects of sequence alterations regulatory variants in Mendelian diseases annotation and functional prioritization of variants Selection of tools for variant interpretation with characteristics and limitations. Table 3: Name Main NCBoost predict CADD predict dele- DeepSEA predict the Genomiser identify Regula- tionSpotter J. M. Schwarz et al., Novel sequencing technologies and bioinformatic tools | 141 PMID: 31228310 PMID: 32496546 PMID: 27268795 PMID: 30898144 PMID: 28031184 Reference URL org/regulome-search nexus.org/v4/ org/info/docs/tools/ vep/index.html HormozdiariLab/TAD- fusion-score lganel/SVScore X https://github.com/ X https://github.com/ – https://regulomedb. – https://www.snp- – https://www.ensembl. Requires extended knowledge in bioinformatics Limitations analysis restricted to known dbSNP IDs web-interface service restricted to VCF fle with maximum 100,000 variants upload limited to 50 Mb only local installation only local installation Prediction/annotation based on features, integrates DeepSEA epigenetic features, gene features, phenotype/disease features, population/frequency features, integrates CADD, DeepSEA, ReMM, and others features, conservation features, gene features, population featuressequence features, integrates CADD and others features, TAD boundaries integrates CADD ical output Graph- Anno- tation X – – chromatin/epigenetic X – – gene features, – X X– chromatin/epigenetic X X conservation features, – X X chromatin/epigenetic tion Predic- types SNV, indel SNV, indel SNV, indel, CNV, SV SV SV (CNV, esp. DEL) web-interface: analysis of known variants with dbSNP IDs web-interface: analysis of single variants or VCF fles download: complete software for local installation web-interface: analysis of single variants or VCF fles download: complete software for local installation download: complete software for local installation annotation of variants or genomic regions functional prioritization of variants efects of variants on genes, transcripts, protein sequences, and regulatory regions riousness of SVs quantify potential of large deletions to disrupt 3D genome structure (continued) Table 3: Name Main purpose AvailabilityRegulomeDB regulatory Variant SNPnexus assist in VEP determine SVscore assess delete- TAD fusion score 142 | J. M. Schwarz et al., Novel sequencing technologies and bioinformatic tools https://doi.org/ 10.1101/2020. 06.30.180711v1 PMID: 33168059 https://doi.org/ 10.1101/2020. 05.15.097048v3 Reference URL jakob-he/TADA gersteinlab/SVFX andrewSharo/ StrVCTVRE X https://github.com/ X https://github.com/ X https://github.com/ Requires extended knowledge in bioinformatics Limitations only local installation focuses on somatic SVs in cancer and germline SVs in common diseases only SVs of DUP and DEL type, which are overlapping an exon are analyzed Prediction/annotation based on features, conservation features, gene features, TAD boundaries features, conservation features, TAD boundaries expression features, gene features ical output Graph- Anno- tation X X – chromatin/epigenetic X – –X chromatin/epigenetic – – conservation features, tion Predic- types SV (CNV, i. e., DEL and DUP) SV SV does not apply. DEL, deletion. DUP, duplication. VCF, variant call format. = download: complete software for local installation download: complete software for local installation : download complete software for local installation partly applies; – = annotation and pathogenicity prediction of CNVs pathogenicity of SVs pathogenicity of SVs (continued) applies; (X) = Table 3: Name Main purpose AvailabilityTADA Variant functional SVFX predict StrVCTVRE predict X J. M. Schwarz et al., Novel sequencing technologies and bioinformatic tools | 143 tion levels may still be sufcient to ensure normal protein the Project in 2003, parts of the ge- amounts. nomic sequence still remain unknown [41]. The latest The lack of knowledge about molecular pathomecha- genome version GRCh38 lacks approximately 5 % of the nisms of small, putative regulatory variants is mirrored in a sequence, mainly heterochromatic, highly repetitive re- lack of training data for developing prediction algorithms. gions, which are hard to sequence and map. The interna- Diferent approaches have been developed to circumvent tional “Telomere-to-Telomere” (T2T) consortium aims to this shortcoming. Instead of using the small number of close this gap using new sequencing technologies, such published “real” disease mutations in the non-coding as ultra-long read nanopore whole genome sequencing. genome, the authors of CADD simulated a set of “proxy- As proof of principle, the complete sequence from telom- neutral” versus “proxy-deleterious” variants that were cat- ere to telomere of a human was published egorized based on whether or not they had been a target last year [42]. A preliminary version of a complete female of purifying selection. While the “proxy-neutral” variants genome (46, XX), essentially lacking only the information were real variants which persisted for millions of years for encoded rRNA, is already freely available to the scien- without being selected against, the “proxy-deleterious” tifc community.2 Another initiative, the “Genome In A Bot- set consisted of artifcial variants without selective pres- tle” (GIAB) consortium, strives to provide validated bench- sure. This also reduced the bias towards conservation in mark datasets and best-practice protocols for detection of the group of “proxy-deleterious” variants, which would small and large variants [43, 44]. These are still limited to have been introduced by using known disease muta- certain regions of the genome. tions. Other authors, e. g., those of the Genomiser [28] Though these eforts shed light on the sequence it- or StrVCTVRE [38], manually curated small, high-quality self, our understanding of how variation in the non- datasets of known regulatory or structural variants for the coding sequence impacts function is very incomplete. training of their tools. Due to the low number of newly dis- We are only just beginning to understand the role of covered extragenic disease mutations as prospective con- non-coding DNA and the interaction of DNA with its trols, it is currently hard to say which approach will lead environment, e. g., through histone modifcations. High- to more robust results. throughput techniques, such as ChIP-seq and its varia- Though the size of SVs suggests that their efect on tions, and genome architecture studies (Guo et al. this edi- gene expression is more straightforward to evaluate, this tion) will lead to novel insights into the number and the is currently not the case. Predicting the efects of SVs dif- functional relevance of the non-coding parts. Other ap- fers from the evaluation of small variants in one central proaches using saturation or random mutagenesis of ge- aspect: SVs may completely delete one or multiple genes nomic elements, e. g., massively parallel reporter assays or crucial parts of genes, or disturb TADs (Krude et al. this (MPRAs) [45], or mutagenesis within short PCR fragments, edition, Guo et al. this edition). Apart from these dramatic including TFBSs as performed in our research unit, will re- consequences, the following information, amongst others, veal the efects single-nucleotide variants or short indels is relevant for and has been implemented in available tools have on the function of these elements. (Table 3): (i) locations of transcription start sites and al- Still, since so many diferent players – DNA variants, ternative splice sites, (ii) diferential gene expression, (iii) histone modifcations, distant enhancers, and cell-specifc epigenetic information about chromatin state as conferred proteomes, to name just a few – are involved in gene reg- by DNA methylation and histone modifcations, and (iv) ulation, this feld will remain challenging for the next the presence of TFBSs. Because only a low number of dis- decades. Non-coding is non-trivial. ease causing SVs are known, a robust estimate of the pre- dictive performance of the available tools is difcult. Author contributions: All authors have jointly drafted the In summary, current software can fairly reliably iden- manuscript, accept responsibility for the entire content of tify genomic regions likely to play a role in gene regulation. this manuscript, and approved its submission. Improving the prediction of the functional impact of the Funding: Deutsche Forschungsgemeinschaft, FOR 2841 DNA variants is subject of future research. “Beyond the exome”. Competing interests: Authors state no confict of interest. Informed consent: Does not apply, review, no study sub- Current limitations and outlook jects involved.

Two decades after the release of the human reference 2 https://github.com/nanopore-wgs-consortium/CHM13#telomere- genome in 2001 [39, 40] and the ofcial completion of to-telomere-consortium 144 | J. M. Schwarz et al., Novel sequencing technologies and bioinformatic tools

Ethical approval: Does not apply, review, no study sub- on nanochannel arrays for structural variation analysis and jects involved. sequence assembly. Nat Biotechnol. 2012;30:771–6. [18] Adzhubei I, Jordan DM, Sunyaev SR. Predicting Functional Efect of Human Missense Mutations Using PolyPhen-2. Curr Protoc Hum Genet. 2013;07:Unit7.20. [19] Vaser R, Adusumalli S, Leng SN, Sikic M, Ng PC. SIFT missense References predictions for genomes. Nat Protoc. 2016;11:1–9. [20] Schwarz JM, Rödelsperger C, Schuelke M, Seelow D. [1] Miller NA, Farrow EG, Gibson M, Willig LK et al. A 26-hour MutationTaster evaluates disease-causing potential of system of highly sensitive whole genome sequencing for sequence alterations. Nat Methods. 2010;7:575–6. emergency management of genetic diseases. Gen Med. [21] Hombach D, Schuelke M, Knierim E, Ehmke N, et al. 2015;7:100. MutationDistiller: user-driven identifcation of pathogenic [2] Li H. Aligning sequence reads, clone sequences and assembly DNA variants. Nucleic Acids Res. 2019. W114–20. contigs with BWA-MEM. 1303.3997 (2013). [22] Holtgrewe M, Stolpe O, Nieminen M, Mundlos S, et al. VarFish: [3] Li H. Minimap2: pairwise alignment for nucleotide sequences. comprehensive DNA variant analysis for diagnostics and Bioinformatics. 2018;34:3094–100. research. Nucleic Acids Res. 2020. W162–9. [4] Garrison E, Marth G. -based variant detection from [23] Wright CF, FitzPatrick DR, Firth HV. Paediatric genomics: short-read sequencing. 1207.3907 (2012). diagnosing rare disease in children. Nat Rev Genet. [5] Abyzov A, Urban AE, Snyder M, Gerstein M. CNVnator: an 2018;19:253–68. approach to discover, genotype, and characterize typical and [24] Boschann F, Fischer-Zirnsak B, Wienker TF, Holtgrewe M et al. atypical CNVs from family and population genome sequencing. An intronic splice site alteration in combination with a large Genome Res. 2011;21:974–84. deletion afecting VPS13B (COH1) causes Cohen syndrome. Eur [6] Rausch T, Zichner T, Schlattl A, Stütz AM, et al. DELLY: J Med Genet. 2020;63:103973. structural variant discovery by integrated paired-end and [25] Lettice LA, Heaney SJH, Purdie LA, Li L et al. A long-range Shh split-read analysis. Bioinformatics. 2012. i333–9. enhancer regulates expression in the developing limb and fn [7] Niehus S, Jónsson H, Schönberger J, Björnsson E, et al. and is associated with preaxial polydactyly. Hum Mol Genet. PopDel identifes medium-size deletions jointly in tens 2003;12:1725–35. of thousands of genomes. Nat Commun. 2020;12:730. [26] Smith AJP, Ahmed F, Nair D, Whittall R et al. A functional https://doi.org/10.1038/s41467-020-20850-5. 740225. mutation in the LDLR promoter (-139C>G) in a patient [8] Cameron DL, Di Stefano L, Papenfuss AT. Comprehensive with familial hypercholesterolemia. Eur J Hum Genet. evaluation and characterisation of short read general-purpose 2007;15:1186–9. structural variant calling software. Nat Commun. [27] Landrum MJ, Chitipiralla S, Brown GR, Chen C, et al. ClinVar: 2019;10:3240. improvements to accessing data. Nucleic Acids Res. 2020. [9] Audano PA, Sulovari A, Graves-Lindsay TA, Cantsilieris S, et D835–44. al. Characterizing the Major Structural Variant Alleles of the [28] Smedley D, Schubach M, Jacobsen JOB, Köhler S et Human Genome. Cell. 2019. 663–675.e19. al. A Whole-Genome Analysis Framework for Efective [10] Beyter D, Ingimundardottir H, Oddsson A, Eggertsson HP, et al. Identifcation of Pathogenic Regulatory Variants in Mendelian Long read sequencing of 3,622 Icelanders provides insight Disease. Am J Hum Genet. 2016;99:595–606. into the role of structural variants in human diseases and other [29] Rentzsch P, Witten D, Cooper GM, Shendure J, Kircher M. traits. Nat Genet. 2021;53:779–86. https://doi.org/10.1038/ CADD: predicting the deleteriousness of variants throughout s41588-021-00865-4. 848366. the human genome. Nucleic Acids Res. 2019. D886–94. [11] Rhoads A, Au KF. PacBio Sequencing and Its Applications. [30] Schwarz JM, Hombach D, Köhler S, Cooper DN, et al. Genomics Proteomics Bioinform. 2015;13:278–89. RegulationSpotter: annotation and interpretation of [12] Bowden R, Davies RW, Heger A, Pagnamenta AT et al. extratranscriptic DNA variants. Nucleic Acids Res. 2019. Sequencing of human genomes with nanopore technology. W106–13. Nat Commun. 2019;10:1869. [31] McLaren W, Gil L, Hunt SE, Riat HS et al. The Ensembl Variant [13] Heller D, Vingron M. SVIM: structural variant identifcation Efect Predictor. Genome Biol. 2016;17:122. using mapped long reads. Bioinformatics. 2019;35:2907–15. [32] Oscanoa J, Sivapalan L, Gadaleta E, Dayem Ullah AZ, et al. https://doi.org/10.1093/bioinformatics/btz041. SNPnexus: a web server for functional annotation of human [14] Patterson M, Marschall T, Pisanti N, van Iersel L genome sequence variation (2020 update). Nucleic Acids Res. et al. WhatsHap: Weighted Haplotype Assembly for 2020. W185–92. Future-Generation Sequencing Reads. J Comput Biol. [33] Martens JHA, Stunnenberg HG. BLUEPRINT: mapping human 2015;22:498–509. blood cell epigenomes. Haematologica. 2013;98:1487–9. [15] Marks P, Garcia S, Barrio AM, Belhocine K et al. Resolving the [34] Davis CA, Hitz BC, Sloan CA, Chan ET, et al. The Encyclopedia full spectrum of human genome variation using Linked-Reads. of DNA elements (ENCODE): data portal update. Nucleic Acids Genome Res. 2019;29:635–45. Res. 2018. D794–801. [16] Elyanow R, Wu H-T, Raphael BJ. Identifying structural [35] Andersson R, Gebhard C, Miguel-Escalada I, Hoof I et al. An variants using linked-read sequencing data. Bioinformatics. atlas of active enhancers across human cell types and tissues. 2018;34:353–60. Nature. 2014;507:455–61. [17] Lam ET, Hastie A, Lin C, Ehrlich D et al. Genome mapping [36] Bernstein BE, Stamatoyannopoulos JA, Costello JF, Ren B et al. J. M. Schwarz et al., Novel sequencing technologies and bioinformatic tools | 145

The NIH Roadmap Epigenomics Mapping Consortium. Nat Jana Marie Schwarz Biotechnol. 2010;28:1045–8. Department of Neuropediatrics, Charité–Universitätsmedizin Berlin, [37] Osterwalder M, Barozzi I, Tissières V, Fukuda-Yuzawa Y et Corporate Member of Freie Universität Berlin and al. Enhancer redundancy provides phenotypic robustness in Humboldt-Universität zu Berlin, Berlin, Germany mammalian development. Nature. 2018;554:239–43. NeuroCure Cluster of Excellence, Charité–Universitätsmedizin [38] Sharo AG, Hu Z, Brenner SE. StrVCTVRE: A supervised learning Berlin, Corporate Member of Freie Universität Berlin and method to predict the pathogenicity of human structural Humboldt-Universität zu Berlin, Berlin, Germany variants. 097048 (2020). [email protected] [39] Lander ES, Linton LM, Birren B, Nusbaum C et al. Initial sequencing and analysis of the human genome. Nature. 2001;409:860–921. Richard Lüpken [40] Venter JC, Adams MD, Myers EW, Li PW et al. The sequence of BIH–Junior Research Group Genome Informatics, Berlin Institute of the human genome. Science. 2001;291:1304–51. Health at Charité-Universitätsmedizin Berlin, Berlin, Germany [41] Schneider VA, Graves-Lindsay T, Howe K, Bouk N et al. [email protected] Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res. 2017;27:849–64. Dominik Seelow [42] Miga KH, Koren S, Rhie A, Vollger MR et al. BIH–Bioinformatics and Translational Genetics, Berlin Institute of Telomere-to-telomere assembly of a complete human X Health at Charité-Universitätsmedizin Berlin, Berlin, Germany chromosome. Nature. 2020;585:79–84. Institute for Medical and Human Genetics, [43] Zook JM, Hansen NF, Olson ND, Chapman L et al. A robust Charité–Universitätsmedizin Berlin, Corporate Member of Freie benchmark for detection of germline large deletions and Universität Berlin and Humboldt-Universität zu Berlin, Berlin, insertions. Nat Biotechnol. 2020;38:1347–55. Germany [44] Zook JM, McDaniel J, Olson ND, Wagner J et al. An open [email protected] resource for accurately benchmarking small variant and reference calls. Nat Biotechnol. 2019;37:561–6. [45] Kinney JB, Murugan A, Callan CG, Cox EC. Using deep Prof. Dr. Birte Kehr sequencing to characterize the biophysical mechanism of a BIH–Junior Research Group Genome Informatics, Berlin Institute of transcriptional regulatory sequence. Proc Natl Acad Sci USA. Health at Charité-Universitätsmedizin Berlin, Berlin, Germany 2010;107:9158–63. Algorithmic Bioinformatics, Regensburg Center for Interventional Immunology (RCI), Franz-Josef-Strauß-Allee 11, 93053 Regensburg, Germany University Regensburg, Regensburg, Germany [email protected]