EXPANDING THE UTILITY OF WHOLE SEQUENCING IN THE

DIAGNOSIS OF RARE GENETIC DISORDERS

by

Phillip Andrew Richmond

B.A., The University of Colorado—Boulder, 2012

A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF

THE REQUIREMENTS FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

in

THE FACULTY OF GRADUATE AND POSTDOCTORAL STUDIES

(Bioinformatics)

THE UNIVERSITY OF BRITISH COLUMBIA

(Vancouver)

October 2020

© Phillip Andrew Richmond, 2020

The following individuals certify that they have read, and recommend to the Faculty of Graduate and Postdoctoral Studies for acceptance, the dissertation entitled:

Expanding the utility of whole genome sequencing in the diagnosis of rare genetic disorders

submitted by Phillip Andrew Richmond in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Bioinformatics

Examining Committee:

Wyeth W. Wasserman, Professor, Medical Genetics, UBC Supervisor

Dr. Inanc Birol, Professor, Medical Genetics, UBC Supervisory Committee Member

Dr. Anna Lehman, Assistant Professor, Medical Genetics, UBC Supervisory Committee Member

Dr. William Gibson, Professor, Medical Genetics, UBC University Examiner

Dr. Paul Pavlidis, Professor, Psychiatry, UBC University Examiner

Additional Supervisory Committee Members:

Sara Mostafavi, Assistant Professor, Medical Genetics, UBC Supervisory Committee Member

ii

Abstract

The emergence of whole genome sequencing (WGS) has revolutionized the diagnosis of rare genetic disorders, advancing the capacity to identify the “causal” gene responsible for disease phenotypes. In a single assay, many classes of genomic variants can be detected from small single nucleotide changes to large insertions, deletions and duplications. While WGS has enabled a significant increase in the diagnostic rate compared to previous assays, at least 50% of cases remain unsolved. The lack of a diagnosis is the result of both limitations in variant calling, and in variant interpretation. As the field of genomic medicine continues to advance, the emergence of novel bioinformatic approaches to variant calling and interpretation herald promise for the future of undiagnosed cases. In the applied setting, innovation is driven by anecdotes of complex diagnoses, which in turn lead to the development of novel tools and approaches. This is a key theme within this thesis work, where in-depth analysis of a single undiagnosed case leads to an appreciation for a challenging class of variants–short tandem repeats–which in turn leads to the development of novel software for detecting these variants in WGS data. Following the anecdote and novel tool development came an appreciation for the role of simulation, both in enabling the development and in the uptake of bioinformatic innovation for diagnostic analysis pipelines. This appreciation led to the development of a rare disease scenario simulator, which can simulate complex variants in multiple inheritance patterns to emulate challenging cases. Lastly, appreciating the limitations of the linear reference genome, I develop a framework for detecting the presence of user-specified sequences within unmapped read sets. This flexible framework can reproduce microarray-like coverage profiles, and genotype SNPs to identify ancestry and sex which can inform the choice of personalized reference in emergent analysis pipelines.

Together, the novel short tandem repeat discovery, bioinformatic innovation, and increased iii

capacity to simulate rare disease cases, expand the utility of whole genome sequencing in the diagnosis of rare genetic diseases.

iv

Lay Summary

Mendelian rare genetic disorders occur when a gene is broken, often by mutations, in the genome of an individual. New DNA sequencing technology has brought on the genomic revolution, enabling clinicians and researchers to identify with precision the mutations which cause disease.

As this technology is adopted into practice, our understanding of the genome expands. In turn, this understanding brings novel insights into disease mechanisms, including the identification of many mutations which were previously unknown. The bioinformatics community (scientists who create software to analyze biological data) has been developing, adapting, and improving software and procedures which better utilize DNA sequencing data. Within the scope of this thesis, I focus on the development of software for the simulation and detection of rare disease mutations, I apply these approaches to undiagnosed cases and define a new rare genetic disease, and lastly I explore an alternative analysis approach for utilizing DNA sequencing data.

v

Preface

The work presented within this thesis was performed at the BC Children’s Hospital Research

Institute, in the Centre for Molecular Medicine and Therapeutics, as part of a PhD program in

Bioinformatics within the Faculty of Science at the University of British Columbia. Much of the work in this thesis was done in collaboration with interdisciplinary research teams of scientists and clinicians. Each of the research chapters contains contents from a co-first authored publication either published, accepted, or currently under review. The introduction and discussion sections of the thesis are not published elsewhere and written solely by me. Details of my involvement in the research program for each chapter are provided below.

The work presented in Chapter 2 represents a co-first author work, which is under the first round of revision at Human Mutation titled: GeneBreaker - Variant simulation to improve the diagnosis of rare Mendelian genetic diseases. I devised the work presented, wrote the manuscript, developed downstream benchmarking and test cases, and processed the data to show tool efficacy. The core codebase for variant simulation was implemented by the other co-first author Tamar Av-Shalom, an undergraduate research assistant in the lab. Tamar implemented the variant creation methods, MySQL tables, and online web interface under my supervision, and with my assistance throughout the design and debugging process. I was responsible for the creation of all the figures and tables within this thesis section. A version of this work can be found online at https://www.biorxiv.org/content/10.1101/2020.05.29.124495v1, and a submitted version of this work is currently under revision.

vi

The work presented in Chapter 3 has been published in the New England Journal of Medicine as a co-first author work: “Glutaminase Deficiency Caused by Short Tandem Repeat Expansion in

GLS” (van Kuilenburg, Tarailo-Graovac et al. 2019). My role in this collaborative work was to process WES and WGS data for a set of undiagnosed patients with similar and specific biochemical phenotypes. In doing so, I discovered missense variants in the GLS gene, as well as manually identified a repeat expansion in the 5’UTR of the GLS gene. The identification of the repeat expansion was a result of collaborative interactions with Dr. Britt Drogemoller, who was attempting to validate a single nucleotide variant in the five prime untranslated region of GLS using Sanger sequencing. Complications with amplification through PCR led to a thorough manual investigation of the surrounding region in the genome sequence data. Using discordant read signals and extracting full read sequences I was able identify a possible short tandem repeat expansion. I then facilitated genotyping of this repeat with the new (at the time) computational method, ExpansionHunter, on a population of PCR-free WGS samples. The samples come from multiple sequencing consortia, and use of ExpansionHunter was guided by the developers of the tool Dr. Michael Eberle and Dr. Egor Dolzhenko. Processing of population samples was performed by analysts within the respective consortia. During the mechanistic investigation of the impact of the repeat expansion I played a central role in interpreting molecular assays and defining the role of the repeat in a non-methylation mediated form of gene repression. The wet lab work to confirm the repeat expansion and two missense variants was performed primarily by

Dr. Britt Drogemoller, and members of Dr. Karen Usdin’s group. Validation of the impact of missense variants was performed by members of Andre van Kuilenburg’s group. The wet lab work for identifying the mechanism of action for the repeat expansion was performed by groups led by Dr. Andre van Kuilenburg, Dr. Mahmoud Pouladi, and Dr. Karen Usdin. As I did not vii

perform any wet lab work presented in this chapter, the details of the molecular assays performed can be found in the original publication, and work presented in this chapter focuses on the bioinformatic analysis of WES/WGS data. I wrote the genome analysis portions of the manuscript, and co-wrote the manuscript with Dr. Antoine van Kuilenburg, Dr. Maja Tarailo-

Graovac, Dr. Karen Usdin, and Dr. Clara van Karnebeek. I was also responsible for creation of all the figures presented in this thesis section, although their final formatting comes from the

NEJM editorial staff. An online version of this work can be found at https://www.nejm.org/doi/10.1056/NEJMoa1806627. Text and figures are copied from

“Glutaminase Deficiency Caused by Short Tandem Repeat Expansion in GLS.”

André B.P. van Kuilenburg, Maja Tarailo-Graovac, Phillip A. Richmond, Britt I. Drögemöller, et al. 380:1433-1441. Copyright © (2020) Massachusetts Medical Society. Reprinted with permission. Patients enrolled in this study were consented under the Treatable Intellectual

Disability Endeavour (TIDE) protocol, with REB approval number H12-00067, and sub-study

REB approval number H11-01142.

The work presented in Chapter 4 has been published in Genome Biology as a co-first author work: “ExpansionHunter Denovo: a computational method for locating known and novel repeat expansions in short-read sequencing data” (Dolzhenko, Bennett et al. 2020). In this collaborative project, I contributed to the development of the outlier test for repeat identification, collation of known pathogenic repeats and their coordinates, analysis of rare repeat expansions in the healthy population, simulation of known pathogenic events, and comparison to alternative methods for

RE detecting using the set of simulations. The analysis of repeat cohort data was performed by

Dr. Mark Bennett, and the source code and algorithm were written by Dr. Egor Dolzhenko and viii

Dr. Michael Eberle of Illumina. The manuscript was co-written with Dr. Egor Dolzhenko in Dr.

Michael Eberle’s lab at Illumina, and Dr. Mark Bennett in Dr. Melanie Bahlo’s lab at the Walter and Eliza Hall Research Institute. In addition to writing I was responsible for creation of all the tables, and select figures including 4-5, 4-8, 4-9, and 4-10 appearing in this thesis section. In this chapter I provide a brief description of the underlying tool ExpansionHunter Denovo, followed by an emphasis on the work I completed within this collaboration. For more details about the tool, validation of repeats with alternative technologies, and detailed supplemental tables please see https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02017-z.

The work in Chapter 5 is in the first round of revision (co-first author): “Demonstrating the utility of flexible sequence queries against indexed short reads with FlexTyper” (Richmond,

Kaye et al. 2020), and can be found online at https://www.biorxiv.org/content/10.1101/2020.03.02.973750v1. In this collaboration, I contributed to the overall direction of the work and devised the project, developed test data, tested the software via simulated and real data, assisted in the algorithm development and improvement, and wrote the manuscript alongside co-author Alice Kaye (another PhD student in the Wasserman lab). I created all of the figures and tables appearing in this thesis chapter. Alice

Kaye and Godfrain Jacques Kounkou (senior software developer in the Wasserman lab) devised and implemented the code used to produce the results that appear in the thesis.

ix

Table of Contents

Abstract ...... iii

Lay Summary ...... v

Preface ...... vi

Table of Contents ...... x

List of Tables ...... xv

List of Figures ...... xvi

List of Abbreviations ...... xvii

Acknowledgements ...... xix

Dedication ...... xx

Chapter 1: Introduction ...... 1

1.1 Objective ...... 2

1.2 Background ...... 2

1.2.1 Rare genetic disease diagnosis ...... 2

1.2.2 Variant calling and interpretation from WGS data ...... 4

1.2.3 Emergent variants underlying missing heritability ...... 7

1.2.4 The genome is dark and full of terrors ...... 9

1.2.5 Advancements in bioinformatic methodology ...... 11

1.2.6 Relying upon the reference genome ...... 12

1.2.7 Picking the right tool for the job ...... 14

1.3 Chapter summaries and research contribution ...... 15

Chapter 2: Simulation of rare disease scenarios ...... 18

2.1 Introduction ...... 18 x

2.2 Methods ...... 22

2.2.1 Simulation architecture ...... 22

2.2.2 Host webserver and underlying data ...... 23

2.2.3 Variant simulation walk through ...... 24

2.2.3.1 Initial Configuration ...... 24

2.2.3.2 Variant Creation ...... 24

2.2.3.3 Inheritance modeling ...... 26

2.2.4 Bury the variant within trio setting for testing prioritization approaches ...... 27

2.2.5 Incorporate variants into reads to test detection capacity ...... 27

2.3 Results ...... 29

2.3.1 Simulator-created rare disease scenarios ...... 29

2.3.2 Prioritization testing ...... 31

2.3.3 Testing variant calling ...... 33

2.4 Discussion ...... 37

Chapter 3: Whole genome sequencing enables discovery of noncoding pathogenic short tandem repeat expansion ...... 40

3.1 Introduction ...... 40

3.2 Case Report ...... 41

3.3 Methods ...... 41

3.3.1 Details of patient exome sequencing ...... 41

3.3.2 Pipeline for the analysis of rare disease WES/WGS data ...... 42

3.3.3 Genome sequencing ...... 47

3.3.4 Genotyping GLS locus with ExpansionHunter ...... 47 xi

3.3.5 STR Genotyping in control populations ...... 48

3.4 Results ...... 49

3.4.1 Efficacy of variant prioritization pipeline ...... 49

3.4.2 Analysis of patient WES data identifies single coding variant in GLS ...... 49

3.4.3 Biochemical characterization of GLS ...... 50

3.4.4 Identification of repeat expansion in 5’UTR of GLS in singleton WGS ...... 53

3.4.5 Genotyping repeat using ExpansionHunter and confirmation in population ...... 56

3.4.6 Defining the mechanism for repeat expansion pathogenicity ...... 58

3.5 Discussion ...... 61

3.6 Conclusion ...... 63

Chapter 4: Genome-wide discovery of short tandem repeat expansions ...... 64

4.1 Introduction ...... 64

4.2 Methods ...... 67

4.2.1 ExpansionHunter Denovo overview ...... 67

4.2.2 Defining relevant repeat expansions ...... 70

4.2.3 Synthetic generation of pathogenic repeats, long motifs, eSTRs ...... 70

4.2.4 Initial processing of WGS data ...... 73

4.2.5 Detection of repeat expansions ...... 73

4.3 Results ...... 75

4.3.1 Baseline simulations show cutoffs of EHdn capacity ...... 75

4.3.2 Detection of pathogenic expansions in the repeat expansion cohort ...... 77

4.3.3 Limitations of genome-wide methods for repeat expansion calling ...... 78

4.3.4 Recovering pathogenic expansions in the repeat expansion cohort ...... 81 xii

4.3.5 Landscape of large repeat expansions in diverse population ...... 84

4.3.6 Simulation of repeat expansions for benchmarking genome-wide methods ...... 85

4.3.7 Guiding the usage of EHdn in RE identification ...... 87

4.4 Discussion ...... 91

4.5 Conclusion ...... 93

Chapter 5: Demonstrating the utility of flexible sequence queries against indexed short reads with FlexTyper ...... 94

5.1 Introduction ...... 94

5.2 Methods ...... 97

5.2.1 Overview of FlexTyper ...... 97

5.2.2 Search method for FlexTyper ...... 99

5.2.3 Query generation ...... 101

5.2.4 FM-index creation ...... 101

5.2.5 Post-processing of FlexTyper counts for downstream analysis ...... 102

5.2.6 Data sources for query generation ...... 103

5.2.7 Generating kmers for testing search performance ...... 103

5.2.8 Samples for genotype analysis testing ...... 104

5.2.9 Simulation of pathogen-containing samples ...... 104

5.2.10 BAM coverage calculation for comparison ...... 104

5.2.11 Using Peddy for ancestry, sex, and relatedness typing ...... 105

5.2.12 Running centrifuge on simulated patient data ...... 105

5.3 Results ...... 106

5.3.1 Performance metrics for indexing and querying ...... 106 xiii

5.3.2 Genomic coverage and genotype detection within human WGS data ...... 109

5.3.3 Testing for the presence of pathogen sequences in RNA-seq ...... 113

5.4 Discussion ...... 116

5.5 Conclusion ...... 120

Chapter 6: Conclusion ...... 121

6.1 Contributions to the field ...... 121

6.2 Limitations and room for improvement ...... 124

6.3 The future of rare disease bioinformatics ...... 128

6.4 Closing ...... 132

Bibliography ...... 133

xiv

List of Tables

Table 2-1 - Data sources for variants within the database available for simulation...... 23

Table 2-2 - A set of rare disease scenarios for small variants...... 30

Table 2-3 - HPO terms for Exomiser testing...... 32

Table 2-4 - Simulated challenging variant scenarios...... 34

Table 3-1 - Information on the bioinformatics pipeline...... 46

Table 4-1 - Curated pathogenic STRs in the reference genome...... 81

Table 5-1 - Indexing performance data for WGS fastq files ...... 108

Table 5-2 – Testing grep vs. FlexTyper ...... 109

Table 5-3 - Performance comparison for simulated spike-in pathogens...... 115

xv

List of Figures

Figure 2-1 – GeneBreaker Overview...... 22

Figure 2-2 - IGV snapshots of simulated complex variants...... 35

Figure 2-3 - IGV snapshots of simulated variants in the Dark Genome...... 36

Figure 3-1 - Profiling GLS activity in patient cells...... 52

Figure 3-2 - Allelic expression imbalance and reduced GLS expression...... 54

Figure 3-3 - IGV snapshots reveal repeat signals in the 5’UTR of GLS...... 55

Figure 3-4 - Genotyping of GLS Probands...... 57

Figure 3-5 - Repeat associated effects...... 60

Figure 4-1 - ExpansionHunter Denovo Overview...... 69

Figure 4-2 - Overview of simulating samples with repeat expansions...... 72

Figure 4-3 - EHdn baseline detection of expanded repeats...... 76

Figure 4-4 - Case-control analysis with repeat expansion cohort...... 78

Figure 4-5 - Venn diagram demonstrating catalog limitations...... 79

Figure 4-6 - Analysis of known expansions in real data...... 83

Figure 4-7 – Distance to telomere/centromere for identified repeats...... 85

Figure 4-8 - Structure of nine complex pathogenic repeats...... 86

Figure 4-9 - Rare REs in the control cohort...... 89

Figure 4-10 - Z-score characteristics of simulated pathogenic REs...... 90

Figure 5-1 - Overview of FlexTyper...... 98

Figure 5-2 - Query search workflow...... 100

Figure 5-3 - Search speeds for FlexTyper ...... 107

Figure 5-4 - WGS Genotyping using FlexTyper...... 112 xvi

List of Abbreviations

API - application programming interface

BWA - Burrows wheeler aligner

CMA - chromosomal microarray

CNV - copy number variant dbSNP - database of single nucleotide polymorphisms

DGV - database for genomic variants

GA4GH - global alliance for and health

GATK - Genome analysis toolkit

GLS - glutaminase

GRCh37/GRCh38 - Genome reference consortium human genome version 37/38

HPO - Human phenotype ontology

IGV - integrative genomics viewer

Indel - insertion / deletion

LINE - long interspersed repeat

MEI - mobile element insertion

PED - pedigree file format

PID - Primary immunodeficiency

RE - repeat expansion

REST - representational state transfer

SINE - short interspersed repeat

SNP - single nucleotide polymorphism

SNV - single nucleotide variant xvii

STR - short tandem repeat

SV - structural variant

SVA - SINE-VNTR-Alu

TIDE - treatable intellectual disability endeavour

UCSC - University of California Santa Cruz

VCF - variant call format

VNTR - variable number tandem repeat

WES - whole exome sequencing

WGS - whole genome sequencing

xviii

Acknowledgements

I am grateful to the many talented and kind collaborators and colleagues who assisted me along the PhD journey. I firmly believe that the best science happens in teams, and I’ve been lucky to interact in collaborations spanning multiple countries and continents during my time in graduate school. First and foremost, I would like to acknowledge my mentor and advisor Dr. Wyeth

Wasserman. Wyeth nurtured my desire to work on many different projects, and opened up avenues of collaboration. One of the primary collaborators I had during my PhD was Dr. Clara van Karnebeek, a clinician-scientist who led the Treatable Intellectual Disability Endeavour

(TIDE) project at BCCHR, and maintained the lead role after a move to The Netherlands. Clara’s passion for solving rare disease cases is contagious, and this opened up several opportunities for bioinformatics innovations to be applied in the healthcare setting. One of those innovations was within short tandem repeat calling, where a collaboration with Dr. Micahel Eberle and Dr. Egor

Dolzhenko at Illumina showed me the benefits of collaboration between academia and industry.

Other members of the lab, primarily Dr. Oriol Fornes, helped increase my exposure beyond applied genome analysis, into the world of eukaryotic gene regulation.

In addition to colleagues and mentors, I would like to acknowledge my family and girlfriend

Minju for their support during my PhD. My immediate family abroad (brother Adam, sister

Rebecca, mother Margaret, and father Andrew), as well as the local family in Vancouver, BC, have made this journey possible. The last few years of the PhD are stressful, and the marathon that is a PhD program can cause mental, emotional, financial and physical challenges. Having the support of family has allowed me to focus on the task ahead, and push towards the finish line.

xix

Dedication

I dedicate this PhD to Babi Ewa, my grandmother who made life in Vancouver as a graduate student a dream come true. Your unwavering support, and consistent boosts to my ego, have allowed me to endure five years of research and education in my Doctoral degree.

“If you put your mind to it, you can move the mountains.”

–Babi Ewa

xx

Chapter 1: Introduction

In the last twenty years, we have witnessed technology dramatically change the way we utilize genomic information to understand health and disease. It started with an initial human genome project, increased into a diversity-driven thousand genomes project, and now we look towards the millions of genomes planned to be sequenced in the next few years (Consortium and

International Human Genome Sequencing 2001). For the average person, whole genome sequencing (WGS) is likely not going to be life changing. The function of all of the bases in our

DNA is incompletely understood, and beyond ancestry tracing and predicting common disease risk, having your genome sequenced currently won’t change our day-to-day lives. The same is not true for those affected by rare genetic disease.

Rare genetic disease patients and their families around the globe are enduring what has been termed a diagnostic odyssey, wrought with hundreds of tests and screens, dozens of clinical specialists, and yet no diagnosis for their often devastating symptoms. Even as molecular genetic diagnostics have improved, these previous assays have taken limited snapshots of the genome, either looking for large genomic changes, or interrogating a small fraction of the genome using targeted methods. Sequencing the whole genome in a single assay will reduce the number of tests and therefore the time to diagnosis for these patients. Herein lies the true potential of WGS to transform clinical care, as a one-stop test for identifying the cause–a genomic variant or mutation–of a patient’s disease. However, both the identification of genomic variants from WGS data, and understanding their functional impact on human health, remain critical roadblocks in ending the diagnostic odyssey for many patients.

1

These roadblocks emerge as a worldwide call to action for genomic researchers and clinicians, who must work collaboratively to utilize WGS to its full potential. It will take novel algorithms, improved software, molecular assays, and interdisciplinary teamwork to realize the full potential of WGS in its capacity to diagnose patients affected by rare genetic diseases. And this...is where my journey begins.

1.1 Objective

The primary goal of this thesis work is to expand the utility of WGS in the diagnosis of rare genetic diseases. As this technology is undergoing worldwide adoption, both the knowledge of the human genome and capacity of bioinformatic methods to detect variants are expanding rapidly. This leads to the continual emergence of many improved bioinformatic methods, which must be tested and deployed to end the diagnostic odyssey of rare disease patients and their families across the globe. The work covers improvements to rare disease scenario simulation, development of novel bioinformatic methods for diagnosis, interrogation of undiagnosed cases leading to the discovery of a novel rare genetic disease, and the development of a reverse mapping method which extracts useful information from rapid sequence queries and doesn’t rely upon aligning reads against the reference genome.

1.2 Background

1.2.1 Rare genetic disease diagnosis

In the past 10 years, early deployment of next generation sequencing (NGS) in the clinical setting has revolutionized the diagnosis of rare genetic disorders. NGS has the potential to identify “pathogenic”–causal for the disease–variants in the genome in a high throughput manner 2

compared to previous single-gene sequencing approaches (Boycott, Vanstone et al. 2013). The process of diagnosing a patient involves sequencing their DNA, identifying genomic differences between the patient and healthy individuals (sometimes including their parents) and interpreting those genomic differences for their potential functional impact. In the clinical setting, this is typically achieved in a collaborative manner between the scientists processing data and the clinical geneticists who make informed decisions about a patient’s diagnosis when presented with a set of candidate pathogenic variants. While conceptually the process is straightforward, it is complicated by technological and algorithmic difficulties, alongside an incomplete understanding of the function of the human genome.

Initial deployment of NGS for rare disease diagnosis focused on using whole exome sequencing

(WES), a method which restricts focus to the exonic regions of the genome (Tarailo-Graovac,

Shyr et al. 2016). WES works by probing for sequences which match known exonic elements in the genome, primarily focusing on coding sequences with additional coverage of the untranslated regions. In the clinical setting the diagnostic rate for WES is 25-50% in studies with mixed phenotypes (Smedley, Schubach et al. 2016, Dragojlovic, Elliott et al. 2018). The limitations of

WES are that it only interrogates a portion of the genome, does not completely cover all exonic regions, and cannot reliably detect the diverse classes of genomic variation. Notably, some of the most well studied rare diseases are caused by copy number variants (CNVs) such as deletions and duplications. While identifying CNVs from WES is possible (Rajagopalan, Murrell et al.

2020), the CNVs must overlap coding regions and identifying the breakpoints of WES-called

CNVs is a major challenge. As such, it’s recommended to pair WES with another assay: chromosomal microarray (CMA). The logical successor to WES and CMA in the clinical setting 3

is WGS, as it has proven benefits including improved exon coverage, the ability to call larger genomic structural variants (SVs) and short tandem repeats (STRs), and provides an interrogation of the 98% noncoding portion of the genome (Elgar and Vavouri 2008, Belkadi,

Bolze et al. 2015, Stavropoulos, Merico et al. 2016). A hindrance to the rapid uptake of WGS has been the cost in comparison to WES, as WGS has both higher sequencing and data analysis costs

(e.g. storage and processing). However, the benefits of WGS are beginning to become truly realized and a transition away from WES is imminent.

In the early adoption of WGS, numerous anecdotes have emerged of research teams finding the

“missing heritability” in complex, undiagnosed cases (Maroilley and Tarailo-Graovac 2019). In several cases, the findings highlight genomic variants beyond the coding regions of the genome which are challenging to interpret, but lead to a dramatically improved understanding of the function of the human genome. Aggregating the knowledge from known causal disease variants propels the development of novel methods in the bioinformatics domain. Bioinformatics plays a critical role by leveraging a combination of algorithmic advances and knowledge to increase the diagnostic capacity of WGS in the rare disease setting. As WGS is ushered into the standard of care for rare disease patients, bioinformatic advancements will lead to an improved diagnostic rate and fundamentally change our understanding of the human genome.

1.2.2 Variant calling and interpretation from WGS data

Analysis of WGS data for the diagnosis of rare genetic diseases can be split into two separate tasks: variant calling and variant interpretation. The primary method for variant calling is 4

anchored to the concept of a linear reference genome, which acts as a reference point enabling rapid comparison between two different WGS samples. First, the WGS “reads”, or short sequences of DNA, are mapped against the reference genome. This process is enabled by compression of the reference genome into a more searchable data structure, typically an FM- index. Then, starting at the first position of chromosome one, the genome is scanned for sequence differences between the mapped reads and the underlying reference sequence. The accumulation of differences at a given position in the reference genome is then considered a genomic variant if it passes a set of thresholds for quality and read support. For several classes of variants, including CNVs and short tandem repeat (STR) expansions, when the size of the variant is larger than a single read, additional read mapping signatures are utilized. One such signal is the read depth or depth of coverage, which considers the pileup of reads over larger windows to identify regions which are higher (duplicated) or lower (deleted) compared to the expectation. As most WGS use cases are paired-end, meaning they are sequenced from each end of a single DNA fragment, the mapping orientation and distance between each read pair can be informative. For example, if two reads map much further apart than they are expected to be from the fragment size, then it’s indicative of a deletion in the sample relative to the reference. Other insertion- based methods, where one or both reads in a pair do not map uniquely to the human genome, require some form of de novo assembly or other advanced algorithms. The utility of these signals in detecting genomic variants will be explored in subsequent chapters.

After identifying the set of variants present within an individual, they need to be refined from millions of total variants to a small set of causal candidates which are linked to the patient’s phenotype (Eilbeck, Quinlan et al. 2017). This process involves annotation of each observed 5

variant for its frequency in the population, often utilizing large open databases of genomic variants such as gnomAD (Lek, Karczewski et al. 2016). These open databases have seen a dramatic increase in size, with the most recent release containing whole genomes of roughly

70,000 individuals (Karczewski, Francioli et al. 2020). Variants are also annotated against genes and their corresponding genic regions (e.g. coding sequences and introns) which is central to the understanding of functional impact (Cingolani, Platts et al. 2012). Predicting the impact of variants uses both these genic annotations, as well as measures of evolutionary conservation– which is a proxy for function (Cooper and Brown 2008). Several advanced methods which screen a variant for functionality exist, and they typically use machine learning to provide an in silico score for the impact of a variant. Often incorporating, and largely driven by, measures of conservation (Pollard, Hubisz et al. 2010, Ritchie and Flicek 2014), these methods seek to discriminate between a functional and a non-functional genomic variant. Principal among them, and perhaps the most widely used, is the combined annotation dependent depletion (CADD) method, which performs an aggregation of several other well-performing prediction methods

(Kircher, Witten et al. 2014). An important component of in silico predictions, is that they are updated with the frequently expanding knowledgebase of the human genome, which is a key advantage of CADD (Rentzsch, Witten et al. 2019). An expansion into variant interpretation beyond the coding regions of the genome are also supported by in silico predictors, some of which are beginning to show promise for detecting splice-altering variants within intronic regions (Jaganathan, Kyriazopoulou Panagiotopoulou et al. 2019).

6

1.2.3 Emergent variants underlying missing heritability

In the past 5 years a shift away from WES and towards WGS for rare genetic disease patients has enabled the investigation of variants beyond the coding regions of the genome. With this shift there are opportunities for discovery of pathogenic variants with a mechanism of action that is not a missense, nonsense, or frameshift variant which alter and or eliminate the functionality of the gene at the protein level. Instead, the mechanism of action could be to disrupt splicing, to affect the stability of the mRNA, or to affect the regulation of the gene at the transcription, post- transcriptional modification, or translation levels. Typically, detecting such abnormalities or disruptions relies upon supplementary assays, such as RNA sequencing (RNA-seq), which can examine the steady state levels of mRNA in a patient’s cells (Marco-Puche, Lois et al. 2019).

One of the strengths of RNA-seq is that it can detect novel splice junctions, an example of which being a novel splice junction in complex I assembly factor (TIMMDC1) (Kremer, Bader et al.

2017). In a larger study of 50 patients with undiagnosed disorders, some of which shared specific phenotypes, 17 achieved a diagnosis with supplementation by RNA-seq including four patients affected by collagen VI–related dystrophy, all sharing an intronic variant disrupting splicing in

COL6A1 (Cummings, Marshall et al. 2016). As our understanding of the sequence determinants of splicing matures, it becomes possible to predict these splice-altering events from sequence alone (Jaganathan, Kyriazopoulou Panagiotopoulou et al. 2019). In the work by Jaganathan et al, they train a neural network–a form of machine learning–to learn the language of splicing, and then apply the method to detect variants which alter splicing with high accuracy. They boldly claim that up to 10% of rare disease patients harbor splice-altering events as the pathogenic

7

variant, suggesting that integration of this scoring scheme, SpliceAI, into analysis pipelines will dramatically boost diagnostic rates.

Beyond splicing alterations, RNA-seq and other mRNA techniques such as qPCR, can help identify disruptions to a gene’s expression level and allele specific expression. Allele specific expression, where one allele is expressed higher than the other, is a strong indication of a heterozygous cis-regulatory mechanism. In the 2016 study by Cummings et al, with 50 patients and 440 Genotype-Tissue Expression Consortium (GTEx) normal donors for matched muscle tissue, they were underpowered to detect such expression abnormalities. Nevertheless, several examples exist of pathogenic noncoding variants in the untranslated regions, promoter regions, and even distal enhancer elements (Smedley, Schubach et al. 2016). The curation work by

Smedley et al showed that in 2016, roughly 450 pathogenic noncoding SNVs and indels were causal for Mendelian rare genetic diseases.

One of the distinct advantages to WGS is that in addition to identifying these noncoding SNVs and indels, it can also identify other forms of noncoding variants which alter gene regulation in cis. A classic example of this is in Fragile X syndrome, where an STR expansion is responsible for the hypermethylation and repression of FMR1. STRs are abundant throughout the genome, however some are prone to massive expansion which can in turn alter the cis-regulatory context of a gene and cause repression or increased expression. Recent developments in bioinformatics are enabling the identification of expanded STRs in WGS datasets, which may in turn lead to novel discoveries (Dashnow, Lek et al. 2018).

8

Expanding variant identification and interpretation beyond what was feasible with exome sequencing has the potential to reveal novel insights into the function of the human genome, and end the diagnostic odyssey for many patients.

1.2.4 The genome is dark and full of terrors

A major challenge of applied human genomics has been how to handle the repeats in the human genome. Repetitive elements include larger transposable elements such as long interspersed nuclear elements (LINEs), short interspersed nuclear elements (SINEs), and long terminal repeats (LTRs). These genomic elements are responsible for genome plasticity and evolution, and also drive many of the large genomic differences between individuals. While LINEs, SINEs, and LTRs are large interspersed elements, the other repeats in the genome belong to a class called variable number tandem repeats (VNTRs), which include short tandem repeats (STRs) with repeat unit length of less than 20 base pairs, and minisatellites with repeat units up to 100 base pairs. Collectively, repeats comprise 50-69% of the human genome, with transposable elements making up roughly 45% (Haubold and Wiehe 2006, de Koning, Gu et al. 2011). These repeats are challenging for short read technologies, because the length of the sequenced fragment is often shorter than the length of a full repeat. This leads to ambiguity in read mapping, where a given read could map equally across several regions of the genome. Analysis pipelines ignore these repeat elements, often masking them or only analyzing reads which are uniquely mapped

(Tarailo-Graovac and Chen 2009). Further complicating the non-unique nature of the human genome, is the fact that many genes come from gene families and share identical, or nearly identical, sequences with other genes in the genome. This can lead to segments of genes, or entire genes altogether, which are not covered by uniquely mapped reads and are thus 9

camouflaged to accurate profiling for variant or mutation identification (Ebbert, Jensen et al.

2019). When entire regions, typically larger than 1000 base-pairs, are duplicated in the genome but present in a low copy, this is referred to as a segmental duplication. Several well studied segmental duplications can have dramatic impact on human health when pathogenic variants affect the genes within them, a notable example being the gene SMN1 which is linked to spinal muscular atrophy (Wirth 2000).

Ignoring these repeat regions in the genome comes at a cost. The Global Alliance for Genomics and Health (GA4GH), together with the Genome in a Bottle consortium, are organizations which oversee the development and maintenance of gold standard reference materials for variant calling and benchmarking. They use unique read mapping and concordance between technology and software to define ‘gold standard regions’, which can be used to filter genomic variants for high confidence (Zook, Chapman et al. 2014, Krusche, Trigg et al. 2019). Intersecting the Genome in a Bottle gold standard regions with known disease-related genes from OMIM and ClinVar revealed that less than 75% of the exonic bases were covered by gold standard regions

(Goldfeder, Priest et al. 2016). A more recent analysis extended this observation based on the unique mapping of reads from different technologies to define the “dark and camouflaged” genes in the genome (Ebbert, Jensen et al. 2019). In their analysis, the authors show that several of these challenging regions are resolved when third generation sequencing technology–or single molecular real time sequencing–is used. This technology produces longer read lengths, which allow for unique mapping as long reads capture the full length of these repeat elements flanked by unique sequence (Sedlazeck, Lee et al. 2018). While not the focus of this thesis work, long reads hold great promise for conquering the “dark” genome, and the future of rare disease 10

genomics will undoubtedly move towards longer sequencing technologies. Until then, researchers will continue to make creative use of the short read sequencing technology.

1.2.5 Advancements in bioinformatic methodology

Tremendous progress has been made in the bioinformatics community to rise to the challenge of calling all the variants in the human genome. Advancements continue to be made in the genotyping of SNVs and indels including the DeepVariant tool, a method from Google’s Brain team, which uses deep neural nets to distinguish true variants from mapping artefacts (Poplin,

Chang et al. 2018). Beyond SNVs and indels, many novel tools have emerged which utilize different signals present in paired end reads, and are enabling the discovery of novel pathogenic variants in the rare disease setting. The types of signals used by these tools include split-read and discordant read mapping. Split reads are reads where a portion of the sequence maps uniquely to the genome, and then the rest of the read is “soft-clipped” or ignored to allow the mapping of the read to that location. Discordant read mapping is where the paired reads map against the genome in an orientation, or distance, which is different from the expectation. Since paired-end reads are sequenced from both ends of a single piece of DNA, their orientation should be to face towards each other and the distance between them should follow a normal distribution centered on the sequenced fragment length. When mapping these reads against the genome, if either their orientation or distance between mapping locations of each end is different from the expectation, then it can indicate the presence of a structural variant (Mahmoud, Gobet et al. 2019).

For methods which call insertions, where there is the addition of novel sequence not present in the reference genome, utilizing unmapped reads is necessary. This is necessary for two classes of 11

genomic variants, STRs and mobile element insertions (MEIs), where insertion of sequence can have a dramatic impact on gene function depending on the location and/or size of insertion. One utilization of unmapped reads is when one read in the pair is mapped uniquely against the genome (anchored), and the other is either unmapped or mapping ambiguously (non-anchored).

For MEIs, the sequence of insertion is known to originate from one of a few active mobile genomic elements so the non-anchored pair can be scanned for a match to one of the transposable element sequences, which is the basis of several MEI calling tools including MELT

(Gardner, Lam et al. 2017). For short tandem repeats, when the non-anchored read is comprised of tandem repeat sequence, e.g. ‘GCAGCAGCA...’ or ‘CCGCCGCCG…”, then this is evidence of a repeat expansion. This type of signal is the basis for several emergent short tandem repeat expansion calling methods (Dolzhenko, van Vugt et al. 2017, Dashnow, Lek et al. 2018,

Dolzhenko, Deshpande et al. 2019, Mousavi, Shleizer-Burko et al. 2019).

1.2.6 Relying upon the reference genome

The standard practice for genome analysis with respect to calling variants has focused on aligning reads against the reference genome, and then post-processing the alignment to find deviations between the mapped (and sometimes unmapped) reads and the expected linear reference sequence. While this approach has remained status quo, recent discoveries about the extent of complex genomic variation in populations have brought into question whether such an approach is viable (Ballouz, Dobin et al. 2019, Yang, Lee et al. 2019). Representing the reference genome, especially with respect to complex variants, as a graph structure could be a solution to the problem. A key example of this is a novel method which seeks to genotype complex structural variants by utilizing graphs (Chen, Krusche et al. 2019). The method by Chen 12

et al. facilitates local realignment around previously observed structural variant breakpoints, which were identified using longer-read technologies from population sequencing efforts. While local–or of a limited genomic search space–methods can utilize graphs efficiently, extending the analysis genome-wide comes with challenges of computational efficiency. Future work within this space is needed before the full potential of graphs for genotyping can be realized.

Beyond a full transition to graph structures is an intermediate approach which focuses on the selection of a sample-matched reference genome. Two examples of this include the use of sex- matched reference genomes and population-matched reference genomes (Dilthey, Cox et al.

2015, Webster, Couse et al. 2019). Correctly identifying both the sex and ethnicity are required for the sample-matching process, and can either be achieved through iterative reference-genome alignment (e.g. bootstrapping), or by utilizing methods which extract information from unaligned reads. Bootstrapping involves: 1) alignment against the reference; 2) genotyping the aligned reads; 3) assessing ethnicity and sex chromosome complement; and then 4) realigning against the sample-matched reference. This process can be computationally expensive, and can potentially be replaced by a separate approach which doesn’t utilize reference mapping. Alternative methods could include those which extract information directly from unaligned reads, including approaches in which genotype is determined based on observed kmers–sequences of length ‘k’– within an unmapped read set (Shajii, Yorukoglu et al. 2016, Dolle, Liu et al. 2017, Sun and

Medvedev 2019). These approaches either rely upon indexing read sets and searching for target sequences, or hashing the read sets into kmer libraries and searching for target kmers.

Improvements to the flexibility and generalizability of these reverse-mapping methods could lead to their adoption into genome analysis workflows. 13

1.2.7 Picking the right tool for the job

Accurately identifying genomic variants in WGS data is an ongoing research endeavor, and has brought on the development of diverse bioinformatic methods. A prime example for this is within the field of CNV and SV calling, where over 75 tools have been developed and broadly used (Kosugi, Momozawa et al. 2019). Within each of the tool publications, a demonstration of performance–typically using some form of gold standard data and/or simulated data–shows that the new tool can outperform other existing tools. These benchmarks contain an inherent bias, because the authors have a horse in the race and for publication a demonstration of improved performance is often necessary. To supplement these performance measures, numerous

“comprehensive” benchmarks surveying anywhere from 5-70 tools emerge periodically to help guide the software selection process during pipeline implementation (Cameron, Di Stefano et al.

2019, Kosugi, Momozawa et al. 2019, Zhang, Bai et al. 2019).

Benchmarking is a key component of software adoption and publication, and central to that process is the use of ‘gold standard’ reference variant sets to demonstrate variant calling performance. For several years, the primary benchmark dataset was the NA12878 deletion call set from the hallmark human SV calling paper (Mills, Walter et al. 2011). In this work, several

SV calling tools are integrated and multiple sequencing technologies are used to define the SV landscape of human genomes. Recently, global initiatives including the Genome in a Bottle,

Database for Genomic Variants (DGV) and GA4GH have begun to play key roles in maintaining these benchmarking standards, and expanding to include long-read sequencing technologies and additional individuals. Even as these standards mature, they are still focused on a small set of 14

healthy individuals, often as a trio (mother, father, child). Further, the set of ‘truth’ variants often includes agreement between multiple variant callers and technologies, which excludes some of the more challenging variants which would not be identified in a consensus approach.

Additionally, utilizing the gold standards limits the breadth of variants identified. Several variants which are common in rare disease patients, such as recurrent large CNVs underlying microdeletion and microduplication syndromes, are not represented in these datasets. Utilizing simulation is an effective way to supplement gold standards, and several frameworks exist to facilitate whole genome simulation (Mu, Mohiyuddin et al. 2015, Yue and Liti 2019, Juan, Wang et al. 2020). The process of simulation through a popular tool like VarSim, is to integrate predefined sets of variants–sometimes including randomly generated ones–into the reference genome sequence, and then synthetically generate reads from the modified reference sequences.

1.3 Chapter summaries and research contribution

The research presented within this thesis is unified by the underlying goal of reducing the time to diagnosis, and in some cases ending the diagnostic odyssey, for patients affected by rare genetic diseases. The work occurs within a transitionary period, where DNA sequencing technology is constantly improving, and across the globe there are numerous efforts to improve variant calling and interpretation.

Chapter 2 focuses on simulation of rare genetic disease scenarios with the development of a tool called GeneBreaker. This specific simulation strategy, which focuses on user-guided creation of pathogenic variants of multiple classes and genic impacts, provides researchers with a capacity to 15

benchmark analysis pipelines and train the incoming generation of genome analysts. With an infinite number of ways to “break” a gene, the GeneBreaker framework will expand upon the emerging anecdotes of rare genetic disease cases in the literature.

In the Chapter 3, I develop a diagnostic pipeline for rare disease analysis, deploy it in a clinical research setting for the analysis of patient DNA sequencing data, and identify a novel noncoding pathogenic short tandem repeat expansion in the untranslated region of glutaminase (GLS). The discovery was the result of a narrow focus on a candidate gene which emerged from whole exome sequencing with a single coding variant supported by biochemical assays, leading to the hypothesis that another gene-disrupting variant was yet to be detected. In an international collaboration, we investigate the molecular impact of the short tandem repeat variant and genotype it across nearly 10,000 individuals to establish its rarity and mechanism of pathogenicity. This work represents the additional capacity to detect pathogenic variants in the genome which are refractory to detection via targeted sequencing methods.

Following the identification of a short tandem repeat expansion in GLS, I collaborate on the development of a novel method for the identification of short tandem repeat expansions in whole genome sequencing data called ExpansionHunter Denovo. Within this collaboration, I implement a novel short tandem repeat simulation method and benchmark the capacity of our new tool and other existing tools for the identification of repeat expansions, both pathogenic and nonpathogenic, throughout the genome. We assess the landscape of repeat expansions in the genome, and show that our method has broader detection capacity than existing tools.

Application of this methodology to undiagnosed cases has promise in identifying novel repeat 16

expansions, including a recently discovered class of complex repeat expansions which alter the repeat composition at insertion sites in the genome.

In Chapter 5, we develop an alternative analysis approach which does not rely upon traditional reference genome mapping to extract useful information from high throughput sequencing datasets. The method, termed FlexTyper, indexes the short reads into a rapidly searchable data structure called an FM-index, and then facilitates the rapid searching of user-specified sequence queries against the index. The sequence queries are broken down into kmers–sequences of length

‘k’–based on the type of sequence being searched for. We demonstrate the accuracy and utility of

FlexTyper using human whole genome sequencing data, and human RNA-seq data with pathogen spike-in sequences. The types of useful information which can be extracted from substring searches include genotyping predefined SNPs for sex and ancestry typing, and extracting coverage of genomic regions in the genome. This information can be used to inform pre-alignment informatic pipelines–such as those which rely upon sex- or population-matched reference genomes–as well as for downstream copy number variant analysis.

Together, the research within this thesis aims to expand the diagnostic utility of whole genome sequencing in the diagnosis of rare genetic diseases.

17

Chapter 2: Simulation of rare disease scenarios

2.1 Introduction

Next-generation sequencing, and increasingly third-generation sequencing, has been revolutionary in rare disease diagnosis (Wise, Manolio et al. 2019). By sequencing the entire genome, millions of variants are identified, prioritized, and then manually curated to arrive at a diagnosis for the affected individuals. This process occurs both in a familial setting (e.g. trio sequencing of mother-father-proband) as well as in groups of patients with similar phenotypes

(e.g. case series or cohort studies). The analysis process can be broken down into two distinct steps: variant calling and variant interpretation.

Advances in bioinformatic tools for variant calling have recently expanded the types of variants being detected from single nucleotide variants (SNVs) and small insertions and deletions

(indels), to now routinely include copy number variants (CNVs), tandem repeat expansions

(REs), mobile element insertions (MEIs) and complex structural variants (SVs) such as inversions, insertions, and translocations. A major challenge within the expansion to these more complex variant types is the large amount of noise and artefacts stemming from limitations of short-reads—with length of 100-250 base pairs (bp)—attempting to resolve repetitive sequences within the genome. These short reads, even when being sequenced from two ends of a DNA fragment with ~500 bp length, cannot span the repetitive elements in the genome. Specifically, interspersed repeats and short tandem repeats whose length often exceeds 500bp, cannot be resolved and uniquely mapped against a reference. These mapping ambiguities lead to strict region-based filtering in order to mitigate noise, in spite of the significant proportion of disease- associated genes overlapping these regions (Goldfeder, Priest et al. 2016, Ebbert, Jensen et al. 18

2019). As both read length technology and algorithmic approaches continue to evolve, new tools will emerge as candidates to use within diagnostic pipelines (Wenger, Peluso et al. 2019). While benchmarking the variant calling process in human genomes has been a focus of international consortia, the majority of comparisons focus on evaluations of healthy individuals with well characterized variant sets (Krusche, Trigg et al. 2019). In the diagnosis of rare genetic diseases, benchmarks should be focused on assessing the detection capacity of pathogenic or potentially pathogenic variants.

Beyond the landscape of rapidly evolving variant calling approaches is the emergence of a multitude of variant interpretation tools and pipelines. Several in silico effect predictors for the assessment of functional variant impact already exist, and research efforts are now adding interpretation capacities for features outside of the coding regions of the genome, e.g. splice regulating sequences (Jaganathan, Kyriazopoulou Panagiotopoulou et al. 2019). Furthermore, continuously expanding population databases serve as filters to help identify rare genomic events

(Lek, Karczewski et al. 2016, Karczewski, Francioli et al. 2020). Both in silico predictors and the population allele frequency of observed variants are used as filters when attempting to identify a pathogenic variant causal for rare genetic disease phenotypes.

Combinations of different variant calling and variant interpretation pipelines are implemented across the world in clinical-grade and research-grade genomic diagnostic laboratories. These approaches utilize tools which consistently receive upgrades to underlying software and databases used within analysis pipelines. As diagnostic pipelines continue to evolve, there is an

19

emerging need for specific performance testing to compare different tools and ensure each new version of an analysis pipeline can identify the broken gene in an applied setting.

There is a demand for developing capacity to create synthetic scenarios of rare disease cases for education and training purposes. As genomic medicine advances into standard of care, there will be a need for easy access to training datasets of rare disease cases of increasing complexity.

Institutional policies and guidelines around access and use of patient data for research purposes beyond that of the specific disease diagnosis vary significantly between different studies and globally (Raza and Hall 2017, Martani, Geneviève et al. 2019). This means, establishing a universal ‘standard’ of bonafide genomes suitable for personnel training and benchmarking is challenging. Further complicating reanalysis of patient data is the possibility of uncovering incidental findings, perhaps affecting either the parents or the proband (Green, Berg et al. 2013).

In light of incidental findings, some institutional policies are strict regarding reanalysis of diagnosed patient data. Even with access to such data for educational purposes, the scale or volume of available data would be limited compared to the potentially infinite possibilities of simulated genomic errors.

To meet these needs and serve the growing community of genomic medicine, we developed

GeneBreaker: a simulation tool for Mendelian rare genetic diseases. GeneBreaker has an online web interface for designing custom genetic disease scenarios based on user-guided parameters. It has the capacity to simulate variants of multiple different classes, affecting different genic regions, and can either draw upon known pathogenic variants from resources such as ClinVar

(Landrum, Lee et al. 2014) and ClinGen (Rehm, Berg et al. 2015), or facilitate user creation of 20

novel events. Created variants can be embedded within different familial inheritance models, to model real life scenarios which may be encountered within clinical or research settings.

GeneBreaker is a web server deployed as NodeJS, which communicates with a REST API that accesses an underlying data repository (storing gene models and known pathogenic variants), and a variant simulation framework written in the Python programming language. User interaction with the online tool guides the stepwise process of variant simulation as follows: 1) select the gene and transcript to be “broken”; 2) simulate the first variant by selecting variant class and either novel creation or existing pathogenic variant(s); 3) proceed to design the second variant or finish and output a Variant Call Format file (VCF; Figure 2.1A). Variant creation is done with Mendelian disease cases in mind, where a single gene/locus is disrupted in either a dominant (single variant) or recessive (one variant per allele) manner. Gene models and known pathogenic variants are stored within a MySQL database, which is used for variant creation.

Python code interacts with the MySQL database REST API to extract variant and gene information, and subsequently creates variants based on user parameters. Code for variant creation and MySQL database interaction can be found here

(https://github.com/wassermanlab/GeneBreaker). While the primary output of the simulator is a

VCF file, there is also the capacity to enable downstream benchmarking. Support for downstream benchmarking includes facilitating a transition to a VarSim-compatible VCF file for full synthetic WGS simulation, and integration into background variant sets for testing annotation and prioritization approaches (Figure 2.1B). Details about the variant simulation process and downstream benchmarking are below.

21

2.2 Methods

2.2.1 Simulation architecture

Figure 2-1 – GeneBreaker Overview.

(A) Overview of GeneBreaker design framework showing user interaction with the website (light blue), connected MySQL tables (red), underlying variant subclasses (dark blue), and output VCF file (yellow). The user interacts with the GeneBreaker website (light blue) which is connected to hidden components for gene description and variant creation/selection. (B) Downstream benchmarking operations enabled by

GeneBreaker including splitting variant amongst VCF files according to user-designed pedigree (yellow), and

22

then either burying the variant within open source trios for annotation and prioritization testing or simulating the proband variant as a full synthetic simulation via VarSim (purple).

2.2.2 Host webserver and underlying data

GeneBreaker is hosted on a virtual webserver at the Centre for Molecular Medicine and

Therapeutics, with 12GB of RAM and 4 CPUs running CentOS 7. The Variant data within the underlying repository comes from open source variant catalogues including ClinVar (Landrum,

Lee et al. 2017), ClinGen (Rehm, Berg et al. 2015), and a manually curated set of pathogenic short tandem repeat expansions (Table 2.1) (Dolzhenko, Bennett et al. 2020). Gene models come from the RefSeq annotation database provided by the UCSC Genome Browser (NCBI Homo sapiens Annotation Release 109 (2018-03-29)) (Haeussler, Zweig et al. 2019).

Variant type Source Date Acquired STR (Background) https://github.com/gymreklab/GangSTR 2019-07-22 https://github.com/Phillip-a- richmond/STR_Analysis/tree/master/CompareSTR STR (Pathogenic) Databases 2019-07-22 https://ftp.ncbi.nlm.nih.gov/pub/dbVar/sandbox/sv Deletion (CNV, Pathogenic) _datasets/nonredundant/deletions/ 2019-07-22 https://ftp.ncbi.nlm.nih.gov/pub/dbVar/sandbox/sv Duplication (CNV, Pathogenic) _datasets/nonredundant/duplications/ 2019-07-22 SNV/Indel (Pathogenic + Benign + others) ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/ 2019-07-22 Table 2-1 - Data sources for variants within the database available for simulation.

23

2.2.3 Variant simulation walk through

The user interaction with the simulation process is stepwise and guided, and a video tutorial detailing the creation process is available at: http://genebreaker.cmmt.ubc.ca/more_info. The variant creation process is detailed below.

2.2.3.1 Initial Configuration

Starting at the Variant Designer page (http://genebreaker.cmmt.ubc.ca/variants): 1) User selects reference genome, proband sex, and enters a gene symbol into the ‘gene’ textbox; 2) User clicks

“Fetch Transcripts” and all transcripts associated with the gene symbol appear, and the user selects a single transcript which is then displayed in the IGV browser; 3) User proceeds to variant creation by clicking “Next”.

2.2.3.2 Variant Creation

From the Variant 1 Info page: 1) User selects the “Region”, which includes a set of genomic regions: coding, UTR, intronic, and genic (anywhere in the body of the gene). These genomic regions restrict the set of possible variants to those which overlap the defined regions, e.g. coding selection means a variant will have to be created over coding regions of the selected gene; 2)

User selects the variant “Type”, either choosing from predefined variant sets: clinvar, clingen copy number variant, and short tandem repeat; or novel creation methods: copy number variant, mobile element insertion, indel, single nucleotide variant; 3) User selects the “Zygosity”: heterozygous or homozygous.

24

For each of the variant classes, additional information will be required as input. All variant positions are represented in the 1-based half-open coordinate system. Users must choose positions for variants which overlap the selected regions. For clinvar and clingen copy number variant, clicking the “Fetch variants” box will populate the window with coordinate-sorted variants from the ClinVar or ClinGen databases which overlap the defined genomic region.

Clicking on a single variant line will select it and enable the user to proceed to the next page. For copy number variant, the user specifies the start and end positions, as well as the copy change

(deletion or duplication). For mobile element insertion, the user specifies the start position and element type: LINE, ALU, or SVA. For indel, the user specifies the start position, and the length as a positive integer for insertion of random nucleotides, or as a negative integer for a deletion.

For single nucleotide variant, there are multiple SNV types which can be selected: stop-loss, missense, nonsense, synonymous, or simply creating alternate alleles from any of the four nucleotides A, T, C, and G. All effects are listed independent of the region selected, however the effects must comply with the region in order to proceed. As an example, selecting intronic region and nonsense variant will give an error. For nonsynonymous variant effects (stop-loss, missense, nonsense), the variant position must have the capacity to create an amino acid change. As an example, creating a nonsense (premature stop codon) SNV requires that the altering the single base at the specified position will change the codon sequence to become TAA, TAG, AGA, or

AGG. To facilitate this, the three-frame translation in the IGV window can be displayed by zooming into the nucleotide level, and clicking the gear on the right side of the IGV window to select “Three-frame Translate”. For short tandem repeat, clicking on the “Fetch Short tandem repeats” box will populate the window with all short tandem repeats which overlap the genic region. Each STR has the repeat motif, and if the repeat is known to be pathogenic then it 25

displays the number of repeat copies which have to be inserted to be considered damaging. After selecting a repeat, enter the repeat length as an integer value of the number of repeats you want to add or remove, using positive or negative integers respectively.

After creating a variant, the user can either repeat the process to create a second variant, or proceed to Family Info to guide the variant inheritance.

2.2.3.3 Inheritance modeling

After the creation of variants, a summary page with Family Info appears and a portion of the variant information is displayed including chromosome, position, reference allele sequence, and alternate allele sequence, adhering to specifications from VCF version 4.2 https://samtools.github.io/hts-specs/VCFv4.2.pdf. Below the variant summary is information for each of the individuals in the family which will be included in the output, starting with information from the proband including sex, presence of variant 1 (Var1), presence of variant 2

(Var2), and affected status (check box). The user will then add family members by clicking on the “Add Family Member” box, choosing to add a mother, father, sister, or brother. After adding the individual, the user will select the boxes for Var1 and Var2 to indicate whether that individual has the variant or not, and their affected status. In the case of homozygous variants, selecting both Var1 and Var2 will add a homozygous variant for that individual. After adding multiple individuals and completing their variant/affected status, the output files can be downloaded by clicking on the “Download Outputs” box. Both the merged VCF containing variant records for all individuals, and the associated pedigree (PED) file can be downloaded.

26

2.2.4 Bury the variant within trio setting for testing prioritization approaches

Simulation of variants within read sets can be performed in two primary ways: 1) incorporating variants into a reference genome sequence (FASTA format) and then simulating reads from the sequence; or 2) incorporating variants directly into mapped read files. While several tools exist for reference-based incorporation and read simulation, we chose to use VarSim (Mu,

Mohiyuddin et al. 2015) due to its ease of use and capacity to simulate other variants alongside the pathogenic variants of interest. Variants were incorporated using VarSim’s default configuration with the hg19 (hs37d5) reference genome, background variants from dbSNP common variants version 150

(https://ftp.ncbi.nih.gov/snp/pre_build152/organisms/human_9606_b150_GRCh37p13/), and

DGV variants (from VarSim’s installation script). Before variant incorporation, VCF files from

GeneBreaker may need to be reformatted if they contain a mobile element insertion using the reformatForVarSim.py script

(https://github.com/wassermanlab/GeneBreaker/blob/master/BenchmarkingTransition/FullSimul ation/reformatSimToVarSim.py).

BamSurgeon is a tool for incorporating variants directly into mapped read files (Ewing,

Houlahan et al. 2015). GeneBreaker VCF files can be parsed for use within BamSurgeon. We chose to demonstrate the utility of GeneBreaker with VarSim, although both approaches are feasible.

2.2.5 Incorporate variants into reads to test detection capacity

Beyond simulating variants for benchmarking variant calling approaches, there is also utility in testing variant prioritization methods. To facilitate this, we collected open source trio PCR-free 27

WGS data from the Polaris project (https://github.com/Illumina/Polaris), and processed it using a standard approach with cutting-edge tools. The data was mapped against the reference genome

GRCh37

(http://www.bcgsc.ca/downloads/genomes/9606/hg19/1000genomes/bwa_ind/genome/GRCh37- lite.fa) and GRCh38 (ftp://ftp.ensembl.org/pub/release-

96/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz) using

BWA mem (v0.7.17) with default settings (Li and Durbin 2009). Output SAM files were converted into bam and sorted using Samtools (v1.9) (Li, Handsaker et al. 2009). Variant calling was done using DeepVariant (Poplin, Chang et al. 2018). Visualizing the mapped reads was done using Integrative Genomics Viewer (IGV) (v2.4.10) (Robinson, Thorvaldsdóttir et al. 2011). We applied the same mapping and conversion procedure, using the GRCh37 reference genome, to simulated data from VarSim.

The output is a set of VCF files, one per individual, which serve as background variants for both male and female probands (children) and their parents. These variants are deposited in the online repository alongside other full simulations (see Data and Availability). Combining the background variants with the pathogenic variants from the GeneBreaker tool is managed using bash scripts which match the sex and reference genome. These scripts utilize standard tools including bcftools (Li 2011), htslib (bgzip and tabix), and a custom reformatting script

(https://github.com/wassermanlab/GeneBreaker/blob/master/BenchmarkingTransition/BuryVaria nt/reformatSimToDeepVariant.py).

28

2.3 Results

2.3.1 Simulator-created rare disease scenarios

We created rare disease scenarios of varying difficulty to test the efficacy of GeneBreaker simulations and for use within benchmarking scenarios (Table 2.2). These simulations cover different modes of inheritance, variant classes, and genic impacts. The first set of variants covers several inheritance models for combinations of coding variants, either designed by hand or extracted from ClinVar and other published works. Each of the variants in the table were synthetically generated using the online GeneBreaker interface in either the GRCh37 or GRCh38 genome (Table 2.2). The variants were then assigned to proband, mother, and father according to the inheritance pattern. Lastly, the variants were embedded in the background of open source trios with matched proband sex (Methods). The output from the embedding process is a merged

VCF file, which can then be used as input for testing common prioritization workflows, such as the commercial package VarSeq or the open source Exomiser tool (Robinson, Köhler et al.

2014).

Beyond creating cases to test inheritance models, we also demonstrate the ability of GeneBreaker to create combinations of variants that are emerging out of anecdotal reports. These variant combinations span multiple classes and genic impacts, and are responsible for the missing heritability in undiagnosed cases (Maroilley and Tarailo-Graovac 2019). These include a set of cases where pathogenic SNVs and indels lie beyond the coding regions of the gene, and a set of variants which include CNVs, STRs, and MEIs (Table 2.4).

29

Lastly, we also designed variants within the “dark regions” of the genome, or regions that are inaccessible to standard variant calling pipelines (Goldfeder, Priest et al. 2016, Ebbert, Jensen et al. 2019). We consider these variants important to simulate due to the need to evaluate results from emerging methods capable of genotyping within such regions (Table 2.4).

Exomiser Proband Var1 Var1 Var2 Var2 Rank (non Inheritance Source Source Gene Sex class impact class impact inheritance matched)

autosomal published: Male SNV missense - - - JAK1 2 dominant maternal 28111307

autosomal clinvar: Male SNV missense - - - MSH2 1 dominant de novo 91028

autosomal published: recessive Male SNV missense - - - MALT1 1 24332264 homozygous autosomal recessive Female SNV missense novel indel frameshift novel CFTR 1 compound heterozygous autosomal recessive Clinvar: compound Male SNV stop-loss novel SNV missense INPP5E N/A (7) 217654 heterozygous de novo

X-linked dominant Female indel frameshift novel SNV nonsense novel MECP2 1 de novo

X-linked recessive published: Male SNV missense - - - WAS 1 homozygous 32202653

X-linked recessive compound Clinvar: Female indel frameshift novel SNV nonsense SLC6A8 N/A (2) heterozygous de 11696 novo X-linked recessive clinvar: hemizygous de Male indel frameshift - - - ABCD1 2 11303 novo

Y-linked de novo Male indel coding novel - - - SRY N/A (N/A) Table 2-2 - A set of rare disease scenarios for small variants.

These scenarios were designed covering a range of Mendelian inheritance patterns, and are comprised of either one or two variants affecting a single gene depending on the inheritance pattern. As many inheritance patterns are sex-specific, the intended proband sex is included. Each variant has the class (SNV or indel),

30

impact on the gene, and source. For variants from the literature or ClinVar, the ClinVar ID or PubMed ID is provided. Exomiser rank is provided, counting the rank of the simulated gene amongst its matching inheritance gene category, and in parentheses is the rank within other inheritance categories.

2.3.2 Prioritization testing

To the best of our knowledge, there are no currently available open source tools for prioritizing combinations of different classes of pathogenic variants affecting the same gene. However, the

Exomiser tool is a fast and easy-to-use method that can prioritize coding SNVs and indels for

Mendelian rare genetic disease cases, and requires as input a merged VCF file, a pedigree (PED) file, and a set of Human Phenotype Ontology (HPO) terms (Robinson, Köhler et al. 2008,

Robinson, Köhler et al. 2014). The HPO terms for a set of selected genes were chosen by matching each gene to a disease using OMIM, and then selecting 4-7 HPO phenotype terms which were common for that disease (https://hpo.jax.org) (Table 2.3). We simulated each of the inheritance testing cases (Patients 1-10) and searched the output from Exomiser for the known gene using the command line tool grep (e.g. “grep -w ABCD1 -n

ExomiserOutput_AR_genes.tsv”).

The causal variant was annotated correctly for both user-created SNVs and indels, confirming that our simulation framework for creating novel variants is functional. In seven out of ten cases, the variant was prioritized in the correct inheritance category and was ranked in the top two candidates at the gene level (Table 2.2). The two scenarios (SLC6A8 and INPP5E) with compound heterozygous de novo inheritance patterns caused issues with Exomiser’s

31

interpretation. A compound heterozygous de novo scenario is where an inherited allele is broken in one parent, and a de novo mutation breaks the other allele of the same gene. In both these scenarios, the variants were found to be ranked in the top 10 for dominant inheritance, likely due to the de novo variant taking priority. For instance, the SLC6A8 gene did not show up in the X- linked recessive candidate gene list, but it ranked second in the X-linked dominant gene list.

Interestingly, the variant created on the Y chromosome in the SRY gene, which is responsible for

46 XY Sex Reversal 1, was not prioritized by Exomiser. It is unclear at which stage this variant was dropped as a result of Exomiser’s inheritance and pathogenicity filtering.

Inheritance Gene HPO terms autosomal dominant HP:0000964; HP:0001047; HP:0001880; HP:0001508; JAK1 maternal HP:0032064 autosomal dominant de HP:0200008', 'HP:0001250', 'HP:0001276', 'HP:0012378', MSH2 novo 'HP:0002027', 'HP:0001276' autosomal recessive HP:0000964; HP:0001047; HP:0001581; HP:0004386; MALT1 homozygous HP:0002090; HP:0002205 autosomal recessive HP:0002613', 'HP:0002721', 'HP:0002024', 'HP:0002206', compound CFTR 'HP:0002205', 'HP:0001738' heterozygous autosomal recessive HP:0001252', 'HP:0001251', 'HP:0001263', 'HP:0002553', compound INPP5E 'HP:0001320', 'HP:0002793' heterozygous de novo X-linked dominant de MECP2 HP:0001250', 'HP:0001257', 'HP:0005484', 'HP:0002187' novo X-linked recessive HP:0000964; HP:0001047; HP:0001880; HP:0001508; WAS homozygous HP:0032064 X-linked recessive HP:0001290', 'HP:0001270', 'HP:0000252', 'HP:0000718', compound SLC6A8 'HP:0008583', 'HP:0000540', 'HP:0008583' heterozygous de novo X-linked recessive HP:0001268', 'HP:0001250', 'HP:0000709', 'HP:0008207', ABCD1 hemizygous de novo 'HP:0002180', 'HP:0002500' Y-linked de novo SRY HP:0012245', 'HP:0011969', 'HP:0000032', 'HP:0000098' Table 2-3 - HPO terms for Exomiser testing.

HPO terms used to pair with each gene across different inheritance patterns. 32

2.3.3 Testing variant calling

As Exomiser is not currently equipped to prioritize CNVs, STRs, and MEIs, we demonstrate the efficacy of their incorporation by simulating and visualizing a full WGS dataset using the larger variants and the set of variants within the dark regions of the genome (Table 2.4). Using VarSim, the ten variants from the CNV/MEI/STR and Dark Genome categories were simulated in a single

VarSim run. Each of the regions where variants were integrated were visualized with the

Integrative Genomics Viewer (IGV) (Robinson, Thorvaldsdóttir et al. 2011) to validate the variant was simulated correctly at the read level (Figures 2.2 & 2.3). An example of the heterozygous duplication of part of the DMD gene shows the expected increase in read coverage over the simulated variant (Figure 2.2A), and the LINE1 transposable element insertion within the intron of SLC17A5 has the expected signal of soft-clipped reads both upstream and downstream of the insertion site (Figure 2.2B). For the variants in the dark genome, a homozygous deletion in SMN2 can be visualized, even though the observed reads are not mapping uniquely to the region (Figure 2.3A). Lastly, a four base-pair coding deletion within

CFC1 appears in the ambiguously mapped reads, confirming previous reports that specialized methods may be able to locate deletions within these dark and camouflaged regions (Ebbert,

Jensen et al. 2019) (Figure 2.3B).

33

Noncoding SNVs & Indels Inheritance Proband Var1 class Var1 Source Var2 Var2 impact Source Gene Sex impact class

autosomal recessive compound M SNV Intronic clinvar: indel frameshift novel ABCA4 heterozygous 99243

autosomal recessive compound F SNV Intronic clinvar: SNV nonsense novel PEX10 heterozygous 6770

autosomal dominant de novo M indel Intronic clinvar: - - - APC 411336

autosomal dominant de novo F SNV UTR clinvar: - - - MMACHC 556698

autosomal recessive compound M SNV UTR clinvar: SNV nonsense clinvar: BBS9 heterozygous 412264 523079 CNVs, MEIs, STRs

Inheritance Proband Var1 class Var1 Source Var2 Var2 impact Source Gene Sex impact class

autosomal dominant de novo Female CNV coding novel - - - SLC2A1 (partial DEL)

autosomal recessive compound Female CNV (full coding novel indel frameshift novel HEXA heterozygous DEL)

autosomal recessive Female MEI intron published: - - - SLC17A5 homozygous 28187749

X-linked dominant de novo Female MEI coding novel - - - IKBKG

autosomal recessive Female STR UTR published: - - - GLS homozygous 30970188

X-linked recessive compound Female CNV coding novel SNV nonsense clinvar: DMD heterozygous 282841

Variants in the Dark Genome

Inheritance Proband Var1 class Var1 Source Var2 Var2 impact Source Gene Sex impact class

autosomal recessive Female CNV coding published: - - - SMN1 homozygous (partial 32066871 DEL) autosomal dominant de novo Female Indel frameshift novel - - - CFC1

autosomal dominant de novo Female indel frameshift novel - - - MAF

autosomal dominant de novo Female indel frameshift novel MEI coding novel RPGR

Table 2-4 - Simulated challenging variant scenarios.

34

Simulations for Noncoding variants; CNVs, MEIs, and STRs; and Variants in the Dark Genome. The inheritance, sex, and variant information for variant 1 and variant 2 including the class, impact, and source.

Figure 2-2 - IGV snapshots of simulated complex variants.

IGV snapshots of a heterozygous duplication in DMD (A), and a mobile element insertion in SLC17A5 (B).

Each image has the simulated variant (red), the mapped simulated reads, and the affected gene. 35

Figure 2-3 - IGV snapshots of simulated variants in the Dark Genome.

Snapshots of a homozygous deletion in SMN2 (A), and a four base-pair deletion in CFC1 (B). Each image has the simulated variant (red), the mapped simulated reads, and the affected gene. 36

2.4 Discussion

The diagnosis of rare genetic diseases will continue to improve as novel methods for calling, interpreting, and prioritizing variants emerge and become deployed in a diagnostic setting. Many of the previously challenging variants to genotype, due to variant complexity or existing within

“dark” genomic regions, are now regularly identified in WGS datasets. Consequently, there is a critical need for testing the improved pipelines to ensure that such improvements are implemented correctly. While real patient data is paramount for testing the efficacy of a pipeline, access to such data is often prohibitive due to data use restrictions and is limited by the number of observed cases with available data. With simulation, infinite combinations of any possible genomic variant(s) can be designed, from nonsense SNVs, to intronic MEIs, and every combination in between.

The introduction of GeneBreaker addresses an unmet need for simulated rare disease cases in a broadly accessible format. GeneBreaker is deployed as a free-to-use website, enabling user creation of pathogenic variants. Downstream of variant simulation, the tool also supports the transition into benchmarking either variant interpretation or variant calling analysis pipelines.

We tested the efficacy of GeneBreaker and the downstream benchmarking transition by simulating rare disease scenarios covering different inheritance models, variant classes, genomic regions, and genic impacts. Using Exomiser, we validated that the variants we simulate have the expected impact, and exposed some limitations in the ability to correctly prioritize variants with challenging inheritance patterns, such as the compound heterozygous de novo pattern. Larger, more complex variants were visualized in IGV to ensure their correct simulation, confirming that 37

VarSim is a viable option for whole dataset simulation to test variant calling capacity. All of the simulated cases are available online (http://genebreaker.cmmt.ubc.ca/premade_cases), and can serve as a starting point for benchmarking.

GeneBreaker is not intended as another genome sequencing data simulator. Such simulation has been broadly explored over the past decade, with a rich collection of tools available (Escalona,

Rocha et al. 2016). The narrow scope of GeneBreaker is placed upon the creation of rare disease simulated genomes, which is achieved by focusing on the generation of diverse forms of genetic disruptions, which can be embedded within real or simulated genome sequencing data. This focus has particular value for two use cases: testing of analysis pipelines and training of interpretation specialists.

When it comes to changing software within a clinical diagnostic pipeline—even if it only involves upgrading to a newer version of an existing package—there must be rigorous testing to ensure that the modifications don’t break the pipeline. Any modification to a standard operating procedure must be tested to ensure that it is still capable of performing at or above the existing diagnostic capacity. With the increased adoption of the reference genome version GRCh38, many pipelines currently utilizing older reference genomes will need mechanisms to test their correct functionality in GRCh38 before the transition. Adopting a new reference genome can sometimes have unanticipated side-effects, as was highlighted with an analysis of missing variant calls from realigning WGS datasets in the UK Biobank (Jia, Munson et al. 2020). A diverse set of variant scenarios involving several inheritance models, genic impacts, and mix of

38

novel and known pathogenic variants has utility for testing these continuously updated pipelines, without the challenge of handling sensitive patient data.

GeneBreaker has value beyond benchmarking as a resource for training a new generation of genome analysts. Rare genetic diseases affect a sizable portion of the population, and as WGS moves into the standard of care, many medical professionals will need hands-on training in the utilization of this technology. Having synthetic cases, either at the merged variant set or raw data levels, is imperative. We encourage those developing educational materials for the analysis of rare disease genomes to consider using the simulation capacity of GeneBreaker as a training tool.

Future work on GeneBreaker will focus on expanding the simulation capacity to include additional complex variant classes (e.g. inversions and translocations) and variants beyond the genic regions known to disrupt gene regulation. Examples of disruptions to regulatory elements include mutations affecting chromatin organization and enhancers (Smith and Shilatifard 2014,

Lupiáñez, Spielmann et al. 2016). A major challenge in extending to regulatory elements is that the relevant genomic regions critical for the regulation of a gene are difficult to define. As these genome annotations improve, we look forward to integrating them into GeneBreaker.

We hope that GeneBreaker is adopted by the growing community of researchers and clinicians who are utilizing WGS in the diagnosis of rare genetic diseases. Feedback is appreciated as we continue to improve the software and simulation capacities. GeneBreaker is available for exploration at: http://GeneBreaker.cmmt.ubc.ca.

39

Chapter 3: Whole genome sequencing enables discovery of noncoding pathogenic short tandem repeat expansion

3.1 Introduction

Over the past decade, the accelerated discovery of disease-causing genes and an increased frequency of diagnostic success in patients with rare mendelian diseases (from approximately

10% with the use of traditional genetic testing to approximately 40% with the use of genome wide testing) have been facilitated by innovation in high-throughput sequencing, as well as integration of multidisciplinary approaches into accurate data interpretation (Tarailo-Graovac,

Shyr et al. 2016, Wright, FitzPatrick et al. 2018).

Despite these advances, establishing a diagnosis for such conditions remains a challenge. In most phenotypic categories, more than 50% of the patients with a rare disease lack a molecular diagnosis (Wright, FitzPatrick et al. 2018). One reason for this shortcoming may be that exome sequencing, the preferred method to date, captures only a small portion (<2%) of the genome and is very limited in its capacity to detect copy-number variants, insertions, translocations, inversions, tandem repeat expansions, noncoding and deep-intronic variants, and variants within complex regions of the genome. Indeed, emerging studies have shown that unlike exome sequencing, genome sequencing has the potential to detect all classes of genetic variation and thus has the capacity to increase the diagnostic yield above that of exome sequencing (Gilissen,

Hehir-Kwa et al. 2014, Tarailo-Graovac, Drögemöller et al. 2017, Alfares, Aloraini et al. 2018,

Guéant, Chéry et al. 2018, Lionel, Costain et al. 2018).

40

However, the immense amounts of data generated by genome sequencing and limitations of reference genomes impede the detection of such variations. The interpretation of genome- sequencing data can be strengthened through the incorporation of a targeted approach, which harnesses information obtained from phenotypic and functional characterization of patients to focus on specific families of genes. Here, we describe detailed clinical and biochemical phenotyping and genome sequencing in three unrelated patients with similar phenotypes, including prog to identify a novel repeat expansion disorder that causes a deficiency of glutaminase, an enzyme important for neurogenesis and neurotransmission.

3.2 Case Report

Three unrelated patients, all of whom were born to nonconsanguineous white parents of

European ancestry after uncomplicated pregnancy and delivery, presented at the age of 3 years

(Patients 1 and 2, from the United Kingdom) and 2 years (Patient 3, from Canada) with early- onset delay in both gross and fine motor skills, as well as delayed speech (Figure 3.4A). In all three patients, ataxia developed, along with dependency on the use of a wheelchair or walker.

Early brain imaging did not reveal abnormalities; however, at age 11 year sof age, Patient 3 was found to have cerebellar atrophy on repeated magnetic resonance imaging.

3.3 Methods

3.3.1 Details of patient exome sequencing

Duo-exome sequencing (mother-proband) and trio-exome sequencing (mother-father-proband) analysis for the Families 1 and 2 of the TIDEX study were performed using the TruSeq DNA

PCR Free (350) kit and Illumina HiSeqX (Macrogen, Korea) through the TIDEX gene discovery 41

project (UBC IRB approval H12-00067). For exome analysis of Family 1 and Family 2, we updated a previously described pipeline for the identification of causal variants from next- generation sequencing data (details below)(Tarailo-Graovac, Shyr et al. 2016). Executable shell scripts for the analysis of the exome data are provided online (https://github.com/Phillip-a- richmond/AnnotateVariants/GLS_Manuscript/). The output candidate variant lists (CVLs) for both Family 1 and Family 2 were manually inspected for variants which fit the phenotype and inheritance pattern. Analysis of variants which fit the inheritance patterns did not identify any fully-penetrant candidates that match the phenotype. Therefore, we examined variants which are identified in the affected proband and meet filters for quality (overlapping Illumina Platinum

Genome Confident Regions, not overlapping segmental duplications, DP[*] <= 300, AD[*:1] >=

10, genotype quality >= 30), pathogenicity (GEMINI HIGH/MED impact) and rarity (global gnomAD AF<0.01, global gnomAD Homozygotes <=10). Family 3 from Care4Rare was subjected to trio-ES sequencing. Exome target enrichment was performed with the Agilent

SureSelect 50 Mb (V3) All Exon Kit; samples were sequenced on the Illumina HiSeq 2000 platform, multiplexing three samples per lane. After removal of duplicate reads, the mean coverage of coding sequence regions ranged from ×70 to ×200. Alignment and variant annotation were performed by the FORGE informatics team at each STIC, using comparable analytical pipelines with publicly available tools and custom scripts as described previously

(Beaulieu, Majewski et al. 2014).

3.3.2 Pipeline for the analysis of rare disease WES/WGS data

An in-house, open-source bioinformatics pipeline was used to analyze the sequencing from families 1 and 2. Development of the pipeline focused on capturing the same capacities as the 42

previously described pipeline used by the Treatable Intellectual Disability Endeavour (TIDE) project (Tarailo-Graovac, Shyr et al. 2016), with improvements to the tools used and increased flexibility to add variant annotations without the need for stepwise filtering.

First, changing the pipeline from outdated mapping and variant calling tools, namely Bowtie2

(Langmead and Salzberg 2012) and Samtools mpileup v0.1.19 (Li, Handsaker et al. 2009), synchronizes the analysis with large consortia which produce population databases of genomic variation (e.g. gnomAD) (Karczewski, Francioli et al. 2020). This synchronization, primarily shifting to BWA mem (Li and Durbin 2009) for read mapping and the Genome Analysis Toolkit

(McKenna, Hanna et al. 2010) for variant calling, allows for the more accurate matching of variant calls between population databases and local patient variant call format (VCF) files.

BWA mem also has the ability to soft-clip reads, which enables the discovery of complex variants, including those with non-reference sequence longer than the read length.

Additionally, comprehensive variant annotation has been recognized as an important component of variant interpretation for rare disease diagnosis (Eilbeck, Quinlan et al. 2017). This change was necessary because the previous pipeline performed stepwise filtering based on variant effects, such as having nonsynonymous coding impact, which prevented investigation into candidate noncoding variants. Additionally, the annotation process changed from relying on locally hosted MySQL databases for storing variant annotations, to utilizing an ultra-fast VCF annotation tool which leverages compressed and indexed text files for variant annotation

(Pedersen, Layer et al. 2016). This shift has enabled the rapid adoption of new in silico scoring metrics, including PrimateAI (Sundaram, Gao et al. 2018) and TrAP (Gelfman, Wang et al. 43

2017), which assess variant deleteriousness through conservation and splicing respectively. It also enables the efficient querying of the gnomAD database, which receives yearly updates to its variant call set (Karczewski, Francioli et al. 2020).

Altogether, this pipeline includes numerous open source tools, and is centered around the construction of a fully annotated variant database stored in the MySQL-lite format produced by the GEMINI framework (Paila, Chapman et al. 2013). Both ad-hoc and inheritance pattern queries are possible with the GEMINI database, and the data can be processed without leaving a secure environment (e.g. not uploading variants to an external server). Inheritance pattern queries, and queries for general damaging variants, are automated and combined with an excel template to produce a candidate variant list (CVL). Details can be found at https://github.com/Phillip-a-richmond/AnnotateVariants.

The updated pipeline is as follows (default options unless specified below; citations are provided in Table 3.1):

1) Map pair-end reads to GRCh37 using BWA mem (v0.7.12) with option –M a. (http://www.bcgsc.ca/downloads/genomes/9606/hg19/1000genomes/bwa_ind/genome/)

2) Convert SAM files to BAM, sort, and index using Samtools

3) Duplicate marking using Picard MarkDuplicates

4) Local indel realignment using GATK RealignerTargetCreator and IndelRealigner

5) GVCF calling using GATK HaplotypeCaller (with –emitRefConfidence GVCF option)

6) Multi-sample variant calling from GVCF files using GATK GenotypeGVCFs

7) Variant normalization and splitting using VT normalize and decompose

8) Variant annotation against genes using SNPEff 44

9) Compression and indexing of variants using bgzip and tabix

10) Variant filtering using BCFTools filter with soft filter: --include 'FORMAT/AD[*:1]>=10

&& FORMAT/DP[*] < 300'

11) Variant annotation against various databases (Table 3.1) with VCFAnno

12) Conversion of annotated VCF to MySQL-lite database with VCF2DB

13) Queries against MySQL-lite database with GEMINI for inheritance patterns, rare and potentially damaging variants

14) Formatting and addition of gene-based databases using GeminiTable2CVL.py

45

Tool Version Date PMID (or biorXiv/arXiv) Source

BWA mem 0.7.12 arXiv:1303.3997v1 https://sourceforge.net/projects/bio-bwa/files/

Samtools 1.2 (Li, Handsaker et al. 2009) http://www.htslib.org/download/

Picard 1.139 . https://broadinstitute.github.io/picard/

GATK 3.4-46 (Pedersen, Layer et al. 2016) https://software.broadinstitute.org/gatk/download/

VT 0.5772 . https://genome.sph.umich.edu/wiki/Vt

SNPEff 4.11 (gene: GRCh37.75) (Cingolani, Platts et al. 2012) http://snpeff.sourceforge.net/download.html

BCFTools 1.8 . http://www.htslib.org/download/

bgzip 0.2.5 . http://www.htslib.org/download/

tabix 0.2.5 . http://www.htslib.org/download/

VCFAnno 0.2.8 (Pedersen, Layer et al. 2016) https://github.com/brentp/vcfanno

VCF2DB Jan 22, 2018 . https://github.com/quinlan-lab/vcf2db

GEMINI 0.19.1 (Paila, Chapman et al. 2013) https://gemini.readthedocs.io/en/latest/ Expansion Hunter 2.5.3

Database Version/ Date PMID (or biorXiv/arXiv) Source http://www.bcgsc.ca/downloads/genomes/9606/hg19/1000ge

GRCh37 Aug 2017 . nomes/bwa_ind/genome/

CADD 1.3 (Kircher, Witten et al. 2014) https://cadd.gs.washington.edu (Adzhubei, Jordan et al. ftp://genetics.bwh.harvard.edu/pph2/whess/polyphen-2.2.2-

PolyPhen2 2.2.2 2013) whess-2011_12.tab.tar.bz2

PrimateAI 0.2 (Sundaram, Gao et al. 2018) https://github.com/Illumina/PrimateAI (Lek, Karczewski et al.

gnomAD (exome) 2.0.2 2016) http://gnomad.broadinstitute.org/downloads (Lek, Karczewski et al.

gnomAD (genome) 2.0.2 2016) http://gnomad.broadinstitute.org/downloads GEMINI

Databases* 0.19.1 (Paila, Chapman et al. 2013) https://gemini.readthedocs.io/en/latest/ http://tcag.ca/documents/projects/RLCRs_no_Repeat_Maske

RLCRs April 2018 (Trost, Walker et al. 2018) r.zip

ConfidentRegions April 2018 (Eberle, Fritzilas et al. 2017) ftp://ussd-ftp.illumina.com/2017-1.0/hg19/small_variants/

OMIM Aug 2018 . http://data.omim.org/downloads/ RefSeq Gene

Summary Oct 2016 . ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/RefSeqGene/

RVIS Release 3, Mar 2016 (Petrovski, Wang et al. 2013) http://genic-intolerance.org (Lek, Karczewski et al.

pLI Release 0.3.1, Mar 2016 2016) https://decipher.sanger.ac.uk/info/loss-intolerance (Köhler, Vasilevsky et al.

HPO May 2018 2017) https://hpo.jax.org/app/download/annotation (Cheung, Ouellette et al.

MeSHOPs Release 1, July 2010 2012) http://www.meshops.oicr.on.ca/meshop/index.html

Table 3-1 - Information on the bioinformatics pipeline.

Contents include software and database versions, acquisition dates, PubMed identifiers (PMIDs), and arXiv citations where applicable.

46

3.3.3 Genome sequencing

Singleton-genome sequencing was performed for the proband of the Family 1 in search of a second variant in the GLS gene expected to be inherited from mother. For the genome sequencing, we also used the TruSeq DNA PCR Free (350) kit and Illumina’s HiSeqX sequencer at Macrogen, Korea. The genome sequencing data was re-analyzed using the same approach described above and again the single missense variant in the GLS was considered the best candidate. Subsequent manual inspection of the cis-regulatory elements associated with GLS using Integrative Genomics Viewer (IGV v.2.4.10) revealed a potential GCA repeat within the 5’

UTR. The recently published ExpansionHunter (Dolzhenko, van Vugt et al. 2017) software

(details below) was used to analyze the locus in further detail and identified the GCA repeat expansion with at least 94 copies on the chromosome the proband had inherited from mother, while the copy inherited from father had 8 GCA copies as well as the original missense variant.

3.3.4 Genotyping GLS locus with ExpansionHunter

For Patient 1, ExpansionHunter (v.2.5.3) was run on the duplicate removed, realigned, genome sequencing mapped reads (described above). Since ExpansionHunter can detect larger repeat expansions (Dolzhenko, van Vugt et al. 2017) which can be disruptive when located within the

UTRs of genes, we decided to use it for this case. Because expanded GCA/CAG/AGC repeats may be present in multiple loci, we ran ExpansionHunter in a conservative mode that only analyzes reads that can be confidently placed at the targeted locus (chromosome 2:191745599-

191745646 build GRCh37). This is enabled by setting "CommonUnit" to "true" in the input repeat-specification file (available online at https://github.com/Phillip-a- richmond/STR_Analysis/blob/master/GenerateTargetLoci/GLS_NoOffTarget.json). In this 47

mode, ExpansionHunter produces a size estimate bounded by the length of the sequenced fragment. Thus for repeats whose size exceeds fragment length, ExpansionHunter produces a lower bound size estimate.

3.3.5 STR Genotyping in control populations

We further examined three independent series totaling 8,295 individuals (16,590 alleles) for the presence of the GLS repeat expansion (same methodology as for Patient 1). Series 1 included sequence data from 441 cell lines from the Coriell Institute for Medical Research

(www.coriell.org). For these samples, DNA was prepared using PCR-free library preparation kits, and sequenced by the Illumina HiSeq X platform (2 x 150bp paired-end reads). Information about these samples and links for downloading the raw sequence data can be obtained at https://github.com/Illumina/Polaris. Series 2 included DNA samples from 1,658 unrelated individuals (unaffected parents of children diagnosed with autism spectrum disorder) from the

MSSNG genomic database (C Yuen, Merico et al. 2017). DNA was derived from whole-blood or lymphoblast-derived cell lines, prepared using PCR-free library preparation kits, and sequenced by the Illumina HiSeq X platform (2 x 150 bp paired-end reads). Further details about MSSNG and samples can be found at https://mssng-edge.dnastack.com. Series 3 included DNA samples from 6,196 unrelated individuals with and without ALS from Project MinE (Project Min 2018).

DNA was derived from whole-blood, prepared using PCR-free library preparation kits, and sequenced by the Illumina HiSeq 2000 platform (2 x 100 bp paired-end reads) and HiSeq X platform (2 x 150bp paired-end reads). Further details about Project MinE and the samples can be found at http://databrowser.projectmine.com/.

48

3.4 Results

3.4.1 Efficacy of variant prioritization pipeline

The variant prioritization pipeline was deployed at BC Children’s Hospital in a clinical-research setting, providing analysis capacity to the Treatable Intellectual Disability Endeavour (TIDE) and Primary Immunodeficiency (PID) projects. The pipeline was validated using previously diagnosed patient exome sequencing datasets from the TIDE project. Briefly, ten cases were tested to ensure that the variant and gene appeared in the final annotated candidate variant list.

All ten of the cases were confirmed. As a demonstration of the utility of the pipeline for novel diagnoses, we tracked the diagnostic rate of a single genome analyst who works with both TIDE and PID projects. A total of 8 diagnoses were made from a set of 14 patients, including a combination of trio (mother, father, and proband), duo (mother, father, or sibling and affected proband), and singleton (affected proband only) cases. As the majority of these were exome sequencing, we further demonstrated the prioritization capacity of this pipeline in the analysis of

WGS trio data for mapping the variant causing isolated strabismus in a large pedigree (Ye,

Roslin et al. 2020). Altogether, we have implemented a functional variant calling and annotation pipeline for the analysis of patient sequencing data.

3.4.2 Analysis of patient WES data identifies single coding variant in GLS

We hypothesized that the patients had a novel inborn error of metabolism (IEM), and performed exome sequencing and then a combined analysis of family 1 (proband and both parents) and family 2 (proband and one parent). We analyzed family 3 (proband and both parents) independently. The standard bioinformatics approach that assumes Mendelian modes of inheritance with full penetrance did not identify any leading gene candidates in any family. We 49

then evaluated the exome sequences of only the probands of families 1 and 2; rare, deleterious variants in the heterozygous state were identified in 345 and 320 genes respectively, of which 24 and 29 were associated with intellectual disability and/or encephalopathy in Human Phenotype

Ontology (HPO) or within Medical Subject Heading Over-representation Profiles (MeSHOPs)

(Cheung, Ouellette et al. 2012, Köhler, Vasilevsky et al. 2017). The heterozygous paternally- inherited variant NM_014905: c.938C>T (p.Pro313Leu) in GLS (encoding glutaminase) was of interest given the excellent match between the encoded phosphate-activated glutaminase and the biochemical phenotype (elevated levels of glutamine). No candidate variants were identified in family 2 that were consistent with the clinical or biochemical phenotype. In a separate study, exome analysis revealed a maternally inherited heterozygous GLS variant NM_014905: c.923dupA (p.Tyr308*) in patient 3. Both the missense and frameshift GLS variants identified in patients 1 and 3, respectively were predicted to be damaging by all tested in silico metrics. The variant p.Tyr308* has an allele frequency of 3.984 e-6 in gnomAD (rs1212883982) and the p.Pro313Leu variant was absent from this database (Lek, Karczewski et al. 2016).

3.4.3 Biochemical characterization of GLS

The results of exome analysis were suggestive of a GLS deficiency but were not conclusive, so we performed further biochemical phenotyping. Functional analyses in lymphocytes and fibroblasts showed markedly reduced GLS activity in all three patients (Figure 3.1A).

Immunoblot analysis showed reduced levels of glutaminase in fibroblasts obtained from all three patients and virtually no glutaminase in the lymphocytes of Patient 2 (Figure 3.1B).

50

For functional analyses, stable-isotope-labeled glutamine, glucose, and oleate were used for the in situ analysis of the GLS deficiency and its metabolic consequences. In samples obtained from all three patients, we observed markedly decreased levels of isotope-labeled glutamate in fibroblasts that had been incubated with isotope-labeled glutamine, a finding that indicates impaired GLS activity (Figure 3.1D). Recombinantly expressed mutant GLS (carrying the p.Pro313Leu variant) showed residual enzymatic activity of 2.8%, whereas no activity could be detected for the recombinantly expressed p.Tyr308* variant (Figure 3.1C). For the p.Tyr308* variant, molecular modeling suggests that the truncation would prevent the proper formation of the active site and the subunit interfaces.

A glutamate deficit in the neurons of the patients could not be examined, since the induction of pluripotent stem cells from GLS-deficient fibroblasts for differentiation into neurons was not successful, perhaps because GLS has been shown to be essential for the differentiation, proliferation, and survival of human neural progenitor cells (Wang, Huang et al. 2014).

Knockdown of the GLS orthologues glsa, glsl, or both in zebrafish was associated with a smaller body size, curved body, and cardiac edema.

51

Figure 3-1 - Profiling GLS activity in patient cells.

Panel A shows GLS activity in fibroblasts obtained from all three study patients and from three controls and in peripheral-blood mononuclear (PBM) cells obtained from Patient 2 and from three controls. The GLS activity is measured in nanomoles of glutamate per milligram of protein per minute. Panel B shows the results of immunoblot analysis of GLS in fibroblasts and PBM cells obtained from the same patients (P) and controls (C) shown in Panel A. Panel C shows GLS activity of recombinantly expressed p.Pro313Leu and

52

p.Tyr308* GLS variants, as compared with wild-type (WT) GLS enzyme, expressed in GLS-deficient

HEK293-Flp-In cells. Panel D shows the ratio of stable isotope-labeled glutamine to stable isotope-labeled glutamate in fibroblasts obtained from the three patients and three controls. In Panels A and D, the T bars indicate the standard deviation.

3.4.4 Identification of repeat expansion in 5’UTR of GLS in singleton WGS

We hypothesized that the missing heritability in each of the three patients was due to variation at the GLS locus refractory to detection on exome sequencing. We based this hypothesis on the results of the biochemical GLS and flux assays indicating the presence of a GLS deficiency, which was consistent with elevated levels of plasma glutamine. Examining the expression level of GLS in Patient 1 and 2 shows a decreased abundance of messenger RNA (mRNA) compared to a control, indicating that there is potentially a variant repressing expression in both families.

Furthermore, Sanger sequencing of samples of cDNA generated from mRNA obtained from members of Family 1 supported this hypothesis, since the proportion of the c.938T allele relative to the second allele was much higher in the patient than in the father, a finding that was consistent with an imbalance in allelic expression (Figure 3.2). Thus, we sequenced the genome of Patient 1.

53

Figure 3-2 - Allelic expression imbalance and reduced GLS expression.

A) qPCR of GLS mRNA expression normalized against RPS14 in Patient 1, Patient 2, and matched healthy control. B) GLS cDNA sequencing of members of Family 1 showing a relative deficit of the mRNA from the expanded allele relative to GLS- c.938T allele in the Patient 1.

Manual analyses of cis-regulatory regions around GLS revealed the presence of a GCA repeat in the 5′ untranslated region (UTR) of GLS (Figure 3.3). Specifically, visual inspection of the GCA repeat track showed several reads which were discordantly mapped, and several showing signs of soft-clipping with the soft-clipped bases containing GCA repeats. Examining the location of the reads which were anchored to this location, showed that their partner paired read was mapping against a separate location in the genome, within a GCA track in the intron of TCF4 (Figure 3.3).

As the GCA track in TCF4 is longer than in GLS, I reasoned that the reads are mapping here because they are full-length GCA repeats and their mapping score would be higher with more reference-matched base pairs. Coincidentally, a novel tool called ExpansionHunter (Dolzhenko, van Vugt et al. 2017) had just been published, which utilizes these discordant read signals to

54

detect repeat expansions which are longer than the read length. ExpansionHunter specifically uses reads which anchor uniquely to a region, where their partner read consists solely of repeated sequences. As ExpansionHunter is designed to work on predefined loci, I defined both the target

(GLS) and off-target (TCF4) regions with a repeat motif of ‘GCA’. I genotyped the expansion using configurations where I provided the off-target region, or withheld it. Without the off-target region provided, I identified that the GLS 5’UTR consisted of more than 90 GCAs on the chromosome that the proband had inherited from the mother, while the paternally inherited allele had eight GCA copies. Applying the off-target region increased this to 246 copies, although since GCA is a common repeat in the genome the authors of the tool suggested to profile the repeat in the conservative mode.

Figure 3-3 - IGV snapshots reveal repeat signals in the 5’UTR of GLS.

Visualization of the mapped reads for Patient 1 (top) and an anonymized control sample (bottom) for the

GLS (left) and TCF4 (right) loci. The repeat location is shown in red, and the coverage track displayed above the reads is set to a fixed coverage for both left and right panels. 55

3.4.5 Genotyping repeat using ExpansionHunter and confirmation in population

We then performed a triplet repeat–primed PCR assay (Figure 3.4B) and a GLS repeat PCR assay (Figure 3.4C). Triplet repeat–primed PCR assays with a primer complementary to a stretch of GCA repeats indicated that all three patients had large GCA repeat expansions, as did one or both of their parents (Figure 3.4B). The proband in Family 1 had compound heterozygosity for the paternally inherited c.938C→T variant and a maternally inherited GCA repeat expansion allele. The proband in Family 2 was homozygous for GCA repeat expansion alleles inherited from both parents, and the proband in Family 3 had compound heterozygosity for the maternally inherited c.923dupA variant and a paternally inherited GCA repeat expansion allele (Figure

3.4A). Repeat PCR assays with the use of primers flanking the GCA repeat showed major expansion products of approximately 680, 900, and 1500 repeats, respectively, in blood samples obtained from the three probands (Figure 3.4C).

Analysis of this GCA repeat expansion locus in an untargeted collection consisting of 8295 genomes showed that this short tandem repeat had a median size of 14 repeats (i.e., 42 bp) and a bimodal prevalence at 8 and 16 repeats. Of the 8295 analyzed genomes, 1 was heterozygous for an allele with more than 90 repeats, making the allele frequency of this repeat 6.03×10−5 (Figure

3.4D). We additionally profiled the expanded sample allowing for the use of off-target reads, and saw that the expansion increased to 384 copies.

56

Figure 3-4 - Genotyping of GLS Probands.

Panel A shows the pedigrees of the three affected families. The presence of c.938C→T (p.Pro313Leu) in

Family 1 and c.923dupA (p.Tyr308*) in Family 3 was confirmed on Sanger sequencing. Polymerase-chain- reaction (PCR) amplification of the 5′ region containing the GCA repeat and subsequent Sanger sequencing allowed for the determination of the number of GCA repeats in non-expanded alleles. In family members in whom a large expansion was observed, the number of GCA repeats was estimated on PCR amplification, 57

followed by agarose gel electrophoresis. Panel B shows the results of triplet repeat–primed PCR assay and subsequent capillary electrophoresis to confirm the presence of the GCA expansion. Each vertical line represents a single GCA repeat. Thus, larger expansions are represented by a greater number of vertical lines. The results of Sanger sequencing that correspond to the missense and frameshift variants are shown next to each member of Families 1 and 3. Panel C shows PCR amplification of the expanded GCA repeats in blood samples obtained from the three patients (P), followed by agarose electrophoresis. The approximately

300-bp PCR product visible in the samples from Patients 1 and 3 reflects the non-expanded alleles in these patients, who are heterozygous for an expansion and a point mutation. Panel D shows the genotypes of the

GCA repeat locus in an untargeted population, indicating the number of GCA repeats for 8295 persons

(16,590 total genotypes) as determined with the use of ExpansionHunter run on PCR-free genome-sequencing data sets.

3.4.6 Defining the mechanism for repeat expansion pathogenicity

To investigate the mechanism of the allele-biased mRNA expression (Figure 3.2 B), which is consistent with the trinucleotide-repeat expansion suppressing GLS expression, we carried out pyrosequencing of DNA (upstream and downstream of the trinucleotide-repeat expansion) purified from blood cells; no evidence of increased DNA methylation was seen in samples obtained from any of the patients (Figure 3.5A). The absence of methylation in the patients’ fibroblasts was confirmed by the loss of almost all the GLS PCR product when the genomic

DNA was predigested with a methylation-sensitive restriction enzyme.

Chromatin immunoprecipitation assays were performed to determine whether the expansion affected histone modifications of the adjacent GLS promoter. The patients’ alleles showed reduced levels of histone modifications that are characteristic of transcriptionally active regions

(H3 acetylation and H3K4 trimethylation) and were enriched for a histone modification 58

characteristic of some transcriptionally silenced regions (H3K9me3) (Figure 3.5B)(Mozzetta,

Boyarchuk et al. 2015). This effect was more marked in Patient 2, who carried two expanded repeat alleles. These data suggest that the repeat expansion causes a change in the chromatin configuration, which results in decreased transcription.

To test for the effect of the expanded repeat on transcription initiation and elongation (i.e., the elongation of the nascent RNA strand with the addition of nucleotides during transcription) or translation, we cloned DNA fragments containing the GLS promoter with 13, 104, and approximately 240 GCA repeats into a luciferase reporter construct (pGL3.1). A comparison of the empty vector (pGL3.1) and vector containing the 13-repeat construct confirmed robust GLS promoter activity at baseline and in response to glutamine supplementation. We observed no negative effects of the repeat expansions on luciferase activity (Figure 3.5C). These assays are based on the transient transfection of nonreplicating plasmids containing GLS fragments outside their normal chromosomal and chromatin context. They therefore address some of the direct effects of the cloned repeats on transcription and the efficiency of translation of the resultant transcript, rather than any epigenetic effects that the repeat may have in situ. Thus, taken together, our data suggest that the predominant effect of the repeats is at the level of histone modifications, causing the glutaminase deficiency by promoting the formation of repressive heterochromatin and thereby reducing GLS transcription.

59

Figure 3-5 - Repeat associated effects.

Panel A shows the percentage methylation of CpG residues in the upstream genomic region (four sites) and the downstream genomic region (four sites) of the repeat in the three study patients and in a control. CpG sites are regions of DNA where a cytosine nucleotide is followed by a guanine nucleotide in the linear sequence of bases from 5′ to 3′. Cytosines in CpG dinucleotides can be methylated to form 5-methylcytosines.

Panel B shows the results of chromatin immunoprecipitation experiments using fibroblasts obtained from

Patients 1 and 2 and from a control, indicating the levels of enrichment of H3 acetylation (H3KAc),

H3K4me3, and H3K9me3 on the GLS promoter relative to the gene encoding glyceraldehyde-3-phosphate 60

dehydrogenase (GAPDH). Differences between the control sample and those obtained from the two patients were significant for H3KAc, H3K4me3, and H3K9me3 in Patient 2. Differences between enrichment for different histone modifications were evaluated with a t-test to determine significance. The enrichment was calculated as the percentage of input. The y axis shows the enrichment for GLS as the factor change over enrichment for GAPDH. The results are from two independent chromatin immunoprecipitation experiments.

Panel C shows the results of cloning of DNA fragments containing the GLS promoter with 13, 104, and approximately 240 GCA repeats — wild-type GLS-(GCA)13, mutant GLS-(GCA)104, and GLS-

(GCA)>200— into luciferase promoter reporter constructs with or without glutamine supplementation at 48 hours. For vector normalization, activity of two luciferases, firefly and Renilla, were measured in the same cells or lysate in three independent experiments. Two-way analyses of variance with Bonferroni post-tests were used for analysis. One asterisk indicates P<0.001 for the comparison with GLS-(GCA)13 with no glutamine supplementation, and two asterisks indicate P<0.001 for the comparison with GLS-(GCA)13 with glutamine supplementation. In Panel A, the data are shown in box plots with the same measures as described in Figure 1D; the T bars indicate the standard deviation of the mean in Panel B and standard error in Panel

C.

3.5 Discussion

In this study, we describe the identification of an inborn error of metabolism caused by a novel trinucleotide GCA repeat expansion. The expansion in the 5′ untranslated region of GLS, which encodes glutaminase, results in reduced expression and glutaminase deficiency. The neurodegenerative course of the patients’ development resembles that of another sibling pair of patients with a GLS deficiency caused by a homozygous 8-kb duplication spanning exon 1

(Lynch, Chelban et al. 2018).

61

Alternative splicing of GLS generates two protein isoforms: glutaminase C and kidney-type glutaminase (Elgadi, Meguid et al. 1999). Both isoforms are localized in mitochondria and expressed in multiple tissues, with high expression levels of kidney-type glutaminase in the brain

(i.e., cerebral cortex) and kidney (Elgadi, Meguid et al. 1999, Thul, Åkesson et al. 2017).

Glutamate is the major excitatory neurotransmitter in the brain and can be synthesized from α- ketoglutarate (Takeda, Ishida et al. 2012) or glutamine (the most abundant amino acid) (Altman,

Stine et al. 2016) through the action of GLS. Studies in model organisms have shown that the ablation of GLS results in partially impaired glutamatergic synaptic transmission in mice and early death from respiratory problems (Masson, Darmon et al. 2006). The small residual GLS activity that was detected in the fibroblasts and lymphocytes of our patients may account for the comparatively milder phenotypes. Mice that are heterozygous for a GLS deficiency had hippocampal hypoactivity(Gaisler-Salomon, Miller et al. 2009). Patients with other causes of glutamate deficiency have ataxia, as did our patients (Guergueltcheva, Azmanov et al.

2012).Whether elevated glutamine levels affect the phenotype in the three patients we describe here is unclear.

Our study underscores the importance of examining noncoding regions of the genome.

Repetitive elements represent an estimated 50% of the human genome(Tarailo-Graovac and

Chen 2009, Hannan 2018), and if such elements are unstable, they may have multiple deleterious effects. Short (1 to 6 bp) tandem repeats (of which the GCA repeat is one) make up more than

3% of the human genome (Subramanian, Mishra et al. 2003), yet relatively few repeat expansion disorders have been described to date. Those that have been reported predominantly affect the nervous system (Hannan 2018). Current emphasis on the use of short-read exome-sequencing 62

methods has made it difficult to systematically identify repetitive elements as a source of genetic deficit in rare mendelian diseases, but repetitive elements may represent missing heritability in the cause of disease.

3.6 Conclusion

Within this study a narrow focus on a specific target gene, GLS, allowed the manual identification of a novel pathogenic short tandem repeat expansion. However, in many cases the specificity of the phenotype and extensive metabolic profiling is not available. This leads to the question of whether or not this class of variants, short tandem repeat expansions, can be identified from WGS data without knowing where in the genome the expansion occurs. In the following chapter I explore this concept, and collaborate with the original authors of

ExpansionHunter (a tool which was critical in solving the GLS case) to develop a genome-wide method for repeat expansion identification.

63

Chapter 4: Genome-wide discovery of short tandem repeat expansions

4.1 Introduction

High-throughput whole-genome sequencing (WGS) has experienced rapid reductions in per- genome costs over the past ten years (Muir, Li et al. 2016) driving population-level sequencing projects and precision medicine initiatives at an unprecedented scale (Genomes Project, Auton et al. 2015, Gudbjartsson, Helgason et al. 2015, Nagasaki, Yasuda et al. 2015, Erikson, Bodian et al. 2016, Telenti, Pierce et al. 2016, Project Min 2018). The availability of large sequencing datasets now allows researchers to perform comprehensive genome-wide searches for disease- associated variants. The primary limitations of these studies are the completeness of the reference genome and the ability to identify putative causal variations against the reference background. A wide variety of software tools can identify variations relative to the reference genome such as single nucleotide variants and short (1-50 bp) insertions and deletions

(McKenna, Hanna et al. 2010, Garrison and Marth 2012, Raczy, Petrovski et al. 2013, Rimmer,

Phan et al. 2014, Poplin, Newburger et al. 2016, Poplin, Chang et al. 2018), copy number variants (Abyzov, Urban et al. 2011, Roller, Ivakhno et al. 2016) and structural variants

(Abyzov, Urban et al. 2011, Layer, Chiang et al. 2014, Chen, Schulz-Trieglaff et al. 2016). A common feature of these variant callers is their reliance on sequence reads that at least partially align to the reference genome. However, because some variants include large amounts of inserted sequence relative to the reference, methods that can analyze reads that do not align to the reference are also needed.

A particularly important category of variants that involve long insertions relative to the reference genome are repeat expansions (REs). An example of which is the expansion in C9orf72 64

associated with amyotrophic lateral sclerosis (ALS). This repeat consists of three copies of

CCGGGG motif in the reference (18 bp total) whereas the pathogenic mutations are comprised of at least 30 copies of the motif (180 bp total) and may encompass thousands of bases (DeJesus-

Hernandez, Mackenzie et al. 2011, Renton, Majounie et al. 2011). REs are known to be responsible for dozens of monogenic disorders (La Spada and Paul Taylor 2010, Hannan 2018).

Several recently-developed tools can detect REs longer than the standard short read sequencing read length of 150 bp (Dolzhenko, van Vugt et al. 2017, Tang, Kirkness et al. 2017, Dashnow,

Lek et al. 2018, Tankard, Bennett et al. 2018, Dolzhenko, Deshpande et al. 2019, Mousavi,

Shleizer-Burko et al. 2019). These tools have all been demonstrated to be capable of accurately detecting pathogenic expansions of simple short tandem repeats (STRs). However, recent discoveries have shown that many pathogenic repeats have complex structures and hence require more flexible methods. For instance: (a) REs causing spinocerebellar ataxia types 31 and 37, familial adult myoclonic epilepsy types 1, 2, 3, 4, 6 and 7, and Baratela-Scott syndrome (Sato,

Amino et al. 2009, Seixas, Loureiro et al. 2017, Ishiura, Doi et al. 2018, Corbett, Kroes et al.

2019, Florian, Kraft et al. 2019, LaCroix, Stabley et al. 2019, Yeetong, Pongpanich et al. 2019) occur within an inserted sequence relative to the reference; (b) expanded repeats recently shown to cause spinocerebellar ataxia, familial adult myoclonic epilepsy, and cerebellar ataxia with neuropathy and bilateral vestibular areflexia syndrome have different composition relative to the reference STR (Sato, Amino et al. 2009, Seixas, Loureiro et al. 2017, Ishiura, Doi et al. 2018,

Corbett, Kroes et al. 2019, Florian, Kraft et al. 2019, Yeetong, Pongpanich et al. 2019); (c)

Unverricht-Lundborg disease, a type of progressive myoclonus epilepsy, is caused by an expansion of a larger, dodecamer (12-mer) motif repeat (Lalioti, Scott et al. 1997). None of the existing variant calling methods are capable of discovering all of these REs. 65

We have developed ExpansionHunter Denovo (EHdn), a novel method for performing genome- wide search for expanded repeats, to address the limitations of the existing approaches. EHdn scans the existing alignments of short reads from one or many sequencing libraries, including the unaligned and misaligned reads, to identify approximate locations of long repeats and their nucleotide composition. Unlike other methods designed to identify REs (Dolzhenko, van Vugt et al. 2017, Tang, Kirkness et al. 2017, Dashnow, Lek et al. 2018, Tankard, Bennett et al. 2018,

Dolzhenko, Deshpande et al. 2019, Mousavi, Shleizer-Burko et al. 2019), EHdn (a) does not require prior knowledge of the genomic coordinates of the REs, (b) can detect nucleotide composition changes within the expanded repeats, and (c) is applicable to both short and long motifs. EHdn is computationally efficient because it does not re-align reads and, depending on the sensitivity settings, can analyze a single 30-40x WGS sample in about 30 minutes to 2 hours using a single CPU thread.

In this study, we demonstrate that EHdn can be used to rediscover the REs associated with fragile X syndrome (FXS), Friedreich Ataxia (FRDA), Myotonic Dystrophy type 1 (DM1), and

Huntington’s disease (HD) using case-control analysis to compare a small number of affected individuals (N=14-35) to control samples (N=150). We also show that REs in individual samples can be identified using outlier analysis. We then characterize large (longer than the read length) repeats in our control cohort to investigate baseline variability of these long repeats. Finally, we demonstrate the capabilities of our method by analyzing simulated expansions of various classes of tandem repeats known to play an important role in human disease. Taken together, our findings demonstrate that EHdn is a robust tool for identifying novel pathogenic repeat

66

expansions in both cohort and single-sample outlier analysis, capable of identifying a new, previously inaccessible class of REs.

4.2 Methods

4.2.1 ExpansionHunter Denovo overview

The length of disease-causing REs tends to exceed the read length of modern short-read sequencing technologies (Ashley 2016). Thus, pathogenic expansions of many repeats can be detected by locating reads that are completely contained inside the repeats. As in our previous work (Dolzhenko, van Vugt et al. 2017, Dolzhenko, Deshpande et al. 2019), we call these reads in-repeat reads (IRRs). We implemented a method, ExpansionHunter Denovo (EHdn), for performing a genome-wide search for IRRs in BAM/CRAM files containing read alignments.

EHdn computes genome-wide STR profiles containing locations and counts of all identified

IRRs. Subsequent comparisons of STR profiles across multiple samples can reveal the locations of the pathogenic repeat expansions.

Genome-wide STR profiles computed by EHdn contain information about two types of IRRs: anchored IRRs and paired IRRs. Anchored IRRs are IRRs whose mates are confidently aligned to the genomic sequence adjacent to the repeat. Paired IRRs are read pairs where both mates are

IRRs with the same repeat motif. Repeats exceeding the read length generate anchored IRRs

(Figure 4.1, middle panel). Repeats that are longer than the fragment length of the DNA library produce paired IRRs in addition to anchored IRRs (Figure 4.1, right panel). The genomic coordinates where the anchored reads align correspond to the approximate locations of loci harboring REs and the number of IRRs is indicative of the overall RE length. 67

The information about anchored IRRs is summarized in an STR profile for each repeat motif

(e.g. CCG) by listing regions containing anchored IRRs in close proximity to each other together with the total number of anchored IRRs identified (Figure 4.1, middle). Note that the mapping positions of anchored IRRs correspond to the positions of anchor reads; mapping positions of

IRRs themselves are not used because their alignments are often unreliable. Contrary to anchored

IRRs, the origin of paired IRRs cannot be determined if a genome contains multiple long repeats with the same motif. Due to this, STR profiles only contain the overall count of paired IRRs for each observed repeat motif.

To compare STR profiles across multiple samples, the profiles must first be merged together across samples. During this process, nearby anchored IRR regions are merged across multiple samples and the associated counts are depth-normalized and tabulated for each sample (Figure

4.1, right). The total counts of paired IRRs are also normalized and tabulated for each sample.

The resulting per-sample counts can be compared in two ways: If the samples can be partitioned into cases and controls where a significant subset of cases is hypothesized to contain expansions of the same repeat then a case/control analysis can be performed using a Wilcoxon rank-sum test.

Alternatively, if no enrichment for any specific expansion is expected, an outlier analysis can be used to flag repeats that are expanded in a small subgroup of cases compared to the rest of the dataset. The outlier analysis is appropriate for heterogeneous cohorts where enrichment for any specific expansion is not expected. The outlier analysis bootstraps the sampling distribution of the 95% quantile and then calculates the z-scores for cases that exceed the mean of this distribution. The z-scores are used for ranking the repeat regions. Similar outlier-detection 68

frameworks were also developed for exSTRa (Tankard, Bennett et al. 2018) and STRetch

(Dashnow, Lek et al. 2018).

Case-control and outlier analyses can be performed on either anchored IRRs or paired IRRs, which we call locus and motif methods, respectively. Thus, the locus method can identify locations of repeat expansions while the motif method can reveal the overall enrichment for long repeats with a given motif.

Figure 4-1 - ExpansionHunter Denovo Overview.

(Left) A search for anchored IRRs is performed across all aligned reads. (Middle) The IRR counts are summarized into STR profiles. (Right) The resulting STR profiles are merged across all samples. If the dataset can be partitioned into cases and controls, IRR counts in these groups are compared for each locus.

Alternatively, if no such partition is possible, an outlier analysis is performed.

69

4.2.2 Defining relevant repeat expansions

A catalog of pathogenic or potentially pathogenic repeat expansions was collated from the literature. We supplemented this catalog with recently reported STRs linked with gene expression (Fotsing, Margoliash et al. 2019), and repeats with longer motifs overlapping with disease genes. For each repeat locus the repeat motif, pathogenic lower bound, curation source, and genomic coordinates in GRCh37 and GRCh38 reference genomes was curated. In cases where the pathogenic lower bound is not defined, the smallest pathogenic allele observed in affected patients was used.

4.2.3 Synthetic generation of pathogenic repeats, long motifs, eSTRs

The repeat expansions were simulated following a strategy similar to the one used by

BamSurgeon (Ewing, Houlahan et al. 2015). Briefly, a region around a non-expanded target repeat in a control WGS sample was replaced with synthetic reads supporting an expansion in a heterozygous (one expanded allele and one reference-length allele) state (Figure 4.2).

Specifically,

● The WGS sample HG03522 from the Polaris Kids cohort (Illumina) was used as the control. ● A FASTA file containing the expanded repeat along with 2Kb flanking sequence upstream and downstream of the repeat was generated for read simulation. ● Reads were simulated from the FASTA file using ART v2.5.8 (Huang, Li et al. 2012); read length (150bp), mean insert size (460), and insert size standard deviation (115) were chosen to match the control WGS sample. ● Processed simulated reads, same as described above, were merged with sample HG03522 using SAMtools (v1.3.1). Further details about the simulation process can be found here:

70

● https://github.com/egor-dolzhenko/ehdn-paper-analysis/tree/master/STR_Simulation

71

Figure 4-2 - Overview of simulating samples with repeat expansions.

STR simulation by replacing reads around a non-expanded repeat in a control WGS sample with synthetic reads supporting an expansion. First an

STR BAM is created (left), by isolating a region around the STR, expanding the sequence (or replacing with complex expanded sequence), and then mapping against the reference genome. This file is concatenated with a human BAM file with the corresponding region cut out (right), to produce a final BAM.

72

4.2.4 Initial processing of WGS data

Human and simulated WGS data was aligned to the GRCh37-lite genome

● http://www.bcgsc.ca/downloads/genomes/9606/hg19/1000genomes/bwa_ind/genome/GR Ch37-lite.fa with BWA-MEM v0.7.17 (Li 2013) using SAMtools v1.9 (Li, Handsaker et al. 2009). WGS data that was originally obtained in BAM format was converted to FASTQ using Bazam v1.0.1

(Sadedin and Oshlack 2019).

4.2.5 Detection of repeat expansions

ExpansionHunter Denovo v0.8.6 was used to generate the genome-wide STR profile for each

WGS sample:

ExpansionHunterDenovo profile --reads SAMPLE.bam --reference GRCh37-lite.fa --output- prefix SAMPLE

The manifest file was synthesized for each required comparison, then multisample STR profiles were generated and subsequent locus- and motif-based analyses were performed:

ExpansionHunterDenovo merge --reference GRCh37-lite.fa --manifest MANIFEST.tsv --output- prefix OUTPUT

python3 casecontrol.py locus --manifest MANIFEST.tsv --multisample-profile

OUTPUT.multisample_profile.json --output-prefix casecontrol_locus.tsv

73

python3 casecontrol.py motif --manifest MANIFEST.tsv --multisample-profile

OUTPUT.multisample_profile.json --output-prefix casecontrol_motif.tsv

python3 outlier.py locus --manifest MANIFEST.tsv --multisample-profile

OUTPUT.multisample_profile.json --output-prefix outlier_locus.tsv

python3 outlier.py motif --manifest MANIFEST.tsv --multisample-profile

OUTPUT.multisample_profile.json --output-prefix outlier_motif.tsv

The output of each command was sorted by p-value/z-score and the rank of each relevant locus or motif was extracted to evaluate performance.

STRetch (https://github.com/Oshlack/STRetch, commit 5405902) was run using the recommended pipeline for WGS analysis starting from BAM files for each sample independently

(STRetch_wgs_bam_pipeline.groovy). The STRetch STR catalog was converted to GRCh37

(sed s/chr// hg19.simpleRepeat_period1-6.dedup.sorted.bed > grch37_input_regions.bed) and set as the input_regions parameter. Each sample was compared to the included controls

(hg19.PCRfreeWGS_143_STRetch_controls.tsv). The rank of each relevant repeat was extracted from the sorted STRs.tsv output file.

74

4.3 Results

4.3.1 Baseline simulations show cutoffs of EHdn capacity

To demonstrate the baseline expectation of how the numbers of anchored and paired IRRs vary with repeat length we simulated 2x150 bp reads at 20x coverage with 450 bp mean fragment length for the repeat associated with Huntington’s disease and varied the repeat length from 0 to

340 CAG repeats (0 to 1,020 bp). No IRRs occur when the repeat is shorter than the read length

(Figure 4.3, left panel). When the repeat is longer than the read length, but shorter than the fragment length (Figure 4.3, middle panel), the number of anchored IRRs increases proportionally to the length of the repeat. As the length of the repeat approaches and exceeds the mean fragment length (Figure 4.3, right panel), the number of paired IRRs increases linearly with the length of the repeat. Because anchored IRRs require one of the reads to “anchor” outside of the repeat region, the number of anchored IRRs is limited by the fragment length and remains constant as the repeat grows beyond the mean fragment length. It is important to note that real sequence data may introduce additional challenges compared to the simulated data. For example, sequence quality in low complexity regions or interruptions in the repeat may impact the ability to identify some IRRs.

75

Figure 4-3 - EHdn baseline detection of expanded repeats.

Diagram illustrating the types and counts of reads generated by simulating repeats of different lengths. When the repeat is shorter than the read length (left panels), there are no IRRs associated with the repeat. When a repeat is longer than the read length but shorter than the fragment length (middle panels), anchored IRRs but no paired IRRs are present. As the repeat length approaches and exceeds the fragment length (right panels), paired IRRs are generated in addition to anchored IRRs.

76

4.3.2 Detection of pathogenic expansions in the repeat expansion cohort

Given a sufficient number of samples with the same phenotype, pathogenic REs may be identified by searching for regions with significantly longer repeats in cases compared to controls. To demonstrate the feasibility of such analyses, we analyzed 91 Coriell samples with experimentally-confirmed expansions in repeats associated with Friedreich’s ataxia (FRDA;

N=25), Myotonic Dystrophy type 1 (DM1; N=17), Huntington disease (HD; N=14), and fragile

X syndrome (FXS; N=35). This dataset has been previously used to benchmark the performance of existing targeted methods (Dolzhenko, van Vugt et al. 2017, Tankard, Bennett et al. 2018,

Mousavi, Shleizer-Burko et al. 2019).

The pathogenic cutoffs for FRDA, DM1, and FXS repeats are greater than the read length, so our analysis of simulated data suggests that anchored IRRs are likely to be present in each sample with one of these expansions (Figure 1). The pathogenic cutoff for the HD repeat (120 bp) is less than the read length (150 bp) used in this study, so a subset of samples with Huntington’s disease may not contain relevant IRRs making this expansion harder to detect de novo even though it is detectable with existing methods (Dolzhenko, van Vugt et al. 2017, Tang, Kirkness et al. 2017,

Dolzhenko, Deshpande et al. 2019, Mousavi, Shleizer-Burko et al. 2019).

We separately compared samples with expansions in FXN (FRDA), DMPK (DM1), HTT (HD), or FMR1 (FXS) genes (cases) against a control cohort of 150 unrelated Coriell samples of

African, European, and East Asian ancestry. Each case-control comparison revealed a clear enrichment of anchored IRRs at the corresponding repeat region (Figure 4.4). This analysis demonstrated that ExpansionHunter Denovo (EHdn) can re-identify known pathogenic repeat 77

expansions without prior knowledge of the location or repeat motif when the pathogenic repeat length is equal to or longer than the read length, also assuming that the repeat is highly penetrant.

Figure 4-4 - Case-control analysis with repeat expansion cohort.

Genome-wide analysis of anchored IRRs comparing cases with known pathogenic expansions in DMPK,

FXN, FMR1 and HTT genes (top to bottom) to 150 controls.

4.3.3 Limitations of genome-wide methods for repeat expansion calling

To evaluate the limitations of catalog-based approaches (Dolzhenko, van Vugt et al. 2017, Tang,

Kirkness et al. 2017, Dashnow, Lek et al. 2018, Tankard, Bennett et al. 2018, Mousavi, Shleizer-

Burko et al. 2019), we curated a set of 53 pathogenic or potentially pathogenic repeats (Table

4.1) and checked if they were present in two commonly-used catalogs: (a) STRs with up to 6bp motifs from the UCSC genome browser simple repeats track (Benson 1999, Kent, Sugnet et al.

2002) utilized by STRetch and exSTRa and (b) the GangSTR catalog. Nine of the known 78

pathogenic repeats are not present in the reference genome and hence are absent from both catalogs. Out of the remaining 44 loci, 22 loci are present in both catalogs, 12 are missing from the GangSTR catalog and present in the UCSC catalog, five are missing from the UCSC catalog and present in the GangSTR catalog, and five are missing from both catalogs (Figure 4.5, Table

4.1). While it is possible to update the catalogs to include these known pathogenic repeats, the number of missing potentially-pathogenic REs remains unknown.

Figure 4-5 - Venn diagram demonstrating catalog limitations.

Venn diagram comparing three sets of STR definitions including curated Pathogenic Loci (orange), the

GangSTR database (blue), and the STRetch UCSC simple repeats database (purple). The Venn diagram is not normalized to size, and the number of variants in each overlap is shown.

Small Insertions: Insertions <150 bp at pathogenic lower bound PMID || Pathogenic Motif In-frame Gene Source Repeat Motif Lower Bound of Gene STRetch GangSTR

AR 29398703 GCA 38 CAG Y N

ARX 29946432 GCG 20 CGC Y Y 79

ATN1 29398703 CAG 48 CAG Y N

ATXN1 29398703 CTG 39 CAG Y Y

ATXN2 29398703 CTG 33 CAG Y Y

ATXN7 29398703 CAG 37 CAG Y Y

CACNA1A 29398703 CTG 20 CAG Y Y

COMP 9887340 GTC 6 GAC N Y

DMPK 29398703 CAG 50 CTG Y Y

HTT 29398703 CAG 36 CAG Y Y

JPH3 29398703 GCT 41 GCT Y Y

PPP2R2B 29398703 GCT 43 AGC Y Y

TCF4 28832669 CAG 40 CTG Y N

Large Insertions: Insertions >150bp at pathogenic lower bound PMID || Pathogenic Motif In-frame Gene Source Repeat Motif Lower Bound of Gene STRetch GangSTR

AFF2 30503517 GCC 200 GCC Y N

AFF3 24763282 GCC 300 CGG Y N

ATXN3 29398703 CTG 55 CAG Y Y

ATXN8 29398703 CTG 80 CTG Y Y

ATXN10 29398703 ATTCT 280 ATTCT Y Y

C9orf72 29398703 GGCCCC 30 GGGGCC Y Y

C11orf80 18160775 GCG 500 GCG Y Y

CBL 7603564 CGG 500 CGG Y Y

CNBP 29398703 CAGG 50 CAGG Y Y CGCGGGGCG CGCGGGGCG CSTB 9012407 GGG 40 GGG N Y

DIP2B 17236128 GGC 500 GGC N Y

FMR1 29398703 CGG 200 CGG Y Y

FRA10AC1 15203205 CCG 200 CGG Y Y

FXN 29398703 GAA 66 GAA Y N

GLS 30970188 GCA 650 GCA Y Y

LOC642361 / 31332380 CGG 90 GCC N Y 80

NUTM2B-AS1

LRP12 31332380 CCG 90 GGC Y N

NOP56 30503517 GGCCTG 1500 GGCCTG Y Y NOTCH2NLC / NBPF19 31332381 GGC 90 GGC N N

TMEM185A 7874164 GCC 300 GGC Y N

XYLT1 30554721 GCC 500 - N Y

ZNF713 25196122 CGG 85 CGG Y N

Degenerate Insertions: Coding regions with flexible nucleotides PMID || Pathogenic Motif In-frame Gene Source Repeat Motif Lower Bound of Gene STRetch GangSTR

FOXL2 12529855 NGC 22 GCN N N

HOXA13 10839976 NGC 18 GCN N N

HOXD13 20974300 GCN 22 GCN Y N

PABPN1 9462747 GCN 11 GCN N N

RUNX2 9182765 GCN 27 GCN Y Y

SOX3 12428212 NGC 26 GCN N N

TBP 29398703 CAN 43 CAN Y Y

ZIC2 11285244 GCN 25 GCN Y N

PHOX2B 12640453 NGC 500 GCN Y N Table 4-1 - Curated pathogenic STRs in the reference genome.

Definitions of small, large, degenerate, and complex repeat expansions. Gene, publication source, repeat motif, pathogenic lower bound, motif in frame of gene, and presence in STRetch/GangSTR databases is listed for each repeat locus. For degenerate repeats, ambiguous nucleotides are represented with ‘N’ and the locus is defined with respect to the amino-acid expansion (e.g. poly-Alanine).

4.3.4 Recovering pathogenic expansions in the repeat expansion cohort

In many discovery projects it can be difficult to isolate patients that harbor the same repeat expansion based on the phenotype alone. For instance, the repeat expansion in the C9orf72 gene

81

is present in fewer than 10% of ALS patients and many ataxias can be caused by expansions of a variety of repeats. Such problems call for analysis methods that are suitable for heterogeneous disease cohorts.

To solve this problem, we follow the approach taken previously by others and compare each case sample against the control cohort to identify outliers (Dashnow, Lek et al. 2018, Tankard,

Bennett et al. 2018). To demonstrate the efficacy of this approach, we combined each sample from the pool of samples with expansions in FXN, DMPK, HTT, and FMR1 genes with 150 controls to generate a total of 91 datasets, each containing 151 samples. We then performed an outlier analysis on the counts of anchored IRRs (Methods) in each dataset.

In 81% of the datasets, the expanded repeat ranked within the top 10 repeats based on the outlier score (Figure 4.6A). This number increased to 84% when the analysis was restricted to short motifs between 2 and 6 bp (Figure 4.6B). EHdn performed well for DMPK and FXN repeats, identifying these REs within the top 10 ranks for 41 out of 42 cases. The FMR1 expansion was only ranked in the top 10 for 24 out of 35 cases known to have the expansion. This result is consistent with a previous comparison, which found this locus had the poorest performance across all RE detection tools (Tankard, Bennett et al. 2018). The performance for the HTT repeat is surprisingly good considering that EHdn was not designed to detect REs shorter than the read length. The rankings improved further when the analysis was restricted to repeats located close to exons of brain-expressed genes (Figure 4.6C) or when multiple cases (five in this example) were included in the analysis (Figure 4.6D).

82

Figure 4-6 - Analysis of known expansions in real data.

Ranking of known expansions based on the outlier score computed for anchored IRRs. Each rank originates from a genome-wide analysis of a dataset consisting of one (A-C) or five (D) samples with a known expansion and 150 controls. (A) Ranks for all identified repeats. (B) Ranks for repeats with 2-6bp motifs. (C) Ranks for repeats located in the 5Kbp region around exons of brain-expressed genes. (D) Ranks for datasets with five case samples.

83

4.3.5 Landscape of large repeat expansions in diverse population

To explore the landscape of large repeats in the general population, we applied EHdn to 150 unrelated Coriell samples of African, European, and East Asian ancestry (Illumina). To limit this analysis to higher confidence repeats, we considered loci where EHdn identified at least five anchored IRRs, corresponding to repeats spanning about 150-200 base pairs and longer, and motifs supported by at least five paired IRRs in a single sample. Altogether, EHdn identified

1,574 unique motifs spanning between two and 20 bp, 94% of which were longer than 6 bp. Of these, 19% were found in at least half of the samples and 23% were found in just one sample. On average, each person had 660 loci with long repeats. As expected, the telomeric hexamer motif

AACCCT is particularly abundant and was found in about ~23,000 IRRs per sample. Similarly, the centromeric pentamer motif AATGG was found in ~5,000 IRRs per sample. To estimate the number of repeats located outside of the telomeric and centromeric regions (Kent, Sugnet et al.

2002, Karolchik 2004), we stratified the repeats by their distance to the closest telomere/centromere (Figure 4.7). We found that, on average, about 170 of the identified repeats are located closer than 2 Mbp from the nearest telomere or centromere and about 200 are located further than 15 Mbp. We also showed that EHdn can accurately detect long repeats in a control sample by validating 77% of repeats supported by two or more reads in Pacific Biosciences long- read data.

84

Figure 4-7 – Distance to telomere/centromere for identified repeats.

Proximity (in megabase pairs) of long (150-200bp and longer) repeats identified in the control population to the closest telomere/centromere.

4.3.6 Simulation of repeat expansions for benchmarking genome-wide methods

To demonstrate that EHdn offers similar performance to catalog-based methods on expansions exceeding the read length, we simulated expansions of the 35 non-degenerate STRs present in the reference (Table 4.1). We focused our comparisons on STRetch because this method was specifically designed to search for novel expansions using a genome-wide catalog and because it was shown to have similar performance to other existing methods (Tankard, Bennett et al. 2018).

Our simulations show that EHdn ranks 33 out of 35 pathogenic repeats in the top 10 at sufficiently long lengths. STRetch prioritizes 26 out of 29 repeats in the top 10 and the six remaining repeats are missing from its catalog. One of the REs detected by EHdn and missed by

STRetch is the pathogenic CSTB repeat with a motif length of 12bp. This is because STRetch is limited to detection of motifs with length up to six base pairs.

Some recently discovered REs are composed of motifs that are not present in the reference genome. One such example is the recently-discovered repeat expansion of non-reference motif

85

AAGGG causing cerebellar ataxia with neuropathy and bilateral vestibular areflexia syndrome

(CANVAS) (Cortese, Simone et al. 2019, Rafehi, Szmulewicz et al. 2019). Rafehi et al. (Rafehi,

Szmulewicz et al. 2019) demonstrated that EHdn is the only computational method capable of discovering this expansion. To further benchmark EHdn’s ability to detect REs with complex structure we simulated nine complex REs with non-reference motifs known to cause disease

(Figure 4.8). For eight out of nine REs, including a simulated version of the CANVAS expansion, EHdn was able to detect one or both of the expanded repeats in each locus.

Figure 4-8 - Structure of nine complex pathogenic repeats.

Pathogenic complex repeats present in the reference genome with motifs TTTTA, AAAAT, or AAAAG, and present in pathogenic forms with either altered motifs, or expanded reference motifs in combination with altered motifs. 86

Since we anticipate that EHdn will be used for discovery of novel pathogenic REs, we additionally simulated expansions of STRs with an associated effect on gene expression

(Fotsing, Margoliash et al. 2019). From the catalog of fine-mapped eSTRs, we selected 12 loci with varying motifs linked with known disease genes. Expansions with 200 copies were simulated at these loci and EHdn prioritized all 12 REs in the top five. STRetch also performed well at these loci, although missed three because they were not present in its catalog.

To highlight that EHdn is not limited to short STR motifs, we tested its capacity to detect an expansion of a known pathogenic repeat with 12 bp motif in the promoter region of the CSTB gene. Using EHdn we detected this simulated expansion at the pathogenic lower bound (40 copies). We further demonstrate the ability of our method to detect REs with longer motifs at other similar loci. We simulated expansions of 27 repeats with 7-10 bp motifs within genes implicated in autosomal recessive genetic diseases. All 27 loci were ranked in the top five in both the locus and motif analyses.

4.3.7 Guiding the usage of EHdn in RE identification

To guide the filtering of candidate REs, we examine the expectation for healthy samples to carry rare REs and determine a cutoff, and contrast with the simulated pathogenic REs described above. As one of the key advantages to EHdn is the discovery of novel pathogenic repeat expansions, it is important to know what the expectation is for rare repeats within healthy individuals. To assess this, we ran EHdn outlier analysis considering each of the 150 samples

87

independently as a case, and compared this sample as an outlier against the remaining 149. On average, samples had five expansions with a Z-score greater than 10 (Figure 4.9 A). Annotating these repeats and extracting the genic annotations identified that most rare insertions occur within intergenic and intronic regions, consistent with the expectation as these regions account for the majority of the genome (Figure 4.9 B).

88

Figure 4-9 - Rare REs in the control cohort.

Rare REs identified from the outlier analysis considering each of the 150 controls as a case and comparing against the other 149, with Z-score bins for the observed count of REs in each of the 150 samples plotted as scatter points overlaid with boxplots (A). The variants are further annotated by genic impact (downstream, exonic, intergenic, intronic, ncRNA exonic, ncRNA intronic, splicing, upstream, downstream, and in the UTRs), and plotted again as scatter points overlapping boxplots.

89

Next, we compare this with the Z-scores from the simulated pathogenic REs. A comparison between rank and Z-score shows that setting a cutoff of greater than 10 places the pathogenic RE in the top 5, consistent with the observation above (Figure 4.10 A). We further examined the relationship between Z-score and repeat size, and see that with the exception of ATXN10

(yellow points), a pentamer repeat expansion which is difficult to genotype with multiple tools

(data not shown), nearly all repeats are above the Z-score threshold of 10 (horizontal dashed line) as their size exceeds the read length (vertical dashed line). Taken together, setting a Z-score cutoff of 10 will enable the identification of expanded pathogenic repeats in most cases.

Figure 4-10 - Z-score characteristics of simulated pathogenic REs.

Simulated REs are compared for their rank vs. Z-score for each of the reference-based pathogenic repeats

(A), showing that the Z-score cutoff of 10 (vertical dashed line) prioritizes the majority of repeats in the top

10. Comparing repeats by size against Z-score (B) shows that as the repeat increases in size, the Z-score increases as well, with the exception of ATXN10 (yellow points).

90

4.4 Discussion

Here, we introduced a new software tool, ExpansionHunter Denovo (EHdn), that can identify novel REs using high-throughput WGS data. We tested EHdn by comparing samples with known

REs against a control group of 150 diverse individuals and performed simulation studies across a range of pathogenic or potentially pathogenic REs. These analyses show that EHdn offers comparable performance to targeted methods on known pathogenic repeats while also being able to detect repeats absent from existing catalogs. In particular, EHdn can be used for discovery of novel repeat expansions not detectable by the current methods because it: (a) does not require prior knowledge of the genomic coordinates of the REs, (b) can detect nucleotide composition changes within the expanded repeats, and (c) is applicable to both short and long motifs.

Recent discoveries have highlighted the importance of complex pathogenic repeat expansions involving non-reference insertions (Sato, Amino et al. 2009, Seixas, Loureiro et al. 2017, Ishiura,

Doi et al. 2018, Corbett, Kroes et al. 2019, Florian, Kraft et al. 2019, Yeetong, Pongpanich et al.

2019). EHdn is currently the only method capable of discovering these expansions from BAM or

CRAM files without the need for re-alignment of the supporting reads. Additionally, we anticipate that EHdn can replace existing more manual and less computationally efficient discovery pipelines, such as the TRhist-based pipeline (Ishiura, Shibata et al. 2019), where identification of enriched repeat motifs is followed by ad-hoc realignment of relevant reads to the reference genome and manual evaluation of loci where these reads align.

EHdn has some limitations and areas for further improvement. It is limited to the detection of repetitive sequences longer than the read length and cannot, in general, detect shorter expansions. 91

However, detection of these shorter expansions are feasible with the existing catalog-based methods, or structural variant detection methods. It may be possible to extend the detection limit to shorter repeat expansions, however increasing the search space will lead to increased runtime and reduced power to detect outlier expansions. It is also important to note that while EHdn can analyze reads produced by a variety of read aligners, the same aligner should be used for all samples involved in comparative analyses to eliminate false signals due to aligner differences.

All parameters of the sequencing assay (sequencing platform and library preparation kits) should also match as closely as possible to avoid coverage biases and other technical artifacts.

In many previous studies, identification of pathogenic REs required years of work and involved linkage studies to isolate the region of interest followed by targeted sequencing to identify the likely causative mutations. EHdn can be used as a front-line tool in such studies to rapidly identify candidate REs. Once identified, these novel REs can be genotyped using targeted methods (Dolzhenko, van Vugt et al. 2017, Tang, Kirkness et al. 2017, Dolzhenko, Deshpande et al. 2019, Mousavi, Shleizer-Burko et al. 2019) or molecular assays. The benefits of this approach were demonstrated in a recent study, where EHdn successfully identified a novel complex pathogenic RE (Rafehi, Szmulewicz et al. 2019). Hundreds of thousands of individuals’ genomes have now been sequenced using short read sequencing from many large disease cohorts, awaiting additional analyses such as RE detection. Additionally, while it is generally easier to analyze the structure of the expanded repeats in long-read data (Mitsuhashi, Frith et al. 2019, Roeck, De

Roeck et al. 2019), combining short-read sequencing datasets and methods with long-read data can offer a cost effective way to conduct large scale repeat expansion discovery projects.

92

4.5 Conclusion

We presented ExpansionHunter Denovo, a new genome-wide and catalog-free method to search for REs in WGS data. We demonstrated that EHdn consistently detects REs in real and simulated data. Given the widespread adoption of WGS for rare disease diagnosis, we expect that EHdn will enable further RE discoveries that will likely resolve the genetic cause of disease in many individuals.

93

Chapter 5: Demonstrating the utility of flexible sequence queries against indexed short reads with FlexTyper

5.1 Introduction

Short-read DNA sequencing enables diverse molecular investigations across life science applications spanning from medicine to agriculture. Obtaining useful information from sequencing datasets typically involves either performing de novo assembly, or mapping the data against one or more reference genomes. The process of mapping sequencing reads (short pieces of DNA read-outs from the DNA sequencer) against reference genomes, or a collection of reference genomes, is made computationally tractable by indexing the reference sequences, commonly performed with a Burrows Wheeler transform or FM-index. Several data analysis pipelines, whether they focus on quantification (e.g. observed gene expression in RNA sequencing data), or identifying sequence differences between a sample and a reference genome

(e.g. genotyping), leverage reference genome mapping as a primary analysis component.

While the status quo has been to utilize linear representations of reference genomes, a transition away from a single haploid reference genome is inevitable (Ballouz, Dobin et al. 2019, Yang,

Lee et al. 2019). This transition is supported by several factors. A large amount of structural variation exists between human populations (Feuk, Carson et al. 2006, MacDonald, Ziman et al.

2014, Levy-Sakin, Pastor et al. 2019). A recent study focusing on ~1000 individuals of African descent identified nearly 200 million bases missing from the most recent reference genome

(Sherman, Forman et al. 2019). Static linear reference genomes which do not capture these large differences between populations impose challenges for accurate genotyping (Ballouz, Dobin et

94

al. 2019, Yang, Lee et al. 2019), with implications in medicine and association studies. An alternative to choosing from a collection of population-specific reference genomes is to use emerging graph genome approaches to unite the data (Dilthey, Cox et al. 2015). As highlighted in a review by (Paten, Novak et al. 2017), in either approach, a key challenge in the future will be to determine the most appropriate reference genome(s), or path(s) through a graph genome, to maximize genotyping performance. Knowledge of distributed single nucleotide polymorphisms

(SNPs) genotypes across the genome can be used to guide such choices.

Currently, the primary approach for identifying SNP genotypes across the genome utilizes computationally expensive reference-based read mapping and variant calling strategies (Nielsen,

Paul et al. 2011). Inferring ancestry from specific, population-discriminating SNPs can be performed rapidly with the recently published tool Peddy, which uses fewer than 25,000 SNPs to identify ancestry through principal component analysis (Pedersen and Quinlan 2017). Previous work showed that it is possible to genotype predefined SNPs from unmapped sequence data, circumventing the read mapping and variant calling process (Shajii, Yorukoglu et al. 2016,

Dolle, Liu et al. 2017, Sun and Medvedev 2019). Some approaches focus on kmer (short sequences of length k) hashing and matching to predefined target kmers to perform genotyping of known SNPs, as demonstrated in the VarGeno and LAVA frameworks (Shajii, Yorukoglu et al. 2016, Sun and Medvedev 2019). These approaches are fast, but rely upon indexes of kmers extracted from the reference genome and SNP databases, thus reducing their flexibility for kmers of different length and source. A separate approach is taken by Dolle et al., wherein the entire

1000 Genomes dataset is compressed into an FM-index and queried with kmers spanning polymorphic sites, thus demonstrating the utility of scanning unmapped reads for predefined 95

kmers of interest. The “reverse mapping” highlighted in their approach was applied to aggregated data, but the concept can be extended to the analysis of individual genomes if implemented in a flexible way for diverse types of queries.

Within the paradigm of indexing reads and performing reverse mapping, other useful operations can be performed with increased utility, especially in cases with a diverse set of informative sequences. One example of this is within RNA sequencing (RNA-seq), where analysis of cancer

RNA-seq datasets can reveal the presence of viral pathogens within patient data (Klijn, Durinck et al. 2015). Several tools have been developed to specifically detect these viral pathogens from sequencing data including viGEN (Bhuvaneshwar, Song et al. 2018) and VirTect (Xia, Liu et al.

2019). However, they are hampered by a computationally expensive iterative mapping procedure which first maps against the human reference genome and then subsequently maps against viral genome collections. Other methods, such as Centrifuge (Kim, Song et al. 2016) and Kraken2

(Wood, Lu et al. 2019), rely upon kmer searches against large viral and bacterial databases. Both of these methods are powerful, but come with drawbacks of flexibility and reliance upon phylogenetic relationships between target sequences. Specifically, they require re-indexing of search databases for different query kmer lengths or when the target sequences change.

Nevertheless, these tools are broadly used and thus serve as good comparators for efficacy, as they have both been demonstrated to have utility in detecting viral pathogens within cancer

RNA-seq datasets by examining kmer content. (https://www.sevenbridges.com/centrifuge/).

Combining the current drive to decrease our reliance upon linear reference genomes, and the wealth of demonstrated utility of reverse mapping approaches, we developed FlexTyper.

FlexTyper is a computational framework which enables the flexible indexing and querying of 96

raw next generation sequencing reads. We show example usage scenarios for FlexTyper by demonstrating the high accuracy of reference-free genotyping of SNPs in single samples, and the ability to identify foreign pathogen sequences within short-read datasets. We hope the flexibility afforded by the framework underpinning FlexTyper will fuel the emerging trend away from the necessity for a static reference genome that currently lay at the heart of the majority of genomic analysis tools.

5.2 Methods

5.2.1 Overview of FlexTyper

Usage can be broken down into three steps: 1) query generation, 2) indexing the raw reads, and

3) querying against the FM-index (Figure 5.1). For query generation, we allow for both custom user query generation, as well as pre-constructed queries from useful databases, such as

CytoScanHD array probe queries. Custom queries designed to capture genomic loci can be generated by pairing a user-provided VCF (format v4.3) with a reference genome fasta file. For the capture of potential pathogen sequences, we also allow query generation from one or more fasta files. The files produced from query generation are used as input for subsequent index query operations. The second step is the production of an FM-index from a set of short-read sequences in fastq format. This process includes reverse-complementing the entire read file, and concatenating the transformed reads with the original set. This is done in order to prevent the need to scan for the reverse complement of the query kmers . The third step is the core

FlexTyper search algorithm which takes the query input file, generates search kmers, and scans the FM-index for matches. This step creates an output with matching format to the input file,

97

with appended counts of matching reads for each query. A detailed breakdown of these three components is described below.

Figure 5-1 - Overview of FlexTyper.

FlexTyper has three primary components: query generation, FM indexing reads, and querying against the

FM-index. Query generation includes the capacity to translate VCF files into query files given a reference genome file (Genome Fasta), or to directly create queries from fasta sequences including pathogen genome 98

sequences. Modules VCF2Query.py and Fasta2Query.py facilitate this process. The second component involves creating an FM-index of the raw reads, after reverse complementing and concatenating the read set and performing optional preprocessing steps. The third component executes the queries against the FM-index to produce output files with counts of reference and alternate sequences within the query files.

5.2.2 Search method for FlexTyper

Querying the FM-index for user selected sequences can be conceptually divided into four steps:

1) kmer generation; 2) kmer filtering; 3) kmer searching; and 4) result collation (Figure 5.2).

There are two primary methods of kmer generation for a query; a centered search where the middle position of the query is included in all kmers, and a sliding search which starts at one end of the query and uses a sliding window approach to generate the kmers (Figure 5.2). Centered search can be used for genotyping or estimating coverage over a single position, and the sliding search can be used to count reads which match to any part of a query sequence. The --ignore- duplicates parameter filters query kmers by ignoring kmers that occur in multiple query sequences. After filtration, the kmers are searched for within the FM-index using C++ multithreading and asynchronous programming, using either a single thread on a single index, multiple threads on a single index, a single thread on multiple indexes, or multiple threads on multiple indexes (Figure 5.2). Importantly, asynchronous programming allows the number of threads used during searching to be increased beyond the number of available CPUs. The output from this search process is a collated results map containing the positions of each kmer within the FM-index. These positions are translated to read IDs, and finally collapsed into query counts using the kmer-to-query mapping. Importantly, if multiple kmers from the same query hit the same read, they are recorded as a single count at the query level. For cases of multiple indexes

99

being searched in parallel, the kmer searching and assignment to the query count is performed independently and then merged to produce a final query count table.

Figure 5-2 - Query search workflow.

Workflow for query search against the FM-index, starting with input queries and settings defined in

Settings.ini file. In this example, it sets a centered search with ignoring duplicate kmers enabled. 1) Kmer generation has two modes, centered search and sliding search. For a centered search, the position of interest lies in the middle of the query, and kmers are designed to overlap that central position with defined length (k) and step (w). 2) If the ignore-duplicates option is set, kmers collated from the query set are filtered to remove 100

any kmers which were found in multiple query sequences. 3) The filtered kmers are then searched for within the FM-index (left two panels) or multiple indexes (right two panels) of the read set. This can be done using single (top two panels) or multiple (bottom two panels) threads. 4) The results corresponding to a position within the FM-index are then translated back into reads, with hits on reverse complement reads assigned to the primary read, and collapsed into a set for each query. The final counts are reported per query.

5.2.3 Query generation

FlexTyper supports flexible query generation giving users the capacity to query for any target sequence or allele within their read dataset. Query files can be generated from an input fasta and

VCF file (VCF2Query.py), or directly from a fasta file (Fasta2Query.py). Potentially useful queries, including those presented here, are provided online and include all sites from the

CytoScanHD chromosomal microarray, and ancestry discriminating sites (Pedersen and Quinlan

2017). These predefined query sets are available through git-lfs in the online FlexTyper github repository (https://github.com/wassermanlab/OpenFlexTyper). If users wish to directly query a short-read dataset with a set of predetermined kmers, they can provide the kmers as a fasta file and set the k parameter to the length of the kmers in the file.

5.2.4 FM-index creation

Generating the FM-index from short-read sequencing datafiles is performed in two steps; preprocessing and indexing. The focus of our work is not on the algorithms used to construct the

FM-index, and hence we use two existing utilities to generate a compatible FM-index for

FlexTyper. The toolkit Seqtk is used for reformatting compressed fastq files by removing quality scores and non-sequence information to create a sequence-only fasta format, and append this 101

with the reverse complement of the reads. The output fasta file is then processed using the

SDSL-Lite library to generate the FM-index. SDSL builds a suffix array that is used to generate the BWT of the input string, which is then compressed using a wavelet tree and subsampled.

The resulting compressed suffix array is streamed to a binary index file. As the memory requirements for indexing large files can be burdensome, we support an option to split the input file and index each chunk of reads independently. Downstream search operations support the use of multiple indexes.

5.2.5 Post-processing of FlexTyper counts for downstream analysis

The output tables from the search process for genotyping can be translated into useful formats for downstream analysis using the fmformatter scripts

(https://github.com/wassermanlab/OpenFlexTyper/tree/master/fmformater). Currently, there is the capacity to output genotype calls in VCF, 23andMe, or Ancestry.com format. Genotype calls are derived here using a basic approach which assigns genotypes given a minimum read count parameter as follows:

Alt < minCount && Ref > minCount: Homozygous reference, 0/0

Alt > minCount && Ref > minCount: Heterozygous alternate, 0/1

Alt > minCount && Ref < minCount: Homozygous alternate, 1/1

For searches which do not pertain to genotyping, the output tab-separated files can be used as count tables for observed query sequences.

102

5.2.6 Data sources for query generation

CytoScanHD SNP probe set was created using the file CytoScanHD_Array.na33.annot.csv acquired from the Affymetrix site (https://www.thermofisher.com/ca/en/home/life- science/microarray-analysis.html). This file was processed using CytoscanSProbe2Query.py

(https://github.com/wassermanlab/OpenFlexTyper).

GRCh37 ancestry sites were acquired from https://github.com/brentp/peddy/blob/master/peddy/GRCH37.sites. These sites were then converted into a query file using Sites2Query.py.

Pathogen fasta sequences were acquired from NCBI for EBV (gi|82503188|ref|NC_007605.1|

Human gammaherpesvirus 4), HIV-1 (gi|4558520|gb|AF033819.3| HIV-1), U21941.1 (U21941.1

Human papillomavirus type 70), and FR751039.1 (FR751039.1 Human papillomavirus type

68b). Each of these fasta files was then converted into a query file using the fasta_2_query.py script (https://github.com/wassermanlab/OpenFlexTyper/tree/master/fmformatter).

5.2.7 Generating kmers for testing search performance

To test the performance of the kmer search speed for FlexTyper, we generated sets of 1, 10, 100, and 1000 kmers coming from the CystocanHD S-probe query set (defined above). These kmers were tested in an isolated fashion within the FlexTyper framework to identify the time required for searching only. We also used all the kmers generated from the CytoscanHD S-Probe query set for testing the performance of the entire FlexTyper search algorithm in the hyperthreading analysis. 103

5.2.8 Samples for genotype analysis testing

To demonstrate the genotyping capacity of FlexTyper at known SNP sites, we used WGS datasets from the Polaris project Diversity Cohort and Kids Cohort

(https://github.com/Illumina/Polaris). These WGS data sets were downloaded as raw fastqs, mapped against GRCh37 genome using BWA mem (v0.7.5) (Li and Durbin 2009), and then converted to BAM format using Samtools (v1.9).

5.2.9 Simulation of pathogen-containing samples

To simulate pathogen samples in RNA-seq data, we first acquired fasta sequences for four different viruses (described above) and simulated reads at various depths of coverage for paired- end 150bp reads using the read simulator ART (Huang, Li et al. 2012). These reads were then concatenated with a human RNA-seq sample available through the Genome England project

(https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-6523/samples/). For the multiple viral mixing cases, we chose five different human blood RNA-seq datasets to mix with differing read counts of each virus, as defined in Table 1. The patients use blood derived from sample accessions as follows: Patient_1-ERR2322363, Patient_2:ERR2322364, Patient_3:ERR2322365,

Patient_4:ERR2322366, Patient_5:ERR2322363.

5.2.10 BAM coverage calculation for comparison

Bam coverage is calculated from an input BAM file and a FlexTyper query file using the script

QueryFromBam.py

(https://github.com/wassermanlab/OpenFlexTyper/tree/master/extras/bamquery). This script 104

utilizes the Pysam package to extract the reference and alternate alleles present at a given locus as defined from an input FlexTyper query file.

5.2.11 Using Peddy for ancestry, sex, and relatedness typing

After running FlexTyper on the ancestry + chrom X sites, we converted the output format to

VCF for input into Peddy. The resulting VCF files for all nine individuals were then compressed with bgzip and indexed with tabix (v1.9), and then merged with bcftools merge (v1.10.1). Next, a PED file was created detailing the relationships and sex of each individual based on information from the Polaris data repository (https://github.com/Illumina/Polaris). We used the merged VCF and the PED file as input to Peddy (v0.4.3) and ran with default settings to generate ancestry, relatedness, and sex-typing figures (Pedersen and Quinlan 2017).

5.2.12 Running centrifuge on simulated patient data

Centrifuge (v1.0.4) was run on the simulated “patient” fastqs, where each patient had mixed viral and human blood RNA-seq reads. The output report files from centrifuge were parsed for the viral names using grep: Human immunodeficiency virus 1 (HIV-1), Human herpesvirus 4 type 2

(EBV), and “apilloma” to get all strains of Human (and other) Papilloma viruses (HPV). The numReads (number of reads) were used as a fair comparator to FlexTyper which also was using non-unique read counting for viral detection.

105

5.3 Results

5.3.1 Performance metrics for indexing and querying

We used a human whole genome sequencing (WGS) sample to demonstrate the indexing and querying capacities of FlexTyper. Our indexing strategy utilizes open source tools to build the

FM-index on a high memory CPU, with at least 1000GB of RAM. While index creation optimization was not the focus of this work, indexing is feasible on standard systems with

~256GB of RAM, as long as the input read dataset is smaller than ~20GB (Table 1). Since our search function allows for multiple separate indexes, we incorporated the ability to sub-divide larger read sets into multiple smaller read sets that can be indexed in parallel. For querying, we generated a set of kmers designed from probes on the CytoScanHD Illumina genotyping microarray with a centered search process (Figure 5.2) for varying kmer lengths (Figure 5.3A).

The CytoScanHD genotyping microarray, chosen for its broad usage in the field of human genetics, has probe sequences designed to uniquely detect well-characterized SNPs. There is a noticeable benefit from multithreading FlexTyper, which we demonstrated by isolating the kmer searching process across 1 to 32 threads (Figure 5.3A). As the number of threads increases, we observe a continuous decrease in search time, and by comparing between observed and expected performance, the performance advantage gained from additional threads does not plateau at 32 threads (Figure 5.3B). As the software is written using asynchronous programming, we tested the upper bound on allocated data threads given a fixed set of 32 CPUs on a single machine with

256GB of RAM. For this analysis, we used an extended set of queries from the CytoScanHD

SNP set, for a total of ~6.4 million kmers. We increased the threads from 32 to 512 stepping by

32 and while we do see a decrease on the improvement in speed, there is still a benefit of additional threads (Figure 5.3C). To see where the benefit of increased threads plateaus, we 106

increased threads from 1000 to 16,384 and witnessed little speed increase (<5 minutes) between

5000 and 16,384 threads (Figure 5.3C). Thus, we define the upper bound on data threads for a machine with 32 CPUs and 250GB of RAM to be ~5000 data threads, for a query set of ~6.4 million kmers. It is possible that higher thread counts may improve performance for larger query sets and more powerful computers. Lastly, to highlight the clear advantage over non-indexed methods, we compared FlexTyper to popular non-indexed algorithms achieving a decrease in search time by roughly three orders of magnitude when using FlexTyper (Table 5.2).

Figure 5-3 - Search speeds for FlexTyper

107

A) FlexTyper search time with kmers of size 25 (blue) and 150 (red), increasing in number from 10-100,000, using one (solid) or 30 (dashed) threads. B) Increasing the number of threads from 1 to 32, for 100,000 kmers of length 25 (solid blue line). Expected values calculated by dividing single thread time by the additional number of threads (dashed black line), with difference between actual and expected plotted (red vertical bar).

C) Hyperthreading results for the time (in seconds) vs. thread counts from 32 up to 16,384 (log10 scaled x- axis)

Polaris reads Memory Utilized (GB) CPU Utilized (s) read size FmIndex size

ERR1955491_0 403.02 14478.75276 45G 11G (10533.5 MiB)

ERR1955491_1 403.02 15142.71965 45G 11G (10594.1 MiB)

ERR1955491_2 403.02 14908.2868 45G 11G (10364.3MiB)

ERR1955491_3 403.02 15198.84047 45G 11G (10515.2 MiB)

ERR1955404_0 443.04 16102.04992 50G 12G (11553.4MiB)

ERR1955404_1 443.04 16419.31906 50G 12G (11480.5MiB)

ERR1955404_2 443.04 17153.70595 50G 12G (11415.4MiB)

ERR1955404_3 443.04 16602.61735 50G 12G (11350.1MiB)

ERR2304597_0 437.73 15986.89442 49G 12G (11404.1MiB)

ERR2304597_1 437.73 16493.82877 49G 12G (11374.3MiB)

ERR2304597_2 437.73 16168.4345 49G 12G (11255.6MiB)

ERR2304597_3 437.73 16369.99306 49G 11G (11230.7MiB)

Table 5-1 - Indexing performance data for WGS fastq files

108

Paried-end files were concatenated into a single fastq and then chunked into multiple parts to allow computation. Contents include the number of reads, total RAM used, total CPU used, wall time and disk space.

Number of kmers FM-index grep egrep fgrep

1 0.107687 114.23 109.073 114.605

10 0.93294 1145.488 1139.098 1114.331

100 3.482785 Did Not Finish Did Not Finish Did Not Finish

1000 30.117324 Not Attempted Not Attempted Not Attempted Table 5-2 – Testing grep vs. FlexTyper

The 100 kmer searches using grep, egrep and fgrep ran for multiple hours before they were manually stopped or the machine had crashed. For the largest kmer searches, we chose not to attempt to run grep, egrep and fgrep as they were unlikely to provide results, given that the 100 kmer searches did not finish. All times are in seconds.

5.3.2 Genomic coverage and genotype detection within human WGS data

Knowing whether a given kmer is present or absent from a human WGS datafile (in this instance genome Illumina short-read, paired-end data) can have utility for estimating the depth of coverage for a target region and genotyping SNPs. FlexTyper has the capacity to compute depth of coverage or genotype SNPs from WGS data for both predefined and user-supplied loci. We demonstrate this capacity for genomic sites using the probe sequences from the CytoScanHD microarray, as well as a subset of previously collated population discriminating SNPs (Pedersen and Quinlan 2017). Using these loci, we created query files with a reference and alternate query sequence centered on the biallelic site (Methods).

109

We first sought to test the read recovery capacity of FlexTyper compared to an alignment based method which we call BamCoverage. The BamCoverage method involves mapping the reads to the reference genome, and then extracting per-base read coverage over a specific reference coordinate. BamCoverage utilizes the pysam package to extract read pileup over positions defined by the FlexTyper input query file (Methods). Using the CytoScanHD SNP set, we found a high concordance between the read counts from FlexTyper and the depth of coverage from aligned reads (Figure 5.4A). The vast majority, 780,178/797,653 or 97.8%, of sites differed by less than 10 between FlexTyper and BamCoverage (Figure 5.4B). This discrepancy is similar for both reference and alternate alleles, which is important since most genotyping models assume relative contributions of observed alleles for genotype calling. There were 16,282 sites with a delta, (Δ = FlexTyper - BamCoverage), greater than 10, and 4,256 sites with a delta greater than

100. We manually investigated a few of these sites which were overcounted by FlexTyper by more than 100 and found that they are being overcounted due to kmers mapping to multiple possible locations. Comparing these over-counted hits with delta greater than 100 to previously defined repeat regions shows that 4189/4256 or 98.4% of the overcounted sites overlapped with predefined repeats (Trost, Walker et al. 2018). The uniqueness of kmers is important for accurate read counting, thus it is recommended to filter such regions when using FlexTyper for genotyping or depth profiling. Lastly, by examining the recovery of reads across the chromosome between FlexTyper and the read alignment approach, it’s clear that FlexTyper can accurately capture relative sequence abundance with relevance to copy number variant calling applications (Figure 5.4D).

110

Next, we investigated whether FlexTyper can accurately recover genotypes at the SNP sites profiled from the chromosomal microarray. The genotyping approach we use leverages a minimum count from the reference and alternate allele to assign heterozygous, homozygous alternate, or homozygous reference genotypes (Methods). We applied this basic genotyping algorithm to both FlexTyper and BamCoverage counts to produce a VCF file. These genotypes were compared to an alternate pipeline which uses reference-based mapping and sophisticated variant calling using DeepVariant (Poplin, Chang et al. 2018). For the 797,653 SNPs on the

CytoScanHD microarray, all three methods agree on 99.2% (791,063/797,653) of the sites. For the sites where there was disagreement, we see an overlap with the repeat regions of 5004/5586 or 89.6%, affirming that these repeat regions are responsible for the majority of discordant genotypes. We further demonstrate the accuracy of these genotypes by indexing nine WGS samples from the Polaris project representing diverse populations including three African, three

Southeast Asian, and three European individuals (Chen, Krusche et al.). After indexing, we queried the samples for population discriminating sites and then genotyped the output table to produce a VCF file. The output VCFs were then used within the Peddy tool, and a principal component analysis was performed to predict the ancestry of the samples (Pedersen and Quinlan

2017). In all nine cases the population was correctly determined, as well as the relatedness inference for the three trios (Figure 5.4E). Interestingly, we observed a discrepancy between the listed sex for the child of the European trio, individual HG01683, and the inferred sex from

FlexTyper and Peddy (Figure 5.4F). We followed up on this observation and revealed that the individual is not an XY male, but rather an XXY individual. Taken together, FlexTyper has the capacity to provide accurate counts of observed reads matching a query sequence, with relevant

111

utilities such as copy number estimation, sample identification, ancestry typing, and sex identification.

Figure 5-4 - WGS Genotyping using FlexTyper.

A) FlexTyper read count compared to the total coverage from BAM file over SNP sites represented on the

CytoScanHD microarray. B) Histogram showing the delta, (Δ = FlexTyper - BamCoverage), in read count for both the alternate (red) and reference (blue) alleles. C) Histogram of the same delta as B) but with an extended axis from 100-2000, showing the frequency of over-counting for sites using FlexTyper. D) Scatter 112

plot showing the delta (Δ = FlexTyper - BamCoverage) on the y-axis, plotted across chromosome 1 on the x- axis. E) Principal component analysis showing projection of FlexTyper-derived SNP genotypes from nine individuals of Asian (green), African (red) and European (purple) ancestry. Squares denote FlexTyper genotypes, points denote existing data from the 1000 Genomes project provided by Peddy. F) Sex-typing for these Polaris samples showing the ratio of heterozygous to homozygous sites on the X chromosome (y-axis) for individuals for the defined sexes as male (right) and female (left). Each individual is labeled as green

(correctly sex-labeled) or red (incorrectly labeled).

5.3.3 Testing for the presence of pathogen sequences in RNA-seq

To demonstrate the capacity of FlexTyper to detect pathogens from RNA-sequencing data, we generated synthetic reads from four relevant viral genomes including Epstein-Barr virus (EBV),

Human Immunodeficiency virus type 1 (HIV-1), and two Human Papilloma virus strains 68b

(HPV FR751039) and 70 (HPV U21941) (Methods). We first examined the impact of various

FlexTyper parameters on the recovery rate of pure, simulated read sets for each of the four viruses and one human blood RNA-seq dataset from the Genome England project

(https://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-6523/samples/). Importantly, varying the parameters k (length of search substring) and w (step-size) change the specificity and sensitivity of read recovery. When k is set to 15 (a short kmer), there are roughly 1 million off- target hits to the viral genomes for the pure human RNA-seq file. Next, we demonstrated that the kmer uniqueness setting only guarantees that identical kmers cannot appear across queries. Thus, if query specificity is a priority then setting the w parameter to 1 will produce results with the least amount of cross-query assignment. By exploring these parameters, we show that all simulated reads can be recovered with parameters of 30 and 5 for k and w respectively, with low off-target assignment.

113

Next, to simulate patients infected by one of the four viruses, we spiked-in simulated pathogen reads with the human RNA-seq dataset. Using the optimized parameters derived above, we are able to detect each virus in the patient sample even at low concentrations. We further demonstrate the capacity for FlexTyper to discriminate between spiked-in virus samples by mixing the viruses at differing concentrations (read counts) within the human RNA-seq dataset

(Table 5.2). FlexTyper was run with two settings by varying the k and w parameters for increased sensitivity (k=31, w=5), or increased speed (k=100, w=25). We compared these results with Centrifuge, a software tool that works with unmapped short-read sequencing data by performing read-length (k=150) kmer searches against a database of viral and bacterial genomes

(Kim, Song et al. 2016). For each of the datasets, FlexTyper is able to detect the contaminating pathogen sequences, even in a sample (Patient_5) where we only spiked in the equivalent of a 1x coverage of the viral genome, which equates to roughly 50-1150 reads depending on the size of the viral genome (Table 5.2). For comparison, Centrifuge results were manually combined for viruses of similar naming schema and presented as the sum of non-unique read hits (Table 5.2).

For each of the samples, both Centrifuge and FlexTyper are capable of detecting the spiked-in pathogens. In all simulations, consistent with its use of shorter kmers, FlexTyper is more sensitive in its detection capacity, recovering more reads than Centrifuge. In our collation of viral hits for the Centrifuge data, we observed the limitation that for HPV strains, Centrifuge utilizes a comprehensive genome database with hundreds of distinct strains. Thus, retrieving a combined count of HPV sequences within a sequencing dataset is nontrivial and requires collation over hundreds of viral genome hits. In contrast, FlexTyper is able to detect all of the spiked in reads for these viral genomes of interest. This is due to the increased flexibility of FlexTyper, which 114

enables the user to define the relevant pathogens to search for without the need for reconstructing a complex bacterial or viral database, as is the case for Centrifuge. In summary, FlexTyper is more sensitive in its detection capacity than Centrifuge, and the flexibility to ad hoc define the pathogen search space could be beneficial in some applications, such as instances when the virus is a novel strain.

EBV HIV-1

Sample Expect FT:30,5 FT:100,25 C Expect FT:30,5 FT:100,25 C

Patient_1 1145 1146 861 538 610 610 462 284

Patient_2 11450 11454 8290 5427 6100 6099 4359 2797

Patient_3 114500 114502 83504 54352 61000 61000 43648 28206

Patient_4 1145000 1144990 833955 543924 62 62 45 29

Patient_5 1146 1146 861 538 62 62 45 29

Total HPV U21941 FR751039 FT:100,2 Sample Expect FT:30,5 FT:100,25 C Expect FT:30,5 FT:100,25 Expect FT:30,5 5

Patient_1 57200 60812 40930 2650 5200 8475 3755 52000 52337 37175

Patient_2 52052 55371 37443 1470 52000 52005 37406 52 3366 37

Patient_3 572 615 422 25 52 92 38 520 523 384

Patient_4 5720 6084 4152 273 520 851 388 5200 5233 3764

Patient_5 104 112 75 3 5 57 38 52 55 37 Table 5-3 - Performance comparison for simulated spike-in pathogens.

Each of the samples, (Patient_1 - Patient_5), with expected (simulated known counts) vs. observed counts for

Centrifuge (C) and FlexTyper with k=30/w=5 (FT:30,5) and k=100/w=25 (FT:100,25). Each quantified viral strain includes Epstein Barr Virus (EBV), Human Immunodeficiency Virus-1 (HIV-1), total Human

Papillomavirus (HPV), and two strains of HPV (type 70 and 68b). The maxOcc parameter was set to limit the

115

number of hits to one million and non-unique kmers were allowed. For centrifuge, sub-strain HPV counts were not feasible so counts were aggregated over all papilloma viral strains in the output report file per patient.

5.4 Discussion

Here we presented FlexTyper, a flexible tool which enables exploratory analysis of short read datasets without the need for alignment to a reference genome. Our framework allows for the custom generation of queries, giving the user total control to perform searches relevant to the problem at hand. We demonstrated three applications, including depth of coverage analysis, accurate SNP genotyping, and sensitive detection of pathogen sequences. FlexTyper is available for the creative use of genomics researchers.

The rapid and accurate recovery of read depth enables innovative usage of FlexTyper in the space of copy number variant profiling. We demonstrated that we can reproduce the depth of coverage of a genomic region without the need for reference-based mapping. As microarrays are replaced by genome sequencing assays, we envision that FlexTyper could be extended to reproduce microarray-style outputs. Further, we show that when genomic queries with counts higher than the expectation arise, these events correspond to repetitive genomic sequences. As such, FlexTyper may not only enable the recovery of read depth in an accurate manner, but it can also inform the quality of a sequence query as a “unique probe” for assessing genomic copy number.

116

The genotyping case study highlights how pre-alignment analysis of genome sequence data can provide rapid insights into the properties of a sample. SNP genotyping was accurate across the genome, allowing rapid identification of sample ancestry, sample relatedness in the trio setting, and sample sex typing using Peddy (Pedersen and Quinlan 2017). Interestingly, applying Peddy to the ouput of FlexTyper for open source trio data from the Polaris project revealed a mislabeling of the sex for individual HG01683, which was reported and subsequently ammended in the online data repository (https://github.com/Illumina/Polaris/wiki/HiSeqX-Kids-

Cohort). Since ancestry and sex information can inform choices in downstream data processing, identifying these discrepancies between labeled sex and inferred sex in a data-driven manner is a critical step of pre-alignment informatics. For instance, mapping against the sample-matched sex chromosomes has been shown to improve performance (Olney, Brotman et al. , Webster, Couse et al. 2019). As such, using FlexTyper, in combination with Peddy, on diverse datasets prior to reference-guided read alignment will lead to improved results from mapping-based pipelines.

The importance of pathogen identification is increasingly recognized. In both cancer profiling

(Klijn, Durinck et al. 2015) and public health studies (Gardy, Loman et al. 2015), rapid determination of the presence of pathogen sequences could obviate the need for full reference mapping. Some existing tools designed for viral detection in sequencing data rely upon pre- indexed databases of viral and bacterial sequences, sometimes including a phylogenetic relationship between genomes within the index (Kim, Song et al. 2016, Wood, Lu et al. 2019,

Xia, Liu et al. 2019). One such approach, Centrifuge, has been applied to cancer genomes to confirm the presence of viral pathogens. We demonstrated that our approach compares favorably to Centrifuge, with a more sensitive detection level, due to the ability to search for kmers shorter 117

than the read length and the advantage of fine-tuned control over the searchable database. Here we only searched for viral pathogens of interest, although other specific pathogen queries could be performed, such as the presence of antibiotic resistance genes within a patient RNA-seq sample.

We anticipate that the research community will identify diverse and creative uses for “reverse mapping” analysis with FlexTyper, but a few approaches are apparent to us. It is feasible to genotype complex structural variants by searching for sequences overlapping breakpoints, such as those observed in a subpopulation, or events recurrently found in cancer (Sudmant, Rausch et al. 2015, Li, Roberts et al. 2020). Within RNA-seq data, querying for exon-exon splice junctions in a rapid manner can allow isoform quantification, as has been previously demonstrated (Patro,

Mount et al. 2014, Bray, Pimentel et al. 2016). Further, a recent report showed the utility of kmer-counting methods in resolving copy number variants within paralogous loci and genes

(Shen, Shen et al. 2020). Another group showed the advantage of examining depth of coverage at specific sites across the paralogous genes in Spinal Muscular Atrophy (Chen, Sanchis-Juan et al.

2020) As FlexTyper is well suited for specific sequence recovery operations, scanning with preselected query sequences such as defined by these studies can enable rapid detection (Chen,

Sanchis-Juan et al. 2020). All of these proposed applications help tackle challenges which are currently a burden for traditional reference-based mapping approaches.

We focused this report on the utility of kmer searches against indexed read sets, but recognize that speed and computational resources are an important consideration for adoption of the method. One obvious (but transient) constraint on the utility of FlexTyper is the generation of the 118

FM-index for the sequencing reads. As the FM-index is critical to many aspects of genome-scale sequence analyses, there are diverse efforts to develop novel indexing strategies, such as optimizing FM-index construction using GPUs (Chacón, Marco-Sola et al. 2015) and creating efficient construction algorithms (Labeit, Shun et al. 2017, Chen, Li et al. 2018). Further, the nature of the mapping procedure holds promise with massive parallelization approaches, including those involving GPU acceleration (Hung, Hsu et al. 2018). Moving forward, accelerations to the FM-index generation and reverse mapping approach will result in faster genomic analysis pipelines than is currently possible with alignment based methods.

Looking to the future, we see the kmer-searching approach of FlexTyper as having great utility when used in conjunction with emergent graph-based representations of the reference genome

(Kehr, Trappe et al. 2014, Kaye 2016, Paten, Novak et al. 2017). Whether users seek to select a population specific reference graph as the basis for read mapping, or to introduce Bayesian priors

(edge weighting) within a pan-population reference graph, knowledge of population markers spanning chromosomes will be required to inform the processes. Furthermore, it is our expectation that graph-based mapping methods will ultimately use read-based FM-indices, as indexing the reference graph imposes restrictions on the graph structures that can be used and the types of variations that can be incorporated (Paten, Novak et al. 2014, Ghaffaari and Marschall

2019). As the graph-based algorithms mature, approaches such as FlexTyper which enable reverse mapping of sequences against a set of indexed reads will be instrumental in the initial steps of genome analysis pipelines, and in the resolution of challenging regions of the genome.

119

5.5 Conclusion

Alternative approaches to analysis will be necessary if all genomic variants are to be discovered from whole genome sequencing datasets. As the shift away from a single linear reference genome is imminent, methods which facilitate that shift will have increasing utility in the coming years. The creative use of FlexTyper can aid the variant calling process, either as a pre-alignment reference selector or as a method to query for specific sequences which indicate the presence or absence of a variant.

120

Chapter 6: Conclusion

Bioinformatics has played, and will continue to play, a pivotal role in the diagnosis of rare genetic diseases. Improvements in sequencing technology paired with bioinformatic innovations will bring an end to the diagnostic odyssey of patients all over the world. Innovation is driven both by large government-level projects which catalog hundreds of thousands to millions of human genomes, as well as one-off anecdotal reports where a handful of patients are affected by a cryptic pathogenic mutation. As we continue to profile populations across the globe, we gain an understanding of the function and diversity of genomic variation between healthy individuals.

Properly identifying and cataloguing these variants with standardized approaches is key, and produces invaluable resources for the diagnostic community. On the smaller scale, our understanding of the role and function of the bases in the human genome expands when we observe genomic disruptions which are unique to affected patients. While many innovation- driving anecdotes are solved by supplementing DNA sequencing with additional molecular assays, they still lead to novel bioinformatic methods which can be deployed for DNA sequencing.

6.1 Contributions to the field

Within my doctoral thesis I have focused on bioinformatic innovations which expand the utility of whole genome sequencing for the diagnosis of rare genetic diseases. My efforts occur at the intersection of clinical interpretation of patient data, and research endeavours to solve undiagnosed cases. A central theme to the work is the domino effect of rare disease anecdotes, wherein a single undiagnosed case leads to bioinformatic innovation which in turn has the capacity to diagnose additional cases. Simulation is a key component of utilizing these anecdotes 121

effectively and incorporating the bioinformatic innovation into semi-automated and standardized workflows.

Chapter 2 focuses on the simulation task, where I developed a flexible framework called

GeneBreaker with the capacity to reproduce these rare genomic variants in a familial setting.

Having this capacity to create synthetic cases which emulate the observed pathogenic variants across the globe enables the uptake of advanced methods, particularly those which focus on detecting challenging forms of genomic variation, into semi-automated workflows. Workflows across the globe can be tested for their capacity to detect these rare and challenging genomic variants, and optimizations which stem from this benchmarking approach may lead to the diagnosis of additional patients. Furthermore, GeneBreaker provides training data for the rapidly growing field of genome analysts in medicine, without the privacy implications of utilizing clinical patient data.

In Chapter 3 I upgrade and implement a semi-automated workflow for the diagnosis of rare genetic diseases from whole exome or whole genome sequencing data. Deploying this workflow in the clinical-research setting has led to the diagnosis of several patients. One set of affected patients with a very specific biochemical profile–which implicated a strong candidate gene glutaminase (GLS)–remained undiagnosed after considering the coding regions with exome sequencing and finding only a single inherited loss-of-function variant. We examined the mRNA for GLS, and found evidence of biased allelic expression, which led to the hypothesis of a noncoding variant refractory to detection from exome sequencing. After performing whole genome sequencing on one of the patients, a manual investigation of the noncoding regions 122

associated with GLS led to the discovery of a cryptic genetic mutation, a short tandem repeat expansion in the 5-prime untranslated region. Molecular investigations for this expansion showed that hypermethylation of the DNA around the promoter of the gene is not the mechanism of action, and some disruption to the local chromatin state was observed. This discovery led to the appreciation of this short tandem repeat expansions, and began an endeavor to detect this class of variants genome-wide within whole genome sequencing datasets.

Following the discovery of a novel short tandem repeat expansion, in Chapter 4 I collaborate with researchers at Illumina to develop a novel bioinformatic method which succeeds at detecting this class of variants in an unbiased manner throughout the genome. I demonstrate that other existing methods are limiting, and that our approach–which does not utilize predefined repeats in the genome–can detect known pathogenic as well as potentially pathogenic variants in simulated data. Initial full-genome simulations were complicated by repeats present in the reference genome but absent from sequenced human samples. To overcome this, I develop a novel method which can simulate repeat expansions in the background of healthy human genomes, which more accurately represents comparisons between a patient with an expansion and a cohort of healthy samples.

Lastly, in Chapter 5 I develop FlexTyper, a flexible reverse-mapping method which can extract useful information from high throughput sequencing data without mapping against the reference genome. While the standard practice for variant calling relies upon mapping data against a linear reference genome, there is agreement in the field that a shift towards a more graph-like genome is inevitable. Having tools which can extract useful information without reliance upon a linear 123

reference genome will help the transition and adoption of advanced graph-based methods. I benchmark the capacity for FlexTyper to extract information in a mock-microarray fashion, by searching for microarray probe sequences against indexed human whole genome sequencing samples. Using the mock-microarray approach, I show the capacity to pair with other open source tools and rapidly extract ancestry and sex from a limited number of genotypes. Analyzing open source genomes led to the identification of an individual in an open genome database with

Klinefelter syndrome (extra X chromosome), which is a critical finding as it has implications for downstream inheritance modeling within variant interpretation. The method developed in

Chapter 5 has immediate utility for sample verification and sex typing which can inform reference-genome alignment choices, and looking to the future this method has the capacity to be a stepping stone towards graph genome implementations.

6.2 Limitations and room for improvement

Possibly the greatest limitation facing bioinformatic innovation is that novel methods will not be adopted by the community. An example of this can be seen in the field of short read data compression, where raw data is still stored as gzipped text files in spite of alternatives existing with smaller file size footprints and no loss of information (Hach, Numanagić et al. 2014). Both software maintenance and ongoing support are necessary for adoption, and the innovations within this thesis work will be maintained beyond their initial deployment. Beyond maintenance, each of the research contributions within this thesis have room for future improvement.

In Chapter 2 where I develop GeneBreaker, a simulation framework for creating rare disease scenarios, future development will improve the utility and scope of the tool. First, the set of 124

variants to be simulated will expand to include additional variant classes and variants beyond genic regions. Inversions and translocations, while challenging to accurately identify in short- read datasets, are emerging as a relevant variant class within rare disease patients when long read technologies are utilized (Mitsuhashi and Matsumoto 2020). Extending beyond genic regions is necessary in order to include associated cis-regulatory regions such as chromatin organization boundaries and enhancer elements. There is evidence that disruptions to chromatin organization boundaries can result in gene dysregulation even when the target gene doesn’t harbor a mutation

(Lupiáñez, Kraft et al. 2015). Annotating these boundaries for each gene in a cell-type specific manner is an ongoing research endeavour, and even more challenging is the capacity to interpret if the observed variants will disrupt boundaries (Sikorska and Sexton 2020). Furthermore, gene transcription (from DNA to mRNA) is regulated in cis by enhancer elements, which act together in three-dimensional space. Defining these regulatory elements for genes, especially those elements which are critical and necessary for proper gene function, is an ongoing research endeavour. As these definitions become refined, incorporating them into GeneBreaker will allow the creation of additional challenging rare genetic disease scenarios. Lastly, GeneBreaker’s downstream benchmarking process focuses on the embedding of an SNV or indel in the background of a trio of individuals for small variant prioritization testing, or full single sample simulation for variant calling testing. Ideally, this will be extended to include variant prioritization testing with variants of different classes, which requires background sets of many classes of genomic variants.

Two primary aspects of Chapter 3 include the creation of a semi-automated pipeline for variant prioritization in rare disease cases, as well as the identification of a novel short tandem repeat 125

expansion. While the pipeline for annotation and variant calling is functional, it does not have the capacity to combine variants of different classes and genic impacts within inheritance modeling and filtering. An ideal rare disease genome analysis pipeline would incorporate every possible genomic variant into a gene-centric interpretation framework. Although some commercial tools are beginning to populate this space, having open-source alternatives will be useful as many of these challenging cases will be subject to research-grade interpretation beyond the clinical workflows. A key aspect of this will be large genomic databases populated with the diverse classes of genomic variants, which can enable researchers to rule out complex variants which are present in the healthy population and are now becoming available (Collins, Brand et al. 2020). Beyond pipeline improvement, more understanding is needed to unravel the functional impact of the repeat expansion in the 5-prime untranslated region of GLS. While DNA hypermethylation and disrupted transcriptional elongation were ruled out, the underlying mechanism of repression has not been completely resolved. Profiling the chromatin and observing changes in chromatin marks indicates that there is a change in the activity potential of the region, although how the repeat mediates this is unclear. As this repeat expansion is large

(over 1500 bp) and located near the promoter, it’s possible that the repeat expansion is disrupting the capacity of transcriptional initiation in the genome, perhaps through acting on chromatin organization which has been suggested for other repeat loci (Sun, Zhou et al. 2018). Further investigation is needed to concretely define the mechanism.

The development of ExpansionHunter Denovo in Chapter 4 provides the community with a tool to identify repeat expansions, including those which may be pathogenic, in PCR-free whole genome sequencing samples. While we provide a small catalog for repeat expansions in a 126

diverse, healthy population of 150 individuals, the true landscape of repeat expansions, especially complex expansion-insertion events, is unexplored. Using a set of 150 individuals will lead to the identification of repeat expansions in patients which appear to be rare, although a much larger cohort is needed to establish the true allele frequency of such variants. Expanding this catalog to include thousands or tens of thousands of individuals will add more power to rare disease diagnostic approaches. Furthermore, the method currently identifies a region where a repeat expansion has occurred, but does not genotype the expansion and does not provide exact coordinates for the event. For genotyping, this is important when considering the inheritance patterns for rare Mendelian disease variants, and having the exact coordinates allows for an improved assessment of molecular impact of the identified variant. Advancing the method to incorporate these components is currently underway.

In Chapter 5, I demonstrated that the FlexTyper tool can recover target sequences accurately from whole genome sequencing datasets by utilizing chromosomal microarray probe sequences and comparing to the expectation from reference-mapped reads. However, FlexTyper has the ability to query for any short (shorter than read length) sequence. Recent work focusing on genotyping variants, both complex events and variants within challenging genomic regions, has highlighted the importance of quantifying selected short sequences (Chen, Krusche et al. , Chen,

Sanchis-Juan et al. 2020, Shen, Shen et al. 2020). For the SMN1/SMN2 segmental duplication which harbors copy number variants underlying spinal muscular atrophy, researchers showed that estimating coverage of a few sequences within these highly similar genes can inform copy number. In the work by Shen et al, the copy number of paralogous genes was estimated by kmer- counting, which is possible with the FlexTyper framework. The primary limitation of FlexTyper 127

is that generating an FM-index for the short-read data is computationally expensive. Fortunately, there are ongoing research efforts to optimize this process which can be integrated into the

FlexTyper framework in the future (Labeit, Shun et al. 2017, Chen, Li et al. 2018).

6.3 The future of rare disease bioinformatics

Achieving a diagnosis for all patients affected by monogenic rare genetic disease is within reach.

At a higher level, several important factors are contributing to this goal including coordinated data-sharing efforts, systematic evaluation of undiagnosed rare disease patient data, large curated databases cataloguing population and pathogenic genomic variants, improved bioinformatic methodology, and increased knowledge of the function of the human genome (Boycott, Hartley et al. 2019). In a comprehensive forecast of the future of rare disease diagnosis, Boycott et al lay out several areas of future development including investigation into complex classes of variants, disruptions to gene regulation, and rare disease modifiers.

The recent publication of large, comprehensive databases of structural variants of multiple classes (Collins, Brand et al. 2020), paired with thorough benchmarks of optimal performing bioinformatic methods (Kosugi, Momozawa et al. 2019, Sarwal, Niehus et al. 2020), show promise for the detection of pathogenic structural variants in rare disease patients. While these approaches have been tailored for short reads, many structural variants can only be resolved using longer reads, including those which can span repetitive elements such as PacBio’s HiFi technology and Oxford Nanopore’s long-read DNA sequencing (Bowden, Davies et al. 2019,

Wenger, Peluso et al. 2019). Similar to short read technology, these long read platforms need to see decrease in price, improvement in accuracy, and robust bioinformatic method development 128

before achieving their true potential. Long reads hold much promise in the detection of repeat expansions, as the full length of each expansion can be captured in a single read, obviating the need for advanced methods which leverage unmapped reads and paired-end information. A remaining challenge which impedes the ability to identify all the complex variant classes is the reliance upon a linear reference genome, which doesn’t accurately represent the diversity in nucleotide content between individuals (Ballouz, Dobin et al. 2019). Advancements in graph genomes, paired with long read technologies and larger population databases of genomic variants, will enable more complete profiling of the genomic differences between individuals, including those which result in rare disease phenotypes.

The diverse classes of genomic variants which will inevitably be observed, will mostly fall within the noncoding regions of the genome. This is due to a lower selective pressure on noncoding sequences, combined with the majority of the genome not coding for protein (Abel,

Larson et al. 2020). Being able to interpret variants within these regions requires an understanding of how genes are regulated, and how they are spliced. While much work has gone into the impact of small variants on splicing, the impact of structural variants on splicing is now beginning to be explored (Payer, Steranka et al. 2019), especially in light of examples of disrupted splicing in rare disease patients from mobile element insertions (Tarailo-Graovac,

Drögemöller et al. 2017). Advanced methods which can predict impacts on splicing, including those which can interpret structural variants, will be important for the interpretation of rare variants within the introns of relevant disease genes. Beyond splicing, genes can be dysregulated at the transcription or post-transcriptional levels. Post-transcriptional regulation is typically facilitated by the interaction of proteins with the untranslated regions of mRNAs, so rare variants 129

over these genic regions could disrupt the capacity for a gene to be translated into protein, or prevent it from being targeted by degradation feedback mechanisms. At the transcriptional level, ongoing research is beginning to decode the language of gene regulation. The regulation of each gene is distinct and complex, and is coordinated by proteins binding the DNA at specific regions proximal to the gene known as enhancers (Wasserman and Sandelin 2004). These proteins which recognize specific DNA sequences are called transcription factors, and they are known to bind

DNA and perform many tasks including remodeling the local chromatin landscape, holding the

DNA open to allow for biochemical activity, and recruiting additional proteins including RNA polymerase. There are over 1500 transcription factors in the human genome, and knowing the mechanism of action (activation, repression, cooperativity, recruitment, remodeling, etc.) and binding sequence of each transcription factor is an ongoing endeavour (Sandelin, Alkema et al.

2004, Lambert, Jolma et al. 2018, Fornes, Castro-Mondragon et al. 2020). A given gene can be regulated by a complex mix of transcription factors which bind to the promoter and proximal enhancer regions, so knowing if a variant disrupts the ability to bind sequences is important.

Several in silico predictors are being developed and improved, and in the near future it will be possible to reliably interpret the functionality of a variant within a cis-regulatory region (Zhou and Troyanskaya 2015, Zhou, Theesfeld et al. 2018, Shi, Fornes et al. 2019). Beyond disrupting the enhancing or silencing elements, disruptions to chromatin compartments referred to as topologically associating domains (TADs) have been implicated in rare diseases (Lupiáñez,

Spielmann et al. 2016). While our understanding of the tolerance for disruptions to chromatin boundaries and enhancer elements is currently limited, the emerging catalogs of population-scale whole genome variant sets will begin to reveal where noncoding mutations are tolerated, and where they are pathogenic (Karczewski, Francioli et al. 2020). This, combined with more 130

comprehensive definitions of TADs across tissues and cell types, will enable the interpretation of variants which disrupt TADs of relevant disease genes.

Rare genetic diseases are known to have differences in severity amongst different individuals, even in cases where two individuals share the same exact genomic variant which disrupts a given gene (Rahit and Tarailo-Graovac 2020). Knowing the genetic or environmental modifiers for specific rare diseases, whether they are coding variants disrupting a pathway or regulatory variants affecting a disease gene, can inform the prognosis and in some cases lead to potential treatments. An example of modifiers being identified in rare disease is in Spinal Muscular

Atrophy, where alleles within CORO1C and PLS3 were identified as protective, highlighting impaired endocytosis as a potential therapeutic rescue mechanism (Hosseinibarkooie, Peters et al. 2016). Detecting modifiers of rare genetic diseases is challenging, and can be supported by additional assays which examine molecular disruptions at the mRNA, lipid, protein, and DNA methylation levels; collectively termed multi-omic profiling. Modifiers can come in different forms, and can act upon genes in cis or in trans (Harper, Nayee et al. 2015). Identifying modifiers of rare disease is a challenging task, as genetic and environmental contributions can be heterogeneous and elucidating them requires large cohorts, which can be challenging to assemble for rare disease patients. Experimental designs which leverage sibling comparisons, where both siblings are affected by the same disease yet have different outcomes, may hold promise in detecting modifiers at the genetic or molecular level, although observed candidate modifiers from pilot studies need to be confirmed in larger patient cohorts (Richmond, van der Kloet et al.

2020).

131

6.4 Closing

In the not-too-distant future, whole genome sequencing will be the standard diagnostic approach for patients affected by rare genetic diseases. As we sequence patients and find pathogenic variants underlying their disease phenotypes, we will continue to unravel the complexity of the human genome. While diagnosing a rare genetic disease brings increased knowledge of the human genome, the primary goal is always to benefit the family affected by the disease. Once the broken gene is identified, there is potential for treatment, or planning for the disease progression.

Gene therapy, enzyme replacement therapy, antisense oligonucleotides, and even specially tailored diets are all potential treatment options which can be explored. But first, the broken gene must be found. Hopefully, the work within this thesis will expand the utility of whole genome sequencing in the diagnosis of rare genetic disorders (even if just by a little bit), enabling the identification of the broken genes and putting an end to the diagnostic odyssey of patients around the world.

132

Bibliography

Abel, H. J., D. E. Larson, A. A. Regier, C. Chiang, I. Das, K. L. Kanchi, R. M. Layer, B. M. Neale, W. J. Salerno, C. Reeves, S. Buyske, N. C. f. C. D. Genomics, T. C. Matise, D. M. Muzny, M. C. Zody, E. S. Lander, S. K. Dutcher, N. O. Stitziel and I. M. Hall (2020). "Mapping and characterization of structural variation in 17,795 human genomes." Nature. 583, 83–89.

Abyzov, A., A. E. Urban, M. Snyder and M. Gerstein (2011). "CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing." Genome Res. 21(6): 974-984.

Adzhubei, I., D. M. Jordan and S. R. Sunyaev (2013). "Predicting functional effect of human missense mutations using PolyPhen-2." Curr Protoc Hum Genet Chapter 7: Unit7 20.

Alfares, A., T. Aloraini, L. A. Subaie, A. Alissa, A. A. Qudsi, A. Alahmad, F. A. Mutairi, A. Alswaid, A. Alothaim, W. Eyaid, M. Albalwi, S. Alturki and M. Alfadhel (2018). "Whole- genome sequencing offers additional but limited clinical utility compared with reanalysis of whole-exome sequencing." Genet. Med. 20(11): 1328-1333.

Altman, B. J., Z. E. Stine and C. V. Dang (2016). "From Krebs to clinic: glutamine metabolism to cancer therapy." Nat. Rev. Cancer 16(11): 749.

Ashley, E. A. (2016). "Towards precision medicine." Nature Reviews Genetics 17(9): 507-522.

Ballouz, S., A. Dobin and J. A. Gillis (2019). "Is it time to change the reference genome?" Genome Biol. 20(1): 159.

Beaulieu, C. L., J. Majewski, J. Schwartzentruber, M. E. Samuels, B. A. Fernandez, F. P. Bernier, M. Brudno, B. Knoppers, J. Marcadier, D. Dyment, S. Adam, D. E. Bulman, S. J. M. Jones, D. Avard, M. T. Nguyen, F. Rousseau, C. Marshall, R. F. Wintle, Y. Shen, S. W. Scherer, F. C. Consortium, J. M. Friedman, J. L. Michaud and K. M. Boycott (2014). "FORGE Canada Consortium: outcomes of a 2-year national rare-disease gene-discovery project." Am. J. Hum. Genet. 94(6): 809-817.

Belkadi, A., A. Bolze, Y. Itan, A. Cobat, Q. B. Vincent, A. Antipenko, L. Shang, B. Boisson, J.- L. Casanova and L. Abel (2015). "Whole-genome sequencing is more powerful than whole- exome sequencing for detecting exome variants." Proc. Natl. Acad. Sci. U. S. A. 112(17): 5473- 5478.

Benson, G. (1999). "Tandem repeats finder: a program to analyze DNA sequences." Nucleic Acids Res. 27(2): 573-580.

Bhuvaneshwar, K., L. Song, S. Madhavan and Y. Gusev (2018). "viGEN: An Open Source Pipeline for the Detection and Quantification of Viral RNA in Human Tumors." Front. Microbiol. 9: 1172.

133

Bowden, R., R. W. Davies, A. Heger, A. T. Pagnamenta, M. de Cesare, L. E. Oikkonen, D. Parkes, C. Freeman, F. Dhalla, S. Y. Patel, N. Popitsch, C. L. C. Ip, H. E. Roberts, S. Salatino, H. Lockstone, G. Lunter, J. C. Taylor, D. Buck, M. A. Simpson and P. Donnelly (2019). "Sequencing of human genomes with nanopore technology." Nat. Commun. 10(1): 1-9.

Boycott, K. M., T. Hartley, L. G. Biesecker, R. A. Gibbs, A. M. Innes, O. Riess, J. Belmont, S. L. Dunwoodie, N. Jojic, T. Lassmann, D. Mackay, I. K. Temple, A. Visel and G. Baynam (2019). "A Diagnosis for All Rare Genetic Diseases: The Horizon and the Next Frontiers." Cell 177(1): 32-37.

Boycott, K. M., M. R. Vanstone, D. E. Bulman and A. E. MacKenzie (2013). "Rare-disease genetics in the era of next-generation sequencing: discovery to translation." Nat. Rev. Genet. 14(10): 681-691.

Bray, N. L., H. Pimentel, P. Melsted and L. Pachter (2016). "Near-optimal probabilistic RNA- seq quantification." Nat. Biotechnol. 34(5): 525-527.

C Yuen, R. K., D. Merico, M. Bookman, J. L Howe, B. Thiruvahindrapuram, R. V. Patel, J. Whitney, N. Deflaux, J. Bingham, Z. Wang, G. Pellecchia, J. A. Buchanan, S. Walker, C. R. Marshall, M. Uddin, M. Zarrei, E. Deneault, L. D'Abate, A. J. S. Chan, S. Koyanagi, T. Paton, S. L. Pereira, N. Hoang, W. Engchuan, E. J. Higginbotham, K. Ho, S. Lamoureux, W. Li, J. R. MacDonald, T. Nalpathamkalam, W. W. L. Sung, F. J. Tsoi, J. Wei, L. Xu, A.-M. Tasse, E. Kirby, W. Van Etten, S. Twigger, W. Roberts, I. Drmic, S. Jilderda, B. M. Modi, B. Kellam, M. Szego, C. Cytrynbaum, R. Weksberg, L. Zwaigenbaum, M. Woodbury-Smith, J. Brian, L. Senman, A. Iaboni, K. Doyle-Thomas, A. Thompson, C. Chrysler, J. Leef, T. Savion-Lemieux, I. M. Smith, X. Liu, R. Nicolson, V. Seifer, A. Fedele, E. H. Cook, S. Dager, A. Estes, L. Gallagher, B. A. Malow, J. R. Parr, S. J. Spence, J. Vorstman, B. J. Frey, J. T. Robinson, L. J. Strug, B. A. Fernandez, M. Elsabbagh, M. T. Carter, J. Hallmayer, B. M. Knoppers, E. Anagnostou, P. Szatmari, R. H. Ring, D. Glazer, M. T. Pletcher and S. W. Scherer (2017). "Whole genome sequencing resource identifies 18 new candidate genes for autism spectrum disorder." Nat. Neurosci. 20(4): 602-611.

Cameron, D. L., L. Di Stefano and A. T. Papenfuss (2019). "Comprehensive evaluation and characterisation of short read general-purpose structural variant calling software." Nat. Commun. 10(1): 3240.

Chacón, A., S. Marco-Sola, A. Espinosa, P. Ribeca and J. C. Moure (2015). "Boosting the FM- Index on the GPU: Effective Techniques to Mitigate Random Memory Access." IEEE/ACM Trans. Comput. Biol. Bioinform. 12(5): 1048-1059.

Chen, N., Y. Li and Y. Lu (2018). A Memory-Efficient FM-Index Constructor for Next- Generation Sequencing Applications on FPGAs. 2018 IEEE International Symposium on Circuits and Systems (ISCAS): 1-4.

134

Chen, S., P. Krusche, E. Dolzhenko, R. M. Sherman, R. Petrovski, F. Schlesinger, M. Kirsche, D. R. Bentley, M. C. Schatz, F. J. Sedlazeck and M. A. Eberle (2019). "Paragraph: a graph-based structural variant genotyper for short-read sequence data." Genome Biol. 20(1): 291.

Chen, X., A. Sanchis-Juan, C. E. French, A. J. Connell, I. Delon, Z. Kingsbury, A. Chawla, A. L. Halpern, R. J. Taft, N. BioResource, D. R. Bentley, M. E. R. Butchbach, F. L. Raymond and M. A. Eberle (2020). "Spinal muscular atrophy diagnosis and carrier screening from genome sequencing data." Genet. Med. 22(5): 945-953.

Chen, X., O. Schulz-Trieglaff, R. Shaw, B. Barnes, F. Schlesinger, M. Källberg, A. J. Cox, S. Kruglyak and C. T. Saunders (2016). "Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications." Bioinformatics 32(8): 1220-1222.

Cheung, W. A., B. F. Ouellette and W. W. Wasserman (2012). "Quantitative biomedical annotation using medical subject heading over-representation profiles (MeSHOPs)." BMC Bioinformatics 13: 249.

Cheung, W. A., B. F. Ouellette and W. W. Wasserman (2012). "Quantitative biomedical annotation using medical subject heading over-representation profiles (MeSHOPs)." BMC Bioinformatics 13(1): 249.

Cingolani, P., A. Platts, L. L. Wang, M. Coon, T. Nguyen, L. Wang, S. J. Land, X. Lu and D. M. Ruden (2012). "A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff." Fly 6(2): 80-92.

Cingolani, P., A. Platts, L. L. Wang, M. Coon, T. Nguyen, L. Wang, S. J. Land, X. Lu and D. M. Ruden (2012). "A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3." Fly 6(2): 80-92.

Collins, R. L., H. Brand, K. J. Karczewski, X. Zhao, J. Alföldi, L. C. Francioli, A. V. Khera, C. Lowther, L. D. Gauthier, H. Wang, N. A. Watts, M. Solomonson, A. O’Donnell-Luria, A. Baumann, R. Munshi, M. Walker, C. W. Whelan, Y. Huang, T. Brookings, T. Sharpe, M. R. Stone, E. Valkanas, J. Fu, G. Tiao, K. M. Laricchia, V. Ruano-Rubio, C. Stevens, N. Gupta, C. Cusick, L. Margolin, K. D. Taylor, H. J. Lin, S. S. Rich, W. S. Post, Y.-D. I. Chen, J. I. Rotter, C. Nusbaum, A. Philippakis, E. Lander, S. Gabriel, B. M. Neale, S. Kathiresan, M. J. Daly, E. Banks, D. G. MacArthur and M. E. Talkowski (2020). "A structural variation reference for medical and population genetics." Nature 581(7809): 444-451.

Consortium, I. H. G. S. and C. International Human Genome Sequencing (2001). "Initial sequencing and analysis of the human genome." Nature 409(6822): 860-921.

Cooper, G. M. and C. D. Brown (2008). "Qualifying the relationship between sequence conservation and molecular function." Genome Res. 18(2): 201-205.

135

Corbett, M. A., T. Kroes, L. Veneziano, M. F. Bennett, R. Florian, A. L. Schneider, A. Coppola, L. Licchetta, S. Franceschetti, A. Suppa, A. Wenger, D. Mei, M. Pendziwiat, S. Kaya, M. Delledonne, R. Straussberg, L. Xumerle, B. Regan, D. Crompton, A.-F. van Rootselaar, A. Correll, R. Catford, F. Bisulli, S. Chakraborty, S. Baldassari, P. Tinuper, K. Barton, S. Carswell, M. Smith, A. Berardelli, R. Carroll, A. Gardner, K. L. Friend, I. Blatt, M. Iacomino, C. Di Bonaventura, S. Striano, J. Buratti, B. Keren, C. Nava, S. Forlani, G. Rudolf, E. Hirsch, E. Leguern, P. Labauge, S. Balestrini, J. W. Sander, Z. Afawi, I. Helbig, H. Ishiura, S. Tsuji, S. M. Sisodiya, G. Casari, L. G. Sadleir, R. van Coller, M. A. J. Tijssen, K. M. Klein, A. M. J. M. van den Maagdenberg, F. Zara, R. Guerrini, S. F. Berkovic, T. Pippucci, L. Canafoglia, M. Bahlo, P. Striano, I. E. Scheffer, F. Brancati, C. Depienne and J. Gecz (2019). "Intronic ATTTC repeat expansions in STARD7 in familial adult myoclonic epilepsy linked to chromosome 2." Nat. Commun. 10(1): 4920.

Cortese, A., R. Simone, R. Sullivan, J. Vandrovcova, H. Tariq, W. Y. Yau, J. Humphrey, Z. Jaunmuktane, P. Sivakumar, J. Polke, M. Ilyas, E. Tribollet, P. J. Tomaselli, G. Devigili, I. Callegari, M. Versino, V. Salpietro, S. Efthymiou, D. Kaski, N. W. Wood, N. S. Andrade, E. Buglo, A. Rebelo, A. M. Rossor, A. Bronstein, P. Fratta, W. J. Marques, S. Züchner, M. M. Reilly and H. Houlden (2019). "Biallelic expansion of an intronic repeat in RFC1 is a common cause of late-onset ataxia." Nat. Genet. 51(4): 649-658.

Cummings, B. B., J. L. Marshall, T. Tukiainen, M. Lek, S. Donkervoort, A. Reghan Foley, V. Bolduc, L. Waddell, S. Sandaradura, G. L. O'Grady, E. Estrella, H. M. Reddy, F. Zhao, B. Weisburd, K. Karczewski, A. O'Donnell-Luria, D. Birnbaum, A. Sarkozy, Y. Hu, H. Gonorazky, K. Claeys, H. Joshi, A. Bournazos, E. Oates, R. Ghaoui, M. Davis, N. G. Laing, A. Topf, A. Beggs, P. B. Kang, K. N. North, V. Straub, J. Dowling, F. Muntoni, N. F. Clarke, S. T. Cooper, C. G. Bonnemann, D. G. MacArthur and G. T. Consortium (2016). "Improving genetic diagnosis in Mendelian disease with transcriptome sequencing." Sci. Transl. Med. 19;9(386):eaal5209.

Dashnow, H., M. Lek, B. Phipson, A. Halman, S. Sadedin, A. Lonsdale, M. Davis, P. Lamont, J. S. Clayton, N. G. Laing, D. G. MacArthur and A. Oshlack (2018). "STRetch: detecting and discovering pathogenic short tandem repeat expansions." Genome Biol. 19(1): 121. de Koning, A. P. J., W. Gu, T. A. Castoe, M. A. Batzer and D. D. Pollock (2011). "Repetitive elements may comprise over two-thirds of the human genome." PLoS Genet. 7(12): e1002384.

DeJesus-Hernandez, M., I. R. Mackenzie, B. F. Boeve, A. L. Boxer, M. Baker, N. J. Rutherford, A. M. Nicholson, N. A. Finch, H. Flynn, J. Adamson, N. Kouri, A. Wojtas, P. Sengdy, G.-Y. R. Hsiung, A. Karydas, W. W. Seeley, K. A. Josephs, G. Coppola, D. H. Geschwind, Z. K. Wszolek, H. Feldman, D. S. Knopman, R. C. Petersen, B. L. Miller, D. W. Dickson, K. B. Boylan, N. R. Graff-Radford and R. Rademakers (2011). "Expanded GGGGCC hexanucleotide repeat in noncoding region of C9ORF72 causes chromosome 9p-linked FTD and ALS." Neuron 72(2): 245-256.

Dilthey, A., C. Cox, Z. Iqbal, M. R. Nelson and G. McVean (2015). "Improved genome inference in the MHC using a population reference graph." Nat. Genet. 47(6): 682-688.

136

Dolle, D. D., Z. Liu, M. Cotten, J. T. Simpson, Z. Iqbal, R. Durbin, S. A. McCarthy and T. M. Keane (2017). "Using reference-free compressed data structures to analyze sequencing reads from thousands of human genomes." Genome Res. 27(2): 300-309.

Dolzhenko, E., M. F. Bennett, P. A. Richmond, B. Trost, S. Chen, J. J. F. A. van Vugt, C. Nguyen, G. Narzisi, V. G. Gainullin, A. M. Gross, B. R. Lajoie, R. J. Taft, W. W. Wasserman, S. W. Scherer, J. H. Veldink, D. R. Bentley, R. K. C. Yuen, M. Bahlo and M. A. Eberle (2020). "ExpansionHunter Denovo: a computational method for locating known and novel repeat expansions in short-read sequencing data." Genome Biol. 21(1): 102.

Dolzhenko, E., V. Deshpande, F. Schlesinger, P. Krusche, R. Petrovski, S. Chen, D. Emig-Agius, A. Gross, G. Narzisi, B. Bowman, K. Scheffler, J. J. F. A. van Vugt, C. French, A. Sanchis-Juan, K. Ibáñez, A. Tucci, B. R. Lajoie, J. H. Veldink, F. L. Raymond, R. J. Taft, D. R. Bentley and M. A. Eberle (2019). "ExpansionHunter: a sequence-graph-based tool to analyze variation in short tandem repeat regions." Bioinformatics 35(22): 4754-4756.

Dolzhenko, E., J. J. F. A. van Vugt, R. J. Shaw, M. A. Bekritsky, M. van Blitterswijk, G. Narzisi, S. S. Ajay, V. Rajan, B. R. Lajoie, N. H. Johnson, Z. Kingsbury, S. J. Humphray, R. D. Schellevis, W. J. Brands, M. Baker, R. Rademakers, M. Kooyman, G. H. P. Tazelaar, M. A. van Es, R. McLaughlin, W. Sproviero, A. Shatunov, A. Jones, A. Al Khleifat, A. Pittman, S. Morgan, O. Hardiman, A. Al-Chalabi, C. Shaw, B. Smith, E. J. Neo, K. Morrison, P. J. Shaw, C. Reeves, L. Winterkorn, N. S. Wexler, U. S. V. C. R. Group, D. E. Housman, C. W. Ng, A. L. Li, R. J. Taft, L. H. van den Berg, D. R. Bentley, J. H. Veldink and M. A. Eberle (2017). "Detection of long repeat expansions from PCR-free whole-genome sequence data." Genome Res. 27(11): 1895-1903.

Dragojlovic, N., A. M. Elliott, S. Adam, C. van Karnebeek, A. Lehman, J. C. Mwenifumbo, T. N. Nelson, C. du Souich, J. M. Friedman and L. D. Lynd (2018). "The cost and diagnostic yield of exome sequencing for children with suspected genetic disorders: a benchmarking study." Genet. Med. 20(9): 1013-1021.

Ebbert, M. T. W., T. D. Jensen, K. Jansen-West, J. P. Sens, J. S. Reddy, P. G. Ridge, J. S. K. Kauwe, V. Belzil, L. Pregent, M. M. Carrasquillo, D. Keene, E. Larson, P. Crane, Y. W. Asmann, N. Ertekin-Taner, S. G. Younkin, O. A. Ross, R. Rademakers, L. Petrucelli and J. D. Fryer (2019). "Systematic analysis of dark and camouflaged genes reveals disease-relevant genes hiding in plain sight." Genome Biol. 20(1): 97.

Eberle, M. A., E. Fritzilas, P. Krusche, M. Källberg, B. L. Moore, M. A. Bekritsky, Z. Iqbal, H.- Y. Chuang, S. J. Humphray, A. L. Halpern, S. Kruglyak, E. H. Margulies, G. McVean and D. R. Bentley (2017). "A reference data set of 5.4 million phased human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree." Genome Res. 27(1): 157- 164.

Eilbeck, K., A. Quinlan and M. Yandell (2017). "Settling the score: variant prioritization and Mendelian disease." Nat. Rev. Genet. 18(10): 599-612.

137

Elgadi, K. M., R. A. Meguid, M. Qian, W. W. Souba and S. F. Abcouwer (1999). "Cloning and analysis of unique human glutaminase isoforms generated by tissue-specific alternative splicing." Physiol. Genomics 1(2): 51-62.

Elgar, G. and T. Vavouri (2008). "Tuning in to the signals: noncoding sequence conservation in vertebrate genomes." Trends Genet. 24(7): 344-352.

Erikson, G. A., D. L. Bodian, M. Rueda, B. Molparia, E. R. Scott, A. A. Scott-Van Zeeland, S. E. Topol, N. E. Wineinger, J. E. Niederhuber, E. J. Topol and A. Torkamani (2016). "Whole- Genome Sequencing of a Healthy Aging Cohort." Cell 165(4): 1002-1011.

Escalona, M., S. Rocha and D. Posada (2016). "A comparison of tools for the simulation of genomic next-generation sequencing data." Nat. Rev. Genet. 17(8): 459-469.

Ewing, A. D., K. E. Houlahan, Y. Hu, K. Ellrott, C. Caloian, T. N. Yamaguchi, J. C. Bare, C. P'ng, D. Waggott, V. Y. Sabelnykova, I.-T. D. S. M. C. C. participants, M. R. Kellen, T. C. Norman, D. Haussler, S. H. Friend, G. Stolovitzky, A. A. Margolin, J. M. Stuart and P. C. Boutros (2015). "Combining tumor genome simulation with crowdsourcing to benchmark somatic single-nucleotide-variant detection." Nat. Methods 12(7): 623-630.

Feuk, L., A. R. Carson and S. W. Scherer (2006). "Structural variation in the human genome." Nature Reviews Genetics 7(2): 85-97.

Florian, R. T., F. Kraft, E. Leitão, S. Kaya, S. Klebe, E. Magnin, A.-F. van Rootselaar, J. Buratti, T. Kühnel, C. Schröder, S. Giesselmann, N. Tschernoster, J. Altmueller, A. Lamiral, B. Keren, C. Nava, D. Bouteiller, S. Forlani, L. Jornea, R. Kubica, T. Ye, D. Plassard, B. Jost, V. Meyer, J.-F. Deleuze, Y. Delpu, M. D. M. Avarello, L. S. Vijfhuizen, G. Rudolf, E. Hirsch, T. Kroes, P. S. Reif, F. Rosenow, C. Ganos, M. Vidailhet, L. Thivard, A. Mathieu, T. Bourgeron, I. Kurth, H. Rafehi, L. Steenpass, B. Horsthemke, F. consortium, E. LeGuern, K. M. Klein, P. Labauge, M. F. Bennett, M. Bahlo, J. Gecz, M. A. Corbett, M. A. J. Tijssen, A. M. J. M. van den Maagdenberg and C. Depienne (2019). "Unstable TTTTA/TTTCA expansions in MARCH6 are associated with Familial Adult Myoclonic Epilepsy type 3." Nat. Commun. 10(1): 4919.

Fornes, O., J. A. Castro-Mondragon, A. Khan, R. van der Lee, X. Zhang, P. A. Richmond, B. P. Modi, S. Correard, M. Gheorghe, D. Baranašić, W. Santana-Garcia, G. Tan, J. Chèneby, B. Ballester, F. Parcy, A. Sandelin, B. Lenhard, W. W. Wasserman and A. Mathelier (2020). "JASPAR 2020: update of the open-access database of transcription factor binding profiles." Nucleic Acids Res. 48(D1): D87-D92.

Fotsing, S. F., J. Margoliash, C. Wang, S. Saini, R. Yanicky, S. Shleizer-Burko, A. Goren and M. Gymrek (2019). "The impact of short tandem repeat variation on gene expression." Nat. Genet. 51(11): 1652-1659.

Gaisler-Salomon, I., G. M. Miller, N. Chuhma, S. Lee, H. Zhang, F. Ghoddoussi, N. Lewandowski, S. Fairhurst, Y. Wang, A. Conjard-Duplany, J. Masson, P. Balsam, R. Hen, O. Arancio, M. P. Galloway, H. M. Moore, S. A. Small and S. Rayport (2009). "Glutaminase-

138

deficient mice display hippocampal hypoactivity, insensitivity to pro-psychotic drugs and potentiated latent inhibition: relevance to schizophrenia." Neuropsychopharmacology 34(10): 2305-2322.

Gardner, E. J., V. K. Lam, D. N. Harris, N. T. Chuang, E. C. Scott, W. S. Pittard, R. E. Mills, C. Genomes Project and S. E. Devine (2017). "The Mobile Element Locator Tool (MELT): population-scale mobile element discovery and biology." Genome Res. 27(11): 1916-1929.

Gardy, J., N. J. Loman and A. Rambaut (2015). "Real-time digital pathogen surveillance — the time is now." Genome Biology 16(1).

Garrison, E. and G. Marth (2012). "Haplotype-based variant detection from short-read sequencing." arXiv [q-bio.GN]. 1207.3907.

Gelfman, S., Q. Wang, K. M. McSweeney, Z. Ren, F. La Carpia, M. Halvorsen, K. Schoch, F. Ratzon, E. L. Heinzen, M. J. Boland, S. Petrovski and D. B. Goldstein (2017). "Annotating pathogenic non-coding variants in genic regions." Nat. Commun. 8(1): 236.

Genomes Project, C., A. Auton, L. D. Brooks, R. M. Durbin, E. P. Garrison, H. M. Kang, J. O. Korbel, J. L. Marchini, S. McCarthy, G. A. McVean and G. R. Abecasis (2015). "A global reference for human genetic variation." Nature 526(7571): 68-74.

Ghaffaari, A. and T. Marschall (2019). "Fully-sensitive seed finding in sequence graphs using a hybrid index." Bioinformatics 35(14): i81-i89.

Gilissen, C., J. Y. Hehir-Kwa, D. T. Thung, M. van de Vorst, B. W. M. van Bon, M. H. Willemsen, M. Kwint, I. M. Janssen, A. Hoischen, A. Schenck, R. Leach, R. Klein, R. Tearle, T. Bo, R. Pfundt, H. G. Yntema, B. B. A. de Vries, T. Kleefstra, H. G. Brunner, L. E. L. M. Vissers and J. A. Veltman (2014). "Genome sequencing identifies major causes of severe intellectual disability." Nature 511(7509): 344-347.

Goldfeder, R. L., J. R. Priest, J. M. Zook, M. E. Grove, D. Waggott, M. T. Wheeler, M. Salit and E. A. Ashley (2016). "Medical implications of technical accuracy in genome sequencing." Genome Med. 8(1): 24.

Green, R. C., J. S. Berg, W. W. Grody, S. S. Kalia, B. R. Korf, C. L. Martin, A. L. McGuire, R. L. Nussbaum, J. M. O'Daniel, K. E. Ormond, H. L. Rehm, M. S. Watson, M. S. Williams, L. G. Biesecker, G. American College of Medical and Genomics (2013). "ACMG recommendations for reporting of incidental findings in clinical exome and genome sequencing." Genet. Med. 15(7): 565-574.

Gudbjartsson, D. F., H. Helgason, S. A. Gudjonsson, F. Zink, A. Oddson, A. Gylfason, S. Besenbacher, G. Magnusson, B. V. Halldorsson, E. Hjartarson, G. T. Sigurdsson, S. N. Stacey, M. L. Frigge, H. Holm, J. Saemundsdottir, H. T. Helgadottir, H. Johannsdottir, G. Sigfusson, G. Thorgeirsson, J. T. Sverrisson, S. Gretarsdottir, G. B. Walters, T. Rafnar, B. Thjodleifsson, E. S. Bjornsson, S. Olafsson, H. Thorarinsdottir, T. Steingrimsdottir, T. S. Gudmundsdottir, A.

139

Theodors, J. G. Jonasson, A. Sigurdsson, G. Bjornsdottir, J. J. Jonsson, O. Thorarensen, P. Ludvigsson, H. Gudbjartsson, G. I. Eyjolfsson, O. Sigurdardottir, I. Olafsson, D. O. Arnar, O. T. Magnusson, A. Kong, G. Masson, U. Thorsteinsdottir, A. Helgason, P. Sulem and K. Stefansson (2015). "Large-scale whole-genome sequencing of the Icelandic population." Nat. Genet. 47(5): 435-444.

Guéant, J.-L., C. Chéry, A. Oussalah, J. Nadaf, D. Coelho, T. Josse, J. Flayac, A. Robert, I. Koscinski, I. Gastin, P. Filhine-Tresarrieu, M. Pupavac, A. Brebner, D. Watkins, T. Pastinen, A. Montpetit, F. Hariri, D. Tregouët, B. A. Raby, W. K. Chung, P.-E. Morange, D. S. Froese, M. R. Baumgartner, J.-F. Benoist, C. Ficicioglu, V. Marchand, Y. Motorin, C. Bonnemains, F. Feillet, J. Majewski and D. S. Rosenblatt (2018). "APRDX1 mutant allele causes a MMACHC secondary epimutation in cblC patients." Nat. Commun. 9(1): 67.

Guergueltcheva, V., D. N. Azmanov, D. Angelicheva, K. R. Smith, T. Chamova, L. Florez, M. Bynevelt, T. Nguyen, S. Cherninkova, V. Bojinova, A. Kaprelyan, L. Angelova, B. Morar, D. Chandler, R. Kaneva, M. Bahlo, I. Tournev and L. Kalaydjieva (2012). "Autosomal-recessive congenital cerebellar ataxia is caused by mutations in metabotropic glutamate receptor 1." Am. J. Hum. Genet. 91(3): 553-564.

Hach, F., I. Numanagić and S. C. Sahinalp (2014). "DeeZ: reference-based compression by local assembly." Nat. Methods 11(11): 1082-1084.

Haeussler, M., A. S. Zweig, C. Tyner, M. L. Speir, K. R. Rosenbloom, B. J. Raney, C. M. Lee, B. T. Lee, A. S. Hinrichs, J. N. Gonzalez, D. Gibson, M. Diekhans, H. Clawson, J. Casper, G. P. Barber, D. Haussler, R. M. Kuhn and W. J. Kent (2019). "The UCSC Genome Browser database: 2019 update." Nucleic Acids Res. 47(D1): D853-D858.

Hannan, A. J. (2018). "Tandem repeats mediating genetic plasticity in health and disease." Nat. Rev. Genet. 19(5): 286-298.

Harper, A. R., S. Nayee and E. J. Topol (2015). "Protective alleles and modifier variants in human health and disease." Nat. Rev. Genet. 16(12): 689-701.

Haubold, B. and T. Wiehe (2006). "How repetitive are genomes?" BMC Bioinformatics 7(1).

Hosseinibarkooie, S., M. Peters, L. Torres-Benito, R. H. Rastetter, K. Hupperich, A. Hoffmann, N. Mendoza-Ferreira, A. Kaczmarek, E. Janzen, J. Milbradt, T. Lamkemeyer, F. Rigo, C. F. Bennett, C. Guschlbauer, A. Büschges, M. Hammerschmidt, M. Riessland, M. J. Kye, C. S. Clemen and B. Wirth (2016). "The Power of Human Protective Modifiers: PLS3 and CORO1C Unravel Impaired Endocytosis in Spinal Muscular Atrophy and Rescue SMA Phenotype." Am. J. Hum. Genet. 99(3): 647-665.

Huang, W., L. Li, J. R. Myers and G. T. Marth (2012). "ART: a next-generation sequencing read simulator." Bioinformatics 28(4): 593-594.

140

Hung, C.-L., T.-H. Hsu, H.-H. Wang and C.-Y. Lin (2018). "A GPU-based Bit-Parallel Multiple Pattern Matching Algorithm." 2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS).

Ishiura, H., K. Doi, J. Mitsui, J. Yoshimura, M. K. Matsukawa, A. Fujiyama, Y. Toyoshima, A. Kakita, H. Takahashi, Y. Suzuki, S. Sugano, W. Qu, K. Ichikawa, H. Yurino, K. Higasa, S. Shibata, A. Mitsue, M. Tanaka, Y. Ichikawa, Y. Takahashi, H. Date, T. Matsukawa, J. Kanda, F. K. Nakamoto, M. Higashihara, K. Abe, R. Koike, M. Sasagawa, Y. Kuroha, N. Hasegawa, N. Kanesawa, T. Kondo, T. Hitomi, M. Tada, H. Takano, Y. Saito, K. Sanpei, O. Onodera, M. Nishizawa, M. Nakamura, T. Yasuda, Y. Sakiyama, M. Otsuka, A. Ueki, K.-I. Kaida, J. Shimizu, R. Hanajima, T. Hayashi, Y. Terao, S. Inomata-Terada, M. Hamada, Y. Shirota, A. Kubota, Y. Ugawa, K. Koh, Y. Takiyama, N. Ohsawa-Yoshida, S. Ishiura, R. Yamasaki, A. Tamaoka, H. Akiyama, T. Otsuki, A. Sano, A. Ikeda, J. Goto, S. Morishita and S. Tsuji (2018). "Expansions of intronic TTTCA and TTTTA repeats in benign adult familial myoclonic epilepsy." Nat. Genet. 50(4): 581-590.

Ishiura, H., S. Shibata, J. Yoshimura, Y. Suzuki, W. Qu, K. Doi, M. A. Almansour, J. K. Kikuchi, M. Taira, J. Mitsui, Y. Takahashi, Y. Ichikawa, T. Mano, A. Iwata, Y. Harigaya, M. K. Matsukawa, T. Matsukawa, M. Tanaka, Y. Shirota, R. Ohtomo, H. Kowa, H. Date, A. Mitsue, H. Hatsuta, S. Morimoto, S. Murayama, Y. Shiio, Y. Saito, A. Mitsutake, M. Kawai, T. Sasaki, Y. Sugiyama, M. Hamada, G. Ohtomo, Y. Terao, Y. Nakazato, A. Takeda, Y. Sakiyama, Y. Umeda-Kameyama, J. Shinmi, K. Ogata, Y. Kohno, S.-Y. Lim, A. H. Tan, J. Shimizu, J. Goto, I. Nishino, T. Toda, S. Morishita and S. Tsuji (2019). "Noncoding CGG repeat expansions in neuronal intranuclear inclusion disease, oculopharyngodistal myopathy and an overlapping disease." Nat. Genet. 51(8): 1222-1232.

Jaganathan, K., S. Kyriazopoulou Panagiotopoulou, J. F. McRae, S. F. Darbandi, D. Knowles, Y. I. Li, J. A. Kosmicki, J. Arbelaez, W. Cui, G. B. Schwartz, E. D. Chow, E. Kanterakis, H. Gao, A. Kia, S. Batzoglou, S. J. Sanders and K. K.-H. Farh (2019). "Predicting Splicing from Primary Sequence with Deep Learning." Cell 176(3): 535-548.e524.

Jia, T., B. Munson, H. Lango Allen, T. Ideker and A. R. Majithia (2020). "Thousands of missing variants in the UK Biobank are recoverable by genome realignment." Ann. Hum. Genet. 84(3): 214-220.

Juan, L., Y. Wang, J. Jiang, Q. Yang, Q. Jiang and Y. Wang (2020). "PGsim: A Comprehensive and Highly Customizable Personal Genome Simulator." Front Bioeng Biotechnol 8: 28.

Karczewski, K. J., L. C. Francioli, G. Tiao, B. B. Cummings, J. Alföldi, Q. Wang, R. L. Collins, K. M. Laricchia, A. Ganna, D. P. Birnbaum, L. D. Gauthier, H. Brand, M. Solomonson, N. A. Watts, D. Rhodes, M. Singer-Berk, E. M. England, E. G. Seaby, J. A. Kosmicki, R. K. Walters, K. Tashman, Y. Farjoun, E. Banks, T. Poterba, A. Wang, C. Seed, N. Whiffin, J. X. Chong, K. E. Samocha, E. Pierce-Hoffman, Z. Zappala, A. H. O’Donnell-Luria, E. V. Minikel, B. Weisburd, M. Lek, J. S. Ware, C. Vittal, I. M. Armean, L. Bergelson, K. Cibulskis, K. M. Connolly, M. Covarrubias, S. Donnelly, S. Ferriera, S. Gabriel, J. Gentry, N. Gupta, T. Jeandet, D. Kaplan, C. 141

Llanwarne, R. Munshi, S. Novod, N. Petrillo, D. Roazen, V. Ruano-Rubio, A. Saltzman, M. Schleicher, J. Soto, K. Tibbetts, C. Tolonen, G. Wade, M. E. Talkowski, B. M. Neale, M. J. Daly and D. G. MacArthur (2020). "The mutational constraint spectrum quantified from variation in 141,456 humans." Nature 581(7809): 434-443.

Karolchik, D. (2004). "The UCSC Table Browser data retrieval tool." Nucleic Acids Research 32(90001): 493D-496.

Kaye, A. (2016). Methods for the graphical representation of genomic sequence data. US Patent. Uspto, University of British Columbia.

Kehr, B., K. Trappe, M. Holtgrewe and K. Reinert (2014). "Genome alignment with graph data structures: a comparison." BMC Bioinformatics 15(1): 99.

Kent, W. J., C. W. Sugnet, T. S. Furey, K. M. Roskin, T. H. Pringle, A. M. Zahler and D. Haussler (2002). "The human genome browser at UCSC." Genome Res. 12(6): 996-1006.

Kim, D., L. Song, F. P. Breitwieser and S. L. Salzberg (2016). "Centrifuge: rapid and sensitive classification of metagenomic sequences." Genome Res. 26(12): 1721-1729.

Kircher, M., D. M. Witten, P. Jain, B. J. O'Roak, G. M. Cooper and J. Shendure (2014). "A general framework for estimating the relative pathogenicity of human genetic variants." Nat. Genet. 46(3): 310-315.

Klijn, C., S. Durinck, E. W. Stawiski, P. M. Haverty, Z. Jiang, H. Liu, J. Degenhardt, O. Mayba, F. Gnad, J. Liu, G. Pau, J. Reeder, Y. Cao, K. Mukhyala, S. K. Selvaraj, M. Yu, G. J. Zynda, M. J. Brauer, T. D. Wu, R. C. Gentleman, G. Manning, R. L. Yauch, R. Bourgon, D. Stokoe, Z. Modrusan, R. M. Neve, F. J. de Sauvage, J. Settleman, S. Seshagiri and Z. Zhang (2015). "A comprehensive transcriptional portrait of human cancer cell lines." Nat. Biotechnol. 33(3): 306- 312.

Köhler, S., N. A. Vasilevsky, M. Engelstad, E. Foster, J. McMurry, S. Aymé, G. Baynam, S. M. Bello, C. F. Boerkoel, K. M. Boycott, M. Brudno, O. J. Buske, P. F. Chinnery, V. Cipriani, L. E. Connell, H. J. S. Dawkins, L. E. DeMare, A. D. Devereau, B. B. A. de Vries, H. V. Firth, K. Freson, D. Greene, A. Hamosh, I. Helbig, C. Hum, J. A. Jähn, R. James, R. Krause, S. J. F Laulederkind, H. Lochmüller, G. J. Lyon, S. Ogishima, A. Olry, W. H. Ouwehand, N. Pontikos, A. Rath, F. Schaefer, R. H. Scott, M. Segal, P. I. Sergouniotis, R. Sever, C. L. Smith, V. Straub, R. Thompson, C. Turner, E. Turro, M. W. M. Veltman, T. Vulliamy, J. Yu, J. von Ziegenweidt, A. Zankl, S. Züchner, T. Zemojtel, J. O. B. Jacobsen, T. Groza, D. Smedley, C. J. Mungall, M. Haendel and P. N. Robinson (2017). "The Human Phenotype Ontology in 2017." Nucleic Acids Res. 45(D1): D865-D876.

Kosugi, S., Y. Momozawa, X. Liu, C. Terao, M. Kubo and Y. Kamatani (2019). "Comprehensive evaluation of structural variation detection algorithms for whole genome sequencing." Genome Biol. 20(1): 117.

142

Kremer, L. S., D. M. Bader, C. Mertes, R. Kopajtich, G. Pichler, A. Iuso, T. B. Haack, E. Graf, T. Schwarzmayr, C. Terrile, E. Koňaříková, B. Repp, G. Kastenmüller, J. Adamski, P. Lichtner, C. Leonhardt, B. Funalot, A. Donati, V. Tiranti, A. Lombes, C. Jardel, D. Gläser, R. W. Taylor, D. Ghezzi, J. A. Mayr, A. Rötig, P. Freisinger, F. Distelmaier, T. M. Strom, T. Meitinger, J. Gagneur and H. Prokisch (2017). "Genetic diagnosis of Mendelian disorders via RNA sequencing." Nat. Commun. 8: 15824.

Krusche, P., L. Trigg, P. C. Boutros, C. E. Mason, F. M. De La Vega, B. L. Moore, M. Gonzalez-Porta, M. A. Eberle, Z. Tezak, S. Lababidi, R. Truty, G. Asimenos, B. Funke, M. Fleharty, B. A. Chapman, M. Salit, J. M. Zook, G. Global Alliance for and T. Health Benchmarking (2019). "Best practices for benchmarking germline small-variant calls in human genomes." Nat. Biotechnol. 37(5): 555-560.

La Spada, A. R. and J. Paul Taylor (2010). "Repeat expansion disease: progress and puzzles in disease pathogenesis." Nat. Rev. Genet. 11(4): 247-258.

Labeit, J., J. Shun and G. E. Blelloch (2017). "Parallel lightweight wavelet tree, suffix array and FM-index construction." J. Discrete Algorithms 43: 2-17.

LaCroix, A. J., D. Stabley, R. Sahraoui, M. P. Adam, M. Mehaffey, K. Kernan, C. T. Myers, C. Fagerstrom, G. Anadiotis, Y. M. Akkari, K. M. Robbins, K. W. Gripp, W. A. R. Baratela, M. B. Bober, A. L. Duker, D. Doherty, J. C. Dempsey, D. G. Miller, M. Kircher, M. J. Bamshad, D. A. Nickerson, G. University of Washington Center for Mendelian, H. C. Mefford and K. Sol- Church (2019). "GGC Repeat Expansion and Exon 1 Methylation of XYLT1 Is a Common Pathogenic Variant in Baratela-Scott Syndrome." Am. J. Hum. Genet. 104(1): 35-44.

Lalioti, M. D., H. S. Scott, C. Buresi, C. Rossier, A. Bottani, M. A. Morris, A. Malafosse and S. E. Antonarakis (1997). "Dodecamer repeat expansion in cystatin B gene in progressive myoclonus epilepsy." Nature 386(6627): 847-851.

Lambert, S. A., A. Jolma, L. F. Campitelli, P. K. Das, Y. Yin, M. Albu, X. Chen, J. Taipale, T. R. Hughes and M. T. Weirauch (2018). "The Human Transcription Factors." Cell 172(4): 650- 665.

Landrum, M. J., J. M. Lee, M. Benson, G. R. Brown, C. Chao, S. Chitipiralla, B. Gu, J. Hart, D. Hoffman, W. Jang, K. Karapetyan, K. Katz, C. Liu, Z. Maddipatla, A. Malheiro, K. McDaniel, M. Ovetsky, G. Riley, G. Zhou, J. B. Holmes, B. L. Kattman and D. R. Maglott (2017). "ClinVar: improving access to variant interpretations and supporting evidence." Nucleic Acids Res. 46(D1):D1062-D1067.

Landrum, M. J., J. M. Lee, G. R. Riley, W. Jang, W. S. Rubinstein, D. M. Church and D. R. Maglott (2014). "ClinVar: public archive of relationships among sequence variation and human phenotype." Nucleic Acids Res. 42(Database issue): D980-985.

Langmead, B. and S. L. Salzberg (2012). "Fast gapped-read alignment with Bowtie 2." Nature Methods 9(4): 357-359.

143

Layer, R. M., C. Chiang, A. R. Quinlan and I. M. Hall (2014). "LUMPY: a probabilistic framework for structural variant discovery." Genome Biol. 15(6): R84.

Lek, M., K. J. Karczewski, E. V. Minikel, K. E. Samocha, E. Banks, T. Fennell, A. H. O'Donnell-Luria, J. S. Ware, A. J. Hill, B. B. Cummings, T. Tukiainen, D. P. Birnbaum, J. A. Kosmicki, L. E. Duncan, K. Estrada, F. Zhao, J. Zou, E. Pierce-Hoffman, J. Berghout, D. N. Cooper, N. Deflaux, M. DePristo, R. Do, J. Flannick, M. Fromer, L. Gauthier, J. Goldstein, N. Gupta, D. Howrigan, A. Kiezun, M. I. Kurki, A. L. Moonshine, P. Natarajan, L. Orozco, G. M. Peloso, R. Poplin, M. A. Rivas, V. Ruano-Rubio, S. A. Rose, D. M. Ruderfer, K. Shakir, P. D. Stenson, C. Stevens, B. P. Thomas, G. Tiao, M. T. Tusie-Luna, B. Weisburd, H.-H. Won, D. Yu, D. M. Altshuler, D. Ardissino, M. Boehnke, J. Danesh, S. Donnelly, R. Elosua, J. C. Florez, S. B. Gabriel, G. Getz, S. J. Glatt, C. M. Hultman, S. Kathiresan, M. Laakso, S. McCarroll, M. I. McCarthy, D. McGovern, R. McPherson, B. M. Neale, A. Palotie, S. M. Purcell, D. Saleheen, J. M. Scharf, P. Sklar, P. F. Sullivan, J. Tuomilehto, M. T. Tsuang, H. C. Watkins, J. G. Wilson, M. J. Daly, D. G. MacArthur and C. Exome Aggregation (2016). "Analysis of protein-coding genetic variation in 60,706 humans." Nature 536(7616): 285-291.

Levy-Sakin, M., S. Pastor, Y. Mostovoy, L. Li, A. K. Y. Leung, J. McCaffrey, E. Young, E. T. Lam, A. R. Hastie, K. H. Y. Wong, C. Y. L. Chung, W. Ma, J. Sibert, R. Rajagopalan, N. Jin, E. Y. C. Chow, C. Chu, A. Poon, C. Lin, A. Naguib, W.-P. Wang, H. Cao, T.-F. Chan, K. Y. Yip, M. Xiao and P.-Y. Kwok (2019). "Genome maps across 26 human populations reveal population-specific patterns of structural variation." Nat. Commun. 10(1): 1025.

Li, H. (2011). "A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data." Bioinformatics 27(21): 2987-2993.

Li, H. (2013). "Aligning sequence reads, clone sequences and assembly contigs with BWA- MEM." arXiv [q-bio.GN]. 1303.3997.

Li, H. and R. Durbin (2009). "Fast and accurate short read alignment with Burrows-Wheeler transform." Bioinformatics 25(14): 1754-1760.

Li, H., B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis, R. Durbin and S. Genome Project Data Processing (2009). "The Sequence Alignment/Map format and SAMtools." Bioinformatics 25(16): 2078-2079.

Li, Y., N. D. Roberts, J. A. Wala, O. Shapira, S. E. Schumacher, K. Kumar, E. Khurana, S. Waszak, J. O. Korbel, J. E. Haber, M. Imielinski, P. S. V. W. Group, J. Weischenfeldt, R. Beroukhim, P. J. Campbell and P. Consortium (2020). "Patterns of somatic structural variation in human cancer genomes." Nature 578(7793): 112-121.

Lionel, A. C., G. Costain, N. Monfared, S. Walker, M. S. Reuter, S. M. Hosseini, B. Thiruvahindrapuram, D. Merico, R. Jobling, T. Nalpathamkalam, G. Pellecchia, W. W. L. Sung, Z. Wang, P. Bikangaga, C. Boelman, M. T. Carter, D. Cordeiro, C. Cytrynbaum, S. D. Dell, P. Dhir, J. J. Dowling, E. Heon, S. Hewson, L. Hiraki, M. Inbar-Feigenberg, R. Klatt, J. Kronick, R. 144

M. Laxer, C. Licht, H. MacDonald, S. Mercimek-Andrews, R. Mendoza-Londono, T. Piscione, R. Schneider, A. Schulze, E. Silverman, K. Siriwardena, O. C. Snead, N. Sondheimer, J. Sutherland, A. Vincent, J. D. Wasserman, R. Weksberg, C. Shuman, C. Carew, M. J. Szego, R. Z. Hayeems, R. Basran, D. J. Stavropoulos, P. N. Ray, S. Bowdin, M. S. Meyn, R. D. Cohn, S. W. Scherer and C. R. Marshall (2018). "Improved diagnostic yield compared with targeted gene sequencing panels suggests a role for whole-genome sequencing as a first-tier genetic test." Genet. Med. 20(4): 435-443.

Lupiáñez, D. G., K. Kraft, V. Heinrich, P. Krawitz, F. Brancati, E. Klopocki, D. Horn, H. Kayserili, J. M. Opitz, R. Laxova, F. Santos-Simarro, B. Gilbert-Dussardier, L. Wittler, M. Borschiwer, S. A. Haas, M. Osterwalder, M. Franke, B. Timmermann, J. Hecht, M. Spielmann, A. Visel and S. Mundlos (2015). "Disruptions of topological chromatin domains cause pathogenic rewiring of gene-enhancer interactions." Cell 161(5): 1012-1025.

Lupiáñez, D. G., M. Spielmann and S. Mundlos (2016). "Breaking TADs: How Alterations of Chromatin Domains Result in Disease." Trends Genet. 32(4): 225-237.

Lynch, D. S., V. Chelban, J. Vandrovcova, A. Pittman, N. W. Wood and H. Houlden (2018). "GLS loss of function causes autosomal recessive spastic ataxia and optic atrophy." Annals of Clinical and Translational Neurology 5(2): 216-221.

MacDonald, J. R., R. Ziman, R. K. C. Yuen, L. Feuk and S. W. Scherer (2014). "The Database of Genomic Variants: a curated collection of structural variation in the human genome." Nucleic Acids Res. 42(Database issue): D986-992.

Mahmoud, M., N. Gobet, D. I. Cruz-Dávalos, N. Mounier, C. Dessimoz and F. J. Sedlazeck (2019). "Structural variant calling: the long and the short of it." Genome Biology 20(1).

Marco-Puche, G., S. Lois, J. Benítez and J. C. Trivino (2019). "RNA-Seq Perspectives to Improve Clinical Diagnosis." Front. Genet. 10: 1152.

Maroilley, T. and M. Tarailo-Graovac (2019). "Uncovering Missing Heritability in Rare Diseases." Genes 10(4).

Martani, A., L. D. Geneviève, C. Pauli-Magnus, S. McLennan and B. S. Elger (2019). "Regulating the Secondary Use of Data for Research: Arguments Against Genetic Exceptionalism." Front. Genet. 10: 1254.

Masson, J., M. Darmon, A. Conjard, N. Chuhma, N. Ropert, M. Thoby-Brisson, A. S. Foutz, S. Parrot, G. M. Miller, R. Jorisch, J. Polan, M. Hamon, R. Hen and S. Rayport (2006). "Mice lacking brain/kidney phosphate-activated glutaminase have impaired glutamatergic synaptic transmission, altered breathing, disorganized goal-directed behavior and die shortly after birth." J. Neurosci. 26(17): 4660-4671.

McKenna, A., M. Hanna, E. Banks, A. Sivachenko, K. Cibulskis, A. Kernytsky, K. Garimella, D. Altshuler, S. Gabriel, M. Daly and M. A. DePristo (2010). "The Genome Analysis Toolkit: a

145

MapReduce framework for analyzing next-generation DNA sequencing data." Genome Res. 20(9): 1297-1303.

Mills, R. E., K. Walter, C. Stewart, R. E. Handsaker, K. Chen, C. Alkan, A. Abyzov, S. C. Yoon, K. Ye, R. K. Cheetham, A. Chinwalla, D. F. Conrad, Y. Fu, F. Grubert, I. Hajirasouliha, F. Hormozdiari, L. M. Iakoucheva, Z. Iqbal, S. Kang, J. M. Kidd, M. K. Konkel, J. Korn, E. Khurana, D. Kural, H. Y. K. Lam, J. Leng, R. Li, Y. Li, C.-Y. Lin, R. Luo, X. J. Mu, J. Nemesh, H. E. Peckham, T. Rausch, A. Scally, X. Shi, M. P. Stromberg, A. M. Stütz, A. E. Urban, J. A. Walker, J. Wu, Y. Zhang, Z. D. Zhang, M. A. Batzer, L. Ding, G. T. Marth, G. McVean, J. Sebat, M. Snyder, J. Wang, K. Ye, E. E. Eichler, M. B. Gerstein, M. E. Hurles, C. Lee, S. A. McCarroll, J. O. Korbel and P. Genomes (2011). "Mapping copy number variation by population-scale genome sequencing." Nature 470(7332): 59-65.

Mitsuhashi, S., M. C. Frith, T. Mizuguchi, S. Miyatake, T. Toyota, H. Adachi, Y. Oma, Y. Kino, H. Mitsuhashi and N. Matsumoto (2019). "Tandem-genotypes: robust detection of tandem repeat expansions from long DNA reads." Genome Biol. 20(1): 58.

Mitsuhashi, S. and N. Matsumoto (2020). "Long-read sequencing for rare human genetic diseases." Journal of Human Genetics 65(1): 11-19.

Mousavi, N., S. Shleizer-Burko, R. Yanicky and M. Gymrek (2019). "Profiling the genome-wide landscape of tandem repeat expansions." Nucleic Acids Res. 47(15): e90.

Mozzetta, C., E. Boyarchuk, J. Pontis and S. Ait-Si-Ali (2015). "Sound of silence: the properties and functions of repressive Lys methyltransferases." Nat. Rev. Mol. Cell Biol. 16(8): 499-513.

Mu, J. C., M. Mohiyuddin, J. Li, N. Bani Asadi, M. B. Gerstein, A. Abyzov, W. H. Wong and H. Y. K. Lam (2015). "VarSim: a high-fidelity simulation and validation framework for high- throughput genome sequencing with cancer applications." Bioinformatics 31(9): 1469-1471.

Muir, P., S. Li, S. Lou, D. Wang, D. J. Spakowicz, L. Salichos, J. Zhang, G. M. Weinstock, F. Isaacs, J. Rozowsky and M. Gerstein (2016). "The real cost of sequencing: scaling computation to keep pace with data generation." Genome Biol. 17: 53.

Nagasaki, M., J. Yasuda, F. Katsuoka, N. Nariai, K. Kojima, Y. Kawai, Y. Yamaguchi-Kabata, J. Yokozawa, I. Danjoh, S. Saito, Y. Sato, T. Mimori, K. Tsuda, R. Saito, X. Pan, S. Nishikawa, S. Ito, Y. Kuroki, O. Tanabe, N. Fuse, S. Kuriyama, H. Kiyomoto, A. Hozawa, N. Minegishi, J. Douglas Engel, K. Kinoshita, S. Kure, N. Yaegashi, M. J. R. P. P. To and M. Yamamoto (2015). "Rare variant discovery by deep whole-genome sequencing of 1,070 Japanese individuals." Nat. Commun. 6: 8018.

Nielsen, R., J. S. Paul, A. Albrechtsen and Y. S. Song (2011). "Genotype and SNP calling from next-generation sequencing data." Nature Reviews Genetics 12(6): 443-451.

146

Olney, K. C., S. M. Brotman, V. Valverde-Vesling, J. Andrews and M. A. Wilson "Aligning RNA-Seq reads to a sex chromosome complement informed reference genome increases ability to detect sex differences in gene expression." bioRxiv. 668376.

Paila, U., B. A. Chapman, R. Kirchner and A. R. Quinlan (2013). "GEMINI: integrative exploration of genetic variation and genome annotations." PLoS Comput. Biol. 9(7): e1003153.

Paten, B., A. Novak and D. Haussler (2014). "Mapping to a Reference Genome Structure." ArXiv e-prints: 1-26.

Paten, B., A. M. Novak, J. M. Eizenga and E. Garrison (2017). "Genome graphs and the evolution of genome inference." Genome Res. 27(5): 665-676.

Patro, R., S. M. Mount and C. Kingsford (2014). "Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms." Nat. Biotechnol. 32(5): 462- 464.

Payer, L. M., J. P. Steranka, D. Ardeljan, J. Walker, K. C. Fitzgerald, P. A. Calabresi, T. A. Cooper and K. H. Burns (2019). "Alu insertion variants alter mRNA splicing." Nucleic Acids Res. 47(1): 421-431.

Pedersen, B. S., R. M. Layer and A. R. Quinlan (2016). "Vcfanno: fast, flexible annotation of genetic variants." Genome Biol. 17(1): 118.

Pedersen, B. S. and A. R. Quinlan (2017). "Who's Who? Detecting and Resolving Sample Anomalies in Human DNA Sequencing Studies with Peddy." Am. J. Hum. Genet. 100(3): 406- 413.

Petrovski, S., Q. Wang, E. L. Heinzen, A. S. Allen and D. B. Goldstein (2013). "Genic intolerance to functional variation and the interpretation of personal genomes." PLoS Genet. 9(8): e1003709.

Pollard, K. S., M. J. Hubisz, K. R. Rosenbloom and A. Siepel (2010). "Detection of nonneutral substitution rates on mammalian phylogenies." Genome Res. 20(1): 110-121.

Poplin, R., P.-C. Chang, D. Alexander, S. Schwartz, T. Colthurst, A. Ku, D. Newburger, J. Dijamco, N. Nguyen, P. T. Afshar, S. S. Gross, L. Dorfman, C. Y. McLean and M. A. DePristo (2018). "A universal SNP and small-indel variant caller using deep neural networks." Nat. Biotechnol. 36(10): 983-987.

Project Min, E. A. L. S. S. C. (2018). "Project MinE: study design and pilot analyses of a large- scale whole-genome sequencing study in amyotrophic lateral sclerosis." Eur. J. Hum. Genet. 26(10): 1537-1546.

Raczy, C., R. Petrovski, C. T. Saunders, I. Chorny, S. Kruglyak, E. H. Margulies, H.-Y. Chuang, M. Källberg, S. A. Kumar, A. Liao, K. M. Little, M. P. Strömberg and S. W. Tanner (2013).

147

"Isaac: ultra-fast whole-genome secondary analysis on Illumina sequencing platforms." Bioinformatics 29(16): 2041-2043.

Rafehi, H., D. J. Szmulewicz, M. F. Bennett, N. L. M. Sobreira, K. Pope, K. R. Smith, G. Gillies, P. Diakumis, E. Dolzhenko, M. A. Eberle, M. G. Barcina, D. P. Breen, A. M. Chancellor, P. D. Cremer, M. B. Delatycki, B. L. Fogel, A. Hackett, G. M. Halmagyi, S. Kapetanovic, A. Lang, S. Mossman, W. Mu, P. Patrikios, S. L. Perlman, I. Rosemergy, E. Storey, S. R. D. Watson, M. A. Wilson, D. S. Zee, D. Valle, D. J. Amor, M. Bahlo and P. J. Lockhart (2019). "Bioinformatics- Based Identification of Expanded Repeats: A Non-reference Intronic Pentamer Expansion in RFC1 Causes CANVAS." Am. J. Hum. Genet. 105(1): 151-165.

Rahit, K. M. T. H. and M. Tarailo-Graovac (2020). "Genetic Modifiers and Rare Mendelian Disease." Genes 11(3).

Rajagopalan, R., J. R. Murrell, M. Luo and L. K. Conlin (2020). "A highly sensitive and specific workflow for detecting rare copy-number variants from exome sequencing data." Genome Med. 12(1): 14.

Raza, S. and A. Hall (2017). "Genomic medicine and data sharing." Br. Med. Bull. 123(1): 35- 45.

Rehm, H. L., J. S. Berg, L. D. Brooks, C. D. Bustamante, J. P. Evans, M. J. Landrum, D. H. Ledbetter, D. R. Maglott, C. L. Martin, R. L. Nussbaum, S. E. Plon, E. M. Ramos, S. T. Sherry and M. S. Watson (2015). "ClinGen — The Clinical Genome Resource." New England Journal of Medicine 372(23): 2235-2242.

Renton, A. E., E. Majounie, A. Waite, J. Simón-Sánchez, S. Rollinson, J. R. Gibbs, J. C. Schymick, H. Laaksovirta, J. C. van Swieten, L. Myllykangas, H. Kalimo, A. Paetau, Y. Abramzon, A. M. Remes, A. Kaganovich, S. W. Scholz, J. Duckworth, J. Ding, D. W. Harmer, D. G. Hernandez, J. O. Johnson, K. Mok, M. Ryten, D. Trabzuni, R. J. Guerreiro, R. W. Orrell, J. Neal, A. Murray, J. Pearson, I. E. Jansen, D. Sondervan, H. Seelaar, D. Blake, K. Young, N. Halliwell, J. B. Callister, G. Toulson, A. Richardson, A. Gerhard, J. Snowden, D. Mann, D. Neary, M. A. Nalls, T. Peuralinna, L. Jansson, V.-M. Isoviita, A.-L. Kaivorinne, M. Hölttä- Vuori, E. Ikonen, R. Sulkava, M. Benatar, J. Wuu, A. Chiò, G. Restagno, G. Borghero, M. Sabatelli, I. Consortium, D. Heckerman, E. Rogaeva, L. Zinman, J. D. Rothstein, M. Sendtner, C. Drepper, E. E. Eichler, C. Alkan, Z. Abdullaev, S. D. Pack, A. Dutra, E. Pak, J. Hardy, A. Singleton, N. M. Williams, P. Heutink, S. Pickering-Brown, H. R. Morris, P. J. Tienari and B. J. Traynor (2011). "A hexanucleotide repeat expansion in C9ORF72 is the cause of chromosome 9p21-linked ALS-FTD." Neuron 72(2): 257-268.

Rentzsch, P., D. Witten, G. M. Cooper, J. Shendure and M. Kircher (2019). "CADD: predicting the deleteriousness of variants throughout the human genome." Nucleic Acids Res. 47(D1): D886-D894.

148

Richmond, P. A., A. M. Kaye, G. J. Kounkou, T. V. Av-Shalom and W. W. Wasserman (2020). "Demonstrating the utility of flexible sequence queries against indexed short reads with FlexTyper." bioRxiv. 2020.03.02.973750

Richmond, P. A., F. van der Kloet, F. M. Vaz, D. Lin, A. Uzozie, E. J. Graham, M. S. Kobor, S. Mostafavi, P. D. Moerland, P. F. Lange, A. H. C. van Kampen, W. W. Wasserman, M. Engelen, S. Kemp and C. van Karnebeek (2020). "Multi-omic approach to identify phenotypic modifiers underlying cerebral demyelination in X-linked adrenoleukodystrophy." medRxiv: 2020.2003.2019.20035063.

Rimmer, A., H. Phan, I. Mathieson, Z. Iqbal, S. R. F. Twigg, W. G. S. Consortium, A. O. M. Wilkie, G. McVean and G. Lunter (2014). "Integrating mapping-, assembly- and haplotype- based approaches for calling variants in clinical sequencing applications." Nat. Genet. 46(8): 912-918.

Ritchie, G. R. and P. Flicek (2014). "Computational approaches to interpreting genomic sequence variation." Genome Med. 6(10): 87.

Robinson, J. T., H. Thorvaldsdóttir, W. Winckler, M. Guttman, E. S. Lander, G. Getz and J. P. Mesirov (2011). "Integrative genomics viewer." Nature 29(1): 24-26.

Robinson, P. N., S. Köhler, S. Bauer, D. Seelow, D. Horn and S. Mundlos (2008). "The Human Phenotype Ontology: a tool for annotating and analyzing human hereditary disease." Am. J. Hum. Genet. 83(5): 610-615.

Robinson, P. N., S. Köhler, A. Oellrich, P. Sanger Mouse Genetics, K. Wang, C. J. Mungall, S. E. Lewis, N. Washington, S. Bauer, D. Seelow, P. Krawitz, C. Gilissen, M. Haendel and D. Smedley (2014). "Improved exome prioritization of disease genes through cross-species phenotype comparison." Genome Res. 24(2): 340-348.

Roeck, A. D., A. De Roeck, W. De Coster, L. Bossaerts, R. Cacace, T. De Pooter, J. Van Dongen, S. D’Hert, P. De Rijk, M. Strazisar, C. Van Broeckhoven and K. Sleegers (2019). "NanoSatellite: accurate characterization of expanded tandem repeat length and sequence through whole genome long-read sequencing on PromethION." Genome Biology 20(1).

Roller, E., S. Ivakhno, S. Lee, T. Royce and S. Tanner (2016). "Canvas: versatile and scalable detection of copy number variants." Bioinformatics 32(15): 2375-2377.

Sadedin, S. P. and A. Oshlack (2019). "Bazam: a rapid method for read extraction and realignment of high-throughput sequencing data." Genome Biol. 20(1): 78.

Sandelin, A., W. Alkema, P. Engström, W. W. Wasserman and B. Lenhard (2004). "JASPAR: an open-access database for eukaryotic transcription factor binding profiles." Nucleic Acids Res. 32(Database issue): D91-94.

Sarwal, V., S. Niehus, R. Ayyala, S. Chang, A. Lu, N. Darci-Maher, R. Littman, E. Wesel, J. Castellanos, R. Chikka, M. G. Distler, E. Eskin, J. Flint and S. Mangul (2020). "A 149

comprehensive benchmarking of WGS-based structural variant callers." bioRxiv. 2020.04.16.045120.

Sato, N., T. Amino, K. Kobayashi, S. Asakawa, T. Ishiguro, T. Tsunemi, M. Takahashi, T. Matsuura, K. M. Flanigan, S. Iwasaki, F. Ishino, Y. Saito, S. Murayama, M. Yoshida, Y. Hashizume, Y. Takahashi, S. Tsuji, N. Shimizu, T. Toda, K. Ishikawa and H. Mizusawa (2009). "Spinocerebellar ataxia type 31 is associated with "inserted" penta-nucleotide repeats containing (TGGAA)n." Am. J. Hum. Genet. 85(5): 544-557.

Sedlazeck, F. J., H. Lee, C. A. Darby and M. C. Schatz (2018). "Piercing the dark matter: bioinformatics of long-range sequencing and mapping." Nat. Rev. Genet. 19(6): 329-346.

Seixas, A. I., J. R. Loureiro, C. Costa, A. Ordóñez-Ugalde, H. Marcelino, C. L. Oliveira, J. L. Loureiro, A. Dhingra, E. Brandão, V. T. Cruz, A. Timóteo, B. Quintáns, G. A. Rouleau, P. Rizzu, Á. Carracedo, J. Bessa, P. Heutink, J. Sequeiros, M. J. Sobrido, P. Coutinho and I. Silveira (2017). "A Pentanucleotide ATTTC Repeat Insertion in the Non-coding Region of DAB1, Mapping to SCA37, Causes Spinocerebellar Ataxia." Am. J. Hum. Genet. 101(1): 87- 103.

Shajii, A., D. Yorukoglu, Y. William Yu and B. Berger (2016). "Fast genotyping of known SNPs through approximate k-mer matching." Bioinformatics 32(17): i538-i544.

Shen, Shen and Kidd (2020). "Rapid, Paralog-Sensitive CNV Analysis of 2457 Human Genomes Using QuicK-mer2." Genes 11(2): 141.

Sherman, R. M., J. Forman, V. Antonescu, D. Puiu, M. Daya, N. Rafaels, M. P. Boorgula, S. Chavan, C. Vergara, V. E. Ortega, A. M. Levin, C. Eng, M. Yazdanbakhsh, J. G. Wilson, J. Marrugo, L. A. Lange, L. K. Williams, H. Watson, L. B. Ware, C. O. Olopade, O. Olopade, R. R. Oliveira, C. Ober, D. L. Nicolae, D. A. Meyers, A. Mayorga, J. Knight-Madden, T. Hartert, N. N. Hansel, M. G. Foreman, J. G. Ford, M. U. Faruque, G. M. Dunston, L. Caraballo, E. G. Burchard, E. R. Bleecker, M. I. Araujo, E. F. Herrera-Paz, M. Campbell, C. Foster, M. A. Taub, T. H. Beaty, I. Ruczinski, R. A. Mathias, K. C. Barnes and S. L. Salzberg (2019). "Assembly of a pan-genome from deep sequencing of 910 humans of African descent." Nat. Genet. 51(1): 30-35.

Shi, W., O. Fornes and W. W. Wasserman (2019). "Gene expression models based on transcription factor binding events confer insight into functional cis-regulatory variants." Bioinformatics 35(15): 2610-2617.

Sikorska, N. and T. Sexton (2020). "Defining Functionally Relevant Spatial Chromatin Domains: It is a TAD Complicated." J. Mol. Biol. 432(3): 653-664.

Smedley, D., M. Schubach, J. O. B. Jacobsen, S. Köhler, T. Zemojtel, M. Spielmann, M. Jäger, H. Hochheiser, N. L. Washington, J. A. McMurry, M. A. Haendel, C. J. Mungall, S. E. Lewis, T. Groza, G. Valentini and P. N. Robinson (2016). "A Whole-Genome Analysis Framework for Effective Identification of Pathogenic Regulatory Variants in Mendelian Disease." Am. J. Hum. Genet. 99(3): 595-606.

150

Smith, E. and A. Shilatifard (2014). "Enhancer biology and enhanceropathies." Nat. Struct. Mol. Biol. 21(3): 210-219.

Stavropoulos, D. J., D. Merico, R. Jobling, S. Bowdin, N. Monfared, B. Thiruvahindrapuram, T. Nalpathamkalam, G. Pellecchia, R. K. C. Yuen, M. J. Szego, R. Z. Hayeems, R. Z. Shaul, M. Brudno, M. Girdea, B. Frey, B. Alipanahi, S. Ahmed, R. Babul-Hirji, R. B. Porras, M. T. Carter, L. Chad, A. Chaudhry, D. Chitayat, S. J. Doust, C. Cytrynbaum, L. Dupuis, R. Ejaz, L. Fishman, A. Guerin, B. Hashemi, M. Helal, S. Hewson, M. Inbar-Feigenberg, P. Kannu, N. Karp, R. Kim, J. Kronick, E. Liston, H. MacDonald, S. Mercimek-Mahmutoglu, R. Mendoza-Londono, E. Nasr, G. Nimmo, N. Parkinson, N. Quercia, J. Raiman, M. Roifman, A. Schulze, A. Shugar, C. Shuman, P. Sinajon, K. Siriwardena, R. Weksberg, G. Yoon, C. Carew, R. Erickson, R. A. Leach, R. Klein, P. N. Ray, M. S. Meyn, S. W. Scherer, R. D. Cohn and C. R. Marshall (2016). "Whole Genome Sequencing Expands Diagnostic Utility and Improves Clinical Management in Pediatric Medicine." NPJ Genom Med 1.

Subramanian, S., R. K. Mishra and L. Singh (2003). "Genome-wide analysis of microsatellite repeats in humans: their abundance and density in specific genomic regions." Genome Biol. 4(2): R13.

Sudmant, P. H., T. Rausch, E. J. Gardner, R. E. Handsaker, A. Abyzov, J. Huddleston, Y. Zhang, K. Ye, G. Jun, M. H.-Y. Fritz, M. K. Konkel, A. Malhotra, A. M. Stütz, X. Shi, F. P. Casale, J. Chen, F. Hormozdiari, G. Dayama, K. Chen, M. Malig, M. J. P. Chaisson, K. Walter, S. Meiers, S. Kashin, E. Garrison, A. Auton, H. Y. K. Lam, X. J. Mu, C. Alkan, D. Antaki, T. Bae, E. Cerveira, P. Chines, Z. Chong, L. Clarke, E. Dal, L. Ding, S. Emery, X. Fan, M. Gujral, F. Kahveci, J. M. Kidd, Y. Kong, E.-W. Lameijer, S. McCarthy, P. Flicek, R. A. Gibbs, G. Marth, C. E. Mason, A. Menelaou, D. M. Muzny, B. J. Nelson, A. Noor, N. F. Parrish, M. Pendleton, A. Quitadamo, B. Raeder, E. E. Schadt, M. Romanovitch, A. Schlattl, R. Sebra, A. A. Shabalin, A. Untergasser, J. A. Walker, M. Wang, F. Yu, C. Zhang, J. Zhang, X. Zheng-Bradley, W. Zhou, T. Zichner, J. Sebat, M. A. Batzer, S. A. McCarroll, C. Genomes Project, R. E. Mills, M. B. Gerstein, A. Bashir, O. Stegle, S. E. Devine, C. Lee, E. E. Eichler and J. O. Korbel (2015). "An integrated map of structural variation in 2,504 human genomes." Nature 526(7571): 75-81.

Sun, C. and P. Medvedev (2019). "Toward fast and accurate SNP genotyping from whole genome sequencing data for bedside diagnostics." Bioinformatics 35(3): 415-420.

Sun, J. H., L. Zhou, D. J. Emerson, S. A. Phyo, K. R. Titus, W. Gong, T. G. Gilgenast, J. A. Beagan, B. L. Davidson, F. Tassone and J. E. Phillips-Cremins (2018). "Disease-Associated Short Tandem Repeats Co-localize with Chromatin Domain Boundaries." Cell 175(1): 224- 238.e215.

Sundaram, L., H. Gao, S. R. Padigepati, J. F. McRae, Y. Li, J. A. Kosmicki, N. Fritzilas, J. Hakenberg, A. Dutta, J. Shon, J. Xu, S. Batzoglou, X. Li and K. K.-H. Farh (2018). "Predicting the clinical impact of human mutation with deep neural networks." Nat. Genet. 50(8): 1161- 1170.

151

Takeda, K., A. Ishida, K. Takahashi and T. Ueda (2012). "Synaptic vesicles are capable of synthesizing the VGLUT substrate glutamate from α-ketoglutarate for vesicular loading." J. Neurochem. 121(2): 184-196.

Tang, H., E. F. Kirkness, C. Lippert, W. H. Biggs, M. Fabani, E. Guzman, S. Ramakrishnan, V. Lavrenko, B. Kakaradov, C. Hou, B. Hicks, D. Heckerman, F. J. Och, C. T. Caskey, J. C. Venter and A. Telenti (2017). "Profiling of Short-Tandem-Repeat Disease Alleles in 12,632 Human Whole Genomes." Am. J. Hum. Genet. 101(5): 700-715.

Tankard, R. M., M. F. Bennett, P. Degorski, M. B. Delatycki, P. J. Lockhart and M. Bahlo (2018). "Detecting Expansions of Tandem Repeats in Cohorts Sequenced with Short-Read Sequencing Data." Am. J. Hum. Genet. 103(6): 858-873.

Tarailo-Graovac, M. and N. Chen (2009). "Using RepeatMasker to identify repetitive elements in genomic sequences." Curr. Protoc. Bioinformatics Chapter 4: Unit 4.10.

Tarailo-Graovac, M., B. I. Drögemöller, W. W. Wasserman, C. J. D. Ross, A. M. W. van den Ouweland, N. Darin, G. Kollberg, C. D. M. van Karnebeek and M. Blomqvist (2017). "Identification of a large intronic transposal insertion in SLC17A5 causing sialic acid storage disease." Orphanet J. Rare Dis. 12(1): 28.

Tarailo-Graovac, M., C. Shyr, C. J. Ross, G. A. Horvath, R. Salvarinova, X. C. Ye, L.-H. Zhang, A. P. Bhavsar, J. J. Y. Lee, B. I. Drögemöller, M. Abdelsayed, M. Alfadhel, L. Armstrong, M. R. Baumgartner, P. Burda, M. B. Connolly, J. Cameron, M. Demos, T. Dewan, J. Dionne, A. M. Evans, J. M. Friedman, I. Garber, S. Lewis, J. Ling, R. Mandal, A. Mattman, M. McKinnon, A. Michoulas, D. Metzger, O. A. Ogunbayo, B. Rakic, J. Rozmus, P. Ruben, B. Sayson, S. Santra, K. R. Schultz, K. Selby, P. Shekel, S. Sirrs, C. Skrypnyk, A. Superti-Furga, S. E. Turvey, M. I. Van Allen, D. Wishart, J. Wu, J. Wu, D. Zafeiriou, L. Kluijtmans, R. A. Wevers, P. Eydoux, A. M. Lehman, H. Vallance, S. Stockler-Ipsiroglu, G. Sinclair, W. W. Wasserman and C. D. van Karnebeek (2016). "Exome Sequencing and the Management of Neurometabolic Disorders." N. Engl. J. Med. 374(23): 2246-2255.

Telenti, A., L. C. T. Pierce, W. H. Biggs, J. di Iulio, E. H. M. Wong, M. M. Fabani, E. F. Kirkness, A. Moustafa, N. Shah, C. Xie, S. C. Brewerton, N. Bulsara, C. Garner, G. Metzker, E. Sandoval, B. A. Perkins, F. J. Och, Y. Turpaz and J. C. Venter (2016). "Deep sequencing of 10,000 human genomes." Proc. Natl. Acad. Sci. U. S. A. 113(42): 11901-11906.

Thul, P. J., L. Åkesson, M. Wiking, D. Mahdessian, A. Geladaki, H. Ait Blal, T. Alm, A. Asplund, L. Björk, L. M. Breckels, A. Bäckström, F. Danielsson, L. Fagerberg, J. Fall, L. Gatto, C. Gnann, S. Hober, M. Hjelmare, F. Johansson, S. Lee, C. Lindskog, J. Mulder, C. M. Mulvey, P. Nilsson, P. Oksvold, J. Rockberg, R. Schutten, J. M. Schwenk, Å. Sivertsson, E. Sjöstedt, M. Skogs, C. Stadler, D. P. Sullivan, H. Tegel, C. Winsnes, C. Zhang, M. Zwahlen, A. Mardinoglu, F. Pontén, K. von Feilitzen, K. S. Lilley, M. Uhlén and E. Lundberg (2017). "A subcellular map of the human proteome." Science 356(6340).

152

Trost, B., S. Walker, Z. Wang, B. Thiruvahindrapuram, J. R. MacDonald, W. W. L. Sung, S. L. Pereira, J. Whitney, A. J. S. Chan, G. Pellecchia, M. S. Reuter, S. Lok, R. K. C. Yuen, C. R. Marshall, D. Merico and S. W. Scherer (2018). "A Comprehensive Workflow for Read Depth- Based Identification of Copy-Number Variation from Whole-Genome Sequence Data." Am. J. Hum. Genet. 102(1): 142-155. van Kuilenburg, A. B. P., M. Tarailo-Graovac, P. A. Richmond, B. I. Drögemöller, M. A. Pouladi, R. Leen, K. Brand-Arzamendi, D. Dobritzsch, E. Dolzhenko, M. A. Eberle, B. Hayward, M. J. Jones, F. Karbassi, M. S. Kobor, J. Koster, D. Kumari, M. Li, J. MacIsaac, C. McDonald, J. Meijer, C. Nguyen, I.-S. Rajan-Babu, S. W. Scherer, B. Sim, B. Trost, L. A. Tseng, M. Turkenburg, J. J. F. A. van Vugt, J. H. Veldink, J. S. Walia, Y. Wang, M. van Weeghel, G. E. B. Wright, X. Xu, R. K. C. Yuen, J. Zhang, C. J. Ross, W. W. Wasserman, M. T. Geraghty, S. Santra, R. J. A. Wanders, X.-Y. Wen, H. R. Waterham, K. Usdin and C. D. M. van Karnebeek (2019). "Glutaminase Deficiency Caused by Short Tandem Repeat Expansion in." N. Engl. J. Med. 380(15): 1433-1441.

Wang, Y., Y. Huang, L. Zhao, Y. Li and J. Zheng (2014). "Glutaminase 1 is essential for the differentiation, proliferation, and survival of human neural progenitor cells." Stem Cells Dev. 23(22): 2782-2790.

Wasserman, W. W. and A. Sandelin (2004). "Applied bioinformatics for the identification of regulatory elements." Nat. Rev. Genet. 5(4): 276-287.

Webster, T. H., M. Couse, B. M. Grande, E. Karlins, T. N. Phung, P. A. Richmond, W. Whitford and M. A. Wilson (2019). "Identifying, understanding, and correcting technical artifacts on the sex chromosomes in next-generation sequencing data." Gigascience 8(7).

Wenger, A. M., P. Peluso, W. J. Rowell, P.-C. Chang, R. J. Hall, G. T. Concepcion, J. Ebler, A. Fungtammasan, A. Kolesnikov, N. D. Olson, A. Töpfer, M. Alonge, M. Mahmoud, Y. Qian, C.- S. Chin, A. M. Phillippy, M. C. Schatz, G. Myers, M. A. DePristo, J. Ruan, T. Marschall, F. J. Sedlazeck, J. M. Zook, H. Li, S. Koren, A. Carroll, D. R. Rank and M. W. Hunkapiller (2019). "Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome." Nat. Biotechnol. 37(10): 1155-1162.

Wirth, B. (2000). "An update of the mutation spectrum of the survival motor neuron gene (SMN1) in autosomal recessive spinal muscular atrophy (SMA)." Hum. Mutat. 15(3): 228-237.

Wise, A. L., T. A. Manolio, G. A. Mensah, J. F. Peterson, D. M. Roden, C. Tamburro, M. S. Williams and E. D. Green (2019). "Genomic medicine for undiagnosed diseases." Lancet 394(10197): 533-540.

Wood, D. E., J. Lu and B. Langmead (2019). "Improved metagenomic analysis with Kraken 2." Genome Biol. 20(1): 257.

Wright, C. F., D. R. FitzPatrick and H. V. Firth (2018). "Paediatric genomics: diagnosing rare disease in children." Nature Reviews Genetics 19(5): 253-268.

153

Xia, Y., Y. Liu, M. Deng and R. Xi (2019). "Detecting virus integration sites based on multiple related sequencing data by VirTect." BMC Med. Genomics 12(Suppl 1): 19.

Yang, X., W.-P. Lee, K. Ye and C. Lee (2019). "One reference genome is not enough." Genome Biol. 20(1): 104.

Ye, X., N. M. Roslin, A. D. Paterson, C. Lyons, V. Pegado, P. A. Richmond, C. Shyr, O. Fornes, X. Han, M. Higginson, C. Ross, D. Giaschi, C. Y. Gregory-Evans, M. Patel and W. W. Wasserman (2020). "Linkage analysis identifies an isolated strabismus locus at 14q12 overlapping with FOXG1 syndrome region." medRxiv: 2020.2004.2024.20077586.

Yeetong, P., M. Pongpanich, C. Srichomthong, A. Assawapitaksakul, V. Shotelersuk, N. Tantirukdham, C. Chunharas, K. Suphapeetiporn and V. Shotelersuk (2019). "TTTCA repeat insertions in an intron of YEATS2 in benign adult familial myoclonic epilepsy type 4." Brain 142(11): 3360-3366.

Yue, J.-X. and G. Liti (2019). "simuG: a general-purpose genome simulator." Bioinformatics 35(21): 4442-4444.

Zhang, L., W. Bai, N. Yuan and Z. Du (2019). "Comprehensively benchmarking applications for detecting copy number variation." PLoS Comput. Biol. 15(5): e1007069.

Zhou, J., C. L. Theesfeld, K. Yao, K. M. Chen, A. K. Wong and O. G. Troyanskaya (2018). "Deep learning sequence-based ab initio prediction of variant effects on expression and disease risk." Nat. Genet. 50(8): 1171-1179.

Zhou, J. and O. G. Troyanskaya (2015). "Predicting effects of noncoding variants with deep learning–based sequence model." Nature Methods 12(10): 931-934.

Zook, J. M., B. Chapman, J. Wang, D. Mittelman, O. Hofmann, W. Hide and M. Salit (2014). "Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls." Nat. Biotechnol. 32(3): 246-251.

154