Genome-wide Investigation of Short Tandem Repeat Variation in Autism Spectrum Disorder

by

Charlotte Michelle Nguyen

A thesis submitted in conformity with the requirements for the degree of Master of Science Graduate Department of Molecular Genetics University of Toronto

c Copyright 2019 by Charlotte Michelle Nguyen Abstract

Genome-wide Investigation of Short Tandem Repeat Variation in Autism Spectrum Disorder

Charlotte Michelle Nguyen Master of Science Graduate Department of Molecular Genetics University of Toronto 2019

Short tandem repeats (STRs) may contribute to the genetic etiology of autism spec- trum disorder (ASD) since several pathogenic STRs are ASD risk factors. Screening for STR variation may uncover novel risk loci, however, this was previously hindered by the difficulty of aligning repetitive sequences in whole genome sequencing (WGS) data. Recently, several tools have been developed to genotype STRs from WGS data. In this study, a STR genotyping pipeline was created to perform a genome-wide screen for repeat variation in an ASD genomic database, MSSNG. This work uncovered 4925 STRs, and variation was identified in ASD-risk loci, repeat disorder , and other pathogenic loci. In particular, repeat expansions upstream of the CRYBB2 were seen significantly more frequently in individuals with ASD than a control population. These findings suggest that this STR genotyping pipeline can detect repeat variation in ASD, and may lead to the discovery of novel candidate loci.

ii Acknowledgements

I am greatly appreciative of the support of my supervisors, Dr. Stephen Scherer and Dr. Ryan Yuen. When I first joined Dr. Scherer’s lab, I felt nervous about how I would fare in a group that was publishing ground-breaking research, led by a person who has a legacy and a Wikipedia page. I could not have predicted how kind and generous my peers and Dr. Scherer would be, and how their influence would shape me to become a better scientist and person. Throughout my time, Dr. Scherer went out of his way to give his students once-in-a-lifetime experiences. I will always remember his personal collection of art, seeing Supertramp perform for World Autism Day, and meeting with Canada’s Chief Science Advisor, Dr. Mona Nemer. I am very fortunate and grateful for Dr. Yuen’s guidance. I have spent so many hours in his office talking about everything from our research to the best horror movies. I will always cherish these conversations that carried me throughout graduate school, and the advice that will follow me throughout my life. Dr. Yuen was so patient and supportive in helping me carve a path in my research and career. None of the highs would have been possible without Dr. Yuen, and I could not have gotten through the lows without him either. Dr. Scherer and Dr. Yuen pushed me to perform the best research I could by providing me with superstar mentors, a world-class facility, and modeling this in their own careers. I am so grateful for all the members of TCAG. This project was a behemoth, and there are so many players to thank. I would like to especially thank Dr. Brett Trost and Dr. Robbie Davies for their mentorship. I learned from the best, and my research and career are forever enriched because of their support, encouragement, and ingenuity. They showed me how to turn letters in beautiful lines of code. I would also like to thank and Dr. Zhuozhi Wang, Bhooma Thiruvahindrapuram, and Omar Hamdan for their technical excellence and guidance. To this day, I still do not understand how they do what they do. Thank you to my fellow graduate students, Lia D’Abate, Ted Higginbotham, and Ada Chan, who so quickly became my friends. The conversations I shared with them are some of my fondest memories of the lab. Thank you for letting me cry sometimes and making me laugh all the time. I am also especially thankful to all the members of the Yuen Lab. With our physical separation and differences in schedules, sometimes we were in different worlds but every time we visited one another I was reminded about how sweet and amazing they all were. Whether it was in lab meeting or lunch time, I was continuously surprised and impressed by their skills in and out of lab. I would also like to thank my committee members, Dr. Quaid Morris and Dr. Christo- pher Pearson. It was a blessing and privilege to be able to learn from and be guided by world-class experts. As well, thank you to the members of Dr. Pearson’s lab who have

iii collaborated closely on this project. Your technical expertise and in-depth knowledge of short tandem repeats is mind-blowing. Thank you to my families. To my family-family: mum, dad, Michael, and Anh Quyen, I love you all very much. I am who I am because of you. Thank you for all the support you have given me. To my friend-family: the friends I have made in this department are some of the best friends I have. I feel so lucky to have been part of the boys and bioinfos. There is so much to say, and so many memories to reminisce on, so I will leave it at this: you all rock.

iv Contents

Abstract ii

Acknowledgements iii

List of Tables vii

List of Figures viii

List of Abbreviations ix

1 Introduction 1 1.1 Clinical presentation of Autism Spectrum Disorder (ASD) ...... 1 1.2 Genetic etiology of ASD established from family-based studies ...... 2 1.2.1 The Broader Autism Phenotype (BAP) ...... 3 1.3 Genetic variants in ASD ...... 3 1.4 Whole Genome Gequencing (WGS) ...... 5 1.4.1 Short-read WGS ...... 5 1.4.2 Long-read WGS ...... 6 1.4.3 MSSNG ASD WGS database ...... 7 1.5 Short Tandem Repeats (STR) ...... 7 1.5.1 Tandem Repeat Disorders (TRD) ...... 8 1.5.2 The role of STRs in common polygenic disorders ...... 10 1.5.3 The role of STRs in ASD ...... 10 1.5.4 Genetic anticipation in ASD ...... 11 1.6 STR genotyping ...... 12 1.6.1 Traditional STR detection methods ...... 12 1.6.2 STR genotyping algorithms ...... 13 1.7 Project rationale ...... 14

v 2 Methods: Development and Implementation 15 2.1 MSSNG ASD database ...... 15 2.2 STR genotyping pipeline development ...... 15 2.2.1 Expansion Hunter de novo (EHdn) ...... 15 2.2.2 STR Finder ...... 16 2.2.3 Expansion Hunter (EH) ...... 17 2.2.4 STR genotyping pipeline ...... 17 2.3 Pipeline validation using long-read sequencing data ...... 22 2.3.1 Comparison of EH tools to other STR callers ...... 23 2.3.2 Pipeline validation using twin data ...... 23 2.4 Outlier detection method for STR expansions ...... 23 2.5 Statistical analysis ...... 24

3 Results 25 3.1 Genome-wide STR genotyping in MSSNG ...... 25 3.1.1 STR calling using EHdn ...... 25 3.1.2 Identification of reference STRs using STR Finder ...... 28 3.1.3 Targeted STR genotyping using EH ...... 28 3.2 Identifying potentially clinically relevant STR Variation in ASD ...... 28 3.2.1 STR variation in MSSNG compared to a control population ...... 29 3.3 Validation of STR genotyping pipeline ...... 35 3.3.1 Validation of EHdn using long-read sequencing data ...... 35 3.3.2 Validation using other STR genotyping tools ...... 36 3.3.3 Monozygotic vs. dizygotic twin concordance in EHdn ...... 36

4 Discussion 40 4.1 Overview of results ...... 40 4.2 Current study limitations ...... 42 4.3 Overall summary and impact of work ...... 44

5 Future Directions 46 5.1 Application of STR genotyping pipeline to other ASD genomic databases 46 5.2 Application of STR genotyping pipeline to other disorders ...... 47

Bibliography 47

vi List of Tables

1 Candidate loci detected in MSSNG ...... 31 2 Outliers identified in MSSNG and 1000 Genomes ...... 32 3 Validation of genome-wide STR genotyping tools in long-read sequencing data ...... 38

vii List of Figures

1 Expansion Hunter de novo and Expansion Hunter read-based approaches 19 2 STR Finder ...... 20 3 STR Genotyping Pipeline ...... 21

4 STR loci detected in MSSNG samples using Expansion Hunter de novo . 26 5 STR motif identified in MSSNG using Expansion Hunter de novo . . . . 27 6 MSSNG allele lengths of the GA repeat upstream of CRYBB2 ...... 33 7 MSSNG pedigrees for GA repeat upstream of CRYBB2 ...... 34 8 Validation rate of EHdn calls supported by ...... 37 9 Comparison of genome-wide STR genotyping tools ...... 39

viii List of Abbreviations

ADDM Autism and Developmental Disabilities Monitoring

ADHD Attention Deficit Hyperactivity Disorder

ADL Activities of Daily Life

ASD Autism Spectrum Disorder

BAP Broader Autism Phenotype

BWA Burrows-Wheeler Alignment

CCS Circular Consensus

CNV Copy Number Variant

DM1 Myotonic Dystrophy 1

DSM Diagnostic and Statistical Manual of Mental Disorders

DZ Dizygotic

EH Expansion Hunter

EHDN Expansion Hunter de novo

FRDA Fragile X Tremor-Ataxia syndrome

GRCh Genome Reference Consortium Build

ID Intellectual Disability

IRR In-Repeat Read

LGD Likely Gene Disrupting

MZ Monozygotic

NDD Neurodevelopmental Disorder

NF1 Neurofibromatosis

ix PacBio Pacific Biosciences

PCR Polymerase Chain Reaction pLI Probability of Loss-of-Function Intolerant Rate

RAN Repeat-Associated Non-ATG

SMRT Single-Molecule Real-Time

SNV Single Nucleotide Variant

STR Short Tandem Repeat

TNR Trinucleotide Repeat

TP Triplet-repeat Primed

TR Tandem Repeat

TRD Tandem Repeat Disorder

UTR Untranslated Region

WGS Whole Genome Sequencing

XLID X-Linked Intellectual Disability

x

Chapter 1

Introduction

1.1 Clinical presentation of Autism Spectrum Disor- der (ASD)

Autism spectrum disorder (ASD) is a lifelong neurodevelopmental condition character- ized by deficits in social communication and restricted, repetitive behaviours and interests (DSM-V, 2013). Onset of the disorder occurs during childhood, with symptoms usually appearing before the age of 3 (Ozonoff et al., 2008). Diagnosis is typically made using the Diagnostic and Statistical Manual of Mental Disorders (DSM) criteria for ASD. As per the fifth edition of DSM, the term “ASD” encompasses a continuum of symptom presentation that was previously categorized as the distinct disorders of autistic disor- der, Asperger disorder, childhood disintegrative disorder and pervasive developmental disorder not otherwise specified (DSM-V, 2013). ASD broadly ranges in phenotypic presentation and functional impact. For example, the core symptoms of impaired social communication and a restricted range of activity and interests express heterogeneously in severity among affected individuals (Georgiades et al., 2013). Intellectual ability is another example of a variably expressed phenotype; 31% of children with ASD have intellectual disability (ID), 25% place in the borderline range, and 44% have average to above average intellectual ability (Surveillance Sum- maries, 2018). Moreover, ASD frequently co-occurs with other neurologic, developmen- tal, psychiatric, genetic and chromosomal disorders. Epilepsy is a frequent comorbidity, affecting ~22% of ASD individuals with ID, and 8% of individuals with average intellect (Amiet et al., 2008). In one cohort, attention deficit hyperactivity disorder (ADHD) was reported in 21.3% on children with ASD, while other studies identified rates between 40-75% (Levy et al., 2010). Other common co-diagnoses include obsessive compulsive

1 Chapter 1. Introduction disorder, oppositional defiant disorder, mood disorder and anxiety disorder (Levy et al., 2010). The Centers for Disease Control’s Autism and Developmental Disabilities Monitoring (ADDM) Network, a surveillance system for ASD among children aged 8 years across different sites in the United States, estimates the disorder prevalence to be 1 in 59 chil- dren (Baio, 2018). This current rate is larger than estimates from previous years (2008: 1 in 88, 2010: 1 in 68, 2012: 1 in 69), however, this discrepancy may arise from changes in diagnostic criteria and surveillance sites used for monitoring, and/or a true increase in ASD prevalence (Surveillance Summaries, 2018). A gender bias has been historically re- ported for ASD, and indeed recent large-scale epidemiological studies continue to identify a 4:1 male to female bias (Baio, 2018, Matheis et al., 2019, Zwaigenbaum et al., 2012). Several mechanisms for this phenomenon have been proposed, such as the involvement of genes on the X , impact of testosterone on brain development, and the epigenetic effects on paternally-imprinted X-linked genes (Zwaigenbaum et al., 2012).

1.2 Genetic etiology of ASD established from family-based studies

Family studies have established that the siblings of individuals with ASD have a greater risk of developing the disorder than the general population, indicating that genetic factors are involved in ASD susceptibility. For example, one study that investigated ASD onset among a cohort of infants with an older affected sibling identified a recurrence rate of 18.7% (Ozonoff et al., 2011). Recurrence has also been studied among half and full siblings, and multiple studies have reported greater recurrence between more closely related siblings. For instance, a study of over 5000 families with children affected with ASD observed recurrence to be 9.5% in full siblings compared to 5.2% in half siblings. Furthermore, a population-based cohort study of ~1.5 million children found recurrence of ASD to be up to 7.5% in full siblings and 2.4% in half siblings (Grønborg et al., 2013). Twin studies affirm a strong genetic component in ASD liability. Several studies between 1977 and 1995 investigated the agreement of narrowly defined autism diagnoses in pairs of monozygotic (MZ) versus dizygotic (DZ) twins and found concordance rates to be between 36-96% in MZ twins and 0-24% in DZ twins, implicating a genetic basis for autism (Bailey et al., 1995, Folstein and Rutter, 1977, Ritvo et al., 1985, Steffenburg et al., 1989). Since then, studies of broadly defined ASD reported concordance to be 88-95% in MZ twins and 31% DZ twins (Rosenberg et al., 2009, Taniai et al., 2008). A

2 Chapter 1. Introduction study that determined concordance for male twins found rates to be 47% in MZ twins and 14% in DZ twins (Lichtenstein et al., 2010). Another study reported concordance as as 95.2% in MZ twins and 4.3% in DZ twins (Nordenbæk et al., 2014). More recent estimates of heritability have been reported in the range of 21-83%, depending on the study (Bai et al., 2019, Colvert et al., 2015, Frazier et al., 2014, Ronald and Hoekstra, 2011, Sandin et al., 2017, 2014). Overall, the numerous studies that have demonstrated that MZ twins were more concordant for ASD diagnoses than DZ twins and have reported large heritability estimates support a complex etiology for ASD that involves a strong genetic component.

1.2.1 The Broader Autism Phenotype (BAP)

The involvement of genetics in ASD susceptibility is also supported by the finding that the unaffected relatives of individuals with ASD express subclinical autistic traits, termed a “Broader Autism Phenotype” (BAP) (Bolton et al., 1994, Piven et al., 1997). In twin studies, the reported discrepancy between MZ and DZ twin concordance is further exacerbated when considering a general presentation of autistic symptoms, which include cognitive, social, communication, and language impairment. For example, the initial ASD twin study reported MZ and DZ twin concordance to be 36% and 0%, respectively; however, concordance rates for BAP was 82% and 10%, respectively (Folstein and Rutter, 1977). Since then, higher rates of MZ twin concordance compared to DZ twins for BAP have continually been reported, ranging between 91-92% for MZ twins and 10-30% for DZ twins (Bailey et al., 1995, Steffenburg et al., 1989). BAP has also been identified in the non-ASD siblings of probands, with up to 20.4 % of siblings in one study (Bolton et al., 1994). Moreover, the prevalence of BAP has been studied in second- and third- degree relatives of ASD individuals, 22.5% of which were reported with at least one ASD-like impairment (Szatmari et al., 2000). As well, up to 24% of the unaffected parents of children with ASD exhibit BAP (Bishop et al., 2008). Altogether, family and twin studies demonstrate that the relatives of autistic individuals express autistic-like symptoms, supporting the influence of genetics in ASD liability.

1.3 Genetic variants in ASD

Hundreds of genetic factors have been implicated in ASD susceptibility, and fall into the categories of ASD-related syndromes, chromosomal abnormalities, copy number varia- tions and gene mutations. Genetic disorders of known cause explain approximately 10%

3 Chapter 1. Introduction of ASD cases, the most common of which are neurofibromatosis (NF1; <1%), Fragile X syndrome (~1-2% of ASD cases), Rett syndrome (~0.5%) and tuberous sclerosis (~1%) (Devlin and Scherer, 2012). Other syndromes that overlap ASD include adenylosucci- nate lyase deficiency, Timothy syndrome, Smith-Lemli-Opitz syndrome, Down syndrome, phenylketonuria, and Angelman syndrome (Betancur, 2011, Moss and Howlin, 2009). Currently, over 100 disease genes and over 40 genomic disorders have been implicated in the etiology of ASD (Betancur, 2011). Intriguingly, all identified ASD-linked genes and loci have also been casually linked to ID, which suggests a genetic overlap between the disorders (Betancur, 2011).

Rare chromosomal abnormalities are found in approximately 5% of individuals with ASD (Devlin and Scherer, 2012). Recurrent and rare rearrangements associated with ASD have been identified across numerous . Structural imbalances of chro- mosome 15 are the most frequently reported chromosomal abnormalities of ASD, which include the regions of 15q11-q13, 15q11.2, and 15q13.3 ASD (Bergbaum and Ogilvie, 2016). Other common chromosomal abnormalities linked to ASD have been identified on chromosome 3, 16, 17, and 22.

Rare copy number variants (CNVs) are the genetic risk factors of approximately 5% of ASD cases, with each CNV found in <1% of probands (Devlin and Scherer, 2012). When individuals with ASD are compared to an unaffected control population, cases were determined to have a higher global burden of rare, genic CNVs, which include both de novo and inherited variants (Pinto et al., 2010). A previous genome-wide analysis of ASD-related CNVs estimated that a total of 234 de novo CNVs that confer risk to ASD exist (Sanders et al., 2011). One example of a de novo ASD-linked CNV is duplication of 7q11.23, which has been identified in multiple probands, whereas deletion of the same region causes Williams-Beuren syndrome (Sanders et al., 2011). ASD-related CNVs may produce effect through multiple mechanisms, such as acting through haploinsufficiency, dominantly, recessively, and with variable penetrance (Devlin and Scherer, 2012).

Rare penetrant genes explain ~5% of ASD diagnoses (Devlin and Scherer, 2012). A previous study of de novo coding mutations in ASD identified 391 de novo likely-gene disrupting (LGD) mutations in 353 genes, with 27 of these loci being recurrent LGD events and 145 recurrent missense events, in individuals affected with ASD (Iossifov et al., 2014).

4 Chapter 1. Introduction

1.4 Whole Genome Gequencing (WGS)

Whole genome sequencing (WGS) is the current state of the art technology for variant detection in ASD. The use of WGS has increased the yield of detecting de novo mutations in the coding genome in ASD, compared to whole-exome sequencing (Jiang et al., 2013). Moreover, WGS has also identified de novo mutations predicted to be damaging in the noncoding genome of ASD probands (Yuen et al., 2016). One genome-wide study of genetic variation in familial forms of ASD found that the vast majority (>95%) of small CNVs captured by WGS were missed by high-resolution microarrays (Yuen et al., 2015). Along with detecting clinically relevant mutations in known risk loci, the use of WGS has enabled the identification of novel ASD candidate genes (Yuen et al., 2017). While the use of WGS in ASD has improved the resolution of the genetic architecture of ASD, the majority of cases remain without a genetic diagnosis (Jiang et al., 2013, Yuen et al., 2017).

1.4.1 Short-read WGS

Several WGS platforms exist that utilizes short reads (up to 150bp) to perform high- coverage sequencing to call variants relative to the reference genome (Lam et al., 2012). The technologies stemming from two companies, Illumina and Complete Genomics (CG), are the most frequently used and account for the generation of >90% of the complete human genome sequences currently reported (Lam et al., 2012). One study that com- pared variant detection using the Illumina Hiseq and CG short-read WGS platforms by sequencing a single sample to high depth (~76x) and aligning reads to the human refer- ence genome found that the majority (88.1%) of SNV calls were concordant between the technologies (Lam et al., 2012). Illumina short-read WGS involves three main steps: 1) library preparation 2) sequencing (cluster amplification, sequencing-by-synthesis, imag- ine analysis and 3) data analysis. Library preparation of DNA samples for WGS can involve PCR or be PCR-free. PCR-free libraries are beneficial for avoiding systematic biases and nonuniform coverage distributions introduced by PCR, particularly at GC rich regions (Aird et al., 2011). A large technical challenge of short-read WGS is repetitive DNA sequences that are found in multiple locations in the genome. Variant detection using short-read WGS requires mapping of the reads to a reference genome using alignment software, however, a major hurdle lies in aligning reads that can map to multiple locations (Treangen and Salzberg, 2012). One strategy to assign ambiguous reads to a genomic location is by using its best alignment, which consists of the fewest mismatches (Treangen and Salzberg,

5 Chapter 1. Introduction

2012). The best match method, however, is not always accurate nor possible (Treangen and Salzberg, 2012). Moreover, the assignment of reads to an incorrect genomic position can result in false calls in variant detection, such as for SNVs and CNVs (Treangen and Salzberg, 2012). Some alignment programs simplify the challenge of ambiguous reads by limiting analysis to unique regions in the genome and ignoring sequences found in multiple locations, however, this may result in the clinically relevant variants being missed (Treangen and Salzberg, 2012, Willems et al., 2014). Therefore, while repetitive DNA is abundant in the genome, consisting of almost half of the human genome, the difficulty in interpreting these sequences have led to their omission from previous large- scale studies of human genetic variation (Willems et al., 2014).

1.4.2 Long-read WGS

WGS that utilizes long read lengths can overcome some of the limitations presented by short-read technology, such as resolving complex genomic regions (Rhoads and Au, 2015, Shi et al., 2016). Single-molecule real-time (SMRT) sequencing, developed by Pacific BioSciences (PacBio), is a long-read sequencing technology that is referred to as PacBio or SMRT sequencing (Rhoads & Au, 2015). One advantage of PacBio sequencing is faster run times compared to that of short-read WGS (Rhoads and Au, 2015). Moreover, the long-read lengths of PacBio sequencing enables the resolution of the location and sequence composition of repetitive DNA since unique regions adjacent to the repetitive sequences are encapsulated in the same read (Rhoads and Au, 2015). PacBio sequencing is especially advantageous for resolving repetitive regions that are longer than 150bp, which is the read length limit of short-read WGS (Rhoads and Au, 2015). The original PacBio platform was the RS system which produced mean read lengths of ~1500 bp. The newer version of the system, the RS II, produces mean read lengths over 10kb and a N50 of more than 20kb, where N50 refers to the lower bound read length for half of the sequencing data (Rhoads and Au, 2015). However, complex and repetitive genomic regions that are longer than PacBio reads do exist, and therefore these sequences remain a technical challenge (Rhoads and Au, 2015, Shi et al., 2016). PacBio sequencing also has limitations compared to short-read WGS. For example, PacBio sequencing produces reads with high error rates (11-15%) (Rhoads and Au, 2015). These errors are distributed uniformly, and are predominantly insertions and deletions (Shi et al., 2016). The accuracy of PacBio sequencing can be improved by generating circular consensus (CCS) reads, and sequencing through them multiple times (Shi et al., 2016). A coverage of 15 sequencing passes results in >99% accuracy (Rhoads and Au,

6 Chapter 1. Introduction

2015). However, multiple passes of CCS reads reduces the read length (Shi et al., 2016). Therefore, variant detection using PacBio sequencing at low coverage would be unreliable (Shi et al., 2016). Moreover, PacBio sequencing is more costly than short-read WGS, which prevents the generation of largescale personal genome datasets using long-read technologies. For instance, the cost per million bases for the PacBio RS platform is 10- 27x more than the Illumina HiSeq 2500 (Rhoads and Au, 2015). Altogether, the high error rates and cost of PacBio sequencing lends short-read WGS to be the better suited method for the generation of WGS datasets (Shi et al., 2016).

1.4.3 MSSNG ASD WGS database

The large heterogeneity in the genetic factors involved in ASD susceptibility among individuals warranted thorough approaches that could capture all the genetic factors in- volved. The improved ability of WGS to capture variants compared to other technologies, the need to scan the genomes of thousands of samples to identify rare variants, and the decreasing cost of WGS resulted in the development of large ASD genomic databases that enable the discovery of novel risk variants (Yuen et al., 2017). MSSNG is one such genomic database, which consists of sample DNA from individuals with ASD and their affected and unaffected family members that were sequenced by short-read WGS platforms from CG or Illumina. MSSNG contains genetic and phenotypic data, such as subject information, family code, and results of diagnostic tests. During the initial stages of the MSSNG project, 5205 DNA samples were sequenced to 40x depth and analyzed for variant discovery, which resulted in the identification of 18 novel ASD candidate genes (Yuen et al., 2017). Currently, the MSSNG cohort has expanded to include ~10,500 samples.

1.5 Short Tandem Repeats (STR)

Short tandem repeats (STR) are repetitive DNA sequences consisting of tandem copies of 1-6 base-pair (bp) motifs. It is estimated that these repeats make up approximately 3% of the human genome and exist as over 1 million distinct STR loci (Subramanian et al., 2003). Approximately 17% of human genes contain repeats within their open reading frames (Gemayel et al., 2010). The most common STRs are homonucleotides, dinucleotides, and trinucleotide repeats (Hannan, 2018). STRs are found in both the coding and noncoding portions of the genome, however some repeats are more abundant in certain genomic locations (Subramanian et al., 2003, T´othet al., 2000). For example,

7 Chapter 1. Introduction

trinucleotide repeats are enriched approximately twofold in exonic regions compared to intronic and intergenic regions on all chromosomes except for chromosome Y (Subrama- nian et al., 2003). Tandem repeats (TRs) function in numerous regulatory roles in the genome. One study of repeats with 1-45 bp motifs identified ~500 promoter-associated TRs that were linked to variable expression and methylation of nearby genes (Quilez et al., 2016). These TRs tend to cluster in close proximity (<1kb) to the genes they act on (Quilez et al., 2016). It has been proposed that trinucleotide STRs may act as spacer elements between functional domains to provide a scaffold, a hinge, or flexibility in mediating protein- protein interactions (Karlin and Burge, 1996). TRs have also been shown to influence chromatin structure, for example, the CTG repeat track associated with the myotonic dystrophy protein kinase (DMPK ) gene affects gene expression through chromatin struc- ture (Gemayel et al., 2010). Furthermore, TRs located in promoter and intronic regions have been shown to modulate transcription factor binding (Gemayel et al., 2010). STRs are some of the most variable types of DNA sequences in the genome. Com- pared to nonrepetitive DNA, STRs are typically polymorphic in length rather than in the primary sequence ((Ellegren, 2004). STRs do not have a uniform mutation rate and instead rates differ among loci and alleles. Mutation rates can be as high as 10−2 per cell division at a single STR locus (Fondon et al., 2008). However, a common theme among STRs is that mutation rates increase with increasing repeat track length (Elle- gren, 2004). Aside from repeat length, it has been suggested that the flanking sequences of STRs influence the mutation rate (Ellegren, 2004). A recent genome-wide survey of STRs in the human genome found that STRs with shorter repeat motifs, higher purity of the repeat motif, longer major alleles, and located in a noncoding region are all as- sociated with increased variability (Willems et al., 2014). It is generally believed that STR length variability arises from replication slippage, which occurs when replicating DNA transiently dissociates and misaligns during re-association, resulting in insertion or deletion of repeat units (Ellegren, 2004). Most of these mutations will be corrected through DNA mismatch repair, however, a small fraction of events will remain and lead to STR length variation (Strand et al., 1993).

1.5.1 Tandem Repeat Disorders (TRD)

Mutations of STRs are the cause of numerous genetic disorders (Hannan, 2018). Since the 1990s, it has been established that expansions of trinucleotide repeats cause disease, including Huntington’s disease, fragile X syndrome, Friedreich ataxia, myotonic dystro-

8 Chapter 1. Introduction phy and other disorders (Fondon et al., 2008). To date, over 40 TRDs of various repeat motif sizes have been identified, the majority of which are neurological diseases (Castel et al., 2010, Fondon et al., 2008). Trinucleotide repeats (TNR) in coding regions are the most common disease-causing repeat (Fondon et al., 2008). Moreover, pathogenic muta- tions are typically expansions rather than contractions, regardless of motif size (Castel et al., 2010). TNRs are generally short and stable in terms of length in the healthy pop- ulation, however, families with TRDs have longer, unstable tracks (Castel et al., 2010). Furthermore, the variable nature of pathogenic repeat tracks is seen between individuals as well within the same individual across different tissues (Castel et al., 2010) The genetics of TRDs are complex. For example, genetic anticipation is frequently seen in TRDs, whereby pathogenic repeat tracks show a tendency towards further expan- sion in subsequent generations which results in earlier onset and more rapid progression of disease (Fondon et al., 2008). Genetic anticipation is seen in myotonic dystrophy; par- ents and grandparents experience onset in adulthood whereas children and grandchildren may develop the disorder in birth (Orr and Zoghbi, 2007). Moreover, somatic instabil- ity has been proposed as a modifier of some TRD-associated phenotypes. For instance, one study of Huntington’s disease suggested that somatic instability is associated with earlier disease onset (Swami et al., 2009). Furthermore, the phenotypic presentation of TRDs tends to be continuous rather than binary (affected vs. unaffected). As well, a single STR locus may be associated with more than one disease. For example, fragile X syndrome is caused by CGG trinucleotide expansions beyond 200 repeats in the 5’ UTR of FMR1 whereas the healthy population tends to have repeat lengths between 5-44. Repeat lengths between ~55-200 in FMR1, termed the pre-mutation, is associated with fragile X tremor-ataxia syndrome (FRDA), ovarian failure and other disorders (Hannan, 2018). Multiple modes of pathogenesis exist for TRDs. At a high level, TRDs can be cat- egorized by whether the causative repeat track is located in a coding or noncoding ge- nomic region. Pathogenic exonic STRs commonly code polyglutamine or polyalanine tracks, such as Huntington’s disease and oculopharyngeal muscular atrophy, respectively (Hannan, 2018). Moreover, pathogenic repeat tracks may express amino acid repeats in multiple reading frames in a process described as repeat-associated non-ATG (RAN) translation (Zu et al., 2011). Pathogenic mutation of coding STRs can cause disease through change of protein function, toxic gain of protein, and RAN-translated proteotox- city (Hannan, 2018). Repeat tracks of noncoding TRDS can reside in the intron, 5’ and 3’ untranslated region of the associated gene, such as the case for FRDA, fragile X syndrome, and myotonic dystrophy 1, respectively (Hannan, 2018). The pathogenic mechanisms for

9 Chapter 1. Introduction noncoding TRDs include toxic gain of protein function, RAN-translated proteotoxcity, toxic gain of RNA function, epigenetic dysregulation, and loss of gene expression or function (Hannan, 2018).

1.5.2 The role of STRs in common polygenic disorders

Previous studies have suggested that tandem repeat polymorphisms play a role in some common polygenic disorders (Hannan, 2010). One study found that STRs are signif- icantly enriched in the exons of disease-linked genes compared to non-disease-related genes (Madsen et al., 2008). Moreover, a different study found that TNRs are five times more prevalent in cancer-associated genes than other genes (Haberman et al., 2008). STR variation has been previously linked to various polygenic disorders, such as type 1 diabetes mellitus, which is a disorder influenced by environmental and genetic factors (Hi- romine et al., 2007). Researchers found that a TGC repeat length variant in the 3’ UTR of the programmed cell death-1 (PDCD1 ) gene to be associated with development of the disorder (Hiromine et al., 2007). Repeat instability is also linked to various cancers. In colorectal cancer, inactivation of mismatch repair genes results in global STR instability (Wheeler et al., 2000). A recent survey of STR instability across different cancer types found association with 14 cancer types. The study also found that STRs located within known cancer genes were significantly more unstable than in genes not associated with cancer (Hause et al., 2016). Given the established role of STR variation in some complex diseases known to be partly genetic, and that the genetic factors identified for complex disease often confer less disease risk than expected, it has been suggested that STR vari- ation may contribute a fraction to this discrepancy termed “missing heritability” (Press et al., 2014).

1.5.3 The role of STRs in ASD

Several lines of evidence point to the role of STRs in the etiology of ASD. Firstly, several TRD loci are ASD risk loci. For example, fragile X syndrome is one of the most common genetic risk factors of ASD (Wang et al., 2010). Moreover, ASD has been previously linked to myotonic dystrophy 1 (DM1), and one study of 57 individuals with DM1 found that half (n = 28) of cases presented with ASD (Ekstr¨omet al., 2008). Other STR disease genes that are known ASD-risk factors include AFF2, ATXN7, CACNA1A, and ARX (Hannan, 2018). Furthermore, ASD is an NDD and repeat variation has been implicated in the genetic architecture of other NDDs. For instance, ADHD has been linked with repeat variation in the SLC6A3/DAT1 and DRD4 genes (the associated repeat motifs 40

10 Chapter 1. Introduction and 48 bp, respectively) (Franke et al., 2008, Johnson et al., 2008). Altogether, given that STR variation is suspected to contribute to the missing heritability of complex diseases, ASD is a known complex disease with a strong genetic component, other NDDs have been associated with STR variation, and several ASD-risk loci are known TRD genes, suggest that STR variation may play a role in ASD susceptibility.

1.5.4 Genetic anticipation in ASD

Genetic anticipation is the phenomenon of increasing disease severity in successive gener- ations. Anticipation has been reported in numerous disorders. The phenomenon was first observed in myotonic dystrophy, and one study found that 98% children with the disorder had earlier onset than their affected parents (H¨oweler et al., 1989). The molecular basis for anticipation in myotonic dystrophy was identified to be caused by trinucleotide repeat expansions (McInnis, 1996). Since then, it has been established that repeat expansion diseases commonly show anticipation, explained by the tendency for expanded repeats to enlarge in subsequent generations due to instability and longer expansions tending to cause earlier onset and increase disease severity (Paulson, 2018). Other TRDs that show anticipation include Huntington’s disease and spinocerebellar ataxia, however, not all repeat disorders show anticipation, such as oculopharyngeal muscular dystrophy (Paul- son, 2018). Anticipation has also been reported in complex disorders, such as bipolar disorder, schizophrenia, Crohn’s disease and psoriasis (Cardoso and Marques, 2018). Anticipation has been suggested to contribute to the complex inheritance pattern seen in ASD, whereby concordance for ASD is greater when considering the BAP compared to a clinical diagnosis (Stodgell et al., 2000). One theory proposes that since some individuals without an ASD diagnosis express subclinical traits, they may be carriers of ASD-risk genes impacted by anticipation. These risk loci may be transmitted from parents with BAP to children with a full ASD diagnosis, demonstrating anticipation (Stodgell et al., 2000). Several ASD-risk loci that are also TRD genes show anticipation. One example is the AFF2 gene, which causes non-specific-linked intellectual disability (XLID), also known as FRAXE syndrome, due to a CGG repeat expansion in the 5’ UTR resulting in gene silencing (Stettner et al., 2011). The clinical features of this disorder include intellectual disability, communication impairments, hyperactivity, attention problems, and autistic behaviour (Sahoo et al., 2011). Variants in AFF2 have routinely been identified in individuals with ASD, including rare non-synonymous mutations, missense mutations, deletions, and duplications (Mondal et al., 2012, Sahoo et al., 2011, Stettner et al.,

11 Chapter 1. Introduction

2011). Anticipation has been reported in families with members affected with FRAXE (Murray et al., 1996). Fragile X syndrome, which is caused by silencing of the FMR1 gene from a CGG repeat expansion in the 5’ UTR resulting in hypermethylation, is the leading known genetic cause of ASD and also demonstrates anticipation (Murray et al., 1996). The involvement of anticipation in several known ASD-risk loci that are also TRD genes, and the complex inheritance pattern of the disorder suggests that anticipation may function in ASD through other susceptibility loci from STR variation.

1.6 STR genotyping

In order to investigate the role of STRs in complex disorders, such as ASD, genome-wide screens of STR variation in genomic data are needed (Hannan, 2018). Until recently, however, data on STRs was limited to a few thousand loci that were from STR linkage and association panels, forensic analysis, genetic genealogy or genetic diseases (Willems et al., 2014). Moreover, for the vast majority of known STR loci, there was little information on normal allelic distributions and population differences due to the previous lack of high-throughput genotyping technologies for STRs (Willems et al., 2014).

1.6.1 Traditional STR detection methods

Traditional STR detection and genotyping methods relies on a priori knowledge of the genomic location of STRs and scales poorly (Willems et al., 2014). Polymerase chain reaction (PCR) amplification combined with gel electrophoresis is commonly used to size STRs, however these methods can result in false-positive calls for expansions due to the polymerase stuttering and introducing repeats. Therefore, PCR is typically used to detect normal up to premutation length alleles and is inaccurate for longer alleles (Ciotti et al., 2004, Lyon et al., 2010). Southern blot analysis is the method of choice for several pathogenic STR loci because it can accurately size large permutations and full mutations, however, this method is expensive, technically challenging and labor intensive (Ciotti et al., 2004, Lyon et al., 2010). PCR has routinely been combined with Southern blot in order to genotype normal to fully expanded alleles (Lyon et al., 2010). Triplet- repeat primed PCR (TP PCR) was developed as a method to overcome the limitations of traditional PCR for expanded alleles, however, this assay can only identify expanded alleles but can not determine the repeat length (Warner et al., 1996). Recently, TP PCR has been combined with automated capillary electrophoresis to produce a high- throughput assay to screen for expanded STRs (Lyon et al., 2010). This method requires

12 Chapter 1. Introduction the use of Southern blot analysis to accurately size expanded alleles (Lyon et al., 2010).

1.6.2 STR genotyping algorithms

Several STR profiling algorithms have been developed which utilize short-read WGS data to detect and genotype STRs. One of the first algorithms developed was lobSTR, which utilizes the nonrepetitive regions which flank STRs to align reads from WGS data to determine repeat lengths. The major limitation of lobSTR was that the program is designed to call STRs that are covered entirely by a single read, and therefore could only size repeat alleles shorter than the read length (Gymrek et al., 2012). A previous study utilized lobSTR to perform a genome-wide analysis of STR variation in the human population using WGS data from the 1000 Genomes Project, and produced a public catalog of variations for ~700,00 loci identified in 1009 samples (Willems et al., 2014). Newer STR genotyping algorithms have been developed that can size STRs that are longer than the read length. This is a critical advancement since many pathogenic expansions are close to or beyond the short-read WGS read length of 100-150 bp (Tang et al., 2017). One such tool, ExpansionHunter uses read pair information, recovery of mismapped reads, reads that partially and fully span the repeat region, and the number of reads mapped to a region from PCR-free short-read WGS data to genotype targeted loci, and is able to accurately size both short and long repeats (Dolzhenko et al., 2017). Another STR caller, TREDPARSE, works similarly to ExpansionHunter as a targeted tool to detect expanded alleles longer than the read length, although TREDPARSE has the additional functionality of using the paired-end distance between reads that span the STR region to predict the size of larger repeats (Tang et al., 2017). More recently, STR callers have been developed that estimate the size of all known STR loci in the genome, instead of targeted approaches, from PCR-free short-read WGS data. These methods, such as gangSTR and STRetch, rely on the repeat annotation of the human reference genome to determine the genomic STR positions (Dashnow et al., 2018, Mousavi et al., 2019). STRetch detects genome-wide STR loci by utilizing “decoy chromosomes” that are comprised of all possible 1-6bp tandem repeat unitsadding the decoys to the reference genome, identifying reads that preferentially map to the decoys, using read-pair information to find the genomic origin of STRs, and using read coverage to size STRs (Dashnow et al., 2018). Altogether, these advances in STR calling tools for WGS data have enabled genome-wide detection of STR variation.

13 Chapter 1. Introduction

1.7 Project rationale

Although ASD is estimated to have high heritability, hundreds of genetic factors have been identified, and the state-of-the-art variant detection technology has been utilized, at least half of cases remain without a genetic diagnosis (Devlin and Scherer, 2012). STR variation may play a role in the genetic architecture of ASD given that STRs are suspected to contribute to susceptibility of a range of complex disorders, STR variation is involved in the risk of developing other NDDs, and several TRD loci are ASD-risk factors. A large scale screen of STR polymorphism in ASD is needed to identify its contribution to risk. I proposed that applying genome-wide STR calling tools to a large ASD WGS database (MSSNG) will allow for a genome-wide screen for STR variation in ASD. I hypothesized that novel STR variation would be detected in some ASD cases and may lead to the identification of novel risk loci that explain a portion of ASD genetic etiology.

14 Chapter 2

Methods: Development and Implementation

2.1 MSSNG ASD database

WGS data from 3845 unique samples from the MSSNG autism spectrum disorder (ASD) genomic database were used in this study, which represent 811 families with children diagnosed with ASD, their affected and/or unaffected siblings, and the unaffected par- ents. The dataset contained 2122 samples affected with ASD in total. Sample DNA was derived from whole-blood or lymphoblast-derived cell lines, prepared using PCR- free library preparation kits and sequenced by the Illumina HiSeq X platform. Reads were aligned to build GRCh37 of the human reference genome using Burrows-Wheeler Alignment (BWA; version 0.7.10). Genomic data was stored and accessed through the MSSNG Google cloud database (Yuen et al., 2017).

2.2 STR genotyping pipeline development

2.2.1 Expansion Hunter de novo (EHdn)

Expansion Hunter de novo (EHdn) is a genome-wide STR detection tool for repeats larger than the read length (150 bp). EHdn does not require prior knowledge of the genomic locations of STRs and uses aligned, unaligned, and misaligned reads from short-read PCR-free WGS data to approximate the location and motif composition of STRs of up to 20bp. EHdn utilizes read-pairs in which at least one read maps inside a STR region, called in-repeat reads (IRR). These IRRs can either be anchored, if one mate aligns to the non-repetitive flanking sequence of a repeat, or paired, if both mates align inside a

15 Chapter 2. Methods: Development and Implementation

STR and are composed of the same motif (Figure 1). EHdn identifies STRs supported by anchored IRRs, and uses the alignment position of the anchored reads to approximate the genomic coordinates of STRs. EHdn version 0.6.1 with min-anchor-mapq set to 50 and max-irr-mapq set to 40 was used to identify loci for targeted genotyping. EHdn version 0.7.0 with min-anchor-mapq set to 50 and max-irr-mapq set to 60 was used for long-read validation and comparison to other genotyping tools. (Developed by Dr. Egor Dolzkhenko, Illumina Inc.)

2.2.2 STR Finder

STR Finder identifies the genomic coordinates of STRs in the human reference genome within a defined search space and repeat motif (Figure 2). This Python program was developed to detect the location within the human reference genome sequence and com- position of STRs identified by EHdn, and translate this information into input files suited to EH for targeted genotyping. The user first specifies the approximate genomic location and motif of an STR, which may be calls from EHdn or any other repeat of interest. STR Finder pulls the ge- nomic sequence corresponding to the input coordinates, and uses a regular expression (“(.+?\1+”) to identify all the repeating sequences found in the region to create a list of potential STRs. Within the regular expression, the parenthesis indicates a capture group and in this case represents the repeat motif. In the capture group, the “.” specifies any character, and is used to match either A, T, C or G in the genomic sequence. The “+” indicates to match any 1 or more of the preceding token. The combination of “.+” means that the program will match any arrangement of the characters A, T, C or G in any motif length greater than 1. The“?” specifies the preceding quantifier to be lazy and match as few characters as possible. In the case of STRs, this allows STR Finder to identify the simplest form of motifs. The “\1” specifies to match the results of the first (and only) capture group. A “+” is used again to describe matching 1 or more of the preceding token, which is now the entire capture group. Following matching for all possible capture groups found in the specified genomic sequence, a list of all STRs found is returned. Next, the list of identified STRs is compared to the input repeat motif to select the most similar STR. A repeat motif is similar to the input motif if they are the same length, contain the same characters, and are in any frame or reverse complement of one another. If none of the STRs within the list are similar to the input motif, a lower threshold for similarity is used. Similarity is then defined as STRs with at least 67% sequence similarity to the input motif, in any frame or reverse complement. After re-

16 Chapter 2. Methods: Development and Implementation ducing the list of potential STRs to those that are similar to the input, the STR with the longest repeat track is selected. The genomic coordinates of the selected STR is determined by string matching the entirety of the selected STR into the genomic se- quence. If the genomic sequence contains multiple repeats with the same length and composition as the selected STR, the first STR that appears in sequence is selected. The genomic coordinates and motif composition of the final STR is recorded in repeat spec- ification file, which is formatted as input for targeted genotyping by Expansion Hunter. (https://github.com/charlottenguyen/STRFinder)

2.2.3 Expansion Hunter (EH)

Expansion Hunter (EH) is a targeted STR genotyping tool which uses aligned, unaligned, and misaligned read-pairs that partially to fully contain the specified repeat motif to determine the size of alleles (Figure 1). For STRs shorter than the read length, EH uses reads that fully span the repeat (spanning reads) for genotyping. Repeat regions that are larger than the read length require reads that flank either end of the region for size determination. STRs that are close in size to the read length requires read-pairs in which one mate partially overlaps the repeat, termed flanking reads. STRs that are longer than the read length will produce reads that are fully composed of the repeat region, termed IRRs. These IRRs will either be paired, where both reads fully contain the repeat, or anchored, where one read will contain non-repetitive flanking sequences. EH uses anchored IRRs to approximate the size of STRs longer than the read length but shorter than the fragment length, and uses paired IRRs to estimate the size of greater-than-fragment-length repeats if the STR motif is sufficiently rare, such that no long repeats with the same motif exists elsewhere in the genome. EH version 2.5.5 with default parameter settings was used for genotyping in this study (Dolzhenko et al., 2017). (https://github.com/Illumina/ExpansionHunter)

2.2.4 STR genotyping pipeline

EHdn, STR Finder, and EH were utilized in conjunction to perform genome-wide detec- tion and genotyping of large STRs in the MSSNG ASD genomic database (Figure 3). First, EHdn was applied individually to 3845 WGS samples, to produce a genome-wide catalog of candidate large STR regions for each sample. These individual catalogs were combined and then filtered by detecting regions supported by at least 5 anchored IRRs across all the samples to identify high-confidence calls. The reference genomic coordi- nates and motifs for these high-confidence STRs was determined using STR Finder to

17 Chapter 2. Methods: Development and Implementation produce repeat specification input files for each locus. All repeat specification files that were successfully generated, which consist of a genome-wide catalog of high-confidence large STRs identified in MSSNG, were used for targeted and genome-wide genotyping using EH. STR variants were annotated using a custom pipeline based on Annovar, which used information from databases of allele frequency, genomic conservation, variant impact predictors, and implication in human genetic disorders (Yuen et al., 2017).

18 Chapter 2. Methods: Development and Implementation

BAM File

repeat sequence non-repeat flanks

Spanning Reads Flanking Reads In-Repeat Reads

Figure 1: Expansion Hunter de novo and Expansion Hunter read-based approaches The Expansion Hunter tools utilize read-pair information derived from sequence align- ment data (BAM) to detect and genotype STRs. Reads that contain repeats, which may consist entirely of repetitive sequence or contain non-repeitive sequences, are used to estimate the location and composition of STRs. This igure is adapted from Dolzhenko et al., 2017.

19 Chapter 2. Methods: Development and Implementation

User input

--region chr17:30603413-30604129 -–motif AAG

Reference sequence

CCGTCTAAAAAAATAAAAAGAAGAAGAAGAAGAAGAAGAAGAAGAAGGAAA

STRs found in sequence C,2 A,7 A,5 GAA,9 G,2 A,2

All input motifs

AAG

AGA GAA TCT CTT TTC

Select final STR track GAA,9

1 Figure 2: STR Finder STR Finder is designed to identify all repeat tracks within a specified genomic region to find the repeat track that is most similar to the input repeat motif. First, the program extracts the genomic sequence specified by the user. The program then identifies all of the repeat tracks found in the genomic region. To identify the corresponding repeat track to the input specifications, the program determines all permutations (any frame or reverse complement) to the input repeat motif. The repeat tracks that were identified in the genomic sequence are compared to the original motif and its permutation to find a match. If more than one match is found, the longest repeat track is selected as the output STR.

20 Chapter 2. Methods: Development and Implementation

Candidate STR compare_anchored_irrs.py Done in batches (1000s) BAM/CRAM ExpansionHunterDenovo regions, read depths combine_counts.py (JSON)

Legend

Motifs of STR and Assembly/Ref Type individuals related to (don’t mix samples STR-finder them from different Data (VCF) assemblies) Process

Repeat spec file. intervals Logs ExpansionHunterOriginal 1 per STR. (VCF) (JSON)

Faked VCF Annotations Fake_VCFs.R Annovar (VCF) (TSV)

Per individual combine_summary.R information (VCF)

Figure 3: STR Genotyping Pipeline Expansion Hunter de novo, STR Finder, and Expansion Hunter were used in conjunction to perform genome-wide genotyping of STR variation in short-read WGS databases. Large STRs were detected genome-wide in sample WGS data using EHdn. The genomic location and composition of STRs were resolved in the human reference genome using STR Finder. Genotyping of the STRs was performed using EH. Additional scripts were developed to process the data for annotation and outlier detection.

21 Chapter 2. Methods: Development and Implementation

2.3 Pipeline validation using long-read sequencing data

To assess the accuracy of EHdn, we compared STRs detected by EHdn on short-read WGS data with those detected by other methods on long-read sequencing data. The Ven- ter (HuRef) Genome is a sequenced and de novo assembled genome derived from a single individual and reconstructed using long range haplotype assembly to create a diploid human genome sequence (Levy et al., 2007). It is regarded as a high quality publicly- available human genome sequence, that has extensive catalogs of variants including SNPs, indels, and structural variation (Zhou et al., 2018). EHdn was run on sequencing data of HuRef DNA generated by WGS of 40x coverage using the Illumina HiSeq X platform with PCR-free DNA library. Each EHdn call was compared with structural variant calls made by PBSV (version 1 or 2) or Sniffles from PacBio long-read sequencing of HuRef DNA to 100x coverage. Each EHdn call was also compared with variants detected by Complete Genomics sequencing data or present in the HuRef structural variant bench- mark (Pang et al., 2010, 2014, 2013). An EHdn call was considered validated if it was detected by one of the aforementioned methods within 100 bp upstream or downstream. For additional validation, if an EHdn call was supported by an inserted sequence, it was determined if at least one copy of the repeat motif (any frame or reverse complement) was found in the insertion.

Furthermore, because EHdn is designed to detect STRs larger than 150 bp, a call was also validated if the repeat region in the reference sequence was equal to or greater than 150 bp. To size a STR, the reference sequence was searched, at the region specified by the EHdn call, to determine if the sequence contained tandem repeats of the EHdn motif (any frame or reverse complement). If the motif did not exist in the reference sequence, repeats with a 67% or greater sequence similarity was used to match to the EHdn motif (any frame or reverse complement). If multiple STRs with repeat motifs corresponding to the EHdn motif were identified, the longest repeat track was selected for analysis. Next, sequences adjacent to the selected track were analyzed for sequence similarity (90% or greater) to the repeat motif. The length of this entire region, composed of tandem repeats identical or highly similar to the EHdn motif, was used to size the STR in the reference sequence.

22 Chapter 2. Methods: Development and Implementation

2.3.1 Comparison of EH tools to other STR callers

The performance of EHdn was compared to other STR-detection tools, STRetch and gangSTR, which were also validated using the procedure described above (Section 2.3). To standardize calls from the different programs (EHdn, STRetch, and gangSTR), all callsets were restricted to “large” calls. A large call for EHdn was defined as calls sup- ported by at least two anchored IRRs, STRetch calls were supported by two reads, and gangSTR calls had an allele equal to or greater than 150 bp. EHdn and STRetch are designed to detect large STRs in their specific methodologies, while gangSTR uses a STR-database approach to scan for all STRs. Therefore, gangSTR was restricted to calls greater than 150bp to subset the data for large STRs. The different STR programs can detect repeats with variable motif lengths. To further standardize calls, all datasets consisted of repeats with 2-6 bp motifs. Motif lengths equal to 1 bp were excluded from analysis, based on the recommendation from the developer of EHdn, since these repeats were overrepresented in the data. Based on the composition of a repeat sequence, EHdn may call multiple repeats that originate from the same region. Therefore, after validation was performed on individual calls, calls within 500 bp of one another were merged. Calls were also restricted to those found within 20 kb of a gene, based on the NCBI RefSeq dataset.

2.3.2 Pipeline validation using twin data

From the MSSNG dataset, 40 monozygotic twins (20 twin sets) and 10 dizygotic twins (5 twin sets) were used for analysis. A total of 1627 EHdn calls were detected between all twin samples. A concordant EHdn call is defined as expansion detected in both or none of the siblings in a twin set. A discordant call is an expansion detected in only one sibling of a twin set.

2.4 Outlier detection method for STR expansions

For each STR locus, an outlier expansion was defined as an allele that was equal or larger to the 99th percentile multiplied by some number n. In our analysis, n ranged from 1-3 and was adjusted after visual inspection of the allelic distribution per locus. A sample was considered an outlier if at least one allele was equal or greater than the defined outlier cut-off. An outlier expansion was considered de novo if it was identified in an ASD proband and the parents had allele lengths below the 99th percentile value of the allelic distribution for that locus. An exponential distribution was used to describe the

23 Chapter 2. Methods: Development and Implementation

MSSNG allelic distributions for outlier detection. (Developed by Induja Chandrakumar, Yuen Lab).

2.5 Statistical analysis

Case/control analyses was performed on children and siblings affected with ASD in the MSSNG as cases, and the 1000 Genomes cohort as controls. For each locus, a one-sided (right-tailed) Fisher’s exact test was performed using a defined outlier length as a cut-off to determine the number of samples with and without outlier alleles in each respective cohort. The number of samples with and without outliers in MSSNG was compared to the equivalent data in the 1000 Genomes cohort in a 2x2 contingency table. Fisher’s exact tests were performed in Rstudio version 1.1.414 with R version 3.4. The Bonferroni correction method was applied to adjust p-values for multiple testing, with α = 0.05 and 23 hypotheses, representing the number of loci selected for genotyping in both cohorts.

24 Chapter 3

Results

3.1 Genome-wide STR genotyping in MSSNG

3.1.1 STR calling using EHdn

EHdn was applied to 3845 PCR-free genomes sequenced by the Illumina HiSeq X platform from the MSSNG database, representing samples from children affected with ASD, their affected and unaffected siblings, and the unaffected parents. A total of 4830 unique STR loci were identified where each call represents a locus that was estimated to be equal or greater than 150bp in at least one MSSNG, and filtered to be a high-confidence call. On average, 1418 STR calls were made per sample (Figure 4). The EHdn calls represent motif sizes between 2-20bp, and of the 4830 calls, there were 1705 unique motifs called. The most frequently seen motif (732/4830) was AAAG, which is consistent with other studies of genome-wide STR variation (Dashnow et al., 2018) (Figure 5). The next most frequent motifs were AAG, AG, and AAAAT (Figure 5). The most frequently seen repeat motifs were tetranucleotides, pentanucleotides, trinucleotides, and dinucleotides. After dinucleotides, the next most common motif sizes were between 18- 20bp. The most infrequently seen motif sizes were between 8-13bp. STRs were called on every chromosome, with calls most frequently seen on chromosome 2, 1, 4, and 7. The frequencies of STR calls per chromosome were not normalized for the size of the chromosome.

25 Chapter 3. Results

Distribution of STRs in MSSNG 30 25 20 15 %Frequency of Number STRs 10 5 0

76 681 802 937 1085 1221 1356 1491 1626 1766

Total Number of STRs Figure 4: STR loci detected in MSSNG samples using Expansion Hunter de novo The distribution of total number of STRs called per sample in the MSSNG ASD genomic database by the EHdn STR calling tool. On average, 1418 EHdn calls were made per sample. In total, 4830 unique loci were detected in MSSNG.

26 Chapter 3. Results

AAAG 15 10 %Frequency of Motif

AAG 5

AG

AAAAT AT AAGG ACC ATCC

AC

AAAGG

ACGGGAGAGGGAGAGGGAG AAAAG AAGGAG AACCCT AGGG AGG AAGGG AATGG AAGAG CCG 0 Figure 5: STR motif identified in MSSNG using Expansion Hunter de novo

ShortAAAAAATATTTTT tandem repeatAACAT DNAAAGAT motifsAATAG and frequencyAC ACACGTAT of motifsACC foundACCTAT in MSSNG,AG AGG called by EHdn. Motifs are arranged in alphabetical order. The composition of the most commonly called motifs are shown. RU

27 Chapter 3. Results

3.1.2 Identification of reference STRs using STR Finder

A subset of 1821 calls from the 4925 called STR loci from EHdn were chosen for genotyp- ing of the MSSNG database EH. The chosen loci represent rare (called in less than 1% of unaffected parents), trinucleotide, and X-chromosome STR expansions. Rare loci were chosen since it is expected that the remaining genetic etiology of ASD will be explained by rare variants and because the known STR loci that influence ASD risk are rare, such as Fragile X syndrome. Trinucleotide repeats were selected for analysis because they are the most common motif among pathogenic STRs. Calls on the were selected it has been hypothesized that sex chromosomes may contribute to the ASD male bias. To genotype at the selected calls, the corresponding reference coordinates and motif of these STRs were determined using STR Finder. Of the 1821 calls, 1162 reference STRs were identified and subsequently used for EH genotyping.

3.1.3 Targeted STR genotyping using EH

Targeted genotyping of the 1162 calls from EHdn was performed with EH on 3845 MSSNG samples. Genotyping produced two estimated alleles sizes at each STR locus per individ- ual. Therefore, at each repeat region 7690 alleles were generated and used to determine the size distribution of each locus in the MSSNG samples.

3.2 Identifying potentially clinically relevant STR Variation in ASD

After EH genotyping, STR variants were annotated and outlier detection was applied to all genotyped loci to identify potential candidate variants. Loci that were selected for further analysis fell in three categories: those that were selected from annotation of EHdn putative large STRs for targeted genotyping, STRs that were known to be associated with a TRD, and STRs with de novo outliers (Table 1). Based on the annotation of STRs identified by applying EHdn to MSSNG, several loci were chosen for targeted EH genotyping. The criteria used to select STR loci included the inter- and intra-genic location, frequency of the call in affected vs. unaffected individuals, known phenotype in the Online Mendelian Inheritance in Man (OMIM) and Mammalian Phenotype Ontology (MPO) databases, probability of loss-of-function intolerant rate (pLI) >0.85, and familial inheritance pattern of STR expansions. Several STRs are known to cause TRDs. If the genomic coordinates and repeat motif

28 Chapter 3. Results

of a STR call matched a pathogenic locus, it was selected for further analysis. These loci were particularly relevant if repeat expansions beyond the pathogenic range were identified in individuals with ASD. Loci with de novo expansions were identified by a custom outlier detection model. Briefly, an outlier expansion was defined as an allele that was found at the right-tail of the MSSNG allelic distribution for that locus, and was equal or larger than at least the 99th percentile length. This outlier expanded allele was considered to be de novo if it was derived from a proband or affected sibling, and the unaffected parents had repeat sizes that were smaller than the 99th percentile allele length.

3.2.1 STR variation in MSSNG compared to a control population

After initial genotyping of STR loci using EH, the MSSNG database was updated to in- clude more samples and realigned to the latest build of the human reference genome (GRCh38). Therefore, the selected loci of interest were re-genotyped in the latest database. Given the change in build of the reference genome, coordinates of the STRs selected for targeted genotyping were lifted over and adjusted. The selected loci were also genotyped in the 1000 Genomes control population, which consisted of 2504 DNA samples sequenced using short-read WGS and aligned to GRCh38, for comparison. Case/control analysis was performed using a one-sided Fisher’s ex- act test at each locus, where cases were defined as children and affected siblings with ASD in MSSNG, and controls as the 1000 Genomes cohort. Larger repeat tracks of an STR located upstream of the Crystallin Beta B2 (CRYBB2 ) gene were seen more frequently in samples affected with ASD in MSSNG compared to the 1000 Genomes control population, and this difference was statistically significant (Bonferroni corrected p-value=1.830x10−24) (Table 2). The GA repeat located upstream of the CRYBB2 gene was initially chosen for analysis because several children with ASD were identified to have de novo expansions at this locus (Figure 7). The majority of MSSNG samples genotyped at this locus had a repeat length of 2 (Figure 6). The cut-off of 53 repeats was defined as the outlier expansion length using the aforementioned outlier detection model. There were six families in which affected children had repeat lengths beyond the outlier cutoff and the parents did not (Figure 7). Aside from de novo expansions, there was a total of 20 affected samples and 22 unaffected samples that were identified to have allele sizes greater than 53 (Figure 7). Fifteen of the 20 affected samples with the outlier alelle length were

29 Chapter 3. Results males. Recently, a deletion impacting the CRYBB2 gene was identified in an individual affected with ASD using array-CGH analysis (Schuch, 2019). Moreover, a study of gene expression quantitative trait loci in the human cortex from the Psychiatric Genomics Consortium (ADHD, ASD, bipolar disorder, major depressive disorder and schizophrenia) and CRYBB2 was identified as the most significant association (Kim et al., 2014).

30 Chapter 3. Results

Table 1: Candidate loci detected in MSSNG

Genic Outlier Locus Location (hg19) Motif Total Unaff. Aff. Location Length NPAS1 19:47538450-47538479 AAG intronic 100 5577 0 1 CACNA1A 19:13524788-13524797 AAAAG intronic 30 5577 5 11 NAA15 4:140236142-140236156 CAGTT intronic 17 5137 0 3 FOXP1 3:71223937-71223978 AC intronic 48 5577 5 4 LMTK2 7:97769377-97769380 AAGG intronic 23 5577 7 4 UVRAG 11:75526278-75526313 AGCGGCGGC 5’ UTR 10 5577 1 2 RICTOR 5:38990811-38990909 GATATATAT intronic 24 976 1 1 SENP7 3:101083073-101083100 AAAG intronic 40 5577 1 6 FAM120C X:54143788-54143847 AAACT intronic NA NA 1 3 DIP2B 12:50898787-50898807 CGG 5’ UTR 52 3835 1 3 HTT 4:3076605-3076667 CAG exonic 36 5412 13 16 DMPK 19:46273464-46273523 CAG 3’ UTR 91 3835 2 2 ZNF9 3:128891469-128891500 CAGG intronic 52 3832 3 4 ATXN8 13:70713518-70713559 CTG exonic 95 3834 11 11 ZNF710 15:90551660-90551671 AT intronic 78 3835 6 13 LINGO3 19:2308144-2308173 CGC 5’ UTR 53 3834 10 13 IMMP2L 7:110782314-110782317 AG intronic 85 3835 4 4 THADA 2:43475386-43475389 AG intronic 26 3835 15 21 CRYBB2 22:25614805-25614808 GA upstream 53 3387 22 20 SCOC 4:141281914-141281919 AGA intronic 52 3835 18 18 IGF2R 6:160401040-160401043 TC intronic 86 3835 3 4 NRCAM 7:107939482-107939485 AG intronic 83 3835 3 5 EBF2 8:25860343-25860370 CAGG intronic 39 3835 13 18 STRs were chosen by annotation, presence of de novo outliers or status of pathogenicity. Total refers to the number of samples genotyped per locus, Unaff. and Aff. refers to the number of unaffected and affected samples with an outlier allele.

31 Chapter 3. Results

Table 2: Outliers identified in MSSNG and 1000 Genomes

Location Outlier MSSNG Affected 1000 Genomes Locus p-value (GRCh38) Length With No With No Total Total Exp. Exp. Exp. Exp. UVRAG 11:75815233-75815268 10 4047 3 4044 2504 1 2503 1 DIP2B 12:50505004-50505024 80 4042 31 4011 2504 7 2497 0.159 ATXN8 13:70139386-70139427 95 4046 13 4033 2504 7 2497 1 ZNF710 15:90008428-90008453 78 4037 17 4020 2497 6 2491 1 CACNA1A 19:13413974-13413983 30 4047 16 4031 2504 38 2466 1 LINGO3 19:2308145-2308174 53 3751 16 3735 2504 18 2486 1 DMPK 19:45770206-45770265 91 4047 3 4044 2504 1 2503 1 NPAS1 19:47035192-47035221 100 4047 0 4047 2504 0 2504 1 THADA 2:43248165-43248250 26 4047 4047 0 2504 2504 0 1 1.830 CRYBB2 22:25218737-25218780 53 1469 240 1229 907 30 877 x10−24 SENP7 3:101364229-101364256 40 4047 8 4039 2504 19 2485 1 ZNF9 3:129172578-129172657 52 4047 7 4040 2504 1 2503 1 FOXP1 3:71174786-71174827 48 4047 3 4044 2504 0 2504 1 NAA15 4:139314988-139315002 17 3740 4 3736 2330 7 2323 1 SCOC 4:140360760-140360765 52 4047 29 4018 2504 7 2497 0.284 HTT 4:3074878-3074934 36 3993 10 3983 2504 5 2499 1 RICTOR 5:38990709-38990780 24 982 1 981 423 0 423 1 IGF2R 6:159980008-159980011 86 4047 5 4042 2504 0 2504 1 NRCAM 7:108299038-108299041 83 4047 2 4045 2504 1 2503 1 IMMP2L 7:111142258-111142261 85 4047 5 4042 2504 8 2496 1 LMTK2 7:98140065-98140068 23 4047 1 4046 2504 2 2502 1 EBF2 8:26002827-26002854 39 4047 36 4011 2504 16 2488 1 Outlier allele counts between cases (probands and siblings affected with ASD in MSSNG) and controls (1000 Genomes cohort) were compared using a one-sided Fisher’s exact test. P-values were adjusted for multiple comparisons using the Bonferroni correction method. Exp. refers to expansions.

32 Chapter 3. Results

CRYBB2 CRYBB2

MSSNG:22_25614805_25614808 80 60 40 % Allele Frequency 20 0

1 5 9 14 53 59 64 73 78 84 107 # Repeats

Figure 6: MSSNG allele lengths of the GA repeat upstream of CRYBB2 The relative location of the repeat expansion is highlighted in blue, adjacent to the CRYBB2 gene above the plot. The plot displays the frequency distribution of the MSSNG allele lengths of the GA repeat.

33 Chapter 3. Results

2−1718−002 2−1718−001 2−1738−002 2−1738−001 2−1749−002 2−1749−001 3−0312−101 3−0312−100 2/2 2/75 2/2 1/78 2/2 2/53 2/2 2/73

2−1718−003 2−1738−003 2−1749−003 3−0312−000 2/2 2/2 2/2 2/2

3−0409−101 3−0409−100 3−0630−101 3−0630−100 5−0045−002 5−0045−001 7−0168−002 7−0168−001 2/2 1/1 2/75 2/2 2/2 2/74 2/2 1/1

3−0409−000 3−0630−000 5−0045−003 7−0168−003 1/75 2/2 2/2 1/79

0039201 0039202 024102 024101 060802 060801 076702 076701 2/74 2/2 1/1 2/75 2/2 2/66 2/2 2/2

0039303 024104 024105 060803 076704 076705 2/2 1/1 2/1 2/2 2/2 2/73

Figure 7: MSSNG pedigrees for GA repeat upstream of CRYBB2 Families with the parents and proband genotyped, and with a member having a repeat length equal or greater than 53 are shown. Squares are males, circles are females, and filled shapes are samples with ASD. Sample IDs and allele lengths are shown.

34 Chapter 3. Results

3.3 Validation of STR genotyping pipeline

3.3.1 Validation of EHdn using long-read sequencing data

The accuracy of EHdn calling STRs equal or greater than 150bp was determined by comparing calls made in HuRef DNA sequenced by in short-read WGS to structural variant calls from long-read sequencing data of the same sample. There were 1600 EHdn calls that were supported by 2 anchored IRRs in HuRef DNA, and 48.3% of the calls were validated in the long-read data (Table 3). One of the limitations of EHdn is that the program may generate multiple STR calls from repeats located in the same region if the repeat track is complex. Therefore, STRs that were estimated to be located within 500bp of one another were merged, resulting in 1298 calls that had a 50.9% validation rate (Table 3). There were 1077 STRs in HuRef DNA that were located near genes (<20kbp), however, the proximity to genes did not impact validation significantly . EHdn is able to call STRs with motifs up to 20bp in length. By limiting analysis to repeats between 2-6bp, the validation of these 645 loci improved to 55.3% (Table 3). There were 376 calls that fulfilled all the aforementioned requirements (merging nearby STRs, near genes, 2-6bp motifs), and 59% were validated in the long-read data (Table 3). EHdn is able to detect STRs that are in and absent from the human reference genome sequence. Therefore, if a sample has a large STR expansion that is an insertion relative to the reference sequence based on the repeat motif, EHdn would theoretically be able to detect such repeats. However, STR Finder and EH are only able to detect and genotype STRs that are found in the reference sequence. I restricted EHdn calls to those detected by STR Finder to determine if this would impact the validation rate, and found a 10% increase in validation for EHdn calls supported by 2 anchored in-repeat reads (58% vs. 48%) (Table 3). Combining filtering for STR Finder calls and merging nearby calls, restricting calls nearby genes, and with 2-6bp motif also saw an improvement in validation (66% vs. 59%) (Table 3). The number of STRs called by EHdn is impacted by the of number of anchored IRRs used (Figure 8). Requiring EHdn to identify STRs supported by 2 anchored IRRs generated 1600 calls, whereas 5 anchored IRRs reduced the number of calls to 777. However, increasing the stringency of support for calls increased the validation rate. For example, 48.3% calls supported by 2 anchored IRRs were validated, 59.6% of 5 anchored IRRs supported calls, and 67.0% of 10 anchored IRRs supported calls. The maximum validation rate was 81.5% for calls supported by 38 anchored IRRs, however, only 27 STRs were identified.

35 Chapter 3. Results

3.3.2 Validation using other STR genotyping tools

STRetch and gangSTR are two other genome-wide STR detection and genotyping meth- ods. Validation using long-read sequencing data of HuRef DNA was perform using STRetch and gangSTR, and compared to EHdn (Figure 9, Table 3). STRetch identifies large STRs with 2-6bp motifs. For comparison to EHdn, STRetch calls were limited to those supported by 2 reads which resulted in 103 calls that had a 69.6% validation rate. The call set were restricted by merging nearby STRs and iden- tifying STRs near genes, which produced 49 calls with a 69.4% validation rate. There were 20 calls that were shared between STRetch and EHdn, and the majority of the calls were validated (85%) (Table 3). gangSTR also performs a genome-wide search for STRs, but is not specifically de- signed to detect large repeats. For comparison to STRetch and EHdn, gangSTR calls were limited to those with an allele equal or greater than 150bp and with motifs be- tween 2-6bps. There were only 7 gangSTR calls which met these criteria, and 57.1% validated. Restricting calls to those near genes and merging nearby calls reduced the set to 4 calls, and 1 call was successfully validated. This gangSTR call overlapped with a call in STRetch and EHdn. For all callers, filtering call sets for those detected by STR Finder, which represented STRs that were found in the reference sequence, improved validation rates (Figure 9). This discrepancy was the largest for EHdn.

3.3.3 Monozygotic vs. dizygotic twin concordance in EHdn

Validation of EHdn can also be assessed by comparing the concordance of calls between monozygotic vs. dizygotic twins. MSSNG contains 40 monozygotic twins (10 twin sets) and 10 dizygotic twins (5 twin sets). A total of 1627 EHdn calls shared between all twin samples. Here, a concordant EHdn call is defined as expansion detected in both or none of the individuals of a twin set. A discordant call is an expansion detected in only one individual of a twin set. Across the 1627 STRs, there were significantly more concordant calls in monozygotic compared to dizygotic twins (85.3% and 79.2%, respectively; T- test p=0.0001). Moreover, there were significantly less discordant calls in monozygotic compared to dizygotic twins (14.7% and 20.8%, respectively; T-test p=0.0001).

36 Chapter 3. Results

EHdn 80 1500 60 1000 40 % Validation Rate % Validation 500 20 0 0

2 6 10 15 20 25 30 35 41 46 52 61 77 96

# Anchored IRRs Figure 8: Validation rate of EHdn calls supported by Plot of the number of anchored in-repeat reads used to support EHdn calls (x-axis) vs. the validation rate (y-axis, left) vs. the total number of calls (y-axis, right).

37 Chapter 3. Results

Table 3: Validation of genome-wide STR genotyping tools in long-read sequencing data

Validation EHdn STRetch gangSTR Details Number Rate X 1600 48.313 X merge STRs (500bp) 1298 50.925 2 anchored X near gene (20kb) 1077 48.561 in-repeat X 2-6bp motif 645 55.349 reads merge STRs, near gene, X 376 59.043 2-6bp motif merge STRs, near gene, X 291 66.323 2-6bp motif, STR Finder X 2-6bp motif 103 69.903 2 reads merge STRs, near gene, X 49 69.388 2-6bp motif merge STRs, near gene, X 45 71.111 2-6bp motif, STR Finder X 2-6bp motif 7 57.143 1 allele 2-6bp motif, merge STRs, X >= 150bp 4 25 near gene 2-6bp motif, merge STRs, X 3 33.333 near gene, STR Finder EHdn validation (2 anchored X X in-repeat reads, merge STRs, 20 55 near gene, 2-6bp motif) STRetch validation (2 reads, X X 20 85 merge STRs, near gene, 2-6bp motif) gangSTR validation (1 allele >= 150bp, X X 2-6bp motif, merge STRs, near gene); 1 100 STRetch validation EHdn validation (2 anchored in-repeat X X reads, merge STRs, near gene, 2 50 2-6bp motif); gangSTR validation EHdn validation, STRetch validation, X X X 1 100 gangSTR validation EHdn, STRetch and gangSTR are genome-wide STR detection and genotyping tools which rely on short-read WGS data. Calls made by the various STR callers were validated using long-read sequencing data of HuRef DNA. A call was validated if it was supported by a structural variant caller. Merged STRs refers to the combining of multiple calls that are within 500bp of one another. Near gene refers to calls that are within 20kbp of a gene.

38 Chapter 3. Results

25% (33%)

gangSTR 1 2 1 0 59% (66%)

EHdn STRetch 355 19 29

69% (71%)

Figure 9: Comparison of genome-wide STR genotyping tools The performance of EHdn, STRetch and gangSTR in detecting STRs was assessed by comparing calls to structural variants found in the long-read sequencing data of HuRef DNA. The validation rate of each tool is shown before and after filtering calls with STR Finder (in parenthesis). The number of calls detected by each tool, after filtering, is shown.

39 Chapter 4

Discussion

4.1 Overview of results

This study is the first genome-wide screen of STR variation in a large ASD WGS database. These findings suggest that STR detection and genotyping tools can be used to identify potentially novel and recover known risk loci in ASD. Given the established role of some STR variation in ASD, I developed a STR geno- typing pipeline which consisted of existing published and unpublished tools and my own novel tools to screen the MSSNG ASD genomic database for novel STR variation genome- wide. Using the STR genotyping pipeline, specifically EHdn, I detected 4830 potential STR loci that were identified to have a repeat track that was least 150 bp in length in a MSSNG sample. From these calls, a subset of 1162 loci were targeted for genotyping in MSSNG using EH after the coordinates and the motif of the STR were identified in the human reference genome by STR Finder. After EH genotyping was performed, the allelic distribution of each locus in MSSNG was determined and analyzed to identify loci of interest. Potential candidate loci for ASD were selected depending on annotation of the variant, if the variant was known to be associated with a TRD, or if the region had de novo outliers in ASD samples. Based on the inter- and intra-genic location, frequency in affected vs. unaffected samples, pLI, associated phenotypes, and the inheritance pattern of variants deter- mined by annotation, several loci were selected for follow-up which included variants in the NPAS1, CACNA1A, NAA15, FOXP1, LMTK2, UVRAG, RICTOR, SENP7 and FAM120C genes. Interestingly, previous studies have implicated NPAS1, CACNA1A, NAA15. FOXP1, and RICTOR and FAM120C as candidate ASD risk genes (Cheng et al., 2018, Chiocchetti et al., 2014, Damaj et al., 2015, De Rubeis et al., 2014, Hamdan et al., 2010, Lozano et al., 2015, O’Roak et al., 2011, Palumbo et al., 2013, Stanco et al.,

40 Chapter 4. Discussion

2014, Stessman et al., 2017, Yuen et al., 2017). In this study, the identification of STR variation in these genes are supportive of their suggested involvement in ASD genetic etiology. Some of the other loci of interest were also found to be associated with clinical phenotypes. For example, SENP7 is associated with malaria and Stargardt disease, and UVRAG with cholecystolithiasis and ascending colon cancer. The identification of STR variation in a cancer-associated gene is unsurprising since STRs located within cancer genes are known to be significantly more unstable than genes not linked to cancer (Hause et al., 2016). Several TRD genes were found to have repeat tracks that were near or beyond the pathogenic range in MSSNG samples, which included the DIP2B, HTT, DMPK, ZNF9, and ATX8 genes. DM1 has previously been linked to ASD, and the finding that two ASD probands in MSSNG carried STR variants beyond the pathogenic threshold pro- vide further evidence that DM1 is comorbid with ASD (Ekstr¨omet al., 2008). Hunting- ton’s disease, mental retardation associated with FRA12A, myotonic dystrophy 2, and spinocerebellar ataxia 8 have not been previously linked to ASD. Since some TRD loci are established ASD-risk genes, such as FMR1 which is the disease locus for Fragile X syndrome and is the leading known monogenic form ASD, suggests that other TRDs may also contribute to the genetic architecture of ASD (Wang et al., 2010). Therefore, the TRD loci that have been identified in this study, which have repeat lengths above the pathogenic in ASD samples, may contribute to susceptibility and be novel disease associations of ASD. In this study, large GA expansions upstream of the CRYBB2 gene were identified significantly more in individuals with ASD from the MSSNG dataset compared to the 1000 Genomes control cohort. While this finding may suggest that CRYBB2 could be a potential ASD risk locus, there are several study limitations that hinder a conclusive result. For example, in the MSSNG cohort, there were more unaffected individuals (22) identified with the expansion than affected individuals (20), suggesting that the GA expansion does not increase risk for ASD. While the expansion was seen significantly more frequently in the MSSNG affected children population compared to the control cohort, this study was limited in that the 1000 Genomes and MSSNG collections were not matched. Matching case-control subjects for factors such as sex, age, ethnicity, DNA source type may limit confounding. Previously, a study of gene expression quantitative trait loci in the human cortex from the Psychiatric Genomics Consortium (ADHD, ASD, bipolar disorder, major depressive disorder and schizophrenia) identified CRYBB2 as the most significant association (Kim et al., 2014). The findings of this study highlight the need to resolve the potential role of the CRYBB2 locus in ASD.

41 Chapter 4. Discussion

To assess the ability of EHdn to accurately detect large repeats in WGS data, val- idation was performed by comparing EHdn STR calls to structural variants called in long-read HuRef sequencing data. The majority of EHdn calls were validated (59%). The specificity of EHdn was comparable but less than that of another genome-wide large STR detection tool, STRetch (59% vs. 69.4%, respectively). EHdn was the more sen- sitive detection tool, with 222 EHdn calls being successfully validated compared to 24 STRetch calls. Therefore, EHdn is more suited for genome-wide STR detection than STRetch because it can detect a greater number of STRs that successfully validate in long-read sequencing data. gangSTR was not well-suited for these purposes because the tool was only able to detect 4 validated large STRs. Restricting the callsets from each tool to calls detected by STR Finder improved the validation rate in all instances. This implies that accuracy of STR genotyping tools can be improved by filtering for calls found in the human reference sequence. Validation of EHdn was also analyzed by comparing the concordance of monozygotic vs, dizygotic calls. Monozygotic twins were more con- cordant for calls than dizygotic twins, and these findings support the utility of EHdn as an accurate STR detection tool.

4.2 Current study limitations

There were several limitations in this study relating to the components of the STR genotyping pipeline and validation using long-read sequencing data. With regards to the STR genotyping pipeline, EHdn is an imprecise STR detection tool and this results in errors in calls. For example, EHdn uses read-pair information from short-read WGS data to approximate the location and motif of a STR. If a locus in the genome contains an imperfect repeat track such that adjacent motifs are not exact matches, EHdn will estimate the motif of that STR to be the most commonly represented repeat, or any one of the repeat motifs in the case that there is no single most common motif. If the reads that are used to support an STR call contains sequencing errors, this may lead to EHdn incorrectly assigning the STR motif. Moreover, EHdn uses reads which map to the nonrepetitive flanking sequences adjacent to a repeat region to approximate the location of STRs. However, depending on the inner distance between two paired-end reads where one mate maps to nonrepetitive sequence and the other mate maps inside the repeat region, the genomic coordinates assigned by EHdn may be several hundred bp away from the actual STR location. Targeted genotyping of STR loci requires the precise input of the coordinates and motif of a repeat region, and the impreciseness of EHdn may result in STRs being missed.

42 Chapter 4. Discussion

STR Finder was a tool I developed to convert the output of EHdn to serve as input for EH, by identifying the exact coordinates and sequence composition of STRs in the human reference sequence from what was approximated by EHdn. One of the advantages of EHdn compared to other STR genotyping tools is that it is able to detect repeats that are not found in the human reference sequence. However, STR Finder relies on the human reference sequence to identify the exact coordinates and composition of STRs, so it is unable to locate STRs that are absent in the reference sequence. Another limitation of STR Finder is that the program is currently only able to identify simple and perfect repeat tracks, which are STRs that are composed of repeats that are uninterrupted by non-motif sequence. Therefore, if a repeat track contains a variant such as an indel or SNV, STR Finder would not capture the entire track and limit the repeat to immediately before the interruption. In this case, the repeat specified by STR Finder would be shorter than the actual repeat track. EH is a targeted genotyping tool that was utilized in this study, but it has several drawbacks. First, the program is only able to genotype STRs that consist of the reference repeat unit. However, previous studies have found that expansions of nonreference STRs can lead to disorders, such as in benign adult familial myoclonic epilepsy which is caused by inserted TTTCA sequence not found in the reference genome (Ishiura et al., 2018). Therefore, this limitation of EH can result in the omission of clinically relevant variants in analysis. Moreover, the version of EH used in this study can only genotype perfect repeat tracks. Complex repeat tracks are STRs that consist of more than one repeat motif that may be interrupted by nonrepeat sequence, such as in Huntington’s disease where the pathogenic repeat track is (CAG)n-CAA-CAG-CCG-CCA-(CTG)n in exon 1 of HTT. In Huntington’s disease, variation of the sequences between the CAG and CTG repeats is associated with earlier age of onset and increased somatic repeat instability (Wright, 2019). Therefore, the inability for EH to capture complex STR variation may result in functionally relevant phenomenon impacting clinical phenotypes being missed. A major component of validation for the STR genotyping pipeline was determining the concordance between an STR call in short-read sequencing data and structural variant calls in long read-sequencing data. However, this method is rudimentary because while an EHdn is considered validated if it is supported by a structural variant call such as an insertion, the inserted sequence may not reflect the STR motif. For additional validation, the inserted sequence was checked to see if it contained at least one copy of the EHdn motif, however, the motif may not be present because the high sequencing error rate of long-read sequencing, the inserted sequence was missed by structural variant callers, the STR is not an insertion relative to the reference, or it was a false positive EHdn call.

43 Chapter 4. Discussion

More comprehensive and accurate validation could involve targeted long-read sequencing, experimental validation with PCR with gel electrophoresis and Southern blot analysis, however, this would be very ineffective in terms of cost and labor for a genome-wide scale. One of the challenges of this study involved selecting potential candidate loci for further genotyping and analysis. Currently, loci were chosen based on annotation, known TRD genes, and an outlier detection model. For loci selected from annotation, one of the criteria used was a high probability of loss-of-function intolerant rate. This metric, however, limits analysis to STR variation found in the coding region of genes. Moreover, while EHdn is designed to identify large STRs, the entire pipeline allows for genome-wide genotyping. Therefore, users of this pipeline would be able to identify STR variation in the form of expansions and contractions. The focus of this research was placed on identifying STR expansions in individuals with ASD, which may have led to clinically relevant variation being missed. In fact, while most repeat expansion disorders are caused by expansions that occur in the germ line, large contractions have been seen in individuals with TRDs including FRAXA, SCA8 and FRADA (Castel et al., 2010). Furthermore, the overall aim of this study was to identify STR variation and therefore, other forms of variation and mutation were not investigated. This was short-sighted since the majority of the potential candidate loci identified were not significantly enriched for STR expansions in MSSNG when compared to the control cohort. Consideration of other variation such as SNVs and indels alongside STR expansions and contractions would have improved the ability to identify candidate loci by screening for loci with multiple hits. Increasing the scope of this study to look for other forms of variation could improve the statistical power to identify ASD risk loci.

4.3 Overall summary and impact of work

In summary, a STR genotyping pipeline consisting of EHdn, STR Finder, and EH was developed to perform genome-wide STR detection and genotyping in 3835 WGS sam- ples from the MSSNG ASD genomic database of children affected with ASD and their unaffected family members. In MSSNG, 4925 unique loci were called that were equal or greater to 150bp in at least one sample, using EHdn. From these calls, 1162 loci were targeted for genotyping. From annotation, analysis of the allelic distribution, and outlier detection, several loci were chosen for further analysis which represented known ASD-risk loci, TRD genes, susceptibility loci for other disorders, and genes not previously associ- ated with a clinical phenotype. These findings highlight that STR genotyping in a large

44 Chapter 4. Discussion

ASD database can lead to the discovery for STR variation which may contribute to ASD risk. This research is critical to dissecting the genetic etiology of ASD. The identification of STR variation in ASD may reveal new risk loci and explain a portion of the missing heritability. This workflow can potentially be used in clinical diagnostic testing for high- throughput screening of ASD. Improved screening may lead to earlier detection and treatment intervention, resulting in better outcomes for individuals with ASD.

45 Chapter 5

Future Directions

5.1 Application of STR genotyping pipeline to other ASD genomic databases

The STR genotyping pipeline developed in this study can be applied to other databases to perform genome-wide STR genotyping. Here, STR genotyping was performed in MSSNG, a WGS resource for ASD. In order to improve the statistical power to detect novel STR variation in ASD, STR genotyping should be performed in another ASD WGS database, such as the Simons Simplex Collection (SSC). The SSC is a repository of WGS data from 1790 ASD simplex families, which contains one child affected with ASD and unaffected parents and siblings (Fischbach and Lord, 2010). STR genotyping in SSC can provide further support for potential candidate STR loci detected in MSSNG that is relevant to ASD. For example, a de novo STR variant that was detected in multiple ASD families across MSSNG and SSC would suggest its involvement in ASD susceptibility. Furthermore, some STR loci that were screened in MSSNG may not have been probed for further investigation potentially because it was more frequent in unaffected vs. affected individuals. Screening of the same STR loci found in this study in the SSC may identify more ASD individuals with expanded repeat tracks, and improve the statistical power to identify regions as potential candidate ASD loci. The SSC genomic dataset contains sample DNA of unaffected siblings of children with ASD, which can be used as a control set for investigation of STR variation. The ability to compare rates of STR variation between children affected with ASD and a matched control set without ASD would improve the statistical power of this study to identify candidate loci.

46 Chapter 5. Future Directions

5.2 Application of STR genotyping pipeline to other disorders

STR variation is known to be the genetic cause of several Mendelian and polygenic disorders, and is expected to contribute to the missing heritability of other disorders. The STR genotyping pipeline developed in this study should be applied to other disorders aside from ASD to investigate the involvement of STR polymorphism in disease genetic etiology. Several components of the STR genotyping pipeline has been utilized to identify clin- ically relevant STR variation. Recently, EHdn was used to identify an expanded intronic pentanucleotide STR in the Replication Factor C Subunit 1 (RFC1 ) gene which causes cerebellar ataxia with neuropathy and bilateral vestibular areflexia syndrome (Rafehi, 2019). Furthermore, a new version of EH was utilized in Huntington’s disease to deter- mine that expansion of the exonic CAG track in HTT by a loss of interruption mutation leads to earlier age of onset (Wright, 2019). Expansion Hunter was also used to identify GCA expansions in the 5’ UTR of the glutaminase (GLS) gene, which causes glutaminase deficiency (van Kuilenburg et al., 2019). Repeat variation has been implicated in other NDDs such as ADHD, (Franke et al., 2008, Johnson et al., 2008). Therefore, this STR genotyping pipeline can be used to investigate other NDDs such as motor and tic disorders and language communication disorders. Genome-wide screening of STR variation should also be performed in other neurologic disorders such as epilepsy and schizophrenia, given the high rate of comorbidity to ASD.

47 Bibliography

Aird, D., Ross, M. G., Chen, W.-S., Danielsson, M., Fennell, T., Russ, C., Jaffe, D. B., Nusbaum, C., and Gnirke, A. (2011). Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biology, 12(2):R18.

Amiet, C., Gourfinkel-An, I., Bouzamondo, A., Tordjman, S., Baulac, M., Lechat, P., Mottron, L., and Cohen, D. (2008). Epilepsy in Autism is Associated with Intellec- tual Disability and Gender: Evidence from a Meta-Analysis. Biological Psychiatry, 64(7):577–582.

Bai, D., Yip, B. H. K., Windham, G. C., Sourander, A., Francis, R., Yoffe, R., Glasson, E., Mahjani, B., Suominen, A., Leonard, H., Gissler, M., Buxbaum, J. D., Wong, K., Schendel, D., Kodesh, A., Breshnahan, M., Levine, S. Z., Parner, E. T., Hansen, S. N., Hultman, C., Reichenberg, A., and Sandin, S. (2019). Association of Genetic and Environmental Factors With Autism in a 5-Country Cohort. JAMA Psychiatry.

Bailey, A., Le Couteur, A., Gottesman, I., Bolton, P., Simonoff, E., Yuzda, E., and Rutter, M. (1995). Autism as a strongly genetic disorder: Evidence from a British twin study. Psychological Medicine, 25(1):63–77.

Baio, J. (2018). Prevalence of Autism Spectrum Disorder Among Children Aged 8 Years — Autism and Developmental Disabilities Monitoring Network, 11 Sites, United States, 2014. MMWR. Surveillance Summaries, 67.

Bergbaum, A. and Ogilvie, C. M. (2016). Autism and chromosome abnormalities- A review: Autism and Chromosome Abnormalities-A Review. Clinical Anatomy, 29(5):620–627.

Betancur, C. (2011). Etiological heterogeneity in autism spectrum disorders: More than 100 genetic and genomic disorders and still counting. Brain Research, 1380:42–77.

48 BIBLIOGRAPHY

Bishop, D. V. M., Whitehouse, A. J. O., Watt, H. J., and Line, E. A. (2008). Autism and diagnostic substitution: Evidence from a study of adults with a history of develop- mental language disorder. Developmental Medicine & Child Neurology, 50(5):341–345.

Bolton, P., Macdonald, H., Pickles, A., Rios, P., Goode, S., Crowson, M., Bailey, A., and Rutter, M. (1994). A case-control family history study of autism. Journal of Child Psychology and Psychiatry, and Allied Disciplines, 35(5):877–900.

Cardoso, I. L. and Marques, V. (2018). Trinucleotide repeat diseases - anticipation diseases. Journal of Clinical Genetics and Genomics, 1(1).

Castel, A. L., Cleary, J. D., and Pearson, C. E. (2010). Repeat instability as the basis for human diseases and as a potential target for therapy. Nature Reviews Molecular Cell Biology, 11(3):165–170.

Cheng, H., Dharmadhikari, A. V., Varland, S., Ma, N., Domingo, D., Kleyner, R., Rope, A. F., Yoon, M., Stray-Pedersen, A., Posey, J. E., Crews, S. R., Eldomery, M. K., Akdemir, Z. C., Lewis, A. M., Sutton, V. R., Rosenfeld, J. A., Conboy, E., Agre, K., Xia, F., Walkiewicz, M., Longoni, M., High, F. A., van Slegtenhorst, M. A., Mancini, G. M. S., Finnila, C. R., van Haeringen, A., den Hollander, N., Ruivenkamp, C., Naidu, S., Mahida, S., Palmer, E. E., Murray, L., Lim, D., Jayakar, P., Parker, M. J., Giusto, S., Stracuzzi, E., Romano, C., Beighley, J. S., Bernier, R. A., K¨ury, S., Nizon, M., Corbett, M. A., Shaw, M., Gardner, A., Barnett, C., Armstrong, R., Kassahn, K. S., Van Dijck, A., Vandeweyer, G., Kleefstra, T., Schieving, J., Jongmans, M. J., de Vries, B. B. A., Pfundt, R., Kerr, B., Rojas, S. K., Boycott, K. M., Person, R., Willaert, R., Eichler, E. E., Kooy, R. F., Yang, Y., Wu, J. C., Lupski, J. R., Arnesen, T., Cooper, G. M., Chung, W. K., Gecz, J., Stessman, H. A. F., Meng, L., and Lyon, G. J. (2018). Truncating Variants in NAA15 Are Associated with Variable Levels of Intellectual Disability, Autism Spectrum Disorder, and Congenital Anomalies. The American Journal of Human Genetics, 102(5):985–994.

Chiocchetti, A. G., Bour, H. S., and Freitag, C. M. (2014). Glutamatergic candidate genes in autism spectrum disorder: An overview. Journal of Neural Transmission, 121(9):1081–1106.

Ciotti, P., Di Maria, E., Bellone, E., Ajmar, F., and Mandich, P. (2004). Triplet Repeat Primed PCR (TP PCR) in Molecular Diagnostic Testing for Friedreich Ataxia. The Journal of Molecular Diagnostics, 6(4):285–289.

49 BIBLIOGRAPHY

Colvert, E., Tick, B., McEwen, F., Stewart, C., Curran, S. R., Woodhouse, E., Gillan, N., Hallett, V., Lietz, S., Garnett, T., Ronald, A., Plomin, R., Rijsdijk, F., Happ´e,F., and Bolton, P. (2015). Heritability of Autism Spectrum Disorder in a UK Population-Based Twin Sample. JAMA Psychiatry, 72(5):415–423.

Damaj, L., Lupien-Meilleur, A., Lortie, A., Riou, E.,´ Ospina, L. H., Gagnon, L., Vanasse, C., and Rossignol, E. (2015). CACNA1A haploinsufficiency causes cognitive impair- ment, autism and epileptic encephalopathy with mild cerebellar symptoms. European Journal of Human Genetics, 23(11):1505–1512.

Dashnow, H., Lek, M., Phipson, B., Halman, A., Sadedin, S., Lonsdale, A., Davis, M., Lamont, P., Clayton, J. S., Laing, N. G., MacArthur, D. G., and Oshlack, A. (2018). STRetch: Detecting and discovering pathogenic short tandem repeat expan- sions. Genome Biology, 19(1).

De Rubeis, S., He, X., Goldberg, A. P., Poultney, C. S., Samocha, K., Ercument Cicek, A., Kou, Y., Liu, L., Fromer, M., Walker, S., Singh, T., Klei, L., Kosmicki, J., Fu, S.-C., Aleksic, B., Biscaldi, M., Bolton, P. F., Brownfeld, J. M., Cai, J., Campbell, N. G., Carracedo, A., Chahrour, M. H., Chiocchetti, A. G., Coon, H., Crawford, E. L., Crooks, L., Curran, S. R., Dawson, G., Duketis, E., Fernandez, B. A., Gallagher, L., Geller, E., Guter, S. J., Sean Hill, R., Ionita-Laza, I., Jimenez Gonzalez, P., Kilpinen, H., Klauck, S. M., Kolevzon, A., Lee, I., Lei, J., Lehtim¨aki,T., Lin, C.-F., Ma’ayan, A., Marshall, C. R., McInnes, A. L., Neale, B., Owen, M. J., Ozaki, N., Parellada, M., Parr, J. R., Purcell, S., Puura, K., Rajagopalan, D., Rehnstr¨om,K., Reichenberg, A., Sabo, A., Sachse, M., Sanders, S. J., Schafer, C., Schulte-R¨uther,M., Skuse, D., Stevens, C., Szatmari, P., Tammimies, K., Valladares, O., Voran, A., Wang, L.- S., Weiss, L. A., Jeremy Willsey, A., Yu, T. W., Yuen, R. K. C., The DDD Study, Homozygosity Mapping Collaborative for Autism, Uk10k Consortium, The Autism Sequencing Consortium, Cook, E. H., Freitag, C. M., Gill, M., Hultman, C. M., Lehner, T., Palotie, A., Schellenberg, G. D., Sklar, P., State, M. W., Sutcliffe, J. S., Walsh, C. A., Scherer, S. W., Zwick, M. E., Barrett, J. C., Cutler, D. J., Roeder, K., Devlin, B., Daly, M. J., and Buxbaum, J. D. (2014). Synaptic, transcriptional and chromatin genes disrupted in autism. Nature, 515(7526):209–215.

Devlin, B. and Scherer, S. W. (2012). Genetic architecture in autism spectrum disorder. Current Opinion in Genetics & Development, 22(3):229–237.

Dolzhenko, E., van Vugt, J. J., Shaw, R. J., Bekritsky, M. A., van Blitterswijk, M., Narzisi, G., Ajay, S. S., Rajan, V., Lajoie, B. R., Johnson, N. H., Kingsbury, Z.,

50 BIBLIOGRAPHY

Humphray, S. J., Schellevis, R. D., Brands, W. J., Baker, M., Rademakers, R., Kooy- man, M., Tazelaar, G. H., van Es, M. A., McLaughlin, R., Sproviero, W., Shatunov, A., Jones, A., Al Khleifat, A., Pittman, A., Morgan, S., Hardiman, O., Al-Chalabi, A., Shaw, C., Smith, B., Neo, E. J., Morrison, K., Shaw, P. J., Reeves, C., Winterkorn, L., Wexler, N. S., Housman, D. E., Ng, C. W., Li, A. L., Taft, R. J., van den Berg, L. H., Bentley, D. R., Veldink, J. H., and Eberle, M. A. (2017). Detection of long repeat expansions from PCR-free whole-genome sequence data. Genome Research, 27(11):1895–1903.

Ekstr¨om,A.-B., Haken¨as-Plate,L., Samuelsson, L., Tulinius, M., and Wentz, E. (2008). Autism spectrum conditons in myotonic dystrophy type 1: A study on 57 individuals with congenital and childhood forms. American Journal of Medical Genetics Part B: Neuropsychiatric Genetics, 147B(6):918–926.

Ellegren, H. (2004). Microsatellites: Simple sequences with complex evolution. Nature Reviews Genetics, 5(6):435.

Fischbach, G. D. and Lord, C. (2010). The Simons Simplex Collection: A Resource for Identification of Autism Genetic Risk Factors. Neuron, 68(2):192–195.

Folstein, S. and Rutter, M. (1977). Infantile autism: A genetic study of 21 twin pairs. Journal of Child Psychology and Psychiatry, and Allied Disciplines, 18(4):297–321.

Fondon, J. W., Hammock, E. A. D., Hannan, A. J., and King, D. G. (2008). Sim- ple sequence repeats: Genetic modulators of brain function and behavior. Trends in Neurosciences, 31(7):328–334.

Franke, B., Hoogman, M., Vasquez, A. A., Heister, J. G. a. M., Savelkoul, P. J., Naber, M., Scheffer, H., Kiemeney, L. A., Kan, C. C., Kooij, J. J. S., and Buitelaar, J. K. (2008). Association of the dopamine transporter (SLC6A3/DAT1) gene 9–6 haplotype with adult ADHD. American Journal of Medical Genetics Part B: Neuropsychiatric Genetics, 147B(8):1576–1579.

Frazier, T. W., Thompson, L., Youngstrom, E. A., Law, P., Hardan, A. Y., Eng, C., and Morris, N. (2014). A Twin Study of Heritable and Shared Environmental Contributions to Autism. Journal of autism and developmental disorders, 44(8):2013–2025.

Gemayel, R., Vinces, M. D., Legendre, M., and Verstrepen, K. J. (2010). Variable tandem repeats accelerate evolution of coding and regulatory sequences. Annual Review of Genetics, 44:445–477.

51 BIBLIOGRAPHY

Georgiades, S., Szatmari, P., Boyle, M., Hanna, S., Duku, E., Zwaigenbaum, L., Bryson, S., Fombonne, E., Volden, J., Mirenda, P., Smith, I., Roberts, W., Vaillancourt, T., Waddell, C., Bennett, T., Thompson, A., and Pathways in ASD Study Team (2013). Investigating phenotypic heterogeneity in children with autism spectrum disorder: A factor mixture modeling approach: ASD factor mixture model. Journal of Child Psy- chology and Psychiatry, 54(2):206–215.

Grønborg, T. K., Schendel, D. E., and Parner, E. T. (2013). Recurrence of autism spectrum disorders in full- and half-siblings and trends over time: A population-based cohort study. JAMA pediatrics, 167(10):947–953.

Gymrek, M., Golan, D., Rosset, S., and Erlich, Y. (2012). lobSTR: A short tandem repeat profiler for personal genomes. Genome Research, 22(6):1154–1162.

Haberman, Y., Amariglio, N., Rechavi, G., and Eisenberg, E. (2008). Trinucleotide repeats are prevalent among cancer-related genes. Trends in Genetics, 24(1):14–18.

Hamdan, F. F., Daoud, H., Rochefort, D., Piton, A., Gauthier, J., Langlois, M., Foomani, G., Dobrzeniecka, S., Krebs, M.-O., Joober, R., Lafreni`ere,R. G., Lacaille, J.-C., Mottron, L., Drapeau, P., Beauchamp, M. H., Phillips, M. S., Fombonne, E., Rouleau, G. A., and Michaud, J. L. (2010). De Novo Mutations in FOXP1 in Cases with Intellectual Disability, Autism, and Language Impairment. The American Journal of Human Genetics, 87(5):671–678.

Hannan, A. J. (2010). Tandem repeat polymorphisms: Modulators of disease suscepti- bility and candidates for ‘missing heritability’. Trends in Genetics, 26(2):59–65.

Hannan, A. J. (2018). Tandem Repeats and Repeatomes: Delving Deeper into the ‘Dark Matter’ of Genomes. EBioMedicine, 31:3–4.

Hause, R. J., Pritchard, C. C., Shendure, J., and Salipante, S. J. (2016). Classification and characterization of microsatellite instability across 18 cancer types. Nature Medicine, 22(11):1342–1350.

Hiromine, Y., Ikegami, H., Fujisawa, T., Nojima, K., Kawabata, Y., Noso, S., Asano, K., Fukai, A., and Ogihara, T. (2007). Trinucleotide repeats of programmed cell death- 1 gene are associated with susceptibility to type 1 diabetes mellitus. Metabolism, 56(7):905–909.

52 BIBLIOGRAPHY

H¨oweler, C. J., Busch, H. F. M., Geraedts, J. P. M., Niermeijer, M. F., and Staal, A. (1989). ANTICIPATION IN MYOTONIC DYSTROPHY: FACT OR FICTION? Brain, 112(3):779–797.

Iossifov, I., O’Roak, B. J., Sanders, S. J., Ronemus, M., Krumm, N., Levy, D., Stessman, H. A., Witherspoon, K. T., Vives, L., Patterson, K. E., Smith, J. D., Paeper, B., Nickerson, D. A., Dea, J., Dong, S., Gonzalez, L. E., Mandell, J. D., Mane, S. M., Murtha, M. T., Sullivan, C. A., Walker, M. F., Waqar, Z., Wei, L., Willsey, A. J., Yamrom, B., Lee, Y.-h., Grabowska, E., Dalkic, E., Wang, Z., Marks, S., Andrews, P., Leotta, A., Kendall, J., Hakker, I., Rosenbaum, J., Ma, B., Rodgers, L., Troge, J., Narzisi, G., Yoon, S., Schatz, M. C., Ye, K., McCombie, W. R., Shendure, J., Eichler, E. E., State, M. W., and Wigler, M. (2014). The contribution of de novo coding mutations to autism spectrum disorder. Nature, 515(7526):216–221.

Ishiura, H., Doi, K., Mitsui, J., Yoshimura, J., Matsukawa, M. K., Fujiyama, A., Toyoshima, Y., Kakita, A., Takahashi, H., Suzuki, Y., Sugano, S., Qu, W., Ichikawa, K., Yurino, H., Higasa, K., Shibata, S., Mitsue, A., Tanaka, M., Ichikawa, Y., Taka- hashi, Y., Date, H., Matsukawa, T., Kanda, J., Nakamoto, F. K., Higashihara, M., Abe, K., Koike, R., Sasagawa, M., Kuroha, Y., Hasegawa, N., Kanesawa, N., Kondo, T., Hitomi, T., Tada, M., Takano, H., Saito, Y., Sanpei, K., Onodera, O., Nishizawa, M., Nakamura, M., Yasuda, T., Sakiyama, Y., Otsuka, M., Ueki, A., Kaida, K.-i., Shimizu, J., Hanajima, R., Hayashi, T., Terao, Y., Inomata-Terada, S., Hamada, M., Shirota, Y., Kubota, A., Ugawa, Y., Koh, K., Takiyama, Y., Ohsawa-Yoshida, N., Ishiura, S., Yamasaki, R., Tamaoka, A., Akiyama, H., Otsuki, T., Sano, A., Ikeda, A., Goto, J., Morishita, S., and Tsuji, S. (2018). Expansions of intronic TTTCA and TTTTA repeats in benign adult familial myoclonic epilepsy. Nature Genetics, 50(4):581.

Jiang, Y.-h., Yuen, R. K. C., Jin, X., Wang, M., Chen, N., Wu, X., Ju, J., Mei, J., Shi, Y., He, M., Wang, G., Liang, J., Wang, Z., Cao, D., Carter, M. T., Chrysler, C., Drmic, I. E., Howe, J. L., Lau, L., Marshall, C. R., Merico, D., Nalpathamkalam, T., Thiruvahindrapuram, B., Thompson, A., Uddin, M., Walker, S., Luo, J., Anagnostou, E., Zwaigenbaum, L., Ring, R. H., Wang, J., Lajonchere, C., Wang, J., Shih, A., Szatmari, P., Yang, H., Dawson, G., Li, Y., and Scherer, S. W. (2013). Detection of Clinically Relevant Genetic Variants in Autism Spectrum Disorder by Whole-Genome Sequencing. The American Journal of Human Genetics, 93(2):249–263.

Johnson, K. A., Kelly, S. P., Robertson, I. H., Barry, E., Mulligan, A., Daly, M., Lambert,

53 BIBLIOGRAPHY

D., McDonnell, C., Connor, T. J., Hawi, Z., Gill, M., and Bellgrove, M. A. (2008). Ab- sence of the 7-repeat variant of the DRD4 VNTR is associated with drifting sustained attention in children with ADHD but not in controls. American Journal of Medical Genetics Part B: Neuropsychiatric Genetics, 147B(6):927–937.

Karlin, S. and Burge, C. (1996). Trinucleotide repeats and long homopeptides in genes and proteins associated with nervous system disease and development. Proceedings of the National Academy of Sciences of the United States of America, 93(4):1560–1565.

Kim, Y., Xia, K., Tao, R., Giusti-Rodriguez, P., Vladimirov, V., van den Oord, E., and Sullivan, P. F. (2014). A meta-analysis of gene expression quantitative trait loci in brain. Translational Psychiatry, 4(10):e459.

Lam, H. Y. K., Clark, M. J., Chen, R., Chen, R., Natsoulis, G., O’Huallachain, M., Dewey, F. E., Habegger, L., Ashley, E. A., Gerstein, M. B., Butte, A. J., Ji, H. P., and Snyder, M. (2012). Performance comparison of whole-genome sequencing platforms. Nature Biotechnology, 30(1):78–82.

Levy, S., Sutton, G., Ng, P. C., Feuk, L., Halpern, A. L., Walenz, B. P., Axelrod, N., Huang, J., Kirkness, E. F., Denisov, G., Lin, Y., MacDonald, J. R., Pang, A. W. C., Shago, M., Stockwell, T. B., Tsiamouri, A., Bafna, V., Bansal, V., Kravitz, S. A., Busam, D. A., Beeson, K. Y., McIntosh, T. C., Remington, K. A., Abril, J. F., Gill, J., Borman, J., Rogers, Y.-H., Frazier, M. E., Scherer, S. W., Strausberg, R. L., and Venter, J. C. (4-Sep-2007). The Diploid Genome Sequence of an Individual Human. PLOS Biology, 5(10):e254.

Levy, S. E., Giarelli, E., Lee, L.-C., Schieve, L. A., Kirby, R. S., Cunniff, C., Nicholas, J., Reaven, J., and Rice, C. E. (2010). Autism Spectrum Disorder and Co-occurring Developmental, Psychiatric, and Medical Conditions Among Children in Multiple Pop- ulations of the United States. Journal of Developmental & Behavioral Pediatrics, 31(4):267–275.

Lichtenstein, P., Carlstr¨om,E., R˚astam,M., Gillberg, C., and Anckars¨ater, H. (2010). The Genetics of Autism Spectrum Disorders and Related Neuropsychiatric Disorders in Childhood. American Journal of Psychiatry, 167(11):1357–1363.

Lozano, R., Vino, A., Lozano, C., Fisher, S. E., and Deriziotis, P. (2015). A de novo FOXP1 variant in a patient with autism, intellectual disability and severe speech and language impairment. European Journal of Human Genetics, 23(12):1702–1707.

54 BIBLIOGRAPHY

Lyon, E., Laver, T., Yu, P., Jama, M., Young, K., Zoccoli, M., and Marlowe, N. (2010). A Simple, High-Throughput Assay for Fragile X Expanded Alleles Using Triple Repeat Primed PCR and Capillary Electrophoresis. The Journal of Molecular Diagnostics, 12(4):505–511.

Madsen, B. E., Villesen, P., and Wiuf, C. (2008). Short Tandem Repeats in Human Exons: A Target for Disease Mutations. BMC Genomics, 9(1):410.

Matheis, M., Matson, J. L., Hong, E., and Cervantes, P. E. (2019). Gender Differences and Similarities: Autism Symptomatology and Developmental Functioning in Young Children. Journal of Autism and Developmental Disorders, 49(3):1219–1231.

McInnis, M. G. (1996). Anticipation: An old idea in new genes. American Journal of Human Genetics, 59(5):973–979.

Mondal, K., Ramachandran, D., Patel, V. C., Hagen, K. R., Bose, P., Cutler, D. J., and Zwick, M. E. (2012). Excess variants in AFF2 detected by massively parallel sequencing of males with autism spectrum disorder. Human Molecular Genetics, 21(19):4356–4364.

Moss, J. and Howlin, P. (2009). Autism spectrum disorders in genetic syndromes: Im- plications for diagnosis, intervention and understanding the wider autism spectrum disorder population. Journal of intellectual disability research: JIDR, 53(10):852–873.

Mousavi, N., Shleizer-Burko, S., Yanicky, R., and Gymrek, M. (2019). Profiling the genome-wide landscape of tandem repeat expansions. Nucleic Acids Research.

Murray, A., Youings, S., Dennis, N., Latsky, L., Linehan, P., McKechnie, N., Macpherson, J., Pound, M., and Jacobs, P. (1996). Population screening at the FRAXA and FRAXE loci: Molecular analyses of boys with learning difficulties and their mothers. Human Molecular Genetics, 5(6):727–735.

Nordenbæk, C., Jørgensen, M., Kyvik, K. O., and Bilenberg, N. (2014). A Danish population-based twin study on autism spectrum disorders. European Child & Adoles- cent Psychiatry, 23(1):35–43.

O’Roak, B. J., Deriziotis, P., Lee, C., Vives, L., Schwartz, J. J., Girirajan, S., Karakoc, E., MacKenzie, A. P., Ng, S. B., Baker, C., Rieder, M. J., Nickerson, D. A., Bernier, R., Fisher, S. E., Shendure, J., and Eichler, E. E. (2011). Exome sequencing in spo- radic autism spectrum disorders identifies severe de novo mutations. Nature Genetics, 43(6):585–589.

55 BIBLIOGRAPHY

Orr, H. T. and Zoghbi, H. Y. (2007). Trinucleotide Repeat Disorders. Annual Review of Neuroscience, 30(1):575–621.

Ozonoff, S., Heung, K., Byrd, R., Hansen, R., and Hertz-Picciotto, I. (2008). The Onset of Autism: Patterns of Symptom Emergence in the First Years of Life. Autism research : official journal of the International Society for Autism Research, 1(6):320–328.

Ozonoff, S., Young, G. S., Carter, A., Messinger, D., Yirmiya, N., Zwaigenbaum, L., Bryson, S., Carver, L. J., Constantino, J. N., Dobkins, K., Hutman, T., Iverson, J. M., Landa, R., Rogers, S. J., Sigman, M., and Stone, W. L. (2011). Recurrence Risk for Autism Spectrum Disorders: A Baby Siblings Research Consortium Study. Pediatrics, 128(3):e488–e495.

Palumbo, O., D’Agruma, L., Minenna, A. F., Palumbo, P., Stallone, R., Palladino, T., Zelante, L., and Carella, M. (2013). 3p14.1 de novo microdeletion involving the FOXP1 gene in an adult patient with autism, severe speech delay and deficit of motor coordination. Gene, 516(1):107–113.

Pang, A. W., MacDonald, J. R., Pinto, D., Wei, J., Rafiq, M. A., Conrad, D. F., Park, H., Hurles, M. E., Lee, C., Venter, J. C., Kirkness, E. F., Levy, S., Feuk, L., and Scherer, S. W. (2010). Towards a comprehensive structural variation map of an individual human genome. Genome Biology, 11(5):R52.

Pang, A. W. C., MacDonald, J. R., Yuen, R. K. C., Hayes, V. M., and Scherer, S. W. (2014). Performance of High-Throughput Sequencing for the Discovery of Genetic Vari- ation Across the Complete Size Spectrum. G3: Genes, Genomes, Genetics, 4(1):63–65.

Pang, A. W. C., Migita, O., MacDonald, J. R., Feuk, L., and Scherer, S. W. (2013). Mech- anisms of Formation of Structural Variation in a Fully Sequenced Human Genome. Human Mutation, 34(2):345–354.

Paulson, H. (2018). Repeat expansion diseases. Handbook of clinical neurology, 147:105– 123.

Pinto, D., Pagnamenta, A. T., Klei, L., Anney, R., Merico, D., Regan, R., Conroy, J., Magalhaes, T. R., Correia, C., Abrahams, B. S., Almeida, J., Bacchelli, E., Bader, G. D., Bailey, A. J., Baird, G., Battaglia, A., Berney, T., Bolshakova, N., B¨olte,S., Bolton, P. F., Bourgeron, T., Brennan, S., Brian, J., Bryson, S. E., Carson, A. R., Casallo, G., Casey, J., Chung, B. H. Y., Cochrane, L., Corsello, C., Crawford, E. L., Crossett, A., Cytrynbaum, C., Dawson, G., de Jonge, M., Delorme, R., Drmic, I.,

56 BIBLIOGRAPHY

Duketis, E., Duque, F., Estes, A., Farrar, P., Fernandez, B. A., Folstein, S. E., Fom- bonne, E., Freitag, C. M., Gilbert, J., Gillberg, C., Glessner, J. T., Goldberg, J., Green, A., Green, J., Guter, S. J., Hakonarson, H., Heron, E. A., Hill, M., Holt, R., Howe, J. L., Hughes, G., Hus, V., Igliozzi, R., Kim, C., Klauck, S. M., Kolevzon, A., Korvatska, O., Kustanovich, V., Lajonchere, C. M., Lamb, J. A., Laskawiec, M., Leboyer, M., Le Couteur, A., Leventhal, B. L., Lionel, A. C., Liu, X.-Q., Lord, C., Lotspeich, L., Lund, S. C., Maestrini, E., Mahoney, W., Mantoulan, C., Marshall, C. R., McConachie, H., McDougle, C. J., McGrath, J., McMahon, W. M., Merikangas, A., Migita, O., Minshew, N. J., Mirza, G. K., Munson, J., Nelson, S. F., Noakes, C., Noor, A., Nygren, G., Oliveira, G., Papanikolaou, K., Parr, J. R., Parrini, B., Pa- ton, T., Pickles, A., Pilorge, M., Piven, J., Ponting, C. P., Posey, D. J., Poustka, A., Poustka, F., Prasad, A., Ragoussis, J., Renshaw, K., Rickaby, J., Roberts, W., Roeder, K., Roge, B., Rutter, M. L., Bierut, L. J., Rice, J. P., Salt, J., Sansom, K., Sato, D., Segurado, R., Sequeira, A. F., Senman, L., Shah, N., Sheffield, V. C., Soorya, L., Sousa, I., Stein, O., Sykes, N., Stoppioni, V., Strawbridge, C., Tancredi, R., Tansey, K., Thiruvahindrapduram, B., Thompson, A. P., Thomson, S., Tryfon, A., Tsiantis, J., Van Engeland, H., Vincent, J. B., Volkmar, F., Wallace, S., Wang, K., Wang, Z., Wassink, T. H., Webber, C., Weksberg, R., Wing, K., Wittemeyer, K., Wood, S., Wu, J., Yaspan, B. L., Zurawiecki, D., Zwaigenbaum, L., Buxbaum, J. D., Cantor, R. M., Cook, E. H., Coon, H., Cuccaro, M. L., Devlin, B., Ennis, S., Gallagher, L., Geschwind, D. H., Gill, M., Haines, J. L., Hallmayer, J., Miller, J., Monaco, A. P., Nurnberger Jr, J. I., Paterson, A. D., Pericak-Vance, M. A., Schellenberg, G. D., Szatmari, P., Vicente, A. M., Vieland, V. J., Wijsman, E. M., Scherer, S. W., Sutcliffe, J. S., and Betancur, C. (2010). Functional impact of global rare copy number variation in autism spectrum disorders. Nature, 466(7304):368–372.

Piven, J., Palmer, P., Jacobi, D., Childress, D., and Arndt, S. (1997). Broader autism phenotype: Evidence from a family history study of multiple-incidence autism families. The American Journal of Psychiatry, 154(2):185–190.

Press, M. O., Carlson, K. D., and Queitsch, C. (2014). The overdue promise of short tandem repeat variation for heritability. Trends in Genetics, 30(11):504–512.

Quilez, J., Guilmatre, A., Garg, P., Highnam, G., Gymrek, M., Erlich, Y., Joshi, R. S., Mittelman, D., and Sharp, A. J. (2016). Polymorphic tandem repeats within gene promoters act as modifiers of gene expression and DNA methylation in humans. Nucleic Acids Research, 44(8):3750–3762.

57 BIBLIOGRAPHY

Rhoads, A. and Au, K. F. (2015). PacBio Sequencing and Its Applications. Genomics, Proteomics & Bioinformatics, 13(5):278–289.

Ritvo, E. R., Freeman, B. J., Mason-Brothers, A., Mo, A., and Ritvo, A. M. (1985). Concordance for the syndrome of autism in 40 pairs of afflicted twins. The American Journal of Psychiatry, 142(1):74–77.

Ronald, A. and Hoekstra, R. A. (2011). Autism spectrum disorders and autistic traits: A decade of new twin studies. American Journal of Medical Genetics Part B: Neu- ropsychiatric Genetics, 156(3):255–274.

Rosenberg, R. E., Law, J. K., Yenokyan, G., McGready, J., Kaufmann, W. E., and Law, P. A. (2009). Characteristics and concordance of autism spectrum disorders among 277 twin pairs. Archives of Pediatrics & Adolescent Medicine, 163(10):907–914.

Sahoo, T., Theisen, A., Marble, M., Tervo, R., Rosenfeld, J. A., Torchia, B. S., and Shaffer, L. G. (2011). Microdeletion of Xq28 involving the AFF2 (FMR2) gene in two unrelated males with developmental delay. American Journal of Medical Genetics Part A, 155(12):3110–3115.

Sanders, S. J., Ercan-Sencicek, A. G., Hus, V., Luo, R., Murtha, M. T., Moreno-De-Luca, D., Chu, S. H., Moreau, M. P., Gupta, A. R., Thomson, S. A., Mason, C. E., Bilguvar, K., Celestino-Soper, P. B. S., Choi, M., Crawford, E. L., Davis, L., Davis Wright, N. R., Dhodapkar, R. M., DiCola, M., DiLullo, N. M., Fernandez, T. V., Fielding- Singh, V., Fishman, D. O., Frahm, S., Garagaloyan, R., Goh, G. S., Kammela, S., Klei, L., Lowe, J. K., Lund, S. C., McGrew, A. D., Meyer, K. A., Moffat, W. J., Murdoch, J. D., O’Roak, B. J., Ober, G. T., Pottenger, R. S., Raubeson, M. J., Song, Y., Wang, Q., Yaspan, B. L., Yu, T. W., Yurkiewicz, I. R., Beaudet, A. L., Cantor, R. M., Curland, M., Grice, D. E., G¨unel,M., Lifton, R. P., Mane, S. M., Martin, D. M., Shaw, C. A., Sheldon, M., Tischfield, J. A., Walsh, C. A., Morrow, E. M., Ledbetter, D. H., Fombonne, E., Lord, C., Martin, C. L., Brooks, A. I., Sutcliffe, J. S., Cook, E. H., Geschwind, D., Roeder, K., Devlin, B., and State, M. W. (2011). Multiple Recurrent De Novo CNVs, Including Duplications of the 7q11.23 Williams Syndrome Region, Are Strongly Associated with Autism. Neuron, 70(5):863–885.

Sandin, S., Lichtenstein, P., Kuja-Halkola, R., Hultman, C., Larsson, H., and Reichen- berg, A. (2017). The Heritability of Autism Spectrum Disorder. JAMA, 318(12):1182– 1184.

58 BIBLIOGRAPHY

Sandin, S., Lichtenstein, P., Kuja-Halkola, R., Larsson, H., Hultman, C. M., and Re- ichenberg, A. (2014). The Familial Risk of Autism. JAMA, 311(17):1770–1777.

Shi, L., Guo, Y., Dong, C., Huddleston, J., Yang, H., Han, X., Fu, A., Li, Q., Li, N., Gong, S., Lintner, K. E., Ding, Q., Wang, Z., Hu, J., Wang, D., Wang, F., Wang, L., Lyon, G. J., Guan, Y., Shen, Y., Evgrafov, O. V., Knowles, J. A., Thibaud- Nissen, F., Schneider, V., Yu, C.-Y., Zhou, L., Eichler, E. E., So, K.-F., and Wang, K. (2016). Long-read sequencing and de novo assembly of a Chinese genome. Nature Communications, 7:12065.

Stanco, A., Pla, R., Vogt, D., Chen, Y., Mandal, S., Walker, J., Hunt, R. F., Lindtner, S., Erdman, C. A., Pieper, A. A., Hamilton, S. P., Xu, D., Baraban, S. C., and Rubenstein, J. L. R. (2014). NPAS1 Represses the Generation of Specific Subtypes of Cortical Interneurons. Neuron, 84(5):940–953.

Steffenburg, S., Gillberg, C., Hellgren, L., Andersson, L., Gillberg, I. C., Jakobsson, G., and Bohman, M. (1989). A twin study of autism in Denmark, Finland, Iceland, Norway and Sweden. Journal of Child Psychology and Psychiatry, and Allied Disciplines, 30(3):405–416.

Stessman, H. A. F., Xiong, B., Coe, B. P., Wang, T., Hoekzema, K., Fenckova, M., Kvarnung, M., Gerdts, J., Trinh, S., Cosemans, N., Vives, L., Lin, J., Turner, T. N., Santen, G., Ruivenkamp, C., Kriek, M., van Haeringen, A., Aten, E., Friend, K., Liebelt, J., Barnett, C., Haan, E., Shaw, M., Gecz, J., Anderlid, B.-M., Nordgren, A., Lindstrand, A., Schwartz, C., Kooy, R. F., Vandeweyer, G., Helsmoortel, C., Romano, C., Alberti, A., Vinci, M., Avola, E., Giusto, S., Courchesne, E., Pramparo, T., Pierce, K., Nalabolu, S., Amaral, D. G., Scheffer, I. E., Delatycki, M. B., Lockhart, P. J., Hormozdiari, F., Harich, B., Castells-Nobau, A., Xia, K., Peeters, H., Nordenskj¨old, M., Schenck, A., Bernier, R. A., and Eichler, E. E. (2017). Targeted sequencing identifies 91 neurodevelopmental-disorder risk genes with autism and developmental- disability biases. Nature Genetics, 49(4):515.

Stettner, G. M., Shoukier, M., H¨oger,C., Brockmann, K., and Auber, B. (2011). Familial intellectual disability and autistic behavior caused by a small FMR2 gene deletion. American Journal of Medical Genetics Part A, 155(8):2003–2007.

Stodgell, C. J., Ingram, J. L., and Hyman, S. L. (2000). The role of candidate genes in unraveling the genetics of autism. In International Review of Research in Mental Retardation, volume 23 of Autism, pages 57–81. Academic Press.

59 BIBLIOGRAPHY

Strand, M., Prolla, T. A., Liskay, R. M., and Petes, T. D. (1993). Destabilization of tracts of simple repetitive DNA in yeast by mutations affecting DNA mismatch repair. Nature, 365(6443):274–276.

Subramanian, S., Mishra, R. K., and Singh, L. (2003). Genome-wide analysis of mi- crosatellite repeats in humans: Their abundance and density in specific genomic re- gions. Genome Biology, 4(2):R13.

Swami, M., Hendricks, A. E., Gillis, T., Massood, T., Mysore, J., Myers, R. H., and Wheeler, V. C. (2009). Somatic expansion of the Huntington’s disease CAG repeat in the brain is associated with an earlier age of disease onset. Human Molecular Genetics, 18(16):3039–3047.

Szatmari, P., MacLean, J. E., Jones, M. B., Bryson, S. E., Zwaigenbaum, L., Bartolucci, G., Mahoney, W. J., and Tuff, L. (2000). The Familial Aggregation of the Lesser Variant in Biological and Nonbiological Relatives of PDD Probands: A Family History Study. Journal of Child Psychology and Psychiatry, 41(5):579–586.

Tang, H., Kirkness, E. F., Lippert, C., Biggs, W. H., Fabani, M., Guzman, E., Ramakr- ishnan, S., Lavrenko, V., Kakaradov, B., Hou, C., Hicks, B., Heckerman, D., Och, F. J., Caskey, C. T., Venter, J. C., and Telenti, A. (2017). Profiling of Short-Tandem- Repeat Disease Alleles in 12,632 Human Whole Genomes. The American Journal of Human Genetics, 101(5):700–715.

Taniai, H., Nishiyama, T., Miyachi, T., Imaeda, M., and Sumi, S. (2008). Genetic influ- ences on the broad spectrum of autism: Study of proband-ascertained twins. American Journal of Medical Genetics. Part B, Neuropsychiatric Genetics: The Official Publi- cation of the International Society of Psychiatric Genetics, 147B(6):844–849.

T´oth,G., G´asp´ari,Z., and Jurka, J. (2000). Microsatellites in Different Eukaryotic Genomes: Survey and Analysis. Genome Research, 10(7):967–981.

Treangen, T. J. and Salzberg, S. L. (2012). Repetitive DNA and next-generation sequenc- ing: Computational challenges and solutions. Nature Reviews Genetics, 13(1):36–46.

van Kuilenburg, A. B., Tarailo-Graovac, M., Richmond, P. A., Dr¨ogem¨oller, B. I., Pouladi, M. A., Leen, R., Brand-Arzamendi, K., Dobritzsch, D., Dolzhenko, E., Eberle, M. A., Hayward, B., Jones, M. J., Karbassi, F., Kobor, M. S., Koster, J., Kumari, D., Li, M., MacIsaac, J., McDonald, C., Meijer, J., Nguyen, C., Rajan-Babu, I.-S., Scherer, S. W., Sim, B., Trost, B., Tseng, L. A., Turkenburg, M., van Vugt, J. J., Veldink, J. H.,

60 BIBLIOGRAPHY

Walia, J. S., Wang, Y., van Weeghel, M., Wright, G. E., Xu, X., Yuen, R. K., Zhang, J., Ross, C. J., Wasserman, W. W., Geraghty, M. T., Santra, S., Wanders, R. J., Wen, X.-Y., Waterham, H. R., Usdin, K., and van Karnebeek, C. D. (2019). Glutaminase Deficiency Caused by Short Tandem Repeat Expansion in GLS. New England Journal of Medicine, 380(15):1433–1441.

Wang, L. W., Berry-Kravis, E., and Hagerman, R. J. (2010). Fragile X: Leading the Way for Targeted Treatments in Autism. Neurotherapeutics, 7(3):264–274.

Warner, J. P., Barron, L. H., Goudie, D., Kelly, K., Dow, D., Fitzpatrick, D. R., and Brock, D. J. (1996). A general method for the detection of large CAG repeat expansions by fluorescent PCR. Journal of Medical Genetics, 33(12):1022–1026.

Wheeler, J. M. D., Bodmer, W. F., Wheeler, J. M. D., and Mortensen, N. J. M. (2000). DNA mismatch repair genes and colorectal cancer. Gut, 47(1):148–153.

Willems, T., Gymrek, M., Highnam, G., Consortium, T. . G. P., Mittelman, D., and Erlich, Y. (2014). The landscape of human STR variation. Genome Research, 24(11):1894–1904.

Yuen, R. K., Merico, D., Cao, H., Pellecchia, G., Alipanahi, B., Thiruvahindrapuram, B., Tong, X., Sun, Y., Cao, D., Zhang, T., Wu, X., Jin, X., Zhou, Z., Liu, X., Nal- pathamkalam, T., Walker, S., Howe, J. L., Wang, Z., MacDonald, J. R., Chan, A. J., D’Abate, L., Deneault, E., Siu, M. T., Tammimies, K., Uddin, M., Zarrei, M., Wang, M., Li, Y., Wang, J., Wang, J., Yang, H., Bookman, M., Bingham, J., Gross, S. S., Loy, D., Pletcher, M., Marshall, C. R., Anagnostou, E., Zwaigenbaum, L., Weksberg, R., Fernandez, B. A., Roberts, W., Szatmari, P., Glazer, D., Frey, B. J., Ring, R. H., Xu, X., and Scherer, S. W. (2016). Genome-wide characteristics of de novo mutations in autism. npj Genomic Medicine, 1:16027.

Yuen, R. K. C., Merico, D., Bookman, M., Howe, J. L., Thiruvahindrapuram, B., Patel, R. V., Whitney, J., Deflaux, N., Bingham, J., Wang, Z., Pellecchia, G., Buchanan, J. A., Walker, S., Marshall, C. R., Uddin, M., Zarrei, M., Deneault, E., D’Abate, L., Chan, A. J. S., Koyanagi, S., Paton, T., Pereira, S. L., Hoang, N., Engchuan, W., Hig- ginbotham, E. J., Ho, K., Lamoureux, S., Li, W., MacDonald, J. R., Nalpathamkalam, T., Sung, W. W. L., Tsoi, F. J., Wei, J., Xu, L., Tasse, A.-M., Kirby, E., Van Etten, W., Twigger, S., Roberts, W., Drmic, I., Jilderda, S., Modi, B. M., Kellam, B., Szego, M., Cytrynbaum, C., Weksberg, R., Zwaigenbaum, L., Woodbury-Smith, M., Brian, J., Senman, L., Iaboni, A., Doyle-Thomas, K., Thompson, A., Chrysler, C., Leef, J.,

61 BIBLIOGRAPHY

Savion-Lemieux, T., Smith, I. M., Liu, X., Nicolson, R., Seifer, V., Fedele, A., Cook, E. H., Dager, S., Estes, A., Gallagher, L., Malow, B. A., Parr, J. R., Spence, S. J., Vorstman, J., Frey, B. J., Robinson, J. T., Strug, L. J., Fernandez, B. A., Elsabbagh, M., Carter, M. T., Hallmayer, J., Knoppers, B. M., Anagnostou, E., Szatmari, P., Ring, R. H., Glazer, D., Pletcher, M. T., and Scherer, S. W. (2017). Whole genome sequencing resource identifies 18 new candidate genes for autism spectrum disorder. Nature Neuroscience, 20(4):602–611.

Yuen, R. K. C., Thiruvahindrapuram, B., Merico, D., Walker, S., Tammimies, K., Hoang, N., Chrysler, C., Nalpathamkalam, T., Pellecchia, G., Liu, Y., Gazzellone, M. J., D’Abate, L., Deneault, E., Howe, J. L., Liu, R. S. C., Thompson, A., Zarrei, M., Uddin, M., Marshall, C. R., Ring, R. H., Zwaigenbaum, L., Ray, P. N., Weksberg, R., Carter, M. T., Fernandez, B. A., Roberts, W., Szatmari, P., and Scherer, S. W. (2015). Whole-genome sequencing of quartet families with autism spectrum disorder. Nature Medicine, 21(2):185–191.

Zhou, B., Arthur, J. G., Ho, S. S., Pattni, R., Huang, Y., Wong, W. H., and Urban, A. E. (2018). Extensive and deep sequencing of the Venter/HuRef genome for developing and benchmarking genome analysis tools. Scientific Data, 5:180261.

Zu, T., Gibbens, B., Doty, N. S., Gomes-Pereira, M., Huguet, A., Stone, M. D., Margolis, J., Peterson, M., Markowski, T. W., Ingram, M. A. C., Nan, Z., Forster, C., Low, W. C., Schoser, B., Somia, N. V., Clark, H. B., Schmechel, S., Bitterman, P. B., Gourdon, G., Swanson, M. S., Moseley, M., and Ranum, L. P. W. (2011). Non- ATG–initiated translation directed by microsatellite expansions. Proceedings of the National Academy of Sciences, 108(1):260–265.

Zwaigenbaum, L., Bryson, S. E., Szatmari, P., Brian, J., Smith, I. M., Roberts, W., Vaillancourt, T., and Roncadin, C. (2012). Sex Differences in Children with Autism Spectrum Disorder Identified Within a High-Risk Infant Cohort. Journal of Autism and Developmental Disorders, 42(12):2585–2596.

62