Genome-Wide Investigation of Short Tandem Repeat Variation in Autism Spectrum Disorder by Charlotte Michelle Nguyen a Thesis

Genome-wide Investigation of Short Tandem Repeat Variation in Autism Spectrum Disorder by Charlotte Michelle Nguyen A thesis submitted in conformity with the requirements for the degree of Master of Science Graduate Department of Molecular Genetics University of Toronto c Copyright 2019 by Charlotte Michelle Nguyen Abstract Genome-wide Investigation of Short Tandem Repeat Variation in Autism Spectrum Disorder Charlotte Michelle Nguyen Master of Science Graduate Department of Molecular Genetics University of Toronto 2019 Short tandem repeats (STRs) may contribute to the genetic etiology of autism spectrum disorder (ASD) since several pathogenic STRs are ASD risk factors. Screening for STR variation may uncover novel risk loci, however, this was previously hindered by the difficulty of aligning repetitive sequences in whole genome sequencing (WGS) data. Recently, several tools have been developed to genotype STRs from WGS data. In this study, a STR genotyping pipeline was created to perform a genome-wide screen for repeat variation in an ASD genomic database, MSSNG. This work uncovered 4925 STRs, and variation was identified in ASD-risk loci, repeat disorder genes, and other pathogenic loci. In particular, repeat expansions upstream of the CRYBB2 gene were seen significantly more frequently in individuals with ASD than a control population. These findings suggest that this STR genotyping pipeline can detect repeat variation in ASD, and may lead to the discovery of novel candidate loci. ii Acknowledgements I am greatly appreciative of the support of my supervisors, Dr. Stephen Scherer and Dr. Ryan Yuen. When I first joined Dr. Scherer's lab, I felt nervous about how I would fare in a group that was publishing ground-breaking research, led by a person who has a legacy and a Wikipedia page. I could not have predicted how kind and generous my peers and Dr. Scherer would be, and how their influence would shape me to become a better scientist and person. Throughout my time, Dr. Scherer went out of his way to give his students once-in-a-lifetime experiences. I will always remember his personal collection of art, seeing Supertramp perform for World Autism Day, and meeting with Canada's Chief Science Advisor, Dr. Mona Nemer. I am very fortunate and grateful for Dr. Yuen's guidance. I have spent so many hours in his office talking about everything from our research to the best horror movies. I will always cherish these conversations that carried me throughout graduate school, and the advice that will follow me throughout my life. Dr. Yuen was so patient and supportive in helping me carve a path in my research and career. None of the highs would have been possible without Dr. Yuen, and I could not have gotten through the lows without him either. Dr. Scherer and Dr. Yuen pushed me to perform the best research I could by providing me with superstar mentors, a world-class facility, and modeling this in their own careers. I am so grateful for all the members of TCAG. This project was a behemoth, and there are so many players to thank. I would like to especially thank Dr. Brett Trost and Dr. Robbie Davies for their mentorship. I learned from the best, and my research and career are forever enriched because of their support, encouragement, and ingenuity. They showed me how to turn letters in beautiful lines of code. I would also like to thank and Dr. Zhuozhi Wang, Bhooma Thiruvahindrapuram, and Omar Hamdan for their technical excellence and guidance. To this day, I still do not understand how they do what they do. Thank you to my fellow graduate students, Lia D'Abate, Ted Higginbotham, and Ada Chan, who so quickly became my friends. The conversations I shared with them are some of my fondest memories of the lab. Thank you for letting me cry sometimes and making me laugh all the time. I am also especially thankful to all the members of the Yuen Lab. With our physical separation and differences in schedules, sometimes we were in different worlds but every time we visited one another I was reminded about how sweet and amazing they all were. Whether it was in lab meeting or lunch time, I was continuously surprised and impressed by their skills in and out of lab. I would also like to thank my committee members, Dr. Quaid Morris and Dr. Christo- pher Pearson. It was a blessing and privilege to be able to learn from and be guided by world-class experts. As well, thank you to the members of Dr. Pearson's lab who have iii collaborated closely on this project. Your technical expertise and in-depth knowledge of short tandem repeats is mind-blowing. Thank you to my families. To my family-family: mum, dad, Michael, and Anh Quyen, I love you all very much. I am who I am because of you. Thank you for all the support you have given me. To my friend-family: the friends I have made in this department are some of the best friends I have. I feel so lucky to have been part of the boys and bioinfos. There is so much to say, and so many memories to reminisce on, so I will leave it at this: you all rock. iv Contents Abstract ii Acknowledgements iii List of Tables vii List of Figures viii List of Abbreviations ix 1 Introduction 1 1.1 Clinical presentation of Autism Spectrum Disorder (ASD) . 1 1.2 Genetic etiology of ASD established from family-based studies . 2 1.2.1 The Broader Autism Phenotype (BAP) . 3 1.3 Genetic variants in ASD . 3 1.4 Whole Genome Gequencing (WGS) . 5 1.4.1 Short-read WGS . 5 1.4.2 Long-read WGS . 6 1.4.3 MSSNG ASD WGS database . 7 1.5 Short Tandem Repeats (STR) . 7 1.5.1 Tandem Repeat Disorders (TRD) . 8 1.5.2 The role of STRs in common polygenic disorders . 10 1.5.3 The role of STRs in ASD . 10 1.5.4 Genetic anticipation in ASD . 11 1.6 STR genotyping . 12 1.6.1 Traditional STR detection methods . 12 1.6.2 STR genotyping algorithms . 13 1.7 Project rationale . 14 v 2 Methods: Development and Implementation 15 2.1 MSSNG ASD database . 15 2.2 STR genotyping pipeline development . 15 2.2.1 Expansion Hunter de novo (EHdn) . 15 2.2.2 STR Finder . 16 2.2.3 Expansion Hunter (EH) . 17 2.2.4 STR genotyping pipeline . 17 2.3 Pipeline validation using long-read sequencing data . 22 2.3.1 Comparison of EH tools to other STR callers . 23 2.3.2 Pipeline validation using twin data . 23 2.4 Outlier detection method for STR expansions . 23 2.5 Statistical analysis . 24 3 Results 25 3.1 Genome-wide STR genotyping in MSSNG . 25 3.1.1 STR calling using EHdn . 25 3.1.2 Identification of reference STRs using STR Finder . 28 3.1.3 Targeted STR genotyping using EH . 28 3.2 Identifying potentially clinically relevant STR Variation in ASD . 28 3.2.1 STR variation in MSSNG compared to a control population . 29 3.3 Validation of STR genotyping pipeline . 35 3.3.1 Validation of EHdn using long-read sequencing data . 35 3.3.2 Validation using other STR genotyping tools . 36 3.3.3 Monozygotic vs. dizygotic twin concordance in EHdn . 36 4 Discussion 40 4.1 Overview of results . 40 4.2 Current study limitations . 42 4.3 Overall summary and impact of work . 44 5 Future Directions 46 5.1 Application of STR genotyping pipeline to other ASD genomic databases 46 5.2 Application of STR genotyping pipeline to other disorders . 47 Bibliography 47 vi List of Tables 1 Candidate loci detected in MSSNG . 31 2 Outliers identified in MSSNG and 1000 Genomes . 32 3 Validation of genome-wide STR genotyping tools in long-read sequencing data . 38 vii List of Figures 1 Expansion Hunter de novo and Expansion Hunter read-based approaches 19 2 STR Finder . 20 3 STR Genotyping Pipeline . 21 4 STR loci detected in MSSNG samples using Expansion Hunter de novo . 26 5 STR motif identified in MSSNG using Expansion Hunter de novo . 27 6 MSSNG allele lengths of the GA repeat upstream of CRYBB2 . 33 7 MSSNG pedigrees for GA repeat upstream of CRYBB2 . 34 8 Validation rate of EHdn calls supported by . 37 9 Comparison of genome-wide STR genotyping tools . 39 viii List of Abbreviations ADDM Autism and Developmental Disabilities Monitoring ADHD Attention Deficit Hyperactivity Disorder ADL Activities of Daily Life ASD Autism Spectrum Disorder BAP Broader Autism Phenotype BWA Burrows-Wheeler Alignment CCS Circular Consensus CNV Copy Number Variant DM1 Myotonic Dystrophy 1 DSM Diagnostic and Statistical Manual of Mental Disorders DZ Dizygotic EH Expansion Hunter EHDN Expansion Hunter de novo FRDA Fragile X Tremor-Ataxia syndrome GRCh Genome Reference Consortium Human Genome Build ID Intellectual Disability IRR In-Repeat Read LGD Likely Gene Disrupting MZ Monozygotic NDD Neurodevelopmental Disorder NF1 Neurofibromatosis ix PacBio Pacific Biosciences PCR Polymerase Chain Reaction pLI Probability of Loss-of-Function Intolerant Rate RAN Repeat-Associated Non-ATG SMRT Single-Molecule Real-Time SNV Single Nucleotide Variant STR Short Tandem Repeat TNR Trinucleotide Repeat TP Triplet-repeat Primed TR Tandem Repeat TRD Tandem Repeat Disorder UTR Untranslated Region WGS Whole Genome Sequencing XLID X-Linked Intellectual Disability x Chapter 1 Introduction 1.1 Clinical presentation of Autism Spectrum Disor- der (ASD) Autism spectrum disorder (ASD) is a lifelong neurodevelopmental condition character- ized by deficits in social communication and restricted, repetitive behaviours and interests (DSM-V, 2013). Onset of the disorder occurs during childhood, with symptoms usually appearing before the age of 3 (Ozonoff et al., 2008).

Load more