Genome-Wide Investigation of Short Tandem Repeat Variation in Autism Spectrum Disorder by Charlotte Michelle Nguyen a Thesis

Total Page:16

File Type:pdf, Size:1020Kb

Genome-Wide Investigation of Short Tandem Repeat Variation in Autism Spectrum Disorder by Charlotte Michelle Nguyen a Thesis Genome-wide Investigation of Short Tandem Repeat Variation in Autism Spectrum Disorder by Charlotte Michelle Nguyen A thesis submitted in conformity with the requirements for the degree of Master of Science Graduate Department of Molecular Genetics University of Toronto c Copyright 2019 by Charlotte Michelle Nguyen Abstract Genome-wide Investigation of Short Tandem Repeat Variation in Autism Spectrum Disorder Charlotte Michelle Nguyen Master of Science Graduate Department of Molecular Genetics University of Toronto 2019 Short tandem repeats (STRs) may contribute to the genetic etiology of autism spec- trum disorder (ASD) since several pathogenic STRs are ASD risk factors. Screening for STR variation may uncover novel risk loci, however, this was previously hindered by the difficulty of aligning repetitive sequences in whole genome sequencing (WGS) data. Recently, several tools have been developed to genotype STRs from WGS data. In this study, a STR genotyping pipeline was created to perform a genome-wide screen for repeat variation in an ASD genomic database, MSSNG. This work uncovered 4925 STRs, and variation was identified in ASD-risk loci, repeat disorder genes, and other pathogenic loci. In particular, repeat expansions upstream of the CRYBB2 gene were seen significantly more frequently in individuals with ASD than a control population. These findings suggest that this STR genotyping pipeline can detect repeat variation in ASD, and may lead to the discovery of novel candidate loci. ii Acknowledgements I am greatly appreciative of the support of my supervisors, Dr. Stephen Scherer and Dr. Ryan Yuen. When I first joined Dr. Scherer's lab, I felt nervous about how I would fare in a group that was publishing ground-breaking research, led by a person who has a legacy and a Wikipedia page. I could not have predicted how kind and generous my peers and Dr. Scherer would be, and how their influence would shape me to become a better scientist and person. Throughout my time, Dr. Scherer went out of his way to give his students once-in-a-lifetime experiences. I will always remember his personal collection of art, seeing Supertramp perform for World Autism Day, and meeting with Canada's Chief Science Advisor, Dr. Mona Nemer. I am very fortunate and grateful for Dr. Yuen's guidance. I have spent so many hours in his office talking about everything from our research to the best horror movies. I will always cherish these conversations that carried me throughout graduate school, and the advice that will follow me throughout my life. Dr. Yuen was so patient and supportive in helping me carve a path in my research and career. None of the highs would have been possible without Dr. Yuen, and I could not have gotten through the lows without him either. Dr. Scherer and Dr. Yuen pushed me to perform the best research I could by providing me with superstar mentors, a world-class facility, and modeling this in their own careers. I am so grateful for all the members of TCAG. This project was a behemoth, and there are so many players to thank. I would like to especially thank Dr. Brett Trost and Dr. Robbie Davies for their mentorship. I learned from the best, and my research and career are forever enriched because of their support, encouragement, and ingenuity. They showed me how to turn letters in beautiful lines of code. I would also like to thank and Dr. Zhuozhi Wang, Bhooma Thiruvahindrapuram, and Omar Hamdan for their technical excellence and guidance. To this day, I still do not understand how they do what they do. Thank you to my fellow graduate students, Lia D'Abate, Ted Higginbotham, and Ada Chan, who so quickly became my friends. The conversations I shared with them are some of my fondest memories of the lab. Thank you for letting me cry sometimes and making me laugh all the time. I am also especially thankful to all the members of the Yuen Lab. With our physical separation and differences in schedules, sometimes we were in different worlds but every time we visited one another I was reminded about how sweet and amazing they all were. Whether it was in lab meeting or lunch time, I was continuously surprised and impressed by their skills in and out of lab. I would also like to thank my committee members, Dr. Quaid Morris and Dr. Christo- pher Pearson. It was a blessing and privilege to be able to learn from and be guided by world-class experts. As well, thank you to the members of Dr. Pearson's lab who have iii collaborated closely on this project. Your technical expertise and in-depth knowledge of short tandem repeats is mind-blowing. Thank you to my families. To my family-family: mum, dad, Michael, and Anh Quyen, I love you all very much. I am who I am because of you. Thank you for all the support you have given me. To my friend-family: the friends I have made in this department are some of the best friends I have. I feel so lucky to have been part of the boys and bioinfos. There is so much to say, and so many memories to reminisce on, so I will leave it at this: you all rock. iv Contents Abstract ii Acknowledgements iii List of Tables vii List of Figures viii List of Abbreviations ix 1 Introduction 1 1.1 Clinical presentation of Autism Spectrum Disorder (ASD) . 1 1.2 Genetic etiology of ASD established from family-based studies . 2 1.2.1 The Broader Autism Phenotype (BAP) . 3 1.3 Genetic variants in ASD . 3 1.4 Whole Genome Gequencing (WGS) . 5 1.4.1 Short-read WGS . 5 1.4.2 Long-read WGS . 6 1.4.3 MSSNG ASD WGS database . 7 1.5 Short Tandem Repeats (STR) . 7 1.5.1 Tandem Repeat Disorders (TRD) . 8 1.5.2 The role of STRs in common polygenic disorders . 10 1.5.3 The role of STRs in ASD . 10 1.5.4 Genetic anticipation in ASD . 11 1.6 STR genotyping . 12 1.6.1 Traditional STR detection methods . 12 1.6.2 STR genotyping algorithms . 13 1.7 Project rationale . 14 v 2 Methods: Development and Implementation 15 2.1 MSSNG ASD database . 15 2.2 STR genotyping pipeline development . 15 2.2.1 Expansion Hunter de novo (EHdn) . 15 2.2.2 STR Finder . 16 2.2.3 Expansion Hunter (EH) . 17 2.2.4 STR genotyping pipeline . 17 2.3 Pipeline validation using long-read sequencing data . 22 2.3.1 Comparison of EH tools to other STR callers . 23 2.3.2 Pipeline validation using twin data . 23 2.4 Outlier detection method for STR expansions . 23 2.5 Statistical analysis . 24 3 Results 25 3.1 Genome-wide STR genotyping in MSSNG . 25 3.1.1 STR calling using EHdn . 25 3.1.2 Identification of reference STRs using STR Finder . 28 3.1.3 Targeted STR genotyping using EH . 28 3.2 Identifying potentially clinically relevant STR Variation in ASD . 28 3.2.1 STR variation in MSSNG compared to a control population . 29 3.3 Validation of STR genotyping pipeline . 35 3.3.1 Validation of EHdn using long-read sequencing data . 35 3.3.2 Validation using other STR genotyping tools . 36 3.3.3 Monozygotic vs. dizygotic twin concordance in EHdn . 36 4 Discussion 40 4.1 Overview of results . 40 4.2 Current study limitations . 42 4.3 Overall summary and impact of work . 44 5 Future Directions 46 5.1 Application of STR genotyping pipeline to other ASD genomic databases 46 5.2 Application of STR genotyping pipeline to other disorders . 47 Bibliography 47 vi List of Tables 1 Candidate loci detected in MSSNG . 31 2 Outliers identified in MSSNG and 1000 Genomes . 32 3 Validation of genome-wide STR genotyping tools in long-read sequencing data . 38 vii List of Figures 1 Expansion Hunter de novo and Expansion Hunter read-based approaches 19 2 STR Finder . 20 3 STR Genotyping Pipeline . 21 4 STR loci detected in MSSNG samples using Expansion Hunter de novo . 26 5 STR motif identified in MSSNG using Expansion Hunter de novo . 27 6 MSSNG allele lengths of the GA repeat upstream of CRYBB2 . 33 7 MSSNG pedigrees for GA repeat upstream of CRYBB2 . 34 8 Validation rate of EHdn calls supported by . 37 9 Comparison of genome-wide STR genotyping tools . 39 viii List of Abbreviations ADDM Autism and Developmental Disabilities Monitoring ADHD Attention Deficit Hyperactivity Disorder ADL Activities of Daily Life ASD Autism Spectrum Disorder BAP Broader Autism Phenotype BWA Burrows-Wheeler Alignment CCS Circular Consensus CNV Copy Number Variant DM1 Myotonic Dystrophy 1 DSM Diagnostic and Statistical Manual of Mental Disorders DZ Dizygotic EH Expansion Hunter EHDN Expansion Hunter de novo FRDA Fragile X Tremor-Ataxia syndrome GRCh Genome Reference Consortium Human Genome Build ID Intellectual Disability IRR In-Repeat Read LGD Likely Gene Disrupting MZ Monozygotic NDD Neurodevelopmental Disorder NF1 Neurofibromatosis ix PacBio Pacific Biosciences PCR Polymerase Chain Reaction pLI Probability of Loss-of-Function Intolerant Rate RAN Repeat-Associated Non-ATG SMRT Single-Molecule Real-Time SNV Single Nucleotide Variant STR Short Tandem Repeat TNR Trinucleotide Repeat TP Triplet-repeat Primed TR Tandem Repeat TRD Tandem Repeat Disorder UTR Untranslated Region WGS Whole Genome Sequencing XLID X-Linked Intellectual Disability x Chapter 1 Introduction 1.1 Clinical presentation of Autism Spectrum Disor- der (ASD) Autism spectrum disorder (ASD) is a lifelong neurodevelopmental condition character- ized by deficits in social communication and restricted, repetitive behaviours and interests (DSM-V, 2013). Onset of the disorder occurs during childhood, with symptoms usually appearing before the age of 3 (Ozonoff et al., 2008).
Recommended publications
  • Downloaded from [266]
    Patterns of DNA methylation on the human X chromosome and use in analyzing X-chromosome inactivation by Allison Marie Cotton B.Sc., The University of Guelph, 2005 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY in The Faculty of Graduate Studies (Medical Genetics) THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) January 2012 © Allison Marie Cotton, 2012 Abstract The process of X-chromosome inactivation achieves dosage compensation between mammalian males and females. In females one X chromosome is transcriptionally silenced through a variety of epigenetic modifications including DNA methylation. Most X-linked genes are subject to X-chromosome inactivation and only expressed from the active X chromosome. On the inactive X chromosome, the CpG island promoters of genes subject to X-chromosome inactivation are methylated in their promoter regions, while genes which escape from X- chromosome inactivation have unmethylated CpG island promoters on both the active and inactive X chromosomes. The first objective of this thesis was to determine if the DNA methylation of CpG island promoters could be used to accurately predict X chromosome inactivation status. The second objective was to use DNA methylation to predict X-chromosome inactivation status in a variety of tissues. A comparison of blood, muscle, kidney and neural tissues revealed tissue-specific X-chromosome inactivation, in which 12% of genes escaped from X-chromosome inactivation in some, but not all, tissues. X-linked DNA methylation analysis of placental tissues predicted four times higher escape from X-chromosome inactivation than in any other tissue. Despite the hypomethylation of repetitive elements on both the X chromosome and the autosomes, no changes were detected in the frequency or intensity of placental Cot-1 holes.
    [Show full text]
  • Targeted Next-Generation Sequencing in Patients with Suggestive X-Linked Intellectual Disability
    G C A T T A C G G C A T genes Article Targeted Next-Generation Sequencing in Patients with Suggestive X-Linked Intellectual Disability Nekane Ibarluzea 1,2 , Ana Belén de la Hoz 1,2, Olatz Villate 1,2,3, Isabel Llano 1,2,3 , Intzane Ocio 4, Itxaso Martí 5, Miriam Guitart 6, Elisabeth Gabau 6, Fernando Andrade 1,2, Blanca Gener 1,2,3 and María-Isabel Tejada 1,2,3,* 1 Biocruces Bizkaia Health Research Institute, 48903 Barakaldo, Spain; [email protected] (N.I.); [email protected] (A.B.d.l.H.); [email protected] (O.V.); [email protected] (I.L.); [email protected] (F.A.); [email protected] (B.G.) 2 Spanish Consortium for Research on Rare Diseases (CIBERER), 28029 Madrid, Spain 3 Genetics Service, Cruces University Hospital, Osakidetza Basque Health Service, 48903 Barakaldo, Spain 4 Department of Paediatric Neurology, Araba University Hospital, Osakidetza Basque Health Service, 01009 Gasteiz, Spain; [email protected] 5 Department of Paediatric Neurology, Donostia University Hospital, Osakidetza Basque Health Service, 20014 Donostia, Spain; [email protected] 6 Genetics Laboratory, Paediatric Unit, Parc Taulí Hospital Universitari, Institut d’Investigació i Innovació Parc Taulí I3PT, Universitat Autònoma de Barcelona, 08208 Sabadell, Spain; [email protected] (M.G.); [email protected] (E.G.) * Correspondence: [email protected] Received: 3 December 2019; Accepted: 30 December 2019; Published: 2 January 2020 Abstract: X-linked intellectual disability (XLID) is known to contribute up to 10% of intellectual disability (ID) in males and could explain the increased ratio of affected males observed in patients with ID.
    [Show full text]
  • The Changing Chromatome As a Driver of Disease: a Panoramic View from Different Methodologies
    The changing chromatome as a driver of disease: A panoramic view from different methodologies Isabel Espejo1, Luciano Di Croce,1,2,3 and Sergi Aranda1 1. Centre for Genomic Regulation (CRG), Barcelona Institute of Science and Technology, Dr. Aiguader 88, Barcelona 08003, Spain 2. Universitat Pompeu Fabra (UPF), Barcelona, Spain 3. ICREA, Pg. Lluis Companys 23, Barcelona 08010, Spain *Corresponding authors: Luciano Di Croce ([email protected]) Sergi Aranda ([email protected]) 1 GRAPHICAL ABSTRACT Chromatin-bound proteins regulate gene expression, replicate and repair DNA, and transmit epigenetic information. Several human diseases are highly influenced by alterations in the chromatin- bound proteome. Thus, biochemical approaches for the systematic characterization of the chromatome could contribute to identifying new regulators of cellular functionality, including those that are relevant to human disorders. 2 SUMMARY Chromatin-bound proteins underlie several fundamental cellular functions, such as control of gene expression and the faithful transmission of genetic and epigenetic information. Components of the chromatin proteome (the “chromatome”) are essential in human life, and mutations in chromatin-bound proteins are frequently drivers of human diseases, such as cancer. Proteomic characterization of chromatin and de novo identification of chromatin interactors could thus reveal important and perhaps unexpected players implicated in human physiology and disease. Recently, intensive research efforts have focused on developing strategies to characterize the chromatome composition. In this review, we provide an overview of the dynamic composition of the chromatome, highlight the importance of its alterations as a driving force in human disease (and particularly in cancer), and discuss the different approaches to systematically characterize the chromatin-bound proteome in a global manner.
    [Show full text]
  • Methylome Profiling of Healthy and Central Precocious Puberty Girls Danielle S
    Bessa et al. Clinical Epigenetics (2018) 10:146 https://doi.org/10.1186/s13148-018-0581-1 RESEARCH Open Access Methylome profiling of healthy and central precocious puberty girls Danielle S. Bessa1, Mariana Maschietto2, Carlos Francisco Aylwin3, Ana P. M. Canton1,4, Vinicius N. Brito1, Delanie B. Macedo1, Marina Cunha-Silva1, Heloísa M. C. Palhares5, Elisabete A. M. R. de Resende5, Maria de Fátima Borges5, Berenice B. Mendonca1, Irene Netchine4, Ana C. V. Krepischi6, Alejandro Lomniczi3,7, Sergio R. Ojeda7 and Ana Claudia Latronico1,8* Abstract Background: Recent studies demonstrated that changes in DNA methylation (DNAm) and inactivation of two imprinted genes (MKRN3 and DLK1) alter the onset of female puberty. We aimed to investigate the association of DNAm profiling with the timing of human puberty analyzing the genome-wide DNAm patterns of peripheral blood leukocytes from ten female patients with central precocious puberty (CPP) and 33 healthy girls (15 pre- and 18 post-pubertal). For this purpose, we performed comparisons between the groups: pre- versus post-pubertal, CPP versus pre-pubertal, and CPP versus post-pubertal. Results: Analyzing the methylome changes associated with normal puberty, we identified 120 differentially methylated regions (DMRs) when comparing pre- and post-pubertal healthy girls. Most of these DMRs were hypermethylated in the pubertal group (99%) and located on the X chromosome (74%). Only one genomic region, containing the promoter of ZFP57, was hypomethylated in the pubertal group. ZFP57 is a transcriptional repressor required for both methylation and imprinting of multiple genomic loci. ZFP57 expression in the hypothalamus of female rhesus monkeys increased during peripubertal development, suggesting enhanced repression of downstream ZFP57 target genes.
    [Show full text]
  • Prevalence and Phenotypic Characterization of Rare Genetic Disorders Within Psychiatric Populations
    Prevalence and Phenotypic Characterization of Rare Genetic Disorders within Psychiatric Populations by Venuja Sriretnakumar A thesis submitted in conformity with the requirements for the degree of Doctorate of Philosophy Graduate Department of Laboratory Medicine and Pathobiology University of Toronto © Copyright by Venuja Sriretnakumar 2021 Prevalence and Phenotypic Characterization of Rare Genetic Disorders within Psychiatric Populations Venuja Sriretnakumar Doctorate of Philosophy Laboratory Medicine and Pathobiology University of Toronto 2021 Abstract Many rare genetic syndromes are known to phenotypically manifest with psychiatric symptoms that can be indistinguishable from primary psychiatric disorders. While the majority of ongoing research in psychiatric genetics has been dedicated to the identification and characterization of genes involved in primary psychiatric disorders, there has been a lack of research to determine the extent to which rare genetic variants contribute to the overall psychiatric disease load. This thesis aims to investigate the prevalence of clinically well-characterized pathogenic copy number variant (CNV) syndromes and treatable genetic disorders that can present with neuropsychiatric symptoms within the general psychiatric patient population. In our first study, a greater than expected number of syndromic CNVs was observed amongst a cohort of 348 schizophrenia patients. In our second, pilot, study of 2 046 psychiatric patients, an enrichment for variants associated with four treatable inborn errors of metabolism were found in comparison to the control population. In the third, expansion, study ii screening for 108 treatable genetic disorders, there was also an enrichment found for the screened genetic disorders in a general psychiatric patient population relative to the expected disease prevalence in the general population. Moreover, an increased carrier frequency for screened genetic disorders was also seen amongst psychiatric patients.
    [Show full text]
  • Attractor Metafeatures and Their Application in Biomolecular Data Analysis
    Attractor Metafeatures and Their Application in Biomolecular Data Analysis Tai-Hsien Ou Yang Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Graduate School of Arts and Sciences COLUMBIA UNIVERSITY 2018 ©2018 Tai-Hsien Ou Yang All rights reserved ABSTRACT Attractor Metafeatures and Their Application in Biomolecular Data Analysis Tai-Hsien Ou Yang This dissertation proposes a family of algorithms for deriving signatures of mutually associated features, to which we refer as attractor metafeatures, or simply attractors. Specifically, we present multi-cancer attractor derivation algorithms, identifying correlated features in signatures from multiple biological data sets in one analysis, as well as the groups of samples or cells that exclusively express these signatures. Our results demonstrate that these signatures can be used, in proper combinations, as biomarkers that predict a patient’s survival rate, based on the transcriptome of the tumor sample. They can also be used as features to analyze the composition of the tumor. Through analyzing large data sets of 18 cancer types and three high-throughput platforms from The Cancer Genome Atlas (TCGA) PanCanAtlas Project and multiple single-cell RNA-seq data sets, we identified novel cancer attractor signatures and elucidated the identity of the cells that express these signatures. Using these signatures, we developed a prognostic biomarker for breast cancer called the Breast Cancer Attractor Metagenes (BCAM) biomarker as well as a software platform
    [Show full text]
  • Public Version.Pdf
    Dedicated to my father 1952 – 2013 Abstract While the aetiology and pathogenesis of cancer is variable and dependent on the initial cellular and carcinogenic environment, all manifestations of the disease are unified by a gross dysfunction of normal epigenetic control. Recent technological advances have revealed epigenomic disruption of large domains that encompass multiple genes, as well as a deregulation of spatial and temporal control of DNA in the nucleus. Here I consolidate these findings by identifying large genomic domains that are epigenetically and transcriptionally activated in prostate tumourigenesis, a phenomenon we termed Long Range Epigenetic Activation (LREA). These regions contain oncogenes, miRNAs and multiple prostate cancer biomarkers, such as prostate cancer antigen 3 (PCA3) and the prostate specific antigen (PSA). LREA regions are characterised by an increase of histone modifications H3K4me3 and H3K9ac with a simultaneous depletion of H3K27me3 at gene promoters. While I found little evidence of CpG island hypomethylation causing gene activation, I identified hypermethylation of CpG-islands associated with gene activation and differential promoter usage in prostate cancer. The presence of both epigenetically activated and repressed domains in prostate cancer indicates the deregulation of superior, “long-range” acting processes such as chromatin looping or the timing of DNA replication. Using the Kallikrein gene family locus, I describe the presence of chromatin loops anchored by the CTCF protein that commonly demarcate the limits of this cancer-specific regional epigenetic modulation. To investigate how replication timing can influence the cancer epigenome I optimised and carried out the high-resolution “Repli-seq” technique, which details the time of replication for all genomic loci.
    [Show full text]