Uncovering the Variability, Regulatory Roles and Mutation Rates of Short Tandem Repeats by Thomas F
Total Page:16
File Type:pdf, Size:1020Kb
Uncovering the variability, regulatory roles and mutation rates of short tandem repeats by Thomas F. Willems B.S. in Chemical Engineering, University of California, Berkeley (2011) Submitted to the Computational and Systems Biology Program in partial fulfillment of the requirements for the degree of Doctor of Philosophy at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 2016 ○c Massachusetts Institute of Technology 2016. All rights reserved. Author............................................................................ Computational and Systems Biology Program April 26, 2016 Certified by . Yaniv Erlich, PhD Assistant Professor of Computer Science, Columbia University Thesis Supervisor Certified by . Manolis Kellis, PhD Associate Professor of Computer Science, MIT Thesis Supervisor Accepted by....................................................................... Christopher B. Burge, PhD Professor of Biology and Biological Engineering Director, Computational and Systems Biology Graduate Program 2 Uncovering the variability, regulatory roles and mutation rates of short tandem repeats by Thomas F. Willems Submitted to the Computational and Systems Biology Program on April 26, 2016, in partial fulfillment of the requirements for the degree of Doctor of Philosophy Abstract Over the past decade, the advent of next-generation DNA sequencing technologies has ushered in an exciting era of biological research. Through large-scale sequencing projects, scientists have begun to unveil the variability and function of millions of DNA mutations called single nucleotide polymorphisms. Despite this rapid growth in understanding, short tandem repeats (STRs), ge- nomic elements consisting of a repeating pattern of 2-6 bases, have remained poorly understood. Mutating orders of magnitude more rapidly than most of the human genome, STRs have been identified as the causal variants in diseases such as Fragile X syndrome and Huntington’s disease. However, in spite of their potentially profound biological consequences, STRs remain systemat- ically understudied due to difficulties associated with obtaining accurate genotypes. To address this issue, we developed a series of bioinformatics approaches and applied them to population- scale whole-genome sequencing data sets. Using data from the 1000 Genomes Project, we performed the first genome-wide characterization of STR variability by analyzing over 700,000 loci in more than 1000 individuals. Next, we integrated these genotypes with expression data to assess the contribution of STRs to gene expression in humans, uncovering their substantial regulatory role. We then developed a state-of-the-art algorithm to genotype STRs, resulting in vastly improved accuracy and uncovering hundreds of replicable de novo mutations in a deeply sequenced trio. Lastly, we developed a novel approach to estimate mutation rates for STRs on the Y-chromosome (Y-STR), resulting in rates for hundreds of previously uncharacterized markers. Collectively, these analyses highlight the extreme variability of STRs and provide a framework for incorporating them into future studies. Thesis Supervisor: Yaniv Erlich, PhD Title: Assistant Professor of Computer Science, Columbia University Thesis Supervisor: Manolis Kellis, PhD Title: Associate Professor of Computer Science, MIT 3 4 Acknowledgments My journey through graduate school never would have started, let alone finished, if not for an enormous number of people that have supported and inspired me along the way. I sincerely want to thank the following people that have made the past five years a fantastic experience: ∙ Yaniv Erlich, for believing in me, inspiring me and teaching me how to do great science, and for introducing me to the exciting world of genomics. ∙ Manolis Kellis, for adopting me into his lab and for his limitless positive energy and sound scientific advice. ∙ David Housman and Laurie Boyer, for their terrific scientific guidance throughout my PhD. ∙ My labmates: Melissa Gymrek, for teaching me all she knows about STRs and bioinfor- matics and for being a great friend and colleague. Dina Zielinski, for lending her wet lab expertise and challenging me to swimming duels. Assaf Gordon, for his never ending wise programming advice and generosity. Sophie Zaaijer, for always welcoming me and entertaining me at the NYGC. And to all the other members of the Erlich lab that have come and gone, thank you for making it a fun and exciting place to work. ∙ The Whitehead Institute, for providing a fantastic environment for young scientists. ∙ Area Four coffee, for providing the fuel that has sustained me throughout graduate school. ∙ Eric Zhu, for being a sage and tireless friend. ∙ Rosanna Lim and Aanchal Jain, for much needed relaxation during Friday night dinners. ∙ Lionel Lam, for providing timely musical relief and humor. ∙ Xiao Su, for teaching me the value of a coffee or donut break. ∙ Mark, Sean, Steven, William, Karthik and David, for many much needed nights out in Boston. ∙ Jacquie Carota, for her tireless work as our program administrator. ∙ My brother Nick, my sister Julie and my brother-in-law Laurens, for their love and support throughout graduate school. ∙ My girlfriend Su Luo, whose love has carried me through difficult times and has inspired me to aim for new heights. 5 ∙ My parents Paul and Carine, for molding me into the person I am today and prioritizing my education. Mom and Dad, without your endless advice and support, none of this would ever have been possible. 6 Contents 1 Introduction 13 1.1 Overview...................................... 13 1.2 Defining an STR.................................. 15 1.3 STR applications.................................. 16 1.4 Population-scale STR variation.......................... 17 1.5 STRs and complex traits............................. 18 1.6 STR genotyping methods............................. 19 1.7 Y-STR mutation rates............................... 23 2 The landscape of human STR variation 25 2.1 Introduction.................................... 25 2.2 Results....................................... 27 2.2.1 Identifying STR loci in the human genome................ 27 2.2.2 Profiling STRs in 1000 Genomes samples................. 28 2.2.3 Quality assessment............................ 29 2.2.4 Validation using population genetics trends................ 34 2.2.5 Patterns of STR variation......................... 34 2.2.6 The prototypical STR........................... 37 2.2.7 STRs in the NCBI reference and LoF analysis.............. 38 2.2.8 Linkage disequilibrium between STRs and SNPs............. 39 2.3 Discussion..................................... 40 2.4 Methods...................................... 42 2.4.1 Call set generation............................. 42 2.4.2 Estimating the number of samples per locus and number of loci per sample 43 2.4.3 Saturation analysis............................ 43 2.4.4 Mendelian inheritance........................... 44 2.4.5 Capillary electrophoresis comparison................... 44 2.4.6 Heterozygosity calculations........................ 45 2.4.7 Summary statistic comparisons...................... 45 7 2.4.8 Comparison of population heterozygosity................. 45 2.4.9 Deviation of lobSTR calls from the NCBI reference........... 46 2.4.10 Sample clustering............................. 46 2.4.11 STR variability trends........................... 46 2.4.12 Extraction of orthologous chimp STR lengths.............. 47 2.4.13 Rst levels.................................. 47 2.4.14 Assessing linkage disequilibrium...................... 47 2.5 Acknowledgments................................. 48 2.6 Supplemental Text................................. 48 2.6.1 Finding putative STR loci in the human genome............. 48 2.6.2 Comparison of STR thresholds to prior studies.............. 49 2.6.3 Incorporation of annotated STRs..................... 49 2.6.4 Assessment of empirical score thresholds with a permissive call set... 50 2.6.5 Call set integration............................ 51 2.6.6 Homopolymer STRs............................ 52 2.7 Supplemental Tables................................ 53 2.8 Supplemental Figures............................... 57 3 Abundant contribution of short tandem repeats to gene expression variation in humans 67 3.1 Introduction.................................... 68 3.2 Results....................................... 69 3.2.1 Initial genome-wide discovery of eSTRs.................. 69 3.2.2 Partitioning the contribution of eSTR and nearby variants........ 72 3.2.3 The effect of eSTRs in the context of individual SNP eQTLs...... 73 3.2.4 Integrative genomic evidence for a functional role of eSTRs....... 74 3.2.5 The potential role of eSTRs in human conditions............ 77 3.3 Discussion..................................... 79 3.4 Acknowledgements................................. 80 3.5 Author contributions................................ 81 3.6 Methods...................................... 81 3.6.1 Genotype datasets............................. 81 3.6.2 Targeted sequencing of promoter region STRs.............. 81 3.6.3 Expression datasets............................ 82 8 3.6.4 eQTL association testing......................... 82 3.6.5 Controlling for gene-level false discovery rate............... 83 3.6.6 Partitioning heritability using linear mixed models............ 84 3.6.7 Comparing to the lead eSNP....................... 85 3.6.8 Conservation analysis........................... 86 3.6.9 Enrichment of STRs and eSTRs in predicted enhancers........