I IDENTIFYING GENETIC MECHANISMS of CARDIOMETABOLIC
Total Page:16
File Type:pdf, Size:1020Kb
IDENTIFYING GENETIC MECHANISMS OF CARDIOMETABOLIC TRAITS AND DISEASES USING QUANTITATIVE SEQUENCE DATA Martin L. Buchkovich A dissertation submitted to the faculty at the University of North Carolina at Chapel Hill in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Curriculum in Bioinformatics and Computational Biology. Chapel Hill 2015 Approved by: Praveen Sethupathy Yun Li Gregory E. Crawford Terrence S. Furey Karen L. Mohlke i ©2015 Martin L. Buchkovich ALL RIGHTS RESERVED ii ABSTRACT Martin L. Buchkovich: Identifying genetic mechanisms of cardiometabolic traits and diseases using quantitative sequence data (Under the direction of Karen L. Mohlke and Terrence S. Furey) Cardiometabolic diseases are a worldwide health concern. Genetics studies have identified hundreds of genetic loci associated with these diseases and other cardiometabolic risk factors, but gaps remain in the understanding of the biological mechanisms responsible for these associations. Sequence data from quantitative experiments, such as DNase-seq and ChIP-seq, that identify genomic regions regulating gene transcription are helping to fill these gaps. Allelic imbalance at heterozygous sites, or enrichment of one allele, in this data can indicate allelic differences in transcriptional regulation, but reference mapping biases present in sequence alignments prevent accurate allelic imbalance detection. We describe a pipeline, AA-ALIGNER, that removes mapping biases at heterozygous sites and increases allelic imbalance detection accuracy in samples with any amount of genotype data available. When complete genotype information is not available, AA-ALIGNER more accurately detects allelic imbalance at imputed heterozygous sites than heterozygous sites predicted using the sequence data. At predicted heterozygous sites, imbalance detection is more accurate at common variants than other variants. Additionally, imbalance detection with AA-ALIGNER is robust to a variety of experimental and analytical parameters. Using AA-ALIGNER, we detected evidence of allelic imbalance at 22,414 heterozygous sites in data from samples with relevance to cardiometabolic disease and risk factors. We have identified protein binding motifs for one of the imbalanced proteins at a majority of these sites, iii and evidence that imbalance in data for this protein is associated with imbalance in data for other proteins. Additionally, a subset of sites of allelic imbalance are located at expression quantitative trait loci and/or genome-wide association loci for cardiometabolic traits and diseases. These sites are strong candidates to be studied experimentally and we report experimental evidence of allelic differences in protein binding, enhancer activity and/or the regulation of specific genes for a handful of these sites. Using allelic imbalance detection, we have detected differences in protein binding across the genome providing valuable insight into mechanisms of transcriptional regulation. Focusing on cardiometabolic diseases and risk factors, this work demonstrates the utility of allelic imbalance detection in studying genetic effects on the regulation of gene transcription at complex disease- and trait-associated loci. iv To my family, friends, and colleagues who have fostered my love of genetics and made this work possible. v ACKNOWLEDGEMENTS This work is the direct result of a journey that started well before I enrolled in graduate school and subsequently became a Ph.D. candidate. Successful completion of this journey would not be possible without the aid, advice, encouragement and support of many individuals along the way. I value the friendship of all that I have encountered on this journey and take a moment here to acknowledge those who have hand a direct hand in the completion of this work. While a paragraph is inadequate to fully acknowledge their contributions to this work, I would first like to thank and acknowledge my mentors Karen Mohlke and Terry Furey. They provided a positive environment that allowed me to engage in exciting research and fostered my growth as an individual and scientist. I would especially like to thank Karen for providing me with an abundance of opportunities to deepen my understanding of human genetics and complex phenotypes through interactions with collaborators and participation in their research. I am grateful to Terry for broadening my understanding of genomics and computational biology, by continually encouraging me to delve deep into biological questions being asked and find the most appropriate solution for answering these questions. I would also like to acknowledge the contributions of my other committee members, Praveen Sethupathy, Yun Li, and Greg Crawford and thank them for their support and encouragement. Whether it was by advice given from miles away, during unannounced visits to their offices, or through participation in their courses, their advice, ideas, and knowledge were critical to this work. As a member of two labs, I have been privileged to interact with many individuals, who are too many to name. I am especially thankful to members of the Mohlke and Furey labs, both past and present, as I don’t know where I would be without their willingness to provide feedback vi on my work and share ideas. I would also like to acknowledge the many collaborators I have worked with who have expanded my horizons and pushed me to be a better researcher, as well as all others I have interacted with during my time as a graduate student. I will always be grateful for the footprints that my scientific mentors and colleagues have made in my life and will values the friendship we have made. This work would not have been possible without the support and encouragement of many individuals outside of science. I am especially grateful for members of my family. I would specifically like to acknowledge my aunt and uncle, Karen Buchkovich-Sass and Phil Sass, who have been very supportive of my scientific pursuits and facilitated my first extended exposure to molecular biology experiments, and Nicholas Buchkovich, the first member of my immediate family to receive a Ph.D. I thank my wife’s parents, Paul and Sally Baird, and would like to publically acknowledge them for the many hours of service given and miles driven to support my family as I have pursued my degree. Likewise, words cannot express enough my thanks to my parents Terry and Joan Buchkovich, for their continuous support and encouragement. They have given sacrificed much to foster my scientific career and I am sure that they would have sacrificed more had there been fewer miles between us. Finally and most importantly, I acknowledge my wife, Jennie Buchkovich, for her love, encouragement and support. Between working the night shift and giving up her weekends to support our family to acting as a single mother when deadlines neared, her sacrifices have been great. Without them, this work would not have been possible. I cannot thank her enough and will always be grateful to have her at my side. I am also grateful for my sons Kelten and Caleb, for providing me with greater motivation to succeed, than anything else. vii TABLE OF CONTENTS LIST OF TABLES .....................................................................................................................xii LIST OF FIGURES .................................................................................................................. xiii LIST OF ABBREVIATIONS ..................................................................................................... xiv CHAPTER 1: INTRODUCTION ................................................................................................ 1 1.1 Introduction .................................................................................................................. 1 1.2 Genetic variation contributes to cardiometabolic traits and diseases ............................ 1 1.3 Genetic variants at GWAS loci likely influence gene transcription regulation................ 2 1.4 Next-generation sequencing data is a powerful tool in genetic studies ......................... 4 1.5 Sequence mapping biases influence regulatory element identification ......................... 6 1.6 Several methods have been reported to remove reference mapping biases ................ 7 1.6.1 Biased mismatch threshold ................................................................................... 8 1.6.2 Variant masking .................................................................................................... 9 1.6.3 Dual reference genomes ....................................................................................... 9 1.6.4 Modified dual reference genome ..........................................................................10 1.6.5 Allele-aware reference creation ...........................................................................11 1.6.6 Allele-aware aligner .............................................................................................11 1.7 Overview of this work ..................................................................................................12 CHAPTER 2: REMOVING REFERENCE MAPPING BIASES USING LIMITED OR NO GENOTYPE DATA IDENTIFIES ALLELIC DIFFERENCES IN PROTEIN BINDING AT DISEASE-ASSOCIATED LOCI ................................................................................................14 2.1 Background.................................................................................................................14