Biological Role and Disease Impact of Copy Number Variation in Complex Disease
Total Page:16
File Type:pdf, Size:1020Kb
University of Pennsylvania ScholarlyCommons Publicly Accessible Penn Dissertations 2014 Biological Role and Disease Impact of Copy Number Variation in Complex Disease Joseph Glessner University of Pennsylvania, [email protected] Follow this and additional works at: https://repository.upenn.edu/edissertations Part of the Bioinformatics Commons, and the Genetics Commons Recommended Citation Glessner, Joseph, "Biological Role and Disease Impact of Copy Number Variation in Complex Disease" (2014). Publicly Accessible Penn Dissertations. 1286. https://repository.upenn.edu/edissertations/1286 This paper is posted at ScholarlyCommons. https://repository.upenn.edu/edissertations/1286 For more information, please contact [email protected]. Biological Role and Disease Impact of Copy Number Variation in Complex Disease Abstract In the human genome, DNA variants give rise to a variety of complex phenotypes. Ranging from single base mutations to copy number variations (CNVs), many of these variants are neutral in selection and disease etiology, making difficult the detection of true common orar r e frequency disease-causing mutations. However, allele frequency comparisons in cases, controls, and families may reveal disease associations. Single nucleotide polymorphism (SNP) arrays and exome sequencing are popular assays for genome-wide variant identification. oT limit bias between samples, uniform testing is crucial, including standardized platform versions and sample processing. Bases occupy single points while copy variants occupy segments. Bases are bi-allelic while copies are multi-allelic. One genome also encodes many different cell types. In this study, we investigate how CNV impacts different cell types, including heart, brain and blood cells, all of which serve as models of complex disease. Here, we describe ParseCNV, a systematic algorithm specifically developed as a part of this project to perform more accurate disease associations using SNP arrays or exome sequencing-generated CNV calls with quality tracking of variants, contributing to each significant vo erlap signal. Red flags of ariantv quality, genomic region, and overlap profile are assessed in a continuous score and shown to correlate over 90% with independent verification methods. eW compared these data with our large internal cohort of 68,000 subjects, with carefully mapped CNVs, which gave a robust rare variant frequency in unaffected populations. In these investigations, we uncovered a number of loci in which CNVs are significantly enriched in non-coding RNA (ncRNA), Online Mendelian Inheritance in Man (OMIM), and genome-wide association study (GWAS) regions, impacting complex disease. By evaluating thoroughly the variant frequencies in pediatric individuals, we subsequently compared these frequencies in geriatric individuals to gain insight of these variants' impact on lifespan. Longevity-associated CNVs enriched in pediatric patients were found to aggregate in alternative splicing genes. Congenital heart disease is the most common birth defect and cause of infant mortality. When comparing congenital heart disease families, with cases and controls genotyped both on SNP arrays and exome sequencing, we uncovered significant and confident loci that provide insight into the molecular basis of disease. Neurodevelopmental disease affects the quality of life and cognitive potential of many children. In the neurodevelopmental and psychiatric diseases, CACNA, GRM, CNTN, and SLIT gene families show multiple significant signals impacting a large number of developmental and psychiatric disease traits, with the potential of informing therapeutic decision-making. Through new tool development and analysis of large disease cohorts genotyped on a variety of assays, I have uncovered an important biological role and disease impact of CNV in complex disease. Degree Type Dissertation Degree Name Doctor of Philosophy (PhD) Graduate Group Genomics & Computational Biology First Advisor Hakon Hakonarson Keywords complex disease, copy number variation, exome sequencing, microarray, single nucleotide polymorphism Subject Categories Bioinformatics | Genetics This dissertation is available at ScholarlyCommons: https://repository.upenn.edu/edissertations/1286 BIOLOGICAL ROLE AND DISEASE IMPACT OF COPY NUMBER VARIATION IN COMPLEX DISEASE Joseph T. Glessner A DISSERTATION in Genomics and Computational Biology Presented to the Faculties of the University of Pennsylvania in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy 2014 Supervisor of Dissertation _____________________ Hakon Hakonarson, M.D., Ph.D. Associate Professor of Pediatrics Graduate Group Chairperson ________________________ Li-San Wang, Ph.D. Associate Professor, Pathology and Laboratory Medicine Dissertation Committee _____John Maris, M.D.____ ____Mingyao Li, Ph.D.____ Professor of Pediatrics Associate Professor of Biostatistics ___Sharon Diskin, Ph.D.___ __Marcella Devoto, Ph.D.__ Assistant Professor of Pediatrics Professor of Pediatrics BIOLOGICAL ROLE AND DISEASE IMPACT OF COPY NUMBER VARIATION IN COMPLEX DISEASE COPYRIGHT 2014 Joseph T. Glessner This work is licensed under the Creative Commons Attribution- NonCommercial-ShareAlike 3.0 License To view a copy of this license, visit http://creativecommons.org/licenses/by-ny-sa/2.0/ Acknowledgment I am indebted to my mentor Hakon Hakonarson for seeing great potential in me and for guiding me in achieving these great projects and results. Together, we have explored the genetic landscape of structural variations in complex disease and it has been an exciting journey all along the way. I am appreciative of my dissertation committee: John Maris, Mingyao Li, Sharon Diskin, and Marcella Devoto for challenging me to explain the key novel points of my dissertation work. I thank John Maris especially for being a guiding light during the dissertation process. I thank Nate Berkowitz for outstanding discussion and support through the classes. Fan Li, Scott Sherill-Mix, and Yih-Chii Hwang gave excellent advice and guidance. Thanks to Maja Bucan who helped me through the whole process, always eager to hear what students thought about each aspect of the graduate program and how to improve. Thanks to PCGC, Alex Bick and Wendy Chung, for outstanding support. I thank my colleagues and mentors over the years: Matt Maccani, Jack Bozzuffi, Harry Padden, Randy Zauhar, Struan Grant, and Kai Wang. You have influenced my career in many incredible ways. To my family, especially Dave, I am grateful for your support and believing in me and being there through challenging times. I would like to thank Alison Miller for writing an outstanding book on how to approach and manage the dissertation process, which has been of great value to me. iii ABSTRACT BIOLOGICAL ROLE AND DISEASE IMPACT OF COPY NUMBER VARIATION IN COMPLEX DISEASE Joseph Glessner Hakon Hakonarson In the human genome, DNA variants give rise to a variety of complex phenotypes. Ranging from single base mutations to copy number variations (CNVs), many of these variants are neutral in selection and disease etiology, making difficult the detection of true common or rare frequency disease-causing mutations. However, allele frequency comparisons in cases, controls, and families may reveal disease associations. Single nucleotide polymorphism (SNP) arrays and exome sequencing are popular assays for genome-wide variant identification. To limit bias between samples, uniform testing is crucial, including standardized platform versions and sample processing. Bases occupy single points while copy variants occupy segments. Bases are bi-allelic while copies are multi-allelic. One genome also encodes many different cell types. In this study, we investigate how CNV impacts different cell types, including heart, brain and blood cells, all of which serve as models of complex disease. Here, we describe ParseCNV, a systematic algorithm specifically developed as a part of this project to perform more accurate disease associations using SNP arrays or exome sequencing-generated CNV calls with quality tracking of variants, contributing to each significant overlap signal. Red flags of variant quality, genomic region, and overlap profile are assessed in a continuous score and shown to correlate over 90% with independent verification methods. We compared these data with our large internal cohort of 68,000 subjects, with carefully iv mapped CNVs, which gave a robust rare variant frequency in unaffected populations. In these investigations, we uncovered a number of loci in which CNVs are significantly enriched in non-coding RNA (ncRNA), Online Mendelian Inheritance in Man (OMIM), and genome-wide association study (GWAS) regions, impacting complex disease. By evaluating thoroughly the variant frequencies in pediatric individuals, we subsequently compared these frequencies in geriatric individuals to gain insight of these variants’ impact on lifespan. Longevity-associated CNVs enriched in pediatric patients were found to aggregate in alternative splicing genes. Congenital heart disease is the most common birth defect and cause of infant mortality. When comparing congenital heart disease families, with cases and controls genotyped both on SNP arrays and exome sequencing, we uncovered significant and confident loci that provide insight into the molecular basis of disease. Neurodevelopmental disease affects the quality of life and cognitive potential of many children. In the neurodevelopmental and psychiatric diseases, CACNA, GRM, CNTN, and