Population-Haplotype Models for Mapping and Tagging Structural Variation Using Whole Genome Sequencing

Population-haplotype models for mapping and tagging structural variation using whole genome sequencing Eleni Loizidou Submitted in part fulfilment of the requirements for the degree of Doctor of Philosophy Section of Genomics of Common Disease Department of Medicine Imperial College London, 2018 1 Declaration of originality I hereby declare that the thesis submitted for a Doctor of Philosophy degree is based on my own work. Proper referencing is given to the organisations/cohorts I collaborated with during the project. 2 Copyright Declaration The copyright of this thesis rests with the author and is made available under a Creative Commons Attribution Non-Commercial No Derivatives licence. Researchers are free to copy, distribute or transmit the thesis on the condition that they attribute it, that they do not use it for commercial purposes and that they do not alter, transform or build upon it. For any reuse or redistribution, researchers must make clear to others the licence terms of this work 3 Abstract The scientific interest in copy number variation (CNV) is rapidly increasing, mainly due to the evidence of phenotypic effects and its contribution to disease susceptibility. Single nucleotide polymorphisms (SNPs) which are abundant in the human genome have been widely investigated in genome-wide association studies (GWAS). Despite the notable genomic effects both CNVs and SNPs have, the correlation between them has been relatively understudied. In the past decade, next generation sequencing (NGS) has been the leading high-throughput technology for investigating CNVs and offers mapping at a high-quality resolution. We created a map of NGS-defined CNVs tagged by SNPs using the 1000 Genomes Project phase 3 (1000G) sequencing data to examine patterns between the two types of variation in protein-coding genes. To investigate potential relationships between CNV-tagging SNPs and various phenotypes, we used SNPs reported for disease/phenotype associations from the GWAS catalog. Moreover, we applied our method to DIAGRAM consortium and Northern Finland Birth Cohort (NFBC) data. Our analysis replicated existing CNV-tagging SNPs but also revealed novel relationships between them in almost all the datasets we analysed. We have developed a statistical framework under a population perspective for a fast and accurate CNV detection. Using 202 drug-target genes defined in collaboration with GlaxoSmithKline (GSK), we applied our framework to the 1000G data. We calculated summary statistics based on the detected CNV calls including the allele frequency (AF) for each of the 26 populations of the 1000G. In addition, we visualised our results using UCSC genome browser visualisation tracks for all 202 regions and successfully benchmarked our CNV calls by comparing them to a gold standard set of the 1000G CNVs. Overall in this thesis, we present detailed maps of CNVs and CNV-tagging SNPs to enhance existing knowledge of their impact on human genome. 4 To my parents, my everything 5 Acknowledgments I would like to express my deepest gratitude to my supervisor, Dr. Inga Prokopenko, for her valuable advice, guidance and patience throughout my PhD journey. Her perfectionism and level of sophistication and experience were always a source of inspiration to me. Above all, I need to say thank you for teaching me how to be a true scientist and how to grow academically and scientifically. I am also deeply grateful to my co-supervisor Dr. Evangelos Bellos whose support and academic brilliance paved the way and made this long journey look shorter. His encouragement and calm reaction to every situation have been my source of power and strength. A thank you is not enough to express my appreciation for the faith you showed in me. I would also like to express my gratitude to my co-supervisors Prof. Michael Johnson and Dr. Lachlan J. M. Coin for their valuable advice and for accepting me in their research groups. Last but not least, I would like to truly thank Dr. Leonardo Bottolo. The person who initially believed in my abilities and is indirectly responsible for me being where I am today. My research project would have never been initiated and completed without the support of Medical Research Council (MRC) and GlaxoSmithKline (GSK). I would therefore like to express my strongest appreciation for them both. I owe special thanks to the Cyprus Institute of Neurology and Genetics and specifically to Prof. Kyproula Christodoulou and Dr. George Spyrou for generously hosting me at the Bioinformatics department in Cyprus for the last year of my PhD. Thank you for giving me the chance to attend the department’s conferences as a speaker and for treating me as part of your team. This long adventure seemed shorter with the support of my colleagues who I am lucky to say that I am now calling friends. Special thank you to Dr. Sadia Saeed, Dr. Marika Kaakinen, Dr. Amna Khamis, Charalambos Kkoufou, Mila Anasanti, Dr. Hutokshi Crouch, Abdullah Abdulshakur and Jani Heikkinen for the unforgettable moments, outings and unstoppable laughter. I would also like to thank Patricia Murphy for her assistance at several administrative issues since the first day of my PhD and for her support throughout the years. I am mostly 6 thankful for being a member of an international department which provided me with the opportunity to meet people from different cultures and mentality. Finally, since this thesis is the culmination of my PhD journey, I would like to say the biggest and warmest thank you to my family. My favourite people who have always been my driving force, my inspiration, my inner power. My sisters Antigoni and Marina, who are a gift from God to me and their love and support are always unconditional. Thank you for constantly being by my side to encourage me, even during periods of worry and frustration. My husband George, the man I am now sharing my life with and the person who proves every single day that true love exists. His patience during the three years we were living in a different place so I could fulfil my dreams has given me the strength to move on. Even though a few words and a thank you will never be enough, I will try to express my deepest gratitude to the two people I owe everything in life. My parents, who provided me with the greatest values. Just by being their selves, they taught me that the biggest achievement is to first be a good person and always treat others in the way you would like to be treated. They never stopped believing in me even when I doubted myself and supported my decisions no matter what. Through their actions, they proved that even with the greatest achievements, the important thing is to remain modest and keep working with passion and self-respect for the best outcome. Their passion for work and their love for humanity were the reasons I gained my scientific curiosity. Thank you for loving me unconditionally and giving me the “supplies” to be the person I am today and to have a successful future. 7 Table of Contents Abstract ................................................................................................................................................... 4 Acknowledgments ................................................................................................................................... 6 Table of Contents .................................................................................................................................... 8 List of figures ......................................................................................................................................... 11 List of tables .......................................................................................................................................... 12 List of Abbreviations ............................................................................................................................. 13 Chapter 1 ............................................................................................................................................... 15 Introduction .......................................................................................................................................... 15 1.1. Human genome ..................................................................................................................... 15 1.1.1 Human genome variation – Single Nucleotide Polymorphisms (SNPs) and Structural Variation (SV) ................................................................................................................................ 15 1.1.2. CNV description .................................................................................................................. 16 ...................................................................................................................................................... 18 1.2. Sequencing the human genome ........................................................................................... 19 1.2.1. Uncovering CNVs ........................................................................................................... 19 1.2.2. 1000 Genomes Sequencing Project (1000G) ...................................................................... 20 1.2.3. Data generated by 1000 Genomes Project ......................................................................... 21 1.2.4. Next generation DNA sequencing methods ........................................................................ 21 1.2.5. Whole-genome and whole-exome sequencing

Load more