Statistical Methods for Elucidating Copy Number Variation in High-Throughput Sequencing Studies

Statistical methods for elucidating copy number variation in high-throughput sequencing studies Evangelos Bellos Submitted in part fulfilment of the requirements for the degree of Doctor of Philosophy Department of Epidemiology and Biostatistics School of Public Health Imperial College London, 2014 The copyright of this thesis rests with the author and is made available under a Creative Commons Attribution Non-Commercial No Derivatives licence. Researchers are free to copy, distribute or transmit the thesis on the condition that they attribute it, that they do not use it for commercial purposes and that they do not alter, transform or build upon it. For any reuse or redistribution, researchers must make clear to others the licence terms of this work. 2 Abstract Copy number variation (CNV) is pervasive in the human genome and has been shown to contribute significantly to phenotypic diversity and disease aetiology. High-throughput sequencing (HTS) technologies have allowed for the systematic investigation of CNV at an unprecedented resolution. HTS studies offer multiple distinct features that can provide evidence for the presence of CNV. We have developed an integrative statistical framework that jointly analyses multiple sequencing features at the population level to achieve sensitive and precise discovery of CNV. First, we applied our framework to low-coverage whole-genome sequencing experiments and used data from the 1000 Genomes Project to demonstrate a substantial improvement in CNV detection accuracy over existing methods. Next, we extended our approach to targeted HTS experiments, which offer improved cost-efficiency by focusing on a predetermined subset of the genome. Targeted HTS involves an enrichment step that introduces non-uniformity in sequencing coverage across target regions and thus hinders CNV identification. To that end, we designed a customized normalization procedure that counteracts the effects of enrichment bias and enhances the underlying CNV signal. Our extended framework was benchmarked on contiguous capture datasets, where it was shown to outperform competing strategies by a wide margin. Capture sequencing can also generate large amounts of data in untargeted genomic regions. Although these off-target results can be a valuable source of CNV evidence, they are subject to complex enrichment patterns that confound their interpretation. Therefore, we developed the first normalization strategy that can adapt to the highly heterogeneous nature of off-target capture and thus facilitate CNV investigation in untargeted regions. All in all, we present a generalized CNV detection toolset that has been shown to achieve robust performance across datasets and sequencing platforms and can therefore provide valuable insight into the prevalence and impact of CNV. 3 Declaration of Originality The work presented in this thesis has not been previously submitted for any degree, diploma or qualification. All of the work described here is my own, except for what is detailed below. The RCA capture sequencing cohort presented in chapter 5 was generated in collaboration with several research groups in Singapore. The clinical samples were collected and managed by Dr Ching-Yu Cheng, Dr Chui Ming Gemmy Cheung and Dr Tien Wong of the Singapore Eye Research Institute. The sequencing of the samples was carried out by Dr Sonia Davila’s group at the Genome Institute of Singapore. Zai Yang Phua and Clarabelle Lin generated the sequencing libraries, while Dr Vikrant Kumar performed preliminary data processing, including quality control and read mapping. Lastly, Jordi Maggi was responsible for the qPCR experiments that were designed to validate the findings of my statistical analysis. 4 To my mother, Eleni, my guardian angel, my guiding light 5 Acknowledgments First and foremost, I would like to thank my supervisor, Dr. Lachlan Coin, who inspired and guided me every step of the road. His brilliance illuminated the path of this long scientific journey, while his generosity allowed me to pave my own way. No idea was ever too bold and no obstacle insurmountable. He trusted and believed in me, even when my own confidence faltered. I am immensely grateful to call him a mentor and a friend. I would also like to extend my deepest gratitude to my second supervisor, Dr. Michael Johnson, whose support and advice have been instrumental to the success of this endeavour. I owe special thanks to Dr. Sonia Davila for welcoming me into her lab in Singapore and entrusting me with a uniquely exciting research opportunity. Our fruitful collaboration was one of the most rewarding experiences of my PhD. Furthermore, I am grateful to Prof. Michael Levin, for taking me under his wing and offering me invaluable opportunities for professional development. As an honorary member of his research group, I had the privilege of interacting with brilliant and warm-hearted people, who have enriched both my academic and my personal life immeasurably. The 4-year adventure that culminated in this PhD would not have been possible without the love and support of my friends. A proper expression of my gratitude could fill a tome, but I would be remiss if I didn’t mention a few of the wonderful people that have touched by life. My childhood friends, Maria and Alkmini, who have stood by me for as long as I can remember, even when we were separated by oceans and continents. I couldn’t imagine my life without our musical escapades. Adam, who has been there for me since our first day at Imperial. His capacity for empathy is only surpassed by his uncanny talent for Greek rhymes. Efi, with whom I share a bond that transcends friendship. She keeps my cynicism in check and inspires me to see the world through her rose-coloured glasses. Tas, for our philosophical sparring matches that made my time in Singapore unforgettable and for embracing my corruptive influence. My brotherly friend GM, for his emotional generosity and his inexhaustible spirit. His infectious joie de vivre was often the brightest light at the end of the PhD tunnel. Last, but not least, I want to thank Myrsini, with whom I’ve shared this journey from the very beginning. We helped each other navigate through the labyrinth of academia and build fulfilling new lives in London. She has been my companion, my partner in crime and my confidante. Finally, I am grateful to my family for being my refuge and my springboard. My cousin, Elena, for being the sister I never had. Her inner power, instinctive kindness and quiet determination have always been my frame of reference. Most of all, however, I would like to thank my father, Spiros, who has been my pillar of strength and my compass through life. He taught me dignity, self-respect and optimism in the face of adversity. He had unwavering faith in me when I doubted myself and challenged me to aim for the stars when I wanted to rest on my laurels. Moreover, he instilled in me a passion for Socratic debate and taught me the immense value of scepticism. Thus, I owe him my scientific curiosity and my sense of wonder at the natural world. He is my greatest champion and my most trusted critic. Without him I wouldn’t be the scientist or the man I am today. Table of contents ABSTRACT .................................................................................................................................................... 3 ACKNOWLEDGMENTS .................................................................................................................................. 5 TABLE OF CONTENTS .................................................................................................................................... 7 LIST OF FIGURES ......................................................................................................................................... 11 LIST OF TABLES ........................................................................................................................................... 12 LIST OF ABBREVIATIONS ............................................................................................................................. 13 CHAPTER 1 INTRODUCTION ................................................................................................................. 14 1.1 STRUCTURAL VARIATION OVERVIEW ............................................................................................................. 15 1.2 CNV ORIGINS .......................................................................................................................................... 15 1.3 UNCOVERING CNV................................................................................................................................... 16 1.3.1 Microscopy-based technologies ...................................................................................................... 16 1.3.2 Microarray-based technologies ...................................................................................................... 17 1.3.2.1 Array CGH ...................................................................................................................................................... 17 1.3.2.2 SNP arrays ..................................................................................................................................................... 17 1.3.2.3 Limitations ..................................................................................................................................................... 18 1.3.3 PCR-based technologies

Statistical Methods for Elucidating Copy Number Variation in High-Throughput Sequencing Studies

Genome‐Wide Copy Number Variation Analysis in Early Onset Alzheimer's

Novel Copy-Number Variations in Pharmacogenes Contribute to Interindividual Differences in Drug Pharmacokinetics

A Visualization and Annotation Tool for Copy Number Variation from Whole-Genome Sequencing Ryan L

Systematic Analysis of Copy Number Variation Associated with Congenital Diaphragmatic Hernia

Genome-Wide Mapping of Copy Number Variations and Loss of Heterozygosity Using the Infinium® Human1m Beadchip

Genome-Wide Analysis of Copy Number Variation in Latin American Parkinson’S Disease Patients

A High-Resolution X Chromosome Copy-Number Variation Map in Fertile Females and Women with Primary Ovarian Insufficiency

S41467-021-21341-X.Pdf

Detecção De Copy Number Variation (CNV) E Sua Caracterização Na População Brasileira

Copy Number Variation Detection from 1000 Genomes Project Exon

Variants and Health

"Interpretation of Genomic Copy Number Variants Using DECIPHER"