Analysis and Interpretation of Next-Generation Sequencing Data for the Identification of Genetic Variants Involved in Cardiovascular Malformation
Total Page:16
File Type:pdf, Size:1020Kb
Analysis and interpretation of next-generation sequencing data for the identification of genetic variants involved in cardiovascular malformation Darren T. Houniet For the degree of Doctor of Philosophy Newcastle University Faculty of Medical Sciences Institute of Genetic Medicine February 2013 Abstract Congenital cardiovascular malformation (CVM) affects 7/1000 live births. Approximately 20% of cases are caused by chromosomal and syndromic conditions. Rare Mendelian families segregating particular forms of CVM have also been described. Among the remaining 80% of non-syndromic cases, there is a familial predisposition implicating as yet unidentified genetic factors. Since the reproductive consequences to an individual of CVM are usually severe, evolutionary considerations suggest predisposing variants are likely to be rare. The overall aim of my PhD was to use next generation sequencing (NGS) methods to identify such rare, potentially disease causing variants in CVM. First, I developed a novel approach to calculate the sensitivity and specificity of NGS data in detecting variants using publicly available population frequency data. My aim was to provide a method that would yield sound estimates of the quality of a sequencing experiment without the need for additional genotyping in the sequenced samples. I developed such a method and demonstrated that it provided comparable results to methods using microarray data as a reference. Furthermore, I evaluated different variant calling pipelines and showed that they have a large effect on sensitivity and specificity. Following this, the NovoAlign-Samtools and BWA-Dindel pipelines were used to identify single base substitution and indel variants in three pedigrees, where predisposition to a different disease appears to segregate following an autosomal dominant mode of inheritance. I identified potentially causative variants segregating with disease in all three of the pedigrees. In the pedigrees with Dilated Cardiomyopathy and Hereditary Sclerosing Poikiloderma these variants were in plausible candidate genes. Finally, NGS was used to identify rare, potentially disease causing indel variants in patients with sporadic, non-syndromic forms of CVM characterised by chamber hypoplasia. Two indel calling pipelines were used as a means to increase confidence in the identified indels. These two pipelines achieved the highest sensitivity calls using the method described above. In the 133 cases, evaluated for 403 candidate genes, indels were identified in 4 known causative genes for human cardiovascular disease, namely MYL1, NOTCH1, TNNT2, and DSC2. i Acknowledgements First of all I would like to thank both my supervisors, Prof. Bernard Keavney and Dr. Mauro Santibanez-Koref, for their fantastic supervision, guidance, support and encouragement, throughout my PhD. In particular, thank you to Mauro, your perceptive nature and blunt, but seemingly always correct (!!), advice was much appreciated. I would also like to thank everyone in the SIGM office, Prof. Heather Cordell, Dr. Ian Wilson, Yaobo Xu, Helen Griffin, Marla Endriga, Kristin Ayres, Jakris Eu- Ahsunthornwattana, Matthieu Miossec, Rebecca Darlay (or Baker), Richard Howey and Valentina Mamasoula. Your constant support has always helped. I would especially like to thank Kristin, Ian and Helen for what seemed like a constant supply of coffee, cakes and biscuits throughout my PhD. Also, thank you to Helen for your morning rants about the metro, weather and children, it got many a day started with a smile! Thank you also to the extended members of the office and SGIM family, such as Dr. Joanna Elson and Katie Siddle. Katie remains a source of encouragement. Jo was always very encouraging and helped orchestrate many an extended lunch break where many a varied discussion occurred. Thank you to all the members of the BHF group, Dr. Judith Goodship, Dr. Ana Topf, Dr. Thahira Rahman, Dr. Elise Glen, Dr. Darroch Hall, Mr. Rafiqul Hussein, Dr. Ruairidh Martin, Dr. Danielle Brown, Mr. Mzwandile Mbele, Ms. Helen Weatherstone. I would like to thank Ana, Thahira, Elise, and Raf for always responding to my nagging questions for data and advice, particularly Elise whom I sent a great many emails too! Mzwandile, it was great to have a fellow South African in the group! Helen, thank you for your support and endless help with anything that needed doing! I would also like to thank both of my assessors, Heather and Dr. Simon Pearce. Your insight, support and guidance was very helpful. Finally, thank you to my fiancé, Jen, for her constant support through my entire PhD. She put up with much moaning about my work and abilities! ii Statement of contributions Unless specified otherwise, all the work in my thesis is entirely my own. I performed all the testing and development of the method proposed in chapter 3, as well as the data analysis, variant filtering and candidate gene identification described in chapter 4 and chapter 5 of my thesis. Dr. Mauro Santibanez-Koref dealt with the design of the statistical procedure of the method proposed in chapter 3. Dr. Matthew Hurles and Dr. Saeed Al Turki from The Wellcome Trust Sanger Institute provided 19 of the sample files used in chapter 3. They also provided the microarray data for these 19 samples. Sequencing of 12 of the samples used in chapter 3 and all of the samples in chapter 4 was performed by Dr. Thahira Rahman, Dr. Elise Glen, and Mr. Rafiqul Hussein. Prof. Judith Goodship provided the samples and clinical information for the cases of Atrioventricular Septal Defect and Dilated Cardiomyopathy in chapter 3. Collaborators in South Africa, under Prof. Bongani Mayosi at the University of Cape Town provided samples and information for the pedigree where cases presented with Hereditary Sclerosing Poikiloderma (HSP). Collaborators in Nantes, France, under Dr. Sébastien Küry provided sequence information for the second HSP pedigree. Where possible, the identified variants were validated by Dr. Elise Glen and Dr. Thahira Rahman. The study described in chapter 5 was part of a large, international collaboration involving researches from across Europe, called HeartRepair. All 133 samples were provided by six centres located in The Netherlands (Academic Medical Center, Amsterdam and Leids Universitair Medisch Centrum, leiden), England (The University of Newcastle, Newcastle Upon Tyne), Belgium (Katholike Universiteit, Leuven, University of Leuven, Leuven), and Germany (Max Planck Institute for Molecular Genetics and the Max Delbrück Center for Molecular Medicine in Berlin). In particular, Dr. Alex Postma (AMC, Amsterdam), Mr. Alejandro Sifrim (Katholike Universiteit, Leuven), Dr. Silke Sperling (Max Plank Institute for Molecular Genetics, Berlin), Prof. Sabine Klaasen (Max Delbrück Center, Berlin), Dr. Peter ten’ Hoen (LUMC, Leiden), Dr. Thahira Rahman, Mr. Rafiqul Hussein, Dr. Ana Topf, Dr. Darroch Hall, Dr. Judith Goodship and Prof. Bernard Keavney (Newcastle University), were all involved in the sample selection, sequencing, project design, identification and validation of single base substitutions. I provided some input for the identification of the single base iii substitutions, however as described in chapter 5, I performed all analysis for the identification of indels. iv Table of contents Abstract………………………………………………………………………………….i Acknowledgements……………………………………………………………………..ii Contributions…………………………………………………………………………..iii Table of contents………………………………………………………………………..v List of figures…………………………………………………………………………..ix List of tables…………………………………………………………………………...xii Abbreviations……………………………………………………………………........xiv 1 Introduction………………………………………………………………...1 1.1 Summary…………………………………………………………………2 1.2 Cardiovascular Malformations…………………………………………..3 1.2.1 Clinical Epidemiology……………………………………….3 1.2.2 Risk factors…………………………………………………..6 1.2.3 Environmental risk factors…………………………………...7 1.2.4 Genetic epidemiology………………………………………..8 1.2.5 Mendelian forms……………………………………………..8 1.2.6 Sporadic, non-syndromic forms……………………………10 1.2.7 Genetic study approaches…………………………………..11 1.3 NGS in Mendelian Diseases……………………………………………15 1.4 NGS in Complex Diseases……………………………………………..16 1.5 NGS platforms………………………………………………………….17 1.5.1 Sanger Sequencing…………………………………………17 1.5.2 Roche 454 GS-FLX System………………………………..17 1.5.3 Applied Biosystem SOLiD…………………………………18 1.5.4 Illumina Genome Analyser…………………………………19 1.5.5 Life Technologies Ion Torrent PGM……………………….20 1.5.6 Third generation sequencing……………………………….20 1.5.7 Comparison of different sequencing platforms…………….21 1.6 NGS Data Analysis……………………………………………………..24 1.7 Specific aims……………………………………………………………25 2 Methods……………………………………………………………………27 2.1 Methods overview……………………………………………………...28 2.2 Samples, target enrichment sequencing………………………………...28 v 2.2.1 Whole exome sequencing in Mendelian family samples (studied in chapter 4)……………………………………….28 2.2.2 Targeted sequencing of unrelated cases of CVM (studies in chapter 5)…………………………………………………...29 2.3 Data analysis……………………………………………………………32 2.3.1 Computers…………………………………………………..32 2.3.2 Scripting……………………….……………………………32 2.3.3 Sequence analysis…………………………………………..34 2.3.4 Accessory programmes……………………………………..34 3 Using population data for assessing next-generation sequencing performance……………………………………………………………….36 3.1 Aim……………………………………………………………………..37 3.2 Introduction…………………………………………………………….37 3.3 Method………………………………………………………………….38 3.4 Materials………………………………………………………………..39 3.4.1 Sequence and genotype data………………………………..39 3.4.2 Comparison of array and sequencing