University of Southampton Bioinformatic
Total Page:16
File Type:pdf, Size:1020Kb
University of Southampton Faculty of Environmental and Life Sciences Biological Sciences Bioinformatic Analysis of Human Next Generation Sequencing Data; extracting Additional Information, Optimising Mapping and Variant Calling, and Application in a Rare Disease Volume 1 of 1 By Roshan Kumar Sood A thesis presented for the degree of Doctor of Philosophy May 2019 Bioinformatic Analysis of Human NGS Data; extracting Additional Info., Optimising Mapping & Variant Calling, Application in Rare Disease Roshan Kumar Sood Page 2 University of Southampton Abstract Faculty of Environmental and Life Science Biological Sciences A thesis presented for the degree of Doctor of Philosophy Bioinformatic Analysis of Human Next Generation Sequencing Data; extracting Additional Information, Optimising Mapping and Variant Calling, and Application in a Rare Disease by Roshan Kumar Sood With the increased application of Next Generation Sequencing (NGS) to medicine it is important to test and develop approaches to extract the optimum information from datasets. In this thesis five aspects of NGS are investigated ranging from quality control to variant calling. Firstly a method to estimate contamination from a VCF file was developed which would be useful in cases where no BAM file was available to use existing tools. Unmapped reads were investigated to extract additional information from NGS samples and were able to detect the abundance of oral microbes from saliva samples relative to blood collected samples, but failed to identify differences between inflammatory bowel disease patients and controls. For a familial trio with a reported rare case of Sedaghatian-type spondylometaphyseal dysplasia (SSMD) sequenced both by whole exome (WES) and genome (WGS) sequencing it was shown that nearly all coding variants from WES were called in WGS despite differences in mean depth of coverage. This comparison highlighted that as sequencing costs decrease WGS will offer the greatest diagnostic value with potential for future re-analysis of cases currently unable to be resolved. Using the familial trio attempts were made to identify causal variant(s) in the gene currently implicated in causing SSMD { Glutathione peroxidase 4 (GPX4 ). However no variants either small SNPs or large structural were identified over the GPX4 gene and no plausible candidates were identified from the trio. Finally variant calling of the FCGR low affinity locus was performed using targeted NGS. FCGR genes have been highly duplicated and so by using customised references it was possible to infer the combinations of alleles across homologous sites. Using this approach it was possible to predict SNPs in the FCGR3B gene and predict human neutrophil antigen haplotypes involved in the immune response to treatments such as monoclonal antibodies. Bioinformatic Analysis of Human NGS Data; extracting Additional Info., Optimising Mapping & Variant Calling, Application in Rare Disease Roshan Kumar Sood Page 2 Contents 1 Introduction 25 1.1 Next generation sequencing . 25 1.1.1 First generation . 25 1.1.2 Second generation . 27 1.1.3 Third generation . 33 1.2 Sequencing projects . 35 1.2.1 Human Genome Project . 35 1.2.2 dbSNP . 36 1.2.3 HapMap . 36 1.2.4 ENCODE project . 36 1.2.5 1,000 Genomes . 38 1.2.6 Other sequencing projects . 39 1.2.7 100,000 Genomes project . 40 1.3 Software pipelines . 41 1.3.1 Pre-processing . 42 1.3.2 Read mapping or assembly . 44 1.3.3 Human reference sequence . 45 1.3.4 Post-mapping . 46 1.3.5 Variant calling . 46 1.3.6 Variant annotation . 48 1.3.7 Structural & copy number variants . 50 1.4 Aims . 53 2 Estimating contamination levels in exome sequencing using alternate allele frequencies and variant zygosity 55 3 Bioinformatic Analysis of Human NGS Data; extracting Additional Info., Optimising Mapping & Variant Calling, Application in Rare Disease Roshan Kumar Sood 2.0.1 Contamination estimation tools . 57 2.0.2 Alternate allele frequency changes with contamination . 59 2.0.3 Applications of machine learning with NGS . 63 2.0.4 Regression models . 63 2.1 Aims . 65 2.2 Materials & methods . 66 2.2.1 Contamination simulations . 66 2.2.2 Alignment and variant calling pipeline . 68 2.2.3 Alternate allele frequency profiles . 69 2.2.4 Measurements used for contamination estimation . 71 2.2.5 Investigating relationships of measurements . 72 2.2.6 Principal component & clustering analysis . 72 2.2.7 Regression analysis - model selection and training . 72 2.2.8 Application of regression models . 73 2.3 Results . 74 2.3.1 Alternate allele frequency profiles and measurements used in contamination estimation . 74 2.3.2 Principal component and clustering analysis . 82 2.3.3 Training regression models . 84 2.3.4 Application of regression models . 93 2.4 Discussion . 105 2.5 Conclusions . 111 3 Unmapped reads provide insights to potentially clinically important information and can distinguish collection methods 113 3.1 Introduction . 113 3.1.1 Unmapped reads . 113 3.1.2 Cross-species contamination . 114 3.1.3 Microbiome information from unmapped reads . 115 3.1.4 Analysis programs . 118 3.1.5 MEGAN6 . 119 3.1.6 Sequencing methods . 119 3.2 Aims . 120 Page 4 Bioinformatic Analysis of Human NGS Data; extracting Additional Info., Optimising Mapping & Variant Calling, Application in Rare Disease Roshan Kumar Sood 3.3 Materials & methods . 121 3.3.1 Samples . 121 3.3.2 Extraction of unmapped reads . 123 3.3.3 FastQC . 124 3.3.4 BLAST classification of sequences . 124 3.3.5 Creation of taxonomic trees . 126 3.3.6 Clustering analysis . 127 3.3.7 Tandem repeat finder . 127 3.3.8 Depth of coverage across bacterial genomes . 127 3.3.9 Calculating sequence similarity between species . 127 3.3.10 Plotting of bacterial genomes . 128 3.4 Results . 129 3.4.1 Extraction & classification of unmapped reads from exome case and control samples . 129 3.4.2 Comparing collection methods . 140 3.4.3 Unmapped reads of RNA-SEQ and WES from the same individual145 3.4.4 Comparison of 16S rRNA sequencing with unmapped WES reads mapped to bacteria . 154 3.4.5 Investigating cronobacter sakazakii read matches . 155 3.4.6 Mapping unmapped reads back against the human genome . 161 3.4.7 Unclassifiable reads . 164 3.5 Discussion . 165 3.6 Conclusions . 174 4 Comparing variant calling using a whole exome and genome sequencing trio 175 4.1 Introduction . 175 4.1.1 Non-coding variant annotation . 176 4.1.2 Limitations of WES compared to WGS . 179 4.1.3 Variant calling with GATK HaplotypeCaller . 182 4.1.4 ACMG genes . 183 4.2 Aims . 183 4.3 Materials & methods . 184 Page 5 Bioinformatic Analysis of Human NGS Data; extracting Additional Info., Optimising Mapping & Variant Calling, Application in Rare Disease Roshan Kumar Sood 4.3.1 Samples & sequencing . 184 4.3.2 Quality control . 185 4.3.3 Sample coverage . 187 4.3.4 Variant calling . 188 4.3.5 Annotation of VCF variants . 188 4.3.6 Annotation of non-coding variants . 190 4.3.7 GEMINI variant analysis . 190 4.3.8 Structural and copy number variant pipeline . 191 4.3.9 Comparison of exome and genome variants . 194 4.4 Results . 195 4.4.1 Quality control . 195 4.4.2 Variant calling . 202 4.4.3 Non-coding variants . 211 4.4.4 GEMINI variant calls . 214 4.4.5 Structural & copy number variants . 217 4.4.6 Comparing whole exome with whole genome sequencing . 226 4.5 Discussion . 234 4.6 Conclusions . 241 5 Prioritisation of trio called variants 243 5.1 Introduction . 243 5.1.1 Using NGS to identify disease causing variants . 243 5.1.2 Sedaghatian-type SpondyloMetaphyseal Dysplasia . 243 5.1.3 Short-Rib polydactyly syndromes . 246 5.2 Aims . 251 5.3 Materials & methods . 252 5.3.1 Family & sequencing . 252 5.3.2 Variant calling . 253 5.3.3 GPX4 variant, transcription factors and binding sites . 253 5.3.4 GPX4 coverage . 254 5.3.5.