Understanding and Improving the Identification of Somatic Variants
Total Page:16
File Type:pdf, Size:1020Kb
Understanding and Improving the Identification of Somatic Variants Vinaya Vijayan Dissertation submitted to the faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy In Genetics, Bioinformatics and Computational Biology Chair: Liqing Zhang Christopher Franck Lenwood S. Heath Xiaowei Wu August 12th, 2016 Blacksburg, VA, USA Keywords: Somatic variants, Somatic variant callers, Somatic point mutations, Short tandem repeat variation, Lung squamous cell carcinoma Understanding and Improving the Identification of Somatic Variants Vinaya Vijayan ABSTRACT It is important to understand the entire spectrum of somatic variants to gain more insight into mutations that occur in different cancers for development of better diagnostic, prognostic and therapeutic tools. This thesis outlines our work in understanding somatic variant calling, improving the identification of somatic variants from whole genome and whole exome platforms and identification of biomarkers for lung cancer. Integrating somatic variants from whole genome and whole exome platforms poses a challenge as variants identified in the exonic regions of the whole genome platform may not be identified on the whole exome platform and vice-versa. Taking a simple union or intersection of the somatic variants from both platforms would lead to inclusion of many false positives (through union) and exclusion of many true variants (through intersection). We develop the first framework to improve the identification of somatic variants on whole genome and exome platforms using a machine learning approach by combining the results from two popular somatic variant callers. Testing on simulated and real data sets shows that our framework identifies variants more accurately than using only one somatic variant caller or using variants from only one platform. Short tandem repeats (STRs) are repetitive units of 2-6 nucleotides. STRs make up approximately 1% of the human genome and have been traditionally used as genetic markers in population studies. We conduct a series of in silico analyses using the exome data of 32 individuals with lung cancer to identify 103 STRs that could potentially serve as cancer diagnostic markers and 624 STRs that could potentially serve as cancer predisposition markers. Overall these studies improve the accuracy in identification of somatic variants and highlight the association of STRs to lung cancer. !iii ACKNOWLEDGEMENTS I would like to thank my mentor, Liqing Zhang, for letting me join her lab while continuing to work on my previous projects and mentoring me through those projects. She has been a terrific mentor and I will eternally be grateful to her for sharing her scientific wisdom every day and chocolates during our lab meetings. I also thank my committee members, Lenwood Heath, Christopher Franck, Xiaowei Wu, for believing in me and for guiding me through tough times. They also provided consistent feedback and valuable insights in my projects. My previous mentors, David Mittelman, Pawel Michalak, Andy Pereira, whose encouragement, support, critique, different styles of working, and different perspectives in science have enriched my experience and helped me grow as a scientist. My initial mentors at National Chemical Laboratory, India, Chetan Gadgil and Mugdha Gadgil, who sparked my interest in research and encouraged me to pursue further studies. My labmates over the years, Arjun and Zalman, who guided me in the right direction through the initial stages of my PhD. Endless discussions with Robin, Gareth, Mohammad, Tithi, Hong and Gustavo about science and life not only helped me understand different aspects of science but also helped me grow as a person. !iv Friends I have made in Blacksburg, Abhijit, Siddhesh, Deepanshu, Raj, Revathy, Sarthak, Saloni, Primal, Nistha and Khushboo who have helped make Blacksburg home away from home. Faizan, for all the weekend phone calls, for not letting distance or the fact that we have not met in around 8 years ever get in the way of an amazing friendship. Shruti, for patiently listening to my endless rants and for being an incredible source of answers for my questions on the personal and professional front. My parents, Vimala and N.K. Vijayan for encouraging and backing up all our dreams through their love, hard work and perseverance. All I can say is thank you and hope those words convey my gratitude towards them. My brother, Vikas, for providing the comic relief in my life regardless of whether it was required. My husband, Nikhil, for his unwavering support and sharing this roller coaster ride of my PhD and life! !v TABLE OF CONTENTS Literature Review 1 1.1 Human Genome Project 2 1.2 Next Generation Sequencing Platforms 2 1.3 Germline and Somatic Variants 3 1.4 Uses of Sequencing 4 1.5 Next Generation Sequencing Collaborations 5 1.6 Mapping Tools 6 1.7 Germline Variant Calling Tools 7 1.8 Somatic Variant Calling Tools 8 1.9 Short Tandem Repeat Calling Tools 9 1.10 Research Plan 10 Evaluation of pipelines detecting somatic point variants and analysis of factors affecting the detection 12 2.1 Abstract 13 2.2 Introduction 14 2.3 Methods 16 2.3.1 Generating Somatic Data sets 16 2.3.2 Algorithms Used 18 2.3.3 Real Data 19 2.4 Results 19 2.4.1 Sensitivity and precision results for pipelines that detect somatic point mutations for exomes 19 2.4.2 Sensitivity and precision results for pipelines that detect somatic point mutations for genomes 22 2.4.3 Causes for undetected true somatic variants by pipelines in exomes 23 2.4.5 Sensitivity and precision vs. tumor sequencing depth for exomes and genomes 25 2.4.6 Sensitivity vs. allele fraction for exomes and genomes 25 2.4.7 Comparing pipelines using real data 26 2.4.8 Number of germline variants misidentified as somatic variants 26 2.5 Discussion 27 2.6 Conclusion 30 2.7 Abbreviations 31 !vi Framework for integration of genome and exome data to improve the identification of somatic variants 46 3.1 Abstract 47 3.2 Introduction 48 3.3 Results 50 3.3.1 Number of somatic variants identified by callers individually 50 3.3.2 Results from different machine learning models 51 3.3.3 Reason for integration of multiple tools and multiple data sets 52 3.3.4 Results for cross-contamination of normal samples 53 3.3.5 Comparison with similar tools 53 3.3.6 Real data validation 54 3.3.7 Robustness of the ensemble method 55 3.4 Discussion 56 3.5 Methods 60 3.5.1 Generating simulated data set 60 3.5.2 Building training and test sets for simulated data 61 3.5.3 Models used to identify somatic variants 61 3.5.4 Building training and test sets for real data 63 3.6 Conclusions 64 3.7 Abbreviations 65 Identifying Short Tandem Repeat Genetic Markers for Lung Squamous Cell Carcinoma 86 4.1 Abstract 87 4.2 Introduction 88 4.3 Results 91 4.3.1 Analysis of STR regions 92 4.3.2 Gene Expression Analysis 93 4.3.3 Functional Annotation Analysis 94 4.4 Discussion 95 4.5 Methods 99 4.5.1 Real Data 99 4.5.2 STR regions 100 4.5.3 STR Analysis in Exomes 101 4.5.4 Gene Expression Analysis 101 4.5.5 Functional Annotation 102 4.6 Conclusion 102 vii! 4.7 Abbreviations 103 Conclusion and Future Directions 148 5.1 Understanding of somatic variants 149 5.2 Improving identification of somatic variants 150 5.3 Identification of short tandem repeats as new biomarkers for cancer 152 5.4 Conclusion 152 Bibliography 155 !viii LIST OF FIGURES Figure 2.1: Sensitivity, precision and F1 score of different pipelines in detecting somatic point mutations in exomes 32 Figure 2.2: Sensitivity, precision and F1 score of different pipelines detecting somatic point mutations in genomes 33 Figure 2.3: Factors affecting the detection of somatic point mutations. 34 Figure 2.4: Sensitivity of different pipelines in detecting somatic point mutations using a high quality exome data set 35 Figure 2.5: Sensitivity and precision as a function of tumor sequencing depth of different pipelines while detecting somatic point mutations in exomes 36 Figure 2.6: Sensitivity as a function of tumor allele fraction while detecting somatic point mutations in exomes 37 Figure 2.7: Mean concordance (in percentages) of somatic variants over three samples of real data between whole exome sequences (WXS) and whole genome sequences (WGS) 38 Figure 2.8: Percentage of germline variants misidentified as somatic variants in real data sets 39 Figure 2.9 : Generating sensitivity and specificity data set 40 Figure S2.1: False positives and false negatives identified by different pipelines while detecting somatic point mutations for exomes 41 Figure S2.2: Number of false positives and false negatives identified by the pipelines while detecting somatic point mutations for genomes 42 Figure 3.1: F1 score for different machine learning models 66 Figure 3.2: Sensitivity, precision and F1 score with MuTect, SomaticSniper, VarScan2, VCMM, and J48 67 Figure 3.3: Distribution of the true positives and false negatives across the depth and allele fractions of whole genome and whole exome 68 Figure 3.4: Performance comparison of somatic variant identification for single platform, i.e., whole genome (WGS) or whole exome (WXS) versus ensemble method 69 !ix Figure 3.5: Performance comparison of SomaticSeq versus our ensemble method on only whole genome platform 70 Figure 3.6: Sensitivity, precision and F1 scores based on different training sets 71 Figure 3.7: Distribution of simulated somatic point mutations across different allele fractions and depths on whole genome and whole exome platforms