Understanding and Improving the Identification of Somatic Variants
Vinaya Vijayan
Dissertation submitted to the faculty of the Virginia Polytechnic Institute and State
University in partial fulfillment of the requirements for the degree of
Doctor of Philosophy
In
Genetics, Bioinformatics and Computational Biology
Chair: Liqing Zhang
Christopher Franck
Lenwood S. Heath
Xiaowei Wu
August 12th, 2016
Blacksburg, VA, USA
Keywords: Somatic variants, Somatic variant callers, Somatic point mutations, Short tandem
repeat variation, Lung squamous cell carcinoma Understanding and Improving the Identification of Somatic Variants
Vinaya Vijayan
ABSTRACT
It is important to understand the entire spectrum of somatic variants to gain more insight into mutations that occur in different cancers for development of better diagnostic, prognostic and therapeutic tools. This thesis outlines our work in understanding somatic variant calling, improving the identification of somatic variants from whole genome and whole exome platforms and identification of biomarkers for lung cancer.
Integrating somatic variants from whole genome and whole exome platforms poses a challenge as variants identified in the exonic regions of the whole genome platform may not be identified on the whole exome platform and vice-versa. Taking a simple union or intersection of the somatic variants from both platforms would lead to inclusion of many false positives
(through union) and exclusion of many true variants (through intersection). We develop the first framework to improve the identification of somatic variants on whole genome and exome platforms using a machine learning approach by combining the results from two popular somatic variant callers. Testing on simulated and real data sets shows that our framework identifies variants more accurately than using only one somatic variant caller or using variants from only one platform. Short tandem repeats (STRs) are repetitive units of 2-6 nucleotides. STRs make up approximately 1% of the human genome and have been traditionally used as genetic markers in population studies. We conduct a series of in silico analyses using the exome data of 32 individuals with lung cancer to identify 103 STRs that could potentially serve as cancer diagnostic markers and 624 STRs that could potentially serve as cancer predisposition markers.
Overall these studies improve the accuracy in identification of somatic variants and highlight the association of STRs to lung cancer.
iii ACKNOWLEDGEMENTS
I would like to thank my mentor, Liqing Zhang, for letting me join her lab while continuing to work on my previous projects and mentoring me through those projects. She has been a terrific mentor and I will eternally be grateful to her for sharing her scientific wisdom every day and chocolates during our lab meetings.
I also thank my committee members, Lenwood Heath, Christopher Franck, Xiaowei Wu, for believing in me and for guiding me through tough times. They also provided consistent feedback and valuable insights in my projects.
My previous mentors, David Mittelman, Pawel Michalak, Andy Pereira, whose encouragement, support, critique, different styles of working, and different perspectives in science have enriched my experience and helped me grow as a scientist.
My initial mentors at National Chemical Laboratory, India, Chetan Gadgil and Mugdha Gadgil, who sparked my interest in research and encouraged me to pursue further studies.
My labmates over the years, Arjun and Zalman, who guided me in the right direction through the initial stages of my PhD. Endless discussions with Robin, Gareth, Mohammad, Tithi, Hong and
Gustavo about science and life not only helped me understand different aspects of science but also helped me grow as a person.