Computational Approaches to Somatic Exome Sequencing Data Runjun Kumar Washington University in St
Total Page:16
File Type:pdf, Size:1020Kb
Washington University in St. Louis Washington University Open Scholarship Arts & Sciences Electronic Theses and Dissertations Arts & Sciences Spring 5-15-2018 Discerning Drivers of Cancer: Computational Approaches to Somatic Exome Sequencing Data Runjun Kumar Washington University in St. Louis Follow this and additional works at: https://openscholarship.wustl.edu/art_sci_etds Part of the Bioinformatics Commons, and the Oncology Commons Recommended Citation Kumar, Runjun, "Discerning Drivers of Cancer: Computational Approaches to Somatic Exome Sequencing Data" (2018). Arts & Sciences Electronic Theses and Dissertations. 1552. https://openscholarship.wustl.edu/art_sci_etds/1552 This Dissertation is brought to you for free and open access by the Arts & Sciences at Washington University Open Scholarship. It has been accepted for inclusion in Arts & Sciences Electronic Theses and Dissertations by an authorized administrator of Washington University Open Scholarship. For more information, please contact [email protected]. WASHINGTON UNIVERSITY IN ST. LOUIS Division of Biology and Biomedical Sciences Computational and Systems Biology Dissertation Examination Committee: Ron Bose, Chair Donald F. Conrad Li Ding Obi L. Griffith Daniel C. Link S. Joshua Swamidass Discerning Drivers of Cancer: Computational Approaches to Somatic Exome Sequencing Data by Runjun D. Kumar A dissertation presented to The Graduate School of Washington University in partial fulfillment of the requirements for the degree of Doctor of Philosophy May 2018 St. Louis, Missouri © 2018, Runjun D. Kumar Table of Contents List of Figures ................................................................................................................................. v List of Tables ................................................................................................................................. vi Acknowledgments......................................................................................................................... vii Abstract .......................................................................................................................................... ix 1. Introduction ................................................................................................................................. 1 1.1 Cancer Genome Sequencing ............................................................................................ 2 1.2 Converting Mutations to Treatments................................................................................ 3 1.3 Specific Approaches ......................................................................................................... 5 1.3.1 Identifying Cancer Genes ................................................................................................. 5 1.3.2 Predicting Mutation Functional Impact ............................................................................. 7 1.3.3 Identifying Tumor Drivers in the Kinome ....................................................................... 10 1.4 Connectedness of Approaches ............................................................................................ 12 1.5 Candidate Contributions ...................................................................................................... 13 2. Identifying Cancer Genes ......................................................................................................... 14 2.1 Introduction .................................................................................................................... 14 2.2 Materials and Methods ................................................................................................... 15 2.2.1 Data Gathering and Quality Control ...................................................................................... 15 2.2.2 HiConf Cancer Gene Panel Construction ............................................................................... 15 2.2.3 Comparison Tools ................................................................................................................. 16 2.2.4 Calculation of Individual Tests .............................................................................................. 17 2.2.5 Imputation of Missing Data ................................................................................................... 20 2.2.6 Generation of Ensemble Model ............................................................................................. 20 2.2.7 Assembly of Validation Gene Panels ..................................................................................... 21 2.2.8 Cancer Subset Analysis ......................................................................................................... 22 2.2.9 Statistics and Software .......................................................................................................... 22 2.2.10 Data Availability ................................................................................................................. 22 2.3 Results ............................................................................................................................ 22 2.3.1 Description of Data ............................................................................................................... 22 2.3.2 Developing a Panel of Known Cancer Genes ......................................................................... 22 ii 2.3.3 Assessing Individual Tests .................................................................................................... 23 2.3.4 Integration into a Single Model ............................................................................................. 25 2.3.5 Detection of Validation Gene Panels ..................................................................................... 26 2.3.6 Predicted Cancer Genes ......................................................................................................... 27 2.3.7 Application to Specific Cancer Types .................................................................................... 28 2.4 Discussion ........................................................................................................................... 30 3. Identifying Drivers with Parsimony.......................................................................................... 48 3.1 Introduction ......................................................................................................................... 48 3.2 Materials and Methods ........................................................................................................ 49 3.2.1 Data Gathering and Quality Control ...................................................................................... 49 3.2.2 Mutation Level Descriptors ................................................................................................... 50 3.2.3 Gene Level Descriptors ......................................................................................................... 51 3.2.4 Imputation and Data Scaling ................................................................................................. 51 3.2.5 Adapting the Expectation-Maximization Algorithm ............................................................... 52 3.2.6 Learning Initialization ........................................................................................................... 52 3.2.7 The E-step ............................................................................................................................. 53 3.2.8 The M-step ............................................................................................................................ 55 3.2.9 Algorithm Stop and Model Training ...................................................................................... 55 3.2.10 Methodological Controls ..................................................................................................... 56 3.2.11 AUROCs for Measuring Performance ................................................................................. 57 3.2.12 Statistics and Software ........................................................................................................ 57 3.2.13 Code Availability & URLs .................................................................................................. 58 3.3 Results ................................................................................................................................. 58 3.3.1 ParsSNP overview ................................................................................................................. 58 3.3.2 Datasets & Analysis Design .................................................................................................. 59 3.3.3 ParsSNP Training, Robustness and Performance ................................................................... 61 3.3.4 Testing ParsSNP with Pan-Cancer Data ................................................................................. 63 3.3.5 Testing ParsSNP with Experimental Data .............................................................................. 64 3.3.6 Summary of ParsSNP Performance ....................................................................................... 65 3.3.7 Application of ParsSNP to an Independent Dataset ................................................................ 66 3.3.8 ParsSNP and Novel Driver Identification..............................................................................