Phd Degree in Molecular Medicine (Curriculum in Computational Biology)
Total Page:16
File Type:pdf, Size:1020Kb
PhD degree in Molecular Medicine (curriculum in Computational Biology) European School of Molecular Medicine (SEMM), University of Milan and University of Naples “Federico II” Settore Disciplinare: MED/04 Computational frameworks for the identification of somatic and germline variants contributing to cancer predisposition and development Giorgio Enrico Maria Melloni IIT@SEMM, Milan Matricola n. R10338 Supervisor: Prof. Piergiuseppe Pelicci IEO, Milan Added Supervisor: Dr. Laura Riva IIT@SEMM, Milan Anno accademico 2015-2016 2 TABLE OF CONTENTS 1 ABSTRACT ......................................................................................................................................... 5 2 INTRODUCTION ............................................................................................................................... 6 2.1 CANCER AS AN EVOLUTIONARY PROCESS .................................................................................................... 6 2.2 ACCUMULATING DRIVER MUTATIONS .......................................................................................................... 8 2.3 TUMOR HETEROGENEITY ............................................................................................................................. 10 2.4 CANCER GENOME LANDSCAPES ................................................................................................................. 11 2.5 DRIVER VS PASSENGER: A PROBLEM OF MUTATION RATE ....................................................................... 13 2.6 CANCER GENOMICS AND HUMAN GENETICS ............................................................................................. 14 2.7 CANCER GENOMICS IN THE NGS ERA ........................................................................................................ 16 3 MATERIAL AND METHODS ........................................................................................................ 17 3.1 DATA FORMAT ............................................................................................................................................. 17 3.1.1 VCF format ............................................................................................................................................ 17 3.1.2 MAF format ........................................................................................................................................... 18 3.2 DATA RETRIEVAL .......................................................................................................................................... 19 3.2.1 TCGA ........................................................................................................................................................ 19 3.2.2 ICGC ......................................................................................................................................................... 19 3.2.3 cBioPortal .............................................................................................................................................. 20 3.2.4 COSMIC and CGC ................................................................................................................................ 20 3.2.5 ExAC ......................................................................................................................................................... 20 3.3 DATA PROCESSING AND MANIPULATION ................................................................................................. 21 4 RESULTS ......................................................................................................................................... 22 4.1 DOTS-FINDER: A COMPREHENSIVE TOOL TO ASSESSING DRIVER GENES IN CANCER GENOMES .......... 22 4.1.1 Abstract .................................................................................................................................................. 22 4.1.2 Introduction .......................................................................................................................................... 23 4.1.3 Implementation ................................................................................................................................... 25 4.1.3.1 Overview of DOTS-Finder ....................................................................................................................................... 25 4.1.3.2 The Functional Step: finding tumor suppressor gene and oncogene candidates ........................... 30 4.1.3.3 The Frequentist step: assessing the possible drivers ................................................................................... 35 4.1.4 Material and Methods ....................................................................................................................... 36 4.1.4.1 Availability .................................................................................................................................................................... 36 4.1.4.2 Input Format ................................................................................................................................................................ 36 4.1.4.3 Requirements .............................................................................................................................................................. 37 4.1.4.4 Mutation data .............................................................................................................................................................. 37 4.1.4.5 Databases ..................................................................................................................................................................... 38 4.1.4.6 DOTS-Finder step by step ....................................................................................................................................... 42 4.1.4.6.1 Setting the threshold for TSG-S and OG-S ............................................................................................. 48 4.1.5 Results ..................................................................................................................................................... 50 4.1.5.1 Application of DOTS-Finder to individual cancer types .............................................................................. 50 4.1.5.2 Driver genes and tissue specificity ...................................................................................................................... 52 4.1.5.3 Breast carcinoma ....................................................................................................................................................... 53 4.1.5.4 Thyroid Carcinoma .................................................................................................................................................... 55 4.1.5.5 Acute Myeloid Leukemia ........................................................................................................................................ 56 4.1.5.6 Bladder Carcinoma .................................................................................................................................................... 57 4.1.5.7 Atypical tumor suppressor genes and oncogenes ....................................................................................... 58 4.1.5.8 The importance of considering subsets of samples ..................................................................................... 62 4.1.5.9 Small sample size analysis. The --lax option .................................................................................................... 63 4.1.5.10 Comparison of DOTS-Finder to existing tools using Pan-Cancer12 data .......................................... 65 4.1.5.11 Statistical power using a small number of cancer samples .................................................................... 68 4.1.6 Discussion .............................................................................................................................................. 69 4.2 LOWMACA: EXPLOITING PROTEIN FAMILY ANALYSIS FOR THE IDENTIFICATION OF RARE DRIVER MUTATIONS IN CANCER ........................................................................................................................................... 71 4.2.1 Abstract .................................................................................................................................................. 71 4.2.2 Introduction .......................................................................................................................................... 72 4.2.3 Materials and Methods ..................................................................................................................... 74 4.2.3.1 Software Implementation and Overview ......................................................................................................... 74 3 4.2.3.2 Input Data ..................................................................................................................................................................... 75 4.2.3.3 Alignment and Mapping ......................................................................................................................................... 76 4.2.3.4 Statistical Testing ......................................................................................................................................................