Gene Expression Signatures for Cancer Cell Line Drug Sensitivity and Patient Outcome
Total Page:16
File Type:pdf, Size:1020Kb
This dissertation is submitted for the degree of Doctor of Philosophy Gene expression signatures for cancer cell line drug sensitivity and patient outcome Michael Schubert Submitted August 2016 Corpus Christi College, University of Cambridge EMBL-EBI DECLARATION This dissertation is the result of my own work and includes nothing which is the outcome of work done in collaboration except as declared in the Preface and specified in the text. It is not substantially the same as any that I have submitted, or, is being concurrently submitted for a degree or diploma or other qualific- ation at the University of Cambridge or any other University or similar institution except as declared in the Preface and specified in the text. I further state that no substantial part of my dissertation has already been submitted, or, is being concurrently submitted for any such de- gree, diploma or other qualification at the University of Cambridge or any other University of similar institution except as declared in the Preface and specified in the text. It does not exceed the prescribed word limit for the relevant Degree Committee. Michael Schubert August, 2016 3 SUMMARY The explanation of phenotypes in cancer, such as cell line drug response or patient survival, has largely been focussed on genomic alterations. While this approach has generated many profound insights into cancer biology, it does not directly make statements about the signaling im- pact those cellular aberrations create. With direct measurements much less widely available than gene expression, pathway methods (mostly mapping gene expression onto signalling proteins) have so far fallen short on delivering actionable evidence. This may in part be due to lack of robustness, but these approaches are fundamentally at odds with the notion of tight post-translational control of signal transduc- tion. A way to solve this may be to derive consensus signatures of pathway activity to make inferences about signalling, or signatures of specific drugs to investigate their interactions. As a baseline (chapter 2), I investigated how well pathway meth- ods compare to driver mutations in terms of explaining cell line drug response. I went on to analyse the value of gene expression as a down- stream signature of signaling activity instead of mapping it to the path- way components by means of a previously published platform. This improved over mapping pathway members using Gene Ontology or Re- actome, but it considered only sets of up-regulated genes defined by multiple arbitrary cutoffs. Hence, I extended the data set and created a linear model (chapter 3) that compares favourably to gene set as well as state of the art pathway methods in terms of recovering driver mutations and providing biomarkers for cell line drug response and patient survival (chapter 4). To complement this, I investigate how gene expression signatures of drugs can be used in conjunction with vi- ability data to suggest effective drug combinations where no pathway information is available (chapter 5). To the best of my knowledge, this thesis represents the first compre- hensive analysis of different pathway methods across primary cohorts and cancer cell lines, as well as the first large-scale systematic analysis of drug sensitisation that could lead to new drug combinations. 5 ACKNOWLEDGEMENTS This thesis would not have been possible without the generous support and feedback of many people, for which I am deeply grateful. First and foremost, I would like to thank my supervisor Julio Saez- Rodriguez for providing me with the opportunity to undertake my PhD in his group at EMBL-EBI as well as his support during my PhD. I would also like to extend my gratitude to all members of the Saez Rodriguez group, especially Francesco Iorio, Emanuel Goncalves, Luz Garcia Alonso, Michael Menden and Camille Terfve for fruitful dicus- sions and their friendship. I would like to thank my academic collaborators, among which are my industry supervisor Joanna Betts, the possibility to visit the US site of GSK with Pankaj Agarwal, the Sanger GDSC team lead by Mathew Garnett and Ultan McDermott, as well as people involved in smaller side projects too numerous to name. Many thanks to my TAC members Florian Markowetz, Sarah Teich- mann, and Christoph Merten who I was always looking forward to meet once a year for a useful critique of my project and much needed direction. Finally, I would like to thank the EMBL community for an amazing four years that I spent at the institute. You know who you are. 7 CONTENTS List of Abbreviations 14 List of Figures 15 List of Tables 19 1 introduction 21 1.1 Cancer Biology 21 1.1.1 Significance and epidemiology 21 1.1.2 Causes of genetic variation 21 1.1.3 Functional impact of mutations: the central dogma 23 1.1.4 Malignant transformation 24 1.1.5 Tumour heterogeneity 25 1.2 Therapeutic interventions 25 1.2.1 The therapeutic window 25 1.2.2 Cytotoxic drugs 26 1.2.3 Targeted therapies 26 1.2.4 Development of resistance 27 1.3 Disease models 28 1.3.1 Rationale 28 1.3.2 Cell lines 29 1.3.3 Animal models 30 1.4 Molecular data types and assays 30 1.4.1 Role of molecular data 30 1.4.2 DNA 31 1.4.3 RNA 31 1.4.4 Proteins and phosphorylation 33 1.5 Data sets 34 1.5.1 Public gene expression repositories 34 1.5.2 Gene sets and pathways 35 1.5.3 Primary cancer cohorts 36 1.5.4 Cancer cell line drug sensitivity resources 36 1.5.5 Drug-induced transcriptional changes 37 1.6 Computational methods 38 1.6.1 Patterns of mutations 39 1.6.2 Gene expression clustering 40 1.6.3 Gene expression signatures 42 1.6.4 Networks 43 1.7 Reproducibility of results 46 1.7.1 Experimental and computational 46 9 10 Contents 1.7.2 Scientific software ecosystem 47 1.7.3 Reproducible workflows 48 1.8 Motivation and outlook 49 2 gene set methods for drug response 51 2.1 Methods used throughout this thesis 52 2.1.1 Gene sets 52 2.1.2 Gene Set Variation Analysis (GSVA) 53 2.1.3 Drug associations using the half-maximum inhib- itory concentration (IC50) 53 2.2 Cell Line Drug Response 56 2.2.1 Associations with Mutations 56 2.2.2 Associations with Gene Ontology categories 58 2.2.3 Associations with Reactome pathways 59 2.3 Pathway-responsive genes: The SPEED Platform 59 2.3.1 Separability-optimised Gene Sets 60 2.3.2 Associations between Pathway Scores and Tis- sues 62 2.3.3 Pan-Cancer Drug Associations 64 2.4 Discussion 66 2.4.1 Cell line drug response 66 2.4.2 Pathway-responsive genes 67 3 building an improved model of perturbation-re- sponse genes 69 3.1 Curating and assembling a database of publicly available perturbation experiments 70 3.1.1 Defining scope and format 70 3.1.2 Finding suitable Experiments 72 3.1.3 From Experiments to Expression Data 76 3.2 A model for perturbation-response experiments 77 3.2.1 Building a linear model from z-scores 77 3.2.2 Computing pathway scores 81 3.2.3 Consensus gene signatures reveal pathway-response transcriptional modules 82 3.3 Comparison to pathway mapping methods 85 3.3.1 Transcriptional footprints are fundamentally dif- ferent to pathway expression 85 3.3.2 Computing pathway scores for other methods 87 3.3.3 Comparison within perturbation experiments 89 3.3.4 Comparison across perturbation experiments 92 3.4 Discussion 94 3.4.1 Need for an improved database and model 94 3.4.2 Value of curated perturbation experiments 94 Contents 11 3.4.3 Importance of distinguishing between pathway expression and activity 95 4 functional evaluation of pathway methods in basal gene expression 97 4.1 Perturbation-response signatures for basal gene expres- sion 97 4.1.1 Correlation in basal expression 97 4.1.2 A linear model is easily transferable 97 4.1.3 Dependence on input experiments 99 4.1.4 Pathway scores for other methods 100 4.2 Recall of known pathway modifiers 102 4.2.1 Pathway scores and mutations/CNAs 102 4.2.2 Associations using driver mutations and CNAs 102 4.2.3 Comparison of methods 106 4.3 Cell line drug response using the GDSC 106 4.3.1 Drug associations using GDSC cell lines 106 4.3.2 Pan-cancer Associations with Drug Response 108 4.3.3 Tissue-specific Associations with Drug Response 113 4.4 Patient survival using the TCGA 115 4.4.1 Clinical data and methods 115 4.4.2 Pan-cancer associations with survival 116 4.4.3 Tissue-specific associations with survival 116 4.4.4 Kaplan-Meier survival curves for specific hits 117 4.5 Discussion 118 5 modelling drug interactions 121 5.1 MANTRA and the original Connectivity Map 122 5.1.1 Two-Tailed GSEA using MANTRA 122 5.1.2 Pan-cancer associations 124 5.1.3 Tissue-specific associations 125 5.2 A pan-cancer view of drug sensitisation using LINCS 126 5.2.1 Data quality of the LINCS 126 5.2.2 Creating drug signatures and scoring GDSC cell lines 126 5.2.3 Naive associations with drug response 127 5.2.4 Pathway correlation as a major source for false positives 129 5.2.5 Building an improved model 131 5.3 Tissue-specific synergistic compounds 134 5.3.1 Consensus models for breast cancer cell lines 134 5.3.2 Defining synergy for drug combinations 136 5.3.3 Single-agent drug response curves 138 5.3.4 Synergy of drug combinations 141 5.4 Discussion 143 12 Contents 5.4.1 Original Connectivity Map 143 5.4.2 Pan-cancer view of the LINCS 144 5.4.3 Tissue-specific models and validation 144 6 conclusions 147 bibliography 149 appendix 165 a Associations for baseline methods (chapter 2) 165 a.1 Drug response with unbiased gene sets 165 a.2 SPEED platform 170 b Associations for evaluating signatures (chapter 4) 172 b.1 Pathway scores and mutations 172 b.2 Pathway scores and CNAs 182 b.3 Pathway