Bioinformatics Methods for the Genetic and Molecular Characterisation Of

BIOINFORMATICS METHODSFORTHE GENETICAND MOLECULAR CHARACTERISATION OFCANCER Dissertation zur Erlangung des Grades des Doktors der Naturwissenschaften der Naturwissenschaftlich-Technischen Fakultäten der Universität des Saarlandes vo n DANIELSTÖCKEL Universität des Saarlandes, MI – Fakultät für Mathematik und Informatik, Informatik b e t r e u e r PROF.DR.HANS-PETERLENHOF 6. dezember 2016 Daten der Verteidigung Datum 25.11.2016, 15 Uhr ct. Vorsitzender Prof. Dr. Volkhard Helms Erstgutachter Prof. Dr. Hans-Peter Lenhof Zweitgutachter Prof. Dr. Andreas Keller Beisitzerin Dr. Christina Backes Dekan Prof. Dr. Frank-Olaf Schreyer This thesis is also available online at: https://somweyr.de/~daniel/ dissertation.zip ii a b s t r ac t Cancer is a class of complex, heterogeneous diseases of which many types have proven to be difficult to treat due to the high genetic variability between and within tumours. To improve ther- apy, some cases require a thorough genetic and molecular characterisation that allows to identify mutations and pathogenic processes playing a central role for the development of the disease. Data obtained from modern, biological high-throughput exper- iments can offer valuable insights in this regard. Therefore, we developed a range of interoperable approaches that support the analysis of high-throughput datasets on multiple levels of detail. Mutations are a main driving force behind the development of cancer. To assess their impact on an affected protein, we de- signed BALL-SNP which allows to visualise and analyse single nucleotide variants in a structure context. For modelling the ef- fect of mutations on biological processes we created CausalTrail which is based on causal Bayesian networks and the do-calculus. Using NetworkTrail, our web service for the detection of deregulated subgraphs, candidate processes for this investigation can be identified. Moreover, we implemented GeneTrail2 for uncovering deregulated processes in the form of biological categories using enrichment approaches. With support for more than 46,000 categories and 13 set-level statistics, GeneTrail2 is the currently most comprehensive web service for this purpose. Based on the ana- lyses provided by NetworkTrail and GeneTrail2 as well as knowledge from third-party databases, we built DrugTargetInspector, a tool for the detection and analysis of mutated and deregulated drug targets. We validated our tools using a Wilm’s tumour expression dataset and were able to identify pathogenic mechanisms that may be responsible for the malignancy of the blastemal tumour subtype and might offer opportunities for the development of novel therapeutic approaches. iii zusammenfassung Krebs ist eine Klasse komplexer, heterogener Erkrankungen mit vielen Unterarten, die aufgrund der genetischen Variabilität, die zwischen und innerhalb von Tumoren herrscht, nur schwer zu behandeln sind. Um eine bessere Therapie zu ermöglichen ist daher in einigen Fällen eine sorgfältige, genetische und moleku- lare Charakterisierung nötig, welche es erlaubt die Mutationen und pathogenen Prozesse zu identifizieren, die eine zentrale Rol- le während der Krankheitsentwicklung spielen. Daten aus mo- dernen, biologischen Hochdurchsatzexperimenten können hier- bei wertvolle Einsichten liefern. Daher entwickelten wir eine Rei- he interoperabler Ansätze, welche die Analyse von Hochdurch- satzdatensätzen auf mehreren Detailstufen unterstützen. Mutationen sind die treibende Kraft hinter der Entstehung von Krebs. Um ihren Einfluss auf das betroffene Protein beur- teilen zu können, entwarfen wir die Software BALL-SNP, welche es erlaubt einzelne Single Nucleotide Variations in einer Kris- tallstruktur zu visualisieren und analysieren. Um den Effekt einer Mutation innerhalb eines biologischen Prozesses modellie- ren zu können, erstellten wir CausalTrail, das auf kausalen bayes- schen Netzwerken und dem do-calculus basiert. Unter der Ver- wendung von NetworkTrail, unserem Web-Service zur Detektion deregulierter Subgraphen, können Prozesse identifiziert werden, die als Kandidaten für eine solche Untersuchung dienen können. Zur Detektion deregulierter Prozesse in der Form von biologischen Kategorien mittels Enrichment-Ansätzen implementierten wir GeneTrail2. GeneTrail2 unterstützt mehr als 46.000 Kategorien und 13 Statistiken zur Berechnung von Enrichment-Scores. Basie- rend auf den Analysemethoden von NetworkTrail und GeneTrail2, sowie dem Wissen aus Drittdatenbanken konstruierten wir Drug- TargetInspector, ein Werkzeug zur Detektion und Analyse von mu- tierten und deregulierten Wirkstoffzielen. Wir validierten unsere Werkzeuge unter Verwendung eines Wilm’s Tumor Expressionsdatensatzes, für den wir pathogene Mechanismen identifizieren konnten, die für die Malignität des blastemreichen Subtyps verantwortlich sein können und biswei- len die Entwicklung neuartiger, therapeutischer Ansätze ermög- lichen könnten. v ACKNOWLEDGEMENTS The thesis you hold in your hands marks the end of a six year journey. When I started my PhD, a fledgling Master of Science, I thought Iknew everything about science. Surely I would be done in three years tops and surely my work would be nothing but revolutionising. All the oth- ers were certainly just slacking off! Oh, how naïve I was back then! And with the passage of time I learned a lot. Some of the things I learned concerned science, while other did not. I learned that succeeding as a researcher not only takes a lot of theoretical knowledge, but also a lot of dedication and hard work. I learned that the solution to some prob- lems cannot be rushed, but instead requires many small and deliberate steps. But most importantly, I learned that science is not an one man show, but a team effort. Because of this I would like to take the time to thank the persons without which this thesis would not have been possible. First of all, I would like to thank my advisor Prof. Dr. Hans-Peter Lenhof for giving me the opportunity to work at his chair. Hans-Peter also gave me the freedom to independently pursue the research topics I was interested in. If, however, a problem arose he would selflessly sacrifice his precious time to help with finding a solution. For thisIam deeply grateful. Next, thanks is in order for Prof. Dr. Andreas Keller for reviewing my thesis, bringing many interesting research opportunities to my attention and the energy and enthusiasm he is infusing into the people that have the pleasure to work with him. Naturally, this work would have been impossible without my fam- ily and, for obvious reasons, without my parents Thomas and Sabine Stöckel. It was them who woke my interest in science and technology. As long as I can remember they supported and encouraged me to pursue the career I have chosen for myself and, thus, without them I would not have been able to achieve what I achieved. Similarly, I would like to thank Charlotte De Gezelle for her patience whenever I postponed yet another vacation because: “I am done soon™!” Whenever I would become frustrated and needed a shoulder to cry on I could rely on her having an open ear for me. You truly have made my world a brighter place! Of course, there is another group of people who’s importance I cannot stress enough. My proofreaders Lara Schneider, Tim Kehl, Fabian Müller, Florian Schmidt, and Andreas Stöckel fought valiantly against spelling mistakes, layout issues, and incomprehensible prose. Thank you for setting my head straight whenever I was again convinced that what I wrote was perfectly clear. Hint: it wasn’t. Equally, I would like to thank my coworkers Alexander Rurainski, Christina Backes, Anna- Katharina Hildebrandt, Oliver Müller, Marc Hellmuth, Lara Schneider, Patrick Trampert, and Tim Kehl for the pleasant and familial working environment. Much of the work presented in this thesis has been a joint endeavour in which they had integral parts. Similarly, I would like to vii praise all the awesome people from the Saarland University, as well as the Universities Mainz and Tübingen that I had the pleasure to work with. A special shoutout goes to my colleagues and good friends from the MPI for Computer Science for the many enjoyable discussions dur- ing lunch. A thank you is also in order for my students, although I am not certain whether they enjoyed being taught by me as much as I enjoyed the opportunity to teach them. Finally, I am grateful for all my friends and the time we spent together playing board games, celebrat- ing, or simply hanging out. Without you, I would certainly not have managed to pull through and finish this thesis. viii co n t e n t s CONTENTS Contents ix List of Figures xi List of Tables xiii List of Notations xv 1 Introduction 1 1.1 Motivation .......................... 2 1.2 Overview ............................ 4 2 Biological Background 9 2.1 Cancer ............................. 11 2.2 Personalised Medicine ................... 18 2.3 Biological Assays ...................... 20 2.4 Wilm’s Tumour Data .................... 30 3 Biological Networks 33 3.1 Graph Theory ......................... 34 3.2 Types of Biological Networks . 36 3.3 Causal Bayesian Networks . 40 3.4 Gaussian Graphical Models . 55 3.5 Deregulated Subgraphs . 62 4 Enrichment Algorithms 81 4.1 Statistical Tests ........................ 82 4.2 A Framework for Enrichment Analysis . 86 4.3 Evaluation .......................... 99 4.4 Hotelling’s T 2-Test . 107 5 Graviton 113 5.1 The World Wide Web . 114 5.2 The Graviton Architecture . 123 5.3 GeneTrail2 . 133 5.4 NetworkTrail . 152 5.5 Drug Target Inspector . 157 6 BALL 163 6.1 The Biochemical Algorithms Library . 165 6.2 ballaxy ............................ 168 6.3 PresentaBALL . 170 6.4 BALL-SNP . 171 6.5 Summary . 173 7 Conclusion 175 ix co n t e n t s 7.1 Summary . 175 7.2 Discussion . 177 A Mathematical Background 179 A.1 Combinatorics and Probability Theory . 179 A.2 Machine Learning . 186 B Wilm’s Dataset - Supplementary Tables 191 C Enrichment Evaluation - Results 199 C.1 Enrichments on Synthetic Categories . 199 C.2 Enrichments on Reactome Categories . 201 D GeneTrail2 - Tables 203 D.1 Supported Organisms . 203 D.2 List of Human Categories . 203 Bibliography 207 x list of figures LISTOFFIGURES 1.1 Common causes of death in Germany (2014).

Bioinformatics Methods for the Genetic and Molecular Characterisation Of

The Hitchhiker's Guide to Graph Exchange Formats

A DSL for Defining Instance Templates for the ASMIG System

Compiler Support for Operator Overloading and Algorithmic Differentiation in C++

Lesson 8 Graph Visual Analytics: Visualising and Analysing Network Data

Lesson 8 Graph Visualisation and Analysis: Methods and Applications

WEB-BASED DRAWING SOFTWARE for GRAPHS in 3D and TWO LAYOUT ALGORITHMS FARSHAD BARAHIMI Bachelor of Science, Shahid Bahonar Unive

Unravelling Graph Exchange File Formats

XML-Entscheidungsbäume Für Applikationen Jan Oevermann (46594)

Optimisation of Clustering Algorithms for the Identification of Customer

The Rocs Handbook

Overview of Existing Software Tools for Graph Matching

Lesson 10 Visualising and Analysing Network Data