Unsupervised Multiple Kernel Learning Approaches for Integrating Molecular Cancer Patient Data

Unsupervised multiple kernel learning approaches for integrating molecular cancer patient data Dissertation zur Erlangung des Grades des Doktors der Naturwissenschaften der FakultätfürMathematik und Informatik der Universitätdes Saarlandes von Nora K. Speicher Saarbrücken, 2019 ii Tag des Kolloquiums: November 11, 2019 Dekan der Fakultät: Prof. Dr. Sebastian Hack Vorsitzender des Prüfungsausschusses: Prof. Dr. Kurt Mehlhorn Berichterstatter: Prof. Dr. Nico Pfeifer Prof. Dr. Hans-Peter Lenhof Akademischer Mitarbeiter: Dr. Peter Ebert Abstract Cancer is the second leading cause of death worldwide. A charac- teristic of this disease is its complexity leading to a wide variety of genetic and molecular aberrations in the tumors. This heterogene- ity necessitates personalized therapies for the patients. However, currently defined cancer subtypes used in clinical practice for treatment decision-making are based on relatively few selected markers and thus provide only a coarse classification of tumors. The increased availability in multi-omics data measured for cancer patients now offers the possibility of defining more informed cancer subtypes. Such a more fine-grained characterization of cancer subtypes harbors the potential of substantially expanding treatment options in personalized cancer therapy. In this thesis, we identify comprehensive cancer subtypes using mul- tidimensional data. For this purpose, we apply and extend unsupervised multiple kernel learning methods. Three challenges of unsupervised multiple kernel learning are addressed: robustness, applicability, and interpretability. First, we show that regularization of the multiple kernel graph embedding framework, which en- ables the implementation of dimensionality reduction techniques, can increase the stability of the resulting patient subgroups. This improvement is especially beneficial for data sets with a small num- ber of samples. Second, we adapt the objective function of kernel principal component analysis to enable the application of multiple kernel learning in combination with this widely used dimensionality reduction technique. Third, we improve the interpretability of kernel learning procedures by performing feature clustering prior to integrating the data via multiple kernel learning. On the basis of these clusters, we derive a score indicating the impact of a feature cluster on a patient cluster, thereby facilitating further analysis of the cluster-specific biological properties. All three procedures are successfully tested on real-world cancer data. Comparing our newly derived methodologies to established methods provides evi- dence that our work offers novel and beneficial ways of identifying patient subgroups and gaining insights into medically relevant char- acteristics of cancer subtypes. i ii Zusammenfassung Krebs ist eine der häufigsten Todesursachen weltweit. Krebs ist gekennzeichnet durch seine Komplexität, die zu vielen ver- schiedenen genetischen und molekularen Aberrationen im Tumor führt.Die Unterschiede zwischen Tumoren erfordern personalisierte Therapien fürdie einzelnen Patienten. Die Krebssubtypen, die derzeit zur Behandlungsplanung in der klinischen Praxis verwen- det werden, basieren auf relativ wenigen, genetischen oder molekularen Markern und können daher nur eine grobe Unterteilung der Tumoren liefern. Die zunehmende Verfügbarkeit von Multi- Omics-Daten fürKrebspatienten ermöglicht die Neudefinition von fundierteren Krebssubtypen, die wiederum zu spezifischeren Be- handlungen fürKrebspatienten führenkönnten. In dieser Dissertation identifizieren wir neue, potentielle Krebssub- typen basierend auf Multi-Omics-Daten. Hierfürverwenden wir unüberwachtes Multiple Kernel Learning, welches in der Lage ist mehrere Datentypen miteinander zu kombinieren. Drei Heraus- forderungen des unüberwachten Multiple Kernel Learnings werden adressiert: Robustheit, Anwendbarkeit und Interpretierbarkeit. Zunächst zeigen wir, dass die zusätzliche Regularisierung des Multiple Kernel Learning Frameworks zur Implementierung ver- schiedener Dimensionsreduktionstechniken die Stabilitätder iden- tifizierten Patientengruppen erhöht. Diese Robustheit ist besonders vorteilhaft fürDatensätzemit einer geringen Anzahl von Proben. Zweitens passen wir die Zielfunktion der kernbasierten Hauptkom- ponentenanalyse an, um eine integrative Version dieser weit ver- breiteten Dimensionsreduktionstechnik zu ermöglichen. Drittens verbessern wir die Interpretierbarkeit von kernbasierten Lernproze- duren, indem wir verwendete Merkmale in homogene Gruppen un- terteilen bevor wir die Daten integrieren. Mit Hilfe dieser Gruppen definieren wir eine Bewertungsfunktion, die die weitere Auswertung der biologischen Eigenschaften von Patientengruppen erleichtert. Alle drei Verfahren werden an realen Krebsdaten getestet. Den Vergleich unserer Methodik mit etablierten Methoden weist nach, dass unsere Arbeit neue und nützliche Möglichkeiten bietet, um integrative Patientengruppen zu identifizieren und Einblicke in medi- zinisch relevante Eigenschaften von Krebssubtypen zu erhalten. iii Acknowledgments First and foremost, I would like to thank Nico Pfeifer for giving me the opportunity to complete my thesis under his supervision. Nico provided guidance, advice, and shared his technical expertise, but at the same time offered trust and freedom, which allowed me to pursue my own ideas. I also wish to thank Thomas Lengauer for creating such an inspiring place to work, introducing me to the basics of statistical learning and providing valuable feedback on my research. I also wish to thank all members of my thesis committee for investing time and effort in the review process. There are many more people that influenced this thesis and made everyday life and work more fun. Thanks to my former office mates Anna Hake and Peter Ebert for the great discussions. Thanks to the other members of this department for the friendly atmosphere at work. I am particularly thankful to my friends, coffee companions, and (former) colleagues Adrin, Alejandro, Fabian, Lisa, Matthias, Matthias, Nadezhda, Olga, Prabhav, and Sarvesh. I am very grateful to Achim Büch and Georg Friedrich for their technical support whenever needed, and to Ruth Schneppen-Christmann for taking care of the bureaucratic challenges during the last years. Last but certainly not least, I am deeply grateful to Benedikt and my family for their support, love, and constant trust. iv Contents 1 Introduction1 2 Background7 2.1 Biological background......................7 2.1.1 Development of cancer..................7 2.1.2 Molecular data......................9 2.2 Machine learning and kernel methods.............. 13 2.2.1 Kernel methods...................... 14 2.2.2 Multiple kernel learning................. 15 2.3 Dimensionality reduction..................... 17 2.3.1 Locality preserving projections.............. 18 2.3.2 Principal component analysis.............. 19 2.3.3 Graph embedding..................... 20 2.4 Clustering............................. 24 2.4.1 K-means clustering.................... 24 2.4.2 Fuzzy c-means clustering................. 25 2.5 Cluster evaluation......................... 26 2.5.1 Internal cluster evaluation measures........... 26 2.5.2 External cluster evaluation measures.......... 28 2.5.3 Enrichment analysis................... 32 2.6 Related work........................... 34 3 Regularization of unsupervised multiple kernel learning 43 3.1 Overview.............................. 43 3.2 Methods.............................. 45 3.2.1 Regularization in the graph embedding framework... 45 3.2.2 Leave-one-out cross-validation for rMKL-DR...... 49 3.2.3 Materials......................... 49 3.3 Regularized multiple kernel locality preserving projections.. 51 3.3.1 Results and discussion.................. 52 3.3.2 External validation.................... 65 v 3.4 Conclusion............................. 66 4 Multiple kernel principal component analysis 69 4.1 Overview.............................. 69 4.2 PCA in the graph embedding framework............ 70 4.3 Direct extension of kernel principal component analysis.... 73 4.4 Scoring function.......................... 76 4.5 Application to cancer patient data................ 78 4.5.1 Materials......................... 78 4.5.2 Workflow......................... 79 4.5.3 Results and discussion.................. 80 4.6 Conclusion............................. 83 5 Increased interpretability of unsupervised multiple kernel learning 85 5.1 Introduction............................ 85 5.2 Conceptual outline........................ 88 5.3 Methods.............................. 90 5.3.1 Feature cluster impact on patient cluster........ 90 5.3.2 Materials......................... 92 5.4 Results and discussion...................... 94 5.4.1 Parameter selection.................... 94 5.4.2 Robustness of the final clusterings............ 95 5.4.3 Survival analysis..................... 96 5.4.4 Interpretation....................... 97 5.5 Conclusion............................. 102 6 Conclusions and outlook 105 6.1 Summary............................. 105 6.2 Perspectives............................ 107 A List of publications 111 B Licensing, copyright, and plagiarism prevention 113 Bibliography 115 vi List of Figures 2.1 Hallmarks of cancer........................8 2.2 Example of a Kaplan-Meier graph................ 30 2.3 Early, intermediate, and late data integration.......... 35 3.1 rMKL-LPP results with different initializations......... 53 3.2 rMKL-LPP results

Load more