
Unsupervised multiple kernel learning approaches for integrating molecular cancer patient data Dissertation zur Erlangung des Grades des Doktors der Naturwissenschaften der Fakult¨atf¨urMathematik und Informatik der Universit¨atdes Saarlandes von Nora K. Speicher Saarbr¨ucken, 2019 ii Tag des Kolloquiums: November 11, 2019 Dekan der Fakult¨at: Prof. Dr. Sebastian Hack Vorsitzender des Pr¨ufungsausschusses: Prof. Dr. Kurt Mehlhorn Berichterstatter: Prof. Dr. Nico Pfeifer Prof. Dr. Hans-Peter Lenhof Akademischer Mitarbeiter: Dr. Peter Ebert Abstract Cancer is the second leading cause of death worldwide. A charac- teristic of this disease is its complexity leading to a wide variety of genetic and molecular aberrations in the tumors. This heterogene- ity necessitates personalized therapies for the patients. However, currently defined cancer subtypes used in clinical practice for treat- ment decision-making are based on relatively few selected markers and thus provide only a coarse classification of tumors. The in- creased availability in multi-omics data measured for cancer pa- tients now offers the possibility of defining more informed cancer subtypes. Such a more fine-grained characterization of cancer sub- types harbors the potential of substantially expanding treatment options in personalized cancer therapy. In this thesis, we identify comprehensive cancer subtypes using mul- tidimensional data. For this purpose, we apply and extend un- supervised multiple kernel learning methods. Three challenges of unsupervised multiple kernel learning are addressed: robustness, applicability, and interpretability. First, we show that regulariza- tion of the multiple kernel graph embedding framework, which en- ables the implementation of dimensionality reduction techniques, can increase the stability of the resulting patient subgroups. This improvement is especially beneficial for data sets with a small num- ber of samples. Second, we adapt the objective function of kernel principal component analysis to enable the application of multiple kernel learning in combination with this widely used dimensional- ity reduction technique. Third, we improve the interpretability of kernel learning procedures by performing feature clustering prior to integrating the data via multiple kernel learning. On the basis of these clusters, we derive a score indicating the impact of a feature cluster on a patient cluster, thereby facilitating further analysis of the cluster-specific biological properties. All three procedures are successfully tested on real-world cancer data. Comparing our newly derived methodologies to established methods provides evi- dence that our work offers novel and beneficial ways of identifying patient subgroups and gaining insights into medically relevant char- acteristics of cancer subtypes. i ii Zusammenfassung Krebs ist eine der h¨aufigsten Todesursachen weltweit. Krebs ist gekennzeichnet durch seine Komplexit¨at, die zu vielen ver- schiedenen genetischen und molekularen Aberrationen im Tumor f¨uhrt.Die Unterschiede zwischen Tumoren erfordern personalisierte Therapien f¨urdie einzelnen Patienten. Die Krebssubtypen, die derzeit zur Behandlungsplanung in der klinischen Praxis verwen- det werden, basieren auf relativ wenigen, genetischen oder moleku- laren Markern und k¨onnen daher nur eine grobe Unterteilung der Tumoren liefern. Die zunehmende Verf¨ugbarkeit von Multi- Omics-Daten f¨urKrebspatienten erm¨oglicht die Neudefinition von fundierteren Krebssubtypen, die wiederum zu spezifischeren Be- handlungen f¨urKrebspatienten f¨uhrenk¨onnten. In dieser Dissertation identifizieren wir neue, potentielle Krebssub- typen basierend auf Multi-Omics-Daten. Hierf¨urverwenden wir un¨uberwachtes Multiple Kernel Learning, welches in der Lage ist mehrere Datentypen miteinander zu kombinieren. Drei Heraus- forderungen des un¨uberwachten Multiple Kernel Learnings wer- den adressiert: Robustheit, Anwendbarkeit und Interpretierbarkeit. Zun¨achst zeigen wir, dass die zus¨atzliche Regularisierung des Multiple Kernel Learning Frameworks zur Implementierung ver- schiedener Dimensionsreduktionstechniken die Stabilit¨atder iden- tifizierten Patientengruppen erh¨oht. Diese Robustheit ist besonders vorteilhaft f¨urDatens¨atzemit einer geringen Anzahl von Proben. Zweitens passen wir die Zielfunktion der kernbasierten Hauptkom- ponentenanalyse an, um eine integrative Version dieser weit ver- breiteten Dimensionsreduktionstechnik zu erm¨oglichen. Drittens verbessern wir die Interpretierbarkeit von kernbasierten Lernproze- duren, indem wir verwendete Merkmale in homogene Gruppen un- terteilen bevor wir die Daten integrieren. Mit Hilfe dieser Gruppen definieren wir eine Bewertungsfunktion, die die weitere Auswertung der biologischen Eigenschaften von Patientengruppen erleichtert. Alle drei Verfahren werden an realen Krebsdaten getestet. Den Vergleich unserer Methodik mit etablierten Methoden weist nach, dass unsere Arbeit neue und n¨utzliche M¨oglichkeiten bietet, um in- tegrative Patientengruppen zu identifizieren und Einblicke in medi- zinisch relevante Eigenschaften von Krebssubtypen zu erhalten. iii Acknowledgments First and foremost, I would like to thank Nico Pfeifer for giving me the opportunity to complete my thesis under his supervision. Nico provided guidance, advice, and shared his technical expertise, but at the same time offered trust and freedom, which allowed me to pursue my own ideas. I also wish to thank Thomas Lengauer for creating such an inspiring place to work, introducing me to the basics of statistical learning and providing valuable feedback on my research. I also wish to thank all members of my thesis committee for investing time and effort in the review process. There are many more people that influenced this thesis and made everyday life and work more fun. Thanks to my former office mates Anna Hake and Peter Ebert for the great discussions. Thanks to the other members of this department for the friendly atmosphere at work. I am particularly thankful to my friends, coffee companions, and (former) colleagues Adrin, Alejandro, Fabian, Lisa, Matthias, Matthias, Nadezhda, Olga, Prabhav, and Sarvesh. I am very grateful to Achim B¨uch and Georg Friedrich for their tech- nical support whenever needed, and to Ruth Schneppen-Christmann for taking care of the bureaucratic challenges during the last years. Last but certainly not least, I am deeply grateful to Benedikt and my family for their support, love, and constant trust. iv Contents 1 Introduction1 2 Background7 2.1 Biological background......................7 2.1.1 Development of cancer..................7 2.1.2 Molecular data......................9 2.2 Machine learning and kernel methods.............. 13 2.2.1 Kernel methods...................... 14 2.2.2 Multiple kernel learning................. 15 2.3 Dimensionality reduction..................... 17 2.3.1 Locality preserving projections.............. 18 2.3.2 Principal component analysis.............. 19 2.3.3 Graph embedding..................... 20 2.4 Clustering............................. 24 2.4.1 K-means clustering.................... 24 2.4.2 Fuzzy c-means clustering................. 25 2.5 Cluster evaluation......................... 26 2.5.1 Internal cluster evaluation measures........... 26 2.5.2 External cluster evaluation measures.......... 28 2.5.3 Enrichment analysis................... 32 2.6 Related work........................... 34 3 Regularization of unsupervised multiple kernel learning 43 3.1 Overview.............................. 43 3.2 Methods.............................. 45 3.2.1 Regularization in the graph embedding framework... 45 3.2.2 Leave-one-out cross-validation for rMKL-DR...... 49 3.2.3 Materials......................... 49 3.3 Regularized multiple kernel locality preserving projections.. 51 3.3.1 Results and discussion.................. 52 3.3.2 External validation.................... 65 v 3.4 Conclusion............................. 66 4 Multiple kernel principal component analysis 69 4.1 Overview.............................. 69 4.2 PCA in the graph embedding framework............ 70 4.3 Direct extension of kernel principal component analysis.... 73 4.4 Scoring function.......................... 76 4.5 Application to cancer patient data................ 78 4.5.1 Materials......................... 78 4.5.2 Workflow......................... 79 4.5.3 Results and discussion.................. 80 4.6 Conclusion............................. 83 5 Increased interpretability of unsupervised multiple kernel learning 85 5.1 Introduction............................ 85 5.2 Conceptual outline........................ 88 5.3 Methods.............................. 90 5.3.1 Feature cluster impact on patient cluster........ 90 5.3.2 Materials......................... 92 5.4 Results and discussion...................... 94 5.4.1 Parameter selection.................... 94 5.4.2 Robustness of the final clusterings............ 95 5.4.3 Survival analysis..................... 96 5.4.4 Interpretation....................... 97 5.5 Conclusion............................. 102 6 Conclusions and outlook 105 6.1 Summary............................. 105 6.2 Perspectives............................ 107 A List of publications 111 B Licensing, copyright, and plagiarism prevention 113 Bibliography 115 vi List of Figures 2.1 Hallmarks of cancer........................8 2.2 Example of a Kaplan-Meier graph................ 30 2.3 Early, intermediate, and late data integration.......... 35 3.1 rMKL-LPP results with different initializations......... 53 3.2 rMKL-LPP results
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages138 Page
-
File Size-