Machine Learning for Genomic Sequence Analysis
Total Page:16
File Type:pdf, Size:1020Kb
Machine Learning for Genomic Sequence Analysis - Dissertation - vorgelegt von Dipl. Inform. S¨oren Sonnenburg aus Berlin Von der Fakult¨at IV - Elektrotechnik und Informatik der Technischen Universit¨at Berlin zur Erlangung des akademischen Grades Doktor der Naturwissenschaften — Dr. rer. nat. — genehmigte Dissertation Promotionsaussschuß Vorsitzender: Prof. Dr. Thomas Magedanz • Berichter: Dr. Gunnar R¨atsch • Berichter: Prof. Dr. Klaus-Robert Muller¨ • Tag der wissenschaftlichen Aussprache: 19. Dezember 2008 Berlin 2009 D83 Acknowledgements Above all, I would like to thank Dr. Gunnar R¨atsch and Prof. Dr. Klaus-Robert M¨uller for their guidance and inexhaustible support, without which writing this thesis would not have been possible. All of the work in this thesis has been done at the Fraunhofer Institute FIRST in Berlin and the Friedrich Miescher Laboratory in T¨ubingen. I very much enjoyed the inspiring atmosphere in the IDA group headed by K.-R. M¨uller and in G. R¨atsch’s group in the FML. Much of what we have achieved was only possible in a joint effort. As the various fruitful within-group collaborations expose — we are a team. I would like to thank Fabio De Bona, Lydia Bothe, Vojtech Franc, Sebastian Henschel, Motoaki Kawanabe, Cheng Soon Ong, Petra Philips, Konrad Rieck, Reiner Schulz, Gabriele Schweikert, Christin Sch¨afer, Christian Widmer and Alexander Zien for reading the draft, helpful discussions and moral support. I acknowledge the support from all members of the IDA group at TU Berlin and Fraunhofer FIRST and the members of the Machine Learning in Computational Biology group at the Friedrich Miescher Laboratory, especially for the tolerance of letting me run the large number of compute-jobs that were required to perform the experiments. Finally, I would like to thank the system administrators Roger Holst, Andre Noll, An- dreas Schulz, Rolf Schulz and Sebastian Stark — your support is very much appreciated. This work was supported in part by the IST Programme of the European Community, under the PASCAL Network of Excellence (IST-2002-506778), and by the Learning and Inference Platform of the Max Planck and Fraunhofer Societies. To my parents. IV Preface When I came into first contact with Machine Learning at a seminar on Support Vector Machines (SVMs) held by Klaus-Robert M¨uller and Bernhard Sch¨olkopf in 1999, I became utterly excited about this subject. During that seminar, I gave an introductory talk on SVMs to an all-knowing audience, but it took a while before I seriously picked up the SVM subject in 2002. For my student research project, Gunnar R¨atsch suggested Hidden Markov Models (HMMs) and their application to Bioinformatics. The idea was to find genes, but as this problem is too complex to start with, I started with the sub- problem of recognising splice sites. HMMs using a certain predefined topology performed quite well. However, we managed to improve on this using string kernels (derived from HMMs) and Support Vector Machines (Sonnenburg et al., 2002, Tsuda et al., 2002b). Unfortunately, training SVMs with these string kernels was computationally very costly. Even though we had a sample of more than 100,000 data points, we could barely afford training on 10,000 instances on a, at that time, cutting-edge Compaq Alpha compute cluster. In contrast to HMMs, the resulting trained SVM classifiers were not easily accessible. After a research break of about 2 years during which I became the group’s system administrator, I started work on exactly these — still valid — topics. All of this research was application driven: improving the splicing signal (and later, other signals) detection in order to construct a gene finder that is applicable on a genomic scale. To this end, I worked on computationally efficient string kernels suitable for detecting biological signals such as the aforementioned splicing signal and on large scale algo- rithms for SVMs with string kernels. The results of these efforts are presented in this doctoral thesis in Chapter 2 and 3. Trying to improve the classification performance of a string kernel SVM using the so-called weighted degree kernel, I investigated multiple kernel learning, which turned out to be useful to understand the learnt SVM classifier. Having learnt lessons about the implied string kernel feature spaces (by working on large scale learning) and since there was still room for improvements in understanding SVMs, I continued to research this subject. I consider the resulting Positional Oligomer Importance Matrices (POIMs) a big step forward toward a full understanding of string kernel based classifiers. Both concepts are described in Chapter 4. Since my research was application motivated, all algorithms were implemented in the very versatile SHOGUN toolbox that I developed in parallel (cf. Appendix C). Equipped with this tool-set, several signal detection problems, like splice site recognition and transcription start site recognition, were tackled. The work on large scale learning with string kernels enabled us to train string kernel SVMs on 10,000,000 instances, obtaining state-of-the-art accuracy, and to apply the trained SVMs to billions of instances. At the same time, I used POIMs to understand the complex SVM decision rule. These improved signal detectors were finally integrated to accurately predict gene structures (R¨atsch et al., 2007), as outlined in Chapter 6 and are used in a full gene finding system (Schweikert et al., 2008) outperforming state-of-the-art methods. V VI Zusammenfassung Die Entwicklung neuer Sequenziertechnologien ebnete den Weg fur¨ kosteneffiziente Ge- nomsequenzierung. Allein im Jahr 2008 werden etwa 250 neue Genome sequenziert. Es ist offensichtlich, dass diese gewaltigen Mengen an Daten effektive und genaue computer- gestutzte¨ Methoden zur Sequenzanalyse erfordern. Diese werden ben¨otigt, um eines der wichtigsten Probleme der Bioinformatik zu l¨osen: die akkurate Lokalisation von Genen auf der DNA. In dieser Arbeit werden auf Basis von Support Vector Machines (SVMs) genaueste genomische Signalerkenner entwickelt, die in Gensuchmaschinen verwendet werden k¨onnen. Die Arbeit untergliedert sich in folgende Themenschwerpunkte: String-Kerne Es wurden String-Kerne zur Detektion von Signalen auf dem Genom entwickelt und erweitert. Die Kerne haben eine in der L¨ange der Eingabesequenzen nur lineare Berechnungskomplexit¨at und sind fur¨ eine Vielzahl von Problemen verwendbar. Dadurch gestaltet sich die Sequenzanalyse sehr effektiv: Mit nur geringem Vorwissen ist es m¨oglich, geeignete String-Kernkombinationen auszuw¨ahlen, die hohe Erkennungsra- ten erm¨oglichen. Large-Scale-Lernen Das Training von SVMs war bisher zu rechenintensiv, um auf Da- ten genomischer Gr¨oße angewendet zu werden. Mithilfe der in dieser Arbeit entwickelter Large-Scale-Lernmethoden ist es nun in kurzer Zeit m¨oglich, string-kern-basierte SVMs auf bis zu zehn Millionen Sequenzen zu trainieren und auf uber¨ sechs Milliarden Sequenz- positionen vorherzusagen. Der entwickelte linadd-Ansatz beschleunigt die Berechnung von Linearkombinationen von String-Kernen, die bereits in linearer Zeit berechenbar sind. Dieser Ansatz erm¨oglicht den Verzicht auf einen Kern-Cache beim SVM-Training und fuhrt¨ somit zu einer drastischen Reduktion des Speicheraufwands. Interpretierbarkeit Ein h¨aufig kritisierter Nachteil von SVMs mit komplexen Kernen ist, dass ihre Entscheidungsregeln fur¨ den Menschen schwer zu verstehen sind. In dieser Arbeit wird die Black Box“ der SVM-Klassifikatoren ge¨offnet“, indem zwei Konzepte ” ” entwickeln werden, die zu ihrem Verst¨andnis beitragen: Multiple Kernel Learning (MKL) und Positional Oligomer Importance Matrices (POIMs). MKL erm¨oglicht die Interpre- tation von SVMs mit beliebigen Kernen und kann dazu verwendet werden heterogene Datenquellen zu fusionieren. POIMs sind besonders gut fur¨ String-Kerne geeignet, um die diskriminativsten Sequenzmotive zu bestimmen. Genom-Sequenzanalyse In dieser Arbeit werden SVMs mit neuentwickelten, sehr pr¨azisen String-Kernen zur Detektion einer Vielzahl von Genom-Signalen, zum Bei- spiel der Transkriptionsstart- oder Spleißstellen, angewendet. Dabei werden unerreichte Genauigkeiten erzielt. Unter Verwendung von POIMs wurden die trainierten Klassifika- toren analysiert und unter anderem viele bereits bekannte Sequenzmuster gefunden. Die verbesserten Signaldetektoren wurden dann zur genauen Genstrukturvorhersage ver- wendet. Die Doktorarbeit schließt mit einem Ausblick, wie in dieser Arbeit entwickelte Methoden die Basis fur¨ eine vollst¨andige Gensuchmaschine legen. Keywords: Support Vector Machine, Interpretierbarkeit, String-Kerne, Large-Scale- Lernen, Multiple Kernel Learning, Positional Oligomer Importance Matrices, Spleiß- stellenerkennung, Transkriptionsstartstellenerkennung, Gensuche VII VIII Summary With the development of novel sequencing technologies, the way has been paved for cost efficient, high-throughput whole genome sequencing. In the year 2008 alone, about 250 genomes will have been sequenced. It is self-evident that the handling of this wealth of data requires efficient and accurate computational methods for sequence analysis. They are needed to tackle one of the most important problems in computational biology: the localisation of genes on DNA. In this thesis, I describe the development of state-of-the- art genomic signal detectors based on Support Vector Machines (SVM) that can be used in gene finding systems. The main contributions of this work can be summarized as follows: String Kernels We have developed and extended