And Cheminformatics

Large Classifier Systems in Bio- and Cheminformatics Jörg S. Wicker TECHNISCHE UNIVERSITÄT MÜNCHEN Lehrstuhl für Bioinformatik Large Classifier Systems in Bio- and Cheminformatics Jörg S. Wicker Vollständiger Abdruck der von der Fakultät für Informatik der Technischen Uni- versität München zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften genehmigten Dissertation. Vorsitzende: Univ.-Prof. Gudrun J. Klinker, Ph.D. Prüfer der Dissertation: 1. Univ.-Prof. Dr. Burkhard Rost 2. Univ.-Prof. Dr. Stefan Kramer Johannes Gutenberg Universität Mainz Die Dissertation wurde am 15.07.2013 bei der Technischen Universität München eingereicht und durch die Fakultät für Informatik am 02.09.2013 angenommen. What’s the most you’ve ever lost on a coin toss? Anton Chigurh Abstract Large classifier systems are machine learning algorithms that use multiple classifiers to improve the prediction of target values in advanced classification tasks. Although learning problems in bio- and cheminformatics commonly provide data in schemes suitable for large classifier systems, they are rarely used in these domains. This thesis introduces two new classifiers incorporating systems of classifiers using Boolean matrix decomposition to handle data in a schema that often occurs in bio- and cheminformatics. The first approach, called MLC-BMaD (multi-label classification using Boolean matrix decomposition), uses Boolean matrix decomposition to decompose the labels in a multi- label classification task. The decomposed matrices are a compact representation of the information in the labels (first matrix) and the dependencies among the labels (second matrix). The first matrix is used in a further multi-label classification while the second matrix is used to generate the final matrix from the predicted values of the first matrix. MLC-BMaD was evaluated on six standard multi-label data sets, the experiments showed that MLC-BMaD can perform particularly well on data sets with a high number of labels and a small number of instances and can outperform standard multi-label algorithms. Subsequently, MLC-BMaD is extended to a special case of multi-relational learning, by considering the labels not as simple labels, but instances. The algorithm, called ClassFact (Classification factorization), uses both matrices in a multi-label classification. Each label represents a mapping between two instances. Experiments on three data sets from the domain of bioinformatics show that ClassFact can outperform the baseline method, which merges the relations into one, on hard classification tasks. Furthermore, large classifier systems are used on two cheminformatics data sets, the first one is used to predict the environmental fate of chemicals by predicting biodegradation pathways. The second is a data set from the domain of predictive toxicology. In biodegradation pathway prediction, I extend a knowledge-based system and incorporate a machine learning approach to predict a probability for biotransformation products based on the structure- and knowledge-based predictions of products, which are based on transformation rules. The use of multi-label classification improves the performance of the classifiers and extends the number of transformation rules that can be covered. For the prediction of toxic effects of chemicals, I applied large classifier systems to the ToxCast™ data set, which maps toxic effects to chemicals. As the given toxic effects are not easy to predict due to missing information and a skewed class distribution, I intro- duce a filtering step in the multi-label classification, which finds labels that are usable in multi-label prediction and does not take the others in the prediction into account. Ex- periments show that this approach can improve upon the baseline method using binary classification, as well as multi-label approaches using no filtering. The presented results show that large classifier systems can play a role in future research challenges, especially in bio- and cheminformatics, where data sets frequently consist of more complex structures and data can be rather small in terms of the number of instances compared to other domains. iv Zusammenfassung Große Klassifikationssysteme sind Algorithmen des Maschinellen Lernens, die mehrere Klassifizierer kombinieren, um die Vorhersagen von Zielvariablen in fortgeschrittenen Klassifikationsproblemen zu verbessern. Obwohl die Daten von Lernprobleme in Bio- und Chemieinformatik häufig in einer Form existieren, die große Klassifikationssysteme ermöglichen, werden sie in diesen Domänen nur selten benutzt. Diese Dissertation führt zwei neue Klassifizierer ein, die Boolesche Matrixzerlegung benutzen, um Daten zu ver- arbeiten, die häufig in Bio- und Chemieinformatik vorkommen. Der erste Ansatz, MLC-BMaD (Multi-Label Classification using Boolean Matrix De- composition), benutzt Boolesche Matrixzerlegung, um Labels in einem Multi-Label Klas- sifikationsproblem zu zerlegen. Die zerlegten Matrizen sind eine kompakte Repräsenta- tion der Informationen in den Labels (erste Matrix) und der Abhängigkeiten zwischen den Labels (zweite Matrix). Die erste Matrix wird in eine weiteren Multi-Label Klas- sifikation verwendet, während die zweite verwendet wird, um die finale Matrix aus den vorhergesagten Werten der ersten Matrix zu generieren. MLC-BMaD wurde auf sechs Standard Multi-Label Datensätzen evaluiert. Die Experimente zeigen, dass MLC-BMaD besonders gute Vorhersagen trifft auf Datensätzen mit vielen Labels und wenig Instan- zen und bessere Vorhersagen trifft als Standard Multi-Label Klassifizierer. Anschließend wird MLC-BMaD erweitert, um einen besonderen Fall des Multi-Relationalen Lernens abzudecken, indem die Labels nicht als einfache Labels, sondern als Instanzen betrachtet werden. Der Algorithmus, genannt ClassFact (Classification factorization) benutzt beide Matrizen in einer Multi-Label Klassifikation. Jedes Label repräsentiert eine Abbildung zwischen zwei Instanzen. Experimente auf drei Datensätzen aus der Bioinformatik zeigen, dass ClassFact auf schwierigen Klassifikationsproblemen bessere Vorhersagen liefert als die Standard Methode, die alle Relationen in eine Relation vereinigt. Weiterhin werden große Klassifikationssysteme auf zwei Datensätze aus der Chemie- informatik angewandt, der erste Datensatz verwendet große Klassifikationssysteme, um Abbauprodukte in der Natur vorherzusagen. Der zweite Datensatz kommt aus der Domäne der Toxizitätsvorhersage. In der Vorhersage von Abbauprodukten erweitere ich ein wissensbasiertes System mit einem Ansatz des Maschinellen Lernens, um eine Wahrscheinlichkeit für einzelne Abbauprodukte vorherzusagen, die auf Transformations- regeln basieren. Multi-Label Klassifikation kann die Vorhersagen verbessern und die Anzahl der verwendeten Transformationsregeln vergrößern. Für die Vorhersage von toxischen Effekten von Chemikalien wurden große Klassifikationssysteme auf dem ToxCast™ Datensatz verwendet. Da die toxischen Effekte aufgrund fehlender Informationen und schiefer Klassenverteilungen nicht leicht vorherzusagen sind, wurde ein Filter Schritt in die Multi-Label Klassifikation eingeführt. Dieser findet Label, die in einer Multi-Label Klassifikation benutzt werden können, und ignoriert andere in der Vorhersage. Experi- mente zeigen, dass dieser Ansatz bessere Vorhersagen liefert als die Standard Methode, die binäre Klassifikation verwendet, sowie Multi-Label Klassifikation ohne Filtern. Die präsentierten Ergebnisse zeigen, dass große Klassifikationssysteme eine Rolle in zukünftigen Forschungsherausforderungen spielen können, speziell in Bio- und Chemie- informatik, wo Datensätze häufig eine komplexere Struktur besitzen und klein in Bezug auf die Anzahl der Instanzen sein können. vi Acknowledgments First of all, I have to thank Stefan Kramer. I visited his lecture on machine learning during my undergrad studies, where he introduced me to the fascinating area of research, later giving me the chance to work on a project for developing an inductive database. After finishing my undergrad studies, he invited me to join his group; I accepted his offer and continued working in this area. He provided the perfect mix of inspiration, guidance and freedom in my research, making the work in academia a real pleasure for me. He supported me in the best way while graduating, letting me find my own way and interests in research, but providing me with just enough specific directions and ideas. In the past years of my research I met, visited and collaborated with many wonderful and inspirational people that influenced my work or my life in one way or another. In the following, I would like to thank them individually for their respective contributions. With Yana Bromberg and Burkhard Rost I had the chance to work on the prediction of the effects of SNPs. Both not only introduced me to a new research area but also provided me with a new perspective on machine learning. I visited the Eawag institute in Zürich twice and worked with Kathrin Fenner on biodegradation pathway prediction during and beyond the visits at Eawag. She introduced me to the field of biodegradation pathway prediction and the numerous interesting challenges in this area. In 2009 I visited the Jožef Stefan Institute in Ljubljana for a month. Many thanks go to everyone in Ljubljana for making my visit the great experience it was, especially Sašo Džeroski who invited me and Darko Aleksovski, with whom I had numerous interesting discussions on multi-label classification. A big part of the last years, I worked on the OpenTox project. In this respect, I want to thank Barry Hardy

Load more