Machine Learning Methods for Life Sciences: Intelligent Data Analysis in Bio- and Chemoinformatics

Machine Learning Methods for Life Sciences: Intelligent Data Analysis in Bio- and Chemoinformatics vorgelegt von Diplom-Physiker Johannes Mohr aus Berlin Von der Fakultät IV - Elektrotechnik und Informatik der Technischen Universität Berlin zur Erlangung des akademischen Grades Doktor der Naturwissenschaften Dr. rer. nat. genehmigte Dissertation Promotionsausschuss: Vorsitzender: Prof. Dr.-Ing. Jörg Raisch Berichter: Prof. Dr. rer. nat. Klaus Obermayer Berichter: Prof. Dr. rer. nat. Sepp Hochreiter Tag der wissenschaftlichen Aussprache: 19.12.2008 Berlin 2009 D 83 This thesis is dedicated with love to my wife Sambu and my daughter Yumi. Zusammenfassung In den letzten Jahren haben die experimentellen Techniken innerhalb der Lebenswis- senschaften rapide Fortschritte gemacht. Zusätzlich hat die Integration von Methoden verschiedener Disziplinen zur Bildung neuer Forschungsgebiete gefuhrt,¨ wie genetische Bildgebung, molekulare Medizin und biologische Psychologie. Der experimentelle Fort- schritt wurde von einem wachsenden Bedarf an intelligenter Datenanalyse begleitet, deren Ziel es ist, einen gegebenen Datensatz unter Einbeziehung von Domänenwissen auf die meistversprechende Art und Weise zu analysieren. Dies schließt die Repräsenta- tion der Daten, die Auswahl der Variablen, die Vorverarbeitung, die Modellannahmen, die Wahl der Methoden fur¨ Prädiktion, Modellselektion und Regularisierung ebenso ein wie die Interpretation der Ergebnisse. Das Thema der vorliegenden Arbeit ist die intelligente Datenanalyse in den Bereichen Bioinformatik und Chemoinformatik mit Hilfe von Methoden des maschinellen Lernens. Das Ziel der genetischen Bildgebung ist es, durch Assoziationsstudien zwischen po- tentiell relevanten genetischen Variablen und Endophänotypen einen Einblick in gene- tisch beeinflusste psychiatrische Erkrankungen zu erlangen. Im Rahmen dieser Arbeit werden zwei verschiedene Methoden zur explorativen Analyse entwickelt: Das erste Verfahren basiert auf P-SVM Merkmalselektion fur¨ multiple Regression und modelliert additive und multiplikative Geneffekte auf einen Endophänotypen mittels eines spärli- chen Regressionsmodells. Die zweite Methode fuhrt¨ ein neues Lernparadigma namens Target Selection ein, um eine Assoziation zwischen einer einzelnen genetischen Varia- blen und einem multidimensionalen Endophänotypen zu modellieren. Oftmals sind in der Literatur mehrere verschiedene genetische Assoziationsmodelle vertreten, und die Frage ist, wieviel Evidenz ein gemessener Datensatz fur¨ jedes dieser Modelle bietet. Zu diesem Zweck wird in der vorliegenden Arbeit eine auf Informationskriterien basierende Modellvergleichsmethode fur¨ die genetische Bildgebung vorgeschlagen. Das Ziel der Analyse quantitativer Struktur-Wirkungs-Beziehungen (QSAR) ist es, die biologische Aktivität einer Substanz anhand ihrer Molekularstruktur vorherzusagen. Traditionell basieren QSAR Methoden auf einer Menge von molekularen Deskriptoren, die zur Bildung eines Prädiktionsmodells benutzt werden. In dieser Arbeit wird eine Deskriptor-freie Methode zur 3D QSAR Analyse vorgeschlagen, welche das Konzept von Molekul-Kerneln¨ einfuhrt,¨ um die Ahnlichkeit¨ zwischen den 3D-Strukturen zweier Molekule¨ zu erfassen. Die Molekul-Kernel¨ können zusammen mit der P-SVM, einer kurzlich¨ eingefuhrten¨ Support-Vektor Maschine fur¨ dyadische Daten, dazu verwendet werden, explanatorische QSAR Modelle zu bauen, die keine explizite Konstruktion von Deskriptoren mehr benötigen. Die resultierenden Modelle verwenden direkt die struk- I II turelle Ahnlichkeit¨ zwischen den vorherzusagenden Substanzen und einer Menge von Support-Molekulen.¨ Die vorgeschlagene Methode wird auf QSAR- und Genotoxizitäts- datensätze angewandt. Abstract In the past few years, experimental techniques in the life sciences have undergone a rapid progress. Moreover, the integration of methods from different disciplines has led to the formation of new fields of research, like imaging genetics, molecular medicine and biological psychology. The experimental progress has come along with an increasing need for intelligent data analysis, which aims at analyzing a given dataset in the most promising way taking domain knowledge into account. This includes the representation of the data, the choice of variables, the preprocessing, the handling of missing values, the model assumptions, the choice of methods for prediction, model selection and regularization, as well as the interpretation of the results. The topic of this thesis is intelligent data analysis in the fields of bioinformatics and chemoinformatics using machine learning techniques. The goal of imaging genetics is to gain insight into genetically determined psychi- atric diseases by association studies between potentially relevant genetic variants and endophenotypes. In this thesis, two different methods for an exploratory analysis are developed: The first method is based on P-SVM feature selection for multiple regression and models additive and multiplicative gene effects on an endophenotype using a sparse regression model. The second method introduces a new learning paradigm called target selection to model the association between a single genetic variable and a multidimensional endophenotype. Often, several different models for genetic association are suggested in the literature, and the question is how much evidence a measured dataset provides for each of them. For this purpose, a method for model comparison in imaging genetics is suggested in this thesis, which is based on the use of information criteria. The aim of quantitative structure activity relationship (QSAR) analysis is to pre- dict the biological activity of compounds from their molecular structure. Traditionally, QSAR methods are based on extracting a set of molecular descriptors and using them to build a predictive model. In this thesis, a descriptor-free method for 3D QSAR analysis is proposed, which introduces the concept of molecule kernels to measure the similarity between the 3D structures of a pair of molecules. The molecule kernels can be used together with the P-SVM, a recently proposed support vector machine for dyadic data, to build explanatory QSAR models which do not require an explicit descriptor construction. The resulting models make direct use of the structural similarities between the compounds which are to be predicted and a set of support molecules. The proposed method is applied to QSAR- and genotoxicity datasets. III IV Contents 1 Introduction 1 1.1 MachineLearningforLifeSciences . ... 1 1.2 OrganizationofthisThesis . 5 1.3 AShortPrimeronGeneticsandMRI . 7 1.3.1 Genetics................................ 7 1.3.2 Magnetic Resonance Imaging . 11 2 MLGPA to Model Epistatic Genetic Effects on a Quantitative Phe- notype 15 2.1 MLGPA.................................... 15 2.1.1 GenerativeModel .......................... 16 2.1.2 MissingValues ............................ 17 2.1.3 VariableSelection . 17 2.1.4 SignificanceTest ........................... 19 2.1.5 P-SVM Variable Selection for Multiple Linear Regression . 20 2.1.6 HyperparameterSelection . 22 2.2 Simulation Study: Sensitivity and Specificity of MLGPA . ........ 23 2.3 Application to Genomic Imaging Data . 24 2.3.1 MedicalBackground . 24 2.3.2 Methods................................ 27 2.3.3 Results ................................ 31 2.3.4 Discussion............................... 34 2.4 SummaryandConclusions. 37 3 Target Selection to Model a Single Genetic Effect on a Multidimen- sional Phenotype 39 3.1 Introduction.................................. 39 3.2 TargetSelection................................ 40 3.2.1 ANewLearningParadigm . 40 3.2.2 Objective Function for Target Selection . 41 3.2.3 LearningMethodforTargetSelection . 44 3.2.4 Model Evaluation and Significance Testing . 45 3.3 Application to Genetic Association Analysis . ....... 47 3.3.1 Introduction ............................. 47 V VI CONTENTS 3.3.2 Experiments ............................. 48 3.3.3 Results ................................ 48 3.3.4 Discussion............................... 50 3.4 SummaryandConclusions. 50 4 Model Comparison in Genomic Imaging Using Information Criteria 53 4.1 Introduction.................................. 53 4.2 CriteriaforModelSelection . 55 4.2.1 AkaikeInformationCriterion . 56 4.2.2 BayesianInformationCriterion . 59 4.3 ApplicationtoGenomicImaging . 62 4.3.1 Introduction ............................. 62 4.3.2 MaterialsandMethods . 63 4.3.3 Results ................................ 67 4.3.4 Discussion............................... 69 4.4 SummaryandConclusions. 71 5 Molecule Kernels for QSAR and Genotoxicity Prediction 73 5.1 QSAR..................................... 73 5.1.1 Introduction ............................. 73 5.1.2 Molecular and Structural Formulas . 77 5.1.3 GeometryOptimization . 79 5.2 MoleculeKernels ............................... 79 5.2.1 Definition ............................... 79 5.2.2 Properties............................... 85 5.2.3 ModelBuildingandPrediction . 85 5.3 Explanatory QSAR Models from Molecule Kernels . 88 5.3.1 BuildinganExplanatoryModel. 88 5.3.2 Visualization ............................. 91 5.4 ApplicationtoQSARAnalysis . 93 5.4.1 QSARDatasets............................ 93 5.4.2 Assessment of Generalization Performance . 94 5.4.3 Results ................................ 95 5.5 Application to Genotoxicity Prediction . 100 5.5.1 Chromosome Aberration Dataset . 101 5.5.2 Results ................................ 102 5.6 SummaryandConclusions. 104 A Appendix 107

Machine Learning Methods for Life Sciences: Intelligent Data Analysis in Bio- and Chemoinformatics

The Chem Access Project from Bitmap Graphics to Fully Accessible Chemical Diagrams

11.2 Alkanes

Imaging Live Drosophila Brain with Two-Photon Fluorescence Microscopy Syeed Ehsan Ahmed University of Texas at El Paso, [email protected]

Arxiv:2001.11604V2 [Cs.PL] 6 Sep 2020 of Trigger Conditions, Having an in Situ Infrastructure That Simpliﬁes Results They Desire

A Dynamic Multidimensional Visualization Method for Social Networks

Locality-Dependent Training and Descriptor Sets for QSAR Modeling

6-Michalek Et Al.Indd

Answers DRAWING ORGANIC STRUCTURES

Deepscreen: High Performance Drug-Target Interaction Prediction with Convolutional Neural Networks Using 2-D Structural Compound

Molecular Binding Kinetics of CYP P450 by Using the Inﬁnitesimal Generator Approach

Leicester Grammar School's

Development of the Chick Wing and Leg Neuromuscular Systems