Predicting the Function of Drug-Like Molecules Methods and Applications
Total Page:16
File Type:pdf, Size:1020Kb
Predicting the function of drug-like molecules methods and applications Inaugural-Dissertation to obtain the academic degree Doctor rerum naturalium (Dr. rer. nat.) submitted to the Department of Biology, Chemistry and Pharmacy of Freie Universität Berlin by Hao Wang from Liupanshui, Guizhou province, People’s Republic of China January 2017 Diese Arbeit wurde in dem Zeitraum von November 2012 bis Januar 2017 unter der Leitung von Prof. Dr. Ernst Walter Knapp am Institute für Chemie und Biochemie der Freien Universität Berlin im Fachbereich Biologie, Chemie und Pharmazie durchgeführt. 1. Gutacher: Prof. Dr. Ernst-Walter Knapp 2. Gutacher: Prof. Dr. Gerhard Wolber Disputation am ___16. 3. 2017_________ Statutory Declaration I hereby testify that this thesis is the result of my own work and research, except for any explicitly referenced material, whose source is listed in the bibliography. This work contains material that is the copyright property of others, which must not be reproduced without the permission of the copyright own. Acknowledgement First of all, I would like to thank Prof. Dr. Ernst Walter Knapp, China Scholarship Council, Free University of Berlin and Xiamen University for giving me this precious opportunity to study in Germany. I very much appreciate the supervision of Prof. Dr. Ernst Walter Knapp and valuable discussions during my doctoral work. In addition, I would also like to thank Dr. Bentzien Jörg and Dr. Ingo Mugge in Boehringer-Ingelheim group for their cooperation and valuable suggestions for my research projects. I am thankful for the support and great help of my colleagues in AG Knapp during my doctoral studies. Finally, I also thank Computational Systems Biology research training group (DFG - Graduiertenkolleg 1772) and Dahlem Research School of Free University of Berlin for their financial support and excellent program that allowed me to obtain useful academic skills and to extend my academic visions. For the proofreading of this dissertation, I would like to thank Mr. Jovan Dragelj, Dr. Nadia Elgohobashi-Meinhardt, Dr. Cong Li, and Ms. Wendy Ashleigh Teo. Thanks to Dr. Florian Krull for translating my summary of dissertation into German. 在德国留学四年,生命中太多的事情在里发生。似乎度过了一段超过四年 的时间。离亲人,离熟悉的境。回首四年,需要感谢的人太多。在里, 感谢在柏林和相遇的每一个朋友。和你们一起,收获了很多,也有了无数值得珍 惜的时。其中,可能也有一些在当时特别难过的时刻。但是,在看来,那些难 过都变了快乐回忆的一部分。特别感谢在中国的的爸爸和妈妈,王海生生 和吴枚女士,有的其他亲人们。因为他们支持,才能够如此安心的完的 博士课题。最后,在里也谢谢晓妤。谢谢你在博士课题完前的最后一年,走 了的生命,陪度过了特别难熬的最后一段。 选择读博是人生的历练,特别是在国外读博。完段修行,给了一个不同 的视角去审视人生。也希望一段宝贵的人生财富能为未来人生路带来更多的启 示。 The table of contents 1. Introduction ............................................................................................................................ 1 1.1 Computer-Aided Drug Design ....................................................................................... 3 1.2 Machine Learning Techniques ....................................................................................... 6 1.2.1 Decision Tree ........................................................................................................... 7 1.2.2 Random Forest ......................................................................................................... 9 1.2.3 Artificial Neural Network....................................................................................... 10 1.2.4 Naïve Bayesian Classifier ...................................................................................... 13 2 Methods ................................................................................................................................. 16 2.1 Dataset .......................................................................................................................... 16 2.2 Features ........................................................................................................................ 17 2.3 Cross-validation ........................................................................................................... 18 2.4 Quality measures .......................................................................................................... 19 2.5 Normalization ............................................................................................................... 21 2.6 Linear classifier ............................................................................................................ 22 2.6.1 Linear discriminate function .................................................................................. 22 2.6.2 Training .................................................................................................................. 25 2.6.3 Objective function .................................................................................................. 27 2.6.3.1 Loss function............................................................................................................................... 27 2.7 Gradient descent algorithm .......................................................................................... 28 2.8 Regularization .............................................................................................................. 31 2.8.1 Lasso Regularization ............................................................................................. 31 2.8.2 Ridge regression Regularization ............................................................................ 32 2.9 DemPred ....................................................................................................................... 32 2.10 DemFeature .................................................................................................................. 33 2.10.1 DemFeature-1 ........................................................................................................ 34 2.10.2 DemFeature-2 ........................................................................................................ 35 2.11 Quadratic features ........................................................................................................ 36 2.12 Comparison of the performance between two models ................................................. 37 2.13 Confidence measure ..................................................................................................... 38 3 Applications ......................................................................................................................... 40 3.1 Project 1: Kaggle™ competition .................................................................................. 40 3.1.1 Dataset ................................................................................................................... 42 3.1.2 Results .................................................................................................................... 43 3.1.2.1 Prediction results of DemPred .................................................................................................... 44 3.1.2.1.1 Linear features prediction results .......................................................................................... 45 3.1.2.1.2 Prediction results with quadratic features .............................................................................. 49 3.1.2.2 Prediction results of DemFeature ................................................................................................ 53 3.1.2.2.1 Prediction results of DemFeature-1 ....................................................................................... 53 3.1.2.2.2 Prediction result of DemFeature-2 ........................................................................................ 57 3.1.2.3 Comparison with results from the Kaggle™ competition ........................................................... 62 3.1.2.4 P-values to compare different models ........................................................................................ 63 3.1.2.5 Measure confidence of prediction ............................................................................................... 65 3.1.3 Discussion .............................................................................................................. 66 3.1.4 Conclusion ............................................................................................................. 70 3.2 Project 2: Drug-induced phospholipidosis prediction .................................................. 71 3.2.1 in silico methods assessing the potential of drug inducing PDL ........................... 76 3.2.1.1 Ploemen model for predicting PLD ............................................................................................ 77 3.2.1.2 Pelletier model: modified Ploemen model .................................................................................. 77 I 3.2.1.3 Tomizawa model ......................................................................................................................... 78 3.2.1.4 Hanumegowda model ................................................................................................................. 78 3.2.1.5 SMARTS models ........................................................................................................................ 79 3.2.2 Phospholipidosis database ..................................................................................... 81 3.2.2.1 Goracci phospholipidosis database ................................................................................................... 81 3.2.2.2 Independent phospholipidosis test dataset .......................................................................................