SVM and a Novel POOL Method Coupled with THEMATICS For

SVM and a Novel POOL Method Coupled with THEMATICS for Protein Active Site Prediction A DISSERTATION SUBMITTED TO THE COLLEGE OF COMPUTER AND INFORMATION SCIENCE OF NORTHEASTERN UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY By Wenxu Tong April 2008 ©Wenxu Tong, 2008 ALL RIGHTS RESERVED Acknowledgements I have a lot of people to thank. I am mostly indebted to my advisor Dr. Ron Williams. He generously allowed me to work on a problem with the idea he developed decades ago and guided me through the research to turn it from just a mere idea into a useful system to solve important problems. Without his kindness, wisdom and persistence, there would be no this dissertation. Another person I am so fortunate to meet and work with is Dr. Mary Jo Ondrechen, my co-advisor. It was her who developed the THEMATICS method, which this dissertation works on. Her guidance and help during my research is critical for me. I am grateful to all the committee members for reading and commenting my dissertation. Especially, I am thankful to Dr. Jay Aslam, who provided a lot of advice for my research, and Dr. Bob Futrelle, who brought me into the field and provided much help in my writing of this dissertation. I also thank Dr. Budil for the time he spent serving as my committee member, especially during all the difficulties and inconvenience he happened to experience unfortunately during the time. I am so fortunate to work in the THEMATICS group with Dr. Leo Murga, Dr. Ying Wei and my fellow graduate student soon to be Dr. Heather Brodkin. I also thank Dr. Jun Gong, Dr. Emine Yilmaz and soon to be Dr. Virgiliu Pavlu, without their generous help, my journey through the tunnel towards my degree would be much darker and harder. I would not have been what I am without the love and support of my parents, Yunkun Tong and Xiazhu Wang. Thanks to raising us and giving us the best education possible during all the hardship they had endured, both of my sister, Wenyi Tong and I have received Ph.D. degrees, the highest degree one can expect. I am so grateful and proud of them. iii Last but definitely not the least I would thank Ying Yang, my beloved wife. Without her patience and confidence in me, I could not imagine that I can do what I have done and I will humbly dedicate this dissertation to her. iv Table of Contents Abstract ...............................................................................................................12 1 Introduction .....................................................................................................13 1.1 THEMATICS and protein active site prediction.................................................................... 14 1.2 Machine Learning ................................................................................................................... 15 2. Background and Related Work .....................................................................18 2.1 Protein Active Site Prediction................................................................................................. 18 2.2 Machine Learning ................................................................................................................... 20 2.2.1 Commonly used supervised learning methods ........................................................... 20 2.2.2 Probability based approach........................................................................................ 22 2.2.3 Performance measure for classification problems ..................................................... 23 2.3 THEMATICS .......................................................................................................................... 27 2.3.1 The THEMATICS method and its features ............................................................... 27 2.3.2 Statistical analysis with THEMATICS....................................................................... 32 2.3.3 Challenges of the site prediction problem using THEMATICS data ........................ 34 3.Applying SVM to THEMATICS.....................................................................37 3.1 Introduction............................................................................................................................. 38 3.2 THEMATICS curve features used in the SVM...................................................................... 38 3.3 Training ................................................................................................................................... 40 3.4 Results...................................................................................................................................... 41 3.4.1 Success in site prediction............................................................................................. 42 3.4.2 Success in catalytic residue prediction........................................................................ 42 3.4.3 Incorporation of non-ionizable residues..................................................................... 43 v 3.4.4 Comparison with other methods................................................................................. 49 3.5 Discussion ................................................................................................................................ 52 3.5.1 Cluster number and size ............................................................................................. 52 3.5.2 Failure analysis............................................................................................................ 53 3.5.3 Analysis of high filtration ratio cases.......................................................................... 53 3.5.4 Some specific examples ............................................................................................... 56 3.6 Conclusions.............................................................................................................................. 58 3.7 Next step .................................................................................................................................. 58 4. New Method: Partial Order Optimal Likelihood (POOL)...........................61 4.1 Ways to estimate class probabilities........................................................................................61 4.1.1 Simple joint probability table look-up........................................................................ 61 4.1.2 Naïve Bayes method .................................................................................................... 62 4.1.3 The K-nearest-neighbor method................................................................................. 63 4.1.4 POOL method ............................................................................................................. 63 4.1.5 Combining CPE's........................................................................................................ 64 4.2 POOL method in detail ........................................................................................................... 65 4.2.1 Maximum likelihood problem with monotonicity assumption .................................. 65 4.2.2 Convex optimization and K.K.T. conditions .............................................................. 67 4.2.3 Finding Minimum Sum of Squared Error (SSE) ....................................................... 69 4.2.4 POOL algorithm ......................................................................................................... 71 4.2.5 Proof that the POOL algorithm finds the minimum SSE. ......................................... 73 4.2.6 Maximum likelihood vs. minimum SSE. .................................................................... 80 4.3 Additional computational steps............................................................................................... 86 4.3.1 Preprocessing .............................................................................................................. 86 4.3.2 Interpolation................................................................................................................ 87 vi 5. Applying the POOL Method with THEMATICS in Protein Active Site Prediction .......................................................................................................88 5.1 Introduction............................................................................................................................. 89 5.2 THEMATICS curves and other features used in the POOL method .................................... 90 5.3 Performance measurement .....................................................................................................94 5.6 Computational procedure ....................................................................................................... 96 5.5 Results...................................................................................................................................... 99 5.5.1 Ionizable residues using only THEMATICS features................................................ 99 5.5.2 Ionizable residues using THEMATICS plus cleft information................................ 102 5.5.3 All residues using THEMATICS plus cleft information.......................................... 107 5.5.4 All residues using THEMATICS, cleft information and sequence conservation, if applicable.................................................................................................................. 110 5.5.5 Recall-filtration

SVM and a Novel POOL Method Coupled with THEMATICS For

Comparison of the Effects on Mrna and Mirna Stability Arian Aryani and Bernd Denecke*

Enzymes Handling/Processing

Phd Thesis Tjaard Pijning

Cysteine Proteinases of Microorganisms and Viruses

Serine Proteases with Altered Sensitivity to Activity-Modulating

The Coffee Protective Effect on Catalase System in the Preneoplastic Induced Rat Liver

„Transformation Von Phospholipiden Durch Phospholipasen A1 Und

Morelloflavone and Its Semisynthetic Derivatives As Potential Novel Inhibitors of Cysteine and Serine Proteases

Immobilization of Trypsin for Peptide Synthesis and Hydrolysis Reactions

Characterisation, Classification and Conformational Variability Of

The Crystal Structure of Novel Chondroitin Lyase ODV-E66, A

(12) United States Patent (10) Patent No.: US 7,569,386 B2 Deangelis (45) Date of Patent: Aug