<<

SVM and a Novel POOL Method Coupled with THEMATICS for

Protein Prediction

A DISSERTATION

SUBMITTED TO THE COLLEGE OF COMPUTER AND INFORMATION SCIENCE

OF NORTHEASTERN UNIVERSITY

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

By

Wenxu Tong

April 2008

©Wenxu Tong, 2008

ALL RIGHTS RESERVED

Acknowledgements

I have a lot of people to thank. I am mostly indebted to my advisor Dr. Ron Williams. He generously allowed me to work on a problem with the idea he developed decades ago and guided me through the research to turn it from just a mere idea into a useful system to solve important problems. Without his kindness, wisdom and persistence, there would be no this dissertation.

Another person I am so fortunate to meet and work with is Dr. Mary Jo Ondrechen, my co-advisor. It was her who developed the THEMATICS method, which this dissertation works on. Her guidance and help during my research is critical for me.

I am grateful to all the committee members for reading and commenting my dissertation. Especially, I am thankful to Dr. Jay Aslam, who provided a lot of advice for my research, and Dr. Bob Futrelle, who brought me into the field and provided much help in my writing of this dissertation. I also thank Dr. Budil for the time he spent serving as my committee member, especially during all the difficulties and inconvenience he happened to experience unfortunately during the time.

I am so fortunate to work in the THEMATICS group with Dr. Leo Murga, Dr. Ying Wei and my fellow graduate student soon to be Dr. Heather Brodkin. I also thank Dr. Jun Gong, Dr. Emine Yilmaz and soon to be Dr. Virgiliu Pavlu, without their generous help, my journey through the tunnel towards my degree would be much darker and harder.

I would not have been what I am without the love and support of my parents, Yunkun Tong and Xiazhu

Wang. Thanks to raising us and giving us the best education possible during all the hardship they had endured, both of my sister, Wenyi Tong and I have received Ph.D. degrees, the highest degree one can expect. I am so grateful and proud of them.

iii Last but definitely not the least I would thank Ying Yang, my beloved wife. Without her patience and confidence in me, I could not imagine that I can do what I have done and I will humbly dedicate this dissertation to her.

iv Table of Contents

Abstract ...... 12

1 Introduction ...... 13

1.1 THEMATICS and protein active site prediction...... 14

1.2 Machine Learning ...... 15

2. Background and Related Work ...... 18

2.1 Protein Active Site Prediction...... 18

2.2 Machine Learning ...... 20

2.2.1 Commonly used supervised learning methods ...... 20

2.2.2 Probability based approach...... 22

2.2.3 Performance measure for classification problems ...... 23

2.3 THEMATICS ...... 27

2.3.1 The THEMATICS method and its features ...... 27

2.3.2 Statistical analysis with THEMATICS...... 32

2.3.3 Challenges of the site prediction problem using THEMATICS data ...... 34

3.Applying SVM to THEMATICS...... 37

3.1 Introduction...... 38

3.2 THEMATICS curve features used in the SVM...... 38

3.3 Training ...... 40

3.4 Results...... 41

3.4.1 Success in site prediction...... 42

3.4.2 Success in catalytic residue prediction...... 42

3.4.3 Incorporation of non-ionizable residues...... 43

v 3.4.4 Comparison with other methods...... 49

3.5 Discussion ...... 52

3.5.1 Cluster number and size ...... 52

3.5.2 Failure analysis...... 53

3.5.3 Analysis of high filtration ratio cases...... 53

3.5.4 Some specific examples ...... 56

3.6 Conclusions...... 58

3.7 Next step ...... 58

4. New Method: Partial Order Optimal Likelihood (POOL)...... 61

4.1 Ways to estimate class probabilities...... 61

4.1.1 Simple joint probability table look-up...... 61

4.1.2 Naïve Bayes method ...... 62

4.1.3 The K-nearest-neighbor method...... 63

4.1.4 POOL method ...... 63

4.1.5 Combining CPE's...... 64

4.2 POOL method in detail ...... 65

4.2.1 Maximum likelihood problem with monotonicity assumption ...... 65

4.2.2 Convex optimization and K.K.T. conditions ...... 67

4.2.3 Finding Minimum Sum of Squared Error (SSE) ...... 69

4.2.4 POOL algorithm ...... 71

4.2.5 Proof that the POOL algorithm finds the minimum SSE...... 73

4.2.6 Maximum likelihood vs. minimum SSE...... 80

4.3 Additional computational steps...... 86

4.3.1 Preprocessing ...... 86

4.3.2 Interpolation...... 87

vi 5. Applying the POOL Method with THEMATICS in Protein Active Site

Prediction ...... 88

5.1 Introduction...... 89

5.2 THEMATICS curves and other features used in the POOL method ...... 90

5.3 Performance measurement ...... 94

5.6 Computational procedure ...... 96

5.5 Results...... 99

5.5.1 Ionizable residues using only THEMATICS features...... 99

5.5.2 Ionizable residues using THEMATICS plus cleft information...... 102

5.5.3 All residues using THEMATICS plus cleft information...... 107

5.5.4 All residues using THEMATICS, cleft information and sequence conservation, if

applicable...... 110

5.5.5 Recall-filtration ratio curves...... 115

5.5.6 Comparison with other methods...... 117

5.5.7 Rank of the first positive...... 124

5.6 Discussion ...... 126

6. Summary and Conclusions...... 130

6.1 Contributions...... 132

6.2 Future research ...... 134

Appendices ...... 136

Appendix A. The training set used in THEMATICS-SVM ...... 136

Appendix B. The test set used in THEMATICS-SVM...... 138

Appendix C. The 64 protein testing set used in THEMATICS-POOL...... 147

Appendix D. The 160 protein testing set used in THEMATICS-POOL...... 151

vii Bibliography...... 161

viii List of Figures

Figure 2.1 Titration curves...... 30

Figure 3.1 The success rate for site prediction on a per-protein basis ...... 46

Figure 3.2 Distribution of the 64 proteins across different values for the filtration ratio...... 48

Figure 3.3 Recall-false positive rate plot (ROC curves) of SVM versus other methods...... 51

Figure 3.4 SVM prediction for protein1QFE...... 56

Figure 3.5 The SVM prediction for 2PLC...... 57 r Figure 4.1 Three cases of G in relation to the convex cone of constraints...... 76

Figure 5.1 Averaged ROC curve comparing POOL(T4), Wei’s statistical analysis and Tong’s SVM

using THEMATICS features...... 101

Figure 5.2 Averaged ROC curves comparing different methods of predicting ionizable active site

residues using a combination of THEMATICS and geometric features of ionizable residues

only...... 103

Figure 5.3 Averaged ROC curve comparing POOL methods applied to ionizable residues only

CHAIN(TION, G) and to all residues CHAIN(TALL, G)...... 109

Figure 5.4 Averaged ROC curves comparing different methods of combining THEMATICS,

geometric and sequence conservation features of all residues...... 113

Figure 5.5 Averaged RFR curve of for CHAIN(T, G, C) on the 160 protein test set...... 116

Figure 5.6 ROC curves comparing CHAIN(T, G), CHAIN(T, G, C) and Petrova’s method...... 122

Figure 5.7 Histogram of the first annotated active site residue...... 125

ix List of Tables

Table 2.1 Confusion matrix of classification labeling...... 25

Table 3.1 Performance of the SVM predictions alone versus the SVM regional predictions

that include all residues within a 6Å sphere of each SVM-predicted residue...... 44

Table 3.2 Comparison of THEMATICS-SVM and other methods...... 50

Table 5.1 Wilcoxon signed-rank tests between methods shown in figure 5.2 ...... 105

Table 5.2 Wilcoxon signed-rank tests between methods shown in figure 5.4 ...... 114

Table 5.3 Comparison of sensitivity, precision, and AUC of CHAIN(T, G, C) with Youn’s

reported results for proteins in the same family, super family, and fold...... 120

Table 5.4 Comparison of CHAIN(T, G) and CHAIN(T, G, C) with Petrova’s method. ....121

Table 5.5 Comparison of CHAIN(T, G) and CHAIN(T, G, C) with Xie’s method...... 123

x List of Abbreviation

ANN Artificial Neural Network ASA Area of Solvent Accessibility AUC Area Under the Curve AveS Averaged Specificity CPE Class Probability Estimator CSA Catalytic Site Atlas E.C. number Commission number H-H equation Henderson-Hasselbalch equation K.K.T. conditions Karush-Kuhn-Tucker conditions k-NN k-Nearest Neighbor method MAP Maximum a posteriori MCC Matthews Correlation Coefficient MAS Mean Average Specificity ML Maximum Likelihood PDB Protein Data Bank POOL Partial Order Optimal Likelihood RFR curve Recall-Filtration Ratio curve ROC curve Receiver Operating Characteristic curve SSE Sum of Squared Errors SVM Support Vector Machine THEMATICS Theoretical Microscopic Titration Curves VC dimension Vapnik-Chervonenkis dimension

11 Abstract

Protein active site prediction is a very important problem in bioinformatics. THEMATICS is a simple and effective method based on the special electrostatic properties of ionizable residues to predict such sites from protein three-dimensional structure alone. The process involves distinguishing computed titration curves with perturbed shape from normal ones; the differences are subtle in many cases. In this dissertation, I develop and apply special machine learning techniques to automate the process and achieve higher sensitivity than results from other methods while maintaining high specificity. I first present application of support vector machines (SVM) to automate the active site prediction using THEMATICS; at the time this work was developed, it achieved better performance than any other 3D structure based methods. I then present the more recently developed Partial Order Optimal Likelihood (POOL) method, which estimates the probabilities of residues being active under certain natural monotonicity assumptions.

The dissertation shows that applying the POOL method just on THEMATICS features outperforms the

SVM results. Furthermore, since the overall approach is based on estimating certain probabilities from labeled training data, it provides a principled way to combine the use of THEMATICS features with other non-electrostatic features proposed by others. In particular, I consider the use of geometric features as well, and the resulting classifiers are the best structure-only predictors yet found. Finally, I show that adding in sequence-based conservation scores where applicable yields a method that outperforms all existing method while using only whatever combination of structure-based or sequence-based features is available.

12 Chapter 1

Introduction

13 This dissertation employs both standard and novel machine learning techniques to automate one aspect of

the problem of protein function prediction from the three-dimensional structure. In addition to applying

established techniques, particularly the support vector machine (SVM), I introduce a novel method, called partial order optimal likelihood (POOL), to perform the task of selection of functionally important sites in protein structures. In my approach to the protein function prediction problem, I start with just the 3D structure of proteins and use THEMATICS, one of the most effective methods which focuses on electrostatic features of residues. Later, I also add some geometric features, and then the conservation of residues among homologous sequences, when available, into our system to achieve better results.

1.1 THEMATICS and protein active site prediction.

Function prediction (predicting protein function from protein structure) is an important and challenging

task in genomics and proteomics 1. More and more protein structures have been deposited to the PDB

(Protein Data Bank) database, many with unknown functions. As of this writing, there are over 3600 protein structures in the PDB of unknown or uncertain function. The recent development of generating structures from proteins expressed from gene sequences using high throughput methods 2-6 only makes

effective and efficient function prediction even more important, as most of these Structural Genomics

proteins are of unknown function.

Determination of active sites, including enzyme catalytic sites, ligand binding sites, recognition epitopes,

and other functionally important sites is one of the keys to protein function prediction.

In addition, the importance of site prediction goes beyond predicting active sites for proteins with

unknown function. Even for a protein with known function, it is not necessarily true that the active site of

that protein is fully or partially characterized. Correctly finding the active site of a protein is always a

prerequisite to understanding the protein’s catalytic mechanism. It also opens the door to the design of

ligands to inhibit, activate, or otherwise modify the protein’s function. Protein engineering applications to

design a protein of particular functions 7-9 also require knowledge of the proper features needed to create a

functioning active site .

14 Because of its importance in genomics and proteomics, many different methods have been developed to

predict the active site of a protein 10-21. We will survey some of these methods in a later section. But among them, there is one particular method, namely THEMATICS (Theoretical Microscopic Titration

Curves), which is powerful, accurate and precise 16, 22-24. Based on protein 3D structure alone, it can predict correct active sites that are highly localized in small regions of the proteins’ structures.

The details of the THEMATICS method will be given later, but the key point of this method is that it takes advantage of the special chemical and electrostatic properties of active site residues, since active site residues tend to have anomalous titration behavior; THEMATICS generates the titration curves of ionizable residues of a protein. In its original formulation the presence of two or more residues with perturbed titration curves in physical proximity is considered a reliable predictor of the active sites for proteins.

For the THEMATICS method to work well, one needs a criterion to distinguish the perturbed titration curves from normal, unperturbed ones, which is not a trivial task. My work uses machine-learning technology to automatically this process. In SVM, I tried to solve this problem in the form of classification by predicting each residue as either an active site residue, or not. Later, I developed the

POOL method to solve this problem by rank-ordering the residues in a protein according to their probability of being in an active site, based on how perturbed their titration curves are in addition to some other 3D-structure-based information. Later still, sequence conservation information, if available, was added as well.

1.2 Machine Learning.

Machine learning is a well-developed field in computer science. There are many types of tasks, ranging

from reinforcement learning 25 to more passive forms of learning, like supervised learning and unsupervised learning 26. A typical supervised learning task is to create a function or classification from a set of training data. If the output of the function is a label from some finite set of classes, it is called a classification problem. If the output is a continuous value, it is called a regression problem. A training set

15 consists of a set of training examples, i.e. pairs of input-output vectors. And the machine-learning problem is typically an optimization problem to generate a function that will give an output from a valid input that generalizes from the seen training data in a “reasonable” way. The part that gets optimized is typically generalization error, which is the error a certain trained machine will make on unseen data with the same distribution as that of the population. For an unsupervised learning task, there is no labeled training data. Usually, the task in unsupervised learning is to cluster the observed data with some criteria, or fit a model to represent the observed data. I will focus on supervised learning; here my learning task is essentially a classification task. As will be described below, in one part of this work the goal will be to estimate actual class probabilities, which in some respects is like a regression problem.

16 Chapter 2

Background and Related Work

17

2.1 Protein Active Site Prediction.

Since the main focus of this dissertation is in Computer Science, I will just briefly survey some of the

methods used for this application, to serve as a background for the method comparison later in my

dissertation.

There are two major classes of methodology used to predict protein active sites. Almost all current

methods in active site prediction use one of them or a mix of both approaches.

The first methodology is based on sequence comparison, or evolutionary information derived from

sequence alignments. The rationale is that active sites of a protein are important regions in the proteins,

and that the amino acids, termed residues, in active sites therefore should be more conserved throughout

evolution than some other regions of the protein. If we can find highly conserved regions among

sequences in similar proteins from different sources (species/tissues), or even in different proteins but

with similar functions, most likely active sites should consist of subsets within these regions. This is a

valid assumption and indeed, many methods have been developed based on this approach, such as

ConSurf 27, Rate4Site 28, and others 29-32.

However there are two drawbacks to this approach.

First, in order to use this method, there have to be at least 10, and preferably 50, different protein

sequences with certain degrees of similarity in order to get reliable results. The method does not work

well if the similarities between sequences are either too high or too low. There are studies showing that

sequence-based methods can transfer reliably the extracted functional information only when applied to

sequences with as high as 40% sequence identity 33, 34. This drawback makes the method unsuitable for many proteins, particular Structural Genomics proteins, since they often do not have enough similar sequences with suitable range of similarities.

18 Second, although most active site residues tend to be conserved through evolution, it is certainly not true

that all conserved regions of a protein are active sites. Residues in protein sequences can be conserved

for a variety of reasons, not just because of involvement in active sites. One well-known counterexample

is the set of residues that stabilize the structure of the protein; they are so important to the protein that

once mutated, the protein will not have the proper structure to perform its function. These residues will

be conserved among different protein homologues, even if they are not active sites. Therefore typically

sites predicted from sequence based methods are non-local, spanning a much larger area than the true

active site. Another difficulty arises for cases where an active site region in a protein is less conserved

than other regions of the protein, especially when the function and/or substrate of the proteins in the class

are somewhat versatile.

The second methodology is structure-based active site prediction. There are different properties that have

been studied and used in different methods, such as electrostatics properties as in THEMATICS 16,

residue interaction as in the graph theoretic method SARIG 35, van der Waals binding energy of a probe

molecule as in Q-site Finder 20, geometric cleft location as in surfNet 36 and castP 37, and a geometric shape descriptor termed geometric potential 38.

There are also studies that combine results from different methods, employing either statistical or

machine-learning techniques. Among all such studies, I list a few examples that either use similar residue

properties or similar machine learning methods as I used in my earlier and current study. P-cats 21, uses a k-nearest neighbor method to smooth the joint probability lookup table; a study by Gutteridge uses a neural network and spatial clustering to predict the location of active sites 18; Petrova’s work uses a

support vector machine (SVM) to predict catalytic residues 39; and Youn’s work uses a support vector machine to predict catalytic residues in proteins 40. All these methods use both sequence conservation and

3D structural information.

Depending on the properties that these methods are based on, the computational cost and accuracy varies.

19 Among all these methods, THEMATICS is the most accurate to date. The computational cost is acceptable. To analyze a typical protein using THEMATICS takes less than an hour on a desktop PC, although actual CPU times depend on protein size. The details of this method will be explained in a later section.

Although THEMATICS is the most effective and accurate method among these when used on its own, it is natural to consider whether the predictions can be improved by using additional information. I examine this using both geometric and conservation information, and find that this is indeed the case.

2.2 Machine Learning.

Machine learning is a very broad subfield of artificial intelligence. It is almost impossible to survey the whole area in this dissertation. Here, I focus on just supervised learning, mostly classification. Even this area is still too broad, and I will briefly introduce the framework and some of the most commonly used methods with their basic principles.

2.2.1 Commonly used supervised learning methods.

The first method is called artificial neural network (ANN) or just neural network (NN) 26, 41. It is based on a computational model of a group of interconnected nodes (neurons) akin to the nervous system in humans. Each neuron has a certain number of inputs and typically one output. The input to a neuron can be either the features of input data, or the outputs of other neurons. The output of one neuron can serve as input to multiple neurons. Typically, there is a weight associated with each input of each neuron. At processing time (classifying a query instance), each neuron in the network takes the input and computes the weighted sum of the total input with their associated weights, and generates its output by some nonlinear function f. There are different flavors of the structures of ANN, such as feedforward versus recurrent network. During training time, a cost function is defined to estimate the accuracy of the ANN with respect to the data, or essentially a measurement of how much error the current ANN makes on the training data. The learning process is to find the optimal setting of the structure and/or weights of the

20 ANN to minimize the cost function on the training data. ANN in general is a very powerful method, and it has been used in numerous applications including in protein active site predictions 18. One drawback is that ANN is a somewhat black-box method, meaning although one may find a very good classifier, the structure of the network and the weights associated with each input may not reveal too much useful information on why it works.

Another popular and intuitively appealing method is the nearest neighbor method, or its more general form, the k-nearest-neighbor method (k-NN) 42. The principal idea is to classify the query instance based on its k nearest neighbors among the training set, which represent the most “similar” training instances.

The success of this method relies on several choices; k and the distance function that defines what

“similar” means are among the most important ones. The training or learning process of this method is somewhat different from most other machine learning techniques. In most cases, instead of solving an optimization problem, it uses cross validation directly to select the best k and best distance function.

However, this method and the naïve Bayes method, which will be discussed later, are both susceptible to the presence of correlated features. The method is also susceptible to the presence of noisy and irrelevant features.

The support vector machine (SVM) 43, 44 is a relatively newly developed method. A one-sentence description of this method is that it uses the “kernel trick” to find a best linear separator (hyperplane) in kernel space to separate instances which are not linearly separable in feature space. There are two major advantages of SVM. First, unlike ANN or some other classification techniques, the linear separator SVM finds is not only the one that successfully separate the two classes (in hard margin case) or that make the

“fewest” errors (in soft margin case), but is the one that is the best among all of those separators. The reason that it finds the “best” among all possible good separators is it maximizes the margin, which measures how “far” this separator can move without making more mistakes on the training data, and there is a rigorous proof that a classifier giving the maximum margin tends to make the fewest errors on the testing data. Another advantage is that the “kernel trick” maps instances in the original feature space to

21 instances in the kernel space and in certain cases the linearly-inseparable instances in feature space

become linearly-separable in kernel space and the kernel transform is easy to compute and the explicit

form of mapping function between instances is not required to be known. To successfully use this method,

one needs to select the right kernel. There are commonly used kernels, but to take full advantage of using

and developing kernels is not a trivial task. I have applied this method to protein active site prediction

with success 45.

Last, I will mention boosting, another method developed quite recently which has achieved a lot of

success 46, 47. Boosting is a meta-learning algorithm. Boosting occurs in stages, by incrementally adding

to the current learned function. At every stage, a weak learner (i.e., one that has accuracy greater than

chance) is trained with the data. The output of the weak learner is then added to the learned function, with

some strength (proportional to how accurate the weak learner is). Then, the data is re-weighted: examples

that the current learned function gets wrong are "boosted" in importance, so that future weak learners will

attempt to fix the errors. If every weak learner is guaranteed to perform better than random guessing, the

boosting method can find the learned function that makes fewer errors on training data than any pre-set

threshold very fast. It is a very powerful method to combine different learners into one “super system”.

There is also a well-developed mathematical theory showing that boosting can lower generalization error.

2.2.2 Probability based approach.

A lot of machine learning work overlaps with statistics, where probability is used to classify instances

directly. The first idea introduced here is Bayesian inference, which is based on Bayes’ theorem 26. Bayes’ theorem may be written as:

P(B | A) ∗ P(A) P(A | B) = P(B)

The probability of an event A conditional on another event B is generally different from the probability of

B conditional on A. However, there is a definite relationship between the two, and Bayes' theorem is the

22 statement of that relationship. P(A) and P(B) are called prior probability, and P(A|B) is called conditional probability of A given B.

Bayes’ theorem is important to a number of applications, including several different places in the present problem. One way to formulate our problem is to generate the hypothesis H that gives the probability that a residue with certain features is in the active site, based on the observed data (D), or training examples, i.e., P(H|D). Take H here as a look-up table for probability of positive with different feature values x. To use such H at query time, just go to the entry that has the same reading as x and read the corresponding probability. Usually the number of ways to fill out the look-up table H is infinite, so how should one choose? One simple answer is to choose the H that gives the largest P(D|H); this is called the maximum likelihood (ML) hypothesis. Taking P(H), the prior probabilities of H into consideration, one could pick the H that gives the largest P(D|H)P(H); this gives the maximum a posteriori (MAP) hypothesis. Notice that ML is equivalent to MAP with “flat prior”, when the prior probability of all hypotheses under consideration are same. The POOL method I introduce below for active site prediction is MAP with all hypotheses satisfying the monotonicity constraints having a flat prior, while all other hypotheses have a prior probability at 0. Both ML and MAP method pick out the “most favorable” hypothesis H* out of all possible ones based on the data, and use that to predict the “most probable class” of new query data, which is, given query data q, the probability that q is in class c is determined by P(C=c| q, H*), and the objective is to find the particular c that gives the largest P(C=c| q, H*). There is another way to do prediction at query time, namely consider the prediction from all possible hypotheses and sum the result with the probability of each hypothesis as weights: P(C=c|q, D) = ∑P(C=c|q, Hi, D)*P(Hi|D). This is called the Bayes classifier, which gives an optimal result, but is often difficult to compute in practice.

2.2.3 Performance measure for classification problems.

In the context of binary classification tasks, the terms true positives, true negatives, false positives and false negatives are used to describe the given classification of an item (the class label assigned to the item

23 by a classifier) with the desired correct classification (the class the item actually belongs to). This is illustrated by the confusion matrix below:

24

Predicted classification

Positive Negative

Actual Positive True Positive (TP) False Negative (FN) classification Negative False Positive (FP) True Negative (TN)

Table 2.1. Confusion matrix of classification labeling.

25

In the confusion matrix above, TP, FP, FN and TN are the number of true positive, false positive, false negative and true negative instances respectively. Although all the information about the classifier’s performance is included in the confusion matrix, people tend to use some other measurement derived from the listed information to compare performance of classifiers. We list some commonly used ones:

TP recall = sensitivity = TP + FN

TN specificity = FP + TN

TP precision = positive predictive value = TP + FP

TN negative predictive value = TN + FN

FP false positive rate = 1− specificity = TN + FP

TP + TN accuracy = TP + TN + FP + FN

FP + FN error = 1− accuracy = TP + TN + FP + FN

TP + FP filtration ratio = TP + TN + FP + FN

TP ∗TN − FP ∗ FN MCC = (TP + FP)(TP + FN)(TN + FP)(TN + FN)

Most of the measurements above are very straightforward and can be treated as mere definitions with no need for further explanation, with the exception of the filtration ratio and the Matthews correlation

26 coefficient (MCC). I invented the term filtration ratio, to be used in place of precision and false positive rate in the present problem. One of the advantages of the filtration ratio is that it is the only measurement listed above that can be determined with information from predicted classification, without the requirement of knowing the actual classification, which is always unknown in practice. Another reason I invent and use filtration ratio is that in the present problem, information available in the literature about actual positives is incomplete, even in our training and testing dataset. Thus a certain fraction of the nominal false positives probably are not false. In situations like ours where one expects true positives to represent a fairly small proportion of all the instances, and the measured false positive rate is suspect, using filtration ratio is more appropriate than some other measurements, such as precision. Both of these issues will be discussed in more detail later in the dissertation. On the other hand, MCC is widely used in machine learning as a measure of the quality of binary classifications. It takes into account both sensitivity and selectivity and is generally regarded as a balanced measure which can be used even if the classes are of very different sizes. It returns a value between -1 and +1. A coefficient of +1 represents a perfect prediction, 0 an average random prediction and -1 the worst possible prediction. While there is no perfect way of describing the confusion matrix of true and false positives and negatives by a single number, the MCC is generally regarded as being one of the best such measures. In the denominator of

MCC, if any of the four sums is zero, the denominator can be arbitrarily set to one; this results in a MCC of zero, which can be shown to be the correct limiting value. 48

2.3 THEMATICS

I am going to discuss the THEMATICS method in more detail because this is the basis for most of the input data in the work of this dissertation.

2.3.1 The THEMATICS method and its features.

In the application of THEMATICS, one begins with the 3D structure of the query protein, solves the

Poisson-Boltzmann (P-B) equations using well-established methods 49-52, and then performs a Monte

27 Carlo procedure to compute the proton occupations of each ionizable as a function of the pH.

Each such function is called a titration curve, as shown in figure 2.1 (a).

From the theoretical titration curves computed from the 3D structure of a query protein, THEMATICS identifies residues (amino acids) that exhibit significant deviation from Henderson-Hasselbalch (H-H) behavior, which I now describe.

A typical ionizable residue in a protein obeys the H-H equation, which may be expressed as a proton occupation O as a function of pH as:

− − O( pH ) = (10 pH pK a +1) 1 (1)

For the residues that form a cation upon protonation (Arg, His, Lys, and the N-terminus), the mean net charge C on particular residue is equal to O, whereas for the residues that form an anion upon deprotonation (Asp, Cys, Glu, Tyr, and the C-terminus), the mean net charge is given by (O −1) as:

C( pH ) = O( pH ) cationic (2)

C( pH ) = O( pH) −1 anionic (3)

Note that C represents the average net charge on a particular residue for a large ensemble of protein molecules. Equations (1) - (3) have the sigmoid shape that is typical of a weak acid or base that obeys the

H-H equation. Thus, as pH increases, the predicted average charge falls sharply in a pH range close to the pKa, which is defined as the pH at which that residue is protonated in exactly half of the protein molecules in the ensemble.

Underlying the THEMATICS approach is the observation that the computed titration curves tend to deviate more from this H-H shape for ionizable residues belonging to active sites than for ionizable residues not belonging to such sites. The key step in the application of the THEMATICS approach is thus recognizing significant deviation from H-H behavior in the shape of these predicted titration curves.

28 When THEMATICS was first developed, visual inspection of the computed curves was used to identify

THEMATICS positive residues. Although simple, it is inefficient, vulnerable to bias, and in some cases ineffective, since some deviations of the curves are subtle and not easily recognized visually, as indicated by figure 2.1(b).

29

(a)

(b)

Figure 2.1. Titration curves. (a) A standard HH curve (black), a typical perturbed curve (red), and a typical unperturbed curve from residues not in active site (blue). (b) Titration curves from active site residues (red) versus non-active-site residues (green) from a set of 20 proteins (Appendix A); only glutamate residues are shown.

30 Before introducing the methods for automation of the classification of residues using THEMATICS, I will first present the feature extraction process.

In order to perform any machine-learning or statistical analysis on titration curves, one needs to find features that are easy to compute and are effective to distinguish positive and negative instances.

I have defined features that may be used to measure the deviation of a particular titration curve from H-H behavior. In particular, four features extracted from the titration curves are most useful in separating

THEMATICS positives residues from the others.

These four features are based on the first four moments of the derivatives of the titration curves, as I now describe briefly. A more detailed description can be found in Ko’s study 24.

Define the variable x to be the offset of the pH from the pKa, as:

= − x pH pK a , (4)

Then equation (1) for Henderson-Hasselbalch titration curves becomes:

− O(x) = (10 x +1) 1 . (5)

The key observation on which the moment analysis is based is that, for any titration curve O(x), whether of Henderson-Hasselbalch form or not, the corresponding derivative

f ( pH ) = −dO / dx = −dO / d( pH ) (6) is effectively a probability density function (ignoring those rare cases when the titration curve fails to be a non-decreasing function of x, in which case this derivative function takes on negative values) 53.

The nth moment of f is defined as

M = ( pH ) n f ( pH )d( pH ) (7) n ∫

th and the corresponding n central moment µn is

31 µ = ( pH − M ) n f ( pH )d( pH ) (8) n ∫ 1 where these integrals are over all space (−∞ to +∞).

The features I use are based on the first moment M1 and the second, third, and fourth central moments µ2,

µ3, and µ4, respectively, of the derivatives f. For a pure H-H titration curve these moments are

M1 = pKa , µ2 = 0.620, µ3 = 0, and µ4 = 1.62. (9)

It is interesting to note that, for an arbitrary probability density function, M1 is its mean and µ2 is its variance, while µ3 and µ4 are related to the skewness and kurtosis, respectively, standard quantities used in statistics to measure departure from normality.

When applied to a general titration curve of ionizable residues in a protein, the pKa shift is closely related to how much M1 differs from the free-solution pKa. Those residues that interact strongly with other ionizing residues in such a way that the predicted titration functions O(pH) are elongated will have broader first derivative functions f and thus have generally higher values for µ2 and especially µ4. The moment µ3 measures the asymmetry of the function f and has a nonzero value for any residue that interacts with other ionizing groups in such a way that the strength of this interaction in the range pH < pKa is different from that in the range pH > pKa.

Thus it is clear that the first moment and the second, third, and fourth central moments are useful measures for determining deviation from H-H behavior.

The methods introduced below all use some of the four features described above, with some additional features in some specific methods.

2.3.2 Statistical analysis with THEMATICS.

One automated analysis was proposed and studied by Ko 24. Ko introduced simple statistical metrics to automatically evaluate the degree of perturbation of a titration curve from H-H behavior. The method is simple, just looking at two of the above features, namely µ3 and µ4. A statistical Z-score was computed on

32 these features; i.e. for every curve analyzed, the deviation of the µ3 and µ4 values from the mean, expressed in units of the standard deviation, of all curves from the same protein were computed. Any residue with a titration curve with either the absolute value of Z-score of µ3 or Z-score of µ4 greater than

1.0 was classified as a THEMATICS positive residue and any such residues with at least one other

THEMATICS positive residue located within 9Å were reported as active site candidates.

Good results were obtained for the identification of active sites in a set of 44 proteins with Ko’s method.

Although this method has excellent recall of catalytic sites, identifying the correct catalytic site for 90% of , the recall rate is lower (about 50%) for the identification of catalytic residues. It is desirable to improve the catalytic residue recall rate and also to expand the method to include predictions of non- ionizable residues.

Wei studied Ko’s analysis and modified it 54, introducing a new fractional parameter, α, which typically runs between 0.95 and 1.0. In this method, the mean and the standard deviation of µ3 and µ4 are calculated from the titration curves from a portion of the residues in a protein, in contrast to the whole population in Ko’s method. The portion of the residues excluded from the sample are the residues with titration curves with µ3 and µ4 values in the highest (1-α) fraction. This modification did yield better recall of annotated catalytic residues than Ko’s analysis, but the optimal α is different for different proteins and was finally fixed at 0.99, the value that yields the best overall performance when averaged over the set of annotated proteins. The purpose of this α is to exclude residues with titration curves with the most extreme µ3 and µ4 values from influencing the mean and standard deviation of the population too much and thus yielding better statistics and slightly more reliable predictions.

Meanwhile, Yang developed another rule-based statistical analysis identifying THEMATICS positive residues 55. In addition to the four features described earlier, her method uses an additional feature, a value called the buffer range R, which measures the width of the pH range over which the residue is

33 partially ionized. Also, outliers are selected within each amino acid type, when possible, instead of the entire set of ionizable residues. The performance of this method is a little better than Ko’s method.

The three statistics-based analyses listed above all employ handcrafted cutoff values to differentiate the positive from the negative instances. The study described in this dissertation begins with the hypothesis that a machine learning method can utilize similar sets of features, define a threshold in a systematic way, and achieve better performance in practice.

2.3.3 Challenges of the site prediction problem using THEMATICS data.

One of the challenges of the task to classify THEMATICS results arises from special characteristics of the training data set.

First, the vast majority of the residues in the training data are negative examples. Literature-confirmed active site residues typically consist of less than 3% of total residues. At the same time, the negative examples, which comprise most of the data in the training set, share some “common” property, while the relatively few positive examples have “abnormal” behaviors in a varied way. This is one of the key reasons that a simple outlier detection process like Ko’s analysis is quite successful in solving this problem. But it is not clear how this method can incorporate additional non-THEMATICS features to possibly improve the active-site prediction.

Secondly, the nature of this problem limits the quality of the training data. The ultimate goal of the project is to predict the active sites of proteins using THEMATICS data, however, the absolute criterion to label a residue in a protein as active is that someone has done the experiment in the lab and published the result supporting the claim. There are databases collecting such annotations and by no means are these annotations complete. There is another subtlety in that although THEMATICS positive residues have been shown to be a very reliable indicator of active sites, THEMATICS sometime predicts additional nearby residues that are not annotated as active, including second shell residues. There is some evidence to support the hypothesis that these second shell residues may be important. Alternatively, they

34 may be affected by the special electrostatic field created by the nearby active sites residues. The

THEMATICS positive residues in the second case may not be shown experimentally to be active site residues. Because residue activity is often measured in a kinetics experiment and a number of factors can sometimes cause large errors in these experiments, the training set inevitably contains some positive instances that are misclassified in the first place, or some instances that cannot be correctly distinguished by the model. In particular, there are most probably instances of true positives that are improperly annotated as negatives, simply because no experiments have been tried on the vast majority of residues.

In order to overcome these two obstacles, in my earlier work of neural network machine learning and

SVM method, I “cleaned” the training set. Instead of using just literature confirmed positive instances, I also labeled “apparent” THEMATICS positives, near a known active site although not experimentally identified as active site residues. I also removed some of the isolated THEMATICS positive instances from the training set. Although this data cleaning did improve the results, it is ad hoc and lacks a systematic justification.

For any machine learning problem, if there is some prior belief, or bias, which turns out to be true, applying it should always help the performance.

After studying THEMATICS and its application in protein active site prediction, it would be fair to conclude the following THEMATICS principles as prior belief:

THEMATICS Principle 1: The more perturbed the titration curve is (relative to other titration curves in the same protein), the greater the probability that residue is in the active site.

THEMATICS Principle 2: The more perturbed the neighboring titration curves are (relative to other titration curves in the same protein), the greater the probability that residue is in the active site.

The ad hoc method used before implicitly cleaned the data based on the THEMATICS principles.

In addition to THEMATICS features, to which we can apply THEMATICS principles, there are some non-THEMATICS features having either positive or negative correlation to the probability that a residue

35 is located in the active site. Those features may not be a reliable indicator by themselves, but combined with THEMATICS methods, they may improve the overall prediction accuracy.

While there may be ways to enforce inductive bias in classifiers like neural networks and SVMs, I believe the most straightforward approach is instead to try to estimate P(class | attributes) nonparametrically, while enforcing these principles as constraints, as explained in Chapters 4 and 5.

36 Chapter 3

Applying SVM to THEMATICS

37 3.1 Introduction.

As discussed in earlier chapters, THEMATICS is a technique for the prediction of local interaction sites in a protein from its three-dimensional structure alone. Various approaches have been taken to automate and standardize the process with various sensitivity and specificity. Here, I will present my first work on this project, using a support vector machine, with four extracted features from THEMATICS alone to predict the active sites of enzymes.

In this chapter it is shown how support vector machines (SVM’s) may be combined with THEMATICS to achieve a substantially higher recall rate for catalytic residues with only a small sacrifice in specificity when compared to Ko’s statistical analysis of THEMATICS 24. It is argued that clusters predicted by

THEMATICS-SVM are small, local networks of ionizable residues with strong coupling between their protonation events; these characteristics appear to be very common, perhaps nearly universal, in enzyme active sites. Performance of THEMATICS-SVM in active site prediction is compared with other 3D- structure-based methods, including THEMATICS combined with previous analyses and shown to return equal or better recall with generally higher specificity and lower filtration ratio. The high specificity and low filtration ratio translate to better quality, more localized, predictions.

This work builds on the prior work of Ko using variants of some of the same features that were found to be successful there, plus some additional features. Results of our method are presented for 60 different proteins. In this chapter, I also present a way to extend the method’s capabilities to the prediction of non- ionizable residues.

3.2 THEMATICS curve features used in the SVM

To use an SVM to classify residues as either likely or not likely to be in the active site, I represent the computed titration curves as points in a four-dimensional space. These four features are based on the first four moments of these curves, as described in section 2.3.1.

38 24 The four features, namely M1, µ2, µ3 and µ4 are conceptually similar to those described in Ko’s analysis , except I slightly modified the normalization process to prevent both the sample mean and sample standard deviation from being too strongly influenced by extreme values. A more robust estimator is used to distinguish “typical” from “atypical” titration curves within a single protein than the standard Z-score. In my normalization, each of the four moments was normalized to its corresponding robust Z-score Z′, which is defined as its deviation from the median, divided by the normalized interquartile distance, the difference between the 75th percentile value and 25th percentile values, for the corresponding feature across all ionizable residues in that protein. A normalization factor of 1.349 comes from the normal distribution with a mean of zero and a standard deviation of one. Thus for a given feature Ф, I define Z’ as:

1.349⋅Φ{} -MEDIAN ( Φ) Z(′ Φ )= (10) PERCENTILE(Φ− ,0.75) PERCENTILE( Φ , 0.25) where the median and corresponding percentiles are based on the value of that feature for all ionizable residues in a given protein. Thus this method achieves the same effect as Wei’s method 54 without introducing an extra parameter to be fine-tuned.

th For the even-numbered moments, Z′n, the robust Z-score for the n central moment is defined as:

Z′n = Z′(µn ) (n even) (11)

The only even-numbered moments used in the present study are the second and fourth, so the corresponding robust Z-scores Z′ are Z′2 and Z′4.

Likewise the only odd-numbered moments are the first and third. Their corresponding robust Z-scores are the deviations of the absolute values from the median. In particular, we define Z′3 as

Z′3 = Z′(µ3 ) (12)

39 The population over which the median and percentiles are computed includes residues of different types with different free pKa’s. In order to compare the computed first moments across all residue types, the offset first moment for a given residue is defined as:

offset M1 = M1 - pKa(free) (13)

where pKa(free) is the pKa for that residue in free solution. Note that by equation (9), a H-H residue has

offset M1 = 0.

This offset may be compared across all residues in the protein. Thus Z′1 is defined as:

offset Z′1 = Z′ (|M1 |). (14)

Note that only the first moment requires this modification to make all residue types in the protein comparable since the H-H equation has only one free parameter, the residue type-dependent translation parameter pKa.

To summarize, the result of all these computations is to create, for each ionizable residue in any given protein, a 4-tuple of descriptors (Z′1, Z′2, Z′3, Z′4) of the theoretical titration curve. Z′2, Z′3, and Z′4 describe the shape of the curve and Z′1 measures its displacement along the horizontal axis.

3.3 Training.

A set of 20 proteins was used as the training set. The protein names, the E.C. numbers and the PDB ID for each of the 20 proteins in the training set are listed in Appendix A.

The labeling of the titration curves for training purposes was performed as follows: All residues listed in

CatRes/CSA as active were labeled positive. Also labeled positive were ionizable residues located near such annotated active residues with titration curves that displayed perturbed titration curves on visual inspection. All other residues were labeled negative, with the exception of a few residues with visually perturbed titration curves and with no literature annotation that are not near any other perturbed residues; they were removed from the training data set entirely. (Note that such residues with perturbed titration

40 curves that are not in spatial proximity with other perturbed residues are not considered predictive in

THEMATICS.)

From 1575 ionizable residues in the 20 protein training set, I remove 46 isolated residues with perturbed titration curves. This leaves a training set of 1529 ionizable residues, among which 140 are labeled as positive training examples. For each ionizable residue in the training set, the four moment-based features and the corresponding labels were fed into the SVM using SVMLight 56. For both training and classification, the quadratic kernel K(x,z) = (1+)2 was used.

The relative cost of misclassification of positive and negative training examples was set such that false negatives were penalized 10 times as much as false positives. This was done because there are many more negative examples than positive examples in the training set, because of the aim to increase the residue recall rate, and because I have much more confidence in the labeling of the false negatives than the false positives (see sections 2.2.3 and 3.4). In addition, a linear kernel and several other choices of parameters were tried, but these resulted in either similar or slightly more training errors.

3.4 Results

Typical criteria used to measure classifier performance are recall (also called sensitivity), the number of correctly predicted positives divided by the number of true positives, and precision (related to specificity), and the number of correctly predicted positives divided by the total number of predicted positives. Ideally, both measures are 100%, which means all and only the true positives are identified as such by the classifier. In the present case, one can be more confident of the true positive data, because for every labeled active residue there is experimental evidence supporting that labeling. On the other hand, true negative data are not as reliable because the experiments are incomplete; some important residues may not have been tested experimentally. Furthermore some of the experimental literature has not been included in the CatRes/CSA database, because of the difficulty of exhaustive literature searching. A better indicator of the selectivity of the method for present purposes is the filtration ratio, the fraction of total

41 residues that are reported as positive. Now the goal of the system is to achieve a high recall with a low filtration ratio.

A set of 64 test proteins was selected randomly from the CatRes database 57. There is no overlap between this test set of 64 proteins and the set of 20 proteins used to train the SVM. The trained SVM was applied on the test set to measure the overall accuracy of the method, assuming that the CatRes annotations define the true positive residues. Results are summarized here, while a detailed list of all proteins studied with all predicted residues and clusters can be found in the Appendix B.

3.4.1 Success in site prediction.

First I examine the degree of matching between our predictions and the CatRes list for each protein.

Overall, the SVM identified an average of 2.7 clusters per subunit. Based on the overlap of the predicted active site and the CatRes listed set, the prediction for a protein is assigned to one of three categories. If

50% or more CatRes listed active residues were found by the system, we consider this a correct site prediction. If some, but fewer than half, of the CatRes listed active residues were found by our system, we consider it partially correct. If none of the CatRes listed active residues were found by our system, we consider the site prediction incorrect. This type of categorization has been used previously 18. Measuring this degree of overlap of predicted clusters with just the ionizable CatRes listed active-site residues, the percentages of proteins for which the predictions are correct, partially correct, and incorrect are 86%, 5% and 9% respectively, as shown in figure 2(a).

3.4.2 Success in catalytic residue prediction.

Out of the 9303 ionizable residues from the 64 proteins, 1338 were predicted as active site candidates by the SVM, forming 244 clusters. There are 233 ionizable residues labeled as active site residues in the

CatRes database and 182 of them were found by our SVM, corresponding to a global residue recall of

78%. The average residue recall rate, averaged over all 64 proteins, is 76%.

42 For these 64 proteins, for filtration ratio defined as residues predicted over a total of 32016 residues including both ionizable and non-ionizables, the average is only 3.9%. This ratio is less than 8% for each of the 64 proteins. The average precision, or fraction of predicted residues that are known true positives, is 20% over the 64 protein set, using only the CatRes/CSA annotations to define the true positives.

3.4.3 Incorporation of non-ionizable residues.

Since not all active site residues are ionizable, it is also of interest to see how well the SVM-reported residues serve as predictors of activity in their spatial vicinity. Therefore I also define a THEMATICS positive region to be the spatial region within 6Å of any residue that belongs to a THEMATICS positive cluster. This may allow the method to find some catalytically important residues that do not have a perturbed titration curve (including non-ionizable residues). The total number of residues found by this criterion across the 64 test proteins is 4795, out of 32016 total residues. Among 366 residues that are labeled as active site residues in CatRes, 263 were found by the system, corresponding to a global recall of 72%, while the average recall per protein is 81%. The average precision, or fraction of predicted residues that are known true positives, is 21% over the 64 protein set, using only the CatRes/CSA annotations to define the true positives.

Table 3.1 compares the performance of the straight SVM predictions versus the SVM+Region predictions.

While the expansion to include the neighborhood surrounding the predicted residue leads to a somewhat higher recall rate, there is considerable sacrifice in the precision and increase in the filtration ratio.

43

Method Recall Precision Filtration Ratio SVM only 61% 21% 4% SVM + 6Å region 81% 8% 13%

Table 3.1. Performance of the SVM predictions alone versus the SVM regional predictions that include all residues within a 6Å sphere of each SVM-predicted residue. Shown are average values of recall (true positive residues over all known positive residues), precision (true positive residues over all predicted residues), and filtration ratio (residues predicted over total residues in the protein), where averaging is performed over the set of 64 test proteins.

44

Using the same criteria described above for judging correctness of the predictions, but this time counting all residues in THEMATICS positive regions and comparing with all the CatRes listed active-site residues, the percentages of correct, partially correct, and incorrect site predictions are 88%, 4% and 8% respectively (Figure 3.1(b)).

45

(a)

9% 5%

Correct Partially Correct Incorrect

86%

(b)

8% 4%

Correct Partially Correct Incorrect

88%

Figure 3.1. The success rate for site prediction on a per-protein basis: (a) ionizable residues only; (b) all residues, extending the SVM’s predictions by including all residues within 6Å of each predicted residue.

46

Figure 3.2 shows histograms of the filtration ratios achieved using the trained SVM on a per protein basis.

These filtration ratios are expressed in three different ways: 1) All predicted residues over all residues; 2)

Ionizable residues predicted over all ionizable residues; and 3) Ionizable residues predicted over all residues. Here, 1) is obtained using the 6Å neighborhood criterion and 2) and 3) are obtained from the straight SVM prediction of ionizable residues only. Using all residues as the basis, the filtration ratio for the SVM (ionizable) predictions is less than 10% for all 64 proteins. There was only one protein out of the 64 for which the filtration ratio for all residues predicted (using the 6Å neighborhood criterion) was higher than 25%. For this protein, human Glyoxalase I, the method identified about 18% of its ionizable residues as candidates and about 27% of all its residues as candidates (using the 6Å neighborhood criterion). For well over 90% of the proteins studied the filtration ratio was better than 20% in both the ionizable/ionizable and the all-residues cases. Even better, in 70% of the proteins tested, the method reported less than 15% of the ionizable residues as candidates and in 61% of the proteins in the test set, less than 15% of all residues were identified as candidates using the 6Å neighborhood criterion. It is important to note that for most of the examples with large filtration ratios, there is a sound functional basis for this high ratio, e.g. the active site binds multiple substrate or molecules and thus has an unusually large interaction site. In other cases with high filtration ratios, the protein has an interaction site of typical size but an unusually small number of total ionizable residues.

47

60 50

40

30 20

10

Number of Proteins of Number 0 0-5% 5-10% 10-15% 15-20% 20-25% 25-30% Filtration Ratio Range

All/All Ionizable/Ionizable Ionizable/All

Figure 3.2. Distribution of the 64 proteins across different values for the filtration ratio. Filtration ratios are expressed as: 1) All predicted residues over all residues; 2) Ionizable residues predicted over all ionizable residues; and 3) Ionizable residues predicted over all residues. All residues are predicted using the 6Å neighborhood criterion.

48 3.4.4 Comparison with other methods.

It is useful to compare the results of the present method with some other catalytic site prediction methods that are based on 3D structure alone. The other methods used for this comparison are QSiteFinder 20 and

SARIG 35, both of which have publicly available servers, and Ko’s statistical analysis of THEMATICS 24.

Of the 64 proteins used for the THEMATICS-SVM test set, one was too large for both SARIG and

QSiteFinder and three others were too large for QSiteFinder. These four were deleted from the test set and thus the comparison results reported here are for the remaining 60 proteins.

The average (per protein) values for recall, precision, filtration ratio false positive rate and MCC of each method are listed in Table 3.2. Two sets of results are given for QSiteFinder, one using only the top site and the other using a combination of the top three sites. This combination of the top three sites is the basis for the success rate reported for this method in the original article 20. The values in Table 3.2 use all annotated residues, including non-ionizable residues, as the basis for the recall and all residues in the protein as the basis for the filtration ratio for all of the methods. Thus the theoretical maximum recall rate for THEMATICS-SVM (without the 6Å region) is less than 100%, because some known catalytic residues are non-ionizable.

Figure 3.3 plots recall as a function of false positive rate. The solid line represents Wei’s THEMATICS analysis with variable parameter α. Performance for Ko’s analysis, QSiteFinder, SARIG, and the present

SVM are depicted as points. The recall and the false positive rate of all the methods in this plot were measured by ionizable residues only.

49

False Method Recall Precision Filtration Ratio MCC Positive Rate THEMATICS-SVM 61% 20.0% 3.8% 3.1% 0.31 THEMATICS-SVM-Region 80% 8.1% 13.2% 12.3% 0.22 THEMATICS-Statistical (Ko) 44% 23.5% 2.6% 2.0% 0.29 QSiteFinder (top one) 33% 4.6% 7.5% 7.2% 0.094 QSiteFinder (top three) 61% 4.2% 16.2% 15.8% 0.12 SARIG 61% 8.0% 11.2% 10.6% 0.18

Table 3.2. Comparison of THEMATICS-SVM and other methods. Shown in the table are the recall, precision, filtration ratio, false positive rate, and Matthews correlation coefficient (MCC) of THEMATICS as well as of other site prediction methods including THEMATICS-Statistical (Ko’s method), QSiteFinder, and SARIG. These quantities are per-protein averages over a comparison set consisting of the 60 proteins from the 64-protein test set for which results could be obtained with all methods.

50

1

0.8

0.6 Recall

0.4 Wei's Method SVM Q-site Finder (largest one) 0.2 Q-site Finder (largest three) Sarig Ko's Method

0 0 0.1 0.2 0.3 0.4 0.5 0.6 False positive rate

Figure 3.3. Recall-false positive rate plot (ROC curves) of SVM versus other methods. The curve for Wei’s method is obtained by varying the parameter α, a generalization of Ko’s methods.

51

The comparisons in Table 3.2 and the Figure 3.3 show that THEMATICS-SVM can achieve catalytic residue recall that is as good as or better than other methods, while simultaneously achieving substantially better precision with lower filtration ratios and false positive rate. This translates to more localized, more precise predictions of generally better quality with higher MCC. Although from Table 3.2, compared with

Ko’s statistical analysis of THEMATICS, the SVM analysis gives significantly better sensitivity to catalytic residues with only a small concomitant drop in the precision, Figure 3.3 and the MCC in table

3.2 show that the tradeoff between recall and filtration ratio/false positive rate is better with the SVM analysis since it outperforms Wei’s method in figure 3.3, which is a generalization and extension of Ko’s method.

3.5 Discussion

3.5.1 Cluster number and size.

The SVM study (without the neighboring region) found 244 clusters as active site candidates, with an average of 2.7 clusters per chain in the 64-protein test set. While the number of true positive clusters per chain is probably somewhat higher than 1.0, 2.7 is too high. Ko’s statistical criteria reported 1.7 clusters per chain, although I note that the present method was designed to increase the recall rate of true positive residues. Among the 244 clusters reported by the present method, 90 of them are pairs and it appears that many of these pairs are false positives resulting from the chance proximity of two residues with similar

58 pKa’s. Simple geometry-based rules have been investigated that do eliminate many of these pairs . The average size of a cluster is 5.5 ionizable residues. These values include catalytic residues, binding residues, and also some “second shell” residues, residues that are nearest neighbors of the known catalytic residues and second-nearest neighbors of the reacting substrate. It is indeed possible that these “second shell” residues play some supporting role in catalysis, or they may simply be subject to the same strongly

52 pH-dependent field as the catalytic residues. Experimental site-directed mutagenesis studies are in progress to elucidate this.

3.5.2 Failure analysis.

Although this method effectively found the active sites in most of proteins, there still remain a few cases where it failed to find the correct active sites. There are a variety of possible causes for this. In some cases, the “failure” may not be a failure at all. An example is cytosolic ascorbate (1APX), in which the active site listed in CatRes is at the distal side of the heme ring. My method did not find that particular site, but it did find a cluster at the proximal side of the heme ring, which has been identified as an active site by several studies 59, 60. This suggests that some of the discrepancy between CatRes labeling and the

SVM results may result from incomplete or incorrect information in the reference database. Failure can also occur in cases where the active site environment is very hydrophobic. In the case of DNA

(1DNP), the listed active site consists of three tryptophan residues, which are not ionizable. Even in that case, the SVM method did find a cluster that lies between those active tryptophan residues and the cofactors FAD and MTF. Indeed the predicted residues may actually be involved in the electron transfer process, which proceeds from the tryptophan residues to the cofactors. In some other cases, the SVM method found a cluster of residues that bind to a substrate or cofactors. These binding residues may exhibit such strong perturbations in the titration curves that the actual catalytic residues are missed by the classifier.

3.5.3 Analysis of high filtration ratio cases.

There are a few proteins for which the SVM method gives a high filtration ratio, meaning the active site candidates our method produces constitute an unusually high fraction of the total residues in the protein.

Most of these cases are cofactor-dependent enzymes and/or systems with larger substrate molecules, where the site truly is a larger fraction of the protein’s surface. The protein with the highest filtration ratio is human Glyoxalase I (1FRO), for which the SVM selects 18% of the ionizable residues and

SVM+Region selects 27% of all residues. Glyoxalase I catalyzes the glutathione dependent conversion of

53 2-oxoaldehydes to 2-hydroxycarboxylic acids. Each of its subunits binds glutathione, substrate, and zinc.

In addition to the catalytic residue E172, the SVM method also finds the glutathione-binding residues

R37, E99, and R122, the zinc-binding residue H126, plus some interfacial residues. Another example is arginine (1BG0), which catalyzes phosphate transfer between arginine and ATP and thus its site binds arginine, phosphate and MgADP. The predicted residues surround the site of interaction between the arginine, the ATP, and the reacting phosphate group.

Still other cases of high filtration ratio result from small enzyme size. Enzyme IIA-lactose (1E2A), a part of the lactose/cellobiose-specific family of enzymes II in the sugar system, is such an example. It has only 36 ionizable residues. For this protein the SVM method found two clusters of three ionizable residues each, one of which is the known active site, giving a filtration ratio of 17% of ionizable residues. The all-residue analysis yielded a 21% filtration ratio. In this case, the site of interaction truly constitutes a high fraction of the total residues in the structure.

3.5.4 Some specific examples.

Type I 3-Dehydroquinate from Salmonella typhi (PDB ID 1QFE) is an important enzyme found in plants and microorganisms. It functions as part of the shikimate pathway, which is essential for the biosynthesis of aromatic compounds including folate, ubiquinone and the aromatic amino acids. The absence of the shikimate pathway in animals makes it an attractive target for antimicrobial agents. The study of the structure, active sites and the reaction mechanism opens the way for the design of highly specific enzyme inhibitors with potential importance as selective therapeutic agents 61. For this protein, the SVM predicts a total of eight residues in two clusters, [E46, E86, D114, E116, H143, K170] and [D50,

H51]. The known catalytic residues are E86, H143, and K170 and are shown as red sticks in the figure.

The other five predicted residues are shown as yellow sticks. The three predicted catalytic residues, plus one additional predicted residue (E46), are nearest neighbors of the reacting substrate molecule, as determined by the LPC server 62. The remaining four residues are “second shell” residues, each of which is in contact with at least two first shell residues. The elongated theoretical titration curves obtained for

54 the eight predicted residues reflect their membership in subsets of ionizable residues with similar pKa’s in close proximity.

Phosphatidylinositol-specific C, exists in both eukaryotic and prokaryotic cells.

Catalyzing the of 1-phosphatidyl-D-myo-inositol-3,4,5-triphosphate into the second messenger molecules diacylglycerol and inositol-1,4,5-triphosphate, it plays an important role in signal transduction and other important biological processes 63. This catalytic process is tightly regulated by reversible reversible phosphorylation and binding of regulatory proteins. The SVM prediction for

Phosphatidylinositol from Listeria monocytogenes (PDB ID 2PLC) is shown in Figure

3.5. The correctly predicted active site residues are shown as red sticks and the additional predicted residues are shown as yellow sticks. Two residues missed by the SVM, but found by the SVM+Region selection, are shown in purple. Note how the additional (yellow) residues occupy a second layer immediately surrounding the known true positive residues (red and purple).

55

Type I 3-Dehydroquinate dehydratase

Figure 3.4. SVM prediction for protein1QFE. The SVM predicts a total of eight residues in two clusters, among which three known catalytic residues are shown as red sticks; the other five residues are shown as yellow sticks.

56

Phosphatidylinositol phospholipase C

Figure 3.5. The SVM prediction for 2PLC. The correctly predicted active site residues are shown as red sticks and the additional predicted residues are shown as yellow sticks. Two residues missed by the SVM, but found by the SVM+Region selection, are shown in purple.

57

3.6 Conclusions.

The THEMATICS-SVM method is a relatively straightforward method. Combining SVM with

THEMATICS achieves a higher recall rate for catalytic residues than earlier THEMATICS analyses with only a small sacrifice in precision. Precision rates for THEMATICS predictions tend to be considerably better than for other methods. The more localized, more specific predictions offer enhanced usefulness for applications such as functional classification and specific ligand design.

The set of residues with perturbed titration behavior is a small subset of the set of ionizable residues with strong electrostatic interactions or shifted pKa’s. Thus THEMATICS is more selective than simple identification of the strongly interacting residues. A previous study 64 showing that titration-based methods give a large number of false positives was based on the less selective electrostatic properties.

The present study confirms the comparatively low false positive rate for THEMATICS.

SVM+Region, the extended version of THEMATICS-SVM that incorporates residues within a 6Å sphere of each predicted residue, does deliver improved recall, but with a sacrifice in precision. The major advantage of THEMATICS over other 3D-structure-based methods is the superior precision;

SVM+Region loses most of this advantage.

THEMATICS selects ionizable residues with strong coupling between their protonation events. These localized networks of interacting residues are good predictors of active sites. This feature may be a fundamental property of enzyme active sites and may be a factor to consider in protein engineering.

3.7 Next step.

Although applying SVM with THEMATICS gave us a system predicting protein active sites with better sensitivity and specificity, there is still room to improve. One possible approach would be using a larger and better annotated training set and fine-tuning the SVM with different kernels and parameters. Most likely, this approach would improve the performance of the THEMATICS-SVM method, but there are

58 limitations. The fundamental basis for this approach, as well as other methods using THEMATICS, is the use of a binary classification system to select the THEMATICS-positive residues and cluster them based on their physical proximity. If we could develop a new system that can rank order the residues based on their likelihood of being in the active site, it would be much easier for a user to decide how far down the list to go. Furthermore, if these rankings are based on probability estimates of each residue being in the active site, we have principled ways to further combine the results from different systems to obtain a more powerful system. So in the next chapter, I take a brand new approach and develop the Partial Order

Optimal Likelihood (POOL) method, which yields probability estimates rather than binary classification decisions.

59 Chapter 4

New Method: Partial Order Optimal Likelihood (POOL).

60

4.1 Ways to estimate class probabilities.

= = = For convenience, I use class probability to refer to P(C c | X 1 x1 ,L, X n xn ) , which is the probability of an instance with attributes X1 to Xn having values of x1 to xn, respectively, belonging to class

c. Notice that it is different from another commonly used term of class-conditional probability, meaning

= = = the probability P(X 1 x1 ,L, X n xn | C c) . I refer to any method that estimates class probabilities

as a class probability estimator (CPE). I distinguish two classes of CPEs: One is parametric, which

involves selecting a function to model the class probabilities of training data, and then applying a learning

process to select the parameters of such a function giving the best fit to the training data. Examples are the

logistic regression method 65 and neural networks. The other class is nonparametric and includes the novel

POOL method we describe below. There is no pre-defined function in such nonparametric CPEs, and the

learning process estimates the class probabilities directly from the training data. I believe the

nonparametric CPEs are more general and I will focus my discussion on this class. There are different

nonparametric CPEs, with different ways of computing and representing the probability estimates. I will

introduce four of them as follows.

4.1.1 Simple joint probability table look-up.

The most conceptually straightforward way is to use a lookup table, having the same number of

dimensions as the number of attributes used to represent the data. The values of each feature are

quantized into small intervals. The probability estimates are calculated by taking the ratio between the

numbers of instances in each class to the total number of instances with each feature value combination.

The complete table has to be stored as the representation of the probability estimates. Clearly, both

computation and representation will be exponential in terms of the number of features. The quantization

is clearly necessary if the attributes are distributed over a continuum.

61 If we had an unlimited set of training data and it had the exact same distribution as that of the actual test

data, this would be the best learning system, because there is no information loss caused by any model

restriction. But in reality, the training set is always limited and there will be attribute combinations in the

test data that have never appeared in the training set. Because of this limitation, such a lookup table has

seldom been a realistic choice in any real problems. Some model abstraction has to be done to make

generalization from observed training instances to unknown test instances possible.

4.1.2 Naïve Bayes method.

Assuming conditional independence, i.e.,

d ======P(X 1 x1 ,L, X d xd | C c) ∏ P(X i xi | C c) , (15) i=1

one can use Bayes’ theorem to estimate the class probability P(class | attributes) by:

= l d = = P(C c) ∗ = ∝ = ∗ = P(C c | X 1,L, X d ) ∏ P(X i | C c) P(C c) ∏ P(X i | C c) . (16) P(X 1,L, X d ) i=1 i=1

P(C=c) is the probability of an instance being in one particular class c, and this probability can be

estimated by the ratio of number of instances in c over the total number of instances in the training

examples. P(Xi | C = c) can be estimated as the ratio of number of instances belonging to the class c with the corresponding attribute values over the number of instances in c. This is also estimated by counting

the number of instances in class c with the corresponding values. P(X 1 L, X d ) is independent of class C,

and it can be treated as a constant. Since the sum of class probability over all classes is 1, one can easily

solve the probability of each class, knowing their relative ratio. This method is linear in both computation

and representation of the probability estimates in terms of number of attributes d, since it only needs to

compute and store P(X i | C) and P(C) estimates.

This is an indirect method in the sense that it indirectly estimates class probability P(class | attributes) by

estimating class-conditional probability P(attributes | class) at first and uses conditional independence to

62 convert it. This method also requires quantization, unless some parametric form of the probability density function is used.

Although I do use a naïve Bayes method to combine probabilities from different types of features in the application, it is not enough because it is not apparent how one can apply the THEMATICS principles in this method, since the constraints themselves are on class probabilities instead of the class-conditional probabilities. Furthermore, because some of the features used in THEMATICS are clearly correlated, the conditional independence assumption is violated, which makes the naïve Bayes probability estimates suspect.

4.1.3 The K-nearest-neighbor method.

The third method to introduce is the k-nearest-neighbor method. It estimates the class probability P(class

| attributes) by taking the k training instances with the “closest” attributes to the query points and then counting how many of them are in that particular class. This is a direct method since it gives the direct estimate of class probability and it does this estimation without direct quantization. In some sense, it implicitly quantizes the attributes by taking the range of the values from the nearest neighbors. This method has a lazy evaluation feature since the estimates of a certain feature combination are calculated at query time. All the training instances have to be stored, and unless some indexing scheme is used, the computation and representation cost of this method is linear in the number of training instances.

This method can estimate the class probabilities when there are no matches between the attributes of the query instance and the training instances, and with properly selected k, it may even reduce some of the negative effect caused by noise in the training data, but there is no apparent way to apply THEMATICS principles with this method. Furthermore, correlation in the THEMATICS features is a potential problem, although a correctly weighted distance function can compensate for this.

4.1.4 POOL method.

63 The fourth method is the one proposed here, the Partial Order Optimal Likelihood (POOL) method, the

details of which will be introduced in the next section.

It finds the “best”, in the sense of maximizing the likelihood, among all possible estimates of the class

probabilities P(class | attributes) that satisfy given monotonicity assumptions. It uses convex optimization

to update the probability estimates of the training instances. It stores these probability estimates as

reference points and computes the probability estimates of query instances at query time, although in

principle one can compute the whole table and store it. With lazy evaluation, this method has a linear

computation and representation cost in terms of the number of training instances.

It is a direct method, since it estimates P(class | attributes) directly. In principle it does not need

quantization, although in our application, we have used quantization for convenience. It is a learning

system that makes full use of training data and enforces some prior belief, like the THEMATICS

principles, to minimize the effect caused by noise in the training set. The POOL method is conceptually

simple, computationally efficient, and appears to be effective in practice, as will be shown below.

4.1.5 Combining CPE’s.

Although one can put all the features together and use one CPE to estimate class probability in one step, it

may not be the best choice in practice. In most cases, it is better to group features into different groups r and use several, say l, CPE’s, each with one group of features X i . Each of these smaller CPE’s can be obtained by any of the methods described above, as well as any parametric methods. At query time, the

= r r class probability P(C c | X 1 ,L, X l ) can be estimated based on the class probability estimates

= r P(C c | X i ) from each of the smaller CPE’s according to the naïve Bayes combination rule:

r r r l P(C = c | X = xr ) P(C = c | X = xr , , X = xr ) ∝ P(C = c) ∗∏ i i (17) 1 1 L l l = i=1 P(C c)

64 {}r n which is easily derived from Bayes’ Theorem under the assumption that (X i | C) i=1 are mutually independent random variables, i.e.,

l r = r r = r = = r = r = P(X 1 x1 ,L, X l xl | C c) ∏ P(X i xi | C c) . (18) i=1

We use the term chaining to mean the application of this naïve Bayes combination rule.

Although strictly speaking, one can only use the chaining to estimate the class probability when the

conditional independence assumption holds, in practice, even if the conditional independence assumption

is not strictly true one might still be able to get some useful results, as with other applications of naïve

Bayes, especially when it comes to relative rankings 66.

4.2 POOL (Partial Order Optimal Likelihood) method in detail.

4.2.1 Maximum likelihood problem with monotonicity assumption.

Before introducing POOL, I first introduce the notion of a class probability monotonicity assumption:

Definition 1: Class Probability Monotonicity Assumption:

Class Probability Monotonicity Assumption refers to a property that for each class c, given a partial order

∈ pc on the attribute space Χ, for any xi , x j Χ with xi pc x j , it must be

= = ≤ = = that P(C c | X xi ) P(C c | X x j ) . For 2-class-classification, p − is the opposite of p + , i.e.,

xi p + x j iff x j p − xi .

An important special case that inspired my development of this approach is when the attribute space has the formℜn . In this case, we can define a partial order on Χ as follows:

= ∈ = ∈ ≤ ∀ Given x (x1 ,Lxn ) Χ and y (y1 ,L yn ) Χ, define x p y if xi yi , i .

We call this the coordinatewise partial order on Χ induced by the ordering on ℜ .

65 The basic idea of POOL is to find a CPE for which the training data likelihood is highest among all CPEs

that conform to the monotonicity assumptions. The effect of the monotonicity assumptions is to create a

set of inequality constraints that relate the class probabilities of certain pairs of points.

Definition 2: Likelihood function L(H) of hypothesis H.

Assume given a hypothesis space containing a family of hypotheses H, i.e., probability density functions r r (for continuous distributions) or probability mass functions (for discrete distributions), and X 1,L, X n as r r n random draws with an actual sample as x1,L, xn . Since they are iid,by definition:

r = r = r = r r P(X i x | H ) P(X j x | H ) , for any i, j, x , H,

r r and for each hypothesis H, we may compute the probability density that we observed X 1,L, X n as a function of H,

r = r r = r L(H) = P( X 1 x1,L, X n xn |H). (19)

Now, restricting attention to 2-class problems, where the class labels are 0 and 1, I

= = r = r define pi P(C 1| H , X i xi ) , and we can write

r c − = = v = i − 1 ci P(Ci ci | H, X i xi ) pi (1 pi ) . (20)

Then,

n n = r = r = = = r = r r = r L(H ) ∏ P(X i xi ,Ci ci | H ) ∏ P(Ci ci | H , X i xi )P(X i xi ) . (21) i=1 i=1

After substitution of (16), the likelihood function in our problem becomes:

n n c − r c − = i − 1 ci = r ∝ i − 1 ci L(H ) ∏ pi (1 pi ) P(X i xi ) ∏ pi (1 pi ) . (22) i=1 i=1

66 Given a monotonicity assumption, finding maximum likelihood probability estimates becomes a

constrained optimization problem: Maximize L subject to a set of inequality constraints of the

≤ r r form pi p j , one for each (xi , x j ) pair in the training data generating the partial order via transitive

closure. The solution of this problem assigns probability estimates to only the training data. For attribute

combinations not observed in the training data, the monotonicity constraints only determine upper and

lower bounds on their class probabilities; one can then use some form of interpolation to assign actual

estimates.

4.2.2 Convex optimization and K.K.T. conditions.

Given a real vector space1 X together with a convex, real-valued function

f : A → ℜ ,

defined on a convex subset A of X, convex optimization is a problem that finds the point x in A for which

the number f(x) is smallest.

Convex optimization has been studied for a long time. It has many good properties such as if a local

minimum exists, it must also be a global minimum, which makes methods like gradient descent work for

solving this problem without the danger of being stuck in a local minimum instead of finding the global

minimum. A lot of methods have been developed to solve it efficiently. There are standard methods like

gradient projection methods, line-searching methods, interior-point methods and some specialized

versions dedicated to some specific problems of this form. How to solve convex optimization problems

in general is still an active research area.

One way to find and prove the constrained optimal point in a convex optimization problem is to use the

Karush-Kuhn-Tucker conditions (K.K.T. conditions). It is a generalization of Lagrange multipliers.

K.K.T. conditions:

1 In order to describe the convex optimization problem in its general form, I use X to denote the vector space, and x to denote a vector (or point) in that space, until further notice. As a general rule, whenever I discuss general optimization problems, I use x, while p is used in specializing to solve for probabilities, as in later sections.

67 Consider the problem:

minimize f(x)

subject to gi(x) ≤ 0 and hj(x) = 0 where f(x) is the function to be minimized, gi(x) (i = 1,…,m) are the inequality constraints and hj(x) (j =

1,…,l) are the equality constraints.

Necessary conditions:

ℜn → ℜ ℜ n → ℜ Suppose that the objective function f : and the constraint functions g i : and

ℜn → ℜ * * h j : are continuously differentiable at a point x ∈S. If x is a local minimum, then there exist

λ > µ ≥ ν constants 0, i 0 (i = 1,…,m) and j (j = 1,…,l) such that

m l λ ∇ * + µ ∇ * + ν ∇ * = * f (x ) ∑∑i * g i (x ) j * h j (x ) 0 (23) i==11j

and

µ ∗ = i gi (x) 0 for all i = 1,…,m. (24)

Sufficient conditions:

ℜn → ℜ ℜ n → ℜ If the objective function f : and the constraint functions g i : and

ℜn → ℜ * h j : are convex functions, the point x ∈S is a feasible point, and there exists

µ ≥ ν i 0 (i = 1,…,m) and j (j = 1,…,l) such that

m l ∇ * + µ ∇ * + ν ∇ * = f (x ) ∑∑i * gi (x ) j * h j (x ) 0 (25) i==11j

and

µ ∗ = i gi (x) 0 for all i = 1,…,m. (26)

68 then x* is the global minimum.

4.2.3 Finding Minimum Sum of Squared Error (SSE).

In addition to likelihood, sum of squared error is another commonly used measurement of how well the model fits the data, and it is easier to work with than the likelihood L. In this section, I will present an approach to compute minimum SSE under the monotonicity assumption, and in the next section, I will prove that we can find the maximum of likelihood L by finding the minimum of SSE.

Definition 3: Sum of squared error (SSE). To estimate how close the estimated function, in our case the class probability estimate2 p(xr) , is to the observation of n (xr,c) pairs, we compute

n = − 2 = SSE ∑(ci pi ) , where pi p(xi ) . (27) i=1

r = r Let p ( p1 ,L, pn ) . SSE is a quadratic function of p , and the class probability monotonicity

assumptions form a set of linear inequality constraints. This problem is then a special case of the convex

optimization problem, called a quadratic programming problem, another well studied class of convex

optimization problem. As a matter of fact, SVM, a recently developed machine learning system, is

actually a quadratic programming problem.

I developed the POOL algorithm, described below, to find the solution generating minimum SSE (and

therefore maximum likelihood) under the monotonicity constraints. Very recently, I have discovered

some existing literature describing an approach called isotonic or monotonic regression 67, but most of

this work is focused on one-dimensional problems. There is also earlier literature focusing on the total

order case; the pool adjacent violators algorithm (PAVA) 68 and monotonic smoothing 69 are such

examples. Although in some of these reports, it was pointed out the extension of this problem into multi-

2 In this specific case of estimating class probabilities, I use xr to denote the instances and p to denote the vector I try to assign values to minimize SSE.

69 dimension could be framed as convex optimization, the emphasis of the literature in this field seems to

focus primarily on one-dimensional problems.

Compared with convex optimization in its general form, the present case is very special. First, I want to

optimize a summation of terms each consisting of only one component of the vector, i.e., there are no cross products between different components of the vector. This feature leads to a very simple gradient of the target function S, such that after rescaling coordinates, the direction of finding global optimization of

S can be determined locally by choosing a component variable pi to optimize the ith term in S, subject to

the same constraints met.

The constraints are special, too. Each constraint is a linear inequality constraint containing two terms, in

addition to the implicit constraints of 0≤pi ≤1 implied by pi being probabilities. Or, in formal terms, this is a sparse problem.

Another special feature of the present problem is that by a simple scaling of coordinates in space, the negative gradient vector at any point points directly to the unconstrained global optimal point, i.e. if one

knows the “best improving” direction at one point, following that direction (in a straight line) will lead to

the optimal point.

We start from a point where all constraints are active, where active means that an inequality constraint is

met by equality, i.e. the constraint of pi ≤ pj being actually satisfied by pi = pj. Also all the constraints are linear and since the gradient never changes direction along the path from point x to the unconstrained global optimal point O, we have a very special property: If the “best improving” direction moves “away” from an active constraint, that constraint will not become active again in the optimized solution. It is also true that if the “best improving” direction “runs into” an active constraint, the constraint will still be

active in the final optimum solution. This makes it possible to determine the active constraints in the final

solution at the starting point, and once the active constraints in the final solution are determined, it is easy

to calculate the exact p that gives the constrained optimal value of the object function f ( pr) . Since it is known that optimum solution is achieved with these active constraints active, meaning the actual

70 probabilities are equal, all one needs to do is to partition the data and make pools of data having the same

(average) probabilities as required by the active constraints and get the accurate solution.

These special conditions make it possible to develop some special algorithms, like the one that will be

presented in the next subsection, to efficiently and accurately solve this subclass of convex optimization

problem. In a more general form of convex optimization problem without these special conditions, a

solution can only be approximated by numeric methods to a certain degree of accuracy, and typically

involves re-computing active constraint sets as the algorithm iterates toward a solution.

4.2.4 POOL algorithm.

The input to the algorithm is the training set D, and the constraint matrix CnXm. Each column in this

matrix corresponds to a single constraint of the form xi≤xj, and contains two non-zero entries.

This algorithm consists of three steps:

• Determine the active set A, which consists of all the active constraints, at the starting point.

• Given A, compute the corresponding partitioning (pools) of data.

• For each pool, compute the average across all data in the pool and assign this common average

value to each instance in the pool.

71 POOL(D, CnXm) • Initialize starting point S as origin. n r ← ← +_ i • Compute G α*▽f(S) (In our re-scaled case: Gi , where = n+_i and ntotal_i are the ntotal _ i

number of positive samples and the number of total samples in ith instances.) r r • Initialize µ ← 0 .

• Until termination condition* has been met: r r • Compute H ← CnXm X µ r r r • Compute F ← G − H r T r • Compute ∆ µ ←CnXm X F r r r • µ ← µ + α *∆ µ µ µ • If i <0 then i ←0 (i = 1,…m))

• Build transitive closure of x • Let each xi (i = 1,…n) be in its own set. µ µ • For each i (i = 1,…m) if i >0, look in constraint matrix CnXm to find the two non-zero entries

Cai and Cbi in ith column and union the sets containing xa and xb into a new set, replacing the two sets containing xa and xb beforehand, if they are not in the same set. • Go through all the sets built in the above step. Set pi (i = 1,…n) to be the sum of n+ of all the x divided by the sum of all the ntotal in the same set.

r * termination condition is the threshold set by the user to determine convergence of F ; in our r program, the size of F is computed until the difference between two iterations is less than 10-9. α is used as the step size to control the rate of change in moving toward the improving direction; in our case, it is set as 0.05.

In the above algorithm, the first four steps compute the active set A by solving the dual problem of

m µ r − µ ∗ r 2 µ ≥ determining i minimizing| G ∑ i Ci | , subject to i 0 for all i. Constrained gradient descent i=1

µ > such as gradient projection is used to solve this. The constraint gi is active iff i 0 .

The reason that I can apply the active set A determined at the starting point at the solution point is the fact

that I rescale the coordinate system, thus preventing the distortion of spherical gradient by different

weights of each term in the objective function.

The original objective function is

72 n = 2 − + SSE ∑(ntotal _ i * pi 2n+_ i pi n+_ i ) . (28) i=1

After applying the appropriate transformation

= ∗ pi ' ntotal _ i pi , (29) the objective function becomes

n n = '2 − +_ i ' + SSE ∑ pi 2 pi n+_ i , (30) i=1 ntotal _ i

which has spherical level surfaces, since all quadratic terms have the same coefficient.

Usually, n and m are large numbers but CnXm is sparse because most of its entries are 0. To improve efficiency in storage and computing, I use the index of non-zero terms of CnXm instead of the matrix itself.

In practice, the speed of applying the POOL algorithm by my program is very fast.

4.2.5 Proof that the POOL algorithm finds the minimum SSE.

In this subsection, I will prove that the POOL algorithm gives the minimum SSE solution under the monotonicity constraints.

3 In the present case, the objective function is a quadratic function of x, the constraint functions gi are linear functions and there are no equality constraints hj, so both the necessary and sufficient KKT

µ ≥ conditions hold. This also implies that, if one can find such i 0 (i = 1,…,m) satisfying the above two

equations, one finds the global minimum x*.

ν Since there are no equality constraints in the present problem, one does not have to consider j and the

µ last term in the K.K.T. equations (23) and (25) above. Here I show my approach to find i .

3 In order to make the proof consistent with the form stated in section 4.2.2, I use x instead of p as in section 4.2.3; the function value we are minimizing is still SSE.

73 First, notice that all of the constraints can be put into two categories, one is 0≤xi ≤1 for all i=1,…,n; the

other is xi≤xj for some i and j. The first category is automatically satisfied from the starting point to the end during the optimization process. There is an interesting feature for the m constraints of the second category: all the bounding hyperplanes of the feasible region intersect at a line where the xi are equal for all i=1,…,n. Our algorithm takes one point on this line, where all xi are equal to 0, as the starting point.

The outward pointing normals of the m constraints at the starting point form a convex cone. There is also r a vector G at that point, pointing to the global minimum of the objective function when there are no r constraints. G is the negative gradient of the objective function at x.

As mentioned earlier in 4.2.3, the present problem has a special feature that by a simple scaling of r r coordinates in space, the negative gradient vector G becomes a constant vector O (pointing to where the r r unconstrained minimum of f(x) is located) minus vector X , where X is the vector from the origin to the point x,

r r ▽f(x) = X − G (31)

So, the key point is to find the active constraints in the final solution at the starting point S. Without loss

of generality, for the sake of convenience in stating the proof, we assume the starting point S is located at r the origin. As Figure 4.1 shows, there are three cases based on where G points.

74 r r C C 2 1

S

r G

O*(x*)

Fig 4.1. (a)

O*

r r r G C C 2 1

S (x*)

Fig 4.1. (b)

75 r r C 2 C1

r H O* r G S

x*

Fig 4.1. (c)

r Figure 4.1. Three cases of G in relation to the convex cone of constraints. Define convex cone as: r r = r ≥ ∀ r {V | V ∑ wi Ci with wi 0 i}.G is the negative gradient of the objective function as starting i * r r point S, pointing to the global optimum O . C1 and C 2 are the constraint vectors forming the r r convex cone. H is the projection of G on the convex cone formed by the constraint vectors and x* is the solution that gives the constrained optimum for the object function. 4.1(a) shows the case where the unconstrained global minimum is located in the feasible region; 4.1(b) shows the case where the unconstrained global minimum is located in the inside of the convex cone formed by the constraint normals; 4.1(c) shows the case where unconstrained global minimum is located outside the feasible region and the convex cone formed by the constraint normals.

76

Figure 4.1(a) shows the special case where the unconstrained optimal O is located in the feasible region, r r meaning O* points at O. This is the simplest scenario. All one needs to do to compute G and x* is to add r r G to S .

* µ * It is trivial to show that x satisfies the KKT sufficient condition by letting i = 0 (i = 1,…,m), since x is

where the unconstrained global optimal O is, so▽f(x*) is 0. Another way to look at it is that a global optimum is always a local optimum.

Figure 4.1(b) shows another special case where all constraints have to be active to get the local minimum,

and the start point S happens to be the optimal point x*, or in other words, there is a non-negative

m µ r = µ ∗ r r assignment of i (i = 1,…,m), that makes G ∑ i Ci , where Ci is the outward normal of the half- i=1

r • r r = r * + r space defined by the ith inequality constraint of gi(x)≤0, or gi(x)= Ci x ≤0 and x X S . Apparently,

* r • r * x is feasible because it is the origin and Ci x =0. The proof that x* satisfies the sufficient K.K.T. r condition is the same as the more general proof for figure 7(c), just with the fact that x*=S and X * = 0 .

r Figure 4.3(c) is the general case that the negative gradient G cannot be expressed as linear combination

r µ of Ci with non-negative assignment of coefficients i (i = 1,…,m). In this case, among all possible non-

m µ r = µ ∗ r negative assignments of i to get H ∑ i Ci , one finds the specific one that gives the minimum i=1 r r distance between the tip of this H vector and O. That is, we seek H of this form minimizing the length of r r r r r the vector G − H . We then set x* = S + G − H .

* r • r * * r First, we show that x is feasible, i.e., Ci X =gi(x )≤0 (i = 1,…,m), where vector G is the same as the

v * vector X .

77 r * r + r − r r r * = r − r µ Since x is located at S G H and S is the origin, we get X G H . Note that i is special in that r r it is non-negative and it gives the minimum length of vector G − H , which is the same as the length r of X * .

We have

r r ∂(X * • X * ) = 0 when µ > 0 (32) ∂µ i i

and

r r ∂(X * • X * ) ≥ 0 when µ = 0 . (i = 1,…,m). (33) ∂µ i i

Since

m r * = r − r = r − µ ∗ r X (G H ) (G ∑ i Ci ) , (34) i=1

We have

∂ r * • r * m (X X ) r r r r r * = 2 ∗Ci • ( µ ∗C − G) = −2 ∗C • X ) (35) ∂µ ∑ i i i i i=1

Substitute (35) in (32) and (33), we get

r • r * = µ > Ci X 0 when i 0 (36)

and

r • r * ≤ µ = Ci X 0 when i 0 (i = 1,…,m). (37)

* = r • r * ≤ r * Combine (36) and (37), we get gi (x ) Ci X 0 (i = 1,…,m), i.e., x is feasible.

m ∇ * + µ ∇ * = Next, we show f (x ) ∑ i * gi (x ) 0 . The following has already been shown in (31) i=1

78 r r r ∇f (x* ) = X − G = −H (31)

and

m m µ ∇ * µ ∇ r • r * ∑ i * gi (x ) = ∑ i * (Ci X ) i=1 i=1

m µ r = ∑ i *Ci i=1

r = H (38)

Thus, we have

m ∇ * + µ ∇ * = f (x ) ∑ i * gi (x ) 0 . (39) i=1

µ * = r * = r − r Last, we show that i * gi (x ) 0 , (i = 1,…,m). Geometrically, since X G H has the minimum

m r * r r • r * = r µ r size, X must be orthogonal to H , i.e., H X 0 . Substituting H with ∑ i *Ci , we have i=1

m m m µ r * • r * = µ r • r * = µ * = (∑ i *C ) X ∑ i *(Ci X ) ∑ i * gi (x ) 0 (40) i=1 i=1 i=1

* ≤ µ ≥ µ * = and the fact that g i (x ) 0 and i 0 , (i = 1,…,m), (40) holds only when i * gi (x ) 0 for all i from 1 to m.

This completes the proof that x* satisfies the sufficient part of the K.K.T. Condition, and verifies that x* is

the minimum of the objective function under the monotonicity assumptions.

I could have used x* described above to find the probability assignment of every data point in the table to

µ give the minimum sum of error squared, but in the present program, I actually use i to find the active

constraints that the optimal solution should satisfy and compute x* based on that, by combining together

all instances in the groups under equality constraints and assigning the corresponding p values to be the

79 total positive instances observed divided by the total instances in the group. This procedure is more r r r accurate than computing from S + G − H . This new way of computing x* generates the same x* as before, since it gives the minimum SSE assignments of p for each group of active constraints.

4.2.6 Maximum likelihood vs. minimum SSE.

Theorem 1. In this problem of finding estimated probability under monotonicity constraints, a value of

pr that minimizes SSE will also maximize likelihood L subject to the same monotonicity assumption.4

While this is well-known when there are no constraints, it is not true under arbitrary constraints; what I

show is that it also holds under the particular constraints used here.

As defined earlier in (22), the likelihood function L is defined as follows for binary problems:

n n r c − r c − = i − 1 ci = r ∝ i − 1 ci L(P) ∏ pi (1 pi ) P(X i xi ) ∏ pi (1 pi ) (22) i=1 i=1

r = r Since the present problem only involves assigning pi, given X i xi , I will only use the

n c − r i − 1 ci ∏ pi (1 pi ) part from now on for L(P) . i=1

r Adopting the convention that 00 = 1, each factor of L(P) can be rewritten as:

− = ci ∗ − 1 ci ≤ ≤ Li ( pi ,ci ) pi (1 pi ) (0 pi 1)

r Before we prove the solution from minimizing SSE also maximizes L(P) , I prove the following lemma.

Lemma 1. Suppose we wish to maximize a function having the form:

4 Since this is to prove the same solution minimizing SSE will also maximize L, and I use this in the r special probability estimate problem presented here, I once again switch to p and P to denote the vector r variable and the vector space respectively instead of using x and X as earlier.

80 r = = ∗ r F(P) ( p1 ,L, pm , pm+1 ,L, pn ) F1 ( p1 ,L, pm ) F2 ( pm+1 ,L, pn ) , where P is partitioned into two groups: p1,L, pm and pm+1 ,L, pn , subject to a given set of constraints.

r * = * = * = * = * If an assignment of P , namely ( p1 p1 ,L, pm pm , pm+1 pm+1,L, pn pn ) maximizes F1 and

r * r F2 under the same constraints, then P maximizes F(P) under the same constraints.

r This lemma is readily proved by contradiction. If there is another assignment of P ' under the same

r ' > r * r ' constraints that gives F(P ) F(P ) , then give the same assignments of P to F1 and F2 . Then at least

r ' > r * r ' > r * one of the following has to be true: F1 (P ) F1 (P ) or F2 (P ) F2 (P ) . But if either one of them is

r * true, the assumption that P maximizes F1 and F2 under those constraints is violated.

With Lemma 1 proved, I can prove Theorem 1. Based on the value of pi from the minimum SSE

r * = = solution P , I break the equation (22) into two parts after moving factors with pi 0 or pi 1at the end

of the equation:

r = r ∗ r L(P) L1 (P) L2 (P) , where

n r c 1−c = i − i = = L1 (P) ∏ pi (1 pi ) where pi 0 or pi 1 i=n'+1

and

n' r c 1−c = i − i < < L2 (P) ∏ pi (1 pi ) where 0 pi 1 i=1

r Apparently, since P* is minimum SSE solution under the same constraint set for maximum-likelihood

r * r r problem, if I can show P maximizes both L1 (P) and L2 (P) , I will prove Theorem 1 based on Lemma 1.

81 r * r r Showing P maximizes L1 (P) is easy. Notice the fact that in L1 (P) , pi can only be 1 or 0, and the only

way when pi could be 1 in minimum SSE solution is when ci has the same value, which is 1, and the

r same is true when pi is 0. Substituting pi and ci with either both 0, or both 1 gives a L1 (P) value of 1, which is the unconstrained maximum already, and apparently it is the constrained maximum also.

r * r I will show in the remaining section that P also maximizes L2 (P) under the same constraint set.

r Since all pi ’s in L2 (P) have values strictly larger than 0 and less than 1, a negative-log-likelihood function may be defined as:

n n r = − = − − − − G(P) log L2 ∑∑ci *log pi (1 ci )* log(1 pi ) (41) i==11i

1 x + y Since with 0 < x, y ≤ 1 , − (log x + log y) ≥ − log( ) , − log p is convex function, as is 2 2 i

− − r log(1 pi ) . SinceG(P) is the weighted sum of convex functions with positive weights, it is a convex function, too.

After taking the derivative of G over pi, we get:

∂G 1 1 p − c = − c ∗ + (1− c ) ∗ = i i (42) ∂ i i − − pi pi 1 pi pi * (1 pi )

Notice that the derivative of the SSE function:

n = − 2 F ∑(ci pi ) (43) i=1 is

∂F = 2* ( p − c ) (44) ∂ i i pi

82 r r Comparing (44) with (42), (42) can be rewritten as the following, with P andC representing the vectors

( p1 ,L, pn ) and (c1 ,L,cn ) , respectively:

 1  0 0 2* p * ( p −1) L L   1 1   0 O 0 L M  ∂G 1 r = − 0 0  × ∗ r − T r M − M (2 (P C)) ∂P  2* pi *( pi 1)    M M 0 O 0  1  0 0  LM L −   2* pn *( pn 1)

 1  0 0 2 * p *( p −1) L L   1 1   0 O 0 L M  1 ∂F = − 0 0  × M − M ( r ) (45)  2* pi *( pi 1)  ∂P   M M 0 O 0  1  0 0  LM L −   2* pn * ( pn 1)

µ r * If we can find G _ i and a specific P to satisfy the following K.K.T. condition for G:

m ∇ r * + µ ∇ r * = G(P ) ∑ G _ l * gl (P ) 0 (46) l=1 and

µ r * = G _ l * gl (P ) 0 , (47)

r then we know that P* is the optimal solution corresponding to the minimum negative-log-likelihood

r* r * function G(P ) , also to the maximum likelihood function L2 (P ) .

83 r Now, we will show that the P* obtained by solving the minimum SSE function with the same set of

r µ µ constraints g(P) , along with the G constructed from the F corresponding to the minimum SSE

solution, does satisfy the K.K.T. condition of (46) and (47).

r First, since P* is the solution minimizing the SSE function, based on the K.K.T necessary condition, there

µ is a corresponding F satisfying

m ∇ r * + µ ∇ r * = F(P ) ∑ F _ l * g l (P ) 0 (48) l =1 with

µ r * = F _ l * gl (P ) 0 . (49)

From (49), we have either

µ = F _ l 0 or

r * = gl (P ) 0 , this latter case meaning the inequality constraints are actually met by equality. Without loss of generality,

th r = − let the l constraint function gl be gl (P) pa(l ) pb(l ) , where a(l) and b(l) correspond to the indices

r * = of the variables p appearing in the constraint according to the given partial order. Since gl (P ) 0 , we

= = let pa(l ) p b(l) pl . Now we can construct

1 µ = − * µ (l = 1, ,m) (50) G _ l − F _ l L 2 * pl *( pl 1)

∇ * = ∇ * Since F and G have the same constraint set, so g G _ l (P ) g F _ l (P ) ,

84 m ∇ r * + µ ∇ r * G(P ) ∑ G _ l * g G _ l (P ) l=1  1  0 0 2 * p *( p −1) L L   1 1  0 0  O L M  r * 1 ∂F m µ *∇g (P ) = − 0 0  × − F _ l G _ l M − M ( r ) ∑  2 * pi *( pi 1)  ∂P l =1 2* p *( p −1)   l l M M 0 O 0  1   0 0  LM L −  2* pn *( pn 1) r 1 ∂F m µ *∇g (P* ) = − ∗ − F _ l G _ l ,(i = 1, n) − ∂ ∑ − L 2* pi *( pi 1) pi l=1 2* pl * ( pl 1)

µ ≠ As mentioned earlier, pl is the same as the two p ’s appearing in g F _ l where F _ l 0 , so the above

equation can be rewritten as:

m ∇ r * + µ ∇ r * G(P ) ∑ G _ l * g G _ l (P ) l=1 r 1 ∂F m µ *∇g (P* ) = − ∗ − F _ l G _ l (51) − ∂ ∑ − 2* pi *( pi 1) pi l=1 2* pl * ( pl 1) 1 ∂F m r = − *( + µ ∗∇g (P* )) − ∂ ∑ F _ l F _ l 2* pi *( pi 1) pi l =1

r * Since P and the corresponding pi are the assignments giving minimum of SSE function F, from (39),

∂F m r we have + µ *∇g (P* ) = 0 (52) ∂ ∑ F _ l F _ l pi l =1

Substituting (49) into (48), we have:

m ∇ r * + µ ∇ r * G(P ) ∑ G _ l * g G _ l (P ) l=1 1 ∂F m r = − *( + µ ∗∇g (P* )) (53) − ∂ ∑ F _ l F _ l 2* pi *( pi 1) pi l =1 = 0

85 r Since the functions G and F share the same constraint set g(P) ,

r 1 r µ * g (P* ) = * µ * g (P* ) = 0 (54) G _ l l − F _ l l 2 * pl *( pl 1)

r (53) and (54) show that the P* minimizing the SSE function F also minimizes the negative-log- r Likelihood function G and therefore maximizes the likelihood function L2 (P) .

r * r Notice the subtlety that the constraint set we used to show P maximized L2 (P) is actually a subset of the original constraint set, because some of the constraints in the original problem do not play roles in r maximizing L2 (P) since they involve some variables of pi with value of 1 or 0, which are not in

r r * r L2 (P) at all. This is not an issue, because we show P already maximized L2 (P) in a less constrained r manner, L2 (P) cannot get a more optimized value with a tighter constraint. On the other hand, since r P* is actually a solution from minimum SSE under the original constraints, it will not violate any r constraints in the original constraint set but not in the one we used to optimize L2 (P) .

Combining all the results above, using lemma 1, this completes the proof of Theorem 1.

With Theorem 1, the answer achieved by least sum of squared error using the POOL algorithm is the

same as the maximum likelihood solution.

4.3 Additional computational steps.

4.3.1 Preprocessing.

In general, CnXm is given by specific problem requirements. In the present study it is derived from the

THEMATICS principles. Since the data in this problem is sparse and n is large, instead of writing out all the constraints, we introduce the idea of immediate successor and immediate predecessor:

Definition 4: Immediate successor.

86 y is an immediate successor of x if and only if ∀z s.t. x p z, ypz .

There is a trivial linear time algorithm that, given monotonicity assumptions, can find all the immediate successors of a particular instance with particular attributes in a single scan of all the instances and their attributes. With immediate successors of each instance known, the transitive closure will contain the whole monotonicity assumption. We use this fact to build CnXm in quadratic time. For storage efficiency,

we could even store CnXm in one mX2 array, if all scaling factors are 1, by just storing the indices of

instances with 1 and -1 as their coefficients respectively; in the present case, with a different scaling

factor for each cell, we use three mX2 arrays, one storing the indices, and the other two for scaling factors.

4.3.2 Interpolation.

If the training set does not have an instance with a specific attribute value, an interpolation scheme is

needed to estimate the class probability with this attribute value. There are some required properties this

interpolation scheme should have. One is that all the interpolated class probabilities should conform to the

monotonicity assumption over the whole virtual table. In addition to that, there are also some desirable

properties, such as, whenever possible, the interpolation should reflect the monotonicity assumption

strictly. In other words, if the original monotonicity assumption specifies that P(C=c|X=x)≤P(C=c|Y=y)

when x≤y, if we have x

monotonicity assumption as a whole. Another desirable feature may be, whenever possible, we prefer

P(C=c|X) continuous over X.

After testing some interpolation schemes in the present application, we found a linear interpolation

between the maximum and minimum allowed P based on the Manhattan distance of instance X to its

limiting predecessor and successor gives good results in practice.

The result of applying this new POOL method to the protein active site prediction problem will be given

in the next chapter.

87 Chapter 5

Applying the POOL Method with THEMATICS in Protein

Active Site Prediction.

88

5.1 Introduction.

In chapter 3, I reported the use of SVM and THEMATICS to predict the protein active sites based on

protein 3D structure alone and introduced one way to expand the original THEMATICS method and

include the non-ionizable residues into the prediction. Although the SVM method outperforms all prior

structure-based methods, including other approaches using THEMATICS, and achieves similar or slightly

better performance than methods using both structural and sequence comparison information, there is still

room for further improvements of the method, as we briefly discussed in section 3.7.

One straightforward improvement can be the addition of more information about a residue into the

learning system. In the study reported in this chapter, in addition to THEMATICS features, I add

different features, such as the size of the cleft in which a surface residue resides, and the conservation of

the residue among proteins of similar sequence into our system, and examine how helpful they are in

terms of improving the sensitivity and the specificity of the prediction.

Another improvement comes from changing a classification problem into a ranking problem. In a

classification problem, the results are binary labels of either positive or negative, with nothing in between.

One of the disadvantages of this approach is that it is less convenient or even impossible in some cases for

the users to fine tune their result if they want to improve the sensitivity at the cost of lowering the

specificity or vice versa. In this study, the result is a ranked list of all residues in a protein based on their likelihood of being in active sites. Users can choose a cut-off best suited to their needs in different situations. One previous method, called PCats 21 generates such a rank-ordered list of probabilities.

Because my method actually estimates probabilities, I can easily combine probability estimates from different methods, using the chaining method I introduced in section 4.1.5. This overcomes one of the common hurdles one encounters by including more features into the system as described in the paragraph

89 above. This study did show the combined probability estimates with additional features work better than a

single probability estimate.

In the SVM approach, there is a 9Ǻ cut-off to form SVM positive residues into clusters as active site

residue candidates. This threshold seems to give good predictions, but it is arbitrary and could be

optimized in a more systematic way. In this study, I eliminated this step and used features containing the

degree of perturbation of titration curves of nearby residues. This approach is more systematic over the

whole system and is optimized over the whole process.

Combining THEMATICS and the power of the POOL method by enforcing the hypothesis into the

learning system, I achieve a substantially higher sensitivity and specificity at the same time than the SVM

method, one I had already shown to be the best among all other 3D-structure based methods, as compared

in Chapter 3. Performance can be improved further with other 3D-structure-based features, including the

size of the cleft in which surface residues reside. Performance can also be improved using sequence

conservation scores for individual residues, obtained from a sequence alignment of proteins of similar

sequence, provided there are enough such proteins. Note that this latter enhancement turns a purely 3D

structure based method into a sequence and structure based method.

A set of 64 different proteins from the CatRes (CSA) database is used to compare the performance of the

different methods for functional site prediction. A more complete selection of 160 proteins from the

CatRes (CSA) database is used to further confirm the advantage I gain by adding extra features in

addition to THEMATICS into the system to form an improved system for prediction. In this chapter, I

also improve the way I extend the method to predict non-ionizable active site residues. In addition, we use

ROC curves to compare the performance between different methods and RFR (Recall-Filtration Ratio) curves to guide potential users in setting the actual cut-off in practice.

5.2 THEMATICS curves and other features used in the POOL method

90 In the work presented in Chapter 3, I used moments of the first derivative curves of the titration curves.

These were defined analogously to the moments of density functions, as these first derivative curves are

essentially probability distribution functions 53. One aspect of these prior approaches such as 24, 45, 54 is the use of spatial clustering as a way of reducing the number of apparent false positives. That is, residues are reported as positive by the method if and only if they are in sufficiently close spatial proximity to at least one other residue identified as a candidate by the outlier detector in Ko’s and Wei’s approach or the SVM in Tong’s approach. The overall identification process in these prior approaches thus involve two stages, where the first stage makes a binary yes/no decision on each residue. In this new approach I do not begin with such a binary decision because it is my goal to assign to every (ionizable) residue a probability that it is an active-site residue. Thus, as an alternative to this clustering approach, I instead consider what I call environment features. For a given scalar feature x, I define the value of the environment feature xenv(r) for a given residue r to be

∑ w(r′)x(r′) x env (r) = r′≠r ∑ w(r′) (55) r′≠r

where r' is an ionizable residue whose distance d(r',r) to residue r is less than 9Ǻ, and the weight w(r') is

given by 1/d(r',r)2 .

In this study, I use the same features µ3 and µ4 used in the Ko approach, along with the additional features

env env µ3 and µ4 as an alternative to the clustering stage. Thus every ionizable residue in any protein is

env env assigned the 4-dimensional feature vector (µ3, µ4, µ3 , µ4 ), which is the THEMATICS feature for

ionizable residues.

Although µ3 and µ4 themselves are only defined for ionizable residues, the environment features, such as

env env µ3 , µ4 , are well-defined for non-ionizable residues. For non-ionizable residues, the THEMATICS

env env features I use are the 2-dimensional feature vectors (µ3 , µ4 ).

91 There is one additional subtlety that all THEMATICS-based methods have had to address, and the current

approach is no exception: the need for some kind of normalization across proteins. In Ko’s and Wei’s

approach, the raw features are individually transformed into Z-scores by subtracting the within-protein

mean and dividing by the within-protein standard deviation. Similarly, in my SVM approach, the raw

features are likewise transformed into robust Z-scores by subtracting the within-protein median and

dividing by the within-protein interquartile distance. Here, I apply yet another within-protein feature

transformation to each feature, which I call rank normalization. Within each protein, each feature value is

ranked from lowest to highest in that protein, and each data point is then assigned a number uniformly

across the interval [0,1] based on the rank of that feature in that protein. The highest value for that feature

is thus transformed to 1, and the lowest value is transformed to 0. Note that unlike the use of Z-scores or

robust Z-scores, this is a nonlinear transformation of the raw feature values. For each scalar feature x,

denote its within-protein rank-normalized value as ~x , which by definition lies in [0,1]. I extend the use of

~ = ~ ~ ~ ~ this notation to feature vectors in the obvious way. That is, x (x1 , x2, x3 , x4 ) .

Note that the use of within-protein rank normalization does not affect the within-protein partial order used in the THEMATICS Principles, which I introduced in Section 2.3.3. That is, xpy is true for raw feature ~ ~ vectors x and y in the same protein if and only if xp y . However, when I combine data from multiple

proteins for training and use the results to make predictions for new proteins, as I describe in more detail

later, this actually amounts to making an even stronger monotonicity assumption across proteins in which

the within-protein rankings replace the raw feature values. This is obviously a more controversial

assumption, but some such approach is required to be able to train on multiple proteins and make

predictions for novel proteins, and, as I show below, this approach appears to give good results.

As discussed in Chapter 2, in addition to THEMATICS, there are other methods for predicting active site

residues. They use features such as geometric position of residues, amino acid type information and

sequence conservation, which are very different from the electrostatic information I use in THEMATICS.

92 It is reasonable to speculate that if I combine these features with THEMATICS features, I may get better

performance. I tested this hypothesis in this study.

In addition to THEMATICS features, I try the cleft feature, which is a number I assigned for every

residue in a given protein based on the rank of the size of the cleft to which the residue belongs. One

special value is assigned to every residue not on the protein surface, and another is assigned to every

residue on the surface but not within any cleft. Ignoring these special values, it is easy to construct the

monotonicity assumption that the larger the cleft to which a residue belongs, the more likely that residue

is to belong to the active site. I can apply POOL on the cleft feature based on this monotonicity

assumption.

ConSurf 27 is a sequence comparison based method that identifies functionally important regions on the surface of a protein of known three-dimensional (3D) structure, based on the phylogenetic relations between its close sequence homologues. If there are more than five homologues (the method is considered reliable if the number of homologues is greater than 10) to the query protein, it can assign a score between 1 and 9 to each residue in the query sequence based on how conserved this residue is among those homologues. The larger the score is, the more conserved the residues are. With some exceptions discussed in Chapter 2, it is commonly believed that the more conserved a residue is, the more likely it is functionally important. This gives me another monotonicity assumption to which I can apply

POOL. I call this the ConSurf feature. In this study, if the protein has more than 10 homologues, I used the scores ConSurf assigns to each residue as their ConSurf feature values. For the proteins with 10 or fewer homologues, I assign 0 as the ConSurf feature values for all their residues. Since I am only interested in the rank list of residues within a protein, rather than across proteins, this special treatment will not affect the final results.

In addition to these features, I also tried features such as residue type and ASA (area of solvent accessibility) of residues in our study but found that including these did not improve the performance. No further details will be given here about those features that did not improve overall performance.

93 5.3 Performance measurement.

Before presenting the results, I must first decide how to measure the performance of our system. For a

standard classification problem, performances are typically measured by recall, false-positive rate and

Matthews correlation coefficient (MCC). Within a specific system, recall and false positive rate usually affect each other: lowering the false-positive rate most likely will lower recall at the same time. So it only makes sense when one gives out both metrics at the same time. Although MCC is a single metric to measure the overall performance of a classification system, it only measures the performance at a specific setting. If I want to measure the performance of my system at different thresholds, ROC (Receiver operation characteristic) curves, which plot recall against the false positive rate is the answer. One can compare two systems by comparing their ROC curves. In general, one can say a system giving a higher recall and a lower false-positive rate at the same time out-performs a system giving a lower recall and a higher false-positive rate at that specific setting. If the ROC curve from system A is always at the upper- left side of the ROC curve from system B, one can conclude that system A dominates system B and always out-performs system B. Studies also have shown that area under the ROC curve (AUC) is a very reliable single-value assessment for the evaluation of different machine learning algorithms 70.

In order to generate ROC curves, I need to be able to calculate recall and false-positive rate values, which

come from classification problems. In the POOL system, the result for each protein is a ranked list based

on the probability of a residue being in the active site. A natural way to draw a ROC curve for every

protein is to move the cutoff one residue at a time from the top to the end of the list. The resulting ROC

curve has a stairwise shape: only recall increases when an active site residue is encountered and only false

positive rate increases when a non-active-site residue is encountered.

We define average specificity (AveS) for each protein in the set:

N ∑ (S(r)* pos(r)) AveS = r=1 (56) Number of positive examples

94 where r is the rank, N is the number of residues in a protein, pos(r) is a binary function that indicates whether the residue of a given rank r is annotated in the reference database in the active site (pos(r)= 1) or not (pos(r)= 0), and S(r) is the specificity at a given cut-off rank r.

It is not hard to see that AveS represents the area under the ROC curve (AUC). This is analogous to AveP, the area under the Recall-Precision curve, used in the information retrieval field. Unlike MCC, AveS is a

single-number measurement of the performance of a classification system over the whole range of

different cutoffs settings, rather than from a single setting.

Since the AveS is a measurement on a ROC curve for predicting active site residues from only one protein, I need a measurement for the performance on a set of proteins. For this, I use Mean Average

Specificity (MAS), which is the mean of AveS of all the proteins in the set. For all methods that generate a

ranked list as in this study, including the POOL method, and one SVM method, I report the corresponding

Mean Average Specificity (MAS) from all the proteins in the test set. As in all statistical analysis, a

difference between the means does not mean too much without further analysis about the statistical

significance of the difference observed. In order to test the significance of the observed difference, I

perform the Wilcoxon signed-rank test 71 on AveS from different methods to estimate the probability of

observing such a difference under the null hypothesis that the observed better-performing method is

actually not better than the other.

To visually compare the performances from different methods, I generate the averaged ROC curve for

each POOL method by computing the recall and false-positive rate after truncating the list after each of

the positive residues in turn, followed by linearly interpolating the value at each recall value and

computing the mean of the interpolated false-positive rate value from all proteins in the dataset.

Although ROC curves and their associated AveS values are good ways to compare performance between

different methods, they do not directly provide a guide for the user to select the cut-off values, because

both recall and false-positive rate are not known to users unless they happen to know the true positives of

their proteins up front. I use another plot that I call the RFR curve; this is a plot of recall against filtration

95 ratio. Its purpose is to provide a guide for the user to select their cut-offs. They are almost the same as

the ROC curves except that filtration ratios are used in place of false-positive rates.

Since for every protein in the dataset POOL generates a ranked list of residues based on their probabilities of being in the active site, and from this list one generates a corresponding ROC curve and a corresponding RFR curve, I need to average these curves into a single ROC curve and a single RFR curve for the whole dataset for comparison purposes. Since it is more natural to ask for any given method what its expected false-positive rate is for given values of recall, this is what I use for the averaged ROC curve.

Another important fact about ROC curves is that there need be no prior commitment to how specific classifiers are created. They express the tradeoff no matter how the classifiers are parameterized. On the other hand, for the user who wants to use a fixed-proportion cutoff scheme, I provide averaged RFR

(recall-filtration ratio) curves; these curves give the expected recall for given filtration ratio values.

5.4 Computational procedure.

The three-dimensional coordinate files for the protein structures were downloaded from the Protein Data

Bank (http://www.rcsb.org/pdb/). In order to predict the theoretical titration curve of each ionizable

residue in the structure, finite-difference Poisson-Boltzmann calculations were performed using UHBD 72

on each protein followed by the program HYBRID 73, which calculates average net charge as a function of pH. These titration curves were obtained for each ionizable residue: Arg, Asp, Cys, Glu, His, Lys, Tyr, and the N- and C- termini. The pH range we simulated for all curves is from -15.0 to 30.0, in increments of 0.2 pH units. This wide theoretical pH range is necessary for proper numerical integration of the first derivative functions. The structures were processed and analyzed to obtain the central moments, as described in Chapters 2 and 3.

These individual features, the central moments µ3 and µ4, were then rank-normalized within each protein,

and thus assigned values in the interval [0,1], as described earlier. This four-dimensional representation of

each curve was used for training and for testing. The results given in the remaining sections were based

96 on eight-fold cross-validation on a set of 64 proteins or 10-fold cross-validation on a set of 160 proteins,

both taken from the Catalytic Site Atlas (CSA) database 57, 74. The labels were taken directly from the

CSA database; if a residue is identified there as active in catalysis, it was labeled as positive in my dataset.

If not so identified in the CSA, we labeled it as negative. The CSA annotations, although incomplete,

constitute the best source of active residue labels for enzymes. In anticipation that the POOL method

would not be overly sensitive to mislabeled data, I performed no hand tuning of the labels and omitted no

residues during training, in contrast to the SVM work reported in Chapter 3.

For the eight-fold cross-validation procedure, I randomly divided the 64-protein set into eight folds of

eight proteins each, training on seven of the eight folds (56 proteins) and testing on the remaining fold (8

proteins). For the ten-fold cross-validation procedure, I randomly divided the 160-protein set into ten

folds of sixteen proteins each, training on nine of the ten folds (144 proteins) and testing on the remaining

fold (16 proteins). Training was performed applying the POOL method to obtain a function Pˆ(1| ~x) for

each rank-normalized feature vector x~ in the appropriate feature space [0,1]k (where k = 4 for the POOL method applied on the four THEMATICS features of ionizable residues as stated earlier, denoted as

POOL(T4); k=5 for the POOL method applied on the four THEMATICS features of ionizable residues plus the geometric feature of the cleft size, denoted as POOL(T4G); k=1 for the POOL method applied just on the geometric feature of cluster size, denoted as POOL(G); and k=2 for POOL applied to non- ionizable residues, denoted as POOL(T2)). An additional detail is that for training we quantized the multi-dimensional data points. For example, for POOL(T4), each rank-normalized feature fell into one of

20 bins whose sizes varied depending on their distance from 0.0. In particular, the lowest ranked bins covered the half-open intervals, [0.0, 0. 2), [0. 2, 0.4), [0.4, 0.6), [0.6, 0.7), and there were 16 more bins of width 0.02 above that, with one special bin for 1.0. Thus the lowest-ranking data were quantized more coarsely than the remaining data. This is appropriate since these data tend to have very low average probability of being in the active site anyway, because the vast majority of residues are negatives. Thus the inability to make fine distinctions among these low-probability candidates does not degrade the

97 overall quality of the results. It does, however, improve the efficiency of the training procedure

significantly, so this is an important component of the analysis. This is especially helpful in the 10-fold

cross-validation on the 160-protein set. The typical training set of 144 proteins contained about 14500

ionizable residues, which fell into more than 6000 quantized bins in the 4-dimensional space used for

POOL(T4). The corresponding number of corresponding inequality constraints was about 35000-40000.

One final detail is that the probability estimates generated by the POOL method as I have applied it tend

to have numerous ties as well as some places where there is no well-defined value. The latter places

occur because the method only assigns values to existing data points (or bins containing data in the case

of our use of quantization). The locally constant regions occur both because of the quantization applied to

the training data at the outset and because the data pools created by the algorithm acquire a single value.

In cells where no value is defined, the interpolation scheme I use is to simply assign a value linearly

interpolated based on the Manhattan distance between the least upper bound and the greatest lower bound

for that cell based on the monotonicity constraint. Finally, since both the data pooling performed by the

algorithm and this interpolation scheme tend to lead to ties, I use the Manhattan distance from the origin

of the four THEMATICS features as a tie-breaker for any residues whose probability estimates are

identical. This simply imposes a slight bias toward strict monotonicity even though the mathematical

formulation I use to determine these probabilities is based on a non-strict monotonicity assumption,

making it possible to obtain well-defined rankings for all the residues in a protein.

I use CASTp 37, which uses the weighted Delaunay triangulation and the alpha complex for shape measurements to calculate the cleft information for each residue in the protein. The clefts were ranked based on their sizes in decreasing order and each residue having atoms located in any cleft is assigned the rank number of the largest of the clefts where its atoms are located. One special value is assigned to every residue not on the protein surface, and another is assigned to every residue on the surface but not within any cleft. Ignoring these special values, the monotonicity assumption is that the larger the cleft to which a residue belongs, the more likely that residue is to belong to the active site.

98 I use ConSurf 27 to calculate the sequence conservation information for residues in each protein. ConSurf

takes a protein sequence and find its closest sequence homologues using MUSCLE 75, a multiple- sequence alignment algorithm. Two sequences with similarity higher than a preset threshold are treated as homologues. ConSurf analyze the homologues of the query sequence and determines how conserved is each residue in the query protein among these homologues. In order to normalize the result and make it comparable between different proteins with different numbers of homologues and with different degrees of overall conservation, the program labels each residue with a conservation score between 1 and 9, with

9 being the most conserved and 1 being the most variable. If there exist more than 50 homologues for the query sequence, the 50 homologues closest to the query sequence are analyzed. If there are less than six homologues, the method will not work. For proteins with 6-10 homologues, ConSurf does report a conservation score, but these scores are less reliable. In this study, I only use conservation score from

ConSurf when there are at least 11 homologues for a protein. Under the assumption that active site residues tend to be more conserved than others, we apply the POOL method on the conservation score with the monotonicity assumption that the larger the conservation score a residue has, the more likely that residue is to belong to the active site.

5.5 Results

The results presented in this section are based on two sets of proteins, a set of 64 test proteins selected

randomly from the CSA database 57, 74, and a 160-protein set covering most of the CSA database. A

detailed list of the proteins in both sets and the CSA-labeled positive residues within that protein can be

found in Appendices C and D. In each case, the results are based on eight-fold cross-validation for the 64

protein set and ten-fold cross-validation for the 160 protein set. The ROC curves and RFR curves I

display show average performance over all proteins in all of the test sets, using the averaging methods

described in section 5.4.

5.5.1 Ionizable residues using only THEMATICS features.

99 First I evaluate the ability of POOL with the four THEMATICS features, POOL(T4), to predict ionizable

residues in the active site. For the purposes of Figures 1 and 2, only the CSA-annotated ionizable active

site residues are taken as the labeled positives. Thus if a method successfully predicts all of the labeled

ionizable active residues, the true positive rate is 100%. The prediction of all active residues, including

the non-ionizable ones, is addressed below.

Figure 5.1 shows the ROC curve, true positive fraction (TP) as a function of false positive fraction (FP),

obtained using POOL(T4), with just the four-dimensional THEMATICS feature vectors described earlier

(solid curve). Recall that the POOL method computes maximum-likelihood probability estimates, but for

these ROC curves, only the rankings of all residues within a single protein matter. For comparison, I also

show in Figure 5.1 a corresponding ROC curve for the earlier THEMATICS statistical approach

introduced by Ko et al. 24 and refined by Wei et al. 54 (dashed curve), plus the single point (X)

corresponding to the THEMATICS SVM-based approach 45. The data set used for the statistical curve

consists of the same 64 proteins used here. Note that the POOL(T4) curve always lies above and to the

left of the statistical curve for all non-zero values of the true positive fraction. For any given non-zero

value of the FP fraction, the true positive fraction is always higher for POOL(T4) than for the statistical

selector. The point representing the particular SVM classifier is based on a separate set of data, trained

and tested on data sets somewhat different from the present data set, so the results are not strictly

comparable. Nevertheless, this point lies well below the POOL(T4) curve and strongly suggests that

POOL(T4) is superior to the SVM approach 45. Below I present further evidence that POOL outperforms an SVM on this active-site classification task. Thus POOL(T4) represents our best method yet for identifying ionizable active-site residues using THEMATICS features alone.

100 1

0.8

0.6 Recall 0.4

POOL(T4) Wei's Method 0.2 SVM

0 0 0.05 0.1 0.15 0.2 0.25 False Positive Rate

Figure 5.1 Averaged ROC curve comparing POOL(T4), Wei’s statistical analysis and Tong’s SVM using THEMATICS features. Shown in the plot are averaged ROC curve for POOL(T4) (solid curve), Wei’s statistical analysis (dashed curve) and Tong’s SVM (point X) using THEMATICS features on ionizable residues only for the prediction of annotated active site ionizable residues. POOL(T4) outperforms both SVM and the Wei’s method.

101 5.5.2 Ionizable residues using THEMATICS plus cleft information.

Next I evaluate the three different ways of combining THEMATICS features with cleft size information.

Figure 5.2 shows averaged ROC curves for these three different methods, along with the best-performing

THEMATICS-only method, POOL(T4). The three methods are: (i) POOL(T4G), which uses the POOL method with the 5-dimensional concatenated feature vectors of THEMATICS and cleft size rank (G stands for geometric feature); (ii) SVM(T4G), which uses a support vector machine trained using the same 5-dimensional feature vectors, with varying threshold; and (iii) CHAIN(POOL(T4), POOL(G)), the result of chaining POOL(T4) estimates with POOL(G) estimates.

102 1

0.8

0.6 Recall 0.4 POOL(T4) CHAIN(POOL(T4), POOL(G)) SVM(T4G) 0.2 POOL(T4G)

0 0 0.05 0.1 0.15 0.2 0.25 False Positive Rate

Figure 5.2. Averaged ROC curves comparing different methods of predicting ionizable active site residues using a combination of THEMATICS and geometric features of ionizable residues only. The method using chaining to combine both THEMATICS feature and geometrics information has the best performance.

103 To compare the averaged ROC curves from Figure 5.2 quantatively, I compute the area under the curve

for each ROC curve in the figure using the mean average specificity (MAS). The MAS values for

CHAIN(POOL(T4), POOL(G)), POOL(T4), POOL(T4G) and SVM(T4G) are 0.939, 0.921, 0.909 and

0.903, respectively. Figure 5.2 and the MAS values show the comparison of averaged performance between different methods. In order to estimate the statistical significance of the performance difference considering all pair-wise comparison results (i.e., on a per-protein basis), I perform the Wilcoxon signed test. Table 5.1 shows the p-value of the Wilcoxon signed-rank test, the probability of observing the specified AveS measurement with the null hypothesis that the method listed in the corresponding row does not out-perform the method listed in the corresponding column, as the first number in each cell. The number N in parentheses indicates the number of proteins out of the 64, for which the method in that row outperforms the method in that column. For the remaining (64-N) proteins in the set, the two methods either give equal performance or the method in the column outperforms the method in the row.

104

SVM(T4G) POOL(T4G) POOL(T4)

CHAIN(POOL(T4), <0.0001 <0.0001 <0.0001 POOL(G)) (53) (59) (46)

POOL(T4) 0.0002 0.0006 (40) (41)

POOL(T4G) 0.038 (37)

Table 5.1 Wilcoxon signed-rank tests between methods shown in figure 5.2. The first number in each cell is the Wilcoxon p value, the probability that the method in the corresponding row does not outperform the method in the corresponding column. The number in parentheses is the number of proteins out of 64 for which the method in the row outperforms the method in the column.

105

The figure and the table above clearly show that chaining the POOL(T4) and POOL(G) probability estimates is the method that gives the best performance. It is interesting to note that this method,

CHAIN(POOL(T4), POOL(G)), is the only one that outperforms POOL(T4) alone. It is also interesting to note that POOL(T4) is consistently at least as good as SVM(T4G), and is significantly better than

SVM(T4G) in the upper recall range, even though the latter has the advantage of the additional cleft information. In general, there is little difference between POOL(T4), SVM(T4G), and POOL(T4G) in the lower recall range, but for recall above about 0.6, POOL(T4) has a significantly lower false positive rate, on average, than the other two, given equal recall. So these ROC curves provide strong evidence that

CHAIN(POOL(T4), POOL(G)) is the only one of the methods reported to date that is capable of taking good advantage of additional geometric information that is not contained in THEMATICS features alone and thereby outperforms any purely THEMATICS-based method so far.

The better performance of this chained method CHAIN(POOL(T4), POOL(G)) over POOL(T4) alone is consistent throughout the ROC curve. For recall rates greater than 0.50, the TP fraction for the chained method is better than that of POOL(T4) by roughly 10% for a given FP fraction. This qualitative trend is apparent from visual inspection of the ranked lists from the two methods. For a typical protein, these two ranked lists tend to be very similar, with annotated positive residues generally ranking a little higher, on average, in the list resulting from chaining.

I believe that the observation that chaining the two four- and one- dimensional estimators gives better results than applying POOL directly to the single, five-dimensional concatenated feature vector is probably an overfitting issue. There may be too much flexibility when POOL is used with a high- dimensional input space5, especially when the data are sparse.

5 As a side note, as far as possible worst-case performance is concerned, it is easy to show that applying coordinate- wise monotonicity with even a 2-dimensional input space has infinite VC dimension.

106 Since I established that chaining the POOL probabilities gives better results, from the next section on, I

omit POOL from the method reference, to make the change of features more distinguishable.

CHAIN(POOL(T4), POOL(G)) will be abbreviated as CHAIN(T4, G) instead.

5.5.3 All residues using THEMATICS plus cleft information.

So far only predictions for ionizable residues have been described. The THEMATICS environment

variables are now used to incorporate predictions for non-ionizable residues in the active site.

Figure 5.3 shows the ROC curve for a combined method by which a single merged list ranking all

residues, both ionizable and non-ionizable, in a protein is generated. The method assigns probability

estimates for ionizable residues using the best of the previous ionizables-only estimators, the chained

estimator CHAIN(TALL, G) corresponding to the best ROC curve CHAIN(POOL(T4), POOL(G)) in

Figure 5.2. It also assigns probability estimates to non-ionizable residues using POOL with the two

THEMATICS environment features chained with POOL(G), and then rank orders all the residues based

on their probability estimates. Also included in Figure 5.3 for comparison is a ROC curve CHAIN(TION,

G) based on the same estimates for the ionizable residues but assigning probability estimates of zero to

all non-ionizable residues. Note that the data for this latter method are essentially the same as those of the

chained CHAIN(POOL(T4), POOL(C)) curve of Figure 5.2, except that the denominator for the recall

values is now the number of total active-site residues in the protein, whether ionizable or not, and the

denominator for the false positive rate is now the total number of non-active-site residues in the protein,

ionizable or not. The improved ROC curve for the merged estimate method CHAIN(TALL, G) compared

to the curve for the ionizables-only method CHAIN(TION, G) indicates that taking into account both

THEMATICS environment variables and cleft information does indeed help identify the non-ionizable active-site residues. When the lists are merged, the rankings of some annotated positive ionizable residues may be lowered, but it is apparent that this effect is more than offset, on average, by the rise in the ranking of some annotated positive non-ionizable residues that are obviously missed by excluding them

107 altogether. If this were not the case, then one would expect the merged curve to cross below (and to the right of) the comparison curve in the lower recall (and lower false positive) range.

108 1

0.8

0.6 Recall

0.4

CHAIN(T_ION, G) 0.2 CHAIN(T_ALL, G)

0 0 0.05 0.1 0.15 0.2 0.25 False Positive Rate

Figure 5.3 Averaged ROC curve comparing POOL methods applied to ionizable residues only CHAIN(TION, G) and to all residues CHAIN(TALL, G).

109 The MAS values for CHAIN(TALL, G) and CHAIN(TION, G) are 0.933 and 0.833, respectively. The p- value of the Wilcoxon signed-rank test of observing such AveS under the null hypothesis that the

CHAIN(TALL, G) does not out-perform CHAIN(TION, G) is <0.0001 and CHAIN(TALL, G) outperforms

CHAIN(TION, G) in 31 of the 64 proteins. The number of proteins for which CHAIN(TALL, G)

outperforms CHAIN(TION, G) in this case may seem low, but both methods perform the same in 25 out of

the 64 proteins. For many of these latter cases, the protein does not have any non-ionizables in the active

site.

I have shown that the extension of the POOL method to non-ionizable residues gives a satisfactory result.

From now on, all residues are included in the study and I will just use T to indicate the way I apply

THEMATICS in TALL: For ionizable residues, I estimate the probability of being in active sites by

chaining the result of the POOL method on four THEMATICS features; for non-ionizable residues, I

estimate the probability of being in active sites by chaining the result of POOL method on two

THEMATICS features; I then combine the results and rank order the list of all residues based on their

probability of being in active sites.

5.5.4 All residues using THEMATICS, cleft information and sequence conservation, if

applicable.

So far all the information I used in protein active site prediction is only derived from the protein 3D

structures, in other words, no sequence comparison information is used. As discussed in Chapter 2, it

makes the method applicable to those proteins with no or very few sequence homologues; indeed many of

the newly discovered protein structures from Structural Genomics projects have few or no sequence

homologues. It is generally true that most active site residues tend to be more conserved than others, with

only a few exceptions. Based on this observation, I believe, if I can include the sequence conservation

into our system when the information is available, I may get better performance. I put this hypothesis to

the test in this section, and the result is presented in Figure 5.4.

110 This figure shows the ROC curves using different features on the 160-protein set, with all residues

included as I did in 5.5.3. The reason I used the 160 protein set instead of the 64 set is that not all the

proteins have reliable sequence conservation information to use, which will be explained later. If I still

use the 64 protein set, the number of proteins with reliable sequence conservation information may not be

large enough to perform significance testing. Also, using a 160-protein set will show that the performance

improvement using different feature sets is consistent between different test sets. As demonstrated in

Figure 5.2, chaining the POOL results together works better than applying POOL directly on high

dimensional features, so I use chaining to combine the features of THEMATICS, cleft and the sequence

conservation information.

As pointed out earlier, not all proteins have enough homologues to perform reliable sequence

conservation analysis. In this study, I use ConSurf to do the sequence analysis. As a requirement, it needs

more than five homologues to perform conservation analysis and the result is claimed to be more reliable

if the number of homologues is larger than 10. In this study, I will only use the conservation information

when the protein has more than 10 homologues. For those not having enough homologues (28 out of 160

in this case), I assign 1 as probability estimate for the conservation POOL table. Since the ranked list is

performed for residues within the same protein, this treatment is valid and will not affect the results from

other proteins in the set.

There are four curves for comparison6: CHAIN(T) uses the four THEMATICS features for ionizable residues and the two THEMATICS features for the non-ionizables; CHAIN(T, G) uses both

THEMATICS and the cleft feature; CHAIN(T, C) uses both THEMATICS and the sequence conservation information; while CHAIN(T, G, C) uses all three features by chaining.

Figure 5.4 shows, among all four curves, CHAIN(T) is dominated by all other three curves, suggesting that including either cleft or sequence conservation features, or both, can improve the performance. Both

CHAIN(T, C) and CHAIN(T, G, C) dominate CHAIN(T, G), suggesting that incorporating sequence

6 I use the notation CHAIN(T) for consistency. This could also have been notated more simply as “T”.

111 conservation information does improve performance more than just incorporating cleft information alone.

Surprisingly, CHAIN(T, C) and CHAIN(T, G, C) have very similar performance, although in the recall range below 80%, CHAIN(T, G, C) performs slightly better.

The MAS for CHAIN(T, G, C), CHAIN(T, C), CHAIN(T, G) and CHAIN(T) are 0.925, 0.923, 0.907 and

0.899, respectively. The p-values of the Wilcoxon signed-rank test of observing such AveS measurement with null hypothesis that the method in the row does not outperform the method in the column are listed in Table 5.2, as the first number in each cell. The number in the parentheses indicates the number of proteins for which the method in that row outperforms the method in that column:

112 1

0.8

0.6 Recall 0.4 CHAIN(T) CHAIN(T, G) 0.2 CHAIN(T, C) CHAIN(T, G, C)

0 0 0.05 0.1 0.15 0.2 0.25 False Positive Rate

Figure 5.4 Averaged ROC curves comparing different methods of combining THEMATICS, geometric and sequence conservation features of all residues. The method using chaining to combine THEMATIC, geometric and sequence conservation features has the best performance.

113

CHAIN(T) CHAIN(T, G) CHAIN(T, C)

CHAIN(T, G, C) <0.0001 <0.0001 <0.0001 (115) (95) (103)

CHAIN(T, C) <0.0001 0.0008 (101) (89)

CHAIN(T, G) <0.0001 (101)

Table 5.2 Wilcoxon signed-rank tests between methods shown in figure 5.4.

114

5.5.5 Recall-filtration ratio curves.

The results reported so far are all in the form of ROC curves. As discussed earlier, my analysis is not committed to any particular cutoff or rule to select the active site residues from the top of the list. For instance, users can select the top k residues in the ranked list of residues ordered by the estimated probability of being in the active site, or they can select the residues with an estimated probability of being in the active site greater than a certain cutoff value, or they can select the top p percent of the residues in the ranked list. Among the three methods listed above, I think the third probably would be preferred in general since it is less susceptible to variation of protein size and availability of sequence conservation information. In this case, RFR curves (recall-filtration ration curve) may be more useful than ROC curves.

Since the main purpose for the RFR-curve is to provide a guide for users to select the appropriate cut-off,

I only report the results for CHAIN(T, G, C), which performs the best among all the methods. The test was performed on all residues ranked by their probability of being in the active site and I average the recall for each filtration ratio value to get the averaged RFR curve. For the curve shown in Figure 5.5, for example, choosing the top 10% of the residues from the ranked list gives an average recall of 90%, while choosing the top 5% of the residues from the ranked list gives an average recall of 79%.

115

1

0.9

0.8

0.7

0.6

0.5 Recall

0.4

0.3

CHAIN(T, G, C) 0.2

0.1

0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Filtration ratio

Figure 5.5 Averaged RFR curve of for CHAIN(T, G, C) on the 160 protein test set.

116 5.5.6 Comparison with other methods.

In addition, I also compare the CHAIN(T, G) and CHAIN(T, G, C) results with the results from some

other top performing active site prediction methods, particularly, Petrova’s method 39, Youn’s method 40, and Xie’s geometric potential method 38. All these methods use SVM. The first two use both sequence conservation and 3D structural information, while Xie’s method uses structural information only.

The authors of the three methods report the results for the dataset by different measures than what I used in my studies. Therefore I will simply compare the result from the CHAIN(T, G) and CHAIN(T, G, C) on the 160 protein test set and compare with their results in their form of analysis. Because the performance measures are not achieved from the same dataset, results are not strictly comparable, but qualitively, the comparisons below give a good idea of the relative performance.

In order to compare my results with theirs at a similar recall level, I used a 4% filtration ratio cutoff in the

POOL method to compare with Youn’s method, and a variable filtration ratio cutoff to compare with

Petrova’s method. Note that while our test set consists of proteins with a wide variety of different folds and functions, Youn’s results are reported for sets of proteins with common fold or with similar structure and function. Performance on the more varied set is a much more realistic test of predictive capability on proteins of unknown function, particularly novel folds. Performance on a set of structurally or functionally related proteins is also substantially better than performance on a diverse set, as one would expect and as has been demonstrated by Petrova and Wu 39.

Youn’s method 40 achieved about 57% recall at 18.5% precision with MAS (AUC) of 0.929, using both

sequence conservation and structural information when they train and test on proteins from the same

family; however the performance dropped when the training and testing is performed on proteins of the

same superfamily and fold level, while our CHAIN(T, G, C) with a preset 4% filtration ratio cutoff,

achieves the averaged recall of 64.68% with averaged precision of 19.07%, and an MAS (AUC) of 0.925 for all 160 proteins in the test set, consisting of proteins from completely different folds and classes.

Without the use of sequence conservation, the CHAIN(T, G) achieves the averaged recall, the averaged

117 precision and the AUC of 61.74%, 18.06% and 0.907, respectively. Our chained POOL method thus does about as well as Youn’s method, even when we exclude conservation information, and a little better with conservation information included, even though our diverse test set is one for which good performance is most difficult to achieve. The complete results are shown in Table 5.3.

Petrova’s method 39 measured the performance of their method globally using all residues in all proteins, instead of computing the recall, accuracy, false positive rate and MCC values for each protein and then averaging them. Like Youn’s method, they use both sequence conservation information and 3D structural properties as input to the SVM. They use a dataset that they call the benchmarking dataset that contains a wide variety of proteins that are dissimilar in sequence, are structurally diverse, and span the full range of

E.C. classes of chemical functions. This dataset constitutes a fair test of how a method will perform on structural genomics proteins of unknown function for which sequence conservation information is available. Their method achieves a global residue level 89.8% recall with an overall predictive accuracy of 86%, with an MCC of 0.23 and a 13% false positive rate on a subset of 79 proteins from CatRes database. Testing on the 72 proteins from their set that also appear in my 160 protein set, CHAIN(T, G,

C) with a 10% filtration ratio cutoff achieves a residue level 88.6% recall at the overall predictive accuracy of 91.0%, with an MCC of 0.28 and a 9% false positive rate. The resulting residue level recall, overall predictive accuracy and the MCC from the CHAIN(T, G) are 85.2%, 91.0% and 0.27, respectively.

The results for Petrova’s method and for the present CHAIN methods with different filtration ratio cutoffs are shown in Table 5.4 and the ROC curves in Figure 5.6. CHAIN(T, G, C) achieves comparable recall with somewhat better accuracy and a lower false positive rate. CHAIN(T, G) performs almost as well, even without conservation information.

In 38, a purely 3D structure based method, the performance was reported in the following fashion: their method achieves at least a 50% recall with 20% or less false positive rate for 85% of the proteins they analyzed. The performance of the CHAIN(T, G) and CHAIN(T, G, C) methods measured in the same way is listed in table 5.5. Xie’s method should be compared against CHAIN(T, G), because these methods

118 do not use conservation data. CHAIN(T, G) achieves at least a 50% recall with a false positive rate of

20% or less for 96% of all proteins.

The results in the tables clearly show that CHAIN(T, G), which only uses 3D structural information of proteins, achieves about as good or even better performance than that of these best performing current active site prediction methods. When additional sequence conservation information is available, the

CHAIN(T, G, C) performs still better.

119

Method/Data set Sensitivity (%) Precision (%) AUC

Youn / Family 57.02 18.51 0.9290

Youn / Superfamily 53.93 16.90 0.9135

Youn / Fold 51.11 17.13 0.9144

CHAIN(T, G, C) / all protein 64.68 19.07 0.925

CHAIN(T, G) / all protein 61.74 18.06 0.907

Table 5.3 Comparison of sensitivity, precision, and AUC of CHAIN(T, G, C) with Youn’s reported results for proteins in the same family, super family, and fold.

120

Residue level recall Residue level Residue level Residue level accuracy false positive rate MCC

Petrova’s method 89.8% 86% 13% 0.23

CHAIN(T, G, C) 7% 81.4% 90.9% 6.2% 0.31

CHAIN(T, G, C) 8% 85.2% 91.0% 7.1% 0.31

CHAIN(T, G, C) 9% 85.6% 91.0% 8.0% 0.29

CHAIN(T, G, C) 10% 88.6% 91.0% 9.0% 0.28

CHAIN(T, G, C) 12% 90.2% 89.1% 11% 0.26 CHAIN(T, G, C) 15% 91.9% 86.1% 14% 0.23

CHAIN(T, G) 7% 73.7% 93.7% 6.1% 0.28 CHAIN(T, G) 8% 77.1% 92.8% 7.0% 0.28

CHAIN(T, G) 9% 81.8% 91.9% 8.0% 0.27 CHAIN(T, G) 10% 85.2% 91.0% 9.0% 0.27

CHAIN(T, G) 12% 86.9% 89.0% 11% 0.25 CHAIN(T, G) 15% 89.4% 86.0% 14% 0.22

Table 5.4 Comparison of CHAIN(T, G) and CHAIN(T, G, C) with Petrova’s method.

121 0.95

0.9

0.85 Recall

0.8

Petrova's method CHAIN(T, G, C) 0.75 CHAIN(T, G)

0.7 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14 False positive rate

Figure 5.6 ROC curves comparing CHAIN(T, G), CHAIN(T, G, C) and Petrova’s method.

122

Method Recall ≥ False positive rate < Achieved for

Xie 50% 20% 85%

CHAIN(T, G, C) 50% 20% 97%

CHAIN(T, G, C) 80% 20% 84%

CHAIN(T, G, C) 60% 10% 85%

CHAIN(T, G) 50% 20% 96%

CHAIN(T, G) 80% 20% 77%

CHAIN(T, G) 60% 10% 81%

Table 5.5 Comparison of CHAIN(T, G) and CHAIN(T, G, C) with Xie’s method. Each method achieves at least the specified recall rate with a false positive rate less than specified for the percentage of proteins in the last column.

123 5.5.7 Rank of the first positive.

The last result I will present is one that is only applicable to methods that generate a ranked list: the rank of the first true positive in the list. This metric is useful for users who are interested in finding a few of the active site residue candidates and who do not necessarily need to know all of the active site residues.

Users could use the list from the POOL method to guide their site directed mutagenesis experiments by going down the ranked list one by one and hopefully, once the first active site residue is found, it is easier to find the rest of active site residues by examining its neighbors. A histogram giving the rank of the first active site residue found by CHAIN(T, G, C) on the 160 protein set is shown in Figure 5.7. The median rank of the first true positive active site residue in the 160 protein set with CHAIN(T, G, C) method is two. For 46 out of 160 proteins, the first residue in the resulting ranked list is an annotated active site residue. 65.0%, 81.3% and 90.0% of the 160 proteins have the first annotated active site residue located within the top 3, 5 and 10 residues of the ranked list, respectively. Such measurements are not easily made for binary classification methods.

124 Rank of the first annotated active site residue in the list

35

30

25

20

15 Percentage of proteins of Percentage 10

5

0 1234567891010+ Rank of the first true positive

Cumulative distribution of the rank of the first annotated active site residue in the list

100

90

80

70

60

50

40

Percentage of proteins Percentage 30

20

10

0 12345678910 Rank

Figure 5.7. Histogram of the first annotated active site residue. Top: rank of the first annotated active site residue in the ranked list from CHAIN(T, G, C) on the 160 protein set. Bottom: the cumulative distribution of the first annotated active site residue in the ranked list from CHAIN(T, G, C) on the 160 protein set.

125 5.6 Discussion.

In this chapter, I presented the application of the POOL method using THEMATICS plus some other features for protein active site prediction.

I started with the application of the POOL method just on THEMATICS features, with features similar to those I used before in the SVM method, as well as those used in Ko and Wei’s statistical analysis 24, 54.

My results show that the POOL method outperforms all of the earlier THEMATICS methods with no training data cleaning and no clustering after the classification. This suggests that by emphasizing the underlying THEMATICS principles, the POOL method makes better use of the training data and automatically limits the adverse effect that noise in the training data set might have caused in the other methods. The results also supply further evidence that the THEMATICS principles are valid in reality. In some sense, this opens another possible application for the POOL method, to verify the underlying monotonicity hypothesis, which could be worth further investigation in the future.

I also tested different ways of incorporating additional features into the learning system. Not surprisingly, the results show that in order to improve performance, I have to incorporate the right features in the right way. In addition to using CastP to get the size rank of the cleft in which a residue resides, I also tried area of solvent accessibility, residue type and some other features. Unfortunately, these extra features did not help the performance of the POOL method. One possible reason for this might be overfitting or possible correlation between these features with some other features already present in the system. Even with features that were found to be helpful in improving the performance, how they are incorporated matters.

The results show that chaining the results from separate POOL estimates is better than simply combining all the available features into a big POOL table with very high dimension. As mentioned earlier, the reason behind this might be overfitting, since combining features into a POOL table with high dimension causes the number of probabilities needed for estimation to grow exponentially, while the training data

126 can only increase linearly in most cases. In other words, the high dimensionality makes the table too sparse and less accurate for probability estimates.

I also extended the application of THEMATICS to all residues, not just ionizable residues, in a natural way and showed that it is effective. Although the performance for predicting non-ionizable residues is not as good as the performance for predicting ionizable ones, this extension does provide a way to combine features from THEMATICS, which by itself can only be applied to ionizable residues directly, with some other features, making the comparison with the performance of other methods more accurate and fair.

The incorporation of sequence conservation information does improve the prediction when there are enough homologues with appropriate similarities. The POOL method gives us a means for easily utilizing this information when it is available, while not affecting the training and classification when it is not.

When comparing with other methods, especially if the other methods use binary classification instead of a ranked list, I have to commit to a specific cutoff value and turn my system into a binary classification system. The results in Section 5.5.6 clearly show that the POOL method using THEMATICS and geometric features achieves equivalent or better performance than the other methods in comparison, even in cases where their methods are tested on very special groups of proteins. This makes my method more widely applicable to proteins with few or no sequence homologues, such as some Structural Genomics proteins, while both Youn’s and Petrova’s methods need sequences alignments from homologues.

Performances of Youn’s and Petrova’s methods will degrade significantly when sequence conservation information is not available. However, my CHAIN(T, G) method’s performance will not degrade in the absence of sequence conservation information. The results also show that with additional sequence conservation information, when available, the performance can be further improved.

Interestingly enough, when I compare the performance of CHAIN(T, G) and CHAIN(T, G, C) in Figure

5.4, it is apparent that the addition of the conservation information does improve the performance a little,

127 but not to the extent observed previously for sequence-structure methods. Typically the conservation information is the most important input feature, and without it performance is substantially worse 18. This suggests that the 3D structure based THEMATICS features are quite powerful compared with other 3D structure based features.

When looking at the recall and false positive rates of the results from all the protein active site prediction methods, one must keep in mind that the annotation of the catalytic residues in the protein dataset is never perfect. Since most of the labeling comes from experimental evidence, some active site residues are not labeled as positive simply because there is no experiment designed and carried out to verify the role of that specific residue. Since I use the CatRes/CSA annotations as the sole criteria to evaluate the performance in order to keep the comparisons consistent, as mentioned several times in this dissertation, the resulting false positive rate may be higher than in reality. There is evidence available to support the functional importance of some residues that are not labeled as active site in the CatRes/CSA database 24, 58, but they have high ranks in the list from the POOL method and are classified as positive by

THEMATICS-SVM and THEMATICS-statistical analysis as well.

Although I evaluated the POOL method performance by using filtration ratio values as a cutoff, it is just for the purpose of comparing with other protein active site prediction methods that use a binary classification scheme. The ranked list of residues based on their probability of being in the active site contains much more information than traditional binary classification labeling. The rank of the first annotated positive residue analysis in Section 5.5.7 shows just one application of the extra information contained in a ranked list rather than a traditional binary label. There are many possible measurements of performance depending on the actual application by users, and in turn many possible applications that can benefit from using a ranked list form. It is noteworthy that P-Cats 21 uses a k-nearest neighbor method to estimate the probability of a residue of being in an active site of a protein, and in principle can be a basis of creating a ranked list as results, but their method just uses the probability estimates as the basis to assign binary labels; residues with probability larger than 0.50 are labeled as positive and the others as

128 negative. Although the online server of their method 76 does report the probability estimates along with the corresponding binary label, the potential benefits of using a ranked list instead of binary labeling has not been fully addressed either in the paper or online.

In conclusion, I have established that applying the POOL method with THEMATICS and other features, appears to yield the best protein active site prediction system yet found, and it provides more information than other active site prediction methods.

129 Chapter 6

Summary and Conclusions.

130 Here, I summarize the work I have presented in this dissertation.

This dissertation starts with an introduction to the central problem I try to solve, which is using machine learning methods to build an automated system that can predict active sites from protein structure alone, but can also further improve the prediction by using sequence conservation too, when the information is available.

In the second chapter, the dissertation briefly surveyed the background of protein active prediction and the machine learning techniques, especially the probability based learning techniques, which forms the foundation of the POOL methods. This chapter also introduces THEMATICS, an effective and accurate protein active site predictor using only structural information of proteins, which forms another foundation of this work.

The third chapter of my dissertation reports my work of using SVM with structure only information and achieved better performance than not only other competing methods using all kinds of features but also

THEMATICS with statistical analysis methods. At the end of this chapter, I explain the limitations of using traditional machine learning techniques in the form of classification to solve the protein active site prediction problem.

Chapter Four of my dissertation proposes a novel POOL method as an approach to solve a class of problems involving estimating class probabilities under multi-dimensional monotonicity constraints. In this chapter, I describe the properties of such problems and a framework under which we can describe and solve such problems. I presented an algorithm for solving such problems and the mathematical proof that the solution indeed is optimal for both sum-of-squared-error and the maximum-likelihood criteria.

In Chapter Five, the protein active site prediction problem is reframed into a ranked list problem from a standard binary classification problem. The dissertation presents the application of using the POOL method to estimate the class probabilities of residues being in the active site under class probability

131 monotonicity assumptions. Using the POOL method and THEMATICS, I achieved better performance than using the THEMATICS-SVM system. After incorporating more features, including sequence conservation information, and extension of the methods into all residues from ionizable residues only, the

POOL method achieved the best performance so far, comparing with both the earlier THEMATICS method and other existing methods using SVMs and all kinds of structural and sequence information from proteins.

My work has established the following two claims:

THEMATICS is an effective and accurate protein active site predictor and can be automated by different machine learning techniques. Incorporating more features makes it even more effective and accurate in protein active site prediction.

The POOL method is an efficient way to estimate probability with maximum likelihood under the multi- dimensional monotonicity constraints. It provides a platform allowing probability estimates to be easily combined in a simple manner. It can be used in protein active site prediction and potentially many other applications where monotonicity constraints play a role, such as disease detection from markers in the blood, risk assessment and many more.

6.1 Contributions.

Listed below are the novel contributions to using machine learning techniques with THEMATICS in protein active site prediction that have been presented in this dissertation.

Use SVM with THEMATICS in active site prediction. The work of using SVM with THEMATICS in protein active site prediction presented in this dissertation is the first successful approach that uses the machine learning techniques to automate the THEMATICS method. It outperforms both the

THEMATICS-statistical method as well as other 3D structure based protein active site prediction methods.

132 Turn the protein active site prediction problem into a probability based ranked list problem. This dissertation frames the protein active site prediction in the form of a ranked list problem, instead of a traditional binary classification problem. Although the P-Cats method 21 also estimates the probability of residues being in the active site using a k-nearest neighbor method, it is still framed as a binary classification problem. This dissertation emphasizes the benefits of using the ranked list scheme, which gives users more control in setting their own cutoff thresholds. The probability estimates behind the ranked list make possible the next contribution, combining the results from different methods.

Combine probability estimates by applying the POOL method on different features. Since all the results in the POOL method are essentially probability estimates, it becomes possible to utilize more features using the chaining technique. This makes the method less susceptible to the problem of analyzing sparse data in high dimensionality.

Introduce monotonicity constraints into machine learning. This is a successful approach that enforces a prior belief of the data in the machine learning task, and the results indicate that if the prior belief is indeed correct, the performance of the learning system can improve by the incorporation and the enforcement of this knowledge.

Develop a novel POOL method. This dissertation frames the problem of assigning probabilities under multi-dimensional monotonicity constraints with minimum sum of squared error (SSE) in the form of a special form of convex optimization problem and develops the POOL algorithm to solve it more efficiently and accurately than solving a general convex optimization problem.

Prove that minimizing SSE maximizes likelihood in the present problem. This dissertation proves that the probability assignments minimizing SSE also maximizes the likelihood under the multi- dimensional monotonicity constraints using the K.K.T. conditions.

133 Use the POOL method with THEMATICS in protein active site prediction. This dissertation presents a practical application of the POOL method by applying the method with THEMATICS in protein active site prediction. It outperforms all other protein active site prediction methods up to date.

Use the environment feature to incorporate influences of nearby residues in the protein active site prediction. This dissertation introduces an environment feature (see 5.2) as a new way to incorporate the influences from nearby residues in the active site prediction. This gives two benefits: first, it makes the

THEMATICS method applicable to non-ionizable residues; second, it avoids the extra step of clustering after the classification process as used in THEMATICS-statistical analysis and the THEMATICS-SVM study.

6.2 Future research.

There are many possible ways to extend the research described in this dissertation. I have mentioned some of them in earlier sections. Here I outline some of the directions where the application of the POOL method with THEMATICS can be improved to further improve the performance of the protein active site prediction.

One place for further research is the use of the ranked list from the POOL method. As mentioned earlier, the POOL method does not make any commitment to any rule or value for the cutoff. But users may still want to know “exactly” which residues are in the active sites. Although I suggested using filtration ratio as a possible cutoff and used this for comparison with other methods, it is still coarse and I believe it can be further improved. One possible way is to look at the actual probability estimates the POOL method gives and possibly the scale and the difference between adjacent residues in the ranked list may give some clue about where to choose the cutoff value. Another approach is to use machine learning as the first screening step in the process and get more human involvement in refining the predictions. Human experts can look at the 3D structure of the proteins to see where the residues near the top of the ranked list are located. Residues in some areas, such as in clefts near the surface and close to each other are more likely to be in the active site than others such as those deeply buried or isolated from other residues near the top

134 of the ranked list. Of course one can also feed the result from the POOL method, either the rank, or the raw probability estimates into another machine learning system to further improve the performance. Most likely, some normalization needs to be performed on these features if cross protein training and testing is used.

Another possible area to improve performance in protein active site prediction is the feature selection.

Some methods I compared against in Section 5.5.6 use many more features. Although it is not true that the more features one uses, the better the performance one gets, it worth exploring the use of more features, including both the simple ones extracted from the 3-D structures or sequences directly, and more sophisticated ones such as the four THEMATICS features I used, or even results from other machine learning systems. These features can either be fed to the POOL method, or to another learning system along with the result of the POOL method.

I will mention one more area where further research is needed, although it is beyond the scope of protein active site prediction. Once one knows the active site of a protein, the next natural question would be what function the active site performs. It is the next step in function prediction after active site prediction.

I believe the exact shape of THEMATICS titration curves of a certain residue and the shapes of

THEMATICS titration curves of residues within a certain region may give us some clues about what class of reaction it catalyses. In principle, this might also be solved by machine learning. Apparently, it is a very challenging and rewarding task.

135 Appendices

Appendix A. The training set used in THEMATICS-SVM

Name EC Classification PDB ID

Acetylcholinesterase (E.C. 3.1.1.7) 1ACE

3-Ketoacetyl-Coa (E.C. 2.3.1.16) 1AFW

Ornithine carbamoyltransferase (E.C. 2.1.3.3) 1AKM

Glutamate racemase (E.C. 5.1.1.3) 1B74

Alanine racemase (E.C. 5.1.1.1) 1BD0

Adenosine kinase (E.C. 2.7.1.20) 1BX4

Subtilisin Carlsberg (E.C. 3.4.21.62) 1CSE

Micrococcal (E.C. 3.1.31.1) 1EY0

Oxalate (E.C. 1.2.3.4) 1FI2

DNA- (E.C. 4.2.99.18) 1HD7

2-amino-4-hydroxy-6-hydroxymethyl-dihydropteridine

Pyrophospho-kinase (E.C. 2.7.6.3) 1HKA

Colicin E3 Immunity Protein (E.C. 3.1.21.-) 1JCH

L-lactate (E.C. 1.1.1.27) 1LDG

Papain (E.C. 3.4.22.2) 1PIP

Mannose-6-phosphate (E.C. 5.3.1.8) 1PMI

Pepsin (E.C. 3.4.23.1) 1PSO

136 Triosephosphate Isomerase (E.C. 5.3.1.1) 1TPH

Aldose (E.C. 1.1.1.21) 2ACS

HIV-1 (E.C. 3.4.23.16) 2AID

Mandelate Racemase (E.C. 5.1.2.2) 2MNR

137 Appendix B. The test set used in THEMATICS-SVM

The following table gives the testing results for 64 CatRes/CSA proteins. Bold indicates residues that are

CatRes/CSA active, ionizable, and correctly predicted in a cluster of just ionizables by the SVM. Bold italic indicates CatRes/CSA active, missed in the SVM result, but found in our cluster if neighbors within

6Å are included. Underline indicates CatRes/CSA active missed by both criteria. []ab means the symmetric clusters appear in both chain a and chain b. [XXXab] means residue XXX in both chain and chain b are in the same cluster. The number to the left of “;” in the recall and filtration ratio columns indicates the percentage we get from just SVM and clustering on ionizable only and the number to the right indicates the result from the test including the neighbors within 6Å.

PDB Code Protein Name CatRes SVM Recall (%) Filtration Result Positive Reported ratio (%) SVM only / Positive SVM-region 1AL6 Citrate (si)- H274 [H246a H249a 67;100 5;17 Correct/Correct R413a E420a H320 H246b H249b D375 R413b E420b] [Y231b H235b H238a H274a R329b D375b R401b R421a] [Y231a H235a H238b H274b R329a D375a R401a R421b] [D174 D257]a,b [Y158 Y167]a, b [Y318 Y330]a,b 1APX Heme peroxidase R38 H42 [E65 H68] 0;33 2;10 Incorrect/Partial [H163 D208] correct N71 [H116 E244] 1AQ2 Phosphoenol pyruvate H232 [R65 K70 100;100 5;16 Correct/Correct carboxykinase C125 Y207 K254 E210 K212 R333 K213 H232 C233 K254 D268 D269 E270 H271 E282 C285 Y286 E311 R333 Y336] [C408 Y421 Y524] [E36 H146] 1B3R D130 K185 [C52 H54 60;100 3;20 Correct/Correct D129 D130 D189 E154 E155 N190 D189 C194 C194 C227]

138 [Y220 Y256] 1B6B Aralkylamine N- S97 L111 [E54 H127]a 30;70 5;17 Partial acetyltransferase [H120 H122]ab correct/Correct H122 L124 [H145 H174]a Y168 [E50 E52]b [E190 E192]b [C160 Y168]b 1B93 Methylglyoxyl synthase H19 G66 [H19 D71 H98 67;100 3;11 Correct/Correct D71 D91 D101] H98 D101 1BG0 R126 E225 [Y68 Y89 H90 100;100 7;20 Correct/Correct H99 R124 R229 R126 C127 R280 D192 E224 R309 E225 D226 R229 C271 R280 R309 E314 H315 R330 E335] [H185 H284] [Y145 R208] [Y134 K151] 1BOL T2 H46 E105 [Y116 Y121 67;100 4;13 Correct/Correct E128 D129 H109 D132 Y202] [E105 H109] 1BWD L-arginine:Inosamine- D108 R127 [E9 E37 Y53 86;100 5;15 Correct/Correct phosphate H87 H102 amidinotransferase D179 D103 C105 H227 R107 D108 D229 E130 D179 H227 D229 H331 H278 C279 C332 H331 C332]ab [D30 Y161]b 1BZY Hypoxanthine-guanine E133 D134 [R100 K102 80;100 4;12 Correct/Correct phosphoribosyltransferase D193] D137 [Y104 D137 K165 K165] R169 [E133 D134] 1CD5 Glucosamine-6-phosphate D72 D141 [D72 E73 Y74 25;25 3;9 Partial correct/ isomerase Y85] Partial correct H143 E148 [H19 E198] [K124 Y128] 1CHD Protein-glutamate S164 T165 [H233 E235 40;80 3;10 Partial correct / methylesterase H248] Correct H190 [H190 H256 M283 D286] D286 1COY E361 H447 [R44 C57 R65 0;33 2;10 Incorrect / Partial Y92 Y107 correct N485 Y219 Y446] [Y21 K225] [R328 Y376] 1CQQ H40 E71 [Y97 C100 0;0 2;5 Incorrect / Incorrect K153] G145 C147 1CTT [H102 E104 100;100 3;11 Correct/Correct C129 H131 E104 C132]ab [E138 H203 E229]ab [C217 Y252]b 1D0S Nicotinate-nucleotide- [D69 H70 25;75 1;8 Partial Correct / dimethylbenzimidazole E174 D242 Correct phosphoribosyltransferase E174 E317 D263] G176

139 K213 1DAA Aminotransferase class-IV E177 [R22a R98a 67;100 4;14 Correct / Correct H100a Y31b K145 L201 E32b H47b R50b Y88b Y114b K145b] [Y31a E32a H47a R50a Y88a K145a R98b] [C142 E177]ab [R22 R93]b 1DAE T11 K15 [D10ab C151ab 50;75 2;11 Correct/Correct H154ab] K37 S41 [K15 K37]ab 1DB3 GDP-mannose 4,6- T132 E134 [D13 Y128 75;100 5;16 Correct/Correct dehydratase Y177 C179 Y156 H186] K160 [D105 E134 Y156 K160] [H228 D231 E344] [K2 Y26] [E315 E317] 1DL2 Mannosyl-oligosaccharide E132 R136 [D86 E132 75;100 5;16 Correct/Correct 1,2-alpha- E207 D275 D275 E435 E279 K283 D336 H337 Y389 E399 E435 E438 E503 Y507 E526 H528] [K216 Y235 E290 Y293] [D52 H68] 1DNK I E78 H134 [E39 Y76 E78 100;100 4;10 Correct/Correct H134 D168 D212 D212 H252] H252 [R31 Y32] 1DNP Deoxyribodipyrimidine W306 [H44 E106 0;0 7;15 Incorrect / Incorrect photolyase E109 C251 W359 R278 R282 W382 D372 D374]ab [R8 D10 D130]a [E318 Y365]ab [D409 D431]ab [D327 D331]b [D354 H453]ab [K353 Y464]a 1DZR dTDP-4-dehydrorhamnose H63 D170 [R60 H63 D84 50;100 4;14 Correct/Correct 3,5-epimerase H120 Y133 Y139 E144] 1E2A IIAlac H78 Q80 [H78abc 75;75 5;21 Correct/Correct D81abc D81 H82 H82abc] [E32 H94 H95]b [E32 H94 H95]c [E32 H95]a 1EBF D219 [K117 E208 100;100 2;10 Correct/Correct D210 D213 K223 D214 D219 K223] 1FRO E172 [R37a E99a 100;100 6;27 Correct/Correct Y114a D165a D167a R122b

140 H126b E172b] [R37c E99c Y114c D165c D167c H126d E172d] [R37b E99b Y114b D165b D167b H126a E172a] [Y70 H102 E107]abcd [H126c E172c E99d] [C138b K151b] [K150d K158d] [D165d D167d] [H115ab] 1GOG C228 [C228 Y272 75;100 2;8 Correct/Correct H334 C383 Y272 Y405 H442 W190 Y495 H496 Y495 H581] [H85 D166 H522 D524] [D324 D404] 1GRC Phosphoribosylglycinamide N106 [H54ab E70ab 50;50 5;14 Correct/Correct formyltransferase H73ab E74ab H108 S135 Y78ab] D144 [H108 H137 D144]a [Y67 R90]ab [E44ab] 1HXQ UDP-glucose--hexose-1- C160 [E182 E232 50;75 5;18 Correct/Correct phosphate uridylyltransferase H281 H296 H164 H298] H166 [H115 E152 Q168 H164 H166] [R13 Y201 R211 R324] [E121 D267 H342] [D183 H292] 1I7D DNA III E7 K8 [E7 K8 H44 67;67 4;11 Correct/Correct D103 D105 R330 E107 E114 D136 Y320 D332 C333 H340 C372 D379 H381 H382 Y410 D520 E525] [H100 D113] [E286 E458] 1IDJ R176 R236 [D186 D217 50;50 3;10 Correct/Correct D221 R236 K239 D242] [H247 E272] [H178 H210] 1KAS 3-oxoacyl-[acyl-carrier C163 [C163 H303 75;100 3;14 Correct/Correct protein] synthase D311 E314 H303 K335 H340 H340 E349 C395]ab F400 [H168ab H172ab D181ab] [E115ab H118ab] 1LBA T7 Lysomsome Y46 K128 [H17 C18 Y46 50;100 5;18 Correct/Correct H47 H68 C80

141 H122 C130] 1MAS D14 N168 [H157ab 67;100 4;13 Correct/Correct E158ab H241 D192ab H195a] [D10 D14 D15 E166 H241 D242]b [D10 D14 D15 H241 D242]a [E265ab] [K44 Y92]a 1MHY C151 T213 [D74a K78a 50;100 8;25 Correct/Correct H80a E89a R171a D172a C173a R179a K45b Y46b K49b R186b D190b D196b E199b C200b D270b D418b H439b H446b D450b E454b E460b E462b R463b Y464b E465b C466b H467b E471b R45c D51c Y54c E58c E62c H112c R116c K12c D133c] [H39a C57a Y99a H109a H110a D176a E71b D75b E111b E114b D143b E144b H147b H149b C151b D170b R172b R175b E209b D242b E243b H246b] [H166 K189 H252 D256 R264]a [K104 Y162 H168 R360]b [Y112a K116a Y290a K65b] [R98 Y288 Y351]b [D243 E246]a 1MPP Mucoropepsin D32 S35 [D9 D11 E13 50;75 1;4 Correct/Correct Y75 D215 D32 D215] 1NID (Copper D98 H255 [H95a D98a 100;100 5;19 Correct/Correct Containing) H100a H135a C136a H145a H255c E279c H306c] [H95c D98c H100c H135c C136c H145c H255b E279b H306b] [H95b D98b H100b H135b C136b H145b H255a E279a

142 H306a] [E180 D182 H245 E310]abc [D251abc] [H260 Y293 R296]abc 1NSP Nucleoside-diphosphate K16 N119 [C16 H55 Y56 67;100 5;15 Correct/Correct kinase E58 R109 H122 H122 E133] 1NZY 4-Chlorobenzoyl Coenzyme F64 H90 [D160abc 40;60 4;16 Partial correct / A Dehalogenase E163abc Correct G114 E175abc W137 D178abc] D145 [D123b D145b Y150b R154b H218b C228c E232c] [C228a E232a H90c D145c Y150c R154c H218c] [D145a Y150a R154a H218a C228b E232b] [H138 D168]b 1PGS D60 E206 [Y62 R80 0;0 3;14 Incorrect / Incorrect Y116] [Y161 Y293] [Y183 K302] [H224 Y277] 1PJB K74 H95 [E8 E13 R15 75;75 3;9 Correct / Correct K72 K74 E75 E117 D269 Y93 H95 Y116 E117] 1PKN R72 R119 [R72 D112 33;67 3;12 Partial correct / E117 K269 Correct K269 T327 E271 D295 S361 E363 E299] [C316 D356 C357 E385 R444 R466] [D224 D227] [K265 Y465] 1PNL Penicillin Acylase S1 A69 [E152a K179a 0;0 4;18 Incorrect / Incorrect H192a D73b N241 D74b D76b Y180b Y190b Y196b D252b] [D12a H18a Y31a D38a R39a Y96a Y33b H38b H520b] [R145a Y27b Y31b Y52b] [R263 K394]b [D484 D501]b [R479 Y528]b [E80 H123]b [Y33 K106]a 1PUD Queuine tRNA- D102 [K55a D315b 100;100 3;10 Correct / Correct ribosyltransferase (tRNA- C318b C320b guanine transglycosylase) C323b E348b H349b] [D315a C318a C320a C323a E348a H349a K55b] [R38 R60

143 R362]ab [D102 D280]ab 1QFE 3-dehydroquinate E86 H143 [E46 E86 D114 100;100 3;12 Correct / Correct dehydratase E116 H143 K170 K170]ab [D50 H51]ab 1QPR Quinolinate K140 E201 [R139 R146 67;67 5;13 Correct / Correct phosphoribosyltransferase K150 H161 (decarboxylating) (Type II) D222 R162 K172 D173 E199 E201 D203 D222 E246] [D57 D80] 1QQ5 2-haloacid dehalogenase D8 T12 [D8 R39 Y89 56;100 4;13 Correct / Partial K147 Y153 Correct R39 N115 D176] K147 S171 [Y95 R192] N173 F175 D176 1QUM Deoxyribonuclease IV E261 [H69 D70 E94 100;100 4;12 Correct / Correct E145 C177 D179 C181 H182 H216 E261] 1RA2 I5 M20 [K38 K109 0;0 2;6 Incorrect / Incorrect D27 L28 Y111] F31 L54 I94 1UAE UDP-N-acetylglucosamine N23 C115 [K22 R91 50;100 4;14 Correct / Correct 1-carboxyvinyltransferase C115 R120 D305 E188 E190 R397 D231 E234 H299 R331 D369 R371 H394 R397 Y399 K405] 1UAG UDP-N- K115 [K115 H183 67;100 9;8 Correct / Correct acetylmuramoylalanine--D- Y187 Y194 glutamate N138 R302 K319 H183 D346 K348 C413 R425] 1ULA Purine-nucleoside H86 E89 [D134a H135a 67;67 3;11 Correct / Correct (type 1) Y166a E201c N243 E205c Y249c] [E201a E205a Y249a D134b H135b Y166b] [E201b E205b Y249b D134c H135c Y166c] [H257 E258]ac [H86 E89]abc 1UOK Oligo-1,6-glucosidase D199 E255 [Y12 Y15 Y39 100;100 6;20 Correct/ Correct D60 D64 D98 D329 H103 H161 D169 D199 E255 H283 D285 Y324 H328 D329 R332 R336 H356 Y365 E368 E369 D385E387 D416 R419 Y464 R471 Y495 R497] [D21 D29]

144 1VAO Vanillyl Y108 [Y108 D170 100;100 4;15 Correct / Correct Y187 R312 D170 D317 R398 H422 E410 E464 Y503 H466 Y503 R504] R504 [D59 H61 H422 H506] [Y148 D167 H193] [H467 C470] [H313 Y440] [Y276 Y358] 1WGI Inorganic D117 [E48 K56 E58 100;100 5;15 Correct / Correcy Y89 E101 H107 D115 D117 D120 D147 D152 Y192]a [E48 K56 E58 E101 D115 D117 D120 D147 D152 Y192]b [E123 D159 D162]b [H87ab] 1YTW Protein Tyrosine E290 D356 [C259 Y261 33;67 2;11 Partial Correct / H270 Y301 Correct H402 H350 H402 C403 R409 C403] T410 2CPO Heme Chloroperoxidase H105 [E104 H105 100;100 3;7 Correct / Correct D106 H107 E183 D113 E161 D168 E183] 2HDH 3-hydroxyacyl-CoA S137 H158 [R209a Y214ab 50;100 3;10 Correct / Correct dehydrogenase E217ab E170 N208 R220ab R224a Y242a H275a] [H158a E170b] [H158b E170a] [H266b H275b] 2HGS Glutathione Synthase R125 S151 [D24a H107b 50;100 5;21 Correct / Correct E214b R221b G369 E224b R236b R450 Y265b R267b Y270b E287b K293b C294b D296b Y432b] [H107a E214a R221a E224a R236a Y265a R267a Y270a E287a K293a C294a D296a Y432a D24b] [R125 D127 E144 K305 K364 E368 Y375 E425 R450 K452]a [R125 D127 E144 K305 K364 E368 Y375 E425 R450]b [H163 D469]ab 2JCW Superoxide H63 R143 [H46 H48 H63 50;100 5;16 Correct / Correct

145 H71 H80 D83 H120 D124]ab 2PFL Formate C-acetyltransferase W333 [D74a H84a 50;100 6;19 Correct / Correct R141ab C418 K142ab C419 H144ab G734 R174ab D180ab Y181ab R183a R218ab E221ab E222ab E225ab Y240a Y259a Y262ab K267a E368ab D413ab H498ab Y499ab H501ab D502ab D503ab Y504ab Y506ab E507ab H514ab R520ab Y594ab R595ab] [Y172 R176 R319 Y323 E400 C418 C419 R435 Y490 Y612 H704 R731 Y735]ab [H84 Y240 Y259 K267]b [C159 Y444]b [R316 D330]ab 2PLC 1-phosphatidylinositol H45 D46 [H45 D46 D82 60;100 3;11 Correct / Correct E128 D204 R84 H93 H236 D278] D278 [Y71 K115] 2THI pyridinylase C113 E241 [Y16 Y50 D64 100;100 5;16 Correct / Correct C113 E171 D175 Y222 Y239 E241 D265 Y270 D272]ab [E37ab D84ab E284ab] [Y323 Y333]b [Y180 R349]ab [H282ab] 8TLN M4 E143 [D138 H142 50;100 4;13 Correct / Correct E143 E166 H231 D170 E177 D185 E190] [K18 D72 Y76 K182]

146 Appendix C. The 64 protein test set used in THEMATICS-POOL

PDB Code Protein Name E.C. Number CSA Annotated Active Site Residues

1A05 1,4-Diacid decarboxylating dehydrogenase 1.1.1.85 Y140, K190, D222

1A26 ADP-ribosyltransferase 2.4.2.30 Y907, E988

1A4I Methylenetetrahydrofolate Dehydrogenase 1.5.1.5 K56

1A4S (NAD+) / Betaine- 1.2.1.8 N166, E263, C297 aldehyde dehydrogenase

1AFW Acetyl-CoA C- 2.3.1.16 C125, H375, C403, G405

1AKM Ornithine Carbamoyltransferase 2.1.3.3 R106, H133, Q136, D231, C273, R319

1AOP Sulphite reductase 1.8.1.2 R83, R153, K215, K217 C483

1APX Heme peroxidase 1.11.1.11 R38, H42, N71

1B6B Aralkylamine N-acetyltransferase 2.3.1.87 S97, L111, H122, L124, Y168

1BG0 Arginine Kinase 2.7.3.3 R126, E225, R229, R280, R309

1BRM Aspartate-beta-semialdehyde dehydrogenase 1.2.1.11 C135, Q162, H274

1BRW Pyrimidine-nucleoside phosphorylase 2.4.2.2 H82, R168, S183, K187

1BWD L-arginine:Inosamine-phosphate 2.1.4.2 D108, R127, D179, amidinotransferase H227, D229 H331, C332

1BZY Hypoxanthine-guanine 2.4.2.8 E133, D134, D137, phosphoribosyltransferase K165, R169

1C3J DNA beta- 2.4.1.27 E22, D100

1COY Cholesterol Oxidase 1.1.3.6 E361, H447, N485

1CQQ Picornain 3C 3.4.22.28 H40, E71, G145, C147

1D0S Nicotinate-nucleotide-dimethylbenzimidazole 2.4.2.21 E317 phosphoribosyltransferase

147 1D4A NAD(P)H dehydrogenase (quinone) 1.6.99.2 G149, Y155, H161

1D4C (Fumerate 1.3.99.1 H364, R401, H503, reductase) R544

1DII 4-cresol dehydrogenase 1.17.99.1 Y73, Y95, E380, E427, H436, R474

1DLI UDP-glucose 6-dehydrogenase 1.1.1.22 T118, E145, K204, N208, C260, D264

1DO8 1.1.1.39 Y112, K183, D278

1E2A Histidine Kinase IIAlac 2.7.1.69 H78, Q80, D81, H82

1EBF Homoserine dehydrogenase 1.1.1.3 D219, K223

1FOH Phenol 2-monooxygenase 1.14.13.7 D54, R281, Y289

1FUG Methionine adenosyltransferase 2.5.1.6 H14, K165, R244, K245, K265, K269, D271

1G72 1.1.99.8 D297

1GET 1.6.4.2 C42, C47, K50, Y177, E181, H439, E444

1GOG Galactose Oxidase 1.1.3.9 C228, Y272, W290, Y495

1GPR The IIAglc Histidine kinase 2.7.1.69 T66, H68, H83, G85

1GRC Phosphoribosylglycinamide formyltransferase 2.1.2.2 N106, H108, S135, (GARTFase II) D144

1IVH Isovaleryl-CoA dehydrogenase 1.3.99.10 E254

1JDW Glycine amidinotransferase 2.1.4.1 D254, H303, C407

1KAS 3-oxoacyl-[acyl-carrier protein] synthase 2.3.1.41 C163, H303, H340, F400

1L9F Monomeric 1.5.3.1 H45, R49, H269, C315

1LCB 2.1.1.45 E60, R178, C198, S219, D221, D257, H259

1LXA UDP-N-acetylglucosamine acyltransferase 2.3.1.129 H125

1MBB UDP-N-acetylmuramate dehydrogenase 1.1.1.158 R159, S229, E325

148 1MHL Mammalian 1.11.1.7 Q91, H95, R239

1MLA [Acyl-carrier protein] 2.3.1.39 S92, H201, Q250 S-malonyltransferase

1MOQ Glucosamine--fructose-6-phosphate 2.6.1.16 E481, K485, E488, aminotransferase (isomerising domain) H504, K603

1MPY Extradiol Catecholic 1.13.11.2 H199, H246, Y255

1NID Nitrite Reductase 1.7.99.3 D98, H255

1NSP Nucleoside-diphosphate kinase 2.7.4.6 K16, N119, H122

1OFG Glucose-fructose 1.1.99.28 K129, Y217

1PFK 2.7.1.11 G11, R72, T125, D127, R171

1PJB Alanine dehydrogenase 1.4.1.1 K74, H95, E117, D269

1PKN Pyruvate Kinase 2.7.1.40 R72, R119, K269, T327, S361, E363

1PUD Queuine tRNA-ribosyltransferase (tRNA- 2.4.2.29 D102 guanine transglycosylase)

1R51 1.7.3.3 R176, Q228

1RA2 Dihydrofolate reductase 1.5.1.3 I5, M20, D27, L28, F31, L54, I94

1UAE UDP-N-acetylglucosamine 2.5.1.7 N23, C115, D305, R397 1-carboxyvinyltransferase

1ULA Purine-nucleoside phosphorylase (type 1) 2.4.2.1 H86, E89, N243

1VAO Vanillyl Alcohol Oxidase 1.1.3.13 Y108, D170, H422, Y503, R504

1VNC 1.11.1.10 K353, H404

1XVA Glycine N- 2.1.1.20 E15

1ZIO 2.7.4.3 K13, R127, R160, D162, D163, R171

2ALR Mammalian Aldehyde Reductase 1.1.1.2 Y49, K79

149 2BBK Methylamine dehydrogenase 1.4.99.3 D32, W57, D76, W108, Y119, T122

2CPO Heme Chloroperoxidase 1.11.1.10 H105, E183

2HDH 3-hydroxyacyl-CoA dehydrogenase 1.1.1.35 S137, H158, E170, N208

2JCW 1.15.1.1 H63, R143

3PCA Protocatechuate dioxygenase 1.13.11.3 Y447, R457

150 Appendix D. The 160 protein test set used in THEMATICS-POOL

PDB Code Protein Name E.C. Number CSA Annotated Active Site Residues

12AS Aspartate--ammonia ligase 6.3.1.1 D46, R100, Q116

13PK 2.7.2.3 R39, K219, G376, G399

1A05 1,4-Diacid decarboxylating 1.1.1.85 Y140, K190, D222 dehydrogenase

1A26 ADP-ribosyltransferase 2.4.2.30 Y907, E988

1A4I Methylenetetrahydrofolate 1.5.1.5 K56 Dehydrogenase

1A4S Aldehyde dehydrogenase 1.2.1.8 N166, E263, C297 (NAD+) / Betaine-aldehyde dehydrogenase

1AE7 (PLA2) 3.1.1.4 G30, H48, D99

1AFW Acetyl-CoA C- 2.3.1.16 C125, H375, C403, acyltransferase G405

1AH7 Phospholipase C 3.1.4.3 D55

1AKM Ornithine 2.1.3.3 R106, H133, Q136, Carbamoyltransferase D231, C273, R319

1ALK 3.1.3.1 S102, R166

1AOP Sulphite reductase 1.8.1.2 R83, R153, K215, K217 C483

1APX Heme peroxidase 1.11.1.11 R38, H42, N71

1APY Aspartylglucosylaminidase 3.5.1.26 T183, T201, T234, G235

1AQ2 Phosphoenol pyruvate 4.1.1.49 H232, K254, R333 carboxykinase

1AW8 Aspartate 1-decarboxylase 4.1.1.11 Y58

1B3R Adenosylhomocysteinase 3.3.1.1 D130, K185, D189, N190, C194

151 1B57 Fructose-bisphosphate 4.1.2.13 D109, E182, N286 aldolase (class II)

1B66 6-pyruvoyl tetrahydropterin 4.6.1.10 C42, D88, H89, E133 synthase

1B6B Aralkylamine N- 2.3.1.87 S97, L111, H122, acetyltransferase L124, Y168

1B73 5.1.1.3 D7, S8, C70, E147, C178, H180

1B93 Methylglyoxyl synthase 4.2.99.11 H19, G66, D71, D91, H98, D101

1BCR D 3.4.16.6 G53, S146, Y147, D338, H397

1BG0 Arginine Kinase 2.7.3.3 R126, E225, R229, R280, R309

1BJP 4-oxalocrotonate 5.3.2.0 P1, R39, F50 tautomerase

1BML /streptokinase 3.4.21.7 H603, S608, D646

1BOL 3.1.27.1 H46, E105, H109

1BRM Aspartate-beta- 1.2.1.11 C135, Q162, H274 semialdehyde dehydrogenase

1BRW Pyrimidine-nucleoside 2.4.2.2 H82, R168, S183, phosphorylase K187

1BS4 3.5.1.31 G45, Q50, L91, E133

1BTL Beta-Lactamase Class A 3.5.2.6 S70, K73, S130, E166

1BWD L-arginine:Inosamine- 2.1.4.2 D108, R127, D179, phosphate H227, D229 H331, amidinotransferase C332

1BWP 2-acetyl-1- 3.1.1.47 S47, G74, N104, alkylglycerophosphocholine D192, H195

1BZY Hypoxanthine-guanine 2.4.2.8 E133, D134, D137, phosphoribosyltransferase K165, R169

152 1C3C 4.3.2.2 H68, H141, E275

1C3J DNA beta- 2.4.1.27 E22, D100 glucosyltransferase

1CB8 Chondroitin AC lyase 4.2.2.5 H225, Y234, R288

1CD5 Glucosamine-6-phosphate 5.3.1.10 D72, D141, H143, isomerase E148

1CHD Protein-glutamate 3.1.1.61 S164, T165, H190, methylesterase M283, D286

1CHK 3.2.1.132 E22, D40

1CHM 3.5.3.3 H232, E262, E358

1COY Cholesterol Oxidase 1.1.3.6 E361, H447, N485

1CQQ Picornain 3C 3.4.22.28 H40, E71, G145, C147

1CTT Cytidine Deaminase 3.5.4.5 E104

1D0S Nicotinate-nucleotide- 2.4.2.21 E317 dimethylbenzimidazole phosphoribosyltransferase

1D4A NAD(P)H dehydrogenase 1.6.99.2 G149, Y155, H161 (quinone)

1D4C Succinate dehydrogenase 1.3.99.1 H364, R401, H503, (Fumerate reductase) R544

1D8C 4.1.3.2 D270, E272, R338, D631

1D8H Polynucleotide 5'- 3.1.3.33 R393, E433, K456, phosphatase R458

1DAA Aminotransferase class-IV 2.6.1.21 K145, E177, L201

1DAE Dethiobiotin synthase 6.3.3.3 T11, K15, K37, S41

1DB3 GDP-mannose 4,6- 4.2.1.47 T132, E134, Y156, dehydratase K160

1DBT Orotidine-5'- 4.1.1.23 D60, K62 monophosphate decarboxylase

153 1DCO 4a- 4.2.1.96 H62, H63, H80, D89 hydroxytetrahydrobiopterin dehydratase

1DGS NAD+ dependent DNA 6.5.1.2 K116, D118, R196, ligase K312

1DII 4-cresol dehydrogenase 1.17.99.1 Y73, Y95, E380, E427, H436, R474

1DIZ DNA-3-methyl adenine 3.2.2.21 Y222, W272, D238 II

1DL2 Mannosyl-oligosaccharide 3.2.1.113 E132, R136, D275, E435 1,2-alpha-mannosidase

1DLI UDP-glucose 1.1.1.22 T118, E145, K204, N208, C260, D264 6-dehydrogenase

1DNK 3.1.21.1 E78, H134, D212, H252

1DO8 Malate dehydrogenase 1.1.1.39 Y112, K183, D278

1DQS 3-dehydroquinate synthase 4.6.1.3 H275

1DZR dTDP-4-dehydrorhamnose 5.1.3.13 H63, D170 3,5-epimerase

1E2A Histidine Kinase IIAlac 2.7.1.69 H78, Q80, D81, H82

1EBF Homoserine dehydrogenase 1.1.1.3 D219, K223

1EF8 Methylmalonyl-CoA 4.1.1.41 H66, G110, Y140 decarboxylase

1EUG 3.2.2.3 D64, H187 (Uracil DNA glycosylase)

1EYI Fructose-1,6- 3.1.3.11 D68, D74, E98 bisphosphatase

1FGH 4.2.1.3 D100, H101, H147, D165, H167, E262, H642

1FOH Phenol 2-monooxygenase 1.14.13.7 D54, R281, Y289

1FRO Lactoylglutathione lyase 4.4.1.5 E172

154 1FUA L-fuculose-phosphate 4.1.2.17 E73 aldolase

1FUG Methionine 2.5.1.6 H14, K165, R244, adenosyltransferase K245, K265, K269, D271

1FUI 5.3.1.3 E337, D361

1G72 Methanol dehydrogenase 1.1.99.8 D297

1GET Glutathione reductase 1.6.4.2 C42, C47, K50, Y177, E181, H439, E444

1GIM Adenylosuccinate 6.3.4.4 D13, H41, Q224 synthetase

1GOG Galactose Oxidase 1.1.3.9 C228, Y272, W290, Y495

1GPM GMP synthase 6.3.5.2 G59, C86, Y87, H181, E183, D239

1GPR The IIAglc Histidine kinase 2.7.1.69 T66, H68, H83, G85

1GRC Phosphoribosylglycinamide 2.1.2.2 N106, H108, S135, formyltransferase D144 (GARTFase II)

1GTP GTP Cyclohydrolase 3.5.4.16 H112, H179

1HFS Stromelysin-1 () 3.4.24.17 E202, M219

1HXQ UDP-glucose--hexose-1- 2.7.7.12 C160, H164, H166, phosphate Q168 uridylyltransferase

1I7D DNA topoisomerase III 5.99.1.2 E7, K8, F328, R330

1IVH Isovaleryl-CoA 1.3.99.10 E254 dehydrogenase

1JDW Glycine amidinotransferase 2.1.4.1 D254, H303, C407

1KAS 3-oxoacyl-[acyl-carrier 2.3.1.41 C163, H303, H340, protein] synthase F400

1KFU m- Form II 3.4.22.17 Q99, C105, H262, N286

155 1KRA 3.5.1.5 H219, D221, H320, R336

1L9F Monomeric sarcosine 1.5.3.1 H45, R49, H269, C315 oxidase

1LBA T7 Lysomsome 3.5.1.28 Y48, K128

1LCB Thymidylate synthase 2.1.1.45 E60, R178, C198, S219, D221, D257, H259

1LXA UDP-N-acetylglucosamine 2.3.1.129 H125 acyltransferase

1MAS Purine nucleosidase 3.2.2.1 D14, N168, H241

1MBB UDP-N-acetylmuramate 1.1.1.158 R159, S229, E325 dehydrogenase

1MHL Mammalian 1.11.1.7 Q91, H95, R239 Myeloperoxidase

1MHY Methane Monooxygenase 1.14.13.25 C151, T213

1MKA 3-hydroxydecanoyl-[acyl- 4.2.1.60 H70, V76, G79, C80, carrier protein] dehydratase D84

1MLA [Acyl-carrier protein] 2.3.1.39 S92, H201, Q250 S-malonyltransferase

1MOQ Glucosamine--fructose-6- 2.6.1.16 E481, K485, E488, phosphate aminotransferase H504, K603 (isomerising domain)

1MPP Mucoropepsin 3.4.23.23 D32, S35, Y75, D215

1MPY Extradiol Catecholic 1.13.11.2 H199, H246, Y255 Dioxygenase

1NBA Carbamoylsarcosine 3.5.1.59 D51, K144, A172, T173, C177

1NID Nitrite Reductase 1.7.99.3 D98, H255

1NSP Nucleoside-diphosphate 2.7.4.6 K16, N119, H122 kinase

1NZY Chlorobenzoate 3.8.1.6 F64, H90, G114,

156 Dehalogenase W137, D145

1OFG Glucose-fructose 1.1.99.28 K129, Y217 oxidoreductase

1PFK .7.1.11 G11, R72, T125, D127, R171

1PGS Peptide 3.5.1.52 D60, E206 Aspartylglucosaminidase

1PJB Alanine dehydrogenase 1.4.1.1 K74, H95, E117, D269

1PKN Pyruvate Kinase 2.7.1.40 R72, R119, K269, T327, S361, E363

1PS1 4.6.1.5 F77, R157, R173, N219, K226, R230, S305, H309

1PUD Queuine tRNA- 2.4.2.29 D102 ribosyltransferase (tRNA- guanine transglycosylase)

1PYA 4.1.1.22 Y62, S81, F195, E197

1PYM Phosphoenolpyruvate 5.4.2.9 G47, L48, D58, K120

1QFE 3-dehydroquinate 4.2.1.10 E86, H143, K170 dehydratase

1QPR Quinolinate 2.4.2.19 R105, K140, E201, phosphoribosyltransferase D222 (decarboxylating) (Type II)

1QQ5 2-haloacid dehalogenase 3.8.1.2 D8, T12, R39, N115, K147, S171, N173, F175, D176

1QUM Deoxyribonuclease IV 3.1.21.2 E261

1R51 Urate Oxidase 1.7.3.3 R176, Q228

1RA2 Dihydrofolate reductase 1.5.1.3 I5, M20, D27, L28, F31, L54, I94

1RBL Ribulose bisphosphate 4.1.1.39 K175, K177, K201, carboxylase D203, H294, H327

157 1REQ Methylmalonyl-CoA 5.4.99.2 Y89, H244, K604, mutase D608, H610

1RPT High molecular weight 3.1.3.2 R11, H12, R15, R79, H257, D258

1SMN Serratia marcescens 3.1.30.2 R87, H89, N119 nuclease

1TYF CLP Protease (clpP) 3.4.21.92 G68, S97, M98, H122, D171

1UAE UDP-N-acetylglucosamine 2.5.1.7 N23, C115, D305, R397 1-carboxyvinyltransferase

1UAG UDP-N- 6.3.2.9 K115, N138, H183 acetylmuramoylalanine--D- glutamate ligase

1ULA Purine-nucleoside 2.4.2.1 H86, E89, N243 phosphorylase (type 1)

1UOK Oligo-1,6-glucosidase 3.2.1.10 D199, E255, D329

1VAO Vanillyl Alcohol Oxidase 1.1.3.13 Y108, D170, H422, Y503, R504

1VNC Chloride peroxidase 1.11.1.10 K353, H404

1WGI Inorganic pyrophosphatase 3.6.1.1 D117

1XVA Glycine N- 2.1.1.20 E15 methyltransferase

1YTW Protein Tyrosine 3.1.3.48 E290, D356, H402, Phosphatase C403, R409, T410

1ZIO Adenylate kinase 2.7.4.3 K13, R127, R160, D162, D163, R171

2ACY 3.6.1.7 R23, N41

2ADM ADENINE-N6-DNA- 2.1.1.72 N105, P106, Y108 METHYLTRANSFERASE

2ALR Mammalian Aldehyde 1.1.1.2 Y49, K79 Reductase

2BBK Methylamine 1.4.99.3 D32, W57, D76,

158 dehydrogenase W108, Y119, T122

2BMI METALLO-BETA- 3.5.2.6 D86, N176 LACTAMASE

2CPO Heme Chloroperoxidase 1.11.1.10 H105, E183

2HDH 3-hydroxyacyl-CoA 1.1.1.35 S137, H158, E170, dehydrogenase N208

2HGS Glutathione Synthase 6.3.2.3 R125, S151, G369, R450

2JCW Superoxide dismutase 1.15.1.1 H63, R143

2PDA 1.2.7.1 E64

2PFL Formate C-acetyltransferase 2.3.1.54 W333, C418, C419, G734

2PHK Protein /threonine 2.7.1.38 D149, K151 kinase

2PLC 1-phosphatidylinositol 3.1.4.10 H45, D46, R84, H93, phosphodiesterase D278

2THI Thiamine pyridinylase 2.5.1.2 C113, E241

3CSM 5.4.99.5 R16, R157, K168, E246

3ECA / 3.5.1.1 T12, Y25, T89, D90, K162

3PCA Protocatechuate 1.13.11.3 Y447, R457 dioxygenase

4KBP Purple Acid Phosphatase 3.1.3.2 H202, H295, H296

5COX Prostaglandin- 1.14.99.1 Q203, H207, Y385 Endoperoxide Synthase

5ENL 4.2.1.11 E168, E211, K345, H373

5FIT diadenosine P1, P3- 3.6.1.29 Q83, H94, H96 triphosphate (ApppA) hydrolase

8TLN Metalloproteinase M4 3.4.24.27 E143, H231

159 9PAP Thiol- 3.4.22.2 Q19, C25, H159, N175

160

Bibliography

1. Schmid, M. B., Structural Proteomics: The potential of high throughput structure determination. Trends Microbiol 2002, 10 (Suppl.), S27-S31. 2. Stultz, C. M.; White, J. V.; Smith, T. F., Structural Analysis Based on State-space Modeling. Protein Sci 1993, 2, 305-314. 3. Combet, C.; Jambon, M.; Deleage, G.; Geourjon, C., Geno3D: automatic comparative molecular modeling of protein. Bioinformatics 2002, 18, (1), 213-214. 4. Venclovas, C., Comparative Modeling of CASP4 target proteins: combining results of sequence search with three-dimensional structure assessment. Proteins Suppl 2001, 5, 47-54. 5. Lambert, C.; Leonard, N.; De Bolle, X.; Depiereux, E., ESyPred3D: Prediction of proteins 3D structures. Bioinformatics 2002, 18, (9), 1250-1256. 6. Terwilliger, T. C., Waldo, G., Peat, TS, Newman, JM, Chu, K., Berendzen, J., Class- directed structure determination: Foundation for a protein structure initiative. Protein Sci 1998, 7, 1851-1856. 7. Madern, D.; Pfister, C.; Zaccai, G., Mutation at a Single Acidic Amino Acid Enhances the Halophilic Behaviour of Malate Dehydrogenase from Haloarcula Marismortui in Physiological Salts. European Journal of 1995, 230, 1088. 8. Looger, L. L., M.A. Dwyer, J.J. Smith, and H.W. Hellinga, Computational design of receptor and sensor proteins with novel functions. Nature 2003, 423, 185-190. 9. Oshiro, C. M., I.D. Kuntz and R.M.A. Knegtel, Molecular Docking and Structure-based Design. In Encyclopedia of Computational Chemistry, Schleyer, P. v. R., Ed. Wiley: Chichester, West Sussex, U.K, 1998; pp 1606-1613. 10. Lichtarge, O., Sowa, M.E., A. Philippi, Evolutionary traces of functional surfaces along signaling pathway. Methods in Enzymology 2002, 344, 536-556. 11. Lima, C. D., M.G. Klein, and W.A. Hendrickson, Structure-based analysis of catalysis and substrate definition in the HIT . Science 1997, 278, 286-290. 12. Marcotte, E. M., Pellegrini, M., Thompson, M. J., Yeates, T. O., Eisenberg, D., A combined algorithm for genome-wide prediction of protein function. Nature 1999, 402, (6757), 83-6. 13. Marcotte, E. M., Computational genetics: finding protein function by nonhomology methods. Curr Opin Struct Biol 2000, 10, (3), 359-65. 14. Mattos, C.; Ringe, D., Locating and characterizing binding sites on proteins. Nat Biotechnol 1996, 14, (5), 595-9. 15. Mlinsek, G., Novic, M., Hodoscek, M., Solmajer, T., Prediction of enzyme binding: human inhibition study by quantum chemical and artificial intelligence methods based on x-ray structures. J Chem Inf Comput Sci 2001, 41, (5), 1286-94. 16. Ondrechen, M. J., J.G. Clifton and D. Ringe, THEMATICS: A simple computational predictor of enzyme function from structure. Proc. Natl. Acad. Sci. (USA) 2001, 98, 12473- 12478. 17. Elcock, A. H., Prediction of functionally important residues based solely on the computed energetics of protein structure. J Mol Biol 2001, 312, 885-896.

161 18. Gutteridge, A., G. Bartlett, and J.M. Thornton, Using a neural network and spatial clustering to predict the location of active sites in enzymes. Journal of Molecular Biology 2003, 330, 719-734. 19. Innis, C. A., A.P. Anand, and R. Sowdhamini, Prediction of functional sites in proteins using conserved functional group analysis. Journal of Molecular Biology 2004, 337, 1053-1068. 20. Laurie, A. T. R., and R.M. Jackson, Q-SiteFinder: An energy-based method for the prediction of protein-ligand binding sites. Bioinformatics 2005, 21, 1908-1916. 21. Ota, M.; K. Kinoshita; Nishikawa, K., Prediction of catalytic residues in enzymes based on known tertiary structure, stability profile, and sequence conservation. Journal of Molecular Biology 2003, 327, 1053-1064. 22. Ondrechen, M. J., THEMATICS as a tool for functional genomics. Genome Informatics 2002, 13, 563-564. 23. Ondrechen, M. J., Identification of functional sites based on prediction of charged group behavior. In Current Protocols in Bioinformatics, Baxevanis, A. D.; Davison, D. B.; Page, R. D. M.; Petsko, G. A.; Stein, L. D.; Stormo, G. D., Eds. John Wiley & Sons: Hoboken, N.J., 2004; pp 8.6.1 - 8.6.10. 24. Ko, J., L.F. Murga, P. Andre, H. Yang, M.J. Ondrechen, R.J. Williams, A. Agunwamba, and D.E. Budil, Statistical Criteria for the Identification of Protein Active Sites Using Theoretical Microscopic Titration Curves. Proteins: Structure Function Bioinformatics 2005, 59, 183-195. 25. Kaelbling, L.; Littman, M.; Moore, A., Reinforcement Learning: A Survey. J. of Artificial Intelligence Research. 1996, 4, 237-285. 26. Mitchell, T. M., Machine Learning. McGraw-Hill: New York, 1997. 27. Landau M.; Mayrose I.; Rosenberg Y.; Glaser F.; Martz E.; T., P.; N., B.-T., ConSurf: identification of functional regions in proteins by surface-mapping of phylogenetic information. . Nucl. Acids Res. 2005, 33, W299-W302. 28. Pupko, T., R.E. Bell, I. Mayrose, F. Glaser, & N. Ben-Tal, Rate4Site: An algorithmic tool for the identification of functional regions in proteins by surface mapping of evolutionary determinants within their homologues. Bioinformatics 2002, 18, S71-S77. 29. Fetrow, J. S., Siew, N., Di Gennaro, J. A., Martinez-Yamout, M., Dyson, H. J., Skolnick, J., Genomic-scale comparison of sequence- and structure-based methods of function prediction: does structure provide additional insight? Protein Sci 2001, 10, (5), 1005-14. 30. Lichtarge, O., H. R. Bourne and F. E. Cohen., An evolutionary trace method defines binding surfaces common to protein families. J Mol Biol 1996, 257, (2), 342-58. 31. Yao, H., D.M. Kristensen, I. Mihalek, M.E. Sowa, C. Shaw, M. Kimmel, L. Kavraki, & O. Lichtarge, An accurate, sensitive, and scalable method to identify functional sites in proteins. J Mol Biol 2003, 326, 255-261. 32. Cheng, G.; Qian, B.; Samudrala, R.; Baker, D., Improvement in protein functional site prediction by distinguishing structural and functional constraints on protein family evolution using computational design. Nucleic Acid Research 2005, 33, (18), 5861-5867. 33. Devos, D.; Valencia, A., Practical limits of function prediction. Proteins: Structure, Functing and Genetics 2000, 4, 98-107. 34. Wilson, M. A., C.V. St. Amour, J.L. Collins, D. Ringe and G.A. Petsko, The 1.8 A resolution crystal structure of YDR533Cp from Saccharomyces cerevisiae: A member of the DJ- 1/ThiJ/PfpI superfamily. Proc Natl Acad Sci U S A 2004, 101, 1531-1536.

162 35. Amitai, G.; Shemesh, A.; Sitbon, E.; Shklar, M.; Netanely, D.; Venger, I.; Shmuel, P., Network Analysis of Protein Structures Identifies Functional Residues Journal of Molecular Biology 2004, 344, (4), 1135-1146. 36. Laskowski, R. A., SURFNET: A program for visualizing molecular surfaces, cavities and intermolecular interactions. J Mol Graph 1995, 13, 323-330. 37. Dundas, J.; Ouyang, Z.; Tseng, J.; Binkowski, A.; Turpaz, Y.; Liang, J., CASTp: computed atas of surface topography of proteins with structural and topographical mapping of functionally annotated residues. . Nucl. Acids Res. 2006, 34, W116-W118. 38. Xie, L.; Bourne, P. E., A robust and efficient algorithm for the shape description of protein structures and its application in predicting ligand binding sites. BMC Bioinformatics 2007, 8, s4-s9. 39. Petrova, N.; Wu, C., Prediction of catalytic residues using Support Vector Machine with selected protein sequence and structural properties. BMC Bioinformatics 2006, 7, (1), 312. 40. Youn, E.; Peters, B.; Radivojac, P.; Mooney, S. D., Evaluation of features for catalytic residue prediction in novel folds. Protein Sci. 2007, 16, 216-226. 41. Duda, R. O.; Hart, P. E.; Stork, D. G., Pattern Classification. Wiley: New York, 2001; p 654. 42. Belur, V., Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques,. 1991. 43. Vapnik, V., Statistical Learning Theory. Springer-Verlag: New York, 1998. 44. Schlkopf, B.; Smola, A., Learning with Kernels. MIT Press: Cambridge, MA, 2002. 45. Tong, W.; Williams, R. J.; Wei, Y.; Murga, L. F.; Ko, J.; Ondrechen, M. J., Enhanced performance in prediction of protein active sites with THEMATICS and support vector machines. Protein Sci. 2008, 17, 333-341. 46. Freund, Y.; Schapire, R., A short introduction to boosting. J of Japanese Society for Artificial Intelligence. 1999, 14, (5), 771-780. 47. Schapire, R. E., The Strength of Weak Learnability. Machine Learning 1990, 5, (2), 197- 227. 48. Matthews, B. W., Comparison of the predicted and observed secondary structure of T4 phage . Biochim. Biophys. Acta 1975, 405, 442-451 49. Yang, A. S., Gunner, M. R., Sampogna, R., Sharp, K., Honig, B., On the calculation of pKas in proteins. Proteins 1993, 15, (3), 252-65. 50. Bashford, D.; Karplus, M., Multiple-site Titration Curves of Proteins: An Analysis of Exact and Approximate Methods for Their Calculation. J. Phys. Chem. 1991, 95, 9556-9561. 51. Warwicker, J.; Watson, H. C., Calculation of the electric potential in the active site cleft due to alpha-helix dipoles. J Mol Biol 1982, 157, (4), 671-9. 52. Antosiewicz, J., Briggs, J.M., Elcock, A.H., Gilson, M.K., and McCammon, J.A., Computing the Ionization States of Proteins with a Detailed Charge Model. J. Comp. Chem. 1996, 17, 1633-1644. 53. Di Cera, E., S.J. Gill, and J. Wyman, Binding Capacity: and buffering in biopolymers. Proc Natl Acad Sci U S A 1988, 85, 449-452. 54. Wei, Y.; Ko, J.; Murga, L.; Ondrechen, M. J., Selective prediction of Interaction sites in protein structures with THEMATICS. BMC Bioinformatics 2007, 8, 119. 55. Yang, T. Statistical applications for structure-based protein function prediction. Northeastern University, Boston, 2007. 56. Joachims, T., Making large-Scale SVM Learning Practical. Advances in Kernel Methods. MIT-Press: Cambridge, MA, 1999.

163 57. Bartlett, G. J., C.T. Porter, N. Borkakoti, and J.M. Thornton, Analysis of Catalytic Residues in Enzyme Active Sites. J Mol Biol 2002, 324, 105-121. 58. Wei, Y. Computed Electrostatic Properties of Protein 3D Structure for Functional Annotation and Biomedical Application. Northeastern University, Boston, 2007. 59. Patterson, W. R.; Poulos, T., Crystal Structure of Recombinant Pea Cytosolic . Biochemistry 1995, 34, (13), 4331-4341. 60. Edwards, S. L.; Xuong, N. h.; Hamlin, R. C.; Kraut, J., Crystal Structure of Cytochrome c Peroxidaes Compound I. Biochemistry 1987, 26, 1503-1511. 61. Gourley, D. G.; Shrive, A. K.; Polikarpov, I.; Krell, T.; Coggins, J. R.; Hawkins, A. R.; Isaacs, N. W.; Sawyer, L., The two types of 3-dehydroquinase have distinct structures but catalyze the same overall reaction. Nat Struct Biol. 1999, 6, 521-525. 62. Sobolev, V., A. Sorokine, J. Prilusky, E.E. Abola, and M. Edelman, Automated analysis of interatomic contacts in proteins. Bioinformatics 1999, 15, 327-332. 63. Moser, J.; Gerstel, B.; Meyer, J. E.; Chakraborty, T.; Wehland, J.; Heinz, D. W., Crystal structure of the phosphatidylinositol-specific phospholipase C from the human pathogen Listeria monocytogenes. . J. Mol. Biol. 1997, 273, 269-282. 64. Bate, P., and J. Warwicker, Enzyme/non-enzyme discrimination and prediction of enzyme active site location using charge-based methods. J Mol Biol 2004, 340, 263-276. 65. Perlich, C.; Provost, F.; Simonoff, J., Tree induction vs. logistic regression: a learning- curve analysis. The Journal of Machine Learning Research 2003, 4, 211-255. 66. Domingos, P.; Pazzani, M., Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier. Machine Learning 1997, 29, 103-130. 67. Best, M. J.; Chakravarti, N., Active set algorithms for isotonic regression; a unifying framework. Mathematical Programming 1990, 47, 425–439. 68. Barlow, R. E.; Bartholomew, D. J.; Bremmer, J. M.; Brunk, H. D., Statistical Inference under Order Restrictions. Wiley: 1972. 69. Ramsay, J. O., Estimating smooth monotone functions. Journal of the Royal Statistical Society, Series B 1998, 60, 365-375. 70. Bradley, A. P., The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition 1997, 30, (7), 1145-1159. 71. Wilcoxon, F., Individual comparisons by ranking methods. Biometrics 1945, 1, 80-83. 72. Madura, J. D., J.M. Briggs, R.C. Wade, M.E. Davis, B.A. Luty, A. Ilin, J. Antosiewicz, M.K. Gilson, B. Bagheri, L.R. Scott, & J.A. McCammon, Electrostatics and diffusion of molecules in solution - Simulations with the University of Houston Brownian Dynamics program. Comp Phys Commun 1995, 91, 57-95. 73. Gilson, M. K., Multiple-site titration and molecular modeling: two rapid methods for computing energies and forces for ionizable groups in proteins. Proteins 1993, 15, (3), 266-82. 74. Porter, C. T.; Bartlett, G. J.; Thornton, J. M., The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural data. . Nucl Acids Res 2004, 32, D129-133. 75. Edgar, R. C., MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research. 2004, 32, 1792-1797. 76. Kinoshita, K.; Ota, M., P-cats: prediction of catalytic residues in proteins from their tertiary structures Bioinformatics 2005, 21, 3570-3571.

164