MATCH MATCH Commun. Math. Comput. Chem. 58 (2007) 569-590 Communications in Mathematical and in Computer Chemistry ISSN 0340 - 6253

An improved branch and bound algorithm for the maximum clique problem Janez Konc and Duˇsanka Janeˇziˇc∗

National Institute of Chemistry, Hajdrihova 19, SI-1000 Ljubljana, Slovenia

(Received June 21, 2007)

Abstract

A new algorithm for finding a maximum clique in an undirected graph is described. An approximate coloring algorithm has been improved and used to provide bounds to the size of the maximum clique in a basic algorithm which finds a maximum clique. This basic algorithm was then extended to include dynamically varying bounds. The resulting algorithm is significantly faster than the comparable algorithm.

∗ Author to whom correspondence should be addressed email: [email protected] telephone: 00386-14760200 fax: 00386-14760300 - 570 -

1 Introduction

A clique is a subset S of vertices in a graph G such that each pair of vertices in S is connected by an edge. The maximum clique problem is the problem of finding in a given graph the clique with the largest number of vertices. The algorithms for finding a maximum clique are frequently used in chem- ical information, bioinformatics and computational biology applications [1], where their main application is to search for similarity between molecules.

These algorithms are used for screening of compounds to filter out molecules that are similar to known biologically active molecules and are feasible to be active themselves [2]. Also, these algorithms are used for comparing protein structures, to provide the information about protein function [3] and also the information about possible interactions between proteins [4, 5].

Searching for the maximum clique is often the bottle-neck computational step in these applications. The maximum clique problem is NP-hard [6], and probably no polynomial time algorithm will be possible, but improvements to the existing algorithms can still be effective. Exact algorithms, which can be guaranteed to find the maximum clique, usually use a branch-and-bound ap- proach to the maximum clique problem [7, 8, 9, 10], searching systematically through possible solutions and applying bounds to limit the search space. The tightest bounds come from the vertex-coloring method. This method assigns colors to vertices, so that no two adjacent vertices of a graph G are colored with the same color. The number of colors is the upper bound to the size of the maximum clique in graph G. Vertex-coloring is also known to be NP-hard [6], so a graph can only be colored approximately.

In this paper, we present improvements to an approximate coloring algo- rithm [8]. We use this algorithm in a basic algorithm for finding a maximum clique [7, 8], where the coloring algorithm provides upper bounds to the size of the maximum clique. The effects on the speed of the maximum clique algo- - 571 -

rithm of varying the tightness of upper bounds dynamically during the search are investigated experimentally. We make use of this concept of varying up- per bounds to enhance the performance of this algorithm in agreement with our modified approximate coloring algorithm. The idea is that the tightest and the most computationally demanding upper bounds should be calculated close to the root of the recursion tree of the branch-and-bound algorithm, while on subsequent levels, where the majority of the search takes place, more relaxed and less computationally expensive bounds should be used. Our algorithm has been tested on random graphs and benchmark graphs, which were developed as part of the Second DIMACS Challenge [11], and is compared to the recent leading algorithm [8]. The improvements to the approximate coloring algorithm, together with the dynamical use of upper bounds, reduce the number of steps required to find the maximum clique and improve the run time of the algorithm by as much as an order of magnitude on dense graphs, while preserving its superior performance on sparse graphs.

2 Theory

2.1 Notations

An undirected graph G =(V,E) consists of a set of vertices V = {1, 2, ..., n} and a set of edges E ⊆ V × V . Two vertices v and w are adjacent, if there exists an edge (v, w) ∈ E. For a vertex v ∈ V , a set Γ(v) is the set of all vertices w ∈ V that are adjacent to the vertex v. |Γ(v)| is the degree of vertex v. The maximum degree in G is denoted as Δ(G). Let G(R)=(R, E∩R×R) be the subgraph induced by vertices in R, where R is a subset of V . The density of a graph is calculated as D = |E|/(|V |·(|V |−1)/2). The number of vertices in a maximum clique is denoted by ω(G). - 572 -

2.2 The basic algorithm

A well known basic algorithm for finding a maximum clique [8] (MaxClique) is shown in Figure 1.

Procedure MaxClique(R, C) 1. while R = ∅ do 2. choose a vertex p with a maximum color C(p) from set R; 3. R := R\{p}; 4. if |Q| + C(p) > |Qmax| then 5. Q := Q ∪{p}; 6. if R ∩ Γ(p) = ∅ then 7. obtain a vertex-coloring C of G(R ∩ Γ(p)); 8. MaxClique(R ∩ Γ(p), C); 9. else if |Q| > |Qmax| thenQmax := Q; 10. Q := Q\{p}; 11. else return 12. end while

Figure 1. The basic maximum clique algorithm.

The algorithm MaxClique maintains two global sets Q and Qmax, where Q consists of vertices of the currently growing clique and Qmax consists of vertices of the largest clique currently found. The algorithm starts with an empty set Q, and then recursively adds vertices to (and deletes vertices from) this set, until it can verify that no clique with more vertices can be found. The next vertex to be added to Q is selected from the set of candidate vertices R ⊆ V , which is initially set to R := V . At each step, the algorithm selects a vertex p ∈ R with the maximum color C(p) among the vertices in R, and deletes it from R. C(p) is the upper bound to the size of the maximum clique in the resulting set R. If the sum |Q| + C(p) indicates that a clique larger than the one currently in Qmax can be found in R, then vertex p is added to the set Q. The new candidate set R ∩ Γ(p) with the corresponding vertex- coloring C is calculated and then passed as parameters to the recursive call to the MaxClique procedure. If Rp = ∅ and |Q| > |Qmax|, i.e., the current clique is larger than the currently largest clique found, then the vertices of - 573 -

Q are copied to Qmax. The algorithm then backtracks by removing p from Q and then selects the next vertex from R. This procedure continues until R = ∅.

2.3 Approximate coloring algorithm

The approximate coloring algorithm introduced in Tomita and Seki (2003), provides vertex-coloring in the MaxClique algorithm. All vertices are colored in the candidate set one by one in the order in which they appear in this set.

The algorithm inserts each vertex v ∈ R into the first possible color class Ck, so that v is non-adjacent to all the vertices already in this color class. If the current vertex v has at least one adjacent vertex in each color class C1, ..., Ck, then a new color class Ck+1 is opened and vertex v is inserted here. After all vertices in R have been assigned to their respective color classes, these vertices are then copied from the color classes as they appear in each color class Ck, and in the increasing order with respect to index k, back to R.In this process, a color C(v)=k is assigned to each vertex v ∈ R. The outputs of the algorithm are the new set R and the vertex-coloring C, where colors in the set C correspond to vertices in R.

The number of color classes, which is the upper bound to the size of the maximum clique in the graph induced by R, depend heavily on how vertices are presented to this algorithm. The upper bound is tighter (lower), when vertices in R are considered in a non-increasing order with respect to their degree in G [8, 12].

2.4 Improved approximate coloring algorithm

We have improved the algorithm described above, which is referred to as the original approximate coloring algorithm, so as to maintain the non-increasing order of vertices in the candidate set R. This means that the order of vertices - 574 -

in R after the application of our coloring algorithm to this set is the same to the order of vertices in R prior to the algorithm start. This is not the case with the original approximate coloring algorithm, which orders the vertices in R by their colors, so that the MaxClique algorithm can at each step select a vertex with a maximum color from the set R, which is conveniently the last vertex in this set. We observe that vertices v ∈ R with colors C(v) < |Qmax|− |Q| + 1, need not be ordered by their colors as the MaxClique algorithm will never add these vertices to the current clique Q (line 4, Figure 1). An inherent property of these vertices is that their colors are lower than a certain color, which we denote kmin. We introduce a counter j of such vertices. At the start of the approximate coloring algorithm we calculate color kmin := |Qmax|−|Q| + 1 and we set j := 0. If kmin ≤ 0, then we set kmin := 1, because colors are positive numbers. When in the loop a vertex v at the i − th position in R is assigned to a color class Ck, we test if k

2.5 Coloring example

Figure 3 depicts an undirected graph. The candidate set of vertices in the non-increasing order with respect to their degrees (in parentheses) is R = {7(5), 1(4), 4(4), 2(3), 3(3), 6(3), 5(2), 8(2)}. This set is the input to the ap- proximate coloring algorithm. In Table 1 vertices of the example graph are assigned to color classes; this procedure is the same for both approximate coloring algorithms. - 575 -

Procedure ColorSort(R, C) 1. max no := 1; 2. kmin := |Qmax|−|Q| +1; 3. if kmin ≤ 0 then kmin := 1; 4. j := 0; 5. C1 := ∅; C2 := ∅; 6. for i := 0 to |R|−1 do 7. p := R[i]; {the i-th vertex in R} 8. k := 1; 9. while Ck ∩ Γ(p) = ∅ do 10. k := k +1; 11. if k > maxno then 12. maxno := k; 13. Cmaxno+1 := ∅; 14. end if 15. Ck := Ck ∪{p}; 16. if k

Figure 2. Improved approximate coloring algorithm. - 576 -

8

71

6 2

5 3

4

Figure 3. Example of an undirected graph. A maximum clique (ω =3)is depicted with its edges emphasized.

After assigning vertices to color classes, the original coloring algorithm copies vertices from these classes back to the candidate set, which becomes R = {7(5), 5(2), 1(4), 6(3), 8(2), 4(4), 2(3), 3(3)} with the respective coloring C = {1, 1, 1, 2, 2, 3, 3, 3}. Compared to the set R at the start of this coloring algorithm in this new set R vertices are less ordered with respect to their degrees. We expect that when the original coloring algorithm is used with the MaxClique algorithm, the disorder of vertices in the resulting candidate sets will accu- mulate on the following levels of the recursion. The original approximate coloring algorithm no longer considers vertices which are sorted by their de- grees, which is not efficient.

In the case of our ColorSort algorithm, we set as an example |Qmax| := 2 and |Q| := 0. kmin is calculated by kmin := 2 − 0 + 1 = 3. In the loop where the assignment of vertices to color classes takes place, our al- gorithm shifts vertices with colors k<3 to the front of the set R. After all the vertices have been assigned to color classes, the partial new candi- date set is R = {7, 1, 6, 5, 8}. The remaining vertices are in color classes with k ≥ 3, and in this case only vertices from the color class C3, are - 577 -

Table 1. Vertices of the example graph assigned to color classes with the approximate coloring algorithm. In each row are vertices of the color class Ck, where index k ∈ N is a color of these vertices. The degrees are in parentheses.

kCk 17(5) 5(2) 21(4) 6(3) 8(2) 34(4) 2(3) 3(3)

copied to R in the order in which they appear in this color class. The fi- nal candidate set is then R = {7(5), 1(4), 6(3), 5(2), 8(2), 4(4), 2(3), 3(3)} and the coloring C = {−, −, −, −, −, 3, 3, 3}, where a − indicates that no color has been assigned to the corresponding vertex in set R. It can be seen that the vertices in this candidate set {7(5), 1(4), 6(3), 5(2), 8(2)} are now in decreas- ing order with respect to their degree, in contrast to the set R obtained with the original approximate coloring algorithm, where the same vertices {7(5), 5(2), 1(4), 6(3), 8(2)} are not ordered. In our computational experiments on various graphs, the number of vertices in the candidate sets with k

2.6 Dynamic coloring

Until now, in the MaxClique algorithm the calculation of the degrees and sorting of vertices was performed only once with the initial set of vertices V . The coloring algorithms considered vertices in the candidate set R sorted by their degrees in G. An alternative to this is to recalculate at each and every step of the MaxClique algorithm the degrees of vertices in R in the graph induced by these vertices, i.e., G(R), and sort these vertices in a non- increasing order with respect to their degrees in G(R). Then the ColorSort algorithm considers vertices in R sorted by their degrees in the induced graph - 578 -

G(R) rather than in G. The upper bounds given by this coloring algorithm are then as tight as possible with this approach. The number of steps required to find the maximum clique is reduced to the minimum, but the overall running time of the MaxClique algorithm does not improve, because of the computational expense O(|R|2) of the determination of the degrees and sorting of vertices in R.

We assume that improvement in the performance can be achieved by sorting vertices by their degrees in G(R) only when the candidate set R is suffi- ciently large. Obviously, set R is larger on initial levels of the MaxClique algorithm. With the level of the MaxClique algorithm we denote the num- ber of branches (recursive calls) from the root to the current leaf of the recursion tree. This is because, for large candidate sets the computational expense related to the computation of tighter bound is much smaller than the cost of investigating false solutions, which arise when applying less tight bounds. The same is not true for small candidate sets, where tighter bounds are much less effective in reducing the redundant searching and the compu- tational expense related to the calculation of tighter upper bounds becomes significant.

The number of levels up to which the calculation of the degrees and sorting improves the speed of the maximum clique algorithm has to be determined dynamically, during the search for the maximum clique. For example, in dense graphs, maximum cliques are generally larger than in sparse graphs of equal size. We expect that the number of levels up to which tighter bounds should be used will be higher for dense than for sparse graphs. We also expect that for large graphs this number will be higher than for small graphs of equal density, as larger maximum cliques are generally found in large graphs. In Figure 4 is shown the MaxCliqueDyn algorithm, which we explain in details below.

We introduce global variables S[level] and Sold[level], which hold the sum of steps the MaxCliqueDyn algorithm performs from the root node up to - 579 -

Procedure MaxCliqueDyn(R, C, level) 1. S[level]:=S[level]+S[level − 1] − Sold[level]; 2. Sold[level]:=S[level − 1]; 3. while R = ∅ do 4. choose a vertex p with maximum C(p) (last vertex) from R; 5. R := R\{p}; 6. if |Q| + C[index of p in R] > |Qmax| then 7. Q := Q ∪{p}; 8. if R ∩ Γ(p) = ∅ then 9. if S[level]/ALL STEPS < Tlimit then 10. calculate the degrees of vertices in G(R ∩ Γ(p)); 11. sort vertices in R ∩ Γ(p) in a descending order 12. with respect to their degrees; 13. end if 14. ColorSort(R ∩ Γ(p),C) 15. S[level]:=S[level]+1; 16. ALL STEPS := ALL STEPS +1; 17. MaxCliqueDyn(R ∩ Γ(p),C, level + 1); 18. else if |Q| > |Qmax| then Qmax := Q; 19. Q := Q\{p}; 20. else return 21. end while

Figure 4. Maximum clique algorithm with dynamically varying upper bounds. and including the current level and the sum of steps up to and including the previous level, respectively. We introduce T [level], which is the fraction of steps up to the current level among all the steps completed so far. We recalculate T [level]=S[level]/ALL STEPS on each and every step, where ALL STEPS is a global counter of steps, which is increased by 1 at each step of the MaxCliqueDyn algorithm. With a new parameter, which we call Tlimit, we can limit the use of tighter bounds to certain levels. While T [level]

At each level the MaxCliqueDyn algorithm first updates the sum of steps - 580 -

up to this level by S[level]:=S[level]+S[level − 1] − Sold[level] and the sum of steps up to the previous level by Sold[level]:=S[level − 1]. Each time the algorithm advances to the next recursive level, S[level] is increased by 1. An example of this calculation is shown in Table 2.

Table 2. Counting the steps. Columns represent levels of the recursion. The path of the algorithm is represented by right arrows (recursive calls) and down left facing arrows (backtracks). The notation is S[level](Sold[level]). Number S[3] = 5 is calculated as S[3] = S[3] + S[2] − Sold[3] = 2 + 4 − 2=4 and, because a recursive call follows (right arrow), S[3] = S[3]+1=5. Sold[3] = S[2] = 4.

level 1 2 3 . . . 1(0) → 2(1) → 2(2)

2(1)

2(0) → 4(2) → 5(4) → . .

2.7 Experimental determination of the Tlimit parameter

We determined the parameter Tlimit by experiments on random graphs. We construct a random graph with n vertices by inserting an edge with prob- ability p between each pair of its vertices. We also construct 10 graphs for each size n and probability p, where n is in the interval 100 − 500 and p is in the interval 0.2 − 0.99. These random graphs were then the input to the MaxCliqueDyn algorithm. For each graph, we let MaxCliqueDyn al- gorithm find the maximum clique multiple times, each time with a different parameter Tlimit in the interval 0.0to1.0. When this parameter was set to Tlimit =0.0, no tight bounds were used, as opposed to when Tlimit was set to 1.0, when the tight bounds were used at every step. We plotted the time to find the maximum clique against Tlimit for each n and p and the scaled plots for some of the random graphs and for some of the DIMACS graphs - 581 -

are shown in Figures 5 and 6, respectively.

All the curves for dense graphs in these plots exhibit a minimum when Tlimit is close to 0.05. For sparse graphs the optimal Tlimit is 0.0, but in almost all these cases the additional calculations make the algorithm less than 10% slower (see Table 3).

1

0.9 n=100 p=0.9 n=150 p=0.9 n=200 p=0.9 0.8

0.7 time [s] 0.6

0.5

0.4

0.3 0.05 0.5 1

Tlimit

0.0 Figure 5. The graph showing the effect of varying Tlimit on the time MaxCliqueDyn algorithm requires to find a maximum clique in random graphs with 100, 150, and 200 vertices with p =0.9. A logarithmic scale is used on the x - axis.

We choose Tlimit =0.025 as higher values of this parameter increase the time of the calculation for sparse graphs and a lower Tlimit makes the algorithm slower on very dense graphs. Other values of Tlimit could be chosen in an in- terval (see Table 3) with little change in the time needed to find a maximum clique with the MaxCliqueDyn algorithm. The choice of Tlimit depends on the way the degrees and sorting of vertices are calculated. We calculate the degrees from ground up and use very simple O(|R|2) implementations of the sorting function. With this parameter set to Tlimit =0.025, the computa- tionally expensive calculations of degrees and sorting are performed on about 2.5% of the steps of the MaxCliqueDyn algorithm, although the true num- ber of such steps may vary, because T [level] is calculated dynamically during the search. Adapting to the local nature of the graph under consideration

T [level] can raise above or fall below Tlimit in the search process, thereby - 582 -

1

0.9 brock200-1 p-hat300-3 sanr200-0.9 0.8

0.7

0.6 time [s]

0.5

0.4

0.3

0.2 0.0 0.05 0.5 1

Tlimit

Figure 6. The graph showing the effect of varying Tlimit on the time MaxCliqueDyn algorithm requires to find a maximum clique in some of the DIMACS graphs. A logarithmic scale is used on the x - axis. disallowing or permitting the calculation of tighter bounds.

Table 3. Intervals of values of the parameter T , where the run-time is within 10% of the minimum time. The sizes n are in the upper row and the proba- bilities p of graphs are in the left-most column. Where a clear minimum was observed, it is indicated in boldface.

p\n 100 150 200 300 500 0.2 0-1 0-0.2 0-0.2 0-0.1 0-0.03 0.3 0-0.5 0-0.2 0-0.5 0-0.03 0-0.005 0.4 0-0.2 0-0.05 0-0.035 0-0.015 0-0.02-0.1 0.5 0-0.05 0-0.03 0-0.01-0.2 0.001-0.035-0.1 0.0001-0.02-0.2 0.6 0-0.05 0-0.1 0.005-0.03-0.07 0.001-0.02-0.2 0.005-0.035-0.07 0.7 0.01-0.035-0.2 0.005-0.025-0.1 0.015-0.02-0.2 0.005-0.05-0.2 -- 0.8 0.005-0.03-0.1 0.01-0.03-0.1 0.005-0.05-0.2 0.005-0.05-0.2 -- 0.9 0.015-0.05-0.2 0.02-0.05-0.2 0.025-0.05-0.2 -- -- 0.95 0.01-0.035-0.07 0.01-0.03-0.1 0.015-0.05-0.1 -- -- 0.99 0-1 0-1 0-1 0-1 0-1

2.8 Initialization

We set Q := ∅, Qmax := ∅ and ALL STEPS := 1. The elements of sets S and Sold are set to 0. We calculate the degrees of vertices V in graph G and sort these vertices in a non-increasing order with respect to these degrees. - 583 -

The first Δ(G) vertices in V are colored with numbers 1...Δ(G) and the rest of the vertices in V are assigned a color Δ(G). The first candidate set V , to- gether with its coloring C, is then an input to the MaxCliqueDyn procedure. This initialization follows the standard procedure described elsewhere [8].

3 Experimental results

Our MaxCliqueDyn algorithm was implemented in C. Computational ex- periments were performed on a 1.6GHz AMD Opteron processor with Linux . We also implemented Tomita’s maximum clique finding algorithm MCQ [8]. First, we compared the basic maximum clique algorithm MaxClique with our improved approximate coloring algorithm ColorSort to the MCQ algorithm. Then we compared our MaxCliqueDyn algorithm to the MCQ algorithm on random graphs, and finally, we performed the same comparison on DIMACS graphs. Our user times for the DIMACS machine benchmark graphs r100.5-r500.5 are 0.00, 0.04, 0.40, 2.48, and 9.45 seconds, respectively. We set the limit for the time available to the algorithms to 21600 seconds (= 6h).

Our improved approximate coloring algorithm ColorSort reduces the number of steps to find the maximum clique on random graphs when compared to the original approximate coloring algorithm, when both algorithm are used to provide upper bounds in the basic algorithm MaxClique. The reduction is most apparent for difficult graphs with p in the interval 0.7 − 0.95, e.g., for random graphs with 200 vertices and p =0.95 the number of steps is reduced 2.1 times. For graphs with higher or lower density the number of steps is reduced less. However these graphs in our random test set can be solved in very few steps and because no additional calculation is associated with the modifications to the original approximate coloring algorithm, the computing time decreases proportionately with the numbers of steps. The times and numbers of steps for the MaxClique algorithm using ColorSort - 584 -

algorithm are shown in Table 4.

We have determined the effect of our dynamic calculation of upper bounds on the performance of the MaxCliqueDyn algorithm, compared to the MCQ algorithm on random graphs. Table 4 shows average CPU times for random graphs obtained by both algorithms.

The data in Table 4 suggest that our MaxCliqueDyn algorithm is 1.3 − 12 times faster than the MCQ algorithm on random graphs with probabilities in the interval 0.7 − 0.95; the speed-up tends to increase for larger graphs. With other graphs, the difference between the algorithms is smaller. Our algorithm is slightly slower on certain sparse graphs, where a small number of unnecessery calculations of degrees and sorting of vertices is performed. This is in part because T [level] does not adapt to the local properties of these graphs.

In Table 5 we show the results of MaxCliqueDyn and MCQ algorithms on DIMACS graphs. Because the MCQ algorithm was originally tested only on some of these graphs [8], we tested this algorithm again. The calculation times of our algorithm confirm that the Tlimit =0.025 is well chosen for these graphs, as the calculation times are close to the minimum for most of these graphs. We solve all four very difficult brock800 family instances in under 4 hours compared to the MCQalgorithm, which solves only 2 of these instances and requires considerably more time. We also solved the p_hat700-3 graph in under 4 hours; in this case the MCQ algorithm fails to complete. Times are improved for all but one graph in the brock family, where our times are generally better by a factor 2−3. With our algorithm, p_hat1000-2 is solved 6 times faster, p_hat300-3 3 times, p_hat500-3 almost 8 times, san1000 7 times, san200_0.9_1 almost 6 times, san400_0.7_2 6 times, san400_0.9_1 31 times and sanr200_0.9 is solved almost 6 times faster.

Our maximum clique algorithm does not improve times for very dense (d ≥ 0.99) graphs in the MANN family and for hamming10-2 graph.Table 5 shows that in these cases numbers of steps are unchanged. A small number of - 585 -

Table 4. CPU times [s] and numbers of steps for random graphs; p is the probability of an edge between two vertices of a graph with n vertices.

Graph MCQ MaxClique+CS MaxCliqueDyn+CS n p num. steps CPU time num. steps CPU time num. steps CPU time 100 0.6 1031 0.003 978 0.00265 943 0.0028 100 0.7 2752 0.00925 2525 0.0082 1940 0.00735 100 0.8 6674 0.03 5460 0.02445 4101 0.0205 100 0.9 9279 0.0746 6655 0.05405 4314 0.0384 100 0.95 764 0.0092 712 0.00815 477 0.0061 150 0.5 2526 0.0076 2423 0.00695 2217 0.00735 150 0.6 7852 0.0278 7330 0.02525 5932 0.0238 150 0.7 30057 0.1315 27087 0.1164 20475 0.09445 150 0.8 218972 1.2913 174905 1.0368 88649 0.606 150 0.9 1570126 17.478 854110 9.863 258853 3.464 150 0.95 78246 1.7241 40081 0.88255 14825 0.40785 200 0.4 2585 0.00795 2494 0.00745 2409 0.0078 200 0.5 9774 0.0329 9355 0.0304 7695 0.03095 200 0.6 40694 0.1626 38060 0.14845 31114 0.12995 200 0.7 275430 1.383 246297 1.22485 148726 0.8599 200 0.8 3555022 25.725 2850467 20.658 1424940 11.06 200 0.9 145553091 2117.78 80498743 1199.39 16404111 279.705 200 0.95 22393742 729.479 10433846 340.077 1508657 60.524 300 0.4 15014 0.0507 14591 0.0479 12368 0.05335 300 0.5 76112 0.30745 73311 0.28725 61960 0.2634 300 0.6 641131 2.983 603540 2.767 387959 2.133 300 0.7 8521492 51.915 7669432 46.385 4548515 28.566 300 0.8 530043775 4580.5 434614352 3735 159498760 1442.01 500 0.3 23300 0.09055 22785 0.0863 19665 0.10895 500 0.4 141095 0.6202 137348 0.58595 123175 0.5817 500 0.5 1422379 7.009 1368896 6.579 963385 5.827 500 0.6 23889293 142.553 22678735 132.084 15075757 91.447 1000 0.2 45171 0.29045 44766 0.2753 41531 0.3858 1000 0.3 456758 2.386 449717 2.238 413895 2.332 1000 0.4 6192174 34.708 6040135 32.914 4332149 33.761 1000 0.5 140261760 886.465 136018698 842.86 100756405 655.6 - 586 -

Table 5. CPU times [s] and numbers of steps for DIMACS benchmark graphs; ω is the size of the largest clique found.

Graph MCQ MaxCliqueDyn name ω num. steps CPU time ω num. steps CPU time brock200 1 21 450327 2.74 21 229597 1.56 brock200 2 12 4232 0.017 12 3566 0.018 brock200 3 15 17089 0.088 15 13057 0.0735 brock200 4 17 64332 0.313 17 48329 0.242 brock400 1 27 326153861 2915 27 125736892 1175.68 brock400 2 29 115680020 1204.86 29 44010239 521.01 brock400 3 31 279192244 2297 31 109522985 935.23 brock400 4 33 129575982 1181.96 33 53669377 532.66 brock800 1 ≥ 23 956318168 fail 23 1445025793 14344 brock800 2 ≥ 24 1071802831 fail 24 1304457116 13507 brock800 3 25 1581256139 17050 25 835391899 9263 brock800 4 26 1105720024 13142 26 564323367 7180 c-fat200-1 12 214 0.0005 12 214 0.0005 c-fat200-2 24 239 0.001 24 239 0.0005 c-fat200-5 58 307 0.003 58 307 0.003 c-fat500-10 126 743 0.0355 126 743 0.0345 c-fat500-1 14 517 0.0015 14 517 0.002 c-fat500-2 26 542 0.0025 26 542 0.003 c-fat500-5 64 618 0.0095 64 618 0.0095 hamming6-2 32 62 0.0005 32 62 0.0005 hamming6-4 4 105 0.0005 4 105 0 hamming8-2 128 254 0.018 128 254 0.018 hamming8-4 16 41603 0.333 16 19107 0.1505 hamming10-2 512 1022 1.3 512 2048 6.63 hamming10-4 ≥ 40 16274629 fail ≥ 40 1934328 fail johnson8-2-4 44604460 johnson8-4-4 14 255 0.001 14 221 0.001 johnson16-2-4 8 430130 0.4365 8 643573 0.681 johnson32-2-4 ≥ 16 15 fail ≥ 16 15 fail keller4 11 12209 0.0485 11 8991 0.04 keller5 ≥ 27 162625 fail ≥ 27 25921 fail keller6 ≥ 50 40912647 fail ≥ 52 319878688 fail MANN a9 16 94 0.0005 16 94 0.0005 MANN a27 126 38252 6.92 126 38252 7.56 MANN a45 345 2852231 5480 345 2852231 9037 - 587 -

Graph MCQ MaxCliqueDyn name ω num. steps CPU time ω num. steps CPU time p hat300-1 8 2137 0.006 8 2084 0.0055 p hat300-2 25 9944 0.0795 25 7611 0.063 p hat300-3 36 2519267 30.62 36 629972 9.1 p hat500-1 9 11275 0.043 9 10933 0.041 p hat500-2 36 515253 6.57 36 189060 2.81 p hat500-3 50 239705791 5683 50 25599649 739.16 p hat700-1 11 34229 0.17 11 27931 0.19 p hat700-2 44 4355991 87.49 44 1071470 25.4 p hat700-3 ≥ 57 486128224 fail 62 292408292 13583 p hat1000-1 10 202628 1.04 10 170203 0.9855 p hat1000-2 46 218545180 5189 46 30842192 859.44 p hat1000-3 ≥ 53 798989871 fail ≥ 59 481023263 fail p hat1500-1 12 1283326 8.39 12 1138496 7.75 p hat1500-2 ≥ 55 494607640 fail ≥ 60 405328276 fail p hat1500-3 ≥ 60 239190830 fail ≥ 69 389162339 fail san200 0.7 1 30 1542 0.019 30 983 0.0125 san200 0.7 2 18 1577 0.011 18 1750 0.014 san200 0.9 1 70 268746 2.86 70 28678 0.4815 san200 0.9 2 60 489203 5.52 60 72041 1.48 san200 0.9 3 44 1037194 17.9 44 327704 6.66 san400 0.5 1 13 4182 0.0525 13 3026 0.0255 san400 0.7 1 40 135677 2.84 40 47332 0.8195 san400 0.7 2 30 71937 1.8 30 9805 0.286 san400 0.7 3 22 411539 5.8 22 366505 3.21 san400 0.9 1 100 41072828 2108 100 697695 66.43 san1000 15 224566 11.37 15 114537 1.56 sanr200 0.7 18 182244 0.924 18 104996 0.586 sanr200 0.9 42 41107366 583.26 42 6394315 102.66 sanr400 0.5 13 299625 1.34 13 248369 1.14 sanr400 0.7 21 89775740 602.9 21 37806745 287.78

calculations of the degrees and sorting are still performed, and this accounts for the increased calculation time.

Notably, a discrepancy between times of our version of the MCQ algorithm and times obtained by authors exist for random graphs. It seems that they limited the maximum clique size within their random graphs. Although we used a different computer, our times are directly comparable for the DIMACS benchmark graphs, except for the san400_0.9_1 graph, in which the initial order of vertices has a particular effect on the performance of the maximum clique algorithms we tested. This may be because the last vertex, i.e., the - 588 -

400th vertex, is a member of the maximum clique. With minor modifications to our sorting algorithm, the MCQ algorithm required 722 seconds and our MaxCliqueDyn algorithm 53.44 seconds.

4 Conclusions

In this paper we describe an improved approximate coloring algorithm and an algorithm for finding a maximum clique in an undirected graph. We show that by applying tighter, more computationally expensive upper bounds on a fraction of the search space, it is possible to reduce the time to find the maximum clique. Our algorithm, which has not been fine-tuned, is consider- ably faster than the original MCQ algorithm. A similar strategy is feasible for other types of upper bounds [9], where advantage could be taken of a trade-off between expensive computation and overall speed.

Acknowledgement

The financial support through grants P1-0002 of the Ministry of Higher Ed- ucation, Science, and Technology of Slovenia is acknowledged. The authors thank Prof. G.W.A. Milne for critical reading of the manuscript. - 589 -

References

[1] S. Butenko, W.E. Wilhelm, Clique-detection models in computational biochemistry and genomics, European Journal of Operational Research 173 (2006) 1-17.

[2] C. Hofbauer, H. Lohninger, A. Asz´odi,SURFCOMP: A novel graph- based approach to molecular surface comparison, Journal of Chemical Information and Computer Science 44 (2004) 837-847.

[3] S. Schmitt, D. Kuhn, G. Klebe, A new method to detect related function among proteins independent of sequence and fold homology, Journal of Molecular Biology 323 (2002) 387-406.

[4] J. Konc, D. Janeˇziˇc,A branch and bound algorithm for matching protein structures, Lecture Notes in Computer Science 4432 (2007) 399-406.

[5] J. Konc, D. Janeˇziˇc, Protein-protein binding-sites prediction by pro- tein surface structure conservation, Journal of Chemical Information and Modeling 47 (2007) 940-944.

[6] M.R. Garey, D.S. Johnson, Computers and Intractability: A guide to the Theory of NP-Completeness, Freeman, San Francisco, 1979.

[7] D.R. Wood, An algorithm for finding a maximum clique in a Graph, Operations Research Letters 21 (1997) 211-217.

[8] E. Tomita, T. Seki, An efficient branch-and-bound algorithm for finding a maximum clique, Lecture Notes in Computer Science 2631 (2003) 278- 289.

[9] E. Balas, J. Xue, Weighted and unweighted maximum clique algorithms with upper bounds from fractional coloring, Algorithmica 15 (1996) 397- 412.

[10] P.R.J. Osterg˚ard,A¨ fast algorithm for the maximum clique problem, Discrete Applied Mathematics 120 (2002) 197-207. - 590 -

[11] D.S. Johnson, M.A. Trick (Eds.), Cliques, Coloring, and Satisfiability: Second DIMACS Challenge, DIMACS Series in Disrete Mathematics and Theoretical Computer Science, vol. 26. American Mathematical Society, Providence, 1996.

[12] N. Biggs, Some heuristics for graph colouring, in: R. Nelson, R.J. Wilson (Eds.), Graph colourings, Longman, New York 1990, pp. 87-96. Article

pubs.acs.org/jcim

Exact Parallel Maximum Clique Algorithm for General and Protein Graphs MatjažDepolli,†,⊥ Janez Konc,‡,⊥ Kati Rozman,‡ Roman Trobec,† and Dusankǎ Janezič*̌,‡,§

† Jozef̌ Stefan Institute, Jamova 39, SI-1000 Ljubljana, Slovenia ‡ National Institute of Chemistry, Hajdrihova 19, SI-1000 Ljubljana, Slovenia § Faculty of Mathematics, Natural Sciences and Information Technologies, University of Primorska, Glagoljaskǎ 8, SI-6000 Koper, Slovenia

*S Supporting Information

ABSTRACT: A new exact parallel maximum clique algorithm MaxCliquePara, which finds the maximum clique (the fully connected subgraph) in undirected general and protein graphs, is presented. First, a new branch and bound algorithm for finding a maximum clique on a single computer core, which builds on ideas presented in two published state of the art sequential algorithms is implemented. The new sequential MaxCliqueSeq algorithm is faster than the reference algorithms on both DIMACS benchmark graphs as well as on protein-derived product graphs used for protein structural comparisons. Next, the MaxCliqueSeq algorithm is parallelized by splitting the branch-and-bound search tree to multiple cores, resulting in MaxCliquePara algorithm. The ability to exploit all cores efficiently makes the new parallel MaxCliquePara algorithm markedly superior to other tested algorithms. On a 12-core computer, the parallelization provides up to 2 orders of magnitude faster execution on the large DIMACS benchmark graphs and up to an order of magnitude faster execution on protein product graphs. The algorithms are freely accessible on http://commsys.ijs.si/~matjaz/maxclique.

− ■ INTRODUCTION chemical graphs.10 12 Using chemical heuristics, the algorithm is fine-tuned for comparison of small molecules. The maximum clique problem, one of the most challenging 13 1 A recent review identified several exact maximum clique NP-complete problems in computer science, addresses − algorithms, that were developed between 1990 and 2012.14 19 important questions in bioinformatics2,3 and molecular 4−7 However, all of these run on a single computer core and only modeling. Molecules can be described as graphs with two exact parallel maximum clique algorithms that could exploit vertices representing atoms or group of atoms and with edges computers with multiple cores exist.20,21 The algorithm of representing bonds. Detection of similarities between mole- Pardalos et al.20 is a master−slave type algorithm, in which a fi cules, which implies nding an optimal alignment between their master process running on a dedicated core distributes jobs to atoms or groups of atoms, can be expressed as a problem of slave processes running on the remaining cores that solve the finding a maximum clique in product graphs derived from the maximum clique problem. The second algorithm, by McCreesh molecules that are being compared. Specifically, the maximum and Prosser,21 is an algorithm adapted for use on large clique algorithm allows detection of protein structural computer networks composed of up to 100 computers. Using similarities that are useful in protein classification or protein Net File System for communication between processes, the function prediction,8 and searching large databases of chemical algorithm reveals a significant overhead on startup, which compounds, such as ZINC,9 for structural similarities in small makes it profitably applicable only to large and hard-to-solve compounds,4 which is a key to development of new drugs. RASCAL is an efficient clique-based algorithm for graph Received: April 27, 2013 similarity calculation that has been applied to comparison of Published: August 21, 2013

© 2013 American Chemical Society 2217 dx.doi.org/10.1021/ci4002525 | J. Chem. Inf. Model. 2013, 53, 2217−2228 Journal of Chemical Information and Modeling Article instances of graphs, where that overhead can be compensated. The parallelization of maximum clique algorithms to exploit modern computer architecture with multiple cores is a severely under-explored field in computer science, although it has a high potential impact on molecular modeling. Little is known regarding the efficiency of referred parallel maximum clique algorithms. A study of Pardalos’ algorithm indicates that speedups between 1.8 and 2.6 can be achieved using a computer with four cores on two tested instances of DIMACS [A publicly available collection of benchmark graphs for solving various graph-related problems, including the maximum clique problem, available from http://dimacs. rutgers.edu.] graphs and on random graphs with up to 1000 vertices.20 Another study of the McCreesh and Prosser algorithm indicates that speedups of 10 and more can be obtained routinely on 25 or more desktop machines over large Figure 1. General graph example. Vertices are circles, edges are lines. benchmark graphs that can require minutes or days to be solved A maximum cliqueset of vertices {a, b, c, d}is highlighted in bold. on a single machine.21 The literature however, appears to contain little, if any information on the effects of the parallelization on the speedup of maximum clique algorithms that the size of the maximum clique that can be found in G is on modern computers with multiple cores. Likewise, little is smaller or equal to the chromatic number of G. known on the performance and on the comparison of such Protein Graphs. Proteins can be represented as protein algorithms on a set of DIMACS graphs and protein-derived graphs, as is schematically shown in Figure 2. In protein graphs, product graphs derived from real protein comparison vertices have spatial coordinates, and vertices are placed at applications, e.g., performed by the ProBiS web server.8,22 geometric centers of functional groups of protein surface amino In this work, we first develop a new exact sequential acids. Vertices are labeled with five distinct colors correspond- maximum clique algorithm MaxCliqueSeq, i.e., one that is only ing to the five physicochemical properties, i.e., acceptor, donor, able to exploit a single processor core. This algorithm runs π−π stacking, aliphatic, and acceptor−donor, of the protein’s faster on a single core than two currently leading maximum surface amino acids at the resolution of functional groups. Two 16,23 clique algorithms. Next, with our improved sequential vertices ui and uj of a protein graph G are adjacent, that is, an ∈ algorithm taken as a starting point for the parallelization, we edge (ui, uj) E(G) exists between them, if the distance(ui, uj) develop a new exact parallel maximum clique algorithm, < 15 Å. To generate protein graphs, we followed the MaxCliquePara, which balances its work and uses all the procedure24 introduced by Konc and Janezic with an extension cores efficiently, by traversing multiple search tree branches at that we considered the entire protein surfaces as protein graphs, the same time. The new parallel algorithm is evaluated on rather than only protein surface patches. DIMACS graphs and on a set of protein product graphs, which Protein Product Graphs. A pair of protein graphs can be are target graphs for the newly developed algorithm. Our compared by finding a maximum clique in their product graph, experimental results show that the MaxCliquePara algorithm is maximum clique being the superimposition that aligns the most fast and effective on large protein graphs whose density, i.e., the vertices of the compared protein graphs (Figure 2). The probability that an edge exists between two vertices, is very protein product graph of two protein graphs G1 and G2 is fi × 24 high, and that on average, it outperforms all state of the art de ned on the vertex set V(G1, G2)=V(G1) V(G2). Each maximum clique algorithms known to authors. protein product graph vertex (ui, vi) is composed of two fi ∈ component vertices: a vertex from the rst protein graph (ui ∈ ■ METHODS G1) and a vertex from the second protein graph (vi G2). In general, a protein product graph has x × y vertices if the Graph Notation. An undirected graph G =(V, E) consists respective protein graphs have x and y vertices; however, we ⊆ × of a set of vertices V ={v1, v2, ..., vn} and a set of edges E V reduce its size by considering only vertices with identical V made up of pairs (u, v) of distinct vertices (Figure 1). Two component vertex colors (physicochemical properties). Addi- vertices (u, v) are described as adjacent if (u, v) ∈ E. The tionally, component vertices must have similar neighborhoods neighborhood Γ(v) of a vertex v is defined as Γ(v)={u ∈ V|u is in their corresponding protein graphs. The latter is determined adjacent to v}. The degree of a vertex v, denoted by |Γ(v)|,is as follows: the neighborhood of each protein graph vertex is the number of edges connected to v. A graph is complete if all defined as a sphere with diameter of 6 Å. Spatial dimensions are its vertices are adjacent. A graph H =(VH, EH), is a subgraph of discretized into 24 intervals each with the length of 0.25 Å. A ⊆ ⊆ G,ifVH V and EH E. A clique C is a complete subgraph of distance matrix representing discretized distances between all an undirected graph; the size of a clique C, denoted by |C|,is pairs of vertices in the neighborhood is calculated. The the number of its vertices. A maximum clique is the largest similarity of neighborhoods of the two component vertices p clique in a given graph. Coloring of a graph is a mapping that and r, which form a vertex of the protein product graph, is assigns one color to each vertex in such a way that no two calculated using their corresponding distance matrices Mp and ∑ ij ij ∑ | ij − ij| adjacent vertices have the same color. The smallest possible Mr as similarity = 1...m,1...n (Mp + Mr )/ 1...m,1...n Mp Mr number of colors with which a graph G can be colored is the where indexes i and j run over all m rows and n columns of each chromatic number of G. If a graph G contains a maximum matrix (for more details, see ref 24). The neighborhoods are clique of size k, then at least k different colors are needed to considered similar if similarity > 1.9.24 Finally, two protein color the vertices of this maximum clique. From this, it follows product graph vertices (ui, vi) and (uj, vj) are adjacent (an edge

2218 dx.doi.org/10.1021/ci4002525 | J. Chem. Inf. Model. 2013, 53, 2217−2228 Journal of Chemical Information and Modeling Article

Figure 2. Protein structural comparison using MaxCliquePara maximum clique algorithm. Proteins are first converted to protein graphs (top). For clarity, only a section of each protein graph is shown. Vertices are colored according to their physicochemical properties, i.e., acceptor (red), donor (green), π−π stacking (pink), and aliphatic (yellow). A protein product graph is constructed from both protein graphs (center). A maximum clique of four vertices connected with red edges indicates the similarity between proteins. Finally, the two compared proteins are superimposed (bottom) according to the best alignment of vertices represented by the maximum clique. ∈ ∈ is inserted between them) if (ui, uj) E(G1) and (vi, vj) bounding condition (see section Graph Notation). Although | − | 1 E(G2) and distance(ui, uj) distance(vi, vj) < 0.5 Å, which vertex coloring is an NP-complete problem, fast approximate means that the distances between respective first and second algorithms26 exist that solve it based on the “greedy strategy”, component vertices need to be almost the same in both protein with a complexity of O(n2) with respect to the number of graphs. vertices n to be colored. An example of approximate coloring of Accordingly, we constructed ten benchmark protein product a graph from Figure 1 is shown in Figure 4. graphs, each from a pair of protein structures from the Protein At the first step of this algorithm, the set of graph vertices Data Bank (PDB).25 The name of each protein product graph, Vcopy is empty; therefore, Vcopy = V. In the next step, vertex a is e.g., 1allA−3dbjC reflects the PDB and Chain IDs of the ff selected as the vertex with the highest degree from Vcopy;itis compared proteins. To measure the e ect of protein size, i.e., colored using the first available color, i.e., color no. 1red number of amino acids, on the performance of the maximum (colors are represented as natural numbers), and added to the clique search, we constructed product graphs from proteins set of colored vertices U. The neighborhood vertices of vertex a, with ∼50 to ∼2000 amino acids. In addition, we considered marked with thick edges in the graph and gray color in V (all protein pairs that share ∼10−95% sequence identity. The copy vertices), are then removed from Vcopy, and vertex a is removed resulting protein product graphs serve as a sample of typical fi from V. Since Vcopy becomes empty, it is re lled from V, and the inputs for the proposed algorithm, and we envision their use as  standard future tests for newly developed maximum clique current color is changed to 2 green (color number is algorithms. increased by one). Next, vertex b is selected from Vcopy, Generic Maximum Clique Algorithm. A generic colored green, and added to the set U. The neighborhood sequential maximum clique algorithm employing a branch- vertices of vertex b, marked with gray color in Vcopy (vertices c, and-bound approach is shown in Figure 3. d, and e), are removed from Vcopy, and vertex b is removed from Approximate Graph Coloring Algorithm. A coloring V. The procedure repeats itself until V is empty, when the algorithm assigns colors (or numbers) to vertices of a graph, so graph is colored and the algorithm stops. that no two adjacent vertices have the same colors (or The number of colors used to color the graph, i.e., in this numbers). The number of colors assigned by a coloring example four colors were used, presents the upper bound to the algorithm is used in the maximum clique algorithm as the size of the maximum clique that can be found in this graph, and

2219 dx.doi.org/10.1021/ci4002525 | J. Chem. Inf. Model. 2013, 53, 2217−2228 Journal of Chemical Information and Modeling Article

consequently, more tightly bound to the size of the maximum clique. In comparison to previous coloring algorithms,28 our approximate coloring algorithm16 consistently reduces the number of steps needed to find a maximum clique and consequently lessens the time required to find a maximum clique. Initial Vertex Ordering. Initial order in which vertices are presented to the approximate coloring algorithm significantly affects the performance of the maximum clique algorithm.29 The coloring is tighter, i.e., fewer colors are needed, if the vertices presented to the approximate coloring procedure are ordered by nonincreasing degree,16 so that, if |Γ(u)| ≥ |Γ(v)|, then u is placed before v. However, some freedom remains in the ordering of vertices having the same degrees. Tomita et al.15 described an efficient initial ordering of vertices of the same degree with the introduction of ex-degree, defined for a given Figure 3. Generic maximum clique algorithm. Backslash in X\Y is the vertex as the sum of degrees of all its adjacent vertices. In relative complement (also the set difference) between sets X and Y. ⟨ ⟩ preprocessing, ex-degree is calculated, followed by the initial (input) U, a set of pairs vertex, number , where vertex is a graph sorting by nonincreasing degree and nonincreasing ex-degree vertex, and number is its degree; empty sets C and Cmax, which will hold vertices of the current and maximum clique, respectively. for vertices of equal degrees. After the vertices are sorted, they (execution) On each step, a vertex with maximum number is taken are renumbered, so that the vertex with the highest degree has from U; if bounding condition is true, the vertex is added to current index one, the one with second highest degree has index two, clique in C; coloring replaces numbers in U with colors (see below) in and so on. The adjacency matrix,30 i.e., an n × n matrix for a | U1, followed by a recursive call to function Generic; after recursion, if graph with n vertices in which element a is nonzero if an edge | | | ij C > Cmax , the current clique becomes the maximum clique. (output) exists between vertices i and j, is then reconstructed with the Cmax holds vertices of the maximum clique. renumbered vertices. The consequence of renumbering is a better localization of memory access and therefore more ffi the maximum clique algorithm uses this value to prune the e cient use of the cache memory. Additionally, an adjunct branches of the search tree.27 vertex set (set V in Figures 4 and 5) that maintains the initial In the MaxCliqueDyn algorithm,16 an approximate coloring order of vertices throughout the execution of the maximum algorithm was introduced that was later adopted by the MCS clique algorithm is introduced. 15 Use of Bit Strings. Recently, Segundo et al. introduced a algorithm and enhanced by addition of bit-strings in the 23 BBMC algorithm.23 All three algorithms follow the same maximum clique algorithm that uses the so-called bit-strings realization that color-based vertex ordering is only needed (also called bit-boards) to encode adjacency matrix and adjunct | | − | | ff above a threshold, which is calculated as kmin = Cmax C +1, vertex set V. Bit-strings o er extremely fast set operations such | | | | “ ” “ ” “ ” where Cmax is the size of the current maximum clique and C is as union , intersection , and complement but are slower the size of the clique found on the current branch of the search than lists in counting the number of elements or in extracting tree. Vertices with colors below kmin can remain in their original the elements with the lowest or the highest number. These order, as they will never be used in further recursions as a root deficiencies however are alleviated by modern processors of a sub-branch. In this way, the algorithm keeps most of the encompassing fast instructions for bit manipulation. In vertices in the working set U in a nonincreasing degree addition, bit-strings are more cache-efficient for storing sets ordering. This results in graphs colored with fewer colors and, than the traditionally used arrays, because they store up to eight

Figure 4. Approximate coloring algorithm trace. (input) Leftmost panel graph (same as in Figure 1). (execution) Algorithm proceeds from left to right, and in each consecutive panel, it colors one vertex of a graph. (output) Graph colored with four colors, and a working set of colored vertices U, ordered by their colors (or numbers) (rightmost panel).

2220 dx.doi.org/10.1021/ci4002525 | J. Chem. Inf. Model. 2013, 53, 2217−2228 Journal of Chemical Information and Modeling Article

elements per byte in contrast to arrays that store at most one element per byte. On the other hand, the order of elements in a bit-string is predefined and cannot be changed, as in an adjacency matrix and adjunct vertex set. Ordered sequences of vertices, such as the vertex sequence U, the output of coloring algorithm and used for branching, are stored in arrays. MaxCliqueSeq Maximum Clique Algorithm. We develop a new maximum clique algorithm, MaxCliqueSeq (Figure 5), by combining the existing algorithmic building blocks. The algorithm is composed of approximate coloring algorithm,16 initial vertex ordering algorithm,15 and uses bit- strings23 to encode the adjunct set V and the adjacency matrix; adjacency matrix is not shown in Figure 5, since it is used internally in function Γ to determine the neighborhood of a vertex. These algorithms and concepts were described in the previous sections. This combination of existing algo- rithms,15,16,23 building blocks of different maximum clique algorithms, has proven the fastest on most DIMACS and protein product graphs out of different combinations tested. For example, an experimental algorithm that used a more advanced coloring algorithm15 performed worse in our test. In the implementation, we avoid copying of variables by value; we also keep the code compliant with cache. Consequently, C, Cmax, and the adjacency matrix are global variables. Instead of performing the complement of the neighborhood Γ(v) at each step in the coloring function (see line 47 in Figure 5), a complemented copy of the adjacency matrix is prepared in the preprocessing step. The coloring function then uses the complemented adjacency matrix instead of the original adjacency matrix, decreasing the number of operations required. The source code, written in C++ standard 2011, can be compiled with GCC 4.6 and should be portable to any other modern compiler, since it employs only standard C+ + functions and the standard template library. The trace of this algorithm on an example graph from Figure 1 is shown in Figure 6. The algorithm traverses the search tree from left to right, following the numbered arrows. It starts by adding vertex a to the current clique C (step 1). Since vertex b is connected with vertex a, the clique C is then extended by vertex b (step 2); at the same time, vertex f, which is not connected to vertex b, is removed from U, and can never be considered again within this branch (steps 3 to 5). MaxCliquePara Maximum Clique Algorithm. Here, we introduce the new parallel maximum clique algorithm, MaxCliquePara, based on the MaxCliqueSeq algorithm out- lined in the previous section. A flowchart of the MaxCliquePara algorithm is shown in Figure 7. The algorithm employs the multithreading techniques, which are supported in most programming languages without extra libraries or other sort of support. This makes the algorithm portable to other languages and operating systems. It can run on most modern multicore computer architectures. The parallel algorithm behaves identically to the MaxCliqueSeq algorithm when executed on a single core but takes slightly longer to execute, due to the added thread management code. However, when executed on multiple cores, it easily compensates for this ̅ overhead if the maximum clique problem to be solved is not Figure 5. MaxCliqueSeq algorithm. The overscore in X is the trivial, e.g., high density graphs. complement of a set X and backslash in X\Y is the relative Main Algorithm. complement (also the set difference) between sets X and Y. (input) The main algorithm (see Main algorithm V, a set of graph vertices; U, an ordered sequence of pairs ⟨vertex, in Figure 7) starts with a preprocessing step consisting of the ⟩ number ; adjacency matrix; C and Cmax, current and maximum clique, initial ordering and renumbering of vertices. This step is not respectively (empty at start). (output) Cmax holds vertices of the parallelized, since its execution time, in comparison to the total maximum clique. time needed to find a maximum clique, is significant only on

2221 dx.doi.org/10.1021/ci4002525 | J. Chem. Inf. Model. 2013, 53, 2217−2228 Journal of Chemical Information and Modeling Article

Figure 6. MaxCliqueSeq maximum clique algorithm trace. In each step, first vertex is removed from the working set U (blue arrow in oval) and added to the current clique C (blue vertex in oval); vertices not connected to the first vertex are removed from U (blue cross in oval). Each oval is one step of the algorithm, and recursive calls are arrows numbered 1−9 to indicate their sequence. Branches, not executed due to bounding | | | | ≤ | | condition U + C Cmax are in the reddened area enclosed by a dashed line. Vertex coloring is omitted for clarity; U is used in the bounding condition instead of the number of colors. In step 4, a maximum clique Cmax = {a, b, c, d} is found.

Figure 7. Flowchart of the MaxCliquePara parallel maximum clique algorithm, consisting of main algorithm (top) and parallel thread execution (bottom). small and easily solvable graphs. These graphs are not the target each thread on its own core. At creation, threads may also be graphs of our algorithm and can be often searched more assigned affinity to be associated with a single core throughout ffi e ciently by a sequential algorithm. their lifetime. The alternative is to leave the thread manage- The next step, which also executes sequentially, is the fi ment completely to the operating system, which may rotate preparation of the rst job. This step generates a data structure their host cores, causing a lower cache hit rate. Our preliminary comprised of an empty current clique C and maximum clique tests, however, failed to reveal any significant difference Cmax, a color-numbered working set of vertices U, and an ffi adjunct set of vertices V. between run times of the algorithm with or without a nity The algorithm proceeds with the creation of p worker settings. fi threads, the number of which can be set at runtime. Setting p to The nal step of the main algorithm is to wait for all the the number of cores has proven as the most efficient in our threads to terminate and to free their resources. The maximum tests. Worker threads are created and started in a sequence, clique found is stored in the shared variable Cmax and is

2222 dx.doi.org/10.1021/ci4002525 | J. Chem. Inf. Model. 2013, 53, 2217−2228 Journal of Chemical Information and Modeling Article

Figure 8. Parallel MaxCliquePara algorithm trace executed as two threads (Thread 1 and Thread 2). Each thread execution is the same as described in Figure 6 except that in Thread 2, a new step 5 (blue shaded oval) represents a branch that was pruned by MaxCliqueSeq, but not by MaxCliquePara. In step 5 of Thread 1, a maximum clique Cmax = {a, b, c, d} is found. renumbered back to the original numbering, before being optimization of the storage and protection policy of Cmax would returned to the user. have almost no effect on the execution time and would also Parallel Thread Execution. Each newly created thread (see complicate the program. Threads in Figure 7) enters a loop and first checks for available Threads explore multiple disjunct branches of the search tree, jobs. If no jobs are available, all the work has already been each branch rooted in a different starting vertex as shown in assigned, thus the thread terminates. If a job is available, the Figure 8 on an example graph from Figure 1. Thread 1 explores thread acquires it and attempts to split it into two jobs: if the a branch that has vertex a as a root and all other vertices as number of vertices in the job’s working set U is above the user− candidates. Thread 2 explores another branch, obtained by first defined threshold, the first vertex is removed from U and the job splitting, and has vertex b as a root and all other vertices, modified working set is used to create a new job. The threshold except vertex a, as candidates. The number of candidate vertices size of U for splitting the job, |U| ≥ 5, was determined with in the working set U decreases with each additional thread, and preliminary tests, as splitting jobs with |U| < 5 slowed down the the average execution time decreases accordingly, thus the execution. The new job can then be acquired by a new thread, complex, most time-consuming branches, are searched first. when a core becomes available. The thread then proceeds with When the threads exploring complex branches finish their a search on the modified working set with the removed vertex work, the only remaining threads are those that explore simple added to its current queue. If the job cannot be split as defined, branches, which allows for low idle time and good load- then the thread processes the whole job and the next thread in balancing. line will find no available jobs. In contrast to the sequential algorithm (Figure 6), the The final and most time demanding activity is the job parallel algorithm explores two branches at the same time solution, i.e., searching for the maximum clique in the working (Figure 8). This decreases the number of steps, i.e., 7 as set of vertices U of the acquired job. This is done in function opposed to 9 in the sequential algorithm, and results in faster Expand (Figure 5), whose working set of vertices U is defined execution time (speedup) of the MaxCliquePara algorithm. by the current job rather than being the result of preprocessing However, a branch can remain unpruned (see blue shaded oval as in the sequential algorithm. in Figure 8) in parallel algorithm, whereas the same branch is Communication between threads is achieved through use of pruned in the sequential algorithm (Figure 6). This branch is shared variables, e.g., Cmax, which are protected from multiple not pruned, because, at the time Thread 2 starts traversing this concurrent modifications and accesses with mutual exclusion or branch, Thread 1 did not find the maximum clique with four mutex (see Figure 7).31 Since the use of mutexes exacts an vertices yet (as is the case in the sequential algorithm), and overhead, their number is kept to a minimum. Shared variables, thus, Thread 2 only has the maximum clique with three vertices e.g., integers, which can be read and modified, using atomic in the bounding condition. Therefore, the blue shaded branch operations, are allowed for unprotected access without use of searched by Thread 2 is not pruned in the parallel algorithm, mutexes. For example, the global variable Cmax is write- which makes Thread 2 traverse one more branch compared to | | ffi protected by a mutex, whereas Cmax , which is an integer, is not. the sequential algorithm trace. This ine ciency of the parallel The Cmax could be a thread-private variable and only algorithm is compensated for by the concurrent execution of synchronized upon completion, which could result in faster two threads compared to a single thread in the sequential execution. However, our preliminary tests showed that such an algorithm. The inverse example could be envisioned, in which a

2223 dx.doi.org/10.1021/ci4002525 | J. Chem. Inf. Model. 2013, 53, 2217−2228 Journal of Chemical Information and Modeling Article

a Table 1. Execution Times for DIMACS Graphs on a Single Core Using Three Exact Maximum Clique Algorithms, with Results Given As a Mean of 15 Trials ± Standard Deviation

execution timeb [s] graph name MaxCliquePara MaxCliqueDyn BBMCc Bron−Kerbosch hamming8-2 0.000865 ± 8.5 × 10−5 0.00375 ± 4.8 × 10−5 0.00135 ± 7.9 × 10−5 (N/A) >2 h brock200 2 0.00391 ± 0.00072 0.00735 ± 0.00012 0.00369 ± 0.00021 (<0.001) 0.18 ± 0.0052 keller4 0.00868 ± 0.00073 0.014 ± 0.00023 0.0111 ± 0.00063 (<0.001) 1.25 ± 0.0038 − − san400 0.5 1 0.00966 ± 0.00053 0.00732 ± 7.4 × 10 5 0.0103 ± 5 × 10 5 (0.016) 2990 ± 3.5 p hat300-2 0.00985 ± 5.9 × 10−5 0.0212 ± 0.00021 0.0112 ± 0.00019 (<0.001) 24 ± 0.048 p hat500-1 0.0129 ± 6.7 × 10−5 0.0181 ± 0.00027 0.0122 ± 2.4 × 10−5 (<0.001) 0.228 ± 0.0006 brock200 3 0.0139 ± 0.0014 0.0294 ± 0.00029 0.0151 ± 0.00082 (<0.001) 1.93 ± 0.039 hamming10-2 0.0165 ± 7.8 × 10−5 4.54 ± 0.0078 0.0444 ± 6.9 × 10−5 (0.063) >2 h san200 0.9 2 0.0268 ± 0.00025 0.371 ± 0.0015 0.134 ± 0.0016 (0.062) >2 h san200 0.9 3 0.0299 ± 0.0025 1.89 ± 0.013 1.82 ± 0.031 (0.015) >2 h C125.9 0.0316 ± 0.0018 0.06 ± 0.0038 0.0413 ± 0.00065 (0.016) 1690 ± 1.7 p hat700-1 0.045 ± 0.0038 0.0677 ± 0.0039 0.0451 ± 0.0025 (0.047) 1.08 ± 0.0045 hamming8−4 0.0465 ± 0.0045 0.0367 ± 0.00021 0.0468 ± 0.0027 (0.015) 11.3 ± 0.032 brock200 4 0.048 ± 0.0004 0.0961 ± 0.0013 0.0526 ± 0.0007 (0.063) 7.91 ± 0.031 san400 0.7 2 0.0661 ± 0.00038 0.0944 ± 0.00029 0.297 ± 0.0014 (0.063) >2 h johnson16-2-4 0.094 ± 0.0082 0.225 ± 0.0029 0.104 ± 0.0068 (0.062) 1.18 ± 0.01 sanr200 0.7 0.118 ± 0.0094 0.218 ± 0.0027 0.154 ± 0.0048 (0.125) 27.1 ± 0.077 san400 0.9 1 0.122 ± 0.0012 21 ± 0.047 4.47 ± 0.014 (0.031) >2 h san200 0.9 1 0.145 ± 0.012 0.0547 ± 0.00023 0.175 ± 0.0074 (0.094) >2 h gen200 p0.9 44 0.25 ± 0.0078 0.956 ± 0.0037 0.596 ± 0.0056 (0.187) >2 h p hat1000-1 0.263 ± 0.0044 0.35 ± 0.0038 0.301 ± 0.00067 (0.421) 5.57 ± 0.022 MANN a27 0.265 ± 0.0064 1.99 ± 0.025 0.316 ± 0.004 (0.187) >2 h sanr400 0.5 0.285 ± 0.0064 0.526 ± 0.0047 0.311 ± 0.019 (0.327) 12.6 ± 0.037 brock200 1 0.287 ± 0.0086 0.538 ± 0.0062 0.389 ± 0.0054 (0.312) 167 ± 0.45 p hat500-2 0.303 ± 0.0068 0.852 ± 0.0051 0.487 ± 0.0018 (0.39) >2 h san400 0.7 1 0.362 ± 0.0072 0.217 ± 0.00069 0.438 ± 0.0014 (0.125) >2 h gen200 p0.9 55 0.47 ± 0.01 0.469 ± 0.0015 0.543 ± 0.011 (0.437) >2 h san400 0.7 3 0.564 ± 0.0025 1.43 ± 0.0063 1.5 ± 0.0097 (0.437) >2 h p hat300-3 0.993 ± 0.0091 2.51 ± 0.013 1.66 ± 0.012 (1.31) >2 h san1000 1.7 ± 0.0095 0.284 ± 0.0019 1.75 ± 0.009 (0.375) >2 h p hat1500-1 2.38 ± 0.0093 2.93 ± 0.017 2.92 ± 0.0029 (3.92) >2 h p hat700-2 2.38 ± 0.0081 7.06 ± 0.02 3.68 ± 0.0024 (3.79) >2 h sanr200 0.9 13.6 ± 0.077 25.5 ± 0.065 30.2 ± 0.35 (13.9) >2 h p hat500-3 55.7 ± 0.22 183 ± 0.6 105 ± 0.57 (76.1) >2 h sanr400 0.7 70.2 ± 0.29 99.4 ± 0.25 94.4 ± 0.47 (102) >2 h brock400 4 94.6 ± 0.25 154 ± 0.81 145 ± 0.6 (143) >2 h p hat1000-2 116 ± 0.67 230 ± 1.3 200 ± 0.86 (193) >2 h brock400 2 117 ± 0.77 171 ± 1 161 ± 0.87 (140) >2 h MANN a45 119 ± 1.1 1395 ± 19 156 ± 0.28 (42.4) >2 h brock400 3 196 ± 4.5 348 ± 2.9 332 ± 2 (240) >2 h brock400 1 279 ± 6.5 368 ± 2.4 377 ± 0.81 (348) >2 h p hat700-3 1118 ± 2 3208 ± 40 2080 ± 14 (1640) >2 h C250.9 1299 ± 14 1802 ± 21 2256 ± 33 (1290) >2 h aStatistics of the DIMACS graphs are given in the Supporting Information in Table ST1. bGraphs are sorted by increasing execution times; the fastest execution time in each row is in bold. cExecution times achieved by our implementation of the BBMC algorithm and times reported by San Segundo et al.18 (in parentheses) are given. Latter times cannot be compared with our results due to problems with reproducibility. See the Supporting Information. maximum clique found by one thread would help prune and on typical protein product benchmark graphs to determine branches of another thread, thus acting advantageously on the the algorithm’s performance on general and protein graphs. pruning process in the parallel algorithm. Judging from the Our algorithm was compared with MaxCliqueDyn16 and superlinear speedups achieved on general and protein graphs 23 (see Results), this is often the case. BBMC algorithms, which are currently the fastest maximum clique algorithms available, and Bron−Kerbosch,32 which is ■ RESULTS among the most widely used maximal clique algorithms. All We evaluated the performance of the MaxCliquePara algorithm experiments were repeated 15 times with negligible standard in terms of execution times and speedups on DIMACS graphs, deviations in execution times, therefore average results are used in standard benchmarking of maximum clique algorithms, shown. The test computer with a dual CPU 2.30 GHz Intel

2224 dx.doi.org/10.1021/ci4002525 | J. Chem. Inf. Model. 2013, 53, 2217−2228 Journal of Chemical Information and Modeling Article

Figure 9. Speedup of the parallel MaxCliquePara algorithm on DIMACS graphs. Graphs are sorted by increasing execution times on a single core. Exact values for speedups are given in Supporting Information Table ST2.

Xeon E5-2630, each with six physical cores; the running server algorithm, allowing faster execution on the account of faster version of Ubuntu 12.04 was used. memory access. Superlinear speedup can also be achieved if the General DIMACS Graphs. Results on a single core show parallel algorithm traverses the search space using a more that on most general DIMACS graphs, MaxCliqueSeq is faster efficient trajectory than its corresponding sequential algorithm.] than the BBMC and MaxCliqueDyn algorithms, whereas the speedups, which occur mostly in large and dense instances of Bron−Kerbosch algorithm is always the slowest (Table 1). san and brock graph families (Figure 9). There were 8 Speedups of the MaxCliquePara algorithm on multiple cores superlinear speedups out of 43 graphs tested when executing are shown in Figure 9. Speedups differ greatly between the on two cores (see Supporting Information Table ST2). different DIMACS graphs and there are several superlinear Superlinear speedup occurred, for example, in the san200 0.9 [Superlinear speedup is the speedup greater than k,onk cores. 1 graph, where a speedup of ∼80 was achieved on four cores. In It can be achieved if one or more caches are integrated with this graph, the parallel algorithm was clearly able to prune away individual cores, increasing total cache size available to the significantly more branches of the search tree, due to sharing

2225 dx.doi.org/10.1021/ci4002525 | J. Chem. Inf. Model. 2013, 53, 2217−2228 Journal of Chemical Information and Modeling Article

Table 2. Execution Times on Protein Product Graphs on a Single Core Using Four Exact Maximum Clique Algorithms, with Results Given As a Mean of 15 Trials ± Standard Deviation

execution time [s] graph number of amino clique name acids na pb size MaxCliqueSeq MaxCliqueDyn BBMC Bron−Kerbosch 3zy0D- <50 61 0.98 52 0.000205 ± 0.000064 0.000244 ± 0.000066 0.000217 ± 0.00009 2 h 3zy1A 3p0kA- 200−300 138 0.94 89 0.000896 ± 0.000071 0.00171 ± 0.00009 0.000996 ± 0.00038 >2 h 3gwlB 2uv8I- >2000 200 0.86 69 0.00327 ± 0.00019 0.00866 ± 0.000075 0.00441 ± 0.00033 >2 h 2j6iA 1f82A- 500−1000 271 0.99 247 0.0104 ± 0.00097 0.0401 ± 0.00017 0.0127 ± 0.00093 3330 ± 6.7 1zb7A 2w4jA- 300−400 563 0.98 447 0.0693 ± 0.00073 0.38 ± 0.0012 0.0809 ± 0.0018 0.0353 ± 0.0011 2a2aD 1kzkA- 50−100 451 0.97 346 0.0695 ± 0.0062 0.193 ± 0.0038 0.0616 ± 0.0054 5470 ± 18 3kt2A 2w00B- 1000−1500 346 0.91 143 0.0894 ± 0.00013 0.465 ± 0.0085 6.38 ± 0.095 0.368 ± 0.0068 3h1tA 1allA- 100−200 655 0.97 500 0.113 ± 0.0053 0.606 ± 0.0088 0.669 ± 0.0086 0.000372 ± 0.000016 3dbjC 2fdvC- 400−500 750 0.96 556 0.259 ± 0.002 0.938 ± 0.0017 0.378 ± 0.00095 >2 h 1po5A 3hrzA- 1500−2000 905 0.94 563 5.66 ± 0.072 4.43 ± 0.011 444 ± 1.8 >2 h 2hr0A an is number of vertices in a graph. bp is graph density. cFastest execution time in each row is in bold.

Figure 10. Speedup of the MaxCliquePara algorithm on protein product graphs on multiple cores. Graphs are sorted by the increasing execution time on a single core (left to right, top to bottom). Exact values for speedups are given in Supporting Information Table ST3. the bounding condition between the threads, than its sequential find the maximum clique. For example, in the hamming 8-2 counterpart. graph that is solved very quickly, the parallelization overhead In two graphs, hamming 8-2 and hamming 10-2, the parallel consisting of sequential preprocessing, are the main causes for a speedups were <1 for all numbers of cores and decreased with slowdown. Conversely, in harder graphs, such as p hat500-3, increasing number of cores (Figure 9). There were also less speedup increases almost linearly with each additional core. In extreme cases where speedup increased almost linearly up to such graphs, even the 12 additional virtual cores provided by certain numbers of cores but stagnated or even decreased at hyperthreading increase the speedup, although less than the higher numbers of cores, e.g., in MANN a25 and MANN a45 physical cores (experiments on up to 12 cores use all physical graphs in which speedup increased to ≈4 on up to four cores, cores while the experiments on 24 cores use 12 physical and 12 then stagnated, and fell to ≈3 at 24 cores. This undesired effect virtual cores). These results clearly indicate that MaxCliquePara may be related to a very high density and very large maximum algorithm is significantly faster than two comparable algorithms cliques of these graphs (see Supporting Information Table on most DIMACS graphs and that on some graphs it achieves ST1), causing the parallel algorithm to prune less branches than speedups of up to two orders on multiple cores. the sequential algorithm. Protein Product Graphs. Results on a single core show In the majority of graphs, however, speedup increases with that MaxCliquePara algorithm outperforms the other three the number of cores. In Figure 9, a general trend in speedup tested algorithms on most protein product graphs (Table 2). depending on the execution time of the sequential algorithm The difference in execution times is especially large in favor of can be seen. Speedups are higher for graphs that are harder to our algorithm on 2w00B-3h1tA and 3hrzA-2hr0A graphs, solve, i.e., in which the sequential algorithm needs more time to where MaxCliquePara’s is almost 80 times faster than BBMC

2226 dx.doi.org/10.1021/ci4002525 | J. Chem. Inf. Model. 2013, 53, 2217−2228 Journal of Chemical Information and Modeling Article algorithm. The Bron−Kerbosch algorithm performs better on ■ ACKNOWLEDGMENTS protein product graphs than on DIMACS graphs. It is the The financial support through grant P1-0002 of the Slovenian fastest on two graphs; however, it is orders of magnitude slower Research Agency is acknowledged. than the three maximum clique algorithms on all other graphs. The results on multiple cores show favorable speedups are achieved by the MaxCliquePara algorithm (Figure 10). ■ REFERENCES Speedups reached their peaks at 5−8 threads, but additional (1) Karp, R. M. Reducibility among combinatorial problems; Plenum threads decreased speedups. For example, in graphs 2w00B- Press: New York, 1972. 3h1tA and 3hrzA-2hr0A, MaxCliquePara achieved high (2) Konc, J.; Janezic, D. ProBiS algorithm for detection of structurally speedups of 14 and 43 when using 6 and 7 threads, respectively. similar protein binding sites by local structural alignment. − In both cases speedup decreased when additional threads were Bioinformatics 2010, 26, 1160 1168. added. Speedups were higher on larger protein product graphs, (3) Eblen, J.; Phillips, C.; Rogers, G.; Langston, M. The maximum clique enumeration problem: algorithms, applications, and implemen- suggesting that our algorithm is most suitable for comparison of tations. BMC Bioinformatics 2012, 13, S5. larger protein structures composed of several hundreds of (4) Barker, E. J.; Buttar, D.; Cosgrove, D. A.; Gardiner, E. J.; Kitts, P.; amino acids. These results indicate that the MaxCliquePara Willett, P.; Gillet, V. J. Scaffold hopping using clique detection applied algorithm is very suitable for use in protein structural to reduced graphs. J. Chem. Inf. Model. 2006, 46, 503−511. 2 comparisons. (5) Butenko, S.; Wilhelm, W. E. Clique-detection models in computational biochemistry and genomics. Eur. J. Oper. Res. 2006, ■ CONCLUSION 173,1−17. (6) Artymiuk, P. J.; Poirrette, A. R.; Grindley, H. M.; Rice, D. W.; The newly developed exact parallel maximum clique algorithm Willett, P. A graph-theoretic approach to the identification of three- MaxCliquePara is presented. It demonstrates an efficient dimensional patterns of amino acid side-chains in protein structures. J. solution of maximum clique problems and outperforms most Mol. Biol. 1994, 243, 327−344. widely used sequential algorithms on general DIMACS and (7) Artymiuk, P. J.; Poirrette, A. R.; Rice, D. W.; Willett, P. The use protein-derived benchmark graphs on a single and on multiple of graph theoretical methods for the comparison of the structures of cores. The performed experiments show significant speedups of biological macromolecules. In Molecular Similiarity II; Sen, K. D., Ed.; − the MaxCliquePara algorithm for lower numbers of parallel Springer: Berlin Heidelberg, 1995; pp 73 103. cores on most tested graphs. With up to the maximum 12 (8) Konc, J.; Janezic, D. ProBiS: a web server for detection of structurally similar protein binding sites. Nucleic Acids Res. 2010, 38, available physical cores, and even with the additional 12 W436−W440. hyperthreading-enabled cores (a total of 24 cores), the speedup (9) Irwin, J. J.; Shoichet, B. K. ZINC-a free of commercially scales great on larger and more computationally demanding available compounds for virtual screening. J. Chem. Inf. Model. 2005, graphs. An important advantage of MaxCliquePara is its use of 45, 177−182. the shared memory parallelism, which enables small overhead, (10) Raymond, J. W.; Gardiner, E. J.; Willett, P. Rascal: Calculation and thus fast execution on wide range of graph sizes. of graph similarity using maximum common edge subgraphs. Comput. Superlinear speedups are observed, consistent with expectations J. 2002, 45, 631−644. for an algorithm that traverses a tree of possible solutions. For (11) Raymond, J. W.; Gardiner, E. J.; Willett, P. Heuristics for graphs that are “easy” for the sequential maximum clique similarity searching of chemical graphs using a maximum common − algorithm, i.e., with the execution time in order of milliseconds, edge subgraph algorithm. J. Chem. Inf. Comp. Sci. 2002, 42, 305 316. the parallelism brings no significant benefits. The speedups are, (12) Raymond, J. W.; Willett, P. Maximum common subgraph isomorphism algorithms for the matching of chemical structures. J. however, almost always greater than one, therefore, the Comput.-Aided Mol. Des. 2002, 16, 521−533. MaxCliquePara algorithm can currently be taken as one of (13) Prosser, P. Exact algorithms for maximum clique: A computa- the fastest general maximum clique algorithms for almost all tional study. Algorithms 2012, 5, 545−587. but the most trivial graphs. (14) Carraghan, R.; Pardalos, P. M. An exact algorithm for the maximum clique problem. Oper. Res. Lett. 1990, 9, 375−382. ■ ASSOCIATED CONTENT (15) Tomita, E.; Sutani, Y.; Higashi, T.; Takahashi, S.; Wakatsuki, M. A simple and faster branch-and-bound algorithm for finding a *S Supporting Information maximum clique. In WALCOM: Algorithms and computation; Md. Freely available codes of our implementation for all maximum Saidur, R.; Satoshi, F., Eds.; Springer: Berlin Heidelberg, 2010; Vol. clique algorithms used here. Statistics of DIMACS graphs are in 5942, pp 191−203. Table ST1. Detailed speedups for MaxCliquePara parallel (16) Konc, J.; Janezic, D. An improved branch and bound algorithm maximum clique algorithm are presented in Tables ST2 and for the maximum clique problem. MATCH Commun. Math. Comput. − ST3. This material is available free of charge via the Internet at Chem. 2007, 58, 569 590. http://pubs.acs.org. (17) Fahle, T. Simple and fast: Improving a branch-and-bound algorithm for maximum clique. Lect. Notes. Comput. Sc. 2002, 2461, 485−498. ■ AUTHOR INFORMATION (18) San Segundo, P.; Rodríguez-Losada, D.; Jimenez,́ A. An exact bit-parallel algorithm for the maximum clique problem. Comput. Oper. Corresponding Author − * Res. 2011, 38, 571 581. E-mail: [email protected]. (19) Östergard,̊ P. R. J. A fast algorithm for the maximum clique Author Contributions problem. Discrete Appl. Math. 2002, 120, 197−207. ⊥ M.D. and J.K.: These authors contributed equally. (20) Pardalos, P. M.; Rappe, J.; Resende, M. G. C. An exact parallel algorithm for the maximum clique problem. In High Performance Notes Algorithms and Software in Nonlinear Optimization; Kluwer Academic The authors declare no competing financial interest. Publishers: Dordrecht, The Netherlands, 1998; pp 279−300.

2227 dx.doi.org/10.1021/ci4002525 | J. Chem. Inf. Model. 2013, 53, 2217−2228 Journal of Chemical Information and Modeling Article

(21) McCreesh, C.; Prosser, P. Distributing an Exact Algorithm for Maximum Clique: maximising the costup. Algorithms 2012, 5, 545− 587. (22) Konc, J.; Janezic, D. ProBiS-2012: web server and web services for detection of structurally similar binding sites in proteins. Nucleic Acids Res. 2012, 40, W214−W221. (23) Segundo, P.; Matia, F.; Rodriguez-Losada, D.; Hernando, M. An improved bit parallel exact maximum clique algorithm. Optim. Lett. 2013, 7, 467−479. (24) Konc, J.; Janezic, D. Protein-protein binding-sites prediction by protein surface structure conservation. J. Chem. Inf. Model. 2007, 47, 940−944. (25)Rose,P.W.;Bi,C.;Bluhm,W.F.;Christie,C.H.; Dimitropoulos, D.; Dutta, S.; Green, R. K.; Goodsell, D. S.; Prlic, A.; Quesada, M.; Quinn, G. B.; Ramos, A. G.; Westbrook, J. D.; Young, J.; Zardecki, C.; Berman, H. M.; Bourne, P. E. The RCSB Protein Data Bank: new resources for research and education. Nucleic Acids Res. 2013, 41, D475−D482. (26) Biggs, N. Some heuristics for graph coloring. In Graph Colourings; Longman: New York, 1990; pp 87−96. (27) Wood, D. R. An algorithm for finding a maximum clique in a graph. Oper. Res. Lett. 1997, 21, 211−217. (28) Tomita, E.; Kameda, T. An efficient branch-and-bound algorithm for finding a maximum clique with computational experiments. J. Global Optim. 2007, 37,95−111. (29) Tsai, C. C.; Johnson, M.; Nicholson, V.; Naim, M. A topological approach to molecular similarity analysis and its application. In Studies in Physical Theoretical Chemistry; Elsevier: USA, 1987; Vol. 51,pp 231−236. (30) Janezic, D.; Milicevič ,́ A.; Nikolic,́ S.; Trinajstic,N.́ Graph- Theoretical Matrices in Chemistry. Mathematical Chemistry Mono- graphs, No. 3.; University of Kragujevac: Kragujevac, 2007. (31)Dijkstra,E.W.Solutionofaprobleminconcurrent programming control. Commun. ACM 1965, 8, 569. (32) Bron, C.; Kerbosch, J. Algorithm 457: finding all cliques of an undirected graph. Commun. ACM 1973, 16, 575−577.

2228 dx.doi.org/10.1021/ci4002525 | J. Chem. Inf. Model. 2013, 53, 2217−2228 Vol. 26 no. 9 2010, pages 1160–1168 BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btq100

Structural bioinformatics Advance Access publication March 19, 2010 ProBiS algorithm for detection of structurally similar protein binding sites by local structural alignment Janez Konc1 and Dušanka Janeži ˇc1,2,∗ 1National Institute of Chemistry, Hajdrihova 19, 1000 Ljubljana and 2University of Primorska, Faculty of Mathematics, Natural Sciences and Information Technologies, Glagoljaška 8, 6000 Koper, Slovenia Associate Editor: Anna Tramontano

ABSTRACT functions, and similar folding by itself does not necessarily imply Motivation: Exploitation of locally similar 3D patterns of physico- evolutionary divergence (Russell et al., 1997). In the case of chemical properties on the surface of a protein for detection divergent evolution, similarity is due to the common origin, such Downloaded from of binding sites that may lack sequence and global structural as accumulation of differences from homologous ancestral protein conservation. structures. In contrast, convergent evolution arises as a result of some Results: An algorithm, ProBiS is described that detects structurally sort of ecological or physical drivers toward a similar solution, even similar sites on protein surfaces by local surface structure alignment. though the structure has arisen independently. A rough estimate is It compares the query protein to members of a database of protein that the frequency of two different folds converging to a similar 3D structures and detects with sub-residue precision, structurally statistically significant side-chain pattern is ∼1% (Russell, 1998). It

similar sites as patterns of physicochemical properties on the protein has been shown that convergent evolution of enzyme binding sites http://bioinformatics.oxfordjournals.org surface. Using an efficient maximum clique algorithm, the program is not a rare phenomenon (Gherardini et al., 2007). identifies proteins that share local structural similarities with the query If a protein has no known function, but a known 3D structure, protein and generates structure-based alignments of these proteins inferences concerning function can be made by comparison to other with the query. Structural similarity scores are calculated for the query proteins. Recently, a number of web servers for local structural protein’s surface residues, and are expressed as different colors on alignment have become available. These provide comparison of the query protein surface. The algorithm has been used successfully pre-selected parts of proteins (e.g. binding sites, user-defined for the detection of protein–protein, protein–small ligand and protein– structural motifs) (Angaran et al., 2009; Debret et al., 2009; DNA binding sites. Shulman-Peleg et al., 2008) against binding sites or whole-protein Availability: The software is available, as a web tool, free of charge structural databases. The MultiBind and MAPPIS servers (Shulman- for academic users at http://probis.cmm.ki.si Peleg et al., 2007, 2008) allow the identification of common Contact: [email protected] spatial arrangements of physicochemical properties such as H-bond Supplementary information: Supplementary data are available at donor, acceptor, aliphatic, aromatic or hydrophobic in a set of user Bioinformatics online. provided protein binding sites defined by interactions with small

molecules (MultiBind) or in a set of user-provided protein–protein by on April 23, 2010 Received on December 2, 2009; revised on February 7, 2010; interfaces (MAPPIS). Others provide comparison of entire protein accepted on February 27, 2010 structures (Ausiello et al., 2008) against a number of user submitted structures. Unlike global alignment approaches, local structural alignment approaches are suited to detection of locally conserved 1 INTRODUCTION patterns of functional groups, which often appear in binding sites In the Protein Data Bank (PDB), there are presently 8000 protein and have significant involvement in ligand binding (Shulman-Peleg structures derived from structural genomic studies and 2000 of these et al., 2007). have no known function (Berman et al., 2002). Binding sites are In this article, we describe an algorithm ProBiS, which, in contrast closely related to protein function, the identification of binding sites to these methods, enables local structural alignment of entire protein in proteins is essential to an understanding of their interactions with surface structure against a large database of protein structures in ligands, including other proteins. Many computational tools for the reasonable time. It detects structurally similar regions in a query analysis (Tuncbag et al., 2009) and prediction (Ezkurdia et al., 2009) protein by mapping structural similarity scores on its surface. The of binding sites have been reported. comparison involves geometry and physicochemical properties, and Binding sites can retain conservation of sequence and structure is conducted at the level of amino acid functional groups. For each (Keskin et al., 2005; Valdar and Thornton, 2001). Structural pairwise comparison of the query protein to a database protein, the conservation however is more prevalent (Lecomte et al., 2005). Even algorithm produces multiple local alignments of the surface regions in the absence of obvious sequence similarity, structural similarity that are found in both; no attempt is made to align the proteins between two protein structures can imply common ancestry, which globally and similar folding is not a requirement for a relationship in turn can suggest a similar function (i.e. a binding site). However, between the two proteins. Since no presumptions about localization it is also possible for structurally similar proteins to have different of binding sites prior to comparison are used, ProBiS may detect new binding sites and suggest ligands that these binding sites may ∗To whom correspondence should be addressed. accommodate.

© The Author(s) 2010. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.5), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

[12:04 13/4/2010 Bioinformatics-btq100.tex] Page: 1160 1160–1168 ProBiS for structurally similar protein binding sites

ProBiS is a generalization of our earlier algorithm that predicts patterns of interactions in proteins which perform similar functions but may protein–protein binding sites by searching for conserved protein or may not have different folding patterns. The possible interactions are surface structure and physicochemical properties in proteins with represented with sub-residue precision and a structural similarity search similar folds, and permitting comparison of only a few structural algorithm, which employs a fast maximum clique algorithm and operates neighbors (Carl et al., 2008; Konc and Janežic,ˇ 2007a). Major independently of fold and sequence, performs a local, surface-oriented comparison of the proteins. extensions of the ProBiS algorithm over the earlier algorithm A clique is a subset S of vertices in a graph, composed of vertices and include: edges, such that each pair of vertices in S is connected by an edge. A • Comparison of proteins regardless of their fold similarities. maximum clique in a given graph is the clique with the largest number ProBiS compares the query protein with a database of over of vertices. The maximal clique algorithm of Bron and Kerbosch, which has previously been used to compare binding sites (Schmitt et al., 2002), 23 000 protein structures. In contrast to the previous algorithm, significantly differs from our maximum clique algorithm (Konc and Janežic,ˇ which simultaneously compares only a few proteins with 2007b). While a maximal search is for all cliques that are not subgraphs of similar folds, ProBiS identifies all proteins that share local any other clique, maximum clique algorithms search only for the clique with similarities with the query protein. It then calculates local the maximum number of vertices. Consequently, although both address a structural similarity scores of query protein residues in each NP-hard problem, finding a maximum clique requires an order of magnitude of these retrieved proteins and in this way detects structurally less computation time. similar regions, which often correspond with binding sites. Using a fast maximum clique algorithm and a strategy of dividing Downloaded from graphs into subgraphs, enables our approach to the comparison of complete • Detection of similarities between backbone segments with protein surfaces. ProBiS compares the query protein structure sequentially different conformations (e.g. flexible loops) in the query and to proteins in the PDB and retrieves structures that share local structural database proteins. Such similar backbone segments are detected similarities with the query protein. It uses these similar structures to generate by 3D comparison of protein backbones. a structure-based sequence alignment of the query protein with the proteins • Assignment of a number of different scores to every structural from the database and then calculates the similarity score for each query alignment. These scores are designed to measure statistical protein surface residue, projecting these scores as colors onto the surface of http://bioinformatics.oxfordjournals.org and structural significance and are the basis for the retrieval the query protein. Each of these steps is described in detail in the following sections. of similar proteins from the database; only alignments with favorable scores are retained for further analysis of structural similarity. 2.1 Representation of protein surfaces • Detection of ‘fingerprints’that are highly conserved amino acid Residues on the protein surface are identified by an algorithm that identifies residues within the most reliable local structural alignments the solvent accessible surface atoms (Konc et al., 2006). This subset of all found. These are most commonly of proteins with the same fold the protein atoms is represented as a series of graphs of vertices and edges, as shown in Figure 1. Vertices are points in 3D space, and each replaces one as the query protein. The remaining structural alignments are functional group belonging to a residue on the protein surface. subsequently examined for similar motifs. As a result, proteins Functional groups are specific groups of atoms within these residues that share with the query protein a structurally similar binding responsible for the characteristic interactions of the protein with other site but have a different fold can be retrieved from the protein molecules. Each vertex is labeled with the physicochemical properties of database. a functional group that it replaces, hydrogen bond acceptor (AC), hydrogen bond donor (DO), mixed acceptor/donor (ACDO), aromatic (PI) and aliphatic • Storage of results in a database, each entry of which has a record by on April 23, 2010 of the aligned residues, rotational matrix, translational vector (AL); (Schmitt et al., 2002). To compare two proteins, represented as graphs, a product graph is and alignment scores. constructed as shown in Figure 1. The corresponding vertices in this product The advantages and limitations of ProBiS in the detection of graph are pairs of vertices with identical physicochemical properties (u1, u2), protein–protein, protein–small ligand and protein–DNA binding e.g. two acceptors: (AC, AC), and each of the two vertices in the pair is sites are described using a quantitative performance evaluation on a test set of 39 proteins. The results show that ProBiS outperforms other methods that rank amino acid residues by degree of sequence conservation and also energy-based methods. Unlike global structural alignment approaches, ProBiS yields high- quality local structural alignments of proteins with dissimilar folds, and aligns similar binding sites without presumption of their whereabouts in the compared structures. The extent of structural similarity in protein–protein and protein–small ligand binding sites is also discussed.

2 METHODS ProBiS detects surface structural patches that are common to the query protein and a database protein. The algorithm identifies structurally similar Fig. 1. Different functional groups in proteins are assigned distinct labels sites, whose constituent residues may be scattered in sequence, but are close (see color encoding). Subgraphs are generated from the query protein and together in structure. Such similar patches are often related to ligand binding each database protein and then used to produce product graphs which reveal and the search for them exploits the fact that binding sites share similar the extent of superposition of any pair of subgraphs.

1161

[12:04 13/4/2010 Bioinformatics-btq100.tex] Page: 1161 1160–1168 J.Konc and D.Janeži ˇc Downloaded from

Fig. 2. Schematic representation of the ProBiS algorithm. (A) The query protein structure (Q) is compared in a pairwise manner with each of ∼23 000 non-redundant structures (P). (B) Proteins, represented as graphs of vertices (white dots) and edges (not shown), are divided into n overlapping subgraphs, where n equals the number of vertices and all vertices are within 15 Å of a central vertex: three subgraphs per protein are depicted here as distinctly colored encirclements. A fast distance-matrix-based filtering is applied to them to eliminate non-similar subgraphs. (C) A product graph is constructed for each similar pair of subgraphs (see color encoding in B and C). A maximum clique (thick lines) in a product graph represents the largest similarity between two compared protein subgraphs. (D) Each maximum clique produces a structural alignment of two compared proteins (the alignment shown corresponds to the middle http://bioinformatics.oxfordjournals.org maximum clique in C). (E) Steps A–D are repeated for each protein from the nr-PDB and the results are stored in a MySQL database. Structural similarity scores are calculated and projected on the query protein surface. Structurally similar and variable residues are colored red and blue, respectively. High-scoring residues are considered as predicted structurally similar binding sites.

derived from its own protein graph. Two vertices (u1, u2) and (v1, v2) of a (B) Proteins represented as graphs of vertices and edges, are divided into product graph are connected by an edge if and only if distances between u1 n overlapping subgraphs, where n is the number of vertices in each protein and v1 and between u2 and v2 differ by <2 Å (Konc and Janežic,ˇ 2007a). graph. Figure 2B shows two proteins, with three such subgraphs for each Protein graphs can be regarded as rigid 3D protein structures of vertices, protein depicted as distinctly colored regions. A subgraph of a protein graph and a product graph constructed from two such graphs is an approximate is defined as all vertices within 15 Å of a central vertex, and is compactly representation of all possible rotations and translations of one protein graph represented as a distance matrix of these neighboring vertices. To find if onto the other. A maximum clique in the product graph corresponds to the two such subgraphs, i.e. two distance matrices, represent similar parts of rotational–translational variation that superimposes on, or aligns with the their respective protein surfaces, the two distance matrices are subtracted, largest number of vertices in the two protein graphs. and from the resulting difference matrix, a value of similarity is calculated (as described in Konc and Janežic,ˇ 2007a). For each pair of query and database protein’s subgraphs, which are similar as judged by this value, by on April 23, 2010 2.2 The ProBiS algorithm a product graph is constructed. This ensures that only pairs of subgraphs A local structurally similar site is defined as one or more surface functional that have sufficiently similar geometrical arrangements of vertices and with groups that adopt a similar geometrical arrangement in two or more compared similar physicochemical properties proceed to the next, computationally proteins. The vertices representing these groups can be superimposed with intensive step. a low root mean square deviation (RMSD), but such a local alignment does (C) All sufficiently similar pairs of subgraphs that pass the filtering in not guarantee that entire protein backbones can be superimposed. Protein the previous step are then subjected to the more rigorous maximum clique surface residues are represented as vertices in 3D space and a structurally procedure, which detects vertex-to-vertex correspondences between the two similar site is merely one with a similar geometrical arrangement of vertices. protein graphs being compared. A product graph is constructed for each A maximum clique in the product graph described above is equivalent to a such pair of protein subgraphs as described above. The algorithm then vertex similarity between the two proteins, and in turn, each such vertex finds a maximum clique in each product graph by examining approximately similarity, is equivalent to a local structural alignment of common surface 100–1000 product graphs with up to 1000 vertices each in each protein– residues. A crucial step in the comparison of whole-protein surfaces is the protein comparison. A maximum clique in a product graph corresponds to a division of protein graphs into subgraphs and filtering of these that precedes largest common vertex substructure, which can be translated to a maximum the actual comparison using the computationally intensive maximum clique substructure common to the two compared proteins. Maximum cliques of 3, 4 algorithm. The strategy of the ProBiS algorithm is presented schematically and 3 vertices are shown as thick lines in Figure 2C. in Figure 2. The method, steps A–E corresponding to those in Figure 2, is (D) Each maximum clique is equivalent to a single structural alignment described in greater detail as follows. and superimposition of two compared proteins. The alignments are local, (A) A query protein structure (Q) whose structurally similar regions and allow superimposition of maximum numbers of two protein subgraph are to be detected is compared with each of ∼23 000 non-redundant PDB vertices. Alignment scores such as surface vectors angle, RMSD and structures (P). Residues on the surface of the proteins, identified by a solvent expectation values (E-values) are calculated for each such local super- accessible surface algorithm (Konc et al., 2006), are represented as graphs imposition of the two proteins, as discussed in Section 2.4, subsequently. with vertices and edges, and their functional groups are replaced by one of These alignment scores measure the statistical and structural significance the five vertex types which are shown as white dots in Figure 2B. Finally, of the different local structural superimpositions and allow filtering out of vertices that are separated by <15 Å are connected with edges. insignificant structural alignments. Maximum cliques passing this filter and

1162

[12:04 13/4/2010 Bioinformatics-btq100.tex] Page: 1162 1160–1168 ProBiS for structurally similar protein binding sites

possessing at least five common vertices are joined into clusters, which (1) Surface vectors angle: for each of the two superimposed sets of represent the larger structural similarities in the protein surfaces being vertices, an outer-pointing surface vector originating in the geometric compared. Finally, a search for similarities in flexible parts of the two center and perpendicular to the surface of the protein is constructed. compared proteins is conducted, as described below. These vectors each represent the orientation of one of the two (E) Steps A–D are repeated for each database protein and the resulting superimposed surface patches of vertices and a smaller angle between alignments and their scores are saved to a results database implemented on the them indicates a similar orientation of the surfaces in this region. Local MySQL platform. The significant local structural alignments can be retrieved structural alignments with surface vector angles <90◦ are retained. from this results database, using the alignment scores as search parameters. (2) Surface patch RMSD: calculated for each pair of superimposed Only local structural alignments with favorable scores (Section 2.4) are vertices, this measures the shape similarity of the two superimposed considered further in the calculation of similarity scores. For every surface surface patches. Local structural alignments having surface patch residue in the query protein, ProBiS first sums the number of times it is found RMSD <2.0 Å are accepted. in all favorable local structural alignments. These sums, one for each residue, are then translated into discrete similarity scores ranging from 0 (blue) (3) Surface patch size: structural alignments with fewer than 10 vertices to 9 (red) and the appropriate colors are applied to the structure of the query are discarded. protein. To facilitate the comparison with other methods, all query protein (4) E-value: ProBiS algorithm calculates an E-value for each local surface residues with structural similarity scores of 7, 8 or 9 are arbitrarily structural alignment using the Karlin–Altschul equation (Altschul considered to be parts of binding sites. and Gish, 1996; Altschul et al., 1997; Karlin and Altschul, 1990). More details are given in the Supplementary Material. In this Downloaded from calculation, vertices in the structural alignment are converted back 2.3 Protein flexibility to the corresponding residues and each residue may be represented Many proteins are flexible, capable of adopting different conformations and by more than one vertex. Consequently, only one occurrence of an algorithm that detects similarities in purely rigid structures will have each residue in the alignment is counted. In contrast to sequence only limited value. This problem is addressed by searching for backbone alignment approaches, gap penalties are not included in the calculation segments with residue compositions and orientations that are similar in of E-values, since structurally similar residues on a protein surface the query protein and the database protein, but which may adopt different may by definition, be quite separate in the protein sequence. The lower http://bioinformatics.oxfordjournals.org conformations in the two proteins. Each maximum clique, i.e. its rotational– the E-value, the less likely that the match is by chance, and the greater translational variation, representing a rigid, local similarity, is used to locally significance of the local structural alignment. The threshold E-value < × −4 superimpose the two compared protein structures and then their backbones for acceptance of a structural alignment used here is 1 10 . are examined for pairs of three consecutive amino acid residues meeting the following conditions. The alignment scores used here have been defined by varying them systematically and observing the effects of the changes on the binding site (1) The distance between the two Cα atoms of the two middle residues detection. must be <10 Å. (2) The amino acid codes of three residues of the first protein, must be similar to those of the second protein. A BLOSUM62 (Blocks 2.5 Fingerprint residues Substitution Matrix) matrix is used to score the match of each residue Residues that are important for binding of ligands need not be contiguous pair in these three residue-long alignments (Henikoff and Henikoff, but are typically found in well-defined local 3D arrangements and can be 1992). To exclude spurious matches composed solely of hydrophobic regarded as a fingerprint of the specific binding site. A motif of fingerprint residues, the cumulative score for the three residues must be >12. residues is therefore first identified in a subset of the most reliable structural (3) For each of the three similar residues in a pair of proteins a vector alignments and then a search for a similar motif of residues is conducted by on April 23, 2010 between the first and the third Cα atom is drawn. To ensure similarly among the remaining aligned proteins. oriented matches, the angle between two such vectors must be <45◦. A structure-based sequence alignment of proteins with favorable alignment scores (described in Section 2.4) is constructed in which the Once identified, these backbone segments of three residues are extended structurally aligned protein sequences are listed in the order of their in both directions by adding a new residue to both ends. This is done decreasing alignment lengths. Alignment length is the sum of residues in all independently of sequence order. A pair of residues is only added if it lowers local structural alignments that were found for a pair of compared proteins. the expectation value (see Section 2.4, subsequently) of the alignment. No In the second step, fingerprint residues are identified in the structure-based gaps are allowed and the distances between pairs of newly added residues sequence alignment as follows. may be >10 Å. (1) Only proteins that structurally align to more than one-third of the query protein residues and thus represent the most trusted part of the 2.4 Local structural alignment scores structure-based sequence alignment are considered. A number of scores are used to measure the statistical and structural (2) Discrete similarity scores (between 0 and 9) are calculated for this significance of the local structural alignments and to permit filtering out ‘most trusted’ part of the structure-based sequence alignment. of insignificant alignments (see Section 2.2, step D). Since local structural alignments need not involve a large number of residues, the similarity (3) Structurally similar residues with the highest similarity score, 9 are that is detected could in fact be a frequently occurring 3D motif, having labeled as fingerprint residues. no relationship to common protein function. Further, surface properties in the two aligned surface regions may differ, e.g. the surfaces may have After the fingerprint residues have been identified, their appearance in the different curvatures, or they may be differently oriented. The first step in remainder of structurally aligned proteins is sought and protein structures the calculation of scores is the local superimposition of the two compared that share a locally similar interaction pattern with the query protein are proteins, based on the rotational and translational variation imposed by a retrieved from the structure-based sequence alignment. For the results to be maximum clique. Then, four distinct criteria are applied to each such local accepted, each aligned protein must have a minimum of five such fingerprint structural alignment. residues.

1163

[12:04 13/4/2010 Bioinformatics-btq100.tex] Page: 1163 1160–1168 J.Konc and D.Janeži ˇc

and has been benchmarked as a small-ligand as well as protein– protein binding sites detection tool (Burgoyne and Jackson, 2006). A detailed discussion of the results obtained by ProBiS is provided in the case of biotin carboxylase and in the case of TATA-binding protein (TBP); the unique ability of ProBiS to detect and align similar binding sites in protein structures with different folds and to Fig. 3. Schematic representation of nr-PDB database preparation, con- compare it with other global and local structural alignment methods version to a surface representation and ProBiS results database (MySQL). is described. All experiments described here were performed on a 16 threaded, 2.6 Identification of non-redundant PDB structures 8 core, 2 processor personal computer. On this computer, pairwise A list of more than 23 000 non-redundant single chain protein structures, local alignment of two protein structures with ProBiS requires automatically updated each week, is prepared as follows (Figure 3): the typically <1 s and running ProBiS jobs in parallel, the nr-PDB ∼ RCSB website provides a list of clustered PDBs derived from the 60 000 search with a query protein of ∼200 residues, involving fast filtering ≥ protein structures in the PDB, in which sequences that are 95% identical are followed by structural similarity mapping takes ∼10 min. clustered together. The structure with the highest resolution is selected from each cluster, X-ray crystallographic structures being given preference over NMR structures. The graph representation of all selected protein structures 3.1 Binding sites detection Downloaded from and residues on the surface of the proteins are calculated, and are saved A test set was prepared of 39 proteins that interact either with other into ‘surface files’. These surface files represent a current view of the non- redundant database (nr-PDB), which is updated every week following the proteins, or with small molecules, substrates or cofactors. In cases regular RCSB protein database update. Surface, instead of PDB files are where the ligands were missing from the PDB files, the relevant used with ProBiS, since they enable faster pairwise comparisons. data were supplied from the literature or from homologous cases. As described in the Supplementary Material, we considered the various

2.7 Test set protein structures preparation different types of binding sites as a single united binding region http://bioinformatics.oxfordjournals.org important for protein function and we also considered each binding As a test set for the detection of binding sites, we use 39 non-redundant site type separately. Each of the 39 proteins in the test set was a protein structures, each involved in well characterized protein–protein query used in a search of the database of currently ca. 23 000 non- interactions and crystallized as a complex involving at least one other protein. redundant protein structures extracted from the PDB. The quality of The non-redundancy of the test set is achieved with the blastclust algorithm (Altschul et al., 1997), so that no two proteins (with the exception of two the detected binding regions was measured in terms of the specificity, pairs, see Supplementary Material) have a sequence identity of >30% in a sensitivity and significance of prediction, as defined below. pairwise alignment covering >90% of each of their sequences. Most of these Specificity indicates the proportion of the residues predicted to proteins also bind other ligands, such as cofactors and metal ions, which be in the binding site which are actually in the binding site. If often are missing from the PDB records. Accordingly, we also identified T residues are predicted to be in the binding site, but only S are binding sites for these ligands. A residue is part of a binding site if the correctly predicted, then the specificity is defined as: SP = S/T distance between any of its atoms and an atom from another molecule (e.g. Sensitivity is the proportion of the interface that was predicted. If protein, small molecule, DNA) is less than the sum of their van der Waals the interface requires U residues but only W were correctly predicted radii plus 3.0 Å. Alternatively, if no ligands are available in the crystal to be in the interface, then sensitivity is defined as: SE = W/U structures, known binding site residues obtained from the catalytic site atlas Significance of prediction is the probability P of randomly by on April 23, 2010 (Porter et al., 2004) and literature were used. A detailed description of the identification of the missing binding sites for each of the 39 proteins can be choosing a patch of residues equal in size to the predicted patch, found in the Supplementary Material. but with equal or better correspondence with the actual binding site than the predicted patch (Carl et al., 2008). P is defined as:

  ,    −1 min(Ps Is) 3 RESULTS T I T −I P = s s s s ProBiS detects structurally similar regions and maps them on the Ps i Ps −i i=O surface of the appropriate protein structures. This is accomplished s through the detection of high quality local structural alignments where Ts is the total number of protein residues, Is is the number in a database of non-redundant protein structures. The program’s of residues in the actual protein binding site, Ps is the number of performance in detecting different types of binding sites, e.g. residues in the predicted protein binding site and Os is the number protein–small ligand, protein–protein, protein–DNA, has been of predicted residues that overlap with the actual protein binding studied on a set of 39 protein structures, details of which are provided site. For example, the value P = 0.5 for a patch of predicted residues in Supplementary Material. indicates that in 5 out of 10 attempts, a patch with the same number The local structural similarity scores generated by ProBiS are of randomly chosen residues will lead to a better prediction of the effective predictors of protein binding sites. The predictions, based actual binding site. solely on local structural similarity, are more accurate than those The detailed results from the 39 test set structures are presented in produced by ConSurf (Glaser et al., 2003, 2005), a protein surface the Supplementary Material. The ability of ProBiS to detect protein mapping tool which depends upon sequence conservation and which binding regions in the test set protein structures is compared to that to our knowledge, is the only available tool that maps sequence of ConSurf and to that of Q-SiteFinder in Table 1. ProBiS produces conservation to protein structure. ProBiS is also compared with discrete similarity scores (0–9) and ConSurf produces conservation Q-SiteFinder (Laurie and Jackson, 2005), a method which detects scores (1–9) for surface protein residues, and it is possible to energetic features in protein structures favorable for ligand binding compare the two methods directly. Residues with similarity scores

1164

[12:04 13/4/2010 Bioinformatics-btq100.tex] Page: 1164 1160–1168 ProBiS for structurally similar protein binding sites

Table 1. Binding sites detection with a structural similarity mapping method, similar residues, calculated as the product of the average number of evolutionary conservation mapping method and an energy-based method on residues in a binding site class and the sensitivity of this class, differ a set of 39 proteins less. Small ligand binding sites have, on average, 21.2, and protein– protein binding sites, 16.3 structurally similar residues. The absolute Method Total Interface No. of P-value SP SE similarity as judged by structural similarity scores seems to be × −3 no. of no. predicted ( 10 ) (%) (%) only moderately lower for protein–protein compared to small-ligand residues residues residues binding sites. Clearly, structural similarity does not extend to entire protein– ProBiSa 224 54 62.8 1.4 39.0 43.9 ConSurfb 224 54 60.6 33.0 34.3 38.1 protein binding sites and cannot be used as the only predictor of Q-SiteFinder 224 54 48.5 6.5 39.4 38.2 a protein–protein binding site. Instead, it could be used to detect structurally conserved ‘hot-spots’ (Keskin et al., 2005), residues Tables with detailed results are available in the Supplementary Material. within protein–protein binding sites, with a particularly important aThe alignment scores listed in the ‘Methods’ section were used. function, for example, hydrogen bonding in ligand binding. Hot b ConSurf uses different methods to count conserved residues. We used the Bayesian spots contribute the most to the binding free energy of a protein– method which is enabled by default and gives the best results. protein complex, and their conservation in homologous sequences

has been used to aid in their identification (Guney et al., 2008). Downloaded from 7–9 (ProBiS) and with conservation scores 8–9 (ConSurf), are Protein–protein binding sites are mostly conserved in structural regarded as reliable binding site predictions. Using this definition, homologues while similar small-ligand binding sites are, in addition, the two methods detect almost equal number of structurally found in structurally distant proteins that frequently adopt different similar (ProBiS) and conserved (ConSurf) residues for a test set folds. As an example, the P-loop, which is a small ligand binding site protein. ProBiS retrieves 62.8 structurally similar residues/protein responsible for phosphate binding in the alpha subunit of a G-protein and ConSurf retrieves 60.6 sequence conserved residues/protein (PDB: 1got) is detected as structurally similar across many hundreds

(Table 1). The specificity, sensitivity and significance of prediction http://bioinformatics.oxfordjournals.org of database proteins, including those with different folds, while established by the two methods are directly comparable. The top the protein–protein binding site for the beta subunit, is conserved four predicted sites proposed by Q-SiteFinder as the putative binding substantially in only the nine most similar structural homologues site are considered; a minimum of P-value at this setting is observed retrieved. This is generally observed in the proteins studied and (Supplementary Material). supports a hypothesis that protein–protein binding sites evolve It can be seen from Table 1 that ProBiS achieves on average 4.7% more rapidly than protein–small ligand binding sites. Arguably, this higher specificity and 5.8% higher sensitivity than ConSurf. The − difference may be attributed to their different biological roles: small lower median value, 1.4 × 10 3 for the significance of prediction ligands binding sites govern processes of molecular transformation, also favors the ProBiS algorithm. While ConSurf calculates which are relatively unchanged in time, while protein–protein conservation scores on the basis of multiple sequence alignment binding sites regulate evolutionary transient processes in which typically of hundreds of homologous protein sequences, ProBiS connections with different proteins are developed. usually uses tens of locally similar structures. These results suggest that local structural similarity provides some additional precision to the detection of binding sites. ProBiS outperforms the energy- 3.3 Detailed binding sites detection examples

based Q-SiteFinder method, which produces median P-value of by on April 23, 2010 − Biotin carboxylase (PDB: 1bnc) is a homodimer with two binding 6.5 × 10 3 at a similar specificity as our algorithm and at 5.7% sites: an active site in which the ligand biotin is bound and a binding lower sensitivity. Consequently, structural comparison should be a site for the second identical protein subunit (Fig. 4A and B). A search method of choice in the case when 3D structural data for a protein of the nr-PDB database with one of the homodimer subunits as of interest is available and there are similar structures in the PDB. the query protein, resulted in 178 structures with locally similar surface patches retrieved from the nr-PDB; 79% of the active site 3.2 Structural similarity in protein–protein and pocket and 31% of the protein–protein binding site were found to be protein–small ligand binding sites structurally similar. In biotin carboxylase, the two binding sites are The final test set of 39 proteins includes 39 protein–protein binding ∼12 Å apart and conservation of one is unlikely to be correlated sites, 17 sites in which small ligands bind, 1 protein–DNA binding with that of the other. Notably, the protein–protein interface in site and 1 loop, associated with protein folding. In the 39 proteins, this homodimer was found in another study (Caffrey et al., 2004) the average protein–protein binding site contains 42.8 amino acid which used multiple sequence alignment of this protein’s sequence residues and a small-ligand binding site, 27.9 amino acid residues. homologues to determine the extent of conservation of its surface Every residue involved in binding, to all ligands including cofactors residues, to be less conserved than the rest of the exposed surface. and metal ions is counted in these numbers. Small-ligand binding In contrast, local structural similarity, used by ProBiS seems to be sites have 76.0% of their residues structurally similar. In comparison, a good predictor of the protein–protein binding site, especially of protein–protein binding sites, with the sensitivity of 38.0%, are two residues that are involved in hydrogen bonding between the two times less structurally similar. The only protein–DNA binding site subunits. studied (PDB: 1ytf) was predicted with the sensitivity of 63.0%, TATA-binding protein (TBP) recognizes a characteristic sequence which, despite its larger size (46 residues), suggests some similarities of nucleotides composed of deoxyadenosine and deoxythymidine in terms of its structural similarity scores, to the small-ligand binding (the TATA box) and binds to it, thus marking the starting point sites. The sensitivities in the two major binding site classes are of transcription. TBP (PDB: 1ytf) forms a complex with the markedly different. However, the absolute number of structurally transcription factor IIA (Fig. 4C) and the TATA box (Fig. 4D).

1165

[12:04 13/4/2010 Bioinformatics-btq100.tex] Page: 1165 1160–1168 J.Konc and D.Janeži ˇc Downloaded from http://bioinformatics.oxfordjournals.org by on April 23, 2010

Fig. 4. Structural similarity pattern in the homodimer protein biotin carboxylase (PDB: 1bnc) and in the TATA-binding protein (TBP) (PDB: 1ytf). The proteins are represented as pink cartoon models, TATA box DNA as cyan cartoon model; the structurally similar residues are shown as yellow, orange and red stick models and the interacting residues on the opposing chains are shown as pink stick models. Hydrogen bond pattern occurring (A) between biotin carboxylase and bound ligands, biotin, ADP, Mg2+ and bicarbonate ion; (B) between the two subunits of biotin carboxylase; (C) between the TBP and the transcription factor IIA and (D) between the TBP and the TATA box DNA is shown.

Structurally similar residues in the DNA binding region of TBP of a pair adopt different folds, but have a similar binding site and form multiple hydrogen bonds with the TATA box (Fig. 4D); this perform a similar function (Russell, 1998). These pairs of proteins region has been observed previously to be conserved in sequence were identified by an all-against-all comparison of SCOP (Structural (Patikoglou et al., 1999). ProBiS also identifies protein–protein Classification of Proteins) representatives and the similar binding binding site residues (Fig. 4C) with the sensitivity of 50% (ConSurf site residues identified in each protein pair can be superimposed with the sensitivity of 16.7%). To our knowledge, the structural with low RMSD. similarity involving these residues has not been reported previously. We performed pairwise comparisons using the ProBiS algorithm between the proteins in each pair reporting the highest scoring local structural alignment, i.e. that with the lowest E-value. Then, 3.4 Detection of similar function in structurally we compared the alignments produced by ProBiS with those of the unrelated proteins global and local structural alignment algorithms and obtained the In order to demonstrate the unique ability of ProBiS to detect and results shown in Table 2. All three structural alignment programs align similar binding sites in the absence of fold similarity, we that we used for comparison, DaliLite (Holm et al., 2008), MolLoc examined 10 pairs of protein structures, where the two members (Angaran et al., 2009) and MultiBind (Shulman-Peleg et al., 2008),

1166

[12:04 13/4/2010 Bioinformatics-btq100.tex] Page: 1166 1160–1168 ProBiS for structurally similar protein binding sites

Table 2. Comparison of structural alignments quality of similar binding sites on 10 protein pairs with dissimilar folds

First protein structure Second protein structure ProBiS DaliLite MolLoca MultiBindb

PDB Residue numbers Ligand PDB Residue numbers Ligand RMSDs (Å)

1addA 262,295,181,15,214,238 ZN 1bmcA 168,90,152,86,88,149 ZN 6.53 14.31 13.29 14.59 1eceA 114,162,238,116,161,27 BGC 2dnjA 212,39,134,252,7,170 DNA 9.91 12.91 12.48 8.05 1phrA 12,129,18,19 SO4 1vhrA 124,92,130,131 EPE 1.50 8.94 8.10 5.37 1bmfA 269,270,273,175,176 ANP 1aylA 268,269,213,254,255 ATP 2.34 n/a 3.26 11.93 1ampA 179,117,256,97,228 ZN 1alkA 51,327,331,370,102 ZN 2.14 10.69 6.60 11.91 1ribA 84,237,238,118,121,122 FEO 1vhhA 130,148,127,135,139,142 ZN 4.82 n/a n/a 13.19 1powA 308,311,286 FAD 1inpA 311,370,358 n/a 1.07 35.70 19.75 n/a 1alkA 369,327,101,331,370,166 ZN 1fjmA 64,92,208,125,173,221 MN 7.00 n/a 9.63 7.92 1qbaA 844,845,847,850,849,852 n/a 1eurA 176,177,179,182,181,184 n/a 0.27 46.12 0.30 n/a 2kauC 134,136,219,246,320,364 NI 2mhrA 54,25,106,77,73,62 FEO 12.68 25.95 12.41 12.62 Downloaded from Mean RMSDs (Å) 4.82 22.09 9.54 10.69

aGeometry + atom type method is used; in the two cases, where ligands are not available (n/a), we restricted the comparison to the corresponding residue numbers given above. bThe compared binding sites are defined by the surface region within 8.0 Å from small ligands with codes ZN, MN, SO4, EPE, NI and FEO; the default threshold of 4.0 Å is used elsewhere.

which represent different approaches to structural alignment are a new approach for the detection of binding sites in a 3D protein http://bioinformatics.oxfordjournals.org available as web-servers. Global alignment algorithms such as that structure by searching for the locally similar surface structures in DaliLite maximize the aligned length and, concurrently, minimize in a large database of protein structures. The maximum clique the RMSD between two protein backbones. Local structural technique, the core of the algorithm detects 3D correspondences alignment algorithms, such as that in MolLoc and that in MultiBind, between proteins at a sub-residue level. Local structural similarity minimize the RMSD between preselected regions of the proteins, scores are calculated and mapped to the query protein surface. Such e.g. binding sites. prediction of binding sites which depends on structural similarity The comparison between the methods listed in Table 2 is made by of protein surfaces is useful and accurate and can enjoy success calculating, after superimposition, the RMSD between the similar in structures with dissimilar folding patterns. Structural similarity binding site residues, which had been previously identified (Russell, can provide additional advantages over sequence conservation in 1998). DaliLite produces the worst RMSD, 22.1 Å, presumably the detection of functional regions such as binding sites and it is because the protein structures that are compared do not have similar concluded that structural comparison should be the method of choice folds and consequently global alignment fails to superimpose the when a crystallographic or NMR structure for a protein of interest is binding site residues. For three of the protein pairs that were available. The ProBiS algorithm for detection of structurally similar

examined, DaliLite does not produce any result, probably because binding sites in proteins is freely available as a web-tool. Detailed by on April 23, 2010 the proteins folds are too dissimilar. In the remaining five cases, explanation and instructions to users of ProBiS can be found at http:// DaliLite misaligns the binding sites. MolLoc and MultiBind produce probis.cmm.ki.si. better alignments with RMSDs of 9.5 Å and 10.7 Å, respectively, Funding: Ministry of Higher Education, Science and Technology of but unlike ProBiS and DaliLite, do not support unsupervised Slovenia and the Slovenian Research Agency (No.P1-0002); Jozef comparison. Thus the user has to manually select regions of protein Mianowski Fund (to J. K.). surfaces that are to be compared. Where the ligands are missing in the compared structures, MolLoc allows the compared binding sites Conflict of Interest: none declared. to be defined with a set of user provided residues; we used residues given in Table 2. MultiBind only allows comparisons, where a ligand is present in each of the two compared binding sites, consequently REFERENCES for two pairs we could not perform the alignment; in all other cases, Altschul,S.F. and Gish,W. (1996) Local alignment statistics. Methods Enzymol., 266, the RMSD of the alignment in the top scoring solution is presented. 460–480. In proteins which have different folds, ProBiS produces better Altschul,S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein alignments of the residues in the similar binding site detected by database search programs. Nucleic Acids Res., 25, 3389–3402. Angaran,S. et al. (2009) MolLoc: a web tool for the local structural alignment of Russell et al. (1998) than either DaliLite, MolLoc or MultiBind molecular surfaces. Nucleic Acids Res., 37, W565–W570. to judge from the calculated RMSD between these residues. In Ausiello,G. et al. (2008) FunClust: a web server for the identification of structural motifs addition, ProBiS can align protein structures in an unsupervised in a set of non-homologous protein structures. BMC Bioinformatics, 9,S2. fashion, which allows it to perform database searches. Berman,H.M. et al. (2002) The protein data bank. Acta Crystallogr. D, D58, 899–907. Burgoyne,N.J. and Jackson,R.M. (2006) Predicting protein interaction sites: binding 4 CONCLUSION hot-spots in protein–protein and protein–ligand interfaces. Bioinformatics, 22, 1335–1342. It is well known that protein binding sites are structurally similar and Caffrey,D.R. et al. (2004) Are protein–protein interfaces more conserved in sequence many methods exist that can structurally align proteins. We introduce than the rest of the protein surface? Protein Sci., 13, 190–202.

1167

[12:04 13/4/2010 Bioinformatics-btq100.tex] Page: 1167 1160–1168 J.Konc and D.Janeži ˇc

Carl,N. et al. (2008) Protein surface conservation in binding sites. J. Chem. Info. Mod., Konc,J. and Janežic,D.ˇ (2007b) An improved branch and bound algorithm 48, 1279–1286. for the maximum clique problem. MATCH Commun. Math. Comput. Chem., 58, Debret,G. et al. (2009) RASMOT-3D PRO: a 3D motif search webserver. Nucleic Acids 569–590. Res., 37, W459–W464. Laurie,A.T.R. and Jackson,R.M. (2005) Q-SiteFinder: an energy-based method Ezkurdia,I. et al. (2009) Progress and challenges in predicting protein–protein for the prediction of protein-ligand binding sites. Bioinformatics, 21, 1908–1916. interaction sites. Brief. Bioinform., 10, 233–246. Lecomte,J.T. et al. (2005) Structural divergence and distant relationships in proteins: Gherardini,P.F. et al. (2007) Convergent evolution of enzyme active sites is not a rare evolution of the globins. Curr. Opin. Struct. Biol., 15, 290–301. phenomenon. J. Mol. Biol., 372, 817–845. Patikoglou,G.A. et al. (1999) TATA element recognition by the TATA box-binding Glaser,F. et al. (2003) ConSurf: identification of functional regions in proteins by protein has been conserved throughout evolution. Genes Dev., 13, 3217–3230. surface-mapping of phylogenetic information. Bioinformatics, 19, 163–164. Porter,C.T. et al. (2004) The catalytic site atlas: a resource of catalytic sites and residues Glaser,F. et al. (2005) The ConSurf-HSSP database: the mapping of evolutionary identified in enzymes using structural data. Nucleic Acids Res., 32, D129–133. conservation among homologs onto PDB structures. Proteins: Struct. Funct. Russell,R.B. et al. (1997) Recognition of analogous and homologous protein folds: Bioinform., 58, 610–617. analysis of sequence and structure conservation. J. Mol. Biol., 269, 423–439. Guney,E. et al. (2008) HotSprint: database of computational hot spots in protein Russell,R.B. (1998) Detection of protein three-dimensional side-chain patterns: new interfaces. Nucleic Acids Res., 36, D662–D666. examples of convergent evolution. J. Mol. Biol., 279, 1211–1227. Henikoff,S. and Henikoff,J.G. (1992) Amino acid substitution matrices from protein Schmitt,S. et al. (2002) A new method to detect related function among proteins blocks. Proc. Natl Acad. Sci. USA, 89, 10915–10919. independent of sequence and fold homology. J. Mol. Biol., 323, 387–406. Holm,L. et al. (2008) Searching protein structure databases with DaliLite v.3. Shulman-Peleg,A. et al. (2007) Spatial chemical conservation of hot spot interactions Bioinformatics, 24, 2780–2781. in protein-protein complexes. BMC Biol., 5, 43. Downloaded from Karlin,S. and Altschul,S.F. (1990) Methods for assessing the statistical significance of Shulman-Peleg,A. et al. (2008) MultiBind and MAPPIS: webservers for multiple molecular sequence features by using general scoring schemes. Proc. Natl Acad. alignment of protein 3D-binding sites and their interactions. Nucleic Acids Res., Sci. USA, 87, 2264–2268. 36, W260–W264. Keskin,O. et al. (2005) Hot regions in protein-protein interactions: the organization Tuncbag,N. et al. (2009) A survey of available tools and web servers for analysis of and contribution of structurally conserved hot spot residues. J. Mol. Biol., 345, protein-protein interactions and interfaces. Brief. Bioinform., 10, 217–232. 1281–1294. Valdar,W.S.J. and Thornton,J.M. (2001) Protein–protein interfaces: analysis of Konc,J. et al. (2006) Molecular surface walk. Croat. Chem. Acta, 79, 237–241. amino acid conservation in homodimers. Proteins: Struct. Funct. Genet., 42, Konc,J. and Janežic,D.ˇ (2007a) Protein-protein binding-sites prediction by protein 108–124. surface structure conservation. J. Chem. Info. Mod., 47, 940–944. http://bioinformatics.oxfordjournals.org by on April 23, 2010

1168

[12:04 13/4/2010 Bioinformatics-btq100.tex] Page: 1168 1160–1168 Structure-Based Function Prediction of Uncharacterized Protein Using Binding Sites Comparison

Janez Konc1, Milan Hodosˇcˇek1, Mitja Ogrizek1, Joanna Trykowska Konc1, Dusˇanka Janezˇicˇ1,2* 1 National Institute of Chemistry, Ljubljana, Slovenia, 2 University of Primorska, Faculty of Mathematics, Natural Sciences and Information Technologies, Koper, Slovenia

Abstract A challenge in structural genomics is prediction of the function of uncharacterized proteins. When proteins cannot be related to other proteins of known activity, identification of function based on sequence or structural homology is impossible and in such cases it would be useful to assess structurally conserved binding sites in connection with the protein’s function. In this paper, we propose the function of a protein of unknown activity, the Tm1631 protein from Thermotoga maritima, by comparing its predicted binding site to a library containing thousands of candidate structures. The comparison revealed numerous similarities with nucleotide binding sites including specifically, a DNA-binding site of endonuclease IV. We constructed a model of this Tm1631 protein with a DNA-ligand from the newly found similar binding site using ProBiS, and validated this model by molecular dynamics. The interactions predicted by the Tm1631-DNA model corresponded to those known to be important in endonuclease IV-DNA complex model and the corresponding binding free energies, calculated from these models were in close agreement. We thus propose that Tm1631 is a DNA binding enzyme with endonuclease activity that recognizes DNA lesions in which at least two consecutive nucleotides are unpaired. Our approach is general, and can be applied to any protein of unknown function. It might also be useful to guide experimental determination of function of uncharacterized proteins.

Citation: Konc J, Hodosˇcˇek M, Ogrizek M, Trykowska Konc J, Janezˇicˇ D (2013) Structure-Based Function Prediction of Uncharacterized Protein Using Binding Sites Comparison. PLoS Comput Biol 9(11): e1003341. doi:10.1371/journal.pcbi.1003341 Editor: Alexander Donald MacKerell, University of Maryland, Baltimore, United States of America Received July 15, 2013; Accepted October 1, 2013; Published November 14, 2013 Copyright: ß 2013 Konc et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: Financial support was provided by grant P1-0002 of the Ministry of Higher Education, Science, and Technology of Slovenia and the Slovenian Research Agency. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing Interests: The authors have declared that no competing interests exist. * E-mail: [email protected]

Introduction Tm1631, has no human analogues and may be an important source, for example of new targets for development of antimicro- Experimental determination of protein function is the most bials [8]. To elucidate their functions, there is a need for methods reliable way to characterize proteins of unknown activity but it is that go beyond sequence and structure homology and are able to difficult to prioritize functional experiments amongst the many provide testable hypotheses to guide functional experiments. possible functions a protein could perform. To guide experimen- Because binding sites are usually more evolutionarily conserved talists, a number of computer approaches have been developed for structures and more directly linked to function than complete prediction of protein function [1,2]. Web portals have been proteins, comparison of protein binding sites to predict function is created that allow sharing information about protein structures an attractive alternative to sequence- or structural homology-based [3,4]. In spite of these efforts, the gap between proteins with methods [2]. Such evolutionarily conserved binding site structures experimentally determined function and those with unknown can be found by local structural alignment algorithms that detect function is growing [5,6]. A recent study suggests that more than similar residue patterns in protein binding sites irrespective of 40% of known proteins lack any annotation in public databases although many are evolutionarily conserved and probably possess sequence or fold similarity of proteins [9–12]. The algorithm, important biological roles [5]. ProBiS (Protein Binding Sites) [11] compares protein binding sites The TM1631 gene from Thermotoga maritima encodes a protein represented as protein graphs in a pairwise fashion using a fast which is a member of a large and widely distributed Duf72 family maximum clique algorithm [13] on protein product graphs, and of domains of unknown function according to Protein family finds sets of residues that are physicochemically and geometrically (Pfam) classification [7]. The structure of Tm1631 has been related. Querying a target binding site, or target protein structure, determined by Joint Center for Structural Genomics (PDB: 1vpq), against a database of template protein structures, ProBiS retrieves but inferences as to its function are unreliable, because it enjoys proteins with similar binding sites, as defined in this way and from little relationship, only about 7% sequence identity, to proteins the resulting alignments it calculates degrees of structural with diverse known functions. Currently, in 2013, some 3000 conservation for all surface amino acid residues of the target proteins of unknown function in the PDB await characterization of protein. These degrees, mapped to the protein’s surface in their function, and for about one third of these proteins, including different colors, show structural evolutionary conservation in the Tm1631, there is little hope that their function will be discovered target protein’s surface, and predict the location of binding sites as using conventional methods based on sequence or structure validated on the set of 39 protein structures with known binding homology [6]. A substantial proportion of these proteins, including sites [11].

PLOS Computational Biology | www.ploscompbiol.org 1 November 2013 | Volume 9 | Issue 11 | e1003341 Function Prediction of Uncharacterized Protein

Author Summary For a substantial proportion of proteins, their functions are not known since these proteins are not related in sequence to any other known proteins. Binding sites are evolutionarily conserved across very distant protein fam- ilies, and finding similar binding sites between known and unknown proteins can provide clues as to functions of the unknown proteins. We choose one of the ‘‘unknown function’’ proteins, and found, using a novel strategy of binding site comparison to construct a hypothetical protein-ligand complex, subsequently validated by molec- ular dynamics that this protein most likely binds and repairs the damaged DNA similar to known DNA-repair enzymes. Our methodology is general and enables one to determine functions of other proteins currently labelled as ‘‘unknown function’’. We envision that the methodology presented herein, the binding sites comparisons enhanced by molecular dynamics, will stimulate the function prediction of other uncharacterized proteins with struc- tures in the Protein Data Bank and boost experimental functional studies of proteins of unknown functions.

In this work, we investigate a new strategy to predict protein function employing ProBiS enhanced by molecular dynamics (MD) simulation (Figure 1), to find structurally evolutionarily conserved binding sites. We validate the new strategy on a set of 369 well-characterized proteins and then apply it to the unknown Tm1631 protein. The strategy proceeds in a number of steps. We first find the binding site on the Tm1631 protein. Then we search for proteins with similar binding sites in the Protein Data Bank [14] (PDB) using the novel binding sites comparison approach described here. In this way, we identify a previously unknown phosphate binding site on Tm1631 that binds a phosphate group of a nucleic acid ligand. To refine the search and narrow down possible functions of Tm1631 protein, we compare this newly identified phosphate binding site with the binding sites in endonuclease IV nucleic acids binding proteins, which are the closest relatives of Tm1631 according to sequence identity, in the a8b8 triose phosphate isomerase (TIM) barrel fold [15] of which the Tm1631 is a member. A similarity is detected with endonuclease IV DNA-binding site, one of the TIM barrel folds. Based on the superimposition of Tm1631 upon endonuclease IV, we construct a hypothetical model of the Tm1631-DNA complex. Finally, using MD simulation we find that the Tm1631 protein forms favorable interactions with the DNA, which are comparable to those seen in the endonuclease Figure 1. Workflow of the function prediction for the Tm1631 IV-DNA complex. In addition, the binding free energies of protein structure of unknown function. doi:10.1371/journal.pcbi.1003341.g001 Tm1631-DNA model and endonuclease IV-DNA complex are in close agreement. Combined, these findings suggest that the Results proposed Tm1631-DNA complex is valid, and support specu- lation that the cleavage of the DNA phosphodiester bond by Based on the prediction of its binding site, and comparison of Tm1631 is distinct from that of endonuclease IV. Tm1631 can this predicted binding site with the protein structures in the PDB, thus be identified provisionally as a DNA binding enzyme with we propose a DNA-repair function for Tm1631, the protein of endonuclease activity, and experimental investigations can be unknown activity. We find that despite the low sequence identity directed towards the repair of DNA lesions in which at least two of the Tm1631 and endonuclease IV proteins the Tm1631 protein consecutive nucleotides in each DNA strand are unpaired, e.g., binding site is similar to the known DNA-binding site in pyrimidine dimers formed from thymine or cytosine bases in endonuclease IV. Construction of a Tm1631-DNA model by DNA via photochemical reactions [16]. Such comparison of superimposition of the similar binding sites found, and running binding sites and generation of hypothetical protein-ligand MD simulations shows that Tm1631 enjoys favorable interactions models followed by molecular dynamics analysis is a method with DNA, similar to those seen in the endonuclease IV-DNA with which function can be assigned to uncharacterized complex. We find that Tm1631 is probably a new endonuclease proteins. functioning in a different way than endonuclease IV.

PLOS Computational Biology | www.ploscompbiol.org 2 November 2013 | Volume 9 | Issue 11 | e1003341 Function Prediction of Uncharacterized Protein

Detailed view of Tm1631 function Table 1. Top-ranked similar binding sites in proteins of Using ProBiS [17], two structurally conserved patches were different folds found using the predicted binding site in found on the surface of the Tm1631. The first lies in a groove in Tm1631 as query to the binding site comparison approach.a the protein surface at the C-terminal side of the TIM barrel (Figure 2, left), and is at a position where proteins of TIM barrel fold often have an active site [18,19]. We thus considered this Rank PDB Ligand Function patch to be a candidate Tm1631 binding site and used it in a 1 3qrf DNA DNA-binding protein substructure search against the non-redundant PDB. The second structurally conserved patch is on a relatively flat surface (Figure 2, 2 2w9m DNA DNA replication right). Judging by the results from PISA program [20], this second 3 3zte RNA RNA-binding protein patch is a homodimer binding site on Tm1631. We focused our 4 2jjx Nucleotide Transferase/kinase further investigation on the first patch since it promises to reveal 5 2vy0 Other Hydrolase more than the homodimer binding site about the protein’s 6 3fhf DNA DNA repair function. 7 1nsc Other Hydrolase We compared the predicted binding site in the Tm1631 protein with protein structures from the non-redundant PDB using 8 1lvg Nucleotide Transferase/kinase binding site comparison approach (see Methods). This comparison 9 1nio RNA Hydrolase showed that the predicted binding site in Tm1631 is very similar to 10 3zzs RNA Transcription various nucleotide and nucleic acids binding sites in proteins with aThe entire list of similar binding sites is in Table S1 in Text S1. folds unrelated to the fold of Tm1631 (Table 1 and Table S1 in doi:10.1371/journal.pcbi.1003341.t001 Text S1). Out of 10 highest ranked similar binding sites, six were DNA or RNA binding sites, and two were nucleotide binding sites. the Tm1631 residues near the sulfate-262 in their type and Highest ranked were binding sites in enzymes involved in DNA orientation (Figure 3b). replication (Figure 3a), transfer of phosphate groups (Figure 3b), The phosphate binding site is most likely also the active site in and DNA repair (Figure 3c). These results indicated that the the Tm1631 protein, as judged from similarity with active sites in predicted binding site in Tm1631 probably binds a nucleotide polymerase X (2w9m), guanylate kinase (1lvg), and others ligand. (Table 1). Based on the reactions performed by the similar active Further, we detected residues predisposed to phosphate binding sites found, Tm1631 can act on a phosphate group of a nucleotide within the predicted nucleotide binding site. Similarities with the catalyzing nucleophilic substitution or phosphoryl transfer. These phosphate binding patterns that we found in similar nucleotide reactions require electropositive surface potential in the active site binding sites were concentrated on the highly conserved patch of that withdraws electrons from the phosphate group, rendering it residues near the co-crystallized sulfate ion (sulfate-262) in susceptible to nucleophilic attack [21]. The predicted phosphate Tm1631 (Figure 3), suggesting that this surface patch is the binding site in Tm1631 is electropositive (Figure S2 in Text S1), phosphate binding site in Tm1631. Uridine monophosphate and thus agrees in this respect with the proposed reaction and with kinase (2jjx), for example, contains a phosphate binding pattern the mechanisms operating at the similar active sites found. of residues Tyr/His/Arg/Glu/Arg that almost perfectly matched The similar binding sites found suggest that the Tm1631 protein is a nucleotide binding enzyme, i.e. the identified active site binds and catalyzes a reaction on a nucleotide phosphate group. However, our attempts to construct a model of Tm1631 bound to these ligands were unsuccessful, because the resulting models had too many clashes between the nucleotide ligands and the Tm1631 protein, which prevented further investigation as to how Tm1631 could bind with these ligands. To find a nucleotide ligand that could bind to the Tm1631, we focused our search for similar binding sites to only TIM barrel proteins that bind nucleotides. According to the standard structural similarity tool [22], endonuclease IV are the most structurally similar nucleotide, specifically, DNA binding proteins out of the TIM barrel proteins, sharing about 7% sequence identity with Tm1631. Using the predicted binding site in Tm1631 as query, we thus searched for similar patterns in all endonuclease IV crystal structures available in the PDB, and found a similar residue pattern in endonuclease IV DNA binding site (PDB: 2nqj, Chain ID: B) (Figure S3 in Text S1). Endonuclease IV is a DNA-repair enzyme that catalyzes phosphodiester bond cleavage in DNA, which is thought to be a nucleophilic substitution reaction on one of the DNA phosphate groups [23,24]. Endonuclease IV binds to an extra-helical region Figure 2. Tm1631 protein surface conservation analysis by in DNA, that is a region with interrupted base pairing, and ProBiS. Tm1631 is shown in surface representation, which is colored by degrees of structural conservation from unconserved (white) to recognizes the apurinic/apyrimidinic (AP) site in the DNA, conserved (red). The predicted binding site is encircled by a yellow consisting of a nucleotide lacking a base, but with an intact dashed line. sugar-phosphate backbone. The enzyme cleaves phosphodiester doi:10.1371/journal.pcbi.1003341.g002 bond 59 at the AP site, creating a nick in one of the DNA strands.

PLOS Computational Biology | www.ploscompbiol.org 3 November 2013 | Volume 9 | Issue 11 | e1003341 Function Prediction of Uncharacterized Protein

Tm1631 in this model and the shape of the groove in the Tm1631 roughly resembled the crescent-shaped DNA-binding groove found in endonuclease IV (Figure 4, right); in both proteins the grooves bound to the same DNA strand. The model suggested that similar to Arg37 and Tyr72 in endonuclease IV, Tyr47 and Tyr48 in Tm1631 bind to the DNA from within the extra-helical region. In endonuclease IV, these residues stack with the DNA bases from within the extra-helical region, and enable the enzyme to distinguish between damaged and normal DNA [23,24]. Due to their similar physicochemical properties, Tyr47 and Tyr48 could form similar stacking interactions with the bases. The presence of a similar groove in the Tm1631 as can be seen in endonuclease IV, and the two-tyrosine motif that could replace residues binding to the extra-helical region in endonuclease IV, were supportive of our Tm1631-DNA model. However, to view the precise picture of the possible interactions between the Tm1631 and the DNA, we had to refine our model with MD.

Molecular dynamics simulation of the Tm1631-DNA model To examine the plausibility of the induced fit upon binding of DNA to the Tm1631 we performed an MD simulation of the Tm1631-DNA model in water. Although MD is a theoretical experiment, it showed that DNA fragment remains bound to the Tm1631 throughout the 90 ns of simulation. In addition, new interactions not seen in the initial model formed between Tm1631 and DNA during MD. The final Tm1631-DNA model after MD is shown in Figure 5a and 5b. We compared the trajectory of Tm1631-DNA model with that of the endonuclease IV-DNA complex, and found many similarities between the Tm1631 and endonuclease IV binding sites that were initially not detected by the similarity detection with ProBiS. Specifically, the residues Ser10, Tyr48, Gln50, Trp53, Arg54, His79, and Gln114 in Tm1631 that hydrogen bonded with the DNA seemed to be direct equivalents, according to their similar positions in the binding site and similar interactions they Figure 3. Similar evolutionary patterns in nucleotide binding formed with the DNA, to Ser9, Tyr72, Gln38, Trp39, Arg40, sites found in PDB using ProBiS. Predicted Tm1631 binding site His231, and Gln261 in the known DNA binding site of (left) is similar to (right): (a) active site in DNA binding site of polymerase X (2w9m) with DNA ligand that was transposed from homologous endonuclease IV (Figure 5c and 5d). Further, we calculated from protein structure 3au6; (b) allosteric site in uridine monophosphate the trajectories the binding free energy of the Tm1631-DNA kinase (2jjx); (c) active site of DNA-glycosylase (3fhf) with DNA ligand transposed from homologous protein structure 3knt. doi:10.1371/journal.pcbi.1003341.g003

This was consistent with our findings in fold-unrelated non- redundant PDB proteins, which suggested nucleophilic substitu- tion reaction on phosphate group as the reaction catalyzed by Tm1631. In addition, given the similar residue patterns found within their binding sites, their similar sizes of ,270 amino acids, and similar electrostatic potential in their binding sites (Figure S2 and S3 in Text S1), implies that the Tm1631 protein could have a related function to endonuclease IV.

Tm1631-DNA model To test the ‘‘endonuclease function’’ hypothesis, we created a Tm1631-DNA model by transposing the DNA fragment from the endonuclease IV co-crystal structure (2nqj) to the Tm1631 (1vpq) Figure 4. Tm1631-DNA model based on comparison of Tm1631 with superimposition of their binding sites. In our model (Figure 4, protein (1vpq) to known endonuclease IV-DNA complex (2nqj) left), one DNA strand bound into a groove in the surface of from PDB. Tm1631 is white, endonuclease IV is blue, DNA is green and Tm1631, so that the reactive phosphate group, i.e. the phospho- light-blue cartoons, sulfate ions are CPK sticks, crescent-shaped grooves in both proteins are shaded areas. Initial Tm1631-DNA model; Tyr47 and diester bond 59 of the AP site that is cleaved by endonuclease IV, Tyr48 penetrate the DNA’s extra-helical region (left). Endonuclease IV- was located about 5 A˚ from the predicted phosphate binding site. DNA complex (right). There were very few clashes between atoms of the DNA and the doi:10.1371/journal.pcbi.1003341.g004

PLOS Computational Biology | www.ploscompbiol.org 4 November 2013 | Volume 9 | Issue 11 | e1003341 Function Prediction of Uncharacterized Protein

Figure 5. Tm1631-DNA model after 90 ns of MD. Reactive phosphate group in DNA is marked with a red asterisk. (a) Tm1631-DNA model, residues that interact with the DNA are marked. (b) Magnified view of the Tm1631-DNA interface. DNA phosphate groups and residues that interact with the DNA are represented as sticks; black dashed lines denote putative hydrogen bonds and salt bridges. (c) and (d) Schematic picture of Tm1631-DNA and endonuclease IV-DNA interactions. Similar residues in Tm1631 and endonuclease IV binding sites are in white and blue ellipses, respectively. Hydrogen bonds with DNA are shown for amino acid side chains (solid black arrows) and backbone atoms (solid cyan arrows). Stacking interactions with DNA nucleotides are dashed black lines. doi:10.1371/journal.pcbi.1003341.g005 model and of the known endonuclease IV-DNA complex; these between the DNA and Tm1631 during minimization. We also saw were 240614 kcal/mol and 252623 kcal/mol, respectively occasional unpairing of terminal base pairs in the control (Figure S1 in Text S1). This good agreement of binding free simulation of endonuclease IV-DNA complex, which suggests energies indicated that the Tm1631 is a similarly good binder of that this is a common process in DNA bound to endonuclease IV. DNA as endonuclease IV. Contrary to the endonuclease IV-DNA simulation, in our The MD also showed that binding of DNA to Tm1631 requires Tm1631-DNA model, the C8:G8 base pair opened (Figure 5c), no major structural changes from the either partner. The root- so that the C8 rotated ,180u to its original position in mean-square deviation (RMSD) between the Ca atoms of the endonuclease IV (Figure 5d). Most conformational changes in Tm1631 before and after MD was ,1.6 A˚ and the corresponding Tm1631 were found in the loop Asn44-Ser52, which binds the RMSD between the phosphorous atoms of DNA was ,3.7 A˚ . extra-helical region of the DNA fragment. The phenyl rings of This last RMSD could also be attributed to the periodical Tyr47 and Tyr48 rotated ,100u about x1 relative to their position fragmenting and reconstitution of the terminal C14:G17 and in the crystal structure 1vpq to point into the solvent, almost G15:C16 base pairs during MD, and to the formation of T-shaped perpendicular to the protein surface, which enabled them to insert intermediates [25,26], as well as to the relaxation of atomic clashes themselves through the DNA minor groove, where Tyr47 stacked

PLOS Computational Biology | www.ploscompbiol.org 5 November 2013 | Volume 9 | Issue 11 | e1003341 Function Prediction of Uncharacterized Protein with G8, and displaced G9 opposite to the AP site; Tyr48 filled the structure 1vpq (and also in homologous structures 1vpy and 1ztv), gap left by the missing base of the AP site and stacked with the 59 which additionally distinguishes it from endonuclease IV. base (C6). A simulation of Tm1631 in the unbound state Could therefore the Tm1631 be a new kind of endonuclease confirmed that these movements also occur without the DNA that senses a different kind of DNA lesion than endonuclease IV? bound (Figure S4 in Text S1), which indicated that Tyr47 and The two-tyrosine motif in the Tm1631, which prevents base Tyr48 are in a correct conformation to bind the DNA already in pairing between the two DNA strands, resembles the typical the unbound state of Tm1631. Conformational changes also mechanism by which endonucleases sense irregularities like extra- occurred in the loop Arg195-Asp209 in the Tm1631 but this loop helical region in DNA structure, and this indicates that the did not bind to the DNA in our model. These last movements Tm1631 could be an endonuclease. However, in Tm1631 the could be correlated with the high flexibility of this loop indicated cleavage of the phosphodiester bond must follow a different by the high B-factors seen in the crystal structure 1vpq. mechanism than the one employed by endonuclease IV, because, unlike endonuclease IV, Tm1631 has no metal ions in the active Discussion site to coordinate the reactive 59-phosphate of the AP site. Instead, in our Tm1631-DNA model, this phosphate is coordinated by We are proposing a structural model of Tm1631 binding to hydrogen bonds from Asn44, Lys72, and Gln114 (Figure 5b). DNA suggesting that Tm1631 could perform a similar DNA During MD, the phosphate however stays about 3 A˚ from the repair function as endonuclease IV (Video S1). This model was predicted phosphate binding site (Figure 3), where it forms built by superimposition of binding sites of Tm1631 and additional hydrogen bonds with Arg145 and Arg191 (Figure S8 in endonuclease IV proteins. This superimposition differs from the Text S1). These hydrogen bonds enable nucleophilic attack on the backbone superimposition obtained with standard structural phosphorous atom by attracting electrons from the phosphorus alignment tool [22], which can only produce a model in which atom, analogous to catalytic Zn2+ ions in endonuclease IV. Metal many atoms of the DNA and the Tm1631 clash (Figure S5 in Text ions might also be absent due to uncertainties in electron density S1). In contrast, our model, based on the superimposition of or experimental conditions, although they actually bind to the binding sites, had remarkably few clashes between atoms (Figure 4, Tm1631. A similar binding site found in polymerase X, for left). example, has magnesium ions (Figure 3c), which supports this To validate the binding site comparison approach for function hypothesis. prediction of the Tm1631 protein, we performed an experiment, A relatively larger DNA binding groove in Tm1631 compared in which we re-predicted functions of 369 proteins with known to the DNA binding groove in endonuclease IV indicates that functions from the ligAsite [27] benchmark set. We simulated the Tm1631 recognizes a different DNA lesion than endonuclease IV conditions under which the function of the unknown protein (Figure 4). This would also justify the need for the existence of a Tm1631 was determined, i.e., proteins of known function with new DNA-repair enzyme such as Tm1631 aside from the known similar sequences were unavailable (for details see Text S1). Our endonuclease IV. In Tm1631-DNA model, G8:C8 unpair during approach correctly predicted 59% of known protein functions in MD due to bulky Tyr47 and Tyr48 that require larger extra- this benchmark set. In contrast, using the BLAST [28] sequence helical region than Arg37 and Tyr72 in endonuclease IV alignment tool instead of the ProBiS [11] algorithm resulted in (Figure 5a, c). This unpairing of a base pair G8:C8, which is not 43% of protein functions correctly predicted (Table S2 and S3 in seen in the endonuclease IV-DNA complex simulation (Figure 5d), Text S1). suggests that Tm1631 binds preferably DNA lesions, in which two The agreement of binding free energies of the Tm1631-DNA consecutive nucleotides are unpaired, whereas endonuclease IV model and that of the known endonuclease IV-DNA complex binds DNA lesions, in which one nucleotide is unpaired, i.e., the suggests that the hypothetical Tm1631-DNA complex is energet- AP site. Two consecutive unpaired nucleotides appear for example ically favorable. This is additionally supported by the similar in pyrimidine dimers DNA lesions, which are result of photo- number of hydrogen bonds formed by the Tm1631 and dimerization of pyrimidines. Usually, these lesions are repaired by endonuclease IV with the DNA during MD. In Tm1631-DNA UV endonucleases (see, e.g., 4gle), enzymes related to endonucle- complex there were 12, and in endonuclease IV-DNA complex ase IV [32]. The two-tyrosine motif and larger groove may thus there were 14 hydrogen bonds (Figure 5c and 5d). This good preferentially recognize larger DNA lesions, such as the ones agreement between the numbers of hydrogen bonds, in addition to found in pyrimidine dimers. the agreement in binding free energies, allows us to posit that the Finally, we ask, is our developed methodology likely to be useful binding affinity of Tm1631 for DNA is similar to that of to those that experimentally determine functions of unknown endonuclease IV. proteins? We do not have the definitive answer yet. Our model To validate our Tm1631-DNA model, we also used other seems to explain well the existing literature data, as well as it computational methods to predict nucleic acid binding site on agrees with and extends the results of other independent Tm1631 structure [29,30], and to search for two-tyrosine motifs in computational methods. The model shows, at the atomic other endonucleases using sequence alignment [28]. We also resolution, how the Tm1631 could interact with the DNA. Based searched the literature [31] for any information on Duf72 on our computational results and good agreement with all function. The obtained evidence is consistent with our Tm1631- available information on this protein structure, we hope that DNA model (Figure S6 and S7 in Text S1). experimentalists will find this problem challenging and will However, Tm1631 cannot be an endonuclease IV, since the eventually confirm our findings. known (PDB: 2x7v) endonuclease IV of Thermotoga maritima shares ,30% sequence identity with other endonucleases IV, whereas Methods the Tm1631 protein has only ,7% sequence identity with known endonucleases IV. Metal ions have a catalytic role in endonuclease The protein structure (PDB: 1vpq) encoded by the TM1631 IV, binding with the phosphate 59 of the AP site and helping gene was designated here as the query protein. Binding sites were cleave the phosphodiester bond. The Tm1631 protein however predicted using the ProBiS web server [17] at http://probis.cmm. lacks metal ions in its putative active site, as evidenced in crystal ki.si. Comparisons of binding site structures were done using the

PLOS Computational Biology | www.ploscompbiol.org 6 November 2013 | Volume 9 | Issue 11 | e1003341 Function Prediction of Uncharacterized Protein parallel ProBiS program [33] (version 2.4.2) freely available at substructures that we found in nr-PDB proteins could occur http://probis.cmm.ki.si/?what = parallel. MD simulations were anywhere on these proteins’ surfaces and accordingly we filtered carried out on the clusters of personal computers (CROW) at the the similar substructures found, so that only those that corre- National Institute of Chemistry in Ljubljana [34], using the sponded with known binding sites remained. The most reliable CHARMM biomolecular simulation program [35] and CHARM- indication that a region of protein surface is a binding site is if co- Ming web server [36]. Structural and dynamic aspects of the crystallized ligands bind to that region in the PDB file of the molecules were visualized via PyMOL software and surface corresponding protein structure. However, ligands may be absent electrostatics were calculated using APBS program [37]. in a particular protein structure, but can be present in some of the structures of homologous proteins. To define binding sites in the Prediction of binding sites on the Tm1631 protein set of newly found similar proteins, we thus superimposed each of Using the ‘‘Detect Structurally Similar Binding Sites’’ tool on these proteins with its .30% sequence identical homologous the ProBiS web server, and selecting the ‘‘List of PDB/Chain IDs’’ structures in the PDB, and transposed to the corresponding option from the ‘‘Proteins to Compare Against’’ drop-down list, protein ligands present in the homologous proteins. Modified the query protein structure 1vpq.A was compared to two crystal residues, carbohydrates that are covalently linked to the glycosyl- structures, 1vpy.A and 1ztv.A; the query protein has about 30% ation sites of a protein, and non-specific ligands listed at http:// sequence identity with either 1vpy.A or 1ztv.A. From the www.russelllab.org/wiki/index.php/Non-specific_ligand-protein_ structural alignments with these two similar proteins, ProBiS binding were not considered to be legitimate ligands. A binding calculated the degrees of structural conservation for each residue site is defined as residues that are ,3A˚ away from the ligand of the query protein and these were mapped to the surface residues atoms. We then filtered the set of similar proteins to obtain only of the query protein to indicate level of evolutionary conservation those in which the similar substructure detected by ProBiS of each residue. Residues with conservation score of 8–10 on a corresponded with the known binding site in a template protein. scale of 1–10, were considered as putative binding site residues The ‘‘similar proteins’’ that were obtained in this process had [11]. binding sites that were similar to the predicted binding site in Tm1631 thus were possible functional analogs of Tm1631. Binding site comparison Dynamics simulations of proteins allow study of the flexibility of Modeling of the Tm1631-DNA complex binding sites at a detailed level. From an MD trajectory, a We prepared the Tm1631-DNA model using a structural sequence of snapshots or frames of a protein at different times can superimposition by ProBiS of crystal structures 1vpq and 2nqj. be produced [38,39]. Similarly as improvements in molecular The model was built with (i) Tm1631 from the crystal structure docking [40], using more protein frames as input to a search 1vpq, and (ii) a DNA fragment, in which one nucleotide lacks a algorithm such as ProBiS, could increase the likelihood of finding a base, from the endonuclease IV structure 2nqj. The putative similar binding site among template protein structures, compared binding site in 1vpq and the known DNA binding site in 2nqj.A to results obtained with only one static protein frame. Accordingly, were superimposed and the DNA was then transposed from we performed a short, 1 ns, MD simulation of the Tm1631 protein 2nqj.A to 1vpq by copying coordinates of the DNA fragment from (1vpq) in water and quenched 30 frames from this MD trajectory 2nqj to the 1vpq crystal structure. at different time intervals: 20 frames were from the first 100 ps at regular intervals of 5 ps, and 10 frames were from 100 to 1000 ps Molecular dynamics simulations at intervals of 100 ps. Details of the MD simulation are provided We performed MD simulation of the Tm1631-DNA model, and below. Each frame was then separately used as input to the ProBiS two control simulations: first of the unbound Tm1631 protein program. The region in each frame designated for comparison was (PDB: 1vpq), and second of the endonuclease IV-DNA complex defined as the amino acids belonging to the predicted binding site, (PDB: 2nqj). In the simulation of endonuclease IV-DNA complex, that is Ser7, Leu43, Glu42, Asn44, Lys72, Gln114, Glu143, three Zn2+ ions were retained in the binding site since they are Phe144, Arg145, Leu176, Arg191, Trp199, Glu205, Arg207, and known to bind to DNA [24]. The control simulations were done Asn239 (Figure 2, left). The selected binding sites in all frames for comparison with our model and to determine the flexible were then compared individually with the entire non-redundant regions of the proteins and the DNA. To remove atomic clashes PDB (nr-PDB) of some 31,000 protein structures using the and to optimize the atomic coordinates of the complexes, the LOCAL and MOTIF options of the ProBiS program, which steepest descent and adopted basis Newton-Raphson energy restrict the search to only the predicted structurally conserved minimizations were used. The HBUILD tool in CHARMM was binding site in Tm1631. The nr-PDB is the default database of used to add missing hydrogens prior to the minimization. In each proteins used by ProBiS; its generation is described elsewhere [11]. case, the DNA ligand was held fixed and the protein was allowed The similar substructures that were found in the nr-PDB proteins, to move freely during the minimization process. The models were were ranked using the Z-Scores assigned by ProBiS, and only then embedded in a cube of water, which was modelled explicitly those with Z-Score.0.5 were considered further. If different by a rigid TIP3P model; KCl was added to neutralize the system frames shared more similar substructures with the same nr-PDB (for details see Text S1). A trajectory of Tm1631-DNA model, protein, then the substructure with the highest Z-Score was endonuclease IV-DNA complex, and unbound Tm1631 were retained. This procedure resulted in a set of proteins, identified by generated at 310 K and covered 90 ns, 60 ns, and 15 ns, their PDB IDs and Chain IDs, each having a substructure that was respectively, of MD at constant pressure and temperature similar to the predicted binding site in Tm1631. employing periodic boundary conditions. In each simulation the first 3 ns of the MD was used for heating (100 ps) and Filtering of similar binding sites equilibration (2,9 ns); the analysis was performed using the final A similarity between the predicted binding site and a known 20 ns of each simulation, except in the unbound Tm1631 case, similar binding site in a different protein is a link that allows where the first 1 ns of simulation was used for binding site determination of the function of the predicted binding site, an comparison. Hydrogen bonds were calculated using the HBOND uncharacterized region in Tm1631. However, the similar tool in CHARMM, and only those with occupancy .0.5 were

PLOS Computational Biology | www.ploscompbiol.org 7 November 2013 | Volume 9 | Issue 11 | e1003341 Function Prediction of Uncharacterized Protein considered. Restraints were used two times during the Tm1631- whereas the non-polar energy was estimated by solvent accessible DNA model simulation to correct the base-pairing in the DNA surface area (SASA) calculation implemented within the GB (Text S1); no restraints were used during last 47 ns to allow the module (Text S1). We assumed that the entropy changes upon DNA and the Tm1631 protein to position themselves freely binding are similar in both complexes, since in both the DNA is responding to physical forces between them. bound in a very similar conformation. Accordingly, to calculate the relative binding free energies, we neglected the entropy term Energetics analysis (2T?DS). With the exception of the entropy terms, all the energy To compare the relative binding affinities of the Tm1631-DNA terms were calculated for 20,000 snapshots sampled at intervals of and endonuclease IV-DNA complexes, we calculated the relative 1 ps along the last 20 ns of each complex’s MD trajectory (Figure binding free energies for these complexes using the Molecular S1 in Text S1). We chose the last 20 ns for energy calculation, Mechanical/Generalized Born Surface Area (MM/GBSA) ap- since in this time interval no new hydrogen bonds formed between proach [41,42]. In this approach, the binding free energy (DGbind), the Tm1631 and the DNA. is calculated as the sum of the changes of the gas phase molecular mechanics energy, DEMM, the solvation free energy, DGsol, and Supporting Information the conformational entropy of the system upon binding, 2T?DS: Text S1 Supporting information containing Figure S1–S8, Table S1, further details of MD simulations, electrostatic potential DG ~DH{T:DS&DE zDG {T:DS ð1Þ bind MM sol of Tm1631, similar evolutionary pattern in Tm1631 and endonuclease IV, alternative Tm1631-DNA model, validation of Tm1631-DNA model, proposed active site in Tm1631, binding ~ z z DEMM DEinternal DEelectrostatic DEVdw ð2Þ site comparison results, Table S2 and S3, validation of binding site comparison as function prediction approach. (DOC) DG ~DG zDG 3 sol GB SA ð Þ Video S1 A movie illustrating the prediction of Tm1631 protein function. In equation 2, DE is the sum of DE (bond, angle, and MM internal (MP4) dihedral energy), DEelectrostatic (electrostatic energy), and DEVdw (Van der Waals energy); in equation 3, DGsol is the sum of Author Contributions electrostatic solvation energy, DGGB (polar contribution) and non- electrostatic solvation component, DGSA (non-polar contribution). Conceived and designed the experiments: JK MH MO JTK DJ. The polar contribution to the desolvation free energy was Performed the experiments: JK MH MO JTK DJ. Analyzed the data: calculated using the analytical Generalized Born using Molecular JK MH MO JTK DJ. Wrote the paper: JK MH MO JTK DJ. Designed Volume (GBMV) model implemented in CHARMM [43,44], the software used in analysis: JK MH MO JTK DJ.

References 1. Hermann JC, Marti-Arbona R, Fedorov AA, Fedorov E, Almo SC, et al. (2007) 15. Murzin AG, Brenner SE, Hubbard T, Chothia C (1995) SCOP: a structural Structure-based activity prediction for an enzyme of unknown function. Nature classification of proteins database for the investigation of sequences and 448: 775–779. structures. J Mol Biol 247: 536–540. 2. Wilkins AD, Bachman BJ, Erdin S, Lichtarge O (2012) The use of evolutionary 16. Setlow RB, Carrier WL (1966) Pyrimidine dimers in ultraviolet-irradiated patterns in protein annotation. Curr Opin Struc Biol 22: 316–325. DNA’s. J Mol Biol 17: 237–254. 3. Ellrott K, Zmasek CM, Weekes D, Sri Krishna S, Bakolitsa C, et al. (2011) 17. Konc J, Janezic D (2012) ProBiS-2012: web server and web services for detection TOPSAN: a dynamic web database for structural genomics. Nucleic Acids Res of structurally similar binding sites in proteins. Nucleic Acids Res 40: W214– 39 (suppl 1): D494–D496. W221. 4. Stehr H, Duarte JM, Lappe M, Bhak J, Bolser DM (2010) PDBWiki: added 18. Farber GK, Petsko GA (1990) The evolution of alpha/beta barrel enzymes. value through community annotation of the Protein Data Bank. Database 2010: Trends Biochem Sci 15: 228–234. baq009. 19. Reardon D, Farber GK (1995) The structure and evolution of alpha/beta barrel 5. Jaroszewski L, Li Z, Krishna SS, Bakolitsa C, Wooley J, et al. (2009) Exploration proteins. FASEB J 9: 497–503. of uncharted regions of the protein universe. PLoS biology 7: e1000205. 20. Krissinel E, Henrick K (2007) Inference of macromolecular assemblies from 6. Nadzirin N, Firdaus-Raih M (2012) Proteins of Unknown Function in the crystalline state. J Mol Biol 372: 774–797. Protein Data Bank (PDB): An Inventory of True Uncharacterized Proteins and 21. Cotton FA, Hazen EE, Jr., Legg MJ (1979) Staphylococcal nuclease: proposed Computational Tools for Their Analysis. Int J Mol Sci 13: 12761–12772. mechanism of action based on structure of enzyme-thymidine 39,59-bispho- 7. Punta M, Coggill PC, Eberhardt RY, Mistry J, Tate J, et al. (2012) The Pfam sphate-calcium ion complex at 1.5-A resolution. Proc Natl Acad Sci USA 76: protein families database. Nucleic Acids Res 40: D290–D301. 2551–2555. 8. Rosamond J, Allsop A (2000) Harnessing the power of the genome in the search 22. Ye Y, Godzik A (2003) Flexible structure alignment by chaining aligned for new antibiotics. Science 287: 1973–1976. fragment pairs allowing twists. Bioinformatics 19 (suppl 2): ii246–ii255. 9. Schmitt S, Kuhn D, Klebe G (2002) A new method to detect related function 23. Hosfield DJ, Guan Y, Haas BJ, Cunningham RP, Tainer JA (1999) Structure of among proteins independent of sequence and fold homology. J Mol Biol 323: the DNA repair enzyme endonuclease IV and its DNA complex: double- 387–406. nucleotide flipping at abasic sites and three-metal-ion catalysis. Cell 98: 397– 10. Stark A, Russell RB (2003) Annotation in three dimensions. PINTS: Patterns in 408. Non-homologous Tertiary Structures. Nucleic Acids Res 31: 3341–3344. 24. Garcin ED, Hosfield DJ, Desai SA, Haas BJ, Bjoras M, et al. (2008) DNA 11. Konc J, Janezic D (2010) ProBiS algorithm for detection of structurally similar apurinic-apyrimidinic site binding and excision by endonuclease IV. Nat Struct protein binding sites by local structural alignment. Bioinformatics 26: 1160– Mol Biol 15: 515–522. 1168. 25. Bren U, Martinek V, Florian J (2006) Free energy simulations of uncatalyzed 12. Xie L, Bourne PE (2008) Detecting evolutionary relationships across existing fold DNA replication fidelity: structure and stability of T.G and dTTP.G terminal space, using sequence order-independent profile-profile alignments. Proc Natl DNA mismatches flanked by a single dangling nucleotide. J Phys Chem B 110: Acad Sci U S A 105: 5441–5446. 10557–10566. 13. Konc J, Janezic D (2007) An improved branch and bound algorithm for the 26. Bren U, Lah J, Bren M, Martinek V, Florian J (2010) DNA duplex stability: the maximum clique problem. MATCH Commun Math Comput Chem 58: 569– role of preorganized electrostatics. J Phys Chem B 114: 2876–2885. 590. 27. Dessailly BH, Lensink MF, Orengo CA, Wodak SJ (2008) LigASite–a database 14. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, et al. (2000) The of biologically relevant binding sites in proteins with known apo-structures. Protein Data Bank. Nucleic Acids Res 28: 235–242. Nucleic Acids Res 36: D667–D673.

PLOS Computational Biology | www.ploscompbiol.org 8 November 2013 | Volume 9 | Issue 11 | e1003341 Function Prediction of Uncharacterized Protein

28. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local 37. Baker NA, Sept D, Joseph S, Holst MJ, McCammon JA (2001) Electrostatics of alignment search tool. J Mol Biol 215: 403–410. nanosystems: application to microtubules and the ribosome. Proc Natl Acad Sci 29. Tjong H, Zhou HX (2007) DISPLAR: an accurate method for predicting DNA- USA 98: 10037–10041. binding sites on protein surfaces. Nucleic Acids Res 35: 1465–1477. 38. Janezic D, Venable RM, Brooks BR (1995) Harmonic analysis of large 30. Chen YC, Wright JD, Lim C (2012) DR_bind: a web server for predicting DNA- systems. III. Comparison with molecular dynamics. J Comput Chem 16: 1554– binding residues from the protein structure based on electrostatics, evolution and 1566. geometry. Nucleic Acids Res 40: W249–W256. 39. Brooks BR, Janezic D, Karplus M (1995) Harmonic analysis of large systems. I. 31. Wijffels G, Dalrymple B, Kongsuwan K, Dixon NE (2005) Conservation of Methodology. J Comput Chem 16: 1522–1542. eubacterial replicases. IUBMB Life 57: 413–419. 40. Kua J, Zhang Y, McCammon JA (2002) Studying enzyme binding specificity in 32. Meulenbroek EM, Peron Cane C, Jala I, Iwai S, Moolenaar GF, et al. (2013) acetylcholinesterase using a combined molecular dynamics and multiple docking UV damage endonuclease employs a novel dual-dinucleotide flipping mecha- approach. J Am Chem Soc 124: 8260–8267. nism to recognize different DNA lesions. Nucleic Acids Res 41: 1363–1371. 41. Srinivasan J, Cheatham TE, Cieplak P, Kollman PA, Case DA (1998) 33. Konc J, Depolli M, Trobec R, Rozman K, Janezic D (2012) Parallel-ProBiS: fast Continuum solvent studies of the stability of DNA, RNA, and phosphorami- parallel algorithm for local structural comparison of protein structures and date-DNA helices. J Am Chem Soc 120: 9401–9409. binding sites. J Comput Chem 33: 2199–2203. 42. Kollman PA, Massova I, Reyes C, Kuhn B, Huo S, et al. (2000) Calculating 34. Borstnik U, Hodoscek M, Janezic D (2004) Improving the performance of structures and free energies of complex molecules: combining molecular molecular dynamics simulations on parallel clusters. J Chem Inf Comput Sci 44: mechanics and continuum models. Acc Chem Res 33: 889–897. 359–364. 43. Lee MS, Salsbury Jr FR, Brooks Iii CL (2002) Novel generalized Born methods. 35. Brooks BR, Brooks CL, . (2009) CHARMM: the biomolecular simulation J Chem Phys 116: 10606–10614. program. J Comput Chem 30: 1545–1614. 44. Lee MS, Feig M, Salsbury FR, Jr., Brooks CL, (2003) New analytic 36. Miller BT, Singh RP, Klauda JB, Hodoscek M, Brooks BR, et al. (2008) approximation to the standard molecular volume definition and its CHARMMing: a new, flexible web portal for CHARMM. J Chem Inf Model application to generalized Born calculations. J Comput Chem 24: 1348– 48: 1920–1929. 1356.

PLOS Computational Biology | www.ploscompbiol.org 9 November 2013 | Volume 9 | Issue 11 | e1003341

Available online at www.sciencedirect.com

ScienceDirect

Binding site comparison for function prediction and

pharmaceutical discovery

1,3 2

Janez Konc and Dusˇ anka Janezˇicˇ

While structural genomics resulted in thousands of new protein can share same folds, as seen, for example, in proteins of

crystal structures, we still do not know the functions of most of TIM barrel fold; consequently, for many of the proteins

these proteins. One reason for this shortcoming is their unique with similar folds, their functions cannot be assigned by

sequences or folds, which leaves them assigned as proteins of comparing their folds alone. Proteins of uncharacterized

‘unknown function’. Recent advances in and applications of function thus form a relatively large part of the PDB, and

cutting edge binding site comparison algorithms for binding one of the main challenges remains their functional

site detection and function prediction have begun to shed light annotation.

on this problem. Here, we review these algorithms and their use

in function prediction and pharmaceutical discovery. Finding Experimental determination of protein function is the

common binding sites in weakly related proteins may lead to most reliable way to characterize proteins of unknown

the discovery of new protein functions and to novel ways of function, but it is difficult to prioritize experiments

drug discovery. among the many possible functions a protein could per-

form. To guide experimentalists, a number of computer

Addresses

1 approaches were developed for prediction of protein

National Institute of Chemistry, Ljubljana, Slovenia

2

University of Primorska, Faculty of Mathematics, Natural Sciences and function [5–8]. Web portals were created that allow

Information Technologies, Koper, Slovenia sharing information about protein structures [2]. In spite

3

Fulbright Scholar at the National Institutes of Health, Bethesda, USA on

of these efforts, the gap between proteins with exper-

leave from the National Institute of Chemistry, Ljubljana, Slovenia.

imentally determined function and those with unknown

Corresponding author: Janezˇicˇ , Dusˇ anka ([email protected], function is growing.

[email protected])

Here, we review recent approaches to binding site com-

parison that were developed during the last two years,

Current Opinion in Structural Biology 2014, 25:34–39

and their use in protein function prediction and pharma-

This review comes from a themed issue on Theory and Simulation ceutical discovery (Figure 1 and Table 1). Since binding

Edited by Rommie E Amaro and Manju Bansal sites are usually more evolutionary conserved than the

rest of the protein structure, binding site comparison

enables finding functional relationships between evolu-

tionarily distant proteins. Tools for binding site compari-

0959-440X/$ – see front matter, # 2013 Elsevier Ltd. All rights son range from those representing binding sites as

reserved. fingerprints, to approaches that use geometric indexing,

http://dx.doi.org/10.1016/j.sbi.2013.11.012 to graph theoretical approaches, where proteins are

represented as vertices and edges. These tools open a

whole new range of possibilities in drug discovery, and

Introduction may circumvent some of the current limitations of virtual

screening.

For more than ten years, structural genomics efforts have

determined thousands of new protein structures that were

deposited in the Protein Data Bank (PDB) [1]. It is Recent advances in binding site comparison

estimated that over 90% of these deposited structures Binding site comparison approaches were first developed

are not yet described in literature, and still await bio- over a decade ago, and for the first time provided

logical and biochemical characterization [2]. Although researchers tools to explore protein functions across

about two-thirds of the proteins that are categorized different protein folds [9]. Unlike global alignment, bind-

under ‘unknown function’ likely represent very divergent ing site comparison is suited for detection of locally

branches of already known and well-characterized protein conserved residue patterns, which often appear in binding

families [3], a recent analysis indicates that for about 42% sites and are involved in protein function. During the past

of these protein structures, their functions cannot be two years, several fast algorithms appeared allowing com-

determined by standard sequence or structural similarity parison of a selected binding site or entire protein surface

approaches [4]. Despite the enormous increase in the structure against a database of tens of thousands of

number and the diversity of new proteins being uncov- protein structures in minutes. Binding site comparison

ered, the protein fold space is gradually becoming satu- approaches are increasingly moving on-line as most of the

rated. However, proteins having very different sequences newly developed algorithms are offered as web servers.

Current Opinion in Structural Biology 2014, 25:34–39 www.sciencedirect.com

Binding site prediction algorithms Konc and Janezˇicˇ 35

Figure 1

magnitude faster comparisons compared to previous

approaches [9] that used the maximal clique algorithm

[12]. Parallel maximum clique algorithms that exploit



multiple computer cores [13 ], and maximum clique

Binding site algorithm specialized to protein alignment graphs [14]

may provide even further speedup to the existing binding

similarity site comparison approaches.

Various reduced representations of protein surfaces were

Protein A Protein B

developed [15,16], but it seems that the representation

that codes the physicochemical properties of surface

functional groups as hydrogen-bond donor, acceptor,

aliphatic, aromatic or hydrophobic, provides a reasonable

trade-off between the speed of calculation and the

accuracy of similarity detection [9,17]. A combination

of local structural similarity and general structural sim-

ilarity to predict interface residues was developed as a

Aligned binding sites logical extension of either of the two approaches [15,18–

20], and shows an improved performance compared to

global structural similarity approaches alone. With the

rapid increase in the number of different binding site

comparison techniques, it has become important to prop-

erly validate the new methods. Databases, such as LigA-

Function prediction Site [21], that contain known protein–ligand interactions

are especially valuable for this purpose.

and

As algorithms became faster, it has become possible to

Drug repositioning submit entire protein structures, enabling queries without

the previous knowledge of binding site location. This

Ligand homology modeling

allows the prediction of functions of completely unchar-

Conserved targets

acterized proteins, in which binding sites are not yet

Bioisosteres known, and is often achieved by partitioning the proteins

compared into smaller patches of surface amino acid

Improved drug specificity

residues. In ProBiS [17], for example, the user can choose

Current Opinion in Structural Biology to compare entire protein surface or a selected surface

patch only, against a database of more than 35,000 non-

Flowchart of binding site comparison and its use in function prediction redundant protein structures. When choosing the entire

and pharmaceutical discovery. protein surface comparison, the algorithm, without con-

straints to known binding sites, finds surface patches on

the query protein that are most similar to patches found in

These approaches have really come into their own by database proteins.

enabling function prediction of completely uncharacter-

ized proteins previously not thought possible. To provide instant access to pre-calculated binding site

similarities, the data are being deposited into on-line

The speed of the algorithms has increased from either databases. The SCOWLP database contains binding

parallelization or methodological improvement, enabling regions obtained by comparison within protein family

extremely fast database searches. Efficient use of multi- and also predicted binding regions, which were inferred

core computers through parallelization allows signifi- from structurally related proteins across all existing folds

cantly faster comparison of a protein query against the [22]. Another relational database, PoSSuM, includes all

PDB and reduces the calculation time for scanning the the discovered pairs of similar binding sites with anno-

entire PDB from hours to minutes. An example of such tations of various types, and enables exploration of

parallelized algorithm, Parallel-ProBiS [10], for detecting similar binding sites among structures with similar or

similar binding sites on clusters of computers scales different global folds [23]. The ProBiS-Database pro-

almost ideally with the number of computing cores up vides access to local structural similarities of non-redun-

to about 64 computing cores. In approaches that represent dant (>95% sequence identity) PDB protein structures,

proteins as protein graphs, introduction of maximum without reference to known binding sites [24]. These

clique algorithms [11] resulted in about an order of databases are useful for predicting the binding sites and

www.sciencedirect.com Current Opinion in Structural Biology 2014, 25:34–39

36 Theory and Simulation

Table 1

Methods for function prediction and pharmaceutical discovery based on local structural similarity.

Method URL Description

CatSId http://catsid.llnl.gov Search for matches between catalytic sites and proteins

COFACTOR http://zhanglab.ccmb.med.umich.edu/COFACTOR Combines global and local features for function annotation

DoGSiteScorer http://dogsite.zbh.uni-hamburg.de Pocket detection and analysis tool for protein druggability assessment

LHM

FINDSITE http://cssb.biology.gatech.edu/findsitelhm Homology modeling approach to flexible ligand docking

GIRAF http://pdbj.org/giraf/source/distr.tar.gz Fast search and flexible alignment of ligand binding interfaces

IMAAAGINE http://mfrlab.org/grafss/imaaagine Hypothetical 3D side chain patterns

KRIPO N/A Bioisostere replacement

Nucleos http://nucleos.bio.uniroma2.it/nucleos Predicts nucleotide-binding sites

Phosfinder http://phosfinder.bio.uniroma2.it Predicts phosphate binding sites

PocketFEATURE N/A Seeks similar microenvironments within two binding sites and allows

for flexibility

PoSSuM http://possum.cbrc.jp/PoSSuM Database of similar protein–ligand binding and putative pockets

ProBiS http://probis.cmm.ki.si Detects structurally similar binding sites without reference to known

binding sites

SA-Mot http://sa-mot.mti.univ-paris-diderot.fr Extracts and locates structural motifs from protein loops

SCOWLP http://www.scowlp.org Individual and comparative analysis of protein interfaces at atomic

level

SiteBinder http://ncbr.muni.cz/SiteBinder Superimposes multiple small protein fragments

SubCav N/A Fragment-based drug design

TrixP N/A Compares protein binding sites and predicts function

potential ligands, such as small compounds, proteins, geometry and detailed chemical interactions within

nucleic acid, and saccharides, for unbound protein struc- putative binding pockets, but may not recognize distant

tures and for characterizing protein structures with similarities where protein dynamics may have modified

unknown functions. the binding site structures. A solution to this problem has

been to partition binding site surface into smaller patches.

Methods specialized for prediction of binding site for a For example, the TrixP algorithm accounts for the inherent

particular type of ligand and methods for comparison of flexibility of proteins employing a partial shape matching

only a particular structural element of a protein structure routine [32], and the GIRAF combines the refined align-

have emerged. Phosfinder enables the identification of ments based on different local coordinate frames [33]. The

phosphate binding sites in protein structures. It uses a PocketFEATURE algorithm seeks similar micro-environ-

structural comparison algorithm to scan a query structure ments within two binding sites, and assesses overall

against a set of known three-dimensional phosphate binding site similarity by the presence of multiple shared

binding motifs [25]. Similarly, Nucleos identifies nucleo- micro-environments [34], which allows alignment of struc-

tide binding sites in protein structures by comparing the tures involving domain movements.

structure of a query protein against a set of known

template three-dimensional binding sites representing Since many protein structures are still unknown, low-

nucleotide modules, namely the nucleobase, carbo- resolution protein structural models are commonly used

hydrate, and phosphate [26]. The SA-Mot web server as input to function prediction. Starting from a structural

searches for recurrent and conserved structural motifs in model, given by either experimental determination or

loops, which enables easy identification of loop regions computational modeling, COFACTOR identifies tem-

that are important for the protein folding and function plate proteins of similar folds and functional sites by

[27]. Furthermore, IMAAAGINE web server allows threading the target structure through three representa-

querying the PDB for hypothetical three-dimensional tive template libraries that have known protein–ligand

side chain patterns that are not limited to known patterns binding interactions, Enzyme Commission number or



from existing protein structures [28]. Analogous to Gene Ontology terms [35 ]. In the 9th Critical Assess-

multiple sequence alignment, the SiteBinder allows com- ment of Techniques for Protein Structure Prediction

parison and superimposition of multiple protein structural (CASP9) competition, the method achieved significantly

motifs [29], and the CatSId web server matches the three- better binding-site predictions than all other participating

dimensional residue arrangements in a query protein to a methods [18].

library of manually annotated catalytic residues [30,31].

Use in function prediction

Flexibility of binding sites presents a challenge to bind- A challenge in structural genomics is prediction of the

ing site comparison since it introduces uncertainty in the function of uncharacterized proteins. When proteins

binding site geometry. Most current methods focus on the cannot be related to other proteins of known activity,

Current Opinion in Structural Biology 2014, 25:34–39 www.sciencedirect.com

Binding site prediction algorithms Konc and Janezˇicˇ 37

identification of function based on sequence or structural cells. A local structural superimposition across all sub-

homology is impossible and in such cases, structurally types and strains of hemagglutinin available in the PDB

conserved binding sites may be assessed in connection revealed a new conserved region on hemagglutinin, a

with the protein’s function. Although studies where potential conserved target for influenza drug and vaccine

binding site comparison is used to guide experimental development [45].

function determination are rare, there have been success-

ful cases [36,37]. For example, the shrimp alkaline phos- Binding site comparison can also be used to discover

phatase, a protein of unknown function, was predicted to novel bioisosteres, that is, structurally variable molecules

act as a protease, using binding site comparison, and the or molecular fragments that form comparable interactions

predicted function was subsequently confirmed exper- with proteins. Although two binding sites are dissimilar

imentally [38]. Another study predicted interactions in overall, they may still bind the same fragments if they

human ubiquitination pathway on a proteome scale based share suitable subpockets. Detected shared subpockets

on the similarity of interface structural motifs [39]. In a can thus be used in fragment-based drug design to

recent study of unknown proteins, we predicted that a suggest new fragments or to replace existing fragments

protein from Thermotoga maritima, the Tm1631 protein within an already known compound. The KRIPO data-

of unknown activity, has a DNA binding, endonuclease base quantifies the similarities of binding site subpockets

function. We compared the protein to a database of non- across protein families based on pharmacophore finger-

redundant protein structures using ProBiS, and identified prints, and enables finding molecular fragments, that is,



its putative binding site, which was found to be similar to bioisosteres, that bind to similar subpockets [46 ]. Sim-

endonuclease IV binding sites. On the basis of the super- ilarly, SubCav, a method based on pharmacophoric

imposition of Tm1631 and endonuclease IV binding sites, fingerprints combined with a subpocket alignment algor-

we constructed a model of the protein with the DNA, and ithm, allows the similarity in searching and alignment of

subsequently verified its plausibility using molecular subpockets from a PDB-wide database against a user-



dynamics simulation [40 ]. defined query [47].

Use in pharmaceutical discovery Drug repositioning applies established drugs to new dis-

Ligand virtual screening is a widely used tool to assist in ease indications, and a prerequisite for drug repositioning

pharmaceutical discovery. In practice, virtual screening is a drug’s ability to bind to several targets. A recent study

approaches have many limitations, and the development finds a correlation between promiscuity and structural

of new methodologies is required. Remotely related similarity as well as binding site similarity of protein

proteins often share a common binding site occupied targets [48]. The study suggests that global structural

by chemically similar ligands, and such ligands may and binding site similarity better explains the observed

contain conserved anchor functional groups as well as a drug promiscuity in the PDB than physicochemical drug

variable region that accounts for their binding specificity. properties, such as hydrophobicity, molecular weight, or

LHM

Exploiting these insights, FINDSITE employs ligand flexibility. Practical examples of drug repositioning

structural information extracted from weakly related are protein kinase inhibitors that have been repositioned

proteins to perform rapid ligand docking by homology for inhibition of bacterial enzyme D-alanine–D-alanine

modeling. Used in a search for HIV-1 protease inhibitors, ligase, based on the similarities between ATP binding

the method yields significantly improved enrichment sites in protein kinases and in D-alanine–D-alanine ligase

factors compared to standard virtual screening approaches [17,49]. In relation to drug repositioning, off-target effects



[41 ]. can also be predicted using binding site comparison.

Recently, we suggested that debromohymenialdisine,

Predicting druggability and prioritizing certain disease inhibiting the ATP binding site in this human dual

modifying targets for the drug development process are of specificity protein kinase, would probably have undesir-

high relevance in pharmaceutical research. DoGSiteS- able side effects in human patients, due to off-target

corer is an algorithm for pocket and druggability predic- binding to ATP-citrate synthase [17].

tion that considers both global properties of the pocket,

for example, size, shape, and hydrophobicity, as well as High degree of binding site similarity shared among

local similarities shared between pockets. Druggability protein kinases makes designing drug compounds, that

scores are predicted by means of a support vector machine is, kinase inhibitors, with high specificity among the

[42,43]. Druggability of regions on viral proteins, such as kinases difficult. Binding site comparison approaches that

hemagglutinin, strongly depends on their conservative- can find similarities and differences in binding sites

ness across viral strains. Designed antibodies that target within a family help understanding functional differences

conserved regions have a greater chance to avoid virus to that can be exploited in drug design for the development of



escape incapacitation by mutations [44]. ProBiS was used highly specific kinase inhibitors [34,50 ]. For example,

to identify conserved binding sites on hemagglutinin, a using binding site comparison, the cross-reactivities of

protein responsible for binding the influenza virus to the anticancer drugs Tarceva and Gleevec have been

www.sciencedirect.com Current Opinion in Structural Biology 2014, 25:34–39

38 Theory and Simulation

12. Bron C, Kerbosch J: Algorithm 457: finding all cliques of an

rationalized [51]. Another study, involving structural align-

undirected graph. Commun ACM 1973, 16:575-577.

ment of 13 active-conformation kinases, found the pre-

13. Depolli M, Konc J, Rozman K, Trobec R, Janezˇicˇ D: Exact parallel

sence of six water molecules that occur in conserved

 maximum clique algorithm for general and protein graphs.

locations across this group of kinase, and these water J Chem Inf Model 2013, 53:2217-2228.

An exact parallel maximum clique algorithm is developed and shown to

molecules confer much of the stability to their local

provide an order of magnitude faster execution on protein product graphs.

environment and to a key catalytic residue [52].

14. Malod-Dognin N, Andonov R, Yanev N: Maximum cliques in

protein structure comparison. In Experimental algorithms. Edited

Conclusions by Festa P. Springer: Berlin/Heidelberg; 2010:106-117. (Internet).

The recently developed field of binding site comparison 15. Jordan RA, Yasser E-M, Dobbs D, Honavar V: Predicting protein–

has emerged as a powerful new approach. Unlike sequence protein interface residues using local surface structural

similarity. BMC Bioinformatics 2012, 13:41.

or structure alignment, binding site comparison finds

16. Xie L, Bourne PE: A robust and efficient algorithm for the shape

functional relationships among distantly related protein

description of protein structures and its application in

structures and can predict functions of unique proteins. It predicting ligand binding sites. BMC Bioinformatics 2007, 8:S9.

is used to annotate unknown proteins in the PDB and

17. Konc J, Janezic D: ProBiS-2012: web server and web services

may play a major role in pharmaceutical discovery. for detection of structurally similar binding sites in proteins.

Nucleic Acids Res 2012, 40:W214-W221.

Acknowledgements 18. Roy A, Zhang Y: Recognizing protein–ligand binding sites by

global structural alignment and local geometry refinement.

The authors acknowledge financial support through Grant P1-0002 of the

Structure 2012, 20:987-997.

Slovenian Research Agency. JK also acknowledges financial support through

Fulbright Grant.

19. Lee HS, Im W: Identification of ligand templates using local

structure alignment for structure-based drug design. J Chem

Inf Model 2012, 52:2784-2795.

References and recommended reading

Papers of particular interest, published within the period of review,

20. Lee HS, Im W: Ligand binding site detection by local structure

have been highlighted as:

alignment and its performance complementarity. J Chem Inf

Model 2013, 53:2462-2470.

 of special interest

21. Dessailly BH, Lensink MF, Orengo CA, Wodak SJ: LigASite—a

 of outstanding interest

database of biologically relevant binding sites in proteins with

known apo-structures. Nucleic Acids Res 2007, 36:D667-D673.

1. Rose PW, Beran B, Bi C, Bluhm WF, Dimitropoulos D,

22. Teyra J, Samsonov S, Schreiber S, Pisabarro MT: SCOWLP

Goodsell DS, Prlic A, Quesada M, Quinn GB, Westbrook JD et al.:

update: 3D classification of protein–protein, –peptide, –

The RCSB Protein Data Bank: redesigned web site and web

saccharide and –nucleic acid interactions, and structure-

services. Nucleic Acids Res 2010, 39:D392-D401.

based binding inferences across folds. BMC Bioinformatics

2. Ellrott K, Zmasek CM, Weekes D, Sri Krishna S, Bakolitsa C, 2011, 12:398.

Godzik A, Wooley J: TOPSAN: a dynamic web database for

23. Ito J-I, Tabei Y, Shimizu K, Tsuda K, Tomii K: PoSSuM: a database

structural genomics. Nucleic Acids Res 2010, 39:D494-D496.

of similar protein–ligand binding and putative pockets. Nucleic

3. Jaroszewski L, Li Z, Krishna SS, Bakolitsa C, Wooley J, Acids Res 2011, 40:D541-D548.

Deacon AM, Wilson IA, Godzik A: Exploration of uncharted ˇ

24. Konc J, Cesnik T, Konc JT, Penca M, Janezˇicˇ D: ProBiS—

regions of the protein universe. PLoS Biol 2009, 7:e1000205.

database: precalculated binding site similarities and local

4. Nadzirin N, Firdaus-Raih M: Proteins of unknown function in the pairwise alignments of PDB structures. J Chem Inf Model 2012,

Protein Data Bank (PDB): an inventory of true uncharacterized 52:604-612.

proteins and computational tools for their analysis. Int J Mol

25. Parca L, Mangone I, Gherardini PF, Ausiello G, Helmer-Citterich M:

Sci 2012, 13:12761-12772.

Phosfinder: a web server for the identification of phosphate-

5. Wilkins AD, Bachman BJ, Erdin S, Lichtarge O: The use of binding sites on protein structures. Nucleic Acids Res 2011,

evolutionary patterns in protein annotation. Curr Opin Struct 39:W278-W282.

Biol 2012, 22:316-325.

26. Parca L, Ferre F, Ausiello G, Helmer-Citterich M: Nucleos: a web

6. Skolnick J, Zhou H, Gao M: Are predicted protein structures of server for the identification of nucleotide-binding sites in

any value for binding site prediction and virtual ligand protein structures. Nucleic Acids Res 2013, 41:W281-W285.

screening? Curr Opin Struct Biol 2013, 23:191-197.

27. Regad L, Saladin A, Maupetit J, Geneix C, Camproux A-C: SA-Mot:

7. Chen YC, Wright JD, Lim C: DR_bind: a web server for a web server for the identification of motifs of interest extracted

predicting DNA-binding residues from the protein structure from protein loops. Nucleic Acids Res 2011, 39:W203-W209.

based on electrostatics, evolution and geometry. Nucleic Acids

28. Nadzirin N, Willett P, Artymiuk PJ, Firdaus-Raih M: IMAAAGINE: a

Res 2012, 40:W249-W256.

webserver for searching hypothetical 3D amino acid side

8. Cukuroglu E, Gursoy A, Keskin O: HotRegion: a database of chain arrangements in the Protein Data Bank. Nucleic Acids

predicted hot spot clusters. Nucleic Acids Res 2011, Res 2013, 41:W432-W440.

40:D829-D833.

29. Sehnal D, Varˇekova´ RS, Huber HJ, Geidl S, Ionescu C-M,

9. Schmitt S, Kuhn D, Klebe G: A new method to detect related Wimmerova´ M, Kocˇ a J: SiteBinder: an improved approach for

function among proteins independent of sequence and fold comparing multiple protein structural motifs. J Chem Inf Model

homology. J Mol Biol 2002, 323:387-406. 2012, 52:343-359.

10. Konc J, Depolli M, Trobec R, Rozman K, Janezˇicˇ D: Parallel- 30. Kirshner DA, Nilmeier JP, Lightstone FC: Catalytic site

ProBiS: fast parallel algorithm for local structural comparison identification—a web server to identify catalytic site

of protein structures and binding sites. J Comput Chem 2012, structural matches throughout PDB. Nucleic Acids Res 2013,

33:2199-2203. 41:W256-W265.

11. Konc J, Janezic D: An improved branch and bound algorithm 31. Nilmeier JP, Kirshner DA, Wong SE, Lightstone FC: Rapid

for the maximum clique problem. MATCH Commun Math catalytic template searching as an enzyme function prediction

Comput Chem 2007, 58:569-590. procedure. PLoS ONE 2013, 8:e62535.

Current Opinion in Structural Biology 2014, 25:34–39 www.sciencedirect.com

Binding site prediction algorithms Konc and Janezˇicˇ 39

32. Von Behren MM, Volkamer A, Henzler AM, Schomburg KT, 42. Volkamer A, Kuhn D, Grombacher T, Rippmann F, Rarey M:

Urbaczek S, Rarey M: Fast protein binding site comparison via Combining global and local measures for structure-based

an index-based screening technology. J Chem Inf Model 2013, druggability predictions. J Chem Inf Model 2012, 52:360-372.

53:411-422.

43. Volkamer A, Kuhn D, Rippmann F, Rarey M: Predicting enzymatic

33. Kinjo AR, Nakamura H: GIRAF: a method for fast search and function from global binding site descriptors. Proteins 2013,

flexible alignment of ligand binding interfaces in proteins at 81:479-489.

atomic resolution. Biophysics 2012, 8:79-94.

44. Fleishman SJ, Whitehead TA, Ekiert DC, Dreyfus C, Corn JE,

34. Liu T, Altman RB: Using multiple microenvironments to find

Strauch E-M, Wilson IA, Baker D: Computational design of

similar ligand-binding sites: application to kinase inhibitor

proteins targeting the conserved stem region of influenza

binding. PLoS Comput Biol 2011, 7:e1002326.

hemagglutinin. Science 2011, 332:816-821.

35. Roy A, Yang J, Zhang Y: COFACTOR: an accurate comparative

45. Yusuf M, Konc J, Sy Bing C, Trykowska Konc J, Ahmad

 algorithm for structure-based protein function annotation.

Khairudin NB, Janezic D, Wahab HA: Structurally conserved

Nucleic Acids Res 2012, 40:W471-W477.

binding sites of hemagglutinin as targets for influenza drug

A web server for automated structure-based protein function annotation

and vaccine development. J Chem Inf Model 2013,

is developed, and was ranked as the best method for protein–ligand 53:2423-2436.

binding site predictions in CASP9 experiment.

46. Wood DJ, de Vlieg J, Wagener M, Ritschel T: Pharmacophore

36. Real-Guerra R, Staniscuaski F, Zambelli B, Musiani F, Ciurli S,

 fingerprint-based approach to binding site subpocket

Carlini CR: Biochemical and structural studies on native and

similarity and its application to bioisostere replacement.

recombinant Glycine max UreG: a detailed characterization of

J Chem Inf Model 2012, 52:2031-2043.

a plant urease accessory protein. Plant Mol Biol 2012,

A method for mining crystal structure databases for bioisosteres is 78:461-475.

presented.

37. Wong MT, Choi SB, Kuan CS, Chua SL, Chang CH, Normi YM, See

47. Kalliokoski T, Olsson TSG, Vulpetti A: Subpocket analysis

Too WC, Wahab HA, Few LL: Structural modeling and

method for fragment-based drug discovery. J Chem Inf Model

biochemical characterization of recombinant KPN_02809, a

2013, 53:131-141.

zinc-dependent metalloprotease from Klebsiella pneumoniae

MGH 78578. Int J Mol Sci 2012, 13:901-917.

48. Haupt VJ, Daminelli S, Schroeder M: Drug promiscuity in PDB:

protein binding site similarity is key. PLoS ONE 2013,

38. Chakraborty S, Minda R, Salaye L, Bhattacharjee SK, Rao BJ:

8:e65894.

Active site detection by spatial conformity and electrostatic

analysis—unravelling a proteolytic function in shrimp alkaline

49. Triola G, Wetzel S, Ellinger B, Koch MA, Hu¨ bel K, Rauh D,

phosphatase. PLoS ONE 2011, 6:e28470.

Waldmann H: ATP competitive inhibitors of D-alanine–D-alanine

39. Kar G, Keskin O, Nussinov R, Gursoy A: Human proteome-scale ligase based on protein kinase inhibitor scaffolds. Bioorg Med

structural modeling of E2–E3 interactions exploiting interface Chem 2009, 17:1079-1087.

motifs. J Proteome Res 2012, 11:1196-1207.

50. Bryant DH, Moll M, Finn PW, Kavraki LE: Combinatorial

40. Konc J, Hodoscek M, Ogrizek M, Trykowska Konc J, Janezic D:  clustering of residue position subsets predicts inhibitor

 Structure-based function prediction of uncharacterized affinity across the human kinome. PLoS Comput Biol 2013,

protein using binding sites comparison. PLoS Comput Biol 9:e1003087.

2013, 9:e1003341. A method that identifies structural features of the kinase binding site that

One of the ‘unknown function’ proteins was shown, using a novel strategy are correlated with an inhibitor binding and uses these features to predict

of binding site comparison by ProBiS, to have a DNA-repair function if this inhibitor will be capable of binding to uncharacterized kinases.

similar to known endonuclease IV enzymes. The predicted function

awaits experimental confirmation. 51. Kinnings SL, Jackson RM: Binding site similarity analysis for the

functional classification of the protein kinase family. J Chem

41. Brylinski M, Skolnick J: FINDSITELHM: a threading-based Inf Model 2009, 49:318-329.

 approach to ligand homology modeling. PLoS Comput Biol

2009, 5:e1000405. 52. Knight JDR, Hamelberg D, McCammon JA, Kothary R: The role of

A threading-based approach that employs structural information conserved water molecules in the catalytic domain of protein

extracted from weakly related proteins to perform ligand docking, similar kinases. Proteins 2009, 76:527-535.

to homology modeling of protein structures, is developed.

www.sciencedirect.com Current Opinion in Structural Biology 2014, 25:34–39