arXiv:1610.01052v1 [cs.CV] 4 Oct 2016 iepeito 3,[] rgsreig[]adpoenbas protein and protein [5] screening novel drug of [4], [3], prediction [2], prediction proteins site of function application classification and major include clustering the topological conserve regard of more Some this [1]. is in i sequence structure protein This the a fields. than other because structural and significant modern design especially drug of discovery, applications drug terti many biology, Protein in e protein. have significance search a mous neighbor of structural and neighbors comparison structural structure find to step ucin tutrlsimilarity structural function, protei two between method. score our measurement a on on based similarity available structures tested the fam- is statistic get is service of other can web method terms and A in our structures http://research.buet.ac.bd:8080/Comograd/score.html structures. methods of protein benchmark art effectiveness of standard the pairs The bett of of measurements. significantly recog state is match on pattern method current ily present distanc dependent from Our the inspired we carbon vision. is than is computer paper, alpha that and scheme protein this scheme nition scoring from In Our scoring extracted matrices. issue. effective features research and in novel crucial characterist novel accurately similar a a the similarity is to character structural time lead structural Determining often Similar thereof. proteins sites. two binding between and sification hs ehd ifri lgmn ue,cntansadth and score. the constraints calculating rules, rota for alignment and used some in rules functions with differ of or methods set constraints) These some distance alignin with (e.g., or on comparison constraints based similarity t under are of methods structures the these Most two to in comparison. used regards under functions define structures scoring with the methods function to of these dissimilarity between is scoring words, methods score) other useful proposed In numeric a these structures. (i.e., all protein measurement two of decades similarity goal few a past main find the The during [8]. comparison [7], structure protein on more is comparison structural ever. fo automated modern than demand for of the methods struc- advancement NMR), novel better the (X-ray, which with technologies at discovered crystallographic pace being the are and tures structures protein known silico oe n fetv crn ceefrstructure for scheme scoring effective and novel A Keywords Abstract ariepoensrcuecmaio saprerequisite a is comparison structure protein Pairwise o frsac ok nteltrtr aefocused have literature the in works research of lot A lsicto n ariesmlrt measurement similarity pairwise and classification rgdsg 6.Det h rsn oueo available of volume present the to Due [6]. design drug ∗ Poentrir tutr ensisfntos clas- functions, its defines structure tertiary —Protein eateto optrSineadEgneig Banglades Engineering, and Science Computer of Department — ariepoensrcuecmaio,scoring comparison, structure protein Pairwise ealKarim Rezaul .I I. ‡ eateto optrSineadEgneig ntdInt United Engineering, and Science Computer of Department NTRODUCTION † eateto optrSine nvriyo aioa Ca Manitoba, of University Science, Computer of Department ∗ d oi lAziz Al Momin Md. , hr you where binding ed istics tion. nor- real ary † ics he er in Saka Shatabda ,Swakkhar al n g d e s e s r - t , prahs h crn ouei vial sawbservice web scoring a alignment as available art the http://research.buet.ac.bd:8080/Comograd/score.ht is the with at module of score scoring state our The and of approaches. statistical methods performance lear machine popular the of and most compare number statistics also in a We used ing. widely to significance are respect statistical that the measures with analyze score depend We not size. CoMOGPhog does approach protein time the based computation on the alignment believe here We the above, high mentioned [18]. unlike its is in because, score work CoMOGPhog scalability, of prior advantage our potential major in the presented features the ial ebifl rwcnlso nScinV. Section III. in IV Section conclusion Section in draw in given briefly are given we findings are the Finally materials on discussion and the and method Results comprehend our . to II of Section works in details comparison related structure the protein in of development review extensive an r essniiet oa tutr ismlrt n oft known well and the dissimilarity are of category Some structure this in efficient. methods local more to computationally sensitive are tha score less ap a These (dis)similarity. compute are structural to sets of representative feature is sco smart do alignment some the use methods find they to these com- rather beforehand of patte structures from two Most align vision, learning. ideas not computer machine applying and theory, literature recognition graph the geometry, in putational presented been crn ucin n ehd.I hsppr eitouea introduce we paper, this analyze In methods. o are and providin significance functions reports statistical proteins scoring few the a two of only verification find (dis)similar experimental we literature structurally the in how However, answer that ieaueadtems oal nsaogteeare these the among in ones methods notable limitati such [11], inherent most of [10], these the number of and a spite exist literature In there siz [9]. drawbacks, the large and if is expensive protein computationally the is before t structures remarkab between alignment no alignment protein proper are optimal finding there Also often an possible. that alignment find despite score to the computing need Furth approaches dissimilarity. structure these that local is to sensitive comparison much similarity are structure global the to gards h eto h ae sognzda olw.W present We follows. as organized is paper the of rest The oeitrsigadapaigapoce aerecently have approaches appealing and interesting Some l hs tepsgv it ovrossoigmethods scoring various to birth gave attempts these All oee,amjrdabc fteeapoce ihre- with approaches these of drawback major a However, nvriyo niern n Technology and Engineering of University h oOPo score CoMOGPhog CE [12], ‡ n .ShlRahman Sohel M. and rainlUniversity ernational MAlign TM nada e crn ucinbsdon based function scoring new a , MatAlign 1] 1]and [14] [13], 1]and [16] ∗ PAlign SP MASASW ml proaches ermore, these f [15]. DALI these [17]. of e The ons wo re; nd en of n- rn es le g t . . II. RELATED WORKS D. FATCAT There exist a number of attempts in the literature to FATCAT [20] or Flexible structure AlignmenT by Chaining compare protein structures as three dimensional objects su- Aligned fragment pairs with Twists is a unique approach in a perimposing on one another. These methods opened up a path sense that it took the structural rearrangements that the proteins to address this problem by proposing a score function based go through into account. This is different as in most methods on different distance metrics as a measure of the similarity the structure is treated like a fixed body. In comparison with or dissimilarity. Global distance score (GDT) is one of the other concurrent methods like FlexProt [21] and other rigid notable classical methods to compare protein structures using body approaches it produced good results in most cases. superimposing and to calculate a distance score of the align- ment. Some of the major contributions on this field are briefly E. Geometric hashing method reviewed below. Another direction in this field was taken by Shatsky et al. in [22] where they used the geometric hashing method. They A. DALI took 3 atoms or 3 α-carbon atom triplets (both can be done) n DALI [10], or distance alignment matrix method finds an from the protein chains. From n atoms, among C3 triplets, optimal alignment between two structures and then calculates they take triplets with some restrictions. At this point they have an alignment score. It breaks the input structure into hexapep- a set of triplets. Then they prepare a hash table with sides of tide fragments and calculates a distance matrix by evaluating the triangle as keys to the hash table. Then they selectively the contact patterns between successive fragments. [11]. align atoms of the two chains using the hash table. Advanced methods of similar approach finds an optimal alignment between two structure and then calculates some F. ProteinDBS alignment score. A common and popular structural alignment ProteinDBS [23] is another method that uses some common method is the DALI [10], or distance alignment matrix method, features of CBIR (Content Based Image Retrieval) to compare which breaks the input structures into hexapeptide fragments α-carbon distance matrix images. This method is much faster and calculates a distance matrix by evaluating the contact than the previous ones as it compares only some specific image patterns between successive fragments. [11]. features. Notably however, the feature preprocessing done by ProteinDBS is computationally expensive. B. CE TM-align and SP-align CE or Combinatorial Extension method [12] is similar to DALI in that it too breaks each structure in the query set into TM-align [14], [24] is one of the most well known protein a series of fragments that it then attempts to reassemble into structure alignment methods. It first finds an optimal alignment a complete alignment. A series of pairwise combinations of and an alignment matrix and then computes the TM-Score as fragments called aligned fragment pairs, or AFPs, are used to a measure of structural similarity. Another notable method is define a similarity matrix through which an optimal path is SP-Align [15] which employs a similar approach but differs generated to identify the final alignment. Only AFPs that meet in the alignment algorithm and alignment score. given criteria for local similarity are included in the matrix as a means of reducing the necessary search space and thereby G. Other pattern recognition based approaches increasing efficiency [12]. However, in spite of having good accuracy it is impossible to implement these two methods as There exist some other approaches that are based on pat- a real time web service due to their huge computational cost. tern recognition techniques. These approaches usually involve feature extraction, feature translation and some distance score C. Sequential Structure Alignment Program measures. Perhaps, the most successful feature to this end is the α-carbon distance matrix used by Marsolo et al. [25]. The SSAP (Sequential Structure Alignment Program) Marsolo et al. introduced a wavelet based approach that method[19] uses double dynamic programming to produce a resized the α-carbon distance matrices of the protein structures structural alignment based on atom-to-atom vectors in structure before the actual comparison. In this method the α-carbon space. Instead of the alpha carbons typically used in structural matrix is considered as a gray scale image and 2D wavelet alignment, SSAP constructs its vectors from the beta carbons decomposition is applied to resize the images to make the for all residues except glycine, a method which thus takes feature scale invariant. This method reportedly outperformed into account the rotameric state of each residue as well as its existing approaches at that time like DALI and CE, in terms location along the backbone. SSAP works by first constructing of retrieval accuracy, memory utilization and execution time. a series of inter-residue distance vectors between each residue This work is one of the prime motivations behind our work. and its nearest non-contiguous neighbors on each protein. A In the field of pattern recognition, the task of comparing series of matrices are then constructed containing the vector two entity is aided by feature extraction, feature translation differences between neighbors for each pair of residues for and some distance score measure with some distance metric. which vectors were constructed. Dynamic programming ap- Among the most successful features, alpha carbon distance plied to each resulting matrix determines a series of optimal matrix is notable. The alpha carbon distance matrix maps local alignments which are then summed into a ‘summary’ three dimensional structure into two dimensional matrix. It matrix to which dynamic programming is applied again to is also rotation and translation invariant. There are some determine the overall structural alignment. methods used alpha carbon distance matrix as feature and their (a) Domain d1n4ja (b) Extracted feature (c) Domain d2efva1 (d) Extracted feature Fig. 1: Original and gray scale images of α-carbon distance matrix of 2 proteins d1n4ja and d2efva1. Representation of β-sheets are shown in (a), (b) and (c), (d) show the α-helices

own customized distance metric and distance score calculation α-carbon distance matrix. This distance matrix is converted to algorithm. And there are some methods which extracted some a grayscale image. The motivation behind this strategy is the salient features from alpha carbon distance matrix and worked observation that the α helix and anti-parallel beta sheets appear on them. as dark lines parallel to the diagonal dark line and parallel beta sheets appear as dark lines normal to the diagonal dark line H. MatAlign in that grayscale image. Furthermore, β-sheets of two strips MatAlign [16] is similar to the method of Marsolo et appear as one dark line normal to the diagonal; β sheets of al. [25]. It uses alignment of α-carbon distance matrix and three strips appear as two dark lines normal to the diagonal a score based on the alignment by dynamic programming. and one dark line parallel to the diagonal and so on. In general, MatAlign differs from the structure alignment approaches as for a standard β-sheet, the number of points of co-occurrence it does not align 3D structures directly; rather it aligns two of parallel and anti-parallel diagonal lines depends on the dimensional alpha carbon distance matrix images. Mirceva et number of strips in the β-sheets. But as different structures al. introduced MASASW [17] (Matrix Alignment by Sequence have different quantity of α-carbons, the matrix dimension Alignment within Sliding Window) that used Daubechies2 would be different. So we scale the distance matrices to the wavelet transform instead of Haar wavelet transform used by same dimension. With Bi-cubic Interpolation we resize each Marsolo et al. As the name indicates, MASASW used sliding image to the nearest dimension that is a power of 2. Then we window to reduce computation. Mirceva et al. have reported apply Wavelet Transform to resize all the images to dimension that their method outperforms DALI, CE, MatAlign and some 128 × 128. We then take gradient image of the resized images other well known methods and also have shown that using and extract CoMOGrad and PHOG features from the gradient Daubechies2 wavelet gives better accuracy than Haar wavelets. images as described in the following section. Fig. 1 shows the Very recently, in our previous work [18] we have presented tertiary structure of a protein and its gray scale equivalent. a super fast and accurate method to compare protein tertiary structure using an approach based on pattern recognition and computer vision.

In this paper, we attempt to use the same features for B. CoMOGrad and PHOG structural similarity and compare the performance of our method and statistical significance of distance score of our feature against the most widely used structure alignment based In our previous work [18], we have shown how to extract method TM-Align and SP-Align using some well known tools the novel features CoMOGrad and PHOG. For the sake of of statistics and machine learning. The method have been completeness here we briefly review the procedure. The gra- previously compared for accuracy and comparison time for dient angle and magnitude are computed from the gradient retrieval against MASASW in our previous of the images mentioned in the previous section. Because of work [18]. the continuous angle gradient values, we quantize those in 16 bins for CoMOGrad with bin size 22.5 degrees. These 16 III. MATERIALS AND METHODS bins provide us with the CoMOGrad feature vector having 256 features. To extract the Pyramid histogram of oriented Our method to protein structure similarity relies upon gradient (PHOG) [26], we create a quad tree with the original extraction of features from the three dimensional tertiary struc- image taken as the root. Each quad tree node has four children ture. We have recently introduced two novel features named with each child being one fourth of the image corresponding Co-occurrence Matrix of the Oriented Gradient of Distance to the parent. We take quad tree up to level 3 which gives Matrices (CoMOGrad) and Pyramid Histogram of Oriented us 1+4+4*4+4*4*4=85 nodes. With 9 bins (of 40 degrees Gradient (PHOG) in [18]. These features are extracted from each) from the gradient orientation histogram, we get a total of the grayscale images of α-carbon distance matrix. 85*9=768 features of PHOG. So these 768 PHOG features are added with the 256 features of CoMOGrad giving us the new A. Representation of Structures novel feature vector of length 1021. The Euclidean distance From a 3D protein structure we filter only the α-carbons. of two features is used as the measure of dissimilarity of Then we compute their distances from each other and get the structures being compared. C. Measurement of CoMOGPhog Score

Suppose, fq and fi denote the feature vectors of the query protein q and a protein i in the database and N is the length of the features. Then the distance score diq of protein q and i would be calculated according to Equationn 1.

N v 2 diq = u (|fq[j] − fi[j]|) (1) uX tj=1

Clearly, the above distance measure can be calculated in O(N) time where N = 1021, the size of our feature Fig. 2: Combined statistics of our method CoMOGPhog vector. As the feature length is independent of the number of score for SuperFamily classification for binary classifier alpha carbons, comparison time doesn’t vary with the protein discriminant threshold at score=0.011 length. So, when the features are preprocessed and stored in a database, this method guaranties high scalability. To search nearest structures from a given database, our method needs to compute diq for each protein i in the database and then sort method with TM-Score and SP-Score, our results for Super- the results to rank them. Family is shown in Figure 2. In this experiment, for binary classifier discriminant threshold at score=0.011, MCC value is D. Experimental Dataset nearly 0.9, P-value is near 0.84, Sensitivity is near 0.86 and Specificity is near 1 for CoMOGPhog Score. We further use We experimented on 9389 SCOP [27], [28] domains (pro- these measurement techniques to compare our method with tein structure) and compared our method with SP-Score and TM-Score and SP-Score on Family classification. TM-Score. In this dataset, there are respectively 624, 1138, 2368 distinct folds, super-families and families available. The 2) Family Match Posterior Probability or P-Value: Pos- list of these domains are listed in the supplementary files. terior probability or P-Value for a family is defined as the probability of being in the same family for a specific score (CoMOGPhog Score, SP-Score or TM-Score). Using Bayes IV. RESULTS AND DISCUSSION Theorem, we calculate P-Value using Equation 2. To experimentally verify the statistical significance and performance of structure comparison using the CoMOGPhog P (FamilyMatch, score = d) score, we have examined a series of well known and widely P (F amilyMatch|score = d)= (2) used statistical measures. These are Posterior probability or P (score = d) P-Value, Mathews Correlation Coefficient (MCC), Receiver Operating Characteristics (ROC), Sensitivity and Specificity. We have plotted line graph for family match posterior prob- Subsequently, we have compared our results with those of ability for CoMOGPhog Score, SP-Score and TM-Score for all TM-Score and SP-Score. We have taken 9389 protein chains 9389 9389 the ( C2) pairs in Fig. 3. From the posterior probability or which gives C2 pairs. We have computed pairwise score P-Value, the goodness of a distance or score measure can be using CoMOGPhog Score, TM-Score and SP-Score. For the justified. For a binary classifier discriminant function, in the latter two, we have used the code provided by the authors in ideal case, the discriminant function will divide the region into their websites. Implementation of our method is available at two parts with a vertical line at a specific value of discriminant http://github.com/rezaulnkarim/CoMOGPhogExtractor. function, say d. Then P-value will be near to 1 below d and We are taking the Structural Classification of Proteins near to 0 above d or vice versa. In other words, ideally, the extended (SCOPe) 2.03 classification [28] for the comparison plot should be like a step function, either step down or step of CoMOGPhog SCore, TM-Score and SP-Score. There are up, depending on the definition of the discriminant function. six classifications (Class, Fold, SuperFamily, Family, Protein For SP-Score and TM-Score, the higher the value, the more is and Species) which are made from experimentally determined the similarity. So, the plot of P-Value is expected to be like a protein structure. We are considering ‘Family’ as its based on step up function. On the contrary, for the CoMOGPhog score, sequence meaning proteins in the same family have similarity the smaller the distance, the more is the similarity in tertiary in their sequence. Though Superfamily (and Fold, Class) structure implying more chance of being in the same family. classifications are more based on the structure of the protein So, the plot of P-Value for CoMOGPhog score is expected to (as well as our feature), we are still considering Family for an be like a step down function. From Figure 3(a) it is clearly unbiased comparison over the current state of the art methods. evident that our score can fix a discriminant distance value d (nearly 0.011) below which P-Value is near to 1 and above which P-Value is near to 0. And the plot is nearly like a step A. Experimental Results down function. From Figure 3(c), it is clear that for TM-Score, such a point can’t precisely be fixed. And in spite of showing 1) Result for SuperFamily Classification: Though we did a step up-like trend, the plot of TM-Score is not really a step not consider SuperFamily classification for comparing our function as it exhibits too much fluctuations. Similar traits are P-value analysis for CoMOGPHOG-Score P-value analysis for SP-Score P-value analysis for TM-Score 1 1 1.4

0.9 0.9 1.2 0.8 0.8

0.7 0.7 1

0.6 0.6 0.8 0.5 0.5 0.6 0.4 0.4

0.3 0.3 0.4 0.2 0.2 Family match posterior pdf Family match posterior pdf Family match posterior pdf 0.2 0.1 0.1

0 0 0 0 0.05 0.1 0.15 0.2 0.25 0 0.5 1 1.5 2 0 0.2 0.4 0.6 0.8 1 CoMOGPHOG-Score values SP-Score values TM-Score values (a) (b) (c) Fig. 3: Plot of posterior probability of family match against (a) CoMOGPhog-score (b) SP-Score and (c) TM-Score.

MCC analysis for CoMOGPHOG-Score MCC analysis for SP-Score MCC analysis for TM-Score 1 1 0.7

0.9 0.9 0.6 0.8

0.8 0.7 0.5

0.6 0.7 0.4 0.5 0.6 0.3 0.4

0.5 0.3 0.2 MCC for family match MCC for family match MCC for family match 0.2 0.4 0.1 0.1

0 0 0 0.005 0.01 0.015 0.02 0.025 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 CoMOGPHOG-Score partition values SP-Score partition values TM-Score partition values (a) (b) (c) Fig. 4: Plot of MCC values against (a) CoMOGPhog-Score (b) SP-Score and (c) TM-Score values.

found in the case of SP-Score (see Figure 3(b)). This clearly CoMOGPhog score and distance 0.68 for SP-Score. indicates the superiority of CoMOGPhog score over SP-Score Also Fig. 4 (c) depicts that MCC plot for TM-Score is convex and TM-Score in terms of family match. and has peak value at nearly 0.65 near at TM-Score value 3) Mathews Correlation Coefficient for Binary Classifica- = 0.78. This observation suggests the superiority of binary tion: The Matthews Correlation Coefficient (MCC) is widely classifier for family with CoMOGPhog score over TM-Score. used in machine learning as a measure of the quality of And it is clear that the MCC plot of CoMOGPhog score is binary (two-class) classifications. It takes true-false positives more convex than that of both SP-Score and TM-Score which and negatives into account. MCC is generally regarded as a balanced measure which can be used even if the classes are of clearly suggests the superiority of the former. different sizes. The MCC is in essence a correlation coefficient between the observed and predicted binary classifications. 4) Receiver operating characteristic: In statistics, a re- It returns a value between -1 and +1. A coefficient of +1 ceiver operating characteristic (ROC) or an ROC curve is represents a perfect prediction, 0 indicates that the prediction a graphical plot that illustrates the performance of a binary is no better than random and -1 indicates total disagreement between prediction and observation. The computation of MCC classifier system as its discrimination threshold is varied. The is done using Equation 3. curve is created by plotting the true positive rate (also known as TP × TN − FP × FN the sensitivity in biomedical informatics and recall in machine MCC = learning) against the false positive rate (specificity, fall-out) (TP + FP ) × (TP + FN) × (TN + FP ) × (TN + FN) at various threshold settings. The ROC curve is thus the p (3) sensitivity as a function of fall-out. The more the curve is in the upper left region and the steeper the curve, the better the score To examine the performance of binary classifiers designed function. We have plotted ROC curve for the binary classifier with SP-Score, TM-Score and CoMOGPhog score as discrim- systems using SP-Score, TM-Score and CoMOGPhog score as inants, we have evaluated MCC values with binary classifier the discrimination threshold is varied. From Fig. 5, it is clearly discriminant/partition at various values of the scores. For a evident that our score is slightly better than both TM-Score and good classifier, the MCC plot will be a convex plot with peak at SP-Score in this regard. the point of best discriminant value. From Fig. 4, it is clear that MCC plot for CoMOGPhog score and SP-Score are convex 5) Discussion: From all the statistical measures considered and both have peak values at nearly 0.94 at distance 0.011 for above, it is evident that CoMOGPhog score performs well ROC analysis for CoMOGPHOG-Score ROC analysis for TM-Score 1 1

0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5

0.4 0.4

0.3 True Positive Rate True Positive0.3 Rate

0.2 0.2

0.1 0.1

0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 False Positive Rate False Positive Rate (a) CoMOGPhog-Score (b) TM-Score

ROC analysis for SP-Score 1

0.99

0.98

0.97

0.96

0.95 True Positive Rate 0.94

0.93

0.92 0 0.2 0.4 0.6 0.8 1 False Positive Rate (c) SP-Score

Fig. 5: Receiver operating characteristic curve for (a) CoMOGPhog-Score, (b) TM-Score and (b) SP-Score

corresponding to TM-Score and SP-Score with regards to V. CONCLUSION computing protein tertiary structure (dis)similarity. And, more In this paper, we have presented CoMOGPhog score, which importantly, both TM-Score and SP-Score needs to compute is a novel and computationally efficient scoring scheme for an alignment matrix first to calculate the alignment score. structural classification and pairwise similarity measurement. The beauty of CoMOGPhog score is that there is no need to CoMOGPhog score proivides more significant score than TM- find an alignment first. More specifically, CoMOGPhog score Score and SP-Score in terms of family match of pairs of can be simply calculated by computing the rmsd of a feature protein structures and other statistical measurements. Our score set without even aligning two structures. This speeds up the is dependent on novel features extracted from protein distance computation dramatically. Also, since CoMOGPhog score is matrices and is inspired from pattern recognition and computer based on a fixed length feature vector, the comparison time vision. The effectiveness of our method is tested on standard doesn’t vary with the protein length. benchmark structures.

REFERENCES However there are some parameters used for feature ex- [1] and Arthur M Lesk. The relation between the divergence traction which can be further experimented. The parameter for of sequence and structure in proteins. The EMBO journal, 5(4):823, CoMOgrad is number of bins for gradient angle orientation 1986. quantization bin which is used 16 empirically. The parameters [2] Lesley H Greene, Tony E Lewis, Sarah Addou, Alison Cuff, Tim for PHOG are depth of quad tree which is used 3 and number Dallman, Mark Dibley, Oliver Redfern, Frances Pearl, Rekha Nam- of bins for gradient angle orientation quantization bin which is budiry, Adam Reid, et al. The cath domain structure database: new used 9. Both of these values are defined empirically. We expect protocols and classification levels give a more comprehensive resource for exploring evolution. Nucleic acids research, 35(suppl 1):D291– by fine tuning these parameters with extensive experiments, D297, 2007. both the qualitative and quantitative statistical significance and [3] Srayanta Mukherjee and Yang Zhang. Mm-align: a quick algorithm precision of our method can be further improved. We expect for aligning multiple-chain protein complex structures using iterative to include those fine tuning in our future works. dynamic programming. Nucleic acids research, 37(11):e83–e83, 2009. [4] Mu Gao and Jeffrey Skolnick. Dbd-hunter: a knowledge-based method the 15th ACM international conference on Information and knowledge for the prediction of dna–protein interactions. Nucleic acids research, management, pages 24–33. ACM, 2006. 36(12):3978–3992, 2008. [26] A Bosch and A Zisserman. Pyramid histogram of oriented gradients [5] Brian K Shoichet and Brian K Kobilka. Structure-based drug screening (phog). Univ. Oxford Visual Geometry Group, 2013. for g-protein-coupled receptors. Trends in pharmacological sciences, [27] Alexey G Murzin, Steven E Brenner, , and Cyrus Chothia. 33(5):268–272, 2012. Scop: a structural classification of proteins database for the investigation [6] Douglas B Kitchen, H´el`ene Decornez, John R Furr, and J¨urgen Bajorath. of sequences and structures. Journal of molecular biology, 247(4):536– Docking and scoring in virtual screening for drug discovery: methods 540, 1995. and applications. reviews Drug discovery, 3(11):935–949, 2004. [28] Naomi K Fox, Steven E Brenner, and John-Marc Chandonia. Scope: [7] Hitomi Hasegawa and Liisa Holm. Advances and pitfalls of protein Structural classification of proteins?extended, integrating scop and astral structural alignment. Current opinion in structural biology, 19(3):341– data and classification of new structures. Nucleic acids research, 348, 2009. 42(D1):D304–D309, 2014. [8] Marco Biasini, Stefan Bienert, Andrew Waterhouse, Konstantin Arnold, Gabriel Studer, Tobias Schmidt, Florian Kiefer, Tiziano Gallo Cas- sarino, Martino Bertoni, Lorenza Bordoli, et al. Swiss-model: modelling protein tertiary and quaternary structure using evolutionary information. Nucleic acids research, page gku340, 2014. [9] Adam Godzik. The structural alignment between two proteins: Is there a unique answer? Protein science, 5(7):1325–1338, 1996. [10] Liisa Holm and . Dali/FSSP classification of three- dimensional protein folds. Nucleic acids research, 25(1):231–234, 1997. [11] Liisa Holm and Chris Sander. Protein structure comparison by align- ment of distance matrices. Journal of molecular biology, 233(1):123– 138, 1993. [12] Ilya N Shindyalov and Philip E Bourne. Protein structure alignment by incremental combinatorial extension (ce) of the optimal path. Protein engineering, 11(9):739–747, 1998. [13] Jinrui Xu and Yang Zhang. How significant is a protein structure similarity with tm-score= 0.5? , 26(7):889–895, 2010. [14] Yang Zhang and Jeffrey Skolnick. Tm-align: a protein structure alignment algorithm based on the tm-score. Nucleic acids research, 33(7):2302–2309, 2005. [15] Yuedong Yang, Jian Zhan, Huiying Zhao, and Yaoqi Zhou. A new size-independent score for pairwise protein structure alignment and its application to structure classification and nucleic-acid binding predic- tion. Proteins: Structure, Function, and Bioinformatics, 80(8):2080– 2088, 2012. [16] Zeyar Aung and Kian-Lee Tan. MatAlign: precise protein structure comparison by matrix alignment. Journal of bioinformatics and , 4(06):1197–1216, 2006. [17] G. Mirceva, I. Cingovska, Z. Dimov, and D. Davcev. Efficient approaches for retrieving protein tertiary structures. Computational Biology and Bioinformatics, IEEE/ACM Transactions on, 9(4):1166– 1179, July 2012. [18] Rezaul Karim, Mohd Momin Al Aziz, Swakkhar Shatabda, M. Sohel Rahman, Md Abul Kashem Mia, Farhana Zaman, and Salman Rakin. Comograd and phog: From computer vision to fast and accurate protein tertiary structure retrieval. Scientific Reports, 5:13275 EP –, Aug 2015. Article. [19] Christine A Orengo and William R Taylor. Ssap: sequential structure alignment program for protein structure comparison. Computer methods for macromolecular sequence analysis, 1996. [20] Yuzhen Ye and Adam Godzik. Flexible structure alignment by chain- ing aligned fragment pairs allowing twists. Bioinformatics, 19(suppl 2):ii246–ii255, 2003. [21] Maxim Shatsky, , and Haim J Wolfson. Flexible protein alignment and hinge detection. Proteins: Structure, Function, and Bioinformatics, 48(2):242–256, 2002. [22] Maxim Shatsky, Ruth Nussinov, and Haim J Wolfson. A method for simultaneous alignment of multiple protein structures. Proteins: Structure, Function, and Bioinformatics, 56(1):143–156, 2004. [23] Chi-Ren Shyu, Pin-Hao Chi, Grant Scott, and Dong Xu. Proteindbs: a real-time retrieval system for protein structure comparison. Nucleic Acids Research, 32(suppl 2):W572–W575, 2004. [24] Lei Zhang, James Bailey, Arun S Konagurthu, and Kotagiri Ramamo- hanarao. A fast indexing approach for protein structure comparison. BMC bioinformatics, 11(1):1, 2010. [25] Keith Marsolo, Srinivasan Parthasarathy, and Kotagiri Ramamohanarao. Structure-based querying of proteins using wavelets. In Proceedings of