G C A T T A C G G C A T genes

Article Selecting Near-Native Protein Structures from Predicted Decoy Sets Using Ordered Graphlet Degree Similarity

Xu Han, Li Li and Yonggang Lu *

School of Information Science and Engineering, Lanzhou University, Lanzhou 730000, China; [email protected] (X.H.); [email protected] (L.L.) * Correspondence: [email protected]; Tel.: +86-0931-8912778

 Received: 28 November 2018; Accepted: 4 February 2019; Published: 11 February 2019 

Abstract: Effective prediction of protein tertiary structure from sequence is an important and challenging problem in computational structural biology. Ab initio prediction is based on sequence alone, thus, it has a wide application area. With the ab initio method, a large number of candidate protein structures called decoy set can be predicted, however, it is a difficult problem to select a good near-native structure from the predicted decoy set. In this work we propose a new method for selecting the near-native structure from the decoy set based on both contact map overlap (CMO) and graphlets. By generalizing graphlets to ordered graphs, and using a dynamic programming to select the optimal alignment with an introduced gap penalty, a GR_score is defined for calculating the similarity between the three-dimensional (3D) decoy structures. The proposed method was applied to all 54 single-domain targets in CASP11 and all 43 targets in CASP10, and ensemble clustering was used to cluster the protein decoy structures based on the computed CR_scores. The most popular centroid structure was selected as the near-native structure. The experiments showed that compared to the SPICKER method, which is used in I-TASSER, the proposed method can usually select better near-native structures in terms of the similarity between the selected structure and the true native structure.

Keywords: GR_score; dynamic programming; gap penalty; near-native protein; protein structure prediction

1. Introduction The human genome project was first proposed by American scientists in 1985 and officially launched in 1990 [1]. Its purpose is to determine the nucleotide sequence consisting of three billion base pairs contained in a human chromosome, thereby mapping the human genome and identifying the genes and their sequences to decipher humans. With the completion of the program, the gene sequence can be obtained by measuring the obtained map, and the sequence of the corresponding protein can also be inferred using the genetic central dogma [2]. Since the function of genes can be studied via the study of the corresponding proteins produced through gene expression, the use of bioinformatics to discover the function of a protein product of a gene becomes more and more significant. In fact, determining protein functions from genomic sequences is a central goal of bioinformatics [3]. Since the function of proteins is determined by its tertiary structure, the prediction of tertiary structure based on protein sequences is a very important problem. It is known that the number of known protein structures increases exponentially. By the end of the decade, the PDB [4] database size will be more than 150,000 structures at the current rate. However, the newly published UniProtKB/TrEMBL [5] protein database in Jan, 2019 contains 139,694,261 sequence entries. Hence, only a very small part of them have experimentally solved structures.

Genes 2019, 10, 132; doi:10.3390/genes10020132 www.mdpi.com/journal/genes Genes 2019, 10, 132 2 of 13

Therefore, protein tertiary structure prediction becomes an important and challenging problem in computational structural biology. Although many protein tertiary structure prediction methods have been proposed, there is no consensus on which one is the best [6,7]. There are usually three kind of structure prediction methods: , threading or fold recognition, and ab initio modeling [8]. Both homology modeling and threading require known protein structures as templates, thus, they are difficult to be successfully applied in the absence of template structures. In contrast, ab initio modeling does not require a known structure: it directly predicts its spatial structure from the protein sequence. Different from these methods, which directly predict the tertiary structures, there are also methods to predict contact maps of the proteins from sequence information [9,10]. Contact maps can be predicted by finding correlated pairs of amino acids in multiple sequence alignments, or using neural network approaches. The predicted contact maps can then be used to help the tertiary structure prediction of the proteins. To help the development of high-quality protein tertiary structure prediction methods, a worldwide experiment called Critical Assessment of Protein Structure Prediction (CASP) has been held every two years since 1994 [11]. The goal of the CASP is to evaluate existing protein structure prediction methods or detect their flaws. CASP provides research groups with an opportunity to objectively test their structure prediction methods and delivers an independent assessment of the state of the art in protein structure modeling to the research community and software users. The decoy sets, generated by I-TASSER, of single-domain targets in the CASP11 [12] and CASP10 [13] were used in our experiments. These decoy sets can be downloaded from the Zhang Lab website [14]. One of the challenges in designing the ab initio structure prediction method is to select the best near-native model from a large number of predicted decoy structures. Using clustering methods based on structure similarity score have been shown to be superior to using energy function in selecting the near-native structures [15]. To use the clustering methods, a key problem is the computation of the protein structure similarity. Many tools for comparing protein structures and computing structure similarity have been developed. One type of the comparison methods is based on the model superposition, which can be further divided into two categories: the rigid-body approaches and flexible alignment approaches. The rigid-body approaches consider the proteins as rigid objects and aim to find alignments that have the maximum number of mapped residues and the minimum deviations between the mapped structures. The rigid-body approaches mainly differ in how they combine these two objectives [16]. The final score is often expressed in terms of root mean square deviation (RMSD). Combinatorial extension (CE) [17] is a typical example of rigid structure comparison method. It aligns protein structures by chaining the consecutive aligned fragment pairs (AFPs) without twists. These AFPs are combined to evaluate the protein similarity. Global distance test (GDT) [18], also written as GDT-TS (GDT total score), is one of the scores developed to overcome shortcomings of RMSD. The GDT-TS measures the structure similarity by quantifying the number of corresponding atoms in the model that can be superposed within a set of predefined tolerance thresholds to the reference structure. Unlike RMSD, GDT-TS is more robust against small fragments movements, benefited from using several superposition thresholds. The GDT-TS is now a major assessment criterion in CASP. The (TM-score) [19] is a variation of the Levitt-Gerstein (LG) score to assess the quality of protein structure templates and predicted full-length models. All the residues of the modeled proteins are evaluated by a protein size dependent scale, rather than using a specific distance cutoff and focusing only on the fractions of structures as in the GDT-TS. TM-score is more sensitive to the correctness of global topology than the local structural errors, while the RMSD measure is sensitive to local small disorientations which may result in a big overall RMSD change even though the core region of the model may be correct. Because proteins are flexible molecules and can undergo large conformational changes that are not captured by the rigid-body approaches, flexible alignment methods have also been developed. Flexible alignment methods overcome the limitations of the rigid body approaches by either allowing twists between rigidly aligned fragments or by only maximizing local similarities Genes 2019, 10, 132 3 of 13

(in terms of Euclidean distance) [20]. One of the typical flexible alignment methods is FATCAT (flexible structure alignment by chaining aligned fragment pairs with twists) [20]. FATCAT is an improvement based on CE. It first identifies the local AFPs and then produces an optimal combination of these AFPs using dynamic programming, where twists and gap penalty are used to allow flexible alignments. Another type of the protein structure comparison methods is not based on the model superposition. One of the methods is Contact Area Difference (CAD) [21], which evaluates the structure similarities based on contacts. It computes the structure similarity by measuring the differences between the physical contacts of a model and a reference structure, without supposition of the two models. The local Distance Difference Test (lDDT) [22] is another superposition free score that evaluates local distance differences of all atoms in a model, including validation of stereochemical plausibility. The reference can be a single structure, or an ensemble of equivalent structures. It is computed over all pairs of atoms in the reference structure at a distance closer than a predefined threshold, and not belonging to the same residue. There are also methods developed specially for evaluating predicted decoys using both energy functions and the structure information. The random forest-based model quality assessment (RFMQA) [23] predicts a relative score of a decoy model by using its secondary structure, solvent accessibility and knowledge-based potential energy terms. The support-vector-machine-based single-model quality assessment (SVMQA) [24] is trained to predict TM-score and GDT_TS score based on both statistical potential energy terms and structure consistency features. In this article, a new protein structure similarity score, called the GR_score, was defined based on maximum Contact Map Overlap (CMO) [25] which is a superposition free protein structure alignment method defined by Godzik and Skolnick, and the ordered graphlet degree [26] which is a new systematic measure of a network’s local structure similarity. The superposition free structure alignment methods based on contact maps may capture both the local structure similarities from contact maps and the global structure similarities using dynamic programming. Using the ordered graphlet degree can further improve the measuring of the local structure similarities by comparing the local topology structures. Thus, the proposed GR_score can help in measuring the decoy structure similarities, and in selecting the near-native models from a large number of predicted decoy models in ab initio structure prediction.

2. Materials and Methods

2.1. Maximum Contact Map Overlap (CMO) A contact map is an ordered graph, CM = (V, E), where nodes V and edges E are defined as follows. Each node in V represents an amino acid of a protein. It leads to a strict total ordering of the nodes: for two different nodes u and w, either u < w if u is before w in the protein sequence or u > w otherwise. The two nodes u and w are connected by an edge (u, w) ∈ E, if and only if the Euclidean distance between the Cα atoms of the corresponding amino acids is less than a given threshold e. This is presented in Figure1[27].

2.2. Graphlets and Graphlet Degrees Graphlets are small, connected, non-isomorphic and induced subgraphs of a larger graph G = (V, E) having n ≥ 2 nodes [27]. Some nodes are identical to each other topologically within each graphlet, which is considered to belong to the same automorphism orbit to represent that a graphlet can touch a node in V by different ways topologically. The concepts used to summarize the graphlets i degree are: the graphlet degree of node n, represented by dn, is the number of times a graphlet touches node n at orbit i. In the graph degree distribution protocol, the degree distribution is extended to 73 graph degree distributions by using all 2-5 nodes and their corresponding 73 automorphism orbits (the first of the 73 graph degree distributions is the degree distribution) [28]. The ith ordered graphlet i degree of node u, represented by du, , is the number of times an ordered graphlet touches the node u at Genes 2019, 10, 132 4 of 13

GenesGenes 2018 2018, 9,, x9 ,FOR x FOR PEER PEER REVIEW REVIEW 4 of4 of13 13 orbit i. To reduce the calculation times, the five 2-node and 3-node ordered graphlets have been chosen 1 2 14 145145 Therefore,toTherefore, define 14 a orbits 14-dimensionala 14-dimensional (see Figure 2vector)[ vector27]. (𝑑 Therefore, ,(𝑑𝑑,𝑑,…,𝑑,…,𝑑 a 14-dimensional) could) could describe describe vector each each node (d nodeu, d u𝑢, ...𝑢 of , ofad ucontacta )contact could map. describe map. For For 146146 aeach givena given node contact contactu of amap contactmap 𝐶𝑀 𝐶𝑀 map. = (𝑉, = For (𝑉,, 𝐸) there a,𝐸) there given would would contact be be a map limitationa limitationCM = of( V ofthe, theE )degree, theredegree of would ofa nodea node be by a by limitation the the number number of 147147 ofthe ofresidues degreeresidues ofthat that a nodecan can fit by fit in the ina spherea number sphere with ofwith residuesradius radius ɛ. that ɛIn. In fact, can fact, a fit lineara in linear a sphere worst worst withtime time complexity radius complexitye. In could fact, could a be linear be led led 148148 byworstby using using time a distancea complexity distance threshold threshold could beɛ ofɛ led of7.5 by7.5 Å. usingÅ. a distance threshold e of 7.5 Å.

7 7

6 6 8 8

5 5 9 9

4 4 10 10

3 3 11 11

2 2 12 12 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 1011 1112 1213 13 1 1 13 13

(a)((a)(b) b) 149149 150150 FigureFigure 1. 1.((aa ))( aSchematic Schematic) Schematic diagram diagram diagram of ofa proteina protein backbone. backbone. Amin Amino Amino acid acido acid 2 2 is is2 inis in incontact contact contact with with with 12 12 12and and and 4 is4 is in in 151151 contactcontactcontact with with with 10 10 10(the (the distance distance between between two two nodes nodes is isless less than than ɛe). ɛ( ().b )() bThe The) The corresponding corresponding contact contact contact map map map 152152 graph,graph, where where two two edges edges connect connect node node 2 with 2 with 12 12and and 4 with4 with 10 10 [[27].27 [27].].

2-node2-node graphlet graphlet 3-node 3-node graphlets graphlets

3 3 6 6 9 9 1212

1 1 4 4 7 7 1010 1313

2 2 5 5 8 8 1111 1414

𝑮𝟏𝑮𝟏 𝑮𝟐𝑮𝟐 𝑮𝟑𝑮𝟑 𝑮𝟒𝑮𝟒 𝑮𝟓𝑮𝟓 153153 Figure 2. The five 2-node and 3-node ordered graphlets and the corresponding 14 automorphism orbits. 154154 FigureFigure 2. 2.The The five five 2-node 2-node and and 3-node 3-node ordered ordered graphlets graphlets and and the the corresponding corresponding 14 14automorphism automorphism The ordering of the graphlet nodes in each graphlet Gi, i ∈ {1, . . . , 5 } is represented by their colors: 155155 orbits.orbits. The The ordering ordering of ofthe the grap graphlethlet nodes nodes in ineach each graphlet graphlet 𝐺 ,𝑖𝐺,𝑖{1,{1, … ,5 …} ,5 is} isrepresented represented by by their their white nodes

Genes 2019, 10, 132 5 of 13 approach is applied to identify the cluster with the most neighbors under the cutoff excluding the members of cluster found in the previous iterations. Finally, an averaging model, called final model, is built from all the decoy members of the cluster in the current iteration.

2.5. Ensemble Clustering Using the ensemble clustering method as introduced in [30] can avoid local optimality. The most popular centroid structure identified in the ensemble clustering is selected as the near-native structure in the proposed method. The method includes two steps: constructing a distance matrix for the decoy set using a similarity score, and finding the most possible largest cluster centroid using an ensemble k-medoids. A confidence score as described in [30] is used to select the cluster centroid with the maximum score as the near-native structure.

2.6. GR_score

2.6.1. Ordered Graphlet Degree Similarity.

Only Cα atoms were used in the structure comparison in the proposed method. For two proteins A and B, u and w are the different Cα atoms of the two proteins. Based on graphlet degrees, between two nodes u and w, the order graphlet degree similarity is defined as follows [27]:

 !2 1 14 min di , di + 1 S (u w) = u w , ∑ i i  (1) 14 i=1 max du, dw + 1 the range of the similarity score is from 0 to 1. The two nodes having similar local topologies will have a high similarity score.

2.6.2. Structure Alignment Algorithm.

The alignment between two structures having, respectively, n1 and n2 nodes was computed using the Needleman-Wunsch dynamic programming algorithm [31] as in the original CMO, where the score of mapping two nodes is their ordered graphlet degree similarity defined in (1). It corresponds to the following dynamic programming procedure:

T[u, 0] = 0, T[0, w] = 0,   T[u − 1][w − 1] + S[u, w], (2)   T[u, w] = max T[u − 1][w] − g,  T[u][w − 1] − g where the gap penalty g is defined as follows:

n n ∑ 1 ∑ 2 S(u, w) g = α × u=1 w=1 (3) n1n2 where α is a constant parameter that will be discussed in Section 3.2.

2.6.3. Definition of the GR_score. The dynamic programming algorithm introduced in the above section produces the T[u, w] matrix, where u ∈ [1, n1] and w ∈ [1, n2]. Thus, the final similarity score of the two proteins is defined as follows: T[n , n ] GR_score = 1 2 (4) min(n1, n2) Genes 2018, 9, x FOR PEER REVIEW 6 of 13

Genes 2019, 10, 132 6 of 13 202 The range of the similarity score is also from 0 to 1. The closer the value of the GR_score is to 1, the 203 higher the similarity of the two structures; the closer the value of the GR_score is to 0, the lower the 204 Thestructural range ofsimilarity the similarity of the two score proteins. is also from 0 to 1. The closer the value of the GR_score is to 1, the higher the similarity of the two structures; the closer the value of the GR_score is to 0, the lower 205 the2.7. structural Constructing similarity the Distance of the Matrix two proteins.

206 2.7. ConstructingTo get the thedistance Distance matrix Matrix for the clustering method, a similarity matrix for the decoys needed 207 to be constructed, and then we can get the distance matrix by defining 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 = 1 − 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦. 208 TheTo distance get the matrix distance is a matrix symmetric for the matrix clustering whose method, diagonal a elements similarity are matrix all 0. forThe the element decoys in needed𝑙 row 209 toand be constructed,𝑗 column represents and then wethe candissi getmilarity the distance between matrix two decoys by defining 𝑙 and distance𝑗. = 1 − similarity. The distance matrix is a symmetric matrix whose diagonal elements are all 0. The element in lth row 210 and2.8.j thSelectcolumn the Near-native represents Stru the dissimilaritycture using Ensemble between Clustering two decoys l and j.

211 2.8. SelectK-medoids the Near-native was ran Structure m = 500 using times, Ensemble which Clustering was enough to ensure statistical stability, with 212 random initialization. The times a decoy became the centroid of the largest cluster was counted. It m 213 wasK-medoids found that was a reasonable ran = 500 value times, for whichparameter was enoughk used in to k-medoids ensure statistical was five. stability, Finally, with to randomconsider 214 initialization.both the size The and times the a decoyinternal became similarity the centroid of a cluster of the largestin selecting cluster the was near-native counted. It structure, was found a 215 thatconfidence a reasonable score value as defined for parameter in [30] kwas used used. in k-medoids The centroid was five.with Finally,the maximum to consider confidence both the score size 216 andwithin the internalthe cluster similarity centroids of whose a cluster count in selectingwas more the than near-native 70% of the structure, maximum a count confidence was selected score as as 217 definedthe near-native in [30] was structure, used. The where centroid the withcount the was maximum the times confidence a decoy became score within the centroid the cluster of the centroids largest 218 whosecluster. count was more than 70% of the maximum count was selected as the near-native structure, where the count was the times a decoy became the centroid of the largest cluster. 219 3. Results and Discussion 3. Results and Discussion

220 3.1.3.1. Dataset Dataset 221 UpUp to to 54 54 decoys decoys sets sets (from (from CASP11) CASP11) [ 12[12]] and and 43 43 decoys decoys sets sets (from (from CASP10) CASP10) [ 13[13],], which which are are 222 single-domainsingle-domain targets targets and and have have experimental experimental native native structures, structures, were were downloaded downloaded from from Zhang Zhang Lab Lab 223 websitewebsite [14 [14].]. These These decoy decoy sets sets contain contain structurally structurally non-redundant non-redundant set set of of protein protein structures structures from from the the 224 rawraw decoy decoy sets. sets. The The native native structure, structure, the the generated generated model model by by SPICKER SPICKER used used in in I-TASSER I-TASSER [ 32[32]] sever, sever, 225 andand the the best best TM-score TM-score for for the the targettarget inin thethe decoydecoy setset werewere alsoalso downloaded downloaded from from the the Zhang Zhang Lab Lab 226 websitewebsite [14 [14].] (Supplementary materials).

227 3.2.3.2. Parameter Parameter Selection Selection 228 InIn the the dynamic dynamic programming, programming, to to select select a a good good parameter parameterα𝛼,, four four values values of ofα𝛼,, 0.2,0.2, 0.5, 0.5, 1, 1, and and 2, 2, 229 werewere compared. compared. For For each each decoy decoy set, theset, similarity the similari matrixty matrix was obtained was obtained by using theby proposedusing the GR_scoreproposed 230 inGR_score Section 2.6.3 in Subsection using each 2.6.3α value. using Then,each the𝛼 value. most popularThen, the centroid most popular structure centroid was selected structure as thewas 231 near-nativeselected as structurethe near-native by the structure proposed by method. the propos Theed near-native method. The structures near-native selected structures by the selected proposed by 232 methodthe proposed and the method corresponding and the nativecorresponding structures native were structures compared were using compared the TM-score. using the TM-score. 233 InIn the the experiments, experiments, 54 targets54 targets from CASP11from CASP11 were used. were For used. each For target, each four target, different four TM-scores different 234 wereTM-scores produced were from produced four α values, from and four the 𝛼α valuevalues, that and produced the 𝛼 thevalue highest that TM-score produced was the recorded. highest 235 Finally,TM-score for eachwas recorded.α value, the Finally, number for ofeach the 𝛼 targets value, for the which number the highestof the targets TM-scores for which were the produced highest 236 usingTM-scores the α valuewere wasproduced counted. using The the numbers 𝛼 value of was the counted. targets with The the numb highesters of TM-scores the targets for with four theα 237 valueshighest are TM-scores shown in for Figure four3 . 𝛼 values are shown in Figure 3.

15

10 scores -

5

targets with the 0 highest TM The number of the 0.2 0.5 1 2 238 𝛼 239 FigureFigure 3. 3:Parameter Parameter selection. selection

Genes 2018, 9, x FOR PEER REVIEW 7 of 13

Genes 2019, 10, 132 7 of 13 240 It can be seen from Figure 3 that when 𝛼=0.5, the selected near-native structures were more 241 similar to the corresponding native structure, compared to the other 𝛼 values. Thus, the parameter 242 𝛼 was setIt to can 0.5 be in seenthe proposed from Figure method.3 that when α = 0.5, the selected near-native structures were more similar to the corresponding native structure, compared to the other α values. Thus, the parameter α 243 3.3. Experimentalwas set to 0.5 Results in the proposed method.

244 3.3.1.3.3. The Experimental Experimental Results Results for Datasets from CASP10.

245 3.3.1.For the The proposed Experimental method, Results the forGR_score Datasets was from used CASP10. to calculate the similarity matrix of the 43 246 decoy sets from CASP10. Then, the ensemble clustering was used to select the near native structures 247 for each Fortarget. the The proposed near-native method, structure the GR_score selected was by usedthe proposed to calculate method the similarity and the matrixnear-native of the 43 248 structuredecoy generated sets from by CASP10. the SPICKER Then, the method ensemble used clustering in I-TASSER was sever used towere select compared. the near The native TM-sore structures 249 and forthe each GR_score target. between The near-native the selected structure near-nat selectedive structures by the proposed and the method native andstructure the near-native were 250 computed.structure The generated results are by theshown SPICKER in the methodscatter plots used inin I-TASSERFigure 4, in sever which were each compared. target protein The TM-sore is 251 representedand the GR_scoreas one point. between The x-axis the selected represents near-native the GR_score structures or TM-score and the native produced structure by the were proposed computed. 252 method,The resultsand the are y-axis shown represents in the scatter the scores plots inproduced Figure4, by in whichthe SPICKER each target method protein for is the represented same target. as one 253 The point.blue diagonal The x-axis line represents in Figure the 4 represents GR_score ory= TM-scorex. The same produced score does by thenot proposed necessarily method, mean andthe the 254 samey -axismodel. represents the scores produced by the SPICKER method for the same target. The blue diagonal 255 line in Figure4 represents y=x. The same score does not necessarily mean the same model.

1 1

(a) (b)

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 256 GR_scores of the proposed method TM-scores of the proposed method 257 FigureFigure 4: The 4. plotThe of plot GR_scores of GR_scores and TM-scores and TM-scores produced produced by two by methods two methods for datasets for datasets fromfrom CASP10. CASP10.

258 The Thedetails details of the of comparison the comparison can also can be also found be in found Table in 1, Table in which1, in the which first thecolumn first is column the ID isof the 259 the IDtarget of theprotein, target the protein, second the column second columnand the and third the column third column are the are GR_scores the GR_scores of the of selected the selected 260 near-nativenear-native models models by the by theproposed proposed method method and and the the SPICKER SPICKER method, method, the the fourth fourth column and thethe fifth 261 fifthcolumn column are are the the TM-scores TM-scores of of the the selected selected near-native near-native model model by theby proposedthe proposed method method and the and SPICKER the 262 SPICKERmethod. method. All the All scores the werescores computed were computed between between the selected the near-nativeselected near-native model and model the corresponding and the 263 correspondingnative structure. native structure. 264 To betterTo better understand understand the results, the results, the number the number of the oftargets the targets for which for each which method each method produced produced the 265 betterthe results better was results counted. was counted. The results The are results shown are in shown Figure in 5, Figure where5, the where white the bar white represents bar represents the 266 numberthe numberof decoy of sets decoy for sets which for whichour method our method produces produces better betterresults results than thanSPICKER, SPICKER, the gray the graybar bar 267 representsrepresents the number the number of decoy of decoy sets setsfor which for which our ourmethod method produces produces worse worse results results than than SPICKER, SPICKER, 268 and andthe black the black bar represents bar represents the number the number of the of simi the similarlar results results produced produced by the by two the twomethods. methods. It can It can 269 be seenbe seen from from the left the part left partof Figure of Figure 5 that5 that the theproposed proposed method method selected selected more more near-native near-native structures structures 270 withwith higher higher GR_scores, GR_scores, comp comparedared to to thethe SPICKERSPICKER method. method. However, However, when when measuring measuring the similarity the 271 similarityusing theusing TM-score, the TM-score, the SPICKER the SPICKER method method produced produced more near-native more near-native structures structures with higher with scores, 272 higheras canscores, be seenas can from be theseen right from part the of right Figure part5, althoughof Figure the5, although difference the was difference smaller comparedwas smaller to the 273 comparedGR_score to resultthe GR_score on the left result part ofon Figure the left5. Thispart may of Figure be due 5. to factThis that may the be similarity due to fact measure that usedthe in 274 similaritythe proposed measure method used in is the GR_score, proposed instead method of theis GR_score, TM-score. instead of the TM-score. 275 276

Genes 2019, 10, 132 8 of 13

Table 1. The Comparison of GR_scores and TM-scores for datasets from CASP10. The bold number indicates the highest GR_score or TM-score for each target.

GR_Scores of the GR_Scores of TM-Scores of the TM-Scores of Target ID Proposed Method SPICKER Proposed Method SPICKER T0644 0.764 0.781 0.869 0.865 T0645 0.666 0.664 0.932 0.929 T0649 0.423 0.454 0.382 0.369 T0650 0.702 0.703 0.876 0.877 T0654 0.626 0.634 0.819 0.820 T0655 0.672 0.677 0.743 0.749 T0657 0.693 0.681 0.827 0.831 T0659 0.753 0.754 0.909 0.906 T0662 0.727 0.737 0.798 0.796 T0664 0.684 0.671 0.936 0.934 T0665 0.756 0.732 0.738 0.739 T0667 0.643 0.646 0.807 0.803 T0669 0.675 0.641 0.614 0.606 T0672 0.590 0.601 0.785 0.784 T0673 0.535 0.509 0.317 0.350 T0675 0.552 0.606 0.356 0.346 T0676 0.553 0.505 0.503 0.510 T0678 0.599 0.594 0.297 0.362 T0679 0.648 0.625 0.807 0.798 T0680 0.709 0.637 0.699 0.513 T0681 0.700 0.710 0.875 0.872 T0683 0.660 0.639 0.888 0.889 T0688 0.629 0.627 0.862 0.869 T0689 0.734 0.742 0.919 0.927 T0691 0.468 0.464 0.480 0.500 T0692 0.704 0.710 0.921 0.942 T0703 0.673 0.673 0.894 0.895 T0704 0.675 0.677 0.831 0.838 T0708 0.736 0.726 0.887 0.891 T0714 0.781 0.781 0.911 0.911 T0716 0.753 0.752 0.674 0.685 T0721 0.716 0.710 0.870 0.872 T0722 0.780 0.729 0.541 0.513 T0723 0.697 0.702 0.866 0.859 T0733 0.647 0.645 0.864 0.863 T0749 0.755 0.737 0.961 0.963 T0752 0.721 0.729 0.873 0.874 T0753 0.696 0.698 0.797 0.790 T0757 0.760 0.768 0.888 0.893 R0001 0.390 0.333 0.212 0.202 R0008 0.574 0.566 0.522 0.519 R0014 0.536 0.486 0.469 0.393 R0018 0.514 0.508 0.345 0.366 Average 0.657 0.651 0.729 0.726 Genes 2019, 10, 132 9 of 13 GenesGenes 20182018,, 99,, xx FORFOR PEERPEER REVIEWREVIEW 99 ofof 1313

3030 betterbetter thanthan SPICKERSPICKER 2525 equalequal toto SPICKERSPICKER

2020

1515

1010

55 the number of targets the number of targets 00 GR_scoreGR_score TM-score TM-score 280280 Figure 5. The comparison of the two methods using both GR_score and TM-score for datasets 281281 FigureFigurefrom 5:5: CASP10.TheThe comparisoncomparison ofof thethe twotwo methodsmethods usingusing bothboth GRGR_score_score andand TM-scoreTM-score forfor datasetsdatasets fromfrom CASP10.CASP10.

282282 3.3.2.3.3.2. The The Experimental Experimental ResultsResults forfor DatasetsDatasets fromfrom CASP11.CASP11. 283283 ToTo furtherfurther evaluateevaluate thethe proposedproposed method,method, itit waswas alsoalso appliedapplied toto thethe 5454 decoydecoy setssets fromfrom CASP11.CASP11. 284284 TheThe near-nativenear-native structure structurestructure selected selectedselected by byby the thethe proposed proposedproposed method methodmethod and andand the thethe near-native near-nativenear-native structure structurestructure generated generatedgenerated by 285285 thebyby thethe SPICKER SPICKERSPICKER method methodmethod used usedused in I-TASSER inin I-TASSERI-TASSER sever seversever were werewere compared. compared.compared. The results TheThe resultsresults of the of GR_scoreof thethe GR_scoreGR_score are shown areare 286286 inshownshown the left inin scatterthethe leftleft plot scatterscatter in Figure plotplot 6inin, while FigureFigure the 6,6, results whilewhile ofthethe the resultsresults TM-score ofof thethe are TM-scoreTM-score shown in are theare leftshownshown scatter inin the plotthe leftleft in 287287 Figurescatterscatter6 plotplot. inin FigureFigure 6.6.

11 11

(a)(a) (b)(b)

0.80.8 0.80.8

0.60.6 0.60.6

0.40.4 0.40.4

0.20.2 0.20.2

00 00 00 0.2 0.2 0.4 0.4 0.6 0.6 0.8 0.8 1 1 00 0.2 0.2 0.4 0.4 0.6 0.6 0.8 0.8 1 1 288288 GR_scoresGR_scores ofof thethe proposedproposed methodmethod TM-scoresTM-scores ofof thethe proposedproposed methodmethod 289289 FigureFigure 6:6.6: TheThe plot plot of of GR_scores GR_scores andand TM-scoresTM-scores producedproduced byby twotwo methodsmethods forfor datasetsdatasets fromfrom CASP11.CASP11.

290290 DetailedDetailed resultsresults withwith scoresscores forfor allall thethethe targetstargetstargets areareare shown shownshown in inin Table TableTable2 .2.2. 291291 Table 2. The Comparison of GR_scores and TM-scores for datasets from CASP11. The bold number 292292 indicatesTableTable 2:2: TheThe the highestComparisonComparison GR_score ofof GR_scoresGR_scores or TM-score andand for TM-scoresTM-scores each target. foforr datasetsdatasets fromfrom CASP11.CASP11. TheThe boldbold numbernumber 293293 indicatesindicates thethe highesthighest GR_scoreGR_score oror TM-scoreTM-score forfor eacheach target.target. GR_Scores of the GR_Scores of TM-Scores of the TM-Scores of Target ID ProposedGR_scoresGR_scores Method ofof SPICKER ProposedTM-scoresTM-scores Method ofof SPICKER GR_scoresGR_scores ofof TM-scoresTM-scores ofof TargetTarget IDID thethe ProposedProposed thethe ProposedProposed T0759 0.547 SPICKERSPICKER0.530 0.362 SPICKERSPICKER0.356 T0762 0.721MethodMethod 0.728 MethodMethod0.921 0.925 T0763T0759T0759 0.4320.5470.547 0.4160.5300.530 0.3620.3620.272 0.3560.3560.198 T0764T0762T0762 0.6790.7210.721 0.6970.7280.728 0.9210.9210.883 0.9250.9250.885 T0765 0.530 0.597 0.740 0.761 T0766T0763T0763 0.7720.4320.432 0.8000.4160.416 0.2720.272 0.938 0.1980.1980.935 T0764T0764 0.6790.679 0.6970.697 0.8830.883 0.8850.885 T0765T0765 0.5300.530 0.5970.597 0.7400.740 0.7610.761 T0766T0766 0.7720.772 0.8000.800 0.9380.938 0.9350.935

Genes 2019, 10, 132 10 of 13

Table 2. Cont.

GR_Scores of the GR_Scores of TM-Scores of the TM-Scores of Target ID Proposed Method SPICKER Proposed Method SPICKER T0768 0.547 0.544 0.629 0.626 T0769 0.707 0.684 0.747 0.741 T0773 0.729 0.754 0.608 0.812 T0778 0.817 0.836 0.910 0.929 T0782 0.580 0.589 0.691 0.687 T0784 0.717 0.742 0.932 0.937 T0785 0.387 0.380 0.257 0.261 T0786 0.618 0.618 0.782 0.782 T0787 0.594 0.593 0.235 0.235 T0788 0.688 0.681 0.901 0.897 T0792 0.750 0.735 0.665 0.672 T0796 0.585 0.579 0.687 0.666 T0797 0.934 0.934 0.794 0.826 T0798 0.823 0.822 0.936 0.937 T0800 0.523 0.495 0.592 0.575 T0801 0.710 0.703 0.937 0.926 T0803 0.464 0.431 0.475 0.467 T0805 0.706 0.713 0.848 0.843 T0807 0.693 0.691 0.911 0.913 T0811 0.736 0.727 0.942 0.941 T0812 0.503 0.525 0.539 0.536 T0813 0.724 0.712 0.921 0.922 T0815 0.794 0.798 0.888 0.885 T0816 0.647 0.658 0.298 0.296 T0817 0.678 0.675 0.715 0.718 T0819 0.685 0.699 0.916 0.920 T0820 0.472 0.488 0.325 0.324 T0821 0.768 0.769 0.810 0.818 T0822 0.528 0.470 0.514 0.442 T0823 0.621 0.623 0.778 0.779 T0824 0.477 0.446 0.308 0.296 T0825 0.786 0.785 0.511 0.509 T0829 0.603 0.611 0.496 0.584 T0833 0.753 0.736 0.754 0.743 T0835 0.531 0.541 0.697 0.700 T0836 0.532 0.570 0.276 0.276 T0837 0.608 0.604 0.418 0.427 T0838 0.579 0.548 0.577 0.543 T0841 0.715 0.715 0.861 0.926 T0843 0.718 0.713 0.926 0.924 T0847 0.673 0.683 0.788 0.788 T0849 0.610 0.608 0.731 0.730 T0851 0.678 0.717 0.913 0.782 T0854 0.679 0.684 0.795 0.794 T0855 0.576 0.551 0.541 0.494 T0856 0.677 0.683 0.870 0.869 T0857 0.516 0.534 0.475 0.487 T0858 0.673 0.683 0.908 0.910 Average 0.644 0.645 0.688 0.688

To clearly represent the results, the number of the targets for which each method produces the better results was counted. The results are shown in Figure7. It can be seen from the Figure7 that the proposed method can select better near-native structures for more targets compared to the SPICKER method, evaluated either with GR_scores or with TM-scores. Genes 2018, 9, x FOR PEER REVIEW 11 of 13 Genes 2018, 9, x FOR PEER REVIEW 11 of 13 295 To clearly represent the results, the number of the targets for which each method produces the 295 To clearly represent the results, the number of the targets for which each method produces the 296 better results was counted. The results are shown in Figure 7. It can be seen from the Figure 7 that 296 better results was counted. The results are shown in Figure 7. It can be seen from the Figure 7 that 297 the proposed method can select better near-native structures for more targets compared to the 297 theGenes proposed2019, 10, 132 method can select better near-native structures for more targets compared to11 ofthe 13 298 SPICKER method, evaluated either with GR_scores or with TM-scores. 298 SPICKER method, evaluated either with GR_scores or with TM-scores.

30 better than SPICKER better than SPICKER 30 equal to SPICKER equal to SPICKER worse than SPICKER 25 worse than SPICKER 25

20 20

15 15

10 10 the number of targets

the number of targets 5 5

0 0 GR_score TM-score 299 GR_score TM-score 299300 FigureFigure 7: The 7. comparisonThe comparison of the oftwo the methods two methods using both using GR both_score GR_score and TM-score and TM-scorefor datasets for from datasets CASP11. 300 Figure 7: The comparison of the two methods using both GR_score and TM-score for datasets from CASP11. 301 fromTaking CASP11. target T0851 as an example, Figure 8 shows the superposition between the native 301 Taking target T0851 as an example, Figure 8 shows the superposition between the native 302 structure and the near-native structure found by the proposed method and the near-native structure 302 structureTaking and target the near-native T0851 as an structure example, found Figure 8by shows the proposed the superposition method and between the near-native the native structurestructure 303 selected by SPICKER. The red model is the native structure and the blue is the structure selected by 303 selectedand the near-nativeby SPICKER. structure The red found model by is the the proposed native st methodructure and the near-nativeblue is the structure structure selected selected by 304 the proposed method in Figure 8(a), the other blue structure is generated by SPICKER in Figure 8(b). 304 theSPICKER. proposed The method red model in Figure is the native8(a), the structure other blue and structure the blue is generated the structure by selectedSPICKER by in the Figure proposed 8(b). 305 It can be seen from Figure 8 that the SPICKER model has an obvious mismatch in the right half part 305 Itmethod can be inseen Figure from8a, Figure the other 8 that blue the structureSPICKER is model generated has an by obvious SPICKER mismatch in Figure in8 theb. It right can behalf seen part 306 of the protein. 306 offrom the Figure protein.8 that the SPICKER model has an obvious mismatch in the right half part of the protein.

307 307 308 Figure 8: (a) The superposition of T0851 native structure and the near-native structure selected by the 308 Figure 8: 8. (a)(a) The The superposition superposition of of T0851 T0851 native native structure structure and the the near-native near-native stru structurecture selected by the 309 proposed method. (b) The super-position of T0851 native structure and the model selected by 309 proposed method.method. ( b(b)) The The super-position super-position of T0851 of T0851 native native structure structure and the and model the selected model byselected SPICKER. by 310 SPICKER. 310 SPICKER. 4. Conclusions 311 4. Conclusions 311 4. ConclusionsIn this paper, we have proposed a new similarity score, GR_score, for comparing two protein 312 In this paper, we have proposed a new similarity score, GR_score, for comparing two protein 312 structuresIn this based paper, on we both have CMO propos anded order a new graphlet similarity degrees. score, The GR_score, introduced for GR_scorecomparing can two serve protein as a 313 structures based on both CMO and order graphlet degrees. The introduced GR_score can serve as a 313 structuresnew assessment based criterionon both CMO for protein and order structure graphlet comparison. degrees. The It is introduced shown that GR_score the proposed can serve GR_score as a 314 new assessment criterion for protein structure comparison. It is shown that the proposed GR_score 314 newalong assessment with the ensemble criterion clustering for protein can structure be used tocomp selectarison. the near-native It is shown structures that the proposed from the decoyGR_score sets. 315 along with the ensemble clustering can be used to select the near-native structures from the decoy 315 alongCompared with tothe the ensemble state-of-the-art clustering SPICKER can be method, used to the select proposed the near-native method can structures select more from high the qualitydecoy 316 sets. Compared to the state-of-the-art SPICKER method, the proposed method can select more high 316 sets.near-native Compared structures to the state-of-the-art if evaluated using SPICKER the GR_score method, for the datasets proposed from method both CASP10 can select and more CASP11. high 317 quality near-native structures if evaluated using the GR_score for datasets from both CASP10 and 317 qualityIn future near-native work, we willstructures continue if toevaluated improve using the computation the GR_score of thefor similaritydatasets from scores both between CASP10 protein and 318 CASP11. In future work, we will continue to improve the computation of the similarity scores 318 CASP11.structures, In and future to evaluate work, we the similaritywill continue scores to from improve more the aspects. computation of the similarity scores 319 between protein structures, and to evaluate the similarity scores from more aspects. 319 betweenSupplementary protein Materials: structures,Following and to areevaluate available the online similarity at http://www.mdpi.com/2073-4425/10/2/132/s1 scores from more aspects. , code and data used.

Genes 2019, 10, 132 12 of 13

Author Contributions: Conceptualization, Y.L.; methodology, X.H., L.L., and Y.L.; software, X.H., and L.L.; validation X.H.; formal analysis, X.H.; investigation, X.H..; resources, X.H..; data curation, X.H.; writing—original draft preparation, X.H.; writing—review and editing, Y.L.; visualization, X.H.; supervision, Y.L.; project administration, Y.L.; funding acquisition, Y.L. Funding: This research was funded by the National Key R&D Program of China (Grants number 2017YFE0111900, 2018YFB1003205). Acknowledgments: We thank all the reviewers for their valuable comments, which helped us a lot in improving the writing of the paper. Conflicts of Interest: The authors declare no conflict of interest.

References

1. Collins, F.S.; Michael, M.; Aristides, P. The human genome project: Lessons from large-scale biology. Science 2003, 300, 286. [CrossRef][PubMed] 2. Crick, F. Central dogma of molecular biology. Nature 1970, 227, 561–563. [CrossRef][PubMed] 3. Pellegrini, M.; Marcotte, E.M.; Thompson, M.J.; Eisenberg, D.; Yeates, T.O. Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc. Natl. Acad. Sci. USA 1999, 96, 4285–4288. [CrossRef][PubMed] 4. Berman, H.M.; Westbrook, J.; Feng, Z.; Gilliland, G.; Bhat, T.N.; Weissig, H.; Shindyalov, I.N.; Bourne, P.E. The protein data bank. Nucl. Acids Res. 2000, 28, 235–242. [CrossRef][PubMed] 5. UniProtKB/TrEMBL Protein database release statisics. Available online: http://www.ebi.ac.uk/uniprot/ TrEMBLstats (accessed on 16 January 2019). 6. Zhang, Z. An overview of protein structure prediction: From homology to ab initio. Bioc218 2002, 1–10. 7. Hasegawa, H.; Holm, L. Advances and pitfalls of protein . Curr. Opin. Struct. Biol. 2009, 19, 341–348. [CrossRef] 8. Yang, Z.; Jeffrey, S. Automated structure prediction of weakly homologous proteins on a genomic scale. Proc. Natl. Acad. Sci. USA 2004, 101, 7594–7599. 9. Wang, S.; Sun, S.; Li, Z.; Zhang, R.; Xu, J. Accurate de novo prediction of protein contact map by ultra-deep learning model. PLoS Comp. Biol. 2017, 13, e1005324. [CrossRef] 10. Hamilton, N.; Burrage, K.; Ragan, M.A.; Huber, T. Protein contact prediction using patterns of correlation. Proteins 2004, 56, 679–684. [CrossRef] 11. Moult, J.; Pedersen, J.T.; Judson, R.; Fidelis, K. A large-scale experiment to assess protein structure prediction methods. Proteins 1995, 23, ii–iv. [CrossRef] 12. The 11th critical assessment of techniques for protein structure prediction. Available online: http://predictioncenter.org/casp11 (accessed on 7 December 2014). 13. The 10th critical assessment of techniques for protein structure prediction. Available online: http://predictioncenter.org/casp10 (accessed on 7 December 2012). 14. The Yang Zhang Lab. Available online: https://zhanglab.ccmb.med.umich.edu/decoys/ (accessed on 30 June 2018). 15. Shortle, D.; Simons, K.T.; Baker, D. Clustering of low-energy conformations near the native structures of small proteins. Proc. Natl. Acad. Sci. USA 1998, 95, 11158–11162. [CrossRef][PubMed] 16. Godzik, A. The structural alignment between two proteins: Is there a unique answer? Protein Sci. 2010, 5, 1325–1338. [CrossRef][PubMed] 17. Shindyalov, I.N.; Bourne, P.E. Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng. 1998, 11, 739–747. [CrossRef][PubMed] 18. Zemla, A.; Venclovas, C.; Moult, J.; Fidelis, K. Processing and analysis of CASP3 protein structure predictions. Proteins 2015, 37, 22–29. 19. Zhang, Y.; Skolnick, J. Scoring function for automated assessment of protein structure template quality. Proteins 2004, 57, 702–710. [PubMed] 20. Ye, Y.; Godzik, A. Flexible structure alignment by chaining aligned fragment pairs allowing twists. Bioinformatics 2003, 19 (Suppl. 2), ii246. [CrossRef] 21. Kliment, O.; Eleonora, K.; Ceslovas, V. CAD-score: A new contact area difference-based function for evaluation of protein structural models. Proteins 2013, 81, 149–162. Genes 2019, 10, 132 13 of 13

22. Valerio, M.; Marco, B.; Alessandro, B.; Torsten, S. IDDT: A local superposition-free score for comparing protein structures and models using distance difference tests. Bioinformatics 2013, 29, 2722–2728. 23. Manavalan, B.; Lee, J.; Lee, J. Random forest-based protein model quality assessment (RFMQA) using structural features and potential energy terms. PLoS ONE 2014, 9, e106542. [CrossRef] 24. Manavalan, B.; Lee, J. SVMQA: Support-vector-machine-based protein single-model quality assessment. Bioinformatics 2017, 33, 2496. [CrossRef] 25. Godzik, A.; Skolnick, J. Flexible algorithm for direct multiple alignment of protein structures and sequences. Bioinformatics 1994, 10, 587–596. [CrossRef] 26. Przulj, N.; Corneil, D.G.; Jurisica, I. Modeling interactome: Scale-free or geometric? Bioinformatics 2004, 20, 3508–3515. [CrossRef][PubMed] 27. Malod-Dognin, N.; Przulj, N. GR-Align: Fast and flexible alignment of protein 3D structures using graphlet degree similarity. Bioinformatics 2014, 30, 1259–1265. [CrossRef][PubMed] 28. Murzin, A.G.; Brenner, S.E.; Hubbard, T.; Chothia, C. SCOP: A structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 1995, 247, 536–540. [CrossRef] 29. Zhang, Y.; Skolnick, J. SPICKER: A clustering approach to identify near-native protein folds. J. Comput. Chem. 2004, 25, 865–871. [CrossRef][PubMed] 30. Li, L.; Lu, Y.; Yan, H. Selecting near-native protein structures from ab initio models using ensemble clustering. Quant. Biol. 2018, 6, 307–312. [CrossRef] 31. Needleman, S.B.; Wunsch, C.D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 1970, 48, 443–453. [CrossRef] 32. Yang, J.Y.; Yan, R.X.; Roy, A.; Xu, D.; Poisson, J.; Zhang, Y. The I-TASSER Suite: Protein structure and function prediction. Nat. Methods 2015, 12, 7–8. [CrossRef]

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).