Røgen Algorithms Mol Biol (2021) 16:1 https://doi.org/10.1186/s13015-020-00180-3 Algorithms for Molecular Biology

RESEARCH Open Access Quantifying steric hindrance and topological obstruction to superposition Peter Røgen*

Abstract Background: In computational structural biology, structure comparison is fundamental for our understanding of proteins. Structure comparison is, e.g., algorithmically the starting point for computational studies of structural evolution and it guides our eforts to predict protein structures from their sequences. Most methods for of protein structures optimize the distances between aligned and superimposed residue pairs, i.e., the distances traveled by the aligned and superimposed residues during linear interpolation. Considering such a linear interpolation, these methods do not diferentiate if there is room for the interpolation, if it causes steric clashes, or more severely, if it changes the topology of the compared protein backbone curves. Results: To distinguish such cases, we analyze the linear interpolation between two aligned and superimposed backbones. We quantify the amount of steric clashes and fnd all self-intersections in a linear backbone interpolation. To determine if the self-intersections alter the protein’s backbone curve signifcantly or not, we present a path-fnding algorithm that checks if there exists a self-avoiding path in a neighborhood of the linear interpolation. A new path is constructed by altering the linear interpolation using a novel interpretation of Reidemeister moves from knot theory working on three-dimensional curves rather than on knot diagrams. Either the algorithm fnds a self-avoiding path or it returns a smallest set of essential self-intersections. Each of these indicates a signifcant diference between the folds of the aligned protein structures. As expected, we fnd at least one essential self-intersection separating most unknotted structures from a knotted structure, and we fnd even larger motions in proteins connected by obstruction free linear interpolations. We also fnd examples of homologous proteins that are diferently threaded, and we fnd many distinct folds connected by longer but simple deformations. TM-align is one of the most restrictive alignment programs. With standard parameters, it only aligns residues superimposed within 5 Ångström distance. We fnd 42165 topological obstructions between aligned parts in 142068 TM-alignments. Thus, this restrictive alignment procedure still allows topological dissimilarity of the aligned parts. Conclusions: Based on the data we conclude that our program ProteinAlignmentObstruction provides signifcant additional information to alignment scores based solely on distances between aligned and superimposed residue pairs. Keywords: Protein topology, Structural classifcation of proteins, Self-avoiding morphs Mathematic subject classifcation: Primary 92C40, Secondary 92E10, 57K10

Background Proteins are essential cellular tools and as macroscopic tools, some shapes are preferential for fulflling specifc *Correspondence: [email protected] Department of Applied Mathematics and Computer Science, Technical functions. Protein shapes are known to be more pre- University of Denmark, Asmussens Allé, Building 322, Kongens Lyngby, served than their sequences of amino acids: structural Denmark comparison of proteins is therefore an important and

© The Author(s) 2021. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativeco​ ​ mmons.org/licen​ ses/by/4.0/​ . The Creative Commons Public Domain Dedication waiver (http://creativeco​ mmons​ .org/publi​ cdoma​ in/​ zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. Røgen Algorithms Mol Biol (2021) 16:1 Page 2 of 19

active area of research with new structural alignment topographical diferences in chain progression in 3D methods reported to double every fve years for three space”. decades [1]. Aside from the rapid growth of the number Superposition free alignment methods based on of known protein structures, protein structure com- changes in internal distances such as DALI [12], FlexE parison is challenging for several reasons. One is that [13] and lDDT [14] also do not check for topological even for a fxed sequence of amino acids, some proteins changes. Mathematically they cannot tell a structure are highly fexible [2]. Another is that mutations cause from its mirror image. More interestingly, local chiral- plastic deformation that comparison methods should ity changes can occur under minor changes to internal also take into account [1]. On the other hand, proteins distances. And the same is true for the type of apparent reuse the same types of folds, but with the growing topological incorrectness reported in [9] among their top number of known protein structures, the global view 400 distance constrained models, “with the polypeptide is changing from discrete folds into considering larger chain passing through loops in a way that is, according to parts of fold space as a continuum [3]. visual intuition, atypical of fully correct structures”. In protein structure comparison either the shapes Tere are methods for protein structure compari- of proteins’ solvent accessible surfaces [4, 5] or their son that do take topological features of proteins into folded backbones are compared. Surface comparison account. Except for [15], none of these are alignment mainly address how a protein is seen from and possibly methods. Each of these methods frst produces a global interacts with its exterior. Chain or curve comparison description of each protein chain in the form of a set of mainly address how a protein is folded upon itself and descriptor values. Protein chains are subsequently com- is typically used when searching structural evolutionary pared by comparing their descriptor values. Te num- evidence [6] and when judging eforts to predict pro- ber of proteins and not the number of protein pairs tein structures [7]. One exception to this general trend dominate the calculation time. Hence, descriptor-based is that surfaces of protein models generally can be dis- methods are very fast on large data sets as they neither criminated from surfaces of native proteins [8]. It is need nor provide a structural alignment of sub struc- not hard to imagine two long protein chains sharing a tures. Te frst method to distinguish diferent thread- surface but being folded diferently inside the surface. ing of protein models was by calculating the writhe of Here, we therefore only consider structural comparison each model [16]. Te writhe is a signed measure of how 2 based on curve comparison. coiled a space curve is and jumps discontinuously by ± Most methods for pairwise curve comparison of pro- when a curve passes through itself once. In [16] Michael tein structures combine methods for searching alignment Levitt reported a writhe separation between diferently and optimal superposition of subsets of two proteins threaded models. However, two self-intersections may with a score function assessing the quality of this align- have canceling writhe jumps, and also continuous defor- ment. See [1] for an overview of the most used methods. mations change the writhe; thus, the writhe cannot sep- All these score functions are based on distances, and it is arate folds by itself. Te four distinct points involved in our hypothesis that in practice they are blind to topologi- two self-intersections on an open directed curve form cal changes even in the aligned parts as we now explain. one of four orderings. Similarly, three self-intersections Each score function is based on the distances that aligned defne 15 orderings. Generalizations of the writhe called residues are moved when the structures are in opti- Generalized Gauss integrals used in [17] are constructed mal superposition. But its not taken into account if the such that at least one of these descriptors jumps discon- motion may be performed by a continuous deformation tinuously [18] in each of these 1 + 4 + 15 cases. Te set or if it will lead to self-intersections of the protein that of descriptors introduced in [19] also has this property. may change its fold or even tie a knot in it. Mathemati- Te generalized Gauss integrals are powerful enough to cally that is, the score functions of most alignment and tell all fold classes in the protein classifcation CATH ver- superposition methods report similarity or distances in sion 2.4 [17] apart, and they can be normalized to have the space of immersions of proteins. Teoretically, we desired metric properties [18]. However, as in the case of will argue, a pair of topologically distinct embeddings writhe, these types of descriptors are mathematically not of a protein may have a similarity score typically found powerful enough to tell any two confgurations apart due between highly similar protein structures. Tis is appar- to the dimension reduction taking place. As they appar- ently also found in practice as [9] states “Tese classic ently manage to separate protein folds, it is here more structure comparison metrics” (Root Mean Square Devi- interesting that such global descriptors do not point out ation (RMSD) [10], Global Distance Test-Total Score where along the backbone two structures are topologi- (GDT-TS) [7], and TM-score [11]) “need to be supple- cally diferent—the descriptor values are just signifcantly mented by more sophisticated measures, which quantify diferent. A similar remark holds for the topological Røgen Algorithms Mol Biol (2021) 16:1 Page 3 of 19

descriptors based on fat graphs [20, 21], alpha shape [22] onto another and present novel algorithms Protein- and on the average occurrences of all patterns of 3 cross- AlignmentObstruction for fnding such obstruc- ings in planar projections of a protein backbone [19]. Not tions and for evaluating how severe they are. all descriptor-based methods are directly sensitive to All protein backbones form open curves and are pro- topology. Examples are the method with the most accu- duced by extrusion. We here ignore the covalent links rate re-classifcation of protein structures reported, Frag- found when including disulfde bridges [30]. Hence, Bag [23], and the distance excess-based descriptors [24]. combining a complete unfolding of the frst structure Te later method however presents an elegant solution to with a folding of the second from its complete unfolded the problem of detecting (plastic) deformation as caused stage gives a continuous self-avoiding path between by mutations. Te exact diferential form underlying the two arbitrary protein structures. Tus, mathemati- descriptors makes them path independent such that they cally all protein chains share topology class and any e.g. are invariant under any deformation of a loop in a sub-division of protein confguration space into folds protein that preserves the loops terminal points. depend on human judgement and is likely to depend Te efort put into detecting special topological fea- on the specifc structural biological context. Consider tures in a single protein structure illustrates the difculty a pair of shoes with almost identical shoelaces except in detecting topological changes between pairs of protein that the one contains a left-handed trefoil knot and structures. Knots in proteins [25, 26] have been studied the other a right-handed trefoil knot. In the space of for two decades [27, 28]. And lately also lassos [29] and immersions, they are very close, and any structural links [30] in proteins have been investigated. Te dis- alignment program will fnd that, but in the space of crete nature of Monte Carlo sampling allows the protein embeddings, they are not close meaning no short self- backbone to pass through itself between samples, but in avoiding morph between them exists. In structural protein confguration space such a jump corresponds to a biology terms, these two shoelaces share architecture, long and unlikely motion. Tus, in protein structure pre- as the overall building blocks are very similar. Tey may diction, Monte Carlo sampling produces more knotted or may not belong to diferent folds, also referred to as models than found in real proteins. Te reader is encour- topology classes, depending if their diferent threading aged to read the well told story [31] about the eforts is recognized and considered important. But the folding taken both by modelers and assessors in the series of paths have to be very diferent which may not be recog- Critical Assessment of techniques for protein Structures nized. To address the diference between immersed and Prediction (CASP) to avoid knotted and slip-knotted self-avoiding path-length ProteinAlignmentOb- structures that in some cases have been the frst model. struction starts from the alignment and superposi- Te algorithm Pokefnd [31] detects if a disk spanned by tion provided by an alignment program and considers a shorter closed loop in a protein structure is penetrated the linear interpolation between the two protein struc- by another part of the chain. In a protein model, this tures implied by the alignment score. First, it provides local non-protein like topological feature is likely to be a a novel measure of the amount of steric overlap caused modeling error caused by the backbone passing through by the morph. Next, it detects all self-intersections that itself. Such topological model errors are apparently not occur during the interpolation and determines which detected by current structural alignment methods as [31] are avoidable by moves similar to knot theory’s Reide- reports that neither the presence of knots nor of pokes meister moves within a user defned tubular neighbor- in protein models show any signifcant correlation to hood of the original linear morph. Next, it solves the the native-model distance measured by Global Distance problem of avoiding the maximal number of self-inter- Test (GDT). [31] concludes by suggesting adding Poke- sections at the shortest additional morph-length. Either fnd and Knotfnd to the current metric. Teir suggested it fnds a self-avoiding morph close to the original linear method will tell if the number of knots and pokes difers interpolation or it labels the remaining self-intersec- between two protein structures. Hence, one would know tions essential, as they cannot be resolved sufciently that there is at least one topological obstruction between close to the original interpolation. Finally if wanted, a knotted and an unknotted protein and at least one the length of a self-avoiding morph that resolves these obstruction between a poked and a poke free structure. essential self-intersections but gets far from the origi- But topological obstructions separating diferently folded nal alignment is found. In this work, the focus is to structures will not be detected as long as knots and pokes make a fast algorithm that can detect steric and topo- are unchanged; most likely, by being absent in both struc- logical obstruction to a structural alignment and super- tures. We conclude that no current algorithm can guar- position of two protein structures at a calculation time antee to fnd topological obstructions to a structural compatible to that of a structural alignment program. alignment and superposition of one protein structure Te length of the self-avoiding morph provided here is Røgen Algorithms Mol Biol (2021) 16:1 Page 4 of 19

at best a crude approximate of a shortest self-avoiding symmetrical crossing in the xy-plane where the over- morph-length; a calculation that lies outside the scope crossing part of a chain has heights (measured on the of this paper as this author expect it to be too slow for z-axis) 0, 0, 1.25, 2.5, 2.5, 1.25, 0, 0Å. Similarly, the practical application in pairwise protein structure com- under-crossing part follows the y-axis at heights 0, 0, parison. Also it is not the aim to fnd geometrically or – 1.25, – 2.5, – 2.5, – 1.25, 0, 0Å. Consider two larger energetically plausible paths as e.g. considered in [32]. structures that are identical except that the above over- Conceptually, the main aim is to introduce homotopy and under-crossings are interchanged. Te Euclidean of curve embeddings and Reidemeister moves to the motion of the optimal superposition of two structures feld of computational structural biology as spoken for is close to the identity, which we assume for simplic- in [33]. ity. Te total motion is thus given by 4 residues trave- Te description of algorithms presented in the fol- ling 5Å and 4 residues traveling 2.5Å . For an n-residue lowing section assumes no prior knowledge of knot structure we fnd a topology change (tc) for theory and homotopy theory. It is somewhat lengthy, 2 2 tc 4(5Å) 4(2.5Å) 12Å occasionally involved and may be skipped by readers RMSDn + < not interested in the method development. Such read- ≤ n √n ers should read the motivating example below and from tc 1 GDT TSn aligned residues within mÅ − ≥ 4 there jump to the results section. m 1,2,4,8 = 1 n 8 n 8 n 4 n 5 = ( − + − + − + ) =1 4 n n n n − n Methods n tc 1 1 TMn We start by estimating how similar two diferently ≥ n 1+( di )2 threaded tube structures fulflling typical distance ine- i=1 d0 n 8 4 1 4 1 qualities of alpha carbon atoms in proteins can be when = − + + n n 1+( 5 )2 n 1+( 2.5 )2 measured by Root Mean Square Deviation (RMSD) d0 d0 4 [10], Global Distance Test - Total Score (GDT-TS) [7], >1 , for n 100, and TM-score [11]. − n ≥ 1/3 where d0 1.24(n 15) 1.8 . For n 100 we = tc − − tc = fnd RMSD100 1.2Å , GDT TS100 0.95 and A motivating example tc = − = TM100 0.96 . Tese scores are typical for very similar Alpha carbon atom neighbors along the backbone are = structures despite the constructed topology change. Te usually 3.8Å (Ångström) apart. Otherwise, alpha car- persistence length of native proteins may require longer bon pairs are rarely less than 4.5Å apart. Imagine a deformations than used in this example; but in compu- tational modeling when, e.g., trying to minimize the dis- tance between a model and a native structure one may get close to this example. Figure 1 shows a similar cross- ing change between two native 112 residue structures 5.6Å aligned with global RMSD and TM= 0.60. Before describing a method to detect such topological changes during a linear interpolation between structures we start with an algorithm for detecting steric clashes under a lin- ear interpolation. Measuring protein‑protein overlap under a linear interpolation Let two protein chains be represented by the coor- P p1 ...pn dinates of their alpha carbon atoms 0 = 0 0 P p1 ...pn and 1 = 1 1 each with index starting at the N-terminus. Assume they are placed in optimal super- position by RMSD or any other alignment method. We study the linear interpolation where each point i i i Fig. 1 A self-intersection indicated by a black face in the morph p (t) (1 t)p tp = − 0 + 1 traverses the straight line between the alpha carbon curves of CATH2.4 domains 1csgA0 and 1jli00. The two domains share the homology-class 1.20.120.200 Røgen Algorithms Mol Biol (2021) 16:1 Page 5 of 19

segment connecting the i’th alpha carbon in each of of the piecewise linear curve connecting the alpha car- t 0, 1 the two structures and the parameter ∈ may bons. Let P denote the piecewise linear curve through all pi i 1, ..., n Pa pi a 1, ..., n be thought of as a time parameter. We want to check if points , = . Tat is = for = , typical protein distance constraints are violated dur- but Pa is defned for all a 1, n . Te linear interpolation ∈[ ] 0 1 ing the interpolation and quantify how much the pro- between two piecewise linear curves P and P is given by 0 1 tein overlaps itself. For this, we frst fnd the minimal Pa(t) (1 t)Pa tPa for t 0, 1 and a 1, n . i j = − + ∈[ ] ∈[ ] distance between p and p during the interpolation. We want to detect all self-intersections of the backbone pi,j pj pi pi,j pj pi pi,j a Leti,j 0 = 0 − i,j0 , i,1j = 1 − 1 and denote 0 = , curve during the linear interpolation. Furthermore, we want p b p p ab cos θ 1 = , and 0 · 1 = . For all t we have to identify all self-intersections we can avoid by changing the 2 j i 2 i,j i,j 2 initial morph locally by rearranging at most a user defned d (t) p (t) p (t) (1 t)p tp i,j = − 2= − 0 + 1 2 number of residues. Te remaining self-intersections are (1 t)2a2 t 2b2 2t(1 t)ab cos θ . expected to do essential changes to the fold, e.g., as avoid- = − + + − 2 ing them require larger rearrangements than allowed by the d2 t a ab cos θ Te global minimum of i,j is found for ∗ a2 b−2 2ab cos θ user. To do this, we borrow some notation from knot theory = + − and equals concerning knots on closed curves in 3-space. For a knot in i,j i,j 2 3-space a planar projection with over and under crossings sin θa2b2 p p m 1 × 0 2 indicated is called a knot diagram if all crossings are iso- =a2 b2 2ab cos θ = pi,j pi,j 2 lated. Hence, at most two points on the knot are allowed to + − 1 0 2 − be projected to the same point. When changing the plane pi,j pi,j pi,j 2 1 − 0 × 0 2 of projection the crossings will move, crossings may disap- i,j i,j 2 . =  p p pear, new crossings may appear, and in the planar projec- 1 − 0 2 tion existing crossings may pass through each other. Te Te last equation shows that the fraction is bounded by same happens to the planar projection when deforming the pi,j 2 pi,j 2 t 0 0 2 and by symmetry also by 1 2 . Setting ˜ = , knot through a self-avoiding morph. A fundamental result t < 0 t 0 t 1 1 < t if ∗ ; ∗ , if ≤ ∗ ≤ ; and 1, if ∗ , the mini - found in any text book on knot theory states that two closed d2 t 0, 1 d2,interpolation d2 (t) curves share knot type, i.e., the one may be morphed into mum of i,j for denoted i,j i,j ˜ . i,j i,j 2 ∈[ ] = the other through imbeddings by a ambient isotopy if and If p1 p0 2 is vanishing t∗ is ill-defned; but then 2 − only if their knot diagrams are connected by a fnite series di,j is constant and this value is returned. From a set of Reidemeister moves, shown in Fig. 2, plus deformations of native protein structures, we estimate the short- not changing crossings. Crossings of a knot may be associ- est possible distances between alpha carbons i and ated with a sign using the usual right-hand rule explained in dmin( i j ) j to be | − | = 2.8, 4.5, 3.86, 3.47, 3.52, 3.48, the same fgure. Note, that changing the direction of travers- i j 1, ...,7 dmin( i j ) 3.7Å 3.6Å for | − |= and | − | = for i j > 7 overlap ing a knot does not change the sign of the crossing. Te sign | − | . Te overlap, i,j , is thus the max of interpolation of a crossing comes from the orientation of 3-space and is dmin( i j ) di,j for line segments PiPi 1 and PjPj 1 given by the sign of the | − | − and 0(zero). Te mean over- + + MeanOverlap P1, P2 1 overlap determinant detij det Pi 1 Pi Pj 1 Pj Pi Pj . lap is given by = n i

Fig. 2 Reidemeister moves of types 0, 1, 2, and 3, denoted �i, i 0, ...,3 . 0 deforms one arc without changing crossings (not shown). Left, 1 = either adds one crossing to an isolated segment or deletes an isolated loop. In the middle, one of two parallel strands is slid either above or below the other by 2 . To the right: 3 moves an arc across a crossing between two other arcs. By the usual right hand rule, the left most crossing in the picture of 1 is negative and the other crossing in this picture is positive meaning that if you choose a direction of traversing of the curve the streamlines fulfll the right hand rule

(1 si)Pi(t∗) siPi 1(t∗) self-intersection is a minor structural change as we now − + + (1 sj)Pj(t∗) sjPj 1(t∗) explain. Mathematically this self-intersection can be = − + + removed from the morph as this structural change may si, sj 0, 1 has a solution with ∈[ ] . If a self-intersection be performed by two Reidemeister moves of type one, is found, we store the interpolation parameter t∗ , the denoted 1 . We have used that the projection of the loop a i si curve parameters of the self-intersection = + and in Fig. 2 is isolated from the rest of the projected curve. b j sj = + , together with the crossings sign change given In a projection of a globular protein structure, it is likely by the sign of the derivative of detij(t) at t∗ . By checking that the projection of other parts of the backbone pass all pairs of line segments we get a list of self-intersections through the projected loop. Hereby a longer series of with data (ak , bk , signk , tk∗) , where k is an arbitrary index. Reidemeister moves may be needed to connect the pro- PiPi 1 jections of the two loops. To avoid this algorithmic com- A self-intersection between line segments + PjPj 1 plexity we choose to make a 3-dimensional analogue of and + requires that at least some of the spheres Pi Pi 1 Pj Pj 1 Reidemeister moves. We want to know if we can remove centered at , + , and + are overlapping dur- ing the interpolation as calculated in the previous sec- a self-intersection by just changing the morph of the arc tion. Te self-intersection check may be skipped if forming the loop while leaving the rest of the curve and the alpha carbon neighbors are less than 4Å apart and its morph unchanged. We therefore say that a self-inter- ( ) overlapi,j overlapi,j 1 overlapi 1,j overlapi 1,j 1 < 2.6Å. section ak , bk , signk , tk∗ is locally 1-removable if the + + + + + + + Te pre-computed overlap gives an efcient flter before topological disk given by all the triangles connecting two the self-intersection check, but one cannot tell a nearby consecutive points of the loop passing from a self-intersection by the overlap values. Pa t∗ , P t∗ , ..., P t∗ , P t∗ Pa t∗ k k ciel(ak ) k floor(bk ) k bk k = k k      Locally avoidable self‑intersections to the center of mass of all these points is disjoint Some morph self-intersections make a signifcant change from all the other line segments of the tk∗-curve P (t ), ..., P (t ) P (t ), ..., P (t ) to a proteins fold, while others do not. In the mathemati- 1 k∗ floor(ak ) k∗ and ciel(bk ) k∗ n k∗ . In cal notion of topology however all protein chains are top- this case, one can freely deform the loop in a neighbor- ologically equivalent as they all may be refolded e.g. into hood of the topological disk without intersecting the rest one long beta strand. Algorithmically such a morph may of the curve. One possible morph that avoids the self- be constructed by a blow-up technique. A distinction intersection contracts the loop almost to its center of between signifcant and insignifcant fold changes can mass where there is room to throw one arc around the thus only be decided by human judgement and may dif- loop. However, for less complicated loops one would fer depending on the application at hand. Here, we search simply deform the loop by rotating it around the line for insignifcant self-intersections that we can resolve through the point of self-intersection and the center by local changes of the morph and quantify how much of mass of the loop. We therefore penalize a locally 1 P1 longer the resulting morph is. -removable self-intersection by a price equal to twice Imagine the left- and right-handed loops in Fig. 2 the sum of the distances from the points of the loop to aligned on top of each other. Te linear interpolation this line. Te self-intersection shown in Fig. 1 may be between the two structures will have a self-intersection. removed by an 1 move involving a large part of the Loops are the most fexible parts of proteins and such a protein. We therefore let the user decide the maximal Røgen Algorithms Mol Biol (2021) 16:1 Page 7 of 19

accepted backbone length, denoted MaxLength involved self-intersection form a closed curve. If a topological disc in the alternate morph. Only self-intersections with spanned by this closed curve is disjoint from the remain- b a MaxLength 1 k − k ≤ are tested for being removable. ing parts of the curve then the two self-intersections can In a planer projection into a knot diagram the curve seg- be removed essentially by performing two Reidemeister ment from ak to bk need neither be simple nor isolated moves of type two close to this disc. You may construct from the remainder of the projected curve. Te con- this such that one of the moves occurs before the time tk∗ t t structed 1-move is operating on a 3-dimensional curve. and one after. In the general case where j∗ ≤ k∗ it is suf- It is therefore not a real Reidemeister move that acts on fcient that a topological disk whose boundary includes t t t 2-dimensional knot diagrams. Te 1-move need not the two arcs is isolated for some ∈ j∗; k∗ , see Fig. 3. project to a Reidemeister 1 move for two main reasons. To close the curve and to have room for a morph avoid- Firstly, the 3-dimensional loop need not project to a sim- ing the two self-intersections we have to add the line ple curve. For larger MaxLength the loop may even be segments traversed by the self-intersection points in the t t t knotted. Secondly, in the knot diagram the projection of time interval j∗ ≤ ≤ k∗ . In practice we check if the topo- t (t t )/2 the loop is not likely to be isolated from the remaining logical disc connecting the loop for average∗ = j∗ + k∗ projection. Tus the 3-dimensional 1-move may involve to its center of mass is disjoint from the remainder of the t t many Reidemeister moves in a knot diagram and can be curves for = average∗ and if the additional line segments seen as a meta move in the context of knot diagrams. traversed by the self-intersection points are disjoint If a closed double stranded DNA undergoes the self- intersection considered here, it will change its writhe and hereby also the linking number between the two strands by plus or minus two [34]. To feel this coupling between writhe and twisting, hold a length of torsional stif wire in the shape of a left-handed loop such that it is in self-con- tact almost forming a self-intersection. By rotating both ends of the wire, you can force the wire to morph into a right-handed loop. Tat is, a morph avoiding the self- intersection requires two full rotations around the back- bone. Proteins are in principle free to do full rotations around the backbone. One may choose not to penalize the discontinuous writhing caused by 1 loops but we ofer the possibility. Most self-intersections are not 1-removable. A signif- Fig. 3 A space curve homotopy where the helical segment moves cant speed up of the calculation time is obtained by: First, upwards and the straight segment moves inclined forward. The frst intersection of the two segments is shown in blue and the second move the origin of the coordinate system to the center of in green. Following the motion of the point of intersection on the mass of the loop, which then is a vertex in all the triangles blue helical segment between the two times of intersection, you get of the topological disc. Ten to sort all line segments by the black cylinder to the right-hand side. Inside this black cylinder increasing distance from the origin to the center of the the green helical segment can go down and around the intersection line-segments. Finally, to check all line-segment triangle point of the two blue curve segments and back up to the green curve. Similarly, the gray cylinder to the right-hand side shows the pairs for intersection and stop either when an intersec- motion of the frst intersection point on the straight segment. Inside tion is found or when the triangle inequality between the gray cylinder, the green line segment may be altered to go back the disk and line segments implies that an intersection is and around the intersection point. Hence, if the right hand side black impossible. and gray cylinders do not intersect other parts of the curve at a time Te local zigzag shape of general alpha carbon curves between the two times of intersection, then we can alter all curves inside the two cylinders and postpone the intersection. Similarly, if often make self-intersections occur in pairs during a the black and gray cylinders to the left-hand side do not intersect morph. Such a pair corresponds to interchanging over other parts of the curve between the times of the two intersections, and under sliding in Reidemeister moves of type two, then that intersection can be mover forward in time. Now move both 2 , shown in the middle of Fig. 2. Te two crossings, intersections to happen at the mean of the two original intersection times (shown in red). The two red arcs combined with parts of the (aj, bj, signj, tj∗) and (ak , bk , signk , tk∗) , in an 2 move have black and gray cylinders now form a closed loop. If a topological opposite signs and both change sign during transver- disc spanned by this loop avoids the remaining parts of the red sal self-intersections. Without loss of generality, we may curve, then two Reidemeister moves of type 2 can change the initial t t t t assume that j∗ ≤ k∗ . Consider frst the case where j∗ = k∗ . under-sliding of the helical segment to an over-sliding and avoid the At time tk∗ the two arcs connecting the two points of two intersection points Røgen Algorithms Mol Biol (2021) 16:1 Page 8 of 19

from the remainder of the curve in the time interval specifed by E′ followed by all the possible 1 moves and t t t P(E ) j∗ ≤ ≤ k∗ . See Fig. 3. Te frst check is speeded up as let ′ be the sum of the prices of all the 1 and 2 in the case of the 1 move. If the backbone length of the moves used. We claim that for each matching E′ loop is shorter than the user-defned MaxLength , i.e., if N(E ) εP E K εP1all w(e) bj aj bk ak MaxLength , and no obstructions are ′ ′ − + − ≤ + = + − e E (a , b , sign , t ) (a , b , sign , t ) ′ found we say that j j j j∗ and k k k k∗  ∈ form a locally 2-removeable loop and penalize it by P2 and denote this quantity by A E′ . By this claim a maxi- price twice the sum of the distances from the points of E w(e) mal weighted matching ∗ maximizes e E′ and the loop to the line connecting the two points Paj (taverage∗ )  ∈ thus minimizes1 the left-hand side. By construction and Pak (taverage∗ ). 0 P E ε ( ′) 2 and hereby N(E∗) is minimized and As in the case of 1 moves, an 2 move acts on space P≤E ≤ ( ∗) is minimized subject to the constraint that N(E∗) curves. When projected to a knot diagram, an 2 move is minimal, solving both problems at once. may require several ordinary Reidemeister moves and We fnd it most instructive to prove this claim by may thus be thought of as a meta move of knot diagrams. induction on the cardinality of E′ and start with the case Te sum of the crossings signs is unchanged by an 2 E′ where all 1 moves are performed and the claim move and there is thus no net discontinuous contribu- =∅ is trivial. Consider a matching E′′ where ek E′′ and the tion to the writhe of the curve. E.g., the linking num- ∈ induction claim is true for E′ E′′ ek . Ten ber of double stranded DNA undergoing an 2 move is = −{ } unchanged and there is therefore no torsional penalty A E′′ K εP1all w(e) = + − involved. e E  ∈ ′′ K εP1all w(e) w(ek ) Essential self‑intersections = + − e E − ∈ ′ Let vi denote all the self-intersections. Some are 1 N(E′) εP(E′) w(ek ), { } P1 v 0 = + − -removable at price ( i)> . Some pairs vi, vj are 2 P2 v , v 0 [ ] -removable at price ( i j)> . We want to remove where the last equality follows by induction. We use the P2(e ) P2(vi, vj) as many self-intersections as possible at the lowest pos- shorthand k for the price of the edge ek sible cost when being restricted to removing each vi at between the two vertices vi and vj. vi, vj / �1 w(ek ) 2 εP2(ek ) most once. Tese two problems may be solved simulta- If ∈ then = − and A(E ) N(E ) 2 ε P(E ) P2(ek ) neously by one maximal weight matching problem as we ′′ = ′ − + ′ + . As two addi- P 2(e ) now explain. We will denote the self-intersections left tional vertices are removed at the cost  k we fnd A(E ) N(E ) εP(E ) after this process a set of essential self-intersections of ′′ = ′′ + ′′ . v vi 1 vj / �1 w(ek ) εP1(vi) 1 εP2(ek ) the morph. Consider the i ’s as vertices in a graph and If ∈ and ∈ then = + − A(E′′) N(E′) 1 ε P(E′) P1(vi) P2(ek ) let 1 denote the 1-removable subset. Let the constant and = − + − + . ε vj equal one over twice the sum of the prices of all 1 and Here one additional vertex, , is removed and the change P2(ek ) P1(i) 2 moves. Each vertex is associated a weight of cost is as vi not is removed by a 1 move − A E N E P E but by the 2 move ek . Tus ( ′′) ( ′′) ε ( ′′). P1 v , if v = + w v ε ( i) i �1 vi, vj 1 w(ek ) = εP 1(vi) + εP 1(vi) v( i) 1, else.∈ Finally, if ∈ then = εP 2(ek ) A(E ) N(E ) ε P (E ) P 1(vi) and ′′ ′ ′ −P 1 P 2 = P+ − (vj) (ek ) N(E′′) ε (E′′) as no additional We want to remove as many vertices as possible. Hence, − + = + vertices are removed and the cost is changed by using the 2 each vertex in 1 will be removed by an 1 move unless move ek instead of the two 1 moves. Tis completes the proof 2 it’s more favorable to let it take part in an move. of the claim. 2 Hence, we only need to consider how to use the We use the Matlab implementation [35] based on [36]. 2 moves. For this, let each move be represented by an In practice, this calculation uses only a small fraction of vi vj edge, e, connecting vertices and and associate this the computation time and may be considered efcient. edge with the weight We also implemented a greedy algorithm that searches for w(e) wv(vi) wv(vj) εP2(vi, vj). locally contractible 2 moves sorted by increasing back- = + − bone length between the two points of self-intersection and Finally, let K be the number of vertices not in 1 , denote removes the ones found. Usually considerable fewer candi- P1all P1(vi) the total cost of the 1 moves by vi �1 , date 2 moves have to be checked. Hence, the greedy algo- = ∈ E′ and recall that a matching is a subset of all the edges rithm is faster but in some cases it cannot remove as many that uses each vertex at most once. Let N(E′) denote the self-intersections as the optimal method based on maxi- number of vertices left after performing the 2 moves mum weighted matching in graphs. Tere may be several Røgen Algorithms Mol Biol (2021) 16:1 Page 9 of 19

sets with a minimal number of essential self-intersections, entire curve P1, ..., Pm to the point Paj . Pm may always but in most cases the real valued prices of 1 and 2 moves be moved along the straight line to Paj . If the trian- will result in a unique cheapest minimal set. However, for Po, Pm, Pm 1 gle with corners − is disjoint from the curve some morphs a slight change in these prices will change downstream of P(aj) the shortcut from Pm 1 to Po may −Po the selected minimal set. For more complicated morphs we be used. Otherwise a new obstruction point ˜ inside the therefore stress not to emphasise which self-intersections triangle is chosen such that the length of the path Po to are denoted essential but solely to report their cardinality. Po Pm 1 ˜ to − is minimized and all downstream line seg- ments are avoided. See Fig. 4. Te algorithm is greedy Po Pa End‑contractions and accepts the path from ˜ to j as the shortest found and all upstream points will use this as part of their path When a self-intersection is not locally removable, we con- to Paj . Te last encountered obstruction point is moved struct a morph using a standard strategy for unknotting to Po . Ten this shortening of the end-contraction is ropes or shoelaces namely: while keeping most of the rope ˜ continued for all the upstream line segments. Te cost of fxed, pull the rope until a terminus gets free of the loop avoiding the self-intersection (aj, bj, signj, tj∗) is thus the restraining it. Tat is, a self-intersection (aj, bj, signj, tj∗) total distance traveled by the points in the end-contrac- may be avoided by N- or C-terminus contraction. An N-ter- tion both for t 0 and for t 1 together with m times minus contraction frst moves all points upstream of Paj (0) = = the distance traveled by P(aj) in the original linear inter- to this point for interpolation parameter t 0 , then uses = polation, because all the m contracted points travel this the corresponding restriction of the original morph from distance. t 0 to t 1 , and fnally does a reversed end-contraction = = If a1 a2 , ..., an and b1, b2, ..., bn are the param- on the protein structure for t 1 . If the end-contraction is ≤ ≤ ≤ = eter values of the essential self-intersections the com- restricted to be performed along the backbone path, P1(0) is ˙ bined smallest N and C terminus end-contractions moved a total distance of approximately aj3.8Å , P2 is moved removing them all solve: (aj 1)3.8Å approximately − etc. Te total morph length is therefore quadratic in the distance to the terminus. In gen- min N - Contraction(a) C - Contraction(b) 1 a < b L + eral, the following algorithm illustrated on an N-terminus ≤ ≤ s.t. ai a or b bi for all i 1, ..., n. contraction fnds a much shorter end-contraction. ≤ ≤ = Te end-contractions is performed on the original pro- N-contraction a 1 If no is used = and tein structures, and we omit writing the constant inter- b min(b1, b2, ..., bn, L) in order to remove Paj = polation parameter t = 0 or 1. Tat is is shorthand all the self-intersections. If a a1 the frst self- Pa (0) Pa (1) = for either j or j . Let the frst obstruction point intersection is removed from the N-termi- Po Paj m floor(aj) be and let . We have to move the b min(b , ..., bn, L) = = nus end and = 2 , removes the

1600

C-terminus contraction 1400 N-terminus contraction estimate 1200

1000

800

600 Morph lengt h

400

200

0 0102030405060 number of residues

Fig. 4 Left: the dots indicate intersections between triangle Po, Pm, Pm 1 with the remaining structure. The obstruction point Po is chosen to − ˜ minimize the path length Po to Po to Pm 1 . Right: the size of end-contraction morphs given as the sum of distance measured in Ångström traveled ˜ − by all residues are shown as a function of the number of residues contracted. N- and C-terminus contraction seem similar, and their common estimate is explained in the text Røgen Algorithms Mol Biol (2021) 16:1 Page 10 of 19

a aj remaining self-intersections. Tus in general = and is 1-removable the price is set to the minimum of the 1 b min(bj 1, ..., bn, L) = + fulfll the constraints and we price and the end-contraction price. Figure 5 illustrates n 1 only have to check + cases to fnd this minimum. To the full self-avoiding path that lies in a tubular neighbor- speed this calculation up, we use an estimate of the end- hood of the original linear interpolation if essential self- contraction based on the observation that the distance intersection are absent. traveled by residues up to 17 residues away from the self-intersection point grow roughly linearly to 25Å and Treating alignments with gaps is fuctuating around 25Å when further apart along the Optimal structural superposition of two protein chains backbone. Tis gives a reasonable estimate of the sizes includes gaps in the alignments of the chains to compen- of the chosen end-contractions as shown on Fig. 4. Te sate for the naturally occurring insertions and deletions in end-contractions are the computationally most costly homologous protein chains. A canonical example is that a part of all the calculations. As they only serve to quan- loop may be longer in one structure than in the other; but tify the distance between essentially diferent folds, we such that the long loop can be contracted into the shorter ofer the possibility to omit this calculation and instead loop without causing self-intersections of the entire struc- use the estimated size of the end-contractions. Finally, we ture. A sequence or structural alignment determines a par- ofer the possibility to consider a self-intersection as 1 ing of only a part of the residues of the two proteins. We -removable if it is sufciently close to a protein terminus. choose to fll out the alignment gaps using linear interpola- One natural choice is to set the longest allowed distance tion as explained by the following example involving a 12 from the point of self-intersection to an end to half of the residue long Chain 0 and a 10 residue long Chain 1 using user-defned MaxLength . In case a self-intersection also the notation of the program TM-align:

Backbone 0 1 2 3 4 5 6 7 8 9 – – – 10 11 12 TM align . : : : : : Backbone 1 1 2 – 3 4 – – 5 6 7 8 9 10

Fig. 5 This fgure illustrates the confguration space of a chain. The circles at the extreme left and right indicate the two aligned structures. The original linear interpolation is the black thin line segment between them. This line segment in the confguration space has six self-intersections where it crosses the thin red curves representing self-intersecting confgurations. Each Reidemeister move may involve MaxLength residues and can at most deviate a fxed distance, roughly 1/4MaxLength2 , from the original linear interpolation. From left to right, the frst self-intersection is avoidable by an 1 - move and the next two by an 2 - move. If the fourth self-intersection is removed by the short thick light colored 1 - move, then the ffth cannot be removed. Hence, the 2 - move is preferred. The last self-intersection involves too many residues to remove, is denoted essential, and the length of the dotted end-contraction avoiding it is calculated Røgen Algorithms Mol Biol (2021) 16:1 Page 11 of 19

Te frst aligned residue pairs between Chain 0 and re-parametrization. E.g., dmin for two points 4.80 line Chain 1 are (3, 1), (4, 2) and (6, 3). We treat weakly segments apart along the backbone curve is set to 3.51Å dmin 3.47Å 3.52Å aligned pairs, indicated with “.” by TM-align, as aligned found by interpolating = and for points pairs indicated by “:” . Residue 5 of Chain 0 is not being 4 and 5 line segments apart respectively. aligned and we choose to align it with the midpoint A loop-contraction combined with other deformation 1 P2.5 between Chain 1’s second and third residue. Te between two structures likely lead to a pair of self-inter- next two aligned pairs (7, 4) and (10, 8) leave 3 inter- sections of type 2 . If just one of these self-intersec- vals open on Chain 0 and 4 intervals on Chain 1. By tions involves the aligned parts and the other not, then linear interpolation, Curve 0 needs to be traversed at the aligned-aligned self-intersection is probably essen- 3/4 of the speed Curve 1 is traversed resulting in the tial when only considering aligned parts of the morph re-parameterizations: whereas it is removable if all self-intersections are con- sidered. Beside the goal to tell if alignment gaps are topo- New param. 1 2 3 4 5 6 7 8 9 10 logically similar, this is a reason not to restrict the morph Re-param. 0 3 4 5 6 7 7.75 8.5 9.25 10 11 to the aligned parts. IsAligned 1 1 0 1 1 0 0 0 1 1 Re-param. 1 1 2 2.5 3 4 5 6 7 8 9 Applying curve smoothening We represent each protein structure by either the posi- α Te re-parameterized Curve 1 is the piecewise linear tions of its alpha carbon atoms Ci or by a smoothened curve connecting the points: curve. In the smoothening all alpha carbons, except for the frst two and the last two are replaced by the fxed ˜1 1 ˜1 1 ˜1 1 ˜1 1 ˜1 1 α α α α α P1 = P1 P2 = P2 P3 = P2.5 P4 = P3 ... P10 = P9 C aC bC aC C i 2+ i 1+ i + i 1+ i 2 convex combination Pi − − 2 2a b + + . Te = + + a 2.4 b 2.1 Te re-parameterized Curve 0 connects the points: constants = and = are chosen to minimize the total curvature of a collection of protein structures [18]. ˜0 0 ˜0 0 ˜0 0 ˜0 0 ˜0 0 P1 = P3 P2 = P4 P3 = P5 P4 = P6 P5 = P7 All linear interpolations from an alpha carbon curve to ˜0 0 ˜0 0 ˜0 0 ˜0 0 ˜0 0 its smoothened curve are self-intersection free for the P6 = P7.75 P7 = P8.5 P8 = P9.25 P9 = P10 P10 = P11 protein structures used here. Te smoothening straight- 0 Te piecewise linear curve connecting the Pi ’s difers ens both alpha helices and beta-strands, see Fig. 6. Te ˜ 0 slightly form the original curve as e.g. a corner around P8 smoothened representation resembles typical cartoon P0 P0 P0 P0 is cut when connecting ˜6 = 7.75 to ˜7 = 8.5 . However, pictures of protein structures and has the advantage often it is trivial to morph the one curve into the other and the to make it possible to follow an interpolation visually, thickness of the protein backbone guaranties that no self- see Fig. 6. Furthermore, it results in fewer interpolation intersections can occur when doing this. Te structural self-intersections. For the smoothened representation, alignment thus gives rise to studying the self-intersec- the overlap is based on the minimal distances 1.0, 2.1, i j 1, ...,5 tions of the linear interpolation 3.0, 3.4, 3.6Å for residues with indices | − |= apart and 3.7Å otherwise. A self-intersection check may P(t, a) (1 t)P0 tP1 , be avoided if the line-segments are shorter than 3.5Å and = − RePar0(a) + RePar1(a) the sum of the endpoints overlap is < 2.1Å. t 0, 1 a 1 m for ∈[ ] and ∈[ ; ] , where m is determined by the alignment On each line segment the function Results IsAligned(a) is given by linear interpolation of the values When performing linear interpolation between more and zero (for not aligned) and one (for aligned) at the end- more distant structures, overlap will start to grow and points of the line segments. In particular, it is one on line eventually self-intersections that can be removed using 1 segments connecting aligned residues and zero when and 2 moves will emerge. At frst, each move involves connecting non-aligned residues. only a small part of the backbone. For a self-intersection (a , b ) (ak bk signk tk ) A self-intersection for parameter values k k gives the with data ; ; ; we defned the backbone AlignedSum(k) IsAligned(ak ) IsAligned(bk ) length of the corresponding 1 move as bk ak and for an sum = + . − We call the self-intersection aligned-aligned if 2 move it’s backbone length is also the (real) number of AlignedSum(k) 1.5 1.5 > AlignedSum(k)>0.5 line segments it involves. When interpolating structures ≥ , aligned-gap if and gap-gap if AlignedSum(k) 0.5 . Te minimal that are more distant these backbone lengths may grow ≤ allowed distance, dmin , between points along the back- and eventually obstructions to the 1 and 2 moves may bone curve used to defne the overlap are given by lin- occur. Tis we now illustrate with a few examples with ear interpolation of the discrete dmin values after the high sequence identity. Te ∗-marked examples have 100% Røgen Algorithms Mol Biol (2021) 16:1 Page 12 of 19

Fig. 6 Left: the alpha carbon trace of CATH domain 1emd01 and its self-intersection free interpolation with its smoothened black curve. Center and right: the interpolation between CATH domains 1csgA0 and 1jli00 alpha carbon and smoothened curves respectively. The two domains share the homology-class 1.20.120.200. Both interpolations have one self-intersection indicated by a black face. In both cases it may be circumvented by a large 1 move. If the user does not allow such large rearrangements, this self-intersection is called essential and the size of an end-contraction avoiding it is calculated. As the alpha carbon RMSD is 5.6Å this is an example that a change in topology that seems to demand quite diferent folding pathways may happen for a smaller RMSD and for homologous chains

sequence identity and are taken from the Sequence-Simi- motion between the glutamine binding protein’s struc- lar, Structure-Dissimilar Protein Pairs in the PDB found by tures 1GGG and 1WDN with and without ligand is 8Å Koslof and Kolodny [2]. an example of this. Te RMSD= interpolation of hepatitis C virus between the A chain confgurations Examples with high sequence identity 1A1Q and 1 JXP∗ has a mainly local overlap of mean Te A and B chains of the open and closed forms of ade- 0.11Å and only 5 self-intersection checks are needed 7Å nylate kinase, 4AKE and 1ANK, both have RMSD≈ . for the alpha carbon curve interpolation. Te alpha Te interpolation cause no overlap of the alpha car- carbon curve interpolation is self-avoiding. Te large, 10 3Å 23Å bon curves, and next to no mean overlap of − for RMSD= motion between the (un-) and phospho- the smooth curves and consequently self-intersection rylated forms of Odhl∗ is shown in Fig. 7 left. Te free checks are not needed. We may conclude that both the arm performs a long motion and the remaining com- A and B chain interpolations are easy to perform. We pact domain rotates close to a third of a full rotation expect the same conclusion for a hinge-motion of rela- in the RMSD superposition. Te linear interpolation of tively ridged sub-domains since neighboring residues this large rotation contracts the structure and causes perform almost identical motions. Te 5.3Å RMSD 1.0Å mean overlap ( 0.6Å for the smooth curve). Only

Fig. 7 Left: the self-intersection free linear interpolation between the smooth backbones of the un- and phosphorylated forms of Odhl (NMR structures 2KB3 and 2KB4). Center: the interpolation between two confgurations of the SH3 domain given by the A chains of 1AOJ and 1I0C∗ has a self-intersection between line segments approximately 10 and 25 residues from the C-terminal. Right: the interpolation between the 52 residue long helix 1IK7 A chain∗ and its aligned part of the 102 residue folded 1D2Z C chain is self-intersection free Røgen Algorithms Mol Biol (2021) 16:1 Page 13 of 19

the interpolation of the alpha carbon curves has self- Detecting knottedness intersections. Its 3 self-intersections may be removed Te D chain of 2FG6 contains a right handed trefoil knot by one 1 and one 2 , move each of backbone length located in the interval from the 171th to the 238th resi- less than 4 line segments. Note, that all interpolations due [26]. We align the 83 residues 162 to 244 of 2FG6’s considered so far are ‘easy to perform’ even if some of D chain that contain the knot to 408 sequence class rep- them involve larger motions. resentatives of CATH 2.4 domains of compatible lengths RMSD 14Å Te = linear interpolation between two con- from 75 to 100 residues. Using global RMSD super- fgurations of the 60 residue SH3 domain is shown in the position, all these interpolations have at least 1 self- middle of Fig. 7. Te alpha carbon and smooth backbone intersection with and average of 8.4 intersections (3.6 interpolations both have one self-intersection, that in both for the smooth representation). In the following n(m) is cases may be resolved by an 1 move involving about 15 res- short hand for n for the alpha carbon curve and m for idues or by an end-contraction of backbone length counted the smooth representation. Depending on MaxLength as twice the 10 residues it involves. Hence, this self-inter- almost all morphs have at least one essential self-inter- MaxLength MaxLength 10 section is considered as non-essential if e.g. is section. E.g. for = only 3(1) domains of 16 which may be larger than some users will allow. Finally topologies 3.40.30 and 3.66.20 are morphed into the tre- RMSD 13Å the = interpolation from the folded 1D2Z C foil knot without causing essential self-intersections. For chain∗ MaxLength 20 chain to the 52 residue long helix 1IK7 A is shown to = there are 11(11) such cases. We have cut the right in Fig. 7. Te folded structure is monotone in the 2FG6’s D chain relatively close to its trefoil knot, and it direction of the helix and consequently no self-intersections may become unknotted by a similar size end-contraction. occur even though most consider the topology changed. Tis efect is very pronounced as self-avoiding morphs TM-align seldom aligns the entire shorter chain and are found in 6(5), 34(25), 109(81), 177(151), 325(318) of MaxLength 5, 10, 14, 16, 20 introduces generally more gaps than the global RMSD the 408 cases for = respec- alignment. Te resulting interpolations therefore are dif- tively when end-contractions up to backbone-length ferent from the RMSD based above. TM-align aligns MaxLength/2 are allowed. Without end-contractions, 181/183 of the 214 residues of the A/B chains of 4AKE a self-avoiding morph is found only in cases that frstly, and 1ANK and the re-parameterized curve has 220/220 are unknotted by the original interpolation by moving vertices covering the entire original chains. For the alpha an end through an enclosing loop, and that secondly also carbon curves the gaps cause mainly local overlap of mean have a sufciently large MaxLength to avoid other self- 0.79/0.74Å MaxLength 20 and one 1 removable self-intersection. Te intersections locally. For = and with end- same is found in the A chain interpolation between 1A1Q contraction of at most 10 residues, the end-contraction and 1 JXP∗ and between 1GGG’a A chain and 1WDN, may unknot the knot and hereafter the 20 residue long where apart from the local overlap of the alpha carbon Reidemeister moves are powerful enough to fnd a self- interpolation, both the alpha carbon and smooth curve avoiding morph to most of the other folds. Hence, to cap- interpolations have non-local overlap. Te situation is ture the characteristics or topology of a fold; one has to quite diferent in the case of Odhl where TM-align aligns set the maximal allowed end-contraction in concordance only the compact and almost unchanged part of the struc- with how tightly the representative domain is cut. Simi- ture. For the SH3 domain, 35 of the frst 36 residues are larly, many TM-alignment windows are too short to con- aligned by TM-align together with the weakly aligned tain the knot, in which case fewer morph obstructions 47’th residue. Te one self-intersection is again present; are found. this time in the alignment gap and its presence is thus due to the weak alignment of residue 47. In the 1IK7-1D2Z C Data chain example TM-align aligns only 18 almost consecutive We pick a representative set of protein structures by residues identifying two helical segments. Tese exam- taking the sequence family representative CATH2.4 ples fall in two main cases. In the frst, the TM-alignment domains [37] restricting to cases where it is gap free and covers most of the aligned chains. Here the gaps in the has between 75 and 150 residues. Tese are clustered at TM-alignment, that are not taken into account in the TM- 60% sequence homology and give 1034 domains repre- superposition, cause more overlap especially for the alpha senting 1034 sequence families, 521 homology families carbon curve interpolations of which some have a local 1 (H-classes), 281 topologies T-classes, 23 architectures removable self-intersection. Te other main case is when and the 4 main classes. If the longest of a pair of domains TM-align identifes a structurally conserved region whose is at most 10% longer than the shorter domain, we fnd interpolation is trivial but where the global RMSD align- the alignment of the entire shorter domain with mini- ment shows that the full structures are connected by rela- mal RMSD allowing one inner gap in this alignment. Te tively long but obstruction free interpolations. choice of limited diferences in domain lengths together Røgen Algorithms Mol Biol (2021) 16:1 Page 14 of 19

with shorter but frequent domains is done to make line-segment pairs cause 0.73 (0.22 smooth). TM-align global alignment in the best possible agreement with the aligns pairs of alpha carbons that in the fnal superpo- structural classifcation. Te global RMSD alignment sition are at distance at most 5Å. Tis short distance use exhaustive search and is time consuming. Hence, does not prevent self-intersections when interpolating the choice of domain pairs is also necessary to make the superimposed structures. In 142068 TM-aligned the global alignment practical. For each of these 142068 domain pairs, we fnd 42165 (12103 smooth) self-inter- alignments we calculate the RMSD together with overlap sections of aligned parts. For both alignment methods, and topological obstructions found in the correspond- the alpha carbon curves cause almost 3 times as many ing linear interpolations. We do the same based on the self-intersections as the smooth curves. TM-alignments using its standard parameters. We fnd 135698 inter T-class pairs, 6370 intra T-class pairs, and Essential self‑intersections 1304 intra H-class pairs giving 5066 intra T-class pairs Figure 8a shows the average number of essential self- that are not inter H-class pairs. All data from these align- intersections as function of the maximal allowed ments are provided as Additional fles 1, 2, 3, 4 that gen- backbone length, MaxLength , of 1 and 2 moves. erate the fgures on CATH alignments below. Te original number of self-intersection is found for MaxLength 0 = . Tere are clearly more self-intersections On inter and intra fold class morphs for the alpha carbon representation, e.g. on average 13 Tis section concerns the 142,068 structural align- between fold classes compared to the 4.9 for the smooth ments of CATH domain pairs of similar chain length. representation, and there is a relatively low correlation of For the alpha carbon curves 29% of the total overlap is 0.68 between the numbers of self-intersections in the two i j 4 local caused by residues with indices | − |≤ . Rela- representations. However, the essential self-intersection tively large local metric changes when alpha helices are numbers of the two representations are closely related for involved is the main contributor to the local overlap. For MaxLength at least four. See Fig. 8a, b. Teir correlation 4 MaxLength 6 comparison, only 7% of the total overlap is local for the is 0.92 for ≤ ≤ and drops slowly down to MaxLength 20 smooth curves. Te non-local contributions coming from 0.79 for = . aligned regions are highly correlated (0.97) between the Te shorter TM-alignments give fewer essential self- alpha carbon and smoothened curves. intersections than the global RMSD alignments as expected. For the TM-alignments MaxLength needs to be Self‑intersections and alignment gaps larger to remove the additional self-intersections of the n 1 A structure with + alpha carbons has n line seg- alpha carbon curves. See Fig. 8d, e. For both alignment n(n 1)/2 ments and − line segment pairs. If a self-inter- methods, an essential self-intersection for smooth curves AlignedSum 1.5 section has ≥ then we say it is between generally implies one for the alpha carbon curves and two aligned parts of the structure. Otherwise, a least the notion of essential self-intersections therefore seems one alignment gap is involved. Our global RMSD align- robust when restricted to one alignment method. Both ment allows one gap that usually is too short to make the alignments and superpositions provided by the two intersections with itself in the linear interpolation. alignment methods may be very diferent. One should One thousand aligned line segment pairs on average thus not expect essential self-intersections to agree cause 2.2 self-intersections and gap-involving line- between them. However, if there is an essential self-inter- segment pairs cause 1.6. Te smooth representation section in the TM-aligned window, then it is more likely generally cause fewer self-intersections and the cor- than average to fnd one in the RMSD alignment, Fig. 8b. responding numbers are 0.81 and 0.52 respectively. If the TM-aligned window covers most of the structure When using TM-align we connect all inner gaps in the also the opposite tendency is found, Fig. 8e. alignments to check if they cause topological changes. Figure 8c shows the average number of essential We use the Aligned Window Fraction (AWF) given self-intersections as function of the aligned window by the shorter aligned window length divided by the length and Fig. 8f shows the ratio between them. TM- shorter chain length to characterize the coverage of the align can avoid self-intersections by choosing a shorter aligned windows. On average 55% of the residues in aligned window. But also in cases where the TM- the shorter chain are aligned and AWF is 0.8 on aver- alignment window is almost global, it results in fewer age. From allowing similar size end-contractions on essential self-intersection. See Fig. 8f. Possibly both the knotted example above, one hereby expects and TM-aligns freedom to optimize the alignment using fnds signifcantly fewer self-intersections. One thou- multiple gaps and its choice of superposition result in sand aligned line segment pairs on average cause 0.17 fewer essential self-intersections. Figure 8f shows that (0.048 smooth) self-intersection where gap-involving for larger values of MaxLength the local Reidemeister Røgen Algorithms Mol Biol (2021) 16:1 Page 15 of 19

a b c 10 1 non T-pairs, C-alpha 7 3 4 8 non T-pairs, Smooth 0.8 6 T/H-pairs, C-alpha 5 T/H-pairs, Smooth 5 6 8 6 H-pairs, C-alpha 0.6 4 10 H-pairs, Smooth 12 4 0.4 3 14 16 2 20 Average number of

Average number of 2 0.2 Fraction of pairs wit h 1 essential self−intersections essential self-intersections 0 essential self-intersections 0 0510 15 20 0510 15 20 0 40 60 80 100 120 140 Largest alignment window length in residues d MaxLength e MaxLength f 2.5 1 e non T-pairs, C-alpha 3.5 2 non T-pairs, Smooth 0.8 T/H-pairs, C-alpha T/H-pairs, Smooth 3 1.5 H-pairs, C-alpha 0.6 H-pairs, Smooth 2.5 1 0.4 2

Average number of 0.5 0.2 1.5 Fraction of pairs with essential self-intersections essential self-intersections number of essential self−intc.

0 0 RMSD/TM−align ratio of averag 1 0510 15 20 0510 15 20 80 100 120 140 MaxLength MaxLength Largest alignment window length in residues Fig. 8 a: the average number of essential self-intersections found in the global RMSD alignment for pairs of CATH2.4 domains sharing homology class (H-pairs), sharing topology class but not homology class (T/H-pairs), or from distinct topology classes (non T-pairs). b: the fraction of pairs with essential morph self-intersections based on the global RMSD alignments. The plus signs show the fraction of the pairs with an essential self-intersection for the smooth curves that also have one for the alpha carbon curves. The circles show for the smooth curves the fraction of pairs with an essential self-intersection for the TM-alignment that also have one for the global RMSD alignment. c: the average number of essential self-intersections for smooth curves shown as function of the length of the largest aligned window and for MaxLength-values given in the legend. The graphs are smoothened by averaging over alignment windows of length between i – 2 and i 2. Global RMSD alignments are dashed and + TM-alignments in solid lines. d , e: as (a, b) but for the TM-alignment. In (e) the circles show the fraction of pairs with an essential self-intersection for the global RMSD alignment that also have one for the TM-alignment restricted to the 5% cases where AWF 0.9725 . f: the average number of ≥ essential self-intersections for the RMSD alignment divided by the same number for the TM-alignment and shown as function of the largest aligned window length. MaxLength is color-coded as in (c). Dashed curves are for all alignments and solid curves for the 5% cases where AWF 0.9725 ≥

MaxLength 10 moves manage to compensate for many of these addi- essential self-intersections for = , but tional self-intersections caused by the lack of gaps in most classes require more data to be studied in detail. the global alignments. See the included supplementary data. Due to the large global RMSD values of interpolations with essential Morphability and fold classifcation self-intersections, TM-align typically aligns too few Interpolations between pairs of homological structures residues to capture the eventual topological dissimilar- have essential self-intersections relatively often. Fig- ity. Te most striking fnding is not that more than half ure 6 shows an example where the sole self-intersection of all intra T-class global RMSD morphs have essential may be removed by an 1 move involving half of the self-intersections, but that many inter T-pair morphs do MaxLength 10 chain or by a relatively large movement of one of the not. For smooth representation and = , long ends. Tis author fnds it debatable if these two 868 of 879 H-pairs, 1189 of 1295 non-H but T-pairs MaxLength 10 10 domains should share fold class. For = and 1831 of 3852 inter T-pairs with RMSD ≤ do not as many as 13% of all H-pair morphs have essential cause essential self-intersections. Hence, under this self-intersections, but usually for greater RMSD than RMSD restriction, most intra T-pairs, in total 2057 of this example as seen from Fig. 9. Te fraction of H-pair 2174, are morphable, but there is a compatible number morphs with essential self-intersections varies between of self-avoiding morphs of similar length into other fold the classes. E.g., 5 out of the 14 alignments in class classes emphasizing the continuous nature of protein 1.20.120.200 self-intersects as seen on Fig. 6. In total folds. An example is the 7.1Å global RMSD interpola- 47 of the 512 H-classes with intra H-class morphs have tion between the 6 helix domain 2ygsA0 and the 2 helix Røgen Algorithms Mol Biol (2021) 16:1 Page 16 of 19

1 non T-pairs Overlap 0.8 all pairs 15 Torsional effect -moves T/H-pairs 1 -moves 0.6 H-pairs 2 10 Original morph 0.4 End-contractions

0.2 5 Fraction of pairs with

essential self-intersections 0 6810 12 14 16 18 0 5101520

Fig. 9 Left: the frequency of morphs with essential self-intersections for pairs with RMSD below a given maximal RMSD. The MaxLength of 1 and 2 moves is set to 10. Right: the average contributions to the morph length for all intra T morphs shown as function of MaxLength . The absolute value of the signed sum of sign changes represent the torsional efect; the remaining contributions are average values per residue and 4 beta-strand domain 1ha101 which is self-intersec- are longer than local Reidemeister moves. Inter T-class 26Å MaxLength 3 21Å tion free both in the alpha carbon and smooth represen- morphs are on average for = and MaxLength 20 tation. Te 0.49Å(0.03Å smooth curve) mean overlap for = , that is only slightly longer than the has very little 0.03Å(0.001Å) non-local contribution. intra T-class morphs. Te global RMSD alignment thus Hence, even if their local geometry and hydrogen bond- do not fnd a drastic diference in the lengths of inter and ing patterns are very diferent, their full backbones are intra fold class self-avoiding morphs. TM aligns approxi- of typical H-pair distance in our data set, and the linear mately half of the residues, namely those superimposed interpolation causes only minor local steric problems. within 5Å distance. Te morphs lengths of the aligned windows are therefore considerably shorter than those Morph lengths obtained using the global RMSD alignment. Te TM- In case no essential self-intersection is found, the original alignment based morph lengths increase mainly due to 1 L morph length plus the price of the 1 and 2 moves is the gaps in the alignment and to the topological obstruc- MaxLength 3 an estimate of the length of a self-avoiding morph for the tions of the morph. For = the average piecewise linear backbone curve not taking into account morph lengths are 3.2, 6.4 and 13.2Å per residue for intra the thickness of a protein chain. Te shortest overlap H, inter H but intra T, and inter T class morphs respec- d 3.7Å MaxLength 20 free distance is = . Hence, each point on the back tively. For = the similar numbers drop to bonecurve has to traverse half the circumference of a cir- 3.0, 5.5 and 11.0Å. cle with radius d/2 to go from over sliding to under sliding and staying at an overlap free distance. One may there- Discussion π d/2 b a 1 1 fore add the penalty ∗ ∗| k − k + | for moves We are handling a paradox by trying to establish a “topo- π d/2 bj aj b a 2 and ∗ ∗| − + k − k + | for 2 moves. Te logical” distinction in the path connected space of pro- total overlap is the sum of the distances each pair of tein structures, and we emphasize that the set of essential amino acids need to get further apart during the morph self-intersections we fnd are only local obstructions in in order to get an overlap free morph. One may there- the non-convex but connected space of protein embed- fore wish to add and possibly weight the AverageOverlap dings. If a current alignment score function shows high to the morph-length estimate. Similarly, one may intro- similarity then the two confgurations are close, at least, duce a weight for the end-contraction length and for in the space of immersions, which allows self-intersec- the penalty for net torsional efects. Figure 9 shows the tions. We ofer to test if the linear interpolation between average contribution from one residue to morph lengths the two structures is self-avoiding and if not then we ofer on intra T-morphs as function of MaxLength for the an algorithmic search for alternative morphs through smoothened backbone curves. Clearly shorter morphs embeddings in a neighborhood of the linear interpola- are found for larger MaxLength as end-contractions tion. We divide the self-intersections into two sets. Te Røgen Algorithms Mol Biol (2021) 16:1 Page 17 of 19

set of non-essential self-intersections contains the maxi- a case where the backbone curve morph is self-avoiding, the mal number of self-intersections that can be avoided overlap holds information in addition to the morph length. We using a sequence of 1 and 2 moves. Te remaining saw examples of long overlap free morphs between sequence self-intersections are called essential. Tere are three identical highly fexible structures. In these morphs, a neigh- possible reasons why a given self-intersection is marked borhood of most residues undergo a motion close to a rigid essential. Te frst is that a needed move involves motion. Tis cause both little overlap and that the shorter larger parts of the protein chain than permitted by the internal distances are subject to only smaller changes. Hence, user. Te second is that there is a “topological” obstruc- a structural comparison based on local internal distances as tion to a needed move. Te third is that the self-inter- FlexE will also detect that neighboring residues perform similar section can be avoided but it is cheaper to avoid other motions. Tus, the combined information of RMSD and over- self-intersections. Te ”topological” obstruction is a case lap may be similar that of RMSD and FlexE in such cases. where the constructed move will cause at least one new Conclusion and future work self-intersection and is technically found as an intersec- We present a number of measures that quantify obstacles tion between a constructed disk-surface and the pro- to deform a protein structure to an aligned and superim- tein structure. However considering protein structures posed structure and show they give signifcant additional as open curves any such intersection can be avoided by information to the distances usually used in protein struc- deforming the disk-surface. In most cases these intersec- ture alignment score functions. We study the linear inter- tions need to be pushed past at least one chain terminal, polation between two superimposed structures. First, we and this results in a large move that is more efciently quantify the steric problems of this linear interpolation resolved by an end-contraction. In some cases two back- by fnding the shortest distance between each amino acid bone-surface intersections lie close along the backbone pair during the interpolation and compare it to distance and a smaller alteration of the surface may avoid them constraints found in native protein structures. Larger both - think of untying a slip-knot. In this study, there has deformations between sequence similar native struc- most often been only one backbone-surface intersection. tures cause little overlap and are recognized as longer but Hence, its a relatively rare situation that the obstructions simple morphs. Next, we fnd all self-intersections of the to an move can be avoided locally and this lies outside protein backbone during linear interpolation between the scope of this work, but it may become more impor- two aligned structures. We introduce 3-dimensional ver- tant if longer MaxLength are wanted. Algorithmically sions of Reidemeister moves each capable of avoiding we have decided to keep our untying moves local along one or two self-intersections by altering the original lin- the backbone to ensure they can be treated in one-pass. ear morph. Near a protein terminal an end-contraction Mathematically, our novel notion of essential self-inter- can also be used to avoid a self-intersection. We calculate section is hereby not a topological but a geometric notion the length of each of these morph alterations and solve saying that any self-avoiding morph between the two the optimization problem to avoid the maximal num- confgurations has to go further than a given threshold ber of self-intersections at the lowest additional morph away from the linear morph path. It is our hope and main length. If this results in a self-avoiding morph the struc- aim that this tunable threshold refects a general percep- tural similarity found by the alignment program is verifed tion of when two structures have diferent folding paths. for the entire aligned windows including all inner align- In the example shown on Fig. 1 is resolved by 1csgA0 ment gaps. Otherwise we fnd a smallest set of essential x-ray difraction and by NMR. From the alignment 1jli00 self-intersections that cannot be avoided. We demonstrate its clear that the two structures have similar distance matrices that this novel notion of essential self-intersections is rela- and residue contacts leaving the possibility that the morph self- tively robust, e.g., under a change in the representation of intersection points to a modeling error in one of the two struc- protein chains. Te user may input the maximal allowed tures. Te smoothened representation makes visual inspection chain length of Reidemeister moves and of end-contrac- of morphs signifcantly easier. Te speedup of calculations tions. We hope this allows users to tailor an appropriate due to fewer self-intersections in this representation was how- notion of when two structures have diferent threading for ever neutralized as the fltering of potential self-intersections the application at hand. by overlap is less efcient. Te overlap is the sum of the steric We fnd as expected that morphs from unknotted proteins clashes during an imagined linear interpolation between two to a given knotted protein have essential self-intersections aligned and superimposed structures. It is important algorith- except for a few cases where the morphs actually untie the mically and may be interesting structural biologically. Even in Røgen Algorithms Mol Biol (2021) 16:1 Page 18 of 19

knot. Using a global RMSD alignment and a set of sequence Supplementary information representative CATH domains, we fnd a signifcant frac- The online version contains supplementary material available at https​://doi. tion of all homologous protein structure pairs separated by org/10.1186/s1301​5-020-00180​-3. at least one essential self-intersection and give examples of homologous pairs that most will considered as diferent Additional fle 1. SupplementaryDataCATH2_4Domains.m. Matlab script that loads and explains all the raw data from all CATH2.4 alignments. It threaded despite having lower RMSD. Our perhaps most produces fgures and may be altered to investigate the data further. interesting fnding is that many representatives of distinct Additional fle 2. Data from aligning CATH2.4 domains. NeamTM3to- folds may be morphed into each other using only smaller 20NoEndContraction.mat Data from global RMSD and TM-align alterations of the direct linear interpolation. Tis supports alignments of alpha carbon backbone curves. The maximal length of Reidemeister moves is varied from 3 to 20 and end-contractions are not the continuous view on parts of protein fold space. allowed. Similar data for the smooth representation of the backbone Te residues that are aligned by TM-align are super- curve are contained in NeamTMSmooth3to20NoEndContraction. imposed within 5Å . We fnd 42165 (12103 smooth mat and in NeamTMSmooth3to20EndContraction.matwhen allowing end-contractions of MaxLength 2/residues. curve) self-intersections between aligned parts in 142068 = TM-alignments. Tus this close distance still allows Additional fle 3. Data from aligning CATH2.4 domains. NeamTM3to- 20NoEndContraction.mat Data from global RMSD and TM-align topological dissimilarity of the aligned parts. In the TM- alignments of alpha carbon backbone curves. The maximal length of alignments performed in this work approximately half of Reidemeister moves is varied from 3 to 20 and end-contractions are not the residues are aligned on average. Te TM-alignments allowed. Similar data for the smooth representation of the backbone curve are contained in NeamTMSmooth3to20NoEndContraction. result in fewer topological obstructions than expected mat and in NeamTMSmooth3to20EndContraction.matwhen allowing end-contractions of MaxLength 2/residues. alone from their shorter alignment length. An interesting = subject for future research is therefor to develop methods Additional fle 4. Data from aligning CATH2.4 domains. NeamTM3to- that combine the superior structural alignment search 20NoEndContraction.mat Data from global RMSD and TM-align alignments of alpha carbon backbone curves. The maximal length of of e.g. TM-align with the restriction to local topologi- Reidemeister moves is varied from 3 to 20 and end-contractions are not cal similarity defned by the absence of essential morph allowed. Similar data for the smooth representation of the backbone self-intersections. Te current algorithm is in Matlab and curve are contained in NeamTMSmooth3to20NoEndContraction. mat and in NeamTMSmooth3to20EndContraction.matwhen requires two superimposed structures and the alignment allowing end-contractions of MaxLength 2/residues. in TM-align format as input. It is available on request to = the author. When a matching alignment method is devel- Acknowledgements oped, the software can stand-alone and will be release. The author is pleased to thank Kresten Lindorf-Larsen for suggesting exam- With fxed sequence alignment, protein structure predic- ples of fexible proteins and Christian Henriksen for discussions on how to remove the maximal number of self-intersections of a morph such that the tion is a natural application of our present method as struc- resulting additional morph length is minimal and appreciates his elegant for- tural comparison is done for the entire structure. Capable of mulation presented here. In addition, I would like to thank Rachel Kolodny for detecting diferences in threading and pointing out where in interesting discussions on protein fold space causing me to start working on measuring interpolation overlap while enjoying her hospitality in Haifa, and the structure these problems occur, we expect our method to fnally the anonymous reviewers for their very careful reading and thoughtful give an interesting supplement to the distance measures used suggestions for improvements of the manuscript. to assess protein models as e.g. done in the Critical Assess- Authors’ contributions ment of protein Structure Prediction (CASP) experiments. The author read and approved the fnal manuscript. When combined with TM-align, our method can detect cases where structurally similar proteins have diferent Funding folding pathways, which is important for understand- Not applicable. ing protein folding. In addition, alternate threading of a Availability of data and materials similar core structure may change dynamical and thereby All data on the alignments of CATH2.4 domains are supplied as supporting functional properties. Further, we hope our focus on self- material together with a Matlab-script for reading and handling the data. avoiding morphing may contribute to the development Ethics approval and consent to participate of alignment methods and to a more detailed picture of Not applicable. sequence to structure relationship for proteins and RNA. Competing interests The author declares not to have competing interests.

Received: 2 March 2018 Accepted: 17 December 2020 Røgen Algorithms Mol Biol (2021) 16:1 Page 19 of 19

References quickly and accurately. Proceed Nat Acad Sci USA. 2010;107(8):3481–6. 1. Hasegawa H, Holm L. Advances and pitfalls of protein structural align- https​://doi.org/10.1073/pnas.09140​97107​. ment. Curr Opin Struct Biol. 2009;19(3):341–8. https​://doi.org/10.1016/j. 24. Røgen P, Karlsson PW. Parabolic section and distance excess of sbi.2009.04.003. space curves applied to protein structure classifcation. Geom Dedic. 2. Koslof M, Kolodny R. Sequence-similar, structure-dissimilar protein pairs 2008;134(1):91–107. https​://doi.org/10.1007/s1071​1-008-9247-z. in the PDB. Prot Struct Funct Bioinformat. 2008;71(2):891–902. https​://doi. 25. Lua RC. PyKnot: a PyMOL tool for the discovery and analysis of knots in org/10.1002/prot.21770.​ proteins. Bioinformatics. 2012;28(15):2069–71. https​://doi.org/10.1093/ 3. Nepomnyachiy S, Ben-Tal N, Kolodny R. Global view of the protein bioin​forma​tics/bts29​9. universe. Proceed Nat Acad Sci USA. 2014;111(32):11691–6. https​://doi. 26. Lai YL, Chen CC, Hwang JK. pKNOT vol 2: the protein KNOT web server. org/10.1073/pnas.14033​95111​. Nucleic Acids Res. 2012;40(1):228–31. https​://doi.org/10.1093/nar/gks59​2. 4. Venkatraman V, Yang YD, Sael L, Kihara D. Protein-protein docking using 27. Takusagawa F, Kamitori S. A real knot in protein. J Am Chem Soc. region-based 3d zernike descriptors. Bmc Bioinformat. 2009;10(1):407. 1996;118(37):8945–6. https​://doi.org/10.1021/ja961​147m. https​://doi.org/10.1186/1471-2105-10-407. 28. Faisca PFN. Knotted proteins: a tangled tale of structural biology. 5. La D, Esquivel-Rodríguez J, Venkatraman V, Li B, Sael L, Ueng S, Ahrendt Comput Struct Biotechnol J. 2015;13:459–68. https​://doi.org/10.1016/j. S, Kihara D. 3d-surfer: software for high-throughput protein surface csbj.2015.08.003. comparison and analysis. Bioinformatics. 2009;25(21):2843–4. 29. Niemyska W, Dabrowski-Tumanski P, Kadlof M, Haglund E, Sulkowski P, 6. Nepomnyachiy S, Ben-Tal N, Kolodny R. Complex evolutionary foot- Sulkowska JI. Complex lasso: new entangled motifs in proteins. Scie Rep. prints revealed in an analysis of reused protein segments of diverse 2016;6(1):36895. https​://doi.org/10.1038/srep3​6895. lengths. Proceed Nat Acad Sci USA. 2017;114(44):11703–8. https​://doi. 30. Dabrowski-Tumanski P, Sulkowska JI. Topological knots and links in org/10.1073/pnas.17076​42114​. proteins. Proceed Nat Acad Sci USA. 2017;114(13):3415–20. https​://doi. 7. Zemla A, Venclovas C, Moult J, Fidelis K. Processing and analysis org/10.1073/pnas.16158​62114​. of CASP3 protein structure predictions. Prot Struct Funct Bio- 31. Khatib F, Rohl CA, Karplus K. Pokefnd: a novel topological flter for use informat. 1999;37(S3):22–9. https​://doi.org/10.1002/(SICI)1097- with protein structure prediction. Bioinformatics. 2009;25(12):281–8. https​ 0134(1999)37:3 <22:AID-PROT5​>3.0.CO;2-W. ://doi.org/10.1093/bioin​forma​tics/btp19​8. 8. Gamliel R, Kedem+ K, Kolodny R, Keasar C. A library of protein surface 32. Franklin J, Koehl P, Doniach S, Delarue M. Minactionpath: maximum likeli- patches discriminates between native structures and decoys generated hood trajectory for large-scale structural transitions in a coarse-grained by structure prediction servers. BMC Struct Biol. 2011;11(1):20. https​://doi. locally harmonic energy landscape. Nucleic Acids Res. 2007;35(2):477–82. org/10.1186/1472-6807-11-20. https​://doi.org/10.1093/nar/gkm34​2. 9. Marks DS, Colwell LJ, Sheridan R, Hopf TA, Pagnani A, Zecchina R, Sander 33. Mishra R, Bhushan S. Knot theory in understanding proteins. J Mathe Biol. C. Protein 3d structure computed from evolutionary sequence variation. 2012;65(6–7):1187–213. https​://doi.org/10.1007/s0028​5-011-0488-3. Plos ONE. 2011;6(12):28766. https​://doi.org/10.1371/journ​al.pone.00287​ 34. Randrup T, Røgen P. How to Twist a Knot. Arch Math. 1997;68(3):252–64. 66. https​://doi.org/10.1007/s0001​30050​055. 10. Kabsch W. A solution for the best rotation to relate two sets of vectors. 35. Saunders D. Matlab implementation (2013) http://www.mathw​orks.com/ Acta Crystallograph Sect. 1976;A32(5):922–3. matla​bcent​ral/flee​xchan​ge/42827​-weigh​ted-maxim​um-match​ing-in- 11. Zhang Y, Skolnick J. TM-align: a protein structure alignment algorithm gener​al-graph​s based on the TM-score. Nucleic Acids Res. 2005;33(7):2302–9. https​://doi. 36. Galil Z. Efcient algorithms for fnding maximum matching in graphs. org/10.1093/nar/gki52​4. Comput Survey. 1986;18(1):23–38. https​://doi.org/10.1145/6462.6502. 12. Holm L, Sander C. Protein-structure comparison by alignment of distance 37. Sillitoe I, Lewis TE, Cuf A, Das S, Ashford P, Dawson NL, Furnham N, matrices. J Mol Biol. 1993;233(1):123–38. Laskowski RA, Lee D, Lees JG, Lehtinen S, Studer RA, Thornton J, Orengo 13. Perez A, Yang Z, Bahar I, Dill KA, MacCallum JL. Flexe: Using elastic CA. CATH: comprehensive structural and functional annotations for network models to compare models of protein structure. J Chem Theory genome sequences. Nucleic Acids Res. 2014;43(D1):376. https​://doi. Comput. 2012;8(10):3985–91. https​://doi.org/10.1021/ct300​148f. org/10.1093/nar/gku94​7. 14. Mariani V, Biasini M, Barbato A, Schwede T. lddt: a local superposition-free score for comparing protein structures and models using distance difer- ence tests. Bioinformatics. 2013;29(21):2722–8. https​://doi.org/10.1093/ Publisher’s Note bioin​forma​tics/btt47​3. Springer Nature remains neutral with regard to jurisdictional claims in pub- 15. Erdmann M. Protein similarity from knot theory: geometric convolu- lished maps and institutional afliations. tion and line weavings. J Comput Biol. 2005;12(6):609–37. https​://doi. org/10.1089/cmb.2005.12.609. 16. Levitt M. Protein folding by restrained energy minimization and molecu- lar-dynamics. J Mol Biol. 1983;170(3):723–64. 17. Røgen P, Fain B. Automatic classifcation of protein structure by using Gauss integrals. Proceed Nat Acad Sci USA. 2003;100(1):119–24. 18. Røgen P. Evaluating protein structure descriptors and tuning Gauss inte- gral based descriptors. J Phy-Condens Matter. 2005;17(18):1523–38. https​ ://doi.org/10.1088/0953-8984/17/18/010. 19. Røgen P, Sinclair R. Computing a new family of shape descriptors for protein structures. J Chem Inform Comput Sci. 2003;43(6):1740–7. https​:// Ready to submit your research ? Choose BMC and benefit from: doi.org/10.1021/ci034​095a. • fast, convenient online submission 20. Penner RC, Knudsen M, Wiuf C, Andersen JE. Fatgraph models of pro- teins. Communicat Pure Appl Mathe. 2010;63(10):1249–97. https​://doi. • thorough peer review by experienced researchers in your field org/10.1002/cpa.20340​. • rapid publication on acceptance 21. Penner RC, Knudsen M, Wiuf C, Andersen JE. An algebro-topological • support for research data, including large and complex data types description of protein domain structure. PLoS ONE. 2011;6(5):19670. https​ ://doi.org/10.1371/journ​al.pone.00196​70. • gold Open Access which fosters wider collaboration and increased citations 22. Krishnamoorthy B, Provan S, Tropsha A. A topological characterization of • maximum visibility for your research: over 100M website views per year protein structure. Springer Optimizat Applicat. 2007;7:431–54. https​://doi. org/10.1007/978-0-387-69319​-4-22. At BMC, research is always in progress. 23. Budowski-Tal I, Nov Y, Kolodny R. FragBag, an accurate representation of protein structure, retrieves structural neighbors from the entire PDB Learn more biomedcentral.com/submissions