structural alignment and structural classification Structural Alignment and Classification The importance and the need How to compare protein structures ? Review of several approaches to structurally align

Classification methods

ערן אייל מרס 2011

Why to compare protein structures ? Same function Structures are more conserved in than sequences. similar sequences Two homologous proteins have the same overall structure. It is possible that 2 proteins without detectable similarity will have the same structure. similar structures

In the twilight zone of sequence similarity, structural alignment might help to correctly determine the relations between 2 proteins similar sequences Structural similarity is therefore more sensitive method than to determine protein function similar structures Similarity in space can point on functional regions which might not be detected by sequences 1 2 3 4 5 6 7 8 9 10 11 12 13 14 PHE ASP ILE CYS ARG LEU PRO GLY SER ALA GLU ALA VAL CYS PHE ASN VAL CYS ARG THR PRO ------GLU ALA ILE CYS PHE ASN VAL CYS ARG ------THR PRO GLU ALA ILE CYS

Different sequences

Form similar local structural pattern

Fig. 4. Substrate assisted catalysis (SAC) employed to investigate GTPase mechanism. GTP hydrolysis in G proteins is dependent on Glncat and Argcat for optimal activity. When Glncat is mutated to almost any other residue (marked by X), catalytic activity is impaired (left branch). Using the modified GTP analogs places an aromatic amine (shown in red) correctly into the mutated active site and restores catalytic activity (right branch). The aromatic amine (or alternatively a hydroxyl group – not shown) functionally replaces Glncat. The figure shows engineered SAC mployed in heterotrimeric G proteins, in which Glncat and Argcat come in cis from the same protein. When using engineered SAC to investigate the intrinsic GTPase in Ras ( 10), Ras-GAP is inactive and hence Argcat is not present. Abbreviation: GAP, GTPase-activating protein.

Kuttner et al,. 2003 Kosloff & Selinger, (2001). TIBS 26, 161-6 Fig. 3. SuperimpositionSuperimposition of six different G-protein structures complexed with the transition state analog GDP–aluminum fluoride and with relevant GTPase- Structural alignment is used to understand better activating proteins (GAPs). All G proteins are complexed Ground with GDP and aluminum fluoride, placing the proteins in a molecular processes state conformation mimicking the transition state for GTP hydrolysis. Broken lines show the position of the covalent bonds broken and formed during the reaction. The positioning of the catalytic functional groups of the conserved Glncat and Argcat in these six structures is highly similar, even though the proteins in question might differ in overall sequence and structure. This leads to the conclusion that the catalytic mechanism in these G proteins proceeds via a similar transition state and hence by a similar catalytic mechanism. Backbone models for the relevant switch regions Transition state of six G proteins and of the finger loops of the relevant GAPs are shown as backbone ribbon diagrams. The crystal structures are of the following proteins complexed with aluminum fluoride [ (PDB) accession numbers are in square brackets]. Small G proteins with GAPs: Ras (blue) complexed with Ras-GAP (light yellow) [1WQ1]; Rho (green) complexed with P50RhoGAP (yellow) [1TX4]; and CDC42 (cyan) complexed with CDC42GAP (orange) [1GRN]. Heterotrimeric G proteins: transducin (magenta) [1TAD]; Gi (dark red) [1GFI]; and Gi (red) complexed with RGS4 (not shown) [1AGR]. The functional groups of the catalytic glutamine and arginine, the magnesium atoms and the nucleophilic water, are drawn as ball and stick models. The GDP and aluminum fluoride are drawn as stick models. This figure was prepared using the Insight II software package.

Evaluation of structural models

Resolution 2.0 ǺǺResolution 2.5 The superposition problem has a close solution Superposition Ù Structural alignment

The problem is least squares (RMSD) problem. There are two types of problems relates to structural Given two set of points with the same size and with known comparison: correspondence between the points in the two sets, we • Superposition problem have to find the transformation matrix and translation • Structural alignment problem vector that minimize the sum of square distances between the corresponding points. In the superposition problem we know in advance the correspondence between the points in the two structures we want to align.

n KABSCH Algorithm 1 2 MSD = ∑| R • Pi + A − Qi | P,Q: coordinates of n i=1 the two structures

The problem is to find R0, A0 such that:

n 1 2 MSDmin = ∑| R0 • Pi + A0 − Qi | n i=1 There are several approaches to solve this problem. These approaches involve calculation of the inverse of the covariance matrix in the problem.

One possible way is to solve this problem is by SVD (Singular Value Decomposition). The structural alignment problem

In the general structural alignment problem we have to compare different proteins having different length and potentially very different sequence. We therefore do not know in advance the correspondence

In general there are 2 strategies: • To compare directly the 2 proteins • To examine structural features of each protein separately

What properties of protein might be used to detect structural similarity to other proteins ?

•sequence

•Type and number of secondary structures (sheets, helices)

•Structural arrangement of sceondary structures

•Structural attributes of individual amino acids

•Distances between amino acids in the protein

•Solvent accessible surface (SAS)

Backbone dihedral angles might be used for structural alignment The usual way to quantify similarity between molecules is to measure the overall deviation of the atoms - RMSD

RMSD = root mean square deviation 0.5Ǻ 2Ǻ

2 2 2 RMSD = Σ(Xi1-Xi2) +(Yi1-Yi2) +(Zi1-Zi2) Mean distance : 0.125 Ǻ Mean distance : 0.5 Ǻ N RMSD : 0.0625 Ǻ RMSD : 1.0 Ǻ

This index amplifies large deviations in local regions 1Ǻ 4Ǻ

Mean distance : 0.25 Ǻ Mean distance : 1.0 Ǻ RMSD : 0.25 Ǻ RMSD : 2.0 Ǻ

Measuring RMSD is not trivial and requires some subjective decisions Z-score

Usually we would like the RMSD calculation to include Given that S is the score given to some structural only subset of aligned atoms. The problem is to define alignment. this subset We can try to normalize the scores by expressing them It is clear also that the RMSD depends on the size of the in units of standard deviation. atom subset S - It is more correct to output both RMSD and the total Z = number of atoms involved in its calculation σ

In some structural alignment algorithms the overall If we assume normal distribution of the original scores, alignment score is calculated according to both numbers then Z score of more than 3 suggest that the 2 aligned proteins are significantly structurally similar. Small Many Good structural RMSD atoms alignment SSAP http://www.biochem.ucl.ac.uk/cgi-bin/cath/GetSsapRasmol.pl A program to compare protein structures based on local structural environments Uses dynamic programming technique usually applied to sequence alignment to compare structural environments of amino acids. In oppose to the original application (in sequence alignment) where the scores are taken from substitution matrix which consider types, here the spatial deviation of other amino acids around is the score.

The basic SSAP algorithm which rely on sequential order of the amino acids can not deal with 2 structures having different topologies,

ferritin cytochrome c’ • • • • • • • •

The method includes 2 steps of dynamic programming. Initial step to obtain the score between each pairs of amino acids, and second step where the best overall alignment in the protein is determined

• • • • • • • The score used to compare differences between 2 vectors is:

Sab = a / (b + δ)

δ is the distance between the 2 vectors, and a,b are empirical parameters that should be optimized.

CE Normalizing scores to compare different proteins http://cl.sdsc.edu/ce.html A program to align structures based on combinatorial Sssap = (∑∑si→ j,i′→ j′ ) / maxequis ∗(maxequis −11) ij,,i′′j extension of an alignment path If we work with parameters a=500, b =10 then the maximal The alignment path is composed of aligned fragment value of sab is 50 pairs (ARPs). The AFPs are composed of 2 fragments, one from Sssap′ = ln(Sssap )⋅100 / ln(50) each of the proteins being aligned.

The final score S’ssap is independent of the length of the The structural similarity is determined by a set proteins and has maximum value of 100 of local distances between residues (Cα atoms) within the fragments. The size (m) of the ASP is fixed (usually 8). The alignment between proteins A,B is the longest continuous path P of AFPs of size m. Two consecutive AFPs in the alignment path must satisfy one of the following conditions:

The algorithm use different schemes of scoring http://cl.sdsc.edu/ce/all-to-all/all-to-all.html?1HHO based on distances in different stages

AFP-AFP

AFP-PATH SARF2

http://123d.ncifcrf.gov/sarf2.html http://www3.ebi.ac.uk/tops/ http://carten.gmd.de/ToPign.html (TOP)

An algorithm to find structural similarity based on comparison of secondary structures.

As such it might be used to compare proteins only, and only proteins with minimal content of defined secondary structures Strand - triangle Helix - circle

The idea is to represent secondary structures as vectors every secondary structure and to identify pairs of vectors in similar orientation in element is represented by a the two proteins. vector

The identification of secondary structures is done in unbiased way by sliding short (5aa) classical secondary structure elements along the sequence and Single SSE does not give any superimposing them to the backbone conformation of information about the structure the proteins to be aligned. of the protein. Two SSEs or more are therefore required. SARF compares between pairs of secondary structures SARF compares between pairs of secondary structures in different proteins in different proteins Pairs of SSE having the same orientation are being Pairs of SSE having the same orientation are being marked and saved marked and saved The orientation between 2 SSE is defined by a set of 5 The orientation between 2 SSE is defined by a set of 5 parameters parameters

Гij The algorithm tries to extend the alignment by union of vector pairs This stage is done by graph theory algorithms Dmin i (maximum clique algorithm) min D j To avoid combinatorial explosion, only SSEs close to all other SSEs already in the SSE ensemble max max D j 3 N Di S = 1+RMSD Based on the S scores, a normalized Z-score is calculated Alexandrov & Fischer, 1996 In the final step, the vector solution serves as starting Dali point to solve the Cα correspondence problem. http://www.ebi.ac.uk/dali/index.html The exact Cα correspondence is determined by dynamic programming. • Used for FSSP classification

At this stage the algorithm tries to extend the • Uses distance matrices for protein representation alignment to other regions of the backbone. • Can compare different type of molecules

• Deals successfully with gaps and different topology

Distance matrix 1 1 • • • Holds the complete information for structural • • • • • • • • • reconstruction up to the mirror structure • • • • • • • • • • • • 1 • • • • •

Amino acid index • • • Amino acid index • • • • • • • N • N • • 1 N • • 1 N • • Amino acid index Amino acid index • • •

• •

• • • • • • •

Amino acid index • N • 1 N Amino acid index 3-helix-bundle pairwise 3D alignment DALI: Search for common

3D-pattern of Cα distance maps

The algorithm split the matrix into (partially overlapping) sub-matrices of size 6 1. Sub-matrix: Helices a, b Similar sub matrices are then scanned in the two molecules

Following detection of similarity, the algorithm tries to 2. Sub-matrix: extend the similarity by finding additional overlapping Helices b, c sub matrices

3. Matrix: helices a, b, c (1 side)

4. Removal of gaps & sequence rearrangement (MC) Source: Mount/ Figure 9.15

Nussinov and Wolfson, 1991

http://pc-gamba.math.tau.ac.il/

•Geometric hashing •The program tries to find overlapping triangles with many additional overlapping points

•Very efficient. Applicable for large database scanning •Nothing is assumed regarding the molecules (size, composition, secondary structures) http://bioinfo3d.cs.tau.ac.il lab (TAU) Haim Wolfson’s position oftheotheratoms in thetargetmoleculeanddatabasealgorithmchecks the •The algorithmscanstriangleineachtargetmolecule.For triangle in thereferenceofall •Hash tablestoresthecoordinatesofallpointsinmolecule •The moleculeisrepresentedbyacollectionofpintsinspace

• • • • • • • • • • • • • • • • • • • • • ••• •• • • • • • • • • • • • • • • • • • • • • • • position oftheotheratoms in thetargetmoleculeanddatabasealgorithmchecks the •The algorithmscanstriangleineachtargetmolecule.For triangle in thereferenceofall •Hash tablestoresthecoordinatesofallpointsinmolecule •The moleculeisrepresentedbyacollectionofpintsinspace

• • • • http://bioinfo3d.cs.tau.ac.il/c_alpha_match • • • • • ••• •• • • • • ••• •• • • ••• • •• • • • ••• • •• • • • • • • • • / 122/141 Rmsd=1.34

Experimental Results Experimental Results

Modified with permission after Maxim Shatsky Modified with permission after Maxim Shatsky 1 FlexProt Algorithm

0.8 • Input: two protein molecules A and B, each being represented by the sequence of the 3-D coordinates of 0.6 its Cα atoms. STRUCTAL CE 0.4 LSQMAN • Task: largest flexible alignment by decomposing the two SSAP molecules into a minimal number of rigid fragment pairs DALI molecules into a minimal number of rigid fragment pairs Fraction of TP 0.2 SSM having similar 3-D structure. Dream Team

0 0 0.2 0.4 0.6 0.8 1 0 Fraction of FP

Kolodny et al., 2004 Modified with permission after Maxim Shatsky

SCOP (Structural Classification Of Proteins) Classification databases http://scop.mrc-lmb.cam.ac.uk/scop/

• Classification of protein structures is very useful in order to understand Classify protein by the structures relation between structural elements and function The classification is mainly manual • Function of new proteins can be deduced based on their structural classification in the absence of significant sequence similarity. 4 hierarchies: • Structural alignment and classification give insight for distinct •class evolutionary relationship between proteins as well as to convergent •fold evolution •superfamily •family http://scop.mrc-lmb.cam.ac.uk/scop/

Hierarchy levels in SCOP Superfamily

Class High structural similarity

Relative amounts of secondary structures: Similar function α ,β , β/α , β+α . Likely common ancestor Fold

Rough separation according to the tertiary structure and the Familiy organization of the secondary structures Sequence similarity

The way secondary structures are connected is important Very high structural similarity

Very similar function Protein architecture

All β All α

α/ β α+β Cath – a hierarchic classification of structures http://www.biochem.ucl.ac.uk/bsm/cath_new/index.html

The classification is done in a semi-automatic manner using SSAP and standard tools for sequence alignment

http://www.cathdb.info/ Hierarchy levels in CATH Topology (T)

Class (C) Same overall structure Same number and organization of secondary structures Relative amounts of secondary structures: Same connection between secondary structures α ,β, β-α. Different types of β-α are separated in lower levels Not the same topology Architecture(A) Rough separation according to the tertiary structure and the organization of the secondary structures The way secondary structures are connected is not important

The same architecture

Homology superfamily (H)

High structural similarity

Similar function

Likely common ancestor

Familiy

Sequence similarity

Very high structural similarity

Very similar function FSSP (Fold classification based on Structural-Structural alignment of Proteins) http://www.cmbi.kun.nl/swift/fssp/

•Fully automated •Successfully reconstruct CATH and SCOP classifications

•Uses the DALI algorithm for structural alignment

•Scores for each pair of structures are converted to Z-csores

•all normalized scores is then clustered

cluster1 clusters of clusters A-B : 5 • The database currently contains an extended structural family for A,C each of 330 representative protein chains. A-C :10

• Highly similar structures are excluded A-D : 4 cluster2 B-C : 4 B,D B-D : 9 C-D : 6 cluster i

cluster1-cluster2 : 4