Structural Motif Finding
Total Page:16
File Type:pdf, Size:1020Kb
6WUXFWXUDO0RWLI)LQGLQJ :KDWDUHPRWLIV" • Automated discovery of 3D motifs for protein • Wikipedia: a structural motif is a three- function annotation (Polacco and Babbit, dimensional structural element or fold within Bioinformatics, January 2006) a protein, which appears also in a variety of other proteins. • Overall comparison of protein structures may not identify similarities among functionally • In proteins, structure motifs usually consist of significant amino acids or atoms involved in a just a few elements, e.g. the 'helix-turn-helix' protein function’s mechanism. has just three. • e.g., amino acids on the surface of the protein vs. buried amino acids 0RWLIH[DPSOHV 0RWLIH[DPSOHV Zinc finger motif Two beta strands with an alpha helix end Helix-turn-helix The DNA-binding folded over to bind a zinc ion. This motif domain of the bacterial gene is seen in transcription factors. regulatory protein lambda repressor, with the two A fragment derived from a mouse helix-turn-helix motifs shown in gene regulatory protein is shown, color. The two helices closest with three zinc fingers bound to the DNA are the reading or spirally in the major groove of a recognition helices, which bind DNA molecule. The inset shows in the major groove and recognize the coordination of a zinc atom by specific gene regulatory characteristically spaced cysteine sequences in the DNA. (PDB 1lmb) and histidine residues in a single zinc finger motif. The image is of PDB: 1aay 0RWLIH[DPSOHV 0RWLIH[DPSOHV Four-helix bundle motif The four-helix bundle motif can comprise an entire protein domain, and occurs in proteins with many different biochemical functions. Shown here is human growth hormone, a signaling molecule. 1 0RWLIV 0RWLIV • The term motif is used in two different ways • The second use of the term motif refers to a in structural biology. The first refers to a set of contiguous secondary structure particular amino-acid sequence that is elements that have a particular functional characteristic of a specific biochemical significance. function. • e.g. helix-turn-helix, Greek-key motif • example: CXX(XX)CXXXXXXXXXXXXHXXXH • Usually, sequence motifs are more indicative of certain • Sequence motifs can be recognized by function, because a shared structural motif does not always imply similar function. However, detecting inspecting the amino-acid sequence. functional motifs from sequence alone is difficult due to Databases of such motifs exist: variable spacing, different ordering of functional • e.g., PROSITE (http://www.expasy.ch/prosite/) residues. 0RWLIGDWDEDVHV *$636 • Which configuration is functionally important? • Polacco and Babbit, 2006 • Determined by experts; therefore, accumulation of • Genetic Algorithm Search for Patterns in known motifs is slow Structures • The catalytic site atlas (CSA) provides 147 • GASPS goals: non-redundant active site motifs for enzymes. • for a group of proteins GASPS should find the • Automated methods of identifying repeated motif most useful for identifying the group patterns in protein structures generates more • GASPS should rely on known functional residues motifs (450 non-redundant ligand-biding as little as possible patterns in PINTS database). However, they may not provide specific functional information. *$636 *$636 • In this study a motif is a small set of residues • Only the match with the best RMSD is (<10) taken from a single chain considered from each structure. • Each residue is modeled with two points: • They identify motifs from a set of related Carbon-alpha and the side-chain geometrical proteins with this method and then they test center. the discovered motif’s discriminative power • They use a previous method named SPASM by conducting leave-one-out cross validation (Kleywegt, 1999) to find matches. experiments. • They first find exact sequence matches and • Five groups of proteins of different functions. A motif is extracted from each group. Then, try to then compute an RMSD between the predict the function of a new protein. matches. 2 $QRWKHUWHFKQLTXH 2YHUYLHZ • The fragment transformation method to • A structural alignment algorithm that detect the protein structural motifs (Lu et al., combines both structural and sequence Proteins, 2006) information to identify local structural motifs. • Motivation: motifs with conserved residues • The methods is tested to detect ȕȕĮ-metal and constrained geometry are easy to detect. binding motif and the treble clef motif. However, it is challenging to detect general structural motifs like ȕȕĮ-metal binding motif, which have variable conformation and sequence. Such motifs are currently identified by manual procedures using sequence and structure analysis. ȕȕĮPHWDOELQGLQJPRWLI 7KHPHWKRG • Let S be the structural motif of length n, and T be the target protein of length m. S and T are represented by their CĮ coordinates, (x1, x2,..., xn) and (y1, y2,..., ym) respectively. • The basic unit of the structural alignment is a triplet composed of three consecutive CĮ atoms. The structures S and T can be expressed in terms of these triplets, that is, S={ı1, ı2,..., ın-2} and T={IJ1, IJ2,..., IJm-2} ıi=(xi,xi+1,xi+2) and IJi=(yi,yi+1,yi+2) 7KHPHWKRG ([DPSOHWUDQVIRUPDWLRQ • An (m-2)x(n-2) matrix between ı and IJ triplets can be constructed, where the element Mi,j is a rigid body transformation matrix from ıi to IJj, that is, Mi,j ıi = IJj 3 7ULSOHWFOXVWHULQJ 7ULSOHWFOXVWHULQJ • Define the distance between two matched • They cluster the triplets with a very simple pairs of triplets as the Cartesian distance algorithm called single-linkage algorithm. between one pair of triplets when the others’ • For two triplet pairs (ıi ,IJj) and (ık ,IJl), where transformation is applied to them. ij ikand jl, if the distance D kl is smaller than o • Formally: D0 (a constant threshold, 3 A in their ij • D kl = cartesian_distance(Mi,jık , IJl) experiments) then put them in the same • The smaller the distance the more similar are the cluster. transformations Mi,j and Mk,l • If any two elements in two different clusters • In other words, the orientation between the triplet satisfy the same criteria, then merge the pairs ( , ) and ( , ) is more similar. ıi IJj ık IJl clusters. 7ULSOHWFOXVWHULQJ 5HVXOWVȕȕĮPHWDOELQGLQJPRWLI • After clustering, each cluster represents a structural alignment between the source motif S and the target protein T. • Each cluster is given a score based on some scoring function, which is a combination of 3 ranked scores: RMSD distance, sequence alignment score, and secondary structure alignment score. 5HVXOWVWUHEOHFOHIILQJHU PRWLI 4.