<<

Talk overview •Secondary assignment Secondary structure assignment and prediction •Why to predict secondary in

• Methods to predict secondary structures in proteins

• Machine learning approaches

• Detailed description of several specific programs (PHD)

• Performance and evaluation May 2011 Eran Eyal

Automatic assignment of secondary structures to a set of coordinates

Assignment of secondary structures to known secondary structures is a relatively simple bioinformatics task.

Given exact definitions for secondary structures, all we need to do is to see which part of the structure falls within each definition

α- Why to automatically and routinely assign secondary structures ?

•Standardization

•Easy visualization

•Detection of structural motifs and improved sequence-structure searches

•Structural alignment Β-strand •Structural classification

What basic structural information is used ?

q > 120° and rHO < 2.5 Ǻ

patterns

•Backbone dihedral angles DSSP algorithm •The helix definition does not include the terminal residue •The so-called “Dictionary of Secondary Structure of Proteins” having the initial and final hydrogen bonds in the helix. (DSSP) by Kabsch and Sander makes its sheet and helix assignments solely on the basis of backbone-backbone •A minimal size helix is set to have two consecutive hydrogen hydrogen bonds. bonds in the helix, leaving out single helix hydrogen bonds, which are assigned as turns (state 'T'). •The DSSP method defines a hydrogen bond when the bond energy is below -0.5 kcal/mol from a Coulomb approximation •beta-sheet residues (state 'E') are defined as either having two of the hydrogen bond energy. hydrogen bonds in the sheet, or being surrounded by two hydrogen bonds in the sheet. •The structural assignments are defined such that visually appealing and unbroken structures result. •The minimal sheet consists of two residues at each partner segment. • In case of overlaps, alpha-helix is given first priority.

STRIDE

•The secondary STRuctural IDEntification method by Frishman and Argos uses an empirically derived hydrogen bond energy and phi- psi torsion angle criteria to assign secondary structure.

•Torsion angles are given alpha-helix and beta-sheet propensities according to how close they are to their regions in Ramachandran plots.

•The parameters are optimized to mirror visual assignments made by crystallographers for a set of proteins.

•By construction, the STRIDE assignments agreed better with the expert assignments than DSSP, at least for the data set used to optimize the free parameters. •Like DSSP, STRIDE assigns the shortest alpha-helix ('H') if it contains at least two consecutive i - i+4 hydrogen bonds.

•In contrast to DSSP, helices are elongated to comprise one or both edge residues if they have acceptable phi-psi angles, similarly a short helix can be vetoed.

•hydrogen bond patterns may be ignored if the phi-psi angles are unfavorable.

•The sheet category does not distinguish between parallel and anti-parallel sheets. The minimal sheet ('E') is composed of two residues.

•The dihedral angles are incorporated into the final sheet assignment criterion as was done for the alpha-helix.

DEFINE

•An algorithm by Richards and Kundrot which assigns secondary structures by matching Cα-coordinates with a linear distance mask of the ideal secondary structures.

•First, strict matches are found, which subsequently are elongated and/or joined allowing moderate irregularities or .

•The algorithm locates the starts and ends of α- and 310-helices, beta-sheets, turns and loops. With these classifications the authors are able to assign 90-95% of all residues to at least one of the given secondary structure classes. Secondary structure prediction

•Prediction of tertiary structures based on the sequence is still a very difficult task.

•Prediction of more local structural properties is easier

•Prediction of secondary structures and accessibility (SAS) is important and more feasible

•Prediction of secondary structures is a “bridge” between the linear information and the 3D structure A-C-H-Y-T-T-E-K-R-G-G-S-G-T-K-K-R-E-A

•Programs in this field often employ different types of machine learning approaches

A-C-H-Y-T-T-E-K-R-G-G-S-G-T-K-K-R-E-A H-H-H-H-H-H-H-H-O-O-O-O-O-S-S-S-S-S-S The importance and the need of predicting the secondary structures in proteins The information might give clues concerning the function of the protein and the existing of specific structural motifs

Intermediate step toward construction of a complete 3D model from the sequence.

Many degrees of freedom Few degrees of freedom Long search. Pruned to errors Fast search

Secondary structure content also allows us to classify a protein to the basic levels of structure type based on its sequence alone. Generations in algorithm development The Chou-Fasman method

First generation: uses statistics regarding preferences of individual amino acids. Each amino acid has preferences regarding appearance in secondary structures. This can be determined by counting amino acids in different secondary structures in known solved structures. Second generation: the improvements comparing to The new statistics determined what is the probability of the first were the uses of better statistics and statistical an amino acid to be in a particular secondary structure methods, and by looking on a set of adjacent amino given that it is in the middle of a local sequence acids on the sequence (usually windows of 11-21 segment. amino acids) rather than on individual amino acids Other segments similar to the given segments might also assist in the prediction. Different methods tried to correspond the segments to other segments in the 3D database by sequence alignments and other methods.

The GOR method

Strand table Helix table Third generation: the improvement of the programs General problems of methods in generations I,II in the third generation was mainly due to incorporation of evolutionary information. This was Overall prediction rate was rather low: done by looking at the multiple alignment which included sequences similar to the sequence we wish to • Overall prediction: 60% predict. • B-strands prediction: ~35% • Predictions included small secondary elements, with Such information presented as MSA or by other way disability to integrate them to longer structures such as include plenty of information which can not be obtained those found in protein structures. from evolutionary sequences: • Which regions are more conserved • which substitution are allowed in each position • Information regarding interacting sites

Comparison of many sequences of protein families Comparison of many sequences of protein families helps to detect conserved regions helps to detect interactions in space

SAARDFFRT--HAAGRFFTFT AAARDFFRTGGHAAGRFFTFT SAARDFFRS--GTRAKFFTFT AAARDFFRSGGHAAGKFFTFT TAARDFFRF-GKAA-KFFTFT AAARDFFRTGGHAAGKFFTFT SAARRFFRTGDHAALDFFTFT AAARRFFRTGAHAAGDFYTFS SAARRFFRWHGLAAIDFFTFT AAARRFFRTGGHAAGDFFTFT Information obtained from MSA might help in the Introduction to neural networks prediction. Because the fold of all members of the family is identical, every sequence can contribute Neurons cells are the basic the structure prediction of other given sequence in components of the nerve the family system Every neuron gets information The best MSA for this purpose is one which from several other neurons by includes many sequences of the family but being the dendrites not too close one to another The information is being processed and the neuron makes binary decision if to transfer a signal to other neurons The information is transferred by the axon

Computational tasks that the nerve system executes:

•Representation of data •Holding data •Learning procedures •Decision making •Pattern recognition Neural networks - properties The perceptron The perceptron models the action of a single neuron, it can be System which are composed of many simple processors used to classify only linearly separable cases. connected and work in parallel. The information may be obtained by learning process and stored in the connections between the processors

Example: binary neuron Example: binary neuron

Inputs: S i = 01, Inputs: S i = 01,

Output: Θ(W1S1 + W2S2 − T) Output: Θ(W1S1 + W2S2 − T)

שער OR AND gate

s s 1 s2 1 s2

1 1 1 1 W1=? W2=? W1=? W2=?

1.5 0.5 In practice, usually some differentiable function is used instead of the step function

Networks of layers Networks with feedback Input

Internal representation

Output Training To test the net we evaluate its performance on a Preparation of a large training set collection of solved examples (test set)

The neural network gets the input and random initial values The test set should be independent of the training set. for the parameters (weights) The first interaction of the net with this set should be done during evaluation The network tries to maximize the number of correctly predicted cases by changes in the values of the parameters The test set should be large and representative. It is (weights) better to use test set already used for evaluation of other programs designed to solve similar task PHD – a third generation program the uses The versions of this program implement and neural networks. demonstrate the recent elements which are considered the most important for prediction accuracy PhD is the most popular secondary structure prediction program, although other programs reach the same accuracy, it is still very Demonstrates the use of machine learning approaches popular today in this field

Input: sequence of amino acids. Using data base sequence alignment, similar alignments are found and MSA is built

The composition of this alignment is the input to the neural network which is the core of the program

Every position in the input sequence is expressed by 21 parameters: the prectage of each amino acid in that position and another character which indicate the start/end of the sequence

In addition the input for each position includes global information about the protein composition and the sequence distance between the predicted region to the start/end positions Important of variability in the input sequences

Good alignment!

The neural network includes several layers:

Input layer: sequence -> structure

Intermediate layer: structure -> structure Output system: summation of several networks

Output: the secondary structure with the highest score is the final prediction for that position http://www.embl-heidelberg.de/predictprotein/ Comparison of secondary structure prediction tools

Assignment Reliability index

Prediction

periplasmic binding protein 4mbp Reliability index -PHD Combination of different prediction methods

Every method has errors which can be classified to 2 general types:

1.Systematic errors 2.Non-systematic errors

Several methods can be therefore combined to increase the prediction accuracy

The basic condition to successful combination is that the source of error of each individual method is not only systematical Several new methods exploit this fact and train independently several neural networks and predict based on average prediction of all the networks.

Another method (Jpred) gets as input results of several existing methods and predict based on that. Many web-server available ….

http://www.compbio.dundee.ac.uk/www-jpred/

To understand some of the sequence signals that might be used we can consider the basic of secondary structures

α-helix for example has a periodicity of 3.6 amino acids. Helices on the protein surface are expected to posses some signal in this periodicity for positions occupied by hydrophilic and hydrophobic side chains.

Finding hydrophobic amino acids in positions i,i+3,i+7, i+10 for example is a strong indication for a helix

http://bmerc-www.bu.edu/psa/ Similarly, in surface B-strands, there is preferences α-helix in for “Zigzag’ pattern. For example, hydrophilic side chain at positions i, i+2, i+4... and hydrophobic side chains at positions i+1, i+3, i+5…

β-strand of CD8

Average prediction accuracies from (based on the 480 protein Related topics set) for 2-state Solvent Accessibility

•Prediction secondary structures of membrane proteins Rel. Acc. PSIBLAST HMMER2 Combined [change] (%) (%) (%) (%) •Prediction of solvent accessibility 25% 75.0 74.2 76.2 5% 79.0 78.8 79.8 0% 86.6 86.3 86.5