ü The first strategy, the so-called protein bioinformatics, is aimed at the development of computational tools that enable to decipher the information encoded in the protein sequences, thus enabling the prediction of structure and function ü The second strategy is based on the laws of physics. One of the most important methods here is molecular dynamics (MD) , which predicts structural, dynamical, and energetic (bio)molecular properties based on Newton laws of motion. a ancestral protein
duplication
paralogues
a b
speciation
orthologues
a a
ü orthologues species - often 1 have very similarspecies functions 2 ü paralogues - may have related functions C. Orengo, EBI Molecular function Cellular component Biological process
Family characterization Residues Features Protein Structure HMM, profiles predict transmembrane, prediction SMART, ProtoNet, Everest, localisation etc Comparative Modeling Gene3D, CATH, InterPro Fold recognition Pfam, TIGR, PRINTS, SCOP New fold
analyse residue features Orthologs identification predict disorder, signal predict protein HAMAP, EggNogg, COGS, peptides, localisation interactions: KOGS Barcello, DisoPred, FFpred ligand-protein protein-protein
search for conserved residues State of art Free-ware bioinformatics programs !!! TreeDet, ScoreCons, GEMMA, ProteinKeys ETtrace, SCI-PHY, FunShift
ü A method for scoring similarities between aminoacids. Substitution matrices: Blosum, PAM, Gonnet....
ü A penalty value for the insertions and deletions (gaps)
ü An algorithm for the alignment. Dynamic programing: Needelman e Wunsch (global alignment) or Smith e Waterman (local alignment) Where does the score (S) come from?
The quality of each pair-wise alignment is
represented as a score and the scores are ranked.
Scoring matrices are used to calculate the score of
the alignment base by base (DNA) or amino acid
by amino acid (protein).
The alignment score will be the sum of6 the scores
for each position. What’s a scoring matrix?
Substitution matrices
are used for amino
acid alignments.
each possible residue 6
substitution is given a
score 7
7 A simpler unitary matrix
is used for DNA pairs
(+1 for match, -2
mismatch) 8 BLOSUM vs PAM
BLOSUM 45 BLOSUM 62 BLOSUM 90 PAM 250 PAM 160 PAM 100
More Divergent Less Divergent
BLOSUM 62 is the default matrix in BLAST
2.0. Though it is tailored for comparisons of
moderately distant proteins, it performs well
in detecting closer relationships. A9 search
for distant relatives may be more sensitive
with a different matrix.
Sequence alignment Problem: optimize the corrispondence between aminoacids from two (or more) different sequences. Protein Evolution
2 Sequences AGTTTGAATGTTTTGTGTGAAAGGAGTATACCATGAGATGAGATGACCACCAATCATTTC
||||||||||||||||||| |||||||| ||| | |||||| |||||||||||||||||
AGTTTGAATGTTTTGTGTGTGAGGAGTATTCCAAGGGATGAGTTGACCACCAATCATTTC
Multiple
KFKHHLKEHLRIHSGEKPFECPNCKKRFSHSGSYSSHMSSKKCISLILVNGRNRALLKTl
KYKHHLKEHLRIHSGEKPYECPNCKKRFSHSGSYSSHISSKKCIGLISVNGRMRNNIKT-
KFKHHLKEHVRIHSGEKPFGCDNCGKRFSHSGSFSSHMTSKKCISMGLKLNNNRALLKRl
KFKHHLKEHIRIHSGEKPFECQQCHKRFSHSGSYSSHMSSKKCV------
KYKHHLKEHLRIHSGEKPYECPNCKKRFSHSGSYSSHISSKKCISLIPVNGRPRTGLKTsn Phylogenetic Trees
• It's a way of visualizing evolutionary relationships • Each external node corrisponds to an species • Internal nodes: speciation event • The distance between two nodes is proprtional to the divergence time. • In protein sequences: node -> protein • The distance between two nodes is proprtional to the sequence simillarity. Phylogenetic Trees
%aa Seq1 Seq2 Seq3 Seq4 different Seq1 0 5 11 14
Seq2 0 9 10 2.5 Seq3 0 7
Seq4 0 1 2 %aadifferent Cluster1,2 Seq3 Seq4
Cluster1,2 0 ½[d(1,3)+d(2,3)]=10 ½[d(1,4)+d(2,4)]=12
Seq3 0 7
Seq4 0
3.5 2.5
1 2 3 4 %aadiversi Cluster3,4
Cluster1,2 =½d[(Cluster1,2),3]+d[(Cluster1,2),4)]=11
5.5
3.5 2.5
1 2 3 4
The most important residues at the structural and/or functional level are conserved along evolution, and this can be appreciated from a the alignment of enough members of the family.
Structural domains The aminoacids involved in of the protein protein function
Provides information on:
Residues buried in the core of the protein
Remote homologs search
A profile shows all the information contained in a multiple sequence alignment
The profile is generated by the calculation of the frequencies of each of the aminoacids in all the positions of the alignment.
Used in in PSI-BLAST and PSI-search !!!
The latter uses as algorithm the 'exact' Smith and Waterman alignment protocol
Concepts of Sequence Similarity
Searching The premise:
One sequence by itself is not informative; it must be
analyzed by comparative methods against existing
sequence databases to develop hypothesis
concerning relatives and function.
21 The BLAST algorithm
The BLAST programs (Basic Local Alignment Search
Tools) are a set of sequence comparison
algorithms introduced in 1990 that are used to
search sequence databases for optimal local
alignments to a query.
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) “Basic local alignment 22
search tool.” J. Mol. Biol. 215:403-410.
Altschul SF, Madden TL, Schaeffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ
(1997) “Gapped BLAST and PSI-BLAST: a new generation of protein database
search programs.” NAR 25:3389-3402. 23 What BLAST tells you ...
BLAST reports surprising alignments
- Different than chance
Assumptions
- Random sequences
- Constant composition Evolutionary Homology: descent from a common ancestor Does not always imply similar function Conclusions 24
- Surprising similarities imply evolutionary homology Basic Local Alignment Search Tool
Widely used similarity search tool
Heuristic approach based on Smith Waterman
algorithm
Finds best local alignments
Provides statistical significance 25
25 www, standalone, and network clients BLAST programs Program Description Compares an amino acid query sequence against a protein blastp sequence database. Compares a nucleotide query sequence against a nucleotide blastn sequence database. Compares a nucleotide query sequence translated in all reading frames against a protein sequence database. You could use this blastx option to find potential translation products of an unknown nucleotide sequence. Compares a protein query sequence against a nucleotide tblastn sequence database dynamically translated in all reading frames. Compares the six-frame translations of a nucleotide query tblastx sequence against the six-frame translations of a nucleotide sequence database. 26 more BLAST programs Program Notes Contiguous Nearly identical sequences Megablast Discontiguous Cross-species comparison
PSI-BLAST Automatically generates a position Position specific score matrix (PSSM)
Specific Searches a database of PSI-BLAST RPS-BLAST PSSMs
nucleotide only
protein only 27 BLAST Algorithm
Scoring of matches done using scoring matrices
Sequences are split into words (default n=3)
Speed, computational efficiency
BLAST algorithm extends the initial “seed” hit
into an HSP
HSP = high scoring segment pair = Local optimal alignment
28 Sequence Similarity Searching –
The statistics are important
Discriminating between real and artifactual
matches is done using an estimate of
probability that the match might occur by
chance.
29
We’ll talk more about the meaning of the scores
(S) and e-values (E) that are associated with
BLAST hits What do the Score and the e-value
really mean? The quality of the alignment is represented by the Score (S).
The score of an alignment is calculated as the sum of substitution and gap scores. Substitution scores are given by a look-up table (PAM, BLOSUM) whereas gap scores are assigned empirically . The significance of each alignment is computed as an E value (E).
Expectation value. The number of different alignments with scores equivalent to or better than S that are expected to occur in a database search by chance. The lower the E value, the more significant the score.
30 Notes on E-values
Low E-values suggest that sequences are homologous ๏ Can’t show non-homology Statistical significance depends on both the size of the alignments and the size of the sequence database ‣ Important consideration for comparing results across different searches ‣ E-value increases as database gets bigger ‣ E-value decreases as alignments get longer
31 Homology: Some Guidelines
Similarity can be indicative of homology Generally, if two sequences are significantly similar over entire length they are likely homologous Low complexity regions can be highly similar without being homologous
Homologous sequences not always highly
similar 32 BLAST Algorithm
Scoring of matches done using scoring
matrices
Sequences are split into words (default n=3)
- Speed, computational efficiency
BLAST algorithm extends the initial “seed” hit
into an HSP 33
- HSP = high scoring segment pair = Local optimal alignment How Does BLAST Really Work?
The BLAST programs improved the overall
speed of searches while retaining good
sensitivity (important as databases continue
to grow) by breaking the query and database
sequences into fragments ("words"), and
initially seeking matches between 34fragments.
Word hits are then extended in either
direction in an attempt to generate an
alignment with a score exceeding the
threshold of "S". BLAST Algorithm
35 How Does BLAST Really Work?
The BLAST programs improved the overall
speed of searches while retaining good
sensitivity (important as databases continue
to grow) by breaking the query and database
sequences into fragments ("words"), and
initially seeking matches between 36fragments.
Word hits are then extended in either
direction in an attempt to generate an
alignment with a score exceeding the
threshold of "S". BLAST Algorithm
37 Extending the High Scoring Segment Pair
(HSP)
Minimum
Score (S)
Neighborhood
Score Threshold (T) 38 39 BLAST Algorithm
Scoring of matches done using scoring
matrices
Sequences are split into words (default n=3)
- Speed, computational efficiency
BLAST algorithm extends the initial “seed” hit
into an HSP 40
- HSP = high scoring segment pair = Local optimal alignment KRTIDPQ Blast
BD Score S’ P(>x) = 1 – exp(-Ke-λx)
Probability of finding an alignment with score greater than x
K and λ are parameters related to the maximun value position and to the width of the distribution and depend on the scoring matrices
E-value: expected value !!! is the number of random hits that we expect to obtain with an score S:
E= Kmne(-λS)
S is generally normalized: S’=(λS-lnK)/ln2
S’: bit score and therefore: E=mn2-S’
PSI (Position Specific Iterated) BLAST
• Idea: – to use BLAST results for generating a profile matrix.
– Search in database using the profiles, instead of the sequence.
• Iterative process: until convergence is reached ?
• Profile search • Alignment of a position specific matrix with a sequence: – Is the same as aligning sequences. – The score of aligning one character of the position with a character from the matrix, is given by the matrix. – There is no need of a substitution matrix
• Value calculation
• Dove Pr(ai|col=j) probability of finding the aminoacid ai in column j
• Pr(ai) frequence of finding ai in the alignment.
• In structure prediction, models can best be thought of as “sequence generators” (e.g., Hidden Markov Models) or “sequence classifiers” (e.g., Neural Networks)
• The better (and usually more complex) a model is, the better the performance is likely to be
ü Different types of structure components (match, insertions and deletions) are characterized by a state
ü The family model is thought to be generated by a state machine: starting form Nterm to Cterm, each aa is generated by a emission probability conditioned on the current state, and the transition from one state to an other:
ü All the parameters of the emission probabilities and transition probabilities are learned. Described by
ü A set of possible states: match, insert, deletion. ü A set of possible observations: frequencies of aa in each position. ü A transition probability matrix ü An emision probability matrix (frequencies of aa occurring in a particular state). ü Initial state probabilities.
ü A Markov chain is a model for stochastic generation of sequential phenomena
ü Every position in a chain is equivalent
ü The order of the Markov chain is the number of previous positions on which the current position depends ü e.g., in nucleic acid sequence, 0-order is mononucleotide, 1st-order is dinucleotide, 2nd-order is trinucleotide, etc.
ü The model parameters are the frequencies of the elements at each position (possibly as a function of preceding elements)
In general, sequences are not monolithic, but can be made up of discrete segments
Hidden Markov Models (HMMs) allow us to model complex sequences, in which the character emission probabilities depend upon the state
Think of an HMM as a probabilistic or stochastic sequence generator, and what is hidden is the current state of the model
An HMM is completely defined by its: State-to-state transition matrix (Φ) Emission matrix (H) State vector (x)
We want to determine the probability of any specific (query) sequence having been generated by the model; with multiple models, we then use Bayes’ rule to determine the best model for the sequence
Two algorithms are typically used for the likelihood calculation: Viterbi Forward
Models are trained with known examples A dishonest croupier could use a dice that has a higher probability of landing on a “6,” (e.g., 50%). To avoid being caught, the croupier can switch from a fair die to a loaded die with a certain frequency. For example, he can change the die from fair to loaded after 20 rolls and from loaded to fair after 10 rolls.
We can ask three questions about Hidden Markov Models:
ü Given a series of rolls, what are the probability values of the state transitions (i.e., how can the numbers associated with each of the arrows in Figure 20 be deduced)?
ü Given a series of rolls, what is the probability that the results have been generated by a scheme or model such as the one above?
ü Given a series of rolls, what is the most likely sequence of states that generated it? In other words, which series of events (switching from the loaded to the fair dice and vice versa) most likely generated the observed results? ü Given a set of homologous amino acid sequences, which are the transition probability values (from match to insert, to delete and vice versa) that best describe the family (i.e., which is their Hidden Markov Model)?
ü Given an amino acid sequence, what is the probability that the sequence has been generated by a Hidden Markov Model of a protein family (i.e., what is the probability that the new sequence belongs to the previously aligned family)?
ü Given an amino acid sequence, what is the most likely sequence of matches, insertions, and deletions that makes the sequence fit to the Hidden Markov Model of a protein family? In other words, what is the most likely alignment of the new sequence to the existing model?
Seq1: A C C – E Quantità Inizio - 1 1 – 2 2 – 3 3 - 4 4 - 5 5 - end Seq2: E C E – A Match - match 4 + 1 3 + 1 3 + 1 1 + 1 1 + 1 4 +1 Match - del 0 + 1 1 + 1 0 + 1 3 + 1 0 + 1 0 + 1 Seq3: A C E A A Ins – Del 0 + 1 0 + 1 0 + 1 0 + 1 0 + 1 0 + 1
Seq4: C – E - E Del – match 0 + 1 0 + 1 1 + 1 0 + 1 3 + 1 0 + 1
Frequenze States=7 (+ match-ins, Match - match 0.45 0.36 0.36 0.18 0.18 0.45 Ins-match, ins-ins) Match - del 0.09 0.18 0.09 0.36 0.09 0.09 We have to add a count For each state so the total Ins - Del 0.09 0.09 0.09 0.09 0.09 0.09 counts are 11) Del – match 0.09 0.09 0.18 0.09 0.36 0.09
0.09 delete
0.18 0.36
0.09 Ins
0.18 A 0.43 A 0.17 A 0.14 A 0.43 Inizio C 0.29 C 0.67 C 0.28 C 0.29 Match 0.45 0.36 0.45 E 0.29 E 0.17 0.36 E 0.57 0.18 E 0.29 Identification of functional subfamilies Identification of functional subfamilies
multiple sequence alignment, clustering and amino acid identification from functional subfamily
Structural model 1 = highly conserved
0 = unconserved Where Pi is the fraction of residues of amino acid type i, and M is the number of amino acid types Putative functional site Comparative modelling
mutation of one amino acid:
Ø the protein does not fold any more
Ø the protein accommodates the replacement with minor modifications: evolutionary pressure!
Ø the protein folds in a completely different structure
[ Chothia & Lesk (1986) ]