The First Strategy, the So-Called Protein Bioinformatics, Is

ü The first strategy, the so-called protein bioinformatics, is aimed at the development of computational tools that enable to decipher the information encoded in the protein sequences, thus enabling the prediction of structure and function ü The second strategy is based on the laws of physics. One of the most important methods here is molecular dynamics (MD) , which predicts structural, dynamical, and energetic (bio)molecular properties based on Newton laws of motion. a ancestral protein

duplication

paralogues

a b

speciation

orthologues

a a

ü orthologues species - often 1 have very similarspecies functions 2 ü paralogues - may have related functions C. Orengo, EBI Molecular function Cellular component Biological process

Family characterization Residues Features Protein Structure HMM, profiles predict transmembrane, prediction SMART, ProtoNet, Everest, localisation etc Comparative Modeling Gene3D, CATH, InterPro Fold recognition Pfam, TIGR, PRINTS, SCOP New fold

analyse residue features Orthologs identification predict disorder, signal predict protein HAMAP, EggNogg, COGS, peptides, localisation interactions: KOGS Barcello, DisoPred, FFpred ligand-protein protein-protein

search for conserved residues State of art Free-ware bioinformatics programs !!! TreeDet, ScoreCons, GEMMA, ProteinKeys ETtrace, SCI-PHY, FunShift

ü A method for scoring similarities between aminoacids. Substitution matrices: Blosum, PAM, Gonnet....

ü A penalty value for the insertions and deletions (gaps)

ü An algorithm for the alignment. Dynamic programing: Needelman e Wunsch (global alignment) or Smith e Waterman (local alignment)‏ Where does the score (S) come from?

The quality of each pair-wise alignment is

represented as a score and the scores are ranked.

Scoring matrices are used to calculate the score of

the alignment base by base (DNA) or amino acid

by amino acid (protein).

The alignment score will be the sum of6 the scores

for each position. What’s a scoring matrix?

Substitution matrices

are used for amino

acid alignments.

each possible residue 6

substitution is given a

score 7

7 A simpler unitary matrix

is used for DNA pairs

(+1 for match, -2

mismatch) 8 BLOSUM vs PAM

BLOSUM 45 BLOSUM 62 BLOSUM 90 PAM 250 PAM 160 PAM 100

More Divergent Less Divergent

BLOSUM 62 is the default matrix in BLAST

2.0. Though it is tailored for comparisons of

moderately distant proteins, it performs well

in detecting closer relationships. A9 search

for distant relatives may be more sensitive

with a different matrix.

Sequence alignment Problem: optimize the corrispondence between aminoacids from two (or more) different sequences. Protein Evolution

2 Sequences AGTTTGAATGTTTTGTGTGAAAGGAGTATACCATGAGATGAGATGACCACCAATCATTTC

||||||||||||||||||| |||||||| ||| | |||||| |||||||||||||||||

AGTTTGAATGTTTTGTGTGTGAGGAGTATTCCAAGGGATGAGTTGACCACCAATCATTTC

Multiple

KFKHHLKEHLRIHSGEKPFECPNCKKRFSHSGSYSSHMSSKKCISLILVNGRNRALLKTl

KYKHHLKEHLRIHSGEKPYECPNCKKRFSHSGSYSSHISSKKCIGLISVNGRMRNNIKT-

KFKHHLKEHVRIHSGEKPFGCDNCGKRFSHSGSFSSHMTSKKCISMGLKLNNNRALLKRl

KFKHHLKEHIRIHSGEKPFECQQCHKRFSHSGSYSSHMSSKKCV------

KYKHHLKEHLRIHSGEKPYECPNCKKRFSHSGSYSSHISSKKCISLIPVNGRPRTGLKTsn Phylogenetic Trees

• It's a way of visualizing evolutionary relationships • Each external node corrisponds to an species • Internal nodes: speciation event • The distance between two nodes is proprtional to the divergence time. • In protein sequences: node -> protein • The distance between two nodes is proprtional to the sequence simillarity. Phylogenetic Trees

%aa Seq1 Seq2 Seq3 Seq4 different Seq1 0 5 11 14

Seq2 0 9 10 2.5 Seq3 0 7

Seq4 0 1 2 %aadifferent Cluster1,2 Seq3 Seq4

Cluster1,2 0 ½[d(1,3)+d(2,3)]=10 ½[d(1,4)+d(2,4)]=12

Seq3 0 7

Seq4 0

3.5 2.5

1 2 3 4 %aadiversi Cluster3,4

Cluster1,2 =½d[(Cluster1,2),3]+d[(Cluster1,2),4)]=11

5.5

3.5 2.5

1 2 3 4

The most important residues at the structural and/or functional level are conserved along evolution, and this can be appreciated from a the alignment of enough members of the family.

Structural domains The aminoacids involved in of the protein protein function

Provides information on:

Residues buried in the core of the protein

Remote homologs search

A profile shows all the information contained in a multiple sequence alignment

The profile is generated by the calculation of the frequencies of each of the aminoacids in all the positions of the alignment.

Used in in PSI-BLAST and PSI-search !!!

The latter uses as algorithm the 'exact' Smith and Waterman alignment protocol

Concepts of Sequence Similarity

Searching The premise:

One sequence by itself is not informative; it must be

analyzed by comparative methods against existing

sequence databases to develop hypothesis

concerning relatives and function.

21 The BLAST algorithm

The BLAST programs (Basic Local Alignment Search

Tools) are a set of sequence comparison

algorithms introduced in 1990 that are used to

search sequence databases for optimal local

alignments to a query.

Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) “Basic local alignment 22

search tool.” J. Mol. Biol. 215:403-410.

Altschul SF, Madden TL, Schaeffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ

(1997) “Gapped BLAST and PSI-BLAST: a new generation of protein database

search programs.” NAR 25:3389-3402. 23 What BLAST tells you ...

BLAST reports surprising alignments

- Different than chance

Assumptions

- Random sequences

- Constant composition Evolutionary Homology: descent from a common ancestor Does not always imply similar function Conclusions 24

- Surprising similarities imply evolutionary homology Basic Local Alignment Search Tool

Widely used similarity search tool

Heuristic approach based on Smith Waterman

algorithm

Finds best local alignments

Provides statistical signiﬁcance 25

25 www, standalone, and network clients BLAST programs Program Description Compares an amino acid query sequence against a protein blastp sequence database. Compares a nucleotide query sequence against a nucleotide blastn sequence database. Compares a nucleotide query sequence translated in all reading frames against a protein sequence database. You could use this blastx option to ﬁnd potential translation products of an unknown nucleotide sequence. Compares a protein query sequence against a nucleotide tblastn sequence database dynamically translated in all reading frames. Compares the six-frame translations of a nucleotide query tblastx sequence against the six-frame translations of a nucleotide sequence database. 26 more BLAST programs Program Notes Contiguous Nearly identical sequences Megablast Discontiguous Cross-species comparison

PSI-BLAST Automatically generates a position Position speciﬁc score matrix (PSSM)

Speciﬁc Searches a database of PSI-BLAST RPS-BLAST PSSMs

nucleotide only

protein only 27 BLAST Algorithm

Scoring of matches done using scoring matrices

Sequences are split into words (default n=3)

Speed, computational efﬁciency

BLAST algorithm extends the initial “seed” hit

into an HSP

HSP = high scoring segment pair = Local optimal alignment

28 Sequence Similarity Searching –

The statistics are important

Discriminating between real and artifactual

matches is done using an estimate of

probability that the match might occur by

chance.

We’ll talk more about the meaning of the scores

(S) and e-values (E) that are associated with

BLAST hits What do the Score and the e-value

really mean? The quality of the alignment is represented by the Score (S).

The score of an alignment is calculated as the sum of substitution and gap scores. Substitution scores are given by a look-up table (PAM, BLOSUM) whereas gap scores are assigned empirically . The signiﬁcance of each alignment is computed as an E value (E).

Expectation value. The number of different alignments with scores equivalent to or better than S that are expected to occur in a database search by chance. The lower the E value, the more signiﬁcant the score.

30 Notes on E-values

Low E-values suggest that sequences are homologous ๏ Can’t show non-homology Statistical signiﬁcance depends on both the size of the alignments and the size of the sequence database ‣ Important consideration for comparing results across different searches ‣ E-value increases as database gets bigger ‣ E-value decreases as alignments get longer

31 Homology: Some Guidelines

Similarity can be indicative of homology Generally, if two sequences are signiﬁcantly similar over entire length they are likely homologous Low complexity regions can be highly similar without being homologous

Homologous sequences not always highly

similar 32 BLAST Algorithm

Scoring of matches done using scoring

matrices

Sequences are split into words (default n=3)

- Speed, computational efﬁciency

BLAST algorithm extends the initial “seed” hit

into an HSP 33

- HSP = high scoring segment pair = Local optimal alignment How Does BLAST Really Work?

The BLAST programs improved the overall

speed of searches while retaining good

sensitivity (important as databases continue

to grow) by breaking the query and database

sequences into fragments ("words"), and

initially seeking matches between 34fragments.

Word hits are then extended in either

direction in an attempt to generate an

alignment with a score exceeding the

threshold of "S". BLAST Algorithm

35 How Does BLAST Really Work?

The BLAST programs improved the overall

speed of searches while retaining good

sensitivity (important as databases continue

to grow) by breaking the query and database

sequences into fragments ("words"), and

initially seeking matches between 36fragments.

Word hits are then extended in either

direction in an attempt to generate an

alignment with a score exceeding the

threshold of "S". BLAST Algorithm

37 Extending the High Scoring Segment Pair

(HSP)

Minimum

Score (S)

Neighborhood

Score Threshold (T) 38 39 BLAST Algorithm

Scoring of matches done using scoring

matrices

Sequences are split into words (default n=3)

- Speed, computational efﬁciency

BLAST algorithm extends the initial “seed” hit

into an HSP 40

- HSP = high scoring segment pair = Local optimal alignment KRTIDPQ Blast

BD Score S’ P(>x) = 1 – exp(-Ke-λx)‏

Probability of finding an alignment with score greater than x

K and λ are parameters related to the maximun value position and to the width of the distribution and depend on the scoring matrices

E-value: expected value !!! is the number of random hits that we expect to obtain with an score S:

E= Kmne(-λS)‏

S is generally normalized: S’=(λS-lnK)/ln2

S’: bit score and therefore: E=mn2-S’

PSI (Position Specific Iterated) BLAST

• Idea: – to use BLAST results for generating a profile matrix.

– Search in database using the profiles, instead of the sequence.

• Iterative process: until convergence is reached ?

• Profile search • Alignment of a position specific matrix with a sequence: – Is the same as aligning sequences. – The score of aligning one character of the position with a character from the matrix, is given by the matrix. – There is no need of a substitution matrix

• Value calculation

• Dove Pr(ai|col=j) probability of finding the aminoacid ai in column j

• Pr(ai) frequence of finding ai in the alignment.

• In structure prediction, models can best be thought of as “sequence generators” (e.g., Hidden Markov Models) or “sequence classifiers” (e.g., Neural Networks)

• The better (and usually more complex) a model is, the better the performance is likely to be

ü Different types of structure components (match, insertions and deletions) are characterized by a state

ü The family model is thought to be generated by a state machine: starting form Nterm to Cterm, each aa is generated by a emission probability conditioned on the current state, and the transition from one state to an other:

ü All the parameters of the emission probabilities and transition probabilities are learned. Described by

ü A set of possible states: match, insert, deletion. ü A set of possible observations: frequencies of aa in each position. ü A transition probability matrix ü An emision probability matrix (frequencies of aa occurring in a particular state). ü Initial state probabilities.

ü A Markov chain is a model for stochastic generation of sequential phenomena

ü Every position in a chain is equivalent

ü The order of the Markov chain is the number of previous positions on which the current position depends ü e.g., in nucleic acid sequence, 0-order is mononucleotide, 1st-order is dinucleotide, 2nd-order is trinucleotide, etc.

ü The model parameters are the frequencies of the elements at each position (possibly as a function of preceding elements)

In general, sequences are not monolithic, but can be made up of discrete segments

Hidden Markov Models (HMMs) allow us to model complex sequences, in which the character emission probabilities depend upon the state

Think of an HMM as a probabilistic or stochastic sequence generator, and what is hidden is the current state of the model

An HMM is completely defined by its: State-to-state transition matrix (Φ) Emission matrix (H) State vector (x)

We want to determine the probability of any specific (query) sequence having been generated by the model; with multiple models, we then use Bayes’ rule to determine the best model for the sequence

Two algorithms are typically used for the likelihood calculation: Viterbi Forward

Models are trained with known examples A dishonest croupier could use a dice that has a higher probability of landing on a “6,” (e.g., 50%). To avoid being caught, the croupier can switch from a fair die to a loaded die with a certain frequency. For example, he can change the die from fair to loaded after 20 rolls and from loaded to fair after 10 rolls.

We can ask three questions about Hidden Markov Models:

ü Given a series of rolls, what are the probability values of the state transitions (i.e., how can the numbers associated with each of the arrows in Figure 20 be deduced)?

ü Given a series of rolls, what is the probability that the results have been generated by a scheme or model such as the one above?

ü Given a series of rolls, what is the most likely sequence of states that generated it? In other words, which series of events (switching from the loaded to the fair dice and vice versa) most likely generated the observed results? ü Given a set of homologous amino acid sequences, which are the transition probability values (from match to insert, to delete and vice versa) that best describe the family (i.e., which is their Hidden Markov Model)?

ü Given an amino acid sequence, what is the probability that the sequence has been generated by a Hidden Markov Model of a protein family (i.e., what is the probability that the new sequence belongs to the previously aligned family)?

ü Given an amino acid sequence, what is the most likely sequence of matches, insertions, and deletions that makes the sequence fit to the Hidden Markov Model of a protein family? In other words, what is the most likely alignment of the new sequence to the existing model?

Seq1: A C C – E Quantità Inizio - 1 1 – 2 2 – 3 3 - 4 4 - 5 5 - end Seq2: E C E – A Match - match 4 + 1 3 + 1 3 + 1 1 + 1 1 + 1 4 +1 Match - del 0 + 1 1 + 1 0 + 1 3 + 1 0 + 1 0 + 1 Seq3: A C E A A Ins – Del 0 + 1 0 + 1 0 + 1 0 + 1 0 + 1 0 + 1

Seq4: C – E - E Del – match 0 + 1 0 + 1 1 + 1 0 + 1 3 + 1 0 + 1

Frequenze States=7 (+ match-ins, Match - match 0.45 0.36 0.36 0.18 0.18 0.45 Ins-match, ins-ins) Match - del 0.09 0.18 0.09 0.36 0.09 0.09 We have to add a count For each state so the total Ins - Del 0.09 0.09 0.09 0.09 0.09 0.09 counts are 11) Del – match 0.09 0.09 0.18 0.09 0.36 0.09

0.09 delete

0.18 0.36

0.09 Ins

0.18 A 0.43 A 0.17 A 0.14 A 0.43 Inizio C 0.29 C 0.67 C 0.28 C 0.29 Match 0.45 0.36 0.45 E 0.29 E 0.17 0.36 E 0.57 0.18 E 0.29 Identification of functional subfamilies Identification of functional subfamilies

multiple sequence alignment, clustering and amino acid identification from functional subfamily

Structural model 1 = highly conserved

0 = unconserved Where Pi is the fraction of residues of amino acid type i, and M is the number of amino acid types Putative functional site Comparative modelling

mutation of one amino acid:

Ø the protein does not fold any more

Ø the protein accommodates the replacement with minor modifications: evolutionary pressure!

Ø the protein folds in a completely different structure

[ Chothia & Lesk (1986) ]