<<

C E N Introduction to T R E 2007 F B O I R O I I N N Lecture 8 T F E O G R R M A A T T I I V C Substitution Matrices E S V U

C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [1] Substitution matrices – Sequence analysis 2006 Sequence Analysis Finding relationships between and products of different species, including those at large evolutionary distances

C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [2] Substitution matrices – Sequence analysis 2006 Archaea Domain Archaea is mostly composed of cells that live in extreme environments. While they are able to live elsewhere, they are usually not found there because outside of extreme environments they are competitively excluded by other organisms.

Species of the domain Archaea are •not inhibited by antibiotics, •lack peptidoglycan in their wall (unlike bacteria, which have this sugar/polypeptide compound), •and can have branched carbon chains in their membrane lipids of the phospholipid bilayer.

C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [3] Substitution matrices – Sequence analysis 2006 Archaea (Cnt.) • It is believed that Archaea are very similar to prokaryotes (e.g. bacteria) that inhabited the earth billions of years ago. It is also believed that eukaryotes evolved from Archaea, because they share many mRNA sequences, have similar RNA polymerases, and have introns.

• Therefore, it is generally assumed that the domains Archaea and Bacteria branched from each other very early in history, after which membrane infolding * produced eukaryotic cells in the archaean branch approximately 1.7 billion years ago.

There are three main groups of Archaea: 1. extreme halophiles (salt), 2. methanogens (methane producing anaerobes), 3. and hyperthermophiles (e.g. living at temperatures >100º C!).

*Membrane infolding is believed to have led to the nucleus of eukaryotic cells, which is a membrane-enveloped cell organelle that holds the cellular DNA. Prokaryotic cells are more primitive and do not have a nucleus. C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [4] Substitution matrices – Sequence analysis 2006 Example of sequence database entry for Genbank

LOCUS DRODPPC 4001 bp INV 15-MAR-1990 DEFINITION D.melanogasterdecapentaplegic gene complex (DPP-C), complete cds. ACCESSION M30116 KEYWORDS . SOURCE D.melanogaster, cDNA to mRNA. ORGANISM Drosophila melanogaster Eurkaryote; mitochondrial eukaryotes; Metazoa; Arthropoda; Tracheata; Insecta; Pterygota; Diptera; Brachycera; Muscomorpha; Ephydroidea; Drosophilidae; Drosophilia. REFER ENCE 1 (bases 1 to 4001) AUTHORS Padgett, R.W., St Johnston, R.D. and Gelbart, W.M. TITLE A transcript from a Drosophila pattern gene predicts a homologous to the transforming growth factor-beta family JOURNAL Nature 325, 81-84 (1987) MEDLINE 87090408 COMMENT The initiation codon could be at either 1188-1190 or 1587-1589 FEATURES Location/Qualifiers source 1..4001 /organism=“Drosophila melanogaster” /db_xref=“taxon:7227” mRNA <1..3918 /gene=“dpp” /note=“decapentaplegic protein mRNA” /db_xref=“FlyBase:FBgn0000490” gene 1..4001 /note=“decapentaplegic” /gene=“dpp” /allele=“” /db_xref=“FlyBase:FBgn0000490” CDS 1188..2954 /gene=“dpp” /note=“decapentaplegic protein (1188 could be 1587)” /codon_start=1 /db_xref=“FlyBase:FBgn0000490” /db_xref=“PID:g157292” /=“MRAWLLLLAVLATFQTIVRVASTEDISQRFIAAIAPVAAHIPLA SASGSGSGRSGSRSVGASTSTALAKAFNPFSEPASFSDSDKSHRSKTNKKPSKSDANR …………………… LGYDAYYCHGKCPFPLADHFNSTNAVVQTLVNNMNPGKVPKACCVPTQLDSVAMLYL NDQSTBVVLKNYQEMTBBGCGCR” BASE COUNT 1170 a 1078 c 956 g 797 t ORIGIN 1 gtcgttcaac agcgctgatc gagtttaaat ctataccgaa atgagcggcg gaaagtgagc 61 cacttggcgt gaacccaaag ctttcgagga aaattctcgg acccccatat acaaatatcg 121 gaaaaagtat cgaacagttt cgcgacgcga agcgttaaga tcgcccaaag atctccgtgc 181 ggaaacaaag aaattgaggc actattaaga gattgttgtt gtgcgcgagt gtgtgtcttc 241 agctgggtgt gtggaatgtc aactgacggg ttgtaaaggg aaaccctgaa atccgaacgg 301 ccagccaaag caaataaagc tgtgaatacg aattaagtac aacaaacagt tactgaaaca 361 gatacagatt cggattcgaa tagagaaaca gatactggag atgcccccag aaacaattca 421 attgcaaata tagtgcgttg cgcgagtgcc agtggaaaaa tatgtggatt acctgcgaac 481 cgtccgccca aggagccgcc gggtgacagg tgtatccccc aggataccaa cccgagccca 541 gaccgagatc cacatccaga tcccgaccgc agggtgccag tgtgtcatgt gccgcggcat 601 accgaccgca gccacatcta ccgaccaggt gcgcctcgaa tgcggcaaca caattttcaa …………………………. 3841 aactgtataa acaaaacgta tgccctataa atatatgaat aactatctac atcgttatgc 3901 gttctaagct aagctcgaat aaatccgtac acgttaatta atctagaatc gtaagaccta 3961 acgcgtaagc tcagcatgtt ggataaatta atagaaacga g //

C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [5] Substitution matrices – Sequence analysis 2006 Example of protein sequence database entry for SWISS-PROT (now UNIPROT)

ID DECA_DROME STANDARD; PRT; 588AA. AC P07713; DT 01-APR-1988 (REL. 07, CREATED) DT 01-APR-1988 (REL. 07, LAST SEQUENCE UPDATE) DT 01-FEB-1995 (REL. 31, LAST ANNOTATION UPDATE) DE DECAPENTAPLEGIC PROTEIN PRECURSOR (DPP-C PROTEIN). GN DPP. OS DROSOPHILA MELANOGASTER (FRUIT FLY). OC EUKARYOTA; METAZOA; ARTHROPODA; INSECTA; DIPTERA. RN [1] RP SEQUENCE FROM N.A. RM 87090408 RA PADGETT R.W., ST JOHNSTON R.D., GELBART W.M.; RL NATURE 325:81-84 (1987) RN [2] RP CHARACTERIZATION, AND SEQUENCE OF 457-476. RM 90258853 RA PANGANIBAN G.E.F., RASHKA K.E., NEITZEL M.D., HOFFMANN F.M.; RL MOL. CELL. BIOL. 10:2669-2677(1990). CC -!- FUNCTION: DPP IS REQUIRED FOR THE PROPER DEVELOPMENT OF THE CC EMBRYONIC DOORSAL HYPODERM, FOR VIABILITY OF LARVAE AND FOR CELL CC VIABILITY OF THE EPITHELIAL CELLS IN THE IMAGINAL DISKS. CC -!- SUBUNIT: HOMODIMER, -LINKED. CC -!- SIMILARITY: TO OTHER GROWTH FACTORS OF THE TGF-BETA FAMILY. DR EMBL; M30116; DMDPPC. DR PIR; A26158; A26158. DR HSSP; P08112; 1TFG. DR FLYBASE; FBGN0000490; DPP. DR PROSITE; PS00250; TGF_BETA. KW GROWTH FACTOR; DIFFERENTIATION; SIGNAL. FT SIGNAL 1 ? POTENTIAL. FT PROPEP ? 456 FT CHAIN 457 588 DECAPENTAPLEGIC PROTEIN. FT DISULFID 487 553 BY SIMILARITY. FT DISULFID 516 585 BY SIMILARITY. FT DISULFID 520 587 BY SIMILARITY. FT DISULFID 552 552 INTERCHAIN (BY SIMILARITY). FT CARBOHYD 120 120 POTENTIAL. FT CARBOHYD 342 342 POTENTIAL. FT CARBOHYD 377 377 POTENTIAL. FT CARBOHYD 529 529 POTENTIAL. SQ SEQUENCE 588 AA; 65850MW; 1768420 CN; MRAWLLLLAV LATFQTIVRV ASTEDISQRF IAAIAPVAAH IPLASASGSG SGRSGSRSVG ASTSTAGAKA FNRFSEPASF SDSDKSHRSK TNKKPSKSDA NRQFNEVHKP RTDQLENSKN KSKQLVNKPN HNKMAVKEQR SHHKKSHHHR SHQPKQASAS TESHQSSSIE SIFVEEPTLV LDREVASINV PANAKAIIAE QGPSTYSKEA LIKDKLKPDP STYLVEIKSL LSLFNMKRPP KIDRSKIIIP EPMKKLYAEI MGHELDSVNI PKPGLLTKSA NTVRSFTHKD SKIDDRFPHH HRFRLHFDVK SIPADEKLKA AELQLTRDAL SQQVVASRSS ANRTRYQBLV YDITRVGVRG QREPSYLLLD TKTBRLNSTD TVSLDVQPAV DRWLASPQRN YGLLVEVRTV RSLKPAPHHH VRLRRSADEA HERWQHKQPL LFTYTDDGRH DARSIRDVSG GEGGGKGGRN KRHARRPTRR KNHDDTCRRH SLYVDFSDVG WDDWIVAPLG YDAYYCHGKC PFPLADHRNS TNHAVVQTLV NNMNPGKBPK ACCBPTQLDS VAMLYLNDQS TVVLKNYQEM TVVGCGCR

C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [6] Substitution matrices – Sequence analysis 2006 Definition of substitution • Two-dimensional matrix with score values describing the probability of one or nucleotide being replaced by another during sequence .

C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [7] Substitution matrices – Sequence analysis 2006 Scoring matrices for nucleotide sequences • Can be simple: • Can be more • e.g. positive value complicated: for match and zero • taking into account for mismatch. transitions and • frequencies of are equal (e.g. Kimura model) for all bases.

C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [8] Substitution matrices – Sequence analysis 2006 Scoring matrices for nucleotide sequences • Simple model • Kimura

A C T G A 1 0 0 0 C 0 1 0 0 T 0 0 1 0 G 0 0 0 1

C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [9] Substitution matrices – Sequence analysis 2006 What is better to align? DNA or protein sequences?

1. Many within DNA are synonymous ⇒ divergence overestimation

C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [10] Substitution matrices – Sequence analysis 2006 2. Evolutionary relationships can be more accurately expressed using a 20 ×20 amino acid exchange table

3. DNA sequences contain non-coding regions , which should be avoided in homology searches.

4. Still an issue when translating into (six) protein sequences through a codon table.

5. Searching at protein level: frameshifts can occur, leading to stretches of incorrect amino acids and possibly elongation.

However, frameshifts normally result in stretches of highly unlikely amino acids.

C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [11] Substitution matrices – Sequence analysis 2006 So? Rule of thumb:

⇒ if ORF exists, then align at protein level

C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [12] Substitution matrices – Sequence analysis 2006 Scoring matrices for amino acid sequences • Are complicated, scoring has to reflect: • Physio-chemical properties of aa’s • Likelihood of residues being substituted among truly homologous sequences

• Certain aa with similar properties can be more easily substituted: preserve structure/function

• “Disruptive” substitution is less likely to be selected in evolution (e.g. non functional )

C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [13] Substitution matrices – Sequence analysis 2006 Scoring matrices for amino acid sequences

Main chain

C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [14] Substitution matrices – Sequence analysis 2006 Example: are very common in metal binding motifs

Zn histidine

C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [15] Substitution matrices – Sequence analysis 2006 Now let’s think about alignments • Lets consider a simple alignment: ungapped global alignment of two (protein) sequences, x and y, of length n.

• In scoring this alignment, we would like to assess whether these two sequences have a common ancestor, or whether they are aligned by chance.

Pr( x, y | M ) ← sequences have common ancestor ← Pr( x, y | R) sequences are aligned by chance

• We therefore want our amino acid substitution table (matrix) to score an alignment by estimating this ratio (= improvement over random).

• In brief, each substitution score is the log-odds probability that amino acid a could change (mutate) into amino acid b through evolution, based on the constraints of our evolutionary model.

C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [16] Substitution matrices – Sequence analysis 2006 Target and background probabilities • Background probability

If qa is the frequency of amino acid a in one sequence and qb is the frequency of amino acid b in another sequence , then the probability of the alignment being random is given by:

A A R S Pr( x, y | R) = ∏ qx ∏ qy i i i i V V K S

• Target probability

If pab is now the probability that amino acids a and b have derived from a common ancestor, then the probability that the alignment is due to common ancestry is is given by: A A R S Pr( x, y | M ) = ∏ px y i i i V V K S

C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [17] Substitution matrices – Sequence analysis 2006 Source of target and background probabilities: high confidence alignments

• Target frequencies

• The “evolutionary true ” alignments allow us to get biologically permissible amino acid mutations and derive the frequencies of observed pairs. These are the TARGET frequencies (20x20 combinations).

• Background frequencies

• The BACKGROUND frequencies are simply the frequency at which each amino acid type is observed in these “trusted ” data sets (20 values).

C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [18] Substitution matrices – Sequence analysis 2006 Log-odds • Substitution matrices apply logarithmic conversions to describe the probability of amino acid substitutions

• The converted values are the so-called log-odds scores

• So they are simply the logarithmic ratios of the observed mutation frequency divided by the probability of substitution expected by random chance (target – background)

C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [19] Substitution matrices – Sequence analysis 2006 Formulas • Odds-ratio of two probabilities

∏ p ∏ p Pr( x, y | M ) xi yi xi yi = i = i Pr( x, y | R) ∏ qx ∏ qy ∏ qx qy i i i i i i i • Log-odds probability of an alignment being random is therefore given by

Pr( x, y | M )  p  log = log  xi yi  Pr( x, y | R) ∑  q q   xi yi 

log ∏ x = ∑log x  i  i C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [20] Substitution matrices – Sequence analysis 2006 Logarithmic functions

Logarithms to various bases: red is to base e, green is to base 10, and purple is to base 1.7. Each tick on the axes is one unit. Logarithms of all bases pass through the point (1, 0), because any number raised to the power 0 is 1, and through the points ( b, 1) for base b, because any number raised to the power 1 is itself. The curves approach the y axis but do not reach it, due to the singularity of a logarithm at x = 0.

http://en.wikipedia.org/wiki/Logarithm

C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [21] Substitution matrices – Sequence analysis 2006 So… for a given :

• a positive score means that the frequency of amino acid substitutions found in the high confidence alignments is greater than would have occurred by random chance

• a zero score … that the freq. is equal to that expected by chance

• a negative score … that the freq. is less to that expected by chance

C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [22] Substitution matrices – Sequence analysis 2006 Alignment score • The alignment score S is given by the sum of all amino acid pair substitution scores:

= = Pr( x, y | M ) S ∑ s()xi , yi log i Pr( x, y | R)

• Where the substitution score for any amino acid pair [a,b ] is given by:

p s()a,b = log ab qaqb

C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [23] Substitution matrices – Sequence analysis 2006 Alignment score • The total score of an alignment:

EAAS VF-T

• would be:

S = s(E,V ) + s(A, F) +γ )1( + s(S,T )

C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [24] Substitution matrices – Sequence analysis 2006 Empirical matrices • Are based on surveys of actual amino acid substitutions among related proteins

• Most widely used: PAM and BLOSUM

C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [25] Substitution matrices – Sequence analysis 2006 The PAM series

• The first systematic method to derive amino acid substitution matrices was done by Margaret Dayhoff et al. (1978) Atlas of Protein Structure .

• These widely used substitution matrices are frequently called Dayhoff, MDM (Mutation Data Matrix), or PAM ( ) matrices.

• Key idea: trusted alignments of closely related sequences provide information about biologically permissible mutations.

C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [26] Substitution matrices – Sequence analysis 2006 The PAM design • Step 1. Dayhoff used 71 protein families, made hypothetical phylogenetic trees and recorded the number of observed substitutions (along each branch of the tree) in a 20x20 target matrix.

C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [27] Substitution matrices – Sequence analysis 2006 • Step 2. The target matrix was then converted to frequencies by dividing each cell (a,b) over the sum of all other substitutions of a. A Pr( b | a) = ab ∑ Aac c • Step 3. The target matrix was normalized so that the expected number of substitutions covered 1% of the protein (PAM-1).

Pr( b | a,t = )1

• Step 4. Determine the final substitution matrix.

p P(b | a,t) s(a,b | t) = log ab = log qaqb qb

C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [28] Substitution matrices – Sequence analysis 2006 PAM units

• One PAM unit is defined as 1% of the amino acids positions that have been changed

• E.g. to construct the PAM 1 substitution table, a group of closely related sequences with mutation frequencies corresponding to one PAM unit is chosen

C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [29] Substitution matrices – Sequence analysis 2006 But there is a whole series of matrices: PAM 10 … PAM 250

• These matrices are extrapolated from PAM 1 matrix (by )

ARNDCQEGHILKMFPSTWYV ARNDCQEGHILKMFPSTWYV ARNDCQEGHILKMFPSTWYV ARNDCQEGHILKMFPSTWYV A 2 A 2 A 2 A 2 R -2 6 R -2 6 R -2 6 R -2 6 N 0 0 2 N 0 0 2 N 0 0 2 N 0 0 2 D 0 -1 2 4 D 0 -1 2 4 D 0 -1 2 4 D 0 -1 2 4 C -2 -4 -4 -5 4 C -2 -4 -4 -5 4 C -2 -4 -4 -5 4 C -2 -4 -4 -5 4 Q 0 1 1 2 -5 4 Q 0 1 1 2 -5 4 Q 0 1 1 2 -5 4 Q 0 1 1 2 -5 4 E 0 -1 1 3 -5 2 4 E 0 -1 1 3 -5 2 4 E 0 -1 1 3 -5 2 4 E 0 -1 1 3 -5 2 4 G 1-3 0 1-3-1 0 5 G 1-3 0 1-3-1 0 5 G 1-3 0 1-3-1 0 5 G 1-3 0 1-3-1 0 5 H -1 2 2 1-3 3 1-2 6 H -1 22 1-3 3 1-2 6 H -1 2 2 1-3 3 1-2 6 H -1 2 2 1-3 3 1-2 6 I -1 -2 -2 -2 -2 -2 -2 -3 -2 5 I -1 -2 -2 -2 -2 -2 -2 -3 -2 5 I -1 -2 -2 -2 -2 -2 -2 -3 -2 5 I -1 -2 -2 -2 -2 -2 -2 -3 -2 5 L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6 L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6 L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6 L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6 K -1 3 1 0-5 1 0-2 0-2-3 5 K -1 3 1 0-5 1 0-2 0-2-3 5 K -1 3 1 0-5 1 0-2 0-2-3 5 K -1 3 1 0-5 1 0-2 0-2-3 5 M -1 0-2-3-5-1-2-3-2 2 4 0 6 M -1 0-2-3-5 -1-2-3-2 2 4 0 6 X M -1 0-2-3-5-1-2-3-2 2 4 0 6 M -1 0-2-3-5-1-2-3-2 2 4 0 6 F -4-4-4-6-4-5-5-5-2 1 2-5 0 9 X F -4-4-4-6-4 -5-5-5-2 1 2-5 0 9 F -4-4-4-6-4-5-5-5-2 1 2-5 0 9 = F -4-4-4-6-4-5-5-5-2 1 2-5 0 9 P 1 0-1-1-3 0-1-1 0 -2-3-1-2-5 6 P 1 0-1-1-3 0-1-1 0-2 -3-1-2-5 6 P 1 0-1-1-3 0-1-1 0 -2-3-1-2-5 6 P 1 0-1-1-3 0-1-1 0 -2-3-1-2-5 6 S 1 0 1 0 0-1 0 1-1-1-3 0-2-3 1 3 S 1 0 1 0 0-1 0 1-1-1-3 0-2-3 1 3 S 1 0 1 0 0-1 0 1-1-1-3 0-2-3 1 3 S 1 0 1 0 0-1 0 1-1-1-3 0-2-3 1 3 T 1-1 0 0-2-1 0 0-1 0-2 0-1-2 0 1 3 T 1-1 0 0-2-1 0 0-1 0-2 0-1-2 0 1 3 T 1-1 0 0-2-1 0 0-1 0-2 0-1-2 0 1 3 T 1-1 0 0-2-1 0 0-1 0-2 0-1-2 0 1 3 W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17 W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17 W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17 W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17 Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10 Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10 Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10 Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10 MultiplyV 0-2-2-2-2-2-2-1-2 Matrices 4 2-2 2-1-1-1 0 -6 -2 4 NV 0-2-2-2-2times -2-2-1-2 4 2-2 2-1-1-1 0 -6 -2to 4 makeV 0-2-2-2-2-2-2-1-2 PAM 4 2-2 2-1-1-1 0 -6 -2 4 ‘ N’; thenV 0-2-2-2-2-2-2-1-2 take 4 2-2 2-1-1-1 the 0 -6 -2 4 Log • So : a PAM is a relative measure of evolutionary distance • 1 PAM = 1 accepted mutation per 100 amino acids • 250 PAM = 250 mutations per 100 amino acids, so 2.5 accepted mutations per amino acid

C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [30] Substitution matrices – Sequence analysis 2006 PAM numbers vs. observed am.ac. mutational rates

PAM Observed Sequence Number Mutation Rate (%) Identity (%) 0 0 100 1 1 99 30 25 75 80 50 50 110 40 60 200 75 25 250 80 20

Note Think about intermediate “substitution” steps …

C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [31] Substitution matrices – Sequence analysis 2006 The PAM 250 matrix

A 2 R -2 6 N 0 0 2 D 0 -1 2 4 C -2 -4 -4 -5 12 Q 0 1 1 2 -5 4 E 0 -1 1 3 -5 2 4 G 1 -3 0 1 -3 -1 0 5 H -1 2 2 1 -3 3 1 -2 6 I -1 -2 -2 -2 -2 -2 -2 -3 -2 5 L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6 W- R exchange is too large K -1 3 1 0 -5 1 0 -2 0 -2 -3 5 (due to paucity of data) M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6 F -4 -4 -4 -6 -4 -5 -5 -5 -2 1 2 -5 0 9 P 1 0 -1 -1 -3 0 -1 -1 0 -2 -3 -1 -2 -5 6 S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 2 T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0 1 3 W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17 Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10 V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4 B 0 -1 2 3 -4 1 2 0 1 -2 -3 1 -2 -5 -1 0 0 -5 -3 -2 2 Z 0 0 1 3 -5 3 3 -1 2 -2 -3 0 -2 -5 0 0 -1 -6 -4 -2 2 3 A R N D C Q E G H I L K M F P S T W Y V B Z

C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [32] Substitution matrices – Sequence analysis 2006 PAM model

• The scores derived through the PAM model are an accurate description of the information content (or the relative entropy) of an alignment (Altschul, 1991).

• PAM 1 corresponds to about 1 million years of evolution.

• PAM 120 has the largest information content of the PAM matrix series: “best” for general alignment.

• PAM 250 is the traditionally most popular matrix: “best” for detecting distant sequence similarity.

C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [33] Substitution matrices – Sequence analysis 2006 Summary Dayhoff’s PAM -matrices • Derived from global alignments of closely related sequences.

• Matrices for greater evolutionary distances are extrapolated from those for smaller ones.

• The number with the matrix ( PAM 40 , PAM 100 ) refers to the evolutionary distance; greater numbers are greater distances.

• Attempts to extend Dayhoff's methodology or re-apply her analysis using databases with more examples: • Jones, Thornton and coworkers used the same methodology as Dayhoff but with modern databases (CABIOS 8:275) • Gonnett and coworkers (Science 256:1443) used a slightly different (but theoretically equivalent) methodology

C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [34] Substitution matrices – Sequence analysis 2006 The BLOSUM series

• BLOSUM stands for: BLO cks SU bstitution Matrices

• Created by Steve Henikoff and Jorja Henikoff (PNAS 89:10915).

• Derived from local, un-gapped alignments of distantly related sequences.

• All matrices are directly calculated; no extrapolations are used.

• Again: compare observed freqs of each pair to expected freqs Then: Log-odds matrix.

C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [35] Substitution matrices – Sequence analysis 2006 The Blocks database • The Blocks Database contains multiple alignments of conserved regions in protein families.

• Blocks are multiply aligned un-gapped segments corresponding to the most highly conserved regions of proteins.

• The blocks for the BLOCKS database are made automatically by looking for the most highly conserved regions in groups of proteins represented in the PROSITE database. These blocks are then calibrated against the SWISS-PROT database to obtain a measure of the random distribution of matches. It is these calibrated blocks that make up the BLOCKS database.

• The database can be searched to classify protein and nucleotide sequences.

C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [36] Substitution matrices – Sequence analysis 2006 The Blocks database

Gapless alignment blocks

C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [37] Substitution matrices – Sequence analysis 2006 The BLOSUM series • BLOSUM 30, 35, 40, 45, 50, 55, 60, 62, 65, 70, 75, 80, 85, 90.

• The number after the matrix ( BLOSUM 62 ) refers to the minimum percent identity of the blocks (in the BLOCKS database) used to construct the matrix (all blocks have >=62% sequence identity);

• No extrapolations are made in going to higher evolutionary distances

• High number - closely related sequences Low number - distant sequences

• BLOSUM62 is the most popular: best for general alignment.

C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [38] Substitution matrices – Sequence analysis 2006 The log-odds matrix for BLOSUM 62

C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [39] Substitution matrices – Sequence analysis 2006 PAM versus BLOSUM • Based on an explicit • Based on empirical evolutionary model frequencies

• Derived from small, closely • Uses much larger, more related proteins with ~15% diverse set of protein divergence sequences (30-90% ID)

• Higher PAM numbers to • Lower BLOSUM numbers to detect more remote detect more remote sequence similarities sequence similarities

• Errors in PAM 1 are scaled • Errors in BLOSUM arise 250X in PAM 250 from errors in alignment

C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [40] Substitution matrices – Sequence analysis 2006 Comparing exchange matrices

• To compare amino acid exchange matrices, the "Entropy" value can be used. This is a relative entropy value (H) which describes the amount of information available per aligned residue pair. = H ∑ sij log 2 (sij / pi p j )

C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [41] Substitution matrices – Sequence analysis 2006 Evolution and Matrix “landscape”

• Recent evolution • Ancient evolution → → convergence to random model

C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [42] Substitution matrices – Sequence analysis 2006 Specialized matrices • Several other aa exchange matrices have been constructed, for situations in which non-standard amino acid frequencies occur

• Secondary structure based (Lüthy R, McLachlan AD, Eisenberg D, Proteins 1991; 10(3):229-39)

E H C

C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [43] Substitution matrices – Sequence analysis 2006 Specialized matrices • Transmembrane specific substitution matrices: • PHAT (Ng P, Henikoff J, Henikoff S, Bioinformatics 2000;16(9):760-766) Built from predicted hydrophobic and transmembrane regions of the blocks database

• BATMAS (Sutormin RA, Rakhmaninova AB, Gelfand S, Proteins 2003; 51(1):85-95) Derived from predicted TM-kernels of bacterial proteins

C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [44] Substitution matrices – Sequence analysis 2006 A note on reliability • All these matrices are designed using standard evolutionary models.

• Circular problem alignment matrix

• It is important to understand that evolution is not the same for all proteins, not even for the same regions of proteins.

C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [45] Substitution matrices – Sequence analysis 2006 … • No single matrix performs best on all sequences. Some are better for sequences with few gaps, and others are better for sequences with fewer identical amino acids.

• Therefore, when aligning sequences, applying a general model to all cases is not ideal. Rather, re-adjustment can be used to make the general model better fit the given data.

C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [46] Substitution matrices – Sequence analysis 2006 Pair-wise alignment quality versus sequence identity • Vogt et al., JMB 249, 816-831,1995

C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [47] Substitution matricesTwilight – Sequence zone analysis 2006 Take-home messages - 1

• If ORF exists, then align at protein level.

• Amino acid substitution matrices reflect the log-odds ratio between the evolutionary and random model and can therefore help in determining homology via the alignment score.

• The evolutionary and random models depend on generalized data sets used to derive them. This not an ideal solution.

C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [48] Substitution matrices – Sequence analysis 2006 Take-home messages - 2 • Apart from the PAM and BLOSUM series, a great number of further matrices have been developed.

• Matrices have been made based on DNA, protein structure, information content, etc.

• For local alignment, BLOSUM62 is often superior; for distant (global) alignments, BLOSUM50, GONNET, or (still) PAM250 work well.

• Remember that gap penalties are always a problem: unlike the matrices themselves, there is no formal way to calculate their values -- you can follow recommended settings, but these are based on trial and error and not on a formal framework.

C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U [49] Substitution matrices – Sequence analysis 2006