Chap. 2 Pairwise alignment

• The most basic sequence analysis question: if two sequences are related? • Key Issues: 1. What alignment should be considered? 2. What score system to rank alignments? 3. What algorithm to find optimal (or good) scoring alignments? 4. What statistical method to evaluate the significance?

2.1 Introduction 1 Introduction 2 2.1 Introduction 3 The scoring model • Evolutionary force that can shape molecular (, DNA) sequences: (substitution, insertion/deletion or indel), selection (positive, negative, neutral). • If total log-likelihood score (measuring relatedness) of an alignment is a sum of terms for each aligned pair of residues (plus terms for each gap), intuitively, we expect identities and conservative substitutions to be more likely in real alignments than we expect by chance (positive score); and vice versa.

2.2 The Scoring model 4 Substitution matrices (for un-gapped global alignment) • For unrelated or random model R, odds ratio of “match model” M and unrelated or random model R, : p ∏ xi yi p(x, y | M ) px y = i = ∏ i i p(x, y | R) q q q q ∏ xi ∏ y j i xi yi i j • For log-odds ratio score

S(x, y) = ∑ s(xi , yi ) i

 pab  where s(a,b) = log   qaqb 

2.2 The scoring model 5 Chemical Properties of Amino Acids Match +3 and mismatch = -1 may be good enough for DNA, but not for : e.g. leucine is much more likely to be replaced by an isoleucine than by a glutamate.

Introduction 6 Introduction Taylor W.R. (1986) Bioinformatics7 Introduction 8 Gap penalties

• Linear penalty score for a gap of length g γ (g) = −gd • Or affine score

γ (g) = −d − (g −1)e where d is the gap-open penalty and e is the gap extension penalty.

2.2 The scoring model 9 Dayhoff matrices (1978)

• A “chicken & egg” problem: score  alignment • Two problems with a simple MLE of frequency counts from the confirmed alignments: – Difficult to obtain a good random sample of confirmed alignments. Alignment tend not to be independent because protein sequences come in families; – Different pairs of sequences have diverged by different amount. This suggests that we should use scores that are matched to the expected divergence of the sequences we wish to compare.

2.8 Deriving score parameters from alignment data 10 Aab ajk=Ajk/ΣkAjk

Pjk=c ajk Pjj= c ajj +(1- c )

2.8 Deriving score parameters from alignment data 11 PAM (Point Accepted Mutation) matrices

• Based on 1572 observed in 71 families of closely related proteins (85% identical). • The PAM matrices imply a model of protein mutation S(n) = S(1)n • The PAM1 matrix gives substitution probabilities for sequences that have experienced one for every hundred amino acids. • The mutations may overlap so that the sequences reflected in the PAM250 matrix have experienced 250 mutation events for every 100 amino acids, yet only 80 out of every 100 amino acids have been affected.

2.8 Deriving score parameters from alignment data 12 For More Divergent Sequences

2.8 Deriving score parameters from alignment data 13 2.8 Deriving score parameters from alignment data 14 Sequence Divergence Through Evolution

2.8 Deriving score parameters from alignment data 15 BLOSUM Matrices(1992) (BLOSUM62 ≈ PAM120, BLOSUM45 ≈ PAM250)

2.8 Deriving score parameters from alignment data 16 Target Frequencies, λ, and H (Altschul 1991) • The most important property of a scoring matrix is its target frequencies and the expected frequencies of the individual pairs. Target frequencies represent the underlying evolutionary model. λsab 1 = ∑ qab = ∑ pa pbe ab ab • The relative entropy of a scoring matrix (H) conveniently summarizes the general behavior of a scoring matrix. H is the average number of bits (or nats) per position in an alignment and is always positive. 20 a qab H = ∑∑ qabλsab = ∑ qab ln a=1 b=1 pa pb

2.8 Deriving score parameters from alignment data 17 The relative entropy H of PAM matrixes

2.8 Deriving score parameters from alignment data 18