APPLICATIONS OF HMMS

CSE/BIMM/BENG 181 MAY 17, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES OUTLINE

Definitions and terms

Training approaches

Sequence feature selection

Secondary structure prediction

Probabilistic alignment using HMMs: , HMMER

Gene finding [next major topic]

Prokaryotic genes and generalized HMMs

Eukaryotic genes

CSE/BIMM/BENG 181 MAY 17, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES DEFINITIONS AND REVIEW

A (HMM) is a generative stochastic model which assigns the probabilities to finite length strings over alphabet A. A four-tuple (A,Q,Pe,Pt) defines a hidden Markov model H:

A - the finite alphabet over which the observed strings are defined.

Q - the finite collection of hidden states of the model.

Pe (ai|qk) - the probability of emitting character i if the hidden state is k

Pt (qk|qm) - the probability of transition from hidden state k to hidden state m in one step

0 0 0 0 1 1 1 1

H H H H T T T T

CSE/BIMM/BENG 181 MAY 17, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES ALGORITHMS

Forward or backward (sum peeling): compute the probability of an observed string a1a2...an given emission and transmission probabilities. Runs in time O(|Q|2 n), or O(|Q| n) for sparse models.

Decoding (Viterbi): compute the sequence of hidden states q1q2...qn that is most likely to have given rise to an observed sequence a1a2...an Runs in time O(|Q|2 n), or O(|Q| n) for sparse models.

Training: estimate transition and/or emission probabilities given

a set of labeled observed sequences (corresponding hidden states are known): frequency counts, possibly corrected

only observed sequences: Baum-Welch, or another non-linear optimization procedure

CSE/BIMM/BENG 181 MAY 17, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES TRAINING HMMS FROM LABELED SEQUENCES CGATATTCGATTCTACGCGCGTATACTAGCTTATCTGATC 011111112222222111111222211111112222111110

TRANSITIONS to state 0 1 2 A a = i, j from 0 0 (0%) 1 (100%) 0 (0%) i, j |Q|"1 state A 1 1 (4%) 21 (84%) 3 (12%) !h=0 i,h 2 0 (0%) 3 (20%) 12 (80%)

symbol E A C G T e = i,k in 6 7 5 7 i,k |#|"1 1 E state (24%) (28%) (20%) (28%) !h=0 i,h 3 3 2 7 2 (20%) (20%) (13%) (47%) EMISSIONS EXAMPLE FROM: HTTP://WWW.GENEPREDICTION.ORG/BOOK/HMM-PART1.PPT

CSE/BIMM/BENG 181 MAY 17, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES PROTEIN STRUCTURE PREDICTION

A simple model states that each residue in a folded protein can be assigned to one of three structural features: Protein 1DZOA An α-helix (offset 4 hydrogen bonds)

A β-strand/sheet

Other (a loop, L)

Cheng and Baldi BMC Bioinformatics 2007 8:113

CSE/BIMM/BENG 181 MAY 17, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES EMISSION AND TRANSITION FREQUENCIES Frequency distributions of amino-acid residues is different between classes. E.g. can be used to estimate emission probabilities.

To estimate transition probabilities, we simply tabulate how frequently the transitions happen in a large reference dataset with known structure.

STATIONARY FREQUENCIES OF THE HIDDEN MARKOV CHAIN

GOLDMAN, THORNE AND JONES JMB 1996

CSE/BIMM/BENG 181 MAY 17, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES TRAINING CAVEATS

Rare transition probabilities events are difficult to estimate from counts data.

Some state k may not appear in any of the training sequences. This means #k➔l = 0 for every state l and Pt(k,l) cannot be computed from counts.

One can ‘pad’ (reflecting our prior beliefs) to observed counts:

A = # of k l transitions + r k,l → k,l Eb,k = # of emissions of k from b + rk(b)

CSE/BIMM/BENG 181 MAY 17, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES STRUCTURE INFERENCE

Given a trained HMM H and a sequence S we can:

Run Viterbi decoding to assign a most-likely hidden path of α, β and L to a given sequence and infer the most likely path.

Use a forward-backward algorithm to compute the posterior probabilities that that a given position i in the amino acid sequence is in an α-helix, β- sheet or a loop: Pr q = α S, H Pr q = β S, H p = { i | } p = { i | } i,α Pr S H i,β Pr S H { | } { | } Pr q = L S, H p = { i | } i,L Pr S H { | }

CSE/BIMM/BENG 181 MAY 17, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES Query weight=0.0963 Q3=68.5%

0.0 0.40 0.8 20 40 60 80 100 120

sequence 1 weight=0.0963

0.0 0.40 0.8 20 40 60 80 100 120

sequence 2 weight=0.146

0.0 0.40 0.8 20 40 60 80 100 120

sequence 3 weight=0.129

0.0 0.40 0.8 20 40 60 80 100 120

sequence 4 weight=0.140

0.0 0.40 0.8 20 40 60 80 100 120

sequence 5 weight=0.109

0.0 0.40 0.8 20 40 60 80 100 120

sequence 6 weight=0.133

0.0 0.40 0.8 20 40 60 80 100 120

sequence 7 weight=0.150

HTTP WWW BIOMEDCENTRAL COM 0.0 0.4 0.8 :// . . /1472-6807/6/25 0 20 40 60 80 100 120

d1jyoa protein consensus Q3=79.2%

0.0 0.40 0.8 20 40 60 80 100 120

true secondary structure h1 s1 s2

Q 3 - a standard measure of structural prediction accuracy, ALHEASGPSVILFGSDVTVPPASNAEQAK defined as the proportion of hhhhhoooossssooosssooooohhhhh residues assigned to correct class (true) ohhhoooossssooooosssooohhhhhh (22/29 = 76% - useful) Random assignment : Q = 33% 3 hhhhhoooohhhhooohhhooooohhhhh (22/29 = 76% - terrible) State-of-the-art prediction: Q3 ~ 80%

HTTP://NOOK.CS.UCDAVIS.EDU/~KOEHL/CLASSES/CSB/CSB_LECTURE11.PPT

CSE/BIMM/BENG 181 MAY 17, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES HMMS ACTUALLY USEFUL FOR STRUCTURE PREDICTION... HELIX COIL STRAND 3.1 4.6 p in [0.1, 0.25[ 3.7 9.6 H10 H3 c1 3.4 p in [0.25, 0.5[ 6.7 1.1 1.9 b3 c12 p in [0.5, 0.75[ 7.0 c9 H14 H2 p=>0.75 c6 8.3 b7 1.1 c10 hydrophilic 1.9 5.4 2.6 H9 H1 H7 b1 preference 3.1 b5 5.0 2.8 4.8 2.5 c8 hydrophobic c5 4.7 2.2 7.8 preference H12 H8 2.9 2.1 H6 c4 7.3 5.2 b6 b8 secondary H4 c2 structure 3.8 2.5 6.9 entry state H15 H11 c3 b2 3.5 secondary c11 4.4 5.5 structure exit state 4.5 3.3 7.2 b4 b9 H13 H5 c7 4.4 LOG-ODDS SCORE Helix Coil Strand Score > > > = = = < < < log2(piq/Pi) ; ; ; % : : : 9 9 9 8 8 8 " 7 7 7 6 6 6 5 5 5 # 4 4 4 Frequency of Frequency of 3 3 3 2 2 2 residue i in residue i in all 1 1 1 !" training sequences 0 0 0 state q / / / . . . !% - - - , , , + + +

! "# "$ % & " ' "% $ "( "" ( "! ) * $ ( * "% " "" ) ! % ' "# & !)("'*&$% HTTP://WWW.BIOMEDCENTRAL.COM/1472-6807/6/25 CSE/BIMM/BENG 181 MAY 17, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES PROFILE HMM ALIGNMENT/ MATCHING

A distant cousin of functionally related sequences in a protein family may have weak pairwise similarities with each member of the family and thus fail significance test.

However, they may have weak similarities with many members of the family.

The goal is to align a sequence to all members of the family at once.

Family of related proteins can be represented by their multiple alignment and the corresponding profile.

CSE/BIMM/BENG 181 MAY 17, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES PROFILE REPRESENTATION OF SEQUENCE FAMILIES Aligned DNA sequences can be represented by a 4N profile matrix reflecting the frequencies of nucleotides in every aligned position.

Protein family can be represented by a 20N profile representing frequencies of amino acids.

These can be used to estimate emission probabilities of an HMM

1 A C A C G T G T 0.000455373 0.000819672 9.10747e-05 0.998634

0.0512143 0.119885 0.000273224 0.828628

0.000335008 0.000167504 0 0.999497

8.37521e-05 8.37521e-05 0.999749 8.37521e-05

0.000167504 0.0274707 0.000167504 0.972194

0.5 0.957377 0.0021062 0.0332003 0.00731626

0.0100599 0.981792 0.00108081 0.00706684

0 1 2 3 4 5 6 7 HIV protease

CSE/BIMM/BENG 181 MAY 17, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES MULTIPLE ALIGNMENTS AND PROTEIN FAMILY CLASSIFICATION Multiple alignment of a protein family shows variations in conservation along the length of a protein

Example: after aligning many globin proteins, the biologists recognized that the helical regions in globins are more conserved than others.

One way to visualize: entropy plots Influenza A hemalutinin

1.5

1 Antigenic sites

0.5

0 50 100 150 200 250 300

CSE/BIMM/BENG 181 MAY 17, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES WHAT ARE PROFILE HMMS

A Profile HMM is a probabilistic representation of a multiple alignment.

A given multiple alignment (of a protein family) is used to build a profile HMM.

This model then may be used to find and score less obvious potential matches of new protein sequences.

CSE/BIMM/BENG 181 MAY 17, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES NOTE THAT THE HMM TOPOLOGY IS SPARSE.

Sean Eddy ‘Profile hidden Markov models’ Bioinformatics 1998:14(9) 755:63

CSE/BIMM/BENG 181 MAY 17, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES Delete a1 a2 A3 - A4 A5

Insert a1 a2

Match Start A3 A4 A5 End

Grundy, WN. PhD UCSD 1998 HTTP://NOBLE.GS.WASHINGTON.EDU/PAPERS/THESIS.PDF

CSE/BIMM/BENG 181 MAY 17, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES Alignment Profile HMM alignment

Globally optimal alignment Viterbi path

Alignment Score Log Pr{Viterbi path}

-- Log Pr{sequence|HMM}

POSITION SPECIFIC Match score d(c1,c2) log2(piq/Pi) a = log Pr match insert + Affine gap score (x indels) 2 { → } a+b(x-1) log Pr insert match 2 { → } b = log Pr insert insert 2 { → }

CSE/BIMM/BENG 181 MAY 17, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES HMM GAP SCORES In traditional sequence alignment, gap penalties are largely arbitrary, but not so in HMMs

Pr match insert + Pr match match + { i → } { i → i+1} Pr match delete =1 { i → } This creates a natural dependance between match and indel states, that has no analogy in traditional sequence alignments

Insertion states can also contribute emission probabilities Pr ai insert Setting those to background frequencies yields the { traditional| } assumption (gap - residue = constant score)

But HMMs can accomodate, e.g. the propensity for insertions in outer loops of proteins that tend to be hydrophilic-residue rich.

CSE/BIMM/BENG 181 MAY 17, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES TRAINING A PROFILE HMM

Multiple alignment (s) is used to construct the HMM model.

Assign each column to a Match state in HMM. Add Insertion and Deletion states.

Estimate the emission probabilities according to amino acid counts in column. Different positions in the protein will have different emission probabilities.

Estimate the transition probabilities between Match, Deletion and Insertion states

The HMM model can be further trained (e.g. Baum-Welch) to derive the optimal parameters from the starting values.

CSE/BIMM/BENG 181 MAY 17, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES PFAM

Pfam describes protein domains Each protein domain family in Pfam has: Seed alignment: manually verified multiple alignment of a representative set of sequences.

HMM built from the seed alignment for further database searches.

Full alignment generated automatically from the HMM The distinction between seed and full alignments facilitates Pfam updates. Seed alignments are stable resources. HMM profiles and full alignments can be updated with newly found amino acid sequences.

CSE/BIMM/BENG 181 MAY 17, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES PFAM USES Pfam HMMs span entire domains that include both well-conserved motifs and less-conserved regions with insertions and deletions.

It results in modeling complete domains that facilitates better sequence annotation and leeds to a more sensitive detection.

Example: PF00516 (HIV-1 gp120)

24 sequences in the seed alignment. 75195 in the complete alignment

PF00516: HIV-1 GP120 PROTEIN HMM LOGO

CSE/BIMM/BENG 181 MAY 17, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES Probabilistic Local Alignment

The names Viterbi and Forward refer to the standard dynamic First, for Viterbi scores, Bundschuh’s ‘‘central conjecture’’ about programming algorithms used to calculate these scores in the the distribution of optimal gapped local alignment scores states specific case of HMMs [1]. Other probabilistic models have that l for the Gumbel distribution is the unique positive solution of differently named algorithms (CYK and Inside for stochastic SelV T~1 in the limit of infinite length comparisons [22,23]. context-free grammars for RNA analysis, for example [1,32]), but There is a strong analogy to the case of ungapped local alignments here I will use the shorthand V and F to represent optimal with additive pairwise residue scores sab, where l is the unique lsab lsab alignment scores and total log likelihood ratio scores in general. positive solution of Se T~ a,b fafbe ~1 [13]. When the Traditional search algorithms report optimal alignment scores, residue scores sab are explicitly probabilistic log-odds scores pab P (sab~log in some arbitrary logarithm base z) then simple so the Viterbi score is the probabilistic analog of traditional z fafb methods. However, from a probabilistic inference standpoint, the algebra shows that l for ungapped alignment scores is log z. Forward score is what we want, because we are after the Likewise Bundschuh’s central conjecture would be satisfied by probability that sequence x is a homologue of the query – that is, l = log z for full probabilistic models of local alignment, when the posterior probability of model H given data x, P(H|x) [33,34]. indels are included as part of the probability model rather than The posterior is a sigmoid function of F: scored with arbitrary penalties. Second, for Forward scores, Milosavljevic´ proved in his ‘‘algorithmic significance’’ method that an upper bound for the eFzr distribution P(F.t) of log likelihood ratios F for full probabilistic PHx ~ Fzr ðÞj 1ze models is an exponential e2t log z [40,41]. Although this is not a E-VALUES FOR HMMPH tight bound, it suggests the high-scoring tail cannot be fatter than where r is a constant offset, the prior log odds ratio logz PRðÞ. ðÞ exponential, and that if it were exponential, it must have l$log z. Forward scores are not generally used in traditional sequence Third, for Forward scores, Yu, Bundschuh, and Hwa argued by comparison, because they only make sense if individual alignments a different approach that the high-scoring tail P(F.t) for scores for Muchhave like probabilities in BLAST,P( xone,p| Hcan) that obtain can cut be- meaningfullyoff values for summed. HMM scoresprobabilistic sequence alignment is likely to be approximated by Forward scores cannot be calculated directly for arbitrary 2t log z and E-value analogs. e , i.e. again, an exponential tail with l = log z [42]. (nonprobabilistic) scoring systems, except by using approaches However, they only used this result as an intermediate in a based on renormalization and partition functions, where the The Eddy conjecture (2008) states that: derivation showing that the scores of a new ‘‘hybrid’’ scoring arbitrary scores are assumed to be unnormalized log probabilities system for local alignment would probably be Gumbel-distributed “...[28,35–38]). optimal gapped alignment scores (Viterbi scores) follow Gumbelwith l = log z. They stated their approximation in the context of a distributionsLocal optimal with alignment a constant scoresλ (just ofas randomin the ungapped sequences alignment (V scores) case) andfull probabilistic model of global alignment, not local, and then thatare expectedthe expected to follow distribution Karlin/Altschul of total log statistics likelihood [7,14], ratio ascores special (Forwardused that result to derive a further approximation for the expected scorescase of) asymptotes a Gumbel distributionto an exponential (a type tail I extremewith the valuesame distribution)constant λ”. distributions of scores for a nonprobabilistic model of local [39]: alignment. However, I believe their approximation only relies on λ = log z, where z is the base of the logarithm used for log-odds (e.g. 2) the model being fully probabilistic, not whether it is of global or {l t{m local alignment. PV§t ~1{exp {e ðÞ, ðÞ Additionally, one expects the high-scoring tail of Forward scores hi to approximate the high-scoring tail of Viterbi scores (so Gumbel- where m and l are location and scale parameters. Karlin/Altschul CSE/BIMM/BENG 181 MAY 17, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURESdistributed Viterbi scores and exponential-tailed Forward scores statistics give a specific dependence of m on query and target would have the same l), because for the highest scoring sequences, log KNL sequence lengths N and L, m~ l , with parameter K essentially the optimal alignment should contain most of the probability mass. representing the fraction of the NL residue alignment lattice that is In practice, however, the simulation-calibrated l values for bit available for initiating independent local alignments. I will use the scores of Gumbel distributions fitted to Viterbi scores of more general Gumbel notation (in terms of m, l) as opposed to the HMMER2 multihit local alignment models for 9318 Pfam 22.0 more usual Karlin/Altschul notation (in terms of KNL, l) for models have a mean of 0.6677, with a standard deviation of 0.051 reasons that will become clear when I consider how score (68%), and a range of 0.517 to 1.337. Though the mean is distributions depend on target sequence length. suggestively close to the conjectured log2 = 0.6931, the variation is In contrast to optimal alignment scores, the distribution of unacceptably broad, well outside traditional tolerance for useful l Forward scores is unknown. It has appeared ‘‘fat-tailed’’ relative to estimates (which is typically considered to be #3% error [20]). the high-scoring exponential tail of the Gumbel distribution of Similarly, another popular profile HMM software package, SAM Viterbi scores [28,29]. [3,43], has used l = log z in the past, but switched to simulated- calibrated l values because they gave better statistical significance Expected Distributions Conjectured for Local Viterbi and estimates [29]. Either something is wrong with the conjectures, or Forward Scores something is not quite right with profile HMMs of local alignment. I made the following two conjectures about V and F scores, in the case of full probabilistic models of local sequence alignment: A Generative Probabilistic Model of Local Sequence Alignment N The Gumbel distribution of Viterbi scores has a fixed l = log z, I modified HMMER’s profile HMM architecture in several where z is the base of the logarithm of the log-odds scoring details, with the main goal of achieving a uniform query entry/exit system. distribution in local alignments. A uniform query entry/exit N The high-scoring tail of Forward scores is exponentially distribution means that for a query profile of N positions 1…N, distributed with the same l = log z. each choice of local alignment to a core model subsequence i…j (leaving query prefix 1…i21 and suffix j+1…N unaligned) has the 2 NNz1 These conjectures are based on three main lines of argument, same probability: NNz1 , since there are ðÞ2 possible choices two of which depend heavily on the work of Bundschuh and his of i…j. This assumptionðÞ is implicit in the traditional Smith/ collaborators. Waterman alignment scoring system [44], which scores identically

PLoS | www.ploscompbiol.org 3 May 2008 | Volume 4 | Issue 5 | e1000069 Probabilistic Local Alignment

Histogram of λ fits for 9318 Pfam 22.0 models Typical examples and outliers 1 ^ RRM_1 Caudal_act A expect λ = log 2 = 0.6931 -2 10 P(V>t)

-4 ^ 10 0.6931 λ = 0.6840 -6 observed: 10 mean 0.6928 +/- 0.0114 ^ λ = 0.7038 0.6931 -8 BC10 old HMMER2: 1 DUF851 Sulfakinin mean 0.6677 +/- 0.0509 -2

(low outlier) (high outlier) 10 P(V>t) low outlier: high outlier: -4 ^ 10 DUF851 Sulfakinin λ = 0.5839 0.6931 -6 0.5828 0.8368 10 ^ λ -8 DE0.6931 = 0.8413 10 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 -10 0 10 20 30 -10 0 10 20 30 λ^ Viterbi score threshold t (in bits)

Figure 2. Viterbi scores follow Gumbel distributions with constant l. (A) A histogram showing the distribution of lˆ estimates determined by maximum likelihood Gumbel fits to multihit local Viterbi scores of n = 105 i.i.d random sequences of length L = 400, for 9318 profile HMMs built from Pfam 22.0 seed alignments. The sharp black peak is from prototype HMMER3, with mean 0.6928 and standard deviation 0.0114, and extreme outliers indicated by arrows. The broader grey histogram is from old HMMER2, for comparison. The conjectured l = log 2 is shown as a vertical dotted red line. (B,C) log survival plots (P(V.t) on a log scale, versus score threshold t) showing observed versus expected distributions for multihit local Viterbi scores for two typical Pfam models, RRM_1 and Caudal_act, for n = 108 i.i.d. random sequences of length L = 400. On a log survival plot, the high-scoring tail of a Gumbel distribution is a straight line with slope 2l. Black circles show the observed data. The black lines show maximum likelihood fitted Gumbel distributions, with lˆ estimates as indicated. The red lines show the conjectured l = log 2 Gumbel distributions, with m fitted by maximum likelihood. (D,E) log survival plots for the extreme outliers DUF851 and Sulfakinin, as described in the text. doi:10.1371/journal.pcbi.1000069.g002

A Probabilisticˆ Model of Local Sequence Alignment That Local Viterbi Scores Follow Gumbel Distributions with Simplifiesl values Statistical should range Significance from about 0.687 Estimation to 0.700 (63.7 s.d.). The observed log2 ratios do show a mean close to 1.0 (1.0008), but an Constant l l^ Sean R.s.d. Eddy* of 0.0167 (six-fold higher than expected), and the lˆ’s range Viterbi bit scores are predicted to be Gumbel distributed withHoward Hughes Medical Institute, Janelia Farm Research Campus, Ashburn, Virginia, United States of America parametric l = log 2. To test this prediction on many different from 0.5828 to 0.8368. This suggests source(s) of variation beyond CSE/BIMM/BENGprofile HMMs, 181 I estimatedMAY 17,lˆ( 2011lˆ represents a maximumSERGEI likelihood L KOSAKOVSKYAbstractexpected POND noise [SPOND of fitting@UCSD finite.EDU samples,] WWW and.HYPHY that both.ORG low/PUBS and/181/L highECTURES Sequence database searches require accurate estimation of the statistical significance of scores. Optimal local sequence estimate fitted to a finite sample of scores, as distinguished from alignmentoutliers scores follow are Gumbel more distributions, frequent but determining than an expected.important parameter The of the bottomdistribution (l) right requires time- of consuming computational simulation. Moreover, optimal alignment scores are less powerful than probabilistic scores that the parametric true l) for 9,318 different profile HMMs built from integrateFigure over alignment 2 shows uncertainty multihit (‘‘Forward’’ scores), local but the Viterbi expected distribution scoredistributions of Forward scores remains for unknown. the Here, I conjecture that both expected score distributions have simple, predictable forms when full probabilistic modeling Pfam 22.0 seed alignments, by collecting multihit local Viterbi methodsmost are used. extreme For a probabilistic high model and of local low sequence outliers, alignment, Sulfakinin optimal alignment and bit scores DUF851, (‘‘Viterbi’’ scores) for are 5 Gumbel-distributed with constant l = log8 2, and the high scoring tail of Forward scores is exponential with the same score distributions for n = 10 i.i.d. random sequences of length constantdeepl. Simulation simulations studies support (10 theserandom conjectures overL = a wide 400 range sequences). of profile/sequence In comparisons, both cases, using 9,318 a profile-hidden Markovˆ models from the Pfam database. This enables efficient and accurate determination of expectation 400 generated with the same residue frequencies as the null model valuessimilar (E-values) forl bothis Viterbi reproduced and Forward scores in for the probabilistic second local alignments. (and deeper) simulation, more evidence that these outlying values are not the result of R. Figure 2 shows the results of maximum likelihood fitting these Citation: Eddy SR (2008) A Probabilistic Model of Local Sequence Alignment That Simplifies Statistical Significance Estimation. PLoS Comput Biol 4(5): e1000069. scores to Gumbel distributions. The 9,318 lˆ estimates are tightly doi:10.1371/journal.pcbi.1000069expected statistical variation in estimation. Editor: Burkhard Rost, Columbia University, United States of America clustered with mean 0.6928, consistent with the conjecture that Received DecemberThe 5, 2007; lowAccepted outlierMarch 26, DUF851 2008; Published May (and 30, 2008 all other low outliers I examined) Copyright: ß 2008 Sean Eddy. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits l = log 2 = 0.6931. unrestrictedactually use, distribution, fits and reproduction better in any visually medium, provided to the the original conjectured author and source are credited.l = log 2 than to the Funding:maximumNone. likelihood fitted lˆ. Low outliers are invariably models As examples, the top right of Figure 2 shows the score Competing Interests: The author has declared that no competing interests exist. distributions for two typical Pfam models, for deep simulations * E-mail: [email protected] the sequences in the seed alignment are highly identical. with a 1000-fold larger sample size (108 random sequences). As This discretizes the model’s alignment scores (emission probabil- Introductionities all converge to 1.0 for all consensusalignment scores residues, follow a Gumbel regardless distribution with of slope ‘‘typical’’ models, I chose RRM_1 and Caudal_act from Pfam parameter l and location parameter K [7], and both parameters Sequenceresidue similarity searching type was or advanced model by the introduction position)are leading readily calculated to for a any non-smooth given scoring system score [7,13]. In the 22.0. The RRM_1 model is the RNA recognition motif, a ,72of probabilistic modeling methods, such as profile hidden Markov more relevant case of optimal gapped local alignments, although residue domain, chosen because it is one of the Pfam domains I ammodels (profiledistribution HMMs) and pair-HMMs (a stairstep-like [1]. When parameters effectscores is empirically often seen, still follow corresponding a Gumbel distribution for to a useful are probabilities rather than arbitrary scores, they are more readily range of gap costs [14], the key Gumbel l parameter must be most familiar with. The Caudal_act domain is the activationoptimizedlocal by objective alignments mathematical of criteria. increasing This enables discreteestimated by lengths expensive computational 1, 2, 3…), simulation and for this each new building more complex, biologically realistic models with large scoring system [9]. Much effort aims to find better ways of domain of the Caudal-like homeobox transcription factors, chosennumbers ofstairstep parameters. For getsexample, misfitprofile HMMs by use position- maximumdetermining likelihoodl [15–24]. For estimation. traditional pairwise Low comparison specific insertion/deletion probabilities in place of the arbitrary, methods (e.g. BLAST), using computational simulations to because it is literally typical for Pfam, being closest to the medianposition-invariantinformation gap costs of more content traditional approaches models such (parameterized as determine l is not by a major entropy limitation. weighting, BLAST precalculates BLAST or PSI-BLAST [2], and this allows profile HMMs to Karlin/Altschul parameters K and l for the small number of of Pfam seed alignments in three different characteristics: numbermodel thedescribed fact that indels occur later) more frequently do not in some show parts of a suchgeneral outliers scoring systems (not in common shown). use [2]. Thus, However, for for position- protein morelow than outliers, others (e.g., in surfacethe loops error as opposed is attributed to specific profile to scoring artifacts models like PSI-BLAST of maximum or profile HMMs, of seed sequences (Pfam 22.0 median = 9; Caudal_act = 9), modelburied core) [3]. each query specifies a customized scoring system, requiring its own length (Pfam median = 147; Caudal_act = 147), and average More sophisticatedlikelihood scoring fitting. models are desirable but not K and l. PSI-BLAST avoids using simulations to determine l by sufficient. It is also necessary to be able to determine the statistical restricting its profiles to fixed position-invariant gap costs, and pairwise identity (Pfam median 36%, Caudal_act = 37%). Bothsignificance of aThe score efficiently high and outlier accurately [4,5]. Sulfakinin One reason assuming (and (backed all by other empirical results) high that outliers the l of a PSI-BLAST I that the BLAST suite of programs [2,6] is so useful is that BLAST profile is equal to the l of the pairwise scoring system with the observed distributions show good agreement to the predictedintroducedexamined) a robust theory for evaluating does the show statistical significance a highersame gapl costs(steeper and the most slope) similar relative than entropy the (average of local alignment scores, widely known as Karlin/Altschul score) per aligned residue pair [2]. For models with position- Gumbel of l = log 2. statistics [7–9].conjectured Although the scoring log technology 2. A in distinctive HMM-based specific feature gap penalties, of Sulfakinin though, such as compared the HMMER profile ˆ profile search and profile/profile search methods is generally an HMMs used by protein domain databases like Pfam [25] and I examined outliers in l to look for models for which theimprovementto over other BLAST Pfam and PSI-BLAST models [10,11], is some thatSMART it is [26], tiny, each just model stillN = requires 9 consensus a relatively expensive problems in determining statistical significance of homology ‘‘calibration’’ by simulation before accurate E-values can be conjectured l = log 2 fails. If the 9318 trials were all truly Gumbelsearch scorespositions have impeded long. the development All other and adoption high ofoutliersobtained. examined This lack of were computational short efficiency models. particularly distributed with l = log 2, l ratios (parametric over maximumthese or otherFinite-length more complex models sequence and methods [12].comparisons There are hampers the are use of expected profile HMMs in to iterative show database an searches, l^ two main problems. where each iteration produces another model that needs likelihood estimate) should be normally distributed around a mean The first‘‘edge problem effect’’is that Karlin/Altschul that increases statistics only thecalibration. apparent l relative to an 0:78 rigorously apply to scores of optimal ungapped alignments using The second problem is that in terms of probabilistic inference, simple position-independent scoring systems. In this case, an optimal alignment score is not the score we should be of 1.0 with standard deviation 0.0025 ( pn , [46]), so in 9318 trials, asymptotic theoretical prediction, and finite-length artifacts are

PLoS Computational Biology | www.ploscompbiol.org 1 May 2008 | Volume 4 | Issue 5 | e1000069 ffiffi PLoS Computational Biology | www.ploscompbiol.org 6 May 2008 | Volume 4 | Issue 5 | e1000069 A RELATED APPLICATION

E.g. motif-finding

x0 x1 x2

1-x0 1-x1 1-x2

x0 x1 x2

1-x 1.01.01-x 1.0 1.0 1.0 1-x Start 0 1 2 End

Motif 1 Motif 2

Grundy, WN. PhD Thesis UCSD 1998 HTTP://NOBLE.GS.WASHINGTON.EDU/PAPERS/THESIS.PDF

CSE/BIMM/BENG 181 MAY 17, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES DIFFERENT TYPES OF ALIGNMENT

Query Database Program

Sequence Sequence BLAST, FASTA...

Profile Sequence PSI-BLAST...

Sequence Profile PSSM, PFAM, HMMER

PROF_SIM, Profile Profile COMPASS, HHsearch

CSE/BIMM/BENG 181 MAY 17, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES HTTP://BIOINFORMATICS.OXFORDJOURNALS.ORG/CONTENT/VOL21/ISSUE7/IMAGES/LARGE/BTI125F1.JPEG

CSE/BIMM/BENG 181 MAY 17, 2011 SERGEI L KOSAKOVSKY POND [[email protected]] WWW.HYPHY.ORG/PUBS/181/LECTURES