Introduction to Hidden Markov Models and Profiles in Sequence Alignment

Introduction to Hidden Markov Models and Profiles in Sequence Alignment Utah State University – Spring 2010 STAT 5570: Statistical Bioinformatics Notes 6.3 1 References Chapters 3-6 of Biological Sequence Analysis (Durbin et al., 2001) Chapter 9 of A First Course in Probability (Ross, 1997) Eddy, S. R. (1998). Profile hidden Markov models. Bioinformatics, 14:755-763 2 The “occasionally dishonest casino” A casino usually uses a fair die, but sometimes (5% of the time) switches to a loaded die. Once using the loaded die, they usually keep using it (90% of the time). How do you know which die you’re playing? - not sure, but have to look at many plays to see pattern - the “state” here is “hidden” One possible representation of this “model”: 0.95 1: 1/6 1: 1/10 0.90 2: 1/6 0.05 2: 1/10 3: 1/6 3: 1/10 4: 1/6 4: 1/10 5: 1/6 0.10 5: 1/10 6: 1/6 6: 1/2 Possible partial sequence of rolls: Roll ...62625335636616366646623... Die ...FFFFFFFFLLLLLLLLLFFFFFF... L = “loaded” state 3 Follow-up to a sequence alignment Consider pairwise (or multiple) alignment What does alignment mean? possibly represents: common ancestry Possible questions Does alignment describe some: “family”? How can we describe its internal structure? Can sometimes characterize these “family” structures as a Markov Chain 4 “Family” Example: CpG islands In DNA: C & G are always matched up in helical structure CpG (C followed by G) in sequence is rare, but is more frequent in promoters of start regions of a gene General idea: CpG at start of gene Possible question about our alignment: Does CpG frequency suggest we are in a promoter region? 5 Markov Chain Examples Occasionally dishonest casino: Probability you are using a fair die on one play depends on whether you were using a fair die on the previous play. Rain indicator: Chance of rain tomorrow depends on whether it rains today Gambler’s Ruin: A gambler starts out with a fortune and plays a game repeatedly (with same prob. of winning or losing $1 each time) until her fortune reaches either 0 or M. Her sequence of fortunes is a Markov chain. CpG Island: Probability you are in a CpG island at one point in alignment depends on whether you were in a CpG island at the previous point. 6 Markov Chains – a little more formally Sequence of random variables: X0, X 1, … Set of possible values: {0,1,…,M} Think of Xt as state of process at time t This sequence is a Markov Chain if: P{X t+1 =j | Xt=i,X t-1=i t-1, …, X 1=i 1, X 0=i 0} = P{X t+1 =j | Xt=i} = Pij So state at time t+1 depends only on: state at time t 7 Hidden Markov Model (HMM) - vocabulary State: in which “family” the process is (CpG vs. not, fair vs. loaded die, etc.) π th π - the “path”: ; the i state in path: i Symbol: observed “outcome” x (sequence, die, etc.) from unknown state Transition probabilities: π π akh = P{ i=h | i-1=k} Emission probabilities: π ek(b) = P{x i=b | i=k } Joint probability of observed sequence x & state sequence π : P(x,π ) = a π ∏ eπ (x )aπ π 0 1 i i i i+1 i 8 Estimating the HMM path Several approaches to find the “most probable state path” π*=argmax P(x,π) Viterbi focus on identifying the Forward algorithm most probable state path Backward algorithm – focuses more on: posterior state probabilities (position-specific prob. that observation came from state k given observed sequence) 9 Estimation when paths unknown: Baum-Welch An iterative procedure estimates the transition and emission probabilities A special case of the EM algorithm (a general approach to deal with maximum likelihood with missing/incomplete/latent data) - think of missing covariate: state Can also consider an approximation based on iterations of the Viterbi algorithm 10 Pairwise Alignments as HMMs Recall notation: x & y are sequences to be aligned, with gap opening penalty of d and gap extension penalty of e Let (+w,+v) here represent change in sequence position, with M=match, and X,Y=insertion (gap) in x or y X s(x ,y ) X i j (+1,+0) ε q -e 1- xi ε s(x i,y j) 1-2δ M -d M δ (+1,+1) Pxiyj -d δ ε s(x i,y j) Y 1- Y (+0,+1) qyj -e ε States: insertion (X or Y), match (M) δ=probability of moving to a specific insertion state ε=prob. of staying in an insertion state 11 What can be done with pairwise HMM Build HMM for a random (non-matched) model Evaluate likelihood of matched model by considering log-odds of matched vs. random models Search for other alignments: sub-optimal Consider posterior probability of alignment: { } P xi y j | x, y “is aligned with” 12 Using HMMs to describe a “family” Suppose we have an alignment of multiple sequences – we can model their “relationship” as a family of sequences – call this the family’s: “profile” PSSM – position-specific score matrix - estimate this to: describe this particular profile (e.g., should ‘A’ count for more at a particular position in the alignment?) Allow for insertions and deletions, where “cost” could also be position-specific Use this profile to describe the alignment and look for other similar sequences 13 Transition structure of a profile HMM Dj Ij Begin Mj End specific position of profile : match state : insertion state : deletion state 14 How do we get this “family”? Multiple Sequence Alignment - many possible strategies to find and score possible alignments One common way: ClustalW a “progressive alignment” approach construct pairwise distances based on evolutionary distance essentially follow an agglomerative clustering approach, progressively aligning nodes in order of decreasing similarity additional heuristics make final alignment more accurate 15 Possible Strategies image from HMMER (on bioweb.pasteur.fr server) 16 One possible analysis approach Obtain multiple alignment using ClustalW http://www.ebi.ac.uk/Tools/clustalw2 creates alignment files in various formats - some specialized for tree-viewing, for example - can get FASTA format of alignment to pass to HMMER Obtain HMM model using HMMER http://bioweb2.pasteur.fr/alignment/intro-en.html creates a “consensus” sequence to summarize the profile (hmmbuild) can use this profile to search database for similar sequences (hmmsearch) 17 Example (Source: JalView example at ClustalW; same as HW5 data) Five proteins from different species (FASTA format) Mouse (2) Human Chicken Rat >FOSB_MOUSE Protein fosB MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYDMPGTSYSTPGLSAYSTGGASGSGGPSTSTTTSGPVSARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRR RELTDRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRDLPGSTSAKEDGFGWLLPPPPPPPLPFQSSRDAPPNLTASLFTHSEVQVLGDPFPVVSPSYTSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPSLLAL >FOSB_HUMAN Protein fosB MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGEMPGSFVPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPVVDPYDMPGTSYSTPGMSGYSSGGASGSGGPSTSGTTSGPGPARPARARPRRPREETLTPEEEEKRRVRRERNKLAAAKCRNRR RELTDRLQAETDQLEEEKAELESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRDLPGSAPAKEDGFSWLLPPPPPPPLPFQTSQDAPPNLTASLFTHSEVQVLGDPFPVVNPSYTSSFVLTCPEVSAFAGAQRTSGSDQPSDPLNSPSLLAL >FOS_CHICK Proto-oncogene proteinc-fos MMYQGFAGEYEAPSSRCSSASPAGDSLTYYPSPADSFSSMGSPVNSQDFCTDLAVSSANFVPTVTAISTSPDLQWLVQPTLISSVAPSQNRGHPYGVPAPAPPAAYSRPAVLKAPGGRGQSIGRRGKVEQLSPEEEEKRRIRRERNKMAAAKCRNRRRELTDTLQAETDQLEEEKS ALQAEIANLLKEKEKLEFILAAHRPACKMPEELRFSEELAAATALDLGAPSPAAAEEAFALPLMTEAPPAVPPKEPSGSGLELKAEPFDELLFSAGPREASRSVPDMDLPGASSFYASDWEPLGAGSGGELEPLCTPVVTCTPCPSTYTSTFVFTYPEADAFPSCAAAHRKGSSSN EPSSDSLSSPTLLAL >FOS_RAT Proto-oncogene protein c-fos MMFSGFNADYEASSSRCSSASPAGDSLSYYHSPADSFSSMGSPVNTQDFCADLSVSSANFIPTVTAISTSPDLQWLVQPTLVSSVAPSQTRAPHPYGLPTPSTGAYARAGVVKTMSGGRAQSIGRRGKVEQLSPEEEEKRRIRRERNKMAAAKCRNRRRELTDTLQAETDQLEDEK SALQTEIANLLKEKEKLEFILAAHRPACKIPNDLGFPEEMSVTSLDLTGGLPEATTPESEEAFTLPLLNDPEPKPSLEPVKNISNMELKAEPFDDFLFPASSRPSGSETARSVPDVDLSGSFYAADWEPLHSSSLGMGPMVTELEPLCTPVVTCTPSCTTYTSSFVFTYPEADSFP SCAAAHRKGSSSNEPSSDSLSSPTLLAL >FOS_MOUSE Proto-oncogene protein c-fos MMFSGFNADYEASSSRCSSASPAGDSLSYYHSPADSFSSMGSPVNTQDFCADLSVSSANFIPTVTAISTSPDLQWLVQPTLVSSVAPSQTRAPHPYGLPTQSAGAYARAGMVKTVSGGRAQSIGRRGKVEQLSPEEEEKRRIRRERNKMAAAKCRNRRRELTDTLQAETDQLEDEK SALQTEIANLLKEKEKLEFILAAHRPACKIPDDLGFPEEMSVASLDLTGGLPEASTPESEEAFTLPLLNDPEPKPSLEPVKSISNVELKAEPFDDFLFPASSRPSGSETSRSVPDVDLSGSFYAADWEPLHSNSLGMGPMVTELEPLCTPVVTCTPGCTTYTSSFVFTYPEADSFP SCAAAHRKGSSSNEPSSDSLSSPTLLAL 18 ClustalW – quick example set alignment options here paste multiple sequences here (in FASTA format, e.g.) click Run to start alignment 19 20 ClustalW – format output in Jalview (Java applet) >FOS_RAT/1-380 MMFSGFNADYEASSSRCSSASPAGDSLSYYHSPADSFSSMGSPVNTQDFCADLSVSSANFIPTVTAISTSPD Here, color is LQWLVQPTLVSSVAPSQ-------TRAPHPYGLPTPS-TGAYARAGVVKTMSGGRAQSIG------------ by BLOSUM62 --------RRGKVEQLSPEEEEKRRIRRERNKMAAAKCRNRRRELTDTLQAETDQLEDEKSALQTEIANLLK EKEKLEFILAAHRPACKIPNDLGFPEEMSVTS-LDLTGGLPEATTPESEEAFTLPLLNDPEPK-PSLEPVKN score ISNMELKAEPFDDFLFPASSRPSGSETARSVPDVDLSG--SFYAADWEPLHSSSLGMGPMVTELEPLCTPVV TCTPSCTTYTSSFVFTYPEADSFPSCAAAHRKGSSSNEPSSDSLSSPTLLAL >FOS_MOUSE/1-380 MMFSGFNADYEASSSRCSSASPAGDSLSYYHSPADSFSSMGSPVNTQDFCADLSVSSANFIPTVTAISTSPD LQWLVQPTLVSSVAPSQ-------TRAPHPYGLPTQS-AGAYARAGMVKTVSGGRAQSIG------------ ... from “File” “Output to Textbox” “FASTA” format (others available) 21 paste FASTA format here (from ClustalW, for example) http://bioweb2.pasteur.fr/alignment/intro-en.html 22 hmmbuild results … … 23 hmmbuild results (reformatted by hand) HMM A C D ... Q R S T ... ... 15 2.35 4.27 3.26 3.50 3.44 0.99 2.83 15 16 3.08 4.91 3.57 3.09 0.88 3.16 3.34 16 17 2.66 0.81 4.20 4.12 3.89 2.93 3.13 17 18 2.35 4.27 3.26 3.50 3.44 0.99 2.83 18 19 2.35 4.27 3.26 3.50 3.44 0.99 2.83 19 ... 24 HMMER – search for “family” members 25 HMMER – search for “family” members 26 27 hmmsearch results … 28 12345678901234567890123456789012 alignfile_data 1 mmfqafagdyeasssrcssaspaadslsyyls mmf++f++dyeasssrcssaspa+dslsyy+s gp|BC029814|BC029814_1 1 MMFSGFNADYEASSSRCSSASPAGDSLSYYHS (more to profile than just SRCSS) 29 Summary Hidden Markov Models use to describe sequence alignments main idea: how does each portion of alignment represent the “family profile” Idea of profile: general “family” characteristics Online resources ClustalW – perform multiple alignments HMMER – build (& use) HMM model from multiple alignment 30.

Introduction to Hidden Markov Models and Profiles in Sequence Alignment

HMMER User's Guide

Apply Parallel Bioinformatics Applications on Linux PC Clusters

HMMER User's Guide

Software List for Biology, Bioinformatics and Biostatistics CCT

PTIR: Predicted Tomato Interactome Resource

Scaling HMMER Performance on Multicore Architectures

Downloaded from TAIR10 [27]

Clawhmmer: a Streaming Hmmer-Search Implementation

HMMER User's Guide

Swift: a GPU-Based Smith-Waterman Sequence Alignment Program

Hmmer and Applications

BIMM-143: INTRODUCTION to BIOINFORMATICS (Lecture 3)