Outline
ary RNA Search and Task 1: RNA 2 Structure Prediction (last time) Motif Discovery Task 2: RNA Motif Models Covariance Models Training & “Mutual Information”
CSE 427 Task 3: Search Winter 2008 Rigorous & heuristic filtering Task 4: Motif discovery
How to model an RNA “Motif”?
Conceptually, start with a profile HMM: from a multiple alignment, estimate nucleotide/ insert/delete Task 2: Motif Description preferences for each position given a new seq, estimate likelihood that it could be generated by the model, & align it to the model
mostly G del ins all G
1 How to model an RNA “Motif”? RNA Motif Models
Add “column pairs” and pair emission probabilities “Covariance Models” (Eddy & Durbin 1994) for base-paired regions aka profile stochastic context-free grammars aka hidden Markov models on steroids Model position-specific nucleotide preferences and base-pair preferences
Pro: accurate <<<<<<< >>>>>>> Con: model building hard, search sloooow … p a ire d c o lu m n s …
What
A probabilistic model for RNA families “RNA sequence analysis using The “Covariance Model” ≈ A Stochastic Context-Free Grammar covariance models” A generalization of a profile HMM Algorithms for Training Eddy & Durbin From aligned or unaligned sequences Nucleic Acids Research, 1994 Automates “comparative analysis” vol 22 #11, 2079-2088 Complements Nusinov/Zucker RNA folding Algorithms for searching (see also, Ch 10 of Durbin et al.)
2 Main Results Probabilistic Model Search
Very accurate search for tRNA As with HMMs, given a sequence, you calculate (Precursor to tRNAscanSE - current favorite) likelihood ratio that the model could generate the sequence, vs a background model Given sufficient data, model construction comparable to, but not quite as good as, You set a score threshold human experts Anything above threshold → a “hit” Some quantitative info on importance of Scoring: pseudoknots and other tertiary features “Forward” / “Inside” algorithm - sum over all paths Viterbi approximation - find single best path (Bonus: alignment & structure prediction)
Example: searching for Alignment Quality tRNAs
3 Profile Hmm Structure Comparison to TRNASCAN
Fichant & Burks - best heuristic then a i t r n e t e 97.5% true positive i r r e c f
f i
0.37 false positives per MB n d o i
t y l a
CM A1415 (trained on trusted alignment) t u h l g a i l
> 99.98% true positives v <0.2 false positives per MB S e Current method-of-choice is “tRNAscanSE”, a CM- based scan with heuristic pre-filtering (including TRNASCAN?) for performance reasons. Mj: Match states (20 emission probabilities) Ij: Insert states (Background emission probabilities) Dj: Delete states (silent - no emission)
Overall CM CM Structure Architecture A: Sequence + structure One box (“node”) per node of guide tree B: the CM “guide tree” BEG/MATL/INS/DEL just C: probabilities of like an HMM letters/ pairs & of indels MATP & BIF are the key Think of each branch additions: MATP emits pairs being an HMM emitting of symbols, modeling base- both sides of a helix (but pairs; BIF allows multiple 3’ side emitted in helices reverse order)
4 CM Viterbi Alignment CM Viterbi Alignment (the “inside” algorithm) (the “inside” algorithm)
y th Sij = max logP(xij generated starting in state y via path ) xi = i letter of input z y x = substring i,..., j of input maxz[Si+1, j 1 + logTyz + log E x ,x ] match pair ij i j max [S z + logT + log E y ] match/insert left Tyz = P(transition y " z) z i+1, j yz xi ! y z y y Sij = maxz[Si, j 1 + logTyz + log E x ] match/insert right E x ,x = P(emission of xi,x j from state y) j i j z max [S + logT ] delete y z i, j yz Sij = max# logP(xij gen'd starting in state y via path #) yleft yright maxi