ncRNA: Interest

extensive noncoding sequence conservation Modeling and Searching even more extensive transcription for Non-Coding RNA “invisible” structural conservation? many RNA binding proteins W.L. Ruzzo examples: microRNAs, riboswitches

http://www.cs.washington.edu/homes/ruzzo Bottom line: important regulatory roles

Outline

Why RNA? Examples of RNA biology Computational Challenges Modeling Fig. 2. The arrows show the situation as it seemed in 1958. Solid arrows represent Search probable transfers, dotted arrows possible transfers. The absent arrows (compare Fig. 1) Inference represent the impossible transfers postulated by the central dogma. They are the three possible arrows starting from protein.

1 RNA Secondary Structure: The “Central Dogma” RNA makes helices too

DNA  RNA  Protein U C A A Base pairs G C C G Protein A gene A U G C U A DNA C G C G A U (chromosome) RNA G C A (messenger) CA A A cell U

“Classical” Non-coding RNA mRNA Messenger RNA - codes for proteins tRNA Non-coding RNA - all the rest rRNA Before, say, mid 1990’s, 1-2 dozen known (critically snRNA (small nuclear - spl icing) important, but narrow roles: e.g. tRNA) snoRNA (small nucleolar - guides for t/rRNA modifications) Since mid 90’s dramatic discoveries RNAseP (tRNA maturation; ribozyme in bacteria) Regulation, transport, stability/degradation SRP (signal recognition particle; co-translational E.g. “microRNA”: ≈ 100’s in humans targeting of proteins to membranes) By some estimates, ncRNA >> mRNA telomerases

2 Bacteria

Triumph of proteins 80% of genome is coding DNA Functionally diverse receptors motors catalysts regulators (Monod & Jakob, Nobel prize 1965) … . e 3

Gene Regulation: The MET Repressor , l a

t

e The

, Riboswitch s t r

e protein b l alternative A way

SAM SAM

Grundy & Henkin, Mol. Microbiol 1998 Epshtein, et al., PNAS 2003 Winkler et al., Nat. Struct. Biol. 2003 Protein Alberts, et al, 3e. DNA

3 . . e e 3 3

, , l l a a

t t e The e The

, Riboswitch , Riboswitch s s t t r r e protein e protein b b l alternatives l alternatives A way A way

SAM-II SAM-III

SAM-I SAM-I SAM-II

Fuchs et al., Corbino et al., NSMB 2006 Genome Biol. 2005

Grundy, Epshtein, Winkler Grundy, Epshtein, Winkler Corbino et al., et al., 1998, 2003 et al., 1998, 2003 Genome Biol. 2005 . e 3

, l a

t

e The

, Riboswitch s t r

e protein b l alternatives A way

Widespread, deeply conserved, structurally boxed = confirmed sophisticated, functionally diverse, biologicariblloyswitch SAM-III important uses for ncRNA throughout (+2 more)

SAM-I SAM-II SAM-IV prokaryotic world.

Grundy, Epshtein, Winkler Corbino et al., Fuchs et al., Weinberg et al., et al., 1998, 2003 Genome Biol. 2005 NSMB 2006 RNA 2008 Weinberg, et al. Nucl. Acids Res., July 2007 35: 4809-4819.

4 RNA on the Rise Human Predictions

In humans Evofold more RNA- than DNA-binding proteins? S Pedersen, G Bejerano, A Siepel, K Rosenbloom, much more conserved DNA than coding K Lindblad-Toh, ES Lander, J Kent, W Miller, D Haussler, "Identification and classification of MUCH more transcribed DNA than coding conserved RNA secondary structures in the In bacteria human genome." PLoS Comput. Biol., 2, #4 regulation of M ANY genes involves RNA (2006) e33. dozens of classes & thousands of new 48,479 candidates (~70% FDR?) examples in just last 5 years

RNAz FOLDALIGN S Washietl, IL Hofacker, M Lukasser, A Hutenhofer, E Torarinsson, M Sawera, JH Havgaard, M PF Stadler, "Mapping of conserved RNA secondary structures predicts thousands of functional Fredholm, J Gorodkin, "Thousands of noncoding RNAs in the human genome." Nat. corresponding human and mouse genomic Biotechnol., 23, #11 (2005) 1383-90. regions unalignable in primary sequence 30,000 structured RNA elements contain common RNA structure." Genome 1,000 conserved across all vertebrates. Res., 16, #7 (2006) 885-9. ~1/3 in introns of known genes, ~1/6 in UTRs 1800 candidates from 36970 (of 100,000) ~1/2 located far from any known gene pairs

5 Fastest Human CMfinder Torarinsson, Yao, Wiklund, Bramsen, Gene? Hansen, Kjems, Tommerup, Ruzzo and Gorodkin. Comparative genomics beyond sequence based alignments: RNA structures in the ENCODE regions. Genome Research, Feb 2008, 18(2):242-251 PMID: 18096747 6500 candidates in ENCODE alone (better FDR, but still high)

Origin of Life? Origin of Life? RNA can carry information too (RNA double helix) Life needs RNA can form complex structures information carrier: DNA RNA enzymes exist (ribozymes) molecular machines, like enzymes: Protein

making proteins needs DNA + RNA + proteins The “RNA world” hypothesis: making (duplicating) DNA needs proteins 1st life was RNA-based Horrible circularities! How could it have arisen in an abiotic environment? Some extant RNAs are relicts of that origin; some are “modern” inventionsrel

6 ncRNA Example: Xist ncRNA Example: 6S large (12kb?) medium size (175nt) largely unstructured RNA structured required for X-inactivation in mammals highly expressed in E. coli in certain growth conditions sequenced in 1971; function unknown for 30 years

6S mimics an open promoter ncRNA Example: IRE

Iron Response Element: a short conserved stem- loop, bound by iron response proteins (IRPs). Found in UTRs of various mRNAs whose products are involved in iron metabolism. E.g., the mRNA of (an iron storage protein) contains one IRE in its 5' UTR. When iron concentration is low, IRPs bind the ferritin mRNA IRE, repressing translation. Binding of multiple IREs in the 3' and 5' UTRs of the (involved in iron acquisition) leads to increased mRNA E.coli stability. These two activities form the basis of iron homeostasis in the vertebrate cell. Barrick et al. RNA 2005 Trotochaud et al. NSMB 2005 Willkomm et al. NAR 2005

7 Iron Response Element ncRNA Example: MicroRNAs

IRE (partial seed alignment): short (~22 nt) unstructured RNAs excised from Hom.sap. GUUCCUGCUUCAACAGUGUUUGGAUGGAAC Hom.sap. UUUCUUC.UUCAACAGUGUUUGGAUGGAAC ~75nt precursor hairpin Hom.sap. UUUCCUGUUUCAACAGUGCUUGGA.GGAAC approx antisense to mRNA targets, often in 3’ UTR Hom.sap. UUUAUC..AGUGACAGAGUUCACU.AUAAA Hom.sap. UCUCUUGCUUCAACAGUGUUUGGAUGGAAC regulate gene activity, e.g. by destabilizing (plants) Hom.sap. AUUAUC..GGGAACAGUGUUUCCC.AUAAU Hom.sap. UCUUGC..UUCAACAGUGUUUGGACGGAAG or otherwise suppressing (animals) message Hom.sap. UGUAUC..GGAGACAGUGAUCUCC.AUAUG Hom.sap. AUUAUC..GGAAGCAGUGCCUUCC.AUAAU hundreds and growing, each w/ perhaps 10x Cav.por. UCUCCUGCUUCAACAGUGCUUGGACGGAGC targets Mus.mus. UAUAUC..GGAGACAGUGAUCUCC.AUAUG Mus.mus. UUUCCUGCUUCAACAGUGCUUGAACGGAAC Some conserved human to worm; Mus.mus. GUACUUGCUUCAACAGUGUUUGAACGGAAC Rat.nor. UAUAUC..GGAGACAGUGACCUCC.AUAUG others evolving rapidly Rat.nor. UAUCUUGCUUCAACAGUGUUUGGACGGAAC SS_cons <<<<<...<<<<<...... >>>>>.>>>>>

ncRNA Example: T-boxes ncRNA Example: Riboswitches

• UTR structure that directly senses/binds small molecules & regulates mRNA • widespread in prokaryotes • some in eukaryotes

8 Example: Glycine Regulation The Glycine Riboswitch

• How is glycine level regulated? • Actual answer (in many bacteria): • Plausible answer: gce g gce protein protein g g g g g g g 5′ 3′ TF g gce mRNA TF glycine cleavage enzyme gene DNA glycine cleavage enzyme gene DNA transcription factors (proteins) bind to DNA to turn nearby genes on or off Mandal et al. Science 2004

(Mandal, Lee, Barrick, Weinberg, Emilsson, Ruzzo, Breaker, Science 2004)

5’

gcvT ORF 3’ And… Fig. 3. Cooperative binding of two glycine molecules by the VC I-II RNA. Plot depicts the • More examples means better alignment fraction of VC II (open) and VC I-II (solid) bound to ligand versus the concentration of glycine. • Understand phylogenetic distribution The constant, n, is the Hill coefficient for the lines as indicated that best fit the aggregate data from four different regions (fig. S3). Shaded boxes demark the dynamic range (DR) of glycine • Find riboswitch in front of new gene concentrations needed by the RNAs to progress from 10%- to 90%-bound states.

9 Riboswitches Why?

• ~ 20 ligands known; multiple nonhomologous • RNA’s fold, solutions for some and function • dozens to hundreds of instances of each • TPP known in archaea & eukaryotes • on/off; transcription/translation; splicing; • Nature uses combinatorial control what works • In some bacteria, more riboregulators identified than protein TFs • all found since ~2003

Outline Homology search

• ncRNA: what/why? Sequence-based • What does computation bring? – Smith-Waterman – FASTA • How to model and search for ncRNA? – BLAST • Faster search • Better model inference Sharp decline in sensitivity at ~60-70% identity

So, use structure, too

10 Impact of RNA homology search (Barrick, et al., 2004) Structure Prediction glycine riboswitch operon • Xray crystalography, NMR, etc B. subtilis • Comparative modeling L. innocua – Alignment & compensatory substitutions A. tumefaciens • Single-sequence Folding V. cholera – mfold, mccaskill, vienna… – est 50-70% accurate up to 200-300 nt M. tuberculosis • Multiple sequence alignment/folding (and 19 more species)

Impact of RNA homology search (Barrick, et al., 2004) Using our glycine techniques, we RNA Informatics riboswitch operon found… B. subtilis • RNA: Not just a messenger anymore L. innocua – Dramatic discoveries – Hundreds of families (besides classics like A. tumefaciens tRNA, rRNA, snRNA…) – Widespread, important roles V. cholera • Computational tools important M. tuberculosis – Discovery, characterization, annotation (and 19 more species) (and 42 more species) – BUT: slow, inaccurate, demanding

11 Q: What’s so hard? A A G A A A A Computational Challenges A A U G C G U U C U C A C G C G C U C U G A G G C A A G G G G U A G C C G G A G C U C G C C A A G A G G G A A G • Search - given • CM-based search G A G A A G G A C C C A C U G U related RNA’s, find A C U C C more G C A A A A • Modeling - describe • Hand-curated A GG C G U C A A C A a related family A A U A A alignments -> CMs G A G A C G G A U U A C U C U U U C U C U G G G U U C G C G • Meta-modeling - A C G G G U • Covariance Models G C C G A G A G C U U C what’s a good G A A A U U G U A C C G U U G U modeling G U A G G C G framework? A: Structure often more important than sequence

Predict Structure from Multiple Sequences

… GA … UC … Compensatory “RNA sequence analysis … GA … UC … mutations reveal … GA … UC … structure, but in using covariance models” … CA … UG … usual alignment algorithms they are Eddy & Durbin … CC … GG … doubly penalized. Nucleic Acids Research, 1994 … UA … UA … vol 22 #11, 2079-2088

12 What Main Results

• A probabilistic model for RNA families • Very accurate search for tRNA – The “Covariance Model” – (Precursor to tRNAscanSE - current – ≈ A Stochastic Context-Free Grammar favorite) – A generalization of a profile HMM • Given sufficient data, model • Algorithms for Training construction comparable to, but not – From aligned or unaligned sequences quite as good as, human experts – Automates “comparative analysis” – Complements Nusinov/Zucker RNA folding • Some quantitative info on importance of • Algorithms for searching pseudoknots and other tertiary features

Example: Probabilistic Model Search searching for tRNAs

• As with HMMs, given a sequence, you calculate llikelihood ratio that the model could generate the sequence, vs a background model • You set a score threshold • Anything above threshold => a “hit” • Scoring: – “Forward” / “Inside” algorithm - sum over all paths – Viterbi approximation - find single best path (Bonus: alignment & structure prediction)

13 Profile Hmm Structure How to model an RNA “Motif”?

• Conceptually, start with a profile HMM: – from a multiple alignment, estimate nucleotide/ insert/delete preferences for each position – given a new seq, estimate likelihood that it could be generated by the model, & align it to the model

Mj: Match states (20 emission probabilities) Ij: Insert states (Background emission probabilities) mostly G del ins all G Dj: Delete states (silent - no emission)

mRNA leader

How to model an RNA “Motif”?

• Covariance Models (aka “profile SCFG”) – Probabilistic models, like profile HMMs, but adding “column pairs” and pair emission probabilities for base-paired regions

mRNA leader switch?

<<<<<<< >>>>>>> … p a ire d c o lu m n s …

14 mRNA leader

mRNA leader switch?

Overall CM CM Structure Architecture A: Sequence + structure • One box (“node”) per node of B: the CM “guide tree” guide tree • BEG/MATL/INS/DEL just like C: probabilities of letters/ an HMM pairs & of indels • MATP & BIF are the key Think of each branch additions: MATP emits pairs of being an HMM emitting symbols, modeling base-pairs; both sides of a helix (but 3’ side emitted in reverse BIF allows multiple helices order)

15 CM Viterbi Alignment Viterbi, cont.

y th Sij = max" logP(xij generated starting in state y via path ") xi = i letter of input x = substring i,..., j of input % max [S z + logT + log E y ] match pair ij ' z i+1, j#1 yz xi ,x j max [S z + logT + log E y ] match/insert left Tyz = P(transition y " z) ' z i+1, j yz xi ! y ' z y y S = & max [S + logT + log E ] match/insert right E P(emission of x ,x from state y) ij z i, j#1 yz x j xi ,x j = i j ' z max [S + logT ] delete S y = max logP(x generated starting in state y via path #) ' z i, j yz ij # ij yleft yright (' maxi

Time O(qn3), q states, seq len n compare: O(qn) for profile HMM ! !

Mutual Information M.I. Example (Artificial)

* 1 2 3 4 5 6 7 8 9 * MI: 1 2 3 4 5 6 7 8 9 i,j: 1,2 1,3 1,4 1,5 1,6 1,7 1,8 1,9 2,3 2,4 2,5 2,6 2,7 2,8 2,9 3,4 3,5 3,6 3,7 3,8 3,9 4,5 4,6 4,7 4,8 4,9 5,6 5,7 5,8 5,9 6,7 6,8 6,9 7,8 fxi,xj A G A U A A U C U 9 0 0 0 0 0 0 0 0 AG AA AU AA AA AU AC AU GA GU GA GA GU GC GU AU AA AA AU AC AU UA UA UU UC UU AA AU AC AU AU AC AU UC A G A U C A U C U 8 0 0 0 0 0 0 0 AG AA AU AC AA AU AC AU GA GU GC GA GU GC GU AU AC AA AU AC AU UC UA UU UC UU CA CU CC CU AU AC AU UC Mij = f xi,xj log2 ; 0 # Mij # 2 "xi,xj A G A C G U U C U 7 0 0 2 0.30 0 1 AG AA AC AG AU AU AC AU GA GC GG GU GU GC GU AC AG AU AU AC AU CG CU CU CC CU GU GU GC GU UU UC UU UC f xi f xj A G A U U U U C U 6 0 0 1 0.55 1 AG AA AU AU AU AU AC AU GA GU GU GU GU GC GU AU AU AU AU AC AU UU UU UU UC UU UU UU UC UU UU UC UU UC A G C C A G G C U 5 0 0 0 0.42 AG AC AC AA AG AG AC AU GC GC GA GG GG GC GU CC CA CG CG CC CU CA CG CG CC CU AG AG AC AU GG GC GU GC Max when no seq conservation but perfect pairing A G C G C G G C U 4 0 0 0.30 AG AC AG AC AG AG AC AU GC GG GC GG GG GC GU CG CC CG CG CC CU GC GG GG GC GU CG CG CC CU GG GC GU GC A G C U G C G C U 3 0 0 AG AC AU AG AC AG AC AU GC GU GG GC GG GC GU CU CG CC CG CC CU UG UC UG UC UU GC GG GC GU CG CC CU GC A G C A U C G C U 2 0 AG AC AA AU AC AG AC AU GC GA GU GC GG GC GU CA CU CC CG CC CU AU AC AG AC AU UC UG UC UU CG CC CU GC MI = expected score gain from using a pair state A G G U A G C C U 1 AG AG AU AA AG AC AC AU GG GU GA GG GC GC GU GU GA GG GC GC GU UA UG UC UC UU AG AC AC AU GC GC GU CC A G G G C G C C U AG AG AG AC AG AC AC AU GG GG GC GG GC GC GU GG GC GG GC GC GU GC GG GC GC GU CG CC CC CU GC GC GU CC ! Finding optimal MI, (i.e. opt pairing of cols) is hard(?) A G G U G U C C U AG AG AU AG AU AC AC AU GG GU GG GU GC GC GU GU GG GU GC GC GU UG UU UC UC UU GU GC GC GU UC UC UU CC A G G C U U C C U •Cols 1 & 9, 2 & 8: perfect conservation & might be base- AG AG AC AU AU AC AC AU GG GC GU GU GC GC GU GC GU GU GC GC GU CU CU CC CC CU UU UC UC UU UC UC UU CC Finding optimal MI without pseudoknots can be done by A G U A A A A C U paired, but unclear whether they are. M.I. = 0 AG AU AA AA AA AA AC AU GU GA GA GA GA GC GU UA UA UA UA UC UU AA AA AA AC AU AA AA AC AU AA AC AU AC A G U C C A A C U AG AU AC AC AA AA AC AU GU GC GC GA GA GC GU UC UC UA UA UC UU CC CA CA CC CU CA CA CC CU AA AC AU AC dynamic programming A G U U G C A C U •Cols 3 & 7: No conservation, but always W-C pairs, so AG AU AU AG AC AA AC AU GU GU GG GC GA GC GU UU UG UC UA UC UU UG UC UA UC UU GC GA GC GU CA CC CU AC A G U U U C A C U seems likely they do base-pair. M.I. = 2 bits. AG AU AU AU AC AA AC AU GU GU GU GC GA GC GU UU UU UC UA UC UU UU UC UA UC UU UC UA UC UU CA CC CU AC MI: 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.3 0.0 1.0 2.0 0.0 0.0 0.4 0.5 0.3 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 •Cols 7->6: unconserved, but each letter in 7 has only 2 fxi,xj: A 16 0 4 2 4 4 4 0 0 possible mates in 6. M.I. = 1 bit. AA 0 4 2 4 4 4 0 0 0 0 0 0 0 0 0 0 1 2 0 0 0 1 1 1 0 0 2 1 0 0 2 0 0 0 C 0 0 4 4 4 4 4 16 0 AC 0 4 4 4 4 4 16 0 0 0 0 0 0 0 0 1 1 0 0 4 0 0 1 0 2 0 0 1 4 0 0 4 0 4 G 0 16 4 2 4 4 4 0 0 AG 16 4 2 4 4 4 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 2 1 0 0 0 0 0 0 U 0 0 4 8 4 4 4 0 16 AU 0 4 8 4 4 4 0 16 0 0 0 0 0 0 0 3 1 2 4 0 4 1 0 0 0 2 0 1 0 4 2 0 4 0 CA 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 1 1 0 0 2 1 0 0 2 0 0 0 CC 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 2 0 4 0 1 0 1 4 0 0 1 4 0 0 4 0 4 CG 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 2 4 0 0 1 1 1 0 0 2 1 0 0 2 0 0 0 CU 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 4 1 2 1 0 4 0 1 0 4 0 0 4 0 GA 0 0 0 0 0 0 0 0 4 2 4 4 4 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 GC 0 0 0 0 0 0 0 0 4 4 4 4 4 16 0 1 1 0 4 4 0 2 0 1 2 0 2 1 4 0 2 4 0 4 GG 0 0 0 0 0 0 0 0 4 2 4 4 4 0 0 1 1 2 0 0 0 0 2 1 0 0 0 1 0 0 2 0 0 0 GU 0 0 0 0 0 0 0 0 4 8 4 4 4 0 16 2 1 2 0 0 4 0 0 0 0 2 2 1 0 4 0 0 4 0 UA 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 2 4 0 0 2 2 2 0 0 0 1 0 0 0 0 0 0 UC16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 2 0 4 0 1 3 2 8 0 2 1 4 0 2 4 0 4 UG 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 3 1 1 0 0 0 1 0 0 0 0 0 0 UU 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 1 0 0 0 4 2 2 3 0 8 2 1 0 4 2 0 4 0 N= 9 log dealy: AA 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.13 0 0 0 0.06 0.06 0.063 0 0 0.13 0 0 0 0.125 0 0 0 AC 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.06 0 0 0 0 0 0 0 0 0 0 0 AG 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.063 0 0 0.13 0 0 0 0 0 0 0 AU 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.11 0 0.13 0.5 0 0 0.06 0 0 0 0 0 0 0 0 0.125 0 0 0 CA 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.06 0 0 0 0 0 0 0 0 0 0 0.13 0 0 0 0.125 0 0 0 CC 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 CG 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.06 0 0.13 0.5 0 0 0 0 0 0 0 0.13 0 0 0 0.125 0 0 0 CU 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -0.1 0 0 0 0 0 0 0.13 0 0 0 0 0 0 0 0 0 0 0 GA 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 GC 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.5 0 0 0.25 0 0.063 0 0 0.13 0 0 0 0.125 0 0 0 GG 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.06 0 0.13 0 0 0 0 0.25 0.063 0 0 0 0 0 0 0.125 0 0 0 GU 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.13 0 0 0 0 0 0 0 0 0.13 0 0 0 0 0 0 0 UA 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.06 0 0.13 0.5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 UC 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.13 0 0 0 -0.1 0.11 0 0 0 0.13 0 0 0 0.125 0 0 0 UG 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.11 -0.1 -0.06 0 0 0 0 0 0 0 0 0 0 UU 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.11 0 0 0.13 0 0 0 0.125 0 0 0 MI-Based Structure-Learning

# Si+1, j % S S = max$ i, j"1 i, j S M % i+1, j"1 + i, j &%max i< j

Pseudoknots # n & % max j Mi, j ( /2 disallowed allowed $ "i=1 '

! Accelerating CM search

Zasha Weinberg & W.L. Ruzzo Recomb ‘04, Bioinformatics ‘04, ‘06

17 Rfam database Covariance (Release 7.0, 3/2005) Model

503 ncRNA families Key difference of CM vs HMM: 280,000 annotated ncRNAs Pair states emit paired symbols, corresponding to base-paired nucleotides; 16 emission 8 riboswitches, 235 small nucleolar RNAs, probabilities here. 8 spliceosomal RNAs, 10 bacterial antisense RNAs, 46 microRNAs, 9 ribozymes, 122 cis RNA regulatory elements, …

CM’s are good, but slow Oversimplified CM Rfam Reality Our Work Rfam Goal (for pedagogical purposes only)

EMBL EMBL EMBL A A C C G G Blast Z U U – –

A A CM CM CM C C G G U U junk – – junk hits hits hits 1 month, ~2 months, 10 years, 1000 computers 1000 computers 1000 computers

18 CM to HMM Key Issue: 25 scores  10

A A A A A A A A C C C C C C C C CM G G G G HMM G G G G U U U U U U U U – – – – – – – – CM HMM A A A A C C C C G G G G U U U U • Need: log Viterbi scores CM ≤ HMM – – – –

25 emisions per state 5 emissions per state, 2x states

Viterbi/Forward Scoring Key Issue: 25 scores  10

• Path π defines transitions/emissions A A A A C C C C • Score(π) = product of “probabilities” on π CM G G G L R G HMM U U U U • NB: ok if “probabilities” aren’t, e.g. ∑≠1 – – – – l e

• E.g. in CM, emissions are odds ratios vs 0th- d o m

order background • Need: log Viterbi scores CM HMM . ≤ b o • For any nucleotide sequence x: r p P ≤ L + R P ≤ L + R …

AA A A CA C A a

– Viterbi-score(x) = max{ score(π) | π emits x} t

PAC ≤ LA + RC PCC ≤ LC + RC … o n

– Forward-score(x) = ∑{ score(π) | π emits x}

PAG ≤ LA + RG PCG ≤ LC + RG … M M

P L + R P L + R … H AU ≤ A U CU ≤ C U : B

PA– ≤ LA + R– PC– ≤ LC + R– … N

19 PAA ≤ LA + RA

PAC ≤ LA + RC P L + R Rigorous Filtering AG ≤ A G Some scores filter better PAU ≤ LA + RU P ≤ L + R …A– A – P = 1 L + R • Any scores satisfying the linear UA ≤ U A P = 4 L + R inequalities give rigorous filtering UG ≤ U G Assuming ACGU ≈ 25% Option 1: Opt 1:

Proof: LU = RA = RG = 2 LU + (RA + RG)/2 = 4 CM Viterbi path score ≤ “corresponding” HMM path score Option 2: Opt 2: L = 0, R = 1, R = 4 L + (R + R )/2 = 2.5 ≤ Viterbi HMM path score U A G U A G (even if it does not correspond to any CM path)

Optimizing filtering Calculating E(Li, Ri)

• For any nucleotide sequence x: E(Li, Ri) = ∑x Forward-score(x)*Pr(x) Viterbi-score(x) = max{ score(π) | π emits x } Forward-score(x) = ∑{ score(π) | π emits x } • Forward-like: for every state, calculate • Expected Forward Score expected score for all paths ending E(L , R ) = ∑ Forward-score(x)*Pr(x) i i x there, easily calculated from expected – NB: E is a function of Li, Ri only Under 0th-order • Optimization: background model scores of predecessors & transition/ emission probabilities/scores Minimize E(Li, Ri) subject to score L.I.s – This is heuristic (“forward↓ ⇒ Viterbi↓ ⇒ filter↓”) – But still rigorous because “subject to score L.I.s”

20 Minimizing E(Li, Ri) “Convex” Optimization

Convex: Nonconvex: • Calculate E(Li, Ri) Forward: symbolically, in terms local max = global max; can be many local simple “hill climbing” works maxima, << global max; of emission scores, “hill-climbing” fails so we can do partial Viterbi: derivatives for numerical convex optimization algorithm "E(L1 , L2 ,...) "Li

What should the probabilities be? Estimated Filtering Efficiency ! (139 Rfam 4.0 families) • Convex optimization problem Filtering # families # families fraction (compact) (expanded) – Constraints: enforce rigorous property < 10-4 105 110 n

– Objective function: filter as aggressively as e

v -4 -2 e possible 10 - 10 8 17 k a e

r .01 - .10 11 3

• Problem sizes: b

– 1000-10000 variables ≈ .10 - .25 2 2 – 10000-100000 inequality constraints .25 - .99 6 4 .99 - 1.0 7 3

Averages 283 times faster than CM

21 Results: buried treasures # found # found # new Building CM’s Name BLAST rigorous filter + CM + CM Pyrococcus snoRNA 57 180 123 • Hand-curated alignments + structure as Iron response element 201 322 121 in Rfam are great, but it doesn’t scale Histone 3’ element 1004 1106 102 • Example Application: Purine riboswitch 69 123 54 Retron msr 11 59 48 Given 5-20 upstream regions (~500 nt) Hammerhead I 167 193 26 of orthologous bacterial genes, some Hammerhead III 251 264 13 U4 snRNA 283 290 7 (but not all) plausibly regulated by a S-box 128 131 3 U6 snRNA 1462 1464 2 common riboswitch, could we find it? U5 snRNA 199 200 1 U7 snRNA 312 313 1

Importance of Alignment CMFinder

Harder: Finding CMs without alignment Yao, Weinberg & Ruzzo, Bioinformatics, 2006

Folding predictions Candidate CM Search alignment Smart heuristics • Blue boxes, e.g., should be lined up. Realign • Structure is invisible otherwise.

22 CMfinder Accuracy Summary of Rfam test (on Rfam families with flanking sequence) families and results

/CW /CW

ary Li = column i; σ = (α, β) the 2 struct, α = unpaired, β = paired cols

With MLE params, Iij is the mutual information between cols i and j Can find it via a simple dynamic programming alg.

23 An approach for cis-regulatory A Computational Pipeline for High Throughput Discovery of cis–Regulatory RNA discovery in bacteria Noncoding RNA in Prokaryotes 1. Get all sequenced bacterial genomes 2. Group upstream sequences per CDD PLoS Comp Biol, 2007 3. Find most promising genes, based on sequence motifs conserved in group 4. From those, find most promising candidates, Zizhen Yao, Jeffrey Barrick, Zasha Weinberg, incorporating structure in the motifs Shane Neph, Ronald Breaker, Martin Tompa 5. From those, genome-wide searches for more and Walter L. Ruzzo instances 6. Expert analyses (Breaker Lab, Yale)

Identify CDD group members < 10 CPU days 2946 CDD groups Genome Scale Search: Why Retrieve upstream sequences

Footprinter ranking < 10 CPU days Most riboswitches, e.g., are present in ~5

CMfinder 1 ~ 2 CPU months copies per genome 35975 motifs Throughout (most of) clade Motif postprocessing 1740 motifs More examples give better model, hence even RaveNnA 10 CPU months more examples, fewer errors CMfinder refinement < 1 CPU month More examples give more clues to function

Motif postprocessing

1466 motifs

24 Genome Scale Search: How Results

CMfinder is directly usable for/with search • Process largely complete in – bacillus/clostridia Folding – gamma proteobacteria predictions Candidate – cyanobacteria CM Search alignment – actinobacteria Smart heuristics • Analysis ongoing Realign

Rank # CDD Gene: Description Annotation 6 69 28178 DHOase IIa: Dihydroorotase PyrR attenuator [22] 15 33 10097 RplL: Ribosomal protein L7/L1 L10 r-protein leader; see Supp 19 36 10234 RpsF: Ribosomal protein S6 S6 r-protein leader fam 22 32 10897 COG1179: Dinucleotide-utilizing enzymes 6S RNA [25]

R 27 27 9926 RpsJ: Ribosomal protein S10 S10 r-protein leader; see Supp Actino Results: finding known RNAs 29 11 15150 Resolvase: N terminal domain 31 31 10164 InfC: Translation initiation factor 3 IF-3 r-protein leader; see Supp 41 26 10393 RpsD: Ribosomal protein S4 and related proteins S4 r-protein leader; see Supp [30] 44 30 10332 GroL: Chaperonin GroEL HrcA DNA binding site [46] Rfam Family Type (metabolite) Rank accession function/metabolite46 33 25629 Ribosomal L21p: Ribosomal prokaryotic L21 protein L21 r-protein leader; see Supp 50 11 5638 Cad: Cadmium resistance transporter [47] THI riboswitch (thiamine) 4 RF00059 thiamin51 19 (pyrophosphate?)9965 RplB: Ribosomal protein L2 aka B1 S10 r-protein leader 55 7 26270 RNA pol Rpb2 1: RNA polymerase beta subunit ydaO-yuaA riboswitch (unknown) 19 RF00379 osmotic69 9 1shock;3148 COG3 8triggers30: ACT domain -contaiAAn itransportersng protein 72 28 4174 Ribosomal S2: Ribosomal protein S2 S2 r-protein leader 74 9 9924 RpsG: Ribosomal protein S7 S12 r-protein leader Cobalamin riboswitch (cobalamin) 21 RF00174 adenosylcobalamin86 6 12328 COG2984: AB C(aka-type unc hb12?)aracterized transport system 88 19 24072 CtsR: Firmicutes transcriptional repressor of class III CtsR DNA binding site [48] SRP_bact gene 28 RF00169 signal100 2recognition1 23019 Formyl trans particleN: Formyl transferase 103 8 9916 PurE: Phosphoribosylcarboxyaminoimidazole RFN riboswitch (FMN) 39 RF00050 flavin117 mononucleotide5 13411 COG4129: Predicted m(FMN)embrane protein 120 10 10075 RplO: Ribosomal protein L15 L15 r-protein leader yybP-ykoY riboswitch (unknown) 48 RF00080 unknown121 9 10 1(diverse32 RpmJ: Ribos ogenes);mal protein L36 called SraF in E.coliIF-1 r-protein leader not cis- 129 4 23962 Cna B: Cna protein B-type domain gcvT riboswitch (glycine) 53 regulatory RF00504 glycine130 9 25424 Ribosomal S12: Ribosomal protein S12 S12 r-protein leader 131 9 16769 Ribosomal L4: Ribosomal protein L4/L1 family L3 r-protein leader S_box riboswitch (SAM) 401 RF00162 SAM:136 s-adenyl7 10610 COG 07methionine42: N6-adenine-specific methylase ylbH putative RNA motif [4] 140 12 8892 Pencillinase R: Penicillinase repressor BlaI, MecI DNA binding site [49] tmRNA gene Not found RF00023 aka 110Sa57 25 24 4RNA15 Ribosom oral SSsrA;9: Ribosom afreesl protein S 9mRNAs/S16 from Lstalled13 r-protein l earibosomesder; Fig 3 160 27 1790 Ribosomal L19: Ribosomal protein L19 L19 r-protein leader; Fig 2 RNaseP gene Not found RF00010 tRNA16 4 maturation;6 9932 GapA: Glyc erisald eahy deribozyme-3-phosphate dehy dinrog enbacteriaase/erythrose 174 8 13849 COG4708: Predicted membrane protein 176 7 10199 COG0325: Predicted enzyme with a TIM-barrel fold 182 9 10207 RpmF: Ribosomal protein L32 L32 r-protein leader 187 11 27850 LDH: L-lactate dehydrogenases 190 11 10094 CspR: Predicted rRNA methylase Table 3: High ranking motifs not found in 194 9 10353 FusA: Translation elongation factors EF-G r-protein leader

25 mRNA leader Rfam Membership Overlap Structure # Sn Sp nt Sn Sp bp Sn Sp RF00174 Cobalamin 183 0.741 0.97 152 0.75 0.85 20 0.60 0.77 RF00504 Glycine 92 0.561 0.96 94 0.94 0.68 17 0.84 0.82 RF00234 glmS 34 0.92 1.00 100 0.54 1.00 27 0.96 0.97 RF00168 Lysine 80 0.82 0.98 111 0.61 0.68 26 0.76 0.87 RF00167 Purine 86 0.86 0.93 83 0.83 0.55 17 0.90 0.95 RF00050 RFN 133 0.98 0.99 139 0.96 1.00 12 0.66 0.65 RF00011 RNaseP_bact_b 144 0.99 0.99 194 0.53 1.00 38 0.72 0.78 RF00162 S_box 208 0.95 0.97 110 1.00 0.69 23 0.91 0.78 RF00169 SRP_bact 177 0.92 0.95 99 1.00 0.65 25 0.89 0.81 RF00230 T-box 453 0.96 0.61 187 0.77 1.00 5 0.32 0.38 RF00059 THI 326 0.89 1.00 99 0.91 0.69 13 0.56 0.74 RF00442 ykkC-yxkD 19 0.90 0.53 99 0.94 0.81 18 0.94 0.68 RF00380 ykoK 49 0.92 1.00 125 0.75 1.00 27 0.80 0.95 RF00080 yybP-ykoY 41 0.32 0.89 100 0.78 0.90 18 0.63 0.66 mRNA leader switch? mean 145 0.84 0.91 121 0.81 0.82 21 0.75 0.77 median 113 0.91 0.97 105 0.81 0.83 19 0.78 0.78

Tbl 2: Prediction accuracy compared to prokaryotic subset of Rfam full alignments. Membership: # of seqs in overlap between our predictions and Rfam’s, the sensitivity (Sn) and specificity (Sp) of our membership predictions. Overlap: the avg len of overlap between our predictions and Rfam’s (nt), the fractional lengths of the overlapped region in Rfam’s predictions (Sn) and in ours (Sp). Structure: the avg # of correctly predicted canonical base pairs (in overlapped regions) in the secondary structure (bp), and sensitivity and specificity of our predictions. 1After 2nd RaveNnA scan, membership Sn of Glycine and Cobalamin increased to 76% and 98% resp., Glycine Sp unchanged, but Cobalamin Sp dropped to 84%.

Identification of 22 candidate structured RNAs in bacteria using the CMfinder comparative genomics pipeline

NAR 2007

Zasha Weinberg, Jeffrey E. Barrick, Zizhen Yao, Adam Roth, Jane N. Kim, Jeremy Gore, Joy Xin Wang, Elaine R. Lee, Kirsten F. Block, Narasimhan Sudarsan, Shane Neph, Martin Tompa, Walter L. Ruzzo and Ronald R. Breaker

26 Motif RNA? Cis? Switch? Phylum/class M,V Cov. # Non cis GEMM Y Y y Widespread V 21 322 12/309 Moco Y Y Y Widespread M,V 15 105 3/81 SAH Y Y Y Proteobacteria M,V 22 42 0/41 SAM-IV Y Y Y Actinobacteria V 28 54 2/54 Comparative genomics beyond COG4708 Y Y y Firmicutes M,V 8 23 0/23 sucA Y Y y -proteobacteria 9 40 0/40 sequence based alignments: RNA 23S-methyl Y Y n Firmicutes 12 38 1/37 structures in the ENCODE regions hemB Y ? ? -proteobacteria V 12 50 2/50 (anti-hemB) (n) (n) (37) (31/37) MAEB ? Y n -proteobacteria 3 662 15/646 Torarinsson, Yao, Wiklund, Bramsen, mini-ykkC Y Y ? Widespread V 17 208 1/205 purD y Y ? -proteobacteria M 16 21 0/20 Hansen, Kjems, Tommerup, Ruzzo and 6C y ? n Actinobacteria 21 27 1/27 alpha- ? N N -proteobacteria 16 102 39/99 Gorodkin. transposases excisionase ? ? n Actinobacteria 7 27 0/27 Genome Research, Feb 2008, 18(2):242-251 ATPC y ? ? Cyanobacteria 11 29 0/23 cyano-30S Y Y n Cyanobacteria 7 26 0/23 PMID: 18096747 lacto-1 ? ? n Firmicutes 10 97 18/95 lacto-2 y N n Firmicutes 14 357 67/355 TD-1 y ? n Spirochaetes M,V 25 29 2/29 TD-2 y N n Spirochaetes V 11 36 17/36 coccus-1 ? N N Firmicutes 6 246 112/189 gamma-150 ? N N -proteobacteria 9 27 6/27

Finding vertebrates ncRNAs Alignment Matters

Natural approach : Align, Fold, Score UCSC Browser tracks for Evofold, RNAz Thousands of candidates

27 Comparison with Evofold, RNAz CMfinder Evofold

230 4799 3134

44 548 169

1781 RNAz

Small overlap (w/ highly significant p-values) emphasizes complementarity Strong association with known genes Strong association with “Indel purified segments” - I.e., apparently under selection

10 of 11 top expressed, usually differentially Assoc w/ coding genes

Many known human ncRNAs lie in introns Several of our candidates do, too, including some of the tested ones • #6: SYN3 (Synapsin 3) • #10: TIMP3, antisense within SYN3 intron • #9: GRM8 (glutamate receptor metabotropic 8)

28 Estimated FDR Software

Infernal - (Eddy et al.) most of Eddy & Durbin

RaveNnA - (Weinberg) fast filtering

CMfinder - (Yao) Motif discovery (local alignment)

o

Open Problems - Better Open Problems - Better CM’s algorithms & scoring Optional- and variable-length stems incorporating phylogeny in model Riboswitches & other regulatory RNAs often construction & scoring switch between conformations; better search e.g. “mutual information” ignores it & alignment exploiting both alternatives? improve scoring by “shuffling” other ideas “Augmented” CM handling pseudoknots probably too slow for scan, but plausibly for scan filtering could be used for alignment comparing & clustering RNA structures Better use of prior knowledge? (GNRA search/alignment/inference with splicing tetraloops, single-stranded A’s…)

29 Open Problems - Summary Applications & Biology clustering intergenic sequences, esp ncRNA is a “hot” topic prokaryotic For family homology modeling: CMs systematic look at eukaryotic UTRs Training & search like HMM (but slower) how to cluster? how to score? Dramatic acceleration possible “swiss-cheese phylogenies” Automated model construction possible evidence for selection (no dN/dS) New computational methods lead to new discoveries

30