Probabilistic walks with Graeme

Sean Eddy Molecular & Cellular Biology, and Applied Mathematics HHMI and Harvard University eddylab.org

Attempt to repeat some analytic method that is considered unreliable and difficult until patience and hard work yield results similar to those published by the author. Pleasure derived from success, especially if it has come without the supervision of an instructor (that is, working alone), is a clear indication of aptitude for experimental work. Santiago Ramon y Cajal (1916) April, 1993 I had proposed to develop reporter fusions to neural-specific promoters as a tool to visualize axonal processes in live animals, facilitating genetic screens... I have started to play with... green fluorescent protein (GFP) from the jellyfish... I found out at the worm meeting in June that Marty Chalfie’s lab is onto the same idea. Chalfie has already obtained bright fluorescence in the axons of the six touch neurons... [GFP introduced: Chalfie et al, Science 263:802, 1994]

Image from Yishi Jin, UC Santa Cruz GABAergic neurons in C. elegans

Somewhat embarrassingly, my most productive work has been unrelated to the original proposal, resulting from some moonlighting as a computational biologist... and a lingering interest in RNA structure from my work.... I invented a new kind of statistical model, related to HMMs, which can model the two-dimensional structure consensus of RNAs. - progress report for my postdoc grant, 1993

From sequence to RNA structure analysis

HMM algorithm SCFG algorithm Goal (sequence) (RNA structure)

optimal alignment Viterbi CYK P(sequence | model) Forward Inside EM parameter estimation Forward-Backward Inside-Outside

memory complexity: O(MN) O(MN2) time complexity (general): O(M2N) O(M3N3) time complexity (as used): O(MN) O(MN3)

• we can analyze target sequences with secondary structure models; • but the algorithms are computationally expensive.

Software tools for homology search and alignment HMMER Infernal protein homology search: profile HMMs RNA homology search: profile SCFGs http://hmmer.org http://eddylab.org/infernal

55K lines source code 90K lines source code >40,000 downloads/yr >10,000 downloads/yr

human Dicer | Lau et al, Struct Mol Biol 19:436 (2012) glnA glutamine riboswitch | RF01739

Pfam 17929 protein domain families 3016 RNA structure families http://pfam.xfam.org http://rfam.xfam.org

Xfam Consortium and HMMER server team: Rob Finn, European Institute, Cambridge UK A A G A A U Group I IB4 intron (IB4.5) C G Group I IA3 intron (IA3.1) phylum: Woesearchaeota G C C phylum: Woesearchaeota A U accession: KP308748.1 P9.1C G accession: CP010426.1 host gene: LSU rRNA 260 U A host gene: LSU rRNA G U A A P9 G C 280 P9.2 A G A A C G G U G C C A U G G G U U C C U U A C G C A A U G A C G G U A C G A C C G A G G G A A U A 240 U 300 A U P9 A 240 U G 3’ A G C a C G a G C U U A 3’ c A G A u U A UGG g A a U U C g U G C G P10 c C A UU g A A U a A A U G C U A A C g A U U A U U G a U A A U U A A G C 160 A U G C GP9.0 G C u C G A U P5 G C G u U A U A U A P5a U A P5 G C g A U C G C G C g G C c G U A P9.0 A U C G 100 C C g G C U A G u C P5a A A A U P7.2 C G C A P1U a A U C g G C U A A U A A 60 P1 A U A U a U G G c G U 140 A G A U U a U G G u A U 220 C G C G C G A U A A 160 C G 80 A u A C G U a 5’ C G U A A G c C G U U P4 A G A U 20 P7(pseudoknot) C G U A A P7.1 G A u 5’ U A 80 G A P7 A G P4 G C G C 200 U A 100 G C UA G C (pseudo- U C G A U A C A knot) U A C G CA A U A U G A U C G A U U U A A A G C A A G C G G A U G U A 140 A C G C U A A P6 G C U A C A A P6 G C UA C U G C A CU C G C A U A A U U C G G C 180 C A A A A A A G C G C G G C G C A A G C P3 G C P3 G U G A A G U A C G U A U A A U G G U G C A C G A C G G UU U 40 A U G C U A A U A U A A P5bU A P6a U U A U 60 A G C 120 A A U P6a C G A G U G C A U A U U A C G G C C G A U C G A U G C G G C U A G C 120 A G U C C G C G P8 U G C G C G A P8 A A A U 220 G C C G A U A A G 180 G A U U A A A A U U A P2 A U A 456nt A U 200 A U G 20 A U G U U A G C A G U A P2 A U A CG G C G A U G C A U C homology to LAGLI_DADG A G A C homing endonuclease A C (PF00961) C G A U 40 G A A A Nawrocki et al, NAR 46:7970 (2018)

“In my view, biology needs numbers; not after the fashion of physics, but in a good engineering, computational spirit.”