Probabilistic walks with Graeme
Sean Eddy Molecular & Cellular Biology, and Applied Mathematics HHMI and Harvard University eddylab.org
Attempt to repeat some analytic method that is considered unreliable and difficult until patience and hard work yield results similar to those published by the author. Pleasure derived from success, especially if it has come without the supervision of an instructor (that is, working alone), is a clear indication of aptitude for experimental work. Santiago Ramon y Cajal (1916) April, 1993 I had proposed to develop reporter fusions to neural-specific promoters as a tool to visualize axonal processes in live animals, facilitating genetic screens... I have started to play with... green fluorescent protein (GFP) from the jellyfish... I found out at the worm meeting in June that Marty Chalfie’s lab is onto the same idea. Chalfie has already obtained bright fluorescence in the axons of the six touch neurons... [GFP introduced: Chalfie et al, Science 263:802, 1994]
Image from Yishi Jin, UC Santa Cruz GABAergic neurons in C. elegans
Somewhat embarrassingly, my most productive work has been unrelated to the original proposal, resulting from some moonlighting as a computational biologist... and a lingering interest in RNA structure from my thesis work.... I invented a new kind of statistical model, related to HMMs, which can model the two-dimensional structure consensus of RNAs. - progress report for my postdoc grant, 1993
From sequence to RNA structure analysis
HMM algorithm SCFG algorithm Goal (sequence) (RNA structure)
optimal alignment Viterbi CYK P(sequence | model) Forward Inside EM parameter estimation Forward-Backward Inside-Outside
memory complexity: O(MN) O(MN2) time complexity (general): O(M2N) O(M3N3) time complexity (as used): O(MN) O(MN3)
• we can analyze target sequences with secondary structure models; • but the algorithms are computationally expensive.
Software tools for homology search and alignment HMMER Infernal protein homology search: profile HMMs RNA homology search: profile SCFGs http://hmmer.org http://eddylab.org/infernal
55K lines source code 90K lines source code >40,000 downloads/yr >10,000 downloads/yr
human Dicer | Lau et al, Nature Struct Mol Biol 19:436 (2012) glnA glutamine riboswitch | RF01739
Pfam Rfam 17929 protein domain families 3016 RNA structure families http://pfam.xfam.org http://rfam.xfam.org
Xfam Consortium and HMMER server team: Rob Finn, Alex Bateman European Bioinformatics Institute, Cambridge UK A A G A A U Group I IB4 intron (IB4.5) C G Group I IA3 intron (IA3.1) phylum: Woesearchaeota G C C phylum: Woesearchaeota A U accession: KP308748.1 P9.1C G accession: CP010426.1 host gene: LSU rRNA 260 U A host gene: LSU rRNA G U A A P9 G C 280 P9.2 A G A A C G G U G C C A U G G G U U C C U U A C G C A A U G A C G G U A C G A C C G A G G G A A U A 240 U 300 A U P9 A 240 U G 3’ A G C a C G a G C U U A 3’ c A G A u U A UGG g A a U U C g U G C G P10 c C A UU g A A U a A A U G C U A A C g A U U A U U G a U A A U U A A G C 160 A U G C GP9.0 G C u C G A U P5 G C G u U A U A U A P5a U A P5 G C g A U C G C G C g G C c G U A P9.0 A U C G 100 C C g G C U A G u C P5a A A A U P7.2 C G C A P1U a A U C g G C U A A U A A 60 P1 A U A U a U G G c G U 140 A G A U U a U G G u A U 220 C G C G C G A U A A 160 C G 80 A u A C G U a 5’ C G U A A G c C G U U P4 A G A U 20 P7(pseudoknot) C G U A A P7.1 G A u 5’ U A 80 G A P7 A G P4 G C G C 200 U A 100 G C UA G C (pseudo- U C G A U A C A knot) U A C G CA A U A U G A U C G A U U U A A A G C A A G C G G A U G U A 140 A C G C U A A P6 G C U A C A A P6 G C UA C U G C A CU C G C A U A A U U C G G C 180 C A A A A A A G C G C G G C G C A A G C P3 G C P3 G U G A A G U A C G U A U A A U G G U G C A C G A C G G UU U 40 A U G C U A A U A U A A P5bU A P6a U U A U 60 A G C 120 A A U P6a C G A G U G C A U A U U A C G G C C G A U C G A U G C G G C U A G C 120 A G U C C G C G P8 U G C G C G A P8 A A A U 220 G C C G A U A A G 180 G A U U A A A A U U A P2 A U A 456nt A U 200 A U G 20 A U G U U A G C A G U A P2 A U A CG G C G A U G C A U C homology to LAGLI_DADG A G A C homing endonuclease A C (PF00961) C G A U 40 G A A A Nawrocki et al, NAR 46:7970 (2018)
“In my view, biology needs numbers; not after the fashion of physics, but in a good engineering, computational spirit.”