Covariance models and Infernal 1.1

Eric Nawrocki

Sean Eddy’s Lab

Howard Hughes Medical Institute Janelia Farm Research Campus Sequence conservation 4 provides information for homology 3 searches

2 Conservation levels vary across alignment 2 columns. 2 2 time 2 past present * **** * 100% conserved yeast GUCUUCGGCAC fly GCCUUCGGAGC cow GCAUUCGUCGC ancestor: mouse GCUUUCGAUGC alignment GCAUUCGUAGC human GCGUUCGCUGC of homologs chicken GUAUUCGUAAC snake GUGUUCGCGAC croc GUUUUCGAGAC } A AA 21 C2222 C C12 sequence CG GGA GUUUUCGUUGC profile Structure conservation provides additional 4 information 3

Base-paired positions covary2 to maintain Watson-Crick complementarity. 2 2 2 secondary time 2 structure past present struct <<<---->->> - - yeast GUCUUCGGCAC - - fly GCCUUCGGAGC < > - cow GCAUUCGUCGC < > ancestor: mouse GCUUUCGAUGC < > GCAUUCGUAGC U C human GCGUUCGCUGC U G A U chicken GUAUUCGUAAC A U C C G U G snake GUGUUCGCGAC G C A U U C C U G croc GUUUUCGAGAC U G A U G C C U A A AA G C 21 C2222 C C12 sequence CG GGA GUUUUCGUUGC profile yeast fly cow mouse human chicken snake croc U C U C U C U C U C U C U C U C U G U G U G U G U G U G U G U G C G C G A U U A G C A U G C U A A C U U A G G U AC C G C G C G C G U A U A U A G C G C G C G C G C G C G C G C Levels of sequence and structure conservation in RNA families

RNaseP RNaseP 80 ● ● (bacterial type A) (bacterial type B) SRP ● (euk and arch) SSU rRNA ● ● (5’ domain) IRES ● (Apthovirus) RNaseP (archaeal) 60

● tmRNA ● RNase MRP ● Tbox ● 7SK 40 ● U2 6S ● ● ● ●

● ● tRNA ● ● SRP ● ● ● ● (bacterial) 5S ● Cobalamin 20 rRNA ● ● SECIS ● ● ●● ● ● ● TPP ● ● Intron ● FMN ● ● ● ● ●● ● ● RRE ● group II ● ● ● ● ● ● ● ● ● ● ● ● snoRNA ● ● ● ● ● ● ● ● ● (SNORA73) ● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● ● ● ● Extra bits from structure (CM-HMM average score) ● Histone 3’ UTR ● ● ● ● ● ● microRNA 0 stem loop ● ● (MIR_807)

20 50 100 200 500 Sequence profile (HMM) average score (bits) Eddy lab software for profile probabilistic models (since 1994)

sequence sequence and profiles structure profiles

models profile HMMs covariance models (CMs)

software HMMER Infernal (prev. COVE)

main use proteins RNAs

database (12273 families) (1973 families)

performance faster but slower but for RNAs less accurate more accurate

http://hmmer.janelia.org Eddy, SR. PLoS Comp. Biol., 4:e1000069, 2008. http://infernal.janelia.org Eddy. SR. , Nawrocki EP, Kolbe DL, Eddy SR 14:755-763, 1998. Bioinformatics, 25 (10):1335-1337, 2009. Eddy SR, Durbin R. Nucleic Acids Research, 22:2079-2088, 1994. Profile HMMs: sequence family models built from alignments

A AA One HMM node per C CC CG GGA alignment column UU UUG G1 2 3 U4 U5 C6 G7 8 9 10C11 Node for column 2: 3 states per node: yeast GUCUUCGGCAC (M) Match: emits residues fly GCCUUCGGAGC (I) Insert: inserts extra I cow GCAUUCGUCGC residues P(C) = 0.5 C mouse GCUUUCGAUGC (D) Delete: deletes residues P(U) = 0.5 U human GCGUUCGCUGC M chicken GUAUUCGUAAC HMMs generate homologous sequences. transitions snake GUGUUCGCGAC croc GUUUUCGAGAC D 2

A A A I I I C I I I I I C I C I I I C G G G A U U U U G B GM M M UM UM CM GM M M M CM E

D D D D D D D D D D D 1 2 3 4 5 6 7 8 9 10 11

GC UUCG A C Profile HMMs: sequence family models built from alignments

A AA One HMM node per C CC CG GGA alignment column UU UUG G1 2 3 U4 U5 C6 G7 8 9 10C11 Node for column 2: 3 states per node: yeast GUCUUCGGCAC (M) Match: emits residues fly GCCUUCGGAGC (I) Insert: inserts extra I cow GCAUUCGUCGC residues P(C) = 0.5 C mouse GCUUUCGAUGC (D) Delete: deletes residues P(U) = 0.5 U human GCGUUCGCUGC M chicken GUAUUCGUAAC HMMs generate homologous sequences. transitions snake GUGUUCGCGAC croc GUUUUCGAGAC Given a sequence, the most likely D path that could have generated 2 that sequence can be computed.

A A A I I I C I I I I I C I C I I I C G G G A U U U U G B GM M M UM UM CM GM M M M CM E

D D D D D D D D D D D 1 2 3 4 5 6 7 8 9 10 11

GC UUCG A C Profile HMMs: sequence family models built from alignments

A AA One HMM node per C CC CG GGA alignment column UU UUG G1 2 3 U4 U5 C6 G7 8 9 10C11 Node for column 2: 3 states per node: yeast GUCUUCGGCAC (M) Match: emits residues fly GCCUUCGGAGC (I) Insert: inserts extra I cow GCAUUCGUCGC residues P(C) = 0.5 C mouse GCUUUCGAUGC (D) Delete: deletes residues P(U) = 0.5 U human GCGUUCGCUGC M chicken GUAUUCGUAAC HMMs generate homologous sequences. transitions snake GUGUUCGCGAC croc GUUUUCGAGAC Given a sequence, the most likely D worm GCGUUCGCGGC path that could have generated 2 that sequence can be computed.

A A A I I I C I I I I I C I C I I I C G G G A U U U U G B GM M M UM UM CM GM M M M CM E

D D D D D D D D D D D 1 2 3 4 5 6 7 8 9 10 11

worm C G C G G B GM M M UM UM CM GM M M M CM E GC UUCG A C Profile HMMs: sequence family models built from alignments

A AA One HMM node per C CC C G GAG alignment column U U UUG G1 2 3 U4 U5 C6 G7 8 9 10C11 Node for column 2: yeast GU.CUUCGGCAC 3 states per node: fly GC.CUUCGGAGC (M) Match: emits residues (I) Insert: inserts extra cow GC.AUUCGUCGC I residues P(C) = 0.5 C mouse GC.UUUCGAUGC (D) Delete: deletes residues P(U) = 0.5 human GC.GUUCGCUGC U M chicken GU.AUUCGUAAC HMMs generate homologous snake GU.GUUCGCGAC sequences. transitions croc GU.UUUCGAGAC Given a sequence, the most likely D worm GC.GUUCGCGGC path that could have generated 2 corn GUGAUUCGU.GC that sequence can be computed.

A A A I I I C I I I I I C I C I I I C G G G A U U U U G B GM M M UM UM CM GM M M M CM E

D D D D D D D D D D D 1 2 3 4 5 6 7 8 9 10 11 G corn I U A U G B GM M M UM UM CM GM M M CM E

D Covariance models (CMs) are built from structure-annotated alignments

A input multiple alignment: C states for 3 guide tree nodes: [structure] . : : < < < : : : : > : > > : < < : < . : : : . > > > . human . A A G A C U U CGG A U C UGG C G . A C A . C C C . MP 50 ML 51 MR 52 D 53 mouse a U A C A C U U CGG A U G - C A C C . A A A . GU G a MATP 16 orc . A GGU C U U C - G C A C GGG C A g C C A c U U C . 1 5 10 15 20 25 28

IL 54 IR 55 example structure guide tree: B ROOT 0 (human): 2 MATL 1 A 3 MATL 2 U MP 56 ML 57 MR 58 D 59 A BIF 3 G C G C A U G C A C BEGL 4 BEGR 14 MATP 17 4 MATP 5 14 15 MATL 15 C G G C IL 60 IR 61 U G A A 5 MATP 6 13 16 MATP 16 27 U C C MATR 7 12 17 MATP 17 26 6 MATP 8 11 18 MATL 18 2 3 15 7 MATL 9 19 MATP 19 25 ML 62 D 63 4 14 16 27 8 MATL 10 21 MATL 20 5 13 17 26 9 MATL 11 22 MATL 21 MATL 18 12 18 6 11 19 25 10 MATL 12 23 MATL 22 IL 64 7 10 21 23 END 13 END 23 8 9 22 MR MR MR MR MR MR MR MP MP MP MP MP MP ML ML ML ML ML ML ML ML ML ML ML ML ML ML ML ML ML IR IR IR IR IR IR IR IR IL IL IL IL IL IL IL IL IL IL IL IL IL IL IL IL IL IL IL D D D D D D D D D D D D D D D D D D E S E S B S 79 76 73 71 69 66 64 62 60 57 56 54 51 49 43 40 37 34 32 30 27 26 25 23 21 18 17 15 12 80 77 74 70 65 61 55 50 47 44 41 38 35 31 22 16 81 78 75 72 68 67 63 59 58 53 52 48 46 45 42 39 36 33 29 28 24 20 19 14 13 11 10 8 5 3 9 6 2 7 4 1 END MATL MATL MATL MATP MATL MATP MATP MATL BEGR END MATL MATL MATL MATL MATP MATR MATP MATP BEGL BIF MATL MATL ROOT 24 23 22 21 20 19 18 17 16 15 14 13 12 10 1 1 9 8 7 6 5 4 3 2 1 CM searches are slow (especially for large RNAs)

model DP algorithms direction complexity

profile HMM Viterbi, Forward left to right O(N 2)

CM CYK, Inside inside to outside O(N 3logN)

CYK and Inside fill in a 3-dimension matrix with (v, i, j) tuples:

Non-bifurcation states: Bifurcation states: v v (state)

w y

(subsequence i j (subsequence start point) end point) d = j-i+1 i k k+1 j Accelerating CM searches: filtering and banded DP

Filtering: search database with faster method first, hits above some threshold • survive the filter and are searched with the slow CM.

– tRNAscan-SE ⇤: filters database using tRNA-specific heuristics (1997) – Rfam uses BLAST filters (2002-present)

– Weinberg and Ruzzo† developed HMM filters for faster searches (2004, 2006).

– Others have also worked on this (Sun and Buhler ‡, Zhang and Bafna§)

Query dependent banding (QDB)¶: restrict subsequence lengths that can align to • each state of the model.

Infernal 1.0: HMM filtering and QDB yield 100-fold acceleration (2009) •

⇤Lowe T, Eddy S, NAR 25:955-964, 1997

†Weinberg Z, Ruzzo WL. Bioinformatics 22(1):3539, 2006.

‡Sun Y, Buhler J, Comput. Systems Bioinf., p145-156, 2008.

§Zhang S et al., Bioinformatics. 22(14):e557-e565, 2006.

¶Nawrocki and Eddy, PLoS Comp Bio, 2007 RMARK: an internal RNA homology search benchmark

Example:

RF00373 RNaseP_arch seed alignment 0.40 0.30 3 1 RMARK construction - for each of the 503 Rfam 7.0 2 • seed alignments: 7 5 4 – single-linkage cluster sequences by sequence identity 6 32 given the alignment 33 38 – look for a training cluster and testing cluster such 34 24 that: 35 14 no training/test sequence pair is > 60% identical 9 ⇤ 10 11 – filter test set so no two test seqs > 70% identical 13 12 – 51 families qualify, with 450 test sequences 0 19 39 – test seqs are embedded in a 10 Mb (20 Mb count- 36 ing both strands) pseudo-genome of “realistic” base 17 16 composition (RMARK2) 18 15 37 30 26 31 25 27 29 28 8 20 23 21 22 1.0

Infernal v1.01 (no filters) 0.8 1478.4 hours

Infernal v1.01 default (filters) 0.6 19.0 hours RaveNnA 0.2f default (with filters) HMM only 43.0 hours 0.4 7.4 hours

0.2 BLAST-FPS 0.12 hours Sensitivity (fraction of true positives) 0.0

0.001 0.005 0.01 0.05 0.10 0.5 1.0 false positives per MB searched per query 100 to 1000-fold faster profile HMM searches using HMMER3 (Eddy, PLoS Comp Bio, 2011)

Striped vector parallelization (Farrar, Bioinformatics, 2007) • Infernal 1.1’s HMMER3-based cmsearch pipeline

model hmmer3 infernal 1.1

F1 (MSV/SSV) HMM P=0.02 P=0.35

windows > 2*W are split up

F2 (Viterbi) HMM P=0.001 P=0.15

F3 (Forward) HMM P=0.00001 P=0.003

F4 (glocal Forward) HMM P=0.003

overlapping windows are merged

F5 (glocal envelope defn) HMM P=0.003

F6 (HMM banded CYK) CM P=0.0001

F7 (HMM banded Inside) CM

Glocal HMM envelope definition trims o↵ residues unlikely to be involved in the hit, • enabling use of fast HMM banded CM algorithms. HMM bands accelerate CM alignment

main idea: eliminate potential alignments the HMM tells us are very improbable • HMM bands accelerate CM alignment

main idea: eliminate potential alignments the HMM tells us are very improbable • Infernal v1.1 is 100X faster and more sensitive than v1.0.2 1.0

Infernal 1.1 (--max, no filters) Infernal 1.1 4404 hours 0.57 hours 0.8

Infernal 1.0.2 0.6 63.7 hours HMMER3 (nhmmer) sequence-only profile 0.02 hours

0.4 Fraction of true positives found 0.2

0.0

0.01 0.05 0.10 0.50 1.00 5.00 10.00 False positives per model Tighter filter thresholds for big databases (Z>20Gb) give 5X further acceleration with little impact on sensitivity 1.0

0.8 Infernal 1.1 Z=20Mb defaults 0.57 hours

Infernal 1.1 0.6 (--max, no filters) 4404 hours Infernal 1.1 Z>20Gb defaults 0.12 hours HMMER3 0.4 (nhmmer --max) sequence-only profile 0.58 hours Fraction of true positives found 0.2

0.0

0.01 0.05 0.10 0.50 1.00 5.00 10.00 False positives per model Searched for 8 RNA families in 26 animal genomes plus a choanoflagellate with • RNAMotif, ERPIN, BLASTN, and RaveNnA (for non-vertebrates). – families: U5, SRP, U3, RNase MRP, let-7, E2, Y, Vault.

BLASTN does as well or better than other methods in most cases. • RaveNnA finds some novel candidates missed by BLASTN in some cases, but is • very slow. U5 SRP U3 MRP let-7 E2 Y vault R E B R E B N R E B N R E B N R E B R E B R E B R E B N

R E B R E B N R E B N R E B N R E B R E B R E B R E B N U5 SRP U3 MRP let-7 E2 Y vault

Menzel et al., Bioinformatics, 2009. R R E E U5 U5 B B I I I R R E E SRP SRP B B N N I I I R R E E U3 U3 B B N N oie rmMne ta. iifrais 2009. Bioinformatics, al., et Menzel from Modified I I I R R E E MRP MRP B B N N I I I I =Infernal1.1 R R let-7 let-7 E E B B I I I R R E E E2 E2 B B I I I R R E E Y Y B B I I I R R vault vault E E B B N N I I I R R E E U5 U5 B B I I I R R E E SRP SRP B B N N I I I R R E E U3 U3 B B N N oie rmMne ta. iifrais 2009. Bioinformatics, al., et Menzel from Modified I I I R R E E MRP MRP B B N N I I I I =Infernal1.1 R R let-7 let-7 E E B B I I I R R E E E2 E2 B B I I I R R E E Y Y B B I I I R R vault vault E E B B N N I I I Running times on all genomes (47 Gb) in hours family BLASTN RNAMotif ERPIN Infernal U5 0.12 0.44 3.68 8.50 SRP 1.82 120.73 2.14 290.94 U3 0.09 35.78 4.75 14.18 MRP 0.07 9.16 6.41 4.10 let-7 0.10 0.47 26.80 3.79 E2 0.08 0.58 3.16 4.81 Y 0.14 0.86 199.16 11.49 vault 0.08 140.81 87.67 4.32 total 2.52 308.83 333.77 342.12 Infernal 1.1 infernal.janelia.org

faster homology searches • no more BLAST filters for Rfam • better handling of truncated sequences • ⇤ cmscan program: search a sequence against a CM database (e.g. Rfam) •

⇤Kolbe and Eddy, Bioinformatics, 2009 Thank you Acknowledgements

Elena and Eric! Sean Eddy Michael Farrar Travis Wheeler Diana Kolbe