RNAmmer: Fast two-level HMM prediction of rRNA in prokaryotic genome sequences

Peter F. Hallin

Peter F. Hallin, s971636. February 21st 2005

This report is written as part of the 10p special course scheduled for 2nd semester in the International Masters program in Bioinformatics at the Center for Biological Sequence Analysis, Technical University of Denmark. Supervisor is David W. Ussery. Table of contents

Table of contents ...... 2

Abstract ...... 3

Introduction ...... 3 BLAST - A common approach ...... 4 Strategy ...... 4

Methods ...... 5 Model construction ...... 5 Homology reduction of dataset ...... 5 Alignment conservation ...... 7 HMM building ...... 9

Results and evaluation ...... 10 Selectivity and sensitivity ...... 10 Accuracy - deviation in start/stop positions ...... 11 Length deviations ...... 12 Example of annotations ...... 13

Proof of Concept ...... 14

Conclusion ...... 14

Appendices ...... 15 Appendix A: Sequence count in phylogenetic group ...... 15 Appendix B: Perl implementation of plotcon algorithm ...... 16 Appendix C: Model building and post script generation ...... 17 Appendix D: Makefile for RNAmmer- parsing search results and calculating accuracy ...... 21 Appendix E: 16s rRNA tree of the Genome Atlas Database ...... 22 Appendix F: RNAmmer vs. complete search in a 1,6Mb Bacteria ...... 23 Appendix G: RNAmmer source code ...... 24

References ...... 26 Abstract A program has been developed which uses Hidden Markov Models (HMMs) to predict rRNA genes in Bacterial DNA. The program uses a short 75bp conserved region of the molecule to build a model for an initial ‘spotter' and a full length HMM to model the entire molecule. This avoids the scanning of entire genome sequences with large and slow models. The program has been implemented in a Makefile and results have been gathered from a collection of full genome sequences (n=236). The program has proven significantly better than the previous BLAST approach that was used in the Genome Atlas Database and predictions on well annotated genomes suggests selectivity and specificity of prediction in the range of 0.993 to 0.999. During the evaluation of the program a few genbank files were identified as having rRNA annotated on the wrong strands.

Introduction The 16s rRNA molecule has been used as a finger print or evolutionary chronometer of microbial genomes for years. It is essential for Comparative Genomics to see genomic properties in phylogenetic context and mutations in the 16s rRNA is often used to measure evolutionary distance of sequenced microbial organisms. The quality of such distance estimates depends on alignment quality - but most important is a proper identification of the individual sequences. Since the rRNA molecules are highly conserved one should believe that they would be easily detectable and that these features would be consistently annotated in all GenBank files being published. Researchers feel tempted to use BLAST to identify rRNA genes. Such predictions might lead to problems as will be discussed later in this report. The distributions below shows the three peaks of 5s, 16s and 23s of all sequenced Bacterial genomes. The width of each major peak is likely to reflect a true variation of the lengths of the molecules, as is visible in panel A in figure 1. Panel B shows the distribution of the first 10 sequences to underline the few (but not least important) poorly annotated molecules with lengths up to 6,000 bp. Although not visible in these histograms it is later to be shown that even the correct strand can be missed in the processes of annotations and compilation of the GenBank/EMBL files. Large Subunits having length around half of an average Small Subunit was also identified. These plots are generated in a bit crude manner since they are simple extract of annotated features containing the word ‘rRNA’. Later in this report, we will compare the lengths from features where predictions overlap with annotations.

AB Distribution of lengths of rRNA moleclus in all sequenced Bacteria 10

600

8

500

6 400 t t n n u u o o C C 300 4

200

2 100

0 0 0 1000 2000 3000 4000 5000 6000 0 1000 2000 3000 4000 5000 6000 Length Length

Figure 1: Panel A: Histogram of all rRNA genes in sequenced bacterial genomes. Panel B: First 10 counts.

3 BLAST - A common approach Most services for detecting rRNA are based on a BLAST approach. There are different problems involved relying on this method:

A. Each match/mismatch of the alignment contributes with the same weight when calculating the score - in other words it’s a linear scoring scheme . B. The ‘model’ for a BLAST approach is a database of already sequenced rRNA molecules. If sequences which deviate significantly from the content of this database are to be found, the BLAST approach will fail. It relies on the assumption that the database assembles all phylogenetic groups . C. BLAST fails to do any form of weighting of domains that are highly conserved . D. The score of a BLAST hit is difficult to interpret. It will be equal to the score/E-value of the closest related molecule in the database. If, by chance, the database contains a sequence of the same genus and species as that of a given query, then the final score will be good. However, the score/E-value will be poor if no similar genus and species are found in the database. Scoring depends on the diversity of the database .

Strategy Our strategy was to obtain structurally aligned sequences from publicly available rRNA databases. By doing so, we achieve avoid forcing an artificial nucleotide conservation into the models by doing sequence alignment. These alignments can be directly presented to software that can generate a profile HMM from an alignment. The European Ribosomal RNA database (Jan Wuyts et al. 2002, http://www.psb.ugent.be/rRNA/) contains a comprehensive collection of SSU and LSU rRNA sequences. These sequences are kept in the ‘distribution format', which includes information from a structural alignments as shown in figure 2.

acc:X53497 (accession no.) src:NoData (source) str:MUCL 29800, ATCC 18804, CBS 562 (strain info.) ta1:Eukarya (taxonomic info.) ta2:Fungi (taxonomic info.) ta3:Eumycota (taxonomic info.) ta4:Ascomycotina (taxonomic info.) ta5:Hemiascomycetes (taxonomic info.) chg:this sequence is not in EMBL (changes other than del with regards to original EMBL entry) rem:this is just an example (remarks about the entry) aut:person 1, person 2 (authors) ttl:The SSU rRNA sequence of a species (title) jou:Journal name (journal) dat:1989 (journal year) vol:12 (journal volume) pgs:223-229 (journal pages) mty:SSU (type of RRNA) del:500 AUG 800 AAAA (deletions made to keep alignment size down: ...) seq:organism name (organism name) ------UAU[CUGGU]U-----GA[UCCU^GCCAG^UAGU{-C}AUA-UGCU]--[UGUC ]UCAAAG--AU-UAA[GCC{A-}UGC]A-UGUCUA-[A-GU{A-UAA-}GC]A------AUUUAU-AC------A[G-U{-G--AA}AC-U]GCGAA--UGG[C-UC]AUUA ---AAU-[CAG{UU}AU{--CG}U-U{UA--UU}UGA]UAG--UA--CC------UU-AC -UA[C(U)UG(G)-AU{AACCG-}UGG]UAAU-U[CUA{-GAGCUA}AU(A)-CA(U)G]CUU------AAA-[AUCCC{ G-A}CU]------GUUU------.....*

Figure 2: Example of a database entry in “distribution format”. The format is easy to parse and has been stored in a mySQL table (Updated version: genome.pfh_public.rdb, frozen version: genome.pfh_public.rdb_report)

We believe this database is one of the most consistent an well maintained SSU/LSU on the web. It should be noted that this database does not contain the complete LSU structure since the 5s/8s molecule is missing. In this report we chose to label 5s and 8s as ‘TSU’ (tiny sub unit), and it is downloaded separately from the Institute of Bioorganic Chemistry at the Polish Academy of Sciences. (http://biobases.ibch.poznan.pl/5SData/). Alignments from both these locations have been imported to the same database as mentioned above. Statistics from this table is shown in figure 3.

4 mysql> select mty,length(clr),count(*),length(clr),stddev(length(clr)) from rdb where ta1 not like '%environment%' group by length(clr); +------+------+------+------+------+ | mty | length(clr) | count(*) | length(clr) | stddev(length(clr)) | +------+------+------+------+------+ | TSU | 163 | 835 | 163 | 0.0000 | | SSU | 6561 | 18754 | 6561 | 0.0000 | | LSU | 9343 | 1321 | 9343 | 0.0000 | +------+------+------+------+------+ 3 rows in set (7.91 sec)

Figure 3: Counting of all entries in the redundant database - checking length deviation. The LSU+SSU part of the database was downloaded 15 th February 2005. The TSU part was supplied by the authors of the the 5s/8s database 31 st August 2004 .

For the TSUs the three kingdoms of Bacteria, Archaea and are included in the database. For SSU and LSU Bacteria, Archaea, Eukaryotes, Plastids and Mitochondria are included. For all subunits it was chosen to exclude environmental samples . This database contains sequences that have been structural aligned by the respective authors and our work will perform a homology reduction to remove highly redundant species and build HMMs directly from the sequences suggested by the reduction algorithm.

Methods Model construction In appendix A, the number of sequences and the phylogenetic groups from which these sequences are obtained are presented. The phyla listed are those provided by the authors of the database. We have constructed a library of HMMs to cover the individual kingdoms as well as groups of kingdoms. Groups are defined according to table 1. Kingdoms covered HMM model TSU SSU LSU All kingdoms available ( all) bac,euk,arc bac,euk,arc,pla,mit bac,euk,arc,pla,mit Groups Prokaryotes ( pro ) arc,bac arc,bac arc,bac Eukaryotes ( euk) †† † Bacteria ( bac) ‡‡ ‡ Kingdoms Archaea ( arc) †† † Plastids ( pla) †† Mitochondria ( mit )†† Table 1: Models are constructed in the following kingdoms and groups of kingdoms

† These models are build but not evaluated nor described in this report ‡ These models are build, evaluated and described in this report

The 5/8s rRNA Database does not cover plastids or mitochondria and for TSU only 5 models are created (all, pro, bac, arc, euk ) while 7 models are created for SSU and LSU (all, pro, bac, arc, euk,pla,mit ).

Homology reduction of dataset As observed in appendix A, a few phyla are over represented in the database (Proteobacteria, Firmicutes). Also, a few genera within these phyla (eq. Clostridiaea) are over represented. In order not to build models that are ‘overtrained' to recognize specific phylogenetic groups or genera, a homology reduction was performed. A homology reduction should be performed not based on annotation or taxonomy association - rather it should be based on an objective quantitative measure that removes sequenced with unwanted similarity. A well described method to do this is the hobohm2 algorithm (Hobohm et al. 1992) which seeks to remove elements with the most neighbours as defined by cutoff. The algorithm requires a list of similarities between all combinations of sequences. In the next section it is described how these similarities are calculated. It should be noted that the algorithm does not state how the similarity/distance measure should be obtained - it only states which nodes should be removed at a given cutoff of the similarity/distance measure. The hobohm2 algorithm was implemented in a C program

5 by Hans-Henrik Stærfeldt at CBS and compiled for both Linux (IA64) and Irix64 (MIPS) platforms.

Similarity between sequences Using the substitution matrix S, as described below:

(1)

- we measure the level of conservation between any two sequences within the three different molecule types, by the following formula:

(2)

- where i and j are nucleotides of the matrix (1) and nij is the number of occurrences of the i-j combination in the alignment. The number of occurences of gap-gap combinations is denoted nGG and each such occurrence scores zero.

The following Perl code implements this functions using $cpus parallel processes, where $matrix contains S:

(3)

The code above will produce an output file for every entry in the database and each file will contain

6 similarity calculation for every other entry in the database ( ). It does so using parallel processing and the results are concatenated, so that we end up with three files; one for every molecule type, TSU, SSU and LSU. Using the hobohm2 algorithm three individual homology reduction are performed based on the similarity indexes of each molecule type. Initially, we perform a number of reductions using different cutoff values from 0 to 6 using in 0.05 increments. This produces a plot which describes the size and composition of the resulting data sets when choosing different cutoff values (figure 4)

A B

C Figure 4 : Dataset composition. Red curve shows the relative number of nodes within the data set after homology reduction as function of a cutoff value for similarity. Blue curve shows the number of distinct phyla within the reduced data set. Green curve shows the average number nodes within each phyla. No phyla information could be obtained for the TSU data set.

The overall criteria for the reduction is that the number of distinct phyla should remain the same while too similar sequences are removed. Unfortunately there were no phyla information available for TSUs and no obvious drop in sequence count was observed, which would indicate that over represented sequences were removed. So a cut off value was chosen which kept about 80% of the data set. For LSUs there exist a visible drop in sequence count around 4.95 which indicates that many sequences have this (or better) similarity. For SSUs there were no obvious drop in sequence count as was the case for LSU’s.

Type Before reduction After reduction Cut off % reduction 5s/8s 835 691 4.80 17.2% 16s/18s 18,754 16,694 4.85 11,0% 23s/28s 1,321 1,089 4.90 17,6% Table 2: Composition of data set before and after homology reduction

All analysis and model constructions from this point on, are based on the reduced data set just described using individual cutoff values for the different molecule types.

Alignment conservation It is a common understanding that a sequence based prediction of rRNA will not be successful since the model will fail to account for structural folding and interaction of different loci of the sequence. This is true, however our hypothesis is this conservation will be detectable also at a nucleotide level. We will show that there exist large sequence conservation within these sequences but it is important to note, that we do not derive this conservation based on any sequence alignment since this would not truly show our hypothesis. Therefor the structural alignment is used directly.

7 For every position in the alignment we calculate the conservation similar to what is used by the Emboss plotcon algorithm (the implementation can be found in appendix B):

(4)

All similarities have been smoothed over a 75 character box filter and the data have been plotted using R statistical package. The 75 character window having the highest level of conservation is marked with red dots in the plot figure 5 through 7, and these regions are then the target for generating the initial ‘spotter’ HMM. The entire alignment is used for the final HMM except loci where only gaps are present.

AB

C Figure 5: TSU (5s/8s) conservations maps. Gray curve shows the position based calculation of similarity based on the plotcon algorithm.The black curve show the 75 character smoothed value and the red dot show the highest of these values

A B

C Figure 6: SSU (16s/18s) conservations maps. Gray curve shows the position based calculation of similarity based on the plotcon algorithm.The black curve show the 75 character smoothed value and the red dot show the highest of these values

8 A B

C Figure 7: LSU (23s/28s) conservations maps. Gray curve shows the position based calculation of similarity based on the plotcon algorithm. The black curve show the 75 character smoothed value and the red dot show the highest of these values.

HMM building The HMMER 1 package version 2.3.2 was downloaded as source code and compiled for the Altix platform and initial as well as final models where created by using hmmbuild and hmmcalibrate, as is documented in the Makefile for this projects, appendix C. The models can be obtained from the CBS filesystem at:

/home/projects/pfh/projects/2004-08-20_NewOsloWorkMeeting /modellib2/_.rnammer.

-where is either all,pro,mit,pla,bac,euk, or arc and is either tsu,ssu, or lsu, is ether hmm or initial.hmm.

The program RNAmmer works by scanning a sequence on both strands using the initial spotter model. This occupies two simultaneous processors and for each hit that is detected by the initial model, a new process is started using a region of +/- 3,000bp around the location of the initial hit. This region is scanned using the final HMM that models the entire molecule. The programs is listed in appendix G whereas the Makefile for parsing results are listed in appendix D.

Genome Sequence Initial spotter search is performed on both strands simultaneously

Final full length search is being

done simultaneously on all hits +/-3000bp windows +/-3000bp windows found by initial model +/-3000bp windows

+/-3000bp windows

Figure 8: Approach for initial spotting and final detection On an SGI Altix system based on 900MHz Intel Itanium2 processors it takes around 32 minutes to complete hmmsearch on both strands in a Bacteria of 1,64Mb using a full length model (using one CPU). Same result is achieved in 23 seconds by using RNAmmer using 2 CPUs for initial models and 3 CPUs for final models. Speed measures are shown in appendix F.

1http://hmmer.wustl.edu/

9 Results and evaluation Selectivity and sensitivity The classification scheme we chose to implement measures selectivity and sensitivity on a genome level, based on calculations of true/false positives/negatives as properties of each nucleotide in the genome. This enables comparison of prediction accuracy between different species or different phyla. We define four variables TP, TN, FP and FN for each base pair as follows:

TP: RNAmmerpredicts an rRNA sequence at the nucleotide and the annotation confirms the prediction TN: RNAmmerdoes not predict the nucleotide to be an rRNA and the annotation agrees FP: RNAmmerpredicts an rRNA nucleotide but the annotation describes it as being part of a non- rRNA sequence FN: RNAmmerdoes not predict an rRNA but the annotation describes it as being a part of an rRNA.

Where: and (5)

Schematically, the classification for a particular region would look like this:

Ann. Figure 9: Binary classification is done on the level of Pred. TP single nucleotides. The red features is an annotated TN rRNA and the blue features is a prediction. Gray bars FP correspond to the classification. FN

We have extracted annotated rRNA features from 192 bacterial segments and calculated selectivity and sensitivity as described above. The reason why this number is not equal to the number of segments in the database is that not all GenBank files have rRNA sequences annotated, thereby making the sensitivity and selectivity meaningless. In figure 10 the selectivity and sensitivity are plotted for each phyla as Box and Whiskers and for all Bacteria as histograms. It is clear that the two phyla of Spirochetes and Deinococcus-Thermus show large variation. A B

CD

Figure 10: Histograms and Box-Whiskers plot for Bacterial rRNA predictions. For the organisms where annotations are provided, we have calculated sensitivity and selectivity based on (5). Panel A: Selectivity within phyla. Panel B: Sensitivity within phyla. Panel C: Selectivity in all bacteria. Panel D: Sensitivity in all bacteria

10 In table 3, the values for the entire of Bacteria are shown, suggesting that both selectivity and sensitivity are acceptable. Measure Performance Average sensitivity of genomes 0.958 10% fractile of sensitivities 0.999 50% fractile of sensitivities 0.997 90% fractile of sensitivities 0.879 Average selectivity of genomes 0.951 10% fractile of selectivities 1,000 50% fractile of selectivities 0.993 90% fractile of selectivities 0.919

Table 3: Average and fractiles of selectivity and sensitivity

Accuracy - deviation in start/stop positions Another property that is important to measure is the deviation between start/stop position of the predictions and annotations. For all sequences where an annotations overlap 10 or more bases with a prediction we measure the difference of start and stop positions:

and (6)

Where t denotes the molecule type, TSU,SSU,LSU and d denotes the strand (+1 for top strand and -1 for bottom strand). ‘start’ and ‘stop’ refers to transcriptional direction always. Again, the only reference are the annotations which do not necessarily have to reflect the truth. Measured deviation Value Start positions (TSU) -4 bp Stop positions (TSU) 1 bp Start positions (SSU) -6 bp Stop positions (SSU) 0 bp Start positions (LSU) 0 bp Stop positions (LSU) 1 bp Table 4: Median deviations in start/stop positions of prediction relative to annotation

Initially these numbers look acceptable, but calculating averages gives a different picture. These numbers are not included since they display outliers that are likely to be faulty annotations. The Box- and Whiskers plots in figure 11 confirm the number of outliers and the median. These outliers are described in the next section, “Length Deviations”.

11 A B

Figure 11: Box and Whiskers plot of start/stop position deviation between prediction and annotations. There is only small variation within the large subunits for all except the large subunits the deviation in stop positions appears to be greater than for the start position. The small subunits display the biggest variation and it appears that stop positions are harder to map accurately. The reason for this can be noisy sequences: Not all molecules have been fully sequenced and the remaining unknown bases are replaced with N’s by the authors (Jan Wuyts et al., 2002)

Length deviations From the same data set as was used in the previous section we extracted all features that overlapped 10 or more base pairs. It is interesting too see if we can produce a length histogram which display less noise than what is seen figure 1 in the introductions. With the mapping of prediction : annotation that was created by looking at overlapping bases we can put annotated rRNA features into its category (molecule type). We have taken all the annotated rRNA features and when they overlap with the prediction we label this annotation with it’s type (TSU, SSU, LSU). This confirms, that the authors of these genbank files actually map features at the correct regions, but they fail either to get the molecules type right or they fail to find the right start and stop position. The length distribution of annotated features are shown in figure 12 (Panel A+B) and the predicted in Panel C

AB

C Figure 12: Length distributions of rRNA which have been confirmed and typed by RNAmmer. Panel A show all counts (0..350) whereas Panel B show the first 30 counts of annotated rRNA. Green bars are typed by RNAmmer to be 5s, blue to be 16s and red to be 23s. 5s and 16s’ are found that are longer than any 23s, and 23s are found that lie between 5s and 16s - clearly some errors have occurred during the annotations. Panel C shows length distributions of predicted rRNA. The narrow shape of the 5s is caused by R’s ‘hist’ function, which cannot draw boxes wider than the difference between minimum and maximum value in a data set. This in itself indicates a very small length variation in 5s predictions.

12 Example of annotations As the figure 10 in the previous section shows, the Spirochetes show large variations in prediction selectivity and sensitivity. A closer look at annotations reveals interesting results. The selectivity and sensitivity are both 0.0000 for the Borrelia burgdorferi B31 genome, which means that not a single annotated rRNA nucleotide is predicted as being a rRNA. However, comparing the annotation and prediction reveals that all rRNA are predicted on the wrong strand. Who’s right? Predictions by Annotations (AE000783) RNAmmer and tRNAscan

rRNA 435199 435308 - 5s rRNA 435201 435312 + 5s (!) rRNA 435334 438262 - 23s rRNA 435334 438267 + 23s (!) rRNA 438444 438553 - 5s rRNA 438446 438557 + 5s (!) rRNA 438579 441507 - 23s rRNA 438590 441508 + 23s (!) CDS 441713 442552 - CDS 442688 443248 - tRNA 443623 443696 + Ile tRNA 443623 443696 + Ile CDS 443763 444053 - CDS 444150 444299 + tRNA 444336 444409 - Ala tRNA 444336 444409 - Ala rRNA 444581 446108 - 16s rRNA 444581 446118 + 16s (!)

Table 3: Predicted and Annotated rRNAs are present on opposite strands in the Borrelia burgdorferi B31 genome

According the genome sequencing paper the operon consists of 16s rRNA, Ala-tRNA(Ile), tRNA, 23s rRNA, 5s rRNA, 23s rRNA, 5s rRNA all of which are located on the same orientation except for tRNA(Ile). According to their annotation the Ile-tRNA is also located in positive direction. Our tRNA- scan prediction also supports this. The rRNA operon contains four genes between the 5s/23s cluster and the 16s gene: “Four unrelated genes, encoding 3-methyladenine glycosylase, hydrolyase and two with no database match, are also present in the rRNA operon. Three of these genes are transcribed in the same direction as the rRNAs” (Fraser, C.M. et al. 1997). This clearly indicate that the authors found the rRNA in the right strand but an error is present in both the GenBank and the NCBI files of AE000783 and NC_001318, respectively:

Figure 11: Atlas Visualization of rRNA locations. Light cyan represents annotated features and dark cyan represents the predicted.

Other such examples was found in the Genome Atlas Database where predictions / annotations are overlapping but found on different strands (Bpseudomallei_K96243_1, Bgarinii_PBi_Main etc). Another example of possible failure of classification is that of Agrobacterium tumefaciens C58 Chromosome I. Only

13 16s rRNA is annotated for this organism, which will cause the sensitivity and selectivity not to be calculated properly:

Predictions by RNAmmer Annotations (AE007869)

gene 56547 58023 + 16s rRNA 56567 58003 + rRNA gene 59319 62140 + 23s rRNA 2516962 2518398 - rRNA gene 62410 62523 + 5s gene 2512442 2512555 - 5s gene 2512825 2515646 - 23s gene 2516941 2518417 - 16s

Table 4: Only SSU’s are annotated in Agrobacterium tumefaciens C58

Proof of Concept The following few lines of GNUmake code will produce the file 16s.names.ph - a neighbour joined tree based on all predicted 16s rRNA of all bacterial genomes described in the CBS Genome Atlas Database (Hallin and Ussery 2005), aligned with clustalw. Although this tree is created with no manual inspection, there exist no obvious misaligned outliers. The tree is shown in appendix E.

Figure 12: Makerule to construct rRNA tree

Conclusion Since the RNAmmer approach specifically searches each strand using a profile HMM, it is not likely that a true rRNA can be detected on the wrong strand by the model. We dare to conclude that such contradictions are results of faulty annotations from authors. However, we do not believe that our approach finds the exact start and stop positions of a given rRNA molecule. For comparative and statistical purposes, our assumption is that one is better off with a consistent annotations that are wrong by 5-10bp rather than using annotations that may be on the wrong strand or using obsolete rRNA annotations. The result of RNAmmer predictions can be used directly to produce alignments and phylogenetic trees with limited manual inspection. Having consistent and accurate rRNA predictions and alignments will enable the comparison of a large number of genomic properties, hence seeing such properties in their true phylogenetic context.

14 Appendices Appendix A: Sequence count in phylogenetic group mysql> select mty,phyla,count(*) from rdb group by phyla,mty order by mty,phyla; +------+------+------+ | mty | phyla | count(*) | +------+------+------+ | lsu | | 729 | | lsu | Ap Crenarchaeota | 16 | | lsu | Ap Euryarchaeota | 21 | | lsu | Bp Actinobacteria | 37 | | lsu | Bp Aquificae | 2 | | lsu | Bp Bacteroidetes | 3 | | lsu | Bp Chlamydiae | 17 | | lsu | Bp Cyanobacteria | 3 | | lsu | Bp Deinococcus-Thermus | 1 | | lsu | Bp Fibrobacteres | 1 | | lsu | Bp Firmicutes | 104 | | lsu | Bp Planctomycetes | 1 | | lsu | Bp Proteobacteria | 214 | | lsu | Bp Spirochaetes | 13 | | lsu | Bp Thermotogae | 2 | | lsu | Ec Alveolata Dinophyceae | 1 | | lsu | Ec Cryptophyta | 3 | | lsu | Ec Granuloreticulosea | 1 | | lsu | Ec Rhodophyta Florideophyceae | 18 | | lsu | Ec stramenopiles Chrysophyceae | 1 | | lsu | Ec stramenopiles | 1 | | lsu | Eo Euglenozoa Kinetoplastida | 3 | | lsu | Eo Mycetozoa Dictyosteliida | 1 | | lsu | Ep Alveolata Apicomplexa | 15 | | lsu | Ep Euglenozoa Euglenida | 1 | | lsu | Ep Fungi Ascomycota | 14 | | lsu | Ep Fungi Basidiomycota | 3 | | lsu | Ep Fungi Chytridiomycota | 1 | | lsu | Ep Fungi Microsporidia | 6 | | lsu | Ep Fungi Zygomycota | 5 | | lsu | Ep Metazoa Arthropoda | 4 | | lsu | Ep Metazoa Chordata | 23 | | lsu | Ep Metazoa Nematoda | 1 | | lsu | Ep Metazoa Platyhelminthes | 1 | | lsu | Ep stramenopiles Bacillariophyta | 1 | | lsu | Ep stramenopiles Eustigmatophyceae | 1 | | lsu | Ep stramenopiles Phaeophyceae | 1 | | lsu | Ep stramenopiles Xanthophyceae | 1 | | lsu | Ep Viridiplantae Chlorophyta | 1 | | lsu | Ep Viridiplantae Streptophyta | 36 | | lsu | Eu Alveolata Ciliophora | 4 | | lsu | Eu Cercozoa | 1 | | lsu | Eu Diplomonadida_group | 4 | | lsu | Eu Entamoebidae | 1 | | lsu | Eu Mycetozoa Myxogastria | 2 | | lsu | Eu stramenopiles | 1 | | ssu | | 3008 | | ssu | Ap Crenarchaeota | 56 | | ssu | Ap Euryarchaeota | 293 | | ssu | Ap Korarchaeota | 1 | | ssu | Bp Acidobacteria | 3 | | ssu | Bp Actinobacteria | 2099 | | ssu | Bp Aquificae | 14 | | ssu | Bp Bacteroidetes | 334 | | ssu | Bp Chlamydiae | 107 | | ssu | Bp Chlorobi | 28 | | ssu | Bp Chloroflexi | 9 | | ssu | Bp Chrysiogenetes | 1 | | ssu | Bp Cyanobacteria | 294 | | ssu | Bp Deferribacteres | 11 | | ssu | Bp Deinococcus-Thermus | 51 | | ssu | Bp Fibrobacteres | 17 | | ssu | Bp Firmicutes | 2438 | | ssu | Bp Fusobacteria | 38 | | ssu | Bp Nitrospirae | 31 | | ssu | Bp Planctomycetes | 78 | | ssu | Bp Proteobacteria | 4644 | | ssu | Bp Spirochaetes | 306 | | ssu | Bp Thermodesulfobacteria | 2 | | ssu | Bp Thermotogae | 25 | | ssu | Bp Verrucomicrobia | 11 | | ssu | Ec Acantharea | 6 | | ssu | Ec Alveolata Dinophyceae | 102 | | ssu | Ec Cryptophyta | 42 | | ssu | Ec Glaucocystophyceae | 6 | | ssu | Ec Granuloreticulosea | 3 | | ssu | Ec Heterolobosea | 8 | | ssu | Ec Lobosea | 6 | | ssu | Ec Parabasalidea Hypermastigia | 20 | | ssu | Ec Parabasalidea Trichomonada | 29 | | ssu | Ec Polycystinea | 10 | | ssu | Ec Rhodophyta Bangiophyceae | 66 | | ssu | Ec Rhodophyta Florideophyceae | 265 | | ssu | Ec stramenopiles | 1 | | ssu | Ec stramenopiles Chrysophyceae | 62 | | ssu | Ec stramenopiles Dictyochophyceae | 6 | | ssu | Ec stramenopiles Hyphochytriomycetes | 3 | | ssu | Ec stramenopiles Pelagophyceae | 14 | | ssu | Ec stramenopiles Placididea | 1 | | ssu | Ec stramenopiles Raphidophyceae | 4 | | ssu | Ef Apusomonadidae | 1 | | ssu | Ef stramenopiles Oikomonadaceae | 1 | | ssu | Eg stramenopiles | 5 | | ssu | Eg stramenopiles Developayella | 1 | | ssu | Eo Euglenozoa Diplonemida | 2 | | ssu | Eo Euglenozoa Kinetoplastida | 124 | | ssu | Eo Mycetozoa Dictyosteliida | 2 | | ssu | Eo stramenopiles | 7 | | ssu | Eo stramenopiles Labyrinthulida | 10 | | ssu | Eo stramenopiles Slopalinida | 3 | | ssu | Ep Alveolata Apicomplexa | 243 | | ssu | Ep Euglenozoa Euglenida | 26 | | ssu | Ep Fungi Ascomycota | 1003 | | ssu | Ep Fungi Basidiomycota | 424 | | ssu | Ep Fungi Chytridiomycota | 16 | | ssu | Ep Fungi Glomeromycota | 36 | | ssu | Ep Fungi Microsporidia | 105 | | ssu | Ep Fungi Zygomycota | 73 | | ssu | Ep Haplosporidia | 8 | | ssu | Ep Metazoa | 1 | | ssu | Ep Metazoa Acanthocephala | 21 | | ssu | Ep Metazoa Annelida | 99 | | ssu | Ep Metazoa Arthropoda | 662 | | ssu | Ep Metazoa Brachiopoda | 44 | | ssu | Ep Metazoa Bryozoa | 2 | | ssu | Ep Metazoa Chaetognatha | 3 | | ssu | Ep Metazoa Chordata | 159 | | ssu | Ep Metazoa Cnidaria | 78 | | ssu | Ep Metazoa Ctenophora | 3 | | ssu | Ep Metazoa Cycliophora | 1 | | ssu | Ep Metazoa Echinodermata | 47 | | ssu | Ep Metazoa Echiura | 1 | | ssu | Ep Metazoa Entoprocta | 3 | | ssu | Ep Metazoa Gastrotricha | 3 | | ssu | Ep Metazoa Gnathostomulida | 1 | | ssu | Ep Metazoa Hemichordata | 7 | | ssu | Ep Metazoa Kinorhyncha | 1 | | ssu | Ep Metazoa Mollusca | 145 | | ssu | Ep Metazoa Myxozoa | 38 | | ssu | Ep Metazoa Myzostomida | 6 | | ssu | Ep Metazoa Nematoda | 172 | | ssu | Ep Metazoa Nematomorpha | 5 | | ssu | Ep Metazoa Nemertea | 2 | | ssu | Ep Metazoa Onychophora | 2 | | ssu | Ep Metazoa Orthonectida | 2 | | ssu | Ep Metazoa Placozoa | 2 | | ssu | Ep Metazoa Platyhelminthes | 211 | | ssu | Ep Metazoa Pogonophora | 2 | | ssu | Ep Metazoa Porifera | 20 | | ssu | Ep Metazoa Priapulida | 3 | | ssu | Ep Metazoa Rhombozoa | 3 | | ssu | Ep Metazoa Rotifera | 7 | | ssu | Ep Metazoa Sipuncula | 1 | | ssu | Ep Metazoa Tardigrada | 7 | | ssu | Ep stramenopiles Bacillariophyta | 45 | | ssu | Ep stramenopiles | 7 | | ssu | Ep stramenopiles Eustigmatophyceae | 32 | | ssu | Ep stramenopiles Phaeophyceae | 35 | | ssu | Ep stramenopiles Xanthophyceae | 12 | | ssu | Ep Viridiplantae Chlorophyta | 305 | | ssu | Ep Viridiplantae Streptophyta | 1219 | | ssu | Eu Acanthamoebidae | 58 | | ssu | Eu Alveolata Ciliophora | 117 | | ssu | Eu Alveolata Perkinsea | 5 | | ssu | Eu Cercozoa | 39 | | ssu | Eu Diplomonadida_group | 22 | | ssu | Eu Entamoebidae | 6 | | ssu | Eu Fungi | 1 | | ssu | Eu Fungi/Metazoa_group | 18 | | ssu | Eu Haptophyceae | 57 | | ssu | Eu Mycetozoa Myxogastria | 2 | | ssu | Eu Parabasalidea | 40 | | ssu | Eu Pelobiontida | 1 | | ssu | Eu Plasmodiophorida | 2 | | ssu | Eu stramenopiles | 4 | | ssu | Eu stramenopiles Oomycetes | 17 | | ssu | Eu unclassified eukaryotes | 1 | | TSU | | 835 | +------+------+------+ 168 rows in set (19.12 sec)

15 Appendix B: Perl implementation of plotcon algorithm

16 Appendix C: Model building and post script generation

17 18 19 20 Appendix D: Makefile for RNAmmer- parsing search results and calculating accuracy

21 Appendix E: 16s rRNA tree of the Genome Atlas Database

Phytoplasma asteris OY BFirm MA Mycoplasma pulmonis UAB CTIP BFirm MM Mycoplasma mobile 163K BFirm MM Mycoplasma hyopneumoniae 232 BFirm MM Mycoplasma mycoides SC BFirm MM Mesoplasma florum L1 BFirm ME Mycoplasma gallisepticum R BFirm MM Mycoplasma pneumoniae M129 BFirm MM Mycoplasma genitalium G37 BFirm MM Mycoplasma penetrans HF2 BFirm MM Ureaplasma urealyticum serovar3 BFirm MM Fusobacterium nucleatum ATCC25586 BFuso FF Rhodopirellula baltica strain1 BPPP Campylobacter jejuni RM1221 BProt EC Campylobacter jejuni NCTC11168 BProt EC Wolinella succinogenes DSMZ1740 BProt EC Helicobacter hepaticus ATCC51449 BProt EC Helicobacter pylori 26695 BProt EC Helicobacter pylori J99 BProt EC Desulfovibrio vulgaris Hildenborough BProt DD Desulfotalea psychrophila LSv54 BProt DD Bacteriovorax marinus SJ BProt DB Geobacter sulfurreducens PCA BProt DD Bdellovibrio bacteriovorus HD100 BProt DB Gluconobacter oxydans 621H BProt AR Rickettsia conorii Malish7 BProt AR Rickettsia typhi Wilmington BProt AR Rickettsia prowazekii Madrid-E BProt AR Anaplasma marginale StMaries BProt AR Ehrlichia ruminantium Gardel BProt AR Ehrlichia ruminantium Welgevonden-1 BProt AR Ehrlichia ruminantium Welgevonden-2 BProt AR Wolbachia endosymbiont TRS BProt AR Wolbachia pipientis wMel BProt AR Caulobacter cresentus CB15 BProt AC Zymomonas mobilis ZM4 BProt AS Silicibacter pomeroyi DSS3 BProt AR Rhodopseudomonas palustris CGA009 BProt AR Bradyrhizobium japonicum USDA110 BProt AR Agrobacterium tumefaciens C58d BProt AR Agrobacterium tumefaciens C58d BProt AR Agrobacterium tumefaciens C58 BProt AR Agrobacterium tumefaciens C58 BProt AR Mesorhizobium loti MAFF303099 BProt AR Bartonella quintana Toulouse BProt AR Bartonella henselae Houston-1 BProt AR Brucella melitensis 16M BProt AR Brucella melitensis 16M BProt AR Brucella Suis 1330 BProt AR Brucella Suis 1330 BProt AR Rhizobium leguminosarum viciae3841 BProt AR Sinorhizobium meliloti Rm1021 BProt AR Francisella tularensis SCHUS4 BProt GT Acidithiobacillus ferrooxidans ATCC23270 BProt GA Nitrosomonas europaea Schmidt BProt BN Chromobacterium violaceum ATCC12472 BProt BN Neisseria gonorrhoeae FA1090 BProt BN Neisseria meningitidis C FAM18 BProt BN Neisseria meningitidis B MC58 BProt BN Neisseria meningitidis A Z2491 BProt BN Azoarcus species EbN1 BProt BR Ralstonia solanacearum GMI1000 BProt BR Ralstonia solanacearum GMI1000 BProt BR Burkholderia pseudomallei K96243 BProt BB Burkholderia pseudomallei K96243 BProt BB Burkholderia mallei ATCC23344 BProt BB Burkholderia mallei ATCC23344 BProt BB Burkholderia cepacia J2315 BProt BB Burkholderia cepacia J2315 BProt BB Burkholderia cepacia J2315 BProt BB Bordetella avium Strain BProt BB Bordetella pertussis TohamaI BProt BB Bordetella parapertussis 12822 BProt BB Bordetella bronchiseptica RB50 BProt BB Xylella fastidiosa Temecula1 BProt GX Xylella fastidiosa 9a5c BProt GX Stenotrophomonas maltophilia K279a BProt GX Xanthomonas oryzae KACC10331 BProt GX Xanthomonas campestris ATCC33913 BProt GX Xanthomonas axonopodis citri306 BProt GX Acinetobacter species ADP1 BProt GP Coxiella burnetii RSA493 BProt GL Legionella pneumophila Lens BProt GL Legionella pneumophila Philadelphia1 BProt GL Legionella pneumophila Paris BProt GL Methylococcus capsulatus Bath BProt GM Alcanivorax borkumensis SK2 BProt GO Pseudomonas aeruginosa PAO1 BProt GP Pseudomonas syringae DC3000 BProt GP Pseudomonas putida KT2440 BProt GP Idiomarina loihiensis L2TR BProt GA Mannheimia succiniciproducens MBEL55E BProt GP Haemophilus ducreyi 3500HP BProt GP Actinobacillus actinomycetemcomitans HK1651 BProt Pasteurella multocida Pm70 BProt GP Haemophilus influenzae Rd BProt GP Shewanella oneidensis MR1 BProt GA Vibrio cholerae N16961 BProt GV Photobacterium profundum SS9 BProt GV Photobacterium profundum SS9 BProt GV Vibrio fischeri ES114 BProt GV Vibrio fischeri ES114 BProt GV Vibrio vulnificus CMCP6 BProt GV Vibrio vulnificus YJ016 BProt GV Vibrio vulnificus CMCP6 BProt GV Vibrio vulnificus YJ016 BProt GV Vibrio parahaemolyticus RIMD2210633 BProt GV Vibrio parahaemolyticus RIMD2210633 BProt GV Buchnera aphidicola BBp BProt GE Buchnera aphidicola Sg BProt GE Buchnera aphidicola APS BProt GB Blochmannia floridanus Strain BProt GE Wigglesworthia glossinidia Strain BProt GE Photorhabdus luminescens laumondiiTTO1 BProt GE Erwinia carotovora SCRI1043 BProt GE Serratia marcescens Db11 BProt GE Yersinia enterocolitica 8081 BProt GE Yersinia pseudotuberculosis IP32953 BProt GE Yersinia pestis Mediaevails BProt GE Yersinia pestis KIM BProt GE Yersinia pestis CO-92BiovarOrientalis BProt GE Escherichia coli O157 RIMD0509952 BProt GE Escherichia coli 042 Escherichia coli CFT073 BProt GE Escherichia coli K-12 W3110 BProt GE Escherichia coli O157 EDL93 BProt GE Escherichia coli K-12 MG1655 BProt GE Shigella flexneri 2a301 BProt GE Shigella flexneri 2457T BProt GE Salmonella bongori 12419 BProt GE Salmonella enterica ATCC9150 BProt GE Salmonella typhimurium LT2 BProt GE Salmonella enterica PT4 BProt GE Salmonella enterica typhiCT18 BProt GE Salmonella enterica Ty2 BProt GE Dehalococcoides ethenogenes 195 BCDDhalo Parachlamydia species UWE25 BChlam CP Chlamydia trachomatis DUW 3CX BChlam CC Chlamydia muridarum Nigg BChlam CC Chlamydia trachomatis MoPn BChlam CC Chlamydophila pneumoniae J138 BChlam CC Chlamydophila pneumoniae CWL029 BChlam CC Chlamydophila pneumoniae TW183 BChlam CC Chlamydophila pneumoniae AR39 BChlam CC Chlamydophila caviae GPIC BChlam CC Chlamydophila abortis Strain BChlam CC Gloeobacter violaceus PCC7421 BCyano CG Synechocystis PCC6803 Strain BCyano CS Anabaena nostoc PCC7120 BCyano NN Thermosynechococcus elongatus BP1 BCyano CT Synechococcus elongatus PCC6301 BCyano CS Synechococcus sp WH8102 BCyano CS Prochlorococcus marinus MIT9313 BCyano PP Prochlorococcus marinus SS120 BCyano PP Prochlorococcus marinus MED4 BCyano PP Clostridium difficile 630X BFirm CC Clostridium perfringens 13 BFirm CC Clostridium acetobutylicum ATCC824 BFirm CC Clostridium tetani E88 BFirm CC Clostridium botulinum ATCC3502 BFirm CC Lactobacillus plantarum WCFS1 BFirm LL Lactobacillus acidophilus NCFM BFirm LL Lactobacillus johnsonii NCC533 BFirm LL Lactococcus lactis IL1403 BFirm LS Streptococcus agalactiae V2603 BFirm LS Streptococcus agalactiae NEM316 BFirm LS Streptococcus uberis 0140J BFirm BL Streptococcus pyogenes MGAS10394 BFirm LS Streptococcus pyogenes MGAS8232 BFirm LS Streptococcus pyogenes M5Manfredo BFirm LS Streptococcus pyogenes SSI1 BFirm LS Streptococcus pyogenes SF370 BFirm LS Streptococcus pyogenes MGAS315 BFirm LS Streptococcus pneumoniae TIGR4 BFirm LS Streptococcus thermophilus CNRZ1066 BFirm LS Streptococcus thermophilus LMG18311 BFirm LS Geobacillus kaustophilus HTA426 BFirm BB Enterococcus faecalis V583 BFirm LE Listeria innocua Clip11262 BFirm BL Listeria monocytogenes EGD BFirm BL Listeria monocytogenes 4b BFirm BL Oceanobacillus iheyensis HTE831 BFirm BB Bacillus clausii KSMK16 BFirm BB Bacillus halodurans C125 BFirm BB Bacillus subtilis 168 BFirm BB Bacillus licheniformis DSM13 BFirm BB Bacillus licheniformis ATCC14580 BFirm BB Bacillus cereus ATCC10987 BFirm BB Bacillus cereus ZK BFirm BB Bacillus cereus ATCC14579 BFirm BB Bacillus thuringiensis 9727 BFirm BB Bacillus anthracis Sterne BFirm BB Bacillus anthracis Ames BFirm BB Bacillus anthracis Ames0581 BFirm BB Staphylococcus aureus N315 BFirm BB Staphylococcus aureus NCTC8325 BFirm BB Staphylococcus aureus Mu50 BFirm BS Staphylococcus aureus MW2 BFirm BS Staphylococcus aureus MSSA476 BFirm BS Staphylococcus aureus MRSA252 BFirm BS Staphylococcus aureus COL BFirm BB Staphylococcus epidermidis RP62A BFirm BS Staphylococcus epidermidis ATCC12228 BFirm BS Symbiobacterium thermophilum Strain BActin S Thermoanaerobacter tengcongensis MB4T BFirm CT Bifidobacterium longum NCC2705 BActin AB Propionibacterium acnes KPA171202 BActin AA Nocardia farcinica IFM10152 BActin AA Mycobacterium leprae TN BActin AA Mycobacterium avium k10 BActin AA Mycobacterium tuberculosis H37Rv BActin AA Mycobacterium tuberculosis CDC1551 BActin AA Mycobacterium bovis AF212297 BActin AA Corynebacterium diphtheriae NCTC13129 BActin AA Corynebacterium glutamicum ATCC13032 BActin AA Corynebacterium efficiens YS314 BActin AA Leifsonia xyli CTCB07 BActin AA Tropheryma whippelii Twist BActin T Tropheryma whippelii TW0827 BActin AA Streptomyces coelicolor A3 BActin AA Streptomyces avermitilis MA4680 BActin AA Chlorobium tepidum TLS BCCC Porphyromonas gingivalis W83 BBBB Bacteroides fragilis YCH46 BBBB Bacteroides fragilis NCTC9343 BBC BB Deinococcus radiodurans R1 BDDD Thermus thermophilus HB8 BDDT Thermus thermophilus HB27 BDDT Aquifex aeolicus VF5 BAqui AA Thermotoga maritima MSB8 BThermt TT Leptospira interrogans FiocruzL1130 BSpiro SL Leptospira interrogans 56601 BSpiro SL Borrelia garinii PBi BSpiro SS Borrelia burgdorferi B31 BSpiro SS Treponema denticola ATCC35405 BSpiro SS Treponema pallidum Nichols BSpiro SS

22 Appendix F: RNAmmer vs. complete search in a 1,6Mb Bacteria Output below shows the time for a typical RNAmmer run. Total search time is 23sec:

life[pfh]:/home/ibiology1/dave/Bacteria/Campylobacter/jejuni/NCTC11168/Main> date ; remake Cjejuni_NCTC11168_Main.rnammer.ssu.report ; date Sat Oct 9 12:19:07 MDT 2004 /home/people/pfh/scripts/rnammer/rnammer.pl --kingdom=`cat Cjejuni_NCTC11168_Main.kingdom | perl -ne 'chomp; print;'` --subunit=ssu --cpus=6 --reportlevel=1 --reportstage='' --modellib=/home/projects/pfh/projects/2004-08-20_NewOsloWorkMeeting/collect/test Cjejuni_NCTC11168_Main.fsa > Cjejuni_NCTC11168_Main.rnammer.ssu.report Sat Oct 9 12:19:30 MDT 2004 life[pfh]:/home/ibiology1/dave/Bacteria/Campylobacter/jejuni/NCTC11168/Main>

Output below shows the time for a typical hmmsearch of a full SSU model in the same organism as above. Note that both strands must be searched since hmmsearch only searches top strand. Total search time is 47 min and 20 sec.

life[pfh]:/home/people/pfh/campy> date ; /home/projects/pfh/2004-08-20_NewOsloWorkMeeting/altixhmm/hmmsearch -T 900 /home/projts/pfh/2004-08-20_NewOsloWorkMeeting/modellib/bac_ssu.rnammer.hmm Cjejuni_NCTC11168_Main.fna > data ; date ; /home/projects/pf2004-08-20_NewOsloWorkMeeting/altixhmm/hmmsearch -T 900 /home/projects/pfh/2004-08-20_NewOsloWorkMeeting/modellib/bac_ssu.rnaer.hmm Cjejuni_NCTC11168_Main.compl.fsa > data.compl ; date ; Sat Oct 9 12:30:05 MDT 2004 Sat Oct 9 12:53:33 MDT 2004 Sat Oct 9 13:17:25 MDT 2004

23 Appendix G: RNAmmer source code

24 25 References Claire M. Fraser, Sherwood Casjens, Wai Mun Huang, Granger G. Sutton, Rebecca Clayton, Raju Lathigra, Owen White, Karen A. Ketchum, Robert Dodson, Erin K. Hickey, Michelle Gwinn, Brian Dougherty, Jean-Francois Tomb, Robert D. Fleischmann, Delwood Richardson, Jeremy Peterson, Anthony R. Kerlavage, John Quackenbush, Steven Salzberg, Mark Hanson, Rene Van Vugt, Nanette Palmer, Mark D. Adams, Jeannine Gocayne, Janice Weidman, Teresa Utterback, Larry Watthey, Lisa Mcdonald, Patricia Artiach, Cheryl Bowman, Stacey Garland, Claire Fujii, Matthew D. Cotton, Kurt Horst, Kevin Roberts, Bonnie Hatch, Hamilton O. Smith & J. Craig Venter. Genomic sequence of a Lyme disease spirochaete , Borrelia burgdorferi (1997). Nature 390 :580 - 586

Hallin PF and Ussery DW CBS Genome Atlas Database: A dynamic storage for bioinformatic results and sequence data (2005) Bioinformatics 20 :3682-3686.

Maciej Szymanski, Miroslawa Z. Barciszewska, Jan Barciszewski and Volker A. Erdmann 5S ribosomal RNA database (2000) Nucleic Acids Research 28 :166-167

Hobohm U, Scharf M, Schneider R, Sander C. Selection of representative protein data sets Protein Sci. 1992 Mar. 111:409-17.

Jan Wuyts, Yves Van de Peer, Tina Winkelmans and Rupert De Wachter. The European database on small subunit ribosomal RNA (2002) Nucleic Acids Research 30 (1) 183-185

HMMER User’s guide, Version 2.3.2, Oct.2003, ftp://ftp.genetics.wustl.edu/pub/eddy/hmmer/CURRENT/Userguide.pdf

26