E:\Data\User\Pfh\Home\My Documents\DTU\2+\Fag\2004F
Total Page:16
File Type:pdf, Size:1020Kb
RNAmmer: Fast two-level HMM prediction of rRNA in prokaryotic genome sequences Peter F. Hallin Peter F. Hallin, s971636. February 21st 2005 This report is written as part of the 10p special course scheduled for 2nd semester in the International Masters program in Bioinformatics at the Center for Biological Sequence Analysis, Technical University of Denmark. Supervisor is David W. Ussery. Table of contents Table of contents .................................................................. 2 Abstract ......................................................................... 3 Introduction ...................................................................... 3 BLAST - A common approach ................................................. 4 Strategy .................................................................. 4 Methods ........................................................................ 5 Model construction .......................................................... 5 Homology reduction of dataset ................................................. 5 Alignment conservation ...................................................... 7 HMM building ............................................................. 9 Results and evaluation ............................................................ 10 Selectivity and sensitivity .................................................... 10 Accuracy - deviation in start/stop positions ....................................... 11 Length deviations .......................................................... 12 Example of annotations ..................................................... 13 Proof of Concept ................................................................. 14 Conclusion ..................................................................... 14 Appendices ..................................................................... 15 Appendix A: Sequence count in phylogenetic group ................................ 15 Appendix B: Perl implementation of plotcon algorithm .............................. 16 Appendix C: Model building and post script generation ............................. 17 Appendix D: Makefile for RNAmmer- parsing search results and calculating accuracy ...... 21 Appendix E: 16s rRNA tree of the Genome Atlas Database .......................... 22 Appendix F: RNAmmer vs. complete search in a 1,6Mb Bacteria ...................... 23 Appendix G: RNAmmer source code ........................................... 24 References ..................................................................... 26 Abstract A program has been developed which uses Hidden Markov Models (HMMs) to predict rRNA genes in Bacterial DNA. The program uses a short 75bp conserved region of the molecule to build a model for an initial ‘spotter' and a full length HMM to model the entire molecule. This avoids the scanning of entire genome sequences with large and slow models. The program has been implemented in a Makefile and results have been gathered from a collection of full genome sequences (n=236). The program has proven significantly better than the previous BLAST approach that was used in the Genome Atlas Database and predictions on well annotated genomes suggests selectivity and specificity of prediction in the range of 0.993 to 0.999. During the evaluation of the program a few genbank files were identified as having rRNA annotated on the wrong strands. Introduction The 16s rRNA molecule has been used as a finger print or evolutionary chronometer of microbial genomes for years. It is essential for Comparative Genomics to see genomic properties in phylogenetic context and mutations in the 16s rRNA is often used to measure evolutionary distance of sequenced microbial organisms. The quality of such distance estimates depends on alignment quality - but most important is a proper identification of the individual sequences. Since the rRNA molecules are highly conserved one should believe that they would be easily detectable and that these features would be consistently annotated in all GenBank files being published. Researchers feel tempted to use BLAST to identify rRNA genes. Such predictions might lead to problems as will be discussed later in this report. The distributions below shows the three peaks of 5s, 16s and 23s of all sequenced Bacterial genomes. The width of each major peak is likely to reflect a true variation of the lengths of the molecules, as is visible in panel A in figure 1. Panel B shows the distribution of the first 10 sequences to underline the few (but not least important) poorly annotated molecules with lengths up to 6,000 bp. Although not visible in these histograms it is later to be shown that even the correct strand can be missed in the processes of annotations and compilation of the GenBank/EMBL files. Large Subunits having length around half of an average Small Subunit was also identified. These plots are generated in a bit crude manner since they are simple extract of annotated features containing the word ‘rRNA’. Later in this report, we will compare the lengths from features where predictions overlap with annotations. AB Distribution of lengths of rRNA moleclus in all sequenced Bacteria 10 600 8 500 6 400 t t n n u u o o C C 300 4 200 2 100 0 0 0 1000 2000 3000 4000 5000 6000 0 1000 2000 3000 4000 5000 6000 Length Length Figure 1: Panel A: Histogram of all rRNA genes in sequenced bacterial genomes. Panel B: First 10 counts. 3 BLAST.4AST - A common approach Most services for detecting rRNA are based on a BLAST approach. There are different problems involved relying on this method: A. Each match/mismatch of the alignment contributes with the same weight when calculating the score - in other words it’s a linear scoring scheme . B. The ‘model’ for a BLAST approach is a database of already sequenced rRNA molecules. If sequences which deviate significantly from the content of this database are to be found, the BLAST approach will fail. It relies on the assumption that the database assembles all phylogenetic groups . C. BLAST fails to do any form of weighting of domains that are highly conserved . D. The score of a BLAST hit is difficult to interpret. It will be equal to the score/E-value of the closest related molecule in the database. If, by chance, the database contains a sequence of the same genus and species as that of a given query, then the final score will be good. However, the score/E-value will be poor if no similar genus and species are found in the database. Scoring depends on the diversity of the database . Strategy Our strategy was to obtain structurally aligned sequences from publicly available rRNA databases. By doing so, we achieve avoid forcing an artificial nucleotide conservation into the models by doing sequence alignment. These alignments can be directly presented to software that can generate a profile HMM from an alignment. The European Ribosomal RNA database (Jan Wuyts et al. 2002, http://www.psb.ugent.be/rRNA/) contains a comprehensive collection of SSU and LSU rRNA sequences. These sequences are kept in the ‘distribution format', which includes information from a structural alignments as shown in figure 2. acc:X53497 (accession no.) src:NoData (source) str:MUCL 29800, ATCC 18804, CBS 562 (strain info.) ta1:Eukarya (taxonomic info.) ta2:Fungi (taxonomic info.) ta3:Eumycota (taxonomic info.) ta4:Ascomycotina (taxonomic info.) ta5:Hemiascomycetes (taxonomic info.) chg:this sequence is not in EMBL (changes other than del with regards to original EMBL entry) rem:this is just an example (remarks about the entry) aut:person 1, person 2 (authors) ttl:The SSU rRNA sequence of a species (title) jou:Journal name (journal) dat:1989 (journal year) vol:12 (journal volume) pgs:223-229 (journal pages) mty:SSU (type of RRNA) del:500 AUG 800 AAAA (deletions made to keep alignment size down: <sequence position> <deleted> ...) seq:organism name (organism name) -------------------------------------------------------------------------------- --------------------------UAU[CUGGU]U-----GA[UCCU^GCCAG^UAGU{-C}AUA-UGCU]--[UGUC ]UCAAAG--AU-UAA[GCC{A-}UGC]A-UGUCUA-[A-GU{A-UAA-}GC]A--------------------------- -----------------------AUUUAU-AC------------------------------------------------ ------------------------------------------A[G-U{-G--AA}AC-U]GCGAA--UGG[C-UC]AUUA ---AAU-[CAG{UU}AU{--CG}U-U{UA--UU}UGA]UAG--UA--CC--------------------------UU-AC -UA[C(U)UG(G)-AU{AACCG-}UGG]UAAU-U[CUA{-GAGCUA}AU(A)-CA(U)G]CUU------AAA-[AUCCC{ G-A}CU]--------------------------------------------GUUU------------------------- .....* Figure 2: Example of a database entry in “distribution format”. The format is easy to parse and has been stored in a mySQL table (Updated version: genome.pfh_public.rdb, frozen version: genome.pfh_public.rdb_report) We believe this database is one of the most consistent an well maintained SSU/LSU on the web. It should be noted that this database does not contain the complete LSU structure since the 5s/8s molecule is missing. In this report we chose to label 5s and 8s as ‘TSU’ (tiny sub unit), and it is downloaded separately from the Institute of Bioorganic Chemistry at the Polish Academy of Sciences. (http://biobases.ibch.poznan.pl/5SData/). Alignments from both these locations have been imported to the same database as mentioned above. Statistics from this table is shown in figure 3. 4 mysql> select mty,length(clr),count(*),length(clr),stddev(length(clr)) from rdb where ta1 not like '%environment%' group by length(clr); +------+-------------+----------+-------------+---------------------+