Using Comparative Genome Analysis to Identify Problems in Annotated Microbial Genomes

Microbiology (2010), 156, 1909–1917 DOI 10.1099/mic.0.033811-0 Mini-Review Using comparative genome analysis to identify problems in annotated microbial genomes Maria S. Poptsova and J. Peter Gogarten Correspondence Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT 06269-3125, USA Maria S. Poptsova [email protected] Genome annotation is a tedious task that is mostly done by automated methods; however, the accuracy of these approaches has been questioned since the beginning of the sequencing era. Genome annotation is a multilevel process, and errors can emerge at different stages: during sequencing, as a result of gene-calling procedures, and in the process of assigning gene functions. Missed or wrongly annotated genes differentially impact different types of analyses. Here we discuss and demonstrate how the methods of comparative genome analysis can refine annotations by locating missing orthologues. We also discuss possible reasons for errors and show that the second-generation annotation systems, which combine multiple gene-calling programs with similarity-based methods, perform much better than the first annotation tools. Since old errors may propagate to the newly sequenced genomes, we emphasize that the problem of continuously updating popular public databases is an urgent and unresolved one. Due to the progress in genome-sequencing technologies, automated annotation techniques will remain the main approach in the future. Researchers need to be aware of the existing errors in the annotation of even well-studied genomes, such as Escherichia coli, and consider additional quality control for their results. Introduction see Reed et al. (2006) and Reeves et al.(2009). In this paper we briefly review the problems associated with identification Sequencing the first complete genome of Haemophilus of coding sequences (CDS) in bacterial and archaeal influenzae in 1995 opened a new page in genome sciences. genomes and demonstrate how comparative genomics can It took eight more years (till 2003) to increase the number help in the location of missed genes. of complete sequenced genomes to 100. This number was doubled by the year 2005, and by 2010 more than 1000 Bacterial and archaeal genomes, as well as those of some completely sequenced bacterial and archaeal genomes were eukaryotic micro-organisms, have the considerable advant- available at GenBank, with approximately four times this age of usually lacking introns, which makes the process of number of genomes in the process of being sequenced. In gene boundary identification much easier. Nevertheless September of 2009 the GOLD database (Liolios et al., 2008) bacterial or archaeal gene-calling procedures are not error listed 5902 genome projects; 4242 of these were bacterial free. In the absence of introns, it might have seemed that genome projects, of which 1154 were listed as complete and ORFs can be designated as any substring of DNA that 966 in draft. With such a tempo it is obvious that the begins with a start codon and ends with a stop codon. If we burden of genome annotation will be assigned mostly to apply this rule to any bacterial or archaeal genome we will automated methods. However, the accuracy of automated obtain many overlapping and short ORFs. A difficult task approaches has been questioned since the beginning of the for gene prediction software is to decide which one from sequencing era (Friedberg, 2006). two overlapping ORFs represents a true gene. This is especially true for those ORFs that are encoded on the Genome annotation is a multi-level process that includes complementary DNA strands. Gene-prediction methods prediction not just of coding genes, but also of pseudo- (see Do & Choi, 2006; Stothard & Wishart, 2006 for genes, promoter regions, direct and inverted repeats, reviews) are based on hidden Markov models (HMMs), untranslated regions and other genome units. For a such as those implemented in Glimmer (Salzberg et al., comprehensive review of genome and proteome annotation 1998) or GeneMark (Besemer et al., 2001). To apply them to a new sequence, one must train prediction models on a Abbreviations: CDS, coding sequence(s); HMM, hidden Markov model. set of known genes. HMM-based gene prediction methods A supplementary table with details of the 30 Escherichia strains proved to be very efficient; however, they have inherent searched for missing orthologues is available with the online version of restrictions. A training set must be chosen from the this paper. genome that has to be annotated. This implies that genes 033811 G 2010 SGM Printed in Great Britain 1909 M. S. Poptsova and J. P. Gogarten with atypical nucleotide composition could be missed. annotations is of another order. Here we have to Another source of potential errors in gene prediction is the distinguish between two different types of errors: errors choice of a start codon, which can be one of the six in gene calling and errors in functional annotations. Errors potential candidates in prokaryotes (ATG, GTG, TTG, and in gene calling come from the gene-prediction methods, in some cases ATT, CTG and ATC) (see Yada et al., 2001 which are all based on Markov models. The difference in and Zhu et al., 2004 on translation initiation site prediction results depends first on the model itself and also on the in prokaryotes). Stop codons can also have a dual function, training set. An evaluation of gene finders based on HMMs as stop codons TGA or TAG may encode selenocysteine was done by Knapp & Chen (2007); the authors reported and pyrrolysine. Another problem is to decide on the cutoff that no significant improvement in the quality of de novo for filtering short ORFs that might encode small polypep- gene prediction methods occurred during the previous 5 tides. Some genes are missed because they have pro- years. The programs were tested on human sequences, grammed frameshifts (Farabaugh, 1996). Second-generation where CDS prediction is complicated by exonic and annotation systems (see below) try to combine multiple intronic structures. For additional assessment of the quality gene-calling programs, but this sometimes leads to gene of gene-prediction methods see Majoros et al. (2003). overcalling (i.e. gene called present when a gene is in fact Bakke et al. (2009) evaluated three second-generation gene- absent) due to the conflict between multiple gene-prediction annotation systems on the genome of the archaeon algorithms (Armengaud, 2009). Halorhabdus utahensis at a different level of annotation: from the performance of de novo gene-prediction models Genes with wrong functional annotation and missed genes to the functional assignments of genes and pathways. differentially impact different types of analyses. Compara- Comparison of gene-calling methods showed that 90 % of tive genome analysis usually consists of identification of all three annotations share exact stop sites with the other orthologues and paralogues, and instances of gene gain and annotations, but only 48 % of identified genes share both gene loss. Orthologues are used to infer evolutionary start and stop sites. Palleja et al. (2008) performed an relationships between species, and a gene with a wrong interesting investigation of overlapping CDS in prokaryotic annotation may be immediately detected, because in a genomes. They compared overlapping genes with their phylogenetic tree it would be surrounded by genes with corresponding orthologues and found that more than 900 different annotations; however, this assumes that the reported overlaps larger than 60 bp were not real overlaps, annotation error is unique to individual genes and was but annotation errors. not propagated between the genomes that contain the neighbouring genes. Short erroneously identified ORFs In addition to errors in CDS prediction, functional also would not greatly impact an orthologue collection as assignments are not error free either. One of the first they most likely would not have orthologues in other estimations of the error rate of gene functional assignment genomes. They will mistakenly increase the set of ORFans, was performed by comparing three published annotated the origin of which is still a debated question (Siew & genomes of Mycoplasma genitalium (Brenner, 1999). Of Fischer, 2003). However conclusions made about gene gain 702 annotated CDS, 55 cases (8 %) showed discrepancies in and loss, which are based on gene presence and absence annotation. The errors intrinsic to the genome-annotation analysis, could be wrong if a gene was missed as a result of process were discussed by Devos & Valencia (2001), and gene-calling procedures. Missed genes have a particularly the authors emphasized that most functional annotations strong impact on the delineation of the core genome for a in complete genomes are based on a relatively weak large and diverse group of organisms, and on identification sequence identity. The estimation of potential errors of the of strain- or species-specific genes. The percentage of first three published genomes of Haemophilus influenzae, missed genes in one genome likely does not exceed 5–10 %; Mycoplasma genitalium and Methanococcus jannaschii however, even a single missed gene can lead to a wrong showed that, depending on the type of function, the biological inference, especially if the missed gene is a key expected rate of errors varies from less than 5 % to more enzyme of a metabolic pathway. than 40 % (Devos & Valencia, 2001). Bork (2000) estimated that 70 % accuracy in the annotation of functional and structural features has to be considered as Error estimation in annotated genomes a success (see also Table 1 in Bork, 2000). The reason for Sequencing errors in the early 1990s were estimated at the such a high error rate, as Bork pointed out, is that a order of 0.1 % (Bork & Bairoch, 1996). These errors are functional annotation is based on knowledge databases that responsible for frameshifts and stop-codon introduction are still of insufficient quality. Nagy et al. (2008) discuss the (see below), and can influence recognition of a gene and its identification of incorrectly predicted genes and proteins in functional assignment.

Using Comparative Genome Analysis to Identify Problems in Annotated Microbial Genomes

Cryptic Inoviruses Revealed As Pervasive in Bacteria and Archaea Across Earth’S Biomes

Microbial Gene Identification Using Interpolated Markov Models Steven L

Steven L. Salzberg

By Katherine E. L. Cox a Dissertation Submitted to Johns Hopkins University in Conformity with the Requirements for the Degree O

Improved Microbial Gene Identification with GLIMMER

The Effect of Machine Learning Algorithms on Metagenomics Gene

Genome Sequencing and Annotation

Gene Prediction with Glimmer for Metagenomic Sequences Augmented by Classification and Clustering

MCB 1200/1201, Virus Hunting

Machine Learning for Metagenomics: Methods and Tools

Glimmer-MG Release Notes Version 0.1

Information Theory, Graph Theory and Bayesian Statistics Based Improved and Robust Methods in Genome Assembly