Gene Prediction Using the Self-Organizing Map: Automatic Generation of Multiple Gene Models Shaun Mahony*1, James O Mcinerney2, Terry J Smith1 and Aaron Golden3
Total Page:16
File Type:pdf, Size:1020Kb
BMC Bioinformatics BioMed Central Software Open Access Gene prediction using the Self-Organizing Map: automatic generation of multiple gene models Shaun Mahony*1, James O McInerney2, Terry J Smith1 and Aaron Golden3 Address: 1National Centre for Biomedical Engineering Science, NUI, Galway, Galway, Ireland, 2Bioinformatics and Pharmacogenomics Laboratory, NUI, Maynooth, Co. Kildare, Ireland and 3Department of Information Technology, NUI, Galway, Galway, Ireland Email: Shaun Mahony* - [email protected]; James O McInerney - [email protected]; Terry J Smith - [email protected]; Aaron Golden - [email protected] * Corresponding author Published: 05 March 2004 Received: 26 November 2003 Accepted: 05 March 2004 BMC Bioinformatics 2004, 5:23 This article is available from: http://www.biomedcentral.com/1471-2105/5/23 © 2004 Mahony et al; licensee BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL. Abstract Background: Many current gene prediction methods use only one model to represent protein- coding regions in a genome, and so are less likely to predict the location of genes that have an atypical sequence composition. It is likely that future improvements in gene finding will involve the development of methods that can adequately deal with intra-genomic compositional variation. Results: This work explores a new approach to gene-prediction, based on the Self-Organizing Map, which has the ability to automatically identify multiple gene models within a genome. The current implementation, named RescueNet, uses relative synonymous codon usage as the indicator of protein-coding potential. Conclusions: While its raw accuracy rate can be less than other methods, RescueNet consistently identifies some genes that other methods do not, and should therefore be of interest to gene- prediction software developers and genome annotation teams alike. RescueNet is recommended for use in conjunction with, or as a complement to, other gene prediction methods. Background ceding bases in the sequence. These methods are Computational gene prediction methods have yet to undoubtedly very successful, with published sensitivity achieve perfect accuracy, even in the relatively simple rates between 90% and 99% for most prokaryotic prokaryotic genomes. Problems in gene prediction centre genomes. However, as the sensitivity rates of the methods on the fact that many protein families remain uncharac- rise, specificity generally tends to fall, and while the appli- terised. As a result, it seems that only approximately half cation of sophisticated post-processing rules can correct of an organism's genes can be confidently predicted on many false-positive predictions, no method has yet the basis of homology to other known genes [1-3], so ab achieved 100% accuracy. This is especially the case in the initio prediction methods are usually employed to identify more complex eukaryotic gene-finding problem, where many protein-coding regions of DNA. less than 80% of exons in anonymous genomic sequences are correctly predicted by current methods [2,6-8]. Currently, the most popular prokaryotic gene-prediction methods, such as GeneMark.hmm [4] and Glimmer2 [5], For the foreseeable future it does not seem that the exact are based on probabilistic Markov models that aim to pre- set of genes in any organism can be automatically pre- dict each base of a DNA sequence using a number of pre- dicted by any single computational method. In practice, Page 1 of 9 (page number not for citation purposes) BMC Bioinformatics 2004, 5 http://www.biomedcentral.com/1471-2105/5/23 this has meant that the best predictions are to be found by training data would not be useful in the case of Markov- combining evidence from two or more independent based methods, as the data contained in each sub-cluster methods [3,9]. Genome annotation teams often compare may not allow for a good estimation of model parameters. the evidence offered by multiple gene-finders in order to However, generating more specific models for subtle pat- predict the gene complement of a given genome. Because terns found in the training set can only be advantageous if of the degree of 'manual' annotation that now takes place it can be done in a way that minimises loss of overall accu- in the major genome sequencing centres, a gene-predic- racy and produces no extra false-positive predictions. tion tool will be of practical use if it can exclusively predict genes that other gene-finders cannot. This paper aims to show how the Self-Organizing Map neural network algorithm can be used to automatically Many ab initio gene-prediction methods are based on sin- identify the major trends in oligonucleotide variation in a gle models of protein-coding regions and therefore make genome, and in doing so provide multiple gene models the implicit assumption that all protein-coding regions for use in gene prediction. It will be explained that this within a particular genome will share similar statistical approach is an effective solution to the problem of intra- properties. However, evidence has mounted that single genomic variation. Specific examples of genes predicted gene models of intrinsic coding measures are no longer only by this method are offered, thus demonstrating the fully satisfying [10,11]. The problem with single model usefulness of the approach in genome annotation. methods centres on the degree of oligonucleotide compo- sition variation that exists within most genomes. On the A further advantage of using the Self-Organizing Map for codon level, intra-genomic variation in codon bias has gene prediction is the ability of the algorithm to use com- long been correlated with expression level [12]. Counter- plex descriptors as measures of gene coding potential. We balancing the translational selection theory of codon bias demonstrate this ability using relative synonymous codon is the effect of mutational bias [13,14]. Many other, often usage (RSCU) as our measure of gene coding potential. more subtle levels of variation have been recognised over Unlike other gene coding measures, RSCU is not based on the years, with many disparate evolutionary pressures the absolute frequency of k-mers, but instead describes shown to be acting on codon usage bias [15]. For exam- the codon choice for each amino acid. Markov chains ple, strand-specific codon usage biases have often been based on the RSCU measure would have transition prob- recognised [16-20], leading to more general studies of cor- abilities that are conditioned on the underlying amino- relation between the location of the gene on the genome acids. Although theoretically possible [26], the practical and codon bias [21], and the more specific discovery of a computation of such Markov chains would give rise to A+T skewed bias near the replication terminus of bacterial major difficulties. Therefore, the ability of our approach genomes [22]. Other effects shown to shape codon usage to make use of a sophisticated gene coding descriptor such are gene length [23] and selection at the amino acid level as RSCU is a distinct advantage of our approach over [24]. It has also been suggested that content variation can Markov model based methods. occur at the exon level in eukaryotic genes, the possibility existing that some exons in a gene may have different Implementation codon usage patterns to others [1]. Given that some of the Coding measure above pressures on codon usage have only recently been In this study, relative synonymous codon usage (RSCU) discovered, it is likely that some more subtle patterns have vectors are used as the measure of protein-coding poten- yet to be recognised, and therefore it is difficult to predict tial for a given window of sequence. The RSCU value for a the level of compositional variation that will be present in codon 'i' is defined as: an anonymous genomic sequence. Obs RSCU = i ()1 The need for gene-finding methods that can overcome the i Exp problems presented by intra-genomic variation was recog- i nised and addressed in the case of prokaryotic genomes by where Obsi is the observed number of occurrences of GeneMark-Genesis [25], which derives two models for codon 'i', and Expi is the expected number of occurrences each genome according to typical and atypical codon of the same codon (based on the number of times the rel- usage clusters in that genome. This increase in the number evant amino acid is present in the gene and the number of of gene models led to an increase in accuracy of the Gen- synonymous alternatives to 'i', assuming a uniform choice eMark method. While Hayes & Borodovsky experimented of synonymous codons). In order to make the data more with a third ('highly-typical') codon usage cluster and an compatible with the mathematical methods used, the log associated model in some cases, they did not see the need of each RSCUi value is found so that the resulting value is to further sub-cluster the atypical codon usage set in order positive if the codon is used more than expected in that to make even more models. Overly sub-clustering the gene, and negative if the codon is used less than expected. Page 2 of 9 (page number not for citation purposes) BMC Bioinformatics 2004, 5 http://www.biomedcentral.com/1471-2105/5/23 Values were capped at ± 10, and set to 0 in the case of the In this work, similarity between two vectors is measured non-occurrence of an amino acid in the sample. Taking by finding the cosine of the angle between them. A cosine the RSCU values for each of the codons with synonymous of 1 represents exactly similar vectors while a cosine of 0 alternatives (and ignoring the 3 stop codons and the Trp represents exactly dissimilar vectors.