Simplified Matching Algorithm Using a Translated Codon (Tron)

Vol. 16 no. 3 2000 BIOINFORMATICS Pages 190–202 Homology-based gene structure prediction: simplified matching algorithm using a translated codon (tron) and improved accuracy by allowing for long gaps Osamu Gotoh Saitama Cancer Center Research Institute, 818 Komuro Ina-machi, Saitama 362-0806, Japan Received on August 23, 1999; accepted on October 21, 1999 Abstract Introduction Motivation: Locating protein-coding exons (CDSs) on a Following the completion of genomic sequencing of the eukaryotic genomic DNA sequence is the initial and an yeast Saccharomyces cerevisiae (Goffeau et al., 1996), essential step in predicting the functions of the genes nearly the complete structure of the nematode Caenorhab- embedded in that part of the genome. Accurate prediction ditis elegans genome has recently been reported (The of CDSs may be achieved by directly matching the DNA C. elegans Sequencing Consortium, 1998). Sequencing sequence with a known protein sequence or profile of a projects in several eukaryotic genomes including the homologous family member(s). human genome are now in progress. Identification of the Results: A new convention for encoding a DNA sequence genes on these genomic sequences and inferring their into a series of 23 possible letters (translated codon or functions are major themes of current computational tron code) was devised to improve this type of analysis. genome analyses. One obstacle to gene identification Using this convention, a dynamic programming algorithm is the fact that typical eukaryotic genes are segmented, was developed to align a DNA sequence and a protein and the prediction of precise exonic regions is still a sequence or profile so that the spliced and translated challenging problem (Burge and Karlin, 1998; Claverie, sequence optimally matches the reference the same as 1997; Murakami and Takagi, 1998). the standard protein sequence alignment allowing for Most gene-identification methods rely on statistical long gaps. The objective function also takes account features of coding and non-coding sequences, and sig- of frameshift errors, coding potentials, and translational nals around the beginnings and ends of transcription, initiation, termination and splicing signals. This method translation, and splicing. Various algorithms including was tested on Caenorhabditis elegans genes of known artificial neural networks (Uberbacher and Mural, 1991), structures. The accuracy of prediction measured in terms hidden Markov models (Burge and Karlin, 1997), and dis- of a correlation coefficient (CC) was about 95% at the criminant analyses (Zhang, 1997) coupled with dynamic nucleotide level for the 288 genes tested, and 97.0% for programming algorithms (Snyder and Stormo, 1995; Xu the 170 genes whose product and closest homologue share et al., 1994) have been used to capture specific signals more than 30% identical amino acids. We also propose a and derive a final prediction by combining several lines strategy to improve the accuracy of prediction for a set of information. The best-performing programs currently of paralogous genes by means of iterative gene prediction available correctly predict 70–80% of exons (Burge and and reconstruction of the reference profile derived from Karlin, 1998; Claverie, 1997; Murakami and Takagi, the predicted sequences. 1998). This level of accuracy might be sufficient for Availability: The source codes for the program ‘aln’ classification of the gene into a certain gene family, but is written in ANSI-C and the test data will be available insufficient for some other purposes, such as predicting via anonymous FTP at ftp.genome.ad.jp/pub/genomenet/ the structure of the encoded protein and evolutionary saitama-cc. studies, since the chance of perfect prediction of the entire Contact: [email protected] coding sequence declines exponentially with the number of exons in the gene. Several gene-finding methods incorporate the results of sequence similarity searches (Altschul et al., 1990) 190 c Oxford University Press 2000 Gene structure prediction by homology to significantly improve the overall prediction accuracy and second, and the second and third nucleotides in a (Burset and Guigo,´ 1996; Cai and Bork, 1998). To codon, respectively. Third, a special form of gap-penalty achieve still better predictions, however, more direct function that takes account of frameshift errors and involvement of homology information appears to be long gaps is employed. Finally, an iterative strategy is necessary (Mironov et al., 1998). Homology information developed to coherently improve the prediction accuracy has been shown to be useful even for better identification for a set of paralogous genes within a genome. Empirical of prokaryotic genes (Lolkema and Slotboom, 1998; examinations with 291 C.elegans genes of experimentally Pearson et al., 1997). A handful of methods have been identified exon–intron organizations indicated that our proposed to predict eukaryotic gene structures by direct method accurately predicts exonic sequences with a cor- matching of the genomic DNA sequence and a reference relation coefficient of about 97% at the nucleotide level, protein (Birney and Durbin, 1997; Gelfand et al., 1996; when the amino acid identities between the reference Huang and Zhang, 1996), or cDNA sequence (Florea sequence and the objective gene product exceed the gen- et al., 1998; Mott, 1997). ‘Genewise’ from Birney and erally recognized range (twilight zone) of reliable global Durbin (1997) appears to be the most general of these protein sequence alignment (Doolittle, 1981; Sander and methods, since it considers frameshift errors and accepts Schneider, 1991). a protein profile as the reference, while other methods lack either of the features. The use of a protein profile Methods or a profile hidden Markov model, rather than a single Tron code sequence, may improve the alignment accuracy, as has been repeatedly experienced for homology searches A remarkable feature of the universal genetic code is (Altschul et al., 1997; Park et al., 1998) and multiple that the second nucleotide in a codon greatly affects its sequence alignment (Gotoh, 1996). However, genewise specificity (Crick, 1968). In fact, all codons for an amino does not necessarily predict the complete gene structure, acid have a unique nucleotide at the second position, and sometimes reports only fragmental accounts. except for Ser (TCN and AGY) and termination (TAR Upon investigating the members of some multi-gene and TGA) codons. Thus, only 23 letters are necessary families throughout the genome of C.elegans, I became and sufficient to unambiguously encode both the original aware that prediction-based CDSs annotated in public nucleotide sequence and conceptually translated amino sequence databases, such as GenBank, might be wrong acid sequences in the three frames. We propose to call t for up to nearly half of the genes (Gotoh, 1998). I reached each translated codon a ‘tron’ ∈ , and express them by a this notion because multiple protein sequence alignments the standard one-letter amino acid codes ∈ , except for derived from such CDSs contained several structurally ‘J’, ‘O’, and ‘U’, which are used to represent translated implausible gaps (insertions and deletions). I was able AGY, TAR, and TGA codons, respectively (Figure 1). t a to show that the reassignment of exons greatly improved Thus, ={ , ‘J’, ‘O’, ‘U’}.Ina‘tron sequence’, = ... the quality of alignments (Gotoh, 1998). The principle of b b1b2 bJ , the tron code substitutes for the second the prediction algorithm was straightforward; when the nucleotide of a triplet. The tron codes at the first and genomic sequence is spliced and translated, the conceptual last sites in a sequence may be determined based on the protein sequence should optimally match the reference arbitrary assumption that a fixed nucleotide, ‘A’, occupies sequence or a ‘generalized profile’ as in the usual protein– the sites immediately before the first and immediately after protein or protein–profile alignment with an affine gap- the last sites of the original nucleotide sequence. A usual a penalty function. Our profile is more general than the usual 20×20 amino acid exchange matrix, M(a, b)(a, b ∈ ), (Gribskov et al., 1987) in terms of rigorous treatment such as PAM250 (Dayhoff et al., 1978) and JTT250 (Jones of internal gaps of various lengths and positions (Gotoh, et al., 1992), is easily expanded to a 20 × 23 amino acid a t 1994). Translational initiation, termination, and splicing versus tron similarity matrix, S(a, b)(a ∈ , b ∈ ), a signals as well as exonic coding potentials are also taken such that S(a, b) = M(a, b) if b ∈ , S(a, ‘J’) = into account for the objective function to be optimized. M(a, ‘S’),andS(a, ‘O’) = S(a, ‘U’) = MIN M(a, b). In this paper, I will show the details of the algorithm, The matrix S(a, b) facilitates immediate comparison of an which involves several novel features. First, a genomic amino acid or a profile vector with a translated codon. On sequence is encoded into translated codon (or tron) codes. the other hand, a single reference to a 23-element table The 23-letter codes can compactly encode the potential immediately recovers the original nucleotide at a specific translated amino acid sequence without losing any in- site. formation of the original nucleotide sequence. Second, a matching score for a codon interrupted by a phase-1 Boundary signals and coding potential or phase-2 intron is rigorously evaluated, where phase-1 We used the frequency of ‘ditrons’, i.e. neighboring trons and phase-2 introns imply those located between the first every three sites, in three frames to estimate coding 191 O.Gotoh Fig. 1. The tron codes. The proposed tron (translated codon) codes are shown in boldface letters together with arbitrarily chosen numeric codes (1–23). potential. All frequency data were normalized with the window size of 20 derived from first-order Markov corresponding reference frequencies obtained from the models at the nucleotide sequence level (Salzberg, 1997; 27 Mb of C.elegans genomic sequence that was publicly Zhang and Marr, 1993).

Simplified Matching Algorithm Using a Translated Codon (Tron)

Gap Opening Penalty Formula

Alignment Principles and Homology Searching Using (PSI-)BLAST

Sequence Analysis

Sequence Alignment

Gap Penalty in Sequence Alignment Pdf

Structural and Evolutionary Considerations for Multiple Sequence Alignment of RNA, and the Challenges for Algorithms That Ignore Them

The Biologist's Guide to Paracel's Similarity Search Algorithms

Bioinformatics-Inspired Analysis for Watermarked Images with Multiple Print and Scan

Sequence Alignment Algorithms

Aligning Coding Sequences with Frameshift Extension Penalties