Vol. 16 no. 3 2000 BIOINFORMATICS Pages 190–202

Homology-based gene structure prediction: simplified matching algorithm using a translated codon (tron) and improved accuracy by allowing for long gaps Osamu Gotoh

Saitama Cancer Center Research Institute, 818 Komuro Ina-machi, Saitama 362-0806, Japan

Received on August 23, 1999; accepted on October 21, 1999

Abstract Introduction Motivation: Locating protein-coding exons (CDSs) on a Following the completion of genomic sequencing of the eukaryotic genomic DNA sequence is the initial and an yeast Saccharomyces cerevisiae (Goffeau et al., 1996), essential step in predicting the functions of the genes nearly the complete structure of the nematode Caenorhab- embedded in that part of the genome. Accurate prediction ditis elegans genome has recently been reported (The of CDSs may be achieved by directly matching the DNA C. elegans Sequencing Consortium, 1998). Sequencing sequence with a known protein sequence or profile of a projects in several eukaryotic genomes including the homologous family member(s). human genome are now in progress. Identification of the Results: A new convention for encoding a DNA sequence genes on these genomic sequences and inferring their into a series of 23 possible letters (translated codon or functions are major themes of current computational tron code) was devised to improve this type of analysis. genome analyses. One obstacle to gene identification Using this convention, a dynamic programming algorithm is the fact that typical eukaryotic genes are segmented, was developed to align a DNA sequence and a protein and the prediction of precise exonic regions is still a sequence or profile so that the spliced and translated challenging problem (Burge and Karlin, 1998; Claverie, sequence optimally matches the reference the same as 1997; Murakami and Takagi, 1998). the standard protein allowing for Most gene-identification methods rely on statistical long gaps. The objective function also takes account features of coding and non-coding sequences, and sig- of frameshift errors, coding potentials, and translational nals around the beginnings and ends of transcription, initiation, termination and splicing signals. This method translation, and splicing. Various algorithms including was tested on Caenorhabditis elegans genes of known artificial neural networks (Uberbacher and Mural, 1991), structures. The accuracy of prediction measured in terms hidden Markov models (Burge and Karlin, 1997), and dis- of a correlation coefficient (CC) was about 95% at the criminant analyses (Zhang, 1997) coupled with dynamic nucleotide level for the 288 genes tested, and 97.0% for programming algorithms (Snyder and Stormo, 1995; Xu the 170 genes whose product and closest homologue share et al., 1994) have been used to capture specific signals more than 30% identical amino acids. We also propose a and derive a final prediction by combining several lines strategy to improve the accuracy of prediction for a set of information. The best-performing programs currently of paralogous genes by means of iterative gene prediction available correctly predict 70–80% of exons (Burge and and reconstruction of the reference profile derived from Karlin, 1998; Claverie, 1997; Murakami and Takagi, the predicted sequences. 1998). This level of accuracy might be sufficient for Availability: The source codes for the program ‘aln’ classification of the gene into a certain gene family, but is written in ANSI-C and the test data will be available insufficient for some other purposes, such as predicting via anonymous FTP at ftp.genome.ad.jp/pub/genomenet/ the structure of the encoded protein and evolutionary saitama-cc. studies, since the chance of perfect prediction of the entire Contact: [email protected] coding sequence declines exponentially with the number of exons in the gene. Several gene-finding methods incorporate the results of sequence similarity searches (Altschul et al., 1990)

190 c Oxford University Press 2000 Gene structure prediction by homology to significantly improve the overall prediction accuracy and second, and the second and third nucleotides in a (Burset and Guigo,´ 1996; Cai and Bork, 1998). To codon, respectively. Third, a special form of gap-penalty achieve still better predictions, however, more direct function that takes account of frameshift errors and involvement of homology information appears to be long gaps is employed. Finally, an iterative strategy is necessary (Mironov et al., 1998). Homology information developed to coherently improve the prediction accuracy has been shown to be useful even for better identification for a set of paralogous genes within a genome. Empirical of prokaryotic genes (Lolkema and Slotboom, 1998; examinations with 291 C.elegans genes of experimentally Pearson et al., 1997). A handful of methods have been identified exon–intron organizations indicated that our proposed to predict eukaryotic gene structures by direct method accurately predicts exonic sequences with a cor- matching of the genomic DNA sequence and a reference relation coefficient of about 97% at the nucleotide level, protein (Birney and Durbin, 1997; Gelfand et al., 1996; when the amino acid identities between the reference Huang and Zhang, 1996), or cDNA sequence (Florea sequence and the objective gene product exceed the gen- et al., 1998; Mott, 1997). ‘Genewise’ from Birney and erally recognized range (twilight zone) of reliable global Durbin (1997) appears to be the most general of these protein sequence alignment (Doolittle, 1981; Sander and methods, since it considers frameshift errors and accepts Schneider, 1991). a protein profile as the reference, while other methods lack either of the features. The use of a protein profile Methods or a profile hidden Markov model, rather than a single Tron code sequence, may improve the alignment accuracy, as has been repeatedly experienced for homology searches A remarkable feature of the universal genetic code is (Altschul et al., 1997; Park et al., 1998) and multiple that the second nucleotide in a codon greatly affects its sequence alignment (Gotoh, 1996). However, genewise specificity (Crick, 1968). In fact, all codons for an amino does not necessarily predict the complete gene structure, acid have a unique nucleotide at the second position, and sometimes reports only fragmental accounts. except for Ser (TCN and AGY) and termination (TAR Upon investigating the members of some multi-gene and TGA) codons. Thus, only 23 letters are necessary families throughout the genome of C.elegans, I became and sufficient to unambiguously encode both the original aware that prediction-based CDSs annotated in public nucleotide sequence and conceptually translated amino sequence databases, such as GenBank, might be wrong acid sequences in the three frames. We propose to call t for up to nearly half of the genes (Gotoh, 1998). I reached each translated codon a ‘tron’ ∈  , and express them by a this notion because multiple protein sequence alignments the standard one-letter amino acid codes ∈  , except for derived from such CDSs contained several structurally ‘J’, ‘O’, and ‘U’, which are used to represent translated implausible gaps (insertions and deletions). I was able AGY, TAR, and TGA codons, respectively (Figure 1). t a to show that the reassignment of exons greatly improved Thus,  ={ , ‘J’, ‘O’, ‘U’}.Ina‘tron sequence’, = ... the quality of alignments (Gotoh, 1998). The principle of b b1b2 bJ , the tron code substitutes for the second the prediction algorithm was straightforward; when the nucleotide of a triplet. The tron codes at the first and genomic sequence is spliced and translated, the conceptual last sites in a sequence may be determined based on the protein sequence should optimally match the reference arbitrary assumption that a fixed nucleotide, ‘A’, occupies sequence or a ‘generalized profile’ as in the usual protein– the sites immediately before the first and immediately after protein or protein–profile alignment with an affine gap- the last sites of the original nucleotide sequence. A usual a penalty function. Our profile is more general than the usual 20×20 amino acid exchange matrix, M(a, b)(a, b ∈  ), (Gribskov et al., 1987) in terms of rigorous treatment such as PAM250 (Dayhoff et al., 1978) and JTT250 (Jones of internal gaps of various lengths and positions (Gotoh, et al., 1992), is easily expanded to a 20 × 23 amino acid a t 1994). Translational initiation, termination, and splicing versus tron similarity matrix, S(a, b)(a ∈  , b ∈  ), a signals as well as exonic coding potentials are also taken such that S(a, b) = M(a, b) if b ∈  , S(a, ‘J’) = into account for the objective function to be optimized. M(a, ‘S’),andS(a, ‘O’) = S(a, ‘U’) = MIN M(a, b). In this paper, I will show the details of the algorithm, The matrix S(a, b) facilitates immediate comparison of an which involves several novel features. First, a genomic amino acid or a profile vector with a translated codon. On sequence is encoded into translated codon (or tron) codes. the other hand, a single reference to a 23-element table The 23-letter codes can compactly encode the potential immediately recovers the original nucleotide at a specific translated amino acid sequence without losing any in- site. formation of the original nucleotide sequence. Second, a matching score for a codon interrupted by a phase-1 Boundary signals and coding potential or phase-2 intron is rigorously evaluated, where phase-1 We used the frequency of ‘ditrons’, i.e. neighboring trons and phase-2 introns imply those located between the first every three sites, in three frames to estimate coding

191 O.Gotoh

Fig. 1. The tron codes. The proposed tron (translated codon) codes are shown in boldface letters together with arbitrarily chosen numeric codes (1–23). potential. All frequency data were normalized with the window size of 20 derived from first-order Markov corresponding reference frequencies obtained from the models at the nucleotide sequence level (Salzberg, 1997; 27 Mb of C.elegans genomic sequence that was publicly Zhang and Marr, 1993). Experimentally verified splicing available by August 1998. More specifically, the coding donor and acceptor sites (8192 each) of C.elegans genes ,ϕE potential at a site j j is calculated as: were taken from the homepage of the Sanger Center (URL: http://www.sanger.ac.uk/Projects/C elegans). ϕ E = ϕ0 + ϕ−1 + ϕ+1 Translational start and stop sites were obtained from the j − + (1) j j 1 j 1 CDSs described above. The conditional probabilities were normalized and logarithmically transformed as mentioned where ϕk = log{ f k(b , b + )/f (b ) f (b + )} for k ∈ j j j 3 j j 3 above to yield score tables. {−1, 0, +1}, f (b) is the relative frequency of tron b in general genomic sequence, f 0(a, b) is the relative frequency of ditron (a, b) in ‘in-frame’ coding phase, Gap-penalty function and matching algorithms and f k(a, b)(k =−1 and +1) are those in ‘out-of- A special form of gap-penalty function was adopted frame’. To maintain consistency with a PAM matrix, a (Figure 2). An insertion or () of k nucleotides common logarithm (base 10) was used to calculate the was penalized by a restricted affine function (Chao, 1999; score tables. This method is nearly the same as that which Huang and Zhang, 1996) if k was a multiple of 3; relies on phase-specific diamino usage, but is a little otherwise an additional penalty was given to allow, but closer to the most popular methods based on hexamer disfavor, potential frameshifts. Since a constant basal frequencies or fifth-order Markov models (Fickett and penalty was assigned to an indel longer than a specified Tung, 1992). The training set was obtained from 6298 length K , our gap-penalty function is a little more general CDSs described in the INV class of the GenBank database than the most commonly used affine functions. The Release 84 (1994). More than three-quarters of the CDSs basic matching algorithm was an extension of the ‘long- in the set were derived from species other than C.elegans. gap algorithm’ proposed previously (Gotoh, 1990), but Since translated information is less species specific than modified so as to not penalize terminal gaps (Sellers, the nucleotide sequence (Guigo´ and Fickett, 1995), and 1979). The algorithm sketched in the Appendix runs in since a ditron table is considerably smaller than a hexamer proportion to the product of the lengths of the sequences table (232/46 = 1/7.74), our choice of the ditron method under comparison, despite the rather complicated form may be appropriate for the present analysis. of the gap-penalty function. [The actual computation The signal strengths of each site as a potential trans- time is further reduced by use of the ‘cutting-corners lational start, stop, or 5 or 3 splicing boundary were approximation’ (Sankoff and Kruskal, 1983)]. It was evaluated by conditional probability matrices with a assumed that an insertion of nucleotides occurs only

192 Gene structure prediction by homology

10

0

–10

–20

–30 Gap penalty –40 Intron penalty

–50

–60

–70 0 5101520 25 30

Gap length (nt)

Fig. 2. Gap penalties as a function of gap length. The basal form is a restricted affine function, whereas an extra penalty is imposed to a gap whose length is not divisible by three. The shaded area indicates the possible range of a penalty value given to a tentative intron. at a codon boundary. Moreover, a match between an orous algorithm would require four variables correspond- amino acid or a profile vector and an incomplete codon ing to the four kinds of nucleotide at the first position of containing a single- or double-nucleotide deletion was a phase-1 codon (see Appendix). Likewise, we need at scored solely for that deletion. Although a more elaborate least two variables corresponding to purine and pyrimi- scoring scheme might be possible (Hein, 1994; Peltola et dine at the third position in a phase-2 codon (b j in Fig- al., 1986), our simple scheme is efficient and likely to ure 3(b)). Considering the balance between rigorousness perform as well as other alternative schemes (Pearson et and efficiency, we adopted a compromise algorithm which al., 1997). uses one, one, and two variables for phase-0, phase-1, and When the reference is a multiple sequence alignment phase-2 boundaries, respectively. The accidental appear- containing internal gaps, the alignment was converted ance of a premature termination codon, the most harmful into a generalized profile as described previously (Gotoh, consequence of using a primitive method that skips regen- 1994). We used a very restricted version of a ‘candidate eration of an interrupted codon, is thus effectively avoided. list’ algorithm (Gotoh, 1993), in which the maximal A restriction was imposed so that a potential exon must number of candidates retained at each iteration node match the reference at least in part. Thus, even if the is limited to three. Although this does not guarantee a combined score of the coding potential and exon/intron rigorous optimal alignment, excessive rigor as to detailed boundary signals for part of a genomic sequence is alignment is unnecessary, since our major purpose is to positive, that part is not considered a potential exon if determine the gene organization. the entire region is an ‘insertion’ relative to the reference An insertion of nucleotides flanked by 5 and 3 splic- sequence or profile. This restriction was necessary to avoid ing signals above a given threshold value is regarded as excessive false positives, especially outside a real gene. a potential intron, which is weighted by the sum of the A conventional traceback procedure is still used, where flanking splicing signals (ψ5 and ψ3 ) plus a negative con- linked lists are recorded and retrieved (Gotoh, 1990). If stant (intron penalty, −νI ) irrespective of the length. At we are interested in only the gene structure, the number of each boundary, the three coding frames were considered stored records was roughly 1/20 of the scan space, and independently. Special care must be paid to phase-1 and this can be easily accommodated in the main memory phase-2 boundaries at which a codon is interrupted by an of most contemporary workstations. We met no trouble intron; i.e. the intron is spliced out, the recovered codon is other than a single exception in calculating the test data. A translated, and then a matching score is calculated. A rig- preliminary implementation of the linear space traceback

193 O.Gotoh

(a) Tron sequence partially match opposite strands. Although several gene structures could be completed by reference to a daily updated version of the C.elegans genome database (URL: http://www.sanger.ac.uk/Projects/C elegans), we only used the 390 genes identified in the chromosomal sequences. The gene lengths (from the translational initiation codon to the termination codon inclusive) varied

Protein sequence from 279 to 35 175 bp and the number of exons in a gene ranged from 1 to 47. A typical gene consisted of 8.59 ± 4.95 exons of length 219 ± 218 nt and introns of length 399 ± 926 nt (Mean±SD). (b) Protein sequences similar to each C.elegans sequence phase translated from an mRNA were searched for in the SWISS-PROT database Release 34 (October, 1996) with the ‘blastp’ program (Altschul et al., 1990). We screened homologues by three criteria: (i) the sequence is derived from an organism other than C.elegans, (ii) the blast probability is less than 10−3, and (iii) the sequence is shorter than 120% and longer than 80% of the length of the C.elegans protein. Of the 390 genes tested, 291 potential exon 1 intron exon 2 had at least one homologue that satisfied criteria (i)– 5' 3' (iii). When more than one homologue was found, the one with the least blast probability was used as the ‘reference protein’. In such a case, a multiple sequence Fig. 3. Alignment paths arriving at a node (i, j). (a) At any node, seven paths (1–7) are considered for calculation of partial alignment alignment was constructed by the ‘prrp’ program (Gotoh, scores (Appendix). (b) At a potential splicing acceptor site j −δ(δ ∈ 1996) from the protein homologues. We finally obtained {0, 1, 2}), additional paths from potential donor sites must also be 260 alignments composed of at least two homologues considered. that satisfied criteria (i)–(iii). Columns in each alignment consisting of more than G%(G = 50 by default) deletion characters were removed, and the trimmed alignment was algorithm (Myers and Miller, 1988) took nearly twice as converted into a ‘generalized profile’ (Gotoh, 1994), and long to execute as our standard method, and the difference then used as a ‘profile reference’. was even greater when the scan space was restricted In principle, it is possible but too demanding to predict around the main diagonals. We are planning to develop a the global structure of a gene without any preprocess- hybrid method, in which the linear space algorithm is used ing. Therefore, to test the performance of our method, at initial phases until the expected storage requirement we assumed that the location of an objective gene is goes under some upper limit. known within a margin of M nucleotides, i.e. the region of the genomic sequence from the first nucleotide in Test data and assessment of performance the initiation codon −M to the last nucleotide of the Caenorhabditis elegans mRNAs containing complete termination codon +M was subjected to analysis. Unless or nearly complete CDSs were retrieved from GenBank otherwise specified, M = 100 was used throughout the Rel. 110 (December, 1998). After removal of identical tests. sequences and alternative transcripts other than the Several measures of the accuracy of prediction, i.e. longest one, the corresponding genes throughout the sensitivity (Sn), specificity (Sp), and correlation coef- entire C.elegans genome in six chromosomes were ficient (CC) at the nucleotide level, were calculated as identified, and the exon/intron structures were estimated shown in Snyder and Stormo (1995, Table 6). Three test through a ‘blastn’ search (Altschul et al., 1990) followed genes with no intron were omitted from the calculations. by sequence alignment (Gotoh, 1990). We were unable Since Sn and Sp were generally well balanced, we used a to identify complete structures for 8% (35 of 425) of single value, Pb, to represent the percentage of correctly the genes examined, presumably because the ‘chromo- predicted exon/intron boundaries, which is the harmonic somal’ C.elegans genomic sequences (The C.elegans mean of sensitivity and specificity at the boundary Sequencing Consortium, 1998) are still incomplete. level, and defined as 200×number of correctly predicted Some parts of the genomic sequences are probably boundaries / (number of real boundaries + number of misassembled, since the corresponding mRNA sequences predicted boundaries). A similar formula was used to

194 Gene structure prediction by homology

calculate the percentage of exactly predicted exons (Pe) Table 1. Default parameter values (both ends are correct). Note that these accuracy measures could underestimate the real situation, since possibility of Symbol Default value Meaning alternative splicing is totally neglected. u 2.0 Gap extension penalty per codon Coherent prediction of a set of paralogous gene ν 9.0 Gap opening penalty structures K 21(nt) Minimum gap length of a constant penalty x 20.0 Penalty for a frameshift An iterative strategy was designed to predict exon/intron VI 68.0 Intron penalty organizations of individual paralogous genes in a family fc 1.0 Relative contribution of a coding potential even if no homologue in other species is known. The basic fb 16.0 Relative contribution of a boundary signal idea is similar to that of the ‘doubly nested randomized iterative strategy’ (DNR method) for multiple sequence alignment (Gotoh, 1996). put are selectable, including: (i) DNA versus protein se- We start with a set of translated sequences that have quence alignment, (ii) gene, predicted cDNA, or translated been previously predicted, say by a statistical gene-finding sequence with or without boundary information, and (iii) a method, and calculate their multiple sequence alignment GenBank-like format. by the DNR method. Using this alignment as the seed, structures of individual genes are re-examined in turn. Let Implementation A0 (1 ≤ n ≤ N, 1 ≤ i ≤ I ) be the initial multiple n,i Choice of parameter values alignment, where N is the number of sequences and I is the length of the alignment. To re-examine the gene Our algorithm uses seven adjustable parameters (Table 1) structure corresponding to the mth sequence, we assign besides the amino-acid substitution matrix [JTT250 (Jones a weight of wn = Cpwm,n and wm = 0 to sequence et al., 1992) by default] and tables for coding potentials n = m, where C is a normalization factor and pwm,n is and boundary signals. The initial set of these parameter ( , ) 0 values was chosen according to the following considera- the weight for the sequence pair m n in An,i calculated by the three-way method (Gotoh, 1995). Optionally, tions. The gap-penalties associated with protein sequence v columns predominantly composed of deletion characters alignment (u, , and K in Table 1) were borrowed from are eliminated as described in the preceding section. After those which best reproduced protein structural alignments all of the N sequences are re-examined, a new alignment (Gotoh, 1996). The factors that controlled the contribu- 1 tions of coding potential ( fc) and boundary signals ( fb) An,i is constructed from the revised conceptual translation products. This process is repeated until no change in relative to amino-acid substitutions were respectively set predicted gene organizations is observed. to 1 and 16 by default, which roughly correspond to the in- Full automation of the above procedure was difficult verse relative chance of assignment. We used a large value in practice, since a smooth iterative cycle was readily of 20 as the default for the extra penalty for a frameshift er- disrupted by the presence of one or a few irregular ror on the assumption of high-quality genomic sequences. ν gene sequences, and there are many causes of such The last parameter, ‘intron penalty’ I , was chosen experi- irregularities. A perl script was written to perform a mentally to achieve the highest accuracy in gene-structure single cycle of re-examination, while outer processes were prediction. As shown in Figure 4(a) and (b), a broad op- ν = ∼ executed manually. While we relied on the conservation timum was observed around I 60 70 for all of the ν = of translated sequences, information regarding intron measures of accuracy, and I 68 was used as the de- insertion sites in paralogous genes was not explicitly fault. considered throughout the procedures. The default value for K used in protein sequence alignment corresponds to 30 nucleotides. To search for possibly better restricted affine penalty functions, we Systems examined several values for K under fixed u and ν values. The program ‘aln’ was developed in ANSI-C and tested The results showed that K = 21 (7 codons) was optimal. on a Sun Ultra-II workstation (300 MHz, 128 Mb main The performance with K =∞, which corresponds to memory) under Solaris 2.5. Aln was originally designed an affine function, was significantly worse than that with to align a pair of nucleotide sequences or generalized pro- the restricted affine function (Figure 4(a), (b)), where files (Gotoh, 1990, 1994), but now accepts any combina- the reduction in performance was mainly ascribed to an tion of protein and nucleotide sequences or profiles. For increase in the number of false negatives. gene-structure prediction, the inputs must be a DNA se- A significant reduction in accuracy was also observed quence and a protein sequence or multiple alignment, and (Figure 4(a),(b)) when the contribution of coding potential a few specific options must be set. Several forms of out- was disregarded ( fc = 0). On the other hand, there

195 O.Gotoh

(a) (c) Correlation coefficient (%) Correlation coefficient

(b) (d)

Intron penalty Margin outside CDS (bp)

Fig. 4. Accuracy of our methods in predicting structures of C.elegans genes. (a) Accuracy of prediction measured in CC at the nucleotide level is plotted as a function of intron penalty νI for a subset of genes for which ID ≥ 30%. The four methods examined used either a restricted affine gap-penalty function (RA) or an affine gap-penalty function (AG) in combination with either value, 0 or 1, for the relative contribution of coding potential fc : RA and fc = 1(◦), RA and fc = 0(•), AG and fc = 1(), and AG and fc = 0(). (b) Same as (a) but all 288 genes are used for the examinations. (c) Dependence of CC (◦), Pb (♦), Pe (), 1 − Sp (), and 1 − Sn () on the length of marginal regions assessed for a subset of genes for which ID≥ 30%. (d) Same as (c) but all 288 genes are used for the examinations.

Table 2. Summary of tests of aln and other methods for structural prediction of C.elegans genes

ID No. of genes Method CC (%) Sp (%) Sn (%) Pb (%) Pe (%)

100 385 AGfc0 99.99 99.99 100.00 99.65 99.29 100 385 RAfc1 99.61 99.41 99.98 99.43 98.98 100 218 GW 74.51 99.67 71.46 76.20 73.11 0–30 118 RAfc1 92.36 94.73 96.80 88.00 80.63 30–93 170 RAfc1 96.97 98.15 98.62 94.06 90.77 0–90 288 RAfc1 95.10 96.76 97.89 91.60 86.65 0–93 278 CESC 90.22 93.53 93.91 87.51 82.93 0–93 278 RAfc1 95.02 96.69 97.85 91.53 86.62 0–93 178 GW 74.35 94.33 77.02 60.86 45.02 0–93 178 RAfc1 96.57 97.63 98.52 93.33 89.55 0–93 257 Prof 95.16 96.57 98.00 91.51 86.58 0–93 257 RAfc1 95.09 96.79 97.79 91.78 86.90

The methods tested are RAfc1: aln with a restricted affine gap-penalty function and fc = 1 (the default parameter set, in boldface); AGfc0: aln with an affine gap-penalty function and fc = 0; GW: genewise (Birney and Durbin, 1997); CESC: taken from the dataset provided by The C. elegans Sequencing Consortium (1998); and Prof: aln with profile references. was virtually no change in overall performance when the not shown). contribution of coding potential was doubled ( fc = 2, data The optimal value for fb varied in parallel with νI, and

196 Gene structure prediction by homology Accuracy (%) Number of genes

Amino acid identity (%)

Fig. 5. Dependence of prediction accuracy on ID. CC (filled bars), Pb (shaded bars) and Pe (open bars) were evaluated for a subset of genes for which IDs are classified within specified ranges. The number of genes in each class is shown by a filled triangle. The same measures, CC (◦), Pb (), and Pe (♦), evaluated for a ‘cumulative’ subset of genes for which the IDs are greater than or within the specified range are also shown together with the number of genes involved (•).

good performance was obtained when these parameters positives remained under the default conditions (Table 2). satisfied the relation of νI = 3.4 fb + 15 within the range On the other hand, the use of weaker gap penalties of 10 ≤ fb ≤ 20 (data not shown). Since fb = 16 and gave better performance for distant object-reference pairs. νI = 68 closely satisfy this relationship, our initial set of However, further quantitative investigations were not parameter values appears to be near-optimal. performed to avoid excessive adaptation. Figures 4(c) and (d) show the dependence of prediction Dependence of performance on sequence similarity accuracy on the uncertainty of gene coverage. Since Since our method uses homology information, its per- the true gene boundaries (transcriptional initiation and formance could significantly depend on the degree of termination sites) are rarely known, a marginal region sequence similarity between the objective sequence and considered here may consist of a 5 or 3 untranslated re- the reference. As expected, the measures of performance gion and the flanking sequence. As more marginal regions decline gradually with the percent of amino acid identity are involved in the calculation, the overall performance (ID) (Figure 5). When ID ≥ 30%, all of the mea- of our method gradually declines. The increase in errors sures (CC, Sp, Sn, Pb, and Pe in Table 2) exceed the is largely attributed to the overestimation of exons, as corresponding values obtained by popular gene-finding seen at the bottom of Figure 4(c) or (d). Since the ends methods applied to human genes (Burge and Karlin, of a protein are generally prone to vary in sequence and 1998; Claverie, 1997; Murakami and Takagi, 1998). Even length, the precise identification of translational initiation for ID < 30%, these measures are comparable to those or termination sites based on sequence homology may achieved by the best methods such as GENSCAN (Burge be more difficult than identification of intron insertion and Karlin, 1997) and MZEF (Zhang, 1997), though sites. In addition, a marginal region may actually contain direct comparison is difficult due to the species difference. CDS regions of the neighboring gene, since intergenic Although the above examination was performed with regions in the C.elegans genome are relatively narrow a single default set of parameters, better results were (The C.elegans Sequencing Consortium, 1998). For obtained if we used several parameter sets depending on better recognition of gene structures, sequence signals similarity classes. For example, coding exons were almost associated with the start and stop of transcription would perfectly identified upon self-comparison (ID = 100) have to be considered. This indicates a future direction for with an affine gap penalty (K =∞), whereas some false improving our method.

197 O.Gotoh

Performance with profile this dataset and Wormpep16 largely overlap. The iterative Most (260 of 291) of the testable genes had more than one procedure described in the Methods section converged homologous protein sequence that passed the three criteria rapidly at the third cycle; these three cycles of gene- described in the Methods section. The performance of structure prediction and multiple alignment took about our ‘profile-version’ program was examined on a subset 10 min on our machine. The sum-of-pairs and weighted of genes with multiple homologues. The results with the sum-of-pairs scores for the multiple alignments were default parameter set were rather disappointing because improved by 133.6 and 102.6 per pair, respectively, in the coding exons were significantly underestimated compared course of the iteration. The final multiple alignment is to those predicted with single-protein sequences. This shown in Figure 6 with predicted sites of intron insertions. underestimation was caused by negative contributions of Six predicted gene structures (indicated by an asterisk at non-conserved regions to the alignment score, and could the end of each sequence in Figure 6) were exactly the be circumvented by adding a small value (0.5 or 1) same as the published structures (GenBank/EMBL/DDBJ to each element of the amino-acid substitution matrix. Accession Nos: AB003486, M38249, M38250, M38251, After this ad hoc measurement, the predictive power of U56864, and X53156). While this particular example was the profile method was indistinguishable from that with performed with an affine gap penalty, the results with single reference sequences (Table 2). This situation did a restricted affine gap penalty were nearly the same as not change appreciably when we used three systems those shown in Figure 6, except for minor variations in for weighting each member in a multiple alignment: (1) the translational initiation sites of a few genes. evenly, (2) according to prrp output, and (3) in proportion to the negative logarithm of blast probability. Discussion Most exon/intron boundaries that were not correctly predicted by the protein method are located in non- The accurate prediction of eukaryotic gene structures conserved regions. Corresponding regions in the reference is by no means trivial, even if homologous protein or multiple alignment are also generally divergent among the cDNA sequences are available. It is particularly dif- members, and often have numerous gaps. Moreover, the ficult when the objective and reference sequences are members that comprise each reference profile could be derived from organisms in different phyla. Mironov et only locally related, so that their global multiple alignment al. (1998) were the first to examine the performance of might be unreliable. These two points are probably the a gene-prediction method that was primarily dependent major reasons why the profile method did not greatly on sequence homology. They used a program called improve the prediction accuracy. More stringent quality ‘Procrustes’ that uses a spliced alignment algorithm control of the reference profile will be necessary to (Gelfand et al., 1996). We developed another homology- improve the performance at the expense of a reduced based gene-identification program, aln, and examined its chance of application. performance on larger inter-phyla combinations of objects and references. With respect to the overall accuracy in predicting An example of the coherent prediction of paralogous coding exons, the results of the present examination are gene structures probably the best among all of the reports that have been It is frequently observed that the closest relative to a published so far. One reason for this high accuracy might gene is another gene in the same genome. Sonnhammer be the nature of the C.elegans genome we examined; and Durbin (1997) reported more than 70 protein domain most of the previous predictions have been made on families that have at least 10 members encoded in vertebrate (mainly human) genes. We chose C.elegans the C.elegans genome. Most of these domains do not genes for two reasons. First, since the entire genomic comprise whole proteins, and we are currently interested sequences are known, many complete genes are easily in determining the entire structure of a gene. Thus, the available. Second, calculations are economical because applicability of our method is currently limited to certain of the shorter average gene length compared to that of enzyme and receptor families. vertebrate genes. Despite their compact sizes, C.elegans As a demonstrative example, G-protein alpha subunit genes are not necessarily easier to predict than vertebrate (Gα) genes were examined. The C.elegans genome genes. In fact, the average accuracy of CESC entries does appears to contain 20 Gα genes (Jansen et al., 1999). The not seem to be much better than that attained by the best genomic region covering gpa 16 was uncertain, and omit- gene-finding methods for human genes, as exemplified by ted from the analysis. The initial amino acid sequences cytochrome P450 genes (Gotoh, 1998). Even for genes for the remaining 19 genes were retrieved from the dataset with known corresponding cDNA sequences, the results published by The C.elegans Sequencing Consortium of the present method were better than CESC (Table 2), (1998). We call this dataset ‘CESC’, and the contents of although each CESC entry was derived from various

198 Gene structure prediction by homology

Fig. 6. Multiple sequence alignment of 19 predicted C.elegans Gα proteins. The locations of potential intron insertion sites are indicated by arrows. Downward arrow: phase 0 intron; arrow toward the lower left: phase 1 intron; arrow toward the lower right: phase 2 intron. The structures of the genes marked by asterisks have been verified experimentally.

199 O.Gotoh sources of information, including cDNA/EST sequences, Since an amino acid sequence is much more conservative homology, and statistical properties. than a nucleotide sequence, tron codes may be useful The aln algorithm is a straightforward extension of a se- not only for sequence alignment between a DNA and a quence alignment algorithm that allows for long gaps (Go- protein but also between diverse DNAs, either genomic toh, 1990), and thus has a simpler structure than that of the or complementary in any combination. Systematic inter- spliced alignment algorithm (Gelfand et al., 1996) imple- genomic comparisons will be a promising application of mented in Procrustes. Although the target functions to be such approaches. optimized are nearly the same, Procrustes may run faster We found clear inter phyla homologues for about than aln, especially when the gene possesses long introns, three-quarters of the 390 C.elegans full-length cDNAs since Procrustes filters out most potential non-coding re- retrieved from the GenBank database. About two-thirds of gions before the major routine. On the other hand, aln these homologues were related to the C.elegans sequences carefully treats gaps corresponding to frameshift errors, more closely than the generally accepted limit, i.e. the long insertions/deletions, and ‘internal’ gaps in the refer- ‘twilight zone’ (Doolittle, 1981; Sander and Schneider, ence profile. The observation that the prediction accuracy 1991), above which reliable protein sequence alignment was significantly improved by restricted affine gap-penalty is attainable. When applied to this subset of genes, our functions compared to more common affine functions in- method performed significantly better than the best avail- dicates the superiority of the present algorithm. However, able gene-finding programs. Thus, the present approach practical application of the present algorithm to vertebrate might be useful for about half of all genes to improve gene-identification problems requires a further reduction the reliability of predicted gene organizations. Since the in computational time and space, which can be realized by full-length cDNA sequences in current databases may the introduction of some pre-filtration processes. represent a biased subset of the whole genes, the chance Two other related programs are ‘nap’ of (Huang and of finding proper reference protein sequence(s) may be Zhang, 1996) and ‘genewise’ of (Birney and Durbin, generally less than the above estimate. Nevertheless, 1997). Both programs were compiled and run alongside this chance will increase with progress in ‘functional aln on the same machine. Nap reports only the alignment genomics’, in which a large number of full-length cDNA of DNA and protein, and so automatic identification of sequences are to be determined. This information can exon-intron boundaries is difficult. Although nap and aln be used to identify related genes in other organisms have similar overall architectures, nap does not consider and also weakly or temporally expressed paralogues in any statistical properties of the genomic sequence or the same genome. Our approach will facilitate a better phases at the 5 and 3 ends of an intron. Since these are understanding of the structure and evolution of various important factors for the correct identification of gene genes on eukaryotic genomes to be sequenced in the near structures, nap is expected to show significantly lower per- future. formance than aln. Genewise (in Wise2 version 2.1.16b, protein model) often reports fragmental gene structures Acknowledgements even though the ‘global’ option is used in combination with the ‘worm.gf’ gene-characterization file. Most This work was supported in part by a Grant-in-Aid for prominently, a frameshift error easily induces such gene Scientific Research on Priority Areas, Genome Science, fragmentation. Multiple ‘genes’ or gene fragments were from the Ministry of Education, Science, Sports and suggested for about one-third (107 of 291) of the test Culture of Japan. cases where a single gene is expected. Even if we restrict ourselves to uniquely predicted cases, the performance Appendix of genewise appears to be considerably inferior to that We present here a dynamic programming algorithm for of aln, as shown in Table 2. Genewise ran about 67 (45) matching a genomic sequence, b = b1b2 ...bJ , and times slower than aln with (without) the cutting-corners a protein sequence, a = a a ...a . We assume that approximation, which also prevented extensive com- 1 2 I b has been converted into tron codes. Let Hi, j be the parison of the performances of genewise and aln. This objective function to be optimized for the subsequences large difference in execution rates is probably due to the b b ...b ( j ∈[1, J]) and a a ...a (i ∈[1, I ]). H α fact that aln (and nap) optimizes a single score while 1 2 j 1 2 i i, j are subsidiary variables where a superscript α(1 ≤ α ≤ 7) genewise calculates probabilities of various states and indicates the direction of the alignment path (Figure 3(a)). state transitions. 0 Hi, j is used as an alias of Hi, j . We begin the following The efficiency of aln is partly due to the use of tron = ∈[, ] = codes (Figure 1). A measure of similarity between a tron recursion relations with Hi,0 0 for i 0 I , H0, j {ψI + ϕE , + w( ) + w( ) code and an amino acid can be easily obtained as a usual MAX j−1 j−1 H0, j−1 1 , H0, j−2 2 , − +ϕE } ∈[ , ] =−∞ < measure between amino acids or between nucleotides. H0, j−3 u j−1 for j 1 J and H0, j for j

200 Gene structure prediction by homology

α = α =−∞( ∈[ , ], ∈[ , ],α > ) α = δ = 0, and Hi, H , j i 1 I j 1 J 0 . operations. In two exceptional cases of 7 and 1, 0 0 α = δ =   and 7 and 2, Hi, j−1 + w(1)  + w( )  7,1 = ( + ( , [ ]) Hi, j−2 2 Fi, j MAX Hi−1,h−3 S ai bh−2bh−1b j 1 =   (h−1)∈{5 < j} Hi, j MAX  , − + w( ) + ϕ E  Hi j 3 3 j−1 5 1 − + ϕ E +ψ − ) (A.5) Hi, j−3 u j−1 h 1 , 2 2 E 7 2 = ( + ( , [ ]) H , = MAX(H , − + w(K ), H , − ) + ϕ − Fi, j MAX Hi−1,h−3 S ai bh−2b j−1b j i j i j 3 i j 3 j 1 (h−2)∈{5 < j} 3 = ( + w( ), 3 − ) Hi, j MAX Hi−1, j 3 Hi−1, j u +ψ5 ) h−2 (A.6) 4 = ( + w( ), 4 ) Hi, j MAX Hi−1, j K Hi−1, j 5 = + w( ) where the triplet in brackets should be returned to nu- Hi, j Hi−1, j−1 2 cleotides and then translated according to the genetic code. 6 = + w( ) 7,2 Hi, j Hi−1, j−2 1 To maintain the maximum value for Fi, j , we must prepare 7 = + ( , ) + ϕ E four variables, rather than just one as in equation (A.4), Hi, j Hi−1, j−3 S ai b j−1 j−1 depending on the type of nucleotide at bh−2 (Figure 3(b)). 0 = α 7,1 Hi, j MAX Hi, j (A.1) Likewise, for the maximal F , we need two variables α=1,7 i, j that correspond to the two possibilities of b j being purine where S(a, b) is the measure of similarity between an or pyrimidine (Figure 3(b)) according to the specific fea- amino acid a and a tron b, w(k) is the gap-penalty function ture of the universal genetic code (Figure 1). ϕE as depicted in Figure 2, and j denotes the coding potential associated with b j . A translational initiation References ψI ψT ϕE or termination signal, j or j , is also added to j Altschul,S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. when appropriate, but omitted in equation (A.1). The final (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403– results are traced back from the point associated with 410. Altschul,S.F., Madden,T.L., Schaffer,A.A.,¨ Zhang,J., Zhang,Z., MAX(Hi,J , HI, j ) for i ∈[1, I ] and j ∈[1, J], where Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSI- H , is slightly modified to incorporate ψT but not to I j j BLAST: a new generation of protein database search programs. assign a gap-open penalty. Nucl. Acids Res., 25, 3389–3402. If j − δ(δ = 0, 1, or 2) is a potential splicing acceptor Birney,E. and Durbin,R. (1997) Dynamite: a flexible code gener- ψ3 > site at which the 3 splicing signal j−δ 0, we need ating language for dynamic programming methods used in se- α,δ quence comparison. ISMB, 5,56–64. further considerations. Let us define F by i, j Burge,C. and Karlin,S. (1997) Prediction of complete gene struc- tures in human genomic DNA. J. Mol. Biol., 268,78–94. α,δ α = ( + ψ5 ) Burge,C.B. and Karlin,S. (1998) Finding the genes in genomic Fi, j MAX Hi,h h−δ (A.2) (h−δ)∈{5 < j} DNA. Curr. Opin. Struct. Biol., 8, 346–354. Burset,M. and Guigo,R.´ (1996) Evaluation of gene structure predic- and tion programs. Genomics, 34, 353–367. Cai,Y. and Bork,P. (1998) Homology-based gene prediction using α = ( α , α,δ + ψ3 − ν ) Hi, j MAX Hi, j Fi, j j−δ I (A.3) neural nets. Anal. Biochem., 265, 269–274. Chao,K.-M. (1999) Calign: aligning sequences with restricted where the set {5 < j} consists of potential splicing donor affine gap penalties. Bioinformatics, 15, 298–304. sites l(ψ5 > 0) preceding j, and α ∈{0, 1, 2, 7}. H α on Claverie,J.-M. (1997) Computational methods for the identification l ij of genes in vertebrate genomic sequences. Hum. Mol. Genet., 6, the right-hand side of equation (A.3) is that obtained with 1735–1744. equation (A.1). If the second operand of MAX is greater −δ Crick,F.H.C. (1968) The origin of the genetic code. J. Mol. Biol., than the first, j is regarded as a likely splicing acceptor 38, 367–379. α α site, and Hi, j is renewed. For most combinations of and Dayhoff,M.O., Schwartz,R.M. and Orcutt,B.C. (1978) A model of δ, equation (A.2) can be simplified as evolutionary change in proteins. In Dayhoff,M.O. (ed.), Atlas of Protein Sequence and Structure. Vol. 5, Suppl. 3, National

αδ = ( α + ψ5 , α,δ) Biomedical Research Foundation, Washington, D.C., pp. 345– Fi, j MAX Hi,h h−δ F , (A.4) i h 352. α,δ Doolittle,R.F. (1981) Similar amino acid sequences: chance or In a computer program, Fi, j can be represented by a common ancestry?. Science, 214, 149–159. single variable for each δ, and should be updated only Fickett,J.W. and Tung,C.-S. (1992) Assessment of protein coding just after j − δ ∈{5 }, which takes a constant number of measures. Nucl. Acids Res., 20, 6441–6450.

201 O.Gotoh

Florea,L., Hartzell,G., Zhang,Z., Rubin,G.M. and Miller,W. (1998) Mironov,A.A., Roytberg,M.A., Pevzner,P.A. and Gelfand,M.S. A computer program for aligning a cDNA sequence with (1998) Performance-guarantee gene predictions via spliced a genomic DNA sequence. Genome Res., 8, 967–974. alignment. Genomics, 51, 332–339. Gelfand,M.S., Mironov,A.A. and Pevzner,P.A. (1996) Gene recog- Mott,R. (1997) EST GENOME: a program to align spliced DNA nition via spliced sequence alignment. Proc. Natl Acad. Sci. USA, sequences to unspliced genomic DNA. Comput. Appl. Biosci., 93, 9061–9066. 13, 477–478. Goffeau,A., Barrell,B.G., Bussey,H., Davis,R.W., Dujon,B., Feld- Murakami,K. and Takagi,T. (1998) Gene recognition by combina- mann,H., Galibert,F., Hoheisel,J.D., Jacq,C., Johnston,M., tion of several gene-finding programs. Bioinformatics, 14, 665– Louis,E.J., Mewes,H.W., Murakami,Y., Philippsen,P., Tettelin,H. 675. and Oliver,S.G. (1996) Life with 6000 genes. Science, 274, 546– Myers,E.W. and Miller,W. (1988) Optimal alignments in linear 567. space. Comput. Appl. Biosci., 4,11–17. Gotoh,O. (1990) Optimal sequence alignment allowing for long Park,J., Karplus,K., Barrett,C., Hughey,R., Haussler,D., Hubbard,T. gaps. Bull. Math. Biol., 52, 359–373. and Chothia,C. (1998) Sequence comparison using multiple Gotoh,O. (1993) Optimal alignment between groups of sequences sequences detect three times as many remote homologues as and its application to multiple sequence alignment. Comput. pairwise methods. J. Mol. Biol., 284, 1201–1210. Appl. Biosci., 9, 361–370. Pearson,W.R., Wood,T., Zhang,Z. and Miller,W. (1997) Comparison Gotoh,O. (1994) Further improvement in methods of group-to- of DNA sequences with protein sequences. Genomics, 46,24–36. group sequence alignment with generalized profile operations. Peltola,H., Soderlund,H.¨ and Ukkonen,E. (1986) Algorithms for the Comput. Applic. Biosci., 10, 379–387. search of amino acid patterns in nucleic acid sequences. Nucl. Gotoh,O. (1995) A weighting system and algorithm for align- Acids Res., 14,99–107. ing many phylogenetically related sequences. Comput. Applic. Salzberg,S.L. (1997) A method for identifying splice sites and Biosci., 11, 543–551. translational start sites in eukaryotic mRNA. Comput. Applic. Gotoh,O. (1996) Significant improvement in accuracy of multi- Biosci., 13, 365–376. ple protein sequence alignments by iterative refinement as as- Sander,C. and Schneider,R. (1991) Database of homology-derived sessed by reference to structural alignments. J. Mol. Biol., 264, protein structures and the structural meaning of sequence align- 823–838. ment. Proteins, 9,56–68. Gotoh,O. (1998) Divergent structures of Caenorhabditis elegans Sankoff,D. and Kruskal,J.B. (1983) Time Warps, String Edits, cytochrome P450 genes suggest the frequent loss and gain of and Macromolecules: The Theory and Practice of Sequence introns during the evolution of nematodes. Mol. Biol. Evol., 15, Comparison. Addison-Wesley, New York. 1447–1459. Sellers,P.H. (1979) Pattern recognition in genetic sequences. Proc. Gribskov,M., McLachlan,A.D. and Eisenberg,D. (1987) Profile Natl Acad. Sci. USA, 76, 3041. analysis: detection of distantly related proteins. Proc. Natl Acad. Snyder,E.E. and Stormo,G.D. (1995) Identification of protein cod- Sci. USA, 84, 4355–4358. ing regions in genomic DNA. J. Mol. Biol., 248,1–18. Guigo,R.´ and Fickett,J.W. (1995) Distinctive sequence features in Sonnhammer,E.L.L. and Durbin,R. (1997) Analysis of protein protein coding genic non-coding, and intergenic human DNA. J. domain families in Caenorhabditis elegans. Genomics, 46, 200– Mol. Biol., 253,51–60. 216. Hein,J. (1994) An algorithm combining DNA and protein align- The C. elegans Sequencing Consortium (1998) Genome sequence ment. J. Theor. Biol., 167, 169–174. of the nematode C. elegans: a platform for investigating biology. Huang,X. and Zhang,J. (1996) Methods for comparing a DNA Science, 282, 2012–2018. sequence with a protein sequence. Comput. Applic. Biosci., 12, Uberbacher,E.C. and Mural,R.J. (1991) Locating protein-coding 497–506. regions in human DNA sequences by a multiple sensor-neural Jansen,G., Thijssen,K.L., Werner,P., van der Horst,M., Hazen- network approach. Proc. Natl Acad. Sci. USA, 88, 11261–11265. donk,E. and Plasterk,R.H.A. (1999) The complete family of Xu,Y., Mural,R.J. and Uberbacher,E.C. (1994) Constructing gene genes encoding G proteins of Caenorhabditis elegans. Nature models from accurately predicted exons: an application of Genet., 21, 414–419. dynamic programming. Comput. Appl. Biosci., 10, 613–623. Jones,D.T., Taylor,W.R. and Thornton,J.M. (1992) The rapid gener- Zhang,M.Q. (1997) Identification of protein coding regions in the ation of mutation data matrices from protein sequences. Comput. human genome by quadratic discriminant analysis. Proc. Natl Appl. Biosci., 8, 275–282. Acad. Sci. USA, 94, 565–568. Lolkema,J.S. and Slotboom,D.-J. (1998) Hydropathy profile align- Zhang,M.Q. and Marr,T.G. (1993) A weight array method for ment: a tool to search for structural homologues of membrane splicing signal analysis. Comput. Applic. Biosci., 9, 499–509. proteins. FEMS Microbiol. Rev., 22, 305–322.

202