How Does Eukaryotic Gene Prediction Work?

PRIMER How does eukaryotic gene prediction work? Michael R Brent Computational prediction of gene structure is crucial for interpreting genomic sequences. But how do the algorithms involved work and how accurate are they? ene-prediction programs are used primar- in accuracy using one or more additional and can now predict a perfect open reading Gily to annotate large, contiguous sequences genomes (see below). Expression-based systems frame for more than one-third of known pro- generated by whole-genome sequencing. Most work by aligning cDNA or protein sequences tein-encoding human genes4. In more com- programs used for this purpose aim to predict to one or more genome locations based on pact genomes, exact ORF accuracy can reach the complete exon-intron structures of the sequence similarity. Some expression-based 60–70%. In general, accuracy increases as http://www.nature.com/naturebiotechnology protein-encoding portions of transcripts (open systems are designed to output only alignments the number and sizes of introns in a genome reading frames or ORFs). Some programs also of expressed sequences to the loci from which decrease. Some systems can now use multiple predict 5′ untranslated regions, and a few pre- they were transcribed (native or cis alignment); informants, but results so far indicate rapidly dict only the boundaries of isolated exons. other systems include alignments of expressed diminishing returns as the number of infor- The resulting predictions are often distributed sequences transcribed from other loci or even mants increases4. through web portals such as the University of other species (trans alignment). The greatest limitation of GENSCAN was California Santa Cruz genome browser (http:// Expression-based systems that use only that it predicted too many genes (~45,000 in genome.ucsc.edu/), where users can examine native alignments tend to produce exon-intron human) and exons (~315,000 in human), many and compare predictions in regions of inter- structures that are quite accurate. Their primary of which were false positives. For comparison, est. If whole-genome prediction sets are not limitation is that there are many genomes for today’s best estimates place the number of available for a sequence in question, users can which little or no cDNA sequence is avail- human protein-coding genes at 20,000–21,000 Nature Publishing Group Group Nature Publishing 7 submit genomic sequences to online gene-pre- able. Even fairly deep sequencing of randomly (Michele Clamp, personal communication). diction servers such as the Twinscan/N-SCAN selected cDNA clones fails to elucidate the The best dual-genome predictors have nearly 200 1 © server (http://mblab.wustl.edu/nscan/) . Gene structures of the 20–40% of genes in a typical eliminated this problem, but fragments of pro- predictions can also be used as a springboard eukaryotic genome that are expressed only at cessed pseudogenes can still show up as false- to obtaining more direct evidence of gene low levels or under rare conditions. Including positive exons. Such fragments can now be structures through high-throughput reverse trans alignments in an annotation increases its eliminated by integrating dual-genome de novo transcription (RT)-PCR and sequencing using sensitivity, but the accuracy of trans alignments predictors with automated pseudogene detec- primers designed on the basis of gene predic- depends on the degree of similarity between the tors6. In the past, the number of predicted genes tions. expressed sequence and the locus to which it is was also inflated by the fact that some programs aligned. For loci that cannot be annotated by a could not process entire chromosomes at once. What are the major approaches to gene high identity cDNA or protein alignment, de Splitting chromosomes before processing would prediction? novo systems—which are the primary focus of split genes that crossed boundaries The best Gene-prediction programs can be broadly this tutorial—can provide more accurate pre- dual-genome predictors have nearly eliminated divided into those whose only inputs are the dictions. this problem, but fragments of processed pseu- sequences of one or more genomes (de novo dogenes can still show up as false-positive exons. predictors) and those that also consider cDNA How well does de novo gene prediction Such fragments can now be eliminated by inte- sequences and/or their predicted translations work? grating dual-genome de novo predictors with (expression-based predictors). Originally, de The accuracy of de novo gene prediction has automated pseudogene detectors5. In the past, novo predictors used only the genome sequence improved steadily since the introduction of the number of predicted genes was also inflated to be annotated (the target sequence), but the GENSCAN2 ten years ago. GENSCAN correctly by the inability of some programs to process past five years have witnessed substantial gains predicts an ORF at ~10% of human gene loci an entire chromosome at once. Disassembling that contain a known ORF (gene sensitivity). chromosomes before processing would split Michael Brent is at the Center for Genome The next major improvement came in 2001, genes that crossed boundaries. This problem Sciences, Campus Box 1045, Washington with the advent of dual-genome de novo pre- was recently solved by a new, memory-efficient University, One Brookings Drive, Saint Louis, dictors such as TWINSCAN3. Dual-genome algorithm that eliminates the need for separate Missouri 63130, USA. predictors use alignments between the target analysis of chromosome fragments6. With e-mail: [email protected] genome and a related genome (the informant), pseudogene detection and whole-chromosome NATURE BIOTECHNOLOGY VOLUME 25 NUMBER 8 AUGUST 2007 883 PRIMER inputs, the number of human genes predicted How do de novo gene predictors work? followed by “s,” rare examples notwithstand- by the N-SCAN system fell to 20,138—remark- De novo gene predictors are based on some ing (Fig. 1a). An alternative hypothesis that ably close to the current estimates cited above, variant of hidden Markov models (HMMs)8. solves this problem is that the sender intended although, of course, predicting the right num- To get an intuition for HMMs, imagine that you to type “hot.” However, typing “o” for “s” is an ber of genes does not mean that all the predicted found a scrap of paper containing the typed let- unlikely error, as “o” is nowhere near “s” on the genes are correct! ters “hst” and suppose that these letters are a keyboard. A more likely hypothesis is that the A detailed analysis of the accuracy of various fragment of a message written in English. It is sender intended to type “hat”: “h” is frequently gene-prediction programs spanning the entire not likely that the sender intended to type “hst,” followed by “a” in correctly spelled English range of methods can be found in reference 7. as relatively few English words contain an “h” words, and “a” is frequently mistyped as “s”. This situation can be modeled by an HMM. In HMM terminology, the intended letters are a Unlikely Unlikely More likely called hidden states, or simply states, and the hypothesis hypothesis hypothesis letters on the paper are called observations. For ? each state, the HMM specifies the probabil- Intention h s t h o t h a t ity that it will be followed by each other state (transition probabilities) and the probability ? that it will result in each possible observed let- Outcome h s t h s t h s t ter (emission probabilities). For any particular HMM and any particular sequence of obser- b Less likely hypothesis More likely hypothesis vations, the Viterbi algorithm can be used to efficiently find the most likely sequence of hid- Intron Intron Intron Intron Exon Exon den states. middle end start middle One can imagine a simple application of http://www.nature.com/naturebiotechnology HMMs to de novo gene prediction in which GTAAGA TAGATACTTGTCTGC... GTAAGA TAGATACTTGTCTGC... ...CTC ...CTC the observations are nucleotides of the target c Intron end Intron start sequence and the hidden states are the functions – 6 – 5 – 4 – 3 – 2 –1 1 2 3 4 5 6 they serve in RNA processing and translation, such as the first and second base of an intron (splice donor), a base in the middle region of an intron or the first, second and third base of a codon. Most gene predictors use a more elabo- rate type of HMM called a generalized hidden Markov model (GHMM). The observation cor- d responding to each state of a GHMM may be a Nature Publishing Group Group Nature Publishing Less likely hypothesis More likely hypothesis 7 DNA sequence of any length, whereas in ordi- nary HMMs, the observation is always a single 200 Intron © Exon nucleotide. The states correspond to functions middle such as coding exon, splice donor region of an intron, middle region of an intron and so TAGATACTTGTCTGCCACCCTC... TAGATACTTGTCTGCCACCCTC... on (Fig. 1b). As the input DNA sequence does ||| || || ||||| || ||| ||| || || ||||| || ||| not include the segment boundaries shown in TAGGTATTTATCTGCTACTCTC... TAGGTATTTATCTGCTACTCTC... Figure 1b, the Viterbi algorithm for GHMMs must consider the most likely segmentation e Less likely hypothesis More likely hypothesis of the DNA sequence as well as the most likely state sequence for each segmentation. Intron Exon For each state, there is a model that defines middle the probability of each possible observation string. For example, consider the first six TAGATACTTGTCTGCCACCCTC... TAGATACTTGTCTGCCACCCTC... nucleotides of an intron (splice donor region).

Load more