"An Overview of Gene Identification: Approaches, Strategies, and Considerations"

An Overview of Gene Identification: UNIT 4.1 Approaches, Strategies, and Considerations Modern biology has officially ushered in a new era with the completion of the sequencing of the human genome in April 2003. While often erroneously called the “post-genome” era, this milestone truly marks the beginning of the “genome era,” a time in which the availability of sequence data for many genomes will have a significant effect on how science is performed in the 21st century. While complete human sequence data is now available at an overall accuracy of 99.99%, the mere availability of all of these As, Cs, Ts, and Gs still does not answer some of the basic questions regarding the human genome—how many genes actually comprise the genome, how many of these genes code for multiple gene products, and where those genes actually lie along the complement of human chromosomes. Current estimates, based on preliminary analyses of the draft sequence, place the number of human genes at ∼30,000 (International Human Genome Sequencing Consortium, 2001). This number is in stark contrast to previously-suggested estimates, which had ranged as high as 140,000. A number that is in the 30,000 range brings into question the one-gene, one-protein hypothesis, underscoring the importance of processes such as alternative splicing in the generation of multiple gene products from a single gene. Finding all of the genes and the positions of those genes within the human genome sequence—and in other model organism genome sequences as well—requires the development and application of robust computational methods, some of which are listed in Table 4.1.1. These methods provide the first, best guess not only of the number and position of all genes, but of the structure of each individual gene as well. These predictions, brought under the banner of “sequence-based annotation,” help to increase the intrinsic value of genome sequence data found within the public databases. REMEMBERING BIOLOGY IN DEDUCING GENE STRUCTURE In considering the problem of gene identification, it is important to briefly review the basic biology underlying what will become, in essence, a mathematical problem (Fig. 4.1.1). At the DNA level, upstream of a given eukaryotic gene there are promoters and other regulatory elements that control the transcription of that gene. The gene itself is discontinuous, being comprised of both introns and exons. Once this stretch of DNA is transcribed into an RNA molecule, both ends of the RNA are modified, with the 5′ end being capped and a poly(A) signal being placed at the 3′ end. The RNA molecule reaches maturity when the introns are spliced out, based on short consensus sequences found both at the intron-exon boundaries and within the introns themselves. Once splicing has occurred and the start and stop codons have been established, the mature mRNA is transported through a nuclear pore into the cytoplasm, at which point translation can take place. While the process of moving from DNA to protein is obviously more complex in eukaryotes than in prokaryotes, the mere fact that it can be described in its entirety in eukaryotes would lead one to believe that predictions can confidently be made as to the exact positions of introns and exons. Unfortunately, the signals that control the process of moving from the DNA level to the protein level are not very well defined, precluding their use as foolproof indicators of gene structure. For example, upwards of 70% of promoter regions contain a TATA box, but because the remainder do not, the presence (or absence) of a TATA box in and of itself cannot be used to assess whether a region is a Finding Genes Contributed by Andreas D. Baxevanis 4.1.1 Current Protocols in Bioinformatics (2004) 4.1.1-4.1.9 Copyright © 2004 by John Wiley & Sons, Inc. Supplement 6 Table 4.1.1 Web Sites for Common Gene Finding Programs Web site URL Banbury Cross http://igs-server.cnrs-mrs.fr/igs/banbury The Encyclopedia of DNA elements (ENCODE) http://www.genome.gov/encode FGENES http://genomic.sanger.ac.uk/gf/gf.shtml FirstEF http://rulai.cshl.org/tools/FirstEF geneid (UNIT 4.3) http://www1.imim.es/geneid.html GeneMachine http://research.nhgri.nih.gov/genemachine/ GeneMark (UNITS 4.5 & 4.6) http://opal.biology.gatech.edu/GeneMark/ GeneParser http://beagle.colorado.edu/~eesnyder/GeneParser.html GENSCAN http://genes.mit.edu/GENSCAN.html Genotator http://www.fruitfly.org/~nomi/genotator/ GlimmerM (UNIT 4.4) http://www.tigr.org/software/glimmerm/ GRAIL (UNIT 4.9) http://compbio.ornl.gov/tools/index.shtml GRAIL-EXP (UNIT 4.9) http://compbio.ornl.gov/grailexp/ HMMgene http://www.cbs.dtu.dk/services/HMMgene/ MZEF (UNIT 4.2) http://www.cshl.org/genefinder PROCRUSTES http://www-hto.usc.edu/software/procrustes RepeatMasker (UNIT 4.10) http://ftp.genome.washington.edu/RM/RepeatMasker.html Sputnik http://rast.abajian.com/sputnik/http://genes.cse.wustl.edu promoter exon 1 intron 1 exon 2 intron 2 exon 3 intron 3 exon 4 DNA transcription RNA end modification cap 5' polyA GU AG GU AG GU AG splicing start stop mature cap polyA mRNA nucleus cytoplasm translation polyA Figure 4.1.1 The central dogma of molecular biology. Proceeding from the DNA through the RNA to the protein level, various sequence features and modifications can be identified that can be used An Overview of in the computational deduction of gene structure. These include the presence of promoter and Gene regulatory regions, intron-exon boundaries, and both start and stop signals. Unfortunately, these Identification signals are not always present, and when present may not always be in the same form or context. The reader is referred to the text for greater detail. 4.1.2 Supplement 6 Current Protocols in Bioinformatics promoter. Similarly, during end modification, the poly(A) tail may be present or absent, or may not contain the canonical AATAAA. Adding to these complications is the fact that an open reading frame is required but is not sufficient to identify a region as an exon. Given these and other considerations, there is at present no straightforward method that will allow 100% confidence in the prediction of introns or exons. It is true, however, that the availability of finished genome sequences from a variety of organisms has led to a better understanding of gene structure and is, in turn, leading to the development of better methods for identifying genes in what is, essentially, “anonymous” DNA. CATEGORIZING THE METHODS Briefly, gene-finding strategies can be grouped into three major categories. Content-based methods rely on the overall, bulk properties of a sequence in making their determinations. Characteristics considered here include the frequency at which particular codons are used, the periodicity of repeats, and the compositional complexity of the sequence. Because different organisms use synonymous codons with different frequency, such clues can provide insight to help determine which regions are more likely to be exons. In site-based methods, by contrast, the focus is on the presence or absence of a specific sequence, pattern, or consensus. These methods are used to detect features such as donor and acceptor splice sites, binding sites for transcription factors, poly(A) tracts, and start and stop codons. Finally, comparative methods make determinations based on sequence homology. Here, translated sequences are subjected to database searches against protein sequences (e.g., BLASTX; UNIT 3.4) in order to see whether a previously characterized coding region corresponds to a region in the query sequence. While this is conceptually the most straightforward of the three approaches, it is restrictive in that most newly discovered genes do not have gene products that match anything in the protein databases. Also, the modular nature of proteins and the fact that there are only a limited number of protein motifs (Chothia and Lesk, 1986) makes it difficult to predict anything more than just exonic regions in this way. The reader is referred to a number of excellent reviews detailing the theoretical underpinnings of these various classes of methods (Claverie, 1997a,b, 1998; Guigó, 1997; Snyder and Stormo, 1997; Stormo, 2000; Rogic et al., 2001). While many of the gene prediction methods belong strictly to one of these three classes of methods, most of those that will be discussed in this chapter combine the strength of different classes of methods in order to optimize their predictions. HOW WELL DO THE METHODS WORK? Given the complexity of the problem at hand and the range of approaches for tackling it, it is important for investigators to appreciate when and how each particular method should be applied. A recurring theme in this chapter will be the fact that each method will perform differently depending on the nature of the data. Put another way, while one method may be best for human finished sequence, another may be better for sequences (whether they be finished or unfinished) from another organism. The reader will also notice that the various methods in this chapter produce different types of results—in some cases, lists of putative exons are returned but these exons are not in a genomic context; in other cases, complete gene structures are predicted, but possibly at a cost of less reliable individual exon predictions. Returning to the cautionary note that different methods will perform better or worse, depending on the system being examined, it becomes important to be able to quantify the performance of each of these algorithms. Several studies have systematically examined the rigor of these methods using a variety of test data sets (Burset and Guigó, 1996; Claverie, 1997a; Snyder and Stormo, 1997; Stormo, 2000; Rogic et al., 2001).

"An Overview of Gene Identification: Approaches, Strategies, and Considerations"

The New Science of Metagenomics: Revealing the Secrets of Our Microbial Planet Is Available from the National Academies Press, 500 Fifth Street, NW, Washington, D.C

Genomics and Its Impact on Science and Society: the Human Genome Project and Beyond

Gene Prediction and Genome Annotation

The Human Genome Project Focus of the Human Genome Project

Gene Structure Prediction

From the Human Genome Project to Genomic Medicine a Journey to Advance Human Health

A Curated Benchmark of Enhancer-Gene Interactions for Evaluating Enhancer-Target Gene Prediction Methods

Comparative Genome Evolution Working Group

There Is a Lot of Research on Gene Prediction Methods

Gene Prediction Using Deep Learning

The Genome Project–Write Issues, Enabling Inclusive Decision-Making on the Topics Mentioned Above (7)

Prediction of Protein-Protein Interactions and Essential Genes Through Data Integration