Eukaryotic Promoter Recognition James W
Total Page:16
File Type:pdf, Size:1020Kb
Downloaded from genome.cshlp.org on October 1, 2021 - Published by Cold Spring Harbor Laboratory Press REVIEW Eukaryotic Promoter Recognition James W. Fickett1,3 and Artemis G. Hatzigeorgiou2 1Bioinformatics, SmithKline Beecham Pharmaceuticals, King of Prussia, Pennsylvania 19406; 2Synaptic Ltd., 13671 Acharnai, Greece Computational analysis of polymerase II (Pol II) Another line of development in gene identifica- promoters may contribute to improved gene iden- tion is based on homology (e.g., Gish and States tification and to prediction of the expression con- 1993; Gelfand et al. 1996). If there is a close homo- text of genes. Before assessing the state of computa- log in the databases to one of the genes in the se- tional promoter recognition per se in the main body quence under analysis, sequence similarity will usu- of this review, we will provide a context by giving a ally group the exons for this gene correctly. Still, in brief overview of these two problems. many cases there is no close homolog and no guar- antee when there is some homolog that the en- coded protein lacks insertions/deletions. Partitioning a Genome into Genes Clearly, some means of recognizing the begin- Only recently has it become common to determine nings of genes, probably via the promoter, or the eukaryotic genomic sequences large enough to con- ends, probably by means of the polyadenylation sig- tain several genes. With these data comes a new nal or translation termination signal (e.g., Kondra- problem for gene finding programs: to partition a khin et al. 1994; Wahle and Keller 1996; Dalphin et set of exons correctly among several genes. al. 1997; Solovyev and Salamov 1997), would enable One line of development in eukaryotic gene a major advance. The promoter seems to be a much identification begins with coding region identifica- richer signal than the 38 processing signals, though, tion by statistical means and adds pattern recogni- as we shall see below, it is not easy to take advantage tion for sites of transcriptional, splicing, and trans- of the information in the promoter. lational control to produce algorithms capable of suggesting overall gene structure (for review, see Determining the Correct Protein Translation Gelfand 1995; Fickett 1996a). To date, most devel- opment effort has focused on integration of the Of course, the single most important goal in gene various kinds of pattern information in the rela- identification is to correctly deduce the protein tively simple case where a single complete gene is product(s) of the gene. After partitioning the ge- present in the input sequence. In this case, current nome into genes, the greatest difficulty in eukary- algorithms usually suggest a putative protein trans- otes is correctly determining the splicing structure. lation similar to that in the literature, though there Locating the correct initiation codon is also a diffi- is still significant room for improvement (Burset cult and important step in this case. If the transcrip- and Guigo 1996). The extension of these algorithms tion start site (TSS) is known, and there is no intron to deal with a sequence containing multiple or par- interrupting the 58-untranslated region, Kozak’s tial genes is just beginning (Burge and Karlin 1997; (1996) rules can probably locate the correct initia- http://gnomic.stanford.edu/∼chris/GENSCAN- tion codon in most cases. W.html). Because the signals that control the start In prokaryotes the problem is of a different and stop of transcription and translation, and the nature. Because splicing is normally absent, divid- location of splicing, are still not very well under- ing the genome into gene units is ordinarily stood, it is not uncommon for a gene-finding algo- straightforward. This does not make the correct de- rithm to confuse internal with initial and terminal duction of protein product trivial, however, for exons, thus wrongly partitioning the exons. The finding the correct initiation codon within an open problem is compounded by our incomplete under- reading frame (ORF) is difficult. In this case, pro- standing of alternative splicing control elements. moter location, though useful, does not provide the key information that it does for eukaryotes because 3Corresponding author. of the existence of multicistronic operons. Rather, E-MAIL [email protected]; FAX (610) 270-5580. for prokaryotes, the key need is reliable localization 7:861–878 ©1997 by Cold Spring Harbor Laboratory Press ISSN 1054-9803/97 $5.00 GENOME RESEARCH 861 Downloaded from genome.cshlp.org on October 1, 2021 - Published by Cold Spring Harbor Laboratory Press FICKETT AND HATZIGEORGIOU of the ribosome binding site (Shine and Dalgarno oped with one approach will almost certainly trans- 1974). fer in part to the other. Determination of Expression Context Eukaryotic Promoter Recognition Many experimental techniques are being developed In the rest of the paper we concentrate on the key for cataloging the expression context of genes problem of general eukaryotic promoter recogni- (e.g., Prashar and Weismann 1996 and references tion. First, we review a few salient points from re- therein). Development of computer algorithms to cent advances in biochemical understanding of predict expression context from genomic sequence transcription initiation, next, the core computa- has received much less attention but may represent tional resources and techniques are discussed, and an important opportunity. then currently available tools are described. To give Gene expression is regulated at many levels, in- some feeling for the current state of the art, the cluding chromatin packing (for review, see Kingston application of these tools to some recently deter- et al. 1996), transcription initiation (see below), mined promoter sequences is also described. Fi- polyadenylation (for review, see Wahle and Keller nally, we discuss prospects for the future. 1996), splicing (for review, see McKeown 1992), mRNA stability (e.g., Decker and Parker 1994), translation initiation (for review, see Kozak 1992), Eukaryotic Transcription Initiation and others. But it is generally thought that the single most important point of regulation is at tran- The biochemical mechanisms controlling transcrip- scription initiation. The initiation of transcription tion initiation in eukaryotes are currently under in- seems to be regulated in large part by coordinate tense investigation. Recent advances are reviewed binding of many proteins to the promoter and, for in, for example, Burley and Roeder (1996); Chao some genes, to one or more enhancers. Specific and Young (1996); Kaiser and Meisterernst (1996); combinations of binding sites, then, may provide Kornberg (1996); Novina and Roy (1996); Roeder the information necessary to suggest a particular ex- (1996); Stargell and Struhl (1996); Verrijzer and pression context, and it is here that computational Tjian (1996); Ptashne and Gann (1997); Smale work to date has focused. (1997). Here we will attempt to summarize the con- In most cases, researchers in this area have clusions most relevant to sequence analysis. taken the locations of transcriptional regulatory re- The so-called preinitiation complex (PIC) recog- gions (promoters and enhancers) as given and, in nizes the core promoter and initiates transcription. attempting to define those patterns in the DNA The PIC includes, besides Pol II, the general initia- (combinations of binding sites) that determine ex- tion factors (or general transcription factors, GTFs) pression context, have only attempted to give pat- TFIIA, TFIIB, TFIID, TFIIE, TFIIF, and TFIIH. Each of terns with sufficient information content to sort these may itself be a multiprotein complex. TFIID, regulatory regions into those that are active in a which consists of TATA-binding protein [TBP; the particular context and those that are not (e.g., Cla- so-called TATA box is ∼25 bp upstream of the tran- verie and Sauvaget 1985; Fondrat and Kalogeropou- scription start site (TSS) in metazoans] and several los 1994; Pedersen et al. 1996; Rosenblueth et al. TBP-associated factors (TAFs), is the only one of 1996). For this approach to be successful in the long these known to have site-specific DNA-binding abil- run, reliable algorithms must be developed for the ity (though several other GTFs are known to be in recognition of promoters and enhancers in general. close contact with the DNA; cf. Coulombe et al. Another approach to the problem is to attempt to 1994). TBP is one of the major determinants of this define patterns with very high information content, DNA-binding specificity, and the consensus se- capable of distinguishing regulatory regions active quence or position weight matrix (PWM) often used in a specific context from all the other DNA in the to recognize the TATA box (Bucher 1990) is prob- genome (e.g., Fickett 1996b; Tronche et al. 1997). ably characterizing the DNA-binding specificity of With this approach, one can imagine that general TBP (see Singer et al. 1990; Wiley et al. 1992). promoter recognition would eventually consist of Around the TSS there is a loosely conserved ini- separately recognizing a large number of specific tiator region (abbreviated Inr; for review, see Kauf- cases. It is too early to clearly define the benefits of mann et al. 1996; Smale 1997) that is one determi- either strategy, and in any case, techniques devel- nant of promoter strength and, in the absence of a 862 GENOME RESEARCH Downloaded from genome.cshlp.org on October 1, 2021 - Published by Cold Spring Harbor Laboratory Press EUKARYOTIC PROMOTER RECOGNITION TATA box, can determine the location of the TSS. To synergistically. There may be many such regions, some extent, the TATA box and the Inr are inter- and they may either enhance or repress expression changeable. For example, TFIID containing a mu- of the gene in particular circumstances (see Yuh and tated TBP defective in DNA binding cannot func- Davidson 1996 for an elegant example). Often these tion on TATA-only promoters, but supports tran- specific regulatory regions are active even if their scription from Inr-containing promoters (Martinez location of orientation is changed, in which case et al.