Transcription Factors
Total Page:16
File Type:pdf, Size:1020Kb
Proc. Natl. Acad. Sci. USA Vol. 86, pp. 5698-5702, August 1989 Biochemistry Association of charge clusters with functional domains of cellular transcription factors (DNA-binding domains/activating domains/"zinc ringer"/"leucine zipper") VOLKER BRENDEL AND SAMUEL KARLIN* Department of Mathematics, Stanford University, Stanford, CA 94305 Contributed by Samuel Karlin, April 19, 1989 ABSTRACT Using rigorous statistical methods, we have established methods to discern statistically significant charge identified and evaluated unusual properties of the distribution clusters (defined as 25- to 75-residue segments high in specific of charged residues within the sequences of eukaryotic cellular charge content) as well as significantly long runs of certain transcription factors. Virtually all transcription factors, in- charge types and special periodic patterns of charged resi- cluding GAL4, c-Jun, C/EBP, CREB, Oct-i, Oct-2, Spl, dues (7). Application of these methods to a large number of Egr-l, CTF-1, steroid and thyroid hormone receptors, and viral proteins revealed a striking richness of significant others, carry one or more highly significant charge clusters. charge configurations in transactivators and other regulatory For the most part these clusters (conserved within families of proteins of mainly eukaryotic DNA viruses (7-9). A para- homologous proteins) are ofpositive net charge but contain also mount example is the Epstein-Barr virus (EBV)-encoded substantial numbers of acidic residues. Predominantly basic nuclear antigen EBNA1, a protein required for replication charge clusters are often, but not exclusively, associated with and maintenance of viral genomes in latently infected cells DNA-binding domains, and vice versa. Negative charge clus- (10). The distribution of the charged residues in this protein ters of note occur only in the yeast protein PH04 and in the distinguishes four separate charge clusters, two of negative proteins encoded at the Drosophila loci zeste (z) and knrl. This and two of positive sign; moreover, the positive charge dearth ofstatistically significant negative charge clusters raises clusters center on long tandem reiterations of(+, 0),t and the questions with respect to the generality of acidic activation C-terminal negative cluster carries, with two mismatches, the domains. A number of sequences (Oct-i, Oct-2, zeste, Dhr23, periodic charge pattern (G, -, -)7; all these distributional E75, and knrl) contain multiple charge dusters together with features are statistically significant in the sense that less than one or more significantly long uncharged regions. The occur- 1% ofrandom sequences ofthe same amino acid composition rence of multiple charge clusters is a rare phenomenon (found would contain any one of these structures (7). Significant in less than 3% of all proteins, mainly in Drosophila develop- charge configurations also occur in all other primary nuclear mental control proteins and in transactivators of eukaryotic antigens of EBV expressed in the latent state as well as in a DNA viruses). Most of the proteins with zinc-binding "fingers" strong transactivator of the lytic cycle (pBMLFl), in the carry a mixed charge cluster centered at the zinc-finger motif immediate early transactivators ICP0 and ICP4 of herpes preceded by a long uncharged stretch, suggestive of a modular simplex virus type 1, p62 and p63 of varicella-zoster virus, structure for these proteins. and IEl of cytomegalovirus; in the ElA 32-kDa protein of adenovirus; in the large tumor (T) antigens of papovaviruses; Eukaryotic transcription control involves a variety of regu- and in El and E7 ofpapillomavirus (7-9). All ofthese proteins latory proteins that bind specifically to promoter and en- are involved in the regulation of early viral transcription or hancer elements and associate with RNA polymerase II viral replication and in some cases in oncogenic transforma- and/or other factors in elaborate complexes (1, 2). It has been tion. suggested that regions of concentrated charge in the proteins Our objective here is to provide a more complete analysis involved in these complexes play an important role in medi- ofthe charge distribution in transcriptional regulators. To this ating the various protein-nucleic acid and protein-protein end we have screened, in addition to viral proteins, more than interactions. For several transcription factors, including the 2500 primarily human, mouse, Drosophila, yeast, and Esch- yeast proteins GAL4 and GCN4, mammalian "zinc-finger" erichia coli proteins for significant charge configurations, proteins like Spl and Krox-20, and homeodomain-containing emphasizing clusters, runs, and periodic patterns involving transcription factors (for example, Oct-2 and Pit-1), the either or both charge types. Significance was determined as DNA-binding domains are associated with basic regions. described previously (7, 11). The data show that significant GAL4 and GCN4 additionally contain one or more separate charge configurations abound among cellular proteins in- activating domains, which are acidic (3, 4). By contrast, Spl volved in regulatory functions, including nuclear transcrip- has no acidic regions but contains long uncharged regions tion factors, steroid and thyroid hormone receptors, nuclear that apparently harbor the activation function and may con- protooncogene products, developmental control proteins, tribute to greater affinity for DNA (5). Several structural high molecular weight heat shock proteins, and transmem- motifs are implicated in DNA binding (for review, see ref. 6). brane proteins like the voltage-gated ion channels, growth The characteristics oftransactivating domains have remained factor and neurotransmitter receptors, and opsins, but are unclear except for the quoted predominance of net negative uncommon for the bulk ofenzymes and constitutive proteins. charge in some cases and the complete absence of charged We shall focus in this paper on nuclear transcription factors residues in other cases. We have been generally interested in the distribution of Abbreviation: EBV, Epstein-Barr virus. charged residues along protein sequences. Previously we *To whom reprint requests should be addressed. tProtein sequences are displayed in the standard one-letter code. + stands for R, K, or H; - for D or E; and 0 for all other amino acids. The publication costs of this article were defrayed in part by page charge Some authors treat H as uncharged; however, due to the low payment. This article must therefore be hereby marked "advertisement" frequency of H in most proteins, this does not affect the qualitative in accordance with 18 U.S.C. §1734 solely to indicate this fact. outcome of our type of analysis. 5698 Downloaded by guest on September 27, 2021 Biochemistry: Brendel and Karlin Proc. Natl. Acad. Sci. USA 86 (1989) 5699 and hormone receptors. Heat shock proteins are discussed in GCN4 ± cluster at'231-276: 16,7/46 ref. 11; the detailed results for the other classes of proteins (28114.2 16.4) c-jun ± cluster at 246-285: 16,7/40 will be discussed elsewhere. (331 12.1 8.5) CpCl ± cluster at 201-258: 18,12/58 -JI Nuclear Transcription Factors (269 10.4 13.4) Some 20 protein factors involved in the regulation of eukary- Oct-2 ± cluster at 10-42: 6,8/33 (463 10.6 7.6) ± cluster at 170-200: 7,7/31 otic transcription have been identified and sequenced. Four ± cluster at 280-338: 17,8/59 the GCN4-related pro- groupings may be distinguished: (i) Oct-1 + cluster at 272-301: 7,7/30 Iz z: I -I teins, characterized by a similar DNA-binding domain (743 5.5 4.8) + cluster at 378-436: 16,8/59 Basm toward the C terminus; (ii) transcription factors containing Pit-1 + cluster at 212-273: 20,10/62 multiple zinc-finger sequences; (iii) transcription factors with (291 17.2 12.0) a POU domain (12); and (iv) factors of more individual unc-86 ± cluster at 362-429: 21,8/68 profile, including GAL4, C/EBP, PHO4, zeste and others. (467 14.1 8.1) Significant charge clusters occur in virtually all of these proteins while, in contrast to many viral transactivators, Spi cluster at 532-623: 33,6/92 there occur almost no significant runs or periodic patterns of ((696) 7.3 4.7) charge (PHO4 is an exception; see below). In Fig. 1 we Kr t cluster at 240-353: 38,15/114 ascertained positive (overwhelmingly basic), negative (over- (466 16.7 7.1) whelmingly acidic), and mixed charge clusters (involving Krox-20 ± cluster at 260-365: 36,10/106 numbers of residues of both signs, where (445 14.4 7.0) substantial charged Egr-1 t cluster at 315-426: 38,12/112 the relative numbers of basic to acidic components may be (533 9.8 6.9) balanced or skewed). Also shown are significantly long GLI ± cluster at 216-336: 31,24/121 uncharged segments. The characterization of shared and (1106 12.6 9.2) ± cluster at 347-428: 25,10/82 contrasting tendencies and the statistics with respect to ADR1 occurrence, multiplicity, type, and location of the charge (1323 14.0 12.1) clusters suggest diagnostic sequence features that can pro- TFIIIA vide insights into protein function, structure, and classifica- (344 25.6 11.0) F-I tion. GAL4 cluster at 15-53: 15,2/39 rlrlrlcdm The DNA-binding domains of GAL4, the GCN4-related (881 11.4 9.6) proteins, the homeobox part of the POU domain, and the C/EBP + cluster at 273-336: 20,11/64 are all (358 15.1 9.8) zinc-finger sequences highly charged, delimiting by c-fos +cluster at 125-194: 19,17/70 our analysis positive charge clusters or mixed charged clus- (380 10.0 13.4) ters with basic residues predominant. Long uncharged seg- fra-1 cluster at 105-183: 23,22/79 ments are frequent in the zinc-finger-containing proteins and (275 14.2 11.3) CREB + cluster at 92-144: 9,15/53 also occur in a number of the other transcription factors.