Proc. Nati. Acad. Sci. USA Vol. 82, pp. 1609-1613, March 1985 Biochemistry

Complete sequence of a encoding a human type I : Sequences homologous to enhancer elements in the regulatory region of the gene (DNA sequence/gene expression/nuclease S1 mapping) DO.UGLAS MARCHUK, SEAN MCCROHON, AND ELAINE FUCHS Department of Molecular Genetics and Cell Biology, The University of Chicago, Chicago, IL 60637 Communicated by A. A. Moscona, November 5, 1984

ABSTRACT We report here the complete nucleotide se- polyadenylylation signals in a keratin gene. A comparison of quence of a gene encoding the 50-kDa keratin expressed in the 5' regulatory sequences of the keratin gene (this paper) abundance in human epidermal cells. According to its se- with those of the gene (17) has revealed a possible quence, this gene has a single transcriptional initiation site and explanation for the different levels of expression of these a single polyadenylylation signal. Nuclease S1 mapping of this two in different tissues. gene with total human epidermal mRNA confirmed the pres- ence of a single initiation site for the 50-kDa keratin gene. When the regulatory sequences 5' upstream from this gene MATERIALS AND METHODS were examined, three sequences that share significant homolo- The 50-kDa gene was isolated and character- gy with viral and immunoglobulin enhancer elements were ized as described (18). To obtain the complete sequence of found. In comparison, the sequence of the regulatory region of this gene, we first subcloned into plasmid pUC8 two restric- vimentin, a structurally similar gene, tion endonuclease fragments that hybridized with a cloned was highly divergent [Quax, W., Egberts, W. V., Hendriks, 50-kDa keratin cDNA, KB-2 (3). The 5' end of the gene was W., Quax-Jeuken, Y. & Bloemendal, H. (1983) Cell 35, 215- sequenced by the method of Maxam and Gilbert (19). For the 223]. This finding may provide a clue to understanding the remainder of the gene, we applied the M13-dideoxy strategy molecular mechanisms underlying the widely varying levels of (20) and the shotgun cloning method of Anderson (21) as de- expression of different intermediate filament genes in different scribed (18). For 100% of the coding sequences and for 90% tissues. of the intron sequences, multiple and frequently opposite strands were sequenced. The intermediate filaments (IFs) are a family of 8- to 10-nm fibers that constitute a part of the in virtually all higher eukaryotic cells (for a review, see ref. 1). The kera- RESULTS tins are a complex group of about 20 different that comprise the IFs in epithelial cells (for a review, see ref. 2). Isolation, Identification, and Complete Nucleotide Sequence They can be further subdivided into two distinct sequence of the Human 50-kDa Keratin Gene. Using a 1380-base-pair classes, type I and type 11(3, 4). At least one member of each (bp) cDNA probe complementary to the 50-kDa keratin of these two keratin classes is always expressed at all times, mRNA (3), we isolated the human gene encoding this kera- suggesting the importance of each of these types of se- tin. Hybridization studies showed that this is the only human quences in filament assembly (2, 5-7). gene bearing a 3' noncoding sequence identical to that of the Although the two keratin types share only 25-30% amino 50-kDa keratin cDNA (18). To elucidate the complete struc- acid homology with one another, the individual members of ture of the 50-kDa keratin gene, we determined the entire a single class can be very closely related, as judged by posi- nucleotide sequence of the gene and its 5' and 3' flanking tive hybrid translation (5, 8) and by sequence analyses (4, 9- regions. An outline of the structure of the gene is illustrated 12). Despite their similarities, many of the of a sin- in Fig. 1. The DNA sequence corresponding to the complete gle type are encoded by separate mRNAs (5, 13-15). Wheth- mRNA for the 50-kDa keratin is shown in Fig. 2 along with er multiple mRNAs might arise from a single keratin gene, the complete predicted amino acid sequence for the . however, has not been examined in detail. The positions of the introns were determined by comparing The level of expression of IF proteins in different cells and the sequence of the gene with that of the previously pub- tissues varies considerably. Even within a single IF family, lished cDNA sequence (9). Intron positions are noted by tri- such as the keratins, the amounts of these proteins in differ- angles, and the complete sequences of these introns are giv- ent epithelia can be dramatically different. In basal epider- en in Fig. 3. mal cells, almost 30% ofthe total protein synthesized is kera- The Transcription and Translation Initiation Sites of the tin (16). The 50-kDa type I and the 56-kDa type II keratin are Human 50-kDa Keratin Gene. At a position 85 nucleotides 5' major keratins present in these cells. Other pairs of type I upstream from the putative initiator ATG codon is found the and type II keratins are expressed in other epithelia, but they sequence T-A-T-A-A-A (for a review, see ref. 24). Fifty nu- are usually less abundant (2, 4-6). cleotides downstream from the TATA box, the sequence C- In this paper, we present the complete nucleotide se- T-T-C-T-G is found. This is a known consensus sequence quence ofthe 50-kDa type I human keratin gene expressed in that frequently appears downstream from the cap site of eu- cultured human epidermal cells. An analysis of this gene has karyotic mRNAs (25). enabled us to explore the possibility of mutiple initiation or To determine the precise transcription initiation site, we hybridized human epidermal mRNA with a radiolabeled ge- The publication costs of this article were defrayed in part by page charge nomic fragment encompassing 251 nucleotides 5' upstream payment. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. §1734 solely to indicate this fact. Abbreviations: bp, (s); IF, intermediate filament.

1609 Downloaded by guest on October 1, 2021 1610 Biochemistry: Marchuk et aL Proc. NatL Acad ScL USA 82 (1985) n m IM m 0* ss r SEMI I- 3'

i I 1 kb FIG. 1. A schematic diagram of the human 50-kDa keratin gene. An outline of the structure of the 50-kDa keratin gene was drawn to scale. The eight exons are represented as boxes and are identified by roman numerals. The introns are represented by the thin connecting lines. kb, Kilobase.

and 172 nucleotides 3' downstream from the putative site. servation of intron positions in these genes provides a strong When the hybrid was subsequently treated with endonucle- indication that the two genes had a common origin. Although ase S1 to digest the unprotected DNA, a mRNA-protected intron position is highly conserved among IF genes, both the DNA fragment of exactly 172 nucleotides was generated. sizes and the sequences of the introns have diverged consid- DNA sequence analysis of this fragment indicated that the erably. The 5' and 3' noncoding sequences of the transcrip- mRNA protection extended exactly to the A-C sequence ap- tional domain of the genes are also highly divergent. pearing 26 nucleotides downstream from the TATA box In addition to the divergence in noncoding sequences of (Fig. 4). The size of the 5' untranslated region is 60 nucleo- the two genes, the sequences 5' upstream from the genes tides, in good agreement with estimates previously obtained were also very different. In the keratin gene, three enhancer- for electrophoretic migration of the 50-kDa keratin mRNA like sequences were discovered in the region 5' upstream (13). The first ATG 3' downstream from the TATA box is from the TATA box. In contrast, only a single sequence homologous to the consensus translation start sequence, sharing identity with only six of the eight enhancer core resi- A/G-X-X-A-U-G-G (27). The high serine content fits well dues was found in the vimentin gene (17). This sequence is with the general richness of serine residues at the amino ter- located very close to the TATA box (6 bp 5' upstream) and is mini of epidermal keratins (28). on the noncoding strand. The complete gene sequence encodes a polypeptide of 472 Unusual Features Within the Keratin Introns? When the amino acid residues (Fig. 2). A molecular mass of 51,591 Da introns of the 50-kDa keratin gene were analyzed for unusual is predicted, which is in good agreement with the size of 50 sequences or structural features, few interesting regions kDa originally estimated on the basis of NaDodSO4/polya- were found (Fig. 3). There are no segments in which a long crylamide gel electrophoresis (16). The predicted amino acid stretch of stable intrastrand secondary structure can be composition matches very well with that determined by formed. The longest stretch ofperfect dyad symmetry in any chemical means for the 50-kDa keratin (9). of the introns consists of only nine potential base pairs. Per- Transcription Polyadenylylation Signals. Only a single fect palindromic sequences are also short (<10 nucleotides). polyadenylylation site A-A-T-A-C-A was found at the 3' end of In part, the low degree of potential secondary structure in the gene. This site appeared 127 nucleotides (residues 4596- the keratin introns may be attributed to the small size of the 4601, Fig. 2) 3' downstream from the translational stop co- introns. Three of the seven introns are less than 100 bp, and don, TGA. The entire 3' noncoding region of this gene, in- only one is larger than 1 kilobase. cluding the polyadenylylation signal, shared complete se- Perhaps the most unusual feature of any of the seven in- quence identity with the corresponding region of the 50-kDa trons occurs in intron III, in which there is an extreme imbal- keratin cDNA (9). The exact polyadenylylation site for the ance of purine residues on one strand and pyrimidine resi- 50-kDa keratin transcript could be determined from a com- dues on the other (Fig. 3). Immediately preceding this pu- parison of the cDNA and gene sequences, and it occurs at rine-rich segment, there is a 116-bp sequence that shares nucleotide residue 4624-i.e., 23 nucleotides 3' downstream 71% homology with the human Alu consensus sequence (23). from the polyadenylylation signal. In the 481 bp 3' down- Interestingly, only one of the two Alu repeat sequences is stream from the first polyadenylylation signal, no additional present. signals were found. Possible Regulatory Sequences. In order to search for pos- sible regulatory sequences for the 50-kDa keratin gene, we DISCUSSION determined the sequence of 253 nucleotide residues 5' up- stream from the transcriptional initiation site. At three posi- About 10 different type I keratins have been identified (2). tions 5' upstream from the TATA box, the sequence T- These proteins are differentially expressed in different tis- (G)1h3-A-A-A-G was found (underlined in Fig. 2). A compari- sues and at different stages of differentiation and develop- son of these sequences with the consensus enhancer se- ment. To date we have isolated from a human genomic li- quence G-T-G-G-A-A-A-G found in some viral and immuno- brary a number of genes that share substantial homology globulin genes indicates a striking homology (Fig. 5; refs. with our cloned 50-kDa keratin cDNA (unpublished results). 29-31). Although we have not yet fully characterized these genes, it A Comparison Between the Genes Encoding Vimentin and is likely that there are multiple genes for the type I keratin Keratin. The only other IF gene sequenced to date is the one subfamily. In the 50-kDa keratin gene and its flanking re- encoding hamster vimentin (17). A comparison ofthe 50-kDa gions, there is no evidence for multiple initiation or polya- human keratin gene with the vimentin gene reveals several denylylation sites. The possibility of differential splicing interesting features. When the two genes were aligned for within the internal introns of the RNA transcript is unlikely, optimal homology, it was discovered that six of the seven since intron position is highly conserved. Thus, this gene keratin introns were positioned nearly identically to the cor- seems to encode a single transcript and most likely a single responding introns of vimentin (18). This is remarkable con- keratin. This distinguishes the keratins from vimentin, where sidering that the gene sequences share only 42% homology a single gene gives rise to the tissue-specific expression of and that their proteins are only 29% homologous. The con- multiple mRNAs (32, 33). Downloaded by guest on October 1, 2021 Biochemistry: Marchuk et aL Proc. NatL Acad. Sci USA 82 (1985) 1611

-250 CCCAGGGTCATGGGAAAGT

-230 -210 -190 -170 -150 GTAGC1CAGGCCCACACCTCCCCCTGTGAATCACGCCTGGCGGGACAAGAAAGCCCAAAACACTCCAAACAATGAGTTTCCAGTAAAATATGACAGA

-130 -110 -90 -70 -50 CATGATGAGGCGGATGAGAGGAGGGACCTGCCTGGGAGTTGGCGCTAGC CGCCCCCT

-29 -24 0 20 40 60 ACCCATGAGTATAAAGCACTCGCATCCCTTTGCAATTTACCCGAGCACCTTCTCTrCACTCAGCCFNCTGCTCGCTCGCTCACCTCCCTCCTCTGCACC

61 ATG ACT ACC TGC AGC CGC CAG TTC ACC TCC TCC AGC TCC ATG AAG GGC TCC TGC GGC ATC GGG GGC GGC ATC GGG 135 1 MET Thr Thr Cys Ser Arg Gln Phe Thr Ser Ser Ser Ser MET Lys Gly Ser Cys Gly Ile Gly Gly Gly Ile Gly 25

136 GCG GGC TCC AGC CGC ATC TCC TCC GTC CTG GCC GGA GGG TCC TGC CGC GCC CCC AAC ACC TAC GGG GGC GGC CTG 210 26 Ala Gly Ser Ser Arg Ile Ser Ser Val Leu Ala Gly Gly Ser Cys Arg Ala Pro Asn Thr Tyr Gly Gly Gly Leu 50

211 TCT GTC TCA TCC TCC CGC TTC TCC TCT GGG GGA GCC TAT GGG TTG GGG GGC GGC TAT GGC GGT GGC TTC AGC AGC 285 51 Ser Val Ser Ser Ser Arg Phe Ser Ser Gly Gly Ala Tyr Gly Leu Gly Gly Gly Tyr Gly Gly Gly Phe Ser Ser 75

286 AGC AGC AGC AGC TmT GGT AGT GGC TTT GGG GGA GGA TAT GGT GGT GGC CTT GGT GCT GGC TTG GGT GGT GGC TTT 360 76 Ser Ser Ser Ser Phe Gly Ser Gly Phe Gly Gly Gly Tyr Gly Gly Gly Leu Gly Ala Gly Leu Gly Gly Gly Phe 100

361 GGT GGT GGC TmT GCT GGT GGT GAT GGG CTT CTG GTG GGC AGT GAG AAG GTG ACC ATG CAG AAC CTC AAT GAC CGC 435 101 Gly Gly Gly Phe Ala Gly Gly Asp Gly Leu Leu Val Gly Ser Glu Lys Val Thr MET Gln Asn Leu Asn Asp Arg 125

436 CTG GCC TCC TAC CTG GAC AAG GTG CGT GCT CTG GAG GAG GCC AAC GCC GAC CTG GAA GTG AAG ATC CGT GAC TGG 510 126 Leu Ala Ser Tyr Leu Asp Lys Val Arg Ala Leu Glu Glu Ala Asn Ala Asp Leu Glu Val Lys Ile Arg Asp TrpVInt150 I 511 TAC CAG AGG CAG CGG CCT GCT GAG ATC AAA GAC TAC AGT CCC TAC TTC AAG ACC ATT GAG GAC CTG AGG AAC AAG 1841 151 Tyr Gln Arg Gln Arg Pro Ala Glu Ile Lys Asp Tyr Ser Pro Tyr Phe Lys Thr Ile Glu Asp Leu Arg Asn Lys 175

1842 ATT CTC ACA GCC ACA GTG GAC AAT GCC AAT GTC CTT CTG CAG ATT GAC AAT GCC CGT CTG GCC GCG GAT GAC TTC 1916 176 Ile Leu Thr Ala Thr Val Asp Asn Ala Asn Val Leu Leu Gln Ile Asp Asn Ala Arg Leu Ala Ala Asp Asp Phe 200 VInt 2 1917 CGC ACC AAG TAT GAG ACA GAG TTG AAC CTG CGC ATG AGT GTG GAA GCC GAC ATC AAT GGC CTG CGC AGG GTG CTG 2553 201 Arg Thr Lys Tyr Glu Thr Glu Leu Asn Leu Arg MET Ser Val Glu Ala Asp Ile Asn Gly Leu Arg Arg Val Leu 225

2554 GAC GAA CTG ACC CTG GCC AGA GCT GAC CTG GAG ATG CAG ATT GAG AGC CTG AAG GAG GAG CTG GCC TAC CTG AAG 2628 226 Asp Leu Thr Leu Ala Ala Leu MET Ile Ser Leu Leu Glu VIntArg3 Asp Glu Gln Glu Lys Glu Glu Ala Tyr Leu Lys 250 2629 AAG AAC CAC GAG GAG GAG ATG AAT GCC CTG AGA GGC CAG GTG GGT GGA GAT GTC AAT GTG GAG ATG GAC GCT GCA 3038 251 Lys Asn His Glu Glu Glu MET Asn Ala Leu Arg Gly Gln Val Gly Gly Asp Val Asn Val Glu MET Asp Ala Ala 275

3039 CCT GGC GTG GAC CTG AGC CGC ATT CTG AAC GAG ATG CGT GAC CAG TAT GAG AAG ATG GCA GAG AAG AAC CGC AAG 3113 276 Pro Gly Val Asp Leu Ser Arg Ile Leu Asn Glu MET Arg Asp Gin Tyr Glu Lys MET Ala Glu Lys Asn Arg Lys 300 4 3114 GAT GCC GAG GAA TGG TTC TTC ACC AAGVIntACA GAG GAG CTG AAC CGC GAG GTG GCC ACC AAC AGC GAG CTG GTG CAG 3271 301 Asp Ala Glu Glu Trp Phe Phe Thr Lys Thr Glu Glu Leu Asn Arg Glu Val Ala Thr Asn Ser Glu Leu Val'Gln 325

3272 AGC GGC AAG AGC GAG ATC TCG GAG CTC CGG CGC ACC ATG CAG AAC CTG GAG ATT GAG CTG CAG TCC CAG CTC AGC 3346 326 Ser GIl Lys Ser Glu Ile Ser Glu Leu Arg Arg Thr MET Gln Asn Leu Glu Ile Glu Leu Gln Ser Gln Leu Ser 350 5 3347 ATGVintAAA GCA TCC CTG GAG AAC AGC CTG GAG GAG ACC AAA GGT CGC TAC TGC ATG CAG CTG GCC CAG ATC CAG GAG 3516 351 MET Lys Ala Ser Leu Glu Asn Ser Leu Glu Glu Thr Lys Gly Arg Tyr Cys MET Gin Leu Ala Gln Ile Gin Glu 375

3517 ATG ATT GGC AGC GTG GAG GAG CAG CTG GCC CAG CTC CGC TGC GAG ATG GAG CAG CAG AAC CAG GAG TAC AAG ATC 3591 376 MET Ile Gly Ser Val Glu Glu Gln Leu Ala Gln Leu Arg Cys Glu MET Glu Gln Gln Asn Gln Glu Tyr Lys Ile 400 yInt 6 3592 CTG CTG GAC GTG AAG ACG CGG CTG GAG CAG GAG ATC GCC ACC TAC CGC CGC CTG CTG GAG GGC GAG GAC GCC CAC 3760 401 Leu Leu Asp Val Lys Thr Arg Leu Glu Gin Glu Ile Ala Thr Tyr Arg Arg Leu Leu Glu Gly Glu Asp Ala His 425 VInt 7 3761 CTC TCC TCC TCC CAG TTC TCC TCT GGA TCG CAG TCA TCC AGA GAT GTG ACC TCC TCC AGC CGC CAA ATC CGC ACC 4399 426 Leu Ser Ser Ser Gln Phe Ser Ser Gly Ser Gln Ser Ser Arg Asp Val Thr Ser Ser Ser Arg Gln Ile Arg Thr 450

4400 AAG GTC ATG GAT GTG CAC GAT GGC AAG GTG GTG TCC ACC CAC GAG CAG GTC CTT CGC ACC AAG AAC TGA GGCTGCC 4475 451 Lys Val MET Asp Val His Asp Gly Lys Val Val Ser Thr His Glu Gln Val Leu Arg Thr Lys Asn *

4476 CAGCCCCGCTCAGGCCTAGGAGGCCCCCCGTGTGGACACAGATCCCACTGGAAGATCCCCTCTCCTGCCCAAGCACTTCACAGCTGGACCCTGCTTCAC 4574 4596 4601 4575 CCTCACCCCCTCCTGGCAATCAATACAGCTTCATTATCTGAGTTGCATAATTCTCGCCTCTCTCTGGTCATTGTTAGGAGTGGGGGTGGGGAGAAAGTG 4673

4674 GGAGAGCATCTCTTTGGAGCTTGTCATGCACCTGGCTATGGCCCCTGGGACTGGGAGAAAAGTCCTGGGGGTGGGTTGGGCTCAGGTCCCAGGATATCT 4772

4773 TTCGCCATCTCAGAAGACACAGATAGATGTGTGTACCAGGTCATATGTGGTGTCTCCTAGGGTACGGAGGGATATTCATTCAETTACTCACTCATETTC 4871

4872 ATGTGTGTCCATTCATTCACCAGATATTGAGTGCCTCTATGTCAGGCACTATGTTAGGTTAAGGATTCCTGATGTTTTTGTGTATCAGGGATTCCTTGG 4970

4971 AGAATATTGAAAGCTATAGATCGTTCCTTCTGCCCCCTACCTTCAAATAAGCATACATACATTTGCATACATGTCATGGGGTTCATGGGTCTCCTAGAG 5069

5070 CTCCTTACCGGAGT FIG. 2. The nucleotide sequence of the human 50-kDa keratin gene and its corresponding amino acid sequence. Nucleotide residues are numbered from 1 to 5083 starting at the first residue after the transcription initiation site. The positions of the introns are indicated by triangles. The "TATA" box and the polyadenylylation signal are in boldface type and underlined. The transcriptional initiation sequence C-T-T-C-T-G and the putative translational initiator ATG are highlighted in boldface type. The location of the capped nucleotide and transcription initiation site is marked by a squiggly arrow. The three sequences upstream from the TATA box that share homology with the consensus enhancer sequence (see text) are underlined. The amino acid sequence of the encoded keratin is given below the nucleotide sequence. In , the 50-kDa keratin gene is coexpressed with regulatory sequences 5' upstream from the keratin genes that the 56-kDa type II human keratin (4). Type I and type II are unique to each coordinately expressed pair of genes. For keratins are frequently expressed in pairs, and different epi- the moment, we can only state that no significant homology thelia seem to express different pairs of keratins (2, 5, 6). was found between the 5' regulatory regions of vimentin and Until more genomic sequences for the keratins become keratins, two differentially expressed IF genes. available, we are not able to determine whether there are any It is interesting that at three positions within 230 nucleo- Downloaded by guest on October 1, 2021 1612 Biochemistry: Marchuk et aL Proc. Natl. Acad Sci. USA 82 (1985)

C G +- + gt~q ttttttttttttttt I COBSs:AA fig G C TA G SI 175 LYS~ AMG qtqgtgaat gggcagcaga aggcaccatt ccagctagct ccttctggga acaattcatg ccccaggccg ctgagacctt aagatttctc tataggacag agtccacccc agatcccttc tttcgaggtc ttggatgccc taagactgat cagtgagaag atgctttccc ttccccaggc ctcctcatcc ccttctgatc tcaaatcctc agaccatgtg agatcagtga ttcctatcct tacatttttt agaggaagca gttgaagctt cgagaggtgc tgtgaccagc tgcaggtcac w atagcaaatt aatggcagag caaggctggg gcccttgtgc ctaccttcca gcacaggagg z .ms cagctacttg ttctccagca caggggagga gtgaggctct aacgggacca ggcaagacat LAJ ccaaaccact cattagctca ctagtctggg ctgtggttgc cgccgcccat aagccttggt acaggctggt ccctccccac agccaggcgg gcatggagag cctgcagaga caattagtgt G ggtcccttga tgtgccctgc acagagagag cctggcaggc ttgtgccctg actctagccc T _Ut I cctcctccct gctcccacat tacttgggag ccctccctgc tggagtctgt tgggctctaa tgacttgcat ggattaggga aattcaagtg atgaggtggg gaaatccaac cagactcagg ggccaatata tcttcttatt cttcccctga gtcctctttt ctaatcccct gtgttagttg ggttttatct cttcacaaag ttccacttga agtcccatgg cctgtgagct tgaaaaggaa z A tgtgcatatc tgcagaggac tggcagggct ggcctgatgc agacagaaga gaggtcagct caggaaggag ggctagggag cctcagatta ttctcctcac atggaggtgg ggaacttgaa gccccaagac ccattaggtt ttgcccagca agacactggc agaattggga ccagaactcc CG tgggctttcg attctaagcc cggggctgcc atctaccccc tctgttgacc atgagttagc aaagtcttag gacaggcctg gggcatctgt tttcctttgg gctgctatgg tcaagttttg CG tgggggaaaa gggggattca ggcaagaaca tgaagcaaga gcttaatgta ggctacagtg C G aagtccagct tgtgaagtcc atttgacaaa ttacctgtgc cttttccatc ctgcag AT! G C Ile A T 176 GC 202 C G Thr Ly AT ACC AA gtgaqtttga aatggtgggc cagaacatcc agtgtcccca gagtagggca tttttggagc C G agtgtttccc aaatagaact agccagtacc aggataggtg catgaaaact ccctggggtg C G cttataaaag aataagactc ttgggcccca cccttggagt tttgattcag ctatttatag caggttacct gggtgattct ggtccacagc caggtttcag aaccgctgct ttagggagag UU A II gcactttcca cttccccagc tgcccttgaa gtataggaag gaatcatagt tggaggactt ctgcattatt tgttggctga agctagaagt gcaaccccct cctgatttct gcagcaagat G A-k-_.- gaactgcctt atccccagcc cgcaggaatg ttcatatctg agcaatcaat gggcactgtg ttcaaccacg ccatttttca agattggctc cttaaaccac ccacaaggca ccagctctgg A gagaagctgc agggagaaga gaacaaagcc ctcgctgtga tcaggatggg tgtctcatac _w.bb cttttctctg gggtcattcc as G TAT ^ Tsyra 4.ss 204 ..I 255 23 =. Glut GAG j5qaqaacta tatggaaaag tcagcttaaa agaaatgcag ggaggctggg tacagtggtg cgtgcccata gtcccagcta cttgggaggc tgagacagga ggatcacttg acacaggag FIG. 4. Mapping the transcription site of the 50-kDa keratin. A III tttgagtcca gctgggca caaggttaga ccctgtccaa 423-bp Ava II genomic fragment, containing 172 bp of sequence aaaaagagag agagagagag agagattgag agagagagag agaagggaga gtcgagatag complementary to the 5' end of the 50-kDa keratin mRNA and aattgtgatg gtgggagggc agtattcagg cctaaggaac accaatccgc tgccatggtg 251 gaactcctga ctgtggactg tccctggct tqcag GAG bp of sequence 5' upstream from the putative transcription initiation Glu site, was isolated and 5' end-radiolabeled (9). The labeled fragment 256 was denatured, and the two strands were separated by polyacrylam- 309 ide gel electrophoresis (19). Ninety percent of each of the two la- LyS beled strands were subjected to each of the four chemical reactions AAG gtggtgtca tttgaggtgg aaggaaccca gaccacctgc cttctggggc cttctggtgt IV gaatggcatt ctcttttttg cag ACA of Maxam and Gilbert (19). Ten percent of each strand was dena- IV~~~~~~~~~~~~~Thr tured and hybridized with 5-Mg aliquots of human epidermal mRNA 310 (13) as described (26). RNA-DNA hybrids were subsequently treat- 3514 ed with S1 endonuclease to digest any part of the DNA fragment that did not hybridize (26). All samples were resolved on a DNA se- ATS; qtaatag tgccaggaag ggtggtgcac ccaggactgg cagggagaga acggccacac V tcactaatcg ttgattccc ttccctccct cacag AAA quencing gel. Lanes: 1-4, C, C + T, G + A, and G Maxam and Lys Gilbert sequencing reactions; 5, nuclease Si-digested DNA frag- 352 ment from the mRNA-DNA hybrid of the strand complementary to 424 the mRNA. The sequence of the strand complementary to the mRNA is shown, so that a direct correlation can be made between GCC CA 5Eagtcttg gccctcccct tagtccgccc cccccatggc actctcacgg ccccaccatg VI tatctaatga tcctgtcctt ttctattttc acaq C CTC the transcription initiation site and the relative migration of the nu- Leu clease Si-treated hybrid. The two S1 bands of 173 and 172 nucleo- P 426 tides probably represent nuclease S1 bands from capped and un- 426 capped mRNAs, respectively. Asp Vt GAT G gtaagaccct cctcctctgc aggcCtgggc tccaggccac cctctgtacc ccaagcaggt ctaggcattg gctaggggct ccgtgagggg ctgagctcta gtgctgtcac ccagtttccC tide residues 5' upstream from the TATA box of the keratin ttgtgaacct ccttgggtgg aagaagctat tttctaaacc ctccttaggg ctaggagagg cagcccccac ctcttgcctt ctacgtggtg tctgtggcag atcctattag ctgttgtggt gene, the sequence T-(G)1_3-A-A-A-G is found. This se- cagcaccatg aacaagggcc ctacagcggt cttcccactg agaccactcc attgggtgaa quence is similar to the consensus enhancer sequence G-T- VII tatggatgga accagccagg tgtgagctct taggaagctc taatctgagg gcaaagactc tgtctctgac ctttgggagc cctcgtctga aagaaatgtg ttgatggtat cagtgcttgg G-G-A-A-A-G found in some viral and immunoglobulin gcaacagcag ggagtgaagc agtaatcagg ggagagggca atggggagcc agtttgagtt genes expressed at high levels (29-31). Enhancer elements tcctcacctt cttggcctcc ttactcctga ttagtccatt gtctgtccac ctctggtaac gtcctcttcc cacctcttcc ccag TG M'C often function in a tissue-specific manner (30, 31). Recently, specific interactions between enhancer-containing se- 1 428 quences and cellular components have been identified (34). FIG. 3. Complete intron sequences of the 50-kDa keratin gene. If the proteins that interact with enhancer elements are tis- The consensus donor and acceptor sequences commonly found for sue-specific, then it is easy to see how the presence of these most exon-intron junctions (22) are listed at the top, with the arrows sequences within a gene or its flanking regions might regu- marking the excision points. Introns are identified on the left by late the expression of that gene in a tissue-specific manner. Roman numerals, and their sizes are: I, 1256 bp; II, 562 bp; III, 335 Moreover, since these sequences have been shown to have a bp; IV, 83 bp; V, 95 bp; VI, 94 bp; and VII, 564 bp. Nucleotide profound effect on the activation of a residues at the keratin exon-intron junctions that are homologous transcriptional gene, with the consensus sequence are underlined. In intron III, the se- they might also be important in regulating the amount of pro- quence homologous to the human Alu DNA consensus sequence tein synthesized in a particular cell. (23) and the purine-rich stretch following this sequence are in bold- Although we do not yet know if these sequences enhance face type. the transcription of the 50-kDa keratin gene or influence its Downloaded by guest on October 1, 2021 Biochemistry: Marchuk et aL Proc. NatL. Acad. Sci. USA 82 (1985) 1613

GTG G "CORE" livan (Johns Hopkins University School of Medicine, Baltimore, AAA MD) for invaluable discussions and for procedures concerning Sang- er sequencing and nuclease S1 mapping. We thank Dr. Israel Hanu- koglu (Technion Institute of Technology, Haifa, Israel) for guidance GTTAGGGT GAAAGCCCCAGG; SV40 in Maxam and Gilbert sequencing. Finally, we are grateful to Meg 238 215 Eichman, Angela Tyner, Dr. Kwan Hee Kim, Dr. Kurt Drickamer, and Rose Wheaton for their willingness to provide help and assist- ance whenever it was needed. This work was carried out by a grant 0*O *****O*O ** from the National Institutes of Health. E.F. is the recipient of a CAAGGGGAATGGAAAGTGCCAGAC KERATIN National Institutes of Health Career Development Award and a -71 -48 Presidential Young Investigator Award. CTGTGGGTG&TG&AAGCCAAGGG< G; KERATIN 50 kDa -88 -65 1. Lazarides, E. (1982) Annu. Rev. Biochem. 51, 219-250. GGGTCCGATGGGAAAGrGTAGCCIT KERATIN 2. Moll, R., Franke, W. W., Schiller, D., Geiger, B. & Krepler, - 253- R. (1982) Cell 31, 11-24. 3. Fuchs, E., Coppock, S., Green, H. & Cleveland, D. (1981) Cell 27, 75-84. 4. Hanukoglu, I. & Fuchs, E. (1983) Cell 33, 915-924. AGGTATCT6TGGTA&lCGGTTCCT MSV 5. Kim, K. H., Rheinwald, J. G. & Fuchs, E. (1983) Mol. Cell. 134 165 Biol. 3, 495-502. 6. Nelson, W. G. & Sun, T.-T. (1983) J. Cell Biol. 97, 244-251. GAGGGCGTGTGT¶TGCAAGAGG POLYOMA 7. Franke, W. W., Schiller, D. L., Hatzfeld, R. & Winter, S. 5180 5203 (1983) Proc. Natl. Acad. Sci. USA 80, 7113-7117. 8. Jorcano, J. L., Magin, T. M. & Franke, W. W. (1984) J. Mol. Biol. 176, 21-37. 9. Hanukoglu, I. & Fuchs, E. (1982) Cell 31, 243-252. 10. Crewther, W. G., Dobb, M. G., Dowling, L. M. & Harrap, TTGTAGCTGTGGTTTGAAGAAGTG I9CP 450 473 B. S. (1980) Proc. Sixth Quinquennial Intl. Wool Text. Conf. 2, 79-81. TTUAAGAAWAAWa-7I-AAA^ACA I C 11. Dowling, L. M., Parry, D. A. D. & Sparrow, L. G. (1983) 467 490 Biosci. Rep. 3, 73-78. Ig C 12. Steinert, P. M., Rice, R. H., Roop, D. R., Trus, B. L. & Ste- TTTACCTAGTGGTTTTATT C 448 430 ven, A. C. (1983) Nature (London) 302, 794-800. 13. Fuchs, E. & Green, H. (1979) Cell 17, 573-582. 14. Roop, D. R., Hawley-Nelson, P., Cheng, C. K. & Yuspa, S. H. (1983) Proc. Natl. Acad. Sci. USA 80, 716-720. 15. Magin, T. M., Jorcano, J. L. & Franke, W. W. (1983) EMBO J. 2, 1387-1392. GATGGCtCTCWGAAIGTCCCCTCT I CK 3957 3980 16. Sun, T.-T. & Green, H. (1978) J. Biol. Chem. 253, 2053-2058. 17. Quax, W., Egberts, W. V., Hendriks, W., Quax-Jeuken, Y. & Bloemendal, H. (1983) Cell 35, 215-223. ATCAAGAGCTGG&AiGAGAGGGT% Ig CX1 18. Marchuk, D., McCrohon, S. & Fuchs, E. (1984) Cell 38, 491- 926 c948 498. 19. Maxam, A. M. & Gilbert, W. (1980) Methods Enzymol. 65, FIG. 5. A comparison of sequences in the 5' regulatory region of 499-560. the keratin gene with those of viral and immunoglobulin enhancer 20. Sanger, F., Coulson, A. R., Barrel, B. G., Smith, A. J. H. & elements. Sequences are aligned at the putative "core" (underlined) Roe, B. A. (1980) J. Mol. Biol. 143, 161-178. sequences (29). The residues that match those in the simian virus 40 21. Anderson, S. (1981) Nucleic Acids Res. 9, 3015-3026. (SV40) enhancer element, including eight nucleotides on either side 22. Mount, S. M. (1982) Nucleic Acids Res. 10, 459-472. of the "core" sequence, are indicated by an asterisk. These include 23. Jelinek, W. R. & Schmid, C. W. (1982) Annu. Rev. Biochem. either an A or T residue in positions 5-7 of the "core" sequence. @ 51, 813-844. indicates that the enhancer element is on the strand complementary 24. Breathnach, R. & Chambon, P. (1981) Annu. Rev. Biochem. to the coding strand and in the direction 3' to 5'. Numberings of the 50, 349-383. sequences are as described (31). MSV, murine sarcoma virus. 25. Baralle, F. E. & Brownlee, G. G. (1978) Nature (London) 274, 84-87. tissue-specific expression, we do know that the 50-kDa type 26. Berk, A. J. & Sharp, P. A. (1977) Cell 12, 721-732. I keratin and its pair, the 56-kDa type II keratin, are ex- 27. Kozak, M. (1981) Nucleic Acids Res. 9, 5233-5252. pressed in abundance comprising up to 15% of the total pro- 28. Steinert, P. M. & Idler, W. W. (1975) Biochem. J. 151, 603- tein synthesis of the basal epidermal cell (16). In contrast, 614. the vimentin gene, containing fewer (if any) enhancer-like 29. Moreau, P., Hen, R., Wasylyk, B., Everett, R., Gaub, M. P. & elements, is expressed in fibroblasts at much lower levels Chambon, P. (1981) Nucleic Acids Res. 9, 6047-6068. (1). We expect that these sequences might account for the 30. Laimins, L. A., Khoury, G., Gorman, C., Howard, B. & wide variability in the levels and tissue-specific expression Gruss, P. (1982) Proc. Natl. Acad. Sci. USA 79, 6453-6457. 31. Gillies, S. D., Morrison, S. L., Oi, V. T. & Tonegawa, S. of these IF genes. However, the exact nature of the differen- (1983) Cell 33, 717-728. tial regulation of these genes must await further analyses of 32. Zehner, S. E. & Paterson, B. M. (1983) Proc. Natl. Acad. Sci. the sequences and factors that mediate their expression. USA 80, 911-915. 33. Capetanaki, Y. G., Ngai, J., Flytzanis, C. N. & Lazarides, E. We thank Dr. Ed Fritsch (Genetics Institute, Boston) for the hu- (1983) Cell 35, 411-420. man genomic library. We thank Drs. Don Cleveland and Kevin Sul- 34. Scholer, H. R. & Gruss, P. (1984) Cell 36, 403-411. Downloaded by guest on October 1, 2021