Proc. NatL Acad. Sci. USA Vol. 80, pp. 4253-4257, July 1983 Biochemistry

Protein structural domains in the unc-54 myosin heavy chain are not separated by introns (ATP binding site/eukaryotic promoters/splicing/colinearity with genetic map) JONATHAN KARN, SYDNEY BRENNER, AND LESLIE BARNETT Laboratory of Molecular Biology, Medical Research Council Centre, Hills Road, Cambridge CB2 2QH, England Contributed by Sydney Brenner, March 22, 1983 ABSTRACT The 1,966-amino acid unc-54 myosin heavy chain sequent cleavage with restriction enzymes with 4-base-pair (bp) sequence was determined from DNA sequence studies of the cloned recognition sequences, purified, and cloned. Although this "semi- gene. The gene is split by eight short introns, 48-561 base pairs directed" approach involved more laborious cloning experi- long, and appears to lack a "TATA" box at its promoter. The phys- ments than in a more "random" approach (8, 9), new sequence ical map of the gene was aligned with the genetic map by locating can be accumulated at comparable rates with the two methods. two point mutations and three internal deletions: 0.01 map units Location of Introns and Termini. The sequence of approx- correspond to approximately 5 kilobases. Comparison of the unc- imately 800 residues of the 3' end of the unc-54 mRNA cov- 54 sequence with the sequence of a second myosin heavy ering the three introns chain from nematode, indicates that the globular head sequence within the rod sequence and of most of S-1 is more highly conserved than the a-helical coiled-coil rod. the 3' untranslated region was determined from a cDNA clone Major sites of proteolysis in S-1 are associated with variable se- prepared by A. R. MacLeod in this laboratory (unpublished data). quences that have the characteristics of surface loops. In both The positions of the other introns were inferred from inspec- there is no correlation between the positions of introns and the tion of the sequence and confirmed by nuclease S1 mapping major protein structural domains. (10). Some of the intron positions could be verified by com- paring the deduced unc-54 protein sequence with those of ho- The major myosin heavy chain isozyme of the body wall mus- mologous peptides from rabbit skeletal muscle myosin deter- culature of the soil nematode Caenorhabditis elegans is en- mined by M. Elzinga and his colleagues (11-13) and R. C. Lu coded by the unc-54 I gene (1, 2). Mutants in this gene produce (14) or with nucleotide sequences of the other myosin heavy paralyzed muscle (3-5) and are readily isolated. Most unc-54 chain isoforms of nematode (unpublished data). Identification alleles are recessive null mutants that fail to produce the unc- of the 5' end of the unc-54 gene was based on homology to one 54 heavy chain. There are also a number of dominant and semi- of these other genes (myosin II) and to the amino-terminal se- dominant mutations with normal levels of altered myosin heavy quence of the rabbit skeletal myosin heavy chain. chains that either fail to assemble into thick filaments or as- Isolation of Other Nematode Myosin Heavy Chain Genes. semble normally but have defective function (2, 4, 5). The A1059 clone pool was screened by using a cloned BamHI- This extensive genetic material provides a basis for defining Bgl II fragment covering the active thiol region of the unc-54 sequences in the myosin heavy chain gene that are of impor- gene (residues 3,989-4,401). This was chosen as a probe be- tance to thick filament assembly, tension generation, and gene cause the amino acid sequence of this region is highly con- expression. In previous papers we have described the molec- served between nematode and rabbit (12, 15). Fifty-two hy- ular cloning of the unc-54 gene in lambdoid phages (6, 7). Here bridizing clones were classified by restriction enzyme digestion. we present the complete nucleotide sequence of the gene, Three major sequences in addition to unc-54 clones were rep- identify transcriptional and processing signals, and analyze the resented. Sequence analysis has shown that each is a myosin deduced amino acid sequence of the heavy chain molecule. heavy chain sequence, and it seems likely that they include the genes corresponding to the residual body-wall myosin and the two pharyngeal myosin heavy chains, but this has not yet been METHODS demonstrated formally. Sequence Determination Strategy. Twenty-two recombi- nants in bacteriophage A1059 containing 15- to 20-kilobase (kb) SEQUENCE OF THE unc-54 GENE fragments of nematode DNA have been isolated that hybrid- Correlations with the Genetic Map. The sequence of the ized to unc-54 cDNA probes (6, 7). It was noted that cleavage unc-54 gene is given in Fig. 1. The gene is split by eight short with HindIlI, Xho I, and Sac I produced numerous small re- introns that vary in size from 561 to 48 bp; the entire coding striction fragments from the unc-54 gene. Because vector arms sequence for the 1,966-amino-acid heavy chain covers 7,266 bp. are not cut by these enzymes, these fragments could be quickly The positions of the amber mutation e1300 (residue 7,993; ref. cloned by "shotgun" experiments into bacteriophage M13 vec- 16) and the missense mutation s74 (residue 2,885; unpublished tors for sequence analysis by the dideoxy method (8). This yielded data) have been obtained from sequence analyses. This permits a considerable portion of the rod sequence. To complete the alignment of the physical map with the genetic map produced sequence, restriction fragments overlapping regions of previ- by Waterston and his colleagues (4) and demonstrates that 0.01 ously determined sequence were purified by gel electropho- map units corresponds to approximately 5,000 bp. The ap- resis on low-melting temperature agarose. Short fragments proximate end points of the deletions in e675, e190, and s291 suitable for insertion into M 13 vectors were generated by sub- have been determined also by Southern transfer experiments and are consistent with the genetic data. The recessive alleles The publication costs of this article were defrayed in part by page charge e1201 and e1092 give rise to truncated myosin heavy chains, payment. This article must therefore be hereby marked "advertise- ment" in accordance with 18 U.S.C. §1734 solely to indicate this fact. Abbreviations: kb, kilobase(s); bp, base pair(s). 4253 Downloaded by guest on October 2, 2021 4254 Biochemistry: Karn et aL Proc. Natl. Acad. Sci. USA 80 (1983)

TT'GCCACTCAI TGA(GACCTGCAI ATGCTCTAAATGCTTTII1TGCAACTGGAAAG1IGTGTGCTCGGAAGTGTAGGTAAAACACTCCGGGAGGTTTTTAAAGAATTGGTGTTTTTTTTGGAA is 20 30 40 50 60 78 s0 90 100 110 129 AAATTGGTCTAAAAAATTTTTTTTGATTTTTTTTGAAAGAAII1GTTATAACAA1 1TTCCTTGAATTTTTTTGCAAAAAAA1IAAAAGGAAAACTGCACCTGAAAATCAATTTCAGATCAA 139 149 159 160 170 160 190 208 210 229 230 240 COATTTTAGGGTTGATATTAGAGCTI TGCCGCTTAGAATACTGCAGTGAGTAIIII1CGGTTTTTCACTGGTCTCTGTAAAAAGAGTAATGCAAAGGGCAAGTTAACTAGGTCGTAAATGT 259 268 278 298 290 309 310 320 330 340 350 360 AlITGATTTGC7 TAAAATCTGAAGATCTAGTGGTGAACCG1IGGAAGATTATCAAGAGGAGGCTGAAGATCTGTTTAAGAACCATTAATCAAACTGGTATTCTATTTTCACTGGTTGTATGT 370 360 390 400 410 4210 4303 440 450 460 470 480 AAACATTCTAVCTTATTCCTTTTATCACTGTTCTGCACTTI CClIATAAAAAAAGII1GACCGACCGTACTCTCTGAATTCATTTTTCCCGATCTTACCAACTCCCGATCTATCTCTATCCC 490 500 510 520 530 540 550 560 570 500 598 600 TGjGTTTTTTCI TCG'TGCTCCAAI GGAATTCTTGAGACTTCCAC1IATCTTCTCI GGCACCCTCCACTACGCGTAGGCGTCTC1ICGCTTCGTGTATI'CCCGGGAAGCCGGTTCCCGTCTCTC 619 620 630 6483 650 660 670 660 690 700 710 728 CCGCCGCTGCCGC1GCCGCACACAGCTTTACACCICGTAGAAlICCCCAAAGAGGGGCGTGGCTTGCGGGTGCCAACATCCTCCTGCCGAGGAAGAAGCAGGCACTCATCACTCGCATCAT740 750 768 778 790 790 000 610 820 630 840 730 M E H E K D P G W Q Y L R R T R E CAACCTCGACIAAGGGAGCACCACA1ITAGTTTTGAGGTAGAGAAACCATTTGAAAGAAGCGAGAAATCATGGAGCACGAGAAGGACCCAGGATGGCAATATCTCCGCCGTACCAGAGAGC 050 0610 807 800 090 900 918 920 930 940 950 960 Q V L E 1060 AUiGII1TTGGAGGTGAGTATTTATTTTTTATTTATTTATT1ITTClITTCATTTGIClIITATTGI'TATGCTCATTTCTATAGAGTAAATCGTATAGTACTCCATAAGTATCTAATTTTAACAA970 900 990 1800 1018 1020 1030 1048 1050 1860 1070 AACTGAGCATTTAI TTTTGAAATTCTAGATTTTTAACATACAACTGAGTTAAAG1lCTGCCATTTTACACTTATAGATTTTTTAAATTCAGGTAGAAGATTTGAGCTTGCCTTGTAATCCT 1090 1100 1110 1120 1138 1140 1150 1160 1170 1100 1190 1208 1 GTAATCAATGTTTATTTCTGTI TTCTAATTTTTCATAGA TI AGCTTCAAGA1ICATCTTTCCCTTGAAATCATTGAAATCTTCCTTTCTTAAATTTTTGCATTGCACTTGGAATTTTGT 1230 1240 1250 1260 1278 1260 1290 1300 1310 1320 1210 1220 TTTGATTCCATTCCATTCATGGGTCTAAAGTGTGAAG GACACATTGTGCTGCAGTAGTACTGTATTTTGAGACGGCACAII1TATTTTGAC1IACACCACAAGCTTAACCTTGATTTTACTl1350 1368 1370 1300 1390 1400 1410 1428 1430 1440 1330 1340 D Q S K P Y D S K CGGTTTAGACCATTTTTCAACACAAGTTGGTATTCTTTAGAAG1IGATATATTiCi GAATTCTGATCTCTTTTGTACCATTA1 1GTAATTTCAGGATCAATCTAAGCCTTATGACTCCAAG 1450 1460 1470 1400 1490 1500 1518 1520 1530 1540 1550 1568 K N V W I P D P E E G Y L A G E I T A T K G D 0 T I V T A R E M S AAGAACGI CrGGA1 CCCAGATCCAGAAGAGGGATACC1lTGCCGGAGAAATCACCGiCCACCAAGGGAGACCAGGTCACCATCGTCACCGCTCGCGAAATGAGGTTCGTTTTTAAGCTAATG 1570 1580 15910 1600 1610 1620 1630 1640 1650 1660 1678 16800 V I Q V T L K K E L V Q E MI N P P K F E K T E D M S H L S F L H D Al TACGGTTT1 CAATGTAGTGTGATTCAGGTCACCCTCAAGAAGGAGTTGGTI1CAAGAAATGAA1ICCACCAAAGTTCGAGAAGACCGAAGACATGTCCAACTTGTCTT TCCTCAACGACG 1690 17108 1718 1728 1730 1740 1750 1760 1770 1700 1790 18000 A S V L H H L R S R Y A A M L I Y CCTCTGTCCICCAI AACTTGCG1ITCCCGTTACGCTGCTAI GClICATCTACG1IAA(G1TTCATTGAAACTTAAAGTTA0qCATTGAGAATCATTCA1lATTCTTCTTCTAAATTCCCATAAAA 1018 1202 1803 1040 1650 1660 1870 1666b 1890 1900 1910 1920 1 CCCGAAACTCCTTCCCTCTATCTTCTTTTTCTTCTCGTII1TCAAATGTTTCICICTATCCCATTCTCTCATCAAT1IGAGTGGGA1IGAGGCTATCTCTGCCTCTCTTCTGAATCTCTGAA 1930 1948 1958 1960 1970 1960 1998 2000 2010 2828 2030 2040 CCATCTTACAI TACACTGTGGA1IGACGAGCCCCACAGGC1ICCClITGCATCAGA1IACTGCCATTGGGGATGGCAAAGAAGAGAGAAGGTATTGTGAGGATATATTTTTCTAAGAAAAAACG 2050 2060 21078 2890 2890 2180 2110 2120 2130 2140 2150 21610 TTTGAAGAAAAGAAGATGAAGAAGAI CTGCTTGATTCATI GCACAAGTTAGAAGI AACAGGGGTCTATATTTCGAAGAACTTAAAGGGAATGCAACTGAACATAAAATTAAACAAAGGGA 2160 2198 2200 2210 2220 2230 2240 2250 2260 2270 2200 2178 T Y S G L F C V V I NHP Y K R L P IY T D S C II GAATCCTGCAGT ATI TAAAAATCGAAACTAGAATGTT1ICCI ATTTCAGACCTACTCCGGACT1ITTCTGCGTTGTCATCAACCCATACAAGCG1 CTCCCAATCTACACCGACTCTTGCG 2298 23810 2310 2320 2330 2340 2350 2360 2370 23808 2398 2408 F M G K R K T E MP P H L F A V S D E A Y R H M L GD H E HGQ SM L IT G CCCGTATGTICATGGGAAAGAGAAAGACAGAAATGCCACCACA1ITTGTTCGCIGIA RM CTCTGACGAAGCCTACCGTAACATGCTCCAAGACCACGAGAACCAGTCTATGCTCATTACCGGAG 2428 2430 2440 2450 2468 2470 2400 2490 2588 2510 2520 2410 V T L EAATCTGGAGCCGGAAAGACTGAGAACACCAAGAAGGTTAII1TGCTACTTCGCCGC1IGTCGG1lGCCTCCCAACAGGAAGGTGGAGCTGAAGTTGATCCAAATAAGAAGAAGGTCACCCTTGS G AG K T E H T K K V I CY F A A V G AS Q00 E GG A E V D PHN K K K 2530 '2540 2550 2568 2570 2500 2590 2600 2610 2620 2630 2640 D I V 0 T H P V L EA F G H A K T V RH H H S S R F G K F I R I H F H K H G AGGACCAAATCGTCCAGACCAA1ICCAGTACTTGAAGCCT1ICGGIE AACGCCAAGAC1IGTCCGTAACAACAACTCATCCCGTTTCGGAAAGTTCATCCGTATCCACTTCAACAAGCACGGAC 2650 2660 2678 2608 2698 2780 2710 2720 2730 2740 2750 2760 Y L L E K RGCCTCGCATCITGCGATATCGAGCAITGTGAGTTACTTTII1GAlITTATATTC1IAGCATAGCTCAATAAAAATTATGCAACTATTGTTTAAACTATAAATTTTGCAGACTTGCTCGAGAAAL AS C D IE H 2770 2798 2790 2000 2010 2020 2030 2640 2805 2060 2070 2800 S R V I R G A P G C RC Y H I F Y G 1 Y S D F R P E LK K E L L L D L P IK D Y TCTCGTGTCAICCGTCAAGCTCCAGGAGAGCGTTGCTACCACA1 CTTCTACCAAA1ICTACTCTGACTTCCGTCCAGAGCTCAAGAAGGAACTTCTTCTTGACTTGCCAATCAAGGATTAC 2890 2980 29180 2928 2938 2948 2958 2960 2978 2900 2990 30108 U F V A G A C LI I D G I D D V E E CF Q L TD E A F D I L H F S A VE K Q D C Y TGG1ITCGTCGCCCAGGCCGAGTTGAICATTGACGGAATCGATGACGTCGAAGAG1 1CCAACTTACTGATGAAGCTTTCGACATCCTCAACTTC1ICTGCCGTCGAGAAGCAGGATTGCTAC 3016 3020 3038 3840 3058 3860 3078 3808 3098 3180 3118 3128 R L M S A H M H M G H M K F KGQ R P R E E Q A E P D G T V E ACE K A S H M Y G I CGCCTCATGTCCGCTCACATGCACA1 GGGTAACATGAAGII1CAAGCAACGCCCACGTGAAGAGCAAGCTGAGCCAGATGGTACCGTAGAGGCCGAGAAGGCATCCAACATGTACGGAATC 3130 3148 3150 3168 3178 3160 3190 3200 3210 3220 3230 3240 N C E Q V H W A V GA M A K G GGATGCGAGGAGT1CCTCAAGGCTTIGACCAAGCCACGTG1ICAAGGTCGGAACCGAATGGGTATCCAAGGGACAGAACTGCGAACAAGTCAATTGGGCTGTCGGAGCCATGGCCAAGGGTG C C C F L K A LT K P R V K V G TE W V S KG Q 3250 3260 3270 3290 3298 3308 3310 3320 3338 3340 3350 3360 L Y S R V F N W L V K K C H L T L D Q K G I D R D Y F IG V L D I A G FE!I F D CTCTACTCCCGTG1 GTT'CAACTGGCTTGTCAAGAAGTGTAACC1 CACTCTGGA1ICAAAAGGGTATTGATCGTGATTATTTCATCGGTGTGCTCGATATCGCCGGTTTCGAAATCT TCGAC 3370 3380 3390 3408 3410 3428 3430 3448 3450 3468 3478 34680 F N SF CO0 LUW IN F V N E K LOG F F N H H M F V LEG EEC Y A RE G IG W V 1ITCAACTCCT ICGAGCAGTTGTGGATTAACTTCGTAAACGAGAAGCTCCAACAA1iICTTCAACCATCACATGTTCGTCCTTGAACAGGAAGAATACGCCCGTGAGGGTATTCAATGGGTC 3499 3508 3518 3528 3538 3540 3550 3568 3578 35808 3590 3688 G A C ICE L ICEK P L GIlI F I D F G L D L TAlITGAGAAGG1lAAGATTGAATTGAAATTTTAAATGTTAT1ITTGATCAATTCi TATTTTTCAGCCACTCGGTATTATCT 1ITCATCGATTTCGGACTTGATTTGCAAGCGTGTATCGAACI 3660 3698 3700 3710 3728 3618 3620 3638 3648 3650 3660 3678 K G S M L D CEECCI V P K A T D L T L AS K L V D Q H L G K H PHN F E K P K P P CCATGCTTGA1 GAAGAGTGTATCGTACCAAAGGCTACTGA1ITlGACCCTCGC1lTCCAAGCTTGTCGATCAACATCTTGGCAAGCATCCAAACTTCGAGAAGCCAAAGCCACCAAAGGGAA 3730 3748 3750 3768 3770 37680 3798 38008 3010 3020 3030 3040 KG0 G C A H F A M R H Y A G T V RY N C L N W L C K N K D P L N D T V V S AM K AGCAAGGAGAAGC1 CACTTCGCCATGCGGCACTACGCCGGAAC1IGTGCGTTACAACTGCTTGAACTGGCTCGAGAAGAACAAGGACCCCCTCAACGACACCGTTGTCTCGGCAATGAAAC 3850 3960 3878 3809 3809 3980 3910 3928 3938 3948 3950 3960 Q09K G N D L L V C I W G D Y T T G EE A A A K A K E GG G G G K K K G K S GOS 4860 4870 4860 3970 39808 3998 4080 4010 4028 4830 4048 4858 S GM F M T V S M L Y RE S L NHN L M T M L N K T H P H F IR C I IP N E K KGQ 1 CATGACCGTCTC1IATGCTCTACAGAGAGTCTCTCAACAACT1IGATGACTAT GCI CAACAAGACTCACCCACATTTCATCCGTTGTATCATCCCCAACGAGAAGAAGCAATCTGGTATGA 4110 4128 4138 4148 4158 4160 4178 4100 4198 4200 4898 4108 T P D F V R Y A I DA A L V L NOQ L T C H G V L EG I R I C R K G F PHN R L H CCAGACTTCGTCCAACGTTACGCCAG 1 CGATGCCGC1 TTGGTTCTCAACCAGCT1IACCTGCAACGGAGTGTTGGAAGGAAI CAGAATTTGCAGAAAGGGATTCCCCAACAGAACCCTTCAl 4320 4210 4228 4238 4240 4250 4268 4270 42808 4290 4380 4310 7 I L A A K C A K S D D D K K K C A C A I M S K L V N D G S L S E EM F R IG L TCCT1 GCCGCCAAGGAGGCCAAGTCCGATGACGACAAGAAGAAGTGCGCCGAGGC1 ATCATGTCCAAOCTCGTCAACGACGGATCCCTCAGCGAGGAGATGTTCCGTATCGGTCTCACCA 4348 4358 4368 4370 43681 4398 4480 4410 4428 4430 4440 43380 G F G S Q I RUN L G L K D R K V F F K A G V L AN L ED I CCGTGACGAGAAGCTCGCCACCATCCTCACCGGATTCCAATCCCAAATCAGATGGCATTTGGGTCTCAAGGACCGCARD C K L A T I LT AGGTCTTCTTCAAGGCTGGAGTI CT1 GCTCATCTTGAAGACAi 4550 4560 4450 4468 4478 44980 4498 4508 4518 4528 4538 4540 K K R R MC O R A G L L IV O R N V R S W C T L R TUWE U FK L Y 0K V K P M L AGCGCCGTATGGAACAACGTGCCGGACTTCTCATTGTTCAGCGCAACGTCCGII1CCTGGTGCACI'CTCCGTACCTGGGAATGGTTCAAGCTTTACGGAAAGGTCAAGCCAATGCTCAAGG4640 4658 4668 4678 4688 4570 4588 4598 4688 4610 4620 4630 L V A O K C A CCEL C K IN D K V K A L ED S LA KE C K L R K E L C E SS A K CCGGAAAGGAAGCCGAGGAGCTCGAGAAGATCAACGACAAGG1iIAAGGCCCi CGAAGACAGTCTCGCCAAGGAGGAGAAGCTTCGCAAGGAGCTCGAGGAGTCCTCTGCCAAGCTCOTCG4008 4698 47100 4710 4728 4738 47410 47580 4760 4770 47808 4798 S E C E K T S L F T N L E ST K T G L S D ACE ER L A K L EA 00QK D A SK Q L AAGAGAAAACTTCTCTCTTCACCAACTTGGAATCCACCAAGACi CAACTCTCCGA1 GCTGAGGAAAOACTTGCCAAGCTTOAGGCTCAACAAAAGGATOCCTCCAAGCAACTCTCCGAGC4920 4818 4820 4838 4040 4050 4806 4070 4800 4090 4980 4918 L L N D G LAD NE D R TA D V O R A K K K ICE A E VE A L K K G IG D L E MS TCAACGAI CAACTCGCTGACAACGAAGACCGTACCGCTGA1IGI iCAGCGCGC1IAAGAAGAAGATCGAGGCCGAOOTCOAOOCTTTOAAOAAGCAGATCCAAGACTTGGAAATOTCTCTCC5084 4930 4949 4959 4968 49780 49981 4998 5888 58010 5828 5038 C R K A C S E K OS K D HOQ I RS LOQDE MOOQ ODE A I A K L N K C K K HOQE GCAAGGCTGAGTC1IGAGAAGCAATCCAAGGATCACCAGA1 CAGATCCCTTCAAGA1IGAGATGCAACAGCAAOATGAAGCTATTOCCAAGCTCAACAAGGAGAAGAAGCATCAAGAAGAAA5140 5150 5168 5650 5860 5970 5080 5898 5180 5118 5128 5139 C RC I N R K L MCED L QOSCEE D K G N H 0 H K V K A K L EG T L D D L E D S L TCAACCGCAAGCTGATGGAGGACCTCCAATCCGAAGAGGAi AAGGGTAACCAi CAGAACAAGGTCAAGGCCAAGCTTGAGCAGACTCTTOATGACCTCGAGGATTCCCTTGAGCGCGAGA5280 5178 5180 5198 5208 5218 5228 5230 5248 5250 5268 5278 K K R A R A D L D K O K R K V C GE L K I A G E HI D ES O RG RH D L C NN L AGAGAGCCCGCGCCGATCTTGACAAGCAGAAGAGAAAGG1ICGAGGGAGAGCTCAAGATTGCTCAAGAGAACATCGATGAOAGCGGACGCCAACGCCACGATCTTGAGAACAACTTGAAGA5380 5489 5298 5380 53103 53280 5338 5348 5350 5360 5378,. 5398 v ----- .. - 0 A T 91. R T S L K K C SCE L H S V SS R L ED E OA L V S K Li Q R Qi I U 0 9 AGAAGGAATClGAGTTGCACTCAGTlTCTTCCCGACTTGAGGACGAACAAGClCiIGTCTCCAAGCTCCAACGCCAGATCAAGGACGGACAAAGCCGTATCTCCGAGCTCGAAGAGGAAC5460 5470 5480 5490 5580 5510 5520 5418 5420 5438 5440 5450 E Q G G A TA 0 V L E N E R O S R S K A D R A K S D L Q R E L E E L G E K L D AG TCGAOAATGAACGTCAATCCCGITCCAAGGCTGACCGTGCCAAGAGCGACCTCCAACGCGAGTIGGAAGAGCTCGGTGAGAAGCTTGACGAGCAAGGTGGAGCTACTGCCGCCCAGGTTG5610 5620 5640 5530 5548 5550 5560 5570 5580 5590 5600 5638 A C V H K K RCEA C L A K L R R D L EE A N M N H E N Q LOGG L R K K N T D A V AGGTCAACAAGAAGCGTGAGGCiGAACTTGCCAAGCTCCGCAGAGACTTGGAGGAGGCCAACATGAACCACGAGAACCAACTCGGTGGACTTCGCAAGAAGCACACCGACGCTGTCGCTG5760 5650 5660 5670 5680 5690 5790 5710 5720 5730 5740 5750 Downloaded by guest on October 2, 2021 Biochemistry: Karnet al Proc. Natl. Acad. Sci. USA 80 (1983) 4255

E L T DO0 L DO0 L N K A K A K V E K D K A 0 A V R D A E D L A A Q L D Q E T S G AGCTCACCGACCAACTCGATCAACTCAACAAGGCCAAGGCI AAGGTCGAGAAGGACAAGGCTCAAGCTGTTCGTGACGCTGAGGACCTTGCTGCTCAACTTGACCAAGAAACCTCTGGAA 5770 5789 5798 5860 5810 58B20 5639 5840 58B50 5860 5879 5880 K L NHN E K L A K O F E L 0 L T EL Q S K A D E QS R Q L 0 D F T S L K G R L H AGCTCAACAACGAAAAGCTTGCCAAGCAGTTCGAACTTCAACI CACCGAGCT1ICAATCTAAGGCTGATGAGCAATCTCGCCAACTCCAAGACTTCACTTCCCTTAAGGGACGCCTTCACT 5890 5996 5910 5926 5939 5940 5950 5960 5970 5980 5999 6000 S EN GD LUV R 0 L ED A E S 0 U N Q L T R L K SQ0L T S QL E E AR R T AD E Cl GAGAATGGAGACCTTGTCCGTCAACTCGAAGATGCCGAATCCCAAGTCAACCAACTCACCAGACTGAAGTCCCAGCTI ACTTCTCAACTCGAGGAGGCTCGCCGCACCGCTGACGAGG 6010 6020 6030 6040 6050 6960 6070 6080 6090 6100 6110 6120 EAAGCACGCGAACGICACGCCGCAGTAACACAAGACGGACCAGGCCTAGGAACAGAAACAACTAAACCAR E RG TV A A Q A K N Y Q H E A EQ0 L 0 E S L EE ElI E G K N E IL R Q L 6130 6140 6150 6160 6170 6189 6190 6200 6210 6220 6230 6240 S K A H AD I Q 0 U K A R F E G E G LL K A D EL E D A K R R Q A Q K IN E L 0 CCAAGGC1 AACGCCGACATCCAACAATGGAAGGCTCGCTI CGAGGGAGAAGGACT CCTCAAGGCCGACGAGCTCGAGGATGCCAAGAGACGCCAAGCCCAAAAGATCAACGAGCTCCAAG 6250 6260 6270 6280 6290 6300 6310 6320 6330 6340 6350 6360 E A L D A A N S K H AS L E K T K S R L V G D L D D AG U DUVE R A NG V A S A AGGC1 CTTGAI GCCGCCAACTC1 AAGAACGCTTCTCTTGAGAAGACCAAGTCCAGACTI GTCGGAGACTTGGACGACGCTCAAGTTGATGTCGAGAGAGCCAACGGTGTCGCCAGTGCTC 6370 6380 6390 6400 6410 6420 6430 6440 6450 6460 6470 6480 L E K K K G F D K I I D EW R K K T D D L A A E L D G AO R D L R N T S T D L TI GGAAGAAGGTCAAGTACAGAGAAAAGCGCATGCGTACTAGACCAGGCTCCAACCATACC 6490 6509 6510 6520 6530 6540 65510 6560 6570 6580 6590 6600 F K A K H A 0 E EL A EU U E G L R R E H K S L SOQ ElI K D L T D Q L G E G G R TCAAGGCCAAGAACGCCCAAGAGGAGCTCGCCGAGGTTGII1GAGGGACTCCGCCGI GAGAACAAGAGCTTGAGCCAAGAGATCAAGGATCTTACCGATCAACTCGGAGAGGGAGGACGCT 6610 6620 6630 6640 6656 6660 6670 6680 6690 6700 6710 6720 SUN E M OK I I R R L ElI E KE E L Q H AL D E A E A A L EA E E S KUV L R A Cl GTCCACGAAATGCAAAAGATCATCCGCCGTCTTGAGAII1GAGAAGGAAGAACl CCAACACGCTTTGGACGAGGCTGAGGC1 GCCCTTGAAGCTGAAGAGAGCAAGGTTCTCCGCGCCC 6730 6749 6750 6760 6770 6789 6790 6890 6010 6820 6830 6840 0 U EUV SO IRS ElI E K R IO E K L E E F E N T R K N H A R A L ES MO0 A S L AGGTTGAAGTTTCCCAGATCCGI TCCGAAATCGAGAAACGCAI CCAGGAGAAGGAGGAAGAGTTCGAGAACACGAGAAAGAACCACGCCCGCGCTCTTGAATCAATGCAAGCTTCCCTCG 68510 6060 6870 6880 6890 6900 6910 6920 6930 6940 6950 6960 K G K A L L R I K K K L EG D I NE L ElI A L DNHA N K A H AD A O K EAGACCGAAGCIAAGAGCGATCCGACAAGACCAGAAACAGACCAACCTGACCCACAGTAGCAGCAAGT EA E 6970 6980 6990 7000 7010 7020 7030 7040 7050 7860 7070 7080 H L K R Y G E OUVR E L 0 LOQVE E EQL R N G AD T R EQ F F N A E K R AT L L ACTTGAAGAGATACCAAGAGCAAGTCCGCGAGTTGCAATI GCAAGTCGAGGAGGACCAACGCAATGGAGCCGACACCCGTGAACAAT TCTTCAACGCCGAGAAGCGCGCCACTCT TCTTC 70910 7109 7110 7120 7130 7149 7150 7160 7170 7180 7190 72900 O S E KE E L LUV A N E A A E RA R K 0 A E YE A A D A R D 0 A N E A N A 0 V S AATCCGAGAAGGAAGAACTTTTGGT1 GCCAACGAGGCCGCCGAGAGAGCCCGCAAGCAAGCCGAGTATGAAGCCGCCGATGCTCGTGATCAAGCCAACGAAGCCAACGCTCAAGTCAGCA 7210 7220 7230 7240 7250 7260 7270 7280 7290 7300 7310 7320 S L T S A K R K L EG ElI 0 A INH A D L D GCTTGACTTC1 GCCAAGAGAAAGCTCGAGGGAGAAATCCAAGCCATTCATG1 AAGI TTTAGATATTGACAATCTAGTTTTTCAGTCTATTCAATGTGAATCTTTCAGGCCGACCTCGATG 7330 7340 7350 7360 7370 7380 7390 7400 7410 74210 7430 7440 E l L H EY K AA E E R S K K A I A D AT R L A E EL RO0E 0 ENHS O H V D R L AGACCCTCAACGAGTACAAGGCCGCTGAGGAGCGCTCCAAGAAGGCTATTGCI GAlIGCCACCAGACTCGCAGAGGAGCTCCGTCAAGAACAAGAACACTCTCAGCACGTTGACCGTCTTC 7450 7460 7470 7480 7490 7500 7510 7520 7530 7540 7550 7560 R K G L EO 0 L K El OUVR L D E A E A A A L K G G K K GCAAGGGACT1 GAGCAACAGCTCAAGGAGATCCAAGTCCG ClIIGATGAGGCCGAGGCTGC1 GCTCTTAAGGGAGGAAAGAAGGTAATACATCATATCGTTCAAAATATATACAAATTTC 7570 7580 7599 7600 7610 7620 7630 7640 7650 7660 7679 76800 U I A K L E 0 R V R E L E S E L D G E 0 R R -F 0 D A N K N L G R A D R R AI~GGTTTTCAGGT1 ATCGCCAAGCTCGAACAACGCGTGCG1lGAGCTCGAATC1IGAATTGGACGGAGAACAACGCCGCTTCCAGGATGCCAACAAGAACCTCGGACGCGCCGATAGACGTG 7690 7700 7710 7720 7730 7740 7750 7766 7770 7780 7790 7800 U R E L O F 0 U D E D K K H F E R L 0 D LI D K LOO0K L K TOQ K K 0 U E E A TT'CGTGAGCT1ICAATTCCAGGTTGA1IGAGGACAAGAAGAACTT CGAACGTCTCCAAGATCTCATTGACAAGCTCCAACAGAAGTTGAAGACCCAGAAGAAGCAGGT TGAGGAAGC TGTAA 7810 7820 7830 7840 7850 7860 7870 7880 7890 7900 7910 7920 E EL A N L N L O K Y K 0 L TNH OLE D A E E R Gi TTATTATATCTTGTAACCTGACAT TCCATCATCTTATTAAII1TCAGGAGGAACTCGCCAACTTGAACCTCCAGAAGTACAAGCAACTCACCCACCAACTTGAAGATGCTGAGGAGCGC 7930 7940 7950 7960 7970 7980 7999 8900 8010 8020 0030 8040 A DO A E NS L S KM R S K S R A S A S V A P G LO0S S A S A A V I R S P S R A GCCGATCAAGCCGAAAACTCCCI CTCAAAGATGCGATCAAAA1lCCCGTGCA1 CCGCCTCTGTTGCCCCAGGACTCCAGAGC TCCGCATCGGCCGCTGTCATCAGATCGCCATCTCGCGCC 0050 8060 8070 0080 0090 8100 0110 8120 8130 8140 0150 8160 R A S D F CGTGCCTCTGACT1 CTAAGTCCAATTACTCTTCAACATCCCTACATGCTCTTTCI CCCTGTGCTCCCACCCCCTAT TTTTGTTATTATCAAAAAAACTTCTTCTTAATTTCTTTGTTTTT 0170 0100 0190 0200 8210 0220 0230 8240 8250 8260 8270 8280 TAGCTTCTTTTAAGTCACCTCTAACAATGAAATTGTGTAGAT1ICAAAAATAGAA1 IAATTCGTAATAAAAAGTCGAAAAAAATTGTGCTCCCTCCCCCCATTAATAATAATTCTATCCCA 8290 0300 0310 8320 8330 0340 8350 8360 0370 9380 8399 8400 AAATCTACACAATGTTCTGTGTACACTTCTTATGTTTTT1 1TACTTCTGATAAA1 ITTTTTTGAAACATCATAGAAAAAACCGCACACAAAATACCTTATCATATGTTACGTTTCAGTTT 9410 9420 8430 0440 0450 0469 0470 0400 8490 05100 8510 8520 AlTGACCGCAAI TTTTATTTCTTCGCACGTCTGGGCCTCTCATGACGTCAAAI CAlIGCTCATCGTGAAAAAGTTTTGGAGTA1ITTTTGGAATTTTTCAATCAAGTGAAAGTTTATGAAATT 0530 8540 0550 8560 8570 0500 8590 8600 9610 0620 0630 8640 AATTTTCCTGCTTTTGCTTTTTGGGGGTTTCCCCTATTGII1'TGI CAAGAGTTI CGAGGACGGCGTTTTTCTTGCTAAAATCACAAGTATTGATGAGCACGATGCAAGAAAGATCGGAAGA 8650 0660 8670 0600 0690 0700 8710 8720 8730 8740 0750 0760 AGGTTTGGGT ITGAGGCTCAGTGGAAGGTGAGTAGAAGTI GAlIAATTTGAAAGr GGAGTAGTGTCTATGGGGTTTTTGCCTTAAATGACAGAATACATTCCCAATATACCAAACATAACT 8770 8790 8790 8800 8810 8820 8830 8840 8850 8860 8870 8899 GTTTAAAATTAAACATTTTTCTAAArTTTTATATGATTTCTITI AAATTTGCAAAAATTACTTAAATTTGAATTCCCGCGCAAATGAGTGACTTCATTTTCTGCATTATTGTGTTTTCCGG 8890 8900 8910 8920 8930 8940 8950 8960 8970 8980 8990 9000 FIG. 1. Nucleotide sequence of the unc- 54 myosin heavy chain gene. Encoded amino acids are given on the line above the sequence in the stan- dard one-letter code.

which may be detected by -free translation of mRNA from in the eight sequences. This consensus sequence is longer than these strains (17). The approximate location of these alleles, cal- usual. The 3' ends of the introns are defined by the consensus culated from the size of these fragments, is also in agreement sequence: Y(6)-A(5)-Y(5)-Y(5)-A(8)-Y(6)-Y(6)-N-Y(8)-N-N-N-Y(5)- with the genetic map. N-T(7)-T(8)-T(6)-C(7)-A(8)-G(8), where N is an unknown nu- Unusual Features at Intron Junctions. The introns in the cleotide. This sequence is unusual because of the frequent unc-54 gene are unusually small and infrequent but show well- presence of purines in the normally long polypyrimidine tract conserved junction sequences. The 48-bp intron is the shortest that precedes the invariant A-C (18). The invariant A residue intron yet described. The G-T-A-G rule is followed, although at position 6 of the consensus is a second unusual feature. Sta- some deviations from vertebrate consensus sequences at intron tistical analysis (19) of codon usage within introns and exons has junctions are observed (18). The 5' end of each intron is de- shown that codon-usage biases in coding sequences are not fol- fined by the 1O-bp consensus sequence: G(8)-T(8)-R(7)-A(7)-G(7)- lowed within the introns, which are A+T-rich and frequently T(6)-T(6)-T(6)-T(6)-Y(6), where R and Y are unspecified purine contain homopolymer tracts. Introns also have been observed and pyrimidine residues, respectively, and the numbers in pa- in two collagen genes from C. elegans (20). These genes also rentheses give the frequency of occurrence of the nucleotide have extended consensus sequences at the junctions, and the E---.--HoMolag 1 .----- 22 TCGTAGAATCCCCARAGAGGGGCGTGGCTTGCGGGTGCCAACATCCTCCTGCCGAGGAAGXAAGCAGGCACTCATCACTCGCRTCATCAACCTCGACTAAGGGAGCAC3ComolnguC CATTAGTTTT G C*TTCGCTCTAAACT AA*CC*CC* CT*CCA**T**CA**CC*C*C*;*GG*A**AA*A*GG*C*T*CG*CGAATAAGTTTCGGTGGACCAC M EN E K D P G U 0 Y L R R T R E 0 U L E AGGTAGAGAA------ACCATTTGAAAGAAGCGAGAAATCATGGAGCACGAGAAGGACCCAGGATGGCAATATCTCCGCCGTACCAGAGAGCAGGTTTTGGAGGTGAGT * *ACTCTCCCACCC*T*CATTTAACCTCGATGTCAACACAAA*TGATTAGAAACGACCAGGAGGAAGACCTCGCCGA *TCCGAGGGAT*TCCA*TACT M D Y EN D P G U K Y L R R S R E E M L 0 ATTTATTTTTTATTTATTTATTTTTCTTTCATTTGTCTTTATTGTTATGCTCATT rCTATAGAGTAAATCGTATAGTACTCCATAAGTATCTAATTTTAACAAAACTGAGCATTTATTTT

FIG. 2. Comparison between the nucleotide sequence of the unc-54 gene (upper sequence) and that of a second myosin heavy chain gene from nematode (myosin II; lower sequence) in the region ofthe amino-terminal exon. Dashes are used as gaps introduced to maximize homology; asterisks between the sequences indicate identical nucleotides. The amino acids encoded by the unc- 54 gene are shown (standard one-letter code) above the nucleotide sequence; the amino acids encoded by myosin II are shown below the sequence. Note the two blocks of 5' terminal homology. Downloaded by guest on October 2, 2021 4256 Biochemistry: Karn et al Proc. Natl. Acad. Sci. USA 80 (1983)

"25 Kd" Peptide * * ** * t******** *********** ** * ** * * MEHEKDPGWQYLRRTREQULEDOSKPYDSKKNVWIPDPEEGYLAGEITATKGDQV1IVTAREMSUIQVTLKKELUQEMNPPKFEKTEDMSNLSFLNDASVLHNLRSRYAAMLIYTYSGLF DQSRPYDSKKNUWIPDAEEGYIEGUIKGPGPKADIUIUTAGGKD--VTLKKDIUQEVNPPKFEKTEDMSNLTELNDRSVLWNLRSRYAAMLIYTYSGLF 10 20 30 40 50 60 70 s0 90 100 ili 120 E------ATP binding ?------I C---"loop"---2 "50 Kd" Peptide * * * * * ***** ******* CVUINPYKRLPIYTDSCARMFMGKRKTEMPPHLFAVSDEAYRNMLQDHENQSMLITGESGAGKTENTKKUICYFAAVGASQQEGGAEVDPNKKKUTLEDQIUQTNPULEAFGNAKTURNN CVuINPYKRLPIYTDSUARMFMGKRRTEMPPHLFAVSDQAYRYMLQDHENQSML11GESGAGKTENTKKVICYFATVGASQKAALKEGEKE---UTLEDQIVQTNPVLEAFGNAKTVRNN 130 140 150 160 170 10 190 200 210 220 230 240 * * ** **- ** * * ** ** * NSSRFGKFIRIHFNKHGRLASCDIEHYLLEKSRVIRQAPGE'RCYHIFYQIYSDFRPFELKKELLLDLPIKDYWFJAQAELI IDGIDDVEEFQLTDEAFDILNFSAUEKQDCYRLMSAHMHM NSSRFGKFIRIHFNKHGTLASCDIEHYLLEKSRVIROAPGLRCYHIFYQIYSDFKPQLRDELLLNHPISNYWFVAQAELLIDGIDDTEEFQLTDEAFDVLKFSPTEKMDCYRLMSAHMHM 250 260 270 Z00 290 300 310 320 330 340 350 360 GNMKFKGRPREEGAEPDGTUEAEKASNMYGIGC*E*EFLKALTKPRUKVGTEWVSKGON*CEQVNWA*UGAMKGLY*SRVFN*WLKKCNLTLDQKGIDRDYFIGVLDIAGFEIFDFNSFEGLWI* * * *** ** * ** * * * * * GNMKFKGRPREEQAEPDGQVEAERACNMYGIGDVQFLKALVSPRUKVGTENVSKGONVDQVHWAIGAMAKGLYARVFHWLVKKCNLTLDQKGIDRDYFIGVLDIAGFEIFDFNSFEGLWI 370 380 390 400 410 420 430 440 450 460 470 480 * * * * * NFUNEKLQQFFNHHMFVLEQEEYAREGLIQWTFIDFGLDLQACIELIEKPLGIISMLDEECIUPKATDLTLASKLTDQHLGKHPNFEKPKPPKGKQGEAHFAMRHYAGTVRYNCLNWLEKNNF'UNEKLQQFFNHHI*FVLEQEEYAREGIQWTFIDF'GLDLOACIE LIEKPLGIISMLDEECIUPKATDIITLAQKLTDG4LGKHPNF'EKPKPPKGKQGEAHLAMRHYAGTVRYNULNWLEKN 490 500 510 520 530 540 550 560 570 590 590 600 (---"loop"---] "23 peptide Active Thiols Kd"' * * * * **F**** * * * IPNEKKQSGMZDAALIJLNQLTCNGULEGIRICRKGFP^ ^ KDPLNDTVVSVMKASKKNDLLVEIWQDYTTQEEAAAAAKAGGG--RKGGKSGSFMlVSMMYRESLNKLMTMLHKTHPHFIRCIIPIEKKQSGMIDAALVLNQLTCNGULEGIRICRKGFPKDPLNDTVVSAMKQSKGNDLLUEIWQDYTTQEEAAAKAKEGGGGGKKKGKSGSFMIIJSMLYRESLNNLMTIILNKTHPHF'IRCI 610 620 630 640 650 660 670 680 690 700 710 720 E----"Swivel -----] * * * * * *** **** ** * * * * * * ** * * ***** *** ** ***** * * * * NJRTLHPDFVGRYAILAAKEAKSDDDKKKCAEAIMSKLVNDGSLSEEMFRIGLIKVFFKAGVLAHLEDIRDEKLATILTGFQSGIRWHLGLKDRKRRMEQRAGLLIVQRNVRSWCTLRTWE NRTOHPDFVORYAILAAKEAKSSDDMKTCAGAILQALINQXQLNDEQFRIGHTKVF FKAGUUAHIEDLRDDKLNQIITGFQSAIRWYTATADAGARRKQLNSYIILQRNIRSWCVLRTWD 730 740 750 760 770 780 790 B00 810 620 030 640 "Rod" * * ** * * * ** ***** * **** ***E **EE**AK t*****T * ****** *E*L **L**A*** * *** * *E* *** ****** *** WFKLYGKVKPMLKAGKEAEELEKINDKVKALEDSLAKEEKLRKELEESSKL VEEKTSLFTHLESTKTQLSD EERLAKLEAOQKDASKQLSELNDQLADNEDRTADUQRAKKKIEAEVE WFLLFGKLRPGLKCGKMAEEMIKMAEEOKULEAEAKKAESARKSGEEAYAKLSAERSKLLEALELTOGGSAAIEEKLTRLNSARQEUEKSLNDANDRLSEHEEKNADLEKQRRKAQQEVE 850 860 870 990 890 900 910 920- 930 940 950 960 * ALKoIQL~nLRKEsE~sK~oIs~oE~oQDEI4KNKEKHEFINRKLAEDLQ;EE*DK¢NHQN;*K**KLEQTLD*DLE*DSLEREKR:RAD*DKQKRKVEGELK*AQE*NIDE*e******e*c:* ** ** * ** *c*c ** * * ** * c* **c**e** * ***e * *e**e * * t4LKK';IEAVDGNLAKSLEEKAAKENGIHSLGDEMNSGDETIGKINKEKKLLEENNRQLVDDLQAEEAKQ ANRRGKLEGTLDEMEEAVEREKRIRAETEKSKRKVEGELKGAQETIDE1090 970 980 990 l00e 1010 1020 1030 1040 (--"Weak"--]1050 1060 1070 *****~****** :W** **** * *** * * **** * * * * *** * * * ****** *** *** ** * * * ** * SGRQRHDLENNLKKKESELHSUSSRLEDEQALVSKLQRQIKDGQSRISELEEELENERQSRSKADRAKSDLQRELEELGEKLDEQGGATAAQUEUNKKREAELAKLRRDLEEANMNHENQ LSArKLETDASLKKKEADIHALGURIEDEQALANRLTRQSKENAQRrIEIEDELEtIEROSRSKADRARAELQRELDELNERLDEQNKQLEIQQDNNKKKDSEIIKFRRDLDEKNMANEDQ 1090 1190 lia 1120 1130 1140 1150 1160 1170 l19 1190 1200 **** * ** c*** ** * * * * * ***c**** ** * c**c**** * * ** * * ecec * eec * * * cc **ce*e* * eec LGGLRKKHTDAVAELTDQLDQLNKAKAKUEKDKAQAURDAEDLAAQLDQETSGXLNNEKLAKQFELQLTELQSKADEQSRQLODFTSLKGRLHSENGDLURQLEDAESQVNQLTRLKSQL MAMIRRKNNDOISALTNTLDALGKSKAKIEKEKGVLQKELDDINAQVDQETKSRUEQERLAKQYEIQVAELQQKUDEQSRGIGEYTSTKGRLSNDNSDLARQUEELEIHLATINRAKTAF 1210 1220 1230 1240 1250 1260 1270 1280 1290 1300 1310 1320 * * ***** ** ** **.* e**** * * ** * e*c**c **e***** TSGLEEARRTADEEARERTV*AAQAKNYQIEAEOLOESLEEEIEGKNEILRGLSKANADIQQWKARFEGEGLLKADELEDAKRRQAQKINELQEALDAANSKNASLEKTKSRLUGDLDDA**e**c* e**e*** ece* *e****** SSOLUEAKKAAEDELHERQEFHAACKNLEHELDOCHELLEEQINGKDDIORGLSRJ.NSEISOWKARYEGEGLUGSEELEELKRKQMNRUMDLQEALSAAQNKUISLEKAKGKLLAETEDA 1330 1340 1350 1360 1370 1380 1390 1400 1410 1420 1430 1440 *c ecee*******cee** e*ec******* ******ceeeeee* cc*cecec*** ecc IKIRRLE*IEKEELQHALDEAE*c * * RSDUDRHLTUIASLEKKQRAFDKIVDDWKRKVDDIQKEIDATTRDSRNTSTEVFKLRSSMDNLSEQIETLRRENKIFSQEIRDINEQITQGGRTYQEVHKSVRRLEQEKDELQHALDEAEOUDUERANGU*SALEKKQKGFDKIIDDEWRKKTDDL*AELDGAQRDLRNTSTDLFKAKN*QEELAEUUEGLRRENKSLSQEIKDLTDQLGEGGRSUHEEQK 1450 1460 1470 1480 1490 1500 1510 1520 1530 1549 1550 1560 AALEAEESKVLRAQUEVSQIRSEIEKRIQEKEEEFENTRKNHARALESIQASLETEAKGKAELLRIKKKLEGDINELEIALDHANKANA*DAKNLKRYQEQURELQLQVEEEQRNGADTRcc* * * * ee* * * * **ee * * cc ecec* AALEAEESKULRLGIEUQQIRSEIEKRIQEKEEEFENTRKNHQRALESIQASLEIEAKSKAELARAKKKLETDINQLEIALDHANKANVDAQKNLKKLFDQUKELQGQUDDEQRRREEIR 1570 1500 1590 1600 1610 1620 1630 1648 1650 1660 1670 1680 cece **e*e s * ecec cecec ** * *ee**e* e*ee*** cc* * * * ce*ee * e* ec* * * * cc ** * EQFFNAEKRATLLGSEKEELLVANEAAERARKQAEYEAADARDQANEANAQUSSI1SAKRKLEGEIQAIHADLDETLNEYKAAEERSKKAIADATRLAEELRQEQEHSQHUDRLRKGLEQSAMKRKUCENEVQIARNELDEYLNELKASEERARKAAADADRLAEEVRGEQEHAUHUDRQRKSLEL ENYLAAEKRLAIALSESEDLAHRIEASDKHKKQLEIEQAELKSSNTELIGNNAAL 1750 1760 1770 1780 1790 1900 OLKEIQVRLDEAEAAALKGGKKUIAKLEGRVRELESELDGEQRRFQDANKNLGRADRRVRELQFQVDEDKKNFERLQDLIDKLQQKLKTQKKQUEEAEELANLNLOKYKQLTHQLEDAEE1690 1700 1710 1720 1730 1740 1910 1820 1830 1840 1850 1860 1870. 1090 1890 1900 1910 1920 RADQAENSLSKMRSKSRASASVAPGLQSSASAAVIRSPSRARASDF*-----"Tail ipce-----i 1930 1940 1950 1960 FIG. 3. Comparison between the protein sequences of the unc- 54 myosin heavy chain (upper sequence) and a fragment of a third myosin heavy chain gene from nematode (myosin I; lower sequence). Positions of major features are indicated in the diagram. The active thiol residues (12), cys- teine-705 and cysteine-715, are indicated by a caret over each residue. Lysine-128 is homologous to a trimethyllysine residue in rabbit skeletal muscle. There are no homologous residues for the two other modified amino acids in rabbit muscle, a second trimethyllysine and a 3-methylhistidine. Asterisks above the sequence indicate amino acid substitutions. Note that sites of proteolysis correspond to poorly conserved sequences and that rod sequences (residues 850-1,944) are less well conserved than head sequences (residues 1-849). The entire 1,966 amino acid unc-54 heavy chain has a calculated molecular size of 229,486 daltons. The major proteolytic fragments defined by homology to rabbit skeletal muscle myosin (11-14) and their calculated molecular sizes are: "S-1", residues 1-849 (95,392 daltons); "25 Kd", residues 1-212 (22,690 daltons); "50 Kd", residues 213- 645 (49,840 daltons); "23 Kd"; residues 646-849 (22,898 daltons); "Rod", residues 850-1,944 (125,836 daltons); "Short S-2", residues 850-1,165 (36,858 daltons); "LMM",- residues 1,165-1,944 (88,996 daltons); and "Tailpiece", residues 1,945-1,966 (8,422 daltons). In the unc-54 gene, introns are found after residues 21,64,114,266,528,1,750, 1,822, and 1,897. Introns are found after residues 21, 114,231,266,322, and 1,750 in the myosin I gene. The terms "swivel", "loop," and "weak" are defined in the text.

introns are short and less frequent than in the collagen genes PROTEIN STRUCTURAL FEATURES of vertebrates. Short introns may be a general feature of in- Sites of Limited of myosin the vertebrates with minimal genome sizes like C. elegans. Proteolysis. proteolysis splits molecule into a number of distinct Termini. In 2 is made between the unc-54 structurally components, Fig. comparison including the globular heads (designated S-i), which have en- gene sequence and the 5'-terminal sequence of a second myosin coiled-coil rods Two blocks of zymatic functions, and the a-helical (designated heavy chain gene from C. elegans (myosin II). S-2 and LMM), which form the thick filament backbone (22, homology, separated by 27 bp, are found upstream of the in- 23). Tryptic digestion of S-1 produces three discrete fragments itiator methionine. It seems likely that these sequences rep- of approximately 25,000, 50,000, and 23,000 daltons, aligned resent the promoter sequences for the genes and the capping the chain The site for the RNA. Neither show a "TATA" box se- in this order within heavy polypeptide (23, 24). messenger 23,000-dalton is known to contain two cysteinyl resi- quence (21), nor do we find sequences of this type anywhere peptide and The unc-54 se- dues, SH-1 and SH-2, which may be selectively alkylated near the beginning of the gene. gene coding ATPase These structural quence terminates with an ochre codon and appears to have a are required for myosin activity (12). A domains have been demonstrated for myosins from a wide va- rather long 3' untranslated region. typical polyadenylylation the sites of signal, A-A-T-A-A-A, is present at residue 8,344, 266 bp after riety of species, but sequences surrounding major the termination codon. This sequence is present in the 3'-ter- proteolytic cleavage are only available for rabbit skeletal myosin. was determined. Unfor- Tentative assignments of the major proteolytic cleavage sites in minal cDNA clone whose sequence known of tunately, this clone has no poly(A) tract after this sequence, unc-54, based on homology with the points cleavage leaving the exact terminus of the mRNA uncertain. in the rabbit skeletal myosin sequence determined by M. Downloaded by guest on October 2, 2021 Biochemistry: Karn et al. Proc. Natl. Acad. Sci. USA 80 (1983) 4257 Elzinga and colleagues (11-14), are indicated in Fig. 3, and the ular and persistent repeat than any other fibrous protein vet predicted molecular sizes of peptides generated by cleavage at described. It is notable that the introns do not separate re- these points are given in the legend to Fig. 3. The sensitive petitive elements in the rod, are not regularly spaced, and are sites in S-1 are in sequences that differ amongst the nematode not associated with the skip residues or weak spot. Although the myosins (Fig. 3) and show sequence characteristics of surface rod sequences of unc-54 and myosin I genes share a common loops. location for an intron, the myosin II and III genes have uniquely Potential ATP Binding Site. Walker et al. (25) have noted placed introns (unpublished data) in this region. The pattern of that a wide range of ATP binding , including the a and intron positions in the nematode myosin genes is inconsistent 13 subunits of membrane ATP synthetase complex, adenylate with proposals that proteins evolve by rearranging exons coding kinase, Escherichia coli rec A, phosphofructokinase, and myosin, for discrete structural and functional domains (33, 34). share related sequences that may be involved in adenine nu- cleotide binding. The region of homology involves residues 139- We thank Andrew McLachlan for numerous discussions and instruc- 206 of the unc-54 sequence and corresponds to a region of the tion in protein folding; Rita M. Fishpool for cheerful assistance; P. Goe- molecule known to bind photoaffinity analogues of ATP (26). let, H. E. Huxley, M. F. Perutz, and R. Staden for their advice; and In the adenylate kinase crystal structure (27), the adenine nu- M. Elzinga for communicating sequence data. cleotide site is a formed binding hydrophobic pocket between 1. Epstein, H. F., Waterston, R. H. & Brenner, S. (1974)1J .Ifol Biol a A3 sheet, a glycine-rich loop (residues 16-22), and an a helix 90, 291-300. (residues 23-30). The nematode sequence shows sufficient ho- 2. MacLeod, A. R., Waterston, R. H., Fishpool, R. M. & Brenner, mology to suggest an analogous structure, with residues 171- S. (1977)J. Mol. Biol. 114, 133-140. 176 as the 13 sheet, residues 177-184 as the loop, and residues 3. Brenner, S. (1974) Genetics 77, 71-94. 185-198 as the a helix. Within this cleft, serine-179 could form 4. Waterston, R. H., Smith, K. C. & Moerman, D. G. (1982)J. Mol. a bond to the 2' OH of ATP in a manner Biol. 158, 1-15. hydrogen group anal- 5. Moerman, D. G., Plurad, S., Waterston, R. H. & Baillie, D. L. ogous to serine-19 in adenylate kinase, and the E-amino group (1982) Cell 29, 773-778. of lysine-183 could interact with the terminal phosphate of ATP. 6. Karn, J., Brenner, S., Barnett, L. & Cesareni, G. (1980) Proc. Nati Rod Sequences. The rod sequence begins with an invariant Acad. Sci. USA 77, 5172-5176. proline residue (proline-850). The sequence is highly repetitive 7. MacLeod, A. R., Karn, J. & Brenner, S. (1980) Nature (London) and composed of 40 cycles of a 28-amino-acid repeat, inter- 291, 386-390. at four 8. Sanger, F., Coulson, A. R., Barrell, B. G., Smith, A. J. H. & Roe, rupted regularly spaced points by single additional amino B. (1980) J. Mol. Biol. 143, 161-178. acids, termed "skip residues" (15, 28, 29). The repeat pattern 9. Anderson, S. (1981) Nucleic Acids Res. 9, 3015-3027. is interrupted also at a restricted site corresponding to the S- 10. Berk, A. & Sharp, P. A. (1978) Cell 14, 695-711. 2/LMM junction. This is the only region in the rod where hy- 11. Elzinga, M. & Trus, B. (1980) in Methods in Peptide and Protein drophobic residues appear on the surface of the coiled coil. We Sequence Analysis, ed. Birr, Chr. (Elsevier, Amsterdam), pp. 213- have called this region a "weak" spot to indicate possible flex- 224. 12. Elzinga, M. & Collin, J. H. (1972) Proc. Natl. Acad. Sci. USA 74, ibility of the rod (15, 28). The rod sequence ends with a non- 4281-4284. helical "tailpiece" beginning with proline-1,943. The tailpiece 13. Capony, J. P. & Elzinga, M. (1981) Biophys. J. 33, 148 (abstr.). appears to be a general feature of invertebrate myosins (27) and 14. Lu, R. C. (1980) Proc. Natl. Acad. Sci. USA 77, 2010-2013. is also present in the other nematode isozymes. 15. McLachlan, A. D. & Karn, J. (1982) Nature (London) 299, 226- Head Sequences Are Highly Conservedbut Rod Sequences 231. Are Variable. Comparison of the myosin I gene and unc-54 gene 16. Wills, N., Gesteland, R. F., Karn, J., Barnett, L., Bolten, S. & shows that 683 out of 828 residues in Waterston, R. H. (1983) Cell, in press. sequences (82.4%) S-1 are 17. MacLeod, A. R., Karn, J., Waterston, R. H. & Brenner, S. (1979) invariant. Most of the amino acid substitutions are conserva- in Nonsense Mutations and tRNA Suppressors, eds. Celis, J. E. tive, but certain localized regions are highly variable. These & Smith, J. D. (Academic, New York), pp. 301-311. include the surface loops noted above, the "swivel" sequence 18. Mount, S. M. (1982) Nucleic Acids Res. 10, 459-472. between the active thiols, the beginning of the rod (residues 19. Staden, R. & McLachlan, A. D. (1982) Nucleic Acids Res. 10, 141- 808-849), and a region in the 25,000-dalton ("25 Kd") peptide 156. The 20. Kramer, J. M., Cox, G. N. & Hirsh, D. (1982) Cell 30, 599-606. (residues 46-68). high sequence variability suggests that 21. Benoist, C. & Chambon, P. (1981) Nature (London) 290, 307-310. each of these regions might have flexible ill-defined structures. 22. Lowey, S., Slayter, H. S., Weeds, A. & Baker, H. (1969)J. Mol. Comparison between the unc-54 sequence and known se- Biol. 42, 1-29. quence fragments of rabbit myosin S-1 (11, 12) shows 64.5% 23. Weeds, A. & Pope, B. J. (1977) J. Mol. Biol. 111, 129-157. matching residues (316 out of 491 residues). In contrast, there 24. Cardinaud, R. (1979) Biochimie 61, 807-821. are only 53.8% matching residues between unc-54 and myosin 25. Walker, J. E., Saraste, M., Runswick, M. J. & Gav, N. J. (1982) 1 out of in the whereas between unc-54 and rab- EMBO J. 1, 945-951. (511 911) rod, 26. Szitagyi, L., Balint, M., Sreter, F. A. & Gergely, J. (1979) Biochim. bit skeletal muscle the rod shows only 42.5% matching amino Biophys. Acta 593, 207-211. acids (27). It is notable, however, that the chemical character- 27. Pai, E. G., Sachsenheimer, W., Schirmer, R. H. & Schultz, G. istics of the rod sequences are strongly preserved, and that the E. (1977)J. Mol. Biol. 114, 37-45. positions of the hydrophobic, charged, and skip residues are 28. McLachlan, A. D. & Karn, J. (1983)J. Mol. Biol. 164, 605-626. invariant. One section of the LMM region is highly conserved. 29. Karn, J., McLachlan, A. D. & Barnett, L. (1982) in Muscle De- There are 18 substitutions between residues 1,430 and 1,555. velopment: Molecular and Cellular Control, eds. Pearson, M. L. only & Epstein, H. F. (Cold Spring Harbor Laboratory, Cold Spring This section of the rod appears to participate in most of the charge Harbor, NY). interactions between molecules (15, 27). 30. Dugaiczyk, A., Woo, S. L. C., Lai, E. C., Mace, M. L., Mc- Intron Positions Are Not Correlated with Protein Structural Reynolds, L. & O'Malley, B. W. (1978) Nature (London) 274, 328- Domains. Introns are frequently found between regions coding 333. for structural units in proteins (30-32). The myosin heavy chain 31. Wozney, J., Hanahan, D., Morimoto, R., Boedtker, H. & Doty, genes from nematode do not appear to follow this pattern. In- P. (1981) Proc. Natl. Acad. Sci. USA 78, 712-716. trons do not the three 32. Cochet, M., Gannor, F., Her, R., Maroteaux, L., Perrin, F. & separate major proteolytic subfragments Chambon, P. (1979) Nature (London) 282, 567-574. of the head, nor do they separate head and rod sequences. The 33. Gilbert, W. (1978) Nature (London) 271, 501. rod sequence is highly periodic (15, 27) and shows a more reg- 34. Lewin, R. (1972) Science 217, 921-922. Downloaded by guest on October 2, 2021