The EMBO Journal vol.6 no.5 pp.1393-1401, 1987 Conserved sequence elements in the 5' region of the Ultrabithorax unit

C.Deborah Wilde and Michael Akam regulation of Ubx also requires the products of many other ident- required early in development Cambridge CB2 3EH, UK ified ; these include functions Department of Genetics, Downing Street, to establish the initial spatial pattern of expression (Ingham et Communicated by M.Ashburner al., 1986; Ingham and Martinez-Arias, 1986; White and Leh- Clones homologous to the 5' region of the Ultrabithorax mann, 1986) and others whose products are required to main- of Drosophda melanogaster have been isolated from D. pseudo- tain the repression of Ubx in those segments where the gene is obscura, D. funebris and Musca domestica. Regions that en- not normally active (e.g. extra sex combs, Polycomb) (Lewis, code most of the Ubx protein have been sequenced in all 1978; Struhl, 1981; Ingham, 1985; Jiirgens, 1985; Struhl and three of these species, and the 5' upstream region has been Akam, 1985). Ubx is also regulated directly or indirectly by other sequenced in D. funebris to a point - 1000 bases upstream homeotic genes, including the abdominal-A and abdominal-B of the probable mRNA start site. Here we compare these functions of the BX-C, and itself regulates genes of the Antenna- sequences with those described elsewhere for D. melanogaster. pedia complex (Hafen et al., 1984; Harding et al., 1986; Struhl Deduced sequences of the Ubx protein show 8% and White, 1986). (D. pseudoobscura), 15% (D. funebris) and 22% (M. dom- A highly conserved 180-bp region of the Ubx 3' region is found estica) divergence from D. melanogaster. However, these fig- in many homeotic genes. This encodes the 60 amino acid homeo- ures mask very different rates of evolution in different regions box, a protein which probably mediates DNA binding of the protein. A glycine-rich ('hinge') region is conserved (Laughon and Scott, 1984; McGinnis et al., 1984; Scott and in each of these species, although its length is variable. Com- Weiner, 1984; Shepherd et al., 1984). parison of D. funebris and D. melanogaster sequences in the To identify other regions of both the Ubx protein and the Ubx long 5' untranslated leader region of the mRNA, and in the transcription unit that are critical for normal function, we have region immediately upstream of the start point of transcrip- isolated and sequenced clones homologous to the 5' region of tion, reveals tightly conserved elements embedded in an other- the Ubx gene from three other Dipteran species, wise non-homologous sequence. These conserved elements include a 118-bp region that spans the mRNA start site, an Cyclorrhaphan internally repetitive (TAA), region in the untranslated leader Diptera and a short repeated motif immediately upstream of the ATG 1' codon that initiates the major open reading frame of the Ubx protein. Two other conserved elements were identified up- stream of the transcription start site; both elements have structural features consistent with a role as recognition sites for regulatory proteins. Key words: bithorax complex/diptera/evolution/upstream el- M Year ements

Introduction The Ultrabithorax (Ubx) gene of the Drosophila bithorax complex (BX-C) controls segment identity in parasegments 5 and 6 of the Drosophila embryo, and plays a minor role in the more posterior abdominal segments (Lewis, 1978, 1981; Casanova et al., 1985; Sanchez-Herrero et al., 1985). Within each of these regions, and within each major germ layer, Ubx gene products show different patterns of expression (Akam, 1983; Akam and Martinez-Arias, 1985; White and Wilcox, 1984, 1985a; Beachy et al., 1985). Most, and possibly all functions of the Ubx gene are mediated by homeoproteins encoded in long 5' and 3' , and in short Musca microexons located within the 70-kb of the Ubx transcrip- Drosophila Drosophila Drosophila tion unit (Bender et al., 1983; Akam et al., 1985; Beachy et al., melanogaster pseudoobscura funebris domestica 1985; Hogness et al., 1985). Genetic (Lewis, 1951, 1963, 1978), and more recently molecular evidence shows that the correct ex- Fig. 1. Relationships of the Dipteran species used for sequence comparisons the (Throckmorton, 1975; Hennig, 1981). Estimates of divergence times for the pression of these proteins requires other sequences within calyptrate and acalyptrate Cyclorrhapha, for the different subgenera of the Ubx transcription unit, and the close linkage of DNA in the bxd genus Drosophila, and for the melanogaster and obscura species groups region of the Ultrabithorax domain (Bender et al., 1983, 1985; within the subgenus Sophophora are based on immunological comparisons of Hogness et al., 1985; White and Wilcox, 1985b). The normal the haemolymph proteins (Beverley and Wilson, 1984).

IRL Press Limited, Oxford, England 1393 C.D.Wilde and M.Akam ' r !''.i ! -i -r---l-- --l r- 1 ---i

,C,lr.-. .i

.. ..

...... r-8. .

., l,,,, ;' .; ..,.*s,,_...C-e'...... ,rwo.>s..-i;Oi.;.-E...;_...P.S,,v..-.._,-._.v/*W{;

.4

A B C

Fig. 2. Genomic Southern hybridizations to EcoRI-digested DNA from different Drosophila and other Dipteran species. (A) Hybridization of the Ubx 5' probe at moderate stringency. Specific Ubx homologues are evident in all the Drosophila species, and are just detectable in Musca and Calliphora. Arrowheads indicate the cloned fragments in D. pseudoobscura, D. funebris and Musca. (B) Hybridization of the same probe at low stringency to Musca and D. melanogaster DNA. (GGX)n sequences in the Ubx probe now reveal repeated elements present in both species. (C) Hybridization of the Musca Ubx clone to Musca and Drosophila DNA at low stringency. The Musca clone hybridizes strongly to the Ubx fragments of Drosophila sequences, and more weakly to a background of repeated sequences. Each track contains 2.5 Itg of DNA. The Musca is - 10-fold larger than that of Drosophila, and therefore the molar concentration of unique sequences is lower in the Musca track. The tree above the species names indicates the phylogenetic grouping of the species used, with no attempt to show divergence times. The Drosophila species melanogaster, simulans, erecta and yakuba are all in the melanogaster species group. D. saltans, D. willistoni and D. pseudoobscura are all in the subgenus Sophophora. D. virilis, D. hydei and D. funebris are all in the subgenus Drosophila, and so should share a common ancestor with D. melanogaster. Calliphora, Sarcophaga and Musca are all in the same family, Calliphoridae. pseudoobscura, D. funebris and Musca domestica. D. pseudo- containing the strongly homologous fragments were isolated from obscura is a member of the same sub-genus as D. melanogaster phage libraries of D. funebris and D. pseudoobscura DNA by (the Sophophora), but these two species probably diverged 40-60 standard techniques (see Materials and methods). million years ago (Throckmorton, 1975; Beverley and Wilson, Initially we had some difficult distinguishing a unique Ubx- 1984). D. funebris is in a different sub-genus (Drosophila), and homologous fragment in Musca DNA. Under moderately strin- M. domestica is a distantly related member of the same sub-order gent hybridization conditions (Figure 2) our probe hybridized of the Diptera, the Cyclorrhapha (Figure 1). The relationship to a prominent pattern of multiple bands. However, progressively of all of these species to D. melanogaster is sufficiently distant more stringent washes of parallel filters revealed a 1.8-kb frag- that unconstrained DNA sequences have diverged extensively, ment hybridizing most stably to the Ubx probe (Figure 2). One allowing functional elements to be identified by sequence con- of four clones isolated with the same probe from a Musca gen- servation. omic library hybridized specifically to this fragment and, when Results hybridized to Drosophila DNA, uniquely to the 5' Ubx fragment originally used as probe. We were therefore confident that this Isolation and sequencing of Ubx homologous clones clone contained a specific Musca homologue of Ubx. The other Preliminary genomic Southern hybridization experiments revealed Musca clones isolated in the same screen hybridized strongly to that each of 10 tested species of the genus Drosophila contained repeated sequences in the Musca genome, and more weakly to unique restriction fragments strongly homologous to a probe span- multiple bands in the Drosophila genome. It is now clear that ning the 5' of the Ubx gene (Figure 2). These fragments the to repetitive sequences observed with these Ubx were clearly distinguishable above a background of heterogeneous probes is mediated by the stretch of GGX repeats in the Ubx fragments showing weak homology to the probe. Genomic clones 5' exon (see below, and R.Weinzierl and M.Akam, in prep-

1394 Ubx sequence conservation

H R AUG H . D. mel. in l_ or m

Bg B AUG _" H Bg p R D. pse. I Ll ._I_ _.. _.

1---

x AUG R PI R A A w D. fun. II.,L , 1J

* ------,

i-

R Musca 1~~~~~~~

---4

6 ---4

200 bp

Fig. 3. Organization of conserved elements in the Ubx 5' region. (A) D. melanogaster, (B) D. pseudoobscura, (C) D. funebris, (D) M. domnestica. Sequenced genomic fragments from the four species are aligned by homology of the major open reading frame (cross-hatched box). The locations of other conserved elements are shown by boxes. These correspond to the regions highlighted in Figure 4. Solid boxes are used for upstream elements and conserved sequences in the Ubx intron. A stippled box shows the partially conserved TAA repeat. Splice donor sites are shown by arrowheads, and the transcription start site is shown with a bold arrow. Sequencing strategies for each fragment are also shown. fragments obtained by sonication, cloned into the SnaI site of M13mp8. fragments obtained by restriction enzyme digestion cloned into appropriately cleaved M13mp8 or mp9. Restriction enzyme sites: R, EcoRI; H, HindIII; B, BamHI; Bg, Bglll; P, PstI; A, AhaIII; X, XhoIl. All R and H sites are shown; others only when used for cloning or sequencing. aration). Ubx 5' probes which do not contain this repeat hybridize suggests that protein synthesis in these species initiates at the at moderate stringency only to Ubx sequences, and detect no other methionine codon equivalent to that used in D. melanogaster. fragments in the Drosophila genome. For Musca, we cannot rule out the possibility that the long open The nucleotide sequence of the cloned Ubx homologous frag- reading frame extends further 5', beyond the limits of our clone. ments was determined by the dideoxy chain termination method In all of the species, the open reading frame reads through to (Sanger et al., 1977; Biggen et al., 1983) using a combination a consensus splice donor site at a position equivalent to the first of random and directed subcloning into M 13 (Figures 3 and 4). of two that are used in D. melanogaster (Beachy et al., 1985; K.Kornfeld, Sequence analysis Beachy, 1986; R.B.Saint, P.A.Beachy, D.A.Peattie, P.J.Harte, M.Goldschmidt-Clermont and D.S.Hogness, in prep- Preliminary observations of the sequences obtained showed that aration). The second splice site, 27 bases downstream, is also the fragments sequenced from D. pseudoobscura, D. funebris conserved in the two other Drosophila species, the intervening and Musca each contains a potential protein coding sequence nine amino acids being conserved completely in D. pseudo- closely homologous to that present in the Ubx 5' exon of D. obscura but only partly in D. funebris (Figure 5). No second In melanogaster (Beachy, 1986; Weinzierl et al., 1987). Figure splice donor site is apparent in Musca. In the genomic sequence oriented and with 3, the sequenced regions have been aligned of each species, this open reading frame terminates a short dis- from respect to this open reading frame. The sequenced fragment tance beyond the splice donor site, ruling out the possibility of Musca begins at a position corresponding to the second amino much longer readthrough proteins. acid of the D. melanogaster protein; that from D. pseudoobscura Within the protein , nucleotide differences be- start. Both extend begins just 86 bases upstream of the protein tween the species are predominantly third base changes or syn- The through the 5' exon and into the Ubx intron. sequenced onymous codon substitutions. Thus the overall divergence at the region of D. funebris extends 2.1 kb upstream of the start of the DNA level of 15% (D. pseudoobscura), 18% (D. funebris) and frame are open reading frame, but only within the open reading 28% (Musca) within the exon leads to amino acid divergence there extensive tracts of homology between the two species. of 8, 15 and 22% respectively. These figures mask considerable Elsewhere islands of homology are separated by regions of ap- differences in the frequency of amino acid substitution in dif- parently unrelated sequence (Figures 3 and 4). ferent regions of the protein (Figure 5). The amino-terminal 110 The structure of the predicted Ubx proteins amino acids are relatively conserved, showing only 14% In D. pseudoobscura and D. funebris the presence of upstream divergence even in Musca. Then follows a glycine-rich region, termination codons in frame with the long open reading frame which in D. melanogaster comprises 13 consecutive glycine

1395 C.D.Wilde and M.Akam

a GGATCCCCGA GCAAG~GATCA GCLIGCPrAG'rP IAGGAGCP(GC CAGCAGCA3'C AG~C 1-'AAC-,CA 3'lP(IfACCCCAGC AG AGCATGA ACTCGTACTT TGAGCAGGCC TCGTC 1 1 120 ACGGCCATCC GCACCAGGCC ACGGGCATGG CCATGGGCTC CGGCGGCCAC CACGACCAGA CGGCCAGCGC CGCCGCGGCC GCCTACCGGG GCTT.TCCCCT CTCGCTGGGC ATGTCGCCGT 121 181 240 ACGCCAACCA CCATCTGCAG CGCACCACCC AGGACTCGCC CTACGACGCC AGCATTACGG CCGCCTGCAA CAAGATCTAC GGCGACGGGG CCAGCGCCTA CAAGCAGGAC TGCCTCAACA 241 301 360 TCAAGGCGGA CGCTGTCAAC GGCTACAAGG ACATCTGGAA CACGGGCGGC TCCAACGGCG GCGGCACAGG CGGCGGAGGG GGCGGCGGCG GAGGTGGCAA TGCATCGAAC GGATCGAACG 361 421 480 CGGGCAATGC CGCCAACGGA CAGAACAACG CCGCGGGCGG CATGCCCGTG CGCCCATCCG CCTGCACCCC CGACTCGCGC GTCGGCGGCT ACTTGGACAC ATCGGGCGGC AGCCCCGTCA 481 541 600 GCCACCGCGG CGGCAGCGCC GGGGGCAACG TGAGCGCCGG CGGCGGGGGC CAAAGCGGCC AAAGCGGCGC CCCAGGCGTG GGCGTCGGTG TGGGAGTGGG CGTGGGCGCC GGGGCGGGCA 601 661 720 CCGCCTGGAA CGCCAACTGC ACCATCTCGG GCGCCGCGGC GGCCCAGACA GCAGCCGCCA GCAGCTTACA CCAGGCCAGC AATCACACAT TCTACCCCTG GATGGCTATC GCAITGL=GT 721 781 40 GCCCAGAAGA TCCGACCAAA TGACGTC CACGGTGCCA -rTGCTCATCC CCATTGCCCA ACCCCTGZIAT CCACAACCCG CAACCCTrCTG ACAAAGCTTT GCTrGTP,CTTC 841 GCCGIOGGTGTT 901 960 TTCCCAGTrTG PP'APrCGA'rGG CGTCPrAPCTrA TrCAGATrc'rATr rAGACATCGG ACCCAACAGA TTTTCAGCTA AAACAACAGC AATCATGTCA ATCGT'rTAAC TTCC'PGGGAA 961 ATATGTAGGT 1021 10800 GTTrGCPTPrA'P PrTCG'P)\GTAA ATGAAACAAC TAAAATAATA ATAAATAAAG CTGCATGAAC AAGATTTCAA c,rTrTAG,rCTA PTTTrAAGcCG CTrCATTCAAG TCTTCAATCT 1081 GCCGATrGGTT 1141 1200 TATTcG'PrGTr CCG3GCAATTT A.AGCGATT'TA GCCAGCCCAT AAATrAAAGCC AGTGAGCTCC CCGGGACCGA GGCAATCTCT ATCCGTrcCTC 0201 T,rcCACCTC,OC AGGAGTrGTCG CACACCTTrAc 1261 1 320 AGCGCTrcCGTr CTGCCTrCATrA ATTTTCAACA CTTrTGTrTT8C 'rAArTPrTTGT CACCTrTTAGT TTTrAATGATG CTCAGCACCG CGCCAGTGGG GGCGACTGAG GCAGCGTCCT 1321 AGTrCGTPCGTC 1381 1440 GTCGTCG'T.CA GTrAATTGTrTG CGTCGGCAAA AGTGGCAACA AAAGTGAGCT GCAATTTGTG CAACAAGAGA GCTGCGAGTrC GGCAACATGA ACAGCACACG GCGAGCAGCG 1441 TAAATTATGG 1501 1 560 CA'rGC'PGCTG GAAGACOCCAC AGAGCAGCCA CAGGAGCCAG c,rc'rPT,TTCA GAAAAATTCA GCAGCAAGGA ACGGGGAAAA ACGAATrcCTT TCAGGACACA CACTGAGAGC 1361 AGAAGTCGCG 1621 1680 CTTATTACAG TTAACTACGC CATTATGTAA rTATrGTTCAG GACCTCGCAC CCAGCCTGCA GGGAAGCCAC CAGCCGAGCG CTTGAG,TT173 1681 AGGCGAACGG AGATTrcCAGT G3AGGGTCAGA 1741 1800 ACAATrGGCPrG GAGTAGGTCA CGGTrTcCCCT TC'rGGrcCTTr GGCGGAATCT AGGCGGTCAG CGATCGAATA TAAATTACAC TTTAGAGTTC C:GAACTrTGCC AAGTTCrcPrG AATTC 1001 1861

GAATTCTTAT TATTAAACGC b TTGAAACTCT AATAACAAAT TTATTTGATA TTTTTATGAC AGTTTTTAAG TTTGTTGCCA GATTCGAATT TGTAAAGAAT GCTTAGTTAT ATAAT

TAATATAAAT GTAATCATTA ATATTGAAAT AATAAAATTA TCAAATCAAT ATGCATAACT GATGTGCATT ATTTTATTAA TATAATTTAT CTTAAACATA ACTCTTAAGA 121 TTTGAACAAA 180 240 TTACAGAAAT TCTCCCTATT GGATATTATT TGTATTAAAA TGGAATTATA TTAAAATTTG AAGGGCATTT AAATGTTAAA AAATACTAAT ACATATATAA ATACATAGAT 241 TTCTTTGTAT 301 360 AATTATCCAT TTTAAATAAT TTGAAATTTG TTTTATGTTT TACAATAATT TCATATTTAG ATAGTTTAGT ATTAAAATAA TATGAAATTA AATTTAAAGG AAGATAAATA 361 TTCAACAAGA 421 480 TAATGGCTTT TATAAAAAGC GGACAAAATG CAGCGCGCAA ATTCATAAAA CGAAATTTAA CAAAGAAAGT ATCTGAAACT GTATTTATAT CTGCAGCGCG TGCCGTGCAG 481 TTGGCAACTG 541 600 GCAGCGGACG ACTGTC ICT GGCAACTGGC GGGC TGAA TTGAGTGCAA ATGAGCGAAA CTCTTTTGAG TGCAAATGAA TGCGTATGCA CGCTCGCCTG GCCTAI¶ IAAATA 601 661

ICTCCTCCATG ATGAATTTCd ACGATACAGA AGCCAGAGTA AATTTTCCAT ATGAGCAAAT GATGAATACC GAACAAGTTG GCCACTCAAA CATGTTCGAA TGAATGCGCT 721 CGAACGCAAC 781 840 AAATGAACAC ACAGTTCACT GAGTGAGAGC GCCAGCAAGA GCGAGCGAAG AAGTTTGAGT AGACCTTATC ATATGCTGAG CGCAGTGAGT GAGGCTCATT CTGTTTCAGT 841 TTCGCTCATT 901 960 CATTTTGTTG TTTCGCCTCT CTGTTTTGGC TGCTTCTTCT TCTTGCTGGT GCTGCTGTCT CTGCTGCTGC TGCTGCTGCT GCTGCGTATC GGGGGAGCTC 961 TACAGGCAAA CGACTCTGTC 1021 1080 GACGTTGCTG TTGCCGTTGC CGTTGCCGTA TGCCGTTTGC CGTTGCCGTT TGCTGTTGCC GTTGCCGTTC GCTGCTTGTG GCCCGTTGCC 1081 ATTCAGTTGA GTTTAGTTGA GCCGC 1141 -T-u ---A.OW- - - - - IAAACACIAA AACTGC'GATT TGGT ACA TTCGTTCGAT GGCAACG;I TGGATAACAG GCGCGCGCTT TGTTTTATTA TCCACATTAT 1201 CAGCGGCATT ATTGTTAT* GTATTGTACG I ~~~~~~~~~~~~~~261 1 320 CTCAATTTTA ATGTTGAATG GCCCTCGCGC TCTGTATTCG CATTCACAAT CGCATTCGTA TCTGTAGTGG CCAGTTAATA GTATCTCAAG TTGAATTTCA 1321 ATTTCAATCT GAATCTGAAT 1381 1440 TTGAATTTTG TATCTCAAAA CTGAAACTGA ATCCGAATCC GAAACGTCAG TTGAACGTGT AGCGTGAGCG CGCGTCGCCG TCGTCGGTCG 1441 TTTCGGTCAA_TGCTAATAAC AATAATAATT 1501 1560 ATAATAACAA TAATAATAAC ACAATAATAA TAATAGTTTG TAATAACAAT AATGATAACG CTATGAATGT TAATCGAGTC ATCGAGTTCA CAATGATTGA 1561 ATAGACGCCA CTACAAATAC 1621 1680 TTCAAATCAA TATTTACTTA TTAACAAACA AAATTTATAT GGTGCGAGTG TCAAGTGAAT AAAATACGAG TGCAATTTGT ATAAAAATAC GAAACCAAAA TCAAGTATAC 1681 GCCACATGAA 1741 1800 ACTCTCTTAC AATACAATAA TTATTTAAAC TTAATTAATT TTACAACTTA TACGAGCCGG C0 8 AACAAAT TTTGAAATTT TCAAAATTTA 1801 ?AAAA TTTAAGCAAA 186 1920 ATTAAATTTA AATTAAAAAC TAAATCAAAA GTAAAATATT TTGAATAAAT ATTATATACT CAAATTTGTT AATAACTTAA TCGTATTGCC CAGTGCCCAC AGTGAGCACA 1921 ACACCCGAGA 1981 2040 GTTACACATT TGTATACGAG TGAAGAGTGT GCCGCGCGCA GTTACTTTTA GTAGCTGCTG TAGCTGCTGT GTCACAA 2041 AGCC*ICCGC CAAAd8TT ACCGCCAAG CATGAACTCC 2101 2160 TATTTTGAG AGGCCTCCGG CTTCTATGGC CATCCGCACC AGGCGTCAGG CATGGCAATG GGCTCGGGCG GTCATCACGA TCAGACGACG GCCAGCGCCG CGGCAGCTGC 2161 CTACAGAGGA 2221 2280 TTCACGCTGC CGCTGGGCAT GTCCCCCTAC GCCAACCATC ATCTGCAGCG CACCACACAG GACTCGCCGT ACGATGCGAG CATCACGGCT GCCTGCAACA 2281 AGATCTACGG CGATGCTGGC 2341 2400 AGCGCCTACA AGCAGGATTG CCTCAACATT AAAGCAGATG CCGTCAATGG CTACAAGGAC ATATGGAACA CGGGCGGTGC 2481 CAATGGTGGG GGCGGTGGAG GCGGTGGCGG TGCAGCCACC 2461 2528 GCTGGCAACA CCTCCAACGG CTCCAATGCA CCCAACGCTG CCAATGGACA GAATAATGCG GGGGGCGGTG GCATGCCCGT TCGCCCATCC GCCTGCACGC 2521 CCGACTCCCG CGTCGGCGGC 2581 2640 TACTTGGACA CATCGGGCGG CAGCCCTGTC AGTCATCGCG GTGGCGGCAG CGCTGGTGGC GGCAATGGCA CTGCCGGCGG 2641 CGTCCCACAG_AATTCAGCCA GCGGCGTAGG CGGCGGTGTG 2701 276 GGCGCGGGCA CAGCCTGGAA TGCCAATTGC ACCATCTCGG GCGCTGCAGC GGCCCAAACA 2761 GCGGCCGCAA GCAGTTTACA CCAGCCCGGC AATCACACCT TCTACCCCTG GATGGCAATC 2821 2880 GCA42TGAGT CCACAGCAGA TCCAATCAAA CAGCCTAGTG GTAACCTCTC TACTCATATA 2881 +TGAGTGTC CTCATAACTC ATAAGATTCA GTTATGTTAT GTTTAGTTTT CATTTTCACA 2941 3000 TT 3001 1396 Ubx sequence conservation

C GAATTCATAT TTTGAACAAG CTTCTGGCTT TTACGGCCAT CCACATCAGG CCACTGGCAT GTCCATGGGT ACCGCTGGCC ATCATGATCA GTCGGCCACC GCCGCAGCAG CGGCCTACAG 1 61 120 AGGTTTCCCC CTATCGCTGG GCATGACACC CTACACCAAT CATCATCTGC AACGTTCTAC CCAAGATTCC CCCTACGATG CCAGTATTAC CGCCGCCTGT AACAAAATCT ACGGTGATGG 121 181 240 CAATGCCTAC AAACAAGATT GCCTCAACAT CAAATCGGAT ACAATAAACG GCTACAAAGA CATATGGAAT ACCACAGCGA ACGGGGGTGG AGCCGGTGGT GGCGGCGGTA CTGGTGGTGG 241 301 360 TGGTGGCGGT TCAGCGGGCT CGGCGAACGG TGCAAATAAT ACCGCCAATG GTCAAAATAC AAGTGGCGGT GGTGGAGCTG GTGGAGGCGG TGGTATGCCC GTCAGACCCT CAGCCTGTAC 361 421 480 ACCGGACTCA CGGGTGGGCG GCTATTTGGA CACATCGGGC GGTAGTCCGG TAAGCCATCG CGGCGGTAGT GCCGGTGTTG TGGGCGGCGC CGGTACCGGG GTTGGGCAAA GTGGTCAGAG 481 541 600 TGCTAATGTC GGTGGTGCTG GCGGTGTAGG CGGGGCAACG GCTTGGAATG CGAATTGTAC AATTTCAGGA GCCGCCGCCG CACAAACAGC AAGTAGTTTA CACCAGGCCA GCAATCACAC 601 661 720 ATTCTATCCC TGGATGGCTA TCGCAI=TGAGTTGGAAAAA CATTAATTTT ATCATTACTA AAAATGAGTA CAAGAAATGA TACAAATTCA GGGCTTTGAA AAAATTGAAT TAGATTCAAT 7 21 781 8 40 TAAAAAATTT AATTCACCAA GCTCACTAGA TTTCCAGATA CAAGGAGATC AGGAATCGAA CCATGACCAA AAATATTTTT TCATATGTTA GTAGGGGTAA AAAACAAAAT AATAATTGCA 841 901 960 GTGAATACTT AAAGGCATCA GCAGTGTTTT ACCGAAAGTA AAGTTTCTCA CTCTCACTTC TACATATTTT TGGAAAATGC CAATAATTAG ATATATAGAG AACTCCCTAT GTGAAAACAA 961 1021 1080 TGAGTAATAT TATTCTAACA ACCTCCTCGA ATGTATTGTT AGCGTTTTCT GAAAAAGTTA TAAGTTTGCG CCTCTATAAA TTAAGGCTGG TGTTAAGGCC AGGTTTACCC AATTGAACAA 1081 1141 1200 AGTGTATATT TTATTAATTA ATTACATCAT TGCATATAAT ACAAAGTTCA GTCAGCAGTT CCATAAATCA CATAAAATTA TGAAAGT 1201 1261

Fig. 4. Sequences of the genomic fragments cloned from (a) D. pseudoobscura, (b) D. funebris and (c) M. domestica. The long open reading frame in each fragment is overlined. The initiating methionine codons are marked with arrowheads, and the terminating splice donor sites are shown by arrows. Bases in D. funebris, homologous to the transcription start sites in D. melanogaster, are shown by the large double-flighted arrow. The immediately following upstream open reading frame is overlined with a broken line. Its termination codon is marked with dots. Sequences conserved (2 17/21 bases) between D. melanogaster and the different species within the 5' UT region of the mRNA and upstream of the mRNA start are boxed. The partially conserved TAA repeat is underlined. Regions of the intron of D. pseudoobscura that match the Drosophila sequence (17/21 bases) are marked with asterisks. Restriction sites used for cloning are also underlined.

residues located in a larger glycine-rich region; this in part is D . meIpsI M N S Y F E Q A S G F Y G H P H Q A r G M A M G S G G H H D D . pse encoded by a repeat of the sequence GGC/T. A string of glycines D. fun -S --A----- within a glycine-rich region is present in each of the species, Musca ------S - - T A - - - - 31 but the sequence organization of this part of the protein varies D. mel Q T A S A A A A A Y R G F P L S L G M S P Y A N H H L Q R T D. pse ------sufficiently to make direct alignments difficult. Adjacent to the D. fun - -T------T glycine-rich region is a stretch of 36 amino acids which is perfect- Musca - S - T------T - - T ------S 61 ly consderved in all of the species, then a further variable region D. mel T Q D S P Y D A S I T A A C N K I Y G D G A G A Y K Q D C I. D. pse -S of 20-30 amino acids, and a final well-conserved region of 40 D. fun ------A G S ------amino acids preceding the splice donor site. Musca ------N( ------91 Two short peptides encoded within the Ubx 5' exon are con- D. mel N I K A D A V N G Y K D I W N T G G S N G G G G G G G G G G D. pse ------T ------served in other homeotic genes. One is the sequence Tyr-Pro- D. fun ------A ------(YPWM) located just prior to the splice donor sites that Musca - - - S - T I ------T( )A - - - - A T Trp-Met 121 terminate the Ubx 5' exon (Figure 5). This sequence occurs at D. mel G G G A G G T G G A G N A N G G N A A N A N G Q N N P A G G D. pse - - - )---S- - S - - G - -A- - - - - A--- an equivalent position, which in the protein is just upstream of D. f un )A A T - - - TS- - S - - P - -A- - - - - A G - -G the in and abdominal-A (F. Musca - - - G --S A - S( )---A - N T( )- T S G - -GAGGGGG , Antennapedia, Deformed 151 Karch and W.Bender, personal communication), in three human D. mel M P V R P S A C T P D S R V G G Y L D T S G G S P V S H R G D. pse in the D. fun ------homeobox-containing genes (Mavilio et al., 1986) and Musca genes Hox 1.1, Hox 1.3 and Hox 2.1 (Krumlauf et al., 181 D. mel G S A G G N V S V S G G N G N A G G V Q S G V G V other conserved is the - - - - 1987). The peptide sequence Met-(Asn)- D. pse ------A G --G Q S G Q S G A P G V G V GVG D. fun -G- - - -( T - - - - P Q N S A S G V G G G at of the - - Ser-Tyr-Phe (MXSYF), that lies the amino terminus Musca - - - - V V G G A G T -V- Q S Q S A N V G A G G Ubx protein. We notice that the sequence Met-X-Ser-Tyr-Phe 206 D. me1 A G A G T A W N A N C T I S G A A A Q T A A A S S L H Q A S occurs internally but close to the amino terminus in Antennapedia D. pse = and is the in the mouse D. fun V ------A------P G (X Thr) initiating sequence gene Musca V - G A ------A- --- ( )------= et Both of these ex- Hox 2.1 (X Ser; Krumlauf al., 1987). 23V 6 ------PG- D. mel N H T F Y P W M A I A G^E C P E D P T K^ tremely conserved peptides lie within larger conserved domains D. pse ------^ D. fun -^- of the Ubx 5' exon, but little if any further homology can be Musca AGAGTAW_AN_T-GAAA_TAAA__QAA detected within these 5' coding sequences between Ubx and even the most closely related homeobox-containing genes of Fig. 5. Protein sequences coded by the open reading frames from D. Drosophila. pseudoobscura, D. funebris and M. domestica compared with the Numbers refer to amino acid the corresponding sequence from D. melanogaster. Conserved features of upstream sequence positions in the D. melanogaster sequence (Beachy, 1986; Weinzierl et al., The sequenced region from D. funebris extends 2.1 kb 5' from 1987). The two short peptides that are conserved in other homeotic genes the codon that initiates the long Ubx open reading frame. In D. are boxed; positions of splice donor sites are indicated by arrowheads. The amino acid code is have been inserted to includes - 970 bases of the Dayhoff one-letter used; gaps align melanogaster the equivalent region homologies. 5' untranslated leader sequence of the major embryonic , and -1 kb upstream of the transcription start site (Hogness et al., 1985; Saari and Bienz, 1987. R.B.Saint et al., in preparation; melanogaster are strongly conserved in D. flinebris. One of these M.Biggin and R.Tjian, personal communication). is a I 18-base sequence that spans the start point of transcription Three regions of the 5' untranslated leader sequence of D. in D. melanogaster; it presumably identified the functionally 1397 C.D.Wilde and M.Akam

A -~ - Saari and Bienz, 1987; R.B.Saint et al., in preparation), and the JACCGCCA|AGATTCTT CAGC---~ D. melanogaster equivalent position in the homologous sequence for D. funebris.] !ACCGCCAAGATTCTCACCGCC CAGC---ATi1 D. pseudoobscura Further 5', there is a conserved 18-base motif containing a direct IACCGCCAAATTCTC| CACAGC ------D. funebris repeat of the sequence ACTGGC. This lies at position -300 in D. melanogaster and -400 in D. funebris. B 18mer With the exception of these conserved elements, the sequences

AGTTGGCA13AGCGGA------ACTGGC | GGCA D. funebris ofD. melanogaster and D. funebris in both upstream transcribed *A**JGGGA and non-transcribed regions have diverted to such an extent that D. melanogaster it is not possible to align them. 3 4 me r Sequence conservation in the Ubx intron AAGAAAAATCAGCCC_CCTCCATGATGAATTTCC D. funebris The sequenced regions of D. pseudoobscura and Musca extend AAGGAAAATCAGCCCTCCTCCATGATGAATTTCC D. melanogaster downstream from the splice donor sites that terminate the 5' exon of the major embryonic Ubx RNAs. In D. melanogaster, se- quences within the first 1.8 kb of this intron are present in a Fig. 6. Conserved nucleotide sequence elements. (a) The 24.-bp element found in the mRNA 5' UT region of D. pseudoobscura, D. funebris and D. 4.7-kb poly(A)- transcript which is only expressed early in melanogaster. Dashes indicate the distance of this sequence from the development. The structure of this RNA is not known, but the initiation methionine codon (boxed). The directly repeated heeptamer is genomic sequence of the proximal intron fragment does not reveal indicated by boxes and arrows. Asterisks indicate divergent inucleotides. any likely protein coding sequence (R.B.Saint et al., in prep- (b) The 18-bp and 34-bp elements found 5' to the transcriptiion start site in aration). Within the 500 bases sequenced in Musca, there is no D. funebris. Direct repeats within the elements are indicated arrows. Arrowheads show the axis of symmetry in the impei bfectpalindrome detectable homology to the Drosophila sequence. There is, how- of the 34-bp element. Symmetrical nucleotides are lined. Astterisks indicate ever, an intriguing pattern of sequence conservation between D. divergent nucleotides. melanogaster and D. pseudoobscura. Conserved elements ranging in length from 20 to 50 bases are scattered throughout the region equivalent region of D. funebris, implying that thiIs species also sequenced. They do not contain open reading frames of any sig- has a long (-- 920 base) untranslated leader sequence. In both nificant length, and are not flanked by consensus splice sites. species, this sequence includes the first methioninc codon of the Their significance is obscure (Figures 3 and 4). message, which initiates a short open reading iframe located almost immediately downstream of the mRNA statrt site. These Discussion open reading frames do not end in good splice donior sequences, Genetic studies revealed the importance of the Ubx gene as a their ATG codons are not embedded in a good (consensus for determinant in Drosophila development for correct segmental protein initiation in Drosophila (Kozak, 1984; Calvener, 1987), identity in the thorax and abdomen of the fly. More recent mol- and their codon usage does not match that of othe-r Drosophila ecular studies, using hybridization and antibody probes, have genes. In addition, the length and 3' region of these short open shown that transcripts and proteins from the Ubx gene are local- reading frames are not conserved between the two species, sug- ized in developing Drosophila embryos with distributions that gesting that they do not encode a functional pol)ypeptide. are broadly consistent with those predicted from the phenotypes In both species, this 5' conserved region is separ^ated from the of different mutations in the Ultrabithorax domain of the BX-C long open reading frame by a sequence containingrmultiple ter- (Akam, 1983; White and Wilcox, 1984, 1985a; Akam and Mart- mination codons in all three frames, rruling out the inez-Arias, 1985). At a biochemical level, however, much still possibility that translation reads through from the firrst methionine remains to be answered about the function of the Ubx gene prod- codon to the major open reading frame of the embryonic uct and about the mechanisms by which its expression is regulated transcripts. These multiple terminators are encoded by the second in the very precise developmental and tissue-specific patterns that conserved feature of the sequence, an imperfect T//CAA repeat, have been revealed by in situ studies. The finding that the Ubx which lies at equivalent positions in the two speccies. gene contains a 'homeobox' which encodes a with A third strongly conserved region of the mRNA 5' leader lies potential DNA binding properties (Laughon and Scott, 1984; immediately upstream of the methionine codon thait initiates the McGinnis et al., 1984; Scott and Weiner, 1984; Shepherd et al., major open reading frame of the homeoprotein. There is a 25/27- 1984), and the observation that Ubx proteins are localized to base match between the sequence ending nine baises upstream nuclei (White and Wilcox, 1984; Beachy et al., 1985) provides of the ATG in D. funebris and that ending nine bases upstream a partial answer to the molecular function of Ubx in development, of the ATG in D. melanogaster and also D. pseudoc 'bscura. This suggesting that it may control other effector genes by acting conserved motif includes a direct repeat of t-he sequence directly as a transcriptional regulator. ACCGCCA (Figure 6). The approach that we have adopted here is to use inter-species Other regions of the D. funebris leader sequence show moder- DNA sequence comparisons to identify regions within the Ubx ate matches to D. melanogaster, including one (tbases 1863- gene, apart from the homeobox, that have been conserved in 1883) that passes the 17/21 criterion used in Figiure 4. These evolution. Evolutionarily conserved sequences both pinpoint el- matches lie in regions where both sequences are ezxtremely A- ements that are like.ly to be required for the correct expression rich, and may have no other specific significance of the Ubx gene and identify regions within the Ubx protein In the region upstream of the transcription start site, Diagon coding domain that may be subject to functional constraints. In (Staden, 1982) comparisons reveal two strongly conserved el- discussing our results we assume that sequence divergence reflects ements (Figures 3 and 6). A 34-base motif containinig an 11-base a lack of functional contraints, and not functional divergence of imperfect inverted repeat lies at -250 in D. melantogaster, and the Ubx genes themselves. is the naive -300 in Implicit assumption that D. Junebris. [The origins for these coordinaites are taken the role and regulation of the Ubx gene is identical throughout as the 5' end of the D. melanogaster mRNA, as de Ltermined by the genus Drosophila, and very similar in allof the This extension and SI Diptera. primer nuclease analysis (Hogness et al., 1985; assumption remains to be tested, but it seems unlikely that major 1398 Ubx sequence conservation changes in either the expression or function of key homeotic genes cDNA clones from D. melanogaster shows that either of two have occurred rapidly in the evolution of this morphologically splice donor sites (sites I and II) may be used to terminate the conservative lineage. Ubx 5' exon, the resulting proteins differing by the presence or At the Ubx locus, comparison between species in different absence of nine amino acids. Sites I and II are conserved in all families or subgenera of the Drosophila can effectively define of the three Drosophila species examined, but site II is not found even short (20 bp) conserved elements in non-coding regions, in M. domestica. This suggests two possibilities. This potential which otherwise show high rates of sequence divergence. Similar for protein micro-hetrogeneity may not be significant for the observations have been made at the heat shock hsp 82 loci in functioning of the Ubx homologue in Musca. Alternatively the Drosophila (Blackman and Meselson, 1986) and at the engrailed 27 base sequences flanked by the alternative splice sites in the locus (Kassis et al., 1986). These close comparisons do not dis- Drosophila species may in Musca be found as a separate micro- criminate so effectively between constrained and less constrained exon downstream of the fragment we have sequenced. The role residues within a protein, because sequence divergence is gener- of this micro-heterogeneity in Drosophila species is still not clear; ally slower throughout the protein coding regions. Nevertheless, the relatively high variation of this sequence (four out of nine it is striking that differences in the amino acid sequence of the amino acid differences) between D. pseudoobscura and D. fune- Ubx proteins from the different species are not uniformly distrib- bris indicates that structural constraints on this region cannot be uted throughout the 5' coding exon. We note that charged amino very stringent. acid residues are found exclusively in the more conserved regions The conserved elements in the region downstream of the splice of the protein, whilst most residues in the variable regions are site in D. pseudoobscura may identify sequences that form part non-polar or hydrophobic in character. This suggests that the of the early non-adenylated 4.7-kb RNA. While the conservation conserved regions of the sequence may comprise a hydrophilic in this region is both striking and quite extensive, it cannot be exterior to the molecule, constrained by many functional inter- conceptually translated into a protein sequence of any length. It actions. The more variable regions would form the interior of is not clear whether this region forms part of an RNA that func- the protein where sequence may be constrained only by a require- tions in some way other than protein coding, whether it comprises ment for hydrophobicity. a non-translated region of an mRNA, or whether its conservation One recurring feature in a number of Drosophila 'develop- reflects some regulatory function at the DNA level. mental' genes is the occurrence of repetitive sequences encoding The 5' untranslated region short homopolymeric runs of a single amino acid (Laughon et Several conserved features of the sequence comprising the 5' un- al., 1985). At Ubx (GGX)n encodes polyglycine in the 5' exon, translated (UT) region of the Ubx mRNA are unusual; it is very and polyalanine in the 3' exon; this repeat also occurs at at least long (925-960 bases), it contains AUG codons upstream of that one other site within the major Ubx intervening sequence, where which initiates the long open reading frame, and sequence el- it is probably not translated (R.Weinzierl and M.Akam, in prep- ements within it are strongly conserved between the Drosophila aration). This same GGX repeat recurs in the Dfd gene where species. With few exceptions mRNA 5' UT regions are < 160 it encodes glycine. Elsewhere permutations of (CAG)n- the bases in length, contain no AUG codons upstream of the start opa repeat (Wharton et al., 1985) - encode strings of glutamine, point of translation, and show no conservation of sequence be- alanine or serine residues in Notch, Antp, Dfd, en and ftz. A tween different species (Kozak, 1983, 1984; Hunt, 1985). Ex- different repeat (AAT/C), which in Ubx is present in the 5' un- ceptions to these general 'rules' are heat shock mRNAs, whose translated region of the Ubx gene, is also found in Dfd where 5' UT regions are 180-250 bases long and contain two conserved it encodes a string of asparagine residues within the protein se- regions (Holmgren et al., 1981; McGarry and Lindquist, 1985), quence (Laughon et al., 1985). The significance ofthese repetitive and the mRNA for a yeast general amino acid metabolism gene sequences is enigmatic. In engrailed the opa-encoded glutamine GCN4 whose 577-base 5' UT region contains four upstream and alanine strings are not conserved between D. virilis and D. AUG codons (Hinnesbusch, 1984; Thireos et al., 1984; Mueller melanogaster (Kassis et al., 1986), suggesting that they are not and Hinnebusch, 1986). Both these mRNAs are subject to trans- critical for the functioning of this gene. In contrast, in Ubx the lational control and this has been shown to be dependent on (GGX)n encoded glycine-rich sequence is conserved across simi- features of the mRNA 5' UT region (Hinnesbusch, 1984; Klem- lar evolutionary distances. Despite some variation in its detailed enz et al., 1985; McGarry and Lindquist, 1985). Since the structure, the conservation of a glycine-rich sequence and the unusual, but conserved, features of the Ubx mRNA 5' UT region underlying (GGX)n motif at the same position in all four species are also characteristic of mRNAs known to be under translational described here argues for it having some important role. The control, we suspect that regulation of Ubx gene expression, at most obvious role for this region is at the level of protein second- least in part, will be at the level of mRNA translation. Indeed, ary structure; the polyglycine string and glycine-rich sequence a comparison of Ubx transcript and Ubx protein distributions in may act as a 'hinge' region in the Ubx protein (Beachy et al., the blastoderm embryo suggests that the earliest patterns of Ubx 1985). It is also possible that at the DNA level the underlying transcription are not reflected in corresponding protein distri- nucleotide may serve some regulatory or structural butions. While this may be simply a problem of detection, it may role that is responsible for its evolutionary conservation. Simple also reflect a requirement for other zygotic gene products before repeated sequences show a much higher local rate of sequence the Ubx mRNA may be efficiently translated (A.Martinez-Arias, variation than adjacent unique sequences, even in locations where personal communication). The length of the Ubx 5' UT region no differential constraints seem likely (Tautz et al., 1986). Hence and the presence of upstream AUG codons does suggest that the rapid changes in the glycine-rich region of the Ubx protein may Ubx mRNA may be inherently badly translated (see Lomedico reflect not only the evolutionary constraints on this region of the and McAndrew, 1982; Kozak, 1983; Johansen et al., 1984; Liu protein molecule, but also the relative instability of the sequence et al., 1984). If this is indeed the case, one role for the conserved that encodes it. element immediately adjacent to the initiator AUG codon in Ubx The Ubx gene potentially encodes a family of related proteins may be to bind a specific factor which increases the efficiency generated from alternatively spliced transcripts. The structure of of translation initiation at this site. 1399 C.D.Wilde and M.Akam

Translational control may be a useful mechanism in complex of the sequences we have identified may reveal major components developmental systems as it provides a second level of control of Ubx regulation. which can act combinationally with transcriptional regulation to generate more precise patterns of gene expression. It may be par- Materials and methods ticularly important in genes such as Ubx with very long transcrip- strains tion units, where there would be a significant delay between D. melanogaster DNA was isolated from the wild-type strain Oregon-R (Stanford), transcription initiation and protein expression. In these cases and from a multiply marked stock mm carrying an isogenized third chromosome. translational control may be required for rapid modulation of the Other Drosophila species were obtained from the Drosophila stock centre at the intracellular concentration of the encoded protein. It is interesting University of Texas, now at Bowling Green, OH. M. domestica and Sarcophaga bullata were the gift of Zoecon Inc., California; Calliphora erythrocephala the that in addition to Ubx, two other Drosophila genes with very gift of Dr G.D.Mann. the 120-kb and the long transcription units, Antennapedia gene Southern hybridization 60-kb gene at 74EF, also have long (1-2 kb) ecdysone-inducible DNA was prepared from a number of different Drosophila and Dipteran species mRNA 5' UT regions that could be involved in translational essentially as described by Schachat and Hogness (1973), using as starting material regulation (Schneuwly et al., 1986; Scott et al., personal com- a crude preparation of nuclei from homogenized adults. These were digested munication; K.Burtis, C.Thummel, W.Jones and D.S.Hogness, to completion with EcoRI, electrophoresed on 0.7% agarose gels, and transferred personal communication). to azothiophenol (ATP) paper, prepared as described by Seed (1982). These filters we assumed that the TAA motif that is con- were hybridized to probes from the 5' region of the D. melanogaster transcription Initially repeating unit or to phage clones isolated from Musca. Hybridization was in 50% formamide, served in the 5' UT region functioned only to provide multiple 5 x SSPE, 0.5 /g/ml denatured salmon sperm DNA, 0.1% SDS, 1 x Denhardt's stop codons. It is now clear that this sequence can act in vitro solution (Maniatis et al., 1982) at 420C (standard stringency) or 30°C (low strin- as a binding site for the Ubx protein (Beachy, 1986; M.Krasnow, gency). Final washes of the filters were at 65°C in 0.2 x SSPE/0. 1% SDS (high L.Gavis and D.Hogness, personal communication). It may there- stringency) or at 45°C in 2 x SSPE/0.1% SDS (low stringency). fore be of some significance in vivo for the regulation of Ubx Screening of libraries expression. DNA from D. funebris and M. domestica was digested to completion with EcoRl, ligated into the insertion vector X607 (Murray et al., 1977) packaged and plated and upstream elements out onto host Escherichia coli K802. These libraries were screened with a probe We do not know what functions are mediated by the conserved prepared by nick-translation of the D. melanogaster insert from the plasmid of 118 bases around the start or DM3108, a 3.3-kb EcoRI fragment which spans the 5' region of the Ubx transcrip- sequence transcription site, by tion unit, using conditions of reduced stringency (hybridization buffer as above the upstream conserved elements. Factors required for transcrip- at 33°C, low stringency wash). Hybridizing phage were isolated and purified tion of the heat shock genes bind to the TATA box and to DNA using standard procedures (Maniatis et al., 1982). sequences corresponding to the first 30-40 nucleotides of the The D. pseudoobscura library used was a gift from R.Blackman; it comprised heat shock mRNA 5' UT region (Parker and Topol, 1984a,b; size-fractionated partial MboI fragments ligated into Charon 30 arms (Blackman and Meselson, 1986). It weas screened using the D. melanogaster Ubx fragment Wu, 1984). The conservation of the 11 8-bp sequence around the as probe under similar conditions to those described above; hybridization for 24 h Ubx transcription start site may well indicate that it is a site to at 37°C in 50% formamide, 5 x SSC, 1 x Denhardt's, 0.1% SDS followed which analogous transcription factors bind. In addition, again by washing in 2 x SSC, 0. 1% SDS at 37°C for four changes of 20 min each. by analogy with heat shock genes, the conserved sequences at Isolated phages were restriction enzyme mapped and the Ubx-homologous sub- the 5' end of the mRNA could also reflect their requirement for fragments were cloned into plasmid vectors. regulation of Ubx mRNA translation. DNA sequencing The short conserved elements that lie further upstream of Ubx DNA to be sequenced was subcloned into M13 vectors and single-stranded DNA prepared from infected by standard procedures. Subclones were made both have structural features that suggest they may be recognition either by ligation of restriction enzyme fragments into appropriately cleaved sites for proteins involved in transcriptional regulation. The M13mp8, or M13mp9, or by ligation of randomly sheared fragments (prepared 34-base palindromic sequence is similar in structure, but not in by sonication) into SniaI-cut, phosphatased M13mp8. sequence, to elements that lie 200-1800 nucleotides upstream DNA was sequenced by the dideoxy chain termination method of Sanger et of under control which mediate re- al. (1977) using a 35S-labelled nucleotide and buffer gradient polyacrylamide gels yeast genes mating-type to enhance resolution of sequencing reaction products (Biggin et al., 1983). pression by binding mating-type proteins (Johnson and Hersko- DNA sequences were compiled using the DBUTIL programs of Staden (1982a), witz, 1985; Miller et al., 1985). The 18-base sequence contains and analysed using DIAGON and ANALYSEQ programs (Staden, 1982b; Staden a direct repeat of the heptamer ACTGGCC. Sequences upstream and McClachlan, 1982). The reference table of 18 000 Drosophila codons was of the ovalbumin gene which are bound by transcription compiled from the EMBO sequence library by M.Ashbumer. The D. melanogaster factors also contain short sequence used for comparisons was kindly provided by P.Beachy and D.Peattie heptameric direct repeats (Pastorcic et (Beachy, 1986; R.B.Saint et al., in preparation) and in the far upstream region al., 1986). Thus in Ubx the conserved 18-base sequence might by M.Bienz (Saari and Bienz, 1987). similarly serve as a recognition site for regulatory transcription In calculating the nucleotide and protein sequence divergences, insertions and factors. deletions in the sequence have been given the same weight as single base or amino The position of these elements (200-500 nucleotides from the acid substitutions. proposed transcription start site) is also consistent with their play- ing a role in Ubx regulation, even though genetic evidence points to Ubx having an unusually large 5' regulatory region that spans Acknowledgements the 25-kb bxd region immediately upstream of Ubx (Lewis, 1978; et Bender et al., et al., 1985; We thank R.Bankier, B.Barrell and members of their laboratory for help and Beachy al., 1985; 1985; Hogness advice on all matters to do with DNA sequencing; M.Bishop for assistance with White and Wilcox, 1985b). The segmentation geneffushi tarazu computing and M.Ashburner for his codon usage table. D.Hogness and members similarly appears to require a relatively large (6 kb) region for of his laboratory provided valuable information before publication; we thank its fully normal expression in P element-transformed ; especially D.Peattie and P.Beachy for providing details of the Ubx sequence. however, a promoter proximal region of 600 bases is sufficient We thank D.Hogness, M.Ashburner and A.Martinez-Arias for comments on the to establish of its normal manuscript, G.Tear for assistance with the D. funebris sequence and Rosemarie many aspects expression (Hiromi et Baines for preparing the typescript. This work was supported by the Medical al., 1985). Thus we anticipate that experimental manipulation Research Council of Great Britain. 1400 Ubx sequence conservation

References Sanger,F., Nicklen,S. and Coulson,A.R. (1977) Proc. Natl. Acad. Sci. USA, Akam,M. (1983) 74, 5463-5467. EMBO J., 2, 2075-2084. Schachat,F.H. and Hogness,D.S. (1973) Cold Spring Harbor Symp. Quant. Biol., Akam,M. and Martinez-Arias,A. (1985) EMBO J., 4, 1689-1700. 371-381. Akam,M., Martinez-Arias,A., Weinzierl,R. and Wilde,C.D. (1985) Cold 38, Spring Schneuwly,S., Kuroiwa,A., Baumgartner,P. and Gehring,W.J. (1986) EMBO Harbor Symp. Quant. Biol., 50, 195 -200. 733-739. Beachy,P.A. (1986) Ph.D. Thesis, Stanford University, CA. J., 5, Beachy,P.A., Helfand,S.L. and Hogness,D.S. (1985) Nature, 313, 545-551. Scott,M.P. and Weiner,A.J. (1984) Proc. Natl. Acad. Sci. USA, 81, 4115-4120. Bender,W., Akam,M., Karch,F., Seed,B. (1982) Nucleic Acids Res., 10, 1799-****. Beachy,P.A., Peifer,M., Spierer,P., Lewis,E.B. Shepherd,J.C.W., McGinnis,W., Carrasco,A.E., DeRobertis,E.M. and Gehring, and Hogness,D.S. (1983) Science, 221, 23-29. W.J. (1984) Nature, 310, 70-71. Bender,W., Weiffenbach,B., Karch,F. and Peifer,M. (1985) Cold Spring Harbor Staden,R. (1982a) Nucleic Acids Res., 10, 4731-4751. Symp. Quant. Biol., 50, 173-180. Staden,R. (1982b) Nucleic Acids Res., 10, 2951-2961. Beverley,S.M. and Wilson,A.C. (1984) J. Mol. Evol., 21, 1-13. Biggin,M.D., Gibson,T.J. and Staden,R. and McLachlan,A. (1982) Nucleic Acids Res., 10, 141-156. Hong,G.F. (1983) Proc. Natl. Acad. Sci. USA, Struhl,G. (1981) Nature, 293, 36-41. 80, 3963-3965. Struhl,G. and Blackman,R.K. and Meselson,M. (1986) J. Mol. Biol., 188, 499-515. Akam,M. (1985) EMBO J., 4, 3259-3264. Casanova,J., Sanchez-Herrero,E. and Morata,G. Struhl,G. and White,R.A.H. (1985) Cell, 43, 507-519. (1985) Cell, 42, 663-669. Tautz,D., Trick,M. and Dover,G.A. (1986) Nature, 322, 652-656. Cavener,D.R. (1987) Nucleic Acids Res., 15, 1353-1361. Thireos,G., Driscoll Penn,M. and Greer,H. (1984) Proc. Natl. Acad. Sci. USA, Hafen,E., Levine,M. and Gehring,W.J. (1984) Nature, 307, 287-289. 81, 5096-5100. Harding,K., Wedeen,C., McGinnis,W. and Levine,M. (1985) Science, 229, Throckmorton,L. (1975) In King,R.C. (ed.), Handbook of Genetics, 7he Phyl- 1236-1242. ogeny, Ecology and Geography ofDrosophila. Plenum Press, New York, pp. Hennig,W. (1981) Insect Phylogeny. Wiley, Chichester, UK. 421-469. Hinnebusch,A.G. (1984) Proc. Natl. Acad. Sci. USA, 81, 6442-6446. Weinzierl,R., Axton,J.M., Ghysen,A. and Akam,M. (1987) Genes and develop- Hiromi,Y., Kuroiwa,A. and Gehring,W. (1985) Cell, 43, 603-613. Hogness,D.S., ment, in press. Lipshitz,H.D., Beachy,P.A., Peattie,D.A., Saint,R.A., Gold- Wharton,K.A., Yedvobnick,B., Finnerty,V.G. and Artavanis-Tsakonas,S. (1985) schmidt-Clermont,M., Harte,P.J., Gavis,E.R. and Helfand,S.L. (1985) Cold Cell, 40, 55-62. Spring Harbor Symp. Quant. Biol., 50, 151-194. White,R.A.H. and Wilcox,M. (1984) Cell, 39, 163-171. Holmgren,R., Corces,V., Morimoto,R., Blackman,R. and Meselson,M. (1981) White,R.A.H. and Wilcox,M. (1985a) EMBO J., 4, 2035-2043. Proc. Natl. Acad. Sci. USA, 78, 3775-3778. White,R.A.H. and Wilcox,M. (1985b) Nature, 318, 563-567. Hunt,T. (1985) Nature, 316, 580-581. White,R.A.H. and Lehmann,R. (1986) Cell, 47, 311-321. Ingham,P.W. (1985) Trends Genet., 1, 112-116. Wu,C. (1984) Nature, 309, 229-234. Ingham,P.W. and Martinez-Arias,A. (1986) Nature, 324, 592-596. Ingham,P.W., Ish-Horowicz,D. and Howard,K.R. (1986) EMBO J., 5, 1659- Received on 1665. January 26, 1987 Johansen,H., Schumperli,D. and Rosenberg,M. (1984) Proc. Natl. Acad. Sci. USA, 81, 7698-7702. Johnson,A.D. and Herskowitz,I. (1986) Cell, 42, 237-247. Jurgens,G. (1985) Nature, 316, 153-155. Kassis,J.A., Wong,M.L. and O'Farrell,P.H. (1985) Mol. Cell Biol., 5, 3600- 3609. Kassis,J.A., Pool,S., Wright,D. and O'Farrell,P.H. (1986) EMBO J., 6, 3583- 3589. Klemenz,R., Hulmark,D. and Gehring,W.J. (1985) EMBO J., 4, 2053 -2060. Kozak,M. (1983) Microbiol. Rev., 47, 1-45. Kozak,M. (1984) Nucleic Acids Res., 12, 857-872. Knurnlauf,R., Holland,P.W.H., McVey,J.H. and Hogan,B.L.M. (1987) Develop- ment, 99, 603-618. Laughon,A. and Scott,M.P. (1984) Nature, 310, 25-31. Laughon,A., Carroll,S.B., Storfor,F.A., Riley,P.D. and Scott,M.P. (1985) Cold Spring Harbor Symp. Quant. Biol., 50, 253 -262. Lewis,E.B. (1951) Cold Spring Harbor Symp. Quant. Biol., 16, 159-174. Lewis,E.B. (1963) Am. Zool., 1, 33-56. Lewis,E.B. (1978) Nature, 276, 565-570. Lewis,E.B. (1981) In Brown,D.D. and Fox,C.F. (eds), Development Biology Using Purified Genes, ICN-UCLA Symnposia on Molecular and Cell Bilogy. Academic Press, New York, pp. 189-208. Liu,C.-C., Simonsen,C.C. and Levinson,A.D. (1984) Nature, 309, 82-85. Lomedico,P.T. and McAndrew,S.J. (1982) Nature, 299, 221-226. McGarry,T.J. and Lindquist,S. (1985) Cell, 42, 903-911. McGinnis,W., Levine,M.S., Hafen,E., Kuroiwa,A. and Gehring,W.J. (1984) Nature, 308, 428-433. Maniatis,T., Fritsch,E.F. and Sambrook,J. (1982) Molecular Cloning. A Labora- tory Manual. Cold Spring Harbor Laboratory Press, New York. Maviio,F., Simeone,A., Giampaolo,A., Faiella,A., Zappavigna,Z., Acampora, D., Poiana,G., Russo,G., Peschke,C. and Boncinelli,F. (1986) Nature, 324, 664-668. Miller,A.M., MacKay,V.L. and Nasmyth,K.A. (1985) Nature, 314, 598-602. Mueller,P.P. and Hinnesbusch,A. (1986) Cell, 45, 201-207. Murray,N.E., Brammer,W.J. and Murray,K. (1977) Mol. Gen. Genet., 150, 53-61. Parker,C.S. and Topol,J. (1984a) Cell, 36, 357-369. Parker,C.S. and Topol,J. (1984b) Cell, 37, 273-282. Pastorcic,M., Wang,M., Elbrecht,A., Tsai,S.Y., Tsai,M.-J. and O'Malley,B.W. (1986) Mol. Cell Biol., 6, 2784-2791. Saari,G. and Bienz,M. (1987) EMBO J., in press. Sanchez-Herrero,E., Vemos,I., Marco,R. and Morata,G. (1985) Nature, 313, 108-113. 1401