<<

Proc. Natl. Acad. Sci. USA Vol. 94, pp. 5177–5182, May 1997 Evolution

Molecular origin of the mosaic sequence arrangements of higher primate ␣- duplication units (␣-globin locus͞gibbon͞DNA deletion͞evolution)

ARNOLD D. BAILEY*†,CHIEN CELIA SHEN*, AND CHE-KUN JAMES SHEN*‡§

*Section of Molecular and Cellular Biology, University of California, Davis, CA 95616; and ‡Institute of Molecular Biology, Academia Sinica, Nankang, Taipei, Republic of China

Communicated by R. W. Allard, Bodega Bay, CA

ABSTRACT The human adult ␣-globin locus consists of gous, or parallel, loci from the orangutan and two Old World three pairs of blocks (X, Y, and Z) interspersed with monkeys, the olive baboon and the rhesus monkey (reviewed three nonhomology blocks (I, II, and III), and three Alu family in ref. 7). Unfortunately, little information regarding the repeats, Alu1, Alu2, and Alu3. It has been suggested that an evolutionary origin of the nonhomology blocks I, II, and III is ancient primate ␣-globin-containing unit was ancestral to the available from these analyses. One or more of these nonho- X, Y, and Z and the Alu1͞Alu2 repeats. However, the evolu- mology blocks could have belonged to the ancestral DNA unit tionary origin of the three nonhomologous blocks has re- prior to its tandem duplication. Alternatively, they could be mained obscure. We have now analyzed the sequence organi- foreign to the adult ␣-globin locus, brought into it by zation of the entire adult ␣-globin locus of gibbon (Hylobates various DNA recombination events. The gross organization of lar). DNA segments homologous to human block I occur in the ␣-globin duplication units of orangutan is nearly identical both duplication units of the gibbon ␣-globin locus. Detailed to that of human (8). Humans and orangutans diverged interspecies sequence comparisons suggest that nonhomolo- relatively recently, only 10–15 million years ago. The ␣2-globin gous blocks I and II, as well as another sequence, IV, were all unit of the Old World monkeys, whose lineage diveraged part of the ancestral ␣-globin-containing unit prior to its earlier from humans, is also similar to that of human (9). tandem duplication. However, sometime thereafter, block I Unfortunately, a major portion of their ␣1-globin unit has been was deleted from the human ␣1-globin-containing unit, and unclonable (9, 10). block II was also deleted from the ␣2-globin-containing unit More recently, we have cloned the adult ␣-globin locus from in both human and gibbon. These were probably independent gibbon (Hylobates lar) (11), which started to diverge from events both mediated by independent illegitimate recombina- either the great apes or the Old World monkeys 15–19 million tion processes. Interestingly, the end points of these deletions years ago (12). As shown below in Results and Discussion, coincide with potential insertion sites of Alu family repeats. detailed sequence comparison between human and this non- These results suggest that the shaping of DNA segments in human catarrhine has provided data regarding the evolution- eukaryotic genomes involved the retroposition of repetitive ary origin of the nonhomology DNA blocks, as well as several DNA elements in conjunction with simple DNA recombination rearrangement processes that appear to have played a major processes. role in the shaping of the adult ␣-globin loci in higher primates. The adult ␣-globin locus appears to provide a model for the analysis of of tandem duplication units MATERIALS AND METHODS in higher primates. Sequence analyses have shown that the two Both the molecular cloning and DNA sequencing of the gibbon human adult ␣-globin are contained within the 3Ј ␣-globin locus by the chain-termination method (13) have been portions of a pair of tandemly arranged duplication units which described in detail elsewhere (14, 15). consist of three pairs of homology blocks X, Y, and Z (1–5). These homology blocks are interrupted by three nonhomology blocks, I, II, and III, with three of their boundaries precisely RESULTS AND DISCUSSION defined by the presence of Alu family repeats (see Fig. 1; refs. The linkage maps of the adult ␣-globin loci of human and 4 and 5). It has been proposed that a tandem duplication that gibbon are shown in Fig. 1a. The sequences of the occurred prior to the divergence of higher primates generated gibbon locus from the X(␣2) block to downstream of the two DNA segments with identical sequences (1–7). The exact ␣1-globin gene are aligned with the orthologous human region homology between these duplication units was subsequently in Fig. 1b for comparison. The gibbon locus is similar to the disrupted by nucleotide substitutions, as well as by DNA human locus in that it also consists of three pairs of homology rearrangements resulting in a mosaic arrangement of the blocks, X(␣2)͞X(␣1), Y(␣2)͞Y(␣1), and Z(␣2)͞Z(␣1). As in homology͞nonhomology blocks (4, 5, 7). Segmental gene human and Old World monkeys, the nonhomology block I(␣2) conversion, a process of nonreciprocal transfer of genetic ( 1368–1724, Fig. 1b) and an Alu family repeat information, may have been responsible for maintenance of (Alu3, nucleotides 1739–2661, Fig. 1b) are also present be- the high degrees of of the X and Z blocks tween the gibbon X(␣2) and Y(␣2) blocks. As shown previ- (2–6). ously, the gibbon Alu3 is a triplet Alu sequence, while doublet The above evolutionary scenario is consistent with subse- and singlet Alu repeats were found at the parallel positions in quent genomic mapping and DNA sequencing of the ortholo- human and rhesus, respectively (Fig. 1; refs. 4, 9, 14). These doublet and triplet Alu sequences are believed to be the result The publication costs of this article were defrayed in part by page charge payment. This article must therefore be hereby marked ‘‘advertisement’’ in accordance with 18 U.S.C. §1734 solely to indicate this fact. The sequence reported in this paper has been deposited in the GenBank database (accession no. 94634). Copyright ᭧ 1997 by THE NATIONAL ACADEMY OF SCIENCES OF THE USA †Present address: Department of Molecular Biophysics and Biochem- 0027-8424͞97͞945177-6$2.00͞0 istry, Yale University School of Medicine, New Haven, CT 06520. PNAS is available online at http:͞͞www.pnas.org. §To whom reprint requests should be addressed.

5177 Downloaded by guest on October 1, 2021 5178 Evolution: Bailey et al. Proc. Natl. Acad. Sci. USA 94 (1997)

(Figure continues on the opposite page.) Downloaded by guest on October 1, 2021 Evolution: Bailey et al. Proc. Natl. Acad. Sci. USA 94 (1997) 5179

FIG.1. (a) Linkage maps of the human and gibbon adult ␣-globin locus. The range of maps spans from the ⌿␣1 to immediately downstream of the ␣1-globin gene. Three pairs of homology blocks, X, Y, and Z, are indicated by open bars. Interspersed among the homology blocks are the nonhomology blocks I, II, III, and IV, and the different Alu family repeats indicated by the arrows. The numbers below the map represent the nucleotide positions of the ends of different blocks of the gibbon locus sequence as shown in b.X(␣2), 1–1052; I(␣2), 1368–1724; Y(␣2), 2662–2831; Z(␣2), 2832–4554; X(␣1), 4555–5607; I(␣1), 5608–5939 and 6239–6263; IV(␣1), 6264–6444; II(␣1), 6777–7103; Y(␣1), 7104–7292; III(␣1), 7293–7516; and Z(␣1), 7517–9238. Note that the X blocks assigned in this diagram do not include the Alu family repeats, as previously defined (4, 5). (b) Nucleotide of the gibbon and human ␣-globin loci. The gibbon sequence (G) is shown in full, with the corresponding human sequence (H) aligned underneath it. The first nucleotide of the gibbon X(␣2) block is denoted as number 1. Human nucleotides identical to those in the gibbon are indicated by dashes, and different nucleotides are indicated by the appropriate letters. Single-base deletions are left blank, while relative deletions longer than 1 bp are indicated by horizontal lines. The homology and nonhomology blocks are individually indicated by brackets. Alu family repeats are represented by horizontal line arrows with the short direct repeats flanking them boxed. The ␣2- and ␣1-globin genes are represented by horizontal open arrows. Not shown are the sequences of the ␣-globin genes (2, 3), the Alu family repeats (4, 5, 14), and the central parts of the blocks X, Y, and Z (15). The numbers of bp in the middle of the latter blocks indicate their total lengths, including the sequences shown. (c) Sequence alignment of the nonhomology blocks human I(␣2), gibbon I(␣2), and gibbon I(␣1). The gibbon I(␣1) sequence is shown in full, with those of the human and gibbon I(␣2) blocks aligned underneath. The symbols used are the same as in b.(d) Sequence alignment of the human and gibbon Alu1 and Alu2 repeats. The sequence of gibbon Alu2 is shown in full, with those of the human Alu2 (4), human Alu1 (4), and gibbon Alu1 (15) aligned underneath. The pairwise homology comparisons of these four repeats are further described in Table 1.

of sequential insertion of singlet Alu family repeats, at the serted into its 3Ј portion. Second, two sequences, IV and Alu2, I(␣2)͞Y(␣2) junctions, during primate evolution (14). at nucleotide positions 6264–6444 and 6479–6776, respec- The organization of the gibbon ␣1-globin unit is similar to tively (Fig. 1b) are located between the gibbon I(␣1) and II(␣1) that of human in the relative arrangement of the blocks X(␣1), blocks. Block IV does not exist in the ␣2-globin unit of either II(␣1), Y(␣1), III, and Z(␣1) (Fig. 1a). However, close exam- human or gibbon, and it is followed by Alu2,anAlu family ination of the two sequences reveals several unique features of repeat exhibiting a 93% sequence identity to the human Alu2 the gibbon locus. First, an array of DNA sequences, nonho- repeat. This sequence identity is significantly higher compared mology I(␣1) (nucleotides 5608–5939 and 6239–6263, Fig. 1b), with those of the three pairwise comparisons: gibbon Alu2͞ occurs immediately downstream of the gibbon X(␣1). This gibbon Alu1 (85%), gibbon Alu2͞human Alu1 (86%), or gibbon I(␣1) region is homologous in sequence to both the human Alu1͞human Alu2 (86%) (Table 1). Furthermore, the human and gibbon I(␣2) blocks (Fig. 1c), but it is interrupted gibbon Alu2͞human Alu2 pair has the highest number of by an Alu repeat (Alu4, nucleotides 5940-6223, Fig. 1b) in- common sites (Table 1). These sequence comparisons suggest Downloaded by guest on October 1, 2021 5180 Evolution: Bailey et al. Proc. Natl. Acad. Sci. USA 94 (1997)

Table 1. Pairwise comparison of Alu1 and Alu2 repeats of gibbon one time similar to the present-day gibbon ␣1-globin unit and human except for the Alu2 repeat; (ii) subsequently, prior to the Sequences No. of Common % divergence of human and gibbon, Alu1 was inserted at the compared differences sites identity X(␣2)͞I(␣2) junction; (iii) Alu2 was inserted at the IV(␣1)͞ II(␣1) junction; (iv) an ancestral singlet Alu3 was inserted at G-Alu1͞H-Alu1 24 19 92 the 5Ј boundary of the Y(␣2) block (Fig. 2); and (v) gross G-Alu2͞H-Alu2 21 20 93 sequence arrangement of this ancient ␣1-globin unit has since G-Alu1͞G-Alu2 42 1 85 been conserved in the gibbon lineage. But during evolution of H-Alu1͞G-Alu2 40 3 86 humans, DNA recombination has occurred between the AϩT- G-Alu1͞H-Alu2 38 3 87 rich sequence at the 3Ј end of X(␣1) block and the very 5Ј end H-Alu1͞H-Alu2 39 2 86 of the Alu2 repeat (Figs. 2 and 3). This recombination process, Pairwise comparisons of the sequences among the four Alu family presumably facilitated by the homology near the recombina- repeats were made to determine their evolutionary relationship. The tion junctions (Fig. 3b), led to the deletion of blocks I(␣1) and percent identity values were calculated according to the sequence IV(␣1), and it also brought the Alu2 repeat to a position alignment of Fig. 1d. A common site is where the two compared Alu immediately downstream from the X(␣1) block (Fig. 3a). This repeat sequences have the same nucleotide, but the other two Alu scenario is consistent with the sequence comparison of the repeats have different base(s). G, gibbon; H, human. Alu1 and Alu2 repeats (Table 1), and it also provides an Alu2 that gibbon Alu2, which apparently has been inserted in explanation for our previous finding. That is, human , despite its existence at a genomic position paralogous to the between blocks II(␣1) and IV(␣1) with the sequence 5Ј- human Alu1, is flanked only at its 5Ј, and not the 3Ј, side with GTCCCCACATCTCA-3Ј being the site of integration, is most C the sequence 5Ј-CTAAAATCC-3Ј that has served as the likely the parallel element, or orthologue, of human Alu2. insertion site for the human Alu1 repeat. Following this, human Alu1 and Alu2 repeats could not be Alternatively, one could propose that an ancient Alu1 repeat paralogous as postulated previously (4, 5). was inserted at the 3Ј end of the X block prior to duplication We propose the following evolutionary scenario for the ␣2- of the ␣-globin unit. Subsequently in the human lineage, DNA and ␣1-globin units: (i) The sequence organization of both deletion occurred by homologous recombination between the ␣-globin duplication units in ancient higher primates was at Alu repeat at 3Ј end of the X(␣1) block and an Alu repeat at the IV͞II junction. This would also produce the arrangement now seen in the human ␣1-globin unit (Fig. 1). However, neither Old World Monkeys (9) nor gibbons (Fig. 1) have an Alu repeat at the 3Ј ends of their X(␣1) blocks. For this latter model to be viable then, one has to hypothesize that tandem duplication of the ␣-globin unit and͞or deletion of certain Alu repeats occurred independently in different lineages during evolution of higher primates. In the ␣2-globin unit, on the other hand, it is most likely that a DNA recombination event occurred between the junction of blocks I(␣2) and IV(␣2), and the junction of blocks II(␣2) and Y(␣2) (Figs. 2 and 4a) prior to the separation of the gibbon͞ human lineage and the Old World monkeys. Interestingly, the DNA substrates near the breakage–rejoining points of this illegitimate recombination process also exhibit quite extensive sequence homology, 53% for perfect base-pairing and 75% if

FIG. 2. Evolutionary shaping of the human ␣2- ␣1-globin dupli- cation units by retroposition of Alu family repeats and by simple DNA deletion events. The proposed scheme of evolution has been deduced from sequence comparison of the duplication units between human FIG.3.(a) Diagram of DNA recombination process deleting and gibbon (see text for more details). The ancestral unit consisted of blocks I(␣1) and IV(␣1) from the human ␣1-globin unit. The proposed blocks X, I, IV, II, Y, and Z. The evolutionary origin of block III is recombination is accomplished by DNA breakage and rejoining of the uncertain at the present time. It could belong to the ancestral unit, but X(␣1)͞I(␣1) junction and the IV(␣1)͞Alu2 junction. See text for more was deleted from the ␣2-, but not ␣1-, globin unit. Alternatively, it details. (b) Sequence alignment of the above two junctions of gibbon could have been inserted into the ␣1-globin unit after the duplication ␣1-globin unit and DNA at the human X(␣1)͞Alu2 junction. The event. After tandem duplication, independent retroposition of differ- tandemly arranged repeats of 5Ј-CAAAA-3Ј (horizontal arrows) are ent Alu family repeats into the adult ␣-globin locus occurred. Of the likely generated by a series of amplifications of the short sequence CA four depicted repeats, Alu1, Alu2, and Alu3 were inserted in the and͞or CAA, presumably only in the gibbon lineage and not in the human͞gibbon ancestor, with the insertion timing of Alu3 as early as human (4). The short direct repeats, or insertion sites, of human Alu1 prior to the separation of gibbon͞human from the Old World monkeys and gibbon Alu2 are boxed. The open triangles denote the boundaries (9, 14). Although Alu4 is present in the gibbon lineage, whether it was of adjacent blocks around which the hypothesized breakage and inserted prior to or after the human͞gibbon divergence is unknown, rejoining processes have occurred. The nucleotide positions, as used in since the DNA region containing its insertion site has been deleted Fig. 1b, of the first and last base of each sequence are indicated by the from the . numbers. Downloaded by guest on October 1, 2021 Evolution: Bailey et al. Proc. Natl. Acad. Sci. USA 94 (1997) 5181

purine–purine and pyrimidine–pyrimidine homologies are ac- in Figs. 3 and 4 apparently functioned as the insertion sites for counted (Fig. 4b). It should be noted that the timing of this Alu family repeats at one time or another during genomic deletion event relative to the insertion of the Alu3 repeat at the evolution of the higher primates. The fact that some sequences very 5Ј end of the Y(␣2) block is unknown. However, it could be utilized both as insertion sites for Alu family repeats remains a viable possibility that the two events occurred at the (Fig. 1b; ref. 4), and independently as the substrates for same time during evolution. illegitimate recombination (Fig. 3) further suggests that it is an These proposed DNA deletion events and the insertions of intrinsic property of these genomic sequences, not the pres- various Alu family repeats during evolution of the adult ence of repetitive DNA elements per se, that renders them hot ␣2-͞␣1-globin duplication units in higher primates are sum- spots of various DNA recombination processes. Among the marized in Fig. 2. Despite the still unknown origin of the possibilities is the recognition of these sequences by DNA nonhomology block III, the model of Fig. 2 appears to recombinases and͞or recombination complexes. Interestingly, provide a satisfactory explanation for the generation of the our evolutionary analysis of the ␣-globin loci of higher pri- present-day mosaic organization of this locus in the human mates is similar to the recent findings in yeast cells that and gibbon lineages. Several interesting genetic implications retrotransposons preferentially insert at, and heal, chromo- emerge from this model. It seems reasonable to suggest that somal breaks (33–35). all eukaryotic genomes arose from series of tandem dupli- cation events (refs. 16 and 17, and references therein), such We thank our colleagues John Hess, Jeng-Pyng Shaw, and Jon as the one generating the ancestor of the adult ␣-globin gene Marks for setting the basis of this work. This research was supported duplication units in higher primates. Sequences of originally by National Institutes of Health Grant DK29800. identical tandem duplication units then diverged due to nucleotide substitutions (reviewed in ref. 18), which could be 1. Lauer, J., Shen, C.-K. J. & Maniatis, T. (1980) Cell 20, 119–130. counterbalanced by the processes of gene correction (see ref. 2. Liebhaber, S. A., Goosen, M. J. & Kan, Y. W. (1981) Nature 19 for references). The eukaryotic genomes could have been (London) 290, 26–29. further shaped by the expansion through insertions of var- 3. Michelson, A. M. & Orkin, S. H. (1983) J. Biol. Chem. 258, ious DNA sequences including the repetitive DNA elements, 15245–15254. Proc. and by down-sizing events such as DNA deletions (20). In the 4. Hess, J. F., Fox, M., Schmid, C. W. & Shen, C.-K. J. (1983) Natl. Acad. Sci. USA 80, 5970–5974. scenario of Fig. 2, we see no evidence for insertion processes 5. Hess, J. F., Schmid, C. W. & Shen, C.-K. J. (1984) Science 226, that would bring unique DNA sequence(s) into the ␣-globin 67–70. locus from other parts of the primate genomes. Instead, 6. Zimmer, E. A., Martin, S. L., Beverly, S. M., Kan, Y. W. & deletion of unique DNA sequences belonging to the original Wilson, A. C. (1980) Proc. Natl. Acad. Sci. USA 77, 2158–2162. duplication units and retroposition (21, 22) of repetitive 7. Shen, C.-K. J. (1990) in Molecular Evolution, eds. Clegg, M. T. & DNA elements—e.g., the Alu family repeats (23)—appear to O’Brien, S. J. (Wiley-Liss, New York), pp. 75–83. be the two major mechanisms for shaping of the adult 8. Marks, J., Shaw, J.-P. & Shen, C.-K. J. (1986) Proc. Natl. Acad. ␣-globin duplication units during higher primate evolution. Sci. USA 83, 1413–1417. Whether this is generally true for the evolution of most 9. Shaw, J.-P., Marks, J. & Shen, C.-K. J. (1991) J. Mol. Evol. 33, eukaryotic genomes is an intriguing question. Furthermore, 506–531. the above two DNA rearrangement processes are probably 10. Shaw, J.-P. (1986) Ph.D. thesis (Univ. of California, Davis). 11. Bailey, A. D., Stanhope, M., Slightom, J. L., Goodman, M., Shen, tightly linked to each other in the evolution of eukaryotic C. C. & Shen, C.-K. J. (1992) J. Biol. Chem. 267, 18398–18406. genomes, a situation similar to some transposon-induced 12. Szalay, F. & Delson, E. (1979) Evolutionary History of the DNA rearrangements in bacteria, yeast, and plant cells (for Primates (Academic, New York). examples, see refs. 24–26). 13. Sanger, F., Nicklen, S. & Coulson, A. R. (1977) Proc. Natl. Acad. The scenario of Fig. 2 also reinforces the previous obser- Sci. USA 74, 5463–5467. vation that potential insertion sites of repetitive DNA elements 14. Bailey, A. D. & Shen, C.-K. J. (1993) Proc. Natl. Acad. Sci. USA such as the Alu family repeats are also hot spots for other DNA 90, 7205–7209. recombination processes (refs. 4 and 27–32 and references 15. Bailey, A. D. (1992) Ph.D. thesis (Univ. of California, Davis). therein). Three of the four recombination junctions depicted 16. Ohno, S. (1970) Evolution by (Springer, New York). 17. Schughart, K., Kappen, C. & Ruddle, F. H. (1989) Proc. Natl. Acad. Sci. USA 86, 7067–7071. 18. Stewart, C.-B. (1993) Nature (London) 361, 603–607. 19. Shen, S. H., Slightom, J. L. & Smithies, O. (1981) Cell 26, 191–203. 20. Calvalier-Smith, T., ed. (1985) The Evolution of (Wiley, New York). 21. Singer, M. F. (1982) Cell 28, 433–434. 22. Weiner, A. M., Deininger, P. L. & Efstratiadis, A. (1986) Annu. Rev. Biochem. 55, 631–661. 23. Jelinek, W. R. & Schmid, C. W. (1982) Annu. Rev. Biochem. 51, 770–771. 24. Calos, M. P. & Miller, J. H. (1980) Cell 20, 579–595. 25. Downs, K. M., Brennan, G. & Liebman, S. W. (1985) Mol. Cell. Biol. 5, 3451–3457. FIG.4. (a) Diagram of the DNA recombination process deleting 26. Martin, C. R., Mackay, S. & Carpenter, R. (1988) 119, blocks IV and II of the human and gibbon ␣2-globin units. The 171–184. breakage and rejoining occurred at the I(␣2)͞IV(␣2) and II(␣2)͞ 27. Vanin, E. F., Henthorn, P. S., Kioussis, D., Grosveld, F. & Y(␣2) junctions. (b) Sequence alignment of the above two recombi- Smithies, O. (1983) Cell 35, 701–709. nation end points of gibbon, and DNA sequences of the I(␣2)͞Y(␣2) 28. Lehrman, M. A., Russell, D. W., Goldstein, J. L. & Brown, M. S. junctions of both gibbon and human. For simplicity of comparison, the (1985) Proc. Natl. Acad. Sci. USA 83, 3679–3683. sequences of the gibbon and human Alu3 repeat (4, 14) are omitted, 29. Nicholls, R. D., Fischel-Ghodsian, N. & Higgs, D. R. (1987) Cell but their sites of insertion are still shown (boxed). Again, the open 49, 369–378. triangles denote the boundaries of adjacent blocks around which 30. Morris, T. & Thacker, J. (1993) Proc. Natl. Acad. Sci. USA 90, breakage and rejoining have occurred. 1392–1396. Downloaded by guest on October 1, 2021 5182 Evolution: Bailey et al. Proc. Natl. Acad. Sci. USA 94 (1997)

31. Margot, J.-B., Demers, G. W. & Hardison, R. C. (1989) J. Mol. 33. Teng, S.-C., Kim, B. & Gabriel, A. (1996) Nature (London) 383, Biol. 205, 15–40. 641–644. 32. Hardison, R. C., Krane, D., Vandenbergh, D., Cheng, J.-F., 34. Moore, J. K. & Haber, J. E. (1996) Nature (London) 383, 644– Mansberger, J., Taddie, J., Schwartz, S., Huang, X. & Miller, W. 646. (1991) J. Mol. Biol. 222, 233–249. 35. Boeke, J. (1996) Nature (London) 383, 579–580. Downloaded by guest on October 1, 2021