Proc. NatL Acad. Sci. USA Vol. 79, pp. 3251-3254, May 1982 Genetics

Allelic and nonallelic homology of a supergene family ( conversion/domain transfer in evolution/major histocompatibility complex /population genetics) ToMOKO OHTA National Institute of Genetics, Mishima, 411, Japan Communicated by Motoo Kimura, February 23, 1982 ABSTRACT A model to explain the high degree of polymor- include under different controls. The presence or ab- phism at the major histocompatibility complex (MHC) is de- sence of shifts of position of loci on the might re- scribed. The model incorporates domain transfer between the flect such a difference. different loci in a supergene family by either gene conversion or double unequal crossing-over. Population genetics theory is used to formulate changes in the probabilities of allelic and nonallelic BASIC THEORY gene identities and equilibrium values are obtained. The observed Let us consider a randomly mating population of effective size degree of allelic and nonallelic homology in the complex can be N. A supergene family consists of n tandemly arranged homol- explained by assuming that a domain is converted at a rate of 10-5 ogous genes. and is evolving under gene conversion, recombi- to 10-6 per generation and reasonable values ofother parameters. This rate ofdomain transfer is compatible with the observed high nation at meiosis, mutation, and random genetic drift.- In this mutation rate at marker loci in the major histocompatibility section, intrachromosomal gene conversion is assumed to be complex. solely responsible for the transfer of gene segments. For sim- plicity, it is assumed that each unit is converted at a constant A supergene was originally defined as "a group oflinked genes rate by any one ofthe remaining (n - 1) genes with equal likeli- mechanically held together on a chromosome and usually held hood. For details of some models ofchromatid interaction that together as a unit" (1). The major histocompatibility complexes result in gene conversion, see Nagylaki and Petes (18). These (MHC) ofman and mouse are among the most thoroughly stud- authors have shown that a small conversional advantage or dis- ied cases of such supergenes (2, 3). An enigmatic observation advantage may have a large effect on the concerted evolution on the MHC is that gene identity among the different loci in of repeated genes; however,' I assume here for simplicity no the complex region is only slightly lower than that among alleles; polarity ofconversion. Let A be the rate at which a gene is con- the gene homology among alleles is =90% in terms of amino verted in one generation. See Fig. 1 for an illustration of con- acid identity and =85% among genes ofdifferent loci (between version. The actual process ofconversion is likely to involve one HLA-A and -B or H2-K and -D) (4). Based on recent findings piece (or even a part of a piece) of a split gene (5-10). In the ofdomain transfer in evolution (5-10) and ofa large number of analyses below, a smaller unit such as an amino acid or a nu- cross-hybridizing genomic clones in this region (11, 12), I sug- cleotide site is considered. Thus, A is the average rate at which gested that domain transfer between the loci in the MHHC, the small unit is converted by the homologous unit of another either by gene conversion or by double unequal crossing-over, locus belonging to the supergene family. is responsible for the observed gene homology (13). The hy- Recombination at meiosis is assumed always to be equal, and pothesis is a revised form of the proposal of Bodmer (14) and we let f3 be the rate per supergene family per generation. The Silver and Hood (15), in which each marker region such as H2- infinite-allele model of Kimura and Crow (19), in which all K or HLA-A contains a cluster of many loci of which only one mutations are unique, not preexisting ones, is assumed, and we is expressed. In the new model, each marker region comprises let v be the mutation rate per small unit (amino acid site or a single copy, yet the total complex region contains a large num- nucleotide site) per generation. As in my previous studies (16, ber ofrelated genes, including pseudogenes. Earlier, I pointed 20), the changes of probability of gene identity (identity coef- out that the population genetics theory of multigene families ficients) by the above processes are formulated, and the equi- (16) is useful for predicting the degree of gene homology in a librium values are obtained and examined. supergene family under various parameter values. However, Letfbe the average probability of allelic identity, cl be the my previous study (16) was based on the model ofunequal cross- average identity probability of genes at different loci of the su- ing-over of Smith (17), using an approximate treatment, and pergene family on one chromosome, and c2 be that oftwo genes does not give an explicit answer for the allelic and nonallelic taken from different loci oftwo homologous ofthe gene homology of a multigene family. The model of gene con- population. Fig. 2 depicts these three identity coefficients. version or double unequal crossing-over is simpler than the Note that, although the term "gene identity" is used, it usually previous one of unequal crossing-over, because no shift of po- means amino acid or nucleotide identity, and hereafter I use sitions of loci on the chromosome occurs, merely transfer of unit and gene interchangeably. It should also be noted that, gene segments from one locus to another. The purpose of this because of the assumption that gene conversion occurs with report is to clarify the relationship ofallelic and nonallelic gene equal likelihood between any two loci of the family, identity identity of a supergene family. The main difference between coefficients do not depend on the position of the chromosome multigene and supergene families may be that the former con- when equilibrium is reached. sists of genes under the same control whereas the latter may In the following analyses, I assume that the parameters, v, A, f3, and 1/N are <<<1, so that their products can be ignored. It mutation and The publication costs ofthis article were defrayed in part by page charge Let us start from the change off. changes, by payment. This article must therefore be hereby marked "advertise- ment" in accordance with 18 U. S. C. §1734 solely to indicate this fact. Abbreviation: MHC, major histocompatibility complex. 3251 Downloaded by guest on September 24, 2021 -3252 Genetics: Ohta Proc. Natl. Acad. Sci. USA 79 (1982) ing-over (equation 7 ofref. 20), since one cycle ofunequal cross- ing-over has roughly the same effect as one conversion. The change of by mutation and interchromosomal recombination 1 2 cl II +=+ I I I I ; is the same as in the previous model (see equations 5, 6, and k--- i 9 of ref. 20), and we have, for the total change of cl in one n generation, FIG. 1. Diagram of the model of gene conversion. Ac1 = 3(2v+a+ +l + 2 [5] random drift in one the as in genetic generation, by amount, The coefficient (3/3 is obtained by assuming that the point of Kimura and Crow (19), recombination is uniformly distributed over the total region of n genes and taking the expectation of the probability that the Amut,dnitE(f) = -2vf + (-f),(1 [1] two units of a chromosome come from different chromosomes after recombination. where Amut'drIftE) is the expected change by mutation and Next, c2 changes by gene conversion with the same coeffi- random genetic drift. By gene conversion with the rate A per cient as cl, but the proportion, a, ofc2 comes from f. unit, it changes by the amount AconvE(c2) = a(f - c2) . [6] AconvE(f) = 2A(c2 -f), [2] The changes of c2 by other processes are the same as in the where A40nE( ) is the expected change by conversion. The for- previous model (equations 5, 6, and 9 of ref. 20); however, it mula is derived from the consideration.that, if one of the two is assumed here that /(6N) is negligibly small. The total change units compared for identity is converted, thefvalue ofthis pair of c2 in one generation is thus is the same as c2 before conversion. Thus, since equal inter- 1 chromosomal recombination has no effect onf, the total change AC2 = -(2v + + a)c2 + cl + af. [7] offin one generation is The equilibrium values of the identity coefficients may be -(2v + -+2A)f+ -2Ac2. [3] obtained from Eqs. 3, 5, and 7 by putting Af = Ac, = Ac2 = Af 0. The solutions are

1 na + - + 4v)+ a 2N ) 3 C2 = + 2v)(na + 1+ 2v) + /(a+2v +2nav+4v2)

= X3C2+ 3a ,B+ 3a + 6v ' and 2N(n- 1)ae2 + 1 [8] +=2N(n--l)a + 1 +4Nv '

The changes of cl and c2 take almost the same form as the where e2, el, andf are the equilibrium values., Note that (n - corresponding ones in the model ofunequal crossing-over (20). 1)a = 2A by definition. Through gene conversion, cl changes according to the formula When n = 2, this model should be comparable with the model of unequal crossing-over of a small multigene family of AconvE(cj) = a(l -cl), [4] two tandem genes (21), and one cycle ofunequal crossing-overs is equivalent to one gene conversion. However, there are some where a = 2A/(n - 1). This factor is the proportion among ran- errors in the previous formulation. First, equation 1 of ref. 21 domly chosen pairs from n units that are identical because one is derived from the assumption that crossing-over of the latter of them has been converted by the other. Note that a is used phase ofthe cycle (see figure 1 ofref. 21) takes place at the ter- as a parameter in the same way as in the model ofunequal cross- minal points of the paired region and not at the middle point. This assumption is not realistic and, if it is assumed that cross- ing-over also occurs at the middle point and with probability Cl equal to that at the terminal ones, the coefficient of yin equa- tion 1 of ref. 21 should be 1/3 instead of 1/2. If it is assumed *~~~~~~~~ + that crossing-over takes place anywhere in the paired region, * I 1 1 1 the coefficient changes again. Thus, a more general formulation I I I I 1 would be, instead ofequation 1 ofref. 21, c11 = cl + ky(l - cl), [9] FIG. 2. Diagram showing the meaning of the three identity coefficients. where ky is the rate per one family by which one of the two Downloaded by guest on September 24, 2021 Genetics: Ohta Proc. Natl. Acad. Sci. USA 79 (1982) 3253 genes is duplicated to replace the other through one cycle of of molecular evolution (23), a mutation rate (v) of 10-8 was unequal crossing-overs and corresponds to nA of the present assumed. Also, the effective population size of mammals was model with n = 2. Second, the coefficients of y in equations taken as 104-105 (23). The interchromosomal recombination rate 2 of ref. 21 also need revision, and the equations should be re- was assumed to be iO-3 as between markers of the MHC (3, placed by 12). The number of genes (n) is 10, again as in the MHC (11, 12). The parameter for which there is the least information is f0' = f0 + ky(f1 -f) the conversion rate, and various values are assumed for it. Pa- and rameter values are A = 5 X 10-6, V = 10-8, n = 50, N = 5 X 104, and /3 = 10-3, unless otherwise specified in the table. i = fi + k y(fo - fl) ' [10] With this particular set of parameter values, the allelic homol- where fo and fl, respectively, correspond to our f and c2 with ogy (f) is 0.918 and the nonallelic homology (cl c2) is 0.839. n = 2. Also, P in the model ofunequal crossing-over is defined These values are about what are observed in H-2 of mouse or as the rate at which the two markers are recombined and cor- HLA of man (ref. 4). When the conversion rate (A) is smaller, responds to /3/3 of the present formulation. With the above the difference between f and cl or c2 becomes larger and vice revisions, the cycle model for two tandem genes becomes the versa. For the other parameters, the higher the mutation rate same as the present model of gene conversion. and the larger the number ofloci and population size, the lower When gene homology is studied by nucleotide identity, it is the homology, as expected. better to assume a finite number of allelic states (actually four) rather than an infinite number. Such a model is known as the DISCUSSION K-allele model (22), and it is easy to extend my analyses to this As shown by the present analysis, the observed allelic and non- model. Let us assume that there are K allelic states and that any allelic homology at the MHC may be explained by assuming that allele mutates to a specific one of the (K - 1) remaining states the gene conversion rate per domain is 10-6-10-5 and reason- at the rate v/(K - 1), so that the total rate is v. By letting v* able values of other parameters. Although gene conversion is = Kv/(K - 1), we can show that the changes of identity coef- assumed to be responsible for transfer of gene segments (do- ficients by mutation become (see ref. 22) main), double unequal crossing-over has the same effect. Now what does this rate of domain transfer imply? It has long been AmutE(f) = -2v*f +2v recognized that the mutation rate is high in H-2 (24). On the other hand, it has been found that a gene ofH-2 or HLA contains eight exons corresponding to protein domains (12). If any one AmutE(cl) = -2v*cl +2K ofthe domains has achance ofconversion of10-6-10-5, the total gene would have a chance of conversion eight times as and high-i.e., 105-10-4. Since conversions would be classified as mutations in a skin grafting experiment, the above estimate of Amut E(c) = -2v*c2 + [11] the rate ofdomain transfer is compatible with the observed high rate of mutation. Changes of identity coefficients by other processes are the The real evolutionary process ofa supergene, however, may same as in the infinite allele model, and the equilibrium solu- notbe as simple as the model, and conversion may not take place tions become randomly among gene members. Then, we would expect a more

a na + 4v* +3) + Y[(na+-+2v*))(++2v*) +a(na+2v*)] C2= 1 (a + 2v*)Qk + 2v*)(na + + 2v*) + 2nav* + 3 (2N 4V*2)

=_c2 + 3a + 6v*/K 1 /3 + 3a + 6v* and 2N(n - 1)ac2 + 1 + 4Nv*/K [12] 2N(n - 1)a + 1 + 4Nv*

Note that, when K -x 00, v* = v and Eqs. 12 reduce to Eqs. complicated organization for a supergene. 8. As stated above, the model of gene conversion or double In Table 1, some examples of equilibrium identity coeffi- unequal crossing-over is simpler than that of the model of un- cients are given. Since our interest is amino acid or nucleotide equal crossing-over (16), because no shift of positions ofloci on identity at homologous sites of a supergene such as the MHC, the chromosome occurs. Therefore, the present theory is more the mutation rate is assumed to be very small. Under such an exact than the previous one ofmultigene families. In particular, assumption, the identity coefficients are expected to be only the relationship between allelic and nonallelic gene identity is slightly different between the infinite-allele and the K-allele ambiguous in the case ofthe multigene family. Only two identity models with K = 4, and the values ofTable 1 are computed by coefficients, Cw,, and Cw2 corresponding to cl and c2, are for- Eqs. 8 for the infinite-allele model. Based on recent knowledge mulated in the approximate analysis (16, 20) and, even in a more Downloaded by guest on September 24, 2021 3254 Genetics: Ohta Proc. Natl. Acad. Sci. USA 79 (1982)

Table 1. Equilibrium values of allelic (f) and nonallelic (cl and extend the analysis to a general model ofthe K allele. I also thank Drs. c2) identity James F. Crow, Walter F. Bodmer, and Kenichi Aoki forcarefully going over the manuscript and making many valuable suggestions to improve Parameter Value f the presentation. This is contribution no. 1419 from the National In- A (in unitsof 10-6) 1 0.937 0.630 stitute of Genetics, Mishima, Japan. 2 0.926 0.745 3 0.922 0.794 4 0.920 0.821 1. Darlington, C. D. & Mather, K. (1949) The Elements ofGenetics 5 0.918 0.839 (Allen & Unwin, London). 6 0.918 0.851 2. Snell, G. D. (1981) Science 213, 172-178. 3. Bodmer, W. F. (1979) in Human Genetics: Possibilities and Real- 8 0.917 0.867 ities, Ciba Foundation Series 66 (Elsevier, North-Holland), pp. 10 0.917 0.877 205-229. 50 0.930 0.924 4. Ploegh, H. L., Orr, H. T. & Strominger, J. L. (1981) Cell 24, 100 0.940 0.938 287-299. 5. Miyata, T., Yasunaga, T., Yamawaki-Kataoka, Y., Obata, M. & v (in units of 10-8) 0.1 0.990 0.981 Honjo, T. (1980) Proc. Natl Acad. Sci. USA 77, 2143-2147. 0.5 0.956 0.912 6. Slightom, J. L., Blechl, A. E. & Smithies, 0. (1980) Cell 21, 1.0 0.918 0.839 627-638. 1.5 0.887 0.776 7. Liebhaber, S. A., Goossens, M. & Kan, Y. W. (1981) Nature 2.0 0.859 0.722 (London) 290, 26-29. 8. Miyata, T. & Yasunaga, T. (1981) Proc. Natl Acad. Sci. USA 78, 450-453. n 10 0.981 0.964 9. Schreier, P. H., Bothwell, A. L. M., Mueller-Hill, B. & Balti- 20 0.964 0.930 more, D. (1981) Proc. Natl. Acad. Sci. USA 78, 4495-4499. 30 0.947 0.897 10. Yamawaki-Kataoka, Y., Nakai, S., Miyata, T. & Honjo, T. (1982) 40 0.932 0.867 Proc. Natl. Acad. Sci. USA 79, 2623-2627. 50 0.918 0.839 11. Steinmetz, M., Frelinger, J. G., Fisher, D., Hunkapiller, T., 60 0.905 0.812 Pereira, D., Weissman, S. M., Uehara, H., Nathenson, S. & 100 0.860 0.721 Hood, L. (1981) Cell 24, 125-134. 12. Steinmetz, M., Moore, K. W., Frelinger, J. G., Sher, B. T. Shen, F.-W., Boyse, E. A. & Hood, L. (1981) Cell 25, 683-692. N(in units of 104) 1 0.982 0.897 13. Ohta, T. (1982) Proc. Natl Acad. Sci. USA 79, 1940-1944. 2 0.966 0.882 14. Bodmer, W. F. (1973) Transplant. Proc. 5, 1471-1476. 3 0.949 0.867 15. Silver, J. & Hood, L. (1976) Proc. Natl. Acad. Sci. USA 73, 4 0.934 0.852 599-603. 10 0.849 0.775 16. Ohta, T. (1980) Evolution and Variation of Multigene Families, Lecture Notes in Biomathematics (Springer, New York), Vol. 37. Parameters: A = 5 x 10-6, v = 10-8, n = 50, N = 5x 104, and , 17. Smith, G. P. (1974) Proc. Cold Spring Harbor Symp. Quant. Biol. = 10-3, unless otherwise specified. 38, 507-513. 18. Nagylaki, T. & Petes, T. D. (1982) Genetics, in press. 19. Kimura, M. & Crow, J. F. (1964) Genetics 49, 725-738. exact approach (25, 26), calculation of the effect of shift of po- 20. Ohta, T. (1978) Genet. Res. 31, 13-28. sitions is not completely precise. Presumably, the approximate 21. Ohta, T. (1981) Genet. Res. 37, 133-149. analyses are appropriate for large multigene families when un- 22. Kimura, M. (1968) Genet. Res. 11, 247-269. equal crossing-over occurs frequently and chromosomal posi- 23. Kimura, M. (1979) Sci. Am. 241 (5), 94-104. tions are often shifted. 24. Klein, J. (1978) Adv. Immunol 26, 44-146. 25. Kimura, M. & Ohta, T. (1979) Proc. NatL. Acad. Sci. USA 76, I thank Dr. Motoo Kimura for stimulating discussion and encour- 4001-4005. agement throughout the course of this study and for his suggestion to 26. Ohta, T. (1982) Genetics, in press. Downloaded by guest on September 24, 2021