Estimating Genetic Divergence and Genetic Variability with Restriction

Estimating Genetic Divergence and Genetic Variability with Restriction

Proc. NatL Acad. Sci. USA Vol. 78, No. 10, pp. 6329-6333, October 1981 Genetics Estimating genetic divergence and genetic variability with restriction endonucleases (heterozygosity/base substitutions/population genetics/maximum likelihood) WILLIAM R. ENGELS Laboratory of Genetics, University of Wisconsin-Madison, Madison, Wisconsin 53706 Communicated by Harry Harris, June 4, 1981 ABSTRACT Restriction endonucleases cut DNA at specific n homologous segments, each approximately L nucleotides sites determined by the local nucleotide sequence. By comparing long. This sample might have come from n individuals drawn related DNA segments with respect to where such cuts are made, from a population whose variability we wish to measure, or from one can estimate the extent of sequence homology between the representatives of n species whose genetic divergence is of in- segments. Empirical methods are presented here for using these terest. Alternatively, they might be homologous regions within data to measure- the proportion of mismatches between two se- a single genome such as the At and Gy human globin genes quences, the proportion of polymorphic positions in a series of discussed below, which arose by gene duplication and have sequences, or the degree ofheterozygosity in a population. These diverged in evolutionary time. Each segment in the sample is methods do not require any assumptions concerning the evolu- treated with one or a series restriction tionary or population genetic processes involved. One can also use of endonucleases, and the data to calculate the precision ofeach ofthese estimates. When the lengths ofthe resulting fragments are determined. Assume the positions of the cuts are not determined, these estimates can for the moment that these lengths allow us to determine the be made, using only the lengths of the resulting DNA fragments, exact point at which each enzyme cuts each fragment. (Data in by means of a maximum likelihood procedure. Several examples which this final step was not taken, and therefore only identity demonstrate the usefulness of these methods to study genetic dif- offragment lengths can be determined, will be considered be- ferences in regions of the genome not amenable to study by other low for the case of n = 2.) Ifj is the length of the recognition methods. sequence ofa given endonuclease-usually 4-6 base pairs-we may define a site as any sequence ofj positions (base pairs). A A large part ofexperimental population genetics in the last two cleavage site is defined as any site where at least one member decades has dealt with protein differences within and between of the sample was cleaved. If there are m cleavage sites, then populations; see chapter 7 ofWright (1) for a review. More re- the data consist of the values c1, c2, . .., cm (1 ' ci < n) rep- cently, the use ofrestriction endonucleases and accompanying resenting the numbers of members in the sample cut at each DNA technology has made it possible to study differences at site by one ofthe enzymes. The total numberofcuts at all cleav- the level of DNA sequences. Several studies (2-5) have dem- age sites will be denoted by c. onstrated variability in DNA sequences as revealed by variation in the lengths ofDNA fragments after digestion by one or more ESTIMATORS sequence-specific restriction enzymes. A major advantage of Frequency of polymorphism this method is that it can be used forany segment ofthe genome, A site may be considered polymorphic if at least one of its j regardless of whether or not it codes for a soluble protein. It positions is polymorphic in our sample. Ifk ofthe cleavage sites is only necessary to be able to identify the particular fragments are observed to be polymorphic for the recognition sequence of interest by prior purification of a given class of DNA (2, 4, (that is, 1 ' ci s n - 1 for exactly k sites), we might consider 5) or by hybridization to a labeled homologous probe (3). Any k/m as an estimate ofthe proportion ofpolymorphic sites in the variability at cleavage sites within the DNA thus identified will entire segment. This estimator appears to be the one most often be detected, provided there are no large deletions or insertions used (2, 3). However, as pointed out by Ewens et al. (11), and causing observable length differences. by Nei and Li (8) for the case of n = 2, this estimator contains There have been several discussions of how such data may a serious ascertainment bias because it is conditioned on the be used in genetic mapping (6), or how they may be related to presence of at least one cut. We may correct this bias by as- models of evolutionary divergence (7-10) or steady-state pop- suming that the frequency of the recognition sequence among ulation genetics (11) in order to estimate the parameters in these monomorphic sites of whatever type has the same expectation models. My purpose in this paper is to present empirical meth- as its frequency in the sample as a whole. That is, the probability ods for estimating genetic divergence or population variability that a given site is monomorphic does not depend on its nu- independent ofevolutionary or population genetic models. The cleotide sequence. This assumption may be written as precision of these estimates can also be obtained with minimal assumptions concerning population structure. These methods E(m- k) = (L - j + 1) can be applied to cases in which all restriction sites in the sample X P(monomorphic site) P(recognition sequence), [1] have been mapped, as well as those in which only the lengths of restriction fragments are available. in which P( ) indicates probability given a randomly chosen site. Data and Definitions. Notation will follow that of Ewens et Noting that al. (11) wherever possible. Our sample is assumed to consist of E(c) = n(L - j + 1) P(recognition sequence), [2] The publication costs ofthis article were defrayed in part by page charge we have payment. This article must therefore be hereby marked "advertise- ment" in accordance with 18 U. S. C. §1734 solely to indicate this fact. P(monomorphic site) = nE(-) [3] 6329 Downloaded by guest on September 29, 2021 6,20Dean Genetics: Engels Proc. NatL Acad. Sci. USA 78 (1981) Replacing the expectations by their observed values gives the Heterozygosity estimator If the n sequences come from a random sample of individuals site) = ( k) in the population, we may estimate the heterozygosity in the P(monomorphic [41 population. Let vi represent the true frequency in the popu- For a heuristic interpretation of this estimator, note that it is lation of the recognition sequence at site i. Then by an as- the proportion ofall cuts in the sample that occurred in mono- sumption similar to Eq. 1 we write morphic sites. A special case of Eq. 4 was given by Nei and Li (8), (their [10] corrected by changing "xx" to "nx"). X drj = (L - j + 1) P(homozygous site) P(recognition sequence), To estimate the polymorphism, p, at single positions, one which may be combined with Eq. 2 to yield possibility suggested by previous authors (7, 8) is to use P(homozygous site) = E(c) p = 1 - P(monomorphic site)11I [5] Noting that from the assumption that each position is independent of its 1)1 neighbors. When the experiment includes several enzymes 7,li2=E I~(Cj- with various lengths of recognition sequences, some weighted [ n(n- 1)] average of the estimates from Eq. 5 must be used (8). A rea- sonable alternative suggested by Ewens et al. (11) is to asume from p. 69 of ref. 12, and replacing expectations by their ob- that a given site may be polymorphic at no more than one ofits served values, leads to the estimator by j positions. With this assumption, p may be estimated P(po- = - 1) lymorphic site)/j, or P(homozygous site) lcj(c(C(n -1) [9] c- n(m- k) [6] Finally, if we assume as before that a given site may be het- ic erozygous at no more than one position so that The interpretation of Eq. 6 is seen by noting that the nu- H = P(heterozygous site)/j [10] merator is the total number ofcuts at or near polymorphic po- sitions, and the denominator is the total number ofpositions in is the heterozygosity per position, we have the estimator the sample recognized by the endonuclease. This estimator has nc - 1) [11] the advantage of being easily extended to several endonu- jc(n -1) cleases. The same formula may be used with c, m, and k being summed over all enzymes, andj redefined as the average length This estimator requires no specific population genetic model. ofrecognition sequences weighted by the total number ofcuts Furthermore, Eq. 10 is true under the stated assumption re- made by each enzyme. The interpretations of the numerator gardless ofwhetherj refers to a single enzyme or to an average and denominator are thus preserved. weighted by number ofcuts over a heterogeneous series ofen- When A is used to measure genetic divergence between two donucleases. This is because j measures the expected number genomes, it is an estimate ofthe proportion ofbase mismatches. of positions per recognized site in both cases. Therefore, the Ifthe n segments are taken from a larger population, p must be estimator 11 may be applied directly when several enzymes are interpreted as an estimate of the frequency of polymorphism used. only in the sample at hand; it does not estimate polymorphism When n = 2, heterozygosity and polymorphism have the in the population as a whole. However, under the Wright-Fisher same meaning, and we should expect H to be equal to f. This model of random sampling with mutation but no selection, p equality is readily demonstrated by substituting n = 2, c = 2m may be used in estimating population parameters.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    5 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us