<<

Lecture 1 23rd Jan 2nd week An introduction to

Gil McVean

What is population genetics? Like so many branches of biology, what we think of today as population genetics would hardly be recognised by the founding fathers of the discipline. If you had been studying population genetics 80 years ago, you would probably have been working on the of microscopic traits in Drosophila or developing efficient crossing schemes for agricultural breeding, 30 years ago you may have been analysing levels of protein and population differentiation. These days, if you work in population genetics you are more likely to be interested in using DNA sequence variation to map disease in , or sites of adaptive in viral . But of course there is a link between all three types of study: the genetics of variation. And broadly speaking, population genetics can be defined as the study of the genetical basis of naturally occurring variation, with the aim of describing and understanding the evolutionary forces that create variation within species and which lead to differences between species. For example, this picture represent sequence level variation in a gene called LPL, thought perhaps to play some role in hereditary heart disease. The types of questions we might want to ask of the data are A) Can we detect a link between sequence variation and a predisposition to heart disease? B) What does variation in this gene tell us about the history of humans? C) Can we detect the influence of on the recent history of the gene? And in turn such questions raise other, more technical, but still critical issues such as A) How many individuals should I sequence from? B) How much sequence should I collect from each? C) What is the best way to sample individuals in order to answer my question? These questions are not easy to address, and of course they are not independent of one another. What I hope to achieve during this course is to give you some understanding of how to begin answering such questions, and a feeling for the underlying theoretical models and methods.

Some definitions This lecture is meant to be an introduction to the subject of population genetics. One of the most important things is to know what a population means by terms you are already familiar with, because it is more than likely that the two are not the same. The most important term is the gene. To the molecular geneticist this means an open reading frame and all the associated regulatory elements. The classical geneticist’s view is only slightly different, but its starting point is the , not the genotype. A geneticist would call a gene a region of a to which they can map a . Long before DNA was understood to be the material of , were talking about genes. To the evolutionary biologist, a gene is also defined by its ability to recombine, but in the evolutionary sense. The best definition of what a gene means to an evolutionary biologist comes from GC

Copyright: Gilean McVean, 2001 1 Williams, who talks about a region of genetic material sufficiently small that it is not broken up by recombination, and which can therefore be acted upon in a coherent manner by natural selection. This is not the most concise definition, but the essence is that the gene is the unit of selection. There are other terms which it is important to understand. An can be defined as one of two or more possible forms of a gene. can exist naturally, or may be induced by mutagenesis. The key point is that alleles are mutually exclusive. The final term of key importance is polymorphism. To be vague this means any gene or for which multiple forms exist in nature. However, because we can never sample every in a species, a more practicable, but arbitrary definition is used – notably a gene is polymorphic if the most common allele is at a frequency of less than 99%.

Historical developments in the understanding of the genetic basis of variation. The field of population genetics was created almost exactly 100 years ago, prompted by the rediscovery of Mendel’s laws of inheritance. But to understand the importance of this discovery it is important to go back even further, to the experiments of Mendel, and of course, to . Although Mendel didn’t realise it, his discovery that certain traits of seed coat and colour in peas are inherited in a particulate manner was critical to the widespread acceptance of Darwin and Wallace’s theory of evolution by natural selection. In its most simplified form, Darwin’s theory consists of just three statements. A) produce too many offspring B) Heritable differences exist in traits influencing the of an organism to its environment C) Organisms that are better adapted have a higher chance of survival The problem was, the way in which Darwin envisaged inheritance differences between organisms would be rapidly diluted through mating. In particular, he envisaged a form of in which offspring were intermediate between parental forms. Mendel’s discovery that traits could be inherited in a discrete manner of course changed that view. Though it was not until de Vries, Correns, and Tschermak von Seysenegg independently rediscovered both the phenomenon, and consequently Mendel’s work, that this was acknowledged. The idea that traits can be inherited in such a simple manner is extremely powerful. And following from de Vries and the others, many different traits showing such simple patterns of inheritance were rapidly described. For example in humans, the most well known examples are traits such as the ABO blood group and albinism. However, while the discovery of particulate inheritance solved one problem, it created an even greater one as well. The problem was that many geneticists, de Vries among them, came to understand the genetic nature of variation simply in terms of large, discrete differences. That is, the difference between round and wrinkly peas, or the difference between pink and white flowers. But the Darwinian view is one of gradualism; that there exists a continuum of variation, on which natural selection can act. De Vries was the first to use the term mutation – and by mutation he meant changes in genetic material that led to large differences in phenotype. On the other hand, naturalists and systematists were developing a coherent view of evolution by natural selection that rested almost entirely upon the notion of small changes. The views of saltationists like Goldscmidt, with his ‘hopeful monsters’ and empiricists such as Dobzhansky seemed to be almost entirely incompatible.

Copyright: Gilean McVean, 2001 2 The solution is of course that the gradual, quantitative difference of the neo-Darwinians are in fact composed of the cumulative effects of many different loci, each behaving in a Mendelian, particulate fashion. By studying patterns of inheritance of bristle number in Drosophila, Morgan was able to show (1915) that even minute differences can behave in a Mendelian fashion. Similar results were found by Jennings in Paramecium. Also important were the artificial selection experiments of Castle and Sturtevant on quantitative traits in rats and Drosophila, which showed that selection acting on genes of small effect is effective. Nilsson-Ehle (1909), working on pigmentation in wheat kernels, showed that the additive contribution of just a few loci (three in his case) could generate an apparently continuous distribution of phenotype. In short, the link had been made been and Darwin’s theory of evolution by natural selection.

The population genetics of continuous variation The first major contribution of theoretical population genetics to the understanding of natural variation arose from the discovery that Mendelian inheritance could underlie apparently continuous traits. In 1918, RA Fisher published a paper demonstrating how the phenotypic variation in a trait, and correlations between relatives, could be used to partition variation into genetic and environmental components, and also how the genetic component could be further partitioned into terms representing additive, dominant and epistatic contributions across loci. This finding, along with earlier work on quantitative theories of , had two important consequences. First, it naturally gave rise to a method for estimating the genetic contribution to variation for any quantitative trait. Second, it provided a means of predicting the effect of any artificial selection regime, as practiced by agriculturalists – and of course a framework within which to develop more efficient methods of breeding crops and animals with more desirable qualities and quantities. Traits affected by multiple loci are called polygenic traits, but the term multifactorial is often used in order to emphasise the importance of environmental influences. Multifactorial traits can be further broken down into three types A) Continuous traits: Height, birth weight, milk yield B) Meristic traits: Bristle number in Drosophila C) Discrete traits with continuous liability: E.g. polygenic disease, threshold traits Fisher’s results provided a means of directly estimating the contribution of genetics to variation in the phenotype, a factor which is term heritability. There are two formulations of the term heritability, one known as “narrow sense” heritability, the other as “broad sense” heritability. “Narrow-sense heritability” is defined as the correlation between parents and offspring for some trait. For example, if we plot mid-parent value against offspring value, and fit a linear model. The linear coefficient b, in the model y = c + bx is estimated by Cov(x, y) b = Var(x) and the relationship between b and heritability (h) is given by

b = h 2

Copyright: Gilean McVean, 2001 3 What is the importance of this number? There are two ways this can be approached. First, it tells us something about what would happen were we to carry out artificial selection experiments, something that is of fundamental importance to agricultural breeding experiments. If only individuals with a trait value greater than some threshold are allowed to breed, if the difference in the mean values of the selected and entire population is = − S S then the selection response, defined as the difference in the mean of the offspring and parental R = ′ − is given by

R = h 2 S For obvious reasons, another term for this form of heritability is realised heritability. The other way in which we can think about the importance of heritability, is that is tells us something about its genetic basis. For example, if we are interested in reducing the incidence of some disease, an estimate of the heritability would give us an indication of whether it is worth trying to find genes involved in the disease. In fact, it turns out that we can write the phenotypic variance in a trait as a sum of factors 2 = 2 + 2 + 2 + 2 P A D I E where the terms on the left are, respectively, additive genetic variance, effects, epistatic effects and environmental effects, then “narrow-sense” heritability is an estimate of

2 2 = A h 2 P This figure shows some examples of heritability estimates for various factors in agricultural animals. Another obvious method for estimating , at least in humans, is to use twins. Because twins share the same intra-uterine environment, the comparison of mono- and di-zygotic twins should theoretically give an estimate of the total genetic contribution to variation. This measure, which includes the first three components of variance, is called the “broad-sense” heritability. Specifically, we can estimate heritability from the relationship r − r h 2 = M D − 1 rD This figure shows estimates for various traits in humans. Two things are of note. First, there is very considerable variation in heritabilities between traits, for example fertility has very low heritability, while height and finger-print properties have very high heritabilities. Cognitive properties, such as IQ measures and behavioural distinctions such as extrovertism typically have a heritability of about 0.5. At the low end of the scale are features such as fertility, which have most likely been under strong selection – and as you will learn in a few weeks, selection tends to remove . Finally, a note of caution should be added – the measurement of heritabilities in humans is notoriously inadequate because of unaccounted environmental correlations. For example, monozygotic twins tend to be treated more similarly than dizygotic twins. Any estimate of heritability in humans should be treated both with caution, and as an upper limit, particularly for behavioural traits.

Copyright: Gilean McVean, 2001 4 A similar approach can be taken with threshold traits – only rather than measure correlations, you measure concordance of a trait (proportion of comparisons with identical phenotype). This table shows concordance for a number of clinical traits in humans, and a corresponding estimate of the genetic component.

Further developments in the history of population genetics The period of activity in in the 1930s and 40s has come to be known as the neo- Darwinian synthesis. Researchers from the very different fields of , palaeontology, cytology and genetics were all amassing evidence that Darwin’s gradualist theory of evolution by natural selection was both theoretically and empirically feasible, and that evidence for its influence was everywhere in nature. What was the contribution of population genetics to this slow revolution? This is an area that has been much debated. There is no doubt that the work produced by key figures such as Fisher, Haldane and Wright is now considered a fundamental part of the synthesis. But at the time, few biologists probably appreciated, let alone understood, their complicated mathematical problem-solving. The most important developments in population genetics all came within a few years of each other. JBS Haldane’s book, The Causes of Evolution (1932) and Fisher’s The Genetical Theory of Natural Selection (1930) were both published in the early 30s, and are devoted to explaining the power of natural selection in generating adaptation. On the other side of the Atlantic, published his famous paper Evolution in Mendelian populations in 1931. To many people, Wright goes somewhat further than either Fisher or Haldane in providing a mathematical basis for the Darwinian synthesis. Critics of population genetics, such as , have argued that evolutionary biology is about three areas – adaptation, and – but that population genetics only considers the first. However Wright was utterly absorbed in the way in which chance differences between populations can lead to evolution. Although he does not explicitly deal with the subject, his work has greatly influenced the way in which we think about how populations can diverge from each other, eventually leading to speciation. His idea of an adaptive landscape is one of the most persistent images in the field. The next major event in theoretical population genetics occurred in the late 1950s, when , originally a physicist, starting publishing works which used diffusion theory methods, originally introduced by Wright and Fisher, to study the fate of alleles in populations. These papers involve some very difficult maths, but the key results are remarkably neat. Although he is mainly remembered for his work on the neutral theory of evolution, which I will discuss later this lecture, his work, much of it with , has laid the foundation for much of what we understand about the behaviour of selected alleles. Kimura was the first theoretician I have discussed to have the benefit of knowing that genetic information is stored in DNA within . Although DNA had been chemically identified as a component of cells in the late 1880s, it was not until the experiments of Avery and his colleagues in the early 1940s that its role in heredity was discovered. Even then these results were not widely accepted, and it was not until 1952 that Hershey and Chase provided incontrovertible evidence– just one year before the famous discovery of the structure of DNA by Crick and Watson. One of the most striking things about the development of theoretical population genetics was just how little it owes to an understanding of the

Copyright: Gilean McVean, 2001 5 mechanistic basis of gene function. Mayr once (1959) derided the field as nothing more than bean-bag genetics. But to a large extent, it is that simplicity which makes population genetics so powerful a tool.

Empirical methods for detecting genetic variation Since about 1960, almost all the revolutions in population genetics have been technical. And they have mainly the discovery of novel methods for detecting and quantifying population level variation at the molecular level. Prior to this point, few methods of detecting variation were available to the empirical researcher. The most important was the use of serological methods to analyse antigenic diversity in blood cells. The injection of blood cells into rabbits causes them to raise an immune response, such that when antibodies extracted from the animal’s serum are mixed with blood cells of the same type, the blood cells coagulate and precipitate – some that can be visualised on a microscope slide. By this method, an amazing level of molecular diversity was revealed on both the red and white blood cells. Of course we all know about the ABO system, and most will know about the Rhesus system too. But there are over 50 different blood groups identified – here is a table of allele frequencies for 12 of them in the white English population. Antigenic diversity on the white blood cells, is even more amazing, controlled by multiple HLA loci within the MHC cluster on chromosome 6. Each locus has many different allelic forms, and the frequency of the alleles varies considerably between populations. For example, the second most common allele at HLA-A in Europeans, which represents about 16%, is at a frequency of only 1% in Japanese. But few proteins can be assayed by serological methods. For this reason, the discovery of a technique called protein electrophoresis in the mid 60s was of enormous importance. Proteins are made of amino acids, some of which carry either a positive or a negative charge. In solution, proteins act as electrostatically charged particles. So, if they are placed in a gel of starch agar or another polymer, and oppositely charged poles are placed at either end of the gel, they will tend to move to the pole of the opposite charge, and the rate at which they move is a function of their charge and size. Differences in amino acid composition can cause differences in the overall charge. Protein variants that migrate at different rates are known as Allozymes, or . After a period of time, the position of the proteins in the gel can be visualised either by staining, or by making use of enzymatic properties of the molecule. For example, peptidase b is visualised through a reaction of snake venom with horseradish peroxidase and the amino-acid cofactor leu-gly-gly. Protein electrophoresis is remarkably effective at detecting protein variation. It is estimated that 85- 90% of all amino acid substitutions result in electrophoretically distinct molecules. And following the introduction of the technique into population genetics by Harris (on humans) and Lewontin and Hubby (in Drosophila), levels of protein variation were assayed in a wide range of organisms.

Measurements of genetic variation Before considering the results of these experiments, it is necessary to describe how genetic, or protein variation can be quantified. There are two simple measures which are widely used – polymorphism and heterozygosity. Polymorphism is the proportion of loci at which different alleles can be detected. It says nothing about how many alleles, or what frequency they are, just whether any differences can be detected.

Copyright: Gilean McVean, 2001 6 Heterozygosity at a locus is the proportion of individuals at which two distinct alleles can be detected, with the obvious caveat that heterozygosity can only be measured in diploid individuals. So for example, if we collect the following data on allozyme variation, 3 of the 4 loci are polymorphic, and the average heterozygosity is .. Why should we be interested in these two particular measures of variation? The answer is that these numbers are the key quantities in any theoretical population genetic understanding of the forces of variation, and under certain models have a very simple relationship to underlying parameters of mutation, selection and . When these methods were used to survey phylogenetically diverse taxa, from humans to plants, a remarkably high level of polymorphism was consistently found. For example, in 30 species of mammals surveyed, with an average of 28 loci per species, about 1 in 5 loci are found to be polymorphic, and heterozygosity is about 5%. Invertebrates appear to be slightly more polymorphic than vertebrates, for example, polymorphism in Drosophila is at about 0.5 and heterozygosity at about 0.15.

Genetic load arguments and the rise of the neutral theory of Why should such levels of polymorphism have been considered high when they were first described? The answer is that up until that point the most widespread belief among evolutionary biologists, was that majority of variation was maintained by balancing selection. That is, that polymorphism is maintained at loci because organisms containing two different alleles are significantly fitter than those carrying two alleles of the same type. The reasons for the widespread acceptance of this theory are not particularly clear, but the paradigm of sickle-cell anaemia/protection against malaria was important, as was the evidence from ecology for the functional nature of many phenotypic polymorphism (for example wing patterns in butterflies). The finding that a large proportion of loci are polymorphic was problematic to this theory, because it meant that natural selection must be being incredibly efficient at balancing polymorphisms at many loci. For example, if 30% of proteins show allozyme variation and there are in the region of 100,000 proteins encoded for by the human , then natural selection must be maintaining polymorphism at about 30,000 loci. Why should this be a problem? The reason is very simple – heterozygotes do not just produce heterozygous offspring. Mendelian segregation ensures that homozyous offspring are produced as well. So just by chance, we expect the number of heterozygous loci to vary considerably between individuals. And as a consequence we can expect to vary considerably between individuals. This type of argument gives rise to the notion of , a concept that has historically been very important in the development of population genetics theory. Genetic load is defined as the difference in fitness between a population and its theoretical optimum. For example, in the case of heterozygote advantage, the most fit population would be heterozygous at every locus. The actual population cannot achieve this because of Mendelian segregation. For example, consider the following scheme Genotype Fitness Frequency AA 1 - s x2 Aa 1 2x(1-x) aa 1- s (1-x)2 Ignoring drift in a finite population, the equilibrium frequency of each allele is 0.5, so the genetic load due to that locus is

Copyright: Gilean McVean, 2001 7 w − w = opt = 1 L 2 s wopt So, if there are 30,000 loci, each maintained by balancing selection with a small of say 1%, then if fitnesses are multiplicative across loci, the genetic load due to segregation (called the segregation load) is such that the ratio of average fitness to best possible fitness is about 5x10-66. In other words, if absolute fitness is relevant to the survival of a species – then humans should be extinct. Although striking, the relevance of a theoretical population that can never exist is not clear – in any finite population no individual is going to have all loci heterozygous. We can, however, ask a more directly relevant question – given Mendelian segregation, we expect variation between individuals in the number of heterozygous loci. How much variation in fitness should we expect? The answer is staggering. If all alleles are a frequency of 0.5, then the expected number of heterozygous loci is 15,000 in humans, but the variance is considerable, such that in a population of 10,000 people we would expect the difference between the highest and lowest number of heterozygous loci to be about 500. If the selection coefficient if 1% again, this translates to a 150 fold difference in fitness. This level of variation in fitness is simply not seen in any natural population. There are further counter-arguments to what I have just presented, and currently genetic load arguments have little widespread interest or acceptance. But the important point is to understand that this type of thinking led people to question the belief that natural selection was responsible for maintaining all variation. Such arguments, coupled with observations demonstrating the constancy of rates of molecular evolution, and the growing data showing how little of the eukaryotic genome is actually involved in protein encoding, led Kimura (1968) and King and Jukes (1969) to the conclusion that the majority of changes at the DNA level are of little or no functional consequence to the organism. Today, it is hard to appreciate just how revolutionary this argument was. For decades, evolutionary biologists had been amassing evidence about the incredible power of natural selection for creating adaptation. Yet here was the claim that the vast majority of change in the genes and proteins which make an organism are completely neutral. Naturally, there was much opposition to the ideas, but when data on the rate of evolution and levels of variation at the DNA level began to accumulate, the neutral theory had to be taken seriously.

Direct measures of variation at the DNA level Allozymes studies had provided a fascinating glimpse into variation at the molecular level, but it was not until the advent of techniques not directly assessing DNA variability that the complete picture began to emerge. The first technique to be developed was the use of restriction fragment length polymorphisms (RFLPs). The genomes of many contain enzymes called restriction endonucleases which are thought to be used in defence against phages, and which cut DNA at specific sequence motifs. As with proteins, DNA is a charged molecule and will move through a gel down an electrostatic gradient at a rate which is a function of its size. It is possible to then dry the gel onto a membrane, denature the DNA, and probe it with radioactively labelled homologous sequence - a procedure known as the Southern blot (after Ed Southern). By using a series of restriction enzymes it is possible to build up a map of the DNA sequence in terms of the relative locations of different restriction sites. How does this help analyse variation? The reason is that a

Copyright: Gilean McVean, 2001 8 certain proportion of differences at the DNA level will affect restriction sites, which results in different patterns of restriction fragments. The advent of the polymerase chain reaction (PCR) changed the face of genetics. And that includes population genetics. The two most important ways in which PCR has enabled the large-scale analysis of genetic variation, are first through the use of markers, and second, through enabling rapid sequencing in order to study single- polymorphisms, or SNPs. are very short motifs, typically only 2-4 base pairs long, that occur in tandem repeats within genomes. Their replication appears to be highly unstable, such that the number of repeats changes at a much higher rate than single point mutations. The average rate for point mutations is about 10-8 per generation. For microsatellites, it is about 10-5 per generation, that is 1000 times higher. The first study of DNA sequence level variation through complete sequencing was carried out in 1983 by Marty Kreitman on the Alcohol dehydrogenase gene of . Full DNA sequencing, or resequencing as it is sometimes known in the community, is clearly the only way of knowing the true extent of genetic variability, but until recently it was both very expensive and laborious. The twin developments of PCR sequencing and precision electrophoresis as carried out by ABI machines has made sequencing a much easier task, though it is still relatively expensive compared to using microsatellites or allozymes. A common strategy among groups wishing to analyse sequence level variation in many individuals is to fully sequence a subsample of the sequences to identify a set of polymorphisms, then use much faster, cheaper methods to assays for these polymorphic sites in a much larger set of samples. These days it is not uncommon to find sequencing surveys of over 100 individuals, where several kb from each individual have been characterised. Just as for protein polymorphisms, we can characterise variation at the DNA level by a series of statistics. The most commonly used statistics are the number of sites which are polymorphic in a sample, the average pair-wise differences between sequences, and the number of . Because the number of segregating sites changes with the number of sequences analysed, and so does the number of haplotypes, the easiest statistic to compare between loci and species is average pairwise differences per site compared. This statistic shows somewhat greater variation between species than measures of allozymes variability. For example the average pairwise differences in genes for humans and Drosophila are about 0.1% and 1% respectively, whereas average heterozygosity at allozymes is about 0.06 and 0.14 respectively. What have studies of DNA level variation told us that we weren’t to expect from patterns of allozyme variation? Two observations are of particular importance. The first point is that different types of mutation at the DNA level have different levels of polymorphism. In particular, sites in coding regions at which some or all mutations have no effect on the amino acid encoded have higher levels of polymorphism than sites where mutations lead to amino acid changes. This observation is clearly evidence in favour of the neutral theory, because the less constrained sites have higher effective mutation rates. However, it is not true to say that all mutations that leave the amino acid unaltered are neutral. Levels of polymorphism in noncoding regions such as 5’ and 3’ untranslated regions and tends to be less in Drosophila than synonymous diversity. These patterns suggest that subtle constraints act on non-coding regions, perhaps through their influence on gene regulation, or their role nucleic acid structure and stability.

Copyright: Gilean McVean, 2001 9 The other striking pattern revealed from comparisons of DNA and allozyme variation is their lack of concordance. That is, while allozyme variation is remarkably constant across species, DNA level variation shows a much greater range. For example, Drosophila has about twice the allozyme heterozygosity of humans, but 10 times the DNA variation. What might cause this discrepancy? One possibility is that there really are subtle selective effects maintaining allozyme variation, but the issue is not resolved.

Contemporary issues in population genetics What are the issues in current population genetic research? First, and perhaps most obviously, there is enormous interest in the use of population genetic methods to address medical genetics questions. Association mapping is becoming an important tool in the detection of genes involved in multifactorial disease, because it can detect effects that are too weak to pick up in pedigree analysis. In addition, a population genetic approach to human disease has the advantage that it sees the whole picture at once – by focusing on pedigree analysis, you may locate genes, but they contribute only a tiny fraction of the variability in risk within a population. As the use of population genetic data becomes more widespread, so it is necessary to develop the appropriate statistical methods with which to analyse data. This is a major focus in theoretical population genetics research. Much of the research is into developing computationally intensive methods for full likelihood analysis of DNA sequences. That is, under a given model you find the parameters which make the probability of observing the data the most likely. The sorts of questions these methods aim to address are whether it is possible to detect the trace of adaptive evolution in genomes, and how much recombination there is in DNA sequences. Both of these subjects will be the focus of later lectures. Finally, there are unanswered theoretical and empirical questions which still demand the attention of population geneticists. And to a large extent these are the same questions as were being addressed by the earliest practitioners. Questions like, “what maintains quantitative variation in populations”, “how does selection acting simultaneously at multiple loci complicate the path of evolution” and “can population genetics explain speciation”? Whether this is cause for excitement or depression, I will leave you to decide.

Copyright: Gilean McVean, 2001 10