<<

Training course in Quantitative and

Biosciences East and Central Africa- International Livestock Research Institute (BecA-ILRI) Hub

Nairobi, KENYA May 30-June 10, 2016

POPULATION AND QUANTITATIVE GENETICS

GENOME ORGANIZATION AND GENETIC MARKERS

SELECTION THEORY

BREEDING STRATEGIES

Samuel E Aggrey, PhD Professor Department of Poultry Science Institute of Bioinformatics

University of Georgia Athens, GA 30602, USA [email protected]

Preface

This lecture notes was written in an attempt to cover parts of , Quantitative Genetics and for postgraduate students and also as a refresher for field . The course material is not a text book and not meant to be copied, duplicated or sold. This text is unedited and I am solely responsible for all conceptual mistakes, grammatical errors and typos. Genetics is a life-long course and cannot be covered in a few lectures. Only selected parts of the population- and quantitative-, and molecular genetics will be covered in this course because of time constraints. This course will cover some of the evolutionary changes in frequency between generations such as and flow, and some aspects of Quantitative and Molecular Genetics.

To those men who have kept us awake for over two centuries and I believe would continue to do so for many more centuries!

POPULATION GENETICS The study of composition of biological populations, and changes in genetic composition that result from operation of various factors including (a) natural selection, (b) , (c) and (d) gene flow

Genetic composition Population 1. The number of at a locus A group of breeding 2. The frequency of alleles at a locus individuals 3. The frequency of at a locus 4. Transmission of alleles from one generation to the next

Single locus: Locus A with two alleles A1 and A2

p =P +½H

q =Q +½H

Derivation of the Hardy-Weinberg principle Ideal population 1. Two and the population consist of sexually mature individuals 2. Mating between male and female are equal in probability (independent of distance between mates, type of , age of individuals 3. Population is large and actual frequency of each mating is equal to Mendelian expectation

1

4. Meiosis is fair. We assume that there is no segregation distortion, no gamete competition, no differences in the developmental ability of eggs or fertilizing ability of sperms 5. All mating produce the same number of offspring, on average.

Thus, frequency of a particular genotype in the pool of newly formed zygote is: ∑(frequency of mating) (frequency of genotype produced from mating)

Frequency (A1A1 in zygotes) = P2 + ½PH +½PH +¼H2 =(P+½H)2 =p2 Frequency (A1A2) =2pq Frequency (A2A2) =q2

6. Generations do not overlap 7. There is no difference among genotype groups in the probability of survival 8. There is no migration, , drift and selection

Hardy-Weinberg Law In a large random mating population in the absence of mutation, migration, selection and random drift, remains the same from generation to generation. Furthermore, there is a simple relationship between allele frequency and genotypic frequency

Why is Hardy-Weinberg principle so important? Is there any population anywhere in the world or outer space that satisfies all assumptions? Possible evolutionary forces within populations cause a violation of at least one of these assumptions, and departure from Hardy-Weinberg are one way in which we detect those forces and estimate their magnitude. The most significant evolutionary factors are selection (natural or artificial), non-random mating and gene flow.

2

Fig. 1 shows the relationship between allele frequency and three genotypic frequencies for a population under Hardy-Weinberg proportions: 1. The heterozygote is the most common genotype for intermediate allele frequencies 2. One of the homozygotes is the most when the allele frequency is not intermediate 3. Only ⅓ of the time when q is between ⅓ and ⅔, is the heterozygote the most common genotype 4. When q is between 0 and ⅓ A1A1 is the most common, and when q is between ⅔ and 1, A2A2 is the most common. 5. The maximum frequency of the heterozygote occurs when q=0.5 This can be shown directly by setting the derivatives of the H-W heterozygosity, 2pq=2q(1-q), equal to zero and solving for q or

d[2q(1 − q) = 2 − 4푞 = 0 푑푞 Here, we assume that the generations are non-overlapping, i.e. the parents die after producing progeny, and the progeny then become the next parental generation.

Testing for deviation from Hardy-Weinberg Equilibrium

Departure from Hardy-Weinberg equilibrium can be tested from a sample scored for their genotypes. The genetic model provided by Hardy-Weinberg generates the expected frequency at equilibrium. We can now compare observed and expected allele frequencies under the assumptions of Hardy-Weinberg proportions. The chi- square test of goodness of fit and the likelihood ratio test can be used to test departure or lack thereof from Hardy-Weinberg equilibrium. The chi-square test is an approximation to the likelihood ratio test. To perform a chi-square goodness of fit test, we first have to estimate the observed genotypic frequency from the data,

3 then use that to generate the expected genotypic frequencies. We can compute the chi-square statistic as: (푂 − 퐸)2 푋2 = ∑ 퐸 Where O and E are the observed and expected number of a particular genotype and n is the number of genotypic classes. From the calculated value of X2 and the table value of X2 we can obtain the probability that the observed numbers deviates from the expected numbers. The degrees of freedom used to determine the significance of X2 value are equal to the number of genotypic classes, n, minus one, then minus the number of parameters estimated from the data. One degree of freedom is always lost because we use the data to estimate allele frequency. We can use the chi-square distribution to test whether the value of X2 is too large to be the result of sampling error. In doing so we are performing a one-tailed test. The chi-square expression for two alleles is given as:

(푁 − p̂2N)2 (푁 − 2p̂q̂N)2 (푁 − q̂2N)2 푋2 = 11 + 12 + 22 p̂N 2p̂q̂ N q̂2N

An alternate way to estimate differences of observed frequencies from expected frequencies is to calculate the standardized deviation of the observed frequency from the Hardy-Weinberg expectation of heterozygotes, which provides the or generally , F.

2푝푞 − 퐻 퐻 퐹 = = 1 − 2푝푞 2푝푞 It can be shown that 푋2 = 퐹2푁

For two alleles, the Chi-square good of fit test for Hardy-Weinberg proportions is equivalent to the test for inbreeding, F=0. However, F is unstable as the expected (E) value approaches zero, and therefore not useful for rare and very common alleles. For E=0, O>0, F=-∞, and for E=0, and O=0, F is undefined. Deviation from Hardy-Weinberg proportions can also be tested using the likelihood ratio test which is described in most statistical texts.

4

The B/b locus is responsible for plumage color in chickens found in the Rift Valley. The B allele expresses black plumage which is completely dominant over the b allele for brown plumage. Genotype Observed number Expected number Black BB 290 p̂ 2N=289.444 Black Bb 496 2p̂ q̂ =497.112 Brown bb 214 q̂ 2N=213.444 Total 1,000 1,000 P=290/1000=0.29; H=496/1000=0.496; Q=214/1000=0.214; P+H+Q=1.0 p̂ =P+½H = 0.29+½(0.496)=0.538; q̂ =Q+½H = 0.214+½(0.496)=0.462; p̂ +q̂ =1.0 Note: Chi-square is allergic to fraction and ratios, but really likes integers!

(290 − 289.444)2 (496 − 497.112)2 (214 − 213.444)2 푋2 = + + = 0.0050 289.444 497.112 213.444

The X2-Table at p=0.05 at 1 degree of freedom is 3.84. Since the X2 calculate is lower than X2 table, we can conclude that the data does not deviate from Hardy- Weinberg proportions.

퐻 0.496000 퐹 = 1 − = 1 − = 0.002237 2푝푞 0.497112

푋2 = 퐹2푁 = 0.0050

5

Extension of Hardy-Weinberg’s Law: Multiple Alleles Let us consider a single locus with three alleles A1, A2 and A3 with frequencies, p, q and r, respectively.

Hardy Weinberg frequencies for three autosomal alleles at a single locus Allele/ A1 A2 A3 frequency p q r A1 A1A1 A1A2 A1A3 p p2 pq pr A2 A2A1 A2A2 A2A3 q qp q2 qr A3 A3A1 A3A3 A3A3 r rp rq r2

Genotype Frequency Number 2 A1A1 p N11 A1A2 pq+pq=2pq N12 A1A3 pr+pr=2pr N13 2 A2A2 q N22 A2A3 qr+qr=2qr N23 2 A3A3 r N33 TOTAL 1.0 N

Please note that, 푝 + 푞 + 푟 = 1, and they key to solving multiple alleles is to break in order for the problem to resemble a two allele problem 푁 푓(퐴3퐴3) = 푟2 = 33 푁 푁 푟 = √ 33 푁 From here, let’s reduce the problem to a two allele locus involving the allele, A3 Expected genotypes under H-W: A2A2, A2A3 and A3A3 with expected frequency 푁 +푁 +푁 푞2 + 2푞푟 + 푟2 = 22 23 33. 푁 From basic algebra: (푎 + 푏)2 = 푎2 + 2푎푏 + 푏2. This implies: (푞 + 푟)2 = 푞2 + 2푞푟 + 푟2 푁 +푁 +푁 Therefore: (푞 + 푟)2 = 22 23 33 푁

6

푁 +푁 +푁 푞 + 푟 = √ 22 23 33 푁

푁 +푁 +푁 푁 푟 = √ 22 23 33 − √ 33 푁 푁 Since, 푝 + 푞 + 푟 = 1, then 푝 = 1 − (푞 + 푟) 푁 +푁 +푁 푝 = 1 − √ 22 23 33 푁 The ABO blood group in humans is determined by three alleles, A, B and O. Allele/ A B O frequency p q r A AA AB AO p p2 pq pr B AB BB BO q pq q2 qr O AO BO OO r pr qr r2

Genotype Frequency Number 2 AA p N11 AB pq+pq=2pq N12 AO pr+pr=2pr N13 2 BB q N22 BO qr+qr=2qr N23 2 OO r N33

In the year 1825, the director general of ILRI-Musastan ordered a staff nurse to collect blood samples of all capacity building course participants. Of the 1,825 individuals sampled, 700 were type A, 250 were type B, 75 were type AB and 800 were type O. Determine the frequency of the A, B and O alleles. Hint: Phenotype Genotype H-W Expectation Number A AA + AO p2+2pr 700 B BB + BO q2+2qr 250 AB AB 2pq 75 O OO r2 800

7

Natural Selection at One Locus Differential viability and fertility Natural selection occurs when some genotypes in a population have differential survival, fertility or reproduction. In this case, we multiply each genotype’s frequency by its , where fitness is a reflection of the genotype’s probability of survival and its relative participation in reproduction. Assuming a single autosomal locus population with two alleles A1 and A2 with three diploid genotypes A1A1, A1A2 and A2A2 and different fitnesses denoted w11, w12 and w22, respectively. Unless w11, w12 and w22 are all equal, then natural selection will occur, possibly leading the genetic composition of the population to change. Before the operation of natural selection (generation 0), the genotypes are in Hardy-Weinberg equilibrium and the frequency of A1 and A2 alleles are p0 and q0, respectively (p0 + q0 = 1). The genotypes of generation 0 produces progeny that becomes generation one with frequency of A1 and A2 denoted by p1 and q1, respectively (p1 + q1 = 1). In both generations, the allele frequency is considered at the zygote stage and may different from adult allele frequency if there is differential viability.

Assuming there is no mutation, and that Mendel's law of segregation is operational, then an A1A1 genotype will produce only A1 gametes, an A2A2 genotype will produce only A2 gametes, and an A1A2 genotype will produce A1 and A2 gametes in equal proportion. Therefore, the proportion of A2 gametes, and thus the frequency of the A2 allele in generation one at the zygotic stage, is:

2 1 [푞0푤22 + (2푝0푞0푤12)] 푞 = 2 1 푤

2 푞0푤 + 푝0푞0푤12 푞 = 22 [1] 1 푤

8

Equation [1] is known as a ‘recurrence’ equation, as it expresses the frequency of the A1 allele f generation 1 in terms of its frequency in generation 0. The change in frequency between generations can then be written as: ∆푞 = 푞1 − 푞0 푞2푤 + 푝 푞 푤 = 0 22 0 0 12 − 푞 푤 0 푞2푤 + 푝 푞 푤 − 푞 푤 = 0 22 0 0 12 0 푤 If we substitute w from Table 3, (푞 = 1 − 푝), and simply the equation above to:

푝푞푤 + 푞2푤 − 푞(푝2푤 + 2푝푞푤 + 푞2푤 ) ∆푞 = 12 22 11 12 22 푤 푞(푝푞푤 − 푝푞푤 + 푝2푤 + 푝푞푤 ) = 22 12 11 12 푤 푝푞[푞(푤 −푤 )−푝(푤 −푤 )] = 22 12 11 12 [2] 푤

Equations [1] and [2] show, in precise terms, how fitness differences between genotypes will lead to evolutionary change. If Δq =0 then no allele frequency change has occurred and the population is in allelic equilibrium. It is worth mentioning that Δq =0 does not mean that no natural selection has occurred. The condition for that is w11=w12=w22. It is possible for natural selection to occur and have no effect on allele frequency.

Directional selection If Δq > 0, then natural selection has lead the A2 allele to increase in frequency; if Δq < 0 then natural selection has led the A1 allele to increase in frequency. If w11>w12>w22, then A1A1 genotype will be fitter than A1A2, which in turn is fitter than A2A2; in which case Δq must be negative (so far as neither p nor q is 0). At each generation, the frequency of A1 allele will be greater than in the previous generation until it eventually reaches fixation and the A2 allele is eliminated from the population. Once A1 reaches fixation (p=1 and q=0) no further evolutionary changes will occur. In this case, the A1 allele confers a fitness advantage on the genotypes that carry it, and its relative frequency in the population will increase from generation to generation until it is fixed. The opposite fixation (A2) is true when w22>w12>w11. Table 4 illustrates numerical 9 example of directional natural selection. Fig. 2 illustrates allele frequency under Hardy-Weinberg proportions where there is no differential viability, w11=w12=w22=1.0 and the average fitness w=1.0 from generation to generation. Assuming w22=0.4 as in Table 4, allele frequency of A1 increases and A2 decreases non-linearly until they get into fixation as illustrated in Fig 3. Ultimately, the population will be monomorphic for the homozygote genotype with the highest fitness.

Stabilizing selection An interesting situation arises when the heterozygote is superior in fitness to the two homozygotes. In this case, w11w22, and what happens in this situation is that, an equilibrium situation is reached with both alleles present in the population. Since q must be non- negative, this condition can be satisfied only there is heterozygote superiority or inferiority-a condition also known as heterosis. In this case, natural selection produces heterogeneity and preserves gene variation. Unlike , stabilizing or tends to keep both alleles in the population and each allele is balanced and converges at a polymorphic equilibrium (Fig 4).

Disruptive selection Under disruptive selection (w11>w12

10

Coefficient of selection The speed with which allele or genotype frequency changes, is driven by the relative fitness for each allele or genotype. Fitness (w11, w12 and w22) is a relative value, usually measured in comparison with the most-fit allele/genotype in the population. Selection coefficient, s, measures the reduction in fitness for a selected allele or genotype compared to the most-fit allele/genotype in a population. Selection against an allele may operate either through reduced viability or reduced fertility or reduced mating ability or different combinations of the three. Therefore, allele frequency needs to be deduced from the zygote stage of the parent generation to the zygote stage of the progeny generation. The coefficient of selection measures the proportionate reduction in gametic contribution of a genotype compared to the most-fit genotype. The contribution of the most fit genotype is taken to be 1, and the contribution of the genotype selected against is 1 - s. If the selection coefficient for a genotype is 0.60; the fitness is then 0.4, which means that for every 100 zygotes produced by the most-fit genotype, only 40 are produced by the genotype selected against.

Dominance To explore the effects of , we can specify the fitnesses using two parameters; one representing the difference in fitness between the two homozygotes and the second to represent the degree of dominance, h (fitness of the heterozygote. Let, w11 = 1 w12 = 1 - hs w22 = 1 - s The parameter h together with s determines the fitness of the heterozygote. a. If h = 0, the heterozygote has fitness 1, the same as the A1A1 homozygote: the A1 allele is completely dominant. b. Conversely if h= 1, the fitness of the heterozygote is the same as that of the A2A2 homozygote (1-s): the A2 allele is completely dominant. c. If 0 < h< 1, the heterozygote’s fitness is somewhere between those of the homozygotes: there is incomplete dominance. d. If h= ½ exactly, the alleles have additive effects: the heterozygote fitness is the average of the two homozygotes’ fitnesses. e. If h< 0, the heterozygote’s fitness is greater than 1, and thus greater than that of the A1A1homozygote; this is called overdominance. f. Similarly, if h> 1, the heterozygote has lower fitness than the A2A2 homozygote (and of course also the A1A1 homozygote); this is .

11

Table 5 Fitness values for different fitness relationships A1A1 A1A2 A2A2 General fitness w11 w12 w22 Recessive lethal 1 1 0 No dominance, selection against A2A2 Detrimental allele 1 1 1-s No dominance, selection against A2 Dominance 1 1-hs 1-s Partial dominance of A1, selection against A2 Dominance 1 1 1-s Complete dominance of A1, selection against A2 Dominance 1-s 1-s 1 Complete dominance of A1, selection against A1 Heterozygote advantage 1-s1 1 1-s2 Overdominance, selection against A1A1 & A1A2 Heterozygote disadvantage 1+s1 1 1+s2 Underdominance, selection against A1A2

Lethal alleles These are alleles that cause an organism to die only when present in the homozygote state. If the mutation is caused by a dominant lethal allele, the heterozygote for the allele will show the lethal phenotype, the homozygote dominant is impossible. If the mutation is caused by a recessive lethal allele, the homozygote for the allele will have the lethal phenotype. Most lethal are recessive. Many lethal alleles prevent cell division and kill an organism at an early age. Some lethal alleles exert their effect later in life, e.g. Huntington disease characterized by progressive degeneration of nervous systems, dementia and early death between 30-50 years. Dominant lethal alleles: They modify the Mendelian 3:1 ratio to 2:1. The organism dies before they can produce progeny, so the mutant dominant allele is removed from the population in the same generation it arose. Fully dominant lethal alleles kill the carrier in both homozygous and heterozygous states. Huntington’s disease, creeper legs (short and stunted) in chicken are a dominant lethal where the homozygote does not survive. Recessive lethal alleles: The recessive lethal kills the carrier individual only in the homozygous state. They maybe in two kinds: (1) one which has no obvious phenotypic effects in the heterozygotes, and (2) on which exhibits a distinctive phenotype in the heterozygous state. In many cases, lethal alleles become operative at the onset of sexual maturity. Examples of recessive lethal in cattle are: osteopetrosis (Angus and Red Angus), pulmonary hypoplasia and anasarca (PHA) (Shorthorn). In humans, common examples are cystic fibrosis (poorly functioning Cl ion transport proteins to the lungs), Tay-Sachs disease (enzyme unable to break down specific ‘membrane lipids), sickle cell anemia and brachydactyly. The relative fitness for a recessive lethal is presented in Table 5.

12

A1A1 A1A2 A2A2 Total Initial frequency p2 2pq q2 1 Fitness 1 1 0 Gametic contribution p2 2pq 0 푤 = (1 + 푞) 푞2 + 푝푞 From Equation 1, 푞 = 푤22 푤12 1 푤 The average fitness, w, under recessive lethal is: 푤 = (1 + 푞) 푝푞 푞 Therefore, 푞 = = [3] 1 푝(1+푞) 1+푞 2 푞0 푞0 ∆푞 = 푞1 − 푞0 = − 푞0 = − 1 + 푞0 1 + 푞0 The mean fitness reaches 1 when the population is fixed for A1. The relationship given for ∆q is a recursive relationship. The allele frequency at any time t+1 is a function of the frequency at time t, or 푞푡 푞푡+1 = 1 + 푞푡 푞1 푞2 = 1 + 푞1 When we substitute the value of q1 from equation 3 in this expression, it becomes: 푞0 푞2 = 1 + 2푞0 This relationship can be generalized to give the frequency in generation t as a function of the frequency at generation 0: 푞0 푞푡 = 1 + 푡푞0 Since there are no recessive homozygotes, the maximum allele frequency possible is 0.5 in all heterozygotes. Fig 6 demonstrates the expected decline in frequency of recessive lethal allele at two frequencies. When the frequency of allele frequency is high, the allele frequency is reduced very quickly.

High throughput data has delineated lethal haplotypes. This in theory would allow us to identify carrier animals and avoid mating them. That would eliminate recessive lethal alleles faster than elimination from natural selection.

13

Selection against recessives A1A1 A1A2 A2A2 Total Initial frequency p2 2pq q2 1 Fitness 1 1 1-s Gametic contribution p2 2pq q2(1-s) w=1-sq2

푞2 + 푝푞 From Equation 1, From Equation 1, 푞 = 푤22 푤12 1 푤 2 When selecting against recessives, w12=1, w22=1-s, and w is 1-sq Therefore, q1 can be written as: 푞2(1 − 푠) + 푝푞 푞 = 1 1 − 푠푞2 푞(1 − 푠푞) = 1 − 푠푞2 The change in frequency of A2 is therefore given as: 푠푞2(1 − 푞) ∆푞 = − 1 − 푠푞2 Both the average fitness and change in allele frequency are functions of the allele frequency and the selection coefficient. Selection against recessive alleles is very efficient at first, but becomes progressively slower because a sizeable proportion of the recessive allele is part of the heterozygotes as allele frequency decreases. Therefore, natural selection alone cannot entirely eliminate the recessive allele even if it is lethal.

14

More than one locus – Linkage and Under random mating alleles at all autosomal loci combine at random to form genotypes to attain equilibrium under Hardy-Weinberg law. The basic assumption here is that transmission of alleles at a given locus across generations is independent of alleles at another locus. We also assume that fitness of genotypes at one locus is not affected by genotypes at another locus. For several loci, these assumptions would likely be violated.

Let’s consider A locus with two alleles A1 and A2 at frequencies 푝퐴 푎푛푑 푞퐴 and a B locus also with two alleles B1 and B2 at frequencies 푝퐵 푎푛푑 푞퐵, respectively. Under Hardy-Weinberg proportions, 푝퐴 + 푞퐴 = 1, 푎푛푑 푝퐵 + 푞퐵 = 1, 2 2 2 2 and expected genotypic frequencies are 푝퐴 + 2푝퐴푞퐴 + 푞퐴 푎푛푑 푝퐵 + 2푝퐵푞퐵 + 푞퐵, respectively. Alleles at A locus may combine at random or in a non-random way with alleles at the B locus.

Random association of alleles showing expected gametic frequency under equilibrium Allele/ A1 A2 frequency 푝퐴 푞퐴 B1 A1B1 A2B1 푝퐵 푝퐴푝퐵 푝퐵푞퐴 B2 A1B2 A2B2 푞퐵 푝퐴푞퐵 푞퐴푞퐵

Let’s use some classical notations to represent the actual gametic frequencies. Let r, s, t and u represent the actual or observed gametic frequencies of A1B1, A1B2, A2A1 and A2A2, respectively. Under random association of gametes, 푟 = 푠 = 푡 = 푢 푎푛푑 푟 + 푠 + 푡 + 푢 = 1. The state of random gametic association between alleles of different genes is called LINKAGE EQUILIBRUIM. If two loci are in linkage equilibrium, it means that they are inherited completely independently in each generation. An example would be loci that are on two different and encode unrelated, non-interacting proteins. Under random mating and other assumptions of Hardy-Weinberg equilibrium, linkage equilibrium between loci is attainable. However, unlike single

15 locus, the attainment of gametic or linkage equilibrium depends on the rate of recombination in genotypes heterozygous to both loci. There are two types of double gametic heterozygotes: 퐴 퐵 1 1 푐표푢푝푙𝑖푛푔 ℎ푒푡푒푟표푧푦푔표푡푒 퐴2퐵2 퐴 퐵 1 2 푟푒푝푢푙푠𝑖푣푒 ℎ푒푡푒푟표푧푦푔표푡푒 퐴2퐵1

Gamete Expected frequency Observed frequency A1B1 푝퐴푝퐵 r Coupling A1B2 푝퐴푞퐵 s Repulsive A2B1 푝퐵푞퐴 t Repulsive A2B2 푞퐴푞퐵 u Coupling

The observed gametic frequency differs from the expected gametic frequency by an amount D. We measure the non-randomness of the gametic frequencies by means of deviation from two loci equilibrium. D is the gametic disequilibrium coefficient. Gametic disequilibrium is often referred to as linkage disequilibrium. This may be confusing because genes or loci need not be linked to be in gametic disequilibrium. The gametic disequilibrium coefficient, D is similar to the effect of inbreeding on genotypic frequencies at a single locus. The Heterozygote deficit interpretation of inbreeding coefficient, F, has been called a “one-locus disequilibrium” coefficient.

푟 = 푝퐴푝퐵 + 퐷 푠 = 푝퐴푞퐵 − 퐷 푡 = 푞퐴푝퐵 − 퐷 푢 = 푞퐴푞퐵 + 퐷 The most common expression of D is: 퐷 = 푟푢 − 푠푡 D is therefore the difference between the coupling and repulsive gametic types.

퐷 = (푝퐴푝퐵 + 퐷)(푞퐴푞퐵 + 퐷) − (푝퐴푞퐵 − 퐷)(푞퐴푝퐵 − 퐷) [You can work on the proof in your spare time]. If two genes are in linkage disequilibrium, it means that certain alleles of each gene are inherited together more often than would be expected by chance. This may be due to actual genetic linkage, i.e., the genes are closely located on the

16 same . Or it could be due to some form of functional interaction where some combinations of alleles at the two loci affect the viability of potential offspring. It should be noted that an observed non-random association of alleles/genotypes need not be caused by their chromosomal location. Any of the evolutionary forces (mutation, random genetic drift, selection and gene flow) can, at least temporarily, cause such associations.

Recombination Let’s consider the following: The gametes produced by this genotype A1B1/A2B2 are of four types:

Type 1: A1B1 non-recombinant with frequency (1-c)/2 Type 2: A1B2 recombinant with frequency c/2 Type 3: A2B1 recombinant with frequency c/2 Type 4: A2B2 non-recombinant with frequency (1-c)/2

Gametic types 1 and 2 are called non-recombinants because the gametes are associated with in the same manner as previous generation. Gametic types 3 and 4 are known as recombinants because the gametes are associated differently than in the previous generation. As a result of Mendelian segregation, f(A1B1)=f(A2B2); and f(A1B2)=f(A2B1). However, the 푓(퐴1퐵2) + 푓(퐴2퐵1) does not have to be equal to 푓(퐴1퐵1) + 푓(퐴2퐵2). The proportion of recombinant gametes produced by the double heterozygote is called the recombination fraction, c and the proportion of non-recombinant gametes is 1-c. The recombination fraction between genes depends on whether they are on the same chromosome, and also the physical distance between them. During meiosis, the four chromatids (of two genes) align. The two inner chromatids can undergo breakage and exchange of parts (recombination) between the two chromatids. Thus, only 50% or (0.5) of the chromatids can undergo recombination.

Therefore, the maximum recombination rate, cmax=0.5. For genes on different chromosomes or far apart on the same chromosome, the recombination fraction, c=0.5 as the four gametic types are produced in equal frequency. Genes that have c<0.5 must necessarily be the same chromosome, and such genes are said to be linked. When c=0, the two genes are very close to each other such that break almost never happens, and they are transmitted together as “one super gene”. 17

Gametic disequilibrium and frequency of gamete change over time The gametic disequilibrium changes from one generation to the next. Let the frequencies of A1B1, A1B2, A2B1 and A2B2 be r, s, t and u, respectively. Now, let’s construct the gametic frequency of offspring. Proportion among gametes Genotype A1B1 A1B2 A2B1 A2B2 A1B1/A1B1 1 0 0 0 A1B1/A1B2 ½ ½ 0 0 A1B1/A2B1 ½ 0 ½ 0 A1B1/A2B2 ½(1-c) ½c ½c ½(1-c) A1B2/A1B2 0 1 0 0 A1B2/A2B1 ½c ½(1-c) ½(1-c) ½c A1B2/A2B2 0 ½ 0 ½ A2B1/A2B1 0 0 1 0 A2B1/A2B2 0 0 ½ ½ A2B2/A2B2 0 0 0 1 There are ten different two-locus genotypes, therefore full mating table would take 100 rows. Assuming Hardy-Weinberg equilibrium, we can calculate the frequency with which any one genotype will produce a particular gamete.

Genotype and the frequency of their progeny gametes Gametes Genotype Frequency A1B1 A1B2 A2B1 A2B2 A1B1/A1B1 r2 r2 A1B1/A1B2 2rs rs rs A1B1/A2B1 2rt rt rt A1B1/A2B2 2ru (1-c)ru (c)ru (c)ru (1-c)ru

A1B2/A1B2 s2 s2 A1B2/A2B1 2st (c)st (1-c)st (1-c)st (c)st A1B2/A2B2 2su su su

A2B1/A2B1 t2 t2 A2B1/A2B2 2tu tu tu

A2B2/A2B2 u2 u2 ′ ′ ′ ′ Total 1 푟 = 푟 − 푐퐷0 푠 = 푠 − 푐퐷0 푡 = 푡 − 푐퐷0 푢 = 푢 − 푐퐷0

18

The frequencies of the four gametes after one generation of selection are: ′ 푟 = 푟 − 푐퐷0 ′ 푠 = 푠 − 푐퐷0 ′ 푡 = 푡 − 푐퐷0 ′ 푢 = 푢 − 푐퐷0 where D0 is the LD at the preceding generation. ′ ′ ′ ′ 퐷1 = 푟 푢 − 푠 푡 = [(푟 − 푐퐷0)(푢 − 푐퐷0)] − [(푠 − 푐퐷0)(푡 − 푐퐷0)] This recursive relationship leads to a general relationship: 푡 퐷푡 = 퐷0(1 − 푐) where Dt is the D at generation, t. The LD decays each generation at a rate determined by the degree of recombination. The maximum value of D (+0.25) occurs when there are only coupling gametes (r=u=0.5). The minimum value of D (-0.25) occurs when there are only repulsive gametes (s=t=0.5). Thus, the value of D varies from -0.25 to +0.25. If there is free recombination between two loci (either on different chromosomes or far apart from each other where c=½, D would be eliminated in about 7 generations (D7=0.00195). However, if c is much less than 0.5, e.g. 0.05, then the decay in disequilibrium will take a substantial period of time. A major problem with D is that, its maximum value changes as a function of allele frequencies at the two loci. As a result, a standardizing D to the maximum possible value was proposed by Lewontin (1964), where 퐷 퐷′ = 퐷푚푎푥 Dmax is equal to the lesser of 푝퐴푞퐵 표푟 푝퐵푞퐴 if D is positive or less of 푝퐴푞퐴 표푟 푝퐵푞퐵 if D is negative. 퐷′ varies between -1 and 1 regardless of the allele frequency at the two loci, and it also provides a matrix to compare LD to be to the maximum possible value it can be. To determine how long it takes for D to decay to a given value D*, the recursive equation for Dt can be solved for the number of generations, t, as: 퐿푁(퐷∗/퐷) 푡 = 퐿푁(1 − 푐) When c=0.1, it will take 6.58 and 28.43 years for half and 90% of the LD, respectively to disappear, however, for c=0.05, it will take 13.51 and 44.89 years, respectively for half and 90% of the LD to disappear.

19

The gametic disequilibrium coefficient, r is also used as a measure of LD: 퐷2 푟2 = 푝퐴푝퐵푞퐴푞퐵 where r is the square root of above equation. When the allele frequencies are the same at both loci, r, ranges from 0 to 1. When the allele frequencies are different at both loci both r2 and r are somewhat smaller. The value of the Chi-square, X2 is numerically equal to r2N, where N is the total number of chromosomes examined. The biological meaning of r is that it is the correlation between alleles present in the same chromosome.

APPLICATION Originally the definition of LD was in terms of gametic frequencies because that allowed for the possibility that the loci are on different chromosomes. However, the usual application now is to loci on the same chromosome. In that case, the allele pair AB is a haplotype, and 푝퐴퐵 is the observed haplotype frequency. 퐷퐴퐵 is estimated from the allele and haplotype frequencies in the sample.

퐷퐴퐵 = 푃퐴퐵 − 푃퐴푃퐵 The quantity 퐷퐴퐵 is the coefficient of linkage disequilibrium defined for a specific pair of alleles, A and B, and does not depend on how many other alleles are at the two loci. Each pair of alleles has its own D. The values for different pairs of alleles are constrained by the fact that the allele frequencies at both loci and the haplotype frequency have to add up to 1. If both loci have two alleles, e.g. SNPs, the constraint is strong enough that one value of D is needed to characterize LD between those loci, and 퐷퐴퐵 = −퐷퐴푏 = −퐷푎퐵 = 퐷푎푏, where a and b are the other alleles. In this case, the D is used without a subscript. The sign of D is arbitrary and depends on which pair of alleles one starts with.

Higher-order disequilibria: The disequilibria can be considered for alleles at three or more loci. For alleles at three loci (A, B, and C) the third-order coefficient is:

퐷퐴퐵퐶 = 푃퐴퐵퐶 − 푃퐴퐷퐵퐶 − 푃퐵퐷퐴퐶 − 푃퐶퐷퐴퐵 − 푃퐴푃퐵푃퐶 Where 퐷퐴퐵, 퐷퐵퐶 푎푛푑 퐷퐴퐶 are pairwise disequilibrium coefficients, and 퐷퐴퐵퐶 can be viewed as analogous to the three-way interaction term in an analysis of variance

20

and can be interpreted as the non-independence among these alleles that is not accounted for by the pairwise coefficients.

Another measure is 휕퐴 defined to be: 휕퐴 = 푝퐴 + 퐷⁄푝퐵 It is a conditional probability that a chromosome carries an A allele, given that it carries a B allele. It is useful for characterizing the extent to which a particular allele is associated with a genetic disease.

Estimating and testing significance of Linkage Disequilibrium For most populations the only information available is the frequency distribution of multi-locus genotypes while the gametic composition of most zygotes can be resolved from the genotype (e.g. an A1A2B1B1 must come from A1B1 and A2B1 gametes), double heterozygotes which can come from the union of A1B1 and A2B2 or A1B2 and A2B1 gametes, cannot be resolved definitely. Assuming random mating, it is not necessary to discriminate between coupling and repulsive heterozygotes. In this case, the unbiased estimator of D is given by

푁 4푁 + 2(푁 + 푁 ) + 푁 퐷̂ = [ 퐴1퐴1퐵1퐵1 퐴1퐴1퐵1퐵2 퐴1퐴2퐵1퐵1 퐴1퐴2퐵1퐵2 − 2푝̂ 푝̂ ] 퐴1퐵1 푁 − 1 2푁 퐴1 퐵1 where N is the total sample size, the terms in the numerator are observed numbers

of the four genotypes, and 푝̂퐴1 and 푝̂퐵1 are estimates of allele frequency.

Examples of LD B1B1 B1B2 B2B2 Total A1A1 40 60 28 128 A1A2 10 48 36 94 A2A2 4 14 26 44 Total 54 122 90 266

A locus B locus A1A1 PA=128/266=0.4812 B1B1 PB=54/266=0.2030 A1A2 HA=94/266=0.3534 B1B2 HB=122/266=0.4586 A2A2 QA=44/266=0.1654 B2B2 QB=90/266=0.3383 pA=0.4812+½(0.3534)=0.6579 pB=0.2030+½(0.4586)=0.4323 qA=0.1654+½(0.3534)=0.3421 qB=0.3383+½(0.3383)=0.5677

21

266 4 ∗ 40 + 2(60 + 10) + 48 퐷̂ = [ − 2 ∗ 0.6579 ∗ 0.4323] = 0.0856 0 266 − 1 2 ∗ 266 what does this mean? Since D̂ is positive, the maximum value of D is the lesser of qApB or pAqB. Since qApB = 0.3421*0.4323 =0.1479, and pAqB =0.6579*0.5677=0.3735 we chose the former. Therefore, 퐷 0.0856 퐷′ = = = 0.5790 퐷푚푎푥 0.1479 This tells us that D̂ is about 57.90% of its maximum value. With a given recombination rate, c, the value of D̂ will change over time. 퐷2 0.08562 푟2 = = = 0.1327 푝퐴푝퐵푞퐴푞퐵 0.6579 푥 0.4323 푥 0.3421 푥 0.5677

푋2 = 푟2푁 = 0.1327 푥 266 = 35.2868 There are 4 chromosomal types, and since we estimated two allele frequencies from the data, the degrees of freedom=4-1-2=1. Since 35.2868 is greater than X2 value at p=0.05, at 1 df (=3.84), we can conclude that the gametic types are no in linkage equilibrium.

LD with SNP data Without considering distance between two polymorphic SNPs, let’s visualize the following on bovine chromosome 1: SNP1 SNP2 AGGT CCT…………..GATT CAA AGGT CCT…………..GATT CAA

SNP1 SNP2 Allele Allele Frequency Allele Allele Frequency

1 G pA 1 A pB

2 C qA 2 T qB

22

Combination of SNPs into haplotypes SNP2 Allele A T SNP1 G GA GT C CA CT

Haplotype Expected frequency Observed frequency

GA pApB r + D

GT pAqB s - D

CA qApB t - D

CT qAqB u + D

Let’s consider some SNP data from 1,000 bulls GA = 280; GT =300; CA = 75; CT=245

Observed Observed Allele Haplotype Number frequency Allele frequency Haplotype Expected frequency GA 280 r=0.2800 G pA=0.580 GA 0.58*0.355=0.2059 GT 300 s=0.3000 C qA=0.420 GT 0.58*0.645=0.3741 CA 75 t=0.0750 T pB=0.645 CA 0.42*0.355=0.1491 CT 345 u=0.3450 A qB=0.355 CT 0.42*0.645=0.2709

퐷0 = (푟푢 − 푠푡) = (0.28푥0.345) − (0.30푥0.075) = 0.0741 Alternatively, DGA can also be calculated as:

퐷퐺퐴 = 푟 − 푝(퐺) 푥 푝(퐴) = 0.2800 − 0.2059 − 0.0741

23

The gametic frequency in a 1,000 chicken population for the naked neck (Na/na) and dominant I (I/i) are as follows: Na-I 0.180 r Na-i 0.707 s na-I 0.061 t na-i 0.052 u

Expected allele frequency

f(Na) = f(Na-I) + f(Na-i) = 0.180 + 0.707 = 0.887=푝퐴 f(na) = f(na-I) + f(na-i) = 0.061 + 0.052 = 0.113=푞퐴 f(Na) + f(na)= 0.887 + 0.113 = 1.000

f(I) = f(Na-I) + f(na-I) = 0.180 + 0.061 = 0.241=푝퐵 f(i) = f(Na-i) + f(na-i) = 0.707 + 0.052 = 0.759=푞퐵 f(I) + f(i)= 0.887 + 0.113 = 1.000

Expected gametic frequencies under Hardy-Weinberg equilibrium f(Na-I) = f(Na) x f(I) = 0.887 x 0.241 = 0.2138 f(Na-i) = f(Na) x f(i) = 0.887 x 0.759 = 0.6732 f(na-I) = f(na) x f(I) = 0.113 x 0.241 = 0.0272 f(na-i) = f(na) x f(i) = 0.113 x 0.759 = 0.0858

퐷0 = 푟푢 − 푠푡 = (0.180 푥 0.052) − (0.707 푥 0.061) = −0.0338 Observed frequency = Expected frequency + D0

Observed frequency of Na-I = [f(Na) x f(I)] + D0 = 0.2138 – 0.0338 = 0.1800

24

The decay in LD is shown in Fig 7 under to different recombination. When there is no linkage (c=½), LD be almost zero by generation 7. However, it takes much longer for LD to decay when recombination is closer to 0. Since D̂ is negative, the maximum value of D is the lesser of or pAqA or pBqB. Since pAqA f(Na) x f(na) =

0.877 x 0.113 =0.1002, and pBqB =0.241 x 0.759=0.1829 we chose the former. Therefore, 퐷 −0.03377 퐷′ = = = −0.3369 퐷푚푎푥 0.10020 This tells us that D̂ is about 33.69 % of its maximum value.

The observed frequency at generation t = Expected frequency at t=0 + Dt where 푡 퐷푡 = 퐷0(1 − 푐) where c is the recombination rate. Assuming c=0.1, at generation 2, D2 = -0.0274. The observed frequency of Na-I will be 0.2138-0.0274=0.1864.

Now we can test whether D0 is significantly different from zero or not using Chi- square. Null Hypothesis: The observed gametic frequencies do not deviate from the expected gametic frequencies Since X2 is allergic to frequencies and fraction, we have to use observed and expected numbers. (180 − 213.8)2 (707 − 673.2)2 (61 − 27.2)2 (52 − 85.8)2 푋2 = + + + = 62.3571 213.8 673.2 27.2 85.8 Degrees of freedom = 4-1-1 (for estimating f(Na) from the data) – 1(for estimating f(I) from the data=1. X2table, 1 df at p=0.05=3.84. We can reject the null hypothesis and conclude that the observed gametic frequencies are not in equilibrium or in linkage disequilibrium.

Population genetics of LD Linkage disequilibrium is affected by the following: Selection (both natural and artificial) Genetic drift Population subdivision and bottlenecks Inbreeding, inversion and Applications of LD Mutation, gene mapping, QTL studies, breeding value estimation Detecting natural selection

25

Population structure and Gene flow So far we have assumed that a population is ‘homogeneous’, and the characteristics of the subpopulations sampled from the population would be identical. This assumption may not be true. The distribution of individuals and gene (allele) flow connections between different subpopulations can be important in . By population structure a population mean that, instead of a single, simple population, the population may have substructure, i.e., differences in among the subpopulations due to different evolutionary reasons (genetic drift, nonrandom mating, selection, etc.). The overall population of subpopulations is referred to as the total population (T). Individual component of the total population is referred to as subpopulations (S), local populations or demes. In many real populations, there may not be obvious structure, and the population is continuous. However, even in effectively continuous populations, different areas or regions can have different allele frequency because the mating in the total population is usually nonrandom. In humans within a country with the same language, most often, there are language differences suggesting substructure, but it is always difficult to find the exact boundary where the changeover occurs. Such a population is structured, but continuous in space. Population structure can therefore be defined as when subpopulations deviate from Hardy-Weinberg proportions.

Reduction in Heterozygosity is one of the major consequences of population substructure. The deviation from expected heterozygote frequency in a population is called inbreeding, F. The inbreeding coefficient, F compares the actual heterozygotes from the expected heterozygote frequency under Hardy-Weinberg equilibrium. The heterozygosity (퐻퐸) under equilibrium is the frequency of the heterozygotes (2pq). With inbreeding, 퐻퐸 reduces by a factor 1 − 퐹. Therefore, the observed frequency of heterozygotes (퐻0) becomes 2푝푞(1 − 퐹).

퐻 − 퐻 퐻 퐹 = 퐸 0 = 1 − 0 퐻퐸 퐻퐸 The reduction in heterozygote frequency is implicit with increases in the frequency of homozygotes. The reduction in heterozygote frequency is divided equally among the homozygotes. Change in heterozygote frequency is given as

퐻퐸 − 퐻0 = 2푝푞 − 2푝푞(1 − 퐹) = 2푝푞 − [2푝푞 − 2푝푞퐹] = 2푝푞퐹

26

This implies, the two homozygotes would have their respective frequencies 2푝푞퐹 increase by ( ) = 푝푞퐹. The reason why the reduced heterozygotes are divided 2 equally to the two homozygotes is that each heterozygote genotype has one of the two alleles.

The observed and expected genotypic frequency is therefore given as: Expected genotypic frequency under inbreeding 퐴1퐴1 퐴1퐴2 퐴2퐴2 Expected genotype frequency 푝2 2푝푞 푞2 Observed genotype frequency 푝2 + 푝푞퐹 2푝푞(1 − 퐹) 푞2 + 2푝푞퐹

If a gene has multiple alleles, 퐴1, 퐴2, … 퐴푛 with respective frequencies 푝1, 푝2, … , 푝푛 where 푝1 + 푝2 + ⋯ + 푝푛 = 1, with inbreeding coefficient, F, then

푓푟푒푞푢푒푛푐푦 표푓 퐴 퐴 = 푝2(1 − 퐹) + 푝 퐹 { 푖 푖 푖 푖 푓푟푒푞푢푒푛푐푦 표푓 퐴푖퐴푗 = 2푝푖푝푗(1 − 퐹)

F coefficients If individuals mate within subpopulations, they would likely mate with related individuals than if they mated randomly over the entire population. provided an approach to partitioning the genetic variation in subpopulations that provides an obvious description of differentiation. If 퐻푇 푎푛푑퐻푠 are the measure of heterozygosity in the total and average of the subpopulations, respectively, Wright’s fixation index, 퐹푆푇 which measures the average change in heterozygosity in subpopulations relative to the total heterozygosity as: 퐻푇 − 퐻푆 퐻푆 퐹푆푇 = = 1 − 퐻푇 퐻푇

If individuals are mated at random within the whole population, then 퐻푇 = 2푝푞. On the other hand, if there is spatial structure and individuals mate within subpopulations, then the frequency of heterozygotes will depend on the allele frequency in that subpopulation, 퐻푘 = 2푝푖푘푞푖푘 푓표푟 푠푢푏푝표푝푢푙푎푡𝑖표푛, 푘 If there are a total of k subpopulations, then 푘

퐻푆 = ∑ 2푝푖푞푖 푖=0

27

Within each subpopulation, there can be a deviation from expected heterozygotes within that subpopulation. Using the same logic, 퐻푆 − 퐻퐼 퐻퐼 퐹퐼푆 = = 1 − 퐻푆 퐻푆 where 퐹퐼푆 is a measure of the deviation from Hardy-Weinberg proportions of expected heterozygotes within subpopulations. Similarly, 퐹퐼푇 measures the deviation from Hardy-Weinberg proportions of expected heterozygotes within the whole population.

퐻푇 − 퐻퐼 퐻퐼 퐹퐼푇 = = 1 − 퐻푇 퐻푇

The heterozygosity 퐻퐼within subpopulations is calculated from the observed heterozygote frequency within the subpopulation.

퐻퐼 퐻퐼 퐻푆 Consequently, 1 − 퐹퐼푆 = ; 1 − 퐹푇 = 푎푛푑 1 − 퐹푆푇 = 퐻푆 퐻푇 퐻푇

퐻푆(1−퐹퐼푆) 퐻푆 Since, 퐻퐼 = 퐻푆(1 − 퐹퐼푆), 1 − 퐹푇 = and = 1 − 퐹푆푇 퐻푇 퐻푇

1 − 퐹푇 = (1 − 퐹푆푇)(1 − 퐹퐼푆)

If individuals are mating completely at random over the entire population, then there will be no local variation in allele frequency and each subpopulation will have the same expected heterozygosity as the total population. In that case 퐹푆푇=0 and there will be no differentiation among subpopulations. At the other extreme, if each subpopulation is completely isolated and alleles have become fixed within each subpopulation, then there is no heterozygosity within the subpopulations. In that case 퐹푆푇=1 and there is maximum differentiation among subpopulations

28

Practical example: A population of 1,600 individuals was divided into three subpopulations and genotyped for the gene responsible for juicy meat in a delicacy goat in Yourland.

AA Aa aa Observed numbers Subpopulation 1 125 250 125 500 Subpopulation 2 55 30 15 100 Subpopulation 3 80 440 480 1,000 Total population 260 720 620 1,600

Subpopulation 1

125 250 125 푃 = = 0.25; 퐻 = = 0.50; 푄 = = 0.25; 푝 = 푃 + ½퐻 = 0.5; 푞 = 0.5 1 500 1 500 1 500 1 1 1 1

Subpopulation 2 55 30 15 푃 = = 0.55; 퐻 = = 0.30; 푄 = = 0.15; 푝 = 푃 + ½퐻 = 0.7; 푞 = 0.3 2 100 2 100 2 100 2 2 2 1

Subpopulation 3 80 440 480 푃 = = 0.08; 퐻 = = 0.44; 푄 = = 0.48; 푝 = 푃 + ½퐻 = 0.3; 푞 = 0.7 3 1000 3 1000 3 1000 3 3 3 1

Total population 260 720 620 푃 = = 0.1625; 퐻 = = 0.45; 푄 = = 0.3875; 푇0 1600 푇0 1600 푇0 1600

푝푇0 = 푃푇 + ½퐻푇 = 0.3875; 푞푇0 = 0.6125

AA Aa aa Expected numbers Subpopulation 1 125 250 125 500 Subpopulation 2 49 42 9 100 Subpopulation 3 90 420 490 1,000 Total population 240.2496 759.5008 600.2496 1,600

29

Expected frequency: 2 2 2 2 퐴퐴1 = 푝1 = 0.5 = 0.25; 퐴푎1 = 2푝1푞1 = 2푥0.5푥0.5 = 0.50; 푎푎1 = 푞1 = 0.5 = 0.25 2 2 2 2 퐴퐴2 = 푝2 = 0.7 = 0.49; 퐴푎2 = 2푝2푞2 = 2푥0.7푥0.3 = 0.42; 푎푎2 = 푞2 = 0.3 = 0.09 2 2 2 2 퐴퐴3 = 푝3 = 0.3 = 0.09; 퐴푎3 = 2푝3푞3 = 2푥0.3푥0.7 = 0.42; 푎푎3 = 푞3 = 0.7 = 0.49 2 2 퐴퐴푇0 = 푝푇0 = 0.3875 = 0.150156; 퐴푎푇0 = 2푝푇0푞푇0 = 2푥0.3875푥0.6125 = 0.474688; 2 2 푎푎푇0 = 푞푇0 = 0.6125 = 0.375156

Inbreeding coefficient in subpopulations and total population 퐻1 퐻1 0.50 퐹푠1 = 1 − = 1 − = 1 − = 0.000 퐻퐸1 2푝1푞1 0.50

퐻2 퐻2 0.30 퐹푠2 = 1 − = 1 − = 1 − = 0.2857 퐻퐸2 2푝2푞2 0.42

퐻3 퐻3 0.44 퐹푠3 = 1 − = 1 − = 1 − = −0.0476 퐻퐸3 2푝3푞3 0.42

퐻푇0 퐻푇0 0.450000 퐹푇0 = 1 − = 1 − = 1 − = 0.0520 퐻퐸푇0 2푝푇0푞푇0 0.474688

In subpopulation 1, the observed heterozygotes are the same as expected. In subpopulation 2, there are less heterozygotes observed than expected In subpopulation 3, there are more heterozygotes than expected

The observed and expected genotypic frequency in subpopulation 2: 퐹푠2 = 0.2857 푎푛푑 푝푞퐹 = 0.059997 퐴1퐴1 퐴1퐴2 퐴2퐴2 Expected genotype frequency 푝2 = 0.49 2푝푞 = 0.42 푞2 = 0.09 Observed genotype frequency 푝2 + 푝푞퐹 2푝푞(1 − 퐹) 푞2 + 2푝푞퐹 = 0.55 = 푃2 = 0.30 = 퐻2 = 0.15 = 푄2

퐻 푁 + 퐻 푁 + 퐻 푁 0.5푥500 + 0.30푥100 + 0.44푥1000 퐻 = 1 1 2 2 3 3 = = 0.4500 퐼 푁 1600

퐻 푁 + 퐻 푁 + 퐻 푁 0.5푥500 + 0.42푥100 + 0.42푥1000 퐻 = 퐸1 1 퐸2 2 퐸3 3 = = 0.445 푆 푁 1600

퐻푇 = 2푝푇0푞푇0 = 2푥0.3875푥0.6125 = 0.474688

30

퐻퐼 0.450 퐹퐼푆 = 1 − = 1 − = −0.0112 퐻푆 0.445

퐻푆 0.445 퐹푆푇 = 1 − = 1 − = 0.0632 퐻푇 0.475

퐻퐼 0.450 퐹퐼푇 = 1 − = 1 − = 0.0526 퐻푇 0.475

Verification 1 − 퐹푇 = (1 − 퐹푆푇)(1 − 퐹퐼푆)

(1 − 0.0526) = (1 − 0.0632)(1 − (−0.0112)) 0.94734 = 1.0112푥0.9368

Some general conclusions Subpopulation 1 is consistent with Hardy-Weinberg proportions Subpopulation 2 has experiences some inbreeding Subpopulation 3 may have experienced heterozygous advantage through disassortative mating since it has more heterozygotes than expected.

Conclusion concerning the overall degree of genetic differentiation (푭푺푻) Subdivision of population, possibly due to genetic drift accounts for 6.32% of the total genetic variation. The differentiation led to deficiency of heterozygotes over the total population.

31

QUANTITATIVE GENETICS

Genetic decomposition of a locus on the phenotype

The nature of quantitative traits: A quintessential question all quantitative geneticists ask is: How much of the variation in a population with respect to a particular trait is due to genetic causes and how much is due to environmental factors? The phenotype (P) can be partitioned into a genotypic value (G) and an environmental deviation (E). 푃 = 퐺 + 퐸 We will focus our attention on the genetic component, G. Let’s consider a single gene A with two alleles A1 and A2 combining into A1A1, A1A2 and A2A2

Let 푎, −푎 푎푛푑 푑 be the arbitrary genotypic values for A1A1, A1A2 and A2A2, respectively. The difference between the two homozygous is 2a. The value of a is a deviation from 0 (mid-point), which is the average of the two homozygotes. The heterozygote, A1A2 has a value of d = ak, where k is the degree of dominance. The alleles A1 and A2 behave in a completely additive manner when k=0. When k=+1, means the A1 allele is completely dominant over A2 allele; and when k=-1, means the A2 allele is completely dominant over the A1 allele. If k>+1 means over dominance, and if k<-1 mean under dominance.

Let’s look at some data set. The genotypic values of an AluI polymorphic site at the 5’-region of the bovine growth hormone gene for milk fat are as follows: AluI (-/-): -25 designated (A2A2) AluI(+/-): -23 designated (A1A2) AluI(+/+): -10 designated (A1A1)

The midpoint of the two homozygotes = [-25 + (-10)]/2 =-17.5. The value of a=-10-(-17.5) = 7.5 and d = -23-(-17.5)= -5.5; k=d/a = -5.5/7.5=-0.73.

32

Population mean

Let’s estimate the population mean (μ) of N individuals assuming a single locus with two alleles.

Expression of Population Mean Genotype Frequency Genotypic value Frequency x value A1A1 푝2 +a 푝2푎 A1A2 2푝푞 d 2푝푞푑 A2A2 푞2 -a −푞2푎

∑ 푓푟푒푞푢푒푛푐푦 푥 푣푎푙푢푒 휇 = ∑ 푓푟푒푞푢푒푛푐푦

푝2푎 − 푞2푎 + 2푝푞푑 휇 = 퐺 푝2 + 2푝푞 + 푞2 The denominator is equal to 1. The numerator can be rewritten as: 푎(푝2 − 푞2) + 2푝푞푑 푝2 − 푞2 = (푝 + 푞)(푝 − 푞) Therefore, the population mean can be written as: 휇퐺 = 푎(푝 − 푞) + 2푝푞푑 The homozygotes contribute a(p - q) and the heterozygote contributes 2pqd to the population mean.

From Fig 9, the population mean depends on allele frequency. The population mean decreases with increasing frequency of the unfavorable allele (Fig 9a). The population mean increases with increasing frequency of the favorable allele (Fig 9b).

33

Population mean under additivity (k=0): We have already established that d=ka, therefore, when k=0, d=0. 휇퐺 = 푎(푝 − 푞) Since p = 1 – q, 휇퐺 = 푎(1 − 푞 − 푞) = 푎(1 − 2푞)

Population mean under complete dominance (k=1): Under complete dominance, k=1, which means d=a 휇퐺 = 푎(푝 − 푞) + 2푝푞푎 휇퐺 = 푎(1 − 푞 − 푞) + 2푎푞(1 − 푞) 2 휇퐺 = 푎 − 2푞 + 2푎푞 − 2푎푞 ) 2 휇퐺 = 푎(1 − 2푞 )

Genetic Model The genotypic value of an individual can be written in term of the genetic decomposition of the genotype.

퐺 = 퐴 + 퐷 + 퐼

The genotypic value equals the breeding value A, dominance deviation, D and deviation. For simplicity, we will ignore the epistatic deviation and concentrate on breeding value or additive value and dominance deviation.

퐺 = 퐴 + 퐷

Genotypic value, G The genotypic value can be written as a deviation from the population mean. 퐺퐴1퐴1 = 푎 − 휇퐺 퐺퐴1퐴2 = 푑 − 휇퐺 퐺퐴2퐴2 = −푎 − 휇퐺

퐺퐴1퐴1 = 푎 − [푎(푝 − 푞) + 2푝푞푑 = 푎 − 푝푎 + 푞푎 − 2푝푞푑 = 푎(1 − 푝 + 푞) − 2푝푞푑 = 푎(1 − 1 + 푞 + 푞) − 2푝푞푑 퐺퐴1퐴1 = 2푞(푎 − 푑푝) Subsequently, 퐺퐴1퐴2 = 푎(푞 − 푝) + 푑(1 − 2푝푞) and 퐺퐴2퐴2 = −2푝(푎 + 푞푑)

34

BREEDING (Additive) VALUES (A)

An individual’s breeding value can be said to be the sum of the additive effects of the individual’s alleles. The concept of additive effects arises from the fact that parents pass on their alleles to their progeny and not their genotype. Therefore, the value of an individual judged by the mean value of its progeny is called the individual’s breeding value. The breeding value for an individual at a locus is defined as the sum of the additive effects of the alleles at the locus.

Allelic value of A1 (α1) An A1 gametes can combine at random with either A1 or A2 to produce A1A1 with genotypic value +a or A1A2 with genotypic value d. Taking into account the proportions in which they occur, the allelic value of A1 = pa + qd The mean deviation of the progeny from the population mean is: 푝푎 + 푞푑 − 휇퐺 = 푝푎 + 푞푑 − [푎(푝 − 푞) + 2푑푝푞 = 푞[푎 + 푑(푞 − 푝)] [Note: p+1=1; and 1-2p=p+q-2p=q-p]

Allelic value of A2 (α2) An A2 gametes which can combine at random with either A2 or A1 to produce A2A2 with genotypic value -a or A1A2 with genotypic value d. Taking into account the proportions in which they occur, the allelic value of A2 = -qa + pd The mean deviation of the progeny from the population mean is: −푞푎 + 푝푑 − 휇퐺 = −푞푎 + 푝푑 − [푎(푝 − 푞) + 2푑푝푞 = −푝[푎 + 푑(푞 − 푝)]

When there are only two alleles at a locus, it is more convenient to express their additive effects in terms of the additive or average effect of allele substitution. 훼1 = 푞[푎 + 푑(푞 − 푝)] 훼2 = −푝[푎 + 푑(푞 − 푝)] The effect of substituting one allele with the other is 훼 = 훼1 − 훼2 this is, the average change in the genotypic value when the A1 allele is completely substituted with the A2 allele. 2 2 2 2 훼 = 훼1 − 훼2 = 푞푎 + 푑푞 − 푑푝푞 + 푝푎 + 푑푝푞 + 푑푝 = 푞푎 + 푝푎 + 푑푞 − 푑푝 훼 = 푎(푝 + 푞) + 푑(푞2 − 푝2) Note that 푝 + 푞 = 1, 푎푛푑 (푞2 − 푝2) = (푞 + 푝)(푞 − 푝)

훼 = 푎 + 푑(푞 − 푝)

35

An individual’s breeding value A is the sum of all additive effects of its alleles. When mating is random, the breeding value of a genotype for an individual is twice the expected mean deviation of its progeny from the population mean. The deviation is multiplied by two since only one half of the parental alleles are transmitted to each progeny. Therefore, we can estimate the breeding value of an individual by mating it to random individuals from the population and taking the twice the deviation of its offspring mean from the population mean. Breeding values can be estimated under several scenarios.

The breeding values are: 2훼1 𝑖푓 푔푒푛표푡푦푝푒 𝑖푠 퐴1퐴1 퐴푖푗 = {훼1 + 훼2 𝑖푓 푔푒푛표푡푦푝푒 𝑖푠 퐴1퐴2 2훼2 𝑖푓 푔푒푛표푡푦푝푒 𝑖푠 퐴2퐴2

Mean breeding value: The summation of the breeding value multiplied by the frequency for each genotype will provide the mean breeding value.

퐴1퐴1 퐴1퐴2 퐴2퐴2 Frequency 푝2 2푝푞 푞2 Breeding value 2푞훼 (푞 − 푝)훼 −2푝훼 Mean breeding value ퟐ풑ퟐ풒휶 + ퟐ풑풒(풒 − 풑)휶 − ퟐ풑풒ퟐ휶

퐴̅ = 2푝푞훼 (푝 + 푞 − 푝 − 푞) = 0

Dominance deviation (D)

From the genetic model, we can calculate the dominance deviation as: 퐷 = 퐺 − 퐴 Since we have already derived both G and A, we can deduce D. Dominance deviation arise from interaction between alleles at a locus. In the absence of dominance, G=A. Let’s write G in terms of α 퐺퐴1퐴1 = 2푞(푎 − 푝푑), 푎푛푑 훼 = 푎 + 푑(푞 − 푝) 푎 = 훼 − 푑푞 + 푑푝

퐺퐴1퐴1 = 2푞푎 − 2푝푞푑 2 퐺퐴1퐴1 = 2푞(훼 − 푑푞 + 푑푝) − 2푝푞푑 = 2푞훼 − 2푑푞 + 2푝푞푑 − 2푝푞푑

36

Therefore, 퐺퐴1퐴1 = 2푞(훼 − 푞푑) Subsequently, 퐺퐴1퐴2 = (푞 − 푝)훼 + 2푝푞푑 and 퐺퐴2퐴2 = −2푝(훼 + 푝푑)

퐴1퐴1 퐴1퐴2 퐴2퐴2 Frequency 푝2 2푝푞 푞2 Genotypic value, G 2푞(훼 − 푞푑) (푞 − 푝)훼 + 2푝푞푑 = −2푝(훼 + 푝푑) Breeding value, A 2푞훼 (푞 − 푝)훼 −2푝훼 Dominance, D=G-A −2푞2푑 2푝푞푑 −2푝2푑 Mean Dominance −ퟐ풑ퟐ풒ퟐ풅 +ퟒ풑ퟐ풒ퟐ풅 −ퟐ풑ퟐ풒ퟐ풅 = 0

COMPONENTS OF GENETIC VARIATION

Genetics as a subject focuses on variability on several levels. Without variability, there is nothing to study. It is therefore important to quantify variability and partition the variability into its components. A single locus with two alleles provides us with three genotypes. We can therefore compute the genotypic variation.

Estimation of variation: In general we study variation by estimating the variance. Variance can be estimated as:

2 2 ∑(푓푖푋푖 2 2 2 휎 = ∑푓푖푋푖 − ( ) = ∑푓푖푋푖 − 휇 ∑푓푖 or 2 휎2 = ∑푋2 − (∑(푋푖) 푖 푁 or 2 2 휎 = ∑(푋푖 − 휇)

∗ 2 ∗2 However, if 푋푖 = 푋푖 − 휇 then 휎푋∗ = ∑푓푖푋푖

37

GENOTYPIC VARIATION

2 The genotypic variance, 휎퐺 can be estimated as: 2 2 2 휎퐺 = ∑(푓푖푗퐺푖푗) − 휇퐺 Since we have already calculated 퐺푖푗 as a deviation from the population mean 휇, then, 2 2 휎퐺 = ∑(푓푖푗퐺푖푗)

2 2 2 2 2 2 휎퐺 = 푝 퐺퐴1퐴1 + 2푝푞퐺퐴1퐴2 + 푞 퐺퐴2퐴2

2푞(훼 − 푞푑) 𝑖푓 푔푒푛표푡푦푝푒 𝑖푠 퐴1퐴1 퐺푖푗 = { (푞 − 푝)훼 + 2푝푞푑 𝑖푓 푔푒푛표푡푦푝푒 𝑖푠 퐴1퐴2 −2푝(훼 + 푝푑) 𝑖푓 푔푒푛표푡푦푝푒 𝑖푠 퐴2퐴2

2 2 2 2 2 2 Thus, 휎퐺 = 푝 [2푞(훼 − 푞푑] + 2푝푞[(푞 − 푝)훼 + 2푝푞푑] + 푞 [−2푝(훼 + 푝푑)]

2 2 2 휎퐺 = 2푝푞훼 + (2푝푞푑)

Partitioning of the Genetic Variance Earlier on we defined 퐺 = 퐴 + 퐷 The genetic model contains both the additive and dominance values. The variance of G is: 2 2 2 휎퐺 = 휎퐴 + 휎퐷 + 2퐶표푣퐴퐷 In a population under Hardy-Weinberg equilibrium (without inbreeding), the between the breeding value and dominance deviation is zero. 퐶표푣퐴퐷 = ∑(푓푖푗퐴푖푗퐷푖푗) = [(푝2)(2푞훼)(−2푞2푑)] + [(2푝푞)((푞 − 푝)훼)(2푝푞푑)] + [(푞2)(−2푝훼)(−2푝2푑)] = −4푝2푞3훼푑 + 4푝2푞2(푞 − 푝)훼푑 + 4푝3푞2훼푑 2 2 퐶표푣퐴퐷 = 4푝 푞 훼푑(−푞 + 푞 − 푝 + 푞) = 0

Therefore, we can drop the covariance from the above model. Therefore, 2 2 2 휎퐺 = 휎퐴 + 휎퐷

38

ퟐ Additive genetic variance, 흈푨

We can use the same logic used in calculating the genetic variance to calculate the additive genetic variance. Since we have already calculated 퐴푖푗 as a deviation from the population mean 휇, then, 2 2 휎퐴 = ∑(푓푖푗퐴푖푗)

2훼1 = 2푞훼 𝑖푓 푔푒푛표푡푦푝푒 𝑖푠 퐴1퐴1 퐴푖푗 = {훼1 + 훼2 = (푞 − 푝)훼 𝑖푓 푔푒푛표푡푦푝푒 𝑖푠 퐴1퐴2 2훼2 = −2푝훼 𝑖푓 푔푒푛표푡푦푝푒 𝑖푠 퐴2퐴2

2 2 2 2 2 2 휎퐴 = 푝 (2푞훼) + 2푝푞[(푞 − 푝)훼] + 푞 (−2푝훼) 2 2 2 2 2 2 2 2 2 휎퐴 = 4푝 푞 훼 + 2푝푞(푞 − 푝) 훼 + 4푝 푞 훼 = 2푝푞훼2(2푝푞 + 푞2 − 2푝푞 + 푝2 + 2푝푞) 2푝푞훼2(푝2 + 2푝푞 + 푞2) 2 2 휎퐴 = 2푝푞훼

ퟐ Dominance variance, 흈푫 We have already calculated 퐷푖푗 as a deviation from the population mean 휇, therefore, 2 2 휎퐷 = ∑(푓푖푗퐷푖푗)

2 −2푞 푑 𝑖푓 푔푒푛표푡푦푝푒 𝑖푠 퐴1퐴1 퐷푖푗 = {2푝푞푑 𝑖푓 푔푒푛표푡푦푝푒 𝑖푠 퐴1퐴2 2 −2푝 푑 𝑖푓 푔푒푛표푡푦푝푒 𝑖푠 퐴2퐴2

2 2 2 2 2 2 2 2 휎퐷 = 푝 (−2푞 푑) + 2푝푞(2푝푞푑) + 푞 (−2푝 푑) = 4푝2푞4푑2 + 8푝3푞3푑2 + 4푝4푞2푑2 = 4푝2푞2푑2(푞2 + 2푝푞 + 푝2)

2 2 휎퐷 = (2푝푞푑)

2 2 2 휎퐺 = 2푝푞훼 + (2푝푞푑)

39

Fig 10 The genotypic (VG), additive (VA) and dominance (VD) variances at different allele frequency

2 If there is no dominance (d=0), the dominance variance, 휎퐷 = 0, resulting in 2 2 휎퐺 = 휎퐴 .If there is complete dominance (d=a) the additive variance becomes, 2 3 2 휎퐴 = 8푝푞 푎 2 2 휎퐴 = ½푎 푤ℎ푒푛 푝 = 푞 = 0.5 { 2 2 휎퐷 = ¼푑

40

Genetic parameter estimations under different allele frequency 푞 = 0.1 푞 = 0.5 푞 = 0.8 퐴1퐴1 퐴1퐴2 퐴2퐴2 퐴1퐴1 퐴1퐴2 퐴2퐴2 퐴1퐴1 퐴1퐴2 퐴2퐴2 Egg weight 50 45 30 50 45 30 50 45 30 Genotypic value, G 10 5 -10 10 5 -10 10 5 -10 Genotypic frequency, f 0.81 0.18 0.01 0.25 0.50 0.25 0.04 0.32 0.65

Population mean=푎(푝 − 푞) + 2푝푞푑 8.9 2.5 -4.4 훼 = 푎 + 푑(푞 − 푝) 6 10 13

Additive effect 퐴1 = 푞훼 0.6 5 11.7 퐴2 = −푝훼 -5.4 -5 -2.6

Breeding value, A 1.2 -4.8 -10.8 10 0 -10 20.8 7.8 -5.2 Mean breeding value 0.972 -0.864 -0.108 2.5 0 -2.5 0.832 2.496 -3.328

Dominance Deviation, D -0.1 0.9 -8.1 -2.5 2.5 -2.5 -6.4 1.6 -0.4 Mean dominance deviation -0.081 0.162 -0.081 -0.625 1.25 -0.625 -0.256 0.512 -0.256

Additive variance 6.48 50 54.08 Dominance variance 0.81 6.25 2.56 Genetic variance 7.29 56.25 56.64

41

MOLECULAR GENETICS APPLIED TO

GENOME ORGANIZATION What is a genome? A genome is an organism’s complete set of DNA, including all of its genes. Each genome contains all of the information needed to build and maintain that organism. The genome is made up of the DNA in chromosomes as well as the DNA in mitochondria.

The genome contains instructions or blue print for all activity in an organism. The instructions are written in a four-letter-language of DNA, i.e. Adenine, Cytosine, Thymine and Guanine, shorten to A, C, T, and G). Almost every cell in an eukaryotic organism contains a complete copy of these instructions. The genetic instructions are stored in pairs of chromosomes. Each chromosome contains genes which contains the direct instructions for a cell to make a protein. The genome contains coding sequences (genes) and non-coding sequences of DNA.

42

The genome contains: 1. STRUCTURAL GENES: DNA segments that codes for some specific or proteins. Encodes for mRNAs, tRNA, snRNAs, scRNAs, etc 2. FUNCTIONAL SEQUENCES: Regulatory sequences-occur as regulatory elements (initiation sites, promotor regions, terminator regions, etc) 3. NON-FUNCTIONAL SEQUENCES: Introns, repetitive sequences, and all the unknowns

DNA: Double stranded helical structure NUCLEOSOME: DNA is complexed with histones. Each nucleosome consist of eight histones proteins around which the DNA wraps 1.65 times. CHROMATOSOME: A nucleosome plus H1 histone. Nucleosomes fold up to produce a 30 nm fiber that forms loops averaging 300 nm in height, which are compressed and folded to produce a 250-nm wide fiber. The tight coiling of the 250 nm fiber produces the chromatid of a chromosome

We can all agree with these noble hard working scientists that the genome is very complex and may never grasp all the complexity. Our knowledge about the genome keeps improving. There are so many unanswered questions.

43

We know about 5-10% of the genome encodes for genes. What is the function of the other 90%? So far there are no good answers. In the 1990’s, the non-coding regions were referred to as junk DNA, but nobody uses the term junk DNA anymore our knowledge of the genome keeps improving, and some of the so called junk DNA have elements that the controls gene transcription. Non-coding RNA, e.g. microRNA depending on the location can affect gene transcription. A fairly balanced article on junk DNA post ENCODE era and the controversy that ensued can be found in PLoS Genetics http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1004351

THE DOUBLE HELIX Deoxyribonucleic Acid (DNA) has double stranded helix structure and it encodes the genetic instructions used in the development and function of all known living organisms and many viruses. The two strands of DNA run in opposite direction to each other. Attached to each sugar is one of four nucleobases. It is the sequence of these four nucleobases along the backbone that encodes or biological information. The four nucleobases are two purines (Adenine and Guanine) and two pyrimidines (Cytosine and Thymine). In the double helix structure, adenine bonds with thymine (A-T) and guanine bonds with cytosine (C-G). Under the genetic code, RNA strands are translated to specify the sequence of amino acids within proteins. The RNA strands are initially created using DNA strands as a template in a process called transcription.

44

Ribonucleic acid (RNA), unlike DNA is single stranded the folds onto itself rather than a paired double strand. In RNA, the pyrimidine, thymine is replaced by uracil. One of the universal functions of RNA is protein synthesis where messenger RNA (mRNA) molecules direct the assembly of proteins on ribosomes. This process uses transfer RNA (tRNA) molecules to deliver amino acids to the ribosome, where ribosomal RNA (tRNA) links amino acids together to form proteins.

GENE A gene was defined at least four decades before the DNA structure was discovered. To a population geneticist, a gene is the basic unit of which comes in pairs, and one pair is transmitted from parent to progeny. A more refined definition of a gene will be a sequence (instruction manual) on a chromosome that encodes a protein or a polypeptide.

A gene consist of a 5' untranslated region (5' UTR) or leader sequence that ends to the position of the first codon used in translation. The 3' UTR is the portion of an mRNA from the 3' end of the mRNA (trailer sequence) to the position of the last codon used in translation. The frame of a gene consists of exons and introns. An exon is any sequence encoded by a gene that remains within the final mature RNA product of that gene. An intron is a noncoding part of a gene that is spliced out before the RNA is translated into a protein.

45

46

47

MOLECULAR MARKERS

What is the composition of the intergenic noncoding part of the genome?

Genome Studies 1. Improve annotation of the genome 2. Function and regulation of coding genes 3. Posttranslational regulation of genes 4. Extract potential functions from non-coding and intergenic DNA

For Animal and Poultry Breeding 1. Map quantitative trait loci 2. Identify genes associated with traits of economic importance 3. Estimation of genome breeding values 4. 5. Gene flow 6. Population studies 7. Epidemiological studies 8. Domestication 9. Toxicity and many others

48

To date, a large proportion of genome studies have been possible because of genetic markers. GENETIC MARKER: DNA sequence that can be detected and whose inheritance can be monitored. The three properties that define a genetic marker are: locus specificity, polymorphic and ease of genotyping. A marker is said to be polymorphic when it exits in more than one form

Types of genetic markers 1. Restricted fragment length (RFLP) 2. Variable number of tandem repeats (VNTR) a. Minisatellites b. DNA fingerprinting c. microsatellites 3. Sequenced tagged sites (STS) and expressed sequence tags (EST) 4. Random amplified polymorphic DNA (RAPD) 5. Amplified fragment length polymorphism (AFLP) 6. Single stranded conformation polymorphism (SSCP) 7. PCR amplification of specific alleles (PASA) 8. Copy number variation (CNV) 9. Single nucleotide polymorphism (SNP) a. Anonymous SNP (No known effect on gene function-have been used extensively in gene mapping, linkage disequilibrium and diversity studies) b. cSNP (located within protein coding sequence (May interfere with gene function by altering the amino acid sequence c. Candidate SNP- SNP thought to have putative functional effect d. rSNP (SNP in the regulatory region of a gene; the regulatory region effect gene expression, e.g. A mutation in the 5' UTR of the endoglin gene affects the translational initiation and alter the reading frame in hereditary hemorrhagic telangiectasia (vascular disorder) e. pSNP (When a phenotype is changed as a result of altered protein function, cSNP or rSNP may be labelled a pSNP. f. Synonymous SNP (When a change occurs in a cSNP, but the cSNP still codes for the same amino acid.

There are several laboratory methods used to detect the aforementioned genetic markers. Those methods would not be the subject of this course. The most commonly used markers in farm animal studies are microsatellites and SNPs.

49

50

51

SELECTON THEORY

Selection response (R) is how much gain you make when mating of selected parents. Response to selection can be evaluated in the short- or long-term.

Success of the selection decisions depend on a number of factors:

1. How heritable is the trait under selection (i.e. the trait in the breeding goal)? 2. How much genetic variation for that trait is there in the population? 3. What is the average accuracy of the EBV, and thus the accuracy of selection? 4. What proportion of the animals will be selected for breeding? 5. In case genetic gain is to be expressed per year, rather than per generation: how long is a generation?

To optimize the success of a it is important to balance the relatively short-term decisions: acquire high genetic gain, and the long term maintenance of the population: controlling rate of inbreeding. SHORT-TERM RESPONSE: Predict a few generations of selection response when the base population (generation 0) additive genetic variance () is sufficient to make satisfactory prediction using the ’ equation (Lush, 1937)

푅 = ℎ2푆

LONG-TERM RESPONSE: As selection proceeds, allele frequency changes and the base population genetic parameters fails to predict long term response.

CHANGES IN THE MEAN:

The within-generation mean: This reflects the changes in the entire population and that of the selected population. Selection can cause changes in the distribution of . The within- generational change is what is referred to as the Selection Differential, (S).

The within-generational change is the means due to selection is:

푆 = 푋푠 − 푋0

Where 푋0 is the population mean (Generation 0) before selection and 푋푠 is the mean of the selected parents that produces the progeny population (Generation 1).

52

The between-generation mean: This is the response to selection, R which measures the changes in mean between the population before and after selection.

푅 = 푋1 − 푋0

Where 푋1 is the population mean (Generation 1) before selection.

Weighted selection differential: The joint effects of natural and artificial selection affect selection response. Natural selection is always on the side of fitness and can be in the same direction or oppose artificial selection.

Important assumption in evaluating predictions of genetic gain: environmental influences remain constant across generations

Let’s examine the unweighted and weighted selection differential and ascertain how they are influenced by natural selection.

Data from a long term selection program:

1. Calculate the Unweighted selection differential 2. Calculate the Weighted selection differential 3. Where the direction of national selection

53

Male (ram) Female (ewe) Population mean 24 kg 22 kg Mating # of offspring measured 1 22 20 2 2 35 29 1 3 23 22 1 4 20 24 2 5 24 20 2 6 30 27 2 7 30 30 0 8 37 22 0 9 22 20 6 10 19 20 10 N=26

Prediction of response to selection from the proportion selected: Selection intensity (i)

The selection differential is limiting when comparing the strength of selection on different traits or in different populations. When planning a selection program, it would be rather useful to predict genetic change from certain selection strategy prior to even selecting the parental population to breed. This is possible when truncation selection (selection of individuals above or below a certain truncation point or threshold) is practiced. The selection differential can be derived from the distribution of predicted breeding values or phenotypic values and knowledge of the proportion of selected individuals. The standardized selection differential, usually called the selection intensity (i) is the selection differential expressed as a fraction of the phenotypic standard deviation. The selection intensity is a more useful measure for predicting selection response or comparing different selection strategies or response in different populations.

푆 𝑖 = 휎푝

Where 휎푝 is the phenotypic standard deviation of the trait: This implies, 푆 = 𝑖휎푝

The breeders’ equation can therefore be written as:

2 푅 = 𝑖ℎ 휎푝

The ’s equation theoretically holds for a single generation of selection from an unselected bas population. The reliability of using the breeder’s equation to predict response to selection beyond one generation depends on:

54

1. The accuracy of the heritability estimate 2. Absence of environmental changes between generations 3. Insignificant change in the heritability estimate from that of the base population From population genetics, we learned that heritability depends on allele frequency. Selection changes allele frequency. Therefore, it should be expected that, heritability will change with selection. Thus, in the strictest sense, the breeder’s equation is valid only for one generation. However, heritability is not expected to change significantly in the first few generations of selection and in practice, the breeders’ equation has been used to predict short term response (up to 3-5 generations of selection.

Accuracy

The breeders’ Equation can be extended beyond choosing an individual solely on the basis of its phenotype.

2 2 휎퐴 휎퐴 ℎ 휎푝 = 2 휎푝 = ( ) 휎퐴 = ℎ휎퐴 휎푝 휎푝

We can rewrite the response to selection equation as:

푅 = 𝑖 ℎ휎퐴

Where h is the correlation between the phenotypic and breeding values; ℎ = 푟퐴푃 which quantifies the ability to predict the breeding value of an individual from the individual’s phenotype. This is in essence the accuracy of the selection scheme used to select parents. We can therefore express the breeders’ equation in terms of accuracy of selection as:

푅 = 𝑖 푟퐴푃휎퐴

푅푒푠푝표푛푠푒 = 𝑖푛푡푒푛푠𝑖푡푦 ∗ 퐴푐푐푢푟푎푐푦 표푓 푝푟푒푑𝑖푐푡𝑖푛푔 퐵푉 ∗ 푠푡푎푛푑푎푟푑 푑푒푣𝑖푎푡𝑖표푛 표푓 퐵푉

1. Single measurement on an animal

The EBV of an animal can be estimated by regressing the animal’s BV on its phenotype. With a 2 single measurement on an animal, the regression coefficient, 푏퐴푃 equals the heritability ℎ :

2 휎퐴푃 휎퐴 2 푏퐴푃 = 2 = 2 = ℎ 휎푃 휎푝

55

2 2 The EBV, 퐴̂ of an animal is 퐴̂ = ℎ (푃 − 푃̅) and 퐴푐푐 = √푏퐴푃 푥 푔 = √ℎ 푥 푔

Where P is the phenotypic value of the trait, 푃̅ is the population mean, and g the relationship between the individual(s) being measured and the individual for which we are estimating BV. The value of g is 1.0 for an individual's own performance. It is 0.5 for full sibs, progeny or parents and 0.25 for half sibs or grandparents.

Example 1:

Daily feed consumption (FC) of two individuals A and B are 125g and 135g respectively. The mean FC is 120g, with heritability of 0.20. Predict the EBV and accuracy of A and B for FC.

A:

EBV=ℎ2(푃 − 푃̅) = 0.20 x (128-120) = 1.6 g

Acc=√ℎ2 푥 푔 = = √0.20 푥 1 = 0.45

B:

EBV=ℎ2(푃 − 푃̅) = 0.20 x (135-120) = 3.0 g

Acc=√ℎ2 푥 푔 = = √0.20 푥 1 = 0.45

Individual B has a higher EBV for FC than A, but both estimates have the same accuracy.

2. Repeated measurement on an animal

Some traits can be measured several times during an animal's lifetime. For example feed consumption, body weight, egg production. If a trait is measured several times during an animal's life, each value should be used in an estimate of breeding value. The relationship between repeated records, termed “repeatability” becomes important. Repeatability (re) is a measure of the reliability or strength of the relationship between repeated measurements on an individual. When using repeated measurements on an individual g is still 1.0 since the animal being measured and the animal the BV is obtained for are still the same. The value of 푏퐴푃 is now a 2 function of the number of records (n), heritability (h ) and repeatability (re).

With repeated measurements on an animal:

푛ℎ2 푛ℎ2 푏퐴푃 = and 퐴푐푐 = √ 푥 푔 1+(푛−1)푟푒 1+(푛−1)푟푒

56

Example 2:

Assume that the daily feed intake of individual A (128 g) is an average of 5 measurements, with a repeatability of 0.40. Predict the EBV and accuracy of A.

푛ℎ2 5 푥 0.20 퐸퐵푉 = (푃 − 푃̅) = 푥 (128 − 120) = 3.08 1 + (푛 − 1)푟푒 1 + (5 − 1)푥0.40

푛ℎ2 5 푥 0.20 퐴푐푐 = √ 푥 푔 = √ 푥 1.0 = 0.62 1+(푛−1)푟푒 1+(5−1)푥0.40

Repeated measurements on A improve its EBV and accuracy for feed intake.

Accuracy of Estimated Breeding Values for different heritability,

Repeatability and number of measurements on an animal.

Number of measurements

Heritability Repeatability 1 5 10

0.10 0.25 0.32 0.50 0.55

0.50 0.32 0.41 0.43

0.75 0.32 0.35 0.36

0.25 0.25 0.50 0.79 0.88

0.50 0.50 0.65 0.67

0.75 0.50 0.56 0.57

0.50 0.50 0.71 0.91 0.95

0.75 0.71 0.79 0.80

Traits with low heritability benefit from multiple measurements since each additional record contributes toward to total information available, especially when the repeatability is low. If the repeatability is high, multiple measurements do not add much to the accuracy of EBV.

57

3. Information from Relatives

In a closed population, there is bound to be full sibs (FS) (have both parents in common) and half sibs (HS) (have one parent in common) that provide additional information in estimating BV. Siblings have a proportion of their alleles (genes) in common. Full sibs have half of their alleles in common, and half sibs have a quarter of their alleles in common. In pig, cattle, sheep and goat, siblings are initially reared together, and the common environment among siblings also creates additional similarity (maternal environment, temperature, food supply), however, in commercial poultry similarity due to common environment is non-existent. In non-commercial poultry where the hen incubates her own eggs and brood her chicks, similarity of siblings due common environment is in play when estimating BV. The similarity among siblings, t, depends on the siblings involved.

2 2 2 2 푡퐻푆 = ¼ℎ + 푐퐻푆 푡퐹푆 = ½ℎ + 푐퐹푆 where, c2 is the environmental correlation among sibs. The regression coefficient is given as:

푛푔ℎ2 푛푔ℎ2 퐸퐵푉 = (푃 − 푃̅) and 퐴푐푐 = √ 푥 푔 1+(푛−1)푡 1+(푛−1)푡

where n is the number of siblings, t is the correlation among sibs, g is the genetic relationship among sibs. For full sibs, g=½, and for half sibs, g=¼.

Example 3:

Individual A has 5 half sibs with and FC of 128 g. Predict the EBV and accuracy of A when environmental correlation c2 is (a) 0, and (b) 0.125. The population mean for FC is 120g and h2 is 0.20. Assume (c) that the 5 records were obtained from full sibs, and c2 is 0.125.

(a) tHS = ¼ x 0.20 + 0 = 0.05, and g=0.25

푛푔ℎ2 5 푥 0.25 푥 0.20 퐸퐵푉 = (푃 − 푃̅) = 푥 (128 − 120) = 1.67 1 + (푛 − 1)푡 1 + (5 − 1)푥0.05

58

푛푔ℎ2 5 푥 0.25 푥 0.20 퐴푐푐 = √ 푥 푔 = √ 푥 0.25 = 0.23 1 + (푛 − 1)푡 1 + (5 − 1)푥0.05

(b) tHS = ¼ x 0.20 + 0.125 = 0.175, and g=0.25

푛푔ℎ2 5 푥 0.25 푥 0.20 퐸퐵푉 = (푃 − 푃̅) = 푥 (128 − 120) = 1.18 1 + (푛 − 1)푡 1 + (5 − 1)푥0.175

푛푔ℎ2 5 푥 0.25 푥 0.20 퐴푐푐 = √ 푥 푔 = √ 푥 0.25 = 0.06 1 + (푛 − 1)푡 1 + (5 − 1)푥0.175

When there is no measurement on the animal, EBV predicted from relatives is low. The higher the value of t the lower the EBV.

(c) tFS = ½ x 0.20 + 0.125 = 0.225, and g=0.50

푛푔ℎ2 5 푥 0.50 푥 0.20 퐸퐵푉 = (푃 − 푃̅) = 푥 (128 − 120) = 2.11 1 + (푛 − 1)푡 1 + (5 − 1)푥0.225

푛푔ℎ2 5 푥 0.50 푥 0.20 퐴푐푐 = √ 푥 푔 = √ 푥 0.50 = 0.11 1 + (푛 − 1)푡 1 + (5 − 1)푥0.225

Sib information never results in really high accuracy. Full sib information is limited by environmental correlations among the sibs. It should not replace individual’s own record if it can be obtained. Rather, it should be used to supplement the information on the individual if sib information happens to be available.

59

Progeny testing Using the mean of a parent’s progeny to predict the parent’s breeding value, is an alternative predictor of an individual’s breeding value. The correlation between the mean of n progeny, and the breeding value of the parent is

푛 4 − ℎ2 푟 = √ , 푤ℎ푒푟푒 푎 = 퐴푃 푛 + 푎 ℎ2 푛ℎ2 푟 = √ 퐴푃 4 + ℎ2(푛 − 1) Example: A breeder selects top 20% of sheep based on performance of 10 offspring. The heritability of udder size is 0.10, with a phenotypic variance of 50. Predict the response to selection that the breeder will achieve with this strategy. A selected proportion of 20% results in a selection intensity of 1.4.

10 푥 0.10 푟 = √ 퐴푃 4 + 0.10(10 − 1) The breeder is disappointed and wants more genetic gain. Predict how much improvement he can achieve be achieved by selecting the top 10% instead of the top 20% for breeding. What changed? The breeder is still not completely satisfied because he wants a genetic gain and decides to base the selection on the performance of 15 instead of 10 offspring. Predict the selection response for this new situation. What changed?

From Response per generation to Response per year

The breeders’ equation thus far calculates response to selection per generation. However, to

In quantitative genetics, generation intervals are generally defined as the average age of parents at birth of their offspring. In this definition, generation interval is based on the contributions of parental age classes to newborn offspring; i.e., the average age of parents is calculated as the sum of ages at birth of offspring weighted by the contribution of each age class to newborn offspring. This approach is adopted in the well-known gene flow procedure (Hill 1974). calculate the selection response per year, the generation interval is required.

The breeders’ equation can be calculated as:

60

𝑖 푟 휎 푅 = 퐴푃 퐴 푦푟 퐿 The generation interval L can be calculated separately for males and females and averaged.

Equal numbers of 2 and 3 year old bulls selected as parents: 퐿푚푎푙푒푠 = 2.5 푦푒푎푟푠

Equal numbers of 2, 3 and 4 year old cows selected as parents: 퐿푓푒푚푎푙푒푠 = 3.0 푦푒푎푟푠;

퐿푎푣푒푟푎푔푒 = 2.75 푦푒푎푟푠;

Age structure of animals selected for breeding Age 2 3 4 5 TOTAL Male 10 7 3 20 Female 200 175 100 25 500

(10푥2) + (7푥3) + (3푥4) 퐿 = = 2.65 푦푟 푚푎푙푒 10 + 7 + 3

(200푥2) + (175푥3) + (100푥4) + (25푥5) 퐿 = = 2.90 푦푟 푓푒푚푎푙푒 200 + 175 + 100 + 25

2.65 + 2.90 퐿 = = 2.775 푦푟 푎푣푒푟푎푔푒 2

High selection intensity means high generation interval, and low selection intensity means low generation interval. This does not fit well with maximizing i/L.

i/L should be OPTIMIZED

Optimizing genetic gain will require a balance between increase of the accuracy and increase of the generation interval

61

Selection Path The selection strategy of males and females are different. The major differences between the sexes are: 1. In mammals there is a limited reproduction capacity in females. We assume that population size is the same across generations. We should be aware that, selected animals should be capable to produce sufficient progeny to maintain population size. Males generally can produce more progeny than female and as a result, selection intensity is higher in males than females. We should also be mindful of the direction of natural selection to ensure that sufficient progeny is produced. 2. The information sources for estimating breeding values in males and females may be different. Males may be selected based on progeny performance, whereas females are selected on their own performance leading to differences in accuracy of selection. 3. The generation interval for the sexes may also be different. If males re selected based on progeny testing, then on the average, the age at which males will be used for breeding will be different from that of females.

The aforementioned differences in males and females require different selection paths when determining response to selection per year. The breeders’ equation can be written as:

푅푚 + 푅푓 𝑖푚 푟퐴푃,푚휎퐴 + 𝑖푓 푟퐴푃,푓휎퐴 푅푦푟 = = 퐿푚 + 퐿푓 퐿푚 + 퐿푓

The intensity of selection and accuracy of selection and generation interval may be different in males and females. The genetic standard deviation, however, is a population parameter and is, therefore, the same between males and females.

A sheep breeder has 200 ewe flock and selecting for weaning weight. Rams are first selected at 2 years old and mated for 3 years. Ewes are first selected at 2 years old, and mated for 5 years. Each ram is mating to 20 ewes, 80% lambing rate, 50:50 ratio, and there is no significant mortality in adults. The heritability =0.11 and the phenotypic variance is 0.25 kg. Calculate the response to selection per year.

Age structure of animals selected for breeding Age 2 3 4 5 6 TOTAL Male 5 5 10 Female 40 40 40 40 40 200 200 ewes, 80% lambing rate means 160 lambs in total (80 of each sex). Select 5 out of 80 males each year. The proportion is 5/80=6.25% corresponding to selection intensity, i of ~1.98. Select 40 out of 80 females each year. The proportion is 40/80=50%, corresponding to selection intensity i of 0.798. Calculate the response to selection per year.

62

We can define four selection paths:

Sires to breed sires (SS) This is the most stringent selection path to breed new fathers of the fathers. Only elite sires make it to sire father. Sires to breed dams (SD) Within the sires this is a less stringent selection path. These sires will be the fathers of the breeding females (the dams). Dams to breed sires (DS) This is the most stringent selection path within the dams to breed new sires. Only the elite dams will make it to sire mother. Dams to breed dams (DD) This is the least stringent selection path. It depends on the studbook whether there are selection criteria for new dams.

푅푆푆 + 푅푆퐷 + 푅퐷푆 + 푅퐷퐷 푅푦푟 = 퐿푆푆 + 퐿푆퐷 + 퐿퐷푆 + 퐿퐷퐷

Selection response can be divided into a number of selection

paths, the number depending on the number of differences in

selection intensity and the accuracy of selection

63

LIVESTOCK BREEDING STRATEGIES Samuel E Aggrey, PhD University of Georgia Athens, GA 30602, USA [email protected]

Several panels have been assembled in the past by governments, international agencies and non- profit organizations to map out strategies to improve livestock productivity in developing countries. The goals have been laudable but the outcomes have been far below expected goals. Breeding strategy in the developing world has become synonymous with turning the axle of poultry and livestock production to mirror that of advanced countries. In the developing world genetic improvement has come to imply upgrading a herd usually, that of a national livestock research institute. Several crossbreeding projects were initiated all across Africa with the goal of quickly upgrading low producing indigenous and adapted with high producing exotic breeds from Europe or North America. Management of crossbred herds did not match their genetic potential and as a result the expected productivity was not realized. The crossbreeding approach to genetic improvement was not done in a sustainable manner and currently only remnants of such projects exist. It should be pointed out that in a few cases, crossbreeding on private farms with improved nutrition and management has been successful but they are not enough to meet the massive demand for meat and livestock products.

Genetic improvement is a long term endeavor and short term approaches are bound to yield limited or no success at all. Funding for genetic improvement projects from most international agencies only last for about 5 years. Funding from national governments could be as short as one year. A total mismatch of a long term endeavor with a very short term funding can only point in the direction of limited success if not failure.

In recent times, scientific jargons have been embraced in several projects. Biotechnology is the silver bullet expected to radically transform the whole agricultural sector in the developing world. The argument here is not about the potential of biotechnology. When a high powered fuel is put into a non-functioning engine, the vehicle would still not move. All other parts of the vehicle should also be functioning. Genomics, high throughput science, biotechnology and nanotechnology when applied in the proper environment can lead to tremendous increase in productivity. However, I would argue that, before any of these advanced technologies are adopted en masse, the well proven methodologies need to be adopted first.

In the developing world, breeding strategies need to have at least four basic components: 1. Assessment 2. Preplanning 3. Technical mechanics of genetic improvement 4. Sustainability

64

A. ASSESSMENT OF EXISTING SYSTEM Assessment can be done in five broad areas to answer basic questions to determine whether genetic improvement is even needed at all. 1. Current Production System a. Who are the breeders? b. Who are the animal keepers? c. What are the management practices? d. Can the current production system support and improvement program? e. Is reduction in herd size or animal numbers possible? f. What are the logistics and infrastructure? g. What is the environmental impact h. Is the current production system sustainable?

2. Existing Input and Support a. Water b. Labor c. Animal health care d. Extension e. Training support f. Research Support

3. Cultural and Social practices a. What is the cultural/societal value of animals? b. What are the significance of raising and/or keeping animals

4. Current Breeding Practices a. How do genes flow from breeding to producing animals? i. How do farmers obtain replacement animals? ii. Pure or crossbred? or no form of improvement?

5. Market Analysis a. What is the size of the overall market? b. Can the market improve or grow? c. Is there demand for the product? d. What is the purchasing power of the population? e. Are there export possibilities? f. Can the market accommodate improvement in the production system? There should be a fact based justification for genetic improvement. When there is a demand for a product, there is no need to convince producers to produce more.

65

GENETIC IMPROVEMENT IS A LONG TERM PROGRAM

What we learned from past attempted programs

1. Short term funding (≤5 years) has been a colossal FAILURE. 2. Economic sustainable plan into the long term is required. 3. Genetic diversity plan (biodiversity) should be required for the long term Otherwise, do not start!

B. PREPLANNING

In the preplanning stage, both livestock keepers and consumers should be adequately involved in the early planning and genetic improvement programs. Some questions also need to be adequately answered at this stage.

1. Is there a demand for increased productivity? 2. Are improved animals needed by livestock keepers without exceeding their capacity to manage the animals? 3. Will increased supply of external inputs (diet, vaccines, housing, etc.) increase productivity rather than a new breed? 4. Will consumers accept a new breed, improved strain or crossbred?

In most cases in Africa, livestock keepers have their own breeding criteria and any genetic improvement program should take that into account when defining the breeding objective. For example, the Karamoja pastoralist prefers coat color, body size, conformation, horn configuration and temperament as traits suitable for marketing. In Ethiopia, there is a preferred phenotypic characteristic of chickens. After all, the breeding objective should be based on projected profits under future conditions of productions and not merely on the potential to change trait genetically. The definition of profit may differ from place to place. Whereas, some places use monetary value to define profit, other may simply use herd size.

It is during the preplanning stage that priorities and the sustainability plan for the entire breeding strategies should be developed.

PRIORITIES a. Short terms b. Medium terms c. Long terms

1. Can the objectives of the priorities be achieved in the given time? 2. Is there any funding in place or in the future for any of the priority steps? 3. Are outcome bench marks clearly defined? 4. Can the outcomes be achieved?

66

C. TECHNICAL MECHANICS OF GENETIC IMPROVEMENT

BREEDING OBJECTIVES

The breeding objective is defined based on projected profits under future conditions of production, not merely on the potential to change traits genetically

Breeding is always aimed at the future. Decisions you make now will influence the future generation(s). The breeding goal that you have defined indicates what you think will be important in the future. You have analyzed the market and have an idea about what customers will demand some years from now. Will it be mainly milk or butter or cheese? Will it be mainly pork chops or ham or bacon? Will it be mainly breast meat or legs or full carcasses? Finally, you have an idea about the expected developments in production systems and regulations. What are new developments related to housing systems, nutrition, etc and how are they expected to influence the performance of your animals? Has the (inter)national government announced new regulations that may limit your current production system? Should you anticipate to these upcoming changes?

This means that the best animals for the future conditions of production need to be developed. How does one define “best animal”. The definition of the best animal is subjective, depending on (1) the function of the animal, (2) culture, (3) market structure, (4) production environment, (5) legislature (6) population structure [pyramidal or segmented] and (7) environment limitations.

Cattle are kept for meat, milk and draft. Depending on the function of the animal within that particular society, the best animal can be defined. A high milking cow may be suitable for Wisconsin, but in the hills of Ethiopia, a hardy cow may be suitable.

The best animal should function well within the production and climatic environment and be culturally acceptable.

Broiler (meat-type) chicken processing changes in the USA 1980 Percentage processed 1990 Percentage processed 67% whole birds 23% whole birds 33% Cut-ups 67% Cut-ups 10% Further processed

The type of birds for cut-ups and further processing is different from just raising whole birds. This means, breeders would anticipate future markets and develop bird meat demands. It will also be the best animal for the future.

67

The best animal may not necessarily be a high performance animal for a particular animal product (milk, meat or fiber), but could be an average performance animal with reasonable resistance to an endemic disease. Defining the best animal is not an easy one and requires inputs from animal keepers, consumers, breeders and other stakeholders. Matching genotypes with suitable environments and societal acceptability depends on the availability of wide range of genotypes to choose from. A thorough knowledge of similar genotypes in other tropical regions, including nutrition and local diseases is needed. The phenotypes may be acceptable but may not necessarily cope in a new environment. The following may be considered in selecting the best animal:

1. Genetically improving locally adaptable indigenous animals. 2. Introducing breeds/stains from similar environment(s). 3. Crossbreeding of local adaptable animals with high producing animals from similar environment(s). 4. Crossbreeding with exotic breed (s) with a clear pathway for reliable supply of exotics. 5. Developing a synthetic breed. India has been successful in developing several local poultry strains most of which are strains of 3 choice in commercial poultry production. The Australian Brangus cattle are about ⁄8 Brahman 5 and ⁄8 Angus in their genetic makeup. The cattle are usually sleek black in color, but reds are also acceptable. Australian Brangus are also good walkers and foragers and "do well" in a wide variety of situations. South Africa has successfully developed both cattle and poultry breeds.

Data Recording System Any serious genetic improvement program should have the infrastructure for collecting data. Without data collection it is almost impossible to undertake any form of tractable genetic improvement. Large cattle herds are kept by pastoralists in Nigeria and Eastern Africa. There are several households who own small numbers of animals. Involvement of animal keepers in a genetic improvement program offers the opportunity to collect data on their animals. Data repository center with high storage and computing ability is absolutely essential in developing any improvement programs. In the USA, the US Department of Agriculture is responsible for storage and analysis of dairy cattle data. Beef cattle data is handled by the various breed associations and some large cattle ranches. Swine and poultry are handled by their respective private breeding companies. A data repository agency need to be identified in each African country and their roles clearly defined. In recent times, the prospects of biotechnology and genomic selection have been projected as “savior” for genetic improvement in the developing world. Regardless of the potential of genomic selection, phenotypic data and pedigree information have to be collected.

While it is possible to realize genetic gain with well-defined phenotypes

without genomic information, it is NOT possible to realize gains without well-

defined phenotypes even with genomic information (Henryon et al. 2014)

68

When the infrastructure for the well proven methods of genetic improvement is in place, advanced technologies become easy to adopt. Several novel approaches can be devised for data collection. Models can be developed by collecting unmeasured phenotypes through the measurement of a few easy-to-measure phenotypes.

Figure 1 The livestock breeding and improvement cycle

GENETIC IMPROVEMENT PLAN

1. ANIMAL POPULATION AND POPULATION STRUCTURE A breeding scheme defines the breeding objectives for the production of the next generation of animals. Animal breeding scheme is a combination of recording selected traits, the estimation of breeding values, the selection of potential parents and a mating program for the selected parents including appropriate (artificial) reproduction methods. The breeding scheme will also depend on the population structure.

69

(a) Breeding Programs with separate breeding and production populations Separation of breeding and production populations allows the breeder to focus on the objectives of each population. The purpose of the breeding population is for genetic improvements in traits of interest. The production population is the vehicle through which commercial production is enhanced. Genetic material from the breeding population should constantly influence the production population. Most commercial dairy farmers in developed countries and some parts of Africa purchase semen from improved bulls to constantly upgrade their herds. A breeding program in Africa can concentrate on developing males and then sell them to local producers to improve their flocks in exchange for data collection. There are several advantages to do so in addition to data collection. This automatically includes the animal keeper in the breeding scheme. Nobody kills the golden goose. When the farmer sees the benefits of improved animals without the burden of keeping males, such a scheme is bound to be successful. Over time, this strategy can become part of the sustainability plan.

When the farmer links the receipt of genetic material to profits, it becomes easy for the farmer to pay for such genetic material. That is when the breeding strategy becomes sustainable.

Figure 2 The components of a sustainable animal breeding scheme

Components of the above structure can be adopted for sustainable genetic improvement in the developing world for cattle and small ruminants and even pigs.

(b) Breeding programs with a pyramidal structure This structure is often seen in species where trait recording is extensive and also very expensive. Under this structure only a small number of individuals relative to the production population are recorded. Genetic improvement is done in a limited number of animals and these animals become the source of gene flow to the production population. The genetic improvement in small elite pure lines, the multiplication in the next generation with a much larger number of animals (parents) and the generation of the production animals in very large numbers in the final

70 generation, leads to a pyramidal structure of such a breeding and production program. This is a strategy usually employed by poultry and pig operations in developed countries. Whereas some companies house and develop only elite pure lines, others develop an integrated system from pure lines to the commercial animal.

Figure 3 The classic pyramidal structure of livestock genetic improvement

Under the pyramid structure, consumer concerns, lobby groups and food services concerns from the bottom of the pyramid bubbles up into the pure lines. Over time, these concerns are addressed in the genetic improvement programs in the pure lines. The poultry breeding companies develop animals for different markets and have the opportunities to respond quicker to market changes than cattle, especially since generation interval is far shorter in poultry than in cattle.

In a pyramid structure all sources of genetic variation are exploited. Selection response is realized in the elite pure lines. The additive genetic variance, accuracy of estimation of breeding values and the selection intensity becomes important as these three factors determine genetic gain. The grandparent and parent multiplication levels exploit heterosis via non-additive genetic variance.

In commercial pig breeding programs and in some rare cases of poultry breeding, usually a three- way cross is applied. The next figure illustrate a commercial three way cross. Usually, the terminal male is a selected on growth, feed efficiency and other production characteristics. The final female is usually a taking advantage of both production and reproduction traits.

71

Figure 4 Three way commercial cross breeding scheme

2. SELECTION OR IMPROVEMENT STRATEGY

This stage includes breeding value estimation, selection criteria and genetic models. After estimating breeding values and evaluating alternative selection decisions on the genetic response to selection, the actual practical selection and mating of animals can begin. Selection programs can maximize genetic gains at an inbreeding rate, e.g. ≤1% or at any level that will that will limit the accumulation of inbreeding. It is at this stage that factors such as selection intensity and generation interval are optimized. Several options can be pursued including:

(a) Mass selection (b) Optimum contributing selection (OCS)-maximizing long term gains by maximizing the weighted-genetic merit of selected parents while constraining the relationship between parents (c) Index selection (d) Single or multi-trait selection (e) Correlated traits Selection allows for choosing of parents of offspring of the next generation. However, a mating plan needs to be in place to ensure that diversity is always maintained and inbreeding does not accrue at a faster rate.

72

Mating Strategy

1. Enables selection to align ancestors closer to exact threshold linear relationship. 2. Reduces rate of inbreeding, risk of allele being lost through genetic drift. 3. Reduce variation in the accuracy of breeding values between selected candidates by increasing connectivity. 4. Genomic information can enable us to develop mating designs that disperse genetic contributions more efficiently than pedigree information. a. Minimizing co-ancestry mating. b. Minimizing the covariance between ancestral contributions. c. Maximizing the probability that all ancestors contribute chromosomal segments to all allocated mating. EVALUATION OF IMPROVEMENT STRATEGY

The traits in the breeding objectives may not necessarily be the selection traits, therefore, it is important that the traits in the breeding objective and the selected traits are evaluated after each year. The following evaluation criteria can be considered: 1. Selection response in selected traits. 2. Selection response in breeding objective traits. 3. Annual rate of inbreeding and . 4. Annual cost of breeding program including appreciation/depreciation of fixed costs.

The annual rate of inbreeding can be used as an indirect measure of diversity in the elite populations.

It is important to compare the theoretical expected response to the realized response. The actual weighted selection intensity could be used to evaluate the theoretical response. If there is discrepancy, then the causes of the discrepancy need to be ascertained. Potential sources of discrepancy maybe: (a) Bias in the estimation of breeding values. (b) Inappropriate genetic model. (c) Some environmental factors not considered or accounted for. (d) Selection criteria not strictly adhere to. (e) Unexpected correlated response in other traits.

DISSEMINATION OF GENETIC MATERIAL TO PRODUCTION POPULATIONS

The alleles (genes) of the improved population from here on are disseminated to the production population depending on the population structure. Mostly, several forms of crossbreeding are pursued to take advantage of heterosis or hybrid vigor. Heterosis is the change in performance of crossbred animals over that of the .

73

ECONOMIC AND GENETIC SUSTAINABILITY OF BREEDING PROGRAM

A breeding program is the organized structure set up to realize the desired gain in the production population. It is important for producers to also have a sense of improvement in their populations. Producers can only judge the benefit of a breeding program when the productivity of their animals improves and their “profit” margins go up. It is easy for farmers to pay for genetic material when they make a direct link of their profit margins to the genetic material they received. Economic sustainability can be achieved only when producers of improved animals can recover their cost and make a profit from recipients of their improved animals.

Pertinent questions to ask at this point are:

1. Can breeding programs sponsored for up to five years be economically sustainable? 2. Is the breeding program also genetically sustainable?

Genetic variation is the raw material for genetic improvement. When a genetic improvement strategy leads to genetic gain in traits, there is a loss of genetic variation. The inbreeding level and genetic diversity in the indigenous populations being improved for production also need to be constantly monitored to ensure that genetic variation between breeds (biodiversity) is preserved for the future.

74