Linkage Disequilibrium

Biostatistics 666 Last Lecture z Common study designs z Descriptions of a single locus • Frequencies • Genotype Frequencies z Hardy-Weinberg Equilibrium • Exact Test for HWE • Procedure is analogous to tests for paired data The Allelic Architecture of Disease What is it and how do we discover it?

Rare, high penetrance mutations – use linkage

Common, low penetrance variants – use association Magnitude of effect

Frequency in population Association Studies and Linkage Disequilibrium z If all polymorphisms were independent at the population level, association studies would have to examine every one of them… z Linkage disequilibrium makes tightly linked variants strongly correlated producing cost savings for association studies Today … z Descriptors for multiple markers z Relating allele frequencies at pairs of loci • Linkage equilibrium • Linkage disequilibrium z D’ and r² statistics Frequencies

Locus B Totals B b

Locus A A pAB pAb pA

a paB pab pa

Totals pB pb 1.0 Linkage Equilibrium

pAB = pA pB

pAb = pA pb = pA(1− pB )

paB = pa pB = (1− pA ) pB

pab = pa pb = (1− pA )(1− pB ) Linkage Disequilibrium

pAB ≠ pA pB

pAb ≠ pA pb = pA(1− pB )

paB ≠ pa pB = (1− pA ) pB

pab ≠ pa pb = (1− pA )(1− pB ) A new mutation…

Before Mutation

AG

CG

After Mutation

AG

CG

CCMutation For a new mutation… z One haplotype frequency is zero z Linkage equilibrium does not hold • In contrast, linkage disequilibrium z How do haplotype frequencies change over time? Recombination

Before Recombination AG

CG

CC

After Recombination AG

CG

CC

AC

Recombinant Haplotype Equilibrium or Disequilibrium? z We will present simple argument for why linkage equilibrium holds for most loci z Balance of factors • (a function of population size) • Random mating • Distance between markers • … Disequilibrium Coefficient DAB

DAB = pAB − pA pB

pAB = pA pB + DAB

pAb = pA pb − DAB

paB = pa pB − DAB

pab = pa pb + DAB Why Equilibrium is Reached… z Eventually, random mating and recombination should ensure that mutations spread from original haplotype to all in the population… z Simple argument: • Assume fixed allele frequencies over time Recombination Rate (θ) z Probability of an odd number of crossovers between two loci z Proportion of time from two different grand-parents occur in the same gamete z Increases with physical (base-pair) distance, but rate of increase varies across genome Without Recombination

B b

A pA pB + DAB pA pb − DAB pA

a pa pB − DAB pa pb + DAB pa

pB pb

Haplotype Frequencies Remain Stable Over Time P=1- θ With Recombination B b

A pA pB pA pb pA

a pA pb pa pb pa

pB pb

Haplotype Frequencies Are Function of Allele Frequencies P=θ Overall Change

θ B θ b θ A pA pB + (1 − )DAB pA pb − (1 − )DAB pA a pA pb − (1 − )DAB pa pb + (1 − θ )DAB pa

pB pb

Disequilibrium Decreases… Predictions z Disequilibrium will decay each generation • In a large population z After t generations… t t 0 • D AB = (1-θ) D AB z A better model should allow for changes in allele frequencies over time… Decay of D with Time

1.000

0.900 Ө = 0.0001 0.800

0.700 Ө = 0.001

0.600 Ө = 0.01

0.500

0.400 Ө = 0.1

Disequilibrium 0.300

0.200 Ө = 0.5 0.100

0.000 1 10 100 1000 Generations Linkage Equilibrium z In a large random mating population haplotype frequencies converge to a simple function of allele frequencies

z (The reverse is always true!) Linkage Disequilibrium z Chromosomes are Ancestor mosaics Present-day z Tightly linked markers • Alleles not randomly associated • Reflect ancestral haplotypes z Recombination, Mutation, Drift DAB is hard to interpret

z Sign is arbitrary … • A common convention is to set A, B to be the common allele and a, b to be the rare allele z Range depends on allele frequencies • Hard to compare between markers Range of DAB z Must be greater than

• Max(-pApB,-papb) z Must be smaller than

• Min(pA -pApB, pB -pApB) z Constraints ensure that all haplotype frequencies are greater than zero Alternative Measures z The most popular measures • D’ and |D’| • ∆²or r2 z Other common measures • Chi-squared • P-value D’

⎧ D AB D < 0 ⎪ AB ⎪ max( pA pB , pa pb ) D' AB = ⎨ DAB ⎪ DAB > 0 ⎩⎪ min( pA pb , pa pB ) z Ranges between –1 and +1 • More likely to take extreme values when allele frequencies are small • ±1 implies at least one of the observed haplotypes was not observed Raw |D’| data from Chr22

1.00

0.80

0.60 D'

0.40

0.20

0.00 0 200 400 600 800 1000 Physical Distance (kb) More on D’ z Pluses: • D’ = 1 or D’ = -1 means no mandatory recombinants between markers • Generally, higher D’ means two markers are better surrogates for each other z Minuses: • D’ estimates inflated in small samples • D’ estimates inflated when one allele is rare Effect of Allele Frequencies ∆² (also called r2)

D2 ∆² = AB pA (1− pA ) pB (1− pB ) χ ² = 2n z Ranges between 0 and 1 • 1 when the two markers provide identical information • 0 when they are in perfect equilibrium z Expected value is 1/2n Raw ∆² data from Chr22

1.00

0.80

0.60 2 r

0.40

0.20

0.00 0 200 400 600 800 1000 Physical Distance (kb) Summary of Disequilibrium in the Genome z How much disequilibrium is there? z What are good predictors of disequilibrium? z What are good predictors of variation in disequilibrium? Extent of Linkage Disequilibrium

12

8

Half-Life of R² Half-Life 4

0 12345678910111213141516171819202122 Chromosome

LD extends further in the larger chromosomes, which have lower recombination rates Most chromosomes exhibit regions of extended LD on the megabase scale Extent of LD

Extent of LD Across the Genome (in 1Mb Windows) Proportion of Genome of Proportion 0.00 0.02 0.04 0.06 0.08 0.10

0 1020304050

Extent of LD (r²) Average Extent: 11.9 kb Median Extent: 7.8 kb 10th percentile: 3.5 kb 90th percentile: 20.9 kb Genomic Variation in Disequilibrium (CEPH)

Expected r2 at 30kb Bright Red > 0.88 Dark Blue < 0.12

Heather Munro Dense Region 1

z Chromosome 7 • 157 markers / 520 kb • 27.0 – 27.5 Mb D’ • Average LD region

z SNP picking (33/157 = 21%) • 12 unique SNPs • 21 tagging SNPs • Others, average r² = 0.73 R²

0.0 1.0 Predicting Disequilibrium z How can we predict ∆2? • Depends on allele frequencies • Degree of recombination z How can we predict LD for multiple markers? z One option: simulation… Question:

z How would you set up a simulation to predict D’ or r² for a pair of markers? z What parameters might be of interest? Coalescent Models …

z A better approach to predicting LD and distribution of genetic variation z Read book chapter by Richard Hudson