<<

Linkage Disequilibrium

Biostatistics 666 Last Lecture z Basic properties of a locus • Frequencies • Genotype Frequencies z Hardy-Weinberg Equilibrium • Relationship between allele and genotype frequencies that holds for most genetic markers z Exact Tests for Hardy-Weinberg Equilibrium Today … z We’ll consider properties of pairs of z frequencies z Linkage equilibrium z Linkage disequilibrium Let’s consider the history of two neighboring alleles… Alleles that exist today arose through ancient mutation events…

Before Mutation

A

After Mutation

A

C Mutation One allele arose first, and then the other…

Before Mutation

AG

CG

After Mutation

AG

CG

CCMutation Recombination generates new arrangements for ancestral alleles

Before Recombination AG

CG

CC

After Recombination AG

CG

CC

AC

Recombinant Haplotype Linkage Disequilibrium z Chromosomes are mosaics Ancestor z Extent and conservation of mosaic pieces depends on Present-day • Recombination rate • Mutation rate • Population size • z Combinations of alleles at very close markers reflect ancestral Why is linkage disequilibrium important for gene mapping? Association Studies and Linkage Disequilibrium z If all polymorphisms were independent at the population level, association studies would have to examine every one of them… z Linkage disequilibrium makes tightly linked variants strongly correlated producing cost savings for association studies Tagging SNPs z In a typical short chromosome segment, there are only a few distinct haplotypes z Carefully selected SNPs can determine status of other SNPs

Haplo1 TT T 30% Haplo2 TT T 20%

Haplo3 TT T 20% Haplo4 TT T 20%

Haplo5 TT T 10% Linkage Disequilibrium Enables Studies z In contrast to linkage studies, association studies can identify variants with relatively small individual contributions to disease risk z However, they require detailed measurement of genetic variation and there are >10,000,000 catalogued genetic variants z Until recently, studies limited to candidate genes or regions • A hit-and-miss approach… z Because assay costs are decreasing and a modest number of variants can represent all others, genome-wide association studies are now possible. The Allelic Architecture of Disease What is it and how do we discover it?

Rare, high penetrance mutations – use linkage

Common, low penetrance variants – use association Magnitude of effect

Frequency in population Basic Descriptors of Linkage Disequilibrium Commonly Used Descriptors z Haplotype Frequencies • The frequency of each type of chromosome • Contain all the information provided by other summary measures z Commonly used summaries • D • D’ • r2 or ∆2 Haplotype Frequencies

Locus B Totals B b

Locus A A pAB pAb pA

a paB pab pa

Totals pB pb 1.0 Linkage Equilibrium Expected for Distant Loci

pAB = pA pB

pAb = pA pb = pA(1− pB )

paB = pa pB = (1− pA ) pB

pab = pa pb = (1− pA )(1− pB ) Linkage Disequilibrium Expected for Nearby Loci

pAB ≠ pA pB

pAb ≠ pA pb = pA(1− pB )

paB ≠ pa pB = (1− pA ) pB

pab ≠ pa pb = (1− pA )(1− pB ) Disequilibrium Coefficient DAB

DAB = pAB − pA pB

pAB = pA pB + DAB

pAb = pA pb − DAB

paB = pa pB − DAB

pab = pa pb + DAB DAB is hard to interpret

z Sign is arbitrary … • A common convention is to set A, B to be the common allele and a, b to be the rare allele z Range depends on allele frequencies • Hard to compare between markers What is the range of DAB? z What are the maximum and minimum

possible values of DAB when

• pA = 0.3 and pB = 0.3

• pA = 0.2 and pB = 0.1 z Can you derive a general formula for this range? D’ – A scaled version of D

⎧ DAB ⎪ DAB <0 ⎪min( pA pB , pa pb ) D'AB =⎨ DAB ⎪ DAB >0 ⎩⎪min( pA pb , pa pB ) z Ranges between –1 and +1 • More likely to take extreme values when allele frequencies are small • ±1 implies at least one of the observed haplotypes was not observed More on D’ z Pluses: • D’ = 1 or D’ = -1 means no evidence for recombination between the markers • If allele frequencies are similar, high D’ means the markers are good surrogates for each other z Minuses: • D’ estimates inflated in small samples • D’ estimates inflated when one allele is rare ∆² (also called r2)

D2 ∆² = AB pA (1− pA ) pB (1− pB ) χ ² = 2n z Ranges between 0 and 1 • 1 when the two markers provide identical information • 0 when they are in perfect equilibrium z Expected value is 1/2n More on r2 z r2 = 1 implies the markers provide exactly the same information z The measure preferred by population geneticists z Measures loss in efficiency when marker A is replaced with marker B in an association study • With some simplifying assumptions (e.g. see Pritchard and Przeworski, 2001) When does linkage equilibrium hold? Equilibrium or Disequilibrium? z We will present simple argument for why linkage equilibrium holds for most loci z Balance of factors • (a function of population size) • Random mating • Distance between markers • … Why Equilibrium is Reached… z Eventually, random mating and recombination should ensure that mutations spread from original haplotype to all haplotypes in the population… z Simple argument: • Assume fixed allele frequencies over time Generation t, Initial Configuration

B b

A pA pB + DAB pA pb − DAB pA

a pa pB − DAB pa pb + DAB pa

pB pb

Assume arbitrary values for the allele frequencies and disequilibrium coefficient Generation t+1, Without Recombination

B b

A pA pB + DAB pA pb − DAB pA

a pa pB − DAB pa pb + DAB pa

pB pb

Haplotype Frequencies Remain Stable Over Time Outcome has probability 1- r Generation t+1, With Recombination B b

A pA pB pA pb pA

a pA pb pa pb pa

pB pb

Haplotype Frequencies Are Function of Allele Frequencies Outcome has probability r Generation t+1, Overall

θ B θ b θ A pA pB + (1 − )DAB pA pb − (1 − )DAB pA a pA pb − (1 − )DAB pa pb + (1 − θ )DAB pa

pB pb

Disequilibrium Decreases… Recombination Rate (r) z Probability of an odd number of crossovers between two loci z Proportion of time alleles from two different grand-parents occur in the same gamete z Increases with physical (base-pair) distance, but rate of increase varies across genome Decay of D with Time

1.000

0.900 r =Ө 0.0001= 0.0001 0.800 Ө = 0.001 0.700 r = 0.001

0.600 rӨ = =0.01 0.01

0.500

0.400 rӨ = = 0.1 0.1

Disequilibrium 0.300

0.200 Өr = = 0.50.5 0.100

0.000 1 10 100 1000 Generations Predictions z Disequilibrium will decay each generation • In a large population z After t generations… t t 0 • D AB = (1-θ) D AB z A better model should allow for changes in allele frequencies over time… Linkage Equilibrium z In a large random mating population haplotype frequencies converge to a simple function of allele frequencies Some Examples of Linkage Disequilibrium Data Summary of Disequilibrium in the Genome z How much disequilibrium is there? z What are good predictors of disequilibrium? z What are good predictors of variation in disequilibrium? Raw |D’| data from Chr22

1.00

0.80

0.60 D'

0.40

0.20

0.00 0 200 400 600 800 1000 Physical Distance (kb) Raw ∆² data from Chr22

1.00

0.80

0.60 2 r

0.40

0.20

0.00 0 200 400 600 800 1000 Physical Distance (kb) Summarizing Disequilibrium ComparingEmpirical of LD Populations(chromosome 2, R²) …

0.6

0.4

Han, Japanese Yoruba CEPH Statistic 0.2

0 0 50 100 150 200 250 Distance (kb)

LD extends further in CEPH and the Han/Japanese than in the Yoruba Comparing Genomic Regions … z Rather than compare curves directly, it is convenient to a pick a summary for the decay curves z One common summary is the distance at which the curve crosses a threshold of interest (say 0.50) Extent of Linkage Disequilibrium

12

8

Half-Life of R² Half-Life 4

0 12345678910111213141516171819202122 Chromosome

LD extends further in the larger chromosomes, which have lower recombination rates But within each chromosome, there is still huge variability! Extent of LD

Extent of LD Across the Genome (in 1Mb Windows) Proportion of Genome of Proportion 0.00 0.02 0.04 0.06 0.08 0.10

0 1020304050

Extent of LD (r²) Average Extent: 11.9 kb Median Extent: 7.8 kb 10th percentile: 3.5 kb 90th percentile: 20.9 kb Genomic Variation in LD (CEPH)

Expected r2 at 30kb Bright Red > 0.88 Dark Blue < 0.12 Genomic Variation in LD (JPT + HCB)

Expected r2 at 30kb Bright Red > 0.88 Dark Blue < 0.12 Genomic Variation in LD (YRI)

Expected r2 at 10kb Bright Red > 0.88 Dark Blue < 0.12 Genomic Distribution of LD (YRI)

Expected r2 at 30kb Bright Red > 0.88 Dark Blue < 0.12 What factors might contribute to genomic variation in LD? Today … z Basic descriptors of linkage disequilibrium z Learn when linkage disequilibrium is expected to hold (or not!)