Statistical Genomics and Bioinformatics Workshop 8/16/2013
Statistical Genomics and Bioinformatics Workshop: Genetic Association and RNA-Seq Studies
Population Genetics and Genome‐wide Genetic Association Studies (GWAS) Brooke L. Fridley, PhD University of Kansas Medical Center
1
Study of Genetic Association
Cases
Controls
Genetic association studies look at the frequency of genetic changes in a group of cases and controls to try to determine whether specific changes are associated with disease. 2
1 Statistical Genomics and Bioinformatics Workshop 8/16/2013
Genetic Analysis Strategies
Linkage effect GWAS Association size Rare Variant Analysis
allele frequency
Ardlie, Kruglyak & Seielstad (2002) Nature Genetics Reviews Zondervan & Cardon (2004) Nature Genetics Reviews 3
Genetics of Complex Traits
• Multiple genes / variants – Common and rare variants – Interactions, Haplotypes, Pathways • Environment – Gene‐Environment interaction 4
2 Statistical Genomics and Bioinformatics Workshop 8/16/2013
In reality, much more complex!
5
NIHGRI GWAS Catalog (8/11/2013) http://www.genome.gov/gwastudies/
6
3 Statistical Genomics and Bioinformatics Workshop 8/16/2013
Population Genetics
7
Recombination
A1 B1 D1 Before meioses
A2 B2 D2
A1 B1 D2 Crossovers occur during meioses
A2 B2 D1
D2 A1 B1 After recombination
A2 B2 D1
8
4 Statistical Genomics and Bioinformatics Workshop 8/16/2013
Linkage Disequilibrium (LD)
• Particular alleles at neighboring loci tend to be co-inherited. • For tightly linked loci, this co-inheritance might lead to associations between alleles in the population. • LD describes the situation where particular alleles at nearby loci occur together on the same chromosome more often than expected by chance D2 A1 B1
A2 B2 D1
9
Linkage of a Marker with a Disease Locus
A a a a
A a A a a a a a
10
5 Statistical Genomics and Bioinformatics Workshop 8/16/2013
Use of LD for Genetic Association Studies
• SNPs are markers, many have no function • Indirect association
LD between SNP marker and causal variant 11
LD measures • LD Varies by race – Recent populations have higher LD (less recombination) – African populations are more genetically diverse than European populations • pairwise measure • D = observed - expected haplotype frequency •D′ (-1 < D′ <1) – standardized = D / max possible value of D – Related to recombination history, D’=1 means no recombination •r2 (0 < r2 <1) – correlation coefficient – less sensitive to low MAFs
12
6 Statistical Genomics and Bioinformatics Workshop 8/16/2013
LD Measures: r2
•r2 = 1 ‘perfect’ LD – Occurs if only two (of four possible) haplotypes are present – Two markers provide identical information – Stronger condition than ‘complete’ LD
•r2 = 0 two markers are in perfect equilibrium
• Sample size needed to detect association using a surrogate marker is equal to N/r2
13
LD by Racial Populations
RRM1 in African American Population RRM1 in White, Non-Hispanic Population
14
7 Statistical Genomics and Bioinformatics Workshop 8/16/2013
Genetic data support hypotheses that humans migrated out of Africa
• National Geographic Society’s Genographic Project, https://genographic.nationalgeographic.com/genographic/lan/en/atlas.html 15
Typical Steps in a GWAS
• Study Enrollment (case-control; cohort) • Extract DNA from blood sample • Genotype (Illumina, Affymetrix or NGS) • QC genotypes / data • Genotype – phenotype association analysis – Population Stratification – Genetic Models – Multiple Testing Correction – Environment, interactions • Visualization of results • Replication / Validation 16
8 Statistical Genomics and Bioinformatics Workshop 8/16/2013
Study Designs Intervention studies (experiments) 1. Clinic trials 2. Pharmacogenomic studies
Observational studies 1. The simplest is the Cross- Sectional (Prevalence) design which is conducted completely at present.
2. The Cohort (Prospective) design measures exposure in the present and the phenotype in the future.
3. The Case-Control (Retrospective) design measures the phenotype in the present and looks backwards for exposure history.
17
Study Designs in Genetic Epidemiology
18
9 Statistical Genomics and Bioinformatics Workshop 8/16/2013
Case‐Control Studies
Case-control studies are used to identify factors that may contribute to a medical condition by comparing subjects who have that condition (the ‘cases’) with patients who do not have the condition but are otherwise similar (the ‘controls’) Case-control studies are retrospective and non-randomized
19
Case‐Control selection
Cases and controls should be sampled from the same homogeneous population – Confounders - age, sex, ethnic background, ...
Population-based cases: include all subjects or a random sample of all subjects with the disease at a single point or during a given period of time in the defined population. Hospital-based cases: all patients in a hospital department at a given time
20
10 Statistical Genomics and Bioinformatics Workshop 8/16/2013
Selection of Controls
Study base: Controls can be used to characterise the distribution of exposure Comparable-accuracy: Equal reliability in the information obtained from cases and controls (to avoid systematic misclassification) Overcome confounding: Elimination of confounding through control selection (matching or stratified sampling)
21
Comparison of Study Designs
Cohort study Case-control study • Rare exposure • Quick, inexpensive • Examine multiple • Well-suited to the effects of a single evaluation of exposure diseases with long • Minimizes bias in the latency period in exposure • Rare diseases determination • Direct measurements • Examine multiple of incidence of the etiologic factors for a disease single disease
22
11 Statistical Genomics and Bioinformatics Workshop 8/16/2013
Comparison of Study Designs
Cohort study Case-control study • Not rare diseases • Not rare exposure • Prospective: • Incidence rates Expensive and time consuming cannot be estimated • Retrospective: in unless the study is adequate records population based • Validity can be • retrospective, non- affected by losses to randomized nature follow-up limits the conclusions that can be drawn from them. 23
Data Structure
individual affection gender SNP 1 SNP 2 … SNP n 11F21…2
21M22…1
30F12…2
41F11…2
50M0-9 …1
sample id case/control genotypes
24
12 Statistical Genomics and Bioinformatics Workshop 8/16/2013
Allele and Genotype frequencies
Allele distribution Genotype distribution T A Total T|T T|A A|A Total Cases 2R Cases R Controls 2S Controls S Total 2N Total N
Counts alleles 2*N Counts genotypes N observations observations
25
Hardy-Weinberg Equilibrium (HWE)
Predicts constant genotype frequencies within large randomly mating population. If these rules hold, then the locus is in HWE Let p = minor allele frequency (MAF) Minor/Rare allele = a Major/Common allele = A Let 1 – p = major allele frequency Genotype AA aA or Aa aa P(Genotype) (1‐p)*(1‐p) = p*(1‐p)+ (1‐p)*p = p*p = p2 (1‐p)2 2p(1‐p) 26
13 Statistical Genomics and Bioinformatics Workshop 8/16/2013
Hardy-Weinberg Equilibrium (HWE)
. Explanations for deviation from HWE . Non-random mating (i.e., population stratification) . Genotype errors . Preferential selection
. What if a marker is not in HWE? . Use tests for association that do not depend on HWE
27
Hardy-Weinberg Equilibrium (HWE)
Tests for HWE: • Goodness-of-fit statistic (Chi-square test) d.f. = # distinct heterozygotes = k(k-1)/2 small n, p values may not be accurate.
• Likelihood exact test based on likelihood of genotypes under HWE, conditioned on observed allele frequencies
28
14 Statistical Genomics and Bioinformatics Workshop 8/16/2013
HWE vs. Linkage Disequilibrium
AB d AB D
HWE AB d ab D
LD ab d AB D
AB d AB D
29
Quality Control of SNPs
• Exclude SNPs that failure the Hardy- Weinberg test -- Expected proportions of genotypes are not consistent with observed allele frequency -- HWE p-value < 10-6 (for GWAS)
• Genotyping success rate < 95% • Differential missingness in cases and controls
30
15 Statistical Genomics and Bioinformatics Workshop 8/16/2013
Quality Control of Samples
• Poor quality samples -- Sample genotype success rate < 95 to 97.5% -- Greater proportion of heterozygous genotypes than expected • Related individuals (if independent samples) -- Based on pair-wise comparisons of similarity of genotypes (IBD estimation) • Samples with miss specified gender
31
Genetic Models • T = Minor Allele C = Major Allele
Genotype Co-dominant / Additive Dominant Recessive General TT (homozygous) 0 0 -1 (2) 0 0 TC (heterozygous) 1 0 0 (1) 1 0 CC (wildtype) 1 1 1 (0) 1 1 Degrees of freedom 2111
Lettre , Lange, Hirschhorn (2007) Genetic model testing and statistical power in population-based association studies of 32 quantitative traits, Genetic Epidemiology
16 Statistical Genomics and Bioinformatics Workshop 8/16/2013
Statistical Models for GWAS
• Depending on the study design, more complicated models can be used – Gene-environment, repeated measures, non- parametric methods • Time to Event Outcome = Cox Proportional Hazards Models • Binary (case/control) = Logistic Regression Models; Chi-Square Tests • Quantitative = Linear Regression Models
33
Single Locus Analysis C/C C/T T/T Not Trait Missing wildtype heterozygous homozygous Missing
Control 387 (72%) 136 (25%) 18 (3%) 12 541
477 Case 304 (64%) 152 (32%) 21 (4%) 5
P-values 2 df Chisq-test = 0.029 Dominant 0.006 Fisher’s Exact test = 0.029 Recessive 0.374 Additive 0.011 Armitage trend test = 0.011 Unstructured:1 0.012 Allelic test = 0.009 Unstructured:2 0.231
34
17 Statistical Genomics and Bioinformatics Workshop 8/16/2013
Haplotypes • Combination of alleles at adjacent loci on a single (haploid) chromosome • Some sections of DNA are preserved in generations (LD) • These sections make up haplotype blocks • Association analyses that focus on haplotypes use LD between markers & causative loci D A1 B1 2
A2 B2 D1
35
Haplotypes: Why study?
mutation LD Mapping LD mapping can be used to find the causative locus for a Mendelian disease. – takes advantage of the fact that in the region of a causative locus, haplotypes of diseased subjects share more ancestry than haplotypes of unaffected subjects
Ardlie Nature Reviews Genetics 3:299-309, 2002 36
18 Statistical Genomics and Bioinformatics Workshop 8/16/2013
Determining Haplotypes
Marker Allele 1 Allele 2 Marker Allele 1 Allele 2 SNP 1A A SNP 1A T SNP 2G G SNP 2G G SNP 3G G SNP 3G G
Haplotype Pair 1A‐ G ‐ GA‐ G ‐ G Haplotype Pair 1A‐ G ‐ GT‐ G ‐ G
Marker Allele 1 Allele 2 SNP 1A T Haplotype Pair 1A‐ G ‐ GT‐ C ‐ G SNP 2G C Haplotype Pair 2A‐ C ‐ GT‐ G ‐ G SNP 3G G
37
Determining Haplotypes
• Can be directly observed if only 1 heterozygous site.
• If H = # heterozygous sites > 1, then 2H-1 haplotypes consistent with observed data.
• H=20 over 1 million possible haplotype pairs
38
19 Statistical Genomics and Bioinformatics Workshop 8/16/2013
Haplotype Estimation
Pedigree analysis – genotyping family members allows phase determination Molecular haplotyping – limited to short DNA sequences Likelihood approach – EM algorithm is used to estimate haplotype frequencies when phase is ambiguous – Bayesian estimation using PHASE or fastPHASE
39
Haplotype Estimation
Subject Hap1 Hap2 Posterior: Pr(H1,H2|G) 1 1 4 1.00 2 2 3 0.75 2 1 4 0.25 3 1 1 1.00 4 3 3 1.00
Subject Hap1 Hap2 Hap3 Hap4 1 1 0 0 1 Posterior 2 0.25 0.75 0.75 0.25 probabilities 3 2 0 0 0 4 0 0 2 0
40
20 Statistical Genomics and Bioinformatics Workshop 8/16/2013
Power and Sample Size Calculations
Parameters needed for genetic association studies – Desired power (1-) – Sample size (N) – Type I error level () Depends on 1- or 2-sided test For GWA with a large number of hypotheses, adjustments need to be considered (discussed in detail tomorrow). – Probability of exposure Allele frequency of variant allele Genotype frequency – Effect size: GRR (genotype relative risk) 41
Factors affecting power
LD between the disease allele and marker allele Difference between disease allele and marker allele frequencies Phenotype misclassification error rates – Miscoding (affected coded as unaffected, vice versa) – locus heterogeneity, mutations in different genes leading to clinically indistinguishable phenotype – Using a surrogate that is not 100% predictive, e.g., clinical dementia to diagnose Alzheimer’s disease Genotype misclassification error rates – Genotype calls include some errors Recommendation: Specify higher power values when computing sample size requirements. 42
21 Statistical Genomics and Bioinformatics Workshop 8/16/2013
Power Table example
Minimal Detectable ORs for various minor allele frequencies and r2; power = 80%, alpha = 0.05, number of controls = 982. Main Group Subgroup 1 Subgroup 2 Subgroup 3 MAF r2 (n=1172) (n=278) (n=140) (n=365) 0.10 1.00 1.31 1.48 1.63 1.43 0.80 1.34 1.54 1.72 1.49 0.20 1.00 1.23 1.36 1.49 1.33 0.80 1.25 1.41 1.55 1.37 0.40 1.00 1.19 1.31 1.42 1.27 0.80 1.21 1.35 1.48 1.31
43
Population Stratification
• Result of confounding –if one or more ethnic subgroups within a population have both: – higher prevalence of an allele – higher risk of disease • Spurious associations between marker and disease.
• Due to: – Sampling without regard to ethnic background – Recently admixed populations
44
22 Statistical Genomics and Bioinformatics Workshop 8/16/2013
Population Stratification Cases Population 1 Population 2
BB AB AA Controls 45
Population Stratification
• Population stratification can be a problem for association studies where the association found could be due to the underlying structure of the population and not a disease associated locus. • Need to adjust for population sub-structure in analysis to prevent spurious associations. • Should not rely completely on self-reported race and ethnicity - With GWAS data we can determine.
46
23 Statistical Genomics and Bioinformatics Workshop 8/16/2013
STRUCTURE Plots
Race coding: 1=NA (missing race) 2=American Indian, Subject Alaskan Native removed 3=Asian from 4=AA analysis 5=Unknown 6=White
47 *Seeded with LCLs of CA, AA and HCA ancestry
Principal Components • Components are linear combinations of genotypes • Each PC describes as much variability of the genotypes as possible for an individual • If population stratification is present, axes of variation may have geographic interpretation • Use PC’s as covariates in phenotype vs. genotype model.
48
24 Statistical Genomics and Bioinformatics Workshop 8/16/2013
Principal Components
49
Genotype Imputation
• Uses LD structure (from data and/or reference data) to infer for unknown genotypes – 1KGP (all racial groups) • Software: MACH, BEAGLE, IMPUTE2, fastPHASE • Remember to QC imputed data • Use either expected genotype (“dosage”) or posterior probabilities of genotype class in association analysis – Do not use most likely genotype
50
25 Statistical Genomics and Bioinformatics Workshop 8/16/2013
Observed Genotypes
51
Determine Haplotype
52
26 Statistical Genomics and Bioinformatics Workshop 8/16/2013
Impute Missing Genotypes
53
Refining Region with Imputation • Genotype Imputation with 1KGP to refine region/signal • Confirm imputation results with genotyping
54
27 Statistical Genomics and Bioinformatics Workshop 8/16/2013
Genome-wide Association Study (GWAS): AROMATASE INHIBITOR (AI) PHARMACOGENOMICS
55
Breast cancer treatment
• 3 AIs (anastrozole, letrozole and exemestane) are commonly used in treatment of breast cancer. • However, response varies between women. • Goal of study is to determine genetic predictors of response to AI treatment. 56
28 Statistical Genomics and Bioinformatics Workshop 8/16/2013
Genotyping
• 835 women recruited from: – MCR (23.5%), MCA (11.6%), MCF(5.3%) – MD Andersen (43.5%) – Memorial – Sloan Ketterling (16.1%) • Genotyped on Illumina 610 SNP Array (after QC 563,945 SNPs for analysis) • Measured blood drug levels and pre- and post- hormone levels, breast density, and BMD
57
Summary Baseline Hormone • N = 776 subjects with baseline hormone levels • Highly skewed distributions
Hormone N Mean SD Min Median Max
Estrone 765 21.6 13.26 0 18.9 111
Estradiol 774 5.6 6.91 0 4.0 111
Estrone-C 752 315.1 310.21 4.1 236.5 3320
Androstenedione 774 461.4 251.68 0 422.5 2470
Testosterone 774 171.8 128.38 0 145.5 1720 58
29 Statistical Genomics and Bioinformatics Workshop 8/16/2013
Assessment of Clinical Covariates
• One person has missing BMI and thus was removed from all analysis – BMI is important variable that must be accounted for in the analysis. • For the GWAS of the baseline hormone levels, we included, for each phenotype, additional covariates if the p-value < 0.01, after inclusion of BMI and study site.
59
Single SNP Analysis • Genotype coded in terms of number of minor alleles (additive model) • Adjustment for BMI, age, site, race, 6 eigenvectors and covariates (p < 0.01). • Due the skewness of the data, we used quasi-likelihood model
60
30 Statistical Genomics and Bioinformatics Workshop 8/16/2013
Estradiol: Ch8 Region
61
Multiple Locus Analysis • Multiple markers within a singe gene or region –PCA – LASSO and Shrinkage Methods • Multiple genes that are: – Within the same pathway – Biologically related • Effect is apparent only with an interaction – Gene-Gene (epistasis) – Gene-Environment – (Gene-Drug; Pharmacogenomics)
62
31 Statistical Genomics and Bioinformatics Workshop 8/16/2013
Epistasis
Definition: the interaction between two or more genes to control a single phenotype
Example: In Labradors, • dogs with a BB or Bb genotype are black • bb genotype are chocolate • that is, unless they have the ee genotype at a second locus. • Regardless of the genotype at the B locus ee dogs are yellow.
63
Pharmacogenomics: Genetic vs. PGx effect • Example: Study to determine genetic effects related to weight gain in depressed patients treated with SSRIs. – Genotype only depressed patients on an SSRIs. – Observe an association between a gene and change in weight of patients. • Question: Is it a genetic effect or PGx effect? – Variant related to obesity in general? – Variant related to obesity when treated with an SSRI?
64
32 Statistical Genomics and Bioinformatics Workshop 8/16/2013
PGx Effect = Gene*Drug Interaction
• Study involving multiple treatment arms • Able to assess a “truly” PGx effect
Example: Weight Gain PGx study 1 if patient on drug weight change for patient 0 if patient on placebo