Statistical Genomics and Bioinformatics Workshop 8/16/2013

Statistical Genomics and Bioinformatics Workshop: Genetic Association and RNA-Seq Studies

Population Genetics and Genome‐wide Genetic Association Studies (GWAS) Brooke L. Fridley, PhD University of Kansas Medical Center

1

Study of Genetic Association

Cases

Controls

Genetic association studies look at the frequency of genetic changes in a group of cases and controls to try to determine whether specific changes are associated with disease. 2

1 Statistical Genomics and Bioinformatics Workshop 8/16/2013

Genetic Analysis Strategies

Linkage effect GWAS Association size Rare Variant Analysis

allele frequency

Ardlie, Kruglyak & Seielstad (2002) Nature Genetics Reviews Zondervan & Cardon (2004) Nature Genetics Reviews 3

Genetics of Complex Traits

• Multiple genes / variants – Common and rare variants – Interactions, , Pathways • Environment – Gene‐Environment interaction 4

2 Statistical Genomics and Bioinformatics Workshop 8/16/2013

In reality, much more complex!

5

NIHGRI GWAS Catalog (8/11/2013) http://www.genome.gov/gwastudies/

6

3 Statistical Genomics and Bioinformatics Workshop 8/16/2013

Population Genetics

7

Recombination

A1 B1 D1 Before meioses

A2 B2 D2

A1 B1 D2 Crossovers occur during meioses

A2 B2 D1

D2 A1 B1 After recombination

A2 B2 D1

8

4 Statistical Genomics and Bioinformatics Workshop 8/16/2013

Linkage Disequilibrium (LD)

• Particular alleles at neighboring loci tend to be co-inherited. • For tightly linked loci, this co-inheritance might lead to associations between alleles in the population. • LD describes the situation where particular alleles at nearby loci occur together on the same chromosome more often than expected by chance D2 A1 B1

A2 B2 D1

9

Linkage of a Marker with a Disease Locus

A a a a

A a A a a a a a

10

5 Statistical Genomics and Bioinformatics Workshop 8/16/2013

Use of LD for Genetic Association Studies

• SNPs are markers, many have no function • Indirect association

LD between SNP marker and causal variant 11

LD measures • LD Varies by race – Recent populations have higher LD (less recombination) – African populations are more genetically diverse than European populations • pairwise measure • D = observed - expected frequency •D′ (-1 < D′ <1) – standardized = D / max possible value of D – Related to recombination history, D’=1 means no recombination •r2 (0 < r2 <1) – correlation coefficient – less sensitive to low MAFs

12

6 Statistical Genomics and Bioinformatics Workshop 8/16/2013

LD Measures: r2

•r2 = 1  ‘perfect’ LD – Occurs if only two (of four possible) haplotypes are present – Two markers provide identical information – Stronger condition than ‘complete’ LD

•r2 = 0  two markers are in perfect equilibrium

• Sample size needed to detect association using a surrogate marker is equal to N/r2

13

LD by Racial Populations

RRM1 in African American Population RRM1 in White, Non-Hispanic Population

14

7 Statistical Genomics and Bioinformatics Workshop 8/16/2013

Genetic data support hypotheses that humans migrated out of Africa

• National Geographic Society’s Genographic Project, https://genographic.nationalgeographic.com/genographic/lan/en/atlas.html 15

Typical Steps in a GWAS

• Study Enrollment (case-control; cohort) • Extract DNA from blood sample • Genotype (Illumina, Affymetrix or NGS) • QC genotypes / data • Genotype – phenotype association analysis – Population Stratification – Genetic Models – Multiple Testing Correction – Environment, interactions • Visualization of results • Replication / Validation 16

8 Statistical Genomics and Bioinformatics Workshop 8/16/2013

Study Designs  Intervention studies (experiments) 1. Clinic trials 2. Pharmacogenomic studies

 Observational studies 1. The simplest is the Cross- Sectional (Prevalence) design which is conducted completely at present.

2. The Cohort (Prospective) design measures exposure in the present and the phenotype in the future.

3. The Case-Control (Retrospective) design measures the phenotype in the present and looks backwards for exposure history.

17

Study Designs in Genetic Epidemiology

18

9 Statistical Genomics and Bioinformatics Workshop 8/16/2013

Case‐Control Studies

 Case-control studies are used to identify factors that may contribute to a medical condition by comparing subjects who have that condition (the ‘cases’) with patients who do not have the condition but are otherwise similar (the ‘controls’)  Case-control studies are retrospective and non-randomized

19

Case‐Control selection

 Cases and controls should be sampled from the same homogeneous population – Confounders - age, sex, ethnic background, ...

 Population-based cases: include all subjects or a random sample of all subjects with the disease at a single point or during a given period of time in the defined population.  Hospital-based cases: all patients in a hospital department at a given time

20

10 Statistical Genomics and Bioinformatics Workshop 8/16/2013

Selection of Controls

 Study base: Controls can be used to characterise the distribution of exposure  Comparable-accuracy: Equal reliability in the information obtained from cases and controls (to avoid systematic misclassification)  Overcome confounding: Elimination of confounding through control selection (matching or stratified sampling)

21

Comparison of Study Designs

Cohort study Case-control study • Rare exposure • Quick, inexpensive • Examine multiple • Well-suited to the effects of a single evaluation of exposure diseases with long • Minimizes bias in the latency period in exposure • Rare diseases determination • Direct measurements • Examine multiple of incidence of the etiologic factors for a disease single disease

22

11 Statistical Genomics and Bioinformatics Workshop 8/16/2013

Comparison of Study Designs

Cohort study Case-control study • Not rare diseases • Not rare exposure • Prospective: • Incidence rates Expensive and time consuming cannot be estimated • Retrospective: in unless the study is adequate records population based • Validity can be • retrospective, non- affected by losses to randomized nature follow-up limits the conclusions that can be drawn from them. 23

Data Structure

individual affection gender SNP 1 SNP 2 … SNP n 11F21…2

21M22…1

30F12…2

41F11…2

50M0-9 …1

sample id case/control genotypes

24

12 Statistical Genomics and Bioinformatics Workshop 8/16/2013

Allele and Genotype frequencies

Allele distribution Genotype distribution T A Total T|T T|A A|A Total Cases 2R Cases R Controls 2S Controls S Total 2N Total N

Counts alleles  2*N Counts genotypes  N observations observations

25

Hardy-Weinberg Equilibrium (HWE)

 Predicts constant genotype frequencies within large randomly mating population.  If these rules hold, then the locus is in HWE  Let p = minor allele frequency (MAF)  Minor/Rare allele = a  Major/Common allele = A  Let 1 – p = major allele frequency Genotype AA aA or Aa aa P(Genotype) (1‐p)*(1‐p) = p*(1‐p)+ (1‐p)*p = p*p = p2 (1‐p)2 2p(1‐p) 26

13 Statistical Genomics and Bioinformatics Workshop 8/16/2013

Hardy-Weinberg Equilibrium (HWE)

. Explanations for deviation from HWE . Non-random mating (i.e., population stratification) . Genotype errors . Preferential selection

. What if a marker is not in HWE? . Use tests for association that do not depend on HWE

27

Hardy-Weinberg Equilibrium (HWE)

Tests for HWE: • Goodness-of-fit statistic (Chi-square test)  d.f. = # distinct heterozygotes = k(k-1)/2  small n, p values may not be accurate.

• Likelihood exact test  based on likelihood of genotypes under HWE, conditioned on observed allele frequencies

28

14 Statistical Genomics and Bioinformatics Workshop 8/16/2013

HWE vs.

AB d AB D

HWE AB d ab D

LD ab d AB D

AB d AB D

29

Quality Control of SNPs

• Exclude SNPs that failure the Hardy- Weinberg test -- Expected proportions of genotypes are not consistent with observed allele frequency -- HWE p-value < 10-6 (for GWAS)

• Genotyping success rate < 95% • Differential missingness in cases and controls

30

15 Statistical Genomics and Bioinformatics Workshop 8/16/2013

Quality Control of Samples

• Poor quality samples -- Sample genotype success rate < 95 to 97.5% -- Greater proportion of heterozygous genotypes than expected • Related individuals (if independent samples) -- Based on pair-wise comparisons of similarity of genotypes (IBD estimation) • Samples with miss specified gender

31

Genetic Models • T = Minor Allele C = Major Allele

Genotype Co-dominant / Additive Dominant Recessive General TT (homozygous) 0 0 -1 (2) 0 0 TC (heterozygous) 1 0 0 (1) 1 0 CC (wildtype) 1 1 1 (0) 1 1 Degrees of freedom 2111

Lettre , Lange, Hirschhorn (2007) Genetic model testing and statistical power in population-based association studies of 32 quantitative traits, Genetic Epidemiology

16 Statistical Genomics and Bioinformatics Workshop 8/16/2013

Statistical Models for GWAS

• Depending on the study design, more complicated models can be used – Gene-environment, repeated measures, non- parametric methods • Time to Event Outcome = Cox Proportional Hazards Models • Binary (case/control) = Logistic Regression Models; Chi-Square Tests • Quantitative = Linear Regression Models

33

Single Locus Analysis C/C C/T T/T Not Trait Missing wildtype heterozygous homozygous Missing

Control 387 (72%) 136 (25%) 18 (3%) 12 541

477 Case 304 (64%) 152 (32%) 21 (4%) 5

P-values 2 df Chisq-test = 0.029 Dominant 0.006 Fisher’s Exact test = 0.029 Recessive 0.374 Additive 0.011 Armitage trend test = 0.011 Unstructured:1 0.012 Allelic test = 0.009 Unstructured:2 0.231

34

17 Statistical Genomics and Bioinformatics Workshop 8/16/2013

Haplotypes • Combination of alleles at adjacent loci on a single (haploid) chromosome • Some sections of DNA are preserved in generations (LD) • These sections make up haplotype blocks • Association analyses that focus on haplotypes use LD between markers & causative loci D A1 B1 2

A2 B2 D1

35

Haplotypes: Why study?

mutation LD Mapping  LD mapping can be used to find the causative locus for a Mendelian disease. – takes advantage of the fact that in the region of a causative locus, haplotypes of diseased subjects share more ancestry than haplotypes of unaffected subjects

Ardlie Nature Reviews Genetics 3:299-309, 2002 36

18 Statistical Genomics and Bioinformatics Workshop 8/16/2013

Determining Haplotypes

Marker Allele 1 Allele 2 Marker Allele 1 Allele 2 SNP 1A A SNP 1A T SNP 2G G SNP 2G G SNP 3G G SNP 3G G

Haplotype Pair 1A‐ G ‐ GA‐ G ‐ G Haplotype Pair 1A‐ G ‐ GT‐ G ‐ G

Marker Allele 1 Allele 2 SNP 1A T Haplotype Pair 1A‐ G ‐ GT‐ C ‐ G SNP 2G C Haplotype Pair 2A‐ C ‐ GT‐ G ‐ G SNP 3G G

37

Determining Haplotypes

• Can be directly observed if only 1 heterozygous site.

• If H = # heterozygous sites > 1, then 2H-1 haplotypes consistent with observed data.

• H=20  over 1 million possible haplotype pairs

38

19 Statistical Genomics and Bioinformatics Workshop 8/16/2013

Haplotype Estimation

 Pedigree analysis – genotyping family members allows phase determination  Molecular haplotyping – limited to short DNA sequences  Likelihood approach – EM algorithm is used to estimate haplotype frequencies when phase is ambiguous – Bayesian estimation using PHASE or fastPHASE

39

Haplotype Estimation

Subject Hap1 Hap2 Posterior: Pr(H1,H2|G) 1 1 4 1.00 2 2 3 0.75 2 1 4 0.25 3 1 1 1.00 4 3 3 1.00

Subject Hap1 Hap2 Hap3 Hap4 1 1 0 0 1 Posterior 2 0.25 0.75 0.75 0.25 probabilities 3 2 0 0 0 4 0 0 2 0

40

20 Statistical Genomics and Bioinformatics Workshop 8/16/2013

Power and Sample Size Calculations

 Parameters needed for genetic association studies – Desired power (1-) – Sample size (N) – Type I error level ()  Depends on 1- or 2-sided test  For GWA with a large number of hypotheses, adjustments need to be considered (discussed in detail tomorrow). – Probability of exposure  Allele frequency of variant allele  Genotype frequency – Effect size: GRR (genotype relative risk) 41

Factors affecting power

 LD between the disease allele and marker allele  Difference between disease allele and marker allele frequencies  Phenotype misclassification error rates – Miscoding (affected coded as unaffected, vice versa) – locus heterogeneity, mutations in different genes leading to clinically indistinguishable phenotype – Using a surrogate that is not 100% predictive, e.g., clinical dementia to diagnose Alzheimer’s disease  Genotype misclassification error rates – Genotype calls include some errors  Recommendation: Specify higher power values when computing sample size requirements. 42

21 Statistical Genomics and Bioinformatics Workshop 8/16/2013

Power Table example

Minimal Detectable ORs for various minor allele frequencies and r2; power = 80%, alpha = 0.05, number of controls = 982. Main Group Subgroup 1 Subgroup 2 Subgroup 3 MAF r2 (n=1172) (n=278) (n=140) (n=365) 0.10 1.00 1.31 1.48 1.63 1.43 0.80 1.34 1.54 1.72 1.49 0.20 1.00 1.23 1.36 1.49 1.33 0.80 1.25 1.41 1.55 1.37 0.40 1.00 1.19 1.31 1.42 1.27 0.80 1.21 1.35 1.48 1.31

43

Population Stratification

• Result of confounding –if one or more ethnic subgroups within a population have both: – higher prevalence of an allele – higher risk of disease • Spurious associations between marker and disease.

• Due to: – Sampling without regard to ethnic background – Recently admixed populations

44

22 Statistical Genomics and Bioinformatics Workshop 8/16/2013

Population Stratification Cases Population 1 Population 2

BB AB AA Controls 45

Population Stratification

• Population stratification can be a problem for association studies where the association found could be due to the underlying structure of the population and not a disease associated locus. • Need to adjust for population sub-structure in analysis to prevent spurious associations. • Should not rely completely on self-reported race and ethnicity - With GWAS data we can determine.

46

23 Statistical Genomics and Bioinformatics Workshop 8/16/2013

STRUCTURE Plots

Race coding: 1=NA (missing race) 2=American Indian, Subject Alaskan Native removed 3=Asian from 4=AA analysis 5=Unknown 6=White

47 *Seeded with LCLs of CA, AA and HCA ancestry

Principal Components • Components are linear combinations of genotypes • Each PC describes as much variability of the genotypes as possible for an individual • If population stratification is present, axes of variation may have geographic interpretation • Use PC’s as covariates in phenotype vs. genotype model.

48

24 Statistical Genomics and Bioinformatics Workshop 8/16/2013

Principal Components

49

Genotype Imputation

• Uses LD structure (from data and/or reference data) to infer for unknown genotypes – 1KGP (all racial groups) • Software: MACH, BEAGLE, IMPUTE2, fastPHASE • Remember to QC imputed data • Use either expected genotype (“dosage”) or posterior probabilities of genotype class in association analysis – Do not use most likely genotype

50

25 Statistical Genomics and Bioinformatics Workshop 8/16/2013

Observed Genotypes

51

Determine Haplotype

52

26 Statistical Genomics and Bioinformatics Workshop 8/16/2013

Impute Missing Genotypes

53

Refining Region with Imputation • Genotype Imputation with 1KGP to refine region/signal • Confirm imputation results with genotyping

54

27 Statistical Genomics and Bioinformatics Workshop 8/16/2013

Genome-wide Association Study (GWAS): AROMATASE INHIBITOR (AI) PHARMACOGENOMICS

55

Breast cancer treatment

• 3 AIs (anastrozole, letrozole and exemestane) are commonly used in treatment of breast cancer. • However, response varies between women. • Goal of study is to determine genetic predictors of response to AI treatment. 56

28 Statistical Genomics and Bioinformatics Workshop 8/16/2013

Genotyping

• 835 women recruited from: – MCR (23.5%), MCA (11.6%), MCF(5.3%) – MD Andersen (43.5%) – Memorial – Sloan Ketterling (16.1%) • Genotyped on Illumina 610 SNP Array (after QC 563,945 SNPs for analysis) • Measured blood drug levels and pre- and post- hormone levels, breast density, and BMD

57

Summary Baseline Hormone • N = 776 subjects with baseline hormone levels • Highly skewed distributions

Hormone N Mean SD Min Median Max

Estrone 765 21.6 13.26 0 18.9 111

Estradiol 774 5.6 6.91 0 4.0 111

Estrone-C 752 315.1 310.21 4.1 236.5 3320

Androstenedione 774 461.4 251.68 0 422.5 2470

Testosterone 774 171.8 128.38 0 145.5 1720 58

29 Statistical Genomics and Bioinformatics Workshop 8/16/2013

Assessment of Clinical Covariates

• One person has missing BMI and thus was removed from all analysis – BMI is important variable that must be accounted for in the analysis. • For the GWAS of the baseline hormone levels, we included, for each phenotype, additional covariates if the p-value < 0.01, after inclusion of BMI and study site.

59

Single SNP Analysis • Genotype coded in terms of number of minor alleles (additive model) • Adjustment for BMI, age, site, race, 6 eigenvectors and covariates (p < 0.01). • Due the skewness of the data, we used quasi-likelihood model

60

30 Statistical Genomics and Bioinformatics Workshop 8/16/2013

Estradiol: Ch8 Region

61

Multiple Locus Analysis • Multiple markers within a singe gene or region –PCA – LASSO and Shrinkage Methods • Multiple genes that are: – Within the same pathway – Biologically related • Effect is apparent only with an interaction – Gene-Gene (epistasis) – Gene-Environment – (Gene-Drug; Pharmacogenomics)

62

31 Statistical Genomics and Bioinformatics Workshop 8/16/2013

Epistasis

Definition: the interaction between two or more genes to control a single phenotype

Example: In Labradors, • dogs with a BB or Bb genotype are black • bb genotype are chocolate • that is, unless they have the ee genotype at a second locus. • Regardless of the genotype at the B locus ee dogs are yellow.

63

Pharmacogenomics: Genetic vs. PGx effect • Example: Study to determine genetic effects related to weight gain in depressed patients treated with SSRIs. – Genotype only depressed patients on an SSRIs. – Observe an association between a gene and change in weight of patients. • Question: Is it a genetic effect or PGx effect? – Variant related to obesity in general? – Variant related to obesity when treated with an SSRI?

64

32 Statistical Genomics and Bioinformatics Workshop 8/16/2013

PGx Effect = Gene*Drug Interaction

• Study involving multiple treatment arms • Able to assess a “truly” PGx effect

Example: Weight Gain PGx study 1 if patient on drug weight change for patient 0 if patient on placebo

Patient genotype coded in terms of # minor alleles

∗ ϵ

Drug Effect Genetic Effect PGx Effect

65

PGx Effect = Gene*Drug Interaction

Drug 1 Drug 2

66

33 Statistical Genomics and Bioinformatics Workshop 8/16/2013

Power to detect Drug*Gene Effect • 500 cases - 500 controls, 50-50 split on two drugs, α= 0.05 – For PGx power, marginal drug and gene effects set to 1.2 • Lower power to detect interaction effects compared to marginal effects

67

Tests for Gene*Drug Interaction

• Similar methods used to detect gene*environment interaction • Regression type models

∗ ϵ –Ho: 0(low power) –Ho: 0 0 (“omibus” test) (Kraft et al, 2007) • Stratified analysis by treatment group – Analysis of genetic effect completed within each treatment group – Combination of test statistics across group • Stepwise-Pooled analysis ϵ – Pooled model assuming SNP effect same over treatments

–Ho: 0 – Follow-up top SNPs for interaction with drug. 68

34 Statistical Genomics and Bioinformatics Workshop 8/16/2013

Resampling and Empirical P-values

• Permutation/Randomization test: – Rearrange case/control label keeping genotype data fixed (null distribution). – Calculate test statistic (TS) for each permuted phenotype – Do many times to estimate distribution of TS under the null hypothesis – Compare the observed TS to the permuted TS • P-value based on number of time permutated data TS was more extreme than observed TS.

69

Empirical P-value

Obs Perm1 Perm2 .. PermK SNP1 Pheno

101.1AA

110.0aA

… ......

010.0AA ‐4 0 4 011.1aa Observed TS = - 3 ‐3 ‐10.5.2 TS P-value = 10/500 =0.02

70

35 Statistical Genomics and Bioinformatics Workshop 8/16/2013

Permutation Based P‐values • Take‐home message: – Computer intensive – Valuable insight into underlying distribution of statistics – Helpful in removing spurious associations – Automatically built into many tools

71

Multiple Testing

72

36 Statistical Genomics and Bioinformatics Workshop 8/16/2013

Multiple Testing Thousands of genes in an experiment

Thousands of hypotheses are tested

Probability of type I error (false discovery) committed increases sharply

Multiplicity problem!

73

Adjustment for Multiple Testing • A GWAS will result in both: – false positive (Type 1 error) – if correction for multiple comparisons is conservative or power is inadequate, false negative results (Type 2 error) •  = P(Type I error) and  = P(Type II error) • Power = 1 -  • α and β act inversely: α ↓ results in β↑and Power ↓. Number of Tests # False Positives (α=0.05) 100 5 1000 50

500,000 25,000 74

37 Statistical Genomics and Bioinformatics Workshop 8/16/2013

Multiple testing strategies

• Perform fewer tests - Subset the genes to consist of those involved in some biological pathway of interest • Cautious in interpretation • Ad-hoc adjustment - Use significance level of 1% instead of 5%

• Control a different error rate which incorporates the number of tests performed - Family wise error rate (FWER): Probability of at least one type 1 error among all tests performed - False discovery rate (FDR): Expected proportion of false discoveries out of the total significant findings 75

Standard control of errors

Cutoff

H0 (no signal) H1 (signal)

FNR Frequency 0.00.10.20.30.40.50.6 -4 -2 0 2 4 FPR x 76

38 Statistical Genomics and Bioinformatics Workshop 8/16/2013

False positives, false negatives, and false discovery rate

Cutoff

H0 (no signal) H1 (signal) Frequency

U = (1-α)*m0 S = m1 -V

V 0.00.10.20.30.40.50.6 -4 -2 0 2 4 77 x

False positives, false negatives, and false discovery rate

Cutoff

H0 (no signal) H1 (signal)

U: true negatives V: false positives T: false negatives S: true positives Frequency

T

U = (1-α)*m0 S = m1 -V

V 0.00.10.20.30.40.50.6 -4 -2 0 2 4 x 78

39 Statistical Genomics and Bioinformatics Workshop 8/16/2013

Multiple testing: types of errors Truth about population

H0 is True H0 is NOT True

Accept H 0 UT(Type II) m - R

Reject H0 Decision V (Type I) S R

Total m0 m1 m FWER: Pr(V ≥ 1) FDR: E[V/(V+S)] = E[V/R] 79

Family-Wise Error Rate • Family-wise Error Rate (FWER): – P(at least one false positive) • Single-Step FWER Methods: same correction made to all tests (e.g., Bonferroni) – GWAS Standard for significance: p < 5×10-8 for α = 0.05 • Sequential FWER Methods: correction for a test depends on the other test results (e.g., Hochberg step-up method)

• Assumes independent tests – Methods will be conservative if tests are correlated

80

40 Statistical Genomics and Bioinformatics Workshop 8/16/2013

False Discovery Rate and q-value

• Goal: To balance the error rates by accepting a certain # of false positives • FDR = proportion of false positives among SNPs identified as “significant”. – Controls the proportion of false positives instead of the P(at least one false positive) – Benjamini and Hochberg (1995) • q-value = expected proportion of false positive results among all features as or more extreme than observed results – Storey (2002); Storey & Tibshirani (2003)

81

FWER versus FDR

Ordered P- Bonferroni Adjusted Gene FDR adjusted P-values values P-values

Gene1 0.0004 0.0055 0.0055 Gene2 0.0010 0.0057 0.0144 Gene3 0.0016 0.0057 0.0223 Gene4 0.0016 0.0057 0.0230 Gene5 0.0077 0.0214 0.1071 Gene6 0.0198 0.0457 0.2744 Gene7 0.0237 0.0474 0.3318 Gene8 0.0310 0.0543 0.4340 Gene9 0.1010 0.1571 1.0000 Gene10 0.1570 0.2198 1.0000 Gene11 0.2840 0.3615 1.0000 Gene12 0.5420 0.5848 1.0000 Gene13 0.5430 0.5848 1.0000 Gene14 0.7930 0.7930 1.0000

Test out these approaches on some data of our own 82

41 Statistical Genomics and Bioinformatics Workshop 8/16/2013

FWER versus FDR • FDR controlling procedures are generally more powerful than FWER controlling procedures: • More likely to detect real differences as significant • Allows for an acceptable rate of type 1 errors among significant findings • Not appropriate when strict control of any type 1 errors is desired • Exploratory setting • Large scale multiple testing problems (i.e., genomics data)

• FWER controlling procedures • Strict control of type 1 errors • More appropriate in a confirmatory setting 83

Gold Standard for Correction • Many multiple testing correction methods assume independent tests of hypothesis – Most are conservative if deviations from this • If tests highly correlated, then FWER << α • Multiple testing corrections that account for correlational structure more powerful • Can take advantage of this using resampling/permutations methods to incorporate observed structure

84

42 Statistical Genomics and Bioinformatics Workshop 8/16/2013

Permutation Testing • Accounts for the correlation between tests but assumes exchangeability between subjects. 1. Run analysis for each SNP using observed phenotype data. 2. Permute (“shuffle”) the phenotype and re-run the complete analysis for all SNPs. 3. Repeat Step 2 K times – K controls the precision in the permutation p-value 4. Permutation P-value = proportion of times that permuted data p-value < observed data p-value.

Permutation Most Significant p‐value < Observed p‐value 1 0.0001 No 2 0.0005 No .. 10,000 0.0000001 Yes

85

External Validation

• One of best ways to determine if significant result is “real” • For SNPs/genes found to be interesting in previous study, re-assess in separate independent data set • Replicating in second data set greatly enhances evidence of effect • Trick: finding this second study • Direction of association must be the same in both data sets

86

43 Statistical Genomics and Bioinformatics Workshop 8/16/2013

Acetaminophen Toxicity

• Acetaminophen (Tylenol) is considered safe • Acetaminophen overdose is the major cause of acute hepatic failure in US • A subset of healthy adults taking 4 grams/day develop elevations in markers for liver damage • Overdose is treated with N‐acetyl cysteine (NAC)‐‐replenish GSH stores

87

Genome-wide SNPs vs. NAPQI IC50

rs2880961

88

44 Statistical Genomics and Bioinformatics Workshop 8/16/2013

Genome-wide SNPs vs. NAPQI IC50 Observed –log10(1-p)

Expected –log10(1-p)

89

Genome-wide SNPs vs. NAPQI IC50

• Permutation p‐value for top SNP is 0.0345

90

45 Statistical Genomics and Bioinformatics Workshop 8/16/2013

Nearest known gene is ~300 kb away

91

Limitations of GWAS

• Not very predictive • Explain little heritability • Focus on common variation • Many associated variants are not causal

92

46 Statistical Genomics and Bioinformatics Workshop 8/16/2013

Common Errors in Association Studies Bell and Cardon (2001)

• Small sample size •Subgroup analysis and multiple testing • Random error • Poorly matched control group • Failure to attempt study replication • Failure to detect LD with adjacent loci • Over-interpreting results and positive publication bias • Unwarranted ‘’ declaration after identifying association in arbitrary genetic region

93

Questions?

94

47