<<

SNP Genotyping - Overview

Overview of SNP Genotyping • Project Rationale

• Genotyping Strategies/Technical Leaps Debbie Nickerson • Data Management/Quality Control Department of Sciences University of Washington [email protected]

SNP Genotyping SNP Project Rationale

Matched Mis-Matched Probe and Target C Allele T Allele • Heritability C Eclipse C C Allele-Specific Hybridization Target G A Dash Hybridize Fail to hybridize Molecular Beacon Affymetrix • Power - Number of Individuals

C C C Taqman Target G A Degrade Fail to degrade

• Number of SNPs - Candidate Gene, Pathway, Genome 5-10 SNPs, 400 to 1,000, 10K, 500K Fluorescence +ddCTP C Polymerase Extension Target G A Polarization C incorporated C Fails to incorporate Sequenom • DNA requirements

C C C SNPlex Oligonucleotide Ligation Target G A Ligate Fail to ligate Parallele • Cost Illumina

SNP Typing Formats Defining the scale of the genotyping project is key to selecting an approach:

Scale 1000 individuals 5 to 10 SNPs in a candidate gene - Many approaches Microtiter Plates - Fluorescence Low $6,000 (expensive ~ 0.60 per SNP/) eg. Taqman - Good for a few markers - lots of samples - PCR prior to genotyping 48 ( to 96) SNPs in a handful of candidate genes $~29,000 (~ 0.25 to 0.30 per SNP/genotype) Medium Size Analysis by Electrophoresis 384 - 1,536 SNPs - cost reductions based on scale $57,600-122,880 eg. SNPlex - Intermediate Multiplexing (~0.08 - 0.15 per SNP/genotype) reduces costs - Genotype directly on 300,000 to 500,000 SNPs defined format $800,000 genomic DNA - new paradigm for high throughput (~0.002 per SNP/ genotype)

Arrays - Custom or Universal High 10,000-20,000 SNPs - defined and custom formats $>250,000 (~0.03 per SNP/genotype) eg. Illumina, ParAllele, Affymetrics - Highly multiplexed - 96, 1,500 SNPs and beyond (500K+)

1 Many Approaches to Genotype a Handful of SNPs Taqman PCR region prior to SNP genotyping - Adds to cost - Many use modified primers - the more modified, Genotyping with fluorescence-based homogenous assays the higher the cost (single-tube assay) = 1 SNP/ tube

• Taqman

• Single base extension - Fluorescence Polarization Sequenom - Mass Spec • Eclipse

• Dash

• Molecular Beacons

Genotype Calling - Cluster Analysis Genotyping by Mass Spectrometry - 24 SNP 1252 - T

SNP 1252 - C

SNPlex Assay - 48 SNPs Technological Leap - No advance PCR

Universal PCR Priming site Allele Specific Sequence Universal PCR after preparing multiple regions for analysis - ZipCode1 Universal PCR Priming site A P Several based on primer specific on genomic DNA followed by G Locus Specific Sequence PCR of the ligated products - different strategies Genomic ZipCode2 C and different readouts. DNATarget 1. Ligation P SNPlex, Illumina, Parallele (Affymetrix) A Ligation Product G Formed C Also, reduced representation - Affymetrix (Homozygote shown in this case) - cut with restriction enzyme, then ligate linkers 2. Clean-up and amplify from linkers and follow by chip hybridization to read out.

2 PCR & ZipChute Hybridization Detection

3. Multiplexed Universal PCR 9. Characterize on Capillary Sequencer Univ. PCR Primer Biotin

Univ. PCR Primer

4. Capture double stranded DNA- microtiter plate SNP 1 (Streptavidin) 5. Denature double stranded DNA 6. Wash away one strand

7. Zip Chute Hybridization • SNP 2

SNPlex Readout Multiplexed Genotyping - Universal Tag Readouts

C T A G

Locus 1 Specific Sequence Locus 2 Specific Sequence ZipChuten N(n) T Position n Tag1 sequence cTag1 sequence Tag2 sequence cTag2 sequence

n ~ 48/lane Substrate Substrate Bead or Chip Bead or Chip ~2000 lanes/day Bead Array Chip Array Zipchute3 NNN T Position 3 ~96,000 /day Tag 1 Tag 2 Zipchute2 NN A Position 2 Tag 3 Zipchute1 N C Position 1 Tag 4 Multiplex ~96 - 20,000 SNPs Illumina ParAllele Not dependent on primary PCR Affymetrix

Arrays - High Density Genotyping Sentrix™ Platform Thousands of SNPs and Beyond

• “Bead” Arrays - Illumina – Manufactured by self-assembly – Beads identified by decoding

Sentrix™ 96 Multi-array Matrix matches standard microtiter plates (96 - 1536 SNPs/well) Up to ~140,000 assays per matrix

3 Ligation Extension & Allele Specific Genomic DNA Hybridization toUniversal Fluorescent Imageof illumiCode Allele-Specific Extensionand #561 A/A PCR Sequence2 PCR Sequence1 Universal Universal

/\/\/\/ illumiCode #217 Ligation T/T Polymerase [T/C] G A /\/\/\//\/\/\/ illumiCode Ligase #1024 BeadArray IllumiCode C/T bundle diameter mm ~1.5 on replicates internal many code per beads to-center bundle per genotyped SNPs 1,536 Currently: ~50,000 ~ beads ~ 5 micron center- micron 5 3 micron diameter micron 3

/\/\/\//\/\/\/ illumiCode - at least 30 least at - PCR Sequence3 features up to up Universal ’ Address [T/A] TM - ’ • • • • Template Amplification Primers Common PCR with luiaAsy-3PrimersperSNP Illumina Assay- nm 635 532, lasers Two micron 0.8 Resolution, system scanning laser Confocal 5’ Sentrix – Sequences (1,2) Universal forward Allele specific Slides for 100k fixed formats fixed 100k for Slides Supports Cy3 & Cy5 imaging Cy5 & Cy3 Supports Sequence Arrays (96 bundle) and bundle) (96 Arrays Cy5 Universal Cy3 Universal Primer 2 Primer 1 3’ A G GoldenGate SNP BeadArray T C (1-20 nt gap) nt (1-20 Amplification Genomic DNAtemplate A Locus specific Reader ™ Assay Sequence 5’ sequence Universal reverse illumiCode Sequence tag Illumicode ™ Universal Primer P3 3’ #561 4 Process Controls Illumina Readout for Sentrix Array > 1,000 SNPs Assayed on 96 Samples

Mismatch High AT/GC

Gender Gap

First Hyb Second Hyb

Contamination

Multiplexed Genotyping - Universal Tag Readouts

C T A G

Locus 1 Specific Sequence Locus 2 Specific Sequence Parallele - Defined and Custom Formats

Tag1 sequence cTag1 sequence Tag2 sequence cTag2 sequence

Substrate Substrate Bead or Chip Bead or Chip - Intermediate Strategy Bead Array Chip Array Tag 1 - Multiplex ~ 20,000 SNPs Tag 2 - Affymetrix readout Universal Arrays Tag 3 Tag 4 Multiplex ~96 - 20,000 SNPs Illumina ParAllele Not dependent on primary PCR Affymetrix

5 Parallele Technology (MIP)

Molecular Inversion Probes (MIP)

Affymetrix’s Chip

Whole Genome Association Strategies

Two Platforms Available Different Designs

- Affymetrix

- Illumina

6 500K: Content Optimized SNP Selection

~2,200,000 SNPs • Initial Selection: 48 people ~2,200,000 SNPs – 2.2M SNPs From Public & Perlegen – 25 million genotypes Affymetrix GeneChip Mapping – 16 each Caucasian, African, Asian 48 individuals – All HapMap samples Call rate, concordance • Maximize performance: Second selection 500K Array Set over 400 people ~650K SNPs – 270 HapMap Samples – 130 diversity samples 400 samples – Accuracy Call rate, accuracy • HW, Mendel error, reproducibility – Call rates LD 500K SNPs • Maximize information content: – Prioritize SNPs based on LD & HapMap (Broad Institute)

80% genome coverage of Mapping 500K Mapping 500K Set

• 500K run on 270 • >500K SNP’s HapMap samples – 2 array set • Pairwise r2 analysis for

common SNPs • Performance (MAF>0.05) – 93-98% call rate range (>95% average) • Robust coverage across – >99.5% concordance with HapMap Genotypes, populations r2=0.8 99.9% reproducibility – CEPH, Asian ~66% • SNP lists, annotation and genotype data available – Yoruba ~45% without restriction at Affymetrix.com • 2 & 3 marker predictors (multimarker) further increase coverage

Illumina - Infinium I & II Infinium II Assay 10K - 300K Single Base Extension

Two haptens/colors Whole genome amplified DNA Bead U

A a T T G G C T G

WGA target SNP1 SNP2 SNP3 - - - - - SNP

b A-DNP A-DNP C-Bio C-Bio A-DNP C-Bio

SNP1 SNP2 SNP3 - - - - - SNP

c Signal Green/Red

7 HumanHap-1 Genotyping BeadChip Content HumanHap300 Content Maximize coverage of human variation by choosing tag Strategy SNPs to uniquely identify haplotypes. • Tag SNPs – r2 ≥ 0.80 for bins containing SNPs within 10kb of genes Tag SNP selection process: or in evolutionarily conserved regions (ECRs) 1. Examine HapMap Phase I SNPs with MAF > 0.05 in CEU – r2 ≥ 0.70 for bins containing SNPs outside of genes or ECRs. 2. Bin SNPs in high LD with one another using ldSelect (Carlson, et al. 2004) • Additional Content 3. Select tag SNP with highest design score for each bin. – ~8,000 nsSNPs – ~1,500 tag SNPs selected from high density SNP data in the MHC region

• Total 317,503 loci

HumanHap300 Data Quality HumanHap300 Genomic Coverage 127 samples by Population 25 trios 1.0 15 replicates 0.9 0.8 Parameter Percent 0.7

0.6 Call rate 99.93%

0.5

0.4 Reproducibility >99.99%

0.3 CEU, mean 0.87 median 0.97 Coverage of Phase I+II Data CHB+JPT, mean 0.80 median 0.94 Mendelian 0.2 YRI, mean 0.57 median 0.55 0.035% 0.1 Inconsistencies

0.0 >0 >0.1 >0.2 >0.3 >0.4 >0.5 >0.6 >0.7 >0.8 >0.9 Concordance with Max r2 99.69% HapMap Data

HumanHap500 Content Preliminary HumanHap500 Strategy Genomic Coverage by Population

– Analysis of full HapMap data set (Phase I + II) using 1.0 HumanHap300 SNP list 0.9

0.8 – Fill in regions of low LD requiring higher density of tag 0.7 SNPs 0.6 0.5 CEU, mean 0.95 median 1.0 CHB+JPT, mean 0.93 median 1.0 0.4 YRI, mean 0.75 median 0.88

– Content Strategy 0.3

• r2 ≥ 0.80 for bins containing SNPs within 10kb of genes or in Coverage of Phase I+II Loci 0.2 evolutionarily conserved regions (ECRs) in CEU • r2 ≥ 0.70 for bins containing SNPs outside of genes or ECRs in CEU 0.1 • r2 ≥ 0.80 for large bins (≥ 3 SNPs) in CHB+JPT population 0.0 >0 >0.1 >0.2 >0.3 >0.4 >0.5 >0.6 >0.7 >0.8 >0.9 2 • r ≥ 0.70 for large bins (≥ 5 SNPs) in YRI population Max r2

8 Data Quality Control Measuring Error Rates

• Genotype replicate Rep 1 • Estimating Error Rates samples CC CT TT • Hardy Weinberg Equilibrium • Error rates generally Rep 2 CC 24 1 0 < <1% • Frequency Analysis CT 0 50 0 • Error rates are SNP • Missing Data TT 0 0 25 specific

Measuring Error Rates Replicate samples

• Genotype replicate Rep 1 • Replicates can also samples CC CT TT detect sample handling errors • Absolute number of Rep 2 CC 24 1 0 replicates is more – Wrong plate CT 0 50 0 important than – Plate rotation percentage TT 0 0 25 – E.g. 10% of 200?

Sample Handling Errors Hardy Weinberg Equilibrium

• Sexing samples • Given • Other known – p = Allele 1 frequency genotypes – q = 1-p – Blood type • Expectations – HLA – p2 = frequency 11 – Etc. – 2pq = frequency 12 – q2 = frequency 22

9 Hardy Weinberg HWE Example Disequilibrium

• Heterozygote excess • Homozygote excess – Biologic – Biologic • Differential survival • Population stratification • Null allele

– Technical – Technical • Allele dropout • Nonspecific assays Duplicated regions

Small et al, NEJM 2002 v347 p1135

Frequency Analysis Data Quality Control tagSNP Allele Frequency tagSNP Haplotype Frequency

100% 50% • Estimate Error Rates from Replicates 80% 40%

60% 30% • Check Hardy Weinberg Equilibrium

IGAP 20% IGAP 40% • Check Allele and Haplotype 10% WhiteEuropean 20% WhiteEuropean BlackAfrican BlackAfrican 0% Frequencies 0% 0% 10% 20% 30% 40% 50% 0% 20% 40% 60% 80% 100% PGA • Check Missing Data - Site specific PGA Sample specific • Check haplotype and allele frequencies against standard populations

Detection of Outliers of the Distribution SNP Genotyping Summary

X-linked SNP Unknown SNP 1. Many different genotyping approaches are available - Low to high throughput

2. Some platforms permit users to pick custom SNPs but the highest throughputs are available only in fixed contents.

3. Not all custom SNPs will work for every format. Multiple formats will be required to carry out most projects targeting specific SNPs

4. There are still trade-offs for throughput - Samples vs. SNPs

5. Costs still dictate study design.

6. Regardless of the study - Design,quality control and tracking will rule the day! Laboratory Information Management Systems are key in every study design (Key: Track - Samples, - Assays - Completion rate - Reproducibility/Error Analysis)

Carlson et al, in preparation

10