Imputation of data

Introduction to theory and implementation of Genomic Selection

Vienna, 11 March 2015 – Marco Bink & Rianne van Binsbergen Background

Scientist Statistical 3rd year PhD student

“Development and application of “Investigate the benefits of using whole statistical methods to model and sequence data in selection and predict complex traits from genotypic breeding of animals and plants” and other types of high-throughput  Study accuracy of genotype data, in plants & animals” imputation in case of whole-genome sequence data

 Study accuracy of genomic prediction based on imputed sequence data

2 Overview

. Context in plants . What is genotype imputation and why needed? . Which cases of genotype imputation? . What methods are available? . How to measure correctness of imputation? . What factors do affect imputation? . What are the current challenges? . Impact on Genomic Prediction?

3 Introduction - Plants: Numerous and huge variation

. Model plant: Arabidopsis (www.1001genomes.org) . Field crops: Maize, soya, rice, barley, wheat (grasses) . Tuber crops: Potato . Vegetable crops: Tomato, Lettuce, cucumber . Genome complexity . Fruit trees: Apple, Peach, Cherry . Length . Forestry trees: Loblolly pine, Maritime pine . Ploidy-level . Etc. . Duplication . Repeats . Heterogeneity Example: Genome of Maritime pine (Pinus pinaster) EU FP7 project ProCoGen

Diploide 2n=24

Despite this mega-genome, the genetic size is similar to what found in other plants (~150cM / chromosome) Impact on genotyping resources

Re-sequencing on multiple accessions

Reference genome available (commercial) SNP arrays available

Scarcely markers available

Number of plant species What is genotype imputation?

“The statistical inference of unobserved

or

“The process of predicting genotypes that are not directly assayed in a sample of individuals”

7 Cases of genotype imputation

1. Impute randomly missing genotypic data

2. Impute genotypic data for alignment of different SNP arrays

3. Impute genotypic data from low-density SNP array to high- density SNP array

4. Impute genotypic data from low coverage sequencing data

8 COMPLETE genotypic data

1. Impute randomly missing genotypic data 3. Impute from low - to high-density

2. Impute for alignment of different SNP arrays 4. Impute from low coverage sequencing data

10 1. Randomly missing genotypic data

Low percentage of genotypes have not been called, e.g., <5%

”Missing At Random”

Individual 1

Individual 2 ? Imputation Individual 3 ? Individual 4 ?

11 2. Alignment of different SNP arrays

Different genotype platforms are used with only a certain percentage SNP common across platforms ”Missing Not At Random”

Individual 1 (platform A) ? ?

Individual 2 (platform A) ? ? Imputation

Individual 3 (platform B) ?

Individual 4 (platform B) ?

13 3. Impute from low-density to high-density

Save money: genotyping at high-density often more expensive than genotyping at low-density

Major incentive: Genomic Prediction • Low density array to high density array • High density array to whole genome sequence data

15 3. Impute from low-density to high-density

Reference individuals Individual genotyped at genotyped at high-density low-density

? ? ? ?

Individual with imputed genotype

16 4. Low coverage sequencing data

Coverage = average number of reads per nucleotide

Figure adapted from: Commins, et al. Biological Procedures Online 2009, 11:52-78 4. Low coverage sequencing data

. Large part of the genome might be not covered ● Varies among individuals and among loci ● Data are Missing At Random

. Uncertainty in genotype calls!!

. No methods developed (YET) to deal with these data specifically

19 Overview

. What is genotype imputation? . Which cases of genotype imputation? . What methods are available? . How to measure correctness of imputation? . What factors do affect imputation? . What are the current challenges?

20 Imputation methods

Naïve approaches General statistical approaches

Population-based Family-based approaches approaches

21 Naïve approaches

Useful if software in proceeding step cannot handle data with missing genotypes

Examples: - Marker mean value • Based on allele frequencies - Heterozygous value (inbred lines)

22 Imputation methods

Naïve approaches General statistical approaches

Population-based Family-based approaches approaches

23 Random forest General statistical - Machine-learning approaches algorithm that uses an ensemble of decision trees - Average of multiple trees is predicted value k-nearest neighbour a. For individual and single SNP similarity scores are calculated: - 1 if identical genotypes𝑖𝑖 𝑗𝑗 - 0.5 if missing genotypes - if different genotypes

b. For𝑝𝑝 the population with individuals, a matrix of is obtained 𝑁𝑁 𝑖𝑖𝑖𝑖 c. The nearest𝑺𝑺 neighbours of individual are individuals with the highest scores 𝑖𝑖𝑖𝑖 𝑖𝑖 𝑗𝑗 𝑺𝑺 Missing genotypes from individual , are estimated as the major allele of the nearest neighbours (if frequency >𝑖𝑖 )

(Huang, et al. 2010. Nat Genet 42: 961-967)𝑓𝑓 Imputation methods

Naïve approaches General statistical approaches

Population-based Family-based approaches approaches

25 . Commonly used in human studies . Based on short-range LD information in a population . Often based on Hidden Markov models (HMM) - Relate an observed process (i.e. observed unphased genotypes) to an underlying unobserved or hidden state of interest (i.e. phase and true genotypes) . Several methods available - Differ in accuracy, computational requirements, speed, ...

o Beagle (Browning and Browning. 2009. AJHG 84: 210-223) o Impute v2 (Howie, et al. 2009. PLoS Genet 5: e1000529) o fastPHASE (Scheet and Stephens. 2006. AJHG 78: 629-644) o ... Population-based

approaches 26 E.g. Beagle Browning and Browning. 2009. AJHG 84: 210-223

Corresponding pair of haplotype clusters will be merged if smallest merging score < threshold

Population-based Merging score ~ probability that allele approaches sequences at markers l+1, l+2, … differ Imputation methods

Naïve approaches General statistical approaches

Population-based Family-based approaches approaches

28 . Utilize pedigree information (long-range LD) . Possible to impute ungenotyped individuals . Several methods available - Differ in accuracy, computational requirements, speed, ...

o findhap (VanRaden, et al. 2011. Genet Sel Evol 43: 10) o AlphaImpute (Hickey, et al. 2012. Genet Sel Evol 44: 9) o FImpute (Sargolzaei, et al. 2014. BMC Genomics 15: 478) o ...

Family-based approaches E.g. AlphaImpute Hickey, et al. 2011. Genet Sel Evol 43: 1-13

Expansion of Long-range phasing (LRP) algorithm:  Heuristic method for phasing of marker genotypes, which uses information from both related and seemingly unrelated individuals  Invoking the concepts of surrogate parents and Erdös numbers  Combined with haplotype library imputation

Homozygous loci across both core and tails Family-based ⇒ define surrogates approaches Imputation methods

Naïve approaches General statistical approaches

Population-based Family-based approaches approaches

31 Overview

. What is genotype imputation? . Which cases of genotype imputation? . What methods are available? . How to measure correctness of imputation? . What factors do affect imputation? . What are the current challenges?

32 Measures of imputation correctness

. Imputation error rate  % of alleles (or genotypes) imputed incorrectly

. Imputation accuracy  Correlation between true and imputed genotypes

33 Measures of imputation correctness

Validation by imputing “masked” genotypes

Validation Imputation 1- 1 1- 2 1- 2 1 1 1 2 1 2 1- 2 2- 1 1- 2 1 2 1 1 1 2

Put SNPs to Imputed 1 2 1 1 1 2 Reference missing 1 1 2 2 1 2 genotype 1 1 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 2 1 1 2 1 1 1 2 1 2 1 2 2 1 1 2

1 1 1 2 1 2 Calculate 1 2 2 1 1 2 imputation error rate or imputation accuracy 34 I. Impute randomly missing genotypic data

. Typically: some percentage of genotypes that have not been called by the genotyping algorithm. ● Proportion of uncalled genotypes is usually low (e.g., <5%) ● Missing genotypes can be regarded as occurring at random

. Maize (Hickey et al., 2012) ● 35,081 SNPs uniquely mapped at physical map ● 1163 inbred lines, mostly from (sub) tropical germplasm ● IMPUTE2 software

III. Impute genotypic data from low-density SNP array to high-density SNP array Cost-saving genotyping strategy: • High-Density SNP Platform (HDPs) • Key individuals, e.g., ancestors of selection candidates or mapping parents • Low-Density SNP Platform (LDP) (= subset of HDP!) • Others, e.g., selection candidates or mapping progeny

. Maize (Hickey et al., 2012) ● 35,081 SNPs uniquely mapped at physical map ● 1163 inbred lines, mostly from (sub) tropical germplasm ● Masking 50%, 75%, ..., 99.2%

275 SNPs ! III. Impute genotypic data from LDP to HDP

Maize (Hickey et al., 2012)

Proportion of correctly imputed genotypes severely affected by (low) MAF III. Impute genotypic data from LDP to HDP EU FP7 project FruitBreedomics Apple: Well documented pedigree records + old cultivars still available (simple via clones) 1285 individuals descending from 42 founders

RallsJan Delicious Winesap RomBeauty Jonathan M_PRI668-100 GoldenDel F2_26829-2-2 Cox RedWinter Wagenerap F_Prima AntonovkaOB F_B8_34.16 F_X-4355 Jefferies PRI830-101 F_X-4598 Clochard ReiDuMans GranSmith F_Ill_#2 O53T136 M_Enterprise McIntosh WorcPearm BeautBath LadyWill DrOldenbu F_Rewena F_Reanda F_Melba Wealthy Starr Braeburn F_JamesGr JerseyBla Malinda F_72-10-33 F_IngMarie LivelRasp Priscilla-NL 22 Full Sib

Fuji Crandall PRI14-126 PRI14-510 PRI14-152 KidsOrRed Idared BVIII_34.16 X-4828 X-4598 Chantecler Ill_#2 Rubinette X-6823 TN_R10A8 Discovery PinkLady Alkmene Clivia 67-47 Melba NJ130 PRI54-12 Telamon Melrose JamesGr Macoun Haralson families

PRI668-100 PRI612-1 Gala Prima Z185 X-3188 PRI672-3 X-6417 Pinova Rewena Pirol Reanda NJ117637 NJ12 Priam Liberty TeBr_004

X-2771 Florina X-3177 X-4355 X-6799 Coop-17 X-4638 PRI1661-2 X-3174 DiPr_001 JoPr_001 FuGa_001 GaCr_02 GaPi_03 RePir_110 FuPi_001 PiRea_083 NJ123249

X-3143 X-6564 Galarina X-3263 RedWinterX3177 X-3274 X-6820 X-6681 Baujade X-3259 X-6679 X-6808 Dorianne Enterprise Choupette X-6688

X-3318 X-6683 X-6398 X-3305 12_I02 12_K01 12_L01 I_CC03 I_BB02 12_O03

12_F02 12_J01 I_J01 I_M01 I_W01 12_N01 12_P01 Mask 2 Full Sib Apple Genome - 17 Chromosomes HDP: 20K SNP array ⇒ ~9K SNP • Duplications ⇒ Paralogous regions Families • Highly heterozygous ⇒ Null-alleles! LDP: 512 SNP array ⇒ 373 SNP ⇒ 373 SNPs III. Impute genotypic data from LDP to HDP

2 Masked families: 2 Approaches: EU FP7 project FruitBreedomics • 12_F • AlphaImpute • FuGa • BEAGLE

. Work in progress ● Severe differences in accuracy due to capturing pedigree info? ● Particular for type of germplasm and marker densities? Measures of imputation correctness

Imputation error rate Imputation accuracy  Used in most studies  Independent of MAF  Performance appears  Variation in true- and better at low MAF imputed genotypes needed .  No (obvious) relation with genomic prediction  Linear relationship with accuracy genomic prediction accuracy

Calus MPL, Bouwman AC, Hickey JM, Veerkamp R F, Mulder HA (2014). Evaluation of measures of correctness of genotype imputation in the context of genomic prediction: a review of livestock applications. Animal 8(11): 1743-1753.

40 Measures of imputation correctness

. Imputation accuracy often calculated per locus

. Individual-specific imputation accuracy ! true and imputed genotypes mean and variance need to be standardized at each locus

 Otherwise overestimation of the accuracy

41 Measures of imputation correctness

Both measures require that true genotypes are known!  In practice true genotypes are unknown...

Examples of metrics that estimate accuracy: ● allelic R2 (e.g. calculated by Beagle) ● standardized allele-frequency error ● ratio of imputed allele dosage variance and true allele dosage variance ● allele-frequency correlation

Browning and Browning. 2009. Am J Hum Genet 84: 210-223 42 Overview

. What is genotype imputation? . Which cases of genotype imputation? . What methods are available? . How to measure correctness of imputation? . What factors do affect imputation? . What are the current challenges?

43 Factors affecting imputation

# Reference individuals Relationship between individuals

LD between known SNP and imputed SNP

Distance between known SNP and imputed SNP

44 Genotype imputation

Imputation - C - C - A A C C C G A - T - G - A A T T G A A

Commercial Reference A A C G A A DNA markers A C T C A A

A C C C A A A A C C A A A A C C A A A A T G A A A C C C G A A A T C A A

114 Whole-genome animals sequence data Factors affecting imputation (example)

van Binsbergen R, Bink M, Calus MPL, van Eeuwijk FA, Hayes BJ, Hulsegge I et al (2014). Accuracy of imputation to whole-genome sequence data in 46 Holstein Friesian cattle. Genetics Selection Evolution 46. Factors affecting imputation

# Reference individuals Relationship between individuals

LD between known SNP and imputed SNP

Distance between known SNP and imputed SNP

# SNPs on lower density panel

47 Factors affecting imputation (example)

Larger gap between two panels  less accurate imputation - e.g. 6,000  777,000 vs. 50,000  777,000

Stepwise imputation can increase accuracy - e.g. from 6,000  50,000  777,000

48 Two-step approach Genotype imputation - accuracy (= correlation between true- and predicted genotype)

Scenario # DNA markers on panel 50,000 777,000 80% animals 0.46 0.83 60% animals 0.43 0.81 40% animals 0.37 0.77

Figure 1 Two-step approach 0.65 Number of SNPs on BTA 1.

van Binsbergen R, Bink M, Calus MPL, van Eeuwijk FA, Hayes BJ, Hulsegge I et al (2014). Accuracy of imputation to whole- genome sequence data in Holstein Friesian cattle. Genetics Selection Evolution 46. Factors affecting imputation

# Reference individuals Relationship between individuals

LD between known SNP and imputed SNP

Distance between known SNP and imputed SNP

# SNPs on lower density panel

MAF of imputed SNP SNP location on chromosome

51 Factors affecting imputation (example)

van Binsbergen R, Bink M, Calus MPL, van Eeuwijk FA, Hayes BJ, Hulsegge I et al (2014). Accuracy of imputation to whole-genome sequence data in 52 Holstein Friesian cattle. Genetics Selection Evolution 46. Overview

. What is genotype imputation? . Which cases of genotype imputation? . What methods are available? . How to measure correctness of imputation? . What factors do affect imputation? . What are the current challenges?

53 Further challenges

. Imputation of whole-genome sequence data ● Many low MAF SNPs ● Uncertainty of genotype calling - Especially with low coverage & GBS ● Correctness of genome assembly / map ● Small populations – synteny between (sub)species

54 Cost-saving genotyping strategies

. Low density arrays ● Initial development costs : minimal number of samples ● Germplasm-specific . GBS or low coverage sequence ● Bioinformatics pipeline for data-handling: Quality Control ● Imputation methods ● Reference sequence(s) ● Parents / panel

Genomic prediction

2 scenarios • 777,000 DNA markers 1000 bull consortium • Whole-genome sequence data www.1000bullgenomes.com Genomic prediction methods

Method 1 (GREML) Method 2 (BSSVS)

Assumption: Assumption: all base-pairs equally small few base-pairs large effect, effect on phenotype the rest have small effect Prediction reliabilities (= Squared correlation between original- and predicted phenotype)

0.6 Traditional - REML

(pedigree data) 0.5 Method 1 - GREML 0.4 (777,000 markers)

0.3 Method 2 - BSSVS (777,000 markers) 0.2 Method 1 - GREML (sequence data)

Prediction reliability 0.1 Method 2 - BSSVS 0 (sequence data) SCS IFL PY Discussion: sequence-based Genomic Prediction

Sequence data did not improve prediction reliability in current set up Number of training animals was relatively low  3416 bulls to estimate ~12,000,000 marker effects

Methods did not ‘find’ causal mutation in sequence data  Training animals too much related? More strict on accuracy of imputed genotypes? Other methods needed? Mixture of priors, include prior knowledge on QTL and candidate genes? Thanks!

Questions?

Contact: [email protected] or [email protected]

References

Bouwman, A. C., and R. F. Veerkamp. 2014. Consequences of splitting whole-genome sequencing effort over multiple breeds on imputation accuracy. BMC Genet 15: 105. http://www.biomedcentral.com/1471-2156/15/105

Browning, B. L., and S. R. Browning. 2009. A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. Am J Hum Genet 84: 210-223. http://www.sciencedirect.com/science/article/pii/S0002929709000123

Calus, M. P. L., A. C. Bouwman, J. M. Hickey, R. F. Veerkamp, and H. A. Mulder. 2014. Evaluation of measures of correctness of genotype imputation in the context of genomic prediction: a review of livestock applications. Animal FirstView: 1-11. http://dx.doi.org/10.1017/S1751731114001803

Commins, J., C. Toft, and M. Fares. 2009. Computational Biology Methods and Their Application to the Comparative Genomics of Endocellular Symbiotic Bacteria of Insects. Biological Procedures Online 11: 52 - 78. http://www.biologicalproceduresonline.com/content/11/1/52

Hickey, J., B. Kinghorn, B. Tier, J. Wilson, N. Dunstan, and J. van der Werf. 2011. A combined long-range phasing and long haplotype imputation method to impute phase for SNP genotypes. Genet Sel Evol 43: 1-13. http://dx.doi.org/10.1186/1297-9686-43-12

Hickey, J. M., B. P. Kinghorn, B. Tier, J. H. J. van der Werf, and M. A. Cleveland. 2012. A phasing and imputation method for pedigreed populations that results in a single-stage genomic evaluation. Genet Sel Evol 44: 9. http://www.gsejournal.org/content/44/1/9

61 References

Howie, B. N., P. Donnelly, and J. Marchini. 2009. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet 5: e1000529. http://dx.doi.org/10.1371%2Fjournal.pgen.1000529

Huang, X., X. Wei, T. Sang, Q. Zhao, Q. Feng, Y. Zhao, C. Li, C. Zhu, T. Lu, Z. Zhang, M. Li, D. Fan, Y. Guo, A. Wang, L. Wang, L. Deng, W. Li, Y. Lu, Q. Weng, K. Liu, T. Huang, T. Zhou, Y. Jing, W. Li, Z. Lin, E. S. Buckler, Q. Qian, Q.-F. Zhang, J. Li, and B. Han. 2010. Genome- wide association studies of 14 agronomic traits in rice landraces. Nat Genet 42: 961-967. http://dx.doi.org/10.1038/ng.695

Sargolzaei, M., J. Chesnais, and F. Schenkel. 2014. A new approach for efficient genotype imputation using information from relatives. BMC Genomics 15: 478. http://www.biomedcentral.com/1471-2164/15/478

Scheet, P., and M. Stephens. 2006. A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am J Hum Genet 78: 629-644. http://www.sciencedirect.com/science/article/pii/S000292970763701X van Binsbergen, R., M. C. A. M. Bink, M. P. L. Calus, F. A. van Eeuwijk, B. J. Hayes, I. Hulsegge, and R. F. Veerkamp. 2014. Accuracy of imputation to whole-genome sequence data in Holstein Friesian cattle. Genet Sel Evol 46: 41. http://www.gsejournal.org/content/46/1/41

VanRaden, P. M., J. R. O. O'Connell, G. R. Wiggans, and K. A. Weigel. 2011. Genomic evaluations with many more genotypes. Genet Sel Evol 43: 10. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3056758/ 62