Novel Statistical Methods in Quantitative Genetics
Total Page:16
File Type:pdf, Size:1020Kb
Dedicated to my family in Shiyan, Hubei, China, . especially to my grandparents for their indispensable guidance all these years. List of papers This thesis is based on the following papers, which are referred to in the text by their Roman numerals. I Rönnegård, L., Shen, X. and Alam, M. (2010). hglm: a package for fitting hierarchical generalized linear models. The R Journal. 2(2):20-28. II Shen, X., Rönnegård, L. and Carlborg, Ö. (2011). How to deal with genotype uncertainty in variance component quantitative trait loci analyses. Genetics Re- search, Cambridge. 93(5):333-342. III Shen, X., Rönnegård, L. and Carlborg, Ö. (2011). Hierarchical likelihood opens a new way of estimating genetic values using genome-wide dense marker maps. BMC Proceedings. 5(Suppl 3):S14. IV Nelson, R., Shen, X. and Carlborg, Ö. (2011). qtl.outbred: interfacing out- bred line cross data with the R/qtl mapping software. BMC Research Notes. 4:154. V Shen, X., Pettersson, M., Rönnegård, L. and Carlborg, Ö. (2012). Inheritance beyond plain heritability: variance controlling genes in Arabidopsis thaliana. Submitted. VI Shen, X., Alam, M., Fikse, F. and Rönnegård, L. (2012). Fast generalized ridge regression for models including heteroscedastic effects in quantitative genetics. Manuscript. Reprints were made with permission from the publishers. Contents Part I: Introduction .............................................................................................. 11 1 Background & Aims ...................................................................................... 12 2 Discovering Genetic Architecture .................................................................. 14 2.1 Single-predictor analysis ..................................................................... 15 2.1.1 QTL analysis ......................................................................... 15 2.1.2 Genome-wide association study ............................................. 19 2.2 Multiple-predictor analysis .................................................................. 22 2.2.1 Polygenic effects estimation ................................................... 22 2.2.2 Interaction effects and variance heterogeneity ......................... 23 3 Predictive Modeling ....................................................................................... 27 Part II: Summary of Papers .................................................................................. 31 4 Implementing Hierarchical Generalized Linear Models (Paper I) .................... 32 5 Quantitative Trait Loci Interval Mapping ........................................................ 36 5.1 Variance component QTL model (Paper II) .......................................... 36 5.2 QTL regression model (Paper IV) ........................................................ 39 6 Fitting The Entire Genome ............................................................................. 40 6.1 Double HGLM (Paper III) ................................................................... 41 6.2 Heteroscedastic effects model (Paper VI) ............................................. 45 7 Beyond Plain Heritability: Variance-Controlling Genes (Paper V) .................. 48 Part III: Discussion & Conclusion ........................................................................ 53 8 Discussion ..................................................................................................... 54 8.1 Heritability: How much can we explain? ............................................. 54 8.2 New data types: How to integrate information? .................................... 56 8.3 Future development ............................................................................. 56 9 Conclusion .................................................................................................... 57 References .......................................................................................................... 61 Nomenclature AMD age-related macular degeneration ANOVA analysis of variance BLUP best linear unbiased predictor BMI body mass index cM centi-Morgan CRP C-reactive protein DHGLM double HGLM DNA deoxyribonucleic acid EBV estimated breeding value EM expectation-maximization FPR false positive rate GBLUP genomic BLUP GC genomic control GEBV genomic EBV GLM generalized linear model GLMM generalized linear mixed model GM genetically modified GO gene ontology GRAMMAR genome-wide rapid association using mixed model and regression GWAS genome-wide association study h-likelihood hierarchical likelihood HDL high-density lipoprotein HEM heteroscedastic effects model HGLM hierarchical generalized linear model HMM hidden Markov model IBD identity-by-descent IBS identity-by-state IWLS iterative weighted least squares LAF low-variance allele frequency LASSO least absolute shrinkage and selection operator LD linkage disequilibrium LM (normal) linear model (regression) LMM (normal) linear mixed model LOD logarithm of odds LRT likelihood ratio test MCMC Markov chain Monte Carlo 9 ML maximum likelihood MME mixed model equations PCA principle component analysis PQL penalized quasi-likelihood QTL quantitative trait locus/loci REML restricted maximum likelihood RR ridge regression SNP single nucleotide polymorphism SSR simple sequence repeat TBV true breeding value VC variance component vGWAS variance-heterogeneity GWAS vQTL variance-controlling QTL 10 PART I: INTRODUCTION 1. Background & Aims “We have now entered a new era of large-scale genetics unthinkable even a few years ago.” —— Peter Donnelly T was by investigating genetics that modern statistics was founded. GALTON (1886) I regressed the mid-parents’ heights on their children’s, which opened the gate to a science that filters out knowledge from chaos, i.e. statistics. FISHER (1918) studied correlation in Mendelian inheritance, which brought analysis of variance (ANOVA) into the field of probability theory and statistics. Genetics and statistics seem to be destined to meet each other due to different kinds of uncertainty in heredity that we still do not understand. Another classic example that genetics drove statistics to develop is the mixed model, a.k.a. random effects model or variance component model, which is the basis of most studies described in this thesis. Millions of poultry and livestock are evaluated ev- ery year via what are called Henderson’s mixed model equations (HENDERSON 1953, 1984; VAN VLECK 1998). Although criticized by some statisticians at the beginning that one should not calculate point estimates for random effects, the mixed model equations have been proved to give so-called best linear unbiased prediction, or BLUP (ROBINSON 1991). It is actually a general method for estimating random effects and dealing with correlated observations, which has great prediction power. Nowadays, not only in animal breeding and genetics, BLUP has also been widely applied in tech- nology and social sciences. The goal of statistical genetics is to analyze genetic data and give explanations of how the variation in the observed traits are affected by genetic variation. Since one of the essential attributes of statistical analysis in genetics is the relatedness among individuals, the key to achieve such a goal is to trace inheritance among individu- als. Before molecular markers were applied, inheritance could be traced only through kinship constructed from a pedigree structure. Such a kinship can be used to model inheritance in e.g. farm animals and human families. After molecular markers became available, statistical tools for genetic analyses started to have more diversity. Being able to trace inheritance of a specific segment of DNA, statistical tests can help identify genes or quantitative trait loci (QTL) that regulate quantitative traits. Analyzing most of the traits are difficult because of the complexity of genetic architecture underly- ing the traits. Due to the fast development of genotyping and sequencing techniques, statistical analyses for different genetics problems now start using the same kind of 12 high-dimensional genomic data. Both QTL interval mapping and genome-wide asso- ciation studies (GWAS) utilize the genetic information carried by high-density SNPs to identify major genetic loci that affect complex traits. Also using the same high- density SNPs, breeders reconstruct the kinship of the studied individuals directly by comparing their genomes and give better evaluations of the animals, i.e. GBLUP. It is now an exciting moment in quantitative genetics that a consistent type of high- dimensional data can be used to explain so many problems regarding inheritance that we are interested in. The aim of this thesis is to develop new statistical tools in both QTL analysis and genomic evaluation. Most of the proposed statistical methods focus on modeling ge- netic variance, especially using random effects models. Properly modeling genetic variance could help us better understand the genetic contribution to various types of complex traits. The thesis contains three parts: the introduction, the summary of papers, and the discussion. Part I briefly introduces the background knowledge related to my research. Chapter 2 introduces the basics of several statistical tools for gene mapping, including QTL analysis, GWAS, whole-genome shrinkage estimation methods, and interactions with the connection to the idea of variance-controlling genes. Chapter 3 introduces the use of genomic dense markers in prediction and discusses model selection. Part II summarizes the papers that this