Quick viewing(Text Mode)

Novel Statistical Methods in Quantitative Genetics

Novel Statistical Methods in Quantitative Genetics

Dedicated to my family in Shiyan, Hubei, China, . especially to my grandparents for their indispensable guidance all these years.

List of papers

This thesis is based on the following papers, which are referred to in the text by their Roman numerals.

I Rönnegård, L., Shen, X. and Alam, M. (2010). hglm: a package for fitting hierarchical generalized linear models. The R Journal. 2(2):20-28. II Shen, X., Rönnegård, L. and Carlborg, Ö. (2011). How to deal with uncertainty in variance component quantitative trait loci analyses. Re- search, Cambridge. 93(5):333-342. III Shen, X., Rönnegård, L. and Carlborg, Ö. (2011). Hierarchical likelihood opens a new way of estimating genetic values using -wide dense marker maps. BMC Proceedings. 5(Suppl 3):S14. IV Nelson, R., Shen, X. and Carlborg, Ö. (2011). qtl.outbred: interfacing out- bred line cross data with the R/qtl mapping software. BMC Research Notes. 4:154. V Shen, X., Pettersson, M., Rönnegård, L. and Carlborg, Ö. (2012). Inheritance beyond plain : variance controlling in Arabidopsis thaliana. Submitted. VI Shen, X., Alam, M., Fikse, F. and Rönnegård, L. (2012). Fast generalized ridge regression for models including heteroscedastic effects in . Manuscript.

Reprints were made with permission from the publishers.

Contents

Part I: Introduction ...... 11

1 Background & Aims ...... 12

2 Discovering ...... 14 2.1 Single-predictor analysis ...... 15 2.1.1 QTL analysis ...... 15 2.1.2 Genome-wide association study ...... 19 2.2 Multiple-predictor analysis ...... 22 2.2.1 Polygenic effects estimation ...... 22 2.2.2 Interaction effects and variance heterogeneity ...... 23

3 Predictive Modeling ...... 27

Part II: Summary of Papers ...... 31

4 Implementing Hierarchical Generalized Linear Models (Paper I) ...... 32

5 Quantitative Trait Loci Interval Mapping ...... 36 5.1 Variance component QTL model (Paper II) ...... 36 5.2 QTL regression model (Paper IV) ...... 39

6 Fitting The Entire Genome ...... 40 6.1 Double HGLM (Paper III) ...... 41 6.2 Heteroscedastic effects model (Paper VI) ...... 45

7 Beyond Plain Heritability: Variance-Controlling Genes (Paper V) ...... 48

Part III: Discussion & Conclusion ...... 53

8 Discussion ...... 54 8.1 Heritability: How much can we explain? ...... 54 8.2 New data types: How to integrate information? ...... 56 8.3 Future development ...... 56

9 Conclusion ...... 57

References ...... 61

Nomenclature

AMD age-related macular degeneration ANOVA BLUP best linear unbiased predictor BMI body mass index cM centi-Morgan CRP C-reactive protein DHGLM double HGLM DNA deoxyribonucleic acid EBV estimated breeding value EM expectation-maximization FPR false positive rate GBLUP genomic BLUP GC genomic control GEBV genomic EBV GLM generalized linear model GLMM generalized linear mixed model GM genetically modified GO ontology GRAMMAR genome-wide rapid association using mixed model and regression GWAS genome-wide association study h-likelihood hierarchical likelihood HDL high-density lipoprotein HEM heteroscedastic effects model HGLM hierarchical generalized linear model HMM hidden Markov model IBD identity-by-descent IBS identity-by-state IWLS iterative weighted least squares LAF low-variance frequency LASSO least absolute shrinkage and selection operator LD LM (normal) linear model (regression) LMM (normal) linear mixed model LOD logarithm of odds LRT likelihood ratio test MCMC Markov chain Monte Carlo

9 ML maximum likelihood MME mixed model equations PCA principle component analysis PQL penalized quasi-likelihood QTL quantitative trait /loci REML restricted maximum likelihood RR ridge regression SNP single polymorphism SSR simple sequence repeat TBV true breeding value VC variance component vGWAS variance-heterogeneity GWAS vQTL variance-controlling QTL

10 PART I: INTRODUCTION 1. Background & Aims

“We have now entered a new era of large-scale genetics unthinkable even a few years ago.”

—— Peter Donnelly

T was by investigating genetics that modern statistics was founded. GALTON (1886) I regressed the mid-parents’ heights on their children’s, which opened the gate to a science that filters out knowledge from chaos, i.e. statistics. FISHER (1918) studied correlation in , which brought analysis of variance (ANOVA) into the field of probability theory and statistics. Genetics and statistics seem to be destined to meet each other due to different kinds of uncertainty in that we still do not understand. Another classic example that genetics drove statistics to develop is the mixed model, a.k.a. random effects model or variance component model, which is the basis of most studies described in this thesis. Millions of poultry and livestock are evaluated ev- ery year via what are called Henderson’s mixed model equations (HENDERSON 1953, 1984; VAN VLECK 1998). Although criticized by some statisticians at the beginning that one should not calculate point estimates for random effects, the mixed model equations have been proved to give so-called best linear unbiased prediction, or BLUP (ROBINSON 1991). It is actually a general method for estimating random effects and dealing with correlated observations, which has great prediction power. Nowadays, not only in and genetics, BLUP has also been widely applied in tech- nology and social sciences. The goal of statistical genetics is to analyze genetic data and give explanations of how the variation in the observed traits are affected by . Since one of the essential attributes of statistical analysis in genetics is the relatedness among individuals, the key to achieve such a goal is to trace inheritance among individu- als. Before molecular markers were applied, inheritance could be traced only through kinship constructed from a pedigree structure. Such a kinship can be used to model inheritance in e.g. farm animals and human families. After molecular markers became available, statistical tools for genetic analyses started to have more diversity. Being able to trace inheritance of a specific segment of DNA, statistical tests can help identify genes or quantitative trait loci (QTL) that regulate quantitative traits. Analyzing most of the traits are difficult because of the complexity of genetic architecture underly- ing the traits. Due to the fast development of and techniques, statistical analyses for different genetics problems now start using the same kind of

12 high-dimensional genomic data. Both QTL interval mapping and genome-wide asso- ciation studies (GWAS) utilize the genetic information carried by high-density SNPs to identify major genetic loci that affect . Also using the same high- density SNPs, breeders reconstruct the kinship of the studied individuals directly by comparing their and give better evaluations of the animals, i.e. GBLUP. It is now an exciting moment in quantitative genetics that a consistent type of high- dimensional data can be used to explain so many problems regarding inheritance that we are interested in. The aim of this thesis is to develop new statistical tools in both QTL analysis and genomic evaluation. Most of the proposed statistical methods focus on modeling ge- netic variance, especially using random effects models. Properly modeling genetic variance could help us better understand the genetic contribution to various types of complex traits. The thesis contains three parts: the introduction, the summary of papers, and the discussion. Part I briefly introduces the background knowledge related to my research. Chapter 2 introduces the basics of several statistical tools for , including QTL analysis, GWAS, whole-genome shrinkage estimation methods, and interactions with the connection to the idea of variance-controlling genes. Chapter 3 introduces the use of genomic dense markers in prediction and discusses model selection. Part II summarizes the papers that this thesis is based on. Chapter 4 describes the imple- mentation of the hglm package in R (Paper I), which is a fundamental tool for fitting random effects models in most of my other papers. Chapter 5 covers both a theoretical investigation (Paper II) and an application tool (Paper IV) in QTL interval mapping, where the former is about variance component models, the latter linear regression. Chapter 6 extends the analysis strategy from repeated single marker testing to fitting multiple markers simultaneously. A double-layer random effects model allowing het- eroscedastic marker effects was found to be powerful in both QTL identification and genomic evaluation (Paper III). By simplifying such a double-layer model, the classic Henderson’s mixed model equations were successfully utilized to re-weight mark- ers according to their uneven contribution to the variation of the studied trait (Paper VI). Chapter 7, at the end, brings up a different way of looking at genetic variance through variance heterogeneity (Paper V), which recently has become a compelling topic in quantitative genetics research. Part III discusses some topics related to this thesis. Chapter 8 discusses two major challenges in current statistical genetics re- search: missing heritability and new data types, and foresees what should be done in the near future. Chapter 9 concludes the thesis by summarizing contributions of the work.

13 2. Discovering Genetic Architecture

“Simple vs. Complex traits: the real definition - Simple: things we have deluded ourselves into thinking we understand. Complex: things we’re pretty sure we don’t understand.”

—— Eleanor Feingold

ROM Mendelian factors to appearance, from to , from what F is concealed to what is revealed, understanding genetic architectures underlying our appearance, diseases, fitness, and so on, has always been an ultimate goal of genet- ics research. Once a complete genetic function or pathway is understood, one can tell the story about a phenotypic phenomenon, so that a valid prediction or even genetic modification can be made. Some “simple” traits, for instance some appearance ones, can be explained quite well through the variation of a single gene. One of our most essential traits, sex, is determined by the gene SRY (sex-determining region Y) on the Y (WALLIS et al. 2008). This gene, existing in all placental mammals and marsupials, encodes a (therian testis determining factor, TDF) that initiates the male sex determination. In other words, this single gene SRY genetically modified a female into a male. Besides sex, for example, rolling tongue is a dominant trait determined by a single gene, and wooly hair is a recessive trait also under regulation of a single gene. These traits, a.k.a. Mendelian traits, can be explained simply by a gene that follows Mendel’s law of segregation. However, most of the traits we see are rather complicated. They are, usually par- tially, affected by several genes. The genes influence but rarely determine the observed traits, therefore the complex traits are not 100% hereditary. For instance, human height has recently been shown to be rather polygenic, affected by plenty of genes along the entire genome that have very small effects (YANG et al. 2011a). Furthermore, there is no doubt that a trait like human height is modified by nutrition and other environmen- tal factors. Hence, most of the time, when a complex trait is studied, one can expect to detect several genes or loci in the genome that have relatively strong effects, and there can be many other genes with small effects that are undetectable due to the polygenic nature of the studied trait and also the limited detection power of the statistical tests. Genes have very different magnitudes of effects, and the underlying genetic path- ways and networks are fairly complex. Unfortunately, our capacity in getting sufficient sample size is too limited to model or reveal the complete pathways underlying a par- ticular complex trait. What we often do is to focus on the additive marginal effect of each locus, trying to extract as much genetic variance as possible from the genome.

14 There are two main drawbacks in the current routine that are difficult to overcome. First, only the loci that show additive effects can be discovered. Additive effects con- tribute the most in prediction, however, non-additive effects such as effects from gene-gene interactions, a.k.a. , are missing. The second drawback could be more worrying, i.e. only the loci with sufficiently big effects can be detected. As mentioned above, there are complex traits that are polygenic, which indicates that many loci with small effects contribute to the trait. Even though the sample size is large, statistical power only allows mapping loci with strong effects. The detective power restricts us to a certain level of effect size. In this chapter, I classify the popular statistical genetics methods for gene map- ping into two categories: single- and multiple-predictor analyses. The single-predictor analysis covers QTL interval mapping in experimental designs and genome-wide as- sociation studies. The section of multiple-predictor analysis starts by multiple additive effects modeling, which fits the entire genome simultaneously. Thereafter, gene-gene and gene-environment interactions are briefly introduced, together with their connec- tion to the idea of detecting genes that show variance heterogeneity.

2.1 Single-predictor analysis This section briefly covers two types of single predictor analysis strategies - QTL anal- ysis and genome-wide association study (GWAS). I name them as “single-predictor” because the predictor can be a genetic marker (GWAS) or a single position between flanking markers (QTL analysis). This type of method focuses on one position on the genome at a time, fits a parametric model, and performs a hypothesis test on the genetic parameter. Both QTL analysis and GWAS require a genome-wide screening for significant loci. Single-predictor analysis is the simplest and the most direct way to detect potential causal genes for quantitative traits.

2.1.1 QTL analysis A QTL () is a chromosomal segment containing one or more genes that contributes to the variation observed for a quantitative trait. The genetic ef- fects of a QTL is the combined effects of the genes located in the segment. GARDNER and LATTA (2007) have summarized that the average confidence interval around QTL is about 15.6 cM based on more than 200 mapped QTL. Generally speaking, QTL analysis is a statistical method that links two types of information - phenotypic data (trait measurements) and genotypic data (DNA variants) - in an attempt to explain the genetic basis of variation in complex traits. QTL analysis allows researchers in fields as diverse as , , and medicine to link complex phenotypes to specific regions of . The goal of a QTL analysis is to identify the ac- tion, interaction, number, and precise location of the regions (QTL) affecting a certain complex trait of interest. Two things are required in order to conduct a QTL analysis in an experimental population: 1. Two or more strains of that differ genetically with regard to the trait of interest; 2. Genetic markers that distinguish between these parental lines. By typing genetic markers as tags along the genome, DNA information of a

15 population can be obtained. The QTL interval mapping strategy was developed when the genetic markers were not as dense as the SNP array nowadays across the genome. When people started using micro-satellite markers (SSRs), every marker, regardless the information content, could be very precious. Techniques such as interval mapping were invented for mining as much information as possible from the limited number of sparse markers, trying to detect QTL harbored within flanking markers. An example of a powerful experimental design is the F2 intercross. To carry out the QTL analysis, the parental strains are intercrossed, resulting in heterozygous (F1) individuals. These individuals are then mated to produce F2 individuals. The pheno- types and genotypes of the derived (F2) population are scored. Such an experimental design makes it possible to do QTL interval mapping, which is a linkage analysis technique that infers genotypes between flanking markers in order to identify QTL not lying on the marker positions. The possibility of doing such inference comes from recombination/crossing-over between homologous chromosomes. During a meiosis, chromosome segments are shuffled, so that the pieces, including the genetic mark- ers therein, that are genetically linked to a QTL influencing the trait of interest will segregate more frequently with trait values, whereas unlinked markers will not show significant association with the phenotype (Figure 2.1). Statistical modeling is of central importance for identifying QTL. It is an open question how the phenotypic and genotypic data should be associated. The simplest linear model (2.1) considers the phenotypic value yik of the i:th individual with marker genotype i as a mean value µ plus a marker effect bk and a residual error eik that is usually assumed to be normally distributed.

yik = µ + bk + eik (2.1) This is a one-way ANOVA model, with the presence of a QTL being indicated by a significant between-genotype variance. Instead of using only marker genotypes, for interval mapping with genotype uncertainty, the maximum likelihood (ML) method uses the full information from the marker-trait distribution, so it is expected to be more powerful. Assuming that the distribution of the phenotype for an individual 2 with QTL genotype Qk is normal with mean µk and variance σ , the likelihood for individual i with phenotypic value yik given marker genotype Mj is

N 2 L (yik|Mj) = ∑ ϕ(yik, µk,σ )P(Qk|Mj) (2.2) k=1

2 where ϕ(yik, µk,σ ) denotes the density function for a with mean 2 µk and variance σ , and a total number of N QTL genotypes is assumed (see also LYNCH and WALSH 1998). In the likelihood framework, testing whether a QTL is associated with the trait of interest is based on the LRT statistic,  maxL (y)  λ = −2log (2.3) maxL0(y) where L0(y) is the likelihood under the null hypothesis assuming no QTL. By plot- ting the likelihood-ratio statistic (or a closely related quantity, e.g. the LOD score) as a function of map position of the putative QTL, the likelihood profile displays graph- ically the amount of support for a QTL at a particular map position.

16 Figure 2.1. A schematic diagram for a QTL mapping process in an F2 intercross design. a, a pair of homologous chromosomes in the F1 population recombine to produce gametic chromosomes that exist in the F2 population. b, a linkage analysis results in a testing statistic (e.g. likelihood ratio) profile along the chromosome, where after a permutation test, the peak above the significance threshold is considered as a significant QTL.

ML estimators were originally computationally demanding and require specialized algorithms, such as the EM algorithm (LANDER and BOTSTEIN 1989;DEMPSTER et al. 1977). Fortunately, a creative use of regression was shown to often provide an excellent approximation to the ML solution. HALEY and KNOTT (1992) proposed a simple regression directly on the genotype probabilities that approximates the likeli- hood profile very well for ML interval mapping. Using the parameterization of FAL- CONER and MACKAY (1996), assuming two Q and q at the◦ QTL, the genotypic means are µQQ = µ + a, µQq = µ + d, µqq = µ − a where a and d are the additive and effects, respectively. The regression is conducted as yik = µ + a · xa(Qk) + d · xd(Qk) + eik (2.4)

Taking the expectation over all individuals with QTL genotype Qk gives

µQk = µ + a · xa(Qk) + d · xd(Qk) (2.5)

17 and given marker genotype Mj, we also have

µQk = (µ + a)P(QQ|Mj) + (µ + d)P(Qq|Mj) + (µ − a)P(qq|Mj) (2.6) = µ + a · [P(QQ|Mj) − P(qq|Mj)] + d · P(Qq|Mj) (2.7) so that comparing (2.5) with (2.7) gives

xa(Qk) = P(QQ|Mj) − P(qq|Mj) (2.8)

xd(Qk) = P(Qq|Mj) (2.9) Hence, as long as the genotype probabilities are inferred using the recombination fre- quencies and flanking marker information, a QTL scan can be done by directly re- gressing the phenotypic records on P(QQ|Mj)−P(qq|Mj) and P(Qq|Mj) for each map position. When it comes to an outbred line cross, calculating genotype probabil- ities can be more complicated than in an inbred line cross. However, in any case, the same QTL regression model can be applied. NELSON et al. (2011) used this fact to connect two softwares - one fast program for calculating genotype probabilities in out- bred line crosses (NETTELBLAD et al. 2009) and the other popular tool for mapping QTL in inbred line crosses (BROMAN 2003). It should be noticed that “ghost” QTL can show up when two or more linked QTL are located on the same chromosome, because the chromosome segment between two linked QTL can carry information from both sides. ZENG (1994) showed that when mapping QTL between markers j and j + 1, including markers j − 1 and j + 2 as “guards” could properly account for the QTL information outside the current interval, which is known as composite interval mapping. Another popular way of modeling QTL effects is to model the effects of QTL alle- les as random instead of fixed effects. The fixed effects models assume a pre-defined number of QTL alleles and try to estimate their individual allele substitution effects. While the random effects models assume that the founder alleles of the population are drawn from a distribution, and the properties of this distribution are inferred in the analysis. To treat the allele substitution effects as random, one can apply a variance component model at each chromosomal position, which has the form y = Xβ + Zu + e (2.10) where y is the trait response vector, β is the fixed effect vector, u is the random 1 2 QTL effect that has a zero mean and variance Var(u) = 2 σg Iq and e is the error 2 term with a zero mean and variance Var(e) = σe IN. u includes the founder allele substitution effects, and the incidence matrix Z relates individuals and their inherited allele substitution effects. Instead of Z, genetic information at a certain chromosomal position is often obtained by calculating the IBD matrix Π such that (RÖNNEGÅRD and CARLBORG 2007) 1 Π = ZZ0 (2.11) 2 The variance- matrix of y is then 2 2 V = Πσg + Iσe (2.12)

2 2 where σg is the genotypic variance and σe is the residual variance. Since the QTL effects are modeled as random, the IBD matrix Π is actually a correlation matrix for

18 the correlated random effects. Each correlation element is inferred not only from the pedigree structure but also from the information of the markers flanking the tested po- sition. The existence of QTL is therefore determined by testing the genetic variance 2 component σg . To adjust the bias of variance components estimation, REML is com- monly used, where the adjusted profile likelihood (PAWITAN 2001;LEE et al. 2006) 2 2 0 of θ = (σg ,σe )

L (θ|Π,y) = f (y|Π,θ)   0 −1 −1/2 −1/2 1 0 −1 X V X = |2πV| exp − (y − Xβˆ ) V (y − Xβˆ ) (2.13) 2 2π is maximized to estimate and test θ. Such a variance component QTL model is a powerful tool in QTL interval mapping, especially in outbred crosses. For instance, RÖNNEGÅRD et al. (2008) developed flexible intercross analysis based on variance component QTL models, where by including information about the population struc- ture, i.e. line-origin, in the IBD matrix, it is possible to model and test the magnitude of segregation within each parental line. SHEN et al. (2011b) considered the complex uncertainty in the IBD matrix itself and showed that a variance component QTL model gains more power in QTL identification when the full likelihood is applied. In general, QTL analysis is regarded as a low-power method, especially in terms of mapping accuracy. Large sample sizes are required to map QTL with precision. Roughly, with 200-300 F2 individuals, a QTL accounting for 5% of total variation can be mapped into a 40cM interval, and over 10 000 F2 individuals are required to map this QTL into a 1cM interval (LYNCH and WALSH 1998). Most of the time, even 1cM is not a short interval at all in terms of the number of genes lying under the significant QTL peak. Therefore, one would not consider using an F2 population to fine-map a QTL. When a dense marker map is available for a certain population, association mapping can have better power in the mapping precision point of view.

2.1.2 Genome-wide association study GWAS is a simple idea that was not believed by many researchers even a few years ago. Unlike a QTL analysis, the markers themselves are tested instead of the intervals flanked by markers. In order to obtain good power, a large sample size is needed in a population-based association study. Since the LD blocks are so small compared to e.g. an F2 population, in order to map a causal locus, a very dense marker map is required so that a certain marker can be in strong linkage with the causal gene. In a cover letter of Science published seven years ago, KLEIN et al. (2005) reported a causal polymorphism of complement factor H that regulates age-related macular degeneration (AMD). Although the detection of this gene was statistically very lucky, it successfully indicated the inherent potential of the GWAS strategy. From then on, more and more loci were mapped via GWAS, regulating for instance, human disease- related traits like blood pressure (LEVY et al. 2009), blood lipids (AULCHENKO et al. 2009;TESLOVICH et al. 2010), coronary heart disease (WANG et al. 2011), breast (TURNBULL et al. 2010), uterine fibroids (CHA et al. 2011), etc. as well as complex traits in other species such as mice (VALDAR et al. 2006), maize (TIAN et al. 2011), Arabidopsis (ATWELL et al. 2010) and so on.

19 Figure 2.2. A schematic flowchart for a genome-wide association study. A narrow GWAS data analysis basically includes the 3rd and 4th steps, where statistical tests are performed throughout the genome at each SNP marker, and the LD pattern under the significant association signals are examined.

Unlike QTL analyses in experimental designs, GWAS does not require an artificial cross of parental lines. Only a randomly sampled population is needed. This certainly sacrifices the advantage of localizing QTL between flanking markers, however, if the SNPs are dense enough, given a sufficient sample size, a group of polymorphisms that are linked to the causal gene would show significant association. Such an associa- tion signal has generally much better resolution than a QTL analysis in an intercross design. Figure 2.2 shows how a GWAS is usually conducted, where the association scan in the middle is the most important step that statistical modeling contributes to. The LD pattern can be used to derive haplotypes for testing window-wise associa- tions. Bonferroni correction is commonly used for determination of genome-wide significance threshold, however, it can be conservative because of LD, or it can be too liberal because of confounding in the population. A significance threshold from permutations can be a convincing alternative. Nevertheless, in order to achieve “over- whelming” significance, replicating the detected association is an ultimate solution, which is especially important in a drug discovery process (see also KINGSMORE et al. 2008). When the same analysis routine is done in several different studies, a further meta-analysis can cumulate power from the combined results.

20 Many statistical methods have been developed particularly for GWAS (see BALD- ING 2006;CANTOR et al. 2010), together with quite a few computational tools (e.g. AULCHENKO et al. 2007b;PURCELL et al. 2007;YANG et al. 2010b). Most GWAS basically only look at the additive effect of each single SNP (the dominance effect may be included but due to the extra degree of freedom, it is often avoided). So the most common parametric model is

yi j = µ + x jβ + ei j (2.14) for individual i with SNP genotype x j (coded as 0, 1 and 2 for instance), where yi j is the phenotype, µ is the overall mean, ei j is the residual, and β is the additive SNP effect. A p-value is obtained by performing a t- or Wald test on β. It is important for a GWAS analysis to store p-values (or equivalently, standard errors), which makes future meta-analysis possible. In a meta-analysis combining results from N different studies, a pooled estimate of β can be calculated as

N ˆ ˆ ∑i=1 wiβi β = N (2.15) ∑i=1 wi

2 ˆ ˆ where wi = 1/si and si is the standard error of βi. The standard error of β is computed as r H s = (2.16) N 2 where H is the harmonic mean of all the√ si ’s. Hence, a meta-analysis reduces the standard error of the estimated effect by N folds. Among many issues in GWAS (see WANG et al. 2005b;MCCARTHY et al. 2008), population stratification might be the most worrying one that requires sophisticated statistical methods to handle (PRICE et al. 2010). A simple solution is genomic control (GC) (DEVLIN and ROEDER 1999). GC is used to shrink inflation of the test scores (−log10 p-values). When testing for the single genetic effect, say additive effect, in GWAS, the null distribution of the test statistic for the nominal p-values is χ2 with 1 degree of freedom. Since most of the SNPs are not expected to be associated with the trait, the sample distribution of the χ2’s across the genome is expected to resemble the null distribution. If inflation exists, an observed χ2 value becomes λ · χ2, therefore the χ2’s can be adjusted using λ, i.e. the inflation factor estimated by comparing the distribution of the observed χ2’s and χ2 distribution with 1 degree of freedom. λ can be estimated in different ways. In the R package GenABEL (AULCHENKO et al. 2007b), λ is estimated as the regression slope of the observed χ2’s on the null. The original λ estimator proposed by DEVLIN and ROEDER (1999) is the ratio of the observed median of the χ2’s to the theoretical median of χ2 distribution with 1 degree of freedom, χ2(1), i.e.

2 2 2 median{χ , χ ,..., χp} λˆ = 1 2 (2.17) median{χ2(1)} where p is the number of tested positions or a big subset of them. In practice, many publications used the number 0.456 to approximate median{χ2(1)}, however,

21 median{χ2(1)} ≈ 0.455 and actually even slightly less than 0.455. This is a curiosity since many GWAS results have been published over-liberal. More sophisticated than GC, the principle component analysis (PCA) method pro- posed by PRICE et al. (2006) have been widely adopted, and more recently, mixed- model-based methods have become popular as well (e.g. AULCHENKO et al. 2007a; KANG et al. 2008, 2010; LIPPERT et al. 2011). The mixed model generally has a form of

yi j = µ + x jβ + ui + ei j (2.18) where comparing with (2.14), the extra random effect term uk is the polygenic effect, 2 and u ∼ N (0,Gσu ). G is a genomic kinship matrix estimated from the SNP data (e.g. VANRADEN 2008;KANG et al. 2008;YANG et al. 2010a). The fundamental problem of population stratification is due to the similarity in the DNA sequence be- tween individuals who also have similar phenotypes, so that the inflated signals caused by genetic background will cause false discoveries. Therefore, the correlation in pop- ulation structure needs to be removed in the analysis, but certainly, the loci with small effects simultaneously become undetectable.

2.2 Multiple-predictor analysis Since genetic variation is a combined effect of multiple genes, looking at only one locus at a time should not be the optimal way to detect complex genetic architecture. Multiple regression types of models have been used to fit more genetic variants or even the whole genome simultaneously. Moreover, gene-gene and gene-environment interactions could also be interesting phenomena that may help us understand complex genetic networks. In this section, some popular approaches with basic theories for estimating polygenic effects across the genome are introduced. Interaction analysis is included, and due to its connection to variance heterogeneity, recent developments on mapping genes affecting phenotypic variability are introduced as well.

2.2.1 Polygenic effects estimation Estimation of polygenic effects does not necessarily require genotyping “” (genome-wide genotyping). In variance component QTL analysis, and also GWAS, a linear model with polygenic effects is commonly used to address the genetic back- ground information other than the major QTL effects. Especially in QTL analysis, polygenic effects can be addressed by modeling random effects that have a correlation structure derived from the pedigree information, without knowing which genes are there causing the polygenic effects. With dense SNP markers, one can estimate the polygenic effects of all the available markers using a “super-saturated” model. This kind of model is very useful in current quantitative genetics since they provide a pow- erful unified framework for both QTL mapping and genomic evaluation (see Chapter 3).

22 The model is “super-saturated” because it fits more markers (p) than the number of individuals (n), which generally uses a linear predictor like

p µ + ∑ Z ju j (2.19) j=1 to model the phenotype, where µ = Xβ may include some fixed effects, Z j and u j are the genotype coding and random effect for the j:th SNP, respectively. When modeling the u j’s as a random sample drawn from a normal distribution, the model is an LMM, or in the terminology used in genomic prediction, a GBLUP model (MEUWISSEN et al. 2001). By solving such a random effects model, one can obtain the shrinkage estimates for all the SNPs (which is actually identical to those obtained from a ridge regression). A good property of ridge regression is that it was originally developed for overcoming collinearity in linear regression problems (e.g. HASTIE et al. 2009), so that LD in the genomic data will have little impact on the estimation of the effects of the strongly linked SNPs. XU (2003) claimed that a GBLUP model is not proper for QTL mapping because of the equally strong shrinkage for each SNP assumed in the model. SHEN et al. (2012a) validated this point by trying a randomization test on a GBLUP model. Therefore, in order to more strongly shrink down the effects of non-QTL positions and mean- while highlight the QTL, different methods, most of which are Bayesian, have been proposed to assign unequal weights to different SNPs (e.g. MEUWISSEN et al. 2001; XU 2003;WANG et al. 2005a; XU 2007;YI and XU 2008;VERBYLA et al. 2009; RÖNNEGÅRD and LEE 2010;HABIER et al. 2011;SHEN et al. 2011a, 2012a). All these methods basically allow heteroscedastic effects of the SNPs, such as

u ∼ ( , 2 ) j N 0 σu j (2.20)

u ∼ ( , 2) 2 instead of j N 0 σu . σu j ’s differ between different ways of penalization. The results from such kind of models can be impressive in terms of QTL mapping profile. Major QTL can be identified clearly by just looking at their effects. However, although Bayesian does not quite emphasize significance testing, it is still an essential problem for these whole genome models to be applied more widely.

2.2.2 Interaction effects and variance heterogeneity Epistasis is always there, interesting to discover and understand, even though additive effects often play the most important role in (HILL et al. 2008). Epistasis is complicated to model and explain, and there is some difference between epistasis and statistical interactions. Epistasis, in its simplest form, refers to an inter- action between a pair of loci, where the phenotypic effect of one locus depends on the genotype at the other (CARLBORG and HALEY 2004). In the statistical point of view, such an interaction happens when the combined effects of a pair of loci are not additive (COX 1984). Let us assume two loci A and B that affects a particular trait y. A linear model that contains only the interaction between A and B does not make sense, because “Now since the presence of the interaction places no restrictions at all on how A varies as

23 B changes (and vice versa), we ought to be very surprised if either margin were null. Why should it be?”, wrote NELDER (1994). Therefore, in order to test interaction in a linear model, given the genotypes at two loci, one has to perform a two-way ANOVA comparing yi jk = µ + αi + β j + γi j + ei jk (2.21) with yi jk = µ + αi + β j + ei jk (2.22) where µ is the overall mean, αi is the effect of locus A, β j is the effect of locus B, γi j is the interaction effect between loci A and B, ei jk is the residual, and i, j and k are the indices for locus A genotypes, locus B genotypes and individuals, respectively. Here, γi j does not contain any additive margin of either locus. Testing γi j genome-widely requires a two-dimensional scan. For p loci on the genome, p(p − 1)/2 analyses of variance need to be performed, which is computationally intensive. Unfortunately, such a pairwise scan has low power because so many tests are done, making it diffi- cult for an epistatic pair of loci to stand out. When epistasis exists, it cannot always be statistically proved (because of power and removing additive margins), and vice versa, a statistically significant interaction does not necessarily indicate a molecular interaction between two proteins (because of false discoveries and that the interaction can be indirect). As a phenomenon, an epistatic pathway (G×G), as well as gene-environment in- teractions (G×E), can be exciting to find. Relatively speaking, looking for a signif- icant G×E interaction is less difficult if the is well measured. In fact, knowing only one locus that is potentially interacting with some other loci or factors can be also exciting. Interestingly, several recent studies (PARÉ et al. 2010; STRUCHALIN et al. 2010;RÖNNEGÅRD and VALDAR 2011), almost developed at the same time, noticed that variance heterogeneity at a locus can be caused by interac- tions that involve this locus. Simply searching for significant loci showing variance heterogeneity can, to use PARÉ et al. (2010)’s word, prioritize such loci since they have potential to be involved in G×G or G×E interactions. The power of detecting potentially interacting loci using variance heterogeneity test seems to be good (PARÉ et al. 2010). However, for such a variance heterogeneity test (Levene or Brown-Forsythe test, see Chapter 7), STRUCHALIN et al. (2010) dis- covered an interesting, and curious, behavior in power. Assuming the model causing variance heterogeneity of the tested locus is (2.21), and there is no main effect of this locus, we would like to know how the power of the variance heterogeneity test varies as the other effects change. Given a certain amount of the interaction effect, what STRUCHALIN et al. (2010) found is an “M”-shaped power curve for the power against the main effect of the other interacting factor, which looks surprising. In order to visu- alize the point completely, I made the 2D Figure 2.3 to show the power pattern against both the main effect of the other interacting factor and the interaction effect. Plotted in Figure 2.3 is the Brown-Forsythe test statistic that is proportional to the non-centrality parameter that represents power (HEWITT and HEATH 1988;LIU and RAUDENBUSH 2004). As expected, the power of the variance heterogeneity test varies as the interaction effect changes. When the interaction effect is null, no heterogeneity of variance is generated, so the test has no power. However, even when there is a certain amount of

24 interaction effect, if it happens to be a half of the main effect of the other interacting factor, the test loses all the power (see the discussion of STRUCHALIN et al. 2010). This peculiar trend in Figure 2.3 makes variance heterogeneity a maybe-sufficient but unnecessary condition for interaction effects. Namely, if variance heterogeneity is caused by interaction,

variance heterogeneity ⇒ interaction effect, but variance heterogeneity : interaction effect. Researchers such as JIMENEZ-GOMEZ et al. (2011) discovered QTL with variance heterogeneity as QTL controlling stochastic noise. They do not claim that the vari- ance heterogeneity is necessarily generated by interaction effects. Nonetheless, the underlying model that generates variance heterogeneity can be an interaction model.

25 Brown−Forsythe test statistic

1000

300

200 800

100 600

0

400 Interaction effect Interaction −100

200 −200

−300 0

−300 −200 −100 0 100 200 300 Main effect of interacting factor

Figure 2.3. The Brown-Forsythe test statistic in a two-way interaction model. Assum- ing a two-way interaction model with standard normal residuals, including the main effect of the tested locus, the main effect of the other interacting factor, and their in- teraction effect. The value of the Brown-Forsythe test statistic depends on both the main effect of the factor and also the interaction effect. Since the test statistic is pro- portional to the non-centrality parameter, this figure shows the power of the variance heterogeneity test in an interaction model.

26 3. Predictive Modeling

“However beautiful the strategy, you should occasionally look at the results.”

—— Winston Churchill

ECAUSE of the complexity in the underlying genetic architecture for the quan- B titative traits, there is usually little confidence that our effect estimate for an individual locus is correct. However, the situation is better when predicting individual phenotypes. In fact, prediction is an essential part of statistical analysis. Since now we cannot understand the functions of all the genetic variants, summing all their ef- fects together could help more in prediction than only the detected individual loci. In Chapter 2, we have already seen the use of the whole-genome models for QTL iden- tification. Here I introduce the predictive usage of the models that fit all the available genome-wide markers. Genomic evaluation or genomic selection basically means predicting individual breeding values and performing selection based on genetic markers. Before genomic dense markers were used, farm animals were evaluated according to their pedigree kinship, from which linear mixed models (LMMs) were developed. Assuming an LMM for a particular phenotype y,

y = Xβ + Zaa + e (3.1) where β are the fixed effects with design matrix X, a are the animal effects with incidence matrix Za, and e are the residuals. The LMM in animal breeding can contain other random effects such as maternal effects as well, but to illustrate the basic idea, I focus on the simple animal model (3.1). If we assume that both a and e are multivariate 2 2 normally distributed, where a ∼ N (0,Aσa ) and e ∼ N (0,Iσe ), the BLUP for a can be solved via Henderson’s mixed model equations (MME), i.e.

0 0 ! X XX Za  β   X0y  2 0 0 σe −1 = 0 (3.2) ZaXZaZa + 2 A a Zay σa The relatedness of the animals whose breeding values are to be estimated is given by the kinship matrix A derived from the pedigree (see e.g. LYNCH and WALSH 1998). Nowadays, since genome-wide dense SNPs are available for typing many farm animals, the kinship between each pair of individuals can be derived by comparing their DNA information directly. Instead of A, one can calculate a genomic kinship matrix G. A commonly used G matrix is the IBS-like matrix proposed by VANRADEN

27 (2008), adjusted for allele frequencies at each SNP. If we construct an incidence matrix Z for n individuals and p SNPs, so that Z has n rows and p columns, by scaling the codings in Z using allele frequencies, we obtain the G matrix (YANG et al. 2010a). We have a general random effects model for genomic evaluation,

y = Xβ + Zu + e (3.3) where u are the allele substitution effects for each SNP, and G ∝ ZZ0. If we assume 2 u ∼ N (0,Iσu ), (3.3) is a GBLUP model (MEUWISSEN et al. 2001), which is equiva- lent to a ridge regression. If we assign different variance components or “weights” as hyper-parameters to different SNPs, (3.3) becomes a BayesA-like model (MEUWIS- SEN et al. 2001) or the DHGLM (RÖNNEGÅRD and LEE 2010;SHEN et al. 2011a). There are quite a few other models (see Chapter 2) but basically all share the shape of (3.3) to model the mean of y. Properly re-weighting the markers along the genome has the potential to improve prediction of breeding values compared to GBLUP (PSZC- ZOLA et al. 2011;SHEN et al. 2012a, see e.g.), because the SNP effects contributing to a particular trait are usually much more heavy-tailed than Gaussian. The key to improving the genomic evaluation model, as well as the use of it in QTL mapping, is variable selection (or prioritization) - to add more weights to the functional QTL. Figure 3.1 shows an example comparing the shrinkage magnitudes of three differ- ent methods for re-weighting markers, where the heteroscedastic effects model (SHEN et al. 2012a, HEM;) shrinks the small effects stronger than the ridge regression, and the LASSO (with 10-fold cross validation) shrinks most of the small effects to exactly zero. When the genetic effects are normally distributed, HEM does not have advan- tage compared to ridge regression, but for skewed distributed genetic effects, stronger shrinkage estimates have better predictive power (SHEN et al. 2012a). Thus, no model is the best. In order to capture the genetic information well and predict the phenotypic outcome, proper variable selection routines should be chosen considering the nature of the phenotype. However, all these methods are purely “mathematical”, ignoring the biological information underlying each typed marker. Variable selection and shrink- age with respect to e.g. gene annotation information would be useful in future studies (see Chapter 8).

28 Figure 3.1. Comparison of SNP effects estimates from ridge regression (RR), het- eroscedastic effects model (HEM) and LASSO. The blue cloud and red scatters com- pare HEM and LASSO estimates, respectively, against RR estimates. The analyzed quantitative trait is days to flowering time under long day (18C, 16 hrs daylight) published by ATWELL et al. (2010). 167 inbred lines were phenotyped for this trait, where a 250K SNP array was used for genotyping, and 216 130 SNPs were available for analysis.

29

PART II: SUMMARY OF PAPERS 4. Implementing Hierarchical Generalized Linear Models (Paper I)

“A unified framework is provided for viewing and extending many existing meth- ods.”

—— Youngjo Lee & John A. Nelder1

IERARCHICAL generalized linear models (HGLMs; LEE and NELDER 1996), H implemented in the R (RDEVELOPMENT CORE TEAM 2011) package hglm (RÖNNEGÅRD et al. 2010), is a fundamental tool throughout this thesis. The origi- nal major advantage of HGLMs compared to normal/generalized linear mixed models is to fit non-normal random effects. However, because of the flexibility in the fit- ting algorithm and its internal connection with Henderson’s mixed model equations, our implementation is additionally capable of: 1. Estimating variance components when we have correlated random effects; 2. Including fixed effects in a model for the residual variance. In fact, the second point has been extended to model any vari- ance component so that HGLMs can be used as a powerful tool in both multiple QTL mapping and genomic evaluation (RÖNNEGÅRD and LEE 2010;SHEN et al. 2011a, 2012a) (Paper III & VI). The HGLMs provide a general unified framework in statistics that can be applied in many random effects problems. All the papers that this thesis is based on, except Paper IV, have used the hglm package for different purposes. The story of HGLMs starts from the normal linear mixed model (LMM) as follows, where β and u are the fixed and random effects, respectively, and the distribution of 2 2 0 the response y is determined by β and the variance components θ = (σu ,σe ) . 2 y|β,u,θ ∼ N (Xβ + Zu,Iσe ) (4.1) 2 u ∼ N (0,Iσu ) (4.2) So that the variance-covariance matrix of y is

0 2 2 Var(y) = ZZ σu + Iσe (4.3) If we define A = ZZ0, this implicates that a linear mixed model with correlated ran- dom effects, e.g. the animal model, can be re-formulated as an ordinary linear mixed model by decomposing the correlation/relationship matrix A. For fitting random ef- fects models, at the time of writing, hglm is the only package in R that allows arbitrary user-defined design matrix Z for the random effects.

1Page 800, LEE and NELDER (1996)

32 Before digging into the algorithm implemented in the hglm package, I hereby il- lustrate the likelihood theory underlying the models. First of all, there are several rea- sons for introducing random effects into a certain statistical model, where nevertheless the most fundamental one is to predict unobservables. The classic Fisher likelihood (FISHER 1922;EDWARDS 1972;PRATT 1976;PAWITAN 2001) was designed for es- timation but not prediction, so that when unobservable/uncertain/random factors exist and need to be predicted, the classic likelihood theory becomes powerless. With the fast development of computing tools, another school of thought in statistics happens to be able to do such predictions for random effects, i.e. the Bayesian. One of the meth- ods that the Bayesian utilizes is Markov chain Monte Carlo (MCMC), which allows sampling posterior distributions for random components in the model so that further inference can be done. When u is normal, the flexibility of HGLMs for handling arbitrary distributions of y is the same as generalized linear mixed models (GLMMs; see BRESLOW and CLAYTON 1993), which is relatively straightforward in the fitting procedure since HGLMs can be formulated as inter-connected GLMs (LEE and NELDER 2001;LEE et al. 2006). Interestingly, the inter-connected GLMs include weighted gamma GLMs for estimating variance components or dispersion parameters in HGLMs. The flow of the HGLM algorithm is demonstrated in Figure 4.1. Since each part of the estimation procedure can be executed as a GLM, the name of hierarchical GLMs makes more sense when described in this way.

Figure 4.1. An illustration of the iterative weighted least squares (IWLS) algorithm for fitting HGLMs based on a normal linear mixed model (LMM). VC = variance components; MME = mixed model equations; Coef. = coefficients (fixed and random effects); GLM = generalized linear model; LRT = likelihood ratio test.

When y is non-normal, the algorithm simply applies a link function as GLM does (MCGULLAGH and NELDER 1989) during the step of solving MME. The essential difference between HGLMs and GLMMs is the flexibility of the distribution of the random effects. For non-normal random effects, in order to apply a link function for the random effects, the original LMM can be re-formulated as an augmented model

33 with response (LEE and NELDER 2001;LEE et al. 2006)  y  y = (4.4) a ψ where in the estimation procedure ψ = E[u] = 0 if the random effects are assumed to be normally distributed with a zero mean. Generally, viewing the h-likelihood estimation as an augmented GLM, we have E[yi] = µi, Var(yi) = φiV(µi), E[ψi] = ui, and Var(ψi) = λiVa(ui), where V(·) and Va(·) are GLM variance functions, and φi’s and λi’s are the dispersion parameters which are the variance components in LMMs. In cases where either the response or the random effects are non-normal, such an augmented response is replaced by the adjusted response z z = (4.5) a ζ where the elements are ∂ηi zi = ηi + (yi − µi) (4.6) ∂ µi and ∂vi ζi = vi + (ψi − ui) (4.7) ∂ui In the linearization equations (4.6) and (4.7), η and v are the linear predictors for the original response y and random effects u, respectively. Namely, η = g(µ) = Xβ +Zv and ga(u) = v, where g(·) and ga(·) are two link functions. Constructing the model matrix as XZ T = (4.8) 0 I the effects can be estimated by iterative weighted least squares (IWLS) for the GLM β T0Σ−1T = T0Σ−1z (4.9) v a −1 where Σ = ΓW with Γ = diag(Φ,Λ), Φ = diag(φi), Λ = diag(λi), and the iter- 2 −1 ative weight matrix W = diag(W0,W1) has elements W0i = (∂ µi/∂ηi) V(µi) and 2 −1 W1i = (∂ui/∂vi) V(ui) . For a normal-normal HGLM, i.e. a linear mixed model, one can show that equation (4.9) is identical to Henderson’s MME. The most interesting part in the IWLS fitting algorithm is to update the estimates for the dispersion pa- rameters or variance components. φi’s and λi’s can be estimated via weighted gamma GLMs, where the response are the squared deviance residuals and the prior weights are (1 − hii)/2. hii is the i:th hat-value or leverage from (4.9). Gamma GLM fam- ily is fitted since one can show that the variance of the squared deviance residuals is proportional to the square of their mean, which is the only assumption we need for fitting gamma GLMs. Since GLMs are used for estimating the dispersion parameters, further modeling of the variance components becomes straightforward. At the time of writing, the most advanced normal-normal HGLM that our hglm package is capable of fitting can be formulated as K y|β,uk,θ ∼ N (Xβ + ∑ Zkuk,diag(exp(Xdβd))) (4.10) k=1 u ∼ (0,A 2 ) (4.11) k N kσuk

34 0 2 2 0 where θ = (βd,σu,1,...,σu,K) , and the subscript d stands for “dispersion”. The other common distributions for the response variable and the random effects that can be handled by hglm, with common link functions, are listed in Table 4.1 (reproduced from Table 1 in Paper I).

Table 4.1. Commonly used distributions and link functions possible to fit with hglm.

Model Name y|u family Link g(µ) u family Link ga(u) Linear mixed model Gaussian identity Gaussian identity Binomial conjugate Binomial logit Beta logit Binomial GLMM Binomial logit Gaussian identity Binomial frailty Binomial comp-log-log Gamma log Poisson GLMM Poisson log Gaussian identity Poisson conjugate Poisson log Gamma log Gamma GLMM Gamma log Gaussian identity Gamma conjugate Gamma inverse Inv-Gamma inverse Gamma-Gamma Gamma log Gamma log

While the augmented-GLM way of presenting MME turns out to allow much more flexibility in fitting different kinds of sophisticated hierarchical models, especially when GLMs are used for estimating variance components. For instance, in Paper III, 2 we model the variance component for genetic random effects, i.e. σu in (4.2), instead 2 of residual variance component σe using a second-layer random effects model, so that the model becomes a particular double HGLM (DHGLM) for genomic data, which is different from the DHGLM described in the literature (LEE and NELDER 2006;LEE et al. 2006). Together with professor Yurii Aulchenko, based on the hglm package, we have im- plemented the function polygenic_hglm as an alternative for the function polygenic in the current version of GenABEL - a popular R package for GWAS (AULCHENKO et al. 2007b). polygenic_hglm estimates the polygenic effects based on the IBS ma- trix more efficiently than the original numerical algorithm. Since REML estimation is done by hglm, polygenic_hglm also produces standard errors estimates for the included fixed effects, which makes it possible to test covariates in a polygenic effects model. The hglm package successfully implemented a unified statistical inference frame- work for random effect models. Its capacity in modeling variance components is fairly flexible, so it has a great potential in large-scale genetics studies as well as other sci- entific fields.

35 5. Quantitative Trait Loci Interval Mapping

“The generation of such high-density maps is not possible for a majority of species in practice... a marker analysis cannot unambiguously separate the genetic effects of a QTL from the recombination fraction between the markers and QTL.”

—— Rongling Wu, Chang-Xing Ma & George Casella1

OWADAYS mapping potentially functional loci is often done by simply associat- N ing a large number of SNPs to different complex traits in a population (GWAS). Such an association study strategy has become very popular during the last 4-5 years. However, although dense marker maps have been developed, the technique of QTL in- terval mapping in experimental crosses is still useful and should never be abandoned. One reason is that dense marker maps are certainly not available for all species. An- other reason is that we have definitely not obtained complete information from DNA sequence, nor have fully understood the haplotypes along the genome. The background theory of QTL interval mapping is introduced in Chapter 2. Here, I demonstrate the contributions in Paper II & IV that are related to interval mapping techniques.

5.1 Variance component QTL model (Paper II) Paper II (SHEN et al. 2011b) extended an “old” idea in QTL analysis that the uncer- tainty in genotypes should be properly considered when modeling the genetic effects as either fixed effects (ELSTON and STEWART 1971;MORTON and MACLEAN 1974; LANDER and BOTSTEIN 1989) or random effects (SCHORK 1993;KRUGLYAK and LANDER 1995). When fitting the QTL effects as fixed effects, the distribution method (full like- lihood method taking the genotype uncertainty into account) has been proved to be approximated well using a simple linear regression on the genotype probabilities (HA- LEY and KNOTT 1992), which has become a widely used tool for QTL mapping. For instance, Paper IV provided a convenient interface for using such a regression method (NELSON et al. 2011). However, when modeling the QTL effects as random, there has not been a universal solution to integrate the information about uncertain geno- types into the analysis. The difficulty comes from the “correlation matrix” inferred by combining the genetic marker information and the pedigree structure, i.e. the IBD

1Page 223, WU et al. (2007)

36 (identity-by-descent) matrix. The usual way of using the IBD matrix is to calculate the average amount of shared alleles between relatives and plug into a linear mixed model as the correlation of random effects, referred to the expectation method. But the expectation method throws away all the information except the mean value in the distribution of the IBD matrix. For small human full-sib families, previous studies have derived the full likelihood function considering the uncertainty in the IBD ma- trix (XU 1996;GESSLER and XU 1996). This is doable because the family sizes are small. Unfortunately, when one moves from human families to an animal pedigree and from full-sibs to F2 intercross designs, with possible as well, it is almost impossible to analytically derive the distribution of a big complex IBD matrix. Statistical inference for the random QTL effects is done by testing the correspond- ing variance component. Restricted maximum likelihood (REML) is used to correct bias in variance component estimation. Let us denote the phenotype vector as y, the IBD matrix as Π, and the other parameters in the random effects model as θ. The traditional expectation method conducts likelihood-ratio tests (LRT) through the like- lihood LE = L (θ|y,E[Π]) (5.1) Often people just call E[Π] the IBD matrix, regardless the uncertainty in the matrix itself. Instead, the distribution method utilizes the joint likelihood for both θ and Π, so that the inference on θ should be done directly through its marginal likelihood

LD = L (θ|y) = ∑L (θ,Π|y) Π = ∑L (θ|y,Π)P(Π) Π = EΠ[L (θ|y,Π)] (5.2)

Therefore, from equation (5.1) and (5.2), we clearly see the exact way of conducting the likelihood function is to average out all the possible likelihood functions over the probability space of Π. Obviously, as suggested by earlier studies (XU 1996), one can just apply a Monte Carlo sampling strategy from the probability space of Π and obtain an empirical re- alization of the IBD matrix distribution. Drawing m imputes for Π, the marginal likelihood of the parameters, LD, can be approximated as

m ˜ 1 LD(θ|y) ≈ ∑ L (θ|y,Πi) (5.3) m i=1 according to (5.2). However, calculating these imputed likelihood functions is not an easy task, because the likelihood functions generally have extremely small values, even at their maxima, which are beyond the capacity of numeric precision in current computers. One cannot calculate the average of the corresponding log-likelihoods and transform back. Hence, one contribution of Paper II is to derive and implement a Newton-Raphson-based EM algorithm for solving such a problem. We found that the algorithm we derived uses log-likelihood values of individual imputes, for which closed solutions are already available (HARVILLE 1977) (see Paper II for details).

37 In Paper II, the performance of the distribution method compared to the expectation method was examined by two small examples and also using some real experimental data. Interestingly, the comparison on a real pig intercross data (ANDERSSON et al. 1994) showed better QTL mapping precision of the distribution method. This is an- other finding of Paper II that contributes to the literature. Figure 5.1 (reproduced from Figure 3b,c in Paper II) compares the maximized REML likelihood function values from both methods at a putative QTL with rather low marker information (see Fig- ure 4 in Paper II). The two methods gave similar likelihood values, however, when a causal QTL exists, the distribution method has better power. Even when no QTL exists at all, the distribution method has significantly lower tendency to generate false positives (p-value = 2.3 × 10−14 from a Wilcoxon test for this particular simulation, testing whether the log-likelihood values from both methods differ).

Figure 5.1. Simulation results for power of interval mapping using the distribution method compared to the expectation method. A QTL was simulated between two flanking markers on pig chromosome 6 (see Paper II for details about the data). 1 000 simulations were executed for comparing the log-likelihoods from the expectation and the distribution methods. The points above/below the diagonal are in red/blue, indicating that the distribution method has larger power than the expectation method (left panel), or that the distribution method has lower false positive rate (FPR) than the expectation method (right panel). The numbers in color show the corresponding percentages of the sets of points.

The study done in Paper II gives us a better understand of how genotype uncertainty affects the classic variance component QTL analysis. The difference between the distribution method and the traditional expectation method might not be substantial in many cases, especially when the population size is sufficiently large. However, for partially informative markers, the distribution method has a tendency to improve the profile of a QTL scan so that it contributes to QTL fine mapping.

38 5.2 QTL regression model (Paper IV) In contrast to the theoretical research in Paper II, Paper IV contributes to QTL interval mapping technique by providing an application tool. The idea is to use Karl Broman’s qtl package (BROMAN 2003) in R to help us perform QTL analysis in outbred line crosses. But R/qtl is a tool designed for intercrosses of inbred lines, and was not developed for outbred line crosses. The idea here is to import pre-calculated geno- type probabilities from outbred line cross data into a fake R/qtl object. Thereafter, one can use R/qtl as a “slave” to do all the jobs including 1D/2D QTL mapping and permutation test, which have already been efficiently implemented in R/qtl. The reason why the qtl.outbred routine works is because the underlying statis- tical models for analyzing inbred and outbred line crosses are identical. No matter how genotype probabilities are inferred for each chromosomal locus, a simple lin- ear regression (HALEY and KNOTT 1992) is utilized to fit the data. Therefore, the functions in R/qtl for both 1D and 2D QTL scans, together with the permutation test, work exactly the same way for outbred line cross data, as long as the correct genotype probabilities are given.

Table 5.1. Functions in the R package qtl.outbred. Function Description calc.prob Calculating genotype probabilities using the triM algorithm impo.prob Importing calculated genotype probabilities to R/qtl

The qtl.outbred package consists of two main functions (Table 5.1, reproduced from Figure 1 in Paper IV), where the one called impo.prob is necessary for any analysis, which does the trick of turning R/qtl into a “slave”. First of all, a dataset for R/qtl should be prepared in its required format, for instance, an Excel sheet. R/qtl handles F2 inbred line cross data so that a pedigree structure needs to be given, and so do F2 outbred line crosses. In the Excel sheet, the user just needs to insert the pedigree structure as if it was an inbred line cross, and the remaining fake genotypes can be arbitrarily given. This creates a data format that R/qtl can read into R, with the correct number of individuals in each generation. R/qtl package itself is capable of calculating genotype probabilities for inbred line crosses, and the calculation results are stored as a component in the list of a qtl object. impo.prob in qtl.outbred creates such a component for outbred line cross data. Certainly, impo.prob requires pre-calculated genotype probabilities from other existing tools. The package can import the output format from the popular web-based tool GridQTL (SEATON et al. 2006). Besides, qtl.outbred provides an alternative of using the triM algorithm, implemented in the C++ software cnF2freq (NETTELBLAD et al. 2009), to calculate genotype probabilities for outbred line crosses. The function calc.prob in the qtl.outbred package does this calculation job as a user-friendly R interface for the C++ software. The package qtl.outbred is a useful tool that provides a simple and fast interface in R for QTL analyses in outbred populations. Using qtl.outbred, one is also able to produce neat QTL scan results and nice figures in the shape of R/qtl (see Figure 1 in Paper IV).

39 6. Fitting The Entire Genome

“Because of the high dimensionality of the model, it violates the usual rule of par- simony in model fitting. Fortunately, we were able to penalize the small effects and give them negligible weights so that their inclusion should have negligible effects on the analysis.” —— Shizhong Xu1

INGLE-MARKER analysis is good and single-marker analysis is not good. It is S good because of the convenience in statistical model fitting and testing. By focus- ing on a single locus on the genome, the result is easy to explain. In the practical point of view, for instance in the GWAS context, the reported single marker p-values from a common regression model can easily be used in further meta-analysis (THOMPSON et al. 2011), which makes a consortium possible to conduct big studies by combin- ing results from many groups. The single marker analysis also gives clear indication where to invest, clone and potentially make drugs or GM products from. It is not good because by looking at a limited amount of the genome, the power in gene detection becomes fairly limited, since genes contribute together to a certain phenotype or even create complex biochemical networks. From a multiple testing procedure, the revealed loci out of the genome explain a limited amount of phenotypic variance, which is way less than the estimated heritability from many other studies. The poor capacity in cap- turing genetic variance makes predictive power using identified loci rather limited as well. Using a multiple regression model, one can include a couple of loci in the same model, trying to understand the genetic effects and underlying biology better. How- ever, a multiple linear regression has two vital drawbacks that make such analyses difficult to proceed with. First, the regression model has a limited degree of freedom, depending on the number of observations, so that too many covariates would make an over-fit. This makes it impossible to fit the whole genome in one unified model since there are usually much more genetic markers than studied population size. Second, statistical testing on covariates in a multiple linear regression can be affected a lot by multi-collinearity which is common in study since a small region on a chro- mosome often shows a certain magnitude of linkage disequilibrium (LD) (FALCONER and MACKAY 1996). Therefore, fitting all the markers effects as random effects be- came a natural way to model the whole genome. For example, fitting a linear mixed model, or equivalently a ridge regression, not only saves degrees of freedom, but also deals with multi-collinearity in the model matrix.

1Page 800, XU (2003)

40 Denoting the number of observations or individuals as n and that of explanatory variables or genetic markers as p. There are plenty of statistical methods or perspec- tives to handle such p  n problems in quantitative genetics, for instance, ridge regres- sion (e.g. MALO et al. 2008), linear mixed model (GBLUP; MEUWISSEN et al. 2001), partial regression (e.g. ZENG 1993, 1994), LASSO (TIBSHIRANI 1996), Bayesian model selection (e.g. MEUWISSEN et al. 2001;XU 2003;YI and XU 2008), etc. Gen- erally, all these methods do some amount of shrinkage on each of the estimated effects, resulting in high-dimensional models useful for predicting phenotypes and potentially also for identifying QTL. In this chapter, the two papers about whole genome models in this thesis are sum- marized. Instead of the popular Bayesian routines, Paper III (SHEN et al. 2011a) showed that the DHGLM works no worse than the BayesA method (MEUWISSEN et al. 2001) and even computationally faster. Paper VI (SHEN et al. 2012a) presented a generalized ridge regression method that is a non-iterative simplified version of the double-layer model use in Paper III, which has a substantial computational advantage in fitting general p  n problems.

6.1 Double HGLM (Paper III) Paper III analyzed the common dataset (SZYDLOWSKI and PACZYNSKA´ 2011) issued for the participants of the 14th QTLMAS workshop in Poznan,´ Poland, 20102. The simulated dataset consists of 3 226 individuals in 5 generations, where the phenotypic records for the 900 individuals in the F4 generation were not given, nor the true QTL coordinates and their effects. Two traits, one quantitative and the other binary, sharing pleiotropic QTL, were simulated. The very first analysis was trying to perform QTL mapping for both traits using variance component models. The results were reason- able but with low precision in QTL mapping, which is not surprising for a linkage analysis. Therefore, an alternative analysis that fits a double HGLM (DHGLM; LEE and NELDER 2006) was tried instead, which was first implemented by RÖNNEGÅRD and LEE (2010) that extended our hglm package. The reported results were compared during the conference, and the performance of DHGLM was good in both QTL detection (MUCHA et al. 2011) and genomic predic- tion (PSZCZOLA et al. 2011) compared to the other methods. DHGLM had the best QTL mapping accuracy for the simulated quantitative trait (Figure 6.1). In Figure 6.1, the Bayesian methods (BOUWMAN et al. 2011;SUN et al. 2011;CALUS et al. 2011) performed well as expected. BOUWMAN et al. (2011) used Gibbs sampling for vari- able selection (GEORGE and MCCULLOCH 1993), which is implemented in the iBay software (JANSS 2009). SUN et al. (2011) used their BayesCπ method (HABIER et al. 2011), whereas CALUS et al. (2011) used BayesC (VERBYLA et al. 2009). Besides, NETTELBLAD (2011) performed haplotype inference based on hidden Markov models (HMMs) for the multi-generation pedigree, and KARACAÖREN et al. (2011) reported their results from GRAMMAR algorithm (AULCHENKO et al. 2007a). What can be seen is that the whole-genome methods out-perform the single-marker analyses. Un- like the Bayesian methods, DHGLM, based on the extended likelihood (BJØRNSTAD

2URL: http://jay.up.poznan.pl/qtlmas2010/

41 1996) or h-likelihood (LEE and NELDER 1996), is deterministic in its fitting algo- rithm, therefore no intensive sampling like MCMC is required.

Shen et al.

Bouwman et al.

Sun & Dekkers

Nettelblad

Calus et al.

Karacaören et al.

Coster and Calus

0.2 0.4 0.6 0.8 1.0 1.2 # Mapped QTL / # Reported QTL

Figure 6.1. Comparison of QTL mapping accuracy of all reported results for the simu- lated quantitative trait. The figure is reproduced from Sebastian Mucha’s presentation at the 14th QTLMAS workshop in Poznan,´ Poland, 2010. The horizontal axis shows the ratio of the number of mapped QTL to the number of reported QTL. One reported location could map more than one simulated QTL position if the simulated QTL are very close to each other.

The DHGLM used in Paper III is different from the original version given by LEE and NELDER (2006). Instead of modeling the residual variance in an LMM, we model the marker-specific genetic variance of the random effects (see also RÖNNEGÅRD and LEE 2010). Suppose we have the following random effect model for the entire genome, y = Xβ + Zg + e (6.1) where y is the vector of phenotypic records, g ∼ N (0,diag(λ)) are the SNP ef- 0 fects, λ = (λ1,λ2,...,λm) are the variances of the SNP effects, and the residuals e ∼ N (0,σ 2I). The fixed effects β include an intercept and the sex effect in Paper III to reduce the residual errors. If λ is a vector of identical λ values, model (6.1) is just an ordinary LMM, from which one can obtain GBLUP (MEUWISSEN et al. 2001). Instead of assigning different prior distributions for the variance of each SNP effect (the Bayesian methods), we further model the effect-specific variance using a second layer of random effect model or namely, another layer of HGLM,

logλ = 1a + b (6.2)

42 2 with an intercept a and normally distributed random effects b ∼ N (0,σb I) as the linear predictor. In the fitting algorithm (see Paper III), model (6.2) actually fits a gamma GLMM. As we’ve already seen in Chapter 4, the HGLM algorithm estimates the fixed and random effects by solving Henderson’s MME which is an LM itself (e.g. equation 4.9), and it estimates the variance components or dispersion parameters using gamma GLMs. Therefore, simply by iterating several LMs and GLMs, the algorithm quickly converges to the ML or REML estimates of the parameters in the DHGLM. At convergence, the inference for each part of the DHGLM is based on the h-likelihood

2 2 2 2 h(y,g,b|β,σ ,a,σb ) = log f (y|β,g,σ ) + log f (g|a,b) + log f (b|σb ) n 1 = − log(2πσ 2) − (y − Xβ − Zg)0(y − Xβ − Zg) 2 2σ 2 m m 2 1 1 g j − log(2πea+b j ) − ∑ ∑ a+b j 2 j=1 2 j=1 e

m 2 1 0 − log(2πσb ) − 2 b b 2 2σb where j is the marker index, m is the number of markers, and n is the number of individuals. The h-likelihood is simply the joint distribution of the data (observed information), the parameters in the model, and the random effects (unobserved infor- mation). Since adjacent loci on the same chromosome are generally linked, we added a flexible extension of our model by adding a correlation for b. The correlated ran- dom effects, b j, follow a multivariate normal distribution with a mean of zero and a variance-covariance matrix  1 ρ ρ2 ··· ρm−2 ρm−1   ρ 1 ρ ··· ρm−3 ρm−2   m− m−   ρ2 ρ 1 ··· ρ 4 ρ 3  A = σ 2   (6.3) b  ......   ......     ρm−2 ρm−3 ρm−4 ··· 1 ρ  ρm−1 ρm−2 ρm−3 ··· ρ 1 where ρ ∈ [0,1). When ρ = 0, the loci are assumed to be independent; and when 0 < ρ < 1, the correlation between two loci is a monomial function of ρ (RÖNNEGÅRD and LEE 2010). In this work, ρ is pre-defined before the model fitting, which actually does smoothing on the marker-specific variance λ, which reduces the noise in the profile and highlights the QTL signals (Figure 6.2). A better estimate or “guess” of the correlation might improve power of the analysis, e.g. using LD information in the genotype data to estimate such correlation (feedback from professor Yurii Aulchenko during the conference presentation). According to the extended likelihood principle (BJØRNSTAD 1996), inference of the random genetic effects g should be done through the h-likelihood, fixed effects β 2 2 through the marginal likelihood, and variance components σ and σb through the ad- justed profile likelihood (LEE et al. 2007). For the genomic model here, what we care about is the g term which tells how the genetic effects differ among different mark- ers. The estimation of genetic random effects g together with their prediction errors (PAWITAN 2001) λ can be done using the inter-connected GLM algorithm (see Paper

43 I & III). This provides h-likelihood estimates of the marker-specific genetic random effects, i.e. gˆ. Such an estimation procedure based on the h-likelihood is deterministic and computationally efficient compared to Bayesian MCMC routines, without losing the flexibility in model construction. Although conducting proper statistical tests for gˆ is not straightforward directly considering their prediction errors λˆ , a randomization test for the entire genome could be a solution (see Paper VI). In Figure 6.2, the “thresh- old” is given by an overall genetic variance component estimate from LMM, which can be regarded as a genome-wide average of the marker-specific variance estimates λˆ .

0.4 Additive QTL Epistatic QTL Imprinted QTL 0.3 0.2 0.1 Variance of Marker Effect of Marker Variance 0.0 0 2000 4000 6000 8000 10000 Marker index

Figure 6.2. Marker-specific variance profile along the genome for QTL identification of the simulated quantitative trait of the 14th QTLMAS workshop common dataset. The horizontal dashed line gives the variance component estimate of the random ge- netic effects if LMM/GBLUP is fitted. The colors separate the five simulated chromo- somes. The simulated true QTL positions are indicated by vertical bars.

Regarding the QTL mapping precision in Figure 6.1, we reported the peaks above the overall genetic variance component estimate from LMM to be detected QTL, and some other small ones to be suggestive. One reason that our method is able to provide clearly centered QTL coordinates could be due to the smoothing we add to b. There- fore, a smoothed curve instead of scattered dots was generated as the marker-specific variance profile. The Bayesian methods (see review by SORENSEN 2009) assume the additive ge- netic effect a of each marker to be random with a certain mean E[a] and variance Var(a). However, the marker-specific variance Var(a) is a “hyper-parameter” not straightforward to explain. One Bayesian interpretation of Var(a) clarified by GI- ANOLA et al. (2009) is the uncertainty about the unknown additive effect a. According to this interpretation, Var(a) = 0 means that a = E[a] without uncertainty, but it does not necessary mean a = 0 since E[a] can differ from zero. Nevertheless, for models that assume a normal distributed random effect per marker, such as DHGLM (RÖN- NEGÅRD and LEE 2010;SHEN et al. 2011a), E(a) = 0. Therefore, in models such as DHGLM, markers with zero variance have no effect, and those with large variance have also large effects.

44 Straightforwardly, the random effects part of the linear predictor Zgˆ was used for calculating genomic estimated breeding values (GEBVs). The results were not the best but promising compared to other sophisticated models used by breeders (PSZCZOLA et al. 2011). We had good prediction for the binary trait of the 900 young individu- als in F4 generation - a correlation of 0.72 between true breeding values (TBVs) and GEBVs. Whereas a correlation of 0.60 between TBVs and GEBVs was obtained for the quantitative trait. The major reason that the performance was worse for the quan- titative trait was because a more complex genetic architecture was simulated for the quantitative trait. Since we did not include or family effect in the anal- ysis, some genetic effects, such as imprinting QTL, were not optimally considered.

6.2 Heteroscedastic effects model (Paper VI) In a memoir of Charles Henderson published by the National Academies Press (VAN VLECK 1998), the author describes that around half a century ago, Henderson’s idea was to simply modify the least squares equations by plugging in the matrix G, i.e.  X0R−1XX0R−1Z  β   X0R−1y  = (6.4) Z0R−1XZ0R−1Z + G−1 u Z0R−1y which is now well known as Henderson’s MME. Given the estimates of the variance components θ in R and G, the MME actually solves the following LMM,

y|β,u,θ ∼ N (Xβ + Zu,R) (6.5) u ∼ N (0,G) (6.6)

2 2 If R = Iσe and G = Iσu , the LMM is simplified to be (4.1) and (4.2), and the MME can be written as  X0XX0Z  β   X0y  = (6.7) Z0XZ0Z + λI u Z0y

2 2 where λ = σˆe /σˆu is the ratio of the estimated residual variance to the estimated ge- netic variance component. As we have shown in Paper I (see also RÖNNEGÅRD and CARLBORG 2007), an LMM with correlated random effects can be solved via the sim- plified MME (6.7). We can see from (6.7) that Henderson modified the least square equations by adding a common shrinkage to the random effects, i.e. what ridge re- gression does. This creates a very important advantage of using random effects, i.e. more covariates than the number of individuals can be fitted into a linear model. Since Henderson’s form of MME (6.4) has the G matrix in it, the MME can actually handle unequal shrinkage for different random effects. In fact, generally, what the DHGLM in Paper III and also the Bayesian methods (e.g. MEUWISSEN et al. 2001;XU 2003) do is to estimate different variance compo- nents or shrinkage parameters for different markers along the genome. Pre-defining a λ vector for all the markers in the MME (6.7), we have the “re-weighted” MME  X0XX0Z  β   X0y  = (6.8) Z0XZ0Z + λI u Z0y

45 The contribution of Paper VI is to provide a heteroscedastic effects model (HEM) that basically answers these two questions: 1. Based on DHGLM, what kind of quantity can be used to assign to λ? 2. How to solve such a huge least squares problem since the matrix Z0Z + λI has a huge size of the number of markers? The DHGLM used in Paper III suggests that the “working response” for λ in the λ i.e. = ˆ 2/ ˆ 2 second layer (6.2) could be a useful predictor of , simply λ j σe σu j and uˆ2 ˆ 2 = j σu j (6.9) 1 − h j j whereu ˆ j is the estimated effect for marker j from GBLUP, and h j j is the correspond- ing hat-value. So this novelty could answer question 1 above. Also, we know that transforming the individual effects a to the marker effects u is possible as follows, u = Z0G−1a (6.10) where the pre-defined weights for markers (6.9) go into G.(6.10) gets rid of the huge matrix Z0Z + λI for solving u, but in order to apply the quantity (6.9), huge matrix manipulation needs to be avoided for calculation of h j j as well. In the following fitting algorithm, we show that h j j can also be obtained via a transformation technique. In the algorithm, steps 1-4 fit GBLUP and steps 5-8 fits a generalized RR. The algorithm also includes a Cholesky decomposition of the genomic relationship matrix G to sim- plify the computations and the transformation of leverages (Step 5). Step 5 is a new derivation that answers question 2 above.

Algorithm. (Fitting heteroscedastic effects model, HEM) Given a phenotype vector y (size n × 1) that belongs to any GLM family (MCGULLAGH and NELDER 1989), fixed effects design matrix X (size n × k) and the SNP genotype matrix Z (size n × p), the GBLUP (ridge regression) and HEM (generalized ridge regression) can be computed as: 1. Calculate G = ZZ0, its inverse G−1 and its Cholesky decomposition L s.t. LL0 = G; 2. Fit a GLMM (generalized linear mixed model) with response y, fixed effects X and random effects design matrix L. This fits the animal model as a GLMM with correlated random effects; 2 2 3. From step 2, store the estimated variance components σˆb , σˆe and the animal 2 2 effects aˆ. Calculate λ = σˆe /σˆb ; 4. Transform aˆ back to the SNP effects bˆ = Z0G−1aˆ; 5. Define 1  X0XX0L  Cv = 2 0 0 σˆe L XL L + λIn and divide the inverse of Cv into blocks  11 12  −1 Cv Cv Cv = 21 22 Cv Cv Define a transformation matrix M = Z0G−1L. Calculate the leverage for each random SNP effect as 22 2 0 h j j = 1 − M j(In − Cv /σˆb )M j

46 where M j is the j:th row of the transformation matrix M; 6. Define a diagonal matrix W with each diagonal element

ˆ2 b j w j j = 1 − h j j

and update G to be G∗ = ZWZ0, which is vectorized in implementation since W is diagonal. Calculate G∗−1, and L∗ s.t. L∗L∗0 = G∗; 7. Fit a GLMM with response y, fixed effects X and random effects design matrix L∗; 8. From step 7, transform the updated individual effects aˆ back to the SNP effects bˆ = Z0G∗−1aˆ.

In this algorithm, GLMMs are estimated based on penalized quasi-likelihood (PQL) for MME (see R package hglm and its algorithm in Paper I). I will not repeat the detailed results in Paper VI here. To summarize in short, the marker-specific shrinkage quantity (6.9) that we developed can highlight strong QTL nicely and improve genomic evaluation by a significant amount. Regarding efficiency, our R package bigRR3 has a big computational advantage when the number of co- variates exceeds the number of individuals a lot. For instance, for a dataset with 100 individuals and 1 million markers, on my laptop, bigRR fits a ridge regression in about 1 minute, and in another minute, fitting of the generalized ridge regression proposed in this paper is done. By simplifying the DHGLM in Paper III, we sacrificed the nice smoothing (6.3) we added to the marker-specific variance components. In order to guarantee a fast and stable DHGLM algorithm with a nice QTL mapping profile, further research is required to incorporate some smoothing technique into the simplified algorithm.

3Package URL: https://r-forge.r-project.org/R/?group_id=1301

47 7. Beyond Plain Heritability: Variance-Controlling Genes (Paper V)

“The , or uniformity, of an individual’s character is not only of great practical importance in and food production but is also of scientific and evolutionary interest.” —— Lars Rönnegård & William Valdar1

IMPLE methods often give us promising results. It’s not because the sophisticated S methods are worse, but due to the unbeatable robustness of the simple methods, such as linear regression, student-t test, ANOVA, etc. While in Paper V (SHEN et al. 2012b), it was a simple test for variance heterogeneity that revealed promising and interesting results. The publicly available Arabidopsis dataset was re-analyzed by screening the genome using the Brown-Forsythe test (BROWN and FORSYTHE 1974) for variance hetero- geneity. The Brown-Forsythe test for such a “vGWAS” is a robust statistical test for the equality of group variances in terms of phenotypic distribution STRUCHALIN et al. (2010). If the phenotypic value is yi j for individual i that with genotype j, where i = 1,...,n, and j = 1,...,m, the absolute deviation from the median of each genotype is ∗ yi j = |yi j − y˜j| (7.1) wherey ˜j is the median of the phenotypic values of the individuals that have the geno- ∗ type j. Performing a one-way ANOVA on yi j, we have the ANOVA F statistic m ∗ ∗ 2 (N − m)∑ n j(y − y··) F = j=1 · j (7.2) m n j ∗ ∗ 2 (m − 1)∑ j=1 ∑i=1(yi j − y· j) where n j is the number of observations in group j. This F statistic follows an F dis- tribution with m − 1, n − m degrees of freedom. When n is large enough, one can ap- proximate the F statistic as a χ2 statistic with m−1 degrees of freedom. The nominal p-values calculated using such χ2-statistics were used in vGWAS with a Bonferroni corrected significance threshold. After conservatively filtering the detected associations (genomic control was ap- plied, see Chapter 2) and database checking, the most striking example is the molyb- denum transporter MOT1 in the plant (Figure 3 in Paper V). The analysis of the molyb- denum content trait shows a clean genome-wide scan profile, where the only signifi- cant peak locates around the MOT1, validated by BAXTER et al. (2008)

1Page 435, RÖNNEGÅRD and VALDAR (2011)

48 via a QTL analysis in an F2 intercross with further molecular experiments. There is one SNP typed right in the exon segment of MOT1. In the quantitative genetics point of view, this molecular evidence is sufficiently exciting to draw more attention to the variance-control topic. Another clear and significant signal is a locus that we named variation in serra- tion, or VS, which regulates the variability in leaf serration of the plant (Figure 4 in Paper V). It is interesting that the position of the VS locus coincides with an important candidate transcription factor ANAC13 (RIECHMANN et al. 2000). However, the asso- ciation peak does not seem to be caused by the candidate ANAC13, since the signal on ANAC13 is quite low and according to the SNP data, ANAC13 is not in a strong link- age with the significant segments around it. Therefore, the VS locus requires further fine-mapping or molecular experiments to be understood better. In contrast to the traditional phenotypic variance dissection (see Chapter 8), con- sidering genetic variance heterogeneity effects, a suggestive way of re-dissecting the phenotypic variance VP is proposed in Paper V. For a single locus, we dissect the phe- notypic variance into the variance due to the mean shift between genotypes, VM, the variance due to the variance heterogeneity, VV , and the remaining residual variance VR, i.e. VP = VM +VV +VR (7.3)

Since inbred lines are analyzed Paper V, there is no dominance and consequently VM = VA. Comparing to (8.1), we have VR VE , where equality holds if and only if VV = 0, i.e. VV captures a part of VE that is not pure stochastic noise, but actually due to genetics. For 52 quantitative phenotypes, we compared the portions VMVP and VV VP for all the available SNPs across the genome (Figure 1 in Paper V). The genetic contribution to the variance of the phenotypes seems to be as common as to the mean.

Figure 7.1. An schematic example illustrating the cause of high-variance allele due to “restriction” type of interaction. When high level of the trait expression has negative effect, a restrictor is in operation to reduce the trait level. In a studied population, different individuals react differently to the restriction. Due to the mixture of over- restricted and under-restricted individuals in the population, the group of individuals that have the high-expression (radical) allele will result in larger group variance com- pared to the individuals that have the normal stable allele.

49 As shown by both PARÉ et al. (2010) and STRUCHALIN et al. (2010), also dis- cussed by RÖNNEGÅRD and VALDAR (2011), variance heterogeneity of a single locus is possibly due to interaction effects (epistasis or gene-by-environment interaction). In Paper V, an interesting finding is reported that the candidate gene FRI for flowering- time-related traits shows interesting interaction with vernalization condition (Figure 6 in Paper V). Even though FRI mainly has strong mean-controlling effect, but its variance heterogeneity is also substantial due to interaction. Here, instead of repeating the results in Paper V in detail, I continue the discus- sion in Chapter 2 on the relationship between interaction and variance heterogeneity. For example, both PARÉ et al. (2010) and STRUCHALIN et al. (2010) have reported a variance-heterogeneity signal associated with C-reactive protein levels (CRP). PARÉ et al. (2010) noticed a significant interaction between the variance-prioritized locus and BMI (p-value = 7.2 × 10−10). However, there are many factors that could poten- tially affect CRP so that might interact with the detected locus. Especially, the CRP level is in response to inflammation, which is usually cured of via medicine taken by the individuals. Since individuals react differently to the cure, larger variance may be created. Figure 7.1 shows a general and maybe common situation that an radi- cal allele could potentially become the high-variance allele as well, because of the diverse response to a certain “restriction”. The restriction could be from another pro- tein, or a treatment, or a different environment, etc. Therefore, for quantitative traits with known factors that restrict the outcome, mapping variance-controlling loci can be more useful to find interactions. Let us assume the two-way interaction model (2.21) again, since true positive vari- ance heterogeneity of a certain locus is a sign for interaction effects, can we predict anything about the interaction effect via the variance-controlling locus? The answer seems to be yes. Based on the theory and definitions in Paper V, Figure 7.2 shows some simulation results under model (2.21). Fixing the main effect of the other in- teracting factor to be null, for a particular low-variance allele frequency (LAF)2, the portion of the phenotypic variance due to variance heterogeneity (VV /VP, see Paper V) is a function of the broad sense heritability (H2). This suggests that one could poten- 2 tially predict H from VV /VP, if the variance heterogeneity of a single marker comes from a two-way interaction model. Therefore, starting by loci showing variance het- erogeneity could potentially infer more broad sense heritability due to interaction.

2LAF is the frequency of the allele that generates low phenotypic variance. Genotypic data from inbred lines were analyzed in Paper V, so the low-variance allele frequency was equivalent to the low-variance genotype frequency.

50 1.0 l LAF = 0.9 l LAF = 0.8 l LAF = 0.7 l LAF = 0.6 l 0.8 LAF = 0.5 l l LAF = 0.4 l l LAF = 0.3 l l l l LAF = 0.2 l l l l LAF = 0.1 l l l 0.6 l l l

P l l l

V l l l

/ l l l V l l l V l l l l l l l l l l l l l l 0.4 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 0.2 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l 0.0

0.0 0.2 0.4 0.6 0.8 1.0

H2

Figure 7.2. The relationship between the phenotypic variance (VP) portion due to 2 variance heterogeneity (VV ) and the broad sense heritability (H ) assuming a two- way interaction model. The two-way interaction model includes the main effect of the tested locus, the main effect of the other interacting factor (set to null), and their interaction effect. The broad sense heritability is estimated as the coefficient of de- termination of the full interaction model. LAF = low-variance allele frequency (see Paper V).

51

PART III: DISCUSSION &CONCLUSION 8. Discussion

“Statistics are like bikinis. What they reveal is suggestive, but what they conceal is vital.”

—— Aaron Levenstein1

ANY topics and challenges have arisen in current quantitative genetics (e.g. TI- M WARI and SCHORK 2011). One should notice that the development of an- alytical tools in quantitative genetics is always driven by data coming from recent biotechnologies. Experimental design is important, however, the common routine of genetic analysis is like “get the data and see what we can do”. “Old” methods could be less useful as soon as the data type changes. For instance, many researchers have developed methods for variance component QTL mapping, which are rarely used now in a population where dense SNPs are available for GWAS. We should therefore try to think more about how to deal with coming issues in genetic analysis. Some discussion on predictive modeling is made in Chapter 3. Here, I will mainly discuss two big chal- lenges in statistical genetics: missing heritability (MAHER 2008) and data types such as rare variants (GIBSON 2012), and how they can be related to the statistical methods developed in this thesis. Some potential future development is also discussed.

8.1 Heritability: How much can we explain? Variation in a phenotype has both its genetic and environmental components. The phenotypic variance VP can be dissected as

VP = VG +VE (8.1) where VG and VE are the genetic and environmental variance, respectively (FALCONER and MACKAY 1996). The portion in VP that is determined by genetics is called the broad sense heritability, i.e. V H2 = G (8.2) VP

VG is further dissected to be VA +VD +VI, where VA, VD, and VI are the additive, dom- inance, and interaction variance, respectively. In this formulation, VI includes both

1Quoted by Nature Genetics 24, page 11, January 2000.

54 the gene-by-gene and gene-by-environment interactions. The additive component VA plays a central role in phenotypic prediction, so the narrow sense heritability

V h2 = A (8.3) VP has drawn the most attention in quantitative genetics research. A general knowledge about heritability is that traits related to fitness are generally less heritable than others, e.g. litter size in pigs (h2 ≈ 5%) and egg production in fruit flies (h2 ≈ 20%). Traits like human height mentioned above and back fat in pigs (h2 ≈ 70%) have very high heritability but with fairly high complexity in heredity as well. The mappable loci often uncover a small proportion of the heritability of a complex trait, whereas the whole genome explains much more than that. An interesting trait that has been studied since GALTON (1886) is human height. It has been concluded from many studies to be highly heritable (h2 ≈80%), but the discovered influential genes determine only a limited amount of variation in height, for instance, even the major locus HMGA2 explains only ∼0.3%. Based on the GBLUP model, a recent study conducted by a research group in Australia found that human height is 45% in- fluenced by around 300 thousands common variants typed in our genome (YANG et al. 2010a). Still, a substantial amount (h2 ≈35%) of heritability is not there in the com- mon SNPs. Taking blood lipids as another example, it is around 30% heritable, how- ever, the major locus APOE explains only ∼ 0.5% of the variation in total cholesterol, and the highly significant locus CETP explains just ∼ 2.5% of HDL (AULCHENKO et al. 2009;TESLOVICH et al. 2010). Even all the 95 loci reported by TESLOVICH et al. (2010) explains only about 5% of the cholesterol variation. What are the reasons that cause heritability to be missing? For instance, we might have missed alleles with rather small effects; most causal variants are not really ob- served; sex chromosomes are not fully considered; chromosomal rearrangements and rare variants are not considered. In the statistical modeling point of view, dominance, epistasis, gene-environment interaction, etc., are contributors to broad sense heritabil- ity but not addressed in the current popular predictive models. The HEM developed in Paper VI has a potential to fit more regressors computationally efficiently, so that it is worth trying to put dominance and epistasis into analysis. Rare variants might be treated as random effects assuming a particular random effects distribution in HGLMs, but further investigation is certainly required. The findings of variance-controlling loci in Paper V does not directly contribute, at least when the phenotypic distribution per genotype is normal, to the “missing heri- tability”. This can easily be shown using the KENT (1983)’s information gain2. The information gain is computed by comparing the model assuming constant residual variance and the one assuming residual variance heterogeneity. When the residu- als follow normal distribution, the information gain equals zero. However, as we discussed in Chapter 2, variance-controlling loci can potentially contribute to broad sense heritability through interaction (e.g. PARÉ et al. 2010), which is to be revealed by further studies.

2The information gain is defined as the difference in FRASER (1965)’s information, which is determined by the likelihood of the model.

55 8.2 New data types: How to integrate information? Current biotechnology is generating new assays for DNA and even RNA sequences. As sequencing develops, it is a big challenge for statistical analysis tools to model such sequence data. On a SNP array, what we see are the typed tags, mostly common in the population, instead of the real complete sequence. When the individuals are sequenced, we do not only get information on haplotypes but also a number of rare variants (BODMER and BONILLA 2008). The rare variants in a DNA segment, are sparse like stars in the sky, however instead of shiny, they can be rather annoying. First of all, rare variants do have effects, but how to collect the effects on a population level is a difficult question to answer. Both common and rare variants are typed at different kinds of DNA segments, where plenty of annotation or gene ontology (GO) information is available in the database, but it is disappointing that we do not use such information in gene mapping and predictive modeling. A hierarchical modeling framework is required to re-weight typed variants according to where they are typed, for instance, whether a particular SNP is located in an exon of an annotated gene or an intron instead. Certainly, SNP arrays are not designed for typing causal variants but just tags, nevertheless, as the sequencing data become available, treating all the variants the same regardless where they are seems to be incorrect. The DHGLM in Paper III (or a similar HGLM) gives us an option to model such biological information. In random effects models, the annotation information could be modeled as a dispersion model for the marker-specific genetic variance component.

8.3 Future development Future development of the works in this thesis would be based mainly on the nov- elties in Paper III, V, and VI. The double-layer random effects model (DHGLM) in Paper III, has good potential to incorporate complex biological information into gene mapping and genomic evaluation. In short, genetic variants typed at different seg- ments of DNA should be re-weighted differently, and using a DHGLM or an HGLM with structured genetic variance, this can be achieved. Such a double-layer model can be used to model genetic variants in a sliding window and test the joint genetic effects. As sequencing data become available, sliding window analysis seems to be a way to score and test rare variants. The generalized ridge regression developed in Paper VI can be regarded as a fast approximation for the DHGLM in Paper III, although not as flexible as the DHGLM or other similar Bayesian methods, its com- putational advantage would definitely help to make hierarchical models more useful in fitting high-dimensional genomic data. The statistical test and quantitative theory in Paper V are currently quite trivial, however, the results are already striking enough for us to pay attention to variance-controlling genes. There are quite a few issues in mapping variance-controlling QTL (vQTL; RÖNNEGÅRD and VALDAR 2011), for instance, population stratification cannot be solved by simply using a linear mixed model, a double-layer random effects model could be a solution (YANG et al. 2011b); the contribution of vQTL in genomic selection is not clear yet; the relationship be- tween variance heterogeneity and heritability needs to be further investigated, and so forth. Some of the topics mentioned here are ongoing at the time of writing.

56 9. Conclusion

“In population genetics there is usually little reason for confidence that an estimate is correct even to within an order of magnitude, but reaching it faster is definitely progress.”

—— Rosalind Harding1

HIS thesis develops some analytical methods that cover different branches of sta- T tistical genetics, including QTL analysis, GWAS, and genomic evaluation. More information has been taken into account so that accuracy in QTL mapping using vari- ance component models can be improved (Paper II). Mapping variance-heterogeneity in GWAS has been shown to be able to discover significant loci missed in earlier stud- ies (Paper V). Double hierarchical generalized linear models, and a new generalized ridge regression, have been implemented and applied to high-dimensional genomic data analysis, showing promising results in both QTL mapping and genomic predic- tion (Paper III & VI). In conclusion, the works presented here provide novel insights in modeling genetic variance, which is the major contribution of this thesis. Most of the tools developed in this thesis have been implemented as R packages and are available online, including a general statistical tool for fitting random ef- fects models (package hglm, Paper I), an efficient generalized ridge regression for high-dimensional data (package bigRR, Paper VI), a double-layer mixed model for genomic data analysis (package iQTL, Paper III), a stochastic IBD matrix calcula- tor (package MCIBD, used in Paper II), a computational interface for QTL mapping (package qtl.outbred, Paper IV), and a GWAS analysis tool for mapping variance- controlling loci (package vGWAS, Paper V).

1In discussion on STEPHENS and DONNELLY (2000), Journal of the Royal Statistical Society, Series B 62(4), page 638.

57 Sammanfattning på Svenska

Denna avhandling utvecklar och utvärderar statistiska metoder för olika typer av genet- iska analyser, inklusive quantitative trait loci (QTL) analys, genomvid associations studier (GWAS) och genomisk utvärdering. Det viktigaste resultatet av avhandlin- gen är att ge nya insikter i modellering genetisk variation, särskilt via modeller med slumpmässiga effekter. En metod för QTL analys utvecklades där osäkerhet i nedärvning ingår i modellen. Det visade sig att denna modell till viss del kan korrigera för bias i skattningar och öka precisionen i QTL kartläggning. Dubbel hierarkiska generaliserade linjära modeller, samt en förenklad version, utve- cklades och tillämpades i hel-genom analys. Metoderna visade hög säkerhet i QTL kartläggning och genomisk prediktion. En analys av allmänt tillgängliga GWAS data identifierade betydande loci i Ara- bidopsis som styr fenotypisk varians i stället för medelvärdet. Denna studie stärker existensen av varianskontrollerande gener. Studierna i avhandlingen åtföljs av R paket som finns tillgängliga online. Dessa inkluderar ett statistiskt verktyg för skattning av modeller med slumpmässiga effekter (hglm), general ridge regression för mångdimensionella data (bigRR), ett analysverk- tyg för genom-data (iQTL), en stokastisk IBD matris kalkylator (MCIBD), ett beräkn- ingsprogram och gränssnitt för QTL kartläggning (qtl.outbred), och analysverktyg för att kartlägga varianskontrollerande loci i GWAS (vGWAS).

58 Acknowledgements

The work conducted in this thesis was performed in both Uppsala (Swedish University of Agricultural Sciences) and Borlänge (Dalarna University). Swedish Foundation for Strategic Research is acknowledged for financial support. Thanks to the National Graduate School in Scientific Computing for conducting and funding many useful courses. This thesis would not be possible without supports from other people. First of all, I owe my deepest gratitude to my two supervisors, Lars Rönnegård and Örjan Carl- borg. They have made available supports from many different aspects. I am grateful to Lars for guiding me into the field of statistical genetics. Lars has contributed his selfless support to each of my projects, with both responsibility and talent. It is not only always interesting to discuss ideas and work with him, playing table tennis and billiard games together with him have also been a lot of fun. I would like to thank Örjan for interesting discussions about creative scientific ideas and also some outlook on life. Especially, the nice open-minded computational genetics group that Örjan is leading has been a wonderful environment for my research. It is a pleasure for me to thank the previous and current members of the computa- tional genetics group in Uppsala. Thanks to Ronnie Nelson for working together on our published package and his friendly smile that makes people feel happy. Thanks to Mats Pettersson for co-authoring with plenty of technical and intellectual supports. Thanks to Marcin Kierczak for sharing the office with me and growing Arabidop- sis for me. Thanks to François Besnier for discussions on variance component QTL analysis and his generous support on IBD matrix calculation. Thanks to Weronica Ek for challenging me with her data and traveling sleepily with me back from the US. Thanks to Lucy Crooks for interesting discussions and language support by her queen English. Many thanks also to Anna Johansson, Stefan Marklund, Xidan Li, Jiazhong Guo, Zheya Sheng and Muhammad Ahsan for their help on science and life. I am indebted to my colleagues in Borlänge for both their supports in statistics knowledge and the nice working environment that they have created. Thanks to Moudud Alam for sharing the office with me and helping me a lot understand the inference theory of mixed models. Thanks to Kenneth Carling for providing fund- ing for several conferences that I attended and also letting me help in the skiing race in Falun. Thanks to Changli He for being my teacher from master study and en- lightening me with his experience. Also, many thanks to Majbritt Felleki, Richard Stridbeck, Dao Li, Xiangli Meng, Mengjie Han and Ola Nääs for their help and in- teresting discussions. I hereby would like to thank Fan Yang Wallentin who lectured in Borlänge as well, for her instructions and guidance for my study and the opportu- nity to present my work at the statistics department of Uppsala University. Thanks to Johan Bring who worked in Borlänge, for his helpful supervision during my master study. Also, thanks to Mikael Möller who also worked in Bolänge, for interesting discussions and his technical support on LATEX.

59 It is a great honor for me to thank Yurii Aulchenko for the opportunity to visit Rotterdam and learn from him. Yurii is such a nice and talented person to work with, who is very pedagogical and patient in discussions. We not only discussed about several ideas to work on, he also introduced to me a lot of common knowledge about human populations and genome-wide association studies that I was not aware of. I also would like to acknowledge him for buying me coffee during our discussions in Rotterdam. I am grateful to Freddy Fikse for co-authoring with his knowledge in the animal breeding area and some very useful suggestions. I also would like to thank Carl Nettelblad for his support via his software for genotype probability calculation and other interesting discussion. Without the advice from the lecturers of the courses that I have taken, it would not be possible for me to build up knowledge for science. I would like to thank Hos- sein Jorjani for offering his quantitative genetics course and interesting discussions in Texas. Thanks to Youngjo Lee for lecturing at the winter course in Vålådalen about the hierarchical generalized linear models, and also a lot of interesting discussions with him about both research and skiing. Thanks to Maya Neytcheva for her use- ful lectures on numerical methods and matrix computation. Thanks to Dietrich von Rosen for not only his lectures but also quite a few interesting discussions regard- ing science and other aspects. Also, thanks to other lectures including Bruce Walsh, Martin Berggren, Jonas Lindemann, Lars Eldén, Jarrod Hadfield and Jeffrey S. Racine for their lectures on both theory and software packages. In addition, I would like to show my gratitude to many friends from the same master program in Dalarna University as I am, who are currently also doing their doctoral research in Sweden. These include Ying Li at Swedish University of Agricultural Sciences; Jianxin Wei, Xingwu Zhou, and Xijia Liu at Uppsala University; Feng Li, Yuli Liang, Ying Pang, and Chengcheng Hao at Stockholm University. All of them have been good buddies on the way approaching PhD. Thanks to Jinzhi Hu, Liang Tian, and other friends for enjoying tennis together with me. Thanks to Kaweng Ieong and his family (especially “Abˇı”) for being my friendly neighbors. Thanks to Liang Tian again and his family (especially “Pípi”) for being my friends in Uppsala, especially traveling with them has always been of great fun. Thanks to all my friends in China, especially to the couple Zirui Yu and Xiaol- ing Tan who, at the time of writing, are planning the trip to Sweden to attend my dissertation. Although the distance is long, the supports have never disappeared. Thanks to Linda, for sharing life and goals with me, and also for her help on my Sammanfattning på Svenska. Finally, I am very grateful to my family and would like to dedicate all my achieve- ment to them, especially to my grandparents for everything that they have been giv- ing selflessly.

60 References

ANDERSSON,L.,C.HALEY,H.ELLEGREN,S.KNOTT,M.JOHANSSON, et al., 1994 Genetic mapping of quantitative trait loci for growth and fatness in pigs. Science 263: 1771–1774. ATWELL, S., Y. S. HUANG,B.J.VILHJALMSSON,G.WILLEMS,M.HORTON, et al., 2010 Genome-wide association study of 107 phenotypes in Arabidopsis thaliana inbred lines. Nature 465: 627–631. AULCHENKO, Y. S., D.-J. DE KONING, and C. HALEY, 2007a Genomewide rapid association using mixed model and regression: a fast and simple method for genomewide pedigree-based quantitative trait loci association analysis. Genetics 177: 577–85. AULCHENKO, Y. S., S. RIPATTI,I.LINDQVIST,D.BOOMSMA,I.M.HEID, et al., 2009 Loci influencing lipid levels and coronary heart disease risk in 16 european population cohorts. Nature Genetics 41: 47–55. AULCHENKO, Y. S., S. RIPKE,A.ISAACS, and C. VAN DUIJN, 2007b GenABEL: an R package for genome-wide association analysis. Bioinformatics 23: 1294–1296. BALDING, D. J., 2006 A tutorial on statistical methods for population association studies. Nature Reviews Genetics 7: 781–791. BAXTER,I.,B.MUTHUKUMAR,H.C.PARK, P. BUCHNER,B.LAHNER, et al., 2008 Variation in molybdenum content across broadly distributed populations of Arabidopsis thaliana is controlled by a mitochondrial molybdenum transporter (MOT1). PLoS Genetics 4: e1000004. BJØRNSTAD, J. F., 1996 On the generalization of the likelihood function and the likelihood principle. Journal of the American Statistical Association 91: 791–806. BODMER, W., and C. BONILLA, 2008 Common and rare variants in multifactorial susceptibility to common diseases. Nature Genetics 40: 695–701. BOUWMAN,A.C.,L.L.G.JANSS, and H. C. M. HEUVEN, 2011 A Bayesian approach to detect QTL affecting a simulated binary and quantitative trait. BMC Proceedings 5(Suppl 3): S2. BRESLOW, N. E., and D. G. CLAYTON, 1993 Approximate inference in generalized linear mixed models. Journal of the American Statistical Association 88: 9–25. BROMAN, K. W., 2003 R/qtl: QTL mapping in experimental crosses. Bioinformatics 19: 889–890. BROWN, M. B., and A. B. FORSYTHE, 1974 Robust tests for equality of variances. Journal of the American Statistical Association 69: 364–367. CALUS,M.P.L.,H.A.MULDER, and R. F. VEERKAMP, 2011 Estimating genomic breeding values and detecting QTL using univariate and bivariate models. BMC Proceedings 5(Suppl 3): S5. CANTOR,R.M.,K.LANGE, and J. S. SINSHEIMER, 2010 Prioritizing gwas results: A review of statistical methods and recommendations for their application. American Journal of Human Genetics 86: 6–22.

61 CARLBORG, Ö., and C. S. HALEY, 2004 Epistasis: too often neglected in complex trait studies? Nature Reviews Genetics 5: 618–625. CHA, P.-C., A. TAKAHASHI,N.HOSONO,S.-K.LOW,N.KAMATANI, et al., 2011 A genome-wide association study identifies three loci associated with susceptibility to uterine fibroids. Nature Genetics 43: 447–450. COX, D. R., 1984 Interaction. International Statistical Review / Revue Internationale de Statistique 52: 1–31. DEMPSTER,A.P.,N.M.LAIRD, and D. B. RUBIN, 1977 Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological) 39: 1–38. DEVLIN, B., and K. ROEDER, 1999 Genomic control for association studies. Biometrics 55: 997–1004. EDWARDS, A., 1972 Likelihood. Cambridge University Press, Cambridge. ELSTON, R., and J. STEWART, 1971 A general model for the genetic analysis of pedigree data. Human Heredity 21: 523–542. FALCONER, D. S., and T. F. MACKAY, 1996 Introduction to Quantitative Genetics. Longman: London. FISHER, R. A., 1918 The correlation between relatives on the supposition of Mendelian inheritance. Transactions of the Royal Society of Edinburgh 52: 399–433. FISHER, R. A., 1922 On the mathematical foundations of theoretical statistics. Philosophical Transactions of the Royal Society of London, Series A 222: 309–368. FRASER, D. A. S., 1965 On information in statistics. The Annals of Mathematical Statistics 36: 890–896. GALTON, F., 1886 Regression towards mediocrity in hereditary stature. The Journal of the Anthropological Institute of Great Britain and Ireland 15: 246–263. GARDNER, K. M., and R. G. LATTA, 2007 Shared quantitative trait loci underlying the genetic correlation between continuous traits. Molecular Ecology 16: 4195–209. GEORGE, E. I., and R. E. MCCULLOCH, 1993 Variable selection via gibbs sampling. Journal of the American Statistical Association 88: 881–889. GESSLER, D. D. G., and S. XU, 1996 Using the expectation or the distribution of the for mapping quantitative trait loci under the random model. American Journal of Human Genetics 59: 1382–1390. GIANOLA,D.,G. DELOS CAMPOS, W. G. HILL,E.MANFREDI, and R.FERNANDO, 2009 Additive genetic variability and the Bayesian alphabet. Genetics 183: 347–363. GIBSON, G., 2012 Rare and common variants: twenty arguments. Nature Reviews Genetics 13: 135–145. HABIER,D.,R.L.FERNANDO,K.KIZILKAYA, and D. J. GARRICK, 2011 Extension of the Bayesian alphabet for genomic selection. BMC Bioinformatics 12. HALEY, C., and S. KNOTT, 1992 A simple regression method for mapping quantitative trait loci in line crosses using flanking markers. Heredity 69: 315–324. HARVILLE, D. A., 1977 Maximum likelihood approaches to variance component estimation and to related problems. Journal of the American Statistical

62 Association 72: 320–338. HASTIE, T., R. TIBSHIRANI, and J. FRIEDMAN, 2009 The elements of statistical learning. Springer. HENDERSON, C. R., 1953 Estimation of variance and covariance components. Biometrics 9: 226–252. HENDERSON, C. R., 1984 Applications of Linear Models in Animal Breeding. University of Guelph, Guelph, ON, 3rd edition. HEWITT, J. K., and A. C. HEATH, 1988 A note on computing the chi-square noncentrality parameter for power analyses. Behavior Genetics 1: 105–108. HILL,W.G.,M.E.GODDARD, and P. M. VISSCHER, 2008 Data and theory point to mainly additive genetic variance for complex traits. PLoS Genetics 4: e1000008. JANSS, L. L. G., 2009 iBay manual version 1.47. Janss , Leiden, the Netherlands. JIMENEZ-GOMEZ,J.M.,J.A.CORWIN,B.JOSEPH,J.N.MALOOF, and D. J. KLIEBENSTEIN, 2011 Genomic analysis of QTLs and genes altering natural variation in stochastic noise. PLoS Genet 7: e1002295. KANG,H.M.,J.H.SUL,S.K.SERVICE,N.A.ZAITLEN, S.-Y. KONG, et al., 2010 Variance component model to account for sample structure in genome-wide association studies. Nature Genetics 42: 348–354. KANG,H.M.,N.A.ZAITLEN,C.M.WADE,A.KIRBY,D.HECKERMAN, et al., 2008 Efficient control of population structure in model association mapping. Genetics 178: 1709–23. KARACAÖREN, B., T. SILANDER,J.M.ÁLVAREZ-CASTRO,C.S.HALEY, and D.J. DE KONING, 2011 Association analyses of the MAS-QTL data set using grammar, principal components and Bayesian network methodologies. BMC Proceedings 5(Suppl 3): S8. KENT, J. T., 1983 Information gain and a general measure of correlation. Biometrika 70: 163–173. KINGSMORE,S.F.,I.E.LINDQUIST,J.MUDGE,D.D.GESSLER, and W. D. BEAVIS, 2008 Genome-wide association studies: progress and potential for drug discovery and development. Nature Review Drug Discovery 7: 221–230. KLEIN,R.J.,C.ZEISS, E. Y. CHEW, J.-Y. TSAI,R.S.SACKLER, et al., 2005 Complement factor h polymorphism in age-related macular degeneration. Science 308: 385–388. KRUGLYAK, L., and E. LANDER, 1995 Complete multipoint sib-pair analysis of qualitative and quantitative traits. American Journal of Human Genetics 57: 439–454. LANDER, E., and D. BOTSTEIN, 1989 Mapping mendelian factors underlying quantitative traits using RFLP linkage maps. Genetics 121: 185–199. LEE, Y., and J. A. NELDER, 1996 Hierarchical generalized linear models (with discussion). Journal of the Royal Statistical Society. Series B (Methodological) 58: 619–678. LEE, Y., and J. A. NELDER, 2001 Hierarchical generalised linear models: A synthesis of generalised linear models, random-effect models and structured dispersions. Biometrika 88: 987–1006. LEE, Y., and J. A. NELDER, 2006 Double hierarchical generalized linear models

63 (with discussion). Applied Statistics 55: 139–185. LEE, Y., J. A. NELDER, and M. NOH, 2007 H-likelihood: problems and solutions. Statistics and Computing 17: 49–55. LEE, Y., J. A. NELDER, and Y. PAWITAN, 2006 Generalized Linear Models with Random Effects: Unified Analysis via H-likelihood. Chapman & Hall/CRC. LEVY,D.,G.B.EHRET,K.RICE,G.C.VERWOERT,L.J.LAUNER, et al., 2009 Genome-wide association study of blood pressure and . Nat Genet . LIPPERT,C.,J.LISTGARTEN, Y. LIU,C.M.KADIE,R.I.DAVIDSON, et al., 2011 Fast linear mixed models for genome-wide association studies. Nature Methods 8: 833–5. LIU, X., and S. RAUDENBUSH, 2004 A note on the noncentrality parameter and effect size estimates for the f test in anova. Journal of Educational and Behavioral Statistics 29: 251–255. LYNCH, M., and B. WALSH, 1998 Genetics and analysis of Quantitative Traits. Sinauer Associates, Inc. MAHER, B., 2008 The case of the missing heritability. Nature 456: 18–21. MALO,N.,O.LIBIGER, and N. J. SCHORK, 2008 Accommodating linkage disequilibrium in genetic-association analyses via ridge regression. American Journal of Human Genetics 82: 375–385. MCCARTHY,M.I.,G.R.ABECASIS,L.R.CARDON,D.B.GOLDSTEIN, J.LITTLE, et al., 2008 Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nature Reviews Genetics 9: 356–369. MCGULLAGH, P., and J. A. NELDER, 1989 Generalized linear models. Chapman & Hall/CRC. MEUWISSEN, T., B. HAYES, and M. GODDARD, 2001 Prediction of total genetic value using genome-wide dense marker maps. Genetics 157: 1819–1829. MORTON, N., and C. MACLEAN, 1974 Analysis of family resemblance. III. complex segregation of quantitative traits. American Journal of Human Genetics 26: 489–503. MUCHA,S.,M.PSZCZOLA, T. STRABEL,A.WOLC, P. PACZYNSKASKA´ , et al., 2011 Comparison of analyses of the QTLMAS XIV common dataset. II: QTL analysis. BMC Proceedings 5(Suppl 3): S2. NELDER, J. A., 1994 The statistics of linear models: back to basics. Statistics and Computing 4: 221–234. NELSON,R.M.,X.SHEN, and Ö. CARLBORG, 2011 qtl.outbred: Interfacing outbred line cross data with the R/qtl mapping software. BMC Research Notes 4. NETTELBLAD, C., 2011 Haplotype inference based on hidden markov models in the QTL-MAS 2010 multi-generational dataset. BMC Proceedings 5(Suppl 3): S10. NETTELBLAD,C.,S.HOLMGREN,L.CROOKS, and Ö. CARLBORG, 2009 cnF2freq: Efficient determination of genotype and haplotype probabilities in outbred populations using markov models. Lecture Notes in Bioinformatics (LNBI) 5462: 307–319. PARÉ,G.,N.R.COOK, P. M. RIDKER, and D. I. CHASMAN, 2010 On the use of variance per genotype as a tool to identify quantitative trait interaction effects: a report from the women’s genome health study. PLoS Genetics 6: e1000981. PAWITAN, Y., 2001 In All Likelihood: Statistical Modelling and Inference Using Likelihood. Oxford Science Publications.

64 PRATT, J. W., 1976 F. Y. Edgeworth and R. A. Fisher on the efficiency of maximum likelihood estimation. The Annals of Statistics 4: 501–514. PRICE,A.L.,N.J.PATTERSON,R.M.PLENGE,M.E.WEINBLATT,N.A. SHADICK, et al., 2006 Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics 38: 904–9. PRICE,A.L.,N.A.ZAITLEN,D.REICH, and N. PATTERSON, 2010 New approaches to population stratification in genome-wide association studies. Nature Reviews Genetics 11: 459–463. PSZCZOLA, M., T. STRABEL,A.WOLC,S.MUCHA, and M. SZYDLOWSKI, 2011 Comparison of analyses of the QTLMAS XIV common dataset. I: genomic selection. BMC Proceedings 5(Suppl 3): S1. PURCELL,S.,B.NEALE,K.TODD-BROWN,L.THOMAS,M.A.R.FERREIRA, et al., 2007 PLINK: A tool set for whole-genome association and population-based linkage analyses. American Journal of Human Genetics 81: 559–575. RDEVELOPMENT CORE TEAM, 2011 R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0. RIECHMANN,J.L.,J.HEARD,G.MARTIN,L.REUBER,C.-Z.JIANG, et al., 2000 Arabidopsis transcription factors: Genome-wide comparative analysis among eukaryotes. Science 290: 2105–2110. ROBINSON, G. K., 1991 That BLUP is a good thing: The estimation of random effects. Statistical Science 6: 15–32. RÖNNEGÅRD, L., F. BESNIER, and O. CARLBORG, 2008 An improved method for quantitative trait loci detection and identification of within-line segregation in F2 intercross designs. Genetics 178: 2315–2326. RÖNNEGÅRD, L., and O. CARLBORG, 2007 Separation of base allele and sampling term effects gives new insights in variance component QTL analysis. BMC Genetics 8: 1. RÖNNEGÅRD, L., and Y. LEE, 2010 Hierarchical generalized linear models have a great potential in genetics and animal breeding. In Proceedings World Congress on Genetics Applied to Livestock Production, Leipzig, Germany. RÖNNEGÅRD,L.,X.SHEN, and M. ALAM, 2010 hglm: A package for fitting hierarchical generalized linear models. The R Journal 2: 20–28. RÖNNEGÅRD, L., and W. VALDAR, 2011 Detecting major genetic loci controlling phenotypic variability in experimental crosses. Genetics 188: 435–447. SCHORK, N. J., 1993 Extended multipoint identity-by-descent analysis of human quantitative traits: Efficiency, power, and modeling considerations. American Journal of Human Genetics 53: 1306–1319. SEATON,G.,J.HERNANDEZ,J.GRUNCHEC,I.WHITE,J.ALLEN, et al., 2006 GridQTL: A grid portal for QTL mapping of compute intensive datasets. In Proceedings of the 8th World Congress on Genetics Applied to Livestock Production, August 13-18, 2006. Belo Horizonte, Brazil. SHEN,X.,M.ALAM, F. FIKSE, and L. RÖNNEGÅRD, 2012a Fast generalized ridge regression for models including heteroscedastic effects in quantitative genetics. Manuscript . SHEN,X.,M.PETTERSSON,L.RÖNNEGÅRD, and Ö. CARLBORG, 2012b Inheritance beyond plain heritability: variance-controlling genes in Arabidopsis

65 thaliana. Submitted . SHEN,X.,L.RÖNNEGÅRD, and Ö. CARLBORG, 2011a Hierarchical likelihood opens a new way of estimating genetic values using genome-wide dense marker maps. BMC Proceedings 5(Suppl 3): S14. SHEN,X.,L.RÖNNEGÅRD, and Ö. CARLBORG, 2011b How to deal with genotype uncertainty in variance component quantitative trait loci analyses. Genetics Research, Cambridge 93: 333–342. SORENSEN, D., 2009 Developments in statistical analysis in quantitative genetics. Genetica 136: 319–332. STEPHENS, M., and P. DONNELLY, 2000 Inference in molecular population genetics (with discussion). Journal of the Royal Statistical Society, Series B 62: 605–655. STRUCHALIN, M. V., A. DEHGHAN,J.C.M.WITTEMAN, C. V. DUIJN, and Y. S. AULCHENKO, 2010 Variance heterogeneity analysis for detection of potentially interacting genetic loci: method and its limitations. BMC Genetics 11: 92. SUN,X.,D.HABIER,R.L.FERNANDO,D.J.GARRICK, and J. C. M. DEKKERS, 2011 Genomic breeding value prediction and QTL mapping of QTLMAS2010 data using Bayesian methods. BMC Proceedings 5(Suppl 3): S13. SZYDLOWSKI, M., and P. PACZYNSKA´ , 2011 QTLMAS 2010: simulated dataset. BMC Proceedings 5(Suppl 3): S3. TESLOVICH, T. M., K. MUSUNURU, A. V. SMITH,A.C.EDMONDSON,I.M. STYLIANOU, et al., 2010 Biological, clinical and population relevance of 95 loci for blood lipids. Nature 466: 707–713. THOMPSON,J.R.,J.ATTIA, and C. MINELLI, 2011 The meta-analysis of genome-wide association studies. Briefings in Bioinformatics 12: 259–269. TIAN, F., P. J. BRADBURY, P. J. BROWN,H.HUNG,Q.SUN, et al., 2011 Genome-wide association study of leaf architecture in the maize nested association mapping population. Nature Genetics 43: 159–162. TIBSHIRANI, R., 1996 Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society, Series B 58: 267–288. TIWARI, H. K., and N. J. SCHORK, 2011 Grand challenges in statistical genetics/genomics methodology. Frontiers in Genetics 2: 1–2. TURNBULL,C.,S.AHMED,J.MORRISON,D.PERNET,A.RENWICK, et al., 2010 Genome-wide association study identifies five new breast cancer susceptibility loci. Nature Genetics 42: 504–507. VALDAR, W., L. C. SOLBERG,D.GAUGUIER,S.BURNETT, P. KLENERMAN, et al., 2006 Genome-wide of complex traits in heterogeneous stock mice. Nature Genetics 38: 879–887. VAN VLECK, L. D., 1998 Charles Roy Henderson: A Biographical Memoir. National Academies Press, Washington D.C. VANRADEN, P. M., 2008 Efficient methods to compute genomic predictions. Journal of Dairy Science 91: 4414–4423. VERBYLA,K.L.,B.J.HAYES, P. J. BOWMAN, and M. E. GODDARD, 2009 Accuracy of genomic selection using stochastic search variable selection in Australian Holstein Friesian dairy . Genetics Research, Cambridge 91: 307–311. WALLIS, M., P. WATERS, and J. GRAVES, 2008 Sex determination in mammals - before and after the evolution of SRY. Cellular and Molecular Life Sciences 65:

66 3182–3195. WANG, F., C.-Q. XU,Q.HE, J.-P. CAI,X.-C.LI, et al., 2011 Genome-wide association identifies a susceptibility locus for in the chinese han population. Nature Genetics 43: 345–349. WANG, H., Y.-M. ZHANG,X.LI,G.L.MASINDE,S.MOHAN, et al., 2005a Bayesian shrinkage estimation of quantitative trait loci parameters. Genetics 170: 465–80. WANG,W.Y.S.,B.J.BARRATT,D.G.CLAYTON, and J. A. TODD, 2005b Genome-wide association studies: theoretical and practical concerns. Nature Reviews Genetics 6: 109–118. WU,R.,C.MA, and G. CASELLA, 2007 Statistical Genetics of Quantitative Traits: Linkage, Maps, and QTL. Springer Science + Business Media, LLC. XU, S., 1996 Computation of the full likelihood function for estimating variance at a quantitative trait locus. Genetics 144: 1951–1960. XU, S., 2003 Estimating polygenic effects using markers of the entire genome. Genetics 163: 789–801. XU, S., 2007 An empirical Bayes method for estimating epistatic effects of quantitative trait loci. Biometrics 63: 513–521. YANG,J.,B.BENYAMIN, B. P. MCEVOY,S.GORDON,A.K.HENDERS, et al., 2010a Common snps explain a large proportion of the heritability for human height. Nature Genetics 42: 565–9. YANG,J.,S.H.LEE,M.E.GODDARD, and P. M. VISSCHER, 2010b GCTA: A tool for genome-wide complex trait analysis. American Journal of Human Genetics 88: 76–82. YANG, J., T. A. MANOLIO,L.R.PASQUALE,E.BOERWINKLE,N.CAPORASO, et al., 2011a Genome partitioning of genetic variation for complex traits using common snps. Nature Genetics 43: 519–525. YANG, Y., O. F. CHRISTENSEN, and D. SORENSEN, 2011b Analysis of a genetically structured variance heterogeneity model using the Box-Cox transformation. Genetics Research, Cambridge 93: 33–46. YI, N., and S. XU, 2008 Bayesian LASSO for quantitative trait loci mapping. Genetics 179: 1045–1055. ZENG, Z.-B., 1993 Theoretical basis for separation of multiple linked gene effects in mapping quantitative trait loci. Proceedings of the National Academy of Sciences, USA 90: 10972–10976. ZENG, Z.-B., 1994 Precision mapping of quantitative trait loci. Genetics 136: 1457–1468.

67