<<

PersPectives

Here we outline the potential use of these Study deSignS — opinion quantitative genetic methods for predicting health-related outcomes. We first Predicting genetic predisposition describe the methodology and then discuss the challenges and opportunities associated with the application of WGP to disease- in : the promise of related traits in humans. We propose that, even within the limits imposed by currently whole- markers identified SNPs, alternative statistical meth- ods may offer opportunities to advance our Gustavo de los Campos, Daniel Gianola and David B. Allison ability to predict disease. These methods can be readily applied to human traits, as Abstract | Although genome-wide association studies have identified markers the type of data required for implementing that are associated with various human traits and diseases, our ability to predict them is the same as that used in standard such phenotypes remains limited. A perhaps overlooked explanation lies in the GWA studies. limitations of the genetic models and statistical techniques commonly used in Whole-genome marker-enabled prediction association studies. We propose that alternative approaches, which are largely Building predictive models of complex phe- borrowed from animal breeding, provide potential for advances. We review notypes can be extremely challenging, as selected methods and discuss the challenges and opportunities ahead. such traits can be affected by many loci that interact in cryptic ways. Ideally, one would The continued advance of genome assess- association between diseases and select a model (that is, a subset of markers ment technologies has brought the promise in humans, makes sense under the assump- and interaction terms) in the set of all pos- of genomic medicine1–3. Genome-wide asso- tion that only a few affect genetic sible models that can be built from p marker ciation (GWA) studies have uncovered many predisposition. This approach is unsatis- genotypes. Models can be compared based loci related to genetic predisposition to factory for many important human traits on a model comparison criterion or by tra- human diseases and traits. However, in most which may be affected by a large number of ditional hypothesis testing. However, when p cases, these loci explain such a small fraction small-effect, possibly interacting, genes6,7. is large, exploring all possible models is not of phenotypic variability that their use for Quantitative genetic theory8,9 addresses feasible. The search among different models predicting diseases is limited3,4. this latter problem. The foundations of this can be simplified by, for example, ruling Several explanations3,4 have been pro- theory were established early in the twenti- out epistasis. Nevertheless, with a large p, it posed for our scant progress in predicting eth century by Fisher10 and Wright11, who is also not feasible to test all possible addi- health outcomes from genetic markers. proposed methods for describing the resem- tive and dominance models. In practice, First, the currently identified SNPs might blance between relatives and for estimating most human GWA studies choose models not fully describe genetic diversity. For trait heritability. Building on those princi- by selecting markers based on some form instance, these SNPs may not capture some ples, quantitative geneticists12,13 developed of SMR. Unfortunately, when markers are forms of genetic variability that are due to pedigree-based methods to predict genetic in linkage disequilibrium (LD) with many copy number variation. values (BOX 1). quantitative trait loci (QTLs), a situation that Second, genetic mechanisms might Over many decades, these predictions is highly likely for complex traits, SMR yields involve complex interactions among genes were used for selective breeding in animals inconsistent estimates of marker effects. For and between genes and environmental con- and plants. More recently, methods for these and other reasons, selecting a model ditions, or epigenetic mechanisms which whole-genome marker-enabled prediction when the number of candidate predictors is are not fully captured by additive models. (WGP) of genetic values were developed14. large is a daunting task, and the initial SMR However, opportunities may exist for Unlike GWA studies, these methods use approach has not been very successful for improving predictions by exploiting additive all available genetic information jointly. complex traits. genetic variation5. Crucially, essential to these methods is the A third explanation — the one we prediction of genetic values and phenotypes, WGP methods. An alternative is to infer focus on here — lies in the limitations instead of the identification of specific genes, a predictive function using all available posed by the genetic models and statistical which has been the central focus of human markers jointly. Such WGP methods were methods that are commonly used to study GWA studies. Positive results from simula- pioneered by Meuwissen et al.14, who pro- genetic predisposition in humans. Indeed, tion studies14,15 and empirical evidence16–21 posed regressing phenotypes on all marker single-marker regression (SMR), the most have prompted the relatively quick adoption covariates jointly using a linear model. commonly used approach to study the of these methods for commercial breeding. With p>>n (in which n is the number of

880 | DeCeMBeR 2010 | VOLUMe 11 www.nature.com/reviews/genetics © 2010 Macmillan Publishers Limited. All rights reserved PersPectives

individuals in the data set), one can usu- Box 1 | Quantitative genetic concepts: from heritability to prediction ally infer genetic values accurately even In a standard quantitative genetic model8, a continuous phenotype (y; i = 1,…,n) is expressed as when large uncertainty about marker i y = g + e in which g is a genetic value and e is a non-genetic component. effects persists. That is, predictions can be i i i i i made even when the information about Variance components When genetic values and model residuals are uncorrelated the phenotypic variance Var(y ) = σ  the effect of each genetic marker is limited. i R      The problem of uncovering signal from can be decomposed as σ R = σ I + σ ε in which σ I is the genetic variance and σ ε is the variance due to non-genetic factors. Genetic values can be further decomposed into additive a , noisy data in large-p with small-n prob- i dominance di and epistatic ζi components as gi = ai + di + ζi . Under the conditions described lems is not unique to genomic applications. elsewhere55,56, these components are uncorrelated; therefore, σ  = σ  + σ  + σ  in which 14 I C F ζ Alternatives to the linear model exist in   σ  σ C = Var(ai) , σ F = Var(di) and ζ = Var(ζi) are genetic variance components due to additive, the statistical and machine-learning litera- dominance and epistatic effects, respectively. ture. The remainder of this section pro- Broad-sense heritability is the proportion of the phenotypic variance that can be attributed to vides an overview of WGP methods. We genetic factors, that is,  begin by describing a general formulation σ I  *   of the problem and then introduce several σ I σ ε ways in which markers can be incorporated into models. Narrow-sense heritability is the proportion of phenotypic variance that can be attributed to additive genetic effects, that is,  σ C Standard quantitative genetic models. In  J   a quantitative genetic model8, a continu- σ I σ ε ous phenotype y (i = 1,…,n) is described as i Resemblance between relatives the sum of a genetic signal (termed ‘genetic This can be quantified through the expected correlation between genetic values of related e value’) gi and of a model residual i, which individuals Cor(gi,gi′). For example, Sewall Wright’s method of path coefficients can be used to includes all sources of variation omitted in gi; evaluate the expected degree of resemblance due to additive effects over complex pedigrees, e 55 56 that is, yi = gi + i. The model may include and Cockerham and Kempthorne developed a complementary theory for describing the additional effects (for example, effects resemblance between relatives due to dominance and diverse forms of epistasis. of experimental conditions), which are Pedigree-based predictions ignored here for simplicity. This model is The resemblance between relatives can be used to predict genetic values using phenotypic used to derive concepts such as heritability and pedigree information. Building upon ideas from Fisher10 and Wright11, Henderson12,13 and also to derive predictions of genetic developed statistical methods for predicting genetic values for infinitesimal traits. In matrix values given phenotypes (BOX 1). notation, the genetic model is y = g + e in which y = {yi}, g = {gi} and e = {ei} are vectors of phenotypes, genetic values and model residuals, respectively. The (co)variance matrices of genetic values and model residuals are denoted Cov{g ,g } = G and Cov{e ,e } = R , respectively. Marker-based quantitative genetic models. i i′ i i′ 2 In pedigree-based models, G = G0σ g in which G0 contains pedigree-derived (co)variances In WGP models, genetic values are viewed 2 describing the resemblance between relatives and σ g is a variance parameter to be as a function of all available markers. estimated from data. Under multivariate normality, the Best Linear Unbiased Predictor Consequently, the genetic model becomes (BLUP)12 of g given y is E[g|y] = G[G + R]–1y = Hy, in which H = G[G + R]-1 is a matrix generalization θ e θ yi = g(xi, ) + i in which g(xi, ) is a func- of heritability. tion mapping from the marker genotypes Whole-genome marker-enabled prediction … ′ xi = (xi1, ,xip) onto genetic values and Marker-based prediction models can be obtained by simply replacing G0 in the BLUP equations θ denotes the collection of model unknowns with a marker-based relationship matrix, examples of this are found in REFS 52,57–60. Another — parameters to be estimated from the data. approach consists of describing genetic values as functions of marker genotypes; these Several methods are available; they differ in methods are further discussed in the main text and in BOX 2. Hence, the modern availability how marker genotypes are incorporated into of genome-wide SNP data connects the long-standing and well-developed domain of θ the study of the resemblance between relatives as a function of pedigree relations with the g(xi, ) and in how parameters are estimated. study of associations of genetic markers with phenotypes into a single unified field. Predicting phenotypes. Predictions of yet- to-be observed phenotypes (for example, assessment of genetic risk of new patients) methodology could be applied to a human regress phenotypes on marker genotypes are commonly derived in two steps. First, disease trait and BOX 3 gives an example using a linear regression model, that is, the model is fitted to a reference sample drawn from the animal-breeding literature21, p β β (or training sample); this yields estimates in which WGP methods are compared with g(xi, )=∑xij j of model unknowns θ. Once an estimate of family-based predictions. As mentioned j =1 the unknowns is available, prediction above, WGP can be implemented using dif- The genetic model becomes: of genetic predisposition of individuals ferent statistical models and estimation tech- p βε with yet-to-be observed phenotypes is per- niques. An overview of the most commonly yi =∑xij j + i [model 1] formed by evaluating the genetic function used is provided next. j=1 with parameters replaced by estimates, that In an additive model, xij represents the θ is, gi = g(xi, ). Linear models. Meuwissen, Hayes and number of copies of a diallelic marker (for 14 ∈ These methods have been commonly Goddard pioneered the use of WGP. They example, a SNP), that is, xij {0,1,2} and b applied to continuous traits (for exam- suggested incorporating dense markers into j is the additive effect of the allele coded as ple, body weight). BOX 2 shows how this statistical models using a very simple idea: one at the jth marker. The predicted genetic

NATURe ReVIeWS | VOLUMe 11 | DeCeMBeR 2010 | 881 © 2010 Macmillan Publishers Limited. All rights reserved PersPectives

value of an individual whose is markers (p) typically exceeds the number a measure of model complexity or penalty xi = {xij} is obtained by multiplying marker of individuals in the data set (n), and the component. In linear models (for example, genotype codes by estimated marker effects estimation of marker effects through ordinary model 1), the penalty component is usually b least squares { ˆj} and summing across markers; that is, (OLS) is not feasible. Instead, a function of marker effects, for example, p penalized estimation and Bayesian estimation the sum of squares of regression coefficients. β ˆ ĝi =∑xij j methods are commonly used — these over- Relative to estimates that are obtained by opti- j=1 come infeasibility, reduce the mean-squared mization of a goodness-of-fit measure alone The example provided in BOX 3 shows an error of estimates and may prevent over-fitting. (for example, OLS or maximum likelihood), application of a linear model for prediction Penalized etimates are obtained as the solu- penalized estimates are shrunk towards zero; of genetic values of Holstein sires. tion to an optimization problem, the objec- this introduces bias but reduces the variance tive function of which embeds a compromise of estimates, yielding a smaller mean-squared Penalized estimation methods. Using current between a measure of goodness of fit — for error. For a given number of markers, bias genotyping technologies, the number of example, a residual sum of squares — and and variance of estimates decreases with increasing sample size, and therefore, so does the mean-squared error of estimates. Box 2 | Applying whole-genome prediction methods to human diseases Several penalized estimation methods (for example, RR22, Least Absolute Shrinkage In whole-genome marker-enabled Genetic model for a disease trait and Selection Operator (LASSO)23 and elastic prediction (WGP) models, genetic values Net24) are available; they differ according to p(di = 1) = η {g(xi, θ)} {gi} are described as a function of marker genotypes and unknown parameters, that the penalty function used and consequently is, g = g(x,θ) in which Z  Z R is a vector on the type of shrinkage of estimates. Penalized i i K KL L  Genetic values of marker genotypes, θ is a vector of Disease Link function as a function estimation methodology is an active area of parameters (for example, marker effects) indicator (for example, of marker statistical research and new methods are rap- probit, logit) genotypes (x ) REF. 27 and g(xi,θ) is a function mapping from i idly emerging (for an overview, see ). genotypes and parameter values onto genetic values. In a standard marker-based p • Linear model: g(x , β) = x β Bayesian estimation methods. These quantitative genetic model, a continuous i Σ ij j j = 1 approaches offer an alternative way of obtain- phenotype (y, for example, body weight) g(x , θ) • Semi-parametric: i i ing shrinkage estimates of marker effects. is expressed as y = g(x,θ) + e in which e Reproducing kernel Hilbert spaces methods i i i i Neural networks Indeed, for most penalized estimates (for represents non-genetic factors. Most of the 22 23 24 literature on WGP methods focuses on example, RR , LASSO and elastic Net ), these types of traits. The figure shows the there is an equivalent Bayesian estimate. In steps in applying a WGP model to human Bayesian models, shrinkage of estimates of diseases. Training the model effects is controlled by the prior distribu- Genetic model for disease traits Data tion that was assigned to marker effects. Binary traits such as disease status can Phenotypes and genotypes Different types of priors induce different (y ; x ) be related with genetic values through i i types of shrinkage of estimates of effects. The a link function η{. } which maps from Gaussian prior yields estimates equivalent continuous genetic values onto to those obtained with RR, with an extent probabilities of disease occurrence. The Penalized estimation Bayesian inference of shrinkage that is homogeneous across probit and logit links are two common markers. This type of shrinkage may not be choices for binary traits. The regression appropriate if some markers are linked to becomes p(di = 1) = η{g(xi,θ)} in which θˆ → g (., θˆ) QTLs whereas others are located in regions di = 1 (di = 0) indicates the presence (or absence) of disease. (estimate) of the genome that are not associated with The genetic function can be specified genetic variances. However, using a scaled-t using parametric or semi-parametric or a double exponential prior, as in the methods and an overview of some of Bayes A model14 and the Bayesian LASSO these methods is given in the main text. Prediction of disease risk of Park and Casella19,25, respectively, yields training the model ˆ marker-specific shrinkage of effect estimates. p(di′ = 1) = η {g(xi′, θ)} The data required for training this model The Bayesian connection is useful in many consist of a sample of individuals with respects: first, non-continuous and censored genotypic and disease status information, phenotypes can be dealt with easily as missing that is, Probability of Genotype Parameter estimate disease of a of a new derived from a data problems; second, unlike penalized- 640 0640 5  ]FKZK_K  new patient patient training data set estimation methods, Bayesian models provide measures of uncertainty about estimates Usually, the number of unknowns in the model vastly exceeds the number of individuals and Nature Reviews | Genetics and predictions; and third, regularization parameters are estimated using some form of penalized or Bayesian estimation procedure, some of which are discussed in the main text. parameters can be dealt with by assigning an appropriate prior to these unknowns. Prediction of disease risk The estimation procedure yields estimates of the unknown parameters θ; these estimates can be used to predict the genetic predisposition of new patients. After replacing parameters with their Semi-parametric models. The linear model 1 accounts for additive effects, but genetic estimates, the estimated probability of disease of a new patient is ˆp (di′ = 1|xi′,θ) = η {g(xi′,θ)}. predisposition may involve non-additive

882 | DeCeMBeR 2010 | VOLUMe 11 www.nature.com/reviews/genetics © 2010 Macmillan Publishers Limited. All rights reserved PersPectives

actions such as dominance or epistasis. Box 3 | Whole-genome marker-enabled prediction: an example application In principle, the linear model can be extended to accommodate these effects. 0.7 However, with large p, including all possible Full model (32,518 SNPs) interactions is computationally feasible only 0.6 to a limited extent. An alternative is to use semi-parametric methods, such as reproducing kernel Hilbert spaces (RKHS) regressions26 0.5 or neural networks27.

28,29 tion Gianola et al. suggested using RKHS PA (no SNP)

ela 0.4 methods for WGP. Unlike parametric regression models orr

— in which the genetic e c function is explicitly defined — in RKHS 0.3

regressions, a collection of real-valued func- edictiv tions is implicitly defined by choosing a Pr (REF 26) 0.2 ‘reproducing kernel’, K(xi,xi′) . This function maps from pairs of genotypes

(xi,xi′) into a real number and must be 0.1 positive semi-definite26. From a Bayesian perspective30–32, the reproducing kernel defines the a priori correlations between 0 0 2,000 4,000 6,000 8,000 10,000 evaluations of the function (that is, genetic Number of SNPs values) at pairs of genotypes Cor[g(xi),g(xi′)]. The choice of kernel is the central element Accurate estimates of the breeding values of dairy sires can be obtained byNa evaluatingture Revie thews | Genetics of model specification. Some parametric performance of a large number of daughters of each sire (progeny testing). However, progeny models (for example, ridge regression) can testing is expensive and many years are required to collect such information. This delays breeding 32–34 be represented as RKHS regressions . decisions and reduces the rate of genetic gain. The best pedigree-based predictor of the genetic Alternatively, kernels can be chosen to value of newborn sires is the average estimated genetic value of the parents (PA). This is the maximize the performance of the model (for conceptual equivalent of using family history in human applications61. example, predictive ability). To this end, one Whole-genome marker-enabled prediction (WGP) offers an alternative method for predicting can develop algorithms that evaluate a wide the genetic values of young sires. Unlike PA, WGP can account for genetic differences between variety of kernels and pick one that that is individuals with equivalent pedigrees (that is, those due to sampling of genes at meiosis). 21 optimal according to some model selection Vazquez et al. compared the performance of WGP with that of PA. criterion (for example, a measure of predic- Data tive ability). Overviews of how this can be The data consisted of 4,608 Holstein sires genotyped using the Illumina BovineSNP50 Bead Chip. implemented are given in REFS 32–34. Sires born before 1999 (n = 2,821) were used to train the models and those born between 1999 and In linear models and in RKHS, the 2003 (n = 893) were used for validation. The target of prediction was sires’ predicted transmitted ability for US-Holstein Net Merit Index (Net Merit PTA), a highly accurate estimate of the basis functions used to regress phenotypes sire’s ability to produce valuable offspring. on markers are defined a priori and this imposes some constraints on the types of Models Predictions were obtained using a linear regression patterns that these methods can capture. In R neural networks, the basis functions used µε ββ [K 2#K 2# ΣZKL L K to regress phenotypes are inferred from the L  th data and this gives neural networks a great in which yi is the Net Merit PTA of the i sire, m is an effect common to all sires, PAi is the average Net Merit PTA of the parents of sire i, b is a regression coefficient, x ∈ {0,1,2} are counts of the flexibility. This generality comes with a price: PA ij number of copies of one of the alleles of the jth SNP, b is the additive effect of the same SNP and the interpretation of parameter estimates j e is a model residual. Models differed on how many SNPs (from zero to 32,518 SNPs) were included, is not straightforward and over-fitting may i 27 how the SNPs were selected (here we present results for evenly spaced SNPs only) and on whether occur . Pre-selection of markers and use of the regression on PA was included. Regression coefficients were estimated using the sires in the penalized or Bayesian estimation methods training set and the predictive ability was evaluated by the correlation between the Net Merit are ways of confronting over-fitting. PTA and WGP in the validation sample. Results evidence for the usefulness of Wgp The figure above, produced with data from REF. 21, gives the correlation between the WGP of Factors that affect the accuracy of WGP. The Net Merit PTA and progeny test Net Merit PTA versus the number of SNPs in models with (blue usefulness of WGP methods in the context dots) and without (red dots) the PA. The horizontal lines give the predictive correlation of PA (no of preventive and personalized medicine will SNPs) and of a model including 32,518 SNPs. The PA alone (p = 0) yielded a predictive correlation depend on how prevalent a disease is, the of 0.41, WGP including all available markers (with or without PA) reached a predictive heritability of the trait and the accuracy with correlation of 0.65. The predictive ability increased monotonically with the number of markers, which genetic predisposition (that is, genetic and the difference between the correlations obtained with and without PA decreased as values) can be inferred. Several simulation the number of markers in the model increased. This occurs because, in infinitesimal traits, as the studies14,15 in animal breeding indicate that number of markers increases so does the proportion of genetic variance at quantitative trait loci that can be explained by markers41. these methods can yield accurate predictions

NATURe ReVIeWS | Genetics VOLUMe 11 | DeCeMBeR 2010 | 883 © 2010 Macmillan Publishers Limited. All rights reserved PersPectives

of genetic values. For instance, Meuwissen The literature in this respect mostly focuses receiving it. In the absence of information et al.14 reported a correlation between esti- on comparing, in the context of a lin- about which factors affect predisposition mated and true genetic values as high as ear model, different shrinkage methods. and ignoring the monetary cost of preven- 0.85 for a trait with 0.5 heritability. Simulation studies14,15 suggest the superiority tion, it is equally good or bad to apply the empirical evidence has partially con- of models using marker-specific shrinkage of prophylaxis to everyone or no one. firmed these expectations. The most exten- estimates of effects (for example, Bayes A14, How useful would WGP be in this case? sive empirical evaluations are found in Bayes B14 and Bayesian LASSO25) over those A genetic model for the above-mentioned dairy cattle17,18,20,21, but these methods have performing across-markers homogeneous disease can be obtained using the type of also been evaluated in several traits and shrinkage of estimates (for example, RR22 generalized linear models described in BOX 2. breeds of beef cattle35, broilers16,36, wheat19,37, or — in a Bayesian context — a linear model To illustrate, we consider a probit model43 maize37,38 and mice19,39. Overall, empirical with a Gaussian prior for marker effects). (for details of this model, see Supplementary evaluations have demonstrated the supe- However, a few simulation studies42 did not information S1 (box)). Briefly, in the probit riority of WGP of genetic values over ped- confirm this and, more importantly, empiri- model, disease is assumed to occur if an igree-based prediction. However, gains in cal evidence suggests only small differences unobservable continuous phenotype — 40 accuracy are smaller than those anticipated between different shrinkage methods . liability to disease (yi) — exceeds a certain by simulation studies7,40. threshold τ. The regression model for liabil- Wgp in humans θ e In the class of linear models (for example, ity to disease takes the form yi = g(xi, ) + i in θ model 1), theory and empirical evidence Potential impact for individual and public which g(xi, ) is a genetic component and e suggest that the accuracy of estimates of health. The potential use of WGP will i represents non-genetic factors. Now, genetic values depends mainly on two fac- depend on the prevalence of the disease, the assume that the correlation between the 41 θ tors : the proportion of the genetic variance relevance of genetic predisposition (that is, inferred genetic signal g(xi′, ) and true at QTLs explained by markers (due to LD the heritability of the trait), on how accu- liability yi is 0.5. How valuable might this with QTLs) and the accuracy of estimates rately genetic values can be inferred from a correlation be? In this case, if we apply the of marker effects. extended LD and a large training sample and on practical features, preventive procedure only to those with ˆ ,θ, number of markers increase the propor- such as the costs of treatment and disease. p(di′ = 1|xi′ ) > 0.2, approximately one-third tion of genetic variance at QTLs that can To illustrate this concept, consider of the population will be treated. This is be accounted for by markers. The larger a disease for which the incidence is 10%, and expected to reduce disease incidence from the data set used to train the model and the assume that an effective preventive measure 10% to about 3.1% and the proportion of higher the heritability of the trait, the higher (for example, vaccination) exists. However, the population who experience either of the the accuracy of estimates of marker effects. when applied, this measure induces a nega- negative events (disease or iatrogenic) from The choice of model can also affect the tive outcome that is as bad as the event it is 10% to approximately 6.4% (for details, see accuracy of estimates of genetic values. intended to prevent in 10% of individuals Supplementary information S1 (box)).

Glossary

Bayesian estimation LASSO number of records. Penalized estimates are obtained by Bayesian inferences are based on the posterior The Least Absolute Shrinkage and Selection Operator23 solving an optimization problem whose objective function distribution of the unknowns given the data. Following is a penalized estimation method commonly used in embeds a compromise between a goodness-of-fit measure Bayes’ rule, this distribution is proportional to the product regression. The penalty function in LASSO is the sum of and a measure of model complexity or penalty function. of the distribution of the data given the unknowns times the absolute value of the regression coefficients. LASSO the prior distribution of the unknowns. performs variable selection and shrinkage simultaneously. Quantitative genetic theory Genetic, mathematical and statistical models used to study Basis function Objective function traits that are affected by a large number of genes. In regression analysis, basis functions are functions of The function whose value is minimized or maximized in an predictors used to construct the regression. Polynomials, optimization problem. Regression model exponential and logarithm are examples of basis functions A statistical model used to describe relationships (for commonly used for parametric regressions. Ordinary least squares example, a conditional mean) between a response variable The ordinary least squares estimates of parameters in a and a set of predictors through a regression function Censored phenotype regression model are obtained by minimizing the residual involving some parameter(s) to be estimated from data. Censoring occurs when, for some individuals, sum of squares of the regression. the phenotypic information consists of bounds but the Semi-parametric regression model actual phenotypic value is unknown. This is commonly Over-fitting A regression model in which the regression function is observed in longevity studies when, at the time of analysis, A term used to describe the situation in which a model fits not assumed to be a member of a parametric family. some patients may still be alive. the training data well but fails to perform well when used to predict outcomes of a collection of subjects (testing data) Shrinkage Genomic medicine that was not used to fit the model. In standard estimation methods (for example, maximum The use of genome information in the prevention, diagnosis likelihood or OLS) estimates are obtained by optimizing and treatment of disorders. Parametric regression model with respect to a goodness-of-it or lack-of-fit measure. A regression model in which the regression function is set Relative to these estimates, Bayesian and penalized Goodness of fit to have a known functional form (for example, a estimates are shrunk towards some values (typically zero). A measure of how well a model fits the data in a polynomial). This prevents over-fitting and, under certain conditions, may training sample. The log likelihood and R-squared reduce mean-squared error of estimates and predictions. statistic are commonly used measures of goodness of fit. Penalized estimation The residual sum of squares is a commonly used measure Penalized estimates are commonly used in situations in Training data of lack of fit. which the number of unknowns is large with respect to the The data set used to fit a model.

884 | DeCeMBeR 2010 | VOLUMe 11 www.nature.com/reviews/genetics © 2010 Macmillan Publishers Limited. All rights reserved PersPectives

Therefore, WGP can turn a useless pro- intensive history of selection7. However, the directed to analyse the available information cedure into one that contributes to improved number of available markers in humans is from a different perspective may yield public health by reducing the incidence of considerably larger and this may increase important progress in an area that has so negative events by 36%. the accuracy with which genetic values can far proven elusive. be inferred. Gustavo de los Campos is at the Section on Statistical Opportunities. Accurate predictions of Most applications encountered in the Genetics, Biostatistics, University of Alabama at genetic predisposition to human dis- literature on WGP deal with continuous Birmingham, 1665 University Boulevard, Alabama 35294, USA. eases should be useful for preventive and uncensored traits. Therefore, further devel- Daniel Gianola is at the Departments of Animal personalized medicine. Applications of opments are needed to extend these methods Sciences, Dairy Science and Biostatistics and Medical marker-enabled prediction of genetic val- to non-continuous and censored outcomes, Informatics, University of Wisconsin-Madison, ues in humans with SNP data have already as these are commonly encountered in 1675 Observatory Dr., Wisconsin 53706, USA. occurred and can be seen as progressively health-related traits. extending the type of David B. Allison is at the Section on Statistical moving along a scale from the most sim- WGP methods discussed here to binary or Genetics, Biostatistics, University of Alabama at ple to more sophisticated. For example, ordered outcomes is straightforward (BOX 2). Birmingham, 1665 University Boulevard, Holzapfel et al.44 used three SNPs in the fat Also, in Bayesian models, censoring can be Alabama 35294, USA. mass and obesity associated (FTO) gene, dealt with easily as a missing data problem. e-mails: [email protected]; [email protected]; which is more strongly associated with body However, incorporating dense molecular [email protected] doi:10.1038/nrg2898 mass index (BMI) than any other gene cur- markers into semi-parametric survival Published online 3 November 2010 rently identified, and found that these SNPs models such as penalized Cox regressions53, only account for about 0.006% of the vari- although theoretically feasible, is expected to 1. Guttmacher, A. E. & Collins, F. S. Genomic medicine — a primer. N. Engl. J. Med. 347, 1512–1520 (2002). ance in BMI. Similarly, small sets of (3 to 18) be computationally challenging, especially 2. Dominiczak, A. F. & McBride, M. W. Genetics SNPs have generally explained little variation with p>>n. of common polygenic stroke. Nature Genet. 35, 116–117 (2003). in susceptibility to traits such as Alzheimer’s The generalization properties of WGP 3. Maher, B. Personal : the case of the missing disease, pigmentation, BMI and diabetes45–49. remain an open question. Can a model heritability. Nature 456, 18–21 (2008). 4. Manolio, T. A. et al. Finding the missing heritability Prediction models using larger numbers trained using individuals of european of complex diseases. Nature 461, 747–753 (2009). of SNPs have become more common. In descent be used to predict genetic predispo- 5. Hill, W. G., Goddard, M. E. & Visscher, P. M. Data and theory point to mainly additive genetic variance most cases, these are built using a subset sition among patients of African descent? for complex traits. PLoS Genet. 4, e1000008 of SNPs, usually preselected based on Almost all empirical evidence in animal (2008). 50 6. Lander, E. S. & Schork, N. J. Genetic dissection of results from SMR , or by combining results breeding comes from within-breed predic- complex traits. Science 265, 2037–2048 (1994). from SMR into risk scores51. In spirit, this tion evaluations and the ability of these 7. Goddard, M. E. & Hayes, B. J. Mapping genes for complex traits in domestic animals and their use approach is similar to that of WGP; however, models to predict genetic values in distantly in breeding programmes. Nature Rev. Genet. 10, a main difference is that in SMR the associa- related individuals is not well known. 381–391 (2009). 8. Falconer, D. S. & Mackay, T. F. C. Introduction to tion between phenotypes and markers is Finally, developing statistical methods Quantitative Genetics 4th edn (Longman, Harlow, UK, assessed one marker at a time, whereas in that can capture (and exploit for predic- 1996). 9. Hill, W. G. Understanding and using quantitative WGP the effects of all markers are jointly tion) complex interactions among genes and genetic variation. Philos. Trans. R. Soc. Lond. B 365, inferred. More recently, Visscher and col- between genes and observable environmen- 73–85 (2010). 10. Fisher, R. The correlation between relatives on the leagues have advanced the use of WGP tal factors will certainly prove challenging. supposition of Mendelian inheritance.Trans. R. Soc. methods in humans by regressing height However, even for traits that are affected by Edinb. Earth Sci. 52, 399–433(1918). on thousands of SNPs simultaneously52; complex interactions, additive models may 11. Wright, S. Systems of mating. Parts I.–V. Genetics 6, 111–178 (1921). 5 therefore presaging the type of methods we prove useful from a predictive standpoint 12. Henderson, C. R. Estimation of genetic parameters. offer here. Their results are encouraging and — we shall remember that “…all models are Ann. Math. Stat. 21, 309–310 (1950). 13. Henderson C. R. Best linear unbiased estimation and suggest that WGP methods can account for wrong; the practical question is how wrong prediction under a selection model. Biometrics 31, a much larger percentage of the expected do they have to be not to be useful” (REF. 54). 423–447 (1975). 14. Meuwissen, T. H., Hayes, B. J. & Goddard, M. E. heritability of the trait than that accounted Prediction of total genetic values using genome-wide for by models based on a small number of Conclusions and recommendations dense marker maps. Genetics 157, 1819–1829 (2001). preselected SNPs. Our field has invested heavily in generat- 15. Habier, D. Fernando, R. L. & Dekkers, J. C. M. ing genotypic and phenotypic data for The impact of genetic relationships information on genome-assisted breeding values. Genetics 177, Challenges. Application of WGP methods GWA studies of health-related traits in 2389–2397 (2007). to human data will pose challenges. humans, and information systems have 16. González-Recio, O. et al. Non-parametric methods for incorporating genomic information into genetic Typically, the computational requirements been developed (for example, the database evaluations: an application to mortality in broilers. are much higher than those of SMR; how- of Genotypes and Phenotypes (dbGaP)) to Genetics 178, 2305–2313 (2008). 17. VanRaden, P. M. et al. Reliability of genomic ever, several algorithms and software are deposit, maintain and distribute data. These predictions for North American Holstein bulls. available. The feasibility of these techniques data have been analysed primarily from a J. Dairy Sci. 92, 16–24 (2009). 18. Hayes, B. J., Bowman, P. J., Chamberlain, A. J. & has also been shown by several applications single perspective: that of detecting the indi- Goddard, M. E. Genomic selection in dairy cattle: in plants19,37, animals17–21,36,39 and humans, as vidual variants associated with disease risk. progress and challenges. J. Dairy Sci. 92, 433–443 52 (2009). just mentioned . This methodology is clearly unsatisfactory 19. de los Campos, G. et al. Predicting quantitative traits Relative to agricultural , predicting for traits affected by a large number of genes. with regression models for dense molecular markers and pedigrees. Genetics 182, 375–385 (2009). genetic values in humans may be more chal- The class of WGP methods commonly used 20. Weigel, K. A. et al. Predictive ability of direct lenging because the extent of LD in human in animal breeding is particularly appropri- genomic values for lifetime net merit of Holstein sires using selected subsets of single nucleotide populations is smaller than that observed in ate for dealing with this type of trait. We markers. J. Dairy Sci. 92, 5248–5257 agricultural species, which have a long and conjecture that relatively small investments (2009).

NATURe ReVIeWS | Genetics VOLUMe 11 | DeCeMBeR 2010 | 885 © 2010 Macmillan Publishers Limited. All rights reserved PersPectives

21. Vazquez, A. et al. Predictive ability of subsets of SNP 37. Crossa, J. et al. Prediction of genetic values of 54. Box, G. E. P. & Draper, N. R. Empirical Model-Building with and without parent average in US Holsteins. quantitative traits in plant breeding using pedigree and Response Surfaces (Wiley, New York, 1987). J. Dairy Sci. 2010 (doi:10.3168/jds.2010–3335). and molecular markers. Genetics 2 Sep 2010 55. Cockerham, C. C. An extension of the concept of 22. Hoerl, A. E. & Kennard, R. W. Ridge regression: (doi:10.1534/genetics.110.118521). partitioning hereditary variance for analysis of biased estimation for non-orthogonal problems. 38. Piepho, H. P. Ridge regression and extensions covariance among relatives when epistasis is present. Technometrics 12, 55–67 (1970). for genomewide selection in maize. Crop Sci. 49, Genetics 39, 859–882 (1954). 23. Tibshirani, R. Regression shrinkage and selection via 1165–1176 (2009). 56. Kempthorne, O. The correlation between relatives in a the LASSO. J. R. Stat. Soc. Series B 58, 267–288 39. Legarra, A., Robert-Granié, C., Manfredi, E. & random mating population. Proc. R. Soc. Lond. B 143, (1996). Elsen, J. M. Performance of genomic selection in 103–113 (1954). 24. Zou, H. & Hastie, T. Regularization and variable mice. Genetics 180, 611–618 (2008). 57. Lynch, M. & Ritland, K. Estimation of pairwise selection via the elastic net. J.R. Stat. Soc. Series B 40. Jannink, J. L., Lorenz, A. J. & Hiroyoshi, I. Genomic relatedness with molecular markers. Genetics 152, 67, 301–320 (2005). selection in plant breeding: from theory to practice. 1753–1766 (1999). 25. Park, T. & Casella, G. The Bayesian LASSO. J. Am. Brief. Funct. Genomics 9, 166–177 (2010). 58. Eding, J. H. & Meuwissen, T. H. Marker based Stat. Assoc. 103, 681–686 (2008). 41. Goddard, M. E. Genomic selection: prediction of estimates of between and within population 26. Wahba, G. Spline Models for Observational Data accuracy and maximization of long term response. kinships for the conservation of genetic (Society for Industrial and Applied Mathematics, Genetica 136, 245–257 (2009). diversity. J. Anim. Breed. Genet. 118, 141–159 Philadelphia, 1990). 42. Zhong, S., Dekkers, J. C., Fernando R. L. & Jannink, J. L. (2001). 27. Hastie, T., Tibshirani, R. & Friedman, J. The Elements Factors affecting accuracy from genomic selection 59. Visscher, P. M. et al. Assumption-free estimation of of Statistical Learning: Data Mining, Inference, and in populations derived from multiple inbred lines: heritability from genome-wide identity-by-descent Prediction 2nd edn (Springer-Verlag, New York, a barley case study. Genetics 182, 355–364 (2009). sharing between full siblings. PLoS Genet. 2, e41 2009). 43. Gianola, D. Theory and analysis of threshold (2006). 28. Gianola, D., Fernando, R. L. & Stella, A. Genomic- characters. J. Anim. Sci. 54, 1079–1096 (1982). 60. Hayes, B. J. & Goddard, M. E. Prediction of assisted prediction of genetic value with semiparametric 44. Holzapfel C. et al. Genes and lifestyle factors in breeding values using marker-derived procedures. Genetics 173, 1761–1776 (2006). obesity: results from 12462 subjects from MONICA/ relationship matrices. J. Anim. Sci. 86, 2089–2092 29. Gianola, D. & van Kaam, J. B. Reproducing kernel KORA. Int. J. Obes. 1–8 (2010). (2008). Hilbert spaces regression methods for genomic 45. Seshadri, S. et al. Genome-wide analysis of genetic 61. Feng, R., McClure, L. A., Tiwari, H. K. & Howard, G. assisted prediction of quantitative traits. loci associated with Alzheimer disease. JAMA 303, A new estimate of family disease history providing Genetics 178, 2289–2303 (2008). 1832–1840 (2010). improved prediction of disease risks. Stat. Med. 28, 30. Kimeldorf, G. S. & Wahba, G. A correspondence 46. Valenzuela, R. K. et al. Predicting phenotype from 1269–1283 (2009). between Bayesian estimation on stochastic process genotype: normal pigmentation. J. Forensic Sci. and smoothing by splines. Ann. Math. Stat. 41, Soc. 55, 315–322 (2010). Acknowledgements 495–502 (1970). 47. Willer, C. J. et al. Six new loci associated with body We are grateful to K. Grimes, A. Vazquez, Y. Klimentidis and 31. de los Campos, G., Gianola, D. & Rosa, G. J. M. mass index highlight a neuronal influence on body S. Cofield for their helpful comments on this paper. Reproducing kernel Hilbert spaces regression: weight regulation. Nature Genet. 41, 25–34 (2008). a general framework for genetic evaluation. J. Anim. 48. Zhao, J. et al. The role of obesity-associated loci Competing interests statement Sci. 87, 1883–1887 (2009). identified in genome-wide association studies in The authors declare competing financial interests: see Web 32. de los Campos, G., Gianola, D., Rosa, G. J. M., the determination of pediatric BMI. Obesity 17, version for details. Weigel, K. & Crossa, J. Semi-parametric genomic- 2254–2257 (2009). enabled prediction of genetic values using reproducing 49. van Hoek, M. et al. Predicting type 2 diabetes based kernel Hilbert spaces regressions. Genetics Res. 92, on polymorphisms from genome-wide association FuRtHeR inFoRMAtion 295–308 (2010). studies: a population-based study. Diabetes 57, Gustav de los campo’s homepage: 33. Shawe-Taylor, J. & Cristianini, N. Kernel Methods 3122–3128 (2008). http://www.soph.uab.edu/ssg/people/campos for Pattern Analysis (Cambridge Univ. Press, UK, 50. Wary, N. R., Goddard, M. E. & Visscher, P. M. dbGap: http://www.ncbi.nlm.nih.gov/gap 2004). Prediction of indivual genetic risk to diseases from Nature Reviews Genetics series on study designs: 34. Schaid, D. J. Genomic similarity and kernel methods I: genome-wide association studies. Genome Res. 17, http://www.nature.com/nrg/series/studydesigns/index.html advancements by building on mathematical and 1520–1528 (2007). Nature Reviews Genetics series on Modelling: statistical foundations. Hum. Hered. 70, 109–131 51. Purcell, S. M. et al. Common polygenic variation http://www.nature.com/nrg/series/modelling/index.html (2010). contributes to risk of schizophrenia and bipolar Nature Reviews Genetics series on Genome-wide 35. Garrick, D. J. The nature, scope and impact of some disorder. Nature 460, 748–752 (2009). association studies: http://www.nature.com/nrg/series/ whole-genome analyses in beef cattle in 9th World 52. Yang, J. et al. Common SNPs explain a large gwas/index.html Congress on Genetics Applied to Livestock proportion of the heritability for human height. (Leipzig, SuppLeMentARy inFoRMAtion Germany, 2010). Nature Genet. 42, 565–569 (2010). see online article: s1 (box) 36. Long, N. et al. Radial basis function regression 53. Witten, D. M. & Tibshirani, R. Survival analysis with methods for predicting quantitative traits using high-dimensional covariates. Stat. Methods Med. Res. All links ARe ActiVe in tHe online PDf SNP markers. Genetics Res. 92, 209–225 (2010). 19, 29–51 (2010).

886 | DeCeMBeR 2010 | VOLUMe 11 www.nature.com/reviews/genetics © 2010 Macmillan Publishers Limited. All rights reserved