Quick viewing(Text Mode)

A Comparative Review of Dimension Reduction Methods in Approximate Bayesian Computation

A Comparative Review of Dimension Reduction Methods in Approximate Bayesian Computation

arXiv:1202.3819v3 [stat.ME] 11 Jun 2013 aee vector rameter eirdistribution terior .G .Bu,M .Nns .PageadS .Sisson A. S. and Prangle D. Nunes, A. M. Blum, B. G. M. Approximate Computation in Dimension Bayesian of Methods Review Reduction Comparative A ot ae,Sde 02 Australia. New 2052, of Sydney University Wales, Statistics, South Professor, and Associate Mathematics is of Sisson School A. S. 4YF, Kingdom. LA1 United Lancaster University, Fylde Lancaster is Department, College, Prangle Statistics D. and and Mathematics Associate Lecturer, Research Senior is Nunes 03 o.2,N.2 189–208 2, DOI: No. 28, Vol. 2013, Science Statistical -84,Fac e-mail: France Grenoble,F-38041, 5525, Scientifique, UMR Recherche TIMC-IMAG la Laboratoire de National Centre Fourier,

.G .Bu sRsac soit,Universit´e Joseph Associate, Research is Blum B. G. M. c ttsia Science article Statistical original the the of by reprint published electronic an is This ern iesfo h rgnli aiainand pagination in detail. original typographic the from differs reprint aeinifrnei yial oue ntepos- the on focused typically is inference Bayesian nttt fMteaia Statistics Mathematical of Institute 10.1214/12-STS406 uto,lklho-reifrne euaiain var regularization, inference, likelihood-free duction, phrases: and words Key sets. data th analysis and the illustrate models through ing We techniques procedure. reduction dimension regularization s sel these the a subset and as best criteria, a information regression Bayesian is and first Akaike on The based reduction. we se dimension addition, subset of In methods best regularization. sp and of are techniques o methods consisting projection methods The classes literature. principal exclusive compre ABC the a nonmutually the of provide in performance we proposed the article reduction of this In comparison qu information. and central of observ a loss the sets, imal from statistics data summary full low-dimensional than derive bas rather computations statistics, requires summary ABC of statis likelihoo implementation intractable summary practical computationally of observed problem the and come simulated between comparisons Abstract. .INTRODUCTION 1. θ nttt fMteaia Statistics Mathematical of Institute , ∈ 03 o.2,N.2 189–208 2, No. 28, Vol. 2013, p Θ ( θ ⊆ | y [email protected] obs R prxmt aeincmuain(B)mtosmk s of use make methods (ABC) computation Bayesian Approximate q 2013 , ) , ∝ q ≥ p ( y ,rpeetn the representing 1, obs | θ ) p ( prxmt aeincmuain ieso re- dimension computation, Bayesian Approximate θ fapa- a of ) .A. M. . This . in 1 rat frsac ra.Se o xml,Lopes example, for See, ( com- Beaumont areas. and and of research number analysis increasing of an the breadth in for models statistical tool plex popular a becoming n loihsta i oda ape rman likeli- from the samples hood, when draw distribution to posterior aim approximate that algorithms and computation t aafo h model, gener- the quickly from to data feasible ate is it where but intractable, by evddata served ieiod(oe)function, (model) likelihood ( uino h uldt set, data full the distri- of posterior bution the First, distribution. posterior the methods. ABC of application ( Fan and Sisson beliefs, prior one’s of updating 2010 B nrdcstopicplapoiain to approximations principal two introduces ABC p ( θ ,Bamn ( Beaumont ), | p s ( obs y obs ) y ∝ | al selection. iable obs θ AC eest aiyo models of family a to refers (ABC) ,i nvial rcomputationally or unavailable is ), p ( 2010 Y ∈ s 2011 obs ucin.A the As functions. d nrdc w new two introduce ddt ihmin- with data ed efrac of performance e 2010 ftrechalleng- three of h term The . eto methods, lection | ,Broel,BnzoadMona and Benazzo Bertorelle, ), sini o to how is estion cn ssridge uses econd θ do etr of vectors on ed o ata vriwo the of overview partial a for ) ) esv review hensive cinmethod ection p i nothree into lit ( ,Cileye l ( al. Csill´ery et ), ist over- to tics θ dimension f y p ,where ), ( ∼ θ | p y p prxmt Bayesian approximate ( obs ( ·| y θ obs p ,i approximated is ), .ACi rapidly is ABC ). ( s θ | obs θ ,truhthe through ), ,hvn ob- having ), = 2010 S ( y obs and ) is ) 2 BLUM, NUNES, PRANGLE AND SISSON a vector of summary statistics of lower dimension curse of dimensionality (e.g., Blum (2010a)), and so than the data yobs. In this manner, p(θ|sobs) ≈ keeping dim(s) ≤ dim(y) as small as possible helps p(θ|yobs) is a good approximation if sobs is highly in- to improve algorithmic efficiency. The second ap- formative for the model parameters, and p(θ|sobs)= proximation (1) allows the sampler weights (or ac- p(θ|yobs) if sobs is sufficient. As p(sobs|θ) is also likely ceptance probabilities, if one considers rejection-based to be computationally intractable if p(yobs|θ) is com- samplers, such as ) to be putationally intractable, a second approximation is free of intractable likelihood terms. constructed as pABC(θ|sobs)= p(θ,s|sobs) ds, with In practice, however, there is typically a trade- off between the two approximations: if the dimen- (1) p(θ,s|s ) ∝ K (ks − s k)p(s|θ)p(θ), obs ǫ obsR sion of s is large so that the first approximation, where Kǫ(kuk)= K(kuk/ǫ)/ǫ is a standard smooth- p(θ|sobs) ≈ p(θ|yobs), is good, the second approxi- ing kernel with scale parameter ǫ> 0. As a result of mation may then be poor due to the inefficiency of (1), approximating the target p(θ|sobs) by pABC(θ| kernel smoothing in large dimensions. Conversely, sobs) can be shown to be a good approximation if the if the dimension of s is small while the second ap- kernel scale parameter, ǫ, is small enough, following proximation (1) will be good (with a small kernel standard kernel density estimation arguments (e.g., scale parameter, ǫ), any loss of information in the Blum (2010a)). mapping sobs = S(yobs) means that the first approx- In combination, both approximations allow for prac- imation may be poor. Naturally, a low-dimensional tical methods of sampling from pABC(θ|sobs) that and near-sufficient statistic, s, would provide a near- avoid explicit evaluation of the intractable likelihood optimal and balanced choice. function, p(yobs|θ). A simple rejection-sampling al- For a given set of summary statistics, much work gorithm to achieve this was proposed by Pritchard has been done on deriving more efficient sampling et al. (1999) (see also Marjoram et al. (2003)), which algorithms to reduce the effect of the second ap- produces draws from p(θ,s|sobs). In general terms, proximation by allowing a smaller value for the ker- an importance-sampling version of this algorithm nel scale parameter, ǫ, which in turn improves the proceeds as follows: approximation pABC(θ|sobs) ≈ p(θ|sobs). The greater the algorithmic efficiency, the smaller the scale pa- (1) Draw a candidate parameter vector from the rameter that can be achieved for a given compu- prior, θ′ ∼ p(θ); tational burden. These algorithms include Markov (2) Draw summary statistics from the model s′ ∼ chain Monte Carlo (Marjoram et al. (2003); Bor- p(s|θ′); tot, Coles and Sisson (2007)) and sequential Monte (3) Assign to (θ′,s′) a weight, w′, that is propor- Carlo techniques (Sisson, Fan and Tanaka (2007); tional to K (ks′ − s k). ǫ obs Toni et al. (2009); Beaumont et al. (2009); Drovandi Here, the sampling distribution for (θ′,s′) is the and Pettitt (2011); Peters, Fan and Sisson (2012); prior predictive distribution, p(s|θ)p(θ), and the tar- Del Moral, Doucet and Jasra (2012)). By contrast, get distribution is p(θ,s|sobs). Using equation (1), the regression-based methods described in Section 2.1 it is then straightforward to compute the impor- do not aim at reducing the scale parameter ǫ but tance weight for the pair (θ′,s′). The weight is pro- rather explicitly account for the imperfect match ′ ′ ′ ′ ′ ′ portional to p(θ ,s |sobs)/[p(s |θ )p(θ )] = Kǫ(ks − between observed and simulated summary statistics sobsk), which is free of intractable likelihood terms, (Beaumont, Zhang and Balding (2002); Blum and p(s′|θ′). The manner by which the intractable like- Fran¸cois (2010)). lihoods cancel between sampling and target distri- Achieving a good trade-off between the two ap- butions forms the basis for the majority of ABC proximations revolves around the identification of a algorithms. set of summary statistics, s, which are both low- Clearly, both ABC approximations to the poste- dimensional and highly informative for θ. A number rior distribution help to avoid the computational of methods, primarily based on dimension reduc- intractability of the original problem. The first ap- tion ideas, have been proposed to achieve this (Joyce proximation allows the kernel weighting of the sec- and Marjoram (2008); Wegmann, Leuenberger and ond approximation, Kǫ(ks − sobsk), to be performed Excoffier (2009); Nunes and Balding (2010); Blum on a lower dimension than that of the original data, and Fran¸cois (2010); Blum (2010b); Fearnhead and yobs. Kernel smoothing is known to suffer from the Prangle (2012)). The choice of summary statistics DIMENSION REDUCTION METHODS IN ABC 3 is one of the most important aspects of a statis- to reduce their dimension while minimizing informa- tical analysis using ABC methods (along with the tion loss. Note that the most suitable set of summary choice of algorithm). Poor specification of s can have statistics for an analysis may be data set depen- a large and detrimental impact on both ABC model dent, as the information content of summary statis- approximations. tics may vary within the parameter space, Θ (an In this article we provide the first detailed re- exception is when sufficient statistics are known). view and comparison of the performance of the cur- As such, any analysis should also consider establish- rent methods of dimension reduction for summary ing potentially different summary statistics when re- statistics within the ABC framework. We character- implementing any model with a different data set. ize these methods into three nonmutually exclusive Methods of summary statistics dimension reduc- classes: (i) best subset selection, (ii) projection tech- tion for ABC can be broadly classified into three niques and (iii) regularization approaches. As part of nonmutually exclusive classes. The first class of meth- this analysis, we introduce two additional novel tech- ods follows a best subset selection approach. Here, niques for dimension reduction within ABC. The candidate subsets are evaluated and ranked accord- first adopts the ideas of Akaike and Bayesian in- ing to various information-based criteria, such as formation criteria to the ABC framework, whereas measures of sufficiency (Joyce and Marjoram (2008)) the second makes use of ridge regression as a reg- or the entropy of the posterior distribution (Nunes ularization procedure for ABC. The dimension re- and Balding (2010)). In this article we contribute duction methods are compared through the analy- additional criteria for this process derived from Akaike sis of three challenging models and data sets. These and Bayesian information criteria arguments. From involve the analysis of a coalescent model with re- these criteria, the highest ranking subset (or, al- combination (Joyce and Marjoram (2008)), an eval- ternatively, a subset consisting of those summary uation of the evolutionary fitness cost of mutation statistics which demonstrate clear importance) is in drug-resistant tuberculosis (Luciani et al. (2009)) then chosen for the final analysis. and an assessment of the number and size-distribu- The second class of methods can be considered as tion of particle inclusions in the production of clean projection techniques. Here, the dimension of (s1,..., steels (Bortot, Coles and Sisson (2007)). sp) is reduced by considering linear or nonlinear The layout of this article is as follows: in Section 2 combinations of the summary statistics. These meth- we classify and review the existing methods of sum- ods make use of a regression layer within the ABC mary statistic dimension reduction in ABC, and in framework, whereby the response variable, θ, is re- Section 3 we outline our two additional novel meth- gressed by the (possibly transformed) predictor vari- ods. A comparative analysis of the performance of ables, s (Beaumont, Zhang and Balding (2002); Blum each of these methods is provided in Section 4. We and Fran¸cois (2010)). These projection methods in- conclude with a discussion. clude partial least squares regression (Wegmann, Leuenberger and Excoffier (2009)), feed-forward neu- 2. CLASSIFICATION OF ABC DIMENSION ral networks (Blum and Fran¸cois (2010)) and re- REDUCTION METHODS gression guided by minimum expected posterior loss In a typical ABC analysis, an initial collection of considerations (Fearnhead and Prangle (2012)). ⊤ statistics s = (s1,...,sp) is chosen by the modeler, In this article we introduce a third class of meth- the elements of which have the potential to be infor- ods for dimension reduction in ABC, based on regu- ⊤ mative for the model parameters, θ = (θ1,...,θq). larization techniques. Using ridge regression, we also Choice of these initial statistics is highly problem make use of the regression layer between the pa- specific, and the number of candidate statistics, p, rameter θ and the summary statistics, s. However, often considerably outnumbers the number of model rather than explicitly considering a selection of sum- parameters, q, that is, p ≫ q (e.g., Bortot, Coles mary statistics, we propose to approach this implic- and Sisson (2007); Allingham, King and Mengersen itly, by shrinking the regression coefficients toward (2009); Luciani et al. (2009)). For example, Bor- zero so that uninformative summary statistics have tot, Coles and Sisson (2007) and Allingham, King the weakest contribution in the regression equation. and Mengersen (2009) use the ordered observations In the remainder of this section we discuss each of S(y) = (s(1),...,s(p)) so that there is no loss of infor- these methods in more detail. We first describe the mation at this stage. The analysis then proceeds by ideas behind ABC regression adjustment strategies either using all p statistics in full or by attempting (Beaumont, Zhang and Balding (2002); Blum and 4 BLUM, NUNES, PRANGLE AND SISSON

Fran¸cois (2010)), as many of the dimension reduc- regression model for the log of the squared residuals, tion techniques build on this framework. that is, log(θi − mˆ (si))2 = log σ2(si)+ ηi, where the ηi are independent, zero-mean variates with com- 2.1 Regression Adjustment in ABC mon variance. The equivalent adjustment to (3) is Standard ABC methods suffer from the curse of then given by dimensionality in that the rate of convergence of ∗i i i σˆ(sobs) posterior expectations with respect to p (θ|s ) (5) θ =m ˆ (sobs) + [θ − mˆ (s )] , ABC obs σˆ(si) (such as the Nadaraya–Watson estimator of the pos- terior mean) decreases dramatically as the dimen- whereσ ˆ(s) denotes the estimate of σ(s). The kernel sion of the summary statistics, p, increases (Blum scale parameter, ǫ, plays the same role as for the ho- (2010a)). ABC regression adjustment (Beaumont, moscedastic model, except with more flexibility on Zhang and Balding (2002)) aims to avoid this by deviations from homoscedasticity. Nott et al. (2013) explicitly modeling the discrepancy between s and have demonstrated that regression adjustment ABC ∗i sobs. When describing regression adjustment meth- algorithms produce samples, {θ }, for which first- ods, for notational simplicity and clarity of exposi- and second-order moment summaries approximate tion, we assume that the parameter of interest, θ, is adjusted expectation and variance for a Bayes lin- univariate (i.e., q = 1). Regression adjustment meth- ear analysis. We do not describe here an alterna- ods may be readily applied to multivariate θ, by us- tive regression adjustment method where the sum- ing a different regression equation for each parame- mary statistics are rather considered as the depen- ter, θ1,...,θq, separately. dent variables and the parameters as the indepen- The simplest model for this is a homoscedastic dent variables of the regression (Leuenberger and Wegmann (2010)). regression in the region of sobs, so that θi = m(si)+ ei, 2.2 Best Subset Selection Methods where (θi,si) ∼ p(s|θ)p(θ) are i = 1,...,n draws from Best subset selection methods are conceptually the prior predictive distribution, m(si)= E[θ|s = si] simple, but are cumbersome to manage for large is the mean function, and the ei are zero-mean ran- numbers of potential summary statistics, s = (s1,..., p dom variates with common variance. To estimate sp). Exhaustive enumeration of the 2 − 1 possible the conditional mean m(·), Beaumont, Zhang and combinations of summary statistics is practically in- Balding (2002) assumed a linear model feasible beyond a moderate value of p. This is espe- i ⊤ i cially true of Markov chain Monte Carlo or sequen- (2) m(s )= α + β s tial Monte Carlo based analyses, which require one in the neighborhood of sobs. An estimate of the mean sampler implementation per combination. As a re- function,m ˆ (·), is obtained by minimizing the weight- sult, stochastic or deterministic (greedy) search pro- n i i i 2 ed least squares criterion i=1 w km(s ) − θ k , cedures, such as forward or backward selection, are i i where w = Kǫ(ks −sobsk). A weighted sample from required to implement them. the posterior distribution, p P (θ|s ), is then ob- ABC obs 2.2.1 A sufficiency criterion The first principled tained by the adjustment approach to dimension reduction in ABC was the ∗i i i (3) θ =m ˆ (sobs) + (θ − mˆ (s )) ε-sufficiency concept proposed by Joyce and Marjo- for i = 1,...,n. In the above, the kernel scale param- ram (2008), which was used to determine whether eter ǫ controls the bias-variance trade-off: increasing to include an additional summary statistic, sk, to ǫ reduces variance by increasing the effective sample a model already containing statistics s1,...,sk−1. size—the number of accepted simulations when us- Here, noting that the difference between the log like- ing a uniform kernel K—but increases bias arising lihoods of p(s1,...,sk|θ) and p(s1,...,sk−1|θ) is from departures from a linear mean function m(·) log p(sk|s1,...,sk−1,θ), Joyce and Marjoram (2008) and homoscedastic error structure (Blum (2010a)). defined the set of statistics s1,...,sk−1 to be ε-suffi- Blum and Fran¸cois (2010) proposed the more flex- cient relative to sk if ible, heteroscedastic model δk = suplog p(sk|s1,...,sk−1,θ) θ (4) θi = m(si)+ σ(si)ei, (6) − inf log p(sk|s1,...,sk−1,θ) where σ2(si)= V[θ|s = si] denotes the conditional θ variance. This variance is estimated using a second ≤ ε. DIMENSION REDUCTION METHODS IN ABC 5

Accordingly, if an estimate of δk (i.e., the “score” date combinations of summary statistics. Since en- of sk relative to s1,...,sk−1) is greater than ε, then tropy measures information and a lack of random- there is enough additional information content in ness (Shannon (1948)), the authors propose mini- sk to justify including it in the model. In practice, mizing the entropy of the approximate posterior, Joyce and Marjoram (2008) implement a conceptu- pABC(θ|sobs), over subsets of the summary statistics, ally equivalent assessment, whereby sk is added to s, as a proxy for determining maximal information the model if the ratio of posteriors about a parameter of interest. High entropy results from a diffuse posterior sample, whereas low entropy p (θ|s ,...,s ,s ) R (θ)= ABC 1 k−1 k is obtained from a posterior which is more precise k p (θ|s ,...,s ) ABC 1 k−1 in nature. differs from one by more than some threshold value Nunes and Balding (2010) estimate entropy us- T (θ) for any value of θ. As such, a statistic sk will be ing the unbiased kth nearest neighbor estimator of added to the model if the resulting posterior changes Singh et al. (2003). For a weighted posterior sample, 1 1 n n i sufficiently at any point. The threshold, T (θ), is (w ,θ ),..., (w ,θ ), where i w = 1, this estima- user-specified, with one particular choice described tor can be written as P in Section 5 of Joyce and Marjoram (2008). πq/2 This procedure can be implemented within any Eˆ = log − ψ(k) + log n Γ(q/2+1) stepwise search algorithm, each of which have vari- (7)   ous pros and cons. Following the definition (6), the n i ˆ−1 resulting optimal subset of summary statistics is + q w log Ci (k/(n − 1)), then ε-sufficient relative to each one of the remain- Xi=1 ing summary statistics. Here ε intuitively represents where q = dim(θ), ψ(x)=Γ′(x)/Γ(x) denotes the an acceptable error in determining whether sk con- digamma function, and where Cˆi(·) denotes the em- tains further useful information in addition to s1,..., pirical distribution function of the Euclidean dis- sk. This quantity is also user-specified, and so the fi- tance from θi to the remainder of the weighted pos- nal optimal choice of summary statistics will depend terior sample, that is, of the weighted samples {(w ˜j , on the chosen value. i j j j j kθ − θ k)}j6=i, wherew ˜ = w / w . Following Sensitivity to the choice of ε aside, this approach j6=i Singh et al. (2003), the original work of Nunes and may be criticized in that it assumes that every change P Balding (2010) used k = 4 and was based on an to the posterior obtained by adding a statistic, s , k equally weighted posterior sample (i.e., with wi = is beneficial. It is conceivable that attempting to in- 1/n, i = 1,...,n), so that Cˆ−1(k/(n−1)) denotes the clude a completely noninformative statistic, where i Euclidean distance from θi to its kth closest neigh- the observed statistic is unlikely to have been gen- bor in the posterior sample {θ1,...,θi−1,θi+1,...,θn}. erated under the model, will result in a sufficiently While minimum entropy could in itself be used to modified posterior as measured by ε, but one which evaluate the informativeness of a vector of summary is more biased away from the true posterior p(θ|y ) obs statistics for θ (although see the criticism of entropy than without including s . A toy example illustrat- k below), Nunes and Balding (2010) propose a sec- ing this was given by Sisson and Fan (2011). ond stage to their analysis, which aims to assess the A further criticism is that the amount of computa- performance of a candidate set of summary statis- tion required to evaluate R (θ) for all θ, and on mul- k tics using a measure of posterior error. For exam- tiple occasions, is considerable, especially for large q. ple, when the true parameter vector, θ , is known, In practice, Joyce and Marjoram (2008) considered true the authors suggest the root sum of squared errors θ to be univariate, and approximated continuous θ (RSSE), given by over a discrete grid in order to keep computational overheads to acceptable levels. As such, this method n 1/2 i i 2 appears largely restricted to dimension reduction for (8) RSSE = w kθ − θtruek , univariate parameters (q = 1). ! Xi=1 i 2.2.2 An entropy criterion Nunes and Balding where the measure kθ −θtruek compares the compo- (2010) propose the entropy of a distribution as a nents of θ on a suitable scale (and so some component- heuristic to measure the informativeness of candi- wise standardization may be required). Naturally, 6 BLUM, NUNES, PRANGLE AND SISSON the true parameter value, θtrue, is unknown in prac- 2.2.3 AIC and BIC criteria Information criteria tice. However, if the simulated summary statistics based on Akaike and Bayesian information are nat- from the samples (θi,si) are treated as observed ural best subset selection techniques for summary i data, it is clear that θtrue = θ for the posterior statistic dimension reduction in ABC analyses. We i pABC(θ|s ). As such, the RSSE can be easily com- introduce and develop these criteria in Section 3.1. puted with a leave-one-out technique. 2.3 Projection Techniques As the subset of summary statistics that mini- mizes (8) will likely vary over observed data sets, si, Selecting a best subset of summary statistics from Nunes and Balding (2010) propose minimizing the s = (s1,...,sp) suffers from the problem that it may average RSSE over some number of simulated data require several statistics to provide the same infor- sets which are close to the observed, sobs. To avoid mation content as a single, highly informative statis- circularity, Nunes and Balding (2010) define these tic that was not specified in the initial set, s. To “close” data sets to be the j = 1,...,n∗ simulated avoid this, projection techniques aim to combine the j j data sets, {s }, that minimize ksME − sMEk, where elements of s through linear or nonlinear transfor- j sME and sME are the vectors of minimum entropy mations, in order to construct a potentially much summary statistics computed via (7) from sj and lower-dimensional set of highly informative statis- the observed summary statistics, sobs, respectively. tics. That is, the quantity One of the main advantages of projection tech- n∗ niques is that, unlike best subset selection methods, 1 (9) RSSE = RSSE they scale well with increasing numbers of summary n∗ j j=1 statistics. They can handle large numbers of possibly X uninformative summary statistics, in addition to ac- is minimized (over subsets of summary statistics), counting for high levels of interdependence and mul- where RSSE corresponds to (8) using the simulated j ticollinearity. A minor disadvantage of projection data set sj. techniques is that the final sets of projected sum- This approach is intuitive and is attractive be- mary statistics typically (but not universally) lack cause the second stage directly measures error in the interpretability. In addition, most projection meth- posterior with respect to a known truth, θtrue, which is not typically considered in other ABC dimension ods require the specification of a hyperparameter reduction approaches, albeit at the extra computa- that governs the number of projections to perform. tional expense of a two-stage procedure. A weak- 2.3.1 Partial least squares regression Partial least ness of the first stage, however, is the assumption squares regression seeks the orthogonal linear combi- that addition of an informative statistic will reduce nations of the explanatory variables which have high the entropy of the resulting posterior distribution. variance and high correlation with the response vari- An example of when this does not occur is when able (e.g., Boulesteix and Strimmer (2007); Vinzi the posterior distribution is diffuse with respect to et al. (2010); Abdi and Williams (2010)). Wegmann, the prior—for instance, if an overly precise prior is Leuenberger and Excoffier (2009) proposed the use located in the distributional tails of the posterior of partial least squares regression for dimension re- (e.g., Jeremiah et al. (2011)). In this case, attempt- duction in ABC, where the explanatory variables ing to include an informative additional statistic, sk, are the suitably (e.g., Box–Cox) transformed sum- can result in a distribution that is more diffuse than mary statistics, s, and the response variables is the with sk excluded. As such, the entropic approach is parameter vector, θ. therefore mostly suited to models with relatively dif- The output of a partial least squares analysis is fuse prior distributions. Another potential criticism the set of k orthogonal components of the regression of the first stage is that minimizing the entropy does design matrix minimal not necessarily provide the subset of suffi- 1 1 cient statistics. This provides an argument for con- 1 s1 · · · sp . . . . sidering the mutual information between θ and s, (10) X =  . . .. .  rather than the entropy (Barnes et al. (2012); see 1 sn · · · sn also Filippi, Barnes and Stumpf (2012)). However,  1 p    it is clear that the overall approach of Nunes and that are optimally correlated (in a specific sense) i Balding (2010) could easily be implemented with al- with θ. Here, sj denotes the jth component of the ternative first-stage selection criteria. ith simulated summary statistic, si. To choose the DIMENSION REDUCTION METHODS IN ABC 7 appropriate number of orthogonal components, Weg- The reduced and nonlinearly transformed summary mann, Leuenberger and Excoffier (2009) examine statistics of the hidden units, zj , are then combined the root mean square error of θ for each value of k, as through the regression function of the neural net- estimated by a leave-one-out cross-validation strat- work egy. For a fixed number of components, k, this cor- H (2) (2) responds to (13) m(s)= g ωj zj + ω0 , ! n 1/2 Xj=1 1 −i i i 2 (2) (11) RMSEk = kmˆ k (s ) − θ k , where ω denotes the weights of the second layer n ! j Xi=1 of the neural network and g(·) is a link function. −i A similar neural network is used to model log σ2(s) wherem ˆ k (s) denotes the mean response of the par- tial least squares regression, estimated without the (e.g., Nix and Weigend (1995)), with the possibility ith simulated summary statistic, si (e.g., Mevik and of allowing for a different number of hidden units to Cederkvist (2004)). The optimal number of compo- estimate heteroscedasticity in the regression adjust- nents is then chosen by inspection of the RMSEk val- ment compared to that in the mean function m(s). ues, based on minimum gradient change arguments Rather than dynamically determining the number (e.g., Mevik and Wehrens (2007)). of hidden units H, Blum and Fran¸cois (2010) pro- A potential disadvantage of partial least squares pose to specify a fixed value, such as H = q where regression, as performed by Wegmann, Leuenberger q = dim(θ) is the number of parameters to infer. The and Excoffier (2009), is that it aims to infer a global weights of the neural network are then obtained by linear relationship between θ and s based on draws minimizing the regularized least-squares criterion from the prior predictive distribution, p(s|θ)p(θ). n i i i 2 2 This may differ from the relationship observed in w km(s ) − θ k + λkωk , i=1 the region around s = sobs, and as such may pro- X duce unsuitable orthogonal components as a result. where ω is the vector of all weights in the neural i i A workaround for this would be to follow Fearn- network model for m(s), w = Kǫ(ks − sobsk) is the head and Prangle (2012) (see Section 2.3.3) and weight of the sample (θi,si) ∼ p(s|θ)p(θ), and λ> 0 elicit the relationship between θ and s based on sam- denotes the regularization parameter (termed the ples from a truncated prior (θi,si) ∼ p(s|θ)p(θ)I(θ ∈ weight-decay parameter for neural networks). The ΘR), where ΘR ⊂ Θ restricts the samples, θi, to re- idea of regularization is to shrink the weights to- gions of significant posterior density. One simple way ward zero so that only informative summary statis- to identify such a region is through a pilot ABC tics contribute in the models (12) and (13) for m(s). analysis (Fearnhead and Prangle (2012)). Following the estimation of m(s), a similar regular- ization criterion is used to estimate log σ2(s). Both 2.3.2 Neural networks In the regression setting, mean and variance functions can then be used in the feed-forward neural networks can be considered as a regression adjustment of equation (5). nonlinear generalization of the partial least squares regression technique described above. Blum and Fran- 2.3.3 Minimum expected posterior loss Fearnhead ¸cois (2010) proposed the neural network as a ma- and Prangle (2012) proposed a decision-theoretic di- chine learning approach to dimension reduction by mension reduction method with a slightly different estimating the conditional mean and variance func- aim to previous dimension reduction approaches. tions, m(·) and σ2(·) in the nonlinear, heteroscedas- Here, rather than constructing appropriate summary tic regression adjustment model (4)—see Section 2.1. statistics to ensure that pABC(θ|sobs) ≈ p(θ|yobs) is The neural network reduces the dimension of the a good approximation, pABC(θ|sobs) is alternatively summary statistics to H

In the same manner the BIC can be defined as over-adjustment by adjusting the parameter values q via (3) in the direction of uninformative summary 2 (16) BIC =n ˜ log σˆj + d logn. ˜ statistics. jY=1 To avoid this, implicit dimension reduction within Alternative penalty terms involving the hat matrix the regression framework can be performed by alter- of the regression could also be used in the above natively minimizing the regularized weighted sum of (e.g., Hurvich, Simonoff and Tsai (1998); Irizarry squares (Hoerl and Kennard (1970)) (2001); Konishi, Ando and Imoto (2004)). n It is instructive to note that in using the linear re- (17) wikθi − (α + β⊤si)k2 + λkβk2 gression adjustment (3), the above information cri- i=1 teria may be expressed as X with regularization parameter λ> 0. As with the q regularization component within the neural network xIC=n ˜ log Var(θ∗) + penalty term, j model of Blum and Fran¸cois (2010) (Section 2.3.2), jY=1 with ridge regression the risk of over-adjustment is ∗ where θj is the jth element of the regression ad- reduced because the regression coefficients, β, are ∗ ∗ ∗ justed vector θ = (θ1,...,θq ). As such, up to the shrunk toward zero by imposing a penalty on their penalty terms, both AIC and BIC seek the com- magnitudes. Note that while we consider ridge re- bination of summary statistics that minimizes the gression here, a number of alternative regularization product of the marginal variances of the adjusted procedures could be implemented, such as the Lasso posterior sample. Similarly to the entropy criterion method. of Nunes and Balding (2010) (see Section 2.2.2), An additional advantage of ridge regression is that these information criterion will select those sum- standard least squares estimates, (ˆα , βˆ )⊤ = (X⊤ · mary statistics that maximize the precision of the LS LS W X)−1X⊤W Θ, are not guaranteed to have a unique posterior distribution, p (θ|s ). However, un- ABC obs solution. Here X is a n × (p + 1) design matrix given like Nunes and Balding (2010), this precision is in equation (10), Θ = (θ1,...,θn) is the n×1 column traded off by a penalty for model complexity. vector of sampled θi, and W = diag(w1, . . ., wn) is A rationale for the construction of AIC and BIC in this manner is that the summary statistics that an n × n diagonal matrix of weights. The lack of should be included within an ABC analysis are those a unique solution can arise through multicolinear- ity of the summary statistics, which can result in which are good predictors of θ. However, an obvious ⊤ requirement for AIC or BIC to identify an infor- singularity of the matrix X W X. In contrast, min- mative statistic is that the statistic varies (with θ) imization of the regularized weighted sum of squares within the local range of the regression model. If (17) always has a unique solution, provided that ⊤ a statistic is informative outside of this range, but λ> 0. This solution is given by (ˆαridge, βˆridge) = ⊤ −1 ⊤ uninformative within it, it will not be identified as (X W X + λIp) X W Θ, where Ip denotes the p × informative under these criteria. p identity matrix. There are several approaches for 3.2 Regularization via Ridge Regression dealing with the regularization parameter λ, includ- ing cross-validation and generalized cross-validation As described in Section 2.1, the local-linear regres- to identify an optimal value of λ (Golub, Heath and sion adjustment of Beaumont, Zhang and Balding Wahba (1979)), as well as averaging the regularized (2002) fits the linear model ⊤ estimates (ˆαridge, βˆridge) obtained for different val- θi = α + β⊤si + ei ues of λ (Taniguchi and Tresp (1997)). based on the prior predictive samples (θi,si) ∼ p(s|θ)p(θ) and with regression weights given by wi = 4. A COMPARATIVE ANALYSIS i Kǫ(ks − sobsk). (As before, we describe the case We now provide a comparative analysis of the pre- where θ is univariate for notational simplicity and viously described methods of dimension reduction clarity of exposition, but the approach outlined be- within the context of three previously studied anal- low can be readily implemented for each compo- yses in the ABC literature. Specifically, this includes nent of a multivariate θ.) However, in fitting the the analysis of a coalescent model with recombina- model by minimizing the weighted least squares cri- n i ⊤ i i 2 tion (Joyce and Marjoram (2008)), an evaluation of teria, i=1 w kα − β s − θ k , there is a risk of the evolutionary fitness cost of mutation in drug- P DIMENSION REDUCTION METHODS IN ABC 11 resistant tuberculosis (Luciani et al. (2009)) and an When using neural networks or ridge regression to assessment of the number and size-distribution of estimate the conditional mean and variance, m(s) particle inclusions in the production of clean steels and σ2(s), we take the pointwise median of the es- (Bortot, Coles and Sisson (2007)). timated functions obtained with the regularization Each analysis is based on n = 1,000,000 simula- parameters λ = 10−3, 10−2 and 10−1. These values of tions where the parameter θ is drawn from the prior λ assume that the summary statistics and the pa- distribution p(θ). The performance of each method rameters have been standardized before fitting the is measured through the RSSE criterion (9) follow- regression function (Ripley (1994)). However, be- ing Nunes and Balding (2010), based on the same cause the optimization procedure for neural networks randomly selected subset of n∗ = 100 samples (θi, (the R function nnet) only finds local optima, in i s ) = (θtrue,sobs) as “observed” data sets. When eval- this case we take the pointwise median of ten esti- uating the RSSE error measure of equation (8), we mated functions, with each optimization initialized give a weight wi = 1 for the accepted simulations from a different random starting point, and ran- and a weight of 0 otherwise. As the value of the domly choosing the regularization parameter with RSSE (8) depends on the scale of each parameter, equal probability from the above values (see Tanigu- we standardize the parameters in each example by chi and Tresp (1997)). dividing the parameter values by the standard de- 4.1 Example 1: A Coalescent Analysis viation obtained from the n = 1,000,000 simulations (with the exception of the first example, where the This model was previously considered by Joyce parameters are on similar scales). For comparative and Marjoram (2008) and Nunes and Balding (2010), ease, and to provide a performance baseline for each each while proposing their respective ABC dimen- example, all RSSE results are presented as relative sion reduction strategies (see Sections 2.2.1 and 2.2.2). to the RSSE obtained when using the maximal vec- The analysis focuses on joint estimation of the scaled tor of summary statistics and no regression adjust- mutation rate, θ˜, and the scaled recombination rate, ment. In this manner, a relative RSSE of x/−x de- ρ, in a coalescent model with recombination (Nord- notes an x% worsening/improvement over the base- borg (2007)). Under this model, 5001 base pair DNA line score. sequences for 50 individuals are generated from the Within each ABC analysis, we use Euclidean dis- coalescent model, with recombination, under the infi- ms tance within an Epanechnikov kernel Kǫ(ks−sobsk). nite-sites mutation model, using the software The Euclidean distances are computed after stan- (Hudson (2002)). The initial summary statistics, s = dardizing the summary statistics with a robust esti- (s1,...,s6), are the number of segregating sites (s1), mate of the standard deviation (the mean absolute the pairwise mean number of nucleotidic differences 2 deviation). The kernel scale parameter, ǫ, is deter- (s2), the mean R across pairs separated by <10% mined as the value at which exactly 1% of the sim- of the simulated genomic regions (s3), the number of i i ulations, (θ ,s ), have nonzero weight. This yields distinct haplotypes (s4), the frequency of the most exactlyn ˜ = 10,000 simulations that form the final common haplotype (s5) and the number of singleton sample from each posterior. To perform the method haplotypes (s6). of Fearnhead and Prangle (2012), a randomly cho- We first examine the performance of ABC with- sen 10% of the n simulations were used to fit the out using dimension reduction techniques. For differ- regression model that determines the choice of sum- ent parameter combinations, θ,˜ ρ and (θ,˜ ρ), we com- mary statistics, with the remaining 90% used for the pute the relative RSSE obtained with a single op- ABC analysis. The final ABC sample sizen ˜ = 10,000 timal summary statistic and the relative RSSE ob- was kept equal to the other methods by slightly ad- tained when using all six population genetic statis- justing the scale parameter, ǫ. In addition, for the tics (s1–s6) (Table 1). When estimating θ˜ only, we method of Fearnhead and Prangle (2012), follow- find that using only the number of segregating sites ing exploratory analyses, the regression model (14) (s1) provides lower relative RSSE than when includ- was fitted using f(s) = (s,s2,s3,s4) for examples ing all 6 summary statistics even when performing 1 and 2 (as described in Section 2.3.3) and using regression adjustment. For all other parameter com- f(s) = (log s, [log s]2, [log s]3, [log s]4) for example 3, binations, using a single statistic produces substan- always resulting in 4 × p independent variables in tially worse than the rejection algorithm with all the regression model of equation (14). summary statistics. For all inferences [i.e., of θ˜, ρ 12 BLUM, NUNES, PRANGLE AND SISSON

Table 1 Relative RSSE for examples 1 and 2. The leftmost column shows the minimal RSSE when considering only one summary statistic (with no regression adjustment). Rightmost columns show relative RSSE using all summary statistics under no, homoscedastic and heteroscedastic regression adjustment. All RSSE are relative to the RSSE obtained when using no regression adjustment with all summary statistics. The score of the best method in each analysis (row) is emphasised in boldface

All summary statistics One optimal statistic (no adj.) Noadj. Homoadj. Heteroadj.

Example 1 θ˜ −7 (s1) 0 −3 −3 ρ 9 (s5) 0 −5 −4 (θ,ρ˜ ) 7(s1) 0 0 −7 Example 2 α 6 0 −3 −3 c −7 0 −5 −5 ρ −9 0 −8 −8 µ −14 0 −5 −6 (α, c, ρ, µ) 5 0 −4 −4

and (θ,˜ ρ) jointly], regression adjustments generally least squares when estimating (θ,˜ ρ) jointly. By con- improve the inference when using all six summary trast, ridge regression provides no improvement over statistics, which is consistent with previous results the standard regression adjustment (the “All” col- (Nunes and Balding (2010)). The only exception is umn). when jointly estimating (θ,˜ ρ), where homoscedastic Based on these results, a loose performance rank- linear adjustment neither decreases nor increases the ing of the dimension reduction methods can be ob- error obtained with the pure rejection algorithm. tained by computing, for each method, the mean Next, we investigate the performance of each di- (relative) RSSE over all parameter combinations θ˜, mension reduction technique. Table 2 and Figure 1 ρ and (θ,˜ ρ) using the heteroscedastic adjustment. show the relative RSSE obtained under each dimen- The worst performers were ridge regression and the sion reduction method for each parameter combi- ε-sufficiency criterion (with a mean relative RSSE nation and under heteroscedastic regression adjust- of −3%). These are followed by the standard regres- ment. For all three examples, more complete tables sion adjustment with all summary statistics (−5%) that contain the results obtained with no regression and the AIC/BIC, neural nets and the posterior adjustment and homoscedastic adjustment can be loss method (−6%). The best performing methods found in the supplementary information to this ar- are partial least squares (−10%) and the two-stage ticle (Blum et al. (2013)). entropy-based procedure (−16%). The performance achieved with AIC, AICc or BIC 4.2 Example 2: The Fitness Cost of Drug is comparable to (i.e., the same or slightly better Resistant Tuberculosis than) the result obtained when including all six pop- ulation genetic statistics. When using the ε-sufficiency We now consider an example of Markov processes criterion, we find that the performance is improved for epidemiological modeling. If a pathogen, such for the inference on θ˜ only. The only best subset as Mycobacterium tuberculosis, mutates to gain an selection method for dimension reduction that sub- evolutionary advantage, such as antibiotic resistance, stantially and uniformly improves the performance it is biologically plausible that this mutation will of ABC posterior estimates is the entropy-based ap- come at a cost to the pathogen’s relative fitness. proach. For the projection techniques, all methods Based on a stochastic model to describe the trans- (partial least squares, neural nets and minimum ex- mission and evolutionary dynamics of Mycobacterium pected posterior loss) outperform the adjustment tuberculosis, and based on incidence and genotypic method based on all six population genetics statis- data of the IS6110 marker, Luciani et al. (2009) es- tics, with a large performance advantage for partial timated the posterior distribution of the pathogen’s DIMENSION REDUCTION METHODS IN ABC 13

Table 2 Relative RSSE for examples 1–3 for different parameter combinations using each method of dimension reduction, and under heteroscedastic regression adjustment. Columns denote no dimension reduction (All), BIC, AIC, AICc, the ε-sufficiency criterion (ε-suff), the two-stage entropy procedure (Ent), partial least squares (PLS), neural networks (NNet), minimum expected posterior loss (Loss) and ridge regression (Ridge). All RSSE are relative to the RSSE obtained when using no regression adjustment with all summary statistics. The score of the best method in each analysis (row) is emphasized in boldface

Best subset selection Projection techniques Regularization

All BIC AIC AICc ε-suff Ent PLS NNet1 Loss Ridge1

θ˜ −3 −5 −5 −5 −6 −11 −6 −4 −7 1 ρ −4 −6 −6 −6 0 −12 −7 −8 −7 −3 (θ,ρ˜ ) −7 −7 −7 −7 – −24 −16 −7 −6 −6 α −3 −15 −15 −15 0 −17 −13 −15 −17 −15 c −5 −15 −15 −15 −8 −15 −8 −12 −9 −9 ρ −8 −16 −16 −16 −8 −16 1 −12 −9 −10 µ −6 −18 −18 −18 −8 −13 −10 −13 −12 −12 (α, c, ρ, µ) −4 −19 −19 −19 – −13 −10 −9 −12 −11 τ −49 −47 −47 −48 −19 −52 −22 −20/−42 −75 −48/−48 σ −45 −46 −47 −46 −15 −50 −15 −21/−37 −56 −43/−43 ξ −27 −29 −29 −28 −13 −32 −28 −7/−41 −41 −26/−44 (τ,σ,ξ) −39 −39 −40 −39 – −42 −11 −4/−38 −60 −39/−32

1For the third example, the first value is found by integrating out the regularization parameter, whereas the second one is found by choosing an optimal regularization parameter with cross-validation. In examples 1 and 2, integration over the regularization parameter is performed.

Fig. 1. Relative RSSE for the different methods of dimension reduction in the three examples. All RSSE are relative to the RSSE obtained when using no regression adjustment with all summary statistics. Methods of dimension reduction include no dimension reduction (All), AIC/BIC, the ε-sufficiency criterion (ε-suff), the two-stage entropy procedure (Ent), partial least squares (PLS), neural networks (NNet), minimum expected posterior loss (Loss) and ridge regression (Ridge). The crosses correspond to situations for which there is no result available. 14 BLUM, NUNES, PRANGLE AND SISSON transmission cost and relative fitness. The model −15% and −19%. The ε-sufficiency criterion pro- contained q = 4 free parameters: the transmission duces more equivocal results, however, as the er- rate, α, the transmission cost of drug resistant strains, ror is sometimes increased with respect to baseline c, the rate of evolution of resistance, ρ, and the mu- performance (e.g., +6% when estimating α with ho- tation rate of the IS6110 marker, µ. moscedastic adjustment) and sometimes reduced (e.g., Luciani et al. (2009) summarized information gen- −8% for c, ρ and θ with heteroscedastic adjust- erated from the stochastic model through p = 11 ment). As with the previous example, the entropy summary statistics. These statistics were expertly criterion provides a clear improvement to the ABC elicited as quantities that were expected to be infor- posterior, and this improvement is almost compara- mative for one or more model parameters, and in- ble to that produced by AIC/BIC. Finally, the pro- cluded the number of distinct genotypes in the sam- jection and regularization methods mostly all pro- ple, gene diversity for sensitive and resistant cases, vide comparable and substantive improvements com- the proportion of resistant cases and measures of the pared to the baseline error, with only partial least degree of clustering of genotypes, etc. It is consid- squares producing more equivocal results (e.g., +1% ered likely that there is dependence and potentially when estimating ρ). replicate information within these statistics. Based on these results, the loose performance rank- As before, we examine the relative performance ing of the dimension reduction methods determines of the statistics without using dimension reduction the worst performers to be the standard least squares techniques. Table 1 shows that for the univariate regression adjustment (with a mean relative RSSE analysis of c, ρ or µ, performing rejection sampling of −5%), the ε-sufficiency approach (−6%) and par- ABC with a single, well-chosen summary statistic tial least squares (−8%). These are followed by ridge can provide an improved performance over a simi- regression (−11%), neural networks and the pos- lar analysis using all 11 summary statistics, under terior loss method (−12%). The best performing any form of regression adjustment. In particular, the methods for this analysis are the two-stage entropy- proportion of isolates that are drug resistant is the based procedure (−15%) and the AIC/BIC criteria individual statistic which is most informative to esti- (−17%). mate c (with a relative RSSE of −7%) and ρ (−9%). In this example, it is interesting to compare the For the marker mutation rate, µ, the most infor- performance of the standard linear regression ad- mative statistic is the number of distinct genotypes, justment of all 11 summary statistics (mean rela- with a relative RSSE of −14%. Conversely, an analy- tive RSSE of −5%) with that of the ridge regres- sis using all summary statistics with a regression ad- sion equivalent (mean relative RSSE of −11%). The justment offers the best inferential performance for increase in performance with ridge regression may α alone, or for (α, c, ρ, µ). These results provide sup- be attributed to its more robust handling of mul- port for recent arguments in favor of “marginal re- ticolinearity of the summary statistics than under gression adjustments” (Nott et al. (2013)), whereby the standard regression adjustment. To see this, Fig- the univariate marginal distributions of a full mul- ure 2 illustrates the relationship between the relative tivariate ABC analysis are replaced by separately RSSE (again, relative to using all summary statis- estimated marginal distributions using only statis- tics and no regression adjustment) and the condi- tics relevant for each parameter. Here, more pre- tion number of the matrix X⊤W X, for both the cisely estimated margins can improve the accuracy standard regression (top panel) and ridge regres- of the multivariate posterior sample, beyond the ini- sion (bottom panel) adjustments based on inference tial analysis. for (α, c, ρ, µ). The condition number of X⊤W X is The performance results of each dimension reduc- given by κ = λmax/λmin, where λmax and λmin are tion method are shown in Table 2 and Figure 1. the largest and smallest eigenvalues of X⊤W X. Ex- p In contrast with the previous example, here the use tremely large condition numbers are evidence for of the AIC/BIC criteria can substantially decrease multicolinearity. posterior errors. For example, compared to the lin- Figure 2 demonstrates that for large values of the ear adjustment of all 11 parameters, which produces condition number (e.g., for κ> 108), the least-squares- a mean relative RSSE between −3% and −8% de- based regression adjustment clearly performs very pending on the parameter (Table 2), using the AIC/ poorly. The region of κ> 108 corresponds to almost BIC criteria results in a relative RSSE of between 5% of all simulations, and for these cases the rel- DIMENSION REDUCTION METHODS IN ABC 15

Fig. 2. Scatterplots of relative RSSE versus the condition number of the matrix X⊤W X for linear least squares (top) and ridge (bottom) regression adjustments. Points are based on joint inference for (α, c, ρ, µ) in example 2 using 1000 randomly i selected vectors of summary statistics, s , as “observed” data. When the minimum eigenvalue, λmin, is zero, the (infinite) condition number is arbitrarily set to be 1025 for visual clarity (open circles on the scatterplot). ative error is hugely increased (w.r.t. rejection) to standard stereological problem (e.g., Baddeley and anywhere between 5% and 200%. In contrast, for Jensen (2004)), whereby inference is required on the ridge regression, the relative errors corresponding to size and number of 3-dimensional inclusions, based κ> 108 are not larger than the errors obtained for on data obtained from those inclusions that inter- nonextreme condition numbers. This analysis clearly sect with a 2-dimensional slice. The model assumes a illustrates that, unlike ridge regression, the stan- Poisson point process of inclusion locations with rate dard least squares regression adjustment can per- parameter τ > 0 and that the distribution of inclu- form particularly poorly when there is multicolin- sion size exceedances above a measurement thresh- earity between the summary statistics. old of 5µm are drawn from a generalized Pareto In terms of the original analysis of Luciani et al. distribution with scale and shape parameters σ> 0 (2009) which used all eleven summary statistics with and ξ, following standard extreme value theory ar- no regression adjustment (although with a very low guments (e.g., Coles (2001)). value for ǫ), the above results indicate that a more The observed data consist of 112 cross-sectional efficient analysis may have been achieved by using a inclusion diameters measured above 5µm. The sum- suitable dimension reduction technique. mary statistics thereby comprise 112 equally spaced 4.3 Example 3: Quality Control in the quantiles of the cross-sectional diameters, in addi- Production of Clean Steels tion to the number of inclusions observed, yielding Our final example concerns the statistical mod- p = 113 summary statistics in total. The ordering of eling of extreme values. In the production of clean the summary statistics creates strong dependences steels, the occurrence of microscopic imperfections between them, a fact which can be exploited by (termed inclusions) is unavoidable. The strength of dimension reduction techniques. Bortot, Coles and a clean steel block is largely dependent on the size Sisson (2007) considered two models based on spher- of the largest inclusion. Bortot, Coles and Sisson ical or ellipsoidal shaped inclusions. Our analysis (2007) considered an extreme value twist on the here focuses on the ellipsoidal model. 16 BLUM, NUNES, PRANGLE AND SISSON

By construction, the large number (2113) of possi- procedure to determine the regularization parame- ble combinations of summary statistics means that ter within ridge regression, there is also a mean gain the best subset selection methods are strictly not in performance from −39% to −42%, although the practicable for this analysis, unless the number of joint parameter inference on (τ,σ,ξ) actually per- summary statistics is reduced further a priori. In forms worse under this alternative approach. The order to facilitate at least some comparison with variability in these results highlights the importance the other dimension reduction approaches, for the of making an optimal choice of the regularization best subset selection methods only, we consider six parameter for an ABC analysis. candidate subsets. Each subset consists of the num- The minimum expected posterior loss approach ber of observed inclusions in addition to 5, 10, 20, performs particularly well here. This approach has 50, 75 or 112 empirical quantiles of the inclusion also been shown to perform well in a similar anal- size exceedances (the latter corresponds to the com- ysis: that of performing inference using quantiles plete set of summary statistics). Due to the extreme of a large number of independent draws from the value nature of this analysis, the parameter esti- (intractable) g-and-k distribution (Fearnhead and mates are likely to be more sensitive to the pre- Prangle (2012)). cise values of the larger quantiles. As such, rather The loose performance ranking of each of the di- than using equally spaced quantiles, we use a scheme mension reduction methods finds that the worst per- which favors quantiles closer to the maximum inclu- formers are the ε-sufficiency criterion (with a mean sion and we always include the maximum inclusion. relative RSSE of −16%) and partial least squares The relative RSSE obtained under each dimension (−19%). Neural networks and AIC/BIC perform just reduction method is shown in Table 2 and Figure 1. as well as standard least squares regression (−40%), In comparison to an analysis using all 113 summary ridge regression slightly outperforms standard re- statistics and regression adjustment (the “All” col- gression (−42%) and the entropy-based approach umn), the best subset selection approaches do not in is a further slight improvement at −44%. The clear general offer any improvement. While the entropy- winner in this example is the posterior loss approach based method provides a slight improvement, the with a mean relative RSSE of −58%. relative RSSE under the ε-sufficiency criterion is substantially worse (along with partial least squares). 5. DISCUSSION Of course, these results are limited to the few subsets of statistics considered and it is possible that alter- The process of dimension reduction is a critical native subsets could perform substantially better. and influential part of any ABC analysis. In this However, it is computationally untenable to evalu- article we have provided a comparative review of ate this possibility based on exhaustive enumeration the major dimension reduction approaches (and in- of all subsets. troduced two new ones) in order to provide some When using neural networks to perform the re- guidance to the practitioner in choosing the most gression adjustment based on computing the point- appropriate technique for their own analysis. A sum- wise median of the m(s) and σ2(s) estimates, ob- mary of the qualitative features of each dimension tained using varying regularization parameter val- reduction method is shown in Table 3, and a com- ues (see the introduction to Section 4), the relative parison of the relative performances of each method performance is quite poor (left-hand side RSSE val- for each example is illustrated in Figure 3. As with ues in Table 2). The mean relative RSSE is −13% each individual example, we may compute an over- for neural networks, compared to −40% for het- all performance ranking of the dimension reduction eroscedastic least squares regression. As an alterna- methods by averaging the mean relative RSSE val- tive approach, rather than averaging over the regu- ues over the examples. Performing worse, on aver- larization parameter λ, we rather choose the value age, than a standard least squares regression adjust- of λ ∈ {10−3, 10−2, 10−1} that minimizes the leave- ment with no dimension reduction (with an overall one-out error of θ [equation (11)]. This approach mean relative RSSE of −17%) is the ε-sufficiency considerably improves the performance of the net- technique (−8%) and partial least squares (−12%). work (right-hand side RSSE values in Table 2) with Performing better, on average, than standard least the mean relative RSSE improving to the same level squares regression is ridge regression and neural net- as for heteroscedastic regression. Adopting the same works (−19%) and AIC/BIC (−21%). In this study, DIMENSION REDUCTION METHODS IN ABC 17

Table 3 Summary of the main features of the different methods of dimension reduction for ABC

Class Method Hyper-parameter Choice of hyper-parameter Computational burden

Best subset selection AIC/BIC None – Substantial/greedy alg. ε-suff T (θ) User choice Substantial/greedy alg. Ent None – Substantial/greedy alg.

Projection techniques PLS Number of PLS components, k Cross-validation Weak NNet Regularization parameter, λ Integration or cross-validation Moderate (optimization algorithm) Loss Choice of basis functions BIC Weak (closed-form solution)

Regularization Ridge Regularization parameter, λ Integration or cross-validation Weak (closed-form solution)

the top performers, on average, were the entropy- ther gains in performance can then be obtained by based procedure and the minimum expected pos- combining regression adjustment with dimension re- terior loss approach, with an overall mean relative duction procedures, although in some cases (such RSSE of −25%. It is worth emphasizing that the as with the ε-sufficiency technique and partial least potential gains in performing a regression adjust- squares) performance can sometimes worsen. ment alone (with all summary statistics and no di- While being ranked in the top three, a clear dis- mension reduction) can be quite substantial. This advantage of the entropy-based procedure and the suggests that regression adjustment should be an AIC/BIC criteria is the quantity of computation re- integral part of the majority of ABC analyses. Fur- quired. This primarily occurs as the best subset se-

Fig. 3. Mean relative RSSE values using each method of dimension reduction and for each example. Methods of dimension reduction include no dimension reduction (All), AIC/BIC, the ε-sufficiency criterion (ε-suff), the two-stage entropy procedure (Ent), partial least squares (PLS), neural networks (NNet), minimum expected posterior loss (Loss) and ridge regression (Ridge). For examples 1 and 2, the results for ridge regression and neural networks estimate m(s) and σ2(s) have been obtained by taking the pointwise median curve over varying values of the regularization parameter; λ = 10−3, 10−2 and 10−1 (see introduction to Section 4). For example 3, an optimal value of λ was chosen based on a cross-validation procedure (see Section 4.3). 18 BLUM, NUNES, PRANGLE AND SISSON lection procedures require evaluation of all 2p po- the third, where a cross-validation procedure to se- tential models. For examples 1 and 2, a greedy al- lect a single best parameter value produced much gorithm was able to find the optimum solution in improved results. This problem can be particularly a reasonable time. This was not possible for exam- critical for neural networks with large numbers of ple 3. Additionally, in this latter case, for the subsets summary statistics, p, as the number of network of summary statistics considered, the performance weights is much larger than p, and, accordingly, mas- obtained by implementing computationally expen- sive shrinkage of the weights (i.e., large values of λ) sive methods of dimension reduction was barely an is required to avoid overfitting. improvement over the computationally cheap, least The posterior loss approach produced the supe- squares regression adjustment. This raises the im- rior performance in the third example. In general, a portant point that the benefits of performing poten- strong performance of this method can be primarily tially expensive forms of dimension reduction over, attributed to two factors. First, in the presence of say, the simple linear regression adjustment, should large numbers of highly dependent summary statis- be evaluated prior to their full implementation. We tics, the extra analysis stage in determining the most also note that the second stage of the entropy-based appropriate regression model (14) by choosing f(s) method (Section 2.2.2) targets minimization of (9), through, for example, BIC diagnostics, affords the the same error measure used in our comparative opportunity to reduce the complexity of the regres- analysis. As such, this approach is likely to be nu- sion in a simple and relatively low-parameterized merically favored in our results. manner. This was not a primary contributor in ex- The top ranked (ex aequo) minimum expected ample 3, however, as the regression [equation (14)] posterior loss approach particularly outperforms was directly performed on the full set of 113 statis- other dimension reduction methods in the final ex- tics. Given the benefits of using regularization meth- ample (the production of clean steels). In such anal- ods in this setting, it is possible that a ridge re- yses, with large numbers of summary statistics (here gression model would allow a more robust estimate p = 113), nonlinear methods such as neural networks of the posterior mean (as a summary statistic) as may become overparametrized, and simpler alterna- part of this process. Second, the posterior loss ap- tives, such as least squares or ridge regression ad- proach determines the number of summary statis- justment, can work more effectively. This is natu- tics to be equal to the number of posterior quan- rally explained through the usual bias-variance trade- tities of interest—in this case, q = 3 posterior pa- off: more complex regression models such as neural rameter means. This small number of derived sum- networks reduce the bias of the estimate of m(s) mary statistics naturally allows more precise poste- [and optionally σ2(s)], but in doing so the variance rior statements to be made, compared to dimension of the estimate is increased. This effect can be espe- reduction methods that produce a much larger num- cially acute for high-dimensional regression (Geman, ber of equally informative statistics. Of course, the Bienenstock and Doursat (1992)). dimension advantage here is strongly related to the Our analyses indicate that the original least number of parameters (q = 3) and summary statis- squares, linear regression adjustment (Beaumont, tics (p = 113) in this example. However, it is not Zhang and Balding (2002)) can sometimes perform fully clear how any current methods of dimension quite well, despite its simplicity. However, the pres- reduction for ABC would perform for substantially ence of multicolinearity between the summary statis- more challenging analyses with considerably higher tics can cause severe performance degradation, com- numbers of parameters and summary statistics. This pared to not performing the regression adjustment is because the curse of dimensionality in ABC (Blum (see Figure 2). In such situations, regularization pro- (2010a)) has tended to restrict existing applications cedures, such as ridge regression (e.g., example 2 and of ABC methods to problems of moderate parameter Figure 2) and projection techniques, can be benefi- dimension, although this may change in the future. cial. What is very apparent from this study is that However, an important issue with regularization there is no single “best” method of dimension re- procedures, such as neural networks and ridge re- duction for ABC. For example, while the posterior gression, is the handling of the regularization param- loss and entropy-based methods were the best per- eter, λ. The “averaging” procedure that was used in formers for example 3, AIC and BIC were ranked the first two examples proved quite suboptimal in first in the analysis of example 2, and partial least DIMENSION REDUCTION METHODS IN ABC 19 squares outperformed the posterior loss approach in SUPPLEMENTARY MATERIAL example 1. A number of factors can affect the most Supplement to “A Comparative Review of Di- suitable choice for any given analysis. As discussed mension Reduction Methods in Approximate Bayes- above, these can include the number of initial sum- ian Computation” (DOI: 10.1214/12-STS406SUPP; mary statistics, the amount of dependence and mul- .pdf). The supplement contains for each of the three ticolinearity within the statistics, the computational overheads of the dimension reduction method, the examples a comprehensive comparison of the errors requirement to suitably determine hyperparameters obtained with the different methods of dimension and sensitivity to potentially large numbers of un- reduction. informative statistics. One important point to understand is that all of REFERENCES the ABC analyses of this review were performed us- Abdi, H. and Williams, L. J. (2010). Partial least square ing the rejection algorithm optionally followed by regression, projection on latent structure regression. Wiley some form of regression adjustment. While alterna- Interdiscip. Rev. Comput. Stat. 2 433–459. tive, potentially more efficient and accurate methods Aeschbacher, S., Beaumont, M. A. and Futschik, A. of ABC posterior simulation exist, such as Markov (2012). A novel approach for choosing summary statistics chain Monte Carlo or sequential Monte Carlo based in approximate Bayesian computation. Genetics 192 1027– samplers, the computational cost of separately im- 1047. p Akaike, H. (1974). A new look at the statistical model iden- plementing such an algorithm 2 times (in the case tification. IEEE Trans. Automat. Control AC-19 716–723. of best subset selection methods) means that such MR0423716 dimension reduction methods can become rapidly Allingham, D., King, R. A. R. and Mengersen, K. L. untenable, even for small p. The price of the benefit (2009). Bayesian estimation of quantile distributions. Stat. of using the more computationally practical, fixed Comput. 19 189–201. MR2486231 large number of samples is that decisions on the di- Baddeley, A. and Jensen, E. B. V. (2004). Stereology for mension reduction of the summary statistics will be Statisticians. Chapman & Hall/CRC, Boca Raton, FL. Barnes, C., Filippi, S., Stumpf, M. P. H. and Thorne, T. made on potentially worse estimates of the posterior (2012). Considerate approaches to constructing summary than those available under superior sampling algo- statistics for ABC model selection. Stat. Comput. 22 1181– rithms. As such, the final derived summary statistics 1197. MR2992293 may in fact not be those which are most appropri- Barthelme,´ S. and Chopin, N. (2011). Expectation- ate for subsequent use in, for example, Markov chain propagation for summary-less, likelihood-free inference. Monte Carlo or sequential Monte Carlo based algo- Available at http://arxiv.org/abs/1107.5959. Beaumont, M. A. (2010). Approximate Bayesian computa- rithms. tion in evolution and ecology. Annual Review of Ecology, However, this price is arguably a necessity. It is Evolution, and Systematics 41 379–406. practically important to evaluate the performance of Beaumont, M. A., Zhang, W. and Balding, D. J. (2002). any dimension reduction procedure in a given anal- Approximate Bayesian computation in population genet- ysis. Here we used a criterion [the RSSE of equa- ics. Genetics 162 2025–2035. tion (9)] that is based on a leave-one-out procedure. Beaumont, M. A., Marin, J. M., Cornuet, J. M. and When using a fixed, large number of samples, eval- Robert, C. P. (2009). Adaptivity for ABC algorithms: The ABC–PMC scheme. Biometrika 96 983–990. uation of such a performance diagnostic is entirely Bertorelle, G., Benazzo, A. and Mona, S. (2010). ABC practicable, as no further model simulations are re- as a flexible framework to estimate demography over space quired. This idea is also relevant to methods of di- and time: Some cons, many pros. Mol. Ecol. 19 2609–2625. mension reduction for model selection (Barnes et al. Blum, M. G. B. (2010a). Approximate Bayesian computa- (2012); Estoup et al. (2012)) where a misclassifica- tion: A nonparametric perspective. J. Amer. Statist. Assoc. tion rate based on a leave-one-out procedure can 105 1178–1187. MR2752613 Blum, M. G. B. (2010b). Choosing the summary statistics serve as a comparative criterion. and the acceptance rate in approximate Bayesian computa- tion. In COMPSTAT 2010: Proceedings in Computational ACKNOWLEDGMENTS Statistics (G. Saporta and Y. Lechevallier, eds.) 47– S. A. Sisson is supported by the Australian Re- 56. Springer, New York. Blum, M. G. B. Franc¸ois, O. search Council through the Discovery Project Scheme and (2010). Non-linear regres- sion models for approximate Bayesian computation. Stat. (DP1092805). M. G. B. Blum is supported by the Comput. 20 63–73. MR2578077 French National Research Agency (DATGEN project, Blum, M. G. B., Nunes, M. A., Prangle, D. and Sis- ANR-2010-JCJC-1607-01). son, S. A. (2013). Supplement to “A Comparative Review 20 BLUM, NUNES, PRANGLE AND SISSON

of Dimension Reduction Methods in Approximate Bayesian Hoerl, A. E. and Kennard, R. W. (1970). Ridge regression: Computation.” DOI:10.1214/12-STS406SUPP. Biased estimation for nonorthogonal problems. Technomet- Bonassi, F. V., You, L. and West, M. (2011). Bayesian rics 12 55–67. learning from marginal data in bionetwork models. Stat. Hudson, R. R. (2002). Generating samples under a Wright– Appl. Genet. Mol. Biol. 10 Art. 49, 29. MR2851291 Fisher neutral model of genetic variation. Bioinformatics Bortot, P., Coles, S. G. and Sisson, S. A. (2007). In- 18 337–338. ference for stereological extremes. J. Amer. Statist. Assoc. Hurvich, C. M., Simonoff, J. S. and Tsai, C.-L. (1998). 102 84–92. MR2345549 Smoothing parameter selection in nonparametric regres- Boulesteix, A.-L. and Strimmer, K. (2007). Partial sion using an improved Akaike information criterion. J. R. least squares: A versatile tool for the analysis of high- Stat. Soc. Ser. B Stat. Methodol. 60 271–293. MR1616041 dimensional genomic data. Brief. Bioinformatics 8 32–44. Hurvich, C. M. and Tsai, C.-L. (1989). Regression and time Coles, S. (2001). An Introduction to Statistical Modeling of series model selection in small samples. Biometrika 76 297– Extreme Values. Springer, London. MR1932132 307. MR1016020 Csillery,´ K., Franc¸ois, O. and Blum, M. G. B. (2012). Irizarry, R. A. (2001). Information and posterior probabil- abc: An R package for approximate Bayesian computation. ity criteria for model selection in local likelihood estima- Methods in Ecology and Evolution 3 475–479. tion. J. Amer. Statist. Assoc. 96 303–315. MR1952740 Csillery,´ K., Blum, M. G. B., Gaggiotti, O. and Jasra, A., Singh, S. S., Martin, J. S. and McCoy, E. Franc¸ois, O. (2010). Approximate Bayesian computa- (2012). Filtering via approximate Bayesian computation. tion in practice. Trends in Ecology and Evolution 25 410– Statist. Comput. 22 1223–1237. 418. Jeremiah, E., Sisson, S. A., Marshall, L., Mehrotra, R. Del Moral, P., Doucet, A. and Jasra, A. (2012). An and Sharma, A. (2011). Bayesian calibration and uncer- adaptive sequential Monte Carlo method for approximate tainty analysis for hydrological models: A comparison of Bayesian computation. Stat. Comput. 22 1009–1020. adaptive-Metropolis and sequential Monte Carlo samplers. Drovandi, C. C. and Pettitt, A. N. (2011). Estimation of Water Resources Research 47 W07547, 13pp. parameters for macroparasite population evolution using Joyce, P. and Marjoram, P. (2008). Approximately suf- approximate Bayesian computation. Biometrics 67 225– ficient statistics and Bayesian computation. Stat. Appl. 233. MR2898834 Genet. Mol. Biol. 7 Art. 26, 18. MR2438407 Drovandi, C. C., Pettitt, A. N. and Faddy, M. J. (2011). Jung, H. and Marjoram, P. (2011). Choice of summary Approximate Bayesian computation using indirect infer- statistic weights in approximate Bayesian computation. ence. J. R. Stat. Soc. Ser. C. Appl. Stat. 60 317–337. Stat. Appl. Genet. Mol. Biol. 10 Art. 45, 25. MR2851287 MR2767849 Konishi, S., Ando, T. and Imoto, S. (2004). Bayesian infor- Estoup, A., Lombaert, E., Marin, J. M., Guille- mation criteria and smoothing parameter selection in radial maud, T., Pudlo, P., Robert, C. and Cornuet, J. M. basis function networks. Biometrika 91 27–43. MR2050458 (2012). Estimation of demo-genetic model probabilities Leuenberger, C. and Wegmann, D. (2010). Bayesian com- with approximate Bayesian computation using linear dis- putation and model selection without likelihoods. Genetics criminant analysis on summary statistics. Molecular Ecol- 184 243–252. ogy Resources 12 846–855. Lopes, J. S. and Beaumont, M. A. (2010). ABC: A useful Fan, Y., Nott, D. J. and Sisson, S. A. (2012). Regression Bayesian tool for the analysis of population data. Infect. density estimation ABC. Unpublished manuscript. Genet. Evol. 10 826–833. Fearnhead, P. and Prangle, D. (2012). Constructing sum- Luciani, F., Sisson, S. A., Jiang, H., Francis, A. R. and mary statistics for approximate Bayesian computation: Tanaka, M. M. (2009). The epidemiological fitness cost of Semi-automatic approximate Bayesian computation (with drug resistance in Mycobacterium tuberculosis. Proc. Natl. discussion). J. R. Stat. Soc. Ser. B Stat. Methodol. 74 419– Acad. Sci. USA 106 14711–14715. 474. MR2925370 Marjoram, P., Molitor, J., Plagnol, V. and Tavare, S. Filippi, S., Barnes, C. P. and Stumpf, M. P. H. (2012). (2003). Markov chain Monte Carlo without likelihoods. Contribution to the discussion of Fearnhead and Pran- Proc. Natl. Acad. Sci. USA 100 15324–15328. gle (2012). Constructing summary statistics for approxi- Mevik, B.-H. and Cederkvist, H. R. (2004). Mean squared mate Bayesian computation: Semi-automatic approximate error of prediction (MSEP) estimates for principal compo- Bayesian computation. J. R. Stat. Soc. Ser. B Stat. nent regression (PCR) and partial least squares regression Methodol. 74 459–460. (PLSR). Journal of Chemometrics 18 422–429. Geman, S., Bienenstock, E. and Doursat, R. (1992). Neu- Mevik, B.-H. and Wehrens, R. (2007). The pls package: ral networks and the bias/variance dilemma. Neural Com- Principal component and partial least squares regression put. 4 1–58. in R. Journal of Statistical Software 18 1–24. Golub, G. H., Heath, M. and Wahba, G. (1979). General- Minka, T. (2001). Expectation propagation for approximate ized cross-validation as a method for choosing a good ridge . Proceedings of Uncertainty in Artificial parameter. Technometrics 21 215–223. MR0533250 Intelligence 17 362–369. Heggland, K. and Frigessi, A. (2004). Estimating func- Nakagome, S., Fukumizu, K. and Mano, S. (2012). Kernel tions in indirect inference. J. R. Stat. Soc. Ser. B Stat. approximate Bayesian computation for population genetic Methodol. 66 447–462. MR2062387 inferences. Available at http://arxiv.org/abs/1205.3246. DIMENSION REDUCTION METHODS IN ABC 21

Nix, D. A. and Weigend, A. S. (1995). Learning local er- Sedki, M. A. and Pudlo, P. (2012). Contribution to the ror bars for nonlinear regression. In Advances in Neural discussion of Fearnhead and Prangle (2012). Constructing Information Processing Systems 7 (NIPS‘94) (G. Tesauo, summary statistics for approximate Bayesian computation: D. Touretzky and T. Leen, eds.) 489–496. MIT Press, Semi-automatic approximate Bayesian computation. J. R. Cambridge. Stat. Soc. Ser. B Stat. Methodol. 74 466–467. Nordborg, M. (2007). Coalescent theory. In Handbook of Shannon, C. E. (1948). A mathematical theory of com- Statistical Genetics, 3rd ed. (D. J. Balding, M. J. Bishop munication. Bell System Tech. J. 27 379–423, 623–656. and C. Cannings, eds.) 179–208. Wiley, Chichester. MR0026286 Nott, D. J., Fan, Y. and Sisson, S. A. (2012). Contri- Singh, H., Misra, N., Hnizdo, V., Fedorowicz, A. and bution to the discussion of Fearnhead and Prangle (2012). Demchuk, E. (2003). Nearest neighbor estimates of en- Constructing summary statistics for approximate Bayesian tropy. Amer. J. Math. Management Sci. 23 301–321. computation: Semi-automatic approximate Bayesian com- MR2045530 putation. J. R. Stat. Soc. Ser. B Stat. Methodol. 74 466. Sisson, S. A., Fan, Y. and Tanaka, M. M. (2007). Sequen- Nott, D. J., Fan, Y., Marshall, L. and Sisson, S. A. tial Monte Carlo without likelihoods. Proc. Natl. Acad. Sci. (2013). Approximate Bayesian computation and Bayes USA 104 1760–1765 (electronic). MR2301870 linear analysis: Towards high-dimensional approximate Sisson, S. A. and Fan, Y. (2011). Likelihood-free Markov Bayesian computation. J. Comput. Graph. Statist. To ap- chain Monte Carlo. In Handbook of Markov Chain Monte pear. Carlo (S. P. Brooks, A. Gelman, G. Jones and Nunes, M. A. and Balding, D. J. (2010). On optimal se- X. L. Meng, eds.) 319–341. CRC Press, Boca Raton, lection of summary statistics for approximate Bayesian FL. computation. Stat. Appl. Genet. Mol. Biol. 9 Art. 34, 16. Taniguchi, M. and Tresp, V. (1997). Averaging regularized MR2721714 estimators. Neural Comput. 9 1163–1178. Peters, G. W., Fan, Y. and Sisson, S. A. (2012). On se- Toni, T., Welch, D., Strelkowa, N., Ipsen, A. and quential Monte Carlo, partial rejection control and approx- Stumpf, M. P. (2009). Approximate Bayesian computa- imate Bayesian computation. Stat. Comput. 22 1209–1222. tion scheme for parameter inference and model selection in MR2992295 dynamical systems. Journal of the Royal Society Interface Pritchard, J. K., Seielstad, M. T., Perez-Lezaun, A. 6 187–202. and Feldman, M. W. (1999). Population growth of human Vinzi, V. E., Chin, W. W., Henseler, J. and Wang, H., Y chromosomes: A study of Y chromosome microsatellites. eds. (2010). Handbook of Partial Least Squares: Con- Mol. Biol. Evol. 16 1791–1798. cepts, Methods and Applications. Springer, Heidelberg. Ripley, B. D. (1994). Neural networks and related methods MR2742562 for classification. J. R. Stat. Soc. Ser. B Stat. Methodol. 56 Wegmann, D., Leuenberger, C. and Excoffier, L. 409–456. MR1278218 (2009). Efficient approximate Bayesian computation cou- Schwarz, G. (1978). Estimating the dimension of a model. pled with Markov chain Monte Carlo without likelihood. Ann. Statist. 6 461–464. MR0468014 Genetics 182 1207–1218.