Deviance Information Criterion for Model Selection: A Theoretical Justification*
Yong Li Jun Yu Renmin University of China Singapore Management University Tao Zeng Zhejiang University September 7, 2021
Abstract Deviance information criterion (DIC) has been extensively used for model se- lection based on MCMC output. No rigorous justification has been provided for DIC in the literature. This paper shows that, when a plug-in predictive distri- bution is used and under a set of regularity conditions, DIC is an asymptotically unbiased estimator of the expected Kullback-Leibler divergence between the data generating process and the plug-in predictive distribution. High-order expan- sions to DIC and the effective number of parameters are developed, facilitating investigating the effect of the prior.
Keywords: AIC; DIC; Expected loss function; Kullback-Leibler divergence; Model Comparison; Plug-in Predictive Distribution
1 Introduction
A highly important statistical inference often faced by model builders and empirical researchers is model selection. Many penalty-based information criteria have been pro- posed to select from a set of candidate models. In the frequentist statistical framework, perhaps the most popular information criterion is AIC. Arguably one of the most im- portant developments for model selection in the Bayesian literature in the last twenty
*We wish to thank Eric Renault, Peter Phillips and David Spiegelhalter for their helpful comments. Yong Li, School of Economics, Renmin University of China, Beijing, China. Jun Yu, School of Eco- nomics and Lee Kong Chian School of Business, Singapore Management University, Singapore. Tao Zeng, School of Economics and Academy of Financial Research, Zhejiang University, Zhejiang, China.
1 years is the deviance information criterion (DIC) of Spiegelhalter et al. (2002).1 DIC is understood as a Bayesian version of AIC. Like AIC, it trades off a measure of model adequacy against a measure of complexity and is concerned with how hypothetically replicate data predict the observed data. However, unlike AIC, DIC takes prior infor- mation into account. DIC is constructed based on the posterior mean of the log-likelihood or the deviance and has several desirable features. Firstly, DIC is easy to calculate when the likelihood function is available in closed-form, and the posterior distributions of models are ob- tained by Markov chain Monte Carlo (MCMC) simulation. Secondly, it applies to a wide range of statistical models. Thirdly, unlike Bayes factors (BF), it is not subject to Jeffreys-Lindley-Barlett’s paradox and can be calculated when vague or even improper priors are used. However, as acknowledged in Spiegelhalter et al. (2002, 2014), the decision-theoretic justification of DIC is not rigorous in the literature. In fact, in the heuristic justifica- tion given by Spiegelhalter et al. (2002), the frequentist framework and the Bayesian framework were mixed. The first contribution of the present paper is to provide a rigorous decision-theoretic justification to DIC purely in a frequentist setup. It can be shown that DIC is an asymptotically unbiased estimator of the expected Kullback- Leibler (KL) divergence between the data generating process (DGP) and the plug-in predictive distribution when the posterior mean is used. This justification is similar to how AIC has been justified. The second contribution of the present paper is to develop high-order expansions to DIC and the effective number of parameters that allow us to easily see the effect of the prior on DIC and the effective number of parameters. The paper is organized as follows. Section 2 explains how to treat the model selection as a decision problem and gives a simple review of the decision-theoretic justification of AIC that helps explain our theory for DIC. Section 3 provides a rigorous decision- theoretic justification to DIC of Spiegelhalter et al. (2002) under a set of regularity conditions and shows why DIC can be explained as the Bayesian version of AIC. In Section 4, we give two examples to illustrate the effect of the prior distribution on DIC in finite samples. Section 5 concludes the paper. The online appendix collects the proof of the theoretical results in the paper. p Throughout the paper, we use :=, tr, vec, ⊗, o(1), op(1), Op(1), → to denote definitional equality, trace, vector operator that converts the matrix into a column vector, Kronecker product, tending to zero, tending to zero in probability, bounded p in probability, convergence in probability, respectively. Moreover, we use θn, θbn, θn 1According to Spiegelhalter et al. (2014), Spiegelhalter et al. (2002) was the third most cited paper in international mathematical sciences between 1998 and 2008. Up to September 2021, it has received 12675 citations on Google Scholar.
2 to denote posterior mean, quasi maximum likelihood (QML) estimator, pseudo true parameter of θ, respectively.
2 Decision-theoretic Justification of AIC
There are essentially two strands of literature on model selection.2 The first strand aims to answer the following question – which model best explains the observed data? The BF (Kass and Raftery, 1995) and its variations belong to this strand. They compare models by examining “posterior probabilities” given the observed data and search for the “true” model. BIC is a large sample approximation to BF, although it is based on the maximum likelihood estimator. The second strand aims to answer the follow- ing question: Which model gives the best predictions of future observations generated by the same mechanism that gives the observed data? Clearly, this is a utility-based approach where the utility is set as prediction. Ideally, we would like to choose the model that gives the best overall predictions of future values. Some cross-validation- based criteria have been developed where the original sample is split into a training set and a validation set (Vehtari and Lampinen, 2002; Zhang and Yang, 2015). Unfortu- nately, different ways of sample splitting often lead to different outcomes. Alternatively, based on hypothetically replicate data generated by the exact mechanism that gives the observed data, some predictive information criteria have been proposed for model se- lection. They minimize a loss function associated with the predictive decisions. AIC and DIC are two well-known criteria in this framework. After the decision is made about which model should be used for prediction and how predictions should be made, a unique prediction action for future values can be obtained to fulfill the original goal. The latter approach is what we follow in the present paper. Given the relevance of prediction in pratice, nor surprisingly, such an approach to model selection has been widely used in applications.
2.1 Predictive model selection as a decision problem
0 Assuming that the probabilistic behavior of observed data, y = (y1, y2, ··· , yn) ∈ Y, is K K described by a set of probabilistic models such as {Mk}k=1 = {p (y|θk,Mk)}k=1 where n is the sample size, θk (without confusion, we simply write it as θ) is the set of parameters in candidate model Mk, and p(·) is a probability density function (pdf). Formally, the model selection problem can be taken as a decision problem to select a K K model among {Mk}k=1 where the action space has K elements, namely, {dk}k=1, where 2For more information about the literature, see Vehtari and Ojanen (2012) and Burnham and Anderson (2002).
3 dk means Mk is selected.
For the decision problem, a loss function, `(y, dk), which measures the loss of decision dk as a function of y, must be specified. Given the loss function, the expected loss (or risk) can be defined as (Berger, 1985) Z Risk(dk) = Ey [`(y, dk)] = `(y, dk)g(y)dy, where g(y) is the pdf of the DGP of y. Hence, the model selection problem is equivalent to optimizing the statistical decision,
∗ k = arg min Risk(dk). k
K Based on the set of candidate models {Mk}k=1, the model Mk∗ with the decision dk∗ is selected. 0 Let yrep = (y1,rep, ··· , yn,rep) be the hypothetically replicate data, independently generated by the exact mechanism that gives y. Assume the sample size in yrep is the same as that in y (i.e. n). Consider the predictive density of this replicate experi- ment for a candidate model M . The plug-in predictive density can be expressed as k p yrep|θbn(y),Mk for Mk where θbn(y) is the QML estimate of θ based on y (when there is no confusion we simply write θbn(y) as θbn or even θb) and defined by
θbn(y) = arg max ln p (y|θ,Mk) , θ∈Θ which is the global maximum interior to Θ. The quantity that has been used to measure the quality of the candidate model in terms of its ability to make predictions is the KL divergence between g (yrep) and p yrep|θbn(y),Mk multiplied by 2,
Z h i g (yrep) 2 × KL g (yrep) , p yrep|θbn(y),Mk = 2 ln g (yrep) dyrep. p yrep|θbn(y),Mk
Naturally, the loss function associated with decision dk is Z h i g (yrep) `(y, dk) = 2 × KL g (yrep) , p yrep|θbn(y),Mk = 2 ln g (yrep) dyrep. p yrep|θbn(y),Mk
As a result, the model selection problem is,
∗ k = arg min Risk(dk) = arg min Ey [`(y, dk)] k k
4 n h io = arg min EyEyrep [2 ln g (yrep)] + EyEyrep −2 ln p yrep|θbn(y),Mk . k
Since g (yrep) is the DGP, Eyrep [2 ln g (yrep)] is the same across all candidate models, and hence, is dropped from the above equation. Consequently,
∗ h i k = arg min Risk(dk) = arg min EyEyrep −2 ln p yrep|θbn(y),Mk . k k The smaller Risk(dk) is, the better the candidate model performs when using p yrep|θbn(y),Mk to predict g (yrep). The optimal decision makes it necessary to evaluate the risk. AIC h i is an asymptotically unbiased estimator of EyEyrep −2 ln p yrep|θbn(y),Mk .
2.2 AIC for predictive model selection
To show the asymptotic unbiasedness of AIC, let us first fix some notations. When there is no confusion, we simply write candidate model p (y|θ,Mk) as p (y|θ) where P θ ∈ Θ ⊆ R (i.e. the dimension of θ is P ). Although −2 ln p y|θbn(y) is a natural estimate of Eyrep −2 ln p yrep|θbn(y) , it is asymptotically biased. Let c(y) = Eyrep −2 ln p yrep|θbn(y) − −2 ln p y|θbn(y) , (1) be the “optimism”. Under a set of regularity conditions, one can show that Ey (c(y)) → 2P as n → ∞. Hence, if we let AIC = −2 ln p y|θbn(y) + 2P , then, as n → ∞, Ey(AIC) − EyEyrep −2 ln p yrep|θbn(y) → 0.
To see why a penalty term, 2P , is needed in AIC, let
p 1 θn = arg min KL [g(y), p (y|θ)] (2) θ∈Θ n be the pseudo true parameter value; θbn (yrep) be the QML estimate of θ obtained from yrep; p (y|θ) be a “good approximation” to the DGP, a concept that will be defined formally in the next section. Note that EyEyrep −2 ln p yrep|θbn(y) h i = EyEyrep −2 ln p yrep|θbn(yrep) (T 1) h p i + EyEyrep (−2 ln p (yrep|θn)) − EyEyrep −2 ln p yrep|θbn (yrep) (T 2) h p i + EyEyrep −2 ln p yrep|θbn(y) − EyEyrep (−2 ln p (yrep|θn)) . (T 3)
5 Clearly, term T 1 is the same as Ey −2 ln p y|θbn(y) . Term T 2 is the expectation of the likelihood ratio statistic based on y . Under a set of regularity conditions that √ rep ensure n-consistency and the asymptotic normality of the QML estimate, we have p p T 2 = T 3 + o(1). To approximate term T 3, if θbn(y) → θn, we have p ∂ ln p (yrep|θn) p T 3 = Ey −2Eyrep θbn(y) − θ ∂θ0 n 0 2 p p ∂ ln p (yrep|θn) p +Ey Eyrep − θbn(y) − θ θbn(y) − θ + o(1). n ∂θ∂θ0 n
p p By the definition of θn, Eyrep [∂ ln p (yrep|θn) /∂θ] = 0, implying that
p p ∂ ln p (yrep|θn) p ∂ ln p (yrep|θn) p EyEyrep θbn(y) − θ = Eyrep Ey θbn(y) − θ = 0. ∂θ0 n ∂θ0 n
Consequently, under the same regularity conditions for approximating T 2, we have
2 p 0 ∂ ln p (yrep|θn) p p T 3 = tr Eyrep Ey − θbn(y) − θ θbn(y) − θ = P + o(1). ∂θ∂θ0 n n
Following Burnham and Anderson (2002), we have EyEyrep −2 ln p yrep|θbn(y) = Ey −2 ln p y|θbn(y) + 2P + o(1) = Ey (AIC) + o(1),
that is, AIC is an unbiased estimator of EyEyrep −2 ln p yrep|θbn(y) asymptotically. From the decision viewpoint, among candidate models, AIC selects a model that mini- mizes the expected loss asymptotically when the plug-in predictive distribution is used for making predictions. The prediction is achieved by p yrep|θbn(y) . The decision-theoretic justification of AIC rests on a frequentist framework. Specif- ically, it requires a careful choice of the KL divergence, the use of QML, and a set √ of regularity conditions that ensure n-consistency and the asymptotic normality of QML. The penalty term in AIC arises from two sources. First, the pseudo true param- eter value has to be estimated. Second, the estimate obtained from the observed data is not the same as that from the replicate data. Moreover, as pointed out in Burnham and Anderson (2002), the justification of AIC requires the candidate model to be a “good approximation” to the DGP for T 3 to be P asymptotically, although a formal definition of “good approximation” was not provided in the literature.
6 3 Decision-theoretic Justification of DIC
3.1 DIC
Spiegelhalter et al. (2002) propose DIC for Bayesian model selection. The criterion is based on the deviance
D (θ) = −2 ln p (y|θ) , and takes the form of
DIC = D (θ) + PD. (3)
The first term, interpreted as a Bayesian measure of model fit, is defined as the posterior expectation of the deviance, that is,
D (θ) = Eθ|yD (θ) = Eθ|y [−2 ln p (y|θ)] .
The better the model fits the data, the larger the log-likelihood value, and hence, the smaller the value for D (θ). The second term, used to measure the model complexity and also known as the “effective number of parameters”, is defined as the difference between the posterior mean of the deviance and the deviance evaluated at the posterior mean of the parameters: Z PD = D (θ) − D θn(y) = −2 ln p (y|θ) − ln p y|θn(y) p (θ|y) dθ, (4)
R where θn(y) is the posterior mean of θ based on y, defined by θp (θ|y) dθ. When there is no confusion, we simply write θn(y) as θn or even θ. DIC can be rewritten in two equivalent forms: