Deviance Information Criterion for Model Selection: Justification and Variation

Deviance Information Criterion for Model Selection: A Theoretical Justiﬁcation*

Yong Li Jun Yu Renmin University of China Singapore Management University Tao Zeng Zhejiang University September 7, 2021

Abstract Deviance information criterion (DIC) has been extensively used for model selection based on MCMC output. No rigorous justification has been provided for DIC in the literature. This paper shows that, when a plug-in predictive distribution is used and under a set of regularity conditions, DIC is an asymptotically unbiased estimator of the expected Kullback-Leibler divergence between the data generating process and the plug-in predictive distribution. High-order expansions to DIC and the effective number of parameters are developed, facilitating investigating the effect of the prior.

Keywords: AIC; DIC; Expected loss function; Kullback-Leibler divergence; Model Comparison; Plug-in Predictive Distribution

1 Introduction

A highly important statistical inference often faced by model builders and empirical researchers is model selection. Many penalty-based information criteria have been proposed to select from a set of candidate models. In the frequentist statistical framework, perhaps the most popular information criterion is AIC. Arguably one of the most important developments for model selection in the Bayesian literature in the last twenty

*We wish to thank Eric Renault, Peter Phillips and David Spiegelhalter for their helpful comments. Yong Li, School of Economics, Renmin University of China, Beijing, China. Jun Yu, School of Eco- nomics and Lee Kong Chian School of Business, Singapore Management University, Singapore. Tao Zeng, School of Economics and Academy of Financial Research, Zhejiang University, Zhejiang, China.

1 years is the deviance information criterion (DIC) of Spiegelhalter et al. (2002).1 DIC is understood as a Bayesian version of AIC. Like AIC, it trades off a measure of model adequacy against a measure of complexity and is concerned with how hypothetically replicate data predict the observed data. However, unlike AIC, DIC takes prior information into account. DIC is constructed based on the posterior mean of the log-likelihood or the deviance and has several desirable features. Firstly, DIC is easy to calculate when the likelihood function is available in closed-form, and the posterior distributions of models are obtained by Markov chain Monte Carlo (MCMC) simulation. Secondly, it applies to a wide range of statistical models. Thirdly, unlike Bayes factors (BF), it is not subject to Jeffreys-Lindley-Barlett’s paradox and can be calculated when vague or even improper priors are used. However, as acknowledged in Spiegelhalter et al. (2002, 2014), the decision-theoretic justification of DIC is not rigorous in the literature. In fact, in the heuristic justification given by Spiegelhalter et al. (2002), the frequentist framework and the Bayesian framework were mixed. The first contribution of the present paper is to provide a rigorous decision-theoretic justification to DIC purely in a frequentist setup. It can be shown that DIC is an asymptotically unbiased estimator of the expected Kullback- Leibler (KL) divergence between the data generating process (DGP) and the plug-in predictive distribution when the posterior mean is used. This justification is similar to how AIC has been justified. The second contribution of the present paper is to develop high-order expansions to DIC and the effective number of parameters that allow us to easily see the effect of the prior on DIC and the effective number of parameters. The paper is organized as follows. Section 2 explains how to treat the model selection as a decision problem and gives a simple review of the decision-theoretic justification of AIC that helps explain our theory for DIC. Section 3 provides a rigorous decision- theoretic justification to DIC of Spiegelhalter et al. (2002) under a set of regularity conditions and shows why DIC can be explained as the Bayesian version of AIC. In Section 4, we give two examples to illustrate the effect of the prior distribution on DIC in finite samples. Section 5 concludes the paper. The online appendix collects the proof of the theoretical results in the paper. p Throughout the paper, we use :=, tr, vec, ⊗, o(1), op(1), Op(1), → to denote definitional equality, trace, vector operator that converts the matrix into a column vector, Kronecker product, tending to zero, tending to zero in probability, bounded p in probability, convergence in probability, respectively. Moreover, we use θn, θbn, θn 1According to Spiegelhalter et al. (2014), Spiegelhalter et al. (2002) was the third most cited paper in international mathematical sciences between 1998 and 2008. Up to September 2021, it has received 12675 citations on Google Scholar.

2 to denote posterior mean, quasi maximum likelihood (QML) estimator, pseudo true parameter of θ, respectively.

2 Decision-theoretic Justiﬁcation of AIC

There are essentially two strands of literature on model selection.2 The first strand aims to answer the following question – which model best explains the observed data? The BF (Kass and Raftery, 1995) and its variations belong to this strand. They compare models by examining “posterior probabilities” given the observed data and search for the “true” model. BIC is a large sample approximation to BF, although it is based on the maximum likelihood estimator. The second strand aims to answer the following question: Which model gives the best predictions of future observations generated by the same mechanism that gives the observed data? Clearly, this is a utility-based approach where the utility is set as prediction. Ideally, we would like to choose the model that gives the best overall predictions of future values. Some cross-validation- based criteria have been developed where the original sample is split into a training set and a validation set (Vehtari and Lampinen, 2002; Zhang and Yang, 2015). Unfortu- nately, different ways of sample splitting often lead to different outcomes. Alternatively, based on hypothetically replicate data generated by the exact mechanism that gives the observed data, some predictive information criteria have been proposed for model selection. They minimize a loss function associated with the predictive decisions. AIC and DIC are two well-known criteria in this framework. After the decision is made about which model should be used for prediction and how predictions should be made, a unique prediction action for future values can be obtained to fulfill the original goal. The latter approach is what we follow in the present paper. Given the relevance of prediction in pratice, nor surprisingly, such an approach to model selection has been widely used in applications.

2.1 Predictive model selection as a decision problem

0 Assuming that the probabilistic behavior of observed data, y = (y1, y2, ··· , yn) ∈ Y, is K K described by a set of probabilistic models such as {Mk}k=1 = {p (y|θk,Mk)}k=1 where n is the sample size, θk (without confusion, we simply write it as θ) is the set of parameters in candidate model Mk, and p(·) is a probability density function (pdf). Formally, the model selection problem can be taken as a decision problem to select a K K model among {Mk}k=1 where the action space has K elements, namely, {dk}k=1, where 2For more information about the literature, see Vehtari and Ojanen (2012) and Burnham and Anderson (2002).

3 dk means Mk is selected.

For the decision problem, a loss function, `(y, dk), which measures the loss of decision dk as a function of y, must be speciﬁed. Given the loss function, the expected loss (or risk) can be deﬁned as (Berger, 1985) Z Risk(dk) = Ey [`(y, dk)] = `(y, dk)g(y)dy, where g(y) is the pdf of the DGP of y. Hence, the model selection problem is equivalent to optimizing the statistical decision,

∗ k = arg min Risk(dk). k

K Based on the set of candidate models {Mk}k=1, the model Mk∗ with the decision dk∗ is selected. 0 Let yrep = (y1,rep, ··· , yn,rep) be the hypothetically replicate data, independently generated by the exact mechanism that gives y. Assume the sample size in yrep is the same as that in y (i.e. n). Consider the predictive density of this replicate experi- ment for a candidate model M . The plug-in predictive density can be expressed as k p yrep|θbn(y),Mk for Mk where θbn(y) is the QML estimate of θ based on y (when there is no confusion we simply write θbn(y) as θbn or even θb) and deﬁned by

θbn(y) = arg max ln p (y|θ,Mk) , θ∈Θ which is the global maximum interior to Θ. The quantity that has been used to measure the quality of the candidate model in terms of its ability to make predictions is the KL divergence between g (yrep) and p yrep|θbn(y),Mk multiplied by 2,

Z h i g (yrep) 2 × KL g (yrep) , p yrep|θbn(y),Mk = 2 ln g (yrep) dyrep. p yrep|θbn(y),Mk

Naturally, the loss function associated with decision dk is Z h i g (yrep) `(y, dk) = 2 × KL g (yrep) , p yrep|θbn(y),Mk = 2 ln g (yrep) dyrep. p yrep|θbn(y),Mk

As a result, the model selection problem is,

∗ k = arg min Risk(dk) = arg min Ey [`(y, dk)] k k

4 n h io = arg min EyEyrep [2 ln g (yrep)] + EyEyrep −2 ln p yrep|θbn(y),Mk . k

Since g (yrep) is the DGP, Eyrep [2 ln g (yrep)] is the same across all candidate models, and hence, is dropped from the above equation. Consequently,

∗ h i k = arg min Risk(dk) = arg min EyEyrep −2 ln p yrep|θbn(y),Mk . k k The smaller Risk(dk) is, the better the candidate model performs when using p yrep|θbn(y),Mk to predict g (yrep). The optimal decision makes it necessary to evaluate the risk. AIC h i is an asymptotically unbiased estimator of EyEyrep −2 ln p yrep|θbn(y),Mk .

2.2 AIC for predictive model selection

To show the asymptotic unbiasedness of AIC, let us ﬁrst ﬁx some notations. When there is no confusion, we simply write candidate model p (y|θ,Mk) as p (y|θ) where P θ ∈ Θ ⊆ R (i.e. the dimension of θ is P ). Although −2 ln p y|θbn(y) is a natural estimate of Eyrep −2 ln p yrep|θbn(y) , it is asymptotically biased. Let c(y) = Eyrep −2 ln p yrep|θbn(y) − −2 ln p y|θbn(y) , (1) be the “optimism”. Under a set of regularity conditions, one can show that Ey (c(y)) → 2P as n → ∞. Hence, if we let AIC = −2 ln p y|θbn(y) + 2P , then, as n → ∞, Ey(AIC) − EyEyrep −2 ln p yrep|θbn(y) → 0.

To see why a penalty term, 2P , is needed in AIC, let

p 1 θn = arg min KL [g(y), p (y|θ)] (2) θ∈Θ n be the pseudo true parameter value; θbn (yrep) be the QML estimate of θ obtained from yrep; p (y|θ) be a “good approximation” to the DGP, a concept that will be deﬁned formally in the next section. Note that EyEyrep −2 ln p yrep|θbn(y) h i = EyEyrep −2 ln p yrep|θbn(yrep) (T 1) h p i + EyEyrep (−2 ln p (yrep|θn)) − EyEyrep −2 ln p yrep|θbn (yrep) (T 2) h p i + EyEyrep −2 ln p yrep|θbn(y) − EyEyrep (−2 ln p (yrep|θn)) . (T 3)

5 Clearly, term T 1 is the same as Ey −2 ln p y|θbn(y) . Term T 2 is the expectation of the likelihood ratio statistic based on y . Under a set of regularity conditions that √ rep ensure n-consistency and the asymptotic normality of the QML estimate, we have p p T 2 = T 3 + o(1). To approximate term T 3, if θbn(y) → θn, we have p ∂ ln p (yrep|θn) p T 3 = Ey −2Eyrep θbn(y) − θ ∂θ0 n 0 2 p p ∂ ln p (yrep|θn) p +Ey Eyrep − θbn(y) − θ θbn(y) − θ + o(1). n ∂θ∂θ0 n

p p By the deﬁnition of θn, Eyrep [∂ ln p (yrep|θn) /∂θ] = 0, implying that

p p ∂ ln p (yrep|θn) p ∂ ln p (yrep|θn) p EyEyrep θbn(y) − θ = Eyrep Ey θbn(y) − θ = 0. ∂θ0 n ∂θ0 n

Consequently, under the same regularity conditions for approximating T 2, we have

2 p 0 ∂ ln p (yrep|θn) p p T 3 = tr Eyrep Ey − θbn(y) − θ θbn(y) − θ = P + o(1). ∂θ∂θ0 n n

Following Burnham and Anderson (2002), we have EyEyrep −2 ln p yrep|θbn(y) = Ey −2 ln p y|θbn(y) + 2P + o(1) = Ey (AIC) + o(1),

that is, AIC is an unbiased estimator of EyEyrep −2 ln p yrep|θbn(y) asymptotically. From the decision viewpoint, among candidate models, AIC selects a model that minimizes the expected loss asymptotically when the plug-in predictive distribution is used for making predictions. The prediction is achieved by p yrep|θbn(y) . The decision-theoretic justification of AIC rests on a frequentist framework. Specif- ically, it requires a careful choice of the KL divergence, the use of QML, and a set √ of regularity conditions that ensure n-consistency and the asymptotic normality of QML. The penalty term in AIC arises from two sources. First, the pseudo true parameter value has to be estimated. Second, the estimate obtained from the observed data is not the same as that from the replicate data. Moreover, as pointed out in Burnham and Anderson (2002), the justification of AIC requires the candidate model to be a “good approximation” to the DGP for T 3 to be P asymptotically, although a formal definition of “good approximation” was not provided in the literature.

6 3 Decision-theoretic Justiﬁcation of DIC

3.1 DIC

Spiegelhalter et al. (2002) propose DIC for Bayesian model selection. The criterion is based on the deviance

D (θ) = −2 ln p (y|θ) , and takes the form of

DIC = D (θ) + PD. (3)

The first term, interpreted as a Bayesian measure of model fit, is defined as the posterior expectation of the deviance, that is,

D (θ) = Eθ|yD (θ) = Eθ|y [−2 ln p (y|θ)] .

The better the model fits the data, the larger the log-likelihood value, and hence, the smaller the value for D (θ). The second term, used to measure the model complexity and also known as the “effective number of parameters”, is defined as the difference between the posterior mean of the deviance and the deviance evaluated at the posterior mean of the parameters: Z PD = D (θ) − D θn(y) = −2 ln p (y|θ) − ln p y|θn(y) p (θ|y) dθ, (4)

R where θn(y) is the posterior mean of θ based on y, deﬁned by θp (θ|y) dθ. When there is no confusion, we simply write θn(y) as θn or even θ. DIC can be rewritten in two equivalent forms:

DIC = D θn + 2PD, (5) and

DIC = 2D (θ) − D θn = −4Eθ|y ln p (y|θ) + 2 ln p y|θn . (6)

DIC deﬁned in Equation (5) bears similarity to AIC of Akaike (1973) and can be interpreted as a classical “plug-in” measure of ﬁt plus a measure of complexity (i.e.

2PD2PD, also known as the penalty term). In Equation (3) the Bayesian measure, D (θ), is the same as D θn + PD that already includes PD as a penalty for model complexity and, thus, could be better thought of as a measure of model adequacy rather than pure goodness of ﬁt.

7 However, as stated explicitly in Spiegelhalter et al. (2002) (Section 7.3 on Page 603 and the first paragraph on Page 605), the justification of DIC is informal and heuristic. It mixes a frequentist setup and a Bayesian setup. In the next subsection, we provide a rigorous decision-theoretic justification of DIC purely in a frequentist setup. Specifically, we show that when a proper loss function is selected, DIC is an unbiased estimator of the expected loss asymptotically.

3.2 Decision-theoretic justiﬁcation of DIC

When developing DIC, Spiegelhalter et al. (2002) assumes that there is a true distribu- p tion for y in Section 2.2, a pseudo-true parameter value θn for a candidate model also in Section 2.2, an independent replicate data set yrep in Section 7.1. All these assumptions are identical to what has been done to justify AIC. Furthermore, as explained in Section 7.1 of Spiegelhalter et al. (2002), the goal for model selection is to estimate the p expected loss where the expectation is taken with respect to yrep|θn. The assumptions and the goal indicate that a frequentist framework was considered. On the other hand, since the “optimism” associated with the natural estimator depends on a pseudo true p parameter value θn, instead of replacing it with a frequentist estimator and then finding the asymptotic property of the “optimism”, In Sections 7.1 and 7.3 of Spiegelhalter et p al. (2002) θn is replaced with a random quantity θ and then calculates the posterior expectation of the “optimism”. As a result, a Bayesian framework is adopted when studying the behavior of “optimism”. Spiegelhalter et al. (2002) do not explicitly specify the KL divergence function. However, from Equation (33) on Page 602, the loss function defined in the first paragraph on Page 603, and Equation (40) on Page 603, one may deduce that the following KL divergence " # p (yrep|θ) KL p (yrep|θ) , p yrep|θn(y) = E ln (7) yrep|θ p yrep|θn(y) was used.3 Hence, 2 × KL p (yrep|θ) , p yrep|θn(y) = 2E (ln p (yrep|θ)) + E −2 ln p yrep|θn(y) . yrep|θ yrep|θ (8) With this KL function, unfortunately, the first term in the right hand side of Equation (8) is no longer a constant across candidate models. This is because, when the pseudo- true value is replaced by a random quantity θ, the first term in the right hand side of Equation (8) is model dependent. Clearly, another KL divergence function is needed.

3 t In Equation (33) of Spiegelhalter et al. (2002), the expectation is taken with respect to yrep|θ which corresponds to the candidate model. In AIC, the expectation is taken with respect to yrep which corresponds to the DGP.

8 As in AIC, we ﬁrst consider the plug-in predictive distribution p yrep|θn(y) in the following KL divergence " # g (y ) KL g (y ) , p y |θ (y) = E ln rep . rep rep n yrep p yrep|θn(y)

The corresponding expected loss function of a statistical decision dk is ( " #) g (y ) Risk(d ) = E E 2 ln rep k y yrep p yrep|θn(y),Mk = EyEyrep [2 ln g (yrep)] + EyEyrep −2 ln p yrep|θn(y),Mk .

Since EyEyrep [2 ln g (yrep)] is the same across candidate models, minimizing the expected loss function Risk(dk) is equivalent to minimizing EyEyrep −2 ln p yrep|θn(y),Mk . Denote the selected model by Mk∗ . Then p yrep|θn(y),Mk∗ is used to generate future observations where θn(y) is the posterior mean of θ in Mk∗ . We are now in the position to provide a rigorous decision-theoretic justiﬁcation to DIC in a frequentist framework based on a set of regularity conditions. Let yt = t t t−1 (y0, y1, . . . , yt) for any 0 ≤ t ≤ n and lt (y , θ) = ln p (y |θ) − ln p (y |θ) be the conditional log-likelihood for the tth observation for any 1 ≤ t ≤ n. When there is no t confusion, we suppress lt (y , θ) as lt (θ) so that the log-likelihood function ln p (y|θ) is Pn 4 j th j t=1 lt (θ). Let ∇ lt (θ) denote the j derivative of lt (θ) and ∇ lt (θ) = lt (θ) when j = 0. Furthermore, deﬁne

t t ∂ ln p (yt|θ) X ∂2 ln p (yt|θ) X s yt, θ = = ∇l (θ) , h yt, θ = = ∇2l (θ) , ∂θ i ∂θ∂θ0 i i=1 i=1 t t−1 2 t t−1 st (θ) = ∇lt (θ) = s y , θ − s y , θ , ht (θ) = ∇ lt (θ) = h y , θ − h y , θ , " n # n 1 X 1 X B (θ) = V ar √ 5l (θ) , H¯ (θ) = h (θ) , n n t n n t t=1 t=1 n n 1 X 0 1 X J¯ (θ) = [s (θ) − ¯s (θ)] [s (θ) − ¯s (θ)] ,¯s (θ) = s (θ) , n n t t n t t=1 t=1 (j) j j Ln (θ) = ln p (θ|y) ,Ln (θ) = ∂ ln p (θ|y) /∂θ , Z Z Hn (θ) = H¯ n (θ) g (y) dy, Jn (θ) = J¯n (θ) g (y) dy.

In this paper, we impose the following regularity conditions.

4 In the deﬁnition of log-likelihood, we ignore the initial condition ln p(y0). For weakly dependent data, the impact of ignoring the initial condition is aymptotically negligible.

9 Assumption 1: Θ ⊂ RP is compact. ∞ Assumption 2: {yt}t=1 satisﬁes the strong mixing condition with the mixing coef- −2r −ε ﬁcient α (m) = O m r−2 for some ε > 0 and r > 2.

Assumption 3: For all t, lt (θ) satisﬁes the standard measurability and continuity condition, and the eight-times diﬀerentiability condition on Θ almost surely.. 0 j j 0 j t 0 Assumption 4: For j = 0, 1, 2, for any θ, θ ∈ Θ, k5 lt (θ) − 5 lt (θ )k ≤ ct (y ) kθ − θ k, j t j t where ct (y ) is a positive random variable with supt E ct (y ) < ∞ and

n 1 X p cj yt − E cj yt → 0. n t t t=1

t Assumption 5: For j = 0, 1,..., 8, there exist Mt(y ) and M < ∞ such that for j j t t r+δ all θ ∈ Θ, 5 lt (θ) exists, supθ∈Θ k5 lt (θ)k 6 Mt(y ), supt E kMt(y )k ≤ M for some δ > 0, where r is the same as that in Assumption 2. j Assumption 6: {5 lt (θ)} is L2-near epoch dependent with respect to {yt} of size 1 −1 for j = 0, 1 and − 2 for j = 2 uniformly on Θ. p Assumption 7: Let θn be the pseudo-true value that minimizes the KL loss between the DGP and the candidate model Z p 1 g(y) θn = arg min ln g(y)dy, θ∈Θ n p (y|θ)

p where {θn} is the sequence of minimizers that are interior to Θ uniformly in n. For all ε > 0,

n 1 X p lim sup sup {E [lt (θ)] − E [lt (θn)]} < 0, (9) n→∞ p n Θ\N(θn,ε) t=1

p p where N (θn, ε) is the open ball of radius ε around θn. p p Assumption 8: The sequence {Hn (θn)} is negative definite and {Bn (θn)} is positive definite, both uniformly in n. p p Assumption 9: Hn (θn) + Bn (θn) = o (1). Assumption 10: The prior density p (θ) is eight-times continuously differentiable, p ∗ ∗ p (θn) > 0 uniformly in n. Moreover, there exists an n such that, for any n > n , the posterior distribution p (ϑ|y) is proper and R kϑk2 p (ϑ|y) dϑ < ∞.

Remark 3.1 Assumption 1 is the compactness condition. Assumption 2 and Assump- tion 6 imply weak dependence in yt and lt. The ﬁrst part of Assumption 3 is the continuity condition. Assumption 4 is the Lipschitz condition for lt ﬁrst introduced in Andrews (1987) to develop the uniform law of large numbers for dependent and heterogeneous stochastic processes. Assumption 5 contains the domination condition for

10 lt. Assumption 7 is the identiﬁcation condition. These assumptions are well-known primitive conditions for developing the QML theory, namely consistency and asymptotic normality, for dependent and heterogeneous data; see, for example, Gallant and White (1988) and Wooldridge (1994).

Remark 3.2 The eight-times diﬀerentiability condition in Assumption 3 and the domination condition for up to the eighth derivative of lt are important to develop a high order stochastic Laplace approximation. In particular, as shown in Kass et al. (1990), these two conditions, together with the well-known consistency condition for QML given by (10) below, are suﬃcient for developing the Laplace approximation. This consistency condition requires that, for any ε > 0, there exists K1 (ε) > 0 such that

 n  1 X p lim P  sup [lt (θ) − lt (θn)] < −K1 (ε) = 1. (10) n→∞ p n Θ\N(θn,ε) t=1

Our Assumption 7 is clearly more primitive than the consistency condition (10). In the following lemma, we show that Assumptions 1-7, including the identification condition (9), are sufficient to ensure (10) as well as the concentration condition around the posterior mode given by Chen (1985). Together with Assumption 10, the concentration condition suggests that the stochastic Laplace approximation can be applied to the posterior distribution, and the asymptotic normality of the posterior distribution can be established. To the best of our knowledge, this is the first time in the literature that primitive conditions have been proposed for the stochastic Laplace approximation. Assumption 10 ensures the second moment of the posterior is bounded. Moreover, it implies that the prior is negligible asymptotically.

Lemma 3.1 If Assumptions 1-7 hold, then Equation (10) holds. Furthermore, if As- sumptions 1-7 hold, for any ε > 0, there exists K2 (ε) > 0 such that   " n n #  1 X X p  lim P  sup lt (θ) − lt (θn) < −K2 (ε) = 1. (11) n→∞  n  Θ\N θb n,ε t=1 t=1

←→ Pn Let θ n = arg maxθ∈Θ t=1 lt (θ) + ln p (θ) be the posterior mode. If, in addition, Assumption 10 holds, then, for any ε > 0, there exists K3 (ε) > 0 such that   n !  1 X p p  lim P  sup [lt (θ) − lt (θn)] + ln p (θ) − ln p (θn) < −K3 (ε) = 1. n→∞  ←→ n  Θ\N θ n,ε t=1 (12)

11 Remark 3.3 Assumption 9 gives the exact requirement for “good approximation”. This generalizes the deﬁnition of information matrix equality (White, 1996). We now give p p an example where Hn (θn) + Bn (θn) is o (1) but not zero in ﬁnite samples. Let the DGP be

iid 2 yt = x1tβ0 + x2tγ0 + εt, εt ∼ N 0, σ0 ,

1/2 where (x1t, x2t) is iid across t and independent of εt. Assume that γ0 = δ0/n , where

δ0 is an unknown constant diﬀerent from zero. Let the candidate model be

iid 2 yt = x1tβ + vt, vt ∼ N(0, σ ).

In this case, it is easy to show that

2 2 2 ! p E (x1t) 1 σ0 + cγ0 −Hn (θ ) = diag , − − , n 2p 2p2 2p3 σn 2 σn σn and that

 σ2E(x2 ) d γ2 d γ3  0 1t + 1 0 2 0 σ2p 2p 2 2p 2 p n (σn ) 2(σn ) Bn (θ ) =  2  . n 3 3 σ2 +6cσ2γ2+d γ4  d2γ0 1 ( 0 ) 0 0 3 0  2p 2 − 2p 2 + 2p 4 2(σn ) 4(σn ) 4(σn )

p p p p Hence, Hn (θn) + Bn (θn) 6= 0 for any ﬁnite n. However, Hn (θn) + Bn (θn) = o (1) since 2 p p E (x1t) 1 lim Bn (θn) = lim −Hn (θn) = diag , . n→∞ n→∞ 2 2 2 σ0 2 (σ0) To develop the Laplace approximation, we need to ﬁx more notations. For the ¯ (j) 1 Pn j convenience of exposition, we let Hn (θ) = n t=1 ∇ lt (θ) for j = 3, 4, 5. Let π (θ) = j j j j ln p (θ), pb, πb, ∇ pb, and ∇ πb be the values of functions, p (θ), π (θ), ∇ p (θ), and ∇ π (θ) evaluated at θbn. The next lemma extends Theorem 4 of Kass et al. (1990) to a higher order in matrix form.

Lemma 3.2 Under Assumptions 1-10, we have, as n → ∞,

R l (θ) p (θ) p (y|θ) dθ t (13) R p (θ) p (y|θ) dθ

1 1 1 2 −3 = lt θbn + Bt,1 + B + B + Bt,22 − B4Bt,1 + Op n , n n2 t,21 t,21

1 2 where Bt,1,Bt,21,Bt,21,Bt,22,B4 are all Op(1) with the expressions given in the online appendix.

12 We are now in the position to develop a high-order expansion of PD and DIC.

Lemma 3.3 Under Assumptions 1-10, we have, as n → ∞, 1 1 P = P + C − C + O n−2 , D n 1 n 2 p 1 1 DIC = AIC + D + D + O n−2 , n 1 n 2 p where 1 1 C = tr (A ) − A = O (1) , 1 4 2 6 3 p −1 ¯ 2 C2 = tr Hn θbn ∇ πb = Op (1) , 1 1 1 D = − A + tr (A ) − A = O (1) , 1 4 1 2 2 3 3 p D2 = C21 − 2C2 − C23 = Op (1) , with

−10 −1 0 −1 ¯ ¯ (3) ¯ ¯ (3) ¯ A1 = vec Hn θbn Hn θbn Hn θbn Hn θbn vec Hn θbn ,

−1 −10 ¯ ¯ ¯ (4) A2 = Hn θbn ⊗ vec Hn θbn Hn θbn ,

0 −1 −1 −1 ¯ (3) ¯ ¯ ¯ ¯ (3) A3 = vec Hn θbn Hn θbn ⊗ Hn θbn ⊗ Hn θbn vec Hn θbn ,

−1 0 −1 0 ¯ ¯ (3) ¯ C21 = ∇πb Hn θbn Hn θbn vec Hn θbn , −1 0 ¯ C23 = ∇πb Hn θbn ∇π.b

Theorem 3.1 Under Assumptions 1-10, we have, as n → ∞,

EyEyrep −2 ln p yrep|θn(y) = Ey (DIC) + o(1). Remark 3.4 DIC is an unbiased estimator of EyEyrep −2 ln p yrep|θn(y) asymptotically, according to Theorem 3.1. Hence, the decision-theoretic justification to DIC is that DIC selects a model that asymptotically minimizes the expected loss, which is the expected KL divergence between the DGP and the plug-in predictive density p yrep|θn(y) . A key difference between AIC and DIC is that the plug-in predictive density is based on different estimators of θ. In AIC, the QML estimate, θbn(y), is used. In DIC, the posterior mean, θn(y), is used. In this sense, DIC is the Bayesian version of AIC.

13 Remark 3.5 The justification of DIC remains valid if the posterior mean is replaced with the posterior mode or with the QML estimator and/or if PD is replaced with P . This is because the justification of DIC requires the information matrix identity to hold asymptotically, and the posterior distribution to converge to a normal distribution (more specifically, the posterior mean minus the posterior mode converges to zero and the posterior variance converges to zero).

Remark 3.6 In AIC, the number of parameters, P , is used to measure model complexity. When the prior is informative, the prior imposes additional restrictions on the parameter space, and hence, PD may not be close to P in ﬁnite samples. A useful contribution of DIC is to provide a way to measure the model complexity when the prior information is incorporated; see Brooks (2002). From Lemma 3.3, the eﬀect of prior on

PD depends on C2, which can be thought of as a measure of the ratio of the information in the prior to the information in the likelihood about the parameters. The eﬀect of prior on DIC depends on D2, which in turn depends on C21, C2, and C23.

Remark 3.7 If p (y|θ) has a closed-form expression, DIC is trivially computable from the MCMC output. The computational tractability, together with the versatility of MCMC and the fact that DIC is incorporated into a Bayesian software, WinBUGS, are among the reasons why DIC has enjoyed a very wide range of applications.

4 Examples

In this section, we use two examples from Spiegelhalter et al. (2002), namely, the normal linear model with known sampling precision and the normal linear model with unknown precision, to illustrate the properties of DIC. In particular, we pay attention to the eﬀect of prior on PD and DIC.

4.1 The normal linear model

The general hierarchical normal model described by Lindley and Smith (1972) is

y ∼ N (F1θ1,G1) , (14) and the conjugate prior for θ1 is

θ1 ∼ N (F2φ,G2) , (15) where F1 is n × P matrix, θ1 is a P × 1 vector, G1 is n × n matrix. Assume G1, F2, φ, and G2 are all known. In this case, θ = θ1. The log likelihood function is n 1 1 L (y|θ) = − ln 2π − ln |G | − (y − F θ )0 G−1 (y − F θ ) . 2 2 1 2 1 1 1 1 1

14 It is easy to see that the QML estimate of θ is

0 −1 −1 0 −1 θbn = F1G1 F1 F1G1 y. (16) The log prior density is P 1 1 π (θ) = − ln 2π − ln |G | − (θ − F φ)0 G−1 (θ − F φ) . 2 2 2 2 1 2 2 1 2 It is well-known that the posterior distribution of θ is

θ|y ∼ N (V b, V ) , where

0 −1 −1−1 V = F1G1 F1 + G2 , (17) 0 −1 −1 b = F1G1 y + G2 F2φ. (18) By Lemma 3.3, we have −1 1 ¯ −2 θn = θbn − Hn θbn ∇π θbn + Op n , (19) n −1 1 ¯ −2 V = − Hn θbn + Op n , (20) n

−1 1 ¯ 2 −2 PD = P − tr Hn θbn ∇ π θbn + Op n , (21) n

−1 2 −1 where ∇π θbn = G2 θbn − F2φ and ∇ π θbn = −G2 . 2 In (21), one can see the eﬀect of prior on PD via ∇ π θbn , which is determined by the curvature of the density of prior at θbn. Note that the third order derivative of the log likelihood function L (y|θ) is zero. Thus, D1 = C21 = 0 and the eﬀect of prior on DIC is −1 0 −1 ¯ 2 ¯ D2 = −2C2 − C23 = −2tr Hn θbn ∇ π θbn − ∇π θbn Hn θbn ∇π θbn .

Hence, by Lemma 3.3, we have 1 DIC = AIC + D + O n−2 . n 2 p

Spiegelhalter et al. (2002) express PD as

0 −1 (−2) −1 PD = tr F1G1 F1V = tr −L θn V = P − tr G2 V (22) (−2) (2) (2) ¯ 0 −1 where L (θ) is the inverse of L (θ) and L (θ) = nHn (θ) = −F1G1 F1. Together with (20), (22) is the same as (21).

15 4.2 The normal linear models with unknown sampling precision

Suppose the model is

−1 y ∼ N F1θ1, τ G1 . (23)

Assume the conjugate prior for θ1 is

−1 θ1 ∼ N F2φ, τ G2 , (24) and the conjugate prior for τ is

τ ∼ Γ(a, b) . (25)

0 0 Assume G1, F2, φ, and G2 are all known. Let θ = (θ1, τ) , where the dimension of θ and θ1 is P × 1 and P1 × 1, respectively. Clearly, P = P1 + 1. The dimension of F1,

G1, F2 and G2 is n × P1, P1 × P1, P1 × P1, P1 × P1, respectively. It is well-known that the posterior of θ is n S θ |τ, y ∼ N (V b ,V ) and τ|y ∼ N a + , b + , 1 1 1 1 2 2 where

−1 −1 0 0 −1 V1 = τV , b1 = τb, S = (y−F1F2φ) (G1 + F1G2F1) (y−F1F2φ) .

According to Lemma 3.3, we have 1 1 P = P + C − C + O n−2 D n 1 n 2 p 1 1 1 1 = P + tr[A ] − A − C + O n−2 n 4 2 6 3 n 2 p 5 P 1 = P + − + 1 − C + O n−2 , (26) 3n n n 2 p since tr[A2] = −12, and A3 = −6P1 − 8. The eﬀect of prior on PD is 1 h (−2) 2 i h 0 −1 −1 −1i P1 2 (a − 1) − C2 = −tr L θbn ∇ π θbn = − tr F G F1 G + + . n 1 1 2 n n (27)

From (26), and (27), we can rewrite PD as

−1 2a 1 1 ¯ −1 −2 PD = P − + + tr Hn,11 θbn τGˆ + Op n , (28) n 3n n 2

16 −1 −1 ¯ 0 −1 −1 ¯ where Hn,11 θbn = − τFˆ 1G1 F1 is the submatrix of Hn θbn corresponding to θ1. In (28), one can see the eﬀect of prior on PD via a and G2. The eﬀect of prior on DIC is 1 1 2 1 D = C − C − C n 2 n 21 n 2 n 23 where 1 2ˆτP 1 2 C = − 1 C∗ , C = −C∗0 F 0G−1F −1 C∗ − τˆ2C∗2, n 21 n 21 n 23 22 1 1 1 22 n n 21 with

∗ P1 1 ∗0 ∗ (a − 1) 1 ∗ −1 C21 = − C22G2C22 + − ,C22 = G2 θb1,n − F2φ . 2ˆτn 2 τˆn b Thus, 1 1 1 DIC = AIC + D + D + O n 1 n 2 p n2 where 1 4 D = P 2 + 2P − ,A = − 2P 2 + 8 . 1 2 1 1 3 1 1

Spiegelhalter et al. (2002) express PD as

0 −1 PD = tr F1G1 F1V − n {ψ (a + n/2) − log (a + n/2)} , (29) where ψ (z) is the digamma function that has the asymptotic expansion ∞ 1 X B2j 1 1 1 ψ (z) = ln z − − = ln z − − + O , (30) 2z 2jz2j 2z 12z2 z4 j=1 where Bk is the kth Bernoulli number. Thus, the second term of the right-hand side of (29) can be written as 1 1 1 n {ψ (a + n/2) − log (a + n/2)} = n − − + O . (31) (2a + n) 3 (2a + n)2 n4 The ﬁrst term of (29) is

−1 0 −1 1 ¯ −1 1 tr F G F1V = P1 + tr Hn,11 θbn τGˆ + Op . (32) 1 1 n 2 n2 Hence, from (29), (31), and (32), we have

−1 1 ¯ −1 2a − 1/3 1 PD = P1 + tr Hn,11 θbn τGˆ + 1 − + O . (33) n 2 2a + n n2

17 2a−1/3 Applying the Taylor expansion to 2a+n at a = 0, we have 2a − 1/3 1 2a 1 = − + + O . 2a + n 3n n n2

Substituting this to (33), we can get (28). From this example, we can see that Lemma 3.3 provides a general and convenient way to measure the effect of prior on PD. Spiegelhalter et al. (2002) use some specific techniques to derive (28). However, these techniques are problem specific and difficult to use in general.

5 Conclusion

This paper provides a rigorous decision-theoretic justification of DIC based on a set of regularity conditions. To do so, we first specify the underlying loss function to be the KL divergence between the true DGP and plug-in predictive distribution p yrep|θn(y) . This loss function is slightly different from that in AIC by using the posterior mean

θn(y) as the estimator of θ rather than QML. As a result, DIC is easy to calculate when the MCMC output is available. Under a set of regularity conditions, we then show that DIC is an asymptotically unbiased estimator of the expected loss function as n → ∞. Moreover, we develop expansions to DIC and the penalty term based on the high-order Laplace approximations. These expansions allow us to easily see the effect of prior on DIC and the penalty term. Although the theoretic framework under which we justify DIC is general, it requires consistency of the posterior mean, the asymptotic normal approximation to the posterior distribution, and the asymptotic normality to the QML estimator. When there are latent variables in the candidate model under which the number of latent variables grows as n grows, consistency and the asymptotic normality may not hold if the parameter space is enlarged to include latent variables. As a result, our decision-theoretic justification DIC is not applicable. A recent study by Li et al. (2020) provides a modification to DIC to compare latent variable models. Moreover, when the data are nonstationary, the asymptotic normality may not hold. In this case, it remains unknown whether or not DIC is justified. References

Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. Second International Symposium on Information Theory, Springer Ver- lag, 1, 267-281.

18 Andrews, D. W. K. (1987). Consistency in nonlinear econometric models: A generic uniform law of large numbers. Econometrica, 55(6), 1465-1471.

Berger, J. O. (1985). Statistical Decision Theory and Bayesian Analysis. 2nd edition. Springer-Verlag.

Brooks, S. (2002). Discussion on the paper by Spiegelhalter, Best, Carlin, and van de Linde (2002). Journal of the Royal Statistical Society Series B, 64, 616–618.

Burnham, K. and Anderson, D. (2002). Model Selection and Multi-model Inference: A Practical Information-theoretic Approach, Springer.

Chen, C. (1985). On asymptotic normality of limiting density functions with Bayesian implications. Journal of the Royal Statistical Society Series B, 47, 540–546.

Gallant, A. R. and White, H. (1988). A Uniﬁed Theory of Estimation and Inference for Nonlinear Dynamic models. Blackwell.

Kass, R. E., and Raftery, A. E. (1995). Bayes factors. Journal of the Americana Statistical Association, 90, 773-795.

Kass, R., Tierney, L. and Kadane, J. (1990). The validity of posterior expansions based on Laplace’s Method. in Bayesian and Likelihood Methods in Statistics and Econometrics, ed. by S. Geisser, J.S. Hodges, S.J. Press and A. Zellner. Elsevier Science Publishers B.V.: North-Holland.

Li, Y., Yu, J. and T. Zeng (2020). Deviance Information Criterion for Latent Variable Models and Misspeciﬁed Models. Journal of Econometrics, 216(2), 450-493.

Lindley, D. V., and Smith, A. F. (1972). Bayes estimates for the linear model. Journal of the Royal Statistical Society: Series B (Methodological), 34(1), 1-18.

Spiegelhalter, D., Best, N., Carlin, B. and van der Linde, A. (2002). Bayesian measures of model complexity and ﬁt. Journal of the Royal Statistical Society Series B, 64, 583-639.

Spiegelhalter, D., Best, N., Carlin, B. and van der Linde, A. (2014). The deviance information criterion: 12 years on. Journal of the Royal Statistical Society Series B, 76, 485-493.

Vehtari, A., and Ojanen, J. (2012). A survey of Bayesian predictive methods for model assessment, selection and comparison. Statistics Surveys, 6, 142-228.

19 Vehtari, A., and Lampinen, J. (2002). Bayesian model assessment and comparison using cross-validation predictive densities. Neural Computation, 14, 2439-2468.

White, H. (1996). Estimation, Inference and Speciﬁcation Analysis. Cambridge Uni- versity Press. Cambridge, UK.

Wooldridge, J. M. (1994). Estimation and inference for dependent processes. Hand- book of Econometrics, 4, 2639-2738.

Zhang, Y., and Yang, Y. (2015). Cross-validation for selecting a model selection procedure. Journal of Econometrics, 187(1), 95-112.

20 Online Appendix

5.1 Proof of Lemma 3.1

Note that n 1 X [l (θ) − l (θp )] n t t n t=1 n n n 1 X 1 X 1 X = (l (θ) − E [l (θ)]) + (E [l (θ)] − E [l (θp )]) + (E [l (θp )] − l (θp )) . n t t n t t n n t n t n t=1 t=1 t=1

From (9), we know that for any ε > 0, there exists δ1(ε) > 0 and N(ε) > 0, for all n > N (ε),

n 1 X {E [l (θ)] − E [l (θp )]} < −δ (ε), n t t n 1 t=1 p p if θ ∈ Θ\N (θn, ε). Thus, for any ε > 0, if θ ∈ Θ\N (θn, ε), for all n > N (ε), n n n 1 X 1 X 1 X [l (θ) − l (θp )] < (l (θ) − E [l (θ)]) − δ (ε) + (E [l (θp )] − l (θp )) , n t t n n t t 1 n t n t n t=1 t=1 t=1 and n 1 X p sup [lt (θ) − lt (θn)] p n Θ\N(θn,ε) t=1 n n 1 X 1 X p p ≤ sup (lt (θ) − E [lt (θ)]) − δ1 (ε) + (E [lt (θn)] − lt (θn)) p n n Θ\N(θn,ε) t=1 t=1 n n 1 X 1 X p p ≤ sup (lt (θ) − E [lt (θ)]) − δ1 (ε) + (E [lt (θn)] − lt (θn)) p n n Θ\N(θn,ε) t=1 t=1 n 1 X ≤ 2 sup (l (θ) − E [l (θ)]) − δ (ε) . (34) n t t 1 θ∈Θ t=1 Under Assumptions 1-6, the uniform convergence condition is satisﬁed, that is,

n ! 1 X P sup (l (θ) − E [l (θ)]) < ε → 1, (35) n t t θ∈Θ t=1

From the uniform convergence, if we choose δ2 such that 0 < δ2 < δ1 (ε) /2, we have

" n # 1 X P sup (l (θ) − E [l (θ)]) < δ → 1. n t t 2 θ∈Θ t=1

21 Hence,

" n # 1 X P 2 sup (l (θ) − E [l (θ)]) − δ (ε) < 2δ − δ (ε) → 1. n t t 1 2 1 θ∈Θ t=1 From (34), we have

" n # 1 X P 2 sup (l (θ) − E [l (θ)]) − δ (ε) < 2δ − δ (ε) n t t 1 2 1 θ∈Θ t=1  " n n #  1 X X p 6 P  sup lt (θ) − lt (θn) < 2δ2 − δ1 (ε) . p n Θ\N(θn,ε) t=1 t=1

Letting K1 (ε) = − (2δ2 − δ1 (ε)) > 0, we have, for any ε > 0,

 " n n #  1 X X p lim P  sup lt (θ) − lt (θn) < −K1 (ε) = 1, n→∞ p n Θ\N(θn,ε) t=1 t=1 which proves the consistency condition given by (10). The proof of the other two concentration conditions (11) and (12) can be done similarly, and hence, omitted.

5.2 Proof of Lemma 3.2

Before we prove Lemma 3.2, we need to prove the three lemmas. Lemma 5.1 is about the high-order analytical expansions while Lemma 5.3 and Lemma 5.2 are about the high-order stochastic expansions.

5.2.1 High-order analytical expansions

P Suppose Θ is a compact subset of R . For any θ ∈ Θ, let {hn(θ): n = 1, 2,... } be a sequence of eight-times continuously diﬀerentiable functions of θ, having an interior n o global minimum at θbn : n = 1, 2,... , b(θ) be a six-times continuously diﬀerentiable real function of θ. For any function f(θ), let fb be the value of function f evaluated at θbn (i.e., fb = f θbn ). When there is no confusion, we write hn(θ) as h(θ) or hn or even h and b(θ) as b. We use B (θ) to denote the open ball of radius δ centered at δ √ √ θ. So B nδ (0) is an open ball of radius nδ centered at the origin. For convenience ∂d of exposition, we write f (θ) as fj1···jd . The Hessian of hn at θ is denoted ∂θj1 ∂θj2 ···∂θjd 2 by ∇ hn (θ), and its (i, j)-component is written as hij while the component of its in- ij 4 6 8 10 12 verse is written as h . Let µijkq, µijkqrs, µijkqrstw, µijkqrstwvβ, µijkqrstwvβτφ be the fourth, sixth, eighth, tenth, and twelfth central moments of a multivariate normal distribution

22 −1 2 2 −1 whose covariance matrix is ∇ bh = (∇ hn (θ)) | . Note that we require hn(θ) θ=θb n be eight-times continuously diﬀerentiable and b(θ) be six-times continuously diﬀeren- tiable. These two conditions are stronger than what have typically been assumed in the literature on the Laplace approximation as we would like to develop higher order expansions.

Following Kass et al. (1990), we call the pair ({hn} , b) satisfies the analytical assumptions for Laplace’s method if the following assumptions are met. There exists positive numbers ε, M and η, and an integer n such that n ≥ n implies (i) for all θ ∈ 0 0 Bε θbn and all 1 ≤ j1, ··· , jd ≤ P with 0 ≤ d ≤ 8, khn (θ)k < M and khj1···jd (θ)k < 2 2 R M; (ii) ∇ bh is positive definite and det ∇ bh > η; (iii) Θ b(θ) exp [−nh (θ)] dθ exists and is finite, and for all δ for which 0 < δ < ε and Bδ θbn ⊆ Θ,

1 Z h 2 i 2 h i −3 det n∇ bh b (θ) exp −nhn (θ) − nbh dθ = O n . Θ−Bδ θb n

If one sets −nhn to be the sequence of the log-likelihood functions of a model (as a sequence of n), we say the model is Laplace regular. Lemma 5.1 below and Lemma 5.2 in the next subsection extend Theorem 1 and Theorem 5 of Kass et al. (1990) to a higher order.

Lemma 5.1 If ({hn} , b) satisfy the analytical assumptions for Laplace’s method, then

Z − 1 P h 2 i 2 1 1 −3 b (θ) exp [−nh (θ)] dθ = (2π) 2 det n∇ bh exp −nbh b + Q1 + Q2 + O n , Θ n n2 where

1 X 4 1 X 6 1 X 4 1 X ζη Q1 = − bhijkqµ b + bhijkbhqrsµ b − bhijkµ bζ + bζηbh , 24 ijkq 72 ijkqrs 6 ijkζ 2 ijkq ijkqrs ijkζ ζη

1 X 6 1 X 8 Q2 = − bhijkqrsµ b + bhijkqbhrstwµ b 720 ijkqrs 1152 ijkqrstw ijkqrs ijkqrstw

1 X 8 1 X 10 + bhijkbhqrstwµ b − bhijkbhqrsbhtwvβµ b 720 ijkqrstw 1728 ijkqrstwvβ ijkqrstw ijkqrstwvβ

1 X 12 1 X 6 + bhijkbhqrsbhtwvbhβτφµ b − bhijkqrµ bζ 31104 ijkqrstwvβτφ 120 ijkqrζ ijkqrstwvβτφ ijkqrζ

1 X 8 1 X 10 + bhijkbhqrstµ bζ − bhijkbhqrsbhtwvµ bζ 144 ijkqrstζ 1296 ijkqrstwvζ ijkqrstζ ijkqrstwvζ

23 1 X 6 1 X 8 − bhijkqµ bζη + bhijkbhqrsµ bζη 48 ijkqζη 144 ijkqrsζη ijkqζη ijkqrsζη

1 X 6 1 X 4 − bhijkµ bζηξ + bζηξωµ . 36 ijkζηξ 24 ζηξω ijkζηξ ζηξω

Proof. Note that, by the third analytical assumption for Laplace’s method, we have Z Z Z b (θ) exp [−nh (θ)] dθ = b (θ) exp [−nh (θ)] dθ + b (θ) exp [−nh (θ)] dθ, Θ Bδ θb n Θ−Bδ θb n and Z b (θ) exp (−nh (θ)) dθ Θ−Bδ θb n

− 1 1 Z h 2 i 2 h 2 i 2 = det n∇ bh exp −nbh det n∇ bh b (θ) exp −n h (θ) − bh dθ Θ−Bδ θb n

− 1 h 2 i 2 −3 = det n∇ bh exp −nbh O n .

√ Let u = n θ − θbn . Applying the Taylor expansion to h (θ) at θbn, we have

1 X 1 − 1 X 1 −1 X nh (θ) = nbh + bhijuiuj + n 2 bhijkuiujuk + n bhijkquiujukuq 2 6 24 ij ijk ijkq

1 − 3 X 1 −2 X + n 2 bhijkqruiujukuqur + n bhijkqrsuiujukuqurus 120 720 ijkqr ijkqrs

1 − 5 X + n 2 bhijkqrstuiujukuqurusut + rn (u) , 5040 ijkqrst where 1 X r (u) = n−3 h (θ0) u u u u u u u u , n 40320 ijkqrstw i j k q r s t w ijkqrstw

0 and θ lies between θ and θbn. Deﬁne

1 − 1 X 1 −1 X 1 − 3 X x = n 2 bhijkuiujuk + n bhijkquiujukuq + n 2 bhijkqruiujukuqur 6 24 120 ijk ijkq ijkqr

1 −2 X 1 − 5 X + n bhijkqrsuiujukuqurus + n 2 bhijkqrstuiujukuqurusut + rn (u) . 720 5040 ijkqrs ijkqrst

24 Applying the Taylor expansion to exp (−x) at the origin, we have ! n o 1 X exp (−nh) = exp −nbh exp − bhijuiuj × 2 ij 1 1 1 1 1 + Ξ1 + Ξ2 + Ξ3 + Ξ4 + Ξ5 + R1,n θ, θbn , 2 6 24 120 where

1 − 1 X 1 −1 X 1 − 3 X Ξ1 = − n 2 bhijkuiujuk − n bhijkquiujukuq − n 2 bhijkqruiujukuqur 6 24 120 ijk ijkq ijkqr

1 −2 X 1 − 5 X − n bhijkqrsuiujukuqurus − n 2 bhijkqrstuiujukuqurusut, 720 5040 ijkqrs ijkqrst

1 −1 X 1 −2 X Ξ2 = n bhijkbhqrsuiujukuqurus + n bhijkqbhrstwuiujukuqurusutuw 36 242 ijkqrs ijkqrstw

1 − 3 X 1 −2 X + n 2 bhijkbhqrstuiujukuqurusut + n bhijkbhqrstwuiujukuqurusutuw 72 360 ijkqrst ijkqrstw

1 − 5 X + n 2 bhijkbhqrstwvuiujukuqurusutuwuv 1440 ijkqrstwv

1 − 5 X + n 2 bhijkbhqrstwvuiujukuqurusutuwuv, 2160 ijkqrstwv

1 − 3 X Ξ = − n 2 h h h u u u u u u u u u 3 216 ijk qrs twv i j k q r s t w v ijkqrstwv 1 X − n−2 h h h u u u u u u u u u u 288 ijk qrs twvβ i j k q r s t w v β ijkqrstwvβ

1 − 5 X − n 2 h h h u u u u u u u u u u u 1152 ijk qrst wvβτ i j k q r s t w v β τ ijkqrstwvβτ

1 − 5 X − n 2 h h h u u u u u u u u u u u , 1440 ijk qrs twvβτ i j k q r s t w v β τ ijkqrstwvβτ

1 −2 X Ξ4 = n bhijkbhqrsbhtwvbhβτφuiujukuqurusutuwuvuβuτ uφ 1296 ijkqrstwvβτφ

1 − 5 X + n 2 bhijkbhqrsbhtwvbhβτφαuiujukuqurusutuwuvuβuτ uφuα, 1296 ijkqrstwvβτφα

1 − 5 X Ξ5 = − n 2 bhijkbhqrsbhtwvbhβτφbhα %uiujukuqurusutuwuvuβuτ uφuαu u%. 7776 κ κ ijkqrstwvβτφακ%

Applying the Taylor expansion to b (θ) at θbn, we have

− 1 X 1 −1 X 1 − 3 X b (θ) = b + n 2 biui + n bijuiuj + n 2 bijkuiujuk 2 6 i ij ijk

25 1 −2 X 1 − 5 X + n bijkquiujukuq + n 2 bijkqruiujukuqur + R2,n θ, θbn . 24 120 ijkq ijkqr Hence, " # ˆ 1 X n o b (θ) exp (−nh (θ)) = exp −nh exp − bhijuiuj In θ, θbn + Rn θ, θbn 2 ij P ! − 1 2 1 P 2 1 2 1 X 2 2 ˆ 2 = (2π) ∇ bh exp −nh ∇ bh exp − bhijuiuj 2π 2 ij n o × In θ, θbn + Rn θ, θbn , where In θ, θbn and Rn θ, θbn will be speciﬁed below. Thus, Z b (θ) exp [−nh (θ)] dθ Bδ θb n

1 P − 2 2 2 = (2π) ∇ bh exp −nbh × Z 1 ! P 2 1 n o − 2 2 X (2π) ∇ bh exp − bhijuiuj In θ, θbn + Rn θ, θbn dθ b 2 Bδ θn ij − 1 P 2 2 − P 2 2 = (2π) ∇ bh exp −nbh n × Z 1 ! − P 2 2 1 X n o (2π) 2 ∇ bh exp − bhijuiuj In θ, θbn + Rn θ, θbn du √ 2 B nδ(0) ij 1 P − 2 2 2 = (2π) n∇ bh exp −nbh × Z 1 ! − P 2 2 1 X n o (2π) 2 ∇ bh exp − bhijuiuj In θ, θbn + Rn θ, θbn du, √ 2 B nδ(0) ij where Rn θ, θbn is the sum of the terms involving R1,n θ, θbn , R2,n θ, θbn , and the terms whose order is equal to or smaller than O(n−3). Furthermore, we can get

1 −3 X R2,n θ, θbn = n bijkqrs θe uiujukuqurus, 720 where θe lies between θ and θbn. Thus, the leading term of Rn θ, θbn is bbR1n θ, θbn + R2n θ, θbn that include rn (u). The integral of rn (u) over Bε θbn can be expressed as

Z 1 ! −3 − P 2 2 1 X X 0 n (2π) 2 ∇ bh exp − bhijuiuj hijkqrstw (θ ) uiujukuqurusutuwdu √ 2 B nδ(0) ij ijkqrstw

1 − P 2 where R (2π) 2 ∇2h exp − 1 P h u u P |u u u u u u u u | du is the eighth RP b 2 ij bij i j ijkqrstw i j k q r s t w order moment of folded multivariate folded normal distribution with mean 0 and co- −1 variance ∇2bh that is ﬁnite; see Kamat (1953) and Kan and Robotti (2017). Then, we have

− 1 Z P 2 2 h i (2π) 2 n∇ bh exp −nbh In θ, θb + Rn θ, θb f (u) du √ B nδ(0) " # − 1 Z P 2 2 −3 = (2π) 2 n∇ bh exp −nbh In θ, θb f (u) du + O n . √ B nδ(0) For In θ, θbn , we have

0 1 2 3 4 5 In θ, θbn = In θ, θbn + In θ, θbn + In θ, θbn + In θ, θbn + In θ, θbn + In θ, θbn , where 0 1 1 1 1 I θ, θbn = b 1 + Ξ1 + Ξ2 + Ξ3 + Ξ4 + Ξ5 n 2 6 24 120 n 1 3 5 o − 2 01 −1 02 − 2 03 −2 04 − 2 05 = b 1 + n In + n In + n In + n In + n In , 1 − 1 X 1 1 1 1 I θ, θbn = n 2 bζ uζ 1 + Ξ1 + Ξ2 + Ξ3 + Ξ4 + Ξ5 n 2 6 24 120 1 X 3 5 − 2 −1 11 − 2 12 −2 13 − 2 14 = n bζ uζ + n In + n In + n In + n In , 2 −1 1 X 1 1 1 1 I θ, θbn = n bζηuζ uη 1 + Ξ1 + Ξ2 + Ξ3 + Ξ4 + Ξ5 n 2 2 6 24 120

1 h −1 X − 3 21 −2 22 − 5 23i = n bζηuζ uη + n 2 I + n I + n 2 I , 2 n n n 3 − 3 1 X 1 1 1 1 I θ, θbn = n 2 bζηξuζ uηuξ 1 + Ξ1 + Ξ2 + Ξ3 + Ξ4 + Ξ5 n 6 2 6 24 120

1 h − 3 X −2 31 − 5 32i = n 2 bζηξuζ uηuξ + n I + n 2 I , 6 n n 4 −2 1 X 1 1 1 1 I θ, θbn = n bζηξωuζ uηuξuω 1 + Ξ1 + Ξ2 + Ξ3 + Ξ4 + Ξ5 n 24 2 6 24 120

27 1 h −2 X − 5 41i = n bζηξωuζ uηuξuω + n 2 I , 24 n 5 − 5 1 X 1 1 1 1 I θ, θbn = n 2 bijkqruiujukuqur 1 + Ξ1 + Ξ2 + Ξ3 + Ξ4 + Ξ5 n 120 2 6 24 120 ijkqr " # 1 − 5 X = n 2 bijkqruiujukuqur . 120 ijkqr with

01 1 X I = − bhijkuiujuk, n 6 ijk

02 1 X 1 X I = − bhijkquiujukuq + bhijkbhqrsuiujukuqurus, n 24 72 ijkq ijkqrs

03 1 X 1 X I = − bhijkqruiujukuqur + bhijkbhqrstuiujukuqurusut n 120 144 ijkqr ijkqrst 1 X − bhijkbhqrsbhtwvuiujukuqurusutuwuv, 1296 ijkqrstwv

04 1 X 1 X I = − bhijkqrsuiujukuqurus + bhijkqbhrstwuiujukuqurusutuw n 720 1152 ijkqrs ijkqrstw 1 X + bhijkbhqrstwuiujukuqurusutuw 720 ijkqrstw 1 X − bhijkbhqrsbhtwvβuiujukuqurusutuwuvuβ 1728 ijkqrstwvβ 1 X + bhijkbhqrsbhtwvbhβτφuiujukuqurusutuwuvuβuτ uφ, 31104 ijkqrstwvβτφ

05 1 X 1 X I = − bhijkqrstuiujukuqurusut + bhijkbhqrstwvuiujukuqurusutuwuv n 5040 2880 ijkqrst ijkqrstwv

1 − 5 X + n 2 bhijkbhqrstwvuiujukuqurusutuwuv 5120 ijkqrstwv 1 X − h h h u u u u u u u u u u u 6912 ijk qrst wvβτ i j k q r s t w v β τ ijkqrstwvβτ 1 X − bhijkbhqrsbhtwvβαuiujukuqurusutuwuvuβuτ 8640 ijkqrstwvβτ 1 X + bhijkbhqrsbhtwvbhβτφαuiujukuqurusutuwuvuβuτ uφuα 31104 ijkqrstwvβτφα 1 X − bhijkbhqrsbhtwvbhβτφbhα %uiujukuqurusutuwuvuβuτ uφuαu u%, 933120 κ κ ijkqrstwvβτφακ%

28 11 1 X I = − bhijkuiujukuζbζ , n 6 ijkζ

12 1 X 1 X I = − bhijkquiujukuquζbζ + bhijkbhqrsuiujukuqurusuζbζ , n 24 72 ijkqζ ijkqrsζ

13 1 X 1 X I = − bhijkqruiujukuquruζbζ + bhijkbhqrstuiujukuqurusutuζbζ n 120 144 ijkqr ijkqrst 1 X − bhijkbhqrsbhtwvuiujukuqurusutuwuvuζbζ , 1296 ijkqrstwv

14 1 X 1 X I = − bhijkqrsuiujukuqurusuζbζ + bhijkqbhrstwuiujukuqurusutuwuζbζ n 720 1152 ijkqrsζ ijkqrstwζ 1 X + bhijkbhqrstwuiujukuqurusutuwuζbζ 720 ijkqrstwζ 1 X − bhijkbhqrsbhtwvβuiujukuqurusutuwuvuβuζbζ 1728 ijkqrstwvβζijkqrs 1 X + bhijkbhqrsbhtwvbhβτφuiujukuqurusutuwuvuβuτ uφuζbζ , 31104 ijkqrstwvβτφζ

21 1 X I = − bhijkuiujukuζ uηbζη, n 6 ijkζη

22 1 X 1 X I = − bhijkquiujukuquζ uηbζη + bhijkbhqrsuiujukuqurusuζ uηbζη, n 24 72 ijkqζη ijkqrsζη

23 1 X 1 X I = − bhijkqruiujukuquruζ uηbζη + bhijkbhqrstuiujukuqurusutuζ uηbζη n 120 144 ijkqrζη ijkqrstζη 1 X − bhijkbhqrsbhtwvuiujukuqurusutuwuvuζ uηbζη, 1296 ijkqrstwvζη

31 1 X I = − bhijkuiujukuζ uηuξbζηξ, n 6 ijkζη

32 1 X 1 X I = − bhijkquiujukuquζ uηuξbζηξ + bhijkbhqrsuiujukuqurusuζ uηuξbζηξ, n 24 72 ijkqζη ijkqrsζη

41 1 X I = − bhijkuiujukuζ uηuξuωbζηξω, n 6 ijkζη

Let f (u) be the pdf of the multivariate normal distribution with mean 0 and covariance −1 matrix ∇2bh . Then, we have

Z In θ, θbn f (u) du Rp

29 ! 1 1 X 4 1 X 6 1 X 4 1 X ζη = b + − bhijkqµ + bhijkbhqrsµ − bhijkµ bζ + bζηbh n 24 ijkq 72 ijkqrs 6 ijkζ 2 ijkq ijkqrs ijkζ ζη  1 P 6 1 P 8  − 720 ijkqrs bhijkqrsµijkqrs + 1152 ijkqrstw bhijkqbhrstwµijkqrstw  + 1 P h h µ8 − 1 P h h h µ10   720 ijkqrstw bijkbqrstw ijkqrstw 1728 ijkqrstwvβ bijkbqrsbtwvβ ijkqrstwvβ   1 P 12   + 31104 ijkqrstwvβτφ bhijkbhqrsbhtwvbhβτφµijkqrstwvβτφ  1  1 P 6 1 P 8  +  − bhijkqrµ bζ + bhijkbhqrstµ bζ  n2  120 ijkqrζ ijkqrζ 144 ijkqrstζ ijkqrstζ   − 1 P h h h µ10 b   1296 ijkqrstwvζ bijkbqrsbtwv ijkqrstwvζbζ   1 P 6 1 P 8   − 48 ijkqζη bhijkqµijkqζηbζη + 144 ijkqrsζη bhijkbhqrsµijkqrsζηbζη  1 P 6 1 P 4 − 36 ijkζηξ bhijkµijkζηξbζηξ + 24 ζηξω bζηξωµζηξω +O n−3 .

Note that the odd order central moments of the multivariate normal distribution are √ P all zero. By expanding the domain from B nδ (0) to R , the error can be expressed as

Z X bhijkquiujukuqf (u) du p √ R −B nδ(0) ijkq X Z = bhijkq uiujukuqf (u) du p √ ijkq R −B nδ(0) 1 Z " # − P 2 2 X 1 X = (2π) 2 ∇ bh bhijkq uiujukuq exp − bhijuiuj du P √ 2 ijkq R −B nδ(0) ij 1 Z " # " # − P 2 2 X 1 X 1 X = (2π) 2 ∇ bh bhijkq uiujukuq exp − bhijuiuj exp − bhijuiuj du P √ 4 4 ijkq R −B nδ(0) ij ij 1 Z " # " # − P 2 2 X 1 X 1 X ≤ (2π) 2 ∇ bh bhijkq uiujukuq exp − λminuiuj exp − bhijuiuj du P √ 4 4 ijkq R −B nδ(0) ij ij 1 Z " # − P 2 2 1 2 X 1 X 1 ≤ (2π) 2 ∇ bh exp − λminδ n bhijkq uiujukuq exp − bhij uiuj du 4 P 2 2 ijkq R ij 1 = M 0 exp − λ δ2n , 4 min

2 where λmin > 0 is the smallest eigenvalue of ∇ bh and

1 Z " # 0 − P 2 2 X X 1 M = (2π) 2 ∇ bh bhijkq uiujukuq exp − bhij uiuj du < ∞. P 4 ijkq R ij

The ﬁrst inequality follows from the fact that

0 λmin = min f (e) = e Ae, kek=1

30 where A is a positive deﬁnite matrix, and λ is the the smallest eigenvalue of A. Here, min we only express one term in In θ, θbn . The other terms can be analyzed in the same way. Hence, we have Z Z Z In θ, θbn f (u) du = In θ, θbn f (u) du − In θ, θbn f (u) du, √ P P √ B nδ(0) R R −B nδ(0) and Z b (θ) exp [−nh (θ)] dθ Θ Z Z = b (θ) exp [−nh (θ)] dθ + b (θ) exp [−nh (θ)] dθ Bδ θb n Θ−Bδ θb n " # − 1 Z P 2 2 h i −3 = (2π) 2 n∇ bh exp −nbh In θ, θbn + Rn θ, θbn f (u) du + O n √ B nδ(0) " # − 1 Z P 2 2 −3 = (2π) 2 n∇ bh exp −nbh In θ, θbn f (u) du + O n √ B nδ(0)  R  − 1 P 2 RP In θ, θbn f (u) du = (2π) 2 n∇2h exp −nh + O n−3 b b  R  − P √ In θ, θbn f (u) du R −B nδ(0) 1 P − 2 Z 2 2 −3 = (2π) n∇ bh exp −nbh In θ, θbn f (u) du + O n . RP Hence, this lemma is proved.

5.2.2 High-order stochastic expansions

In this subsection we develop high-order stochastic Laplace expansions. Suppose y is defined on a common probability space Ω, F, ℘θ , where Ω is a sample space, F is a sigma-algebra, and ℘θ is a probability measure that depends on parameter θ ∈ Θ, P a compact subset of R . Assume {yi, i = 1, 2, ···} take values in the same subset of R. Let h (y,θ) be a sequence of functions, each of which is eight-times continuously n n o differentiable with respect to θ and has an interior global minimum θbn and b(θ) be a six-times continuously differentiable real function of θ. When there is no confusion, we write hn(y,θ) as hn(θ) or hn or even h.

We call the pair ({hn} , b) satisﬁes the analytical assumptions for the stochastic Laplace method on ℘θ if the following assumptions are satisﬁed. There exists positive numbers ε, M and η such that (i) with probability approach one (w.p.a.1), for all θ ∈ Bε θbn and all 1 ≤ j1, ··· , jd ≤ P with 0 ≤ d ≤ 8, khn (θ)k < M and

31 2 2 khj1···jd (θ)k < M; (ii) w.p.a.1, ∇ bh is positive deﬁnite and det ∇ bh > η; (iii) R Θ b(θ) exp (−nhn (θ)) dθ exists and is ﬁnite, and for all δ for which 0 < δ < ε and Bδ θbn ⊆ Θ,

1 Z h 2 i 2 h i −3 det n∇ bh b (θ) exp −n hn (θ) − bh dθ = Op n . Θ−Bδ θb n

Note that the assumptions above are related to but slightly diﬀerent from those in

Section 3 of Kass et al. (1990). They differ in two aspects. First, we require hn(θ) be eight-times continuously differentiable and b(θ) be six-times continuously differentiable. Second, for conditions (ii) and (iii), instead of almost sure boundedness and almost sure convergence, we assume they hold w.p.a.1. We do so because we are interested in convergence in probability only. Following the result in Theorem 7 of Kass et al.

(1990), ({hn} , b) satisfy the analytical assumptions for stochastic Laplace’s method on ℘θ and Lemma 5.1 above, it is straightforward to show that

Z − 1 P h 2 i 2 1 1 −3 b (θ) exp [−nhn (θ)] dθ = (2π) 2 det n∇ bh exp −nbh b + Q1 + Q2 + Op n , Θ n n2 (36) where the expressions for Q1 and Q2 are given in Lemma 5.1.

Lemma 5.2 If both ({hn} , g × bD) and ({hn} , bD) satisfy the analytical assumptions for the stochastic Laplace method on ℘θ, then R g (θ) b (θ) exp (−nh (θ)) dθ 1 1 1 D n = g + B + (B − B ) + O , R b 1 2 2 3 p 3 bD (θ) exp (−nhn (θ)) dθ n n n where P σ b g 1 X ij bijbD,jbi 1 X 4 B1 = σijgij + − bhijkµ gq, 2 b b 6 ijkqb ij bD ijkq

1 X 6 1 X 8 B2 = − bhijkqrµ gs + bhijkbhqrstµ gw 120 ijkqrsb 144 ijkqrstwb ijkqrs ijkqrstw P h µ6 b g 1 X 10 1 ijkqrs bijkq ijkqrsbD,sbr − bhijkbhqrsbhtwvµ gβ − 1296 ijkqrstwvβb 24 ijkqrstwvβ bD P 8 P 6 1 bhijkbhqrsµ bD,wgt 1 bhijkµ bD,ηξgζ + ijkqrstw ijkqrstw b − ijkζηξ ijkζηξ b 72 bD 12 bD P µ4 b g 1 ζηξω ζηξωbD,ηξωbζ 1 X 6 + − bhijkqµ grs 6 48 ijkqrsb bD ijkqrs

32 1 X 8 1 X 6 + bhijkbhqrsµ gtw − bhijkµ gζηξ 144 ijkqrstwb 36 ijkζηξb ijkqrstw ijkζηξ P 6 1 X 1 ijkζηξ bhijkµijkζηξgζηbD,ξ + µ4 g − b 24 ζηξωbζηξω 12 ζηξω bD P 4 P 4 1 µ gζηξbD,ω 1 µ gζηbD,ξω + ζηξω ζηξωb + ζηξω ζηξωb , 6 bD 4 bD ! 1 X bD,ij 1 X 4 bD,q 1 X 6 1 X 4 B3 = σij − bhijkµ + bhijkbhqrsµ − bhijkqµ B1, 2 b 6 ijkq 72 ijkqrs 24 ijkq ij bD ijkq bD ijkqrs ijkq

ij with σij = h .

N D Proof. If h (θ) , bN and h (θ) , bD satisfy the analytical assumptions for the stochastic Laplace method on ℘θ, then, by (36)

1 − 2 h N i N R N ∇2hN exp −nh θ b θ + 1 c + 1 d + O (n−3) bN (θ) exp −nh (θ) dθ b bn k bn n N n2 N p = . R D − 1 D bD (θ) exp [−nh (θ)] dθ 2 h Di 1 1 −3 2 D bk θb + cD + 2 dD + Op (n ) ∇ bh exp −nh θbn n n n

From Tierney and Kadane (1986) and Miyata (2004, 2010), we have

 1 cN 1 dN −3  N N 1 + n N + n2 N + Op (n ) 1 1 −3 b b bN θbn + n cN + n2 dN + Op (n ) bN θbn bN θn bN θn =   D D  1 c 1 d  1 1 −3 D D −3 b θ + c + d + O (n ) b θ  1 + n D + n2 D + Op (n )  D bn n D n2 D p D bn b b bD θn bD θn        1 cN cD  N  1 + n  N − D    b b  bN θbn  bN θn bD θn  = D    , bD θb   n  1 dN dD cD cN cD −3   + n2  N − D − D  N − D  + Op (n )   b b b b b   bN θn bD θn bD θn bN θn bD θn  where bN cNbD − cDbN bN cN cD cNbD − cDbN − = = 2 2 bD bN bD bDbN bD P P 1 σijbN,ijbD − σijbD,ijbN = ij b ij b 2 2 bD P N 4 P D 4 1 bh µ bN,qbD − bh µ bD,qbN − ijkq ijk ijkq ijkq ijk ijkq 6 2 bD

33 P N N 6 P D D 6 1 bN bh bh µ − bh bh µ bN + ijkqrs ijk qrs ijkqrs ijkqrs ijk qrs ijkqrs 72 bD P N 4 P D 4 1 bN bh µ − bh µ bN − ijkq ijkq ijkq ijkq ijkq ijkq . 24 bD bN dN dD − bD bN bD P N 6 P D 6 1 bN bh µ − bh µ bN = − ijkqrs ijkqrs ijkqrs ijkqrs ijkqrs ijkqrs 720 bD P N N 8 P D D 8 1 bN bh bh µ − bh bh µ bN + ijkqrstw ijkq rstw ijkqrstw ijkqrstw ijkq rstw ijkqrstw 1152 bD P N N 8 P D D 8 1 bN bh bh µ − bh bh µ bN + ijkqrstw ijk qrstw ijkqrstw ijkqrstw ijk qrstw ijkqrstw 720 bD P N N N 10 P D D D 10 1 bN bh bh bh µ − bh bh bh µ bN − ijkqrstwvβ ijk qrs twv ijkqrstwvβ ijkqrstwvβ ijk qrs twv ijkqrstwvβ 1728 bD P 12 ! bN ijkqrstwvβτφ bhijkbhqrsbhtwvbhβτφµijkqrstwvβτφ P 12 1 − ijkqrstwvβτφ bhijkbhqrsbhtwvbhβτφµijkqrstwvβτφbN + 31104 bD P N 6 P D 6 1 bh µ bN,sbD − bh µ bD,sbN − ijkqrs ijkqr ijkqrs ijkqrs ijkqr ijkqrs 120 2 bD P N N 8 P N N 8 1 bh bh µ bN,wbD − bh bh µ bD,wbN + ijkqrstw ijk qrst ijkqrstw ijkqrstw ijk qrst ijkqrstw 144 2 bD P N N N 10 P D D D 10 1 bh bh bh µ bN,βbD − bh bh bh µ bD,βbN − ijkqrstwvβ ijk qrs twv ijkqrstwvβ ijkqrstwvβ ijk qrs twv ijkqrstwvβ 1296 2 bD P N 6 P D 6 1 bh µ bN,rsbD − bh µ bD,rsbN − ijkqrs ijkq ijkqrs ijkqrs ijkq ijkqrs 48 2 bD P N N 8 P D D 8 1 bh bh µ bN,twbD − bh bh µ bD,twbN + ijkqrstw ijk qrs ijkqrstw ijkqrstw ijk qrs ijkqrstw 144 2 bD P N 6 P D 6 1 bh µ bN,ζηξbD − bh µ bD,ζηξbN − ijkζηξ ijk ijkζηξ ijkζηξ ijk ijkζηξ 36 2 bD P 4 P 4 1 µ bN,ζηξωbD − µ bD,ζηξωbN + ζηξω ζηξω ζηξω ζηξω 24 2 bD P P cD bN cN cD 1 ij σbijbN,ijbD − ij σbijbD,ijbN − = cD 2 3 bD bD bN bD bD

34 P N P D D 6 1 ijkq bhijkµijkqbN,qbD − ijkq bhijkbhqrsµijkqrsbD,qbN − cD 6 3 bD P N N 6 P D D 6 1 bN ijkqrs bhijkbhqrsµijkqrs − ijkq bhijkbhqrsµijkqrsbN + cD 72 2 bD P N 4 P D 4 1 bN ijkq bhijkqµijkq − ijkq bhijkqµijkqbN − cD . 24 2 bD

N D If we set bN (θ) = g (θ) bD (θ) and h (θ) = h (θ) = h (θ), we can show that, for the derivatives of bN (θ),

bN,i (θ) = gi (θ) bD (θ) + g (θ) bD,i (θ) ,

bN,ij (θ) = gij (θ) bD (θ) + gi (θ) bD,j (θ) + gj (θ) bD,i (θ) + g (θ) bD,ij (θ) ,

bN,ijk (θ) = gijk (θ) bD (θ) + gij (θ) bD,k (θ) + gik (θ) bD,j (θ) + gi (θ) bD,jk (θ) + gjk (θ) bD,i (θ)

+gj (θ) bD,ik (θ) + gk (θ) bD,ij (θ) + g (θ) bD,ijk (θ) ,

bN,ijkq (θ) = gijkq (θ) bD (θ) + gijk (θ) bD,q (θ) + gijq (θ) bD,k (θ) + gij (θ) bD,kq (θ)

+gikq (θ) bD,j (θ) + gik (θ) bD,jq (θ) + giq (θ) bD,jk (θ) + gi (θ) bD,jkq (θ)

+gjkq (θ) bD,i (θ) + gjk (θ) bD,iq (θ) + gjq (θ) bD,ik (θ) + gj (θ) bD,ikq (θ)

+gkq (θ) bD,ij (θ) + gk (θ) bD,ijq (θ) + gq (θ) bD,ijk (θ) + g (θ) bD,ijkq (θ) .

Hence, we have

bN,ij (θ) bD (θ) − bD,ij (θ) bN (θ) (37)

= [gij (θ) bD (θ) + gi (θ) bD,j (θ) + gj (θ) bD,i (θ) + g (θ) bD,ij (θ)] bD (θ) − bD,ij (θ) g (θ) bD (θ) 2 = gij (θ) bD (θ) + gi (θ) bD,j (θ) bD (θ) + gj (θ) bD,i (θ) bD (θ)

+g (θ) bD,ij (θ) bD (θ) − bD,ij (θ) g (θ) bD (θ) 2 = gij (θ) bD (θ) + gi (θ) bD,j (θ) bD (θ) + gj (θ) bD,i (θ) bD (θ) ,

bN,i (θ) bD (θ) − bD,i (θ) bN (θ) (38) 2 = (gi (θ) bD (θ) + g (θ) bD,i (θ)) bD (θ) − bD,i (θ) g (θ) bD (θ) = gi (θ) bD (θ) ,

bN,ijk (θ) bD (θ) − bD,ijk (θ) bN (θ) (39) gijk (θ) bD (θ) + gij (θ) bD,k (θ) + gik (θ) bD,j (θ) + gi (θ) bD,jk (θ) = bD(θ), +gjk (θ) bD,i (θ) + gj (θ) bD,ik (θ) + gk (θ) bD,ij (θ)

35 bN,ijkq (θ) bD (θ) − bD,ijkq (θ) bN (θ) (40) gijkq (θ) bD (θ) + gijk (θ) bD,q (θ) + gijq (θ) bD,k (θ) + gij (θ) bD,kq (θ) = bD (θ) +gikq (θ) bD,j (θ) + gik (θ) bD,jq (θ) + giq (θ) bD,jk (θ) + gi (θ) bD,jkq (θ) gjkq (θ) bD,i (θ) + gjk (θ) bD,iq (θ) + gjq (θ) bD,ik (θ) + gj (θ) bD,ikq (θ) + bD(θ). +gkq (θ) bD,ij (θ) + gk (θ) bD,ijq (θ) + gq (θ) bD,ijk (θ) Consequently, P P bN cN cD 1 ij σijbN,ijbD − ij σijbD,ijbN − = b b 2 2 bD bN bD bD P N 4 P D 4 1 bh µ bN,qbD − bh µ bD,qbN − ijkq ijk ijkq ijkq ijk ijkq 6 2 bD P P 4 1 ij σbij bN,ijbD − bD,ijbN 1 ijkq bhijkµijkq bN,qbD − bD,qbN = − , 2 2 6 2 bD bD where

P P 2 ij σbij bN,ijbD − bD,ijbN ij σbij gbijbD + gbibD,jbD + gbjbD,ibD = 2 2 bD bD P 2 P σijgijb + 2 σijgibD,jbD = ij b b D ij b b 2 bD P X 2 ij σbijbD,jgbi = σbijgbij + , ij bD from (37) and gqbD + gbD,q bD − bD,qgbD bN,qbD − bD,qbN b b b = = gq 2 2 b bD bD by (38). Hence,

P σ b g bN cN cD 1 X ij bijbD,jbi 1 X 4 − = σijgij + − bhijkµ gq. 2 b b 6 ijkqb bD bN bD ij bD ijkq

From (39) and (40), we can get

P N 6 P D 6 ijkζηξ bhijkµijkζηξbN,ζηξbD − ijkζηξ bhijkµijkζηξbD,ζηξbN 2 bD P 6 h i ijkζηξ bhijkµijkζηξ gbζηξbD + gbζηbD,ξ + gbζξbD,η + gbζbD,ηξ + gbηξbD,ζ + gbηbD,ζξ + gbξbD,ζη = bD

36 3 P h µ6 g b 3 P h µ6 g b X 6 ijkζηξ bijk ijkζηξbζηbD,ξ ijkζηξ bijk ijkζηξbζbD,ηξ = bhijkµijkζηξgbζηξ + + , (41) ijkζηξ bD bD

P 4 P 4 ζηξω µζηξωbN,ζηξωbD − ζηξω µζηξωbD,ζηξωbN 2 bD 4 P µ4 g b 6 P µ4 g b X 4 ζηξω ζηξωbζηξbD,ω ζηξω ζηξωbζηbD,ξω = µζηξωgbζηξω + + ζηξω bD bD P 4 P 4 4 µ gζbD,ηξω 4 µ gζbD,ηξω + ζηξω ζηξωb + ζηξω ζηξωb . (42) bD bD We can also show that bN dN dD − bD bN bD P 6 1 ijkqrs bhijkqrµijkqrs bN,sbD − bD,sbN = − 120 2 bD P 8 1 ijkqrstw bhijkbhqrstµijkqrstw bN,wbD − bD,wbN + 144 2 bD P 10 1 ijkqrstwvβ bhijkbhqrsbhtwvµijkqrstwvβ bN,βbD − bD,βbN − 1296 2 bD P 6 1 ijkqrs bhijkqµijkqrs bN,rsbD − bD,rsbN − 48 2 bD P 8 1 ijkqrstw bhijkbhqrsµijkqrstw bN,twbD − bD,twbN + 144 2 bD P 6 1 ijkζηξ bhijkµijkζηξ bN,ζηξbD − bD,ζηξbN − 36 2 bD P 4 1 ζηξω µζηξω bN,ζηξωbD − bD,ζηξωbN + , 24 2 bD since hN (θ) = hD (θ) = h (θ). Hence, by (37), (38), (41), and (42), we have bN dN dD − bD bN bD 1 X 6 1 X 8 = − bhijkqrµ gs + bhijkbhqrstµ gw 120 ijkqrsb 144 ijkqrstwb ijkqrs ijkqrstw

37 P h µ6 b g 1 X 10 1 ijkqrs bijkq ijkqrsbD,sbr − bhijkbhqrsbhtwvµ gβ − 1296 ijkqrstwvβb 24 ijkqrstwvβ bD P 8 P 6 1 bhijkbhqrsµ bD,wgt 1 bhijkµ gζbD,ηξ + ijkqrstw ijkqrstw b − ijkζηξ ijkζηξb 72 bD 12 bD P µ4 g b 1 ζηξω ζηξωbζbD,ηξω 1 X 6 + − bhijkqµ grs 6 48 ijkqrsb bD ijkqrs

1 X 8 1 X 6 + bhijkbhqrsµ gtw − bhijkµ gζηξ 144 ijkqrstwb 36 ijkζηξb ijkqrstw ijkζηξ P 6 1 X 1 ijkζηξ bhijkµijkζηξgζηbD,ξ + µ4 g − b 24 ζηξωbζηξω 12 ζηξω bD P 4 P 4 1 µ gζηξbD,ω 1 µ gζηbD,ξω + ζηξω ζηξωb + ζηξω ζηξωb . 6 bD 4 bD

Using the matrix notation for high order derivatives as in Magnus and Neudecker (1999) (with the exception that the ﬁrst order derivative of a scalar function in our setting is a column vector), we can write Lemma 5.2 in matrix form. Before we do that, let us ﬁrst introduce the following Generalized Isserlis theorem.

Theorem 5.1 (Generalized Isserlis Theorem) If A = {α1, . . . , α2N } is a set of P integers such that 1 ≤ αi ≤ P, for each i ∈ [1, 2N] and X ∈ R is a zero mean multivariate normal random vector, then

E (XA) = ΣΠE (XiXj) , (43) A

2N where EXA = E Πi=1Xαi = µα1,... ,α2N and the notation ΣΠ means summing over all distinct ways of partitioning Xα1 ,...,Xα2N into pairs (Xi,Xj) and each summand is the product of the N pairs. This yields (2N)!/ 2N N! = (2N − 1)!! terms in the sum where (2N − 1)!! is the double factorial deﬁned by (2N − 1)!! = (2N − 1)×(2N − 3)×· · ·×1.

The Isserlis theorem, ﬁrst obtained by Isserlis (1918), expresses the higher order moments of a zero mean Gaussian vector X ∈ RP in terms of its covariance matrix. The generalized Isserlis theorem is due to Withers (1985) and Vignat (2012). For example, if 2N = 4, then

E (X1X2X3X4) = E (X1X2) E (X3X4) + E (X1X3) E (X2X4) + E (X1X4) E (X2X3)

where αi = i for i ∈ [1, 4] and there are three terms in the sum. If 2N = 6, there are ﬁfteen terms in the sum.

38 j j j Lemma 5.3 Let ∇ bh, ∇ gb, and ∇ bD be the jth order derivatives of h (θ), g (θ), and bD (θ) evaluated at θbn, respectively. If both ({hn} , g × bD) and ({hn} , bD) satisfy the analytical assumptions for the stochastic Laplace method on ℘θ, then, R g (θ) b (θ) exp (−nh (θ)) dθ 1 1 1 D n = g + B + (B − B ) + O , R b 1 2 2 3 p 3 bD (θ) exp (−nhn (θ)) dθ n n n where

−1 −1 −1 −1 1 2 2 0 2 ∇bD 1 2 3 2 B1 = tr ∇ bh ∇ gb + (∇gb) ∇ bh − vec ∇ bh ∇ bh ∇ bh ∇g,b 2 bD 2 B2 = B21 + B22,

B3 = B1 × B4, with −1 0 −1 −1 1 0 2 5 2 2 B21 = − (∇g) ∇ bh ∇ bh vec ∇ bh ⊗ vec ∇ bh 8 b 1 −10 −10 −1 −1 + vec ∇2bh ⊗ vec ∇2bh ∇3bh ∇2bh ∇4bh ∇2bh ∇g 4 b 1 0 −1 −1 −1 −1 + vec ∇3bh ∇2bh ⊗ ∇2bh ⊗ ∇2bh ∇4bh ∇2bh ∇g 6 b 1 −1 −10 −10 −1 + tr ∇2bh ⊗ vec ∇2bh ∇4bh vec ∇2bh ∇3bh ∇2bh ∇g 16 b 0 1 −1 −10 −1 −1 + vec ∇2bh ⊗ vec ∇2bh ∇4bh ∇2bh ∇3bh ∇2bh ∇g 4 b 3 −10 −1 −1 −1 −1 − vec ∇2bh ∇3bh ∇2bh ∇3bh0 ∇2bh ⊗ ∇2bh ∇3bh ∇2bh ∇g 8 b 1 −1 −1 0 −1 −1 −1 − vec ∇3bh0 ∇2bh ⊗ ∇2bh ∇3bh ∇2bh ⊗ ∇2bh ∇3bh ∇2bh ∇g 4 b 1 −10 −1 −1 −10 −1 − vec ∇2bh ∇3bh ∇2bh ∇3bh0vec ∇2bh vec ∇2bh ∇3bh ∇2bh ∇g 16 b 1 0 −1 −1 −1 − vec ∇3bh ∇2bh ⊗ ∇2bh ⊗ ∇2bh vec ∇3bh 24 −10 −1 ×vec ∇2bh ∇3bh ∇2bh ∇g

1 −1 −1 0 −1 − tr ∇2bh ⊗ vec ∇2bh ∇4bh ∇b0 ∇2bh ∇g 8 D b −10 −1 0 −1 1 2 2 4 2 − vec ∇ bh ⊗ ∇ bh ∇bD ∇ bh ∇ bh ∇g 2 b 0 −10 −1 0 −1 −1 1 2 3 2 3 2 ∇bD 2 + vec ∇ bh ∇ bh ∇ bh ∇ bh vec ∇ bh ∇ bh ∇gb 8 bD

39 0 0 −1 −1 −1 −1 1 3 2 2 2 3 ∇bD 2 + vec ∇ bh ∇ bh ⊗ ∇ bh ⊗ ∇ bh vec ∇ bh ∇ bh ∇gb 12 bD −1 −10 −1 −1 1 2 2 3 2 3 2 + vec ∇ bh bDvec ∇ bh ∇ bh ∇ bh ∇ bh ∇ bh ∇g 2 b −10 −1 −10 −1 1 2 3 2 2 3 2 + vec ∇ bh ∇ bh ∇ bh bDvec ∇ bh ∇ bh ∇ bh ∇g 4 b 0 −1 −1 0 −1 −1 1 2 2 3 2 3 2 + vec ∇ bh ⊗ ∇ bh bD ∇ bh ∇ bh ∇ bh ∇ bh ∇g 2 b −10 −1 2 −1 1 2 3 2 ∇ bD 2 − vec ∇ bh ∇ bh ∇ bh ∇ bh ∇gb 2 bD !0 −1 2 −1 −1 1 2 ∇ bD 2 3 2 − vec ∇ bh ∇ bh ∇ bh ∇ bh ∇gb 2 bD " # 2 −1 −10 −1 1 ∇ bD 2 2 3 2 − tr ∇ bh vec ∇ bh ∇ bh ∇ bh ∇gb 4 bD 0 3 −10 −1 ∇ bD 1 2 2 + vec ∇ bh ∇ bh ∇g,b 2 bD

−1 −1 0 −1 1 2 2 4 2 2 B22 = − tr ∇ bh ⊗ vec ∇ bh ∇ bh tr ∇ bh ∇ g 16 b 1 −1 −1 −1 0 − tr ∇2bh ∇2g ∇2bh ⊗ vec ∇2bh ∇4bh 4 b 1 −10 −1 0 −1 −1 + vec ∇2bh ∇3bh ∇2bh ∇3bh vec ∇2bh tr ∇2bh ∇2g 16 b 1 0 −1 −1 −1 −1 + vec ∇3bh ∇2bh ⊗ ∇2bh ⊗ ∇2bh vec ∇3bh tr ∇2bh ∇2g 24 b 1 −10 −1 −1 −1 + vec ∇2bh ∇3bh ∇2bh ∇3bh0vec ∇2bh ∇2g ∇2bh 4 b 1 −10 −1 −1 −1 + vec ∇2bh ∇3bh ∇2bh ∇2g ∇2bh ∇3bh0vec ∇2bh 8 b 1 0 −1 −1 −1 −1 + vec ∇3bh ∇2bh ∇2g ∇2bh ⊗ ∇2bh ⊗ ∇2bh vec ∇3bh 4 b −10 −1 −1 1 2 3 2 3 0 2 − vec ∇ bh ∇ bh ∇ bh ∇ g vec ∇ bh 4 b −1 −1 −1 1 3 0 2 2 2 3 − vec ∇ g ∇ bh ⊗ ∇ bh ⊗ ∇ bh vec ∇ bh 6 b 1 −1 −1 + tr ∇2bh ⊗ vec ∇2bh ∇4g0 8 b

40 −10 −1 −1 1 2 3 2 2 2 ∇bD − vec ∇ bh ∇ bh ∇ bh ∇ gb ∇ bh 2 bD −1 −10 −1 1 2 2 2 3 2 ∇bD − tr ∇ gb ∇ bh vec ∇ bh ∇ bh ∇ bh 4 bD −1 −10 −1 1 2 2 2 3 2 ∇bD − vec ∇ bh ∇ gb ∇ bh ∇ bh ∇ bh 2 bD −10 −1 1 2 3 2 ∇bD + vec ∇ bh ∇ gb ∇ bh 2 bD " # −1 −1 2 1 2 2 2 ∇ bD + tr ∇ gb ∇ bh tr ∇ bh 4 bD " # −1 −1 2 1 2 2 2 ∇ bD + tr ∇ bh ∇ gb ∇ bh , 2 bD

" # −1 2 −10 −1 1 2 ∇ bD 1 2 (3) 2 ∇bD B4 = tr ∇ bh − vec ∇ bh bh ∇ bh 2 bD 2 bD 1 −10 −1 0 −1 + vec ∇2bh ∇3bh ∇2bh ∇3bh vec ∇2bh 8 1 0 −1 −1 −1 + vec ∇3bh ∇2bh ⊗ ∇2bh ⊗ ∇2bh vec ∇3bh 12 1 −1 −1 0 − tr ∇2bh ⊗ vec ∇2bh ∇4bh . 8

Proof. From (5.2), we ﬁrst write each term of B1 in matrix form based on (43), that is

−1 1 X 1 X 1 2 2 σijgij = σijgij = tr ∇ bh ∇ g , 2 b b 2 b b 2 b ij ij

P σ b g −1 ij bijbD,jbi X bD,j 0 2 ∇bD = gbiσbij = (∇gb) ∇ bh , bD ij bD bD

−1 −1 1 X 4 1 X 1 2 3 2 − bhijkµ gq = − bhijkσijσkqgq = − vec ∇ bh ∇ bh ∇ bh ∇g. 6 ijkqb 2 b b b 2 b ijkq ijkq

Then, we can similarly write each term of B2 in matrix form based on (43), that is

1 X 6 15 X − bhijkqrµ gs = − σijσkqbhijkqrσrsgs 120 ijkqrsb 120 b b b b ijkqrs ijkqrs

41 −1 0 −1 −1 1 0 = − (∇g) ∇2bh ∇5bh vec ∇2bh ⊗ vec ∇2bh . 8 b By (76) in the next subsection, we have

1 X 8 bhijkbhqrstµ gw 144 ijkqrstwb ijkqrstw 1 −10 −10 −1 −1 = vec ∇2bh ⊗ vec ∇2bh ∇3bh ∇2bh ∇4bh ∇2bh ∇g 4 b 1 0 −1 −1 −1 −1 + vec ∇3bh ∇2bh ⊗ ∇2bh ⊗ ∇2bh ∇4bh ∇2bh ∇g 6 b 1 −1 −10 −10 −1 + tr ∇2bh ⊗ vec ∇2bh ∇4bh vec ∇2bh ∇3bh ∇2bh ∇g 16 b 0 1 −1 −10 −1 −1 + vec ∇2bh ⊗ vec ∇2bh ∇4bh ∇2bh ∇3bh ∇2bh ∇g. 4 b By (77) in the next subsection, we have

1 X 10 − bhijkbhqrsbhtwvµ gβ 1296 ijkqrstwvβb ijkqrstwvβ 3 −10 −1 −1 −1 −1 = − vec ∇2bh ∇3bh ∇2bh ∇3bh0 ∇2bh ⊗ ∇2bh ∇3bh ∇2bh ∇g 8 b 1 −1 −1 0 −1 −1 −1 − vec ∇3bh0 ∇2bh ⊗ ∇2bh ∇3bh ∇2bh ⊗ ∇2bh ∇3bh ∇2bh ∇g 4 b 1 −10 −1 −1 −10 −1 − vec ∇2bh ∇3bh ∇2bh ∇3bh0vec ∇2bh vec ∇2bh ∇3bh ∇2bh ∇g 16 b 1 0 −1 −1 −1 − vec ∇3bh ∇2bh ⊗ ∇2bh ⊗ ∇2bh vec ∇3bh 24 −10 −1 ×vec ∇2bh ∇3bh ∇2bh ∇g,

6 1 X bhijkqµijkqrsgrbD,s − b 24 ijkqrs bD 0 −1 −1 0 −1 1 2 2 4 ∇bD 2 = − tr ∇ bh ⊗ vec ∇ bh ∇ bh ∇ bh ∇gb 8 bD " !0# −10 −1 −1 1 2 2 ∇bD 4 2 − vec ∇ bh ⊗ ∇ bh ∇ bh ∇ bh ∇g.b 2 bD Similar to the proof of (71) in the next subsection,, we can get

P 8 1 ijkqrstw bhijkbhqrsµijkqrstwbD,wgbt 72 bD

42 0 −10 −1 0 −1 −1 1 2 3 2 3 2 ∇bD 2 = vec ∇ bh ∇ bh ∇ bh ∇ bh vec ∇ bh ∇ bh ∇gb 8 bD 0 0 −1 −1 −1 −1 1 3 2 2 2 3 ∇bD 2 + vec ∇ bh ∇ bh ⊗ ∇ bh ⊗ ∇ bh vec ∇ bh ∇ bh ∇gb 12 bD ! −1 −10 −1 −1 1 2 ∇bD 2 3 2 3 2 + vec ∇ bh vec ∇ bh ∇ bh ∇ bh ∇ bh ∇ bh ∇gb 2 bD −10 −1 −10 −1 1 2 3 2 ∇bD 2 3 2 + vec ∇ bh ∇ bh ∇ bh vec ∇ bh ∇ bh ∇ bh ∇gb 4 bD !0! !0 −1 −1 −1 −1 1 2 2 ∇bD 3 2 3 2 + vec ∇ bh ⊗ ∇ bh ∇ bh ∇ bh ∇ bh ∇ bh ∇g.b 2 bD By (49), we can show that

P 6 1 bhijkµ bD,ηξgζ − ijkζηξ ijkζηξ b 12 bD −10 −1 2 −1 1 2 3 2 ∇ bD 2 = − vec ∇ bh ∇ bh ∇ bh ∇ bh ∇gb 2 bD !0 −1 2 −1 −1 1 2 ∇ bD 2 3 2 − vec ∇ bh ∇ bh ∇ bh ∇ bh ∇gb 2 bD " # 2 −1 −10 −1 1 ∇ bD 2 2 3 2 − tr ∇ bh vec ∇ bh ∇ bh ∇ bh ∇g,b 4 bD

P 4 1 ζηξω µζηξωbD,ηξωgζ 3 X bD,ηξω 1 X bD,ηξω b = g σ σ = g σ σ 6 6 bζ bζηbξω 2 bζ bζη bξω bD ζηξω bD ζηξω bD 0 3 −1 ∇ bD −1 1 0 2 2 = (∇gb) ∇ bh vec ∇ bh . 2 bD By (60) in the next subsection, we have

1 X 6 − bhijkqµ grs 48 ijkqrsb ijkqrs 1 −1 −1 0 −1 = − tr ∇2bh ⊗ vec ∇2bh ∇4bh tr ∇2bh ∇2g 16 b 1 −1 −1 −1 0 − tr ∇2bh ∇2g ∇2bh ⊗ vec ∇2bh ∇4bh . 4 b By (71) in the next subsection, we have

1 X 8 bhijkbhqrsµ gtw 144 ijkqrstwb ijkqrstw

43 1 −10 −1 0 −1 −1 = vec ∇2bh ∇3bh ∇2bh ∇3bh vec ∇2bh tr ∇2bh ∇2g 16 b 1 0 −1 −1 −1 −1 + vec ∇3bh ∇2bh ⊗ ∇2bh ⊗ ∇2bh vec ∇3bh tr ∇2bh ∇2g 24 b 1 −10 −1 −1 −1 + vec ∇2bh ∇3bh ∇2bh ∇3bh0vec ∇2bh ∇2g ∇2bh 4 b 1 −10 −1 −1 −1 + vec ∇2bh ∇3bh ∇2bh ∇2g ∇2bh ∇3bh0vec ∇2bh 8 b 1 0 −1 −1 −1 −1 + vec ∇3bh ∇2bh ∇2g ∇2bh ⊗ ∇2bh ⊗ ∇2bh vec ∇3bh . 4 b By (65) in the next subsection, we have

1 X 6 − bhijkµ gζηξ 36 ijkζηξb ijkζηξ −10 −1 −1 1 2 3 2 3 0 2 = − vec ∇ bh ∇ bh ∇ bh ∇ g vec ∇ bh 4 b −1 −1 −1 1 3 0 2 2 2 3 − vec ∇ g ∇ bh ⊗ ∇ bh ⊗ ∇ bh vec ∇ bh , 6 b

1 X 3 X µ4 g = σ σ g 24 ζηξωbζηξω 24 bζηbξωbζηξω ζηξω ζηξω −1 −1 1 2 2 4 0 = tr ∇ bh ⊗ vec ∇ bh ∇ g . 8 b By (49), we can get

P 6 1 bhijkµ gζηbD,ξ − ijkζηξ ijkζηξb 12 bD −10 −1 −1 1 2 3 2 2 2 ∇bD = − vec ∇ bh ∇ bh ∇ bh ∇ gb ∇ bh 2 bD −1 −10 −1 1 2 2 2 3 2 ∇bD − tr ∇ gb ∇ bh vec ∇ bh ∇ bh ∇ bh 4 bD −1 −10 −1 1 2 2 2 3 2 ∇bD − vec ∇ bh ∇ gb ∇ bh ∇ bh ∇ bh , 2 bD

P 4 1 ζηξω µζηξωgζηξbD,ω 3 X bD,ω b = σ g σ 6 6 bζηbζηξbξω bD ζηξω bD

−10 −1 1 2 3 2 ∇bD = vec ∇ bh ∇ gb ∇ bh . 2 bD

44 From (48),

P 4 1 ζηξω µζηξωgbζηbD,ξω 4 bD " # " # −1 −1 2 −1 −1 2 1 2 2 2 ∇ bD 1 2 2 2 ∇ bD = tr ∇ gb ∇ bh tr ∇ bh + tr ∇ bh ∇ gb ∇ bh . 4 bD 2 bD And we have " # −1 2 1 X bD,ij 1 2 ∇ bD σij = tr ∇ bh , (44) 2 b 2 ij bD bD

1 X 4 bD,q 3 X bD,q − bhijkµ = − bhijkσijσkq (45) 6 ijkq 6 b b ijkq bD ijkq bD

−10 −1 1 ∇bD = − vec ∇2bh ∇3bh ∇2bh . 2 bD From (66) in the next subsection, we have

1 X 6 bhijkbhqrsµ (46) 72 ijkqrs ijkqrs 1 −10 −1 0 −1 = vec ∇2bh ∇3bh ∇2bh ∇3bh vec ∇2bh 8 1 0 −1 −1 −1 + vec ∇3bh ∇2bh ⊗ ∇2bh ⊗ ∇2bh vec ∇3bh . 12 Note that

1 X 4 3 X − bhijkqµ = − bhijkqσijσkq (47) 24 ijkq 24 b b ijkq ijkq 1 −1 −1 0 = − tr ∇2bh ⊗ vec ∇2bh ∇4bh . 8 From (44), (45), (46) and (47), we have " # −1 2 −10 −1 1 2 ∇ bD 1 2 (3) 2 ∇bD B4 = tr ∇ bh − vec ∇ bh bh ∇ bh 2 bD 2 bD 1 −10 −1 0 −1 + vec ∇2bh ∇3bh ∇2bh ∇3bh vec ∇2bh 8 1 0 −1 −1 −1 + vec ∇3bh ∇2bh ⊗ ∇2bh ⊗ ∇2bh vec ∇3bh 12 1 −1 −1 0 − tr ∇2bh ⊗ vec ∇2bh ∇4bh . 8

45 5.2.3 Lemma 5.3 in Matrix Form

We show how to express each term of the stochastic expansions reported in Lemma 5.3 in matrix form.

P µ4 g b For term 1 ζηξω ζηξωbζηbD,ξω 4 bD

We can get

P 4 1 ζηξω µζηξωgbζηbD,ξω 4 bD P P P ! 1 gζησζησξωbD,ξω gζησζξσηωbD,ξω gζησζξσηωbD,ξω = ζηξω b b b + ζηξω b b b + ζηξω b b b 4 bD bD bD where " # P g σ σ b −1 −1 2 ζηξω bζηbζηbξωbD,ξω X X bD,ξω 2 2 2 ∇ bD = gbζησbζη σbξω = tr ∇ gb ∇ bh tr ∇ bh , bD ζη ξω bD bD

P ζηξω gbζησbζξσbηωbD,ξω X bD,ξω X bD,ξω = gbζησbζξσbηω = σbξζ gbζησbηω bD ζηξω bD ζηξω bD " # −1 −1 2 X bD,ξω X 2 2 2 ∇ bD = σbξζ gbζησbηω = tr ∇ bh ∇ gb ∇ bh , ξω bD ζη bD

" # P g σ σ b P g σ σ b −1 −1 2 ζηξω bζηbζξbηωbD,ξω ζηξω bζηbζξbηωbD,ξω 2 2 2 ∇ bD = = tr ∇ bh ∇ gb ∇ bh . bD bD bD

Then, we have

P 4 1 µ gζηbD,ξω ζηξω ζηξωb (48) 4 bD " # " # −1 −1 2 −1 −1 2 1 2 2 2 ∇ bD 1 2 2 2 ∇ bD = tr ∇ gb ∇ bh tr ∇ bh + tr ∇ bh ∇ gb ∇ bh 4 bD 2 bD " # " # −1 −1 2 −1 2 −1 1 2 2 2 ∇ bD 1 2 ∇ bD 2 2 = tr ∇ gb ∇ bh tr ∇ bh + tr ∇ bh ∇ bh ∇ gb . 4 bD 2 bD

46 P h µ6 g b For term − 1 ijkζηξ bijk ijkζηξbζηbD,ξ 12 bD

Note that

P 6 P 6 1 bhijkµ gζηbD,ξ 1 bhijkµ gqrbD,s − ijkζηξ ijkζηξb = − ijkqrs ijkqrsb 12 bD 12 bD where

Then, we can decompose it into three groups. The ﬁrst group has six elements without σbqr but with σbij, σbik or σbjk, that is X X X bhijkσbijσbkqσbrsgbqrbD,s + bhijkσbiqσbkjσbrsgbqrbD,s + bhijkσbiqσbjkσbrsgbqrbD,s ijkqrs ijkqrs ijkqrs X X X + bhijkσbijσbkrσbqsgbqrbD,s + bhijkσbikσbjrσbqsgbqrbD,s + bhijkσbirσbjkσbqsgbqrbD,s. ijkqrs ijkqrs ijkqrs

Note that the six elements are the same since X X X bhijkσbiqσbkjσbrsgbqrbD,s = bhijkσbiqσbjkσbrsgbqrbD,s = bhijkσbijσbkqσbrsgbqrbD,s ijkqrs ijkqrs ijkqrs X = σbijbhijkσbkqgbqrσbrsbD,s ijkqrs −10 −1 −1 2 3 2 2 2 = vec ∇ bh ∇ bh ∇ bh ∇ gb ∇ bh ∇bD, and X X X bhijkσbirσbjkσbqsgbqrbD,s = bhijkσbikσbjrσbqsgbqrbD,s = bhijkσbijσbkrσbqsgbqrbD,s ijkqrs ijkqrs ijkqrs X = σbijbhijkσbkrσbqsgbqrbD,s, ijkqrs where X X X σbijbhijkσbkrσbqsgbqrbD,s = σbijbhijkσbkrgbrqσbqsbD,s = σbijbhijkσbkqgbqrσbrsbD,s ijkqrs ijkqrs ijkqrs −10 −1 −1 2 3 2 2 2 = vec ∇ bh ∇ bh ∇ bh ∇ gb ∇ bh ∇bD.

47 The second group has six elements without σbqr, σbij, σbik and σbjk, that is X X X bhijkσbiqσbjrσbksgbqrbD,s + bhijkσbirσbjqσbksgbqrbD,s + bhijkσbiqσbjsσbkrgbqrbD,s ijkqrs ijkqrs ijkqrs X X X + bhijkσbisσbjqσbkrgbqrbD,s + bhijkσbirσbjsσbkqgbqrbD,s + bhijkσbisσbjrσbkqgbqrbD,s. ijkqrs ijkqrs ijkqrs

These six elements are the same since X bhijk [σbiqσbjrσbks + σbirσbjqσbks] gbqrbD,s ijkqrs X X = bhijkσbiqσbjrσbksgbqrbD,s + bhijkσbirσbjqσbksgbqrbD,s ijkqrs ijkqrs X X = 2 bhijkσbiqσbjrσbksgbqrbD,s = 2 σbiqgbqrσbjrbhijkσbksbD,s ijkqrs ijkqrs −1 −10 −1 2 2 2 3 2 = 2vec ∇ bh ∇ gb ∇ bh ∇ bh ∇ bh ∇bD,

X bhijk [σbiqσbjsσbkr + σbisσbjqσbkr] gbqrbD,s ijkqrs X X = bhijkσbiqσbjsσbkrgbqrbD,s + bhijkσbisσbjqσbkrgbqrbD,s ijkqrs ijkqrs X X = bhikjσbiqσbkrσbjsgbqrbD,s + bhkjiσbkrσbjqσbisgbqrbD,s ijkqrs ijkqrs X X = bhijkσbiqσbjrσbksgbqrbD,s + bhijkσbirσbjqσbksgbqrbD,s ijkqrs ijkqrs −1 −10 −1 2 2 2 3 2 = 2vec ∇ bh ∇ gb ∇ bh ∇ bh ∇ bh ∇bb,

X bhijk [σbirσbjsσbkq + σbisσbjrσbkq] gbqrbD,s ijkqrs X X = bhijkσbirσbjsσbkqgbqrbD,s + bhijkσbisσbjrσbkqgbqrbD,s ijkqrs ijkqrs X X = bhkijσbkqσbirσbjsgbqrbD,s + bhjkiσbjrσbkqσbisgbqrbD,s ijkqrs ijkqrs X X = bhijkσbiqσbjrσbksgbqrbD,s + bhijkσbirσbjqσbksgbqrbD,s ijkqrs ijkqrs −1 −10 −1 2 2 2 3 2 = 2vec ∇ bh ∇ gb ∇ bh ∇ bh ∇ bh ∇bb.

48 The third group has three elements without σbqr, that is X X X bhijkσbijσbksσbqrgbqrbD,s + bhijkσbikσbjsσbqrgbqrbD,s + bhijkσbisσbjkσbqrgbqrbD,s. ijkqrs ijkqrs ijkqrs We have X bhijkσbijσbksσbqrgbqrbD,s ijkqrs X X X = bhijkσbikσbjsσbqrgbqrbD,s = bhijkσbisσbjkσbqrgbqrbD,s = bhijkσbjkσbisbD,sσbqrgbqr ijkqrs ijkqrs ijkqrs −1 −10 −1 X X 2 2 2 3 2 = σbqrgbqr σbjkbhjkiσbisbD,s = tr ∇ gb ∇ bh vec ∇ bh ∇ bh ∇ bh ∇bD. qr ijks Then, we have

P 6 1 bhijkµ gζηbD,ξ − ijkζηξ ijkζηξb 12 bD −10 −1 −1 1 2 3 2 2 2 ∇bD = − vec ∇ bh ∇ bh ∇ bh ∇ gb ∇ bh 2 bD −1 −10 −1 1 2 2 2 3 2 ∇bD − vec ∇ bh ∇ gb ∇ bh ∇ bh ∇ bh 2 bD −1 −10 −1 1 2 2 2 3 2 ∇bD − tr ∇ gb ∇ bh vec ∇ bh ∇ bh ∇ bh . 4 bD For the same reason, we have

P 6 1 bhijkµ bD,ηξgζ − ijkζηξ ijkζηξ b (49) 12 bD −10 −1 2 −1 1 2 3 2 ∇ bD 2 = − vec ∇ bh ∇ bh ∇ bh ∇ bh ∇gb 2 bD !0 −1 2 −1 −1 1 2 ∇ bD 2 3 2 − vec ∇ bh ∇ bh ∇ bh ∇ bh ∇gb 2 bD " # 2 −1 −10 −1 1 ∇ bD 2 2 3 2 − tr ∇ bh vec ∇ bh ∇ bh ∇ bh ∇g.b 4 bD Note that

−10 −1 −1 2 3 2 2 2 ∇bD vec ∇ bh ∇ bh ∇ bh ∇ gb ∇ bh bD " # −10 −1 −1 2 3 2 2 2 ∇bD = tr vec ∇ bh ∇ bh ∇ bh ∇ gb ∇ bh bD

49 " # −1 −10 −1 2 2 ∇bD 2 3 2 = tr ∇ gb ∇ bh vec ∇ bh ∇ bh ∇ bh bD " # −1 −10 −1 2 ∇bD 2 3 2 2 = tr ∇ bh vec ∇ bh ∇ bh ∇ bh ∇ gb . bD

We have

X bD,s σbiqgbqrσbjrbhijkσbks ijkqrs bD " # X bD,s X X bD,s = σbiqbhijkσbks σbjrgbqr = σbiqbhijkσbks σbjr gbqr ijkqrs bD qr ijks bD " # " " " " # # ## X X bD,s X X X X bD,s = σbqiσbks bhikjσbjr gbqr = σbqi σbks bhikj σbjr gbqr qr ijks bD qr j ik s bD "" !0 # # −1 −1 −1 2 ∇bD 2 3 2 2 = tr ∇ bh ⊗ ∇ bh ∇ bh ∇ bh ∇ gb . bD

It can also be shown that

−1 −10 −1 2 2 2 3 2 ∇bD vec ∇ bh ∇ gb ∇ bh ∇ bh ∇ bh bD −1 −1 −1 2 0 2 2 3 2 ∇bD = vec ∇ gb ∇ bh ⊗ ∇ bh ∇ bh ∇ bh bD " # −1 −1 −1 2 0 2 2 3 2 ∇bD = vec ∇ gb vec ∇ bh ⊗ ∇ bh ∇ bh ∇ bh bD " !0 # −1 −1 −1 2 0 2 ∇bD 2 2 3 = vec ∇ gb ∇ bh ⊗ ∇ bh ⊗ ∇ bh vec ∇ bh bD " " ! ## −1 −1 −1 2 2 3 0 2 ∇bD 2 = tr ∇ gb ∇ bh ∇ bh ∇ bh ⊗ ∇ bh bD "" !0 # # −1 −1 −1 2 ∇bD 2 3 2 2 = tr ∇ bh ⊗ ∇ bh ∇ bh ∇ bh ∇ gb . bD

Then, we have

P 6 1 bhijkµ gζηbD,ξ − ijkζηξ ijkζηξb 12 bD " # −1 −10 −1 1 2 ∇bD 2 3 2 2 = − tr ∇ bh vec ∇ bh ∇ bh ∇ bh ∇ gb 2 bD

50 "" !0 # # −1 −1 −1 1 2 ∇bD 2 3 2 2 − tr ∇ bh ⊗ ∇ bh ∇ bh ∇ bh ∇ gb 2 bD −1 −10 −1 1 2 2 2 3 2 ∇bD − tr ∇ gb ∇ bh vec ∇ bh ∇ bh ∇ bh . 4 bD

For term − 1 P h µ6 g 48 ijkqrs bijkq ijkqrsbrs

Note that

6 µijkqrs = σbijσbkqσrs + σbiqσbkjσbrs + σbiqσbjkσbrs + σbijσbkrσbqs + σbikσbjrσbqs + σbirσbjkσbqs +σbijσbksσbqr + σbikσbjsσbqr + σbisσbjkσbqr + σbiqσbjrσbks + σbirσbjqσbks +σbiqσbjsσbkr + σbisσbjqσbkr + σbirσbjsσbkq + σbisσbjrσbkq. P 6 We can decompose ijkqrs bhijkqµijkqrsgbrs into two groups. The ﬁrst group has twelve elements without σbrs, that is, X X X bhijkqσbijσbkrσbqsgbrs + bhijkqσbikσbjrσbqsgbrs + bhijkqσbirσbjkσbqsgbrs ijkqrs ijkqrs ijkqrs X X X + bhijkqσbijσbksσbqrgbrs + bhijkqσbikσbjsσbqrgbrs + bhijkqσbisσbjkσbqrgbrs ijkqrs ijkqrs ijkqrs X X X + bhijkqσbiqσbjrσbksgbrs + bhijkqσbirσbjqσbksgbrs + bhijkqσbiqσbjsσbkrgbrs ijkqrs ijkqrs ijkqrs X X X + bhijkqσbisσbjqσbkrgbrs + bhijkqσbirσbjsσbkqgbrs + bhijkqσbisσbjrσbkqgbrs. ijkqrs ijkqrs ijkqrs The twelve elements in this group are the same as X X bhijkqσbijσbkrσbqsgbrs = bhijkqσbijσbkrgbrsσbsq (50) ijkqrs ijkqrs −1 −1 −1 0 2 2 2 2 4 = tr ∇ bh ∇ gb ∇ bh ⊗ vec ∇ bh ∇ bh .

The detailed proof is as follows X X X bhijkqσbijσbksσbqrgbrs = bhijkqσbijσbksσbqrgbsr = bhijkqσbijσbkrσbqsgbrs, (51) ijkqrs ijkqsr ijkqrs

X X X bhijkqσbikσbjrσbqsgbrs = bhikjqσbikσbjrσbqsgbrs = bhijkqσbijσbkrσbqsgbrs, (52) ijkqrs ikjqrs ijkqrs

X X X bhijkqσbikσbjsσbqrgbrs = bhijkqσbikσbjsσbqrgbsr = bhijkqσbikσbjrσbqsgbrs ijkqrs ijkqsr ijkqrs

51 X = bhijkqσbijσbkrσbqsgbrs, (53) ijkqrs where the last equality is because of (52). And X X X bhijkqσbiqσbjrσbksgbrs = bhiqjkσbiqσbjrσbksgbrs = bhijkqσbijσbkrσbqsgbrs, (54) ijkqrs iqjkrs ijkqrs

X X X bhijkqσbiqσbjsσbkrgbrs = bhijkqσbiqσbjsσbkrgbsr = bhijkqσbiqσbjrσbksgbrs ijkqrs ijkqsr ijkqrs X = bhijkqσbijσbkrσbqsgbrs, (55) ijkqrs

X X X bhijkqσbirσbjkσbqsgbrs = bhjkiqσbjkσbirσbqsgbrs = bhijkqσbijσbkrσbqsgbrs, (56) ijkqrs jkiqrs ijkqrs

X X X bhijkqσbirσbjqσbksgbrs = bhjqikσbjqσbirσbksgbrs = bhijkqσbijσbkrσbqsgbrs, (57) ijkqrs jqikrs ijkqrs

X X X bhijkqσbirσbjsσbkqgbrs = bhkqijσbkqσbirσbjsgbrs = bhijkqσbijσbkrσbqsgbrs, (58) ijkqrs kqijrs ijkqrs

X X X bhijkqσbisσbjkσbqrgbrs = bhjkqiσbjkσbqrσbisgbrs = bhijkqσbijσbkrσbqsgbrs, ijkqrs jkqirs ijkqrs

X X X bhijkqσbisσbjqσbkrgbrs = bhjqkiσbjqσbkrσbisgbrs = bhijkqσbijσbkrσbqsgbrs, ijkqrs jqkirs ijkqrs

X X X bhijkqσbisσbjrσbkqgbrs = bhkqjiσbkqσbjrσbisgbrs = bhijkqσbijσbkrσbqsgbrs, ijkqrs kqjirs ijkqrs

X bhijkqσbijσbkrgbrsσbqs ijkqrs " # X X = bhijkq σbij σbkrgbrsσbqs ijkq rs −1 −1 −1 0 2 2 2 2 4 = tr ∇ bh ∇ gb ∇ bh ⊗ vec ∇ bh ∇ bh .

52 We can illustrate the result by a simple example. Let   h1111 h1112  h2111 h2112     h1211 h1212    −1 −1 (4)  h2211 h2212  2 2 2 hˆ =   , e = ∇ bh ∇ g ∇ bh ,  h1121 h1122  b b    h2121 h2122     h1221 h1222  h2221 h2222

  σ11 −1 −1 e11 e12 2 σ11 σ12 2  σ21  e = , ∇ bh = , vec ∇ bh =   . b e21 e22 σ21 σ22  σ12  σ22

Then,   σ11e11 σ11e12  σ21e11 σ21e12     σ12e11 σ12e12  −1   2  σ22e11 σ22e12  e ⊗ vec ∇ bh =   . b  σ11e11 σ11e12     σ21e11 σ21e12     σ12e11 σ12e12  σ22e11 σ22e12

The second group has three elements with σbrs, that is, X X X bhijkqσbijσbkqσbrsgbrs + bhijkqσbiqσbkjσbrsgbrs + bhijkqσbiqσbkjσbrsgbrs. ijkqrs ijkqrs ijkqrs

These three elements are the same. We can get X bhijkqσbijσbkqσbrsgbrs (59) ijkqrs −1 −1 0 −1 2 2 4 2 2 = tr ∇ bh ⊗ vec ∇ bh ∇ bh tr ∇ bh ∇ gb .

From (50) and (59), we have

1 X 6 − bhijkqµ grs (60) 48 ijkqrsb ijkqrs 3 X X 12 X = − σijσkqbhijkq σrsgrs − bhijkqσijσkrgrsσqs 48 b b b b 48 b b b b ijkq rs ijkqrs

53 1 −1 −1 0 −1 = − tr ∇2bh ⊗ vec ∇2bh ∇4bh tr ∇2bh ∇2g 16 b 1 −1 −1 −1 0 − tr ∇2bh ∇2g ∇2bh ⊗ vec ∇2bh ∇4bh , 4 b

X bhijkqσbijσbkrgbrsσbqs ijkqrs " # X X X = σbrkσbijbhijkqσbqsgbrs = σbrkσbijbhijkqσbqs gbrs ijkqrs rs ijkq −1 −10 −1 2 2 4 2 2 = tr ∇ bh ⊗ vec ∇ bh ∇ bh ∇ bh ∇ gb .

Then, we have

1 X 6 − bhijkqµ grs (61) 48 ijkqrsb ijkqrs 3 X X 12 X = − σijσkqbhijkq σrsgrs − bhijkqσijσkrgrsσqs 48 b b b b 48 b b b b ijkq rs ijkqrs 1 −1 −1 0 −1 = − tr ∇2bh ⊗ vec ∇2bh ∇4bh tr ∇2bh ∇2g 16 b 1 −1 −10 −1 − tr ∇2bh ⊗ vec ∇2bh ∇4bh ∇2bh ∇2g 4 b 1 −1 −10 −1 = − tr ∇2bh ⊗ vec ∇2bh ∇4bh tr ∇2bh ∇2g 16 b 1 −1 −10 −1 − tr ∇2bh ⊗ vec ∇2bh ∇4bh ∇2bh ∇2g . 4 b

For term − 1 P h µ6 g b 24 ijkqrs bijkq ijkqrsbrbD,s

Similar to the proof of (60), we have

1 X 6 − bhijkqµ grbD,s (62) 24 ijkqrsb ijkqrs 3 X X 12 X = − σijσkqbhijkq σrsgrbD,s − bhijkqσijσkrσqsgrbD,s 24 b b b b 24 b b b b ijkq rs ijkqrs 1 −1 −1 0 −1 = − tr ∇2bh ⊗ vec ∇2bh ∇4bh tr ∇2bh ∇g∇b0 8 b D 1 −1 −1 −1 0 − tr ∇2bh ∇g∇b0 ∇2bh ⊗ vec ∇2bh ∇4bh . 2 b D

54 We can also write (62) as

1 X 6 − bhijkqµ grbD,s 24 ijkqrsb ijkqrs 3 X X 12 X = − σijσkqbhijkq σrsgrbD,s − bhijkqσijσkrσsqgrbD,s 24 b b b b 24 b b b b ijkq rs ijkqrs 3 X X 12 X = − σijσkqbhijkq σrsgrbD,s − σijσsqbD,sbhijqkσkrgr 24 b b b b 24 b b b b ijkq rs ijkqrs 1 −1 −1 0 −1 = − tr ∇2bh ⊗ vec ∇2bh ∇4bh ∇b0 ∇2bh ∇g 8 D b −10 −1 0 −1 1 2 2 4 2 − vec ∇ bh ⊗ ∇ bh ∇bD ∇ bh ∇ bh ∇g. 2 b

For term − 1 P h µ6 g 36 ijkζηξ bijk ijkζηξbζηξ

P 6 P 6 P 6 Note that ijkζηξ bhijkµijkζηξgbζηξ = ijkqrs bhijkµijkqrsgbqrs. We can decompose ijkqrs bhijkµijkqrsgbqrs into two groups. The ﬁrst group consists of nine elements, each of which has the term from (σbij, σbik, σbjk) and the term from (σbqr, σbrs, σbqs), that is X X X bhijkσbijσbkqσbrsgbqrs + bhijkσbiqσbkjσbrsgbqrs + bhijkσbiqσbjkσbrsgbqrs ijkqrs ijkqrs ijkqrs X X X + bhijkσbijσbkrσbqsgbqrs + bhijkσbikσbjrσbqsgbqrs + bhijkσbirσbjkσbqsgbqrs ijkqrs ijkqrs ijkqrs X X X + bhijkσbijσbksσbqrgbqrs + bhijkσbikσbjsσbqrgbqrs + bhijkσbisσbjkσbqrgbqrs. ijkqrs ijkqrs ijkqrs

These nine elements are the same and we have X X bhijkσbijσbkqσbrsgbqrs = σbijbhijkσbkqgbqrsσbrs (63) ijkqrs ijkqrs −10 −1 −1 2 3 2 3 0 2 = vec ∇ bh ∇ bh ∇ bh ∇ gb vec ∇ bh .

The second group consists of six elements that do not include any term from (σbij, σbik, σbjk), that is X X X bhijkσbiqσbjrσbksσbqsgbqrs + bhijkσbirσbjqσbksσbqsgbqrs + bhijkσbiqσbjsσbkrσbqsgbqrs ijkqrs ijkqrs ijkqrs X X X + bhijkσbisσbjqσbkrσbqsgbqrs + bhijkσbirσbjsσbkqσbqsgbqrs + bhijkσbisσbjrσbkqσbqsgbqrs. ijkqrs ijkqrs ijkqrs

55 These six elements are the same and we have X bhijkσbiqσbjrσbksσbqsgbqrs (64) ijkqrs −1 −1 −1 3 0 2 2 2 3 = vec ∇ gb ∇ bh ⊗ ∇ bh ⊗ ∇ bh vec ∇ bh .

From (63) and (64), we can get

1 X 6 − bhijkµ gζηξ (65) 36 ijkζηξb ijkζηξ 9 X 6 X = − σijbhijkσkζ gζηξσηξ − bhijkσiζ σjησkξgζηξ 36 b b b b 36 b b b b ijkζηξ ijkζηξ −10 −1 −1 1 2 3 2 3 0 2 = − vec ∇ bh ∇ bh ∇ bh ∇ g vec ∇ bh 4 b −1 −1 −1 1 3 0 2 2 2 3 − vec ∇ g ∇ bh ⊗ ∇ bh ⊗ ∇ bh vec ∇ bh . 6 b

1 P 6 For term 72 ijkqrs bhijkbhqrsµijkqrs

Similar to the proof of (65), we have

1 X 6 9 X 6 X bhijkbhqrsµ = σijbhijkσkqbhqrsσrs + bhijkσiqσjrσksbhqrs, 72 ijkqrs 72 b b b 72 b b b ijkqrs ijkqrs ijkqrs where

−10 −1 0 −1 X 2 3 2 3 2 σbijbhijkσbkqbhqrsσbrs = vec ∇ bh ∇ bh ∇ bh ∇ bh vec ∇ bh , ijkqrs

X X X bhijkσbiqσbjrσbksbhqrs = bhijk σbiqσbjrσbksbhqrs ijkqrs ijk qrs 0 −1 −1 −1 = vec ∇3bh ∇2bh ⊗ ∇2bh ⊗ ∇2bh vec ∇3bh .

Then, we can get

1 X 6 bhijkbhqrsµ (66) 72 ijkqrs ijkqrs 1 −10 −1 0 −1 = vec ∇2bh ∇3bh ∇2bh ∇3bh vec ∇2bh 8 1 0 −1 −1 −1 + vec ∇3bh ∇2bh ⊗ ∇2bh ⊗ ∇2bh vec ∇3bh . 12

56 For term 1 P h h µ8 g 144 ijkqrstw bijkbqrs ijkqrstwbtw

P 8 We can decompose ijkqrstw bhijkbhqrsµijkqrstwgbtw into seven groups. One group contains all the terms that involve σbtw, that is,

X 6 bhijkbhqrsµijkqrsσbtwgbtw (67) ijkqrstw X 6 X = bhijkbhqrsµijkqrs σbtwgbtw ijkqrs tw −10 −1 0 −1 −1 2 3 2 3 2 2 2 = 9vec ∇ bh ∇ bh ∇ bh ∇ bh vec ∇ bh tr ∇ bh ∇ gb

0 −1 −1 −1 −1 3 2 2 2 3 2 2 +6vec ∇ bh ∇ bh ⊗ ∇ bh ⊗ ∇ bh vec ∇ bh tr ∇ bh ∇ gb by (60). Note that there are ﬁfteen terms in this group. The other six groups are the same. One of them, which involves σbsw, is X X bhijkbhqrsσbijσbkqσbrtσbswgbtw + bhijkbhqrsσbijσbkrσbqtσbswgbtw ijkqrstw ijkqrstw X X + bhijkbhqrsσbijσbktσbqrσbswgbtw + bhijkbhqrsσbikσbjqσbrtσbswgbtw ijkqrstw ijkqrstw X X + bhijkbhqrsσbikσbjrσbqtσbswgbtw + bhijkbhqrsσbikσbjtσbqrσbswgbtw ijkqrstw ijkqrstw X X + bhijkbhqrsσbiqσbjkσbrtσbswgbtw + bhijkbhqrsσbiqσbjrσbktσbswgbtw ijkqrstw ijkqrstw X X + bhijkbhqrsσbiqσbjtσbkrσbswgbtw + bhijkbhqrsσbirσbjkσbqtσbswgbtw ijkqrstw ijkqrstw X X + bhijkbhqrsσbirσbjqσbktσbswgbtw + bhijkbhqrsσbirσbjtσbkqσbswgbtw ijkqrstw ijkqrstw X X + bhijkbhqrsσbitσbjkσbqrσbswgbtw + bhijkbhqrsσbitσbjqσbkrσbswgbtw ijkqrstw ijkqrstw X + bhijkbhqrsσbitσbjrσbkqσbswgbtw. ijkqrstw

Note that this group can be further decomposed into three sub-groups. The ﬁrst sub- group consists of six elements, which involve either σbrt or σbqt, that is X X bhijkbhqrsσbijσbkqσbrtσbswgbtw + bhijkbhqrsσbijσbkrσbqtσbswgbtw ijkqrstw ijkqrstw X X + bhijkbhqrsσbikσbjqσbrtσbswgbtw + bhijkbhqrsσbikσbjrσbqtσbswgbtw ijkqrstw ijkqrstw

57 X X + bhijkbhqrsσbiqσbjkσbrtσbswgbtw + bhijkbhqrsσbirσbjkσbqtσbswgbtw. ijkqrstw ijkqrstw

Hence, X X bhijkbhqrsσbirσbjkσbqtσbswgbtw = σbjkbhjkiσbirbhrqsσbqtgbtwσbsw (68) ijkqrstw ijkqrstw −10 −1 −1 −1 2 3 2 3 0 2 2 2 = vec ∇ bh ∇ bh ∇ bh ∇ bh vec ∇ bh ∇ gb ∇ bh .

The second sub-group consists of three elements, which involve one term from (σbkt, σbrt, σbqt) and one term from (σbij, σbik, σbjk), that is X X X bhijkbhqrsσbijσbktσbqrσbswgbtw + bhijkbhqrsσbikσbjtσbqrσbswgbtw + bhijkbhqrsσbitσbjkσbqrσbswgbtw. ijkqrstw ijkqrstw ijkqrstw

These three elements are the same and we have X bhijkbhqrsσbitσbjkσbqrσbswgbtw (69) ijkqrstw X X = bhijkbhqrsσbjkσbqrσbitgbtwσbsw = σbjkbhjkiσbitgbtwσbswbhsqrσbqr ijkqrstw ijkqrstw −10 −1 −1 −1 2 3 2 2 2 3 0 2 = vec ∇ bh ∇ bh ∇ bh ∇ gb ∇ bh ∇ bh vec ∇ bh .

The third sub-group consists of six elements, which involve one term from (σbij, σbik, σbjk) but do not include any term from (σbkt, σbrt, σbqt). Thus, X X bhijkbhqrsσbiqσbjrσbktσbswgbtw + bhijkbhqrsσbiqσbjtσbkrσbswgbtw ijkqrstw ijkqrstw X X + bhijkbhqrsσbirσbjqσbktσbswgbtw + bhijkbhqrsσbirσbjtσbkqσbswgbtw ijkqrstw ijkqrstw X X + bhijkbhqrsσbitσbjqσbkrσbswgbtw + bhijkbhqrsσbitσbjrσbkqσbswgbtw. ijkqrstw ijkqrstw

These six elements are the same and we have X X bhijkbhqrsσbitσbjrσbkqσbswgbtw = bhijkbhqrsσbjrσbkqσbitgbtwσbsw (70) ijkqrstw ijkqrstw X = bhijkσbitgbtwσbwsσbjrσbkqbhsrq ijkqrstw 0 −1 −1 −1 −1 3 2 2 2 2 2 3 = vec ∇ bh ∇ bh ∇ gb ∇ bh ⊗ ∇ bh ⊗ ∇ bh vec ∇ bh .

58 Then, from (67), (68), (69), and (70), we have

1 X 8 bhijkbhqrsµ gtw (71) 144 ijkqrstwb ijkqrstw 9 −10 −1 0 −1 −1 = vec ∇2bh ∇3bh ∇2bh ∇3bh vec ∇2bh tr ∇2bh ∇2g 144 b 6 0 −1 −1 −1 −1 + vec ∇3bh ∇2bh ⊗ ∇2bh ⊗ ∇2bh vec ∇3bh tr ∇2bh ∇2g 144 b 36 −10 −1 −1 −1 + vec ∇2bh ∇3bh ∇2bh ∇3bh0vec ∇2bh ∇2g ∇2bh 144 b 18 −10 −1 −1 −1 + vec ∇2bh ∇3bh ∇2bh ∇2g ∇2bh ∇3bh0vec ∇2bh 144 b 36 0 −1 −1 −1 −1 + vec ∇3bh ∇2bh ∇2g ∇2bh ⊗ ∇2bh ⊗ ∇2bh vec ∇3bh . 144 b We can write (68) as

X bhijkbhqrsσbirσbjkσbqtσbswgbtw (72) ijkqrstw X X = σbjkbhjkiσbirbhrqsσbqtσbswgbtw = σbtqσbjkbhjkiσbirbhqrsσbswgbtw ijkqrstw ijkqrstw " # " # X X X X = σbtqσbjkbhjkiσbirbhqrsσbsw gbtw = σbtqbhijkσbjkσbirbhqrsσbsw gbtw tw ijkqrs tw ijkqrs −10 −1 −1 −1 2 3 2 2 3 2 2 = tr vec ∇ bh ∇ bh ∇ bh ⊗ ∇ bh ∇ bh ∇ bh ∇ gb ,

X bhijkbhqrsσbitσbjkσbqrσbswgbtw (73) ijkqrstw X X = bhijkbhqrsσbjkσbqrσbitgbtwσbsw = σbjkbhjkiσbitgbtwσbswbhsqrσbqr ijkqrstw ijkqrstw X = σbtiσbjkbhjkiσbqrbhqrsσbswgbtw ijkqrstw −1 −1 −10 −1 2 3 0 2 2 3 2 2 = tr ∇ bh ∇ bh vec ∇ bh vec ∇ bh ∇ bh ∇ bh ∇ gb ,

X X bhijkbhqrsσbitσbjrσbkqσbswgbtw = bhijkbhqrsσbjrσbkqσbitgbtwσbsw (74) ijkqrstw ijkqrstw X X = bhijkσbitgbtwσbwsσbjrσbkqbhsrq = σbtibhikjσbkqσbjrbhqrsσbswgbtw ijkqrstw ijkqrstw

59 −1 −1 −1 −1 2 3 0 2 2 3 0 2 2 = tr ∇ bh ∇ bh ∇ bh ⊗ ∇ bh ∇ bh ∇ bh ∇ gb .

Then, we have

1 X 8 bhijkbhqrsµ gtw (75) 144 ijkqrstwb ijkqrstw 9 −10 −1 0 −1 −1 = vec ∇2bh ∇3bh ∇2bh ∇3bh vec ∇2bh tr ∇2bh ∇2g 144 b 6 0 −1 −1 −1 −1 + vec ∇3bh ∇2bh ⊗ ∇2bh ⊗ ∇2bh vec ∇3bh tr ∇2bh ∇2g 144 b 36 −10 −1 −1 −1 + tr vec ∇2bh ∇3bh ∇2bh ⊗ ∇2bh ∇3bh ∇2bh ∇2g 144 b 18 −1 −1 −10 −1 + tr ∇2bh ∇3bh0vec ∇2bh vec ∇2bh ∇3bh ∇2bh ∇2g 144 b 36 −1 −1 −1 −1 + tr ∇2bh ∇3bh0 ∇2bh ⊗ ∇2bh ∇3bh0 ∇2bh ∇2g . 144 b

P h h µ8 b g For term 1 ijkqrstw bijkbqrs ijkqrstwbD,wbt Note that 72 bD

X bhijkbhqrsσbirσbjkσbqtσbswbD,wgbt ijkqrstw X = σbswbD,wσbjkbhjkiσbirbhsrqσbqtgbt ijkqrstw −1 −10 −1 −1 2 2 3 2 3 2 = vec ∇ bh bDvec ∇ bh ∇ bh ∇ bh ∇ bh ∇ bh ∇g,b

X bhijkbhqrsσbitσbjkσbqrσbswbD,wgbt ijkqrstw X = σbqrbhqrsbD,wσbswσbjkbhjkiσbitgbt ijkqrstw −10 −1 −10 −1 2 3 2 2 3 2 = vec ∇ bh ∇ bh ∇ bh bDvec ∇ bh ∇ bh ∇ bh ∇g,b

X bhijkbhqrsσbitσbjrσbkqσbswbD,wgbt ijkqrstw X X = bhijkbhqrsσbjrσbkqσbitbD,wgbtσbsw = bhijkσbitbD,wgbtσbswσbjrσbkqbhsrq ijkqrstw ijkqrstw

60 X = σbjrσbswbD,wbhsrqσbqkbhjkiσbitgbt ijkqrstw 0 −1 −1 0 −1 −1 2 2 3 2 3 2 = vec ∇ bh ⊗ ∇ bh bD ∇ bh ∇ bh ∇ bh ∇ bh ∇g.b

P 8 1 ijkqrstw bhijkbhqrsµijkqrstwbD,wgbt 72 bD " # −10 −1 0 −1 −1 1 2 3 2 3 2 2 ∇bD 0 = vec ∇ bh ∇ bh ∇ bh ∇ bh vec ∇ bh tr ∇ bh ∇gb 8 bD " # 0 −1 −1 −1 −1 1 3 2 2 2 3 2 ∇bD 0 + vec ∇ bh ∇ bh ⊗ ∇ bh ⊗ ∇ bh vec ∇ bh tr ∇ bh ∇gb 12 bD ! −10 −1 −1 −1 1 2 3 2 3 0 2 ∇bD 0 2 + vec ∇ bh ∇ bh ∇ bh ∇ bh vec ∇ bh ∇gb ∇ bh 2 bD −10 −1 −1 −1 1 2 3 2 ∇bD 0 2 3 0 2 + vec ∇ bh ∇ bh ∇ bh ∇gb ∇ bh ∇ bh vec ∇ bh 4 bD " ! # 0 −1 −1 −1 −1 1 3 2 ∇bD 0 2 2 2 3 + vec ∇ bh ∇ bh ∇gb ∇ bh ⊗ ∇ bh ⊗ ∇ bh vec ∇ bh 2 bD 0 −10 −1 0 −1 −1 1 2 3 2 3 2 ∇bD 2 = vec ∇ bh ∇ bh ∇ bh ∇ bh vec ∇ bh ∇ bh ∇gb 8 bD 0 0 −1 −1 −1 −1 1 3 2 2 2 3 ∇bD 2 + vec ∇ bh ∇ bh ⊗ ∇ bh ⊗ ∇ bh vec ∇ bh ∇ bh ∇gb 12 bD ! −1 −10 −1 −1 1 2 ∇bD 2 3 2 3 2 + vec ∇ bh vec ∇ bh ∇ bh ∇ bh ∇ bh ∇ bh ∇gb 2 bD −10 −1 −10 −1 1 2 3 2 ∇bD 2 3 2 + vec ∇ bh ∇ bh ∇ bh vec ∇ bh ∇ bh ∇ bh ∇gb 4 bD !0! !0 −1 −1 −1 −1 1 2 2 ∇bD 3 2 3 2 + vec ∇ bh ⊗ ∇ bh ∇ bh ∇ bh ∇ bh ∇ bh ∇g.b 2 bD

For term 1 P h h µ8 g 144 ijkqrstw bijkbqrst ijkqrstwbw

8 In this case, we can decompose µijkqrstw into seven groups, each containing ﬁfteen terms. The ﬁrst four groups contain σbqw, σbrw, σbsw, σbtw. Out of these four groups, the group involving σbtw is X X bhijkbhqrstσbijσbkqσbrsσbtwgbw + bhijkbhqrstσbijσbkrσbqsσbtwgbw ijkqrstw ijkqrstw

61 X X + bhijkbhqrstσbijσbksσbqrσbtwgbw + bhijkbhqrstσbikσbjqσbrsσbtwgbw ijkqrstw ijkqrstw X X + bhijkbhqrstσbikσbjrσbqsσbtwgbw + bhijkbhqrstσbikσbjsσbqrσbtwgbw ijkqrstw ijkqrstw X X + bhijkbhqrstσbiqσbjkσbrsσbtwgbw + bhijkbhqrstσbiqσbjrσbksσbtwgbw ijkqrstw ijkqrstw X X + bhijkbhqrstσbiqσbjsσbksσbtwgbw + bhijkbhqrstσbirσbjkσbqsσbtwgbw ijkqrstw ijkqrstw X X + bhijkbhqrstσbirσbjqσbksσbtwgbw + bhijkbhqrstσbirσbjsσbkqσbtwgbw ijkqrstw ijkqrstw X X + bhijkbhqrstσbisσbjkσbqrσbtwgbw + bhijkbhqrstσbisσbjqσbkrσbtwgbw ijkqrstw ijkqrstw X + bhijkbhqrstσbisσbjrσbkqσbtwgbw. ijkqrstw The above ﬁfteen terms can be further decomposed into two diﬀerent sub-groups by whether two of (i, j, k) are in the same combination of subscripts (i.e., whether there involves σbij or σbjk or σbik). One sub-group is X X bhijkbhqrstσbijσbkqσbrsσbtwgbw + bhijkbhqrstσbijσbkrσbqsσbtwgbw ijkqrstw ijkqrstw X X + bhijkbhqrstσbijσbksσbqrσbtwgbw + bhijkbhqrstσbikσbjqσbrsσbtwgbw ijkqrstw ijkqrstw X X + bhijkbhqrstσbikσbjrσbqsσbtwgbw + bhijkbhqrstσbikσbjsσbqrσbtwgbw ijkqrstw ijkqrstw X X + bhijkbhqrstσbiqσbjkσbrsσbtwgbw + bhijkbhqrstσbirσbjkσbqsσbtwgbw ijkqrstw ijkqrstw X + bhijkbhqrstσbisσbjkσbqrσbtwgbw. ijkqrstw Note that each term in this sub-group are the same and we have X bhijkbhqrstσbijσbkqσbrsσbtwgbw ijkqrstw X = σbrsσbijbhijkσbkqbhrsqtσbtwgbw ijkqrstw −10 −10 −1 −1 2 2 3 2 4 2 = vec ∇ bh ⊗ vec ∇ bh ∇ bh ∇ bh ∇ bh ∇ bh ∇g.b

Another sub-group is X X bhijkbhqrstσbiqσbjrσbksσbtwgbw + bhijkbhqrstσbiqσbjsσbksσbtwgbw ijkqrstw ijkqrstw

62 X X + bhijkbhqrstσbirσbjqσbksσbtwgbw + bhijkbhqrstσbirσbjsσbkqσbtwgbw ijkqrstw ijkqrstw X X + bhijkbhqrstσbisσbjqσbkrσbtwgbw + bhijkbhqrstσbisσbjrσbkqσbtwgbw. ijkqrstw ijkqrstw Note that each term in this subgroup are the same and we have X bhijkbhqrstσbiqσbjrσbksσbtwgbw ijkqrstw X = bhijkσbiqσbjrσbksbhqrstσbtwgbw ijkqrstw 0 −1 −1 −1 −1 3 2 2 2 4 2 = vec ∇ bh ∇ bh ⊗ ∇ bh ⊗ ∇ bh ∇ bh ∇ bh ∇g.b

The second three groups involve σbiw, σbjw, σbkw in each term. Out of these three groups, the group involving σbiw is X X bhijkbhqrstσbjkσbqrσbstσbiwgbw + bhijkbhqrstσbjkσbqsσbrtσbiwgbw ijkqrstw ijkqrstw X X + bhijkbhqrstσbjkσbqtσbrsσbiwgbw + bhijkbhqrstσbjqσbkrσbstσbiwgbw ijkqrstw ijkqrstw X X + bhijkbhqrstσbjqσbksσbrtσbiwgbw + bhijkbhqrstσbjqσbktσbrsσbiwgbw ijkqrstw ijkqrstw X X + bhijkbhqrstσbjrσbkqσbstσbiwgbw + bhijkbhqrstσbjrσbksσbqtσbiwgbw ijkqrstw ijkqrstw X X + bhijkbhqrstσbjrσbktσbqsσbiwgbw + bhijkbhqrstσbjsσbkqσbrtσbiwgbw ijkqrstw ijkqrstw X X + bhijkbhqrstσbjsσbkrσbqtσbiwgbw + bhijkbhqrstσbjsσbktσbqrσbiwgbw ijkqrstw ijkqrstw X X + bhijkbhqrstσbjtσbkqσbrsσbiwgbw + bhijkbhqrstσbjtσbkrσbqsσbiwgbw ijkqrstw ijkqrstw X + bhijkbhqrstσbjtσbksσbqrσbiwgbw. ijkqrstw This group can be further decomposed into two sub-groups. The ﬁrst sub-group is as follows X X X bhijkbhqrstσbjkσbqrσbstσbiwgbw + bhijkbhqrstσbjkσbqsσbrtσbiwgbw + bhijkbhqrstσbjkσbqtσbrsσbiwgbw. ijkqrstw ijkqrstw ijkqrstw

Note that each element in this group contains σbjk. The three terms in the ﬁrst sub-group are the same and we have X bhijkbhqrstσbjkσbqrσbstσbiwgbw ijkqrstw

63 X X X = gbwσbiwσbjkbhijkbhqrstσbqrσbst = gbwσbiwσbjkbhijk bhqrstσbqrσbst ijkqrstw ijkw qrst −10 −1 −1 −10 2 3 2 2 2 4 = vec ∇ bh ∇ bh ∇ bh ∇gbtr ∇ bh ⊗ vec ∇ bh ∇ bh .

The second sub-group consists of the remaining terms X X bhijkbhqrstσbjqσbkrσbstσbiwgbw + bhijkbhqrstσbjqσbksσbrtσbiwgbw ijkqrstw ijkqrstw X X + bhijkbhqrstσbjqσbktσbrsσbiwgbw + bhijkbhqrstσbjrσbkqσbstσbiwgbw ijkqrstw ijkqrstw X X + bhijkbhqrstσbjrσbksσbqtσbiwgbw + bhijkbhqrstσbjrσbktσbqsσbiwgbw ijkqrstw ijkqrstw X X + bhijkbhqrstσbjsσbkqσbrtσbiwgbw + bhijkbhqrstσbjsσbkrσbqtσbiwgbw ijkqrstw ijkqrstw X X + bhijkbhqrstσbjsσbktσbqrσbiwgbw + bhijkbhqrstσbjtσbkqσbrsσbiwgbw ijkqrstw ijkqrstw X X + bhijkbhqrstσbjtσbkrσbqsσbiwgbw + bhijkbhqrstσbjtσbksσbqrσbiwgbw. ijkqrstw ijkqrstw

The twelve terms in the second sub-group are the same and we have X X bhijkbhqrstσbjqσbkrσbstσbiwgbw = σbjqσbstbhqstrσbrkbhjkiσbiwgbw ijkqrstw ijkqrstw 0 −1 −10 −1 −1 2 2 4 2 3 2 = vec ∇ bh ⊗ vec ∇ bh ∇ bh ∇ bh ∇ bh ∇ bh ∇g.b

Then, we have

1 X 8 bhijkbhqrstµ gw (76) 144 ijkqrstwb ijkqrstw 36 −10 −10 −1 −1 = vec ∇2bh ⊗ vec ∇2bh ∇3bh ∇2bh ∇4bh ∇2bh ∇g 144 b 24 0 −1 −1 −1 −1 + vec ∇3bh ∇2bh ⊗ ∇2bh ⊗ ∇2bh ∇4bh ∇2bh ∇g 144 b 9 −10 −1 −1 −10 + vec ∇2bh ∇3bh ∇2bh ∇gtr ∇2bh ⊗ vec ∇2bh ∇4bh 144 b 0 36 −1 −10 −1 −1 + vec ∇2bh ⊗ vec ∇2bh ∇4bh ∇2bh ∇3bh ∇2bh ∇g 144 b 1 −10 −10 −1 −1 = vec ∇2bh ⊗ vec ∇2bh ∇3bh ∇2bh ∇4bh ∇2bh ∇g 4 b

64 1 0 −1 −1 −1 −1 + vec ∇3bh ∇2bh ⊗ ∇2bh ⊗ ∇2bh ∇4bh ∇2bh ∇g 6 b 1 −1 −10 −10 −1 + tr ∇2bh ⊗ vec ∇2bh ∇4bh vec ∇2bh ∇3bh ∇2bh ∇g 16 b 0 1 −1 −10 −1 −1 + vec ∇2bh ⊗ vec ∇2bh ∇4bh ∇2bh ∇3bh ∇2bh ∇g. 4 b

For term − 1 P h h h µ10 g 1296 ijkqrstwvβ bijkbqrsbtwv ijkqrstwvβbβ

10 Note that µijkqrstwvβ can be decomposed into nine groups that are the same. Each group has one hundred and ﬁve elements that involve σbiβ, σbjβ, σbkβ, σbqβ, σbrβ, σbsβ, σbtβ, σbwβ and σbvβ. We take the group involving σbvβ as an example. This group is further decomposed into seven groups with ﬁfteen elements in each group. There are six groups out of seven with the following structure: X X bhijkbhqrsbhtwvσbijσbkqσbrtσbswσbvβgbβ + bhijkbhqrsbhtwvσbijσbkrσbqtσbswσbvβgbβ ijkqrstwvβ ijkqrstwvβ X X + bhijkbhqrsbhtwvσbijσbktσbqrσbswσbvβgbβ + bhijkbhqrsbhtwvσbikσbjqσbrtσbswσbvβgbβ ijkqrstwvβ ijkqrstwvβ X X + bhijkbhqrsbhtwvσbikσbkrσbqtσbswσbvβgbβ + bhijkbhqrsbhtwvσbikσbktσbqrσbswσbvβgbβ ijkqrstwvβ ijkqrstwvβ X X + bhijkbhqrsbhtwvσbiqσbjkσbrtσbswσbvβgbβ + bhijkbhqrsbhtwvσbiqσbjrσbktσbswσbvβgbβ ijkqrstwvβ ijkqrstwvβ X X + bhijkbhqrsbhtwvσbiqσbjtσbkrσbswσbvβgbβ + bhijkbhqrsbhtwvσbirσbjkσbqtσbswσbvβgbβ ijkqrstwvβ ijkqrstwvβ X X + bhijkbhqrsbhtwvσbirσbjqσbktσbswσbvβgbβ + bhijkbhqrsbhtwvσbitσbjtσbkqσbswσbvβgbβ ijkqrstwvβ ijkqrstwvβ X X + bhijkbhqrsbhtwvσbitσbkjσbqrσbswσbvβgbβ + bhijkbhqrsbhtwvσbitσbkqσbjrσbswσbvβgbβ ijkqrstwvβ ijkqrstwvβ X + bhijkbhqrsbhtwvσbitσbkrσbjqσbswσbvβgbβ. ijkqrstwvβ

The elements in these group do not include σbtw. We can further decompose the above group into two sub-groups. The ﬁrst sub-group has elements involving σbij, σbik or σbjk and is expressed as X X bhijkbhqrsbhtwvσbijσbkqσbrtσbswσbvβgbβ + bhijkbhqrsbhtwvσbijσbkrσbqtσbswσbvβgbβ ijkqrstwvβ ijkqrstwvβ X X + bhijkbhqrsbhtwvσbijσbktσbqrσbswσbvβgbβ + bhijkbhqrsbhtwvσbikσbjqσbrtσbswσbvβgbβ ijkqrstwvβ ijkqrstwvβ

65 X X + bhijkbhqrsbhtwvσbikσbkrσbqtσbswσbvβgbβ + bhijkbhqrsbhtwvσbikσbktσbqrσbswσbvβgbβ ijkqrstwvβ ijkqrstwvβ X X + bhijkbhqrsbhtwvσbiqσbjkσbrtσbswσbvβgbβ + bhijkbhqrsbhtwvσbirσbjkσbqtσbswσbvβgbβ ijkqrstwvβ ijkqrstwvβ X + bhijkbhqrsbhtwvσbitσbkjσbqrσbswσbvβgbβ. ijkqrstwvβ The nine terms in the ﬁrst sub-group are the same and we have X bhijkbhqrsbhtwvσbijσbkqσbrtσbswσbvβgbβ ijkqrstwvβ X = σbijbhijkσbkqbhqrsσbrtσbswbhtwvσbvβgbβ ijkqrstwvβ −10 −1 −1 −1 −1 2 3 2 3 0 2 2 3 2 = vec ∇ bh ∇ bh ∇ bh ∇ bh ∇ bh ⊗ ∇ bh ∇ bh ∇ bh ∇g.b

The second sub-group is X X bhijkbhqrsbhtwvσbiqσbjrσbktσbswσbvβgbβ + bhijkbhqrsbhtwvσbiqσbjtσbkrσbswσbvβgbβ ijkqrstwvβ ijkqrstwvβ X X + bhijkbhqrsbhtwvσbirσbjqσbktσbswσbvβgbβ + bhijkbhqrsbhtwvσbitσbjtσbkqσbswσbvβgbβ ijkqrstwvβ ijkqrstwvβ X X + bhijkbhqrsbhtwvσbitσbkqσbjrσbswσbvβgbβ + bhijkbhqrsbhtwvσbitσbkrσbjqσbswσbvβgbβ ijkqrstwvβ ijkqrstwvβ The six terms in the second sub-group are the same and we have X bhijkbhqrsbhtwvσbiqσbjrσbktσbswσbvβgbβ ijkqrstwvβ X = bhijkσbiqσbjrbhqrsσbktσbswbhtwvσbvβgbβ ijkqrstwvβ −1 −1 0 −1 −1 −1 3 0 2 2 3 2 2 3 2 = vec ∇ bh ∇ bh ⊗ ∇ bh ∇ bh ∇ bh ⊗ ∇ bh ∇ bh ∇ bh ∇g.b

Note that out of the seven groups, we have one group remained, which is expressed as X X bhijkbhqrsbhtwvσbijσbkqσbrsσbtwσbvβgbβ + bhijkbhqrsbhtwvσbijσbkrσbqsσbtwσbvβgbβ ijkqrstwvβ ijkqrstwvβ X X + bhijkbhqrsbhtwvσbijσbksσbqrσbtwσbvβgbβ + bhijkbhqrsbhtwvσbikσbjqσbrsσbtwσbvβgbβ ijkqrstwvβ ijkqrstwvβ X X + bhijkbhqrsbhtwvσbikσbkrσbqsσbtwσbvβgbβ + bhijkbhqrsbhtwvσbikσbksσbqrσbtwσbvβgbβ ijkqrstwvβ ijkqrstwvβ X X + bhijkbhqrsbhtwvσbiqσbjkσbrsσbtwσbvβgbβ + bhijkbhqrsbhtwvσbiqσbjrσbksσbtwσbvβgbβ ijkqrstwvβ ijkqrstwvβ

66 X X + bhijkbhqrsbhtwvσbiqσbjsσbkrσbtwσbvβgbβ + bhijkbhqrsbhtwvσbirσbjkσbqsσbtwσbvβgbβ ijkqrstwvβ ijkqrstwvβ X X + bhijkbhqrsbhtwvσbirσbjqσbksσbtwσbvβgbβ + bhijkbhqrsbhtwvσbitσbjsσbkqσbtwσbvβgbβ ijkqrstwvβ ijkqrstwvβ X X + bhijkbhqrsbhtwvσbisσbkjσbqrσbtwσbvβgbβ + bhijkbhqrsbhtwvσbisσbkqσbjrσbtwσbvβgbβ ijkqrstwvβ ijkqrstwvβ X + bhijkbhqrsbhtwvσbisσbkrσbjqσbtwσbvβgbβ. ijkqrstwvβ We can further decompose the above group into two sub-groups. The ﬁrst sub-group is X X bhijkbhqrsbhtwvσbijσbkqσbrsσbtwσbvβgbβ + bhijkbhqrsbhtwvσbijσbkrσbqsσbtwσbvβgbβ ijkqrstwvβ ijkqrstwvβ X X + bhijkbhqrsbhtwvσbijσbksσbqrσbtwσbvβgbβ + bhijkbhqrsbhtwvσbikσbjqσbrsσbtwσbvβgbβ ijkqrstwvβ ijkqrstwvβ X X + bhijkbhqrsbhtwvσbikσbkrσbqsσbtwσbvβgbβ + bhijkbhqrsbhtwvσbikσbksσbqrσbtwσbvβgbβ ijkqrstwvβ ijkqrstwvβ X X + bhijkbhqrsbhtwvσbiqσbjkσbrsσbtwσbvβgbβ + bhijkbhqrsbhtwvσbirσbjkσbqsσbtwσbvβgbβ ijkqrstwvβ ijkqrstwvβ X + bhijkbhqrsbhtwvσbisσbkjσbqrσbtwσbvβgbβ. ijkqrstwvβ

The elements of this sub-group involve σbij, σbik or σbjk. The nine terms in the ﬁrst sub-group are the same and we have X bhijkbhqrsbhtwvσbijσbkqσbrsσbtwσbvβgbβ ijkqrstwvβ X X = σbijbhijkσbkqbhqrsσbrs σbtwbhtwvσbvβgbβ ijkqrs twvβ −10 −1 −1 −10 −1 2 3 2 3 0 2 2 3 2 = vec ∇ bh ∇ bh ∇ bh ∇ bh vec ∇ bh × vec ∇ bh ∇ bh ∇ bh ∇g.b The second sub-group is X X bhijkbhqrsbhtwvσbiqσbjrσbksσbtwσbvβgbβ + bhijkbhqrsbhtwvσbiqσbjsσbkrσbtwσbvβgbβ ijkqrstwvβ ijkqrstwvβ X X + bhijkbhqrsbhtwvσbirσbjqσbksσbtwσbvβgbβ + bhijkbhqrsbhtwvσbitσbjsσbkqσbtwσbvβgbβ ijkqrstwvβ ijkqrstwvβ X X + bhijkbhqrsbhtwvσbisσbkqσbjrσbtwσbvβgbβ + bhijkbhqrsbhtwvσbisσbkrσbjqσbtwσbvβgbβ. ijkqrstwvβ ijkqrstwvβ The six terms in the second sub-group are the same and we have X bhijkbhqrsbhtwvσbiqσbjrσbksσbtwσbvβgbβ ijkqrstwvβ

67 X X = bhijkσbiqσbjrσbksbhqrs σbtwbhtwvσbvβgbβ ijkqrs twvβ 0 −1 −1 −1 = vec ∇3bh ∇2bh ⊗ ∇2bh ⊗ ∇2bh vec ∇3bh

−10 −1 2 3 2 ×vec ∇ bh ∇ bh ∇ bh ∇g.b

Then, we have

1 X 10 − bhijkbhqrsbhtwvµ gβ (77) 1296 ijkqrstwvβb ijkqrstwvβ 9 × 6 × 9 −10 −1 = − vec ∇2bh ∇3bh ∇2bh ∇3bh0 1296 −1 −1 −1 2 2 3 2 × ∇ bh ⊗ ∇ bh ∇ bh ∇ bh ∇gb 9 × 6 × 6 −1 −1 0 − vec ∇3bh0 ∇2bh ⊗ ∇2bh ∇3bh 1296 −1 −1 −1 2 2 3 2 × ∇ bh ⊗ ∇ bh ∇ bh ∇ bh ∇gb 9 × 1 × 9 −10 −1 −1 − vec ∇2bh ∇3bh ∇2bh ∇3bh0vec ∇2bh 1296 −10 −1 2 3 2 ×vec ∇ bh ∇ bh ∇ bh ∇gb 9 × 1 × 6 0 −1 −1 −1 − vec ∇3bh ∇2bh ⊗ ∇2bh ⊗ ∇2bh vec ∇3bh 1296 −10 −1 ×vec ∇2bh ∇3bh ∇2bh ∇g

Note that 9 × 6 × 9 3 9 × 6 × 6 1 9 × 1 × 9 1 9 × 1 × 6 1 = , = , = , = . 1296 8 1296 4 1296 16 1296 24

5.2.4 Proof of Lemma 3.2

Proof of Lemma 3.2 can be obtained by directly applying Lemma 5.3 by setting bD (θ) = N D p (θ), g (θ) = lt (θ), −nh (θ) = −nh (θ) = ln p(y|θ). Then, under Assumptions 1-10, we have R lt (θ) p (θ) p (y|θ) dθ R p (θ) p (y|θ) dθ

1 1 1 2 −3 = lt θbn + Bt,1 + B + B + Bt,22 − B4Bt,1 + Op n , n n2 t,21 t,21

68 where −1 0 −1 1 ¯ 2 ¯ ∇pb Bt,1 = − tr Hn θbn 5 lt θbn − 5lt θbn Hn θbn 2 pb −10 −1 1 ¯ ¯ (3) ¯ + vec Hn θbn H θbn Hn θbn 5 lt θbn , (78) 2 n with

1 Bt,21 (79) 0 −1 0 −1 −1 1 ¯ ¯ (5) ¯ ¯ = − ∇lt θbn Hn θbn H θbn vec Hn θbn ⊗ vec Hn θbn 8 n −10 −1 0 1 ¯ ¯ (3) ¯ ¯ (4) + vec Hn θbn H θbn Hn θbn H θbn 4 n n −1 −1 ¯ ¯ × vec Hn θbn ⊗ Hn θbn ∇lt θbn

0 −1 −1 −1 −1 1 ¯ (3) ¯ ¯ ¯ ¯ (4) ¯ + vec H θbn Hn θbn ⊗ Hn θbn ⊗ Hn θbn H θbn Hn θbn ∇lt θbn 6 n n −10 −1 1 ¯ ¯ (3) ¯ + vec Hn θbn H θbn Hn θbn ∇lt θbn 16 n −1 −10 ¯ ¯ ¯ (4) ×tr Hn θbn ⊗ vec Hn θbn Hn θbn

 −1 0 −1 −1 0  ¯ ¯ ¯ (3) ¯ Hn θbn ⊗ ∇lt θbn Hn θbn Hn θbn Hn θbn 1   ¯ (4)  + tr  −1  Hn θbn  4  ¯   ⊗vec Hn θbn !0 −1−1 −1−1 0 3 ¯ ¯ (3) ¯ ¯ (3) − vec Hn θbn H θbn Hn θbn H θbn 8 n n

 −1−1 0  ¯ ¯ (3) Hn θbn Hn θbn   ×vec  ! !!   −1−1 −1−1  ¯ ¯  × vec Hn θbn ⊗ Hn θbn ∇lt θbn 

−1 −1  ¯ ¯  Hn θbn ⊗ Hn θbn ⊗   −1 0   1 0  ¯ ¯ (3)  ¯ (3) Hn θbn Hn θbn − vec H θbn   4 n   !!     −1 −1−1    ¯ ¯   × vec Hn θbn ⊗ Hn θbn ∇lt θbn  ¯ (3) ×vec Hn θbn −10 −1 0 −1 1 ¯ ¯ (3) ¯ ¯ (3) ¯ − vec Hn θbn H θbn Hn θbn H θbn vec Hn θbn 16 n n

69 −10 −1 −1 ¯ ¯ (3) ¯ ¯ ×vec Hn θbn Hn θbn Hn θbn Hn θbn ∇lt θbn

0 −1 −1 −1 1 ¯ (3) ¯ ¯ ¯ ¯ (3) − vec H θbn Hn θbn ⊗ Hn θbn ⊗ Hn θbn vec H θbn 24 n n −10 −1 −1 ¯ ¯ (3) ¯ ¯ ×vec Hn θbn Hn θbn Hn θbn Hn θbn ∇lt θbn

0 −1 −1 −1 0 5 ¯ ∇pb ¯ ¯ ¯ (4) − ∇lt θbn Hn θbn tr Hn θbn ⊗ vec Hn θbn Hn θbn 8 pb −1 −1 0 −1 0 1 ¯ ¯ ¯ (4) ¯ ∇pb − tr Hn θbn ⊗ vec Hn θbn Hn θbn tr Hn θbn ∇lt θbn 8 pb −1 0 −1 −1 0 1 ¯ ∇pb ¯ ¯ ¯ (4) − tr Hn θbn ∇lt θbn Hn θbn ⊗ vec Hn θbn Hn θbn 2 pb −10 −1 0 −1 1 ¯ ¯ (3) ¯ ¯ (3) ¯ + vec Hn θbn H θbn Hn θbn H θbn vec Hn θbn 8 n n −1 ¯ ∇pb ×tr Hn θbn ∇lt θbn pb 0 −1 −1 −1 1 ¯ (3) ¯ ¯ ¯ ¯ (3) + vec H θbn Hn θbn ⊗ Hn θbn ⊗ Hn θbn vec H θbn 12 n n −1 ¯ ∇pb ×tr Hn θbn ∇lt θbn pb −10 −1 0 1 ¯ ¯ (3) ¯ ¯ (3) + vec Hn θbn H θbn Hn θbn H θbn 2 n n −1 −1 ¯ ∇pb ¯ ×vec Hn θbn ∇lt θbn Hn θbn , pb

−10 −1 2 1 ¯ ¯ (3) ¯ ∇pb Bt,21 = vec Hn θbn Hn θbn Hn θbn ∇lt θbn (80) 4 pb −1 0 −1 ¯ ¯ (3) ¯ ×Hn θbn Hn θbn vec Hn θbn

0 −1 −1 −1 −1 1 ¯ (3) ¯ ∇pb ¯ ¯ ¯ + vec Hn θbn Hn θbn ∇lt θbn Hn θbn ⊗ Hn θbn ⊗ Hn θbn 2 pb ¯ (3) ×vec Hn θbn −10 −1 2 −1 1 ¯ ¯ (3) ¯ ∇ pb ¯ − vec Hn θbn Hn θbn Hn θbn Hn θbn ∇lt θbn 2 pb 2 −1 −10 −1 1 ∇ pb ¯ ¯ ¯ (3) ¯ − tr Hn θbn vec Hn θbn Hn θbn Hn θbn ∇lt θbn 4 pb !0 2 −1 −1 −1 −1 1 ∇ pb ¯ ¯ ¯ ¯ (3) − vec ⊗ ∇lt θbn Hn θbn ⊗ Hn θbn ⊗ Hn θbn vec Hn θbn 2 pb

70 0 −1 3 0 −1 1 ¯ (∇ pb) ¯ + ∇lt θbn Hn θbn vec Hn θbn , 2 pb −1 1 ¯ 2 Bt,22 = − tr [A2] tr Hn θbn ∇ lt θbn (81) 16 −1 −1 −1 0 1 ¯ 2 ¯ ¯ ¯ (4) − tr Hn θbn ∇ lt θbn Hn θbn ⊗ vec Hn θbn H θbn 4 n −1 −1 1 ¯ 2 1 ¯ 2 + A1 × tr Hn θbn ∇ lt θbn + A3 × tr Hn θbn ∇ lt θbn 16 24 −10 −1 0 1 ¯ ¯ (3) ¯ ¯ (3) + vec Hn θbn H θbn Hn θbn H θbn 4 n n −1 −1 ¯ 2 ¯ ×vec Hn θbn ∇ lt θbn Hn θbn

−10 −1 −1 1 ¯ ¯ (3) ¯ 2 ¯ + vec Hn θbn H θbn Hn θbn ∇ lt θbn Hn θbn 8 n 0 −1 ¯ (3) ¯ ×Hn θbn vec Hn θbn

 −1 −1  ¯ 2 ¯ 1 0 Hn θbn ∇ lt θbn Hn θbn + vec H¯ (3) θ   n bn  −1 −1  4 ¯ ¯ ⊗Hn θbn ⊗ Hn θbn ¯ (3) ×vec Hn θbn −10 −1 0 −1 1 ¯ ¯ (3) ¯ 3 ¯ − vec Hn θbn H θbn Hn θbn ∇ lt θbn vec Hn θbn 4 n −1 −1 −1 1 2 ¯ ¯ ¯ − vec ∇ lt θbn Hn θbn ⊗ Hn θbn ⊗ Hn θbn 6 ¯ (3) ×vec Hn θbn −1 −1 0 1 ¯ ¯ 4 + tr Hn θbn ⊗ vec Hn θbn ∇ lt θbn 8 −10 −1 −1 1 ¯ ¯ (3) ¯ 2 ¯ ∇pb − vec Hn θbn Hn θbn Hn θbn ∇ lt θbn Hn θbn 2 pb −1 −10 −1 1 2 ¯ ¯ ¯ (3) ¯ 5pb − tr ∇ lt θbn Hn θbn vec Hn θbn Hn θbn Hn θbn 4 pb −1 −10 −1 1 ¯ 2 ¯ ¯ (3) ¯ 5pb − vec Hn θbn ∇ lt θbn Hn θbn Hn θbn Hn θbn 2 pb −10 −1 1 ¯ 3 ¯ 5pb + vec Hn θbn ∇ lt θbn Hn θbn 2 pb −1 −1 2 1 ¯ 2 ¯ 5 pb + tr Hn θbn ∇ lt θbn tr Hn θbn 4 pb 71 −1 −1 2 1 ¯ 2 ¯ 5 pb + tr Hn θbn ∇ lt θbn Hn θbn 2 pb

−1 2 −10 −1 1 ¯ ∇ pb 1 ¯ ¯ (3) ¯ ∇pb B4 = − tr Hn θbn + vec Hn θbn Hn θbn Hn θbn 2 pb 2 pb 5 1 − A + tr [A ] , (82) 24 1 8 2

−10 −1 0 −1 ¯ ¯ (3) ¯ ¯ (3) ¯ A1 = vec Hn θbn Hn θbn Hn θbn Hn θbn vec Hn θbn

−10 −10 ¯ ¯ = tr vec Hn θbn vec Hn θbn A4 ,

−1 −10 ¯ ¯ ¯ (4) A2 = Hn θbn ⊗ vec Hn θbn Hn θbn ,

−1 0 ¯ (3) ¯ ¯ (3) A4 = Hn θbn Hn θbn Hn θbn .

5.3 Proof of Lemma 3.3

Z 1 1 1 lt (θ) p(θ|y)dθ =lt θbn + Bt,1 + (Bt,2 − Bt,3) + Op . n n2 n3

Note that

n " n # n −1 0 −1 1 X 1 ¯ 1 X 2 1 X ¯ 5pb Bt,1 = − tr Hn θbn ∇ lt θbn − ∇lt θbn Hn θbn n 2 n n p t=1 t=1 t=1 b n −1 −1 −1 1 ¯ ¯ (3) ¯ 1 X + vec Hn θbn H θbn Hn θbn ∇lt θbn 2 n n t=1 −1 1 ¯ ¯ 1 = − tr Hn θbn Hn θbn = − P, 2 2

72 Pn since t=1 ∇lt θbn = 0. For the same reason,

n 1 X B = 0. n t,21 t=1 Moreover,

n 1 X B n t,22 t=1 " n # −1 1 ¯ 1 X 2 = − tr [A2] tr Hn θbn ∇ lt θbn 16 n t=1 "" n ! # # −1 −1 −1 0 1 ¯ 1 X 2 ¯ ¯ ¯ (4) − tr Hn θbn ∇ lt θbn Hn θbn ⊗ vec Hn θbn H θbn 4 n n t=1 −10 −1 −1 1 ¯ ¯ (3) ¯ ¯ (3) ¯ + vec Hn θbn H θbn Hn θbn H θbn vec Hn θbn 16 n n " n # −1 ¯ 1 X 2 ×tr Hn θbn ∇ lt θbn n t=1 0 −1 −1 −1 1 ¯ (3) ¯ ¯ ¯ ¯ (3) + vec H θbn Hn θbn ⊗ Hn θbn ⊗ Hn θbn vec H θbn 24 n n " n # −1 ¯ 1 X 2 ×tr Hn θbn ∇ lt θbn n t=1 −10 −1 0 1 ¯ ¯ (3) ¯ ¯ (3) + vec Hn θbn H θbn Hn θbn H θbn 4 n n n ! −1 −1 ¯ 1 X 2 ¯ ×vec Hn θbn ∇ lt θbn Hn θbn n t=1 n −10 −1 −1 1 ¯ ¯ (3) ¯ 1 X 2 ¯ + vec Hn θbn H θbn Hn θbn ∇ lt θbn Hn θbn 8 n n t=1 0 −1 ¯ (3) ¯ ×Hn θbn vec Hn θbn

 −1 −1  ¯ 1 Pn 2 ¯ 1 0 Hn θbn n t=1 ∇ lt θbn Hn θbn + vec H¯ (3) θ   n bn  −1 −1  4 ¯ ¯ ⊗Hn θbn ⊗ Hn θbn ¯ (3) ×vec Hn θbn n −10 −1 0 −1 1 ¯ ¯ (3) ¯ 1 X 3 ¯ − vec Hn θbn H θbn Hn θbn ∇ lt θbn vec Hn θbn 4 n n t=1

73 n ! −1 −1 −1 1 1 X 2 ¯ ¯ ¯ − vec ∇ lt θbn Hn θbn ⊗ Hn θbn ⊗ Hn θbn 6 n t=1 ¯ (3) ×vec Hn θbn " n # −1 −1 0 1 ¯ ¯ 1 X 4 + tr Hn θbn ⊗ vec Hn θbn ∇ lt θbn 8 n t=1 n −10 −1 −1 1 ¯ ¯ (3) ¯ 1 X 2 ¯ 5pb − vec Hn θbn H θbn Hn θbn ∇ lt θbn Hn θbn 2 n n p t=1 b " n # −1 −10 −1 1 1 X 2 ¯ ¯ ¯ (3) ¯ 5pb − tr ∇ lt θbn Hn θbn vec Hn θbn H θbn Hn θbn 4 n n p t=1 b n !0 −1 −1 −1 1 ¯ 1 X 2 ¯ ¯ (3) ¯ 5pb − vec Hn θbn ∇ lt θbn Hn θbn H θbn Hn θbn 2 n n p t=1 b n −10 −1 1 ¯ 1 X 3 ¯ 5pb + vec Hn θbn ∇ lt θbn Hn θbn 2 n p t=1 b " n # −1 −1 2 1 ¯ 1 X 2 ¯ 5 pb + tr Hn θbn ∇ lt θbn tr Hn θbn 4 n p t=1 b " n # −1 −1 2 1 ¯ 1 X 2 ¯ 5 pb + tr Hn θbn ∇ lt θbn Hn θbn . 2 n p t=1 b Then, we have

n 1 X B n t,22 t=1 P 1 P P 1 1 1 1 1 1 = − tr [A ] − tr [A ] + A + A + A + A + A − A − A + tr [A ] 16 2 4 2 16 1 24 3 4 1 8 1 4 3 4 1 6 3 8 2 −10 −1 1 ¯ ¯ (3) ¯ 5pb − vec Hn θbn Hn θbn Hn θbn 2 pb −10 −1 P ¯ ¯ (3) ¯ 5pb − vec Hn θbn Hn θbn Hn θbn 4 pb −10 −1 1 ¯ ¯ (3) ¯ 5pb − vec Hn θbn Hn θbn Hn θbn 2 pb −10 −1 1 ¯ ¯ (3) ¯ 5pb + vec Hn θbn Hn θbn Hn θbn 2 pb −1 2 −1 2 P ¯ 5 pb 1 ¯ 5 pb + tr Hn θbn + tr Hn θbn 4 pb 2 pb −10 −1 P + 2 P + 2 P + 2 P + 2 ¯ ¯ (3) ¯ 5pb = A1 − tr [A2] + A3 − vec Hn θbn Hn θbn Hn θbn 16 16 24 4 pb

74 −1 2 P + 2 ¯ 5 pb + tr Hn θbn 4 pb P + 2 P + 2 P + 2 P + 2 P + 2 P + 2 = A − tr [A ] + A − C + C + C , 16 1 16 2 24 3 4 21 4 2 4 23 where C21, C2 and C23 are deﬁned in Lemma 3.3. Clearly the ﬁrst two terms are not related to the prior while the last three terms are related to the prior. We can also show that

n 1 X − B n t,3 t=1 n ! 1 X = − B B n t,1 4 t=1 −1 2 −10 −1 P 1 ¯ ∇ pb 1 ¯ ¯ (3) ¯ ∇pb = − tr Hn θbn + vec Hn θbn Hn θbn Hn θbn 2 2 pb 2 pb P 1 1 1 − A − A + tr [A ] 2 8 1 12 3 8 2 P P P P 1 1 = − C − C + C − A − A + tr [A ] . 4 2 4 23 4 21 16 1 24 3 16 2 Hence,

n X Z lt (θ) p (θ|y) dθ t=1 n X P = lt θbn − 2 t=1 1 P + 2 P + 2 P + 2 P + 2 P + 2 P + 2 + A − tr [A ] + A − C + C + C n 16 1 16 2 24 3 4 21 4 2 4 23 1 P P P P P P + − C − C + C − A − A + tr [A ] + O n−2 n 4 2 4 23 4 21 16 1 24 3 16 2 p n X P = lt θbn − 2 t=1 1 1 1 1 1 1 1 + A − tr [A ] + A − C + C + C + O n−2 . n 8 1 8 2 12 3 2 21 2 2 2 23 p

From the stochastic expansion, we have

−1 1 ¯ θn = θbn − Hn θbn ∇π n b −1 0 −1 1 ¯ ¯ (3) ¯ 1 + Hn θbn H θbn vec Hn θbn + Op , 2n n n2

75 and −1 1 ¯ θbn − θn = Hn θbn ∇π n b −1 0 −1 1 ¯ ¯ (3) ¯ 1 − Hn θbn H θbn vec Hn θbn + Op . (83) 2n n n2 By the Taylor expansion, we get ¯ ln p y|θn ∂ ln p y|θbn ¯ = ln p y|θbn + θn − θbn ∂θ0 2 0 ∂ ln p y|θbn 1 ¯ ¯ 1 + θn − θbn θn − θbn + Op 2 ∂θ∂θ0 n2 −1 −1 1 0 ¯ ¯ ¯ = ln p y|θbn + ∇π Hn θbn Hn θbn Hn θbn ∇π 2n b b n −10 −1 −1 1 ¯ 1 X 3 ¯ ¯ ¯ + vec Hn θbn ∇ lt θbn Hn θbn Hn θbn Hn θbn 8n n t=1 n 0 −1 1 X 3 ¯ × ∇ lt θbn vec Hn θbn n t=1 −1 −1 1 0 ¯ ¯ ¯ − ∇π Hn θbn Hn θbn Hn θbn 2n b n 0 −1 1 X 3 ¯ −2 × ∇ lt θbn vec Hn θbn + Op(n ) n t=1

1 1 1 −2 = ln p y|θbn − C21 + C23 + A1 + Op(n ). (84) 2n 2n 8n Hence, from (83) and (84) we have, n Z X ¯ PD = −2 lt (θ) p (θ|y) dθ + 2 ln p y|θn t=1 = −2 ln p y|θbn + P 1 1 1 1 + − A + tr [A ] − A + C − C − C n 4 1 4 2 6 3 21 2 23

1 1 1 −2 +2 ln p y|θbn − C21 + C23 + A1 + Op(n ) n n 4n 1 1 1 = P + tr [A ] − A − C + O n−2 . (85) n 4 2 6 3 2 p From the deﬁnition of DIC, (84), and (85), we have n Z X ¯ DIC = −4 lt (θ) p (θ|y) dθ + 2 ln p y|θ t=1

76 n X Z = −2 lt (θ) p (θ|y) dθ+PD t=1 1 1 1 1 = −2 ln p y|θbn + P + − A1 + tr [A2] − A3 + C21 − C2 − C23 n 4 4 6 1 1 1 +P + tr [A ] − A − C + O n−2 n 4 2 6 3 2 p = −2 ln p y|θbn + 2P 1 1 1 1 + − A + tr [A ] − A + C − 2C − C + O n−2 n 4 1 2 2 3 3 21 2 23 p 1 1 = AIC + D + D + O n−2 . n 1 n 2 p 5.4 Proof of Theorem 3.1

p p −1 −1 We write Hn (θn) as Hn, Bn (θn) as Bn, and let Cn = Hn BnHn . Under Assumptions 1-10, we can show that

−1 θn(y) = θbn(y) + Op n (86) by (83). Then, we have

p −1/2 θn(y) = θn + Op n ,

1 ∂ ln p(y |θp ) √ B−1/2 rep n →d N (0, I ) , (87) n n ∂θ P and

−1/2√ p d Cn n θn(y) − θn → N (0, IP ) . (88)

p Now let us analyze T2 and T3. First, expanding ln p (yrep|θn) at θn (yrep), we have

p ln p (yrep|θn)

77 ∂ ln p y |θ (y ) = ln p y |θ (y ) + rep n rep θp − θ (y ) rep n rep ∂θ0 n n rep 2 1 0 ∂ ln p y |θ (y ) + θp − θ (y ) rep n rep θp − θ (y ) + o (1) 2 n n rep ∂θ∂θ0 n n rep p ∂ ln p yrep|θbn (yrep) = ln p y |θ (y ) + θp − θ (y ) rep n rep ∂θ0 n n rep 2 0 ∂ ln p yrep|θbn (yrep) p + θn (yrep) − θbn (yrep) θ − θn (yrep) ∂θ∂θ0 n 2 1 0 ∂ ln p y |θ (y ) + θp − θ (y ) rep n rep θp − θ (y ) + o (1) 2 n n rep ∂θ∂θ0 n n rep p 2 1 0 ∂ ln p y |θ (y ) = ln p y |θ (y ) + θp − θ (y ) rep n rep θp − θ (y ) + o (1) . rep n rep 2 n n rep ∂θ∂θ0 n n rep p from (86). Then, we have

p T2 = EyEyrep −2 ln p (yrep|θn) + 2 ln p yrep|θn (yrep) " # 0 ∂ ln p y |θ (y ) = E E − θ (y ) − θp rep n rep θ (y ) − θp + o (1) y yrep n rep n ∂θ∂θ0 n rep n p

" 2 # 0 ∂ ln p y |θ (y ) = E − θ (y ) − θp rep n rep θ (y ) − θp + o (1) yrep n rep n ∂θ∂θ0 n rep n

" 2 # 0 ∂ ln p y|θ (y) = E − θ (y) − θp n θ (y) − θp + o (1) , y n n ∂θ∂θ0 n n by Assumptions 5-6 and the dominated convergence theorem (Chung, 2001 and Das- p Gupta, 2008). Next, we expand ln p yrep|θn(y) at θn: ∂ ln p (y |θp ) ln p y |θ (y) = ln p (y |θp ) + rep n θ (y) − θp rep n rep n ∂θ0 n n 2 p 1 0 ∂ ln p (y |θ ) + θ (y) − θp rep n θ (y) − θp + o (1) . 2 n n ∂θ∂θ0 n n p

Substituting the above expansion into T3, we have

p T3 = EyEyrep −2 ln p yrep|θn(y) − EyEyrep [−2 ln p (yrep|θn)]  p  ∂ ln p(yrep|θn) p −2 0 θn(y) − θn − ∂θ p = EyEyrep  2  p 0 ∂ ln p(yrep|θn) p θn(y) − θ 0 θn(y) − θ + op (1) n ∂θ∂θ n ∂ ln p (y |θp ) = E E −2 rep n θ (y) − θp y yrep ∂θ0 n n 2 p 0 ∂ ln p (y |θ ) +E E − θ (y) − θp rep n θ (y) − θp + o (1) y yrep n n ∂θ∂θ0 n n

78 ∂ ln p (y |θp ) = −2E rep n E θ (y) − θp yrep ∂θ0 y n n 2 p 0 ∂ ln p (y |θ ) +E − θ (y) − θp E rep n θ (y) − θp + o (1) y n n yrep ∂θ∂θ0 n n 2 p √ 0 1 ∂ ln p(y|θ ) √ = E − n θ (y) − θp E n n θ (y) − θp + o(1), y n n y n ∂θ∂θ0 n n since ∂ ln p (y |θp ) ∂ ln p (y |θp ) E E −2 rep n θ (y) − θp = E −2 rep n E θ (y) − θp = 0 y yrep ∂θ0 n n yrep ∂θ0 y n n by (87), (88), and the dominated convergence theorem. Note that

1 ∂2 ln p y|θ (y) 1 ∂2 ln p (y|θp ) n = E n + o (1) , n ∂θ∂θ0 y n ∂θ∂θ0 p by Assumptions 1-10 and the uniform law of large numbers. Hence, we get

" 2 # 0 ∂ ln p y|θ (y) T = E − θ (y) − θp n θ (y) − θp + o (1) 2 y n n ∂θ∂θ0 n n 2 p √ 0 1 ∂ ln p (y|θ ) √ = E − n θ (y) − θp E n n θ (y) − θp + o (1) + o (1) y n n n y ∂θ∂θ0 n n p

= T3 + o(1).

Hence, we only need to analyze T3. Note that

2 p √ 0 1 ∂ ln p(y|θ ) √ T = E − n θ (y) − θp E − n n θ (y) − θp + o (1) 3 y n n y n ∂θ∂θ0 n n h√ p 0 √ p i = Ey n θn(y) − θn (−Hn) n θn(y) − θn + o (1)

h −1/2√ p 0 1/2 1/2 −1/2√ p i = Ey Cn n θn(y) − θn Cn (−Hn) Cn Cn n θn(y) − θn + o (1)

n h 1/2 −1/2√ p √ p 0 −1/2 1/2io = Ey tr HnCn Cn n θn(y) − θn n θn(y) − θn Cn Cn + o (1)

n 1/2 h −1/2√ p √ p 0 −1/2i 1/2o = tr (−Hn) Cn Ey Cn n θn(y) − θn n θn(y) − θn Cn Cn + o (1)

n 1/2 h −1/2√ p √ p 0 −1/2i 1/2o = tr (−Hn) Cn Ey Cn n θn(y) − θn n θn(y) − θn Cn Cn + o (1) , and

h −1/2√ p √ p 0 −1/2i Ey Cn n θn(y) − θn n θn(y) − θn Cn = IP + o (1) .

79 Hence,

1/2 1/2 T3 = tr (−Hn) Cn Cn + o (1) = tr ((−Hn) Cn) + o (1) −1 −1 = tr (−Hn)(−Hn) Bn (−Hn) + o (1) −1 −1 = tr (−Hn)(−Hn) Bn (−Hn) + o (1) −1 = tr Bn (−Hn) + o (1) , and

p p The last step is due to the assumption that Hn (θn) + Bn (θn) = o (1). Following Lemma 3.3, we get PD = P + op(1). Finally, by Assumption 10 and the dominated convergence theorem, we have

EyEyrep −2 ln p yrep|θn(y) = Ey −2 ln p y|θn + 2P + op(1) = Ey D θn + 2PD + op(1) = Ey [DIC + op(1)] = Ey [DIC] + o(1). References

Chen, C. (1985). On asymptotic normality of limiting density functions with Bayesian implications. Journal of the Royal Statistical Society Series B, 47, 540–546.

Kamat, A. R. (1953). Incomplete and Absolute Moments of the Multivariate Normal Distribution with Some Applications, Biometrika, 40, 20–34.

Kan, R. and C. Robotti (2017). On Moments of Folded and Truncated Multivariate Normal Distributions, Journal of Computational and Graphical Statistics, 26(4), 930–934.

Kass, R., Tierney, L., and J. B. Kadane (1990). The validity of posterior expansions based on Laplace’s Method. in Bayesian and Likelihood Methods in Statistics and Econometrics, ed. by S. Geisser, J.S. Hodges, S.J. Press and A. Zellner. Elsevier Science Publishers B.V.: North-Holland.

80 Li, Y., Yu, J., and T. Zeng (2021). Deviance Information Criterion for Bayesian Model Selection: A Theoretical Justiﬁcation, Working Paper, Singapore Management University.

Miyata, Y. (2004). Fully exponential Laplace approximations using asymptotic modes. Journal of the American Statistical Association, 99, 1037–1049.

Miyata, Y. (2010). Laplace approximations to means and variances with asymptotic modes. Journal of Statistical Planning and Inference, 140, 382–392.

Tierney, L. and J. B. Kadane (1986). Accurate approximations for posterior moments and marginal densities. Journal of the American Statistical Association, 81, 82– 86.

Vignat, C. (2012). A generalized Isserlis theorem for location mixtures of Gaussian random vectors. Statistics & Probability Letters, 82(1), 67-71.

Withers, C.S. (1985) The moments of the multivariate normal, Bulletin of the Aus- tralian Mathematical Society, 32, 103-107