(2019) "Discussion of 'Prior-Based Bayesian Information Criterion
Total Page:16
File Type:pdf, Size:1020Kb
Statistical Theory and Related Fields ISSN: 2475-4269 (Print) 2475-4277 (Online) Journal homepage: https://www.tandfonline.com/loi/tstf20 Discussion of ‘Prior-based Bayesian Information Criterion (PBIC)’ Bertrand S. Clarke To cite this article: Bertrand S. Clarke (2019) Discussion of ‘Prior-based Bayesian Information Criterion (PBIC)’, Statistical Theory and Related Fields, 3:1, 26-29, DOI: 10.1080/24754269.2019.1611143 To link to this article: https://doi.org/10.1080/24754269.2019.1611143 Published online: 02 May 2019. Submit your article to this journal Article views: 10 View Crossmark data Full Terms & Conditions of access and use can be found at https://www.tandfonline.com/action/journalInformation?journalCode=tstf20 STATISTICAL THEORY AND RELATED FIELDS 2019, VOL. 3, NO. 1, 26–29 https://doi.org/10.1080/24754269.2019.1611143 SHORT COMMUNICATION Discussion of ‘Prior-based Bayesian Information Criterion (PBIC)’ Bertrand S. Clarke Statistics Department, University of Nebraska-Lincoln Lincoln, NE, USA ARTICLE HISTORY Received 16 April 2019; Accepted 22 April 2019 1. Summary parameters is an important gap to fill in the literature on the BC and model selection more generally. Theauthorshavethebasisforareformulationofthe The focus on the Fisher information I(·) –see BICaswethinkofitnow.Thisproblemisbothhard Sec. 3.2 in particular – supports this view, however, andimportant.Inparticular,toaddressit,theauthors one must wonder if there is more to be gained from have put six incisive ideas in sequence. The first is the either off-diagonal elements of I or from the orthogonal separation of parameters that are common across mod- e (orthonormal) matrix O.Theconstraintbi = n di is els versus those that aren’t. The second is the use of an i alsoalittlepuzzling.Itmakessensebecausebi is inter- orthogonal (why not orthonormal?) transformation of pretable as something like the Fisher information rel- the Fisher information matrix to get diagonal entries di ative to parameter i. (In this sense it’s not clear why that summarise the parameter-by-parameter efficiency it’s called the ‘unit information’.) The prior selection is of estimation. The third is using Laplace’s method very perceptive – and works – but there does not seem only on the likelihood, i.e. Taylor expanding the log- to be any unique, general conceptual property that it likelihood and using the MLE rather than centring a possesses. Even though it gives an effective result, the Taylor expansion at the posterior mode. (From an esti- prior selection seems a little artificial. The authors may mation standpoint the difference between the MLE and of course counter-argue that one of the reasons to use a posterior mode is O(1/n) and can be neglected.) The prior is precisely that it represents information one has fourth is the particular prior selection that the third outside the data. step enables. Since the prior is not approximated by, Setting aside such knit-picking, let us turn to the say π(θ)ˆ , the prior can be chosen to have an impact substance of the contribution. and the only way the prior won’t wash out is if it’s tails are heavy enough. Fifth is defining an effective sample e size ni that differs from parameter to parameter. Finally, 2. Other forms for the BIC? sixth, is imposing a relationship between the diagonal For comparison, let us try to modify the BIC in three elements di and the ‘unit information’ bi by way of the e n . (All notation and terminology here is the same as in other ways. The first is a refinement of the BIC to i c the paper, unless otherwise noted.) identify the constant in Result 1.1. The second is to Taken together, the result is a PBIC that arises look more closely at the contrast between the PBIC the n n as an approximation to −2logm(x ),whereX = authors propose and a more conventional approach. n The third is a discussion of an alternative that starts (X1, ..., Xn) = (x1, ..., xn) = x is the data. This matches the O(1) asymptotics of the usual BIC. with an effective sample size rather than bringing it in The main improvement in perspective on the BIC via the prior. that this paper provides is the observation that differ- First observe that the conventional expression for the O ( ) o ( ) ent efficiencies for estimating different parameters are BIC is actually only accurate to P 1 not P 1 .How- xn important to include in model selection. Intuitively, if ever, the constant term can be identified. Let be IID P a parameter is easier to estimate in one model (larger θ . Staring at Result 1.1 and using standard Lapace’s m(xn) Fisher information) than the corresponding param- method analysis of gives that eter in another model then ceteris paribus the first n ˆ ˆ model should be preferred. (The use of ceteris paribus p(X | θ)π(θ) p n 1 log − log − log det Iˆ(θ)ˆ →0, covers a lot of ground, but helps make the point m(Xn) 2 2π 2 about efficiency.) Neglecting comparitive efficiences of (1) CONTACT Bertrand S. Clarke [email protected] © East China Normal University 2019 STATISTICAL THEORY AND RELATED FIELDS 27 in Pθ -probability; see Clarke and Barron (1988). So, get: a more refined version of the BIC expression, which n n ˆ −(n/2)(θ−θT)TI(θT )(θ−θT ) approximates the posterior mode, is m(x ) ≈ p(x | θ) e π(θ)dθ. (3) =− (θ)ˆ +p n =− m(xn) + p π BICbetter 2 log 2log log 2 So far, this is standard. It becomes more interesting ˆ ˆ − log det Iˆ(θ)+ 2logπ(θ)+ oP(1).(2)when the technique of the authors is invoked. Essen- tially, they diagonalise I(θT).Forthis,thep eigenval- uesmustbestrictlypositive,butthatisnotusually Using (2) may largely address Problem 1 as identified t a difficult assumption to satisfy. Write D = O I(θT)O by the authors. Minimising BICbetter over candidate n O models is loosely like maximising m(x ) subject to a where is an orthonormal matrix, i.e., a rotation, and D = (d ... d ) penalty term in p and I, i.e. loosely like finding the diag 1, , p . (The authors use an orthogo- model that achieves the maximal penalised maximum nal matrix, but an orthonormal matrix seems to give ξ = likelihood if the mixture density were taken as the like- cleaner results.) Now, consider the transformation OT(θ − θ ) ξ = θ lihood. Expression (2) can be re-arranged to give an T so that d d by the orthonormality of n O expression for m(x ). Indeed, one can plausibly argue . Note that the transformation has been simplified n O θ that maximising m(x ) over models (and priors) under since the argument of is T. Now, the integral in right hand side of expression (3) is some restrictions should be a useful statistic for model selection. T T −(n/2)ξ O I(θT )Oξ π(O(θ )ξ + θ ) θ This is intuitively reasonable ... until you want to e T T d viz. take the intuition of the authors into account, that T = −(n/2)ξ Dξ π(O(θ )ξ + θ ) θ different θj’s in θ = (θ1, ..., θp) require different sample e T T d .(4) sizestoestimateequallywellorcorrespondtodiffer- ent effective sample sizes. One expects this effect to be At this point the authors, rather than using Laplace’s π greater as more and more models are under considera- method on the integral, choose as a product of indi- π ξ tion. It is therefore natural to focus on the parameters vidual i’s for each i.Eachfactorinthatproducthas λ d b p that distinguish the models from each other rather than hyperparameters i, i,and i and the resulting - the common parameters. So, for ease of exposition we dimensional integral in (4) has a closed form as given at the end of Sec. 2. assume that θ = θ(1) i.e. that θ(2) does not appear. (In simple examples like linear regression θ( ) often corre- An alternative is the more conventional approach 2 n →∞ sponds to the intercept and can be removed by centring of recognising that as the integrand converges ξ = m(xn) the data.) to unit mass at 0. Using this gives that is So,second,letuslookattheLaplace’smethod approximately n n ˆ p/2 −1/2 applied to m(x ). Being informal about a second order p(x | θ)w(θT)(2π) det(nD) Taylor expansion and using standard notation gives − − −(1/2)ξ((nD) 1) 1ξ ξ × e d p/ − / n n n (2π) 2 det(nD) 1 2 m(x ) = p(x | θ)π(θ)dθ = p(x | θ)ˆ n ˆ p/2 1 = p(x | θ)w(θT)(2π) p −(n/ )(θ−θ)ˆ TIˆ(θ)(θ˜ −θ)ˆ np/2(( d )1/p)p/2 × e 2 π(θ)dθ. i=1 i n ˆ p/2 1 = p(x | θ)w(θT)(2π) , p (ns)p/2 (The domain of intgration is R but this can be cut ˆ down to a ball B(θ, ) by allowing error terms of order where s is the geometric mean of the di’s. The geometric −nγ O(e ) for suitable γ>0. Then, the Taylor expan- meanisthesidelengthofap-dimensional cuboid whith p d s sion can be used. Finally, one can go back to the original volume equal do i=1 i.Thus, plays the role of a sort domain of integration again by adding an exponentialy of average Fisher information for the collection of ξi’s. small error term.) Standard conditions (see e.g. Clarke This sequence of approximations gives θ˜ θ p p &Barron,1988)givethatthe can be replaced by T n ˆ Iˆ log m(x )=(θ)+ log w(θT) + log(2π)− log ns.