Statistical Theory and Related Fields

ISSN: 2475-4269 (Print) 2475-4277 (Online) Journal homepage: https://www.tandfonline.com/loi/tstf20

Discussion of ‘Prior-based Bayesian Information Criterion (PBIC)’

Bertrand S. Clarke

To cite this article: Bertrand S. Clarke (2019) Discussion of ‘Prior-based Bayesian Information Criterion (PBIC)’, Statistical Theory and Related Fields, 3:1, 26-29, DOI: 10.1080/24754269.2019.1611143 To link to this article: https://doi.org/10.1080/24754269.2019.1611143

Published online: 02 May 2019.

Submit your article to this journal

Article views: 10

View Crossmark data

Full Terms & Conditions of access and use can be found at https://www.tandfonline.com/action/journalInformation?journalCode=tstf20 STATISTICAL THEORY AND RELATED FIELDS 2019, VOL. 3, NO. 1, 26–29 https://doi.org/10.1080/24754269.2019.1611143

SHORT COMMUNICATION Discussion of ‘Prior-based Bayesian Information Criterion (PBIC)’

Bertrand S. Clarke

Statistics Department, University of Nebraska-Lincoln Lincoln, NE, USA

ARTICLE HISTORY Received 16 April 2019; Accepted 22 April 2019

1. Summary parameters is an important gap to fill in the literature on the BC and model selection more generally. Theauthorshavethebasisforareformulationofthe The focus on the I(·) –see BICaswethinkofitnow.Thisproblemisbothhard Sec. 3.2 in particular – supports this view, however, andimportant.Inparticular,toaddressit,theauthors one must wonder if there is more to be gained from have put six incisive ideas in sequence. The first is the either off-diagonal elements of I or from the orthogonal separation of parameters that are common across mod- e (orthonormal) matrix O.Theconstraintbi = n di is els versus those that aren’t. The second is the use of an i alsoalittlepuzzling.Itmakessensebecausebi is inter- orthogonal (why not orthonormal?) transformation of pretable as something like the Fisher information rel- the Fisher information matrix to get diagonal entries di ative to parameter i. (In this sense it’s not clear why that summarise the parameter-by-parameter efficiency it’s called the ‘unit information’.) The prior selection is of estimation. The third is using Laplace’s method very perceptive – and works – but there does not seem only on the likelihood, i.e. Taylor expanding the log- to be any unique, general conceptual property that it likelihood and using the MLE rather than centring a possesses. Even though it gives an effective result, the Taylor expansion at the posterior mode. (From an esti- prior selection seems a little artificial. The authors may mation standpoint the difference between the MLE and of course counter-argue that one of the reasons to use a posterior mode is O(1/n) and can be neglected.) The prior is precisely that it represents information one has fourth is the particular prior selection that the third outside the data. step enables. Since the prior is not approximated by, Setting aside such knit-picking, let us turn to the say π(θ)ˆ , the prior can be chosen to have an impact substance of the contribution. and the only way the prior won’t wash out is if it’s tails are heavy enough. Fifth is defining an effective sample e size ni that differs from parameter to parameter. Finally, 2. Other forms for the BIC? sixth, is imposing a relationship between the diagonal For comparison, let us try to modify the BIC in three elements di and the ‘unit information’ bi by way of the e n . (All notation and terminology here is the same as in other ways. The first is a refinement of the BIC to i c the paper, unless otherwise noted.) identify the constant in Result 1.1. The second is to Taken together, the result is a PBIC that arises look more closely at the contrast between the PBIC the n n as an approximation to −2logm(x ),whereX = authors propose and a more conventional approach. n The third is a discussion of an alternative that starts (X1, ..., Xn) = (x1, ..., xn) = x is the data. This matches the O(1) asymptotics of the usual BIC. with an effective sample size rather than bringing it in The main improvement in perspective on the BIC via the prior. that this paper provides is the observation that differ- First observe that the conventional expression for the O ( ) o ( ) ent efficiencies for estimating different parameters are BIC is actually only accurate to P 1 not P 1 .How- xn important to include in model selection. Intuitively, if ever, the constant term can be identified. Let be IID P a parameter is easier to estimate in one model (larger θ . Staring at Result 1.1 and using standard Lapace’s m(xn) Fisher information) than the corresponding param- method analysis of gives that eter in another model then ceteris paribus the first  n ˆ ˆ  model should be preferred. (The use of ceteris paribus  p(X | θ)π(θ) p n 1  log − log − log det Iˆ(θ)ˆ →0, covers a lot of ground, but helps make the point m(Xn) 2 2π 2 about efficiency.) Neglecting comparitive efficiences of (1)

CONTACT Bertrand S. Clarke [email protected]

© East China Normal University 2019 STATISTICAL THEORY AND RELATED FIELDS 27 in Pθ -probability; see Clarke and Barron (1988). So, get:  a more refined version of the BIC expression, which n n ˆ −(n/2)(θ−θT)TI(θT )(θ−θT ) approximates the posterior mode, is m(x ) ≈ p(x | θ) e π(θ)dθ. (3) =− (θ)ˆ +p n =− m(xn) + p π BICbetter 2 log 2log log 2 So far, this is standard. It becomes more interesting ˆ ˆ − log det Iˆ(θ)+ 2logπ(θ)+ oP(1).(2)when the technique of the authors is invoked. Essen- tially, they diagonalise I(θT).Forthis,thep eigenval- uesmustbestrictlypositive,butthatisnotusually Using (2) may largely address Problem 1 as identified t a difficult assumption to satisfy. Write D = O I(θT)O by the authors. Minimising BICbetter over candidate n O models is loosely like maximising m(x ) subject to a where is an orthonormal matrix, i.e., a rotation, and D = (d ... d ) penalty term in p and I, i.e. loosely like finding the diag 1, , p . (The authors use an orthogo- model that achieves the maximal penalised maximum nal matrix, but an orthonormal matrix seems to give ξ = likelihood if the mixture density were taken as the like- cleaner results.) Now, consider the transformation OT(θ − θ ) ξ = θ lihood. Expression (2) can be re-arranged to give an T so that d d by the orthonormality of n O expression for m(x ). Indeed, one can plausibly argue . Note that the transformation has been simplified n O θ that maximising m(x ) over models (and priors) under since the argument of is T. Now, the integral in right hand side of expression (3) is some restrictions should be a useful for model  selection. T T −(n/2)ξ O I(θT )Oξ π(O(θ )ξ + θ ) θ This is intuitively reasonable ... until you want to e T T d viz.  take the intuition of the authors into account, that T = −(n/2)ξ Dξ π(O(θ )ξ + θ ) θ different θj’s in θ = (θ1, ..., θp) require different sample e T T d .(4) sizestoestimateequallywellorcorrespondtodiffer- ent effective sample sizes. One expects this effect to be At this point the authors, rather than using Laplace’s π greater as more and more models are under considera- method on the integral, choose as a product of indi- π ξ tion. It is therefore natural to focus on the parameters vidual i’s for each i.Eachfactorinthatproducthas λ d b p that distinguish the models from each other rather than hyperparameters i, i,and i and the resulting - the common parameters. So, for ease of exposition we dimensional integral in (4) has a closed form as given at the end of Sec. 2. assume that θ = θ(1) i.e. that θ(2) does not appear. (In simple examples like θ( ) often corre- An alternative is the more conventional approach 2 n →∞ sponds to the intercept and can be removed by centring of recognising that as the integrand converges ξ = m(xn) the data.) to unit mass at 0. Using this gives that is So,second,letuslookattheLaplace’smethod approximately n n ˆ p/2 −1/2 applied to m(x ). Being informal about a second order p(x | θ)w(θT)(2π) det(nD) Taylor expansion and using standard notation gives  − − −(1/2)ξ((nD) 1) 1ξ ξ  × e d p/ − / n n n (2π) 2 det(nD) 1 2 m(x ) = p(x | θ)π(θ)dθ = p(x | θ)ˆ  n ˆ p/2 1 = p(x | θ)w(θT)(2π) p −(n/ )(θ−θ)ˆ TIˆ(θ)(θ˜ −θ)ˆ np/2(( d )1/p)p/2 × e 2 π(θ)dθ. i=1 i n ˆ p/2 1 = p(x | θ)w(θT)(2π) , p (ns)p/2 (The domain of intgration is R but this can be cut ˆ down to a ball B(θ, ) by allowing error terms of order where s is the geometric of the di’s. The geometric −nγ O(e ) for suitable γ>0. Then, the Taylor expan- meanisthesidelengthofap-dimensional cuboid whith p d s sion can be used. Finally, one can go back to the original volume equal do i=1 i.Thus, plays the role of a sort domain of integration again by adding an exponentialy of average Fisher information for the collection of ξi’s. small error term.) Standard conditions (see e.g. Clarke This sequence of approximations gives θ˜ θ p p &Barron,1988)givethatthe can be replaced by T n ˆ Iˆ log m(x )=(θ)+ log w(θT) + log(2π)− log ns. and the empirical Fisher information, ,canbereplaced 2 2 bytheactualFisherinformation.Thus: This leads to a form of the BIC as  ˆ n n n −(n/ )(θ−θ)ˆ TI(θ )(θ−θ)ˆ BICS =−2(θ)+ p log n =−2logm(x ) m(x ) ≈ p(x | θ)ˆ e 2 T π(θ)dθ. + 2logw(θT) − p log(2π)− p log s + o(1). (5) The integrand is a normal density that can be integrated in closed form, apart from the π.Byanotherapprox- Comparing (5) and (2), the only difference is that the imation (that seems to be asymptotically tight up to Fisher information is summarised by s,asortofaver- p/ OP(n 2) factor where  can be arbitrarily small) we ageefficiencythatineffectputsallparametersonthe 28 B. S. CLARKE

n same scale. Roughly, p log s and log det I(θ) correspond to m(x ) but one can hope that a suitably reformu- p e to the term i= log(1 + ni ) in the PBIC. The extra lated Laplace’s method on (3) and (4) may lead to a 1 p √ − (( − vi )/ v ) term in the PBIC, 2 i=1 log 1 e 2 i ,seems compatible expression for it. to correspond to the log prior density term. One interesting query the authors are well-placed to As a third way to look at the BIC, oberve that nei- answer is whether the results of Sec. 5.5 hold if the PBIC e ther (5) nor (2) have any clear analog to ni apart from is replaced by (6). After all, there should be reason- e the treatment of Fisher information and its interpre- able conditions under which all the ni ’s from Sec. 3.2 tation as an efficiency. So, two natural questions are increase fast enough with n, e.g. for all n,0<η< ∗ ∗ ∗ what the effective sample sizes mean and what they are minj Ijj ≤ Ijj < maxj Ijj < B < ∞. doing. In the PBIC they are introduced as hyperparam- eters and are restricted to linear models. For instance, 3. Where to from here? in Example 3.3, effective sample sizes are average pre- cisions divided by the maximal precision even though The authors have a very promising general definition e it is unclear why this expression has a claim to be an in Sec. 3.2. Establishing a relationship between nj and effective sample size. theeffectivesamplesizeformulaeproposedforlinear On the other hand, in Sec. 3.2 a general definition modelswouldbeuseful,butmorefundamentally,the e ne of nj in terms of entries of I(θ) is given for each question is whether the j from Sec. 3.2 makes sense j = 1, ...p.Thisisavalidgeneralisationofsample in such simpler contexts. If it does, then the fact that e size because the nj ’s reduce to n. Indeed, in the it differs from ‘TESS’ may not be very important. We ∗ ˆ IID case with large n, I (θ) ≈ nIjj(θT) and wij ≈ 1/n. strongly agree with the authors who write, á propos of  jj  ne n w I∗ (θ ) ≈ ( /n) n I∗ (θ ) ≈ ( /n)(nI j , that it should ‘be viewed primarily as a starting point So, i=1 ij ijj T 1 i=1 ijj T 1 jj e ∗ for future investigations of effective sample size’. (They (θT)) = Ijj(θT).Thisgivesni ≈ nIjj(θT) /Ijj(θT) = n. e actually limit this point to nonlinear models, but for In this generalisation, each nj is closely related to the sake of a satisfying overall theory it should apply to theFisherinformationandhencetotherelativeeffi- e linear models as well.) ciency of estimating different parameters. Indeed, ni Another tack is to be overtly information-theoretic is, roughly, the total Fisher information for θi (over by defining an effective sample size in terms of code- the sample) as a fraction of the convex combination of length. One form of the relative entropy, see Clarke Fisher informations for the θj’s over the data. e and Barron (1988), is implicit in (2). However, one can Now, it may make sense to use the definition of nj useananalogousformulationtoconvertaputativesam- in Sec. 3.2 to generalise the BIC directly, i.e. find the e ple of size n to an effective sample. Use a nonparametric nj ’s first, since they depend only on the Fisher informa- n n to form h(x; x ), an estimate of the density of tions and on x , and use them to propose a new BIC. m X. Then, choose a ‘distortion rate’, r and find z for the For instance, consider n m smallest value of m that satisfies D(h(·; x )h(·; z )) ≤ p  r,whereD(··) istherelativeentropy.Thisistheeffec- =− (θ)ˆ + ne BICTESS 2 log i .(6)tive sample and sample size since it recreates the empiri- i= 1 cal density with a tolerable level of distortion. The larger In (6), the concept of effective sample size is used to r is, the more distortion is allowed and the smaller m account for the different efficiencies of estimating dif- will be. Information-theoretically, this is the same as n ferent parameters, making it valid to compare them. approximating a Shannon code based on h(·; x ) by m Note that (6) levels the playing field for the fi(·θ)’s in the a Shannon code based on h(·; z ) in terms of small log-likelihood so that they do not need to be modified. redundancy, in say, bits. This definition for effective Thus, effective sample sizes have a meaning something sample size requires choosing r,butD is in bits so it like the sample size required to make the estimation of would make sense for r to be some of bits one parameter (to a prescribed accuracy) close to the per symbol e.g. (σn),whereVar(X) = σ 2,forsome sample size required to estimate another parameter (to c ∈ (0, 1], with  = 1/2asadefault. the same accuracy), a parallel to the appearance of the Another way to look at this procedure for finding geometric mean in (5). an effective sample size is via data compression. In this At this point, one can go back to (3) and (4) and context, the rate distortion function is a well-studied e seek ways to justify using ni in place of n.Because(4)is quantity, see Cover and Thomas (2006), Chap. 10. The nearly a product of univariate integrals it may be pos- problem is that it’s not obvious how to obtain an effec- sible to regard the elements on the diagonal of D as a tive sample size from the rate distortion function, or, form of the Fisher information that permits replace- in the parlance of information theory, a set of lower e ment of n with ni .Similarly,thegeometricmeanusedin dimensional canonical representatives that achieve the (5) may be related (by, say, log) to the ratios of sums of rate distortion function lower bound. On the other e Fisher informations used to define ni in Sec. 3.2 thereby hand, this can be done in practice and further study may relating (5) and (6). Finally, (6) is not obviously related yield good solutions. STATISTICAL THEORY AND RELATED FIELDS 29

Finally,theratedistortionfunctionistheresultof Disclosure statement an operation performed on a Shannon mutual infor- No potential conflict of interest was reported by the author. mation that, for parameteric families, usually has an expression in terms of the Fisher information. Like- References wise, it is well known that certain relative entropies can Information-theoretic asymp- be expressed in terms of Fisher information. So, the Clarke, B., & Barron, A.1988). totics of Bayes methods (Technical Report #26). Stat. Dept., definitions of effective sample size from an informa- Univ. Illinois. tion theory perspective (via rate distortion) and form Cover, T., & Thomas, J. (2006). Elements of information theory Sec. 3.2 (via efficiency) may ultimately coincide. (2nd ed.). Hoboken, NJ: John Wiley and Sons.