<<

arXiv:2104.08324v1 [stat.ML] 16 Apr 2021 vntn L628 USA 60208, IL Evanston, University Northwestern Economics of Department Velez Amilcar USA 10027, NY York, New University Columbia Statistics of Department Rush Cynthia USA 10027, NY York, New University Columbia Economics of Department Olea Montiel Jos´e Luis USA 10027, NY York, New University Columbia Statistics of Department Medina Avella Marco nteRbsns oMspcfiainof Misspecification to Robustness the On hwn ovrec nttlvraindsac of distance variation derive total we in point, convergence distribut this showing make posterior To infeasible, misspecification. model perhaps K metric but the reduce true, appropriately, the tuned from if distortions, variational such introducing that and likelihood the downweighting Keywords: poster usual the with as linearly not increases and divergence KL misspecification smalle optimized of The becomes severe. value more optimized becomes The misspecification. model iegnebtentu n eotdpseir.W hwth show We posteriors. d choosing reported these and use true We between distributions. divergence Gaussian limiting to imations α pseir n hi aitoa prxmtosdistort approximations variational their and -posteriors outesof Robustness α n hi aitoa Approximations Variational Their and titysalrta n,asmn hr savnsigys vanishingly a is there assuming one, than smaller strictly α pseir aitoa neec,mdlmisspecificatio model inference, variational -posterior, α pseir n hi aitoa approximations variational their and -posteriors Abstract 1 α pseir n hi aitoa approx- variational their and -posteriors o hnteei oeta para- potential is there when ion prxmto ros eshow We errors. approximation lbc-ebe K)divergence (KL) ullback-Leibler tnadpseirifrneby inference posterior standard [email protected] srbtost vlaeteKL the evaluate to istributions ior. enti-o ie theorem Mises Bernstein-von a sdvrec smnmzdby minimized is divergence is [email protected] stetemisspecification the the as r oaihial ntedegree the in logarithmically [email protected] [email protected] ,robustness. n, alpoaiiyof probability mall α -Posteriors Avella Medina and Montiel Olea and Rush and Velez

1. Introduction

A recent body of work in (Gr¨unwald, 2011; Holmes and Walker, 2017; Gr¨unwald, 2012; Bhattacharya et al., 2019; Miller and Dunson, 2019; Knoblauch et al., 2019) and probabilistic machine learning (Huang et al., 2018; Higgins et al., 2017; Burgess et al., 2018) has analyzed the statistical properties of α-posteriors and their variational approxima- tions. The α-posteriors—also known as fractional, tempered, or power posteriors—are pro- portional to the product of the prior and the α-power of the likelihood (Ghosal and Van der Vaart, 2017, Chapter 8.6). Their variational approximations are defined as those distributions within some tractable subclass that minimize the Kullback-Leibler (KL) divergence to the α-posterior; see Alquier and Ridgway (2020), Definition 1.2. We contribute to this growing literature by investigating the robustness-to-misspecification of α-posteriors and their variational approximations, with a focus on low-dimensional, para- metric models. Our analysis—motivated by the seminal work of Gustafson (2001)—is based on a simple idea. Suppose two different procedures lead to incorrect a posteriori inference (either due to a likelihood misspecification or computational considerations). Define one procedure to be more robust than the other if it is closer—in terms of KL divergence—to the true posterior. Is it true that α-posteriors (or their variational approximations) are more robust than standard ? We answer this question using asymptotic approximations. We establish a Bernstein-von Mises theorem (BvM) in total variation distance for α-posteriors (Theorem 1) and for their (Gaussian mean-field) variational approximations (Theorem 2). Our result allows for both model misspecification and non i.i.d. data. The main assumptions are that the likelihood ratio of the presumed model is stochastically locally asymptotically normal (LAN) as in Kleijn and Van der Vaart (2012) and that the α-posterior concentrates around the (pseudo)- true parameter at rate √n. Our theorem generalizes the results in Wang and Blei (2019a,b), who focus on the case in which α = 1. We also extend the results of Li et al. (2019), who establish the BvM theorem for α-posteriors under a weaker norm, but under more primitive conditions. These asymptotic distributions allows us to study our suggested measure of robustness by computing the KL divergence between multivariate Gaussians with parameters that de- pend on the data, the sample size, and the ‘curvature’ of the likelihood. One interesting observation is that relative to the BvM theorem for the standard posterior or its variational approximation, the choice of α only re-scales the limiting variance, with no mean effect. The new scaling is as if the observed sample size were α n instead of n, but the location · for the Gaussian approximation continues to be the maximum likelihood estimator. Thus,

2 Robustness of α-posteriors and their variational approximations the posterior mean of α-posteriors and their variational approximations has the same limit regardless of the value of α.

When computing our measures of robustness, we think of a researcher that, ex-ante, places some small exogenous probability ǫn of model misspecification and is thus interested in computing an expected KL. This forces us to analyze the KL divergence under both correct and incorrect specification. Under the assumption that as the sample size n increases, the probability of misspecification decreases as nǫ ε for constant ε (0, ), we establish n → ∈ ∞ three main results (Theorem 3) that we believe speak to the robustness of α-posteriors and their variational approximations.

The first result shows that for a large enough sample size, the expected KL divergence between α-posteriors and the true posterior is minimized for some α∗ (0, 1). This means n ∈ that, for a properly tuned value of α (0, 1), inference based on the α-posterior is asymp- ∈ totically more robust than regular posterior inference (corresponding to the case α = 1).

Our calculations suggest that αn∗ decreases as both the probability of misspecification ǫn and the difference between the true and pseudo-true parameter increase, where by pseudo-true parameter we mean the point in the parameter space that provides the best approximation (in terms of KL divergence) to the true data generating process. In other words, this anal- ysis makes the reasonable suggestion that as the probability of the being wrong increases, one should put less emphasis on it when computing the posterior.

The second result demonstrates that the Gaussian mean-field variational approximations of α-posteriors inherit some of the robustness properties of α-posteriors. In particular, it is shown that the expected KL divergence between the true posterior and the mean-field vari- ational approximation to the α-posterior is also minimized at some α∗ (0, 1). Our second n ∈ result thus provides support to the claim that variational approximations to α-posteriors are more robust to model misspecification than regular variational eapproximations (corre- sponding to α = 1). Our result also provides some theoretical support for the recent work of Higgins et al. (2017) that suggests introducing an additional hyperparameter β to temper the posterior when implementing variational inference (see Eq. 9 and the discussion that follows it).

The final result contrasts the expected KL of the optimized α-posterior (and also of the optimized α-variational approximation) against the expected KL of the regular posterior. We find that the latter increases linearly in the magnitude of misspecification (measured by the difference between the true parameter θ0 and the pseudo-true parameter θ∗), while the former does so logarithmically. This suggests that when the model misspecification is large,

3 Avella Medina and Montiel Olea and Rush and Velez there will be significant gains in robustness from using α-posteriors and their variational approximations.

1.1 Related Work

The idea that Bayesian procedures can be enhanced by adding an α parameter to decrease the influence of the likelihood in the posterior calculation has been suggested independently by many authors in the past, predating the more recent research studying its robustness properties under the name of α-posteriors. This includes work by Vovk (1990), McAllester (2003), Barron and Cover (1991), Walker and Hjort (2001), and Zhang et al. (2006).

A large part of the theoretical literature studying the robustness of α-posteriors and their variational approximations has focused on nonparametric or high-dimensional models. In these works, the term robustness has been used to mean that, in large samples, α-posteriors and their variational approximations concentrate around the pseudo-true parameter, even when the standard posterior does not. Bhattacharya et al. (2019) illustrate this point by providing examples of heavy-tailed priors in both density estimation and regression analysis where the α-posterior can be guaranteed to concentrate at (near) minimax rate, but the regular posterior cannot. Alquier and Ridgway (2020) also derive concentration rates for α-posteriors for high-dimensional and non-parametric models. Although comparison of the conditions required to verify concentration properties is natural in nonparametric or high- dimensional models, for the analysis of low-dimensional parametric models there are alterna- tive suggestions in the literature. For example, Wang and Blei (2019a) compare the posterior predictive distribution based on the regular posterior and also its variational approximation, and they show that under likelihood misspecification, the difference between the two posterior predictive distributions converges to zero. Gr¨unwald and Van Ommen (2017) also used (in- sample) predictive distributions to assess the performance of α-posteriors. They show that in a misspecified linear regression problem (where the researcher assumes homoskedasticity, but the data is heteorskedastic) α-posteriors can outperform regular posteriors.

Our work complements this recent body of work by demonstrating robustness in a different sense, while still connecting to previous work on the subject. While it is true that one may be content with just studying ‘first order’ properties of α-posteriors (like contraction) whenever the BvM does not hold, it is not yet clear how to properly formalize the first order benefits of α-posteriors and their variational approximations under misspecification. Yang et al. (2020) show that the necessary conditions for optimal contraction in misspecified models can be relaxed, but there are no results yet that formally tease apart the benefits of α-posteriors compared to the usual posteriors. Our results exploit heavily the BvM theorem which, in a

4 Robustness of α-posteriors and their variational approximations sense, is based on ‘second order’ approximations. We believe that our results may be useful in studying other models where BvM results are available, but are more complicated than that considered in this paper, in that the underlying model is not necessarily low-dimensional and/or parametric; for example, semiparametric models where the object of interest are smooth functionals as in Castillo and Rousseau (2015).

More importantly, our results are an attempt to provide an additional rationale for the use of variational inference methods due to their robustness to model misspecification, as opposed to solely based on computational considerations. We hope that our results will encourage the use of variational methods in fields like applied statistics and econometrics where variational inference remains somewhat underutilized despite its tremendous impact in machine learning. Recent applications of variational inference in econometrics include Bonhomme (2021), Mele and Zhu (2019), and Koop and Korobilis (2018).

1.2 Organization

The rest of this paper is organized as follows. Section 2 presents definitions, notations, and our general framework. Section 3 presents the Bernstein-von Mises theorem for α-posteriors and their variational approximations. Section 4 presents our main results on the theoretical analysis of our suggested measure of robustness. Section 5 presents an example concerning a Gaussian linear regression model with omitted variables. Section 6 presents some concluding remarks and discussion. The proofs of our main results are collected in Appendix A.

2. Notation General Framework

2.1 Paper Notation and Definitions

We begin by introducing notation and definitions that we will use throughout the paper. We use φ( µ, Σ) to denote the density of a multivariate normal random vector with mean µ ·| and covariance matrix Σ. The indicator function of event A, namely the function that takes the value 1 if event A occurs and 0 otherwise is denoted 1 A . If p and q are two densities { } with respect to Lebesgue measure in Rp, the total variation distance between them, denoted dTV(p, q), equals 1 dTV(p, q) p(u) q(u) du . (1) ≡ 2 Rp | − | Z

5 Avella Medina and Montiel Olea and Rush and Velez

See Section 2.9 in Van der Vaart (2000) for more details on the total variation distance. The KL divergence between the two distributions p and q, denoted (p q), is defined as K || p(u) (p q) p(u) log du . (2) K || ≡ q(u) Z   For a sequence of densities p on random variables X , we say that X = o (1) if lim P ( X > n n n pn n pn k nk ǫ) = 0 for every ǫ > 0 and that the sequence Xn is ‘bounded in pn-probability’ if for every ǫ> 0 there exists M > 0 such that P ( X < M ) 1 ǫ. ǫ pn k nk ǫ ≥ −

2.2 Statistical Model

Let f ( θ): θ Θ Rp be a parametric family of densities used as a statistical Fn ≡ { n ·| ∈ ⊆ } model for the random vector Xn (X ,...,X ). The statistical model may be misspecified, ≡ 1 n n in the sense that if f0,n is the true density for the random vector X , then f0,n may not belong to n. As usual, we define the maximum likelihood (ML) estimator—denoted by θML- n —as F F n b θML- n arg max fn(X θ) . (3) F ≡ θ Θ | ∈ b For simplicity, we assume that the ML estimator is unique, and that there is a param- eter value θ∗ in the interior of Θ for which √n (θML- n θ∗) is asymptotically normal. F − Sufficient conditions for the consistency and asymptotic normality of the ML estimator un- b der model misspecification with i.i.d. data can be found in Huber (1967); White (1982); Kleijn and Van der Vaart (2012). Examples of other papers establishing consistency and asymptotic normality under misspecification for certain types of non i.i.d. data are given in

Pouzo et al. (2016). If the model is correctly specified, then θ∗ is simply the true parameter, but if the model is misspecified, then θ∗ provides the best approximation (in terms of KL divergence) to the true data generating process. In the latter misspecified case, it is then common to refer to θ∗ as the pseudo-true parameter. We focus on the case in which the ML estimator is asymptotically normal to highlight the fact that none of our results depend on atypical asymptotic distributions. The main restriction we impose on is that the √n-stochastic local asymptotic normal- Fn ity (LAN) condition of Kleijn and Van der Vaart (2012), around θ∗.

∗ Assumption 1 Denote ∆n,θ √n (θML- n θ∗). There exists a positive definite matrix ≡ F − Vθ∗ such that b n fn (X θ∗ + h/√n) 1 R (h) log | h⊤ V ∗ ∆ ∗ + h⊤ V ∗ h , (4) n ≡ f (Xn θ ) − θ n,θ 2 θ  n | ∗ 

6 Robustness of α-posteriors and their variational approximations satisfies

sup Rn(h) 0 , (5) h K | |→ ∈ in f -probability, for any compact set K Rp. 0,n ⊆

Noticing that

∗ 1 1 φ(θ∗ + h/√n θML- n , (nVθ )− ) h⊤ Vθ∗ ∆n,θ∗ h⊤ Vθ∗ h = log | F , (6) − 2 ∗ 1 φ(θ∗ θML- n , (nVθ )− ) ! | bF we can see that Assumption 1 implies that the correspondingb likelihood ratio process for Fn approximates, asymptotically, that of a normal random variable.

2.3 α-posteriors and their Variational Approximations

We now present the definition of α-posteriors and their variational approximations. Starting from the statistical model , a prior density π for θ, and a scalar α> 0, the α-posterior is Fn defined as the distribution having density:

[f (Xn θ)]α π(θ) π (θ Xn) n | . (7) n,α | ≡ [f (Xn θ)]α π(θ) dθ n | See Chapter 8.6 in Ghosal and Van der VaartR (2017) for a textbook definition.

The projection of (7)—in Kullback-Leibler (KL) divergence—onto the space of probabil- ity distributions with independent marginals, also referred to as the mean-field family and denoted , provides the mean-field variational approximation to the α-posterior: QMF

n n πn,α( X ) arg min (q πn,α( X )) . (8) ·| ∈ q MF K || ·| ∈Q e There is a trade-off between choosing a flexible and rich enough domain for the opti- mization in (8), so that q can be close to π (θ Xn), and choosing a domain that is also n,α | constrained enough such that the optimization is computationally feasible. The projection onto the Gaussian mean-field family, studied in equation (8), is a particular case of the more general variational approximations studied in the recent work of Alquier and Ridgway (2020), who allow for other sets of distributions over which the KL is minimized. A key insight of the variational framework is that minimizing the KL divergence in (8) is equivalent to solving

7 Avella Medina and Montiel Olea and Rush and Velez the program

n n πn,α( X ) arg min q(θ) log (fn(X θ)) dθ (1/α) (q π) . (9) ·| ≡ q MF | − K || ∈Q Z  The objectivee function in (9) is reminiscent of penalized estimation: it involves a data-fitting term (the average log-likelihood) and a regularization or penalization term that forces the distribution q to be close to a baseline prior π with regularization parameter 1/α.

The optimization scheme in (9) has been the subject of recent work in the representation learning literature in Burgess et al. (2018) and Higgins et al. (2017) under the name of the β-variational autoencoder. The optimization problem has also been studied in axiomatic de- cision theory; see, for example, the multiplier preferences introduced in Hansen and Sargent (2001) and their axiomatization in Strzalecki (2011). More generally, the objective function in (9) with an arbitrary divergence function is analogous to the so-called divergence pref- erences studied in Maccheroni et al. (2006). We think this literature could be potentially useful in understanding the role of the different divergence functions in penalizations, as well as the multiplier parameter α.

3. Bernstein-von Mises Theorem for α-posteriors and their Variational Approximations

Before presenting a formal definition of the measure of robustness used in this paper, we show that α-posteriors and their variational approximations are asymptotically normal. This ex- tends the Bernstein-von Mises (BvM) theorem for misspecified models in Kleijn and Van der Vaart (2012), which shows that posteriors under misspecified models are asymptotically normal, and the recent Variational BvM of Wang and Blei (2019a), which shows that variational ap- proximations of true posteriors are asymptotically normal. We show that the total variation distance between the studied distribution and its limiting Gaussian distribution converges in probability to zero with growing sample size.

3.1 BvM for α-posteriors

n We say that the α-posterior, π ( X ) defined in (7), concentrates at rate √n around θ∗ if n,α ·| for every sequence of constants r , n →∞

n E 1 √n (θ θ∗) >r π (θ X ) dθ 0 . (10) f0,n k − k n n,α | →  Z  

8 Robustness of α-posteriors and their variational approximations

Notice that

n E E P n f0,n 1 √n (θ θ∗) >rn πn,α(θ X ) dθ = f0,n πn,α( X ) √n (θ θ∗) >rn . k − k | ·| k − k  Z     Theorem 1 Suppose that the prior density π is continuous and positive on a neighborhood n around the (pseudo-) true parameter θ∗, and that π ( X ) concentrates at rate √n around n,α ·| θ∗, as in (10). If Assumption 1 holds, then

n 1 dTV πn,α( X ), φ( θML- n , Vθ−∗ /(αn)) 0 , (11) ·| ·| F →   in f -probability, where dTV( , ) denotes theb total variation distance and is defined in (1) 0,n · · and Vθ∗ is the positive definite matrix satisfying Assumption 1.

In a nutshell, the theorem states that the α-posterior distribution behaves asymptotically as a multivariate normal distribution, centered at the ML estimator, θML- n , which is based on F the potentially misspecified model n. Thus, the theorem shows that the choice of α does F b not asymptotically affect the location of the α-posterior distribution. However, Theorem 1 1 shows that the asymptotic covariance matrix of the α-posterior is given by Vθ−∗ /(αn), hence, the parameter α inflates the asymptotic variance when α< 1, and deflates it otherwise. The matrix Vθ∗ is the second-order term in the stochastic LAN approximation in Assumption 1, and its inverse is the usual variance in the BvM theorem for correctly or incorrectly specified models. Intuitively, Vθ∗ can be thought of as measuring the curvature of the likelihood. The proof and its details are presented in Appendix A.1. For the sake of exposition, we present a brief intuitive argument for why the result should hold. By assumption, the α- posterior concentrates around θ∗ at rate √n, in the sense of (10). Consider the log-likelihood ratio, for some vector h Rd, ∈ n n α π (θ∗ + h/√n X ) f(X θ∗ + h/√n) π(θ∗ + h/√n) log n,α | = log | . (12) π (θ Xn) f(Xn θ ) π(θ )  n,α ∗ |   | ∗  ∗ 

n If πn,α( X ) were exactly a multivariate normal having mean θML- n and covariance matrix F 1 ·| Vθ−∗ /(αn), the log-likelihood ratio in (12) would equal b 1 ∗ ∗ h⊤(αVθ )√n (θML- n θ∗) h⊤(αVθ )h . (13) F − − 2

The log-likelihood ratios in (12) and (13)b are not equal, since π ( Xn) is not exactly multi- n,α ·| variate normal, but the continuity of π at θ∗ and the stochastic LAN property of Assumption

1 makes them equal up to an of0,n (1) term.

9 Avella Medina and Montiel Olea and Rush and Velez

Most of the work in the proof of Theorem 1 consists of relating the closeness in log- likelihood ratios to closeness in total variation distance. The arguments we use to make this connection follow verbatim the arguments used by Kleijn and Van der Vaart (2012). To the best of our knowledge the result in Theorem 1 has not appeared previously in the literature, although Section 4.1 in Li et al. (2019) present a different version of this result in a weaker metric (convergence in distribution, as opposed to total variation), but using lower-level conditions (as opposed to our high-level assumptions).

3.2 BvM for Variational Approximations of α-posteriors

Given that the πn,α is close to a multivariate normal distribution, it is natural to conjecture that the variational approximation πn,α will converge to the projection of such multivariate normal distribution onto the mean-field family . QMF e To formalize this argument, let GMF p denote the family of multivariate normal distribu- Q − tions of dimension p with independent marginals. An element in this family is parameterized p p p by a vector µ R and a positive semi-definite diagonal covariance matrix Σ R × . When ∈ ∈ convenient, we denote such an element as q( µ, Σ) and will always implicitly assume that Σ ·| is diagonal. We focus on the Gaussian mean-field approximation to the α-posterior. Given a sample of size n, this can be defined as the Gaussian distribution with parameters µn, Σn satisfying

n n e πn,α( X )= q( µn, Σn) arg min (q( ) πn,α( X )) .e (14) ·| ·| ∈ q GMF-p K · || ·| ∈Q

e en e Theorem 1 has shown that πn,α( X ) is close to a multivariate normal distribution with mean 1 ·| θML- n and variance Vθ−∗ /(αn). Let q( µn∗ , Σn∗ ) denote the multivariate normal distribution F ·| in the Gaussian mean-field family closest to such limit. That is, b 1 q( µn∗ , Σn∗ ) = arg min (q( ) φ( θML- n , Vθ−∗ /(αn))) . (15) ·| q GMF-p K · || ·| F ∈Q b Algebra shows that the optimization problem (15) has a simple (and unique) closed-form solution when Vθ∗ is positive definite; namely:

∗ 1 µn∗ = θML- n , Σn∗ = diag(Vθ )− /(αn) . (16) F

In words, q( µ∗ , Σ∗ ) is the distributionb in the Gaussian mean-field family having the same ·| n n mean as the limiting distribution of π ( Xn) but, as we will show, underestimates the n,α ·| covariance. We would like to show that the total variation distance between the distributions

10 Robustness of α-posteriors and their variational approximations

n π ( X ) and q( µ∗ , Σ∗ ) converges in probability to zero, provided the prior and the n,α ·| ·| n n likelihood satisfy some regularity conditions. To do this, let θ∗ and Rn(h) be defined as in Assumptione 1.

Assumption 2 For any sequence (µ , Σ ) such that (√n(µ θ∗), nΣ ) is bounded in f - n n n − n 0,n probability, the residual Rn(h) in Assumption 1 and the prior π are such that

π(θ∗ + h/√n) φ(h √n(µ θ∗), nΣ ) log dh 0 , (17) | n − n π(θ ) → Z  ∗  and

φ(h √n(µ θ∗), nΣ ) R (h) dh 0 . (18) | n − n n → Z In both cases the convergence is in f0,n-probability.

Theorem 2 Suppose that (√n(µ θ∗), nΣ ) is bounded in f -probability where (µ , Σ ) n − n 0,n n n is the sequence defining π ( Xn)= q( µ , Σ ). If Assumptions 1 and 2 hold, then n,α ·| ·| n n e e e e n e dTV (π ( Xe), qe( µ∗ , Σ∗ )) 0 , (19) n,α ·| ·| n n → in f0,n-probability, where µn∗ and Σen∗ are defined in (16).

In words, Theorem 2 shows that the Gaussian mean-field approximation to the α-posterior converges to the Gaussian mean-field approximation of asymptotic distribution of the α- posterior. Indeed, the mean and variance parameter of this normal distribution are obtained by projecting the limiting distribution obtained in Theorem 1 onto the Gaussian mean-field family.

A detailed proof of Theorem 2 can be found in Appendix A.2. A similar result was ob- tained by Wang and Blei (2019a) for the case of α = 1. Thus, Theorem 2 can be viewed as a generalization of their variational BvM Theorem, applicable to the variational approxima- tions of α-posteriors. We note however that our proof technique is quite different from theirs. Indeed, we require a simpler set of assumptions because we restrict ourselves to Gaussian mean-field variational approximations to the α-posterior. This enables us to work out a sim- plified argument that explicitly leverages formulas obtained by computing the KL divergence between two Gaussians. The key intermediate step in our proof is an asymptotic represen- tation result stated formally in Lemma 6 in Appendix C showing that, under Assumption 1 and 2, for any sequence (µ , Σ ) such that (√n(µ θ∗), nΣ ) is bounded in f -probability, n n n − n 0,n

11 Avella Medina and Montiel Olea and Rush and Velez we have that

n 1 ∗ (q( µn, Σn) πn,α( X )) = q( µn, Σn) φ( θML- n , Vθ− /(αn)) + of0,n (1) . K ·| || ·| K ·| || ·| F   We use this lemma to show that projecting the α-posteriorb onto the space of Gaussian mean- field distributions is approximately equal to projecting the α-posterior’s total variation limit in Theorem 1. In particular, the proof of Theorem 2 shows that

n (π ( X ; µ , Σ ) q( µ∗, Σ∗ )) 0 , (20) K n,α ·| n n || ·| n n → e in f0,n-probability. The statemente in (19)e then follows from the above limit by Pinsker’s inequality i.e. d (P, Q) 2 (P Q) for any two probability distributions P and Q. T V ≤ K || (See part iii) of Lemma B.1 inp Ghosal and Van der Vaart (2017) for a textbook reference on Pinsker’s inequality.)

4. Misspecification Robustness Analysis

This section introduces the measure of robustness we study in this work and our main results related to it. As we will explain below, the main idea is to measure closeness—in terms of KL divergence—of the α-posterior or its variational approximations to the correct posterior. To formalize this discussion, we introduce a bit more of notation. In order to analyze model misspecification, following Gustafson (2001), we posit a correctly specified parametric model that takes the form g ( θ,γ): θ Θ,γ Γ (notice that under model Gn ≡ { n ·| ∈ ∈ } misspecification, will differ from ). Correct specification of here simply means that, Gn Fn Gn for any sample size n, there exist parameters (θ∗ ,γ∗) Θ Γ for which g ( θ∗,γ∗) equals n n ∈ × n ·| n n f . Note that the parameter θ is well-defined in both and . 0,n Fn Gn n Let π∗ be a prior over Θ Γ. We let π∗(θ X ) denote the posterior for θ based on × n | Gn n and π∗. When is misspecified, we refer to π∗(θ X ) as the true posterior, and when is Fn n | Fn correctly specified, the true posterior is simply given by πn,α evaluated at α = 1. In a slight abuse of notation, we denote the ML estimator of θ based on simply as θ . Gn ML Our goal is to compute the KL divergence between α-posteriors (or theirb variational ap- proximations) and the correct posterior. To do so, we need to consider two cases: one in which is misspecified (in the sense that f / ) and another in which is correctly specified. Fn 0,n ∈Fn Fn We assume that the decision maker does not know ex-ante whether the model is correctly specified or not, and we denote by ǫn the probability of being misspecified (consequently, 1 ǫ denotes the probability of correct specification). − n

12 Robustness of α-posteriors and their variational approximations

To assess the robustness of α-posteriors our interest is to compute the expected value of KL divergence:

n n n n r (α) ǫ (π∗ (θ X ) π (θ X ))+(1 ǫ ) (π (θ X ) π (θ X )) . (21) n ≡ nK n | || n,α | − n K n,1 | || n,α |

The first KL divergence term on the right side of (21) measures how difficult it is to distinguish between the true posterior under model misspecification and the α-posterior. The second is the KL divergence between the regular posterior (α = 1) and the α-posterior. These terms are weighted by the probability of misspecification. Values of α that lead to smaller values of rn(α) are said to be more robust to parametric specification, as the corresponding α-posterior is ‘closer’ to the true posterior. To the best of our knowledge, the idea of using the KL divergence between reported posteriors and true posteriors was first introduced by Gustafson (2001).

Likewise, we could analyze the robustness of variational α-posteriors by studying:

n n n n r (α) ǫ (π∗ (θ X ) π (θ X ))+(1 ǫ ) (π (θ X ) π (θ X )) , (22) n ≡ nK n | || n,α | − n K n,1 | || n,α | where (22)e is the same as (21), excepte we have replaced the α-posterior π e (θ Xn) in (21) with n,α | its variational approximation π (θ Xn) in (22). Both r (α) and r (α) are random variables n,α | n n (where the randomness comes from the sampled data Xn used to construct the posterior distributions) and their magnitudese depend on the structure of thee correctly and incorrectly specified models, on the priors, and on the sample size. We are able to make progress on the analysis of (21) and (22) by relying on asymptotic approximations to the infeasible posterior, the regular posterior, and the α-posterior and its variational approximation.

It is well known that the Bernstein-von Mises theorem for correctly specified models—e.g., DasGupta (2008), p. 291—implies that under some regularity conditions on the statistical model and the prior π∗ (analogous to Assumption 1 and (10)), the true, infeasible posterior Gn n π∗(θ X ) is close in total variation distance to the p.d.f. of a (θ , Ω/n) random variable. n | N ML Theorems 1 and 2, naturally suggest surrogates for (21) and (22) where we replace the b densities by their asymptotically normal approximations. We denote the surrogate measures by rn∗ (α) and rn∗ (α). Define

e αn∗ arg min rn∗ (α) , αn∗ arg min rn∗ (α) , (23) α 0 α 0 ≡ ≥ ≡ ≥ e e

13 Avella Medina and Montiel Olea and Rush and Velez where

1 rn∗ (α) ǫn φ( θML, Ω/n) φ( θML- n ,Vθ−∗ /(αn)) ≡ K ·| || ·| F (24)  1  1 + (1 ǫn) φ( θML- n ,Vθ−∗ /n) φ( θML- n ,Vθ−∗ /(αn)) , − Kb ·| F b || ·| F   and b b

∗ 1 rn∗ (α) ǫn φ( θML, Ω/n) φ( θML- n , diag(Vθ )− /(αn)) ≡ K ·| || ·| F (25)  1  ∗ 1 + (1 ǫn) φ( θML- n ,Vθ−∗ /n) φ( θML- n , diag(Vθ )− /(αn)) . e − Kb ·| F b || ·| F   b b Theorem 3 Let p dim(θ), let V ∗ be the positive definite matrix satisfying Assumption 1, ≡ θ and let V ∗ diag(V ∗ ). Suppose θML θ in f -probability. If θ = θ∗ and nǫ ε θ ≡ θ → 0 0,n 0 6 n → ∈ (0, ) then ∞ e b p αn∗ < 1 , (26) → p + ε(θ0 θ∗)⊤Vθ∗ (θ0 θ∗) − p − α∗ < 1 , (27) n 1 → tr(V ∗ V −∗ )+ ε(θ θ ) V ∗ (θ θ ) θ θ 0 − ∗ ⊤ θ 0 − ∗ e where the convergence is in f0,n-probability.e e

We now discuss the meaning and implications of Theorem 3. Equation (26) says that, for a properly tuned value of α, the α-posterior is—with high probability and in large samples— more robust than the regular posterior. The limit of αn∗ suggests that the properly tuned value of α decreases as the probability of misspecification increases. This makes conceptual sense: it seems reasonable to down-weight the likelihood if it is known that, very likely, it is misspecified. On the other extreme, if the likelihood is known to be correct there is no gain from down-weighting the likelihood, as this simply creates a difference in the scaling of the α-posterior relative to the true posterior.

The formula also says that if the misspecification implied by the model is large (the difference between θ0 and θ∗ is large), then the properly tuned value of α must be small. Equation (27) refers to the properly calibrated value of α for the variational approxima- tions of α-posteriors. The analysis is similar to what we have already discussed, but here we need to take into account that variational approximations tend to further distort the vari- ance matrix. Indeed, Theorem 2 has shown that the asymptotic variance of the variational 1 1 approximations is V −∗ /αn, as opposed to V −∗ /αn, where we have defined V ∗ diag(V ∗ ), θ θ θ ≡ θ for positive definite Vθ∗ . It is well known that the variational approximation understates e e

14 Robustness of α-posteriors and their variational approximations

1 1 the variances of each coordinate of θ (Blei et al. (2017)), hence [(V ∗ )− ] [V −∗ ] for θ jj ≤ θ jj ∗ 1 j =1,...,p. Therefore, tr(Vθ Vθ−∗ ) p, which establishes the inequality in (27). ≥ e There is one additional derivation associated to the formulae in Theorem 3. Consider the e optimized expected value of the KL divergence based on the asymptotic approximations to

α-posteriors and their variational approximations. From the definition of rn∗ and Theorem 3:

∗ 2 lim rn∗ (αn∗ )= p log(p)+ p log p + ε(θ0 θ∗)⊤Vθ (θ0 θ∗) . (28) n →∞ − − −  Likewise,

Vθ∗ ∗ 1 ∗ 2 lim rn∗ (αn∗ )= p log(p)+ p log tr Vθ V −∗ + ε(θ0 θ∗)⊤Vθ (θ0 θ∗) + log | | . n θ →∞ − − − Vθ∗ ! | |     (29) e e e e Compare these two equations with the expected KL of the usual posterior (α = 1).e In this case, the KL under correct specification equals zero, so that the only relevant piece is the KL distance under misspecification. Algebra shows that

∗ 2 lim rn∗ (1) = ε(θ0 θ∗)⊤Vθ (θ0 θ∗) . n →∞ − − Thus, the expected KL distance for the regular posterior increases linearly in the magnitude of the misspecification (the distance between θ0 and θ∗). The optimized KL for both the α- posteriors and their variational approximations is also monotonically increasing in this term, but its growth is logarithmic. One final remark concerns the well-known result that the asymptotic variance of Bayesian posteriors under misspecified models does not coincide with the asymptotic ‘sandwich’ co- variance matrix of the Maximum Likelihood Estimator under misspecification. This suggests that instead of targeting the true (perhaps infeasible) posterior, it might make more sense to target the artificial posterior suggested by M¨uller (2013), which is a normal centered at the Maximum Likelihood estimator but with sandwich covariance matrix. This choice of tar- get distribution does not change our results. The reason is that under our assumptions the part of the expected Kullback Leibler under misspecification only depends on the difference between true and pseudo-true parameter and the covariance matrix of the reported posterior.

5. Illustrative Example

To illustrate our main results, this section studies an example of model misspecification in the form of a linear regression model with omitted variables. Our objective is twofold. First,

15 Avella Medina and Montiel Olea and Rush and Velez we present a simple environment where the high-level assumptions of our main theorems can be easily verified and discussed. Second, an appropriate choice of priors in this model yields closed-form solutions for the α-posteriors and their (Gaussian mean-field) variational approximations. Thus, it is possible to provide additional details about the nature of the Bernstein-von Mises theorems that we have established, as well as the optimal choice of α.

5.1 True and misspecified model

Consider a random sample of an outcome variable Y with control variables W Rp and i i ∈ Z Rd. The true data generating process is an homoskedastic Gaussian linear regression i ∈ model

Yi = θ0⊤Wi + γ0⊤Zi + εi , where ǫ (0, σ2) independently of W and Z . The joint distribution of W and Z is i ∼ N ǫ i i i i p+d assumed to have a density h(wi, zi) with respect to the Lebesgue measure on R .

The statistician faces an omitted variables problem, in that she would like to estimate θ0 but only observes (Yi, Wi). The statistician’s misspecified model posits that

Yi = θ⊤Wi + ui ,

where ui is assumed to be univariate normal with mean zero and a pressumedly known 2 variance σu independently of Wi. For simplicity, we assume that the statistician has correctly specified the marginal distribution of Wi.

5.2 Pseudo-true parameter and LAN assumption

If we denote the data as Xn (Y , W ) n , the likelihood is ≡{ i i }i=1 n n n 1 1 2 f(X θ)= exp (Y θ⊤W ) h(W ) , | (2πσ2)n/2 −2σ2 i − i i u u i=1 ! i=1 X Y and the Maximum Likelihood estimator is simply the least-squares estimator of θ:

1 1 n − 1 n θ = W W ⊤ W Y . ML n i i n i i i=1 ! i=1 X X b It is straightforward to show that under mild assumptions on the joint distribution of (Wi,Zi), the Maximum Likelihood estimator is √n-asymptotically normal around the pseudo-true

16 Robustness of α-posteriors and their variational approximations parameter: 1 θ∗ θ + E[W W ⊤]− E[W Z⊤] γ , ≡ 0 i i i i 0  which equals the true parameter, θ0, plus the usual omitted variable bias formula. Algebra shows that as long as the sample second moments of Wi converge in probability (under the E true model) to the positive definite matrix [Wi Wi⊤], the stochastic LAN assumption is satisfied with 2 V ∗ E[W W ⊤]/σ . θ ≡ i i u This means that, in this example, the curvature of the likelihood around the pseudo-true parameter does not depend on θ∗.

5.3 α-posteriors and their variational approximations

Consider now the setting where we assume a commonly used Gaussian prior, denoted π, for θ. In particular, suppose 2 1 θ (µ , σ Σ− ) . ∼N π u π The computation of the α-posterior in this set-up is straightforward, as the α-power of the Gaussian likelihood is itself Gaussian with the new scale divided by α. Thus, algebra shows that the α-posterior for the linear regression model is also multivariate normal with mean parameter

1 1 n 1 − 1 1 n µ W W ⊤ + Σ Σ µ + W Y , (30) n,α ≡ n i i αn π αn π π n i i i=1 ! i=1 ! X X and covariance matrix

1 2 n − σu 1 1 Σ W W ⊤ + Σ . (31) n,α ≡ αn n i i αn π i=1 ! X Because we have closed-form solutions for the α-posteriors, one can readily show that they concentrate at rate √n around θ∗ for any fixed (α,µπ, Σπ). This is shown in Appendix B.1.

Since the assumptions of Theorem 1 are met, then the total variation distance between the α-posterior and the multivariate normal

2 σu 1 θ , E[W W ⊤]− , (32) N ML αn i i   b

17 Avella Medina and Montiel Olea and Rush and Velez must converge in probability to zero. In fact, in this example, it is possible to establish a stronger result: the KL distance between πn,α and the distribution in (32) converges in prob- ability to zero (see Appendix B.2). This example thus raises the question of whether entropic Bernstein-von Mises theorems are more generally available for α-posteriors in misspecified models.1

This simple linear regression example also shows that the Bernstein-von Mises theorem is not likely to hold if α is chosen in a way that approaches zero very quickly. In particular, consider a sequence αn for which αnn converges to a strictly positive constant. Then, in this simple example the total variation distance between the αn-posterior and the distribution in (32) will be bounded away from zero. This is shown in Appendix B.3.

Finally, in this example the mean-field Gaussian variational approximation of the α- posterior also has a closed-form expression. Algebra shows that the variational approximation has exactly the same mean as the α-posterior, but variance equal to

1 2 n − σu 1 1 Σ diag W W ⊤ + Σ . (33) n,α ≡ αn n i i αn π i=1 !! X e In this example, it is possible to show that the Bernstein-von Mises theorem holds for the variational approximation to the α-posterior (not only in total variation distance, but also in KL divergence). The assumptions of Theorem 2 are verified in Appendix B.4.

5.4 Expected KL and Optimal α

Finally, we discuss the expected KL criterion and the choice of α in the context of our example. In Section 4, we defined rn(α) as the expected KL divergence between the true posterior of θ and the α-posterior. As we therein explained, this expected KL measure is typically difficult to compute because—with the exception of some stylized examples such as our linear regression model with omitted variables—the α-posteriors and their variational approximations are not available in closed-form.

For this reason, we chose to work instead with a surrogate measure rn∗ (α), which, moti- vated by the Bernstein-von Mises theorem, replaces the true posterior and the α-posterior by their asymptotic approximations. In our linear regression example, it is possible to formalize the relationship between rn(α) and rn∗ (α). Algebra shows that for any fixed α and any Gaus-

1. Clarke (1999) showed that—in a smooth parametric model with a well-behaved prior—the relative entropy between a posterior density and an appropriate normal tends to zero in probability and in mean. If an analogous result were available for standard posterior distributions in misspecified parametric models, it might be possible to extend it to cover α-posteriors.

18 Robustness of α-posteriors and their variational approximations sian prior, r (α) r∗ (α) 0 as n , as one would have expected (see Appendix B.6). n − n → → ∞ One of the challenges in generalizing this result is that the Bernstein-von Mises theorems we have established are in total variation, thus we cannot use them directly to analyze the behavior of the expected KL divergence.

Regarding the choice of α, we have shown in Theorem 3 that the limit of the optimal choice is p α∗ = , p + ε(θ θ ) V ∗ (θ θ ) 0 − ∗ ⊤ θ 0 − ∗ under the assumption that the true parameter, θ0, is different from the pseudo-true parameter

θ∗. In our example, this happens whenever γ = 0 (i.e., there are indeed omitted variables) 0 6 and the omitted variables are correlated with the observed controls (i.e., when the omitted variable bias is different from zero).

Since in the linear regression example it is possible to compute rn(α) explicitly, it is also possible to choose α to minimize this expression. Algebra shows that for any α′ = α∗, we 6 have that rn(α∗)

6. Concluding Remarks and Discussion

In this work, we have studied the robustness to model misspecification of α-posteriors and their variational approximations with a focus on parametric, low-dimensional models. To formalize the notion of robustness we built on the seminal work of Gustafson (2001) and his suggested measure of sensitivity to parametric model misspecification. To state it simply, if two different procedures both lead to incorrect a posteriori inference (either due to model misspecification or computational considerations), one procedure is more robust (or less sen- sitive) than the other if it is closer—in terms of KL divergence—to the true posterior. Thus, we analyzed the KL divergence between true posteriors and the distributions reported by either the α-posterior approach or their variational approximations.

Obtaining general results about the properties of the KL divergence between true and reported posteriors is quite challenging, as this will typically depend on the priors, the data, the statistical model, and the form of misspecification. We were able to make progress by relying on asymptotic approximations to α-posteriors and their variational approximations.

In particular, we established a Bernstein-von Mises (BvM) theorem in total variation dis- tance for α-posteriors (Theorem 1) and for their (Gaussian mean-field) variational approxi- mations (Theorem 2). Our results provided a generalization of the results in Wang and Blei (2019a,b), who focus on the case in which α = 1. We also extend the results of Li et al. (2019),

19 Avella Medina and Montiel Olea and Rush and Velez who establish the BvM theorem for α-posteriors under a weaker norm (weak convergence), but under more primitive conditions.

We think these asymptotic approximations have value per se. For example, we learned that relative to the BvM theorem for the standard posterior or its variational approximation, the choice of α only re-scales the limiting variance. The new scaling acts as if the observed sample size were α n instead of n, but the location for the Gaussian approximation continues · to be the Maximum Likelihood estimator. Since choosing α < 1 inflates the α-posterior’s variance relative to the usual posterior, then the tempering parameter corrects some of the variance understatement of standard variational approximations to the posterior. Also, there is some recent work considering variational approximations using the α-R´enyi divergence in- stead of the KL divergence (Jaiswal et al., 2020). It might be interesting to explore whether it is possible to derive a result analogous to Theorem 2, where we approximate the limit- ing distribution of α-R´enyi approximate posteriors by projecting the limiting distribution obtained in Theorem 1.

The main use of the asymptotic approximations in our paper, however, was simply to facilitate the computation of the suggested measure of robustness. This required elementary calculations once we have multivariate Gaussians with parameters that depend on the data, the sample size, and the ‘curvature’ of the likelihood.

An important caveat of our results is that we focused on analyzing the KL divergence between the limiting distributions, as opposed to the limit of KL divergence between reported and true posteriors. Although in some simple models the two are equivalent (for example, a linear regression model with omitted variables and Gaussian priors), the general result requires further exploration. Unfortunately, we do not yet have a good solution. It is easy to show that the function (P, Q) (P Q) is lower semi-continuous in total variation 7→ K || distance, so it might be possible to get a bound on the KL divergence we want to study with the KL divergence of the Gaussian limits. However, the interesting part is not the divergences themselves, but rather the α’s that minimize them. We think that perhaps the use of the Theorem of the Maximum Berge (1963) and its generalizations could be useful for this analysis. It is possible that the continuity results can be strengthened in our case, since we are dealing with Gaussians in the limit, but we do not yet have answers. Also, a formal analysis might require the derivation of entropic BvM theorems, where the distance is measured using KL divergence, as in Clarke (1999).

Finally, even though our paper has a theoretical prescription for choosing the tempering parameter α, further research is needed to translate this into a practical recommendation.

As discussed in detail, our calculations suggest that αn∗ tends to be smaller as both the prob-

20 Robustness of α-posteriors and their variational approximations ability of misspecification ǫn and the difference between the true and pseudo-true parameters increase. It might be possible to hypothesize some value for the probability of misspecifica- tion. However, the true parameter is not known (and cannot be estimated consistently under misspecification). We leave the question of how to optimally choose the tempering parameter for α-posterior and their approximations for future research. One promising line of work is to do a full Bayesian treatment of the problem as in Wang et al. (2017).

Acknowledgments

We would like to thank Pierre Alquier, St´ephane Bonhomme, Yun Ju, Debdeep Pati, and Yixin Wang for extremely helpful comments and suggestions. All errors remain our own. Cynthia Rush would like to acknowledge support for this project from the National Science Foundation (NSF CCF #1849883), the Simons Institute for the Theory of Computing, and NTT Research.

Appendix A. Proofs of Main Results

A.1 Proof of Theorem 1

This proof follows Theorem 2.1 in Kleijn and Van der Vaart (2012), which shows that the posterior under misspecification is asymptotically normal, but our proof is adapted and

∗ simplified for the α-posterior framework. In what follows, we let ∆n,θ √n (θML- n θ∗) ≡ F − as in Assumption 1. b p By the Theorem 1 assumption that θ∗ is in the interior of Θ R , there exists a sufficiently ⊂ small δ > 0 such that the open ball B ∗ (δ) θ : θ θ∗ < δ is a neighborhood of the θ ≡ { || − || } (pseudo-)true parameter θ∗ in Θ. In particular, we choose δ such that Bθ∗ (δ) belongs to the neighborhood of θ∗ in which it is assumed π is continuous and positive. Note that for any compact set K Rp including the origin, we can find an integer N N (K , B ∗ (δ)) 0 ⊂ 0 ≡ 0 0 θ sufficiently large such that for any vector h K , we have that the perturbation of θ∗ in the ∈ 0 direction h/√n, meaning θ∗ + h/√n, belongs to B ∗ (δ) whenever n N . θ ≥ 0 The goal is to show that the total variation distance between the α-posterior of θ, de- n noted πn,α( X ), and a multivariate Normal distribution with mean θML n and variance −F 1 ·| Vθ−∗ /(nα) goes to zero in f0,n-probability. Because the total variation distance is invariant to b the simultaneous re-centering and scaling of both measures being compared (Van der Vaart

(2000)), it is more convenient to work with the α-posterior of the transformation √n (θ θ∗) −

21 Avella Medina and Montiel Olea and Rush and Velez and compare it to the similarly re-centered and scaled multivariate Normal distribution, i.e.

∗ 1 one with mean ∆n,θ and variance Vθ−∗ /α.

For vectors g, h K , the following random variable will be used to bound the average ∈ 0 total variation distance between the α-posterior and the alleged multivariate normal limit:

φ (h) πLAN (g Xn) + n n,α | fn(g, h) 1 LAN n , (34) ≡ − πn,α (h X ) φn(g) n | o 1/2 1 LAN n 1/2 n where φ (h) n− φ(h ∆ ∗ , V −∗ /α) and π (h X ) n− π (θ∗ + h/√n X ), are n ≡ | n,θ θ n,α | ≡ n,α | scaled versions of the densities that we want to compare using the total variation distance, + 1/2 and x = max 0, x denotes the positive part of x. Define also π (h) n− π(θ∗ +h/√n) { } { } n ≡ to be the density of the prior distribution of the transformation √n (θ θ∗). It follows that f − n in (34) is well-defined on K K for all n > N , as in this regime πLAN (h Xn) is guaranteed 0 × 0 0 n,α | to be positive since θ∗ + h/√n belongs to Bθ∗ (δ) as discussed above.

Let B0(r ) denote a closed ball of radius r around 0. Since d 1 and the expectation n n T V ≤ is linear, we have that for any sequence rn and for any η > 0:

E d πLAN ( Xn), φ ( ) f0,n TV n,α ·| n · E  LAN n  P f0,n dTV(πn,α ( X ), φn( ))1 sup fn(g, h) η + f0,n sup fn(g, h) > η . ≤ " ·| · (g,h B0(rn) ≤ )# g,h B0(rn) ! ∈ ∈ (35)

The proof is completed by bounding the two terms on the right side of (35).

First we bound the expectation on the right side of (35). Lemma 4 in Appendix C implies

LAN n LAN n dTV(πn,α ( X ), φn) sup fn(g, h)+ πn,α ( X ) dh + φn(h) dh , ·| ≤ g,h B0(rn) h >rn ·| h >rn ∈ Z|| || Z|| || (36) therefore

E LAN n f0,n dTV(πn,α ( X ), φn( ))1 sup fn(g, h) η " ·| · (g,h B0(rn) ≤ )# ∈ (37) E LAN n E η + f0,n πn,α (h X ) dh + f0,n φn(h) dh . ≤ h >rn | h >rn Z|| ||  Z|| || 

22 Robustness of α-posteriors and their variational approximations

The bound in (37) uses that by the non-negativity of φ ( ), which implies n ·

E E f0,n φn(h) dh 1 sup fn(g, h) η f0,n φn(h) dh , " h >rn (g,h B0(rn) ≤ )# ≤ h >rn Z|| || ∈ Z|| ||  and a similar upper bound for the third term on the right side of (37). In addition, by the concentration assumption of the theorem (defined in equation (10)), there exists an integer

N1(η, ǫ) such that, for all n > N1(η, ǫ),

E LAN n f0,n πn,α (h X ) dh < ǫ. (38) h >rn | Z|| ||  Also, by Lemma 5.2 in Kleijn and Van der Vaart (2012) which exploits properties of the mul- tivariate normal distribution, there exists an integer N2(η, ǫ), such that for all n > N2(η, ǫ),

E f0,n φn(h) dh < ǫ . (39) h >rn Z|| ||  Now plugging (38) and (39) into (37), defining N(η, ǫ) = max N (η, ǫ), N (η, ǫ) , we find for { 1 2 } all n> N(η, ǫ), e e E LAN n f0,n dTV πn,α ( X ), φn( ) 1 sup fn(g, h) η η +2ǫ. (40) " ·| · (g,h B0(rn) ≤ )# ≤  ∈

Lemma 5 in Appendix C shows that, for a given η,ǫ > 0, there exists a sequence r n → + and N(η, ǫ) such that the second term on the right side of equation (35) is small for ∞ n > N(η, ǫ); that is, for all n > N(η, ǫ),

P f0,n sup fn(g, h) > η ǫ . (41) g,h B0(rn) ! ≤ ∈ We mention that it is in the proof of Lemma 5 that we make use the stochastic LAN condition in Assumption 1.

Finally we conclude from (35), using the bounds in (40) and (41), that for all n > max N(η, ǫ), N(η, ǫ) , { }

e E d πLAN ( Xn), φ ( ) η +2ǫ + ǫ = η +3ǫ . f0,n TV n,α ·| n · ≤ h  i A standard application of Markov’s inequality gives the desired result.

23 Avella Medina and Montiel Olea and Rush and Velez

A.2 Proof of Theorem 2

Let π ( Xn)= q( µ , Σ ) be the Gaussian mean-field approximation to the α-posterior, n,α ·| ·| n n defined in (14). We will prove that e e e

n ∗ 1 (πn,α ( X ) q( µn∗ , Σn∗ )) = φ( µn, Σn) φ( θML- n , diag(αnVθ )− ) 0 , (42) K ·| || ·| K ·| || ·| F →   in f e -probability where the equality followe e since q( bµ∗ , Σ∗ ) is Gaussian with mean and 0,n ·| n n covariance defined in (16). The statement in (19) follows by Pinsker’s inequality as it en- sures that convergence in Kullback-Leibler divergence implies convergence in total variation distance.

Note that the KL divergence between two p-dimensional Gaussian distributions can be computed explicitly as

1 Σ2 1 1 (φ( µ , Σ ) φ( µ , Σ )) = log | | + tr Σ− Σ +(µ µ )⊤Σ− (µ µ ) p . K ·| 1 1 || ·| 2 2 2 Σ 2 1 2 − 1 2 2 − 1 −  | 1|   (43)

Applying the identity (43) we see that

∗ 1 φ( µn, Σn) φ( θML- n , diag(αnVθ )− ) = T1 + T2 , K ·| || ·| F   where e e b

1 ∗ T1 = (θML- n µn)⊤diag(αnVθ )(θML- n µn) , (44) 2 F − F − 1 1 b p 1 b Σn − T2 = tr(diag(αnVeθ∗ )Σn) + log | e| . (45) 2 − 2 2 diag(αnVθ∗ ) ! | e | e To prove (42), we will prove that both T1 = of0,n(1) and T2 = of0,n(1). The key step to establishing these results is the asymptotic representation result of Lemma 6. It shows that, under Assumptions 1 and 2, for any sequence (µn, Σn)—possibly dependent on the data—that is bounded in f0,n-probability, we have:

n 1 (q( µn, Σn) πn,α( X )) = q( µn, Σn) φ( θML- n , Vθ−∗ /(αn)) + of0,n (1) . K ·| || ·| K ·| || ·| F   b This means that the KL divergence between any normal density q( µ , Σ ) and the α- ·| n n posterior π ( Xn) is eventually close to the KL divergence between the same density and n,α ·| the α-posterior’s total variation limit (which we have characterized in Theorem 1).

24 Robustness of α-posteriors and their variational approximations

∗ 1 We use two intermediate steps to relate (µn, Σn) to (θML n , diag(Vθ )− /(αn)). −F

e e b

Claim 1. We start by showing that T1 = of0,n (1). Since µn and Σn are the parameters that solve the variational approximation in (14), they are thus the parameters that minimize the e KL divergence between the Gaussian mean-field family ande the α-posterior. It follows that for every n,

n n (q( µn, Σn) πn,α( X )) (q( θML- n , Σn) πn,α( X )) . (46) K ·| || ·| ≤K ·| F || ·| e b e Using Lemma 6, wee can evaluate each of the KL divergences above up to an of0,n (1) term using the asymptotic Gaussian limit of π ( Xn) given in Thm 1. Indeed, since both n,α ·| (√n(µn θ∗), nΣn) and (√n(θML- n θ∗), nΣn) are bounded in f0,n-probability, we can − F − apply Lemma 6 to find e e b e n 1 (q( µn, Σn) πn,α( X )) = (q( µn, Σn) φ( θML- n , Vθ−∗ /(αn))) + of0,n (1) , K ·| || ·| K ·| || ·| F n 1 ∗ (q( θML- n , Σn) πn,α( X )) = (q( θML- n , Σn) φ( θML- n , Vθ− /(αn))) + of0,n (1) , K ·| eF e || ·| K ·| e eF || b ·| F b e b e b and plugging the above into (46) gives

1 q( µn, Σn) φ( θML- n , Vθ−∗ /(αn)) K ·| || ·| F (47)   1 q( θML- n , Σn) φ( θML- n , Vθ−∗ /(αn)) + of0,n (1). ≤K e ·|e F b || ·| F   Therefore, the result followsb from (43)e and (47)b that

1 ∗ T1 = (θML- n µn)⊤(αnVθ )(θML- n µn) 2 F − F − 1 1 = q(b µn, Σn) φ( θML- nb, Vθ−∗ /(αn)) q( θML- n , Σn) φ( θML- n , Vθ−∗ /(αn)) K ·| e|| ·| F e −K ·| F || ·| F o  (1) .   (48)  ≤ f0,n e e b b e b

∗ 1 Claim 2. We now show that T2 = of0,n (1) by relating Σn to diag(Vθ )− /(αn). The optimality of µn and Σn defined by (14) once again implies that for every n: e

e n ∗ 1 n e (q( µn, Σn) πn,α( X )) (q( θML- n , diag(Vθ )− /(αn)) πn,α( X )). (49) K ·| || ·| ≤K ·| F || ·| e e b

25 Avella Medina and Montiel Olea and Rush and Velez

As in the work of Claim 1, this inequality and Lemma 6 imply that

1 (q( µn, Σn) φ( θML- n , Vθ−∗ /(αn))) K ·| || ·| F (50) ∗ 1 1 (q( θML- n , diag(Vθ )− /(αn)) φ( θML- n , Vθ−∗ /(αn))) + of0,n (1). ≤K e·| e F b || ·| F Next, using the inequalityb in (50), applying (43) tob each term, and noting that (48) implies

1 ∗ (θML- n µn)⊤(αnVθ )(θML- n µn) of0,n (1), we obtain 2 F − F − ≤ b b 1 1 e eVθ−∗ /(αn) tr(αnVθ∗ Σn) p + log | | 2 " − Σn !# | | 1 1 e V −∗ /(αn) ∗ ∗ 1 θ tr(αnVθ diag(Vθ )− /(αn))e p + log | | + of0,n (1) . ≤ 2 − diag(V ∗ ) 1/(αn)  | θ − | 1 Further noting that tr(Vθ∗ diag(Vθ∗ )− )= p, we see that the above inequality is equivalent to

1 1 1 V −∗ /(αn) V −∗ /(αn) ∗ θ θ tr(αnVθ Σn) p + log | | log | | of0,n (1) . (51) ∗ 1 2 " − Σn ! − diag(Vθ )− /(αn) # ≤ | | | | e Since Σn is diagonal, e

tr(αnVθ∗ Σn) = tr(diag(αnVθ∗ )Σn), e and consequently, the left-hand side ofe (51) equals T2, whiche has been defined in (45). More- over, since the term T2 is nonnegative, as it equals the KL divergence between two normals

∗ 1 with the same mean, but variances Σn and diag(Vθ )− /αn, we conclude that T2 = of0,n (1).

e

A.3 Proof of Theorem 3

We note that rn∗ andr ˜n∗ defined in (24) and (25) are both expectations involving KL diver- gences of two multivariate Gaussian distributions, hence (43) allows us to compute rn∗ (α) and r∗(α) explicitly as

1 1 er∗ (α)= αA (V ∗ ) p log(α)+ B (V ∗ ) , r∗ (α)= αA (V ∗ ) p log(α)+ B (V ∗ ) , n 2 n θ − n θ n 2 n θ − n θ     where e e e

1 An(Σ) ǫntr(ΣΩ) + (1 ǫn)tr(ΣVθ−∗ )+ nǫn(θML- n θML)⊤Σ(θML- n θML) , ≡ − F − F − 1 1 1 1 1 B (Σ) p + ǫ log( Ω − Σ− )+(1 ǫ ) log( V −∗ − Σ− ) . n ≡ − n | | | | − n b | θ | |b | b b

26 Robustness of α-posteriors and their variational approximations

Using this notation, we see that rn∗ and rn∗ are convex functions on α. This implies that first order conditions pin-down the optimal αn∗ and αn∗ defined in (23). These are equal to e p p α∗ = ande α∗ = . (52) n ∗ n An(Vθ ) An(Vθ∗ )

p e By assumption we have θML θ0 and nǫn ε (0, e). In addition, we know that p → p 1→ ∈ ∞ θML- n θ∗. It follows that An(Σ) tr(ΣVθ−∗ )+ ε(θ∗ θ0)⊤Σ(θ∗ θ0). Denote α∗ and α∗ F → b → − − the limits in f0,n-probability of αn∗ and αn∗ . These limits can be computed replacing Σ in the b e expressions above with Vθ∗ or Vθ∗ and taking limit of (52). This implies e p p e p p α∗ α∗ ∗ ⊤ ∗ ∗ , α∗ α∗ e −1 ∗ ⊤ e ∗ . n p+ε(θ0 θ ) Vθ (θ0 θ ) n tr(V ∗ V ∗ )+ε(θ0 θ ) V ∗ (θ0 θ ) → ≡ − − → ≡ θ θ − θ −

e e 1 To conclude α∗ < 1 is sufficient to prove that tr(V ∗ V −∗ ) p. θ θ ≥ Recall thate the trace of a matrix is the sum ofe the eigenvalues and the determinant is the product. Applying the Arithmetic Mean-Geometric Mean inequality, we have

1/p 1 1 1 tr V ∗ V −∗ V ∗ V −∗ . p θ θ ≥ | θ θ |     1 1 e 1 e Then since V ∗ V −∗ = V ∗ V −∗ = V ∗ V ∗ − , it will be sufficient to prove that V ∗ V ∗ , | θ θ | | θ || θ | | θ || θ | | θ |≥| θ | where Vθ∗ = diag(Vθ∗ ). Because Vθ∗ is a semi-definite matrix, this is exactly Hadamard’s e e e e inequality (see Theorem 7.8.1 in Horn and Johnson (2012)). e

Appendix B. Technical work for the illustrative example

B.1 Verification of α-posterior concentration

We first want to prove that the α-posterior concentrates at the rate √n around θ∗ as defined in (10). In other words need to show that for every sequence r , we have n →∞

E P n f0,n πn,α( X ) √n (θ θ∗) >rn 0 . (53) ·| k − k →   Step 1: Compute an upper bound for the probability inside the brackets using Markov’s inequality.

Note that in our illustrative example, we have π (θ Xn) (µ , Σ ), where µ n,α | ∼ N n,α n,α n,α and Σn,α were defined in (30) and (31). Therefore, by Markov’s inequality and the normal

27 Avella Medina and Montiel Olea and Rush and Velez distribution of the α-posterior we have

2 2 n 2 n 2 P n ∗ E n ∗ ∗ πn,α( X ) θ θ >rn/n 2 πn,α( X ) θ θ = 2 µn,α θ + tr(Σn,α) , ·| k − k ≤ rn ·| k − k rn k − k     (54) which defines the upper bound for the probability inside the brackets.

Step 2: Conclude that the expected value of the probability goes to zero.

Lemma 7 in Appendix C implies that the sequence (√n(µ θ∗), nΣ ) is bounded n,α − n,α 2 in f -probability. It follows that both √n(µ θ∗) and tr(nΣ ) are bounded in 0,n k n,α − k n,α f0,n-probability. This means that for every ǫ> 0, there exists an Mǫ > 0 such that

P (A ) ǫ , (55) f0,n ǫ ≤

2 where A = √n(µ θ∗) + tr(nΣ ) > M . By linearity of the expectation, we have ǫ {k n,α − k n,α ǫ} that for any sequence rn and for any ǫ> 0:

E P n f0,n πn,α( X ) √n (θ θ∗) >rn ·| k − k c E P n E  f0,n πn,α ( X ) √n (θ θ∗) >rn 1 Aǫ + f0,n [1 Aǫ ] , (56) ≤ ·| k − k { } { }    where 1 A is the indicator function of the event A . The first term in (56) can be bounded { ǫ} ǫ using (54) and the definition of Aǫ, leading to

c E P n f0,n πn,α( X ) √n (θ θ∗) >rn 1 Aǫ ·| k − k { } 1 2 c Mǫ  E √n(µ θ∗) + tr(nΣ  ) 1 A . ≤ f0,n r2 k n,α − k n,α { ǫ} ≤ r2  n  n   Using (55), we see that the second term in (56) is smaller than ǫ. Hence, we conclude that

Mǫ E P n √ ∗ f0,n [ πn,α( X ) n (θ θ ) >rn ] 2 + ǫ , ·| k − k ≤ rn  which is sufficiently small since ǫ > 0 was arbitrary, M is constant, and r . This ǫ n → ∞ verifies (53).

B.2 KL distance for Theorem 1

In our illustrative example, we can compute the KL distance between the α-posterior dis- tribution πn,α and the distribution defined in (32) since both distribution are multivariate

28 Robustness of α-posteriors and their variational approximations normal. Using (43), we find the KL divergence ot equal

1 1 Vθ∗ αn − p + log | | + tr(V ∗ αnΣ )+(µ θ )⊤V ∗ αn(µ θ ) . 2 − Σ θ n,α n,α − ML θ n,α − ML   | n,α|   b b Lemma 7 implies that the previous expression converges to 0 in f0,n-probability. This means that the KL distance between πn,α and the distribution in (32) goes to zero in f0,n-probability.

B.3 What if α goes to zero very quickly?

Suppose that the sequence of α satisfies nα α > 0. This implies that, in f -probability, n n → 0 0,n

1 2 1 Σπ − Σπµπ σu Σπ − µ E[W W ⊤]+ + E[W W ⊤]θ∗ , Σ E[W W ⊤]+ , n,αn → i i α α i i n,αn →α i i α  0   0  0  0 

Using these different limits, we show that the total variation distance between the αn-posterior distribution πn,αn and the distribution in (32) is bounded away from zero. We show this using that the square of the Hellinger distance is a lower bound for the total variation distance.

In our illustrative example, the αn-posterior distribution πn,αn and the distribution in (32) are both multivariate normal distributions. Then, we can compute the square of the Hellinger distance between these distributions. Using Lemma B.1 part ii) of Ghosal and Van der Vaart (2017), we obtain

1/4 ∗ 1 1/4 ∗ 1 Σn,αn (Vθ αnn)− 1 Σn,αn +(Vθ αnn)− 1 | | | | exp (µn,αn θML)⊤ (µn,αn θML) , − (Σ +(V ∗ α n) 1)/2 1/2 −8 − 2 − | n,αn θ n − |   b b which converge in f0,n-probability to a positive number. To verify this, notice that Σn,αn and 1 (Vθ∗ αnn)− converge to different limits, which guarantees

1/4 1 1/4 Σ (V ∗ α n)− lim | n,αn | | θ n | < 1 , n (Σ +(V ∗ α n) 1)/2 1/2 →∞ | n,αn θ n − | where the limit is taken in f0,n-probability and the inequality follows by applying the Arith- metic Mean-Geometric Mean inequality.

B.4 Verification of Assumption 2

In our illustrative example:

2 1 1 π(θ) (µ , σ Σ− ), and R (h)= h⊤Q ∆ ∗ h⊤Q h , ∼N π u π n n n,θ − 2 n

29 Avella Medina and Montiel Olea and Rush and Velez where ∆ ∗ = √n(θ θ∗) and n,θ ML − n b i=1 WiWi⊤ Q V ∗ . n ≡ nσ2 − θ P u

∗ E 2 Qn converges to zero in f0,n-probability in our illustrative example since Vθ = [WiWi⊤]/σu.

Let us take a sequence (µ , Σ ) such that (√n(µ θ∗), nΣ ) is bounded in f -probability. n n n− n 0,n Then, equation (17) in Assumption 2 becomes

1 h Σ h h Σ √ ⊤ π ⊤ π φ(h n(µn θ∗), nΣn) 2 + 2 (µπ θ∗) dh | − −2 √n σu √n √n σu − Z   (57) 1 Σπ 1 Σπ 1 Σπ = µ⊤ µ tr nΣ + µ⊤ (µ θ∗) , −2n n σ2 n − 2n n σ2 √n n σ2 π − u  u  u where µ = √n(µ θ∗). By assumption, the sequence (µ , nΣ ) is bounded in f - n n − n n 0,n probability. This implies that (57) goes to zero in f0,n-probability. Equation (18) in Assumption 2 can be computed explicitly as

1 1 1 φ(h √n(µ θ∗), nΣ ) h⊤Q ∆ ∗ h⊤Q h dh = µ⊤Q ∆ ∗ µ⊤Q µ tr(Q nΣ ) . ∈ | n − n n n,θ − 2 n n n n,θ − 2 n n n − 2 n n   By assumption, the sequence (µn, nΣn) is bounded in f0,n-probability. Since Qn converge to zero in f0,n-probability, the expression above goes to zero in f0,n-probability.

B.5 KL distance for Theorem 2

Lemma 7 shows that in our illustrative example, the sequence (µn,α, Σn,α) verifies that

(√n(µ θ∗), nΣ ) is bounded in f -probability. Then, we can apply Theorem 2. The n,α − n,α 0,n proof presented in Section A.2 shows that, in f0,n-probability,

n ∗ 1 (πn,α( X ) φ( θML- n , diag(Vθ )− /(αn))) 0 . K ·| || ·| F →

This means that the Bernstein-vone Misesb Theorem holds for the variational approximation to the α-posterior in KL divergence.

B.6 About the optimal α for robustness

Let us recall the definition of rn(α) presented in Section 4 in equation (21):

r (α)= ǫ (φ( ν∗, Ω∗ ) φ( µ , Σ ))+(1 ǫ ) (φ( µ , Σ ) φ( µ , Σ )) , n nK ·| n n || ·| n,α n,α − n K ·| n,1 n,1 || ·| n,α n,α

30 Robustness of α-posteriors and their variational approximations where νn∗ is the true posterior mean and Ωn∗ is the true posterior covariance matrix. The expression above is equal to

1 1 1 r (α)= ǫ tr(Σ− Ω∗ )+ α(1 ǫ )tr((αnΣ )− nΣ ) n 2 n n,α n − n n,α n,1 n 1 + αnǫ (µ ν∗)⊤(αnΣ )− (µ ν∗) n n,α − n n,α n,α − n 1 + αn(1 ǫ )(µ µ )⊤(αnΣ )− (µ µ ) − n n,1 − n,α n,α n,1 − n,α Σn,α Σn,α p log(α) p + ǫn log | | + (1 ǫn) log | | . − − Ωn∗ − Σn,1  | |   | |  o

1 Notice that Σ− Ω∗ αV ∗ Ω in f -probability, since it can be proved that nΩ∗ Ω n,α n → θ 0,n n → in the well-specified model, for some definite positive matrix Ω. Lemma 7 implies that 1 (αnΣ )− nΣ I in f -probability. Moreover, we have that n(µ µ ) is bounded n,α n,1 → p 0,n n,1 − n,α in f0,n-probability, which implies that (µn,1 µn,α)⊤Σn,α(µn,1 µn,α) 0 in f0,n-probability. − − → 1 Since nǫn ε, we conclude that, in f0,n-probability, rn(α) r (α) where r (α) (αp + → → ∞ ∞ ≡ 2 αε(θ∗ θ )⊤V ∗ (θ∗ θ ) p log(α) p). − 0 θ − 0 − −

Using the notation introduced in the proof of Theorem 3 in Section A.3, we have

1 r∗ (α)= α A (V ∗ ) p log(α)+ B (V ∗) , n 2 n θ − n θ   where

1 An(Σ) ǫntr(ΣΩ) + (1 ǫn)tr(ΣVθ−∗ )+ nǫn(θML- n θML)⊤Σ(θML- n θML) , ≡ − F − F − 1 1 1 B (Σ) p + ǫ log( Ω− Σ − )+(1 ǫ ) log( V ∗ Σ − ) . n ≡ − n | || | − n b | θ || b| b b

Notice that A (V ∗ ) p + ε(θ∗ θ )⊤V ∗ (θ∗ θ ) and B (V ∗ ) p in f -probability. n θ → − 0 θ − 0 n θ → − 0,n This implies that

1 ∗ rn∗ (α) r (α)= αp + αε(θ∗ θ0)⊤Vθ (θ∗ θ0) p log(α) p . → ∞ 2 − − − −  Then, we can conclude that r (α) r∗(α) 0 in f -probability. In particular, for α = α∗ n − n → 0,n defined in Theorem 3 and any α′ = α∗, we have that rn(α∗) is close to r (α∗) and rn(α′) 6 ∞ is close to r (α′). Since r (α∗) < r (α′) by definition of α∗, it follows that for large n, ∞ ∞ ∞ rn(α∗)

31 Avella Medina and Montiel Olea and Rush and Velez

Appendix C. Technical Lemmas

Lemma 4 Consider sequences of densities φ and ψ . For a given compact set K Rp, n n ⊂ suppose that the densities ψn and φn are positive on K. Then,

dTV (ψn, φn) sup fn(g, h)+ ψn(h) dh + φn(h) dh , ≤ g,h K Rp K Rp K ∈ Z \ Z \ where we have defined the function

+ φn(h) ψn(g) fn(g, h)= 1 . (58) − ψn(h) φn(g) n o

1 1 − − Proof. First, denote an = K ψn(g) dg and bn = K φn(g) dg . Notice that both are well defined since φ and ψRare assumed positive on KR. We will assume throughout that a b , without loss of generality. n ≥ n

First, by definition, d (ψ ,φ )= 1 ψ (h) φ (h) dh. Then since x =2 x + x, it TV n n 2 | n − n | | | { } − follows that the total variation is equal toR 1 d (ψ , φ )= ψ (h) φ (h) dh TV n n 2 | n − n | Z + 1 = ψn(h) φn(h) dh + (ψn(h) φn(h)) dh Rp − 2 Rp − Z n o+ Z (59) = ψn(h) φn(h) dh Rp − Z n o+ + = ψn(h) φn(h) dh + ψn(h) φn(h) dh . K − Rp K − Z n o Z \ n o Now, since ψ ,φ are non-negative on all of Rp, it follows that ψ (h) φ (h) + ψ (h)+ n n { n − n } ≤ n φn(h). This provides a bound for the second term on the right side of (59):

+ ψn(h) φn(h) dh ψn(h) dh + φn(h) dh . (60) Rp K − ≤ Rp K Rp K Z \ n o Z \ Z \

To complete the proof we show that the first term on the right side of (59) is upper bounded by supg,h K fn(g, h). For all h K, we have the following identity, ∈ ∈ φ (h) a φ (h) ψ (g) n = n n n b φ (g) dg . ψ (h) b ψ (h) φ (g) n n n n ZK n n

32 Robustness of α-posteriors and their variational approximations

Thus, we can rewrite the first term on the right side of (59) as follows,

+ φ (h) + ψ (h) φ (h) dh = 1 n ψ (h) dh n − n − ψ (h) n ZK ZK  n  (61) n o + an φn(h) ψn(g) = 1 bnφn(g) dg ψn(h) dh . K − bn K ψn(h) φn(g) Z n Z o Now, by applying Jensen’s inequality on the convex function f(x)= 1 x +, we have that { − } 1 E[X] + E 1 X + . Applying this to the final expression above, and recalling the { − } ≤ { − } definition of fn(g, h) in (58), we find

a φ (h) ψ (g) + 1 n n n b φ (g) dg ψ (h) dh − b ψ (h) φ (g) n n n ZK  n ZK n n  n a φ (h) ψ (g) + o 1 n n n b φ (g) dg ψ (h) dh f (g, h) b φ (g) ψ (h) dg dh , ≤ − b ψ (h) φ (g) n n n ≤ n n n n ZK ZK n n n  ZK ZK n o (62) where the final step uses that when a /b 1, we have 1 (a /b )x + 1 x + for n n ≥ { − n n } ≤ { − } x 0. We finally note that, ≥

fn(g, h) bnφn(g) ψn(h) dg dh sup fn(g, h) bnφn(g) ψn(h) dg dh K K ≤ g,h K K K Z Z  ∈  Z Z 

= ψn(h) dh sup fn(g, h) sup fn(g, h) , K g,h K ≤ g,h K Z   ∈  ∈

1 The final equality uses that a− = ψ (g) dg 1. n K n ≤ R Lemma 5 Assume there exists a δ > 0 such that the prior density π is continuous and positive on Bθ∗ (δ), the closed ball of radius δ around θ∗ and that Assumption 1 holds. For any η,ǫ > 0, there exists a sequence r + and an integer N(η, ǫ) > 0, such that for all n → ∞ n > N(η, ǫ), with fn(g, h) defined in (34),

P f0,n sup fn(g, h) > η ǫ , g,h B0(rn) ≤  ∈  where B0(rn) denotes a closed ball of radius rn around 0.

Proof: The proof has two steps. In Step 1, we prove the claim for any fixed r > 0, instead of a sequence rn. In Step 2, we construct a sequence of rn using equation (68). Step 1: First notice that for any r > 0, there exists an integer N (r) := 4r2/δ2 > 0 such 0 ⌈ ⌉ that θ∗ + h/√n B ∗ (δ) whenever h B0(r) and n N (r). To see that this is true, notice ∈ θ ∈ ≥ 0

33 Avella Medina and Montiel Olea and Rush and Velez that if h r and n 4r2/δ2, then h /√n δ/2 < δ. This will ensure that the function || || ≤ ≥ || || ≤ f (g, h) is well-defined whenever g, h B0(r). n ∈

Recall the definition of the α-posterior πn,α in (7), as well as that of the scaled densities LAN n 1/2 n 1/2 π (h X ) = n− π (θ∗ + h/√n X ) and π (h) = n− π(θ∗ + h/√n) introduced in n,α | n,α | n the proof of Theorem 1. Then we see that for any two sequences h , g in B0(r) and { n} { n} n > N0(r),

α g g g LAN n n n n n n π (g X ) πn,α θ∗ + √n X fn X θ∗ + √n π θ∗ + √n n,α n | LAN n = = α π (hn X )  hn n h  n hn i  hn  n,α πn,α θ + X f X θ + π θ + | ∗ √n n ∗ √n ∗ √n α   h  n gn i   fn X θ∗ + √n πn (gn) = . (63) h  iα f Xn θ + hn π (h ) n ∗ √n n n h  i n n α Thus by the definition in (34), with the notation s (h ) = [f (X θ∗ +h /√n)/f (X θ∗)] , n n n | n n | LAN n φ (h ) π (gn X ) + φ (h )s (g )π (g ) + n n n,α | n n n n n n fn(gn, hn)= 1 LAN n = 1 . − πn,α (hn X ) φn(gn) − φn(gn)sn(hn)πn(hn) n | o n o 1 Recall that φ (h ) = φ(h ∆ ∗ ,V −∗ /α) and notice that since h /√n < δ as discussed n n n | n,θ θ || n|| at the beginning of the step 1 proof, the above is well-defined (i.e. πLAN (h Xn) is positive). n,α n |

Next, for any sequence h B0(r), Assumption 1 implies n ∈

n hn fn(X θ∗ + √n ) 1 log(s (h )) = α log | = h⊤αV ∗ ∆ ∗ h⊤αV ∗ h + o (1) , (64) n n f (Xn θ ) n θ n,θ − 2 n θ n f0,n n | ∗ ! and algebra shows that the log-likelihood of the normal density φn can be written as

p 1 1 log φ (h )= log(2π)+ log(det(αV ∗ )) (h ∆ ∗ )⊤αV ∗ (h ∆ ∗ ) , n n −2 2 θ − 2 n − n,θ θ n − n,θ hence

sn(hn) p 1 1 ∗ ∗ ∗ ∗ log = of0,n (1) + log(2π) log(det(αVθ )) + ∆n,θ⊤ αVθ ∆n,θ φn(hn) 2 − 2 2  

Now, for any sequence g B0(r), define n ∈

φn(hn)sn(gn)πn(gn) bn(gn, hn) . (65) ≡ φn(gn)sn(hn)πn(hn)

34 Robustness of α-posteriors and their variational approximations

Then we conclude

φn(hn)sn(gn)πn(gn) log(bn(gn, hn)) = log = of0,n (1) , (66) φn(gn)sn(hn)πn(hn)   where we have used that π (g ), π (h ) π(θ∗) as n . n n n n → →∞

Since hn,gn are arbitrary sequences in B0(r), the result in (66) is equivalent to saying that for any fixed r, there exists an integer N (r, ǫ, η), such that for n> max N (r, ǫ, η), N (r) : 0 { 0 0 } e e Pf0,n sup log (bn (gn, hn)) > η ǫ . (67) gn,hn B0(r) | | ! ≤ ∈

Next, notice that

log(b (g , h )) log(min 1, b (g , h ) ) = log(1 f (g , h )) f (g , h ) , | n n n |≥| { n n n } | | − n n n | ≥ n n n where the equality follows since f (g, h)=1 min b (g, h), 1 by definition, and the final n − { n } inequality follows by noting that the function f(x)= log(1 x) x is increasing for x [0, 1] | − |− ∈ and f(0) = 0. Thus, by (67), for all n> max N (r, η, ǫ), N (r) , { 0 0 } e Pf0,n sup fn(g, h) > η Pf0,n sup fn(g, h) > η g,h B0(r) ! ≤ gn,hn B0(r) ! ∈ ∈ (68)

Pf0,n sup log(bn(gn, hn)) > η ǫ . ≤ gn,hn B0(r) | | ! ≤ ∈

2 Step 2: Let N ∗(ǫ, η) = max N (1, ǫ, η), 4/δ for N (r, ǫ, η) defined just above (67) and for { 0 } 0 all n > N ∗(ǫ, η) let e e r = max r R r δ√n/2 and n> N (r, ǫ, η) . n { ∈ | ≤ 0 }

e δ Notice this is well defined since for any n > N ∗(ǫ, η) the choice r = 1 is valid as 2 (√n) > 1 and n> N (1, ǫ, η) by the definition of N ∗(ǫ, η). For n N ∗(ǫ, η) we can define r arbitrarily, 0 ≤ n e.g. rn = 1. First notice rn . Indeed, e →∞ r = max r R r δ√n/2 and n> N (r, ǫ, η) min δ√n/2, max r R n> N (r, ǫ, η) . n { ∈ | ≤ 0 } ≤ { { ∈ | 0 }}

Clearly δ√n/2 , so r if maxe r R n > N (r, ǫ, η) . This is true,e since → ∞ n → ∞ { ∈ | 0 }→∞ if it were not, there must be some rmax such that N0(r, ǫ, η) = for r>rmax. However e ∞ e

35 Avella Medina and Montiel Olea and Rush and Velez

p Assumption 1 holds for any compact set K R , meaning for any ball B0(r) with r < , ∈ ∞ hence N (r, ǫ, η) < for any r < . 0 ∞ ∞ Finally,e following the proof in step 1, we show that for all n > N ∗(ǫ, η),

P f0,n sup fn(g, h) > η ǫ . (69) g,h B0(rn) ≤  ∈ 

First, θ∗ + h /√n B ∗ (δ) whenever h B0(r ) since h /√n r /√n δ/2. This n ∈ θ n ∈ n || n|| ≤ n ≤ guarantees fn(gn, hn) is well-defined, as discussed in step 1. Then for all n > N ∗(ǫ, η), by the rn definition, n> N0(rn, ǫ, η), hence by the work in (67)-(68), the bound in (69) holds.

e Lemma 6 Suppose Assumptions 1 and 2 hold. Let (µ , Σ ) be a sequence such that (√n(µ n n n− θ∗), nΣn) is bounded in f0,n-probability. Then,

n 1 (φ( µn, Σn) πn,α( X )) = φ( µn, Σn) φ( θML- n , Vθ−∗ /(αn)) + of0,n (1) . K ·| || ·| K ·| || ·| F   b Proof: Using the change of variables h √n(θ θ∗) and reparametrizing µ √n(µ θ∗), ≡ − n ≡ n − we have that

φ(θ µ , Σ ) (φ( µ , Σ ) π ( Xn)) = φ(θ µ , Σ ) log | n n dθ K ·| n n || n,α ·| | n n π (θ Xn) Z  n,α |  np/2φ(h µ , nΣ ) = φ(h µ , nΣ ) log | n n dh | n n π (θ + h/√n Xn) Z  n,α ∗ |  = I1 + I2 + I3 , (70) where

I φ(h µ , nΣ ) log φ(h µ , nΣ )dh , 1 ≡ | n n | n n Z n π (θ∗ + h/√n X ) I φ(h µ , nΣ ) log n,α | dh , 2 ≡ − | n n π (θ Xn) Z  n,α ∗ |  n 1 πn,α(θ∗ X ) φ(θ∗ θML- n , Vθ−∗ /(αn)) I log | + log | F . 3 ≡ − 1 np/2 " φ(θ∗ θML- n , Vθ−∗ /(αn))! !# | F b We will compute the threeb terms separately and show that their sum gives the desired result.

The first term I1 is the negative of the entropy of a Gaussian distribution with mean µn and covariance-matrix nΣn. A direct computation of this quantity gives

p p 1 I = log(2π) log nΣ . (71) 1 −2 − 2 − 2 | n|

36 Robustness of α-posteriors and their variational approximations

To compute the second term I2, notice that

n n α π (θ∗ + h/√n X ) f (X θ∗ + h/√n) π(θ∗ + h/√n) n,α | = n | , π (θ Xn) f (Xn θ )α π(θ ) n,α ∗ | n | ∗ ∗ hence

n f (X θ∗ + h/√n) I = α φ(h µ , nΣ ) log n | dh 2 − | n n f (Xn θ ) Z  n | ∗  (72) π(θ + h/√n) φ(h µ , nΣ ) log ∗ dh , − | n n π(θ ) Z  ∗  and we consider the two terms on the right side of (72) separately. First, notice that by

Assumption 2 the second term is of0,n (1). For the first term, consider Assumption 1, and we find

n f (X θ∗ + h/√n) α φ(h µ , nΣ ) log n | dh − | n n f (Xn θ ) Z  n | ∗  1 = α φ(h µ , nΣ ) h⊤ V ∗ ∆ ∗ h⊤ V ∗ h + R (h) dh (73) − | n n θ n,θ − 2 θ n Z   1 = α φ(h µ , nΣ ) h⊤ V ∗ ∆ ∗ h⊤ V ∗ h dh + o (1) . − | n n θ n,θ − 2 θ f0,n Z   Plugging this into (72) and solving the integral, we find

1 I = α φ(h µ , nΣ ) h⊤ V ∗ ∆ ∗ h⊤ V ∗ h dh + o (1) 2 − | n n θ n,θ − 2 θ f0,n Z α   = α µ⊤ V ∗ ∆ ∗ + µ⊤ V ∗ µ + tr(nΣ V ∗ ) + o (1) (74) − n θ n,θ 2 n θ n n θ f0,n α = (∆ ∗ µ )⊤ V ∗ (∆ ∗ µ ) ∆⊤ ∗ V ∗∆ ∗ + tr(nΣ V ∗ ) + o (1) . 2 n,θ − n θ n,θ − n − n,θ θ n,θ n θ f0,n  ∗ Next, replacing ∆n,θ = √n(θML- n θ∗) and µn = √n(µn θ∗) in (74) yields F − − 1 1 b ∗ ∗ I2 = (θML- n µn)⊤(αnVθ )(θML- n µn)+ tr(ΣnαnVθ ) 2 F − F − 2 1 ∗ b (θML- n θ∗)⊤(αnVbθ )(θML- n θ∗)+ of0,n (1) . (75) − 2 F − F − b b Let’s now turn to the term I3. We first claim that

n π (θ∗ X ) log n,α | = o (1) , (76) 1 f0,n φ(θ∗ θML- n , Vθ−∗ /(αn))! | F b

37 Avella Medina and Montiel Olea and Rush and Velez and we will prove this in what follows. Result (76) implies that

p/2 1 I3 = log n− φ(θ∗ θML- n , Vθ−∗ /(αn)) + of0,n (1) − | F p  1 1  ∗ = log(2π) log bαV + (θML- n θ∗)⊤(αnVθ )(θML- n θ∗)+ of0,n (1) . (77) 2 − 2 | | 2 F − F − b b

Finally, combining (70), (75) and (77) we conclude that

(φ( µ , Σ ) π ( Xn)) K ·| n n || n,α ·| 1 ∗ ∗ ∗ = log αnVθ + log Σn tr(ΣnαnVθ ) (θML- n µn)⊤(αnVθ )(θML- n µn)+ p + of0,n (1) −2 | | | | − − F − F −   1 = φ( µn, Σn) φ( θML- n , Vθ−∗ /(αn)) +bof0,n (1) , b K ·| || ·| F   where the last equality followsb from (43).

It therefore remains to verify (76) in order to complete the proof. We first notice that the change of variables h √n(θ θ∗) leads to ≡ − n LAN n π (θ∗ X ) π (0 X ) n,α | = n,α | , (78) 1 ∗ φn(0) φ(θ∗ θML- n , Vθ− /(αn)) | F p/2 1 LAN n p/2 n where φ (h) n− φ(h ∆ ∗ ,b V −∗ /α) and π (h X ) n− π (θ∗ + h/√n X ), are n ≡ | n,θ θ n,α | ≡ n,α | scaled versions of the densities πn,α and φ. Consider the function bn(g, h) defined as

πLAN (h Xn) φ (g) b (g, h) n,α | n , n ≡ φ (h) πLAN (g Xn) n !  n,α |  and note that it is the same as (65). Therefore, from using (67), we can guarantee the existence of rn such that for any η > 0,

lim Pf0,n sup log(bn(g, h)) > η =0 . g,h B0(rn) | |  ∈  LAN n 1 This implies that supg,h B0(rn) bn(g, h)=1+ of0,n (1). Define cn =( B (r ) πn,α (h X )dh)− ∈ 0 n | 1 and dn =( φn(h)dh)− . Using this notation, we have that (78) is equal to B0(rn) R R d πLAN (0 Xn) φ (g) n n,α | n c πLAN (g Xn) dg , c φ (0) πLAN (g Xn) n n,α | n ZB0(rn) n !  n,α | 

38 Robustness of α-posteriors and their variational approximations which is lower—by definition of bn—than

dn dn sup bn(g, h)= 1+ of0,n (1) . cn g,h B0(rn) cn ∈  By the concentration assumption of the α-posterior (10) we have that c 1 in f - n → 0,n probability. Furthermore, from Lemma 5.2 in Kleijn and Van der Vaart (2012), it follows that d 1 in f -probability. This implies that n → 0,n LAN n πn,α (0 X ) | 1 1+ of0,n (1) . φ (0 ∆ ∗ , V −∗ /α) ≤ n | n,θ θ In a similar way, we can conclude

1 φ (0 ∆ ∗ , V −∗ /α) n | n,θ θ 1+ o (1) . πLAN (0 Xn) ≤ f0,n n,α | Using (78), and the two inequalities above, we can conclude (76).

Lemma 7 Let (µn,α, Σn,α) be the sequence defined in (30) and (31). Denote by θ∗ the (pseudo-) true parameter of the illustrative example in Section 5. Then, the sequence (√n(µ n,α− θ∗), nΣn,α) is bounded in f0,n-probability. Moreover, if θML is the maximum likelihood esti- mator of θ∗, we have that n(µn,α θML) is bounded in f0,n-probability. − b b

Proof: Using (30), we have that √n(µ θ∗) is equal to n,α − 1 1 n 1 − 1 1 n W W ⊤ + Σ Σ (µ θ∗)+ W W ⊤ √n(θ θ∗) , n i i αn π α√n π π − n i i ML − i=1 ! i=1 ! X X b where 1 1 n − 1 n θ = W W ⊤ W Y . ML n i i n i i i=1 ! i=1 X X b

Since θ∗ is the (pseudo-) true parameter and θML is the maximum likelihood estimator, we have that √n(θ θ∗) converges in distribution to a multivariate normal distribution, ML − 1 n E b and n− i=1 WiWi⊤ converges to [WiWi⊤] in f0,n-probability. By Slutsky’s theorem, we b conclude √n(µ θ∗) converges in distribution, which implies that √n(µ θ∗) is bounded P n,α − n,α − in f0,n-probability.

39 Avella Medina and Montiel Olea and Rush and Velez

Using (31), we have that nΣn,α is equal to

1 2 n − σu 1 1 nΣ W W ⊤ + Σ , n,α ≡ α n i i αn π i=1 ! X 2 σu 1 1 1 E ∗ and this converges in f0,n-probability to α ( [WiWi⊤])− = α Vθ− . Then, it follows that nΣn,α is bounded in f0,n-probability.

Finally, algebra shows that n(µ θ ) is equal to n,α − ML 1 1 n 1b − 1 W W ⊤ + Σ Σ (µ θ ) , n i i αn π α π π − ML i=1 ! X   b 1 1 which converge in f -probability to (E[W W ⊤])− ( Σ (µ θ )). This implies that n(µ 0,n i i α π π− ML n,α− θML) is bounded in f0,n-probability. b b References

Pierre Alquier and James Ridgway. Concentration of tempered posteriors and of their vari- ational approximations. Annals of Statistics, 48(3):1475–1497, 2020.

Andrew R Barron and Thomas M Cover. Minimum complexity density estimation. IEEE transactions on information theory, 37(4):1034–1054, 1991.

Claude Berge. Topological Spaces: including a treatment of multi-valued functions, vector spaces, and convexity. Dover publications, 1st english edition, 1963.

Anirban Bhattacharya, Debdeep Pati, and Yun Yang. Bayesian fractional posteriors. Annals of Statistics, 47(1):39–66, 2019.

David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review for statisticians. Journal of the American statistical Association, 112(518):859–877, 2017.

St´ephane Bonhomme. Teams: Heterogeneity, sorting, and complementarity. University of Chicago, Becker Friedman Institute for Economics Working Paper, (2021-15), 2021.

Christopher P Burgess, Irina Higgins, Arka Pal, Loic Matthey, Nick Watters, Guillaume Des- jardins, and Alexander Lerchner. Understanding disentangling in β-VAE. arXiv preprint arXiv:1804.03599, 2018.

40 Robustness of α-posteriors and their variational approximations

Isma¨el Castillo and Judith Rousseau. A Bernstein–von Mises theorem for smooth functionals in semiparametric models. Annals of Statistics, 43(6):2353–2383, 2015.

Bertrand S. Clarke. Asymptotic normality of the posterior in relative entropy. IEEE Trans- actions on Information Theory, 45(1):165–176, 1999.

Anirban DasGupta. Asymptotic theory of statistics and probability. Springer Science & Business Media, 2008.

Subhashis Ghosal and Aad Van der Vaart. Fundamentals of nonparametric Bayesian infer- ence. Cambridge University Press, 2017.

Peter Gr¨unwald. Safe learning: bridging the gap between bayes, mdl and statistical learning theory via empirical convexity. In Proceedings of the 24th Annual Conference on Learning Theory, pages 397–420. JMLR Workshop and Conference Proceedings, 2011.

Peter Gr¨unwald. The safe Bayesian. In International Conference on Algorithmic Learning Theory, pages 169–183. Springer, 2012.

Peter Gr¨unwald and Thijs Van Ommen. Inconsistency of bayesian inference for misspecified linear models, and a proposal for repairing it. Bayesian Analysis, 12(4):1069–1103, 2017.

Paul Gustafson. On measuring sensitivity to parametric model misspecification. Journal of the Royal Statistical Society: Series B, 63(1):81–94, 2001.

Lars Peter Hansen and Thomas J Sargent. Robust control and model uncertainty. American Economic Review, 91(2):60–66, 2001.

Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. β-VAE: Learning basic visual con- cepts with a constrained variational framework. In Proceedings of the 5th International Conference on Learning Representations, volume 2, page 6, 2017.

CC Holmes and SG Walker. Assigning a value to a power likelihood in a general bayesian model. Biometrika, 104(2):497–503, 2017.

Roger A Horn and Charles R Johnson. Matrix Analysis. Cambridge University Press, 2012.

Chin-Wei Huang, Shawn Tan, Alexandre Lacoste, and Aaron Courville. Improving explorabil- ity in variational inference with annealed variational objectives. pages 9724—-9734, 2018.

41 Avella Medina and Montiel Olea and Rush and Velez

Peter J Huber. The behavior of maximum likelihood estimates under nonstandard conditions. In Proc. Fifth Berkeley Sympos. Math. Statist. and Probability, volume 1, pages 221–233. Berkeley, CA: Univ. California Press, 1967.

Prateek Jaiswal, Vinayak Rao, and Harsha Honnappa. Asymptotic consistency of α-r´enyi- approximate posteriors. Journal of Machine Learning Research, 21(156):1–42, 2020.

Bas J.K Kleijn and Aad .W. Van der Vaart. The Bernstein-von-Mises theorem under mis- specification. Electronic Journal of Statistics, 6:354–381, 2012.

J. Knoblauch, J. Jewson, and T. Damoulas. Generalized variational inference: Three argu- ments for deriving new posteriors. arXiv:1904.02063, 2019.

Gary Koop and Dimitris Korobilis. Variational bayes inference in high-dimensional time- varying parameter models. 2018.

Yong Li, Nianling Wang, and Jun Yu. Improved marginal likelihood estimation via power posteriors and importance sampling. SMU Economics and Statistics Working Paper Series, Paper No. 16-2019, 2019.

Fabio Maccheroni, Massimo Marinacci, and Aldo Rustichini. Ambiguity aversion, robustness, and the variational representation of preferences. Econometrica, 74(6):1447–1498, 2006.

David A McAllester. Pac-bayesian stochastic model selection. Machine Learning, 51(1):5–21, 2003.

Angelo Mele and Lingjiong Zhu. Approximate variational estimation for a model of network formation. Available at SSRN 2909829, 2019.

Jeffrey W Miller and David B Dunson. Robust Bayesian inference via coarsening. Journal of the American Statistical Association, 114(527):1113–1125, 2019.

Ulrich K M¨uller. Risk of Bayesian inference in misspecified models, and the sandwich covari- ance matrix. Econometrica, 81(5):1805–1849, 2013.

Demian Pouzo, Zacharias Psaradakis, and Martin Sola. Maximum likelihood estimation in possibly misspecified dynamic models with time inhomogeneous markov regimes. Available at SSRN 2887771, 2016.

Tomasz Strzalecki. Axiomatic foundations of multiplier preferences. Econometrica, 79(1): 47–73, 2011.

42 Robustness of α-posteriors and their variational approximations

Aad W Van der Vaart. Asymptotic statistics, volume 3. Cambridge university press, 2000.

Volodimir G Vovk. Aggregating strategies. Proc. of Computational Learning Theory, 1990, 1990.

Stephen Walker and Nils Lid Hjort. On bayesian consistency. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(4):811–821, 2001.

Yixin Wang and David Blei. Variational bayes under model misspecification. In Advances in Neural Information Processing Systems, pages 13357–13367, 2019a.

Yixin Wang and David M. Blei. Frequentist consistency of variational Bayes. Journal of the American Statistical Association, 114(527):1147–1161, 2019b. doi: 10.1080/01621459. 2018.1473776. URL https://doi.org/10.1080/01621459.2018.1473776.

Yixin Wang, Alp Kucukelbir, and David M. Blei. Robust probabilistic modeling with Bayesian data reweighting. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3646–3655. PMLR, 06–11 Aug 2017.

Halbert White. Maximum likelihood estimation of misspecified models. Econometrica, 50 (1):1–25, 1982.

Yun Yang, Debdeep Pati, and Anirban Bhattacharya. α-variational inference with statistical guarantees. Annals of Statistics, 48(2):886–905, 2020.

Tong Zhang et al. From ε-entropy to kl-entropy: Analysis of minimum information complexity density estimation. The Annals of Statistics, 34(5):2180–2210, 2006.

43