entropy
Article Geometric Variational Inference
Philipp Frank 1,2,* , Reimar Leike 1 and Torsten A. Enßlin 1,2
1 Max–Planck Institut für Astrophysik, Karl-Schwarzschild-Straße 1, 85748 Garching, Germany; [email protected] (R.L.); [email protected] (T.A.E.) 2 Faculty of Physics, Ludwig-Maximilians-Universität München, Geschwister-Scholl-Platz 1, 80539 München, Germany * Correspondence: [email protected]
Abstract: Efficiently accessing the information contained in non-linear and high dimensional prob- ability distributions remains a core challenge in modern statistics. Traditionally, estimators that go beyond point estimates are either categorized as Variational Inference (VI) or Markov-Chain Monte-Carlo (MCMC) techniques. While MCMC methods that utilize the geometric properties of continuous probability distributions to increase their efficiency have been proposed, VI methods rarely use the geometry. This work aims to fill this gap and proposes geometric Variational Inference (geoVI), a method based on Riemannian geometry and the Fisher information metric. It is used to construct a coordinate transformation that relates the Riemannian manifold associated with the metric to Euclidean space. The distribution, expressed in the coordinate system induced by the trans- formation, takes a particularly simple form that allows for an accurate variational approximation by a normal distribution. Furthermore, the algorithmic structure allows for an efficient implementation of geoVI which is demonstrated on multiple examples, ranging from low-dimensional illustrative ones to non-linear, hierarchical Bayesian inverse problems in thousands of dimensions.
Keywords: variational methods; Bayesian inference; Fisher information metric; Riemann manifolds
Citation: Frank, P.; Leike, R.; Enßlin, T.A. Geometric Variational Inference. Entropy 2021, 23, 853. https:// 1. Introduction doi.org/10.3390/e23070853 In modern statistical inference and machine learning it is of utmost importance to access the information contained in complex and high dimensional probability distributions. Academic Editor: In particular in Bayesian inference, it remains one of the key challenges to approximate Carlos Alberto De Bragança Pereira samples from the posterior distribution, or the distribution itself, in a computationally fast and accurate way. Traditionally, there have been two distinct approaches towards this Received: 21 May 2021 problem: the direct construction of posterior samples based on Markov Chain Monte-Carlo Accepted: 30 June 2021 Published: 2 July 2021 (MCMC) methods [1–3], and the attempt to approximate the probability distribution with a different one, chosen from a family of simpler distributions, known as variational inference
Publisher’s Note: MDPI stays neutral (VI) [4–7] or variational Bayes’ (VB) methods [8–10]. While MCMC methods are attractive with regard to jurisdictional claims in due to their theoretical guarantees to reproduce the true distribution in the limit, they published maps and institutional affil- tend to be more expensive compared to variational alternatives. On the other hand, the iations. family of distributions used in VI is typically chosen ad hoc. While VI aims to provide an appropriate approximation within the chosen family, the entire family may be a poor approximation to the true distribution. In recent years, MCMC methods have been improved by incorporating geometric information of the posterior, especially by means of Riemannian manifold Hamilton Monte- Copyright: © 2021 by the authors. Licensee MDPI, Basel, Switzerland. Carlo (RMHMC) [11], a particular hybrid Monte-Carlo (HMC) [12,13] technique that This article is an open access article constructs a Hamiltonian system on a Riemannian manifold with a metric tensor related to distributed under the terms and the Fisher information metric of the likelihood distribution and the curvature of the prior. conditions of the Creative Commons For VI methods, however, the geometric structure of the true distribution has rarely been Attribution (CC BY) license (https:// utilized to motivate and enhance the family of distributions used during optimization. One creativecommons.org/licenses/by/ of the few examples being [14] where the Fisher metric has been used to reformulate the 4.0/). task of VI by means of α-divergencies in the mean-field setting.
Entropy 2021, 23, 853. https://doi.org/10.3390/e23070853 https://www.mdpi.com/journal/entropy Entropy 2021, 23, 853 2 of 42
In addition, a powerful variational approximation technique for the family of normal distributions utilizing infinitesimal geometric properties of the posterior is Metric Gaussian Variational Inference (MGVI) [15]. In MGVI the family is parameterized in terms of the mean m, and the covariance matrix is set to the inverse of the metric tensor evaluated at m. This choice ensures that the true distribution and the approximation obtain the same geometric properties infinitesimally, i.e., at the location of the mean m. In this work we extend the geometric correspondence used by MGVI to be valid not only at m, but also in a local neighborhood of m. We achieve this extension by means of an invertible coordinate transformation from the coordinate system used within MGVI, in which the curvature of the prior is the identity, to a new coordinate system in which the metric of the posterior becomes (approximately) the Euclidean metric. We use a normal distribution in these coordinates as the approximation to the true distribution and thereby establish a non-Gaussian posterior in the MGVI coordinate system. The resulting algorithm, called geometric Variational Inference (geoVI) can be computed efficiently and is inherently similar to the implementation of MGVI. This is not by mere coincidence: To linear order, geoVI reproduces MGVI. In this sense, the geoVI algorithm is a non-linear generalization of MGVI that captures the geometric properties encoded in the posterior metric not only infinitesimally, but also in a local neighborhood of this point. We include an implementation of the proposed geoVI algorithm into the software package Numerical Information Field Theory (NIFTy [16]), a versatile library for signal inference algorithms.
Mathematical Setup Throughout this work, we consider the joint distribution P(d, s) of observational data d ∈ Ω and the unknown, to be inferred signal s. This distribution is factorized into the likelihood of observing the data, given the signal P(d|s), and the prior distribution P(s). In general, only a subset of the signal, denoted as s0, may be directly constrained by the likeli- hood, such that P(d|s) = P(d|s0), and therefore there may be additional hidden variables in s, that are unobserved by the data, but part of the prior model. Thus, the prior distribution P(s) may posses a hierarchical structure that summarizes our knowledge about the system prior to the measurement, and s represents everything in the system that is of interest to us, but about which our knowledge is uncertain a priori. We do not put any constraints on the functional form of P(s), and assume that the signal s solely consists of continuous real valued variables, i.e., s ∈ X ⊂ RM. This enables us to regard s as coordinates of the space on which P(s) is defined and to use geometric concepts such as coordinate transformations to represent probability distributions in different coordinate systems. Probability densities transform in a probability mass preserving fashion. Specifically let f : RM → X be an invertible function, and let s = f (ξ). Then the distributions P(s) and P(ξ) relate via
P(s) ds = P(ξ) dξ . (1) Z Z This allows us to express P(s) by means of the pushforward of P(ξ) by f . We denote the pushforward as
− d f 1 P(s) = ( f ? P(ξ))(s) = δ(s − f (ξ)) P(ξ) dξ = P(ξ) . (2) dξ ! −1 Z ξ= f (s)
Under mild regularity conditions on the prior distribution, there always exists an f that relates the complex hierarchical form of P(s) to a simple distribution P(ξ) [17]. We choose f such that P(ξ) takes the form of a normal distribution with zero mean and unit covariance and call such a distribution a standard distribution:
P(ξ) = N (ξ; 0, 1) , (3) Entropy 2021, 23, 853 3 of 42
where N (ξ; m, D) denotes a multivariate normal distribution in the random variables ξ with mean m and covariance D. We may express the likelihood in terms of ξ as
P(d|ξ) ≡ P(d|s0 = f 0(ξ)) , (4)
where f 0 is the part of f that maps onto the observed quantities s0. In general, f 0 is a non-invertible function and is commonly referred to as generative model or generative process, as it encodes all the information necessary to transform a standard distribution into the observed quantities, subject to our prior beliefs. Using Equation (4) we get by means of Bayes’ theorem, that the posterior takes the form
P(ξ, d) P(d|ξ) N (ξ; 0, 1) P(ξ|d) = = . (5) P(d) P(d)
Using the push-forward of the posterior, we can recover the posterior statistics of s via
P(s|d) = ( f ? P(ξ|d))(s) , (6)
which means that we can fully recover the posterior properties of s, which typically has a physical interpretation as opposed to ξ. In particular Equation (6) implies that if we are able to draw samples from P(ξ|d) we can simply generate posterior samples for s since s = f (ξ).
2. Geometric Properties of Posterior Distributions In order to access the information contained in the posterior distribution P(ξ|d), in this work, we wish to exploit the geometric properties of the posterior, in particular with the help of Riemannian geometry. Specifically, we define a Riemannian manifold using a metric tensor, related to the Fisher Information metric of the likelihood and a metric for the prior, and establish a (local) isometry of this manifold to Euclidean space. The associated coordinate transformation gives rise to a coordinate system in which, hopefully, the posterior takes a simplified form despite the fact that probabilities do not transform in the same way as metric spaces do. As we will see, in cases where the isometry is global, and in addition the transformation is (almost) volume-preserving, the complexity of the posterior distribution can be absorbed (almost) entirely into this transformation. To begin our discussion, we need to define an appropriate metric for posterior dis- tributions. To this end, consider the negative logarithm of the posterior, sometimes also referred to as information Hamiltonian, which takes the form
H(ξ|d) ≡ − log(P(ξ|d)) = H(d|ξ) + H(ξ) − H(d) . (7)
A common choice to extract geometric information from this Hamiltonian is the Hessian C of H. Specifically
∂2H(ξ|d) ∂2H(d|ξ) C(ξ) ≡ = + 1 ≡ C (ξ) + 1 , (8) ∂ξ∂ξ0 ∂ξ∂ξ0 d|ξ
where the identity matrix arises from the curvature of the prior (information Hamiltonian). While C provides information about the local geometry, it turns out to be unsuited for our approach to construct a coordinate transformation, as it is not guaranteed to be positive definite for all ξ. An alternative, positive definite, measure for the curvature can be obtained by replacing the Hessian of the likelihood with its Fisher information metric [18], defined as
∂H ∂H ∂2H(d|ξ) Md|ξ (ξ) = 0 = 0 = Cd|ξ (ξ) . (9) ∂ξ ∂ξ P(d|ξ) ∂ξ∂ξ P(d|ξ) P(d|ξ) D E Entropy 2021, 23, 853 4 of 42
The Fisher information metric can be understood as a Riemannian metric defined over the statistical manifold associated with the likelihood [19], and is a core element in the field of information geometry [20] as it provides a distance measure between probability distributions [21]. Replacing Cd|ξ with Md|ξ in Equation (8) we find
1 1 M(ξ) ≡ Md|ξ (ξ) + = Cd|ξ (ξ) + = hC(ξ)iP(d|ξ) , (10) P(d|ξ) D E which, from now on, we refer to as the metric M. As the Fisher metric of the likelihood is a symmetric, positive-semidefinite matrix, we get that M is a symmetric, positive-definite matrix for all ξ. It is noteworthy that upon insertion, we find that the metric M is defined as the expectation value of the Hessian of the posterior Hamiltonian C w.r.t. the likelihood P(d|ξ). Therefore, in some way, we may regard M as the measure for the curvature in case the observed data d is unknown, and the only information given is the structure of the model itself, as encoded in P(d|ξ). This connection is only of qualitative nature, but it highlights a key limitation of M when used as the defining property of the posterior geometry. From a Bayesian perspective, only the data d that is actually observed is of relevance as the posterior is conditioned on d. Therefore a curvature measure that arises from marginalization over the data must be sub-optimal compared to a measure conditional to the data, as it ignores the local information that we gain from observing d. Nevertheless, we find that in many practical applications M encodes enough relevant information about the posterior geometry that it provides us with a valuable metric to construct a coordinate transformation. It is noteworthy that attempts have been provided to resolve this issue via a more direct approach to recover a positive definite matrix from the Hessian of the posterior while retaining the local information of the data. e.g., in [22], the SoftAbs non-linearity is applied to the Hessian and the resulting positive definite matrix is used as a curvature measure. In our practical applications, however, we are particularly interested in solving very high dimensional problems, and applying a matrix non-linearity is currently too expensive to give rise to a scalable algorithm for our purposes. Therefore we rely on the metric M as a measure for the curvature of the posterior, and leave possible extensions to future research.
2.1. Coordinate Transformation Our goal is to construct a coordinate system y and an associated transformation g, that maps from ξ to y, in which the posterior metric M takes the form of the identity matrix 1. The motivation is that if M captures the geometric properties of the posterior, a coordinate system in which this metric becomes trivial should also be a coordinate system in which the posterior distribution takes a particularly simple form. For an illustrative example see Figure1. To do so, we require the Fisher metric of the likelihood Md|ξ to be the pullback of the Euclidean metric. Specifically we propose a function x(ξ) such that
T ! ∂x ∂x M = , (11) d|ξ ∂ξ ∂ξ where T denotes the adjoint of a matrix. As outlined in AppendixA, for many practically relevant likelihoods such a decomposition is possible by means of an inexpensive to evaluate function x (Here with “inexpensive” we mean that applying the function x(ξ) has a similar computational cost compared to applying the likelihood function P(d|ξ) to a specific ξ). Given x, we can rewrite the posterior metric M as
∂x T ∂x M = + 1 . (12) ∂ξ ∂ξ In order to relate this metric to Euclidean space, we aim to find the isometry g that relates the Riemannian manifold associated with the metric M to Euclidean space. Specifically we seek to find an invertible function g satisfying Entropy 2021, 23, 853 5 of 42
T T ∂x ∂x ! ∂g ∂g M(ξ) = + 1 = . (13) ∂ξ ∂ξ ∂ξ ∂ξ In general, i.e., for a general function x(ξ), however, this decomposition does not exist globally. Nevertheless, there exists a transformation g(ξ; ξ¯) based on an approximative Taylor series around an expansion point ξ¯, that results in a metric M˜ (ξ) such that
T ∂g(ξ; ξ¯) ∂g(ξ; ξ¯) M(ξ) ≈ M˜ (ξ) ≡ , (14) ∂ξ ∂ξ in the vicinity of ξ¯. This transformation g can be obtained up to an integration constant by Taylor expanding Equation (14) around ξ¯ and solving for the Taylor coefficients of g in increasing order. We express g in terms of its Taylor series using the Einstein sum convention as
¯ i i i ¯ j i ¯ j ¯ k g(ξ; ξ) = g¯ + g¯ ,j ξ − ξ + g¯ ,jk ξ − ξ ξ − ξ + . . . , (15) where repeated indices get summed over, a,i denotes the partial derivative of a w.r.t. the ith component of ξ, and s¯ denotes a (tensor) field s(ξ), evaluated at the expansion point ξ¯. We begin to expand Equation (14) around ξ¯ and obtain for the zeroth order
¯ α α ! α α Mij ≡ x¯ ,ix¯ ,j + δij = g¯ ,i g¯ ,j . (16)
Expanding Equation (14) to first order yields
α α α α ! α α α α x¯ ,ikx¯ ,j + x¯ ,ix¯ ,jk = g¯ ,ikg¯ ,j + g¯ ,i g¯ ,jk , (17)
and therefore i ¯ iγ β α α g¯ ,jk = M g¯ ,γx¯ ,βx¯ ,jk , (18) M¯ ij M¯ −1 M¯ where = ij denotes the components of the inverse of . Thus, to first order in the metric (meaning to second order in the transformation) the expansion remains solvable for a general x. Proceeding with the second order, however, we get that
α α α α α α α α ! α α α α α α α α x¯ ,ikl x¯ ,j + x¯ ,ikx¯ ,jl + x¯ ,il x¯ ,jk + x¯ ,ix¯ ,jkl = g¯ ,ikl g¯ ,j + g¯ ,ikg¯ ,jl + g¯ ,il g¯ ,jk + g¯ ,i g¯ ,jkl , (19)
i which does not exhibit a general solution for g¯ ,jkl in higher dimensions due to the fact that the third derivative has to be invariant under arbitrary permutation of the latter three indices jkl. However, in analogy to Equation (18), we may set
i ¯ iγ β α α g¯ ,jkl = M g¯ ,γx¯ ,βx¯ ,jkl , (20)
which cancels the first and the last term of Equation (19), and study the remaining error which takes the form
α α α α α α α α eijkl = x¯ ,ikx¯ ,jl + x¯ ,il x¯ ,jk − g¯ ,ikg¯ ,jl − g¯ ,il g¯ ,jk α α α α − α γ M¯ γδ δ β − α γ M¯ γδ δ β = x¯ ,ikx¯ ,jl + x¯ ,il x¯ ,jk x¯ ,ikx¯ ,α x¯ ,βx¯ ,jl x¯ ,il x¯ ,α x¯ ,βx¯ ,jk α − γ M¯ γδ δ β α − γ M¯ γδ δ β = x¯ ,ik δαβ x¯ ,α x¯ ,β x¯ ,jl + x¯ ,il δαβ x¯ ,α x¯ ,β x¯ ,jk . (21) i i Let X j ≡ x¯ ,j, the expression in the parentheses takes the form
−1 −1 1 − XM¯ −1XT = 1 − X 1 + XT X XT = 1 + XXT ≡ M , (22) Entropy 2021, 23, 853 6 of 42
and thus Equation (21) reduces to
α β α β eijkl = x¯ ,ik Mαβx¯ ,jl + x¯ ,il Mαβx¯ ,jk . (23)
The impact of this error contribution can be qualitatively studied using the spectrum λ(M) of the matrix M. This spectrum may exhibit two extreme cases, a so-called likelihood dominated regime, where the spectrum λ(XXT) 1, and a prior dominated regime where λ(XXT) 1. In the likelihood dominated regime, we get that λ(M) 1 and thus the contribution of the error is small, whereas in the prior dominated regime λ(M) ≈ 1 which yields an O(1) error. However, in the prior dominated regime, the entire metric M is close to the identity as we are in the standard coordinate system of the prior ξ and therefore higher order derivatives of x are small. As a consequence, the error is of the order O(1) only in regimes where the third (and higher) order of the expansion is negligible compared to the first and second order. An exception occurs when the expansion point is close to a saddle point of x, i.e., in cases where the first derivative of x becomes small (and therefore the metric is close to the identity), but higher order derivatives of x may be large. For the moment, we proceed under the assumption that the change of x, as a function of ξ, is sufficiently monotonic throughout the expansion regime. We discuss the implications of violating this assumption in Section 5.3.
P (ξ d) P (y d) = [g ? P (ξ d)](y) | 3 | | 4 1 g− (y; ξ¯) ξ¯ 2 3
1 2 2 2 ξ y 0 1
1 − 0
2 − 1 − y = g(ξ; ξ¯) 3 − 5 4 3 2 1 0 3 2 1 0 1 2 3 − − − − − − − − ξ1 y1
Figure 1. Non-linear posterior distribution P(ξ|d) in the standard coordinate system of the prior distribution ξ (left) and the transformed distribution P(y|d) (right) in the coordinate system y where the posterior metric becomes (approximately) the identity matrix. P(y|d) is obtained from P(ξ|d) via the push-forward through the transformation g which relates the two coordinate systems. The functional form of g is derived in Section 2.1 and depends on an expansion point ξ¯ (orange dot in the left image), and g is set up such that ξ¯ coincides with the origin in y. To visualize the transformation, the coordinate lines of y (black mesh grid on the right) are transformed back into ξ-coordinates using the inverse coordinate transformation g−1 and are displayed as a black mesh in the original space on the left. In addition, note that while the transformed posterior P(y|d) arguably takes a simpler form compared to P(ξ|d), it does not become trivial (e.g., identical to a standard distribution) as there remain small asymmetries in the posterior density. There are multiple reasons for these deviations which are discussed in more detail in Section 2.2 once we established how the transformation g is constructed. Entropy 2021, 23, 853 7 of 42
If we proceed to higher order expansions of Equation (14), we notice that a repetitive i picture emerges: The leading order derivative tensor g¯ ,j... that appears in the expansion may be set in direct analogy to Equations (18) and (20) as
i ¯ iγ β α α g¯ ,j... = M g¯ ,γx¯ ,βx¯ ,j... , (24)
where ... denotes the higher order derivatives. The remaining error contributions at each order take a similar form as in Equation (23), where the matrix M reappears in between all possible combinations of the remaining derivatives of x that appear using the Leibniz rule. Note that for increasing order, the number of terms that contribute to the error also n−1 n increases. Specifically for the nth order expansion of Equation (14) we get m = ∑k=1 (k) contributions to the error. Therefore, even if each individual contribution by means of M is small, the expansion error eventually becomes large once high order expansions become relevant. Therefore the proposed approximation only remains locally valid around ξ¯. Nevertheless, we may proceed to combine the derivative tensors of g determined above in order to get the Jacobian of the transformation g as
k 1 k l gi (ξ) ≡ g¯i + g¯i ξ − ξ¯ + g¯i ξ − ξ¯ ξ − ξ¯ + ... ,j ,j ,jk 2 ,jkl β 1 = g¯i + M¯ iαg¯ x¯γ x¯γ (ξ − ξ¯)k + x¯γ (ξ − ξ¯)k(ξ − ξ¯)l + ... , (25) ,j ,α ,β ,jk 2 ,jkl or equivalently
1 g¯α gα (ξ) = δ + x¯α x¯α + x¯α (ξ − ξ¯)k + x¯α (ξ − ξ¯)k(ξ − ξ¯)l + ... . (26) ,i ,j ij ,i ,j ,jk 2 ,jkl √ i From the zeroth order, Equation (16), we get that g¯i = M¯ up to a unitary transfor- ,j j mation, and we can sum up the Taylor series in x of Equation (26) to arrive at an index free representation of the Jacobian as
T ∂g √ −1 ∂x ∂x = M¯ 1 + . (27) ∂ξ ∂ξ ∂ξ ξ¯!
Upon integration, this yields a transformation
T √ −1 ∂x g(ξ) − g(ξ¯) = M¯ ξ − ξ¯ + x(ξ) − x(ξ¯) . (28) ∂ξ ξ¯! The resulting transformation takes an intuitive form: The approximation to the distance between a point g(ξ) and the transformed expansion point g(ξ¯) consists of the distance w.r.t. the prior measure ξ − ξ¯ and the distance w.r.t. the likelihood measure x(ξ) − x(ξ¯) , back-projected into the prior domain using the local transformation at ξ¯. Finally, the metric at ξ¯ is used as a measure for the local curvature. Equation (28) is only defined up to an integration constant, and therefore, without loss of generality, we may set g(ξ¯) = 0 to obtain the final approximative coordinate transformation as
T √ −1 ∂x √ −1 g(ξ; ξ¯) = M¯ ξ − ξ¯ + x(ξ) − x(ξ¯) ≡ M¯ g˜(ξ; ξ¯) . (29) ∂ξ ξ¯!
2.2. Basic Properties In order to study a few basic properties of this transformation, for simplicity, we first consider a posterior distribution with a metric that allows for an exact isometry giso to Entropy 2021, 23, 853 8 of 42
Euclidean space. Specifically let giso be a coordinate transformation satisfying Equation (13). The posterior distribution in coordinates y = giso(ξ) is given via the push-forward of P(ξ|d) through giso as
−1 ∂giso P(ξ|d) P(y|d) ∝ (giso ? P(ξ|d))(y) = P(ξ|d) = , (30) ∂ξ − |M(ξ)| − ! ξ=g 1(y) ξ=g 1(y) iso iso
p and the information Hamiltonian takes the form
1 ˜ −1 H(y|d) = H(ξ|d) + log(|M(ξ)|) + H0 ≡ H(ξ = giso (y)) + H0 , (31) 2 =g−1(y) ξ iso
where H0 denotes y independent contributions. We may study the curvature of the posterior in coordinates y given as:
∂ξ ∂2H˜ (ξ) ∂ξ0 T ∂H˜ (ξ) ∂2ξ C(y) = + with (32) ∂y ∂ξ∂ξ0 ∂y0 ∂ξ ∂y∂y0 −1 0 −1 0 ξ = giso (y) and ξ = giso y ,
which we can use to construct a metric M(y) in analogy to Equation (10) by taking the expectation value of the curvature w.r.t. the likelihood. This yields
∂ξ ∂ξ0 T 1 ∂ξ ∂2 log(|M(ξ)|) ∂ξ0 T ∂H˜ (ξ) ∂2ξ M(y) = M(ξ) + + ∂y ∂y0 2 ∂y ∂ξ∂ξ0 ∂y0 ∂ξ ∂y∂y0 P(d|ξ) ∂ξ ∂ξ0 T ≡ M(ξ) + R(y) = 1 + R(y) . (33) ∂y ∂y0
The first terms yields the identity, as it is the defining property of giso. Furthermore, in case we are able to say that R(y) is small compared to the identity, we notice that the quantity M(ξ) (Equation (10)), that we referred to as the posterior metric, approximately transforms like a proper metric under giso. In this case we find that the isometry giso between the Riemannian manifold associated with M(ξ) and the Euclidean space is also a transformation that removes the complex geometry of the posterior. To further study R(y), we consider its two contributions separately, where for the first part, the log-determinant (or logarithmic volume), we get that it becomes small compared to the identity if
! 1 ∂2 log(|M(ξ)|) M(ξ) . (34) 2 ∂ξ∂ξ0 Therefore, the curvature of the log-determinant of the metric has to be much smaller then the metric itself. To study the second term of R(y), we may again split the discussion into a prior and a likelihood dominated regime, depending on the ξ at which we evaluate the expression. In a prior dominated regime we get that
∂2ξ ≈ 0 , (35) ∂y∂y0
as the metric is close to the identity in this regime (and therefore ξ ≈ y). In a likelihood dominated regime we get that H˜ ≈ H(d|ξ) and therefore
∂H˜ (ξ) ∂H(d|ξ) 1 ∂P(d|ξ) ≈ = − = 0 . (36) ∂ξ ∂ξ P(d|ξ) ∂ξ P(d|ξ) P(d|ξ) P(d|ξ) So at least in a prior dominated regime, as well as a likelihood dominated regime, the posterior Hamiltonian transforms in an analogous way as the manifold, under the transfor- mation giso, if Equation (34) also holds true. Entropy 2021, 23, 853 9 of 42
For a practical application, however, in all but the simplest cases the isometry giso is not accessible, or might not even exist. Therefore, in general, we have to use the approximation g(ξ; ξ¯), as defined in Equation (29), instead. We may express the transformation of the metric using g(ξ; ξ¯), and find that
−1 T −1 √ ∂x T ∂x ∂x T ∂x0 ∂x ∂x0 √ M(y) = M¯ 1 + 1 + 1 + M¯ ∂ξ ∂ξ ∂ξ ∂ξ0 ∂ξ ∂ξ0 ξ¯! ! ξ¯!
+ R˜ (y) , (37)
−1 0 now with ξ = g y; ξ¯ and analogous for ξ . R˜ is defined by replacing giso with g for the entire expression of R. We notice that this transformation does not yield the identity, except when evaluated at the expansion point ξ = ξ¯. Therefore, in addition to the error R˜ there is a deviation from the identity related to the expansion error as one moves away from ξ¯. At this point we would like to emphasize that the posterior Hamiltonian H and the Riemannian manifold constructed from M are only loosely connected due to the errors described by R˜ and the additional expansion error. They are arguably small in many cases and in the vicinity of ξ¯, but we do not want to claim that this correspondence is valid in general (see Section 5.3). Nevertheless, we find that in many cases this correspondence works well in practice. Some illustrative examples are given in Section 3.1.2.
3. Posterior Approximation Utilizing the derived coordinate transformation for posterior approximation is mainly based on the idea that in the transformed coordinate system, the posterior takes a simpler form. In particular we aim to remove parts (if not most) of the complex geometry of the posterior, such that a simple probability distribution, e.g., a Gaussian distribution, yields a good approximation.
3.1. Direct Approximation Assuming that all the errors discussed in the previous section are small enough, we may attempt to directly approximate the posterior distribution via a unit Gaussian in the coordinates y as in this case the transformed metric M(y) is close to the identity. As the coordinate transformation g, defined via Equation (29), is only known up to an integration constant by construction, the posterior approximation is achieved by a shifted unit Gaussian in y. This shift needs to be determined, which we can do by maximizing the transformed posterior distribution
∂g −1 P(y|d) ∝ P(ξ, d) , (38) ∂ξ ! ξ=g−1(y;ξ¯)
w.r.t. y. Here g−1(y; ξ¯) denotes the inverse of g(ξ; ξ¯) w.r.t. its first argument. Equivalently we can minimize the information Hamiltonian H(y|d), defined as
1 H(y|d) ≡ − log(P(y|d)) = H(ξ, d) + log M˜ ≡ H˜ (ξ = g−1(y; ξ¯)) . (39) 2 −1 ξ=g (y;ξ¯)
Minimizing H(y|d) yields the maximum a posterior solution y∗ which, in case the posterior is close to a unit Gaussian in the coordinates y, can be identified with the shift in y. As g is an invertible function, we may instead minimize H˜ w.r.t. ξ and apply g to the result in order to obtain y∗. Specifically
y∗ ≡ argmin(H(y|d)) = g argmin H˜ (ξ) . (40) y ξ ! Entropy 2021, 23, 853 10 of 42
Therefore we can circumvent the inversion of g at any point during optimization. Now suppose that we use any gradient based optimization scheme to minimize for ξ, starting from some initial position ξ0. If we set the expansion point ξ¯, used to construct g, to be equal to ξ0, we notice that
M˜ (ξ¯) = M(ξ¯) (41) ∂M˜ ∂M = , (42) ∂ξ ∂ξ ξ=ξ¯ ξ=ξ¯
as the expansion of the metric is valid to first order by construction. Therefore if we set the expansion point ξ¯ to the current estimate of ξ after every step, we can replace the approximated metric M˜ with the true metric M and arrive at an optimization objective of the form 1 ξ¯ = argmin H(ξ, d) + log(|M(ξ)|) . (43) 2 ξ Note that g(ξ¯; ξ¯) = 0 by construction, and therefore y∗ = 0, as there is a degeneracy between a shift in y and a change of the expansion point ξ¯. Once the optimal expansion point ξ¯ is found, we directly retrieve a generative process to sample from our approximation to the posterior distribution. Specifically
P(y|d) ≈ N (y; 0, 1) (44) ξ = g−1(y; ξ¯) , (45)
where g−1 is only implicitly defined using Equation (29) and therefore its inverse applica- tion has to be approximated numerically in general.
3.1.1. Numerical Approximation to Sampling Recall that √ −1 y = g(ξ; ξ¯) = M¯ g˜(ξ; ξ¯) . (46) To generate a posterior sample for ξ we have to draw a random realization for y from a unit Gaussian, and then solve Equation (46) for ξ. To avoid the matrix square root of M¯ , we may instead define √ z ≡ M¯ y = g˜(ξ; ξ¯) with P(z) = N (z; 0, M¯ ) . (47)
Sampling from P(z) is much more convenient then constructing the matrix square root, since
∂x T ∂x M¯ = 1 + , (48) ∂ξ ∂ξ ! ξ¯
and therefore a random realization may be generated using
T ∂x z = η + η with η ∼ N (η ; 0, 1) , i ∈ {1, 2} . (49) 1 ∂ξ 2 i i ξ¯!
Finally, a posterior sample ξ is retrieved by inversion of Equation (47). We numeri- cally approximate the inversion by minimizing the squared difference between z and g˜(ξ). Specifically, 1 ξ = argmin (z − g˜(ξ))T(z − g˜(ξ)) . (50) 2 ξ Note that if g is invertible then also g˜ is invertible as M¯ is a symmetric positive definite matrix. Therefore the quadratic form of Equation (50) has a unique global optimum at zero which corresponds to the inverse of Equation (47). Entropy 2021, 23, 853 11 of 42
In practice, this optimum is typically only reached approximately. For an efficient numerical approximation, throughout this work, we employ a second order quasi-Newton method, named NewtonCG [23], as implemented in the NIFTy framework. Within the NewtonCG algorithm, we utilize the metric M˜ (ξ) as a positive-definite approximation to the curvature of the quadratic form in Equation (50). Furthermore, its inverse application, required for the second order optimization step of NewtonCG, is approximated with the conjugate gradient (CG) [24] method, which requires the metric to be only implicitly available via matrix-vector products. In addition, in practice we find that the initial position 0 ξ of the minimization procedure can be set to be equal to the prior realization η1 used to construct z (Equation (49)) in order to improve convergence as ξ = η1 is the solution of Equation (47) for all degrees of freedom unconstrained by the likelihood. Alternatively, for weakly non-linear problems, initializing ξ0 as the solution of the linearized problem
∂g˜ ∂x T ∂x ¯ − ¯ 1 − ¯ z = g˜(ξ; ξ) ξ=ξ¯ + ξ ξ = + ξ ξ , (51) ∂ξ ξ=ξ¯ ∂ξ ∂ξ ! ξ=ξ¯
can significantly improve the convergence. The full realization of the sampling procedure is summarized in Algorithm1.
Algorithm 1: Approximate posterior samples using inverse transformation ¯ ∂x 1 Function drawSample(Location ξ, Transformation x(ξ), Jacobian ∂ξ ): ∂x 2 A ← ∂ξ ξ=ξ¯ 1 3 η1 ∼ N ( η1; 0, ) 4 η2 ∼ N ( η2; 0, 1) T 5 z ← η1 + A η2 0 0 1 T 0 0 6 ξ ← η1 or ξ ← Solve(z = + A A ξ − ξ¯ ) for ξ (see Equation (51)) 7 Function Energy(ξ): T 8 g˜ ← ξ − ξ¯ + A x(ξ) − x(ξ¯) 1 T 9 − − return 2 (z g˜) (z g˜) ∗ 0 10 ξ ← NewtonCG(Energy, ξ ) ∗ 11 return ξ
3.1.2. Properties We may qualitatively study some basic properties of the coordinate transformation and the associated approximation using illustrative one and two dimensional examples. To this end, consider a one dimensional log-normal prior model with zero mean and standard deviation σp of the form
s(ξ) = eσpξ with P(ξ) = N (ξ; 0, 1) , (52)
from which we obtain a measurement d subject to independent, additive Gaussian noise with standard deviation σn such that the likelihood takes the form
2 P(d|ξ) = N (d; s(ξ), σn) . (53)
The posterior distribution is given as
2 P(ξ|d) ∝ P(d|ξ) P(ξ) = N (d; s(ξ), σn) N (ξ; 0, 1) , (54)
and its metric takes the form
2 2 1 ∂s(ξ) σp M(ξ) = + 1 = e2σpξ + 1 . (55) σ ∂ξ σ n n Entropy 2021, 23, 853 12 of 42
In this one dimensional example we can construct the exact transformation giso that maps from ξ to the transformed coordinates y, by integrating the square root of Equation (55) over ξ. The resulting transformation can be seen in the central panel of Figure2, for an example with σp = 3 and σn = 0.3 and measured data d = 0.5. In ad- dition, we depict the approximated transformation g(ξ; ξ¯) for multiple expansion points ξ¯ ∈ {−1, −0.6, −0.2}. We see that the function approximation quality depends on the choice of the expansion point ξ¯ as the approximation error is smallest in the vicinity of ξ¯. In order to transform the posterior distribution P (Equation (54)) into the new coordinated system, not all parts of the transformation are equally relevant and therefore different expansion points result in more/less complex transformed distributions (see top panel of Figure2). Finally, if we use a standard distribution in the transformed coordinates y and transform it back using the inverse transformations g−1(y; ξ¯), we find that the ap- ¯ proximation quality of the resulting distributions Qξ¯ depends on ξ. The distributions are illustrated in the left panel of Figure2 together with the Kullback–Leibler divergence KL between the true posterior distribution P and the approximations Qξ¯. We also illustrate the “geometrically optimal” approximation using a standard distribution in y and the optimal transformation giso and find that while the approximation error becomes minimal in this case, it remains non-zero. Considering the discussion in Section 2.2, this result is to be expected due to the error contribution from the change in volume associated with the transformation g. As a comparison we also depict the optimal linear approximation of P, that is a normal distribution in the coordinates ξ with optimally chosen mean and standard deviation. We see that even the worst expansion point ξ¯ = −0.2, that is far away from the optimum, still yields a better approximation of the posterior. As a second example we consider the task of inferring the mean m and variance v of a single, real valued Gaussian random variable d. In terms of s = (m, v), the likelihood takes the form P(d|s) = N (d; m, v) . (56) Furthermore we assume a prior model for s by means of a generative model of the form
m = ξ1 and vs. = exp[3(ξ2 + 2ξ1)] , (57)
where ξ1 and ξ2 follow standard distributions a priori. This artificial model results in a linear prior correlation between the mean and the log-variance and thus introduces a non-linear coupling between m and v. The resulting two dimensional posterior distribution P(ξ1, ξ2) can be seen in the left panel of Figure3, together with the two marginals P(ξ1) and P(ξ2) for a given measurement d = 0. We approximate this posterior distribution following the direct approach described in Section 3.1, where the expansion point ξ¯ is obtained from minimizing the sum of the posterior Hamiltonian and the log-determinant of the metric (see Equation (43)). The resulting approximative distribution QD is shown in the right panel of Figure3, where the location of ξ¯ is indicated as a blue cross. In comparison to the true distribution, we see that both, the joint distribution as well as the marginals are in a good agreement qualitatively, which is also supported quantitatively by a small difference of the KL between P and QD (see Figure3). The difference between P and QD appears to increase in regions further away from the expansion point, which is to be expected due to the local nature of the approximation. However, non-linear features such as the sharp peak at the “bottom” of P (Figure3), are also present in QD, although slightly less prominent. This demonstrates that relevant non-linear structure can, to some degree, be captured by the coordinate transformation g derived from the metric M of the posterior. Entropy 2021, 23, 853 13 of 42
0.5
KL (P ; Qiso) = 0.0333 0.4 )
KL P ; Qξ¯= 1.0 = 0.0582 d − | 0.3 y
KL P ; Qξ¯= 0.6 = 0.0489 (