15 | Variational inference

15.1 Foundations

Variational inference is a statistical inference framework for probabilistic models that comprise unobserv- able random variables. Its general starting point is a joint probability density function over observable random variables y and unobservable random variables ϑ,

p(y, ϑ) = p(ϑ)p(y|ϑ), (15.1) where p(ϑ) is usually referred to as the prior density and p(y|ϑ) as the likelihood. Given an observed value of y, the first aim of variational inference is to determine the conditional density of ϑ given the observed value of y, referred to as the posterior density. The second aim of variational inference is to evaluate the marginal density of the observed data, or, equivalently, its logarithm Z ln p(y) = ln p(y, ϑ) dϑ. (15.2)

Eq. (15.2) is commonly referred to as the log marginal likelihood or log model evidence. The log model evidence allows for comparing different models in their plausibility to explain observed data. In the variational inference framework it is not the log model evidence itself which is evaluated, but rather a lower bound approximation to it. This is due to the fact that if a model comprises many unobservable variables ϑ the integration of the right-hand side of (15.2) can become analytically burdensome or even intractable. To nevertheless achieve its two aims, variational inference in effect replaces an integration problem with an optimization problem. To this end, variational inference exploits a set of information theoretic quantities as introduced in Chapter 11 and below. Specifically, the following log model evidence decomposition forms the core of the variational inference approach (Figure 15.1):

ln p(y) = F (q(ϑ)) + KL(q(ϑ)kp(ϑ|y)). (15.3)

In eq. (15.3), q(ϑ) denotes an arbitrary probability density over the unobservable variables which is used as an approximation of the posterior density p(ϑ|y). In the following, q(ϑ) is referred to as variational density. In words, (15.3) states that for an arbitrary variational density q(ϑ), the log model evidence comprises the sum of two information theoretic quantities: the so-called variational free energy, defined as Z p(y, ϑ) F (q(ϑ)) := q(ϑ) ln dϑ (15.4) q(ϑ) and the KL divergence between the true posterior density p(ϑ|y) and the variational density q(ϑ),

Z  q(ϑ)  KL(q(ϑ)||p(ϑ|y)) = q(ϑ) ln dϑ. (15.5) p(ϑ|y)

Based on these definitions, it is straightforward to show the validity of the log model evidence decompo- sition: Foundations 153

Figure 15.1. Visualization of the log model evidence decomposition that lies at the heart of the variational inference approach. The upper vertical bar represents the log model evidence, which is a function of the probabilistic model p(y, ϑ) and is constant for any observation of y. As shown in the main text, the log model evidence can readily be rewritten into the sum of the variational free energy term F (q(ϑ)) and a KL divergence term KL(q(ϑ)||p(ϑ|y)), if one introduces an arbitrary variational density over the unobservable variables ϑ. Maximizing the variational free energy hence minimizes the KL divergence between the variational density q(ϑ) and the true posterior density p(ϑ|y) and renders the variational free energy a better approximation of the log model evidence. Equivalently, minimizing the KL divergence between the variational density q(ϑ) and the true posterior density p(ϑ|y) maximizes the free energy and also renders it a tighter approximation to the log model evidence ln p(y).

Proof of (15.3)

By definition, we have

Z  p(y, ϑ)  F (q(ϑ)) = q(ϑ) ln dϑ q(ϑ) Z  p(y)p(ϑ|y)  = q(ϑ) ln dϑ q(ϑ) Z Z  p(ϑ|y)  = q(ϑ) ln p(y) dϑ + q(ϑ) ln dϑ q(ϑ) (15.6) Z Z  p(ϑ|y)  = ln p(y) q(ϑ) dϑ + q(ϑ) ln dϑ q(ϑ) Z  q(ϑ)  = ln p(y) − q(ϑ) ln dϑ p(ϑ|y) = ln p(y) − KL(q(ϑ)||p(ϑ|y)), from which eq. (15.3) follows immediately. 2 The log model evidence decomposition (15.3) can be used to achieve the aims of variational inference as follows: first, the non-negativity property of the KL divergence has the consequence, that the variational free energy F (q(ϑ)) is always smaller than or equal to the log model evidence, i.e.,

F (q(ϑ)) ≤ ln p(y). (15.7)

This fact can be exploited in the numerical application of variational inference to probabilistic models: because the log model evidence is a fixed quantity which only depends on the choice of p(y, ϑ) and a specific data realization, manipulating the variational density q(ϑ) for a given data set in such a manner that the variational free energy increases has two consequences: first, the lower bound to the log model evidence becomes tighter, and the variational free energy a better approximation to the log model evidence. Second, because the left-hand side of eq. (15.3) remains constant, the KL divergence between the true posterior and its variational approximation decreases, which renders the variational density q(ϑ) an increasingly better approximation to the true posterior distribution p(ϑ|y). Because the variational free energy is a lower bound to the log model evidence, it is also referred to as evidence lower bound (ELBO). The maximization of a variational free energy in terms of a variational density is a very general approach for posterior density and log model evidence approximation. Like the maximum likelihood approach, it serves rather as a guiding principle rather than a concrete numerical algorithm. In other words, algorithms that make use of the variational free energy log model evidence decomposition are jointly referred to as variational inference algorithms, but many variants exist. In the following two sections, we will discuss two specific variants and illustrate them with examples. The variants will be referred to as free-form mean-field variational inference and fixed-form mean-field variational inference. Here, the term mean-field refers to a factorization assumption with respect to the variational densities

PMFN | © 2019 Dirk Ostwald CC BY-NC-SA 4.0 Free-form mean-field variational inference 154 over s sets of the unobserved random variables,

s Y q(ϑ) = q(ϑi). (15.8) i=1 Such a factorization allows the variational free energy to be optimized independently for the variational densities q(ϑi) in a coordinate-wise fashion for i = 1, ..., s, a procedure sometimes referred to as coordinate ascent variational inference (CAVI). The free-form and fixed-form variants of mean-field variational inference then differ in their assumptions about the variational densities q(ϑi): the defining feature of the free-form mean-field variational inference approach is that the parametric form of variational densities is not predetermined, but analytically evaluated based on a central result from variational calculus. As such, the free-form mean-field variational inference approach is useful to emphasize the roots of variational inference in variational calculus, but is also analytically quite demanding. In a functional neuroimaging context, the free-form mean-field approach thus serves primarily didactic purposes. The fixed-form mean- field variational inference approach, on the other hand, is characterized by predetermined functional forms of the variational densities and of high practical relevance in functional neuroimaging. In particular, a fixed-form mean-field variational inference approach that rests on Gaussian variational densities enjoys wide-spread popularity in functional neuroimaging (under the label variational Bayes Laplace algorithm) and theoretical (under the label free energy principle). In contrast to the free-form mean-field approach, the fixed-form mean-field approach is less analytically demanding and replaces a variational optimization problem with a standard numerical optimization problem. This is achieved by analytically evaluating the variational free energy in terms of the parameters of the variational densities.

15.2 Free-form mean-field variational inference

Free-form mean-field variational inference rests on a factorization of the variational density over sets of unobserved random variables q(ϑ) = q(ϑs)q(ϑ\s), (15.9) referred to as a mean-field approximation. In (15.9), ϑ\s denotes all unobserved variables not in the sth group. For the factorization (15.9), the variational free energy becomes a function of two arguments, namely q(ϑs) and q(ϑ\s). Due to the complexity of the integrals involved, a simultaneous analytical maxi- mization of the variational free energy with respect to both its arguments is often difficult to achieve, and a coordinate-wise approach, i.e., maximization first with respect to q(ϑs) and second with respect to q(ϑ\s), is preferred. Notably, the assumed factorization over sets of variables corresponds to the assumption, that the respective variables form stochastically independent contributions to the multivariate posterior, which, depending on the true form of the generative model, may have weak or strong implications for the validity of the ensuing posterior inference.

The question is thus how to obtain the arguments q(ϑs) and q(ϑ\s) that maximize the variational free energy. It turns out that this challenge corresponds to a well-known problem in statistical physics, which has long been solved in a general fashion using variational calculus (Hinton and Van Camp, 1993). In contrast to ordinary calculus, which deals with the optimization of functions with respect to real numbers, variational calculus deals with the optimization of functions (in the context of variational calculus also referred to as functionals) with respect to functions. Using variational calculus, it can be shown that the variational free energy is maximized with respect to the unobserved variable partitions ϑs, if q(ϑs) is set proportional (i.e., equal up to a scaling factor) to the exponential of the expected log joint probability of y and ϑ under the variational density over ϑ\s. Formally, this can be written as Z  q(ϑs) ∝ exp q(ϑ\s) ln p(y, ϑ)dϑ\s (15.10)

The result stated in eq. (15.10) is fundamental. It represents the general free-form mean-field variational inference strategy to obtain variational densities over unobserved variables in light of data and maximizing the lower bound to the log model evidence. We thus refer to eq. (15.10) as the free-form variational inference theorem. In the following, we provide two proofs that (15.10) maximizes the variational free energy with respect to q(ϑs). The first proof is constructive in that it uses the constrained Gateaux derivative approach from variational calculus (Chapter 7), to generate the solution (15.10). The second proof eschews recursion to variational calculus techniques and uses a reformulation of the variational free energy in terms of a KL divergence involving the right-hand side of (15.10) (cf. Tzikas et al.(2008))

PMFN | © 2019 Dirk Ostwald CC BY-NC-SA 4.0 Free-form mean-field variational inference 155

Proof I of (15.10)

We first note that the aim of the free-form mean-field variational inference approach is to approximate the log marginal likelihood ln p(y) by iteratively maximizing its lower bound F (q(ϑs), q(ϑ\s)) with respect to the (i+1) arguments q(ϑs) and q(ϑ\s). For iterations i = 1, 2,..., during the first maximization of finding q (ϑs), (i) (i+1) (i+1) q (ϑ\s) is treated as a constant, while during the second maximization of finding q (ϑs), q (ϑ\s) is treated as a constant. However, as ϑs and ϑ\s may be used interchangeably, we here concern ourselves only with (i+1) the case of maximizing F (q(ϑs), q(ϑ\s)) with respect to q(ϑs). To obtain an expression for q (ϑs), we thus consider the variational free energy functional ZZ   Z  (i)  (i) p(y, ϑ) F q(ϑs), q (ϑ\s) = q(ϑs)q (ϑ\s) ln (i) dϑdϑ\s, where q(ϑ) dϑ = 1. (15.11) q(ϑ)q (ϑ\s)

In this case, the extended Lagrange function (cf. Chapter 7) is given by Z   ¯  (i)  (i) p(y, ϑ) F q(ϑs), q (ϑ\s) = q(ϑs)q (ϑ\s) ln (i) dϑ\s + λq(ϑs) − λ. (15.12) q(ϑs)q (ϑ\s)

¯  (i)  ¯ Furthermore, the Gateaux derivative δF q(ϑs), q (ϑ\s) is given by the derivative of F with respect to q(ϑs), 0 because F¯ is not a function of q (ϑs) . One thus obtains

¯  (i)  δF q(ϑs), q (ϑ\s) ! ! ∂ Z p(y, ϑ) = q(ϑ )q(i)(ϑ ) ln dθ + λq(ϑ ) − λ s \s (i) s ∂q(ϑs) q(ϑs)q (ϑ\s) Z Z  ∂ (i) (i) (i) = q(ϑs)q (ϑ\s) ln p(y, ϑ)dϑ\s − q(ϑs)q (ϑ\s) ln(q(ϑs)q (ϑ\s))dϑ\s + λq(ϑs) − λ ∂q(ϑs)  Z Z Z  ∂ (i) (i) (i) = q(ϑs) q (ϑ\s) ln p(y, ϑ)dϑ\s − q(ϑs)q (ϑ\s) ln q(ϑs)dϑ\s − q(ϑs)q (ϑ\s) ln q(ϑ\s)dϑ\s + λq(ϑs) − λ ∂q(ϑs)  Z Z Z  ∂ (i) (i) (i) = q(ϑs) q (ϑ\s) ln p(y, ϑ)dϑ\s − q(ϑs) ln q(ϑs) q (ϑ\s)dϑ\s − q(ϑs) q (ϑ\s) ln q(ϑ\s)dϑ\s + λq(ϑs) − λ ∂q(ϑs) Z Z (i) (i) (i) = q (ϑ\s) ln p(y, ϑ)dϑ\s − ln q(ϑs) − q (ϑ\s) ln q (ϑ\s)dϑ\s + λ Z (i) = q (ϑ\s) ln p(y, ϑ)dϑ\s − ln q(ϑs) + c, (15.13) where Z (i) (i) c := λ − q (ϑ\s) ln q (ϑ\s)dϑ\s. (15.14)

Setting the Gateaux derivative to zero thus yields Z (i+1) (i) ln q (ϑs) = q (ϑ\s) ln p(y, ϑ)dϑ\s + c. (15.15)

Taking the exponential and subsuming the multiplicative constant under the proportionality factor then yields free-form variational inference theorem for mean-field approximations Z  (i+1) q (ϑs) ∝ exp q(ϑ\s) ln p(y, ϑ)dϑ\s . (15.16)

2

Proof II of (15.10)

Consider maximization of the variational free energy with respect to q(ϑs)

PMFN | © 2019 Dirk Ostwald CC BY-NC-SA 4.0 Free-form mean-field variational inference 156

Figure 15.2. The log model evidence decomposition visualized in Figure 15.1 is exploited in numerical algorithms for free- form VB inference: based on a mean-field approximation q(ϑ) = q(ϑs)q(ϑ\s), the variational free energy can be maximized in a coordinate-wise fashion. Maximizing the variational free energy in turn has two implications: it decreases the KL divergence between q(ϑ) and the true posterior p(ϑ|y) and renders the variational free energy a closer approximation to the log model evidence. This holds true, because the log model evidence for a given observationy ˜ is constant (represented by the constant length of the vertical bar) and the KL divergence is non-negative.

! ZZ p(y, ϑ) F (q(ϑs)q(ϑ\s)) = q(ϑs)q(ϑ\s) ln dϑsdϑ\ q(ϑs)q(ϑ\s) ZZ = q(ϑs)q(ϑ\s)(ln p(y, ϑ) − ln q(ϑs) − ln q(ϑ\s))dϑsdϑ\s ZZ ZZ = q(ϑs)q(ϑ\s)(ln p(y, ϑ) − ln q(ϑs))dϑ\sdϑs − q(ϑs)q(ϑ\s) ln q(ϑ\)dϑsdϑ\s ZZ Z Z  = q(ϑs)q(ϑ\s)(ln p(y, ϑ) − ln q(ϑs))dϑ\sdϑs − q(ϑ\s) ln q(ϑ\s) q(ϑs)dϑs dϑ\s ZZ ZZ Z = q(ϑs)q(ϑ\s) ln p(y, ϑ)dϑ\sdϑs − q(ϑs)q(ϑ\s) ln q(ϑs)dϑ\sdϑs − q(ϑ\s) ln q(ϑ\s) · 1 dϑ\s Z Z  Z Z = q(ϑs) q(ϑ\s) ln p(y, ϑ)dϑ\s dϑs − q(ϑs) ln q(ϑs)( q(ϑ\)dϑ\s)dϑs − c Z Z  Z = q(ϑs) q(ϑ\s) ln p(y, ϑ)dϑ\s dϑs − q(ϑs) ln q(ϑs) · 1 dϑs − c Z Z  Z = q(ϑs) q(ϑ\s) ln p(y, ϑ)dϑ\s − q(ϑs) ln q(ϑs)dϑs − c Z   Z  Z = q(ϑs) ln exp q(ϑ\s) ln p(y, ϑ)dϑ\s dϑs − q(ϑs) ln q(ϑs)dϑs − c Z   Z  = q(ϑs) ln exp q(ϑ\s) ln p(y, ϑ)dϑ\s − ln q(ϑs)dϑs − c R  !! Z exp q(ϑ\) ln p(y, ϑ)dϑ\s = q(ϑs) ln dϑs − c q(ϑs) !! Z q(ϑs) = − q(ϑs) ln R dϑs − c exp q(ϑ\s) ln p(y, ϑ)dϑ\s   = −KL q(ϑs)k exp ∫ q(ϑ\s) ln p(y, ϑ) dϑ\s − c. (15.17)

Maximizing the negative KL divergence by setting  q(ϑs) = exp ∫ q(ϑ\s) ln p(y, ϑ) dϑ\s (15.18) thus maximizes the variational free energy. 2 Based on the free-form variational inference theorem, algorithmic implementations of variational in- ference can use an iterative coordinate-wise variational free energy ascent. For iterations i = 0, 1, 2,..., (0) (0) this strategy proceeds as follows. The ascent starts by initializing q (ϑs) and q (ϑ\s), commonly by equating them to the prior distributions over ϑs and ϑ\s, respectively. Based on (15.10) it then con- (i) (i) tinues by maximizing the variational free energy F (q (ϑs), q (ϑ\s)), first with respect to the density (i) (i) (i+1) q (ϑs), given q (ϑ\s), and yielding the updated density q (ϑs). Then, by exchanging the labelling of ϑs and ϑ\s in eq. (15.9), the ascent continues by maximizing the variational free energy with respect (i) (i+1) (i+1) to the density q (ϑ\s), given q (ϑs) and yielding q (ϑ\s). This procedure is then iterated until convergence. Commonly, the initialization step sets the variational density q(0)(ϑ) to the prior distribution p(ϑ). This defines the starting point of the iterative procedure as representative of the knowledge about the

PMFN | © 2019 Dirk Ostwald CC BY-NC-SA 4.0 Free-form mean-field variational inference 157 unknown variables before observed data is taken into account. Further, this choice often enables the use of the well-known benefits of parameterized conjugate priors in the context of variational inference. The initialization of the variational density in terms of the prior distribution, and the subsequent optimization of the variational densities should, however, not be confused with an empirical Bayesian approach, in which priors themselves are learned from the data: on each iteration of the variational inference algorithm sketched above, the variational density corresponds to the approximate posterior distribution, not an updated prior distribution. An empirical Bayesian extension of the variational inference algorithm on the other hand would correspond to a variation of the prior distribution (specifying the variational inference algorithm starting conditions) after convergence with the aim of increasing the log model evidence per se. The variational inference algorithm as described here merely increases the lower bound to the fixed log model evidence, which is determined by the choice of the prior p(ϑ) and likelihood p(y|ϑ), i.e, the generative model p(y, ϑ). To summarize above, a general iterative algorithm for free-form mean-field variational inference is outlined below. This iterative scheme shares some similarities with expectation- maximization algorithms for models comprising unobserved variables (Dempster et al., 1977; Wu, 1983). In fact, variational inference can be viewed as a generalization of expectation-maximization algorithms for maximum likelihood estimation to . For the general linear model, this line of thought is further investigated in Chapter 22.

An iterative free-form mean-field variational inference algorithm. Initialization (0) (0) (0) (0) 0. Initialize q (ϑs) and q (ϑ\s) appropriately, e.g., by setting q (ϑs) ∝ p(ϑs) and q (ϑ\s) ∝ p(ϑ\s) Until convergence (i+1) R (i) 1. Set q (ϑs) proportional to exp( q (ϑ\) ln p(y, ϑ)dϑ\s) (i+1) R (i+1) 2. Set q (ϑ\s) proportional to exp( q (ϑs) ln p(y, ϑ)dϑs)

Free-form mean-field variational inference for a Gaussian model

Probabilistic model

To demonstrate the free form mean-field variational inference, we consider the estimation of the ex- pectation and precision parameter of a univariate Gaussian distribution based on n independently and identically distributed data realizations yi, i = 1, ..., n (Penny and Roberts, 2000; Bishop, 2006; Chappell et al., 2008; Murphy, 2012). To this end, we assume that the yi, i = 1, ..., n are generated by a univariate Gaussian distribution with true, but unknown, expectation parameter µ ∈ R and precision parameter T n λ > 0. We denote the concatenation of the data realizations by y := (y1, . . . , yn) ∈ R . To recapitulate, the aim of variational inference is, based on appropriately chosen prior densities, first, to obtain posterior densities that quantify the remaining uncertainty over the true, but unknown, unobservable variables given the observable variables, and second, to obtain an approximation to the log model evidence, i.e., the log probability of the data given the probabilistic model. In the current example, the probabilistic model takes the form of the joint probability density function,

n Y p(y, µ, λ) = p(µ, λ)p(y|µ, λ) = p(µ, λ) p(yi|µ, λ). (15.19) i=1 A possible choice for a prior joint density of the unobservable variables is given by the product of a univariate Gaussian density for µ and a Gamma density for λ, i.e.,

n 2  Y −1 p(y, µ, λ) := p(µ)p(λ)p(y|µ, λ) := N µ; mµ, sµ G (λ; aλ, bλ) N yi; µ, λ (15.20) i=1 Note that many other prior densities are conceivable. In fact, a more commonly discussed scenario is the case of a non-independent Gaussian-Gamma prior density (Bishop, 2006; Murphy, 2012). With respect to the factorized prior density considered here, the non-independent Gaussian-Gamma prior density has the advantage that it belongs to the conjugate-exponential class and allows for the derivation of an exact

PMFN | © 2019 Dirk Ostwald CC BY-NC-SA 4.0 Free-form mean-field variational inference 158 analytical solution for the form of the posterior distribution. On the other hand, it is not clear in which applied scenarios a dependency of the prior over the expectation parameter µ on the prior density over λ is in fact a reasonable assumption. We here thus focus on the factorized prior density, as it corresponds to a more parsimonious choice than its non-factorized counterpart. Furthermore, it demonstrates how variational inference can be used to derive posterior density approximations in model scenarios where no analytical treatment is possible.

Variational inference

For the posterior density, we consider the mean-field approximation

p(µ, λ|y) ≈ q(µ)q(λ). (15.21)

Recall that the free-form mean-field variational inference theorem states that the variational density over the unobservable variable partition ϑs is given by  q(ϑs) ∝ exp ∫ q(ϑ\s) ln p(y, ϑ)dϑ\s . (15.22)

For the current example, q(µ, λ) := q(µ)q(λ), (15.23) and thus q(µ) = cµ exp (∫ q(λ) ln p(y, µ, λ) dλ) (15.24) and q(λ) = cλ exp (∫ q(µ) ln p(y, µ, λ) dµ) , (15.25) where cµ and cλ denote proportionality constants that render the proportionality statement in (15.22) equalities in (15.24) and (15.25), respectively. In the following, we shall derive an iterative scheme based on the equations above. For this purpose, it is first helpful to explicitly denote the iterative nature of the approach by denoting the variational densities q(µ) and q(λ) as q(i)(µ) and q(i)(λ). This also stresses the fact that in eqs. (15.24) and (15.25), the left-hand variational densities refer to their state at the (i + 1)th algorithm iteration, while the right-hand variational densities refer to their state at the ith algorithm iteration. Second, as we are dealing with densities from the exponential family, it is helpful to log transform both eqs. (15.24) and (15.25). For i = 0, 1, 2, ... eqs. (15.24) and (15.25) may thus be rewritten as Z (i+1) (i) ln q (µ) := q (λ) ln p(y, µ, λ) dλ + ln cµ (15.26) and Z (i+1) (i+1) ln q (λ) := q (µ) ln p(y, µ, λ) dµ + ln cλ (15.27)

To obtain an expression for q(i+1)(µ), we first note that we can express eq. (15.26) as

n (i+1) 1 X 2 1 2 ln q (µ) = − hλi (i) (y − µ) − (µ − m ) +c ˜ (15.28) 2 q (λ) i 2s2 µ µ i=1 µ wherec ˜µ denotes a constant including additive terms devoid of µ. Based on (15.28) and using the completing-the-square theorem for Gaussian distributions (cf. Chapter 10), we can then infer that q(i+1)(µ) is proportional to a Gaussian density

(i+1)  (i+1) 2(i+1)  q (µ) = N µ; mµ , sµ , (15.29) with parameters

2 Pn 2 (i+1) mµ + sµhλiq(i)(λ) i=1 yi 2(i+1) sµ mµ = 2 and sµ = 2 . (15.30) 1 + nsµhλiq(i)(λ) 1 + nsµhλiq(i)(λ)

Next, to obtain an expression for q(i+1)(λ), we first note that we can express eq. (15.27) as

n (i+1) n λ X 2 λ ln q (λ) = ln λ − h (y − µ) i (i+1) + (a − 1) ln λ − +c ˜ , (15.31) 2 2 i q (µ) λ b λ i=1 λ

PMFN | © 2019 Dirk Ostwald CC BY-NC-SA 4.0 Free-form mean-field variational inference 159

Figure 15.3. Free-form variational inference for the Gaussian. (A) The panels depict the true underlying data model p(y|µ, λ), for µ = 1 and λ = 5 as solid line and N = 10 samples yi from this model on the abscissa as red dots. Based on these samples, on each iteration of the VB algorithm, a variational approximation q(µ)q(λ) is updated. The first panel of (A) shows the univariate Gaussian model as approximated by the expectations overq(µ) and q(λ) as dashed line. The second panel of (A) shows the effect of the update of the density q(µ) on the first iteration of the algorithm. As q(µ) governs the mean of the univariate Gaussian, the dashed Gaussian is now centered on the mean of the data-points. The third panel of (A) shows the effect of the update of the density q(λ) on the first iteration of the algorithm. As q(λ) governs the precision of the univariate Gaussian model, the dashed Gaussian updates its variance based on the data variability. The fourth and fifth panels of (A) show the corresponding two steps on the 8th iteration. (B) The panels of (B) show the factorized variational density q(µ)q(λ) over VB algorithm iterations. The white dot in each panel indicates the true underlying parameters that gave rise to the observed data. Note that these parameters were not sampled from the prior density, but that the prior density embeds the initial uncertainty about this true, but unknown, parameter value before the observation of any data. The ordering of the panels is as in (A). (A) The panel shows the evolution of the variational free energy over iterations of the VB algorithm. For the current model and data set, the variational free energy levels off from approximately 4 iterations onwards. In the variational inference framework, the final value of the variational free energy after convergence of the algorithm corresponds to the approximation to the log model evidence ln p(y)

wherec ˜λ denotes a constant including additive terms devoid of λ. Expressing the right-hand side of (15.31) in multiplicative terms involving λ and taking exponentials, it then follows that q(i+1)(λ) is proportional to a Gamma density  (i+1) (i+1) G λ; aλ , bλ (15.32) with parameters

n n !!−1  2  (i+1) n (i+1) 1 1 X 2 X (i+1)  (i+1) 2(i+1) a = + a and b = + y − 2 yimµ + n mµ + s (15.33) λ 2 λ λ b 2 i µ λ i=1 i=1

A number of things are noteworthy. First, the Gaussian and Gamma density forms of the variational densities q(i+1)(µ) and q(i+1)(λ) follow directly from the form of the probabilistic model eq. (15.20) and the free-form mean-field theorem for variational inference. In other words, the functional forms of the densities q(i+1)(µ) and q(i+1)(λ) are not predetermined, but automatically fall out of the variational inference approach - hence free-form variational inference. Second, if the variational density q(0)(λ) is initialized using the prior density p(λ), the expected value hλiq(0)(λ) is determined by the prior param- (i) (i) eters aλ and bλ for i = 0, and by the variational density parameters aλ and bλ for i = 1, 2, .... In other words, the parameter update equations (15.30) and (15.33) are fully determined in terms of the 2 prior density parameters aλ, bλ, mµ, sµ, the data realizations y1, y2, ..., yn, and the variational density (i) 2(i) (i) (i) parameters mµ , sµ , aλ and bλ . Third, an explicit form of the variational free energy is not required for its maximization by means of the variational densities q(i)(µ) and q(i)(λ). It is nevertheless useful to evaluate it in order to monitor the progression of the iterative algorithm. For the current example, it takes the form

PMFN | © 2019 Dirk Ostwald CC BY-NC-SA 4.0 Free-form mean-field variational inference 160

n 2  (i) 2(i) (i) (i) 2  F : R × R × R>0 × R>0 × R>0 → R, y, mµ , sµ , aλ , bλ , mµ, sµ, aλ, bλ 7→

 (i) 2(i) (i) (i) 2  F y, mµ , sµ , aλ , bλ , mµ, sµ, aλ, bλ n n ! 1   (i)  (i) 1 (i) (i) X 2  (i) 2 2(i)  (i) X := ψ a + ln b − a b y + N (mµ ) + s − 2mµ yi 2 λ λ 2 λ λ i µ i=1 i=1 2 2 2 (i) 2(i) (i) 1 s m + mµ + s − 2mµ mµ 1 + ln µ + µ µ − (15.34) 2(i) 2 2 sµ 2sµ 2 (i)  (i) (i) (i) (i) + (bλ − 1)ψ bλ − ln aλ − bλ − ln Γ(bλ ) + ln Γ (bλ) (i) (i)  (i) (i) aλ bλ + bλ ln aλ − (bλ − 1)(ψ bλ + ln aλ ) + , aλ where Γ and ψ denote the Gamma and digamma functions, respectively. We visualize the free-form mean-field variational inference approach for the expectation and precision parameter of a univariate Gaussian in Figure 15.3.

Proof of eqs. (15.28), (15.29) and (15.30)

We first note that with the probabilistic model (15.20), eq. (15.26) can be rewritten as

(i+1) ln q (µ) = hln p(y, µ, λ)iq(i)(λ) + ln cµ n Y = hln( p(yi|µ, λ)p(µ)p(λ))iq(i)(λ) + ln cµ i=1 (15.35) n X = h ln p(yi|µ, λ)iq(i)(λ) + hln p(µ)iq(i)(λ) + hln p(λ)iq(i)(λ) + ln cµ i=1 Substitution of the example-specific densities (cf. eq. (15.20))

2 −1 p(µ) = N(µ; mµ, sµ), p(λ) = G(λ; aλ, bλ), and p(yi|µ, λ) = N(yi; µ, λ ) (15.36) then yields n    (i+1) X 1 − 1 λ 2 ln q (µ) = h ln λ 2 (2π) 2 exp − (y − µ) i (i) 2 i q (λ) i=1  1  1  2 − 2 2 + hln (2πsµ) exp − 2 (µ − mµ) 2sµ q(i)(λ)    1 1 aλ−1 λ + hln aλ λ exp − iq(i)(λ) Γ(aλ) bλ bλ + ln c µ (15.37) n n n λ X 2 = h ln λ − ln 2π − (y − µ) i (i) 2 2 2 i q (λ) i=1

1 2 1 1 2 + h− ln sµ − ln 2π − 2 (µ − mµ) iq(i)(λ) 2 2 2sµ λ + h− ln Γ(aλ) − aλ ln(bλ) + (aλ − 1) ln λ − iq(i)(λ) bλ

+ ln cµ

Grouping all terms devoid of µ in a constantc ˜µ and accounting for the linearity of expectations then results in.

n (i+1) 1 X 2 1 2 ln q (µ) = − hλi (i) (y − µ) − (µ − m ) +c ˜ (15.38) 2 q (λ) i 2s2 µ µ i=1 µ

Next, to use the completing-the-square theorem for inferring that q(i+1)(µ) conforms to a Gaussian density with the parameters of eq. (15.29), we first rewrite the right-hand side of eq. (15.28) as a quadratic expression in µ. We have

PMFN | © 2019 Dirk Ostwald CC BY-NC-SA 4.0 Free-form mean-field variational inference 161

n 2 (i+1) 1 X 2 1 (µ − mµ) ln q (µ) = − hλi (i) (yi − µ) − +c ˜µ 2 q (λ) 2 s2 i=1 µ n n ! 2 2 1 X 2 X 2 1 µ − 2µmµ + mµ = − hλi (i) y − 2µ yi + nµ − +c ˜µ 2 q (λ) i 2 s2 i=1 i=1 µ n n ! 1 X 2 X 2 1 2 2mµ 1 2 = − hλi (i) y − hλi (i) 2µ yi + hλi (i) nµ + µ − µ + m +c ˜µ (15.39) 2 q (λ) i q (λ) q (λ) s2 s2 s2 µ i=1 i=1 µ µ µ n n ! 1 2 1 2 X 2mµ X 2 1 2 = − hλi (i) nµ + µ − hλi (i) 2µ yi − µ + hλi (i) y + m +c ˜µ 2 q (λ) s2 q (λ) s2 q (λ) i s2 µ µ i=1 µ i=1 µ ! n ! n ! 1 1 2 (i) (i) X 2mµ X 2 1 2 = − hλi (i) n + µ − 2a b yi + µ + hλi (i) y + m +c ˜µ 2 q (λ) s2 λ λ s2 q (λ) i s2 µ µ i=1 µ i=1 µ

Resolving brackets and grouping terms devoid of µ with the constantc ˜µ, resulting in the new constant c˜µ, and re-expressing the coefficient of µ2 then results in

  n ! (i+1) 1 1 2 X mµ ln q (µ) = − hλi (i) n + µ + hλi (i) y + µ + c˜ 2 q (λ) s2 q (λ) i s2 µ µ i=1 µ 2 ! n ! 1 hλiq(i)(λ)nsµ 1 2 X mµ = − + µ + hλi (i) y + µ + c˜ (15.40) 2 s2 s2 q (λ) i s2 µ µ µ i=1 µ 2 ! n ! 1 1 + nsµhλiq(i)(λ) 2 X mµ = − µ + hλi (i) y + µ + c˜ 2 s2 q (λ) i s2 µ µ i=1 µ Using the completing-the-square theorem (cf. Chapter 10) in the form

 1  exp − ax2 + bx ∝ N x; a−1b, a−1 (15.41) 2 then yields (i+1)  (i+1) 2(i+1)  q (µ) ∝ N µ; mµ , sµ (15.42)

(i+1) 2(i+1) where the variational parameters mµ , sµ may be expressed in terms of the expectation of λ under the ith (i) 2 variational density q (λ), the prior parameters mµ and sµ, and the data yi as

2 !−1 2(i+1) 1 + nsµhλiq(i)(λ) sµ = 2 sµ (15.43) 2 sµ = 2 1 + nsµhλiq(i)(λ) and

2 n ! (i+1) sµ X mµ mµ = 2 hλiq(i)(λ) yi + 2 1 + ns hλi (i) s µ q (λ) i=1 µ 2 2 Pn ! s mµ + hλi (i) sµ yi µ q (λ) i=1 (15.44) = 2 2 1 + nsµhλiq(i)(λ) sµ 2 Pn mµ + sµhλiq(i)(λ) i=1 yi = 2 1 + Nsµhλiq(i)(λ) 2

PMFN | © 2019 Dirk Ostwald CC BY-NC-SA 4.0 Free-form mean-field variational inference 162

Proof of eq. (15.31) and eq. (15.33)

In analogy to the derivation of eq. (15.28) we have n    (i+1) X 1 − 1 λ 2 ln q (µ) = h ln λ 2 (2π) 2 exp − (y − µ) i (i+1) 2 i q (µ) i=1  1  1  2 − 2 2 + hln (2πsµ) exp − 2 (µ − mµ) 2sµ q(i+1)(µ)    1 1 aλ−1 λ + hln aλ λ exp − iq(i+1)(µ) Γ(aλ) bλ bλ + ln c λ (15.45) n n n λ X 2 = h ln λ − ln 2π − (y − µ) i (i+1) 2 2 2 i q (µ) i=1

1 2 1 1 2 + h− ln sµ − ln 2π − 2 (µ − mµ) iq(i+1)(µ) 2 2 2sµ λ + h− ln Γ(aλ) − aλ ln(bλ) + (aλ − 1) ln λ − iq(i+1)(µ) bλ

+ ln cλ

Grouping all terms devoid of λ in a constantc ˜λ and using the linearity of expectations then simplifies the above to

n (i+1) n λ X 2 λ ln q (λ) = h ln λ − (yi − µ) iq(i+1)(µ) + h(aλ − 1) ln λ − iq(i+1)(µ) +c ˜λ 2 2 bλ i=1 n (15.46) n λ X 2 λ = ln λ − h (yi − µ) iq(i+1)(µ) + (aλ − 1) ln λ − +c ˜λ 2 2 bλ i=1 Reorganizing the right-hand side of equation (15.31) in multiplicative terms involving λ and expressing the expectations of µ under the variational density q(i+1)(µ) yields

n ! (i+1)  n  1 1 X 2 ln q (λ) = + a − 1 ln λ − + h (yi − µ) i (i+1) λ +c ˜ 2 λ b 2 q (µ) λ λ i=1 n n !  n  1 1 X 2 X 2 = + a − 1 ln λ − + h y − 2µ yi + nµ i (i+1) λ +c ˜ 2 λ b 2 i q (µ) λ λ i=1 i=1 (15.47) n n !!  n  1 1 X 2 X 2 = + a − 1 ln λ − + y − 2 yihµi (i+1) + nhµ i (i+1) λ +c ˜ 2 λ b 2 i q (µ) q (µ) λ λ i=1 i=1 n n !!  2   n  1 1 X 2 X (i+1)  (i+1) 2(i+1) = + a − 1 ln λ − + y − 2 yimµ + n mµ + s λ +c ˜ . 2 λ b 2 i µ λ λ i=1 i=1

Taking the exponential on both sides then yields

n n ! ! n 1 1  2 (i+1)  (i+1) ( 2 +aλ−1) X 2 X (i+1) (i+1) 2 q (λ) ∝ λ exp − − yi − 2 yimµ + n mµ + sµ λ . (15.48) bλ 2 i=1 i=1

Up to a normalization constant, q(i+1)(λ) is thus given by a Gamma density function

(i+1)  (i+1) (i+1) q (λ) ∝ G λ; aλ , bλ (15.49) with parameters

n n !!−1  2  (i+1) N (i+1) 1 1 X 2 X (i+1)  (i+1) 2(i+1) aλ := + aλ and bλ + yi − 2 yimµ + n mµ + sµ (15.50) 2 bλ 2 i=1 i=1

2

PMFN | © 2019 Dirk Ostwald CC BY-NC-SA 4.0 Free-form mean-field variational inference 163

Proof of eq. (15.34)

We first reformulate the variational free energy functional as Z  p(y, ϑ)  F (q(ϑ)) = q(ϑ) ln dϑ q(ϑ) Z  p(y|ϑ)p(ϑ)  = q(ϑ) ln dϑ q(ϑ) Z   q(ϑ)  = q(ϑ) ln p(y|ϑ)) − ln dϑ (15.51) p(ϑ) Z Z  q(ϑ)  = q(ϑ) ln (p(y|ϑ)) dϑ − q(ϑ) ln dϑ p(ϑ) Z = q(ϑ) ln (p(y|ϑ)) dϑ − KL(q(ϑ)kp(ϑ)), where the first term on the right-hand side is sometimes referred to as the average likelihood and the second term is the KL divergence between the variational and prior distributions. We next evaluate the average likelihood term. To this end, substitution of the relevant probability densities yields

Z ZZ n ! (i) (i) Y q(ϑ) ln(p(y|ϑ)) dϑ = q (µ)q (λ) ln N(yi; µ, λ) dµdλ i=1 ZZ   1 n  ! (i) (i) λ 2 Y λ 2 = q (µ)q (λ) ln exp − (yi − µ) dµdλ 2π 2 i=1 ZZ n ! (i) (i) 1 1 λ X 2 = q (µ)q (λ) ln λ − ln 2π − (yi − µ) dµdλ (15.52) 2 2 2 i=1 ZZ ZZ n ! 1 (i) (i) (i) (i) λ X 2 1 = q (µ)q (λ) ln λdµdλ − q (µ)q (λ) (yi − µ) dµdλ − ln 2π 2 2 2 i=1 Z Z Z n ! ! 1 (i) (i) λ (i) X 2 1 = q (λ) ln λ dλ − q (λ) q (µ) (yi − µ) dµ dλ − ln 2π. 2 2 2 i=1

The first integral term on the right-hand side of eq. (15.52) is the expectation of the logarithm of λ under the (i)  (i) (i) variational density q (λ) = G λ; aλ , bλ and evaluates to (cf. Johnson et al., 1994) Z (i)  (i) (i) q (λ) ln λdλ = ψ aλ + ln bλ , (15.53) where ψ denotes the digamma function. The second integral term on the right-hand side of eq. (15.52) evaluates to

Z Z n ! ! Z Z n n ! ! (i) λ (i) X 2 (i) 1 (i) X 2 X 2 q (λ) q (µ) (yi − µ) dµ dλ = q (λ) λ q (µ)( y − 2µ yi + mµ dµ dλ 2 2 i i=1 i=1 i=1 Z n n Z Z ! 1 (i) X 2 X (i) (i) 2 = q (λ)λ y − 2 yi q (µ)µ dµ + n q (µ) µ dµ dλ 2 i i=1 i=1 n n !  2  1 (i) (i) X 2 (i) X  (i) 2(i) = a b y − 2mµ yi + n mµ + s . 2 λ λ i µ i=1 i=1 (15.54)

The average likelihood term in eq. (15.51) thus evaluates to

n n ! Z  2   (i) (i) 1 (i) (i) X 2 (i) X  (i) 2(i) q(ϑ) ln (p(y|ϑ)) dϑ = ψ a + ln b + a b y − 2m y + n m + s . (15.55) λ λ 2 λ λ i µ i µ µ i=1 i=1 To evaluate the KL divergence term in eq. (15.51), we first note that with the additivity property of the KL divergence for factorized densities (cf. Chapter 12), we have

KL(q(µ)q(λ)||p(µ)p(λ)) = KL(q(µ)kp(µ)) + KL(q(λ)kp(λ)) (15.56)

In the current example, the variable µ is governed by Gaussian densities for both the prior density p(µ) and the variational densities. More specifically, in the variational inference algorithm, the prior density for µ has

PMFN | © 2019 Dirk Ostwald CC BY-NC-SA 4.0 Fixed-form mean-field variational inference 164

2 (i) parameters mµ, sµ, while the variational density q(µ) corresponds to the ith variational density q (µ) with (i) 2(i) parameters mµ , sµ . With the known form of the KL divergence for univariate Gaussian densities, we thus have

 (i)    (i) 2(i)  2  KL q (µ)kp(µ) = KL N µ; mµ , sµ kN µ; mµ, sµ

2 2 (i)2 2(i) (i) (15.57) 1 s m + mµ + s − 2mµ mµ 1 = ln µ + µ µ − . 2(i) 2 2 sµ 2sµ 2

Similarly, the variable λ is governed by Gamma densities for both the prior and and the variational densities. Specifically, the prior density of λ has parameters aλ and bλ, while the variational density q(λ) corresponds to (i) (i) the ith variational Gamma distribution over λ with parameters aλ and bλ . With the known form of the KL divergence for Gamma densities, we thus have

 (i)    (i) (i)  KL q (λ)||p(λ) = KL G λ; a , b kG λ; a , b  λ λ λ λ (i) (i) (15.58) (i)  (i) (i) (i) (i)  (i) (i) a b = (b − 1)ψ b − ln a − b − ln Γ(b ) + ln Γ b  + b ln a − (b − 1)(ψ b + ln a ) + λ λ λ λ λ λ λ λ λ λ λ λ λ aλ

The KL divergence term in eq. (15.51) thus evaluates to

2 2 (i)2 2(i) (i) 1 s m + mµ + s − 2mµ mµ 1 KL(q(ϑ)kp(ϑ)) = ln µ + µ µ − 2(i) 2 2 sµ 2sµ 2 (i)  (i) (i) (i)  (i) (15.59) + (bλ − 1)ψ bλ − ln aλ − bλ − ln Γ bλ + ln Γ (bλ) (i) (i)  (i) (i) aλ bλ + bλ ln aλ − (bλ − 1)(ψ bλ + ln aλ ) + aλ 2

15.3 Fixed-form mean-field variational inference

The central idea of fixed-form mean-field variational inference to pre-define the parametric form of the factorized variational density k Y q(ϑ) = q(ϑi) (15.60) i=1 at all stages of an iterative algorithm for the maximization of the variational free energy. Because the joint density p(y, ϑ) is defined during the formulation of the probabilistic model of interest, this entails that all densities of the variational free energy Z p(y, ϑ) F (q(ϑ)) = q(ϑ) ln dϑ (15.61) q(ϑ) are defined in parametric form at all times of the procedure. If the integral on the right-hand side of eq. (15.61) can be analytically evaluated (or at least be approximated) as a function of the parameters of the variational densities q(ϑi), i = 1, ..., k, the variational problem of maximizing a functional with respect to probability density functions is rendered a problem of multivariate optimization, which in turn can be addressed using the standard machinery of nonlinear optimization (Chapter 4). In the following, we will exemplify the fixed-form mean-field variational inference approach using a non-linear Gaussian model with a single mean-field partition, which forms of the basis for many models in functional neuroimaging (Friston, 2008).

Fixed-form variational inference for a non-linear Gaussian model

Probabilistic model

We consider the following hierarchical nonlinear Gaussian model comprising an unobservable random vector x and an observable random vector y

x = µx + η (15.62) y = f(x) + ε, (15.63)

PMFN | © 2019 Dirk Ostwald CC BY-NC-SA 4.0 Fixed-form mean-field variational inference 165

m n where x, µx, η ∈ R and y, ε ∈ R , ε and η are random vectors with distributions

−1  η ∼ N (0m, Σx) and ε ∼ N 0n, λy In , (15.64)

m×m m n where Σx ∈ R p.d. , λy > 0, and f : R → R is a multivariate vector-valued function. To apply the fixed-form mean-field variational inference approach to this model, we first consider the joint density implicit in eq. (15.62). It is given by p(y, x) = p(y|x)p(x), (15.65) where the conditional density of the observable random variable is specified by

−1  p(y|x) = N y; f(x), λy In (15.66) and the marginal distribution of the unobserved random vector x is specified by

p(x) = N (x; µx, Σx) . (15.67)

In functional form, we can thus write the joint density (15.65) as the product of two multivariate Gaussian distributions, −1  p(y, x) = N y; f(x), λy In N (x; µx, Σx) . (15.68)

We assume that the prior parameters µx and Σx, as well as the likelihood parameter λy are known. The aim of variational inference applied to (15.68) is to obtain an approximation to the posterior distribution p(x|y) and the log model evidence ln p(y).

Variational inference

To achieve these aims, fixed-form mean-field variational inference approximates the posterior distribution p(x|y) using a predefined parametric form of the variational distribution. A common choice is to use a Gaussian, i.e., q(x) := N(x; mx,Sx). (15.69) This latter definition is often referred to as the Laplace approximation in the functional neuroimaging literature (e.g., Friston et al., 2007a). This is unfortunate, because the term Laplace approximation is used in the machine learning and statistics literature for the approximation of an arbitrary probability density function with a Gaussian density and not for the definition of a variational distribution in terms of a Gaussian distribution. The definition of q(x) as in eq. (15.69) allows for reformulating the variational problem of maximizing the variational free energy as a standard problem in nonlinear optimization. Substitution of eqs. (15.68) and (15.69) on the right-hand side of eq. (15.61) yields the variational free energy function

Z −1 −1 ! m m×m N(y; f(x, ), λy In)N(x; µx, λx Im) F : R × R → R, (mx,Sx) → F (mx,Sx) := N(x; mx,Sx) ln dx (15.70) N(x; mx,Sx)

Note that (15.70) specifies the variational free energy explicitly as a multivariate real-valued function in the variational density parameters mx and Sx. From a mathematical perspective, it is worth noting that the fixed-form reformulation of the variational free energy by no means results in a trivial problem. First, the argument of the function F is defined by an integral term involving the nonlinear function f, which, as discussed below, can often only be approximated. This is important, because it calls into question the validity of the optimized free energy approximation to the log model evidence. However, as of now, the magnitude of the ensuing approximation error as a function of the degree of nonlinearity of f does not seem to have been systematically studied in the literature. Second, the function F is not a simple real- valued multivariate function, in the that its arguments are just arbitrary real vectors. Its second argument is a covariance parameter, which has a predefined structure, i.e., it has to be a positive-definite matrix. Fortunately, optimization of the function F with respect to this parameter can be achieved analytically, as discussed below. We next provide the functional form of the free energy function (15.70) and its derivation and then proceed to discuss its optimization with respect to the variational parameters Sx and mx.

PMFN | © 2019 Dirk Ostwald CC BY-NC-SA 4.0 Fixed-form mean-field variational inference 166

Approximate evaluation of the variational free energy

Using a multivariate first-order Taylor approximation of the nonlinear function f, the function defined in eq. (15.70) can be approximated as

m m×m F : R × R → R, (mx,Sx) 7→

n n λy T λy f T f F (mx,Sx) := − ln 2π + ln λy − (y − f(mx)) (y − f(mx)) − tr(J (mx) J (mx)Sx) 2 2 2 2 m 1 1 T −1 1 −1 − ln 2π − ln |Σx| − (mx − µx) Σ (mx − µx) − tr(Σ Sx) 2 2 2 x 2 x 1 m + ln |Sx| + ln(2πe), 2 2 (15.71)

f where tr denotes the trace operator and J (mx) denotes the Jacobian matrix of the function f evaluated at the variational expectation parameter. Note that, formally, different symbols for the function defined in (15.70) and its approximation provided in (15.71) would be appropriate.

Derivation of (15.71)

Using the properties of the logarithm and the linearity of the integral, we first decompose the variational free energy integral as follows: Z  p(y, x)  F (q(x)) = q(x) ln dx q(x) Z = q(x)(ln p(y, x) − ln q(x))dx (15.72) Z Z = q(x) ln p(y, x)dx − q(x) ln q(x)dx.

Of the remaining two integral terms, the latter corresponds to the differential entropy of a multivariate Gaussian distribution, which is well-known to correspond to a nonlinear function of the variational covariance parameter Sx: Z 1 m q(x) ln q(x)dx = hln N (x; m ,S )i = ln |S | + ln(2πe). (15.73) x x N(x;mx,Sx) 2 x 2 There thus remains the evaluation of the first integral term, which corresponds to the expectation of the log joint density of the observed and unobserved random variables under the variational density of the unobserved random variables. Substitution of p(y, x) results in

−1 hln p(y, x)iq(x) = hln(N(y; f(x), λy In)N(x; µx, Σx))iN(x;mx,Sx) −1 = hln(N(y; f(x), λy In))iN(x;mx,Sx) + hln(N(x; µx, Σx))iN(x;mx,Sx)   − n −1 − 1 1 T −1 −1 = hln((2π) 2 |λ I | 2 exp − (y − f(x)) (λ I ) (y − f(x))) i y n 2 y n N(x;mx,Sx)   − m − 1 1 T −1 + hln((2π) 2 |Σ | 2 exp − (x − µ ) Σ (x − µ )) i x 2 x x x N(x;mx,Sx) n n λ (15.74) = h− ln 2π + ln λ − y (y − f(x))T (y − f(x))i 2 2 y 2 N(x;mx,Sx) m 1 1 + h− ln 2π − ln |Σ | − (x − µ )T Σ−1(x − µ )i 2 2 x 2 x x x N(x;mx,Sx) n n λ = − ln 2π + ln λ − y h(y − f(x))T (y − f(x))i 2 2 y 2 N(x;mx,Sx) m 1 1 − ln 2π − ln |Σ | − h(x − µ )T Σ−1(x − µ )i . 2 2 x 2 x x x N(x;mx,Sx) There thus remain two integral terms. Of these, the latter can be evaluated using the Gaussian expectation theorem (Chapter 10). Specifically, we have

T −1 T −1 T −1 T −1 T −1 h(x − µx) Σx (x − µx)iN(x;mx,Sx) = hx Σx x − x Σx µx − µx Σx x + µx Σx µxiN(x;mx,Sx) T −1 T −1 T −1 = hx Σx xiN(x;m ,S ) − 2µx Σx hxiN(x;m ,S ) + µx Σx µx x x x x (15.75) −1 T −1 T −1 T −1 = tr(Σx Sx) + mx Σx mx − 2µx Σx mx + µx Σx µx −1 T −1 = tr(Σx Sx) + (mx − µx) Σx (mx − µx).

PMFN | © 2019 Dirk Ostwald CC BY-NC-SA 4.0 Fixed-form mean-field variational inference 167

There thus remains the evaluation of the first integral term in (15.74). We first note that we can write this term as

T T T T T h(y − f(x)) (y − f(x))iN(x;m ,S ) = hy y − y f(x) − f(x) y + f(x) f(x)iN(x;m ,S ) x x x x (15.76) T T T = y y − 2y hf(x)iN(x;mx,Sx) + hf(x) f(x)iN(x;mx,Sx).

We are thus led to the evaluation of the expectation of a Gaussian random variable x under the nonlinear trans- formation f. In the functional neuroimaging literature, the function f is then approximated using a multivariate first-order Taylor expansion in order to evaluate the remaining expectations (cf. Friston et al., 2007a; ?). Denoting f the Jacobian matrix of f evaluated at the variational expectation parameter mx by J (mx), we thus have

f f(x) ≈ f(mx) + J (mx)(x − mx). (15.77)

By replacing f(x) in the first expectation of the right-hand side of eq. (15.76) with the approximation (15.77), we obtain

f hf(x)iN(x;mx,Sx) ≈ hf(mx) + J (mx)(x − mx)iN(x;mx,Sx) f = f(mx) + J (mx)(hx − mxiN(x;mx,Sx)) f (15.78) = f(mx) + J (mx)(hxiN(x;mx,Sx) − mx) f = f(mx) + J (mx)(mx − mx)

= f(mx).

Further, replacing f(x) in the second expectation of the right-hand side of (15.76) with the approximation (15.77), we obtain

T f T f hf(x) f(x)iN(x;mx,Sx) ≈ h(f(mx) + J (mx)(x − mx)) (f(mx) + J (mx)(x − mx))iN(x;mx,Sx) T T f = hf(mx) f(mx) + 2f(mx) J (mx)(x − mx) f T f + (J (mx)(x − mx)) (J (mx)(x − mx))iN(x;mx,Sx) (15.79) T T f = f(mx) f(mx) + 2f(mx) J (mx)h(x − mx)iN(x;mx,Sx) f T f + h(J (mx)(x − mx)) (J (mx)(x − mx))iN(x;mx,Sx). Considering the first remaining expectation then yields

h(x − mx)iN(x;mx,Sx) = hxiN(x;mx,Sx) − mx = mx − mx = 0. (15.80) To evaluate the second remaining expectation, we first rewrite it as

f T f T f T f h(J (mx)(x − mx)) (J (mx)(x − mx))iN(x;mx,Sx) = h(x − mx) J (mx) J (mx)(x − mx)iN(x;mx,Sx) (15.81)

T 1×(m+p) f T (m+p)×n f n×(m+p) (m+p)×1 and note that (x − mx) ∈ R ,J (mx) ∈ R ,J (mx) ∈ R and (x − mx) ∈ R . Application of the Gaussian expectation theorem then yields

T f T f T f T f h(x − mx) J (mx) J (mx)(x − mx)iN(x;mx,Sx) = (mx − mx) J (mx) J (mx)(mx − mx) f T f + tr(J (mx) J (mx)Sx) (15.82) f T f = tr(J (mx) J (mx)Sx).

We thus have T T f T f hf(x) f(x)iN(x;mx,Sx) = f(mx) f(mx) + tr(J (mx) J (mx)Sx). (15.83) In summary, we obtain the following approximation for the first integral on the left-hand side of (15.74)

T T T T h(y − f(x)) (y − f(x))iN(x;mx,Sx) = y y − 2y hf(x)iN(x;mx,Sx) + hf(x) f(x)iN(x;mx,Sx) T T T f T f ≈ y y − 2y f(mx) + f(mx) f(mx) + tr(J (mx) J (mx)Sx) (15.84) T  f T f  = (y − f(mx)) (y − f(mx) + tr J (mx) J (mx)Sx .

Concatenating the results, we have thus obtained the following approximation of the expectation of the joint density of observed and unobserved random variables under the variational density

n n λy T f T f hln p(y, x)iq(x) = − ln 2π + ln λy − ((y − f(mx)) (y − f(mx)) + tr(J (mx) J (mx)Sx)) 2 2 2 (15.85) m 1 1 − ln 2π − ln |Σ | − (tr(Σ−1S ) + (m − µ )T Σ−1(m − µ )). 2 2 x 2 x x x x x x x

PMFN | © 2019 Dirk Ostwald CC BY-NC-SA 4.0 Fixed-form mean-field variational inference 168

Together with the previously evaluated entropy term, we thus obtained an approximation for the variational free energy functional under the fixed-form assumption q(x) = N(x; mx,Sx) that can be written as a multivariate real-valued function in the variational density parameters parameters mx and Sx:

m m×m F : R × R → R, (mx,Sx) 7→ n n λ λ F (m ,S ) := − ln 2π + ln λ − y (y − f(m ))T (y − f(m )) − y tr(J f (m )T J f (m )S ) x x 2 2 y 2 x x 2 x x x m 1 1 1 (15.86) − ln 2π − ln |Σ | − (m − µ )T Σ−1(m − µ ) − tr(Σ−1S ) 2 2 x 2 x x x x x 2 x x 1 m + ln |S | + ln(2πe). 2 x 2 2

Maximization with respect to the variational variance parameter

Optimizing nonlinear multivariate real-valued functions such as (15.71) is the fundamental aim of non- linear optimization (Chapter 4). Intuitively, many nonlinear optimization methods are based on a simple premise: from basic calculus we know that a necessary condition for an extremal point at a given location in the input space of a function is that the first derivative evaluates to zero at this point, i.e., the function is neither increasing nor decreasing. If one extends this idea to functions of multidimensional entities, one can show that one may maximize the function F with respect to its input argument Sx based on a simple formula. Omitting all terms of the function F that do not depend on Sx and which hence do not contribute to changes in the value of F as Sx changes, we can write the first derivative of F with respect to Sx suggestively as

∂ λy f T f 1 −1 1 −1 F (mx,Sx) = − J (mx) J (mx) − Σx + Sx (15.87) ∂Sx 2 2 2

Setting the derivative of F with respect to Sx to zero and solving for the extremal argument Sx then yields the following update rule for the variational covariance parameters

f T f −1−1 Sx = λyJ (mx) J (mx) + Σx . (15.88)

Proof of (15.88)

We only provide a heuristic proof to demonstrate the general idea. A formal mathematical proof would also require the characterization of for the function F as concave function and a sensible notation for derivatives of functions of multivariate entities such as vectors and positive-definite matrices. Here, we use the notation for partial derivatives. We have

∂ λy ∂ f T f 1 ∂ −1 1 ∂ F (mx,Sx) = − tr(J (mx) J (mx)Sx) − tr(Σx Sx) + ln |Sx| (15.89) ∂Sx 2 ∂Sx 2 ∂Sx 2 ∂Sx which, using the following rules for matrix derivatives involving the trace operator and logarithmic determinants (cf. equations (103) and (57) in (Petersen et al., 2006)

∂ ∂ tr(AXT ) = A and ln |X| = (XT )−1 (15.90) ∂X ∂X

T with Sx = Sx yields ∂ λy f T f 1 −1 1 −1 F (mx,Sx) = − J (mx) J (mx) − Σx + Sx . (15.91) ∂Sx 2 2 2 Setting the above to zero then yields the equivalent relations ∂ F (mx,Sx) = 0 ∂Sx λ 1 1 ⇔ − y J f (m )T J f (m ) − Σ−1 + S−1 = 0 (15.92) 2 x x 2 x 2 x f T f −1 −1 ⇔ Sx = (λyJ (mx) J (mx) + Σx ) .

2

PMFN | © 2019 Dirk Ostwald CC BY-NC-SA 4.0 Fixed-form mean-field variational inference 169

Maximization with respect to the variational expectation parameter

In contrast to the variational covariance parameter, maximization of the variational free energy function with respect to mx cannot be achieved analytically, but requires an iterative numerical optimization algorithm. In the functional neuroimaging literature, the algorithm employed to this end is fairly specific, but related to standard nonlinear optimization algorithms such as gradient and Newton descents (Chapter 4). To simplify the notational complexity of the discussion below, we first rewrite the variational free energy function eq. (15.71) as a function of only the variational expectation parameter, assuming that it has been maximized with respect to the variational covariance parameter Sx previously and remove any additive terms devoid of mx. The function of interest then takes the form

m λy T λy f T f F : R → R, mx 7→ F (mx) =: − (y − f(mx)) (y − f(mx)) − tr(J (mx) J (mx)Sx) 2 2 (15.93) 1 − (m − µ )T Σ−1(m − µ ) 2 x x x x x As discussed in Chapter 4, numerical optimization schemes usually work by guessing an initial value for the maximization argument of the nonlinear function under study and then iteratively update this guess according some update rule. A very basic gradient ascent scheme for the function specified in eq. (15.93) is provided as Algorithm 1. A prerequisite for the application of this algorithm is the availability of the (k) gradient of the function F evaluated at mx for k = 0, 1, 2,.... In the functional neuroimaging literature, it is suggested to approximate this gradient analytically by omitting higher derivatives of the function f with respect to mx (Friston et al., 2007a). The function (15.93) comprises first derivatives of the function f f with respect to mx in the form of the Jacobian J (mx) in the second term. If this term is omitted, the gradient of F evaluates to

f T −1 ∇F (mx) = −λyJ (mx) (y − f(mx)) − Σx (mx − µx) (15.94) and the update rule for the variational expectation parameter takes the form

(k+1) (k) f (k) T (k) −1 (k) T mx = mx − λyJ (mx ) (y − f(mx )) − Σx (mx − µx) for k = 0, 1, 2, ... (15.95)

Proof of (15.94)

We have

λy ∂ T λy ∂ f T h 1 ∂ T −1 ∇F (mx) = − (y − f(mx)) (y − f(mx)) − tr(J (mx) J (mx)Sx) − ((mx − µx) Σx (mx − µx)) 2 ∂mx 2 ∂mx 2 ∂mx (15.96)

Notably, the second term above involves second-order derivatives of the function f with respect to mx. Following Friston et al.(2007a) we neglect these terms, and obtain, using the rules of the calculus for multivariate real-valued functions (Petersen and Pedersen, 2012)

λy ∂ T 1 −1 ∇F (mx) = − 2( f(mx)) (y − f(mx)) − 2Σx (mx − µx) 2 ∂mx 2 (15.97) f T −1 = −λyJ (mx) (y − f(mx)) − Σx (mx − µx) 2

Algorithm 1: A gradient ascent algorithm Initialization (0) m (0) (0) 1. Define a starting point mx ∈ R and set k := 0. If ∇F (mx ) = 0, stop! mx is a zero of ∇F . If not, proceed to iterations. Until Convergence (k+1) (k) (k) 1. Set mx := mx + κ∇F (mx ) (k+1) (k+1) 2. If ∇F (mx ) = 0, stop! mx is a zero of F . If not, go to 3. 3. Set k := k + 1 and go to 1.

PMFN | © 2019 Dirk Ostwald CC BY-NC-SA 4.0 Fixed-form mean-field variational inference 170

From an optimization perspective, gradient ascent schemes for the maximization of nonlinear functions are suboptimal. Furthermore, as can be shown in simple univariate examples, the approximated gradient can easily fail to reliably identify the necessary condition for an extremal point. The gradient ascent scheme of Algorithm 1 is thus of little data-analytical relevance in functional neuroimaging. It is, however, of interest with respect to its speculative neurobiological implementation in the context of the free energy principle for as discussed below. A more robust method is a globalized Newton scheme with Hessian modification and numerically evaluated gradients and Hessians as documented as Algorithm 2 (Ostwald and Starke, 2016). Intuitively, (k) this algorithm works by approximating the target function F at the current iterand mx by a second- order Taylor expansion and analytically determining an extremal point of this approximation at each iteration. The location of this extremal point corresponds to the search direction pk. If, however, the F (k) Hessian H (mx ) is not positive-definite, there is no guarantee that pk is an ascent direction. Especially in regions far away from a local extremal point, this can often be the case. Thus, a number of modification F  (k) techniques have been developed that minimally change H mx , but render it positive-definite, such that an ascent direction is obtained. Finally, on each iteration the Newton step-size tk is determined such as yield an increase in the target function, but with a not too short step-size. This approach is referred to as backtracking and the conditions for sensible step-lengths are given by the necessary and sufficient Wolfe-conditions. Notably, the algorithm documented as Algorithm 2 is a standard approach for the iterative optimization of a nonlinear function, and hence analytical results on its performance bounds are available. For a detailed discussion of this algorithm, see Nocedal and Wright(2006).

Algorithm 2: A globalized Newton method with Hessian modification Initialization (0) mp (0) (0) 1. Define a starting point mx ∈ R and set k := 0. If ∇F (mx ) = 0, stop! mx is a zero of ∇F . If not, proceed to iterations. Until Convergence −1  F (k)  (k) 1. Evaluate the Newton search direction pk := H (mx ) ∇F (mx )

T (k) F (k) 2. If pk ∇F (mx ) < 0, pk is a descent direction. In this case, modify H (mx ) to render it positive definite.

3. Evaluate a step-size tk fulfilling the sufficient Wolfe-condition using the following algorithm: set T  (k)   (k) (k) tk := 1 and select ρ ∈]0, 1[, c ∈]0, 1[. Until F mx + tkpk ≥ F mx ) + c1tk∇F (mx pk set tk := ρtk.

(k+1) (k) F (k) −1 (k) 4. Set mx := mx + tk(H (mx )) ∇F (mx ) (k+1) (k+1) 5. If ∇F (mx ) = 0, stop! mx is a zero of F . If not, go to 3. 6. Set k := k + 1 and go to 1.

Finally, the actual optimization scheme employed in much of the functional neuroimaging literature, because it has been implemented in SPM, is more specific. It derives from the local linearization method for nonlinear stochastic dynamical systems as suggested by Ozaki(1992) and is often formulated in differential equation form (cf. Friston et al., 2007a). We provide a standard iterative formulation of this approach as Algorithm 3. Note that the analytical and convergence properties of this algorithm are not very well understood and provide an interesting avenue for future research.

Algorithm 3: A local linearization-based gradient ascent algorithm Initialization (0) m (0) (0) 1. Define a starting point mx ∈ R and set k := 0. If ∇F (mx ) = 0, stop! mx is a zero of ∇F . If not, proceed to iterations. Until Convergence

PMFN | © 2019 Dirk Ostwald CC BY-NC-SA 4.0 Fixed-form mean-field variational inference 171

(k+1) (k) F (k) F (k) −1 (k) 1. Set mx := mx + exp(τH (mx ) − I)H (mx ) ∇F (mx ) (0) (0) 2. If ∇F (mx ) = 0, stop! mx is a zero of F . If not, go to 3. 3. Set k := k + 1 and go to 1.

The free energy principle for perception

The free energy principle is a body of work that attempts to cast biological self-organization in terms of random dynamical systems and emerged under this label over the last two decades. With respect to cognition, the free energy principle aims to provide computational, algorithmic, and neurobiological- implementational descriptions of both sensation and action. A central tenet of the free energy principle is that cognitive agents engage in variational inference. In the following, we shall briefly discuss a subset of ideas that emerged in the application of variational inference as an algorithmic depiction of perception (cf. Friston, 2005; Bogacz, 2017). The free energy principle for perception postulates that free energy maximization in hierarchical models can be neurobiologically implemented using predictive processing. In general, predictive processing theories of perception assert that neurobiological systems encode models of the world and reconcile their model-based predictions with incoming sensory information to arrive at . In hierarchical frameworks, the idea is that low-level areas represent prediction errors between the top-down conveyed predictions and the actual sensory representation. From the perspective of the free energy, we can integrate this intuitive idea and the formal idea of fixed-form mean-field variational inference for the non- linear Gaussian model (15.62) as follows Figure 15.4: in the world, there exist a true, but unknown, state of the variable x, which we denote by x∗ and refer to as a cause. Based on a nonlinear transformation implemented by the function f, this true, but unknown, cause evokes an observation or point y in the sensorium of a cognitive agent. Under the free energy principle, the cognitive agent encodes a probabilistic representation p(y, x) of these environmental actualities: high-level cortices represent the prior distributions p(x), whereas low-level early sensory cortices represent the conditional distributions p(y|x). The computational problem of the neural architecture can then be formulated as the problem of inferring on the true, but unknown, cause x∗ by means of the posterior distribution p(x|y). The free energy principles proposes that this inversion of the generative model p(y, x) is performed using fixed- form mean-field variational inference and provides a speculative account of this inversion in terms of neural representations. Note that in the current setting the cause is assumed to be static, i.e., it does not change over time. This may be viewed from two perspectives: either the model is assumed to represent the perception of static environmental causes (for example, because x∗ represents the static image of an object which is partially occluded by another object, where the nonlinear occlusion transformation is implemented in the function f), or the model is assumed to represent the perception of a dynamic environmental cause at a singular point in time. With respect to the speculative neural implementation of the free energy maximization scheme that ensues by applying the variational inference to the model formulated in (15.62), the update of the vari- ational covariance parameter derived above plays only a minor role. Instead, the focus of the neurobio- logical interpretation of the numerical scheme is on the iterative updates of the variational expectation parameter mx. To reflect this, we will consider the following simplified version of the variational free energy function of eq. (15.71):

T T m (k)  (k) λy   (k)   (k) 1  (k)  −1  (k)  F : R → R, mx 7→ F mx := − y − f mx y − f mx − mx − µx Σx mx − µx (15.98) 2 2

In this free energy function the second term in denotes the weighted deviation between the prior expec- tation parameter µx and the kth estimate of the variational expectation parameter, while the first term in denotes the weighted deviation between the observed data y and the data prediction of the genera- tive model based on the kth estimate of the variational expectation parameter. Under the free energy principle for perception, these terms are referred to as prediction error terms. By defining

(k) −1/2   (k) (k) −1/2  (k)  y := λy y − f mx and x := Σx mx − µx (15.99)

PMFN | © 2019 Dirk Ostwald CC BY-NC-SA 4.0 Fixed-form mean-field variational inference 172

Figure 15.4. Free energy principle interpretation of fixed-form mean-field variational inference for a non- linear Gaussian models. Note that we assume that there exists a fixed true, but unknown, cause x∗ for the cognitive agent’s sensory representation y. The perceptual act is conceived as inference, i.e., formation of the posterior distribution over causes in a generative model encoded by the agent. This generative model is assumed to mirror the nonlinear trans- formation from external cause to sensory representation in the form of its likelihood p(y, x) and be equipped with prior assumptions about possible external causes in the form of its marginal distribution p(x). Because the function f is assumed to be complicated, variational inference of the posterior distribution is invoked. the simplified variational free energy function (15.98) can then be rewritten as

  1 T 1 T F : m → , m(k) 7→ F m(k) = − (k) (k) − (k) (k). (15.100) R R x x 2 y y 2 x x In this formulation, it is readily seen that maximization of the variational free energy for a model of the (k) (k) form (15.62) corresponds to the minimization of the prediction error terms y and x , because these enter (15.100) squared and with negative sign. Moreover, in terms of the probabilistic model, the second (k) term in (15.100) refers to the relation of the current variational parameter expectation estimate mx to a parameter of the prior distribution p(x). From a Bayesian perspective, one may regard the prior distribution as ” hierarchically higher” than the likelihood distribution, because, in ancestral sampling schemes generating data, one may first draw a value x from the prior distribution and then, based on this value, draw a realization y of the data from the distribution p(y|x). This suggests to allocate the second term in (15.100) to a higher cortical area then the first term. In the current two-level model the first term in (15.100)i nvolves the data y, which, in neural terms, is most readily conceived as a primary sensory area, such as the primary visual cortex. Based on a review of the anatomical and physiological basis of cortical systems, the free energy principle for perception then suggests a mapping between neural populations in cortical areas and the terms involved in eqs. (15.98), (15.99), and (15.100) as depicted in Figure 15.5.

A toy example

To illustrate the foregoing discussion, we consider the model

x = µx + η (15.101) y = f(x) + ε, (15.102)

2 d2 −1 where x, µx, η ∈ R , y, ε ∈ R ,, ε and η are distributed with densities p(ε) = N(ε; 0, λy Id2 ) and −1 2 d2 p(η) = N(η; 0, λx I2) with λy, λx > 0, respectively, and f : R → R is the multivariate vector-valued function    2 d2 1 T f : R → R , x 7→ f(x) := exp − (ci − x) (ci − x) (15.103) σ 1≤i≤d2 2 In (15.103), ci ∈ C for i = 1, . . . , d denotes a two-dimensional coordinate vector formed by the Cartesian product of an equipartition of an interval [smin, smax] ∈ R with d ∈ N support points with itself: 2 C := {s1, s2, . . . , sd} × {s1, s2, . . . , sd} ∈ R (15.104) The intuition of this model system is as follows: the true, but unknown, state of the world refers to a coordinate vector x∗ that refers to the location of an object (e.g., a mouse) in the plane. The cognitive

PMFN | © 2019 Dirk Ostwald CC BY-NC-SA 4.0 Fixed-form mean-field variational inference 173

Low Level Intermediate Level High Level

Prediction Error Prediction Error 푘 −1/2 푘 푘 −1/2 푘 휀푦 = 휆푦 푦 − 푓 푚푥 Prediction 휀푥 = Σ푥 푚푥 − 휇푥 푘 Prediction 푓 푚푥 휇푥 Gradient −1 (푘) Σ푥 휀푥

Data Representation Cause Representation Cause Prior Representation 푦 푘 휇 푚푥 푥 Gradient 푇 푓 푘 (푘) 휆푦퐽 푚푥 휀푦

Figure 15.5. Predictive processing as a speculative neural implementation of the free energy principle for perception. The figure depicts the speculative neurobiological implementation of the gradient ascent scheme described in the main text. External information is projected onto data representing units residing in deep layers (V) of early sensory (low-level) cortices. Neuronal units lying in more superficial layers (III) of low-level cortices are supposed to encode the prediction-error signal, resulting from the difference between their sensory input from lower layers at the same level and data predictions formed by the projection of deep intermediate-level units. In turn, the superficial low-level units are assumed to project the gradient component that allows for adjusting the prediction at the deep intermediate-level units. These units, in correspondence to the deep data-representing units in low-level cortices are assumed to encode the predicted causes of sensory information. Superficial units at this intermediate level in turn are thought to represent the prediction error between higher-order causes, specifically, the prior expectations projected down from deep high-level representation units and the cause representations obtained from the deep units at their own level. The superficial units of the intermediate level in turn are speculated to project a gradient component to the deep units at their level, adjusting the prediction of the intermediate level projected to the lowest level. Note that in the current scheme, the prior predictions at the highest level are fixed agent (e.g., a cat) does not observe the location of the object directly, but only indirectly, upon transfor- mation by the function f and under the addition of measurement noise. The function f can be thought of as a blanket that covers the object, which is hidden and only conveys information about the location x∗ of the object by means of a bump in the blanket. The task of the cognitive agent is to estimate the unobserved location of the object based on the sensory availability of this bump. To model the task of the agent, the fixed-form variational inference approach to estimate the varia- 2 tional parameter mx ∈ R using the simplified variational free energy (15.98) was employed. Figure 15.6A T T depicts the true, but unknown, location of the object x = (1, 1) , the prior parameter µx = (0, 0) (0) corresponding to the initialization of the variational parameter mx and an isocontour of the prior (0) distribution over x specified by mx and λx = 70. Figure 15.6B depicts a data realization where smin = −2.5, smax = 2.5, d = 30, σ = 0.4 and λy = 100. Figure 15.6C and Figure 15.6E depict the 2 negative variational free energy landscape as a function of mx ∈ R . Note that in accordance with the nonlinear optimization literature, we here consider minimization of the negative variational free energy function, rather than maximization of the (positive) variational free energy function. It has a minimum in the region of the location of the true, but unknown, value x∗, and a slight depression around the location of the prior parameter µx. Figure 15.6C and Figure 15.6D depict the application of the globalized Newton approach with Hessian modification and the ensuing squared and precision weighted prediction errors at (i) the low (blue) and high (red) level. During the first number of iterations, the iterands mx partially leave the depicted free energy surface and alternate between two locations while making little progress towards the minimum. After approximately ten iterations, the iterands enter the minimum’s well and quickly converges to the minimum. With a gradient norm convergence criterion of 10−3, convergence is reached after 30 iterations. Figure 15.6E and Figure 15.6F depict the analogous approach for the local linearization approach. Here, based on a temporal regularization parameter of t = 10−4, a local search is performed, which leads the iterands directly into the minimum’s well, but for convergence with the same gradient norm criterion a few more iterations are required. For both methods, the prediction errors at the low level decrease, while at the high level increase minimally, owing to the deviation of the variational expectation parameter at convergence from the prior parameter. Naturally, the nonlinear optimization scheme that lies at the heart of the fixed-form mean-field vari-

PMFN | © 2019 Dirk Ostwald CC BY-NC-SA 4.0 Bibliographical remarks 174

Figure 15.6. Example application of the fixed-form variational inference approach. For a detailed description, please refer to the main text (pmfn 14.m). ational inference method discussed here suffers from the usual problems of (non-global) nonlinear opti- mization approaches. Figure 15.7 depicts a number of scenarios in which both the globalized Newton method and the local linearization method will have difficulties to identify the correct minimum of the negative variational free energy. Of particular importance for the successful application of these nonlin- ear optimization methods in the current scenario are the absolute values of, and the ratio between, the precision parameters λx and λy. Figure 15.7A depicts a scenario with low prior precision (the Gaussian isocontour is so large that it falls out of the space depicted in the leftmost panel) and low noise precision. In this case, the data (middle panel) conveys little information about the true, but unknown, location of the object, and the free energy surface has a wavy profile with no clear structure. As the algorithm is initialized in a local minimum corresponding to the prior expectation, numerical nonlinear optimization approaches are likely to fail. Figure Figure 15.7B depicts a scenario in which the ratio between λx and λy render the negative variational free energy surface virtually flat at the location of the prior expectation. As a consequence, gradient descent schemes will be initialized with steps around zero and usually fail to make sufficient progress towards the local minimum. Finally, Figure 15.7C depicts a scenario in which the prior distribution has relatively high precision, but the negative variational free energy function still has a global minimum at the data supported location of the true, but unknown, coordinates. In this case, the negative free energy surface has a lot of structure, but if the algorithm is initialized in the local minimum of the prior expectation will usually remain there and fail to identify the global minimum.

15.4 Bibliographical remarks

Introductions to variational inference can be found in all standard machine learning textbooks, such as Bishop(2006), Barber(2012), and Murphy(2012). More recent reviews include Blei et al.(2016), and, in the context of the resurgence of neural network approaches, Doersch(2016). The widespread use of variational inference in functional neuroimaging based on its SPM implementation originates from Friston et al.(2007a). Finally, recent technical reviews of the free energy principle include Bogacz(2017) and Buckley et al.(2017).

15.5 Study questions

1. For a probabilistic model p(y, ϑ), write down the log marginal likelihood decomposition that forms the core of the variational inference approach and discuss its components.

PMFN | © 2019 Dirk Ostwald CC BY-NC-SA 4.0 Study questions 175

Figure 15.7. Example scenarios in the application of fixed-form variational inference to a nonlinear Gaussian model. For a detailed description, please refer to the main text (pmfn 14.m).

2. Write down the definition of the Kullback-Leibler divergence. What does the Kullback-Leibler divergence measure? 3. Write down the definition of the variational free energy and explain its importance in variational inference. 4. Write down the free-form mean-field variational inference theorem and explain its relevance. 5. What are the commonalities and differences between free-form mean-field and fixed-form mean-field varia- tional inference? 6. What is the central postulate of the free energy principle for perception?

Study questions answers

1. For a probabilistic model p(y, ϑ) with observable random variables y and unobservable random variables ϑ, the log marginal likelihood decomposition

ln p(y) = F (q(ϑ)) + KL(q(ϑ)kp(ϑ|y)) (15.105)

forms the core of the variational inference approach. Here, ln p(y) denotes the logarithm of the marginal likelihood p(y) = R p(y, ϑ) dϑ, F (q(ϑ)) denotes the variational free energy evaluated for the variational distribution q(ϑ) that serves as an approximation of the posterior distribution p(y|ϑ), and KL(q(ϑ)||p(ϑ|y)) denotes the KL divergence between the variational distribution q(ϑ) and the true posterior distribution p(ϑ|y).

PMFN | © 2019 Dirk Ostwald CC BY-NC-SA 4.0 Study questions 176

2. The Kullback-Leibler divergence of two probability distributions specified in terms of the probability density functions q(x) and p(x) is defined as

Z  q(x)  KL(q(x)||p(x)) = q(x) ln dx. (15.106) p(x)

The Kullback-Leibler divergence measures the dissimilarity of probability distributions. 3. For a probabilistic model p(y, ϑ) comprising observable random variables y and unobservable random vari- ables ϑ and a variational distribution q(ϑ) that serves as an approximation of the posterior density p(y|ϑ), the variational free energy is defined as

Z  p(y, ϑ)  F (q(ϑ)) := q(ϑ) ln dϑ. (15.107) q(ϑ)

Because the variational free energy is a lower-bound to the log marginal likelihood ln p(y), maximizing the variational free energy with respect to the variational distribution renders the variational free energy an approximation to the log marginal likelihood, and, given the non-negativity of the KL-divergence, renders the variational distribution an approximation to the true posterior distribution p(ϑ|y). 4. The free-form mean-field variational inference theorem states that the variational distribution that maxi- mizes the variational free energy with respect to the sth partition of the mean-field representation of q(ϑ) can be determined according to Z  q(ϑs) ∝ exp q(ϑ\s) ln p(y, ϑ)dϑ\s , (15.108)

where q(ϑ\s) denotes the variational density over all unobservable random variables not in the sth partition. Partition-wise application of the free-form mean-field variational inference theorem allows for developing coordinate-wise free-form variational free energy maximization algorithms. 5. Both free- and fixed-form mean-field variational inference rest on maximizing the variational free energy Qk with respect to a factorized variational distribution q(ϑ) = i=1 q(ϑi) over unobserved random variables ϑ that serves as an approximation to the true posterior distribution p(ϑ|y) of a probabilistic model p(y, ϑ). For both free- and fixed-form mean-field variational inference, the factorization of the variational distribution is referred to as a mean-field approximation. Free-form mean-field variational inference uses a central result from variational calculus to determine the functional form of the variational distributions and their parameters. In contrast, fixed-form mean-field variational inference pre-defines the functional form of the variational distributions, evaluates the variational free energy integral to render it a multivariate real-valued function of the variational distribution parameters, and maximizes the resulting function using standard techniques of nonlinear optimization. 6. The central postulate of of the free energy principle for perception is that perception corresponds to varia- tional inference.

PMFN | © 2019 Dirk Ostwald CC BY-NC-SA 4.0