Laplace’s in Information Geometry

Yann Ollivier

CNRS & Paris-Saclay University, France

Second International Conference on Geometric Science of Information (GSI 2015), École polytechnique, October 29, 2015 Example: given that w women and m men entered this room, what is the that the next person who enters is a woman/man?

Common performance criterion for prediction: cumulated log-loss

T −1 ∑︁ t+1 LT := − log p (xt+1 | x1...t ) t=0 to be minimized. This corresponds to compression cost, and is also equal to square loss for Gaussian models.

Sequential prediction

Sequential prediction problem: given observations x1,..., xt , build a t+1 probabilistic model p for xt+1, iteratively. Common performance criterion for prediction: cumulated log-loss

T −1 ∑︁ t+1 LT := − log p (xt+1 | x1...t ) t=0 to be minimized. This corresponds to compression cost, and is also equal to square loss for Gaussian models.

Sequential prediction

Sequential prediction problem: given observations x1,..., xt , build a t+1 probabilistic model p for xt+1, iteratively.

Example: given that w women and m men entered this room, what is the probability that the next person who enters is a woman/man? This corresponds to compression cost, and is also equal to square loss for Gaussian models.

Sequential prediction

Sequential prediction problem: given observations x1,..., xt , build a t+1 probabilistic model p for xt+1, iteratively.

Example: given that w women and m men entered this room, what is the probability that the next person who enters is a woman/man?

Common performance criterion for prediction: cumulated log-loss

T −1 ∑︁ t+1 LT := − log p (xt+1 | x1...t ) t=0 to be minimized. Sequential prediction

Sequential prediction problem: given observations x1,..., xt , build a t+1 probabilistic model p for xt+1, iteratively.

Example: given that w women and m men entered this room, what is the probability that the next person who enters is a woman/man?

Common performance criterion for prediction: cumulated log-loss

T −1 ∑︁ t+1 LT := − log p (xt+1 | x1...t ) t=0 to be minimized. This corresponds to compression cost, and is also equal to square loss for Gaussian models. ML Then, predict xt+1 using this 휃t :

ML p (xt+1|x1...t ) := p ML (xt+1) 휃t This is the maximum likelihood or “plug-in” .

Heavily used in . Argmax often computed via gradient descent.

Maximum likelihood estimator

Maximum likelihood strategy: Fix a parametric model p휃(x). At each time, the best parameter based on past observations: ⎧ ⎫ ML ⎨ ∏︁ ⎬ 휃t := arg max p휃(xs ) 휃 ⎩ s6t ⎭ ⎧ ⎫ ⎨ ∑︁ ⎬ = arg min − log p휃(xs ) 휃 ⎩ s6t ⎭ ML p (xt+1|x1...t ) := p ML (xt+1) 휃t This is the maximum likelihood or “plug-in” estimator.

Heavily used in machine learning. Argmax often computed via gradient descent.

Maximum likelihood estimator

Maximum likelihood strategy: Fix a parametric model p휃(x). At each time, the best parameter based on past observations: ⎧ ⎫ ML ⎨ ∏︁ ⎬ 휃t := arg max p휃(xs ) 휃 ⎩ s6t ⎭ ⎧ ⎫ ⎨ ∑︁ ⎬ = arg min − log p휃(xs ) 휃 ⎩ s6t ⎭

ML Then, predict xt+1 using this 휃t : Heavily used in machine learning. Argmax often computed via gradient descent.

Maximum likelihood estimator

Maximum likelihood strategy: Fix a parametric model p휃(x). At each time, the best parameter based on past observations: ⎧ ⎫ ML ⎨ ∏︁ ⎬ 휃t := arg max p휃(xs ) 휃 ⎩ s6t ⎭ ⎧ ⎫ ⎨ ∑︁ ⎬ = arg min − log p휃(xs ) 휃 ⎩ s6t ⎭

ML Then, predict xt+1 using this 휃t :

ML p (xt+1|x1...t ) := p ML (xt+1) 휃t This is the maximum likelihood or “plug-in” estimator. Maximum likelihood estimator

Maximum likelihood strategy: Fix a parametric model p휃(x). At each time, the best parameter based on past observations: ⎧ ⎫ ML ⎨ ∏︁ ⎬ 휃t := arg max p휃(xs ) 휃 ⎩ s6t ⎭ ⎧ ⎫ ⎨ ∑︁ ⎬ = arg min − log p휃(xs ) 휃 ⎩ s6t ⎭

ML Then, predict xt+1 using this 휃t :

ML p (xt+1|x1...t ) := p ML (xt+1) 휃t This is the maximum likelihood or “plug-in” estimator.

Heavily used in machine learning. Argmax often computed via gradient descent. m w + m according to the ML estimator.

=⇒ The ML estimator is not satisfying:

I How do you predict the first observation?

I Zero- problem: If you have seen only women so far, the probability to see a man is estimated to0.

I Often overfits in machine learning.

Problems with the ML estimator

Example: Given that there are w women and m men in this room, the next person to enter will be a man with probability =⇒ The ML estimator is not satisfying:

I How do you predict the first observation?

I Zero-frequency problem: If you have seen only women so far, the probability to see a man is estimated to0.

I Often overfits in machine learning.

Problems with the ML estimator

Example: Given that there are w women and m men in this room, the next person to enter will be a man with probability m w + m according to the ML estimator. I How do you predict the first observation?

I Zero-frequency problem: If you have seen only women so far, the probability to see a man is estimated to0.

I Often overfits in machine learning.

Problems with the ML estimator

Example: Given that there are w women and m men in this room, the next person to enter will be a man with probability m w + m according to the ML estimator.

=⇒ The ML estimator is not satisfying: I Zero-frequency problem: If you have seen only women so far, the probability to see a man is estimated to0.

I Often overfits in machine learning.

Problems with the ML estimator

Example: Given that there are w women and m men in this room, the next person to enter will be a man with probability m w + m according to the ML estimator.

=⇒ The ML estimator is not satisfying:

I How do you predict the first observation? I Often overfits in machine learning.

Problems with the ML estimator

Example: Given that there are w women and m men in this room, the next person to enter will be a man with probability m w + m according to the ML estimator.

=⇒ The ML estimator is not satisfying:

I How do you predict the first observation?

I Zero-frequency problem: If you have seen only women so far, the probability to see a man is estimated to0. Problems with the ML estimator

Example: Given that there are w women and m men in this room, the next person to enter will be a man with probability m w + m according to the ML estimator.

=⇒ The ML estimator is not satisfying:

I How do you predict the first observation?

I Zero-frequency problem: If you have seen only women so far, the probability to see a man is estimated to0.

I Often overfits in machine learning. I Solves the zero-frequency problem: After having seen t women and no men, the probability to see a man is estimated to 1/(t + 2).

I Generalizes to other discrete data (“additive smoothing”).

I May seem arbitrary but has a beautiful Bayesian interpretation.

Laplace’s rule of succession

Laplace suggested a quick fix for these problems: add one to the counts of each possibility. That is, predict according to w + 1 m + 1 p(woman) = p(man) = w + m + 2 w + m + 2 w m instead of w+m and w+m . This is Laplace’s rule of succession. I Generalizes to other discrete data (“additive smoothing”).

I May seem arbitrary but has a beautiful Bayesian interpretation.

Laplace’s rule of succession

Laplace suggested a quick fix for these problems: add one to the counts of each possibility. That is, predict according to w + 1 m + 1 p(woman) = p(man) = w + m + 2 w + m + 2 w m instead of w+m and w+m . This is Laplace’s rule of succession.

I Solves the zero-frequency problem: After having seen t women and no men, the probability to see a man is estimated to 1/(t + 2). I May seem arbitrary but has a beautiful Bayesian interpretation.

Laplace’s rule of succession

Laplace suggested a quick fix for these problems: add one to the counts of each possibility. That is, predict according to w + 1 m + 1 p(woman) = p(man) = w + m + 2 w + m + 2 w m instead of w+m and w+m . This is Laplace’s rule of succession.

I Solves the zero-frequency problem: After having seen t women and no men, the probability to see a man is estimated to 1/(t + 2).

I Generalizes to other discrete data (“additive smoothing”). Laplace’s rule of succession

Laplace suggested a quick fix for these problems: add one to the counts of each possibility. That is, predict according to w + 1 m + 1 p(woman) = p(man) = w + m + 2 w + m + 2 w m instead of w+m and w+m . This is Laplace’s rule of succession.

I Solves the zero-frequency problem: After having seen t women and no men, the probability to see a man is estimated to 1/(t + 2).

I Generalizes to other discrete data (“additive smoothing”).

I May seem arbitrary but has a beautiful Bayesian interpretation. Proposition (folklore) For Bernoulli distributions on a binary variable, e.g., {woman, man}, Laplace’s rule coincides with the Bayesian predictor with a uniform prior on the Bernoulli parameter 휃 ∈ [0; 1].

Bayesian predictors

Bayesian predictors start with a parametric model p휃(x) together with a prior 훼(휃) on 휃. At time t, the next symbol xt+1 is predicted by mixing all possible models p휃 with all values of 휃, ∫︁ 훼-Bayes p (xt+1|x1...t ) = p휃(xt+1) qt (휃) d휃 휃

where qt (휃) is the Bayesian posterior on 휃 given data x1...t , ∏︁ qt (휃) ∝ 훼(휃) p휃(xs ) s6t Bayesian predictors

Bayesian predictors start with a parametric model p휃(x) together with a prior 훼(휃) on 휃. At time t, the next symbol xt+1 is predicted by mixing all possible models p휃 with all values of 휃, ∫︁ 훼-Bayes p (xt+1|x1...t ) = p휃(xt+1) qt (휃) d휃 휃

where qt (휃) is the Bayesian posterior on 휃 given data x1...t , ∏︁ qt (휃) ∝ 훼(휃) p휃(xs ) s6t

Proposition (folklore) For Bernoulli distributions on a binary variable, e.g., {woman, man}, Laplace’s rule coincides with the Bayesian predictor with a uniform prior on the Bernoulli parameter 휃 ∈ [0; 1]. but

I are difficult to compute: one must perform an integral over 휃, and keep track of the Bayesian posterior which is an arbitrary function of 휃.

Is there a simple way to approximate Bayesian predictors that would generalize Laplace’s rule?

Bayesian predictors (2)

Bayesian predictors

I solve the zero-frequency problem

I have theoretical guarantees

I do not overfit Is there a simple way to approximate Bayesian predictors that would generalize Laplace’s rule?

Bayesian predictors (2)

Bayesian predictors

I solve the zero-frequency problem

I have theoretical guarantees

I do not overfit

but

I are difficult to compute: one must perform an integral over 휃, and keep track of the Bayesian posterior which is an arbitrary function of 휃. Bayesian predictors (2)

Bayesian predictors

I solve the zero-frequency problem

I have theoretical guarantees

I do not overfit

but

I are difficult to compute: one must perform an integral over 휃, and keep track of the Bayesian posterior which is an arbitrary function of 휃.

Is there a simple way to approximate Bayesian predictors that would generalize Laplace’s rule? Under suitable regularity conditions, these two predictors coincide at first order in 1/t:

I The Bayesian predictor using the non-informative Jeffreys prior 훼(휃) ∝ √︀det I(휃) with I(휃) the Fisher information matrix.

I The average 1 1 pML + pSNML 2 2 of the maximum likelihood predictor and the “sequential normalized maximum likelihood” predictor [Shtarkov 1987, Roos, Rissanen...]

“The predictors p and p′ coincide at first order” means that

′ 2 p (xt+1|x1...t ) = p(xt+1|x1...t )(1 + O(1/t ))

for any sequence (xt ), assuming both are non-zero.

Theorem Let (p휃) be an exponential family of probability distributions. I The average 1 1 pML + pSNML 2 2 of the maximum likelihood predictor and the “sequential normalized maximum likelihood” predictor [Shtarkov 1987, Roos, Rissanen...]

“The predictors p and p′ coincide at first order” means that

′ 2 p (xt+1|x1...t ) = p(xt+1|x1...t )(1 + O(1/t ))

for any sequence (xt ), assuming both are non-zero.

Theorem Let (p휃) be an exponential family of probability distributions.

Under suitable regularity conditions, these two predictors coincide at first order in 1/t:

I The Bayesian predictor using the non-informative Jeffreys prior 훼(휃) ∝ √︀det I(휃) with I(휃) the Fisher information matrix. “The predictors p and p′ coincide at first order” means that

′ 2 p (xt+1|x1...t ) = p(xt+1|x1...t )(1 + O(1/t ))

for any sequence (xt ), assuming both are non-zero.

Theorem Let (p휃) be an exponential family of probability distributions.

Under suitable regularity conditions, these two predictors coincide at first order in 1/t:

I The Bayesian predictor using the non-informative Jeffreys prior 훼(휃) ∝ √︀det I(휃) with I(휃) the Fisher information matrix.

I The average 1 1 pML + pSNML 2 2 of the maximum likelihood predictor and the “sequential normalized maximum likelihood” predictor [Shtarkov 1987, Roos, Rissanen...] Theorem Let (p휃) be an exponential family of probability distributions.

Under suitable regularity conditions, these two predictors coincide at first order in 1/t:

I The Bayesian predictor using the non-informative Jeffreys prior 훼(휃) ∝ √︀det I(휃) with I(휃) the Fisher information matrix.

I The average 1 1 pML + pSNML 2 2 of the maximum likelihood predictor and the “sequential normalized maximum likelihood” predictor [Shtarkov 1987, Roos, Rissanen...]

“The predictors p and p′ coincide at first order” means that

′ 2 p (xt+1|x1...t ) = p(xt+1|x1...t )(1 + O(1/t )) for any sequence (xt ), assuming both are non-zero. Define q(y) := p휃ML+y (y) ∫︀ Usually q is not a probability distribution, y q(y) > 1. =⇒ Rescale q: q pSNML := ∫︀ q

and use this for prediction of xt+1.

The sequential normalized maximum likelihood predictor

The SNML predictor pSNML is defined as follows. For each possible value y of xt+1, let ⎧ ⎫ ML+y ⎨ ∑︁ ⎬ 휃 := arg max log p휃(y) + log p휃(xs ) 휃 ⎩ s6t ⎭

be the value of the ML estimator if this value of xt+1 had already been observed. =⇒ Rescale q: q pSNML := ∫︀ q

and use this for prediction of xt+1.

The sequential normalized maximum likelihood predictor

The SNML predictor pSNML is defined as follows. For each possible value y of xt+1, let ⎧ ⎫ ML+y ⎨ ∑︁ ⎬ 휃 := arg max log p휃(y) + log p휃(xs ) 휃 ⎩ s6t ⎭

be the value of the ML estimator if this value of xt+1 had already been observed.

Define q(y) := p휃ML+y (y) ∫︀ Usually q is not a probability distribution, y q(y) > 1. The sequential normalized maximum likelihood predictor

The SNML predictor pSNML is defined as follows. For each possible value y of xt+1, let ⎧ ⎫ ML+y ⎨ ∑︁ ⎬ 휃 := arg max log p휃(y) + log p휃(xs ) 휃 ⎩ s6t ⎭

be the value of the ML estimator if this value of xt+1 had already been observed.

Define q(y) := p휃ML+y (y) ∫︀ Usually q is not a probability distribution, y q(y) > 1. =⇒ Rescale q: q pSNML := ∫︀ q

and use this for prediction of xt+1. The theorem states that the Bayesian predictor with canonical Jeffreys prior is approximately the average of the ML and SNML predictors.

=⇒ For Bernoulli distributions, we recover the “add-one-half” rule for the Jeffreys prior (Krichevsky–Trofimov estimator).

I Relatively easy to compute

I Different usually differ at first order/ in1 t (e.g., ML estimator or Bayesian estimators with different priors). The theorem is precise at first order in/ 1 t so recovers these differences.

2 I Multiplicative error (1 + O(1/t )) in the theorem yields at most a bounded difference on cumulated log-loss.

Example: For a Bernoulli distribution, the SNML predictor is the same as Laplace’s rule. =⇒ For Bernoulli distributions, we recover the “add-one-half” rule for the Jeffreys prior (Krichevsky–Trofimov estimator).

I Relatively easy to compute

I Different estimators usually differ at first order/ in1 t (e.g., ML estimator or Bayesian estimators with different priors). The theorem is precise at first order in/ 1 t so recovers these differences.

2 I Multiplicative error (1 + O(1/t )) in the theorem yields at most a bounded difference on cumulated log-loss.

Example: For a Bernoulli distribution, the SNML predictor is the same as Laplace’s rule.

The theorem states that the Bayesian predictor with canonical Jeffreys prior is approximately the average of the ML and SNML predictors. I Relatively easy to compute

I Different estimators usually differ at first order/ in1 t (e.g., ML estimator or Bayesian estimators with different priors). The theorem is precise at first order in/ 1 t so recovers these differences.

2 I Multiplicative error (1 + O(1/t )) in the theorem yields at most a bounded difference on cumulated log-loss.

Example: For a Bernoulli distribution, the SNML predictor is the same as Laplace’s rule.

The theorem states that the Bayesian predictor with canonical Jeffreys prior is approximately the average of the ML and SNML predictors.

=⇒ For Bernoulli distributions, we recover the “add-one-half” rule for the Jeffreys prior (Krichevsky–Trofimov estimator). I Different estimators usually differ at first order/ in1 t (e.g., ML estimator or Bayesian estimators with different priors). The theorem is precise at first order in/ 1 t so recovers these differences.

2 I Multiplicative error (1 + O(1/t )) in the theorem yields at most a bounded difference on cumulated log-loss.

Example: For a Bernoulli distribution, the SNML predictor is the same as Laplace’s rule.

The theorem states that the Bayesian predictor with canonical Jeffreys prior is approximately the average of the ML and SNML predictors.

=⇒ For Bernoulli distributions, we recover the “add-one-half” rule for the Jeffreys prior (Krichevsky–Trofimov estimator).

I Relatively easy to compute 2 I Multiplicative error (1 + O(1/t )) in the theorem yields at most a bounded difference on cumulated log-loss.

Example: For a Bernoulli distribution, the SNML predictor is the same as Laplace’s rule.

The theorem states that the Bayesian predictor with canonical Jeffreys prior is approximately the average of the ML and SNML predictors.

=⇒ For Bernoulli distributions, we recover the “add-one-half” rule for the Jeffreys prior (Krichevsky–Trofimov estimator).

I Relatively easy to compute

I Different estimators usually differ at first order/ in1 t (e.g., ML estimator or Bayesian estimators with different priors). The theorem is precise at first order in/ 1 t so recovers these differences. Example: For a Bernoulli distribution, the SNML predictor is the same as Laplace’s rule.

The theorem states that the Bayesian predictor with canonical Jeffreys prior is approximately the average of the ML and SNML predictors.

=⇒ For Bernoulli distributions, we recover the “add-one-half” rule for the Jeffreys prior (Krichevsky–Trofimov estimator).

I Relatively easy to compute

I Different estimators usually differ at first order/ in1 t (e.g., ML estimator or Bayesian estimators with different priors). The theorem is precise at first order in/ 1 t so recovers these differences.

2 I Multiplicative error (1 + O(1/t )) in the theorem yields at most a bounded difference on cumulated log-loss. (︂ 1 dim(휃))︂ pJeffreys(x ) ≈ pML(x ) 1 + ‖휕 log p (x )‖2 − t+1 t+1 2t 휃 휃 t+1 Fisher 2t

up to O(1/t2).

Here ‖휕휃 log p휃(xt+1)‖Fisher is the norm of the gradient of log p휃(xt+1) in the Riemannian metric given by the Fisher information matrix. (Compare “flattened ML” [Kotłowski–Grünwald–de Rooij 2010].)

Note: valid only when these are non-zero, so does not solve the zero-frequency problem.

Corollary: From ML to Bayes For exponential families, there is an explicit approximate formula to compute the Bayesian predictor with Jeffreys prior if the ML predictor is known: Note: valid only when these probabilities are non-zero, so does not solve the zero-frequency problem.

Corollary: From ML to Bayes For exponential families, there is an explicit approximate formula to compute the Bayesian predictor with Jeffreys prior if the ML predictor is known: (︂ 1 dim(휃))︂ pJeffreys(x ) ≈ pML(x ) 1 + ‖휕 log p (x )‖2 − t+1 t+1 2t 휃 휃 t+1 Fisher 2t up to O(1/t2).

Here ‖휕휃 log p휃(xt+1)‖Fisher is the norm of the gradient of log p휃(xt+1) in the Riemannian metric given by the Fisher information matrix. (Compare “flattened ML” [Kotłowski–Grünwald–de Rooij 2010].) Corollary: From ML to Bayes For exponential families, there is an explicit approximate formula to compute the Bayesian predictor with Jeffreys prior if the ML predictor is known: (︂ 1 dim(휃))︂ pJeffreys(x ) ≈ pML(x ) 1 + ‖휕 log p (x )‖2 − t+1 t+1 2t 휃 휃 t+1 Fisher 2t up to O(1/t2).

Here ‖휕휃 log p휃(xt+1)‖Fisher is the norm of the gradient of log p휃(xt+1) in the Riemannian metric given by the Fisher information matrix. (Compare “flattened ML” [Kotłowski–Grünwald–de Rooij 2010].)

Note: valid only when these probabilities are non-zero, so does not solve the zero-frequency problem. What about other priors?

From ML to Bayes (2)

Proof of the corollary (idea): 1 휃ML+xt+1 ≈ 휃ML + ∇˜ log p (x ) t 휃 휃 t+1 ˜ −1 휕 where ∇휃 = I(휃) 휕휃 is Amari’s natural gradient given by the Fisher matrix.

”When adding a data point, the ML estimator moves by 1/t times the natural gradient of the new point’s log-likelihood.” From ML to Bayes (2)

Proof of the corollary (idea): 1 휃ML+xt+1 ≈ 휃ML + ∇˜ log p (x ) t 휃 휃 t+1 ˜ −1 휕 where ∇휃 = I(휃) 휕휃 is Amari’s natural gradient given by the Fisher matrix.

”When adding a data point, the ML estimator moves by 1/t times the natural gradient of the new point’s log-likelihood.”

What about other priors? Then a similar theorem holds if the definition of the SNML predictor pSNML is modified as

2 (︁ ML+y )︁ SNML q q(y) := 훽 휃 p ML+y (y) , p := 휃 ∫︀ q

for each possible value y of xt+1.

Arbitrary Bayesian priors

Consider a Bayesian prior with density 훽(휃) with respect to the Jeffreys prior √︁ 훼(휃) = 훽(휃) det I(휃) Arbitrary Bayesian priors

Consider a Bayesian prior with density 훽(휃) with respect to the Jeffreys prior √︁ 훼(휃) = 훽(휃) det I(휃)

Then a similar theorem holds if the definition of the SNML predictor pSNML is modified as

2 (︁ ML+y )︁ SNML q q(y) := 훽 휃 p ML+y (y) , p := 휃 ∫︀ q

for each possible value y of xt+1. “Is the Bayesian posterior approximately centered around the ML estimator?”

=⇒ No!

Posterior means: a systematic shift between ML and Bayes

Another way to compare ML and Bayesian predictors is to use test functions. =⇒ No!

Posterior means: a systematic shift between ML and Bayes

Another way to compare ML and Bayesian predictors is to use test functions.

“Is the Bayesian posterior approximately centered around the ML estimator?” Posterior means: a systematic shift between ML and Bayes

Another way to compare ML and Bayesian predictors is to use test functions.

“Is the Bayesian posterior approximately centered around the ML estimator?”

=⇒ No! For exponential families and the Jeffreys prior, this difference is approximately 1 휕 f · V (휃ML) t 휃 where V (휃) is an intrinsic vector field on Θ, independent from f :

1 V (휃) = I(휃)−1 · T (휃) · I(휃)−1 4 with I the Fisher matrix and T the skewness tensor [Amari–Nagaoka]

휕 ln p (x) 휕 ln p (x) 휕 ln p (x) T (휃) := 휃 휃 휃 ijk Ex∼p휃 휕휃i 휕휃j 휕휃k

Posterior means: a systematic shift between ML and Bayes

Let f (휃) be a smooth test function of 휃. Then there is a systematic direction of the difference between f (휃ML) and the Bayesian posterior mean of f . 1 V (휃) = I(휃)−1 · T (휃) · I(휃)−1 4 with I the Fisher matrix and T the skewness tensor [Amari–Nagaoka]

휕 ln p (x) 휕 ln p (x) 휕 ln p (x) T (휃) := 휃 휃 휃 ijk Ex∼p휃 휕휃i 휕휃j 휕휃k

Posterior means: a systematic shift between ML and Bayes

Let f (휃) be a smooth test function of 휃. Then there is a systematic direction of the difference between f (휃ML) and the Bayesian posterior mean of f . For exponential families and the Jeffreys prior, this difference is approximately 1 휕 f · V (휃ML) t 휃 where V (휃) is an intrinsic vector field on Θ, independent from f : Posterior means: a systematic shift between ML and Bayes

Let f (휃) be a smooth test function of 휃. Then there is a systematic direction of the difference between f (휃ML) and the Bayesian posterior mean of f . For exponential families and the Jeffreys prior, this difference is approximately 1 휕 f · V (휃ML) t 휃 where V (휃) is an intrinsic vector field on Θ, independent from f :

1 V (휃) = I(휃)−1 · T (휃) · I(휃)−1 4 with I the Fisher matrix and T the skewness tensor [Amari–Nagaoka]

휕 ln p (x) 휕 ln p (x) 휕 ln p (x) T (휃) := 휃 휃 휃 ijk Ex∼p휃 휕휃i 휕휃j 휕휃k Thank you!

Conclusions

I For exponential families, Bayesian predictors can be approximated using modified ML predictors.

I The difference between Bayesian and ML predictors can be computed from the Fisher metric.

I There is a systematic direction of the shift from ML to Bayesian posterior means.

I Extends to non-i.i.d. models if p휃(xt+1|x1...t ) is an exponential family. Conclusions

I For exponential families, Bayesian predictors can be approximated using modified ML predictors.

I The difference between Bayesian and ML predictors can be computed from the Fisher metric.

I There is a systematic direction of the shift from ML to Bayesian posterior means.

I Extends to non-i.i.d. models if p휃(xt+1|x1...t ) is an exponential family.

Thank you!