Bayesian Interpretations of Regularization

Charlie Frogner

9.520 Class 15

April 1, 2009

C. Frogner Bayesian Interpretations of Regularization The Plan

n Regularized maps (xi , yi ) i=1 to a function that minimizes the regularized loss: { }

n 1 λ f = arg min (y f (x ))2 + f 2 S 2 i − i 2k kH f ∈H Xi=1 Can we justify Tikhonov regularization from a probabilistic point of view?

C. Frogner Bayesian Interpretations of Regularization The Plan

Bayesian estimation basics Bayesian interpretation of ERM Bayesian interpretation of linear RLS Bayesian interpretation of kernel RLS Transductive model Infinite dimensions = weird

C. Frogner Bayesian Interpretations of Regularization Some notation

n S = (xi , yi ) i=1 is the set of observed input/output pairs in Rd {R (the} training set). × T n×d X and Y denote the matrices [x1,..., xn] R and T n ∈ [y ,..., yn] R , respectively. 1 ∈ θ is a vector of parameters in Rp. p(Y X,θ) is the joint distribution over outputs Y given inputs| X and the parameters.

C. Frogner Bayesian Interpretations of Regularization Estimation

The setup: A model: relates observed quantities (α) to an unobserved quantity (say β). Want: an estimator – maps observed data α back to an estimate of unobserved β. Nothing new yet...

Estimator β B is unobserved, α A is observed. An estimator for β is a function∈ ∈ βˆ : A B → such that βˆ(α) is an estimate of β given an observation α.

C. Frogner Bayesian Interpretations of Regularization Estimation

Tikhonov fits in the estimation framework.

n 1 λ f = arg min (y f (x ))2 + f 2 S 2 i − i 2k kH f ∈H Xi=1 Regression model:

2 yi = f (xi )+ ε, ε 0,σε I ∼N  

C. Frogner Bayesian Interpretations of Regularization Bayesian Estimation

Difference: Bayesian model specifies p(β, α) , usually by a measurement model, p(α β) and a prior p(β). | Bayesian model β is unobserved, α is observed.

p(β, α)= p(α β) p(β) | ·

C. Frogner Bayesian Interpretations of Regularization ERM as a Maximum Likelihood Estimator

(Linear) Expected risk minimization:

n T ˆ ˆ 1 T 2 fS(x)= x θERM (S), θERM (S)= arg min (yi xi θ) θ 2 − Xi=1 Measurement model:

2 Y X,θ Xθ,σε I | ∼N   X fixed/non-random, θ is unknown.

C. Frogner Bayesian Interpretations of Regularization ERM as a Maximum Likelihood Estimator

(Linear) Expected risk minimization:

n T ˆ ˆ 1 T 2 fS(x)= x θERM (S), θERM (S)= arg min (yi xi θ) θ 2 − Xi=1 Measurement model:

2 Y X,θ Xθ,σε I | ∼N   X fixed/non-random, θ is unknown.

C. Frogner Bayesian Interpretations of Regularization ERM as a Maximum Likelihood Estimator

Measurement model:

2 Y X,θ Xθ,σε I | ∼N   Want to estimate θ. Can do this without defining a prior on θ. Maximize the likelihood, i.e. the probability of the observations.

Likelihood The likelihood of any fixed parameter vector θ is:

L(θ X)= p(Y X,θ) | | Note: we always condition on X.

C. Frogner Bayesian Interpretations of Regularization ERM as a Maximum Likelihood Estimator

Measurement model:

2 Y X,θ Xθ,σε I | ∼N   Likelihood:

2 L(θ X)= Y ; Xθ,σε I | N   1 2 exp 2 Y Xθ ∝ −2σε k − k 

Maximum likelihood estimator is ERM: 1 arg min Y Xθ 2 θ 2k − k

C. Frogner Bayesian Interpretations of Regularization ERM as a Maximum Likelihood Estimator

Measurement model:

2 Y X,θ Xθ,σε I | ∼N   Likelihood:

2 L(θ X)= Y ; Xθ,σε I | N   1 2 exp 2 Y Xθ ∝ −2σε k − k 

Maximum likelihood estimator is ERM: 1 arg min Y Xθ 2 θ 2k − k

C. Frogner Bayesian Interpretations of Regularization Really?

1 Y Xθ 2 2k − k

C. Frogner Bayesian Interpretations of Regularization Really?

− 1 kY −Xθk2 e 2σε2

C. Frogner Bayesian Interpretations of Regularization What about regularization?

Linear regularized least squares:

n T ˆ 1 T 2 λ 2 fS(x)= x θ, θRLS(S)= arg min (yi xi θ) + θ θ 2 − 2k k Xi=1 Is there a model of Y and θ that yields linear RLS?

Yes. n 1 T 2 λ 2 − ( (yi −x θ) + kθk ) 2σε2 i 2 e iP=1

p(Y X,θ) p(θ) | ·

C. Frogner Bayesian Interpretations of Regularization What about regularization?

Linear regularized least squares:

n T ˆ 1 T 2 λ 2 fS(x)= x θ, θRLS(S)= arg min (yi xi θ) + θ θ 2 − 2k k Xi=1 Is there a model of Y and θ that yields linear RLS?

Yes. n 1 T 2 λ 2 − ( (yi −x θ) + kθk ) 2σε2 i 2 e iP=1

p(Y X,θ) p(θ) | ·

C. Frogner Bayesian Interpretations of Regularization What about regularization?

Linear regularized least squares:

n T ˆ 1 T 2 λ 2 fS(x)= x θ, θRLS(S)= arg min (yi xi θ) + θ θ 2 − 2k k Xi=1 Is there a model of Y and θ that yields linear RLS?

Yes. n 1 T 2 − (yi −x θ) λ 2σε2 i − kθk2 e iP=1 e 2 ·

p(Y X,θ) p(θ) | ·

C. Frogner Bayesian Interpretations of Regularization What about regularization?

Linear regularized least squares:

n T ˆ 1 T 2 λ 2 fS(x)= x θ, θRLS(S)= arg min (yi xi θ) + θ θ 2 − 2k k Xi=1 Is there a model of Y and θ that yields linear RLS?

Yes. n 1 T 2 − (yi −x θ) λ 2σε2 i − kθk2 e iP=1 e 2 ·

p(Y X,θ) p(θ) | ·

C. Frogner Bayesian Interpretations of Regularization Linear RLS as a MAP estimator

Measurement model:

2 Y X,θ Xθ,σε I | ∼N   Add a prior: θ (0, I) ∼N 2 So σε = λ. How to estimate θ?

C. Frogner Bayesian Interpretations of Regularization The Bayesian method

Take p(Y X,θ) and p(θ). | Apply Bayes’ rule to get posterior:

p(Y X,θ) p(θ) p(θ X, Y )= | · | p(Y X) | p(Y X,θ) p(θ) = | · p(Y X,θ)dθ | R Use the posterior to estimate θ.

C. Frogner Bayesian Interpretations of Regularization Estimators that use the posterior

Bayes least squares estimator The Bayes least squares estimator for θ given the observed Y is: θˆ (Y X)= E [θ] BLS | θ|X,Y i.e. the mean of the posterior.

Maximum a posteriori estimator The MAP estimator for θ given the observed Y is:

θˆMAP (Y X)= arg max p(θ X, Y ) | θ | i.e. a mode of the posterior.

C. Frogner Bayesian Interpretations of Regularization Estimators that use the posterior

Bayes least squares estimator The Bayes least squares estimator for θ given the observed Y is: θˆ (Y X)= E [θ] BLS | θ|X,Y i.e. the mean of the posterior.

Maximum a posteriori estimator The MAP estimator for θ given the observed Y is:

θˆMAP (Y X)= arg max p(θ X, Y ) | θ | i.e. a mode of the posterior.

C. Frogner Bayesian Interpretations of Regularization Linear RLS as a MAP estimator

Model: 2 Y X,θ Xθ,σε I , θ (0, I) | ∼N   ∼N Joint over Y and θ:

Y X 0 XX T + σ2IX | , ε  θ  ∼N  0   X T I 

Condition on Y X. |

C. Frogner Bayesian Interpretations of Regularization Linear RLS as a MAP estimator

Model: 2 Y X,θ Xθ,σε I , θ (0, I) | ∼N   ∼N Posterior: θ X, Y µ , Σ | ∼N θ|X,Y θ|X,Y where 

T T 2 −1 µθ|X,Y = X (XX + σε I) Y Σ = I X T (XX T + σ2I)−1X θ|X,Y − ε

This is Gaussian, so

θˆ (Y X)= θˆ (Y X)= X T (XX T + σ2I)−1Y MAP | BLS | ε

C. Frogner Bayesian Interpretations of Regularization Linear RLS as a MAP estimator

Model: 2 Y X,θ Xθ,σε I , θ (0, I) | ∼N   ∼N Posterior: θ X, Y µ , Σ | ∼N θ|X,Y θ|X,Y where 

T T 2 −1 µθ|X,Y = X (XX + σε I) Y Σ = I X T (XX T + σ2I)−1X θ|X,Y − ε

This is Gaussian, so

θˆ (Y X)= θˆ (Y X)= X T (XX T + σ2I)−1Y MAP | BLS | ε

C. Frogner Bayesian Interpretations of Regularization Linear RLS as a MAP estimator

Model: 2 Y X,θ Xθ,σε I , θ (0, I) | ∼N   ∼N

θˆ (Y X)= X T (XX T + σ2I)−1Y MAP | ε Recall the linear RLS solution:

n ˆ 1 T 2 λ 2 θRLS(Y X)= arg min (yi xi θ) + θ | θ 2 − 2k k Xi=1 λ = X T (XX T + I)−1Y 2 How do we write the estimated function?

C. Frogner Bayesian Interpretations of Regularization Linear RLS as a MAP estimator

Model: 2 Y X,θ Xθ,σε I , θ (0, I) | ∼N   ∼N

θˆ (Y X)= X T (XX T + σ2I)−1Y MAP | ε Recall the linear RLS solution:

n ˆ 1 T 2 λ 2 θRLS(Y X)= arg min (yi xi θ) + θ | θ 2 − 2k k Xi=1 λ = X T (XX T + I)−1Y 2 How do we write the estimated function?

C. Frogner Bayesian Interpretations of Regularization What about Kernel RLS?

We can use basically the same trick to derive kernel RLS;

n 1 λ f = arg min (y f (x ))2 + f 2 S 2 i − i 2k kHK f ∈HK Xi=1 How?

Feature space: f (x)= θ, φ(x) h iF n 1 T 2 λ T − 2 ( (yi −φ(xi ) θ) + 2 θ θ) e iP=1

Feature space must be finite-dimensional.

C. Frogner Bayesian Interpretations of Regularization What about Kernel RLS?

We can use basically the same trick to derive kernel RLS;

n 1 λ f = arg min (y f (x ))2 + f 2 S 2 i − i 2k kHK f ∈HK Xi=1 How?

Feature space: f (x)= θ, φ(x) h iF n 1 T 2 λ T − 2 ( (yi −φ(xi ) θ) + 2 θ θ) e iP=1

Feature space must be finite-dimensional.

C. Frogner Bayesian Interpretations of Regularization What about Kernel RLS?

We can use basically the same trick to derive kernel RLS;

n 1 λ f = arg min (y f (x ))2 + f 2 S 2 i − i 2k kHK f ∈HK Xi=1 How?

Feature space: f (x)= θ, φ(x) h iF n 1 T 2 − (y −φ(x ) θ) λ 2 i i − θT θ e iP=1 e 2 · Feature space must be finite-dimensional.

C. Frogner Bayesian Interpretations of Regularization What about Kernel RLS?

We can use basically the same trick to derive kernel RLS;

n 1 λ f = arg min (y f (x ))2 + f 2 S 2 i − i 2k kHK f ∈HK Xi=1 How?

Feature space: f (x)= θ, φ(x) h iF n 1 T 2 − (y −φ(x ) θ) λ 2 i i − θT θ e iP=1 e 2 · Feature space must be finite-dimensional.

C. Frogner Bayesian Interpretations of Regularization More notation

T φ(X) = [φ(x1),...,φ(xn)]

K (X, X) is the kernel matrix: [K (X, X)]ij = K (xi , xj )

K (x, X) = [K (x, x1),..., K (x, xn)] T f (X) = [f (x1),..., f (xn)]

C. Frogner Bayesian Interpretations of Regularization What about Kernel RLS?

Model:

2 Y X,θ φ(X)θ,σε I , θ (0, I) | ∼N   ∼N Then:

θˆ (Y X)= φ(X)T (φ(X)φ(X)T + σ2I)−1Y MAP | ε What is φ(X)φ(X)T ? It’s K (X, X).

C. Frogner Bayesian Interpretations of Regularization What about Kernel RLS?

Model:

2 Y X,θ φ(X)θ,σε I , θ (0, I) | ∼N   ∼N Then:

θˆ (Y X)= φ(X)T (φ(X)φ(X)T + σ2I)−1Y MAP | ε What is φ(X)φ(X)T ? It’s K (X, X).

C. Frogner Bayesian Interpretations of Regularization What about Kernel RLS?

Model:

2 Y X,θ φ(X)θ,σε I , θ (0, I) | ∼N   ∼N Then: θˆ (Y X)= φ(X)T (K (X, X)+ σ2I)−1Y MAP | ε Estimated function?

ˆf (x)= φ(x)θˆ (Y X) MAP MAP | T 2 −1 = φ(x)φ(X) (K (X, X)+ σε I) Y λ = K (x, X)(K (X, X)+ I)−1Y 2 ˆ = fRLS(x)

C. Frogner Bayesian Interpretations of Regularization What about Kernel RLS?

Model:

2 Y X,θ φ(X)θ,σε I , θ (0, I) | ∼N   ∼N Then: θˆ (Y X)= φ(X)T (K (X, X)+ σ2I)−1Y MAP | ε Estimated function?

ˆf (x)= φ(x)θˆ (Y X) MAP MAP | T 2 −1 = φ(x)φ(X) (K (X, X)+ σε I) Y λ = K (x, X)(K (X, X)+ I)−1Y 2 ˆ = fRLS(x)

C. Frogner Bayesian Interpretations of Regularization A prior over functions

Model:

2 Y X,θ φ(X)θ,σε I , θ (0, I) | ∼N   ∼N Can we write this as a prior on ? HK

C. Frogner Bayesian Interpretations of Regularization A prior over functions

Remember Mercer’s theorem:

K (xi , xj )= νk ψk (xi )ψk (xj ) Xk

where ν ψ ( )= K ( , y)ψ (y)dy for all k. The functions k k · · k √νk ψk ( ) formR an orthonormal basis for K . { · } H

Let φ( ) = [√ν1ψ1( ),..., √νpψp( )]. Then: · · · = θT φ( ) θ Rp HK { · | ∈ }

C. Frogner Bayesian Interpretations of Regularization A prior over functions

We showed: when θ (0, I), ∼N ˆf ( )= θˆ (Y X)T φ( )= ˆf ( ) MAP · MAP | · RLS ·

Taking φ( ) = [√ν1ψ1,..., √νpψp], this prior is equivalently: · f ( )= θT φ( ) (0, I) · · ∼N i.e. the functions in are Gaussian distributed: HK 1 1 p(f ) exp f 2 = exp θT θ ∝ −2k kHK  −2 

Note: again we need to be finite-dimensional. HK

C. Frogner Bayesian Interpretations of Regularization A prior over functions

So: 1 p(f ) exp f 2 θ (0, I) ˆf = ˆf ∝ −2k kHK  ⇔ ∼N ⇒ MAP RLS

Assuming p(f ) exp( 1 f 2 ), ∝ − 2 k kHK p (Y X, f ) p(f ) p (f X, Y )= | · | p (Y X) | 1 1 exp Y f (X) 2 exp f 2 ∝ −2k − k  −2k k  1 1 = exp Y f (X) 2 f 2 −2k − k − 2k k 

C. Frogner Bayesian Interpretations of Regularization A quick recap

We wanted to know if RLS has a probabilistic interpretation. Empirical risk minimization is ML.

− 1 kY −Xθk2 p(Y X,θ) e 2 | ∝

Linear RLS is MAP. λ − 1 kY −Xθk2 − θT θ p(Y ,θ X) e 2 e 2 | ∝ ·

Kernel RLS is also MAP. λ − 1 kY −φ(X)θk2 − θT θ p(Y ,θ X) e 2 e 2 | ∝ ·

Equivalent to a Gaussian prior on K : H λ 1 2 2 − kY −f (X)k − kf kH p(Y ,θ X) e 2 e 2 K | ∝ · But these don’t work for infinite dimensional function spaces...

C. Frogner Bayesian Interpretations of Regularization A quick recap

We wanted to know if RLS has a probabilistic interpretation. Empirical risk minimization is ML.

− 1 kY −Xθk2 p(Y X,θ) e 2 | ∝

Linear RLS is MAP. λ − 1 kY −Xθk2 − θT θ p(Y ,θ X) e 2 e 2 | ∝ ·

Kernel RLS is also MAP. λ − 1 kY −φ(X)θk2 − θT θ p(Y ,θ X) e 2 e 2 | ∝ ·

Equivalent to a Gaussian prior on K : H λ 1 2 2 − kY −f (X)k − kf kH p(Y ,θ X) e 2 e 2 K | ∝ · But these don’t work for infinite dimensional function spaces...

C. Frogner Bayesian Interpretations of Regularization A quick recap

We wanted to know if RLS has a probabilistic interpretation. Empirical risk minimization is ML.

− 1 kY −Xθk2 p(Y X,θ) e 2 | ∝

Linear RLS is MAP. λ − 1 kY −Xθk2 − θT θ p(Y ,θ X) e 2 e 2 | ∝ ·

Kernel RLS is also MAP. λ − 1 kY −φ(X)θk2 − θT θ p(Y ,θ X) e 2 e 2 | ∝ ·

Equivalent to a Gaussian prior on K : H λ 1 2 2 − kY −f (X)k − kf kH p(Y ,θ X) e 2 e 2 K | ∝ · But these don’t work for infinite dimensional function spaces...

C. Frogner Bayesian Interpretations of Regularization A quick recap

We wanted to know if RLS has a probabilistic interpretation. Empirical risk minimization is ML.

− 1 kY −Xθk2 p(Y X,θ) e 2 | ∝

Linear RLS is MAP. λ − 1 kY −Xθk2 − θT θ p(Y ,θ X) e 2 e 2 | ∝ ·

Kernel RLS is also MAP. λ − 1 kY −φ(X)θk2 − θT θ p(Y ,θ X) e 2 e 2 | ∝ ·

Equivalent to a Gaussian prior on K : H λ 1 2 2 − kY −f (X)k − kf kH p(Y ,θ X) e 2 e 2 K | ∝ · But these don’t work for infinite dimensional function spaces...

C. Frogner Bayesian Interpretations of Regularization Transductive setting

We hinted at problems if dim = . HK ∞ Idea: Forget about estimating θ (i.e. f ). Instead: Estimate predicted outputs

∗ ∗ ∗ T Y = [y1 ,..., yM ]

at test inputs ∗ ∗ ∗ T X = [x1 ,..., xM ]

Need the joint distribution over Y ∗ and Y .

C. Frogner Bayesian Interpretations of Regularization Transductive setting

Say Y ∗ and Y are jointly Gaussian:

Y µY ΛY ΛYY ∗ ∗ = ,  Y  N  µY ∗   ΛY ∗Y ΛY ∗ 

Want: kernel RLS. General form for the posterior:

∗ Y X, Y µ , Σ ∗ | ∼N Y ∗|X,Y Y |X,Y  where

T −1 µ ∗ = µ ∗ +Λ ∗ Λ (Y µ ) Y |X,Y Y YY Y − Y T −1 Σ ∗ =Λ ∗ Λ ∗ Λ Λ ∗ Y |X,Y Y − YY Y YY

C. Frogner Bayesian Interpretations of Regularization Transductive setting

Say Y ∗ and Y are jointly Gaussian:

Y µY ΛY ΛYY ∗ ∗ = ,  Y  N  µY ∗   ΛY ∗Y ΛY ∗ 

Want: kernel RLS. General form for the posterior:

∗ Y X, Y µ , Σ ∗ | ∼N Y ∗|X,Y Y |X,Y  where

T −1 µ ∗ = µ ∗ +Λ ∗ Λ (Y µ ) Y |X,Y Y YY Y − Y T −1 Σ ∗ =Λ ∗ Λ ∗ Λ Λ ∗ Y |X,Y Y − YY Y YY

C. Frogner Bayesian Interpretations of Regularization Transductive setting

2 ∗ ∗ ∗ Set ΛY = K (X, X)+ σ I, ΛYY ∗ = K (X, X ), ΛY ∗ = K (X , X ). Posterior: ∗ Y X, Y µ , Σ ∗ | ∼N Y ∗|X,Y Y |X,Y where 

∗ 2 −1 µ ∗ = µ ∗ + K (X , X)(K (X, X + σ I) (Y µ ) Y |X,Y Y − Y Σ = K (X ∗, X ∗) K (X ∗, X)(K (X, X)+ σ2I)−1K (X, X ∗) Y ∗|X,Y −

ˆ ∗ ˆ ∗ So: YMAP = fRLS(X ).

C. Frogner Bayesian Interpretations of Regularization Transductive setting

Model:

2 ∗ Y µY K (X, X)+ σε IK (X, X ) ∗ = , ∗ ∗ ∗  Y  N  µY ∗   K (X , X) K (X , X ) 

MAP estimate (posterior mean) = RLS function at every point x ∗, regardless of dim . HK Are the prior and posterior (on points!) consistent with a distribution on ? HK

C. Frogner Bayesian Interpretations of Regularization Transductive setting

Strictly speaking, θ and f don’t come into play here at all:

Have: p(Y ∗ X, Y ) | Do not have: p(θ X, Y ) or p(f X, Y ) | | ∗ But, if K is finite dimensional, the joint over Y and Y is consistentH with: Y = f (X)+ ε, Y ∗ = f (X), and

f K is a random trajectory from a Gaussian process over∈H the domain, with mean µ and covariance K . (Ergo, people call this “Gaussian process regression.”) (Also “,” because of a guy.)

C. Frogner Bayesian Interpretations of Regularization Transductive setting

Strictly speaking, θ and f don’t come into play here at all:

Have: p(Y ∗ X, Y ) | Do not have: p(θ X, Y ) or p(f X, Y ) | | ∗ But, if K is finite dimensional, the joint over Y and Y is consistentH with: Y = f (X)+ ε, Y ∗ = f (X), and

f K is a random trajectory from a Gaussian process over∈H the domain, with mean µ and covariance K . (Ergo, people call this “Gaussian process regression.”) (Also “Kriging,” because of a guy.)

C. Frogner Bayesian Interpretations of Regularization Recap redux

Empirical risk minimization is the maximum likelihood estimator when: y = x T θ + ε Linear RLS is the MAP estimator when:

y = x T θ + ε, θ (0, I) ∼N Kernel RLS is the MAP estimator when:

y = φ(x)T θ + ε, θ (0, I) ∼N in finite dimensional . HK Kernel RLS is the MAP estimator at points when:

2 ∗ Y µY K (X, X)+ σε IK (X, X ) ∗ = , ∗ ∗ ∗  Y  N  µY ∗   K (X , X) K (X , X )  in possibly infinite dimensional . HK C. Frogner Bayesian Interpretations of Regularization Is this useful in practice?

Want confidence intervals + believe the posteriors are meaningful = yes Maybe other reasons?

Gaussian Regression + Confidence Intervals 5 Regression function 4 Observed points

3

2

1

Y 0

−1

−2

−3

−4

−5 −6 −4 −2 0 2 4 6 X

C. Frogner Bayesian Interpretations of Regularization What is going on with infinite-dimensional K ? H

Wrote down statements like: θ (0, I) and f (0, I). The space can be written ∼N ∼N HK ∞ = f : f 2 < = θT φ( ) : θ2 < HK { k kHK ∞} { · i ∞} Xi=1 Difference between finite and infinite: not every θ yields a function θT φ( ) in . · HK A hint: θ (0, I) E θT φ( ) 2 = . ∼N ⇒ k · kHK ∞ In fact: θ (0, I) θT φ( ) with probability zero. ∼N ⇒ · ∈HK So be careful out there.

C. Frogner Bayesian Interpretations of Regularization What is going on with infinite-dimensional K ? H

Wrote down statements like: θ (0, I) and f (0, I). The space can be written ∼N ∼N HK ∞ = f : f 2 < = θT φ( ) : θ2 < HK { k kHK ∞} { · i ∞} Xi=1 Difference between finite and infinite: not every θ yields a function θT φ( ) in . · HK A hint: θ (0, I) E θT φ( ) 2 = . ∼N ⇒ k · kHK ∞ In fact: θ (0, I) θT φ( ) with probability zero. ∼N ⇒ · ∈HK So be careful out there.

C. Frogner Bayesian Interpretations of Regularization What is going on with infinite-dimensional K ? H

Wrote down statements like: θ (0, I) and f (0, I). The space can be written ∼N ∼N HK ∞ = f : f 2 < = θT φ( ) : θ2 < HK { k kHK ∞} { · i ∞} Xi=1 Difference between finite and infinite: not every θ yields a function θT φ( ) in . · HK A hint: θ (0, I) E θT φ( ) 2 = . ∼N ⇒ k · kHK ∞ In fact: θ (0, I) θT φ( ) with probability zero. ∼N ⇒ · ∈HK So be careful out there.

C. Frogner Bayesian Interpretations of Regularization A hint that things are amiss

Assume: θ (0, I), φ ∞ orthonormal basis in . ∼N { i }i=1 HK

∞ E θT φ 2 = E θ φ 2 k kHK k i i kHK Xi=1 ∞ E 2 = θi Xi=1 ∞ = 1 Xi=1 = ∞

C. Frogner Bayesian Interpretations of Regularization