Bayesian Interpretations of Regularization

Bayesian Interpretations of Regularization Charlie Frogner 9.520 Class 15 April 1, 2009 C. Frogner Bayesian Interpretations of Regularization The Plan n Regularized least squares maps (xi , yi ) i=1 to a function that minimizes the regularized loss: { } n 1 λ f = arg min (y f (x ))2 + f 2 S 2 i − i 2k kH f ∈H Xi=1 Can we justify Tikhonov regularization from a probabilistic point of view? C. Frogner Bayesian Interpretations of Regularization The Plan Bayesian estimation basics Bayesian interpretation of ERM Bayesian interpretation of linear RLS Bayesian interpretation of kernel RLS Transductive model Infinite dimensions = weird C. Frogner Bayesian Interpretations of Regularization Some notation n S = (xi , yi ) i=1 is the set of observed input/output pairs in Rd {R (the} training set). × T n×d X and Y denote the matrices [x1,..., xn] R and T n ∈ [y ,..., yn] R , respectively. 1 ∈ θ is a vector of parameters in Rp. p(Y X,θ) is the joint distribution over outputs Y given inputs| X and the parameters. C. Frogner Bayesian Interpretations of Regularization Estimation The setup: A model: relates observed quantities (α) to an unobserved quantity (say β). Want: an estimator – maps observed data α back to an estimate of unobserved β. Nothing new yet... Estimator β B is unobserved, α A is observed. An estimator for β is a function∈ ∈ βˆ : A B → such that βˆ(α) is an estimate of β given an observation α. C. Frogner Bayesian Interpretations of Regularization Estimation Tikhonov fits in the estimation framework. n 1 λ f = arg min (y f (x ))2 + f 2 S 2 i − i 2k kH f ∈H Xi=1 Regression model: 2 yi = f (xi )+ ε, ε 0,σε I ∼N C. Frogner Bayesian Interpretations of Regularization Bayesian Estimation Difference: Bayesian model specifies p(β, α) , usually by a measurement model, p(α β) and a prior p(β). | Bayesian model β is unobserved, α is observed. p(β, α)= p(α β) p(β) | · C. Frogner Bayesian Interpretations of Regularization ERM as a Maximum Likelihood Estimator (Linear) Expected risk minimization: n T ˆ ˆ 1 T 2 fS(x)= x θERM (S), θERM (S)= arg min (yi xi θ) θ 2 − Xi=1 Measurement model: 2 Y X,θ Xθ,σε I | ∼N X fixed/non-random, θ is unknown. C. Frogner Bayesian Interpretations of Regularization ERM as a Maximum Likelihood Estimator (Linear) Expected risk minimization: n T ˆ ˆ 1 T 2 fS(x)= x θERM (S), θERM (S)= arg min (yi xi θ) θ 2 − Xi=1 Measurement model: 2 Y X,θ Xθ,σε I | ∼N X fixed/non-random, θ is unknown. C. Frogner Bayesian Interpretations of Regularization ERM as a Maximum Likelihood Estimator Measurement model: 2 Y X,θ Xθ,σε I | ∼N Want to estimate θ. Can do this without defining a prior on θ. Maximize the likelihood, i.e. the probability of the observations. Likelihood The likelihood of any fixed parameter vector θ is: L(θ X)= p(Y X,θ) | | Note: we always condition on X. C. Frogner Bayesian Interpretations of Regularization ERM as a Maximum Likelihood Estimator Measurement model: 2 Y X,θ Xθ,σε I | ∼N Likelihood: 2 L(θ X)= Y ; Xθ,σε I | N 1 2 exp 2 Y Xθ ∝ −2σε k − k Maximum likelihood estimator is ERM: 1 arg min Y Xθ 2 θ 2k − k C. Frogner Bayesian Interpretations of Regularization ERM as a Maximum Likelihood Estimator Measurement model: 2 Y X,θ Xθ,σε I | ∼N Likelihood: 2 L(θ X)= Y ; Xθ,σε I | N 1 2 exp 2 Y Xθ ∝ −2σε k − k Maximum likelihood estimator is ERM: 1 arg min Y Xθ 2 θ 2k − k C. Frogner Bayesian Interpretations of Regularization Really? 1 Y Xθ 2 2k − k C. Frogner Bayesian Interpretations of Regularization Really? − 1 kY −Xθk2 e 2σε2 C. Frogner Bayesian Interpretations of Regularization What about regularization? Linear regularized least squares: n T ˆ 1 T 2 λ 2 fS(x)= x θ, θRLS(S)= arg min (yi xi θ) + θ θ 2 − 2k k Xi=1 Is there a model of Y and θ that yields linear RLS? Yes. n 1 T 2 λ 2 − ( (yi −x θ) + kθk ) 2σε2 i 2 e iP=1 p(Y X,θ) p(θ) | · C. Frogner Bayesian Interpretations of Regularization What about regularization? Linear regularized least squares: n T ˆ 1 T 2 λ 2 fS(x)= x θ, θRLS(S)= arg min (yi xi θ) + θ θ 2 − 2k k Xi=1 Is there a model of Y and θ that yields linear RLS? Yes. n 1 T 2 λ 2 − ( (yi −x θ) + kθk ) 2σε2 i 2 e iP=1 p(Y X,θ) p(θ) | · C. Frogner Bayesian Interpretations of Regularization What about regularization? Linear regularized least squares: n T ˆ 1 T 2 λ 2 fS(x)= x θ, θRLS(S)= arg min (yi xi θ) + θ θ 2 − 2k k Xi=1 Is there a model of Y and θ that yields linear RLS? Yes. n 1 T 2 − (yi −x θ) λ 2σε2 i − kθk2 e iP=1 e 2 · p(Y X,θ) p(θ) | · C. Frogner Bayesian Interpretations of Regularization What about regularization? Linear regularized least squares: n T ˆ 1 T 2 λ 2 fS(x)= x θ, θRLS(S)= arg min (yi xi θ) + θ θ 2 − 2k k Xi=1 Is there a model of Y and θ that yields linear RLS? Yes. n 1 T 2 − (yi −x θ) λ 2σε2 i − kθk2 e iP=1 e 2 · p(Y X,θ) p(θ) | · C. Frogner Bayesian Interpretations of Regularization Linear RLS as a MAP estimator Measurement model: 2 Y X,θ Xθ,σε I | ∼N Add a prior: θ (0, I) ∼N 2 So σε = λ. How to estimate θ? C. Frogner Bayesian Interpretations of Regularization The Bayesian method Take p(Y X,θ) and p(θ). | Apply Bayes’ rule to get posterior: p(Y X,θ) p(θ) p(θ X, Y )= | · | p(Y X) | p(Y X,θ) p(θ) = | · p(Y X,θ)dθ | R Use the posterior to estimate θ. C. Frogner Bayesian Interpretations of Regularization Estimators that use the posterior Bayes least squares estimator The Bayes least squares estimator for θ given the observed Y is: θˆ (Y X)= E [θ] BLS | θ|X,Y i.e. the mean of the posterior. Maximum a posteriori estimator The MAP estimator for θ given the observed Y is: θˆMAP (Y X)= arg max p(θ X, Y ) | θ | i.e. a mode of the posterior. C. Frogner Bayesian Interpretations of Regularization Estimators that use the posterior Bayes least squares estimator The Bayes least squares estimator for θ given the observed Y is: θˆ (Y X)= E [θ] BLS | θ|X,Y i.e. the mean of the posterior. Maximum a posteriori estimator The MAP estimator for θ given the observed Y is: θˆMAP (Y X)= arg max p(θ X, Y ) | θ | i.e. a mode of the posterior. C. Frogner Bayesian Interpretations of Regularization Linear RLS as a MAP estimator Model: 2 Y X,θ Xθ,σε I , θ (0, I) | ∼N ∼N Joint over Y and θ: Y X 0 XX T + σ2I X | , ε θ ∼N 0 X T I Condition on Y X. | C. Frogner Bayesian Interpretations of Regularization Linear RLS as a MAP estimator Model: 2 Y X,θ Xθ,σε I , θ (0, I) | ∼N ∼N Posterior: θ X, Y µ , Σ | ∼N θ|X,Y θ|X,Y where T T 2 −1 µθ|X,Y = X (XX + σε I) Y Σ = I X T (XX T + σ2I)−1X θ|X,Y − ε This is Gaussian, so θˆ (Y X)= θˆ (Y X)= X T (XX T + σ2I)−1Y MAP | BLS | ε C. Frogner Bayesian Interpretations of Regularization Linear RLS as a MAP estimator Model: 2 Y X,θ Xθ,σε I , θ (0, I) | ∼N ∼N Posterior: θ X, Y µ , Σ | ∼N θ|X,Y θ|X,Y where T T 2 −1 µθ|X,Y = X (XX + σε I) Y Σ = I X T (XX T + σ2I)−1X θ|X,Y − ε This is Gaussian, so θˆ (Y X)= θˆ (Y X)= X T (XX T + σ2I)−1Y MAP | BLS | ε C. Frogner Bayesian Interpretations of Regularization Linear RLS as a MAP estimator Model: 2 Y X,θ Xθ,σε I , θ (0, I) | ∼N ∼N θˆ (Y X)= X T (XX T + σ2I)−1Y MAP | ε Recall the linear RLS solution: n ˆ 1 T 2 λ 2 θRLS(Y X)= arg min (yi xi θ) + θ | θ 2 − 2k k Xi=1 λ = X T (XX T + I)−1Y 2 How do we write the estimated function? C. Frogner Bayesian Interpretations of Regularization Linear RLS as a MAP estimator Model: 2 Y X,θ Xθ,σε I , θ (0, I) | ∼N ∼N θˆ (Y X)= X T (XX T + σ2I)−1Y MAP | ε Recall the linear RLS solution: n ˆ 1 T 2 λ 2 θRLS(Y X)= arg min (yi xi θ) + θ | θ 2 − 2k k Xi=1 λ = X T (XX T + I)−1Y 2 How do we write the estimated function? C. Frogner Bayesian Interpretations of Regularization What about Kernel RLS? We can use basically the same trick to derive kernel RLS; n 1 λ f = arg min (y f (x ))2 + f 2 S 2 i − i 2k kHK f ∈HK Xi=1 How? Feature space: f (x)= θ, φ(x) h iF n 1 T 2 λ T − 2 ( (yi −φ(xi ) θ) + 2 θ θ) e iP=1 Feature space must be finite-dimensional.

Load more