<<

Machine Learning Srihari

Bayesian

Sargur Srihari [email protected] Srihari Topics in Bayesian Regression

• Recall Max Likelihood Linear Regression • Distribution • Predictive Distribution • Equivalent Kernel

2 Machine Learning Srihari Linear Regression: model complexity M M y(x,w) w w x w x 2 .. w x M w x j • regression = 0 + 1 + 2 + + M = ∑ j j =0 – Red lines are best fits with M = 0,1,3,9 and N=10

Poor representations of sin(2πx)

Over Fit Best Fit Poor to representation sin(2πx) of sin(2πx)

3 Machine Learning Srihari Max Likelihood Regression

• Input vector x , basis functions {ϕ1(x),.., ϕM(x)}: M −1 ⎛ 1 ⎞ T ⎜ t −1 ⎟ y(x,w) = w φ (x) = w φ(x) Radial basis fns: φj (x) = exp⎜− (x − µj ) Σ (x − µj )⎟ ∑ j j ⎝⎜ 2 ⎠⎟ j =0 • Objective Function: 2 N 1 T Max Likelihood objective with N examples {x1,..xN}: E( ) = t − ( ) w ∑{ n w φ xn } (equivalent to Squared Error Objective) 2 n=1

N 2 Regularized MSE with N examples: 1 T λ T E(w) = t − w φ(x ) + w w (λ is the regularization coefficient) ∑{ n n } 2 n=1 2 ⎛ φ (x ) φ (x ) ... φ (x ) ⎞ • Closed-form ML solution is: ⎜ 0 1 1 1 M −1 1 ⎟ ⎜φ 0(x 2 ) ⎟ T −1 T where Φ is the design : Φ = w = (Φ Φ) Φ t T -1 ⎜ ⎟ ML (Φ Φ) is Moore-Penrose inverse ⎜ ⎟ ⎜φ (x ) φ (x )⎟ Regularized ⎝ 0 N M −1 N ⎠ w = (λI + ΦT Φ)−1ΦTt solution is: ML w(τ +1) = w(τ ) − η∇E • Gradient Descent: N N (τ)T ⎡ ⎤ 4 ∇E = − t −w φ(x ) φ(x ) Regularized version: (τ )T (τ ) ∑{ n n } n ∇E = ⎢− t − w φ(x ) φ(x )⎥ − λw n=1 ∑{ n n } n ⎣ n=1 ⎦ Machine Learning Srihari Shortcomings of MLE

• M.L.E. of w does not address – M (Model complexity: how many basis functions? – It is controlled by size N • More data allows better fit without • Regularization also controls overfit (λ controls effect) 2 1 N 1 T E(w) = E (w) + λE (w) E (w) = t −wTφ(x ) E (w) = w w D W where D ∑{ n n } W 2 2 n=1

• But M and choice of ϕj are still important – M can be determined by holdout, but wasteful of data • Model complexity and over-fitting are better handled using Bayesian approach 5 Machine Learning Srihari Bayesian Linear Regression • Using Bayes rule, posterior is proportional to Likelihood × Prior: p(t | w)p(w) p(w | t) = p(t)

– where p(t|w) is the likelihood of observed data – p(w) is prior distribution over the parameters • We will look at: – A for prior p(w) – Likelihood p(t|w) is a product of Gaussians based on the noise model

– And conclude that posterior is also Gaussian 6 Machine Learning Srihari

Gaussian Prior Parameters

Assume multivariate Gaussian prior for w (which has components w0,..,wM-1)

p(w) = N (w|m0 , S0)

with mean m0 and matrix S0 -1 If we choose S0 = α I it that the of the weights are all equal to α-1 and are zero

p(w) with zero mean (m0=0) and isotropic over weights (same variances) w1

7 w0 Machine Learning Srihari Likelihood of Data is Gaussian

• Assume noise precision parameter β

t = y(x,w)+ε where ε is defined probabilistically as Gaussian noise p(t|x,w,β)=N(t|y(x,w),β-1) Note that output t is a

• Likelihood of t ={t1,..,tN} is then

N p(t | X,w,β) = N t | wTφ(x ),β −1 ∏ ( n n ) n=1

– This is the probability of target data t given the parameters w and input X={x1,..,xN} – Due to Gaussian noise, likelihood p(t |w) is also a Gaussian 8 Machine Learning Srihari Posterior Distribution is also Gaussian

• Prior: p(w)~N (w|m0 , S0) i.e., it is Gaussian • Likelihood comes from Gaussian noise N p(t | X,w,β) = N t | wTφ(x ),β −1 ∏ ( n n ) n=1 • It follows that posterior p(w|t) is also Gaussian • Proof: use standard result from Gaussians: • If marginal p(w) & conditional p(t|w) have Gaussian forms then the marginals p(t) and p(w|t) are also Gaussian: – Let p(w) = N(w|µ,Λ-1) and p(t|w) = N(t|Aw+b,L-1) – Then marginal p(t) = N(t|Aµ+b,L-1+AΛ-1AT) and conditional p(w|t) = N(w|Σ{AtL(t-b)+Λµ},Σ) where Σ=(Λ+ATLA)-1 9 Machine Learning Srihari Exact form of Posterior Distribution

N p(t | X,w,β) = N t | wTφ(x ),β −1 • We have p(w)= N (w|m0 , S0) & ∏ ( n n ) n=1 • Posterior is also Gaussian, written directly as

p(w|t)=N(w|mN,SN)

– where mN is the mean of the posterior -1 T given by mN= SN (S0 m0+ β Φ t) Φ is the ⎛ φ (x ) φ (x ) ... φ (x ) ⎞ ⎜ 0 1 1 1 M −1 1 ⎟ – and S is the of posterior ⎜φ 0(x 2 ) ⎟ N Φ = ⎜ ⎟ ⎜ ⎟ -1 -1 T ⎜φ (x ) φ (x )⎟ given by SN = S0 + β Φ Φ ⎝ 0 N M −1 N ⎠

w1 p(w | α) = N(w | 0,α −1I ) Prior w1 w1 and Posterior in weight space for scalar input x and 10 y(x,w)=w0+w1x w0 w0 Machine Learning Srihari Properties of Posterior

1. Since posterior p(w|t)=N(w|mN,SN) is Gaussian its coincides with its mean

– Thus maximum posterior weight is wMAP= mN -1 2. Infinitely broad prior S0=α I, i.e.,precision α à0

– Then mean mN reduces to the maximum likelihood value, i.e., mean is the solution vector w = (ΦT Φ)−1 ΦTt ML 3. If N = 0, posterior reverts to the prior 4. If data points arrive sequentially, then posterior to any stage acts as prior distribution for subsequent data points 11 Machine Learning Srihari Choose a simple Gaussian prior p(w) y(x,w)=w0+w1x p(w|α)~N(0,1/α)

• Zero mean (m0=0) isotropic w1 • (same variances) Gaussian w0 p(w | α) ∼ N(w | 0,α −1I) Single precision parameter α • Corresponding posterior distribution is

p(w|t)=N(w|mN,SN) where T -1 T mN=β SNΦ t and SN =α I+β Φ Φ Note: β is noise precision and Point Estimate α is of parameter in prior with 12 w infinite samples Machine Learning Srihari Equivalence to MLE with Regularization

N p(t | X,w,β) = N t | wTφ(x ),β −1 −1 • Since ∏ ( n n ) and p(w | α) = N(w | 0,α I ) n=1 • we have N p(w | t) = N t | wTφ(x ),β −1 N(w | 0,α −1I) ∏ ( n n ) n=1 • Log of Posterior is 2 β N α ln p(w | t) = − t − wTφ(x ) − wTw + const ∑{ n n } 2 n=1 2 • Thus Maximization of posterior is equivalent to minimization of sum-of-squares error 1 N 2 λ E(w) = t − wTφ(x ) + wTw ∑{ n n } 2 n=1 2 with addition of quadratic regularization term wTw with λ = α /β 13 Machine Learning Srihari Bayesian Linear Regression Example (Straight Line Fit)

• Single input variable x • Single target variable t t • Goal is to fit

y(x,w) = w0+ w1x x • Goal of Linear Regression is to recover

w =[w0 ,w1] given the samples

14 Machine Learning Srihari Data Generation • generated from t f(x,w)=w0+w1x with parameter values

w0= -0.3 and w1=0.5 -1 x 1

– First choose xn from U(x|-1,1), then evaluate f(xn,w)

– Add Gaussian noise with st dev 0.2 to get target tn – Precision parameter β = (1/0.2 )2= 25 • For prior over w we choose α = 2 p(w | α) = N(w | 0,α −1I ) 15 Machine Learning Srihari p(w) and p(w|t)

• Each sample represents a straight line in data space (modified by examples) Distribution Six samples

y(x,w)=w0+w1x

w1 p(w) With no examples:

w0 With two examples: p(w|t)

Goal of Bayesian Linear Regression: Determine p(w|t) 16 Machine Learning Prior/ Six samples Srihari Sequential Bayesian Learning Posterior (regression functions) Likelihood p(t|x.w) p(w) corresponding to y(x,w) • Since there are as function of w gives p(w|t) with w drawn from posterior only two We are plotting p(w|t) parameters for a single data point – We can Before data No points Data prior and observed posterior Point True parameter X distributions in Value parameter First After first data point space (x,t) observed Data Band represents values Point • We look at of representing w0, w1 x ,t sequential st lines going near data ( 1 1) point x update of posterior Likelihood Second for 2nd point Data alone Point

With infinite points Likelihood for Twenty th posterior is a delta 20 point Data alone function centered Points at true parameters 17 (white cross) Machine Learning Srihari Generalization of Gaussian prior • The Gaussian prior over parameters is −1 p(w | α) = N(w | 0,α I) Maximization of posterior ln p(w|t) is equivalent to minimization of sum of squares error 1 N 2 λ E(w) = t − wTφ(x ) + wTw ∑{ n n } 2 n=1 2 • Other prior yields and variations: M 1/q ⎡q ⎛ α ⎞ 1 ⎤ ⎛ α M ⎞ p(w | α) = ⎢ ⎥ exp − | w |q ⎜ ⎟ ⎜ ∑ j ⎟ ⎢2 ⎝ 2 ⎠ Γ(1/q)⎥ ⎝ 2 j =1 ⎠ ⎣ ⎦ – q=2 corresponds to Gaussian – Corresponds to minimization of regularized error function 1 N 2 λ M t − wTφ(x ) + | w |q 18 2 ∑{ n n } 2 ∑ j n=1 j =1 Machine Learning Srihari Predictive Distribution • Usually not interested in the value of w itself • But predicting t for a new value of x p(t|t,X,x) or p(t|t) – Leaving out conditioning variables X and x for convenience • Marginalizing over parameter variable w, is the standard Bayesian approach – Sum rule of probability p(t)= p(t,w)dw = p(t|w)p(w)dw ∫ ∫ – We can now write p(t | t)= p(t|w)p(w|t)dw ∫ 19

Machine Learning Srihari Predictive Distribution with α, β,x,t • We can predict t for a new value of x using p(t | t)= p(t|w)p(w|t)dw We have left out conditioning variables X and x for convenience. Also we have applied sum rule of probability p(t)=Σwp(t|w)p(w) ∫ – With explicit dependence on prior parameter α, noise parameter β, & targets in training set t p(t | t,α,β)= p(t|w,β)⋅ p(w|t,α,β)dw ∫

Conditional of target t given weight w posterior of weight w −1 p( | )=N( | ,S ) p( t | x, w ,β) = N(t | y(x,w),β ) w t w mN N where T mN=β SNΦ t -1 T SN =α I+β Φ Φ • RHS is a convolution of two Gaussian distributions • whose result is the Gaussian:

T 2 2 1 T p(t | x,t,α,β) = N(t | mN φ(x),σN (x)) where σN (x) = + φ(x) SN φ(x) β Machine Learning Srihari Variance of Predictive Distribution

• Predictive distribution is a Gaussian: p(t | x,t,α,β) = N(t | mT φ(x),σ2 (x)) N N 2 1 T where σN (x) = + φ(x) SN φ(x) Since noise process and distribution β of w are independent Gaussians their variances are additive

Noise in data Uncertainty associated with parameters w: where S - 1 = α I + β Φ T Φ is the covariance of p(w|α) N Since σ 2 ( x ) ≤ σ 2 ( x ) as no. of samples increases it N +1 N becomes narrower As Nà ∞, second term of variance goes to zero and variance of predictive distribution arises solely from the additive noise parameter β

21 Machine Learning Srihari Example of Predictive Distribution

• Data generated from sin(2πx) • Model: nine Gaussian basis functions 8 ⎛ 2 ⎞ ⎜(x − µ ) ⎟ T φ (x) = exp⎜ j ⎟ y(x,w) = w φ (x) = w φ(x) j ⎜ 2 ⎟ ∑ j j ⎜ 2σ ⎟ j=0 ⎝ ⎠ • Predictive distribution 1 p(t | x, t,α,β) = N(t | mT φ(x),σ2 (x)) where σ2 (x) = + φ(x)T S φ(x) N N N β N

T -1 T Plot of p(t|x) where mN=β SNΦ t, SN =α I+β Φ Φ for one data point and α and β come from assumptions -1 showing mean (red) p(w|α)= N (w|0, α I ) and one std dev (pink) p(t | x,w,β) = N(t | y(x,w),β −1)

22 Mean of Predictive Distribution Machine Learning Srihari Predictive Distribution Variance

T 2 2 1 T Bayesian : p(t | x,t,α,β) = N(t | mN φ(x),σN (x))where σN (x) = + φ(x) SN φ(x) where we have assumed β N=1 N=2 Gaussian prior over parameters: −1 p(w | α) = N(w | 0,α I) Noise model assumed Gaussian: p(t|x,w,β)=N(t|y(x,w),β-1) Mean of the and use design matrix as: Gaussian ⎛ φ (x ) φ (x ) ... φ (x ) ⎞ ⎜ 0 1 1 1 M −1 1 ⎟ Predictive -1 T ⎜φ 0(x 2 ) ⎟ SN =α I+β Φ Φ Φ = ⎜ ⎟ Distribution ⎜ ⎟ ⎜ ⎟ ⎝φ 0(x N ) φ M −1(x N )⎠ Using data from N=4 N=25 sin(2πx): 2 σ (x), std dev of t, is N smallest in One standard neighborhood of data deviation points from Mean

Uncertainty Plot only shows point-wise predictive variance decreases as more To show covariance between at different data points are values of x draw samples from posterior distribution over w observed p(w|t) and plot corresponding functions y(x,w) Machine Learning Srihari Plots of function y(x,w)

Draw samples w from N=1 N=2 from posterior distribution p(w|t)

p(w|t)=N(w|mN,SN) and plot samples from y(x,w) = wTϕ(x) N=4 N=25 Shows covariance between predictions at different values of x For a given function, for a pair of x,x’ , the values of y,y’ are determined by k(x,x’)

which in turn is determined by 24 the samples Machine Learning Srihari Disadvantage of Local Basis • Predictive distribution, assuming Gaussian prior

−1 – p (w | α ) = N ( w | 0 , α I ) and Gaussian noise t = y(x,w)+ε – where noise is defined probabilistically as p(t|x,w,β)=N(t|y(x,w),β-1)

1 -1 T T 2 2 T SN =α I+β Φ Φ p(t | x,t,α,β) = N(t | mN φ(x),σN (x)) where σN (x) = + φ(x) SN φ(x) β • With localized basis functions, e.g., Gaussian – at regions away from centers, 2 contribution of second term of variance σn in will go to zero leaving only noise contribution β -1 • Model becomes very confident outside of region occupied by basis functions – Problem avoided by alternative Bayesian approach

of Gaussian Processes 25 Machine Learning Srihari Dealing with unknown β

• If both w and β are treated as unknown then we can introduce a conjugate prior distribution p(w,β) which is given by a Gaussian-gamma distribution ⎛ −1⎞ p(µ,λ) = N ⎜µ | µ , βλ ⎟Gam λ |a,b ⎜ 0 ( ) ⎟ ( ) ⎝ ⎠

– In this case the predictive distribution is a Student’s t-distribution

1/ 2 −ν / 2−1/ 2 Γ(ν / 2 +1/ 2) ⎛ λ ⎞ ⎡ λ(x − µ)2 ⎤ St(x | µ,λ,ν ) = ⎜ ⎟ ⎢1+ ⎥ Γ(ν + 2) ⎝ πν ⎠ ν ⎣ ⎦ 26 Machine Learning Srihari Mean of p(w|t) has Kernel Interpretation • Regression function is: M −1 y( , ) w ( ) T ( ) x w = ∑ jφj x = w φ x j =0 • If we take a Bayesian approach with Gaussian prior

p(w)=N(w|m0 , S0) then we have:

– Posterior p(w|t)=N (w|mN,SN) where -1 T mN= SN (S0 m0+ βΦ t) -1 -1 T SN = S0 + βΦ Φ • With zero mean isotropic p(w|α)= N(w|0, α-1I) T mN= β SN Φ t, -1 T SN = α I+ β Φ Φ T • Posterior mean β SN Φ t has a kernel interpretation

– Sets stage for kernel methods and Gaussian processes 27

Machine Learning Srihari Equivalent Kernel T • Posterior mean of w is mN=βSNΦ t -1 -1 T – where SN = S0 + βΦ Φ ,

• S0 is the covariance matrix of the prior p(w), β is the noise parameter and Φ is the design matrix that depends on the samples • Substitute mean value into Regression function M −1 y( ) w ( ) wT ( ) x,w = ∑ jφj x = φ x j =0 • Mean of predictive distribution at point x is

N N y( , ) = T ( ) = ( )T S ΦT t = ( )T S ( )t = k( , )t x mN mN φ x βφ x N ∑ βφ x N φ xn n ∑ x xn n n=1 n=1

T – where k(x,x’)=βϕ (x) SN ϕ (x’) is the equivalent kernel • Thus mean of predictive distribution is a linear

combination of training set target variables tn

– Note: the equivalent kernel depends on input values xn from the dataset because they appear in SN Machine Learning Srihari Kernel Function • Regression functions such as T k(x,x’)=βϕ (x) SNϕ (x’) N S -1 S -1 βΦTΦ y(x,m ) = k(x,x )t N = 0 + N ∑ n n ⎛ φ 0(x1) φ1(x1) ... φ M −1(x1) ⎞ ⎜ ⎟ n=1 (x ) ⎜φ 0 2 ⎟ Φ = ⎜ ⎟ ⎜ ⎟ ⎜φ (x ) φ (x )⎟ • That take a of the ⎝ 0 N M −1 N ⎠ training set target values are known as linear smoothers

• They depend on the input values xn from the since they appear in the definition

of SN

29 Machine Learning Srihari Example of kernel for Gaussian Basis Gaussian Basis ϕ(x)

Equivalent Kernel ⎛ 2 ⎞ (x − µj ) φ (x) = exp⎜ ⎟ T j 2s 2 k(x, x’)=ϕ(x) SNϕ(x’) ⎝ ⎠ -1 -1 T SN = S0 + βΦ Φ For three values of x ⎛ φ 0(x1) φ1(x1) ... φ M −1(x1) ⎞ x ⎜ ⎟ the behavior ⎜φ 0(x 2 ) ⎟ Φ = ⎜ ⎟ of k(x,x’) is ⎜ ⎟ ⎜ ⎟ shown as a slice ⎝φ 0(x N ) φ M −1(x N )⎠ Kernels are localized around x, i.e., peaks when x =x’ x’ Kernel used directly in regression. Plot of k(x,x’) shown N Mean of the predictive distribution is y( ) k( )t as a function x,mN = ∑ x,xn n n=1 of x and x’ Obtained by forming a weighted combination of target values: Peaks when x=x’ Data points close to x are given higher weight than points further removed from x Data set used to generate kernel were 200 values of x equally spaced in (-1,1) 30 Machine Learning Srihari Equivalent Kernel for Polynomial Basis Function

j ϕj(x)=x

T k(x,x’)=βϕ (x) SNϕ (x’) -1 -1 T SN = S0 + β Φ Φ Plotted as a function of x’ for x=0 Data points close to x are given higher weight than points further removed from x

Localized function of x’ even though corresponding basis function is nonlocal 31 Machine Learning Srihari Equivalent Kernel for Sigmoidal Basis Function

⎛x − µ ⎞ ⎜ j ⎟ 1 φj (x) = σ⎜ ⎟ where σ(a) = ⎜ s ⎟ 1+ exp(−a) ⎝ ⎠

T k(x,x’)=βϕ(x) SNϕ(x’)

Localized function of x’ even though corresponding basis function is nonlocal

32 Machine Learning Srihari Covariance between y(x) and y(x’) An important insight: The value of the kernel function between two points is directly related to the covariance between their target values cov [y(x), y(x’)] = cov[ϕ(x)Tw, wTϕ (x’)] T = ϕ (x) SNϕ (x’) where we have used: -1 = β k(x, x’) p(w|t)~N(w|mN,SN) βϕ T ϕ k(x, x’)= (x) SN (x) From the form of the equivalent kernel k(x, x’) the predictive mean at nearby points y(x) , y(x’) will be highly correlated For more distant pairs correlation is smaller

The kernel captures the covariance

33 Machine Learning Srihari Predictive plot vs. Posterior plots • Predictive distribution – allows us to visualize pointwise uncertainty in the predictions governed by

T 2 2 1 T p(t | x,t,α,β) = N(t | mN φ(x),σN (x)) where σN (x) = + φ(x) SN φ(x) β • Drawing samples from posterior p(w|t) – Plotting corresponding functions y(x,w) we visualize joint uncertainty in the posterior distribution between the y values at two or more x values as governed by the kernel

34 Machine Learning Srihari Directly Specifying Kernel Function • Formulation of Linear Regression in terms of kernel function suggests an alternative approach to regression: – Instead of introducing a set of basis functions, which implicitly determines an equivalent kernel: – Directly define kernel functions and use to make predictions for new input x, given the observation set • This leads to a practical framework for regression (and classification) called Gaussian

Processes 35 Machine Learning Srihari Summing Kernel Values Over samples • Effective kernel defines weights by which – target values combined to make a prediction at x • It can be shown that weights sum to one, i.e., N ’ T k( ) 1 k(x, x )=ϕ(x) SNϕ(x’) ∑ x,xn = n=1 -1 -1 T SN = S0 + βΦ Φ • For all values of x – This result can be proven intuitively: N y(x,m ) = k(x,x )t • Since N ∑ n n summation is equivalent to n=1 considering predictive mean ŷ(x) for a set of integer data

in which tn=1 for all n • Provided basis functions are linearly independent, that N>M, one of the basis functions is constant (corresponding to the bias parameter), then we can fit training data exactly, and hence ŷ(x)=1 36 Machine Learning Srihari Kernel Function Properties • Equivalent kernel can be positive or negative

T k(x, x’)=ϕ(x) SNϕ(x’) – Although it satisfies a summation constraint, the corresponding predictions are not necessarily a convex combination of the training set target variables • Equivalent kernel satisfies important property shared by kernel functions in general. – It can be expressed in the form of an inner product wrt a vector ψ(x) of nonlinear functions: k(x,z) = ψ(x)Tψ(z) where ψ(x) = β1/2S 1/2φ(x) N 37