Chris Bishop's PRML Ch. 3: Linear Models of Regression

Chris Bishop's PRML Ch. 3: Linear Models of Regression

Chris Bishop’s PRML Ch. 3: Linear Models of Regression Mathieu Guillaumin & Radu Horaud October 25, 2007 Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression Chapter content I An example – polynomial curve fitting – was considered in Ch. 1 I A linear combination – regression – of a fixed set of nonlinear functions – basis functions I Supervised learning: N observations {xn} with corresponding target values {tn} are provided. The goal is to predict t of a new value x. I Construct a function such that y(x) is a prediction of t. I Probabilistic perspective: model the predictive distribution p(t|x). Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression Figure 1.16, page 29 t y(x, w) y(x0, w) 2σ p(t|x0, w, β) x0 x Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression The chapter section by section 3.1 Linear basis function models I Maximum likelihood and least squares I Geometry of least squares I Sequential learning I Regularized least squares 3.2 The bias-variance decomposition 3.3 Bayesian linear regression I Parameter distribution I Predictive distribution I Equivalent kernel 3.4 Bayesian model comparison 3.5 The evidence approximation 3.6 Limitations of fixed basis functions Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression Linear Basis Function Models M−1 X > y(x, w) = wjφj(x) = w φ(x) j=0 where: > > I w = (w0, . , wM−1) and φ = (φ0, . , φM−1) with φ0(x) = 1 and w0 = bias parameter. D I In general x ∈ R but it will be convenient to treat the case x ∈ R I We observe the set X = {x1,..., xn,..., xN } with corresponding target variables t = {tn}. Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression Basis function choices I Polynomial j φj(x) = x I Gaussian (x − µ )2 φ (x) = exp − j j 2s2 I Sigmoidal x − µ 1 φ (x) = σ j with σ(a) = j s 1 + e−a I splines, Fourier, wavelets, etc. Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression Examples of basis functions 1 1 1 0.5 0.75 0.75 0 0.5 0.5 −0.5 0.25 0.25 −1 0 0 −1 0 1 −1 0 1 −1 0 1 Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression Maximum likelihood and least squares t = y(x, w) + | {z } |{z} deterministic Gaussian noise For a i.i.d. data set we have the likelihood function: N Y > −1 p(t|X, w, β) = N (tn| w φ(xn), β ) | {z } |{z} n=1 mean var We can use the machinery of MLE to estimate the parameters w and the precision β: > −1 > wML = (Φ Φ) Φ t with ΦM×N = [φmn(xn)] and: N 1 X 2 β−1 = t − w> φ(x ) ML N n ML n n=1 Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression Geometry of least squares S t ϕ2 ϕ1 y Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression Sequential learning Apply a technique known as stochastic gradient descent or sequential gradient descent, i.e., replace: N 1 X 2 E (w) = t − w>φ(x ) D 2 n n n=1 with (η is a learning rate parameter): (τ+1) (τ) (τ)> w = w + η (tn − w φ(xn))φ(xn) (3.23) | {z } ∇En Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression Regularized least squares The total error function: N 1 X 2 λ t − w>φ(x ) + w>w 2 n n 2 n=1 w = (λI + Φ>Φ)−1Φ>t Regularization has the advantage of limiting the model complexity (the appropriate number of basis functions). This is replaced with the problem of finding a suitable value of the regularization coefficient λ. Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression The Bias-Variance Decomposition I Over-fitting occurs whenever the number of basis functions is large and with training data sets of limited size. I Limiting the number of basis functions limits the flexibility of the model. I Regularization can control over-fitting but raises the question of how to determine λ. I The bias-variance tradeoff is a frequentist viewpoint of model complexity. Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression Back to section 1.5.5 2 I The regression loss-function: L(t, y(x)) = (y(x) − t) I The decision problem = minimize the expected loss: ZZ E[L] = (y(x) − t)2p(x, t)dxdt Z I Solution: y(x) = tp(t|x)dt = Et[t|x] I this is known as the regression function I conditional average of t conditioned on x, e.g., figure 1.28, page 47 I Another expression for the expectation of the loss function: Z Z E[L] = (y(x) − E[t|x])2p(x)dx + (E[t|x] − t)2p(x)dx. (1.90) Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression I The optimal prediction is obtained by minimization of the expected squared loss function: Z h(x) = E[t|x] = tp(t|x)dt (3.36) I The expected squared loss can be decomposed into two terms: Z Z E[L] = (y(x) − h(x))2p(x)dx + (h(x) − t)2p(x, t)dxdt. (3.37) I The theoretical minimum of the first term is zero for an appropriate choice of the function y(x) (for unlimited data and unlimited computing power). I The second term arises from noise in the data and it represents the minimum achievable value of the expected squared loss. Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression An ensemble of data sets I For any given data set D we obtain a prediction function y(x, D). I The performance of a particular algorithm is assessed by taking the average over all these data sets, namely ED[L]. This expands into the following terms: expected loss = (bias)2 + variance + noise I There is a tradeoff between bias and variance: I flexible models have low bias and high variance I rigid models have high bias and low variance I The bias-variance decomposition provides interesting insights in model complexity, it is of limited practical value because several data sets are needed. Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression Example: L=100, N=25, M=25, Gaussian basis 1 1 ¢ £¥¤§¦©¨ ¡ ¡ 0 0 −1 −1 0 1 0 1 £¥¤§¦©¨ 1 ¢ 1 ¡ ¡ 0 0 −1 −1 0 1 0 1 Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression Bayesian Linear Regression (1/5) Assume additive gaussian noise with known precision β. The likelihood function p(t|w) is the exponential of a quadratic function of w, its conjugate prior is Gaussian: p(w) = N (w|m0, S0) (3.48) Its posterior is also Gaussian (2.116): p(w|t) = N (w|mN , SN ) ∝ p(t|w)p(w) (3.49) −1 T mN = SN (S0 m0 + βΦ t) where −1 −1 T (3.50/3.51) SN = S0 + βΦ Φ I Note how this fits a sequential learning framework I The max of a Gaussian is at its mean: wMAP = mN Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression Bayesian Linear Regression (2/5) Assume p(w) is governed by a hyperparameter α following a −1 Gaussian law of scalar covariance (i.e. m0 = 0 and S0 = α I ): p(w|α) = N (w|0, α−1I) (3.52) T mN = βSN Φ t then −1 T (3.53/3.54) SN = αI + βΦ Φ T −1 T I Note α → 0 implies mN → wML = (Φ Φ) Φ t (3.35) Log of posterior is sum of log of likelihood and log of prior: N β X 2 α ln p(w|t) = − t − wTφ(x ) − wTw + const (3.55) 2 n n 2 n=1 which is equivalent to a quadratic regularizer with coeff. α/β Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression Bayesian Linear Regression (3/5) In practice, we want to make predictions of t for new values of x: Z p(t|t, α, β) = p(t|w, β)p(w|t, α, β)dw (3.57) −1 I Conditional distribution: p(t|w, β) = N (t|y(x, w), β ) (3.8) I Posterior: p(w|t, α, β) = N (w|mN , SN ) (3.49) The convolution is a Gaussian (2.115): T 2 p(t|x, t, α, β) = N (t|mN Φ(x), σN (x)) (3.58) where 2 −1 T σN (x) = β + Φ(x) SN Φ(x) (3.59) |{z} | {z } noise in data uncertainty in w Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression 1 1 ¡ ¡ 0 0 −1 −1 0 1 0 1 1 1 ¡ ¡ 0 0 −1 −1 0 1 0 1 Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch. 3: Linear Models of Regression 1 1 ¡ ¡ 0 0 −1 −1 0 1 0 1 1 1 ¡ ¡ 0 0 −1 −1 0 1 0 1 Mathieu Guillaumin & Radu Horaud Chris Bishop’s PRML Ch.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    28 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us