Outline Lecture 2 2(32)

Machine Learning, Lecture 2 and Classification 1. Summary of Lecture 1 “it is our firm belief that an understanding of linear models 2. Bayesian linear regression is essential for understanding nonlinear ones” 3. Motivation of Kernel methods 4. Linear Classification Thomas Schön 1. Problem setup Division of Automatic Control 2. Discriminant functions (mainly least squares) 3. Probabilistic generative models Linköping University 4. (discriminative model) Linköping, Sweden. 5. Bayesian logistic regression

Email: [email protected], (Chapter 3.3 - 4) Phone: 013 - 281373, Office: House B, Entrance 27.

AUTOMATIC CONTROL AUTOMATIC CONTROL Machine Learning REGLERTEKNIK Machine Learning REGLERTEKNIK T. Schön LINKÖPINGS UNIVERSITET T. Schön LINKÖPINGS UNIVERSITET

Summary of Lecture 1 (I/VI) 3(32) Summary of Lecture 1 (II/VI) 4(32)

The of distributions over x, parameterized by η, The three basic steps of Bayesian modeling (where all variables are modeled as stochastic) p(x η) = h(x)g(η) exp ηTu(x) | 1. Assign prior distributions p(θ) to all unknown parameters θ.   2. Write down the likelihood p(x ,..., x θ) of the data 1 N | One important member is the Gaussian density, which is commonly x1,..., xN given the parameters θ. used as a building block in more sophisticated models. Important 3. Determine the posterior distribution of the parameters given the basic properties were provided. data The idea underlying maximum likelihood is that the parameters θ p(x ,..., x θ)p(θ) should be chosen in such a way that the measurements x N are p(θ x ,..., x ) = 1 N | ∝ p(x ,..., x θ)p(θ) { i}i=1 | 1 N p(x ,..., x ) 1 N | as likely as possible, i.e., 1 N

If the posterior p(θ x1,..., xN) and the prior p(θ) distributions are θ = arg max p(x1, , xN θ). | θ ··· | of the same functional form they are conjugate distributions and the prior is said to be a conjugate prior for the likelihood. b

AUTOMATIC CONTROL AUTOMATIC CONTROL Machine Learning REGLERTEKNIK Machine Learning REGLERTEKNIK T. Schön LINKÖPINGS UNIVERSITET T. Schön LINKÖPINGS UNIVERSITET Summary of Lecture 1 (III/VI) 5(32) Summary of Lecture 1 (IV/VI) 6(32)

Modeling “heavy tails” using the Student’s t-distribution Linear regression models the relationship between a continuous 1 St(x µ, λ, ν) = x µ, (ηλ)− Gam (η ν/2, ν/2) dη target variable t and a possibly nonlinear function φ(x) of the input | Z N | |   1 ν 1 variable x, Γ(ν/2 + 1/2) λ 2 λ(x µ)2 − 2 − 2 = 1 + − Γ(ν/2) πν ν T     tn = w φ(xn) +en.

which according to the first expressions can be interpreted as an y(xn,w) infinite mix of Gaussians with the same , but different . Solved this problem using | {z } 10 −log Student 9 −log Gaussian Poor robustness is due to an 1. Maximum Likelihood (ML) 8 7 unrealistic model, the ML 6 2. Bayesian approach 5 estimator is inherently robust, 4 3 provided we have the correct ML with a Gaussian noise model is equivalent to least squares (LS). 2 1 model. 0 −15 −10 −5 0 5 10 15

AUTOMATIC CONTROL AUTOMATIC CONTROL Machine Learning REGLERTEKNIK Machine Learning REGLERTEKNIK T. Schön LINKÖPINGS UNIVERSITET T. Schön LINKÖPINGS UNIVERSITET

Summary of Lecture 1 (V/VI) 7(32) Summary of Lecture 1 (VI/VI) 8(32)

Two potentially biased estimators are ridge regression (p = 2) and Theorem (Gauss-Markov) the Lasso (p = 1)

In a linear regression model N T 2 min ∑ = tn w φ(xn) w n 1 − M 1 p T = Φw + E, s.t. ∑ − w η  j=0 | j| ≤ where E is white noise with zero mean and covariance R, the best which using a Lagrange multiplier λ can be stated linear unbiased estimate (BLUE) of w is N 2 M 1 T − p min tn w φ(xn) + λ wj T 1 1 T 1 T 1 1 w ∑ − ∑ | | w = (Φ R− Φ)− Φ R− T, Cov(w) = (Φ R− Φ)− . n=1   j=0

Interpretation:b The least squares estimator hasb the smallest mean square error (MSE) of all linear estimators with no bias, BUT there Alternative interpretation: The MAP estimate with the likelihood ∏N (t wTφ(x ))2 together with a Gaussian prior leads to ridge may exist a biased estimator with lower MSE. n=1 n − n regression and together with a Laplacian prior it leads to the LASSO.

AUTOMATIC CONTROL AUTOMATIC CONTROL Machine Learning REGLERTEKNIK Machine Learning REGLERTEKNIK T. Schön LINKÖPINGS UNIVERSITET T. Schön LINKÖPINGS UNIVERSITET Bayesian Linear Regression - Example (I/VI) 9(32) Bayesian Linear Regression - Example (II/VI) 10(32)

T Consider the problem of fitting a straight line to noisy measurements. Let the true values for w be w? = 0.3 0.5 (plotted using a − Let the model be (t , x ) white circle below). ∈ R n ∈ R  Generate synthetic measurements by tn = w0 + w1xn +en, n = 1, . . . , N. (1) y(x,w) t = w? + w?x + e , e (0, 0.22), n 0 1 n n n ∼ N where | {z } where xn ( 1, 1). 1 ∼ U − e (0, 0.22), β = = 25. n ∼ N 0.22 Furthermore, let the prior be

T 1 p(w) = w 0 0 , α− I According to (1), the following identity basis function is used N |    φ0(xn) = 1, φ1(xn) = xn. where

Since the example lives in two dimensions, we can plot distributions α = 2. to illustrate the inference.

AUTOMATIC CONTROL AUTOMATIC CONTROL Machine Learning REGLERTEKNIK Machine Learning REGLERTEKNIK T. Schön LINKÖPINGS UNIVERSITET T. Schön LINKÖPINGS UNIVERSITET

Bayesian Linear Regression - Example (III/VI) 11(32) Bayesian Linear Regression - Example (IV/VI) 12(32)

Plot of the situation before any data arrives. Plot of the situation after one measurement has arrived.

1 1

0.8 0.8 0.6

0.6 0.4

0.4 0.2

y 0

0.2 −0.2

−0.4 y 0 −0.6

−0.2 −0.8

−1 −0.4 −1 −0.5 0 0.5 1 x −0.6

−0.8 Likelihood (plotted as a Posterior/prior, Example of a few realizations −1 −1 −0.5 0 0.5 1 x function of w) from the posterior and the first p(w t1) = (w m1, S1) , measurement (black circle). 1 | N | Prior, p(t1 w) = (t1 w0 + w1x1, β− ) T Example of a few realizations from | N | m1 = βS1Φ t1, T 1 T 1 the posterior. S1 = (αI + βΦ Φ)− . p(w) = w 0 0 , I N | 2    AUTOMATIC CONTROL AUTOMATIC CONTROL Machine Learning REGLERTEKNIK Machine Learning REGLERTEKNIK T. Schön LINKÖPINGS UNIVERSITET T. Schön LINKÖPINGS UNIVERSITET Bayesian Linear Regression - Example (V/VI) 13(32) Bayesian Linear Regression - Example (VI/VI) 14(32)

Plot of the situation after two measurements have arrived. Plot of the situation after 30 measurements have arrived.

1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2 y 0 y 0

−0.2 −0.2

−0.4 −0.4

−0.6 −0.6

−0.8 −0.8

−1 −1 −1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1 x x

Likelihood (plotted as a Posterior/prior, Example of a few realizations Likelihood (plotted as a Posterior/prior, Example of a few realizations function of w) from the posterior and the function of w) from the posterior and the p(w T) = (w m2, S2) , measurements (black circles). p(w T) = (w m30, S30) , measurements (black circles). 1 | N | 1 | N | p(t2 w) = (t2 w0 + w1x2, β− ) T p(t30 w) = (t30 w0 + w1x30, β− ) T | N | m2 = βS2Φ T, | N | m30 = βS30Φ T, T 1 T 1 S2 = (αI + βΦ Φ)− . S30 = (αI + βΦ Φ)− .

AUTOMATIC CONTROL AUTOMATIC CONTROL Machine Learning REGLERTEKNIK Machine Learning REGLERTEKNIK T. Schön LINKÖPINGS UNIVERSITET T. Schön LINKÖPINGS UNIVERSITET

Predictive Distribution - Example 15(32) Posterior Distribution 16(32)

Investigating the predictive distribution for the example above Recall that the posterior distribution is given by

0.6 1 1

0.4

0.5 0.5 0.2 p(w T) = (w mN, SN), 0 | N | 0 0 −0.2

−0.4 −0.5 −0.5 where −0.6

−0.8 −1 −1 −1 T −1.2 −1.5 −1.5 −1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1 mN = βSNΦ T, T 1 N = 2 observations N = 5 observations N = 200 observations SN = (αI + βΦ Φ)− .

True system (y(x) = 0.3 + 0.5x) generating the data (red line) − Let us now investigate the posterior mean solution mN above, which Mean of the predictive distribution (blue line) has an interpretation that directly leads to the kernel methods One of the predictive distribution (gray shaded area) Note that this is the point-wise predictive standard deviation as a function of x. (lecture 4), including Gaussian processes. Observations (black circles)

AUTOMATIC CONTROL AUTOMATIC CONTROL Machine Learning REGLERTEKNIK Machine Learning REGLERTEKNIK T. Schön LINKÖPINGS UNIVERSITET T. Schön LINKÖPINGS UNIVERSITET Generative and Discriminative Models 17(32) ML for Probabilistic Generative Models (I/IV) 18(32)

Consider the two class case, where the class-conditional densities p(x ) are Gaussian and the training data is given by x , t N . | Ck { n n}n=1 Furthermore, assume that p( ) = α. C1 Approaches that model the distributions of both the inputs and the The task is now to find the parameters α, µ1, µ2, Σ by maximizing the outputs are known as generative models. The reason for the name is , the fact that using these models we can generate new samples in the input space. N tn 1 tn p(T, X α, µ1, µ2, Σ) = ∏(p(xn, 1)) (p(xn, 2) − , Approaches that models the directly are referred | n=1 C C to as discriminative models. where

p(x , ) = p( )p(x ) = α (x µ , Σ), n C1 C1 n | C1 N n | 1 p(x , ) = p( )p(x ) = (1 α) (x µ , Σ). n C2 C2 n | C2 − N n | 2

AUTOMATIC CONTROL AUTOMATIC CONTROL Machine Learning REGLERTEKNIK Machine Learning REGLERTEKNIK T. Schön LINKÖPINGS UNIVERSITET T. Schön LINKÖPINGS UNIVERSITET

ML for Probabilistic Generative Models (II/IV) 19(32) ML for Probabilistic Generative Models (III/IV) 20(32)

Let us now maximize the logarithm of the likelihood function, N N N 1 1 T 1 tn 1 tn L(Σ) = tn ln det Σ tn(xn µ1) Σ− (xn µ1) L(α, µ1, µ2, Σ) = ln (α (xn µ1, Σ)) ((1 α) (xn µ2, Σ)) − ∑ ∑ ∏ −2 n=1 − 2 n=1 − − n=1 N | − N | ! N N 1 1 T 1 The terms that depends on α are ∑ (1 tn) ln det Σ ∑ (1 tn)(xn µ2) Σ− (xn µ2) − 2 n=1 − − 2 n=1 − − − N T T ∑ (tn ln α + (1 tn) ln(1 α)) Using the fact that x Ax = Tr Axx we have n=1 − − N  N 1 which is maximized by α = 1 N t = N1 (as expected). N L(Σ) = ln det Σ Tr Σ− S , N ∑n=1 n N1+N2 k − 2 − 2 denotes the number of data in class . Straightforwardly we get   Ck where N b N N 1 1 1 T T µ1 = ∑ tnxn, µ2 = ∑ (1 tn)xn. S = ∑ tn(xn µ1)(xn µ1) + (1 tn)(xn µ2)(xn µ2) N1 n=1 N2 n=1 − N − − − − − n=1   b b AUTOMATIC CONTROL AUTOMATIC CONTROL Machine Learning REGLERTEKNIK Machine Learning REGLERTEKNIK T. Schön LINKÖPINGS UNIVERSITET T. Schön LINKÖPINGS UNIVERSITET ML for Probabilistic Generative Models (IV/IV) 21(32) Generalized Linear Models for Classification 22(32)

Lemma (Useful Matrix derivatives) In linear regression we made use of a linear model ∂ T ln det M = M− , T ∂M tn = y(x, w) = w φ(xn) + en. ∂ 1 T T T Tr M− N = M− N M− . For classification problems the target variables are discrete, or ∂M −   slightly more general, posterior probabilities in the (0, 1). This N N 1 1 Differentiating L(Σ) = ln det Σ Tr Σ− S results in is achieved using a so called activation function f (f must exist), − 2 − 2 −  T ∂L N T N T T y(x) = f (w x + w0). (2) = Σ− + Σ− SΣ− ∂Σ − 2 2 Hence, Σ = S ∂L Note that the decision surface corresponds to y(x) = constant, = 0 T ∂Σ implying that w x + w0 = constant. This that the decision surface is a linear function of x, even if f is nonlinear. Hence, the More results on matrix derivatives are available in Magnus, J. R., & Neudecker, H. (1999). Matrix differential calculus with applications in and . 2nd Edition, name for (2). Chichester, UK: Wiley.

AUTOMATIC CONTROL AUTOMATIC CONTROL Machine Learning REGLERTEKNIK Machine Learning REGLERTEKNIK T. Schön LINKÖPINGS UNIVERSITET T. Schön LINKÖPINGS UNIVERSITET

Gradient of L(w) for Logistic Regression (I/II) 23(32) Gradient of L(w) for Logistic Regression (II/II) 24(32)

The negative log-likelihood is Furthermore, N ∂y ∂σ(a ) n = n = = σ(a )(1 σ(a )) = y (1 y ), L(w) = (tn ln yn + (1 tn) ln(1 yn)) , n n n n ∑ ∂an ∂an ··· − − − n=1 − − ∂an where = φ . ∂w n 1 T yn = σ(an) = , and an = w φn. which results in the following expression for the gradient 1 + exp( an) − ∂L N Using the chain rule we have, = (y t )φ = ΦT(Y T), ∂w ∑ n − n n − N n=1 ∂L ∂L ∂yn ∂an = ∑ where ∂w n=1 ∂yn ∂an ∂w φ0(x1) φ1(x1) ... φM 1(x1) y1 t1 − where φ0(x2) φ1(x2) ... φM 1(x2) y2 t2  −      ∂L 1 t t y t Φ = . . . . Y = . T = . = − n n = n − n ...... ∂yn 1 yn − yn yn(1 yn)       − − φ0(xN) φ1(xN) ... φM 1(xN) yN tN  −      AUTOMATIC CONTROL    AUTOMATIC CONTROL  Machine Learning REGLERTEKNIK Machine Learning REGLERTEKNIK T. Schön LINKÖPINGS UNIVERSITET T. Schön LINKÖPINGS UNIVERSITET Hessian of L(w) for Logistic Regression 25(32) Bayesian Logistic Regression 26(32)

Recall that

N 1 tn ( ) = σ( Tφ )tn 1 σ( Tφ ) − ∂2L N p T w ∏ w n w n = = = ( )φ φT = ΦT Φ | n=1 − H T ∑ yn tn n n R   ∂w∂w ··· n=1 − Hence, computing the posterior density where p(T w)p(w) y (1 y ) 0 . . . 0 p(w T) = | 1 − 1 | p(T) 0 y2(1 y2) . . . 0 R =  . −. . .  is intractable. We are forced to an approximation. Three alternatives . . .. .    0 0 . . . y (1 y ) 1. Laplace approximation (this lecture)  N − N    2. VB & EP (lecture 5) 3. methods, e.g., MCMC (lecture 6)

AUTOMATIC CONTROL AUTOMATIC CONTROL Machine Learning REGLERTEKNIK Machine Learning REGLERTEKNIK T. Schön LINKÖPINGS UNIVERSITET T. Schön LINKÖPINGS UNIVERSITET

Laplace Approximation (I/III) 27(32) Laplace Approximation (II/III) 28(32)

The Laplace approximation is a simple approximation that is Consider a Taylor expansion of ln f (z) around the z0, obtained by fitting a Gaussian centered around the (MAP) mode of d 1 d2 the distribution. ln f (z) ln f (z ) + ln f (z) (z z ) + ln f (z) (z z )2 ≈ 0 dz − 0 2 dz2 − 0 z=z0 z=z0 Consider the density function p(z) of a scalar stochastic variable z, = given by 0 A = ln f (z ) | (z {zz )2, } (3) 1 0 − 2 − 0 p(z) = f (z), Z where d2 where Z = f (z)dz is the normalization coefficient. A = ln f (z) −dz2 z=z0 We start byR finding a mode z0 of the density function,

Taking the exponential of both sides in the approx. (3) results in df (z) = 0. A 2 dz f (z) f (z0) exp (z z0) z=z0 ≈ − 2 −  

AUTOMATIC CONTROL AUTOMATIC CONTROL Machine Learning REGLERTEKNIK Machine Learning REGLERTEKNIK T. Schön LINKÖPINGS UNIVERSITET T. Schön LINKÖPINGS UNIVERSITET Laplace Approximation (III/III) 29(32) Bayesian Logistic Regression (I/II) 30(32)

By normalizing this expression we have now obtained a Gaussian The posterior is approximation p(w T) ∝ p(T w)p(w), (4) 1/2 | | A A 2 q(z) = exp (z z0) where we assume a Gaussian prior p(w) = (w m , S ) and 2π − 2 − N | 0 0     make use of the Laplace approximation. Taking logarithm on both where sides of (4) gives d2 1 A = ln f (z) ln ( ) = ( )T 1( ) −dz2 p w t w m0 S0− w m0 z=z0 | −2 − − N

+ (t ln y + (1 t ) ln(1 y )) + const. The main limitation of the Laplace approximation is that it is a local ∑ n n n n n=1 − − method that only captures aspects of the true density around a T specific value z0. where yn = σ(w φn).

AUTOMATIC CONTROL AUTOMATIC CONTROL Machine Learning REGLERTEKNIK Machine Learning REGLERTEKNIK T. Schön LINKÖPINGS UNIVERSITET T. Schön LINKÖPINGS UNIVERSITET

Bayesian Logistic Regression (II/II) 31(32) A Few Concepts to Summarize Lecture 2 32(32)

Using the Laplace approximation we can now obtain a Gaussian Hyperparameter: A parameter of the prior distribution that controls the distribution of the parameters of the model. approximation Classification: The goal of classification is to assign an input vector x to one of K classes, , k = 1, . . . , K. Ck q(w) = (w w , S ) N | MAP N Discriminant: A discriminant is a function that takes an input x and assigns it to one of K classes. where w is the MAP estimate of p(w T) and the covariance S MAP | N Generative models: Approaches that model the distributions of both the inputs and the is the Hessian of ln p(w T), outputs are known as generative models. In classification this amounts to modelling the | class-conditional densities p(x ), as well as the prior densities p( ). The reason for the | Ck Ck 2 N name is the fact that using these models we can generate new samples in the input space. ∂ 1 T Discriminative models: Approaches that models the posterior probability directly are referred SN = T ln p(w T) = S0− + ∑ yn(1 yn)φnφn ∂w∂w | n=1 − to as discriminative models. Logistic Regression: Discriminative model that makes direct use of a generalized linear Based on this distribution we can now start making predictions for model in the form of a logistic sigmoid to solve the classification problem.

new input data φ(x), which is typically what we are interested in. Laplace approximation: A local approximation method that finds the mode of the posterior Recall that prediction corresponds to marginalization w.r.t. w. distribution and then fits a Gaussian centered at that mode.

AUTOMATIC CONTROL AUTOMATIC CONTROL Machine Learning REGLERTEKNIK Machine Learning REGLERTEKNIK T. Schön LINKÖPINGS UNIVERSITET T. Schön LINKÖPINGS UNIVERSITET