Machine Learning, Lecture 2 Linear Regression and Classification Outline Lecture 2 Summary of Lecture 1

Outline Lecture 2 2(32) Machine Learning, Lecture 2 Linear Regression and Classification 1. Summary of Lecture 1 “it is our firm belief that an understanding of linear models 2. Bayesian linear regression is essential for understanding nonlinear ones” 3. Motivation of Kernel methods 4. Linear Classification Thomas Schön 1. Problem setup Division of Automatic Control 2. Discriminant functions (mainly least squares) 3. Probabilistic generative models Linköping University 4. Logistic regression (discriminative model) Linköping, Sweden. 5. Bayesian logistic regression Email: [email protected], (Chapter 3.3 - 4) Phone: 013 - 281373, Office: House B, Entrance 27. AUTOMATIC CONTROL Machine Learning REGLERTEKNIK Machine Learning T. Schön LINKÖPINGS UNIVERSITET T. Schön Summary of Lecture 1 (I/VI) 3(32) Summary of Lecture 1 (II/VI) 4(32) The exponential family of distributions over x, parameterized by η, The three basic steps of Bayesian modeling (where all variables are modeled as stochastic) p(x η) = h(x)g(η) exp ηTu(x) | 1. Assign prior distributions p(θ) to all unknown parameters θ. 2. Write down the likelihood p(x ,..., x θ) of the data 1 N | One important member is the Gaussian density, which is commonly x1,..., xN given the parameters θ. used as a building block in more sophisticated models. Important 3. Determine the posterior distribution of the parameters given the basic properties were provided. data The idea underlying maximum likelihood is that the parameters θ p(x ,..., x θ)p(θ) should be chosen in such a way that the measurements x N are p(θ x ,..., x ) = 1 N | ∝ p(x ,..., x θ)p(θ) { i}i=1 | 1 N p(x ,..., x ) 1 N | as likely as possible, i.e., 1 N If the posterior p(θ x1,..., xN) and the prior p(θ) distributions are θ = arg max p(x1, , xN θ). | θ ··· | of the same functional form they are conjugate distributions and the prior is said to be a conjugate prior for the likelihood. b Machine Learning Machine Learning T. Schön T. Schön Summary of Lecture 1 (III/VI) 5(32) Summary of Lecture 1 (IV/VI) 6(32) Modeling “heavy tails” using the Student’s t-distribution Linear regression models the relationship between a continuous 1 St(x µ, λ, ν) = x µ, (ηλ)− Gam (η ν/2, ν/2) dη target variable t and a possibly nonlinear function φ(x) of the input | Z N | | 1 ν 1 variable x, Γ(ν/2 + 1/2) λ 2 λ(x µ)2 − 2 − 2 = 1 + − Γ(ν/2) πν ν T tn = w φ(xn) +en. which according to the first expressions can be interpreted as an y(xn,w) infinite mix of Gaussians with the same mean, but different variance. Solved this problem using | {z } 10 −log Student 9 −log Gaussian Poor robustness is due to an 1. Maximum Likelihood (ML) 8 7 unrealistic model, the ML 6 2. Bayesian approach 5 estimator is inherently robust, 4 3 provided we have the correct ML with a Gaussian noise model is equivalent to least squares (LS). 2 1 model. 0 −15 −10 −5 0 5 10 15 AUTOMATIC CONTROL Machine Learning REGLERTEKNIK Machine Learning T. Schön LINKÖPINGS UNIVERSITET T. Schön Summary of Lecture 1 (V/VI) 7(32) Summary of Lecture 1 (VI/VI) 8(32) Two potentially biased estimators are ridge regression (p = 2) and Theorem (Gauss-Markov) the Lasso (p = 1) In a linear regression model N T 2 min ∑ = tn w φ(xn) w n 1 − M 1 p T = Φw + E, s.t. ∑ − w η j=0 | j| ≤ where E is white noise with zero mean and covariance R, the best which using a Lagrange multiplier λ can be stated linear unbiased estimate (BLUE) of w is N 2 M 1 T − p min tn w φ(xn) + λ wj T 1 1 T 1 T 1 1 w ∑ − ∑ | | w = (Φ R− Φ)− Φ R− T, Cov(w) = (Φ R− Φ)− . n=1 j=0 Interpretation:b The least squares estimator hasb the smallest mean square error (MSE) of all linear estimators with no bias, BUT there Alternative interpretation: The MAP estimate with the likelihood ∏N (t wTφ(x ))2 together with a Gaussian prior leads to ridge may exist a biased estimator with lower MSE. n=1 n − n regression and together with a Laplacian prior it leads to the LASSO. Machine Learning Machine Learning T. Schön T. Schön Bayesian Linear Regression - Example (I/VI) 9(32) Bayesian Linear Regression - Example (II/VI) 10(32) T Consider the problem of fitting a straight line to noisy measurements. Let the true values for w be w? = 0.3 0.5 (plotted using a − Let the model be (t , x ) white circle below). ∈ R n ∈ R Generate synthetic measurements by tn = w0 + w1xn +en, n = 1, . , N. (1) y(x,w) t = w? + w?x + e , e (0, 0.22), n 0 1 n n n ∼ N where | {z } where xn ( 1, 1). 1 ∼ U − e (0, 0.22), β = = 25. n ∼ N 0.22 Furthermore, let the prior be T 1 p(w) = w 0 0 , α− I According to (1), the following identity basis function is used N | φ0(xn) = 1, φ1(xn) = xn. where Since the example lives in two dimensions, we can plot distributions α = 2. to illustrate the inference. AUTOMATIC CONTROL Machine Learning REGLERTEKNIK Machine Learning T. Schön LINKÖPINGS UNIVERSITET T. Schön Bayesian Linear Regression - Example (III/VI) 11(32) Bayesian Linear Regression - Example (IV/VI) 12(32) Plot of the situation before any data arrives. Plot of the situation after one measurement has arrived. 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 y 0 0.2 −0.2 −0.4 y 0 −0.6 −0.2 −0.8 −1 −0.4 −1 −0.5 0 0.5 1 x −0.6 −0.8 Likelihood (plotted as a Posterior/prior, Example of a few realizations −1 −1 −0.5 0 0.5 1 x function of w) from the posterior and the first p(w t1) = (w m1, S1) , measurement (black circle). 1 | N | Prior, p(t1 w) = (t1 w0 + w1x1, β− ) T Example of a few realizations from | N | m1 = βS1Φ t1, T 1 T 1 the posterior. S1 = (αI + βΦ Φ)− . p(w) = w 0 0 , I N | 2 Machine Learning Machine Learning T. Schön T. Schön Bayesian Linear Regression - Example (V/VI) 13(32) Bayesian Linear Regression - Example (VI/VI) 14(32) Plot of the situation after two measurements have arrived. Plot of the situation after 30 measurements have arrived. 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 y 0 y 0 −0.2 −0.2 −0.4 −0.4 −0.6 −0.6 −0.8 −0.8 −1 −1 −1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1 x x Likelihood (plotted as a Posterior/prior, Example of a few realizations Likelihood (plotted as a Posterior/prior, Example of a few realizations function of w) from the posterior and the function of w) from the posterior and the p(w T) = (w m2, S2) , measurements (black circles). p(w T) = (w m30, S30) , measurements (black circles). 1 | N | 1 | N | p(t2 w) = (t2 w0 + w1x2, β− ) T p(t30 w) = (t30 w0 + w1x30, β− ) T | N | m2 = βS2Φ T, | N | m30 = βS30Φ T, T 1 T 1 S2 = (αI + βΦ Φ)− . S30 = (αI + βΦ Φ)− . AUTOMATIC CONTROL Machine Learning REGLERTEKNIK Machine Learning T. Schön LINKÖPINGS UNIVERSITET T. Schön Predictive Distribution - Example 15(32) Posterior Distribution 16(32) Investigating the predictive distribution for the example above Recall that the posterior distribution is given by 0.6 1 1 0.4 0.5 0.5 0.2 p(w T) = (w mN, SN), 0 | N | 0 0 −0.2 −0.4 −0.5 −0.5 where −0.6 −0.8 −1 −1 −1 T −1.2 −1.5 −1.5 −1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1 mN = βSNΦ T, T 1 N = 2 observations N = 5 observations N = 200 observations SN = (αI + βΦ Φ)− . True system (y(x) = 0.3 + 0.5x) generating the data (red line) − Let us now investigate the posterior mean solution mN above, which Mean of the predictive distribution (blue line) has an interpretation that directly leads to the kernel methods One standard deviation of the predictive distribution (gray shaded area) Note that this is the point-wise predictive standard deviation as a function of x. (lecture 4), including Gaussian processes. Observations (black circles) Machine Learning Machine Learning T. Schön T. Schön Generative and Discriminative Models 17(32) ML for Probabilistic Generative Models (I/IV) 18(32) Consider the two class case, where the class-conditional densities p(x ) are Gaussian and the training data is given by x , t N . | Ck { n n}n=1 Furthermore, assume that p( ) = α. C1 Approaches that model the distributions of both the inputs and the The task is now to find the parameters α, µ1, µ2, Σ by maximizing the outputs are known as generative models. The reason for the name is likelihood function, the fact that using these models we can generate new samples in the input space. N tn 1 tn p(T, X α, µ1, µ2, Σ) = ∏(p(xn, 1)) (p(xn, 2) − , Approaches that models the posterior probability directly are referred | n=1 C C to as discriminative models. where p(x , ) = p( )p(x ) = α (x µ , Σ), n C1 C1 n | C1 N n | 1 p(x , ) = p( )p(x ) = (1 α) (x µ , Σ).

Load more