Logistic Regression and Generalized Linear Models

Logistic Regression and Generalized Linear Models Sridhar Mahadevan [email protected] University of Massachusetts ©Sridhar Mahadevan: CMPSCI 689 – p. 1/29 Topics Generative vs. Discriminative models In many cases, it is difficult to model data using a parametric class conditional density P (X|ω, θ) Yet, in many problems, a linear decision boundary is usually adequate to separate classes (also, gaussian densities with a shared covariance matrix produces a linear decision boundary). Logistic regression: discriminative model for classification that produces linear decision boundaries Model fitting problem solved using maximum likelihood Iterative gradient-based algorithm for solving nonlinear maximum likelihood equations Recursive weighted least squares regression Logistic regression is an instance of a generalized linear model (GLM), which consists of a large variety of exponential models.GLMs can also be extended to generalized additive models (GAMs). ©Sridhar Mahadevan: CMPSCI 689 – p. 2/29 Discriminative vs. Generative Models Both generative and discriminative approaches address the problem of modeling the discriminant function P (y|x) of output labels (or values) y conditioned on the input x. In generative models, we estimate both P (x) and P (x|y), and use Bayes rule to compute the discriminant. P (y|x) ∝ P (x)P (x|y) Discriminative approaches model the conditional distribution P (y|x) directly, and ignore the marginal P (x). We now turn to explore several types instances of discriminative models, including logistic regression in this class, and later several other types including support vector machines. ©Sridhar Mahadevan: CMPSCI 689 – p. 3/29 Generalized Linear Models In linear regression, we model the output y as a linear function of the input variables, with a noise term that is zero mean constant variance Gaussian. y = g(x)+ ǫ, where the conditional mean E(y|x)= g(x), and the noise term is ǫ. T g(x)= β x (where β0 is an offset term). We saw earlier that the maximum likelihood framework justified the use of a squared error loss function, provided the errors were IID gaussian (the variance does not matter). We want to generalize this idea of specifying a model family by specifying the type of error distribution: When the output variable y is discrete (e.g., binary or multinomial), the noise term is not gaussian, but binomial or multinomial. A change in the mean is coupled by a change in the variance, and we want to be able to couple mean and variance in our model. Generalized linear models provides a rich tool of models based on specifying the error distribution. ©Sridhar Mahadevan: CMPSCI 689 – p. 4/29 Logit Function Since the output variable y only takes on values ∈ (0, 1) (for binary classification), we need a different way of representing E(y|x) so that the range of y ∈ (0, 1). One convenient form to use is the sigmoid or logistic function. Let us assume a vector-valued input variable x =(x1,...,xp). The logistic function is S shaped and approaches 0 (as x → −∞) or 1 (as x → ∞). T eβ x 1 P (y = 1|x, β)= µ(x|β)= = 1+ eβT x 1+ e−βT x 1 P (y = 0|x, β) = 1 − µ(x|β)= 1+ eβT x We assume an extra input x0 = 1, so that β0 is an offset. We can invert the above transformation to get the logit function µ(x|β) g(x|β) = log = βT x 1 − µ(x|β) ©Sridhar Mahadevan: CMPSCI 689 – p. 5/29 Logistic Regression y β0 β2 β1 X2 X1 X0 ©Sridhar Mahadevan: CMPSCI 689 – p. 6/29 Example Dataset for Logistic Regression The data set we are analyzing is coronary heart disease in South Africa. The chd response (output) variable is binary (yes, no), and there are 9 predictor variables: There are 462 instances, out of which 160 are cases (positive instances), and 302 are controls (negative instances). The predictor variables are systolic blood pressure, tobacco, ldl, famhist, obesity, alcohol, age, adiposity, typea, Let’s focus on a subset of the predictors: sbp, tobacco, ldl, famhist, obesity, alcohol, age. We want to fit a model of the following form 1 P (chd = 1|x, β)= 1+ e−βT x where βT x = β0 +β1xsbp +β2xtobacco +β3xldl +β4xfamhist +β5xage +β6xalcohol +β7xobesity ©Sridhar Mahadevan: CMPSCI 689 – p. 7/29 Noise Model for Logistic Regression Let us try to represent the logistic regression model as y = µ(x|β)+ ǫ and ask ourself what sort of noise model is represented by ǫ. Since y takes on the value 1 with probability µ(x|β), it follows that ǫ can also only take on two possible values, namely If y = 1, then ǫ = 1 − µ(x|β) with probability µ(x|β). Conversely, if y = 0, then ǫ = −µ(x|β) and this happens with probability (1 − µ(x|β)). This analysis shows that the error term in logistic regression is a binomially distributed random variable. Its moments can be computed readily as shown below: E(ǫ)= µ(x|β)(1 − µ(x|β)) − (1 − µ(x|β))µ(x|β) = 0 (the error term has mean 0). V ar(ǫ)= Eǫ2 − (Eǫ)2 = Eǫ2 = µ(x|β)(1 − µ(x|β)) (show this!) ©Sridhar Mahadevan: CMPSCI 689 – p. 8/29 Maximum Likelihood for LR Suppose we want to fit a logistic regression model to a dataset of n observations X =(x1, y1),..., (xn, yn). We can express the conditional likelihood of a single observation simply as i i P (yi|xi, β)= µ(xi|β)y (1 − µ(xi|β))1−y Hence, the conditional likelihood of the entire dataset can be written as n i i P (Y |X, β)= µ(xi|β)y (1 − µ(xi|β))1−y Y i=1 The conditional log-likelihood is then simply n l(β|X,Y )= yi log µ(xi|β)+(1 − yi) log(1 − µ(xi|β)) X i=1 ©Sridhar Mahadevan: CMPSCI 689 – p. 9/29 Maximum Likelihood for LR We solve the conditional log-likelihood equation by taking gradients n ∂l(β|X,Y ) 1 ∂µ(xi|β) 1 ∂µ(xi) = yi − (1 − yi) i i ∂βk X µ(x |β) ∂βk (1 − µ(x |β)) ∂βk i=1 i ∂µ(x |β) ∂ 1 i i i Using the fact that = ( T i )= µ(x |β)(1 − µ(x |β))x , we get βk ∂βk 1+e−β x k n ∂l(β|X,Y ) i i i = xk(y − µ(x |β)) ∂βk X i=1 Setting this to 0, since x0 = 1 the first component of these equations reduces to n n yi = µ(xi|β) X X i=1 i=1 The expected number of instances of each class must match the observed number. ©Sridhar Mahadevan: CMPSCI 689 – p. 10/29 Newton-Raphson Method Newton’s method is a general procedure for finding the roots of an equation f(θ) = 0. Newton’s algorithm is based on the recursion f(θt) θ = θ − t+1 t ′ f (θt) Newton’s method finds the minimum of a function f. We want to find the maximum of the log likehood equation. But, the maximum of a function f(θ) is exactly when its derivative f ′(θ) = 0. So, plugging in f ′(θ) for f(θ) above, we get ′ f (θt) θ = θ − t+1 t ′′ f (θt) ©Sridhar Mahadevan: CMPSCI 689 – p. 11/29 Fisher Scoring In logistic regresion, the parameter β is a vector, so we have to use the Newton-Raphson algorithm −1 βt+1 = βt − H ∇β l(βt|X,Y ) Here, ∇β l(βt|X,Y ) is the vector of partial derivatives of the log-likelihood equation 2 ∂ l(β|X,Y ) Hij = is the Hessian matrix of second order derivatives. ∂βi∂βj The use of Newton’s method to find the solution to the conditional log-likelihood equation is called Fisher scoring. ©Sridhar Mahadevan: CMPSCI 689 – p. 12/29 Fisher Scoring for Maximum Likelihood Taking the second derivative of the likelihood score equations gives us n 2 ∂ l(β|X,Y ) i i i i = − xkxmµ(x |β)(1 − µ(x |β)) ∂βk∂βm X i=1 We can use matrix notation to write the Newton-Raphson algorithm for logistic regression. Define the n × n diagonal matrix µ(x1|β)(1 − µ(x1|β)) ... 0 0 µ(x2|β)(1 − µ(x2|β)) ... W = ... 0 ... µ(xn|β)(1 − µ(xn|β)) Let Y be an n × 1 column vector of output values, and X be the design matrix of size n × (p + 1) of input values, and P be the column vector of fitted probability values µ(xi|β). ©Sridhar Mahadevan: CMPSCI 689 – p. 13/29 Iterative Weighted Least Squares The gradient of the log likelihood can be written in matrix form as n ∂l(β|X,Y ) = xi(yi − µ(xi|β)) = XT (Y − P ) ∂β X i=1 The Hessian can be written as ∂2l(β|X,Y ) = −XT W X ∂β∂βT The Newton-Raphson algorithm then becomes βnew = βold +(XT W X)−1XT (Y − P ) = (XT W X)−1XT W Xβold + W −1(Y − P ) = (XT W X)−1XT W Z where Z ≡ Xβold + W −1(Y − P ) ©Sridhar Mahadevan: CMPSCI 689 – p. 14/29 Weighted Least Squares Regression Weighted least squares regression finds the best least-squares solution to the equation W Ax ≈ Wb (WA)T WAxˆ = (WA)T Wb xˆ = (AT CA)−1AT Cb where C = W T W Returning to logistic regression, we now see βnew =(XT W X)−1XT W Z is weighted least squares regression (where X is the matrix A above, W is a diagonal weight vector with entries µ(xi|β)(1 − µ(xi|β)), and Z corresponds to the vector b above).

Load more