Machine Learning I Lecture 14 Logistic Regression

ECE595 / STAT598: Machine Learning I Lecture 14 Logistic Regression Spring 2020 Stanley Chan School of Electrical and Computer Engineering Purdue University c Stanley Chan 2020. All Rights Reserved. 1 / 25 Overview In linear discriminant analysis (LDA), there are generally two types of approaches Generative approach: Estimate model, then define the classifier Discriminative approach: Directly define the classifier c Stanley Chan 2020. All Rights Reserved. 2 / 25 Outline Discriminative Approaches Lecture 14 Logistic Regression 1 Lecture 15 Logistic Regression 2 This lecture: Logistic Regression 1 From Linear to Logistic Motivation Loss Function Why not L2 Loss? Interpreting Logistic Maximum Likelihood Log-odd Convexity Is logistic loss convex? Computation c Stanley Chan 2020. All Rights Reserved. 3 / 25 Geometry of Linear Regression The discriminant function g(x) is linear The hypothesis function h(x) = sign(g(x)) is a unit step c Stanley Chan 2020. All Rights Reserved. 4 / 25 From Linear to Logistic Regression Can we replace g(x) by sign(g(x))? How about a soft-version of sign(g(x))? This gives a logistic regression. c Stanley Chan 2020. All Rights Reserved. 5 / 25 Sigmoid Function The function x 1 1 h( ) = x = w T x 1 + e−g( ) 1 + e−( +w0) is called a sigmoid function. Its 1D form is 1 h(x) = ; for some a and x0; 1 + e−a(x−x0) a controls the transient speed x0 controls the cutoff location c Stanley Chan 2020. All Rights Reserved. 6 / 25 Sigmoid Function Note that h(x) ! 1; as x ! 1; h(x) ! 0; as x ! −∞; So h(x) can be regarded as a \probability". c Stanley Chan 2020. All Rights Reserved. 7 / 25 Sigmoid Function Derivative is d 1 −2 = − 1 + e−a(x−x0) e−a(x−x0) (−a) dx 1 + e−a(x−x0) ! e−a(x−x0) 1 = a 1 + e−a(x−x0) 1 + e−a(x−x0) 1 1 = a 1 − 1 + e−a(x−x0) 1 + e−a(x−x0) = a[1 − h(x)][h(x)]: Since 0 < h(x) < 0, we have 0 < 1 − h(x) < 1. Therefore, the derivative is always positive. So h is an increasing function. Hence h can be considered as a \CDF". c Stanley Chan 2020. All Rights Reserved. 8 / 25 Sigmoid Function c Stanley Chan 2020. All Rights Reserved. http://georgepavlides.info/wp-content/uploads/2018/02/logistic-binary-e1517639495140.jpg 9 / 25 From Linear to Logistic Regression Can we replace g(x) by sign(g(x))? How about a soft-version of sign(g(x))? This gives a logistic regression. c Stanley Chan 2020. All Rights Reserved. 10 / 25 Loss Function for Linear Regression All discriminant algorithms have a Training Loss Function N 1 X J(θ) = L(g(x ); y ): N n n n=1 In linear regression, N 1 X J(θ) = (g(x ) − y )2 N n n n=1 N 1 X = (w T x + w − y )2 N n 0 n n=1 2 T 3 2 3 2 x 1 y1 1 1 w 1 = 6 . .7 − 6 . 7 = kAθ − yk2: N 4 . .5 w 4 . 5 N xT 0 N 1 yN c Stanley Chan 2020. All Rights Reserved. 11 / 25 Training Loss for Logistic Regression N X J(θ) = L(hθ(xn); yn) n=1 N X n o = − yn log hθ(xn) + (1 − yn) log(1 − hθ(xn)) n=1 This loss is also called the cross-entropy loss. Why do we want to choose this cost function? Consider two cases ( 0; if yn = 1; and hθ(xn) = 1; yn log hθ(xn) = −∞; if yn = 1; and hθ(xn) = 0; ( 0; if yn = 0; and hθ(xn) = 0; (1 − yn)(1 − log hθ(xn)) = −∞; if yn = 0; and hθ(xn) = 1: No solution if mismatch c Stanley Chan 2020. All Rights Reserved. 12 / 25 Why Not L2 Loss? Why not use L2 loss? N X 2 J(θ) = (hθ(xn) − yn) n=1 Let's look at the 1D case: 1 2 J(θ) = − y : 1 + e−θx This is NOT convex! How about the logistic loss? 1 1 J(θ) = y log + (1 − y) log 1 − 1 + e−θx 1 + e−θx This is convex! c Stanley Chan 2020. All Rights Reserved. 13 / 25 Why Not L2 Loss? Experiment: Set x = 1 and y = 1. Plot J(θ) as a function of θ. 1 0 -1 0.8 -2 0.6 ) ) -3 J( J( 0.4 -4 0.2 -5 0 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 -5 -4 -3 -2 -1 0 1 2 3 4 5 L2 Logistic So the L2 loss is not convex, but the logistic loss is concave (negative is convex) If you do gradient descent on L2, you will be trapped at local minima c Stanley Chan 2020. All Rights Reserved. 14 / 25 Outline Discriminative Approaches Lecture 14 Logistic Regression 1 Lecture 15 Logistic Regression 2 This lecture: Logistic Regression 1 From Linear to Logistic Motivation Loss Function Why not L2 Loss? Interpreting Logistic Maximum Likelihood Log-odd Convexity Is logistic loss convex? Computation c Stanley Chan 2020. All Rights Reserved. 15 / 25 The Maximum-Likelihood Perspective We can show that argmin J(θ) θ N X n o = argmin − yn log hθ(xn) + (1 − yn) log(1 − hθ(xn)) θ n=1 N ! Y yn 1−yn = argmin − log hθ(xn) (1 − hθ(xn)) θ n=1 N n o Y yn 1−yn = argmax hθ(xn) (1 − hθ(xn)) : θ n=1 This is maximum-likelihood for a Bernoulli random variable yn The underlying probability is hθ(xn) c Stanley Chan 2020. All Rights Reserved. 16 / 25 Interpreting h(x n) Maximum-likelihood Bernoulli: N n o ∗ Y yn 1−yn θ = argmax hθ(xn) (1 − hθ(xn)) : θ n=1 We can interpret hθ(xn) as a probability p. So: hθ(xn) = p; and 1 − hθ(xn) = 1 − p: But p is a function of xn. So how about hθ(xn) = p(xn); and 1 − hθ(xn) = 1 − p(xn): And this probability is \after" you see xn. So how about hθ(xn) = p(1 j xn); and 1 − hθ(xn) = 1 − p(1 j xn) = p(0 j xn): So hθ(xn) is the posterior of observing xn. c Stanley Chan 2020. All Rights Reserved. 17 / 25 Log-Odds Let us rewrite J as N X n o J(θ) = − yn log hθ(xn) + (1 − yn) log(1 − hθ(xn)) n=1 n X n hθ(xn) o = − y log + log(1 − h (x )) n 1 − h (x ) θ n n=1 θ n x In statistics, the term log hθ( n) is called the log-odd. 1−hθ(xn) x 1 If we put hθ( n) = T , we can show that 1+e−θ x 0 1 1 x −θT x T hθ( ) 1+e θ x T x log = log @ T x A = log e = θ : 1 − hθ(x) e−θ 1+e−θT x Logistic regression is linear in the log-odd. c Stanley Chan 2020. All Rights Reserved. 18 / 25 Outline Discriminative Approaches Lecture 14 Logistic Regression 1 Lecture 15 Logistic Regression 2 This lecture: Logistic Regression 1 From Linear to Logistic Motivation Loss Function Why not L2 Loss? Interpreting Logistic Maximum Likelihood Log-odd Convexity Is logistic loss convex? Computation c Stanley Chan 2020. All Rights Reserved. 19 / 25 Convexity of Logistic Training Loss Recall that n X n hθ(xn) o J(θ) = − y log + log(1 − h (x )) n 1 − h (x ) θ n n=1 θ n The first term is linear, so it is convex. The second term: Gradient: 1 rθ[− log(1 − hθ(x))] = −∇θ log 1 − 1 + e−θT x " −θT x # e h −θT x −θT x i = −∇θ log = −∇θ log e − log(1 + e ) 1 + e−θT x h T −θT x i h −θT x i = −∇θ −θ x − log(1 + e ) = x + rθ log 1 + e T ! −e−θ x = x + x = hθ(x)x: 1 + e−θT x c Stanley Chan 2020. All Rights Reserved. 20 / 25 Convexity of Logistic Training Loss Gradient of second term is rθ[− log(1 − hθ(x))] = hθ(x)x: Hessian is: 2 x x x rθ[− log(1 − hθ( ))] = rθ [hθ( ) ] 1 = rθ x 1 + e−θT x ! 1 T = −e−θ x xxT (1 + e−θT x )2 1 1 = 1 − xxT 1 + e−θT x 1 + e−θT x T = hθ(x)[1 − hθ(x)]xx : c Stanley Chan 2020. All Rights Reserved. 21 / 25 Convexity of Logistic Training Loss d For any v 2 R , we have that h i v T 2 x v v T x x xxT v rθ[− log(1 − hθ( ))] = hθ( )[1 − hθ( )] T 2 = (hθ(x)[1 − hθ(x)]) kv xk ≥ 0: Therefore the Hessian is positive semi-definite. So − log(1 − hθ(x) is convex in θ. Conclusion: The training loss function n X n hθ(xn) o J(θ) = − y log + log(1 − h (x )) n 1 − h (x ) θ n n=1 θ n is convex in θ. So we can use convex optimization algorithms to find θ. c Stanley Chan 2020. All Rights Reserved. 22 / 25 Convex Optimization for Logistic Regression We can use CVX to solve the logistic regression problem But it requires some re-organization of the equations N X n T o J(θ) = − ynθ xn + log(1 − hθ(xn)) n=1 N T x ! n θ n o X T x e = − ynθ n + log 1 − T θ xn n=1 1 + e N n T o X T θ xn = − ynθ xn − log 1 + e n=1 8 N !T N 9 < T = X X θ xn = − ynxn θ − log 1 + e : : n=1 n=1 ; T The last term is a sum of log-sum-exp: log(e0 + eθ x ).

Load more