<<

ECE595 / STAT598: I Lecture 14

Spring 2020

Stanley Chan

School of Electrical and Computer Engineering Purdue University

c Stanley Chan 2020. All Rights Reserved. 1 / 25 Overview

In linear discriminant analysis (LDA), there are generally two types of approaches Generative approach: Estimate model, then define the classifier Discriminative approach: Directly define the classifier c Stanley Chan 2020. All Rights Reserved. 2 / 25 Outline

Discriminative Approaches Lecture 14 Logistic Regression 1 Lecture 15 Logistic Regression 2

This lecture: Logistic Regression 1 From Linear to Logistic Motivation Loss Function Why not L2 Loss? Interpreting Logistic Maximum Likelihood Log-odd Convexity Is logistic loss convex? Computation c Stanley Chan 2020. All Rights Reserved. 3 / 25 Geometry of

The discriminant function g(x) is linear The hypothesis function h(x) = sign(g(x)) is a unit step

c Stanley Chan 2020. All Rights Reserved. 4 / 25 From Linear to Logistic Regression

Can we replace g(x) by sign(g(x))? How about a soft-version of sign(g(x))? This gives a logistic regression.

c Stanley Chan 2020. All Rights Reserved. 5 / 25

The function x 1 1 h( ) = x = w T x 1 + e−g( ) 1 + e−( +w0) is called a sigmoid function. Its 1D form is 1 h(x) = , for some a and x0, 1 + e−a(x−x0) a controls the transient speed x0 controls the cutoff location

c Stanley Chan 2020. All Rights Reserved. 6 / 25 Sigmoid Function

Note that h(x) → 1, as x → ∞, h(x) → 0, as x → −∞,

So h(x) can be regarded as a “probability”.

c Stanley Chan 2020. All Rights Reserved. 7 / 25 Sigmoid Function

Derivative is d  1   −2   = − 1 + e−a(x−x0) e−a(x−x0) (−a) dx 1 + e−a(x−x0) ! e−a(x−x0)  1  = a 1 + e−a(x−x0) 1 + e−a(x−x0)  1   1  = a 1 − 1 + e−a(x−x0) 1 + e−a(x−x0) = a[1 − h(x)][h(x)].

Since 0 < h(x) < 0, we have 0 < 1 − h(x) < 1. Therefore, the derivative is always positive. So h is an increasing function. Hence h can be considered as a “CDF”. c Stanley Chan 2020. All Rights Reserved. 8 / 25 Sigmoid Function

c Stanley Chan 2020. All Rights Reserved. http://georgepavlides.info/wp-content/uploads/2018/02/logistic-binary-e1517639495140.jpg 9 / 25 From Linear to Logistic Regression

Can we replace g(x) by sign(g(x))? How about a soft-version of sign(g(x))? This gives a logistic regression.

c Stanley Chan 2020. All Rights Reserved. 10 / 25 Loss Function for Linear Regression

All discriminant algorithms have a Training Loss Function N 1 X J(θ) = L(g(x ), y ). N n n n=1 In linear regression, N 1 X J(θ) = (g(x ) − y )2 N n n n=1 N 1 X = (w T x + w − y )2 N n 0 n n=1  T    2 x 1 y1 1 1 w  1 =  . . −  .  = kAθ − yk2. N  . . w  .  N xT 0 N 1 yN

c Stanley Chan 2020. All Rights Reserved. 11 / 25 Training Loss for Logistic Regression

N X J(θ) = L(hθ(xn), yn) n=1 N X n o = − yn log hθ(xn) + (1 − yn) log(1 − hθ(xn)) n=1

This loss is also called the cross-entropy loss. Why do we want to choose this cost function? Consider two cases ( 0, if yn = 1, and hθ(xn) = 1, yn log hθ(xn) = −∞, if yn = 1, and hθ(xn) = 0, ( 0, if yn = 0, and hθ(xn) = 0, (1 − yn)(1 − log hθ(xn)) = −∞, if yn = 0, and hθ(xn) = 1.

No solution if mismatch c Stanley Chan 2020. All Rights Reserved. 12 / 25 Why Not L2 Loss?

Why not use L2 loss?

N X 2 J(θ) = (hθ(xn) − yn) n=1 Let’s look at the 1D case:  1 2 J(θ) = − y . 1 + e−θx This is NOT convex! How about the logistic loss?

 1   1  J(θ) = y log + (1 − y) log 1 − 1 + e−θx 1 + e−θx This is convex! c Stanley Chan 2020. All Rights Reserved. 13 / 25 Why Not L2 Loss?

Experiment: Set x = 1 and y = 1. Plot J(θ) as a function of θ.

1 0

-1 0.8

-2 0.6 ) ) -3 J( J( 0.4 -4

0.2 -5

0 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 -5 -4 -3 -2 -1 0 1 2 3 4 5 L2 Logistic So the L2 loss is not convex, but the logistic loss is concave (negative is convex) If you do on L2, you will be trapped at local minima

c Stanley Chan 2020. All Rights Reserved. 14 / 25 Outline

Discriminative Approaches Lecture 14 Logistic Regression 1 Lecture 15 Logistic Regression 2

This lecture: Logistic Regression 1 From Linear to Logistic Motivation Loss Function Why not L2 Loss? Interpreting Logistic Maximum Likelihood Log-odd Convexity Is logistic loss convex? Computation c Stanley Chan 2020. All Rights Reserved. 15 / 25 The Maximum-Likelihood Perspective

We can show that

argmin J(θ) θ N X n o = argmin − yn log hθ(xn) + (1 − yn) log(1 − hθ(xn)) θ n=1 N ! Y yn 1−yn = argmin − log hθ(xn) (1 − hθ(xn)) θ n=1 N n o Y yn 1−yn = argmax hθ(xn) (1 − hθ(xn)) . θ n=1

This is maximum-likelihood for a Bernoulli yn

The underlying probability is hθ(xn)

c Stanley Chan 2020. All Rights Reserved. 16 / 25 Interpreting h(x n)

Maximum-likelihood Bernoulli:

N n o ∗ Y yn 1−yn θ = argmax hθ(xn) (1 − hθ(xn)) . θ n=1

We can interpret hθ(xn) as a probability p. So:

hθ(xn) = p, and 1 − hθ(xn) = 1 − p.

But p is a function of xn. So how about

hθ(xn) = p(xn), and 1 − hθ(xn) = 1 − p(xn).

And this probability is “after” you see xn. So how about

hθ(xn) = p(1 | xn), and 1 − hθ(xn) = 1 − p(1 | xn) = p(0 | xn).

So hθ(xn) is the posterior of observing xn. c Stanley Chan 2020. All Rights Reserved. 17 / 25 Log-Odds

Let us rewrite J as N X n o J(θ) = − yn log hθ(xn) + (1 − yn) log(1 − hθ(xn)) n=1 n   X n hθ(xn) o = − y log + log(1 − h (x )) n 1 − h (x ) θ n n=1 θ n  x  In , the term log hθ( n) is called the log-odd. 1−hθ(xn) x 1 If we put hθ( n) = T , we can show that 1+e−θ x

 1   x  −θT x  T  hθ( ) 1+e θ x T x log = log  T x  = log e = θ . 1 − hθ(x) e−θ 1+e−θT x Logistic regression is linear in the log-odd. c Stanley Chan 2020. All Rights Reserved. 18 / 25 Outline

Discriminative Approaches Lecture 14 Logistic Regression 1 Lecture 15 Logistic Regression 2

This lecture: Logistic Regression 1 From Linear to Logistic Motivation Loss Function Why not L2 Loss? Interpreting Logistic Maximum Likelihood Log-odd Convexity Is logistic loss convex? Computation c Stanley Chan 2020. All Rights Reserved. 19 / 25 Convexity of Logistic Training Loss

Recall that n   X n hθ(xn) o J(θ) = − y log + log(1 − h (x )) n 1 − h (x ) θ n n=1 θ n The first term is linear, so it is convex. The second term: Gradient:   1  ∇θ[− log(1 − hθ(x))] = −∇θ log 1 − 1 + e−θT x " −θT x # e h −θT x −θT x i = −∇θ log = −∇θ log e − log(1 + e ) 1 + e−θT x h T −θT x i h  −θT x i = −∇θ −θ x − log(1 + e ) = x + ∇θ log 1 + e

T ! −e−θ x = x + x = hθ(x)x. 1 + e−θT x

c Stanley Chan 2020. All Rights Reserved. 20 / 25 Convexity of Logistic Training Loss

Gradient of second term is

∇θ[− log(1 − hθ(x))] = hθ(x)x. Hessian is: 2 x x x ∇θ[− log(1 − hθ( ))] = ∇θ [hθ( ) ]  1   = ∇θ x 1 + e−θT x ! 1  T  = −e−θ x xxT (1 + e−θT x )2  1   1  = 1 − xxT 1 + e−θT x 1 + e−θT x T = hθ(x)[1 − hθ(x)]xx .

c Stanley Chan 2020. All Rights Reserved. 21 / 25 Convexity of Logistic Training Loss

d For any v ∈ R , we have that h i v T 2 x v v T x x xxT v ∇θ[− log(1 − hθ( ))] = hθ( )[1 − hθ( )] T 2 = (hθ(x)[1 − hθ(x)]) kv xk ≥ 0.

Therefore the Hessian is positive semi-definite.

So − log(1 − hθ(x) is convex in θ. Conclusion: The training loss function n   X n hθ(xn) o J(θ) = − y log + log(1 − h (x )) n 1 − h (x ) θ n n=1 θ n is convex in θ. So we can use convex optimization algorithms to find θ.

c Stanley Chan 2020. All Rights Reserved. 22 / 25 Convex Optimization for Logistic Regression

We can use CVX to solve the logistic regression problem But it requires some re-organization of the equations N X n T o J(θ) = − ynθ xn + log(1 − hθ(xn)) n=1 N T x ! n θ n o X T x e = − ynθ n + log 1 − T θ xn n=1 1 + e N n  T  o X T θ xn = − ynθ xn − log 1 + e n=1  N !T N    T  X X θ xn = − ynxn θ − log 1 + e .  n=1 n=1 

T The last term is a sum of log-sum-exp: log(e0 + eθ x ). c Stanley Chan 2020. All Rights Reserved. 23 / 25 Convex Optimization for Logistic Regression

1

0.8 Estimated True 0.6

0.4

0.2

0

0 2 4 6 8 10

c Stanley Chan 2020. All Rights Reserved. 24 / 25 Reading List

Logistic Regression (Machine Learning Perspective) Chris Bishop’s , Chapter 4.3 Hastie-Tibshirani-Friedman’s Elements of Statistical Learning, Chapter 4.4 Stanford CS 229 Discriminant Algorithms http://cs229.stanford.edu/notes/cs229-notes1.pdf CMU Lecture https: //www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch12.pdf Stanford Language Processing https://web.stanford.edu/~jurafsky/slp3/ (Lecture 5) Logistic Regression (Statistics Perspective) Duke Lecture https://www2.stat.duke.edu/courses/Spring13/ sta102.001/Lec/Lec20.pdf Princeton Lecture https://data.princeton.edu/wws509/notes/c3.pdf c Stanley Chan 2020. All Rights Reserved. 25 / 25