<<

Introduction to Machine

Introduction to Amo G. Tong 1 Lecture 7

• Linear Classifiers • • Sigmoid unit

• Some materials are courtesy of Vibhave Gogate and Tom Mitchell. • All pictures belong to their creators.

Introduction to Machine Learning Amo G. Tong 2 Linear Classifiers

• Input: a real vector 풙 = (푥1, … , 푥푛), some features

• A weight vector 흎 = (휔1, … , 휔푛)

: 휔0

• A constant feature: 푥0 = 1 for all instance

:

• 푓(풙) = 푝표푠푖푡푖푣푒 푖푓 σ 휔푖 ⋅ 푥푖 > 0

• 푓 풙 = 푛푒푔푎푡푖푣푒 푖푓 σ 휔푖 ⋅ 푥푖 ≤ 0

Introduction to Machine Learning Amo G. Tong 3 Linear Classifiers

• Exercise: • Is a linear classifier?

Introduction to Machine Learning Amo G. Tong 4 Linear Classifiers

• Example • (from Gogate)

Introduction to Machine Learning Amo G. Tong 5 Perceptron

1, if 휔 + 휔 푥 + ⋯ + 휔 푥 > 0 표(푥 , … , 푥 ) = ቊ 표 1 1 푛 푛 1 푛 −1, else 1, if 푦 > 0 표(푥 , … , 푥 ) = sgn(흎 ⋅ 풙) where sgn 푦 = ቊ 1 푛 −1, else

Introduction to Machine Learning Amo G. Tong 6 Decision Surface of Perceptron

+ + + + + + - - - - + + - + - - - - 휔표 + 휔1푥1 > 0 -

휔표 + 휔1푥1 + 휔2푥2 > 0 • Representation Power of Perceptron: • All linearly separable . + + - - - - + - - - +

Introduction to Machine Learning Amo G. Tong 7 Decision Surface of Perceptron

• Representation Power of Perceptron: • All linearly separable data.

• Special cases: Boolean 푥푖 = 0 or 1. • Perceptron can represent some Boolean functions such as AND and OR. • But not all of them.

- + - + - - + -

AND XOR Introduction to Machine Learning Amo G. Tong 8 Training a Perceptron

• Training a Perceptron. • Method 1: perceptron training rule • Method 2: LSM and descent.

1, if 휔 + 휔 푥 + ⋯ + 휔 푥 > 0 표(푥 , … , 푥 ) = ቊ 표 1 1 푛 푛 1 푛 −1, else

1 2 3 3 + Initialize 휔푖 as some small values 2 6 3 3 - Update 휔푖 ← 휔푖 + Δ휔푖 9 2 4 4 + 2 7 5 3 - Until some criteria met

Introduction to Machine Learning Amo G. Tong 9 Training a Perceptron

• Method 1: perceptron training rule

1, if 휔 + 휔 푥 + ⋯ + 휔 푥 > 0 표(푥 , … , 푥 ) = ቊ 표 1 1 푛 푛 1 푛 −1, else

Do until converge Update 휔 ← 휔 + Δ휔 푖 푖 푖 For each 푥 in 퐷 For each 휔푖 Δ휔푖 = 휂 푡 − 표 푥푖 휔푖 ← 휔푖 + Δ휔푖

• 푡: target value • 표: current • 휂: . small value.

Introduction to Machine Learning Amo G. Tong 10 Training a Perceptron

• Method 1: perceptron training rule

1, if 휔 + 휔 푥 + ⋯ + 휔 푥 > 0 표(푥 , … , 푥 ) = ቊ 표 1 1 푛 푛 1 푛 −1, else

Update 휔푖 ← 휔푖 + Δ휔푖

Δ휔푖 = 휂 푡 − 표 푥푖

• 푡: target value • 표: current prediction • 휂: learning rate. small value.

Introduction to Machine Learning Amo G. Tong 11 Training a Perceptron

• Method 1: perceptron training rule Do until converge For each 푥 in 퐷 • Converges if data is linearly separable. For each 휔푖 • If learning rate is small enough 휔푖 ← 휔푖 + Δ휔푖

• May not converge if data is not linearly separable

• Your training data may not be linearly separable, even if the underlying truth is linear. Noise. • Any approach that always converges? LSM+.

Introduction to Machine Learning Amo G. Tong 12 Training a Perceptron

• Method 2: LSM and Gradient Descent.

• Linear unit: 표푙(풙) = 휔표 + 휔1푥1 + ⋯ + 휔푛푥푛 1 2 • Define Error 퐸 흎 = σ 푡 − 표 푑 퐷 2 푑∈퐷 푑 푙 • Least squares method.

Introduction to Machine Learning Amo G. Tong 13 Training a Perceptron

• Method 2: LSM and Gradient Descent.

• Linear unit: 표푙(풙) = 휔표 + 휔1푥1 + ⋯ + 휔푛푥푛 1 2 • Define Error 퐸 흎 = σ 푡 − 표 푑 퐷 2 푑∈퐷 푑 푙 • Least squares method. • Gradient descent. Do until converge For each 푑 in 퐷 For each 휔푖 휔푖 ← 휔푖 + Δ휔푖

Introduction to Machine Learning Amo G. Tong 14 Training a Perceptron

• Method 2: LSM and Gradient Descent.

• Linear unit: 표푙(풙) = 휔표 + 휔1푥1 + ⋯ + 휔푛푥푛

1 2 Update 휔 ← 휔 + Δ휔 • 퐸 흎 = σ 푡 − 표 푑 푖 푖 푖 퐷 2 푑∈퐷 푑 푙

휕퐸퐷 • = σ푑(푡푑 − 표푙(푑))(−푥푑,푖) 휕 휔푖 Do until converge 휕퐸 • Δ휔 = −휂 퐷 For each 휔푖 푖 휕 휔 푖 휔푖 ← 휔푖 + 휂 σ푑(푡푑 − 표푙(푑))(푥푑,푖)

Introduction to Machine Learning Amo G. Tong 15 Training a Perceptron

• Method 2: LSM and Gradient Descent.

• Linear unit: 표푙(풙) = 휔표 + 휔1푥1 + ⋯ + 휔푛푥푛

1 2 • 퐸 흎 = σ 푡 − 표 푑 퐷 2 푑∈퐷 푑 푙

휕퐸 • 퐷 = σ (푡 − 표 (푑))(−푥 ) Converges to the hypothesis that can 휕 휔 푑 푑 푙 푑,푖 푖 minimize the error. If the learning rate is sufficiently small 휕퐸 • Δ휔 = −휂 퐷 Even if data is not linearly separable. 푖 휕 휔 푖 Even if data has noise.

Introduction to Machine Learning Amo G. Tong 16 Training a Perceptron

• Method 2: LSM and Gradient Descent.

• Batch Mode: • Define the error for the whole training data, and consider it as a whole. 1 2 • 퐸 흎 = σ 푡 − 표 푑 퐷 2 푑∈퐷 푑 푙

• Incremental Mode (online mode): • Define the error for each training instance, and consider it one by one. 1 2 • 퐸 흎 = 푡 − 표 푑 푑 2 푑 푙

Introduction to Machine Learning Amo G. Tong 17 Training a Perceptron

• Method 2: LSM and Gradient Descent.

• Batch Mode: • Incremental (Online)Mode: • Define the error for the whole • Define the error for each training data, and consider it as a training instance, and consider whole. it one by one. 1 2 1 2 • 퐸 흎 = σ 푡 − 표 푑 • 퐸 흎 = 푡 − 표 푑 퐷 2 푑∈퐷 푑 푙 푑 2 푑 푙

Do until converge Do until converge For each 푑 in 퐷 For each 휔 , update 휔 푖 푖 For each 휔 , update 휔 휕퐸퐷 푖 푖 휔푖 ← 휔푖 − 휂 휕퐸푑 휕 휔푖 휔푖 ← 휔푖 − 휂 휕 휔푖

Introduction to Machine Learning Amo G. Tong 18 Training a Perceptron

• Method 2: LSM and Gradient Descent. • Batch Mode vs Incremental (Online) Mode: • Incremental mode can approximate batch model with an arbitrary small error. • Incremental mode can better avoid local minima. Update 휔 more frequently. It is faster. It deals with dynamic datasets. Learning rate is usually smaller in incremental mode. • Batch mode is more stable. Easy to check convergence. Do until converge Do until converge For each 푑 in 퐷 For each 휔 , update 휔 푖 푖 For each 휔 , update 휔 휕퐸퐷 푖 푖 휔푖 ← 휔푖 − 휂 휕퐸푑 휕 휔푖 휔푖 ← 휔푖 − 휂 휕 휔푖

Introduction to Machine Learning Amo G. Tong 19 Training a Perceptron

Two Methods • Perceptron Rule: • LSM and Gradient Descent : 1 2 • Δ휔푖 = 휂 푡 − 표 푥푖 • σ 푡 − 표 푑 2 푑∈퐷 푑 푙 • Consider the threshold error • Consider the un-threshold error. output by the perceptron.

• Always converges to the • Converges if data is separable by hypothesis that can minimize a perceptron. the error.

Minimizing the error does not necessarily minimize the misclassified samples.

Introduction to Machine Learning Amo G. Tong 20 Training a Perceptron

• Multi Class Perceptron

1, if 휔 + 휔 푥 + ⋯ + 휔 푥 > 0 표(푥 , … , 푥 ) = ቊ 표 1 1 푛 푛 1 푛 −1, else

푓 푥 = 휔표 + 휔1푥1 + ⋯ + 휔푛푥푛

• Class: {red, , blue}

• Task 1: red or not red. 푓푟 푥

• Task 2: orange or not orange. 푓표 푥

• Task 3: blue or not blue. 푓푏 푥

• Class(x)= argmax푟,표,푏{푓푟 푥 , 푓표 푥 , 푓푏 푥 }

Introduction to Machine Learning Amo G. Tong 21 Sigmoid Unit

• Input : (푥1, … , 푥푛) real vector 1 표(푥 , … , 푥 ) = • Output: real value 1 푛 1 + 푒−흎⋅풙

1 표(푥 , … , 푥 ) = 휎(흎 ⋅ 풙) where 휎(푦) = 1 푛 1+푒−푦

Introduction to Machine Learning Amo G. Tong 22 Sigmoid Unit

• Input : (푥1, … , 푥푛) real vector • Output: real value

• Parameters: (휔0, … , 휔푛)

• Train a sigmoid unit. • Define the error • Calculate partial

1 1 표 푥 , … , 푥 = = 휎(흎 ⋅ 풙) 휎(푦) = 1 푛 1 + 푒−흎⋅풙 1 + 푒−푦

Introduction to Machine Learning Amo G. Tong 23 Sigmoid Unit

−푦 −1 • Input : (푥1, … , 푥푛) real vector 휕 휎 푦 휕 1 + 푒 = • Output: real value 휕 푦 휕 푦 휕(1 + 푒−푦) • Parameters: (휔0, … , 휔푛) = − 1 + 푒−푦 −2⋅ 휕 푦 휕 − 푦 • Train a sigmoid unit. = − 1 + 푒−푦 −2⋅ 푒−푦 휕 푦 • Define the error = − 1 + 푒−푦 −2⋅ 푒−푦(−1) • Calculate partial derivatives 푒−푦 푒−푦 = = 1 + 푒−푦 2 (1 + 푒−푦)(1 + 푒−푦) = 휎 푦 (1 − 휎(푦))

1 1 표 푥 , … , 푥 = = 휎(흎 ⋅ 풙) 휎(푦) = 1 푛 1 + 푒−흎⋅풙 1 + 푒−푦

Introduction to Machine Learning Amo G. Tong 24 Sigmoid Unit

• Input : (푥1, … , 푥푛) real vector • Output: real value 휕 휎 푦 휕 1 + 푒−푦 −1 = = 휎 푦 (1 − 휎(푦)) • Parameters: (휔0, … , 휔푛) 휕 푦 휕 푦

휕 휎 흎 ⋅ 풙 휕 흎 ⋅ 풙 휕 휎 흎 ⋅ 풙 • Train a sigmoid unit. = ⋅ 휕 휔 휕 휔 휕 흎 ⋅ 풙 • Define the error 푖 푖 • Calculate partial derivatives = 푥푖 ⋅ 휎 흎 ⋅ 풙 (1 − 휎(흎 ⋅ 풙))

1 1 표 푥 , … , 푥 = = 휎(흎 ⋅ 풙) 휎(푦) = 1 푛 1 + 푒−휔⋅푥 1 + 푒−푦

Introduction to Machine Learning Amo G. Tong 25 Sigmoid Unit

• Input : (푥1, … , 푥푛) real vector • 휕 휎 흎 ⋅ 풙 Output: real value = 푥 ⋅ 휎 흎 ⋅ 풙 (1 − 휎(흎 ⋅ 풙)) 휕 휔 푖 • Parameters: (휔0, … , 휔푛) 푖 1 퐸 휔 = ෍ 푡 − 표 2 • Train a sigmoid unit. 퐷 2 푑 푑 푑∈퐷 • Define the error • Calculate partial derivatives

휕퐸퐷 휕표푑 = ෍ −(푡푑 − 표푑) = ෍ −(푡푑 − 표푑) ⋅ 푥푑,푖⋅ 표푑(1 − 표푑) 휕 휔푖 휕 휔푖 푑∈퐷 푑∈퐷 1 1 표 = 표 푥 , … , 푥 = = 휎(흎 ⋅ 풙) 휎(푦) = 푑 1 푛 1 + 푒−흎⋅풙 1 + 푒−푦 Introduction to Machine Learning Amo G. Tong 26 Perceptron and Sigmoid Unit

• Perceptron: • Sigmoid Unit: • Linear:

• 표 푥1, … , 푥푛 • 표 푥1, … , 푥푛 • 표 푥1, … , 푥푛 1, if 흎 ⋅ 풙 > 0 1 = 흎 ⋅ 풙 = ቊ = −1, else 1 + 푒−흎⋅풙 • Output a value • Output a class • Output a value • Differentiable. • Not differentiable. • Differentiable. • unbounded • (0,1)

Introduction to Machine Learning Amo G. Tong 27 Summary

• Perceptron (definition) • Two methods to train a perceptron • Batch mode vs Online mode

• Sigmoid (definition) • Formulas to train a sigmoid.

Introduction to Machine Learning Amo G. Tong 28