Machine Learning Basics Lecture 3: Perceptron Princeton University COS 495 Instructor: Yingyu Liang Perceptron Overview
• Previous lectures: (Principle for loss function) MLE to derive loss • Example: linear regression; some linear classification models
• This lecture: (Principle for optimization) local improvement • Example: Perceptron; SGD Task (푤∗)푇푥 = 0 (푤∗)푇푥 > 0
(푤∗)푇푥 < 0 ∗ Class +1 푤
Class -1 Attempt
• Given training data 푥푖, 푦푖 : 1 ≤ 푖 ≤ 푛 i.i.d. from distribution 퐷 푇 • Hypothesis 푓푤 푥 = 푤 푥 • 푦 = +1 if 푤푇푥 > 0 • 푦 = −1 if 푤푇푥 < 0 푇 • Prediction: 푦 = sign(푓푤 푥 ) = sign(푤 푥)
• Goal: minimize classification error Perceptron Algorithm
• Assume for simplicity: all 푥푖 has length 1
Perceptron: figure from the lecture note of Nina Balcan Intuition: correct the current mistake
• If mistake on a positive example 푇 푇 푇 푇 푇 푤푡+1푥 = 푤푡 + 푥 푥 = 푤푡 푥 + 푥 푥 = 푤푡 푥 + 1
• If mistake on a negative example 푇 푇 푇 푇 푇 푤푡+1푥 = 푤푡 − 푥 푥 = 푤푡 푥 − 푥 푥 = 푤푡 푥 − 1 The Perceptron Theorem
∗ • Suppose there exists 푤 that correctly classifies 푥푖, 푦푖 ∗ • W.L.O.G., all 푥푖 and 푤 have length 1, so the minimum distance of any example to the decision boundary is ∗ 푇 훾 = min | 푤 푥푖| 푖
1 2 • Then Perceptron makes at most mistakes 훾 The Perceptron Theorem
∗ • Suppose there exists 푤 that correctly classifies 푥푖, 푦푖 ∗ • W.L.O.G., all 푥푖 and 푤 have length 1, so the minimum distance of any example to the decision boundary is ∗ 푇 훾 = min | 푤 푥푖| 푖 Need not be i.i.d. !
1 2 • Then Perceptron makes at most mistakes 훾
Do not depend on 푛, the length of the data sequence! Analysis
푇 ∗ • First look at the quantity 푤푡 푤
푇 ∗ 푇 ∗ • Claim 1: 푤푡+1푤 ≥ 푤푡 푤 + 훾 • Proof: If mistake on a positive example 푥 푇 ∗ 푇 ∗ 푇 ∗ 푇 ∗ 푇 ∗ 푤푡+1푤 = 푤푡 + 푥 푤 = 푤푡 푤 + 푥 푤 ≥ 푤푡 푤 + 훾
• If mistake on a negative example 푇 ∗ 푇 ∗ 푇 ∗ 푇 ∗ 푇 ∗ 푤푡+1푤 = 푤푡 − 푥 푤 = 푤푡 푤 − 푥 푤 ≥ 푤푡 푤 + 훾 Analysis
Negative since we made a • Next look at the quantity 푤푡 mistake on x
2 2 • Claim 2: 푤푡+1 ≤ 푤푡 + 1 • Proof: If mistake on a positive example 푥
2 2 2 2 푇 푤푡+1 = 푤푡 + 푥 = 푤푡 + 푥 + 2푤푡 푥 Analysis: putting things together
푇 ∗ 푇 ∗ • Claim 1: 푤푡+1푤 ≥ 푤푡 푤 + 훾 2 2 • Claim 2: 푤푡+1 ≤ 푤푡 + 1
After 푀 mistakes: 푇 ∗ • 푤푀+1푤 ≥ 훾푀
• 푤푀+1 ≤ √푀 푇 ∗ • 푤푀+1푤 ≤ 푤푀+1 1 2 So 훾푀 ≤ √푀, and thus 푀 ≤ 훾 Intuition The correlation gets larger. Could be: • 푤푇 푤∗ ≥ 푤푇푤∗ + 훾 ∗ Claim 1: 푡+1 푡 1. 푤푡+1 gets closer to 푤 2 2 2. 푤푡+1 gets much longer • Claim 2: 푤푡+1 ≤ 푤푡 + 1
Rules out the bad case “2. 푤푡+1 gets much longer” Some side notes on Perceptron History
Figure from Pattern Recognition and Machine Learning, Bishop Note: connectionism vs symbolism
• Symbolism: AI can be achieved by representing concepts as symbols • Example: rule-based expert system, formal grammar
• Connectionism: explain intellectual abilities using connections between neurons (i.e., artificial neural networks) • Example: perceptron, larger scale neural networks Symbolism example: Credit Risk Analysis
Example from Machine learning lecture notes by Tom Mitchell Connectionism example Neuron/perceptron
Figure from Pattern Recognition and machine learning, Bishop Note: connectionism v.s. symbolism
• Formal theories of logical reasoning, grammar, and other higher mental faculties compel us to think of the mind as a machine for rule- based manipulation of highly structured arrays of symbols. What we know of the brain compels us to think of human information processing in terms of manipulation of a large unstructured set of numbers, the activity levels of interconnected neurons. ---- The Central Paradox of Cognition (Smolensky et al., 1992) Note: online vs batch
• Batch: Given training data 푥푖, 푦푖 : 1 ≤ 푖 ≤ 푛 , typically i.i.d.
• Online: data points arrive one by one • 1. The algorithm receives an unlabeled example 푥푖 • 2. The algorithm predicts a classification of this example. • 3. The algorithm is then told the correct answer 푦푖, and update its model Stochastic gradient descent (SGD) Gradient descent
• Minimize loss 퐿 휃 , where the hypothesis is parametrized by 휃
• Gradient descent • Initialize 휃0 • 휃푡+1 = 휃푡 − 휂푡훻퐿 휃푡 Stochastic gradient descent (SGD)
• Suppose data points arrive one by one
1 • 퐿 휃 = σ푛 푙(휃, 푥 , 푦 ), but we only know 푙(휃, 푥 , 푦 ) at time 푡 푛 푡=1 푡 푡 푡 푡
• Idea: simply do what you can based on local information • Initialize 휃0 • 휃푡+1 = 휃푡 − 휂푡훻푙(휃푡, 푥푡, 푦푡) Example 1: linear regression
1 • Find 푓 푥 = 푤푇푥 that minimizes 퐿 푓 = σ푛 푤푇푥 − 푦 2 푤 푤 푛 푡=1 푡 푡 1 • 푙 푤, 푥 , 푦 = 푤푇푥 − 푦 2 푡 푡 푛 푡 푡
2휂 • 푤 = 푤 − 휂 훻푙 푤 , 푥 , 푦 = 푤 − 푡 푤푇푥 − 푦 푥 푡+1 푡 푡 푡 푡 푡 푡 푛 푡 푡 푡 푡 Example 2: logistic regression
• Find 푤 that minimizes 1 1 퐿 푤 = − log휎(푤푇푥 ) − log[1 − 휎 푤푇푥 ] 푛 푡 푛 푡 푦푡=1 푦푡=−1 1 퐿 푤 = − log휎(푦 푤푇푥 ) 푛 푡 푡 푡 −1 푙 푤, 푥 , 푦 = log휎(푦 푤푇푥 ) 푡 푡 푛 푡 푡 Example 2: logistic regression
• Find 푤 that minimizes −1 푙 푤, 푥 , 푦 = log휎(푦 푤푇푥 ) 푡 푡 푛 푡 푡
휂 휎 푎 1−휎 푎 푤 = 푤 − 휂 훻푙 푤 , 푥 , 푦 = 푤 + 푡 푦 푥 푡+1 푡 푡 푡 푡 푡 푡 푛 휎(푎) 푡 푡 푇 Where 푎 = 푦푡푤푡 푥푡 Example 3: Perceptron
• Hypothesis: 푦 = sign(푤푇푥) • Define hinge loss 푇 푙 푤, 푥푡, 푦푡 = −푦푡푤 푥푡 핀[mistake on 푥푡]
푇 퐿 푤 = − 푦푡푤 푥푡 핀[mistake on 푥푡] 푡
푤푡+1 = 푤푡 − 휂푡훻푙 푤푡, 푥푡, 푦푡 = 푤푡 + 휂푡푦푡푥푡 핀[mistake on 푥푡] Example 3: Perceptron
• Hypothesis: 푦 = sign(푤푇푥)
푤푡+1 = 푤푡 − 휂푡훻푙 푤푡, 푥푡, 푦푡 = 푤푡 + 휂푡푦푡푥푡 핀[mistake on 푥푡]
• Set 휂푡 = 1. If mistake on a positive example
푤푡+1 = 푤푡 + 푦푡푥푡 = 푤푡 + 푥 • If mistake on a negative example
푤푡+1 = 푤푡 + 푦푡푥푡 = 푤푡 − 푥 Pros & Cons
Pros: • Widely applicable • Easy to implement in most cases • Guarantees for many losses • Good performance: error/running time/memory etc.
Cons: • No guarantees for non-convex opt (e.g., those in deep learning) • Hyper-parameters: initialization, learning rate Mini-batch
• Instead of one data point, work with a small batch of 푏 points
(푥푡푏+1,푦푡푏+1),…, (푥푡푏+푏,푦푡푏+푏)
• Update rule 1 휃 = 휃 − 휂 훻 푙 휃 , 푥 , 푦 푡+1 푡 푡 푏 푡 푡푏+푖 푡푏+푖 1≤푖≤푏
• Other variants: variance reduction etc. Homework Homework 1
• Assignment online • Course website: http://www.cs.princeton.edu/courses/archive/spring16/cos495/ • Piazza: https://piazza.com/princeton/spring2016/cos495 • Due date: Feb 17th (one week) • Submission • Math part: hand-written/print; submit to TA (Office: EE, C319B) • Coding part: in Matlab/Python; submit the .m/.py file on Piazza Homework 1
• Grading policy: every late day reduces the attainable credit for the exercise by 10%.
• Collaboration: • Discussion on the problem sets is allowed • Students are expected to finish the homework by himself/herself • The people you discussed with on assignments should be clearly detailed: before the solution to each question, list all people that you discussed with on that particular question.