<<

Machine Basics Lecture 3: Princeton University COS 495 Instructor: Yingyu Liang Perceptron Overview

• Previous lectures: (Principle for loss ) MLE to derive loss • Example: ; some linear classification models

• This lecture: (Principle for optimization) local improvement • Example: Perceptron; SGD Task (푤∗)푇푥 = 0 (푤∗)푇푥 > 0

(푤∗)푇푥 < 0 ∗ Class +1 푤

Class -1 Attempt

• Given training 푥푖, 푦푖 : 1 ≤ 푖 ≤ 푛 i.i.d. from distribution 퐷 푇 • Hypothesis 푓푤 푥 = 푤 푥 • 푦 = +1 if 푤푇푥 > 0 • 푦 = −1 if 푤푇푥 < 0 푇 • Prediction: 푦 = sign(푓푤 푥 ) = sign(푤 푥)

• Goal: minimize classification error Perceptron

• Assume for simplicity: all 푥푖 has length 1

Perceptron: figure from the lecture note of Nina Balcan Intuition: correct the current mistake

• If mistake on a positive example 푇 푇 푇 푇 푇 푤푡+1푥 = 푤푡 + 푥 푥 = 푤푡 푥 + 푥 푥 = 푤푡 푥 + 1

• If mistake on a negative example 푇 푇 푇 푇 푇 푤푡+1푥 = 푤푡 − 푥 푥 = 푤푡 푥 − 푥 푥 = 푤푡 푥 − 1 The Perceptron Theorem

∗ • Suppose there exists 푤 that correctly classifies 푥푖, 푦푖 ∗ • W.L.O.G., all 푥푖 and 푤 have length 1, so the minimum distance of any example to the decision boundary is ∗ 푇 훾 = min | 푤 푥푖| 푖

1 2 • Then Perceptron makes at most mistakes 훾 The Perceptron Theorem

∗ • Suppose there exists 푤 that correctly classifies 푥푖, 푦푖 ∗ • W.L.O.G., all 푥푖 and 푤 have length 1, so the minimum distance of any example to the decision boundary is ∗ 푇 훾 = min | 푤 푥푖| 푖 Need not be i.i.d. !

1 2 • Then Perceptron makes at most mistakes 훾

Do not depend on 푛, the length of the data sequence! Analysis

푇 ∗ • First look at the quantity 푤푡 푤

푇 ∗ 푇 ∗ • Claim 1: 푤푡+1푤 ≥ 푤푡 푤 + 훾 • Proof: If mistake on a positive example 푥 푇 ∗ 푇 ∗ 푇 ∗ 푇 ∗ 푇 ∗ 푤푡+1푤 = 푤푡 + 푥 푤 = 푤푡 푤 + 푥 푤 ≥ 푤푡 푤 + 훾

• If mistake on a negative example 푇 ∗ 푇 ∗ 푇 ∗ 푇 ∗ 푇 ∗ 푤푡+1푤 = 푤푡 − 푥 푤 = 푤푡 푤 − 푥 푤 ≥ 푤푡 푤 + 훾 Analysis

Negative since we made a • Next look at the quantity 푤푡 mistake on x

2 2 • Claim 2: 푤푡+1 ≤ 푤푡 + 1 • Proof: If mistake on a positive example 푥

2 2 2 2 푇 푤푡+1 = 푤푡 + 푥 = 푤푡 + 푥 + 2푤푡 푥 Analysis: putting things together

푇 ∗ 푇 ∗ • Claim 1: 푤푡+1푤 ≥ 푤푡 푤 + 훾 2 2 • Claim 2: 푤푡+1 ≤ 푤푡 + 1

After 푀 mistakes: 푇 ∗ • 푤푀+1푤 ≥ 훾푀

• 푤푀+1 ≤ √푀 푇 ∗ • 푤푀+1푤 ≤ 푤푀+1 1 2 So 훾푀 ≤ √푀, and thus 푀 ≤ 훾 Intuition The correlation gets larger. Could be: • 푤푇 푤∗ ≥ 푤푇푤∗ + 훾 ∗ Claim 1: 푡+1 푡 1. 푤푡+1 gets closer to 푤 2 2 2. 푤푡+1 gets much longer • Claim 2: 푤푡+1 ≤ 푤푡 + 1

Rules out the bad case “2. 푤푡+1 gets much longer” Some side notes on Perceptron History

Figure from and , Bishop Note: vs symbolism

• Symbolism: AI can be achieved by representing as symbols • Example: rule-based , formal grammar

• Connectionism: explain intellectual abilities using connections between (i.e., artificial neural networks) • Example: perceptron, larger scale neural networks Symbolism example: Credit Risk Analysis

Example from Machine learning lecture notes by Tom Mitchell Connectionism example /perceptron

Figure from Pattern Recognition and machine learning, Bishop Note: connectionism v.s. symbolism

• Formal theories of logical reasoning, grammar, and other higher mental faculties compel us to think of the mind as a machine for rule- based manipulation of highly structured arrays of symbols. What we know of the compels us to think of human information processing in terms of manipulation of a large unstructured set of numbers, the activity levels of interconnected neurons. ---- The Central Paradox of Cognition (Smolensky et al., 1992) Note: online vs batch

• Batch: Given training data 푥푖, 푦푖 : 1 ≤ 푖 ≤ 푛 , typically i.i.d.

• Online: data points arrive one by one • 1. The algorithm receives an unlabeled example 푥푖 • 2. The algorithm predicts a classification of this example. • 3. The algorithm is then told the correct answer 푦푖, and update its model Stochastic descent (SGD)

• Minimize loss 퐿෠ 휃 , where the hypothesis is parametrized by 휃

• Gradient descent • Initialize 휃0 • 휃푡+1 = 휃푡 − 휂푡훻퐿෠ 휃푡 Stochastic gradient descent (SGD)

• Suppose data points arrive one by one

1 • 퐿෠ 휃 = σ푛 푙(휃, 푥 , 푦 ), but we only know 푙(휃, 푥 , 푦 ) at time 푡 푛 푡=1 푡 푡 푡 푡

• Idea: simply do what you can based on local information • Initialize 휃0 • 휃푡+1 = 휃푡 − 휂푡훻푙(휃푡, 푥푡, 푦푡) Example 1: linear regression

1 • Find 푓 푥 = 푤푇푥 that minimizes 퐿෠ 푓 = σ푛 푤푇푥 − 푦 2 푤 푤 푛 푡=1 푡 푡 1 • 푙 푤, 푥 , 푦 = 푤푇푥 − 푦 2 푡 푡 푛 푡 푡

2휂 • 푤 = 푤 − 휂 훻푙 푤 , 푥 , 푦 = 푤 − 푡 푤푇푥 − 푦 푥 푡+1 푡 푡 푡 푡 푡 푡 푛 푡 푡 푡 푡 Example 2:

• Find 푤 that minimizes 1 1 퐿෠ 푤 = − ෍ log휎(푤푇푥 ) − ෍ log[1 − 휎 푤푇푥 ] 푛 푡 푛 푡 푦푡=1 푦푡=−1 1 퐿෠ 푤 = − ෍ log휎(푦 푤푇푥 ) 푛 푡 푡 푡 −1 푙 푤, 푥 , 푦 = log휎(푦 푤푇푥 ) 푡 푡 푛 푡 푡 Example 2: logistic regression

• Find 푤 that minimizes −1 푙 푤, 푥 , 푦 = log휎(푦 푤푇푥 ) 푡 푡 푛 푡 푡

휂 휎 푎 1−휎 푎 푤 = 푤 − 휂 훻푙 푤 , 푥 , 푦 = 푤 + 푡 푦 푥 푡+1 푡 푡 푡 푡 푡 푡 푛 휎(푎) 푡 푡 푇 Where 푎 = 푦푡푤푡 푥푡 Example 3: Perceptron

• Hypothesis: 푦 = sign(푤푇푥) • Define hinge loss 푇 푙 푤, 푥푡, 푦푡 = −푦푡푤 푥푡 핀[mistake on 푥푡]

푇 퐿෠ 푤 = − ෍ 푦푡푤 푥푡 핀[mistake on 푥푡] 푡

푤푡+1 = 푤푡 − 휂푡훻푙 푤푡, 푥푡, 푦푡 = 푤푡 + 휂푡푦푡푥푡 핀[mistake on 푥푡] Example 3: Perceptron

• Hypothesis: 푦 = sign(푤푇푥)

푤푡+1 = 푤푡 − 휂푡훻푙 푤푡, 푥푡, 푦푡 = 푤푡 + 휂푡푦푡푥푡 핀[mistake on 푥푡]

• Set 휂푡 = 1. If mistake on a positive example

푤푡+1 = 푤푡 + 푦푡푥푡 = 푤푡 + 푥 • If mistake on a negative example

푤푡+1 = 푤푡 + 푦푡푥푡 = 푤푡 − 푥 Pros & Cons

Pros: • Widely applicable • Easy to implement in most cases • Guarantees for many losses • Good performance: error/running time/memory etc.

Cons: • No guarantees for non-convex opt (e.g., those in ) • Hyper-parameters: initialization, Mini-batch

• Instead of one data point, work with a small batch of 푏 points

(푥푡푏+1,푦푡푏+1),…, (푥푡푏+푏,푦푡푏+푏)

• Update rule 1 휃 = 휃 − 휂 훻 ෍ 푙 휃 , 푥 , 푦 푡+1 푡 푡 푏 푡 푡푏+푖 푡푏+푖 1≤푖≤푏

• Other variants: variance reduction etc. Homework Homework 1

• Assignment online • Course website: http://www.cs.princeton.edu/courses/archive/spring16/cos495/ • Piazza: https://piazza.com/princeton/spring2016/cos495 • Due date: Feb 17th (one week) • Submission • Math part: hand-written/print; submit to TA (Office: EE, C319B) • Coding part: in Matlab/Python; submit the .m/.py file on Piazza Homework 1

• Grading policy: every late day reduces the attainable credit for the exercise by 10%.

• Collaboration: • Discussion on the problem sets is allowed • Students are expected to finish the homework by himself/herself • The people you discussed with on assignments should be clearly detailed: before the solution to each question, list all people that you discussed with on that particular question.