Machine Learning Basics Lecture 3: Perceptron Princeton University COS 495 Instructor: Yingyu Liang Perceptron Overview

Machine Learning Basics Lecture 3: Perceptron Princeton University COS 495 Instructor: Yingyu Liang Perceptron Overview • Previous lectures: (Principle for loss function) MLE to derive loss • Example: linear regression; some linear classification models • This lecture: (Principle for optimization) local improvement • Example: Perceptron; SGD Task (푤∗)푇푥 = 0 (푤∗)푇푥 > 0 (푤∗)푇푥 < 0 ∗ Class +1 푤 Class -1 Attempt • Given training data 푥푖, 푦푖 : 1 ≤ 푖 ≤ 푛 i.i.d. from distribution 퐷 푇 • Hypothesis 푓푤 푥 = 푤 푥 • 푦 = +1 if 푤푇푥 > 0 • 푦 = −1 if 푤푇푥 < 0 푇 • Prediction: 푦 = sign(푓푤 푥 ) = sign(푤 푥) • Goal: minimize classification error Perceptron Algorithm • Assume for simplicity: all 푥푖 has length 1 Perceptron: figure from the lecture note of Nina Balcan Intuition: correct the current mistake • If mistake on a positive example 푇 푇 푇 푇 푇 푤푡+1푥 = 푤푡 + 푥 푥 = 푤푡 푥 + 푥 푥 = 푤푡 푥 + 1 • If mistake on a negative example 푇 푇 푇 푇 푇 푤푡+1푥 = 푤푡 − 푥 푥 = 푤푡 푥 − 푥 푥 = 푤푡 푥 − 1 The Perceptron Theorem ∗ • Suppose there exists 푤 that correctly classifies 푥푖, 푦푖 ∗ • W.L.O.G., all 푥푖 and 푤 have length 1, so the minimum distance of any example to the decision boundary is ∗ 푇 훾 = min | 푤 푥푖| 푖 1 2 • Then Perceptron makes at most mistakes 훾 The Perceptron Theorem ∗ • Suppose there exists 푤 that correctly classifies 푥푖, 푦푖 ∗ • W.L.O.G., all 푥푖 and 푤 have length 1, so the minimum distance of any example to the decision boundary is ∗ 푇 훾 = min | 푤 푥푖| 푖 Need not be i.i.d. ! 1 2 • Then Perceptron makes at most mistakes 훾 Do not depend on 푛, the length of the data sequence! Analysis 푇 ∗ • First look at the quantity 푤푡 푤 푇 ∗ 푇 ∗ • Claim 1: 푤푡+1푤 ≥ 푤푡 푤 + 훾 • Proof: If mistake on a positive example 푥 푇 ∗ 푇 ∗ 푇 ∗ 푇 ∗ 푇 ∗ 푤푡+1푤 = 푤푡 + 푥 푤 = 푤푡 푤 + 푥 푤 ≥ 푤푡 푤 + 훾 • If mistake on a negative example 푇 ∗ 푇 ∗ 푇 ∗ 푇 ∗ 푇 ∗ 푤푡+1푤 = 푤푡 − 푥 푤 = 푤푡 푤 − 푥 푤 ≥ 푤푡 푤 + 훾 Analysis Negative since we made a • Next look at the quantity 푤푡 mistake on x 2 2 • Claim 2: 푤푡+1 ≤ 푤푡 + 1 • Proof: If mistake on a positive example 푥 2 2 2 2 푇 푤푡+1 = 푤푡 + 푥 = 푤푡 + 푥 + 2푤푡 푥 Analysis: putting things together 푇 ∗ 푇 ∗ • Claim 1: 푤푡+1푤 ≥ 푤푡 푤 + 훾 2 2 • Claim 2: 푤푡+1 ≤ 푤푡 + 1 After 푀 mistakes: 푇 ∗ • 푤푀+1푤 ≥ 훾푀 • 푤푀+1 ≤ √푀 푇 ∗ • 푤푀+1푤 ≤ 푤푀+1 1 2 So 훾푀 ≤ √푀, and thus 푀 ≤ 훾 Intuition The correlation gets larger. Could be: • 푤푇 푤∗ ≥ 푤푇푤∗ + 훾 ∗ Claim 1: 푡+1 푡 1. 푤푡+1 gets closer to 푤 2 2 2. 푤푡+1 gets much longer • Claim 2: 푤푡+1 ≤ 푤푡 + 1 Rules out the bad case “2. 푤푡+1 gets much longer” Some side notes on Perceptron History Figure from Pattern Recognition and Machine Learning, Bishop Note: connectionism vs symbolism • Symbolism: AI can be achieved by representing concepts as symbols • Example: rule-based expert system, formal grammar • Connectionism: explain intellectual abilities using connections between neurons (i.e., artificial neural networks) • Example: perceptron, larger scale neural networks Symbolism example: Credit Risk Analysis Example from Machine learning lecture notes by Tom Mitchell Connectionism example Neuron/perceptron Figure from Pattern Recognition and machine learning, Bishop Note: connectionism v.s. symbolism • Formal theories of logical reasoning, grammar, and other higher mental faculties compel us to think of the mind as a machine for rule- based manipulation of highly structured arrays of symbols. What we know of the brain compels us to think of human information processing in terms of manipulation of a large unstructured set of numbers, the activity levels of interconnected neurons. ---- The Central Paradox of Cognition (Smolensky et al., 1992) Note: online vs batch • Batch: Given training data 푥푖, 푦푖 : 1 ≤ 푖 ≤ 푛 , typically i.i.d. • Online: data points arrive one by one • 1. The algorithm receives an unlabeled example 푥푖 • 2. The algorithm predicts a classification of this example. • 3. The algorithm is then told the correct answer 푦푖, and update its model Stochastic gradient descent (SGD) Gradient descent • Minimize loss 퐿෠ 휃 , where the hypothesis is parametrized by 휃 • Gradient descent • Initialize 휃0 • 휃푡+1 = 휃푡 − 휂푡훻퐿෠ 휃푡 Stochastic gradient descent (SGD) • Suppose data points arrive one by one 1 • 퐿෠ 휃 = σ푛 푙(휃, 푥 , 푦 ), but we only know 푙(휃, 푥 , 푦 ) at time 푡 푛 푡=1 푡 푡 푡 푡 • Idea: simply do what you can based on local information • Initialize 휃0 • 휃푡+1 = 휃푡 − 휂푡훻푙(휃푡, 푥푡, 푦푡) Example 1: linear regression 1 • Find 푓 푥 = 푤푇푥 that minimizes 퐿෠ 푓 = σ푛 푤푇푥 − 푦 2 푤 푤 푛 푡=1 푡 푡 1 • 푙 푤, 푥 , 푦 = 푤푇푥 − 푦 2 푡 푡 푛 푡 푡 2휂 • 푤 = 푤 − 휂 훻푙 푤 , 푥 , 푦 = 푤 − 푡 푤푇푥 − 푦 푥 푡+1 푡 푡 푡 푡 푡 푡 푛 푡 푡 푡 푡 Example 2: logistic regression • Find 푤 that minimizes 1 1 퐿෠ 푤 = − ෍ log휎(푤푇푥 ) − ෍ log[1 − 휎 푤푇푥 ] 푛 푡 푛 푡 푦푡=1 푦푡=−1 1 퐿෠ 푤 = − ෍ log휎(푦 푤푇푥 ) 푛 푡 푡 푡 −1 푙 푤, 푥 , 푦 = log휎(푦 푤푇푥 ) 푡 푡 푛 푡 푡 Example 2: logistic regression • Find 푤 that minimizes −1 푙 푤, 푥 , 푦 = log휎(푦 푤푇푥 ) 푡 푡 푛 푡 푡 휂 휎 푎 1−휎 푎 푤 = 푤 − 휂 훻푙 푤 , 푥 , 푦 = 푤 + 푡 푦 푥 푡+1 푡 푡 푡 푡 푡 푡 푛 휎(푎) 푡 푡 푇 Where 푎 = 푦푡푤푡 푥푡 Example 3: Perceptron • Hypothesis: 푦 = sign(푤푇푥) • Define hinge loss 푇 푙 푤, 푥푡, 푦푡 = −푦푡푤 푥푡 핀[mistake on 푥푡] 푇 퐿෠ 푤 = − ෍ 푦푡푤 푥푡 핀[mistake on 푥푡] 푡 푤푡+1 = 푤푡 − 휂푡훻푙 푤푡, 푥푡, 푦푡 = 푤푡 + 휂푡푦푡푥푡 핀[mistake on 푥푡] Example 3: Perceptron • Hypothesis: 푦 = sign(푤푇푥) 푤푡+1 = 푤푡 − 휂푡훻푙 푤푡, 푥푡, 푦푡 = 푤푡 + 휂푡푦푡푥푡 핀[mistake on 푥푡] • Set 휂푡 = 1. If mistake on a positive example 푤푡+1 = 푤푡 + 푦푡푥푡 = 푤푡 + 푥 • If mistake on a negative example 푤푡+1 = 푤푡 + 푦푡푥푡 = 푤푡 − 푥 Pros & Cons Pros: • Widely applicable • Easy to implement in most cases • Guarantees for many losses • Good performance: error/running time/memory etc. Cons: • No guarantees for non-convex opt (e.g., those in deep learning) • Hyper-parameters: initialization, learning rate Mini-batch • Instead of one data point, work with a small batch of 푏 points (푥푡푏+1,푦푡푏+1),…, (푥푡푏+푏,푦푡푏+푏) • Update rule 1 휃 = 휃 − 휂 훻 ෍ 푙 휃 , 푥 , 푦 푡+1 푡 푡 푏 푡 푡푏+푖 푡푏+푖 1≤푖≤푏 • Other variants: variance reduction etc. Homework Homework 1 • Assignment online • Course website: http://www.cs.princeton.edu/courses/archive/spring16/cos495/ • Piazza: https://piazza.com/princeton/spring2016/cos495 • Due date: Feb 17th (one week) • Submission • Math part: hand-written/print; submit to TA (Office: EE, C319B) • Coding part: in Matlab/Python; submit the .m/.py file on Piazza Homework 1 • Grading policy: every late day reduces the attainable credit for the exercise by 10%. • Collaboration: • Discussion on the problem sets is allowed • Students are expected to finish the homework by himself/herself • The people you discussed with on assignments should be clearly detailed: before the solution to each question, list all people that you discussed with on that particular question..

Machine Learning Basics Lecture 3: Perceptron Princeton University COS 495 Instructor: Yingyu Liang Perceptron Overview

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support