On-line Learning, , Kernels

Dan Roth [email protected] | http://www.cis.upenn.edu/~danroth/ | 461C, 3401 Walnut

Slides were created by Dan Roth (for CIS519/419 at Penn or CS446 at UIUC), or borrowed from otherCIS authors 419/519 who Fall’20 have made their ML slides available. Are we recording? YES! Administration (10/5/20) Available on the web site

• Remember that all the lectures are available on the website before the class – Go over it and be prepared – We will add written notes next to each lecture, with some more details, examples and, (when relevant) some code.

• HW 1 is due tonight. – Covers: SGD, DT, , Ensemble Models, & Experimental • HW 2 is out tonight – Covers: Linear classifiers; learning curves; an application to named entity recognition; domain adaptation. – Start working on it now. Don’t wait until the last day (or two) since it could take a lot of your time

• PollEv: If you haven’t done so, please register https://pollev.com/cis519/register – We still have some problems with missing participation for students • Quizzes: https://canvas.upenn.edu/courses/1546646/quizzes/2475913/statistics

• Go to the recitations and office hours • Questions? Please ask/comment during class; give us feedback CIS 419/519 Fall’20 2 Comments on HW1

• Learning Algorithms – Decision Trees (Full, limited depth) – Stochastic Gradient Descent (over original and a richer feature space)

• Should we get different results? – Why, why not?

• Results on what? – Training data? – Future data (Data that is not used in training; a.k.a Test data)? • Note that the use of cross-validation just allows us to get a better estimate to the performance on future data;

CIS 419/519 Fall’20 3 CIS 419/519 Fall’20 4 Comments on HW1 (Cont.) Should really call this Should really call this “linear” • Hypothesis Spaces: “linear” over a richer feature set – DT (full) >> DT(4) ?> SGD << SGD(DT(4))

• Training data performance: – Directly dictated by the size of the hypothesis space – But, depending on the data, learning with simpler hypothesis spaces can also get 100% on training

• Test data performance Our expressive – Learning using hypothesis spaces that are too expressive leads to overfitting model learned – Recall an example I gave in the Intro class, using deep NN for this dataset the dataset • Data set or the hidden function? – In principle, it’s the hidden function that determines what will happen – is it too expressive/expressive enough for the hypothesis space we are using? – But, if we don’t see a lot of data, our data may not be representative of the hidden function. • In training, we will actually learn another function, and thus not do well at test time. • At this point, we are developing an intuition, but we will quantify some of it soon.

CIS 419/519 Fall’20 5 A Guide

• Learning Algorithms – (Stochastic) Gradient Descent (with LMS) – Decision Trees * • Importance of hypothesis space (representation) • How are we doing? – We quantified in terms of cumulative # of mistakes (on test data) – But our algorithms were driven by a different metric than the one we care about. – Today, we are going to start discussing quantification.

CIS 419/519 Fall’20 6 A Guide

• Versions of Perceptron – How to deal better with large features spaces & sparsity?

– Variations of Perceptron To d ay : • Dealing with overfitting Take a more general perspective and think – Closing the loop: Back to Gradient Descent more about learning, – Dual Representations & Kernels learning protocols, quantifying performance, • etc. This will motivate some of • Beyond Binary Classification? the ideas we will see next. – Multi-class classification and • More general ways to quantify learning performance (PAC) – New Algorithms (SVM, Boosting)

CIS 419/519 Fall’20 7 Two Key Goals

• (1) Learning Protocols – Most of the class is devoted to – Most of machine learning today is spent doing supervised learning. • But, there is more to ML than supervised learning – We’ll introduce some ideas that can motivate other protocols, like active learning and teaching (human in the loop) • (2) We will do it in the context of Linear Learning algorithms – Why? Don’t we only care about neural networks? – No. • First, the algorithmic ideas we’ll introduce are the same used in neural networks. • More importantly, there are good reasons to want to know about simpler learning algorithms.

CIS 419/519 Fall’20 8 Example (Named Entity Recognition)

• (A “standard” dataset will only care about PER, LOC, ORG, MISC)

CIS 419/519 Fall’20 9 Linear Models? Consider Performance

• Named Entity Recognition in 22 languages

• CCG: a state-of-the-art linear model (over expressive features)

• fastText, mBERT: state of the art neural models based on bi-LSTM-CRF, with different embeddings

CIS 419/519 Fall’20 10 But,…Consider Cost Effectiveness

• The linear model is – more than 50 times faster to train. – ~20 times faster to evaluate. • Companies that need to run it over millions of documents will consider the cost (time & money) and would often prefer linear models.

CIS 419/519 Fall’20 11 Quantifying Performance

• We want to be able to say something rigorous about the performance of our learning algorithm.

• We will concentrate on discussing the number of examples one needs to see before we can say that our learned hypothesis is good.

CIS 419/519 Fall’20 12 Learning Conjunctions (recall that these are linear functions, and you can use SGD and many other functions to learn them; but the lessons of today are clearer with a simpler representation)

CIS 419/519 Fall’20 Learning Conjunctions

• There is a hidden (monotone) conjunction the learner (you) is to learn • , , … , = • How many examples are needed to learn it ? How ? 𝑓𝑓 𝑥𝑥1 𝑥𝑥2 𝑥𝑥100 𝑥𝑥2 ∧ 𝑥𝑥3 ∧ 𝑥𝑥4 ∧ 𝑥𝑥5 ∧ 𝑥𝑥100 – Protocol I: The learner proposes instances as queries to the teacher – Protocol II: The teacher (who knows ) provides training examples – Protocol III: Some random source (e.g., Nature) provides training examples; the Teacher (Nature) provides𝑓𝑓 the labels ( ( ))

𝑓𝑓 𝑥𝑥

CIS 419/519 Fall’20 14 Learning Conjunctions (I)

• Protocol I: The learner proposes instances as queries to the teacher • Note: we know we are after a monotone conjunction:

CIS 419/519 Fall’20 15 CIS 419/519 Fall’20 16 Learning Conjunctions (I)

• Protocol I: The learner proposes instances as queries to the teacher • Since we know we are after a monotone conjunction: • Is in? < (1,1,1 … , 1,0), ? > = 0 (conclusion: Yes) • Is in? < (1,1, … 1,0,1), ? > = 1 (conclusion: No) 𝑥𝑥100 𝑓𝑓 𝑥𝑥 • Is in ? < (0,1, … 1,1,1), ? > = 1 (conclusion: No) 𝑥𝑥99 𝑓𝑓 𝑥𝑥 𝑥𝑥1 𝑓𝑓 𝑥𝑥 • A straight forward algorithm requires = 100 queries, and will produce as a result the hidden conjunction (exactly).

– ( , , … , ) = 𝑛𝑛 What happens here if the conjunction is not known to be monotone? If we know of a positive example, ℎ 𝑥𝑥1 𝑥𝑥2 𝑥𝑥100 𝑥𝑥2 ∧ 𝑥𝑥3 ∧ 𝑥𝑥4 ∧ 𝑥𝑥5 ∧ 𝑥𝑥100 CIS 419/519 Fall’20 the same algorithm works. 17 Learning Conjunctions(II)

• Protocol II: The teacher (who knows ) provides training examples 𝑓𝑓

CIS 419/519 Fall’20 18 Learning Conjunctions (II)

• Protocol II: The teacher (who knows ) provides training examples 𝑓𝑓 • < (0,1,1,1,1,0, … , 0,1), 1 >

CIS 419/519 Fall’20 19 Learning Conjunctions (II)

• Protocol II: The teacher (who knows ) provides training examples 𝑓𝑓 • < (0,1,1,1,1,0, … , 0,1), 1 > (We learned a superset of the good variables)

CIS 419/519 Fall’20 20 Learning Conjunctions (II) = 𝑓𝑓 𝑥𝑥2 ∧ 𝑥𝑥3 ∧ 𝑥𝑥4 ∧ 𝑥𝑥5 ∧ 𝑥𝑥100 • Protocol II: The teacher (who knows ) provides training examples

𝑓𝑓

CIS 419/519 Fall’20 21 CIS 419/519 Fall’20 22 Learning Conjunctions (II) = 𝑓𝑓 𝑥𝑥2 ∧ 𝑥𝑥3 ∧ 𝑥𝑥4 ∧ 𝑥𝑥5 ∧ 𝑥𝑥100 • Protocol II: The teacher (who knows ) provides training examples • < (0,1,1,1,1,0, … , 0,1), 1 > (We learned a superset of the good variables) • To show you that all these variables are𝑓𝑓 required… – < (0,0,1,1,1,0, … , 0,1), 0 > need – < (0,1,0,1,1,0, … , 0,1), 0 > need 2 Modeling Teaching Is tricky – … 𝑥𝑥 3 – < (0,1,1,1,1,0, … , 0,0), 0 > need 𝑥𝑥

𝑥𝑥100 • A straight forward algorithm requires = 6 examples to produce the hidden conjunction (exactly). • ( , , … , ) = 𝑘𝑘

ℎ 𝑥𝑥1 𝑥𝑥2 𝑥𝑥100 𝑥𝑥2 ∧ 𝑥𝑥3 ∧ 𝑥𝑥4 ∧ 𝑥𝑥5 ∧ 𝑥𝑥100 CIS 419/519 Fall’20 23 Learning Conjunctions (III) = 𝑓𝑓 𝑥𝑥2 ∧ 𝑥𝑥3 ∧ 𝑥𝑥4 ∧ 𝑥𝑥5 ∧ 𝑥𝑥100 • Protocol III: Some random source (e.g., Nature) provides training examples • Teacher (Nature) provides the labels ( ( )) – < (1,1,1,1,1,1, … , 1,1), 1 > – < (1,1,1,0,0,0, … , 0,0), 0 > 𝑓𝑓 𝑥𝑥 – < (1,1,1,1,1,0, … 0,1,1), 1 > – < (1,0,1,1,1,0, … 0,1,1), 0 > – < (1,1,1,1,1,0, … 0,0,1), 1 > – < (1,0,1,0,0,0, … 0,1,1), 0 > – < (1,1,1,1,1,1, … , 0,1), 1 > – < (0,1,0,1,0,0, … 0,1,1), 0 > • How should we learn?

CIS 419/519 Fall’20 24 CIS 419/519 Fall’20 25 Learning Conjunctions (III) = 𝑓𝑓 𝑥𝑥2 ∧ 𝑥𝑥3 ∧ 𝑥𝑥4 ∧ 𝑥𝑥5 ∧ 𝑥𝑥100 • Protocol III: Some random source (e.g., Nature) provides training examples – Teacher (Nature) provides the labels ( ( ))

• Algorithm: Elimination 𝑓𝑓 𝑥𝑥 – Start with the set of all literals as candidates – Eliminate a literal that is not active (0) in a positive example

CIS 419/519 Fall’20 26 Learning Conjunctions (III) = 𝑓𝑓 𝑥𝑥2 ∧ 𝑥𝑥3 ∧ 𝑥𝑥4 ∧ 𝑥𝑥5 ∧ 𝑥𝑥100 • Protocol III: Some random source (e.g., Nature) provides training examples – Teacher (Nature) provides the labels ( ( ))

• Algorithm: Elimination 𝑓𝑓 𝑥𝑥 – Start with the set of all literals as candidates – Eliminate a literal that is not active (0) in a positive example – < (1,1,1,1,1,1, … , 1,1), 1 > – < (1,1,1,0,0,0, … , 0,0), 0 >

CIS 419/519 Fall’20 27 Learning Conjunctions (III) = 𝑓𝑓 𝑥𝑥2 ∧ 𝑥𝑥3 ∧ 𝑥𝑥4 ∧ 𝑥𝑥5 ∧ 𝑥𝑥100 • Protocol III: Some random source (e.g., Nature) provides training examples – Teacher (Nature) provides the labels ( ( ))

• Algorithm: Elimination 𝑓𝑓 𝑥𝑥 – Start with the set of all literals as candidates – Eliminate a literal that is not active (0) in a positive example – < (1,1,1,1,1,1, … , 1,1), 1 > – < (1,1,1,0,0,0, … , 0,0), 0 >  learned nothing: = , … , – < (1,1,1,1,1,0, … 0,1,1), 1 > ℎ 𝑥𝑥1 ∧ 𝑥𝑥2 ∧ 𝑥𝑥100

CIS 419/519 Fall’20 28 Learning Conjunctions (III) = 𝑓𝑓 𝑥𝑥2 ∧ 𝑥𝑥3 ∧ 𝑥𝑥4 ∧ 𝑥𝑥5 ∧ 𝑥𝑥100 • Protocol III: Some random source (e.g., Nature) provides training examples – Teacher (Nature) provides the labels ( ( ))

• Algorithm: Elimination 𝑓𝑓 𝑥𝑥 – Start with the set of all literals as candidates – Eliminate a literal that is not active (0) in a positive example – < (1,1,1,1,1,1, … , 1,1), 1 > – < (1,1,1,0,0,0, … , 0,0), 0 >  learned nothing: = , … , – < (1,1,1,1,1,0, … 0,1,1), 1 >  = ℎ 𝑥𝑥1 ∧ 𝑥𝑥2 ∧ 𝑥𝑥100 ℎ 𝑥𝑥1 ∧ 𝑥𝑥2 ∧ 𝑥𝑥3 ∧ 𝑥𝑥4 ∧ 𝑥𝑥5 ∧ 𝑥𝑥99 ∧ 𝑥𝑥100 CIS 419/519 Fall’20 29 Learning Conjunctions (III) = 𝑓𝑓 𝑥𝑥2 ∧ 𝑥𝑥3 ∧ 𝑥𝑥4 ∧ 𝑥𝑥5 ∧ 𝑥𝑥100 • Protocol III: Some random source (e.g., Nature) provides training examples – Teacher (Nature) provides the labels ( ( )) • Algorithm: Elimination 𝑓𝑓 𝑥𝑥 – Start with the set of all literals as candidates – Eliminate a literal that is not active (0) in a positive example – < (1,1,1,1,1,1, … , 1,1), 1 > – < (1,1,1,0,0,0, … , 0,0), 0 >  learned nothing: = , … , – < (1,1,1,1,1,0, … 0,1,1), 1 >  = ℎ 𝑥𝑥1 ∧ 𝑥𝑥2 ∧ 𝑥𝑥100 – < (1,0,1,1,0,0, … 0,0,1), 0 >  learned nothing ℎ 𝑥𝑥1 ∧ 𝑥𝑥2 ∧ 𝑥𝑥3 ∧ 𝑥𝑥4 ∧ 𝑥𝑥5 ∧ 𝑥𝑥99 ∧ 𝑥𝑥100 – < (1,1,1,1,1,0, … 0,0,1), 1 >

CIS 419/519 Fall’20 30 Learning Conjunctions (III) = 𝑓𝑓 𝑥𝑥2 ∧ 𝑥𝑥3 ∧ 𝑥𝑥4 ∧ 𝑥𝑥5 ∧ 𝑥𝑥100 • Protocol III: Some random source (e.g., Nature) provides training examples – Teacher (Nature) provides the labels ( ( )) • Algorithm: Elimination 𝑓𝑓 𝑥𝑥 – Start with the set of all literals as candidates – Eliminate a literal that is not active (0) in a positive example – < (1,1,1,1,1,1, … , 1,1), 1 > – < (1,1,1,0,0,0, … , 0,0), 0 >  learned nothing: = , … , – < (1,1,1,1,1,0, … 0,1,1), 1 >  = ℎ 𝑥𝑥1 ∧ 𝑥𝑥2 ∧ 𝑥𝑥100 – < (1,0,1,1,0,0, … 0,0,1), 0 >  learned nothing ℎ 𝑥𝑥1 ∧ 𝑥𝑥2 ∧ 𝑥𝑥3 ∧ 𝑥𝑥4 ∧ 𝑥𝑥5 ∧ 𝑥𝑥99 ∧ 𝑥𝑥100 – < (1,1,1,1,1,0, … 0,0,1), 1 >  =

1 2 3 4 5 100 CIS 419/519 Fall’20 ℎ 𝑥𝑥 ∧ 𝑥𝑥 ∧ 𝑥𝑥 ∧ 𝑥𝑥 ∧ 𝑥𝑥 ∧ 𝑥𝑥 31 Learning Conjunctions (III) = 𝑓𝑓 𝑥𝑥2 ∧ 𝑥𝑥3 ∧ 𝑥𝑥4 ∧ 𝑥𝑥5 ∧ 𝑥𝑥100 • Protocol III: Some random source (e.g., Nature) provides training examples – Teacher (Nature) provides the labels ( ( )) • Algorithm: Elimination – Start with the set of all literals as candidates𝑓𝑓 𝑥𝑥 – Eliminate a literal that is not active (0) in a positive example – < (1,1,1,1,1,1, … , 1,1), 1 > – < (1,1,1,0,0,0, … , 0,0), 0 >  learned nothing: = , … , – < (1,1,1,1,1,0, … 0,1,1), 1 >  = 1 2 100 – < (1,0,1,1,0,0, … 0,0,1), 0 >  learned nothing ℎ 𝑥𝑥 ∧ 𝑥𝑥 ∧ 𝑥𝑥 1 2 3 4 5 99 100 – < (1,1,1,1,1,0, … 0,0,1), 1 > ℎ =𝑥𝑥 ∧ 𝑥𝑥 ∧ 𝑥𝑥 ∧ 𝑥𝑥 ∧ 𝑥𝑥 ∧ 𝑥𝑥 ∧ 𝑥𝑥 – < (1,0,1,0,0,0, … 0,1,1), 0 > ℎ 𝑥𝑥1 ∧ 𝑥𝑥2 ∧ 𝑥𝑥3 ∧ 𝑥𝑥4 ∧ 𝑥𝑥5 ∧ 𝑥𝑥100 – < (1,1,1,1,1,1, … , 0,1), 1 > – < (0,1,0,1,0,0, … 0,1,1), 0 >

CIS 419/519 Fall’20 32 Learning Conjunctions (III) = 𝑓𝑓 𝑥𝑥2 ∧ 𝑥𝑥3 ∧ 𝑥𝑥4 ∧ 𝑥𝑥5 ∧ 𝑥𝑥100 • Protocol III: Some random source (e.g., Nature) provides training examples – Teacher (Nature) provides the labels ( ( )) • Algorithm: Elimination • Is it good – Start with the set of all literals as candidates𝑓𝑓 𝑥𝑥 • Performance ? • # of examples ? – Eliminate a literal that is not active (0) in a positive example – < (1,1,1,1,1,1, … , 1,1), 1 > – < (1,1,1,0,0,0, … , 0,0), 0 >  learned nothing: = , … , – < (1,1,1,1,1,0, … 0,1,1), 1 >  = 1 2 100 – < (1,0,1,1,0,0, … 0,0,1), 0 >  learned nothing ℎ 𝑥𝑥 ∧ 𝑥𝑥 ∧ 𝑥𝑥 1 2 3 4 5 99 100 – < (1,1,1,1,1,0, … 0,0,1), 1 > ℎ =𝑥𝑥 ∧ 𝑥𝑥 ∧ 𝑥𝑥 ∧ 𝑥𝑥 ∧ 𝑥𝑥 ∧ 𝑥𝑥 ∧ 𝑥𝑥 – < (1,0,1,0,0,0, … 0,1,1), 0 > ℎ 𝑥𝑥1 ∧ 𝑥𝑥2 ∧ 𝑥𝑥3 ∧ 𝑥𝑥4 ∧ 𝑥𝑥5 ∧ 𝑥𝑥100 – < (1,1,1,1,1,1, … , 0,1), 1 > Final hypothesis: = – < (0,1,0,1,0,0, … 0,1,1), 0 > ℎ 𝑥𝑥1 ∧ 𝑥𝑥2 ∧ 𝑥𝑥3 ∧ 𝑥𝑥4 ∧ 𝑥𝑥5 ∧ 𝑥𝑥100

CIS 419/519 Fall’20 33 Learning Conjunctions (III) = 𝑓𝑓 𝑥𝑥2 ∧ 𝑥𝑥3 ∧ 𝑥𝑥4 ∧ 𝑥𝑥5 ∧ 𝑥𝑥100 • Protocol III: Some random source (e.g., Nature) provides training examples – Teacher (Nature) provides the labels ( ( )) • Algorithm: Elimination • Is it good – Start with the set of all literals as candidates𝑓𝑓 𝑥𝑥 • Performance ? • # of examples ? – Eliminate a literal that is not active (0) in a positive example – < (1,1,1,1,1,1, … , 1,1), 1 > . With the given data, we only learned an – < (1,1,1,0,0,0, … , 0,0), 0 > “approximation” to the true concept . We don’t know how many examples – < (1,1,1,1,1,0, … 0,1,1), 1 > we need to see to learn exactly. (do we – < (1,0,1,1,0,0, … 0,0,1), 0 > care?) . But we know that we can make a limited – < (1,1,1,1,1,0, … 0,0,1), 1 > # of mistakes. – < (1,0,1,0,0,0, … 0,1,1), 0 > Final hypothesis: – < (1,1,1,1,1,1, … , 0,1), 1 > = < (0,1,0,1,0,0, … 0,1,1), 0 > – ℎ 𝑥𝑥1 ∧ 𝑥𝑥2 ∧ 𝑥𝑥3 ∧ 𝑥𝑥4 ∧ 𝑥𝑥5 ∧ 𝑥𝑥100

CIS 419/519 Fall’20 36 Two Directions

• Can continue to analyze the probabilistic intuition: – Never saw = 0 in positive examples, maybe we’ll never see it? – And if we will, it will be with small probability, so the concepts we learn may be pretty𝑥𝑥1 good – Good: in terms of performance on future data – PAC framework • Mistake Driven Learning algorithms/On line algorithms – Now, we can only reason about #(mistakes), not #(examples) • any relations? – Update your hypothesis only when you make mistakes • Not all on-line algorithms are mistake driven, so performance measure could be different.

CIS 419/519 Fall’20 37 On-Line Learning

• New learning algorithms • (all learn a linear function over the feature space) – Perceptron (+ many variations) – General Gradient Descent view

• Issues: – Importance of Representation – Complexity of Learning – Idea of Kernel Based Methods – More about features

CIS 419/519 Fall’20 38 CIS 419/519 Fall’20 39 Are we recording? YES! Administration (10/7/20) Available on the web site

• Remember that all the lectures are available on the website before the class – Go over it and be prepared – A new set of written notes will accompany most lectures, with some more details, examples and, (when relevant) some code. The first one is available today for this lecture.

• HW 1: Was due on Monday. – We plan to complete grading it this week, in case you want to consider it before the drop deadline next week. • HW 2 is out. – Covers: Linear classifiers; learning curves; an application to named entity recognition; domain adaptation. – Start working on it now. Don’t wait until the last day (or two) since it could take a lot of your time

• PollEv: If you haven’t done so, please register https://pollev.com/cis519/register – We still have some problems with missing participation for students • Quizzes: https://canvas.upenn.edu/courses/1546646/quizzes/2475913/statistics

• Go to the recitations and office hours • Questions? Please ask/comment during class; give us feedback CIS 419/519 Fall’20 40 CIS 419/519 Fall’20 41 The following question came up in my office hour: Why did you talk about the Teaching and Active Practical Machine Learning Learning protocols. After all, this is not what Machine Learning is about, right? WRONG! • You work for an E-discovery company that handles documents for large corporations. • A corporation asks you for a reliable Text Classification program that can identify harassing e-mail (H-mail) messages sent by/to their employees within the organization (they don’t want to be sued). • Problem: The harassing emails constitute a tiny fraction of the 10k/day email traffic. – How will you develop a reliable classifier?

• Active Learning Protocol: • Assume that you already have a “reasonable” model that can classify an e-mail message as H-mail/Not – Get some experts from the HR office in the company. Get them to use your model. – The model will output its guesses; hopefully, many of them will indeed “reasonable”. The HR experts will classify them. This will allow you to, relatively quickly get a good enough collection of H-mails, along with hard negative examples. – Now you can train a model. – Questions: • How will your learning algorithm decide what examples to present to the HR experts? (key Active Learning Question) • How will you initialize the model so it doesn’t present only negative examples (which is the vast majority of examples)? • Teaching Protocol: – Ask the HR experts to “dream up” some examples of the rare H-mails, and maybe some hard-negative examples. – They can use real examples they have seen, but can also invent some, given that they understand the concepts. – This “teaching” process is essential to start the learning process. – Questions: The example protocols discussed • Can we teach in a more efficient way? (e.g., via definitions)? last time were simplified and idealized, but very realistic in CIS 419/519 Fall’20 applications. 42 Learning Conjunctions (III) = 𝑓𝑓 𝑥𝑥2 ∧ 𝑥𝑥3 ∧ 𝑥𝑥4 ∧ 𝑥𝑥5 ∧ 𝑥𝑥100 • Protocol III: Some random source (e.g., Nature) provides training examples – Teacher (Nature) provides the labels ( ( )) • Algorithm: Elimination • Is it good – Start with the set of all literals as candidates𝑓𝑓 𝑥𝑥 • Performance ? • # of examples ? – Eliminate a literal that is not active (0) in a positive example – < (1,1,1,1,1,1, … , 1,1), 1 > . With the given data, we only learned an – < (1,1,1,0,0,0, … , 0,0), 0 > “approximation” to the true concept . We don’t know how many examples – < (1,1,1,1,1,0, … 0,1,1), 1 > we need to see to learn exactly. (do we – < (1,0,1,1,0,0, … 0,0,1), 0 > care?) . But we know that we can make a limited – < (1,1,1,1,1,0, … 0,0,1), 1 > # of mistakes. – < (1,0,1,0,0,0, … 0,1,1), 0 > Final hypothesis: – < (1,1,1,1,1,1, … , 0,1), 1 > = – < (0,1,0,1,0,0, … 0,1,1), 0 > ℎ 𝑥𝑥1 ∧ 𝑥𝑥2 ∧ 𝑥𝑥3 ∧ 𝑥𝑥4 ∧ 𝑥𝑥5 ∧ 𝑥𝑥100 CIS 419/519 Fall’20 43 Two Directions

• Can continue to analyze the probabilistic intuition: – Never saw = 0 in positive examples, maybe we’ll never see it? – And if we will, it will be with small probability, so the concepts we learn may be pretty𝑥𝑥1 good – Good: in terms of performance on future data – PAC framework • Mistake Driven Learning algorithms/On line algorithms – Now, we can only reason about #(mistakes), not #(examples) • any relations? – Update your hypothesis only when you make mistakes • Not all on-line algorithms are mistake driven, so performance measure could be different.

CIS 419/519 Fall’20 44 Generic Mistake Bound Algorithms

• Is it clear that we can bound the number of mistakes ? • Let be a finite concept class. Learn C • CON algorithm: – In𝐶𝐶 the ith stage of the algorithm: 𝑓𝑓 ∈ • : all concepts in consistent with all 1 previously seen examples – Choose randomly and use it to predict the next example 𝐶𝐶𝑖𝑖 𝐶𝐶 𝑖𝑖 − Clearly, C and, if a mistake is made on the i-th example, then – 𝑖𝑖 < , so progress𝑓𝑓 ∈ 𝐶𝐶 is made. i+1 𝑖𝑖 • The CON algorithm⊆ 𝐶𝐶 makes at most 1 mistakes 𝐶𝐶𝑖𝑖+1 𝐶𝐶𝑖𝑖 The goal of the following discussion is to think about • Can we do better ? hypothesis spaces, and some “optimal” algorithms, as a way to understand what might𝐶𝐶 be possible.− For this reason, here we think about the class of possible 45 CIS 419/519 Fall’20 target functions and the class of hypothesis as the same. The Halving Algorithm

• Let be a finite concept class. Learn C • Halving Algorithm: – 𝐶𝐶In the ith stage of the algorithm: 𝑓𝑓 ∈ • : all concepts in consistent with all 1 previously seen examples – Given an example consider the value for all and predict by majority. 𝐶𝐶𝑖𝑖 𝐶𝐶 𝑖𝑖 − – Predict 1 iff 𝑒𝑒𝑡𝑡 𝑓𝑓𝑗𝑗 𝑒𝑒𝑡𝑡 𝑓𝑓𝑗𝑗 ∈ 𝐶𝐶𝑖𝑖 • ; = 0 < ; = 1

– Clearly,𝑓𝑓𝑗𝑗 C∈ 𝐶𝐶𝑖𝑖 𝑓𝑓𝑗𝑗 𝑒𝑒𝑖𝑖 and, if a mistake𝑓𝑓𝑗𝑗 ∈ 𝐶𝐶𝑖𝑖 is𝑓𝑓𝑗𝑗 made𝑒𝑒𝑖𝑖 on the i-th example, then < , so progress is made 1 i+1 ⊆ 𝐶𝐶𝑖𝑖 𝐶𝐶𝑖𝑖+1 2 𝐶𝐶𝑖𝑖 • The Halving algorithm makes at most log(| |) mistakes – Of course, this is a theoretical algorithm; can this be achieved with an efficient algorithm? 𝐶𝐶

CIS 419/519 Fall’20 46 Can this bound Learning Conjunctions be achieved?

Can mistakes be • There is a hidden conjunctions the learner is to learn bounded in the non- • , , … , = finite case? • The number of (all; not monotone) conjunctions: C = 3 𝑓𝑓 𝑥𝑥1 𝑥𝑥2 𝑥𝑥100 𝑥𝑥2 ∧ 𝑥𝑥3 ∧ 𝑥𝑥4 ∧ 𝑥𝑥5 ∧ 𝑥𝑥100 • log = Earlier:𝑛𝑛 • Talked about various • The elimination algorithm makes mistakes learning protocols & on • … 𝐶𝐶 𝑛𝑛 algorithms for conjunctions. 𝑛𝑛 • Discussed the performance • k-conjunctions: of the algorithms in terms of – Assume that only k attributes occur in the disjunction bounding the number of mistakes that algorithm C = 2 • The number of k-conjunctions:≪ 𝑛𝑛 makes. – log = 𝑛𝑛 𝑘𝑘 • Gave a “theoretical” – Can we learn efficiently with this number𝑘𝑘 of mistakes ? algorithm with log | | 𝐶𝐶 𝑘𝑘𝑘𝑘𝑘𝑘𝑘𝑘𝑘𝑘 mistakes. 𝐶𝐶 CIS 419/519 Fall’20 47 Linear Threshold Functions

( )= sgn · = sgn{ } • Many functions are Linear𝑇𝑇 𝑛𝑛 𝑖𝑖=1 𝑖𝑖 𝑖𝑖 – Conjunctions:𝑓𝑓 𝒙𝒙 𝒘𝒘 𝒙𝒙 − 𝜃𝜃 ∑ 𝑤𝑤 𝑥𝑥 − 𝜃𝜃 • = • = sgn 1 · + 1 · + 1 · 3 = 1, 0, 1, 0, 1 = 3 𝑦𝑦 𝑥𝑥1 ∧ 𝑥𝑥3 ∧ 𝑥𝑥5 – At least m of n: 𝑦𝑦 𝑥𝑥1 𝑥𝑥3 𝑥𝑥5 − 𝒘𝒘 𝜃𝜃 • = 2 { , , } • = sgn 1 · + 1 · + 1 · 2 = 1, 0, 1, 0, 1 = 2 𝑦𝑦 𝑎𝑎𝑎𝑎 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 𝑜𝑜𝑜𝑜 𝑥𝑥1 𝑥𝑥3 𝑥𝑥5 • Many functions𝑦𝑦 are𝑥𝑥1 not𝑥𝑥3 𝑥𝑥5 − 𝒘𝒘 𝜃𝜃 – Xor: = ( ) (¬ ¬ ) – Non trivial DNF: = ( ) ( ) 𝑦𝑦 𝑥𝑥1 ∧ 𝑥𝑥2 ∨ 𝑥𝑥1 ∧ 𝑥𝑥2 • But can be made linear 𝑦𝑦 𝑥𝑥1 ∧ 𝑥𝑥2 ∨ 𝑥𝑥3 ∧ 𝑥𝑥4 • Note: all the variables above are Boolean variables CIS 419/519 Fall’20 48 Linear Threshold Functions

( )= sgn · = sgn{ } • Many functions are Linear𝑇𝑇 𝑛𝑛 𝑖𝑖=1 𝑖𝑖 𝑖𝑖 – Conjunctions:𝑓𝑓 𝒙𝒙 𝒘𝒘 𝒙𝒙 − 𝜃𝜃 ∑ 𝑤𝑤 𝑥𝑥 − 𝜃𝜃 • = • = sgn 1 · + 1 · + 1 · 3 = 1, 0, 1, 0, 1 = 3 𝑦𝑦 𝑥𝑥1 ∧ 𝑥𝑥3 ∧ 𝑥𝑥5 • In our 𝑦𝑦Elimination𝑥𝑥1 Algorithm𝑥𝑥3 we𝑥𝑥5 − started𝒘𝒘 with: 𝜃𝜃 – = (1, 1, 1, 1, 1, ) and changed from 1  0 following a mistake • In general, when learning linear functions, we will change 𝒘𝒘 𝑛𝑛 𝑤𝑤𝑖𝑖 more carefully 𝑖𝑖 – = s { · + · + · + · + · } 𝑤𝑤

1 1 2 2 3 3 4 4 5 5 CIS 419/519𝑦𝑦 Fall’20 𝑔𝑔𝑔𝑔 𝑤𝑤 𝑥𝑥 𝑤𝑤 𝑥𝑥 𝑤𝑤 𝑥𝑥 𝑤𝑤 𝑥𝑥 𝑤𝑤 𝑥𝑥 − 𝜃𝜃 49 = θ 𝑇𝑇 𝒘𝒘 𝒙𝒙 ------= 0 - - - 𝑇𝑇 - 𝒘𝒘 𝒙𝒙 𝒘𝒘

CIS 419/519 Fall’20 51 Canonical Representation

( )= sgn · = sgn{ } • Note: sgn · 𝑇𝑇 = sgn{ · }𝑛𝑛 𝑓𝑓 𝒙𝒙 𝒘𝒘 𝒙𝒙 − 𝜃𝜃 ∑𝑖𝑖=1 𝑤𝑤𝑖𝑖𝑥𝑥𝑖𝑖 − 𝜃𝜃 – Where =𝑇𝑇 ( , 1) and = ( , ′𝑇𝑇) • Moved from an 𝒘𝒘′dimensional𝒙𝒙 − representation𝜃𝜃 ′ to an𝒘𝒘 ( + 1)𝒙𝒙풙dimensional representation, but now can look for hyperplanes𝒙𝒙 𝒙𝒙 that− go through𝒘𝒘 the origin.𝒘𝒘 𝜃𝜃 • Basically, that means𝑛𝑛 that we learn both 𝑛𝑛

θ ’ = 0 = θ 𝒘𝒘 𝑎𝑎𝑎𝑎𝑎𝑎 𝜃𝜃 ′𝑇𝑇 𝑇𝑇 x0 𝐰𝐰 � 𝐱𝐱 𝐰𝐰 � 𝐱𝐱

x0 x1

CIS 419/519 Fall’20 52 x1 Earlier: • Talked about various learning Perceptron protocols & on algorithms for conjunctions. • Measured performance of algorithms in terms of bounding the number of mistakes that algorithm makes. • Showed that in the case of a finite concept class we will stop making mistakes at some point (a “theoretical” algorithm had log | | mistakes.

𝐶𝐶 CIS 419/519 Fall’20 Perceptron learning rule

• On-line, mistake driven algorithm. • Rosenblatt (1959) suggested that when a target output value is provided for a single neuron with fixed input, it can incrementally change weights and learn to produce the output using the Perceptron learning rule • (Perceptron == Linear Threshold Unit)

x1 1 2 w1 T 7 3 y 4 ∑ 5 x6 6 w 6 CIS 419/519 Fall’20 54 Perceptron learning rule

• We learn : = 1, +1 represented as = sgn{ · } • Where = {0, 1} or = and 𝑇𝑇 𝑓𝑓 𝑿𝑿 → 𝑌𝑌 − 𝑓𝑓 𝒘𝒘 𝒙𝒙 • Given Labeled examples:𝑛𝑛 { 𝑛𝑛, , , 𝑛𝑛 , … , } 𝑿𝑿 𝑿𝑿 𝑹𝑹 𝒘𝒘 ∈ 𝑹𝑹 1 1 2 2 𝑚𝑚 𝑚𝑚 1. Initialize 𝒙𝒙 𝑦𝑦 𝒙𝒙 𝑦𝑦 𝒙𝒙 𝑦𝑦 2. Cycle through all𝑛𝑛 examples [multiple times] 𝒘𝒘 ∈ 𝑹𝑹 a. Predict the label of instance to be y = sgn{ · } b. If , update the weight vector: ′ 𝑇𝑇 𝒙𝒙 𝒘𝒘 𝒙𝒙 ′ = + ( - a constant, ) 𝑦𝑦 ≠ 𝑦𝑦 Otherwise, if = , leave weights unchanged. 𝒘𝒘 𝒘𝒘 𝑟𝑟𝑟𝑟𝒙𝒙 𝑟𝑟 ′ CIS 419/519 Fall’20 𝑦𝑦 𝑦𝑦 55 Perceptron in action

( = +1) next item to be classified 𝒙𝒙 𝑤𝑤𝑤𝑤𝑤𝑤� 𝑦𝑦

as a vector = 0 = 0 𝑻𝑻New Current𝑻𝑻 𝒙𝒙 𝒘𝒘decision𝒙𝒙 𝒘𝒘decision𝒙𝒙 boundary boundary

Current weight New weight vector vector𝒘𝒘 𝒘𝒘

as a vector added to

𝒙𝒙 𝒘𝒘 Positive (Figures from Bishop 2006) Negative

CIS 419/519 Fall’20 56 Perceptron in action

• At the top of the new notes you will find a link to a colab notebook: = 0 ( = +1) 𝑻𝑻New next item to be 𝒘𝒘decision𝒙𝒙 𝒙𝒙 𝑤𝑤𝑤𝑤𝑤𝑤�classified𝑦𝑦 boundary as a vector

= 0 𝒙𝒙 Current𝑻𝑻 𝒘𝒘decision𝒙𝒙 boundary

Current weight as a vector added vector𝒘𝒘 to 𝒙𝒙 𝒘𝒘

New weight vector𝒘𝒘 Positive (Figures from Bishop 2006) Negative

CIS 419/519 Fall’20 57 Perceptron learning rule

• If is Boolean, only weights of active features are updated • Why is this important? 𝒙𝒙 1. Initialize 2. Cycle through all𝑛𝑛 examples [multiple times] 𝒘𝒘 ∈ 𝑹𝑹 a. Predict the label of instance to be y = sgn{ · } b. If , update the weight vector: ′ 𝑇𝑇 𝒙𝒙 𝒘𝒘 𝒙𝒙 ′= + ( - a constant, learning rate) 𝑦𝑦 ≠ 𝑦𝑦 Otherwise, if = , leave weights unchanged. 𝒘𝒘 𝒘𝒘 𝑟𝑟𝑟𝑟𝒙𝒙 𝑟𝑟 ′ 𝑦𝑦 𝑦𝑦

• · > 0 is equivalent to: = +1 = > 𝑇𝑇 1 1 −𝑤𝑤𝑇𝑇𝑥𝑥 𝒘𝒘 𝒙𝒙 𝑃𝑃 𝑦𝑦 𝑥𝑥 1+𝑒𝑒 2 CIS 419/519 Fall’20 58 Perceptron Learnability

• Obviously can’t learn what it can’t represent (???) – Only linearly separable functions • Minsky and Papert (1969) wrote an influential book demonstrating Perceptron’s representational limitations – Parity functions can’t be learned (XOR) – In vision, if patterns are represented with local features, can’t represent symmetry, connectivity • Research on Neural Networks stopped for years • Perceptron • Rosenblatt himself (1959) asked, – “What pattern recognition problems can be transformed so as to become linearly separable?”

CIS 419/519 Fall’20 59 CIS 419/519 Fall’20 60 Perceptron Convergence

– Perceptron Convergence Theorem: • If there exist a set of weights that are consistent with the data (i.e., the data is linearly separable), the perceptron learning algorithm will converge • How long would it take to converge ? – Perceptron Cycling Theorem: • If the training data is not linearly separable the perceptron learning algorithm will eventually repeat the same set of weights and therefore enter an infinite loop. • How to provide robustness, more expressivity ?

CIS 419/519 Fall’20 61 Are we recording? YES! Administration (10/12/20) Available on the web site

• Remember that all the lectures are available on the website before the class – Go over it and be prepared – A new set of written notes will accompany most lectures, with some more details, examples and, (when relevant) some code. The first one is available today for this lecture.

• HW 1: Avg. 90.5/100; overall, 96.8 (with EC)

• HW 2: Due date extended to 10/22

• No class on Wednesday (see my Piazza message last night)

• Quizzes: https://canvas.upenn.edu/courses/1546646/quizzes/2480748/statistics

• Mid-term is on 10/28; at the class time. If you have a time zone problem – email me. • Questions? Please ask/comment during class; give us feedback CIS 419/519 Fall’20 62 Where Are We?

• Discussed On-Line Learning – Learning by updating the hypothesis after each observed example – Specifically: Perceptron • Mistake driven learning: updates are done only when a mistake is made • Today: – Continue with Perceptron and some variations – Understand Perceptron as a special case of SGD – More advanced update rules

CIS 419/519 Fall’20 63 Perceptron learning rule

• We learn : = 1, +1 represented as = sgn{ · } • Where = {0, 1} or = and 𝑇𝑇 𝑓𝑓 𝑿𝑿 → 𝑌𝑌 − 𝑓𝑓 𝒘𝒘 𝒙𝒙 • Given Labeled examples:𝑛𝑛 { 𝑛𝑛, , , 𝑛𝑛 , … , } 𝑿𝑿 𝑿𝑿 𝑹𝑹 𝒘𝒘 ∈ 𝑹𝑹 1 1 2 2 𝑚𝑚 𝑚𝑚 1. Initialize 𝒙𝒙 𝑦𝑦 𝒙𝒙 𝑦𝑦 𝒙𝒙 𝑦𝑦 2. Cycle through all𝑛𝑛 examples [multiple times] 𝒘𝒘 ∈ 𝑹𝑹 a. Predict the label of instance to be y = sgn{ · } b. If , update the weight vector: ′ 𝑇𝑇 𝒙𝒙 𝒘𝒘 𝒙𝒙 ′ = + ( - a constant, learning rate) 𝑦𝑦 ≠ 𝑦𝑦 Otherwise, if = , leave weights unchanged. 𝒘𝒘 𝒘𝒘 𝑟𝑟𝑟𝑟𝒙𝒙 𝑟𝑟 ′ CIS 419/519 Fall’20 𝑦𝑦 𝑦𝑦 64 Perceptron in action

( = +1) next item to be classified 𝒙𝒙 𝑤𝑤𝑤𝑤𝑤𝑤� 𝑦𝑦

as a vector = 0 = 0 𝑻𝑻New Current𝑻𝑻 𝒙𝒙 𝒘𝒘decision𝒙𝒙 𝒘𝒘decision𝒙𝒙 boundary boundary

Current weight New weight vector vector𝒘𝒘 𝒘𝒘

as a vector added to

𝒙𝒙 𝒘𝒘 Positive (Figures from Bishop 2006) Negative

CIS 419/519 Fall’20 65 Perceptron

Input set of examples and their labels = , , … , × 1,1 , , • Initialize 0 and 𝑛𝑛 𝑚𝑚 Ζ 𝒙𝒙1 𝑦𝑦1 𝒙𝒙𝑚𝑚 𝑦𝑦𝑚𝑚 ∈ 𝑹𝑹 − 𝜂𝜂 𝜃𝜃𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼 • For every training epoch: 𝒘𝒘 ← 𝜃𝜃 ← 𝜃𝜃𝐼𝐼𝐼𝐼𝐼𝐼𝐼𝐼 • for every : < , > Just to make sure we understand – 𝑗𝑗 𝒙𝒙 ∈ 𝑿𝑿 that we learn both and – If 𝑗𝑗 𝑦𝑦� ← 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝒘𝒘 𝒙𝒙 −𝜃𝜃 𝒘𝒘 𝜃𝜃 • + y 𝑦𝑦� ≠ 𝑦𝑦𝑗𝑗 • + 𝒘𝒘 ← 𝒘𝒘 𝜂𝜂 𝑗𝑗𝒙𝒙𝑗𝑗 𝑗𝑗 CIS 419/519 Fall’20𝜃𝜃 ← 𝜃𝜃 𝜂𝜂𝑦𝑦 66 Perceptron: Mistake Bound Theorem

• Maintains a weight vector , = (0, … , 0). • Upon receiving an example 𝑛𝑛 𝒘𝒘 ∈ 𝑹𝑹 𝒘𝒘0 • Predicts according to the linear threshold𝑛𝑛 function · 0. 𝒙𝒙 ∈ 𝑹𝑹 • Theorem [Novikoff,1963] 𝑇𝑇 ; L2 norm ( , ), … , ( , ) 𝒘𝒘 𝒙𝒙 ≥ ||u|| = – Let be a sequence of labeled examples with , 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑜𝑜𝑜𝑜 𝑢𝑢2 𝑛𝑛 2 || || and y { 1, 1} for all i. Let , > 0 be such that, 𝑛𝑛 ∑𝑖𝑖=1 𝑢𝑢𝑖𝑖 1 1 𝑡𝑡 𝑡𝑡 𝑖𝑖 | | 𝒙𝒙= 1𝑦𝑦 and 𝒙𝒙 𝑦𝑦· for all 𝑛𝑛 𝒙𝒙complexity∈ 𝑹𝑹 paeameter 𝑖𝑖 i (margin) 𝒙𝒙 ≤ 𝑅𝑅 ∈𝑇𝑇 − 𝒖𝒖 ∈ 𝑹𝑹 𝛾𝛾 • Then𝒖𝒖 Perceptron𝑦𝑦 makes𝑖𝑖𝒖𝒖 𝒙𝒙𝑖𝑖 at≥ most𝛾𝛾 𝑖𝑖 mistakes on this example 2 𝑅𝑅 sequence. 2 𝛾𝛾 • (see additional notes)

One goal of the class is for you to be able to read and comprehend math. ML is a fast-evolving field and CIS 419/519 Fall’20 using it means reading, understanding, and implementing new things – all presented in mathematical notation. 67 Perceptron-Mistake Bound

1. Note that the bound does not Assumptions depend on the dimensionality nor on the number of examples. = 2. Note that we place weight vectors and examples in the same space. | 1 | = 3. Interpretation of the theorem 𝑣𝑣̅ • 𝟎𝟎≥ γ 𝑢𝑢�𝑇𝑇 𝟏𝟏 𝑦𝑦𝑖𝑖 𝑢𝑢� 𝑥𝑥̅𝑖𝑖

< 2 / γ 2

CIS 419/519 Fall’20 𝑘𝑘 𝑅𝑅 68 Robustness to Noise

• In the case of non-separable data , the extent to which a data point fails to have margin via the hyperplane can be quantified by a slack variable = max(0, ) 𝑇𝑇 • Observe that𝛾𝛾 when =𝑖𝑖 0, the example𝒘𝒘 𝑖𝑖 𝑖𝑖has margin at least . Otherwise, it grows linearly𝜉𝜉 with 𝛾𝛾 − 𝑦𝑦 𝒘𝒘 𝒙𝒙 𝑖𝑖 𝑖𝑖 𝜉𝜉 𝑇𝑇 𝒙𝒙 𝛾𝛾 Denote: = [ { } ] 𝑖𝑖 𝑖𝑖 • 1 −𝑦𝑦 𝒘𝒘 𝒙𝒙 • Theorem: 2 2 -- - 𝐷𝐷2 ∑ 𝜉𝜉𝑖𝑖 - - -- - – The perceptron is guaranteed to make - -- - - no more than ( ) mistakes on - - any sequence of𝑅𝑅 +examples𝐷𝐷2 2 satisfying || || < 𝛾𝛾 • Perceptron is expected to have some robustness2 to noise. 𝒙𝒙𝑖𝑖 𝑅𝑅

CIS 419/519 Fall’20 69 Perceptron for Boolean Functions

• How many mistakes will the Perceptron algorithms make when learning a -disjunction? • Try to figure out the bound • Find𝑘𝑘 a sequence of examples that will cause Perceptron to make ( ) mistakes on -disjunction on attributes. – (Where is coming from?) 𝑂𝑂 𝑛𝑛 • Recall that halving𝑘𝑘 suggested 𝑛𝑛the possibility of a better bound – ( ) 𝑛𝑛 • This can be achieved by Winnow 𝑘𝑘𝑘𝑘𝑘𝑘𝑘𝑘– A multiplicative𝑛𝑛 update algorithm [Littlestone’88] – See HW2

CIS 419/519 Fall’20 70 Practical Issues and Extensions

• There are many extensions that can be made to these basic algorithms. • Some are necessary for them to perform well – Regularization (next; will be motivated in the next section, COLT) • Some are for ease of use and tuning – Converting the output of a Perceptron to a conditional probability 1 = +1 = 1 + 𝑇𝑇 – The parameter can𝑃𝑃 𝑦𝑦be tuned𝒙𝒙 on a development−𝐴𝐴𝒘𝒘 𝒙𝒙 set 𝑒𝑒 𝐴𝐴

CIS 419/519 Fall’20 71 A Learning Scenario

• You are running Perceptron on a stream of random examples. • h1 made a mistake on the first example it saw. th • h2 ran through 9 examples and only made a mistake on the 10 . • h3 ….. th • h98 made a mistake only on the 1000 example it saw. • h99 made a mistake on the first example it saw. • h100 made a mistake on the 10th example it saw. • Now, you need to stop training and output the final hypothesis. • Which one will you output? •

CIS 419/519 Fall’20 74 CIS 419/519 Fall’20 75 Regularization Via Averaged Perceptron

• An Averaged Perceptron Algorithm is motivated by the following considerations: – In real life, we want [more] guarantees from our learning algorithm – In the mistake bound model: • We don’t know when we will make the mistakes. • In a sequential/on-line scenario, which hypothesis will you choose…?? • Being consistent with more examples is better (why?) – Ideally, we want to quantify the expected performance as a function of the number of examples seen and not number of mistakes. (“a global guarantee”; the PAC model does that) – Every Mistake-Bound Algorithm can be converted efficiently to yield such a guarantee.

• To convert a given Mistake Bound algorithm (into a global guarantee algorithm): – Wait for a long stretch w/o mistakes (there must be one) – Use the hypothesis at the end of this stretch. – Its PAC behavior is relative to the length of the stretch.

• Averaged Perceptron returns a weighted average of earlier hypotheses; the weights are a function of the length of no-mistakes stretch.

CIS 419/519 Fall’20 76 Regularization Via Averaged Perceptron

• Variables: • This can be done on top of any online – : number of examples mistake driven algorithm. – : number of mistakes • In HW 2 you will run it over three 𝑚𝑚 different algorithms. – : consistency count for hypothesis • The implementation requires thinking. – 𝑘𝑘: number of epochs 𝑖𝑖 𝑖𝑖 • Input𝑐𝑐: a labeled training { , , 𝒗𝒗 , , … , } 𝑇𝑇 • Output: a list of weighted { , , … , , } 1 1 2 2 𝑚𝑚 𝑚𝑚 • Initialize: = 0, = 0, 𝒙𝒙= 𝑦𝑦0 𝒙𝒙 𝑦𝑦 𝒙𝒙 𝑦𝑦 1 1 Averaged𝑘𝑘 version𝑘𝑘 of Perceptron/Winnow is as • Repeat times: 𝒗𝒗 𝑐𝑐 𝒗𝒗 𝑐𝑐 1 1 good as any other linear learning algorithm, if not – For =𝑘𝑘1, … , 𝒗𝒗; 𝑐𝑐 better. – Compute𝑇𝑇 prediction = sgn( · ) – If 𝑖𝑖= , then𝑚𝑚 = ′ + 1 𝑇𝑇 𝑘𝑘 𝑖𝑖 ′ else: =𝑦𝑦 + 𝒗𝒗; 𝒙𝒙 = 1; = + 1 • Prediction𝑦𝑦 :𝑦𝑦 𝑐𝑐𝑘𝑘 𝑐𝑐𝑘𝑘 𝒗𝒗𝑘𝑘+1 𝒗𝒗𝑘𝑘 𝑦𝑦𝑖𝑖𝒙𝒙 𝑐𝑐𝑘𝑘+1 𝑘𝑘 𝑘𝑘 – Given: a list of weighted perceptrons { , , … , , }; a new example x – Predict the label ( ) as follows: = sgn[ ( )] 𝟏𝟏 1 𝒌𝒌 𝑘𝑘 𝒗𝒗 𝑐𝑐 𝑘𝑘 𝒗𝒗 𝑇𝑇 𝑐𝑐 1 𝑖𝑖 𝑖𝑖 CIS 419/519 Fall’20 𝑥𝑥 𝑦𝑦 𝒙𝒙 ∑ 𝑐𝑐 𝒗𝒗 𝒙𝒙 Perceptron with Margin

• Thick Separator (aka as Perceptron with Margin) (Applies both for Perceptron and Winnow)

· = 𝑇𝑇 Promote weights if: 𝒘𝒘 𝒙𝒙 𝜃𝜃 • - · < - - – - - - - 𝑇𝑇 - - - - • Demote𝒘𝒘 𝒙𝒙 weights− 𝜃𝜃 𝛾𝛾 if: · = 0 - - - 𝑇𝑇 - – · > 𝒘𝒘 𝒙𝒙 𝑇𝑇 𝒘𝒘 𝒙𝒙 − 𝜃𝜃 𝛾𝛾 Note: γ is a functional margin. Its effect could disappear as grows. Nevertheless, this has been shown to be a very effective algorithmic addition.(Grove & Roth 98,01; Karov et. al 97) 𝒘𝒘 CIS 419/519 Fall’20 78 Other Extensions

• Assume you made a mistake on example . • You then see example again; will you make a mistake on it? • Threshold relative updating (Aggressive Perceptron)𝒙𝒙 • + 𝒙𝒙

• = 𝒘𝒘 ← 𝒘𝒘 𝑇𝑇 𝑟𝑟𝒙𝒙 𝜃𝜃−𝒘𝒘 𝒙𝒙 2 𝑟𝑟 𝒙𝒙

• Equivalent to updating on the same example multiple times

CIS 419/519 Fall’20 79 So Far

• We have presented several on-line linear learning algorithms

– SGD • Remember: this algorithm depended on the loss function chosen. – Perceptron

• What is the difference between the versions we have seen? • Are there any similarities?

CIS 419/519 Fall’20 81 General Stochastic Gradient Algorithms

• Given examples = , , from a distribution over × , we are trying to learn a linear function, parameterized by a weight 1 𝑚𝑚 The loss Q: a vector , so that 𝑧𝑧we minimize𝒙𝒙 𝑦𝑦 the expected risk function𝑋𝑋 function𝑌𝑌 of , and = ( , )~=~ ( , ) 𝒘𝒘 𝒙𝒙 𝒘𝒘 𝑦𝑦 1 𝑚𝑚 • In Stochastic Gradient Descent𝑧𝑧 Algorithms we approximate𝑖𝑖 𝑖𝑖 this minimization 𝐽𝐽by𝒘𝒘 incrementally𝐸𝐸 𝑄𝑄 𝒛𝒛 𝒘𝒘 updating𝑚𝑚 ∑ 1the𝑄𝑄 weight𝒛𝒛 𝒘𝒘 vector w as follows: = , = – Where = , is the gradient with respect to at time . 𝒘𝒘𝑡𝑡+1 𝒘𝒘𝑡𝑡 − 𝑟𝑟𝑡𝑡𝒈𝒈𝑤𝑤𝑄𝑄 𝒛𝒛𝑡𝑡 𝒘𝒘𝑡𝑡 𝒘𝒘𝑡𝑡 − 𝑟𝑟𝑡𝑡𝒈𝒈𝑡𝑡 • The difference𝑡𝑡 between𝒘𝒘 𝑡𝑡 algorithms𝑡𝑡 now amounts to choosing a different loss𝒈𝒈 function𝒈𝒈 𝑄𝑄 𝒛𝒛 𝒘𝒘( , ) 𝒘𝒘 𝑡𝑡

CIS 419/519 Fall’20 𝑄𝑄 𝒛𝒛 𝒘𝒘 82 General Stochastic Gradient Algorithms Learning rate gradient The loss Q: a function of , and

= – ( , , ) = – 𝒙𝒙 𝒘𝒘 𝑦𝑦

𝑡𝑡+1 𝑡𝑡 𝑡𝑡 𝒘𝒘 𝑡𝑡 𝑡𝑡 𝑡𝑡 𝑡𝑡 𝑡𝑡 𝑡𝑡 – LMS: , , 𝒘𝒘 = –𝒘𝒘 𝑟𝑟 𝒈𝒈 𝑄𝑄 𝒙𝒙 𝑦𝑦 𝒘𝒘 𝒘𝒘 𝑟𝑟 𝒈𝒈 – Computing the gradient1 leads 𝑇𝑇to the2 update rule (Also called Widrow’s Adaline): 𝑄𝑄 𝒙𝒙 𝑦𝑦 𝒘𝒘 2 𝑦𝑦 𝒘𝒘 𝒙𝒙= + – – Here, even though we make binary predictions based𝑇𝑇 on ( ) we do not take the sign of the 𝑡𝑡+1 𝑡𝑡 𝑡𝑡 𝑡𝑡 𝑡𝑡 𝑡𝑡 dot-product into account in𝒘𝒘 the loss.𝒘𝒘 𝑟𝑟 𝑦𝑦 𝒘𝒘 𝒙𝒙 𝒙𝒙 𝑻𝑻 𝒔𝒔𝒔𝒔𝒔𝒔 𝒘𝒘 𝒙𝒙 – Another common loss function is: – Hinge loss: (( , ), ) = max(0, 1 ) – Computing the gradient leads to the𝑇𝑇 perceptron update rule: – g𝑄𝑄 = 0𝒙𝒙 if𝑦𝑦 𝒘𝒘 > 1 ; otherwise,− 𝑦𝑦 𝒘𝒘 𝒙𝒙 g = - y 𝑇𝑇 𝑇𝑇 𝑖𝑖 𝑖𝑖 𝑖𝑖 𝑦𝑦𝒘𝒘 𝒙𝒙 – If 𝑦𝑦 � 𝒘𝒘> 𝒙𝒙1 (No mistake, by a margin):𝒙𝒙 No update – Otherwise,𝑇𝑇 (Mistake, relative to margin): = + Here = 𝑖𝑖 𝑖𝑖 𝑖𝑖 𝑦𝑦 � 𝒘𝒘 𝒙𝒙 Good to think about the 𝑡𝑡+1 𝑡𝑡 𝑡𝑡 𝑡𝑡 CIS 419/519 Fall’20 𝒘𝒘 𝒘𝒘 𝑟𝑟 𝑦𝑦 𝒙𝒙 case of Boolean𝒈𝒈 examples−𝑦𝑦83𝒙𝒙 Perceptron on Boolean Examples

• = – ( , , ) = – • Perceptron: g = 0 if > 1 ; otherwise, g = - y 𝑡𝑡+1 𝑡𝑡 𝑡𝑡 𝒘𝒘 𝑡𝑡 𝑡𝑡 𝑡𝑡 𝑡𝑡 𝑡𝑡 𝑡𝑡 • 𝒘𝒘Consider the𝒘𝒘 performance𝑟𝑟 𝒈𝒈 𝑄𝑄 𝒙𝒙 𝑇𝑇 of𝑦𝑦 Perceptron𝒘𝒘 𝒘𝒘 on 𝑟𝑟Boolean𝒈𝒈 Examples: 𝑖𝑖 𝑖𝑖 𝑖𝑖 – x = <(0, 0, 1, 1, 0),+>𝑦𝑦 � 𝒘𝒘 𝒙𝒙 𝒙𝒙 – Assume that you made a mistake on x: • what will happen to w2 ? • what will happen to w3 ? – Assume that you made another mistake on x: • what will happen to w2 ? • what will happen to w3 ? – And another mistake on x: • what will happen to w2 ? • what will happen to w3 ?

CIS 419/519 Fall’20 84 CIS 419/519 Fall’20 85 New Stochastic Gradient Algorithms

= – ( , ) = – (notice that this is a vector, each coordinate (feature) has its own and ) 𝒘𝒘𝑡𝑡+1 𝒘𝒘𝑡𝑡 𝑟𝑟𝑡𝑡 𝒈𝒈𝒘𝒘 𝑄𝑄 𝒛𝒛𝑡𝑡 𝒘𝒘𝑡𝑡 𝒘𝒘𝑡𝑡 𝑟𝑟𝑡𝑡 𝒈𝒈𝑡𝑡 , , 𝑤𝑤𝑡𝑡 𝑗𝑗 𝑔𝑔𝑡𝑡 𝑗𝑗 – So far, we used fixed learning rates = , but this can change. – AdaGrad alters the update to adapt based on historical information 𝑟𝑟 𝑟𝑟𝑡𝑡 • Frequently occurring features in the gradients get small learning rates and infrequent features get higher ones. • The idea is to “learn slowly” from frequent features but “pay attention” to rare but informative features. – Define a “per feature” learning rate for the feature , as: / , = / , 1 2𝑗𝑗 – where = ∑ the sum of squares of gradients at feature until time . , , 𝑟𝑟𝑡𝑡 𝑗𝑗 𝑟𝑟 𝐺𝐺𝑡𝑡 𝑗𝑗 𝑡𝑡 2 Easy to think about the case of Perceptron, and on – Overall, the𝑡𝑡 𝑗𝑗 update𝑘𝑘= 1rule𝑘𝑘 𝑗𝑗for Adagrad is: 𝐺𝐺 𝑔𝑔 / Boolean𝑗𝑗 examples (x=0,0,1,….).𝑡𝑡 Check that in , = , , / , Perceptron, if , = 0 then , does not change. Now, 1 2 what happens when the j-th coordinate of x is 1 for a 𝑥𝑥𝑡𝑡 𝑗𝑗 𝑤𝑤𝑡𝑡 𝑗𝑗 𝑤𝑤𝑡𝑡+1 𝑗𝑗 𝑤𝑤𝑡𝑡 𝑗𝑗 − 𝑔𝑔𝑡𝑡 𝑗𝑗 𝑟𝑟 𝐺𝐺𝑡𝑡 𝑗𝑗 few consequitve examples? – This algorithm is supposed to update weights faster than Perceptron or LMS when needed.

CIS 419/519- Some Fall’20 additional comments on optimization with adaptive learning rates 86 Are we recording? YES! Administration (10/19/20) Available on the web site

• Remember that all the lectures are available on the website before the class – Go over it and be prepared – A new set of written notes will accompany most lectures, with some more details, examples and, (when relevant) some code. The first one is available today for this lecture.

• HW 2: Due date extended to 10/22

• Quizzes: Quiz 5 Statistics

• Mid-term is on 10/28; at the class time. – If you have a time zone problem – email me.

• Questions? Please ask/comment during class; give us feedback CIS 419/519 Fall’20 87 What have we done?

• Perceptron with Extensions

– Perceptron is an SGD Algorithm: • With hinge loss

– Adagrad: Adaptive learning rate

CIS 419/519 Fall’20 88 Preventing Overfitting

h1 h2

CIS 419/519 Fall’20 89 Regularization

• The more general formalism adds a regularization term to the risk function, and minimize: = ∑ ( , ) + ( ) • Where is used to enforce1 𝑚𝑚 “simplicity” of the learned functions. 𝐽𝐽 𝒘𝒘 𝑚𝑚 1 𝑄𝑄 𝒛𝒛𝑖𝑖 𝒘𝒘𝑖𝑖 𝜆𝜆 𝑅𝑅𝑖𝑖 𝒘𝒘𝑖𝑖 • LMS case:𝑅𝑅 (( , ), ) = – – ( ) = gives the optimization𝑇𝑇 2 problem called Ridge Regression. 𝑄𝑄 𝒙𝒙 𝑦𝑦 2 𝒘𝒘 𝑦𝑦 𝒘𝒘 𝒙𝒙 – ( ) = gives a problem called the LASSO problem 𝑅𝑅 𝒘𝒘 𝒘𝒘 2 1 • Hinge 𝑅𝑅Loss𝒘𝒘 case:𝒘𝒘 (( , ), ) = max(0, 1 ) – ( ) = gives the problem called Support Vector𝑇𝑇 Machines 𝑄𝑄2 𝒙𝒙 𝑦𝑦 𝒘𝒘 − 𝑦𝑦 𝒘𝒘 𝒙𝒙 𝑅𝑅 𝒘𝒘 𝒘𝒘 2 • Logistics Loss case: (( , ), ) = log(1 + exp{ }) – ( ) = gives the problem called Logistics Regression𝑇𝑇 2𝑄𝑄 𝒙𝒙 𝑦𝑦 𝒘𝒘 −𝑦𝑦 𝒘𝒘 𝒙𝒙 2 • These 𝑅𝑅are𝒘𝒘 convex𝒘𝒘 optimization problems and, in principle, the same gradient descent mechanism can be used in all cases. • We will see later why it makes sense to use the “size” of w as a way to control “simplicity”.

CIS 419/519 Fall’20 Jump to Kernels 90 Algorithmic Approaches

• Focus: Two families of algorithms (one of the on-line representative) – Additive update algorithms: Perceptron • SVM is a close relative of Perceptron – Multiplicative update algorithms: Winnow • Close relatives: Boosting, Max entropy/

CIS 419/519 Fall’20 91 Summary of Algorithms

• Examples: 0,1 ; or (indexed by ) ; Hypothesis: • Prediction: { 1,𝑛𝑛+1}: Predict𝑛𝑛 : = 1 iff > 𝑛𝑛 • Update: Mistake𝒙𝒙 ∈ Driven 𝒙𝒙 ∈ 𝑹𝑹 𝑘𝑘 𝒘𝒘 ∈ 𝑹𝑹 • Additive weight𝑦𝑦 ∈ update− algorithm: 𝑦𝑦 + 𝒘𝒘 ⋅ 𝒙𝒙 𝜃𝜃 – (Perceptron, Rosenblatt, 1958. Variations exist) – In the case of Boolean features: 𝒘𝒘 𝒘𝒘 𝑟𝑟 𝑦𝑦𝑘𝑘 𝒙𝒙𝑘𝑘 = 1 , + 1 = 1 = 0 , 1 = 1 𝐼𝐼𝐼𝐼 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝑏𝑏𝑏𝑏𝑏𝑏 𝒘𝒘 ⋅ 𝒙𝒙 ≤ 𝜃𝜃 𝑤𝑤𝑖𝑖 ← 𝑤𝑤𝑖𝑖 𝑖𝑖𝑖𝑖 𝑥𝑥𝑖𝑖 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 𝑖𝑖 𝑖𝑖 𝑖𝑖 • Multiplicative𝐼𝐼𝐼𝐼 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 weight𝑏𝑏 update𝑏𝑏𝑏𝑏 𝒘𝒘 ⋅ 𝒙𝒙 algorithm≥ 𝜃𝜃 𝑤𝑤 ← 𝑤𝑤− 𝑖𝑖𝑖𝑖 𝑥𝑥 (component𝑑𝑑𝑑𝑑𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚-wise update) • (Winnow, Littlestone, 1988. Variations exist) 𝑦𝑦𝑘𝑘𝑥𝑥𝑖𝑖 𝑖𝑖 𝑖𝑖 – Boolean features: (as an example, =2) 𝑤𝑤 𝑤𝑤 𝛼𝛼

= 1 𝛼𝛼, = 1 = 0 , /2 = 1 𝑖𝑖 𝑖𝑖 𝑖𝑖 CIS 419/519 Fall’20𝐼𝐼𝐼𝐼 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝑏𝑏𝑏𝑏𝑏𝑏 𝒘𝒘 ⋅ 𝒙𝒙 ≤ 𝜃𝜃 𝑤𝑤 ← 2𝑤𝑤 𝑖𝑖𝑖𝑖 𝑥𝑥 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 92 𝐼𝐼𝐼𝐼 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 𝑏𝑏𝑏𝑏𝑏𝑏 𝒘𝒘 ⋅ 𝒙𝒙 ≥ 𝜃𝜃 𝑤𝑤𝑖𝑖 ← 𝑤𝑤𝑖𝑖 𝑖𝑖𝑖𝑖 𝑥𝑥𝑖𝑖 𝑑𝑑𝑑𝑑𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 Which algorithm is better? How to Compare?

• Generalization – Since we deal with linear learning algorithms, we know (???) that they will all converge eventually to a perfect representation. • All can represent the data • So, how do we compare: – How many examples are needed to get to a given level of accuracy? – How many mistakes will an algorithm make before getting to a give level of accuracy? – Efficiency: How long does it take to learn a hypothesis and evaluate it (per-example)? – Robustness (to noise); – Adaptation to a new domain, …. • One key issue here turns out to be the characteristics of the data

CIS 419/519 Fall’20 93 Sentence Representation

• Assume that our problem is context sensitive spelling: = I don’t know weather [whether] to laugh or cry

𝑆𝑆 • Define a set of features: – features are relations that hold in the sentence • Map a sentence to its feature-based representation – The feature-based representation will give some of the information in the sentence • Use this feature-based representation as an example to your algorithm

CIS 419/519 Fall’20 94 Sentence Representation

= I don’t know whether to laugh or cry

𝑆𝑆 • Define a set of features: – features are properties that hold in the sentence • Conceptually, there are two steps in coming up with a feature- based representation – What are the information sources available? • Sensors: words, order of words, properties (?) of words – What features to construct based on these? Think about the features you use Why is this distinction needed? CIS 419/519 Fall’20 in the NER HW2 problem 95 Blow Up Feature Space

Whether

Weather

𝑥𝑥1𝑥𝑥2𝑥𝑥3 ∨ 𝑥𝑥1𝑥𝑥4𝑥𝑥3 ∨ 𝑥𝑥3𝑥𝑥2𝑥𝑥5 New discriminator in functionally simpler

1 4 5 CIS 419/519 Fall’20 𝑦𝑦 ∨ 𝑦𝑦 ∨ 𝑦𝑦 96 Domain Characteristics

• The number of potential features (dimensionality) is very large – The instance space is sparse • Decisions depend on a small set of features – the function space is sparse • Want to learn from a number of examples that is small relative to the dimensionality

• A comment on deep neural networks: – Have things changed with DNN? Not really; in many respects we can think about the DNNs as the feature extractors and learning of (relatively simple functions) is done over these feature extractors. We pay in training time, since it takes time to learn these feature extractors. We’ll discuss this more later.

CIS 419/519 Fall’20 97 Generalization

• Dominated by the sparseness of the function space – Most features are irrelevant

• # of examples required by multiplicative algorithms depends mostly on # of relevant features – (Generalization bounds depend on the target || || )

• # of examples required by additive algorithms𝒖𝒖 depends heavily on sparseness of features space: – Advantage to additive. Generalization depend on input || || • (Kivinen/Warmuth 95). • Nevertheless, today most people use additive algorithms.𝒙𝒙

CIS 419/519 Fall’20 98 Which Algorithm to Choose?

• Generalization (in terms of # of mistakes made) / The 1 norm: = | | The 2 norm: ( ) 𝑛𝑛 2 1 2 𝑖𝑖 𝑖𝑖 𝑙𝑙 𝑥𝑥 1 ∑𝑖𝑖 𝑥𝑥 / 𝑙𝑙 𝑥𝑥 2 ∑1 𝑥𝑥 The norm: || || = ( ) The norm: = | | 𝑛𝑛 𝑝𝑝 1 𝑝𝑝 𝑙𝑙𝑝𝑝 𝑥𝑥 𝑝𝑝 ∑1 𝑥𝑥𝑖𝑖 𝑙𝑙∞ 𝑥𝑥 ∞ 𝑚𝑚𝑚𝑚𝑚𝑚𝑖𝑖 𝑥𝑥𝑖𝑖 – Multiplicative algorithms: • Bounds depend on || ||, the separating hyperplane; : example #)

• = 2 ln max𝑢𝑢 / min 𝑖𝑖 2 2 𝑖𝑖 𝑖𝑖 2 • Do𝑤𝑤 not care much about data; advantage with sparse target 𝑀𝑀 𝑛𝑛 𝑢𝑢 1 𝑖𝑖 𝑥𝑥 ∞ 𝑖𝑖 𝑢𝑢 𝑥𝑥 – Additive algorithms: 𝑢𝑢 • Bounds depend on || || (Kivinen / Warmuth, ‘95)

• = max 𝑥𝑥 / min 2 2 𝑖𝑖 𝑖𝑖 2 • Advantage𝑝𝑝 with few active features per example 𝑀𝑀 𝑢𝑢 2 𝑖𝑖 𝑥𝑥 2 𝑖𝑖 𝑢𝑢 𝑥𝑥

CIS 419/519 Fall’20 99 = 2 ln max / min 2 2 𝑖𝑖 𝑖𝑖 2 Examples 𝑤𝑤 𝑀𝑀 = 𝑛𝑛 𝑢𝑢 max1 𝑖𝑖 𝑥𝑥 /∞min 𝑖𝑖 𝑢𝑢 𝑥𝑥 2 2 𝑖𝑖 𝑖𝑖 2 𝑝𝑝 𝑀𝑀 𝑢𝑢 2 𝑖𝑖 𝑥𝑥 𝑖𝑖 𝑢𝑢 𝑥𝑥 • Extreme Scenario 1: Assume the has exactly active features,2 and the other are 0. That is, only input features are relevant to the prediction. Then: 𝑢𝑢 𝑘𝑘 𝑛𝑛 − 𝑘𝑘 𝑘𝑘 • , = ; , = ; max = ; max , = 1

′ 𝑢𝑢 2 𝑘𝑘 𝑢𝑢 1 𝑘𝑘 𝑥𝑥 2 𝑛𝑛 𝑥𝑥 ∞ • We get that: = ; = 2 ln • Therefore, if << , Winnow behaves2 much better. 𝑀𝑀𝑝𝑝 𝑘𝑘𝑘𝑘 𝑀𝑀𝑤𝑤 𝑘𝑘 𝑛𝑛 • Extreme Scenario𝑘𝑘 2: Now𝑛𝑛 assume that = (1, 1, … .1) and the instances are very sparse, the rows of an × unit matrix. Then: • , = ; , = ; max , =𝑢𝑢1 ; max , = 1 𝑛𝑛 𝑛𝑛 𝑢𝑢 2 𝑛𝑛 𝑢𝑢 1 𝑛𝑛 𝑥𝑥 2 𝑥𝑥 ∞ • We get that: = ; = 2 ln • Therefore, Perceptron has a better2 bound. 𝑀𝑀𝑝𝑝 𝑛𝑛 𝑀𝑀𝑤𝑤 𝑛𝑛 𝑛𝑛 CIS 419/519 Fall’20 100 CIS 519 - Online Learning 2.ipynb Notice` two types of curves! Mistakes bounds for 10 of 100 of

𝑛𝑛 Function: At least 10 out of fixed 100 variables are active Perceptron,SVMs Dimensionality is

𝑛𝑛

Winnow # of mistakes to# of convergence mistakes

CIS 419/519 Fall’20 : Total # of Variables (Dimensionality) 101

𝑛𝑛 Summary

• Introduced multiple versions of on-line algorithms • Most turned out to be Stochastic Gradient Algorithms ------= ------𝑇𝑇 – For different loss functions - - -𝑤𝑤 𝑥𝑥 𝜃𝜃 • Some turned out to be mistake driven A term that minimizes error on the A term that forces training data simple hypothesis • We suggested generic improvements via regularization: – Adding a term that forces a “simple hypothesis” ( ) = ∑ ( , ) + ( ) – 1, – Regularization via the Averaged Trick 𝑚𝑚 𝑖𝑖 𝑖𝑖 𝑖𝑖 𝑖𝑖 • “Stability”𝐽𝐽 𝒘𝒘 of a hypothesis𝑄𝑄 𝑧𝑧 𝑤𝑤 is related𝜆𝜆 𝑅𝑅 to its𝑤𝑤 ability to generalize – An improved, adaptive, learning rate (Adagrad) • Dependence on function space and the instance space properties. • Now: – A way to deal with non-linear target functions (Kernels) – Beginning of Learning Theory.

CIS 419/519 Fall’20 102 Functions Can be Made Linear

• Data are not linearly separable in one dimension • Not separable if you insist on using a specific class of functions

𝒙𝒙

CIS 419/519 Fall’20 105 Blown Up Feature Space

• Data are separable in < , > space 2 𝒙𝒙 𝒙𝒙 2

𝒙𝒙

𝒙𝒙 CIS 419/519 Fall’20 106 Making data linearly separable

𝑥𝑥2

2 2 = 1 iff 𝑥𝑥11 + 2 1

CIS 419/519 Fall’20 𝑓𝑓 𝒙𝒙 𝑥𝑥 𝑥𝑥 ≤ 107 Making data linearly separable

In order to deal with this, we introduce two new concepts: Dual Representation Kernel (& the kernel trick) The key question is: how to systematically transform the feature space? 2 x ∗ 2 𝑥𝑥

x Transform data: = , = ( 2, 2 ) 1 2 𝑥𝑥1 ∗ 1 1 2 ( ) = 1 iff + 1 𝒙𝒙 𝑥𝑥 𝑥𝑥 → 𝒙𝒙풙 𝑥𝑥 𝑥𝑥 1 2 CIS 419/519 Fall’20𝑓𝑓 𝒙𝒙풙 𝑥𝑥’ 𝑥𝑥’ ≤ 108 Dual Representation

Examples ∈ {0,1} ; Learned hypothesis ∈ ( ) = sgn{ 𝑁𝑁 } = sgn{ } 𝑁𝑁 𝒙𝒙 𝒘𝒘 𝑹𝑹 𝑇𝑇 𝑛𝑛 Perceptron Update:𝑓𝑓 𝒙𝒙 𝒘𝒘 � 𝒙𝒙 ∑𝑖𝑖=1 𝑤𝑤𝑖𝑖𝑥𝑥𝑖𝑖 If ≠ , update: = +

• Let be an initial weight vector𝑦𝑦� 𝑦𝑦 for perceptron.𝒘𝒘 Let𝒘𝒘 ( , +𝑟𝑟𝑟𝑟),𝒙𝒙( , +), ( , ), ( , ) be examples and assume mistakes are made on , and1 . 2 3 4 • What𝒘𝒘 is the resulting weight vector? 1 2 𝒙𝒙 4 𝒙𝒙 𝒙𝒙 − 𝒙𝒙 − 𝒙𝒙 𝒙𝒙 𝒙𝒙

CIS 419/519 Fall’20 110 CIS 419/519 Fall’20 111 Are we recording? YES! Administration (10/21/20) Available on the web site

• Remember that all the lectures are available on the website before the class – Go over it and be prepared – A new set of written notes will accompany most lectures, with some more details, examples and, (when relevant) some code. • HW 2: Due date extended to 10/22

• Quizzes: Quiz 5 Statistics

• Mid-term is on 10/28; at the class time. – Example mid-terms are available on the website – If you have a time zone problem – email me.

• Questions? Please ask/comment during class; give us feedback

CIS 419/519 Fall’20 112 Kernels

• Dealing with non-linearly-separable data using linear classifier. • Method: Systematically enrich the feature space – How to avoid the computation cost of working with a much larger feature space? – Done by moving the dual space – Instead of predicting by computing a dot product with a weight vector w, the prediction f( ) is are done by computing dot-products g( ) – of𝒙𝒙 the target point𝑻𝑻 with other examples 𝒊𝒊 (observed earlier)∑ 𝒙𝒙 𝒙𝒙 𝑻𝑻 𝒙𝒙 𝒙𝒙𝒊𝒊 CIS 419/519 Fall’20 113 Dual Representation

Examples ∈ {0,1} ; Learned hypothesis ∈ Note: We care about the dot ( ) = sgn{ 𝑁𝑁 } = sgn{ } 𝑁𝑁 product: ( ) = = 𝒙𝒙 𝒘𝒘 𝑹𝑹 𝑇𝑇 𝑛𝑛 = (∑ 𝑇𝑇) 𝑖𝑖=1 𝑖𝑖 𝑖𝑖 Perceptron Update:𝑓𝑓 𝒙𝒙 𝒘𝒘 � 𝒙𝒙 ∑ 𝑤𝑤 𝑥𝑥 = 𝑓𝑓∑𝑚𝑚𝒙𝒙 𝒘𝒘(𝑖𝑖 𝒙𝒙𝑇𝑇 ) 1 𝑖𝑖 𝑖𝑖 𝑚𝑚𝑟𝑟 𝛼𝛼 𝑦𝑦 𝒙𝒙 𝑖𝑖𝑖𝑖𝒙𝒙 If ≠ , update: = + 1 𝑟𝑟 𝛼𝛼𝑖𝑖 𝑦𝑦𝑖𝑖 𝒙𝒙 𝒙𝒙 • Let be an initial weight vector𝑦𝑦� 𝑦𝑦 for perceptron.𝒘𝒘 Let𝒘𝒘 ( , +𝑟𝑟𝑟𝑟),𝒙𝒙( , +), ( , ), ( , ) be examples and assume mistakes are made on , and1 . 2 3 4 • What𝒘𝒘 is the resulting weight vector? 1 2 𝒙𝒙 4 𝒙𝒙 𝒙𝒙 − 𝒙𝒙 − = + + 𝒙𝒙 𝒙𝒙 𝒙𝒙 • (here = 1) 1 2 4 𝒘𝒘 𝒘𝒘 𝒙𝒙 𝒙𝒙 − 𝒙𝒙 • In general, the weight vector can be written as a linear combination of examples: 𝑟𝑟 = ∑ • Where is the number𝑚𝑚 of mistakes𝒘𝒘 𝑖𝑖 made on . 1 𝑖𝑖 𝑖𝑖 𝒘𝒘 𝑟𝑟 𝛼𝛼 𝑦𝑦 𝒙𝒙 𝑖𝑖 𝛼𝛼𝑖𝑖 𝒙𝒙 CIS 419/519 Fall’20 114 Kernel Based Methods ( ) = sgn { } = sgn{ } 𝑇𝑇 𝑛𝑛 𝑓𝑓 𝒙𝒙 𝒘𝒘 � 𝒙𝒙 ∑𝑖𝑖=1 𝑤𝑤𝑖𝑖𝑥𝑥𝑖𝑖 • Representing the model in the dual space will allow us to use Kernels. • A method to run Perceptron on a very large feature set, without incurring the cost of keeping a very large weight vector. • Computing the dot product can be done in the original feature space. – Notice: this pertains only to efficiency: The classifier is identical to the one you get by blowing up the feature space. – Generalization is still relative to the real dimensionality (or, related properties). • Kernels were popularized by SVMs, but many other algorithms can make use of them (== run in the dual). – Linear Kernels: no kernels; stay in the original space. A lot of applications actually use linear kernels.

CIS 419/519 Fall’20 115 Recall the NER problem in HW2. You added conjunctive features. Kernel Base Methods We will do it systematically now, but we’ll add all the conjunctive features, not only conjunctions of small size. ∈ {0,1} ; ∈ 𝑁𝑁 𝑁𝑁 𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸( ) =𝒙𝒙sgn { 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿} = sgnℎ𝑦𝑦{ 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦� 𝒘𝒘( )𝑹𝑹} 𝑇𝑇 𝑛𝑛 𝑓𝑓 𝒙𝒙 𝒘𝒘 � 𝒙𝒙 ∑𝑖𝑖=1 𝑤𝑤𝑖𝑖𝑥𝑥𝑖𝑖 𝑥𝑥 • Let’s transform the space to an expressive one. • Let be the set , , …of monomials (conjunctions) over the feature space , … . 1 2 3 Then𝐼𝐼 we can write𝑡𝑡 𝑡𝑡a linear𝑡𝑡 function over this new feature • 1 2 𝑛𝑛 space. 𝑥𝑥 𝑥𝑥 𝑥𝑥 New features (indexed by I) ( ) = sgn{ } = sgn{ ( )} 𝑇𝑇 Example: 𝑓𝑓 𝒙𝒙 (11010) 𝒘𝒘= 1� 𝒙𝒙 (11010∑𝐼𝐼 𝑤𝑤) 𝑖𝑖𝑡𝑡=𝑖𝑖 0𝑥𝑥 (11010) = 1

116 CIS 419/519 Fall’20 𝑥𝑥1𝑥𝑥2𝑥𝑥4 𝑥𝑥3𝑥𝑥4 𝑥𝑥1𝑥𝑥2 Kernel Based Methods

∈ {0,1} ; ∈ 𝑁𝑁 𝑁𝑁 𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸( ) =𝒙𝒙sgn { 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿} = sgnℎ𝑦𝑦{ 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦� 𝒘𝒘( )𝑹𝑹} 𝑇𝑇 𝑛𝑛 Perceptron𝑓𝑓 𝒙𝒙 Update:𝒘𝒘 � 𝒙𝒙 ∑𝑖𝑖=1 𝑤𝑤𝑖𝑖𝑡𝑡𝑖𝑖 𝑥𝑥 ≠ , : = +

𝐼𝐼𝐼𝐼 𝑦𝑦� 𝑦𝑦 𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢 𝒘𝒘 𝒘𝒘 𝑟𝑟𝑟𝑟 𝒙𝒙 • Great Increase in expressivity • Can run Perceptron (and Winnow) but the convergence bound may suffer exponential growth. • Exponential number of monomials are true in each example. • Also, will have to keep many weights.

CIS 419/519 Fall’20 117 Embedding

Whether

Weather

𝑥𝑥1𝑥𝑥2𝑥𝑥3 ∨ 𝑥𝑥1𝑥𝑥4𝑥𝑥3 ∨ 𝑥𝑥3𝑥𝑥2𝑥𝑥5 New discriminator in functionally simpler

1 4 5 CIS 419/519 Fall’20 𝑦𝑦 ∨ 𝑦𝑦 ∨ 𝑦𝑦 Page 118 Let be the representation of x in the new feature The Kernel Trick(1) space. We already showed that: ( Φ( 𝒙𝒙)) = ( ) = (∑ ) ( ) = ∑ 𝑇𝑇 ( 𝑛𝑛 ( )) 𝑖𝑖 𝑇𝑇 1 𝑖𝑖 𝑖𝑖 ∈ {0,1} ; 𝑓𝑓 Φ 𝒙𝒙 𝒘𝒘𝑛𝑛 Φ∈𝒙𝒙 𝑖𝑖𝑟𝑟𝑇𝑇𝛼𝛼 𝑦𝑦 Φ 𝒙𝒙 Φ 𝒙𝒙 1 𝑟𝑟 𝛼𝛼𝑖𝑖 𝑦𝑦𝑖𝑖 Φ 𝒙𝒙 Φ We𝒙𝒙 will not show it in 𝑁𝑁 𝑁𝑁 a more explicit way. 𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸( ) =𝒙𝒙sgn { 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿} = sgnℎ𝑦𝑦{ 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦� 𝒘𝒘( )𝑹𝑹} Perceptron Update: 𝑇𝑇 New features𝑛𝑛 (n) 𝑓𝑓 𝒙𝒙 𝒘𝒘 � 𝒙𝒙 ∑𝑖𝑖=1 𝑤𝑤𝑖𝑖𝑡𝑡𝑖𝑖 𝑥𝑥 ≠ , : = +

𝐼𝐼𝐼𝐼 𝑦𝑦� 𝑦𝑦 𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢 𝒘𝒘 𝒘𝒘 𝑟𝑟𝑟𝑟 𝒙𝒙

CIS 419/519 Fall’20 119 We called this a “dual representation: Example: Polynomial Kernel

• Prediction with respect to a separating hyper planes w (produced by Perceptron, SVM) can be computed instead, as a function of dot products of feature based representation of (some of) the examples. • We want to define a dot product in a high dimensional space. Sq(2) • Given two examples = ( , , … ) and = ( , , … ) we want to map them to a high dimensional space [example- quadratic]: • ( , , … ) = (1𝑥𝑥, , …𝑥𝑥,1 𝑥𝑥,2 , …𝑥𝑥𝑛𝑛, , 𝑦𝑦 , … , 𝑦𝑦1 𝑦𝑦2 ) 𝑦𝑦𝑛𝑛 • ( , , … ) = (1, , … , , 2 , … , 2 , , … , ) 1 2 𝑛𝑛 1 𝑛𝑛 1 𝑛𝑛 1 2 𝑛𝑛−1 𝑛𝑛 Φand𝑥𝑥 compute𝑥𝑥 𝑥𝑥 the dot product𝑥𝑥 𝑥𝑥 𝑥𝑥=2 𝑥𝑥 2 𝑥𝑥 𝑥𝑥( ) [higher𝑥𝑥 dimensionality,𝑥𝑥 takes time ] 1 2 𝑛𝑛 1 𝑛𝑛 1 𝑛𝑛 1 2 𝑛𝑛−1 𝑛𝑛 • Instead,Φ 𝑦𝑦 𝑦𝑦 in the𝑦𝑦 original space,𝑦𝑦 𝑦𝑦compute𝑦𝑦 𝑦𝑦 𝑇𝑇𝑦𝑦 𝑦𝑦 𝑦𝑦 𝑦𝑦 – = ( , ) = [1 + , , …𝐴𝐴 Φ, 𝑥𝑥, …Φ 𝑦𝑦] • Theorem: = 𝑇𝑇 2 𝐵𝐵 𝑘𝑘 𝑥𝑥 𝑦𝑦 𝑥𝑥1 𝑥𝑥2 𝑥𝑥𝑛𝑛 𝑦𝑦1 𝑦𝑦2 𝑦𝑦𝑛𝑛 • Exercise: Compute A and B and prove the theorem. (Coefficients do not really matter) 𝐴𝐴 𝐵𝐵 CIS 419/519 Fall’20 120 Let be the representation of x in the new feature The Kernel Trick(1) space. We already showed that: ( 𝑻𝑻( 𝒙𝒙)) = ( ) = (∑ ) ( ) = ∑𝑇𝑇 ( 𝑛𝑛 ( )) 𝑖𝑖 𝑇𝑇 1 𝑖𝑖 𝑖𝑖 ∈ {0,1} ; 𝑓𝑓 𝑻𝑻 𝒙𝒙 𝒘𝒘𝑛𝑛 𝑇𝑇 ∈𝒙𝒙 𝑖𝑖𝑟𝑟𝑇𝑇𝛼𝛼 𝑦𝑦 𝑻𝑻 𝒙𝒙 𝑇𝑇 𝒙𝒙 1 𝑟𝑟 𝛼𝛼𝑖𝑖 𝑦𝑦𝑖𝑖 𝑻𝑻 𝒙𝒙 𝑻𝑻 𝒙𝒙We will not show it in 𝑁𝑁 𝑁𝑁 a more explicit way. 𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸( ) =𝒙𝒙sgn { 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿} = sgnℎ𝑦𝑦{ 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦� 𝒘𝒘( )𝑹𝑹} Perceptron Update: 𝑇𝑇 New features𝑛𝑛 (n) 𝑓𝑓 𝒙𝒙 𝒘𝒘 � 𝒙𝒙 ∑𝑖𝑖=1 𝑤𝑤𝑖𝑖𝑡𝑡𝑖𝑖 𝑥𝑥 ≠ , : = +

• Consider the value𝐼𝐼𝐼𝐼 𝑦𝑦of� 𝑦𝑦 𝑢𝑢used𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢 in the𝒘𝒘 prediction.𝒘𝒘 𝑟𝑟𝑟𝑟 𝒙𝒙 • Each previous mistake on example , makes an additive contribution of +/ 1𝒘𝒘to some of the coordinates of . – Note: examples are Boolean, so only𝑧𝑧 coordinates of that correspond to ON terms in the example− ( ( ) = 1) are being updated.𝒘𝒘 • The value of is determined by the number and𝒘𝒘 type of mistakes. 𝑧𝑧 𝑡𝑡𝑖𝑖 𝑧𝑧 CIS 419/519 Fall’20 𝒘𝒘 121 The Kernel Trick(2)

∈ {0,1} ; ∈ 𝑁𝑁 𝑁𝑁 𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸( ) =𝒙𝒙sgn { 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿} = sgnℎ𝑦𝑦{ 𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦𝑦� 𝒘𝒘( )𝑹𝑹} Perceptron Update: 𝑇𝑇 𝑛𝑛 𝑓𝑓 𝒙𝒙 𝒘𝒘 � 𝒙𝒙 ∑𝑖𝑖=1 𝑤𝑤𝑖𝑖𝑡𝑡𝑖𝑖 𝑥𝑥 ≠ , : = +

• – set of examples𝐼𝐼𝐼𝐼 on𝑦𝑦� which𝑦𝑦 𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢 we Promoted𝒘𝒘 𝒘𝒘 𝑟𝑟𝑟𝑟 𝒙𝒙 • – set of examples on which we Demoted • 𝑃𝑃 = 𝐷𝐷 𝑀𝑀 (𝑃𝑃)∪=𝐷𝐷sgn ( ) = , 1 , 1 ( ) = 𝐼𝐼 𝑖𝑖 𝑖𝑖 𝐼𝐼 𝑧𝑧∈𝑃𝑃 𝑡𝑡𝑖𝑖 𝑧𝑧 =1 𝑧𝑧∈𝐷𝐷 𝑡𝑡𝑖𝑖 𝑧𝑧 =1 𝑖𝑖 𝑓𝑓 𝒙𝒙 ∑ 𝑤𝑤 𝑡𝑡 𝒙𝒙 = ∑ [∑ ( ) − ∑( )] 𝑡𝑡 𝒙𝒙

CIS 419/519 Fall’20 122 ∑𝐼𝐼 ∑𝑧𝑧∈𝑀𝑀 𝑆𝑆 𝑧𝑧 𝑡𝑡𝑖𝑖 𝑧𝑧 𝑡𝑡𝑖𝑖 𝒙𝒙 The Kernel Trick(3)

( ) = { } = sgn{ ( )} • – set of examples on which we𝑇𝑇 Promoted ∑𝐼𝐼 𝑖𝑖 𝑖𝑖 • – set of examples𝑓𝑓 𝒙𝒙 on𝑠𝑠𝑠𝑠𝑠𝑠 which𝒘𝒘 we� 𝒙𝒙Demoted 𝑤𝑤 𝑡𝑡 𝑥𝑥 𝑃𝑃 • = 𝐷𝐷 ( ) = sgn ( ) 1 1 ( ) • 𝑀𝑀 𝑃𝑃 ∪ 𝐷𝐷 = , , • = sgn{ [ ]} 𝑓𝑓 𝒙𝒙 ∑𝐼𝐼 𝑤𝑤𝑖𝑖𝑡𝑡𝑖𝑖 𝑥𝑥 ∑𝐼𝐼 ∑𝑧𝑧∈𝑃𝑃 𝑡𝑡𝑖𝑖 𝑧𝑧 =1 − ∑𝑧𝑧∈𝐷𝐷 𝑡𝑡𝑖𝑖 𝑧𝑧 =1 𝑡𝑡𝑖𝑖 𝑥𝑥 ∑𝐼𝐼 ∑𝑧𝑧∈𝑀𝑀 𝑖𝑖 𝑖𝑖 • Where = 1 𝑆𝑆and𝑧𝑧 𝑡𝑡 𝑧𝑧 𝑡𝑡=𝑥𝑥 1 . • Reordering: 𝑆𝑆 𝑧𝑧 (𝑖𝑖𝑖𝑖) 𝑧𝑧=∈sgn𝑃𝑃 { 𝑆𝑆 𝑧𝑧 ( )− 𝑖𝑖𝑖𝑖(z𝑧𝑧)∈(𝐷𝐷)}

𝑧𝑧∈𝑀𝑀 𝐼𝐼 𝑖𝑖 𝑖𝑖 CIS 419/519 Fall’20 𝑓𝑓 𝒙𝒙 ∑ 𝑆𝑆 𝑧𝑧 ∑ 𝑡𝑡 𝑡𝑡 𝒙𝒙 123 The Kernel Trick(4)

( ) = ( )

• ( ) = 1 if and𝑓𝑓 (𝑥𝑥 ) =𝑠𝑠𝑠𝑠𝑠𝑠1 �if 𝑤𝑤𝑖𝑖𝑡𝑡𝑖𝑖. 𝑥𝑥 𝑖𝑖∈𝐼𝐼 𝑆𝑆 𝑦𝑦 𝑦𝑦 ∈ 𝑃𝑃( ) =𝑆𝑆 𝑦𝑦 − 𝑦𝑦 ∈ 𝐷𝐷 ( )

• A mistake on contributes𝑓𝑓 𝑥𝑥 𝑠𝑠𝑠𝑠𝑠𝑠 the� value𝑆𝑆 𝑧𝑧+/ �1 to𝑡𝑡 𝑖𝑖all𝑧𝑧 monomials𝑡𝑡𝑖𝑖 𝑥𝑥 satisfied by . The total contribution of to the𝑧𝑧∈𝑀𝑀 sum is equal𝑖𝑖∈𝐼𝐼 to the number of monomials that satisfy both𝑧𝑧 and . − 𝑧𝑧 • Define a dot product in the𝑧𝑧 t-space: 𝑥𝑥 𝑧𝑧 , = ( )

• We get the standard notation:𝐾𝐾 𝑥𝑥 𝑧𝑧 � 𝑡𝑡𝑖𝑖 𝑧𝑧 𝑡𝑡𝑖𝑖 𝑥𝑥 ( ) = 𝑖𝑖∈𝐼𝐼 ( , ) CIS 419/519 Fall’20 124 𝑓𝑓 𝑥𝑥 𝑠𝑠𝑠𝑠𝑠𝑠 � 𝑆𝑆 𝑧𝑧 𝐾𝐾 𝑥𝑥 𝑧𝑧 𝑧𝑧∈𝑀𝑀 Kernel Based Methods

( ) = ( , )

𝑓𝑓 𝑥𝑥 𝑠𝑠𝑠𝑠𝑠𝑠 � 𝑆𝑆 𝑧𝑧 𝐾𝐾 𝑥𝑥 𝑧𝑧 • What does this representation give𝑧𝑧∈𝑀𝑀 us?

, = ( )

𝐾𝐾 𝑥𝑥 𝑧𝑧 � 𝑡𝑡𝑖𝑖 𝑧𝑧 𝑡𝑡𝑖𝑖 𝑥𝑥 • We can view this Kernel as 𝑖𝑖the∈𝐼𝐼 distance between , in the t-space. • So far, all we did is algebra; as written above I is huge. 𝑥𝑥 𝑧𝑧 • But, ( , ) can be measured in the original space, without explicitly writing the t-representation of , 𝐾𝐾 𝑥𝑥 𝑧𝑧 CIS 419/519 Fall’20 𝑥𝑥 𝑧𝑧 125 1 3 (001) = 1 3 (011) = 1 1 (001) = 1 (011) = 1 ; 3 (001) = 3 (011) = 1 𝑥𝑥 𝑥𝑥 (001) 𝑥𝑥=𝑥𝑥 (011) = 1 Kernel Trick 𝑥𝑥If any other𝑥𝑥 variables appears𝑥𝑥 in the monomial,𝑥𝑥 it’s evaluationΦ on , Φwill be different. ( ) = ( , ) , = (𝑥𝑥 𝑧𝑧)

𝐾𝐾 𝑥𝑥 𝑧𝑧 � 𝑡𝑡𝑖𝑖 𝑧𝑧 𝑡𝑡𝑖𝑖 𝑥𝑥 𝑓𝑓 𝑥𝑥 𝑠𝑠𝑠𝑠𝑠𝑠 �3𝑆𝑆 𝑧𝑧 𝐾𝐾 𝑥𝑥 𝑧𝑧 𝑖𝑖∈𝐼𝐼 • Consider the space of𝑧𝑧 all∈𝑀𝑀 monomials (allowing both positive and negative literals). Then, 𝑛𝑛 • Claim: , = = 2 ( , ) 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑥𝑥 𝑧𝑧 – Where 𝐾𝐾 𝑥𝑥( ,𝑧𝑧) is the� number𝑡𝑡𝑖𝑖 𝑧𝑧 of𝑡𝑡𝑖𝑖 features𝑥𝑥 that have the same value for both and . • We get: 𝑖𝑖∈𝐼𝐼 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠( )𝑥𝑥=𝑧𝑧 ( (2 ( , ))) 𝑥𝑥 𝑧𝑧 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑥𝑥 𝑧𝑧 • Example: Take𝑓𝑓 𝑥𝑥 = 3𝑠𝑠𝑠𝑠𝑠𝑠; = �(001𝑆𝑆), 𝑧𝑧= (011), features are monomials of size 0,1,2,3 𝑧𝑧∈𝑀𝑀 • Proof: let = ( , ); construct a “surviving” monomials by: (1) choosing to include one of 𝑛𝑛these 𝑥𝑥literals with𝑧𝑧 the right polarity in the monomial, or (2) choosing to not include𝑘𝑘 it at𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 all. Monomials𝑥𝑥 𝑧𝑧 with literals outside this set disappear. 𝑘𝑘 CIS 419/519 Fall’20 126 Example

= ( ( , )) , = ( )

𝑓𝑓 𝑥𝑥 𝑠𝑠𝑠𝑠𝑠𝑠 � 𝑆𝑆 𝑧𝑧 𝐾𝐾 𝑥𝑥 𝑧𝑧 𝐾𝐾 𝑥𝑥 𝑧𝑧 � 𝑡𝑡𝑖𝑖 𝑧𝑧 𝑡𝑡𝑖𝑖 𝑥𝑥 𝑧𝑧∈𝑀𝑀 𝑖𝑖∈𝐼𝐼 • Take = { , , , } • = The space of all 3 monomials; | | = 81 1 2 3 4 • Consider𝑋𝑋 𝑥𝑥= (𝑥𝑥1100𝑥𝑥 )𝑥𝑥,𝑛𝑛 = (1101) • Write𝐼𝐼 down ( ), ( ), the representation𝐼𝐼 of , in the space. • Compute 𝑥𝑥( ) ( ). 𝑧𝑧 • Show that: 𝐼𝐼 𝑥𝑥 𝐼𝐼 𝑧𝑧 𝑥𝑥 𝑧𝑧 𝐼𝐼 – ( , )𝐼𝐼 =𝑥𝑥 (�)𝐼𝐼 𝑧𝑧 ( ) = ( ) = 2 , = 8 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑥𝑥 𝑧𝑧 • Try to develop another kernel,𝐼𝐼 𝑖𝑖 e.g.,𝑖𝑖 where is the space of all conjunctions of size𝐾𝐾 𝑥𝑥3𝑧𝑧(exactly).𝐼𝐼 𝑥𝑥 � 𝐼𝐼 𝑧𝑧 ∑ 𝑡𝑡 𝑧𝑧 𝑡𝑡 𝑥𝑥 𝐼𝐼 CIS 419/519 Fall’20 127 Implementation: Dual Perceptron

= ( ( , )) , = ( )

𝑓𝑓 𝑥𝑥 𝑠𝑠𝑠𝑠𝑠𝑠 � 𝑆𝑆 𝑧𝑧 𝐾𝐾 𝑥𝑥 𝑧𝑧 𝐾𝐾 𝑥𝑥 𝑧𝑧 � 𝑡𝑡𝑖𝑖 𝑧𝑧 𝑡𝑡𝑖𝑖 𝑥𝑥 • Simply run Perceptron𝑧𝑧∈𝑀𝑀 in an on-line mode, but𝑖𝑖∈ 𝐼𝐼keep track of the set . • Keeping the set allows us to keep track of ( ). 𝑀𝑀 • Rather than remembering the weight vector , remember the set ( and ) – all those𝑀𝑀 examples on which we made𝑆𝑆 𝑧𝑧 mistakes. 𝒘𝒘 𝑀𝑀 𝑃𝑃 𝐷𝐷 • Dual Representation

CIS 419/519 Fall’20 128 We called this a “dual representation: Example: Polynomial Kernel = ( ( , )) 𝑓𝑓 𝑥𝑥 𝑠𝑠𝑠𝑠𝑠𝑠 � 𝑆𝑆 𝑧𝑧 𝐾𝐾 𝑥𝑥 𝑧𝑧 𝑧𝑧∈𝑀𝑀 • Prediction with respect to a separating hyper planes w (produced by Perceptron, SVM) can be computed instead, as a function of dot products of feature based representation of (some of) the examples. • We want to define a dot product in a high dimensional space. Sq(2) • Given two examples = ( , , … ) and = ( , , … ) we want to map them to a high dimensional space [example- quadratic]: • ( , , … ) = (1𝑥𝑥, , …𝑥𝑥,1 𝑥𝑥,2 , …𝑥𝑥𝑛𝑛, , 𝑦𝑦 , … , 𝑦𝑦1 𝑦𝑦2 )𝑦𝑦𝑛𝑛 • ( , , … ) = (1, , … , , 2 , … , 2 , , … , ) 1 2 𝑛𝑛 1 𝑛𝑛 1 𝑛𝑛 1 2 𝑛𝑛−1 𝑛𝑛 Φand𝑥𝑥 compute𝑥𝑥 𝑥𝑥 the dot product𝑥𝑥 𝑥𝑥 𝑥𝑥=2 𝑥𝑥 2 𝑥𝑥 𝑥𝑥( ) 𝑥𝑥 [takes𝑥𝑥 time ] 1 2 𝑛𝑛 1 𝑛𝑛 1 𝑛𝑛 1 2 𝑛𝑛−1 𝑛𝑛 • Instead,Φ 𝑦𝑦 𝑦𝑦 in the𝑦𝑦 original space,𝑦𝑦 𝑦𝑦compute𝑦𝑦 𝑦𝑦 𝑇𝑇𝑦𝑦 𝑦𝑦 𝑦𝑦 𝑦𝑦 – = ( , ) = [1 + , , …𝐴𝐴 Φ, 𝑥𝑥, …Φ 𝑦𝑦] • Theorem: = (Coefficients𝑇𝑇 do2 not really matter) 𝐵𝐵 𝑘𝑘 𝑥𝑥 𝑦𝑦 𝑥𝑥1 𝑥𝑥2 𝑥𝑥𝑛𝑛 𝑦𝑦1 𝑦𝑦2 𝑦𝑦𝑛𝑛

CIS 419/519 Fall’20 𝐴𝐴 𝐵𝐵 129 Kernels – General Conditions Definition: When is k(x,z) a kernel?

Kernel Trick: You want to work with degree 2 polynomial features, ( ). Then, your dot product will be in a space of dimensionality ( + 1)/2. The kernel trick allows you to save and compute dot products in an dimensional space. Φ 𝑥𝑥 𝑛𝑛 𝑛𝑛 = ( ( , )) , = ( ) • Can𝑛𝑛 we use any (. , . )? 𝑖𝑖 𝑖𝑖 – ( , ) if it corresponds𝑓𝑓 𝑥𝑥 to𝑠𝑠𝑠𝑠𝑠𝑠 a dot� (an𝑆𝑆 inner)𝑧𝑧 𝐾𝐾 𝑥𝑥product𝑧𝑧 in 𝐾𝐾some𝑥𝑥 𝑧𝑧 (perhaps� 𝑡𝑡 infinite𝑧𝑧 𝑡𝑡 𝑥𝑥 A function is a valid kernel 𝑧𝑧∈𝑀𝑀 𝑖𝑖∈𝐼𝐼 dimensional)𝐾𝐾 feature space. • Take the quadratic𝑲𝑲 𝒙𝒙 function𝒛𝒛 : ( , ) = We proved that is a valid kernel by explicitly showing that it corresponds to a – Is it a kernel? 𝑇𝑇 2 dot product. 𝐾𝐾 • Example: Direct construction𝑘𝑘 (2𝑥𝑥 dimensional,𝑧𝑧 𝑥𝑥 𝑧𝑧 for simplicity): • ( , ) = + = + 2 + = , 2 , ,2 2 2, 2 2 2 𝐾𝐾 𝑥𝑥 𝑧𝑧 𝑥𝑥1 𝑧𝑧1 𝑥𝑥2 𝑧𝑧2 𝑥𝑥1 𝑧𝑧1 𝑥𝑥1 𝑧𝑧1 𝑥𝑥2 𝑧𝑧2 𝑥𝑥2 𝑧𝑧2 = 2 ( ) 2 A dot2 product in2 an𝑇𝑇 expanded space; consequently, a kernel. 1 1 2 2 1 1 2 2 • Comment:𝑥𝑥 𝑇𝑇 𝑥𝑥 𝑥𝑥 𝑥𝑥 𝑧𝑧 𝑧𝑧 𝑧𝑧 𝑧𝑧 • In general,Φ 𝑥𝑥 it isΦ not𝑧𝑧 necessary to explicitly show the feature function . • General condition: construct the kernel matrix { ( , )} (indices are over the examples); check that it’s positive semi definite. Φ 𝑘𝑘 𝑥𝑥𝑖𝑖 𝑧𝑧𝑗𝑗 CIS 419/519 Fall’20 130 Polynomial kernels

• Linear kernel: ( , ) = • Polynomial kernel of degree : ( , ) = 𝑘𝑘 𝒙𝒙 𝒛𝒛 𝒙𝒙𝒙𝒙 (only d-th-order interactions) 𝑑𝑑 𝑑𝑑 𝑘𝑘 𝒙𝒙 𝒛𝒛 𝒙𝒙𝒙𝒙 • Polynomial kernel up to degree : ( , ) = + ( > 0) (all interactions of order or lower) 𝑑𝑑 𝑑𝑑 𝑘𝑘 𝒙𝒙 𝒛𝒛 𝒙𝒙𝒙𝒙 𝑐𝑐 𝑐𝑐 𝑑𝑑

CIS 419/519 Fall’20 135 Constructing New Kernels

• You can construct new kernels ( , ) from existing ones: – Multiplying ( , ) by a constant : ( , ) = 𝑘𝑘(� ,𝒙𝒙 𝒙𝒙) ’ – Multiplying 𝑘𝑘(𝒙𝒙, 𝒙𝒙’) by a function 𝑐𝑐 applied to and : ( ,𝑘𝑘� )𝒙𝒙 =𝒙𝒙’ ( )𝑐𝑐𝑐𝑐( 𝒙𝒙, 𝒙𝒙)’ ( ) – Applying a polynomial𝑘𝑘 𝒙𝒙 𝒙𝒙’ (with non-negative𝑓𝑓 coefficients)𝒙𝒙 𝒙𝒙’ to ( , ): ( , ) = (𝑘𝑘� (𝑥𝑥 ,𝒙𝒙’ ) ) with𝑓𝑓 𝒙𝒙 𝑘𝑘( 𝒙𝒙) 𝒙𝒙=’ 𝑓𝑓 𝒙𝒙’ 0 ( , ): 𝑘𝑘 𝒙𝒙 𝒙𝒙’ – Exponentiating 𝑖𝑖 𝑖𝑖 𝑖𝑖 𝑖𝑖 𝑘𝑘� 𝒙𝒙 𝒙𝒙’ 𝑃𝑃 𝑘𝑘 𝒙𝒙( 𝒙𝒙, ’ ) = exp𝑃𝑃(𝒛𝒛( , ∑)) 𝑎𝑎 𝑧𝑧 𝑎𝑎𝑎𝑎𝑎𝑎 𝑎𝑎 ≥ 𝑘𝑘 𝒙𝒙 𝒙𝒙’ 𝑘𝑘� 𝒙𝒙 𝒙𝒙’ 𝑘𝑘 𝒙𝒙 𝒙𝒙’

CIS 419/519 Fall’20 136 Constructing New Kernels (2)

• You can construct ( , ) from ( , ), ( , ) by: – Adding ( , ) and ( , ): (𝑘𝑘�, 𝒙𝒙)𝒙𝒙=’ ( , 𝑘𝑘)1+𝒙𝒙 𝒙𝒙(’ , 𝑘𝑘2) 𝒙𝒙 𝒙𝒙’ – Multiplying𝑘𝑘1 𝒙𝒙 𝒙𝒙(’ , )and𝑘𝑘2 𝒙𝒙 𝒙𝒙(’ , ): 𝑘𝑘� 𝒙𝒙( 𝒙𝒙,’ ) =𝑘𝑘1 𝒙𝒙( 𝒙𝒙, ’ ) 𝑘𝑘(2 ,𝒙𝒙 𝒙𝒙)’ 𝑘𝑘1 𝒙𝒙 𝒙𝒙’ 𝑘𝑘2 𝒙𝒙 𝒙𝒙’ • Also: 1 2 – If ( ) and𝑘𝑘� 𝑥𝑥 𝑥𝑥�( , 𝑘𝑘) a valid𝒙𝒙 𝒙𝒙’ kernel𝑘𝑘 𝒙𝒙 𝒙𝒙 in’ , ( , ) = 𝑚𝑚 ( ( ), ( )) is also a valid kernel𝑚𝑚 𝜙𝜙 𝒙𝒙 ∈ 𝑹𝑹 𝑘𝑘𝑚𝑚 𝒛𝒛 𝒛𝒛’ 𝑹𝑹 𝑚𝑚 – 𝑘𝑘If 𝒙𝒙is𝒙𝒙 a’ symmetric𝑘𝑘 𝜙𝜙 positive𝒙𝒙 𝜙𝜙 𝒙𝒙’ semi-definite matrix, ( , ) = is also a valid kernel • In all 𝐴𝐴cases, it is easy to prove these directly by construction. 𝑘𝑘 𝒙𝒙 𝒙𝒙’ 𝒙𝒙𝐴𝐴𝒙𝒙’

CIS 419/519 Fall’20 137 Gaussian Kernel (Radial Basis Function kernel)

• , = exp( / ) – : squared Euclidean2 distance between and 𝑘𝑘– 𝒙𝒙 𝒛𝒛= 2 2: a free− parameter𝒙𝒙 − 𝒛𝒛 𝑐𝑐 𝒙𝒙 − 𝒛𝒛 𝒙𝒙 𝒛𝒛 – very small2 : K ≈ identity matrix (every item is different) 𝑐𝑐 𝜎𝜎 – very large : K ≈ unit matrix (all items are the same) 𝑐𝑐 𝑐𝑐 – ( , ) 1 when , close – ( , ) 0 when , dissimilar 𝑘𝑘 𝒙𝒙 𝒛𝒛 ≈ 𝒙𝒙 𝒛𝒛 𝑘𝑘 𝒙𝒙 𝒛𝒛 ≈ 𝒙𝒙 𝒛𝒛

CIS 419/519 Fall’20 138 Gaussian Kernel

• ( , ) = exp( / ) – Is this a kernel? 2 • 𝑘𝑘(𝑥𝑥, 𝑧𝑧) = exp(− 𝑥𝑥 − 𝑧𝑧 /2𝑐𝑐 ) – = exp( ( + 2 2 )/2 ) 𝑘𝑘–𝑥𝑥 𝑧𝑧 = exp(− 𝑥𝑥/−2 𝑧𝑧)exp𝜎𝜎( / ) exp2 ( /2 ) – = ( )−exp𝑥𝑥𝑥𝑥( /2 𝑧𝑧𝑧𝑧) −( 2𝑥𝑥) 𝑥𝑥 2 𝜎𝜎 2 • exp( / ) is a −valid𝑥𝑥𝑥𝑥 kernel:𝜎𝜎 2 𝑥𝑥𝑥𝑥 𝜎𝜎 −𝑧𝑧𝑧𝑧 𝜎𝜎 𝑓𝑓 𝑥𝑥 𝑥𝑥𝑥𝑥 𝜎𝜎 𝑓𝑓 𝑧𝑧 – is the2 linear kernel; – we𝑥𝑥𝑥𝑥 can𝜎𝜎 multiply kernels by constants (1/ ) – we𝑥𝑥𝑥𝑥 can exponentiate kernels 2 • Unlike the discrete kernels discussed earlier,𝜎𝜎 here you cannot easily explicitly blow up the feature space to get an identical representation.

CIS 419/519 Fall’20 139 Summary – Kernel Based Methods

= ( ( , ))

𝑓𝑓 𝒙𝒙 𝑠𝑠𝑠𝑠𝑠𝑠 � 𝑆𝑆 𝒛𝒛 𝐾𝐾 𝒙𝒙 𝒛𝒛 • A method to run Perceptron on𝑧𝑧 ∈a𝑀𝑀 very large feature set, without incurring the cost of keeping a very large weight vector. • Computing the weight vector can be done in the original feature space. • Notice: this pertains only to efficiency: the classifier is identical to the one you get by blowing up the feature space. • Generalization is still relative to the real dimensionality (or, related properties). • Kernels were popularized by SVMs but apply to a range of models, Perceptron, Gaussian Models, PCAs, etc.

CIS 419/519 Fall’20 141 Explicit & Implicit Kernels: Complexity

• Is it always worthwhile to define kernels and work in the dual space? • Computationally: – : # of examples, – : dimensionality of the original space, and therefore the dimensionality used in the Dual space – 𝑚𝑚 : the dimensionality of the Primal space (blown up features space (O(n2) or 3n, in earlier examples) 1 𝑡𝑡 • (That is, , is the cost of a dot product in the respective spaces) 2 – Then,𝑡𝑡 computational cost is: 1 2 • Dual space𝑡𝑡 –𝑡𝑡 2 vs, Primal Space – (think about it) – Typically, << , so it boils down to the number of examples one needs to consider relative to the growth in dimensionality.𝑡𝑡1 𝑚𝑚 𝑡𝑡2 𝑚𝑚 • Rule of thumb:𝑡𝑡1 𝑡𝑡2 – A lot of examples  use Primal space – High dimensionality, not so many examples, use dual space • Most applications today: People use explicit kernels. That is, they blow up the feature space explicitly.

CIS 419/519 Fall’20 143 Kernels: Generalization

• Do we want to use the most expressive kernels we can? – (e.g., when you want to add quadratic terms, do you really want to add all of them?) • No; this is equivalent to working in a larger feature space, and will lead to overfitting. • It’s possible to give simple arguments that show that simply adding irrelevant features does not help.

CIS 419/519 Fall’20 144 Conclusion- Kernels

• The use of Kernels to learn in the dual space is an important idea – Different kernels may expand/restrict the hypothesis space in useful ways. – Need to know the benefits and hazards • To justify these methods we must embed in a space much larger than the training set size. – Can affect generalization • Expressive structures in the input data could give rise to specific kernels, designed to exploit these structures. – E.g., people have developed kernels over parse trees: corresponds to features that are sub-trees. – It is always possible to trade these with explicitly generated features, but it might help one’s thinking about appropriate features.

CIS 419/519 Fall’20 146 CIS 419/519 Fall’20 147 Functions Can be Made Linear

• Data are not linearly separable in one dimension • Not separable if you insist on using a specific class of functions

𝒙𝒙

CIS 419/519 Fall’20 148 Blown Up Feature Space

• Data are separable in < , > space 2 𝒙𝒙 𝒙𝒙 2

𝒙𝒙

𝒙𝒙 CIS 419/519 Fall’20 149 End

CIS 419/519 Fall’20 150 Multi-Layer Neural Network

• Multi-layer network were designed to activation Output overcome the computational (expressivity)

limitation of a single threshold element. Hidden • The idea is to stack several layers of

threshold elements, each layer using the Input output of the previous layer as input. • Multi-layer networks can represent arbitrary functions, but building effective learning methods for such network was [thought to be] difficult.

CIS 419/519 Fall’20 151 Basic Units

• Linear Unit: Multiple layers of linear functions = produce linear functions. We want to represent nonlinear functions. 𝑗𝑗 • Need to do it in a way that activation 𝑜𝑜 𝒘𝒘 ⋅ 𝒙𝒙Output

facilitates learning 2

• Threshold units: = sgn( ) 𝑖𝑖𝑖𝑖 Hidden𝑤𝑤 are not differentiable, hence 𝑗𝑗 1 unsuitable for gradient𝑜𝑜 descent.𝒘𝒘 ⋅ 𝒙𝒙 Input𝑖𝑖𝑖𝑖 • The key idea was to notice that the discontinuity of the threshold𝑤𝑤 element can be represents by a smooth non-linear approximation: = 1 + exp (Rumelhart, Hinton, Williiam, 1986), (Linnainmaa, 1970) , see: http://people.idsia.ch/~juergen/who-invented-.html ) −1 𝑜𝑜𝑗𝑗 −𝒘𝒘 ⋅ 𝒙𝒙 CIS 419/519 Fall’20 152 Model Neuron (Logistic)

• Use a non-linear, differentiable output function such as the sigmoid or logistic function x 1 1 w 2 17 T 𝑥𝑥1 7 3 O j 4 ∑ 𝑗𝑗 5 Tj 𝑜𝑜 x7 6 w67 𝑥𝑥7 𝑇𝑇𝑗𝑗 • Net input to a unit is defined as: = = • Output of a unit is defined as: 𝑛𝑛𝑛𝑛𝑡𝑡𝑗𝑗 (∑ 𝑤𝑤𝑖𝑖𝑖𝑖 ⋅ )𝑥𝑥𝑖𝑖 1 𝑗𝑗 − 𝑛𝑛𝑛𝑛𝑡𝑡𝑗𝑗 −𝑇𝑇𝑗𝑗 𝑜𝑜 1+𝑒𝑒 CIS 419/519 Fall’20 153