Lecture 6. Notes on Linear Algebra. Perceptron

Lecture 6. Notes on Linear Algebra. Perceptron COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Andrey Kan Copyright: University of Melbourne Statistical Machine Learning (S2 2017) Deck 6 This lecture • Notes on linear algebra ∗ Vectors and dot products ∗ Hyperplanes and vector normals • Perceptron ∗ Introduction to Artificial Neural Networks ∗ The perceptron model ∗ Stochastic gradient descent 2 Statistical Machine Learning (S2 2016) Deck 6 Notes on Linear Algebra Link between geometric and algebraic interpretation of ML methods 3 Statistical Machine Learning (S2 2017) Deck 6 What are vectors? Suppose = , . What does really represent? ′ 1 2 Ordered set of numbers { , } 1 2 Cartesian coordinates of a point 2 0 A direction 1 2 art: OpenClipartVectors at pixabay.com (CC0) 1 4 Statistical Machine Learning (S2 2017) Deck 6 Dot product: algebraic definition • Given two -dimensional vectors and , their dot product is ∗ E.g., weighted sum of′ terms is a dot product =1 ⋅ ≡ ≡ ∑ ′ • Verify that if is a constant and , and are vectors of the same size then = = ′ + =′ + ′ ′ ′ ′ 5 Statistical Machine Learning (S2 2017) Deck 6 Dot product: geometric definition • Given two -dimensional vectors and , their dot product is cos ∗ Here and are′ L2 norms (i.e., Euclidean lengths) for vectors ⋅ and≡ ≡ ∗ and is the angle between vectors The scalar projection of onto is given by = cos Thus dot product is = = ′ 6 Statistical Machine Learning (S2 2017) Deck 6 Equivalence of definitions • Lemma: The algebraic and geometric definitions are identical • Proof sketch: ∗ Express the vectors using the standard vector basis , … , in , = , and = 1 ∗ Vectors are an orthonormal basic, they have unit length and ∑=1 ∑=1 orthogonal to each other, so = 1 and = 0 for ′ ′ = = ≠ ′ ′ ′ � � =1 =1 = cos = � � =1 =1 7 Statistical Machine Learning (S2 2017) Deck 6 Geometric properties of the dot product • If the two vectors are orthogonal then = 0 ′ • If the two vectors are parallel then = , if they are anti-parallel then = ′ ′ • = , so = + −+ defines the Euclidean′ vector2 length 2 2 1 ⋯ 8 Statistical Machine Learning (S2 2017) Deck 6 Hyperplanes and normal vectors • A hyperplane defined by parameters and is a set of points that satisfy + = 0 ′ • In 2D, a hyperplane is a line: a line is a set of points that satisfy + + = 0 11 22 A normal vector for a 2 hyperplane is a vector perpendicular to that hyperplane 9 1 Statistical Machine Learning (S2 2017) Deck 6 Hyperplanes and normal vectors • Consider a hyperplane defined by parameters and . Note that is itself a vector • Lemma: Vector is a normal vector to the hyperplane • Proof sketch: ∗ Choose any two points and on the hyperplane. Note that vector lies on the hyperplane ∗ Consider dot product = = − + ′+ =′ 0 ′ ∗ Thus lies′ on the hyperplane,− ′ but − is perpendicular to , and so is a vector − normal − 10 Statistical Machine Learning (S2 2017) Deck 6 Example in 2D • Consider a line defined by , and • Vector = , is a normal1 2 vector ′ 1 2 2 = 1 1 2 − 1 − 2 2 2 1 11 Statistical Machine Learning (S2 2017) Deck 6 Is logistic regression a linear method? Logistic function 1.0 0.8 0.6 Probabilities 0.4 0.2 0.0 -10 -5 0 5 10 Reals 12 Statistical Machine Learning (S2 2017) Deck 6 Logistic regression is a linear classifier • Logistic regression model: 1 = 1| = 1 + exp ′ • Classification rule: − if = 1| > then class “1”, else class “0” 1 • Decision boundary: 2 1 1 = 1 + exp 2 exp ′ = 1 −′= 0 −′ 13 Statistical Machine Learning (S2 2016) Deck 6 The Perceptron Model A building block for artificial neural networks, yet another linear classifier 14 Statistical Machine Learning (S2 2017) Deck 6 Biological inspiration • Humans perform well at many tasks that matter • Originally neural networks were an attempt to mimic the human brain photo: Alvesgaspar, Wikimedia Commons, CC3 15 Statistical Machine Learning (S2 2017) Deck 6 Artificial neural network • As a crude approximation, the human brain can be thought as a mesh of interconnected processing nodes (neurons) that relay electrical signals • Artificial neural network is a network of processing elements • Each element converts inputs … to output … … • The output is a function (called activation function) of a weighted sum of inputs 16 Statistical Machine Learning (S2 2017) Deck 6 Outline • In order to use an ANN we need (a) to design network topology and (b) adjust weights to given data ∗ In this course, we will exclusively focus on task (b) for a particular class of networks called feed forward networks • Training an ANN means adjusting weights for training data given a pre-defined network topology • We will come back to ANNs and discuss ANN training in the next lecture • Right now we will turn our attention to an individual network element because it is an interesting model in itself 17 Statistical Machine Learning (S2 2017) Deck 6 Perceptron model × 1 × 1 ( ) 2 2 × 1 0 Σ • , – inputs 1 2 Compare this • , – synaptic weights model to logistic 1 2 regression • –bias weight • 0– activation function 18 Statistical Machine Learning (S2 2017) Deck 6 Perceptron is a linear binary classifier Perceptron is a Predict class A if 0 binary classifier: Predict class B if < 0 where = ≥ Perceptron is a linear classifier: ∑=0 is a linear function of inputs, and the decision boundary is linear decision plane with boundary values of 2 plane with 1 data points Example for 2D data 19 Statistical Machine Learning (S2 2017) Deck 6 Exercise: find weights of a perceptron capable of 0 0 Class B perfect classification of 0 1 Class B the following dataset 1 0 Class B 1 1 Class A art: OpenClipartVectors at pixabay.com (CC0) Statistical Machine Learning (S2 2017) Deck 6 Loss function for perceptron • Recall that “training” means finding weights that minimise some loss. Therefore, we proceed with considering the loss function for perceptron • Our task is binary classification. Let’s arbitrarily encode one class as +1 and the other as 1. So each training example is now { , }, where is either +1 or 1 − • Recall that, in a perceptron, = − , and the sign of determines the predicted class: +1if > 0, and 1 if < 0 ∑=0 • Consider a single training example. If and have− the same sign then the example is classified correctly. If and have different signs, the example is misclassified 21 Statistical Machine Learning (S2 2017) Deck 6 Loss function for perceptron • Consider a single training example. If and have the same sign then the example is classified correctly. If and have different signs, the example is misclassified • The perceptron uses a loss function in which there is no penalty for correctly classified examples, while the penalty (loss) is equal to for misclassified examples* ( , ) • Formally: ∗ , = 0 if both , have the same sign ∗ , = if both , have different signs 0 • This can be re -written as , = max 0, − * This is similar, but not identical to another widely used loss function called hinge loss 22 Statistical Machine Learning (S2 2017) Deck 6 Stochastic gradient descent • Split all training examples in batches ( ) • Choose initial Iterations over the 1 For from 1 to entire dataset are • called epochs • For from 1 to • Do gradient descent update using data from batch • Advantage of such an approach: computational feasibility for large datasets 23 Statistical Machine Learning (S2 2017) Deck 6 Perceptron training algorithm Choose initial guess ( ), = 0 0 For from 1 to (epochs) For from 1 to (training examples) Consider example , Update*: = ( ( )) ++ ( ) = max 0, − *There is no derivative when = 0, but this case = − is handled explicitly in the is learning� rate algorithm, see next slides =0 24 Statistical Machine Learning (S2 2017) Deck 6 Perceptron training rule • We have = 0 when > 0 ∗ We don’t need to do update when an example is correctly classified • We have = when = 1 and < 0 − • We have = when = 1 and > 0 ( , ) − • = ∑=0 0 25 Statistical Machine Learning (S2 2017) Deck 6 Perceptron training algorithm When classified correctly, weights are unchanged When misclassified: = (± ) ( > 0 is called learning rate)+1 − If = 1, but < 0 If = 1, but 0 + + − ≥ ← ← − 0 ← 0 0 ← 0 − Convergence Theorem: if the training data is linearly separable, the algorithm is guaranteed to converge to a solution. That is, there exist a finite such that = 0 26 Statistical Machine Learning (S2 2017) Deck 6 Perceptron convergence theorem • Assumptions ∗ Linear separability: There exists so that for all training data = 1, … , and some∗ positive . ∗ ′ ∗ Bounded data: for =1, … , and some finite≥ . • Proof sketch ≤ ∗ Establish that + ∗ It then follows that∗ ′ ∗ ′ −1 ≥ ∗ ∗ ′ Establish that ≥ 2 2 ∗ Note that cos , = ≤ ∗ ′ ∗ ∗ cos … 1 ∗ Use the fact that ∗ Arrive at 2 ∗ 2 ≤ ≤ 27 Statistical Machine Learning (S2 2017) Deck 6 Pros and cons of perceptron learning • If the data is linearly separable, the perceptron training algorithm will converge to a correct solution ∗ There is a formal proof good! ∗ It will converge to some solution (separating boundary), one of infinitely many possible bad! • However, if the data is not linearly separable, the training will fail completely rather than give some approximate solution ∗ Ugly 28 Statistical Machine Learning (S2 2017) Deck 6 Perceptron Learning Example Basic setup = + + 1 1 1 2 2 0 1 1, < 0 1, 0 2 − 2 Σ � 1 (learning

Lecture 6. Notes on Linear Algebra. Perceptron

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support