Lecture 6. Notes on Linear Algebra. Perceptron

Lecture 6. Notes on Linear Algebra. Perceptron COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Andrey Kan Copyright: University of Melbourne Statistical Machine Learning (S2 2017) Deck 6 This lecture • Notes on linear algebra ∗ Vectors and dot products ∗ Hyperplanes and vector normals • Perceptron ∗ Introduction to Artificial Neural Networks ∗ The perceptron model ∗ Stochastic gradient descent 2 Statistical Machine Learning (S2 2016) Deck 6 Notes on Linear Algebra Link between geometric and algebraic interpretation of ML methods 3 Statistical Machine Learning (S2 2017) Deck 6 What are vectors? Suppose = , . What does really represent? ′ 1 2 Ordered set of numbers { , } 1 2 Cartesian coordinates of a point 2 0 A direction 1 2 art: OpenClipartVectors at pixabay.com (CC0) 1 4 Statistical Machine Learning (S2 2017) Deck 6 Dot product: algebraic definition • Given two -dimensional vectors and , their dot product is ∗ E.g., weighted sum of′ terms is a dot product =1 ⋅ ≡ ≡ ∑ ′ • Verify that if is a constant and , and are vectors of the same size then = = ′ + =′ + ′ ′ ′ ′ 5 Statistical Machine Learning (S2 2017) Deck 6 Dot product: geometric definition • Given two -dimensional vectors and , their dot product is cos ∗ Here and are′ L2 norms (i.e., Euclidean lengths) for vectors ⋅ and≡ ≡ ∗ and is the angle between vectors The scalar projection of onto is given by = cos Thus dot product is = = ′ 6 Statistical Machine Learning (S2 2017) Deck 6 Equivalence of definitions • Lemma: The algebraic and geometric definitions are identical • Proof sketch: ∗ Express the vectors using the standard vector basis , … , in , = , and = 1 ∗ Vectors are an orthonormal basic, they have unit length and ∑=1 ∑=1 orthogonal to each other, so = 1 and = 0 for ′ ′ = = ≠ ′ ′ ′ � � =1 =1 = cos = � � =1 =1 7 Statistical Machine Learning (S2 2017) Deck 6 Geometric properties of the dot product • If the two vectors are orthogonal then = 0 ′ • If the two vectors are parallel then = , if they are anti-parallel then = ′ ′ • = , so = + −+ defines the Euclidean′ vector2 length 2 2 1 ⋯ 8 Statistical Machine Learning (S2 2017) Deck 6 Hyperplanes and normal vectors • A hyperplane defined by parameters and is a set of points that satisfy + = 0 ′ • In 2D, a hyperplane is a line: a line is a set of points that satisfy + + = 0 11 22 A normal vector for a 2 hyperplane is a vector perpendicular to that hyperplane 9 1 Statistical Machine Learning (S2 2017) Deck 6 Hyperplanes and normal vectors • Consider a hyperplane defined by parameters and . Note that is itself a vector • Lemma: Vector is a normal vector to the hyperplane • Proof sketch: ∗ Choose any two points and on the hyperplane. Note that vector lies on the hyperplane ∗ Consider dot product = = − + ′+ =′ 0 ′ ∗ Thus lies′ on the hyperplane,− ′ but − is perpendicular to , and so is a vector − normal − 10 Statistical Machine Learning (S2 2017) Deck 6 Example in 2D • Consider a line defined by , and • Vector = , is a normal1 2 vector ′ 1 2 2 = 1 1 2 − 1 − 2 2 2 1 11 Statistical Machine Learning (S2 2017) Deck 6 Is logistic regression a linear method? Logistic function 1.0 0.8 0.6 Probabilities 0.4 0.2 0.0 -10 -5 0 5 10 Reals 12 Statistical Machine Learning (S2 2017) Deck 6 Logistic regression is a linear classifier • Logistic regression model: 1 = 1| = 1 + exp ′ • Classification rule: − if = 1| > then class “1”, else class “0” 1 • Decision boundary: 2 1 1 = 1 + exp 2 exp ′ = 1 −′= 0 −′ 13 Statistical Machine Learning (S2 2016) Deck 6 The Perceptron Model A building block for artificial neural networks, yet another linear classifier 14 Statistical Machine Learning (S2 2017) Deck 6 Biological inspiration • Humans perform well at many tasks that matter • Originally neural networks were an attempt to mimic the human brain photo: Alvesgaspar, Wikimedia Commons, CC3 15 Statistical Machine Learning (S2 2017) Deck 6 Artificial neural network • As a crude approximation, the human brain can be thought as a mesh of interconnected processing nodes (neurons) that relay electrical signals • Artificial neural network is a network of processing elements • Each element converts inputs … to output … … • The output is a function (called activation function) of a weighted sum of inputs 16 Statistical Machine Learning (S2 2017) Deck 6 Outline • In order to use an ANN we need (a) to design network topology and (b) adjust weights to given data ∗ In this course, we will exclusively focus on task (b) for a particular class of networks called feed forward networks • Training an ANN means adjusting weights for training data given a pre-defined network topology • We will come back to ANNs and discuss ANN training in the next lecture • Right now we will turn our attention to an individual network element because it is an interesting model in itself 17 Statistical Machine Learning (S2 2017) Deck 6 Perceptron model × 1 × 1 ( ) 2 2 × 1 0 Σ • , – inputs 1 2 Compare this • , – synaptic weights model to logistic 1 2 regression • –bias weight • 0– activation function 18 Statistical Machine Learning (S2 2017) Deck 6 Perceptron is a linear binary classifier Perceptron is a Predict class A if 0 binary classifier: Predict class B if < 0 where = ≥ Perceptron is a linear classifier: ∑=0 is a linear function of inputs, and the decision boundary is linear decision plane with boundary values of 2 plane with 1 data points Example for 2D data 19 Statistical Machine Learning (S2 2017) Deck 6 Exercise: find weights of a perceptron capable of 0 0 Class B perfect classification of 0 1 Class B the following dataset 1 0 Class B 1 1 Class A art: OpenClipartVectors at pixabay.com (CC0) Statistical Machine Learning (S2 2017) Deck 6 Loss function for perceptron • Recall that “training” means finding weights that minimise some loss. Therefore, we proceed with considering the loss function for perceptron • Our task is binary classification. Let’s arbitrarily encode one class as +1 and the other as 1. So each training example is now { , }, where is either +1 or 1 − • Recall that, in a perceptron, = − , and the sign of determines the predicted class: +1if > 0, and 1 if < 0 ∑=0 • Consider a single training example. If and have− the same sign then the example is classified correctly. If and have different signs, the example is misclassified 21 Statistical Machine Learning (S2 2017) Deck 6 Loss function for perceptron • Consider a single training example. If and have the same sign then the example is classified correctly. If and have different signs, the example is misclassified • The perceptron uses a loss function in which there is no penalty for correctly classified examples, while the penalty (loss) is equal to for misclassified examples* ( , ) • Formally: ∗ , = 0 if both , have the same sign ∗ , = if both , have different signs 0 • This can be re -written as , = max 0, − * This is similar, but not identical to another widely used loss function called hinge loss 22 Statistical Machine Learning (S2 2017) Deck 6 Stochastic gradient descent • Split all training examples in batches ( ) • Choose initial Iterations over the 1 For from 1 to entire dataset are • called epochs • For from 1 to • Do gradient descent update using data from batch • Advantage of such an approach: computational feasibility for large datasets 23 Statistical Machine Learning (S2 2017) Deck 6 Perceptron training algorithm Choose initial guess ( ), = 0 0 For from 1 to (epochs) For from 1 to (training examples) Consider example , Update*: = ( ( )) ++ ( ) = max 0, − *There is no derivative when = 0, but this case = − is handled explicitly in the is learning� rate algorithm, see next slides =0 24 Statistical Machine Learning (S2 2017) Deck 6 Perceptron training rule • We have = 0 when > 0 ∗ We don’t need to do update when an example is correctly classified • We have = when = 1 and < 0 − • We have = when = 1 and > 0 ( , ) − • = ∑=0 0 25 Statistical Machine Learning (S2 2017) Deck 6 Perceptron training algorithm When classified correctly, weights are unchanged When misclassified: = (± ) ( > 0 is called learning rate)+1 − If = 1, but < 0 If = 1, but 0 + + − ≥ ← ← − 0 ← 0 0 ← 0 − Convergence Theorem: if the training data is linearly separable, the algorithm is guaranteed to converge to a solution. That is, there exist a finite such that = 0 26 Statistical Machine Learning (S2 2017) Deck 6 Perceptron convergence theorem • Assumptions ∗ Linear separability: There exists so that for all training data = 1, … , and some∗ positive . ∗ ′ ∗ Bounded data: for =1, … , and some finite≥ . • Proof sketch ≤ ∗ Establish that + ∗ It then follows that∗ ′ ∗ ′ −1 ≥ ∗ ∗ ′ Establish that ≥ 2 2 ∗ Note that cos , = ≤ ∗ ′ ∗ ∗ cos … 1 ∗ Use the fact that ∗ Arrive at 2 ∗ 2 ≤ ≤ 27 Statistical Machine Learning (S2 2017) Deck 6 Pros and cons of perceptron learning • If the data is linearly separable, the perceptron training algorithm will converge to a correct solution ∗ There is a formal proof good! ∗ It will converge to some solution (separating boundary), one of infinitely many possible bad! • However, if the data is not linearly separable, the training will fail completely rather than give some approximate solution ∗ Ugly 28 Statistical Machine Learning (S2 2017) Deck 6 Perceptron Learning Example Basic setup = + + 1 1 1 2 2 0 1 1, < 0 1, 0 2 − 2 Σ � 1 (learning

Load more