Multilayer Neural Networks (Sections 6.1-6.3)

Chapter 6: Multilayer Neural Networks (Sections 6.1-6.3) • Introduction • Feedforward Operation and Classification • Backpropagation Algorithm Pattern Recognition Two main challenges • Representation • Matching Jain CSE 802, Spring 2013 Representation and Matching How Good is Face Representation? Courtesy: Pete Langenfeld, MSP Driver License Information Gallery: 34 million (30M DMV photos, 4M mugshots) 2009 driver license photo How Good is Face Representation? Smile makes a difference! Top-10 retrievals 1 2 3 4 5 6 7 8 9 10 Gallery: 34 million (30M DMV photos, 4M mugshots) Courtesy: Pete Langenfeld, MSP State of the Art in FR: Verification FRGC v2.0 (2006) MBGC (2010) LFW (2007) IJB-A (2015) LFW Standard Protocol 99.77% (Accuracy) 3,000 genuine & 3,000 imposter pairs; 10-fold CV LFW BLUFR Protocol 88% TAR @ 0.1% FAR 156,915 genuine, ~46M imposter pairs; 10-fold CV D. Wang, C. Otto and A. K. Jain, "Face Search at Scale: 80 Million Gallery", arXiv, July 28, 2015 Neural Networks n Massive parallelism is essential for complex recognition tasks (speech & image recognition) n Humans take only ~200ms for most cognitive tasks; this suggests parallel computation in human brain n Biological networks achieve excellent recognition performance via dense interconnection of simple computational elements (neurons) 10 12 n Number of neurons » 10 – 10 3 4 n Number of interconnections/neuron » 10 – 10 14 n Total number of interconnections » 10 n Damage to a few neurons or synapse (links) does not appear to impair performance (robustness) Neuron n Nodes are nonlinear, typically analog x 1 w1 x2 Y (output) xd wd where is internal threshold or offset Neural Networks n Feed-forward networks with one or more layers (hidden) between input & output nodes n How many nodes & hidden layers? . c outputs d inputs First hidden layer Second hidden layer NH1 input units NH2 input units n Network training? Form of the Discriminant Function • Linear: Hyperplane decision boundaries • Non-Linear: Arbitrary decision boundaries • Adopt a model and then use the resulting decision boundary • Specify the desired decision boundary Linear Discriminant Function • For a 2-class problem, discriminant function that is a linear combination of input features can be written as Sign of the function value gives Bias or Threshold the class Weight weight label vector Quadratic Discriminant Function • Quadratic Discriminant Function • Obtained by adding pair-wise products of features Linear Part Quadratic part, d(d+1)/2 (d+1) parameters additional parameters • g(x) positive implies class 1; g(x) negative implies class 2 • g(x) = 0 represents a hyperquadric, as opposed to hyperplanes in linear discriminant case • Adding more terms such as wijkxixjxk results in polynomial discriminant functions Generalized Discriminant Function • A generalized linear discriminant function, where y= f(x) can be written as Dimensionality of the Setting yi(x) to be augmented feature monomials results in space. polynomial discriminant functions Weights in the augmented feature space. Note that the function is linear in a. • Equivalently, a [a , a ,..., a ]t y = [y (x), y (x),..., y (x)]t = 1 2 dˆ 1 2 dˆ also called the augmented feature vector. 13 Perceptron • Perceptron is a linear classifier; it makes predictions based on a linear predictor function combining a set of weights with feature vector • The perceptron algorithm was invented by Rosenblatt in the late 1950s; its first implementation, in custom hardware, was one of the first artificial neural networks to be produced • The algorithm allows for online learning; it processes training samples one at a time Two-category Linearly Separable Case • Let y1,y2,…,yn be n training samples in augmented feature space, which are linearly separable • We need to find a weight vector a such that • aty > 0 for examples from the positive class • aty < 0 for examples from the negative class • “Normalizing” the input examples by multiplying them with their class label (replace all samples from class 2 by their negatives), find a weight vector a such that • aty > 0 for all the examples (here y is multiplied with class label) • The resulting weight vector is called a separating vector or a solution vector The Perceptron Criterion Function • Goal: Find weight vector a such that aty > 0 for all the samples (assuming it exists) • Mathematically, this can be expressed as finding a weight vector a that minimizes the no. of samples misclassified • Function is piecewise constant (discontinuous, and hence non- differentiable) and is difficult to optimize • Perceptron Criterion Function: Find a that minimizes this criterion The criterion is proportional to the sum of distances from the misclassified samples to the decision boundary Now, the minimization is mathematically tractable, and hence it is a better criterion fn. than no. of misclassifications. Fixed-increment Single Sample Perceptron • Also called perceptron learning in an online setting • For large datasets, this is more efficient compared to batch mode n = no. of training samples; a = weight vector; k = iteration # Chapter 5, page 230 Perceptron Convergence Theorem If training samples are linearly separable, then the sequence of weight vectors given by Fixed- increment single-sample Perceptron will terminate at a solution vector What happens if the patterns are non-linearly separable? 18 Multilayer Perceptron Can we learn the nonlinearity at the same time as the linear discriminant? This is the goal of multilayer neural networks or multilayer Perceptrons Pattern Classification, Chapter 6 19 Pattern Classification, Chapter 6 20 Feedforward Operation and Classification • A three-layer neural network consists of an input layer, a hidden layer and an output layer interconnected by modifiable (learned) weights represented by links between layers • Multilayer neural network implements linear discriminants, but in a space where the inputs have been mapped nonlinearly • Figure 6.1 shows a simple three-layer network Pattern Classification, Chapter 6 21 NNo training here No training involved here, since we are implementing a known input/output mapping 22 Pattern Classification, Chapter 6 23 • A single “bias unit” is connected to each unit in addition to the input units d d t • Net activation: net j = å xi w ji + w j0 = å xi w ji º w j .x, i=1 i=0 where the subscript i indexes units in the input layer, j in the hidden layer; wji denotes the input-to-hidden layer weights at the hidden unit j. (In neurobiology, such weights or connections are called “synapses”) • Each hidden unit emits an output that is a nonlinear function of its activation, that is: yj = f(netj) Pattern Classification, Chapter 6 24 Figure 6.1 shows a simple threshold function ì1 if net ³ 0 f ( net ) = sgn( net ) º í î- 1 if net < 0 • The function f(.) is also called the activation function or “nonlinearity” of a unit. There are more general activation functions with desirables properties • Each output unit similarly computes its net activation based on the hidden unit signals as: nH nH t netk = å y j wkj + wk0 = å y j wkj = wk .y, j=1 j=0 where the subscript k indexes units in the ouput layer and nH denotes the number of hidden units Pattern Classification, Chapter 6 25 • The output units are referred as zk. An output unit computes the nonlinear function of its net input, emitting zk = f(netk) • In the case of c outputs (classes), we can view the network as computing c discriminant functions zk = gk(x); the input x is classified according to the largest discriminant function gk(x) " k = 1, …,c • The three-layer network with the weights listed in fig. 6.1 solves the XOR problem Pattern Classification, Chapter 6 26 • The hidden unit y1 computes the boundary: ³ 0 Þ y1 = +1 x1 + x2 + 0.5 = 0 < 0 Þ y1 = -1 • The hidden unit y2 computes the boundary: £ 0 Þ y2 = +1 x1 + x2 -1.5 = 0 < 0 Þ y2 = -1 • Output unit emits z1 = +1 if and only if y1 = +1 and y2 = +1 Using the terminology of computer logic, the units are behaving like gates, where the first hidden unit is an OR gate, the second hidden unit is an AND gate, and the output unit implements zk = y1 AND NOT y2 = (x1 OR x2) and NOT(x1 AND x2) = x1 XOR x2 which provides the nonlinear decision of fig. 6.1Pattern Classification, Chapter 6 27 • General Feedforward Operation – case of c output units æ nH æ d ö ö g ( x ) º z = f ç w f ç w x + w ÷ + w ÷ (1) k k çå kj çå ji i j0 ÷ k0 ÷ è j=1 è i=1 ø ø (k = 1,...,c) • Hidden units enable us to express more complicated nonlinear functions and extend classification capability • Activation function does not have to be a sign function, it is often required to be continuous and differentiable • We can allow the activation in the output layer to be different from the activation function in the hidden layer or have different activation for each individual unit • Assume for now that all activation functions are identical Pattern Classification, Chapter 6 28 • Expressive Power of multi-layer Networks Question: Can every decision boundary be implemented by a three-layer network? Answer: Yes (due to A. kolmogorov) “Any continuous function from input to output can be implemented in a three-layer net, given sufficient number of hidden units nH, proper nonlinearities, and weights.” Any continuous function g(x) defined on the unit cube can be represented in the following form 2n+1 n g( x ) = åd j (Sbij ( xi )) "x Î I ( I = [0,1];n ³ 2 ) j=1 for properly chosen functions dj and bij Pattern Classification, Chapter 6 29 Pattern Classification, Chapter 6 30 • Network has two modes of operation:

Multilayer Neural Networks (Sections 6.1-6.3)

Warm Start for Parameter Selection of Linear Classifiers

Neural Networks and Backpropagation

Lecture 2: Linear Classifiers

Hyperplane Based Classification: Perceptron and (Intro

2 Linear Classifiers and Perceptrons

Machine Learning Empirical Risk Minimization

Using Linear Classifier Probes

Chapter 2 Classification

Linear Classification Overview Logistic Regression Model I Logistic

Differentially Private Empirical Risk Minimization

Tutorial on Support Vector Machine (SVM) Vikramaditya Jakkula, School of EECS, Washington State University, Pullman 99164

2DI70 - Statistical Learning Theory Lecture Notes