15-780: Machine Learning
Total Page:16
File Type:pdf, Size:1020Kb
OUTLINE 1 Machine learning with neural networks . 3 2 Training neural networks . 20 3 Popularity and applications . 37 2 OUTLINE 1 Machine learning with neural networks . 3 2 Training neural networks . 20 3 Popularity and applications . 37 3 SUPERVISED MACHINE LEARNING CLASSIFICATION SETUP (i) n Input features x 2 R (i) Outputs y 2 Y (e.g. R, {−1; +1g, f1; : : : ; pg) k Model parameters θ 2 R n Hypothesis function hθ : R ! R Loss function ` : R × Y ! R+ Machine learning optimization problem m X (i) (i) minimize `(hθ(x ); y ) θ i=1 4 LINEAR CLASSIFIERS Linear hypothesis class: (i) T (i) hθ(x ) = θ φ(x ) n k where the input can be any set of non-linear features φ : R ! R The generic function φ represents some (possibly) selected way to generate non-linear features out of the available ones, for instance: x(i) = [temperature for day i] 2 1 3 6 x(i) 7 (i) 6 2 7 φ(x ) = 6 x(i) 7 6 7 4 . 5 . 5 GENERAL REPRESENTATION OF LINEAR CLASSIFICATION 6 CHALLENGES WITH LINEAR MODELS Linear models crucially depend on choosing “good” features Some “standard” choices: polynomial features, radial basis functions, random features (surprisingly effective) But, many specialized domains required highly engineered special features E.g., computer vision tasks used Haar features, SIFT features, every 10 years or so someone would engineer a new set of features Key question 1: Should we stick with linear hypothesis functions? What about using non-linear combinations of the inputs? ! Feed-forward neural networks (Perceptrons) Key question 2: can we come up with algorithms that will automatically learn the features themselves? ! Feed-forward neural networks with multiple (> 2!) hidden layers (Deep Networks) 7 FEATURE LEARNING: USE TWO CLASSIFIERS IN CASCADE Instead of a simple linear classifier, let’s consider a two-stage hypothesis class where one linear function creates the features, another models the classifier and takes as input the features created by the first one: hw(x) = W2φ(x) + b2 = W2(W1x + b1) + b2 where n×k k 1×k w = fW1 2 R ; b1 2 R ;W2 2 R ; b2 2 Rg Note that in this notation, we’re explicitly separating the parameters on the “constant feature” into the b terms 8 FEATURE LEARNING: USE TWO CLASSIFIERS IN CASCADE Graphical depiction of the obtained function W1; b1 z1 x1 W2; b2 z x 2 2 y . xn zk But there is a problem: hw(x) = W2(W1x + b1) + b2 = W~ x + ~b (1) in other words, we are still just using a normal linear classifier: the apparent added complexity by concatenating multiple is not giving us any additional representational power, we can only discriminate linearly separable classes 9 ARTIFICIAL NEURAL NETWORKS (ANN) Neural networks provide a way to obtain complexity by: 1 Using non linear transformationsW1 ;of b1 the inputs z1 2 Propagating the informationx1 among layers of processing units to realize multi-staged computation W2; b2 z 3 In deep networks, the numberx of stages is2 relatively large, allowing to automatically learn hierarchical representations2 of the data featuresy . z2 z3 z4 xn z1 = x zk . z5 = hθ(x) . W4; b4 W1; b1 W2; b2 W3; b3 10 TYPICAL NON-LINEAR ACTIVATION FUNCTIONS Using non-linear activation functions at each node, the two-layer network of the previous example, become equivalent to have the following hypothesis function: hθ(x) = f2(W2f1(W1x + b1) + b2) where f1; f2 : R ! R are some non-linear functions (applied elementwise to vectors) 2x 2x Common choices for fi are hyperbolic tangent tanh(x) = (e − 1)=(e + 1), sigmoid/logistic σ(x) = 1=(1 + e−x), or rectified linear unit f(x) = maxf0; xg tanh sigmoid relu 1.0 1.0 4.0 3.5 0.8 0.5 3.0 0.6 2.5 0.0 2.0 0.4 1.5 0.5 1.0 0.2 0.5 1.0 0.0 0.0 4 3 2 1 0 1 2 3 4 4 3 2 1 0 1 2 3 4 4 3 2 1 0 1 2 3 4 11 HIDDEN LAYERS AND LEARNED FEATURES We draw these the same as before (non-linear functions are virtually always implied in the neural network setting) W1; b1 z1 x1 W2; b2 z x 2 2 y . xn zk Middle layer z is referred to as the hidden layer or activations These are the learned features, nothing in the data that prescribes what values these should take, left up to the algorithm to decide To have a meaningful feature learning we need multiple hidden layers in cascade Networks 12 TYPES OF NETWORKS 13 PROPERTIES OF NEURAL NETWORKS It turns out that a neural network with a single hidden layer (and a suitably large number of hidden units) is a universal function approximator, can approximate any function over the input arguments (but this is actually not very useful in practice, c.f. polynomials fitting any sets of points for high enough degree) The hypothesis class hθ is not a convex function of the parameters θ = fWi; big The number of parameters (weights and biases), layers (depth), topology (connectivity), activation functions, all affect the performance and capacity of the network 14 DEEP LEARNING “Deep” neural networks refer to networks with multiple hidden layers z2 z3 z4 z1 = x Mathematically, a k-layer network has the hypothesis function . z5 = hθ(x) . zi+1 = fi(Wizi + bi); i = 1; : : : ; k − 1; z1 = x W4; b4 W1; b1 hw(x) = zk W2; b2 W3; b3 where zi terms now indicate vectors of output features 15 WHY USE DEEP NETWORKS? A deep architecture trades space for time (or breadth for depth): more layers (more sequential computation), but less hardware (less parallel computation). Many functions can be represented more compactly using deep networks than one-hidden layer networks (e.g. parity function would require (2n) hidden units in 3-layer network, O(n) units in O(log n)-layer network) Motivation from neurobiology: brain appears to use multiple levels of interconnected neurons to process information (but careful, neurons in brain are not just non-linear functions) In practice: works better for many domains Allow for automatic hierarchical feature extraction from the data 16 HIERARCHICAL FEATURE REPRESENTATION 17 EXAMPLES OF HIERARCHICAL FEATURE REPRESENTATION 18 EFFECT OF INCREASING NUMBER OF HIDDEN LAYERS Speech recognition task 19 OUTLINE 1 Machine learning with neural networks . 3 2 Training neural networks . 20 3 Popularity and applications . 37 20 OPTIMIZING NEURAL NETWORK PARAMETERS How do we optimize the parameters for the machine learning loss minimization problem with a neural network m X (i) (i) minimize `(hθ(x ); y ) θ i=1 now that this problem is non-convex? Just do exactly what we did before: initialize with random weights and run stochastic gradient descent Now have the possibility of local optima, and function can be harder to optimize, but we won’t worry about all that because the resulting models still often perform better than linear models 21 STOCHASTIC GRADIENT DESCENT FOR NEURAL NETWORKS Recall that stochastic gradient descent computes gradients with respect to loss on each example, updating parameters as it goes (i) (i) function SGD(f(x ; y )g; hθ; `; α) Initialize: Wj ; bj Random; j = 1; : : : ; k Repeat until convergence: For i = 1; : : : ; m: (i) (i) Compute rWj ;bj `(hθ(x ); y ); j = 1; : : : ; k − 1 Take gradient steps in all directions: (i) (i) Wj Wj − αrWj `(hθ(x ); y ); j = 1; : : : ; k (i) (i) bj bj − αrbj `(hθ(x ); y ); j = 1; : : : ; k return fWj ; bj g (i) (i) How do we compute the gradients rWj ;bj `(hθ(x ); y )? 22 BACK-PROPAGATION Back-propagation is a method for computing all the necessary gradients using one forward pass (just computing all the activation values at layers), and one backward pass (computing gradients backwards in the network) The equations sometimes look complex, but it’s just an application of the chain rule of calculus and the use of Jacobians 23 JACOBIANS AND CHAIN RULE n m For a multivariate, vector-valued function f : R ! R , the Jacobian is a m × n matrix 2 @f1(x) @f1(x) ··· @f1(x) 3 @x1 @x2 @xn 6 @f2(x) @f2(x) @f2(x) 7 @f(x) 6 @x @x ··· @x 7 m×n 6 1 2 n 7 2 R = 6 . 7 @x 6 . .. 7 4 . 5 @fm(x) @fm(x) ··· @fm(x) @x1 @x2 @xn n For a scalar-valued function f : R ! R, the Jacobian is the transpose of the gradient @f(x) T @x = rxf(x) For a vector-valued function, row i of the Jacobian corresponds to the gradient of the component fi of the output vector, i = 1; : : : ; m. It tells how the variation of each input variable affects the variation of the output component Column j of the Jacobian is the impact of the variation of the j-th input variable, j = 1; : : : ; n, on each one of the m components of the output Chain rule for the derivation of a composite function: @f(g(x)) @f(g(x)) @g(x) = @x @g(x) @x 24 MULTI-LAYER / MULTI-MODULE FF Multi-layered feed-forward architecture (cascade): module/layer i gets as input the feature vector xi−1 output from module i − 1, applies the transformation function Fi(xi−1; Wi), and produces the output xi Each layer i contains (Nx)i parallel nodes (xi)j ; j = 1;:::; (Nx)i. Each node gets the input from previous layer’s nodes through (Nx)i−1 ≡ (NW )i connections with weights (Wi)j ; j = 1;:::; (NW )i Each layer learns its own weight vector Wi.