OUTLINE
1 Machine learning with neural networks ...... 3
2 Training neural networks ...... 20
3 Popularity and applications ...... 37
2 OUTLINE
1 Machine learning with neural networks ...... 3
2 Training neural networks ...... 20
3 Popularity and applications ...... 37
3 SUPERVISED MACHINE LEARNING CLASSIFICATION SETUP
(i) n Input features x ∈ R
(i) Outputs y ∈ Y (e.g. R, {−1, +1}, {1, . . . , p})
k Model parameters θ ∈ R
n Hypothesis function hθ : R → R
Loss function ` : R × Y → R+
Machine learning optimization problem
m X (i) (i) minimize `(hθ(x ), y ) θ i=1
4 LINEAR CLASSIFIERS
Linear hypothesis class: (i) T (i) hθ(x ) = θ φ(x ) n k where the input can be any set of non-linear features φ : R → R
The generic function φ represents some (possibly) selected way to generate non-linear features out of the available ones, for instance:
x(i) = [temperature for day i] 1 x(i) (i) 2 φ(x ) = x(i) . .
5 GENERAL REPRESENTATION OF LINEAR CLASSIFICATION
6 CHALLENGES WITH LINEAR MODELS
Linear models crucially depend on choosing “good” features
Some “standard” choices: polynomial features, radial basis functions, random features (surprisingly effective)
But, many specialized domains required highly engineered special features
E.g., computer vision tasks used Haar features, SIFT features, every 10 years or so someone would engineer a new set of features
Key question 1: Should we stick with linear hypothesis functions? What about using non-linear combinations of the inputs? → Feed-forward neural networks (Perceptrons)
Key question 2: can we come up with algorithms that will automatically learn the features themselves? → Feed-forward neural networks with multiple (> 2!) hidden layers (Deep Networks)
7 FEATURE LEARNING: USE TWO CLASSIFIERS IN CASCADE
Instead of a simple linear classifier, let’s consider a two-stage hypothesis class where one linear function creates the features, another models the classifier and takes as input the features created by the first one:
hw(x) = W2φ(x) + b2 = W2(W1x + b1) + b2 where n×k k 1×k w = {W1 ∈ R , b1 ∈ R ,W2 ∈ R , b2 ∈ R} Note that in this notation, we’re explicitly separating the parameters on the “constant feature” into the b terms
8 FEATURE LEARNING: USE TWO CLASSIFIERS IN CASCADE
Graphical depiction of the obtained function
W1, b1 z1 x1 W2, b2 z x 2 2 y . . .
xn zk
But there is a problem:
hw(x) = W2(W1x + b1) + b2 = W˜ x + ˜b (1) in other words, we are still just using a normal linear classifier: the apparent added complexity by concatenating multiple is not giving us any additional representational power, we can only discriminate linearly separable classes
9 ARTIFICIAL NEURAL NETWORKS (ANN)
Neural networks provide a way to obtain complexity by:
1 Using non linear transformations of the inputs
2 Propagating the information among layers of processing units to realize multi-staged computation
3 In deep networks, the number of stages is relatively large, allowing to automatically learn hierarchical representations of the data features
W1, b1 z1 x1 z2 z3 z4 W2, b2 z1 = x z x 2 2 y ...... z5 = hθ(x) . . . .
xn W4, b4 zk W1, b1 W2, b2 W3, b3
10 TYPICAL NON-LINEAR ACTIVATION FUNCTIONS
Using non-linear activation functions at each node, the two-layer network of the previous example, become equivalent to have the following hypothesis function:
hθ(x) = f2(W2f1(W1x + b1) + b2)
where f1, f2 : R → R are some non-linear functions (applied elementwise to vectors)
2x 2x Common choices for fi are hyperbolic tangent tanh(x) = (e − 1)/(e + 1), sigmoid/logistic σ(x) = 1/(1 + e−x), or rectified linear unit f(x) = max{0, x}
tanh sigmoid relu 1.0 1.0 4.0 3.5 0.8 0.5 3.0 0.6 2.5 0.0 2.0 0.4 1.5 0.5 1.0 0.2 0.5 1.0 0.0 0.0 4 3 2 1 0 1 2 3 4 4 3 2 1 0 1 2 3 4 4 3 2 1 0 1 2 3 4
11 HIDDEN LAYERS AND LEARNED FEATURES
We draw these the same as before (non-linear functions are virtually always implied in the neural network setting)
W1, b1 z1 x1 W2, b2 z x 2 2 y . . .
xn zk
Middle layer z is referred to as the hidden layer or activations These are the learned features, nothing in the data that prescribes what values these should take, left up to the algorithm to decide To have a meaningful feature learning we need multiple hidden layers in cascade Networks
12 TYPES OF NETWORKS
13 PROPERTIES OF NEURAL NETWORKS
It turns out that a neural network with a single hidden layer (and a suitably large number of hidden units) is a universal function approximator, can approximate any function over the input arguments (but this is actually not very useful in practice, c.f. polynomials fitting any sets of points for high enough degree)
The hypothesis class hθ is not a convex function of the parameters θ = {Wi, bi}
The number of parameters (weights and biases), layers (depth), topology (connectivity), activation functions, all affect the performance and capacity of the network
“Deep” neural networks refer to networks with multiple hidden layers
z2 z3 z4 z1 = x
. . . z5 = hθ(x) . . . .
W4, b4
W1, b1 W2, b2 W3, b3 Mathematically, a k-layer network has the hypothesis function
zi+1 = fi(Wizi + bi), i = 1, . . . , k − 1, z1 = x
hw(x) = zk where zi terms now indicate vectors of output features
15 WHY USE DEEP NETWORKS?
A deep architecture trades space for time (or breadth for depth): more layers (more sequential computation), but less hardware (less parallel computation).
Many functions can be represented more compactly using deep networks than one-hidden layer networks (e.g. parity function would require (2n) hidden units in 3-layer network, O(n) units in O(log n)-layer network)
Motivation from neurobiology: brain appears to use multiple levels of interconnected neurons to process information (but careful, neurons in brain are not just non-linear functions)
In practice: works better for many domains
Allow for automatic hierarchical feature extraction from the data
16 HIERARCHICAL FEATURE REPRESENTATION
17 EXAMPLES OF HIERARCHICAL FEATURE REPRESENTATION
18 EFFECT OF INCREASING NUMBER OF HIDDEN LAYERS
Speech recognition task
19 OUTLINE
1 Machine learning with neural networks ...... 3
2 Training neural networks ...... 20
3 Popularity and applications ...... 37
20 OPTIMIZING NEURAL NETWORK PARAMETERS
How do we optimize the parameters for the machine learning loss minimization problem with a neural network m X (i) (i) minimize `(hθ(x ), y ) θ i=1 now that this problem is non-convex?
Just do exactly what we did before: initialize with random weights and run stochastic gradient descent
Now have the possibility of local optima, and function can be harder to optimize, but we won’t worry about all that because the resulting models still often perform better than linear models
21 STOCHASTIC GRADIENT DESCENT FOR NEURAL NETWORKS
Recall that stochastic gradient descent computes gradients with respect to loss on each example, updating parameters as it goes
(i) (i) function SGD({(x , y )}, hθ, `, α) Initialize: Wj , bj ← Random, j = 1, . . . , k Repeat until convergence: For i = 1, . . . , m: (i) (i) Compute ∇Wj ,bj `(hθ(x ), y ), j = 1, . . . , k − 1 Take gradient steps in all directions: (i) (i) Wj ← Wj − α∇Wj `(hθ(x ), y ), j = 1, . . . , k (i) (i) bj ← bj − α∇bj `(hθ(x ), y ), j = 1, . . . , k return {Wj , bj } (i) (i) How do we compute the gradients ∇Wj ,bj `(hθ(x ), y )?
22 BACK-PROPAGATION
Back-propagation is a method for computing all the necessary gradients using one forward pass (just computing all the activation values at layers), and one backward pass (computing gradients backwards in the network)
The equations sometimes look complex, but it’s just an application of the chain rule of calculus and the use of Jacobians
23 JACOBIANS AND CHAIN RULE
n m For a multivariate, vector-valued function f : R → R , the Jacobian is a m × n matrix
∂f1(x) ∂f1(x) ··· ∂f1(x) ∂x1 ∂x2 ∂xn ∂f2(x) ∂f2(x) ∂f2(x) ∂f(x) ∂x ∂x ··· ∂x m×n 1 2 n ∈ R = . . . . ∂x . . .. . . . . ∂fm(x) ∂fm(x) ··· ∂fm(x) ∂x1 ∂x2 ∂xn
n For a scalar-valued function f : R → R, the Jacobian is the transpose of the gradient ∂f(x) T ∂x = ∇xf(x) For a vector-valued function, row i of the Jacobian corresponds to the gradient of the component fi of the output vector, i = 1, . . . , m. It tells how the variation of each input variable affects the variation of the output component
Column j of the Jacobian is the impact of the variation of the j-th input variable, j = 1, . . . , n, on each one of the m components of the output
Chain rule for the derivation of a composite function:
∂f(g(x)) ∂f(g(x)) ∂g(x) = ∂x ∂g(x) ∂x
24 MULTI-LAYER / MULTI-MODULE FF
Multi-layered feed-forward architecture (cascade): module/layer i gets as input the feature vector xi−1 output from module i − 1, applies the transformation function Fi(xi−1, Wi), and produces the output xi
Each layer i contains (Nx)i parallel nodes (xi)j , j = 1,..., (Nx)i. Each node gets the input from previous layer’s nodes through (Nx)i−1 ≡ (NW )i connections with weights (Wi)j , j = 1,..., (NW )i
Each layer learns its own weight vector Wi.
Fi is a vector function. At each node j of layer i, the transformation function is T (Fi)j = f((Wi)j · (xi−1)j ) and f is the activation function (e.g., a sigmoid), that can be safely considered being the same for all nodes.
In the following the notation is made simpler by dropping the second indices and reasoning at the aggregate vector level of each layer.
25 FORWARD PROPAGATION
Forward Propagation: Following the presentation of the training input x0, the output vectors xi resulting from the activation function Fi at all layers i = 1, . . . , n, are computed in sequence, starting from x1, and are stored
The output of the network, the loss ` (to be minimized), results from the forward propagation at output layer and computing the deviation with respect to the target Y :
`(Y , x, W ) = C(xn, Y )
26 COMPUTING GRADIENTS
At each iteration of SGD, the gradients with respect to all the parameters of the system need to be computed (i.e., the weights Wi, that could be split in weights for input and weights for bias, but hereafter we just use the general form Wi)
After the Forward pass, let’s start setting up the relations for the Backward pass
Let’s consider the generic layer i: from the Forward propagation, its output value is available and is xi = Fi(xi−1, Wi)
In addition, let’s assume that we already know
∂` , ∂xi that is, we know for each component of the vector xi the variation of ` in relation to a variation of ∂` xi. We can assume that we know since we ∂xi will proceed backward
27 COMPUTING GRADIENTS Since we assume as known ∂` and we have ∂xi computed xi = Fi(xi−1, Wi), we can use the chain rule to compute ∂` , which is the quantity ∂Wi of interest, and which tells us the variation in ` as a response to a variation in the weights of Wi: ∂` ∂` ∂F (x , W ) = i i−1 i ∂Wi ∂xi ∂Wi
where xi is a substitute for Fi(xi−1, Wi) Dimensionally, the previous equation is as follows:
[1 × NW ] = [1 × Nx] · [Nx × NW ]
∂Fi(xi−1, Wi) is the Jacobian matrix of Fi with ∂Wi respect to Wi The element (k, l) of the Jacobian quantifies the variation in the k-th output when a variation is exerted on the l-th weight " # ∂Fi(xi−1, Wi) ∂ Fi(xi−1, Wi) k = ∂Wi ∂ Wi kl l 28 COMPUTING GRADIENTS
Let’s keep assuming that we known ∂` and let’s use it ∂xi this time to compute ∂` ∂xi−1 Applying the chain rule:
∂` ∂` ∂F (x , W ) = i i−1 i ∂xi−1 ∂xi ∂xi−1 Dimensionally, the previous equation is as follows:
[1 × Nx] = [1 × Nx] · [Nx × Nx]
∂Fi(xi−1, Wi) is the Jacobian matrix of Fi with respect ∂xi−1 to xi−1 The element (k, l) of the Jacobian quantifies the variation in the k-th output when a variation is exerted on the l-th input
The equation above is a recurrence equation!
29 BACK-PROPAGATION (BP) To sequentially compute all the gradients needed by SGD, a backward sweep is applied, which is called the back-propagation algorithm, that precisely makes use of the recurrence equation for ∂` ∂xi
∂` ∂C(xn, Y ) 1 = ∂xn ∂xn
∂` ∂` ∂Fn(xn−1, Wn) 2 = ∂xn−1 ∂xn ∂xn−1
∂` ∂` ∂Fn(xn−1, Wn) 3 = ∂Wn ∂xn ∂Wn
∂` ∂` ∂Fn−1(xn−2, Wn−1) 4 = ∂xn−2 ∂xn−1 ∂xn−2
∂` ∂` ∂Fn−1(xn−2, Wn−1) 5 = ∂Wn−1 ∂xn−1 ∂Wn−1
6 . . . until we reach the first, input layer
∂` 7 → all the gradients , ∀i = 1, . . . , n have been ∂Wi computed! 30 ACTIVATION FUNCTIONS EXAMPLES