Lecture 6: Multilayer Neural Networks & Autoencoder

Lecture 6: Multilayer Neural Networks & Autoencoder Topics of this lecture Multilayer perceptron (MLP) ◦ MLP-based inference (decision making) ◦ Back propagation algorithm Autoencoder ◦ Training of autoencoder ◦ Autoencoder for “internal representation” ◦ Training with l2-norm regularization ◦ Training with sparsity regularization ◦ Training a deep neural network MACHINE LEARNING: PRODUCED BY QIANGFU ZHAO (SINCE 2018), ALL RIGHTS RESERVED (C) Lec06/2 Multilayer perceptron One of the most popular neural network model is the multilayer feedforward neural network, which is commonly known as the multilayer perceptron (MLP). In an MLP, neurons are arranged in layers. There is one input layer, one output layer, and several (or many) hidden layers. A three layer MLP is known as a general approximator (for solving any problem), but more layers can solve the problem more efficiently (using less neurons) and more effectively (more accurate). This is a three layer MLP MACHINE LEARNING: PRODUCED BY QIANGFU ZHAO (SINCE 2018), ALL RIGHTS RESERVED (C) Lec06/3 MLP-based inference (푖) Suppose that 푥 is the output of the 푖-th 푦 = 푥(퐾) ∈ 푅푁퐾 layer, 푖 = 0,1, … , 퐾. ◦ 푥(0) = 푥: External input Layer K ◦ 푥(퐾) = 푦: The final output The output of the MLP can be found layer by layer (start from 푖=1) as follows: 푥(1) ∈ 푅푁1 Layer 1 (푖) 푖 푖−1 푖 푢 = 푊 푥 − 휃 (4) 푥(0) = 푥 ∈ 푅푁0 푥 푖 = 푔 푢 푖 (5) Layer 0 (푖) 푢푗 is called the effective input of the 푗-th 푁 : Dimension of the 푖-th layer. neuron in the 푖-th layer. 푖 • 푁0: Number of inputs. • 푁퐾: Number of classes. MACHINE LEARNING: PRODUCED BY QIANGFU ZHAO (SINCE 2018), ALL RIGHTS RESERVED (C) Lec06/4 Physical meaning of the outputs The activation functions of the hidden layers are usually the same. For the output layer, the following softmax function is often used for pattern classification. exp(푢 퐾 ) 푦 = 푥(퐾) = 푛 , 푛 = 1,2, … , 푁 푛 푛 푁퐾 (퐾) 퐾 (6) σ푚=1 exp(푢푚 ) This output can be considered the conditional probability 푝 퐶푛 풙 . That is, through training, the MLP can approximate the posterior probability. This MLP can be used as a set of discriminant functions for making decisions. MACHINE LEARNING: PRODUCED BY QIANGFU ZHAO (SINCE 2018), ALL RIGHTS RESERVED (C) Lec06/5 Objective function for training Suppose that we have a training set given by Ω = { 풙1, 풅1 , … , (풙푁, 풅푁)} Neural network training is an optimization problem and the variables to optimize are the weights (including the biases). For regression, the following squared error function is often used as the objective function: 1 퐸 푊 = σ푁 풅 − 푦(풙 ; 푊) 2 (7) 2 푛=1 푛 푛 For classification, the following cross-entropy is often used: 푁 푁퐾 퐸 푊 = − σ푛=1 σ푚=1 푑푛푚푙표푔푦푛푚(푥푛, 푊) (8) 푑푛푚: 푑푒푠푖푟푒푑 표푢푡푝푢푡 ∈ 0, 1 ; 푦푛푚: 푚-th output for the 푛-th input MACHINE LEARNING: PRODUCED BY QIANGFU ZHAO (SINCE 2018), ALL RIGHTS RESERVED (C) Lec06/6 The gradient descent algorithm To find the weights based on the given training data, we usually use the following learning rule to update the weights iteratively: 푊푛푒푤 = 푊표푙푑 − 휖훻퐸 (9) where 훻퐸 is the gradient of 퐸 with respect to each element of 푊, and 휖 is the learning rate. To find the gradient of 퐸 is not easy because the weights of the hidden neurons are “embedded” in a complex function. To solve the problem, we usually use the well-known back propagation (BP) algorithm. MACHINE LEARNING: PRODUCED BY QIANGFU ZHAO (SINCE 2018), ALL RIGHTS RESERVED (C) Lec06/7 BP for squared error optimization For the weights in the output layer, we have 휕퐸 휕퐸 휕푦 휕푢 = 푗 푗 = 푦 풙 − 푑 푔′ 푢 푥 퐾−1 = 훿(퐾) 푥(퐾−1) (10) (퐾) 휕푦 휕푢 (퐾) 푗 푗 푗 푖 푗 푖 휕푤푗푖 푗 푗 휕푤푗푖 For the weights in the 푘-th hidden layer, we have (푘) 휕퐸 휕퐸 휕푢푗 (푘) (푘−1) (푘) = (푘) (푘) = 훿푗 푥푖 (11) 휕푤푗푖 휕푢푗 휕푤푗푖 (푘) 휕퐸 푁푘+1 푘+1 푘+1 푘 훿푗 = (푘) = σ푛=1 훿푛 (푤푛푗 푔′(푢푗 )) (12) 휕푢푗 Using the gradient found above, we can update the weights layer-by-layer using Eq. (9). MACHINE LEARNING: PRODUCED BY QIANGFU ZHAO (SINCE 2018), ALL RIGHTS RESERVED (C) Lec06/8 BP for cross-entropy optimization If the outputs are calculated using the softmax function, the cross-entropy for any input 푥 is 퐾 σ푁퐾 σ푁퐾 exp(푢푛 ) 퐸 = − 푛=1 푑푛 log 푦푛 = − 푛=1 푑푛 log 푁퐾 (퐾) (13) σ푚=1 exp(푢푚 ) For the output weights, we have 푁푘 (퐾) 휕퐸 1 휕푦푛 훿푗 = (퐾) = − ෍ 푑푛 퐾 = −푑푗 1 − 푦푗 − ෍ 푑푛 −푦푗 푦푛 휕푢푗 푛=1 휕푢푗 푛≠푗 푁퐾 = −푑푗 + 푦푗 σ푛=1 푑푛 = 푦푗 − 푑푗 (14) In the last line, we have used the property that the summation of all outputs is 1. MACHINE LEARNING: PRODUCED BY QIANGFU ZHAO (SINCE 2018), ALL RIGHTS RESERVED (C) Lec06/9 BP for cross-entropy optimization Based on Eq. (14) we can find the error signal when the loss function is defined as the cross-entropy, and the error signals for other hidden layers can be obtained in the same way using equations (11) and (12). Therefore, the MLP for classification can be trained in the same way using the BP algorithm. For a classification problem, the desired outputs (or teacher signal) are usually given as follows: ◦ The number of outputs equals to the number of classes. ◦ Only the output corresponding to the correct class is one, and all others are zeros. MACHINE LEARNING: PRODUCED BY QIANGFU ZHAO (SINCE 2018), ALL RIGHTS RESERVED (C) Lec06/10 Summary of MLP learning Step 1: Initialize the weights Step 2: Reset the total error This is why BP is also Step 3: Forward propagation called the extended ◦ Get a training example from the training set; ◦ Calculate the outputs of all layers; and delta learning rule ◦ Update the total error (for all data) Step 4: Back propagation ◦ Calculate the “delta” for each layer, starting from the output layer. ◦ Calculate the gradient using Eq. (10) and Eq. (11). ◦ Update the weights using Eq. (9) Step 5: See if all data have been used. If NOT, return to Step 3. Step 6: See if the total error is smaller than a desired value. If NOT, reset the total error, and return to Step 2; otherwise, terminate. MACHINE LEARNING: PRODUCED BY QIANGFU ZHAO (SINCE 2018), ALL RIGHTS RESERVED (C) AI Lec13/11 Meaning of error back propagation Start from the output layer, we can find Layer K the delta directly from the teacher signal and the network output. 훿(퐾) We can then find the delta for each hidden layer, from top to bottom. This delta is also called the error signal in (2) the literature. 훿 This is why BP algorithm is also called Layer 1 extended delta learning rule. 훿(1) Layer 0 MACHINE LEARNING: PRODUCED BY QIANGFU ZHAO (SINCE 2018), ALL RIGHTS RESERVED (C) Lec06/12 Example - 1 In this example, we try the Classify Patterns with a Shallow Neural Network in Matlab. The database used is the cancer dataset. ◦ There are 699 data, and each datum has 9 inputs and 2 outputs (1 or 0). We train a three layer MLP with 10 hidden neurons. To train the MLP, we simply use the following lines ◦ load cancer_dataset ◦ net = feedforwardnet(10); ◦ net = train(net, cancerInputs, cancerTargets); ◦ view(net) % to show the structure of the network MACHINE LEARNING: PRODUCED BY QIANGFU ZHAO (SINCE 2018), ALL RIGHTS RESERVED (C) Lec06/13 Example - 1 • The trained network can be evaluated by using it as a function ◦ outputs=net(cancerInputs)； ◦ errors = gsubtract(cancerTargets,outputs); ◦ performance = perform(net,cancerTargets,outputs) ◦ performance = 0.0249 To find a proper network size (number of hidden neurons), we can use grid search & cross validation 1) Grid search: n_hidden=1,2,…100 2) 10 fold Cross validation to ensure the reliability of the selected value. MACHINE LEARNING: PRODUCED BY QIANGFU ZHAO (SINCE 2018), ALL RIGHTS RESERVED (C) Lec06/14 Example - 1 The learning curves MACHINE LEARNING: PRODUCED BY QIANGFU ZHAO (SINCE 2018), ALL RIGHTS RESERVED (C) Lec06/15 Example - 1 The error histogram and the confusion matrix MACHINE LEARNING: PRODUCED BY QIANGFU ZHAO (SINCE 2018), ALL RIGHTS RESERVED (C) Lec06/16 Example - 1 Upper: the trained network Right: Training status MACHINE LEARNING: PRODUCED BY QIANGFU ZHAO (SINCE 2018), ALL RIGHTS RESERVED (C) Lec06/17 Autoencoder MACHINE LEARNING: PRODUCED BY QIANGFU ZHAO (SINCE 2018), ALL RIGHTS RESERVED (C) Lec06/18 What is an autoencoder? Autoencoder is a special MLP. Output: 푥ො There is one input layer, one output layer, and one or more hidden layers. The teacher signal equals to the input. Hidden layer(s) Input: 푥 The main purpose of using an autoencoder is to find a new (internal or latent) representation for the given feature space, with the hope of obtaining the true factors that control the distribution of the input data. MACHINE LEARNING: PRODUCED BY QIANGFU ZHAO (SINCE 2018), ALL RIGHTS RESERVED (C) Lec06/19 What is an autoencoder? Using principal component Output layer analysis (PCA) we can obtain a linear autoencoder. Each hidden unit corresponds to Hidden layer(s) an eigenvector of the co-variance matrix, and each datum is a linear combination of these Input layer eigenvectors. Using a non-linear activation function in the hidden layer, we can obtain a non-linear autoencoder. That is, each datum can be represented using a set of non-linear basis functions. MACHINE LEARNING: PRODUCED BY QIANGFU ZHAO (SINCE 2018), ALL RIGHTS RESERVED (C) Lec06/20 Training of autoencoder Implementing an autoencoder using an MLP, we can find a more compact representation for the given problem space.

Load more