Lecture 10 Handwritten Digit (MNIST) Recognition Using Deep Neural Networks ELEC801 Pattern Recognition, Fall 2018, KNU Instructor: Gil-Jin Jang

Lecture 10 Handwritten Digit (MNIST) Recognition Using Deep Neural Networks ELEC801 Pattern Recognition, Fall 2018, KNU Instructor: Gil-Jin Jang DNNs for MNIST 1 Optical Character Recognition • Convert text into machine processable data • Printed character recognition • Fixed font characteristics • Accuracy affected by noise • Hand Written or printed characters • Characters not consistent • Likely to contain noise DNNs for MNIST 2 Contents • MNIST hand written digit data base • Neural Networks • Autoencoder • SoftMax Regression • Convolutional Neural Networks for MNIST DNNs for MNIST 3 MNIST Database • Mixed National Institute of Standards and Technology • Large handwritten digit classification database • Re-mix of NIST digit databases • 60k training images from American Census Bureau employees • 10k testing images from American high school students DNNs for MNIST 4 MNIST Database (cont.) • Format • Input: 32 x 32 grayscale images (dimension 1024 ) or 28 x 28 grayscale images (dimension 768 ) • Output: 10 labels (0-9) • Centered on center of mass DNNs for MNIST 5 MNIST Recognition Using Stacked Autoencoders Slide credits: Yahia Saeed, Jiwoong Kim, Lewis Westfall, and Ning Yang Seidenberg School of CSIS; Pace University, New York DNNs for MNIST 6 Types of Neural Networks General MLP (multi-layer perceptron) Layer 1 – Input layer Layer 2 – Undercomplete hidden layer Layer 3 – Output layer Auto-Encoders • Learning target = Input • A type of unsupervised learning which tries to discover generic features of the data • Learn identity function by learning important sub- features (not by just passing through data) • Compression, etc. • Can use just new features in the new training set or concatenate both DNNs for MNIST 8 Autoencoder Neural Network • Unsupervised • Backpropagation • Output intended to match input • Features captured may not be intuitive • Undercomplete constraint used DNNs for MNIST 9 Stacked Auto-Encoders • 1) Stack many (sparse) auto-encoders in succession and train them using greedy layer-wise training 10 Stacked Auto-Encoders • 2) Supervised training on the last layer using final features (target = input) • 3) Then supervised training on the entire network to fine-tune all weights 11 1ST Autoencoder DNNs for MNIST 12 2nd Autoencoder DNNs for MNIST 13 3rd Softmax Classifier Supervised learning - Classifies results of autoencoder processing of original inputs - Goal to match output to original input DNNs for MNIST 14 Stacked Autoencoder for MNIST Classification • Output of hidden layer of one autoencoder input to the next autoencoder DNNs for MNIST 15 Previous Work LeCun, Cortes, and Burgess “The MNIST Database of Handwritten Digits” http://yann.lecum.com/exdb/mnist Method Accuracy Linear Classifier 92.4% K Nearest Neighbor 99.3% Boosted Stumps 99.1% Non-Linear Classifier 96.4% SVM 99.4% Neural Net 99.6% Convolutional Net 99.7% DNNs for MNIST 16 Training • 10k MNIST images • 1st autoencoder • 784 features / image • Encode undercomplete to 100 features / image • Decode to 784 features / image • 400 epochs • Sparsity parameter of 0.15 DNNs for MNIST 17 Training (cont.) • 2nd autoencoder • 100 features / image • Encode undercomplete to 50 features / image • Decode to 100 features / image • 100 epochs • Sparsity parameter of 0.10 • SoftMax Classifier • 50 features / image to 1 of 10 classes / image • 400 epochs DNNs for MNIST 18 Testing • First results - 79.7% accuracy • Conducted retraining with 10 labels • Final results – 99.7% accuracy DNNs for MNIST 19 Output of 1st Autoencoder DNNs for MNIST 20 Output of 2nd Autoencoder DNNs for MNIST 21 1st Confusion Matrix DNNs for MNIST 22 Final Confusion Matrix DNNs for MNIST 23 Characteristic of Autoencoder • Undercomplete • 784 features -> 100 features (12.75%) • 100 features -> 50 features (6.37%) • Sparse network • Can be compressed DNNs for MNIST 24 Training Stacked (Deep) Autoencoders Acknowledgments: Geoffrey Hinton, Ruslan Salakhutdinov, Joshua Bengio DNNs for MNIST 25 Restricted Boltzmann Machines • Boltzmann machine • Binary stochastic neurons are symmetrically connected (Hinton & h1 h2 h3 Sejnowski, 1983) hidden • Restricted Boltzmann w13 machines • Only one layer of hidden w12 units for easier learning (Smolensky, 1986) w11 • No connection in the same layer: hidden units visible are conditionally v1 v2 v3 independent given the visible states DNNs for MNIST 26 Advantage of Symmetric Connection in RBM • Prior and posterior probabilities Ev,h vihj wij are inferred back and forth , ji • By marginalization pv,h e Ev,h • Energy function E(v,h) at the current configurations of v and h • The output at the hidden neuron Ev,h e can be described by a pdf pv,h e Eu,g • The network output has a meaning u,g eEv,h h pv Eu,g h1 h2 h3 e hidden u,g eEv,h v visible ph v1 v2 v3 e Eu,g u,g DNNs for MNIST 27 Stacking RBM to obtain a DBM H3 (output) • If we build a multi-layered network with random W3 initialization, learning the network is very hard H2 • Time-consuming • Falls in local maximum W2 • DBM architecture H1 • Learn the layer independently hidden using greedy learning algorithm (Hinton) W1 • In hidden layers, no data is visible available but pdf use Gibbs V (data) sampling to obtain data samples 0 1 • Run Back-Propagation w v h v h algorithm after DBM greedy ij i j i j learning Greedy learning: minimize the error of the data and the reconstructed data DNNs for MNIST 28 Deep Autoencoders based on DBM • Architecture for MNIST dataset Number of • output Output layer 10 • 10 binary neurons for 10 digit classes classes (0~9) (binary) • Encoding layer 30 • 30 neurons W4 • Hidden layers • 3 hidden layers of 1000, 500, 250 250 neurons neurons (decreasing) W3 • Input layer • 28x28 (784) binary (0 or 1) pixels for 500 neurons digit image patches W2 • Sparse representation: 784 1000 neurons pixels 30 W1 • Learning the deep network 28x28 directly is very hard!! DNNs for MNIST 29 Step 1: pre-training Number of • Learn the weights from input output st classes 10 layer to the 1 hidden layer (binary) • Unsupervised learning: RBM 30 (restricted Boltzmann model) W • The values for 1000 neurons are 4 updated as well in an 250 neurons unsupervised manner W3 • Update: W1 and 1000 neurons • The detailed algorithms are in the 500 neurons W appendix 2 1000 neurons W1 Fix 28x28 (input) DNNs for MNIST 30 RBM Training • Prior and posterior probabilities Ev,h vihj wij are inferred back and forth , ji • Repeat altering the values of hidden , Ev,h to maximize the probabilties pv h e • Energy function E(v,h) at the current configurations of v and h Ev,h • e The output at the hidden neuron pv,h , can be described by a pdf e Eu g • The network output has a meaning u,g eEv,h h pv Eu,g h1 h2 h3 e hidden u,g eEv,h v visible ph v1 v2 v3 e Eu,g u,g DNNs for MNIST 31 Step 2: pre-training st • Learn the weights from 1 Number of nd output hidden layer to 2 hidden classes 10 layer (binary) • Unsupervised learning: RBM 30 • Update: W2 and 500 neurons W4 250 neurons W3 500 neurons W2 1000 neurons Fix W1 28x28 DNNs for MNIST 32 Step 3: pre-training nd • Learn the weights from 2 Number of rd output hidden layer to 3 layer classes 10 • Unsupervised learning: RBM (binary) • Update: W3 and 250 neurons 30 W4 250 neurons W3 Fix 500 neurons W2 1000 neurons W1 28x28 DNNs for MNIST 33 Step 4: pre-training rd • Learn the weights from 3 Number of output hidden layer to the encoding classes 10 layer (binary) • Unsupervised learning: RBM 30 • Update: W4 and 30 neurons W4 Fix 250 neurons W3 500 neurons W2 1000 neurons W1 28x28 DNNs for MNIST 34 28x28 T Step 5: un-roll W1 1000 neurons • T Replicate the learned layers W2 by reflecting the activations 500 neurons T and weights W3 • Copy the input image (28x28), 250 neurons T activations (values) in the W4 neurons (1000, 250, 500) • Fix 30 Assign pseudo-inverse matrix W of the lower layers for the 4 higher layers 250 neurons W • Because the directions are 3 reversed 500 neurons By orthonormalizing the • W2 weight matrices during training, inversion becomes 1000 neurons W-1 WT transpose ( = ) W1 DNNs for MNIST 28x28 35 Fixed target 28x28 W Step 6: fine-tuning 8 1000 neurons • Train the whole network W7 using standard 500 neurons backpropagation algorithm W6 • Target: 28x28 original images 250 neurons • Input: 28x28 original images W5 • Characteristics 30 W • Supervised learning 4 • Note that the previous pre- 250 neurons trainings are all unsupervised W3 • Layered architecture Stacked 500 neurons W2 • Learning target: each input 1000 neurons image itself Autoencoder W1 DNNs for MNIST Fix 28x28 36 28x28 W Step 7: applications 8 1000 neurons • As a result of autoencoder W7 training, the input 28x28 (784) 500 neurons image patches can be W represented by 30 neurons in 6 the encoding layer 250 neurons W • 96% reduction 5 • 30 Applications W • Denoising / decoding 4 • Use the last layer’s output 250 neurons W • Compression / encoding 3 • Use the encoding layer’s output 500 neurons • Feature extraction W2 • Use the encoding layer’s output as features 1000 neurons W1 DNNs for MNIST 28x28 37 28x28 W Step 8: recognition 8 1000 neurons • Once the training is W7 done 500 neurons • use the outputs in W6 the encoding layer as 250 neurons features Number of W 5 W • output branch Hinton’s classes 10 30 • Add a “branch layer” (binary) W4 whose targets are 250 neurons digit labels (in binary) W3 • Train by a standard 500 neurons backpropagation W • Obtained 99.7% 2 accuracy 1000 neurons W1 DNNs for MNIST 28x28 38 How to learn the weights 1 2 3 4 5 6 7 8 9 0 The image Show the network an image and increment the weights from active pixels to the correct class.

Load more