Lecture 10 Handwritten Digit (MNIST) Recognition Using Deep Neural Networks ELEC801 Pattern Recognition, Fall 2018, KNU Instructor: Gil-Jin Jang
DNNs for MNIST 1 Optical Character Recognition
• Convert text into machine processable data
• Printed character recognition • Fixed font characteristics • Accuracy affected by noise
• Hand Written or printed characters • Characters not consistent • Likely to contain noise
DNNs for MNIST 2 Contents
• MNIST hand written digit data base • Neural Networks • Autoencoder • SoftMax Regression • Convolutional Neural Networks for MNIST
DNNs for MNIST 3 MNIST Database
• Mixed National Institute of Standards and Technology • Large handwritten digit classification database
• Re-mix of NIST digit databases • 60k training images from American Census Bureau employees • 10k testing images from American high school students
DNNs for MNIST 4 MNIST Database (cont.)
• Format • Input: 32 x 32 grayscale images (dimension 1024 ) or 28 x 28 grayscale images (dimension 768 ) • Output: 10 labels (0-9) • Centered on center of mass
DNNs for MNIST 5 MNIST Recognition Using Stacked Autoencoders Slide credits: Yahia Saeed, Jiwoong Kim, Lewis Westfall, and Ning Yang Seidenberg School of CSIS; Pace University, New York
DNNs for MNIST 6 Types of Neural Networks
General MLP (multi-layer perceptron) Layer 1 – Input layer Layer 2 – Undercomplete hidden layer Layer 3 – Output layer Auto-Encoders
• Learning target = Input • A type of unsupervised learning which tries to discover generic features of the data • Learn identity function by learning important sub- features (not by just passing through data) • Compression, etc. • Can use just new features in the new training set or concatenate both
DNNs for MNIST 8 Autoencoder Neural Network
• Unsupervised • Backpropagation • Output intended to match input • Features captured may not be intuitive • Undercomplete constraint used
DNNs for MNIST 9 Stacked Auto-Encoders • 1) Stack many (sparse) auto-encoders in succession and train them using greedy layer-wise training
10 Stacked Auto-Encoders • 2) Supervised training on the last layer using final features (target = input) • 3) Then supervised training on the entire network to fine-tune all weights
11 1ST Autoencoder
DNNs for MNIST 12 2nd Autoencoder
DNNs for MNIST 13 3rd Softmax Classifier
Supervised learning - Classifies results of autoencoder processing of original inputs - Goal to match output to original input
DNNs for MNIST 14 Stacked Autoencoder for MNIST Classification • Output of hidden layer of one autoencoder input to the next autoencoder
DNNs for MNIST 15 Previous Work LeCun, Cortes, and Burgess “The MNIST Database of Handwritten Digits” http://yann.lecum.com/exdb/mnist
Method Accuracy
Linear Classifier 92.4%
K Nearest Neighbor 99.3%
Boosted Stumps 99.1%
Non-Linear Classifier 96.4%
SVM 99.4%
Neural Net 99.6%
Convolutional Net 99.7%
DNNs for MNIST 16 Training
• 10k MNIST images • 1st autoencoder • 784 features / image • Encode undercomplete to 100 features / image • Decode to 784 features / image • 400 epochs • Sparsity parameter of 0.15
DNNs for MNIST 17 Training (cont.)
• 2nd autoencoder • 100 features / image • Encode undercomplete to 50 features / image • Decode to 100 features / image • 100 epochs • Sparsity parameter of 0.10 • SoftMax Classifier • 50 features / image to 1 of 10 classes / image • 400 epochs
DNNs for MNIST 18 Testing
• First results - 79.7% accuracy
• Conducted retraining with 10 labels
• Final results – 99.7% accuracy
DNNs for MNIST 19 Output of 1st Autoencoder
DNNs for MNIST 20 Output of 2nd Autoencoder
DNNs for MNIST 21 1st Confusion Matrix
DNNs for MNIST 22 Final Confusion Matrix
DNNs for MNIST 23 Characteristic of Autoencoder
• Undercomplete • 784 features -> 100 features (12.75%) • 100 features -> 50 features (6.37%)
• Sparse network • Can be compressed
DNNs for MNIST 24 Training Stacked (Deep) Autoencoders
Acknowledgments: Geoffrey Hinton, Ruslan Salakhutdinov, Joshua Bengio
DNNs for MNIST 25 Restricted Boltzmann Machines
• Boltzmann machine • Binary stochastic neurons are symmetrically connected (Hinton & h1 h2 h3 Sejnowski, 1983) hidden
• Restricted Boltzmann w13 machines • Only one layer of hidden w12 units for easier learning (Smolensky, 1986) w11 • No connection in the same layer: hidden units visible are conditionally v1 v2 v3 independent given the visible states
DNNs for MNIST 26 Advantage of Symmetric Connection in RBM • Prior and posterior probabilities Ev,h vihj wij are inferred back and forth , ji • By marginalization pv,h eEv,h • Energy function E(v,h) at the current configurations of v and h • The output at the hidden neuron Ev,h e can be described by a pdf pv,h e Eu,g • The network output has a meaning u,g eEv,h h pv Eu,g h1 h2 h3 e hidden u,g eEv,h v visible ph v1 v2 v3 e Eu,g u,g DNNs for MNIST 27 Stacking RBM to obtain a DBM
H3 (output) • If we build a multi-layered network with random W3 initialization, learning the network is very hard H2 • Time-consuming • Falls in local maximum W2 • DBM architecture H1 • Learn the layer independently hidden using greedy learning algorithm (Hinton) W1 • In hidden layers, no data is visible available but pdf use Gibbs V (data) sampling to obtain data samples 0 1 • Run Back-Propagation w v h v h algorithm after DBM greedy ij i j i j learning Greedy learning: minimize the error of the data and the reconstructed data DNNs for MNIST 28 Deep Autoencoders based on DBM
• Architecture for MNIST dataset Number of • output Output layer 10 • 10 binary neurons for 10 digit classes classes (0~9) (binary) • Encoding layer 30 • 30 neurons W4 • Hidden layers • 3 hidden layers of 1000, 500, 250 250 neurons neurons (decreasing) W3 • Input layer • 28x28 (784) binary (0 or 1) pixels for 500 neurons digit image patches W2 • Sparse representation: 784 1000 neurons pixels 30 W1 • Learning the deep network 28x28 directly is very hard!!
DNNs for MNIST 29 Step 1: pre-training Number of • Learn the weights from input output st classes 10 layer to the 1 hidden layer (binary) • Unsupervised learning: RBM 30 (restricted Boltzmann model) W • The values for 1000 neurons are 4 updated as well in an 250 neurons unsupervised manner W3 • Update: W1 and 1000 neurons • The detailed algorithms are in the 500 neurons W appendix 2 1000 neurons
W1 Fix 28x28 (input)
DNNs for MNIST 30 RBM Training • Prior and posterior probabilities Ev,h vihj wij are inferred back and forth , ji • Repeat altering the values of hidden , Ev,h to maximize the probabilties pv h e • Energy function E(v,h) at the current configurations of v and h Ev,h • e The output at the hidden neuron pv,h , can be described by a pdf e Eu g • The network output has a meaning u,g eEv,h h pv Eu,g h1 h2 h3 e hidden u,g eEv,h v visible ph v1 v2 v3 e Eu,g u,g DNNs for MNIST 31 Step 2: pre-training st • Learn the weights from 1 Number of nd output hidden layer to 2 hidden classes 10 layer (binary) • Unsupervised learning: RBM 30 • Update: W2 and 500 neurons W4 250 neurons
W3 500 neurons
W2 1000 neurons Fix
W1 28x28
DNNs for MNIST 32 Step 3: pre-training nd • Learn the weights from 2 Number of rd output hidden layer to 3 layer classes 10 • Unsupervised learning: RBM (binary) • Update: W3 and 250 neurons 30
W4 250 neurons
W3 Fix 500 neurons
W2 1000 neurons
W1 28x28
DNNs for MNIST 33 Step 4: pre-training rd • Learn the weights from 3 Number of output hidden layer to the encoding classes 10 layer (binary) • Unsupervised learning: RBM 30 • Update: W4 and 30 neurons W4 Fix 250 neurons
W3 500 neurons
W2 1000 neurons
W1 28x28
DNNs for MNIST 34 28x28 T Step 5: un-roll W1 1000 neurons • T Replicate the learned layers W2 by reflecting the activations 500 neurons T and weights W3 • Copy the input image (28x28), 250 neurons T activations (values) in the W4 neurons (1000, 250, 500) • Fix 30 Assign pseudo-inverse matrix W of the lower layers for the 4 higher layers 250 neurons W • Because the directions are 3 reversed 500 neurons By orthonormalizing the • W2 weight matrices during training, inversion becomes 1000 neurons W-1 WT transpose ( = ) W1 DNNs for MNIST 28x28 35 Fixed target 28x28 W Step 6: fine-tuning 8 1000 neurons
• Train the whole network W7 using standard 500 neurons
backpropagation algorithm W6 • Target: 28x28 original images 250 neurons
• Input: 28x28 original images W5 • Characteristics 30 W • Supervised learning 4 • Note that the previous pre- 250 neurons
trainings are all unsupervised W3 • Layered architecture Stacked 500 neurons W2 • Learning target: each input 1000 neurons image itself Autoencoder W1 DNNs for MNIST Fix 28x28 36 28x28 W Step 7: applications 8 1000 neurons • As a result of autoencoder W7 training, the input 28x28 (784) 500 neurons image patches can be W represented by 30 neurons in 6 the encoding layer 250 neurons W • 96% reduction 5 • 30 Applications W • Denoising / decoding 4 • Use the last layer’s output 250 neurons W • Compression / encoding 3 • Use the encoding layer’s output 500 neurons • Feature extraction W2 • Use the encoding layer’s output as features 1000 neurons W1 DNNs for MNIST 28x28 37 28x28 W Step 8: recognition 8 1000 neurons
• Once the training is W7 done 500 neurons
• use the outputs in W6 the encoding layer as 250 neurons features Number of W 5 W • output branch Hinton’s classes 10 30
• Add a “branch layer” (binary) W4 whose targets are 250 neurons digit labels (in binary) W3 • Train by a standard 500 neurons backpropagation W • Obtained 99.7% 2 accuracy 1000 neurons W1 DNNs for MNIST 28x28 38 How to learn the weights
1 2 3 4 5 6 7 8 9 0
The image
Show the network an image and increment the weights from active pixels to the correct class. Then decrement the weights from active pixels to whatever class the network guesses.
DNNs for MNIST 39 1 2 3 4 5 6 7 8 9 0
The image
DNNs for MNIST 40 1 2 3 4 5 6 7 8 9 0
The image
DNNs for MNIST 41 1 2 3 4 5 6 7 8 9 0
The image
DNNs for MNIST 42 1 2 3 4 5 6 7 8 9 0
The image
DNNs for MNIST 43 1 2 3 4 5 6 7 8 9 0
The image
DNNs for MNIST 44 The learned weights 1 2 3 4 5 6 7 8 9 0
The image The details of the learning algorithm will be explained in future lectures.
DNNs for MNIST 45 A comparison of methods for compressing digit images to 30 real numbers.
real data 30-D deep auto 30-D logistic PCA 30-D PCA
DNNs for MNIST 46 Slide credit: Geoffrey Hinton MNIST Recognition using Autoencoders http://www.cs.toronto.edu/~hinton/adi/index.htm
DNNs for MNIST 47 Deep autoencoder applications Document retrieval, Geoffrey Hinton Image classification using deep autoencoders, Krizhevsky et al.
DNNs for MNIST 48 Document Retrieval
0 fish • Problem: “How to find documents 0 cheese that are similar to a query document?” 2 vector • “ 2 count Convert each document into a bag of 0 school words ” 2 query • Vector of word counts ignoring order 1 reduce • “ ” “ ” 1 bag pulpit Ignore stop words (like the or over ) 0 iraq word • Use the word counts of the query 0 document 2 • For the input to an autoencoder
Slide credit: Geoffrey Hinton, Couresera lecture DNNs for MNIST 49 How to compress the count vector
output 2000 reconstructed counts vector • We train the neural network to 500 neurons reproduce its input vector as its output 250 neurons • This forces it to compress as much information as possible into the 10 10 numbers in the central bottleneck. • These 10 numbers are then a good 250 neurons way to compare documents.
500 neurons input 2000 word counts vector
Slide credit: Geoffrey Hinton, Couresera lecture DNNs for MNIST 50 Retrieval performance on 400,000 Reuters business news stories
LSA is a version of PCA
Slide credit: Geoffrey Hinton, Couresera lecture DNNs for MNIST 51 Clustering by PCA First compress all documents to 2 numbers using PCA on log(1+count). Then use different colors for different categories.
Slide credit: Geoffrey Hinton, Couresera lecture DNNs for MNIST 52 Clustering by Autoencoders First compress all documents to 2 numbers using deep autoencoder . Then use different colors for different document categories
Slide credit: Geoffrey Hinton, Couresera lecture DNNs for MNIST 53 Image Retrieval Krizhevsky’s deep autoencoder
The encoder has 256-bit binary code about 67,000,000 parameters. 512 There is no theory to It takes a few days on justify this architecture 1024 a GTX 285 GPU to train on two million 2048 images. 4096
8192
1024 1024 1024
DNNs for MNISTSlide credit: Geoffrey Hinton, lecture notes in Cours54 era.org Image Retrievalretrieved using 256Results bit codes
retrieved using Euclidean distance in pixel intensity space
DNNs for MNISTSlide credit: Geoffrey Hinton, lecture notes in Cours55 era.org retrieved using 256 bit codes
retrieved using Euclidean distance in pixel intensity space
DNNs for MNISTSlide credit: Geoffrey Hinton, lecture notes in Cours56 era.org Summary
• Stacked deep autoencoder • Layer-wise pretraining for stable learning of weights • Applications • Reconstruction, noise removal, recognition, and retrievals • Restrictions • Sensitive to translations, scales, rotations • Computationally extensive
DNNs for MNIST 57 Convolutional neural networks
Convolution operation Lenet Architecture
Acknowledgments: Geoffrey Hinton, Yann LeCun, Antonio Torralba, Boris Ginzburg
DNNs for MNIST 58 CNN (Convolutional Neural Network) • Neural network architecture for image detection and recognition • First proposed by Yann LeCun, 1998 • http://yann.lecun.com – with codes, tutorials • Many hidden layers • Resembles human visual system • Sliding window (Convolution ) = scanning images by sliding windows • Enables detecting objects with invariances to changes in locations, rotations, scales, etc. • Shared over different locations • Pooling • Provides abstraction as well as dimension reduction
DNNs for MNIST 59 Why CNN?
Higher layer responses: class-specific features
2nd layer: structured edges, corners, blobs, etc.
1st layer: primitive, localized edge filters
Picture credit: Honglak Lee et al. in Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks. Communications of the ACM. DNNs for MNIST 60 Convolutive Windowing (Scanning)
5x5 • 2-dimensional convolution • Computing correlation of all locations to the given kernel window in REVERSE direction • Input: 32x32 • Kernel (window) size: 5x5 • Output response: 28x28 • Result of convolution architecture • Learning kernels: extracting the local characteristics of the input images • Output images: responses to the learned kernels
DNNs for MNIST 61 Lenet-5 CNN Architecture for MNIST
C1: 1 st layer – CONVOLUTION - 6 kernel windows of size 5x5 - 32 x 32 pixels 28x28 responses by a single window - Total 6 output maps of size 28 x 28 (6@28x28) are generated from a single image
Number of parameters to train - 6 x 5 x 5 + 6 (bias) = 150 (much less than using full connection)
DNNs for MNIST 62 Picture credit: Christopher Mitchell Lenet-5: Pooling (subsampling)
S2: 2 nd layer – POOLING - No convolution - 1 out of 2x2 pixels is sampled - Method: max (referred to as Max Pooling, most popular), mean (Average Pooling), Fractional Pooling [Graham, CVPR 2015] , etc. - Total 6 output maps of size 14 x 14 (6@14x14) are generated from C1 layer
Number of parameters to train - None . DNNs for MNIST 63 Lenet-5: C3
C3: 3 rd layer – CONVOLUTION - 16 kernel windows of size 5x5 - 6 maps of 14x14 pixels 6@10x10 responses by a single window - 6@10x10 1@10x10 image by averaging (AvgOut) or taking max (MaxOut) [Goodfellow, JMLR 2013] - Total 16 windows 16@10x10 are generated from a single image Number of parameters to train - 16 x 5 x 5 + 16 = 416 (still much less than using full connection)
DNNs for MNIST 64 Lenet-5: S4 and C5
S4: 4 th layer – POOLING - 1 out of 2x2 outputs is sampled - 16@5x5 outputs are generated from C3 layer C5: 5 th layer – CONVOLUTION - 120 kernel windows of size 5x5 are added - 5x5 pixels 1x1 response per window - 120 outputs, finally Number of parameters to train - 120 x 5 x 5 + 120 = 3,120 (still much less than using full connection)
DNNs for MNIST 65 LeNet-5: Full Connection Layers
F6: 6 th layer – FULL Connection - 120 x 84 x 10 weights
Output: - 10 binary labels (targets, 0~9) DNNs for MNIST 66 Training CNN
• Standard Backpropagation with • Activation functions • Sigmoid (vanishing gradient problem) • ReLU (rectified linear unit) • Over-fitting prevention • DropOut • Requires huge computation • GPU Implementation • Performance on MNIST • The error rate of LeNet-5 [1998] is 0.8% - 0.95% • Deep autoencoder: 784-500-500-2000-30 [2007] + nearest neighbor - 1.0% • Results of other classifiers are also available • http://yann.lecun.com/exdb/mnist/
DNNs for MNIST 67 Learning a Compositional Hierarchy of Object Structure
Parts model
The architecture
Learned parts FidlerDNNs & Leonardis, for MNIST CVPR’07; Fidler, Boben & Leonardis, CVPR 2008, copied from68 Torralba’s From Common Basic Features to Class- Specific Features
Picture credit: Honglak Lee et al. in Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks. Communications of the ACM. DNNs for MNIST 69 Learning a Compositional Hierarchy of Object Structure
FidlerDNNs & Leonardis, for MNIST CVPR’07; Fidler, Boben & Leonardis, CVPR 2008, copied from70 Torralba’s Conclusion
• CNN implementation for MNIST
• Next • Implementation
DNNs for MNIST 71