Lecture 10 Handwritten Digit (MNIST) Recognition Using Deep Neural Networks ELEC801 , Fall 2018, KNU Instructor: Gil-Jin Jang

DNNs for MNIST 1 Optical Character Recognition

• Convert text into machine processable data

• Printed character recognition • Fixed font characteristics • Accuracy affected by noise

• Hand Written or printed characters • Characters not consistent • Likely to contain noise

DNNs for MNIST 2 Contents

• MNIST hand written digit data base • Neural Networks • Autoencoder • SoftMax Regression • Convolutional Neural Networks for MNIST

DNNs for MNIST 3 MNIST Database

• Mixed National Institute of Standards and Technology • Large handwritten digit classification database

• Re-mix of NIST digit databases • 60k training images from American Census Bureau employees • 10k testing images from American high school students

DNNs for MNIST 4 MNIST Database (cont.)

• Format • Input: 32 x 32 grayscale images (dimension 1024 ) or 28 x 28 grayscale images (dimension 768 ) • Output: 10 labels (0-9) • Centered on center of mass

DNNs for MNIST 5 MNIST Recognition Using Stacked Autoencoders Slide credits: Yahia Saeed, Jiwoong Kim, Lewis Westfall, and Ning Yang Seidenberg School of CSIS; Pace University, New York

DNNs for MNIST 6 Types of Neural Networks

General MLP (multi-layer perceptron) Layer 1 – Input layer Layer 2 – Undercomplete hidden layer Layer 3 – Output layer Auto-Encoders

• Learning target = Input • A type of unsupervised learning which tries to discover generic features of the data • Learn identity function by learning important sub- features (not by just passing through data) • Compression, etc. • Can use just new features in the new training set or concatenate both

DNNs for MNIST 8 Autoencoder Neural Network

• Unsupervised • • Output intended to match input • Features captured may not be intuitive • Undercomplete constraint used

DNNs for MNIST 9 Stacked Auto-Encoders • 1) Stack many (sparse) auto-encoders in succession and train them using greedy layer-wise training

10 Stacked Auto-Encoders • 2) Supervised training on the last layer using final features (target = input) • 3) Then supervised training on the entire network to fine-tune all weights

11 1ST Autoencoder

DNNs for MNIST 12 2nd Autoencoder

DNNs for MNIST 13 3rd Softmax Classifier

Supervised learning - Classifies results of autoencoder processing of original inputs - Goal to match output to original input

DNNs for MNIST 14 Stacked Autoencoder for MNIST Classification • Output of hidden layer of one autoencoder input to the next autoencoder

DNNs for MNIST 15 Previous Work LeCun, Cortes, and Burgess “The MNIST Database of Handwritten Digits” http://yann.lecum.com/exdb/mnist

Method Accuracy

Linear Classifier 92.4%

K Nearest Neighbor 99.3%

Boosted Stumps 99.1%

Non- 96.4%

SVM 99.4%

Neural Net 99.6%

Convolutional Net 99.7%

DNNs for MNIST 16 Training

• 10k MNIST images • 1st autoencoder • 784 features / image • Encode undercomplete to 100 features / image • Decode to 784 features / image • 400 epochs • Sparsity parameter of 0.15

DNNs for MNIST 17 Training (cont.)

• 2nd autoencoder • 100 features / image • Encode undercomplete to 50 features / image • Decode to 100 features / image • 100 epochs • Sparsity parameter of 0.10 • SoftMax Classifier • 50 features / image to 1 of 10 classes / image • 400 epochs

DNNs for MNIST 18 Testing

• First results - 79.7% accuracy

• Conducted retraining with 10 labels

• Final results – 99.7% accuracy

DNNs for MNIST 19 Output of 1st Autoencoder

DNNs for MNIST 20 Output of 2nd Autoencoder

DNNs for MNIST 21 1st Confusion Matrix

DNNs for MNIST 22 Final Confusion Matrix

DNNs for MNIST 23 Characteristic of Autoencoder

• Undercomplete • 784 features -> 100 features (12.75%) • 100 features -> 50 features (6.37%)

• Sparse network • Can be compressed

DNNs for MNIST 24 Training Stacked (Deep) Autoencoders

Acknowledgments: , Ruslan Salakhutdinov, Joshua Bengio

DNNs for MNIST 25 Restricted Boltzmann Machines

• Boltzmann machine • Binary stochastic neurons are symmetrically connected (Hinton & h1 h2 h3 Sejnowski, 1983) hidden

• Restricted Boltzmann w13 machines • Only one layer of hidden w12 units for easier learning (Smolensky, 1986) w11 • No connection in the same layer: hidden units visible are conditionally v1 v2 v3 independent given the visible states

DNNs for MNIST 26 Advantage of Symmetric Connection in RBM • Prior and posterior probabilities   Ev,h vihj wij are inferred back and forth , ji • By marginalization pv,h   eEv,h  • Energy function E(v,h) at the current configurations of v and h  • The output at the hidden neuron Ev,h   e can be described by a pdf pv,h   e Eu,g  • The network output has a meaning  u,g eEv,h   h pv  Eu,g  h1 h2 h3 e hidden u,g eEv,h   v visible ph   v1 v2 v3 e Eu,g  u,g DNNs for MNIST 27 Stacking RBM to obtain a DBM

H3 (output) • If we build a multi-layered network with random W3 initialization, learning the network is very hard H2 • Time-consuming • Falls in local maximum W2 • DBM architecture H1 • Learn the layer independently hidden using greedy learning algorithm (Hinton) W1 • In hidden layers, no data is visible available but pdf  use Gibbs V (data) sampling to obtain data samples 0 1 • Run Back-Propagation w  v h  v h algorithm after DBM greedy ij  i j i j  learning Greedy learning: minimize the error of the data and the reconstructed data DNNs for MNIST 28 Deep Autoencoders based on DBM

• Architecture for MNIST dataset Number of • output Output layer 10 • 10 binary neurons for 10 digit classes classes (0~9) (binary) • Encoding layer 30 • 30 neurons W4 • Hidden layers • 3 hidden layers of 1000, 500, 250 250 neurons neurons (decreasing) W3 • Input layer • 28x28 (784) binary (0 or 1) pixels for 500 neurons digit image patches W2 • Sparse representation: 784 1000 neurons  pixels 30 W1 • Learning the deep network 28x28 directly is very hard!!

DNNs for MNIST 29 Step 1: pre-training Number of • Learn the weights from input output st classes 10 layer to the 1 hidden layer (binary) • Unsupervised learning: RBM 30 (restricted Boltzmann model) W • The values for 1000 neurons are 4 updated as well in an 250 neurons unsupervised manner W3 • Update: W1 and 1000 neurons • The detailed algorithms are in the 500 neurons W appendix 2 1000 neurons

W1 Fix 28x28 (input)

DNNs for MNIST 30 RBM Training • Prior and posterior probabilities   Ev,h vihj wij are inferred back and forth , ji • Repeat altering the values of hidden ,  Ev,h  to maximize the probabilties pv h  e • Energy function E(v,h) at the current  configurations of v and h Ev,h  •  e The output at the hidden neuron pv,h   , can be described by a pdf e Eu g  • The network output has a meaning u,g eEv,h   h pv  Eu,g  h1 h2 h3 e hidden u,g eEv,h   v visible ph   v1 v2 v3 e Eu,g  u,g DNNs for MNIST 31 Step 2: pre-training st • Learn the weights from 1 Number of nd output hidden layer to 2 hidden classes 10 layer (binary) • Unsupervised learning: RBM 30 • Update: W2 and 500 neurons W4 250 neurons

W3 500 neurons

W2 1000 neurons Fix

W1 28x28

DNNs for MNIST 32 Step 3: pre-training nd • Learn the weights from 2 Number of rd output hidden layer to 3 layer classes 10 • Unsupervised learning: RBM (binary) • Update: W3 and 250 neurons 30

W4 250 neurons

W3 Fix 500 neurons

W2 1000 neurons

W1 28x28

DNNs for MNIST 33 Step 4: pre-training rd • Learn the weights from 3 Number of output hidden layer to the encoding classes 10 layer (binary) • Unsupervised learning: RBM 30 • Update: W4 and 30 neurons W4 Fix 250 neurons

W3 500 neurons

W2 1000 neurons

W1 28x28

DNNs for MNIST 34 28x28 T Step 5: un-roll W1 1000 neurons • T Replicate the learned layers W2 by reflecting the activations 500 neurons T and weights W3 • Copy the input image (28x28), 250 neurons T activations (values) in the W4 neurons (1000, 250, 500) • Fix 30 Assign pseudo-inverse matrix W of the lower layers for the 4 higher layers 250 neurons W • Because the directions are 3 reversed 500 neurons By orthonormalizing the • W2 weight matrices during training, inversion becomes 1000 neurons W-1 WT transpose ( = ) W1 DNNs for MNIST 28x28 35 Fixed target 28x28 W Step 6: fine-tuning 8 1000 neurons

• Train the whole network W7 using standard 500 neurons

backpropagation algorithm W6 • Target: 28x28 original images 250 neurons

• Input: 28x28 original images W5 • Characteristics 30 W • Supervised learning 4 • Note that the previous pre- 250 neurons

trainings are all unsupervised W3 • Layered architecture  Stacked 500 neurons W2 • Learning target: each input 1000 neurons image itself  Autoencoder W1 DNNs for MNIST Fix 28x28 36 28x28 W Step 7: applications 8 1000 neurons • As a result of autoencoder W7 training, the input 28x28 (784) 500 neurons image patches can be W represented by 30 neurons in 6 the encoding layer 250 neurons W • 96% reduction 5 • 30 Applications W • Denoising / decoding 4 • Use the last layer’s output 250 neurons W • Compression / encoding 3 • Use the encoding layer’s output 500 neurons • Feature extraction W2 • Use the encoding layer’s output as features 1000 neurons W1 DNNs for MNIST 28x28 37 28x28 W Step 8: recognition 8 1000 neurons

• Once the training is W7 done 500 neurons

• use the outputs in W6 the encoding layer as 250 neurons features Number of W 5 W • output branch Hinton’s classes 10 30

• Add a “branch layer” (binary) W4 whose targets are 250 neurons digit labels (in binary) W3 • Train by a standard 500 neurons backpropagation W • Obtained 99.7% 2 accuracy 1000 neurons W1 DNNs for MNIST 28x28 38 How to learn the weights

1 2 3 4 5 6 7 8 9 0

The image

Show the network an image and increment the weights from active pixels to the correct class. Then decrement the weights from active pixels to whatever class the network guesses.

DNNs for MNIST 39 1 2 3 4 5 6 7 8 9 0

The image

DNNs for MNIST 40 1 2 3 4 5 6 7 8 9 0

The image

DNNs for MNIST 41 1 2 3 4 5 6 7 8 9 0

The image

DNNs for MNIST 42 1 2 3 4 5 6 7 8 9 0

The image

DNNs for MNIST 43 1 2 3 4 5 6 7 8 9 0

The image

DNNs for MNIST 44 The learned weights 1 2 3 4 5 6 7 8 9 0

The image The details of the learning algorithm will be explained in future lectures.

DNNs for MNIST 45 A comparison of methods for compressing digit images to 30 real numbers.

real data 30-D deep auto 30-D logistic PCA 30-D PCA

DNNs for MNIST 46 Slide credit: Geoffrey Hinton MNIST Recognition using Autoencoders http://www.cs.toronto.edu/~hinton/adi/index.htm

DNNs for MNIST 47 Deep autoencoder applications Document retrieval, Geoffrey Hinton Image classification using deep autoencoders, Krizhevsky et al.

DNNs for MNIST 48 Document Retrieval

0 fish • Problem: “How to find documents 0 cheese that are similar to a query document?” 2 vector • “ 2 count Convert each document into a bag of 0 school words ” 2 query • Vector of word counts ignoring order 1 reduce • “ ” “ ” 1 bag pulpit Ignore stop words (like the or over ) 0 iraq word • Use the word counts of the query 0 document 2 • For the input to an autoencoder

Slide credit: Geoffrey Hinton, Couresera lecture DNNs for MNIST 49 How to compress the count vector

output 2000 reconstructed counts vector • We train the neural network to 500 neurons reproduce its input vector as its output 250 neurons • This forces it to compress as much information as possible into the 10 10 numbers in the central bottleneck. • These 10 numbers are then a good 250 neurons way to compare documents.

500 neurons input 2000 word counts vector

Slide credit: Geoffrey Hinton, Couresera lecture DNNs for MNIST 50 Retrieval performance on 400,000 Reuters business news stories

LSA is a version of PCA

Slide credit: Geoffrey Hinton, Couresera lecture DNNs for MNIST 51 Clustering by PCA First compress all documents to 2 numbers using PCA on log(1+count). Then use different colors for different categories.

Slide credit: Geoffrey Hinton, Couresera lecture DNNs for MNIST 52 Clustering by Autoencoders First compress all documents to 2 numbers using deep autoencoder . Then use different colors for different document categories

Slide credit: Geoffrey Hinton, Couresera lecture DNNs for MNIST 53 Image Retrieval Krizhevsky’s deep autoencoder

The encoder has 256-bit binary code about 67,000,000 parameters. 512 There is no theory to It takes a few days on justify this architecture 1024 a GTX 285 GPU to train on two million 2048 images. 4096

8192

1024 1024 1024

DNNs for MNISTSlide credit: Geoffrey Hinton, lecture notes in Cours54 era.org Image Retrievalretrieved using 256Results bit codes

retrieved using Euclidean distance in pixel intensity space

DNNs for MNISTSlide credit: Geoffrey Hinton, lecture notes in Cours55 era.org retrieved using 256 bit codes

retrieved using Euclidean distance in pixel intensity space

DNNs for MNISTSlide credit: Geoffrey Hinton, lecture notes in Cours56 era.org Summary

• Stacked deep autoencoder • Layer-wise pretraining for stable learning of weights • Applications • Reconstruction, noise removal, recognition, and retrievals • Restrictions • Sensitive to translations, scales, rotations • Computationally extensive

DNNs for MNIST 57 Convolutional neural networks

Convolution operation Lenet Architecture

Acknowledgments: Geoffrey Hinton, Yann LeCun, Antonio Torralba, Boris Ginzburg

DNNs for MNIST 58 CNN (Convolutional Neural Network) • Neural network architecture for image detection and recognition • First proposed by Yann LeCun, 1998 • http://yann.lecun.com – with codes, tutorials • Many hidden layers • Resembles human visual system • Sliding window ( ) = scanning images by sliding windows • Enables detecting objects with invariances to changes in locations, rotations, scales, etc. • Shared over different locations • Pooling • Provides abstraction as well as dimension reduction

DNNs for MNIST 59 Why CNN?

Higher layer responses: class-specific features

2nd layer: structured edges, corners, blobs, etc.

1st layer: primitive, localized edge filters

Picture credit: Honglak Lee et al. in Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks. Communications of the ACM. DNNs for MNIST 60 Convolutive Windowing (Scanning)

5x5 • 2-dimensional convolution • Computing correlation of all locations to the given kernel window in REVERSE direction • Input: 32x32 • Kernel (window) size: 5x5 • Output response: 28x28 • Result of convolution architecture • Learning kernels: extracting the local characteristics of the input images • Output images: responses to the learned kernels

DNNs for MNIST 61 Lenet-5 CNN Architecture for MNIST

C1: 1 st layer – CONVOLUTION - 6 kernel windows of size 5x5 - 32 x 32 pixels  28x28 responses by a single window - Total 6 output maps of size 28 x 28 (6@28x28) are generated from a single image

Number of parameters to train - 6 x 5 x 5 + 6 (bias) = 150 (much less than using full connection)

DNNs for MNIST 62 Picture credit: Christopher Mitchell Lenet-5: Pooling (subsampling)

S2: 2 nd layer – POOLING - No convolution - 1 out of 2x2 pixels is sampled - Method: max (referred to as Max Pooling, most popular), mean (Average Pooling), Fractional Pooling [Graham, CVPR 2015] , etc. - Total 6 output maps of size 14 x 14 (6@14x14) are generated from C1 layer

Number of parameters to train - None . DNNs for MNIST 63 Lenet-5: C3

C3: 3 rd layer – CONVOLUTION - 16 kernel windows of size 5x5 - 6 maps of 14x14 pixels  6@10x10 responses by a single window - 6@10x10  1@10x10 image by averaging (AvgOut) or taking max (MaxOut) [Goodfellow, JMLR 2013] - Total 16 windows  16@10x10 are generated from a single image Number of parameters to train - 16 x 5 x 5 + 16 = 416 (still much less than using full connection)

DNNs for MNIST 64 Lenet-5: S4 and C5

S4: 4 th layer – POOLING - 1 out of 2x2 outputs is sampled - 16@5x5 outputs are generated from C3 layer C5: 5 th layer – CONVOLUTION - 120 kernel windows of size 5x5 are added - 5x5 pixels  1x1 response per window - 120 outputs, finally Number of parameters to train - 120 x 5 x 5 + 120 = 3,120 (still much less than using full connection)

DNNs for MNIST 65 LeNet-5: Full Connection Layers

F6: 6 th layer – FULL Connection - 120 x 84 x 10 weights

Output: - 10 binary labels (targets, 0~9) DNNs for MNIST 66 Training CNN

• Standard Backpropagation with • Activation functions • Sigmoid (vanishing gradient problem) • ReLU (rectified linear unit) • Over-fitting prevention • DropOut • Requires huge computation • GPU Implementation • Performance on MNIST • The error rate of LeNet-5 [1998] is 0.8% - 0.95% • Deep autoencoder: 784-500-500-2000-30 [2007] + nearest neighbor - 1.0% • Results of other classifiers are also available • http://yann.lecun.com/exdb/mnist/

DNNs for MNIST 67 Learning a Compositional Hierarchy of Object Structure

Parts model

The architecture

Learned parts FidlerDNNs & Leonardis, for MNIST CVPR’07; Fidler, Boben & Leonardis, CVPR 2008, copied from68 Torralba’s From Common Basic Features to Class- Specific Features

Picture credit: Honglak Lee et al. in Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks. Communications of the ACM. DNNs for MNIST 69 Learning a Compositional Hierarchy of Object Structure

FidlerDNNs & Leonardis, for MNIST CVPR’07; Fidler, Boben & Leonardis, CVPR 2008, copied from70 Torralba’s Conclusion

• CNN implementation for MNIST

• Next • Implementation

DNNs for MNIST 71