best of THE WEB Li Deng [ ]

The MNIST Database of Handwritten Digit Images for Research

n this issue, “Best of the Web” pres- The MNIST database was constructed their results that were obtained using the ents the modified National Institute of out of the original NIST database; hence, MNIST database. In most experiments, the Standards and Technology (MNIST) modified NIST or MNIST. There are existing training data from the database resources, consisting of a collection of 60,000 training images (some of these were used in learning the classifiers, where handwritten digit images used exten- training images can also be used for cross- “none” is entered in the “Preprocessing” Isively in optical character recognition and validation purposes) and 10,000 test column of the table on the Web site. In machine learning research. images, both drawn from the same distri- some experiments, the training set was Handwritten digit recognition is an bution. All these black and white digits are augmented with artificially distorted ver- important problem in optical character size normalized, and centered in a fixed- sions of the original training samples. The recognition, and it has been used as a test size image where the center of gravity of distortions include random combinations case for theories of the intensity lies at the center of the of jittering, shifts, scaling, deskewing, and machine learning algorithms for image with 28 # 28 pixels. Thus, the deslanting, blurring, and compression. The many years. Historically, to promote dimensionality of each image sample vec- type(s) of these and other distortions are machine learning and pattern recognition tor is 28 * 28 = 784, where each element specified in the “Preprocessing” column of research, several standard databases have is binary. This is a relatively simple data- the table as well. emerged in which the handwritten digits base for people who want to try machine A total of 68 classifiers are provided in are preprocessed, including segmentation learning techniques and pattern recogni- the comparison table on the Web site, and normalization, so that researchers can tion methods on real-world data while where “Test Error Rates (%)” and links to compare recognition results of their tech- spending minimal efforts on preprocess- the corresponding reference(s) are pro- niques on a common basis. The freely ing and formatting. Using the references vided. These 68 machine learning tech- available MNIST database of handwritten provided on the Web site, students and niques are organized into six broad digits has become a standard for fast-test- educators of machine learning can also categories: ing machine learning algorithms for this benefit from a rather comprehensive set of ■ linear classifiers purpose. The simplicity of this task is anal- machine learning literature with perfor- ■ k-nearest neighbors ogous to the TIDigit (a speech database mance comparison readily available. ■ boosted stumps created by Texas Instruments) task in ■ nonlinear classifiers . Just like there is a EVALUATION OF MACHINE ■ support vector machines (SVMs) long list for more complex speech recog- LEARNING ALGORITHMS ■ neural nets (with no convolutional nition tasks, there are many more difficult USING MNIST structure) and challenging tasks for image recogni- General evaluation results on MNIST: ■ convolutional nets. tion and computer vision, which will not http://yann.lecun.com/exdb/mnist/ Each category contains up to 21 be addressed in this column. entries with very brief description of each Details of logistic regression evaluated on in the “Classifier” column of the table. DATA MNIST: Much of the early techniques published in General site for the MNIST database: http: //deeplearning.net/tutorial/logreg. [1] are listed in the table. http: //yann.lecun.com/exdb/mnist html BRIEF ANALYSIS OF THE MACHINE Code to read the MNIST database: Many well-known machine learning LEARNING ALGORITHMS http: //www.mathworks.com/matlabcen- algorithms have been run on the MNIST EVALUATED ON MNIST tral/fileexchange/27675-read-digits- database, so it is easy to assess the relative Comparing all the 68 classifiers listed on and-labels-from-mnist-database performance of a novel algorithm. The the MNIST Web site, we can make a brief Web site http://yann.lecun.com/exdb/ analysis on the effectiveness of various mnist/ was updated in December of 2011 to techniques and of the preprocessing Digital Object Identifier 10.1109/MSP.2012.2211477 Date of publication: 15 October 2012 list all major classification techniques and methods.

1053-5888/12/$31.00©2012IEEE IEEE SIGNAL PROCESSING MAGAZINE [141] NOVEMBER 2012 [best of THE WEB] continued

SUMMARY The MNIST database gives a relatively simple static classification task for researchers and students to explore machine learning and pattern recognition techniques, saving unnecessary efforts on data preprocessing and formatting. This is analogous to the TIMIT database (a speech database created by Texas Instruments and Massachusetts Institute of Tech- nology) familiar to most speech process- ing researchers in the signal processing community. Just like the TIMIT phone classifica- tion and recognition tasks that have been productively used as a test bed for devel- oping and testing speech recognition algorithms [7], MNIST has been used in a similar way for image and more general classification tasks. The Web site we introduce in this column provides the most comprehensive collection of re- sources for MNIST. In addition, this “Best of the Web” column also provides an analysis on a wide range of effective machine learning techniques evaluated on the MNIST task.

AUTHOR Li Deng ([email protected]) is a prin- “The first entry to the handwritten image database.” Cartoon by Tayfun Akgul (tayfun. [email protected]). cipal researcher at Microsoft Research, Redmond, Washington.

Neural net classifiers tend to perform neural net is increased from 0.35% to REFERENCES [1] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, significantly better than other types of 0.53% [5]. “Gradient-based learning applied to document recogni- classifiers. Specifically, The depth of neural nets also accounts tion,” Proc. IEEE, vol. 86, no. 11, pp. 2278–2324, Nov. 1998. structure in neural nets accounts for for low error rates. With both convolution [2] D. Ciresan, U. Meier, L. M. Gambardella, and J. excellent classification performance. In structure and distortions, the deep versus Schmidhuber, “Convolutional neural network com- mittees for handwritten character classification,” in fact, the record performance, about shallow nets give the error rates of 0.35% Proc. ICDAR, 2011. 0.27% error rate or 27 errors in the full [3] and 0.40–0.60% [4], respectively. [3] D. Ciresan, U. Meier, J. Masci, L. M. Gambardella, and J. Schmidhuber, “Flexible, high performance con- 10,000 test set, is achieved by a commit- Without convolution structure and volutional neural networks for image classification,” in tee of convolutional nets (with elastic distortions or other types of special pre- Proc. IJCAI, 2011. distortion in augmenting the training processing, the lowest error rate in the lit- [4] P. Simard, D. Steinkraus, and J. Platt, “Best practic- es for convolutional neural networks applied to visual set) [2]. Without the “committee,” one erature, 0.83%, is achieved using the deep document analysis,” in Proc. ICDAR, 2003. single very large and deep convolutional stacking/convex neural net [6]. The error [5] K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. Le- Cun, “What is the best multi-stage architecture for ob- neural net gives also a very low error rate rate is increased to 1.10% [7] when a cor- ject recognition?” in Proc. IEEE Int. Conf. Computer of 0.35% [3]. responding shallow net is used. Vision (ICCV 2009). [6] L. Deng and D. Yu, “Deep convex network: A scal- The use of distortions, especially elas- Behind neural net techniques, k-near- able architecture for speech pattern classification,” in tic distortion [4], to augment the train- est neighbor methods also produced low Proc. Interspeech, Aug. 2011. ing data is important to achieve very low error rates, followed by virtual SVMs. Note [7] L. Deng, D. Yu, and J. Platt, “Scalable stacking and learning for building deep architectures,” in Proc. error rates. Without such distortion, the that preprocessing is needed in both cases ICASSP, Mar. 2012. error rate of a single large convolutional for the success. [SP]

IEEE SIGNAL PROCESSING MAGAZINE [142] NOVEMBER 2012