Tiny Imagenet Challenge

Tiny ImageNet Challenge Jiayu Wu Qixiang Zhang Guoxi Xu Stanford University Stanford University Stanford University [email protected] [email protected] [email protected] Abstract Our best model, a fine-tuned Inception-ResNet, achieves a top-1 error rate of 43.10% on test dataset. Moreover, we We present image classification systems using Residual implemented an object localization network based on a Network(ResNet), Inception-Resnet and Very Deep Convo- RNN with LSTM [7] cells, which achieves precise results. lutional Networks(VGGNet) architectures. We apply data In the Experiments and Evaluations section, We will augmentation, dropout and other regularization techniques present thorough analysis on the results, including per-class to prevent over-fitting of our models. What’s more, we error analysis, intermediate output distribution, the impact present error analysis based on per-class accuracy. We of initialization, etc. also explore impact of initialization methods, weight decay and network depth on system performance. Moreover, visu- 2. Related Work alization of intermediate outputs and convolutional filters are shown. Besides, we complete an extra object localiza- Deep convolutional neural networks have enabled the tion system base upon a combination of Recurrent Neural field of image recognition to advance in an unprecedented Network(RNN) and Long Short Term Memroy(LSTM) units. pace over the past decade. Our best classification model achieves a top-1 test error [10] introduces AlexNet, which has 60 million param- rate of 43.10% on the Tiny ImageNet dataset, and our best eters and 650,000 neurons. The model consists of five localization model can localize with high accuracy more convolutional layers, and some of them are followed by than 1 objects, given training images with 1 object labeled. max-pooling layers. To fight overfitting, [10] proposes data augmentation methods and includes the technique of dropout[14] in the model. The model achieves a top-5 error rate of 18.9% in the ILSVRC-2010 contest. A tech- 1. Introduction nique called local response normalization is employed to help generalization, which is similar to the idea of batch The ImageNet Large Scale Visual Recognition Chal- normalization[8]. [13] evaluates a much deeper convo- lenge(ILSVRC) started in 2010 and has become the stan- lutional neural network with smaller filters of size 3×3. dard benchmark of image recognition. Tiny ImageNet As the model goes deeper, the number of filters increases Challenge is a similar challenge with a smaller dataset but while the feature map size decreases by max-pooling. The less image classes. It contains 200 image classes, a training best VGGNet model achieves a top-5 test error of 6.8% dataset of 100,000 images, a validation dataset of 10,000 in ILSVRC-2014. [13] shows that a deeper CNN with images, and a test dataset of 10,000 images. All images are small filters can achieve better results than AlexNet-like of size 64×64. networks. However, as a CNN-based model goes deeper, The goal of our project is to do as well as possible on we have the degradation problem. More specifically, a the image classification problem in Tiny ImageNet Chal- deeper CNN is supposed to have at least the same perfor- lenge. In order to overcome the problem of a small training mance as a shallower CNN, but in practice, a shallower dataset, we applied data augmentation methods to training CNN converges to a lower error rate than a deeper CNN. images, hoping to artificially create variations that help our To solve this problem, [6] proposes ResNet. Residual net- models generalize better. We built our models based on the works are designed to overcome the difficulty of training a idea of VGGNet [13], ResNet [6], and Inception-ResNet deep convolutional neural network. Suppose the mapping [17]. All our image classification models were trained we want to approximate is H : Rn !Rm. Instead of ap- from scratch. We tried a large number of different set- proximating H directly, we approximate a mapping F such tings, including update rules, regularization methods, net- that H(x) = F(x) + x. [6] shows empirically that the map- work depth, number of filters, strength of weight decay, etc. ping F is easier to approximate through training. [6] re- 1 ports their best ResNet achieves a top-5 test error of 3.57% is added between a convolutional layer and its ReLU activa- in ILSVRC-2015. tions. As [6] suggests, batch normalization layers can serve [17] combines inception structures and residual connec- as a source of regularization. Since the random crop gener- tions. Inception-Network is able to achieve better perfor- ated by our data augmentation methods is of size 56×56, mance than traditional residual networks with less parame- although [6] uses 34 or more convolutional layers, we hy- ters. Inception module was introduced by [18]. The basic pothesize that we only need less layers than the models in idea is that we apply different filters and also max-pooling [6] since [6] uses 224×224 image inputs. to the same volume and then concatenate all the output so Unlike [6], we do not use a max pooling layer imme- that each layer can choose the best methods during learning. diately after the first 7×7 convolutional layer because our For object localization, [16] gives a model based on de- input images already have a smaller size than 224×224 in coding an image into a set of people detections. The ap- [6]. Furthermore, our first 7×7 convolutional layer does proach is related to OverFeat model[12], but has some im- not use a stride of 2 like [6]. Each building block for provement. The model relies on LSTM cells to generate our modified ResNet includes 2n convolutional layers with variable length outputs, and the loss function encourages same number of 3×3 filters, where we can adjust the depth the model to make predictions in order of descending con- of the network by varying n. In this project, we have tried fidence. n = 1 and n = 2. Notice that down-sampling is performed at the first layer of each building block by using a stride of 3. Approach 2. 3.1. Data Augmentation Combining two volumes with same dimensions is straightforward. For two volumes with different depths and Preventing overfitting is an essential problem to over- different feature map sizes, [6] suggests two ways to create come, especially for Tiny ImageNet Challenge, because we a shortcut, (A) identity mapping with zero-padding; and (B) only have 500 training images per class 1. a convolutional layer with 1×1 filters with a stride of 2. We First, during training, each time when an image is fed use option (B) for our residual networks. In our project, we to the model, a 56×56 crop randomly generated from the used option (B). Furthermore, when a 1×1 convolution is image will be used instead. During validation and testing, applied, batch normalization is applied to each of the two we use the center crop. incoming volumes before they are merged into one volume. Besides, following [10], we then augment data by hor- Finally, ReLU is applied on the merged volume. izontal flipping, translation and rotation. We also used random contrast correction as an augmentation method. 3.3. Modified Inception-ResNet We set the scaling factor to be random([0:9; 1:08]) at ev- We reorganized the state-of-art Inception-ResNet v2 ery batch, and clip pixel values to the range [0; 255] af- model [17]. The architecture of our modified Inception- ter correction process to guarantee valid augmented im- ResNet model is shown in Fig.3. The input images are first ages. Another data augmentation method we used is ran- down-sampled in a stem module. The stem module has par- dom Gamma correction [11] for luminance adjustment. Af- allel convolutional blocks, whose outputs are later concate- ter some experiments, we used correction coefficient γ = nated. Moreover, one convolutional layer of spatial filter random([0:9; 1:08]), which ensures both significant lumi- size 1×7 and one of size 7×1 are combined to replace a nance change and recognizable augmented images. Fig. 1 7×7 sized layer, which significantly reduces number of pa- gives concrete instances of above methods. rameters while maintaining same receptive field. Then the In order to speed up the training process, we apply these data flows through n Inception-Resnet-A modules, which methods in a random fashion. When an image is fed to the has the residual part being an inception module. A 1×1 model, every augmentation method is applied randomly to convolutional layer is applied at the end of each inception this image. In this way, the total number of training ex- module to keep the output depth same as the input’s, thus amples is the same but we managed to have our models enabling the final addition operation. Then after another see slightly different but highly recognizable images at each down-sampling in a reduction module, the feature map flow epoch. passes through 2n Inception-Resnet-B modules and finally 3.2. Modified Residual Network reaches a fully connected layer. Fig.3 also shows several modifications we made to fit Fig. 2 shows the architecture of our modified ResNet. It the Inception-Resnet model with the tiny ImageNet chal- consists of a series of convolutional layers of different num- lenge. Firstly, we used 2 Inception-Resnet module types ber of filters, an average pooling layer, and finally a fully- and 1 reduction module, while [17] uses 3 Inception-Resnet connected affine scoring layer. A batch normalization layer module types and 2 reduction modules.

Tiny Imagenet Challenge

Backpropagation and Deep Learning in the Brain

Automated Elastic Pipelining for Distributed Training of Transformers

Convolutional Neural Network Transfer Learning for Robust Face

Deep Learning Is Robust to Massive Label Noise

The History Began from Alexnet: a Comprehensive Survey on Deep Learning Approaches

Kernel Descriptors for Visual Recognition

Face Recognition Using Popular Deep Net Architectures: a Brief Comparative Study

Key Ideas and Architectures in Deep Learning Applications That (Probably) Use DL

Transfer Learning Based Face Recognition Using Deep Learning M

Transfer Imagenet for Medical Image Analyzing Using Unsupervised Learning

Generation-Distillation for Efficient Natural Language Understanding in Low-Data Settings

Imagenet Classification with Deep Convolutional Neural Networks