HIGH SPEED DEEP NETWORKS BASED ON DISCRETE COSINE TRANSFORMATION

Xiaoyi Zou Xiangmin Xu? Chunmei Qing Xiaofen Xing

School of Electronic and Information Engineering South China University of Technology, Guangzhou, China [email protected], {xmxu, qchm, xfxing}@scut.edu.cn

ABSTRACT pixel images for a face detector. This network takes three days to train by using parallelism techniques on a cluster of The traditional deep networks take raw pixels of data as input, 1,000 machines (16,000 cores). However, on the one hand, and automatically learn features using unsupervised learning large amounts of training data are needed to learn millions or algorithms. In this configuration, in order to learn good fea- billions of parameters of such a network for preventing over- tures, the networks usually have multi-layer and many hidden fitting. On the other hand, it costs much more time to train units which lead to extremely high training time costs. As and a greater quantity of memory resources which unfortu- a widely used algorithm, Discrete Cosine nately makes the deep learning method nontrivial for general Transformation (DCT) is utilized to reduce image information users. redundancy because only a limited number of the DCT coef- The motivation to solve the above problem is to reduce ficients can preserve the most important image information. the dimensionality of the input data. A well-known and wide- In this paper, it is proposed that a novel framework by com- ly used dimensionality reduction technique is Principal Com- bining DCT and deep networks for high speed object recog- ponent Analysis(PCA). However, PCA is data dependent and nition system. The use of a small subset of DCT coefficients obtaining the principal components is a nontrivial task. The of data to feed into a 2-layer sparse auto-encoders instead of Discrete Cosine Transform (DCT)[3] is investigated in this raw pixels. Because of the excellent decorrelation and en- paper since its basic functions are input independent and its ergy compaction properties of DCT, this approach is proved compact representation ability closely approximates the opti- experimentally not only efficient, but also it is a computation- mal PCA[4]. Furthermore, computationally efficient and fast ally attractive approach for processing high-resolution images algorithms exist to compute 2D DCT[5]. The DCT transform- in a deep architecture. s the input data to a linear combination of different frequen- Index Terms— Discrete Cosine Transformation, deep cy components. Most of the components are typically very networks, high speed, object recognition small in magnitude because most of the visually significant information about the image concentrates into just a few co- efficients. Hence, a limited number of DCT components are 1. INTRODUCTION sufficient to preserve the most important image information, such as the regularity, complexity and some texture features. Recently, many researchers have focused on deep learning By feeding them into a deep network for object recognition, because of its ability for automatically learning good repre- high accuracy can be achieved, while the training speed of sentations of the underlying distribution of the data. Suc- this system is dramatically much faster than other compara- cessful deep learning algorithms and their applications can ble approaches which use raw pixels as input. be found in recent literature using a variety of approaches and This paper presents a new approach for object recogni- can achieve remarkable string of empirical successes both in tion systems by introducing DCT into deep networks. The academia and in industry [1]. main idea of the approach is instead of learning the set of The traditional deep architectures learn complex function filters at each layer of deep networks, the first layer uses a mapping from input to output directly from raw pixels of da- fixed set of filters based on DCT. The first application is the ta. It is well known that the dimensionality of raw input data use of the DCT to reduce the information redundancy and s- is large. In this case, in order to learn the expected mapping elect only a limited number of the coefficients to feed into a function, more hidden layers or hidden units will be added deep network instead of raw pixels. Then the deep network is to increase the capacity of the model. e.g., in [2], Quoc V. trained to learn good representations of data in frequency do- Le et al., trained a 9-layered locally connected sparse auto- main layer-wise in unsupervised fashion. Specially, the use encoders with 1 billion connections on 10 million 200x200 of sparse auto-encoders here and this method is the general Xiangmin Xu is the corresponding author. framework which can be easily extended to some other deep

978-1-4799-5751-4/14/$31.00 ©2014 IEEE 5921 ICIP 2014 learning approaches because of the generality of DCT. functions exhibit a progressive increase in frequency both in The rest of the paper is organized as follows. Section 2, the vertical and horizontal direction [7]. And from Fig. 1(c), describes the details of the proposed approach. Section 3, it can be observed that the DCT coefficients with large mag- discusses the results. The conclusions and some future work nitudes are mainly located in the upper-left corner of the D- are given in Section 4. CT . Accordingly, as illustrated in Fig. 2(d), it can independently scan the DCT coefficient matrix in a zig-zag 2. PROPOSED DCT BASED DEEP NETWORKS manner starting from the upper-left corner and subsequently convert it to a one-dimensional DCT coefficient vector, and 2.1. Discrete Cosine Transform those coefficients at higher frequency bands can be removed in order to reduce the amount of data needed to describe the The DCT transforms a finite signal to a sum of sinusoids of image without sacrificing too much image quality. varying magnitudes and frequencies. The DCT of a N x M image f(x, y) is defined by: 2.2. DCT Based Sparse Autoencoders N−1 M−1 2 X X F (u, v) = √ α(u)α(v) {f(x, y) In this paper, Sparse Auto-encoder[8][9] are adopted to learn MN high level frequency features in unsupervised fashion when x=0 y=0 (1) using a limited number of DCT coefficients as input. (2x + 1)uπ  (2y + 1)vπ  • cos cos } As shown in Fig. 2, a auto-encoder belongs to the 2N 2M encoder-decoder paradigm. The encoder transforms an in- Here, put vector x into a code vector h, and the decoder produces  √1 ; n = 0  a reconstruction y. The optimal code vector is obtained α(n) = 2 1; n > 0 by minimizing the reconstruction error. In fact, this sim- When applied the DCT to the entire image f(x, y), a 2D fre- ple auto-encoder often ends up learning a low-dimensional quency coefficient matrix F(u,v) with the same dimension can representation very similar to PCA[1]. It is proved that the be obtained. compact representation ability of DCT is closely approximate the optimal PCA[4]. In this paper, it is proposed to use DCT as a hard-wire linear auto-encoder layer located in the first layer of a deep architecture. As shown in Fig. 2(a). The ad- vantage is that it can get approximated performances without the need to train a linear auto-encoder or to solve PCA when using DCT. Furthermore, only a limited number of the output codes transformed by DCT is retained to feed into the latter non-linear stacked auto-encoders, which will reduce much time for training instead of raw pixels. The parameters of this model are optimized to minimize the following cost function: " N  # 1 X 1 (i) J(W, b) = ||x − y(i)||2 N 2 DCT i=1 (2) nl−1 sl sl+1 sl+1 λ X X X (l) 2 X + (W ) + β KL(ρ||ρˆ ) 2 ji j l=1 i=1 j=1 j=1

Here, N is the number of training samples, and nl denotes Fig. 1. (a) displays two dimensional DCT basis function- the number of layers. sl denotes the number of units in layer (l) s(for 8-by-8 matrices). Neutral gray represents zero, white l and Wji denotes the weight associated with the connection represents positive amplitudes, and black represents negative between unit i in layer l, and unit j in layer l + 1. There are amplitude[6].(b) displays the original image, and its DCT co- three terms in this cost function. The first term is an aver- efficient matrix is shown in (c). The zig-zag pattern shown in age sum-of-squares error term between the input and its re- (d) is adopted to select a limited number of DCT coefficients. construction and the second one is a regularization term for preventing overfitting. The last one is a sparsity penalty ter- The DCT transforms the input data into a linear combi- m with Kullback-Leibler (KL) divergence which enforces the nation of basis functions which stand for different frequen- average activation of hidden unit j, which is donated as ρˆj, to cy components. For 8-by-8 matrices, the 64 basis function- be approximate to a small value ρ which is close to zero. λ s(filters) are shown in Fig. 1(a), it can be seen that the basis and β control the relative importance of these three terms.

978-1-4799-5751-4/14/$31.00 ©2014 IEEE 5922 ICIP 2014 Fig. 2. The architecture of DCT based stacked sparse auto-encoders and its typical four training stages. The module in gray block means its parameters can not be adjusted during this training stage.

This optimization will be carried out by the backpropa- on a laptop with Intel Core i5-3230M 2.6GHz CPU and 4GB gation algorithm[10]. During training, Dropout[11] is added RAW. to prevent overfitting. Dropout is a training auxiliary tech- nique which is proved to be efficient for reducing overfitting 3.1. Evaluations on the Number of DCT Coefficients by randomly setting the activation of each hidden unit to zero with probability p within each layer. The whole training stage In these experiments, the comparisons between the original can be divided into four stages, which is shown intuitively in deep network using raw pixels with the proposed deep frame- Fig. 2. In Stage 1, after applying DCT, it is directly fed the work using different numbers of the DCT coefficients retained selected DCT coefficients of the training set into sparse auto- of the input data and the number of hidden units of two layers encoder and optimizes the parameters of encoder and decoder. is set to 200. In Stage 2, the encoder is fixed in Stage 1, and train the second layer. In Stage 3, training a general classifier (e.g., softmax) above code h2 is supervised. Finally, supervised fine-tuning procedure is then used to the entire architecture donated as Stage 4.

3. EXPERIMENTAL RESULTS AND DISCUSSIONS

In order to evaluate the proposed approach, experiments are performed on standard MNIST[12] and MNIST variation dataset [13]. The MNIST dataset consists of 28 x 28 digit im- Fig. 3. Accuracy of five datasets for different numbers of ages (60,000 for training and 10,000 for testing). The MNIST DCT coefficients retained and raw pixels. variation dataset contains different variations of the MNIST digit dataset, with added factors of variation such as rotation The experimental results are shown in Fig. 3. It can be (rot), addition of a background composed of random pixels observed that when the number of the DCT coefficients in- (bg-rand) or made from patches extracted from a set of im- creases, there is a slight fluctuation in recognition accuracy. ages (bg-img), or combinations of these factors (rot-bg-img). 40 to 100 DCT coefficients are enough to achieve comparable Each problem is divided into a training, validation, and test or even better accuracy compared with raw pixels. Because of set (10000, 2000, 50000 examples respectively). the limitation of the paper length it is only shown the recog- For all experiments, the two-hidden layer sparse auto- nition performance on “bg-rand” in Table 1. The best accu- encoders and λ = 0.2, β = 1.0 and ρ = 0.05 is used. The racy is in bold. It is impressive that the “bg-rand” obtains probability used in dropout is set to p = 0.2. Classification 88.54% accuracy by retaining 100 DCT components which is is based on softmax. Then divide the training sets into mini- 14.59% higher than raw pixels. In addition, “rot” dataset ob- batches of 100 cases, and adopt 200 training epochs for all tains 83.27% accuracy using only 60 DCT coefficients while the training stages. The algorithm is implemented in Matlab it is 76.50% using raw pixels.

978-1-4799-5751-4/14/$31.00 ©2014 IEEE 5923 ICIP 2014 num of selected hidden accuracy training minutes to train such a network. DCT coefficients size time/min 30 200x200 73.70% 2.40 num of selected hidden accuracy training 40 200x200 79.72% 2.54 DCT coefficients size time/min 50 200x200 76.98% 2.67 70 30x30 97.10% 4.74 60 200x200 86.36% 2.82 70 60x60 97.60% 6.45 70 200x200 87.02% 2.81 70 90x90 97.78% 7.96 80 200x200 88.21% 2.92 70 120x120 97.75% 10.13 100 200x200 88.54% 2.92 70 160x160 98.04% 10.86 raw pixel=784 200x200 73.95% 11.34 70 200x200 98.39% 15.90 70 300x300 97.96% 26.73

Table 1. Recognition performance on bg-rand dataset for different numbers of DCT coefficients retained. Table 2. Recognition performance on MNIST dataset for d- ifferent numbers of hidden units in the sparse auto-encoders trained on 70 DCT coefficients. Another remarkable property is that training such a deep architecture using only a small subset of DCT coefficients is four times faster than raw pixels for very small images (28x28) as shown in Fig. 4. When the size of the image or 4. CONCLUSIONS & FUTURE WORK dataset becomes larger, the more significant reduction in com- putation costs as well as when comparing to raw pixels.( e.g., This paper presents a novel framework for high speed deep for 200x200 images), 100 DCT coefficients are generally suf- network system by using DCT. The selected DCT coefficients ficient to represent the most visually significant information gives good compromise between information packing ability of the original image. In this situation, the dimension of the and computational complexity. Experiments on MNIST and input of a deep network will reduce to 100 from 40,000 when its variation dataset with 2-layer sparse auto-encoders demon- using the truncated DCT coefficients instead of raw pixels and strate that for very small images (28x28) the proposed ap- this will be hundreds of times faster with little loss of accura- proach can achieve competitive performance with much less cy. computation time (four times less) when comparing to raw pixels. This can be summarized as follows: (i) introduc- tion of the DCT into deep architecture by using a fixed set of filters based on the DCT in the first layer instead of learn- ing them from raw pixels; (ii) the DCT coefficients can be computed very efficiently (much more than the responses of a general set of filters), and hence this is a computationally attractive approach for processing full-size images, as is typ- ically required at the first layer of a deep architecture; (iii) the proposed method takes less time to train and less comput- er resources, which provides a new solution for researchers with general computers to access deep learning, especially the Fig. 4. Training time of five datasets for different numbers of recognition system with high-resolution images. Based on the DCT coefficients retained and raw pixels. above analysis, future work will be focused on extending the proposed framework to more deep learning networks such as Restricted Boltzmann Machines[14], Sparse Coding[15], just to name a few. 3.2. Evaluations on the Number of Hidden Units In order to confirm whether the proposed approach can achieve competitive recognition accuracy via different num- Acknowledgment ber of hidden units, the number of DCT coefficients is fixed to 70, and thus trained different sparse auto-encoders with This work is supported by National Natural Science Foun- varying numbers of hidden units. From Table 2, it can be dation of China (No.61171142), the Science and Tech- seen that few numbers of hidden units can achieve competi- nology Planning Project of Guangdong Province of China tive recognition accuracy. As seen in Table 1, the recognition (No.2011A010801005, 2010A080402015), and the Funda- accuracy on “MNIST” can achieve 97.10% just using a small mental Research Funds for the Central Universities (No.2013Z network namely [70x30x30]. Moreover, it takes only 4.74 M0081).

978-1-4799-5751-4/14/$31.00 ©2014 IEEE 5924 ICIP 2014 5. REFERENCES

[1] Yoshua Bengio, Aaron Courville, and Pascal Vincent, “Repre- sentation learning: A review and new perspectives,” 2013. [2] Quoc V Le, Marc’Aurelio Ranzato, Rajat Monga, Matthieu Devin, Kai Chen, Greg S Corrado, Jeff Dean, and Andrew Y Ng, “Building high-level features using large scale unsuper- vised learning,” arXiv preprint arXiv:1112.6209, 2011. [3] , T Natarajan, and Kamisetty R Rao, “Discrete cosine transform,” Computers, IEEE Transactions on, vol. 100, no. 1, pp. 90–93, 1974. [4] Rafael C Gonzalez and Richard E Woods, “ pro- cessing: Introduction,” 2002. [5] Charilaos A Christopoulos, Jan G Bormans, Athanasios N Skodras, and Jan P Cornelis, “Efficient computation of the two-dimensional fast cosine transform,” in SPIE’s Interna- tional Symposium on Optical Engineering and Photonics in Aerospace Sensing. International Society for Optics and Pho- tonics, 1994, pp. 229–237. [6] William B Pennebaker, JPEG: Still image standard, Springer, 1992. [7] Syed Ali Khayam, “The discrete cosine transform (dct): theory and application,” Michigan State University, 2003. [8] Y-lan Boureau, Yann L Cun, et al., “Sparse feature learning for deep belief networks,” in Advances in neural information processing systems, 2007, pp. 1185–1192. [9] Geoffrey Hinton, “A practical guide to training restricted boltz- mann machines,” Momentum, vol. 9, no. 1, 2010. [10] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams, “Learning representations by back-propagating er- rors,” Cognitive modeling, vol. 1, pp. 213, 2002. [11] Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov, “Improving neural networks by preventing co-adaptation of feature detectors,” arXiv preprint arXiv:1207.0580, 2012. [12] Y.Bengio Y. Lecun, L. Bottou and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the Ieee, vol. 86, pp. 2278–2324, 1998. [13] Hugo Larochelle, Dumitru Erhan, Aaron Courville, James Bergstra, and Yoshua Bengio, “An empirical evaluation of deep architectures on problems with many factors of variation,” in Proceedings of the 24th international conference on Machine learning. ACM, 2007, pp. 473–480. [14] Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh, “A fast learning algorithm for deep belief nets,” Neural computa- tion, vol. 18, no. 7, pp. 1527–1554, 2006. [15] Honglak Lee, Alexis Battle, Rajat Raina, and Andrew Ng, “Ef- ficient sparse coding algorithms,” in Advances in neural infor- mation processing systems, 2006, pp. 801–808.

978-1-4799-5751-4/14/$31.00 ©2014 IEEE 5925 ICIP 2014