High Speed Deep Networks Based on Discrete Cosine Transformation

HIGH SPEED DEEP NETWORKS BASED ON DISCRETE COSINE TRANSFORMATION Xiaoyi Zou Xiangmin Xu? Chunmei Qing Xiaofen Xing School of Electronic and Information Engineering South China University of Technology, Guangzhou, China [email protected], fxmxu, qchm, [email protected] ABSTRACT pixel images for a face detector. This network takes three days to train by using parallelism techniques on a cluster of The traditional deep networks take raw pixels of data as input, 1,000 machines (16,000 cores). However, on the one hand, and automatically learn features using unsupervised learning large amounts of training data are needed to learn millions or algorithms. In this configuration, in order to learn good fea- billions of parameters of such a network for preventing over- tures, the networks usually have multi-layer and many hidden fitting. On the other hand, it costs much more time to train units which lead to extremely high training time costs. As and a greater quantity of memory resources which unfortu- a widely used image compression algorithm, Discrete Cosine nately makes the deep learning method nontrivial for general Transformation (DCT) is utilized to reduce image information users. redundancy because only a limited number of the DCT coef- The motivation to solve the above problem is to reduce ficients can preserve the most important image information. the dimensionality of the input data. A well-known and wide- In this paper, it is proposed that a novel framework by com- ly used dimensionality reduction technique is Principal Com- bining DCT and deep networks for high speed object recog- ponent Analysis(PCA). However, PCA is data dependent and nition system. The use of a small subset of DCT coefficients obtaining the principal components is a nontrivial task. The of data to feed into a 2-layer sparse auto-encoders instead of Discrete Cosine Transform (DCT)[3] is investigated in this raw pixels. Because of the excellent decorrelation and en- paper since its basic functions are input independent and its ergy compaction properties of DCT, this approach is proved compact representation ability closely approximates the opti- experimentally not only efficient, but also it is a computation- mal PCA[4]. Furthermore, computationally efficient and fast ally attractive approach for processing high-resolution images algorithms exist to compute 2D DCT[5]. The DCT transform- in a deep architecture. s the input data to a linear combination of different frequen- Index Terms— Discrete Cosine Transformation, deep cy components. Most of the components are typically very networks, high speed, object recognition small in magnitude because most of the visually significant information about the image concentrates into just a few coefficients. Hence, a limited number of DCT components are 1. INTRODUCTION sufficient to preserve the most important image information, such as the regularity, complexity and some texture features. Recently, many researchers have focused on deep learning By feeding them into a deep network for object recognition, because of its ability for automatically learning good repre- high accuracy can be achieved, while the training speed of sentations of the underlying distribution of the data. Suc- this system is dramatically much faster than other compara- cessful deep learning algorithms and their applications can ble approaches which use raw pixels as input. be found in recent literature using a variety of approaches and This paper presents a new approach for object recogni- can achieve remarkable string of empirical successes both in tion systems by introducing DCT into deep networks. The academia and in industry [1]. main idea of the approach is instead of learning the set of The traditional deep architectures learn complex function filters at each layer of deep networks, the first layer uses a mapping from input to output directly from raw pixels of da- fixed set of filters based on DCT. The first application is the ta. It is well known that the dimensionality of raw input data use of the DCT to reduce the information redundancy and s- is large. In this case, in order to learn the expected mapping elect only a limited number of the coefficients to feed into a function, more hidden layers or hidden units will be added deep network instead of raw pixels. Then the deep network is to increase the capacity of the model. e.g., in [2], Quoc V. trained to learn good representations of data in frequency do- Le et al., trained a 9-layered locally connected sparse auto- main layer-wise in unsupervised fashion. Specially, the use encoders with 1 billion connections on 10 million 200x200 of sparse auto-encoders here and this method is the general Xiangmin Xu is the corresponding author. framework which can be easily extended to some other deep 978-1-4799-5751-4/14/$31.00 ©2014 IEEE 5921 ICIP 2014 learning approaches because of the generality of DCT. functions exhibit a progressive increase in frequency both in The rest of the paper is organized as follows. Section 2, the vertical and horizontal direction [7]. And from Fig. 1(c), describes the details of the proposed approach. Section 3, it can be observed that the DCT coefficients with large mag- discusses the results. The conclusions and some future work nitudes are mainly located in the upper-left corner of the D- are given in Section 4. CT matrix. Accordingly, as illustrated in Fig. 2(d), it can independently scan the DCT coefficient matrix in a zig-zag 2. PROPOSED DCT BASED DEEP NETWORKS manner starting from the upper-left corner and subsequently convert it to a one-dimensional DCT coefficient vector, and 2.1. Discrete Cosine Transform those coefficients at higher frequency bands can be removed in order to reduce the amount of data needed to describe the The DCT transforms a finite signal to a sum of sinusoids of image without sacrificing too much image quality. varying magnitudes and frequencies. The DCT of a N x M image f(x, y) is defined by: 2.2. DCT Based Sparse Autoencoders N−1 M−1 2 X X F (u; v) = p α(u)α(v) ff(x; y) In this paper, Sparse Auto-encoder[8][9] are adopted to learn MN high level frequency features in unsupervised fashion when x=0 y=0 (1) using a limited number of DCT coefficients as input. (2x + 1)uπ (2y + 1)vπ • cos cos g As shown in Fig. 2, a auto-encoder belongs to the 2N 2M encoder-decoder paradigm. The encoder transforms an in- Here, put vector x into a code vector h, and the decoder produces p1 ; n = 0 a reconstruction y. The optimal code vector is obtained α(n) = 2 1; n > 0 by minimizing the reconstruction error. In fact, this sim- When applied the DCT to the entire image f(x, y), a 2D fre- ple auto-encoder often ends up learning a low-dimensional quency coefficient matrix F(u,v) with the same dimension can representation very similar to PCA[1]. It is proved that the be obtained. compact representation ability of DCT is closely approximate the optimal PCA[4]. In this paper, it is proposed to use DCT as a hard-wire linear auto-encoder layer located in the first layer of a deep architecture. As shown in Fig. 2(a). The ad- vantage is that it can get approximated performances without the need to train a linear auto-encoder or to solve PCA when using DCT. Furthermore, only a limited number of the output codes transformed by DCT is retained to feed into the latter non-linear stacked auto-encoders, which will reduce much time for training instead of raw pixels. The parameters of this model are optimized to minimize the following cost function: " N # 1 X 1 (i) J(W; b) = jjx − y(i)jj2 N 2 DCT i=1 (2) nl−1 sl sl+1 sl+1 λ X X X (l) 2 X + (W ) + β KL(ρjjρ^ ) 2 ji j l=1 i=1 j=1 j=1 Here, N is the number of training samples, and nl denotes Fig. 1. (a) displays two dimensional DCT basis function- the number of layers. sl denotes the number of units in layer (l) s(for 8-by-8 matrices). Neutral gray represents zero, white l and Wji denotes the weight associated with the connection represents positive amplitudes, and black represents negative between unit i in layer l, and unit j in layer l + 1. There are amplitude[6].(b) displays the original image, and its DCT co- three terms in this cost function. The first term is an aver- efficient matrix is shown in (c). The zig-zag pattern shown in age sum-of-squares error term between the input and its re- (d) is adopted to select a limited number of DCT coefficients. construction and the second one is a regularization term for preventing overfitting. The last one is a sparsity penalty ter- The DCT transforms the input data into a linear combi- m with Kullback-Leibler (KL) divergence which enforces the nation of basis functions which stand for different frequen- average activation of hidden unit j, which is donated as ρ^j, to cy components. For 8-by-8 matrices, the 64 basis function- be approximate to a small value ρ which is close to zero. λ s(filters) are shown in Fig. 1(a), it can be seen that the basis and β control the relative importance of these three terms. 978-1-4799-5751-4/14/$31.00 ©2014 IEEE 5922 ICIP 2014 Fig. 2. The architecture of DCT based stacked sparse auto-encoders and its typical four training stages. The module in gray block means its parameters can not be adjusted during this training stage.

High Speed Deep Networks Based on Discrete Cosine Transformation

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support