Fast Image Recognition with Gabor Filter and Pseudoinverse Learning AutoEncoders

Xiaodan Deng, Sibo Feng, Ping Guo*, and Qian Yin*

Image Processing and Pattern Recognition Laboratory, Beijing Normal University, Beijing, China [email protected], [email protected], [email protected], [email protected]

Abstract. Deep neural network has been successfully used in various fields, and it has received significant results in some typical tasks, especially in com- puter vision. However, deep neural network are usually trained by using gradi- ent descent based algorithm, which results in gradient vanishing and gradient explosion problems. And it requires expert level professional knowledge to de- sign the structure of the deep neural network and find the optimal hyper param- eters for a given task. Consequently, training a deep neural network becomes a very time consuming problem. To overcome the shortcomings mentioned above, we present a model which combining Gabor filter and pseudoinverse learning autoencoders. The method referred in model optimization is a non- gradient descent algorithm. Besides, we presented the empirical formula to set the number of hidden neurons and the number of hidden layers in the entire training process. The experimental results show that our model is better than ex- isting benchmark methods in speed, at same time it has the comparative recog- nition accuracy also.

Keywords: Pseudoinverse learning autoencoder, Gabor filter, Image recogni- tion, Handcraft feature.

1 Introduction

Neural network has attracted many researchers to study, and it has been used in many fields successfully. Currently the most used model for image recognition is convolu- tional neural networks (CNN). In 1998, Yann LeCun and Yoshua Bengio published a paper on the application of neural networks in handwriting recognition and optimiza- tion with back propagation algorithm, and presented model LeNet5 in [1] is consid- ered as the beginning of CNN. Its network structure includes the convolutional layer, the pooling layer and the full connection layer, which are the basic components of the modern CNN network. In 2012, Alex used AlexNet [2] in the contest of ImageNet to refresh the record of image classification and set the position of in computer vision. AlexNet uses five layers and three fully connected lay- ers for classification. Subsequently, there are many other successful CNN models, 2 X. Deng et al. which became deeper and more complex. In 2015, He et al. proposed the ResNet [3] model which reached 152 layers. Recently, successful CNN models usually have complex structure and need to set many hyper-parameters. Those parameters are related to the performance of the CNN models and they are difficult to tune. Many research groups have presented their re- search results, however it is difficult to repeat them. On the other hand, because there are too many hyper-parameters, the training of the CNN model is a time-consuming process. Moreover, most deep neural networks are trained by the gradient descent (GD) based algorithms and variations [1,4]. Also, it is found that the gradient descent based algorithm in deep neural networks has inherent instability. This instability blocks the learning process of the previous or later layers. In addition, gradient de- scent method is easy to be stuck in vanishing gradient problem. Though CNN has good performing result, it need much professional knowledge to use and it takes a lot of time to train. In order to reduce the training time and improve the generalization ability of neural network, we present a model, which combines the Gabor filter [5] and pseudoinverse learning autoencoders (PILAE) [6], to deal with image recognition problem. Gabor transformation belongs to the window Fourier transformation, and the Gabor function can extract relevant characteristics in different scales and directions in the frequency domain. In addition, Gabor function is similar to the biological function of human eyes, so it is often used for texture recognition and has achieved good results. The main advantage of using CNN based deep nets is the features are leant from images, while the advantage of using traditional handcraft - features is that the feature extrac- tion speed. Therefore, to extract features using Gabor filter (GF) is much easier than that of using CNN. In Feng et al.’s work [7], histogram of oriented gradient (HOG) is used to extract features. However, the generation processing of the HOG descriptor is tedious, resulting in slow speed and poor real-time performance. Besides, due to the nature of the gradient, the descriptor is quite sensitive to noise. Hence, we choose Gabor filter to extract features first. Then, PILAE based feed forward neural net is adopted to extract independent feature vectors and perform image recognition. Our proposed GF+PILAE model optimization does not need gradient descent based algorithms. The learning procedure of our model is forward propagating and the whole structure of network is determined with a given strategy in the process of prop- agation, including the depth of the network and the number of neurons in the hidden layer. It is a completely quasi-automatic learning procedure, so even users without professional knowledge they can easily use it. It is our efforts to prompt democratized artificial intelligence development.

2 Related Work

2.1 Gabor Filter The class of Gabor functions was presented by Gabor [8]. The basic idea of Gabor function is to add a small window to the signal. The of the signal is mainly concentrated in the small window, so it can reflect the local characteristics of Fast Image Recognition with Gabor Filter and PILAE 3 the signal. Daugman [9] extended Gabor function to two-dimensional cases. Gabor wavelet function was regarded as the best model for simulating visual sensory cells in the cerebral cortex [10]. Each visual cell can be viewed as a Gabor filter with a certain direction and scale. When an external stimulus such as image signal inputs visual cells, the output response of visual cells is the convolution of image with Gabor filter, and the output signal is further processed by the brain to form the final impression of cognition. This model can better explain human vision's tolerance to scale and direc- tion change. The two-dimensional Gabor kernel function is defined as follows [11]:

2 2 푥′ +훾2푦′ 푥′ 퐺 (푥, 푦) = exp (− ) cos (2휋 + 휑) , (1) 휆,휃,휑,휎,훾 2휎2 휆

′ 푥 = (푥 − 푥0)푐표푠휃 + (푦 − 푦0)푠푖푛휃,

′ 푦 = −(푥 − 푥0)푠푖푛휃 + (푦 − 푦0)푐표푠휃.

Eq. (1) is obtained by the multiplication of a and a cosine function. The arguments x and y specify the position of a light impulse, where (푥0,푦0) is the center of the receptive field in the spatial domain. 훩 is the orientation of parallel bands in the kernel of Gabor filter, and the valid values are real numbers from 0 to 360. φ is the phase parameter of cosine function in Gabor kernel function, and the valid values is from -180 to 180 degrees. γ is the space aspect ratio, which represents the ellipticity of the Gabor filter. λ is the wavelength parameter of the cosine function in the Gabor kernel function. σ is the standard deviation of Gaussian function in the Gabor kernel function. This parameter determines the size of acceptable area in the Gabor filter. Its value is related to the Bandwidth b and the value of λ. The Bandwidth b indicates the difference in high and low frequency. Eq.(2) presents the relationship of b, σ and λ:

휎 푙푛2 휋+√ 휎 1 푙푛2 2푏+1 b = 푙표푔 휆 2 = √ ∙ . (2) 2 휎 푙푛2 휆 휋 2 2푏−1 휋−√ 휆 2

Fig. 1. Gabor filter bank. These filters are in different scales and orientations [12]

Usually we use the Gabor filter in 8 directions, 5 scales, and these parameters can be adjusted. Fig 1 is a sample of Gabor filter bank with forty different Gabor filters. 4 X. Deng et al.

Feature extraction is performed using Gabor filter, as shown in Eq. (3).

퐈퐺 = 퐈 ⊕ 퐆. (3)

Where I is the grayscale distribution of the image, IG is the feature extracted from I, “⊕” stands for 2D convolution operator, G is the defined Gabor filter. Eq. (3) can be −1 −1 efficiently computed by fast Fourier transform, 퐈퐺 = 퐹 (퐹(퐈)퐹(퐆)), where F is the inverse Fourier transform. Gabor filters are sensitive to the edge information of images and able to adapt to the obvious environments with different light. Studies have found that Gabor wavelet transformation is very suitable for texture expression and separation. Gabor filter needs less data and can meet the real-time requirements compared with other meth- ods. On the other hand, it can tolerate a certain degree of image rotation and defor- mation.

2.2 Autoencoders

An autoencoder [13] was first proposed by Rumelhart et al. in 1986. The autoencod- ing neural network is an unsupervised learning scheme, which uses the back propaga- tion algorithm and tries to encode input vectors into hidden vectors, and decode hid- den vectors into input vectors. Autoencoders are usually used to reduce dimension and feature learning task. The autoencoder consists of two parts: encoder and decoder. The autoencoder can compress the input into potential space representation and the decoder can reconstruct the input from potential space representation. The loss func- tion of autoencoder can be defined as reconstruction error function in Eq. (4),

2 푁 푖 푖 E = ∑푖=1 ‖(푊푑 (푓(푊푒푥 + 푏푒)) + 푏푑) − 푥 ‖ , (4) where W푒 and W푑 is respectively the weights of encoder and decoder, 푏푒 and 푏푑 is the bias of the encoder and decoder repectively. A stacked autoencoder is a feed forward neural network in which the outputs of each encoder layer are the inputs of the successive layer. A way to obtain good pa- rameters for a stacked autoencoder is to use greedy layer-wise training [14]. This method trains the weight parameters of each layer individually while freezing parame- ters for the remains of the model. To produce the better results, after this phase of training is complete, fine-tuning with backpropagation algorithm can be used to im- prove the results by tuning the parameters of all layers at the same epoch.

2.3 Pseudoinverse Learning Autoencoder

Pseudoinverse learning algorithm (PIL) [15,16,17] was originally proposed by Guo et al, which is a fast algorithm for training feedforward neural networks. The whole training process of the PIL just needs simple matrix inner product operation and Fast Image Recognition with Gabor Filter and PILAE 5 pseudoinverse operation. It improves the learning accuracy by adding layers, without iteration optimization like other gradient descent based algorithms. Moreover, it is convenient to use and does not require users to set various hyper-parameters. The depth parameter in the training process are automatically adjusted. 푖 푖 푁 For a classification task, suppose that the data set 퐷 = {푋 , 푂 }푖=1 denotes N sam- 푖 푑 푖 푚 ples, where푋 = (푥1, 푥2, ⋯ , 푥푑) ∈ 푅 and 푂 = (표1, 표2, ⋯ , 표푚) ∈ 푅 denotes the i-th input sample and the corresponding expected output respectively. For a single hidden layer forward network, the most used sum-of-square objective function is as follows,

1 2 E = ∑푁 ∑푚 ‖푔 (푥푖, 휃̂) − 표푖‖ , (5) 2푁 푖=1 푗=1 푗 푗

푖 where 푔푗(푥 , 휃̂) is the j-th output neuron, which shows the map from input value into predicted value, and it is defined as follows, ̂ 푝 1 푑 0 푔푗(푥, 휃) = ∑푖=1 푤푖,푗 푓(∑푘=1 푤푘,푖 푥푘 + 푏). (6)

To simplification, we can represent the map in matrix form. The hidden layer is de- fined as follows,

푁×푑 푑×푝 H = 푓(XW0 + b) , X ∈ 푅 , W0 ∈ 푅 , (7) where H is a matrix representing the output of hidden layer, X is the input matrix which has N vectors with d dimension, W0 is the weights matrix between input and hidden layer which has d rows and p columns, b is a bias parameter in the input layer, f(·) is an activation function. Details of the PIL can be found in Refs [15][16][17].

3 Proposed Methodology

3.1 Proposed Classification Model We proposed a classification model combining Gabor filter and pseudoinverse learn- ing autoencoders (PILAE), which is a forward network without needing iterative op- timization by gradient descent algorithm. The structure of the model is shown in Fig 2. The input image is first filtered by Gabor filter bank, then we can obtain feature maps. Gabor feature map is a kind of Handcraft feature, and it is easy to obtain. The feature maps are fused into a vector as the input of PILAE. PILAE is used to further extract features and make classification.

3.2 Feature extraction In our model, feature extraction is set to two parts. First, image features is extracted by Gabor filter. With different scales and different orientations, Gabor filters can extract different features from original image input. Moreover, the biological model function of human eyes is similar to Gabor function, so Gabor filter performs well in 6 X. Deng et al. extract features. Second, PILAE can extract features from input vector. PILAE is consist of several layers, in which there are different number of neurons. If the num- ber of neurons in hidden layer is less than the dimension of the input, it is equivalent to extracting features from the input vectors.

Fig. 2. The proposed method combining Gabor filter and PILAE. Gabor filters extract features from input image to form feature maps, then the PILAE extract features further and make clas- sification.

3.3 Training Model The Gabor filter part can be considered as data preprocessing, which transforms input data to a first feature data space, while the training focuses on the PILAE section. For a single autoencoder that uses pseudoinverse algorithm to train, the encoder part can be represented as follows,

H = 푓(W푒X), (8) where X is the input, which can be considered as the succeed feature data input. H is the hidden output, which can be considered as the succeed feature data output. f (·) is the activation function, W푒 is the weight between input and hidden layer. And the decoder part can be represented as follows,

G = 푔(W푑H), (9) where G is a vector mapped from vector H, g(·) is the activation function, 푊푑 is the weight between hidden layer and output layer. Our objective function is as follows,

1 휆̂ 퐽 = ‖푊 퐻 − 푋‖2 + ‖푊 ‖2, (10) 2 푑 2 푑 where 휆̂ is a regularization parameter which can be selected with a formula in Ref. [19]. This is an error function with weight decay. The goal of training the auto- encoder is to find the weight parameter to minimize the error function. By using pseudoinverse learning algorithm, we can get the weight of decoder is as follows,

푇 푇 −1 푊푑 = 푋퐻 (퐻퐻 + 휆̂) . (11) Fast Image Recognition with Gabor Filter and PILAE 7

T The weights W푒 and W푑 can be set as Wd = We , which is termed as tied weights. The basic low-rank approximation is adopted to avoid identity mapping [18]. To get the optimal result, we can add layer to the model, which is called as stacked autoen- coders in the literatures. The number of first hidden layer neuron is set to equal to the rank of input matrix. For succeeding hidden layer the number of neuron is set to be p = βDim(x), β ∈ (0, 1], where Dim(x) is the dimension of input vectors. Because the autoencoder is trained with PIL algorithm, we name it as PILAE [18]. When the auto- encoder training is finished, we discard the decoder parts and cascade the encoders to form a stacked autoencoder network. For the task of classification or regression, in the final network output layer, a classifier at the end of the network is used to get the final result of classification. The classifiers can be a PIL as well as its variants [20], a sup- port vector machine (SVM), a multilayer neural network (MLP), or a radial basis function network, and so on. In this work, a Softmax classifier is used to get final output results.

4 Performance Evaluation

To evaluate the performance of the proposed methods, we conduct experiments to compare our method with other methods based on benchmark data set. In the experi- ments, the parameters of Gabor filter are set as follow:the scale of Gabor filter is 2, 4 and 6, respectively, the wave length λ is π/2, the orientation 휃 is set as 8 different orientations from 0 to 7π/8, and the difference between the two adjacent orientations is π/8,standard deviation of the Gaussian function σ is 1.0, spatial aspect ratio γ is 0.5,phase parameter 휑 is 0. We conduct experiments to generate different num- bers of Gabor feature maps, which are fused into a feature vector input to the follow- ing network layer. The MNIST dataset and the CIFAR-10 dataset are used in the ex- periments. All the experiments are conducted on the same hardware computer with Core i7 3.20 GHz processors.

4.1 MNIST Dataset In deep learning and pattern recognition, MNIST is the most widely used database. MNIST is a handwritten digits images recognition data set including 70,000 handwrit- ten digital images of 0-9, 60,000 images out of which are used as training samples and the rest 10,000 images are the test samples. Each image in the dataset is 28 × 28 = 784 pixels. Table 1 shows the comparison results of our method and other benchmark methods using MNIST dataset. In our method, we use four Gabor feature maps and three lay- ers of PILAE to get the result. The structure of PILAE is 705 - 635 - 571 in GF+PILAE. 5 and 10 convolution kernels are used in different layers in LeNet5. MLP uses one hidden layer and the number of hidden neuron is 300. PILAE has one encod- er layer in this experiment. We use 4 HOG (Histogram of oriented gradient) feature maps in the method of HOG+PILAE. From Table 1, we observe that our method can 8 X. Deng et al. obtain comparable accuracy to other baseline method, while the training speed of our method is fast than others.

Table 1. Performance comparison of MNIST dataset.

Model Training accuracy(%) Testing accuracy(%) Training time(s) GF+PILAE 98.86 98.42 103.25 LeNet5 98.51 98.49 1270.8 PILAE 93.88 93.78 33.54 MLP 97.87 97.80 411.68 SVM 98.72 96.46 2593.28 HOG+PILAE 98.36 98.02 112.58

Fig. 3. (a) The accuracy with different numbers of Gabor feature maps for MNIST. (b) The training time with different numbers of Gabor feature maps for MNIST.

Fig. 3 (a) shows that the highest accuracy was obtained as the number of Gabor fea- tures was four, and the accuracy decreases as the number of Gabor features continued to increase. The reason may be the fact that the more feature maps are used the more noise will be learned. From Fig. 3 (b) we observe that the training time increases as the number of Gabor features increases. With the number of Gabor features increas- ing, the dimension of the network’s input increases, so the training time will become Fast Image Recognition with Gabor Filter and PILAE 9 longer. In the future, we will continue to study how to fuse the features to get better performance.

4.2 CIFAR-10 Dataset CIFAR-10 image database includes 60,000 32×32 color images. The images are di- vided into 10 categories, and there are 6,000 images in each category. The whole database is divided into five training packages and one test package. Each package contains 10,000 images, so there are 50,000 training images and 10,000 test images in total. We use different numbers of Gabor filters in this experiment, and the structure of PILAE is 3012 - 2955 - 2524.

Fig. 4. The accuracy with different numbers of Gabor feature maps for CIFAR-10.

Table 2. Performance comparison of CIFAR10 dataset.

Model Training accuracy(%) Testing accuracy(%) Training time(s) GF+PILAE 48.34 47.02 388.23 LeNet5 64.31 63.02 6743.82 PILAE 45.16 44.08 151.47 MLP 38.98 38.32 765.68 HOG+PILAE 46.57 46.05 436.58

Fig.4 shows that the best test accuracy is obtained at the 6 feature maps. As the num- ber of feature maps increasing from 1 to 6, the training accuracy and the test accuracy is increasing. When the number of feature maps is greater than 6, the accuracy de- creases, but the change is small. We made experiments using other methods, such as MLP with one hidden layer and 2000 neurons, PILAE with 3 layers, HOG+PILAE with 4 HOG features and LeNet5 with 20 and 50 kernels in different layers. The re- sults are shown in Table 2. All the results show that when getting the same accuracy, our method is faster than other methods. However, our method is not perfect on the accuracy on CIFAR-10. The reason maybe the color information is lost when filtered by Gabor. In the future, we will pay more attention on processing color images. 10 X. Deng et al.

4.3 Discussion

From experimental results presented in Table (1) and (2), we can know that training speed of our model is fast compared with other models. In the proposed model, Gabor filter is used to extract features, and extracted feature is a kind of handcraft feature. Compared with features learned with CNN, the Gabor feature is easier to obtain and less time consumption. With Gabor features, it can meet the real-time processing requirements compared with other learning feature methods. On the other hand, the training processing of the PILAE is a completely forward propagation without itera- tion. The connecting weights of PILAE are computed by using pseudoinverse learn- ing algorithm directly. In additions, the depth of the network is dynamically increas- ing, and the number of hidden layers in PILAE is data dependent. That is, relative simple data set will generate relative simple network structure. While complex data set will require the network architecture has more hidden layer to learn better data representation, and consequently reach good performance for given task. In the training process, compared with gradient descent based learning algorithm, we need not set those learning optimization related hyper parameters, such as learning rate, momentum, and iterative epoch number. Those hyper parameters are difficult to tune if without rich experiences and professional knowledge. The transfer learning based on convolution neuron network (CNN) may be fast in inference stage, but it cannot be faster in training stage. As we known, CNN based network such as Alexnet is trained by gradient descent based algorithm, which is iterative and needs to adjust the learning hyper parameters also. So it is time consuming in training stage. Our method performs well on MNIST data set, however, it does not obtain good test accuracy on CIFAR10 data set. The reason is that we only use one color channel to conduct experiments, this will lose the color information about the RGB image in CIFAR10. In the future, we will design the more complicated network architecture to improve the classification accuracy to the color image.

5 Conclusions

In this paper, we proposed a fast image recognition model that combines Gabor filter and pseudoinverse learning autoencoder. This model integrates the advantages of both Gabor filter and PILAE. Gabor filters extract features from input in different scales and orientations, and then the feature maps are sent to PILAE. The training of our model is fast, because it does not need back propagation or iterative optimization. Moreover, the number of layers is automatically determined, and we give the method to set the number of hidden neurons. We estimate the performance of our network using some benchmark datasets such as MNIST, CIFAR-10. The results show that on classification tasks, our model has a better performance than other models especially in learning speed. Because our model has no empirical parameters, it is easy to use even for person without professional knowledge. This is our effort to prompt the de- velopment of automatic machine learning and expect to democratize artificial intelli- gence. Fast Image Recognition with Gabor Filter and PILAE 11

Acknowledgements

The research work described in this paper was fully supported by the grants from the National Natural Science Foundation of China (Project No. 61472043), the Joint Re- search Fund in Astronomy (U1531242) under cooperative agreement between the NSFC and CAS, and Natural Science Foundation of Shandong (ZR2015FL006). Prof. Ping Guo and Qian Yin are the authors to whom all correspondence should be ad- dressed.

References

1. LeCun Y, Bottou L, Bengio Y, Haffner P: Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11), 2278–2324 (1998). 2. Krizhevsky A, Sutskever I, Hinton G. E: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012). 3. He K, Zhang X, Ren S, Sun J: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016). 4. Karen S, Andrew Z: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv: 1409.1556 (2014). 5. Tai S. L: Image Representation using 2D Gabor Wavelets. IEEE Trans. Pattern Analysis & Machine Intelligence 18(10), 959-971 (1996). 6. Wang K, Guo P, Yin Q, et al.: A pseudoinverse incremental algorithm for fast training deep neural networks with application to spectra pattern recognition. In: 2016 International Joint Conference on Neural Networks (IJCNN), pp. 3453–3460. IEEE (2016). 7. Feng S, Li S, Guo P, Yin Q: Image Recognition with Histogram of Oriented Gradient Feature and Pseudoinverse Learning AutoEncoders. In: 24 th International Conference on Neural In- formation Processing (ICONIP 2017), pp. 740-749. Springer, Cham (2017). 8. Gabor D: Theory of communication. Journal of the Institution of Electrical Engineers - Part I: General 93(26), 429-441 (1946). 9. Daugman J. D: Two dimensional spectral analysis of cortical receptive field profiles. Vision Res 20(10), 847-856 (1980). 10. Jones J, Palmer L: An evaluation of the two-dimensional Gabor filter model of simple recep- tive fields in cat striate cortex. Journal of neurophysiology 58(6), 1233-1258 (1987). 11. Kruizinga P, Petkov N: Nonlinear operator for oriented texture. IEEE Trans Image Process 8(10), 1395-1407 (1999). 12. Fazli S, Afrouzian R, Seyedarabi H: High-Performance Facial Expression Recognition Using Gabor Filter and Probabilistic Neural Network. In: 2009 IEEE International Conference on Intelligent Computing and Intelligent Systems, pp. 93-96, (2009). 13. Rumelhart D. E, Hinton G. E, Williams R. J: Learning representations by back-propagating errors. Nature 323(6088) , 533-536 (1986). 14. Hinton G. E, Osindero S, Teh Y. W: A fast learning algorithm for deep belief nets. Neural Computation 18(7), 1527-1554 (2006). 15. Guo P, Chen P. C. L, Sun Y: An Exact Supervised Learning for a Three-Layer Supervised Neural Network, ICONIP’95, pp. 1041-1044, (1995). 16. Guo P, Lyu M.R, Mastorakis N.E: Pseudoinverse learning algorithm for feedforward neural networks. In: Advances in Neural Networks and Applications, pp. 321–326 (2001). 12 X. Deng et al.

17. Guo P, Lyu M.R: A pseudoinverse learning algorithm for feedforward neural networks with stacked generalization applications to software reliability growth data. Neurocomputing 56, 101–121 (2004). 18. Wang K, Guo P, Xin X, Ye Z: Autoencoder, low rank approximation and pseudoinverse learning algorithm. In: 2017 IEEE International Conference on Systems, Man, and Cybernet- ics (SMC), pp. 948–953. IEEE (2017). 19. Guo, P., Lyu, M., Chen, P.: Regularization parameter estimation for feedforward neural net- works. IEEE trans System, Man and Cybernetics (B) 33(1), 35-44 (2003) 20. Guo, P.: A VEST of the Pseudoinverse Learning Algorithm, Preprint, arXiv:1805.07828, (2018).