<<

Multi-Layer Feature Fusion and Selection from Convolutional Neural Networks for Texture Classification

Hajer Fradi 1 Anis Fradi 2,3 and Jean-Luc Dugelay 1 1Digital Security Department, EURECOM, Sophia Antipolis, France 2University of Clermont Auvergne, CNRS LIMOS, Clermont-Ferrand, France 3 University of Monastir, Faculty of Sciences of Monastir, Tunisia {hajer.fradi, jean-luc.dugelay}@eurecom.fr, [email protected]

Keywords: Texture Classification, , Feature Selection, Discriminative Features, Convolutional Layers.

Abstract: Deep feature representation in Convolutional Neural Networks (CNN) can act as a set of feature extractors. However, since CNN architectures embed different representations at different abstraction levels, it is not trivial to choose the most relevant layers for a given classification task. For instance, for texture classification, low-level patterns and fine details from intermediate layers could be more relevant than high-level semantic information from top layers (commonly used for generic classification). In this paper, we address this problem by aggregating CNN activations from different convolutional layers and encoding them into a single feature vector after applying a pooling operation. The proposed approach also involves a feature selection step. This process is favorable for the classification accuracy since the influence of irrelevant features is minimized and the final dimension is reduced. The extracted and selected features from multiple layers can be further manageable by a classifier. The proposed approach is evaluated on three challenging datasets, and the results demonstrate the effectiveness of selecting and fusing multi-layer features for texture classification problem. Furthermore, by means of comparisons to other existing methods, we demonstrate that the proposed approach outperforms the state-of-the-art methods with a significant margin.

1 Introduction such as noise, scale, illumination, rotation, and trans- lation. Due to these variations, the texture of the same Deep networks have promoted the research in many objects appears visually different. Likewise, for dif- applications, specifically a great suc- ferent texture patterns, the difference might be subtle cess has recently been achieved in the field of im- and fine. All these factors besides the large number age classification using models. In this of texture classes make the problem of texture classi- context, extensive study has been conducted to ad- fication challenging (Bu et al., 2019; Almakady et al., dress the problem of large-scale and generic classi- 2020; Alkhatib and Hafiane, 2019). fication that spans a large number of classes and im- One of the key aspects of texture analysis is the ages (such as the case of ImageNet Large Scale Vi- features extraction step which aims at building a pow- sual Recognition Challenge) (Qu et al., 2016). Partic- erful texture representation that is useful for the clas- ularly, the problem of texture classification has been sification task. Texture, by definition, is a visual cue a long-standing research topic due to both its signifi- that provides useful information to recognize regions cant role in understanding the texture recognition pro- and objects of interest. It refers to the appearance, the cess and its importance in a wide range of applica- structure and the arrangement or the spatial organi- tions including medical imaging, document analysis, zation of a set of basic elements or primitives within object recognition, fingerprint recognition, and mate- the image. Since the extraction of powerful features rial classification (Liu et al., 2019). to encode the underlying texture structure is of great Even though texture classification is similar to interest to the success of the classification task, many other classification problems, it presents some distinct research works in this field focus on the texture repre- challenges since it has to deal with potential intraclass sentation (Lin and Maji, 2015; Liu et al., 2019). variations such as randomness and periodicity, in ad- This topic has been extensively studied through dition to external class variations in real-world images different texture features such as filter bank texton (Varma and Zisserman, 2005), Local Binary Pattern bine the extracted features from different layers and (LBP) (Wang et al., 2017; Alkhatib and Hafiane, to highlight the most relevant ones for the classifica- 2019), Bag-of-Words (BoWs) (Quan et al., 2014) and tion task. The contribution of this paper is three-fold: their variants e.g. Fisher Vector (FV) (Cimpoi et al., First, a novel approach for extracting and fusing fea- 2014; Yang Song et al., 2015) and VLAD (Jegou´ tures from multiple layers into a joint feature repre- et al., 2010). Early attempts to handle this prob- sentation after applying a pooling operation is pro- lem generally substitute these hand-engineered fea- posed. This feature representation is adopted to in- tures by deep learning models particularly based on corporate the full capability of the network in the fea- Convolutional Neural Network (CNN). According to ture learning process instead of truncating the CNN a recently published survey (Liu et al., 2019), CNN- pipeline at one specific layer (usually chosen empir- based methods for texture representation can be cate- ically). Second, since the fused features do not per- gorized into three categories: pre-trained CNN mod- form equally well and some of them are more rele- els, fine-tuned CNN models, and hand-crafted deep vant than others, the proposed approach incorporates convolutional networks. Such architectures mostly re- a feature selection step in order to enhance the dis- quire large-scale datasets because of the huge number criminative power of the feature representation. By of parameters that have to be trained. Since they can- alleviating the effect of high dimensional feature vec- not be conveniently trained on small datasets, transfer tor and by discarding the impact of irrelevant features, learning (Zheng et al., 2016) can be instead used as the overall classification rate is significantly improved an effective method to handle that. Typically, a CNN using three challenging datasets for texture classifi- model previously trained on a large external dataset cation. Third, by means of comparisons to existing for a given classification task can be fine-tuned for methods, the effectiveness of our proposed approach another classification task. Such pre-trained deep fea- is proven. The obtained results outperform the cur- tures take advantage of the large-scale training data. rent state-of-the-art methods, which proves the gener- Thus, they have shown good performance in many ap- alization ability and the representative capacity of the plications (Yang et al., 2015). learned and selected features from multiple layers. In both transfer and non-transfer classification The remainder of the paper is organized as fol- tasks, CNNs have been proven to be effective in lows: Related work to texture classification using automatically learning powerful feature representa- deep learning models is presented in Section 2. In tion that achieves superior results compared to hand- Section 3, our proposed approach based on feature crafted features (Schonberger et al., 2017). Although fusion and selection from multiple convolutional lay- this approach has shown promising results for differ- ers is presented. The proposed approach is evaluated ent classification tasks, some problems remain unad- using three challenging datasets, commonly used for dressed. For instance, features from the fully con- texture classification and experimental results are an- nected (FC) layers are usually used for classifica- alyzed in Section 4. Finally, we conclude and present tion. However, these FC features discard local in- some potential future works in Section 5. formation which is of significant interest in the clas- sification since it aims at finding local discrimina- tive cues. Major attempts to handle this problem 2 Related Work alternatively substitute FC with convolutional fea- tures, mostly extracted from one specific convolu- Deep learning models have recently shown good per- tional layer. But the exact choice of the convolutional formances in different applications, essentially for layer remains particularly unclear. Usually, high- image classification by means of CNNs. Particularly, level features from the last convolutional layer are ex- different architectures have been proposed for texture tracted since they are less dependent on the dataset classification (Liu et al., 2019). The existing methods compared to those extracted from lower layers. How- based on FC layers mainly include FC-CNN (Cim- ever, it has been shown in other applications such as poi et al., 2014), which is the top output from the last image retrieval (Jun et al., 2019; Tzelepi and Tefas, FC layer of VGG-16 architecture. Given the limita- 2018; Kordopatis-Zilos et al., 2017) that intermediate tions of FC features (discussed in the introduction), layers mostly perform better than last layers. major attempts in this field instead made use of con- Following the same strategy, we intend in this cur- volutional features. For instance, in (Cimpoi et al., rent paper to first investigate the representative power 2015; Cimpoi et al., 2016), the authors introduced FV- of different convolutional layers for exhibiting rel- CNN descriptor which is obtained using Fisher Vec- evant texture features. Then, fusion and selection tor to aggregate the convolutional features extracted methods are adopted in order to automatically com- from VGG-16. This descriptor is suitable for texture classification since it is orderless pooled. FV-CNN tures have been proposed so far, composed of many substantially improved the state-of-the-art in texture layers organized in a hierarchical way, where each and materials by obtaining better performances com- layer adds certain abstraction level to the overall fea- pared to FC-CNN descriptor, but it is not trained in an ture representation (Nanni et al., 2017). end-to-end manner. Deep feature representations in these networks Lin et al. in (Lin and Maji, 2015; Lin et al., 2015) can act as a set of feature extractors which have the proposed the bilinear CNN (B-CNN), which consists potential of being representative and generic enough in feeding the convolutional features to bilinear pool- with the increasing depth of layers. They encode from ing. It multiplies the outputs of two feature extrac- bottom to top layers, low-level to more semantic fea- tors using outer product at each location of the im- tures. Thus, it is not trivial to choose the most rel- age. This model is close to Fisher vectors but has the evant layers for a given classification problem. For advantage of the bilinear form that simplifies gradi- example, in generic classification tasks, features from ent computation. It also allows an end-to-end training top layers are commonly used because they capture of both networks using image labels only. Both pre- semantic information which is more relevant for this vious related works based on FV-CNN and B-CNN kind of applications. However, for texture classifica- made use of pre-trained VGG models as the base net- tion, low-level patterns and fine details from bottom work, but they further apply different encoding tech- or intermediate layers could be more important than niques which are FV and bilinear encoding. These high-level semantic information. descriptors are computed by the respective encoding For the aforementioned reasons, instead of em- technique from the last convolutional layer of the pre- pirically choosing one specific layer for the classifi- trained model, which shows better performances than cation task, we first attempt to learn different visual the results from the penultimate fully connected layer. representations by fusing convolutional features from Zhang et al. in (Zhang et al., 2017) proposed a multiple layers for complementary aspect, see section Deep Texture Encoding Network (Deep-TEN) with a 3.1. Then, we apply a feature selection process to novel Encoding Layer integrated on the top of con- encode multi-layer features into a single feature vec- volutional layers, which ports the entire dictionary tor by finding low-dimensional representation that en- learning and encoding pipeline into a single model. hances the overall classification performance, see sec- Different from other methods built from distinct com- tion 3.2. An overview of the proposed approach based ponents such as pre-trained CNN features, Deep-TEN on fusion and selection steps from multiple layers is provides an end-to-end learning framework, where shown in Fig. 1. the inherent visual vocabularies are directly learned from the loss function. Following the same scheme, 3.1 Multi-Layer Convolutional Feature Locally-transferred Fisher Vectors (LFV) which in- volve a multi-layer neural network model, have been Fusion proposed in (Song et al., 2017). It contains locally connected layers to transform the input FV descrip- Given a CNN model C of L convolutional layers tors with filters of locally shared weights. Finally, {L1,L2,...,LL}, an input image I is forward prop- in (Bu et al., 2019), convolutional layer activations agated through the network C, which results differ- are employed to constitute a powerful descriptor for ent numbers of feature maps at each layer. The gen- texture classification under an end-to-end learning erated feature maps are denoted by M1,M2,...,ML, n xm xc framework. Different from the previous works, a Mi ∈ i i i , where ni x mi is the dimension of locality-aware coding layer is designed with the local- channels and ci refers to the number of filters in Li ity constraint, where the dictionary and the encoding convolutional layer. These feature maps from differ- representation are learned simultaneously. ent layers have different dimensions, that are usually high. Since high dimensional feature vector increases the computation time of classification, pooling layers that follow convolutional layers are preferred to re- 3 Feature Extraction for Texture duce the dimensionality. Practically, from each layer Classification Li, the feature vector Fi is extracted by applying aver- age or maximum pooling operation on every channel Deep learning has reformed field of feature maps Mi to get a single value. Fi is defined and has brought a revolution to the computer vision for both cases (average and maximum pooling) as: community in recent years. Particularly, good perfor- ni mi mance has been achieved in image classification by 1 Fi(k) = ∑ ∑ Mi(p,q,k),k = 1,...,ci (1) means of CNNs. In this context, different architec- nixmi p=1 q=1 2 x Inception Module C 3 x Inception Module A 4 x Inception Module B

x3 x4

Convolution x2 AvgPool Feature maps MaxPool n i M M M ci 1, 2,….., L Concat mi Dropout Fully connected Softmax Average/Max Pooling

Feature vectors

F1, F2,….., FL

Convolutional Feature Fusion

Convolutional Feature Selection

Figure 1: The proposed approach for texture feature extraction based on convolutional features fusion and selection using Inception-v3 architecture for illustrative purpose, which can readily be replaced by other CNN architectures. Note that feature maps and feature vectors visualized in this figure are of different sizes.

Fi(k) = maxMi(:,:,k),k = 1,...,ci (2) Inception-v3 is a 48-layers deep convolutional ar- chitecture that first consists of few regular convo- From each convolutional layer Li, the resulting fea- lutional and max-pooling layers. The network fol- ture vector Fi is of size ci. As a result, a set of fea- lows by three blocks of Inception layers separated by ture vectors {F1,...,FL} is obtained to encode the in- two grid reduction modules. After that, the output put image I. These features from different layers are of the last Inception block is aggregated via global normalized using L2-norm before concatenating them average-pooling followed by a FC layer. Based on into a single feature vector. this architecture features can be extracted from multi- Our approach is applicable to various convolu- ple layers, precisely we choose them from ‘mixed0’ tional neural network architectures. Note that by in- to ‘mixed10’ layers, as shown in Fig. 1. Obvi- creasing the depth of layers, the network usually get ously, other CNN architectures can be readily em- better feature representation, but its complexity is in- ployed such as Inception-v4 and Inception-ResNet-v2 creased. For this reason, we choose to experiment (Szegedy et al., 2017), but their computational cost is Inception-v3 architecture (Szegedy et al., 2016) as a higher (55.8M). For shallower architectures, VGG-19 good compromise between the depth of layers and the (Simonyan and Zisserman, 2015) can be selected. complexity of network (23.8M parameters). For more details, Inception architecture was initially introduced 3.2 Multi-Layer Convolutional Feature by Szegedy et al. in (Szegedy et al., 2015). It is es- sentially based on using variable filter sizes to capture Selection different sizes of visual patterns. Specifically, incep- tion architecture includes 1 x 1 convolutions which Instead of using raw features from multiple layers to are placed before 3 x 3 and 5 x 5 convolutions to encode texture information, we propose to learn a dis- act as dimension reduction modules, which enables criminant subspace of the concatenated feature vector, increasing the depth of CNN without increasing the where samples of different texture patterns are opti- computational complexity. Following the same strat- mally separated. For instance, using Inception-v3 for egy, the representation size of Inception-v3 (Szegedy convolutional feature extractors and after applying av- et al., 2016) is slightly decreased from inputs to out- erage or maximum pooling operation, the final repre- puts by balancing the number of filters per layer and sentation from multiple layers is of size 10048. By the depth of the network without much loss in the rep- directly classifying such relatively high-dimensional resentation power. feature vector, apart the fact that the computation time increases, another problem that might incur is ran et al., 2010), Describable Textures Dataset (DTD) that the feature vector generally contains some com- (Cimpoi et al., 2014) and KTH-TIPS2 (Mallikarjuna ponents irrelevant to texture. Thus, the use of the et al., 2006). FMD is a well known dataset for texture whole feature vector without any feature selection classification that includes 10 different materiel cate- process could lead to unsatisfactory classification per- gories with 100 images in each class. It focuses on formance (Fradi et al., 2018). identifying materials such as plastic, wood and glass. For these reasons, after stacking features from During experiments, half of images are randomly se- multiple layers and before feeding them into a classi- lected for training and the other half for testing. KTH- fier, we propose the combination of Principle Compo- TIPS2 is a database of materials, and is an extension nent Analysis (PCA) and Linear Discriminant Analy- of KTH-TIPS database where TIPS refers to Textures sis (LDA) to find a low dimensional discriminative under varying Illumination, Pose and Scale. It con- subspace in which same-texture-pattern samples are tains images of 11 material classes, with 432 images projected close to each other while different texture- for each class. The images in each class are divided pattern samples are projected further apart. This pro- into four samples of different scales (each sample cess is favorable for the later texture classification step contains 108 images). Following the standard pro- since the influence of irrelevant feature components is tocol, one sample is used for training and the three minimized. It is important to mention that this com- remaining ones are used for testing. bination of PCA followed by LDA has been mainly DTD is another widely used dataset to assess tex- used in face recognition domain and is commonly re- ture classification algorithms, that contains 5640 im- ferred as “Fisherface”. But, to the best of our knowl- ages belonging to 47 texture classes, with 120 images edge, such feature selection method is applied for the for each class. This dataset is considered as the most first time on multi-layer convolutional features in or- challenging one since it has varying size of images der to enhance their discriminative power. and some of them are natural images. Following the Linear Discriminant Analysis is an efficient ap- evaluation protocol published with this dataset, 2/3 of proach of widely used in the images are used for training and the remaining 1/3 various classification problems. It basically aims at for testing. In Fig. 2, we show some sample images finding an optimized projection Wopt that projects of different texture classes from the three aforemen- D dimensional data vectors U into a d dimensional tioned datasets. space by: V = W U, in which the intra-class scatter opt The proposed approach presented in Section 3 (S ) is minimized while the inter-class scatter (S ) is W B is evaluated for texture classification. This multi- maximized. W is obtained according to the objec- opt classification problem is a 10-class, 47-class and 11- tive function: class for FMD, DTD and KTH-TIPS2 datasets, re- T W SBW spectively. For tests, the proposed feature represen- Wopt = argmax T = [w1,...,wg] (3) W W SWW tation from multiple layers is identified as one of the classes by multi-class SVM classifier following one- There are at most N − 1 non-zero generalized eigen- vs-one strategy using linear kernel. Following the values (N is the number of classes), thus g is upper- evaluation protocols published with the datasets, the bounded by N − 1. Since S is often singular, it is W experiments are repeated and the average top-1 iden- common to first apply PCA (Jolliffe, 2002) to reduce tification accuracy is reported. Also, cross-validation the dimension of the original vector. Once the di- to optimize PCA parameters within the training set is mensionality reduction process of PCA followed by adopted. LDA is applied on the concatenated feature vector from multiple convolutional layers, texture classifica- The obtained results of the proposed approach are tion is performed by adopting multi-class SVM fol- compared to the baseline method (Inception-v3) in lowing one-vs-one strategy. order to demonstrate the representative and the dis- criminative power of the proposed feature represen- tation. For feature selection, the proposed method is compared to two other methods: PCA and Neigh- 4 Experimental Results bourhood Components Analysis (NCA) (Yang et al., 2012). Furthermore, extensive comparative study is 4.1 Datasets and Experiments conducted in order to highlight the effectiveness of the proposed approach regarding the state-of-the-art The proposed approach is evaluated within three chal- methods for texture classification, namely, FC-CNN lenging datasets widely used for texture classifica- (Cimpoi et al., 2014) , FV-CNN (Cimpoi et al., 2015), tion, namely Flickr Material Database (FMD) (Sha- B-CNN (Lin and Maji, 2015), Deep-TEN (Zhang oterslso h aeieIcpinv architec- Inception-v3 baseline the of results results the obtained the to compare first we pro- approach, the jus- from posed To component each datasets. of effectiveness three the the tify on approach proposed our improve performance. to effective overall be the could paper this in lay- adopted auto- different ers from the strategy that selection justify and com- and fusion matic proposal, observations main These our with ply layers. FC those exceed from many layers obtained in convolutional 3, from Fig. results in com- shown cases By those to datasets. results three these the paring for 74.47% lay- and 81.80%, FC 70.03%, are for accuracies classification Likewise obtained the FMD, ers, respectively. for KTH, layers and re- ‘mixed5’ DTD best from and obtained The 79.46% ‘mixed7’, and layer. ‘mixed9’, 72.52%, last 82.8%, lay- the are intermediate than sults that clear better is perform it ers cases, always many not In is last performance true. of best assumption the give the would that layers shown results is obtained it the surprisingly, retained from Not is experiments. operator next the pooling in average reason datasets. this three For the on pooling maximum outperforms 3. Fig. in reported three the are on datasets ‘mixed10’) convolu- to ‘mixed0’ each (from layer for tional obtained accuracies The convo-classification ef- architecture. of Inception-v3 the performance using investigate the features on lutional to layers intend different first of fect we section, this In Analysis and Results 4.2 (Bu and 2017), al., 2019). et al., et (Song LFV-CNN each 2017), within al., variations the et show to datasets. class) KTH-TIPS2 each and cases. for DTD examples some image FMD, in (3 variations bottom: dataset class to each external for top low classes the From of and datasets: examples class 3 experimented right: three to the left from From images Sample 2: Figure KTH DTD FMD nFg ,w eottecasfiainrslsof results classification the report we 4, Fig. In significantly pooling average that notice we First, lmnu olCotton Wool Aliminium otdGi Blotchy Grid Dotted ea lsi Stone Plastic Metal hysgicnl upromhn-rfe methods that classification. texture 2019) hand-crafted for al., outperform et significantly (Liu survey recent they published demonstrated only recently been comparison a has in this it in since Ta- methods consider in We CNN-based datasets 2019) same 1. al., the ble 2017), et on (Bu classification al., and texture et 2017), for al., (Zhang et Deep-TEN (Song (Lin LFV-CNN 2015), B-CNN 2015), Maji, al., and et (Cimpoi al., FV-CNN et , (Cimpoi 2014) FC-CNN methods state-of-the-art the step. selection feature the applying by features enhanced of power is discriminative the abstrac- Also, different levels. at tion features capture better ca- to generalization pability higher three a has that the approach is for proposed behind the 5.55% reason and main The 3.9% base- respectively. 3%, datasets, the of regarding is approach enhance- method proposed line overall our The of NCA). ment and meth- selection PCA feature (namely other of ods results selec- the feature to without and results tion the to perfor- compared best mance the the achieves on features step convolutional method. selection fused baseline feature the proposed to the Moreover, compared datasets has on three accuracies layers the higher convolutional obtaining by multiple demonstrated been from results ing Multi-Layer to we refer figure, to this Features. MLCF Convolutional in abbreviation that the Note use NCA. and PCA methods, namely selection sub- feature we other by Besides PCA+LDA MLCF). stitute as (called selection performing the feature without to features results convolutional the fused compare raw we approach, our usefulness pro- in PCA+LDA the posed using process prove selection to feature the Also, of Softmax). (using ture ial,w ittecasfiainacrce of accuracies classification the list we Finally, fus- of effectiveness the figure, the in depicted As 100 100 100

90 90 90

80 80 80

70 70 70

60 60 60

Accuracy (%) Accuracy (%) Accuracy (%) Accuracy 50 50 50

40 Max pooling 40 Max pooling 40 Max pooling Avr pooling Avr pooling Avr pooling 30 30 30 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11 Layers Layers Layers (a) FMD dataset (b) DTD dataset (c) KTH-TIPS2 dataset Figure 3: Results of different convolutional layers indexed from 1 (refers to ‘mixed0’) to 11 (refers to ‘mixed10’) from Inception-v3 model on FMD, DTD and KTH-TIPS2 datasets in terms of classification accuracy. Comparisons between maximum and average pooling operators are shown as well.

100 important (about 6% in the classification accuracy). Baseline (Inception-v3) MLCF One reason behind that could be the split of train/test 90 MLCF+PCA MLCF+NCA in this dataset in which only 25% of data is dedicated Ours (MLCF+PCA+LDA) for training. 80

70

Accuracy (\%) Accuracy 5 Conclusion

60 In this paper, we presented our proposed approach

50 for texture classification based on encoding feature FMD DTD KTH-TIPS2 vectors from multiple convolutional layers and fus- Figure 4: Comparisons of the proposed approach to: the ing them into a joint feature representation after ap- baseline architecture (Inception-v3 using Softmax), the original fused multi-layer convolutional features (MLCF), plying a feature selection step. From the obtained and other feature selection methods (PCA and NCA) on the results using Inception-v3 as baseline architecture, it three experimented datasets in terms of classification accu- has been demonstrated that the classification accu- racy. MLCF stands for Multi-Layer Convolutional Features. racy is significantly improved on three challenging datasets. Furthermore, the results demonstrate the rel- Method FMD DTD KTH-TIPS2 evance of the discriminant feature selection process. FC-CNN (Cimpoi et al., 2014) 69.3 59.5 71.1 Also, by means of comparisons to the state-of-the-art FV-CNN (Cimpoi et al., 2015) 80.8 73.6 77.9 methods better results are obtained. As perspectives, B-CNN (Lin and Maji, 2015) 81.6 72.9 77.9 the obtained results can be further enhanced using Deep-TEN (Zhang et al., 2017) 80.2 - 82.0 more performing networks. Also, other applications LFV-CNN (Song et al., 2017) 82.1 73.8 82.6 such as fine-grained classification and image retrieval (Bu et al., 2019) 82.4 71.1 76.9 could be investigated using our proposed multi-layer Ours 84.8 74.41 81.17 convolutional feature fusion and selection. Table 1: Comparisons of the proposed method to the state- of-the-art methods on the three datasets in terms of accu- racy. REFERENCES From these comparisons, our proposed approach has been experimentally validated showing more ac- Alkhatib, M. and Hafiane, A. (2019). Robust adaptive curate results against the recent state-of-the-art meth- median binary pattern for noisy texture classification ods for texture classification. Slightly lower re- and retrieval. IEEE Trans Image Process, (11):5407– 5418. sults are noticed on KTH-TIPS2 dataset compared to (Song et al., 2017; Zhang et al., 2017). This could Almakady, Y., Mahmoodi, S., Conway, J., and Bennett, M. (2020). Rotation invariant features based on three di- be explained by the fact that initial result from the mensional gaussian markov random fields for volu- baseline architecture is relatively uncompetitive (only texture classification. Computer Vision and Im- 75.62%), however the achieved enhancement is quite age Understanding, 194. Bu, X., Wu, Y., Gao, Z., and Jia, Y. (2019). Deep convo- 2014 IEEE Conference on Computer Vision and Pat- lutional network with locality and sparsity constraints tern Recognition, pages 160–167. for texture classification. , 91. Schonberger, J. L., Hardmeier, H., Sattler, T., and Pollefeys, Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., and M. (2017). Comparative evaluation of hand-crafted Vedaldi, A. (2014). Describing textures in the wild. and learned local features. In 2017 IEEE Conference In Proceedings of the 2014 IEEE Conference on Com- on Computer Vision and Pattern Recognition (CVPR), puter Vision and Pattern Recognition, pages 3606– pages 6959–6968. 3613. Sharan, L., Rosenholtz, R., and Adelson, E. H. (2010). Ma- Cimpoi, M., Maji, S., Kokkinos, I., and Vedaldi, A. terial perception: What can you see in a brief glance? (2016). Deep filter banks for texture recognition, de- volume 14. scription, and segmentation. Int. J. Comput. Vision, Simonyan, K. and Zisserman, A. (2015). Very deep convo- 118(1):6594. lutional networks for large-scale image recognition. in Cimpoi, M., Maji, S., and Vedaldi, A. (2015). Deep filter Proc. Int. Conf. Learn. Representations. banks for texture recognition and segmentation. In Song, Y., Zhang, F., Li, Q., Huang, H., ODonnell, L. J., The IEEE Conference on Computer Vision and Pattern and Cai, W. (2017). Locally-transferred fisher vectors Recognition (CVPR). for texture classification. In 2017 IEEE International Fradi, A., Samir, C., and Yao, A. (2018). Manifold-based Conference on Computer Vision (ICCV), pages 4922– inference for a supervised gaussian process classifier. 4930. In 2018 IEEE International Conference on Acoustics, Szegedy, C., Ioffe, S., Vanhoucke, V., and Alemi, A. A. Speech and Signal Processing (ICASSP), pages 4239– (2017). Inception-v4, inception-resnet and the impact 4243. of residual connections on learning. In Proceedings of Jegou,´ H., Douze, M., Schmid, C., and Perez,´ P. (2010). Ag- the Thirty-First AAAI Conference on Artificial Intelli- gregating local descriptors into a compact image rep- gence. AAAI Press. resentation. In CVPR 2010 - 23rd IEEE Conference on Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Computer Vision & Pattern Recognition, pages 3304– Anguelov, D., Erhan, D., Vanhoucke, V., and Rabi- 3311. novich, A. (2015). Going deeper with convolutions. Jolliffe, I. T. (2002). Principal component analysis. 2nd ed. In Computer Vision and Pattern Recognition (CVPR). New-York: Springer-Verlag. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Jun, H., Ko, B., Kim, Y., Kim, I., and Kim, J. (2019). Com- Z. (2016). Rethinking the inception architecture for bination of multiple global descriptors for image re- computer vision. 2016 IEEE Conference on Computer trieval. arXiv preprint arXiv:1903.10663. Vision and Pattern Recognition (CVPR), pages 2818– Kordopatis-Zilos, G., Papadopoulos, S., Patras, I., and 2826. Kompatsiaris, Y. (2017). Near-duplicate video re- Tzelepi, M. and Tefas, A. (2018). Deep convolutional learn- trieval by aggregating intermediate cnn layers. In Mul- ing for content based image retrieval. Neurocomput- tiMedia Modeling, pages 251–263. ing, 275:2467 – 2478. Lin, T.-Y. and Maji, S. (2015). Visualizing and understand- Varma, M. and Zisserman, A. (2005). A statistical approach ing deep texture representations. 2016 IEEE Con- to texture classification from single images. Interna- ference on Computer Vision and Pattern Recognition tional Journal of Computer Vision, 62(1–2):61–81. (CVPR), pages 2791–2799. Wang, K., Bichot, C.-E., Li, Y., and Li, B. (2017). Local Lin, T.-Y., RoyChowdhury, A., and Maji, S. (2015). Bilin- binary circumferential and radial derivative pattern for ear cnn models for fine-grained visual recognition. In texture classification. Pattern Recogn., 67(C). Proceedings of the 2015 IEEE International Confer- Yang, B., Yan, J., Lei, Z., and Li, S. Z. (2015). Convolu- ence on Computer Vision (ICCV). tional channel features for pedestrian, face and edge Liu, L., Chen, J., Fieguth, P. W., Zhao, G., Chellappa, R., detection. ICCV, abs/1504.07339. and Pietikainen, M. (2019). From bow to cnn: Two Yang, W., Wang, K., and Zuo, W. (2012). Neighborhood decades of texture representation for texture classi- component feature selection for high-dimensional fication. International Journal of Computer Vision, data. JCP, 7:161–168. 127(1):74–109. Yang Song, Weidong Cai, Qing Li, Fan Zhang, Feng, D. D., Mallikarjuna, P., Targhi, A. T., Fritz, M., Hayman, E., Ca- and Huang, H. (2015). Fusing subcategory probabil- puto, B., and Eklundh, J.-O. (2006). The kth-tips 2 ities for texture classification. In 2015 IEEE Con- database. ference on Computer Vision and Pattern Recognition Nanni, L., Ghidoni, S., and Brahnam, S. (2017). Hand- (CVPR), pages 4409–4417. crafted vs. non-handcrafted features for computer vi- Zhang, H., Xue, J., and Dana, K. (2017). Deep ten: Tex- sion classification. Pattern Recognition, 71:158 – 172. ture encoding network. In The IEEE Conference on Qu, Y., Lin, L., Shen, F., Lu, C. B., Wu, Y., Xie, Y., and Computer Vision and Pattern Recognition (CVPR). Tao, D. (2016). Joint hierarchical category structure Zheng, L., Zhao, Y., Wang, S., Wang, J., and Tian, Q. learning and large-scale image classification. IEEE (2016). Good practice in CNN feature transfer. CoRR, Transactions on Image Processing, 26:4331–4346. abs/1604.00133. Quan, Y., Xu, Y., Sun, Y., and Luo, Y. (2014). Lacunarity analysis on image patterns for texture classification. In