Deep Spatial Pyramid Ensemble for Cultural Event Recognition

Xiu-Shen Wei Bin-Bin Gao Jianxin Wu∗ National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China {weixs,gaobb}@lamda.nju.edu.cn, [email protected]

Abstract images with very different visual appearances are possible to indicate the same cultural event, while images containing Semantic event recognition based only on image-based the same object might come from different cultural events. cues is a challenging problem in computer vision. In or- In addition, it is crucial for cultural event recognition to ex- der to capture rich information and exploit important cues ploit several important cues like garments, human poses, like human poses, human garments and scene categories, objects and background at the same time. we propose the Deep Spatial Pyramid Ensemble framework, In this paper, we propose the Deep Spatial Pyramid En- which is mainly based on our previous work, i.e., Deep Spa- semble framework for cultural event recognition, which is tial Pyramid (DSP). DSP could build universal and power- mainly based on our previous work, i.e., the Deep Spatial ful image representations from CNN models. Specifically, Pyramid (DSP) method [5]. This method builds universal we employ five deep networks trained on different data image representations from CNN models, while adapting sources to extract five corresponding DSP representations this universal image representation to different image do- for event recognition images. For combining the comple- mains in different applications. In DSP, it firstly extract the mentary information from different DSP representations, we deep convolutional activations of an input image with ar- ensemble these features by both “early fusion” and “late bitrary resolution by a pre-trained CNN. These deep acti- fusion”. Finally, based on the proposed framework, we vations are then encoded into a new high dimensional fea- come up with a solution for the track of the Cultural Event ture representation by overlaying a spatial pyramid parti- Recognition competition at the ChaLearn Looking at Peo- tion. Additionally, in order to capture the important cues ple (LAP) challenge in association with ICCV 2015. Our (e.g., human poses, objects and background) of cultural framework achieved one of the best cultural event recogni- event recognition images, we employ two types deep net- tion performance in this challenge. works, i.e., VGG Nets [14] trained on ImageNet [12] and Place-CNN [23] trained on the Places database [23]. Mean- 1. Introduction while, we also fine-tune VGG Nets on cultural event im- ages [3]. After that, we utilize these deep networks trained Event recognition is one of the key tasks in computer vi- on different data sources to extract different DSP represen- sion. There have been many researches about video-based tations for cultural event images. Finally, we ensemble the event recognition and action recognition [13, 15, 17, 18]. information from multiple deep networks via “early fusion” However, event recognition from still images has received and “late fusion” to boost the recognition performance. little attention in the past, which is also a more challeng- ing problem than the video-based event recognition task. In consequence, based on the proposed framework, we Because videos could provide richer and more useful in- come up with a solution of five DSP deep convolutional net- formation (e.g., motions and trajectories) for understanding works ensemble for the track of Cultural Event Recognition events, while images of events just merely contain static ap- at the ChaLearn Looking at People (LAP) challenge in asso- pearance information. ciation with ICCV 2015. Our proposed framework achieved Moreover, cultural event recognition is an important one of the best cultural event recognition performance in the problem of event understanding. The goal of cultural event Final Evaluation phase. recognition is not only to find images with similar content, The rest of this paper is organized as follows. In Sec. 2, but further to find images that are semantically related to a we present the proposed framework, and mainly introduce particular type of event. Specifically, as shown in Fig. 1, the key method DSP. Implementation details and experi- ∗ mental results of the cultural event recognition competition This work was supported by the National Natural Science Foundation of China under Grant 61422203 and the Collaborative Innovation Center are described in Sec. 3. Finally, we conclude our method of Novel Software Technology and Industrialization. and present the future works in Sec. 4.

38 Figure 1. Images randomly sampled from 99 categories of the cultural event recognition images [3]. The cultural event recognition dataset contains 99 important cultural events from all around the globe, which includes: of (), Gion matsuri (Japan), Harbin Ice and Snow Festival (China), Oktoberfest (Germany), (USA), Tapati rapa Nui (Chile) and so on.

2. The proposed framework and Place-CNN [23]) to extract the DSP representations for each image in this competition, and then ensemble the com- In this section, we will introduce the proposed Deep plementary information from multiple CNNs. Spatial Pyramid Ensemble framework, especially the main In the following, we will firstly present some detailed approach used in this paper, i.e., Deep Spatial Pyramid factors in DSP, and secondly introduce the Deep Spatial (DSP) [5]. Pyramid method, and finally describe the ensemble strat- Recently, thanks to the rich semantic information ex- egy used in our framework for the cultural event recognition tracted by the convolutional layers of CNN, convolutional competition. layer deep descriptors have exemplified their value and been 2.1. The ℓ2 matrix normalization in DSP successful in [10, 2, 20]. Moreover, these deep descriptors T T ×d contain more spatial information compared to the activa- Let X = [x1,..., xt,..., xT ] (X R ) be the tion of the fully connected layers, e.g., the top-left cell’s matrix of d-dimensional deep descriptors extracted∈ from an d-dim deep descriptor is generated using only the top-left image I via a pre-trained CNN model. X was usually pro- part of the input image, ignoring other pixels. In addition, cessed by dimensionality reduction methods such as PCA, fully connected layers have large computational cost, be- before they are pooled into a single vector using VLAD cause it contains roughly 90% of all the parameters of the or FV [6, 21]. PCA is usually applied to the SIFT fea- whole CNN model. Thus, here we use fully convolutional tures or fully connected layer activations, since it is em- networks by removing the fully connected layers as feature pirically shown to improve the overall recognition perfor- extractors. mance. However, as studied in [5], it shows that PCA sig- In the proposed framework, we feed an input image with nificantly hurts recognition when applied to the fully con- arbitrary resolution into a pre-trained CNN model to ex- volutional activations. Thus, it is not applied to fully con- tract deep activations in the first step. Then, a visual dic- volutional deep descriptors in this paper. tionary with K dictionary items is trained on the deep de- In addition, multiple types of deep descriptors normal- scriptors from training images. The third step overlay a spa- ization have been evaluated, and the ℓ2 matrix normaliza- tial pyramid partition to the deep activations of an image tion before using FV is found to be important for better per- into m blocks in N pyramid levels. One spatial block is formance, cf. Table 2 in [5]. Therefore, we employ the represented as a vector by using the improved Fisher Vec- ℓ2 matrix normalization for the cultural event recognition tor. Thus, m blocks correspond to m FVs. In the fourth competition as follows: and fifth step, we concatenate the m FVs to form a 2mdK- xt xt/ X 2 , (1) dimensional feature vector as the final image-level repre- ← k k sentation. These steps are shown as the key parts of our where X 2 is the matrix spectral norm, i.e., largest sin- framework in Fig. 2. In addition, since cultural event recog- gular valuek k of X. This normalization has a benefit that it nition is highly related with two high-level computer vision normalizes xt using the information from the entire image problems, i.e., object recognition and scene recognition, X, which makes it more robust to changes such as illumi- we employ multiple pre-trained CNNs (e.g., VGGNets [14] nation and scale.

39 in h ihrVector Fisher the tion, utrleetrcgiin efi the fix we recognition, event cultural large used accuracy normally higher achieves than surprisingly Vector Fisher in 4) or omlzto ihtefco f05 olwdb the by followed 0.5, of factor the with normalization incmoet.Tu,F a ersn h e fdeep of set the represent can FV descriptors Thus, components. sian ino h gradients the of tion mgshv rirr ie.Hwvr h lsies( classifiers the However, sizes. arbitrary have images FV by descriptors deep Encoding 2.2. representation. o omlzto [11]. normalization tor etr(V oecd h epdescriptors. deep Fisher the a the encode form use to also to (FV) we Vector pooled DSP, to be all similarly must Here, Thus, image vector. single an vectors. of descriptors length deep fixed the require soft-max) or SVM etr.Tefia ihrVector Fisher final The vectors. h mrvdFse etr ial,w octnt the concatenate we Finally, Vector. Fisher improved the pnigto sponding the to respect oainemtie r ignland diagonal are matrices covariance the of matrix and oethat, Note et by nents etr.Let vectors. ciain.AGMvsa itoayi rie ae ntede ecitr rmtann mgs hn pta yai attosthe partitions pyramid spatial a Then, deep images. extract training to from model descriptors CNN into deep pre-trained image the a an on into of based image activations trained input is deep resolution dictionary arbitrary visual an GMM feeds A DSP activations. framework. classification image The 2. Figure oevr sdsusdi 5,avr small very a [5], in discussed as Moreover, h ieo pool of size The ednt h aaeeso h M with GMM the of parameters the denote We (Any resolution) f f σ Input image σ µ k k k r h itr egt envco n covariance and vector mean weight, mixture the are ( ( X X λ = ) = ) f = X µ γ µ t k k k k ( { √ √ iha with t asincmoet epciey The respectively. component, Gaussian -th k ( ω and t asin h Vrpeetto corre- representation FV the Gaussian, -th X 1 ) 2 ω k 1 ω etesf-sinetwih of weight soft-assignment the be , k ) 5 µ k σ X f t r-rie N Activations Pre-trained CNN and saprmtri N eas input because CNN in parameter a is k =1 T X k t µ , =1 T 2 r rsne sflos[11]: follows as presented are f σ k conv3-64 dK γ ( λ f k K γ t X ( ( ; conv3-64 t σ X k dmninlvco.I addi- In vector. -dimensional k ( ) k aus noreprmnsof experiments our In values. k ) max-pooling ( and ) 1 = ) m  X ...  sipoe ytepower- the by improved is x lcsin blocks ) ( f conv3-512 f t x K , . . . , r both are λ σ − σ t

( conv3-512 K k k − σ X µ ( σ X k 2 µ ) k au as value k 2  k ) steconcatena- the is N } r h variance the are ) o all for where , 2 , yai ees nti a,ec lc ciain r ersne sasnl etrby vector single a as represented are activations block each way, this In levels. pyramid d ... − -dimensional K 1 K 2 (  . K e.g. m compo- x . ω ℓ Gaus- 2 t k igevcost oma form to vectors single ,3 2, , with , vec- e.g. ... µ (3) (2) Normalization k , 40 ahrgo eaaeyuigF.Teoeaino S is DSP of 3. operation Fig. The in in FV. illustrated descriptors using deep separately pool region the then each in activations and we of layer, cells practice, convolutional the In last partition spatially sub-region. to need each just inside comput- features and sub-regions local into ing image an partitioning by mid patches image CNN. larger a from by extracted are descriptors deep of ueetato rmwr.Isedo eua rdo SIFT of from grid fea- extracted regular SIFT vectors a dense of the Instead to framework. analogy extraction the direct ture in a convolu- patches is image all This of of image. grid input set regular a the form and cells image, layer tional local input one the to in corresponds patch layer image convolutional last the in tor) exists. way natural and intuitive more a performance. DSP, recognition in improved However, has which nets, deep to fully using when but [7]. features activations convolutional SIFT dense using when not shown only performance been recognition has image improve [9] significantly pyramid spatial to spatial adding Also, a through simply. information and naturally more much tion pyramid spatial Deep 2.3. pyramid. spatial deep 0 and 1 level the of Illustration 3. Figure l2-matrix normalize h ee ipyageae l el sn V The FV. using cells all aggregates simply 0 level The pyra- spatial deep natural a form easily can DSP Thus, speiul icse,oesnl el(epdescrip- (deep cell single one discussed, previously As layer pooling pyramid spatial a adds it [7], SPP-net In informa- pyramid spatial adding is DSP of part key The pta yai TheFisherVectorencoding Spatial pyramid 2 mdK ee Level0 Level 1 dmninlfauevco stefia image final the as vector feature -dimensional 16 ×

16 Dictionary (GMM) oa mg ace,agrid a patches, image local

Fisher Vector

Power normalization

l2-normalization level 1, however, splits the cells into 5 regions according The multi-scale DSP is related to MPP proposed by Yoo to their spatial locations: the 4 quadrants and 1 centerpiece. et al. [21]. A key different between our method and MPP is Then, 5 FVs are generated from activations inside each spa- that f s encodes spatial information while MPP does not. tial region. Note that the level 1 spatial pyramid used in DSP During the competition of cultural event recognition, we is different from the classic one in [9]. It follows Wu and find that a large scale will achieve a better performance. Rehg [19] to use an additional spatial region in the center Thus, we employ four scales, i.e., 1.4, 1.2, 1.0 and 0.8, and of the image. A DSP using two levels will then concatenate the experimental results are shown in Sec. 3.3. all 6 FVs from level 0 and level 1 to form the final image representation. 2.5. Ensemble of multiple DSPs This DSP method is summarized in Algorithm 1. In the past several years, many successful deep CNN ar- chitectures have been shown to further improve CNN per- Algorithm 1 The DSP pipeline formance, characterized by deeper and wider architectures 1: Input: and smaller convolutional filters when compared to tradi- 2: An input image I tional CNN such as [8, 22]. Examples of deeper nets in- 3: A pre-trained CNN model clude GoogLeNet [16], VGG Net-D and VGG Net-E [14]. 4: Procedure: Specifically, in order to exploit different types informa- 5: Extract deep descriptors X from I using the tion from cultural event images, we choose the VGG Net- T pre-defined model, X =[x1,..., xt,..., xT ] D and VGG Net-E for object recognition, and utilize the 6: For each activation vector xt, perform ℓ2 matrix Place-CNN net [23] as pre-trained deep network for scene normalization xt xt/ X 2 recognition. VGG Net-D and VGG Net-E consist of the ← k k 7: Estimate a GMM λ = ωk, µ , σk using the similar architectures and parameters of convolutional and { k } training set pooling filters. More details of these two deep networks can 8: Generate a spatial pyramid X1,...,Xm for X be found in [14]. In addition, to boost recognition perfor- { } 9: for all 1 i m mance, we also fine-tune VGG Net-D and VGG Net-E on ≤ ≤ 10: f (Xi) [f (Xi), f (Xi), the training and validation images/crops of the competition. λ ← µ1 σ1 ..., f µK (Xi), f σK (Xi)] Therefore, for one image/crop, we can get five DSP rep- 11: f (Xi) sign(f (Xi)) f (Xi) resentations extracted from the aforementioned five CNN λ ← λ λ 12: f (Xi) f (Xi)/ f (pXi) 2 models. Because these CNN models are trained on different λ ← λ k λ k 13: end for types of images (i.e., object-centric images, scene-centric 14: Concatenate f λ(Xi), 1 i m, to form the final images and event-centric images), we ensemble the com- spatial pyramid representation≤ ≤ f(X) plementary information of multiple CNN models by treat- 15: f(X) f(X)/ f(X) 2 ing these DSP representations as multi-view data. ← k k 16: Output: f(X). We denote the multi-scale DSP representation extracted i from the i-th CNN model by f m. After extracting these DSP representations, we concatenate all the features and 2.4. Multi•scale DSP apply ℓ2 normalization as follows: In order to capture variations of the activations caused 1 2 3 4 5 f final f m, f m, f m, f m, f m , (5) by variations of objects in an image, we generate a multiple ← f f / f 2 ,  (6) scale pyramid, extracted from S different rescaled versions final ← final k finalk of the original input image. We feed images of all different which is called as “early fusion” in this paper. Note that, the scales into a pre-trained CNN model and extract deep acti- dimensionality of deep descriptors in the last convolutional vations. In each scale, the corresponding rescaled image is layer is 512 and 256 for VGG Nets and Place-CNN, respec- encoded into a 2mdK-dimensional vector by DSP. There- tively. Thus, followed the aforementioned experimental set- fore, we have S vectors of 2mdK-dimensions and they are tings, the DSP representations of VGG Nets and Place-CNN merged into a single vector by average pooling, as are of 12,288- and 6,144-dimension, and the final DSP rep- S resentation of each image is a 55,296-dimensional vector. 1 f = f , (4) m S s Xs=1 3. Experiments where f s is the DSP representation extracted from the scale In this section, we first describe the dataset of cultural level s. Finally, ℓ2 normalization is applied to f m. Note event recognition at the ICCV ChaLearn LAP 2015 com- that each vector f s is already ℓ2 normalized, as shown in petition [3]. Then we give a detailed description about the Algorithm 1. implementation details of the proposed framework. Finally,

41 The original distribution of the training set in Development The distribution of the training set in Development after crops we present and analyze the experimental results of the pro- 1000 1800 900 1600

800 posed framework on the competition dataset. 1400

700 1200 600 3.1. Datasets and evaluation criteria 1000 500 800 400 600 The cultural event recognition at the ICCV ChaLearn of images number The 300 of images number The 400 LAP 2015 competition [3] is the second round for this 200 100 200 track. Compared with the previous one, the task of ICCV 0 0 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 # of classes # of classes ChaLearn LAP 2015 has significantly increased the number (a) (b)

The original distribution of the training set in Final Evaluation The distribution of the training set in Final Evaluation after crops of images and classes, adding a new “no-event” class. As 1400 2500 a result, for this track more than 28,000 images are labeled 1200 2000 to perform automatic cultural event recognition from 100 1000

1500 categories in still images. These images are collected from 800

600 two image search engines (Google and Bing), which belong 1000 The number of images number The of images number The to 99 different cultural events and one non-class. This is the 400 500 first dataset on cultural events from all around the globe. 200

0 0 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 From these images, we see that several cues like garments, # of classes # of classes human poses, objects and background could be exploited (c) (d) for recognizing the cultural events. Figure 4. Distributions of the number of training images in Devel- The dataset is divided into three parts: the training set opment and Final Evaluation. (a) and (c) are the original distri- (14,332 images), the validation set (5,704 images) and the butions of training images in both Development and Final Eval- evaluation set (8,669 images). During the development uation, respectively. (b) and (d) are the distributions of training phase, we train our model on the training set and verify its images after crops. performance on the validation set. For final evaluation, we merge the training and validation set into a single data set and re-train our model. The principal quantitative measure used is the average precision (AP), which is calculated by numerical integration. 3.2. Implementation details Before extracting the DSP representations, we get the original distributions of the numbers of training images in both Development and Final Evaluation, which are shown in Fig. 4(a) and Fig. 4(c), respectively. From these figures, we can see the “non-event” class is of large quantity and the original dataset is apparently class-imbalanced. To fix Original images Crop1 Crop2 Crop3 this problem, for each image of the other 99 cultural event classes, we extract three 384 384 crops which are illus- Figure 5. Crops of the original images. Different from the random crops used in other deep networks (e.g., [8]), we fix the locations trated in Fig. 5. Moreover, in× order to keep the original se- of these crops, which can keep the original semantic meaning of mantic meaning of each image, we fix the location of each cultural event images. If we get the random crops, for example the corresponding crop. In addition, we also get the horizontal second original image of Carnevale di Viareggio, it might get one reflection of the 99 cultural event images. Therefore, the crop only contains sky, which will hurt the cultural event recogni- number of cultural event images/crops will become 5 times tion performance. These figures are best viewed in color. as the original one, which on one hand can supply diverse data sources, and on the other hand can solve the class- imbalanced problem. During the testing phase, because we fine-tuned VGG Net-D, fine-tuned VGG Net-E and Place- do not know the classes of testing images, all the testing CNN) and use them to extract the corresponding DSP rep- images will be augmented by the aforementioned process. resentations for each image/crop. Thus, each image of both After data augmentation, as aforementioned, we employ training (except for the “no-event” class) and testing will be three popular deep CNNs as pre-trained models, including represented by five DSP features/instances. As described VGG Net-D, VGG Net-E [14] and Place-CNN [23]. In ad- in Sec. 2.5, we concatenate these DSP features and apply dition, we also fine-tune VGG Net-D and VGG Net-E on ℓ2 normalization to get the final representation for each im- the images/crops of the competition. In consequence, we age/crop. Finally, we feed these feature vectors into logistic obtain five deep networks (i.e., VGG Net-D, VGG Net-E, regression [4] to build a classifier and use the softmax as

42 the prediction scores of images/crops. And then, the final Table 2. Comparison performances of our proposed framework scores of testing images can be obtained by averaging the with that of the top five teams in the Final Evaluation phase. scores across their corresponding crops and horizontal re- Rank Team Score flections, which is called “late fusion” corresponding to the 1 VIPL-ICT-CAS 0.854 former one mentioned in Sec. 2.5. 2 FV (Ours) 0.851 3.3. Experimental results 3 MMLAB 0.847 4 NU&C 0.824 In this section, we first present the experimental results 5 CVL ETHZ 0.798 of the Development phase and analyze our proposed frame- work. Finally, we show the Final Evaluation results of this cultural event recognition competition. prediction scores for the testing images/crops. After sev- eral times bagging processes, the final prediction scores can 3.3.1 Development be obtained by averaging the results of multiple baggings. Moreover, advanced ensemble methods can be also simply In Table 1, we present the main results in the Development applied into our framework to achieve better performance. phase. As discussed in Sec. 2.4, the multiple scales (MS) strategy could capture the variation information, which 4. Conclusion boosts the performance by about 1% mAP on VGG Net-D Event recognition from still images is one of the chal- (0.761 0.770) and VGG Net-E (0.762 0.773). In addi- lenging problems in computer vision. In order to exploit and tion, the→ late fusion approach is also effective.→ From this ta- capture important cues like human poses, human garments ble, it improves more than 1% mAP on the pre-trained VGG and other context, this paper has proposed the Deep Spa- nets, and improves performance by 2% when deep networks tial Pyramid Ensemble framework. In consequence, based are fine-tuned on cultural event images/crops of the compe- on the proposed framework, we employ five deep CNN net- tition. Because these deep networks are trained on different works trained on different data sources and ensemble their image sources, i.e., ImageNet [12], Places [23] and Cultural complementary information. Finally, we utilize the pro- Event Recognition [3], they can supply complementary in- posed framework for the track of cultural event recogni- formation for each image of this competition. Thus, we do tion [3] at the ChaLearn LAP challenge in association with “early fusion” by concatenating these DSP representations ICCV 2015, and achieve one of the best recognition per- extracted from the five deep networks, and then get the fi- formance in the Final Evaluation phase. In the future, we nal prediction score of each testing image in Development will introduce more advanced ensemble methods into our via “late fusion”. The ensemble performance (0.841) can framework and incorporating more visual cues for event un- significantly outperform the previous ones. derstanding. In order to further investigate this complementarity, we visualize the feature maps of these five deep networks in References Fig. 6. As shown in those figures, the strongest responses in the corresponding feature maps of these deep networks [1] L. Breiman. Bagging predictors. MLJ, 24:123–140, 1996. are quite different from each other, especially the one of [2] M. Cimpoi, S. Maji, and A. Vedaldi. Deep convolutional fil- Place-CNN, i.e., Fig. 6 (f). Apparently, different pre-trained ter banks for texture recognition and segmentation. In CVPR, deep networks trained on different data sources could ex- pages 3828–3836, 2015. tract complementary information for each image in cultural [3] S. Escalera, J. Fabian, P. Pardo, X. Baro,´ J. Gonzalez,` H. J. event recognition. Escalante, and I. Guyon. ChaLearn 2015 apparent age and cultural event recognition: Datasets and results. In ICCV ChaLearn Looking at People workshop, 2015. 3.3.2 Final evaluation [4] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. As aforementioned, in the Final Evaluation phase, we Lin. LIBLINEAR: A library for large linear classification. merge the training and validation set into a single data set JMLR, 9:1871–1874, 2008. [5] B.-B. Gao, X.-S. Wei, J. Wu, and W. Lin. Deep spa- and do the similar processes, i.e., data augmentation, fine- tial pyramid: The devil is once again in the details. tuning, “early fusion” and “late fusion”, etc. The final chal- arXiv:1504.05277, 2015. lenge results are shown in Table 2. Our final result (0.851) is st [6] Y. Gong, L. Wang, R. Guo, and S. Lazebnik. Multi-scale slightly lower (0.3%) than the team ranked 1 . For further orderless pooling of deep convolutional activation features. improving recognition performance of the proposed frame- In ECCV, pages 392–407, 2014. work, a very simple and straightforward way is to apply the [7] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling “bagging” approach [1] on the concatenated DSP represen- in deep convolutional networks for visual recognition. In tations of each image/crop, and then get the corresponding ECCV, pages 346–361, 2014.

43 Table 1. Recognition mAP comparisons of the Development phase. Note that, “FT” stands for the fine-tuned deep networks; “SS” is for single scale, and “MS” is for multiple scales.

VGG Net-D VGG Net-E FT VGG Net-D FT VGG Net-E Place-CNN SS 0.761 0.762 – – – MS 0.770 0.773 0.779 0.769 0.640 Late fusion 0.782 0.784 0.802 0.791 0.649 Ensemble 0.841

(a) The original image (b) Fine-tuned VGG Net-D (c) VGG Net-D

(d) Fine-tuned VGG Net-E (e) VGG Net-E (f) Place-CNN

Figure 6. Feature maps of an image of . (a) is the original image. For each feature map, we summarize the responses values of all the depths in the final pooling layer for each deep network. (b) and (d) are the feature maps of the fine-tuned VGG Net-D and fine-tuned VGG Net-E, respectively. (c) and (e) are the ones of the pre-trained VGG nets. (f) is the feature map of Place-CNN. These figures are best viewed in color.

[8] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet [15] C. Sun and R. Nevatia. ACTIVE: activity concept transitions classification with deep convolutional neural networks. In in video event classification. In ICCV, pages 913–920, 2013. NIPS, pages 1097–1105, 2012. [16] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, [9] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. features: Spatial pyramid matching for recognizing natural Going deeper with convolutions. arXiv:1409.4842, 2014. scene categories. In CVPR, pages 2169–2178, 2006. [17] H. Wang and C. Schmid. Action recognition with improved [10] L. Liu, C. Shen, and A. van den Hengel. The treasure beneath trajectories. In ICCV, pages 3551–3558, 2013. convolutional layers: Cross-convolutional-layer pooling for [18] L. Wang, Y. Qiao, and X. Tang. Action recognition with image classification. In CVPR, pages 4749–4757, 2015. trajectory-pooled deep-convolutional descriptors. In CVPR, pages 1–10, 2015. [11] F. Perronnin, J. Sanchez,´ and T. Mensink. Improving the [19] J. Wu and J. M. Rehg. CENTRIST: A visual descriptor for fisher kernel for large-scale image classification. In ECCV, scene categorization. IEEE TPAMI, 33(8):1489–1501, 2011. pages 143–156, 2010. [20] Z. Xu, Y. Yang, and A. G. Hauptmann. A discriminative [12] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, CNN video representation for event detection. In CVPR, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, pages 1798–1807, 2015. A. C. Berg, and L. Fei-Fei. ImageNet large scale visual [21] D. Yoo, S. Park, J.-Y. Lee, and I. S. Kweon. Fisher kernel recognition challenge. IJCV, 2015. for deep neural activations. arXiv:1412.1628, 2014. [13] K. Simonyan and A. Zisserman. Two-stream convolutional [22] M. D. Zeiler and R. Fergus. Visualizing and understanding networks for action recognition in videos. In NIPS, pages convolutional networks. In ECCV, pages 818–833, 2014. 568–576, 2014. [23] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva. [14] K. Simonyan and A. Zisserman. Very deep convolutional Learning deep features for scene recognition using places networks for large-scale image recognition. In ICLR, 2015. database. In NIPS, pages 487–495, 2014.

44