Evaluating Two-Stream CNN for Video Classification
Total Page:16
File Type:pdf, Size:1020Kb
Evaluating Two-Stream CNN for Video Classification Hao Ye, Zuxuan Wu, Rui-Wei Zhao, Xi Wang, Yu-Gang Jiang,∗ Xiangyang Xue School of Computer Science, Shanghai Key Lab of Intelligent Information Processing, Fudan University, Shanghai, China {haoye10, zxwu,rwzhao14, xwang10, ygj, xyxue}@fudan.edu.cn ABSTRACT 1. INTRODUCTION Videos contain very rich semantic information. Traditional With the popularity of video recording devices and con- hand-crafted features are known to be inadequate in analyz- tent sharing activities, there is a strong need for techniques ing complex video semantics. Inspired by the huge success that can automatically analyze the huge scale of video data. of the deep learning methods in analyzing image, audio and Video classification serves as a fundamental and essential text data, significant efforts are recently being devoted to step in the process of analyzing the video contents. For ex- the design of deep nets for video analytics. Among the many ample, it would be extremely helpful if the massive consumer practical needs, classifying videos (or video clips) based on videos on the Web could be automatically classified into pre- their major semantic categories (e.g., \skiing") is useful in defined categories. Learning semantics from the complicated many applications. In this paper, we conduct an in-depth video contents is never an easy task, and methods based on study to investigate important implementation options that traditional hand-crafted features and prediction models are may affect the performance of deep nets on video classifica- known to be inadequate [1]. tion. Our evaluations are conducted on top of a recent two- In recent years, deep learning based models have been stream convolutional neural network (CNN) pipeline, which proved to be more competitive than the traditional methods uses both static frames and motion optical flows, and has on solving complex learning problems in various domains. demonstrated competitive performance against the state-of- For example, the deep neural network (DNN) has been suc- the-art methods. In order to gain insights and to arrive cessfully used for acoustic modeling in the large vocabulary at a practical guideline, many important options are stud- speech recognition problems [2]. Moreover, the deep learning ied, including network architectures, model fusion, learn- based methods have been shown to be extremely powerful in ing parameters and the final prediction methods. Based the image domain. In 2012, Krizhevsky et al. were the first on the evaluations, very competitive results are attained on to use a completely end-to-end deep convolutional neural two popular video classification benchmarks. We hope that network (CNN) model to win the famous ImageNet Large the discussions and conclusions from this work can help re- Scale Visual Recognition Challenge (ILSVRC), outperform- searchers in related fields to quickly set up a good basis for ing all the traditional methods by a large margin [13]. In the further investigations along this very promising direction. most recent 2014 edition of the ILSVRC, Szegedy et al. de- veloped an improved and deeper version of the CNN, which Categories and Subject Descriptors further reduced the top-5 label error rate to just 6.7% over one thousand categories [30]. In the text domain, deep mod- I.5.2 [Pattern Recognition]: Design Methodology; H.3.1 els have also been successfully used for sentence parsing, [Information Storage and Retrieval]: Content Analysis sentiment prediction and language translation problems [27, and Indexing 18, 11, 29]. On video data, however, deep learning often demonstrated General Terms worse results than the traditional techniques [8, 12]. This is Algorithms, Measurement, Experimentation. largely due to the difficulties in modeling the unique char- acteristics of the videos. On one hand, the spatial-temporal Keywords nature demands more complex network structures and maybe also advanced learning methods. One the other hand, so far Video Classification, Deep Learning, CNN, Evaluation. there is very limited amount of training data with manual ∗Corresponding author. annotations in the video domain, which limits the progress Permission to make digital or hard copies of all or part of this work for of developing new methods as neural networks normally re- personal or classroom use is granted without fee provided that copies are not quire extensive training. Very recently, Simonyan et al. pro- made or distributed for profit or commercial advantage and that copies bear posed two-stream CNN, an effective approach that trains this notice and the full citation on the first page. Copyrights for components two CNNs using static frame and temporal motion sepa- of this work owned by others than ACM must be honored. Abstracting with rately [24]. The temporal motion stream is converted to credit is permitted. To copy otherwise, or republish, to post on servers or to successive optical flow images so that the conventional CNN redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. designed for images can be directly deployed. ICMR’15, June 23 - 26, 2015, Shanghai, China Although promising results were observed in [24], we un- c 2015 ACM. ISBN 978-1-4503-3274-3/15/06...$15.00 derline that the performance of deep learning in video clas- DOI: http://dx.doi.org/10.1145/2671188.2749406. sification is subject to many implementation options, and Most CNN-based approaches rely on the neural networks there are no in-depth and systematic investigations on this to perform the final class label prediction, normally using in the field. In this paper, we conduct extensive experiments a softmax layer [12, 24] or a linear layer [8]. Instead of on two popular benchmarks to evaluate several important direct prediction by the network, Jain et al. conducted ac- options, including not only network structures and learn- tion recognition using support vector machines (SVMs) with ing parameters, but also model fusion that combines results features extracted from off-the-shelf CNN models [6]. Their from different networks and prediction strategies that map impressive results in the THUMOS action recognition chal- network outputs to classification labels. The two-stream lenge [9] indicate that CNN features are very powerful. In CNN approach is adopted as the basic pipeline for the eval- addition, a few works attempted to apply the CNN repre- uations in this work. By evaluating the implementation op- sentations with Recurrent Neural Network (RNN) models tions, we intend to answer the question of what network to capture the temporal information in videos and perform settings and implementation options are likely to produce classification within the same network. Donahue et al. lever- good video classification results. As implementing a deep aged the Long-Short Term Memory (LSTM) RNN model for learning based system for video classification is a very dif- action recognition [3] and Venugopalan et al. proposed to ficult task, we hope that the discussions and conclusions translate videos directly to sentences with the LSTM model from this work are helpful for researchers in this field and by transferring knowledge from image description tasks [31]. can stimulate future studies. RNN shares the same motivation with the temporal pathway The rest of this paper is organized as follows. Section 2 in the two-stream framework [24]. reviews related works. In Section 3, we briefly introduce the In this paper, we provide an in-depth study on various two-stream CNN approach. Section 4 discusses the evalu- implementation choices of deep learning based video classi- ated implementation options and Section 5 reports and an- fication. On top of the two-stream CNN pipeline [24], we ex- alyzes experimental results. Finally, Section 6 summarizes amine the performance of the spatial and temporal streams the findings in this work. separately and jointly with different network architectures under various sets of parameters. In addition, we also ex- 2. RELATED WORKS amine the effect of different prediction options, including di- rectly using a CNN with a softmax layer for end-to-end clas- Extensive studies have been conducted on video classi- sification and adopting SVMs on the features learned from fication in the multimedia and computer vision communi- the CNNs. The RNN models are not considered since they ties. State-of-the-art video classification systems are usu- have not been fully explored and only demonstrated limited ally built on top of multiple discriminative feature repre- performance gain in the context of video classification. This sentations. To achieve better performance, various features paper fills the gap in the existing works on deep learning have been developed. For instance, Laptev et al. extended for video classification, where most people have focused on the traditional SIFT features to obtain the Space-Time In- designing a new classification pipeline or a deeper network terest Points (STIP) by finding representative tubes in 3D structure without systematically evaluating and discussing space [15]. Wang et al. proposed the dense trajectory fea- the implementation options. tures, which densely sample local patches from each frame at different scales and then track them in a dense optical flow field over time [32]. Besides the feature descriptors, one 3. TWO-STREAM CNN can obtain further improvements by adopting advantageous According to the findings in neuroscience, human visual feature encoding strategies like the Fisher Vector [19] or uti- system processes what we see through two kinds of streams, lizing fusion techniques [26, 34, 33] to integrate information namely the ventral pathway and the dorsal pathway respec- from different features. tively. The ventral pathway is responsible for processing The aforementioned hand-crafted features like the dense the spatial information, such as shape and color, while the trajectories have demonstrated state-of-the-art performance dorsal pathway is responsible for processing the motion in- on many video classification tasks. However, these features formation [14, 5].