Videolt: Large-Scale Long-Tailed Video Recognition

VideoLT: Large-scale Long-tailed Video Recognition Xing Zhang1∗ Zuxuan Wu2;3∗ Zejia Weng2 Huazhu Fu4 Jingjing Chen2;3 Yu-Gang Jiang2;3y Larry Davis5 1Academy for Engineering and Technology, Fudan University, 2Shanghai Key Lab of Intel. Info. Processing, School of Computer Science, Fudan University, 3 Shanghai Collaborative Innovation Center on Intelligent Visual Computing, 4Inception Institute of Artificial Intelligence, 5University of Maryland Abstract Long-tailed Video Recognition Label distributions in real-world are oftentimes long- General Video Recognition tailed and imbalanced, resulting in biased models towards dominant labels. While long-tailed recognition has been extensively studied for image classification tasks, limited effort has been made for the video domain. In this paper, we introduce VideoLT, a large-scale long-tailed video recognition dataset, as a step toward real-world video recognition. VideoLT contains 256,218 untrimmed videos, annotated into 1,004 classes with a long-tailed distribution. Through extensive studies, we demonstrate that state-of- the-art methods used for long-tailed image recognition do not perform well in the video domain due to the additional temporal dimension in videos. This motivates us to propose FrameStack, a simple yet effective method for long- Figure 1. Long-tailed video recognition. General video recog- tailed video recognition. In particular, FrameStack per- nition methods are overfitted on head classes, while long-tailed forms sampling at the frame-level in order to balance class video recognition focuses on the performance of both head and distributions, and the sampling ratio is dynamically deter- tail classes, especially on tail classes. (Blue box is the head class mined using knowledge derived from the network during region, red box is the region of medium and tail classes.) training. Experimental results demonstrate that FrameS- tack can improve classification performance without sacri- but poorly on tail classes that contain a limited number of ficing the overall accuracy. Code and dataset are available samples (see Figure1). at: https://github.com/17Skye17/VideoLT. Recently, there is a growing interest in learning from long-tailed data for image tasks [4, 10, 22, 29, 40, 42, 47, arXiv:2105.02668v3 [cs.CV] 18 Aug 2021 1. Introduction 49, 53]. Two popular directions to balance class distributions are re-sampling and re-weighting. Re-sampling [8, Deep neural networks have achieved astounding success 11, 16, 22, 53] methods up-sample tail classes and down- in a wide range of computer vision tasks like image classifi- sample head classes to acquire a balanced data distribution cation [18, 19, 39, 41], object detection [14, 28, 34, 35], etc. from the original data. On the other hand, re-weighting Training these networks requires carefully curated datasets methods [4, 10, 27, 40, 47, 52] focus on designing weights like ImageNet and COCO, where object classes are uni- to balance the loss functions of head and tail classes. While formly distributed. However, real-world data often have a extensive studies have been done for long-tailed image clas- long tail of categories with very few training samples, pos- sification tasks, limited effort has been made for video clas- ing significant challenges for network training. This results sification. in biased models that perform exceptionally well on head While it is appealing to directly generalize these methods classes (categories with a large number of training samples) from images to videos, it is also challenging since for clas- ∗ Equal contribution. y Corresponding author. sification tasks, videos are usually weakly labeled—only 1 a single label is provided for a video sequence and only bels conditioned on the temporal sampling ratio. a small number of frames correspond to that label. This Moreover, we also collect a large-scale long tailed video makes it difficult to apply off-the-shelf re-weighting and recognition dataset, VideoLT, which consists of 256,218 re-sampling techniques since not all snippets 1 contain in- videos with an average duration of 192 seconds. These formative clues—some snippets directly relate to the target videos are manually labeled into 1,004 classes to cover a class while others might consist of background frames. As wide range of daily activities. Our VideoLT have 47 head a result, using a fixed weight/sampling strategy for all snip- classes (#videos > 500), 617 medium classes (100 < pets to balance label distributions is problematic. For long- #videos <= 500) and 340 tail classes (#videos <= 100), tailed video recognition, we argue that balancing the distri- which naturally has a long tail of categories. bution between head and tail classes should be performed Our contributions are summarized as follows: at the frame-level rather than at the video (sample)-level— more frames in videos from tail classes should be sampled • We collect a new large-scale long-tailed video recogni- for training and vice versa. More importantly, frame sam- tion dataset, VideoLT, which contains 256,218 videos pling should be dynamic based on the confidence of neural that are manually annotated into 1,004 classes. To the networks for different categories during the training course. best of our knowledge, this is the first “untrimmed” This helps preventing overfitting for head classes and un- video recognition dataset which contains more than derfitting for tail classes. 1,000 manually defined classes. To this end, we introduce FrameStack, a simple yet effec- • We propose FrameStack, a simple yet effective method tive approach for long-tailed video classification. FrameS- for long-tailed video recognition. FrameStack uses tack operates on video features and can be plugged into a temporal sampling ratio derived from knowledge state-of-the-art video recognition models with minimal learned by networks to dynamically determine how surgery. More specifically, given a top-notch classification many frames should be sampled. model which preserves the time dimension of input snippets, we first compute a sequence of features as inputs of • We conduct extensive experiments using popular long- FrameStack, i.e., for a input video with T frames, we obtain tailed methods that are designed for image classifi- T feature representations. To mitigate the long tail problem, cation tasks, including re-weighting, re-sampling and we define a temporal sampling ratio to select different num- data augmentation. We demonstrate that the existing ber of frames from each video conditioned on the recogni- long-tailed image methods are not suitable for long- tion performance of the model for target classes. If the net- tailed video recognition. By contrast, our FrameS- work is able to offer decent performance for the category tack combines a pair of videos for classification, and to be classified, we then use fewer frames for videos in this achieves better performance compared to alternative class. On the contrary, we select more frames from a video methods. The dataset, code and results can be found at: if the network is uncertain about its class of interest. We in- https://github.com/17Skye17/VideoLT. stantiate the ratio using running average precision (AP) of each category computed on training data. The intuition is that AP is a dataset-wise metric, providing valuable infor- 2. Related Work mation about the performance of the model on each category and it is dynamic during training as a direct indicator of 2.1. Long-Tailed Image Recognition progress achieved so far. Consequently, we can adaptively Long-tailed image recognition has been extensively under-sample classes with high AP to prevent over-fitting studied, and there are two popular strands of methods: re- and up-sample those with low AP. weighting and re-sampling. However, this results in samples with different time di- Re-Weighting A straightforward idea of re-weighting is to mensions and such variable-length inputs are not parallel- use the inverse class frequency to weight loss functions in friendly for current training pipelines. Motivated by re- order to re-balance the contributions of each class to the cent data-augmentation techniques which blend two sam- final loss. However, the inverse class frequency usually re- ples [43, 50] as virtual inputs, FrameStack performs tem- sults in poor performance on real-world data [10]. To miti- poral sampling on a pair of input videos and then concate- gate this issue, Cui et al.[10] use carefully selected sam- nates re-sampled frame features to form a new feature rep- ples for each class to re-weight the loss. Cao et al.[4] resentation, which has the same temporal dimension as its propose a theoretically-principled label-distribution-aware inputs. The resulting features can then be readily used for margin loss and a new training schedule DRW that defers final recognition. We also adjust the the corresponding la- re-weighting during training. In contrast to these meth- 1 We use “snippet” to denote a stack of frames sampled from a video clip, ods, EQL [40] demonstrates that tail classes receive more which are typically used as inputs for video networks. discouraging gradients during training, and ignoring these Dataset 13 ActivityNet Charades 12 Kinetics-400 Kinetics-600 11 Kinetics-700 10 FCVID Something-something 9 AVA VideoLT 8 7 6 5 Number of videos/frames (log) 4 3 0 110 220 330 440 550 660 770 880 990 Label index Figure 2. The taxonomy structure of VideoLT. There are 13 top- Figure 3. Class frequency distribution of existing video datasets level entities and 48 sub-level entities, the children of sub-level and VideoLT. VideoLT has superior linearity in logarithmic co- entities are sampled. Full taxonomy structure can be found in Sup- ordinate system, which means the class frequency distribution of plementary materials. VideoLT is close to long-tailed distribution. gradients will prevent the model from being influenced by of inputs.

Load more