Learning and Using the Arrow of Time
Total Page:16
File Type:pdf, Size:1020Kb
International Journal of Computer Vision manuscript No. (will be inserted by the editor) Learning and Using the Arrow of Time Donglai Wei1 · Joseph Lim2 · Andrew Zisserman3 · William T. Freeman4;5 Received: date / Accepted: date Abstract We seek to understand the arrow of time in videos 1 Introduction – what makes videos look like they are playing forwards or backwards? Can we visualize the cues? Can the arrow of We seek to learn to see the arrow of time – to tell whether time be a supervisory signal useful for activity analysis? To a video sequence is playing forwards or backwards. At a this end, we build three large-scale video datasets and apply small scale, the world is reversible–the fundamental physics a learning-based approach to these tasks. equations are symmetric in time. Yet at a macroscopic scale, To learn the arrow of time efficiently and reliably, we time is often irreversible and we can identify certain motion design a ConvNet suitable for extended temporal footprints patterns (e.g., water flows downward) to tell the direction of and for class activation visualization, and study the effect time. But this task can be challenging: some motion patterns of artificial cues, such as cinematographic conventions, on seem too subtle for human to determine if they are playing learning. Our trained model achieves state-of-the-art per- forwards or backwards, as illustrated in Figure1. For exam- formance on large-scale real-world video datasets. Through ple, it is possible for the train to move in either direction cluster analysis and localization of important regions for the with acceleration or deceleration (Figure1d). prediction, we examine learned visual cues that are consis- Furthermore, we are interested in how the arrow of time tent among many samples and show when and where they manifests itself visually. We ask: first, can we train a reli- occur. Lastly, we use the trained ConvNet for two appli- able arrow of time classifier from large-scale natural videos cations: self-supervision for action recognition, and video while avoiding artificial cues (i.e. cues introduced during forensics – determining whether Hollywood film clips have video production, not from the visual world); second, what been deliberately reversed in time, often used as special ef- does the model learn about the visual world in order to solve fects. this task; and, last, can we apply such learned commonsense knowledge to other video analysis tasks? Regarding the first question on the arrow of time classifi- cation, we go beyond the previous work (Pickup et al 2014) Donglai Wei E-mail: [email protected] to train a ConvNet, exploiting thousands of hours of online videos, and let the data determine which cues to use. Such Joseph Lim E-mail: [email protected] cues can come from high-level events (e.g., riding a horse), or low-level physics (e.g., gravity). But as discovered in the Andrew Zisserman E-mail: [email protected] previous self-supervision work (Doersch et al 2015), Con- vNet may learn artificial cues from still images (e.g., chro- William T. Freeman E-mail: [email protected] matic aberration) instead of a useful visual representation. 1 Videos, as collections of images, have additional artificial Harvard University camera motion 2 University of Southern California cues introduced during creation (e.g. ), com- 3 University of Oxford pression (e.g. inter-frame codec) or editing (e.g. black fram- 4 Massachusetts Institute of Technology ing), which may be used to indicate the video’s temporal 5 Google Research direction. Thus, we design controlled experiments to under- 2 International Journal of Computer Vision (a) (b) (c) (d) Fig. 1: Seeing these ordered frames from videos, can you tell whether each video is playing forward or backward? (answer below1). Depending on the video, solving the task may require (a) low-level understanding (e.g. physics), (b) high-level reasoning (e.g. semantics), or (c) familiarity with very subtle effects or with (d) camera conventions. In this work, we learn and exploit several types of knowledge to predict the arrow of time automatically with neural network models trained on large-scale video datasets. stand the effect of artificial cues from videos on the arrow of rectors. With the properly pre-processed data, we train our time classification. model using two large video datasets (Section5): a 147k clip Regarding the second question on the interpretation of subset of the Flickr100M dataset (Thomee et al 2016) and a learned features, we highlight the observation from Zhou 58k clip subset of the Kinetics dataset (Kay et al 2017). We et al(2014): in order to achieve a task (scene classification evaluate test performance and visualize the representations in their case), a network implicitly learns what is necessary learned to solve the arrow-of-time task. Lastly, we demon- (object detectors in their case). We expect that the network strate the usefulness of our ConvNet arrow of time detec- will learn a useful representation of the visual world, involv- tor for self-supervised pre-training in action recognition and ing both low-level physics and high-level semantics, in order for identifying clip from Hollywood films made using the to detect the forward direction of time. reverse-motion film technique (Section7). Regarding the third question on applications, we use the arrow-of-time classifier for two tasks: video representation learning and video forensics. For representation learning, re- 1.1 Related Work cent works have used temporal ordering for self-supervised Several recent papers have explored the usage of the tempo- training of an image ConvNet (Misra et al 2016; Fernando ral ordering of images. Dekel et al(2014) consider the task et al 2017). Instead, we focus on the motion cues in videos of photo-sequencing – determining the temporal order of a and use the arrow of time to pre-train action recognition collection of images from different cameras. Others have models. For video forensics, we detect clips that are played used the temporal ordering of frames as a supervisory sig- backwards in Hollywood films. This may be done as a spe- nal for learning an embedding (Ramanathan et al 2015), for cial effect, or to make an otherwise dangerous scene safe self-supervision training of a ConvNet (Misra et al 2016; to film. We show good performance on a newly collected Fernando et al 2017), and for construction of a representa- dataset of films containing time-reversed clips, and visual- tion for action recognition (Fernando et al 2015). ize the cues that the network uses to make the classification. However, none of these previous works address the task More generally, this application illustrates that the trained of detecting the direction of time. Pickup et al(2014) ex- network can detect videos that have been tampered in this plore three representations for determining time’s arrow in way. In both applications we exceed the respective state of videos: asymmetry in temporal behaviour (using hand-crafted the art. SIFT-like features), evidence for causality, and an auto-regressive In the following, we first describe our ConvNet model model to determine if a cause influences future events. While (Section2), incorporating recent developments for human their methods work on a small dataset collected with known action recognition and network interpretation. Then we iden- strong arrow of time signal, it is unclear if the method works tify and address three potential confounds to learning the ar- on generic large-scale video dataset with different artificial row of time discovered by the ConvNet (Section4), for ex- signals. The study of the arrow of time is a special case ample, exploiting prototypical camera motions used by di- of causal inference, which has been connected to machine 1 Forwards: (b), (c); backwards: (a), (d). Though in (d) the train can learning topics, such as transfer learning and covariate shift move in either direction. adaptation (Scholkopf¨ et al 2012). Recently, Xie et al(2017) International Journal of Computer Vision 3 t1 Arrow (T groups) Conv. Concat. of Time (a) original frame (b) warped frame (c) simulation tT Conv. GAP +Logistic Fig. 3: The 3-cushion billiard dataset. (a) Original frame Conv. from a 3-cushion video; (b) the frame warped (with a ho- Input flow mography transformation) to an overhead view of the bil- (a) Late Temporal Fusion (b) Classification liard table; and, (c) a simulated frame to match the real one in terms of size and number of balls. Fig. 2: Illustration of our Temporal Class-Activation-Map Network (T-CAM) for the arrow of time classification. Start- ing from the traditional VGG-16 architecture (Simonyan the temporal feature fusion stage (Figure2a), we first mod- and Zisserman 2014b) for image recognition, (a) we first ify the VGG-16 network to accept a number of frames (e.g. concatenate the conv5 features from the shared convolu- 10) of optical flow as input by expanding the number of tional layers, (b) and then replace the fully-connected layer channels of conv1 filters (Wang et al 2016). We use T such with three convolution layers and global average pooling temporal chunks, with a temporal stride of t. The conv5 fea- layer (GAP) (Lin et al 2013; Springenberg et al 2014; tures from each chunk are then concatenated. Then for the Szegedy et al 2015; Zhou et al 2016) for better activation classification stage (Figure2b), we follow the CAM model localization. design to replace fully-connected layers with three convolu- tion layers and global average pooling (GAP) before the bi- nary logistic regression. Batch-Normalization layers (Ioffe discovered that the I3D action recognition model (Carreira and Szegedy 2015) are added after each convolution layer.