Seeing the Arrow of Time
Total Page:16
File Type:pdf, Size:1020Kb
Seeing the Arrow of Time Lyndsey C. Pickup1 Zheng Pan2 Donglai Wei3 YiChang Shih3 Changshui Zhang2 Andrew Zisserman1 Bernhard Scholkopf¨ 4 William T. Freeman3 1Visual Geometry Group, Department of Engineering Science, University of Oxford. 2Department of Automation, Tsinghua University State Key Lab of Intelligent Technologies and Systems, Tsinghua National Laboratory for Information Science and Technology (TNList). 3Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology. 4Max Planck Institute for Intelligent Systems, Tubingen.¨ Abstract itself visually. This is a fundamental property of the visual world that we should understand. Thus we ask the question: We explore whether we can observe Time’s Arrow in a can we see the Arrow of Time? – can we distinguish a video temporal sequence–is it possible to tell whether a video is playing forward from one playing backward? running forwards or backwards? We investigate this some- Here, we seek to use low-level visual information – what philosophical question using computer vision and ma- closer to the underlying physics – to see Time’s Arrow, not chine learning techniques. object-level visual cues. We are not interested in memoriz- We explore three methods by which we might detect ing that cars tend to drive forward, but rather in studying the Time’s Arrow in video sequences, based on distinct ways common temporal structure of images. We expect that such in which motion in video sequences might be asymmetric regularities will make the problem amenable to a learning- in time. We demonstrate good video forwards/backwards based approach. Some sequences will be difficult or im- classification results on a selection of YouTube video clips, possible; others may be straightforward. We want to know: and on natively-captured sequences (with no temporally- how strong is the signal indicating Time’s Arrow? Where dependent video compression), and examine what motions do we see it, and when is it reliable? We study these ques- the models have learned that help discriminate forwards tions using several different learning-based approaches. from backwards time. Asking such fundamental questions about the statistical structure of images has helped computer vision researchers to formulate priors and to develop algorithms exploiting 1. Introduction such priors. For example, asking, “are images invariant How much of what we see on a daily basis could be over scale?” led researchers to develop the now widely- time-reversed without us noticing that something is amiss? used multi-resolution spatial pyramid architectures [3]. Hu- Videos of certain motions can be time-reversed without man vision researchers have learned properties of human looking wrong, for instance the opening and closing of au- visual processing by asking “can we recognize faces upside tomatic doors, flexing of the digits of a human hand, and down?” [19], or about tonescale processing by asking,“can elastic bouncing or harmonic motions for which the damp- we recognize faces if the tonescale is inverted?” [9]. Both ing is small enough as to be undetectable in a short video. symmetries and asymmetries of priors provide useful infor- The same is not true, however, of a cow eating grass, or a mation for visual inference. The prior assumption that light dog shaking water from its coat. In these two scenarios, comes from above is asymmetric and helps us disambiguate we don’t automatically accept that a complicated system convex and concave shapes [15]. Spatial translation invari- (chewed food, or the spatter pattern of water drops) will ance, on the other hand, allows us to train object detection re-compose itself into a simpler one. or recognition systems without requiring training data at all There has been much study of how the underlying possible spatial locations. physics, and the second law of thermodynamics, create a Learning and representing such statistical structure has direction of time [14, 16]. While physics addresses whether provided the prior probabilities needed for a variety of com- the world is the same forwards and backwards, our goal will puter vision tasks for processing or synthesizing images and be to assay whether and how the direction of time manifests image sequences, including, for example: noise removal 1 [17], inpainting over spatial or temporal gaps [1, 24], im- might learn low-level motion-type cues that indicate the di- age enhancement [25], and texture synthesis [13]. rection of time. Keywords included “dance”, “steam train”, In contrast to quite extensive research on the statistical and “demolition”, amongst other terms. The dataset is avail- properties of the spatial structure in images [6], there has able at http://www.robots.ox.ac.uk/data/arrow/. been far less investigation of temporal structure (one of the exceptions to this rule includes [7]). However, knowledge Selection criteria: Video clips were selected to be 6–10 of the temporal structure of images/videos can be used in seconds long, giving at least 100 frames on which to run a variety of tasks, including video enhancement or analy- the computations. This meant having to discard clips from sis, filling-in missing data, and the prediction of what will many professionally-produced videos where each shot’s du- be seen next, an important feedback signal for robotics ap- ration was too short, since each clip was required to be a plications. Thus, we can similarly ask: are video temporal single shot. We restricted the selections to HD videos, al- statistics symmetric in time? lowing us to subsample extensively to avoid block artifacts We expect that the answer to this question can guide us as and minimize the interaction with any given compression we develop priors and subsequent algorithms. For example, method. an understanding of the precise asymmetries of temporal vi- Videos with poor focus or lighting were discarded, sual priors may have implications for methods for video de- as were videos with excessive camera motion or motion noising or video decompression or optical flow estimation blur due to hand shake. Videos with special effects and or ordering of images [4]. In addition, causal inference, computer-generated graphics were avoided, since the un- of which seeing Time’s Arrow is a special case, has been derlying physics describing these may not match exactly connected to machine learning topics relevant to computer with the real-world physics underpinning our exploration of vision, such as semi-supervised learning, transfer learning Time’s Arrow. Similarly, split-screen views, cartoons and and covariate shift adaptation [18]. Our question relates to computer games were all discarded. Finally, videos were this issue of causality, but seen through the lens of vision. required to be in landscape format with correct aspect ra- In the sequel, we probe how Time’s Arrow can be de- tios and minimal interlacing artefacts. In a few cases, the termined from video sequences in three distinct ways: first, dataset still contains frames in which small amounts of text section 3, by using features that represent local motion pat- or television channel badges have been overlaid, though we terns and learning from these whether there is an asymme- tried to minimize this effect as well. try in temporal behaviour that can be used to determine the Finally, the dataset contains a few clips from “back- video direction; second, section 4, by explicitly represent- wards” videos on YouTube; these are videos where the au- ing evidence of causality; and third, section 5, by using a thors or uploaders have performed the time-flip before sub- simple auto-regressive model to distinguish whether a cause mitting the video to YouTube, so any time-related artefacts at a time point influences future events or not (since it can’t that might have been introduced by YouTube’s subsequent influence the past). compression will also be reversed in these cases relative to the forward portions of our data. In these cases, because rel- 2. Dataset and baseline method atively few clips meeting all our our criteria exist, multiple In this section we describe the two video datasets and a shots of different scenes were taken from the same source baseline method for measuring Time’s Arrow. video in a few cases. In all other cases, only one short clip from each YouTube video was used. 2.1. YouTube dataset In total, there are 155 forward videos and 25 reverse We have assembled a set of videos that act as a com- videos in the dataset. Two frames from each of six ex- mon test bed for the methods investigated in this paper. ample videos are shown in Figure 1. The first five are Our criteria when selecting videos is that the information forwards-time examples (horse in water, car crash, mine on Time’s Arrow should not be ‘semantic’ but should be blasting, baseball game, water spray), and the right-most available principally from the ‘physics’. Thus we do not in- pair of frames is from a backwards-time example where pil- clude strong semantic cues for example from the coupling lows are “un-thrown”. of the asymmetries of animals and vehicles with their direc- tions of motion (e.g. eyes at the front of the head tending to Train/val/test split and evaluation procedure. For eval- move forwards); or from the order of sequences of actions uating the methods described below, the dataset was divided (such as traffic lights or clock hands). What remains is the into 60 testing videos and 120 training videos, in three dif- ‘physics’ – gravity, friction, entropy, etc [21]. ferent ways such that each video appeared as a testing video This dataset consists of 180 video clips from YouTube. exactly once and training exactly twice. Where methods re- This was obtained manually using more than 50 keywords, quired it, the 120 videos of the training set were further sub- aimed at retrieving a diverse set of videos from which we divided into 70 training and 50 validation videos.