Instance-Level Future Motion Estimation in a Single Image Based on Ordinal Regression
Total Page:16
File Type:pdf, Size:1020Kb
Instance-Level Future Motion Estimation in a Single Image Based on Ordinal Regression Kyung-Rae Kim1 Whan Choi1 Yeong Jun Koh2 Seong-Gyun Jeong3 Chang-Su Kim1 1Korea University 2Chungnam National University 3CODE42.ai Abstract A novel algorithm to estimate instance-level future mo- tion in a single image is proposed in this paper. We first represent the future motion of an instance with its direction, speed, and action classes. Then, we develop a deep neu- ral network that exploits different levels of semantic infor- mation to perform the future motion estimation. For effec- tive future motion classification, we adopt ordinal regres- sion. Especially, we develop the cyclic ordinal regression scheme using binary classifiers. Experiments demonstrate Figure 1. Above annotations are automatically generated from the that the proposed algorithm provides reliable performance single image by the proposed algorithm. and thus can be used effectively for vision applications, in- cluding single and multi object tracking. Furthermore, we jectories in a scene. Given starting and end points in a scene, release the future motion (FM) dataset, collected from di- they estimate trajectories connecting those points. Mottaghi verse sources and annotated manually, as a benchmark for et al.[40] estimate motion scenarios by classifying objects single-image future motion estimation. in images into one of pre-defined scenarios as in Figure 2. They focus on the scene understanding task rather than on object FMs, such as motion directions and magnitudes. Gao 1. Introduction et al.’s algorithm [16] is the most similar to ours, but it es- Motion understanding is critical in various vision tasks, timates pixel-level optical flow and works only for highly such as optical flow [23, 38, 46], action recognition [55], similar action scenes to the ones in the training data. future frame prediction [36], and video compression [50]. Even unaware of the exact physics, humans can predict Most prior arts estimate motions by analyzing the differ- next motions of instances based on their experience. Based ences between consecutive frames. In contrast, a human be- on this observation, we use a deep neural network [22] ing is often capable of anticipating motions precisely even to implement such perceptual capabilities regarding FM. from a single still image, as illustrated in Figure 1. Such in- Deep neural networks can perform high-level understand- nate perceptual capabilities enable us to take desired actions ing of images, if reliable examples are provided sufficiently and avoid dangerous situations. If computer vision attains a [17,20,24,29–31,53]. Hence, another objective of this work similar level of motion understanding, we would be able to is to construct a reliable dataset for single-image FM es- build more secure artificial intelligent systems, e.g. robots timation. First, we focus on pedestrian instances, which and self-driving cars. are objects of the most interest in many applications. We We propose a pioneering algorithm to estimate instance- construct the FM dataset by collecting images, containing level FM in a single image. The proposed algorithm at- pedestrians. After detecting bounding boxes for pedestri- tempts to challenge humans in single-image motion under- ans, we manually assign three attributes of FM (i.e. direc- standing. Even though there are some methods [16, 26, 37, tion, speed, and action) to each pedestrian. Later, we extend 40, 52] for predicting FM, they need additional information the FM dataset to include car and animal instances. or are effective in limited environments only. As shown in When we see a moving object, we perceive visual infor- Figure 2, Yagi et al.[52] require a cumulative trajectory of mation of the object and its surroundings simultaneously. an instance from past frames, while the proposed algorithm Similarly, to exploit scene contexts for FM estimation, we uses only a present frame. Also, [26,37] estimate object tra- propose the multi-context pooling (MCP) layer that inte- 1273 Yagi et al.[52] Kitani et al.[26] Ma et al.[37] Mottaghi et al.[40] Gao et al.[16] Proposed Figure 2. Comparison of previous FM estimation methods [16,26,37,40,52] and the proposed algorithm: the proposed algorithm estimates FM in a single image more reliably in more diverse scenes and environments. Please also see the supplemental video. grates both object and global features. We incorporate the Also, some methods predict long-term trajectories of ob- MCP layer into DenseNet-121 [22] to learn a unified model jects in a still image [26, 37, 47]. Using a Markov decision for estimating the future direction, speed, and action of an process, Kitani et al.[26] predict trajectories of an object instance. Moreover, for effective FM estimation, we adopt based on its current and goal states. Ma et al.[37] extract ordinal regression. Especially, we propose the cyclic ordi- visual attributes of pedestrians via deep networks to forecast nal regression (COR) scheme for the future direction, whose pedestrian dynamics. Walker et al.[47] exploit mid-level classes have a cyclic order. patches for future trajectory prediction. However, since they The proposed algorithm provides promising FM estima- focus on scene information rather than on object informa- tion results despite variations in camera viewpoints, source tion, these conventional techniques are highly dependent on types, and environments. We assess the efficacy of the scene types and camera angles. In contrast, we attempt to proposed algorithm in important applications of single and design a domain-adaptive FM estimation algorithm, which multi object tracking. It is demonstrated that the proposed can be reliably used in various applications. algorithm makes conventional object trackers [25,41]more efficient by reducing search regions. Moreover, we extend 2.2. Pixel-Level Future Motion Estimation the proposed algorithm to process other kinds of instances, Some algorithms [16, 34, 43, 48, 49] estimate dense mo- including cars, cats, dogs, and horses. tion, i.e. optical flow, from a single image. Pintea et al.[43] This work has the following main contributions: predict continuous motion vectors using structured random forests with regression. Walker et al.[48] quantize opti- • We develop a single-image FM estimation algorithm, cal flow vectors into 40 clusters and then classify each pixel by incorporating the MCP layer into DenseNet-121 into one of the clusters using convolutional neural networks. and developing the COR scheme for future direction Using a conditional variational autoencoder (VAE), Walker classification. et al.[49] characterize various distributions of future flow to • Unlike the previous attempts [16, 26, 37, 47, 52], the predict multiple motion directions and magnitudes at each proposed algorithm estimates FM reliably in diverse pixel. Gao et al.[16] develop an encoder-decoder network scenes and environments. It is strongly recommended with a loss network to hallucinate future flow and demon- to see the supplemental video. strate the efficacy of the hallucinated flow for action recog- nition. Li et al.[34] devise a spatio-temporal conditional • We demonstrate the efficacy of the proposed algorithm VAE to predict future flow maps in multiple time steps. quantitatively in single and multi object tracking. With the predicted maps, they perform full frame synthe- • We release the FM dataset to serve as a benchmark for sis and achieve video prediction. the interesting research topic of subsequent behaviour 2.3. Ordinal Regression estimation in a single still image. Ordinal regression is a learning task for predicting a la- 2. Related Work bel (or rank) of an object, where the set of labels has a lin- 2.1. Instance-Level Future Motion Estimation ear order [21], e.g. the set of integers. Attempts have been made for ordinal regression [19], for example, using sup- FM estimation can be performed at either instance level port vector machines [6], Gaussian processes [5], and per- or pixel level. Let us first review instance-level algorithms. ceptron learning [8]. Frank and Hall proposed the ordinal Mottaghi et al.[40] forecast FMs of objects in a single im- binary decomposition [15]. They constructed independent age. They pre-define 66 scenarios based on physical move- binary classifiers to decide whether the rank of an object ments of an object, and classify an object into one of the is greater than k. Using decision trees, they combined the scenarios. Chao et al.[4] forecast human poses from static binary outputs to estimate the rank. Li and Lin [33]pro- images. They design recurrent neural networks to yield a posed a reduction scheme from ordinal regression to binary pose sequence from an initial pose. However, it works for classification. A set of binary classifiers was learned jointly limited sports scenes only. and combined with existing techniques, such as SVM. This 274 (a) Direction (b) Speed (c) Action Figure 3. Statistics of the FM dataset. The instances are sampled from YouTube [1], CityPersons [54], and CPDB [13], and then split into training (top) and test (bottom) sets. reduction scheme has been adopted in various applications, including the age estimators in [3,42]. Also, the decomposi- tion for the case of a cyclic order was considered in [11]. In this work, we formulate the FM classification as an ordinal regression problem. Stop Slow Fast Sidewalk Crosswalk Jaywalk Figure 4. Three speed classes and three action classes. 3. Single-Image Future Motion Estimation the future direction into one of the four cardinal directions We represent the FM of an instance with its direction, (N, E, S, W) and the four intermediate ones (NE, SE, SW, speed, and action classes. We then develop classifiers, based NW) in the image coordinates. We use the eight quantized on ordinal regression, for the FM estimation. directions, which are sufficient in many applications. Finer quantization makes the annotation difficult and unreliable.