Arxiv:2001.06891V3 [Cs.CV] 24 Mar 2020 Grounding Dataset Vidstg Based on Video Relation Dataset Given an Untrimmed Video and a Declarative/Interrogative Vidor
Total Page:16
File Type:pdf, Size:1020Kb
Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences ∗ Zhu Zhang1, Zhou Zhao1, Yang Zhao1, Qi Wang2, Huasheng Liu2, Lianli Gao3 1College of Computer Science, Zhejiang University 2Alibaba Group, 3University of Electronic Science and Technology of China fzhangzhu, zhaozhou, [email protected], fwq140362, [email protected] Abstract In this paper, we consider a novel task, Spatio-Temporal 25.12s 27.53s Video Grounding for Multi-Form Sentences (STVG). Given Declarative Sentence: A little boy with a Christmas hat is catching a yellow toy. an untrimmed video and a declarative/interrogative sen- Interrogative Sentence: What is caught by the squatting boy on the floor? tence depicting an object, STVG aims to localize the spatio- Figure 1. An example of STVG for multi-form sentences. temporal tube of the queried object. STVG has two chal- lenging settings: (1) We need to localize spatio-temporal object tubes from untrimmed videos, where the object may age, which has attracted much attention and made great only exist in a very small segment of the video; (2) We progress [14, 25, 6, 42, 39]. Recently, researchers begin deal with multi-form sentences, including the declarative to explore video grounding, including temporal grounding sentences with explicit objects and interrogative sentences and spatio-temporal grounding. Temporal sentence ground- with unknown objects. Existing methods cannot tackle the ing [8, 11, 48, 37] determines the temporal boundaries of STVG task due to the ineffective tube pre-generation and events corresponding to the given sentence, but does not the lack of object relationship modeling. Thus, we then localize the spatio-temporal tube (i.e., a sequence of bound- propose a novel Spatio-Temporal Graph Reasoning Net- ing boxes) of the described object. Further, spatio-temporal work (STGRN) for this task. First, we build a spatio- grounding is to retrieve the object tubes according to textual temporal region graph to capture the region relationships descriptions, but existing strategies [50, 1, 38, 4] can only with temporal object dynamics, which involves the implicit be applied to restricted scenarios, e.g. grounding in a frame and explicit spatial subgraphs in each frame and the tem- of the video [50, 1] or grounding in trimmed videos [38, 4]. poral dynamic subgraph across frames. We then incorpo- Moreover, due to the lack of bounding box annotations, re- rate textual clues into the graph and develop the multi-step searchers [50, 4] can only adopt a weakly-supervised set- cross-modal graph reasoning. Next, we introduce a spatio- ting, leading to suboptimal performance. temporal localizer with a dynamic selection method to di- To break through above restrictions, we propose a novel rectly retrieve the spatio-temporal tubes without tube pre- task, Spatio-Temporal Video Grounding for Multi-Form generation. Moreover, we contribute a large-scale video Sentences (STVG). Concretely, as illustrated in Figure 1, arXiv:2001.06891v3 [cs.CV] 24 Mar 2020 grounding dataset VidSTG based on video relation dataset given an untrimmed video and a declarative/interrogative VidOR. The extensive experiments demonstrate the effec- sentence depicting an object, STVG aims to localize the tiveness of our method. The VidSTG dataset is available spatio-temporal tube of the queried object. at https://github.com/Guaranteer/VidSTG-Dataset. Compared with previous video grounding [50, 1, 38, 4], STVG has two novel and challenging points. First, we lo- calize spatio-temporal object tubes from untrimmed videos. 1. Introduction The objects may exist in a very small segment of the video and be hard to distinguish. And the sentences may only Grounding natural language in visual contents is a fun- describe a short-term state of the queried object, e.g. the damental and vital task in the visual-language understand- action ”catching a toy” of the boy in Figure 1. So it is cru- ing field. Visual grounding aims to localize the object cial to determine the temporal boundaries of object tubes described by the given referring expression in an im- by sufficient cross-modal understanding. Secondly, STVG ∗Zhou Zhao is the corresponding author. deals with multi-form sentences, that it, not only grounds 1 the conventional declarative sentences with explicit objects, determine the temporal boundaries of the tube, and then ap- but also localizes the interrogative sentences with unknown ply a spatial localizer with a dynamic selection method to objects, for example, the sentence ”What is caught by the ground the object in each frame and generate a smooth tube. squatting boy on the floor?” in Figure 1. Due to the lack of To facilitate this STVG task, we contribute a large-scale the explicit characteristics of objects (e.g. the class ”toy” video grounding dataset VidSTG by adding the multi-form and visual appearance ”yellow”), grounding for interroga- sentence annotations into video relation dataset VidOR. tive sentences can only depend on relationships between the Our main contributions can be summarized as follows: unknown object and other objects (e.g. the action relation ”caught by the squatting boy” and spatial relation ”on the • We propose a novel task STVG to explore the spatio- floor”). Thus, sufficient relationship construction and cross- temporal video grounding for multi-form sentences. modal relation reasoning are crucial for the STVG task. • We develop a novel STGRN to tackle this STVG task, Existing video grounding methods [38, 4] often extract a which builds a spatio-temporal graph to capture the re- set of spatio-temporal tubes from trimmed videos and then gion relationships with temporal object dynamics, and identify the target tube that matches the sentence. However, employs a spatio-temporal localizer to directly retrieve this framework may be unsuitable for STVG. On the one the spatio-temporal tubes without tube pre-generation. hand, the performance of this framework is heavily depen- dent on the quality of tube candidates. But it is difficult to • We contribute a large-scale video grounding dataset pre-generate high-quality tubes without textual clues, since VidSTG as the benchmark of the STVG task. the sentences may describe a short-term state of objects in • The extensive experiments show the effectiveness of a very small segment, but the existing tube pre-generation our proposed STGRN method. framework [38, 4] can only produce the complete object tubes from trimmed videos. On the other hand, these meth- ods only consider single tube modeling and ignore the rela- 2. Related Work tionships between objects. However, object relations are vi- tal clues for the STVG task, especially for interrogative sen- 2.1. Temporal Localization via Natural Language tences that may only offer the interactions of the unknown Temporal natural language localization is to detect objects with other objects. Thus, these approaches cannot the video clip depicting the given sentence. Early ap- deal with the complicated grounding of STVG. proaches [11, 8, 12, 23, 24] are mainly based on the slid- To tackle above problems, we propose a novel Spatio- ing window framework, which first samples abundant can- Temporal Graph Reasoning Network (STGRN) to capture didate clips and then ranks them. Recently, people be- region relationships with temporal object dynamics and di- gin to develop holistic and cross-modal methods [46, 2, 3, rectly localize the spatio-temporal tubes without tube pre- 35, 48, 37] to solve this problem. Chen, Zhang and Xu generation. Specifically, we first parse the video into a et al. [2, 3, 48, 22, 37] build frame-by-word interactions spatio-temporal region graph. Existing visual graph mod- from visual and textual contents to aggregate the matching eling approaches [41, 19] often build the spatial graph in clues. Zhang et al. [46] deal with the structural and semantic an image, which cannot utilize the temporal dynamics in- misalignment challenges by an explicitly structured graph. formation in videos to distinguish the subtle differences of Wang et al. [35] propose a reinforcement learning frame- object actions, e.g. distinguish ”open the door” and ”close work to adaptively observe frame sequences and associate the door”. Different from them, our spatio-temporal region video contents with sentences. Further, Mithun and Lin et graph not only involves the implicit and explicit spatial sub- al. [26, 21] devise the weakly-supervised temporal local- graphs in each frame, but also includes a temporal dynamic ization methods requiring only coarse video-level annota- subgraph across frames. The spatial subgraphs can capture tions for training. And besides the natural language query, the region-level relationships by implicit or explicit seman- Zhang et al. [49] attempt to localize the unseen temporal tic interactions, and the temporal subgraph can incorporate clip according to an image query. Although these methods the dynamics and transformation of objects across frames to have achieved promising performance, they still remain in further improve the relation understanding. Next, we fuse the temporal grounding. We further explore spatio-temporal the textual clues into this spatio-temporal graph as the guid- grounding in this paper. ance, and develop the multi-step cross-modal graph reason- 2.2. Object Grounding in Images/Videos ing. The multi-step process can support the multi-order re- lation modeling like ”a man hug a baby wearing a red hat”. Visual grounding [14, 25, 36, 43, 27, 44, 6, 51, 13, After it, we introduce a spatio-temporal