Arxiv:2001.06891V3 [Cs.CV] 24 Mar 2020 Grounding Dataset Vidstg Based on Video Relation Dataset Given an Untrimmed Video and a Declarative/Interrogative Vidor
Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences ∗ Zhu Zhang1, Zhou Zhao1, Yang Zhao1, Qi Wang2, Huasheng Liu2, Lianli Gao3 1College of Computer Science, Zhejiang University 2Alibaba Group, 3University of Electronic Science and Technology of China fzhangzhu, zhaozhou, zhaoyangg@zju.edu.cn, fwq140362, fangkong.lhsg@alibaba-inc.com Abstract In this paper, we consider a novel task, Spatio-Temporal 25.12s 27.53s Video Grounding for Multi-Form Sentences (STVG). Given Declarative Sentence: A little boy with a Christmas hat is catching a yellow toy. an untrimmed video and a declarative/interrogative sen- Interrogative Sentence: What is caught by the squatting boy on the floor? tence depicting an object, STVG aims to localize the spatio- Figure 1. An example of STVG for multi-form sentences. temporal tube of the queried object. STVG has two chal- lenging settings: (1) We need to localize spatio-temporal object tubes from untrimmed videos, where the object may age, which has attracted much attention and made great only exist in a very small segment of the video; (2) We progress [14, 25, 6, 42, 39]. Recently, researchers begin deal with multi-form sentences, including the declarative to explore video grounding, including temporal grounding sentences with explicit objects and interrogative sentences and spatio-temporal grounding. Temporal sentence ground- with unknown objects. Existing methods cannot tackle the ing [8, 11, 48, 37] determines the temporal boundaries of STVG task due to the ineffective tube pre-generation and events corresponding to the given sentence, but does not the lack of object relationship modeling. Thus, we then localize the spatio-temporal tube (i.e., a sequence of bound- propose a novel Spatio-Temporal Graph Reasoning Net- ing boxes) of the described object.
[Show full text]