Ffnet: Video Fast-Forwarding Via Reinforcement Learning

FFNet: Video Fast-Forwarding via Reinforcement Learning Shuyue Lan 1 Rameswar Panda 2 Qi Zhu 1 Amit K. Roy-Chowdhury 2 1Northwestern University. slan@u., qzhu@ northwestern.edu { } 2University of California, Riverside. rpand002@, amitrc@ece. ucr.edu { } Abstract 3%)+*4&54*6"./ ',"--"./&01("%.2 For many applications with limited computation, communication, storage and energy resources, there is an im- perative need of computer vision methods that could select !"#$%&'()$*+ an informative subset of the input video for efficient pro- 7*2(87%)9*)#"./ cessing at or near real time. In the literature, there are two relevant groups of approaches: generating a “trailer” for a video or fast-forwarding while watching/processing Figure 1. Overview of Our Proposed Method. Given a video the video. The first group is supported by video summa- stream, our FFNet decides which frame to process next and rization techniques, which require processing of the entire presents it to users while skipping the irrelevant frames in an online manner. Top-row shows the representative frames in the nor- video to select an important subset for showing to users. mal playing portion and bottom-row shows the irrelevant frames In the second group, current fast-forwarding methods de- in the fast-forwarding portion. Best viewed in color. pend on either manual control or automatic adaptation of playback speed, which often do not present an accurate representation and may still require processing of every frame. cally or remotely through network transmission (or a com- In this paper, we introduce FastForwardNet (FFNet), a re- bination of both). For better system performance, the pro- inforcement learning agent that gets inspiration from video cessing often needs to be done at or near real time. On the summarization and does fast-forwarding differently. It is an other hand, the local nodes/devices typically have limited online framework that automatically fast-forwards a video computation and storage capability and often run on bat- and presents a representative subset of frames to users on teries, while the communication network is constrained by the fly. It does not require processing the entire video, but bandwidth, speed and reliability [1, 19, 40]. Such discrep- just the portion that is selected by the fast-forward agent, ancy presents an urgent need for new vision methods that which makes the process very computationally efficient. The can automatically select an informative subset of the input online nature of our proposed method also enables the video for processing, to reduce computation, communica- users to begin fast-forwarding at any point of the video. tion and storage requirements and to conserve energy. Experiments on two real-world datasets demonstrate that In the relevant literature, with ever expanding volume our method can provide better representation of the input of video data, there is significant interest in video sum- video (about 6%-20% improvement on coverage of impor- marization techniques, which compute a short and infor- tant frames) with much less processing requirement (more mative subset of the original video for human consump- than 80% reduction in the number of frames processed). tion or further processing [4, 9, 26, 49, 50, 51]. How- ever, these techniques require the processing of entire video and often take a long time to generate the subset. There 1. Introduction are also video fast-forwarding techniques where the playback speed of the video is adjusted to meet the needs of Leveraging video input has become increasingly impor- users [3, 10, 13, 31, 32, 34, 38], but they often do not present tant in many intelligent Internet-of-Things (IoT) applica- an accurate representation and may still require processing tions, such as environment monitoring, search and rescue, of the entire video. Both types of approaches are not suit- smart surveillance, and wearable devices. In these systems, able for the resource-limited and time-critical systems we large amount of video needs to be collected and processed discussed above. To address this problem, we started by by users (human operators or autonomous agents), either lo- asking the following question: Is it possible to develop a 16771 method for fast-forwarding through a video that is compu- ity of each candidate clip to the query clip [31] or the mo- tationally efficient, causal, online and results in informative tion activity patterns present in a video [3, 29, 30]. Some segments which can be validated through statistical evalua- recent works use mutual information between frames to de- tion and user experience? scribe the fast-forward policy [11, 12], or use shortest path In this paper, we introduce FastForwardNet (FFNet), distance over the graph that is constructed with semantic a reinforcement learning agent that gets inspiration from information extracted from frames [34, 38]. This family video summarization and does fast-forwarding differently of methods is most relevant to our goal. A similar fam- from the state-of-the-art methods. It has an online frame- ily of work (hyperlapse) [32, 10, 13] aiming at speed-up work that automatically fast-forwards a video and presents and smoothing has also been developed for creating fast- a selected subset of frames to users on the fly (see Fig. 1 forwarded videos. In contrast to these prior works, we de- for an example). The fast-forward agent does not require velop a deep reinforcement learning strategy for the fast- the processing of entire video. This makes the process very forward policy. Our proposed framework (FFNet) is an on- computationally efficient, and communicationally efficient line and causal system that does not need the entire video to if the video (subset) needs to be transmitted over the net- get the fast-forward policy, making it very efficient in terms work for remote processing. The online nature of our pro- of computation, communication and storage needs. posed FFNet enables the users to begin fast-forwarding at Video Summarization. The goal of video summariza- any point when watching/processing videos. The causal na- tion is to produce a compact summary that contains the ture of our FFNet ensures that it can work even as the video most important parts of a video. Much progress has been subset is being generated. made to summarize a video using either supervised learn- To summarize, the key advantage of our approach is that ing based on video-summary pairs [6, 9, 49, 50, 33, 26] it automatically selects the most important frames without or unsupervised approaches based on low-level visual in- processing or even obtaining the entire video. Such capa- dices [5, 8, 20, 7] (see reviews [24, 43]). Leveraging bility can significantly reduce resource requirements and crawled web images or videos is another recent trend for lower energy consumption, and is particularly important for video summarization [14, 15, 41, 28]. Closely related to resource-constrained and time-critical systems. The main video summarization, the authors in [37] develop a frame- technical contributions of this paper are as follows. work for creating storylines from photo albums. (1) We formulate video fast-forwarding as a Markov deci- Most relevant to our approach is the work of online video sion process (MDP), and propose FFNet for fast-forwarding summarization, which compiles the most salient and infor- a potentially very long video while presenting its important mative portion of a video by automatically scanning through and interesting content on the fly. the video stream, in an online fashion, to remove repeti- (2) We propose an online framework to deal with incremen- tive and uninteresting content. Various strategies have been tal observations without requiring to store and process the studied, including Gaussian mixture model [25], online dic- entire video. At any point of the video, our approach can tionary learning [51], and submodular optimization [4]. Our jump to potentially important future frames based on anal- approach significantly differs from these methods in that ysis of past frames that had been selected. it only processes a subset of frames instead of the entire video. To the best of our knowledge, this is the first work to (3) We demonstrate the effectiveness of our proposed FFNet address video fast-forwarding in generating an informative on two standard challenging video summarization datasets, summary from a video. Tour20 [27] and TVSum [41], achieving real-time speed on all tested videos. Reinforcement Learning. Apart from the recent suc- cess in playing Go games and Atari [23, 39], deep reinforce- 2. Related Work ment learning(DRL) has also achieved promising performance in several vision tasks, such as object detection [21], Our work relates to three major research directions: visual tracking [48], pose estimation [17] and image cap- video fast-forward, video summarization and reinforcement tioning [35]. [47] employs a computationally intensive re- learning. Here, we focus on some representative methods inforcement learning strategy for action detection in short that are closely related to our work. video clips. In contrast to [47], our framework is an on- Video Fast-Forward. Video fast-forward methods are line and causal system that enables users to begin fast- typically used when users are losing patience to watch the forwarding at any point while watching videos (online) and entire video. Most commercial video players offer man- can work even as summary is being generated (causal). ual control on the playback speed, e.g., Apple QuickTime Markov Decision Process (MDP) has been widely used for Player offers 2, 5 and 10 multi-speed fast-forward. several vision tasks. For example, in [42], the authors for- Current automatic fast-forward approaches mostly focus mulate a policy learning as MDP for activity recognition. on adapting the playback speed based on either the similar- Q-learning (a reinforcement learning method) is one way to 6772 solve MDP problems [45].

Load more