Event Video Mashup
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17) Event Video Mashup: From Hundreds of Videos to Minutes of Skeleton Lianli Gao,1 Peng Wang,2 Jingkuan Song,3 Zi Huang,2 Jie Shao,1 Heng Tao Shen1 1University of Electronic Science and Technology of China, Chengdu 611731, China. 2The University of Queensland, QLD 4072, Australia. 3Columbia University, NY 10027, USA. {lianli.gao, shaojie}@uestc.edu.cn, {p.wang6, huang}@itee.uq.edu.au, [email protected], [email protected] Abstract 1J1. 1JV `V QJ`V`VJHV %]VH VG 1RVQ VC: VR Q 8Q QJ :`: .QJ QIG1J$9 The explosive growth of video content on the Web has been revolutionizing the way people share, exchange and perceive :.%] information, such as events. While an individual video usu- 1IV 1JV ally concerns a specific aspect of an event, the videos that are uploaded by different users at different locations and times can embody different emphasis and compensate each other :%:C 1V G:I: 1H 1I in describing the event. Combining these videos from dif- ferent sources together can unveil a more complete picture of the event. Simply concatenating videos together is an in- Figure 1: Overview of Event Video Mashup tuitive solution, but it may degrade user experience since it is time-consuming and tedious to view those highly redun- dant, noisy and disorganized content. Therefore, we develop ods usually rely on low-level cues to determine the impor- a novel approach, termed event video mashup (EVM), to au- tomatically generate a unified short video from a collection of tance of segments of a video (Ma et al. 2005; Ngo, Ma, Web videos to describe the storyline of an event. We propose and Zhang 2005), and formulate summarization as a re- a submodular based content selection model that embodies construction problem (Zhao and Xing 2014) where sparse both importance and diversity to depict the event from com- coding is employed to remove the redundant video content prehensive aspects in an efficient way. Importantly, the video from a single video. More recently, based on the assump- content is organized temporally and semantically conforming tion that Web images tend to capture subjects from canonical to the event evolution. We evaluate our approach on a real- viewpoint, (Khosla et al. 2013; Kim, Sigal, and Xing 2014; world YouTube event dataset collected by ourselves. The ex- Song et al. 2015) utilize Web image prior to help extract tensive experimental results demonstrate the effectiveness of meaningful subsets of user videos. Meng et al (Meng et al. the proposed framework. 2016) selects representative objects to form video summa- rization. Apart from the aforementioned unsupervised meth- Introduction ods, the works in (Potapov et al. 2014; Gong et al. 2014; Zhang et al. 2016) formulate video summarization as a su- In recent years, we have witnessed the explosive growth of pervised content selection problem and utilize priori to train Web video content on the Internet due to the tremendous ad- a system to determine the relevance of video content. How- vance of video-sharing platforms, digital cameras, fast net- ever, the above methods are focusing on a single video sum- work and massive storage. Users are often overwhelmed by marization. the unstructured videos and in danger of getting lost in the video data world. Therefore, it is increasingly important to There are some recent multiple-video summarization automatically summarize a large collection of Web videos in works (Chu, Song, and Jaimes 2015; Yeo, Han, and Han an efficient yet comprehensive way. 2016; Hong et al. 2011; Tan, Tan, and Ngo 2010; Wang Inspired by text and image summarization (Mason et et al. 2012). However, video co-summarization methods al. 2016; Liu, Yu, and Deng 2015; Singla, Tschiatschek, (Chu, Song, and Jaimes 2015; Yeo, Han, and Han 2016) and Krause 2016; Lin and Bilmes 2011), video summa- only focus on extracting visual co-occurrence across mul- rization (Truong and Venkatesh 2007) has be proposed as tiple videos. This summarization criterion may fail to cap- an early attempt to solve this problem. Based on some de- ture some important information which are not shared by signed criteria, video summarization typically selects a sub- multiple videos. On the other hand, (Hong et al. 2011; set of frames and/or subshots from a long-running video Tan, Tan, and Ngo 2010; Wang et al. 2012) extract typi- to form a short summary. Traditional summarization meth- cal frames/shots and then assign them with semantic an- notations, which is very similar to our method. But during Copyright c 2017, Association for the Advancement of Artificial the temporal alignment step, (Hong et al. 2011; Wang et al. Intelligence (www.aaai.org). All rights reserved. 2012) simply assume that each tag is given a time stamp, and 1323 ^ŚŽƚůŝƋƵĞ/ĚĞŶƚŝĨŝĐĂƚŝŽŶ ^ĞŵĂŶƚŝĐ/ŶĨĞƌĞŶĐĞ dĞŵƉŽƌĂůůŝŐŶŵĞŶƚ tĞďsŝĚĞŽƐĂŶĚ Đϭ ƐĞŵĂŶƚŝĐϭ DĞƚĂĚĂƚĂ Đϭ sϭ sϮ ĐϮ ƐĞŵĂŶƚŝĐϮ ĐϮ sϯ ƐĞŵĂŶƚŝĐϯ Đϯ Đϯ dŝŵĞůŝŶĞ ŽŶƚĞŶƚŽŵƉŽƐŝƚŝŽŶ ^ŚŽƚ^ĞůĞĐƚŝŽŶ ^ĞŵĂŶƚŝĐ^ŵŽŽƚŚŝŶŐ ^ŚŽƚůŝƋƵĞ^ĞůĞĐƚŝŽŶ DĂƐŚƵƉǀŝĚĞŽ ^ŚŽƚƐĞƋƵĞŶĐĞ ^ƵďĞǀĞŶƚϭ ^ƵďĞǀĞŶƚϮ Figure 2: The flowchart of the proposed event video mashup approach (Tan, Tan, and Ngo 2010) heavily relies on Google Trend. select a subset of video content from different raw videos in Therefore, these are not standalone multiple-videos summa- order to cover representative and comprehensive informa- rization approach for practical applications. tion of the event in its evolving order. The objective of con- To handle the aforementioned problems, in this paper tent composition is to use the selected content to compose we propose a novel framework named event video mashup a single mashup video in a way that the final video can be (EVM) to automatically generate a short video from a col- smoothly viewed by users. lection of related videos to describe an event. Different To address the aforementioned problems, we propose from video summarization techniques which extract impor- a new framework that can automatically generate a short tant frames/shots from a single video, video mashup iden- mashup video from a set of Web videos related to an event. tifies important frames/shots and temporally aligns them by Figure 2 gives an overview of our proposed mashup system. simultaneously considering multiple videos from different Given a collection of videos, firstly the shots of videos are sources with varying qualities, resolutions, lengths, etc. Fig- grouped into shot cliques using a graph-based near-duplicate ure 1 depicts the general idea of event video mashup, where detection approach. Then semantics of those identified shot a short video is constructed by the most important content cliques are inferred to unveil their relevance to key seman- selected from relevant videos along the time line. tic facets of the event. Next, the shot cliques are temporally Our contributions are summarized as follows: 1) as far aligned, followed by a semantic smoothing step to refine the as we know, this is the first standalone framework to auto- alignment and discover key subevents of the event. Taking matically generate a temporal-aligned summarization from a into consideration both importance and diversity factors, a collection of Web videos which are unstructured, unprofes- shot clique selection method is performed. From each top sional and may be edited arbitrarily. 2) To generate EVM, we ranked shot clique, a shot is chosen to represent the clique. first temporally align the shot cliques by utilizing the tem- All the above steps can be generalized as content selection. poral information and a semantic smoothing strategy, and Finally, a sequence of selected shots in their previously de- then devise an effective submodular based shot clique se- termined order compose the final mashup video with transi- lection strategy to ensure the significance and coverage of tion smoothness being considered. The mashup video length the selected content; 3) To enable progress on this problem, can be adaptively adjusted by varying the number of top we construct a new dataset with 1,600 videos describing 8 ranked shot cliques. Next, we describe these components in events covering different topics. Extensive empirical results details. on this dataset show that our proposed method significantly outperforms the state-of-the-art approaches. Shot Clique Identification Despite severe diversities among the Web videos related Event Video Mashup to the same event, they often share some partial overlaps Event video mashup is a technique that mashes together in terms of shots and the repetitively occurred shots often many videos related to an event from different sources and carry important information about the event. For instance, then aggregates the most important content into a single new in videos related to “Academy Awards”, shots describing video, to allow users to understand the event efficiently and ”Red Carpet”, ”Best Actor” will appear many times in dif- comprehensively. There are two main problems involved in ferent videos. In light of this observation, it is necessary to event video mashup, including content selection and content group the near-duplicate shots together into shot cliques. composition. Specifically, content selection concerns how to There are many works proposed for visual content near- 1324 duplicate detection (Shen et al. 2007; Zhao and Ngo 2009; related to an event can provide sufficient keywords to cover Zhu et al. 2008; Huang et al. 2010; Song et al. 2013). key semantic facets of the event. To mine the keywords from In this subsection, we apply the concept of maximally co- titles, we first parse the sentences into words and phrases and hesive subgraph (MCS) (Huang et al. 2010) to detect and eliminate the common stop words. Then we calculate the group near-duplicate shots. We first construct an undirected term frequency of all the remaining words and remove the graph where each vertex represents the feature vector of a words that appear with low frequency which may not cover keyframe of a shot.