Video Highlights Detection and Summarization with Lag-Calibration Based on Concept-Emotion Mapping of Crowd-Sourced Time-Sync Comments

Video Highlights Detection and Summarization with Lag-Calibration based on Concept-Emotion Mapping of Crowd-sourced Time-Sync Comments Qing Ping and Chaomei Chen College of Computing & Informatics Drexel University {qp27, cc345}@drexel.edu Abstract ing, there is increasing demand for fast video digestion. Imagine a scenario where a user wants With the prevalence of video sharing, to quickly grasp a long video, without dragging there are increasing demands for au- the progress bar repeatedly to skip shots unap- tomatic video digestion such as high- pealing to the user. With automatically-generated light detection. Recently, platforms highlights, users could digest the entire video in with crowdsourced time-sync video minutes, before deciding whether to watch the comments have emerged worldwide, full video later. Moreover, automatic video high- providing a good opportunity for highlight detection and summarization could benefit light detection. However, this task is video indexing, video search and video recom- non-trivial: (1) time-sync comments mendation. often lag behind their corresponding However, finding highlights from a video is not shot; (2) time-sync comments are se- a trivial task. First, what is considered to be a mantically sparse and noisy; (3) to de- “highlight” can be very subjective. Second, a termine which shots are highlights is highlight may not always be captured by analyz- highly subjective. The present paper ing low-level features in image, audio and mo- aims to tackle these challenges by pro- tions. Lack of abstract semantic information has posing a framework that (1) uses con- become a bottleneck of highlight detection in tra- cept-mapped lexical-chains for lag- ditional video processing. calibration; (2) models video high- Recently, crowdsourced time-sync video com- lights based on comment intensity and ments, or “bullet-screen comments” have combination of emotion and concept emerged, where real-time generated comments concentration of each shot; (3) sum- will be flying over or besides the screen, synchro- marize each detected highlight using nized with the video frame by frame. It has gained improved SumBasic with emotion and popularity worldwide, such as niconico in Japan, concept mapping. Experiments on Bilibili and Acfun in China, YouTube Live and large real-world datasets show that our Twitch Live in USA. The popularity of the time- highlight detection method and sum- sync comments has suggested new opportunities marization method both outperform for video highlight detection based on natural lan- other benchmarks with considerable guage processing. margins. Nevertheless, it is still a challenge to detect and label highlights using time-sync comments. First, 1 Introduction there is almost inevitable lag for comments related to each shot. As in Figure 1, ongoing discussion Every day, people watch billions of hours of vid- about one shot may extend to next a few shots. eos on YouTube, with half of the views on mo- Highlight detection and labeling without lag- bile devices1. With the prevalence of video shar- calibration may cause inaccurate results. Second, 1 https://www.youtube.com/yt/press/statistics.html time-sync comments are sparse semantically, both embedding, bullet-comment emotion lexicon and in number of comments per shot and number of ground-truth for highlight-detection and labeling tokens per comment. Traditionally bag-of-words evaluation based on bullet-comments. statistical model may work poorly on such data. Third, there is much uncertainty in highlight 2 Related Work detection in an unsupervised setting without any prior knowledge. Characteristics of highlights 2.1 Highlight detection by video processing must be explicitly defined, captured and modeled. First, following the definition in previous work (M. Xu, Jin, Luo, & Duan, 2008), we define highlights as the most memorable shots in a video with high emotion intensity. Note that highlight detection is different from video summarization, which focuses on condensed storyline representation of a video, rather than extracting affective contents (K.-S. Lin, Lee, Yang, Lee, & Chen, 2013). Figure 1.Lag Effect of Time-Sync Com- For highlight detection, some researchers pro- ments Shot by Shot. pose to represent emotions in a video by a curve on the arousal-valence plane with low-level fea- To our best knowledge, little work has concen- tures such as motion, vocal effects, shot length, trated on highlight detection and labeling based on and audio pitch (Hanjalic & Xu, 2005), color time-sync comments in unsupervised way. The (Ngo, Ma, & Zhang, 2005), mid-level features most relevant work proposed to detect highlights such as laughing and subtitles (M. Xu, Luo, Jin, & based on topic concentration of semantic vectors Park, 2009). Nevertheless, due to the semantic gap of bullet-comments, and label each highlight with between low-level features and high-level seman- pre-trained classifier based on pre-defined tags tics, accuracy of highlight detection based on vid- (Lv, Xu, Chen, Liu, & Zheng, 2016). Neverthe- eo processing is limited (K.-S. Lin et al., 2013). less, we argue that emotion concentration is more important in highlight detection than general topic 2.2 Temporal text summarization concentration. Another work proposed to extract highlights based on frame-by-frame similarity of The work in temporal text summarization is relevant to the present study, but also has differences. emotion distribution (Xian, Li, Zhang, & Liao, 2015). However, neither work proposed to tackle Some works formulate temporal text summariza- the issue of lag-calibration, emotion-topic concen- tion as a constrained multi-objective optimization problem (Sipos, Swaminathan, Shivaswamy, & tration balance and unsupervised highlight labeling simultaneously. Joachims, 2012; Yan, Kong, et al., 2011; Yan, To solve these problems, the present study pro- Wan, et al., 2011), as a graph optimization problem (C. Lin et al., 2012), as a supervised learning- poses the following: (1) word-to-concept and word-to-emotion mapping based on global word- to-rank problem (Tran, Niederée, Kanhabua, embedding, from which lexical-chains are con- Gadiraju, & Anand, 2015), and as online cluster- ing problem (Shou, Wang, Chen, & Chen, 2013). structed for bullet-comments lag-calibration; (2) highlight detection based on emotional and con- The present study models the highlight detec- ceptual concentration and intensity of lag- tion as a simple two-objective optimization problem with constraints. However, the features cho- calibrated bullet-comments; (3) highlight summarization with modified Basic Sum algorithm that sen to evaluate the “highlightness” of a shot are treats emotions and concepts as basic units in a different from the above studies. Because a highlight shot is observed to be correlated with high bullet-comment. The main contribution of the present paper are emotional intensity and topic concentration, cov- as follows: (1) We propose an entirely unsuper- erage and non-redundancy are not goals of optimization any more, as in temporal text summari- vised framework for video highlight-detection and summarization based on time-sync comments; (2) zation. Instead, we focus on modeling emotional We develop a lag-calibration technique based on and topic concentration in present study. concept-mapped lexical chains; (3) We construct large datasets for bullet-comment word- 2.3 Crowdsourced time-sync comment min- �67889:; for number of comments in each high- ing light summary. Our task is to (1) generate a set of Several works focused on tagging videos shot-by- highlight shots �(�) = {�%, �', �(, … , �@}, and (2) shot with crowdsourced time-sync comments by highlight summaries Α � = {�%, �', �(, … , �@} as manual labeling and supervised training (Ikeda, close to ground truth as possible. Each highlight Kobayashi, Sakaji, & Masuyama, 2015), temporal summary comprises a subset of all the comments in this shot: . Number of and personalized topic modeling (Wu, Zhong, �2 = {�%, �', �(, … , �@C} Tan, Horner, & Yang, 2014), or tagging video as a highlight shots � and number of comments in whole (Sakaji, Kohana, Kobayashi, & Sakai, summary �2 are determined by �123142315 and 2016). One work proposes to generate summariza- �67889:; respectively. tion of each shot by data reconstruction jointly on textual and topic level (L. Xu & Zhang, 2017). 4 Video Highlight Detection One work proposed a centroid-diffusion algorithm to detect highlights (Xian et al., 2015). In this section, we introduce our framework for Shots are represented by latent topics by LDA. highlight detection. Two preliminary tasks are also Another work proposed to use pre-trained seman- described, namely construction of global time- tic vector of comments to cluster comments into sync comment word embedding and emotion lexicon. topics, and find highlights based on topic concentration (Lv et al., 2016). Moreover, they use pre- 4.1 Preliminaries defined labels to train a classifier for highlight labeling. The present study differs from these two Word-Embedding of Time-Sync Comments studies in several aspects. First, before highlight As pointed out earlier, one challenge in analyzing detection, we perform lag-calibration to minimize time-sync comments is the semantic sparseness, inaccuracy due to comment lags. Second, we pro- since number of comments and comment length pose to represent each scene by the combination are both very limited. Two semantically related of topic and emotion concentration. Third,

Load more