Video Highlights Detection and Summarization with Lag-Calibration based on Concept-Emotion Mapping of Crowd-sourced Time-Sync Comments
Qing Ping and Chaomei Chen College of Computing & Informatics Drexel University {qp27, cc345}@drexel.edu
Abstract ing, there is increasing demand for fast video di- gestion. Imagine a scenario where a user wants With the prevalence of video sharing, to quickly grasp a long video, without dragging there are increasing demands for au- the progress bar repeatedly to skip shots unap- tomatic video digestion such as high- pealing to the user. With automatically-generated light detection. Recently, platforms highlights, users could digest the entire video in with crowdsourced time-sync video minutes, before deciding whether to watch the comments have emerged worldwide, full video later. Moreover, automatic video high- providing a good opportunity for high- light detection and summarization could benefit light detection. However, this task is video indexing, video search and video recom- non-trivial: (1) time-sync comments mendation. often lag behind their corresponding However, finding highlights from a video is not shot; (2) time-sync comments are se- a trivial task. First, what is considered to be a mantically sparse and noisy; (3) to de- “highlight” can be very subjective. Second, a termine which shots are highlights is highlight may not always be captured by analyz- highly subjective. The present paper ing low-level features in image, audio and mo- aims to tackle these challenges by pro- tions. Lack of abstract semantic information has posing a framework that (1) uses con- become a bottleneck of highlight detection in tra- cept-mapped lexical-chains for lag- ditional video processing. calibration; (2) models video high- Recently, crowdsourced time-sync video com- lights based on comment intensity and ments, or “bullet-screen comments” have combination of emotion and concept emerged, where real-time generated comments concentration of each shot; (3) sum- will be flying over or besides the screen, synchro- marize each detected highlight using nized with the video frame by frame. It has gained improved SumBasic with emotion and popularity worldwide, such as niconico in Japan, concept mapping. Experiments on Bilibili and Acfun in China, YouTube Live and large real-world datasets show that our Twitch Live in USA. The popularity of the time- highlight detection method and sum- sync comments has suggested new opportunities marization method both outperform for video highlight detection based on natural lan- other benchmarks with considerable guage processing. margins. Nevertheless, it is still a challenge to detect and label highlights using time-sync comments. First, 1 Introduction there is almost inevitable lag for comments related to each shot. As in Figure 1, ongoing discussion Every day, people watch billions of hours of vid- about one shot may extend to next a few shots. eos on YouTube, with half of the views on mo- Highlight detection and labeling without lag- bile devices1. With the prevalence of video shar- calibration may cause inaccurate results. Second,
1 https://www.youtube.com/yt/press/statistics.html
time-sync comments are sparse semantically, both embedding, bullet-comment emotion lexicon and in number of comments per shot and number of ground-truth for highlight-detection and labeling tokens per comment. Traditionally bag-of-words evaluation based on bullet-comments. statistical model may work poorly on such data. Third, there is much uncertainty in highlight 2 Related Work detection in an unsupervised setting without any prior knowledge. Characteristics of highlights 2.1 Highlight detection by video processing must be explicitly defined, captured and modeled. First, following the definition in previous work (M. Xu, Jin, Luo, & Duan, 2008), we define high- lights as the most memorable shots in a video with high emotion intensity. Note that highlight detec- tion is different from video summarization, which focuses on condensed storyline representation of a video, rather than extracting affective contents (K.-S. Lin, Lee, Yang, Lee, & Chen, 2013). Figure 1.Lag Effect of Time-Sync Com- For highlight detection, some researchers pro- ments Shot by Shot. pose to represent emotions in a video by a curve on the arousal-valence plane with low-level fea- To our best knowledge, little work has concen- tures such as motion, vocal effects, shot length, trated on highlight detection and labeling based on and audio pitch (Hanjalic & Xu, 2005), color time-sync comments in unsupervised way. The (Ngo, Ma, & Zhang, 2005), mid-level features most relevant work proposed to detect highlights such as laughing and subtitles (M. Xu, Luo, Jin, & based on topic concentration of semantic vectors Park, 2009). Nevertheless, due to the semantic gap of bullet-comments, and label each highlight with between low-level features and high-level seman- pre-trained classifier based on pre-defined tags tics, accuracy of highlight detection based on vid- (Lv, Xu, Chen, Liu, & Zheng, 2016). Neverthe- eo processing is limited (K.-S. Lin et al., 2013). less, we argue that emotion concentration is more important in highlight detection than general topic 2.2 Temporal text summarization concentration. Another work proposed to extract highlights based on frame-by-frame similarity of The work in temporal text summarization is rele- vant to the present study, but also has differences. emotion distribution (Xian, Li, Zhang, & Liao, 2015). However, neither work proposed to tackle Some works formulate temporal text summariza- the issue of lag-calibration, emotion-topic concen- tion as a constrained multi-objective optimization problem (Sipos, Swaminathan, Shivaswamy, & tration balance and unsupervised highlight label- ing simultaneously. Joachims, 2012; Yan, Kong, et al., 2011; Yan, To solve these problems, the present study pro- Wan, et al., 2011), as a graph optimization prob- lem (C. Lin et al., 2012), as a supervised learning- poses the following: (1) word-to-concept and word-to-emotion mapping based on global word- to-rank problem (Tran, Niederée, Kanhabua, embedding, from which lexical-chains are con- Gadiraju, & Anand, 2015), and as online cluster- ing problem (Shou, Wang, Chen, & Chen, 2013). structed for bullet-comments lag-calibration; (2) highlight detection based on emotional and con- The present study models the highlight detec- ceptual concentration and intensity of lag- tion as a simple two-objective optimization prob- lem with constraints. However, the features cho- calibrated bullet-comments; (3) highlight summa- rization with modified Basic Sum algorithm that sen to evaluate the “highlightness” of a shot are treats emotions and concepts as basic units in a different from the above studies. Because a high- light shot is observed to be correlated with high bullet-comment. The main contribution of the present paper are emotional intensity and topic concentration, cov- as follows: (1) We propose an entirely unsuper- erage and non-redundancy are not goals of opti- mization any more, as in temporal text summari- vised framework for video highlight-detection and summarization based on time-sync comments; (2) zation. Instead, we focus on modeling emotional We develop a lag-calibration technique based on and topic concentration in present study. concept-mapped lexical chains; (3) We construct large datasets for bullet-comment word-
2.3 Crowdsourced time-sync comment min- � for number of comments in each high- ing light summary. Our task is to (1) generate a set of Several works focused on tagging videos shot-by- highlight shots �(�) = {� , � , � , … , � }, and (2) shot with crowdsourced time-sync comments by highlight summaries Α � = {� , � , � , … , � } as manual labeling and supervised training (Ikeda, close to ground truth as possible. Each highlight Kobayashi, Sakaji, & Masuyama, 2015), temporal summary comprises a subset of all the comments in this shot: . Number of and personalized topic modeling (Wu, Zhong, � = {� , � , � , … , � } Tan, Horner, & Yang, 2014), or tagging video as a highlight shots � and number of comments in whole (Sakaji, Kohana, Kobayashi, & Sakai, summary � are determined by � and 2016). One work proposes to generate summariza- � respectively. tion of each shot by data reconstruction jointly on textual and topic level (L. Xu & Zhang, 2017). 4 Video Highlight Detection One work proposed a centroid-diffusion algo- rithm to detect highlights (Xian et al., 2015). In this section, we introduce our framework for Shots are represented by latent topics by LDA. highlight detection. Two preliminary tasks are also Another work proposed to use pre-trained seman- described, namely construction of global time- tic vector of comments to cluster comments into sync comment word embedding and emotion lexi- con. topics, and find highlights based on topic concen- tration (Lv et al., 2016). Moreover, they use pre- 4.1 Preliminaries defined labels to train a classifier for highlight la- beling. The present study differs from these two Word-Embedding of Time-Sync Comments studies in several aspects. First, before highlight As pointed out earlier, one challenge in analyzing detection, we perform lag-calibration to minimize time-sync comments is the semantic sparseness, inaccuracy due to comment lags. Second, we pro- since number of comments and comment length pose to represent each scene by the combination are both very limited. Two semantically related of topic and emotion concentration. Third, we per- words may not be related if they do not co-occur form both highlight detection and highlight label- frequently in one video. To compensate, we con- ing in unsupervised way. struct a global word-embedding on a large collec- tion of time-sync comments. 2.4 Lexical chain The word-embedding dictionary can be repre- Lexical chains are a sequence of words in a cohe- sented as: �{(� : � ), (� : � ), … , (� : � ), sive relationship spanning in a range of sentences. where � is a word, � is the corresponding word- Early work constructs lexical chains based on syn- vector, � is the vocabulary of the corpus. tactic relations of words using the Roget’s Thesau- rus without word sense disambiguation (Morris & Emotion Lexicon Construction Hirst, 1991). Later work expands lexical chains by As emphasized earlier, it is crucial to extract emo- WordNet relations with word sense disambigua- tions in time-sync comments for highlight detec- tion (Barzilay & Elhadad, 1999; Hirst & St-Onge, tion. However, traditional emotion lexicons can- 1998). Lexical chains is also constructed based on not be used here, since there exist too many Inter- word-embedded relations for disambiguation of net slangs that are specifically born on this type of multi-words (Ehren, 2017). The present study platforms. For example, “23333” means “ha ha constructs lexical chains for proper lag-calibration ha”, and “6666” means “really awsome”. There- based on global word-embedding. fore, we construct an emotion lexicon tailored for time-sync comments from the word-embedding 3 Problem Formulation dictionary trained from last step. First we manual- ly label words of the five basic emotional catego- The problem in the present paper can be formu- ries (happy, anger, sad, fear and surprise) as seeds lated as follows. The input is a set of time-sync (Ekman, 1992), from the top frequent words in the comments, � = {� , � , � , … , � } with a set of corpus. Here the sixth emotion category “disgust” timestamps � = {� , � , � , … , � } of a video �, a is omitted because it is relatively rare in the da- compression ratio � for number of high- taset, and could be readily incorporated for other lights to be generated, a compression ratio datasets. Then we expand the emotion lexicon by searching the top � neighbors of each seed word
in the word-embedding space, and adding a Specifically, each comment in � can be either neighbor to seeds if the neighbor meets at least appended to existing lexical chains, or added to percentage of overlap � with all the seeds new empty lexical chains, based on its temporal with minimum similarity of ��� . The neigh- distance with existing chains controlled by Maxi- bors are searched based on cosine similarity in the mum silence � . word-embedding space. Note that word senses in the lexical chains con- structed here are not disambiguated as most tradi- 4.2 Lag-Calibration tional algorithms do. Nevertheless, we argue that In this section, we introduce our method for lag- lexical chains are still useful, since our concept calibration following the steps of concept map- mapping is constructed from time-sync comments ping, word-embedded lexical chain construction in its natural order, a progressively semantic con- and lag-calibration. tinuity that naturally reinforces similar word sens- Concept Mapping es for temporally close comments. This semantic continuity together with global word embedding To tackle the issue of semantic sparseness in time- ensures that our concept mapping is valid in most sync comments, and to construct lexical-chains of cases. semantically related words, words of similar meanings should be mapped to same concept first. Given a set of comments � of video �, we first Algorithm 1 Lexical Chain Construction Input time-sync comments �. Word-to-concept propose a mapping ℱ from the vocabulary � of mapping ℱ. Maximum silence � . comments � to a set of concepts � , namely: Output A dictionary of lexical chains � . ℱ: � → � ( � ≥ � ) Initialize � ← {} More specifically, mapping ℱ maps each word for each c in C do � ← � � into a concept � = ℱ(� ): for each word in � do � ← ℱ(����) ℱ � = ℱ � = ℱ � = ⋯ = ℱ � _ ( ) = if � in � then { | ∈ _ ( ) ∧ ℱ } �, ∃� ∈ � ��� �ℎ���� ← � (�) _ ( ) (1) � ← � [ ] ≥ � � − � ≤ � �, ��ℎ������ if ������� �������� then �ℎ����[����] ← �ℎ����[����] ∪ � else and ���_�(� ) returns the top � neighbors of word � based on cosine similarity. For every �ℎ���� ← �ℎ���� ∪ {�} end if word � in comment �, we check percentage of else its neighbors already mapped to a concept �. If � (�) ← {{�}} end if the percentage exceeds the threshold � , end for then word � together with its neighbors will be end for mapped to �. Otherwise they will be mapped to a return � new concept � . Table 1. Lexical Chain Construction. Lexical Chain Construction The next step is to construct all lexical chains in Comment Lag-Calibration current time-sync comments of video �, so that Now given constructed lexical chain dictionary lagged comments could be calibrated based on � , we can calibrate the comments in � lexical chains. A lexical chain � comprises a set based on their lexical chains. From our observa- of triples � = �, �, � , where � is the actual tion, the first comment about one shot usually oc- curs within the shot, while the rest may not be the mentioned word of concept � in comment �, � is the timestamp of the comment �. A lexical chain case. Therefore, we calibrate the timestamp of dictionary � for time-sync comments each comment to the timestamp of first element of the lexical chain it belongs to. Among all the lexi- � of video �: � = {� : � , � , � … , � : cal chains (concepts) a comment belongs to, we � , � , � … , … , � : (� , � , � … )},where pick the one with highest score ����� , . � ∈ � is a concept, and � is the ��ℎ lexical ����� , is computed as the sum frequency of chain of concept � . The algorithm for lexical each word in the chain weighted by its logarithm chain construction is described in Algorithm 1. global frequency log ( � � . �����). Therefore,
each comment will be assigned to its most seman- highlight related to memorable background music. tically important lexical-chain (concept) for cali- Meanwhile, comments such as “this plot is so bration. The algorithm for the calibration is de- funny/hilarious/lmao/lol/2333” may suggest a sin- scribed in Algorithm 2. gle-emotion concentrated highlight. Therefore, we Note that if there are multiple consecutive shots combine these two concentrations in our model. First, we define emotional concentration � Algorithm 2 Lag-Calibration of Time-Sync Com- of shot � based on time-sync comments � given ments emotional lexicon � as follows: Input time-sync comments �. Word-to-concept mapping ℱ. Lexical chain dictionary � (� , �) = (2) ∙ ( ) � . Word-embedding dictionary �. Output Lag-calibrated time-sync comments �′. { | ∈ ∧ ∈ ( )} � = (3) Initialize �′ ← � for each c in �′ do Here we calculate the reverse of entropy of �ℎ��� ← {} , probabilities of five emotions within a shot as ����� , ← 0 for each word in � do emotion concentration. Then we define topical � ← ℱ(����) concentration � : �ℎ��� , ← � (�)[�] ����� , ← 0 � (� , �) = (4) for (�, �, �) in �ℎ��� do ∙ ( ) �(�) ← �(�). ����� ∈ ∧ ℱ ∉ ( ) ����� , ← ����� , + 1/log (�(�)) � = (5) ( ) end for ∈ ( ) ∈ ∧ ℱ ∉ if ����� > ����� then , where we calculate the reverse of entropy of all �ℎ��� , ← �ℎ��� , end if concepts within a shot as topic concentration. The end for probability of each concept � is determined by � ← � , [ ] end for sum frequencies of its mentioned words weighted return �′ by their global frequencies, and divided by those Table 2. Lag-Calibration of Time-Sync Com- values of all words in the shot. ments. Now the comment importance ℐ � , � of shot � can be defined as: {� , � , … , � } with comments of similar contents, our lag-calibration method may calibrate many ℐ � = � ∙ � � , � + (1 − �) ∙ � (� , �) (6) comments in shots � , � , … , � to the timestamp of the first shot � , if these comments are con- where � is a hyper-parameter, controlling the bal- nected via lexical chains from shot � . This is not ance between emotion and concept concentration. necessarily a bad thing since we hope to avoid se- Finally, we define the overall importance of lecting redundant consecutive highlight shots and shot as: leave opportunity for other candidate highlights, given a fixed compression ratio. ℐ � , � = ℐ � , � ∙ log ( � ) (7)
Shot Importance Scoring Where � is the length for all time-sync comments in shot �, which is a straightforward yet In this section, we first segment comments by effective indicator of comment intensity per shot. shots of equal temporal length � , then we Now the problem of highlight detection can be model shot importance. Then highlights could be modeled as a maximization problem: detected based on shot importance. A shot’s importance is modeled to be impacted �������� ℐ � , � ∙ � (8) by two factors: comment concentration and com- � ≤ � ∙ � ���������� �� menting intensity. For comment concentration, as � ∈ 0,1 mentioned earlier, both concept and emotional concentration may contribute to highlight detec- 5 Video Highlight Summarization tion. For example, a group of concept- concentrated comments like “the background mu- Given a set of detected highlight shots �(�) = sic/bgm/soundtrack of this shot is clas- {� , � , � , … , � } of video �, each with all the lag- sic/inspiring/the best” may be an indicator of a calibrated comments � of that shot, we are at-
tempting to generate summaries Α � = “233333”, “66666”, “哈哈哈哈” are replaced {� , � , � , … , � } so that � ⊂ � with compres- with two same characters. sion ratio � and � is as close to ground The word-embedding is trained using truth as possible. word2vec (Goldberg & Levy, 2014) with the skip- We propose a simple but very effective summa- gram model. Number of embedding dimensions is rization model, an improvement over SumBasic 300, window size is 7, down-sampling rate is 1e- (Nenkova & Vanderwende, 2005) with emotion 3, words with frequency lower than 3 times are and concept mapping and two-level updating discarded. mechanism. Emotion Lexicon Construction In the modified SumBasic, instead of only down-sampling the probabilities of words in a se- After the word-embedding is trained, we manually lected sentence to prevent redundancy, we down- select emotional words belonging to the five basic sample the probabilities of both words and their categories from the 500 most-frequent words in mapped concepts for re-weighting each comment. the word-embedding. Then we expand the emo- tion seeds iteratively using algorithm 1. After each This two-level updating mechanism could: (1) impose a penalty for sentences with semantically Happy Sad Fear Anger Surprise similar words to be selected; (2) still select a sen- Seeds 17 13 21 14 19 tence with word already in the summary if this All 157 235 258 284 226 word occurs much more frequently. In addition, we use a parameter emotion bias b to Table 3. Number of Initial and Expanded Emo- weight words and concepts when computing their tion Words. probabilities, so that frequencies of emotional expansion iteration, we also manually examine the words and concepts will increase by b expanded lexicon and remove inaccurate words to compared to non-emotional words and concepts. prevent the concept-drift effect, and use the fil- tered expanded seeds for expansion in next round. 6 Experiment The minimum overlap � is set to be 0.05, and minimum similarity ��� is set to be 0.6. In this section, we conduct experiments on large real datasets for highlight detection and summari- The selection of � and ��� is selected zation. We will describe the data collection pro- based on grid search in the range of 0,1 . The cess, evaluation metrics, benchmarks and experi- number of words for each emotion initially and af- ment results. ter final expansion are listed in Table 3. Video Highlights Data 6.1 Data To evaluate our highlight-detection algorithm, we In this section, we describe the datasets collected have constructed a ground-truth dataset. Our and constructed in our experiments. All datasets ground-truth dataset takes advantage of user- and codes will be made publicly available on uploaded mixed-clips about a specific video on Github2. Bilibli. Mixed-clips are a collage of video high- Crowdsourced Time-sync Comment Corpus lights by the user’s own preferences. Then we take To train the word-embedding described in 4.1.1, the most-voted highlights as ground-truth for a we have collected a large corpus of time-sync video. comment from Bilibli3, a content sharing website The dataset contains 11 videos of 1333 minutes in China with time-sync comments. The corpus in length, with 75,653 time-sync comments in to- contains 2,108,746 comments, 15,179,132 tokens, tal. For each video, 3~4 video mix-clips about this 91,745 unique tokens, from 6,368 long videos. video are collected from Bilibili. Shots that occur Each comment has 7.20 tokens on average. in at least 2 of all the mix-clips are considered as Before training, each comment is first to- ground-truth highlights. All ground-truth high- kenized using Chinse word tokenization package lights are mapped to the original video timeline, Jieba4. Repeating characters in words such as and the start and end time of the highlight are rec- orded as ground-truth. The mix-clips are selected 2 https://github.com/ChanningPing/VideoHighlightDetection based on the following heuristics: (1) The mixed- 3 https://www.bilibili.com/ 4 https://github.com/fxsjy/jieba clips are searched on Bilibli using the keywords