<<

Unsupervised Text Recap Extraction for TV Series

Hongliang Yu and Shikun Zhang and Louis-Philippe Morency Language Technologies Institute Carnegie Mellon University Pittsburgh, PA, 15213, USA yuhongliang, shikunz, morency @cs.cmu.edu { }

Abstract absorb the essence of previous episodes, but also grab people’s attention with upcoming plots. How- Sequences found at the beginning of TV ever, creating those recaps for every newly aired shows help the audience absorb the essence episode is labor-intensive and time-consuming. To of previous episodes, and grab their attention with upcoming plots. In this paper, we pro- our advantage, there are many textual scripts freely pose a novel task, text recap extraction. Com- available online which describe the events and ac- 2 pared with conventional summarization, text tions happening during the TV show episodes . recap extraction captures the duality of sum- These textual scripts contain plot descriptions of the marization and plot contingency between ad- events, dialogues of the actors, and sometimes also jacent episodes. We present a new dataset, the synopsis summarizing the whole episode. TVRecap, for text recap extraction on TV These abundant textual resources enable us to shows. We propose an unsupervised model that identifies text recaps based on plot de- study a novel, yet challenging task: automatic scriptions. We introduce two contingency fac- text recap extraction, illustrated in Figure 1. The tors, concept coverage and sparse reconstruc- goal of text recap extraction is to identify seg- tion, that encourage recaps to prompt the up- ments from scripts which both summarize the cur- coming story development. We also propose a rent episode and prompt the story development of multi-view extension of our model which can the next episode. This unique task brings new incorporate dialogues and synopses. We con- technical challenges as it goes beyond summariz- duct extensive experiments on TVRecap, and ing prior TV episodes, by introducing a concept of conclude that our model outperforms summa- rization approaches. plot contingency to the upcoming TV episode. It differs from conventional summarization techniques which do not consider the interconnectivity between 1 Introduction neighboring episodes. Text recaps should capture According to a study by FX Networks, in U.S., the the duality of summarization and plot contingency total number of ongoing scripted TV series hit a new between neighboring episodes. To our knowledge, high of 409 on broadcast, cable, and streaming in no dataset exists to study this research topic. 20151. Such a large number indicates there are more In this paper, we present an unsupervised model shows than anyone can realistically watch. To attract to automatically extrapolate text recaps of TV shows prospective audiences as well as help current view- from plot descriptions. Since we assume recaps ers recall the key plot when airing new episodes, should cover the main plot of the current episode some TV shows add a clip montage, which is called and also prompt the story development of the next a recap sequence, at the beginning of new episodes episode, our model jointly optimizes these two ob- or seasons. Recaps not only help the audience 2http://www.simplyscripts.com/tv_all. 1http://tinyurl.com/jugyyu2 html

1797

Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1797–1806, Austin, Texas, November 1-5, 2016. c 2016 Association for Computational Linguistics Current Episode Next Episode

We see Shannon and Sayid working on the translation. Sayid finds Boone is watching Shannon read from far away. Sayid shows up, and the translation nonsense, slightly annoyed. Shannon walks off, upset hands a box to Shannon to thank for her help with the translation. and frustrated with Sayid and herself. Back to the robbery. Hutton Shannon opens the box which contains purple flowery shoes. They opens the door as Jason is pointing gun at him. continue talking as the shot switches to Boone watching them. Kate shoots Joson in the leg. Kate opens the box which reveals an Flashback - Shot of Boone with his arm around a girl, carrying tennis envelope inside. Text Recap Extraction racket and ball, walking up steps from the tennis court to the pool area On-Island – Jack is with the case asking Kate to tell him what is inside. of a club. Sound of a cell phone ringing. Shannon is in a shaky voice. Jack opens the box, and finds an envelope. Kate opens the envelope Shannon is yelling at someone on her end. and pulls out a small airplane. After admitting it belongs to the man On-Island - Shot of Sayid limping along the beach. Kate loved and killed, Kate sits down and starts crying. Jack looks Boone confronts Sayid and tells him to stay away from his sister nonplussed, he closes up the case and walks away. We see Shannon and Sayid working Shannon. Locke calls Boone away. Boone and Locke walk off into the Shot of everyone moving up the beach. Rose sitting by a tree, Charlie on the translation. jungle. approaches. Shot of Shannon walking up to Sayid on the beach. Kate opens the box which reveals Boone stares at Sayid and Shannon from behind a tree with a an envelope inside. … … weird look on his face. Kate just stares at her toy airplane. After admitting it belongs to the man Kate loved and killed, Kate sits down and starts crying. Boone stares at Sayid and Shannon from behind a tree with a weird look on his face.

Text Recap

Figure 1: Illustration of text recap extraction. The system extracts sentences from the current episode. The text recap sentences in black summarize the current episode, while colored sentences motivate the next episode. jectives. To summarize the current episode, our evaluation metrics of text summarization. Finally, model exploits coverage-based summarization tech- we discuss the video description which is comple- niques. To connect to the next episode, we devise mentary to our work. two types of plot contingency factors between adja- cent episodes. These factors implement the coverage Generic Text Summarization Alogrithms Text and reconstruction assumptions to the next episode. summarization is widely explored in the news do- We also show how our model can be extended to in- main (Hong and Nenkova, 2014; McKeown, 2005). tegrate dialogues and synopses when available. Generally, there are two approaches: extractive and We introduce a new dataset3, named TVRecap abstractive summarization. for text recap extraction which consists of TV se- Extractive summarization forms a summary by ries with textual scripts, including descriptions, di- choosing the most representative sentences from the alogues and synopses. The dataset enables us to original corpus. The early system LEAD (Was- study whether contingency-based methods which son, 1998) was pioneering work. It selected lead- exploit relationships between adjacent episodes can ing text of the document as the summary, and was improve summarization-based methods. applied in news searching to help online customers The rest of this paper is organized as follows. In focus their queries on the beginning of news docu- Section 2, we discuss related work and the motiva- ments. He et al. (2012) assumed that summarization tion for our work. In Section 3, we introduce our should consist of sentences that could best recon- new dataset for text recap extraction. Section 4 ex- struct the original document. They modeled rela- plains our proposed model for text recap extraction, tionship among sentences by forming an optimiza- and Section 5 expands the model by incorporating tion problem. Moreover, Sipos et al. (2012) and synopses and dialogues. In Section 6 and 7, we Lin and Bilmes (2010) studied multi-document sum- present our experimental results and analyses, and marization using coverage-based methods. Among finally conclude our work in Section 8. them, Lin and Bilmes (2010) proposed to approxi- mate the optimal solution of a class of functions by 2 Related Work exploiting submodularity. Abstractive summarization automatically create In this section, we discuss three related research top- new sentences. For example, compared with the ics. Text summarization is an relevant task that aims sentence-level analysis in extractive summarization, to create a summary that retains the most important Bing et al. (2015) explored fine-grained syntactic points of the original document. Then we discuss the units, i.e. noun/verb phrases, to represent concepts 3http://multicomp.cs.cmu.edu in input documents. The informative phrases were

1798 then used to generate sentences. 3 The TVRecap Dataset In this paper, we generalize the idea of text sum- We collected a new dataset, called TVRecap, for text marization to text recap extraction. Instead of sum- recap extraction on TV series. We gathered and marizing a given document or collection, our model processed scripts, subtitles and synopses from web- emphasizes plot contingency with the next episode. sites4 as components to build our model upon. We also established ground truth to help future research Summarization Applications Summarization on this challenging topic. TVRecap includes all sea- techniques are not restricted to informative re- sons from the widely-known show “” with a sources (e.g. news), applications in broader areas total of 106 episodes. Statistics of our dataset are are gaining attention (Apar´ıcio et al., 2016). As shown in Table 1. the prevailance of online forums, Misra et al. (2015) developed tools to recognize arguments # sent. avg. # sent. # words avg. # w./s. from opinionated conversations, and group them description 14,686 138.5 140,684 9.57 across discussions. In entertainment industry, Sang dialogue 37,714 355.8 284,514 7.54 and Xu (2010) proposed a character-based movie synopsis 453 4.27 7,868 17.36 recap 619 17.19 5,892 9.52 summarization approach by incorporating scripts Statistics of TVRecap. into movie analysis. Moreover, recent applications Table 1: include multimedia artifact generation (Figueiredo et al., 2015), music summarization (Raposo et al., This section describes how textual scripts and 2015) and customer satisfaction analysis (Roy et al., synopses are processed, and how we automatically 2016). define the ground truth of text recap annotations. Descriptions, Dialogues and Synopses A script Generating video descriptions Video Description for one TV series episode is a sequence of di- is a task that studies automatic generation of natural alogues interleaved with descriptions (marked by language that describes events happening in video square brackets). We automatically split the script clips. Most work uses sequential learning for en- into descriptions and dialogues. For each episode, coding temporal information and language genera- We also downloaded the synopsis, a human-written tion (Guadarrama et al., 2013; Rohrbach et al., 2013, paragraph summarizing the main plot of the episode. 2015; Donahue et al., 2015). Our work is com- Figure 2 shows examples of a script and a synopsis plementary to video description: the large number from our TVRecap dataset. of unlabeled videos can be utilized to train end-to- end recap extraction system when video description LOCKE: Two players. Two sides. One is light... one is dark. Walt, do models can properly output textual descriptions. you want to know a secret? [Claire writing in her diary. Jin approaches and offers her some urchin. She shakes her head, but then gives in and takes some.] In contrast with Contributions of This Paper CLAIRE: No. Thank you. No, it's okay. [Jin keeps insisting] No, prior work, the main contributions of this paper are: really. Okay. Thanks. (1) We propose a novel problem, text recap extrac- (a) Script: containing descriptions and dialogues. tion for TV series. Our task aims to identify seg- Boone steals the decreasing water supply in a misguided attempt to ments from scripts which both summarize the cur- help everyone, but the survivors turn on him. A sleep-deprived Jack rent episode and prompt the story development of chases after what appears to be his deceased father in the forests and eventually discovers caves with fresh water. Jack comes to the upcoming episode; terms with his role as leader. In flashbacks, Jack goes to Australia (2) We propose an unsupervised model for text recap to retrieve his deceased father. extraction from descriptions. It models the episode (b) Synopsis. contingency through two factors, next episode sum- marization and sparse reconstruction; Figure 2: Example of a script (including descriptions and dia- (3) We introduce a new dataset for TV show recap logues) and a synopsis. extraction, where descriptions, dialogues and syn- 4http://lostpedia.wikia.com/ and https:// opses are provided. www.wikipedia.org/

1799 All plot descriptions and dialogues are time- Formally, the system is given E episodes from a aligned automatically using the subtitle files5. We specific TV show, where each episode contains tex- first aligned the dialogue sentences from the script tual descriptions. We define these descriptions as with the subtitle files which contain time-stamps (in D = D1, ,DE , where Di is the set of descrip- { ··· } milliseconds) of the spoken dialogues. Then we es- tions of episode i. Di is composed of descriptive timated time-stamps of description sentences using sentences as Di = di , , di , where di is the { 1 ··· Di } j surrounding dialogues. j-th sentence. The goal of our| task| is to find text re- Since descriptions sometimes contain words not caps R = R1, ,RE 1 where the components { ··· − } relevant to the event, we manually post-processed of Ri are selected from Di with a length budget all descriptions and recap sentences as follows: (1) (constraint on the number of sentences) Ri K. | | ≤ remove trivial sentences such as “music on”, (2) re- In our TREM model, the text recap Ri of the i-th move introductory terms like “Shot of ”, (3) com- episode is optimized by: plete missing grammatical components (like omitted max (Ri) = (Ri,Di) + (Ri,Di+1) subjects) of sentences when possible. Ri Di F S M ⊂ (1) s.t Ri K, Text Recap Annotations The goal of our ground | | ≤ truth annotation is to identify the text descriptions where (Ri,Di) measures how well Ri summa- S associated with the TV show recap. We performed rizes Di, and (Ri,Di+1) quantifies the level of M this annotation task in three steps. connectivity between the text recap of the current First, we automatically extracted the recap se- episode and the plot description of the next episode. quence, which is a montage of important scenes By using ( , ), we expect to produce text re- M · · from previous episodes to inform viewers of what caps with better plot contingency with the upcoming has happened in the show, from the TV show video. story. These recap sequences, if available, are always In the following sections, we demonstrate in de- shown at the beginning of TV episodes. We auto- tails: (1) the definition of the summarization func- matically separated video recap sequences from full- tion ( , ); (2) two factors that derive the contin- S · · length video files by detecting a lengthy appearance gency function ( , ) based on different hypothe- M · · of black frames in the first several minutes of the ses. episode. Second, we located the frames of the re- cap sequences in the videos of previous episodes, 4.1 Plot Summarization and recorded their time-stamps. Finally, the recap In this section, we discuss the summarization com- annotations are automatically identified by compar- ponent of our model’s objective function. Our model ing the video time-stamps with the text description is inspired by the coverage-based summarization time-stamps. A description is annotated as part of (Lin and Bilmes, 2010), whose key idea is to find the recap if at least 4 frames from the video recap a proxy that approximates the information overlap are present during this description. between the summary and the original document. In this work, any text is assumed to be represented by a 4 Our Text Recap Extraction Model set of “concepts” using weights to distinguish their In our Text Recap Extraction Model (TREM), we importance. To be more specific, a concept is de- assume a good text recap should have two charac- fined as a noun/verb/adjective or noun/verb phrase. teristics: (a) it covers the main plot of the current In terms of concepts, we define the summarization episode, and (b) it holds plot contingency with the term (Ri,Di) as follows: S next episode. Under the first assumption, the text (Ri,Di) = z(c, Di) max w(c, r), (2) recap can be seen as a summarization that retains S r Ri c (Di) ∈ the most important plots. Under assumption (b), the ∈CX text recap should capture the connections between where (Di) = c c di , di Di is the con- C { 0| 0 ∈ j ∀ j ∈ } two consecutive episodes. cept set of Di, and z(c, Di) measures the impor- 5http://www.tvsubtitles.net/ tance of c in Di. We use Term Frequency Inverse

1800 i+1 Document Frequency (TF-IDF) (Salton and Buck- contingency. Here we assume that sentence dj is ley, 1988) to calculate z(c, Di). Finally, w(c, r) de- related to a small number of sentences in Di. c r i+1 Di notes the relatedness of a concept to a sentence . Let αj R| | be the indicator that determines We use Word2Vec (Mikolov et al., 2013) vectors ∈ i i+1 which sentences in D prompt dj , and W be the as the semantic representation of concepts, and de- matrix that transforms these contingent sentences to fine w(c, r) as: i+1 the embedding space of dj . Intuitively, our model learns W by assuming each sentence in Di+1 can w(c, r) = c max cos(c, c0), (3) i | | · c0 r be reconstructed by contingent sentences from D : ∈ where bold notations are the Word2Vec representa- i+1 i i+1 dj WD αj , (6) tions of c and c0. Note that if c is a phrase, c is mean ≈ pooled by the embeddings of its component words. In the equation, we first convert every description c is the number of words in c. sentence to its distributed representation using the | | pre-trained skip-thought model proposed by Kiros 4.2 Plot Contingency et al. (2015). The sentence embedding is denoted in We model plot contingency on the concept level as bold (e.g. di for sentence di ). Di = [di ; ; di ] j j 1 ··· Di well as on the sentence level. Therefore, the compo- stacks the vector representations of all sentences| in| nent ( , ) is decomposed into two factors: Di, and αi+1 linearly combines the contingent sen- M · · j tences. i i+1 i i+1 (R ,D ) =λs s(R ,D )+ i+1 M M (4) We propose to jointly optimize αj and W by: λ (Ri,Di+1). rMr min WDiαi+1 di+1 2 E 1 j j 2 i i+1 i αi+1 − ,W k − k where s(R ,D ) measures how well R { }i=1 i,j (7) M i+1 X can summarize the next episode D and + γ αi+1 + θ W 2 , i i+1 j 1 F r(R ,D ) is the factor that quantify the k k k k M i i+1 ability of R to reconstruct D . λs, λr 0 are i+1 i+1 i+1 ≥ where we denote α = [α1 ; ; α Di+1 ]. We coefficients for ( , ) and ( , ) respectively. ··· | | Ms · · Mr · · impose sparsity constraint on αi+1 with L norm In the following sections, we define and explain j 1 such that only a small fraction of sentences in Di these two factors in details. i+1 will be linked to dj . γ and θ are coefficients of the 4.2.1 Concept Coverage regularization terms. Following the coverage assumption of Section Given the optimal W∗ from Equation 7, our main 4.1, we argue that the text recap should also cover objective is to identify the subset of descriptions in important concepts from the next episode. There- Di that best capture the contingency between Di and fore, the first contingency factor can be defined Di+1. The reconstruction contingency factor can be in the same form as the summarization component defined as: where Di’s in Equation 2 are replaced by Di+1’s: i i+1 | r(R ,D ) = max r W∗d. (8) M r Ri i i+1 i+1 d Di+1 ∈ s(R ,D ) = z(c, D ) max w(c, r). ∈X M r Ri c (Di+1) ∈ ∈CX 4.3 Optimization (5) In this section, we describe our approach to optimize 4.2.2 Sparse Reconstruction the main objective function expressed in Equations As events happening in the current episode can 1 and 7. have an impact on the next episode, there exist hid- Finding an efficient algorithm to optimize a den connections between the descriptive sentences set function like Equation 1 is often challenging. in Di and Di+1. To be more specific, assuming de- However, it can be easily shown that the objec- scriptive sentences from Di+1 are dependent on a tive function of Equation 1 is submodular, since few sentences in Di, we aim to infer such hidden all its components (Ri,Di), (Ri,Di+1) and S Ms 1801 (Ri,Di+1) are submodular with respect to Ri. Algorithm 2 Reconstruction Matrix Optimization Mr According to Lin and Bilmes (2011), there exists a Input: Vectorized sentence representations simple greedy algorithm for monotonic submodular Di E , θ and γ. { }i=1 function maximization where the solution is guar- Output: Contingency matrix W. (0) i+1(0) anteed to be close to the real optimum. Specifi- 1: Initialize W I and αj 0, i, j; i ← ← ∀ cally, if we denote Rgreedy as the approximation op- 2: Initialize iteration step t 0; i ← timized by greedy algorithm and R ∗ as the best pos- 3: REPEAT sible solution, then (Ri ) (1 1 ) (Ri ), 4: t t + 1; F greedy ≥ − e ·F ∗ ← where ( ) is the objective function of Equation 1 5: W(t) is updated according to Equation 9; F · i+1,(t) and e 2.718 denotes the natural constant. The 6: i, j, α sparse coding(W(t)); ≈ j greedy approach is shown in Algorithm 1. ∀ (t) ← (t 1) 2 7: UNTIL W W −  k − kF ≤ Algorithm 1 Text Recap Extraction Input: Vectorized sentence representations Previously, we build TREM using plot descriptions. Di E , parameters λ , λ , θ, γ, budget K, { }i=1 s r In this section, we expand our TREM model to in- optimal W∗ for Equation 7. i E corporate plot synopses and dialogues. We define Output: Text recaps R . 1 E { }i=1 text synopses and dialogues as S = S , ,S 1: for i = 1, ,E 1 E i { ···i } ··· and T = T , ,T , where S and T are the 2: Initialize Ri ; { ··· } ← ∅ set of sentences from synopses and dialogues of the 3: REPEAT i i-th episode. 4: r∗ arg max (R r ); i ← i F ∪ { } 5: R R r∗ ; In TV shows, a lot of useful informa- ← ∪ { } Dialogues 6: UNTIL Ri K tion is presented via actors’ dialogues which moti- | | ≥ 7: end vates us to extend our TREM model to include di- alogues. Both views can be used to identify recap Algorithm 1 requires the optimal W∗ learned segments which are assumed to be summative and from the adjacent episode pairs in Equation 7. We contingent. Denote the neighboring dialogues of utilize the algorithm that iteratively updates W and Ri as N(Ri) = t T i r Ri, s.t. time(t) { ∈ ∃ ∈ | − α given the current solution. At each iteration, time(r) < δ , we extend the optimization objective i+1 | } each variable (W or α ) is updated by fixing (Equation 1) into: { } the other. At t-th iteration, W(t) is computed as the solution of ridge regression (Hoerl and Kennard, (Ri) = (Ri,Di) + (N(Ri),T i) F S S 1970): + (Ri,Di+1) + (N(Ri),T i+1) . M M  (t) 1 (10) W = DX|(XX| + θI)− , (9)  i+1 i+1 where D and X stack all dj and xj , i i+1 i Synopses Since a synopsis is a concise summary D αj , i = 1, ,E 1, j = 1, , D . Fix- ∀ i+1··· − ··· | | of each episode, we can treat plot summarization as ing W, each αj can be solved separately by gen- text alignment where Ri is assumed to match the eral sparse coding algorithms as stated in Mairal content of Si. Therefore, the summarization term et al. (2009). Algorithm 2 shows the optimization can be redefined by substituting Di with Si: process of Equation 7. (Ri,Si) = z(c, Si) max w(c, r). (11) 5 Multi-View Recap Extraction S r Ri c (Si) ∈ ∈CX In addition to plot descriptions, there are also dia- logues and plot synopses available for TV shows. Similarly, the contingency component can be Descriptions, dialogues and synopses can be seen as modified to include connections from synopses to three different views of the same TV show episode. detailed descriptions. For Equation 8, we substitute

1802 ROUGE-1 ROUGE-2 ROUGE-SU4 ILP-Ext (Banerjee et al., 2015) 0.308 0.112 0.091 ILP-Abs (Banerjee et al., 2015) 0.361 0.158 0.120 Our approach TREM 0.405 0.207 0.148 w/o SR 0.393 0.189 0.144 w/o CC 0.383 0.171 0.132 w/o SR&CC (summarization only) 0.374 0.168 0.129 Table 2: Experimental results on different methods using descriptions. Contingency-based methods generally outperforms summarization-based methods.

Di+1 to Si+1 where our model only focuses on high- rest of the paper, we also call it as TREM-Summ. level storyline: Multi-view TREM: The augmented TREM − model with descriptions, dialogues and synopses as i i+1 | r(R ,S ) = max r W∗s. (12) M r Ri proposed in Section 5. Different views and combi- s Si+1 ∈ ∈X nations will be tested in our experiments. 6 Experimental Setup 6.2 Methodology We designed our experiments to evaluate whether Using TVRecap, we measure the quality of gener- our TREM model, by considering contingency be- ated sentences following the standard metrics in the tween adjacent episodes, can achieve better results summarization community, ROUGE (Lin and Hovy, than summarization techniques. Furthermore, we 2003). want to examine how each contingency factor as For the purpose of evaluation, we defined a de- proposed in Section 4.2 contributes to the system velopment and a test set, by randomly selecting 18 performance. As our model can integrate multiple adjacent pairs of episodes from all seasons. These views, we want to dissect the effects of using differ- episodes were selected to have at least two recap ent combinations of three views. description sentences. The remaining 70 episodes were only used during the learning process of W. 6.1 Comparison Models After tuning hyper-parameters on development set, To answer the research questions presented above, we report the comparison results on the test set. we compare the following methods in our experi- ments. 7 Results and Discussion ILP-Ext and ILP-Abs (Banerjee et al., 2015): − This summarizer generates sentences by optimizing 7.1 Overall Results the integer linear programming problem in which Table 2 shows our experimental results comparing the information content and linguistic quality are de- TREM and baseline models using descriptions. fined. Both extractive and abstractive implementa- In general, contingency-based methods (TREM, tions are used in our experiments. TREM w/o SR and TREM w/o CC) outperform TREM: Our TREM model proposed in Section 4 summarization-based methods. Our contingency − extracts sentences that can both summarize the cur- assumptions are verified as adding CC and SC rent episode and prompt the next episode with two both improve TREM with summarization compo- contingency factors. nent only. Moreover, the best result is achieved by TREM w/o SR: The TREM model without the the complete TREM model with both contingency − sparse reconstruction factor proposed in Section factors. It suggests that these two factors, modeling 4.2.2. word-level summarization and sentence-level recon- TREM w/o CC: The TREM model without the struction, are complementary. − concept coverage factor proposed in Section 4.2.1. From the summarization-based methods, we can TREM w/o SR&CC: The summarization-only see that our TREM-Summ gets higher ROUGE − TREM model without contingency factors. In the scores than two ILP approaches. Additionally, we

1803 Target sentence from next episode Sentences with highest reconstruction value from current episode We see three bottles of water. Kate is putting water bottles in a pack. They go into a room with a body bag on a gurney. Kate is going through clothes, as Claire approaches. Boone is coming up to camp and sees Locke sitting by a fire. Locke is with his knife case, holding a pencil, sitting by a fire. Locke throws a knife into the ground, just out of Boone’s reach. Boone quickly cuts through the ropes and starts running. In another part of the temple grounds, Miles and Hurley are John contemplates the fabric swatches he is holding. playing Tic-Tac-Toe by placing leaves in a grid of sticks on On the beach, Frank covers Locke’s body with a tarp. the ground. Helen closes the door and brings the case inside to the kitchen. Table 3: A case study on sparse reconstruction as proposed in Section 4.2.2. Sentences in the first column are reconstructed by sentences in the second column. The first two examples successfully captures related sentences, while the third example fails. note that the performance of ILP-Ext is poor. This cause dialogues contain too many trivial sentences. is because ILP-Ext tends to output short sentences, Synopses, however, are relatively short, but provide while ROUGE is a recall-oriented measurement. key plots to summarize the story, and hence achieve the best ROUGE scores. Model Current Next R-1 R-2 R-SU4 des - 0.374 0.168 0.129 syn - 0.369 0.163 0.121 TREM-Summ dial - 0.354 0.138 0.115 7.3 Qualitative Study on Sparse des+syn - 0.384 0.172 0.132 Reconstruction des+dial - 0.386 0.168 0.135 des des 0.405 0.207 0.148 In this section, we give some examples to illustrate des syn 0.411 0.219 0.154 TREM des dial 0.375 0.158 0.127 the process of sparse reconstruction. Equation 7 des des+syn 0.409 0.210 0.154 assumes that each descriptive sentence can be re- des des+dial 0.395 0.177 0.142 constructed by a few sentences from the previous Table 4: Comparison of views in summarization-only TREM episode. Table 3 shows three examples of sentences and full TREM with contingency factors. “des”, “syn”, and with their top-3 reconstructive sentences, which are “dial” are abbreviated for description, synopses and dialogues. i+1 defined by values in the indicator vector αj .

7.2 Multi-view Comparison 7.4 Limitations and Future Work As shown in Table 4, The second study examines the effect of different views in both types of methods TREM restrains the contingency within adjacent using the TREM model. In single-view summariza- episodes. However, storylines sometimes proceed tion, TREM-Summ with descriptions outperforms through multiple episodes. In our model, with more connectivity terms (Ri,Dj) where i < j, we can methods based on the other two views. In terms of M hybrid of views, only ROUGE-1 is significantly im- develop more general system with longer dependen- proved, while ROUGE-2 and ROUGE-SU4, which cies. focus more on semantic consistency, have little im- While our model and dataset are appropriate for provement. text recap extraction and algorithm comparison, this In contingency-based methods, we keep the cur- task can be further applied to multimedia settings, rent episode represented as descriptions which ob- where visual or acoustic information can be in- tain the best performance in single-view summa- cluded. Therefore, in future work, we plan to expand rization, and change the views of the next episode. our work to broader applications where intercon- Comparing the model using descriptions with the nectivity between consecutive instances is crucial, one fusing descriptions and synopses, we can see such as educational lectures, news series and book that simply adding views does not guarantee higher chapters. Specifically, TREM can be integrated with ROUGE scores. In both TREM-Summ and full video description results to get an end-to-end system TREM, dialogue is inferior to others. It might be be- that produces video recaps.

1804 8 Conclusion Paulo Figueiredo, Marta Apar´ıcio, David Martins de Matos, and Ricardo Ribeiro. 2015. Gen- In this paper, we explore a new problem of text eration of multimedia artifacts: An extractive recap extraction for TV shows. We propose an summarization-based approach. arXiv preprint unsupervised model that identifies recap segments arXiv:1508.03170 . from multiple views of textual scripts. To facili- tate the study of this new research topic, we cre- Sergio Guadarrama, Niveda Krishnamoorthy, Girish ate a dataset called TVRecap, which we test our Malkarnenkar, Subhashini Venugopalan, Ray- approach on. From the experimental results, we mond Mooney, Trevor Darrell, and Kate Saenko. conclude that contingency-based methods improve 2013. Youtube2text: Recognizing and describ- summarization-based methods at ROUGE measure- ing arbitrary activities using semantic hierarchies ments by exploiting plot connection between adja- and zero-shot recognition. In Proceedings of the cent episodes. IEEE International Conference on Computer Vi- sion. pages 2712–2719. Acknowledgement Zhanying He, Chun Chen, Jiajun Bu, Can Wang, Li- This material is based in part upon work partially jun Zhang, Deng Cai, and Xiaofei He. 2012. Doc- supported by the National Science Foundation (IIS- ument summarization based on data reconstruc- 1523162). Any opinions, findings, and conclusions tion. In AAAI. or recommendations expressed in this material are Arthur E Hoerl and Robert W Kennard. 1970. Ridge those of the author(s) and do not necessarily reflect regression: Biased estimation for nonorthogonal the views of the National Science Foundation. problems. Technometrics 12(1):55–67. References Kai Hong and Ani Nenkova. 2014. Improving the estimation of word importance for news multi- Marta Apar´ıcio, Paulo Figueiredo, Francisco Ra- document summarization. In EACL. pages 712– poso, David Martins de Matos, Ricardo Ribeiro, 721. and Lu´ıs Marujo. 2016. Summarization of films and documentaries based on subtitles and scripts. Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Pattern Recognition Letters 73:7–12. Richard S Zemel, Antonio Torralba, Raquel Urta- sun, and Sanja Fidler. 2015. Skip-thought vectors. Siddhartha Banerjee, Prasenjit Mitra, and Kazu- arXiv preprint arXiv:1506.06726 . nari Sugiyama. 2015. Multi-document abstrac- tive summarization using ilp based multi-sentence Chin-Yew Lin and Eduard Hovy. 2003. Auto- compression. In 24th International Joint Con- matic evaluation of summaries using n-gram co- ference on Artificial Intelligence (IJCAI). Buenos occurrence statistics. In Proceedings of the 2003 Aires, Argentina: AAAI press. Conference of the North American Chapter of the Association for Computational Linguistics on Hu- Lidong Bing, Piji Li, Yi Liao, Wai Lam, Wei- man Language Technology-Volume 1. Association wei Guo, and Rebecca J Passonneau. 2015. for Computational Linguistics, pages 71–78. Abstractive multi-document summarization via phrase selection and merging. arXiv preprint Hui Lin and Jeff Bilmes. 2010. Multi-document arXiv:1506.01597 . summarization via budgeted maximization of Jeffrey Donahue, Lisa Anne Hendricks, Ser- submodular functions. In Human Language Tech- gio Guadarrama, Marcus Rohrbach, Subhashini nologies: The 2010 Annual Conference of the Venugopalan, Kate Saenko, and Trevor Darrell. North American Chapter of the Association for 2015. Long-term recurrent convolutional net- Computational Linguistics. Association for Com- works for visual recognition and description. In putational Linguistics, pages 912–920. Proceedings of the IEEE Conference on Com- Hui Lin and Jeff Bilmes. 2011. A class of submod- puter Vision and Pattern Recognition. pages ular functions for document summarization. In 2625–2634. Proceedings of the 49th Annual Meeting of the

1805 Association for Computational Linguistics: Hu- Jitao Sang and Changsheng Xu. 2010. Character- man Language Technologies-Volume 1. Associa- based movie summarization. In Proceedings of tion for Computational Linguistics, pages 510– the 18th ACM international conference on Multi- 520. media. ACM, pages 855–858. Julien Mairal, Francis Bach, Jean Ponce, and Ruben Sipos, Adith Swaminathan, Pannaga Shiv- Guillermo Sapiro. 2009. Online dictionary learn- aswamy, and Thorsten Joachims. 2012. Tempo- ing for sparse coding. In Proceedings of the ral corpus summarization using submodular word 26th annual international conference on machine coverage. In Proceedings of the 21st ACM in- learning. ACM, pages 689–696. ternational conference on Information and knowl- Kathleen McKeown. 2005. Text summarization: edge management. ACM, pages 754–763. News and beyond. In Proceedings of the Aus- Mark Wasson. 1998. Using leading text for news tralasian Language Technology Workshop. pages summaries: Evaluation results and implications 4–4. for commercial summarization applications. In Proceedings of the 17th international conference Tomas Mikolov, Kai Chen, Greg Corrado, and Jef- on Computational linguistics-Volume 2. Associa- frey Dean. 2013. Efficient estimation of word tion for Computational Linguistics, pages 1364– representations in vector space. arXiv preprint 1368. arXiv:1301.3781 . Amita Misra, Pranav Anand, JEF Tree, and MA Walker. 2015. Using summarization to dis- cover argument facets in online idealogical dia- log. In NAACL HLT. pages 430–440. Francisco Raposo, Ricardo Ribeiro, and David Mar- tins de Matos. 2015. On the application of generic summarization algorithms to music. IEEE Signal Processing Letters 22(1):26–30. Anna Rohrbach, Marcus Rohrbach, and Bernt Schiele. 2015. The long-short story of movie de- scription. In Pattern Recognition, Springer, pages 209–221. Marcus Rohrbach, Wei Qiu, Ivan Titov, Stefan Thater, Manfred Pinkal, and Bernt Schiele. 2013. Translating video content to natural language de- scriptions. In Proceedings of the IEEE Inter- national Conference on Computer Vision. pages 433–440. Shourya Roy, Ragunathan Mariappan, Sandipan Dandapat, Saurabh Srivastava, Sainyam Galhotra, and Balaji Peddamuthu. 2016. Qa rt: A system for real-time holistic quality assurance for contact center dialogues. In Thirtieth AAAI Conference on Artificial Intelligence. Gerard Salton and Christopher Buckley. 1988. Term-weighting approaches in automatic text re- trieval. Information processing & management 24(5):513–523.

1806