Movie Plot Analysis via Turning Point Identification

Pinelopi Papalampidi Frank Keller Mirella Lapata Institute for Language, Cognition and Computation School of Informatics, University of Edinburgh 10 Crichton Street, Edinburgh EH8 9AB [email protected], {keller,mlap}@inf.ed.ac.uk

Abstract Turning Point Description Introductory event that occurs after 1. Opportunity According to screenwriting theory, turning the presentation of the setting and the background of the main characters. points (e.g., change of plans, major setback, Event where the main goal of the story climax) are crucial narrative moments within 2. Change of Plans is defined. From this point on, the ac- a screenplay: they define the plot structure, tion begins to increase. determine its progression and thematic units Event that pushes the main charac- 3. Point of No Return (e.g., setup, complications, aftermath). We ter(s) to fully commit to their goal. propose the task of turning point identification Event where everything falls apart 4. Major Setback (temporarily or permanently). in movies as a means of analyzing their narra- Final event of the main story, mo- tive structure. We argue that turning points and 5. Climax ment of resolution and the “biggest the segmentation they provide can facilitate spoiler”. processing long, complex narratives, such as screenplays, for summarization and question Table 1: Turning points and their definitions. answering. We introduce a dataset consist- ing of screenplays and plot synopses annotated with turning points and present an end-to-end ries (Frermann et al., 2018), summarizing screen- neural network model that identifies turning plays (Gorinski and Lapata, 2018), and answer- points in plot synopses and projects them onto scenes in screenplays. Our model outperforms ing questions about long and complex narratives strong baselines based on state-of-the-art sen- (Kociskˇ y` et al., 2018). tence representations and the expected posi- In this paper we are interested in the automatic tion of turning points. analysis of narrative structure in screenplays. Nar- rative structure, also referred to as a storyline or 1 Introduction plotline, describes the framework of how one tells Computational literary analysis works at the inter- a story and has its origins to Aristotle who de- section of natural language processing and literary fined the basic triangle-shaped plot structure rep- studies, aiming to evaluate various theories of sto- resenting the beginning (protasis), middle (epita- rytelling (e.g., by examining a collection of works sis), and end (catastrophe) of a story (Pavis, 1998). within a single genre, by an author, or topic) and to The German novelist and playwright Gustav Frey- develop tools which aid in searching, visualizing, tag modified Aristotle’s structure by transforming or summarizing literary content. the triangle into a pyramid (Freytag, 1896). In Within natural language processing, computa- his scheme, there are five acts (introduction, rising arXiv:1908.10328v2 [cs.CL] 30 Aug 2019 tional literary analysis has mostly targeted works movement, climax, return, and catastrophe). Sev- of fiction such as novels, plays, and screenplays. eral variations of Freytag’s pyramid are used today Examples include analyzing characters, their rela- in film analysis and screenwriting (Cutting, 2016). tionships, and emotional trajectories (Chaturvedi In this work, we adopt a variant commonly em- et al., 2017; Iyyer et al., 2016; Elsner, 2012), iden- ployed by screenwriters as a practical guide for tifying enemies and allies (Nalisnick and Baird, producing successful screenplays (Hague, 2017). 2013), villains or heroes (Bamman et al., 2014, According to this scheme, there are six stages 2013), measuring the memorability of quotes (acts) in a film, namely the setup, the new situ- (Danescu-Niculescu-Mizil et al., 2012), charac- ation, progress, complications and higher stakes, terizing gender representation in dialogue (Agar- the final push, and the aftermath, separated by five wal et al., 2015; Ramakrishna et al., 2015; Sap turning points (TPs). TPs are narrative moments et al., 2017), identifying perpetrators in crime se- from which the plot goes in a different direction Recently divorced Meg Altman and her 11-year-old daughter Sarah have just purchased a four-story brownstone on New York City. The house's previous manifestation varies across movies (depending on owner installed an isolated room used to protect the house's occupants from intruders. On the night the two move into the home, it is broken by genre and length), although there are some rules Junior, the previous owner's grandson; Burnham, an employee of the residence's security company; and Raoul; a ski mask-wearing gunman. of thumb indicating where to expect a TP (e.g., the The three are after $3 million in bearer bonds, which are locked inside a Opportunity occurs after the first 10% of a screen- floor safe in the panic room..... As they begin the robbery, Meg wakes up and happens to see the intruders on the video monitors in the panic room. Before the three can reach them, Meg and Sarah run into the panic room play, Change of Plans is approximately 25% in). and close the door behind them, only to find that the burglars have disabled the telephone. To enable automatic TP identification, we develop Intending to force them of the room, Burnham introduces propane gas into the room's air vents....Meg then taps into the main telephone line and a new dataset which consists of screenplays, plot gets through to her ex-husband Stephen, before the burglars cut them off.... Stephen arrives at the home and is taken hostage by Burnham and synopses, and turning point annotations. To save Raoul—who severely beats him. To make matters worse, Sarah, who has diabetes, suffers a seizure. annotation time and render the labeling task fea- Her glucagon syringe is in a refrigerator outside the panic room. After using an unconscious Stephen to trick Meg into momentarily leaving the sible, we collect TP annotations at the plot syn- panic room, Burnham enters it, finding Sarah motionless on the floor..... After Burnham gives Sarah the injection, Sarah thanks him. Having earlier received a call from Stephen, two policemen arrive, opsis level (synopses are a few paragraphs long which prompts Raoul to threaten Sarah's life. Sensing the potential danger to her daughter, Meg lies to the officers and they leave. compared to screenplays which are on average Meanwhile, Burnham opens the safe and removes the $22 million in bearer bonds inside. As the robbers attempt to leave, using Sarah as a hostage, 120 pages long). An example is given in Figure1. Meg hits Raoul with a sledgehammer and Burnham flees. After a badly injured Stephen shoots at Raoul and misses, Raoul disables him and We then project the TP annotations via distant su- prepares to kill Meg with the sledgehammer, but Burnham, upon hearing Sarah's screams of pain, returns to the house and shoots Raoul dead, pervision onto screenplays and propose an end-to- stating, "You'll be okay now", to Meg and her daughter before leaving.

The police, alerted by Meg's suspicious behavior earlier, arrive in force end neural network model which identifies TPs in and capture Burnham. Later, Meg and Sarah, having recovered from their harrowing experience, begin searching the newspaper for a new home. full length screenplays. Our contributions can be summarized as fol- Figure 1: Example of turning point annotations (TP1, lows: (a) we introduce TP identification as a TP2, TP3, TP4, TP5, respectively) for the synopsis of the movie “Panic Room”. new task for the computational analysis of screen plays that can benefit applications such as QA and summarization; (b) we create and make pub- (Thompson, 1999), and by definition they occur at licly available the TuRnIng POint Dataset (TRI- the junctions of acts. Aside from changing nar- POD) 1 which contains 99 movies (3,329 synop- rative direction, TPs define the movie’s structure, sis sentences and 13,403 screenplay scenes) an- tighten the pace, and prevent the narrative from notated with TPs; and (c) we present an end-to- drifting. The five TPs and their definitions are end neural network model that identifies turning given in Table1. points in plot synopses and projects them onto We propose the task of turning point identifica- scenes in screenplays, outperforming strong base- tion in movies as a means of analyzing their nar- lines based on state-of-the-art sentence represen- rative structure. TP identification provides a se- tations and the expected position of TPs. quence of key events in the story and segments the screenplay into thematic units. Common ap- 2 Related Work proaches to summarization and QA of long or Recent years have seen increased interest in the multiple documents (Chen et al., 2017; Yang et al., automatic analysis of long and complex narra- 2018; Kratzwald and Feuerriegel, 2018; Elgohary tives. Specifically, Machine Reading Comprehen- et al., 2018) include a retrieval system as the sion (MRC) and Question Answering (QA) tasks first step, which selects a subset of relevant pas- are transitioning from investigating single short ˇ ` sages for further processing. However, Kocisky and clean articles or queries (Rajpurkar et al., et al.(2018) demonstrate that these approaches do 2016; Nguyen et al., 2016; Trischler et al., 2016) not perform equally well for extended narratives, to large scale datasets that consist of complex sto- since individual passages are very similar and the ries (Tapaswi et al., 2016; Frermann et al., 2018; same entities are referred to throughout the story. Kociskˇ y` et al., 2018; Joshi et al., 2017) or require We argue that this challenge can be addressed by reasoning across multiple documents (Welbl et al., TP identification, which finds the most important 2018; Wang et al., 2018; Dua et al., 2019; Yang events and segments the narrative into thematic et al., 2018). Tapaswi et al.(2016) introduce a units. Downstream processing for summarization multi-modal dataset consisting of questions over or question answering can then focus on those seg- 140 movies, while Frermann et al.(2018) attempt ments that are relevant to the task. to answer a single question, namely who is the per- Problematically for modeling purposes, TPs are petrator in 39 episodes of the well-known crime latent in screenplays, there are no scriptwriting series CSI, again based on multi-modal informa- conventions (like character cues or scene head- ings) to denote where TPs occur, and their exact 1https://github.com/ppapalampidi/TRIPOD tion. Finally, Kociskˇ y` et al.(2018) recently in- Train Test movies 84 15 troduced a dataset consisting of question-answer turning points 420 75 pairs over 1,572 movie screenplays and books. synopsis sentences 2,821 508 Previous approaches have focused on fine- screenplay scenes 11,320 2,083 synopsis vocabulary 7.9k 2.8k grained story analysis, such as inducing charac- screenplay vocabulary 37.8k 16.8k ter types (Bamman et al., 2013, 2014) or under- per synopsis standing relationships between characters (Iyyer tokens 729.8 (165.5) 698.4 (187.4) et al., 2016; Chaturvedi et al., 2017). Various ap- sentences 35.4 (8.4) 33.9 (9.9) sentence tokens 20.6 (9.5) 20.6 (9.3) proaches have also attempted to analyze the goal per screenplay and structure of narratives. Black and Wilensky tokens 23.0k (6.6) 20.9k (4.5) (1979) evaluate the functionality of story gram- sentences 3.0k (0.9) 2.8k (0.6) scenes 133.0 (61.1) 138.9 (50.7) mars in story understanding, Elson and McKeown per scene (2009) develop a platform for representing and tokens 173.0 (235.0) 150.5 (198.3) reasoning over narratives, and Chambers and Ju- sentences 22.2 (31.5) 19.9 (26.9) rafsky(2009) learn fine-grained chains of events. sentence tokens 7.8 (6.0) 7.6 (6.4) In the context of movie summarization, Gorin- Table 2: Statistics of the TRIPOD dataset; all means ski and Lapata(2018) automatically generate an are shown with standard deviation in brackets. overview of the movie’s genre, mood, and artis- tic style based on screenplay analysis. Gorinski and Lapata(2015) summarize full length screen- 2015) based on the following criteria: (a) main- plays by extracting an optimal chain of scenes taining a variation across different movie genres via a graph-based approach centered around the (e.g., action, romance, comedy, drama) and narra- characters of the movie. A similar approach tive types (e.g., flashbacks, time shifts); and (b) in- has also been adopted by Vicol et al.(2018), cluding screenplays that are faithful to the released who introduce the MovieGraphs dataset consist- movies and their synopses as much as possible. In ing of 51 movies and describe video clips with Table2, we present various statistics of the dataset. character-centered graphs. Other work creates an- Our motivation for obtaining TP annotations at imated story-boards using the action descriptions the synopsis level (coarse-grained), instead of at of screenplays (Ye and Baldwin, 2008), extracts the screenplay level (fine-grained) was twofold. social networks from screenplays (Agarwal et al., Firstly, on account of being relatively short, 2014a), or creates xkcd movie narrative charts synopses are easier to annotate than full-length (Agarwal et al., 2014b). screenplays, allowing us to scale the dataset in the Our work also aims to analyze the narrative future. Secondly, we would expect synopsis-level structure of movies, but we adopt a high-level ap- annotations to be more reliable and the degree of proach. We advocate TP identification as a pre- inter-annotator agreement higher; asking annota- cursor to more fine-grained analysis that unveils tors to identify precisely where a turning point character attributes and their relationships. Our occurs might seem like looking for a needle in a approach identifies key narrative events and seg- haystack. An example of a synopsis with TP anno- ments the screenplay accordingly; we argue that tations is shown in Figure1 for the movie “Panic this type of preprocessing is useful for applica- Room”. Each TP is colored differently, and both tions which might perform question answering the chain of key events (colored text) and resulting and summarization over screenplays. Although segmentation ( § ) are illustrated. our experiments focus solely on the textual modal- In an initial pilot study, the three authors acted ity, turning point analysis is also relevant for mul- as annotators for identifying TPs in movie syn- timodal tasks such as trailer generation and video opses. They selected exactly one sentence per summarization. TP, under the assumption that all TPs are present. 3 The TRIPOD Dataset Based on the pilot, annotation instructions were devised and an annotation tool was created which The TRIPOD dataset contains 99 screenplays, ac- allows to label synopses with TPs sentence-by- companied with cast information (according to sentence. After piloting the annotation scheme on IMDb), and Wikipedia plot synopses annotated 30 movies, two new annotators were trained using with turning points. The movies were selected our instructions and in a second study, they dou- from the Scriptbase dataset (Gorinski and Lapata, bly annotated five movies. The remaining movies in the dataset were then single annotated by the normalized by M, the length of the screenplay: new annotators. 1 We computed inter-annotator agreement using d[Si,Gi] = min |s − g| (4) two different metrics: (a) total agreement (TA), M (s∈Si,g∈Gi) i.e., the percentage of TPs that two annotators The TA and PA between the two annotators were agree upon by selecting the exact same sentence; 35.48% and 56.67%, respectively. The mean an- and (b) annotation distance, i.e., the distance notation distance was 1.48% (StDev 2.93%). The d[pi,t pi] between two annotations for a given TP, TA shows that the annotators rarely indicate the normalized by synopsis length: same scenes, even if they are asked to annotate 1 an event in the screenplay that is described by a d[p ,t p ] = |p −t p | (1) i i N i i specific synopsis sentence. However, they identify scenes which are in close proximity in the screen- where N is the number of synopsis sentences and play, as PA and annotation distance reveal. This t pi and pi are the indices of the sentences labeled analysis validates our assumption that annotating with TP i by two annotators. The mean annota- the synopses first limits the degree of overall dis- tion distance D is then computed by averaging dis- agreement. tances d[pi,t pi] across all annotated TPs. The TA between the two annotators in our sec- 4 Turning Point Prediction Models ond study was 64.00% and the mean annotation distance was 4.30% (StDev 3.43%). The annota- In this work, we aim to detect text segments which tion distance per TP is presented in Table5 (last act as TPs. We first identify which sentences line), where it is compared with the automatic TP in plot synopses are TPs (Section 4.1); next, we identification results (to be explained later). identify which scenes in screenplays act as TPs We also asked our annotators to annotate the via projection of goldstandard TP labels (Sec- screenplays (rather than synopses) for a subset of tion 4.2); finally, we build an end-to-end system 15 movies. This subset serves as our goldstan- which identifies TPs in screenplays based on pre- dard test set. Annotators were given synopses an- dicted TP synopsis labels (Section 4.3). notated with TPs and were instructed to indicate All models we propose in this paper have the for each TP which scenes in the screenplay cor- same basic structure; they take text segments i respond to it. Six of the 15 movies were doubly (sentences or scenes) as input and predict whether annotated, so that we could measure agreement. these act as TPs or not. Since the sequence, num- Since annotators were allowed to choose a vari- ber, and labels of TPs are fixed (see Table1), we able number of scenes for each TP, this changes treat TP identification as a binary classification slightly our agreement metrics. problem (where 1 indicates that the text is a TP Total Agreement (TA) now is the percentage of and 0 otherwise). Each segment is encoded into TP scenes the annotators agree on: a multi-dimensional feature space xi which serves as input to a fully-connected layer with a single T·L 1 |Si ∩ Gi| neuron representing the probability that i acts as TA = ∑ (2) a TP. In the following, we describe several models T · L i=1 |Si ∪ Gi| which vary in the way input segments are encoded. where T, L are the TPs identified per annotator in a screenplay, and Si and Gi are the indices of 4.1 Identifying Turning Points in Synopses the scenes selected for TP i by the two annotators. Context-Aware Model (CAM) A simple base- Partial Agreement (PA) is the percentage of TPs line model would compute the semantic represen- where there is an overlap of at least one scene: tation of each sentence in the synopsis using a pre- trained sentence encoder. However, classifying 1 T·L segments in isolation without considering the con- PA = ∑ [Si ∩ Gi 6= 0/] (3) T · L i=1 text in which they appear, might yield inferior se- And annotation distance D becomes the mean of mantic representations. We therefore obtain richer 2 representations for sentences by modeling their the distances d[Si,Gi] between two annotators surrounding context. We encode the synopsis with 2We compute the minimum distance between the two sets a Bidirectional Long Short-Term Memory (BiL- of scenes, since non-sequential scenes may be included in the same set. Hence, considering the center of the sets is not STM; Hochreiter and Schmidhuber 1997) net- always representative of the TP scenes. work; and obtain contextualized representation cpi Context-aware Context Sentence Synopsis sentence interaction layer representations encoder representations y1 . . . yN left Merging layer h1 x1 LSTM LSTM context y1 . . . sliding h1 window . . sentence- h . x2 LSTM LSTM 2 context . . h similarity 2 ......

x LSTM LSTM h3 3 y3 h ...... 3 Context interaction layer

hN-1 sentence- . xN-1 LSTM LSTM context . h . N-1 similarity . TP1- TP2- TP3- TP4- TP5- sliding . Synopsis Synopsis Synopsis Synopsis Synopsis hN window encoder encoder encoder encoder encoder xN LSTM LSTM yN right h ...... N context x1 xN x1 xN x1 xN x1 xN x1 xN (a) Topic-Aware Model (TAM) (b) Multi-view TAM

Figure 2: Model overview for TP identification in synopses. On the left, sentence representations xi are contextualized via a synopsis encoder (BiLSTM layer) and after interacting with the left and right windows in the context interaction layer, the final sentence representation yi is computed. On the right, five different synopsis encoders are utilized, one per TP, and these different views of a synopsis sentence xi are combined in the merging layer. for sentence x by concatenating the hidden layers as similarity metrics: i−→ ←− of the forward hi and backward hi LSTM, respec- −→ ←− cpi · lci tively: cpi = hi = [hi ; hi ] (for a more detailed de- bi = cpi lci ci = (5) kcpikklcik scription, see the Appendix). Representation cp i cp · lc is the input feature vector for our binary classifier. u = i i (6) i (kcp k · klc k ) The model is illustrated in Figure 2a. max i 2 i 2 The interaction representation of sentence cp with Topic-Aware Model (TAM) TPs by definition i its left context is the concatenation of cp , f l , and act as boundaries between different thematic units i i the above similarity values (i.e., b ,c ,u ): in a movie. Furthermore, long documents are i i i usually comprised of topically coherent text seg- f li = [cpi;lci;bi;ci;ui] (7) ments, each of which contains a number of text passages such as sentences or paragraphs (Salton The interaction representation f ri for the right et al., 1996). Inspired by text segmentation ap- context rci is computed analogously. We obtain proaches (Hearst, 1997) which measure the se- the final representation of sentence i via concate- mantic similarity between sequential context win- nating f li and f ri: yi = [ f li; f ri;cpi]. dows in order to determine topic boundaries, we TP-Specific Information Another variation of enhance our representations with a context inter- our model is to use TP-specific encoders instead action layer. The objective of this layer is to mea- of a single one (see Figure 2b). In this case, we sure the similarity of the current sentence with its employ five different encoders for calculating five preceding and following context, thereby encod- different representations for the current synopsis ing whether it functions as a boundary between sentence x , each one with respect to a specific TP. thematic sections. The enriched model with the i These representations can be considered multiple context interaction layer is illustrated in Figure 2a. views of the same sentence. We calculate the inter- After calculating contextualized sentence repre- action of each view with the left and right context sentations cp , we compute the representation of i window, as previously, via the context interaction the left lc and right rc contexts of sentence i (see i i layer. Finally we compute the sentence represen- Figure 2a, right-hand side). We select windows of tation y by concatenating its individual context- fixed length l and calculate lc and rc by averaging i i i enriched TP representations. the sentence representations within each window. Next, we compute the semantic similarity of the Entity-Specific Information We also enrich our current sentence with each context representation. model with information about entities. We first ap- Specifically, we consider the element-wise prod- ply co-reference resolution to the plot synopses us- uct bi, cosine similarity ci and pairwise distance ui ing the Stanford CoreNLP toolkit (Manning et al., TP1 TP2 TP3 TP4 TP5 . theory 10.00 25.00 50.00 75.00 94.50 tp scene µ 11.39 31.86 50.65 74.15 89.43

TP-scene interaction layer σ 6.72 11.26 12.15 8.40 4.74 z z z 1 . . . k . . . M interaction layer Table 3: Expected TP position based on screenwriting Context

tp1 . . . tp5 theory; mean position µ and standard deviation σ in

TP left right context context selection goldstandard synopses of our training set. y1 y2 . . . yN sc1 . . . sck . . . scM

vide distant supervision by constructing noisy la- Synopsis Encoder Screenplay Encoder bels based on goldstandard TP annotations in syn-

x1 x2 . . . xN s1 . . . sk . . . sM opses (see the description below). Given sentences labeled as TPs in a synopsis, we identify scenes in Figure 3: TAM overview for TP identification in screenplays. The synopsis and screenplay encoders the corresponding screenplay which are semanti- contextualize synopsis sentences xi and screenplay cally similar to them. We formulate this task as scenes si, respectively. TPs are selected from contex- a binary classification problem, where a sentence- tualized synopsis sentences yi and a richer representa- scene pair is deemed either “relevant” or “irrele- tion sci is computed for si via the context interaction vant” for a given TP. layer. The similarity between sentence t pi and scene zi is computed by the TP–scene interaction layer. Distant Supervision Based on the screenwrit- ing scheme of Hague(2017), TPs are expected to occur in specific parts of a screenplay (e.g., the 2014) and substitute mentions of named entities Climax is likely to occur towards the end). We whenever these are included in the IMDb cast exploit this knowledge as a form of distant super- list. We then obtain entity-specific sentence rep- vision. We estimate the mean position for each TP resentations as follows. Our encoder uses a word using the gold standard annotation of the plot syn- embedding layer initialized with pre-trained en- opses in our training set (normalized by the synop- tity embeddings and a BiLSTM for contextual- sis length). The results are shown in Table3, to- izing word representations. We add an attention gether with the TP positions postulated by screen- mechanism on top of the LSTM, which assigns a writing theory. We observe that our estimates weight to each word representation. We compute agree well with the theoretical predictions, but the entity-specific representation e for synopsis i also that some TPs (e.g., TP2 and TP3) are more sentence i as the weighted sum of its word rep- variable in their position than others (e.g., TP1 and resentations (for more details, see the Appendix). TP5). This leads us to the following hypothesis: Finally, entity enriched sentence representations x0 i each TP is situated within a specific window in a are obtained by concatenating generic vectors x i screenplay. Scenes that lie within the window are with entity-specific ones e : x0 = [x ;e ]. i i i i semantically related to the TP, whereas all other scenes are unrelated. In experiments we calculate 4.2 Identifying Turning Points in Screenplays a window µ ± σ based on our data (see Table3). Identifying TPs in synopses serves as a testbed for We compute scene representations based on the validating some of the assumptions put forward in sequence of sentences that comprise it using a this work, namely that turning points mark nar- BiLSTM equipped with an attention mechanism rative progression and can be identified automat- (see Section 4.1). The final scene representation s ically based on their lexical makeup. Neverthe- is the weighted sum of the representations of the less, we are mainly interested in the real-world scene sentences. Next, the TP–scene interaction scenario where TPs are detected in longer docu- layer enriches scene representations with similar- ments such as screenplays. Screenplays are natu- ity values with each marked TP synopsis sentence rally segmented into scenes, which often describe t p as shown in Equations (5)–(7). a self-contained event that takes place in one lo- We again augment the above-described base cation, and revolves around a few characters. We model with contextualized sentence and scene rep- therefore assume that scenes are suitable textual resentations using a synopsis and screenplay en- segments for signaling TPs in screenplays. coder. The synopsis encoder is the same one used Unfortunately, we do not have any goldstandard for our sentence-level TP prediction task (see Sec- information about TPs in screenplays. We pro- tion 4.1). The screenplay encoder works in a sim- ilar fashion over scene representations. TAD Baseline 31.00 9.65 (4.41) Topic-Aware Model (TAM) TAM enhances our CAM 33.00 7.44 (8.09) TAM 36.00 7.11 (7.98) screenplay encoder with information about topic + TP views 39.00 6.52 (7.72) boundaries. Specifically, we compute the repre- + entities 38.00 6.91 (7.65) sentations of the left lci and right rci context win- (a) Development set dow of the ith scene in the screenplay as described in Section 4.1. Next, we compute the final repre- TAD Random 2.00 37.79 (25.33) sentation zi of scene sci by concatenating the rep- Theory baseline 22.00 7.47 (6.75) resentations of the context windows lci and rci and Distribution baseline 28.00 7.28 (6.23) TAM 6.80 5.19 the current scene sci: zi = [lci;sci;rci]. There is no 34.67 ( ) + TP views 38.57 7.47 (7.48) need to compute the similarity between scenes and + entities 41.33 7.30 (7.21) context windows here as we now have goldstan- Human agreement 64.00 4.30 (3.43) dard TP representations in the synopsis and em- (b) Test set ploy the TP–scene interaction layer for the compu- tation of the similarity between TPs and enriched Table 4: Identification of TPs in plot synopses; re- scene representations z . Hence, we directly calcu- sults are shown in percent (TA: mean Total Agreement; i D: annotation distance; standard deviation in brackets). late in this layer a scene-level feature vector that encodes information about the scene, its similar- TAM TP1 TP2 TP3 TP4 TP5 ity to TP sentences, and whether these function as + TP views 6.09 9.45 10.72 6.91 4.26 boundaries between topics in the screenplay. + entities 7.18 9.35 9.86 5.23 3.48 Human agreement 3.33 5.00 10.58 1.07 1.53 Entity-Specific information We can also em- ploy an entity-specific encoder (see Section 4.1) Table 5: Mean annotation distance D (test set); results for the representing the synopsis and scene sen- are shown per TP on the synopsis identification task. tences. Again, generic and entity-specific repre- sentations are combined via concatenation. the five sentences with the highest posterior prob- 4.3 End-to-end TP Identification ability of being TPs and sequentially assign them TP labels based on their position. However, it is Our ultimate goal is to identify TPs in screen- possible to have a cluster of neighboring sentences plays without assuming any goldstandard infor- with high probability, even though they all belong mation about their position in the synopsis. We to the same TP. We therefore constrain the sen- address this with an end-to-end model which first tence selection for each TP within the window of predicts the sentences that act as TPs in the synop- its expected position, as calculated in the distribu- sis (e.g., TAM in Section 4.1) and then feeds these tion baseline (Section 4.2). predictions to a model which identifies the corre- For models which predict TPs in screenplays, sponding TP scenes (e.g., TAM in Section 4.2). we obtain a probability distribution over all scenes 5 Experimental Setup in a screenplay indicating how relevant each is to the TPs of the corresponding plot synopsis. We Training We used the Universal Sentence En- find the peak of each distribution and select a coder (USE; Cer et al. 2018) as a pre-trained sen- neighborhood of scenes around this peak as TP- tence encoder for all models and tasks; its perfor- relevant ones. Based on the goldstandard annota- mance was superior to BERT (Devlin et al., 2018) tion, each TP corresponds to 1.77 relevant scenes and other related pre-trained encoders (for more on average (StDev 1.23). We therefore consider a details, see the Appendix). Since the binary labels neighborhood of three relevant scenes per TP. in both prediction tasks are imbalanced, we apply class weights to the loss function of our models. 6 Results We weight each class by its inverse frequency in the training set (for more implementation details, TP Identification in Synopses Table 4a reports see the Appendix). our results on the development set (we extracted 20 movies from the original training set) which Inference During inference in our first task aim at comparing various model instantiations for (i.e., identification of TPs in synopses), we select the TP identification task. Specifically, we report one sentence per TP. Specifically, we want to track the performance of a baseline model which is nei- 45 goldstandard TA PA D TAM 40 Theory baseline 8.66 10.67 10.45 (9.14) distribution 35 baseline Distribution baseline 6.67 9.33 10.84 (8.94) 30 tf*idf similarity 0.74 1.33 53.07 (31.83) 25 tf*idf + distribution 4.44 6.67 13.33 (11.51) 20 CAM 11.11 16.00 10.23 (11.23) % cases 15 + entities 14.18 17.33 12.77 (12.61) 10 TAM 10.63 13.33 8.94 (9.39) 5 + entities 10.63 13.33 10.15 (10.56) 0 1st 2nd 3rd TAM End2end 7.87 9.33 10.16 (10.74) Human agreement 35.48 56.67 1.48 (2.93) Figure 4: Rankings (shown as proportions) of synopsis highlights produced by aggregating goldstandard TP Table 6: Identification of TPs in screenplays; results are annotations, those predicted by the distribution base- shown in percent using five-fold cross validation (TA: line, and our model (TAM + TP views). mean Total Agreement; PA: Partial Agreement; D: an- notation distance D; standard deviation in brackets). ther context-aware nor utilizes topic boundary in- ment on Amazon Mechanical Turk (AMT). AMT formation against CAM and TAM. We also show workers were presented with a synopsis and “high- two variants of TAM enhanced with TP-specific lights”, i.e., five sentences corresponding to TPs. encoders (+ TP views) and entity-specific informa- We obtained highlights from goldstandard anno- tion (+ entities). Model performance is measured tations, the distribution baseline, and TAM (+ TP using the evaluation metrics of Total Agreement views). AMT workers were asked to read the syn- (TA) and annotation distance (D), normalized by opsis and rank the highlights from best to worst synopsis length (equation (1)). according to the following criteria: (1) the qual- The baseline model presents the lowest per- ity of the plotline that they form; (2) whether they formance among all variants which suggests that include the most important events and plot twists state-of-the-art sentence representations on their of the movie; and (3) whether they provide some own are not suitable for our task. Indeed, when description of the events in the beginning and end contextualizing the synopsis sentences via a BiL- of the movie. In Figure4 we show, proportion- STM layer we observe an absolute increase of ally, how often our participants ranked each model 4.00% in terms of TA. Moreover, the addition 1st, 2nd, and so on. Perhaps unsurprisingly, gold- of a context interaction layer (see TAM row in standard TPs were considered best (and ranked 1st Table 4a) yields an absolute TA improvement of 42% of the time). TAM is ranked best 30% of the 4.00% compared to CAM. Combining different time, followed by the distribution baseline which TP views further improves by 3.00%, reaching a was only ranked first 26% of the time. Over- TA of 39.00%, and reducing D to 6.52%. all, the average ranking positions for the goldstan- Table 4b shows our results on the test set. We dard, TAM, and the baseline are 1.87, 1.98, and compare TAM, our best performing model against 2.16, respectively. Human evaluation therefore two strong baselines. The first one selects sen- validates that our model outperforms the position- tences that lie on the expected positions of TPs based baselines. according to screenwriting theory; while the sec- ond one selects sentences that lie on the peaks of TP Identification in Screenplays Our results the empirical TP distributions in the training set are summarized in Table6. For this task, we (Section 4.2). As we can see, TAM (+ TP views) performed five-fold crossvalidation over our orig- achieves a TA of 38.57% compared to 22.00% inal goldstandard set to obtain a test-development for the distribution baseline. And although entity- split (recall we do not have goldstandard anno- specific information does not have much impact tations for training). We report Total Agreement on the development set, it yields a 2.76% improve- (TA), Partial Agreement (PA), and annotation dis- ment on the test set. A detailed break down of re- tance D, normalized by screenplay length (Equa- sults per TP is given in Table5. Interestingly, our tions (2)–(4)). model resembles human behavior (see row Human Aside from the theory and distribution-based agreement): TPs 1, 4, and 5 are easiest to distin- baselines, we also experimented3 with a com- guish, whereas TPs 2 and 3 are hardest and fre- 3 quently placed at different points in the synopsis. Common segmentation approaches such as TextTiling (Hearst, 1997) perform poorly on our task and we do not re- We also conducted a human evaluation experi- port them due to space constraints. similarity probability probability

scene indices scene indices scene indices Distribution baseline tf*idf similarity TAM

Figure 5: Probability distributions over the scenes of the screenplay for the movie “Juno”; x-axis: scene indices, y-axis: probability that the scene is relevant to a specific TP. Vertical dashed lines are goldstandard TP scenes. mon IR baseline which considers TP synopsis sen- that the distribution baseline provides a good ap- tences as queries and retrieves a neighborhood of proximation of relevant TP positions (which val- semantically similar scenes from the screenplay idates its use in the construction of noisy labels, using tf*idf similarity. Specifically, we compute Section 4.2), even though it is not always accurate. the maximum tf*idf similarity for all sentences in- For example, TPs 1 and 3 lie outside the expected cluded in the respective scene. We empirically ob- window in “Juno”. served that tf*idf’s behavior can be erratic select- The second panel presents the TP predictions ing scenes in completely different sections of the according to tf*idf similarity. We observe that screenplay, and therefore constrain it by selecting scenes located in entirely different parts of the scenes only within the windows determined by the screenplay present high similarity scores with re- position distributions (µ±σ) for each TP. As far as spect to a given TP due to vocabulary uniformity our own models are concerned, we report results and mentions of the same entities throughout the with goldstandard TP labels for CAM and TAM screenplay. In the next panel we present the pre- on their own and enriched with entity information. dictions of TAM. Adding synopsis and screenplay We also built and end-to-end system based on TP encoders yields smoother distributions increasing predictions from TAM. the probability of selecting TP scenes inside dis- As can be seen in Table6, tf*idf approaches per- tinct regions of the screenplay, with sharper peaks form worse than position-related baselines. Over- and higher confidence. all, similar vocabulary across scenes and mentions of the same entities throughout the screenplay 7 Conclusions make tf*idf approaches insufficient for our tasks. We proposed the task of turning point identifica- The best performing model is TAM confirming tion in screenplays as a means of analyzing their our hypothesis that TPs are not just isolated key narrative structure. We demonstrated that auto- events, but also mark boundaries between the- matically identifying a sequence of key events and matic units and, therefore, segmentation-inspired segmenting the screenplay into thematic units is approaches can be beneficial for the task. Results feasible via an end-to-end neural network model. for entities are somewhat mixed; for CAM, the In future work, we will investigate the usefulness entity-specific information improves TA and PA of TPs for summarization and question answering. but increases D, while it does not seem to make We will also scale the TRIPOD dataset and move much difference for TAM. The performance of to a multi-modal setting where TPs are identified the end-to-end TAM model drops slightly com- directly in video data. pared to the same model using goldstandard TP annotations. However, it still remains competitive Acknowledgments against the baselines, indicating that tracking TPs in screenplays fully automatically is feasible. We thank the anonymous reviewers for their feed- In Figure5, we visualize the posterior distri- back. We gratefully acknowledge the support of bution of various models over the scenes of the the European Research Council (Lapata; award screenplay for the movie “Juno”. The first panel 681760, “Translating Multiple Modalities into shows the distribution baseline alongside gold- Text”) and of the Leverhulme Trust (Keller; award standard TP scenes (vertical lines). We observe IAF-2017-019). References Danqi Chen, Adam Fisch, Jason Weston, and An- toine Bordes. 2017. Reading wikipedia to an- Apoorv Agarwal, Sriramkumar Balasubramanian, swer open-domain questions. arXiv preprint Jiehan Zheng, and Sarthak Dash. 2014a. Parsing arXiv:1704.00051. Screenplays for Extracting Social Networks from Movies. In Proceedings of the 3rd Workshop on James E Cutting. 2016. Narrative theory and the dy- Computational Linguistics for Literature, pages 50– namics of popular movies. Psychonomic bulletin & 58, Gothenburg, Sweden. review, 23(6):1713–1743. Cristian Danescu-Niculescu-Mizil, Justin Cheng and- Apoorv Agarwal, Sarthak Dash, Sriramkumar Balasub- Jon Kleinberg, and Lillian Lee. 2012. You had me ramanian, and Jiehan Zheng. 2014b. Using Deter- at hello: How phrasing affects memorability. In Pro- minantal Point Processes for Clustering with Ap- ceedings of the 50th Annual Meeting of the Associ- plication to Automatically Generating and Drawing ation for Computational Linguistics: Long Papers– xkcd Movie Narrative Charts. In Proceedings of the Volume 1, pages 892–901, ?? 2nd Academy of Science and Engineering Interna- tional Conference on Big Data Science and Com- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and puting, Stanford, California. Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understand- Apoorv Agarwal, Jiehan Zheng, Shruti Kamath, Sri- ing. arXiv preprint arXiv:1810.04805. ramkumar Balasubramanian, and Shirin Ann Dey. 2015. Key Female Characters in Film Have More to Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Talk About Besides Men: Automating the Bechdel Stanovsky, Sameer Singh, and Matt Gardner. 2019. Test. In Proceedings of the 2015 Conference of Drop: A reading comprehension benchmark re- the North American Chapter of the Association for quiring discrete reasoning over paragraphs. arXiv Computational Linguistics: Human Language Tech- preprint arXiv:1903.00161. nologies, pages 830–840, Denver, Colorado. Ahmed Elgohary, Chen Zhao, and Jordan Boyd- Graber. 2018. A dataset and baselines for sequential David Bamman, Brendan OConnor, and Noah A open-domain question answering. In Proceedings of Smith. 2013. Learning latent personas of film char- the 2018 Conference on Empirical Methods in Nat- acters. In Proceedings of the 51st Annual Meeting of ural Language Processing, pages 1077–1083. the Association for Computational Linguistics (Vol- ume 1: Long Papers), volume 1, pages 352–361, Mich Elsner. 2012. Character-based kernels for novel- Sofia, Bulgaria. istic plot structure. In Proceedings of the 13th Con- ference of the European Chapter of the Association David Bamman, Ted Underwood, and Noah A Smith. for Computational Linguistics, pages 634–644, Avi- 2014. A bayesian mixed effects model of literary gnon, France. character. In Proceedings of the 52nd Annual Meet- ing of the Association for Computational Linguistics David Elson and Kathleen McKeown. 2009. Extending (Volume 1: Long Papers), volume 1, pages 370–379. and evaluating a platform for story understanding. In Proceedings of the AAAI 2009 Spring Symposium on Intelligent Narrative Technologies II, page ??, ?? John B Black and Robert Wilensky. 1979. An evaluation of story grammars. Cognitive science, Lea Frermann, Shay B Cohen, and Mirella Lapata. 3(3):213–229. 2018. Whodunnit? crime drama as a case for natural language understanding. Transactions of the Associ- Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, ation of Computational Linguistics, 6:1–15. Nicole Limtiaco, Rhomni St John, Noah Constant, Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, Gustav Freytag. 1896. Freytag’s technique of the et al. 2018. Universal sentence encoder. arXiv drama: an exposition of dramatic composition and preprint arXiv:1803.11175. art. Scholarly Press. Philip John Gorinski and Mirella Lapata. 2015. Movie Nathanael Chambers and Dan Jurafsky. 2009. Un- script summarization as graph-based scene extrac- supervised learning of narrative schemas and their tion. In Proceedings of the 2015 Conference of participants. In Proceedings of the Joint Confer- the North American Chapter of the Association for ence of the 47th Annual Meeting of the ACL and the Computational Linguistics: Human Language Tech- 4th International Joint Conference on Natural Lan- nologies, pages 1066–1076. guage Processing of the AFNLP: Volume 2-Volume 2, pages 602–610. Association for Computational Philip John Gorinski and Mirella Lapata. 2018. Whats Linguistics. this movie about? a joint neural network architec- ture for movie content analysis. In Proceedings of Snigdha Chaturvedi, Mohit Iyyer, and Hal Daume III. the 2018 Conference of the North American Chap- 2017. Unsupervised learning of evolving relation- ter of the Association for Computational Linguistics: ships between literary characters. In Thirty-First Human Language Technologies, Volume 1 (Long Pa- AAAI Conference on Artificial Intelligence. pers), pages 1770–1781. Michael Hague. 2017. Storytelling Made Easy: Per- Patrice Pavis. 1998. Dictionary of the theatre: Terms, suade and Transform Your Audiences, Buyers, and concepts, and analysis. University of Toronto Press. Clients – Simply, Quickly, and Profitably. Indie Books International. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions Marti A Hearst. 1997. Texttiling: Segmenting text into for machine comprehension of text. arXiv preprint multi-paragraph subtopic passages. Computational arXiv:1606.05250. linguistics, 23(1):33–64. Anil Ramakrishna, Nikolaos Malandrakis, Elizabeth Sepp Hochreiter and Jurgen¨ Schmidhuber. 1997. Staruk, and Shrikanth Narayanan. 2015. A quan- Long short-term memory. Neural computation, titative analysis of gender differences in movies us- 9(8):1735–1780. ing psycholinguistic normatives. In Proceedings of the 2015 Conference on Empirical Methods in Nat- Mohit Iyyer, Anupam Guha, Snigdha Chaturvedi, Jor- ural Language Processing., pages 1996–2001, Lis- dan Boyd-Graber, and Hal Daume´ III. 2016. Feud- bon, Portugal. ing families and former friends: Unsupervised learn- ing for dynamic fictional relationships. In Proceed- Gerard Salton, Amit Singhal, Chris Buckley, and Man- ings of the 2016 Conference of the North Ameri- dar Mitra. 1996. Automatic text decomposition us- can Chapter of the Association for Computational ing text segments and text themes. In Proceedings of Linguistics: Human Language Technologies, pages the the seventh ACM conference on Hypertext, pages 1534–1544. 53–65, Bethesda, Maryland.

Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Maarten Sap, Marcella Cindy Prasettio, Ari Holtzman, Zettlemoyer. 2017. Triviaqa: A large scale distantly Hannah Rashkin, and Yejin Choi. 2017. Connota- supervised challenge dataset for reading comprehen- tion frames of power and agency in modern films. sion. arXiv preprint arXiv:1705.03551. In Proceedings of the 2017 Conference on Empiri- cal Methods in Natural Language Processing, pages Diederik P Kingma and Jimmy Ba. 2014. Adam: A 2329–2334, Copenhagen, Denmark. method for stochastic optimization. arXiv preprint arXiv:1412.6980. Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. Toma´sˇ Kociskˇ y,` Jonathan Schwarz, Phil Blunsom, 2016. Movieqa: Understanding stories in movies Chris Dyer, Karl Moritz Hermann, Gaabor´ Melis, through question-answering. In Proceedings of the and Edward Grefenstette. 2018. The narrativeqa IEEE conference on computer vision and pattern reading comprehension challenge. Transactions recognition, pages 4631–4640. of the Association of Computational Linguistics , Kristin Thompson. 1999. Storytelling in the new Holly- 6:317–328. wood: Understanding classical narrative technique. Bernhard Kratzwald and Stefan Feuerriegel. 2018. Harvard University Press. Adaptive document retrieval for deep question an- Adam Trischler, Tong Wang, Xingdi Yuan, Justin Har- swering. arXiv preprint arXiv:1808.06528. ris, Alessandro Sordoni, Philip Bachman, and Ka- heer Suleman. 2016. Newsqa: A machine compre- Christopher Manning, Mihai Surdeanu, John Bauer, hension dataset. arXiv preprint arXiv:1611.09830. Jenny Finkel, Steven Bethard, and David McClosky. 2014. The stanford corenlp natural language pro- Paul Vicol, Makarand Tapaswi, Lluis Castrejon, and cessing toolkit. In Proceedings of 52nd annual Sanja Fidler. 2018. Moviegraphs: Towards under- meeting of the association for computational lin- standing human-centric situations from videos. In guistics: system demonstrations, pages 55–60. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8581–8590. Eric T. Nalisnick and Henry S. Baird. 2013. Character- to-character sentiment analysis in shakespeare’s Yizhong Wang, Kai Liu, Jing Liu, Wei He, Yajuan plays. In Proceedings of the 51st Annual Meet- Lyu, Hua Wu, Sujian Li, and Haifeng Wang. 2018. ing of the Association for Computational Linguis- Multi-passage machine reading comprehension with tics, pages 479–483, Sofia, Bulgaria. cross-passage answer verification. arXiv preprint arXiv:1805.02220. Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. Johannes Welbl, Pontus Stenetorp, and Sebastian 2016. Ms marco: A human generated machine Riedel. 2018. Constructing datasets for multi-hop reading comprehension dataset. arXiv preprint reading comprehension across documents. Transac- arXiv:1611.09268. tions of the Association of Computational Linguis- tics, 6:287–302. Adam Paszke, Sam Gross, Soumith Chintala, Gre- gory Chanan, Edward Yang, Zachary DeVito, Zem- Ikuya Yamada, Akari Asai, Hiroyuki Shindo, ing Lin, Alban Desmaison, Luca Antiga, and Adam Hideaki Takeda, and Yoshiyasu Takefuji. 2018. Lerer. 2017. Automatic differentiation in pytorch. Wikipedia2vec: An optimized tool for learning embeddings of words and entities from wikipedia. where Wh and bh are the attention layer’s weights. arXiv preprint 1812.06280. B Implementation Details Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Ben- gio, William W Cohen, Ruslan Salakhutdinov, and Pre-trained Sentence Encoder The perfor- Christopher D Manning. 2018. Hotpotqa: A dataset mance of our models depends on the initial sen- for diverse, explainable multi-hop question answer- ing. arXiv preprint arXiv:1809.09600. tence representations. We experimented with us- ing the large BERT model (Devlin et al., 2018) Patrick Ye and Timothy Baldwin. 2008. Towards Au- and the Universal Sentence Encoder (USE) (Cer tomatic Animated Storyboarding. In Proceedings et al., 2018) as pre-trained sentence encoders in of the 23rd Association for the Advancement of Ar- tificial Intelligence Conference on Artificial Intelli- all tasks. Intuitively, we expect USE to be more gence, pages 578–583, Chicago, Illinois. suitable, since it was trained in textual similarity tasks which are more relevant to ours. Experi- A Model Details ments on the development set confirmed our intu- Synopsis Encoder In all tasks we use a synop- ition. Specifically, on the screenplay TP prediction sis encoder in order to contextualize the sentences task, annotation distance D dropped from 17.00% in the synopsis. We employ an LSTM network to 10.04% when employing USE instead of the as the synopsis encoder which produces sentence BERT embeddings in the CAM version of our ar- representations h1,h2,...,hT , where hi is the hid- chitecture. den state at time-step i, summarizing all the in- Hyper-parameters We used the Adam algo- formation of the synopsis up to the i-th sentence. rithm (Kingma and Ba, 2014) for optimizing our We use a Bidirectional LSTM (BiLSTM) in order networks. After experimentation, we chose an to get sentence representations that summarize the LSTM with 32 neurons (64 for the BiLSTM) for information from both directions. A BiLSTM con- −→ the synopsis encoder in the first task and one with sists of a forward LSTM f that reads the synop- ←− 64 neurons for the encoder in the second task. For sis from p1 to pN and a backward LSTM f that the context interaction layer, the window l was reads it from pN to p1. We obtain the final repre- set to two sentences for the first task and 20% sentation cpi for a given synopsis sentence pi by of the screenplay length for the second task. For concatenating the representations from both direc- the entity encoder, an embedding layer of size 300 −→ ←− 2S tions, cpi = hi = [hi ; hi ], hi ∈ R , where S de- was initialized with the Wikipedia2Vec pre-trained notes the size of each LSTM. word embeddings (Yamada et al., 2018) and re- mained frozen during training. The LSTM of the Entity-Specific Encoder This encoder is used encoder had 32 and 64 neurons for the first and to evaluate the contribution of entity-specific in- second tasks, respectively. Finally, we also added formation to the performance of our models. We a dropout of 0.2. For developing our models we use a word embedding layer to project words th used PyTorch (Paszke et al., 2017). w1,w2,...,wT of the i synopsis sentence pi to a continuous vector space RE , where E the size Data Augmentation We used multiple annota- of the embedding layer. This layer is initialized tions for training for movies where these were with pre-trained entity embeddings. Next, we use available and considered reliable. The reasons for a BiLSTM as described in the case of the synopsis this are twofold. Firstly, this allowed us to take encoder. On top of the LSTM, we add an atten- into account the subjective nature of the task dur- tion mechanism, which assigns a weight ai to each ing training; and secondly, it increased the size of word representation hi. We compute the entity- our dataset, which contains a limited number of th specific representation pei of the i plot sentence movies. Specifically, we added triplicate annota- as the weighted sum of word representations: tions for 17 movies and duplicate annotations for 5 movies. e j = tanh(Whh j + bh), e j ∈ [−1,1] (8) C Example Output: TP Identification in T exp(e j) Synopses a j = T , ∑ a j = 1 (9) ∑t=1 exp(et ) j=1 As mentioned in Section 6, we also conducted T a human evaluation experiment, where highlights 2S pei = ∑ a jh j, e ∈ R (10) were extracted by combining the five sentences la- j=1 beled as TPs the synopsis. In Tables7,8, and9, Goldstandard • Sixteen-year-old Minnesota high-schooler Juno MacGuff discovers she is pregnant with a child fathered by her friend and longtime admirer, Paulie Bleeker. • All of this decides her against abortion, and she decides to give the baby up for adoption. • With Mac, Juno meets the couple, Mark and Vanessa Loring (Jason Bateman and Jennifer Garner), in their expensive home and agrees to a closed adoption. • Juno watches the Loring marriage fall apart, then drives away and breaks down in tears by the side of the road. • Vanessa comes to the hospital where she joyfully claims the newborn boy as a single adoptive mother. TAM (+ TP views) • Going to a local clinic run by a women’s group, she encounters outside a school mate who is holding a rather pathetic one-person Pro-Life vigil. • With Mac, Juno meets the couple, Mark and Vanessa Loring (Jason Bateman and Jennifer Garner), in their expensive home and agrees to a closed adoption. • Juno and Leah happen to see Vanessa in a shopping mall being completely at ease with a child, and Juno encourages Vanessa to talk to her baby in the womb, where it obligingly kicks for her. • Juno watches the Loring marriage fall apart, then drives away and breaks down in tears by the side of the road. • The film ends in the summertime with Juno and Paulie playing guitar and singing together, followed by a kiss. Distribution baseline • Once inside, however, Juno is alienated by the clinic staff’s authoritarian and bureaucratic attitudes. • Juno visits Mark a few times, with whom she shares tastes in punk rock and horror films. • Not long before her baby is due, Juno is again visiting Mark when their interaction becomes emotional. • Juno then tells Paulie she loves him, and Paulie’s actions make it clear her feelings are very much reciprocated. • Vanessa comes to the hospital where she joyfully claims the newborn boy as a single adoptive mother.

Table 7: Highlights for the movie ”Juno”: goldstandard annotations and predicted TPs for TAM (+ TP views) and distribution baseline. we present the highlights presented to the AMT describe important events in the movie that have workers for the movies ”Juno”, ”Panic Room”, some desired characteristics. In particular, for the and ””, respectively. For each movie movie ”Juno” the climax (TP5) is the moment of we show the goldstandard annotations alongside resolution, where Vanessa decides to adopt the with the predicted TPs for TAM (+ TP views) and baby after all the setbacks and obstacles. Even the distribution baseline, which is the strongest though our model does not predict this sentence, it performing baseline with respect to the automatic does select one that reveals information about the evaluation results. ending of the movie. An other such example is the Overall, we observe that goldstandard high- movie ”Panic Room”, where the point of no return lights describe the plotline of the movie, contain (TP3) is not correctly predicted, but the selected a first introductory sentence, some major and in- sentence refers to the same event. tense events, and a last sentence that describes the ending of the story. The distribution baseline is able to predict a few goldstandard TPs by only considering the rel- ative position of the sentences in the synopsis. This observation validates the screenwriting the- ory: TPs, or more generally important events that determine the progression of the plot, are con- sistently distributed in specific parts of a movie. However, when the distribution baseline cannot predict the exact TP sentence, it might select one that describes irrelevant events of minor impor- tance (e.g., TP4 for ”Panic Room” is a detail about a secondary character instead of a major setback and highly intense event in the movie). Finally, our own model seems to be able to pre- dict some goldstandard TP sentences, as demon- strated during the automatic evaluation. However, we also observe here that even when it does not select the goldstandard TPs, the predicted ones Goldstandard • On the night the two move into the home, it is broken into by Junior, the previous owner’s grandson; Burnham, an employee of the residence’s security company; and Raoul, a ski mask-wearing gunman recruited by Junior. • Before the three can reach them, Meg and Sarah run into the panic room and close the door behind them, only to find that the burglars have disabled the telephone. • To make matters worse, Sarah, who has diabetes, suffers a seizure. • Sensing the potential danger to her daughter, Meg lies to the officers and they leave. • After a badly injured Stephen shoots at Raoul and misses, Raoul disables him and prepares to kill Meg with the sledgehammer, but Burnham, upon hearing Sarah’s screams of pain, returns to the house and shoots Raoul dead, stating, ”You’ll be okay now”, to Meg and her daughter before leaving. TAM (+ TP views) • On the night the two move into the home, it is broken into by Junior, the previous owner’s grandson; Burnham, an employee of the residence’s security company; and Raoul, a ski mask-wearing gunman recruited by Junior. • Before the three can reach them, Meg and Sarah run into the panic room and close the door behind them, only to find that the burglars have disabled the telephone. • Her emergency glucagon syringe is in a refrigerator outside the panic room. • As Meg throws the syringe into the panic room, Burnham frantically locks himself, Raoul, and Sarah inside, crushing Raoul’s hand in the sliding steel door. • After a badly injured Stephen shoots at Raoul and misses, Raoul disables him and prepares to kill Meg with the sledgehammer, but Burnham, upon hearing Sarah’s screams of pain, returns to the house and shoots Raoul dead, stating, ”You’ll be okay now”, to Meg and her daughter before leaving. Distribution baseline • On the night the two move into the home, it is broken into by Junior, the previous owner’s grandson; Burnham, an employee of the residence’s security company; and Raoul, a ski mask-wearing gunman recruited by Junior. • Unable to seal the vents, Meg ignites the gas while she and Sarah cover themselves with fireproof blankets, causing an explosion which vents into the room outside and causes a fire, injuring Junior. • To make matters worse, Sarah, who has diabetes, suffers a seizure. • While doing so, he tells Sarah he did not want this, and the only reason he agreed to participate was to give his own child a better life. • As the robbers attempt to leave, using Sarah as a hostage, Meg hits Raoul with a sledgehammer and Burnham flees.

Table 8: Highlights for the movie ”Panic Room”: goldstandard annotations and the predicted TPs for TAM (+ TP views) and distribution baseline.

Goldstandard • Manager Stuart Ullman warns him that a previous caretaker developed cabin fever and killed his family and himself. • Hallorann tells Danny that the hotel itself has a ”shine” to it along with many memories, not all of which are good. • After she awakens him, he says he dreamed that he had killed her and Danny. • Jack begins to chop through the door leading to his family’s living quarters with a fire axe. • Wendy and Danny escape in Hallorann’s snowcat, while Jack freezes to death in the hedge maze. TAM (+TP views) • Jack’s wife, Wendy, tells a visiting doctor that Danny has an imaginary friend named Tony, and that Jack has given up drinking because he had hurt Danny’s arm following a binge. • Hallorann tells Danny that the hotel itself has a ”shine” to it along with many memories, not all of which are good. • Danny starts calling out ”redrum” frantically and goes into a trance, now referring to himself as ”Tony”. • When Wendy sees this in the bedroom mirror, the letters spell out ”MURDER”. • Wendy and Danny escape in Hallorann’s snowcat, while Jack freezes to death in the hedge maze. Distribution baseline • Jack’s wife, Wendy, tells a visiting doctor that Danny has an imaginary friend named Tony, and that Jack has given up drinking because he had hurt Danny’s arm following a binge. • Jack, increasingly frustrated, starts acting strangely and becomes prone to violent outbursts. • Jack investigates Room 237, where he encounters the ghost of a dead woman, but tells Wendy he saw nothing. • When Wendy sees this in the bedroom mirror, the letters spell out ”MURDER”. • He kills Hallorann in the lobby and pursues Danny into the hedge maze.

Table 9: Highlights for the movie ”The Shining”: goldstandard annotations and the predicted TPs TAM (+ TP views) and distribution baseline.