Parsing Screenplays for Extracting Social Networks from Movies

Parsing Screenplays for Extracting Social Networks from Movies Apoorv Agarwal†, Sriramkumar Balasubramanian†, Jiehan Zheng‡, Sarthak Dash† †Dept. of Computer Science ‡Peddie School Columbia University Hightstown, NJ, USA New York, NY, USA [email protected] [email protected] Abstract et al., 2007; Weng et al., 2009; Gil et al., 2011). However, these approaches are primarily regular In this paper, we present a formalization expression based with no evaluation of how well of the task of parsing movie screenplays. they work. While researchers have previously moti- In this paper we introduce a formalization of the vated the need for parsing movie screen- task of parsing screenplays and present an NLP plays, to the best of our knowledge, there and ML based approach to the task. By parsing is no work that has presented an evalua- a screenplay, we mean assigning each line of the tion for the task. Moreover, all the ap- screenplay one of the following five tags: “S” for proaches in the literature thus far have scene boundary, “N” for scene description, “C” been regular expression based. In this pa- for character name, “D” for dialogue, and “M” for per, we present an NLP and ML based meta-data. We expect screenplays to conform to approach to the task, and show that this a strict grammar but they often do not (Gil et al., approach outperforms the regular expres- 2011). This disconnect gives rise to the need for sion based approach by a large and statis- developing a methodology that is able to handle tically significant margin. One of the main anomalies in the structure of screenplays. Though challenges we faced early on was the ab- the methodology proposed in this paper is in the sence of training and test data. We pro- context of movie screenplays, we believe, it is gen- pose a methodology for using well struc- eral and applicable to parse other kinds of noisy tured screenplays to create training data documents. for anticipated anomalies in the structure One of the earliest challenges we faced was the of screenplays. absence of training and test data. Screenplays, on average, have 7,000 lines of text, which limits the 1 Introduction amount of annotated data we can obtain from hu- Social network extraction from unstructured text mans. We propose a methodology for using well has recently gained much attention (Agarwal and structured screenplays to create training data for Rambow, 2010; Elson et al., 2010; Agarwal et al., anticipated anomalies in the structure of screen- 2013a; Agarwal et al., 2013b; He et al., 2013). Us- plays. For different types of anomalies, we train ing Natural Language Processing (NLP) and Ma- separate classifiers, and combine them using en- chine Learning (ML) techniques, researchers are semble learning. We show that our ensemble out- now able to gain access to networks that are not performs a regular-expression baseline by a large associated with any meta-data (such as email links and statistically significant margin on an unseen and self-declared friendship links). Movies, which test set (0.69 versus 0.96 macro-F1 measure for the can be seen as visual approximations of unstruc- five classes). Apart from performing an intrinsic tured literary works, contain rich social networks evaluation, we also present an extrinsic evaluation. formed by interactions between characters. There We show that the social network extracted from has been some effort in the past to extract social the screenplay tagged by our ensemble is closer networks from movies (Weng et al., 2006; Weng to the network extracted from a screenplay tagged 50 Proceedings of the 3rd Workshop on Computational Linguistics for Literature (CLfL) @ EACL 2014, pages 50–58, Gothenburg, Sweden, April 27, 2014. c 2014 Association for Computational Linguistics by a human, as compared to the network extracted itrova (2004). Turetsky and Dimitrova (2004) from a screenplay tagged by the baseline. proposed a system to automatically align written The rest of the paper is structured as follows: in screenplays with their videos. One of the crucial section 2, we present common terminology used steps, they noted, is to parse a screenplay into its to describe screenplays. We survey existing liter- different elements: scene boundaries, scene de- ature in section 3. Section 4 presents details of scriptions, character names, and dialogues. They our data collection methodology, along with the proposed a grammar to parse screenplays and data distribution. Section 5 gives details of our show results for aligning one screenplay with its regular-expression based system, which we use as video. Weng et al. (2009) motivated the need for a baseline for evaluation purposes. In section 6, screenplay parsing from a social network analy- we present our machine learning approach. In sec- sis perspective. They proposed a set of opera- tion 7, we give details of the features we use for tions on social networks extracted from movies machine learning. In section 8, we present our ex- and television shows in order to find what they periments and results. We conclude and give fu- called hidden semantic information. They pro- ture directions of research in section 9. posed techniques for identifying lead roles in bi- lateral movies (movies with two main characters), 2 Terminology for performing community analysis, and for au- tomating the task of story segmentation. Gil et Turetsky and Dimitrova (2004) describe the al. (2011) extracted character interaction networks structure of a movie screenplay as follows: a from plays and movies. They were interested in screenplay describes a story, characters, action, automatically classifying plays and movies into setting and dialogue of a film. Additionally, different genres by making use of social network they report that the structure of a screenplay analysis metrics. They acknowledged that the follows a (semi) regular format. Figure 1 shows scripts found on the internet are not in consistent a snippet of a screenplay from the film – The formats, and proposed a regular expression based Silence of the Lambs. A scene (tag “S”) starts system to identify scene boundaries and character with what is called the slug line (or scene bound- names. ary). The slug line indicates whether the scene is to take place inside or outside (INT, EXT), While there is motivation in the literature to the name of the location (“FBI ACADEMY parse screenplays, none of the aforementioned GROUNDS, QUANTICO, VIRGINIA”), and work addresses the task formally. In this paper, we can potentially specify the time of day (e.g. formalize the task and propose a machine learning DAY or NIGHT). Following the scene boundary based approach that is significantly more effec- is a scene description. A scene description tive and tolerant of anomalous structure than the is followed by a character name (tag “C”), baseline. We evaluate our models on their ability which is followed by dialogues (tag “D”). to identify scene boundaries and character names, Character names are capitalized, with an optional but also on their ability to identify other important (V.O.) for “Voice Over” or (O.S.) for “Off-screen.” elements of a screenplay, such as scene descrip- Dialogues, like scene descriptions, are not asso- tions and dialogues. ciated with any explicit indicators (such as INT, V.O.), but are indented at a unique level (i.e. 4 Data nothing else in the screenplay is indented at this We crawled the Internet Movie Script Database level). Screenplays may also have other elements, (IMSDB) website1 to collect movie screenplays. such as “CUT TO:”, which are directions for the We crawled a total of 674 movies. Movies camera, and text describing the intended mood that are well structured have the property that of the speaker, which is found within parentheses scene boundaries and scene descriptions, charac- in the dialogue. For lack of a name for these ter names, and dialogues are all at different but elements, we call them “Meta-data” (tag “M”). fixed levels of indentation.2 For example, in the movie in Figure 1, all scene boundaries and scene 3 Literature Survey 1http://www.imsdb.com One of the earliest works motivating the need for 2By level of indentation we mean the number of spaces screenplay parsing is that of Turetsky and Dim- from the start of the line to the first non-space character. 51 Figure 1: Example screenplay: first column shows the tags we assign to each line in the screenplay. M stands for “Meta-data”, S stands for “Scene boundary”, N stands for “Scene description”, C stands for “Character name”, and D stands for “Dialogue.” We also show the lines that are at context -2 and +3 for the line “CRAWFORD.” descriptions are at the same level of indentation, ology, we were able to tag 222 movies that pass equal to five spaces. All character names are at the sanity check. a different but fixed level of indentation, equal to 20 spaces. Dialogues are at an indentation level Data #S #N #C #D #M of eight spaces. These indentation levels may vary TRAIN 2,445 21,619 11,464 23,814 3,339 from one screenplay to the other, but are consis- DEV1 714 7,495 4,431 9,378 467 tent within a well formatted screenplay. Moreover, DEV2 413 5,431 2,126 4,755 762 the indentation level of character names is strictly TEST 164 845 1,582 3,221 308 greater than the indentation level of dialogues, which is strictly greater than the indentation level Table 1: Data distribution of scene boundaries and scene descriptions. For each crawled screenplay, we found the frequency Table 1 gives the distribution of our training, de- of unique indentation levels in that screenplay. If velopment and test sets.

Parsing Screenplays for Extracting Social Networks from Movies

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support