Extracting Social Networks from Literary Fiction

Extracting Social Networks from Literary Fiction David K. Elson Nicholas Dames Kathleen R. McKeown Dept. of Computer Science English Department Dept. of Computer Science Columbia University Columbia University Columbia University [email protected] [email protected] [email protected] Abstract We present a method to automatically construct a network based on dialogue interactions between We present a method for extracting so- characters in a novel. Our approach includes com- cial networks from literature, namely, ponents for finding instances of quoted speech, nineteenth-century British novels and se- attributing each quote to a character, and iden- rials. We derive the networks from di- tifying when certain characters are in conversa- alogue interactions, and thus our method tion. We then construct a network where char- depends on the ability to determine when acters are vertices and edges signify an amount two characters are in conversation. Our of bilateral conversation between those charac- approach involves character name chunk- ters, with edge weights corresponding to the fre- ing, quoted speech attribution and conver- quency and length of their exchanges. In contrast sation detection given the set of quotes. to previous approaches to social network construc- We extract features from the social net- tion, ours relies on a novel combination of pattern- works and examine their correlation with based detection, statistical methods, and adapta- one another, as well as with metadata such tion of standard natural language tools for the liter- as the novel’s setting. Our results provide ary genre. We carried out this work on a corpus of evidence that the majority of novels in this 60 nineteenth-century novels and serials, includ- time period do not fit two characterizations ing 31 authors such as Dickens, Austen and Conan provided by literacy scholars. Instead, our Doyle. results suggest an alternative explanation In order to evaluate the literary claims in ques- for differences in social networks. tion, we compute various characteristics of the dialogue-based social network and stratify these 1 Introduction results by categories such as the novel’s setting. Literary studies about the nineteenth-century For example, the density of the network provides British novel are often concerned with the nature evidence about the cohesion of a large or small of the community that surrounds the protagonist. community, and cliques may indicate a social frag- Some theorists have suggested a relationship be- mentation. Our results surprisingly provide evi- tween the size of a community and the amount of dence that the majority of novels in this time pe- dialogue that occurs, positing that “face to face riod do not fit the suggestions provided by liter- time” diminishes as the number of characters in ary scholars, and we suggest an alternative expla- the novel grows. Others suggest that as the social nation for our observations of differences across setting becomes more urbanized, the quality of di- novels. alogue also changes, with more interactions occur- In the following sections, we survey related ring in rural communities than urban communities. work on social networks as well as computational Such claims have typically been made, however, studies of literature. We then present the literary on the basis of a few novels that are studied in hypotheses in more detail. We describe the meth- depth. In this paper, we aim to determine whether ods we use to extract dialogue and construct conan automated study of a much larger sample of versational networks, along with our approach to nineteenth century novels supports these claims. analyzing their characteristics. After we present The research presented here is concerned with the statistical results, we analyze their significance the extraction of social networks from literature. from a literary perspective. 138 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 138–147, Uppsala, Sweden, 11-16 July 2010. c 2010 Association for Computational Linguistics 2 Related Work 3 Hypotheses Computer-assisted literary analysis has typically It is commonly held that the novel is a literary occurred at the word level. This level of granular- form which tries to produce an accurate represen- ity lends itself to studies of authorial style based tation of the social world. Within literary stud- on patterns of word use (Burrows, 2004), and re- ies, the recurring problem is how that represen- searchers have successfully “outed” the writers of tation is achieved. Theories about the relation anonymous texts by comparing their style to that between novelistic form (the workings of plot, of a corpus of known authors (Mostellar and Wal- characters, and dialogue, to take the most basic lace, 1984). Determining instances of “text reuse,” categories) and changes to real-world social mi- a type of paraphrasing, is also a form of analysis lieux abound. Many of these theories center on at the lexical level, and it has recently been used to nineteenth-century European fiction; innovations validate theories about the lineage of ancient texts in novelistic form during this period, as well as the (Lee, 2007). rapid social changes brought about by revolution, Analysis of literature using more semantically- industrialization, and transport development, have oriented techniques has been rare, most likely be- traditionally been linked. These theories, however, cause of the difficulty in automatically determin- have used only a select few representative novels ing meaningful interpretations. Some exceptions as proof. By using statistical methods of analy- include recent work on learning common event se- sis, it is possible to move beyond this small corpus quences in news stories (Chambers and Jurafsky, of proof texts. We believe these methods are es- 2008), an approach based on statistical methods, sential to testing the validity of some core theories and the development of an event calculus for char- about social interaction and their representation in acterizing stories written by children (Halpin et al., literary genres like the novel. 2004), a knowledge-based strategy. On the other Major versions of the theories about the social hand, literary theorists, linguists and others have worlds of nineteenth-century fiction tend to cen- long developed symbolic but non-computational ter on characters, in two specific ways: how many models for novels. For example, Moretti (2005) characters novels tend to have, and how those has graphically mapped out texts according to ge- characters interact with one another. These two ography, social connections and other variables. “formal” facts about novels are usually explained While researchers have not attempted the auto- with reference to a novel’s setting. From the influ- matic construction of social networks represent- ential work of the Russian critic Mikhail Bakhtin ing connections between characters in a corpus to the present, a consensus emerged that as nov- of novels, the ACE program has involved entity els are increasingly set in urban areas, the num- and relation extraction in unstructured text (Dod- ber of characters and the quality of their interac- dington et al., 2004). Other recent work in so- tion change to suit the setting. Bakhtin’s term for cial network construction has explored the use of this causal relationship was chronotope: the “in- structured data such as email headers (McCallum trinsic interconnectedness of temporal and spatial et al., 2007) and U.S. Senate bill cosponsorship relationships that are artistically expressed in liter- (Cho and Fowler, 2010). In an analysis of discus- ature,” in which “space becomes charged and re- sion forums, Gruzd and Haythornthwaite (2008) sponsive to movements of time, plot, and history” explored the use of message text as well as posting (Bakhtin, 1981, 84). In Bakhtin’s analysis, dif- data to infer who is talking to whom. In this pa- ferent spaces have different social and emotional per, we also explore how to build a network based potentialities, which in turn affect the most basic on conversational interaction, but we analyze the aspects of a novel’s aesthetic technique. reported dialogue found in novels to determine the After Bakhtin’s invention of the chronotope, links. The kinds of language that is used to signal much literary criticism and theory devoted itself such information is quite different in the two me- to filling in, or describing, the qualities of spe- dia. In discussion forums, people tend to use ad- cific chronotopes, particularly those of the village dresses such as “Hi Tom,” while in novels, a sys- or rural environment and the city or urban en- tem must determine both the speaker of a quota- vironment. Following a suggestion of Bakhtin’s tion and then the intended recipient of the dialogue that the population of village or rural fictions is act. This is a significantly different problem. modeled on the world of the family, made up of 139 Author/Title/Year Persp. Setting Author/Title/Year Persp. Setting Ainsworth, Jack Sheppard (1839) 3rd urban Gaskell, North and South (1854) 3rd urban Austen, Emma (1815) 3rd rural Gissing, In the Year of Jubilee (1894) 3rd urban Austen, Mansfield Park (1814) 3rd rural Gissing, New Grub Street (1891) 3rd urban Austen, Persuasion (1817) 3rd rural Hardy, Jude the Obscure (1894) 3rd mixed Austen, Pride and Prejudice (1813) 3rd rural Hardy, The Return of the Native (1878) 3rd rural Braddon, Lady Audley’s Secret (1862) 3rd mixed Hardy, Tess

Extracting Social Networks from Literary Fiction

Part One: 'Science Fiction Versus Mundane Culture', 'The Overlap Between Science Fiction and Other Genres' and 'Horror Motifs' Transcript

Fictionality and the Empirical Study of Literature

Popular Fiction: Some Theoretical Perspectives

ELEMENTS of FICTION – NARRATOR / NARRATIVE VOICE Fundamental Literary Terms That Indentify Components of Narratives “Fiction

Science Fiction Gerry Canavan Marquette University, [email protected]

Social Science Fiction from the Sixties to the Present Instructor: Joseph Shack [email protected] Office Hours: to Be Announced

Golden Fantasy: an Examination of Generic & Literary Fantasy in Popular Writing Zechariah James Morrison Seattle Pacific Nu Iversity

Literary Fiction

Literature in the 21St Century

25 CHAPTER 1 Breaking the Canon? Critical Reflections on “Other”

Dante's 'Strangeness': the Commedia and the Late Twentieth- Century Debate on the Literary Canon

Reading Literary Fiction and Theory of Mind