Casting Light on Invisible Cities: Computationally Engaging with Literary Criticism

Shufan Wang Mohit Iyyer University of Massachusetts, Amherst University of Massachusetts, Amherst [email protected] [email protected]

Abstract Adelma (cities & the dead) Eusapia (cities & the dead) … An old man was loading a basket of … And to make the leap from life to sea urchins on a cart; I thought I death less abrupt, the inhabitants Literary critics often attempt to uncover mean- recognized him …he looked like a have constructed an identical copy of fisherman who, already old when I their city, underground… They say that ing in a single work of literature through care- was a child, could no longer be every time they go below they find among the living… Adelma is the city something changed in the lower ful reading and analysis. Applying natural lan- where you arrive dying and where Eusapia; the dead make innovations guage processing methods to aid in such lit- each finds again… in their city; not many, but surely… erary analyses remains a challenge in digital Zobeide (cities & desire) Isidora (cities & memory) humanities. While most previous work fo- …men of various nations had an When a man rides a long time through cuses on “distant reading” by algorithmically identical dream. They saw a woman wild regions he feels the desire for a running at night through an unknown city… seashells… perfect telescopes discovering high-level patterns from large col- city… They dreamed of pursuing her… and violins…He was thinking of all decided to build a city like the one in these things when he desired a city. lections of literary works, here we sharpen the the dream… they settled, waiting for Isidora, therefore, is the city of his focus of our methods to a single literary theory that scene to be repeated… dreams… about Italo Calvino’s postmodern novel Invis- ible Cities, which consists of 55 short descrip- Figure 1: Calvino labels the thematically-similar cities tions of imaginary cities. Calvino has provided in the top row as cities & the dead. However, although a classification of these cities into eleven the- the bottom two cities share a theme of desire, he assigns matic groups, but literary scholars disagree as them to different groups. to how trustworthy his categorization is. Due to the unique structure of this novel, we can computationally weigh in on this debate: we use recent advances in text representation learning leverage pretrained contextualized representa- to test a single literary theory about the novel In- tions to embed each city’s description and use visible Cities by Italo Calvino. unsupervised methods to cluster these embed- dings. Additionally, we compare results of Framed as a dialogue between the traveler our computational approach to similarity judg- Marco Polo and the emperor Kublai Khan, Invis- ments generated by human readers. Our work ible Cities consists of 55 prose poems, each of is a first step towards incorporating natural lan- which describes an imaginary city. Calvino cat- guage processing into literary criticism. egorizes these cities into eleven thematic groups that deal with human emotions (e.g., desires, 1 Introduction memories), general objects (eyes, sky, signs), Literary critics form interpretations of meaning and unusual properties (continuous, hidden, thin). arXiv:1904.08386v1 [cs.LG] 17 Apr 2019 in works of literature. Building computational Many critics argue that Calvino’s labels are not models that can help form and test these interpre- meaningful, while others believe that there is a dis- tations is a fundamental goal of digital humani- tinct thematic separation between the groups, in- ties research (Benzon and Hays, 1976). Within cluding the author himself (Calvino, 2004). The natural language processing, most previous work unique structure of this novel — each city’s de- that engages with literature relies on “distant read- scription is short and self-contained (Figure1)— ing” (Jockers, 2013), which involves discover- allows us to computationally examine this debate. ing high-level patterns from large collections of As the book is too small to train any models, stories (Bamman et al., 2014; Chaturvedi et al., we leverage recent advances in large-scale lan- 2018). We depart from this trend by showing that guage model-based representations (Peters et al., computational techniques can also engage with lit- 2018a; Devlin et al., 2018) to compute a repre- erary criticism at a closer distance: concretely, we sentation of each city. We feed these representa- tions into a clustering algorithm that produces ex- actly eleven clusters of five cities each and evalu- learned clusters Zoe Dorothea Zirma ate them against both Calvino’s original labels and Fedora Clustering crowdsourced human judgments. While the over- algorithm Dorothea all correlation with Calvino’s labels is low, both Zoe Fedora computers and humans can reliably identify some Zirma avg mean pooling thematic groups associated with concrete objects. While prior work has computationally analyzed ELMo a single book (Eve, 2019), our work goes be- yond simple word frequency or n-gram counts by Travelers return from the city of Zirma… leveraging the power of pretrained language mod- els to engage with literary criticism. Admittedly, Figure 2: We first embed each city by averaging token our approach and evaluations are specific to Invis- representations derived from a pretrained model such ible Cities, but we believe that similar analyses as ELMo. Then, we feed the city embeddings to a clus- of more conventionally-structured novels could tering algorithm and analyze the learned clusters. become possible as text representation methods improve. We also highlight two challenges of applying computational methods to literary criti- group and the content of its descriptions. Bloom cisms: (1) text representation methods are imper- (2002) claims that the “cities are totally inter- fect, especially when given writing as complex as changeable”; Springer(1985) agrees, stating that Calvino’s; and (2) evaluation is difficult because “even the categories themselves seem both cho- there is no consensus among literary critics on a sen and assigned arbitrarily”. Teichert(1985) con- single “correct” interpretation. tends that “the catalogue is superimposed on, but does not cover, the elusive, fluid mass of an un- 2 Literary analyses of Invisible Cities written world”. While out of scope for our computational anal- Before describing our method and results, we first ysis, many possible theories exist regarding why review critical opinions on both sides of whether the groupings appear largely incoherent. For in- Calvino’s thematic groups meaningfully charac- stance, Boeck(2004) posits that the structural in- terize his city descriptions. coherence exists because all of the cities actually describe different facets of Marco Polo’s home- The groups are meaningful: Some scholars town of Venice. Breiner(1988) argues instead that believe that the thematic grouping imposed by Calvino’s labels “may refer only to a projection Calvino reflects properties of the cities he de- of the Khan’s occupational thirst for order, unre- scribes; Vrbani(2012), for example, argues that lated to the structure of the text”, while Knowles Calvino’s structure are “ontologically grounded (2015) hypothesizes that the mismatch is one of in different ways”. Buitendijk(2018) further many obstacles that readers need to “untangle” to provides examples of cities with the same label understand the central substance of the novel. that are clearly thematically similar, pointing at the “cities of desire” as “informed by 20th cen- 3 A Computational Analysis tury theories of desires associated with Sigmund Freud”. Calvino(2004) himself claims that he cre- We focus on measuring to what extent computers ates most categorizations of cities with clear la- can recover Calvino’s thematic groupings when bels in mind, especially the cities of memory and given just raw text of the city descriptions. At desire, which he deemed as “fundamental corner- a high level, our approach (Figure2) involves stones” of the novel. However, many critics argue (1) computing a vector representation for every that authorial intent is irrelevant when analyzing city and (2) performing unsupervised clustering of literature (Wimsatt and Beardsley, 1946; Barthes, these representations. The rest of this section de- 1994). scribes both of these steps in more detail.

The groups are arbitrary: On the other hand, 3.1 Embedding city descriptions a large body of criticism focuses on the appar- While each of the city descriptions is relatively ent mismatch between a city’s assigned thematic short, Calvino’s writing is filled with rare words, complex syntactic structures, and figurative lan- Method Purity Accuracy 1 guage. Capturing the essential components of Random 0.32 33.3 each city in a single vector is thus not as simple GloVe 0.35 35.9 as it is with more standard forms of text. Nev- BERT 0.40 39.3 ertheless, we hope that representations from lan- ELMo 0.42 44.6 guage models trained over billions of words of Human - 48.8 text can extract some meaningful semantics from these descriptions. We experiment with three dif- Table 1: Results from cluster purity and accuracy on ferent pretrained representations: ELMo (Peters the “odd-one-out” task suggests that Calvino’s the- et al., 2018a), BERT (Devlin et al., 2018), and matic groups are not completely arbitrary. GloVe (Pennington et al., 2014). To produce a single city embedding, we compute the TF-IDF set of N data points, weighted element-wise mean of the token-level 1 X 2 purity = max |m ∩ d|. representations. For all pretrained methods, we N d∈D additionally reduce the dimensionality of the city m∈M embeddings to 40 using PCA for increased com- patibility with our clustering algorithm. 4 Evaluating clustering assignments While the results from the above section allow 3.2 Clustering city representations us to compare our three computational methods against each other, we additionally collect human Given 55 city representations, how do we group judgments to further ground our results. In this them into eleven clusters of five cities each? Ini- section, we first describe our human experiment tially, we experimented with a graph-based com- before quantitatively analyzing our results. munity detection algorithm that maximizes clus- Human clustering: We conduct a crowd- ter modularity (Newman, 2006), but we found no sourced experiment to measure how well humans simple way to constrain this method to produce can disambiguate thematically different cities. a specific number of equally-sized clusters. The Filling in the entire 55 × 55 adjacency matrix brute force approach of enumerating all possible with human similarity judgments is expensive and cluster assignments is intractable given the large time-consuming. Thus, we instead design a proxy search space ( 55! possible assignments). We (5!)11 “odd-one-out” task for collecting human judg- devise a simple clustering algorithm to approxi- ments: given three city descriptions, two of which mate this process. First, we initialize with random come from the same ground-truth thematic group cluster assignments and define “cluster strength” and the other from a different group, workers to be the relative difference between “intra-group” are asked to identify the intruder city. We use Euclidean distance and “inter-group” Euclidean the Figure Eight crowdsourcing platform5 to col- distance.3 Then, we iteratively propose random lect three annotations each for 100 different city exchanges of memberships, only accepting these triples. Our interface initially displays only the proposals when the cluster strength increases, un- first and last sentences of each city’s description; til convergence. To evaluate the quality of the workers can optionally click to reveal the full de- computationally-derived clusters against those of scription. As workers are likely unfamiliar with Calvino, we measure cluster purity (Manning Invisible Cities and its different thematic groups, et al., 2008):4 given a set of predicted clusters M this crowdsourced task provides a fair comparison and ground-truth clusters D that both partition a to our computational approaches.

1The book contains a vocabulary of 5,372 word types, and 4.1 Quantitative comparison the average length of a city description is 380 tokens. We compare clusters computed on different repre- 2Using other composition functions such as the span rep- resentation of Peters et al.(2018b) had little impact on the sentations using community purity; additionally, learned clusters. we compare these computational methods to hu- 3The choice of distance metric (e.g., cosine, word mover) mans by their accuracy on the odd-one-out task. did not meaningfully impact our results. 4Purity ranges between 0 and 1, and a larger purity indi- 5Workers were restricted to English-speaking countries cates a higher degree of agreement. and paid $0.30 per judgment. Purity of learned clusters: City representations which makes it one of the most “internally coher- computed using language model-based represen- ent” groups. Similarly, some literary critics along tation (ELMo and BERT) achieve significantly with Calvino himself (Calvino, 2004) describe the higher purity than a clustering induced from ran- thin cities as a fairly arbitrary group, which is dom representations, indicating that there is at supported by our results: when using BERT, no least some meaningful coherence to Calvino’s the- two thin cities are grouped into the same cluster. matic groups (first row of Table1). ELMo rep- However, Calvino also suggests that the cities of resentations yield the highest purity among the memory group is a “fundamental substance” of the three methods, which is surprising as BERT is a book and therefore should be highly coherent. Our bigger model trained on data from books (among computational methods cannot pick up this theme, other domains). Both ELMo and BERT outper- instead scattering all cities of memory into differ- form GloVe, which intuitively makes sense be- ent clusters. cause the latter do not model the order or structure of the words in each description. Why do computers disagree with Calvino? In cases where the learned clusters deviate from the Comparison to humans: While the purity of opinions of Calvino or literary critics, identifying our methods is higher than that of a random clus- the cause of the discrepancy is difficult: our com- tering, it is still far below 1. To provide additional putational methods are flawed, but there is also context to these results, we now switch to our no one “correct” literary interpretation. Here we “odd-one-out” task and compare directly to human qualitatively analyze some of the learned clusters performance. For each triplet of cities, we iden- in an attempt to understand why the algorithm ar- tify the intruder as the city with the maximum Eu- rived at a particular assignment. First, we examine clidean distance from the other two. Interestingly, two cities from different thematic groups, Beer- crowd workers achieve only slightly higher accu- sheba from “cities and the sky” and Valdrada from racy than ELMo city representations; their inter- “cities and eyes”, that belong to the same learned 6 annotator agreement is also low, which indicates cluster (and are each other’s nearest neighbors). that close reading to analyze literary coherence be- The first two paragraphs of Beersheba describe tween multiple texts is a difficult task, even for hu- a noble city “suspended in the heavens” with an man annotators. Overall, results from both com- identical but immoral “fecal” city underground, putational and human approaches suggests that the while the remaining paragraphs focus on the heav- author-assigned labels are not entirely arbitrary, enly city. The description of Valdrada, which is as we can reliably recover some of the thematic built on a lake, shares this theme of twin cities: ar- groups. riving travelers see “two cities: one erect above the lake, and the other reflected, upside down”. While 5 Examining the learned clusters Calvino likely classified Beersheba based on its Our quantitative results suggest that while vector- location in the sky, the two cities share undeniable based city representations capture some thematic thematic similarities. Rerunning the clustering al- similarities, there is much room for improvement. gorithm after removing the first two paragraphs of In this section, we first investigate whether the Beersheba results in each city being assigned to a learned clusters provide evidence for any argu- different cluster, which supports our hypothesis. ments put forth by literary critics on the novel. Another interesting case is the previously- Then, we explore possible reasons that the learned mentioned “thin cities”, supposedly bound to- clusters deviate from Calvino’s. gether by airy and ambiguous themes (Knowles, 2015), which Calvino(2004) states were written Do learned clusters support existing analyses? after all of the other cities and are more incoherent The argument that cities of desire constitute a than the other groups. While BERT does not group particularly coherent thematic group (Buitendijk, any thin cities together, ELMo categorizes Isaura 2018) is partially supported by our clustering re- and Armilla into the same learned cluster. The two sults. Three of the five cities of desire are grouped cities appear largely dissimilar: Isaura is a city into the same cluster using BERT (two for ELMo), with a thousand wells dug by its inhabitants, while 6Fleiss κ = 0.14, indicating slight agreement, and two or Armilla is an “unfinished” city without walls, ceil- more workers agreed on the intruder only 64% of the time. ings, or floors. However, both cities’ descriptions mention supernatural beings living underground. engage with specific literary criticism about a sin- In Isaura, some people believe “gods live in the gle narrative. depths” and “in the black lake that feeds the un- There has been other computational work that derground streams”, while the last paragraph of focuses on just a single book or a small number Armilla’s description conjectures that it is “in the of books, much of it focused on network analy- possession of nymphs and naiads” who “travel sis: Agarwal et al.(2013) extract character so- along underground veins”. Removing these de- cial networks from Alice in Wonderland, while El- scriptions on underground gods and nymphs and son et al.(2010) recover social networks from 19 th rerunning our clustering algorithm yields a new century British novels. Wallace(2012) disentan- assignment in which each of these cities belongs gles multiple narrative threads within the novel In- to different clusters. finite Jest, while Eve(2019) provides several au- tomated statistical methods for close reading and When do humans and computers agree? Our test them on the award-winning novel Cloud Atlas computational approach yields generally compa- (2004). Compared to this work, we push further rable accuracies and more consistent results than on modeling the content of the narrative by lever- human annotators in the “odd-one-out” task. On aging pretrained language models. cities with concrete themes such as sky and trad- ing, our approach with BERT and ELMo obtains 7 Conclusion accuracy of 0.44 and 0.45 respectively, (0.47 and 0.48 for humans). ELMo also performs on par Our work takes a first step towards computation- with humans in some case: for example, humans ally engaging with literary criticism on a sin- achieve an accuracy of 42% on “cities and eyes”, gle book using state-of-the-art text representation compared to ELMo’s 43%. On groups where the methods. While we demonstrate that NLP tech- theme word frequently occurs in the passage, such niques can be used to support literary analyses and as “eyes”, our approach even slightly outperforms obtain new insights, they also have clear limita- the human readers. However, human readers are tions (e.g., in understanding abstract themes). As better at recognizing abstract intangible topics, text representation methods become more power- such as memory. ful, we hope that (1) computational tools will be- come useful for analyzing novels with more con- 6 Related work ventional structures, and (2) literary criticism will Most previous work within the NLP community be used as a testbed for evaluating representations. applies distant reading (Jockers, 2013) to large collections of books, focusing on modeling differ- Acknowledgement ent aspects of narratives such as plots and event We thank the anonymous reviewers for their in- sequences (Chambers and Jurafsky, 2009; McIn- sightful comments. Additionally, we thank Nader tyre and Lapata, 2010; Goyal et al., 2010; Eisen- Akoury, Garrett Bernstein, Chenghao Lv, Ari berg and Finlayson, 2017), characters (Bamman Kobren, Kalpesh Krishna, Saumya Lal, Tu Vu, et al., 2014; Iyyer et al., 2016; Chaturvedi et al., Zhichao Yang, Mengxue Zhang and the UMass 2016, 2017), and narrative similarity (Chaturvedi NLP group for suggestions that improved the pa- et al., 2018). In the same vein, researchers per’s clarity, coverage of related work, and analy- in computational literary analysis have combined sis experiments. statistical techniques and linguistics theories to perform quantitative analysis on large narrative texts (Michel et al., 2011; Franzosi, 2010; Un- References derwood, 2016; Jockers and Kirilloff, 2016; Long A. Agarwal, A. Kotalwar, and O. Rambow. 2013. Au- and So, 2016), but these attempts largely rely on tomatic extraction of social networks from literary techniques such as word counting, topic modeling, text: A case study on alice in wonderland. In In Pro- and naive Bayes classifiers and are therefore not ceedings of the Sixth International Joint Conference able to capture the meaning of sentences or para- on Natural Language Processing,. graphs (Da, 2019). While these works discover David Bamman, Ted Underwood, and Noah A. Smith. general patterns from multiple literary works, we 2014. A bayesian mixed effects model of literary are the first to use cutting-edge NLP techniques to character. In ACL. Roland Barthes. 1994. 11 the death of the author. Me- Martin Paul Eve. 2019. Close Reading with Com- dia Texts, Authors and Readers: A Reader, page 166. puters: Textual Scholarship, Computational For- malism, and David Mitchell’s Cloud Atlas. Li- William Benzon and David G. Hays. 1976. Computa- brary of Congress Cataloging-in-Publication Data. tional linguistics and the humanist. Computers and 9781503609365. the Humanities, 10(5):265–274. Robert Franzosi. 2010. Quantitative Narrative Analy- Harold Bloom. 2002. Bloom’s Major Short Story Writ- sis. Library of Congress Cataloging-in-Publication ers Italo Calvino. Chelsea House Publishers. Data. SAGE Publication.

Filip De Boeck. 2004. Kinshasa: Tales of the Invisible Amit Goyal, Ellen Riloff, and Hal Daume III. 2010. City. Leuven University Press. Automatically producing plot unit representations for narrative text. In In Proceedings of Empirical Laurence Breiner. 1988. Italo calvino: The place of Methods in Natural Language Processing. the emperor in “invisible cities”. Modern Fiction Studies, 34(4):559–573. Mohit Iyyer, Anupam Guha, Snigdha Chaturvedi, Jor dan L. Boyd-Graber, and Hal Daume III. 2016. Tomas Buitendijk. 2018. Port cities and desire in the Feuding families and former friends: Unsupervised work of italo calvino. Port Towns and Urban Cul- learning for dynamic fictional relationships. In tures, (Article). NAACL-HLT. Columbia: A Italo Calvino. 2004. On “invisible cities. Matthew Jockers and Gabi Kirilloff. 2016. Un- Journal of Literature and Art , (40):177–182. derstanding gender and character agency in the nineteenth-century novel,. Journal of Cultural Ana- Nathanael Chambers and Dan Jurafsky. 2009. Unsu- lytics, culturalanalytics .org/2016/12/understanding- pervised learning of narrative schemas and their par- gender-and-character-agency-in-the-19th-century- ticipants. In In Proceedings of the Joint Conference novel/. of ACL and AFNLP.

Snigdha Chaturvedi, Mohit Iyyer, and Hal Daume III. Matthew L. Jockers. 2013. Macroanalysis: Digital 2017. Unsupervised learning of evolving relation- Methods and Literary History. Topics in the Digi- ships between literary characters. In n Proceedings tal Humanities. University of Illinois Press. of the Thirty First AAAI Conference on Artificial In- telligence. Dominick Knowles. 2015. A redemption of meaning in three novels by italo calvino. Digital Commons Snigdha Chaturvedi, Shashank Srivastava, Hal Daume at Ursinus College, English Honor Papers(2). III, and Chris Dyer. 2016. Modeling evolving rela- tionships between characters in literary novels. In Hoyt Long and Richard Jean So. 2016. Literary pattern In Proceedings of the Thirtieth AAAI Conference on recognition: Modernism be- tween close reading Artificial Intelligence,. and machine learning. Critical Inquiry 42, 23567.

Snigdha Chaturvedi, Shashank Srivastava, and Dan Christopher Manning, Prabhakar Raghavan, and Hin- Roth. 2018. Where have i heard this story before?: rich Schtze. 2008. An Introduction to Information Identifying narrative similarity in movie remakes. In Retrieval. Cambridge University Press. NAACL-HLT. Neil McIntyre and Mirella Lapata. 2010. Plot induc- Nan Z. Da. 2019. The computational case against tion and evolutionary search for story generation. In computational literary studies. Critical Inquiry 45, ACL. 23567. Jean-Baptiste Michel, Yuan Kui Shen, Aviva P. Aiden, Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Adrian Veres, Matthew K. Gray, The Google Books Kristina Toutanova. 2018. Bert: Pre-training of deep Team, Joseph P. Pickett, Dale Hoiberg, Peter Norvig bidirectional transformers for language understand- Dan Clanc and, Jon Orwant, Steven Pinker, Mar- ing. arXiv preprint arXiv:1810.04805. tin A. Nowak, and Erez Lieberman Aiden. 2011. Quantitative analysis of culture using millions J.D. Eisenberg and M. A. Finlayson. 2017. A simpler of digitized books. Science (New York, N.Y.), and more generalizable story detector using verb and 331(6014), 176182. doi:10.1126/science.1199644. character features. In In Proceedings of the 2017 Conference on Empirical Methods in Natural Lan- Mark Newman. 2006. Modularity and community guage Processing. structure in networks. PNAS.

D. Elson, N. Dames, and K. McKeown. 2010. Extract- Jeffrey Pennington, Richard Socher, and Christopher ing social networks from literary fiction. In In Pro- Manning. 2014. Glove: Global vectors for word ceedings of the 48th Annual Meeting of the Associa- representation. In Empirical Methods in Natural tion for Computational Linguistics. Language Processing. Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018a. Deep contextualized word rep- resentations. In North American Association for Computational Linguistics. Matthew E. Peters, Mark Neumann, Luke S. Zettle- moyer, and Wen tau Yih. 2018b. Dissecting contex- tual word embeddings: Architecture and represen- tation. In In Proceedings of Empirical Methods in Natural Language Processing. Carolyn Springer. 1985. Textual geography: The role of the reader in “invisible cities”. Modern Lan- guages, 15(4):289–299. Evelyne Teichert. 1985. Words about nothing: Writ- ing the ineffable in calvino and ma yuan. Compar- ative Literature PhD Thesis, University of British Columbia. Ted Underwood. 2016. The life cycle of genres. Journal of Cultural Analytics, culturalanalytics.org/2016/05/the-life-cycles-of- genres/. Mario Vrbani. 2012. A dream of the perfect map calvi- nos invisible cities. The Zone and Zones - Radical Spatiality in our Times, (2). Byron C Wallace. 2012. Multiple narrative disentan- glement: Unraveling infinite jest. In North Ameri- can Association for Computational Linguistics. William K Wimsatt and Monroe C Beardsley. 1946. The intentional fallacy. The Sewanee Review, 54(3).