<<

NAACL HLT 2015

The 2015 Conference of the North American Chapter of the Association for Computational : Human Language Technologies

Proceedings of the Fourth Workshop on Computational Linguistics for Literature

June 4, 2015 Denver, Colorado, USA c 2015 The Association for Computational Linguistics

Order print-on-demand copies from:

Curran Associates 57 Morehouse Lane Red Hook, New York 12571 USA Tel: +1-845-758-0400 Fax: +1-845-758-2633 [email protected]

ISBN 978-1-941643-36-5

ii Preface

Welcome to the 4th edition of the Workshop on Computational Linguistics for Literature. After the rounds in Montréal, Atlanta and Göteborg, we are pleased to see both the familiar and the new faces in Denver.

We are eager to hear what our invited speakers will tell us. Nick Montfort, a poet and a pioneer of digital arts and poetry, will open the day with a talk on the use of programming to foster exploration and fresh insights in the humanities. He suggests a new paradigm useful for people with little or no programming experience.

Matthew Jockers’s work on macro-analysis of literature is well known and widely cited. He has published extensively on using digital analysis to view literature diachronically. Matthew will talk about his recent work on modelling the shape of stories via sentiment analysis.

This year’s workshop will feature six regular talks and eight posters. If our past experience is any indication, we can expect a lively poster session. The topics of the 14 accepted papers are diverse and exciting.

Once again, there is a lot of interest in the computational analysis of poetry. Rodolfo Delmonte will present and demo SPARSAR, a system which analyzes and visualizes poems. Borja Navarro-Colorado will talk about his work on analyzing shape and meaning in the 16th and 17th century Spanish sonnets. Nina McCurdy, Vivek Srikumar & Miriah Meyer propose a formalism for analyzing sonic devices in poetry and describe an open-source implementation.

This year’s workshop will witness a lot of work on parallel texts and on machine translation of literary data. Laurent Besacier & Lane Schwartz describe preliminary experiments with MT for the translation of literature. In a similar vein, Antonio Toral & Andy Way explore MT on literary data but between related languages. Fabienne Cap, Ina Rösiger & Jonas Kuhn explore how parallel editions of the same work can be used for literary analysis. Olga Scrivner & Sandra Kübler also look at parallel editions – in dealing with historical texts.

Several other papers cover various aspects of literary analysis through computation. Prashant Jayannavar, Apoorv Agarwal, Melody Ju & Owen Rambow consider social network analysis for the validation of literary theories. Andreas van Cranenburgh & Corina Koolen investigate what distinguishes literary novels from less literary ones. Dimitrios Kokkinakis, Ann Ighe & Mats Malm use computational analysis and leverage literature as a historical corpus in order to study typical vocations of women in the 19th century Sweden. Markus Krug, Frank Puppe, Fotis Jannidis, Luisa Macharowsky, Isabella Reger & Lukas Weimar describe a coreference resolution system designed specifically with fiction in mind. Stefan Evert, Thomas Proisl, Thorsten Vitt, Christof Schöch, Fotis Jannidis & Steffen Pielström explain the success of Burrows’s Delta in literary authorship attribution.

Last but not least, there are papers which do not fit into any other bucket. Marie Dubremetz & Joakim Nivre will tell us about automatic detection of a rare but elegant rhetorical device called chiasmus. Julian Brooke, Adam Hammond & Graeme Hirst describe a tool much needed in the community: GutenTag, a system for accessing Project Gutenberg as a corpus.

iii To be sure, there will be much to listen to, learn from and discuss for everybody with the slightest interest in either NLP or literature. We cannot wait for June 4 (-:).

This workshop would not have been possible without the hard work of our program committee. Many people on the PC have been with us from the beginning. Everyone offers in-depth, knowledgeable advice to both the authors and the organizers. Many thanks to you all! We would also like to acknowledge the generous support of the National Science Foundation (grant No. 1523285), which has allowed us to invite such interesting speakers.

We look forward to seeing you in Denver!

Anna F., Anna K., Stan and Corina

iv Nick Montfort (http://nickm.com/)

Short bio

Nick Montfort develops computational art and poetry, often collaboratively. He is on the faculty at MIT in CMS/Writing and is the principal of the naming firm Nomnym. Montfort wrote the books of poems #! and Riddle & Bind, co-wrote 2002: A Palindrome Story, and developed more than forty digital projects including the collaborations The Deletionist and Sea and Spar Between. The MIT Press has published four of his collaborative and individual books: The New Media Reader, Twisty Little Passages, Racing the Beam, and 10 PRINT CHR$(205.5+RND(1)); : GOTO 10, with Exploratory Programming for the Arts and Humanities coming soon.

Invited talk: Exploratory Programming for Literary Work

Abstract

We are fortunate to be at a stage when formal research projects, including substantial ones on a large scale, are bringing computation to bear on literary questions. While I participate in this style of research, in this talk I plan to discuss some different but valuable approaches to literature that use computation, approaches that are complementary. Specifically, I will discuss how smaller-scale and even ad hoc explorations can lead to new insights and suggest possibilities for more structured and larger-scale research. In doing so, I will explain my concept of exploratory programming, a style of programming that I find particularly valuable in my own practice and research and that I have worked to teach to students in the humanities, most of whom have no programming background. I am completing a book, Exploratory Programming for the Arts and Humanities, to be published by the MIT Press next year, which I hope will foster this type of programming. In my talk, I will provide some examples of how both generative approaches (developing system that produce literary language) and analytical approaches can be undertaken using exploratory programming, and will explain how these can inform one another. While some of this work has been presented in literary studies and computer science contexts, my examples will also include work presented in art and poetry contexts.

v Matthew Jockers (http://www.matthewjockers.net/)

Short bio

Matthew L. Jockers is Associate Professor of English at the University of Nebraska, Faculty Fellow in the Center for Digital Research in the Humanities, Faculty Fellow in the Center for Great Plains Studies, and Director of the Nebraska Literary Lab. His books include Macroanalysis: Digital Methods and Literary History (University of Illinois, 2013) and Text Analysis Using R for Students of Literature (Springer, 2014). He has written articles on computational text analysis, authorship attribution, Irish and Irish-American literature, and he has co-authored several amicus briefs defending the fair and transformative use of digital text.

Invited talk: The (not so) Simple Shape of Stories

Abstract

In a now famous lecture, Kurt Vonnegut described what he called the “simple shapes of stories.” His thesis was that we could understand the plot of novels and stories by tracking fluctuations in sentiment. He illustrated his thesis by drawing a grid in which the y-axis represented “good fortune” at the top and “ill fortune” at the bottom. The x-axis represented narrative time and moved from “beginning” at the left to “end” at the right. Using this grid, Vonnegut traced the shapes of several stories including what he called the “man in hole” and the “boy meets girl”. At one point in the lecture, Vonnegut wonders why computers cannot be trained to reveal the simple shapes of stories. In this lecture, Matthew Jockers will describe his attempt to model the simple shapes of stories using methods from sentiment analysis and signal processing.

vi Program Committee Apoorv Agarwal (Columbia University) Cecilia Ovesdotter Alm (Rochester Institute of Technology) David Bamman (Carnegie Melon University) Peter Boot (Huygens Institute for Netherlands History) Julian Brooke (University of Toronto) Hal Daumé III (University of Maryland) David Elson (Google) Micha Elsner (Ohio State University) Mark Finlayson (MIT) Pablo Gervás (Universidad Complutense de Madrid) Roxana Girju (University of Illinois at Urbana-Champaign) Amit Goyal (University of Maryland) Catherine Havasi (MIT Media Lab) Justine Kao (Stanford University) David Mimno (Cornell University) Saif Mohammad (National Research Council Canada) Rebecca Passonneau (Columbia University) Livia Polanyi (LDM Associates) Owen Rambow (Columbia University) Michaela Regneri (Saarland University) Reid Swanson (University of California, Santa Cruz) Rob Voigt (Stanford University) Marilyn Walker (University of California, Santa Cruz) Janice Wiebe (University of Pittsburgh) Bei Yu (Syracuse University)

Invited Speakers Matthew Jockers (University of Nebraska) Nick Montfort (MIT)

Organizers Anna Feldman (Montclair State University) Anna Kazantseva (University of Ottawa) Stan Szpakowicz (University of Ottawa) Corina Koolen (Universiteit van Amsterdam)

vii

Table of Contents

Exploratory Programming for Literary Work Nick Montfort ...... v

The (not so) Simple Shape of Stories Matthew Jockers ...... vi

Tools for : Enabling Access to the Old Occitan Romance of Flamenca Olga Scrivner and Sandra Kübler ...... 1

RhymeDesign: A Tool for Analyzing Sonic Devices in Poetry Nina McCurdy, Vivek Srikumar and Miriah Meyer ...... 12

Rhetorical Figure Detection: the Case of Chiasmus Marie Dubremetz and Joakim Nivre ...... 23

Validating Literary Theories Using Automatic Social Network Extraction Prashant Jayannavar, Apoorv Agarwal, Melody Ju and Owen Rambow...... 32

GutenTag: an NLP-driven Tool for Digital Humanities Research in the Project Gutenberg Corpus Julian Brooke, Adam Hammond and Graeme Hirst...... 42

A Pilot Experiment on Exploiting Translations for Literary Studies on Kafka’s "Verwandlung" Fabienne Cap, Ina Rösiger and Jonas Kuhn...... 48

Identifying Literary Texts with Bigrams Andreas van Cranenburgh and Corina Koolen ...... 58

Visualizing Poetry with SPARSAR – Visual Maps from Poetic Content Rodolfo Delmonte ...... 68

Towards a better understanding of Burrows’s Delta in literary authorship attribution Stefan Evert, Thomas Proisl, Thorsten Vitt, Christof Schöch, Fotis Jannidis and Steffen Pielström 79

Gender-Based Vocation Identification in Swedish 19th Century Prose Fiction using Linguistic Patterns, NER and CRF Learning Dimitrios Kokkinakis, Ann Ighe and Mats Malm...... 89

Rule-based Coreference Resolution in German Historic Novels Markus Krug, Frank Puppe, Fotis Jannidis, Luisa Macharowsky, Isabella Reger and Lukas Weimar 98

A computational linguistic approach to Spanish Golden Age Sonnets: metrical and semantic aspects Borja Navarro ...... 105

ix Automated Translation of a Literary Work: A Pilot Study Laurent Besacier and Lane Schwartz ...... 114

Translating Literary Text between Related Languages using SMT Antonio Toral and Andy Way ...... 123

x Conference Program

Thursday, June 4, 2015

Session 1

8:57–9:00 Welcome

9:00–10:00 Exploratory Programming for Literary Work (invited talk) Nick Montfort

10:00–10:30 Tools for Digital Humanities: Enabling Access to the Old Occitan Romance of Fla- menca Olga Scrivner and Sandra Kübler

Coffee break

Session 2

11:00–11:30 RhymeDesign: A Tool for Analyzing Sonic Devices in Poetry Nina McCurdy, Vivek Srikumar and Miriah Meyer

11:30–12:00 Rhetorical Figure Detection: the Case of Chiasmus Marie Dubremetz and Joakim Nivre

12:00–12:30 Validating Literary Theories Using Automatic Social Network Extraction Prashant Jayannavar, Apoorv Agarwal, Melody Ju and Owen Rambow

xi Thursday, June 4, 2015 (continued)

Lunch break

Session 3

14:00–15:00 The (not so) Simple Shape of Stories (invited talk) Matthew Jockers

15:00–15:30 Poster teaser talks

Coffee break

Session 4

16:00–16:30 Poster session

GutenTag: an NLP-driven Tool for Digital Humanities Research in the Project Gutenberg Corpus Julian Brooke, Adam Hammond and Graeme Hirst

A Pilot Experiment on Exploiting Translations for Literary Studies on Kafka’s "Ver- wandlung" Fabienne Cap, Ina Rösiger and Jonas Kuhn

Identifying Literary Texts with Bigrams Andreas van Cranenburgh and Corina Koolen

Visualizing Poetry with SPARSAR – Visual Maps from Poetic Content Rodolfo Delmonte

Towards a better understanding of Burrows’s Delta in literary authorship attribu- tion Stefan Evert, Thomas Proisl, Thorsten Vitt, Christof Schöch, Fotis Jannidis and Steffen Pielström

Gender-Based Vocation Identification in Swedish 19th Century Prose Fiction using Linguistic Patterns, NER and CRF Learning Dimitrios Kokkinakis, Ann Ighe and Mats Malm

xii Thursday, June 4, 2015 (continued)

Rule-based Coreference Resolution in German Historic Novels Markus Krug, Frank Puppe, Fotis Jannidis, Luisa Macharowsky, Isabella Reger and Lukas Weimar

A computational linguistic approach to Spanish Golden Age Sonnets: metrical and semantic aspects Borja Navarro

Session 5

16:30–17:00 Automated Translation of a Literary Work: A Pilot Study Laurent Besacier and Lane Schwartz

17:00–17:30 Translating Literary Text between Related Languages using SMT Antonio Toral and Andy Way

17:30–17:33 Farewell

xiii

Tools for Digital Humanities: Enabling Access to the Old Occitan Romance of Flamenca

Olga Scrivner Sandra Kubler¨ Indiana University Indiana University Bloomington, IN, USA Bloomington, IN, USA [email protected] [email protected]

Abstract text-processing. Similarly, the progress in digital technology has created novel ways of data visual- Accessing historical texts is often a challenge ization and data interpretation. Finally, “by access- because readers either do not know the his- torical language, or they are challenged by ing linguistic annotation, we can extend the range the technological hurdle when such texts are of phenomena that can be found” (Kubler¨ and Zins- available digitally. Merging corpus linguistic meister, 2014). That is, a digital corpus enriched methods and digital technology can provide with linguistic annotation, such as syntactic, prag- novel ways of representing historical texts dig- matic, and semantic annotation, can amplify our un- itally and providing a simpler access. In this derstanding of the literary or historical work. With paper, we describe a multi-dimensional paral- such a corpus, a researcher can overcome many of lel Old Occitan-English corpus, in which word alignment serves as the basis for search ca- the challenges provided by the historical nature of pabilities as well as for the transfer of an- the manuscripts as well as by the technological bar- notations. We show how parallel alignment rier provided by many available query tools. For ex- can help overcome some challenges of his- ample, one of the challenges in working with histor- torical manuscripts. Furthermore, we apply a ical texts consists in the variation in spelling or in resource-light method of building an emotion lexical variation. In such cases, instead of perform- annotation via parallel alignment, thus show- ing a direct lexical search, researchers can access ing that such annotations are possible without this information via linguistic annotation if lemma speaking the historical language. Finally, us- ing visualization tools, such as ANNIS and information is available. GoogleViz, we demonstrate how the emotion However, there are challenges for which standard analysis can be queried and visualized dynam- linguistic annotations are not useful. First, search- ically in our parallel corpus, thus showing that ing in text is usually restricted to known phenomena such information can be made accessible with low technological barriers. or explicit occurrences of data. For example, some phenomena may involve a variation between explicit and implicit tokens, e.g., null subjects or zero rela- 1 Introduction tive clauses. While some corpora allow for a query In the past, historical documents and manuscripts of null occurrences, such as in the MCVF (Mar- were studied exclusively by a manual, paper-based tineau et al., 2010), most of the monolingual corpora approach. This limited the access to such docu- do not provide such an annotation. Furthermore, it ments to scholars who, on the one hand, know the is essential for the researcher to be aware of all pos- historical variety of the language and, on the other sible forms and contexts for queries, which is a dif- hand, had access to the manuscripts. However, re- ficult task in monolingual corpora. That is, there is cent achievements in corpus linguistics have intro- always a possibility that “some relevant form will duced new methods and tools for digitization and be overlooked because it has never been studied”

1

Proceedings of NAACL-HLT Fourth Workshop on Computational Linguistics for Literature, pages 1–11, Denver, Colorado, June 4, 2015. c 2015 Association for Computational Linguistics

(Enrique-Arias, 2013, 107). For example, consider for the project. the study of Medieval Spanish discourse markers, e.g., he, evas´ ’behold’. Enrique-Arias (2013) shows 2 Using Corpora in Historical Linguistics that only by using a parallel corpus is he able to ob- serve unexpected linguistic structures conveying the Parallel corpora are collections of two or more texts same discourse function, e.g., sepas que ‘know that’ consisting of an original text (source) and its trans- and cata que ‘see that’. Finally, monolingual cor- lation(s). Generally, such parallel corpora are an- pora are usually accessible only to an audience with notated for word alignment between the source text prior knowledge of a given historical language, leav- and its translation. Word alignment can be carried ing a large public aside. out automatically via tools such as GIZA++ (Och We propose to address these challenges by us- and Ney, 2000). ing a parallel corpus. In our case, the two doc- Monolingual historical corpora are undoubtedly uments consist of the original, historical text and valuable linguistic resources. It is not uncommon, its translation into modern English. Until recently, however, to encounter different spellings and other parallel corpora have been almost exclusively used lexical variations in historical texts. Not knowing in the fields of machine translation, bilingual lexi- an exact spelling or just a simple language barrier cography, translator training, and the study of lan- may hinder data collection. Parallel corpora can as- guage specific translational phenomena (McEnery sist in such situations through Historic Document and Xiao, 2007). With the increased availability of Retrieval (Koolen et al., 2006), which allows re- historical parallel corpora, we have seen the emer- searchers to query via the modern translation rather gence of their use in historical linguistics (Koolen than via the older language. Given the common as- et al., 2006; Zeldes, 2007; Petrova and Solf, 2009; sumption that “translation equivalents are likely to Enrique-Arias, 2012; Enrique-Arias, 2013). be inserted in the same or very similar, syntactic, In this paper, we introduce a parallel annotated semantic and pragmatic contexts” (Enrique-Arias, Old Occitan-English corpus. We show how the 2013, 114), we can assess not only lexical, but also alignment with modern English makes this historical morphological variations. That is, it is possible to corpus more accessible and how the word alignment a) identify forms that have never been studied and can facilitate the cross-language transfer of emotion b) find occurrences based on their textual or stylis- annotations from a resource-rich modern language tic conventions. For example, using the parallel into a resource-poor language, such as Old Occitan. Bible corpus of Old Spanish and its English transla- Finally, we demonstrate how emotion visualization tion, Sanchez´ Lopez´ (To Appear) is able to identify techniques can contribute to a richer understanding a new form, salvante, that has never been reported. of the literary text for technically less inclined read- Similarly, Enrique-Arias (2012) examines dis- ers. course markers, possessive structures and clitics in The remainder of this paper is organized as fol- the Latin Bible and its medieval Spanish translation. lows: Section 2 reviews the use of corpora in histor- In addition, the author is able to observe stylistic ical studies, and section 3 describes work on trans- variation in choices made by translators depending ferring annotations via alignment. Section 4 de- on Bible sub-genres such as narrative or poetry. scribes the textual basis of our corpus, the 13th- In recent years, there has also been increasing century Old Occitan Romance of Flamenca. In sec- interest in the correspondence between translation tion 5, we explain the compilation of the parallel and language change. In this view, translation is corpus. Section 6 describes the emotion annotation, seen as a “means of tracking change” (Beeching, which was carried out for English and then trans- 2013). Various studies have demonstrated the feasi- ferred to Old Occitan. Section 7 introduces emo- bility of parallel corpora in studies of semantic and tion queries via ANNIS, a freely available on-line (morpho-)syntactic change. For example, Beech- search platform, and visualization methods with mo- ing (2013) examines the evolution of the French ex- tion charts in GoogleViz. Section 8 draws general pression quand-memeˆ using monolingual and paral- conclusions and provides an outlook on future steps lel corpora. Similarly, Zeldes (2007) looks at the

2 parallel corpus of Bible translations in two differ- tool not only for historians and historical linguists ent stages of the same language, Old and Modern but also for a lay audience since the translation pro- Polish. The author is able to detect various changes vides access to the meaning without introducing ad- in nominal affixes. Another interesting approach is ditional hurdles such as by a semantic annotation. suggested by Petrova and Solf (2009), who investi- In section 4, we will present our corpus of choice, gate the influence of Latin in historical documents. the Romance of Flamenca, an Old Occitan poem, When we deal with historical documents that are along with the parallel version that includes a mod- translations of Latin or other languages, it is hard to ern English translation. Then, we will present an assess which phenomena are target language specific approach to annotate this corpus for emotions, using or introduced via the translation from the source lan- non-experts working on the English part and then guage. Petrova and Solf (2009) show that this is- transferring the annotation to the source language. sue can be resolved with a parallel diachronic cor- pus: They analyze a change in word order in the Old 3 Using Alignment Methods for High German translation of the Bible and its original Annotating Resource-Poor Languages Latin version. Given the assumption that any word order deviation in the translation can be viewed as Linguistic annotation often involves a great amount evidence for Old High German syntax, their inves- of manual labor, which is often not feasible for tigation is restricted to cases where word order in low-resourced languages. Instead, we can use the translation differs from its original. The results a method from computational linguistics, namely reveal that in contrast to Latin, preverbal objects cross-language transfer, as proposed by Yarowsky in subordinate clauses in Old High German convey and Ngai (2001). This method does not involve given information (explicitly mentioned in the previ- any resources in the target language, neither train- ous context), whereas postverbal objects carry new ing data, a large lexicon, nor time-consuming man- information. ual annotation. Finally, linguists and computational linguistics Cross-language transfer has been previously ap- can benefit from parallel corpora in studies of im- plied to languages with parallel corpora and bilin- plicit constituents, e.g. zero anaphora. Not all cor- gual lexica (Yarowsky and Ngai, 2001; Hwa et al., pora are annotated for implicit occurrences or the 2005). This approach uses parallel text and word omission of certain elements in a sentence. This alignment to transfer the annotation from one lan- is a common challenge in the fields, where zero guage to the next. Yarowsky and Ngai (2001) show anaphora resolution is necessary, e.g., automatic the transfer of a coarse grained POS tagset and base summarization, machine translation, and studies of noun phrases from English to French. Yarowsky syntactic variation. With a parallel corpus, it is et al. (2001) extend the approach to Spanish and possible to search for the explicit form in a trans- Czech, and they include named entity tagging. Hwa lated text and then observe the use or omission of et al. (2005) use a similar approach to transfer that form in the original text. For instance, in his dependency annotations from English to Chinese. study of Biblia Medieval, a parallel corpus of Old Snyder and Barzilay (2008) extend the approach to Spanish Bible translations, Enrique-Arias (2013) an- unsupervised annotation of morphology in Semitic alyzes discourse markers and observes instances of languages via a hierarchical Bayesian network. And zero-marking in Old Spanish by searching for ex- Snyder et al. (2009) extend the framework to include plicit Hebrew markers. Furthermore, such corpora multiple source languages. can be used to investigate the convergence universal In previous work (Scrivner and Kubler,¨ 2012), we in machine translation, a correlation between zero used another cross-language transfer method, based anaphora and the degree of text explicitation (Rello on the work by Feldman and Hana (2010), to an- and Ilisei, 2009). notate the Flamenca corpus with POS and syntactic We argue that the addition of a translation into a annotations. This method does not require parallel modern language is a simple and intuitive way of texts, it rather uses resources such as lexical or POS giving access to historical texts. This is a useful taggers from closely related languages. Our anno-

3 tations were based on resources from Old French in 1834. Since then, the manuscript has been edited (Martineau et al., 2010) and modern Catalan (Civit multiple times (Meyer, 1865, 1901; Gschwind 1976; et al., 2006). Huchet, 1982). While Gschwind’s edition (1976) In machine translation, the transfer of sentiment remains “the most useful edition”, which provides analysis is common in machine translation.Kim et a more accurate interpretation (Blodgett, 1995; Car- al. (2010) use a machine translation system to map bonero, 2010), we have chosen the second edition by subjectivity lexica from English to other languages. Meyer for two reasons: a) The edition has no copy- In word sense disambiguation, word alignment has right restriction and b) this edition is available in a been used as a bridge (Tufis, 2007) based on the as- scanned image format provided by Google1. sumption that the translated word shares the same sense with the original word. A similar method was 5 Parallel Occitan-English Corpus used for sentiment transfer from a resource-rich lan- The compilation and architecture of the monolingual guage to a resource-poor language (Mihalcea et al., Old Occitan corpus Romance of Flamenca along 2007). with the morpho-syntactic and syntactic annotations has been described by Scrivner and Kubler¨ (2012) 4 Romance of Flamenca and Scrivner et al. (2013). In this paper, we augment the monolingual version into the Romance of Fla- Medieval Provenc¸al literature is well known for its menca corpus with a parallel Old Occitan-English lyric poetry of troubadours. There remains, how- level. As described in section 2, we regard mono- ever, a small number of non-lyric provenc¸al texts, lingual and parallel corpora as complementary re- such as Romance of Flamenca, Girart de Rossilho sources. That is, our new corpus conforms to the tra- and Daurel et Beto, that have not received much ditions of a conventional multi-layered monolingual attention. In this project we focus on Romance corpus, and at the same time, it offers the research of Flamenca, which can be faithfully described as advantages of a parallel bilingual corpus. “the one universally acknowledged masterpiece of In our task, we have made various methodolog- Old Occitan narrative” (Fleischmann, 1995). This ical decisions related to translation and alignment. anonymous romance, written in the 13th century, First, in the selection of a translation of the source, presents an artistic amalgam of fabliau, courtly ro- it was important to find the most faithful transla- mance, troubadours lyrics, and narrative genre. The tion to the original poem. While free translations uniqueness of this “first modern novel” is also seen have their own merits, they pose a great challenge to in its use of setting, adventures, and character por- the alignment task. Our choice fell to the work by trayal (Blodgett, 1995). The narrator virtuously de- Blodgett (1995) for several reasons. Blodgett “en- picts Archambaut’s transmutation from a chivalrous deavored, so far as possible, to respect the loose and knight to an unbearably jealous husband who locks often convoluted syntax of the original” (Blodgett, his beautiful wife Flamenca in a tower, as well 1995). In addition, the author was able to add lines as Guilhem’s ingenious conspiracy to liberate Fla- from the manuscript that were missing in the pre- menca. Furthermore, this prose in verse played an vious editions. Finally, Blodgett (1995) followed a influential role in the development of French litera- “conservative approach” and omitted lines that were ture. The potential value of this historical resource, suggested earlier to replace lacunae in the original. however, is limited by the lack of an accessible dig- This conservative approach is necessary for ensuring ital format and linguistic annotation. the accurate line alignment of verses. There are no known records of Romance of Fla- In a second step, we provided word alignment. menca before the late 18th century, when it was This is a challenging task, partly because of the non- seized from a private collection during the French standardized spelling in the Old Occitan source, but Revolution and placed later in the library of Carcas- also because the amount of aligned text is rather sonne (Blodgett, 1995). The manuscript came in 139 small for standard unsupervised approaches. An ad- folios with 8095 octosyllabic verses but missing the beginning and end. It was first edited by Raynouard 1https://archive.org/details/leromandeflamen00meyegoog

4 Figure 1: Word alignment: sample of results for the null pronoun in Old Occitan. ditional challenge results from the verse structure of tence with the word index as well as the English the poem, which necessitates deviations in the trans- translation. The GIZA++ output before correction is lation. This genre is prone to various stylistic word illustrated in (2), and the corrected version is shown orders, as compared to political or historical narra- in (3). In both versions, the numbers in parentheses tives. In addition, sentence boundaries in Occitan do behind a word indicate which word in the original is not always correspond to those in the English trans- aligned with this word in the translation. lation. (1) index: 1 2 3 4 5 6 If we followed common practice in automatic OO: poissas lur dis tot en apert alignment and chose sentences as the basic text units ME: then he said to them openly for the automatic alignment, this very likely would result in many mis-alignments. As a result, we de- (2) then (1) he ( ) said (3) to (4) them (2) openly cided to split data by lines, instead of sentences. We (6) 2 performed the line alignment by means of NATools . (3) then (1) he (3) said (3) to (2) them (2) openly NATools is a Perl package for processing parallel (6) corpora. This package helps with the alignment of two files on a sentence level and with extracting a This example shows that in (2), the subject pro- probability lexicon. noun he is not aligned with any word in Old Occ- For word alignment, after some experimenta- itan. This is to be expected since Old Occitan is a tion, we decided to use a fully unsupervised ap- pro-drop language, and the pronoun is not expressed proach since there do not exist any automatic align- overtly. However, during our manual correction, we ers for the Old Occitan-English pair. We chose align the pronoun he with the verb dis, a standard GIZA++ (Och and Ney, 2000), a freely available treatment for null subject pronouns. In a final step, automatic aligner, which allows for one-to-one and we combine lines with corrected word alignment to one-to-many word alignment. In addition, when form a sentence in order to merge this parallel align- using GIZA++, we can make use of our extracted ment with our monolingual Old Occitan corpus, as probability dictionary. The output of the automatic illustrated in Figure 1. alignment was then corrected manually. Below, we This example also shows that such an alignment show an example. In (1), we show the original sen- can be used to search the Old Occitan corpus via the modern English annotated translation. For example, 2http://linguateca.di.uminho.pt/natools/ the query with an explicit English pronoun (PRP)

5 Storm & Storm Shaver et al. Plutchik has been shown that human annotators often assign a (1987) (1987) (1980) different label for the same emotional context (Alm sadness sadness sadness and Sproat, 2005, 670). Finally, available annotated anger anger anger resources are domain specific, e.g., movie reviews fear fear fear and poll opinions, which makes it difficult to adapt happiness happiness joy to a narrative genre. As Francisco et al. (2011) point love affection disgust out, “the complexity of the emotional information disgust trust involved in narrative texts is much higher than in anxiety surprise those domains”. contentment anticipation In recent years, however, with the increasing ac- hostility cess to digitized books, such as the Google Books liking Corpus and Project Gutenberg, there has been grow- pride ing interest in applying emotion annotation for nar- shame rative stories. For example, Alm and Sproat (2005) Table 1: Basic emotion models. annotate 22 Grimms’ fairy tales and demonstrate the importance of story sequences for emotional story evaluation and Francisco et al. (2011) create a cor- aligned to the Occitan verb (VJ) allows to find null pus of 18 English folk tales. Both corpora are built occurrences of subject pronouns in Old Occitan. using a manual annotation. In contrast, Moham- At present, our parallel corpus contains 14 100 mad (2012) applies a lexicon-based method to the tokens and 1 000 aligned verse lines. Our corpus is emotion analysis of Google Books. He creates an further converted to PAULA XML (Dipper, 2005) emotion association lexicon with 14 200 word types, and imported into the ANNIS search engine (Zeldes which determines the ratio of emotional terms in et al., 2009), which makes this corpus accessible on- 3 text. In addition, Mohammad shows effective visu- line . alization methods that trace emotions and compare emotional content in a large collections of texts. 6 Emotion Annotation Transfer As we have seen, the emotion annotation can be While emotion analysis constitutes an important a valuable resource in linguistics and literary stud- component in literary analysis, narrative corpora an- ies. However, annotated corpora and emotion lex- notated for emotional content are not very common. ica exist mainly for resource-rich languages, such as In contrast, there is a large body of work on emotion English. Annotating verses in Old Occitan manually and sentiment analysis of non-literary resources, for emotional content is a tedious task and requires such as blog posts, news and tweets (see (Liu, 2012; an expert in the language. Thus neither a manual an- Pang and Lee, 2008) for overviews). However, de- notation of the text nor the creation of an emotion spite the advances in the automatic annotation, the lexicon in the source language is viable. However, manual annotation of emotions remains a difficult we can use a resource-poor approach from NLP, task. On the one hand, the definition of emotion re- namely cross-language transfer (see section 3, which mains a controversial issue as there is still no clear allows us to take advantage of English resources in distinction between emotions, attitude, personality, combination with the word alignment. I.e., we can and mood. Various models of emotion clusters have annotate emotions in English, which is easy enough been proposed, as illustrated in Table 1, but no clear to do given a lexicon and an undergraduate student. standard has emerged so far. In a second step, we transfer the annotation via the On the other hand, the assignment of emotion is word alignments to the source language. often a subjective decision. While many emotions Below, we will describe our emotion annotation can be identified through contextual and linguistic transfer method. First, we compiled a word list from cues, e.g., lexical, semantic, or discourse cues, it the English version Flamenca and removed common 3www.oldoccitancorpus.org function words. We then used the NRC emotion lex-

6 icon4, which consists of words and their associations who show the relevance of textual sequencing for with 8 emotions as well as positive or negative sen- emotional analysis, we have segmented the corpus timent. Mohammad and Yang (2011) created this into 10 logical event sequences, namely wedding an- lexicon by using frequent words from the 1911 Ro- nouncement, preparation for wedding, arrival of Ar- get Thesaurus5, which were annotated by 5 human chambaut, marriage, departure, King’s arrival, and annotators. For our application, we focus on the 8 Queen’s jealousy. The corpus consists of different emotion associations: anger, anticipation, disgust, layers of annotations: a) (morpho-)syntactic layer fear, joy, sadness, surprise and trust. Since the emo- (part-of-speech and constituency annotation for Oc- tion annotation is not neutral with regard to context, citan and part-of-speech for English), b) lemmas, c) several emotions can be assigned to the same token discourse layer (speakers classification, e.g., king, in the NRC lexicon, as shown in (4). queen, Flamenca), d) temporal sequencing (events), e) word alignment (Occitan English), and f) emo- abandon fear → abandon sadness tion layer (joy, trust, fear, surprise, sadness, dis- (4) lose anger gust, anger and anticipation). For visualization, we lose disgust use the search engine ANNIS (Zeldes et al., 2009), lose fear which is designed for displaying and querying multi- layered corpora. One advantage of ANNIS con- As we can see, the word abandon has two associ- sists of its graphical query interface, which allows ations, namely with fear and sadness, and the word for user-friendly, graphical queries, instead of tradi- lose has three possible associations, namely anger, tional textual queries, for which one needs to know disgust and fear. During the initial label transfer the query language. To illustrate the visual query from the NCR lexicon to the Flamenca lexicon, we tool, we present a query in Figure 2. In this query, kept multiple labels. The tokens with multiple la- we search for any aligned token that expresses joy bels were further manually checked, and only one and is spoken by Father. label that would best fit the context to our knowl- One example of a query result is illustrated in Fig- edge was retained. For example, in case of ‘aban- ure 3. This example shows the Occitan token honor don’, we selected the emotion sadness, as the con- and the aligned English token honors that are anno- text describes Famenca’s father before her marriage. tated with joy and spoken by Father. Also the evaluation of 100 randomly selected labels Another way of analyzing emotions more gener- revealed that some associations did not fit our text ally is to look at the overall emotional content, by due to the difference in genre. For example, ‘court’ querying for any words that have emotion; in other in our narrative genre represents a different seman- words, they are not annotated as “None” for emo- tic entity (king’s court), whereas in the NCR lexicon, tion (query: emotion ! = ”None”). We can then ‘court’ (a criminal case) is associated with anger or perform a frequency analysis of the results. The fre- fear. We decided to leave these cases unannotated. quency distribution is shown in Figure 4, where we Finally, the emotion annotation was transferred see that trust and joy are the most common emo- from the English translated words to their aligned tions. words in Old Occitan. The emotion annotation layer In recent years, visual and dynamic applications was further added to the main corpus and converted in corpus and literature studies have become more to the ANNIS format. important, thus showing a focus on non-technical users. For example, (Moretti, 2005) advocates the 7 Corpus Query and Visualization use of maps, trees, and graphs in the literary anal- In this section, we will focus on how our parallel ysis. Oelke et al. (2012) suggest person network corpus can be queried for emotional content. Fol- representation, showing relations between charac- lowing the approach from Alm and Sproat (2005), ters. In addition, they also use fingerprint plots, 4http://saifmohammad.com/WebPages/lexicons.html which allow to compare the mentioning of vari- 5The words must occur more than 120 000 times in the ous categories, e.g. male, female, across several Google n-gram corpus. novels. Hilpert (2011) introduces dynamic motion

7 Figure 2: Search for joy with the speaker Father.

Figure 3: Resulting visualization for the query in Figure 2. charts as a visualization methodology to the liter- type, color and size across time sequencing. Thus, ary and linguistic studies. These charts are common the user can access this information in an interactive in the socio-economic statistic field and are capa- way without having to extract any information.For ble of visualizing in motion the development of a the future, we plan on adding discourse and word phenomenon in question across time. Hilpert (2011, information. 436) stresses that “the main purpose of producing linguistic motion charts is to give the analyst an in- 8 Conclusion tuitive understanding of complex linguistic develop- ment”. Following Hilpert’s methodology, we con- We have presented an approach to providing access verted our data into the R data frame format and to an Old Occitan text via parallel word alignment produced a motion chart, as shown in Figure 5. At with a modern language and cross-language trans- present, our emotion analysis can be assessed as a fer. We regard this project as a case study, showing dynamic motion chart, by using GoogleViz6, inter- that we can provide access to many types of infor- nally interfaced via R (R Development Core Team, mation without the user having to learn cumbersome 2007). The chart allows for displaying emotion by query languages or programming. We have also shown that the use of methods from computational 6http://cran.r-project.org/web/packages/googleVis/index.html linguistics, namely cross-language transfer, can pro-

8 Figure 5: Dynamic emotion analysis with GoogleViz.

this parallel corpus can be used as a training cor- pus in machine translation and for parallel dictio- nary and emotion lexicon building in resource-poor languages, such as Old Occitan.

References Cecilia Ovesdotter Alm and Richard Sproat. 2005. Emo- tional sequencing and development in fairy tales. In Jianhua Tao, Tieniu Tan, and Rosalind Picard, edi- tors, Affective Computing and Intelligent Interaction, volume 3784 of Lecture Notes in Computer Science, Figure 4: ANNIS frequency analysis. pages 668–674. Springer. Kate Beeching. 2013. A parallel corpus approach to investigating semantic change. In Karin Aijmer and vide tools for annotating the corpus without having Bengt Altenberg, editors, Advances in Corpus-Based access to an expert in the historical language. Us- Contrastive Linguistics: Studies in Honour of Stig Jo- ing ANNIS, we showed how a multi-layered corpus hansson, pages 103–126. John Benjamins. E.D. Blodgett. 1995. The Romance of Flamenca. Gar- can be queried via a user-friendly visual query in- land, New York. terface. Finally, we presented a motion chart, which Monserat Civit, Antonia` Mart´ı, and Nuria Buf´ı. 2006. allows the user analyze and trace emotions dynami- Cat3LB and Cast3LB: From constituents to dependen- cally without any technical requirements. cies. In Advances in Natural Language Processing, This work is part of our on-going project to fully pages 141–153. Springer. annotate the Romance of Flamenca. Our goal is to Stefanie Dipper. 2005. XML-based stand-off represen- provide users with necessary tools allowing for text- tation and exploitation of multi-level linguistic anno- tation. In Proceedings of Berliner XML Tage 2005 mining and visualization of this romance. Given a (BXML 2005), pages 39–50, Berlin, Germany. search query in ANNIS, we plan to develop an R Andres´ Enrique-Arias. 2012. Parallel texts in diachronic package that will enable users to visualize their indi- investigations: Insights from a parallel corpus of Span- vidual results exported from ANNIS and processed ish medieval Bible translations. In Exploring Ancient as motions charts and other statistical plots. Finally, Languages through Corpora (EALC), Oslo, Norway.

9 Andres´ Enrique-Arias. 2013. On the usefulness of us- Meeting of the ACL, pages 976–983, Prague, Czech ing parallel texts in diachronic investigations: Insights Republic. from a parallel corpus of Spanish medieval Bible trans- Saif Mohammad and Tony Yang. 2011. Tracking senti- lations. In Paul Durrell, Martin Scheible, Silke Whitt, ment in mail: How genders differ on emotional axes. and Richard J. Bennett, editors, New Methods in His- In Proceedings of the ACL Workshop on Computa- torical Corpora, pages 105–116. Narr. tional Approaches to Subjectivity and Sentiment Anal- Anna Feldman and Jirka Hana. 2010. A Resource-Light ysis (WASSA), pages 70–79, Portland, OR. Approach to Morpho-Syntactic Tagging. Rodopi. Saif M. Mohammad. 2012. From once upon a time Suzanne Fleischmann. 1995. The non-lyric texts. In to happily ever after: Tracking emotions in mail and F.R.P. Akehurst and Judith M. Davis, editors, A Hand- books. Decision Support Systems, 53:730–741. book of the Troubadours, pages 176–184. University Franco Moretti. 2005. Graphs, Maps, Trees: Abstract of California Press. Models for a Literary History. Verso. Virginia Francisco, Raquel Hervas,´ Federico Peinado, Franz Josef Och and Hermann Ney. 2000. A comparison and Pablo Gervas.´ 2011. Emotales: Creating a corpus of alignment models for statistical machine translation. of folk tales with emotional annotations. Language In Proceedings of the 18th Conference on Compu- Resources and Evaluation, 46:341–381. tational Linguistics, pages 1086–1090, Saarbrucken,¨ Martin Hilpert. 2011. Dynamic visualizations of lan- Germany. guage change: Motion charts on the basis of bivari- Daniela Oelke, Dimitrios Kokkinakis, and Mats Malm. ate and multivariate data from diachronic corpora. In- 2012. Advanced visual analytics methods for litera- ternational Journal of Corpus Linguistics, 16(4):435– ture analysis. Proceedings of the 6th EACL Workshop 461. on Language Technology for Cultural Heritage, Social Rebecca Hwa, Philip Resnik, Amy Weinberg, Clara Sciences, and Humanities, pages 35–44. Cabezas, and Okan Kolak. 2005. Bootstrapping Bo Pang and Lillian Lee. 2008. Opinion mining and parsers via syntactic projection across parallel texts. sentiment analysis. Foundations and Trends in Infor- Natural Language Engineering, 11(3):311–325. mation Retrieval, 2(1-2):1–135. Jungi Kim, Jin-Ji Li, and Jong-Hyeok Lee. 2010. Svetlana Petrova and Michael Solf. 2009. On the Evaluating multilanguage-comparability of subjectiv- methods of information-structuralanalysis in historical ity analysis systems. In Proceedings of the 48th An- texts: A case study on Old High German. In Roland nual Meeting of the ACL, pages 595–603, Uppsala, Hinterholzl¨ and Svetlana Petrova, editors, Informa- Sweden. tion structure and language change. New approaches Marijn Koolen, Frans Adriaans, Jaap Kamps, and to word order variation in Germanic, pages 121–160. Maarten de Rijke. 2006. A cross-language ap- Walter de Gruyter. proach to historic document retrieval. In M. Lalmas and et al., editors, Advances in Information Retrieval, Shaver Philip, Schwartz Judith, Kirson Donald, and pages 407–419. Springer. O’Connor Cary. 1987. Emotion knowledge: Further Sandra Kubler¨ and Heike Zinsmeister. 2014. Cor- exploration of a prototype approach. Journal of Per- pus Linguistics and Linguistically Annotated Corpora. sonality and Social Psychology, 52(6):1061–1086. Bloomsbury Academic. Robert Plutchik. 1980. A general psychoevolutionary Bing Liu. 2012. Sentiment Analysis and Opinion Min- theory of emotion. Emotion: Theory, research, and ing. Synthesis Lectures on Human Language Tech- experience, 1(3):3–33. nologies. Morgan & Claypool Publishers. R Development Core Team. 2007. A language and en- France Martineau, Paul Hirschbuhler,¨ Anthony Kroch, vironment for statistical computing. R Foundation for and Yves Charles Morin. 2010. Corpus MCVF Statistical Computing. (parsed corpus), modeliser´ le changement: les voies Luz Rello and Iustina Ilisei. 2009. Approach to the du franc¸ais. Department´ de Franc¸ais, University of Ot- identification of Spanish zero pronouns. In Student tawa. Research Workshop, RANLP, pages 60–65, Borovets, Anthony McEnery and Zhonghu Xiao. 2007. Paral- Bulgaria. lel and comparable corpora: What is happening? In Cristina Sanchez´ Lopez,´ To Appear. Sintaxis historica´ G. James and G. Anderman, editors, Incorporating de la lengua espanola.˜ Tercera parte, chapter Preposi- Corpora: Translation and the Linguist, Translating ciones, conjunciones y adverbios derivados de partici- Europe, pages 18–31. Multilingual Matters. pios. Fondo de Cultura Economica,´ Mexico.´ Rada Mihalcea, Carmen Banea, and Janyce Wiebe. 2007. Olga Scrivner and Sandra Kubler.¨ 2012. Building an Old Learning multilingual subjective language via cross- Occitan corpus via cross-language transfer. In Pro- lingual projections. In Proceedings of the 45th Annual ceedings of the First International Workshop on Lan-

10 guage Technology for Historical Text(s), pages 392– 400, Vienna, Austria. Olga Scrivner, Sandra Kubler,¨ Barbara Vance, and Eric Beuerlein. 2013. Le Roman de Flamenca : An an- notated corpus of old occitan. In Proceedings of the Third Workshop on Annotation of Corpora for Re- search in Humanities, pages 85–96, Sofia, Bulgaria. Benjamin Snyder and Regina Barzilay. 2008. Unsuper- vised multilingual learning for morphological segmen- tation. In Proceedings of ACL: HTL, Columbus, OH. Benjamin Snyder, Tahira Naseem, and Regina Barzilay. 2009. Unsupervised multilingual grammar induction. In Proceedings of the Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the Asian Federa- tion of Natural Language Processing (ACL-IJCNLP), Suntec, Singapore. To appear. Christine Storm and Tom Storm. 1987. A taxonomic study of the vocabulary of emotions. Journal of Per- sonality and Social Psychology, 53(4):805–816. Dan Tufis. 2007. Exploiting aligned parallel corpora in multilingual studies and applications. In Proceed- ings of the 1st International Conference on Intercul- tural Collaboration, pages 103–117, Kyoto, Japan. David Yarowsky and Grace Ngai. 2001. Inducing multi- lingual POS taggers and NP bracketers via robust pro- jection across aligned corpora. In Proceedings of the 2nd Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL), Pittsburgh, PA. David Yarowsky, Grace Ngai, and Richard Wicentowski. 2001. Inducing multilingual text analysis tools via ro- bust projection across aligned corpora. In Proceedings of HLT 2001, First International Conference on Hu- man Language Technology Research, pages 161–168, San Diego, CA. Amir Zeldes, J. Ritz, Anke Ludeling,¨ and Christian Chiarcos. 2009. ANNIS: A search tool for multi-layer annotated corpora. In Proceedings of Corpus Linguis- tics, Liverpool, UK. Amir Zeldes. 2007. Machine translation between lan- guage stages: Extracting historical grammar from a parallel diachronic corpus of Polish. In Proceedings of the Corpus Linguistics Conference (CL), Birming- ham, UK.

11 RhymeDesign: A Tool for Analyzing Sonic Devices in Poetry

Nina McCurdy, Vivek Srikumar, Miriah Meyer School of Computing University of Utah nina, svivek, miriah @cs.utah.edu { }

Abstract years. This investigation focused on the computa- tional analysis of complex sonic devices, the liter- The analysis of sound and sonic devices in po- ary devices involving sound that are used to con- etry is the focus of much poetic scholarship, vey meaning or to influence the close reading expe- and poetry scholars are becoming increasingly rience. Our poetry collaborators identified a broad interested in the role that computation might range of interesting sonic devices, many of which play in their research. Since the nature of such sonic analysis is unique, the associated can be characterized as a type of traditional rhyme. tasks are not supported by standard text anal- To fully capture the range of sonic devices our col- ysis techniques. We introduce a formalism for laborators described, we adopted a broader defini- analyzing sonic devices in poetry. In addition, tion of rhyme. We found that this broader defini- we present RhymeDesign, an open-source im- tion was not only able to capture known instances of plementation of our formalism, through which sonic devices, but it also uncovered previously un- poets and poetry scholars can explore their in- known instances in poems, providing rich, novel in- dividual notion of rhyme. sights for our poetry collaborators. In this paper, we present two contributions from 1 Introduction our work on analyzing sound in poetry. The first is a formalism for analyzing a broad range of sonic While the digital humanities have experienced devices in poetry. As part of the formalism we iden- tremendous growth over the last decade (Gold, tify a language, built on top of regular expressions, 2012), the true value of computation to poets and for specifying these devices. This language is both poetry scholars is still very much in question. The highly expressive and designed for use by poets. reasons for this are complex and multifaceted. We The second contribution is an open-source imple- believe that common techniques for reasoning about mentation of this formalism, called RhymeDesign. text, such as topic modeling and analysis of word RhymeDesign provides both a platform to test and frequencies, are not directly applicable to poetry, extend the formalism, and a tool through which po- partly due to the unique nature of poetry analysis. ets and poetry scholars can explore a broad range of For example, sound is a great source of experi- complex sonic devices within a poem. mentation and creativity in writing and reading po- etry. A poet may exploit a homograph to encode am- 2 Background biguous meaning, or play with words that look like they should rhyme, but don’t, in order to intention- Poetry may be written as formal or free verse: ally trip up or excite the reader. Discovering these formal verse follows conventional patterns of end sonic features is an integral part of a close reading, rhyme, meter, or some combination thereof, while which is the deep and sustained analysis of a poem. free verse allows poets more flexibility to experi- While existing tools allow querying for sounds or ment with structural features, including variable line text, close reading requires analyzing both the lexi- and stanza lengths. In poetry analysis, rhyming cal and acoustic properties. structure generally focuses on end rhymes, repre- To investigate the influence that technology can sented as AA BB CC, ABAB CDCD, and so on. have on the close reading of a poem we collaborated Metrical poetry may or may not also incorporate with several poetry scholars over the course of two rhyme; blank verse, for example, refers to unrhymed

12

Proceedings of NAACL-HLT Fourth Workshop on Computational Linguistics for Literature, pages 12–22, Denver, Colorado, June 4, 2015. c 2015 Association for Computational Linguistics iambic pentameter. In contrast to such established overview. structures, the more open form of free verse places A significant body of research, stemming from greater emphasis on sounds and rhythms of speech. multiple fields, has been devoted to analyzing po- Whether working with sonnets or experimen- etry. A number of tools and algorithms have been tal avant-garde, our poetry collaborators consider designed for teaching (Tucker, n.d.), analyzing (Pla- a broad and expansive definition of rhyme. To mondon, 2006; Kao and Jurafsky, 2012; Meneses et them, the term rhyme encompasses all sound pat- al., 2013) translating (Byrd and Chodorow, 1985; terns and sound-related patterns. We classify these Genzel et al., 2010; Greene et al., 2010; Reddy and sonic patterns, which we define as instances of sonic Knight, 2011), and generating (Manurung et al., devices, into four distinct types of rhyme: sonic 2000; Jiang and Zhou, 2010; Greene et al., 2010) rhyme involves the pronunciations of words; pho- poetry, all of which attend, to some degree, to sound netic rhyme associates the articulatory properties and rhyme. While this work inspires our current re- of speech sound production, such as the location search, it considers a much more limited, traditional of the tongue in relation to the lips; visual rhyme definition of rhyme. As a result, these tools and al- relates words that look similar, such as cough and gorithms disregard many of the sound-related pat- bough, whether or not they sound alike; and struc- terns that we seek to reveal. tural rhyme links words through their sequence of The growing body of research analyzing rhyme in consonants and vowels. We describe these types of hip hop and rap lyrics (Kawahara, 2007; Hirjee and rhyme in more detail in Section 4. For the remainder Brown, 2009; Hirjee and Brown, 2010; Buda, 2004; of this paper, we use the term rhyme in reference to Addanki and Wu, 2013; Wu et al., 2013b) considers this broader definition. a broader and more flexible definition of rhyme. Be- cause these lyrics are meant primarily to be heard, 3 Related Work the emphasis is placed on rhymes that occur in close proximity, as opposed to rhymes in poetry that can Rhyme has been a subject for literary criticism and occur anywhere across a poem. Furthermore, rhyme especially the focus of attention by poets for hun- analysis in hip hop and rap is purely sonic, and thus dreds of years, and relates to the broader tradition does not include visual rhyme. of analyzing and evaluating sound in poetry (Sid- Several visualization tools that support the close ney, 1583; Shelley, 1821; Aristotle, 1961; Wes- reading of poetry allow users to interactively explore ling, 1980; Howe, 1985). More recent literary crit- individual sounds and sonic patterns within text, and icism has tended to focus its attention elsewhere, consider a broader range of sonic devices (Smolin- leaving the discussion of rhyme in literary circles sky and Sokoloff, 2006; Chaturvedi et al., 2012; largely to poetry handbooks. Notable exceptions oc- Clement, 2012; Abdul-Rahman et al., 2013). For cur in relation to hip-hop poetry and nursery rhymes example, PoemViewer (Abdul-Rahman et al., 2013) — perhaps a reflection of the tendency in high- visualizes various types of sound patterning such as literary circles to treat rhyme as a more simple de- end rhyme, internal rhyme, assonance, consonance vice than our collaborators see it as being — al- and alliteration, and also provides phonetic informa- though other writers share our interest in rhyme’s tion across a poem. Enabling a somewhat deeper complexities (McGuire, 1987; Stewart, 2009; Ca- exploration, ProseVis (Clement, 2012) provides the plan, 2014). complete information about the pronunciation of Computational research analyzing sound in text each word within a text and allows users to browse stems from multiple fields, from digital humani- through visual encodings of different patterns re- ties to computational linguistics. Our research is lated to the pronunciation information. While these grounded in two sources of inquiry: sonic analysis tools capture and visualize low-level details about specific to poetry and literature, and formalisms for sound, our research goes a step further, building on describing sound. The latter problem of recognizing the sonic information in a poem to detect and query phonetic units of words is a well studied one; we re- complex sonic patterns. fer the reader to (Jurafsky and Martin, 2008) for an Our work is most closely related to Pattern-

13 Word Data Type troubling Data Type Attribute

Phoneme Phoneme transcription(s) Character(s) OR Index Syllable 1 Syllable 2 Syllable 3 Consonant Vowel Stress Syllables(s) TR AH1 - B AH 0 L - IH 0 NG t r o u b l i n g

MannerPlace Voice OR Onset Nucleus Coda Onset Nucleus Coda Onset Nucleus Coda Onset (optional) Nucleus Coda (optional) Monophthong Diphthong TR AH B AH L IH NG

length location Phoneme(s) Phoneme Phoneme(s) T R AH B AH L IH NG

Figure 1: (left) The rhyming object data structure, which decomposes a word into several levels of sonic attributes. The subtree to the right captures the various phonetic attributes of a phoneme. (right) Decomposition of the word troubling into a rhyming object. The phonetic transcription is encoded using the ARPABET.

Finder (Smolinsky and Sokoloff, 2006) which al- Each phoneme transcription is parsed into a se- lows users to query patterns involving specific quence of syllables. Syllables are the basic orga- sounds (characterized by one or multiple sonic at- nization of speech sounds and they play a critical tributes) within a text. While our work supports role in defining rhyme. An important attribute of this kind of single sound patterning, it further allows the syllable is its articulatory stress. In Figure 1 users to query complex combinations of both sounds (right), the stress of each syllable, indicated as ei- and characters in specified contexts. ther 1 (stressed) or 0 (unstressed), is highlighted with a bounding box. Each syllable is also decom- 4 A Formalism for Analyzing Rhyme posed into its constituent onset, nucleus, and coda, the leading consonant sound(s), vowel sound, and Our formalism for detecting and querying rhyme trailing constant sound(s), respectively. It is impor- within a poem is composed of three components: a tant to note that a syllable will always contain a nu- representation of the sonic and textual structure of cleus, whereas the onset and coda are optional — a poem; a mechanism for querying complex rhyme; the troubling example in Figure 1 illustrates these and a query notation designed for poets. We expand variations. The onset, nucleus, and coda are fur- on each of these below. ther decomposed into one or multiple phonemes. 4.1 Rhyming Object Representation Phonemes are the basic linguistic units of speech sound and carry with them a number of attributes de- To enable the detection of rhyme, we decompose scribing their physiological production (University each word in a poem into its constituent sonic and of Iowa, 2011). structural components. We call this decomposition a rhyming object, which includes two subrepresenta- picky/tricky tions. The first is a phoneme transcription that cap- tures one or more pronunciations of the word, and Syll 1 Syll 2 Syll 1 Syll 2 the second is a surface form defined by the word’s P IH1 - K IY0 TR IH1 - K IY0 string of characters. We illustrate the rhyming object representation in Figure 1, along with the decompo- ON stressed O N stressed sition of the word troubling into its phoneme tran- ON unstressed O N unstressed scription and surface form. The phoneme transcrip- tion is encoded using the ARPABET1, one of sev- Figure 2: The phoneme transcription of picky and tricky. eral ASCII phonetic transcription codes. Our rhyme specification strategy, described in Section 4.3, ex- 4.2 The Algebra of Rhyme ploits every level of the rhyming object. A broad range of rhymes can be expressed as com-

1 binations of rhyming object components. Take for The ARPABET was developed by the Advanced Research Projects Agency (ARPA). More information may be found at example the rhyme picky/tricky. Figure 2 shows http://en.wikipedia.org/wiki/Arpabet (accessed 2/28/2015) the phoneme transcription of each word — we de-

14 note onset with O, nucleus with N, and coda with (IPA) (IPA, 1999) and the ARPABET; and avoid us- C. This phoneme transcription elucidates that the ing symbols which may be overloaded within poetry rhyming segment, icky, contains the stressed nucleus scholarship. While we appreciate that our notation of the first syllable, combined with the onset and may cause confusion for regex users, we emphasize unstressed nucleus of the second syllable. We can that our target users are poets. mathematically express this rhyme as: Table 2 presents a list of predefined rhyme types deemed interesting by our poetry collaborators, stressed unstressed Nsyll1 + (O + N)syll2 (1) transcribed into our notation. This table serves both as a reference for template building and as Picky/tricky is an instance of a perfect feminine an illustration of the expressivity of our notation. rhyme, which is a rhyme defined by an exact match Note the transcription of Equation 2 describing per- in sound beginning at a stressed nucleus in the fect feminine rhyme written more succinctly as penultimate syllable. Equation 1 can also describe ...-O[NC’ -ONC]. other perfect feminine rhymes like scuba/tuba. We observed our poetry collaborators taking two Neither of these examples, however, includes the different approaches when building new rhyming optional coda in its syllables. If we generalize Equa- templates. In the first approach they would build a tion 1 to include these codas, and specifically only new template based on a generalized instance of a consider the last two syllables of a word, we can de- rhyme, analogous to our perfect feminine rhyme ex- scribe all instances of perfect feminine rhyme, in- ample in Section 4.2. The second approach we ob- cluding complex, multisyllabic rhymes like synes- served is more exploratory, where the poets would thesia/amnesia/freesia: modify and expand a template based on iterative re- stressed unstressed sults. Our collaborators told us this approach felt (N + C)penultimate syll + (O + N + C)last syll (2) more natural to them as it is similar to practices in- We use expressions like Equation 2 as a rule for volved in close reading. We describe one poet’s ex- defining and detecting instances of a specific rhyme perience with this second approach in Section 6.2. type, in this case perfect feminine rhyme. Such rules, which we call rhyming templates, are akin 5 RhymeDesign to templates of regular expressions where each tem- plate denotes a set of regular expressions. We implemented our formalism for analyzing rhyme in an open-source tool called RhymeDesign. 4.3 ASCII Notation RhymeDesign allows users to query for a broad Table 1 presents our ASCII notation for specifying range of rhyme types in a poem of their choos- rhyming templates. Section A lists the general no- ing by selecting from a set of prebuilt rhyming tation applicable to both sonic and textual rhymes. templates, or by building new templates using Note that the bracket [] is the fundamental nota- the ASCII rhyming notation. In this section tion for the templates and allows users to specify we describe the major features of RhymeDesign, the rhyming segment as well as the the context in namely the decomposition of text into rhyming which it appears. Section B lists the notation spe- objects, the use of rhyming templates, and the cific to sonic rhymes, including the symbol indicat- user interface. RhymeDesign is freely available at ing a syllable break -, as well as support for pho- RhymeDesign.org. netic rhymes. Section C lists the notation specific to visual and structural rhymes. 5.1 Text Decomposition In designing this notation we attempted to balance Obtaining the surface form of a word is straight- the competing needs of expressivity versus usability. forward, while producing the phoneme transcription In particular, to make the notation usable by poets is a more complicated task. Within the literature we: limit the symbols to as few as possible; bor- there are three approaches to completing this task: row symbols from existing phonetic transcription al- integrating external knowledge using a pronuncia- phabets, namely the International Phonetic Alphabet tion dictionary, using natural language processing

15 A. General Rhyme Notation Notation Description [brackets] indicates the matching portion of the rhyming pair (the rhyming segment) ... indicates that additional syllables/characters may or may not exist & distinguishes between the rhyming pair words (e.g. word1/word2) indicates the occurrence of “one or both” | : indicates word break (e.g. for cross-word rhymes) ! indicates no match (must be placed at beginning of rule) B. Sonic and Phonetic Rhyme Notation C. Visual and Structural Rhyme Notation Notation Description Notation Description O Onset (leading consonant phonemes) A Vowel N Nucleus (vowel phoneme) B Consonant C Coda (ending consonant phonemes) Y Vowel or Consonant C‘ Required coda * Mixed character clusters e.g. “est/ets” O‘ Required onset char (lowercase) specific character - Syllable break A’ First vowel ’ Primary stress B’ First consonant ˆ Stressed or unstressed s Match in structure { } O mvp Match on onset manner/voice/place e.g. A s : A/O (vowel/vowel match) { } { } C mvp Match on coda manner/voice/place { } N p Match on nucleus place { }

Table 1: The ASCII rhyme notation: (A) general rhyme notation applicable to both sonic and visual rhymes; (B) notation specific to sonic and phonetic rhymes; and (C) notation specific to visual and structural rhymes.

Rhyme Type Transcribed Rule Example Identical Rhyme [... - O N Cˆ - ...] spruce/spruce;bass/bass;pair/pare/pear Perfect Masculine ... - O [N C]’ rhyme/sublime Perfect Feminine ... - O [N C’ - O N C] picky/tricky Perfect Dactylic ... - O [N C’ - O N C - O N C] gravity/depravity Semirhyme ...- O [N C]’ & ... - O [N C]’ - O N C end/bending; end/defending Syllabic Rhyme ... - O [N C]’ & ... - O [N C] wing/caring Consonant Slant Rhyme ... - O N [C]’ - ... years/yours; ant/bent Vowel Slant Rhyme ...- O [N] C’ -... eyes/light Pararhyme ... - [O‘] N [C‘]’ - ... tell/tail/tall Syllabic 2 Rhyme O [N C]’ - ONC - ... restless/westward Alliteration ...- [O‘] N C’ - ... languid/lazy/line/along Assonance ... - O [N] Cˆ - ... blue/estuaries Consonance ... - [O‘] [C‘]ˆ - ... shell/chiffon; shell/wash; | Eye rhyme !O[NCˆ-...] and ...[A’...] cough/bough ; daughter/laughter Forced rhyme ...-O[NC‘ mv ]’-... one/thumb; shot/top/sock { } Mixed 3-character cluster ...[YYY]*... restless/inlets Structural rhyme [B s A s B s B s ] fend/last { } { } { } { }

Table 2: A range of example rhyme types represented using the ASCII rhyme notation.

16 Figure 3: The RhymeDesign interface comprises two browser windows, the main RhymeDesign interface and a no- tation guide that provides a quick reference for rhyme specification. In the main interface, a user can upload a poem of his/her choosing, generate custom rhyming templates or choose from existing ones, and extract sets of rhyming words based on chosen templates. An option to Combine Templates allows users to query rhymes combining patterns in sounds and characters. tools, or using some combination of the two. In out-of-dictionary words. Adapted from CMU’s Fes- RhymeDesign we use the hybrid approach. tivox voice building tools (Black, n.d.), the LTS sys- For external knowledge we rely on the Carnegie tem was trained on the CMU dictionary in order to Mellon University (CMU) pronunciation dictionary generate the most consistent results. (CMU, 1998). The CMU dictionary provides the It is important to note that each of these phoneme phoneme transcriptions of over 125K words in the transcription methods introduces a different element North American English lexicon. Words are mapped of uncertainty into our analysis. For in-dictionary to anywhere from one to three transcriptions, taking words there is a possibility that the CMU dictionary into account differing pronunciations as well as in- will return the wrong homograph, while for out-of- stances of homographs. Syllable boundaries are not dictionary words there is a chance the LTS system provided in the original CMU transcriptions; how- will simply predict a mispronunciation. Our poetry ever, the syllabified CMU dictionary (Bartlett et al., collaborators find this uncertainty incredibly inter- 2009) addresses this problem by training a classifier esting as it reveals possible sonic devices that expe- to identify syllable boundaries. rienced readers have been hard-wired to neglect. We When relying on any dictionary there is a high therefore expose this uncertainty in RhymeDesign likelihood, particularly in the domain of poetry, and allow users to address it as they wish. that a given text will have one or multiple out-of- To summarize the decomposition process, raw dictionary words. To address this, we’ve integrated text is split into words by tokenizing on whitespace. existing letter-to-sound (LTS) rules (Black et al., For each word, we first check to see if it exists in our 1998) and syllable segmentation algorithms (Bartlett pronunciation dictionary. If it does, it is tagged as an et al., 2009) to predict the phoneme transcriptions of in-dictionary word and its phoneme transcription(s)

17 are retrieved directly. If not, it is tagged as an out- user documentation at RhymeDesign.org. of-dictionary word and its phoneme transcription is predicted using our LTS methods. Transcriptions are 6 Validation then parsed down to the phoneme level, and a look- up table is used to retrieve phonetic properties. Our validation takes two forms: the first is an ex- periment that tests our formalism and the expressiv- 5.2 Rhyme Detection ity of our rhyming language; the second is a qual- Given a rhyming template, our detection engine it- itative evaluation of RhymeDesign which includes erates through every pair of words in the poem, ex- a description of how two of our collaborators used tracting all possible rhyme segments for each word, RhymeDesign in a close reading. and comparing them to find all instances of rhyme. 6.1 Formalism Each new instance of rhyme is then either added to an existing rhyme set or initiates a new rhyme set. To validate our formalism we designed an exper- Our detection engine is similar to a typical regex iment to test the expressivity of the rhyming lan- engine, but with a few important differences. First, guage. For this experiment we requested examples our engine performs on a pairwise basis and at- of interesting rhyme types from our collaborators, tempts to establish a match based on a generic tem- along with a brief description of the rhyming char- plate. The process of extracting rhyme segments is acteristics. For each example we asked for two dif- also similar to that of regex engines, in which the ferent instances of the same rhyme type. One mem- engine marches through both the expression and the ber of the research team then manually composed subject string, advancing only if a match is found a rhyming template for each of the examples based on the current token. However, for expressions with on the example’s description and first instance (the multiple permutations, rather than only returning the prototype). The rhyming templates were then run leftmost match, as is the case for regex engines, our against the second instances (the test prototypes). engine returns all matches to ensure that all existing This allowed us to check that the new rhyming tem- patterns are revealed. plates were in fact detecting their associated second instances, and that any other instances detected were 5.3 User Interface indeed suitable. As shown in Figure 3, RhymeDesign is composed of Gathering the examples turned out to be more two browser windows, the main RhymeDesign inter- challenging than we anticipated. We soon realized face and a notation guide that provides a quick ref- that coming up with new rhyme types was a very erence for rhyme specification. In the main interface involved research task for our poetry collaborators. a user can upload a poem of his/her choosing, gen- The end result was a smaller set of 17 examples than erate novel rhyming templates or choose from exist- our initial goal of 20-30. Some of the examples ing ones, and extract sets of rhyming words based on we received were incomplete, requiring us to iterate chosen templates. An option to Combine Templates with our collaborators on the descriptions; generate allows users to query rhymes combining patterns in second instances ourselves (by someone other than sounds and characters. The resulting sets of words the template builder); or in some cases to proceed are specified in a results file, organized by rhyming with only one instance. Our results from this exper- template. This file also includes alternative pro- iment are summarized in Table 3, however we high- nunciation options for in-dictionary words, and the light our favorite example here, the cross-word ana- predicted pronunciation for out-of-dictionary words. gram Britney Spears/Presbyterians, which can be Pronunciation modifications are made in an uncer- represented using the rhyming template [...Y... tainty file, where users may specify alternative pro- : ...Y...]* & [...Y...]. nunciations for in-dictionary words, or enter custom While we were able to express the majority pronunciations for both in- and out-of-dictionary (14/17) of examples, we found at least one oppor- words. For more details on the use of RhymeDesign tunity to improve our notation in a way that allowed and the formats of the resulting files, please see the us to express certain rhyme types more succinctly.

18 Prototype Description Template T.P. 1. blessed/blast pararhyme ...-[O‘]N[C‘]’-... Y 2. orchard/tortured ear rhyme ...-O[NC’-...] and !...[A’...] Y 3. bard/drab/brad anagram [...Y...]* Y 4. Britney Spears/presbyterians cross word anagram [...Y... : ...Y...]* & [...Y...]* Y 5. paws/players character pararhyme [Y]...[Y] Y 6. span/his pan cross word rhyme [...Y...] & ...[Y]:[...Y...] Y 7. ethereal/blow flipped vowel+L/L+vowel ...[Al]*... Y 8. brittle/fumble match on final coda and character cluster ...-ON[C]ˆ and ...[..YY] Y 9. soul/on near assonance (shared place on nucleus) ...-O[N p ]C’-... and !...-O[N]C’-... Y { } 10. dress/wantonness perfect rhyme, mono/polysyllable words O[NC]’ & ...-ONCˆ-O[NC]ˆ Y 11. stone/home/unknown Matching vowel, final consonants differ by ...-O[NC mv ]ˆ N { } one phonetic attribute 12. paws/still last character of first word matches with ...[Y]&[Y]... Y first character of second word 13. blushing crow/crushing blow spoonerism [O‘]NCˆ-...:ONCˆ-... & ONCˆ- Y ...:[O‘]NCˆ-... and ONCˆ- (1:2) ...:[O‘]NCˆ-... & [O‘]NCˆ-...:ONCˆ- ... and O[NCˆ-...]:O[NCˆ-...] 14. nevermore/raven reversed consonant sounds [B]ABAB...& BABA[B]... and NA BABA[B]...& [B]ABAB... and BA[B]AB... 15. separation/kevin bacon match in all vowel sounds 16. crystal text/tristal crest chiastic rhyme 17. whack-a-mole/guacamole hybrid ear-eye rhyme

Table 3: The results of our expressivity experiment. Examples 1- 14 could be expressed using our language. Of the 3 that we failed to express (15-17), 2 of them (16,17) could not be precisely defined by our collaborators. Y/N in the rightmost column indicates whether test prototypes were detected. We note that the test prototype for example 11 did not follow the same pattern as the original example prototype. We have since modified our notation to express examples 13 and 14 more succinctly.

Of the 3 examples that we failed to express using asked to bring a poem with interesting sonic pat- our language (items 15-17 in Table 3), 2 of them terns. Interviews began with an introduction to the (16,17) could not be precisely defined by our col- formalism, followed by a tour of the RhymeDesign laborators. Incidentally, a few other instances of in- interface. Each poet was then given time to experi- definable rhyme were encountered earlier in the ex- ment with querying different kinds of rhyme. ample collection process. The question of how to We conducted the first interview with a poet who capture patterns that cannot be precisely defined is brought the poem “Night” by Louise Bogan. Tak- something that we are very interested in exploring ing an exploratory approach to template building, in our future work. she began by querying rhymes involving a vowel Of the examples that we were able to express, all followed by two consonants ...[ABB].... This but two of the test prototypes were detected. One turned up several different rhyming sets, one of example provided two test prototypes, of which only which connected the words partial and heart via the one was detected, and the other test prototype that art character cluster. Shifting her attention from art we failed to detect did not follow the same pattern to ear in heart, she then queried rhymes involving as the original example prototype. mixed ear clusters ...[ear]*.... This revealed a new pattern connecting heart with breathes and 6.2 RhymeDesign clear. This is illustrated in Figure 4. She told us Validation for RhymeDesign came in the form of in- that she suspected she would have connected clear formal user feedback. We conducted interviews with and heart on her own, but she said she thought it un- two of our poetry collaborators, each of whom were likely she would have noticed the words’ shared link

19 with breathes. This connection, and especially the 7 Conclusions & Future Work fact that it was found by way of the ear cluster was This paper presents two contributions. The first is thrilling to this poet, as it prompted her to recon- a formalism for analyzing sonic devices in poetry. sider roles and interrelations of the ear, heart, and As part of this formalism we identify a language breath in the context of writing poetry as set forth in for specifying new types of complex rhymes, and Charles Olson’s seminal 1950 poetics essay, “Pro- we design a corresponding ASCII notation for po- jective Verse” — she is pursuing the ramifications ets. While both our language and notation may not of this potential theoretical reframing in ongoing re- be complete, we present a first iteration which we search. In reflection, she commented that this ex- will continue to develop and improve based on ex- ploratory approach was very similar to how close tensive user feedback. Our second contribution is an readings were conducted, and that RhymeDesign open-source implementation of the formalism called naturally facilitated it. RhymeDesign. We validated both the formalism and RhymeDesign with our poetry collaborators. One of the biggest challenges that we encounter in this research stems from our grapheme-to-phoneme (g2p) conversion. Our use of the CMU pronuncia- tion dictionary restricts our analysis to one very spe- cific lexicon and dialect. This restriction is prob- lematic for poetry, where pronunciations span a vast Figure 4: Excerpt from the poem “Night” by Louise Bo- number of dialects, both geographically and tem- gan, with the detected ...[ear]*... rhyming set shown in porally, and where reaching across dialects can be bold. Ellipses indicate skipped lines. a sonic device within itself. While automated g2p techniques have come along way, we suspect that We conducted the second interview with a poet even an ideal g2p converter would fail to support the who brought “Platin” by Peter Inman, which is a complex tasks outlined in this paper. One interest- poem composed almost entirely of non-words. Us- ing approach that we would like to explore would be ing RhymeDesign she was able to query a range of to integrate speech and phone recognition software, patterns involving character clusters as well as dif- thereby allowing a user to perform the sonic analysis ferent kinds of structural rhymes. Exploring sonic based on his/her own reading of the poem. patterns proved to be very interesting as well. Given a non-word, it is the natural tendency of a reader to Other future work we plan to explore includes the predict its pronunciation. This is similar to the pre- automation of rhyme template generation using ma- diction made by the automated LTS system. Com- chine learning techniques. This could allow users to paring the predicted pronunciations of the reader select words sharing some sonic resemblance, and with that of the computer revealed new paths of ex- extract additional sets of words connected through ploration and potential sources of experimentation similar sonic themes. We will also work towards on the part of Peter Inman. This poet commented building a visualization tool on top of RhymeDesign that using RhymeDesign was the perfect way to re- that would permit users to explore the interaction of search language poetry, and that it was a great way sonic devices in the space of the poem. to gain entrance into a complicated and possibly in- Acknowledgments timidating text. Furthermore, she noted that “using Rhyme Design has had the delightful side effect of We are deeply grateful for the participation of our deepening my understanding of language structures, poetry collaborators, Professor Katherine Coles and sound, rhyme, rhythm, and overall facture of words Dr. Julie Lein, throughout this project. Thanks to both written and spoken...which can only make me a Jules Penham who helped us test the RhymeDesign better, more sophisticated poet”. Finally, she said interface and gathered test data for the evaluation. that RhymeDesign affirmed previous observations This work was funded in part by NSF grant IIS- made in her own close readings of the same poem. 1350896 and NEH grant HD-229002.

20 References Dmitriy Genzel and Jakob Uszkoreit and Franz Och. 2010. “Poetic” Statistical Machine Translation: Alfie Abdul-Rahman, Julie Lein, Katharine Coles, Ea- Rhyme and Meter. Proceedings of the 2010 Confer- monn Maguire, Miriah Meyer, and Martin Wynne, ence on Empirical Methods in Natural Language Pro- Chris Johnson, Anne E. Trefethen, Min Chen. 2013. cessing, pages 158-166. Rule-based Visual Mappings - with a Case Study on Poetry Visualization. In Computer Graphics Forum, Matthew K. Gold, ed. 2012 Debates in the Digital Hu- 32(3):381-390. manities. Minneapolis, MN, USA: University of Min- Aristotle 1961 Poetics. Trans. S. H. Butcher. Hill and nesota Press. Retrieved from http://www.ebrary.com Wang. pages 95-104. Erica Greene , Tugba Bodrumlu and Kevin Knight. 2010. Karteek Addanki and Dekai Wu. 2013. Unsupervised Automatic Analysis of Rhythmic Poetry with Applica- Rhyme Scheme Identification in Hip Hop Lyrics us- tions to Generation and Translation. Proceedings of ing Hidden Markov Models. Proceedings of the 1st the 2010 Conference on Empirical Methods in Natural International Conference on Statistical Language and Language Processing, pages 524533 Speech Processing (SLSP - 2013), Tarragona, Spain. Hussein Hirjee and Daniel Brown. 2010. Using auto- Susan Bartlett, Grzegorz Kondrak, and Colin Cherry. mated rhyme detection to characterize rhyming style 2009. On the syllabification of phonemes. In Pro- in rap music. Empirical Musicology Review. ceedings of Human Language Technologies: The 2009 Hussein Hirjee and Daniel Brown. 2009. Automatic Annual Conference of the North American Chap- Detection of Internal and Imperfect Rhymes in Rap ter of the Association for Computational Linguistics Lyrics. In Proceedings of the 10th International Soci- (NAACL ’09). Association for Computational Lin- ety for Music Information Retrieval Conference. pages guistics, Stroudsburg, PA, USA, 308316. 711-716. Alan W. Black, Kevin Lenzo and Vincent Pagel. 1998. Susan Howe. 1985. My Emily Dickinson. New Direc- Issues in building general letter to sound rules. Inter- tions. national Speech Communication Association. ESCA Workshop in Speech Synthesis, Australia. pages 7780. International Phonetic Association. 1999. Handbook of Bradley Buda. 2004. A System for the Automatic Identifi- the International Phonetic Association: A Guide to cation of Rhymes in English Text. University of Michi- the Use of the International Phonetic Alphabet. Cam- gan. bridge University Press. Roy J. Byrd and Martin S. Chodorow. 1985. Us- Long Jiang and Ming Zhou. 2010. Generating Chinese ing an on-line dictionary to find rhyming words Couplets Using a Statistical MT Approach. Proceed- and pronunciations for unknown words. In Pro- ings of the 22nd International Conference on Compu- ceedings of the 23rd annual meeting on Associa- tational Linguistics, COLING 2008, vol. 1, pages 377- tion for Computational Linguistics (ACL ’85). As- 384. sociation for Computational Linguistics, Stroudsburg, Daniel Jurafsky and James H. Martin. 2008 Speech PA, USA, 277-283. DOI=10.3115/981210.981244 and language processing: An introduction to speech http://dx.doi.org/10.3115/981210.981244 recognition. Computational Linguistics and Natural Manish Chaturvedi, Gerald Gannod, Laura Mandell, He- Language Processing. 2nd Edn. Prentice Hall, ISBN, len Armstrong, Eric Hodgson. 2012. Rhyme’s Chal- 10(0131873210), 794-800. lenge: Hip Hop, Poetry, and Contemporary Rhyming Justine Kao and Dan Jurafsky. 2012. A computational Culture. Oxford University Press, 2014 - Literary Crit- analysis of style, affect, and imagery in contemporary icism - 178 pages. poetry. Proceedings of NAACL 2012 Workshop on Manish Chaturvedi, Gerald Gannod, Laura Mandell, He- Computational Linguistics for Literature. len Armstrong, Eric Hodgson. 2012. Myopia: A Vi- Shigeto Kawahara. 2007. Half rhymes in Japanese rap sualization Tool in Support of Close Reading. Digital lyrics and knowledge of similarity Journal of East Humanities 2012. Asian Linguistics, 16(2), pages 113-144. Tanya Clement. 2012. Distant Listening or Playing Visu- alizations Pleasantly with the Eyes and Ears. Digital Hisar M Manurung, Graeme Ritchie, and Henry Thomp- Studies / Le champ numrique. 3.2. son. 2000. Towards A Computational Model of Poetry CMU. 1998. Carnegie Mellon Pronouncing Dictionary. Generation. In Proceedings of AISB Symposium on Carnegie MellonUniversity: http://www. speech. cs. Creative and Cultural Aspects and Applications of AI cmu. edu/cgi-bin/cmudict. and Cognitive Science. pages 7986. Alan W. Black n.d. Carnegie Mellon Pronounc- Philip C. McGuire. 2006 “Shakespeare’s Non- ing Dictionary. [computer software] available from Shakespearean Sonnets.” Shakespeare Quarterly. http://www.festvox.org 38:3, pages 304-319.

21 Luis Meneses, Richard Furuta, Laura Mandell. 2006. Ambiances: A Framework to Write and Visu- alize Poetry. Digital Humanities 2013: URL: http://dh2013.unl.edu/abstracts/ab-365.html Marc R. Plamondon. 2006 Virtual verse analysis: Analysing patterns in poetry. Literary and Linguis- tic Computing 21, suppl 1 (2006), 127–141. 2 Percy Bysshe Shelley. 1821. A Defence of Poetry. The Poetry Foundation. URL: http://www.poetryfoundation.org/learning/poetics- essay/237844 Sravana Reddy and Kevin Knight. 2011. Unsupervised Discovery of Rhyme Schemes. Proceedings of the 49th Annual Meeting of the Association for Compu- tational Linguistics: Human Language Technologies. pages 7782. Sir Philip Sidney. 1583. The Defence of Poesy. The Poetry Foundation. URL: http://www.poetryfoundation.org/learning/poetics- essay/237818 Stephanie Smolinsky and Constantine Sokoloff. 2006. Introducing the Pattern-Finder Conference abstract: Digital Humanities 2006. Susan Stewart. 2009. “Rhyme and Freedom.” The Sound of Poetry / The Poetry of Sound. Ed. Marjorie Perloff and Craig Dworkin. University of Chicago Press, pages 29-48. Herbert Tucker. n.d. For Better For Verse University of Virginia, Department of English. The University of Iowa. 2011. Phonetics: The Sounds of English and Spanish - The University of Iowa.” Phonetics: The Sounds of English and Spanish. The University of Iowa. N.p., n.d. Web. 22 Nov. 2013. http://www.uiowa.edu/ acadtech/phonetics/# Donald Wesling 1980. The Chances of Rhyme: Device and Modernity. Berkeley: University of California Press, c1980 1980. http://ark.cdlib.org/ark:/13030/ft0f59n71x/. Dekai Wu and Karteek Addanki. 2013. Modeling hip hop challenge-response lyrics as machine translation. 4th Machine Translation Summit (MT Summit XIV).

22 Rhetorical Figure Detection: the Case of Chiasmus

Marie Dubremetz Joakim Nivre Uppsala University Uppsala University Dept. of Linguistics and Philology Dept. of Linguistics and Philology Uppsala, Sweden Uppsala, Sweden [email protected] [email protected]

Abstract cluding stylistics, pragmatics, and senti- ment). While computational approaches We propose an approach to detecting the to language have occasionally deployed rhetorical figure called chiasmus, which in- volves the repetition of a pair of words in re- the word ‘rhetoric’, even in quite cen- verse order, as in “all for one, one for all”. tral ways (such as Mann and Thompsons Although repetitions of words are common in Rhetorical Structure Theory (1988)), the natural language, true instances of chiasmus deep resources of the millenia-long re- are rare, and the question is therefore whether search tradition of rhetoric have only been a computer can effectively distinguish a chias- tapped to a vanishingly small degree. mus from a random criss-cross pattern. We argue that chiasmus should be treated as a Even though the situation has improved slightly dur- graded phenomenon, which leads to the de- ing the last years (see Section 2), the gap underlined sign of an engine that extracts all criss-cross by Harris and DiMarco in the treatment of tradi- patterns and ranks them on a scale from pro- tional rhetoric is still important. This study is a con- totypical chiasmi to less and less likely in- stances. Using an evaluation inspired by in- tribution aimed at filling this gap. We will focus on formation retrieval, we demonstrate that our the task of automatically identifying a rhetorical de- system achieves an average precision of 61%. vice already studied in the first century before Christ As a by-product of the evaluation we also con- by Quintilian (Greene, 2012, art. antimetabole), but struct the first annotated corpus of chiasmi. rarely studied in computational linguistics: the chi- asmus. 1 Introduction Chiasmi are a family of figures that consist in re- peating linguistic elements in reverse order. It is Natural language processing (NLP) automates dif- named by the classics after the Greek letter χ be- ferent tasks: translation, information retrieval, genre cause of the cross this letter symbolises (see Fig- classification. Today, these technologies definitely ure 1). If the name ‘chiasmus’ seems specific to provide valuable assistance for humans even if they rhetorical studies, the figure in itself is known to ev- are not perfect. But the automatic tools become in- erybody through proverbs like (1) or quotations like appropriate to use when we need to generate, trans- (2). late or evaluate texts with stylistic quality, such as great political discourse, novels, or pleadings. In- (1) Live not to eat, but eat to live. deed, one is reluctant to trust computer assistance when it comes to judging the rhetoric of a text. As (2) Ask not what your country can do for you; ask expressed by Harris and DiMarco (2009): what you can do for your country. Too much attention has been placed on One can talk about chiasmus of letters, sounds, semantics at the expense of rhetoric (in- concepts or even syntactic structures as in (3).

23

Proceedings of NAACL-HLT Fourth Workshop on Computational Linguistics for Literature, pages 23–31, Denver, Colorado, June 4, 2015. c 2015 Association for Computational Linguistics

Indeed, despite the fact that this book comes from a politician recognized for his high rhetorical abil- ities and despite the length of the text (more than one hundred thousand words) we could find only one chiasmus:

(8) Ambition stirs imagination nearly as much as imagination excites ambition. Figure 1: Schema of a chiasmus. Such rareness is a challenge for our discipline. NLP is accustomed to treating common linguistic phe- (3) Each throat was parched, and glazed each nomena (multiword expressions, anaphora, named eye. entities), for which statistical models work well.We will see that chiasmus is a needle in the haystack Strictly speaking, we will only be concerned with problem. For the case of chiasmus, we have a one type of chiasmus: the chiasmus of identical double-fold challenge: we must not only perform words (also called antimetabole) or of identical lem- well at classifying the majority of wrong instances mas, as exemplified in (4) and (5), respectively. but above all perform well in finding the infrequent phenomenon. (4) A comedian does funny things, This paper will promote a new approach to chias- a good comedian does things funny. mus detection that takes chiasmus as a graded phe- nomenon. According to our evaluation, inspired (5) A wit with dunces and a dunce with wits from information retrieval methodology, our system From now on, for the sake of simplicity, we will re- gets up to 61% average precision. At the end of this strict the term ‘chiasmus’ to exclusively chiasmus of paper the reader will have a list of precise features words that share identity of lemma. that can be used in order to rank chiasmus. As a We see different reasons why NLP should pay at- side effect we provide a partially annotated tuning tention to chiasmi. First, it may be useful for a lit- set with 1200 extracts manually annotated as true or erary analysis: tools for supporting studies of liter- false chiasmus instances. ature exist but mostly belong to textometry. Thus, they mainly identify word frequency and some syn- 2 Related Work tactic patterns, not figures of speech. Second, the The identification of chiasmus is not very common chiasmus has interesting linguistic properties: it ex- in computational linguistics, although it has some- presses semantic inversion thanks to syntax inver- times been included in the task of detecting figure of sion, as in (2) above. Horvei (1985, p.49) points out repetition (Gawryjolek, 2009; Harris and DiMarco, that chiasmus can be used to emphasize antagonism 2009; Hromada, 2011). Gawryjolek (2009) pro- as in (6) as well as reciprocity as in (7). poses to extract every pair of words repeated in re- verse order in the text, but this method quickly be- (6) Portugal has no rivers that flow into Spain, but comes impractical with big corpora. When we try it Spain does have rivers that flow into Portugal. on a book (River War by Churchill, 130,000 words) (7) Hippocrates said that food should be our it outputs 66,000 inversions: as we shall see later, we medicine and medicine our food. have strong reason to believe that only one of them is a real chiasmus. To see whether we can detect such interplay of se- At the opposite end of the spectrum, Hro- mantics and syntax is an interesting test case for mada (2011) makes a highly precise detection by de- computational linguistics. tecting three pairs of words repeated in reverse order Chiasmus is extremely rare. To see this, one can without any variation in the intervening material as read a book like River War by Winston Churchill. illustrated in (9), but thus he limits the variety of chi-

24 asmus pattern that can be found and limits the recall are obvious for every reader, some are not. For in- (Dubremetz, 2013). stance, it is easy to say that there is chiasmus when the figure constitutes all the sentence and shows a (9) You don’t need [intelligence] [to have] [luck], perfect symmetry in the syntax: but you do need [luck] [to have] [intelligence]. (12) Chuck Norris does not fear death, death fears In our example, River War, the only chiasmus of Chuck Norris. the book (Sentence (8)) does not follow this pat- But how about cases where the chiasmus is just one tern of three identical words. Thus, with Hromada’s clause in a sentence or is not as symmetric as our algorithm we obtain no false positive but no true canonical examples: positive either: we have got an empty output. Fi- nally, Dubremetz (2013) proposes an algorithm in (13) We are all European citizens and we are all between, based on a stopword filter and the occur- citizens of a European Union which is under- rence of punctuation, but this algorithm still has low pinned. precision. With this method we found one true in- The existence of borderline cases indicates the stance in River War but only after the manual anno- need for a detector that does not only eliminate the tation of 300 instances. Thus the question is: can most flagrant false positives, but above all estab- we build a system for chiasmus detection that has a lishes a ranking from prototypical chiasmi to less better trade-off between recall and precision? and less likely instances. In this way, the tool be- comes a real help for the human because it selects 3 A New Approach to Chiasmus Detection interesting cases of repetitions and leaves it to the To make progress, we need to move beyond the bi- human to evaluate unclear cases. nary definition of chiasmus as simply a pair of in- A serious bottleneck in the creation of a system verted words, which is oversimplistic. The repeti- for ranking chiasmus candidates is the lack of an- tion of words is an extremely common phenomenon. notated data. Chiasmus is not a frequent rhetorical Defining a figure of speech by just the position of figure. Such rareness is the first reason why there word repetitions is not enough (Gawryjolek, 2009; is no huge corpus of annotated chiasmi. It would re- Dubremetz, 2013). To become a real rhetorical quire annotators to read millions of words in order to device, the repetition of words must be “a use of arrive at a decent sample. Another difficulty comes language that creates a literary effect”.1 This ele- from the requirement for qualified annotators when ment of the definition requires us to distinguish be- it comes to literature-specific devices. Hammond et tween the false positives, or accidental inversions al. (2013), for example, do not rely on crowd sourc- of words, and the (true) chiasmi, that is, when the ing when it comes to annotating changing voices. inversion of words explicitly provokes a figure of They use first year literature students. Annotating speech. Sentence (10) is an example of false posi- chiasmi is likely to require the same kind of annota- tive (here with ‘the’ and ‘application’). It contrasts tors which are not the most easy to hire on crowd- with Sentence (11) which is a true positive. sourcing platforms. If we want to create annotations we have to face the following problem: most of the (10) My government respects the application of inversions made in natural language are not chiasmi the European directive and the application of (see the example of River War, Section 2). Thus, the the 35-hour law. annotation of every inversion in a corpus would be a massive, repetitive and expensive task for human (11) One for all, all for one. annotators. This also explains why there is no large scale eval- However, the distinction between chiasmus and ac- uation of the performance of the chiasmus detectors cidental inversion is not trivial to draw. Some cases through literature. To overcome this problem we 1Definition of ‘rhetorical device’ given by Princeton word- can seek inspiration from another field of compu- net: https://wordnet.princeton.edu/ tational linguistics: information retrieval targeted at

25 the world wide web, because the web cannot be fully Horvei (1985), Garc´ıa-Page (1991), Nordahl (1971), annotated and a very small percentage of the web and Diderot and D’Alembert (1782) deliver exam- pages is relevant to a given request. As described al- ples of canonical chiasmi as well as useful observa- ready fifteen years ago by Chowdhury (1999, p.213), tions. Our features are inspired by them. in such a situation calculating the absolute recall is impossible. However, we can get a rough esti- 1. Basic features: stopwords and punctuation mate of the recall by comparing different search en- 2. Size-related features gines. For instance Clarke and Willett (1997, p.186), working with Altavista, Lycos and Excite, made a 3. N-gram features pool of relevant documents for a particular query by merging the outputs of the three engines. We will 4. Lexical clues base our evaluation system on the same principle: We group in a first category what has been tested through our experiments our different “chiasmus re- in previous research (Dubremetz, 2013). They are trieval engines” will return different hits. We anno- indeed expected and not hard to motivate. For in- tate manually the top hundreds of those hits and ob- stance, following the Zipf law, we can expect that tain a pool of relevant (and irrelevant) inversions. In most of the false positives are caused by the gram- this way both precision and recall can be estimated matical words (‘a’, ‘the’, etc.) which justifies the use without a massive work of annotation effort. In ad- of a stopword list. As well, even if nothing forbids dition, we will produce a corpus of annotated (true an author to make chiasmi that cross sentences, we and false) instances that can later be used as training hypothesize that punctuation, parentheses, quotation data. marks are definitely a clue that characterises some The definition of chiasmus (any pair of inversion false positives, for instance in the false positive (14). that provokes literary effect) is rather floating (Ra- batel, 2008, p.21). No clear discriminative test has been stated by linguists. This forces us to rely our (14) They could use the format :‘Such-and-such annotation on human intuition. However, we keep assurance , reliable ’ , so that the citizens will transparency of our annotation and evaluation by know that the assurance undertaking uses its publishing all the chiasmi we consider positive as funds well . well as samples of false and borderline cases (see Appendix). We see in (14) that the inversion ‘use/assurance’ is interrupted by quotation marks, commas and colon. 4 A Ranking Model for Chiasmus The second feature type is the one related to size or the number of words. We can expect that a too A mathematical model should be defined for our long distance between main terms or a huge size dif- ranking system. The way we compute the chiasmus ference between clauses is an easy-to-compute false score is the following. We define a standard linear positive characteristic as in this false positive: model by the function: n (15) It is strange that other committees can f(r) = x w i · i apparently acquire secretariats and well- Xi=1 equipped secretariats with many staff, but that Where r is a string containing a pair of inverted this is not possible for the Women’ s Commit- words, xi is a set of feature values, and wi is the tee. weight associated with each features. Given two in- versions r1 and r2, f(r1) > f(r2) means that the In 15, indeed, the too long distance between ‘secre- inversion r1 is more likely to be a chiasmus than r2. tariats’ and ‘Committee’ in the second clause breaks the axial symmetry prescribed by Morier (1961, 4.1 Features p.113) We experiment with four different categories of fea- The third category of features follows the intuition tures that can be easily encoded. Rabatel (2008), of Hromada (2011) when he looks for three pairs of

26 inverted words instead of two: we will not just check 5.2 Implementation if there are two pairs of similar words but as well if Our program3 takes as an input a text file and gives some other words are repeated. As we see in (16), as output the chiasmus candidates with a score that similar contexts are a characteristic pattern of chi- allows us to rank them: higher scores at the top, asmi (Fontanier, 1827, p.381). We will evaluate the lower scores at the bottom. To do so, it tokenises similarity by looking at the overlapping N-grams. and lemmatises the text with TreeTagger (Schmid, 1994). Once this basic processing is done it extracts (16) In peace, sons bury their fathers; in war, fa- every pair of repeated and inverted lemmas within a thers bury their sons. window of 30 tokens and passes them to the scoring function. The last group consists of the lexical clues. These In our experiments we implemented twenty fea- are language specific and, in contrast to stopwords in tures. They are grouped into the four categories de- basic features, this is a category of positive features. scribed in Section 4.1. We present all the features We can expect from this category to encode features and weights in Table 1 using the notation from Fig- like conjunction (Rabatel, 2008, p.28) because they ure 2. underline the axial symmetry (Morier, 1961) as in (17). 5.3 Evaluation

(17) Evidence of absence or absence of evidence? In order to evaluate our features we perform four ex- periments. We start with only the basic features. We capture negations as well. Indeed, Fontanier Then, for each new experiment, we add the next (1827, p.160) stresses that prototypical chiasmi are group of features in the same order as in Table 1 used to underline opposition. As well Horvei (1985) (size, similarity and lexical clues). Thus the fourth concludes that chiasmi often appear in general atmo- and last experiment (called ‘+Lexical clues’) accu- sphere of antagonism. We can observe such nega- mulates all the features. Each time we run an experi- tions in (1), (6), (9), and (12). ment on the test set, we manually annotate as True or False the top 200 hits given by the machine. Thanks 4.2 Setting the Weights to this manual annotation, we present the evaluation in a precision-recall graph (Figure 3). As observed in Section 3, there is no existing set of annotated data. This excludes traditional supervised machine learning to set the weights. So, we pro-

ceeded empirically, by observation of our successive 100 results on our tuning set. We make available on the +Size

80 +Ngram web the results of our trainings. +Lexical clues 60 5 Experiments 40 Precision % 5.1 Corpus 20 We perform experiments on English. We choose 2 0 a corpus of politics often used in NLP: Europarl. 0 20 40 60 80 100 From this corpus, we take two extracts of two mil- Recall % lion words each. One is used as a tuning corpus to test our hypotheses on weigths and features. The Figure 3: Precision-Recall graphs for the top two hundred other one is the test corpus which is only used for candidates in each experiment. (The ‘basic’ experiment the final evaluation. In Section 5.3 we only present is not included because of too low precision.) results based on this test corpus.

2Europarl English to Swedish corpus 01/1997-11/2011 3Available at http://stp.lingfil.uu.se/~marie/chiasme.htm.

27 In prehistoric times women resembled men , and men resembled women. W C W C W CLeft a ab b Cbb Wb0 ba a0

Figure| 2: Schematic{z } representation| {z } | of{z chiasmus,} |{z} C|{z stands} |{z for} | context,{z W} | for{z word.}

Feature Description Weight Basic #punct Number of hard punctuation marks and parentheses in C and C 10 ab ba − #softPunct Number of commas in C and C 10 ab ba − #centralPunct Number of hard punctuation marks and parentheses in C 5 bb − isInStopListA W is a stopword 10 a − isInStopListB W is a stopword 10 b − #mainRep Number of additional repetitions of W or W 5 a b − Size #diffSize Difference in number of tokens between C and C 1 ab ba − #toksInBC Position of W’ minus position of W 1 a b − Similarity exactMatch True if Cab and Cba are identical 5 #sameTok Number of identical lemmatized tokens in Cab and in Cba 1 simScore #sameTok but normalised 10 #sameBigram Number of bigrams that are identical in Cab and Cba 2 #sameTrigram Number of trigrams that are identical in Cab and Cba 4 #sameCont Number of tokens that are identical in CLeft and Cbb 1 Lexical clues hasConj True if Cbb contains one of the conjunctions ‘and’, ‘as’, ‘because’, 2 ‘for’, ‘yet’, ‘nor’, ‘so’, ‘or’, ‘but’ hasNeg True if the chiasmus candidate contains one of the negative words 2 ‘no’, ‘not’, ‘never’, ‘nothing’ hasTo True if the expression “from . . . to” appears in the chiasmus candidate 2 or ‘to’ or ‘into’ are repeated in Cab and Cba

Table 1: The four groups of features used to rank chiasmus candidates

Precision at +Size +Ngram +Lex. clues the curves end when the position 200 is reach. For candidate instance, the curve of ‘+Size’ experiment stops at 10 40 70 70 7% precision because at candidates number 200 only 50 18 24 28 14 chiasmi were found. The ‘basic’ experiment is 100 12 17 17 not present because of too low precision. Indeed, 200 7 9 9 16 out of the 19 true positives were ranked by the Ave. P. 34 52 61 ‘basic’ features as first but within a tie of 1180 other criss-cross patterns (less than 2% precision). Table 2: Average precision, and precision at a given top rank, for each experiment. When we add size features, the algorithm outputs the majority of the chiasmi within 200 hits (14 out of 19, or a recall of 74%), but the average precision In Figure 3, recall is based on 19 true positives. is below 35%(Table 2). The recall can get signifi- They were found through the annotation of the 200 cantly better. We notice, indeed, a significant pro- top hits of our 4 different experiments. On this graph gression of the number of chiasmi if we use N-gram

28 similarity features (17 out of 19 chiasmi). Finally Finally, the efficiency: our algorithm takes less the lexical clues do not permit us to find more chi- than three minutes and a half for one million words asmi (the maximum recall is still 90% for both the (214 seconds). It is three times more than Hro- third and the fourth experiment) but the precision mada (2011) (78 seconds per million words) but still improves slightly (plus 9% for ‘+Lexical clues’ ex- reasonable. periment Table 2). We never reach 100% recall. This means that 2 of 6 Conclusion the 19 chiasmi we found were not identifiable by the The aim of this research is to detect a rare linguistic best algorithm (‘+Lexical clues’). They are ranked phenomenon called chiasmus. For that task, we have more highly by the non-optimal algorithms. It can no annotated corpus and thus no possibility of super- be that our features are too shallow, but it can be as vised machine learning, and no possibility to evalu- well that the current weights are not optimal. Since ate the absolute recall of our system. Our first con- our tuning is manual, we have not tried every com- tribution is to redefine the task itself. Based on lin- bination of weights possible. guistic observations, we propose to rank the chiasmi Chiasmus is a graded phenomenon, our manual instead of classifying them as true or false. Our sec- annotation ranks three levels of chiasmi: true, bor- ond contribution was to offer an evaluation scheme deline, and false cases. Borderline cases are by def- and carry it out. Our third and main contribution inition controversial, thus we do not count them in was to propose a basic linear model with four cat- 4 our table of results. Duplicates are not counted ei- egories of features that can solve the problem. At 5 ther. the moment, because of the small amount of positive Comparing our model to previous research is examples in our data set, only tentative conclusions not straightforward. Our output is ranked, unlike can be drawn. The results might not be statistically Gawryjolek (2009) and Hromada (2011). We know significant. Nevertheless, on this data set, this model already that Gawryjolek (2009) extracts every criss- gives the best F-score ever obtained. We kept track cross pattern with no exception and thus obtains of all annotations and thus our fourth contribution 100% recall but for a precision close to 0% (see Sec- is a set of 1200 criss-cross patterns manually anno- tion 2). We run the experiment of Hromada (2011) tated as true, false and borderline cases, which can on our test set.6 It outputs 6 true positives for a total be used for training or evaluation or both. of only 9 candidates. In order to give a fair com- Future work could aim at gathering more data. parison with Hromada (2011), the 3 systems will be This would allow the use of machine learning tech- compared only for the nine first candidates (Table 3). niques in order to set the weights. There are still lin- guistic choices to make as well. Syntactic patterns seem to emerge in our examples. Identifying those +Lex. Gawryjolek Hromada patterns would allow the addition of structural fea- clues (2009) (2011) tures in our algorithm such as the symmetrical swap Precision 78 0 67 of syntactic roles. Recall 37 0 32 F1-Score 59 0 43 A Chiasmi Annotated as True

Table 3: Precision, recall, and F-measure at candidate 1. Hippocrates said that food should be our number 9. Comparison with previous works. medicine and medicine our food.

4We invite our reader to read them in Appendix B and at 2. It is much better to bring work to people than http://stp.lingfil.uu.se/~marie/chiasme.htm to take people to work. 5For example, if the machine extracts both: “All for one, one for all”, “All for one, one for all” we take into account 3. The annual reports and the promised follow-up only the second case even if both extracts belong to a chiasmus. 6 report in January 2004 may well prove that this Program furnished by D. Hromada in the email of the 10th of February. We provide this regex at report, this agreement, is not the beginning of http://stp.lingfil.uu.se/~marie/chiasme.htm. the end but the end of the beginning.

29 4. The basic problem is and remains: social State ternational problems require international so- or liberal State? More State and less market lutions. or more market and less State ? 15. Finally, I would like to discuss this commit- 5. She wants to feed chickens to pigs and pigs to ment from all in the sense that, as Bertrand chickens. Russell said, in order to be happy we need three things: the courage to accept the things 6. There is no doubt that doping is changing that we cannot change, enough determination sport and that sport will be changed due to to change the things that we can change and doping, which means it will become absolutely the wisdom to know the difference between the ludicrous if we do nothing to prevent it. two.

7. Portugal and Spain are in quite unequal posi- 16. Perhaps we should adapt the Community poli- tions, given that we are a downstream country, cies to these regions instead of expecting the in other words, Portugal has no rivers that flow regions to adapt to an elitist European policy. into Spain, but Spain does have rivers that flow into Portugal. 17. What is to prevent national parliamentarians appearing more regularly in the European Par- 8. Mr President, some Eurosceptics are in favour liament and European parliamentarians more of the Treaty of Nice because federalists regularly in the national parliaments ? protest against it. federalists are in favour of it because Eurosceptics are opposed to it. 18. In my opinion, it would be appropriate if the European political parties took part in na- 9. Companies must be enabled to find a solution tional elections, rather than having national to sending their products from the company to political parties take part in European elec- the railway and once the destination is reached, tions. from the railway to the company. 19. The directive entrenches a situation in which 10. That is the first difficulty in this exercise, since protected companies can take over unpro- each of the camps expects first of all that we tected ones, yet unprotected companies can- side with it against its enemy, following the old not take over protected ones. adage that my enemy ’s enemy is my friend, but my enemy ’s friend is my enemy. B Chiasmi Annotated as Borderline Cases Random sample out of a total of 10 borderline cases. 11. It is high time that animals were no longer adapted to suit their environment, but the en- 1. also ensure that beef and veal are safer than vironment to suit the animals. ever for consumers and that consumer confi- dence is restored in beef and . 12. That is, it must not be a matter of Europe veal following the Member States or of Member 2. If all this is not helped by a fund , the fund is States following Europe. no help at all.

13. I would like to say once again that the Euro- 3. Both men and women can be victims , but the pean Research Area is not simply the Euro- main victims are women [...]. pean Framework Programme. The European Framework Programme is an aspect of the Eu- 4. The Commission should monitor the agency ropean Research Area. and we should monitor the Commission.

14. We also need to clarify another point, i.e. that 5. The more harmless questions have been an- we are calling for an international solution be- swered , but the evasive or inadequate answers cause this is an international problem and in- to other questions are unacceptable.

30 C Chiasmi Annotated as False Jakub J. Gawryjolek. 2009. Automated Annotation and Visualization of Rhetorical Figures. Master thesis, Random sample out of 390 negative instances. Universty of Waterloo. Roland Greene. 2012. The Princeton Encyclopedia of 1. the charging of environmental and resource Poetry and Poetics: Fourth Edition. Princeton Univer- costs associated with water use are aimed at sity Press. those Member States that still make excessive Adam Hammond, Julian Brooke, and Graeme Hirst. use of and pollute their water resources , and 2013. A Tale of Two Cultures: Bringing Literary therefore not Analysis and Computational Linguistics Together. In Proceedings of the Workshop on Computational Lin- 2. at 3 p.m. ) Council priorities for the meet- guistics for Literature, pages 1–8, Atlanta, Georgia, ing of the United Nations Human Rights Com- June. Association for Computational Linguistics. mission in Geneva The next item is the Coun- Randy Harris and Chrysanne DiMarco. 2009. Construct- cil and Commission statements on the Council ing a Rhetorical Figuration Ontology. In Persuasive Technology and Digital Behaviour Intervention Sym- for the meeting of priorities posium, pages 47–52, Edinburgh, Scotland. Harald Horvei. 1985. The Changing Fortunes of a 3. President , the two basic issues are whether we Rhetorical Term: The History of the Chiasmus. The intend to harmonise social and whether policy Author. the power of the Commission will be extended Daniel Devatman Hromada. 2011. Initial Experiments to cover social policy issues . I will start with Multilingual Extraction of Rhetoric Figures by means of PERL-compatible Regular Expressions. In 4. of us within the Union agree on every issue , Proceedings of the Second Student Research Workshop but on this issue we are agreed . When we are associated with RANLP 2011, pages 85–90, Hissar, Bulgaria. 5. appear that the reference to regulation 2082/ William C. Mann and Sandra A. Thompson. 1988. 90 may be a departure from the directive and Rhetorical structure theory: Toward a functional the- that the directive and the regulation could be ory of text organization. Text, 8(3):243–281. considered to Henri Morier. 1961. Dictionnaire de poetique´ et de rhetorique´ . Presses Universitaires de France. Helge Nordahl. 1971. Variantes chiasmiques. Essai de description formelle. Revue Romane, 6:219–232. References Alain Rabatel. 2008. Points de vue en confrontation dans Gobinda Chowdhury. 1999. The internet and informa- les antimetaboles´ PLUS et MOINS. Langue franc¸aise, tion retrieval research: a brief review. Journal of Doc- 160(4):21–36. umentation, 55(2):209–225. Helmut Schmid. 1994. Probabilistic Part-of-Speech Tag- Sarah J. Clarke and Peter Willett. 1997. Estimating the ging Using Decision Trees. In Proceedings of Interna- recall performance of Web search engines. Proceed- tional Conference on New Methods in Language Pro- ings of Aslib, 49(7):184–189. cessing, pages 44–49, Manchester, Great Britain. Denis Diderot and Jean le Rond D’Alembert. 1782. En- cyclopedie´ methodique:´ ou par ordre de matieres,` vol- ume 66. Panckoucke. Marie Dubremetz. 2013. Vers une identification automatique du chiasme de mots. In Actes de la 15e Rencontres des Etudiants´ Chercheurs en Informatique pour le Traitement Automatique des Langues (RECITAL’2013), pages 150–163, Les Sables d’Olonne, France. Pierre Fontanier. 1827. Les Figures du discours. Flam- marion, 1977 edition. Mario Garc´ıa-Page. 1991. El “retruecano´ lexico”´ y sus l´ımites. Archivum: Revista de la Facultad de Filolog´ıa de Oviedo, 41-42:173–203.

31 Validating Literary Theories Using Automatic Social Network Extraction Prashant Arun Jayannavar Apoorv Agarwal Department of Computer Science Department of Computer Science Columbia University Columbia University NY, USA NY, USA [email protected] [email protected]

Melody Ju Owen Rambow Columbia College Center for Computational Learning Systems Columbia University Columbia University NY, USA NY, USA [email protected] [email protected]

Abstract or make new observations over vast bodies of litera- ture. In a seminal example of this undertaking, Elson In this paper, we investigate whether long- et al. (2010) attempted to validate an assumption of standing literary theories about nineteenth- structural difference between the social worlds of ru- century British novels can be verified using ral and urban novels using social networks extracted computational techniques. Elson et al. (2010) previously introduced the task of computa- from nineteenth-century British novels. Extrapo- tionally validating such theories, extracting lating from the work of various literary theorists, conversational networks from literary texts. Elson et al. (2010) hypothesized that nineteenth- Revisiting their work, we conduct a closer century British novels set in urban environments fea- reading of the theories themselves, present a ture numerous characters who share little conversa- revised and expanded set of hypotheses based tion, while rural novels have fewer characters with on a divergent interpretation of the theories, more conversations. Using quoted speech attribu- and widen the scope of networks for validat- ing this expanded set of hypotheses. tion, the authors extracted conversation networks from 60 novels, which had been manually classified by a scholar of literature as either rural or urban. El- 1 Introduction son et al. (2010) concluded that the results of their In his book Graphs, Maps, Trees: Abstract Models analysis of conversation networks, which indicated for Literary History, literary scholar Franco Moretti no difference between the social networks of rural proposes a radical transformation in the study of lit- and urban novels, invalidated literary hypotheses. erature (Moretti, 2005). Advocating a shift from the However, we believe that Elson et al. (2010) misrep- close reading of individual texts in a traditionally se- resented the original theories, and that their results lective literary canon, to the construction of abstract actually support rather than contradict the original models charting the aesthetic form of entire genres, theories in question. Moretti imports quantitative tools to the humanities In this paper, we revisit the work of Elson et al. in order to inform what he calls “a more rational lit- (2010), presenting a nuanced interpretation of their erary history.” While Moretti’s work has inspired results through a closer reading of the original the- both support and controversy, this reimagined mode ories cited. We propose that Elson et al. (2010)’s of reading opens a fresh direction from which to ap- results actually align with these theories. We then proach literary analysis and historiography. employ a more powerful tool for extracting social By enabling the “distant reading” of texts on sig- networks from texts, which allows us to examine a nificantly larger scales, advances in Natural Lan- wider set of hypotheses and thus provide deeper in- guage Processing and applied Machine Learning can sights into the original theories. Our findings con- be employed to empirically evaluate existing claims firm that the setting (rural versus urban) of a novel

32

Proceedings of NAACL-HLT Fourth Workshop on Computational Linguistics for Literature, pages 32–41, Denver, Colorado, June 4, 2015. c 2015 Association for Computational Linguistics in Elson et al. (2010)’s corpus has no effect on its characters. social structure, even when one goes beyond con- versations to more general and different notions of EDM2: In novels set in urban environments, • interactions. Specifically, we extend the work of El- numerous characters share little conversational son et al. (2010) in four significant ways: (1) we interactions. Rural novels, on the other hand, extract interaction networks, a conceptual general- have fewer characters with more conversations. ization of conversation networks; (2) we extract ob- servation networks, a new type of network with di- We argue that the theories themselves are miscon- rected links; (3) we consider unweighted networks strued and that the results of Elson et al. (2010)’s in addition to weighted networks; and (4) we inves- experiments actually support what theorists imply tigate the number and size of communities in the ex- about the distinction between rural and urban nov- tracted networks. els as sub-genres of 19th century realist fiction. For For extracting interaction and observation net- instance, Elson et al. (2010) quote Williams (1975) works, we use our existing system called SINNET as follows: (Agarwal et al., 2010; Agarwal et al., 2013b; Agar- wal et al., 2014). In addition to validating a richer set Raymond Williams used the term “know- of hypotheses using SINNET, we present an evalu- able communities” to describe this [ru- ation of the system on the task of automatic social ral] world, in which face-to-face rela- network extraction from literary texts. Our results tions of a restricted set of characters are show that SINNET is effective in extracting interac- the primary mode of social interaction tion networks from a genre quite different from the (Williams, 1975, 166). By contrast, the genre it was trained on, namely news articles. urban world, in this traditional account, is The paper is organized as follows. In section 2 both larger and more complex. we revisit the theories postulated by various liter- ary theorists and critics. In Section 3, we present On re-visiting this quotation in a larger and orig- an expanded set of literary hypotheses. Section 4 inal context, we note that Williams (1975) actually presents the methodology used by Elson et al. (2010) apply the term “knowable communities” to novels for validating their literary hypothesis. We use the in general, not to settings, and specifically not – as same methodology for validating our expanded set Elson et al. (2010) presume – to any particular set- of literary hypotheses. In Section 5, we give details ting (rural in this case). Williams (1975) states that on the difference between conversation, observation, “most novels are in some sense knowable communi- and interaction networks. We then evaluate SIN- ties”, meaning that the novelist “offers to show peo- NET on the data set provided by Elson et al. (2010) ple and their relationships in essentially knowable (Section 6). We test our hypotheses against the data and communicable ways.” However, the need or in Section 7 and conclude with future directions of desire to portray some setting in a realistic (“know- research in Section 8. able”) way does not automatically entail the ability to do so: evolutions in real-world social milieu may 2 Literary Theories occur independently of the evolutions in novelistic In section 3 of their paper, Elson et al. (2010) technique that specifically allow such evolutions to present a synthesis of quotations from literary the- be captured in literature. orists Mikhail Bakhtin (Bakhtin, 1937), Raymond In the same vein, Robert Alter asserts that “there Williams (Williams, 1975), Franco Moretti (Moretti, may [at any point in social history] be inherent limits 1999; Moretti, 2005) and Terry Eagleton (Eagleton, on the access of the novelistic imagination to objec- 1996; Eagleton, 2013). Elson et al. (2010) simplify tive, collective realities” (Alter, 2008, p. x). And the quotations to derive the following hypotheses: Moretti’s central point is that a shortage of linguis- tic resources for reproducing the experience of an EDM1: There is an inverse correlation between urban community persisted as literature shifted its • the number of dialogues and the number of focus toward the portrayal of urban realities in the

33 nineteenth century. Moretti asks, “given the over- unrelated enclaves increase the ‘noise’, the ‘disso- complication of the nineteenth-century urban set- nance’, the complexity of the plot – the family ro- ting - how did novels ‘read’ cities? By what nar- mance tries to reduce it, turning London into a co- rative mechanisms did they make them ‘legible’, herent whole” (Moretti, 1999, p. 130). Alter agrees, and turn urban noise into information?” (Moretti, arguing that in Dickens’ London, “representation 1999, p. 79). To answer this question, Moretti of human solidarity characteristically sequesters it points to the reductive rendering techniques of the in protected little enclaves within the larger urban urban genre’s first wave; these novels “don’t show scene” (Alter, 2008, p. 55) and that “in these elab- ‘London’, only a small, monochrome portion of it” orately plotted books of Dickens’s, almost no char- (Moretti, 1999, p. 79). In order to make London acter is allowed to go to waste; each somehow is legible, nineteenth century British novelists, includ- linked with the others as the writer deftly brings all ing Austen and Dickens, reduce its complexity and the strands together in the complication and reso- its randomness, thereby amputating the richer, more lution of the story” (Alter, 2008, p. 67). In terms unpredictable interactions that could occur in a more of the “perception of the fundamental categories of complex city (Moretti, 1999, p. 86). Moretti com- time and space, the boundaries of the self, and the pares Dickens’s London with Balzac’s Paris; unlike autonomy of the individual” (Alter, 2008, p. xi), Dickens, Balzac allows the complications of his ur- Dickens essentially writes a rural fiction, but in an ban subject to flourish and inform narrative possi- urban setting. bility. The following quote presented by Elson et To summarize these arguments: when novelists al. (2010) is actually used by Moretti to describe – like Dickens – employ narrative techniques not Balzac’s Paris specifically, not urban settings in gen- originally evolved for the portrayal of urban areas eral, and specifically not Dickens’s London: in novels with an urban setting, they fail to create in the novel the type of urban space in which an urban As the number of characters increases, story with an urban social world is possible. Set- Moretti argues (following Bakhtin in his ting is sociological (it exists outside of novels), but logic), social interactions of different space is literary (it exists only in novels): it is only kinds and durations multiply, displacing the development of new practices in writing that are the family-centered and conversational able to create truly urban spaces, those which reflect logic of village or rural fictions. “The the fundamental transformations in the nature of hu- narrative system becomes complicated, man experience by the city: “Urban crowds and ur- unstable: the city turns into a gigantic ban dwellings may reinforce a sense of isolation in roulette table, where helpers and antago- individuals which, pushed to the extreme, becomes nists mix in unpredictable combinations” an incipient solipsism or paranoia. This feeling of (Moretti, 1999). being cut off from meaningful human connections finds a congenial medium in modes of narration – In summary, the simple fact that a novel is set in pioneered by Flaubert – that are rigorously centered an urban environment (and the evocation of the ur- in the consciousness of the character” Alter (2008, ban setting by name or choice of props) does not p. 107). equate with the creation of a truly urban space. The We now turn to Elson et al. (2010)’s presentation latter is the key that renders possible an urban story of the literary theories. They muddy the difference with an urban social world; “without a certain kind between setting and space, a serious flaw in inter- of space,” Moretti declares, “a certain kind of story preting Bakhtin. Urban setting does not equal urban is simply impossible” (Moretti, 1999, p. 100). space, and space – not setting – is what concerns Moretti exposes another reductive rendering tech- Bakhtin’s chronotope. The nature of the space of a nique used by Dickens: the narrative crux of the novel, not its explicit setting, defines what can hap- family romance. This technique, he asserts, “is a pen in it, including its social relationships. Each text further instance of the tentative, contradictory path in Elson et al. (2010)’s corpus of 60 novels is classi- followed by urban novels: as London’s random and fied as either rural or urban by the following defini-

34 tions. They define urban to mean set in a metropoli- the social worlds of rural- and urban-set novels, we tan zone, characterized by multiple forms of la- confirm the need to look beyond setting in order to bor (not just agricultural), where social relations are pinpoint facets of novelistic form that do determine largely financial or commercial in character. Con- social networks. versely, rural is defined to mean set in a country or Similar to the approach of Elson et al. (2010), village zone, where agriculture is the primary activ- our hypotheses concern (a) the implications of an ity, and where land-owning, non-productive, rent- increase in the number of characters, and (b) the im- collecting gentry are socially predominant. Thus, plications of the dichotomy between rural and urban the distinction between rural and urban for Elson et settings. However, unlike Elson et al. (2010), we do al. (2010) is clearly one of setting, not one of space. not claim any hypothesized relation between the in- Hypothesis EDM2 of Elson et al. (2010) is therefore crease in number of characters and the social struc- not a correct representation of the theories they cite. ture. We formulate our own hypotheses (H1.1, H1.2, Interestingly, Elson et al. (2010) cannot validate H3.1, H3.2, H5) concerning the increase in number their own hypothesis EDM2: their results suggest of characters and study them out of curiosity and as that the “urban” novels within the corpus do not be- an exploratory exercise. Furthermore, unlike Elson long to a fundamentally separate class of novels, in- et al. (2010), we claim that literary theorists did not sofar as basic frameworks of time and space inform posit a relation between setting and social structure. the essential experience of the characters. They con- Following is the set of hypotheses we validate in this clude: paper:

We would propose that this suggests that H0: As setting changes from rural to urban, • the form of a given novel – the standpoint there is no change in the number of characters. of the narrative voice, whether the voice The number of characters is given by the for- is “omniscient” or not – is far more de- mula 1 in table 1. terminative of the kind of social network described in the novel than where it is set H1.1: There is a positive correlation between • or even the number of characters involved. the number of interactions and the number of characters. The number of interactions is given Put differently, differences in novels’ social net- by the formula 3 in table 1. works are more related to literary differences (space) than to non-literary differences (setting). H1.2: There is a negative correlation between • 3 Expanded Set of Literary Hypotheses the number of characters and the number of other characters a character interacts with (un- In light of the analysis in the previous section, we weighted version of H1.1, formula 4). propose that Elson et al. (2010)’s results, though they invalidate hypotheses EDM1 and EDM2, ac- H2.1: As setting changes from rural to urban, • tually align with the parent theories from which there is no change in the total number of inter- they are derived. In direct opposition to EDM1 and actions that occur. The number of interactions EDM2, we expect our analysis to confirm the ab- is given by the formula 3 in table 1. sence of correlation between setting and social net- work within our corpus of novels. While Elson et al. H2.2: As setting changes from rural to urban, • (2010)’s approach is constricted to examining social there is no change in the average number of networks from the perspective of conversation, we characters each character interacts with. obtain deeper insight into the novels by exploring an expanded set of hypotheses which takes general H3.1: There is a positive correlation between • interaction and observation into account. If a com- the number of observations and the number prehensive look at the social networks in our cor- of characters. The number of observations is pus confirms a lack of structural difference between given by the formula 3 in table 1.

35 H3.2: There is a negative correlation between is relevant for (in)validating EDM2. It is unclear • the number of characters a character observes why Elson et al. (2010) report the correlation be- (formula 4), and the number of characters. tween other features to invalidate EDM2. We there- fore, validate our formulation of the theory (similar H4.1: As setting changes from rural to urban, • to EDM2) using only the average degree metric. there is no change in the total number of obser- vations that occur. The number of observations 5 Types of Networks is given by the formula 3 in table 1. This section provides definitions for the three differ- H4.2: As setting changes from rural to urban, • ent types of networks we consider in our study. there is no change in the average number of ob- servations performed by each character. (This 5.1 Conversation Network hypothesis is the unweighted version of H4.1, Elson et al. (2010) defined a conversation network formula 4, and the OBS counterpart of H2.2). as a network in which nodes are characters and links H5: As the number of characters increases, the are conversations. The authors defined a conversa- • number of communities increases, but the aver- tion as a continuous span of narrative time in which age size of communities decreases. a set of characters exchange dialogues. Since dia- logues are denoted in text by quotation marks, El- H6: As setting changes from rural to urban, son et al. (2010) used simple regular expressions • there is no change in the number nor the av- for dialogue detection. However, associating dia- erage size of communities. logues with their speakers (a task known as quoted speech attribution) turned out to be non-trivial (El- 4 Methodology for Validating Hypotheses son and McKeown, 2010; He et al., 2013). In a sep- Elson et al. (2010) provide evidence to invalidate arate work, Elson and McKeown (2010) presented a EDM1. They report a positive Pearson’s correlation feature-based, supervised machine learning system coefficient (PCC) between the number of characters for performing quoted speech attribution. Using this and the number of dialogues to show that the two system, Elson et al. (2010) successfully extracted quantities are not inversely correlated. We use the conversation networks from the novels in their cor- same methodology to examine our hypotheses re- pus. We refer to the system as EDM2010 through- lated to the number of characters. out this paper. Elson et al. (2010) provide evidence to invalidate EDM2. They extract various features from the social 5.2 Observation and Interaction Networks networks of rural and urban novels and show that In our past work (Agarwal et al., 2010), we defined these features are not statistically significantly dif- a social network as a network in which nodes are ferent. They use the homoscedastic t-test to mea- characters and links are social events. We defined sure statistical significance (with p < .05 = sta- two broad categories of social events: observations ⇒ tistical significance). We employ the same method- (OBS) and interactions (INR). Observations are de- ology to examine our hypotheses related to the ru- fined as unidirectional social events in which only ral/urban dichotomy. one entity is cognitively aware of the other. Inter- The features that Elson et al. (2010) use to inval- actions are defined as bidirectional social events in idate EDM2 are as follows: (a) average degree, (b) which both entities are cognitively aware of each rate of cliques, (c) density, and (d) rate of characters’ other and of their mutual awareness. mentions of other characters. EDM2 posits that the In Example 1, Mr. Woodhouse is talking about number of characters in urban settings share lesser Emma. He is therefore cognitively aware of Emma. conversation as compared to the rural settings. The However, there is no evidence that Emma is also average degree (count of the number of conversa- aware of Mr. Woodhouse. Since only one charac- tions normalized by the number of characters, see ter is aware of the other, this is an observation event formula 4 in Table 1) seems to be the metric that directed from Mr. Woodhouse to Emma.

36 No Name Formula Weighted? Remark 1 # of characters V | | 2 # of interaction pairs E unw. | |E 3 # of interactions W := Σi|=1| wi weighted Σv V Ev 2 E 4 average degree ∈V | | = |V | unw. number of characters a character inter- | | | | acts with on average Σu V Σv Eu wu,v 2W 5 average weighted degree ∈ V∈ = V weighted number of interactions a character has | | | | on average

Table 1: Table connecting the social network terminology to the natural language interpretation, along with the for- mulae. Interactions may be replaced with observations to obtain the corresponding formula.

(1) “[Emma] never thinks of herself, if she can do 2009) to build a tree kernel-based supervised sys- good to others,” rejoined [Mr. Woodhouse] tem for automatically detecting and classifying so- { } OBS cial events. We used this system for extracting ob- servation and interaction networks from novels. We In Example 2, Mr. Micawber is talking about go- will refer to it as SINNET throughout this paper. ing home with Uriah. Since Mr. Micawber is talking about Uriah, there is a directed OBS link from Mr. 5.3 Terminology Regarding Networks Micawber to Uriah. Since he went home with Uriah, A network (or graph), G = (V,E), is a set of ver- they must both have been aware of each other and tices (V ) and edges (E). The set of edges incident of their mutual awareness. Thus, there is a bidirec- on vertex v is written Ev. In weighted networks, tional INR link between the two characters. each edge between nodes u and v is associated with a weight, denoted by wu,v. In the networks we con- (2) [Mr. Micawber] said , that [he] had { }OBS sider, weight represents the frequency with which gone home with INR [Uriah] OBS and INR { } two people interact or observe one another. An edge may be directed or undirected. Interactions (INR) In Example 3, the author (Jane Austen) states a are undirected edges and observations (OBS) are di- fact about three characters (Elton, Mr. Knightley, rected edges. Table 1 presents the name and the and Mr. Weston). However, the author does not tell mathematical formula for social network analysis us about the characters’ cognitive states, and thus metrics we use to validate the theories. there is no social event between the characters. Edges in a network are typed. We consider (3) [Elton]’s manners are superior to [Mr. Knight- four types of networks in this work: networks with undirected interaction edges (INR), with di- ley]’s or [Mr. Weston]’s. NoEvent rected observation edges (OBS), with a combina- As these examples demonstrate, the definition of tion of interaction and observation edges (INR + a social event is quite broad. While quoted speech OBS), and with a combination of interaction, obser- (detected by Elson et al. (2010)) represents only a vation, and undirected conversational edges (CON). strict sub-set of interactions, social events may be We denote these networks by GINR = (V,EINR), linguistically expressed using other types of speech GOBS, GINR+OBS, and GINR+OBS+CON respec- as well, such as reported speech. tively. Each of these networks may be weighted or In our subsequent work (Agarwal and Rambow, unweighted. 2010; Agarwal et al., 2013b; Agarwal et al., 2014), 6 Evaluation of SINNET we leveraged and extended ideas from the relation extraction literature (Zelenko et al., 2003; Kamb- In our previous work, we showed that SINNET hatla, 2004; Zhao and Grishman, 2005; GuoDong adeptly extracts the social network from one work of et al., 2005; Harabagiu et al., 2005; Nguyen et al., fiction, Alice in Wonderland (Agarwal et al., 2013b).

37 Novel Excerpt CONV-GOLD INT-GOLD EDM2010 SINNET EDM2010 SINNET R R P R F1 P R F1 Emma 0.40 0.70 1.0 0.13 0.22 0.86 0.48 0.61 Study in Scarlet 0.50 0.50 1.0 0.18 0.31 0.69 0.41 0.51 David Copperfield 0.70 0.80 1.0 0.22 0.36 0.80 0.63 0.70 Portrait of a Lady 0.66 0.66 1.0 0.22 0.36 0.73 0.44 0.55 Micro-Average 0.56 0.68 1.0 0.18 0.30 0.79 0.50 0.61

Table 3: Performance of the two systems on the two gold standards.

# of char. # of links As explained in previous sections, conversations Novel Excerpt pairs CG IG are a strict subset of general interactions. Since Emma 91 10 40 SINNET aims to extract the entire set of observa- Study in Scarlet 55 8 22 tions and interactions, the gold standard we created David Copperfield 120 10 32 records all forms of observation and interaction be- Portrait of a Lady 55 6 18 tween characters. For each pair of characters, anno- tators were asked to mark “1” if the characters ob- Table 2: A comparison of the number of links in the two serve or interact and “0” otherwise. gold standards; CG is CONV-GOLD and IG is INT-GOLD Table 2 presents the number of character pairs in each novel excerpt, the number of character pairs In this paper, we determine the effectiveness of SIN- that converse according to CONV-GOLD and the NET on an expanded collection of literary texts. El- number of character pairs that observe or inter- son et al. (2010) presented a gold standard for mea- act according to INT-GOLD. The difference in the suring the performance of EDM2010, which we number of links between CONV-GOLD and INT- call CONV-GOLD. This gold standard is not suitable GOLD suggests that the observation and interac- for measuring the performance of SINNET because tion of many more pairs of characters is expressed SINNET extracts a larger set of interactions beyond through reported speech in comparison to conversa- conversations. We therefore created another gold tional speech. For example, the number of conversa- standard more suitable for evaluating SINNET, and tional links identified in the excerpt from Emma by refer to it as INT-GOLD. Jane Austen was 10, while the number of interaction links identified was 40. 6.1 Gold standards: CONV-GOLD and INT-GOLD 6.2 Evaluation and Results Elson et al. (2010) created their gold standard for Table 3 presents the results for the performance of evaluating the performance of EDM2010 using ex- EDM2010 and SINNET on the two gold standards cerpts from four novels: Austen’s Emma, Conan (CONV-GOLD and INT-GOLD). The recall of SIN- Doyle’s A Study in Scarlet, Dickens’ David Copper- NET is significantly better than that of EDM2010 field, and James’ The Portrait of a Lady. The authors on CONV-GOLD (columns 2 and 3), suggesting enumerated all pairs of characters for each novel ex- that most of the links expressed as quoted conver- cerpt. If a novel features n characters, its corre- sations are also expressed as interactions via re- n (n 1) sponding list contains ∗ 2− elements. For each ported speech. Note that, because SINNET ex- pair of characters, annotators were asked to mark tracts a larger set of interactions, we do not re- “1” if the characters converse (defined in Section 5) port the precision and F1-measure of SINNET on and “0” otherwise. Annotators were asked to iden- CONV-GOLD. By definition, SINNET will pre- tify conversations framed with both direct (quoted) dict links between characters that may not be linked and indirect (unquoted) speech. in CONV-GOLD; therefore the precision (and thus

38 Hypothesis As # of characters ... As settings go from rural to urban ... ↑ PCC Valid? t-test Valid? [H0] ... # of characters p > 0.05  ∼ [H1.1] ... # of interactions 0.83  ↑ [H1.2] ... # of characters interacted with -0.36  ↓ [H2.1] ... # of interactions p > 0.05  ∼ [H2.2] ...# of characters interacted with p > 0.05  ∼ [H3.1] ... # of observations 0.77  ↑ [H3.2]... # of characters observed -0.36  ↓ [H4.1] ... # of observations p > 0.05  ∼ [H4.2] ... # of characters observed p > 0.05  ∼ [H5] ... # of communities 0.98  ↑ [H5] ...average size of communities -0.26  ↓ [H6] ...# of communities p > 0.05  ∼ [H6] ... average size of communities p > 0.05  ∼ Table 4: Hypotheses and results. All correlations are statistically significant. denotes no significant change. As ∼ an example, hypothesis H0 may be read as: As settings go from rural to urban ... the number of characters does not change significantly.

F1-measure) of SINNET will be low (and uninter- 7 Results for Testing Literary Hypotheses pretable) on CONV-GOLD. Table 4 presents the results for all hypotheses (H0- Table 3 additionally presents the performance H6) formulated in this paper. There are two broad of the two systems on INT-GOLD (the last six categories of hypotheses: (1) ones that comment on columns). These results show that EDM2010 social network analysis metrics (the rows) based on achieves perfect precision, but significantly lower the increase in the number of characters (columns recall than SINNET (0.18 versus 0.50). This is ex- 2 and 3), and (2) ones that comment on the social pected, as EDM2010 was not trained (or designed) network analysis metrics based on the type of setting to extract any interactions besides conversation. (rural versus urban, columns 4 and 5). The results show that as settings change from ru- ral to urban, there is no significant change in the 6.3 Discussion of Results number of characters (row H0, column t-test). Fur- thermore, as the number of characters increases, the If there are any conversational links that EDM2010 number of interactions also increases with a high detects but SINNET misses, then the two systems Pearson correlation coefficient of 0.83 (row H1.1, should be treated as complementary. To determine column PCC). Similarly, for all other hypotheses, whether or not this is the case, we counted the num- the relation between the number of characters and ber of links in all four excerpts that are detected by the setting of novels behaves as expected in terms of EDM2010 and missed by SINNET. For Austen’s various types of networks and social network analy- Emma, SINNET missed two links that EDM2010 sis metrics. Our results thus provide support for the detected (with respect to INT-GOLD). For the other cogency of the original theories. three novels, the counts were SINNET two, zero, These results highlight one of the critical find- and one, respectively. In total, the number of links ings of this paper: while network metrics are sig- that SINNET missed and EDM2010 detected is nificantly correlated with the number of characters, five out of 112. Since the precision of EDM2010 is there is no correlation at all between setting and perfect, it seems advantageous to combine the out- number of characters within our corpus (hypoth- put of the two systems. esis H0 is valid). If H0 were invalid, then all hy-

39 potheses concerning the effects of setting would be original corpus do not belong to a fundamentally false. However, since H0 is true, we may conclude separate class of novels, insofar as the essential ex- that setting (as defined by our rural/urban classifica- perience of the characters is concerned possible di- tion) has no predictive effect on any of the aspects rections for future research include expanding our of social networks that we investigate. corpus in order to identify novelistic features that do We also consider whether examining different determine social worlds. We are particularly inter- network types (interaction, observation, and com- ested in studying novels which exhibit innovations bination) in conjunction produces the same results in narrative technique, or which occur historically as examining each individually. The results indeed in and around periods of technological innovation. align with those in Table 4, but with slightly differ- Lastly, we would like to add a temporal dimension ent correlation numbers. We give one example: we to our social network extraction, in order to capture find that the correlation between number of charac- information about how networks transform through- ters and number of interactions (hypothesis H1.1) out different novels. increases from 0.83 for GINR alone (as shown in Acknowledgments Table 4) to 0.85 for GINR+OBS and also 0.85 for GINR+OBS+CONV . This pattern is observed for all We would like to thank anonymous reviewers for hypotheses. their insightful comments. We would also like to thank David Elson for providing the gold standard 8 Conclusion and Future Work the data set used in their previous work. This paper is based upon work supported in part by the DARPA In this paper, we investigated whether social net- DEFT Program. The views expressed are those of work extraction confirms long-standing assumptions the authors and do not reflect the official policy or about the social worlds of nineteenth-century British position of the Department of Defense or the U.S. novels. Namely, we set out to verify whether the Government. social networks of novels explicitly located in ur- ban settings exhibit structural differences from those of rural novels. Elson et al. (2010) had previously proposed a hypothesis of difference as an interpre- tation of several literary theories, and provided ev- idence to invalidate this hypothesis on the basis of conversational networks. Following a closer read- ing of the theories cited by Elson et al. (2010), we suggested that their results, far from invalidating the theories themselves, actually support their cogency. To extend Elson et al. (2010)’s findings with a more comprehensive look at social interactions, we ex- plored the application of another methodology for extracting social networks from text (called SIN- NET) which had previously not been applied to fic- tion. Using this methodology, we were able to ex- tract a rich set of observation and interaction rela- tions from novels, enabling us to build meaningfully on previous work. We found that the rural/urban distinction proposed by Elson et al. (2010) indeed has no effect on the structure of the social networks, while the number of characters does. As our findings support our literary hypothesis that the urban novels within Elson et al. (2010)’s

40 References Hua He, Denilson Barbosa, and Grzegorz Kondrak. 2013. Identification of speakers in novels. In ACL Apoorv Agarwal and Owen Rambow. 2010. Automatic (1), pages 1312–1320. Pro- detection and classification of social events. In Nanda Kambhatla. 2004. Combining lexical, syntactic, ceedings of the 2010 Conference on Empirical Meth- and semantic features with maximum entropy mod- ods in Natural Language Processing , pages 1024– els for extracting relations. In Proceedings of the 1034, Cambridge, MA, October. Association for Com- ACL 2004 on Interactive poster and demonstration putational Linguistics. sessions, page 22. Association for Computational Lin- Apoorv Agarwal, Owen C. Rambow, and Rebecca J. Pas- guistics. sonneau. 2010. Annotation scheme for social network Franco Moretti. 1999. Atlas of the European novel, extraction from text. In Proceedings of the Fourth Lin- 1800-1900. Verso. guistic Annotation Workshop. Franco Moretti. 2005. Graphs, Maps, Trees: Abstract Apoorv Agarwal, Anup Kotalwar, and Owen Rambow. Models for a Literary History. Verso. 2013a. Automatic extraction of social networks from Truc-Vien T. Nguyen, Alessandro Moschitti, and literary text: A case study on alice in wonderland. In Giuseppe Riccardi. 2009. Convolution kernels on the Proceedings of the 6th International Joint Confer- constituent, dependency and sequential structures for ence on Natural Language Processing (IJCNLP 2013). relation extraction. Conference on Empirical Methods Apoorv Agarwal, Anup Kotalwar, Jiehan Zheng, and in Natural Language Processing. Owen Rambow. 2013b. Sinnet: Social interaction net- Raymond Williams. 1975. The country and the city. Ox- work extractor from text. In Sixth International Joint ford University Press. Conference on Natural Language Processing, page 33. Dmitry Zelenko, Chinatsu Aone, and Anthony Apoorv Agarwal, Sriramkumar Balasubramanian, Anup Richardella. 2003. Kernel methods for relation Kotalwar, Jiehan Zheng, and Owen Rambow. 2014. extraction. The Journal of Machine Learning Frame semantic tree kernels for social network ex- Research, 3:1083–1106. traction from text. 14th Conference of the European Shubin Zhao and Ralph Grishman. 2005. Extracting re- Chapter of the Association for Computational Linguis- lations with integrated information using kernel meth- tics. ods. In Proceedings of the 43rd Meeting of the ACL. Robert Alter. 2008. Imagined cities: urban experience and the language of the novel. Yale University Press. Mikhail M Bakhtin. 1937. Forms of time and of the chronotope in the novel: Notes toward a historical po- etics. Narrative dynamics: Essays on time, plot, clo- sure, and frames, pages 15–24. Terry Eagleton. 1996. Literary theory: An introduction. U of Minnesota Press. Terry Eagleton. 2013. The English novel: an introduc- tion. John Wiley & Sons. David K Elson and Kathleen McKeown. 2010. Auto- matic attribution of quoted speech in literary narrative. In AAAI. David K. Elson, Nicholas Dames, and Kathleen R. McK- eown. 2010. Extracting social networks from literary fiction. Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 138–147. Zhou GuoDong, Su Jian, Zhang Jie, and Zhang Min. 2005. Exploring various knowledge in relation extrac- tion. In Proceedings of 43th Annual Meeting of the Association for Computational Linguistics. Sanda Harabagiu, Cosmin Adrian Bejan, and Paul Morarescu. 2005. Shallow semantics for relation ex- traction. In International Joint Conference On Artifi- cial Intelligence.

41 GutenTag: An NLP-driven Tool for Digital Humanities Research in the Project Gutenberg Corpus

Julian Brooke Adam Hammond Graeme Hirst Dept of Computer Science School of English and Theatre Dept of Computer Science University of Toronto University of Guelph University of Toronto [email protected] [email protected] [email protected]

Abstract computer scientists, and computational linguists are developing literature-specific techniques which are This paper introduces a software tool, unavailable or unknown to the digital humanist com- GutenTag, which is aimed at giving literary munity. The high-level goal of the project pro- researchers direct access to NLP techniques posed here is to create an on-going two-way flow for the analysis of texts in the Project Guten- of resources between these groups, allowing com- berg corpus. We discuss several facets of the tool, including the handling of formatting and putational linguists to identify pressing problems in structure, the use and expansion of metadata the large-scale analysis of literary texts, and to give which is used to identify relevant subcorpora digital humanists access to a wider variety of NLP of interest, and a general tagging framework tools for exploring literary phenomena. The context which is intended to cover a wide variety for this exchange of ideas and resources is a tool, of future NLP modules. Our hope that the GutenTag1, aimed at facilitating literary analysis of shared ground created by this tool will help the Project Gutenberg (PG) corpus, a large collec- create new kinds of interaction between tion of plain-text, publicly-available literature. the computational linguistics and digital humanities communities, to the benefit of At its simplest level, GutenTag is a corpus reader; both. given the various eccentricities of the texts in Project Gutenberg (which reflects the diversity of the source texts and the rather haphazard nature of their collec- 1 Introduction tion), this application alone serves to justify its ex- istence. A second facet of the tool is a corpus filter: The emerging field of digital literary studies has em- it uses the information contained explicitly within braced not only statistical analysis of literary texts in the PG database and/or derived automatically from the corpus linguistics tradition, but even more com- other sources to allow researchers to build subcor- plex methods such as principal components analy- pora of interest reflecting their exact analytic needs. sis (Burrows, 1987), clustering (Rybicki, 2006), and Another feature gives GutenTag its name: the tool topic modeling (Goldstone and Underwood, 2012; has access to tagging models which represent the Jockers, 2013). At the same time, there is sus- intersection of literary analysis needs and existing tained interest in computational linguistics in tack- NLP methods. The output of GutenTag is either ling problems that are specific to literature, as ev- an XML corpus with tags (at both text and meta- idenced by an annual dedicated workshop as well textual levels) based on the TEI-encoding standard; as various papers at major conferences (Elson et or, if desired, direct statistical analysis of the dis- al., 2010; Wallace, 2012; He et al., 2013; Bam- tribution of tags across different subcorpora. None man et al., 2014). Though some work in the shared of the features of GutenTag mentioned above are in- ground between these two fields is explicitly cross- tended to be static: GutenTag is a tool that will grow disciplinary, this is still fairly atypical, reflecting a and improve with feedback from the digital human- deep cultural barrier (Hammond et al., 2013): in most cases, digital humanists are using off-the-shelf 1GutenTag is available at statistical tools with little or no interaction with http://www.cs.toronto.edu/~jbrooke/gutentag/

42

Proceedings of NAACL-HLT Fourth Workshop on Computational Linguistics for Literature, pages 42–47, Denver, Colorado, June 4, 2015. c 2015 Association for Computational Linguistics ities community and new methods from the compu- GutenTag employs a fairly complex heuristic involv- tational linguistics community. ing regular expressions; this handles some of the more troublesome cases by making sure that large 2 Project Gutenberg sections of the text are not being tossed out. Other common extra-textual elements that we remove dur- Project Gutenberg is a web-based collection of texts ing this stage include references to illustrations and (mostly literary fiction such as novels, plays, and notes that are clearly not part of the text (e.g. tran- collections of poetry and short stories, but also non- scriber’s notes). fiction titles such as biographies, histories, cook- books, reference works, and periodicals) which have Most texts are structured to some degree, and this fallen out of copyright in the United States. There structure is reflected inconsistently in the raw Guten- are versions of Project Gutenberg in various coun- berg texts by implicit indicators such as extra spac- tries around the world, but the development of ing, capitalized headings, and indentations. The GutenTag has been based on the US version.2 The structure depends on the type of literature, which entire contents of the current archive is almost fifty may or may not be indicated in the datafile (see Sec- thousand documents, though the work here is based tion 4). Most books contain at least a title and chap- on the most recently released (2010) DVD image, ter/section/part headings (which may be represented which has 29,557 documents. Nearly all major by a number, a phrase, or both); other common canonical works of English literature (and many elements include tables of contents, introductions, from other languages) published before 1923 (the prefaces, dedications, or initial quotations. Plays limit of US copyright) are included in the collec- have an almost entirely different set of elements, in- tion. The English portion of the corpus consists of cluding character lists, act/scene breaks, stage direc- approximately 1.7 billion tokens. Although it is or- tions, and speaker tags. GutenTag attempts to iden- ders of magnitude smaller than other public domain tify common elements when they appear; these can collections such as HathiTrust, the Internet Archive, be removed from the text under analysis if desired and Google Books, PG has some obvious advan- and/or used to provide structure to the text in the fi- tages over those collections: all major modern dig- nal output (as special tags, see Section 5). Note that itization efforts use OCR technology, but the texts this step generally has to occur before tokenization, in Project Gutenberg have also been at least proof- since many markers of structure are destroyed in the read by a human (some are hand-typed), and the en- tokenization process. tire corpus remains sufficiently small that it can be GutenTag is written in Python and built on top conveniently downloaded as a single package;3 this of the Natural Language Toolkit (Bird et al., 2009): last is an important property relative to our interests for sentence and word tokenization, we use the here, since the tool assumes a complete copy of the NLTK regex tokenizer, with several pre- and post- PG corpus is present. processing tweaks to deal with specific properties of the corpus and to prevent sentence breaks af- 3 Reader ter common English abbreviations. We are careful to preserve within-word hyphenation, contractions, The standard format for texts in the PG corpus is and the direction of quotation marks. plain text, most commonly the Latin-1 character set though some are UTF-8 Unicode. Generally speak- 4 Subcorpus Filter ing, the actual content is bookended by informa- tion about creation of the corpus and the copyright. Taken as a whole, the Gutenberg corpus is gener- The first challenge is removing this information– ally too diverse to be of use to researchers in partic- not a trivial task, given that the exact formatting ular fields. Relevant digital humanities projects are is extremely inconsistent across texts in the corpus. far more likely to target particular subsections of the 2http://www.gutenberg.org corpus, e.g. English female novelists of the late 19th 3http://www.gutenberg.org/wiki/Gutenberg:The_CD_and_ century. Fortunately, in addition to the raw texts, DVD_Project each document in the PG corpus has a correspond-

43 ing XML data file which provides a bibliographic initial heuristics, we sampled 30 prose texts, 10 po- record, including the title, the name of the author, the etry texts, and 10 plays randomly from the PG cor- years of the author’s birth and death, the language in pus based on our automatic classification, resam- which the text is written, the Library of Congress pling and improving our classification heuristics un- classification (sometimes multiple), and the subject til we reached perfect performance. Then, using (often multiple). GutenTag provides a complete list those 50 correctly-classified texts, we improved our of each of the non-numerical tags for reference and heuristics for removing non-textual elements and allows the user to perform an exact or partial string identifying basic text structure until we had perfect match to narrow down subcorpora of interest or to performance in all 50 texts (as judged by one of combine lists of independently defined subcorpora the authors). Needless to say, we avoided includ- into a single subcorpus. ing heuristics that had no possibility of generaliza- Although they are extremely useful, there are nu- tion across multiple texts (for instance, hand-coding merous problems with the built-in PG annotations. the titles of books). We also used these texts to While Library of Congress classification is gener- confirm that sub-corpus filtering was working as ex- ally reliable for distinguishing literature from other pected. GutenTag comes with a list of the texts that books, for instance, it does not reliably distinguish were focused on during development, with the idea between genres of literature. Therefore, GutenTag they could be pulled out using sub-corpus filtering distinguishes prose fiction from drama and poetry by and used as training or testing examples for more- (at present) simple classification based on the typi- sophisticated statistical techniques. cal properties of these genres. For drama, it looks to see if there are significant numbers of speaker 5 Tagging tags (which unfortunately appear in numerous dis- tinct forms in the corpus); to distinguish poetry from Once a text has been tokenized, a tag can be defined prose fiction, it uses line numbers and/or the location as a string identifier, possibly with nominal or nu- of punctuation (in poetry, punctuation often appears merical attributes, which is associated with a span at the end of lines of verse); collections of short sto- of tokens. Tags of the same type can be counted ries can often be distinguished from novels by their together, and their attributes can be counted (for titles (e.g. and other stories). We make these auto- nominal attributes) or summed or averaged (for nu- matic annotations available as a “genre” tag to help merical attributes) across a text, or across a sub- users create a more-exact subcorpus definition. corpus of texts. The particular tags desired in a Other useful information missing from the PG run of GutenTag are specified by the user in ad- database includes the text’s publication date and vance. The simplest tag for each token is the token place and information about the author such as their itself, or a lemmatized version of the token. Another gender, nationality, place of birth, education, mar- tag of length one is the part-of-speech tag, which, ital status, and membership in particular literary in GutenTag, is currently provided by the NLTK schools. When possible, we collect additional in- part-of-speech tagger. GutenTag also supports sim- formation about texts and their authors from other ple named entity recognition to identify the major structured resources such as Open Library, which characters in a literary work, by looking at repeated has most of the same texts but with additional pub- proper names which appear in contexts which indi- lication information and metadata, and Wikipedia, cate personhood. Any collection of words or phrases which only references a small subset of titles/author, can be grouped under a single tag using user-defined but usually in more detail. A more speculative idea lexicons, which can be nominal or numerical; as ex- for future work is to derive information about less- amples of this, the GutenTag includes word proper- popular texts and authors from unstructured text. ties from the MRC psycholinguistic database (Colt- We did not carry out a full independent evalua- heart, 1980), the General Inquirer Dictionary (Stone tion of the (non-trival) subcorpus filtering and reader et al., 1966) and high-coverage stylistic and polarity features of GutenTag, but we nevertheless took steps lexicons (Brooke and Hirst, 2013; Brooke and Hirst, to ensure basic functionality: after developing some 2014) which were built automatically using the vari-

44 ation within the PG corpus itself. canonicity within the corpus. We are also interested, Tags above the word level include, most promi- where appropriate, in identifying features relevant to nently, structural elements such as chapters iden- narratives: when analyzing a novel, for example, it tified in the corpus-reader step. Another tag sup- would be interesting to be able to tag entire scenes ported in GutenTag is the TEI “said” tag which is with a physical location, a time of day, and a list of used to identify quoted speech and assign it to a spe- participants; for an entire narrative, it would be use- cific character. The current version of “said” iden- ful to identify particular points in the plot structure tification first detects the quotation convention be- such as climax and dénouement, and other kinds of ing used in the text (i.e. single or double quotes), variation such as topic (Kazantseva and Szpakowicz, matches right and left quotes to create quote spans, 2014) and narrator viewpoint (Wiebe, 1994). and then looks in the immediate vicinity around the quotes to identify a character (as identified using the 6 Interfaces character module) to whom to assign the quotation. GutenTag is intended for users with no program- Though currently functional, this is the first module ming background. The potential options are suf- in line to be upgraded to a fully statistical approach, ficiently complex that a run of GutenTag is de- for instance based on the work of He et al. (2013). fined within a single configuration file, including As far as GutenTag is concerned, a tagger is sim- any number of defined subcorpora, the desired tag ply a function which takes in a tokenized text and sets (including various built-in tagging options and (optionally) other tags which have been identified user-defined lexicons), and options for output. We earlier in the pipeline, and then outputs a new set of also offer a web interface for small-scale, limited tags. Even complex statistical models are often com- analysis for those who do not want to download the plex only in the process of training, and classifica- entire corpus. tion is often matter of simple linear combinations of Given our interest in serving the digital humani- features; adding new tagging modules should there- ties community, it is important that the output op- fore be simple and seamless from both a user’s and tions reflect their needs. For those looking only a developer’s perspective. To conclude this section, for a tagged corpus, the Text Encoding Initiative we will discuss some of the ideas for kinds of tag- (TEI) XML standard4 is the obvious choice for cor- ging that might be useful from a digital humanities pus output format. The only potential incompatibil- perspective as well as interesting for computational ity is with overlapping but non-nested tags (which linguists. Some have been addressed already, and are supported by our tag schema but not by XML), some have not. The following is intended not as an which are handled by splitting up the tags over the exhaustive list but rather as a starting point for fur- smaller span and linking them using an identifier and ther discussion. “next” and “prev” attributes. Numerical attributes At the simpler end of the spectrum, we can imag- for lexicons are handled using a “value” attribute. ine taggers which identify some of the classic poetic Again, users can choose whether they want to in- elements such as rhyme scheme, meter, anaphora, clude structural elements that are not part of the alliteration, onomatopoeia, and the use of foreign main text, and whether they want to include these languages (along with identification of the specific as part of the text, or as XML tags, or both. language being used). Metaphor detection is of For those who want numerical output, the default growing interest in NLP (Tsvetkov et al., 2014), option is a count of all the desired tags for all the de- and would undoubtedly be useful for literary anal- fined subcorpora. The counts can be normalized by ysis (as would simile detection, a somewhat simpler token counts and/or divided up into scores for indi- task). Another challenging but important task is the vidual texts. We also allow counts of tags occurring identification of literary allusions: we envision not only inside other tags, so that, for instance, differ- only the identification of allusions, but also the es- ent sub-genres within the same texts can be com- tablishment of direct connections between alluding pared. GutenTag is not intended to provide more- and alluded works with the PG corpus, which we could then employ to derive metrics of influence and 4http://www.tei-c.org/Guidelines/

45 sophisticated statistical analysis, but we can include There are of course many software toolkits that it in the form of interfaces to the Numpy/Scipy offer off-the-shelf solutions to a variety of general Python modules if there is interest. We will include computational linguistics tasks: GutenTag makes di- support for direct passing of subcorpora and the por- rect use of NLTK (Bird et al., 2009), but there nu- tions of subcorpora with a particular tag to MAL- merous other popular options–our choice of NLTK LET (McCallum, 2002) for the purposes of building mostly reflects our preference for using Python, topic models, given the growing interest in their use which we believe will allow for quicker and more among digital humanists. flexible development in the long run. What is more important is that GutenTag is intended to make only 7 Comparison with other tools modest use of off-the-shelf techniques, because we GutenTag differs from existing digital humanities strongly believe that using NLP for literary analy- text-analysis offerings in its focus on large-scale sis will require building literature-specific modules, analysis using NLP techniques. Popular text- even for tasks that are otherwise well-addressed in analysis suites such as Voyant5 and TAPoR6 present the field. In numerous ways, literary texts are sim- numerous useful and user-friendly options for liter- ply too different from the newswire and web texts ary scholars, but their focus on individual texts or that have been the subject of the vast majority of small groups of texts as well as output which con- work in the field, and there are many tasks funda- sists mostly of simple statistical measures or visual- mental to literary study that would be only a foot- izations of surface phenomena means that they are note in other contexts. Our intent is that GutenTag unable to take advantage of the new insights that will become a growing repository for NLP solutions larger corpora and modern NLP methods can (po- to tasks relevant to literary analysis, and as such we tentially) provide. As digital humanists become in- hope those working in digital humanities or compu- creasingly interested in statistical approaches, the tational linguistics will bring to our attention new limiting factor is not so much the availability of ac- modules for us to include. It is this inherently cross- cessible statistical software packages for doing anal- disciplinary focus that is the clearest difference be- ysis but rather the ability to identify interesting sub- tween GutenTag and other tools. sets of the data (including text spans within texts) on which to run these tools; GutenTag supplements these tools with the goal of producing more diverse, 8 Conclusion meaningful, and generalizable results. GutenTag also has some overlap in functional- In the context of computational analysis of litera- ity with literary corpus tools such as PhiloLogic7, ture, digital humanists and computational linguists but such tools are generally based on manual an- are natural symbionts: while increasing numbers of notations of structure and again offer only surface literary scholars are becoming interested in the in- treatment of linguistic phenomena (e.g. identifica- sights that large-scale computational analysis can tion of keywords) for text retrieval. We also note that provide, they are often limited by their lack of there is a simple Python corpus reader for Guten- technical expertise. GutenTag meets the needs of berg available8, but it is intended for individual text such scholars by providing an accessible tool for retrevial via the web, and the only obvious overlap building large, highly customizable literary subcor- with GutenTag is the deletion of copyright header pora from the PG corpus and for performing perti- and footers; in this regard GutenTag is noticeably nent advanced NLP tasks in a user-appropriate man- more advanced since the existing reader relies only ner. By thus drawing increasing numbers of lit- on presence or absence of keyphrases in the offend- erary scholars into the realm of computational lin- ing spans. guistics, GutenTag promises to enrich the latter field by supplying it with new problems, new questions, 5http://voyant-tools.org/ 6http://tapor.ca/ and new applications. As the overlap between the 7https://sites.google.com/site/philologic3/ spheres of digital humanities and computational lin- 8https://pypi.python.org/pypi/Gutenberg/0.4.0 guistics grows larger, both fields stand to benefit.

46 Acknowledgements In Proceedings of the 25th International Conference on Computational Linguistics (COLING 2014). This work was supported by the Natural Sciences Andrew Kachites McCallum. 2002. Mal- and Engineering Research Council of Canada, the let: A machine learning for language toolkit. MITACS Elevate program, and the University of http://mallet.cs.umass.edu. Guelph. Jan Rybicki. 2006. Burrowing into translation: Char- acter idiolects in Henryk Sienkiewicz’s trilogy and its two English translations. Literary and Linguistic References Computing, 21:91–103. Philip J. Stone, Dexter C. Dunphy, Marshall S. Smith, David Bamman, Ted Underwood, and Noah A. Smith. and Daniel M. Ogilivie. 1966. The General In- 2014. A bayesian mixed effects model of literary char- quirer: A Computer Approach to Content Analysis. acter. In Proceedings of the 52st Annual Meeting of the MIT Press. Association for Computational Linguistics (ACL ’14). Yulia Tsvetkov, Leonid Boytsov, Anatole Gershman, Eric Steven Bird, Ewan Klein, and Edward Loper. 2009. Nat- Nyberg, and Chris Dyer. 2014. Metaphor detection ural Language Processing with Python. O’Reilly Me- with cross-lingual model transfer. In Proceedings of dia Inc. the 52nd Annual Meeting of the Association for Com- Julian Brooke and Graeme Hirst. 2013. Hybrid mod- putational Linguistics (ACL ’14). els for lexical acquisition of correlated styles. In Pro- Byron C. Wallace. 2012. Multiple narrative disentan- ceedings of the 6th International Joint Conference on glement: Unraveling Infinite Jest. In Proceedings of Natural Language Processing (IJCNLP ’13). the 2012 Conference of the North American Chapter Julian Brooke and Graeme Hirst. 2014. Supervised rank- of the Association for Computational Linguistics: Hu- ing of co-occurrence profiles for acquisition of contin- man Language Technologies (NAACL-HLT ’12). uous lexical attributes. In Proceedings of The 25th In- Janyce M. Wiebe. 1994. Tracking point of view in narra- ternational Conference on Computational Linguistics tive. Computational Linguistics, 20(2):233–287, June. (COLING 2014). John F. Burrows. 1987. Computation into Criticism: A Study of Jane Austen’s Novels and an Experiment in Method. Clarendon Press, Oxford. Max Coltheart. 1980. MRC Psycholinguistic Database User Manual: Version 1. Birkbeck College. David K. Elson, Nicholas Dames, and Kathleen R. McK- eown. 2010. Extracting social networks from literary fiction. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL ’10). Andrew Goldstone and Ted Underwood. 2012. What can topic models of PMLA teach us about the history of literary scholarship? Journal of Digital Humanities, 2. Adam Hammond, Julian Brooke, and Graeme Hirst. 2013. A tale of two cultures: Bringing literary analy- sis and computational linguistics together. In Proceed- ings of the 2nd Workshop on Computational Literature for Literature (CLFL ’13). Hua He, Denilson Barbosa, and Grzegorz Kondrak. 2013. Identification of speakers in novels. In Proceed- ings of the 51st Annual Meeting of the Association for Computational Linguistics (ACL ’13). Matthew Jockers. 2013. Macroanalysis: Digital Meth- ods and Literary History. University of Illinois Press, Champaign, IL. Anna Kazantseva and Stan Szpakowicz. 2014. Hierar- chical topical segmentation with affinity propagation.

47 A Pilot Experiment on Exploiting Translations for Literary Studies on Kafka’s “Verwandlung”

Fabienne Cap, Ina Rosiger¨ and Jonas Kuhn Institute for Natural Language Processing University of Stuttgart Germany [cap roesigia kuhn]@ims.uni-stuttgart.de | |

Abstract we present a manually annotated reference word alignment and use this to assess the principled use- We present a manually annotated word align- fulness of word alignment as an additional informa- ment of Franz Kafka’s “Verwandlung” and use this as a controlled test case to assess the tion source for the (monolingually motivated) task of principled usefulness of word alignment as an identifying mentions of the same literary character additional information source for the (mono- in texts ((Bamman et al., 2014)). This is a very im- lingually motivated) identification of literary portant analytical sub-step for further analysis (e.g., characters, focusing on the technically well- network analysis, event recognition for narratolog- explored task of co-reference resolution. This ical analysis, stylistic analysis of character speech pilot set-up allows us to illustrate a number etc.). Literary character identification is related to, of methodological components interacting in a modular architecture. In general, co-reference but not identical to named entity recognition, as is resolution is a relatively hard task, but the pointed out in Jannidis et al. (2015). In addition, co- availability of word-aligned translations can reference resolution is required to map pronominal provide additional indications, as there is a and other anaphoric references to full mentions of tendency for translations to explicate under- the character, using his or her name or other charac- specified or vague passages. teristic descriptions. In our present study we focus on the technically well-explored task of co-reference 1 Introduction resolution.1 This pilot set-up allows us to illustrate a number of methodological components interacting We present a pilot study for a methodological ap- in a modular architecture. proach starting out with combinations of fairly canonical computational linguistics models but aim- 2 Background ing to bootstrap a modular architecture of text- analytical tools that is more and more adequate from Whenever computational linguists exchange the point of view of literary studies. The core mod- thoughts with scholars from literary studies who are eling component around which our pilot study is open-minded towards “the Digital Humanities”, the arranged is word-by-word alignment of a text and feeling arises that the machinery that computational its translation, in our case Franz Kafka’s “Verwand- linguistics (CL)/Natural Language Processing lung” (= Metamorphosis) and its translation to En- (NLP) has in their toolbox should in principle glish. As research in the field of statistical machine open up a variety of analytical approaches to translation has shown, word alignments of surpris- literary studies – if only the tools and models were ingly good quality can be induced automatically ex- 1 > ploiting co-occurrence statistics, given a sufficient In their large-scale analysis of 15,000 English novels, Bamman et al. (2014) adopt a simpler co-reference resolution amount of parallel text (from a reasonably homoge- strategy for character clustering. Our work explores the poten- neous corpus of translated texts). In this pilot study, tial for a more fine-grained analysis.

48

Proceedings of NAACL-HLT Fourth Workshop on Computational Linguistics for Literature, pages 48–57, Denver, Colorado, June 4, 2015. c 2015 Association for Computational Linguistics appropriately adjusted to the underlying research adopting to meta-level analysis tools for visualiza- questions and characteristics of the targeted texts. tion and stepwise evaluation. Ideally, it should also At a technical level, important analytical steps for make it clear that the literary scholars’ insights into such studies are often closely related to “classical” the texts and the higher-level questions are instru- tasks in CL and NLP. Possible applications include mental in improving the system at hand, moving to the identification of characters in narrative texts, more appropriate complex models (taking a view on extraction of interactions among them as they are modeling in the sense of McCarty (2005) as an it- depicted in the text, and possibly more advanced erative cycle of constructing, testing, analyzing, and analytical categories such as the focalization of reconstructing intermediate-stage models, viewed as characters in the narrative perspective. Based on tools to advance our understanding of the modeled preprocessing steps that extract important feature object).2 information from formal properties of the text, algorithmic models that are both theoretically 2.2 Pilot Character of Word-aligned Text informed and empirically “trained” using reference This paper tries to make a contribution in this data, should lead to automatic predictions that will spirit, addressing a methodological modular “build- at least support exploration of larger collections ing block” which (i) has received enormous atten- of literary texts and ideally also the testing of tion in technically oriented NLP work, and (ii) in- quantitative hypotheses. tuitively seems to bear considerable potential as an Of course, the qualifying remark that the tools and analytical tool for literary studies – both for in-depth models would first need to be adjusted to the higher- analysis of a small selection of texts and for scal- level questions and the characteristics of the texts able analysis tasks on large corpora – and for which under consideration weighs quite heavily: despite (iii) a realistic assessment leads one to expect the the structural similarity with established NLP anal- need for some systematic adjustments in the meth- ysis chains, the actual analytical questions are dif- ods and machinery to make the established NLP ap- ferent, and NLP tools optimized for the standard do- proach fruitful in literary studies. We are talking mains (mostly newspaper text and parliamentary de- about the use of translations of literary works into bates) may require considerable adjustment to yield other languages and techniques from NLP research satisfactory performance on literary texts. on statistical machine translation and related fields. 2.1 Methodological Considerations Literature translations exist in abundance (relatively speaking), often in multiple variants for a given tar- The large-scale solution to these challenges would get language, and multiple target languages can be be to overhaul the entire chain of standard NLP pro- put side by side. Such translation corpora (“paral- cessing steps, adjusting it to characteristics of liter- lel corpora”) are a natural resource for translational ary texts and then add new components that address studies, but research in various subfields of NLP has analytical questions beyond the canonical NLP of- shown that beyond this, the interpretive work that a ferings. This would however presuppose a master translator performs as a by-product of translating a plan, which presumably requires insights that can text, can often be exploited for analytical questions only come about during a process of stepwise ad- that are not per se taking a cross-linguistic perspec- justments. So, a more realistic approach is to boot- tive: Tentative word-by-word correspondences in a strap a more adaquate (modular) system architec- parallel corpus can be induced automatically, taking ture, starting from some initial set-up building on ex- advantage of co-occurrence statistics from a large isting machinery combined in a pragmatic fashion. collection, with surprising reliability. These “word Everyone working with such an approach should be alignment” links can then be used to map concepts aware of the limited adequacy of many components that are sufficiently invariant across languages. The used – but this may be a “healthy” methodological so-called paradigm of “annotation projection” (pio- training: no analytical model chain should be relied on without critical reflection; so an imperfect initial 2For a more extensive discussion of our methodological con- architecture may increase this awareness and help siderations, see also Kuhn and Reiter (2015 to appear).

49 neered by Yarowsky et al. (2001)) has been enor- deal with. Here, the translation avoids the long rel- mously successful, even for concepts that one would ative clause (die . . . brachte) from the original after not consider particularly invariant (such as gram- the initial subject, at the cost of using a completely matical categories of words): here, strong statistical different sentence structure. As a side effect, an ad- tendencies can be exploited. ditional referential chain (Mrs. Grubach’s cook – Since the statistical induction of word alignments she) is introduced in the English translation. So, it requires no knowledge-based preprocessing (the re- is an open question how effective it is in practice to quired sentence alignment can also be calculated au- use translational information in co-reference resolu- tomatically), it can in principle be applied to any col- tion of the original.4 lection of parallel texts. Hence, it is possible to test For the purposes of this paper, which are pre- quite easily for which analytical categories that are dominantly methodological, aiming to exemplify of interest to literary scholars the translator’s inter- the modular bootstrapping scenario we addressed pretive guidance could be exploited. above, the combination of word-alignment and co- As pointed out in the introduction, we pick reference is a convenient playing ground: we can literary character identification as an analytical readily make use of existing NLP tool chains to target category that is very central to literary studies reach analytical goals that are structurally not far of narrative text, focusing on the task of building from categories of serious interest. At the same time, chains of co-referent mentions of the same charac- the off-the-shelf machinery is clearly in a stage of ter. Co-reference resolution is a relatively hard task, limited adequacy, so we can point out the need for but the availability of word-aligned translations careful critical reflection of the system results. can provide additional indications: surprisingly often, translators tend to use a different type of 2.3 Reference Data and Task-based Evaluation referential phrase in a particular position: pronouns To go beyond an impressionistic subjective assess- are translated as more descriptive phrases, and ment of some analytical tool’s performance (which vice versa. A hypothesis that is broadly adapted in can be highly misleading since there are just too translational studies states there is a tendency for many factors that will escape attention), it is crucial translations to explicate underspecified or vague to rely on a controlled scenario for assessing some passages (Blum-Kulka, 1986). An example of modular component. It is indeed not too hard to ar- “explication” that affects character mentions is rive at a manually annotated reference dataset, and found in the second sentence of the English trans- this can be very helpful to verify some of the work- lation of Kafka’s “Der Prozess”:3 the apposition ing assumptions. In our case, working with trans- seiner Zimmervermieterin (“his landlady”), whose lated literary texts, we made various assumptions: attachment is structurally ambiguous, is translated automatic word alignment will be helpful for ac- with a parenthetical that makes the attachment to cessing larger amounts of text; standard techniques Mrs. Grubach explicit: from machine translation are applicable to this; word alignment can be used to “project” invariant analyt- (DE) Die Kochin¨ der Frau Grubach, seiner Zim- ical categories for text segments etc. mervermieterin, die ihm jeden Tag gegen acht A manual word alignment gives us a reference Uhr fruh¨ das Fruhst¨ uck¨ brachte, kam diesmal nicht. dataset that can give us a clearer picture about many (EN) Every day at eight in the morning he was of these assumptions: we can compare an automatic brought his breakfast by Mrs. Grubach’s cook alignment obtained by various techniques with the – Mrs. Grubach was his landlady – but today reference alignment; we can check how “projection” she didn’t come. works in the ideal case that the alignment is correct

As the example also illustrates, potential referen- 4An additional advantage of literary texts that we are not tial ambiguity is however only one aspect translators going into here is that often multiple translations (to the same or different languages) are available, which a robust automatic 3www.farkastranslations.com/books/Kafka_ approach could exploit, hence multiplying the chance to find a Franz-Prozess-en-de.html disambiguating translation (see also (Zhekova et al., 2014)).

50 etc. With Kafka’s “Verwandlung” we chose a liter- ary text that has some convenient properties at a su- perficial level. At the same time, Kafka’s extremely internalized narrative monoperspective makes this text highly marked. So, in a future study that goes deeper into narrative perspectives, this reference text may be complemented with other examples.

3 Manual Word Alignment The basis for a good word alignment is a reliable sentence alignment. However, the latter is a chal- lenging task on its own – especially when it comes to literature translations – and is thus beyond the scope of this paper. We start from a freely avail- able version of Franz Kafka’s “Verwandlung” which has been carefully sentence-aligned and provided for download5. We selected “Verwandlung” for two reasons: the first has to do with the original lan- guage of the work, here: German. Usually, hu- man translators tend to explicate ambiguities in their translations. We thus assume that the word align- ment will be useful for German co-reference reso- lution. The other has to do with the limitation of this work, both in terms of quantity (the parallel text to be manually aligned consisted of only 675 lines), and in terms of the low number of characters in- volved. It is fairly simple to resolve ambiguities oc- curring in this limited set of characters for a human annotator, but that does not neccessarily apply to an automatic co-reference resolver.For this pilot study, manual word alignments were established by a Ger- man native speaker with a computational linguistics background. The annotator was asked to mark trans- lational correspondences. In lack of suitable corre- spondences, words are to be left unaligned.

3.1 Results In the following, we quantitatively summarize the alignments we established for “Verwandlung”, fo- cussing on personal pronouns and their respective translations. Let us first consider the translations for the pronouns of the original, German, language in Table 1 (a): it can be seen that indeed there are a few cases where a German pronoun is translated Figure 1: Screenshot of manual word alignment tool. It into a more specified person in English (see high- can be seen how the alignment from the second he to der lighted markup of the respective table cells). An ex- Vater helps disambiguate the co-reference. 5http://opus.lingfil.uu.se/Books.php

51 (a) German pronouns together with their English translations German English translation all he his himself him Gregor’s father er 315 12 9 2 1 339 him it he his ihn 50 17 9 2 78 him he his it ihm 42 11 4 2 59 his he the him Gregor’s sein/seine[nmrs] 142 5 5 4 1 157 she her his sister sie (sg) 107 24 1 132 her she the his sister’s ihr/ihre[nmrs] (sg) 57 10 4 1 72 they their/them the food the pain Gregor’s father and mother sie (pl) 52 8 1 1 1 63 their they them ihr/ihre[nmrs] (pl) 25 1 1 15 (b) English pronouns together with their alignment to the original German text English German original text all er ihm ihn Gregor man sein/sein[er] der Vater he 349 12 9 10 5 5 2 390 ihn ihm Gregor er sein/sein[emrs] sich Vater him 51 43 15 11 7 6 1 134 die sein/sein[emrs] de[nmrs] ihm Gregors eine[nr] des Vaters his 140 140 111 4 4 4 2 380 sie ihr/ihre[rm] die die Schwester die Bedienerin die Mutter Grete she 115 11 2 2 2 1 1 134 ihr/ihre[nmr] de[mnr] sie die (die) Mutter die Schwester das her 51 24 22 17 2 2 2 120 Table 1: Overview of how German (a) and English (b) pronouns have been translated. The translations are obtained through manual alignment of the parallel German and English editions of the work. Highlighting indicates pronouns where the translations might actually help co-reference resolution. ample is “er” (= “he”), which in one case is trans- translation direction. While “he” has been trans- lated as “Gregor’s father”. In case of the plural pro- lated 10 times from “Gregor” and two times from noun “sie” (= “they”) we can see that it is trans- “der Vater” (= “the father”), we find a more diverse lated as “Gregor’s father and mother” in one case distribution when looking at the female counterpart and into the more abstract entities “the food” and “she”, which has amongst others been translated “the pain”. Even though not denoting personal enti- from “die Schwester” (= “the sister”), “die Bedi- ties, the latter can still help resolving the other pro- enerin” (= “the charwoman”), “die Mutter” (= “the nouns that might occur in the close context. In Ta- mother”) or “Grete”. In the next section, we will run ble 1 (b), we give results for the opposite direction, a state-of-the-art co-reference resolver on both the namely we show the German original words from German original and the English translation of the which the English pronouns have been translated. novel. For a subset of pronouns, we will then man- Comparing these two tables, we can see that in gen- ually compare the outcome of the resolver with the eral, more English pronouns are used than German translation to see in which of the cases highlighted ones (cf. last column “all” indicating the frequency in Table 1 (where access to a translation might help with which the pronoun has occurred overall in the co-reference resolution), the access to the translation text). Be it a consequence thereof or not, we also actually can improve co-reference resolution. find more resolved co-referential ambiguities in this

52 4 Automatic Coreference Resolution (2) He had thought that nothing at all remained from his father’s business, at least he had never told Noun phrase coreference resolution is the task of him anything different, and Gregor had never determining which noun phrases (NPs) in a text asked him about it anyway. or dialogue refer to the same real-world entities For an automatic system, it is easy to confuse the (Ng, 2010). Coreference resolution has been two male persons present in the sentence, as they are extensively addressed in NLP research, e.g. in both singular and masculine. Interestingly, humans the CoNLL shared task 2011 and 2012 (Pradhan can easily resolve these cases. et al., 2011; Pradhan et al., 2012)6 or in the Se- Due to the above mentioned reasons, it is particu- mEval shared task 2010 (Recasens et al., 2010)7. larly important to exploit given resources in a certain State-of-the-art tools that take into account a variety domain. In the literature area, translations into many of linguistic rules and features, most of the time languages are typically available. In the following, in a machine learning setting, achieve promising we will explain the benefits of using such transla- results. However, when applying co-reference tions by showing examples of how manual word resolution tools on out-of-domain texts, i.e. texts alignment can help co-reference resolution, here in that the system has not been trained on, performance the case of “Verwandlung”. We will also talk about typically decreases. Thus, co-reference resolution the prospects of automatic word alignment. on literature text is even more challenging as most state-of-the-art co-reference resolver are trained on 4.1 Experimental Setup newspaper text. For a system that has been trained For English, we perform our experiments using on newspaper articles, it is difficult to resolve the the IMS HotCoref co-reference resolver (Bjorkelund¨ longer literary texts that typically revolve around and Kuhn, 2014) as a state-of-the-art co-reference a fixed set of characters. In our experiments, we resolution system that obtains the best results pub- for example observe a tendency of the resolver to lished to date on the English CoNLL 2012 Shared create several co-reference chains for one character. Task data8. It models co-reference as a directed Domain-adaptation is time-consuming, as it often rooted tree, making it possible to use non-local fea- requires manually designed gold data to increase tures that go beyond pair-based information. We performance. Moreover, recall that Kafka’s “Ver- use the English non-local model9 that comprises the wandlung” has been written 100 years ago and that training and development datasets from the CoNLL German language has been changing in this time 2012 shared task. These datasets, as common in period. This might lead to additional sparsities. most NLP tasks, mainly consists of news text and Apart from that, there are even some more general dialogue. difficulties an automatic co-reference resolver has to For German, our experiments are also based on the deal with: First, it is difficult for a system to resolve IMS HotCoref system, but as German is not a lan- an NP that has more than one potential antecedent guage that is featured in the standard resolver, we candidates that match morpho-syntactically, i.e. that first had to adapt it to it. These adaptations in- agree, for example, in number and gender. Second, clude gender and number agreement, lemma-based often background or world knowledge is required (sub)string match and a feature that addresses Ger- to find the right antecedent. Consider the following man compounds, to name only a few. Again, the examples taken from “Verwandlung” showing gold training data consists of newspaper texts as we base co-reference annotations through different colour our experiments on the Tueba-D/Z SemEval data10 markup: Gregor and the father. (version 8). (1) And so ran up to , stopped when he his father 8www.ims.uni-stuttgart.de/forschung/ his father stopped, scurried forwards again ressourcen/werkzeuge/HOTCoref.en.html when he moved, even slightly. 9www2.ims.uni-stuttgart.de/cgi-bin/ anders/dl?hotcorefEngNonLocalModel 6http://conll.cemantix.org/2012 10http://www.sfs.uni-tuebingen.de/de/ 7http://stel.ub.edu/semeval2010-coref/ ascl/ressourcen/corpora/tueba-dz.html

53 4.2 Using Alignments above. For Example (1), Figure 2 (a) shows the pro- In order to assess the usefulness of word alignments posed co-reference annotations by the DE and EN in co-reference resolution, we ran IMS-HotCoref on co-reference resolver on the right hand side while both the German original text and its English trans- gold annotations are shown on the left hand side. lation. Then, we had a closer look at the sentences The German sentence is much easier to process in which having access to the translation of a pro- for an automatic system, as it contains fewer pro- noun presumably helps its resolution. In Table 2, nouns and many identical definite descriptions, and we show how the co-reference resolver performs for so unsurprisingly, the output of the German tool each of the German translations being highlighted is correct. The English system, however, wrongly in Table 1(a). It can be seen that in 3 out of 7 cases, links the second pronoun he (marked with a box) to the word alignment would indeed improve the reso- Gregor (=the first he), as there are two potential an- lution. tecedents in the sentence that both agree in number and gender with the anaphor. German English IMS-HotCoref correct? If we have word alignments, as shown in Figure 1, er Gregor’s father der NO we can see that the second he is aligned with der seiner Gregor’s der Zug NO Vater (again, marked with a box), and therefore we sie (sg) his sister Schwester YES now know that the tool’s assignment was wrong. ihre his sister’s Schwester YES the food Herren NO For Example (2), the word alignment again helps the pain Schmerzen YES predict the right co-reference links. The output of sie (pl) Gregor’s mother the tool and the right gold annotation is shown in Eltern YES and father Figure 2 (b). The English co-reference resolver Table 2: Results of the German co-reference resolver. wrongly puts the second he (marked by a box) in the co-reference chain describing Gregor, but the word For the opposite direction, where the German orig- alignment (not shown for Example (2)) tells us that inal text is assumed to help resolve coreferential this is not the case: it actually refers to the father. ambiguities, Table 3 contains the results of the co- We also experimented with a second EN co- reference resolver for all crucial occurrences of the reference resolver, the Stanford co-reference sys- pronoun “he”, as highlighted in Table 1(b). tem as part of the Stanford Core NLP tools11, but the results were similar to the IMS HotCoref sys- English German IMS-HotCoref correct? tem. When comparing two system outputs or gold Gregor one of the trainees NO annotations with the output predicted by the sys- Gregor Gregor YES tem, the ICARUS Coreference Explorer12 (Gartner¨ Gregor Gregor YES et al., 2014) is a useful tool to browse and search co- Gregor (one after) another NO reference-annotated data. It can display co-reference Gregor Gregor YES Gregor Gregor YES annotations as a tree, as an entity grid, or in a stan- he Gregor father NO dard text-based display mode. Particularly useful in Gregor Gregor YES our case is the fact that the tool can compare two Gregor Gregor YES different annotations on the same document. In the Gregor Gregor YES differential view, the tool analyses discrepancies be- der Vater Gregor NO tween the predicted and the gold annotation (or two der Vater Gregor NO predicted annotations, respectively) and marks dif- Table 3: Results of the English co-reference resolver. ferent types of errors with different colors. Figure 3 shows an exemplary differential view between the And even here, we find evidence in 5 of 12 cases Stanford and the HotCoref system for Kafka’s “Ver- that the alignment to the original language would wandlung”. help co-reference resolution. Thereof, the latter two 11nlp.stanford.edu/software/corenlp.shtml cases are particularly challenging. In fact, we have 12www.ims.uni-stuttgart.de/forschung/ already introduced them as Examples (1) and (2) ressourcen/werkzeuge/icarus.html

54 gold co-reference annotations output of IMS-HotCoref And so he ran up to his father, stopped when And so he ran up to his father, stopped when EN his father stopped, scurried forwards again his father stopped, scurried forwards again when he moved, even slightly. when he moved, even slightly. Und so lief er vor dem Vater her, stockte, wenn Und so lief er vor dem Vater her, stockte, wenn DE der Vater stehen blieb, und eilte schon wieder der Vater stehen blieb, und eilte schon wieder vorwarts,¨ wenn sich der Vater nur ruhrte.¨ vorwarts,¨ wenn sich der Vater nur ruhrte.¨ (a) Example (1) gold co-reference annotations output of IMS-HotCoref He had thought that nothing at all remained He had thought that nothing at all remained from his father’s business, at least he had from his father’s business, at least he had EN never told him anything different, and Gregor never told him anything different, and Gregor had never asked him about it anyway. had never asked him about it anyway. Er war der Meinung gewesen , daß dem Vater Er war der Meinung gewesen , daß dem Vater von jenem Geschaft¨ her nicht das Geringste von jenem Geschaft¨ her nicht das Geringste DE ubriggeblieben¨ war , zumindest hatte ihm der ubriggeblieben¨ war , zumindest hatte ihm der Vater nichts Gegenteiliges gesagt , und Gregor Vater nichts Gegenteiliges gesagt , und Gregor allerdings hatte ihn auch nicht darum gefragt . allerdings hatte ihn auch nicht darum gefragt . (b) Example (2) Figure 2: Illustration of gold co-reference annotations and tool outputs for Examples (1)+(2).

5 Related Work the literature domain, in contrast to general language texts. In lack of a reasonable amount of parallel As mentioned earlier, the “annotation projection” training data to train an automatic word alignment paradigm was first described by Yarowsky et al. of the language pair Russian to German, they de- (2001) in order to improve POS-tagging. How- veloped an alignment tool which facilitates manual ever, it has proven useful for a number of other ap- alignment. In contrast to previous works, they used plications, e.g. for multilingual co-reference res- not only one translation but different German trans- olution. Most approaches aim at projecting co- lations of a Russian novel. Mikhaylova (2014) ap- references which are available for one (usually well- plied automatic word alignment to the same Rus- resourced) language to another (less-resourced) lan- sian novel and its German translations, trained on guage for which no tools or not even annotated the novel itself. In order to improve word align- training data are available. The degree of auto- ment performance on such a comparably small train- matic processing ranges from using manually an- ing corpus, she made use of simple generalisation notated co-references and hand-crafted translations techniques (e.g. lemmatisation). Most previous (e.g. Harabagiu and Maiorano (2000)) to automati- works on multilingual co-reference resolution fo- cally obtained word alignments and combined with cus on using co-referential annotations in one lan- a manual post-editing of the obtained co-references guage to obtain the same kind of annotations in an- (e.g. (Postolache et al., 2006)). Finally, some ap- other language. In our work, we want to improve proaches make use of automatic word alignment, co-reference resolution in one (sufficiently well re- but instead of manual post-editing, they access the sourced) language by using translations from or into quality of the projection through training an own another language. This underlying concept has al- co-reference resolver for the under-resourced lan- ready been described by Harabagiu and Maiorano guage based on the projected data (e.g. de Souza (2000). However, they rely on manual projections and Orasan˘ (2011) and Rahman and Ng (2012)). instead of automatic word alignment and moreover, Zhekova et al. (2014) were to our knowledge the they apply their approach to general language text first ones who projected co-reference annotations in and not to literature.

55 Figure 3: The ICARUS differential error analysis explorer illustrates the differences between the Stanford and the IMS HotCoref system using colour markup.

6 Conclusion and Future Work we will use a POS-tagger to identify proper nouns and then either replace them with names used in the We presented a pilot study in which we show literature or leave them underspecified. The man- how computational linguistic tools and resources ual gold standard alignment we presented for “Ver- can help to improve the identification of charac- wandlung” is useful in at least two respects for fu- ter/persona references in literary texts. Our test case ture works: on the one hand it serves us as an up- is fairly controlled, which enables us to assess the per bound for annotation projection beyond standard modular components on their own. Based on a man- co-reference resolution (e.g. distinguishing canoni- ual gold standard word alignment of Franz Kafka’s cal stative present tense usage and historical/scenic “Verwandlung” we show numerous examples for present tense usage), on the other hand we can use which the translation of a pronominal referent can it to approximate the quality of different automatic help to resolve coreferential ambiguites which oth- word alignment approaches on literary texts. In the erwise may not be resolved accordingly. Having future, we will adopt our automatic co-reference re- shown that, we will in the near future focus on sub- solver to use the word alignment of pronominal ref- stituting the manual alignment with an automatic erents, which should lead to an improved perfor- word alignment approach. Due to the limited train- mance. While our pilot study is on literary texts, ing data, we will examine different possibilities to word alignments can be used for co-reference reso- improve automatic word alignment of literature text. lution even for texts from other genres, as long as As German is a morphologically rich language, lem- parallel translations are available. matisation should definitely be considered. More- over, the training text for automatic word alignment Acknowledgements might be enriched with general language data. In This work was supported by the Deutsche order to enhance the positive matching of personal Forschungsgemeinschaft (DFG) grants for the pronouns to names used in the underlying novel, projects D2 and A6 of the SFB 732.

56 References Sameer Pradhan, Lance Ramshaw, Mitchell Marcus, Martha Palmer, Ralph Weischedel, and Nianwen Xue. David Bamman, Ted Underwood, and Noah A. Smith. 2011. CoNLL-2011 Shared Task: Modeling Unre- 2014. A Bayesian Mixed Effects Model of Literary stricted Coreference in OntoNotes. In CoNLL’11: Character. In Proceedings of the 52nd Annual Meet- Proceedings of the 15th Conference on Computational ing of the Association for Computational Linguistics, Natural Language Learning: Shared Task, pages 1– pages 370–379, Baltimore, Maryland. 27. Anders Bjorkelund¨ and Jonas Kuhn. 2014. Learn- ing Structured Perceptrons for Coreference Resolution Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, with Latent Antecedents and Non-local Features. In Olga Uryupina, and Yuchen Zhang. 2012. CoNLL- ACL’14: Proceedings of the 52nd Annual Meeting of 2012 Shared Task: Modeling Multilingual Unre- Proceedings of the Association for Computational Linguistics, Balti- stricted Coreference in OntoNotes. In the Joint Conference on Empirical Methods in Natu- more, Maryland. Shoshana Blum-Kulka. 1986. Shifts of Cohesion and ral Language Processing (EMNLP) and the Confer- Coherence in Translation. In J. House and S. Blum- ence on Computational Natural Language Learning Kulka, editors, Interlingual and Intercultural Commu- (CoNLL): Shared Task, pages 1–40. nication, pages 17–35. Gunter Narr Verlag, Tubingen,¨ Altaf Rahman and Vincent Ng. 2012. Translation- Germany. based Projection for Multilingual Coreference Reso- Jose´ Guilherme Camargo de Souza and Constantin lution. In Proceedings of the 2012 Conference of the Orasan.˘ 2011. Can Projected Chains in Parallel Cor- North American Chapter of the Association for Com- pora Help Coreference Resolution? In Anaphora Pro- putational Linguistics: Human Language Technolo- cessing and Applications, pages 59–69. Springer. gies, pages 720–730. Markus Gartner,¨ Gregor Thiele, Wolfgang Seeker, An- Marta Recasens, Llu´ıs Marquez,` Emili Sapena, ders Bjorkelund,¨ and Jonas Kuhn. 2014. ICARUS M Antonia` Mart´ı, Mariona Taule,´ Veronique´ – An Extensible Graphical Search Tool for Depen- Hoste, Massimo Poesio, and Yannick Versley. 2010. dency Treebanks. In ACL’13: Proceedings of the 51nd Semeval-2010 Task 1: Coreference Resolution in Annual Meeting of the Association for Computational Multiple Languages. In Semeval’10: Proceedings of Linguistics: Systems Demonstrations., Sofia, Bulgaria. the 5th International Workshop on Semantic Eval- Sanda M Harabagiu and Steven J Maiorano. 2000. Mul- uation, pages 1–8. Association for Computational tilingual Coreference Resolution. In Proceedings of Linguistics. the 6th Conference on Applied Natural Language Pro- David Yarowsky, Grace Ngai, and Richard Wicentowski. cessing, pages 142–149. 2001. Inducing Multilingual Text Analysis Tools Fotis Jannidis, Markus Krug, Isabella Reger, Martin via Robust Projection across Aligned Corpora. In Toepfer, Lukas Weimer, and Frank Puppe. 2015. Au- HLT’01: Proceedings of the 1st International Confer- tomatische Erkennung von Figuren in deutschsprachi- ence on Human Language Technology research, pages gen Romanen. Conference Presentation at ”Digital 1–8. Association for Computational Linguistics. Humanities im deutschsprachigen Raum”. Desislava Zhekova, Robert Zangenfeind, Alena Jonas Kuhn and Nils Reiter. 2015, to appear. A Plea for Mikhaylova, and Tetiana Nikolaienko. 2014. a Method-Driven Agenda in the Digital Humanities. Alignment of Multiple Translations for Linguistic In Proceedings of Digital Humanities 2015: Global Analysis. In Proceedings of the The 3rd Annual Digital Humanities, Sydney. International Conference on Language, Literature Willard McCarty. 2005. Humanities Computing. Pal- and Linguistics (L3), pages 9–10. grave Macmillan. Alena Mikhaylova. 2014. Koreferenzresolution in mehreren Sprachen. Master’s thesis, Ludwig Maxi- milians Universitat¨ Munchen.¨ Vincent Ng. 2010. Supervised Noun Phrase Coreference Research: The First Fifteen Years. In ACL’10: Pro- ceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 1396–1411. Oana Postolache, Dan Cristea, and Constantin Orasan. 2006. Transferring Coreference Chains Through Word Alignment. In LREC’06: Proceedings of the 5th In- ternational Conference on Language Ressources and Evaluation.

57 Identifying Literary Texts with Bigrams

Andreas van Cranenburgh*† Corina Koolen† *Huygens ING †Institute for Logic, Language and Computation Royal Dutch Academy of Sciences University of Amsterdam The Hague, The Netherlands The Netherlands [email protected] [email protected]

Abstract or awe they experience, they will choose textual el- ements to refer to. In our project,1 we try to find We study perceptions of literariness in a set of whether novels that are seen as literary have certain contemporary Dutch novels. Experiments with textual characteristics in common, and if so, what machine learning models show that it is pos- meaning we can assign to such commonalities. In sible to automatically distinguish novels that other words, we try to answer the following question: are seen as highly literary from those that are are there particular textual conventions in literary seen as less literary, using surprisingly simple textual features. The most discriminating fea- novels that contribute to readers judging them to be tures of our classification model indicate that literary? genre might be a confounding factor, but a re- In this paper, we show that there are indeed textual gression model shows that we can also explain characteristics that contribute to perceived literari- variation between highly literary novels from ness. We use data from a large survey conducted in less literary ones within genre. the Netherlands in 2013, in which readers were asked to rate novels that they had read on a scale of literari- ness and of general quality (cf. section 2). We show 1 Introduction that using only simple bigram features (cf. section 3), models based on Support Vector Machines can suc- The prose, plot, [the] characters, the se- cessfully separate novels that are seen as highly liter- quence of the events, the thoughts that run ary from less literary ones (cf. section 4). This works in Tony Websters mind, big revelation in with both content and style related features of the text. the end . . . They are all part of the big beau- Interestingly, general quality proves harder to predict. tiful ensemble that delivers an exception- Interpretation of features shows that genre plays a ally nice written novella. — (from a review role in literariness (cf. section 5), but results from on Goodreads of Julian Barnes, A Sense of regression models indicate that the textual features an Ending) also explain differences within genres.

However much debated the topic of literary quality 2 Survey Data and Novels is, one thing we do know: we cannot readily pinpoint what ‘literary’ means. Literary theory has insisted During the summer of 2013, the Dutch reading pub- for a number of years that it lies mostly outside of lic was asked to give their opinion on 401 novels the text itself (cf. Bourdieu, 1996), but this claim is published between 2007 and 2012 that were most at odds with the intuitions of readers, of which the often sold or borrowed between 2009 and 2012. This quote above is a case in point. Publishers, critics, and list was chosen to gather as many ratings as possible literary theorists all influence the opinions of readers, 1The Riddle of Literary Quality, but nevertheless, in explaining the sense of rapture cf. http://literaryquality.huygens.knaw.nl

58

Proceedings of NAACL-HLT Fourth Workshop on Computational Linguistics for Literature, pages 58–67, Denver, Colorado, June 4, 2015. c 2015 Association for Computational Linguistics

Original Translated 50 Thrillers 0 31 Literary thrillers 26 29 40 Literary fiction 27 33 30 Table 1: The number of books in each category. These 20 categories were assigned by the publishers.

10

0 1 2 3 4 5 6 7 literariness (less popular novels might receive too few ratings for empirical analysis), and to ensure that readers were Figure 1: A histogram of the mean literary ratings. not influenced too much by common knowledge on their canonisation (this is less likely for more recent 3 Experimental setup books). About 13,000 people participated in the sur- vey. Participation was open to anyone. Participants Three aspects of a machine learning model can be were asked, among other things, to select novels that distinguished: the target of its predictions, the data they had read and to rate them on two scales from which predictions are based on, and the kind of model 1–7: literariness (not very literary–very literary) and and predictions it produces. general quality (bad–good). These two were distin- 3.1 Machine Learning Tasks guished because a book that is not literary can still We consider two tasks: be considered to be a good book, because it is sus- penseful or funny for instance; conversely, a novel 1. Literariness that is seen as literary can still be considered to be bad (for instance if a reader does not find it engag- 2. Bad/good (general quality) ing), although we found no examples of this in our The target of the classification model is a binary results. No definition was given for either of the two classification whether a book is within the 25 % dimensions, in order not to influence the intuitive judged to be the most literary, or good. Figure 1 judgments of participants. The notion of literariness shows a histogram of the literary judgments. This in this work is therefore a pretheoretical one, directly cutoff divides the two peaks in the histogram, while reflecting the perceptions of the participants. In this ensuring that the number of literary novels is not too work we use the mean of the ratings of each book. small. A more difficult task is to try to predict the average The dataset used in this paper contains a selec- rating for literariness of each book. This not only tion of 146 books from the 401 included in the sur- involves the large differences between thrillers and vey; see Table 1 and 2. Both translated and original literary novels, but also smaller differences within (Dutch) novels are included. It contains three genres, these genres. as indicated by the publisher: literary novels, literary thrillers and thrillers. There are no Dutch thrillers 3.2 Textual Features in the corpus. Note that these labels are ones that The features used to train the classifier are based on the publishers have assigned to the novels. We will a bag-of-words model with relative frequencies. In- not be using these labels in our experiments—save stead of single words we use word bigrams. Bigrams for one where we interpret genre differences—we are occurrences of two consecutive words observed base ourselves on reader judgements. In other words: in the texts. The bigrams are restricted to those that when we talk about highly literary texts, they (in the- occur in between 60 % and 90 % of texts used in the ory) could be part of any of these genres, as long as model, to avoid the sparsity of rare bigrams on the readers judged them to be highly literary. one hand, and the most frequent function bigrams on

59 Original Translated literature Bernlef, Dis, Dorrestein, Durlacher, Auel, Avallone, Baldacci, Binet, Blum, Cronin, Enquist, Galen, Giphart, Hart, Heijden, Donoghue, Evans, Fragoso, George, Gilbert, Japin, Kluun, Koch, Kroonenberg, Giordano, Harbach, Hill, Hodgkinson, Hosseini, Launspach, Moor, Mortier, Rosenboom, Irving, James, Krauss, Lewinsky, Mastras, Scholten, Siebelink, Verhulst, Winter. McCoy, Pick, Picoult, Rosnay, Ruiz Zafn, Sansom, Yalom. literary Appel, Dijkzeul, Janssen, Noort, Pauw, Coben, Forbes, French, Gudenkauf, Hannah, thrillers Terlouw, Verhoef, Vermeer, Visser, Vlugt Haynes, Kepler, Koryta, Lackberg, Larsson, Lckberg, Nesbo, Patterson, Robotham, Rosenfeldt, Slaughter, Stevens, Trussoni, Watson. thrillers Baldacci, Clancy, Cussler, Forsyth, Gerritsen, Hannah, Hoag, Lapidus, McFadyen, McNab, Patterson, Roberts, Rose.

Table 2: Authors in the dataset the other. No limit is placed on the total number of regularization tuned on the training set. The cross- bigram features. We consider two feature sets: validation is 10-fold and stratified (each fold has a distribution of the target class that is similar to that content bigrams: Content words contribute mean- of the whole data set). ing to a sentence and are thus topic related; they For regression the same setup of texts and features consist of nouns, verbs, adjectives, and adverbs. is used as for the classification experiments, but the Content bigrams are extracted from the original machine learning model is a linear Support Vector tokenized text, without further preprocessing. Regression model. style bigrams: Style bigrams consist of function 4 Results words, punctuation, and part-of-speech tags of content words (similar to Bergsma et al. 2012). Before we train machine learning models, we con- In contrast with content words, function words sider a dimensionality reduction of the data. Figure 2 determine the structure of sentences (determin- shows a non-negative matrix factorization of the style ers, conjunctions, prepositions) or express re- bigrams. In other words, this is a visualization of a lationships (pronouns, demonstratives). Func- decomposition of the bigram counts, without taking tion words are identified by a selection of part- into account whether novels are literary or not (i.e., of-speech tags and a stop word list. Function an unsupervised model). Notice that most of the non- words are represented with lemmas, e.g., auxil- literary novels (red) cluster together in one corner, iary verbs appear in uninflected form. Lemmas while the literary books (blue) show more variation. and part-of-speech tags were automatically as- When content bigrams are used, a similar cluster of signed by the Alpino parser.2 non-literary books emerges, but interestingly, this cluster only consists of translated works. With style 3.3 Models bigrams this does not occur. This result seems to suggest that non-literary books All machine learning experiments are performed with are easier to recognize than literary books, since the scikit-learn (Pedregosa et al., 2011). The classi- literary novels show more variation. However, note fier is a linear Support Vector Machine (SVM) with that this decomposition present just one way to sum- 2Cf. http://www.let.rug.nl/vannoord/alp/Alpino/ marize and visualize the data. The classification

60 rest top 25% lit

0.6

0.5

0.4

0.3

0.2

0.1

0.0

0.1 0.05 0.1 0.00 0.05 0.0 0.10 0.1 0.15 0.2 0.20 0.3 0.25 0.4 0.30 0.5 0.35

Figure 2: Non-negative matrix factorization based on style bigrams (literary novels are the blue triangles).

Features Literary Bad/good Features Literary Bad/Good Content bigrams 90.4 63.7 Content bigrams 61.3 (0.65) 33.5 (0.49) Style bigrams 89.0 63.0 Style bigrams 57.0 (0.67) 22.2 (0.52)

Table 3: Classification accuracy (percentage correct). Table 4: Evaluation of the regression models; R2 scores (percentage of variation explained), root mean squared error in parentheses (1–7). model, when trained specifically to recognize liter- ary and non-literary texts, can still identify particular discriminating features. the features are about content or style bigrams. This means that the bad/good judgments are more difficult 4.1 Classification to predict from these textual features. This is not due Table 3 shows the evaluation of the classification to the variance in the survey responses themselves. If models. The content bigrams perform better than the literariness were a more clearly defined concept for style bigrams. The top-ranked bigram features of the the survey participants than general quality, we would model for literary classification are shown in Table 5. expect there to be less consensus and thus more vari- If we look only at the top 20 bigrams that are most ance on the latter dimension. But this is not what predictive of literary texts according to our model and we find; in fact the mean of the standard deviations plot how often they occur in each genre as specified of the bad/good responses is lower than for the liter- by the publishers, we see that these bigrams occur ariness responses (1.08 vs. 1.33). Rather, it is likely significantly more often in literary texts; cf. the plot that the bad/good dimension depends on higher-level, in Figure 4. This indicates that there are features plot-related characteristics, or text-extrinsic social specific to literary texts, despite the variance among factors. literary texts shown in Figure 2. When trained on the bad/good dimension, the clas- 4.2 Regression sification accuracy is around 60 %, compared to The regression results cannot be evaluated with a around 90 % for literariness, regardless of whether simple ‘percentage correct’ accuracy metric, because

61 7 7 r2_score = 0.61 r2_score = 0.57

6 6

5 5

4 4

3 3 predicted reader judgments predicted reader judgments

2 2

1 1 1 2 3 4 5 6 7 1 2 3 4 5 6 7 actual reader judgments actual reader judgments

Figure 3: Regression results for predicting literary judgments with content bigrams (left) and style bigrams (right).

thriller, translated 6.5 genre literary thrillers literature literary thriller, translated thrillers 6.0

literary, translated 5.5

literary thriller, Dutch 5.0

literary, Dutch 4.5 0.00 0.05 0.10 0.15 0.20 0.25 0.30 predicted reader judgments

4.0 Figure 4: A barplot of the number of occurrences of the top 20 most important literary features (cf. Table 5) across 3.5 the genres given by the publisher (error bars show 95 % confidence interval). 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 actual reader judgments it is not feasible to predict a continuous variable ex- Figure 5: Regression results for predicting literary judg- actly. Instead we report the coefficient of determi- ments with content bigrams, with data points distinguished nation (R2). This metric captures the percentage of by the publisher-assigned genre. variation in the data that the model explains by con- trasting the errors of the model predictions with those of the null model which always predicts the mean of The regression scores are shown in Table 4. Pre- the data. R2 can be contrasted with the root mean dicting the bad/good scores is again more difficult. squared error, also known as the standard error of the The regression results for literariness predictions are estimate, or the norm of residuals, which measures visualized in Figure 3. Each data point represents a how close the predictions are to the target on average. single book. The x-axes show the literariness ratings In contrast with R2, this metric has the same scale as from survey participants, while the y-axes show the the original data, and lower values are better. predictions from the model. The diagonal line shows

62 what the perfect prediction would be, and the further The book, a book, a letter, and to write are also the data points (novels) are from this line, the greater part of the most important features, as well as the the error. On the sides of the graphs the histograms bar, a cigarette, and the store. This suggests a cer- show the distribution of the literariness scores. No- tain pre-digital situatedness, as well as a reflection on tice that the model based on content bigrams mirrors the writing process. Interestingly enough, in contrast the bimodal nature of the literariness ratings, while to the book and letter that are most discriminating, the histogram of predicted literariness scores based negative indicators contain words related to modern on style bigrams shows only a single peak. technology: mobile phone and the computer. Inspec- Figure 5 shows the same regression results with tion of the novels shows that the literary novels are the publisher-assigned genres highlighted. The graph not necessarily set in the pre-digital age, but that they shows that predicting the literariness of thrillers is have fewer markers of recent technology. This might more difficult than predicting the literariness of the be tied to the adage in literary writing that good writ- more literary rated novels. Most thrillers have ratings ing should be ‘timeless’—which in practice means between 3.4 and 4.3, while the model predicts a wider that at the very least a novel should not be too obvi- range of ratings between 3.3 and 5.0; i.e., the model ous in relating its settings to the current day. It could predicts more variation than actually occurs. For the also show a hint of nostalgia, perhaps connected to a literary novels both the predicted and actual judg- romantic image of the writer. ments show a wide range between 4.5 and 6.5. The In the negative features, we find another time- actual judgments of the literary novels are about 0.5 related tendency. The first is indications of time— points higher than the predictions. However, there little after, and in Dutch ‘tot nu’ and ‘nu toe’, which are novels at both ends of this range for which the are part of the phrase ‘tot nu toe’ (so far or up until ratings are well predicted. Judging by the dispersion now), minutes after and ten minutes; another indi- of actual and predicted ratings of the literary nov- cator that awareness of time, albeit in a different els compared to the thrillers, the model accounts for sense, is not part of the ‘literary’ discourse. Indica- more of the variance within the ratings of literary tors of location are the building, the garage/car park, novels. and the location, showing a different type of setting It should be noted that while in theory 100 % is than the one described above. We also see indica- the perfect score, the practical ceiling is much lower tors of homicide: the murder, and the investigation. due to the fact that the model is trying to predict Some markers of colloquial speech are also found in an average rating—and because part of the variation the negative markers: for god’s sake and thank you, in literariness will only be explainable with richer which aligns with a finding of Jautze et al (2013), features, text-extrinsic sociological influences, or ran- where indicators of colloquial language were found dom variation. in low-brow literature. It is possible to argue, that genre is a more im- 5 Interpretation portant factor in this classification than literary style. However, we state that this is not particular to this As the experiments show, there are textual elements research, and in fact unavoidable. The discussion that allow a machine learning model to distinguish of how tight genre and literariness are connected, between works that are perceived as highly literary as has been held for a long time in literary theory and opposed to less literary ones—at least for this dataset will probably continue for years to come. Although and survey. We now take a closer look at the features it is not impossible for so called ‘genre novels’ to and predictions of the literary classification task to gain literary status (cf. Margaret Atwood’s scifi(-like) interpret its success. work for instance—although she objects to such a classification; Hoby 2013), it is the case that certain 5.1 Content topics and genres are considered to be less literary When we look at the forty bigrams that perform best than others. The fact that the literary novels are ap- and worst for the literary novels (cf. Table 5), we can parently not recognised by proxy, but on an internal identify a few tendencies. coherence (cf. section 4), does make an interesting

63 weight literary features, content weight non-literary features, content 12.1 de oorlog the war -6.1 de moeder the mother 8.1 het bos the forest -5.1 keek op looked up 8.1 de winter the winter -4.9 mijn hoofd my head 6.6 de dokter the doctor -4.9 haar moeder her mother 5.8 zo veel so much -4.7 mijn ogen my eyes 4.8 nog altijd yet still -4.7 ze keek she looked 4.5 de meisjes the girls -4.5 mobiele telefoon mobile telephone 4.3 zijn vader his father -4.2 de moord the murder 4.0 mijn dochter my daughter -4.0 even later a while later 3.9 het boek the book -3.8 nu toe (until) now 3.8 de trein the train -3.5 zag ze she saw 3.7 hij hem he him -3.4 ik voel I feel 3.7 naar mij at me -3.3 mijn man my husband 3.5 zegt dat says that -3.2 tot haar to her 3.5 het land the land -3.2 het gebouw the building 3.5 een sigaret a cigarette -3.2 liep naar walked to 3.4 haar vader her father -3.1 we weten we know 3.4 een boek a book -3.1 enige wat only thing 3.2 de winkel the shop -3.1 en dus and so 3.1 elke keer each time -3.0 in godsnaam in god’s name weight literary features, style weight non-literary features, style 21.8 ! WW ! VERB , -13.8 nu toe until now 20.5 u , you (FORMAL) , -13.4 en dus and so 18.0 haar haar her her -13.4 achter me behind me 16.5 SPEC : NAME : -13.2 terwijl ik while I 15.4 worden ik become I -13.1 tot nu until now

Table 5: The top 20 most important content features and top 5 most important style features of literary (left), and non-literary texts (right), respectively. case for the literary novel to be a genre on its own. but in this case the distinction between literariness Computational research into genre differences has and genre becomes more subtle. Function words are proven that there are certain markers that allow for a hard to interpret manually, but we do see in the top 20 computer to make an automated distinction between (Table 5 shows the top 5) that the most discriminat- them, but it also shows that interpretation is often ing features of less literary texts contain more ques- complex (Moretti, 2005; Allison et al., 2011; Jautze tion marks (and thus questions), and more numerals et al., 2013). Topic modelling might give some more (‘TW’)—which can possibly be linked to the discrim- insight into our findings. inative qualities of time-indications in the content words. Some features in the less-literary set appear 5.2 Style to show more colloquial language again, such as ik A stronger case against genre determining the classifi- mezelf (‘I myself’), door naar (‘through/on to’; an cation is the success of the function words in the task. example can be found in the sentence ‘Heleen liep Function words are not directly related to themes door naar de keuken.’, which translates to ‘Heleen or topics, but reflect writing style in a more general walked on to the kitchen’, a sound grammatical con- sense. Still, the results do not rule out the existence struction in Dutch, but perhaps not a very aesthet- of particular conventions of writing style in genres,

64 ically pleasing one). A future close reading of the war. This subject, and the plain writing style original texts will give more information on this intu- makes it stand out from the other novels. ition. In future work, more kinds of features should be Erwin Mortier - While the Gods Were Sleeping applied to the classification of literature to get more The most highly rated (6.6) literary novel in insight. Many aspects could be studied, such as read- the dataset, with a high prediction (5.7). A ability, syntax, semantics, discourse relations, and striking feature of this novel is that it consists topic coherence. Given a larger data set, the factors of short paragraphs and short, often single genre and translation/original can be controlled for. line sentences. It features a lot of metaphors, The general question which needs to be answered analogies, and generally a poetic writing style. is whether a literary interpretation of a computational This novel also deals with war, but the writing model is even possible. The material to work with style contrasts with Lewinsky, which may (the features), consist of concise sets of words or explain why the model’s prediction is not as even part-of-speech tags, which are not easy to in- close for this novel. terpret manually; and they paint only a small part of the picture. The workings of the machine learning 6 Related Work model remain largely hidden to the interpreter. This is an instance of the more general problem of the Previous work on classification of literature has fo- interpretability of results in computational humani- cused on authorship attribution (e.g., Hoover, 2003; ties (Bod, 2013). In the specific case of literature, we van Cranenburgh, 2012) and popularity (Ashok et al., can observe that readers of literature follow a similar 2013). The model of Ashok et al. (2013) classifies pattern: literature can be recognized and appreciated, novels from Project Gutenberg as being successful but it is hard to explain what makes texts literary, let or not using stylometric features, where success is alone to compose a highly literary work. based on their download counts. Since many of the most downloaded novels are classics, their results 5.3 Good and bad predictions indirectly relate to literariness. However, in our data In Figure 5, we can see both outliers and novels that set all texts are among the most popular books in are well predicted by the regression model. Here we a fixed time span (cf. section 2), whereas the less discuss a few and suggest why the model does or successful novels in their data set differ much more does not account for their perceived literariness. in popularity from the successful novels. To the best of our knowledge, our work is the first to directly Emma Donoghue - Room A literary novel that is predict the literariness of texts in a computational rated as highly literary (5.5), but with a lower model. prediction (3.8). This may be because this novel There is also work on the classification of the qual- is written from the perspective of a child, with a ity of non-fiction texts. Bergsma et al. (2012) work correspondingly limited vocabulary. on scientific articles with a similar approach to ours, but including syntactic features in addition to bag-of- Elizabeth Gilbert - Eat, Pray Love A novel with a words features. Louis and Nenkova (2013) present re- low literariness rating (3.5), but a high predic- sults on science journalism by modelling what makes tion (5.2) by the model. This novel may be rated articles interesting and well-written. lower due to the perception that it is a novel for women, dealing with new age themes, giving it Salganik et al. (2006) present an experimental a more specific audience than the other novels study on the popularity of music. They created an in the dataset. artificial “music market” to study the relationship be- tween quality and success of music, with or without Charles Lewinsky - Melnitz A novel that is both social influence as a factor. They found that social rated (5.7) and predicted (5.7) as highly literary. influence increases the unpredictability of popularity This novel chronicles the history of a Jewish in relation to quality. A similar effect likely plays a family including the events of the second world role in the reader judgments of the survey.

65 7 Conclusion manities Program.

Our experiments have shown that literary novels References share significant commonalities, as evidenced by the performance of machine learning models. It is still Sarah Danielle Allison, Ryan Heuser, Matthew Lee a challenge to understand what these literary com- Jockers, Franco Moretti, and Michael Witmore. monalities consist of, since a large number of word 2011. Quantitative formalism: an experiment. features interact in our models. General quality is Stanford Literary Lab pamphlet. http://litlab. harder to predict than literariness. stanford.edu/LiteraryLabPamphlet1.pdf. Features related to genre (e.g., the war in literary Vikas Ashok, Song Feng, and Yejin Choi. 2013. novels and the homicide in thrillers) indicate that Success with style: Using writing style to pre- genre is a possible confounding factor in the classifi- dict the success of novels. In Proceedings of cation, but we find evidence against the notion that EMNLP, pages 1753–1764. http://aclweb. the results are solely due to genre. One aspect that org/anthology/D13-1181. stood out in our analysis of content features, which Shane Bergsma, Matt Post, and David Yarowsky. is not necessarily restricted to genre (or which might 2012. Stylometric analysis of scientific articles. indicate that the literary novel is a genre in and of In Proceedings of NAACL, pages 327–337. http: itself), is that setting of space and time rank high //aclweb.org/anthology/N12-1033. among the discriminating features. This might be indicative of a ‘timeless quality’ that is expected of Rens Bod. 2013. Who’s afraid of patterns?: The highly literary works (where words as book and letter particular versus the universal and the meaning of are discriminative)—as opposed to more contempo- humanities 3.0. BMGN – Low Countries Histori- rary settings in less literary novels (computer and cal Review, 128(4). http://www.bmgn-lchr.nl/ mobile phone). Further study is needed to get more index.php/bmgn/article/view/9351/9785. insight into these themes and to what extent these are Pierre Bourdieu. 1996. The rules of art: Genesis and related to genre differences or a literary writing style. structure of the literary field. Stanford University The good performance of style features shows the Press. importance of writing style and indicates that the clas- Hermione Hoby. 2013. Margaret Atwood: inter- sification is not purely based on topics and themes. view. The Telegraph, Aug 18. http://www. Although genres may also have particular writing telegraph.co.uk/culture/books/10246937/ styles and thus associated style features, the fact that Margaret-Atwood-interview.html. good results are obtained with two complementary feature sets suggests that the relation between liter- David L. Hoover. 2003. Frequent collocations and ariness and text features is robust. authorial style. Literary and Linguistic Computing, Finally, the regression on content and function 18(3):261–286. http://llc.oxfordjournals. words shows that the model accounts for more than org/content/18/3/261.abstract. just genre distinctions. The predictions within genres Kim Jautze, Corina Koolen, Andreas van Cranen- are good enough to show that it is possible to distin- burgh, and Hayco de Jong. 2013. From high heels guish highly literary works from less literary works. to weed attics: a syntactic investigation of chick This is a result that merits further investigation. lit and literature. In Proc. of workshop Compu- tational Linguistics for Literature, pages 72–81. Acknowledgments http://aclweb.org/anthology/W13-1410. We are grateful to Karina van Dalen-Oskam, Rens Annie Louis and Ani Nenkova. 2013. What makes Bod, and Kim Jautze for commenting on drafts, and writing great? first experiments on article qual- to the anonymous reviewers for useful feedback. This ity prediction in the science journalism domain. work is part of The Riddle of Literary Quality, a Transactions of the Association for Computational project supported by the Royal Netherlands Academy Linguistics, 1:341–352. http://aclweb.org/ of Arts and Sciences through the Computational Hu- anthology/Q13-1028.

66 Franco Moretti. 2005. Graphs, maps, trees: abstract models for a literary history. Verso. Fabian Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Van- derplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830. Matthew J. Salganik, Peter Sheridan Dodds, and Dun- can J. Watts. 2006. Experimental study of inequal- ity and unpredictability in an artificial cultural mar- ket. Science, 311(5762):854–856. Andreas van Cranenburgh. 2012. Literary authorship attribution with phrase-structure fragments. In Pro- ceedings of CLFL, pages 59–63. Revised version: http://andreasvc.github.io/clfl2012.pdf.

67 Visualizing Poetry with SPARSAR – Visual Maps from Poetic Content

Rodolfo Delmonte Department of Language Studies & Department of Computer Science Ca’ Foscari University - 30123, Venezia, Italy [email protected]

Abstract Most of a poem's content can be captured considering three basic views on the poem itself: In this paper we present a specific application of one that covers what can be called the overall SPARSAR, a system for poetry analysis and sound pattern of the poem - and this is related to TextToSpeech “expressive reading”. We will the phonetics and the phonology of the words focus on the graphical output organized at three contained in the poem - Phonetic Relational View. macro levels, a Phonetic Relational View where Another view is the one that captures the main phonetic and phonological features are poetic devices related to rhythm, that is the rhyme highlighted; a Poetic Relational View that accounts for a poem rhyming and metrical structure and the metrical structure - this view will structure; and a Semantic Relational View that be called Poetic Relational View. Finally, the shows semantic and pragmatic relations in the semantic and pragmatic contents of the poem poem. We will also discuss how colours may be which are related to relations entertained by used appropriately to account for the overall predicates and arguments expressed in the poem, underlying attitude expressed in the poem, relations at lexical semantic level, relations at whether directed to sadness or to happiness. This metaphorical and anaphoric level - this view will is done following traditional approaches which be called Semantic Relational View. assume that the underlying feeling of a poem is In this paper we will concentrate on the three strictly related to the sounds conveyed by the views above, which are visualized by the graphical words besides their meaning. This will be shown using part of Shakespeare’s Sonnets. output of the system and has been implemented by extracting the various properties and features of the 1 Introduction poem and are analyzed in ten separate poetic maps. These maps are organized as follows: The contents of a poem cover many different fields v A General Description map including seven from a sensorial point of view to a mental and a Macro Indices with a statistical evaluation of auditory linguistic one. A poem may please our such descriptors as: Semantic Density hearing for its rhythm and rhyme structure, or Evaluation; General Poetic Devices; General simply for the network of alliterations it evokes be Rhetoric Devices etc., Prosodic Distribution; they consonances or assonances; it may attract our Rhyming Schemes; Metrical Structure. This attention for its structure of meaning, organized on map is discussed and presented in previous a coherent lattice of anaphoric and coreferential publications, so I will not show it here; links or suggested and extracted from inferential v Phonetic Relational Views: five maps, and metaphorical links to symbolic meanings o Assonances, i.e. all vowels contained obtained by a variety of rhetorical devices. Most if in stressed vowel nuclei which have not all of these facets of a poem are derived from been repeated in the poem within a the analysis of SPARSAR, the system for poetry certain interval – not just in adjacency; analysis which has been presented to a number of o Consonances, i.e. all consonant onsets international conferences (Delmonte 2013a; 2013b; of stressed syllables again repeated in 2014) - and to Demo sessions in its TextToSpeech the poem within a certain interval; “expressive reading” version (Delmonte & Bacalu, o All word repetitions, be it stressed or 2012; Delmonte & Prati, 2014; Delmonte, 2015). unstressed;

68

Proceedings of NAACL-HLT Fourth Workshop on Computational Linguistics for Literature, pages 68–78, Denver, Colorado, June 4, 2015. c 2015 Association for Computational Linguistics

o one for the Unvoiced/Voiced Jakobson & Waugh, 1978:188; lately Mazzeo, opposition as documented in syllable 2004). As Tsur (1992) notes, Fónagy in 1961 wrote onset of stressed words (stress an article in which he connected explicitly the use demotion counts as unstressed); of certain types of consonant sounds associated to o another for a subdivision of all certain moods: unvoiced and obstruent consonants consonant syllable onsets, including are associated with aggressive mood; sonorants consonant cluster onsets, and with tender moods. Fónagy mentioned the work of organized in three main phonological M.Macdermott (1940) who in her study identified classes: a specific quality associated to “dark” vowels, i.e. o Continuants (only fricatives); back vowels, that of being linked with dark o Obstruents (Plosives and Affricates); colours, mystic obscurity, hatred and struggle. For o Sonorants (Liquids, Vibrants, this reason, we then decided to evaluate all Approximants; Glides; Nasals). information made available by SPARSAR at the v Poetic Relation Views: three macro levels in order to check these findings o Metrical Structure, Rhyming Structure about the association of mood and sound. This will and Expected Acoustic Length all in be discussed in a final section of the paper devoted one single map. to correlations in Shakespeare’s sonnets. v Semantic Relational View: four maps, As a result, we will also be using darker colours - A map including polarity marked words for highlighting back and front vowels as opposed (Positive vs Negative) and words to low and middle vowels, these latter with light belonging to Abstract vs Concrete colours. The same will apply to representing semantic class1; unvoiced and obstruent consonants as opposed to - A map including polarity marked words voiced and sonorants. But as Tsur (1992:15) notes, (Positive vs Negative) and words this sound-colour association with mood or attitude belonging to Eventive vs State semantic has no real significance without a link to class; semantics. In the Semantic Relational View, we - A map including Main Topic words; will be using dark colours for Concrete referents vs Anaphorically linked words; Inferentially Abstract ones with lighter colours; dark colours linked words; Metaphorically linked words also for Negatively marked words as opposed to i.e. words linked explicitly by “like” or Positively marked ones with lighter colours. The “as”, words linked by recurring symbolic same strategy will apply to other poetic maps: this meanings (woman/serpent or technique has certainly the good quality of woman/moon or woman/rose); highlighting opposing differences at some level of - A map showing predicate argument abstraction2. relations intervening between words, The usefulness of this visualization is marked at core argument words only, intuitively related to various potential users and for indicating predicate and semantic role; different purposes. First of all, translators of poetry eventive anaphora between verbs. would certainly benefit from the decomposition of Graphical maps highlight differences using the poem and the fine-grained analysis, in view of colours. The use of colours associated to sound in the need to preserve as much as possible of the poetry has a long tradition. Rimbaud composed a original qualities of the source poem in the target poem devoted to “Vowels” where colours where language. Other possible users are literary critics specifically associated to each of the main five and literature teachers at various levels. Graphical vowels. Roman Jakobson wrote extensively about output is essentially produced to allow immediate sound and colour in a number of papers (1976; and direct comparison between different poems

1 see in particular Brysbaert et al. 2014 that has a database of 2 our approach is not comparable to work by Saif Mohammad 40K entries. We are also using a manually annotated lexicon (2011a;2011b), where colours are associated with words on of 10K entries and WordNet supersenses or broad semantic the basis of what their mental image may suggest to the mind classes. We are not using MRCDatabase which only has some of annotators hired via Mechanical Turk. The resource only 8,000 concrete + some 9,000 imagery classified entries contains word-colour association for some 12,000 entries over because it is difficult to adapt and integrated into our system. the 27K items listed.

69 and different poets. In order to show the usefulness iambic pentameter. And they use the audio and power of these visualization, I have chosen transcripts - made by just one person - to create the two different English poets in different time syllable-based word-stress gold standard corpus for periods: Shakespeare with Sonnet 1 and 60; and testing, made of some 70 lines taken from Sylvia Plath, with Edge. Shakespeare's sonnets. Audio transcripts without The paper is organized as follows: a short state supporting acoustic analysis3 are not always the of the art in the following section; then the views best manner to deal with stress assignment in of three poems accompanied by comments; some syllable positions which might or might not conclusion. conform to a strict sequence of iambs. There is no indication of what kind of criteria have been used, 2. Related Work and it must be noted that the three acoustic cues may well not be congruent (see Tsur, 2014). So Computational work on poetry addresses a number eventually results obtained are rather difficult to of subfields which are however strongly related. evaluate. As the authors note, spoken recordings They include automated annotation, analysis, or may contain lexical stress reversals and archaic translation of poetry, as well as poetry generation, pronunciations 4 . Their conclusion is that "this that we comment here below. Other common useful information is not available in typical subfields regard automatic grapheme-to-phoneme pronunciation dictionaries". Further on, (p. 531) translation for out of vocabulary words as they comment "the probability of stressing 'at' is discussed in (Reddy & Goldsmith, 2010). Genzel 40% in general, but this increases to 91% when the et al. (2010) use CMU pronunciation dictionary to next word is 'the'. " We assume that demoting or derive stress and rhyming information, and promoting word stress requires information which incorporate constraints on meter and rhyme into a is context and syntactically dependent. Proper use machine translation system. There has also been of one-syllable words remains tricky. In our some work on computational approaches to opinion, machine learning would need much characterizing rhymes (Byrd and Chodorow, 1985) bigger training data than the ones used by the and global properties of the rhyme network authors for their experiment. (Sonderegger, 2011) in English. There's a large number of papers on poetry Green et al. (2010) use a finite state transducer generation starting from work documented in a to infer the syllable-stress assignments in lines of number of publications by P. Gérvas (2001;2010) poetry under metrical constraints. They contribute who makes use of Case Based Reasoning to induce variations similar to the schemes below, by the best line structure. Other interesting attempts allowing an optional inversion of stress in the are by Toivonen et al.(2012) who use a corpus- iambic foot. This variation is however only based approach to generate poetry in Finnish. Their motivated by heuristics, noting that "poets often idea is to contribute the knowledge needed in use the word 'mother' (S* S) at the beginnings and content and form by two separate corpora, one ends of lines, where it theoretically should not providing semantic content, and another for appear." So eventually, there is no control of the grammatical and poetic structure. Morphological internal syntactic or semantic structure of the analysis and synthesis is used together with text- newly obtained sequence of feet: the optional mining methods. Basque poetry generation is the change is only positionally motivated. They employ statistical methods to analyze, generate, 3 and translate rhythmic poetry. They first apply One questions could be "Has the person transcribing stress pattern been using pitch as main acoustic correlate for stress unsupervised learning to reveal word-stress position, or loudness (intensity or energy) or else durational patterns in a corpus of raw poetry. They then use patterns?". The choice of one or the other acoustic correlated these word-stress patterns, in addition to rhyme might change significantly the final outcome. 4 and discourse models, to generate English love At p.528 they present a table where they list a number of words - partly function and partly content words - associated poetry. Finally, they translate Italian poetry into to probability values indicating their higher or lower English, choosing target realizations that conform propensity to receive word stress. They comment that to desired rhythmic patterns. They, however, "Function words and possessives tend to be unstressed, while concentrate on only one type of poetic meter, the content words tend to be stressed, though many words are used both ways".

70 topic of Agirrezabal et al. 2013 paper which uses is the strict one of perfect rhyme: two words rhyme POS-tags to induce the linear ordering and if their final stressed vowels and all following WordNet to select best semantic choice in context. phonemes are identical. So no half rhymes are Manurung et al., 2000 have explored the considered. Rhyming lines are checked from problem of poetry generation under some CELEX phonological database. constraints using machine learning techniques. With their work the authors intended to fill the gap 3. Three Views via Poetic Graphical Maps in the generation paradigm, and "to shed some light on what often seems to be the most enigmatic The basic idea underlying poetic graphical maps is and mysterious forms of artistic expression". The that of making available to the user an insight of conclusion they reach is that "...despite our the poem which is hardly realized even if the analysis is carried out manually by an expert implementation being at a very early stage, the literary critic. This is also due to the fact that the sample output succeeds in showing how the expertise required for the production of all the stochastic hillclimbing search model manages to maps ranges from acoustic phonetics to semantics produce text that satisfies these constraints." and pragmatics, a knowledge that is not usually However, when we come to the evaluation of possessed by a single person. All the graphical metre we discover that they base their approach on representations associated to the poems are disputable premises. The authors quote the first produced by SWI Prolog, inside the system which line of what could be a normal limerick but totally is freely downloadable from its website, at misinterpret the metrical structure. In limericks, sparsar.wordpress.com. For lack of space, we will what we are dealing with are not dactyls - TAtata - show maps related to two of Shakespeare's but anapests, tataTA, that is a sequence of two Sonnets, Sonnet 1 and Sonnet 60 and compare unstressed plus a closing stressed syllable. This is a them to Sylvia Plath’s Edge, to highlight well known characteristic feature of limericks and similarities and to show that the system can handle the typical rhythm is usually preceded and totally different poems still allowing comparisons introduced by a iamb "there ONCE", and followed to be made neatly. All Phonetic Views are shown by two anapests, "was a MAN", "from maDRAS". in Arpabet, i.e. the computer based phonetic Here in particular it is the syntactic-semantic alphabet . phrase that determines the choice of foot, and not the scansion provided by the authors5. Reddy & Knight (2011) produce an unsupervised machine learning algorithm for finding rhyme schemes which is intended to be language-independent. It works on the intuition that "a collection of rhyming poetry inevitably contains repetition of rhyming pairs. ... This is partly due to sparsity of rhymes – many words that have no rhymes at all, and many others have only a handful, forcing poets to reuse rhyming pairs." The authors harness this repetition to build an unsupervised algorithm to infer rhyme schemes, based on a model of stanza generation. "We test the algorithm on rhyming poetry in English and French.” The definition of rhyme the authors used

5 "For instance, the line 'There /once was a /man from Ma/dras', has a stress pattern of (w,s,w,w,s,w,w,s). This can be divided into feet as (w),(s,w,w),(s,w,w),(s). In other words, this line consists of a single upbeat (the weak syllable before the first strong syllable), followed by 2 dactyls (a classical poetry unit consisting of a strong syllable followed by two weak ones), and ended with a strong beat."(ibid.7)

71 back dark vowels: both can be treated as marked vowels, compared to middle and low vowels6. Assonances and Consonances are derived from syllable structure in stressed position of repeated sounds within a certain line span: in particular, Consonances are derived from syllable onset while Assonances from syllable nuclei in stressed position. The Voiced/Unvoiced View is taken from all consonant onsets of stressed words. As can be noticed from the maps above, the choice of warm colours is selected for respectively, CONTINUANT (yellow), VOICED (orange), SONORANT (green), Centre/Low Vowel Area (gold), Middle Vowel Area (green); and cold colours respectively for UNVOICED (blue), Back Vowel Area (brown). We used then red for

OBSTRUENT (red), Front High Vowel Area (red), to indicate suffering and surprise associated to speech signal interruption in obstruents.

We will start commenting the Phonetic Relational View and its related maps. The first map is concerned with Assonances. Here sounds are grouped into Vowel Areas, as said above, which The second set of views is the Poetic Relations include also diphthongs: now, in area choice what View. It is obtained by a single graphical map we have considered is the onset vowel. We have which however condenses five different levels of disregarded the offset glide which is less persistent and might also not reach its target articulation. We 6 will have also combined together front high Area low: ae aa ah ao aw ay; area high front: iy ih ia y; area high back: uw uh ow ua w; area middle: er ea ax oh eh ey oy. vowels, which can express suffering and pain, with Voiced consonants: l m n ng r z zh dh d b g v jh; unvoiced consonants: p s sh th h hh f ch t k. Obstruents: t d b p g k jh ch; continuants: f v z zh s sh th dh h hh; sonorants: l m n ng r.

72 analysis. The Rhyming Structure is obtained by matching line endings in its phonetic form. The result is an capital letter associated with each line on left. This is accompanied by a metrical measure indicating the number of syllables contained in the line. Then the text of the poem and underneath the phonetic translation at syllable level. Finally, another annotation is added mapping syllable type with sequences of 0/1. This should serve a metrical analysis which can be accomplished by the user – completed by comments on poetry type which will be presented at the conference. The additional important layer of analysis that this view makes available is an acoustic phonetic image of each line represented by a coloured streak computed on the basis of the average syllable length in msec derived from our database of syllables of British English. Finally, the third set of views, the Semantic Relational View, produced by the modules of the system derived from VENSES (Delmonte et al., 2005). This view is organized around four separate graphical poetic maps: a map which highlights Event and State words in the poem; a map which highlights Concrete vs Abstract words. Both these two maps address nouns and adjectives. They also indicate Affective and Sentiment analysis (Delmonte & Pallotta, 2011; Delmonte 2014), an evaluation related to nouns and adjective – which however will be given a separate view when the Appraisal-based dictionary will be completed at the conference. A map which contains main Topics, Anaphoric and Metaphoric relations, and a final map with Predicate-arguments relations.

We will now compare a love sonnet like Sonnet 1 to Sonnet 60, which like many other similar sonnets depicts the condition of man condemned to succumb to the scythe of Time - the poet though, will survive through his verse. Here we will restrict ourselves to showing only part of the maps and omit less relevant ones. In the Phonetic Relations Views the choice of words is strongly related to the main theme and the result is a gloomier, hasher overall sound quality of the poem: number of unvoiced is close to that of voiced consonants; in particular, number of obstruents is higher than the sum of sonorants and continuants. As to Assonances, we see that even

73 though A and E sounds - that is open and middle go from 29/43 that is dark sounds are a little bit vowels - constitute the majority of sounds, there is more than half light ones in Sonnet 1, to 32/40 in a remarkable presence of back and high front Sonnet 60, i.e. they are almost the same amount. vowels. Proportions are very different from what Eventually, the information coming from affective we found in Sonnet 1: 18/49, i.e. dark are one third analysis confirms our previous findings: in Sonnet of light, compared to 21/46, almost half the 1 we see a majority of positive words/propositions, amount. Also consider number of Obstruents 18/15; the opposite applies to Sonnet 60, where the which is higher than number of Sonorants and ratio is reversed 10/21. Continuants together: in Sonnet 1 it was identical to number of Continuants.

This interpretation of the data is expected also for other poets and is proven by Sylvia Plath’s Edge, a poem the author wrote some week before her suicidal death. It’s a terrible and beautiful poem at the same time: images of death are evoked and explicitly mentioned in the poem, together with images of resurrection and nativity. The poem starts with an oxymoron: “perfected” is joined with Similar remarks can be made on the map of “dead body” and both are predicated of the Unvoiced/Voiced opposition, where we see that we “woman”. We won’t be able to show all the maps

74 for lack of space, but the overall sound pattern is There's also a wealth of anaphoric relations strongly reminiscent of a death toll. In the expressed by personal and possessive pronouns Consonances map, there’s a clear majority of which depend on WOMAN. In addition, he system obstruent sounds and the balance between has found metaphoric referential links with such voiced/unvoiced consonants is in favour of the images as MOON GARDEN and SERPENT. In latter. In the Assonances map we see that dark particular the Moon is represented as human - "has vowel sounds are almost in same amount of light nothing to be sad about". sounds 28/30.

These images are all possible embodiment of the The same applies to the Voiced/Unvoiced WOMAN, either directly - the Moon is feminine distinction which is again in favour of the latter. If (she) - or indirectly, when the CHILD that the we look at the topics and the coherence through woman FOLDS the children in her BODY, and the anaphora, we find that the main topic is constituted children are in turn assimilated to WHITE by concepts WOMAN, BODY and CHILD. SERPENTS.

75 4. Computing Mood from the Sonnets relating the theme and mood of the sonnet to the sound intended to be produced while reading it. In this final section we will show data produced by Shakespeare’s search for the appropriate word is a SPARSAR relatively to the relation holding well-known and established fact and a statistics of between Mood, Sound and Meaning in half of his corpus speak of some 29,000 types, a lot more William Shakespeare’s Sonnets. This is done to than any English poet whose corpus has been confirm data presented in the sections above. As quantitatively analyzed so far (see Delmonte, will be made clear from Table 1. below, choice of 2013a). words by Shakespeare has been carefully done in

Table 1. Comparing Polarity with Sound Properties of Shakespeare's Sonnets: Blue Line = Ratio of Unvoiced/Voiced Consonants; Red Line = Ratio of Obstruents/Continuants+Sonorants; Green Line = Ratio of Marked Vowels/Unmarked Vowels; Violet Line = Ratio of Negative/Positive Polarity Words/Propositions.

1. sonnets with an overall happy mood; 2. sonnets 2. POSITIVE-CONTRAST (6): sonnet 22, sonnet about love with a contrasted mood – the lover have 24, sonnet 31, sonnet 49, sonnet 55, sonnet 59 betrayed the poet but he still loves him/her, or the 3. NEGATIVE-CONTRAST (1): sonnet 52 poet is doubtful about his friend’s love; 3. sonnets Overall, the system has addressed 33 sonnets out about the ravages of time, the sadness of human of 75 with the appropriate mood selection, 44%. condition (but the poet will survive through his The remaining 42 sonnets have been projected in verse); 4 sonnets with an overall negative mood. the intermediate zone from high peaks to low dips. We will look at peaks and dips in the Table 1. and try to connect them to the four possible Conclusion interpretations of the sonnets. 1. POSITIVE peaks (11): sonnet 6, sonnet 7, We presented a visualization algorithm that works sonnet 10, sonnet 16, sonnet 18, sonnet 25, sonnet on two XML files, the output of SPARSAR system 26, sonnet 36, sonnet 43, sonnet 116, sonnet 130 for poetry analysis. The algorithm decomposes the 4. NEGATIVE dips (15): sonnet 5, sonnet 8, content of the two XML files into 10 graphical sonnet 12, sonnet 14, sonnet 17, sonnet 19, sonnet maps whose content can in turn be organized into 28, sonnet 33, sonnet 41, sonnet 48, sonnet 58, three macro views that encompass most of a sonnet 60, sonnet 63, sonnet 65, sonnet 105 poem’s poetic content. In a final section we also verified (successfully) the hypothesis regarding the

76 existence of an implicit association between sound and Sentiment in Social and Expressive Media: and meaning carried by the words making up a approaches and perspectives from AI (ESSEM poem, by a study of 75 Shakespeare’s sonnets. 2013), CEUR Workshop Proceedings, Torino, 148- More work needs to be done to improve the 155, http://ceur-ws.org/Vol-1096/. Polarity analysis which we intend to project onto Delmonte R. & A.M. Prati. 2014. SPARSAR: An Expressive Poetry Reader, Proceedings of the the “Appraisal Theory” of meaning. A complete Demonstrations at the 14th Conference of the analysis of Shakespeare’s sonnets is also under EACL, Gotheborg, 73–76. way and will be presented at the conference, Delmonte R. 2014. A Computational Approach to together with a comparison with the work of more Poetic Structure, Rhythm and Rhyme, in R. Basili, recent poets. A. Lenci, B. Magnini (eds), Proceedings of CLiC-it - The First Italian Conference on Computational Linguistics, Pisa University Press, Vol.1, 144-150. References Delmonte R. 2014. ITGETARUNS A Linguistic Rule- Based System for Pragmatic Text Processing, in C. Agirrezabal Manex, Bertol Arrieta, Aitzol Astigarraga, Bosco, P. Cosi, F. Dell’Orletta, M. Falcone, S. Mans Hulden, 2013. POS-tag based poetry Montemagni, Maria Simi (eds.), Proceedings of generation with WordNet, Proceedings of the 14th Fourth International Workshop EVALITA, Pisa European Workshop on Natural Language University Press, Vol. 2, 64-69. Generation, pages 162–166. Delmonte R., 2015. SPARSAR - Expressivity in TTS Baayen R. H., R. Piepenbrock, and L. Gulikers. 1995. and its Relations to Semantics, Invited Talk at AISV The CELEX Lexical Database (CD-ROM). 2015, Bologna. Linguistic Data Consortium. Genzel Dmitriy, J. Uszkoreit, and F. Och. 2010. Bacalu C., Delmonte R. 1999. Prosodic Modeling for “Poetic” statistical machine translation: Rhyme and Syllable Structures from the VESD - Venice English meter. In Proceedings of EMNLP. Syllable Database, in Atti 9° Convegno GFS-AIA, Fónagy, Iván (1971) "The Functions of Vocal Style", in Venezia. Seymour Chatman (ed.), Literary Style: A Bacalu C., Delmonte R. 1999. Prosodic Modeling for Symposium. London: Oxford UP, 159-174. Speech Recognition, in Atti del Workshop AI*IA - Gérvas, P. (2001). An expert system for the composition "Elaborazione del Linguaggio e Riconoscimento del of formal Spanish poetry. Knowledge-Based Parlato", IRST Trento, pp.45-55. Systems,14(3):181–188. Brysbaert, M., Warriner, A.B., & Kuperman, V. 2014. Gérvas, P. (2010). Engineering linguistic creativity: Concreteness ratings for 40 thousand generally Bird flight and jet planes. In Proceedings of the known English word lemmas. Behavior Research NAACL HLT 2010 Second Workshop on Methods, 46, 904-911. Computational Approaches to Linguistic Creativity, Byrd Roy J. and M. S. Chodorow. 1985. Using an pages 23–30. online dictionary to find rhyming words and Greene E., T. Bodrumlu, K. Knight. 2010. Automatic pronunciations for unknown words. In Proceedings Analysis of Rhythmic Poetry with Applications to of the 23rd Annual Meeting of ACL, 277–283. Generation and Translation, in Proceedings of the Delmonte R., 2013a. Transposing Meaning into 2010 Conference on EMNLP, 524–533. Immanence: The Poetry of Francis Webb, in Rivista Jakobson, R. 1978. Six lectures on sound and meaning di Studi Italiani, Vol. XXX1, n° 1, 835-892. (Trans.: J. Mepham). Cambridge: MIT Press Delmonte R., et al. 2005. VENSES – a Linguistically- (Original work published in 1976). Based System for Semantic Evaluation, in J. Jakobson, R., & Waugh, L. 1978. The sound shape of Quiñonero-Candela et al.(eds.), Machine Learning language. Bloomington: Indiana University Press. Challenges. LNCS, Springer, Berlin, 344-371. Kao Justine and Dan Jurafsky. 2012. "A Computational Delmonte R. and V. Pallotta, 2011. Opinion Mining and Analysis of Style, Affect, and Imagery in Sentiment Analysis Need Text Understanding, in Contemporary Poetry". in Proc. NAACL Workshop "Advances in Distributed Agent-based Retrieval on Computational Linguistics for Literature. Tools", Springer, 81-96. Keppel-Jones David. 2001. The Strict Metrical Delmonte R. & C. Bacalu. 2013. SPARSAR: a System Tradition: Variations in the Literary Iambic for Poetry Automatic Rhythm and Style AnalyzeR, Pentameter from Sidney and Spenser to Matthew SLATE 2013 - Demonstration Track, Grenoble. Arnold, Mcgill Queens Univ. Pr., 280. Delmonte R. 2013b. Computing Poetry Style, in C. Manurung Hisar Maruli, G. Ritchie, and H. Thompson. Battaglino, C. Bosco, E. Cambria, R. Damiano, V. 2000. Towards a computational model of poetry Patti, P. Rosso (eds.), Proceeding ESSEM - Emotion generation. In Proceedings of AISB Symposium on

77 Creative and Cultural Aspects and Applications of AI and Cognitive Science, 17-20. Manurung M.H., G. Ritchie, H. Thompson. 2000. A Flexible Integrated Architecture For Generating Poetic Texts. in Proceedings of the Fourth Symposium on Natural Language Processing (SNLP 2000), Chiang Mai, Thailand, 7-22. Macdermott M.M. 1940. Vowel Sounds in Poetry: Their Music and Tone Colour, Psyche Monographs, No.13, London: Kegan Paul, 148 pp. Mazzeo, M. 2004. Les voyelles colorées: Saussure et la synesthésie. Cahiers Ferdinand de Saussure, 57,129– 143. Mohammad Saif, Colourful Language: Measuring Word-Colour Associations, 2011a. In Proceedings of the ACL 2011 Workshop on Cognitive Modeling and Computational Linguistics (CMCL), June 2011, Portland, OR. Mohammad Saif, Even the Abstract have Colour: Consensus in Word Colour Associations, 2011b. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, June 2011, Portland, OR. Sonderegger Morgan. 2011. Applications of graph theory to an English rhyming corpus. Computer Speech and Language, 25:655–678. Sravana Reddy & Kevin Knight. 2011. Unsupervised Discovery of Rhyme Schemes, in Proceedings of the 49th Annual Meeting of ACL: shortpapers, 77-82. Toivanen Jukka, Hannu Toivonen, Alessandro Valitutti, & Oskar Gross, 2012. Corpus-based generation of content and form in poetry. In Proceedings of the Third International Conference on Computational Creativity. Tsur, Reuven. 1992. What Makes Sound Patterns Expressive: The Poetic Mode of Speech-Perception. Durham N. C.: Duke UP. Tsur Reuven. 1997a. "Poetic Rhythm: Performance Patterns and their Acoustic Correlates". Versification: An Electronic Journal Devoted to Literary Prosody. (http://sizcol1.u-shizuoka- ken.ac.jp/versif/Versification.html) Tsur Reuven. 2012. Poetic Rhythm: Structure and Performance: An Empirical Study in Cognitive Poetics, Sussex Academic Press, 472.

78 Towards a better understanding of Burrows’s Delta in literary authorship attribution

Stefan Evert and Thomas Proisl Fotis Jannidis, Steffen Pielstrom,¨ FAU Erlangen-Nurnberg¨ Christof Schoch¨ and Thorsten Vitt Bismarckstr. 6 Universitat¨ Wurzburg¨ 91054 Erlangen, Germany Am Hubland [email protected] 97074 Wurzburg,¨ Germany [email protected] [email protected]

Abstract and other linguistic features. This, in turn, allows using the relative similarity of the texts to each other Burrows’s Delta is the most established mea- in clustering or classification tasks and to attribute a sure for stylometric difference in literary au- thorship attribution. Several improvements on text of unknown authorship to the most similar of a the original Delta have been proposed. How- (usually closed) set of candidate authors. ever, a recent empirical study showed that One of the most crucial elements in quantitative none of the proposed variants constitute a ma- authorship attribution methods is the distance mea- jor improvement in terms of authorship attri- sure used to quantify the degree of similarity be- bution performance. With this paper, we try tween texts. A major advance in this area has been to improve our understanding of how and why these text distance measures work for author- Delta, as proposed by Burrows (2002), which has ship attribution. We evaluate the effects of proven to be a very robust measure in different gen- standardization and vector normalization on res and languages (Hoover, 2004b; Eder and Ry- the statistical distributions of features and the bicki, 2013). Since 2002, a number of variants resulting text clustering quality. Furthermore, of Burrows’s Delta have been proposed (Hoover, we explore supervised selection of discrimi- 2004a; Argamon, 2008; Smith and Aldridge, 2011; nant words as a procedure for further improv- Eder et al., 2013). In a recent publication, empir- ing authorship attribution. ical tests of authorship attribution performance for Delta as well as 13 precursors and/or variants of 1 Introduction it have been reported (Jannidis et al., 2015). That Authorship Attribution is a research area in quantita- study, using three test corpora in English, German tive text analysis concerned with attributing texts of and French, has shown that Burrows’s Delta remains unknown or disputed authorship to their actual au- a strong contender, but is outperformed quite clearly thor based on quantitatively measured linguistic evi- by Cosine Delta as proposed by Smith and Aldridge dence (Juola, 2006; Stamatatos, 2009; Koppel et al., (2011). The study has also shown that some of the 2008). Authorship attribution has applications e.g. theoretical arguments by Argamon (2008) do not in literary studies, history, and forensics, and uses find empirical confirmation. This means that, in- methods from Natural Language Processing, Text triguingly, there is still no clear theoretical model Mining, and Corpus Stylistics. The fundamental as- which is able to explain why these various distance sumption in authorship attribution is that individuals measures yield varying performance; we don’t have have idiosyncratic habits of language use, leading a clear understanding why Burrows’s Delta and Co- to a stylistic similarity of texts written by the same sine Delta are so robust and reliable. person. Many of these stylistic habits can be mea- In the absence of compelling theoretical ar- sured by assessing the relative frequencies of func- guments, systematic empirical testing becomes tion words or parts of speech, vocabulary richness, paramount, and this paper proposes to continue such

79

Proceedings of NAACL-HLT Fourth Workshop on Computational Linguistics for Literature, pages 79–88, Denver, Colorado, June 4, 2015. c 2015 Association for Computational Linguistics investigations. Previous work has focused on feature transformation, in order to adapt the weight selection either in the sense of deciding what type given to each of the mfw. The most com- of feature (e.g. character, word or part-of-speech n- mon choice is to standardize features using a grams) has the best discriminatory power for author- z-transformation ship attribution (Forsyth and Holmes, 1996; Rogati fi(D) µi and Yang, 2002), or in the sense of deciding which zi(D) = − σi part of the list of most frequent words yields the best results (Rybicki and Eder, 2011). Other pub- where µi is the mean of the distribution of fi across the collection and σ its standard devi- lications explored strategies of deliberately picking D i a very small numbers of particularly disciminative ation (s.d.). After the transformation, each fea- features (Cartright and Bendersky, 2008; Marsden ture zi has mean µ = 0 and s.d. σ = 1. Dissimilarities between the scaled feature vec- et al., 2013). Our strategy builds on such approaches • but differs from them in that we focus on word un- tors are computed according to some distance igrams only and examine how the treatment of the metric. Optionally, feature vectors may first input feature vector (i.e., the list of word tokens be normalized to have length 1 under the same used and their frequencies) interacts with the per- metric. formance of distance measures. Each distance mea- Different choices of a distance metric lead to vari- sure implements a specific combination of standard- ous well-known variants of Delta. The original Bur- ization and/or normalization of the feature vector. rows’s Delta ∆B (Burrows, 2002) corresponds to the In addition, the feature vector can be preprocessed Manhattan distance between feature vectors: in several ways before submitting it to the distance ∆ (D,D0) = z(D) z(D0) B k − k1 measure. nw In the following, we report on a series of exper- = zi(D) zi(D0) iments which assess the effects of standardization | − | Xi=1 and normalization, as well as of feature vector ma- Quadratic Delta ∆Q (Argamon, 2008) corre- nipulation, on the performance of distance measures sponds to the squared Euclidean distance: for authorship attribution. 2 Although we use attribution success as our per- ∆Q(D,D0) = z(D) z(D0) k − k2 formance indicator, our ultimate goal is not so much nw 2 = (z (D) z (D0)) to optimize the results, but rather to gain a deeper i − i understanding of the mechanisms behind distance Xi=1 measures. We hope that a deeper theoretical under- and is fully equivalent to Euclidean distance ∆Q. standing will help choose the right parameters in au- Cosine Delta ∆ (Smith and Aldridge, 2011) ∠ p thorship attribution cases. measures the angle α between two profile vectors

2 Notation ∆∠(D,D0) = α All measures in the Delta family share the same ba- which can be computed from the cosine similarity of sic procedure for measuring dissimilarities between x = z(D) and y = z(D0): 1 the text documents D in a collection of size n . T D D x y Each text D is represented by a profile of cos α = • ∈ D x 2 y 2 the relative frequencies fi(D) of the nw most k k · k k T nw frequent words (mfw) w1, w2, . . . wnw . where x y = i=1 xiyi is the dot product and The complete profile of D is given by the fea- nw 2 • x 2 = i=1Pxi denotes the length of the vector ture vector f(D) = (f (D), . . . , f (D)). k k 1 nw x accordingq to the Euclidean norm. All three vari- Features are re-scaled, usually with a linear P • ants of Delta agree in using standardized frequencies 1The notation introduced here follows Argamon (2008) and of the most frequent words as their underlying fea- Jannidis et al. (2015). tures.

80 German (z−scores) 3 Understanding the parameters of Delta 100

Different versions of Delta can be obtained by set- 80 ting the parameters of the general procedure outlined 60 in Sec. 2, in particular: 40

Burrows Delta adjusted Rand index (%) adjusted Rand index nw, i.e. the number of words used as features in 20 Quadratic Delta • Cosine Delta the frequency profiles; 0 10 20 50 100 200 500 1000 2000 5000

how these words are selected (e.g. taking the 10000 • # features most frequent words, choosing words based on (a) the number df of texts they occur in, etc.); English (z−scores)

how frequency profiles f(D) are scaled to fea- 100 •

ture vectors z(D) 80

whether feature vectors are normalized to unit 60 •

length z(D) = 1 (and according to which 40 k k Burrows Delta

norm); and (%) adjusted Rand index 20 Quadratic Delta Cosine Delta

which distance metric is used to measure dis- 0 • 10 20 50 100 200 500

similarities between feature vectors. 1000 2000 5000 10000 We focus here on three key variants of the Delta # features measure: (b) French (z−scores) (i) the original Burrows’s Delta ∆B because it is 100 consistently one of the best-performing Delta vari- 80 ants despite its simplicity and lack of a convincing 60 mathematical motivation (Argamon, 2008); 40

(ii) Quadratic Delta ∆Q because it can be derived Burrows Delta adjusted Rand index (%) adjusted Rand index 20 Quadratic Delta from a probabilistic interpretation of the standard- Cosine Delta 0

ized frequency profiles (Argamon, 2008); and 10 20 50 100 200 500 1000 2000 5000 # features 10000 (iii) Cosine Delta ∆∠ because it achieved the best results in the evaluation study of Jannidis et al. (c) (2015). Figure 1: Clustering quality for German, English and All three variants use some number nw of mfw French texts in our replication of Jannidis et al. (2015) as features and scale them by standardization (z- transformation). At first sight, they appear to differ only with respect to the distance metric used: Man- As a result, ∆Q and ∆∠ are equivalent for nor- hattan distance (∆B), Euclidean distance (∆Q), or malized feature vectors. The difference between angular distance (∆∠). Quadratic and Cosine Delta is a matter of the nor- There is a close connection between angular dis- malization parameter at heart; they are not based on tance and Euclidean distance because the (squared) genuinely different distance metrics. Euclidean norm can be expressed as a dot product x 2 = xT x. Therefore, 3.1 The number of features k k2 As a first step, we replicate the findings of Jannidis x y 2 = (x y)T (x y) k − k2 − − et al. (2015). Their data set is composed of three = xT x + yT y 2xT y collections of novels written in English, French and − 2 2 German. Each collection contains 3 novels each = x 2 + y 2 2 x 2 y 2 cos α k k k k − k k k k from 25 different authors, i.e. a total of 75 texts. If the profile vectors are normalized wrt. the Eu- The collection of British novels contains texts pub- clidean norm, i.e. x = y = 1, Euclidean k k2 k k2 lished between 1838 and 1921 coming from Project distance is a monotonic function of the angle α: Gutenberg.2 The collection of French novels con-

x y 2 = 2 2 cos α 2www.gutenberg.org k − k2 −

81 English tains texts published between 1827 and 1934 origi- ●

1. ● 2. 3. 4. 5. 6. 7. 8. 9. 3 10. ● 101. 102. 103. 104. 105. 901. 902. 903. 904. 905. nating mainly from Ebooks libres et gratuits. The 0.06

● collection of German novels consists of texts from ● ● ● ● ● 0.04 ● ● the 19th and the first half of the 20th Century which ● ●

● 4 ● ● ● relative frequency relative come from the TextGrid collection. 0.02 ● ● ● ● ● ● ● ● ● Our experiments extend the previous study in ● ● ● ● ● ● 0.00 I a in to of he the sky and that was gold think such down three respects: might made couldn honest Robert 1. We use a different clustering algorithm, partitioning around medoids (Kaufman and Figure 2: Distribution of relative frequencies for selected Rousseeuw, 1990), which has proven to be very English words (numbers at the top show mfw rank) robust especially on linguistic data (Lapesa and Evert, 2014). The number of clusters is set to the collections are correctly grouped by author. 25, corresponding to the number of different It is obvious from these evaluation graphs that an ap- authors in each of the collections. propriate choice of nw plays a crucial role for ∆Q, 2. We evaluate clustering quality using a well- and to a somewhat lesser extent also for ∆B. In all established criterion, the chance-adjusted Rand three languages, clustering quality is substantially index (Hubert and Arabie, 1985), rather than diminished for nw > 5 000. Since there are already cluster purity. This improves comparability noticeable differences between the three collections, with other evaluation studies. it has to be assumed that the optimal nw depends 3. Jannidis et al. (2015) consider only three arbi- on many factors – language, text type, length of the trarily chosen values n = 100, 1000, 5 000. w texts, quality and preprocessing of the text files (e.g. Since clustering quality does not always im- spelling normalization), etc. – and cannot be known prove if a larger number of mfw is used, this a priori. It would be desirable either to re-scale the approach draws an incomplete picture and does relative frequencies in a way that gives less weight not show whether there is a clearly defined op- to “noisy” features, or to re-rank the most frequent timal value n or whether the Delta measures w words by a different criterion for which a clear cut- are robust wrt. the choice of n . Our evaluation w off point can be determined. systematically varies n from 10 to 10 000. w Alternatively, more robust variants of Delta such Fig. 1(a) shows evaluation results for the German as ∆∠ might be used, although there is still a gradual texts; Fig. 1(b) and 1(c) show the corresponding re- decline, especially for the French data in Fig. 1(c). sults on English and French data. Our experiments Since ∆∠ differs from the least robust measure ∆Q confirm the observations of Jannidis et al. (2015): only in its implicit normalization of the feature vec- For a small number of mfw as features (roughly • tors, vector normalization appears to be the key to n 500), ∆ and ∆ achieve the same clus- w ≤ B Q robust authorship attribution. tering quality. However, ∆Q proves less robust if the number of features is further increased 3.2 Feature scaling (nw > 500), despite the convincing probabilis- Burrows applied a z-transformation to the frequency tic motivation given by Argamon (2008). profiles with the explicit intention to “treat all of ∆ consistently outperforms the other Delta • ∠ these words as markers of potentially equal power” measures, regardless of the choice of nw. It is (Burrows, 2002, p. 271). Fig. 2 and 3 illustrate this robust for values up to nw = 10 000, degrading intuition for some of the most frequent words in the much more slowly than ∆B and ∆Q. English collection. The clustering quality achieved by ∆ is very • ∠ impressive. With an adjusted Rand index above Without standardization, words with mfw ranks above 100 make a negligible contribution to the fre- 90% for a wide range of nw, most of the texts in quency profiles (Fig. 2). The evaluation graph in 3www.ebooksgratuits.com Fig. 4 confirms that Delta measures are hardly af- 4www.textgrid.de/Digitale-Bibliothek fected at all by words above mfw rank 100 if no z-

82 English German

● 8 0.6 ● 1. 2. 3. 4. 5. 6. 7. 8. 9. ●●● ● ● ●● ●●● ● ●●● ●●●●●●● ● 10. ●●●●●●● ●● ●● ●●●●● ●●●● = ●●● ●●●●● ●●●●● ●●●●●● ● ● df 75 101. 102. 103. 104. 105. 901. 902. 903. 904. 905. ●● ●● ● ●● ● ●● ●●●●● ●●● ● ● ●●●●● ● ●●●●●●● ●●● ● ●● ●●● ● ●● ●● ● ● ●● ●●● ● ●● ● ● ●● ●●● ●● ●● = ● ● ● ●●● ●●●●●●● ●●● ● ●● ● df 50 ● ●● ● ●●●●●●●●● ● ●● 6 ● ● ● ● ●●● ● ● ● ●●●●● ● ●●● ● ●●●●● ● ●●● ● ● 0.5 ● ● ● ● ● ● ● ● ● ● ●●●● ●● ●●● ●●● ●● ● ● = ● ●● ●●● ● ●●●●● ●● ●●● ● ●● ● ● ● ●● ● df 25 ● ● ●● ● ●● ● ● ● ●●●●●● ● ●● ● ● ● ●●● ● ●● ●● ● ● ● ●●●●● ●●● ●● ● ● ● ●● ●●●● ● ● ● ●● ●●●●● ●●● ●● ● ● ● ● ● ● ● ●●●●●●● ● ●● ● ● ● ● ● ● ●● ●●●●● ● ●● ●●● ● ● ● ● = ● ●●● ● ● ● ● ●● ●● ● ● ●●● ● ● ● ● df 1 ● ● ●●●●● ●●● ●● ●● ● ● ● ●● ● ● ● ●● ● ● ●●●● ● ● ●●●● ● ● ●●●● ●●● ●● ● ● ● ● ● ● ● ● ●● ● ● ●●● ●● ● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ●●● ●● ●● ●●●●● ●● ● ● 4 ● ●● ●●● ● ●● ● ● ● ●●● ● ● ●●●●●● ● ● ● ● ● ●● ● ●●●● ● ●● ● ●● ●●● ●● ● ● ● ● ●●●● ● ●● ● ● ● ● ● ● ● ● ● ●●● ●● ●● ●● ●●● ●●●● ●●● ●●● ● ● ● ● ●●●● ●● ● ●● ●●●●● ●● ● ● ● ●● ● ● ● ● ● 0.4 ● ●● ● ● ●● ● ● ●● ●● ● ● ● ● ●●●●●● ● ● ●● ●●● ● ● ●●●●● ● ● ●●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●● ●● ●● ●● ● ● ● ● ●●●●● ●● ● ● ● ● ● ● ● ●● ●● ● ●● ● ● ●● ● ●●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●● ●●● ●●● ● ●● ● ●●● ● ●●● ● ●● ●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●● ●●● ●● ●●●●●● ●●●● ●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●●●●● ●● ● ● ● ●●●●●● ●●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●●●●●● ● ●● ●●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●●● ●●●● ● ●●●● ●●●● ●●● ●● ● ● ●●●●● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●● ●●● ● ●● ●● ●●● ●●● ●●● ●●● ● ●●●●● ●● ● ● 2 ● ● ● ●● ● ● ●● ● ● ● ●● ●●● ●● ● ● ●●●●● ●●● ● ●●● ● ●●● ●●● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ●●● ● ● ● ●●● ●● ●● ●●● ● ●●●●●●●● ●●●● ●●● ●● ●●●●●●● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ●● ● ● ●●● ●●● ●●●●●● ●●●●● ●●● ●● ●●●●●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● ●●●●●● ●●●● ● ●●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ●●●●● ● ● ●●● ●● ● ●●●●●● ●●●● ●● ●●● ● ●●● ● ●● ● ●●● ● ● ●● ● ●● ● ● ● ● ●● ● ● ●● ● ●●●●●●●● ● ●● ●●● ●● ●● ● ●● ● ●●●●● ● ● ● ●● ● ● ● 0.3 ● ● ● ● ● ● ●●●● ●● ● ● ● ● ● ● ● ● ●●●●●● ●●● ● ●● ●●●● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ●●●●●●● ●●●● ● ●●● ● ●● ●●● ●●● ●● ●●● ●● ● ● ● ● ●● ● ● ●● ● ●● ● ●●● ●●●●●● ●● ● ●●●● ●●●●● ●● ●●●● ● ●● ●●●●●●●● ● ● ●● ●●● ● ●●●● ●● ● ●● ● ● ● ● ● ● ● ●● ●●●● ●●● ● ● ● ●● ●● ●●●●● ●● ● ● ●●●●●●●●●●●●●●●●●● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ●● ●● ●● ● ●●● ●● ●● ●● ●● ●● ●●● ● ●●● ●●●●●●●● ● ●● ●●● ●● ●● ●● ● ● ● ● ● ● ●● ● ● ●● ●● ● ● ●● ● ●●●●● ●●● ● ●● ●●●● ● ●●●●● ●●● ● ● ● ● ● ● ● ● ● ●● ● ● ●●●● ● ●● ● ●●●●●● ●●●●●● ●●● ● ●●● ● ● ● ●●●●●●●●●●●●●●●●● ● ●● ● ●● ●● ● ● ●● ● ● ●● ● ● ● ●●●● ● ● ●● ● ●●●●● ●●● ● ● ●● ●●● ● ● ●●●●●● ●●●●●●● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ●●● ●● ●● ●● ● ●●● ● ●● ●●● ●● ● ●●●●● ● 0 ● ● ● ● ● ● ● ● ● ●●● ●● ● ● ● ●● ● ●● ●● ●● ●● ● ●●●●●●●● standardized rel. freq. standardized ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ●● ●● ●●● ●●● ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ●● ●● ●● ● ●●●●●● ●●●●●●● ●● ●● ●● ●●● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ●● ● ●●●●●● ●●●●●● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ●●● ●●● ● ●● ●● ●● ● ● ● ●●●● ● ● ● ● ● ●● ● ●● ● ● ● ●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ●●●● ●●● ●●●●● ●●● 0.2 ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●●● ● ● ●● ● ● ● ● ● ● ● ●● ●● ● ●● ● ●● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

−2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ●● ● ● ●● ●● ●● ●●●● ●●● ●● ● ●●●●●● ● ●●● ●●● ●● ●●●●● ●● ●● ●● ●● ●●● ●●●●●●●●● ●●● ●● ●●● ●●●●●● ●●●● ●● ●●●● ●●●● ●●●●●●● ●●● ●●● ●● ●●● ● ●●●●●● ●● ● ●●●●●● 0.1 I a in to of he the sky and that mean contribution to Manhattan distances mean contribution was gold think such down might made couldn honest Robert 0.0 0 10000 20000 30000 40000 50000 Figure 3: Distribution of standardized z-scores most frequent words

German (relative frequencies) Figure 5: Average contribution of features to pairwise ∆B distances; colour indicates document frequency (df) 100 80

60 z-transformed feature vectors z(D) and z(D0). It

40 shows that less frequent words have a moderately

Burrows Delta adjusted Rand index (%) adjusted Rand index 20 Quadratic Delta smaller weight than the mfw up to rank 5 000. Cosine Delta

0 Words that occur just in a small number of texts 10 20 50 100 200 500 1000 2000 5000

10000 (their document frequency df, indicated by point # features colour in Fig. 5) carry a low weight regardless of Figure 4: Clustering quality of German texts based on their overall frequency. unscaled relative frequencies (without standardization) Our conclusion is that ∆B appears to be more ro- bust than ∆Q precisely because it gives less weight to the standardized frequencies of “noisy” words transformation is applied. While results are robust above mfw rank 5 000 in contrast to the claim made with respect to nw, the few mfw that make a no- by Burrows. Moreover, it strongly demotes words ticeable contribution are not sufficient to achieve a that concentrate in a small number of texts, which reasonable clustering quality. After standardization, are likely idiosyncratic expressions from a particular the z-scores show a similar distribution for all fea- novel (e.g. character names) or a narrow sub-genre. tures (Fig. 3). It is plausible that such words are of little use for the Argamon (2008) argued that standardization is purpose of authorship attribution. only meaningful if the relative frequencies roughly Surprisingly, ranking the mfw by their contribu- follow a Gaussian distribution across the texts in tion to ∆B (so that e.g. words with df < 10 are a collection , which is indeed the case for high- D never included as features) is less effective than frequency words (Jannidis et al., 2015). With ranking by overall frequency (not shown for space some further assumptions, Argamon showed that reasons). We also experimented with a number of ∆Q(D,D0) can be used as a test statistic for author- alternative scaling methods – including the scaling 2 ship attribution, with an asymptotic χnw distribution suggested by Argamon (2008) for a probabilistic in- under the null hypothesis that both texts are from the terpretation of ∆B – obtaining consistently worse same author. It can also be shown that standardiza- clustering quality than with standardization. tion gives all features equal weight in ∆Q in a strict sense, i.e. each feature makes exactly the same aver- 3.3 Vector normalization age contribution to the pairwise squared Euclidean As shown at the beginning of Sec. 3, the main differ- distances (and analogously for ∆ ). ∠ ence between ∆∠ (the best and most robust measure This strict interpretation of equal weight does not in our evaluation) and ∆Q (the worst and least robust hold for ∆B, so Burrows’s original intention has measure) lies in the normalization of feature vectors. not been fully realized. Fig. 5 displays the actual This observation suggests that other Delta measures contribution made to each feature to ∆B(D,D0), such as ∆B might also benefit from vector normal- i.e. to the pairwise Manhattan distances between ization. We test this hypothesis with the evaluation

83 hw nFg 6. Fig. in shown norm) Manhattan = L1 norm, Euclidean clustering = on (L2 normalization quality vector of effect The 6: Figure aedge naltxsb ie uhr leading values the author, of given magnitude average a the the by in differences texts to to all expressed in not degree is same pattern collection. characteristic text the This across frequency average the negativei.e. and positive of pattern deviations the by reflected primar- ily is style authorial that conjecture We follows. is for (L1 normalization used appropriate an whether difference to n equal and quality red clustering ∆ the in resulting with curve ones), blue black the (contrast ization ∆ Co- for ( curves the Delta to sine identical fact in are malization adjusted Rand index (%) adjusted Rand index (%) adjusted Rand index (%) w ∠ B h ult uvsfor curves quality The u ettv xlnto o hs nig sas is findings these for explanation tentative Our 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 although , > sas mrvdsbtnilyb etrnormal- vector by substantially improved also is 10 10 10 000 5

20 20 20 z ∆ i neetnl,i em omk little make to seems it Interestingly, . ∆ fwr rqece rmte“norm”, the from frequencies word of ∠ ∆ B n r o hw eaaeyhere. separately shown not are and ) 50 50 50 B n 2for L2 and ih esihl esrbs for robust less slightly be might 100 100 100 German (z−scores) English (z−scores) French (z−scores)

200 200 200 # features # features # features ∆ ∆ Q Q

ihEciennor- Euclidean with 500 500 500 rnt(ieversa). (vice not or )

1000 1000 1000

2000 2000 2000 Quadratic Delta/L1 Burrows Delta/L2 Cosine Delta Burrows Delta/L1 Burrows Delta Quadratic Delta/L1 Burrows Delta/L2 Cosine Delta Burrows Delta/L1 Burrows Delta Quadratic Delta/L1 Burrows Delta/L2 Cosine Delta Burrows Delta/L1 Burrows Delta

5000 5000 5000 84

10000 10000 10000 ain eutn nmn lseigerrors. clustering many in resulting uation, on author influence same stronger the a from have texts between length vector stylis- in Differences their similar. and increasingly become diagonal profiles the tic below are panel right Man- the and for case norm the hattan not is of this (but the contribution values outweighs negative norm the Euclidean the 3), to values Fig. positive (see words of frequency distribution skewed the of Because of results evaluation poor quality. clustering improves thus and the author from texts same normal- between cases, distances such the the reduces In among ization as novels. well French as and norm English (L1) observed Manhattan be can the patterns for Similar panel). right the two, Spielhagen in and other (Freytag much the length Euclidean than a larger norm shows i.e. the texts from the deviation of larger one however, other authors, For deviations. negative the vs. positive and of length authors balance vector some by well panels, quite characterized both are In deviations. negative deviations positive larger) (or z more diag- have dashed the thus below onal Texts pattern. stylistic indicator its rough of a as frequency) below-average with ( features negative and quency) ulda egh h nua oiino text features a positive of of ( contribution position relative angular the The shows same length. the of Euclidean vectors feature to correspond thus circle niae t ulda L)norm (L2) Euclidean its indicates tor normalized than worse panel right the by ( depicted situation the in whereas mag- average the the of out equalizes nitude stand it author because each clearly of more normalization pattern vector stylistic case, the the makes indeed is this If tors. z oteaeaevco length vector average the to panel left the by depicted au- situation German the ( different In by written thors. texts for vectors z n n i i et bv h ignlhv oe(rlarger) (or more have diagonal the above texts , i w w i.7as eel luil xlnto o the for explanation plausible a reveals also 7 Fig. ahpiti h lt ersnstefauevec- feature the represents plots the in point Each i.7vsaie h ulda egho feature of length Euclidean the visualizes 7 Fig. n ec h length the hence and z > 150 = ( 000 5 = D 0 ) ..wrsue ihaoeaeaefre- above-average with used words i.e. , foetx.Tedsac rmteorigin the from distance The text. one of omlzto a osbtnileffect, substantial no has normalization ) z i unnormalized ) . ∆ B .Teeoe l onsi the in points all Therefore, ). ∆ k ∠ z ( ∆ c.Fg 1(a)). Fig. (cf. D ∆ √ Q ) Q k ∆ n z as itne nti sit- this in distances w i ftefauevec- feature the of Q k l onso a on points All . < n z efrsmuch performs w ( 0 D od used words , z sincreased. is ) i k o lower- for 2 relative , German (L2, 150 mfw) German (L2, 5000 mfw) 1.0 1.0 ● Arnim Jean Paul ● Arnim Jean Paul ● Dohm Kafka ● ● Dohm Kafka ● Ebner Eschenbach Keller ● Ebner Eschenbach Keller ● ● Fischer Lewald ● Fischer Lewald ● Fontane Marlitt ● Fontane Marlitt ● Fouqué May Fouqué May

0.8 François Raabe 0.8 François Raabe Freytag Schopenhauer Freytag Schopenhauer Goethe Spielhagen ● ● Goethe Spielhagen Gutzkow Tieck ● Gutzkow Tieck ● Hahn Hahn Wassermann Hahn Hahn Wassermann Hauff Wieland Hauff Wieland ● ● ● Huber ● Huber ● ● 0.6 ● 0.6 ●● ● ● ● ● ● ● ● ● ● ● ● ● ● mean negative score mean negative score mean negative

● 0.4 0.4 0.2 0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2

mean positive score mean positive score

Figure 7: Euclidean length of unnormalized vectors for texts in the German corpus. Coordinates of the points indicate the average contribution of positive and negative features to the total length of the vector.

English French German nr. of features 246 381 234 SVC accuracy 0.99 ( 0.04) 1.00 ( 0.00) 1.00 ( 0.00) ± ± ± MaxEnt accuracy 1.00 ( 0.00) 1.00 ( 0.00) 1.00 ( 0.00) ± ± ± Cosine Delta ARI 0.966 1.000 1.000

Table 1: Results for cross-validation and clustering experiments

4 Feature selection as a contribution to ing training, an SVC assigns weights to the individ- explanation ual features, with greater absolute weights indicat- ing more important features. We can use the weight In this section, we explore another strategy for ob- magnitude as ranking criterion and perform recur- taining an optimal set of features. Instead of us- sive feature elimination by repeating the following ing a threshold on (document) frequencies of words three steps: for feature selection, we systematically identify a 1. Train the classifier, i. e. assign weights to the set of discriminant words by using the method of features. recursive feature elimination. The resulting fea- 2. Rank the features according to the absolute ture set is much smaller and not only works well value of their weights. in a machine learning setting, but also outperforms 3. Prune the n lowest ranking features. the most-frequent-words approach when clustering Since it is more efficient to remove several features a test corpus of mainly unseen authors. at a time (with the possible disadvantage of intro- 4.1 Recursive feature elimination ducing a slight degradation in classification perfor- mance) and we are starting with a few hundreds of Recursive feature elimination is a greedy algorithm thousands of features and are aiming for a much that relies on a ranking of features and on each step smaller set, we first reduce the number of features to selects only the top ranked features, pruning the re- 500 in three stages. First we reduce the number of maining features. For our feature elimination ex- features to 50 000 by recursively pruning the 10 000 periments we rely on a Support Vector Classifier lowest ranking features, then we reduce those 50 000 (SVC) with linear kernel for feature ranking.5 Dur- features to 5 000 features by pruning 1 000 features 5The scikit-learn implementation of Support Vector Ma- at a time and finally we reduce the number of fea- chines we used is based on libsvm and supports multiclass clas- sification via a one-vs.-one scheme. We used it with default parameter settings.

85 (a) English (b) French (c) German

Figure 8: Distributions of document frequencies for selected features

unscaled full fs rescaled full fs selected fs SVC accuracy 0.91 ( 0.03) 0.57 ( 0.13) 0.84 ( 0.14) ± ± ± MaxEnt accuracy 0.95 ( 0.03) 0.95 ( 0.03) 0.90 ( 0.08) ± ± ± Cosine Delta ARI 0.835 0.835 0.871

Table 2: Evaluation results for the selected features on the second additional test set compared to all features (for Cosine Delta clustering, 2 000 mfw are used in this case) tures to 500 by pruning the 100 lowest ranking fea- ument frequencies of those features. For all corpora tures at a time. there are some highly specific features that occur Once we have reduced the number of features to only in a fraction of the texts, but most selected fea- 500, we try to find an optimal number of features tures have a rather high document frequency. by pruning only one feature at a time, doing a strati- The features which turn out to be maximally dis- fied threefold cross-validation on the data after each tinctive of authors show a number of interesting pat- pruning step to test classification accuracy. terns. For example, they are not limited to function Since Support Vector Machines are not scale- words. This is a relevant finding, because it is of- invariant, we rescaled each feature to [0, 1] dur- ten assumed that function words are the best indica- ing preprocessing. Simply rescaling the data should tors of authorship. However, content words may be work better than standardization because it preserves more prone to overfitting than function words. Also, sparsity (the standardized feature matrix is dense, re- in the English and French collections, a small num- placing every zero by a small negative value). As an ber of roman numerals are included (such as “XL” or additional preprocessing step we removed all words “XXXVVII”), which may be characteristic of novels with a document frequency of 1. with an unusually high number of chapters. This, in turn, may in fact be characteristic of certain authors. 4.2 Examination and validation Finally, in the German collection, a certain number The recursive feature elimination process can of words show historical orthographic variants (such choose from an abundance of features, and therefore as “Heimath” or “giebt”). These are most likely ar- it is no surprise that it is able to find a subset of fea- tifacts of the corpus rather than actual stylistic char- tures that yields perfect results for both classification acteristics of certain authors. (accuracy was determined using stratified threefold Perfect cross-validation and clustering results cross-validation and is given as the mean plus/minus suggest that there may be severe overfitting. In order 6 two standard deviations) and clustering using ∆∠. to verify how well the set of selected features per- Cf. Table 1 for an overview of the results. forms on unseen data, we used two additional eval- Figures 8(a)-8(c) show the distribution of the doc- uation data sets: 6We used agglomerative clustering with complete linkage. 1. An unbalanced set of 71 additional unseen nov-

86 els by 19 authors from the German collection; authors. Nevertheless, the difference in clustering 2. An unbalanced set of 155 unseen novels by 34 accuracy between the first and the second test set in- authors, with at least 3 novels per author (6 au- dicates that these features are author-dependent to thors are also in the original collection). some extent. For the first test set, we trained an SVC and a Max- imum Entropy classifier (MaxEnt) on the original 5 Conclusion German corpus using the set of 234 selected fea- The results presented here shed some light on the tures and evaluated classifier accuracy on the test properties of Burrows’s Delta and related text dis- set. Both SVC and MaxEnt achieved 0.97 accuracy tance measures, as well as the contributions of the on that test set, indicating that the selected features underlying features and their statistical distribution. are not overfit to the specific novels in the training Using the most frequent words as features and stan- corpus but generalize very well to other works from dardizing them with a z-transformation (Burrows, the same authors. Since this test set includes single- 2002) proves to be better than many alternative tons (a single novel by an author), cross-validation strategies (Sec. 3.2). However, the number nw of and clustering experiments cannot sensibly be con- features remains a critical factor (Sec. 3.1) for which ducted here. no good strategy is available. Vector normalization For the second test set, we evaluated classification is revealed as the key factor behind the success of accuracy with stratified threefold cross-validation Cosine Delta. It also improves Burrows’s Delta and using only the set of 234 selected features. We also makes all measures robust wrt. the choice of nw clustered the texts using ∆∠ based on the same fea- (Sec. 3.3). In Sec. 4 we showed that supervised fea- tures. To have a point of reference, we furthermore ture selection may be a viable approach to further evaluated classification and clustering using the full improve authorship attribution and determine a suit- feature set, once using relative frequencies and once able value for nw in a principled manner. using rescaled relative frequencies. For the cluster- Although we are still not able to explain in full ing we used the 2 000 mfw as features, which our ex- how Burrows’s Delta and its variants are able to dis- periments in Section 3.1 showed to be a robust and tinguish so well between texts of different author- nearly optimal number. The results are summarized ship, why the choices made by Burrows (use of mfw, 7 in Table 2. z-scores, and Manhattan distance) are better than Comparing evaluation results for the 234 se- many alternatives with better mathematical justifi- lected features from the original corpus with the full cation, and why vector normalization yields excel- rescaled feature set, we see a considerable increase lent and robust authorship attribution regardless of in SVC accuracy (due to the smaller number of fea- the distance metric used, the present results consti- 8 tures), a small decrease in MaxEnt accuracy and tute an important step towards answering these ques- an increase in clustering quality, indicating that the tions. selected features are not overfitted to the training data and generalize fairly well to texts from other References 7Clustering results on the unscaled and rescaled full feature sets are identical because of the z-transformation involved in Shlomo Argamon. 2008. Interpreting Burrows’s Delta: cosine delta. Geometric and Probabilistic Foundations. Literary 8While Support Vector Machines are supposed to be effec- and Linguistic Computing, 23(2):131 –147, June. tive for data sets with more features than training samples, they John Burrows. 2002. Delta: a Measure of Stylistic Dif- don’t deal very well with huge feature sets that exceed the train- ference and a Guide to Likely Authorship. Literary ing samples by several orders of magnitude. For this reason, and Linguistic Computing, 17(3):267 –287. the poor performance of SVC on the full features set was to Marc-Allen Cartright and Michael Bendersky. 2008. be expected. It is surprising, however, that SVC performs much Towards scalable data-driven authorship attribution. better on unscaled features. We believe this to be a lucky coinci- dence: The SVC optimization criterion prefers high-frequency Center for Intelligent Information Retrieval. words that require smaller feature weights; they also happen to Maciej Eder and Jan Rybicki. 2013. Do birds of a feather be the most informative and robust features in this case. really flock together, or how to choose training sam-

87 ples for authorship attribution. Literary and Linguistic Efstathios Stamatatos. 2009. A survey of modern author- Computing, 28(2):229–236, June. ship attribution methods. J. Am. Soc. Inf. Sci. Technol., Maciej Eder, Mike Kestemont, and Jan Rybicki. 2013. 60(3):538–556, March. Stylometry with R: a suite of tools. In Digital Hu- manities 2013: Conference Abstracts, pages 487–489, Lincoln. University of Nebraska. R. S. Forsyth and D. I. Holmes. 1996. Feature-finding for text classification. Literary and Linguistic Com- puting, 11(4):163–174, December. David L. Hoover. 2004a. Delta Prime? Literary and Linguistic Computing, 19(4):477 –495, November. David L. Hoover. 2004b. Testing Burrows’s Delta. Literary and Linguistic Computing, 19(4):453 –475, November. Lawrence Hubert and Phipps Arabie. 1985. Comparing partitions. Journal of Classification, 2:193–218. Fotis Jannidis, Steffen Pielstrom,¨ Christof Schoch,¨ and Thorsten Vitt. 2015. Improving Burrows’ Delta - An empirical evaluation of text distance measures. In Dig- ital Humanities Conference 2015, Sydney. Patrick Juola. 2006. Authorship attribution. Founda- tions and Trends in Information Retrieval, 1(3):233– 334. Leonard Kaufman and Peter J. Rousseeuw. 1990. Find- ing Groups in Data: An Introduction to Cluster Anal- ysis. John Wiley and Sons, New York. M. Koppel, J. Schler, and S. Argamon. 2008. Computa- tional methods in authorship attribution. Journal of the American Society for information Science and Tech- nology, 60(1):9–26. Gabriella Lapesa and Stefan Evert. 2014. A large scale evaluation of distributional semantic models: Parame- ters, interactions and model selection. Transactions of the Association for Computational Linguistics, 2:531– 545. John Marsden, David Budden, Hugh Craig, and Pablo Moscato. 2013. Language individuation and marker words: Shakespeare and his maxwell’s demon. PloS one, 8(6):e66813. Monica Rogati and Yiming Yang. 2002. High- performing feature selection for text classification. In Proceedings of the eleventh international conference on Information and knowledge management, CIKM ’02, pages 659–661, New York, NY, USA. ACM. Jan Rybicki and Maciej Eder. 2011. Deeper delta across genres and languages: do we really need the most fre- quent words? Literary and Linguistic Computing, 26(3):315–321, July. Peter W. H. Smith and W. Aldridge. 2011. Im- proving Authorship Attribution: Optimizing Burrows’ Delta Method*. Journal of Quantitative Linguistics, 18(1):63–88, February.

88 Gender-Based Vocation Identification in Swedish 19th Century Prose Fiction using Linguistic Patterns, NER and CRF Learning

Dimitrios Kokkinakis Ann Ighe Mats Malm Department of Swedish, Economic History, School of Department of Literature, Språkbanken Business Economics and Law History of Ideas and Religion University of Gothenburg University of Gothenburg University of Gothenburg Sweden Sweden Sweden [email protected] [email protected] [email protected]

capture the (professional) activities with which Abstract one occupies oneself, such as employment or other, wider, forms of productive occupations not This paper investigates how literature could necessarily paid. Therefore vocation is used here be used as a means to expand our understand- in a rather broad sense since we do not want to ing of history. By applying macroanalytic disallow word candidates that might not fit in a techniques we are aiming to investigate how strict definition of the term. Apart from vocation women enter literature and particularly which identification, the described system recognizes functions do they assume, their working pat- terns and if we can spot differences in how and assigns gender, i.e. male, female or un- often male and female characters are men- known, to the vocations by using various Natural tioned with various types of occupational ti- Language Processing (NLP) technologies. tles (vocation) in Swedish literary texts. The purpose of this work is to use literature Modern historiography, and especially femi- as means to expand our understanding of history nist and women’s history has emphasized a (Pasco, 2014) by applying macroanalytic tech- relative invisibility of women’s work and niques (Moretti, 2013; Jockers, 2013) in order to women workers. The reasons to this are man- start exploring how women enter literature as ifold, and the extent, the margin of error in characters, which functions do they assume and terms of women’s work activities is of course their working patterns. The research questions hard to assess. Therefore, vocation identifica- tion can be used as an indicator for such ex- themselves are not new, but in fact central to the ploration and we present a hybrid system for field of gender studies and to a certain extent, automatic annotation of vocational signals in economic history. From a historical point of 19th century Swedish prose fiction. Beside view, the 19th century in Sweden, and several vocations, the system also assigns gender other western countries, is a period with a dra- (male, female or unknown) to the vocation matic restructuring of gender relations in formal words, a prerequisite for the goals of the institutions such as the civil law, and also a peri- study and future in-depth explorations of the od where the separation of home and workplace corpora. came to redefine the spatial arenas for human interaction. Singular works of fiction can be ana- 1 Introduction lyzed and interpreted in historical research and Can we use literary text as a (valid) source for current development in digital humanities cer- historical research? Evidence shows that the an- tainly opens new possibilities in this direction. swer is probably yes and in this paper we inves- Therefore, vocation identification can be used as tigate how literature can be used as a means to one such indicator for achieving some of the expand our understanding of history; (Rutner & above stated goals. The starting point of this Schonfeld, 2012). This paper presents a system study has been to create an infrastructure of suit- for the automatic annotation of vocational signals able lexical resources and computational tools in Swedish text, namely 19th century prose fic- for empirical NLP in the cultural heritage do- tion. Vocation in this context is defined as a sin- main and digital humanities area. Supporting gle word or a multi word expression intended to future users of digitized literature collections

89

Proceedings of NAACL-HLT Fourth Workshop on Computational Linguistics for Literature, pages 89–97, Denver, Colorado, June 4, 2015. c 2015 Association for Computational Linguistics with tools that enable the discovery and explora- after about 1885 or that agricultural professions tion of text patterns and relationships, or even first declined in importance but then become to allow them to semantically search and browse rise around the turn of the century, after which (Don et al., 2007; Vuillemot et al., 2009; Oelke they rapidly sank again. However, even closer to et al., 2013), using computer-assisted literary our goals is the research by Pettersson & Nivre analysis with more semantically oriented tech- (2011); Fiebranz et al. (2011) and Pettersson et niques, can lay the foundations to more distant al. (2012; 2014), who in cooperation with histo- reading or macroanalysis of large corpora in rians, study what men and women did for a liv- novel ways, impossible to investigate using tradi- ing in the early modern Swedish society (“The tional qualitative methods or close reading. Gender and Work project”, GaW) between 1550- 1800. In the context of GaW’s verb-orientated 2 Background studies, historians are building a database with relevant information, by identifying working ac- Digital humanities is an umbrella term used for tivities often described in the form of verb any humanities research with potential for real phrases, such as chop wood or sell fish. Automat- interdisciplinary exchange between various fields ically extracted verb phrases from historical texts and can be seen as an amalgamation of method- are presented to historians as a list of candidate ologies from traditional humanities disciplines phrases for revision and database inclusion. (such as literature and art, corpus linguistics), Since Swedish NLP tools for that period are very and social sciences, with computational ap- scarce, the historical texts are first normalized to proaches and tools provided by computer science a more modern spelling, before tagging and pars- (such as text and data mining) and digital pub- ing is applied – the techniques applied in GaW lishing. During the last couple of decades there are different and thus complementary to the ones has been a lot of research on applying automatic we apply in the research described in this paper. text analytic tools to annotate, enrich, explore and mine historical or other digital collections in 3 Material various languages and for several reasons (Pen- nacchiotti & Zanzotto, 2008; Mueller, 2009; The textual data we use in this work is the con- Manning, 2011; Piotrowski, 2012; Jockers, 2013; tent of an 18-19th century Swedish Prose Fiction McEnery & Baker, 2014). The focus of such re- database – Spf 1 . Spf is comprised by ca 300 search is to reduce the time consuming, manual prose works that were originally published as work that is often carried out e.g. by historians or books during the years 1800, 1820, 1840, 1860, other literature scholars, in order to identify val- 1880 and 1900. The material is representational id, useful and meaningful results such as seman- of its times in ways that the canonized literatures tic associations, gender patterns and features of are not, in the sense that it contains not only can- human networks (Agarwal et al., 2012). Also, onized literature but mainstream as well as mar- recently, a small number of studies have been ginalized treatments of 19th century society. The published where gender and other biographical database makes it possible to examine a particu- characteristics are explored (Hota et al., 2006; lar year span and compare it to the material of Argamon et al., 2007; Garera & Yarowsky, 2009; other years in order to obtain a comprehensive Bullard & Oveesdotter Alm, 2014). These meth- view of societal development across an entire ods apply various types of classifiers with good century. However, the main part of this work performance results. deals with the construction and adaptation of Boes (2014) discusses the content of the “Vo- several lexical and semantic resources developed cations of the Novel Project” which consists of a for modern Swedish to the language of Spf and database of roughly 13,000 German-language algorithmic resources that use those for automat- prose works, published between 1750-1950, and ic labeling, and we have left as future work the in which each entry in this database is tagged fine-grained comparison between different time with vocational metadata identifying occupations spans. that receive extended narrative treatment. Fifteen Furthermore, there are several classificatory occupational clusters, such as agricultural pro- systems available where occupations are orga- fessions, health and nautical professions, are nized into clearly defined sets of groups accord- used for estimating the proportional distribution ing to the tasks and duties undertaken in the job; of those with the database content, showing for instance that members of the clergy diminished 1 .

90 such as the International Standard Classification and compound lexical units for health-related of Occupations 2 or the National Occupational occupations) was encoded using both Vocation Classification3 which is the nationally accepted and Health, e.g. patolog ‘pathologist’ or sjukskö- reference on occupations in Canada. Such classi- terska ‘nurse’. Similarly other combinations of fications are basically structured by skill type, FrameNet-originating labels were extracted and e.g. Agricultural, Animal Husbandry and Forest- encoded in a similar manner. Since we adopted a ry Workers, Fishermen and Hunters which can broad definition of the term Vocation we allow further increase the usability of such resources. such words to be included in the knowledge However, in this study we did not have the hu- base, but not all were used for the study de- man resources to structure the collected occupa- scribed here if there were not directly vocation- tions in a finer-grained manner, apart from some related. For instance, tjuvpojke ‘thief boy’ is very basic and coarse encoding (see further Sec- coded as Morality-negative; here Morality- tion 3.1); therefore this task is left as a future negative is of course not a vocation but rather a work. A large number of vocation and other re- general human quality and not used in the study. lated terms from three lexically-oriented re- Also words with the label Person, which is the sources were collected and structured. As our most generic category, including mentions such starting point we used ca 3,500 lexical units as baldrottning ‘prom queen’; äventyrerska ‘ad- found in relevant frames, such as Medi- venturess’ or söndagsskoleelev ‘Sunday school cal_professionals and People_by_origin, from student’, are not used in the study presented in the Swedish FrameNet4. Moreover, several other this paper. lexical units in related frames, that are slightly Gender assignment is based on reliable or- more general but still relevant, were used, i.e. thographic features of the lexicon entry in ques- entries that belong to other types of both generic tion (if available), these include: and more specific frames that indicate person activities, relationships or qualities of various  typical gender bearing morphological suf- kinds, such as Kinship or Performer. Secondly, fixes, such as –inna [female] värdinna we used several hundred of vocation names from ‘hostess’, prestinna ‘priestess’or –rska the Swedish dictionary Svensk Ordbok5 ‘Swedish [female] ångfartygskokerska ‘steamboat’s Dictionary’ (SO) and finally, several thousand cook’, städerska ‘cleaning woman’. vocation names from the Alfabetiskt yrkesregis-  gender bearing head noun words that un- ter ‘Alphabetically list of professional designa- ambiguously can assign gender in com- 6 tors’ published by the Statistics Sweden . pound word forms; for instance hustru ‘wife’ [female]: soldathustru ‘soldier’s 3.1 Vocation Lexicon Structure wife’, bagarehustru ‘baker’s wife’; or Semi-automatically, all lexicon entries were as- dräng [male] ‘farmhand’: stalledräng signed two features. The first one was gender ‘stable farmhand’, fiskardräng ‘fisherman (i.e. Male, Female or Unknown) and the second farmhand’. one Vocation. Depending on the content and structure of the three resources we used to extract After consulting international efforts in the area, occupations and similar entries from, we tried to we automatically normalized and attempted to keep and encode any kind of relevant to our group together and harmonize the labels of all goals descriptive information for these entries in available lexical units (these labels are primarily tab-separated fields. For our study we only use encoded in the Swedish FrameNet) to the follow- Vocation and combinations with Vocation and ing 14 single types and 4 complex ones, without other labels. For instance, in FrameNet, vocation putting too much effort to introduce finer-grained related frames such as Medical_professionals (a types for practical reasons (see the previous sec- label that was transformed to a more generic and tion for discussion on this issue). Thus, the final shorter one, Health, and which consists of single set of categories we applied are: Age (yngling ‘youth’), Expertise-Neg (okunnige ‘ignorant’), 2 . Expertise-Pos (specialist), Jurisdiction (invånare 3 . ter’), Morality-Neg (tjuv ‘thief’), Morality-Pos 4 . 5 (nationalhjälte ‘national hero’), Origin (jugoslav . 6 . (socialdemokrat ‘social-democrat’), Religion

91 (metodistpredikant ‘methodist preacher’), Resi- is found, the new term gets the vocation annota- dence (granne ‘neighbour’), Vocation (bussföra- tion of the matched head. The length of six char- re ‘bus driver’), Vocation+Health, Voca- acters is determined after testing with other low- tion+Military (kavallerilöjtnant ‘cavalry lieuten- er values and six is the lowest number that can be ant’), Vocation+Performer (dragspelare ‘accor- safely used and which minimizes the number of dionist’) and Person+Disease (autistiker ‘autistic false positives returned to a minimum. Suppose person’). The resulting lexicon consists of 19,500 prestinna ‘priestess’ is in the lexicon with fea- terms, of which over 77% (15,000) are distinct tures Vocation and Female and a new word, over occupational titles (vocations), and used in vari- six characters long, e.g. öfverprestinna ‘head ous ways by the system, mainly as the core lexi- priestess’, is found in a text, öfverprestinna will con for rule based pattern matching and as a fea- be then decomposed to öfver+prestinna and the ture for supervised machine learning (see Section head prestinna will match an existing lexicon 4.4). Moreover, 75% (or 11,000) of all these vo- entry with the same form; consequently öfver- cations in the lexical resources have been as- prestinna will inherit the annotation Vocation signed Unknown gender as a default since no and Female. Alternative surface text forms, e.g. classificatory orthographic features, as previous- with hyphenation (solo-sångare ‘solo singer’ ly described, could be applied for that purpose. versus solosångare) are treated in a similar man- ner, breaking the compound at the hyphen. This 4 Methods processing allows a set of potential new terms to be efficiently recognized, while the results using Figure 1 provides a general outline of the meth- the lexicon based pattern matching approach ods applied in the study. show nearly perfect precision scores. The recog-

nition step is made using case insensitive match- ing against the lexicon’s content. Furthermore, we also experimented with con- tinuous space word representations (Mikolov et al., 2013) in order to extract, manually review and incorporate lists of near synonyms to voca- tion-specified words. For instance, the top-10 closest words for the word soldat ‘soldier’ were: bonde ‘farmer’, officer, simple* ‘simple’, knekt Figure 1. The major steps used to extract gender- ‘foot soldier’, tapper* ‘brave’, sjöman ‘seaman’, bearing vocations from raw text. adelsman ‘nobleman’, munk ‘monk’, duktig* ‘capable’ and matros ‘sailor’; only three of these 4.1 Linguistic Patterns: Morphology, Com- words, marked with ‘*’, are not directly associat- pounding and Continuous Space Word ed with vocations. This experiment resulted into Representations roughly a hundred of new vocation words inte- The previously outlined lexicon7 is used for pat- grated in the knowledge base. tern matching based on a number of manually 4.2 Person Entity Recognition coded rules that explore regularities in the sur- rounding context of potential vocation words. During processing we also use named entity Inflectional morphology is determined program- recognition (NER) (Borin et al., 2007), but only matically, while the lexicon’s content is also a component that deals with person entity identi- used for discovering “new” terms (not in the stat- fication. Since gender is a prominent attribute for ic lexicon) using compound segmentation and a very large number of first names, we apply a matching of the head of a candidate vocation to NER component that uses a first name gazetteer the content of the knowledge base. with 21,000 first names, in which each name has For instance, a potential new term candi- been pre-assigned gender and thus used to assign date, that is a word over 6 characters long not in gender to recognized person first names. For in- the lexicon, is de-compounded and its head is stance the NER processing of the sentence: Jan matched against the lexicon’s content. If a match och Johan skulle just gå in i stugan, då Maja ropade dem tillbaka ‘Jan and John were about to

7 go into the cottage, when Maja called them back’ All lexicon entries annotated with the Vocation label are will recognize three first names (Jan, Johan and available from: . Maja) and assign male gender to the first two,

92 Jan and Johan and female to the last one, Maja.  historical forms of Swedish adjectives Thereafter, the results obtained during the appli- used to be gender bearing; e.g. the majori- cation of the method described in Section 4.1 and ty of adjectives ending in -e designate the results from the NER will be merged, and a male gender. For example, fattige bonden post-NER pattern matching script will try to as- ‘the poor farmer’, here bonden is identi- sign gender to vocation words for which gender fied as a vocation with unknown gender is marked as Unknown by the process described which will be assigned male since the ad- in 4.1 and there is a first name annotation close jective fattige is indicating a male noun by. This is accomplished under the condition that  similarly to the process described in Sec- the NER has assigned a gender to a first name in tion 4.1, we also here take advantage of the near context of a “genderless” vocation. For the fact that many noun suffixes or head instance, a vocation word, for which its surface words of compounds are also gender bear- characteristics does not reveal any gender affilia- ing; e.g. suffixes -erska or -inna designate tion according to the vocation lexicon, can be female gender; e.g. tvätterska ‘laundress’ assigned appropriate gender if a recognized first or värdinna ‘hostess’ are assigned female name appears in its near context; e.g. bonden gender because of their gender bearing Petter ‘(the) farmer Petter’ or Gusten är en suffixes. Gender bearing head words are fiskare ‘Gusten is a fisherman’. In these exam- also used for gender assignment; e.g. ples bonden and fiskare are coded in the compounds ending in -fru ‘wife’ such as knowledge base as vocation words with un- bondefru [bonde+fru] ‘peasant wife’ will known gender and the process outlined in 4.1 be assigned female gender; while com- recognizes it as such. Petter and Gusten, on the pounds ending in e.g. karl ‘man’ such as other hand, are recognized by the NER as human besättningskarl [besättning+s+karl] ‘crew with male gender. The gender attribute will be man’ will be assigned male gender then propagated to the vocation words bonden  local context is also used to merge two or and fiskare which will get the same gender as its more consecutive vocation and related an- appositive Petter in the first case and the person notations into one; typically when a geni- entity’s gender close by in the second case. tive form of a noun precedes another noun. For instance, the text snippet: ryttmästarns 4.3 Local Context Regularities betjent ‘[the] rittmeister's servant’ will ini- Since not all vocation annotations get gender tially receive two vocation annotations assignment during recognition, we use hand- (with unknown gender) that will be coded rules based on various lexical patterns for merged into one [ryttmästarns+betjent] that purpose. The heuristics applied to these rules which unknown gender; while the snippet include four major types of reliable information: drottningens hofmästarinna ‘(the) queen’s personal pronouns, gender-bearing adjectives, hofmeisteress’ will initially receive two gender-bearing suffixes and certain forms of lo- vocation annotations (with female gender) cal context: that will be merged into one [drottning- ens+hofmästarinna] with female gender.  personal pronouns, e.g. the Swedish hans ‘his’ and hennes ‘her’, are used for gender 4.4 Statistical Modeling assignment if they appear in a context very Finally, we also use a complementary statistical close to a vocation (1 to 5 tokens); e.g. in modelling method, conditional random fields the text fragment …fiskaren och hans (CRF) for learning gender-assigned vocations in barn ‘… the fisherman and his children’, combination with the results of the rule-based fiskaren is identified as a vocation but with system and the NER (the vocation words togeth- unknown gender which at this stage will er with basic features such as n-grams and word be assigned male since the pronoun hans is shape were used as features for training the male and refers to the fiskaren ‘fisher- learner). For that purpose we use the Stanford man’; while in the text fragment …hon var CRF off-the-shelf software (Finkel et al., 2005). en agitator ‘… she was an agitator’, agita- The purpose of the CRF is to identify vocations tor is identified as a vocation but with un- and (possibly) correct gender not captured by the known gender which at this stage will be previous techniques, in order to increase recall. assigned female since the pronoun hon is Training and testing is based on a pre-annotated female referring to agitator and manually inspected sample (by the first au-

93 thor). This sample was randomly selected from 5.1 Evaluation of the CRF Spf and it was first automatically annotated by The results of the classifier’s evaluation8 are giv- the rule based and NER components, and then en in tables 2 and 3. Low recall scores can be sentences with at least one annotation were se- possibly attributed to two facts; one is the lected, manually inspected, corrected and used amount of test and training texts used for training for training (390,000) and testing (50,000). the classifier and second the use of default fea- 5 Results, Evaluation and Analysis tures for training. Addition of new features, such as part-of-speech, syntactic and/or co-reference The fact that a large number of vocations in the links could have possibly being beneficial, in- lexical resources have been assigned Unknown cluding larger training corpora. Moreover, vari- gender implies that the computational processing ous types of errors could be identified during all requires to heavily relying on a (wider) context stages of processing. With respect to the CRF to assign proper gender to these words. This is a component evaluation, most of the errors had to serious drawback since there is not always relia- do with the occurrences of e.g. male designated ble near context, e.g. at the sentence level, that adjectives, such as svenske ‘Swedish’ or danske can be used. This fact is mirrored on the results ‘Danish’. A number of last names and also of e.g. the CRF classifier. Different complemen- common words that were homographic to voca- tary techniques have been tested, but still, a large tions were also annotated erroneously as such; number of vocations remains with unknown gen- for instance the last names Skytte ‘Shooter’ (e.g. der. More elaborated ways are probably required in the context Malin Skytte) and Snickare ‘car- to identify gender, perhaps using discourse in- penter’ (e.g. in the context Gorius Snickare); and formation, e.g. CRF is used with default features, common nouns such as simmaren ‘(the) swim- new features might be necessary to test. mer’ or vakt ‘guard’ in idiomatic contexts such as hålla vakt ‘be on one's guard’. Female Male Unknown hustru [wife] kung [king] (976) präst [priest] Precision Recall f-score (2016) (1341) 9 majorska [ma- prost [provost] kapten [captain] B 96.33% 58.22% 72.57% joress] (1054) (614) (532) I 89.09% 72.06% 79.67% grefvinna [coun- brukspatron [iron- löjtnant [lieuten- tess] (805) master] (582) ant] (487) jungfru [maid] patron [land ten- major (382) Table 2. Precision, recall and f-scores for the (698) ure] (430) CRF learner. grevinna [coun- kyrkoherde [vicar] bonde [farmer] tess] (593) (325) (343) Precision Recall f-score överstinna [colo- konung [king] *don (322) nel's wife] (308) (258) F 97.03% 43.75% 60.31% prostinna [minis- biskop [bishop] tjänare [servant] M 94.52% 55.44% 69.89% ter’s wife] (304) (242) (283) U 53.25% 45.00% 48.78% drottning [queen] grefve [count] överste [colonel] (274) (227) (241) hushållerska baron (215) tiggare [beggar] Table 3. Precision, recall and f-scores for the [housekeeper] (236) CRF learner on gender (M=male; F=female; (260) prästfru [minister’s kejsare [emperor] husbonde [master] U=unknown). wife] (218) (208) (217) Another type of important omission has been the Table 1. The top-10 most frequent lemmatized fact that a large number of vocations are assigned vocations (and their occurrences) in the Selma Lagerlof Archive (‘*’: error). 8 For the evaluation we used the conlleval script, vers. 2004- 01-26 by Erik Tjong Kim Sang. Table 1 above shows the top-20 occurrences of 9 We use the IOB tags for file representation: I (inside), O three automatically extracted lists of male, fe- (outside), or B (begin). A token is tagged as B if it marks male and unknown gender vocations from the the beginning of a chunk. Subsequent tokens within the chunk are tagged I. All other tokens are tagged O. E.g. Selma Lagerlof Archive (a collection of the au- the context Fru majorskan skrek till majorn… ‘Mrs. ma- thor’s works of fiction published as books during jor's wife shouted to the major ...’ is represented as: her lifetime), a completely new corpus, not used Fru B-Female for the development of the resources in the study majorskan I-Female with 3.341.714 tokens. skrek O till O majorn B-Male ...

94 unknown gender since there is no reliable con- A total of 341 of vocation identifiers could be text (at the sentence level) that could be used. manually recognised and confirmed the assump- Moreover, we have been restrictive to the gender tion that best results are produced by using all assignment of certain vocations in the resources, available resources at hand. Moreover, the preci- although, in principle, and considering the nature sion of the rule-based system (i.e. the lexicon and publication time of the texts, we could by lookup) is very high. Out of the 341 possible vo- default assign gender to a large number of these cations, the baseline, i.e. the rule-based system vocations. For instance, a large number of mili- without any sort of disambiguation, compound tary-related vocations, such as löjtnant ‘lieuten- analysis etc., identified 329 vocations (46% with ant’ or generalmajor ‘major general’ are as- the correct gender and the rest with unknown signed unknown gender, although these, predom- gender); 12 (most of them compounds such as inantly, refer to males in the novels. Moreover, klädessömmerska ‘clothing dressmaker’) and a identical singular and plural forms of vocation few others such as penitentiarius ‘confessor’, terms are yet another difficult problem, e.g. poli- could not be found and 15 tokens were annotated tiker ‘politician’ or ‘politicians’ or spritfabrikar- as vocations but were wrong. These 15 wrong betare ‘distillery worker’ or ‘distillery workers’. ones originate from (possibly) inappropriate lex- Some sort of linguistic pre-processing, such as icon vocation entries, entries that shouldn’t have idiom identification and part of speech annota- entered the lexicon as vocations, such Sachsare tion, could probably exclude word tokens in plu- ‘a person from Saxony’ befrämjare ‘promoter’ (a ral form (or verbs in that matter, but such cases borderline case) and homographs with proper were extremely rare in our data), nevertheless names, such as Jarl (which is a title given to part of speech tagging is not used at the moment. members of medieval royal families before their Also, more elaborative models could be used to accession to the throne, but also used as a last first determine who the personal pronouns refer name). to before an attempt could be made to assign the The combination of all available tools and re- pronoun’s gender to a vocation word with un- sources improved these figures; marginally on known one. the gender but substantially on the recognition of the compounds. All vocation compounds were 5.2 Evaluation of the Knowledge-based recognized (10) and also four more that were Components wrong, such as Nekropolis ‘Nekro+polis’ (since Besides the evaluation of the CRF learner, we polis ‘police’ is in the lexicon) and Notre-Dame also conducted an analysis on a small random ‘Notre-+Dame’ (since dame ‘female equivalent sample of similar text from different, but compa- of the honour of knighthood’ is in the lexicon as rable, corpora, in order to investigate the contri- well). These compounds could be identified be- bution of the different components for the voca- cause of the compound decomposition step and tion identification. A selection of a randomized matching of the compounds’ heads to the lexicon subset of 1000 sentences from two sources was content. Compared to the baseline results the conducted from the August Strindberg's Collect- percentage of vocations with correct gender ed Works (“August Strindbergs Samlade Verk”) raised to 49.6%. and the Selma Lagerlof Archive (“Selma Lager- löf-arkivet”), both parts of the Swedish Litera- 6 Conclusions 10 ture Bank . These 1000 sentences were automat- Women’s history has emphasized a relative in- ically annotated by: i) the rule-based system visibility of women’s work and women workers. without any sort of disambiguation or other pro- The reasons to this are manifold, and the extent, cessing only lexicon look-up; this can be consid- the margin of error in terms of women’s work ered as a baseline system where only inflectional activities is of course hard to assess, e.g. work morphology is considered; and ii) all the rest wasn’t a tax base also work wasn’t the point of without the CRF. That included the rule-based departure for a collective political interest (i.e. system with the use of lexical patterns for disam- labour) as it later developed into; while the polit- biguation, compound segmentation and the ical organisation also excluded women from named entity recognition. some certain areas and particularly from formal authority. This means that women were to a less- er extent mentioned, authorised, appointed or 10 Information about the Literature Bank is here: nominated in formal sources. This is obviously .

95 the case for many but not all traditional sources, population registers, cameral, and fiscal sources. However, we still don’t have good reasons to believe that this means either that women didn’t work or that other types of material couldn’t be more rewarding. And the suggestion of this paper is that prose fiction is still possible to utilise fur- ther and that the methodological developments in digital humanities should be tried in this endeav- our, e.g. by investigating women’s work and economic activities represented longitudinally in prose fiction and the differences between the texts of female, male (and unknown) authors, in all those aspects. In this work we have applied automatic text analytic techniques in order to identify vocation signals in 19th century Swedish prose fiction. Our goal has been to reduce the time consuming, manual work that is usually carried out by historians, literature scholars etc., Figure 2. Comparison male and female authors in order to e.g. identify and extract semantically (based on absolute frequencies) of the Vocation meaningful information such as gender patterns and Kinship categories during the period 1840- and semantic associations. Literature is a com- 1860-1880-1900. prehensive source for data on employment and occupation, economy and society, and to e.g. an As Fig. 2 shows, other types of investigations are economic historian or gender researcher such possible and the analysis can provide information data can be of immense value, particularly for about the kind of vocabulary used by various the period between 1800-1900 since gender rela- authors during different periods; e.g. do male tions in and through work is a long-standing authors use more vocation than kinship labels problem due to repeated underestimation of and of which type? Kinship (section 3.1) is one women’s work attributed to among other com- of the categories already encoded in the pelling reasons, the systematic under-reporting of knowledge base and can be easily used to com- women’s work in the used sources (Humphries pare the style of authors diachronically. & Sarasua, 2012; Ighe & Wiechel, 2012). Prose fiction does not necessarily have the same limita- Acknowledgements tions and can be utilized as a fruitful point of de- parture. This work is partially supported by the Swedish For future work we would like to explore Research Council’s framework grant “Towards a even in more detail the variation of both the per- knowledge-based culturomics” dnr 2012-5738. formance of the processing steps and also com- pare the results across time periods and authors’ References gender. Deeper analysis could provide interesting Apoorv Agarwal, Augusto Corvalan, Jacob Jensen insights on the nature of which types of person and Owen Rambow. 2012. Social Network Analy- activities are used by different authors or com- sis of Alice in Wonderland. Workshop on Compu- pare and explore other types of collections11 from tational Linguistics for Literature. Pp 88–96, the same period, and thus confirm or reject estab- Montréal, Canada. lished hypotheses about the kind of vocabulary used; e.g. do male authors use more vocation or Shlomo Argamon, Russell Horton, Mark Olsen and kinship labels? Sterling Stuart Stein. 2007. Gender, Race, and Na- tionality in Black Drama, 1850-2000: Mining Dif- ferences in Language Use in Authors and their Characters. Digital Hum. Pp. 8-10. U. of Illinois. 11 Such collections could be the Dramawebben (), i.e. digital versions of Tobias Boes. 2014. The Vocations of the Novel: Dis- over 500 plays of Swedish drama from the 1600s to tant Reading Occupational Change in 19th Centu- modern times; or the Digidaily, i.e. digitized Swedish ry German Literature. Distant Readings – Topolo- newspapers from the 18-19th century ().

96 gies of German Culture in the Long 19th Century. Tony McEnery and Helen Baker. 2014. The Corpus Erlin&Tatlock (eds). Pp. 259-283. Camden House. as Social History - Prostitution in the 17th Centu- ry. Exploring Historical Sources with LT: Results / Lars Borin, Dimitrios Kokkinakis and Leif-Jöran Ols- Perspectives CLARIN Workshop. Den Haag, The son. 2007. Naming the past: Named entity and Netherlands. (Visited 20150222) erature. 1st LaTeCh. Pp. 1-8. Prague. Tomas Mikolov, Wen-tau Yih and Geoffrey Zweig. Joseph Bullard and Cecilia Oveesdotter Alm. 2014. 2013. Linguistic Regularities in Continuous Space Computational analysis to explore authors’ depic- Word Representations. Proc. of NAACL HLT. Pp. tion of characters. Proc. of the 3rd Workshop on 746–751. Atlanta, USA. Computational Linguistics for Literature. Pp. 11– 16. Gothenburg, Sweden. Franco Moretti. 2013. Distant Reading. Verso.

Anthony Don et al. 2007. Discovering interesting us- Martin Mueller. 2009. Digital Shakespeare, or to- age patterns in text collections: Integrating text wards a literary informatics. Shakespeare. 4:3, mining with visualization. 16th ACM Conf. on info 284-301, Routledge. & knowledge management (CIKM). Pp. 213-222. Daniela Oelke, Dimitrios Kokkinakis and Daniel Rosemarie Fiebranz, Erik Lindberg, Jonas Lindström Keim. 2013. Visual Literature Analysis: Uncover- and Maria Ågren. 2011. Making verbs count: the ing the dynamics of social networks in prose liter- research project ‘Gender and Work’ and its meth- ature. 15th Eurographics Conf. on Visualization odology. Scan Econ Hist Rev. 59:3, pp. 273-293. (EuroVis). Pp. 371-380. Leipzig, Germany.

Jenny Rose Finkel, Trond Grenager and Christopher Allan H. Pasco. 2004. Literature as Historical Ar- Manning. 2005. Incorporating Non-local Infor- chive. New Literary History. Vol. 35:3, pp 373- mation into Systems by 394. John Hopkins University Press. Gibbs Sampling. 43rd ACL. Pp. 363-370. Marco Pennacchiotti and Fabio M. Zanzotto. 2008. Nikesh Garera and David Yarowsky. 2009. Modeling Natural Language Processing across time: an em- latent biographic attributes in conversational gen- pirical investigation on Italian. Proc. of GoTAL. res. 47th Annual Meeting of the ACL and 4th LNAI. Vol. 5221. Pp 371-382. Springer. IJCNLP. Pp. 719–718. Singapore. Eva Pettersson and Joakim Nivre. 2011. Automatic Sobhan Raj Hota, Shlomo Argamon and Rebecca Verb Extraction from Historical Swedish Texts. Chung. 2006. Gender in Shakespeare: Automatic 5th LaTeCH. Pp 87-95. Oregon, USA. stylistics gender character classification using syn- tactic, lexical and lemma features. Chicago Col- Eva Pettersson, Beáta Megyesi and Joakim Nivre. loq. on Digital Humanities and Computer Science 2012. Parsing the Past – Identification of Verb (DHCS). Pp. 100-106. U. of Chicago, USA. Constructions in Historical Text. 6th LaTeCH. Pp 65-74. Avignon, France. Jane Humphries and Carmen Sarasua. 2012. Off the Record: Reconstructing Women's Labor Force Eva Pettersson, Beáta Megyesi and Joakim Nivre. Participation in the European Past. Feminist Eco- 2014. Verb Phrase Extraction in a Historical Con- nomics. Vol. 18:4. Taylor and Francis. text. Proc. of the first Swedish national SWE- CLARIN workshop. Uppsala, Sweden. Ann Ighe and Anna-Helena Wiechel. 2012. Without title. The dynamics of status, gender and occupa- Michael Piotrowski. 2012. Natural Language Pro- tion in Gothenburg, Sweden, 1815-45. Eur. Social cessing for Historical Texts. Synthesis Lectures on Science History Conf.. Glasgow, Scotland. HLT. 5(2):1. Morgan & Claypool Publ.

Matthew L. Jockers. 2013. Macroanalysis - Digital Jennifer Rutner and Roger C. Schonfeld. 2012. Sup- Methods and Literary History. Topics in the Digi- porting the Changing Research Practices of His- tal Humanities. University of Illinois Press. torians. National Endowment for the Humanities. Ithaca S+R. New York. Christopher Manning. 2011. Natural Language Pro- cessing Tools for the Digital Humanities. Availa- Romain Vuillemot, Tanya Clement, Catherine ble at (Visited 20150105). said near “Martha”? Exploring NEs in Literary Text Collections. VAST. Pp. 107-114. NJ, USA.

97 Rule-based Coreference Resolution in German Historic Novels

Markus Krug, Frank Puppe Fotis Jannidis, Luisa Macharowsky, Isabella Reger, Lukas Weimer Wuerzburg university, Wuerzburg university, Institute for Computer Science Institute for German Studies Am Hubland Am Hubland D-97074 Würzburg, Germany D-97074 Würzburg, Germany markus.krug|[email protected] [email protected]

novels not only exhibit different topics and word- Abstract ing, but also show a heavy use of pronouns (in our corpus 70% of all NEs) and relative few large clus- Coreference resolution (CR) is a key task in ters with long coreference chains opposed to many the automated analysis of characters in stories. small clusters (see baseline analysis in table 1). Standard CR systems usually trained on Another important difference is the number and newspaper texts have difficulties with literary lengths of passages containing direct speech [Iosif, texts, even with novels; a comparison with Mishra 2014]. newspaper texts showed that average sentence length is greater in novels and the number of pronouns, as well as the percentage of direct speech is higher. We report promising evalua- We decided on a rule-based approach because: tion results for a rule-based system similar to  A key aspect in coreference resolution is [Lee et al. 2011], but tailored to the domain feature and constraint detection. Features which recognizes coreference chains in novels and constraints for ruling in or out candi- much better than CR systems like CorZu. dates for coreference with a high precision Rule-based systems performed best on the can be combined to achieve a high recall. CoNLL 2011 challenge [Pradhan et al. 2011]. If such features and constraints are repre- Recent work in machine learning showed sim- ilar results as rule-based systems [Durett et al. sented by rules, the explanation component 2013]. The latter has the advantage that its ex- of rule-based systems is very valuable in planation component facilitates a fine grained understanding the errors and thus enabling error analysis for incremental refinement of rapid rule refinement. the rules.  We do not have a large corpus with anno- tated German novels to learn from. As 1 Introduction mentioned above, there are substantial dif- ferences between e.g. newspapers and The overall goal of our research is the identifica- th novels, so that machine learning approach- tion of characters in German novels from the 19 es with domain adaptation (e.g. [Yang et century and an analysis of their attributes. The al. 2012]) are difficult. main steps are named entity recognition (NER) of  We intend to use rule-based CR to semi- the persons, coreference resolution (CR), attribu- automatically create a large corpus of an- tion of persons and character description with fo- notated novels for experimenting with ma- cus on sentiment analysis. While NER in novels is chine learning CR approaches. discussed in [Jannidis et al. 2015], we report on work in progress on rule-based coreference resolu- We present a state-of-the-art rule-based system tion in novels. Tests with existing rule-based or tailored for CR in novels. In comparison to CorZu machine learning NLP tools on our novels had (see section 4) which recognizes CR well in news- unsatisfying results. In contrast to newspaper texts papers, we achieve better results in novels (MUC

98

Proceedings of NAACL-HLT Fourth Workshop on Computational Linguistics for Literature, pages 98–104, Denver, Colorado, June 4, 2015. c 2015 Association for Computational Linguistics

F1: 85.5% vs. 65.9%, B3 F1: 56.0% vs. 33.6%). 78.6% and 80.5% showing the great influence of Our explanation component facilitates a fine the data for the final results. In section 4 we there- grained error analysis for incremental rule refine- fore perform two separate experiments on two ment. different datasets to manifest the reliability of our approach for the domain of literary novels.

2 Related Work 3 Methods and data Coreference Resolution itself is an old, but un- solved task on which a huge amount of effort was Coreference resolution is based on a NLP-pipeline spent during the last 40 years. Large conferences including tokenization, sentence splitting, part of like ConLL 2011 [Pradhan et al. 2011], CoNLL speech tagging, lemmatization, named entity 2012 [Pradhan et al. 2012] and SemEval 2010 recognition and dependency parsing. In our pipe- [Recasens et al. 2010]) have offered challenges for line we use the TreeTagger of Stuttgart university the topic not only with English text (ConLL 2011), [Schmitt 1995] for tokenization and POS-tagging, but also for Chinese and Arabic (CoNLL 2012) OpenNLP1 with the according German model for and German, Dutch, Italian, Catalan and Spanish sentence splitting and the MATE-Toolkit [Bohnet (SemEval 2010). Most approaches are based on 2010] for dependency parsing. Due to our overall machine learning algorithms. A large part of ma- goal, the identification and attribution of characters chine learning approaches use two phases: first in German novels, we restrict coreference resolu- classification of pairs of NEs (nouns, persons, etc.) tion to the resolution of persons, excluding e.g. followed by a clustering or ranking of the results geographic locations. (so called mention-pair model [Aone 1995, Soon et The data used for this development consists of al. 2001]). Since the mention-pair model suffers roughly 80 segments, each sampled from a differ- from serious problems [Ng 2010], newer ap- ent novel. The sampling process determined a ran- proaches try to match named entities directly to dom (syntactic) sentence in the text and used all clusters of mentions (entity-mention model). A following 129 sentences, therefore forming a con- multitude of approaches was developed under the nected chunk of 130 sentences. This sampling pro- focus to model the affiliation of a mention to a cess ignored the beginning of a chapter, which specific entity. One goal of such an approach is to bears an even greater challenge for the human an- avoid problems of the mention-pair approach like notators and the algorithms, because now some (A = B, B = C, A ≠ C, e.g. A = "Mr. Clinton", B = segments can even start with uninformative men- "Clinton", C = "she"). Aside of mention-ranking tions such as pronouns. With the long-term goal to approaches [Denis, Baldridge 2008], [Rahman get a detailed attribution of entities in the novels 2009] the system developed by Durett, Hall and we developed our own annotation tool based on Klein [Durett et al. 2013] shows that task-specific eclipse-rcp2. Since former studies showed that the graphical models perform well following the enti- coreference task does not exhibit many ambiguities ty-mention approach. Since rules can model hard for humans our data is only annotated by one anno- and soft constraints directly, e.g. for pronoun reso- tator. lution, rule-based approaches like the multi pass Our corpus used in the first evaluation comprises sieve for CR [Lee et al. 2011] deliver promising 48 different novels. Thus, the first test corpus results, e.g. the best result in the challenge of contains 143 000 tokens with ca. 19 000 refer- CoNLL 2011 for English documents. Although the ences including proper names, personal pronouns problem of CR has been a topic for forty years etc., while for our second experiment we used 30 [Hobbs 1976], even good results are between 60% additional fragments with about 11 600 NEs and and 70%: [Durett et al. 2013] report a MUC-Score 104 000 tokens. In comparison to the German of 63.7% and a B³ Score of 67.8% for the CoNLL TIGER corpus [Brants et al. 2004] which consists blind test set, with the system of Stanford perform- of newspaper articles, we have on average longer ing comparably with 61.5% and 69.2% respective- sentences with 24.2 tokens compared to 16.3 to- ly – much worse than e.g. for NER. In [Lee et al. 2011] the Stanford system got better results with 1 https://opennlp.apache.org/ 2 https://wiki.eclipse.org/index.php/Rich_Client_Platform

99 kens. On average, one sentence contains three ref- NE and dependency parsing is used for determin- erences to the same character or other characters. ing the subject of a sentence. An evaluation of the Each character appeared 10 times on average with- number attribute which we had annotated aside of in the small novel fragments of 130 sentences, the coreferences and NEs resulted in an accuracy compared to ca. 4 times in the ACE2004-nwire of approximately 93% in the used test data. We corpus used in [Lee et al. 2011]. The majority of split our documents into a small training set with references are pronouns (~ 70%). In German, pro- just 5 documents, a first test set of 48 documents noun resolution is more ambiguous than in Eng- that we used to compare our performance with the lish, e.g. the German "sie" has three possible system CorZu, and into another test set consisting meanings: "she", "they", and “you” in direct of 30 completely unseen documents where we speech. Only for unambiguous pronouns like "er" evaluated the robustness of our system. [he] and "ihm" [him] we can use static features like in [Lee et al. 2011]. For pronoun resolution, our Our system has a similar rule organization as [Lee rules use features of NEs like gender (male, fe- et al. 2011] with passes, i.e. rule sets, which build male, neuter, unknown), number (singular, plural, on the results of former passes. While [Lee et al. unknown), person (for pronouns: first, second, 2011] uses 7 passes, we extend this by using 11 third person) and whether the NE is the subject of passes: the sentence. In general, a substantial part of a novel is direct speech, so we segment the novels in 1. Pass: exact match: All identical non-pronouns narrative and direct speech parts in a preprocessing are marked as coreferent. They are also consid- step. In order to detect the speaker of a given di- ered as coreferent if their cases differ (“An- rect speech annotation we use the following rules: nette” vs “ANNETTE”).  Explicit speaker detection. 2. Pass: Nameflexion: We designed a distance  Speaker propagation for longer dialogues. metric that allows us to detect derivations (or  Pseudo speaker propagation for situations nicknames) from names and mark them as co- where in longer dialogues two persons referent. (“Lydia”, “Lyden”,”Lydchen”). talking in turn can be recognized, but 3. Pass: Attributes: We use all modifiers (derived speaker detection failed. from the output of the parser) and match them against the strings of the NEs of the other clus- In order to be able to determine features like gen- ter. Coreference occurs if there is an “equals- der from non-pronoun references we use various ignore-case”-match. (“Die alte Getrud” ,…”die resources like lists of male and female first names Alte”) [“the elderly Gertrud”, “the elderly” ] from CorZu [Klenner 2011], the morphological 4. Pass: precise constructs: Appositions, relative analysis tool of Stuttgart University SMOR and reflexive pronouns are assigned to the pre- [Fitschen et al. 2004], the morphological tagger ceding NE. In addition, these pronouns get the from the above mentioned Mate-toolkit [Bohnet gender and number of the NE in order to sup- 2010] and a self-trained classifier, a maximum port subsequent resolution of other pronouns. entropy model trained on the TIGER corpus. After 5-7. Pass: 5. strict head match, 6. relaxed head applying these tools in a precision based manner match and 7. title match: These 3 passes rec- (lists, own system, mate, SMOR), where the sub- ognize coreferent NEs, where an NE consists sequent system is only used if the previous system of several words. The first rule, named strict detects “unknown”, we apply a set of correction head match, removes all titles from the given rules in order to guarantee consistency among the mentions and then compares the remaining NEs. The heuristic rules try to infer the gender of a words. Two NEs are said to be coreferent if NE by using context clues, e.g. a subsequent refer- there is at least one word that appears in both ence within the same “subsentence” (that is a sen- mentions and they agree in number and gender tence up to the next comma) of "his" or "her" and (“Baron Landsfeld” , “Herr Landsfeld”) [“Bar- the propagation of a recognized gender of a NE on Landsfeld”, “Mister Landsfeld”]. The re- along a local chain of unambiguous NEs (e.g. for laxed head match only requires that one word of old fashioned first names like "Günderode"). Other one NE is contained in a word of the other NE. rules exist for determination of the number of an Since titles were removed in the previous two

100 rules we added another rule specifically for ti- tles and match those to the most recent NE which contains the given title. 8. Semantic pass: For this semantic pass we use the synonyms in the German resource Ger- maNet3, again, an agreement in gender is re- quired. This matches for example “Gatte” and “Gemahl” [“spouse”, “consort”]. 9. Pass: pronoun resolution: pronouns are re- solved to the most recent, suitable precedent NE. To respect salience we sorted our previous NEs in a manner that preferred the last subject of earlier sentences over the closest NEs in those sentences. A suitable precedent is a one that doesn’t conflict with a given constraint. In the current implementation we respect the fol- lowing constraints:  Compliance in gender/number and person.

 Compliance in its context (are they both Fig. 1: Explanation and annotation editor for rule-based coref- part of a direct speech or both part of a erence resolution. All named entities (NE) are already marked. narrative segment). Identical numbers above the NE denote that they are corefer-  The candidate and the actual pronoun do ent to the same entity. For explanation, the user clicks on one NE (here: "Du" [you]) and the system shows in a pop up menu not harm a given constraint of binding the last five NE tokens with some key attributes: Geschlecht theory. ([gender]: male, female, neuter, unknown), number (singular, plural, unknown), Person (for pronouns: first, second third . 10. Pass: Detection of the addressed person in persons and for nouns the type of NE like real name, appella- direct speech: For each direct speech annota- tive name or pseudo-person) and whether the NE is the subject of the sentence. The annotator can classify errors like the tion we try to find the addressed NE. We do this missing coreference of "DU" to "Geliebter" [lover] with cate- by using several handcrafted lexico-syntactic gories (here: "angesprochener falsch" [wrong reference]) with patterns (matching against expressions such as a drop down menu in the right frame. If there is direct speech, “Alexander, you are great”). Based on the re- it is highlighted with a background color and the speaker (here sults of speaker detection we use a propagation "sie" [she]) is marked with a red circle. of the addressed persons in dialogues. 4 Evaluation and error analysis

11. Pass: pronouns in direct speech: We then We evaluated the coreference resolution algorithm resolve all instances of to the speaker and in two experiments. The first one uses the test cor- all instances of to the person the speaker pus of 48 novel fragments with about 19.000 man- talks to (if known). If the speaker of two subse- ually annotated character references in total. The quent direct speech annotations doesn’t change, following common evaluation metrics are used but the addressed person differs, we assume (see [Luo 2005]): that the speaker only uses a different naming for the person he is talking to and therefore set  The MUC-Score. It is based on the idea to these NEs as coreferent. count the minimum amount of links which need to be added to derive the true set of Fig. 1 shows the explanation, annotation and error entities from the predicted set of entities or analysis component of our rule-based tool. vice versa, divided by the amount of links in the spanning tree of the true partition. The MUC-Score itself is the harmonic mean out of both numbers that you get

3 http://www.sfs.uni-tuebingen.de/GermaNet/

101 when you switch the true partition with the 3. For reference, we added the results from Lee et gold partition. al. [Lee et al. 2011], the results of the system of  The B3-Score. The MUC Score cannot Stanford, evaluated on an ACE newspaper corpus. measure the influence of singleton clusters, It shows that pronoun resolution is much more that’s why an additional evaluation metric important in novels than in newspapers, while ex- is needed. The B³-Score scales the overlap act string matches and head matches already result of predicted clusters and true clusters, in rather high scores on the ACE newspaper cor- based on how many markables were cor- pus. rectly assigned to a given cluster. Scores in our system evaluated Stanfords Sieve evaluat- % with the novel corpus ed with ACE newspaper The effect of the different evaluation measures on corpus a newspaper corpus with rather short coreference Passes MUC F1 B³ F1 MUC F1 B³ F1 chains and on a novel corpus with long chains is 1 27.5 24.6 47.8 69.4 1-4 37.7 28.1 59.9 73.3 shown in the baseline analysis in table 1. While for 1-8 38.9 28.9 67.1 76.9 newspapers the baseline with n clusters for n NEs 1-9 83.3 52.6 78.6 80.5 is very good, for novels the baseline with just one 1-11 85.5 56.0 cluster for all NEs performs well. This can be ex- Table 3: Evaluation and comparison of the effects of the different passes of the rule-based algorithm. plained using the structure of the underlying enti- ties. While in newspaper texts many different entities with only a few mentions appear, our do- We finally evaluated our system on our second test main shows relatively few entities that tend to set, comprising 30 completely unseen fragments show up frequently. and achieved an F1-score of 86% MUC-F1 and a B³-F1 of 55.5%. It is almost identical to the result of the first test set. Rule-based systems with an explanation component allow a fine-grained error analysis. Table 4 shows an error analysis for 5 randomly selected novel fragments from the 30 Table 1: Baseline analysis for a typical newspaper and novel novels, drawn from the second test set that we used corpus with assigning all n NEs to either just one cluster or to for evaluation: n different clusters.

We compared our system with the free coreference resolution software CorZu, using ParZu4 [Sennrich 2009] as its parser from the university of Zurich, which was developed using a newspaper corpus. CorZu was given the same annotated named enti- ties in the same novel fragments, so that the detect- ed chains were comparable. Table 2 shows the results. Our system is about 20 percent points bet- ter than CorZu for both evaluation scores MUC F1 3 and B F1. Table 4: Number of named entities, clusters, evaluation met- rics and error types for a sample of 5 novel fragments, drawn Scores in MUC MUC MUC B³ B³ B³ randomly from our second test set comprising 30 fragments. % precision recall F1 prec. recall F1 The category "Wrong g|n|p" refers to the sum of mistakes the our system 89.1 83.2 85.5 70.5 83.2 56.0 algorithm made that were caused by a wrong assignment of CorZu 77.0 57.7 65.9 69.5 22.7 33.6 gender, number or person. The category "Ds related", contains Table 2: evaluation results of our system and CorZu on 48 novel fragments with about 19 000 named entities. all errors related to direct speech, e.g. by assigning a wrong speaker or the wrong detection of the addressed person to a given direct speech annotation. The effect of the passes (see section 3) in our sys- tem evaluated on the novel corpus is given in table Table 4 shows that even though we combined 4 different morphological resources the recognition 4 http://www.cl.uzh.ch/research/coreferenceresolution_en.html

102 of wrong number, gender and person still makes up References a fraction of about 14% of the total amount of er- rors in the analyzed documents. Another part with Aone, C. and Bennett, S. 1995. Evaluating Automated 14% of the mistakes is the category that describes and Manual Acquisition of Anaphora Resolution all errors related to direct speech, e.g. wrong Strategies. Proceedings of the 33rd annual meeting speaker detection, missed detection of “Sie” [you] on Association for Computational Linguistics (ACL in the role of “du” or wrong detection of an ad- '95). Association for Computational Linguistics, dressed person. We intend to find some additional 122-129. Bohnet, B. 2010. Very High Accuracy and Fast De- constraints to further reduce the errors made in pendency Parsing is not a Contradiction. The 23rd these categories. The next category with 35% error Int. Conf. on Computational Linguistics (COLING contribution, labeled as heuristics, sums up all the 2010), Beijing, China. errors which happened due to a wrong assumption Brants, S., Dipper, S., Eisenberg, P., Hansen, S., König, of salience, parser errors or errors that were in- E., Lezius, W., Rohrer, C., Smith, G. and Uszkoreit, duced by former misclassified NEs. Still the big- H. 2004. TIGER: Linguistic Interpretation of a gest share of mistakes (37%) and probably also the German Corpus. Journal of Language and Computa- ones that are most difficult to fix is the class of tion 2 (4), 597-620. semantic errors. Most of these misclassifications Brunner, A. 2015. Automatische Erkennung von Redew- can only be resolved with additional knowledge iedergabe: Ein Beitrag zur Quantitativen Narratolo- gie (Narratologia). [Automatic Recognition of about the world or the entities in the novel itself Recorded Speech: a Contribution to Quantitative (“a widow is a woman who lost her husband; “his Narratologogia] Walter De Gruyter Inc. profession is forester”; …). Apart from these, there Durrett G., Hall D., and Klein D. 2013. Decentralized are other mistakes related to an unmodeled context, Entity-Level Modeling for Coreference Resolution. such as thoughts, songs or letters that appear In Proceedings of the 51st Annual Meeting of the throughout the text. We plan to integrate the work Association for Computational Linguistics (Volume of Brunner [Brunner 2015] to detect those instanc- 1), 114-124. es and thereby improve the quality of our system. Ferrucci, D. and Lally, A. 2004. UIMA: An Architectur- al Approach to Unstructured Information Pro- cessing in the Corporate Research Environment. 5 Conclusion Nat. Lang. Eng. 10, 327-348. http://dx.doi.org/10.1017/S1351324904003523 CR for NE can be viewed as a task in which candi- Fitschen, A., Schmid, H. and Heid, U. 2004. SMOR: A dates for coreference are filtered out by constraints German computational morphology covering deri- until only one candidate remains. Rule-based vation, composition, and inflection. Proceedings of knowledge representation is well suited for this the IVth International Conference on Language Re- task. The more constraints can be modeled, the sources and Evaluation (LREC 2004), 1263-1266. Haghighi, A. and Klein, D. 2009. Simple Coreference better the performance of the system. Our error Resolution with Rich Syntactic and Semantic Fea- analysis shows that we can cut our current error tures. In Proceedings of the 2009 Conference on rate by roughly 28% with more precise grammati- Empirical Methods in Natural Language Processing. cal constraints (14% for errors related to direct Association for Computational Linguistics, Singa- speech and 14% for gender, number and person pore, 1152-1161. related errors). However, we also plan the integra- Hinrichs, E., Kübler, S. and Naumann, K. 2005. A uni- tion of semantic constraints and information, simi- fied representation for morphological, syntactic, se- lar to Haghighi and Klein [Haghighi, Klein 2009]. mantic, and referential annotations. Proceedings of A promising way is to collect information about the ACL Workshop on Frontiers in Corpus Anota- the persons in the text, which is also the next step tion II: Pie in the Sky, pages 13–20, Ann Arbor, MI. Hobbs, J.1976. Pronoun Resolution. Technical report, in our overall goal, the automated character analy- Dept. of Computer Science, CUNY, Technical Re- sis: determining all attributes assigned in a novel to port TR761. a character. Iosif, E. and Mishra, T. 2014. From Speaker Identifica- tion to Affective Analysis: A Multi-Step System for Analyzing Children`s Stories. Proceedings of the 3rd Workshop on Computational Linguistics for Litera- ture. Gothenburg, Sweden, 40-49.

103 Jannidis, F., Krug, M., Reger, I. Toepfer, M. Weimer, L, Pradhan, S., Moschitti, A., Xue, N., Uryupina, O. and Puppe, F. 2015. Automatische Erkennung von Figu- Zhang, Y. 2012. CoNLL-2012 Shared Task: Model- ren in deutschsprachigen Romanen. [Automatic ing Multilingual Unrestricted Coreference in Onto- recognition of Characters in German novels] Digital Notes. In Proceedings of EMNLP and CoNLL-2012: Humanities im deutschsprachigen Raum (Dhd Shared Task, 1-40. 2015), Graz, Austria, 2015. Pradhan, S., Ramshaw, L., Marcus, M., Palmer, M., Klenner, M; Tuggener, D. 2011. An Incremental Entity- Weischedel, R., and Xue, N. 2011. CoNLL-2011 mention Model for Coreference Resolution with Re- Shared Task: Modeling Unrestricted Coreference in strictive Antecedent Accessibility. Recent Advances OntoNotes. Proceedings of the Fifteenth Conference in Natural Language Processing (RANLP 2011), on Computational Natural Language Learning: Hissar, Bulgaria, 178-185. Shared Task (CONLL Shared Task '11). Association Lee, H., Peirsman, Y., Chang, A., Chambers, N., for Computational Linguistics, 1-27. Surdeanu, M. and Jurafsky, D. 2011. Stanford's Mul- Rahman, A and Ng, V. 2009. Supervised Models for ti-pass Sieve Coreference Resolution System at the Coreference Resolution. Proceedings of the 2009 CoNLL-2011 Shared Task. Proc. of the 15th Confer- Conference on Empirical Methods in Natural Lan- ence on Computational Natural Language Learning: guage Processing, 968-977. Shared Task (CONLL Shared Task '11). Association Recasens, M., Martí, T., Taulé, M., Màrquez, L., and for Computational Linguistics, 28-34. Sapena, E. 2009. Coreference Resolution in Multiple Luo, X. 2005. On Coreference Resolution Performance Languages. Proceedings of the Workshop on Se- Metrics. Proc. of the conference on Human Lan- mantic Evaluations: Recent Achievements and Fu- guage Technology and Empirical Methods in Natu- ture Directions (SEW-2009), 70–75. ral Language Processing (HLT '05). Association for Schmid, H. 1995. Improvements in Part-of-Speech Tag- Computational Linguistics, 25-32. ging with an Application to German. Proceedings of Ng, V. 2010. Supervised Noun Phrase Coreference the ACL SIGDAT-Workshop. Dublin, Ireland. Research: The First Fifteen Years. Proceedings of Sennrich, R., Schneider, G., Volk, M., Warin, M. 2009. the 48th Annual Meeting of the Association for A New Hybrid Dependency Parser for German. Pro- Computational Linguistics (ACL '10). Association ceedings of GSCL Conference, Potsdam. for Computational Linguistics, 1396-1411. Soon, W, Ng, H. and Lim, D. 2001. A Machine Learn- Pascal, D. and Baldridge, J. 2008. Specialized models ing Approach to Coreference Resolution of Noun and ranking for coreference resolution. In Proceed- Phrases. Comput. Linguist. 27 (4), 521-544. ings of the Conference on Empirical Methods in Yang, J., Mao, Q., Xiang, Q., Tsang, I., Chai, K., Chieu, Natural Language Processing (EMNLP '08). Associ- H. 2012. Domain Adaptation for Coreference Reso- ation for Computational Linguistics, Stroudsburg, lution: An Adaptive Ensemble Approach. Proceed- PA, USA, 660-669. ings of the 2012 Joint Conference on Empirical Methods in NLP and CNLL, 744-753.

104 A computational linguistic approach to Spanish Golden Age Sonnets: metrical and semantic aspects

Borja Navarro-Colorado Natural Language Processing Group University of Alicante Alicante, Spain [email protected]

Abstract We have two general objectives: first, we will try to extract regular patterns from the overall pe- Several computational linguistics techniques riod; and second, in order to analyze each author in- are applied to analyze a large corpus of Span- side the broad literary context in which they wrote ish sonnets from the 16th and 17th centuries. (Garc´ıa Berrio, 2000), we will look for chains of The analysis is focused on metrical and se- relationships between them. mantic aspects. First, we are developing a hy- brid scansion system in order to extract and Nowadays both objectives are focused on metri- analyze rhythmical or metrical patterns. The cal and semantic aspects of Spanish Golden Age possible metrical patterns of each verse are ex- Sonnets. In this position paper we will present the tracted with language-based rules. Then sta- computational linguistic techniques used to achieve tistical rules are used to resolve ambiguities. these objectives. Second, we are applying distributional seman- Next section shows how a large corpus of Span- tic models in order to, on one hand, extract ish Golden-Age sonnets has been compiled and an- semantic regularities from sonnets, and on the other hand to group together sonnets and poets notated; Section 3 describes a hybrid scansion sys- according to these semantic regularities. Be- tem developed to extract metrical patterns; Section 4 sides these techniques, in this position paper presents how we use distributional semantic models we will show the objectives of the project and to extract semantic patterns from the corpus; finally, partial results. Section 5 shows some preliminar conclusions.

2 Corpus compilation and XML 1 Introduction annotation 16th- and 17th-Centuries Spanish poetry is judge as A corpus of 5078 sonnets has been compiled. It one of the best period of the History of Spanish includes all the main poets of the Spanish Golden Literature (Rico, 1980 2000; Terry, 1993; Mainer, Age: Juan Boscan,´ Garcilaso de la Vega, Fray Luis 2010). It was the time of great, famous and “canon- de Leon,´ Lope de Vega, Francisco de Quevedo, ical” Spanish poets such as Miguel de Cervantes, Calderon´ de la Barca, Sor Juana Ines´ de la Cruz, Lope de Vega, Garcilaso de la Vega or Calderon´ de la etc. Our objetive is to include all the authors of Barca, among others. Due to the importance given this period who wrote a significant amount of son- to this period, it has been deeply studied by schol- nets. Authors who wrote but few sonnets (less than ars from the 19th century to the present. We are ten) have been rejected. Most sonnets have been persuaded that new approaches based on a “distant obtained from the Miguel de Cervantes Virtual Li- reading” (Moretti, 2007; Moretti, 2013) or “macro- brary1. analysis” (Jockers, 2013) framework could shed new light on this period. 1http://www.cervantesvirtual.com/

105

Proceedings of NAACL-HLT Fourth Workshop on Computational Linguistics for Literature, pages 105–113, Denver, Colorado, June 4, 2015. c 2015 Association for Computational Linguistics

I must point out that sonnet quality is not taken 3.1 Metrical representation into account. Following Moretti’s Distant Reading Spanish poetry measures poetic lines by syllables. framework (Moretti, 2007), we want a representa- The Spanish sonnet is an adaptation of the Italian tive corpus of the kind of sonnets written in that pe- sonnet, where each line has eleven syllables (hen- riod, not only the canonical sonnets. In other words, decasyllable). The metrical pattern of each verse the corpus must represent the literary context of any line is formed by a combination of stressed and un- Golden Age poet. stressed syllables. There must be a stressed sylla- Each sonnet has been marked with the standard ble in the tenth position. The rest of the stressed 2 TEI-XML . We have followed the standard TEI in syllables can be combined in different ways, gener- order to ensure the re-usability of the corpus in fur- ating different rhythms or metrical patterns: Heroic ther research. The main metadata annotated at the (stressed syllables at position 2, 6 and 10), Melodic TEI-Header are: (at 3, 6, 10), sapphic (at 4, 8, 10), etc. (Quilis, 1984; Varelo-Merino et al., 2005) Project title and project manager (Title and • For each verse line, we represent its metrical pat- Publication Statement), tern by a combination of symbols “+” (stressed syl- lable) and “-” (unstressed syllable). For example: Title and author of each sonnet, • Source publication and editor (Source Descrip- • Cuando me paro a contemplar mi estado tion). “lg” tag represents the stanza (quatrain in this Regarding the sonnet structure, we have anno- case), and “l” tag the line. Each line has the tated: “met =” tag with the metrical pattern of the verse. This verse from Garcilaso de la Vega has thirteen Quatrains, • linguistic syllables, but it has only eleven metrical Tercets, syllables. As we will show in the next section, “-ro • a” (in “paro a”) and “mi es-” (in “mi estado”) con- Verse line and number of line, form a single syllable each due to the synaloepha • phenomenon. Therefore this line is an hendecasyl- Some extra lines included in the poem (called lable with stressed syllables in position 4, 8 and 10 • “estrambote”) (sapphic). 3.2 Scansion system The markup includes a representation of the met- rical structure of each verse line. It will be explained Metrical patterns extraction does not consist of a in the next section. simple detection of syllables and accents. Due to Nowadays we have a first version of the corpus. the fact that there is not a direct relationship between We plan to publish it on the internet during 2016. linguistic syllables and metrical syllables, some am- biguity problems appear that must be solved by com- 3 Metrical annotation and analysis putational linguistics techniques. The main scansion problems are the following: In order to extract the metrical pattern of each verse The total amount of syllables could change ac- line, we have created a scansion system for Spanish • based on Computational Linguistics techniques. In cording to the position of the last stressed syl- this section we will show how the metrical informa- lable. If the last stressed syllable is the last tion is represented and what the main aspects of the one (oxytone), the line should have ten sylla- scansion system are. bles and an extra syllable must be added. On contrary, if the last stressed syllable is the an- 2www.tei-c.org/ tepenultimate (proparoxytone), the line should

106 have twelve syllables and the last syllable must A ranking of natural and artificial synaloephas has be removed. This is a fixed phenomenon and been defined by traditional metrical studies. For ex- can be solved with rules. ample, it is more natural to join two unstressed vow- els than two stressed vowels (Quilis, 1984). From Not every word with linguistic accent has a • our point of view, this is a “deliberate” ambiguity metrical accent. It depends on the Part of (Hammond et al., 2013): both metrical patterns are Speech. Words like nouns, verbs, adjectives correct, choosing one depends on how the verse line or adverbs have always a metrical accent; but is pronounced. prepositions, conjunctions and some pronouns An automatic metrical scansion system must re- have no metrical accent. solve this ambiguity3. There are several compu- tational approaches to metrical scansion for dif- A vocalic sound at the end of a syllable and • ferent languages (Greene et al., 2010; Agirrezabal at the beginning of the next one tends to be et al., 2013; Hammond, 2014). For Spanish, P. blended in one single syllable (syneresis if Gervas´ (2000) proposes a rule-based approach. It the syllables belong to the same word and applies Logic-programming to detect stressed and synaloepha if they belong to different words). unstressed syllables. It has a specific module to de- This phenomenon is not always carried out: it tect and resolve synaloephas that is applied recur- depends on several factors, mainly the intention sively up to the end of the verse. However, I think during declamation. that this system is not able to detect ambiguities: if The opposite is possible too: a one single sylla- there are two possible synalephas, this system al- • ble with two vowels (normally semivowel like ways chooses the first one. Therefore, it does not an “i” or “u”) that can be pronounced as two detect other possible metrical patterns. separated syllables (dieresis). We follow a hybrid approach to metrical scan- sion. First, rules are applied in order to sepa- These phenomena could change the metrical pat- rate words in syllables (hyphenation module), de- tern extracted in two different ways: the amount of tect metrical syllables with a Part of Speech tagger4, syllables and the type of each one of them (stressed and finally blend or segment syllables according to or unstressed). The main problem are those verses in synaloephas, dieresis or syneresis. which it is possible to extract two or more different Before the application of synaloephas or syneresis patterns, all of them correct. rules, the system counts the number of syllables. If For example, for a verse with twelve syllables the line has eleven syllables, then these rules are not and paroxitonal final stress it is necessary to blend applied. If there are more than eleven syllables, then at least two syllables in one through a phenomenon the system counts how many synaloephas or synere- of synaloepha or syneresis. The problem appears sis must be resolved. If resolving all synaloephas when there are two possible synaloephas or synere- or syneresis the syllables amount to eleven, then the sis: which of them must be carried out? The final system applies them all. If resolving all synaloephas metrical pattern will be completely different. or syneresis the syllables amount to a number lower For example, the next verse line: than eleven, the verse is ambiguous: the system must select which rules must be applied and which must cuando el padre Hebrero nos ensena˜ not. It has 12 syllables. It is necessary to blend two syllables in one through synaloepha. However, there 3Or al least the system must select the most appropriate one, are two possibles synaloephas: “cuando+el” and even if it could detect and represent alternative patterns. 4We use Freeling as PoS tagger (http://nlp.lsi. “padre+Hebrero”. Different metrical patterns are upc.edu/freeling/) (Padro´ and Stanilovsky, 2012). For generated for each synaleopha: each word, the scansion system selects the most general PoS tag (noun, verb, etc.) Only for a few cases it is necessary a deeper --+--+---+- analysis. For example, the system must distinguish between ---+-+---+- personal pronouns (stressed) and clitic pronouns (unstressed)

107 For these ambiguous verses (with two or more cuando el padre Hebrero nos ensena˜ possible metrical patterns) we follow a statistical ap- proach. First, the system calculates metrical patterns frequencies from non-ambiguous patterns. These Nowadays we are manually reviewing the auto- patterns are extracted from lines in which it has not matic annotation in order to correct errors, set up a been necessary to apply the rules for synaloephas, Gold Standard and evaluate the system. or lines in which applying all possible rules for 4 Semantic analysis synalopehas, a unique patter of eleven syllables is obtained. Each time the system analyzes one of In order to develop a broad semantic analysis of these lines, the frequency of its pattern is increased Spanish Golden Age sonnets, we are applying Dis- one. tributional Semantic Models (Turney and Pantel, From a total amount of 82593 verses5, 6338 are 2010; Mitchell and Lapata, 2010). These models ambiguous and 76255 non-ambiguos. Therefore, are based on the distributional hypothesis (Harris, only 7,67% of lines are ambiguous. In these cases, 1951): words that occur in similar contexts have from the possible pattern that can be applied to a similar meanings. These models use vector space specific line, the system selects the most frequent models to represent the context in which a word ap- one: the pattern that has been used more frequently pears and, then, represent the meaning of the word. in non-ambiguous verses. Computational distributional models are able to Our approach tends to select common pattern and establish the similarities between words according reject unusual ones. It must be noted that we do not to the similarity of their contexts. Therefore, the ap- claim that the metrical pattern selected in ambiguous plication of these distributional models to corpora lines is the correct one. We claim that it is the most of sonnets can extract semantic similarities between frequent one. As we said before, this is a “delib- words, texts and authors. A standard approach is erate” ambiguity (Hammond et al., 2013) in which based on a word-text matrix. Applying well-known there are not correct or incorrect solutions. distance metrics as Cosine Similarity or Euclidean Table 1 shows the most frequent patterns ex- Distance it is possible to find out the similarities be- tracted from the corpus and its frequency. tween words or poems. In light of these similarities we can then establish the (distributional) semantic Metrical Pattern Name Frequency relations between authors. - + - - - + - - - + - Heroic 6457 We are applying two specific distributional se- - + - + - - - + - + - Sapphic 6161 mantic models: Latent Dirichlet Allocation (LDA) - - + - - + - - - + - Melodic 5982 Topic Modeling (Blei et al., 2003) on one hand, - + - + - + - - - + - Heroic 5015 and Distributional-Compositional Semantic Models - - - + - + - - - + - Sapphic 3947 (Mitchell and Lapata, 2010) on the other hand. - + - - - + - + - + - Heroic 3549 - + - + - + - + - + - Heroic 3310 4.1 LDA Topic Modeling + - - + - - - + - + - Sapphic 3164 During the last years several papers have pro- + - - + - + - - - + - Sapphic 3150 posed applying LDA Topic Modeling to literary - - - + - - - + - + - Sapphic 3105 corpora (Tangherlini and Leonard, 2013; Jockers - - + - - + - + - + - Melodic 2940 and Mimno, 2013; Jockers, 2013; Kokkinakis and Malm, 2013) -among others-. Jokers and Mimno Table 1: Most frequent metrical patterns. (2013), for example, use Topic Modeling to extract relevant themes from a corpus of 19th-Century nov- Therefore, the previous example is annotated with els. They present a classification of topics accord- the first metrical pattern (Melodic): ing to genre, showing that, in 19th-Century English

5 novels, males and females tended to write about the This is the total amount of verses, including authors with less than ten sonnets that were rejected for the final version of same things but to very different degrees. For ex- the corpus. ample, males preferred to write about guns and bat-

108 tles, while females preferred to write about educa- is showing the presence of Petrarchan tradition tion and children. From a computational point of “rivers of glass”7 in the Spanish poetry. view, this paper concludes that Topic Modeling must Noise topics. be applied with care to literary texts and it proves • the needs for statistical tests that can measure confi- dence in results. Topic Model Traditional Theme Rhody (2012) analyzes the application of Topic amor fuerza desden´ arco Unrequited Modeling to poetry. The result will be different from nino˜ cruel ciego flecha fuego Love the application of Topic Modeling to non-figurative ingrato sospecha hoy yace sepulcro fenix´ Funeral texts. When it is applied to figurative texts, some marmol´ polvo ceniza ayer “opaque” topics (topics formed by words with ap- guarda muerta piedad cadaver´ parently no semantic relation between them) really espana˜ rey sangre roma Decline of shows symbolic and metaphoric relations. More imperio grande bana˜ valor Spanish Empire extrana˜ reino carlos hazana˜ than “topics”, these topics represent symbolic mean- engana˜ sana˜ barbaro´ ings. She concludes that, in order to understand them, a closed reading of the poems is necessary. Table 2: Topic Models related to classical themes. We have run LDA Topic Modeling over our cor- pus of sonnets6. Using different configurations (10, Once we detect an interesting topic, we analyze 20, 50, 100 and 1000 topics), we are developing sev- the sonnets in which this topic is relevant. For exam- eral analysis. In the next sections I will present these ple, Topic 2 in table 2 represents clearly the funeral analysis together with some preliminary results and theme, sonnets composed upon the gravestone of a comments. dead person. According to LDA, this topic is rele- vant in Francisco de Quevedo (10 poems), Gongora´ 4.1.1 Common and regular topics (6 poems), Lope de Vega (6 poems), Juan de Tas- First, we have extracted the most common and sis y Peralta (6 poems), Trillo y Figueroa (3 poems), regular topics from the overall corpus. We are an- Lopez´ de Zarate´ (3 poems), Bocangel´ y Unzueta (3 alyzing them using as reference framework themes poems), Polo de Medina (2 poems), Pantaleon´ de and topics established manually by scholars follow- Ribera (2 poems), etc. We can reach interesting con- ing a close reading approach (Garc´ıa Berrio, 1978; clusions from a close reading of these poems. For Rivers, 1993). example, At this moment we have found four types of top- ics: All these authors belong to 17th-century, the • Baroque Period. This topic is related to the Topics clearly related with classical themes. “brevity of life” theme, a typical Baroque topic. • Table 2 shows some examples. Topic Modeling is, then, confirming traditional studies. Topics showing rhime relations: words that • used to appear at the same sonnet because they Most of these sonnets are really funeral son- • rhyme between them. For example, “boca loca nets, but not all of them. There are some love toca poca provoca” (Topic 14 of 100). and satyrical sonnets too. However, these off- topic sonnets use words related to sepulcher, Topics showing figurative and symbolic rela- tomb, graves and death. In these cases, Topic • tions: words semantically related only in a Modeling is not showing topics but stylistic and symbolic framework. For example, topic 70 re- figurative aspects of the poem. Francisco de lates the words “r´ıo fuente agua” (river, foun- Quevedo is an example of this aspect: he wrote tain, water) with “cristal” (glass). This topic quite a lot of funeral sonnets and, and the same

6We have used MALLET http://mallet.cs. 7“et gia´ son quasi di cristallo i fiumi” Petrarca Canzoniere umass.edu/ (McCallum, 2002) LXVI.

109 time, he used words related to death in satyri- de Acuna,˜ Juan de Timoneda, Juan Boscan,´ cal and mainly love sonnets. It is what Terry Garcilaso de la Vega, Gutierre de Cetina and (1993) calls “ceniza amante” (loving ash), as a Diego Hurtado de Mendoza. specific characteristic of Quevedos’s sonnets. There is another cluster that groups together • Therefore, we benefit of the relations between poets of the second Renaissance generation: sonnets and authors stablished by LDA Topic Mod- authors than wrote during the second half of eling. Then we follow a close reading approach in the 16th Century as Miguel de Cervantes, Fray order to (i) reject noise and random relations, (ii) Luis de Leon,´ Francisco de Figueroa, Francisco confirm relations detected by manual analysis, and de la Torre, Diego Ram´ırez Pagan,´ Francisco de (iii) detect non-evident relations. This last situation Aldana and Juan de Almeida. is our main objective. One of the poets of this generation, Fernando 4.1.2 Cluster of sonnets and poets • de Herrera, appears in isolation in a specific Second, we are automatically clustering sonnets cluster. and authors that share the same topics. At this mo- ment we have run a k-means cluster over an author- Baroque poets (who wrote during 1580 to • topic matrix 8. Each author is represented by all the 1650) are grouped together in varios cluster. sonnets that they wrote. The matrix is formed by the There are two main groups: the fist one in- weight that each topic has in the overall sonnets of cludes poets born between 1560 to 159010, each author9. Then a k-means cluster has been run and the second one poets born from 1600 on- using Euclidean distance and different amounts of wards11. clusters. Some preliminary analysis shows that with 20 This temporal coherence in the clusters, that ap- topics and clustering authors in only two groups, pears in other clusters too, shows us that, on one 16th-Century authors (Renaissance period) and hand, Topic Modeling could be a reliable approach 17th-Century authors (Baroque period) are grouped to the analysis of corpora of poetry, and on the other together. Only one poet (of 52) is misclassified. It hand, that there is some relation between topic mod- shows that topic models are able to represent dis- els and the generations of poets during these cen- tinctive characteristics of each period. Therefore, we turies. Nowadays we are analyzing the relations be- can assume some coherence in more fine clusters. tween the poets grouped together in the same groups With 20 topics but clustering authors in ten in order to know the reasons of this homogeneity. groups, we have obtained coherent groups too. All We plan to run other kinds of clusters in order to poets grouped together wrote during the same period analyze other possibilities. of time. The most relevant aspects of this automatic 4.1.3 Topic timeline classification are the following: Taking into account an author’s timeline, we are I´nigo˜ Lopez´ de Mendoza, Marques´ de Santil- analyzing how trendy topics change during the pe- • lana, was a pre-Renaissance poet. He was the riod. We want to know about the evolution of main first Spanish author who wrote sonnets. It ap- 10Lope de Vega (b. 1562), Juan de Arguijo (b. 1567), Fran- pears isolated in a specific cluster. Topic Mod- cisco de Medrano (b. 1570), Tirso de Molina (b. 1579), Fran- eling has detected clearly that this is a special cisco de Quevedo (b. 1580), Francisco de Borja y Aragon´ (b. poet. 1581), Juan de Jauregui´ (b. 1583), Pedro Soto de Rojas (b. 1584), Luis Carrillo y Sotomayor (b. 1585), Antonio Hurtado The first generation of Renaissance poets are de Mendoza (b. 1586), etc. • 11 grouped together in the same cluster: Hernando Jeronimo´ de Cancer´ y Velasco (c. 1599), Pantaleon´ de Rib- era (b.1600), Enr´ıquez Gomez´ (b. 1602), Bocangel´ y Unzueta 8We have used the cluster algorithm implemented in pyclus- (b. 1603), Polo de Medina (b. 1603), Agustin de Salazar y Tor- ter https://pypi.python.org/pypi/Pycluster res (1642), Sor Juana Ines´ de la Cruz (b. 1651), Jose´ de Litala 9Only a stop-list filter has been used to pre-process the cor- y Castelv´ı (b. 1672). Only Francisco Lopez´ de Zarate´ (b. 1580) pus. is misclassified.

110 topics during the period, which authors introduce albeit in a very speculative way, that there must be new topics, to what extent these topics are followed some kind of regularity between topics and meters. by other poets, etc. Nowadays we have not prelimi- In a nutshell, as we have shown in this section, ap- nary results to illustrate this aspect. plying LDA Topics Modeling and, in general, distri- butional model to our corpus of sonnets it is possible 4.1.4 Relations between metrical patterns and to extract not evident (latent) but reliable relations semantic topics between words (specially figurative language), son- We are analyzing possible relations between met- nets and poets. In any case, a final close reading is rical patterns and topics. Our hypothesis is that for necessary in order to validate or reject the relations specific topics, poets use specific metrical patterns extracted automatically and justify them according and rhythms. At this moment this is an open ques- to previous studies. These computational methods tion. attract attention to possible latent relations, but it As a preliminary analysis, we have run a cluster must always be manually validated. of sonnets based on their metrical patterns. First, we have set up the most relevant metrical patterns 4.2 Compositional-distributional semantic of each author applying LDA to the metrical pat- models terns. Instead of using the words, each sonnet is Recently a new model of computational semantics represented only with metrical patterns. Then we has been proposed: the compositional-distributional have run k-means cluster algorithm with Euclidean model (Baroni, 2013). The main idea of this model Distance and 10 clusters. is to introduce the principle of compositionality in a From these clusters we have some preliminary distributional framework. considerations: Distributional models are based on single words. I´nigo˜ Lopez´ de Mendoza, Marques´ de Santil- Standard Vector Space Models of semantics are • lana, appears again in isolation. As we said based in a term-document or word-context matrix before, hos sonnets were written in a pre- (Turney and Pantel, 2010). Therefore, as we have Renaissance period: their meters and rhythm shown in the previous section, they are useful mod- are very different from the others. The cluster els to calculate similarity between single words, but is correctly detecting this special case. they cannot represent the meaning of complex ex- pressions as phrases or sentences. Other cluster is conformed mainly by Renais- • Following Frege’s principle of compositionality sance poets: from Garcilaso de la Vega to Fray (Montague, 1974), the meaning of these complex Luis de Leon.´ Even though there are two expressions is formed by the meaning of their sin- Baroque poets in this cluster, it seems that Re- gle units and the relations between them. To rep- naissance meters are quite stable and uniform. resent compositional meaning in a distributional framework, it is necessary to combine word vectors. The other two clusters assemble Baroque poets • How semantic vectors must be combined to repre- together. At this moment we have not detected sent the compositional meaning is an open question if there are any literary criteria that justify these in Computational Linguistics. Some proposals are clusters. It is noteworthy that one cluster in- vector addiction, tensor product, convolution, etc. cludes Miguel de Cervantes and Lope de Vega, (Mitchell and Lapata, 2010; Clarke, 2011; Blacoe who tend to use more classical rhythms, and and Lapata, 2012; Socher et al., 2012; Baroni, 2013; the other Gongora´ and Quevedo, that tend to Baroni et al., 2014; Hermann and Blunsom, 2014). use more Baroque rhythms. From our point of view, compositional- These clusters based on metrical patterns are sim- distributional models are useful to detect semantic ilar to the previous clusters based on distribution of relations between sonnets based on stylistic fea- words. Many poets appear together in both experi- tures. These models are able to detect semantic ments: it seems that they are sharing the same distri- similarity according to, not only the words used in butional topics and metrical patterns. This suggest, a poem, but how the author combines these words.

111 The combination of words in a poem is the base of Modeling and try to relate authors not by what words its literary style. they use, but by how they combine the words in son- We plan to calculate semantic similarity accord- nets. We plan to apply compositional-distributional ing to specific phrases. For example, it is very models to cluster sonnets and authors with similar specific of an author how they use the adjectives. stylistic features. Compositional-distributional models allow us to ex- As a position paper, we have presented only par- tract adjective-noun patterns from sonnets and to tial results of our project. Our idea is to establish a calculate the similarities between these patterns. If global computational linguistic approach to literary two poets tend to use similar adjective-noun pat- analysis based on the combination of metrical and terns, then it is possible to establish an influential semantic aspects; a global approach that could be chain between them. We are working with stan- applied to other corpora of poetry. dard tools as DISSECT (Dinu et al., 2013). Unfor- tunately, at this moment we have not results to show. References 5 Conclusions Manex Agirrezabal, Bertol Arrieta, Aitzol Astigarraga, and Mans Hulden. 2013. ZeuScansion : a tool for In this paper we have presented the computational scansion of English poetry. In Proceedings of the 11th linguistics techniques applied to the study of a large International Conference on Finite State Methods and corpus of Spanish sonnets. Our objective is to es- Natural Language Processing, pages 18–24, St An- tablish chains of relations between sonnets and au- drews Sctotland. ACL. Marco Baroni, Raffaela Bernardi, and Roberto Zampar- thors and, then, analyze each author in a global liter- elli. 2014. Frege in Space: A Program for Composi- ary context. Once a representative corpus has been tional Distributional Semantics. Linguistics Issues in compiled and annotated, we have focused on two as- Language Technology, 9(6):5–110. pects: metrical patterns and semantic patterns. Marco Baroni. 2013. Composition in Distributional Metrical patterns are extracted with a scansion Semantics. Language and Linguistics Compass, system developed in the project. It follows a hybrid 7(10):511–522. approach than combines hand-mande and statistical William Blacoe and Mirella Lapata. 2012. A Com- rules. With all these metrical patterns we plan, on parison of Vector-based Representations for Semantic Composition. In Empirical Methods in Natural Lan- one hand, to analyze the most relevant metrical pat- guage Processing and Computational Natural Lan- terns of the period, as well as the most relevant pat- guage Learning, number July, pages 546–556. terns of each author. On the other hand, we plan to David M Blei, Andrew Y Ng, and Michael I Jordan. cluster sonnets and authors according to the relevant 2003. Latent Dirichlet Allocation. Journal of Ma- metrical pattern they use, and establish metrical re- chine Learning Research, 3:993–1022. lational chains. Daoud Clarke. 2011. A context-theoretic framework for Semantic patters are extracted following a distri- compositionality in distributional semantics. Compu- tational Linguistics butional semantic framework. First, we are using , 38(1):41–71. G. Dinu, N. Pham, and M. Baroni. 2013. DISSECT: LDA Topic Modeling to detect the most relevant DIStributional SEmantics Composition Toolkit. In topics of the period and the most relevant topics of Proceedings of the 51st Annual Meeting of the Asso- each author. Then we plan to group together authors ciation for Computational Linguistics. System Demon- and sonnets according to the topics they share. Fi- strations. nally we will establish the influential chains based Antonio Garc´ıa Berrio. 1978. Lingu¨´ıstica del texto y on these topics. tipolog´ıa l´ırica (La tradicion´ textual como contexto). We plan to combine both approaches in order to Revista espanola˜ de lingu¨´ıstica, 8(1). analyze the hypothesis that poets tend to use similar Antonio Garc´ıa Berrio. 2000. Retorica´ figural. Esque- mas argumentativos en los sonetos de Garcilaso. Edad metrical patterns with similar topics. At this mo- de Oro, (19). ment it is only a hypothesis that will be evaluated Pablo Gervas. 2000. A Logic Programming Application during the development of the project. for the Analysis of Spanish Verse. In Computational Finally, we want to go one step beyond Topic Logic, Berlin Heidelberg. Springer Berlin Heidelberg.

112 Erica Greene, Lancaster Ave, Kevin Knight, and Marina Richard Socher, Brody Huval, Christopher D Manning, Rey. 2010. Automatic Analysis of Rhythmic Poetry and Andrew Y Ng. 2012. Semantic Compositional- with Applications to Generation and Translation. In ity through Recursive Matrix-Vector Spaces. In Em- Empirical Methods in Natural Language Processing, pirical Methods in Natural Language Processing and pages 524–533, Massachusetts. ACL. Computational Natural Language Learning, pages Adam Hammond, Julian Brooke, and Graeme Hirst. 1201–1211. 2013. A Tale of Two Cultures : Bringing Literary Timothy R. Tangherlini and Peter Leonard. 2013. Trawl- Analysis and Computational Linguistics Together. In ing in the Sea of the Great Unread: Sub-corpus topic Workshop on Computational Linguistics for Litera- modeling and Humanities research. Poetics, 41:725– ture, Atlanta, Georgia. 749. Michael Hammond. 2014. Calculating syllable count Arthur Terry. 1993. Seventeenth-Century Spanish Po- automatically from fixed-meter poetry in English and etry. Cambridge University Press. Welsh. Literary and Linguistic Computing, 29(2). PD Turney and Patrick Pantel. 2010. From frequency to Zellig Harris. 1951. Structural Linguistics. University meaning: Vector space models of semantics. Journal of Chicago Press, Chicago. of Artificial Intelligence Research, 37:141–188. Karl Moritz Hermann and Phil Blunsom. 2014. Multilin- Elena Varelo-Merino, Pablo Mo´ıno-Sanchez,´ and Pablo gual Models for Compositional Distributed Semantics. Jauralde-Pou. 2005. Manual de metrica´ espanola˜ . In Proceedings of the 52nd Annual Meeting of the As- Castalia, Madrid. sociation for Computational Linguistics, pages 58–68. Matthew L Jockers and David Mimno. 2013. Significant Themes in 19th-Century Literature. Poetics, 41. Matthew L. Jockers. 2013. Macroanalysis. Digital Me- dia and Literary History. University of Illinois Press, Illinois. Dimitrios Kokkinakis and Mats Malm. 2013. A Macro- analytic View of Swedish Literature using Topic Mod- eling. In Corpus Linguistics Conference, Lancaster. Jose´ Carlos Mainer. 2010. Historia de la literatura espanola˜ . Cr´ıtica, Barcelona. Andrew Kachites McCallum. 2002. Mallet: A machine learning for language toolkit. Jeff Mitchell and Mirella Lapata. 2010. Composition in Distributional Models of Semantics. Cognitive Sci- ence, 34:1388–1429. R. Montague. 1974. English as a formal language. In R. Montague, editor, Formal philosophy, pages 188– 221. Yale University Press. Franco Moretti. 2007. Graphs, Maps, Trees: Abstract Models for a Literary History. Verso. Franco Moretti. 2013. Distant reading. Verso. Llu´ıs Padro´ and Evgeny Stanilovsky. 2012. FreeLing 3.0: Towards Wider Multilinguality. In Language Re- sources and Evaluation Conference (LREC 2012), Is- tanbul. Antonio Quilis. 1984. Metrica´ espanola˜ . Ariel, Barcelona. Lisa M. Rhody. 2012. Topic Modeling and Figurative Language. Journal of Digital Humanities, 2(1). Francisco Rico, editor. 1980-2000. Historia y cr´ıtica de la literatura espanola˜ . Cr´ıtica, Barcelona. Elias L. Rivers. 1993. El soneto espanol˜ en el siglo de oro. Akal, Madrid.

113 Automated translation of a literary work: a pilot study

Laurent Besacier Lane Schwartz LIG, Univ. Grenoble - Alpes Department of Linguistics UJF - BP 53 University of Illinois 38041 Grenoble Cedex 9, France Urbana, IL 61801, USA [email protected] [email protected]

Abstract productivity, this process can result in higher qual- ity translations (when compared to translating from Current machine translation (MT) techniques 1 are continuously improving. In specific ar- scratch). Autodesk also carried out an experiment eas, post-editing (PE) can enable the pro- to test whether the use of MT would improve the duction of high-quality translations relatively productivity of translators. Results from that exper- quickly. But is it feasible to translate a liter- iment (Zhechev, 2012) show that post-editing ma- ary work (fiction, short story, etc) using such chine translation output significantly increases pro- an MT+PE pipeline? This paper offers an ductivity when compared to translating a document initial response to this question. An essay from scratch. This result held regardless of the lan- by the American writer Richard Powers, cur- guage pair, the experience level of the translator, and rently not available in French, is automatically translated and post-edited and then revised by the translator’s stated preference for post-editing or non-professional translators. In addition to translating from scratch. presenting experimental evaluation results of These results from academia (Garcia, 2011) and the MT+PE pipeline (MT system used, auto- industry (Zhechev, 2012) regarding translation in matic evaluation), we also discuss the quality specialized areas lead us to ask the following ques- of the translation output from the perspective tions: of a panel of readers (who read the translated short story in French, and answered a survey What would be the value of such a process • afterwards). Finally, some remarks of the offi- (MT + PE) applied to the translation of a lit- cial French translator of R. Powers, requested erary work? on this occasion, are given at the end of this article. How long does it take to translate a literary doc- • ument of ten thousand words? 1 Introduction Is the resulting translation acceptable to read- • The task of post-editing consists of editing some text ers? (generally produced by a machine, such as a ma- chine translation, optical character recognition, or What would the official translator (of the con- • automatic transcription system) in order to improve sidered author) think of it? it. When using machine translation in the field of Is “low cost” translation produced by commu- document translation, the following process is gen- • nities of fans (as is the case for TV series) fea- erally used: the MT system produces raw trans- sible for novels or short stories? lations, which are manually post-edited by trained professional translators (post-editors) who correct This work attempts to provide preliminary an- translation errors. Several studies have shown the swers to these questions. In addition to our ex- benefits of the combined use of machine transla- perimental results, we also present a new transla- tion and manual post-editing (MT+PE) for a docu- 1 ment translation task. For example, Garcia (2011) The work of Garcia (2011) is somewhat controversial, be- cause the manual translation without post-editing appears to showed that even though post-editing raw transla- have been done without allowing the translator to use any form tions does not always lead to significant increases in of digital assistance, such as an electronic dicitonary.

114

Proceedings of NAACL-HLT Fourth Workshop on Computational Linguistics for Literature, pages 114–122, Denver, Colorado, June 4, 2015. c 2015 Association for Computational Linguistics tion (into French) of an English-language essay (The to create effects using the language and to express Book of Me by Richard Powers). meanings (explicit or implied). This paper is organized as follows. We begin in 2 by surveying related work in machine translation 3 Methodology § in the literary domain. In 3, we present our ex- § perimental methodology, including the choice of lit- For this study, we follow a variant of the post-editing erary work to be translated and the machine trans- methodology established by Potet et al. (2012). In lation, domain adaptation, and post-editing frame- that work, 12,000 post-edited segments (equivalent works used. In 4, we present our experimental re- to a book of about 500 pages) in the news domain § sults,2 including an assessment of translation qual- were collected through crowdsourcing, resulting in one of the largest freely available corpora of post- ity using automated machine translation metrics. In 4 5, we attempt to assess machine translation qual- edited machine translations. It is, for example, § ity beyond automated metrics, through a human as- three times larger than that collected by Specia et sessment of the final translation; this assessment was al. (2010), a well known benchmark in the field. performed by a panel of readers and by the official Following Potet et al. (2012), we divide the doc- French translator of Richard Powers. ument to be translated into three equal parts. A translation/post-edition/adaptation loop was applied 2 Related Work to the three blocks of text according to the following process: While the idea of post-editing machine translations of scientific and technical works is nearly as old The first third of the document was translated • as machine translation (see, for example (Oettinger, from English to French using Moses (Hoang et 1954)), very little scholarship to date has examined al., 2007), a state-of-the-art phrase-based ma- the use of machine translation or post-editing for chine translation system. This machine transla- literary documents. The most closely related work tion output was then post-edited. (Voigt and Jurafsky, 2012) that we were able to iden- The post-edited data from the third of the doc- tify was presented at the ACL workshop on Compu- • tational Linguistics for Literature3 ; since 2012, that ument was used to train an updated domain- workshop has examined the use of NLP in the lit- adapted English-French MT system. Given erary field. Voigt and Jurafsky (2012) examine how the small amount of post-edited data, adapta- referential cohesion is expressed in literary and non- tion at this point consisted only in adapting the literary texts and how this cohesion affects transla- weights of the log-linear SMT model (by using tion (experiments on Chinese literature and news). the corrected first third as a development cor- The present paper, however, tries to investigate if pus). A similar method is suggested by Pecina computer-assisted translation of a complete (and ini- et al. (2012) for domain adaptation with a lim- tially un-translated) short story, is feasible or not. ited quantity of data (we are aware that other For the purposes of this paper, we now define more advanced domain adaptation techniques what constitutes a literary text. We include in this could have been used but this was not the cen- category (our definition is undoubtedly too restric- tral theme of our contribution). tive) all fiction or autobiographical writing in the Then, the second third of the text was translated form of novels, short stories or essays. In such • texts, the author expresses his vision of the world of with the adapted MT system, then the results his time and life in general while using literary de- were post-edited and a second adapted MT sys- vices and a writing technique (form) that allows him tem was obtained starting from the new data. This second system was used to translate the 2Our translations and collected data third and last part of the text. are available at https://github.com/ powersmachinetranslation/DATA 4http://www-clips.imag.fr/geod/User/ 3https://sites.google.com/site/clfl2014a marion.potet/index.php?page=download

115 Our methodology differs in two important ways in spite of the simple, clinical style used by the au- from Potet et al. (2012). First, our study makes use thor, The Book of Me is truly a work of literature in of only one post-editor, and does not use crowd- which the author, who teaches narrative technique at sourcing to collect data. Second, once the post- the university level, never puts aside his poetic am- editing was completed, the final text was revised: bition, his humour and his fascination for the impact first by the post-editor and then by another reviewer. of science and technology on the society. The reviewer was a native French speaker with a good knowledge of the English language. The times 3.2 MT system used taken to post-edit and revise were recorded. Our machine translation system is a phrase-based system using the Moses toolkit (Hoang et al., 2007). 3.1 Choice of literary document Our system is trained using the data provided in To test the feasibility of using machine translation the IWSLT machine translation evaluation campaign and post-editing to translate a literary work, we be- (Federico et al., 2012), representing a cumulative to- gan by selecting an essay written in English which tal of about 25M sentences: had not yet been translated into French. The choice news-c: version 7 of the News-Commentary of text was guided by the following factors: • corpus, (a) we had a contact with the French translator 5 of the American author Richard Powers (author of europarl: version 7 of the Europarl corpus7 • the novel The Echo Maker which won the National (Koehn, 2005), Book Award and was a finalist for the Pulitzer Prize) un: the United-nations corpus,8 (b) In his writing, Powers often explores the ef- • fects of modern science and technology, and in some eu-const: corpus which is freely available ways his writings contain commonalities with sci- • (Tiedemann, 2009), entific and technical texts. We hypothesized that this characteristic may somewhat reduce the gap be- dgt-tm: DGT Multilingual Translation Memory • tween translation of scientific and literary texts. of the Acquis Communautaire (Steinberger et Via his French translator (Jean-Yves Pellegrin), al., 2012), Powers was informed by e-mail of our approach, pct: corpus of Parallel Patent Applications9, and he gave his consent as well as his feeling on • this project (Richard Powers: “.../... this automated gigaword: 5M sentences extracted from the Gi- translation project sounds fascinating. I know that • gaword corpus; after cleaning, the whole Gi- the field has taken a big jump in recent years, but gaword corpus was sorted at sentence level ac- each jump just furthers the sense of how overwhelm- cording to the sum of perplexities of the source ing the basic task is. I would be delighted to let him (English) and the target (French) based on two do a text of mine. Such figurative writing would be a French and English pretrained language mod- good test, to be sure. .../... “The Book of Me” would els. Finally, the 5M subset was obtained after be fine, too.”). filtering out the whole Gigaword corpus with a We selected an essay by Powers, entitled The cut-off limit of 300 (ppl). This leads to a subset Book of Me, originally published in GQ magazine.6 of 5M aligned sentences. The essay is a first-person narrative set in 2008, in which Powers describes the process by which he be- Prior to training the translation and language came the ninth person in the world to see his genome models, various pre-processing steps are performed fully sequenced. Although the topic is genetics and on the training data. We begin by filtering out

5http://en.wikipedia.org/wiki/Richard_ 7http://www.statmt.org/europarl/ Powers 8http://www.euromatrixplus.net/multi-un/ 6http://www.gq.com/ 9http://www.wipo.int/patentscope/en/ news-politics/big-issues/200810/ data/pdf/wipo-coppatechnicalDocumentation. richard-powers-genome-sequence pdf

116 badly aligned sentences (using several heuristics), 2013), including translation of the EOLLS encyclo- filtering out empty sentences, and sentences having pedia, as well as multilingual access to dozens of more than 50 words. Punctuation is normalized, websites (80 demonstrations, 4 industrial contracts, and we tokenize the training data, applying spe- 10 target languages, 820k post-edited segments). cific grammar-based rules for the French tokeniza- Figure 1 shows the post-editing interface in ad- tion. Spelling correction is applied to both source vanced mode. In advanced mode, multiple auto- and target side, and certain words (such as coeur) matic translations of each source segment (for ex- are normalized. Abbreviations and clitics are dis- ample, from Google, Moses, etc.) can be displayed ambiguated. Various additional cleaning steps (as and corrected. For this experiment, the output of our described in the list above) were applied to the Gi- Moses system was prioritized when displaying seg- gaword corpus. Many heuristics (rules) were used in ment translations. order to keep only good quality bi-texts. Post-editing was done by a (non-English native) From this data, we train three distinct translation student in translation studies at Universite´ Grenoble models on various subsets of the parallel data (ted; Alpes. news-c+europarl+un+eu-const+dgt-tm+pct; giga- word5M). The French part of the same corpus is 4 Experimental results used for language model training, with the addition 4.1 Corpus and post-editing statistics of the news-shuffle corpus provided as part of the WMT 2012 campaign (Callison-Burch et al., 2012). The test data, The Book of Me (see 3.1), is made § A 5-gram language model with modified Kneser- up of 545 segments comprising 10,731 words. This Ney smoothing is learned separately for each cor- data was divided into three equal blocks. We apply pus using the SRILM toolkit (Stolcke, 2002); these machine translation and post-editing to the data, as models are then interpolated by optimizing perplex- described in 3.2 and 3.3. § § ity on the IWSLT dev2010 corpus. The weights for Table 1 summarizes the number of source and tar- the final machine translation system are optimized get (MT or PE) words in the data. Not surprisingly, using the data from the English-French MT task of a ratio greater than 1.2 is observed between French IWSLT 2012. The system obtains BLEU (Papineni target (MT) and English source words. However, et al., 2002) scores of 36.88 and 37.58 on the IWSLT this ratio tends to decrease after post-editing of the tst2011 and test2012 corpora, respectively (BLEU French output. The post-editing results reported in evaluated with case and punctuation). Table 1 are obtained after each iteration of the pro- The training data used is out-of-domain for the cess; the last stage of revision is thus not taken into task of literary translation, and as such is clearly not account at this stage. ideal for translating literary texts. In future work, it would be desirable to at least collect literary texts 4.2 Performance of the MT system in French to adapt the target language model, and if Table 2 summarizes machine translation perfor- possible gain access to other works and translations mance, as measured by BLEU (Papineni et al., of the same author. Additionally, in future work we 2002), calculated on the full corpus with the sys- may examine the use of real-time translation model tems resulting from each iteration. Post-editing time adaptation, such as Denkowski et al. (2014). required for each block is also shown. The BLEU scores, which are directly comparable (because eval- 3.3 Post-editing uated on the full corpus), show no real improvement We use the SECTra w.1 post-editing interface of of the system. It therefore appears that adaptation Huynh et al. (2008). This tool also forms the foun- of weights alone (which resulted in improvements in dation that gave rise to the interactive Multilingual (Pecina et al., 2012)) is ineffective in our case. How- Access Gateway (iMAG) framework for enabling ever, post-editing time decreases slightly with each multilingual website access, with incremental im- iteration (but again, the differences are small and it provement and quality control of the translations. It is unclear whether the decrease in post-editing time has been used for many projects (Wang and Boitet, is due to the adaptation of the MT system or to in-

117 Figure 1: Post-editing interface in advanced mode

Iteration (no. seg) English (no. words) French MT (no. words) French PE (no. words) Iteration 1 (184) 3593 4295 4013 Iteration 2 (185) 3729 4593 4202 Iteration 3 (176) 3409 4429 3912 Total (545) 10731 13317 12127

Table 1: Number of words in each block of the English source corpus, French machine translation (MT), and French post-edited machine translation (PE). creasing productivity as the post-editor adapts to the of 79.92, indicating very high similarity between the task). In the end, the total PE time is estimated at two versions. So, post-editing and revising are very about 15 hours. different tasks. This is illustrated by the numbers of Table 3: MT and PE are highly dissimilar (post- 4.3 Analyzing the revised text editor corrects a lot of MT errors) while PE and REV Reading the translated work at this stage (after PE) are similar (revision probably focuses more on im- is unsatisfactory. Indeed, the post-editing is done portant details for readability and style). More qual- ”segment by segment” without the context of the full itative analysis of the revised text and its comparison corpus. This results in a very embarrassing lack of with post-edited text is part of future work (and any homogeneity for a literary text. For this reason, two reader interested in doing so can download our data revisions of the translated text are also conducted: — see footnote 2 on page 2). one by the original post-editor (4 hours) and one by a second French-English bilingual serving as a re- 5 Human evaluation of post-edited MT viewer (6 hours). The final version of the translated 5.1 The views of readers on the post-edited work (which has been obtained after 15+4+6=25 translation hours of work) provides the basis for more qualita- tive assessments which are presented in the next sec- Nine French readers agreed to read the final trans- tion. The difference between the rough post-edited lated work and answer an online questionnaire. The version (PE - 15 hours of work) and the revised ver- full survey is available on fluidsurveys.com.10. A pdf sion (REV - 25 hours of work) is analyzed in Table version of the test results and a spreadsheet file con- 3. It is interesting to see that while the revision takes taining the results of the survey are also made avail- 40% of the total time, the revised text remains very 10https://fluidsurveys.com/ similar to the post-edited text. This can be observed surveys/manuela-cristina/ by computing BLEU between the post-edited text un-livre-sur-moi-qualite-de-la-traduction/ before and after revising; the result is a BLEU score ?TEST_DATA=

118 MT system used BLEU score (full corpus) PE (block it.) time Iteration 1 (not adapted) 34.79 5h 37mn Iteration 2 (tuning on Block 1) 33.13 4h 45mn Iteration 3 (tuning on Blocks 1+2) 34.01 4h 35mn

Table 2: BLEU after tokenization and case removal on full corpus, and time measurements for each iteration

Comparison BLEU score MT vs PE 34.01 MT vs REV 30.37 PE vs REV 79.92

Table 3: Automatic Evaluation (BLEU) on full corpus between unedited machine translation (MT), post-edited ma- chine translation (PE), and revised post-edited machine translation (REV). able on github (see footnote 2 on page 2). The text is considered to be overall readable (5 After three questions to better understand the pro- Very Good and 3 Good), comprehensible (8 yes, 1 file of the participant (How old are you? Do you fre- not) and containing few errors (8 seldom, 1 often). quently read? If yes, what is your favorite genre?), The easiest comprehension questions were well han- the first portion asks readers five questions about dled by the readers, who all responded correctly (4 readability and quality of the translated literary text: questions). However, three questions led to different answers from the readers: What do you think about text readability? • 2 readers responded incorrectly to a seemingly Is the text easy to understand? • • simple question (Who funded the genome se- Does the language sound natural? quencing of Powers?) • Do you think sentences are correct (syntacti- The question At the time the story was writ- • • cally)? ten, how many people’s genomes had been se- quenced? was ambiguous since the answer Did you notice obvious errors in the text? could be 8 or 9 (depending on whether Powers • is counted), giving rise to different responses The second portion (7 questions) verifies that certain from readers subtleties of the text were understood Only 4 of 9 readers were able to give the cor- What is the text about? • • rect sequence of steps in the process of genome Who is the main character of the story? sequencing; the translated text is not unclear on • this point (the errors are on the part of the read- Who is funding the genome sequencing? ers); this mixed result may indicate a lack of • interest by some readers in the most technical Chronologically sort the sequence of steps in- • aspects of the text. volved in genome sequencing. In short, we can say that this survey, while very How many base pairs are in the genome? • limited, nevertheless demonstrates that the text (pro- When the novel was written, how many people duced according to our methodology) was consid- • had already fully sequenced their genome? ered to be acceptable and rather readable by our readers (of whom 3 indicated that they read very of- Which genetic variant is associated with a high ten, 4 rather often, and 2 seldom). We also include • risk for Alzheimer’s disease? some remarks made in the free comments:

119 “I have noticed some mistakes, some neolo- “A third defect is due to not taking into account • • gisms (I considered them to be neologisms and certain cultural references .../... For example, not mistranslations because they made sense)” Powers made several references to the topog- raphy of Boston that give rise to inaccuracies “Very fluid text and very easy reading despite in the translation: ‘Charles River’ for exam- • precise scientific terms” ple (p. 12) is not ‘une riviere’ but ‘un fleuve’; that is why we translate by ‘la Charles River’ “I found the text a little difficult because it con- • or simply ‘la Charles’” tains complex words and it deals with an area I do not know at all.” The errors mentioned above are considered as not acceptable by a professional translator of literary 5.2 The views of R. Powers’s French translator text. These are hard problems for computer assisted translation (move away from the syntactic calque, To conclude this pilot study, the views of a tenth better handle idioms and multi-word expressions, reader were solicited: the author’s French transla- take into account cultural references). tor, Jean-Yves Pellegrin, research professor at Paris- Sorbonne University. His comments are summa- Could this text serve as a starting point for a “Instinctively, I rized here in the form of questions and answers. professional literary translator? am tempted to say no for now, because from his first Readability? “The text you have successfully re- cast the translator has reflexes that allow him to pro- produces faithfully the content of the article by Pow- duce a cleaner text than the one you produced .../.... ers. The readability bet is won and certain parts (in however, this translator would spend more than 25 particular those which relate to the scientific aspects hours to produce the 42 pages of 1500 characters of the described experiment) are very convincing.” that comprise Power’s text. At a rate of 7 pages So the MT+PE pipeline seems also efficient for per day on average, it would take 6 eight-hour days. obtaining quickly readable literary texts, as it is the If, however, I could work only from your text (while case for other domain specific data types. completely forgetting Powers’s) and I could be guar- anteed that your translation contains no errors or Imperfections? “There are, of course, imperfec- omissions from the original, but just that it needs to tions, clumsy expressions, and specific errors which be improved, made more fluid, more authentically require correction” French, things would be different and the time saved Top mistakes? would be probably huge.” As expected, the professional translator of liter- “The most frequent defect, which affects the • ature wants to control the whole translation pro- work of any novice translator, is the syntactic cess. But the last part of his comment is interest- calque, where French structures the phrase dif- ing: if the meaning were guaranteed, he could con- ferently .../... One understands, but it does not centrate on the form and limit going back and forth sound very French” between source and target text. Thus, working more “Another fairly common error is the loss of id- on quality assesment of MT and confidence estima- • iomatic French in favor of Anglicisms.” Some- tion seems to be a promising way for future work times these Anglicisms can be more disturbing on literary text translation. Based on the translation when flirting with Franglais,11 such as translat- speed rates provided by Pellegrin, we can estimate ing ‘actionable knowledge’ as ‘connaissances the time savings of our technique. Our computer- actionnables’ (p. 18) instead of ‘connaissances assisted methodology can be said to have accelerated pratiques / utilisables’.” the translation factor by a factor of 2 — our process took roughly 25 hours, compared to the 50 hours es- 11Frenglish timated for a professional literary translation.

120 6 Conclusion author refers both to a book but also to his DNA; this paradox is a good illustration of the difficulty 6.1 Collected Data Available Online translating a literary work! The data in this article are avail- able at https://github.com/ Thanks powersmachinetranslation/DATA. There Thanks to Manuela Barcan who handled the first one can find: phase of post-editing machine translations in French The 545 English source and French target (MT, and Romanian during the summer of 2013. Thanks • PE) segments mentioned in Table 1 to Jean-Yves Pellegrin, French translator of Richard Powers, for his help and open-mindedness. Thanks The translated and revised work (REV in Ta- • to Richard Powers who allowed us to conduct this ble 3), in French, that was read by a panel of 9 experiment using one of his works. readers

The results of the survey (9 readers) compiled • References in a spreadsheet (in French) Chris Callison-Burch, Philipp Koehn, Christof Monz, 6.2 Comments and open questions Matt Post, Radu Soricut, and Lucia Specia. 2012. Findings of the 2012 Workshop on Statistical Machine We presented an initial experiment of machine trans- Translation. In Proceedings of the Seventh Work- lation of a literary work (an English text of about shop on Statistical Machine Translation, pages 10– twenty pages). The results of an MT+PE pipeline 51, Montreal,´ Canada, June. Association for Compu- were presented and, going beyond that, the opinions tational Linguistics. of a panel of readers and a translator were solicited. Michael Denkowski, Alon Lavie, Isabel Lacruz, and The translated text, obtained after 25 hours of hu- Chris Dyer. 2014. Real time adaptive machine man labor (a professional translator told us that he translation for post-editing with cdec and TransCenter. In Proceedings of the EACL 2014 Workshop on Hu- would have needed at least twice that much time) mans and Computer-assisted Translation, pages 72– is acceptable to readers but the opinion of a profes- 77, Gothenburg, Sweden, April. Association for Com- sional translator is mixed. This approach suggests a putational Linguistics. methodology for rapid “low cost” translation , sim- Marcello Federico, Mauro Cettolo, Luisa Bentivogli, ilar to the translation of TV series subtitles found Michael Paul, and Sebastian Stuker.¨ 2012. Overview on the web. For the author of the literary text, this of the IWSLT 2012 evaluation campaign. In In pro- presents the possibility of having his work trans- ceedings of the 9th International Workshop on Spoken lated into more languages (several dozen instead of Language Translation (IWSLT), December. a handful, this short story by Richard Powers has Ignacio Garcia. 2011. Translating by post-editing: is it the way forward? Journal of Machine Translation, also been translated into Romanian using this same 25(3):217–237. methodology). Hieu Hoang, Alexandra Birch, Chris Callison-burch, But would the author be willing to sacrifice the Richard Zens, Rwth Aachen, Alexandra Constantin, quality of translation (and control over it) to enable Marcello Federico, Nicola Bertoldi, Chris Dyer, wider dissemination of his works? For a reader who Brooke Cowan, Wade Shen, Christine Moran, and cannot read an author in the source language, this Ondrejˇ Bojar. 2007. Moses: Open source toolkit provides the ability to have faster access to an (ad- for statistical machine translation. In ACL’07, Annual mittedly imperfect) translation of their favorite au- Meeting of the Association for Computational Linguis- thor. For a non-native reader of the source language tics, pages 177–180, Prague, Czech Republic. this provides a mechanism for assistance on the parts Cong-Phap Huynh, Christian Boitet, and Herve´ Blan- chon. 2008. SECTra w.1: an online collaborative he or she has trouble understanding. One last thing: system for evaluating, post-editing and presenting MT the title of the work The Book of Me has remained translation corpora. In LREC’08, Sixth International unchanged in the French version because no satis- Conference on Language Resources and Evaluation, factory translation was found to illustrate that the pages 28–30, Marrakech, Morocco.

121 Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In Proceedings of the Tenth Machine Translation Summit, Phuket, Thailand, September. Anthony Oettinger. 1954. A Study for the Design of an Automatic Dictionary. Ph.D. thesis, Harvard Univer- sity. Kishore Papineni, Salim Roukos, Todd Ward, and Zhu Wei-Jing. 2002. BLEU: A method for automatic eval- uation of machine translation. In ACL’02, 40th Annual Meeting of the Association for Computational Linguis- tics, pages 311–318, Philadelphia, PA, USA. Pavel Pecina, Antonio Toral, and Josef van Genabith. 2012. Simple and effective parameter tuning for do- main adaptation of statistical machine translation. In Proceedings of the 24th International Conference on Computational Linguistics, pages 2209–2224, Mum- bai, India. Marion Potet, Emmanuelle Esperanc¸a-Rodier, Laurent Besacier, and Herve´ Blanchon. 2012. Collection of a large database of French-English SMT output cor- rections. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC), Istanbul, Turkey, May. Lucia Specia, Nicola Cancedda, and Marc Dymetman. 2010. A dataset for assessing machine translation evaluation metrics. In 7th Conference on Interna- tional Language Resources and Evaluation (LREC- 2010), pages 3375–3378, Valetta, Malta. Ralf Steinberger, Andreas Eisele, Szymon Klocek, Spyri- don Pilos, and Patrick Schluter.¨ 2012. DGT-TM: A freely available translation memory in 22 languages. In LREC 2012, Istambul, Turkey. Andreas Stolcke. 2002. SRILM: An extensible language modeling toolkit. In ICSLP’02, 7th International Con- ference on Spoken Language Processing, pages 901– 904, Denver, USA. Jorg¨ Tiedemann. 2009. News from OPUS - a collection of multilingual parallel corpora with tools and inter- faces. In Proceedings of Recent Advances in Natural Language Processing. Rob Voigt and Dan Jurafsky. 2012. Towards a literary machine translation, the role of referential cohesion. In Computational Linguistics for Literature, Workshop at NAACL-HLT 2012, Montreal, Canada. Lingxiao Wang and Christian Boitet. 2013. Online pro- duction of HQ parallel corpora and permanent task- based evaluation of multiple MT systems. In Proceed- ings of the MT Summit XIV Workshop on Post-editing Technology and Practice. Ventsislav Zhechev. 2012. Machine translation infras- tructure and post-editing performance at Autodesk. In AMTA’12, Conference of the Association for Machine Translation in the Americas, San Diego, USA.

122 Translating Literary Text between Related Languages using SMT

Antonio Toral Andy Way ADAPT Centre ADAPT Centre School of Computing School of Computing Dublin City University Dublin City University Dublin, Ireland Dublin, Ireland [email protected] [email protected]

Abstract for technical domains (Zhechev, 2012). Meanwhile, the performance of MT systems in research contin- We explore the feasibility of applying machine ues to improve. In this regard, a recent study looked translation (MT) to the translation of literary at the best-performing systems of the WMT shared texts. To that end, we measure the translata- task for seven language pairs during the period be- bility of literary texts by analysing parallel tween 2007 and 2012, and estimated the improve- corpora and measuring the degree of freedom of the translations and the narrowness of the ment in translation quality during this period to be domain. We then explore the use of domain around 10% absolute, in terms of both adequacy and adaptation to translate a novel between two re- fluency (Graham et al., 2014). lated languages, Spanish and Catalan. This Having reached this level of research maturity and is the first time that specific MT systems are industrial adoption, in this paper we explore the fea- built to translate novels. Our best system out- sibility of applying the current state-of-the-art MT performs a strong baseline by 4.61 absolute points (9.38% relative) in terms of BLEU and technology to literary texts, what might be consid- is corroborated by other automatic evaluation ered to be the last bastion of human translation. The metrics. We provide evidence that MT can perceived wisdom is that MT is of no use for the be useful to assist with the translation of nov- translation of literature. We challenge that view, de- els between closely-related languages, namely spite the fact that – to the best of our knowledge (i) the translations produced by our best sys- – the applicability of MT to literature has to date tem are equal to the ones produced by a pro- been only partially studied from an empirical point fessional human translator in almost 20% of cases with an additional 10% requiring at most of view. 5 character edits, and (ii) a complementary hu- In this paper we aim to measure the translatability man evaluation shows that over 60% of the of literary text. Our empirical methodology relies on translations are perceived to be of the same (or the fact that the applicability of MT to a given type even higher) quality by native speakers. of text can be assessed by analysing parallel corpora of that particular type and measuring (i) the degree 1 Introduction of freedom of the translations (how literal the trans- lations are), and (ii) the narrowness of the domain The field of Machine Translation (MT) has evolved (how specific or general that text is). Hence, we very rapidly since the emergence of statistical ap- tackle the problem of measuring the translatability proaches almost three decades ago (Brown et al., of literary text by comparing the degree of freedom 1988; Brown et al., 1990). MT is nowadays a grow- of translation and domain narrowness for such texts ing reality throughout the industry, which continues to documents in two other domains which have been to adopt this technology as it results in demonstra- widely studied in the area of MT: technical docu- ble improvements in translation productivity, at least mentation and news.

123

Proceedings of NAACL-HLT Fourth Workshop on Computational Linguistics for Literature, pages 123–132, Denver, Colorado, June 4, 2015. c 2015 Association for Computational Linguistics

Furthermore, we assess the usefulness of MT in lation. They found that literary texts have more translating a novel between two closely-related lan- dense reference chains and conclude that incorporat- guages. We build an MT system using state-of-the- ing discourse features beyond the level of the sen- art domain-adaptation techniques and evaluate its tence is an important direction for applying MT to performance against the professional human transla- literary texts. tion, using both automatic metrics and manual eval- Jones and Irvine (2013) used existing MT sys- uation. To the best of our knowledge, this is the first tems to translate samples of French literature (prose time that a specific MT system is built to translate and poetry) into English. They then used qualitative novels. analysis grounded in translation theory on the MT The rest of the paper is organised as follows. Sec- output to assess the potential of MT in literary trans- tion 2 gives an overview of the current state-of-the- lation and to address what makes literary translation art in applying MT to literary texts. In Section 3 we particularly difficult, e.g. one objective in literary measure the translatability of literary texts. In Sec- translation, in contrast to other domains, is to pre- tion 4 we explore the use of MT to translate a novel serve the experience of reading a text when moving between two related languages. Finally, in Section to the target language. 5 we present our conclusions and outline avenues of Very recently, Besacier (2014) presented a pilot future work. study where MT followed by post-editing is ap- plied to translate a short story from English into 2 Background French. In Besacier’s work, post-editing is per- formed by non-professional translators, and the au- To date, there have been only a few works on ap- thor concludes that such a workflow can be a useful plying MT to literature, for which we provide an low-cost alternative for translating literary works, al- overview here. beit at the expense of sacrificing translation quality. Genzel et al. (2010) explored constraining sta- According to the opinion of a professional translator, tistical MT (SMT) systems for poetry to produce the main errors had to do with using English syntac- translations that obey particular length, meter and tic structures and expressions instead of their French rhyming rules. Form is preserved at the price of pro- equivalents and not taking into account certain cul- ducing a worse translation, in terms of the BLEU tural references. automatic metric, which decreases from 0.3533 to Finally, there are some works that use MT tech- 0.1728, a drop of around 50% in real terms. Their niques in literary text, but for generation rather than system was trained and evaluated with WMT-09 for translation. He et al. (2012) used SMT to gener- data1 for French–English. ate poems in Chinese given a set of keywords. Jiang Greene et al. (2010) also translated poetry, choos- and Zhou (2008) used SMT to generate the second ing target realisations that conform to the desired line of Chinese couplets given the first line. In a rhythmic patterns. Specifically, they translated similar fashion, Wu et al. (2013) used transduction Divine Comedy Dante’s from Italian sonnets into grammars to generate rhyming responses in hip-hop English iambic pentameter. Instead of constrain- given the original challenges. ing the SMT system, they passed its output lattice This paper contributes to the current state-of-the- through a FST that maps words to sequences of art in two dimensions. On the one hand, we con- stressed and unstressed syllables. These sequences duct a comparative analysis on the translatability are finally filtered with a iambic pentameter accep- of literary text according to narrowness of the do- tor. Their output translations are evaluated qualita- main and freedom of translation. This can be seen tively only. as a more general and complementary analysis to Voigt and Jurafsky (2012) examined how refer- the one conducted by Voigt and Jurafsky (2012). ential cohesion is expressed in literary and non- On the other hand, and related to Besacier (2014), literary texts, and how this cohesion affects trans- we evaluate MT output for literary text. There are 1http://www.statmt.org/wmt09/ two differences though; first, they translated a short translation-task.html story, while we do so for a longer type of literary

124 text, namely a novel; second, their MT systems were translatability of literature as a whole, we root the evaluated against a post-edited reference produced study along two axes: by non-professional translators, while we evaluate Relatedness of the language pair: from pairs • our systems against the translation produced by a of languages that belong to the same family professional translator. (e.g. Romance languages), through languages that belong to the same group (e.g. Romance 3 Translatability and Germanic languages of the Indo-European The applicability of SMT to translate a certain text group) to unrelated languages (e.g. Romance type for a given pair of languages can be studied and Finno-Ugric languages). by analysing two properties of the relevant parallel Literary genre: novels, poetry, etc. data. • We hypothesise that the degree of applicability of Degree of freedom of the translation. While lit- SMT to literature depends on these two axes. Be- • eral translations can be learnt reasonably well tween related languages, translations should be more by the word alignment component of SMT, free literal and complex phenomena (e.g. metaphors) translations may result in problematic align- might simply transfer to the target language, while ments. they are more likely to require complex translations Narrowness of the domain. Constrained do- between unrelated languages. Regarding literary • mains lead to good SMT results. This is due genres, in poetry the preservation of form might be to the fact that in narrow domains lexical selec- considered relevant while in novels it may be a lesser tion is much less of an issue and relevant terms constraint. occur frequently, which allows the SMT model The following sections detail the experimental to learn their translations with good accuracy. datasets and the experiments conducted regarding narrowness of the domain and degree of translation We could say then, that the narrower the domain freedom. and the smaller the degree of freedom of the transla- tion, the more applicable SMT is. This is, we assert, 3.1 Experimental Setup why SMT performs well on technical documenta- In order to carry out our experiment on the translata- tion while results are substantially worse for more bility of literary texts, we use monolingual datasets open and unpredictable domains such as news (cf. for Spanish and parallel datasets for two language WMT translation task series).2 pairs with varying levels of relatedness: Spanish– We propose to study the applicability of SMT to Catalan and Spanish–English. literary text by comparing the degree of freedom and Regarding the different types of corpora, we con- narrowness of parallel corpora for literature to other sider datasets that fall in the following four groups: domains widely studied in the area of MT (technical novels, news, technical documentation and Europarl documentation and news). Such a corpus study can (EP). be carried out by using a set of automatic measures. We use two sources for novels: two novels The perplexity of the word alignment can be used from Carlos Ruiz Zafon,´ The Shadow of the Wind as a proxy to measure the degree of freedom of the (published originally in Spanish in 2001) and The translation. The narrowness of the domain can be Angel’s Game (2008), for Spanish–Catalan and assessed by measuring perplexity with respect to a Spanish–English, referred to as novel1; and two language model (LM) (Ruiz and Federico, 2014). novels from Gabriel Garc´ıa Marquez,´ Hundred Therefore, in order to assess the translatability of Years of Solitude (1967) and Love in the Time of literary text with MT, we contextualise the problem Cholera (1985), for Spanish–English, referred to as by comparing it to the translatability of other widely novel2. studied types of text. Instead of considering the We use two sources of news data: a corpus made of articles from the newspaper El Periodico´ 3 (re- 2http://www.statmt.org/wmt14/ translation-task.html 3http://www.elperiodico.com/

125 ferred to as news1) for Spanish–Catalan, and news- is to avoid having high word alignment perplexities commentary v84 (referred to as news2) for Spanish– due, not to high degrees of translation freedom, but English. to the presence of misaligned parallel data. For technical documentation we use four datasets: DOGC,5 a corpus from the official journal of the 3.2 Narrowness of the Domain Catalan Goverment, for Spanish–Catalan; EMEA,6 As previously mentioned, we use LM perplexity as a corpus from the European Medicines Agency, for a proxy to measure the narrowness of the domain. Spanish–English; JRC-Acquis (henceforth referred We take two random samples without replace- as JRC) (Steinberger et al., 2006), made of leg- ment for the Spanish side of each dataset, to be used islative text of the European Union, for Spanish– for training (200,000 tokens) and testing (20,000 to- English; and KDE4,7 a corpus of localisation files of kens). We train an LM of order 3 and improved the KDE desktop environment, for the two language Kneser-Ney smoothing (Chen and Goodman, 1996) pairs. with IRSTLM (Federico et al., 2008). Finally, we consider the Europarl corpus For each LM we report the perplexity on v7 (Koehn, 2005), given it is widely used in the MT the testset built from the same dataset in Fig- community, for Spanish–English. ure 1. The two novels considered (perplexities All the datasets are pre-processed as fol- in the range [230.61, 254.49]) fall somewhere be- lows. First they are tokenised and truecased with tween news ([359.73, 560.62]) and technical domain Moses’ (Koehn et al., 2007) scripts. Truecasing is ([127.30, 228.38]). Our intuition is that novels cover carried out with a model trained on the caWaC cor- a narrow domain, like technical texts, but the vo- pus for Catalan (Ljubesiˇ c´ and Toral, 2014) and News cabulary and language used in novels is richer, thus Crawl 20128 both for English and Spanish. leading to higher perplexity than technical texts. Parallel datasets not available in a sentence-split News, on the contrary, covers a large variety of top- format (novel1 and novel2) are sentence-split using ics. Hence, despite novels possibly using more com- Freeling (Padro´ and Stanilovsky, 2012). All paral- plex linguistic constructions, news articles are less lel datasets are then sentence aligned. We use Hu- predictable. nalign (Varga et al., 2005) and keep only one-to- 600 one alignments. The dictionaries used for Spanish– Catalan and Spanish–English are extracted from 500 400

Apertium bilingual dictionaries for those language y t i 9,10 x e 300 pairs. Only sentence pairs for which the confi- l p r

11 e dence score of the alignment is >= 0.4 are kept. P 200

Although most of the parallel datasets are provided 100 in sentence-aligned form, we realign them to ensure 0 that the data used to calculate word alignment per- novel1 novel2 news1 news2 DOGC Acquis KDE4 ep plexity are properly aligned at sentence level. This Dataset Figure 1: LM perplexity results 4http://www.statmt.org/wmt13/ training-parallel-nc-v8.tgz 5http://opus.lingfil.uu.se/DOGC.php 3.3 Degree of Translation Freedom 6http://opus.lingfil.uu.se/EMEA.php 7http://opus.lingfil.uu.se/KDE4.php We use word alignment perplexity, as in Equation 1, 8http://www.statmt.org/wmt13/ as a proxy to measure the degree of translation free- translation-task.html dom. Word alignment perplexity gives an indication 9 http://sourceforge.net/projects/ of how well the model fits the data. apertium/files/apertium-es-ca/1.2.1/ 10http://sourceforge.net/projects/ log2 PP = log2 p(es fs) (1) apertium/files/apertium-en-es/0.8.0/ − s | 11Manual evaluation for English, French and Greek con- X cluded that 0.4 was an adequate threshold for Hunalign’s confi- The assumption is that the freer the translations dence score (Pecina et al., 2012). are for a given parallel corpus, the higher the per-

126 plexity of the word alignment model learnt from Type Dataset # sentences Avg length such dataset, as the word alignment algorithms 22.45 News 629,375 would have more difficulty to find suitable align- 21.49 TM ments. 16.95 Novel 21,626 For each parallel dataset, we randomly select a 15.11 set of sentence pairs whose overall size accounts for News1 631,257 22.66 500,000 tokens. We then run word alignment with LM caWaC 16,516,799 29.48 GIZA++ (Och and Ney, 2003) in both directions, Novel 22,170 17.14 with the default parameters used in Moses. 22.31 News 1,000 21.36 For each dataset and language pair, we report in Dev 16.92 Figure 2 the perplexity of the word alignment af- Novel 1,000 ter the last iteration for each direction. The most 15.35 17.91 important discriminating variable appears to be the Test Novel 1,000 level of relatedness of the languages involved, i.e. 15.93 all the perplexities for Spanish–Catalan are below Table 1: Datasets used for MT 10 while all the perplexities for Spanish–English are well above this number. The translation model (TM) of our baseline sys- tem is trained with the news1 dataset while the LM is 50 45 trained with the concatenation of news1 and caWaC. 40 The baseline system is tuned on news. On top of 35

y 30 this baseline we then build our domain-adapted sys- t i x es-ca e 25 l

p tems. The domain adaptation is carried out by using r 20 ca-es e P 15 es-en two previous novels from the same author that were 10 en-es translated by the same translator (cf. the dataset 5 0 novel1 in Section 3.1). We explore their use for tun- novel1 novel2 news1 news2 DOGC Acquis KDE4 EMEA ep ing (+inDev), LM (concatenated +inLM and inter- Dataset polated +IinLM) and TM (concatenated +inTM and Figure 2: Word alignment perplexity results interpolated +IinTM). The testset is made of a set of randomly selected sentence pairs from The Pris- oner of Heaven. Table 1 provides an overview of the 4 MT for Literature between Related datasets used for MT. Languages We train phrase-based SMT systems with Moses v2.1 using default parameters. Tuning is carried out Encouraged by the results obtained for the translata- with MERT (Och, 2003). LMs are linearly interpo- bility of novels (cf. Figures 1 and 2), we decided to lated with SRILM (Stolcke et al., 2011) by means carry out an experiment to assess the feasibility of of perplexity minimisation on the development set using MT to assist with the translation of novels be- from the novel1 dataset. Similarly, TMs are linearly tween closely-related languages. In this experiment interpolated, also by means of perplexity minimisa- we translate a novel, The Prisoner of Heaven (2011) tion (Sennrich, 2012). by Carlos Ruiz Zafon,´ from Spanish into Catalan. This language pair is chosen because of the maturity 4.1 Automatic Evaluation of applied MT technology, e.g. MT is used along- side post-editing to translate the newspaper La Van- Our systems are evaluated with a set of state-of- guardia (around 70,000 tokens) from Spanish into the-art automatic metrics: BLEU (Papineni et al., Catalan on a daily basis (Mart´ın and Serra, 2014). 2002), TER (Snover et al., 2006) and METEOR We expect the results to be similar for other lan- 1.5 (Denkowski and Lavie, 2014). guages with similar degrees of similarity to Spanish, Table 2 shows the results obtained by each of e.g. Portuguese and Italian. the systems built. For each domain-adapted system

127 System BLEU diff TER diff METEOR diff baseline 0.4915 0.3658 0.3612 +inDev 0.4939 0.49% 0.3641 -0.47% 0.3628 0.46% +inDev+inLM 0.4948 0.67% 0.3643 -0.41% 0.3633 0.59% +inDev+IinLM 0.5045 2.64% 0.3615 -1.18% 0.3669 1.59% +inDev+inTM 0.5238 6.57% 0.3481 -4.85% 0.3779 4.61% +inDev+IinTM 0.5258 6.98% 0.3510 -4.04% 0.3795 5.06% +inDev+inLM+inTM 0.5297 7.77% 0.3433 -6.17% 0.3811 5.51% +inDev+IinLM+IinTM 0.5376 9.38% 0.3405 -6.92% 0.3847 6.50% inDev+inTM+inLM 0.4823 -1.87% 0.3777 3.24% 0.3594 -0.49% Table 2: Automatic evaluation scores for the MT systems built System BLEU diff TER diff METEOR diff Google 0.4652 15.56% 0.4021 -15.31% 0.3498 9.98% Apertium 0.4543 18.34% 0.3925 -13.25% 0.3447 11.60% Lucy 0.4821 11.51% 0.3758 -9.40% 0.3550 8.35% Table 3: Automatic evaluation scores for third-party MT systems we show its relative improvement over the baseline Lucy.14 These three systems follow different ap- (columns diff). The use of in-domain data to adapt proaches; while the first is statistical, the second each of the components of the pipeline, tuning (+in- and the third are rule-based, classified respectively Dev), LM (+inLM and +IinLM) and TM (+inTM as shallow and deep formalisms. and +IinTM), results in gains across all the metrics. Table 3 shows the results of the third-party sys- Additional gains are achieved when combining the tem and compares their scores with our best domain- different in-domain components. Interpolation, both adapted system in terms of relative improvement for LM and TM, results in gains when compared to (columns diff). The results of the third-party sys- the systems that use the same data in a concatenated tems are similar, albeit slighly lower, compared to manner (e.g. +IinLM vs +inLM) except for the TM our baseline (cf. Table 2). in terms of TER. The best system, with in-domain We conducted statistical significance tests for data used for all the components and interpolated BLEU between our best domain-adapted system, TM and LM (+inDev+IinLM+IinTM), yields a rel- the baseline and the three third-party systems us- ative improvement over the baseline of 9.38% for ing paired bootstrap resampling (Koehn, 2004) with BLEU, 6.92% for TER and 6.5% for METEOR. Fi- 1,000 iterations and p = 0.01. In all cases the im- nally we show the scores obtained by a system that provement brought by our best system is found out uses solely in-domain data (inTM+inLM+inDev). to be significant. While its results are slightly below those of the base- Finally we report on the percentage of translations line, it should be noted that both the TM and TL of that are equal in the MT output and in the reference. this system are trained with very limited amounts of These account for 15.3% of the sentences for the data: 21,626 sentence pairs and 22,170 sentences, baseline and 19.7% for the best domain-adapted sys- respectively (cf. Table 1). tem. It should be noted though that these tend to be We decided to compare our system also to widely- short sentences, so if we consider their percentage in used on-line third-party systems, as these are the terms of words, they account for 4.97% and 7.15% ones that a translator could easily have access to. of the data, respectively. If we consider also the We consider the following three systems: Google translations that can reach the reference in at most Translate,12 Apertium (Forcada et al., 2011)13 and five character editing steps (Volk, 2009), then the percentage of equal and near-equal translations pro-

12https://translate.google.com 14http://www.lucysoftware.com/english/ 13http://apertium.org/ machine-translation/

128 duced by our best domain-adapted system reaches The inter-annotator agreement, in terms of Fleiss’ 29.5% of the sentences. Kappa (Fleiss, 1971), is 0.49, which falls in the band of moderate agreement [0.41, 0.60] (Landis 4.2 Manual Evaluation and Koch, 1977). To gain further insight on the results, we conducted a Considering the aggregated 202 judgements, we manual evaluation. A common procedure (e.g. con- have the breakdown shown in Table 4. In most cases ducted in the MT shared task at WMT) consists of (41.58% of the judgements), both the human trans- ranking MT translations. Given the source and tar- lation (HT) and the MT are considered to be of equal get sides of the reference (human) translations, and quality. The HT is considered better than MT in two or more outputs from MT systems, these outputs 39.11% of the cases. Perhaps surprisingly, the eval- are ranked according to their quality, i.e. how close uators ranked MT higher than HT in almost 20% of they are to the reference, e.g. in terms of adequacy their judgements. and/or fluency. In our experiment, we are of course not interested Judgement Times Percentage in comparing two MT systems, but rather one MT HT=MT 84 41.58% system (the best one according to the automatic met- HTMT 79 39.11% the rank-based manual evaluation in a slightly mod- Table 4: Manual Evaluation. Breakdown of ranks (over- ified setting; we do not provide the target of the ref- all) erence translation as reference but as one of the MT We now delve deeper into the results and show systems to be ranked. The evaluator thus is given in Table 5 the breakdown of judgements by evalu- the source-side of the reference and two translations, ator. For around two thirds of the sentences, both one being the human translation and the other the evaluators agreed in their judgement: in 28.71% of translation produced by an MT system. The evalua- the sentences both for HT=MT and for HT>MT, tor of course does not know which is which. More- and in 9.9% of the sentences for HTMT (13.86%) and between HT=MT and Two bilingual speakers in Spanish and Catalan, HTMT and HT and B, the evaluator can rank them as A B (if A HT>MT, HT>MT 29 28.71% < is better than B), A B (if A is worse than B) and Total 68 67.33% A=B (if both are of the same quality). Here we re- HT>MT, HTMT 14 13.86% to the evaluators: HT=MT, HTB) if the output of system evaluator) A is better than the output of system B - Rank A lower than B (AMT. - Rank both systems equally (A=B) if the outputs of First, we report on their average sentence length in both systems are of an equivalent level of quality” tokens, as shown in Table 6. We can conclude that

129 Source La habitacion´ ten´ıa un pequeno˜ balcon´ que daba a la plaza. Gloss The room had a small balcony facing the square. HT La cofurna tenia un balconet que donava a la plac¸a. MT L’habitacio´ tenia un petit balco´ que donava a la plac¸a. Discussion Habitacio´ (room) is the translation of habitacion.´ Cofurna (hovel) has slightly different meaning. Source — ¿Adonde´ vas? Gloss — Where are you going? HT — ¿On vas? — hi vaig afegir. MT — ¿On vas? Discussion The snippet “hi vaig afegir” (I added) is not in the original. Table 7: Manual Evaluation. Examples of translations ranked as HT

Mode Sentences Tokens Tokens/sent. wrongly translated as “ho” (that) instead of HTMT 29 657 22.66 • natural or its meaning is closely related to the whole 101 1,671 16.71 original but it is not exactly the same. Table 6: Manual Evaluation. Avg sentence length per rank 5 Conclusions and Future Work

MT results in translations of good quality for shorter This paper has explored the feasibility of applying sentences than the average, while HT remains the MT to the translation of literary texts. To that end, best translation for longer sentences. we measured the translatability of literary texts and We now look at each of these sets of sentences compared it to that of other datasets commonly used and carry out a qualitative analysis, aiming at finding in MT by measuring the degree of freedom of the out what types of sentences and translation errors are translations (using word alignment perplexity) and predominant. the narrowness of the domain (via LM perplexity). For most of the HT=MT cases (22), both transla- Our results show that novels are less predictable than tions are exactly the same. In the remaining 7 cases, texts in the technical domain but more predictable up to a few words are different, with both transla- than news articles. Regarding translation freedom, tions being accurate. the main variable is not related to the type of data In most of the 10 sentences ranked as HTMT, the translation as produced by the MT sys- best domain-adapted system outperforms a strong tems has some errors, in most cases affecting just baseline by 4.61 absolute points (9.38% relative) one or a few words. The most common errors are in terms of BLEU. We provided evidence that MT related to: can be useful to assist with the translation of novels between closely-related languages, namely (i) the OOVs, mainly for verbs that contain a pronoun • translations produced by our best system are equal as a suffix in Spanish. E.g. “escrutandola”´ to the ones produced by a professional human trans- (scrutinising her). lator in almost 20% of cases, with an additional 10% Pronouns translated wrongly. E.g. “lo” (him) requiring at most 5 character edits, and (ii) over 60% •

130 of the translations are perceived to be of the same (or Stanley F. Chen and Joshua Goodman. 1996. An empir- even higher) quality by native speakers. ical study of smoothing techniques for language mod- As this is the first work where a specific MT sys- eling. In Proceedings of the 34th Annual Meeting on tem has been built to translate novels, a plethora of Association for Computational Linguistics, ACL ’96, pages 310–318, Stroudsburg, PA, USA. Association research lines remain to be explored. In this work for Computational Linguistics. we have adapted an MT system by learning from Michael Denkowski and Alon Lavie. 2014. Meteor uni- previous novels from the same author. A further versal: Language specific translation evaluation for step would be to learn from translators while they any target language. In Proceedings of the EACL 2014 are translating the current novel using incremen- Workshop on Statistical Machine Translation. tal retraining techniques. We would like to experi- Marcello Federico, Nicola Bertoldi, and Mauro Cettolo. ment with less related language pairs (e.g. Spanish– 2008. Irstlm: an open source toolkit for handling large English) to assess whether the current setup remains scale language models. In INTERSPEECH, pages useful. As pointed out by Voigt and Jurafsky (2012), 1618–1621. and corroborated by our manual evaluation (some Christian Federmann. 2012. Appraise: An open-source toolkit for manual evaluation of machine translation of the MT errors are due to mistranslation of pro- output. The Prague Bulletin of Mathematical Linguis- nouns), we would like to explore using discourse tics, 98:25–35, September. features. Finally, as the ultimate goal of this work Joseph L. Fleiss. 1971. Measuring nominal scale agree- is to integrate MT in the translation workflow to as- ment among many raters. Psychological Bulletin, sist with the translation of literature, we would like 76(5):378–382. to study which is the best way of doing so, e.g. by Mikel L. Forcada, Mireia Ginest´ı-Rosell, Jacob Nord- means of post-editing, interactive MT, etc, with real falk, Jim O’Regan, Sergio Ortiz-Rojas, Juan An- customers. tonio Perez-Ortiz,´ Felipe Sanchez-Mart´ ´ınez, Gema Ram´ırez-Sanchez,´ and Francis M. Tyers. 2011. Aper- Acknowledgments tium: A free/open-source platform for rule-based ma- chine translation. Machine Translation, 25(2):127– This research is supported by the European Union 144, June. Seventh Framework Programme FP7/2007-2013 un- Dmitriy Genzel, Jakob Uszkoreit, and Franz Och. 2010. der grant agreement PIAP-GA-2012-324414 (Abu- ”Poetic” Statistical Machine Translation: Rhyme and MaTran) and by Science Foundation Ireland through Meter. In Proceedings of the 2010 Conference on the CNGL Programme (Grant 12/CE/I2267) in the Empirical Methods in Natural Language Processing, EMNLP ’10, pages 158–166. ADAPT Centre (www.adaptcentre.ie) at Dublin City Yvette Graham, Timothy Baldwin, Alistair Moffat, and University. Justin Zobel. 2014. Is machine translation getting bet- ter over time? In Proceedings of the 14th Confer- ence of the European Chapter of the Association for References Computational Linguistics, pages 443–451, Gothen- Laurent Besacier. 2014. Traduction automatisee´ d’une burg, Sweden. oeuvre litteraire:´ une etude´ pilote. In Traitement Erica Greene, Tugba Bodrumlu, and Kevin Knight. 2010. Automatique du Langage Naturel (TALN), Marseille, Automatic analysis of rhythmic poetry with applica- France. tions to generation and translation. In Proceedings of P. Brown, J. Cocke, S. Della Pietra, V. Della Pietra, F. Je- the 2010 Conference on Empirical Methods in Natural linek, R. Mercer, and P. Roossin. 1988. A statisti- Language Processing, EMNLP ’10, pages 524–533, cal approach to language translation. In Proceedings Stroudsburg, PA, USA. Association for Computational of the 12th Conference on Computational Linguistics Linguistics. - Volume 1, COLING ’88, pages 71–76, Stroudsburg, Jing He, Ming Zhou, and Long Jiang. 2012. Generating PA, USA. Association for Computational Linguistics. chinese classical poems with statistical machine trans- Peter F. Brown, John Cocke, Stephen A. Della Pietra, lation models. Vincent J. Della Pietra, Fredrick Jelinek, John D. Laf- Long Jiang and Ming Zhou. 2008. Generating Chi- ferty, Robert L. Mercer, and Paul S. Roossin. 1990. A nese Couplets Using a Statistical MT Approach. In statistical approach to machine translation. Comput. Proceedings of the 22nd International Conference on Linguist., 16(2):79–85, June. Computational Linguistics - Volume 1, pages 377–384.

131 Ruth Jones and Ann Irvine, 2013. Proceedings of the 7th Pavel Pecina, Antonio Toral, Vassilis Papavassiliou, Workshop on Language Technology for Cultural Her- Prokopis Prokopidis, and Josef van Genabith. 2012. itage, Social Sciences, and Humanities, chapter The Domain adaptation of statistical machine translation (Un)faithful Machine Translator, pages 96–101. As- using web-crawled resources: a case study. In Pro- sociation for Computational Linguistics. ceedings of the 16th Annual Conference of the Euro- Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris pean Association for Machine Translation, pages 145– Callison-Burch, Marcello Federico, Nicola Bertoldi, 152. Brooke Cowan, Wade Shen, Christine Moran, Richard Nicholas Ruiz and Marcello Federico. 2014. Complexity Zens, Chris Dyer, Ondrejˇ Bojar, Alexandra Con- of spoken versus written language for machine transla- stantin, and Evan Herbst. 2007. Moses: open source tion. In 17th Annual Conference of the European As- toolkit for statistical machine translation. In Proceed- sociation for Machine Translation, EAMT, pages 173– ings of the 45th Annual Meeting of the ACL on Inter- 180, Dubrovnik, Croatia. active Poster and Demonstration Sessions, pages 177– Rico Sennrich. 2012. Perplexity minimization for trans- 180. lation model domain adaptation in statistical machine Philipp Koehn. 2004. Statistical significance tests for translation. In Proceedings of the 13th Conference of machine translation evaluation. In Proceedings of the European Chapter of the Association for Compu- the 2004 Conference on Empirical Methods in Nat- tational Linguistics, pages 539–549. ural Language Processing , EMNLP 2004, A meet- Matthew Snover, Bonnie Dorr, Richard Schwartz, Linnea ing of SIGDAT, a Special Interest Group of the ACL, Micciulla, and Ralph Weischedel. 2006. A Study of held in conjunction with ACL 2004, 25-26 July 2004, Translation Error Rate with Targeted Human Annota- Barcelona, Spain, pages 388–395. ACL. tion. In Proceedings of the Association for Machine Philipp Koehn. 2005. Europarl: A Parallel Corpus Translation in the Americas. for Statistical Machine Translation. In Conference Ralf Steinberger, Bruno Pouliquen, Anna Widiger, Proceedings: the tenth Machine Translation Summit, Camelia Ignat, Tomaz Erjavec, Dan Tufis, and Daniel´ pages 79–86, Phuket, Thailand. AAMT, AAMT. Varga. 2006. The jrc-acquis: A multilingual R. J. Landis and G. G. Koch. 1977. The measurement of aligned parallel corpus with 20+ languages. CoRR, observer agreement for categorical data. Biometrics, abs/cs/0609058. 33(1):159 – 174. Andreas Stolcke, Jing Zheng, Wen Wang, and Victor Nikola Ljubesiˇ c´ and Antonio Toral. 2014. caWaC Abrash. 2011. SRILM at Sixteen: Update and Out- - a Web Corpus of Catalan and its Application to look. In Proceedings of ASRU. Language Modeling and Machine Translation. In Daniel´ Varga, Laszl´ o´ Nemeth,´ Peter´ Halacsy,´ Andras´ Ko- Proceedings of the Ninth International Conference rnai, Viktor Tron,´ and Viktor Nagy. 2005. Parallel on Language Resources and Evaluation (LREC’14), corpora for medium density languages. In Recent Ad- Reykjavik, Iceland. vances in Natural Language Processing, pages 590– Juan Alberto Alonso Mart´ın and Anna Civil Serra. 2014. 596, Borovets, Bulgaria. Integration of a machine translation system into the Rob Voigt and Dan Jurafsky, 2012. Proceedings of the editorial process flow of a daily newspaper. Proce- NAACL-HLT 2012 Workshop on Computational Lin- samiento del Lenguaje Natural, 53(0):193–196. guistics for Literature, chapter Towards a Literary Ma- Franz Josef Och and Hermann Ney. 2003. A Systematic chine Translation: The Role of Referential Cohesion, Comparison of Various Statistical Alignment Models. pages 18–25. Computational Linguistics, 29(1):19–51. Martin Volk. 2009. The automatic translation of film Franz Josef Och. 2003. Minimum Error Rate Training subtitles. A machine translation success story? JLCL, in Statistical Machine Translation. In Proceedings of 24(3):115–128. the 41st Annual Meeting on Association for Computa- Dekai Wu, Karteek Addanki, and Markus Saers. 2013. tional Linguistics, pages 160–167. Modelling hip hop challenge-response lyrics as ma- Llu´ıs Padro´ and Evgeny Stanilovsky. 2012. Freeling chine translation. In Machine Translation Summit 3.0: Towards wider multilinguality. In Proceedings of XIV, pages 109–116, Nice, France. the Language Resources and Evaluation Conference Ventsislav Zhechev. 2012. Machine Translation Infras- (LREC 2012), Istanbul, Turkey, May. ELRA. tructure and Post-editing Performance at Autodesk. Kishore Papineni, Salim Roukos, Todd Ward, and Wei- In AMTA 2012 Workshop on Post-Editing Technology Jing Zhu. 2002. BLEU: A Method for Automatic and Practice (WPTP 2012), pages 87–96, San Diego, Evaluation of Machine Translation. In Proceedings of USA. the 40th Annual Meeting on Association for Computa- tional Linguistics, pages 311–318.

132 Author Index

Agarwal, Apoorv, 32 Rösiger, Ina, 48

Besacier, Laurent, 114 Schöch, Christof, 79 Brooke, Julian, 42 Schwartz, Lane, 114 Scrivner, Olga,1 Cap, Fabienne, 48 Srikumar, Vivek, 12

Delmonte, Rodolfo, 68 Toral, Antonio, 123 Dubremetz, Marie, 23 van Cranenburgh, Andreas, 58 Evert, Stefan, 79 Vitt, Thorsten, 79

Hammond, Adam, 42 Way, Andy, 123 Hirst, Graeme, 42 Weimar, Lukas, 98 Ighe, Ann, 89

Jannidis, Fotis, 79, 98 Jayannavar, Prashant, 32 Ju, Melody, 32

Kokkinakis, Dimitrios, 89 Koolen, Corina, 58 Krug, Markus, 98 Kübler, Sandra,1 Kuhn, Jonas, 48

Macharowsky, Luisa, 98 Malm, Mats, 89 McCurdy, Nina, 12 Meyer, Miriah, 12

Navarro, Borja, 105 Nivre, Joakim, 23

Pielström, Steffen, 79 Proisl, Thomas, 79 Puppe, Frank, 98

Rambow, Owen, 32 Reger, Isabella, 98

133