Tools for Digital Humanities: Enabling Access to the Romance of Flamenca

Olga Scrivner Sandra Kubler¨ Indiana University Indiana University Bloomington, IN, USA Bloomington, IN, USA [email protected] [email protected]

Abstract text-processing. Similarly, the progress in digital technology has created novel ways of data visual- Accessing historical texts is often a challenge ization and data interpretation. Finally, “by access- because readers either do not know the his- torical language, or they are challenged by ing linguistic annotation, we can extend the range the technological hurdle when such texts are of phenomena that can be found” (Kubler¨ and Zins- available digitally. Merging corpus linguistic meister, 2014). That is, a digital corpus enriched methods and digital technology can provide with linguistic annotation, such as syntactic, prag- novel ways of representing historical texts dig- matic, and semantic annotation, can amplify our un- itally and providing a simpler access. In this derstanding of the literary or historical work. With paper, we describe a multi-dimensional paral- such a corpus, a researcher can overcome many of lel Old Occitan-English corpus, in which word alignment serves as the basis for search ca- the challenges provided by the historical nature of pabilities as well as for the transfer of an- the manuscripts as well as by the technological bar- notations. We show how parallel alignment rier provided by many available query tools. For ex- can help overcome some challenges of his- ample, one of the challenges in working with histor- torical manuscripts. Furthermore, we apply a ical texts consists in the variation in spelling or in resource-light method of building an emotion lexical variation. In such cases, instead of perform- annotation via parallel alignment, thus show- ing a direct lexical search, researchers can access ing that such annotations are possible without this information via linguistic annotation if lemma speaking the historical language. Finally, us- ing visualization tools, such as ANNIS and information is available. GoogleViz, we demonstrate how the emotion However, there are challenges for which standard analysis can be queried and visualized dynam- linguistic annotations are not useful. First, search- ically in our parallel corpus, thus showing that ing in text is usually restricted to known phenomena such information can be made accessible with low technological barriers. or explicit occurrences of data. For example, some phenomena may involve a variation between explicit and implicit tokens, e.g., null subjects or zero rela- 1 Introduction tive clauses. While some corpora allow for a query In the past, historical documents and manuscripts of null occurrences, such as in the MCVF (Mar- were studied exclusively by a manual, paper-based tineau et al., 2010), most of the monolingual corpora approach. This limited the access to such docu- do not provide such an annotation. Furthermore, it ments to scholars who, on the one hand, know the is essential for the researcher to be aware of all pos- historical variety of the language and, on the other sible forms and contexts for queries, which is a dif- hand, had access to the manuscripts. However, re- ficult task in monolingual corpora. That is, there is cent achievements in corpus linguistics have intro- always a possibility that “some relevant form will duced new methods and tools for digitization and be overlooked because it has never been studied”

1

Proceedings of NAACL-HLT Fourth Workshop on Computational Linguistics for Literature, pages 1–11, Denver, Colorado, June 4, 2015. c 2015 Association for Computational Linguistics (Enrique-Arias, 2013, 107). For example, consider for the project. the study of Medieval Spanish discourse markers, e.g., he, evas´ ’behold’. Enrique-Arias (2013) shows 2 Using Corpora in Historical Linguistics that only by using a parallel corpus is he able to ob- serve unexpected linguistic structures conveying the Parallel corpora are collections of two or more texts same discourse function, e.g., sepas que ‘know that’ consisting of an original text (source) and its trans- and cata que ‘see that’. Finally, monolingual cor- lation(s). Generally, such parallel corpora are an- pora are usually accessible only to an audience with notated for word alignment between the source text prior knowledge of a given historical language, leav- and its translation. Word alignment can be carried ing a large public aside. out automatically via tools such as GIZA++ (Och We propose to address these challenges by us- and Ney, 2000). ing a parallel corpus. In our case, the two doc- Monolingual historical corpora are undoubtedly uments consist of the original, historical text and valuable linguistic resources. It is not uncommon, its translation into modern English. Until recently, however, to encounter different spellings and other parallel corpora have been almost exclusively used lexical variations in historical texts. Not knowing in the fields of machine translation, bilingual lexi- an exact spelling or just a simple language barrier cography, translator training, and the study of lan- may hinder data collection. Parallel corpora can as- guage specific translational phenomena (McEnery sist in such situations through Historic Document and Xiao, 2007). With the increased availability of Retrieval (Koolen et al., 2006), which allows re- historical parallel corpora, we have seen the emer- searchers to query via the modern translation rather gence of their use in historical linguistics (Koolen than via the older language. Given the common as- et al., 2006; Zeldes, 2007; Petrova and Solf, 2009; sumption that “translation equivalents are likely to Enrique-Arias, 2012; Enrique-Arias, 2013). be inserted in the same or very similar, syntactic, In this paper, we introduce a parallel annotated semantic and pragmatic contexts” (Enrique-Arias, Old Occitan-English corpus. We show how the 2013, 114), we can assess not only lexical, but also alignment with modern English makes this historical morphological variations. That is, it is possible to corpus more accessible and how the word alignment a) identify forms that have never been studied and can facilitate the cross-language transfer of emotion b) find occurrences based on their textual or stylis- annotations from a resource-rich modern language tic conventions. For example, using the parallel into a resource-poor language, such as Old Occitan. Bible corpus of and its English transla- Finally, we demonstrate how emotion visualization tion, Sanchez´ Lopez´ (To Appear) is able to identify techniques can contribute to a richer understanding a new form, salvante, that has never been reported. of the literary text for technically less inclined read- Similarly, Enrique-Arias (2012) examines dis- ers. course markers, possessive structures and clitics in The remainder of this paper is organized as fol- the Bible and its medieval Spanish translation. lows: Section 2 reviews the use of corpora in histor- In addition, the author is able to observe stylistic ical studies, and section 3 describes work on trans- variation in choices made by translators depending ferring annotations via alignment. Section 4 de- on Bible sub-genres such as narrative or poetry. scribes the textual basis of our corpus, the 13th- In recent years, there has also been increasing century Old Occitan Romance of Flamenca. In sec- interest in the correspondence between translation tion 5, we explain the compilation of the parallel and language change. In this view, translation is corpus. Section 6 describes the emotion annotation, seen as a “means of tracking change” (Beeching, which was carried out for English and then trans- 2013). Various studies have demonstrated the feasi- ferred to Old Occitan. Section 7 introduces emo- bility of parallel corpora in studies of semantic and tion queries via ANNIS, a freely available on-line (morpho-)syntactic change. For example, Beech- search platform, and visualization methods with mo- ing (2013) examines the evolution of the French ex- tion charts in GoogleViz. Section 8 draws general pression quand-memeˆ using monolingual and paral- conclusions and provides an outlook on future steps lel corpora. Similarly, Zeldes (2007) looks at the

2 parallel corpus of Bible translations in two differ- tool not only for historians and historical linguists ent stages of the same language, Old and Modern but also for a lay audience since the translation pro- Polish. The author is able to detect various changes vides access to the meaning without introducing ad- in nominal affixes. Another interesting approach is ditional hurdles such as by a semantic annotation. suggested by Petrova and Solf (2009), who investi- In section 4, we will present our corpus of choice, gate the influence of Latin in historical documents. the Romance of Flamenca, an Old Occitan poem, When we deal with historical documents that are along with the parallel version that includes a mod- translations of Latin or other languages, it is hard to ern English translation. Then, we will present an assess which phenomena are target language specific approach to annotate this corpus for emotions, using or introduced via the translation from the source lan- non-experts working on the English part and then guage. Petrova and Solf (2009) show that this is- transferring the annotation to the source language. sue can be resolved with a parallel diachronic cor- pus: They analyze a change in word order in the Old 3 Using Alignment Methods for High German translation of the Bible and its original Annotating Resource-Poor Languages Latin version. Given the assumption that any word order deviation in the translation can be viewed as Linguistic annotation often involves a great amount evidence for Old High German syntax, their inves- of manual labor, which is often not feasible for tigation is restricted to cases where word order in low-resourced languages. Instead, we can use the translation differs from its original. The results a method from computational linguistics, namely reveal that in contrast to Latin, preverbal objects cross-language transfer, as proposed by Yarowsky in subordinate clauses in Old High German convey and Ngai (2001). This method does not involve given information (explicitly mentioned in the previ- any resources in the target language, neither train- ous context), whereas postverbal objects carry new ing data, a large lexicon, nor time-consuming man- information. ual annotation. Finally, linguists and computational linguistics Cross-language transfer has been previously ap- can benefit from parallel corpora in studies of im- plied to languages with parallel corpora and bilin- plicit constituents, e.g. zero anaphora. Not all cor- gual lexica (Yarowsky and Ngai, 2001; Hwa et al., pora are annotated for implicit occurrences or the 2005). This approach uses parallel text and word omission of certain elements in a sentence. This alignment to transfer the annotation from one lan- is a common challenge in the fields, where zero guage to the next. Yarowsky and Ngai (2001) show anaphora resolution is necessary, e.g., automatic the transfer of a coarse grained POS tagset and base summarization, machine translation, and studies of noun phrases from English to French. Yarowsky syntactic variation. With a parallel corpus, it is et al. (2001) extend the approach to Spanish and possible to search for the explicit form in a trans- Czech, and they include named entity tagging. Hwa lated text and then observe the use or omission of et al. (2005) use a similar approach to transfer that form in the original text. For instance, in his dependency annotations from English to Chinese. study of Biblia Medieval, a parallel corpus of Old Snyder and Barzilay (2008) extend the approach to Spanish Bible translations, Enrique-Arias (2013) an- unsupervised annotation of morphology in Semitic alyzes discourse markers and observes instances of languages via a hierarchical Bayesian network. And zero-marking in Old Spanish by searching for ex- Snyder et al. (2009) extend the framework to include plicit Hebrew markers. Furthermore, such corpora multiple source languages. can be used to investigate the convergence universal In previous work (Scrivner and Kubler,¨ 2012), we in machine translation, a correlation between zero used another cross-language transfer method, based anaphora and the degree of text explicitation (Rello on the work by Feldman and Hana (2010), to an- and Ilisei, 2009). notate the Flamenca corpus with POS and syntactic We argue that the addition of a translation into a annotations. This method does not require parallel modern language is a simple and intuitive way of texts, it rather uses resources such as lexical or POS giving access to historical texts. This is a useful taggers from closely related languages. Our anno-

3 tations were based on resources from in 1834. Since then, the manuscript has been edited (Martineau et al., 2010) and modern Catalan (Civit multiple times (Meyer, 1865, 1901; Gschwind 1976; et al., 2006). Huchet, 1982). While Gschwind’s edition (1976) In machine translation, the transfer of sentiment remains “the most useful edition”, which provides analysis is common in machine translation.Kim et a more accurate interpretation (Blodgett, 1995; Car- al. (2010) use a machine translation system to map bonero, 2010), we have chosen the second edition by subjectivity lexica from English to other languages. Meyer for two reasons: a) The edition has no copy- In word sense disambiguation, word alignment has right restriction and b) this edition is available in a been used as a bridge (Tufis, 2007) based on the as- scanned image format provided by Google1. sumption that the translated word shares the same sense with the original word. A similar method was 5 Parallel Occitan-English Corpus used for sentiment transfer from a resource-rich lan- The compilation and architecture of the monolingual guage to a resource-poor language (Mihalcea et al., Old Occitan corpus Romance of Flamenca along 2007). with the morpho-syntactic and syntactic annotations has been described by Scrivner and Kubler¨ (2012) 4 Romance of Flamenca and Scrivner et al. (2013). In this paper, we augment the monolingual version into the Romance of Fla- Medieval Provenc¸al literature is well known for its menca corpus with a parallel Old Occitan-English lyric poetry of . There remains, how- level. As described in section 2, we regard mono- ever, a small number of non-lyric provenc¸al texts, lingual and parallel corpora as complementary re- such as Romance of Flamenca, Girart de Rossilho sources. That is, our new corpus conforms to the tra- and Daurel et Beto, that have not received much ditions of a conventional multi-layered monolingual attention. In this project we focus on Romance corpus, and at the same time, it offers the research of Flamenca, which can be faithfully described as advantages of a parallel bilingual corpus. “the one universally acknowledged masterpiece of In our task, we have made various methodolog- Old Occitan narrative” (Fleischmann, 1995). This ical decisions related to translation and alignment. anonymous romance, written in the 13th century, First, in the selection of a translation of the source, presents an artistic amalgam of fabliau, courtly ro- it was important to find the most faithful transla- mance, troubadours lyrics, and narrative genre. The tion to the original poem. While free translations uniqueness of this “first modern novel” is also seen have their own merits, they pose a great challenge to in its use of setting, adventures, and character por- the alignment task. Our choice fell to the work by trayal (Blodgett, 1995). The narrator virtuously de- Blodgett (1995) for several reasons. Blodgett “en- picts Archambaut’s transmutation from a chivalrous deavored, so far as possible, to respect the loose and knight to an unbearably jealous husband who locks often convoluted syntax of the original” (Blodgett, his beautiful wife Flamenca in a tower, as well 1995). In addition, the author was able to add lines as Guilhem’s ingenious conspiracy to liberate Fla- from the manuscript that were missing in the pre- menca. Furthermore, this prose in verse played an vious editions. Finally, Blodgett (1995) followed a influential role in the development of French litera- “conservative approach” and omitted lines that were ture. The potential value of this historical resource, suggested earlier to replace lacunae in the original. however, is limited by the lack of an accessible dig- This conservative approach is necessary for ensuring ital format and linguistic annotation. the accurate line alignment of verses. There are no known records of Romance of Fla- In a second step, we provided word alignment. menca before the late 18th century, when it was This is a challenging task, partly because of the non- seized from a private collection during the French standardized spelling in the Old Occitan source, but Revolution and placed later in the library of Carcas- also because the amount of aligned text is rather sonne (Blodgett, 1995). The manuscript came in 139 small for standard unsupervised approaches. An ad- folios with 8095 octosyllabic verses but missing the beginning and end. It was first edited by Raynouard 1https://archive.org/details/leromandeflamen00meyegoog

4 Figure 1: Word alignment: sample of results for the null pronoun in Old Occitan. ditional challenge results from the verse structure of tence with the word index as well as the English the poem, which necessitates deviations in the trans- translation. The GIZA++ output before correction is lation. This genre is prone to various stylistic word illustrated in (2), and the corrected version is shown orders, as compared to political or historical narra- in (3). In both versions, the numbers in parentheses tives. In addition, sentence boundaries in Occitan do behind a word indicate which word in the original is not always correspond to those in the English trans- aligned with this word in the translation. lation. (1) index: 1 2 3 4 5 6 If we followed common practice in automatic OO: poissas lur dis tot en apert alignment and chose sentences as the basic text units ME: then he said to them openly for the automatic alignment, this very likely would result in many mis-alignments. As a result, we de- (2) then (1) he ( ) said (3) to (4) them (2) openly cided to split data by lines, instead of sentences. We (6) 2 performed the line alignment by means of NATools . (3) then (1) he (3) said (3) to (2) them (2) openly NATools is a Perl package for processing parallel (6) corpora. This package helps with the alignment of two files on a sentence level and with extracting a This example shows that in (2), the subject pro- probability lexicon. noun he is not aligned with any word in Old Occ- For word alignment, after some experimenta- itan. This is to be expected since Old Occitan is a tion, we decided to use a fully unsupervised ap- pro-drop language, and the pronoun is not expressed proach since there do not exist any automatic align- overtly. However, during our manual correction, we ers for the Old Occitan-English pair. We chose align the pronoun he with the verb dis, a standard GIZA++ (Och and Ney, 2000), a freely available treatment for null subject pronouns. In a final step, automatic aligner, which allows for one-to-one and we combine lines with corrected word alignment to one-to-many word alignment. In addition, when form a sentence in order to merge this parallel align- using GIZA++, we can make use of our extracted ment with our monolingual Old Occitan corpus, as probability dictionary. The output of the automatic illustrated in Figure 1. alignment was then corrected manually. Below, we This example also shows that such an alignment show an example. In (1), we show the original sen- can be used to search the Old Occitan corpus via the modern English annotated translation. For example, 2http://linguateca.di.uminho.pt/natools/ the query with an explicit English pronoun (PRP)

5 Storm & Storm Shaver et al. Plutchik has been shown that human annotators often assign a (1987) (1987) (1980) different label for the same emotional context (Alm sadness sadness sadness and Sproat, 2005, 670). Finally, available annotated anger anger anger resources are domain specific, e.g., movie reviews fear fear fear and poll opinions, which makes it difficult to adapt happiness happiness joy to a narrative genre. As Francisco et al. (2011) point love affection disgust out, “the complexity of the emotional information disgust trust involved in narrative texts is much higher than in anxiety surprise those domains”. contentment anticipation In recent years, however, with the increasing ac- hostility cess to digitized books, such as the Google Books liking Corpus and Project Gutenberg, there has been grow- pride ing interest in applying emotion annotation for nar- shame rative stories. For example, Alm and Sproat (2005) Table 1: Basic emotion models. annotate 22 Grimms’ fairy tales and demonstrate the importance of story sequences for emotional story evaluation and Francisco et al. (2011) create a cor- aligned to the Occitan verb (VJ) allows to find null pus of 18 English folk tales. Both corpora are built occurrences of subject pronouns in Old Occitan. using a manual annotation. In contrast, Moham- At present, our parallel corpus contains 14 100 mad (2012) applies a lexicon-based method to the tokens and 1 000 aligned verse lines. Our corpus is emotion analysis of Google Books. He creates an further converted to PAULA XML (Dipper, 2005) emotion association lexicon with 14 200 word types, and imported into the ANNIS search engine (Zeldes which determines the ratio of emotional terms in et al., 2009), which makes this corpus accessible on- 3 text. In addition, Mohammad shows effective visu- line . alization methods that trace emotions and compare emotional content in a large collections of texts. 6 Emotion Annotation Transfer As we have seen, the emotion annotation can be While emotion analysis constitutes an important a valuable resource in linguistics and literary stud- component in literary analysis, narrative corpora an- ies. However, annotated corpora and emotion lex- notated for emotional content are not very common. ica exist mainly for resource-rich languages, such as In contrast, there is a large body of work on emotion English. Annotating verses in Old Occitan manually and sentiment analysis of non-literary resources, for emotional content is a tedious task and requires such as blog posts, news and tweets (see (Liu, 2012; an expert in the language. Thus neither a manual an- Pang and Lee, 2008) for overviews). However, de- notation of the text nor the creation of an emotion spite the advances in the automatic annotation, the lexicon in the source language is viable. However, manual annotation of emotions remains a difficult we can use a resource-poor approach from NLP, task. On the one hand, the definition of emotion re- namely cross-language transfer (see section 3, which mains a controversial issue as there is still no clear allows us to take advantage of English resources in distinction between emotions, attitude, personality, combination with the word alignment. I.e., we can and mood. Various models of emotion clusters have annotate emotions in English, which is easy enough been proposed, as illustrated in Table 1, but no clear to do given a lexicon and an undergraduate student. standard has emerged so far. In a second step, we transfer the annotation via the On the other hand, the assignment of emotion is word alignments to the source language. often a subjective decision. While many emotions Below, we will describe our emotion annotation can be identified through contextual and linguistic transfer method. First, we compiled a word list from cues, e.g., lexical, semantic, or discourse cues, it the English version Flamenca and removed common 3www.oldoccitancorpus.org function words. We then used the NRC emotion lex-

6 icon4, which consists of words and their associations who show the relevance of textual sequencing for with 8 emotions as well as positive or negative sen- emotional analysis, we have segmented the corpus timent. Mohammad and Yang (2011) created this into 10 logical event sequences, namely wedding an- lexicon by using frequent words from the 1911 Ro- nouncement, preparation for wedding, arrival of Ar- get Thesaurus5, which were annotated by 5 human chambaut, marriage, departure, King’s arrival, and annotators. For our application, we focus on the 8 Queen’s jealousy. The corpus consists of different emotion associations: anger, anticipation, disgust, layers of annotations: a) (morpho-)syntactic layer fear, joy, sadness, surprise and trust. Since the emo- (part-of-speech and constituency annotation for Oc- tion annotation is not neutral with regard to context, citan and part-of-speech for English), b) lemmas, c) several emotions can be assigned to the same token discourse layer (speakers classification, e.g., king, in the NRC lexicon, as shown in (4). queen, Flamenca), d) temporal sequencing (events), e) word alignment (Occitan English), and f) emo- abandon fear → abandon sadness tion layer (joy, trust, fear, surprise, sadness, dis- (4) lose anger gust, anger and anticipation). For visualization, we lose disgust use the search engine ANNIS (Zeldes et al., 2009), lose fear which is designed for displaying and querying multi- layered corpora. One advantage of ANNIS con- As we can see, the word abandon has two associ- sists of its graphical query interface, which allows ations, namely with fear and sadness, and the word for user-friendly, graphical queries, instead of tradi- lose has three possible associations, namely anger, tional textual queries, for which one needs to know disgust and fear. During the initial label transfer the query language. To illustrate the visual query from the NCR lexicon to the Flamenca lexicon, we tool, we present a query in Figure 2. In this query, kept multiple labels. The tokens with multiple la- we search for any aligned token that expresses joy bels were further manually checked, and only one and is spoken by Father. label that would best fit the context to our knowl- One example of a query result is illustrated in Fig- edge was retained. For example, in case of ‘aban- ure 3. This example shows the Occitan token honor don’, we selected the emotion sadness, as the con- and the aligned English token honors that are anno- text describes Famenca’s father before her marriage. tated with joy and spoken by Father. Also the evaluation of 100 randomly selected labels Another way of analyzing emotions more gener- revealed that some associations did not fit our text ally is to look at the overall emotional content, by due to the difference in genre. For example, ‘court’ querying for any words that have emotion; in other in our narrative genre represents a different seman- words, they are not annotated as “None” for emo- tic entity (king’s court), whereas in the NCR lexicon, tion (query: emotion ! = ”None”). We can then ‘court’ (a criminal case) is associated with anger or perform a frequency analysis of the results. The fre- fear. We decided to leave these cases unannotated. quency distribution is shown in Figure 4, where we Finally, the emotion annotation was transferred see that trust and joy are the most common emo- from the English translated words to their aligned tions. words in Old Occitan. The emotion annotation layer In recent years, visual and dynamic applications was further added to the main corpus and converted in corpus and literature studies have become more to the ANNIS format. important, thus showing a focus on non-technical users. For example, (Moretti, 2005) advocates the 7 Corpus Query and Visualization use of maps, trees, and graphs in the literary anal- In this section, we will focus on how our parallel ysis. Oelke et al. (2012) suggest person network corpus can be queried for emotional content. Fol- representation, showing relations between charac- lowing the approach from Alm and Sproat (2005), ters. In addition, they also use fingerprint plots, 4http://saifmohammad.com/WebPages/lexicons.html which allow to compare the mentioning of vari- 5The words must occur more than 120 000 times in the ous categories, e.g. male, female, across several Google n-gram corpus. novels. Hilpert (2011) introduces dynamic motion

7 Figure 2: Search for joy with the speaker Father.

Figure 3: Resulting visualization for the query in Figure 2. charts as a visualization methodology to the liter- type, color and size across time sequencing. Thus, ary and linguistic studies. These charts are common the user can access this information in an interactive in the socio-economic statistic field and are capa- way without having to extract any information.For ble of visualizing in motion the development of a the future, we plan on adding discourse and word phenomenon in question across time. Hilpert (2011, information. 436) stresses that “the main purpose of producing linguistic motion charts is to give the analyst an in- 8 Conclusion tuitive understanding of complex linguistic develop- ment”. Following Hilpert’s methodology, we con- We have presented an approach to providing access verted our data into the R data frame format and to an Old Occitan text via parallel word alignment produced a motion chart, as shown in Figure 5. At with a modern language and cross-language trans- present, our emotion analysis can be assessed as a fer. We regard this project as a case study, showing dynamic motion chart, by using GoogleViz6, inter- that we can provide access to many types of infor- nally interfaced via R (R Development Core Team, mation without the user having to learn cumbersome 2007). The chart allows for displaying emotion by query languages or programming. We have also shown that the use of methods from computational 6http://cran.r-project.org/web/packages/googleVis/index.html linguistics, namely cross-language transfer, can pro-

8 Figure 5: Dynamic emotion analysis with GoogleViz.

this parallel corpus can be used as a training cor- pus in machine translation and for parallel dictio- nary and emotion lexicon building in resource-poor languages, such as Old Occitan.

References Cecilia Ovesdotter Alm and Richard Sproat. 2005. Emo- tional sequencing and development in fairy tales. In Jianhua Tao, Tieniu Tan, and Rosalind Picard, edi- tors, Affective Computing and Intelligent Interaction, volume 3784 of Lecture Notes in Computer Science, Figure 4: ANNIS frequency analysis. pages 668–674. Springer. Kate Beeching. 2013. A parallel corpus approach to investigating semantic change. In Karin Aijmer and vide tools for annotating the corpus without having Bengt Altenberg, editors, Advances in Corpus-Based access to an expert in the historical language. Us- Contrastive Linguistics: Studies in Honour of Stig Jo- ing ANNIS, we showed how a multi-layered corpus hansson, pages 103–126. John Benjamins. E.D. Blodgett. 1995. The Romance of Flamenca. Gar- can be queried via a user-friendly visual query in- land, New York. terface. Finally, we presented a motion chart, which Monserat Civit, Antonia` Mart´ı, and Nuria Buf´ı. 2006. allows the user analyze and trace emotions dynami- Cat3LB and Cast3LB: From constituents to dependen- cally without any technical requirements. cies. In Advances in Natural Language Processing, This work is part of our on-going project to fully pages 141–153. Springer. annotate the Romance of Flamenca. Our goal is to Stefanie Dipper. 2005. XML-based stand-off represen- provide users with necessary tools allowing for text- tation and exploitation of multi-level linguistic anno- tation. In Proceedings of Berliner XML Tage 2005 mining and visualization of this romance. Given a (BXML 2005), pages 39–50, Berlin, Germany. search query in ANNIS, we plan to develop an R Andres´ Enrique-Arias. 2012. Parallel texts in diachronic package that will enable users to visualize their indi- investigations: Insights from a parallel corpus of Span- vidual results exported from ANNIS and processed ish medieval Bible translations. In Exploring Ancient as motions charts and other statistical plots. Finally, Languages through Corpora (EALC), Oslo, Norway.

9 Andres´ Enrique-Arias. 2013. On the usefulness of us- Meeting of the ACL, pages 976–983, Prague, Czech ing parallel texts in diachronic investigations: Insights Republic. from a parallel corpus of Spanish medieval Bible trans- Saif Mohammad and Tony Yang. 2011. Tracking senti- lations. In Paul Durrell, Martin Scheible, Silke Whitt, ment in mail: How genders differ on emotional axes. and Richard J. Bennett, editors, New Methods in His- In Proceedings of the ACL Workshop on Computa- torical Corpora, pages 105–116. Narr. tional Approaches to Subjectivity and Sentiment Anal- Anna Feldman and Jirka Hana. 2010. A Resource-Light ysis (WASSA), pages 70–79, Portland, OR. Approach to Morpho-Syntactic Tagging. Rodopi. Saif M. Mohammad. 2012. From once upon a time Suzanne Fleischmann. 1995. The non-lyric texts. In to happily ever after: Tracking emotions in mail and F.R.P. Akehurst and Judith M. Davis, editors, A Hand- books. Decision Support Systems, 53:730–741. book of the Troubadours, pages 176–184. University Franco Moretti. 2005. Graphs, Maps, Trees: Abstract of California Press. Models for a Literary History. Verso. Virginia Francisco, Raquel Hervas,´ Federico Peinado, Franz Josef Och and Hermann Ney. 2000. A comparison and Pablo Gervas.´ 2011. Emotales: Creating a corpus of alignment models for statistical machine translation. of folk tales with emotional annotations. Language In Proceedings of the 18th Conference on Compu- Resources and Evaluation, 46:341–381. tational Linguistics, pages 1086–1090, Saarbrucken,¨ Martin Hilpert. 2011. Dynamic visualizations of lan- Germany. guage change: Motion charts on the basis of bivari- Daniela Oelke, Dimitrios Kokkinakis, and Mats Malm. ate and multivariate data from diachronic corpora. In- 2012. Advanced visual analytics methods for litera- ternational Journal of Corpus Linguistics, 16(4):435– ture analysis. Proceedings of the 6th EACL Workshop 461. on Language Technology for Cultural Heritage, Social Rebecca Hwa, Philip Resnik, Amy Weinberg, Clara Sciences, and Humanities, pages 35–44. Cabezas, and Okan Kolak. 2005. Bootstrapping Bo Pang and Lillian Lee. 2008. Opinion mining and parsers via syntactic projection across parallel texts. sentiment analysis. Foundations and Trends in Infor- Natural Language Engineering, 11(3):311–325. mation Retrieval, 2(1-2):1–135. Jungi Kim, Jin-Ji Li, and Jong-Hyeok Lee. 2010. Svetlana Petrova and Michael Solf. 2009. On the Evaluating multilanguage-comparability of subjectiv- methods of information-structuralanalysis in historical ity analysis systems. In Proceedings of the 48th An- texts: A case study on Old High German. In Roland nual Meeting of the ACL, pages 595–603, Uppsala, Hinterholzl¨ and Svetlana Petrova, editors, Informa- Sweden. tion structure and language change. New approaches Marijn Koolen, Frans Adriaans, Jaap Kamps, and to word order variation in Germanic, pages 121–160. Maarten de Rijke. 2006. A cross-language ap- Walter de Gruyter. proach to historic document retrieval. In M. Lalmas and et al., editors, Advances in Information Retrieval, Shaver Philip, Schwartz Judith, Kirson Donald, and pages 407–419. Springer. O’Connor Cary. 1987. Emotion knowledge: Further Sandra Kubler¨ and Heike Zinsmeister. 2014. Cor- exploration of a prototype approach. Journal of Per- pus Linguistics and Linguistically Annotated Corpora. sonality and Social Psychology, 52(6):1061–1086. Bloomsbury Academic. Robert Plutchik. 1980. A general psychoevolutionary Bing Liu. 2012. Sentiment Analysis and Opinion Min- theory of emotion. Emotion: Theory, research, and ing. Synthesis Lectures on Human Language Tech- experience, 1(3):3–33. nologies. Morgan & Claypool Publishers. R Development Core Team. 2007. A language and en- France Martineau, Paul Hirschbuhler,¨ Anthony Kroch, vironment for statistical computing. R Foundation for and Yves Charles Morin. 2010. Corpus MCVF Statistical Computing. (parsed corpus), modeliser´ le changement: les voies Luz Rello and Iustina Ilisei. 2009. Approach to the du franc¸ais. Department´ de Franc¸ais, University of Ot- identification of Spanish zero pronouns. In Student tawa. Research Workshop, RANLP, pages 60–65, Borovets, Anthony McEnery and Zhonghu Xiao. 2007. Paral- Bulgaria. lel and comparable corpora: What is happening? In Cristina Sanchez´ Lopez,´ To Appear. Sintaxis historica´ G. James and G. Anderman, editors, Incorporating de la lengua espanola.˜ Tercera parte, chapter Preposi- Corpora: Translation and the Linguist, Translating ciones, conjunciones y adverbios derivados de partici- Europe, pages 18–31. Multilingual Matters. pios. Fondo de Cultura Economica,´ Mexico.´ Rada Mihalcea, Carmen Banea, and Janyce Wiebe. 2007. Olga Scrivner and Sandra Kubler.¨ 2012. Building an Old Learning multilingual subjective language via cross- Occitan corpus via cross-language transfer. In Pro- lingual projections. In Proceedings of the 45th Annual ceedings of the First International Workshop on Lan-

10 guage Technology for Historical Text(s), pages 392– 400, Vienna, Austria. Olga Scrivner, Sandra Kubler,¨ Barbara Vance, and Eric Beuerlein. 2013. Le Roman de Flamenca : An an- notated corpus of old occitan. In Proceedings of the Third Workshop on Annotation of Corpora for Re- search in Humanities, pages 85–96, Sofia, Bulgaria. Benjamin Snyder and Regina Barzilay. 2008. Unsuper- vised multilingual learning for morphological segmen- tation. In Proceedings of ACL: HTL, Columbus, OH. Benjamin Snyder, Tahira Naseem, and Regina Barzilay. 2009. Unsupervised multilingual grammar induction. In Proceedings of the Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the Asian Federa- tion of Natural Language Processing (ACL-IJCNLP), Suntec, Singapore. To appear. Christine Storm and Tom Storm. 1987. A taxonomic study of the vocabulary of emotions. Journal of Per- sonality and Social Psychology, 53(4):805–816. Dan Tufis. 2007. Exploiting aligned parallel corpora in multilingual studies and applications. In Proceed- ings of the 1st International Conference on Intercul- tural Collaboration, pages 103–117, Kyoto, Japan. David Yarowsky and Grace Ngai. 2001. Inducing multi- lingual POS taggers and NP bracketers via robust pro- jection across aligned corpora. In Proceedings of the 2nd Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL), Pittsburgh, PA. David Yarowsky, Grace Ngai, and Richard Wicentowski. 2001. Inducing multilingual text analysis tools via ro- bust projection across aligned corpora. In Proceedings of HLT 2001, First International Conference on Hu- man Language Technology Research, pages 161–168, San Diego, CA. Amir Zeldes, J. Ritz, Anke Ludeling,¨ and Christian Chiarcos. 2009. ANNIS: A search tool for multi-layer annotated corpora. In Proceedings of Corpus Linguis- tics, Liverpool, UK. Amir Zeldes. 2007. Machine translation between lan- guage stages: Extracting historical grammar from a parallel diachronic corpus of Polish. In Proceedings of the Corpus Linguistics Conference (CL), Birming- ham, UK.

11