A Comparison of Feature-Based and Neural Scansion of Poetry
Total Page:16
File Type:pdf, Size:1020Kb
A Comparison of Feature-Based and Neural Scansion of Poetry Manex Agirrezabal1 and Iñaki Alegria1 and Mans Hulden2 IXA NLP Group1 Department of Linguistics2 Department of Computer Science University of Colorado Univ. of the Basque Country (UPV/EHU) [email protected] [email protected] [email protected] Abstract this rhythmic structure of the poem can be discov- ered without possessing complete understanding Automatic analysis of poetic rhythm is a of the language. Or, whether we could even an- challenging task that involves linguistics, alyze it without any knowledge of the language in literature, and computer science. When question. These are a difficult challenge for NLP the language to be analyzed is known, that involve knowledge about linguistics, literature rule-based systems or data-driven meth- and computer science. ods can be used. In this paper, we ana- To understand the underlying prosodic struc- lyze poetic rhythm in English and Span- ture of a poem independently of the language, a ish. We show that the representations of necessary core piece of knowledge concerns the data learned from character-based neural typological relationship between different poetic models are more informative than the ones traditions. This work represents the first steps from hand-crafted features, and that a Bi- towards an understanding of how to incorporate LSTM+CRF-model produces state-of-the such knowledge into practical systems. To this art accuracy on scansion of poetry in two end, we scan6 the rhythm of poems using data- languages. Results also show that the in- driven techniques with two languages.7 In our pre- formation about whole word structure, and vious work we tested basic techniques on English not just independent syllables, is highly in- poetry (Agirrezabal et al., 2016a); in this research formative for performing scansion. we improve the results using deep learning and 1 Introduction extend the experiments to include Spanish poetry. The analysis of the results and adopting our mod- 1 I don’t like to brag and I don’t like to boast els to perform fully unsupervised and language in- 2 Questi non ciberà terra né peltro, dependent poetry analysis is our current challenge. Мой дядя самых честных правил,3 The above are examples of metered poetry in 2 Scansion English, Italian and Russian. If the English ex- Performing scansion of a line of poetry involves ample is read out loud, it is probably rendered in marking the rhythmic structure of that line, along a continuous deh-deh-dum pattern. In the second with feet (groups of syllables) and rhyme pat- example, the line consists of eleven beats where terns across lines (Corn, 1997; Fabb, 1997; Steele, some syllables (in fixed positions) are more promi- 1999). In this work, however, we address only the nent than others.4 The Russian example is part of task of inferring the stress sequence for each verse a poem written completely in iambic meter (using (a sequence of words or syllables). a recurring deh-dum sound pattern).5 A person able to read texts in Russian would most likely 2.1 English produce this recurring pattern when reciting the poem. A far more interesting question is whether Poems in English contain repeating patterns of syl- lable stress groupings, better known as feet, and 1Dr. Seuss’ Scrambled Eggs Super! 2 Dante Alighieri’s The Divine Comedy(Canto I, Inferno). 6The common term for annotating poetry with stress lev- 3Alexander Pushkin’s Eugene Onegin. els. 4As a rule of thumb, the 10th beat is always stressed. 7The repository with the data and techniques: 5A complete reading of each poem should convince the https://github.com/manexagirrezabal/ reader that this pattern is present throughout. herascansion/ according to the type of foot used, i.e. the num- all the syllables receive a stress value. As one of ber of syllables in each, several meters can be our intentions was to reproduce the experiments employed. The most common ones are iambic and methods of previous work, we have created feet (bal-loon), trochaic (jun-gle), dactylic (ac-ci- a heuristic to assign a stress value to each sylla- dent) or anapestic (com-pre-hend). ble (by adding unstressed syllables and maintain- The length of a metrical line is expressed by ing lexical stresses when possible). the number of feet found in regular lines. Thus a dimeter has two feet, a trimeter three, a tetram- 2.3 Automated scansion eter four, and so on (pentameter, hexameter, hep- tameter,. ). The most common meter in English Automated scansion is a vibrant topic of research. is iambic pentameter, e.g. Recent work often casts this as a prediction prob- lem, where receiving a sequence of words in a O change | thy thought, | that I poem as input we must predict the stress pat- may change | my mind, terns for each of them. This prediction is of- Although poems show an overall regularity ten approached in one of two different ways; ei- throughout lines, poets tend to vary some parts of ther following expert-designed rules that guide verse slightly, with various artistic motives for do- the marking, or learning from patterns in labeled ing so, as in data. Rule-based work include Logan (1988); Grant, if thou wilt, thou art beloved of many Hartman (2005); Plamondon (2006); McAleese (2007); Gervas (2000); Navarro-Colorado (2015); This differs from the previous example by its Agirrezabal et al. (2016b). Currently, data-driven prominent dum-deh-deh-dum pattern early on— techniques are becoming more popular due to the grant, if thou wilt—known in the literature as availability of tagged data. Some works that em- a ‘trochaic variation’. Another variation is that, ploy data and get information from it are Hay- since the poem is iambic overall, the final syllable ward (1996); Greene et al. (2010); Hayes et al. in the line should be stressed, but it instead ends (2012); Agirrezabal et al. (2016a); Estes and with an unstressed ny-syllable. Appending an un- Hench (2016). stressed syllable at the end of an iambic line is a common departure of a set form in English poetry called feminine ending. An automated scansion 3 Corpora system must be aware of, or learn, such common variants and be able to apply them consistently. As the gold standard material for training the En- glish metrical tagger, we used a corpus of scanned 2.2 Spanish poetry, For Better For Verse (4B4V), from the Uni- versity of Virginia (Tucker, 2011).8 The entire In the Spanish poetic tradition, several metrical collection consists of 78 poems, approximately structures have been popular over time (Quilis, 1,100 lines in total. Sometimes several analyses 1984; Tomás, 1995; Caparrós, 1999). In this work, are given as correct, as there is some natural am- because of corpus availability, we have only fo- biguity when performing scansion—about 10% of cused on a specific time period, the Golden Age. the lines are ambiguous with two or more plausi- In this period the main meter of poetry was the ble analyses given. hendecasyllable, in which each line of verse con- sists of eleven syllables. The stress sequence For the Spanish language portion we make is quite regular and usually the 10th syllable is use of a corpus of Spanish Golden-Age Son- stressed. Other syllable positions are also stressed nets (Navarro-Colorado et al., 2016) available on 9 and the specifics of the pattern leads to a rich cat- GitHub. This is a collection of poems from the egorization of hendecasyllabic lines, which is out- 16th and 17th centuries, which has been manually side our current scope of work. checked, contains approximately 135 sonnets and One of the challenges in analyzing Spanish po- almost 2,000 lines. These poems were written by etry is the use of syllable contractions, also known seven different well-known authors. as synalephas, to force verses with more than 8 eleven syllables into hendecasyllabic structures. http://prosody.lib.virginia.edu/ 9https://github.com/bncolorado/ Because of this, when scansion is performed, not CorpusSonetosSigloDeOro English a character-based RNN with LSTM, which pro- duces two vectors. The forward vector will have The jaws that bite, the claws that catch! a representation of the character sequence from Eight segments, four strong beats the left to the right. The backward one will Spanish have the same in the reversed order. Our in- sight is that this character-based LSTM captures su fábrica en tus ruinas adelanta, the phonological structure of the word from its Eleven segments, three strong beats graphemes/characters. These two vectors are con- catenated together with the whole word’s embed- 4 Methods ding (the embeddings could be pre-trained from larger corpora or trained jointly for the task). The We follow the intuitions outlined in Agirrezabal vector of these three elements will represent each et al. (2016a) and we use the same set of lin- word in the sequence. Then, for each word, there guistically motivated features. The feature tem- will be a word-level LSTM, which will produce an plates include current and surrounding words, syl- output for each word, with its right and left context lables, POS-tags and lexical stresses, among other information. Finally, this output will go through a simpler ones. This paper extends the work as CRF layer to get the optimal output. For details, more current methods—neural network models in we refer the reader to Lample et al. (2016). particular—and a new language is explored. We performed several experiments. In some The earlier feature-based systems require man- cases, the models were designed to learn a di- ual extraction of features where for each syllable rect mapping from syllables to stresses (S2S16).