Metrical Tagging in the Wild: Building and Annotating Poetry Corpora with Rhythmic Features Thomas Haider Department of Language and Literature, Max Planck Institute for Empirical Aesthetics, Frankfurt am Main Institute for Natural Language Processing (IMS), University of Stuttgart
[email protected] Abstract However, for projects that work with larger text corpora, close reading and extensive man- A prerequisite for the computational study of literature is the availability of properly digi- ual annotation are neither practical nor afford- tized texts, ideally with reliable meta-data and able. While the speech processing community ground-truth annotation. Poetry corpora do ex- explores end-to-end methods to detect and con- ist for a number of languages, but larger collec- trol the overall personal and emotional aspects of tions lack consistency and are encoded in vari- speech, including fine-grained features like pitch, ous standards, while annotated corpora are typ- tone, speech rate, cadence, and accent (Valle et al., ically constrained to a particular genre and/or 2020), applied linguists and digital humanists still were designed for the analysis of certain lin- rely on rule-based tools (Plecha´cˇ, 2020; Anttila guistic features (like rhyme). In this work, we provide large poetry corpora for English and Heuser, 2016; Kraxenberger and Menning- and German, and annotate prosodic features in haus, 2016), some with limited generality (Navarro- smaller corpora to train corpus driven neural Colorado, 2018; Navarro et al., 2016), or without models that enable robust large scale analysis. proper evaluation (Bobenhausen, 2011). Other ap- We show that BiLSTM-CRF models with syl- proaches to computational prosody make use of lable embeddings outperform a CRF baseline lexical resources with stress annotation, such as and different BERT-based approaches.