Stylometric Classification of Ancient Greek Literary Texts by Genre
Total Page:16
File Type:pdf, Size:1020Kb
Stylometric Classification of Ancient Greek Literary Texts by Genre Efthimios Tim Gianitsos Thomas J. Bolt Pramit Chaudhuri Department of Computer Science Department of Classics Department of Classics University of Texas at Austin University of Texas at Austin University of Texas at Austin Joseph P. Dexter Neukom Institute for Computational Science Dartmouth College Abstract analyses of literary genre have been reported, us- ing both English and non-English corpora such Classification of texts by genre is an impor- as classical Malay poetry, German novels, and tant application of natural language process- Arabic religious texts (Tizhoosh et al., 2008; Ku- ing to literary corpora but remains understud- mar and Minz, 2014; Jamal et al., 2012; Hettinger ied for premodern and non-English traditions. et al., 2015; Al-Yahya, 2018). However, computa- We develop a stylometric feature set for an- cient Greek that enables identification of texts tional prediction of even relatively coarse generic as prose or verse. The set contains over 20 distinctions (such as between prose and poetry) re- primarily syntactic features, which are calcu- mains unexplored for classical Greek literature. lated according to custom, language-specific Encompassing the epic poems of Homer, the heuristics. Using these features, we classify tragedies of Aeschylus, Sophocles, and Euripides, almost all surviving classical Greek literature the historical writings of Herodotus, and the phi- as prose or verse with >97% accuracy and F1 losophy of Plato and Aristotle, the surviving lit- score, and further classify a selection of the verse texts into the traditional genres of epic erature of ancient Greece is foundational for the and drama. Western literary tradition. Here we report a com- putational analysis of genre involving the whole of 1 Introduction the classical Greek literary tradition. Using a cus- tom set of language-specific stylometric features, Classification of large corpora of documents into we classify texts as prose or verse and, for the coherent groups is an important application of nat- verse texts, as epic or drama with >97% accuracy. ural language processing. Research on document An important advantage of our approach is that all organization has led to a variety of successful of the features can be computed without syntactic methods for automatic genre classification (Sta- parsing, which remains in an early phase of de- matatos et al., 2000; Santini, 2007). Computa- velopment for ancient Greek. As such, our work tional analysis of genre has most often involved illustrates how computational modeling of liter- material from a single source (e.g., a newspaper ary texts, where research has concentrated over- corpus, for which the goal is to distinguish be- whelmingly on modern English literature (Elson tween news articles and opinion pieces) or from et al., 2010; Elsner, 2012; Bamman et al., 2014; standard, well-curated test corpora that contain Chaturvedi et al., 2016; Wilkens, 2016), can be ex- primarily non-literary texts (e.g., the Brown cor- tended to premodern, non-Anglophone traditions. pus or equivalents in other languages) (Kessler et al., 1997; Petrenz and Webber, 2011; Amasyali 2 Stylometric feature set for ancient and Diri, 2006). Greek Notions of genre are also of substantial impor- tance to the study of literature. For instance, ex- The feature set is composed of 23 features cov- amination of the distinctive characteristics of var- ering four broad grammatical and syntactical cat- ious forms of poetry dates to classical Greece and egories. The majority of the features are func- Rome (for instance, by Aristotle and Quintilian) tion or non-content words, such as pronouns and and remains an active area of humanistic research syntactical markers; a minority concern rhetorical today (Frow, 2015). A number of computational functions, such as questions and uses of superla- 52 Proc. of the 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, pp. 52–60 Minneapolis, MN, USA, June 7, 2019. c 2019 Association for Computational Linguistics Feature Feature Genre Precision Recall Pronouns and non-content adjectives 4 verse 0.96 0.96 1 ἄλλος 4 prose 0.97 1 2 ἀυτός 10 verse 1 0.93 3 demonstrative pronouns 10 prose 1 1 4 selected indefinite pronouns 14 verse 0.97 0.96 5 personal pronouns 14 prose 1 1 6 reflexive pronouns 19 verse 1 0.89 Conjunctions and particles 19 prose 1 1 7 conjunctions 20 verse 1 0.85 8 mèn 20 prose 1 1 9 particles Subordinate clauses Table 2: Error analysis of non-exact features. The fea- 10 circumstantial markers tures are numbered as in Table 1. 11 conditional markers 12 ἵna 13 ὅπως 14 sentences with relative pronouns in standard studies of English stylistics, such as 15 temporal and causal markers pronouns, others are specific to ancient Greek. At- 16 ¹ste not preceded by ¢ tention to language-specific features enhances sty- 17 mean length of relative clauses lometric methods developed for the English lan- Miscellaneous guage and not directly transferable to languages 18 interrogative sentences possessing a different structure (Rybicki and Eder, 19 superlatives 2011; Kestemont, 2014). Greek particles, for ex- 20 sentences with  exclamations ample, are uninflected adverbs used for a wide 21 ±c range of logical and emotional expressions; in En- 22 mean sentence length glish their equivalent meaning is often expressed 23 variance of sentence length by a phrase or, in speech, tone. In order to avoid significant problems arising from dialectical vari- Table 1: Full set of ancient Greek stylometric features. ation, including a large increase in homonyms, we restrict features to the Attic dialect, in which the majority of classical Greek texts were composed. tive adjectives and adverbs. Function words are Many features are computed by counting all in- standard features in stylometric research on En- flected forms of the appropriate word(s), which glish (Stamatatos, 2009; Hughes et al., 2012) and can be found in any standard ancient Greek text- have also been used in studies of ancient Greek lit- book or grammar such as Smyth (1956). A de- erature (Gorman and Gorman, 2016). Our feature tailed description of the methods for computing selection is not drawn from a prior source but has the features is given in Appendix A. been devised based on three criteria: amenabil- ity to exact or approximate calculation without Calculation of five features relies on heuris- use of syntactic parsing, substantial applicability tics to disambiguate between words of similar to the corpus, and diversity of function. The fea- morphology. (All other features can be calcu- ture set is listed in Table1. The first restric- lated exactly.) To assess the effectiveness of tion is necessary because a general-purpose syn- these heuristics, we hand-annotate the five features tactic parser remains to be developed for classi- in a representative sub-corpus containing three cal Greek (notwithstanding promising early-stage verse (Homer’s Odyssey 6, Quintus of Smyrna’s research through the open-source Classical Lan- Posthomerica 12, and Euripides’ Cyclops) and two guage Toolkit and other projects). All features are prose (Lysias 7 and Plutarch’s Caius Gracchus) per-character frequencies with the exception of a texts. Table2 lists the precision and recall of each handful that are normalized by sentence (indicated feature on the aggregated verse and prose texts. In in the table by “sentences with...”). every instance, the precision is > 0:95 and the re- Although some features overlap with those used call is > 0:85. 53 3 Experimental setup Accuracy (%) Weighted F1 (%) Fold 1 98.0 98.0 3.1 Dataset Fold 2 100 100 We use a corpus of ancient Greek text files, which Fold 3 99.3 99.3 was assembled by the Perseus Digital Library Fold 4 98.7 98.7 and further processed by Tesserae Project (Crane, Fold 5 100 100 1996; Coffee et al., 2012). A full list of texts is Mean 99.2 99.2 provided in Appendix B. Each file typically con- S.D. 1.9 1.9 tains either an entire work of literature (e.g., a play Overall 98.9 98.9 or a short philosophical treatise) or one book of a S.D. 0.8 0.8 longer work (e.g., Book 1 of Homer’s Iliad). 29 files are composites of multiple books included Table 3: Performance of prose vs. verse classifier for elsewhere in the Tesserae corpus and are omitted ancient Greek literary texts. from our analysis, leaving 751 files. In total, this corpus contains essentially all surviving classical Feature Gini S.D. Greek literature and spans from the 8th century ἀυτός 0.209 0.074 BCE to the 6th century CE. conjunctions 0.159 0.062 For our first experiment, we hand-annotate the demonstrative pronouns 0.121 0.057 full set of texts as prose (610 files) or verse reflexive pronouns 0.118 0.049 (141 files) according to standard conventions (Ap- mèn 0.0623 0.029 pendix B). For the second experiment, we hand- annotate the verse texts as epic (82 files) and Table 4: Feature rankings for prose vs. verse classifier. drama (45 files), setting aside 14 files that contain poems of other genres (Appendix C). 4 Results 3.2 Feature extraction All text processing is done using Python 3.6.5. 4.1 Prose vs. verse classification We first tokenize the files from the Tesserae cor- pus into either words or sentences using the Nat- Using the workflow described in Section 3.3, we ural Language Toolkit (NLTK; v. 3.3.0) (Bird classify each of the literary texts in the corpus et al., 2009). For sentence tokenization, we as prose or verse. Table3 lists the accuracy and use the PunktSentenceTokenizer class of NLTK weighted F1 score for a sample cross-validation Greek (Kiss and Strunk, 2006).