D. S. Nikolaev Ab M. V. Shumilin Ac
Total Page:16
File Type:pdf, Size:1020Kb
Шаги / Steps. Т. 7. № 1. 2021 Статьи D. S. Nikolaev ab ORCID: 0000-0002-3034-9794 ✉ [email protected] M. V. Shumilin ac ORCID: 0000-0002-4348-3909 ✉ [email protected] a Российская академия народного хозяйства и государственной службы при Президенте РФ (Россия, Москва) b Стокгольмский университет (Швеция, Стокгольм) c Институт мировой литературы им. А. М. Горького РАН (Россия, Москва) identifying latin autHors tHrougH maximum-likeliHood diricHlet inference: a contribution to model-based stylometry Аннотация. В статье предлагается новый алгоритм для опре- деления авторов латинских прозаических текстов, основанный на Дельте Берроуза и распределении Дирихле. Для демонстра- ции эффективности алгоритма проводится анализ фрагментов текстов 36 авторов классического и средневекового периода. Наш алгоритм показывает результаты, сопоставимые с результатами, полученными за счет применения Random Forest, одного из са- мых мощных универсальных классификационных алгоритмов. Преимущество нашего алгоритма заключается в том, что он тре- бует очень мало времени и вычислительных ресурсов для обуче- ния, его легко имплементировать на любом языке программи- рования общего назначения и его тривиально параллелизовать. Кроме того, поскольку алгоритм основан на эксплицитной модели порождения текста, параметры натренированной модели подда- ются интерпретации: т о ч н о с т ь распределения (сумма его па- раметров) прямо соответствует стилистической гомогенности тек- стов соответствующего автора. Статья подготовлена в рамках выполнения научно-исследо- вательской работы государственного задания РАНХиГС. Ключевые слова: стилометрия, латинская литература, распре- деление Дирихле, Дельта Берроуза, Random Forest, атрибуция текстов, стилистический анализ, машинное обучение Для цитирования: Nikolaev D. S., Shumilin M. V. Identifying Latin authors through maximum-likelihood Dirichlet inference: A contribution to model-based sty- lometry // Шаги/Steps. Т. 7. № 1. 2021. С. 183–198. https://doi.org/10.22394/2412- 9410-2021-7-1-183-198. Статья поступила в редакцию 9 апреля 2020 г. Принято к печати 3 августа 2020 г. © D. S. NIKOLAEV, M. V. SHUMILIN 183 Shagi / Steps. Vol. 7. No. 1. 2021 Articles D. S. Nikolaev ab ORCID: 0000-0002-3034-9794 ✉ [email protected] M. V. Shumilin ac ORCID: 0000-0002-4348-3909 ✉ [email protected] a The Russian Presidential Academy of National Economy and Public Administration (Russia, Moscow) b Stockholm University (Sweden, Stockholm) c A. M. Gorky Institute of World Literature of the Russian Academy of Sciences (Russia, Moscow) identifying latin autHors tHrougH maximum-likeliHood diricHlet inference: a contribution to model-based stylometry Abstract. The last two decades saw a dramatic increase in the num- ber of papers published on the subject of stylometry, which is often narrowly understood as the task of identification of the author of a particular text fragment based on its stylistic properties. We present a new lightweight algorithm for stylometric identification of authors of Latin prose texts based on Burrows’s Delta, computed over relative frequencies of 244 manually selected genre and topic neutral words, and the Dirichlet distribution, whose parameters we estimate using an iterative maximum-likelihood algorithm. In order to demonstrate the effectiveness of the method, we present a case study of 3000- word fragments of texts by 36 classical and medieval authors and show that our method performs on par with Random Forest, a pow- erful general-purpose classification algorithm. We provide summa- ry statistics of our algorithm’s performance together with confusion matrices demonstrating pairwise discriminability of texts by differ- ent authors. The advantages of our method are that it is very simple to implement, very quick to train and do inference with, and that it is very interpretable since it is a model-based algorithm: precision of the fitted Dirichlet distributions directly corresponds to the stylistic homogeneity of the texts by different authors. This makes it possible to use the algorithm as a general research tool in Latin stylistics. The article was written on the basis of the RANEPA state as- signment research programme. Keywords: stylometry, Latin literature, Dirichlet distribution, Burrows’s Delta, Random Forest, text attribution, stylistic analy- sis, machine learning © D. S. NIKOLAEV, M. V. SHUMILIN 184 D. S. Nikolaev, M. V. Shumilin. Identifying Latin authors through maximum-likelihood Dirichlet inference: A contribution to model-based stylometry To cite this article: Nikolaev, D. S., & Shumilin, M. V. (2021). Identifying Latin authors through maximum-likelihood Dirichlet inference: A contribution to model- based stylometry. Shagi/Steps, 7(1), 183–198. https://doi.org/10.22394/2412-9410- 2021-7-1-183-198. Received April 9, 2020 Accepted August 3, 2020 1. Introduction The last two decades saw a dramatic increase in the number of papers published on the subject of stylometry, which is often narrowly understood as the task of iden- tification of the author of a particular text fragment based on its stylistic properties. Stylometry was launched as a computational discipline in the 1960s with the cel- ebrated analysis of the Federalist Papers in [Mosteller, Wallace 1964]; see [Holmes 1998; Holmes, Kardos 2003] for an overview of the early developments. The statistical-learning boom of the early 21st century, following the seminal early work, such as [Vapnik 1999; Breiman 2001], gave rise to a plethora of new algorithms and approaches to the analysis of textual data. Many of those algorithms have a natu- ral application in stylometry, while others were created specifically for this purpose. Contemporary approaches can be roughly divided in two groups: 1. F e a t u r e - b a s e d approaches rely on features extracted from texts to as- certain their authorship. Counts of function words, most-frequent words, or word or character N-grams, and different combinations thereof are usually employed as features. Texts are then clustered or directly compared based on some distance mea- sure, such as Burrows’s Delta [Burrows 2002]. Function words have been tradi- tionally considered as good features for stylometric analysis because they are not tied to particular genres and it is hard for authors to deliberately manipulate their frequencies. 2. M o d e l - b a s e d approaches aim to directly model the distribution of fea- tures in texts by different authors. A text-generating model for an author is created, which makes it possible to directly estimate the posterior probability (in a Bayes- ian setting) or likelihood (in a frequentist one) that the fragment of interest was composed by this author. Decisions about authorship are then made based on these estimates. Work on stylometric attribution of Latin texts was mostly done using the dis- criminative approach. A notable, albeit controversial example, is the attempt by Jus- tin Stover to prove that the Latin fragment preserved in the 13th-century manuscript MS Vat. Reg. lat. 1572 is the long-lost Book 3 of the treaty De dogmate Platonis by Apuleius. In his edition of the text [Stover 2015] and several co-authored ar- ticles [Stover et al. 2016; Stover, Kestemont 2016a], Stover employed PCA, Bur- rows’s Delta-based Bootstrap-Consensus Trees [Eder et al. 2016], and the Impostor Method [Koppel, Winter 2014] in order to substantiate his claim. Stover and his co- authors also tried to apply a similar methodology to the problem of the authorship of other texts, including the Corpus Caesarianum [Kestemont et al. 2016]. Related work was reported in [Kabala 2020], who applied a distance-based classification (“let text segment A be attributed correctly if the next closest text segment in its cor- 185 Шаги / Steps. Т. 7. № 1. 2021 pus belongs to the same author class as A”) and logistic regression to the problem of the authorship of the twelfth-century Latin works Translatio s. Nicolai and Gesta principium polonorum.1 Feature-based approaches are sometimes performant, but they are nearly always uninterpretable and therefore inflexible. Both sides of the problem—how exactly a given author produces texts of a particular type and how exactly do we ascertain that a given text was written by her—remain totally obscure even if the attribution is suc- cessful. More worryingly, there is often no clear way to estimate the degree of our un- certainty in the attribution and to check how well it is actually supported by the data. Model-based approaches offer a way to overcome these limitations. These ap- proaches regard text fragments by different authors as draws from a probability distribution characterizing this author’s style. If we know the parameters of this distribution, we can directly estimate how likely it is that a given text was composed by this author. In an ideal scenario, we should be able to estimate the probability that a given text was written by this author. It is easy to see, however, why this estimation is infeasible: in order for it to work, we need to construct a probability distribution over all possible author-text combinations, and we usually do not have access to the complete set of authors in any given language or genre. Moreover, even if we had restricted the possible authors to some finite set, we still would not have had access to the joint probability of authors and textual frag- ments because they were not sampled in any meaningful sense. As a consequence of this issue, the current model-based approaches tend also to be partly feature based: the parameters of the model are estimated from the data, and their parameter vectors are then used as features in subsequent analysis. The most important decision to be made here is what probability distribution to use for modelling feature distributions. The current instrument of choice is the multi- nomial distribution. This distribution models the probability of the event that after n trials each of which may result in K different outcomes the result will be equal to (p1, p2, …, pk), where pi is the number of trials ending in the outcome and all pi’s sum to n.