Arxiv:1610.05670V2 [Cs.CL] 3 Aug 2017
Total Page:16
File Type:pdf, Size:1020Kb
Stylometric Analysis of Early Modern Period English Plays Mark Eisen1, Santiago Segarra2, Gabriel Egan3, and Alejandro Ribeiro1 1Dept. of Electrical and Systems Engineering, University of Pennsylvania, Philadelphia, USA 2Inst. for Data, Systems, and Society, Massachusetts Institute of Technology, Cambridge, USA 3School of Humanities, De Montfort University, Leicester, UK Editor: Abstract Function word adjacency networks (WANs) are used to study the authorship of plays from the Early Modern English period. In these networks, nodes are function words and directed edges between two nodes represent the relative frequency of directed co-appearance of the two words. For every analyzed play, a WAN is constructed and these are aggregated to generate author profile networks. We first study the similarity of writing styles between Early English playwrights by comparing the profile WANs. The accuracy of using WANs for authorship attribution is then demonstrated by attributing known plays among six popular play- wrights. Moreover, the WAN method is shown to outperform other frequency-based methods on attributing Early English plays. In addition, WANs are shown to be reliable classifiers even when attributing collaborative plays. For several plays of disputed co-authorship, a deeper analysis is performed by attributing every act and scene separately, in which we both corroborate existing breakdowns and provide evidence of new assignments. 1 Introduction Stylometry involves the quantitative analysis of a text’s linguistic features in order to gain further insight into its underlying elements, such as authorship or genre. Along with common uses in digital forensics (De Vel et al., 2001; Stamatatos, 2009) and plagiarism detection (Meuschke and Gipp, 2013), stylometry has also become the primary method for evaluating authorship disputes in historical texts, such as the Federalist papers arXiv:1610.05670v2 [cs.CL] 3 Aug 2017 (Mosteller and Wallace, 1964; Holmes and Forsyth, 1995) and the Mormon scripture (Holmes, 1992), in a field called authorship attribution. Such disputes exist regarding the collection of dramatic works produced in England during the Early Modern era, covering the 16th through mid-17th century. Due to factors such as inaccurate publication information on title pages and undocumented collaborations, the precise authorship of many of these plays–including works by William Shakespeare and John Fletcher–remains highly contested. 1 Stylometric analysis of the work from this time period dates as far back as the nineteenth century in F. G. Fleay’s analysis of verse features in Shakespeare’s plays (Fleay, 1878). Similar analyses based on the manual counting of linguistic features continued throughout the early to late twentieth century (Timberlake, 1931; Oras, 1960; Tarlinskaja et al., 1987). Computer-based techniques for counting the frequency of various stylistic features, such as rare words or phrases, have become very common over the past few decades. The most recent work done in evaluating authorship in Early Modern era drama includes that by MacDonald P. Jackson (Jackson, 2003, 2006), Brian Vickers (Vickers, 2002), and Hugh Craig and Arthur Kinney (Craig and Kinney, 2009), each of whom studied the works of Shakespeare and his contemporaries extensively using computational stylometry techniques. The techniques used in modern authorship attribution began almost a century ago by examining sentence lengths in texts to determine authorship (Yule, 1939). Mosteller and Wallace (1964) were the first to consider function words as important stylistic markers in stylometric analysis, producing unprecedented results. As such, function words have continued to be common in analysis techniques (Argamon and Levitan, 2005; Juola, 2006) due to their context independence and ubiquity at high rates of occurence in English language texts. These methods rely mainly on the frequency of usage of function words. Numerous other stylistic features have since been used in authorship attribution studies, including vocabulary richness (Holmes, 1991; Hoover, 2003) and parts-of-speech (Cutting et al., 1992). Our method for attributing texts, developed in (Segarra et al., 2015), also measures function word usage to distinguish author styles. Rather than only considering word frequencies, however, we consider a more complex relational structure in an author’s usage of function words. We construct word adjacency networks (WANs) with function words as nodes, and edges containing information regarding the use of two function words within a certain distance (measured in intervening words) from one another. We interpret each WAN as a Markov chain that assigns transition probabilities to the appearance of two function words in succession, derived from their actual occurrences in succession at varying distances within the securely attributed texts. Thus, these probabilities stand for the author’s expressed preference for following one particular word with another. We can then quantify similarity between WANs by using a measure of relative entropy. Markov chains have previously been used in (Khmelev and Tweedie, 2001) and (Sanderson and Guenter, 2006) for the purposes of authorship attribution, though neither consider the use of function words. Results in (Segarra et al., 2015) show an increase in attribution accuracy compared to frequency-based methods for general texts of English literature. In this work we perform further validation of the method’s performance specifically on plays from the Early Modern period and compare this performance to that of word frequency-based methods previously used in Shakespeare attributional studies. We then employ this new technique to comment on authorship disputes concerning Early Modern English dramatic works. 2 We first present an overview of the construction and comparison of WANs in Section 2. We discuss in Section 3 the main playwrights used in our analysis as well as the construction of their profile networks, and in Section 4 we present a measure of similarity between profiles. As a validation of the method, in Section 5 we perform a stylometric analysis of the complete undisputed works of our six primary playwrights, followed by a comparison with existing methods in Section 5.1. We are able to demonstrate high attribution accuracy in discriminating between six candidate authors. We then examine the use of WANs in determining authorship of plays known to be written by multiple authors in collaboration. This is first done by analyzing entire plays in Section 6 and then through extensive interplay analysis of a set of particularly controversial plays in Section 7. Our results largely corroborate existing theories regarding these plays and, in some cases, propose new divisions of labor. 2 Word Adjacency Networks When doing authorship attribution, we are given a set of candidate authors A = fa1; a2; : : : ; ang and a set of known texts written by each of these authors, and the objective is to correctly attribute a collection of texts of unknown authorship among the authors. In (Segarra et al., 2015, 2013), we propose an authorship attribution method based on function word adjacency networks. For each text, we can construct a word adjacency network (WAN) of function words. These include prepositions, conjunctions, pronouns, auxiliary verbs, and articles that convey only grammatical relationships between the so-called lexical words that carry meaning. Formally, from a given text t we construct the network Wt = (F; Qt) where F = ff1; f2; :::; ff g is the set of nodes composed by a collection of function words and Qt is a similarity measure between ordered pairs of function words. The similarity function Qt measures the directed co-appearance of two function words. Once we en- counter a particular function word, Qt indicates the likelihood of encountering another one in the few words following the first one. More precisely, to compute Qt we first divide the text t into units of consecutive h h words (e.g. sentences, speeches) st where h ranges from 1 to the total number of units. We denote by st (e) the word in the e-th position within unit h of text t. Moreover, we consider that two words in the same unit are related if they are at most D 2 N positions apart and the relation between words decays with their position difference according to a discount factor α 2 (0; 1). In this way, with I{·} denoting the indicator function, we define D X h X d−1 h Qt(fi; fj)= I st (e) = fi α I st (e + d) = fj ; (1) h;e d=1 h for all fi; fj 2 F . The selection of the decay parameter α, the window size D, and the delimiting units st , in general, may vary based on the texts and authors being considered. In this work, we select α = 0:75 and D = 10, determined in (Segarra et al., 2015) to be generally optimal and robust parameter choices. However, 3 because punctuation marks were often added by publishers rather than the authors themselves (Howard, 1930), and because dramatic characters do not necessarily speak in sentences, when applying our method to Early Modern plays (rather than novels) we use individual speeches (rather than clauses or sentences) as the units into which we break our texts. We then generate a profile network Wc = (F; Qc) for every author ac using the WANs from those texts (c) known to have been written by the corresponding author ac. Formally, if we denote by T the set of texts written by author ac, then the similarity function Qc of the profile is computed as X Qc = Qt: (2) t2T (c) The similarity function Qc depends on the number and length of the texts written by author ac. This is a problem since we aim to compare profiles of different authors whose canons will be of differing sizes. Thus, we apply the following normalization to the similarity measures ^ Qc(fi; fj) Qc(fi; fj) = P ; (3) j Qc(fi; fj) for all fi; fj 2 F . In (3) we assume that the combined length of the texts written by author ac is long enough to guarantee a non-zero denominator for a given number of function words jF j.