A Fuzzy-Logic-Based Approach to Shakespearian Authorship Attribution
Total Page:16
File Type:pdf, Size:1020Kb
Doctoral Dissertation Title: From n-grams to n-sets: A Fuzzy-Logic-Based Approach to Shakespearian Authorship Attribution. A doctoral thesis submitted to the Faculty of Arts, Design and Humanities of De Montfort University in partial fulfilment for the degree of Doctor of Philosophy. By Foukis Dimitrios, October 2019 Leicester, United Kingdom. Supervisor: Professor Gabriel Egan. This page intentionally left blank. 1 Acknowledgements I would like to express my gratitude to my supervisor Professor Egan for his full support and scientific criticism throughout the three years of this PhD thesis. I would like also to thank De Montfort University for the full bursary it offered me during my studies. Furthermore, I would like to acknowledge the valuable help of Professor Hugh Craig, the director of the Centre for Literary and Linguistic Computing (CLLC) of the University of Newcastle in Australia, for the authorisation to use texts (versions of plays) that were regularised by him and his collaborators. Finally, I dedicate this thesis to my father, Apostolos Foukis, whose chronic illness made me understand two things: the importance of caring for others and that scientific reasoning is the only tool for resolving complex problems and minimising human errors that are unavoidable however hard we try. Declaration of Authorship I hereby declare that this thesis is the product of my own work and has been generated as the result of my own original research. Wherever contributions of others are involved, every effort is made to indicate this explicitly, with due reference to the literature, and acknowledgement of the relevant resource. In addition, no part of this thesis has been published anywhere. 2 Abstract This thesis surveys the principles of Fuzzy Logic as they have been applied in the last three decades in the micro-electronic field and, in the context of resolving problems of authorship verification and attribution shows how these principles can assist with the detection of stylistic similarities or dissimilarities of an anonymous, disputed play to an author‘s general or patterns-based known style. The main stylistic markers are the counts of semantic sets of 100 individual words-tokens and an index of counts of these words‘ frequencies (a cosine index), as found in the first extract of approximately 10,000 words of each of 27 well-attributed Shakespearian plays. Based on these markers, their geometrical representation, fuzzy modelling and on thee ground of Set Theory and Boolean Algebra, in the core part of this thesis three Mamdani (Type-1) genre-based Fuzzy Expert Systems were built for the detection of degrees (measured on a scale from 0 to 1) of Shakespearianness of disputed and, probably, co-authored plays of the early modern English period. Each of these three expert systems is composed of seven input and two output variables that are associated through a set of approximately 30 to 40 rules. There is a detailed description of the properties of the three expert systems‘ inference mechanisms and the various experimentation phases. There is also an indicative graphical analysis of the phases of the experimentation and a thorough explanation of terms, such as partial truths membership, approximate reasoning and output centroids on an X-axis of a two-dimensional space. Throughout the thesis there is an extensive demonstration of various Fuzzy Logic techniques, including Sugeno-ANFIS (adaptive neuro-fuzzy inference system), with which the style of Shakespeare can be modelled in order to compare it with well-attributed plays of other authors or plays that are not included in the strict Shakespearian canon of the selected 27 well-attributed, sole-authored plays. In addition, other relevant issues of stylometric concern are discussed, such as the investigation and classification of known ‗problem‘ and disputed plays through holistic classifiers (irrespective of genre). The results of the experimentation advocate the use of this novel, automated and computer simulation-based method of classification in the stylometric field for various purposes. In fact, the three models have succeeded in detecting the low Shakespearianness of non-Shakespearian plays and the results they provided for anonymous, disputed plays are in 3 conformance with the general evidence of historical scholarship. Therefore, the original contribution of this thesis is to define fully functional automated fuzzy classifiers of Shakespearianness. The result of this discovery is that we now know that the principles of fuzzy modelling can be applied for the creation of Fuzzy Expert Stylistic Classifiers and the concomitant detection of degrees of similarity of a play under scrutiny with the general or patterns-based known style of a specific author (in our case, Shakespeare). Furthermore, this thesis shows that, given certain premises, counts of words‘ frequencies and counts of semantic sets of words can be employed satisfactorily for stylistic discrimination. Keywords: Fuzzy Logic, Set Theory and Boolean Algebra, Fuzzy Expert Stylistic Classifiers, Fuzzy Model, Authorship Attribution and Verification, Cosine Index of Similarity, Average-‗Ideal‘ Document (-Vector of Counts of Words‘ Frequencies), ‗Problem Plays‘, Sugeno-ANFIS (Adaptive Neuro-Fuzzy Inference System) 4 Abbreviation Full Form ANFIS Adaptive Neuro-Fuzzy Inference System COA Centroid of Area FIS Fuzzy Inference System LDA Linear Discriminant Analysis Max Maximum Value or Union Min Minimum Value or Intersection Mf/mf Membership Function PCA Principal Components Analysis RSD Relative Standard Deviation SD (Sample) Standard Deviation SiS Stylistic index of Similarity SVM Support Vector Machines WAN Word Adjacency Networks 5 Note regarding the length of the thesis The length of the thesis without the Appendices (Section 9) is approximately 95,000 words. The section of Appendices is about 20,000 words. The necessity for exceeding the general limit of 80,000 words for the core part of the thesis was due to the nature of the experimentation, the variety of technical terms the application of Fuzzy-Logic entailed and the need to follow a repetitive structure for the generation of rules and the building of an inference mechanism. A shorter version would not permit me to illustrate adequately how by analogy to the engineering field I applied in concreto in digital humanities the principles of Fuzzy-Logic through a (Matlab-based) simulated computerised environment. Note regarding numbers Numbers from 0 to 10 are written in words, unless they are used in formulae or as determiners (for example Validation Stage 2 or Rule 9). Numbers more than 10 are written in the numerical form. Supplementary files The Matlab scripts and other relevant files are stored and can be accessed from the repository: https://github.com/dimitrios-lab/fuzzy.git 6 Table of Contents Abstract .................................................................................................................................... 3 1 Chapter 1: Introduction & Literature Review................................................. 19 1.1 Introduction ............................................................................................................... 19 1.1.1 Review of the Literature .................................................................................... 22 1.1.2 Definitional Context of and Scope of the Topic ................................................ 23 1.1.3 Outline of the Current Situation--Novel Method of Fuzzy Logic in Stylometry 25 1.1.4 Bridging the Gap--Importance and Advantages of Novel Methodology ........... 26 1.1.5 Research Problem, Objectives and Aim ............................................................ 27 1.1.6 An Outline of the Order of Information in the Thesis. ...................................... 28 1.2 Authorship Attribution and Style Discrimination: The 18th, 19th, and 20th Centuries. .............................................................................................................................. 30 1.2.1 Discriminators of Verses, Lines Ending and ‗Peculiar Inaccuracies‘: Edmond Malone and James Boswell, William Spalding. (Circa 1750-1850) ................................ 30 1.2.2 Middle and Late nineteenth Century (circa 1850-1895): Proportionality in Co- authorship Attribution & The Debates of the New Shakspere Society: James Spedding, Frederick James Furnivall. ............................................................................................... 33 1.2.3 Emergence of a Mechanic Stylometric Approach: Mendenhall and his Curves- Based Discrimination of Authorial Styles. (Circa 1895-1905) ........................................ 34 1.2.4 Theoretical Foundations of the Beginning of the Twentieth Century (circa 1905-1931) and External (Henslowe‘s Diary) and Internal Stylometric Evidence: W.W.Greg, H. Dugdale Sykes and Charles A. Langworthy. ........................................... 35 1.2.5 Zipf‘s First Systematic Rank-Correlation/Frequency Law (1932) and Its Modern Application. ......................................................................................................... 37 1.2.6 Collaboration and Further Theoretical Stylometric Clues by Muriel St. Claire Byrne (1932). .................................................................................................................... 39 1.2.7 Yule‘s Theorems (1944): Words at Risk, Ratios of Vocabularies and Size of Texts. 40 7 1.2.8 Probabilities-Based Emerging Stylometric Technicalities (1963): Investigating the Federalist