Quantitative Authorship Attribution: a History and an Evaluation of Techniques
Total Page:16
File Type:pdf, Size:1020Kb
QUANTITATIVEAUTHORSHIP ATTRIBUTION: A HISTORYAND AN EVALUATIONOF TECHNIQUES A thesis submitted in partial fulfillment of the requirements for the degree of IN THE DEPARTMENT OF LINGUISTICS O Jack Grieve 2005 SIMONFRASER UNI~RSITY Summer 2005 All rights reserved. This work may not be reproduced in whole or in part, by photocopy or other means, without the permission of the author Name: Jack William Grieve Degree: Master of Arts Title of Thesis: Quantitative Authorship Attribution: A History and an Evaluation of Techniques Examining Committee: Dr. Zita McRobbie Chair Associate Professor, Department of Linguistics Dr. Paul McFetridge Senior Supervisor Associate Professor, Department of Linguistics Dr. Maria Teresa Taboada Supervisor Assistant Professor, Department of Linguistics Dr. Fred Popowich External Examiner Professor, School of Computing Science Date Defended: SIMON FRASER UNIVERSITY PARTIAL COPYRIGHT LICENCE The author, whose copyright is declared on the title page of this work, has granted to Simon Fraser University the right to lend this thesis, project or extended essay to users of the Simon Fraser University Library, and to make partial or single copies only for such users or in response to a request from the library of any other university, or other educational institution, on its own behalf or for one of its users. The author has further granted permission to Simon Fraser University to keep or make a digital copy for use in its circulating collection. The author has further agreed that permission for multiple copying of this work for scholarly purposes may be granted by either the author or the Dean of Graduate Studies. It is understood that copying or publication of this work for financial gain shall not be allowed without the author's written permission. Permission for public performance, or limited permission for private scholarly use, of any multimedia materials forming part of this work, may have been granted by the author. This information may be found on the separately catalogued multimedia material and in the signed Partial Copyright Licence. The original Partial Copyright Licence attesting to these terms, and signed by this author, may be found in the original bound copy of this work, retained in the Simon Fraser University Archive. W. A. C. Bennett Library Simon Fraser University Burnaby, BC, Canada I here present a history of the field of quantitative authorship attribution and an evaluation of its techniques. The basic assumption of quantitative authorship attribution is that the author of a text can be selected from a set of possible authors by comparing the values of textual measurements in that text to their corresponding values in each author's writing sample. Over the centuries, many measurements have been proposed, but never before have the majority of these measurements been tested on the same dataset. Until now investigators of authorship have not known which measurements are the best indicators of authorship. Such information is crucial if our current techniques are to be used effectively and if new more powerhl techniques are to be developed. Based on the results of this study, I propose that the best approach to quantitative authorship attribution involves the analysis of many different types of textual measurements. I would like to thank Paul McFetridge, Maite Taboada, E. W. Roberts, Joseph Rudman, Graeme Trousdale, Ivan Zubov, Fred Popowich and most especially Tom Grieve and Paula Chrnilar for their help and support. .. Approval ............................................................................................................................11 ... Abstract .............................................................................................................................111 Acknowledgements ..........................................................................................................iv Contents ............................................................................................................................v .. Tables ...............................................................................................................................v11 ... Figures .............................................................................................................................vi~i 1 Introduction ...............................................................................................................1 2 History ........................................................................................................................4 Introduction .........................................................................................................4 Meter & Rhyme ..................................................................................................4 Word-Length .......................................................................................................8 Sentence-Length ........................ ....................................................................12 Punctuation .......................................................................................................18 Contractions ......................................................................................................19 Vocabulary Richness ........................................................................................21 Graphemes ........................................................................................................25 Etymology .........................................................................................................28 Errors................................................................................................................. 29 Words ................................................................................................................32 Word Position ...................................................................................................44 N-Grams ............................................................................................................ 49 Syntax ............................................................................................................... 52 Summary ...........................................................................................................56 3 Attribution Algorithms ...........................................................................................57 3.1 Introduction .......................................................................................................57 3.2 Input ..................................................................................................................57 3.3 Textual Preparation ...........................................................................................58 3.4 Textual Measurements ......................................................................................60 3.4.1 Introduction ...............................................................................................60 3.4.2 Word-Length .............................................................................................60 3.4.3 Sentence-Length .......................................................................................61 3.4.4 Vocabulary Richness ................................................................................62 3.4.5 Graphemes ................................................................................................63 3.4.6 Words ........................................................................................................64 3.4.7 Punctuation ...............................................................................................66 3.4.8 Word Positions ..........................................................................................67 3.4.9 Collocations ..............................................................................................67 3.4.10 N-Grams ..............................................................................................68 3.5 Comparison & Output .......................................................................................68 3.6 Summary ...........................................................................................................72 4 Experimental Design ...............................................................................................73 4.1 Introduction ......................................................................................................73 4.2 The Corpus of Possible Authors .......................................................................73 4.2.1 Corpus Compilation ..................................................................................73 4.2.2 Author-Based Corpus Compilation........................................................... 82 4.2.3 Corpus of Possible Authors Compilation ................................................. 86 4.3 Attribution Algorithm Evaluation .....................................................................91 4.4 Summary ...........................................................................................................92 5 Results & Discussion ............................................................................................... 93 Introduction .....................................................................................................93 On the Presentation and Interpretation of the Results ......................................93 Word- & Sentence-Length ................................................................................95