JUL 3 0 2003 the Author Hereby Grants to M.I.T

Authorship Attribution Using Lexical Attraction by Corey M. Gerritsen Submitted to the Department of Electrical Engineering and Computer Science in Partial Fulfillment of the Requirements for the Degrees of Bachelor of Science in Computer Science and Engineering and Master of Engineering in Electrical Engineering and Computer Science at the Massachusetts Institute of Technology May 21 st, 2003 Copyright 2003 Corey M. Gerritsen. All rights Reserved. ASSACHUSETTS INSTITUTE OF TECHNOLOGY JUL 3 0 2003 The author hereby grants to M.I.T. permission to reproduce and distribute publicly paper and electronic copies of this thesis LIBRARIES and to grant others the right to do so. Author 6 (J Department of Electrical Engineering and Computer Science May 21 ", 2003 Certified By 0' Patrick Henry Winston or of ArtificialIntelligence and Computer Science sjSupervisor Accepted By Arthur C. Smith Chairman, Department Committee on Graduate Theses Authorship Attribution Using Lexical Attraction by Corey M. Gerritsen Submitted to the Department of Electrical Engineering and Computer Science May 21st, 2003 In Partial Fulfillment of the Requirements for the Degree of Bachelor of Science in Computer Science and Engineering and Master of Engineering in Electrical Engineering and Computer Science ABSTRACT Authorship attribution determines who wrote a text when it is unclear who wrote the text. Some examples are when two or more people claim to have written something or when no one is willing (or able) to say that he or she wrote the piece. In order to further the tools available for authorship attribution, I introduced lexical attraction as a way to distinguish authors. I implemented a program called StyleChooser that determines the author of a text, based on Yuret's lexical attraction parser. StyleChooser, once trained on a set of authors, determines how much information is redundant under each author model. Dividing by the number of words in the test text and by the log of the number of words used to train the model gives a metric used to rank the known authors in order of likelihood that they wrote the text in question. I then tested StyleChooser and analyzed the results. When tested with knowledge of 62 authors on 369 texts by those authors, my program had an accuracy of 75%, while the right author ranked in the top three authors 86% of the time. The closeness of a few authors shows that StyleChooser does a better job of differentiating between styles in a broader sense than between authors. A program that differentiates between styles could be used for style differentiation, style based searching, and even better human/computer interaction. Thesis Supervisor: Patrick Winston Title: Ford Professor of Artificial Intelligence and Computer Science 2 TABLE OF CONTENTS Table of C ontents ............................................................................................................ 3 List of figures........................................................................................................................4 List of Tables ........................................................................................................................ 4 A cknow ledgm ents ......................................................................................................... 5 1. Introduction ....................................................................................................... 6 1.1 The Importance of Authorship Attribution..........................................7 1.2 xic A ttraction ................................................................................ 8 2. B ackground....................................................................................................... 9 2.1 Stylom etry .................................................................................................. 9 2.2 Lexical Attraction Model of Language................................................. 10 3. M ethods............................................................................................................ 12 3.1 C reating A uthor M odels ......................................................................... 12 3.2 A M etric for Determining Authorship .................................................... 14 3.3 T ests for Success ...................................................................................... 15 4. Results.............................................................................................................. 18 5. D iscussion ....................................................................................................... 24 5.1 T ext Lengths ............................................................................................. 24 5.2 A ccuracy ..................................................................................................... 27 5.3 Inaccuracy................................................................................................ 28 5.4 Future W ork .............................................................................................. 30 6. C ontributions .................................................................................................. 3 2 Appendix A: Texts from Project Gutenberg Used for Training and Testing....... 33 Appendix B: Full Results from Tests with StyleChooser................. 42 Bibliography ....................................................................................................................... 56 3 LIST OF FIGURES Number Page Figure 1: Link-Sentence Pseudocode............................................................................................... 13 Figure 2: Effect of varying the training set size..............................................................................20 Figure 3: Effect of varying the test set size ..................................................................................... 20 Figure 4: Full graph of the effects of changing the training and testing lengths. Trained on GreatExpectations and tested on Mansfield Park.................................................................. 21 Figure 5: Full graph of the effects of changing the training and testing lengths. Trained on Emma and tested on Manfield Park. .................................................................................... 21 Figure 6: A closer look at the graph in Figure 3. The X-axis has been stretched out to get a better look at the variation in the MI close to x = 0........................................................25 LIST OF TABLES Number Page Table 1: Mutual Information per word for Hamilton and Madison for each of the disputed Federalist P ap ers................................................................................................................ 18 Table 2: number of texts that had their correct author ranked at each rank when training with Set A (one text per author) and testing with Set B (the rest of the texts/variable num ber per author)....................................................................................................................19 Table 3: Number of texts that had their correct author ranked at each rank when training with Set A and testing on Set B, with the metric of MI/(test set size)/log(training set size)...............................................................................................................................................22 Table 4: number of texts that had their correct author ranked at each rank when training with Set A and testing with the first 15000 words of each text in Set B.................22 Table 5: number of texts that had their correct author ranked at each rank when training with Set A, testing with the first 15000 words of each text in Set B, and using the MI/(words in test set)/log(words in training set) metric.............................................23 Table 6: Texts from Project Gutenberg (Set A - one text per author)...............33 Table 7: Texts from Project Gutenberg (Set B - multiple texts per author)...........34 Table 8: Ranking of correct author for each text after training on Set A, with original metric of M I/w ords in test set ........................................................................................................ 42 Table 9: Ranking of correct author for each text after training on Set A, with new metric of MI/words in test set/log(words in training set)............................................................. 49 4 ACKNOWLEDGMENTS The author wishes to thank Deniz Yuret, for inspiring him; Patrick Winston, for the ideas that kept him interested; his parents, for giving him the tools to get where he is today; and Sara Elice, just for existing. 5 1. Introduction Authorship attribution empowers governments and institutions to give credit where credit is due, be it for scholarly works or for terrorist manifestos. Literary scholars have been and continue to be the foremost experts on authorship, but stylometry, the statistical analysis of style in literature, has expanded as a field in the past century. Stylometry attempts to capture an author's style using quantitative measurements of various features in the text such as word length or vocabulary distributions. Many stylometric studies have measured word dependencies as a feature of an author's style using language models that restrict what words a given word can depend upon. A lexical attraction based language model is a probabilistic language model in which each word can depend upon any other word within

JUL 3 0 2003 the Author Hereby Grants to M.I.T

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support