Stylistic Structures
Total Page:16
File Type:pdf, Size:1020Kb
Stylistic Structures A Computational Approach to Text Classification by Richard S. Forsyth [Thesis submitted in accordance with the regulations governing the degree of Doctor of Philosophy in the Faculty of Science at the University of Nottingham, October 1995.] Copyright (c) 1995 by Richard S. Forsyth Richard S. Forsyth asserts the right to be identified as the Author of this Work in accordance with the Copyright, Designs and Patents Act 1988. To Mei Lynn Acknowledgements Firstly I would like to thank the `Three Daves' for help both extensive and intensive: my supervisors David Clarke and David Elliman, and David Holmes (my guide into the world of authorship studies). They have each provided assistance and intellectual stimulation over and above the call of duty. Many other people have helped in a variety of ways in the work leading up to this thesis. They include: Nigel Allinson Helen Forsyth James Forsyth Alex Gammerman Paul Goodwin Jim Gowers Ken Jukes Colin Martindale Dean McKenzie Chris Naylor Richard Spencer-Smith Alison Truelove Fiona Tweedie Richard Wright Cordial thanks are also due to the inter-library-loan staff at the Bolland Library of the University of the West of England (U.W.E.). In addition, this project has benefitted from access to two public-domain text repositories, Project Gutenberg and the Oxford Text Archive, as well as from computing support provided by the Faculty of Computer Science and Mathematics at U.W.E. Last but not least, I would also like to thank my wife, Mei Lynn, for support, suggestions and encouragement during the long period that this thesis took up, my daughter, Frances, for advice that altered the course of this research, and my son, Edward, for putting up with a good deal of thesis-induced absence and absent-mindedness. Naturally, none of those named above should be held responsible for any blunders or blemishes in what follows. iii Table of CONTENTS Acknowledgements iii Contents List iv Abstract vi 1. Introduction 1 1.1 Aims of this Research 2 1.2 An Approach to Text Classification 4 1.3 What this Thesis Isn't 5 1.4 Preview of Subsequent Chapters 5 2. Background 8 2.1 The Case of the Federalist Papers 9 2.2 Other Stylometric Studies 20 2.3 Other Relevant Research 42 2.4 Review 49 3. Text Classification by Cross-Codebook Compression 56 3.1 Why Data Compression? 56 3.2 The Algorithm Used 57 3.3 Some Results 60 3.4 Critique 71 4. Text Classification with Diagnostic Digrams 77 4.1 Towards an Empirical Framework 78 4.2 Selection of Test Problems 79 4.3 Preprocessing and Other Preliminaries 85 4.4 Establishing a Baseline 87 4.5 Trials on Word-Coded Data 95 4.6 Results Using Word Transitions 101 4.7 Discussion 103 5. Instance-Based Text Classification 105 5.1 Text Classification by Similarity 105 5.2 IOGA : An Instance-Oriented Genetic Algorithm 116 5.3 From Prototypes to Profiles 125 5.4 Concluding Comments 131 v 6. Text Classification by Rule Induction 133 6.1 A Robust Bayesian Text Classifier 133 6.2 Text Classification with Discrimination Trees 146 6.3 The Characterization of Character Strings 160 6.4 Review and Summary 162 7. Extending Textual Fragments to their `Natural' Lengths 165 7.1 How Long is a Piece of Substring? 165 7.2 Performance of Extended Text Fragments 173 8. An Evolutionary Approach to Text Classification 176 8.1 TOGA : A Text-Oriented Genetic Algorithm 176 8.2 Preliminary Test on Numeric Data 191 8.3 Testing Various Fitness Functions 198 8.4 An Experiment on Rule Complexity and Inference Mode 210 8.5 Does Stricter Selection of Features Help? 215 8.6 Evaluative Review 221 9. The Federalist Revisited (Again): a Case Study 227 9.1 A Break from Benchmarking 227 9.2 A Stepwise Removal of Restrictions 236 9.3 Though this be Madison, yet there's Method in it 239 9.4 Summing Up 243 10. Concluding Discussion 245 10.1 Achievements 245 10.2 Findings 254 10.3 Failings and Future Prospects 259 10.4 Closing Remarks 267 Reference List 271 Appendix A -- Specimen Spitbol Program 1: A Monte-Carlo Feature Finder A-1 Appendix B -- Specimen Spitbol Program 2: Basic Bayesian Text Classifier B-1 Appendix C -- Specimen C Code: GA Routines and Heap-Tree Functions C-1 Appendix D -- Details of Data Sources D-1 v Abstract The problem of authorship attribution has received attention both in the academic world (e.g. did Shakespeare or Marlowe write Edward III?) and outside (e.g. is this confession really the words of the accused or was it made up by someone else?). Previous studies by statisticians and literary scholars have sought "verbal habits" that characterize particular authors consistently. By and large, this has meant looking for distinctive rates of usage of specific marker words -- as in the classic study by Mosteller and Wallace of the Federalist Papers. The present study is based on the premiss that authorship attribution is just one type of text classification and that advances in this area can be made by applying and adapting techniques from the field of machine learning. Five different trainable text-classification systems are described, which differ from current stylometric practice in a number of ways, in particular by using a wider variety of marker patterns than customary and by seeking such markers automatically, without being told what to look for. A comparison of the strengths and weaknesses of these systems, when tested on a representative range of text-classification problems, confirms the importance of paying more attention than usual to alternative methods of representing distinctive differences between types of text. The thesis concludes with suggestions on how to make further progress towards the goal of a fully automatic, trainable text-classification system. 0 Chapter 1 INTRODUCTION "As we near the end of the twentieth century, the printed book also appears to be drawing near the end of its five-century career." -- Philip Davies Roberts (1986). The world is awash with text. When Saint Alcuin (c.732-804 A.D.) was librarian at York Minster he presided over the revival of learning known as the "Northumbrian Renaissance" (Hutchinson & Palliser, 1980). Before being destroyed by the Vikings in 866, Alcuin's library held a collection of about 300 manuscript volumes, estimated as almost half the books in Europe at that time (Allinson, 1994). In modern terms, allowing 60,000 words per book and using 6 bytes as a rough estimate of storing each word in a computer memory, that comes to 108 megabytes. Were he alive today, Alcuin could equip himself with a portable laptop computer that could hold three or four copies of his entire library -- without using any form of data compression. Alcuin flourished during the so-called Dark Ages, notorious as a low point in European scholarship; but even the fabled library of Alexandria, the greatest treasury of literature assembled at the height of classical civilization, is said to have contained no more than half a million papyrus scrolls (Sagan, 1981). As a scroll typically held about 10,000 words, that comprises around 5 billion words or 30 gigabytes of storage -- too much for you or me or Alcuin to carry around in a briefcase, but still not huge by today's standards. For comparison, Leech (1987) states that the text databank of the Mead Data Corporation contained 5 billion words available for on-line retrieval in the mid-1980s. It was already the size of the great library of Alexandria, and has undoubtedly grown since then. There is, therefore, plenty of text in our industrialized society, much of it stored on computers. Indeed there is so much text being stored on, and transmitted between, computers (as electronic mail, faxes, word-processed documents and so on) that we suffer from text overload. Individuals, and indeed organizations, cannot really cope with such vast amounts of textual information. In other words, our ability to analyze text lags well behind our ability to save, retrieve and send it. This mismatch between ability to store and ability to analyze texts has motivated responses of several kinds -- including, for example, the development of programs 1 that filter out `junk email'. The work reported in this thesis can also be seen as a response to this predicament. 1.1 Aims of this Research The most important objective of this thesis (and the author's excuse for adding to the mountain of text described above) is to make a modest contribution to the field of text analysis. By `text analysis' is meant any method of summarizing or extracting information from data that is encoded solely or chiefly in the form of character strings. There are no departments of Text Analysis, so relevant work is currently carried out under several headings, including computer science, linguistics, psychology and statistics. In the English-speaking world the term text analysis does not even designate a well-defined area; although in France it does seem to be emerging as a coherent field (see, for example, Lebart et al., 1991; Lebart & Salem, 1994). It is hoped that, after reading this thesis, the reader will share the author's view that text analysis deserves to be considered a worthwhile field of study in its own right. To be more specific, the present work concentrates on methods of text classification, where the main aim is to find ways of using patterns within texts to sort them as accurately as possible into classes or groups. In fact, much of what follows will be concerned with the subproblem of authorship attribution, i.e.