Using Techniques for Stylometry

Ramyaa, Congzhou He, Khaled Rasheed Center The University of Georgia, Athens, GA 30602 USA

Abstract Morgan suggested word length could be an indicator of authorship. The real impact did not come until 1964, In this paper we describe our work which when two American statisticians Mosteller and Wallace attempts to recognize different authors based on decided to use word frequencies to investigate the 1 their style of writing (without help from genre or mystery of the authorship of The Federalist Papers. period). Fraud detection, email classification, Their conclusion from statistical discrimination methods deciding the authorship of famous documents agreed with historical scholarship which gave stylometry like the Federalist Papers1, attributing authors to much needed credence [7]. Today with the help of many pieces of texts in collaborative writing, and other significant events in stylometry, its techniques are software forensics are some of the many uses of being widely applied in various areas, such as, disease author attribution. In this project, we train detection and court trials. Examples of using stylometric decision trees and neural networks to “learn” the techniques as a forensic tool include the appeal in London writing style of five Victorian authors and of Tommy McCrossen in July 1991, the pardon for Nicky distinguish between them based on certain Kelly from the Irish government in April 1992, along with features of their writing which define their style. others [7]. The texts chosen were of the same genre and of the same period to ensure that the success of the 1.1 Features learners would entail that texts can be classified There is no consensus in the discipline as to what on the style or the “textual fingerprint” of characteristic features to use or what methodology or authors alone. We achieved 82.4% accuracy on techniques to apply in standard research, which is the test set using decision trees and 88.2% precisely the greatest problem confronting stylometry. accuracy on the test set using neural networks. Most of the experiments in stylometry are directed to different authors with different techniques; there has not been a comparison of results on a large scale as to what 1. Introduction features are generally more representative or what Stylometry – the measure of style – is a burgeoning methods are typically more effective. [11] claims that interdisciplinary research area that integrates literary there is no clear agreement on which style markers are , statistics and science in the study of significant. Rudman (1998) [11] notes that particular the “style” or the “feel” (of a document). The style of a words may be used for a specific classification (like The document is typically based on a lot of parameters – its Federalist Papers) but they cannot be counted on for style genre or the topic (which is used in text categorization), analysis in general. Many different kinds of tests have its content, its authors - to name a few. Stylometry been proposed for use in author identification. Angela assumes – in the context of author attribution - that there Glover (1996) [1] gives a comprehensive table of the is an unconscious aspect to an author’s style that cannot features used in the tests. be consciously manipulated but which possesses quantifiable and distinctive features. These characteristic “The following are some (tests) grouped by the degree of features of an author should be salient, frequent, easily linguistic analysis of the data that is required for the test quantifiable and relatively immune to conscious control. to be carried out. In addition, these features need to be able to distinguish authors writing in the same genre, similar topics and 1 periods so that the classifier is not helped by differences During 1787 and 1788, 77 articles were anonymously in genre or English style which changes with time. published to persuade New Yorkers to support ratification of the new US constitution. Most of the articles were later attributed credibly to their authors, except for 12 of them which were The origins of stylometry can be traced back to the mid claimed to be written by both General Alexander Hamilton and 19th century, where the English logician Augustus de President James Madison. 1) Tagged text useful by seeking to form clusters of individuals such that a. Type / token ratio individuals within a cluster are more similar in some b. Distribution of word classes (parts of speech) sense than individuals from other clusters [8]). The most c. Distribution of verb forms (tense, aspect, etc) widely used among these is the Principal Components d. Frequency of word parallelism Analysis (PCA), which arranges the observed variables in e. Distribution of word-class patterns decreasing order of importance and plots the data in the f. Distribution of nominal forms (e.g., gerunds) space of just the first two components so that clustering is g. Richness of vocabulary clearly visible [2,6]. Gram analysis is another commonly used technique. Text categorization, a related field, uses 2) Parsed text statistical methods like the bag of words to classify a. Frequency of clause types documents. Similar methods have also been tried for b. Distribution of direction of branching author attribution. c. Frequency of syntactic parallelism d. Distribution of genitive forms (of and ’s) Automated pattern recognizing has also been used, e. Distribution of phrase structures though not as much [9, 12 ,13]. This involves training a f. Frequency of types of sentences neural network to learn the style of the texts. Neural g. Frequency of topicalization networks have been used to learn the general style (and h. Ratio of main to subordinate clauses the genre) of a text and in text categorization. They have i. Distribution of case frames also been used to learn one particular author’s style – to j. Frequency of passive voice distinguish the one author from all others or from another particular author. 3) Interpreted text a. Frequency of negation b. Frequency of deixis 2 Components of the Problem c. Frequency of hedges and markers of uncertainty d. Frequency of semantic parallelism With automatic parsing algorithms getting more e. Degree of alternative word use sophisticated, we believe that the role of artificial intelligence techniques in stylometry has great potential. It is reasonable that stylometers would not agree on the An automatic text reader to parse the documents was characteristic features to be analyzed in stylometry. written in Prolog to extract the data needed for the Human languages are subtle, with many unquantifiable experiments. yet salient qualities. Idiolects, moreover, complicate the picture by highlighting particular features that are not shared by the whole language community. Individual 2.1 Authors studies on specific authors, understandably, rely on Author attribution of documents – the most popular use of completely different measures to identify an author. author recognition using stylometry – usually involves attributing an author to a document when there is some 1.2 Classifiers doubt about its authorship. This usually involves proving (or disproving) that the document was written by a Efstathios Stamatatos, (2000) [3] states that there is no particular author (to whom it has previously been computational system as of today that can distinguish the attributed to) [10]. At times, this also would involve texts of a randomly-chosen group of authors without the deciding between two authors. So, usually author assistance of a human in the selection of both the most recognition is done to learn the style of one (or two) appropriate set of style markers and the most accurate author(s). This is inherently different from learning the disambiguation procedure. styles of a number of authors (or in “learning the style”) because in case of one (or two) authors, the list of features Although there is a large variety in the methodology of chosen may simply represent the authors’ particular stylometry, the techniques may be roughly divided into idiosyncrasies. However, these may not be in general two classes: statistical methods and automated pattern good features that can be taken to be representative of recognition methods. The statistical group of methods authors’ style. For example, usage of the words “no normally features the application of Bayes’ Rule in matter” may be a good feature that separates author X various ways to predict the probability of authorship; yet from author Y, but it may not be a feature that can be used non-Bayesian statistical analyses have also been done by to distinguish authors in general. In this paper we try to setting weights to different features. Statistical grouping learn the styles of five authors – in hopes that the learners techniques such as have proven to be would learn the general “style” of authors the rather than 11. number of hyphens per thousand tokens: Some the idiosyncrasies of a particular author. authors use hyphenated words more than others. Also, documents can have a distinctive style without it 12. number of ands per thousand tokens: Ands are being the style of the author. For instance, the texts from markers of coordination, which, as opposed to the Victorian era (written by any author) would seem to subordination, is more frequent in spoken production. have the same style to a reader of this era. Thus, some 13. number of buts per thousand tokens: The contrastive care was needed in the selection of the texts so that they linking buts are markers of coordination too. would not be easily distinguishable by their genre or number of howevers per thousand tokens: The period. Also, the authors should not have very different 14. styles – or learning to differentiate between them would conjunct “however” is meant to form a contrastive pair be trivial. with “but”. 15. number of ifs per thousand tokens: If-clauses are Five popular Victorian authors are chosen for the samples of subordination. experiments: Jane Austen, Charles Dickens, William 16. number of thats per thousand tokens: Most of the Thackeray, Emily Brontë and Charlotte Brontë.2 The thats are used for subordination while a few are used as ASCII files of their works are freely downloadable at the demonstratives. Gutenberg Project Website. 17. number of mores per thousand tokens: ‘More’ is an indicator of an author’s preference for comparative 2.2 Features Used structure. 18. number of musts per thousand tokens: Modal verbs Hanlein’s empirical research (1999) has yielded a set of are potential candidates for expressing tentativeness [5]. individual-style features, from which the 21 style Musts are more often used non-epistemically. indicators in the present study are derived. 19. number of mights per thousand tokens: Mights are 1. type-token ratio: The type-token ratio indicates the more often used epistemically. richness of an author’s vocabulary. The higher the ratio, 20. number of thiss per thousand tokens: Thiss are the more varied the vocabulary. It also reflects an typically used for anaphoric reference. author’s tendency to repeat words. 2. mean word length: Longer words are traditionally 21. number of verys per thousand tokens: Verys are associated with more pedantic and formal styles, whereas stylistically significant for its emphasis on its modifiees. shorter words are a typical feature of informal spoken language. These attributes include some very unconventional 3. mean sentence length: Longer sentences are often markers. Also, it is unconventional to use as few as twenty-one attributes in stylistic studies. Usually, a huge the indicator of carefully planned writing, while shorter number of features (typically function words) are used . sentences are more characteristic of spoken language.3 4. standard deviation of sentence length: The standard In the following two sections, we discuss the learning deviation indicates the variation of sentence length, which methods used and describe the experiments and the is an important marker of style. results. 5. mean paragraph length: The paragraph length is much influenced by the occurrence of dialogues. 6. chapter length: The length of the sample chapter. 7. number of commas per thousand tokens: Commas 3. Decision Trees signal the ongoing flow of ideas within a sentence. Decision trees are rule based, explicit learners. Two 8. number of semicolons per thousand tokens: decision trees were made to learn the styles of the authors. Semicolons indicate the reluctance of an author to stop a Text categorization uses decision trees extensively, but sentence where (s)he could. they have not been used in author identification. Due to 9. number of quotation marks per thousand tokens: their rule-based nature, it is easy to read and understand a Frequent use of quotations is considered a typical decision tree. Thus, it is possible to “see the style” in the involvement-feature [5]. texts with these trees. Also, the results indicate that the 10. number of exclamation marks per thousand tokens: unconventional features we used are quite useful in Exclamations signal strong emotions. classification of the authors. 2 In fact, many previous research included Austen and the Brontë sisters because of their similarity. 3 Mean sentence length has been considered by some as unreliable, and the frequency distributions of the logarithms of sentence length have been used as well. 3.1 Experiments and Results 5 4 (6.6%) << (a) (b) (c) (d) (e)  classified as The See5 package by Quinlan is used in this experiment, ------which extends the basic ID3 algorithm of Quinlan. It 14 1 (a): class jane infers decision trees by growing them from the root 16 (b): class charles downward, greedily selecting the next best attribute for 1 2 7 (c): class william each new decision branch added to the tree. Thus, 10 (d): class emily decision tree learning differs from other machine learning 10 (e): class charlotte techniques such as neural networks, in that attributes are Evaluation on test data (17 cases): considered individually rather than in connection with one Tree Size Errors another. The feature with the greatest information gain is 5 3 (17.6%) << given priority in classification. Therefore, decision trees should work very well if there are some salient features (a) (b) (c) (d) (e)  classified as that distinguish one author from the others. ------4 1 (a): class jane Since the attributes tested are continuous, all the decision 5 1 (b): class charles trees are constructed using the fuzzy threshold parameter, 2 (c): class william so that the knife-edge behavior for decision trees is 1 1 (d): class emily softened by constructing an interval close to the threshold. 2 (e): class charlotte The decision tree constructed without taking any other As shown, the decision tree has an accuracy of 82.4% on options results in an error rate of 3.3% in the training set the 17 patterns in the test set. Also, the tree shows that ‘;’ and an error rate of 23.5% in the test set, (averaged from (semicolon) and quotation marks were relevant attributes cross validation) which is far above random guess (20%) that define style (as they survived the winnowing). but which is still not satisfactory. To improve the performance of the classifier, two different measures have been taken: winnowing and boosting. 3.1.2 BOOSTING Boosting is a wrapping technique for generating and 3.1.1WINNOWING combining classifiers to give improved predictive accuracy. Boosting takes a generic learning algorithm and When the number of attributes is large, it becomes harder adjusts the distribution given to it (by removing some to distinguish predictive information from chance training data) based on the algorithm's behavior. The coincidences. Winnowing overcomes this problem by basic idea is that, as learning progresses, the booster investigating the usefulness of all attributes before any samples the input distribution to keep the accuracy of the classifier is constructed. Attributes found to be irrelevant learner's current hypothesis near to that of random or harmful to predictive performance are disregarded guessing. As a result, the learning process focuses on the (“winnowed”) and only the remaining attributes are used currently hard data. The boosting option in the package to construct the trees. As the relevance of features is one causes a number of classifiers to be constructed; when a of the things we experiment on, winnowing was used. The case is classified, all the classifiers are consulted before a package had a winnowing option which was used. This decision is made. Boosting often gives higher predictive also makes building the trees a lot faster. accuracy at the expense of increased classifier construction time. In this case, it yields the same error The result of this part of the experiment is shown below rate as in winnowing in the validation set, but has 100% accuracy in the training set. The result of this part of the Decision tree: experiment is as follows: semicolon <= 6.72646 (8.723425): charles (18/2) semicolon >= 8.97227 (8.723425): Decision tree that yields the best result: :...semicolon >= 16.2095 (14.64195): charlotte (11/1) semicolon <= 8.47458 (8.723425): semicolon <= 14.6293 (14.64195): :...might >= 498339 (249170.4): william (2.5) :...but >= 4.71113 (4.69338): jane (15/1) : might <= 1.86846 (249170.4): but <= 4.67563 (4.69338): : :...sen len stdev <= 23.1772 (24.8575): charles (14.9) :...quotation mark <= 12.285 (14.7294): emily (10) : sen len stdev >= 26.5378 (24.8575): william (2.2) quotation mark >= 17.1738 (14.7294): william (7) semicolon >= 8.97227 (8.723425):

:...semicolon >= 14.6546 (14.64195): Evaluation on training data (61 cases): :...sen len stdev <= 10.2292 (12.68945): jane (1.9) Tree Size Errors : sen len stan >= 15.1497 (12.68945): charlotte (10.4) semicolon <= 14.6293 (14.64195): Noticeably, the two decision trees are constructed using :...but <= 4.67563 (4.69338): different features and result in different classifications: in :...mean sen len <= 23.2097 (24.9242): emily (8.2) particular, Chapter 1 in Tale of Two cities is classified as : mean sen len >= 26.6387 (24.9242): william (6.4) written by Emily Brontë in the first case and as written by but >= 4.71113 (4.69338): William Thakeray in the second case. Many of the :...this <= 2.46609 (3.4445): jane (12.6) conventional input attributes do not help in the this >= 4.42291 (3.4445): william (1.9) classification; yet, the frequency of certain punctuation marks appears crucial in both of the decision trees, which Evaluation on training data (61 cases): verifies our hypothesis that a very important aspect of Trial Tree Size Errors unconscious writing has been unduly ignored by 0 7 2 (3.3%) stylometers. 1 6 5 (8.2%) 2 6 15 (24.6%) 3 6 10 (16.4%) 4. Neural Networks 4 8 4 (6.6%) 5 7 8 (13.1%) Neural networks are powerful pattern matching tools. 6 6 6 (9.8%) They are basically very complex non-linear modeling 7 6 10 (16.4%) equations. They are especially good in situations where 8 9 0 (0.0%) the “concept” to be learned is very difficult to express as a boost 0 (0.0%) << well-defined, simple formulation, but rather is a complex, highly interrelated function of the inputs which is usually (a) (b) (c) (d) (e)  classified as not easily transparent. This feature makes them ideal to ------learn a concept like “style” which is inherently intangible. 15 (a): class jane The network takes all input attributes into consideration 16 (b): class charles simultaneously though some are more heavily weighted 10 (c): class william than others. This differs from decision tree construction 10 (d): class emily in that the style of an author is taken to be the joint 10 (e): class charlotte product of many different features. Artificial Neural Networks has the ability to invent new features that are Evaluation on test data (17 cases): not explicit in the input, yet it also has the drawback that Trial Tree Size Errors its inductive rules are inaccessible to humans. 0 7 4 (23.5%) 1 6 3 (17.6%) 4.1 Experiments and Results 2 6 11 (64.7%) 3 6 12 (70.6%) Neuroshell – the commercial software package created 4 8 2 (11.8%) by the Ward Systems Group, Inc. which is a package that 5 7 5 (29.4%) creates and runs various types of neural networks, was 6 6 3 (17.6%) used in this experiment. 7 6 10 (58.8%) 8 9 4 (23.5%) Many structures of the multilayer network were boost 3 (17.6%) << experimented with before we came up with our best network. Kohonen Self Organizing Maps, probability networks, and networks based on statistical models were (a) (b) (c) (d) (e)  classified as some of the architectures tried. Backpropogation feed ------forward networks yield the best result with the following 4 1 (a): class jane architecture: 21 input nodes, 15 nodes on the first hidden 5 1 (b): class charles layer, 11 nodes on the second hidden layer, and 10 2 (c): class william output nodes (to act as error correcting codes). Two 1 1 (d): class emily output nodes are allotted to a single author. (This 2 (e): class charlotte increases the Hamming distance between the classifications - the bit string that is output with each bit As shown, the decision tree has an accuracy of 82.4% on corresponding to one author in the classification- of any the 17 patterns in the test set. two authors, thus decreasing the possibility of misclassification) 30% of the 61 training samples are used in the validation set which determines whether over- fitting has occurred and when to stop training. The remaining 17 samples were used for testing. This testing The following table gives the classification of given by is also used to decide the architecture of the network. the network of the training and the validation sets

After that, on the chosen network, cross validation was done by using different test sets. There was not much variation in the results obtained. On an average, the accuracy (measured on the test set) is 88.2%. There is one misclassification Pride and Prejudice which is misclassified as written by Charlotte Brontë; there is one (a) (b) (c) (d) (e)  classified as misclassification in Tale of Two Cities - misclassified as ------written by William Thackeray. These are the 11 (a): class jane misclassifications that were present in all the partitions 9 (b): class charles used for cross validation (irrespective of whether the 10 (c): class william patterns are on testing set or in training/validation set). 16 (d): class emily 1 14 (e): class charlotte The error obtained on one of the test sets is plotted in the following graph. It was observed that the network learned The following table gives the classification given by the each author to be represented by two output nodes i.e. network of the test set both the output nodes representing the same author never (a) (b) (c) (d) (e)  classified as differed significantly in their values. So, the following ------graph shows only one output node for each author. It plots 2 (a): class jane the error in all the output nodes for every pattern. Any 2 (b): class charles deviation from zero is not a perfect classification. Spikes 2 (c): class william indicate the deviations from perfect classifications. 2 4 (d): class emily Positive spikes indicate that non-authors are classified as 5 (e): class charlotte authors and negative spikes indicate that authors are Also, the patterns that decision trees misclassify are not classified as non-authors. If a spike goes above 0.5 (or the same as those misclassified by the neural network, below -0.5), then a misclassification occurs. although the misclassification is persistent within on learning technique i.e. for more than one architecture, the same patterns get misclassified in neural network; 1 Decision trees, with different parameters also have trouble with a same set of patterns (different from the set that troubles Artificial Neural Networks). The Neural Network 0.8 is obviously looking at some different attributes from the ones that the decision trees look at.

0.6 5 Future Works

0.4 The fact that neural networks look at different features from those considered by decision trees could be put to use. A meta-learner could be trained to use both neural Figure 1 error graph for Neural Networks 0.2 network and decision trees to get near perfect classifications. As can be seen, in the test set (the last quarter), there are r

two misclassificationso and there is one misclassification r 0 Different set of features may be tried to see if there exists in the training r set. It is interesting to note that every

e a set of feature which makes different learning techniques positive spike is accompanied by a negative spike. This give the same results. Also feature extraction should be means that every pattern is classified as belonging to one -0.2 done in some care to train these learners with the most author and only one author. This is not built in. The relevant features. Also, Automatic feature selecting network is free to predict 0 for all authors or 1 for more algorithms (like winnowing) can be used to select the than one author – both meaning that it is undecided. This -0.4 initial set of features. was not explicitly avoided. Nevertheless, the network always classified a pattern (either correctly or not). The project could be extended to any number of authors.

-0.6 This project can classify five authors. The next step could be to generate abstract general inductive rules that can be

-0.8

-1 Pattern

Column AG Column AH Column AI Column AJ Column AK used in all author identification problems instead of environment: Writers at work in a world of learning the ways to separate a group of n chosen technology, edited by Mike Sharples and Thea van der authors. Geest, London: Springer-Verlag, 1996. [2] Binongo, J. N. G. & Smith, M. W. A. “The Some clustering algorithms should be tried on the data to application of principal component analysis to see if they perform better. Self Organizing Maps were stylometry.” Literary and Linguistic Computing, Vol. tried. However, they did not yield good results. Other 14, No. 4, pp445-465, 1999. statistical methods may provide better results. [3] Stamatatos E, Fakotakis N., and Kokkinakis G. “Automatic Text Categorization in Terms of Genre and Author” Computational , 26(4), 6 Conclusions December 2000, pp. 471-495, 2000. [4] Forsyth, R. S. “Cicero, sigonio, and burrows: This paper attempts to recognize different authors based investigating the authenticity of the Consolatio.” on their style of writing (without help from genre or Literary and Linguistic Computing, Vol. 14, No. 3, time). Both decision trees and Artificial Neural Networks pp311-332, 1999. yield a significantly higher accuracy rate than random [5] Hanlein, H. “Studies in Authorship Recognition: a guess, which shows that the assumption is well justified Corpus-based Approach”. Peter Lang, 1999. that there is a quantifiable unconscious aspect in an [6] Baayen1 H., Halteren1 H., Neijt1 A., Tweedie F., author’s style. We achieved 82.4% accuracy the test set “An experiment in authorship attribution” JADT using decision trees and 88.2% accuracy on the test set 2002 : 6th International Conference on the Statistical using neural networks. Although the neural networks Analysis of Textual Data. 2002. yielded a better numerical performance, and are [7] Holmes, D. I. “The evolution of stylometry in considered inherently suitable to capture an intangible humanities scholarship.” Literary and Linguistic concept like style, the decision trees are human readable Computing, Vol. 13, No. 3, pp111-117, 1998. making it possible to define style. Also, the results [8] Holmes, D. I. & Forsyth, R. S. “The Federalist obtained by both methods are comparable. Revisited: New Directions in Authorship Attribution.” Literary and Linguistic Computing, Vol. 10, No. 2, Most of the previous studies in stylometry input a large pp111-127, 1995. number of attributes, such as the frequencies of fifty- [9] Hoorn, J. F. “Neural network identification of poets nine function words in some recent Federalist Papers using letter sequences.” Literary and Linguistic research. While more features could produce additional Computing, Vol. 14, No. 3, pp311-332, 1999. discriminatory material, the present study proves that [10] Merriam, T. “Heterogeneous authorship in early artificial intelligence provides stylometry with excellent Shakespeare and the problem of Henry V.” Literary classifiers that require fewer input variables than and Linguistic Computing, Vol. 13, No. 1, pp15-27, traditional statistics. The unconventional features used as 1998. style markers proved to be effective. The proven [11] Rudman, J. “The state of authorship attribution efficiency of the automatic classifiers marks them as studies :some problems and solutions.” exciting tools for the future in stylometry’s continuing and the humanities, 31:351 –365, 1998. evolution (Holmes 1998: 115). We believe that the [12] Tweedie, F., Singh, S. and Holmes, D. I. "Neural combination of stylometry and AI will result in a useful Network Applications in Stylometry: The Federalist discipline with many practical applications. Papers", Computers and The Humanities, vol. 30, issue 1, pp. 1-10, 1996. [13] Waugh, S. “Computational stylistics using Acknowledgements artificial neural networks.” Literary and Linguistic Computing, Vol. 15, No. 2, pp187-197, 2000. We would like to thank Dr. Covington, Associate Director, and Artificial Intelligence Center for his advice and guidance. We are grateful to Dr. Potter, graduate coordinator of Artificial Intelligence Center, UGA for his help.

References [1] Glover A. and H. “Detecting stylistic inconsistencies in collaborative writing” in The new writing