Literary authorship attribution with phrase-structure fragments

Andreas van Cranenburgh Huygens ING Royal Netherlands Academy of Arts and Sciences P.O. box 90754, 2509 LT The Hague, the Netherlands [email protected]

Abstract S S

We present a method of authorship attribution S VP and stylometry that exploits hierarchical infor- NP VP NP ADJP mation in phrase-structures. Contrary to much previous work in stylometry, we focus on con- ADJP ADJP PP tent words rather than function words. Texts NP are parsed to obtain phrase-structures, and com- JJ NNS VBP RB RB : DT JJ NN VBZ JJ IN PRP$ JJ NN pared with texts to be analyzed. An efficient Happy families are all alike ; every unhappy family is unhappy in its own way tree kernel method identifies common tree frag- ments among data of known authors and un- Figure 1: A phrase-structure tree produced by the Stanford known texts. These fragments are then used to parser. identify authors and characterize their styles. Our experiments show that the structural infor- mation from fragments provides complemen- text with the highest probability. There is also work tary information to the baseline trigram model. that looks at on a more shallow level, such as Hirst and Feiguina (2007), who work with par- 1 Introduction tial parses; Wiersma et al. (2011) looked at n-grams of part-of-speech (POS) tags, and Menon and Choi The task of authorship attribution (for an overview (2011) focussed on particular word frequencies such cf. Stamatatos, 2009) is typically performed with su- as those of ‘stop words,’ attaining accuracies well perficial features of texts such as sentence length, above 90% even in cross-domain tasks. word frequencies, and use of punctuation & vocabu- In this work we also aim to perform syntactic sty- lary. While such methods attain high accuracies (e.g., lometry, but we analyze syntactic parse trees directly, Grieve, 2007), the models make purely statistical de- instead of summarizing the data as a set of grammar cisions that are difficult to interpret. To overcome productions or a probability measure. The unit of this we could turn to higher-level patterns of texts, comparison is tree fragments. Our hypothesis is that such as their syntactic structure. the use of fragments can provide a more interpretable Syntactic stylometry was first attempted by model compared to one that uses fine-grained surface Baayen et al. (1996), who looked at the distribution of features such as word tokens. frequencies of grammar productions.1 More recently, Raghavan et al. (2010) identified authors by deriving 2 Method a probabilistic grammar for each author and picking the author grammar that can parse the unidentified We investigate a corpus consisting of a selection of 1A grammar production is a rewrite rule that generates a novels from a handful of authors. The corpus was constituent. selected to contain works from different time periods S Author Works NP VP (sentences) (year of first publication) ADJP Conrad, Heart of Darkness (1899), Lord Jim VBP RB RB Joseph (1900), Nostromo (1904), are all alike (25,889) The Secret Agent (1907) Figure 2: A phrase-structure fragment from the tree in Hemingway, A Farewell To Arms (1929), figure 1. Ernest For Whom the Bell Tolls (1940), (40,818) The Garden of Eden (1986), The Sun Also Rises (1926) from authors with a putatively distinctive style. In Huxley, Ape and Essence (1948), Brave order to analyze the syntactic structure of the corpus Aldous New World (1932), Brave New we use hierarchical phrase-structures, which divide (23,954) World Revisited (1958), Crome sentences into a series of constituents that are repre- Yellow (1921), Island (1962), sented in a tree-structure; cf. figure 1 for an example. The Doors of Perception (1954), We analyze phrase-structures using the notion of tree The Gioconda Smile (1922) fragments (referred to as subset trees by Collins and Salinger, Franny & Zooey (1961), Nine Duffy, 2002). This notion is taken from the frame- J.D. Stories (1953), The Catcher in the work of Data-Oriented Parsing (Scha, 1990), which (26,006) Rye (1951), Short stories hypothesizes that language production and compre- (1940–1965) hension exploits an inventory of fragments from pre- Tolstoy, Anna Karenina (1877); transl. vious language experience that are used as building Leo Constance Garnett, Resurrection blocks for novel sentences. In our case we can sur- (66,237) (1899); transl. Louise Maude, The mise that literary authors might make use of a specific Kreutzer Sonata and Other Stories inventory in writing their works, which characterizes (1889); transl. Benjamin R. Tucker, their style. Fragments can be characterized as fol- War and Peace (1869); transl. lows: Aylmer Maude & Louise Maude Definition. A fragment f of a tree T is a connected subset of nodes from T , with |f| ≥ 2, such that each Table 1: Works in the corpus. Note that the works by node of f has either all or none of the children of the Tolstoy are English translations from project Gutenberg; corresponding node in T . the translations are contemporaneous with the works of Conrad. When a node of a fragment has no children, it is called a frontier node; in a parsing algorithm such nodes function as substitution sites where the frag- of data-sparsity considerations. ment can be combined with other fragments. Cf. fig- To obtain phrase-structures of the corpus we em- ure 2 for an example of a fragment. An important ploy the Stanford parser (Klein and Manning, 2003), consideration is that fragments can be of arbitrary which is a treebank parser trained on the Wall Street size. The notion of fragments captures anything from journal (WSJ) section of the Penn treebank (Marcus a single context-free production such as et al., 1993). This unlexicalized parser attains an ac- curacy of 85.7 % on the WSJ benchmark (|w| ≤ 100). (1) S → NP VP Performance is probably much worse when parsing text from a different domain, such as literature; for . . . to complete stock phrases such as example dialogue and questions are not well repre- (2) Come with me if you want to live. sented in the news domain on which the parser is trained. Despite these issues we expect that useful In other words, instead of making assumptions about information can be extracted from the latent hierar- grain size, we let the data decide. This is in contrast chical structure that is revealed in parse trees, specif- to n-gram models where n is an a priori defined ically in how patterns in this structure recur across sliding window size, which must be kept low because different texts. We pre-process all texts manually to strip away ing sets has been fixed, they still diverge in the aver- dedications, epigraphs, prefaces, tables of contents, age number of words per sentence, which is reflected and other such material. We also verified that no oc- in the number of nodes per tree as well. This causes currences of the author names remained.2 Sentence a bias because statistically, there is a higher chance and word-level tokenization is done by the Stanford that some fragment in a larger tree will match with parser. Finally, the parser assigns the most likely another. Therefore we also normalize for the average parse tree for each sentence in the corpus. No fur- number of nodes. The author can now be guessed as: ther training is performed; as our method is memory- based, all computation is done during classification. f(A, B) arg max P In the testing phase the author texts from the train- A∈Authors 1/|A| t∈A |t| ing sections are compared with the parse trees of texts Note that working with content words does not mean to be identified. To do this we modified the fragment that the model reduces to an n-gram model, because extraction algorithm of Sangati et al. (2010) to iden- fragments can be discontiguous; e.g., “he said X but tify the common fragments among two different sets Y .” Furthermore the fragments contain hierarchical of parse trees.3 This is a tree kernel method (Collins structure while n-grams do not. To verify this con- and Duffy, 2002) which uses dynamic programming tention, we also evaluate our model with trigrams to efficiently extract the maximal fragments that two instead of fragments. For this we use trigrams of trees have in common. We use the variant reported by word & part-of-speech pairs, with words stemmed Moschitti (2006) which runs in average linear time using Porter’s algorithm. With trigrams we simply in the number of nodes in the trees. count the number of trigrams that one text shares with To identify the author of an unknown text we col- another. Raghavan et al. (2010) have observed that lect the fragments which it has in common with each the lexical information in n-grams and the structural known author. In order to avoid biases due to dif- information from a PCFG perform a complementary ferent sizes of each author corpus, we use the first role, achieving the highest performance when both 15,000 sentences from each training section. From are combined. We therefore also evaluate with a these results all fragments which were found in more combination of the two. than one author corpus are removed. The remaining fragments which are unique to each author are used 3 Evaluation & Discussion to compute a similarity score. We have explored different variations of similarity Our data consist of a collection of novels from five scores, such as the number of nodes, the average num- authors. See table 1 for a specification. We perform ber of nodes, or the fragment frequencies. A simple cross-validation on 4 works per author. We evaluate method which appears to work well is to count the on two different test sizes: 20 and 100 sentences. We total number of content words.4 Given the parse trees test with a total of 500 sentences per work, which of a known author A and those of an unknown author gives 25 and 5 datapoints per work given these sizes. B, with their unique common fragments denoted as As training sets only the works that are not tested on A u B, the resulting similarity is defined as: are presented to the model. The training sets consist X of 15,000 sentences taken from the remaining works. f(A, B) = content words(x) Evaluating the model on these test sets took about x∈AuB half an hour on a machine with 16 cores, employing However, while the number of sentences in the train- less than 100 MB of memory per process. The simi- 2Exception: War and Peace contains a character with the larity functions were explored on a development set, same name as its author. However, since this occurs in only one the results reported here are from a separate test set. of the works, it cannot affect the results. The authorship attribution results are in table 2. It 3 The code used in the experiments is available at http:// is interesting to note that even with three different github.com/andreasvc/authident . translators, the work of Tolstoy can be successfully 4Content words consist of nouns, verbs, adjectives, and ad- verbs. They are identified by the part-of-speech tags that are part identified; i.e., the style of the author is modelled, not of the parse trees. the translator’s. 20 sentences trigrams fragments combined 100 sentences trigrams fragments combined Conrad 83.00 87.00 94.00 Conrad 100.00 100.00 100.00 Hemingway 77.00 52.00 81.00 Hemingway 100.00 100.00 100.00 Huxley 86.32 75.79 86.32 Huxley 89.47 78.95 89.47 Salinger 93.00 86.00 94.00 Salinger 100.00 100.00 100.00 Tolstoy 77.00 80.00 90.00 Tolstoy 95.00 100.00 100.00 average: 83.23 76.16 89.09 average: 96.97 95.96 97.98

Table 2: Accuracy in % for authorship attribution with test texts of 20 or 100 sentences.

removed with a more principled model. Here are some examples of sentence-level and pro-

Conrad HemingwayHuxley SalingerTolstoy ductive fragments that were found:

Conrad 94 1 2 3 (3) Conrad: [PP [IN ][NP [NP [DT ][NN sort ] ] Hemingway 3 81 11 5 [PP [IN of ] [NP [JJ ][NN ]]]]] Huxley 5 2 82 1 5 (4) Hemingway: [VP [VB have ] [NP [DT a ] [NN Salinger 1 2 3 94 drink ] ] ] Tolstoy 8 2 90 (5) Salinger: [NP [DT a ] [NN ][CC or ] [NN some- Table 3: Confusion matrix when looking at 20 sentences thing ] ] with trigrams and fragments combined. The rows are the true authors, the columns the predictions of the model. (6) Salinger: [ROOT [S [NP [PRP I]][VP [VBP mean ] [SBAR ]][. .]]]

Gamon (2004) also classifies chunks of 20 sen- (7) Tolstoy: [ROOT [SINV [“ “][S ][, ,][” ”] tences, but note that in his methodology data for [VP [VBD said ] ] [NP ][, ,][S [VP [VBG training and testing includes sentences from the same shrugging ] [NP [PRP$ his ] [NNS shoulders ] work. Recognizing the same work is easier because ]]][. .]]] of recurring topics and character names. It is likely that more sophisticated statistics, for exam- Grieve (2007) uses opinion columns of 500–2,000 ple methods used for collocation detection, or general words, which amounts to 25–100 sentences, as- methods to select features such as suming an average sentence length of 20 words. support vector machines would allow to select only Most of the individual algorithms in Grieve (2007) the most characteristic fragments. score much lower than our method, when classify- On the task of attributing the disputed and co- ing among 5 possible authors like we do, while the authored Federalist papers, the fragment model at- accuracies are similar when many algorithms are tributes 14 out of 15 papers to Madison (in line with combined into an ensemble. Although the corpus the commonly held view); the 19th paper is (incor- of Grieve is carefully controlled to contain compa- rectly) attributed to Jay. To counter the imbalance in rable texts written for the same audience, our task training data lengths, we normalized on the number is not necessarily easier, because large differences of nodes in the fragments common to training & test within the works of an author can make classifying texts. that author more challenging. Table 3 shows a confusion matrix when working 4 Conclusion with 20 sentences. It is striking that the errors are relatively asymmetric: if A is often confused with We have presented a method of syntactic stylome- B, it does not imply that B is often confused with try that is conceptually simple—we do not resort A. This appears to indicate that the similarity metric to sophisticated or an ensemble has a bias towards certain categories which could be of algorithms—and takes sentence-level hierarchical phenomena into account. Contrary to much previous Sindhu Raghavan, Adriana Kovashka, and Raymond work in stylometry, we worked with content words Mooney. 2010. Authorship attribution using prob- rather than just function words. We have demon- abilistic context-free grammars. In Proceedings of strated the feasibility of analyzing literary syntax ACL, pages 38–42. through fragments; the next step will be to use these Federico Sangati, Willem Zuidema, and Rens Bod. techniques to address other literary questions. 2010. Efficiently extract recurring tree fragments References from large treebanks. In Proceedings of LREC, pages 219–226. URL http://dare.uva.nl/ Harold Baayen, H. Van Halteren, and F. Tweedie. record/371504. 1996. Outside the cave of shadows: Using syn- Remko Scha. 1990. Language theory and language tactic annotation to enhance authorship attribution. technology; competence and performance. In Literary and Linguistic Computing, pages 121– Q.A.M. de Kort and G.L.J. Leerdam, editors, Com- 132. putertoepassingen in de Neerlandistiek, pages Michael Collins and Nigel Duffy. 2002. New ranking 7–22. LVVN, Almere, the Netherlands. Orig- algorithms for parsing and tagging: Kernels over inal title: Taaltheorie en taaltechnologie; com- discrete structures, and the voted perceptron. In petence en performance. Translation available at Proceedings of ACL. http://iaaa.nl/rs/LeerdamE.html. Michael Gamon. 2004. Linguistic correlates of style: Efstathios Stamatatos. 2009. A survey of modern au- authorship classification with deep linguistic anal- thorship attribution methods. Journal of the Amer- ysis features. In Proceedings of COLING. ican Society for Information Science and Technol- Jack Grieve. 2007. Quantitative authorship at- ogy, 60(3):538–556. URL http://dx.doi.org/ tribution: An evaluation of techniques. Lit- 10.1002/asi.21001. erary and Linguistic Computing, 22(3):251– Wybo Wiersma, John Nerbonne, and Timo Laut- 270. URL http://llc.oxfordjournals.org/ tamus. 2011. Automatically extracting typi- content/22/3/251.abstract. cal syntactic differences from corpora. Lit- Graeme Hirst and Olga Feiguina. 2007. Bi- erary and Linguistic Computing, 26(1):107– grams of syntactic labels for authorship 124. URL http://llc.oxfordjournals.org/ discrimination of short texts. Literary and content/26/1/107.abstract. Linguistic Computing, 22(4):405–417. URL http://llc.oxfordjournals.org/content/ 22/4/405.abstract. Dan Klein and Christopher D. Manning. 2003. Ac- curate unlexicalized parsing. In Proc. of ACL, volume 1, pages 423–430. Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a large anno- tated corpus of English: The Penn Treebank. Com- putational , 19(2):313–330. Rohith K Menon and Yejin Choi. 2011. Domain independent authorship attribution without domain adaptation. In Proceedings of Recent Advances in Natural Language Processing, pages 309–315. Alessandro Moschitti. 2006. Making tree kernels practical for natural language learning. In Pro- ceedings of EACL, pages 113–120. URL http: //acl.ldc.upenn.edu/E/E06/E06-1015.pdf.