
Literary authorship attribution with phrase-structure fragments Andreas van Cranenburgh Huygens ING Royal Netherlands Academy of Arts and Sciences P.O. box 90754, 2509 LT The Hague, the Netherlands [email protected] Abstract S S We present a method of authorship attribution S VP and stylometry that exploits hierarchical infor- NP VP NP ADJP mation in phrase-structures. Contrary to much previous work in stylometry, we focus on con- ADJP ADJP PP tent words rather than function words. Texts NP are parsed to obtain phrase-structures, and com- JJ NNS VBP RB RB : DT JJ NN VBZ JJ IN PRP$ JJ NN pared with texts to be analyzed. An efficient Happy families are all alike ; every unhappy family is unhappy in its own way tree kernel method identifies common tree frag- ments among data of known authors and un- Figure 1: A phrase-structure tree produced by the Stanford known texts. These fragments are then used to parser. identify authors and characterize their styles. Our experiments show that the structural infor- mation from fragments provides complemen- text with the highest probability. There is also work tary information to the baseline trigram model. that looks at syntax on a more shallow level, such as Hirst and Feiguina (2007), who work with par- 1 Introduction tial parses; Wiersma et al. (2011) looked at n-grams of part-of-speech (POS) tags, and Menon and Choi The task of authorship attribution (for an overview (2011) focussed on particular word frequencies such cf. Stamatatos, 2009) is typically performed with su- as those of ‘stop words,’ attaining accuracies well perficial features of texts such as sentence length, above 90% even in cross-domain tasks. word frequencies, and use of punctuation & vocabu- In this work we also aim to perform syntactic sty- lary. While such methods attain high accuracies (e.g., lometry, but we analyze syntactic parse trees directly, Grieve, 2007), the models make purely statistical de- instead of summarizing the data as a set of grammar cisions that are difficult to interpret. To overcome productions or a probability measure. The unit of this we could turn to higher-level patterns of texts, comparison is tree fragments. Our hypothesis is that such as their syntactic structure. the use of fragments can provide a more interpretable Syntactic stylometry was first attempted by model compared to one that uses fine-grained surface Baayen et al. (1996), who looked at the distribution of features such as word tokens. frequencies of grammar productions.1 More recently, Raghavan et al. (2010) identified authors by deriving 2 Method a probabilistic grammar for each author and picking the author grammar that can parse the unidentified We investigate a corpus consisting of a selection of 1A grammar production is a rewrite rule that generates a novels from a handful of authors. The corpus was constituent. selected to contain works from different time periods S Author Works NP VP (sentences) (year of first publication) ADJP Conrad, Heart of Darkness (1899), Lord Jim VBP RB RB Joseph (1900), Nostromo (1904), are all alike (25,889) The Secret Agent (1907) Figure 2: A phrase-structure fragment from the tree in Hemingway, A Farewell To Arms (1929), figure 1. Ernest For Whom the Bell Tolls (1940), (40,818) The Garden of Eden (1986), The Sun Also Rises (1926) from authors with a putatively distinctive style. In Huxley, Ape and Essence (1948), Brave order to analyze the syntactic structure of the corpus Aldous New World (1932), Brave New we use hierarchical phrase-structures, which divide (23,954) World Revisited (1958), Crome sentences into a series of constituents that are repre- Yellow (1921), Island (1962), sented in a tree-structure; cf. figure 1 for an example. The Doors of Perception (1954), We analyze phrase-structures using the notion of tree The Gioconda Smile (1922) fragments (referred to as subset trees by Collins and Salinger, Franny & Zooey (1961), Nine Duffy, 2002). This notion is taken from the frame- J.D. Stories (1953), The Catcher in the work of Data-Oriented Parsing (Scha, 1990), which (26,006) Rye (1951), Short stories hypothesizes that language production and compre- (1940–1965) hension exploits an inventory of fragments from pre- Tolstoy, Anna Karenina (1877); transl. vious language experience that are used as building Leo Constance Garnett, Resurrection blocks for novel sentences. In our case we can sur- (66,237) (1899); transl. Louise Maude, The mise that literary authors might make use of a specific Kreutzer Sonata and Other Stories inventory in writing their works, which characterizes (1889); transl. Benjamin R. Tucker, their style. Fragments can be characterized as fol- War and Peace (1869); transl. lows: Aylmer Maude & Louise Maude Definition. A fragment f of a tree T is a connected subset of nodes from T , with jfj ≥ 2, such that each Table 1: Works in the corpus. Note that the works by node of f has either all or none of the children of the Tolstoy are English translations from project Gutenberg; corresponding node in T . the translations are contemporaneous with the works of Conrad. When a node of a fragment has no children, it is called a frontier node; in a parsing algorithm such nodes function as substitution sites where the frag- of data-sparsity considerations. ment can be combined with other fragments. Cf. fig- To obtain phrase-structures of the corpus we em- ure 2 for an example of a fragment. An important ploy the Stanford parser (Klein and Manning, 2003), consideration is that fragments can be of arbitrary which is a treebank parser trained on the Wall Street size. The notion of fragments captures anything from journal (WSJ) section of the Penn treebank (Marcus a single context-free production such as et al., 1993). This unlexicalized parser attains an ac- curacy of 85.7 % on the WSJ benchmark (jwj ≤ 100). (1) S ! NP VP Performance is probably much worse when parsing text from a different domain, such as literature; for . to complete stock phrases such as example dialogue and questions are not well repre- (2) Come with me if you want to live. sented in the news domain on which the parser is trained. Despite these issues we expect that useful In other words, instead of making assumptions about information can be extracted from the latent hierar- grain size, we let the data decide. This is in contrast chical structure that is revealed in parse trees, specif- to n-gram models where n is an a priori defined ically in how patterns in this structure recur across sliding window size, which must be kept low because different texts. We pre-process all texts manually to strip away ing sets has been fixed, they still diverge in the aver- dedications, epigraphs, prefaces, tables of contents, age number of words per sentence, which is reflected and other such material. We also verified that no oc- in the number of nodes per tree as well. This causes currences of the author names remained.2 Sentence a bias because statistically, there is a higher chance and word-level tokenization is done by the Stanford that some fragment in a larger tree will match with parser. Finally, the parser assigns the most likely another. Therefore we also normalize for the average parse tree for each sentence in the corpus. No fur- number of nodes. The author can now be guessed as: ther training is performed; as our method is memory- based, all computation is done during classification. f(A; B) arg max P In the testing phase the author texts from the train- A2Authors 1=jAj t2A jtj ing sections are compared with the parse trees of texts Note that working with content words does not mean to be identified. To do this we modified the fragment that the model reduces to an n-gram model, because extraction algorithm of Sangati et al. (2010) to iden- fragments can be discontiguous; e.g., “he said X but tify the common fragments among two different sets Y .” Furthermore the fragments contain hierarchical of parse trees.3 This is a tree kernel method (Collins structure while n-grams do not. To verify this con- and Duffy, 2002) which uses dynamic programming tention, we also evaluate our model with trigrams to efficiently extract the maximal fragments that two instead of fragments. For this we use trigrams of trees have in common. We use the variant reported by word & part-of-speech pairs, with words stemmed Moschitti (2006) which runs in average linear time using Porter’s algorithm. With trigrams we simply in the number of nodes in the trees. count the number of trigrams that one text shares with To identify the author of an unknown text we col- another. Raghavan et al. (2010) have observed that lect the fragments which it has in common with each the lexical information in n-grams and the structural known author. In order to avoid biases due to dif- information from a PCFG perform a complementary ferent sizes of each author corpus, we use the first role, achieving the highest performance when both 15,000 sentences from each training section. From are combined. We therefore also evaluate with a these results all fragments which were found in more combination of the two. than one author corpus are removed. The remaining fragments which are unique to each author are used 3 Evaluation & Discussion to compute a similarity score. We have explored different variations of similarity Our data consist of a collection of novels from five scores, such as the number of nodes, the average num- authors. See table 1 for a specification.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages5 Page
-
File Size-