ANALYSIS OF AS A COGNITIVE BIOMETRIC TRAIT

By KALAIVANI SUNDARARAJAN

A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2018 © 2018 Kalaivani Sundararajan I dedicate this dissertation to my Mom and Dad, sister Gowri and my husband Raghav who have been incredibly supportive, understanding and encouraging throughout this entire journey. ACKNOWLEDGMENTS I extend my heartfelt thanks to mentors and friends who have made this journey possible. First and foremost, I thank my advisor, Dr.Damon Woodard, who took me under his wing and provided constant support and motivation throughout this journey. His calm and composed demeanor has always been reassuring even during periods of self-doubt. He gave me complete freedom to explore and grow into an independent researcher. I would also like to thank my committee members for their valuable feedback and comments. Many mentors laid the foundation for this journey before I started Ph.D. My first job at Soliton Technologies, India, seeded the desire to pursue a career in research. I am forever grateful to Ms.Mekhala Devaraj and Dr.Ganesh Devaraj who provided me that opportunity and have been of immense support since then. I also thank Dr.Stan Birchfield who laid my foundation for academic research at Clemson University and has always wished me well. I would also like to thank all my mentors in industry and academia who have shaped my career. This long and eventful journey was made memorable by my dear friends. I would like to thank Poorna Sudhakar, Gayatri Lakshmanan, Rommel Jalasutram, Ahalya Srikanth, Renuka Jagadish, Anand Raju, Dhananjay Joshi, Vishnu Prasad and Ezhilselvi Karunakaran for all their love and support. Finally, I would like to thank my graduate coordinators Adrienne Cook, Tamira Carter and Lesly Galiana for their support.

4 TABLE OF CONTENTS page ACKNOWLEDGMENTS ...... 4 LIST OF TABLES ...... 8 LIST OF FIGURES ...... 10 ABSTRACT ...... 12

CHAPTER 1 INTRODUCTION ...... 14 1.1 Motivation ...... 15 1.2 Thesis Statement and Dissertation Topics ...... 15 1.2.1 Style Representations ...... 16 1.2.2 Stylometry as a Biometric Trait ...... 17 1.2.3 Soft-biometric Classification or ...... 18 1.3 Contributions ...... 18 1.4 Chapter Guide ...... 19 2 REVIEW ...... 21 2.1 Historical Background ...... 21 2.2 Stylometry ...... 22 2.2.1 Stylometric Features ...... 22 2.2.2 Attribution Approaches ...... 25 2.3 Challenges ...... 28 3 PERFORMANCE ANALYSIS OF STATE-OF-THE-ART APPROACHES ... 30 3.1 Dataset ...... 31 3.2 Authorship Identification ...... 31 3.2.1 Methodology ...... 32 3.2.2 Results ...... 37 3.2.3 Discussion ...... 40 3.3 Verification ...... 41 3.3.1 Methodology ...... 41 3.3.2 Results ...... 44 3.3.3 Discussion ...... 46 3.4 Feature Analysis ...... 46 3.4.1 Methodology ...... 47 3.4.2 Results ...... 48 3.4.3 Discussion ...... 49 3.5 Clustering ...... 51 3.5.1 Methodology ...... 52

5 3.5.2 Preliminary Results ...... 53 3.5.3 Discussion ...... 55 3.6 Summary ...... 56 4 WHAT CONSTITUTES STYLE? ...... 58 4.1 Datasets ...... 59 4.2 Role of Structural representations ...... 60 4.2.1 Methodology ...... 61 4.2.2 Results ...... 63 4.2.3 Discussion ...... 65 4.3 Role of Lexical representations ...... 66 4.3.1 Methodology ...... 67 4.3.2 Results ...... 68 4.3.3 Discussion ...... 69 4.4 Learned representations ...... 72 4.4.1 Methodology ...... 73 4.4.2 Results ...... 75 4.4.3 Discussion ...... 76 4.5 Summary ...... 77 5 STYLOMETRY AS A BIOMETRIC TRAIT ...... 79 5.1 Biometric Characteristics ...... 79 5.1.1 Analysis of Uniqueness Using Biometric Menagerie ...... 79 5.1.2 Analysis of Permanence in Linguistic Style ...... 87 5.2 Multimodal Biometric Authentication ...... 96 5.2.1 Methodology ...... 97 5.2.2 Dataset ...... 99 5.2.3 Results ...... 99 5.2.4 Discussion ...... 102 5.3 Summary ...... 105 6 ROLE OF PSYCHOLOGICAL AND SOCIAL ASPECTS ON STYLOMETRY 106 6.1 Role of psychological aspects ...... 106 6.1.1 Methodology ...... 107 6.1.2 Datasets ...... 107 6.1.3 Results ...... 110 6.1.4 Discussion ...... 115 6.2 Profiling author social groups ...... 116 6.2.1 Methodology ...... 116 6.2.2 Results ...... 118 6.2.3 Discussion ...... 119 6.3 Summary ...... 120

6 7 SUMMARY ...... 125 7.1 Insights and Conclusions ...... 125 7.1.1 Style Representations ...... 125 7.1.2 Stylometry as a Biometric Trait ...... 126 7.1.3 Soft-biometric Classification or Author Profiling ...... 126 7.2 Limitations ...... 126 7.3 Future work ...... 127 APPENDIX: AUTHOR PROFILING ANALYSIS ...... 128 REFERENCES ...... 132 BIOGRAPHICAL SKETCH ...... 142

7 LIST OF TABLES Table page 2-1 Stylistic features ...... 26 3-1 CASIS Dataset partitions ...... 31 3-2 Character level features ...... 43 3-3 Lexical features ...... 44 3-4 Syntactic features ...... 44 3-5 Salient features across different feature categories ...... 48 3-6 Comparison of salient features ...... 49 3-7 Effect of varying sentence lengths...... 49 4-1 Single-domain and Cross-domain datasets ...... 60 4-2 Shallow syntactic model performance ...... 64 4-3 Deep syntactic model performance ...... 65 4-4 Effect of lexical POS on single-domain authorship attribution using IMDB1M . 68 4-5 Effect of lexical POS on UF cross-topic authorship attribution ...... 69 4-6 Effect of lexical POS on UF cross-genre authorship attribution ...... 69 4-7 Effect of topic words on IMDB1M single-domain authorship attribution ..... 70 4-8 Effect of topic words on UF cross-topic authorship attribution ...... 70 4-9 Effect of topic words on UF cross-genre authorship attribution ...... 70 4-10 Performance of learned representations ...... 76 5-1 BoW features ...... 81 5-2 Dataset statistics ...... 90 5-3 Permanence of various span-subsets in and Yelp datasets ...... 90 5-4 Salient and consistent features ...... 95 5-5 Verification performance for multimodal authentication ...... 101 6-1 LIWC categories ...... 108 6-2 PAN2015 author profiling dataset - English ...... 109

8 6-3 PAN2016 author profiling dataset - English ...... 109 6-4 PAN2015 author profiling results ...... 119 6-5 PAN2016 author profiling results ...... 119 A-1 PAN2015 - LIWC analysis ...... 128 A-2 PAN2015 - LIWC analysis(cont.) ...... 129 A-3 PAN2015 - LIWC analysis(cont.) ...... 130 A-4 PAN2016 - LIWC analysis ...... 131

9 LIST OF FIGURES Figure page 2-1 Stylistic features ...... 23 3-1 Recognition accuracy of various implementations ...... 39 3-2 Recognition accuracy of various implementations across CASIS subsets ..... 42 3-3 ROC curves for CASIS subsets ...... 45 3-4 Score distributions for CASIS subsets ...... 46 3-5 Clustering results ...... 54 3-6 Markov clustering on CASIS-5 with three similarity measures ...... 54 3-7 Markov clustering on CASIS-2 with features of different dimensions ...... 55 4-1 Parse tree of a sample sentence and rewrite rules ...... 62 4-2 Recurrent and recursive neural networks ...... 74 4-3 Hierarchical attention model ...... 75 5-1 Biometric menagerie ...... 82 5-2 Biometric menagerie plots for IMDB1M dataset ...... 84 5-3 Biometric menagerie plots for UF cross-topic dataset ...... 85 5-4 Biometric menagerie plots for UF Blogs-Forums cross-genre dataset ...... 86 5-5 Biometric menagerie plots for UF Forums-Blogs dataset ...... 87 5-6 ROC curves for various span-subsets in Blogs and Yelp datasets ...... 91 5-7 Trend of consistent users over time ...... 93 5-8 Success scenario ...... 94 5-9 Failure scenarios ...... 95 5-10 ROC curves for stylometry, keystroke and their fusion ...... 101 5-11 Score distribution curves for stylometry, keystroke and fusion ...... 102 5-12 Text samples for success and failure cases ...... 104 6-1 LIWC category distributions for PAN2015 and PAN2016 datasets ...... 110 6-2 Correlation of personality scores in PAN2015 ...... 113

10 6-3 Correlation of personality traits with LIWC categories ...... 115 6-4 Self-attentive embedding ...... 117 6-5 Salient cues for prediction on PAN2015 ...... 121 6-6 Salient cues for age prediction (18-34) on PAN2015 ...... 122 6-7 Salient cues for age prediction (35-xx) on PAN2015 ...... 123

11 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy ANALYSIS OF STYLOMETRY AS A COGNITIVE BIOMETRIC TRAIT By Kalaivani Sundararajan December 2018 Chair: Damon L. Woodard Major: Engineering Perpetrators of exploit the increase in text-based communication platforms and the associated anonymity. However, whenever a person is online, they inadvertently leave a digital footprint. In this dissertation, we explore the potential for using stylometry, i.e., linguistic style, as a cognitive biometric trait to identify a person. To use stylometry as a biometric trait, representation of style needs to independent of genre or topic and purely indicative of stylistic traits. We investigate different aspects of a language that can be used to represent the style. Specifically, we explore the role of structural and lexical information in representing style using cross-domain data. To the best of our knowledge, this is the first large-scale cross-domain stylometric analysis in literature. From our experiments, we observed that low-level lexical features are more effective in representing style compared to syntactic features, specifically for data where text lengths are short. However, large-scale identification is still challenging with all representations. We further investigate whether stylometry exhibits essential biometric characteristics. First, we explore the uniqueness of stylometry using Biometric menagerie which gives insights into using stylometry as a biometric trait in large-scale settings. We observed that not all users exhibit a unique linguistic style and some users can easily be impersonated or impersonate others. Next, we investigate the permanence of linguistic style using data which spans over 24 months. We observed that only a few users exhibit consistent style for

12 over 24 months while most other users drift beyond six months. Next, we investigate the potential for using stylometry along with another co-occurring trait, keystroke dynamics. We observed that multimodal authentication using both traits improved performance and limited by users with too few samples. Finally, we investigate the role of psychological and social aspects of language use that can help infer attributes like gender, age and personality traits of a person which is useful when large-scale identification is challenging. We observed that many psycholinguistic words correlated with these attributes and used as salient features for estimating them.

13 CHAPTER 1 INTRODUCTION “Language is a set of choices, and speakers and writers tend to fall into habitual choices.” —Patrick Juola Language Log

Behavioral or socio-cultural influences might uniquely define a person’s linguistic usage. Hence, linguistic expressions of his thoughts and feelings might be tied to his identity. This notion forms the basis of the “human stylome hypothesis” [1] and research in stylometry has evolved based on this hypothesis. Various computational representations of stylistic cues and matching techniques have been proposed in the literature. Early research focused on few authors and long text samples like . Convincing results under such scenarios further bolstered the stylome hypothesis. However, with the advent of sophisticated computational resources, researchers were able to evaluate the scalability of these approaches for a large number of authors and shorter text samples [2]. The failure of conventional approaches to scale to a large number of authors and short samples sparked a slew of feature representations and techniques for such scenarios. With the advent of the Internet and , the situation looks more challenging with the increasing number of users and shorter text samples. In today’s connected world, most of our cyber-interactions are text-based on various platforms like messages, , tweets and so on. Though multimedia is gaining traction, text-based expressions or online interactions are here to stay. Text-based exchanges in the cyber world provide a degree of anonymity which benefits both whistle-blowers and perpetrators of crime, hate speech and cyberbullying. This dissertation focuses on the security aspects concerning the latter. Every time a person is online, they leave a digital footprint that can be traced back to them. Physical aspects (IP addresses, location) or

14 behavioral information help represent such digital footprints. In this dissertation, we investigate one such behavioral trait, viz. stylometry, for security purposes. 1.1 Motivation

A significant portion of stylometry literature has been demonstrated using a small set of authors and long text samples. However, some research efforts have shown that stylometry struggles to scale for Internet-scale data. Since we are investigating the potential for stylometry to be used for security purposes under such extreme scenarios, we need to understand the nature of different style feature representations. Further, such features need to be evaluated on large-scale datasets, specifically cross-domain datasets. This evaluation will ensure that the features are truly representative of the style and not suffer from topic/genre interference. Further, given such internet data with few samples per person and short text samples, we need to investigate whether each person exhibits a unique linguistic style and whether it remains stable over time. Such properties are essential for reliable biometric authentication. With limited data samples, it might be difficult to model a person’s linguistic style. However, one can infer other attributes of a person’s identity like gender, age and personality traits. Hence, we need to study the role of psychological and social aspects in linguistic style and understand what cues are useful for predicting these social groups. To sum up, in this dissertation, we delve deeper into linguistic aspects that are claimed to be stylistic cues to understand and explain their role in stylometry. Further, we evaluate the potential for stylometry to be used a biometric trait. Since stylometry is claimed to a cognitive biometric trait, we also investigate the role of psychological and social aspects in linguistic style. 1.2 Thesis Statement and Dissertation Topics

In this dissertation, we explore the following thesis statement: A person’s linguistic usage exhibits distinctive behavioral traits that allow linguistic style to be used as

15 a cognitive biometric trait. In order to validate this statement, we analyze different representations of style, biometric characteristics like uniqueness and permanence, multi-modal authentication and soft-biometric classification/author profiling. 1.2.1 Style Representations

People might express themselves with written communication on many platforms like blogs and emails. If their linguistic usage exhibits distinctive behavioral traits, we hypothesize that these traits will manifest themselves irrespective of the platform. However, linguistic style is an abstract concept and computationally representing it forms the basis of validating our thesis statement. Any representation of linguistic style must capture only behavioral traits and disregard any content information since an impostor can easily mimic these. In this dissertation topic, we investigate various structural and lexical representations since sentence construction/grammar and choice of words might reveal useful cues about a person’s identity. We represent style using shallow and deep syntactic models to encode structural information. We use character and word-based models to capture lexical information. Function words (articles, auxiliary ) have already been proposed as viable features for author identification since they do not have any meaning by themselves but serve only a grammatical purpose. Hence, we analyze specific word categories that are traditionally considered to be content-related to investigate whether some words belonging to these categories can be useful to represent style owing to their repeated use. These word categories are called lexical parts-of-speech (POS) and include common nouns, proper nouns, verbs (excluding auxiliary verbs), adjectives and adverbs. We further and utilize topic modeling to tease apart content information from behavioral cues. The different representations of linguistic style are investigated using a large-scale (>1200 authors) cross-domain dataset since it would validate the claim that certain behavioral traits manifest themselves irrespective of the platform/domain. We analyze the efficacy of these representations with the following research questions:

16 • Do structural representations help with stylometric attribution in cross-domain settings?

• What are the effects of masking words corresponding to different lexical POSon cross-domain attribution?

• Does masking topic words of various lexical POS help with stylometric attribution in cross-domain settings? 1.2.2 Stylometry as a Biometric Trait

For stylometry to be useful as a biometric trait for security or forensic purposes, it needs to exhibit specific properties like uniqueness, permanence, collectability, universality, performance, acceptability and circumvention [3]. Of these, we investigate two important biometric properties - uniqueness and permanence. Uniqueness refers to the trait being sufficient enough to distinguish two people. We evaluate uniqueness of stylometry using Biometric menagerie [4] which involves classifying each person into one of the four categories (sheep, wolf, lamb, goat) where each category represents uniqueness regarding their contribution to False Accept or False Reject rates. People who contribute the least to both error rates are categorized as sheep and a biometric trait that categorizes most people as sheep is considered to be distinctive or unique. People who contribute to either or both error rates are classified as wolf, lamb or goat and cause a reduction in the biometric system performance. Permanence refers to the trait being sufficiently invariant over long periods of time. This is important for physiological biometrics since a person enrolls with their biometric data only a few times and it needs to be robust to aging. However, we investigate permanence of linguistic style though it is considered to be behavioral. It is done to validate whether linguistic style remains invariant, and if not, to understand how it varies over time. We investigate permanence using data that spans over 24 months and track the change in performance over time. Stylometry, like any other behavioral trait, may not be as effective as physiological traits. However, it is readily available and suitable for scenarios like authentication

17 in the cyber world. To further bolster stylometry as a biometric trait, we investigate multi-modal authentication with another co-occurring trait i.e. keystroke dynamics. Keystroke dynamics are more representative of muscle memory than cognitive complexity while stylometry is more representative of cognitive complexity. Hence, these traits can be complementary to each other and help improve biometric authentication. Overall, we investigate the feasibility of using linguistic style as a behavioral biometric trait with the following research questions:

• Is linguistic style sufficiently unique to be considered as a biometric trait?

• Does linguistic style remain permanent over long periods of time?

• Can stylometry be used with other biometric traits for authentication? 1.2.3 Soft-biometric Classification or Author Profiling

If linguistic style is not able to uniquely identify a person, it might still be possible to infer other information about a person’s identity like age, gender and native language. Identifying such social groups using biometric traits is called soft-biometric classification, or, in this case, author profiling. Soft-biometric information helps to narrow down candidates, especially in large-scale identification scenarios. We investigate the feasibility of estimating a person’s gender, age and personality traits by using their online communications. We further analyze the role of different psycholinguistic word categories in various social groups and investigate whether they are useful in author profiling. Specifically, we attempt to answer the following research question:

• What are the linguistic aspects that help profile an author’s attributes like age, gender and personality? 1.3 Contributions

In this dissertation, our main contributions include:

• Performance analysis of state-of-the-art approaches in stylometry using a large-scale single-domain dataset. This analysis highlights the relative merits and demerits of various feature representations and classification approaches in stylometry. This analysis sets the stage for our dissertation topics.

18 • First detailed analysis into different components that represent style using a large-scale cross-domain dataset. We study the role of structural and lexical information in representing a distinctive linguistic style using both single-domain and cross-domain datasets. To the best of our knowledge, this is the first time cross-domain stylometry has been studied at this scale for both cross-topic and cross-genre attribution.

• Evaluation of stylometry from a biometric perspective. Specifically, we investigate the uniqueness of stylistic traits using Biometric menagerie and the permanence of linguistic style.

• Investigation of the effect of psychological aspects and social aspects that help profile a person’s attributes from their written communications. 1.4 Chapter Guide

In this chapter, we provided a brief description of the context and motivation for this dissertation. We also outlined the research objectives and contributions of this dissertation. In Chapter 2, we provide a brief overview of the historical background and state-of-the-art approaches in stylometry. We describe various feature representations used to reflect style and matching techniques for authentication. Finally, we discuss the challenges associated with existing approaches and list some open research questions in the field. Equipped with the state-of-the-art approaches described in Chapter 2, we perform a performance analysis on fourteen powerful algorithms using a large-scale, private, single-domain dataset in Chapter 3. We compare and contrast the relative merits and demerits of feature representations and matching techniques used in these approaches. We also perform large-scale verification, feature analysis, and clustering to highlight the challenges associated with stylometry when used under difficult constraints such as those found in real-world settings. Based on the insights obtained from Chapter 3, we delve deeper into the different components that can be representative of style in Chapter 4. We do so with a large-scale cross-domain dataset that consists of both cross-genre and cross-topic data. It will ensure that the inference is purely based on style and will mitigate topic or genre interference.

19 This chapter provides insights into the rationale behind the success and failure of different style feature representations. In Chapter 5, we look into the biometric aspects of using stylometry for authentication. For a trait to be considered as a biometric modality, it needs to meet few characteristics. In this chapter, we look into the uniqueness and permanence properties of stylometry. We explore uniqueness using Biometric menagerie to understand the stylistic traits of each person. We also evaluate the permanence of stylistic traits and study drift in linguistic style. Since stylometry is considered a behavioral biometric, it is affected by various factors which make standalone biometric authentication challenging. Hence, we explore multimodal authentication using stylometry and another co-occurring trait, i.e. keystroke dynamics. Stylometry is considered a cognitive biometric that is affected by a person’s thoughts, mood, and feelings. In Chapter 6, we look into the psychological and social aspects that manifest in linguistic usage and investigate their role in profiling a person’s attributes like age, gender and personality traits. In the concluding Chapter 7, we summarize our research conclusions concerning using stylometry as a cognitive biometric modality. We also provide a brief outline of future research perspectives in this domain.

20 CHAPTER 2 LITERATURE REVIEW In this chapter, we will review the evolution of stylometry research over the past 100 years. We will describe some common feature representations and matching techniques that have been used in previous research related to various tasks associated with stylometry, viz. authorship attribution, authorship verification, and authorship profiling. We conclude the chapter with known challenges in existing approaches and open research questions. 2.1 Historical Background

Holmes [5] provides a complete picture of the historical evolution of stylometry. The concept of stylometry dates back to 1851 when Augustus de Morgan suggested that authorship disputes can be resolved using the frequency of word lengths. This hypothesis was then quantitatively analyzed for the first time in the late 1880s when Thomas C. Mendenhall compared the works of Bacon, Marlowe, and Shakespeare using word length distributions to attribute Shakespeare’s plays correctly. This approach, however, failed to effectively distinguish these authors though there were close similarities between the works of Marlowe and Shakespeare. These findings held back researchers from pursuing stylometry any further till alternative measures of style were proposed 30 years later by George Zipf and G. Udny Yule. Zipf [6] discovered that there exists a log-linear relationship between the rank and frequency of words popularly known as Zipf’s law. Yule [7] utilized these word frequencies to come up with a measure of vocabulary richness called Yule’s characteristic. However, these approaches also proved to be insufficient for discriminating linguistic styles of different authors. It wasn’t until late 1960s when Mosteller et al.[8–10] made a breakthrough in computational stylometry. They investigated the authorship of Federalist papers, twelve of which are claimed by both Alexander Hamilton and James Madison. They used function words like prepositions, conjunctions, and articles to attribute all twelve papers to Madison which seemed to

21 corroborate with historical information. In the following decades, stylometry research has evolved significantly owing to the growth in computational resources, internet, machine learning, and natural language processing tools. 2.2 Stylometry

There exists multiple detailed surveys on authorship attribution approaches [11–13]. In this section, we outline research techniques used for various stylometric tasks that are also relevant for biometric authentication viz. authorship attribution, authorship verification, and authorship profiling. Authorship attribution [11, 14, 15] involves author identification/recognition and determines the probability that a particular author wrote a document. Besides stylistic traits, this attribution may or may not depend on topic/content. Authorship verification [16–18] validates whether the same person writes two documents or not. When author identification is difficult, one can narrow down potential candidates using authorship profiling19 [ –22] which provides insights into a person’s attributes like gender, age, native speaker and personality. These three stylometric tasks depend on proper stylistic feature representation and matching techniques. Various feature representations and matching approaches used by stylometry researchers are explained in the following sections. 2.2.1 Stylometric Features

Stylometric tasks have used various features spanning different linguistic levels (Figure 2-1). This section describes these features (Table 2-1) and their relative merits and demerits. 2.2.1.1 Character features

These are the lowest and surface level features which view the text as a sequence of characters. These features include frequencies of alphabets, digits, , lowercase and uppercase alphabets [23, 24]. A significant advantage of these features is that they do not require complex tokenization and are applicable even for languages (e.g. Chinese,

22 Figure 2-1. Stylistic features in order of feature complexity

Japanese) where tokenization is challenging. However, since they view text only as a sequence of characters, they do not distinguish between stylistic traits and topic/content. Character n-grams represent a more sophisticated class of character features and use frequencies of n consecutive characters. These features have been widely used in authorship attribution [25–28]. Typically the most common character n-grams are used for stylometry. These n-grams seem to capture information across all linguistic levels partially, i.e. lexical information when they represent words, contextual information when they span two words, information when they represent function words, usage and capitalization. Owing to these merits, character n-grams are considered the most successful feature representation in stylometry [12]. However, Kestemont [29] suggests that their success could be from a purely quantitative perspective since they have more observations per sample compared to words or syntactic features.

23 2.2.1.2 Lexical features

These features view the text as a sequence of tokens where tokens can be words, digits or punctuations. Earlier approaches used frequencies of word lengths and sentence lengths [30]. Later, vocabulary richness measures [7, 31] were proposed to quantify diversity in vocabulary usage. However, these approaches were not as reliable as standalone features. A vector of word frequencies or bag-of-words representation has also been proposed. Typically, in most semantic NLP tasks, infrequent words are given more importance than most frequent words like articles, pronouns and prepositions. However, in stylometry, most frequent words are given higher importance. These words are typically independent of content and serve a grammatical purpose. Hence, they are called function words and are straightforward indicators of syntactic structures too. Function words have been used successfully in various authorship attribution approaches [32, 33]. Since bag-of-words representation ignores word order, word n-grams were proposed to capture context to a certain degree [34–36]. However, they did not provide any advantage compared to bag-of-words representation. Further, dimensions of these features increase with n and are also sparse especially for short texts because most word sequences do not occur. 2.2.1.3 Syntactic features

These features use syntactic information and intuitively seem more reliable than lexical information since they are independent of the topic. These techniques can be categorized into shallow and deep syntactic approaches. Deep syntactic approaches use detailed syntactic analysis for feature representation. Baayen et al.[37] used parse trees to compute rewrite rule frequencies as syntactic features. Stamatatos et al.[38] used chunking to compute frequencies of noun phrases and phrases. Raghavan et al.[39] used probabilistic context-free grammars computed

24 using production rules of parse trees. However, these approaches require accurate NLP tools for parsing and chunking and are computationally intensive. As an alternative, shallow syntactic features based on parts-of-speech (POS) were proposed. This includes using POS frequencies and POS n-gram frequencies to represent style [40–43]. However, they do not capture complete syntactic information like their deep counterparts. 2.2.1.4 Semantic features

These features capture semantic dependencies. Gamon [42] used semantic dependency graphs to extract semantic features and their relationships. McCarthy et al.[44] used WordNet [45] to obtain synonyms and hypernyms of words to detect semantic similarities. Argamon et al.[46] used functional features that associate specific words with semantic content, e.g. Conjunction, Extension and Elaboration. However, the accuracy of these features are only reported when used with other types of features, and their standalone accuracy is not available. 2.2.2 Attribution Approaches

Stylometric attribution approaches can be broadly classified into profile-based and instance-based approaches. With profile-based approaches, all training samples ofan author are concatenated to obtain an aggregated feature representation or model for that author. On the other hand, a feature representation is obtained for each training sample of an author in instance-based approaches and used with inter-textual distances or machine learning techniques. In the following sections, we describe few influential authorship attribution approaches which fall into either of these two categories. 2.2.2.1 Inter-textual distances

These profile-based approaches compute some similarity measure between two pieces of text to determine if the same author wrote them. Few popular inter-textual distances in stylometry literature are:

25 Table 2-1. Stylistic features Category Description Examples Advantages/Disadvantages

Character Surface-level features Frequency of Easy to tokenize and applicable using characters or alphabets, digits, to any language, Does not sequence of characters punctuations, distinguish between stylistic character n-grams traits and content

Lexical Token/word-level Frequency of function Difficult to tokenize in some features words, word n-grams, languages, non-function Vocabulary richness words are less frequent and vocabulary-based measures are unreliable

Syntactic Uses consistent Parts-of-speech More reliable than lexical syntactic patterns (POS), sentence information as they are structures, rewrite topic-independent, requires rule frequencies accurate NLP tools to extract syntactic information

Semantic Captures meaning of Synonyms, semantic Requires reliable tools for words and sentences dependencies semantic analysis

Others Custom greeting/farewell, - representations idiosyncratic features

• Burrow’s Delta - Burrows [47] measures differences in z-scores of relative frequencies,f , of the most frequent n words in a piece of text. This measure, Delta, for two documents D and D′ is computed as

1 ∑n Delta = |z(f (D)) − z(f (D′))| (2-1) n i i i=1

• Chi-square distance - Chi-square distance, χ2, is a statistical measure of goodness-of-fit to determine if a set of frequencies were drawn from the same population. If O and E are the observed and expected frequencies of n features, then

∑n (O − E )2 χ2 = k k (2-2) Ek k=1 This measure has been used to perform authorship attribution using frequencies of lexical features and the sample population consists of training samples from an author [48].

26 • Kullback-Leibler divergence - Kullback-Leibler divergence or KL divergence is used to measure the difference between two probability distributions,P and Q as ∑ P KL(P||Q) = P(i)log i (2-3) Qi i

Arun et al.[49] used KL divergence to measure the similarities between stopword graphs constructed using training and test samples. The test sample is attributed to the author with least KL divergence.

• Stamatatos distances - Stamatatos [26] proposed a new inter-textual distance method using text profiles, P, comprised of pairs of n-grams and their normalized frequencies of L most frequent n-grams. These profiles can be word or character-based. Given an author profile P(Ta) and test sample profile P(x), the inter-textual distance, ( ) ∑ ∗ − 2 2 (fx (g) fTa (g)) d1 = (2-4) fx (g) + fT (g) g∈P(x) a

where, fx (g) represents frequency of n-gram in P(x). A normalized distance was also proposed by concatenating all training samples whose corresponding profile is represented by P(N). The normalized distance is given by ( ) ( ) ∑ ∗ − 2 ∗ − 2 2 (fx (g) fTa (g)) 2 (fx (g) fN (g)) d2 = . (2-5) fx (g) + fT (g) fx (g) + fN (g) g∈P(x) a

2.2.2.2 Probabilistic models

These profile-based approaches10 [ , 36, 50] attempt to maximize the conditional probability P(x|a) given a test sample x and known author a using Bayes rule and assumes independence of features. Peng et al.[34] used local Markov chain dependencies with standard Bayes classification to capture contextual information. This approach can be further enhanced by appropriate smoothing techniques and are applicable at both word and character level. 2.2.2.3 Compression models

Compression-based approaches are profile-based methods and easy to apply with the use of off-the-shelf compression algorithms. Further, they require little orno

preprocessing. Given a known piece of text xa, let its length after compression be C(xa).

Now, given an unknown piece of text x, it is concatenated with xa and compressed to

27 yield a text length of C(xa + x). x is then attributed to author of xa whose distance,

d(xa, a) = C(xa + x) − C(xa), is the least. Benedetto et al.[51] used off-the-shelf LZ77 compression algorithm for authorship attribution. Teahan et al.[52] used compression based on Prediction by Partial Matching [53] to determine authorship. However, these approaches are slow owing to joint compression of train and test data at the time of inference. 2.2.2.4 Vector space models

These instance-based approaches represent each training and test sample as a vector of suitable features and used with different machine learning techniques including SVM [23, 36, 41], decision trees [24, 50], ensemble methods [54], neural networks [24, 55], discriminant analysis [56, 57] and memory-based learners [58]. However, these approaches require many samples per author and of sufficient length for reliable classification. 2.3 Challenges

Based on the stylometric techniques discussed above, we observe the following challenges:

1. Feature representation - Since stylometry relies on cues that are consistently used by a person across all or most of his samples, salient features constitute the most frequently occurring features like function words. However, such features are prevalent amongst all authors as well and their discriminative capabilities decrease as the number of candidate authors increase.

2. Highly correlated factors - A piece of text encompasses various factors like topic, genre, style, and emotion which are highly correlated. Minimal efforts have been made to understand their dependencies to obtain stylistic features that are invariant to other correlated factors.

3. Limited data - Authorship attribution, especially in forensic applications, suffers from limited data. It is not yet known how long text samples should be or how many samples per author are required to capture stylistic cues reliably. Most attribution approaches in literature are demonstrated using a small set of candidate authors with large amounts of samples per author.

4. Benchmark datasets - There is a lack of gold standard benchmark datasets that have texts spanning different topics, genres, time periods and languages. Such datasets

28 would allow construction of reliable feature representations and fair performance comparisons. Daelemans [1] succinctly summarizes these challenges: “We are looking not just for a system that reaches a certain target accuracy in a task, but for explanations, and for systems that are scalable, and that generalize over different genres and topics in their author identification and profiling results. It seems clear that a systematic studyofthe components and concepts of style will only be possible by collecting a large balanced dataset for each language of a type that doesn’t yet exist in current benchmark efforts.” We attempt to address some of these challenges in this dissertation. We perform a systematic study of different representations of style using a large-scale cross-domain dataset to identify stylistic cues that are consistently used across topics or genres. We also use the weights learned by machine learning models to unearth cues that are considered salient by the models for author identification or profiling. This approach allows oneto explain why a text sample was attributed to a specific individual or social group. This explainability is imperative in the security or forensic linguistic applications since it needs to stand up in the court of law.

29 CHAPTER 3 PERFORMANCE ANALYSIS OF STATE-OF-THE-ART APPROACHES From the previous chapter, we know that stylometry encompasses a broad range of document analysis techniques and tools that have evolved over the past two centuries. However, stylometric research has yet to establish state-of-the-art performance under demanding circumstances. In an ideal scenario, authorship attribution is a considerably easier problem when the set of candidate authors is small (i.e. ≤ 20) and each author has training samples of at least 1,000 words [35, 37–39]. However, this is not a realistic approach. Growth in internet media and online communication has produced datasets that far exceed the constraints mentioned above. While a few approaches attempt to address this problem specifically59 [ –62], these research efforts confirm the negative correlation between training samples and the number of candidate authors on performance, thereby limiting authorship attribution techniques to small, and in some cases, unrealistic datasets. Therefore, a thorough evaluation of authorship attribution methods should carefully analyze scaling abilities subject to a large number of candidate authors, variable document lengths, and a variable number of training samples. This chapter presents such an analysis to provide insight into what has been termed the needle-in-a-haystack problem, where the is to accurately identify the author of a document among a large set of potential authors, where each author has a small number of training samples [63]. In this chapter, a performance analysis using state-of-the-art feature representations and classification approaches is presented. Essential biometric tasks such as identification and verification are performed using a single-domain dataset. Biometric identification involves answering “Who is this person?” by comparing a test/probe sample against a set of training/gallery samples. Identification prevents a person from using multiple identities. Biometric verification involves answering “Is this person whom he/she claims to be?”by

30 comparing a test/probe sample with the claimed training/gallery samples. Verification prevents multiple people from using the same identity. Preliminary feature analysis and clustering are also performed to showcase the difficulty of the problem at hand. Most of these results have also been discussed inour paper [13]. Insights derived from these analyses helped shape our research objectives discussed in the upcoming chapters. 3.1 Dataset

A private corpus provided by the Center for Advanced Studies in Identity Science (CASIS) is used for all experiments in this chapter. CASIS dataset contains 4,000 samples from 1,000 authors (4 samples per author). Each sample has an average of 1,634 characters, 304 words, and 13 sentences. Author posts were divided such that author samples have an equal number of sentences to study the effect of sentence lengths. This process yielded subsets with 2, 5, 10, and 20 sentences per post in addition to the original posts. As the number of sentences per sample increases, the number of candidate authors decreases due to insufficient amounts of text per author as shown inTable 3-1. These sentence subsets are hereafter referred to as CASIS-2, CASIS-5, CASIS-10, CASIS-20, and CASIS-orig.

Table 3-1. CASIS Dataset partitions Sentence Subset Authors Samples CASIS-orig 1000 4000 CASIS-20 19 229 CASIS-10 103 1597 CASIS-5 378 6691 CASIS-2 984 27662

3.2 Authorship Identification

In this section, we describe various feature representations and matching techniques that are used in literature to perform authorship attribution. We also elaborate our approach for evaluating the scalability of these approaches under challenging scenarios like a large number of authors and short text samples.

31 3.2.1 Methodology

In order to study the scalability of state-of-the-art authorship identification approaches, fourteen open-source algorithms were used for performance comparison based on the work in [64]. The authorship identification algorithms used for performance comparison encompass a variety of techniques, including compression-based models, probabilistic models, and machine learning algorithms. Further, these techniques employ various feature representations such as n-grams and frequencies of common words. The algorithm descriptions used for performance analysis are as follows:

Stopword graphs and authorship attribution in text corpora. Arun et al.[49] used stopword graphs to classify authors by measuring the similarities between train and test stopword graphs. Stopwords are used as nodes and edges indicate distances from a stopword ws to recent occurrences of other stopwords w1,2,...,n. After constructing graphs for known and unknown texts, Kullback-Leibler (KL) Divergence distances are computed for each test graph against every training graph. The test sample is attributed to the candidate author whose training graph yields the least KL divergence. This algorithm is hereafter referred to as arun09.

Language trees and zipping. Benedetto et al.[51] identify authors using relative entropy by measuring distances from unknown texts based on data compression techniques. The LZ77 algorithm is used to compress data by detecting duplicates. For an − unknown text X , the author, A, of text Ai that minimizes the difference LAi +X LAi is chosen, where LAi represents the length of text Ai in bits after compression, and Ai + X indicates a new sequence by appending X to Ai . This algorithm is hereafter referred to as benedetto02.

‘Delta’: A measure of stylistic difference and a guide to likely authorship . Burrows [47] provides a relatively simple method to measure the difference in relative frequencies of the most frequent words among texts, which he termed Delta. Mean frequencies of the 150 most common words in the dataset are compared against the test

32 set. Z-scores are used instead of absolute differences to avoid drastic drops in frequencies from words that only appear once. The candidate author with the smallest Delta score is likely the author of the unknown text. This algorithm is hereafter referred to as burrows02.

Mining e-mail content for author identification forensics. De Vel et al. [23] proposes a method using support vector machines (SVMs) on multi-topic e-mails. Features included 170 style markers (e.g., average sentence length) and 21 structural attributes (e.g., a greeting acknowledgment). Various kernel functions are explored for the SVM classifier, such as linear, 3-degree polynomial, radial basis, and sigmoid functions. This algorithm is hereafter referred to as devel01.

Local histograms of character N-grams for authorship attribution. Escalante et al.[28] proposed an improved bag-of-words (BOW) approach for capturing sequential information. The locally-weighted bag-of-words (LOWBOW) framework computes a set of local histograms smoothed by kernels on different parts of the document for preservation of lexical usage and positional information. The authors suspect that sequential clues will be helpful as different authors are expected to use certain words or characters in certain locations. BOW represents documents as a single histogram of its vocabulary, normally indicating frequency and weights of all terms. LOWBOW, however, computes several histograms across the document. Each histogram’s kernel parameters specify the centering position, µ, and scale, σ, for each histogram, where µ aids in understanding the distribution of terms at the specified position and σ determines the influence of terms near µ. This algorithm is hereafter referred to as jairescalante11.

N-gram-based author profiles for authorship attribution. Kešelj et al.[25] used similarity measurements from byte-level n-grams of limited size. For training, the L most common n-grams are found for each candidate author; normalized frequencies are then computed for each n-gram. Thus, a profile is built as a set of L pairs, and the frequencies from an unknown text are compared to the profiles to determine the similarity

33 between the two. The author’s profile with the shortest distance to the unknown textis predicted as the author. This algorithm is hereafter referred to as keselj03.

A repetition based measure for verification of text collections and for text categorization. Khmelev et al.[65] introduced an R-measure for determining the repeatedness of a document for an intuitive assessment of a dataset’s characteristics and apply this for author identification. The R-measure is defined as the character-based normalized sum of all substring lengths that are repeated across multiple documents via the construction of a suffix tree from the concatenation of all documents. Repeated substrings are detected when R values are 1.0. For authorship attribution, the R-measure determines the correct class for document T among m candidate authors represented by texts S1, ... , Sm using estimate Θ(ˆ T ) = argmaxi R(T |Si ). This algorithm is hereafter referred to as khmelev03.

Measuring differentiability: Unmasking pseudonymous authors. Koppel et al.[66] present ‘unmasking’ for understanding the depth of difference in style in author verification. The intuition behind this technique lies in the importance ofonly a small number of features that work to distinguish between two authors. Consider texts Ai and X , where Ai is written by author A and the author of X is unknown. The proposed technique performs 10-fold cross-validation using SVM with a linear kernel. For each fold, the k strongly weighted positive and negative features are eliminated. This process is repeated iteratively, and the performance degradation curves are recorded. The authors find that the depth of difference for same-author features is shallow comparedto different-author curves, i.e. performance drops faster when the same author writes the known and unknown text, but slowly degrades otherwise. This result confirms that as the most reliable features are removed, features specific to the legitimate author are removed from same-author feature sets simultaneously. This algorithm is hereafter referred to as koppel07.

34 Authorship attribution in the wild. Koppel et al.[27] proposes a similarity-based algorithm to attribute authors in datasets where the number of candidates is large and open, and document lengths are limited. 250,000 unique space-free character 4-gram features are extracted to capture both document content and . For each unknown snippet, the authors randomly choose some k2 fraction of the feature set, find the best match candidate using cosine similarity, and repeat choosing different subsets k1 times. Scores are then calculated for each candidate author to indicate the proportion of times such candidate is considered as the best match. Finally, the candidate with the highest score is assigned to the unknown snippet if such score is greater than some threshold, σ; otherwise, the unknown snippet is classified as Don’t Know. Hence, this algorithm checks if the given candidate is consistently similar to a test snippet for many sets of randomly selected features. The Don’t Know case helps to deal with the open-set problem. This algorithm is hereafter referred to as koppel11.

Augmenting Naïve Bayes classifiers with statistical language models. Peng et al.[34] proposed the Chain Augmented Naïve Bayes (CAN) model to combine naïve Bayes and statistical n-gram methods. The CAN model uses Markov dependencies to relax the Bayesian assumption of independence amongst words. For each unknown document, d, the candidate, c ∗, with the largest probability, P(c|d), amongst all candidates is chosen as the legitimate author. This algorithm is hereafter referred to as peng04.

Syntactic N-grams as machine learning features for Natural Language Processing. Sidorov et al.[43] introduced syntactic n-grams (sn-grams) to preserve syntactical relationships between words. The authors argue that traditional n-grams provide surface representations that ignore dependency and constituency relationships, where dependency parsing identifies word relationships, and constituency parsing identifies sub-phrases within sentences. Additionally, sn-grams are claimed asmore stable, less arbitrary, and can offer a better linguistic interpretation. Sn-grams are n-grams

35 constructed from following paths in syntactic trees, compared to arbitrarily extraction of consecutive terms as they appear in the text. This algorithm is hereafter referred to as sidorov14.

Authorship attribution based on Feature Set Subspacing Ensembles. Stamatatos [54] presents a classifier ensemble based on a feature set subspacing method for authorship attribution, using the 1,000 most frequent words of the training corpus. After feature extraction, subsets from the feature set are selected using two approaches. The first approach involves the k-random-classifier (kRC), which randomly selects feature subsets of size m k times. The second approach uses exhaustive disjoint subspacing (EDS) to randomly divide the feature set into equally-sized disjoint subsets of size m. k resulting classifiers are generated from k subsets using linear discriminant analysis as the base learning algorithm, and a predefined combination method is utilized to ensemble thebase classifiers. Stamatatos uses the mp combination method to find the average posterior probabilities. The candidate with the largest posterior probability is chosen as the author of the unknown text. This algorithm is hereafter referred to as stamatatos06.

Author identification using imbalanced and limited training. texts Stamatatos [26] presents new distance measures based on character n-grams (in this paper, 3-grams are used). These metrics make use of author profiles, where a profile, P, is a pair (n-gram, normalized frequency) of the L most frequent n-grams in a text sample. The first metric measures the distance between an unknown text profile and training ∑ − 2(fx (g) fTa (g)) 2 author profile: d1(P(x), P(Ta)) = ∈ ( ) , where P(x) is the profile g P(x) fx (g)+fTa (g)

for the unknown text, P(Ta) is the candidate author’s profile, and fx (g) represents the

frequency of n-gram g in P(x). The second metric, d2, concatenates all training samples ∑ − 2(fx (g) fTa (g)) 2 2(fx (g)−fN (g)) 2 as a normalization step: d2(P(x), P(Ta), P(N)) = ∈ ( ) · ( ) , g P(x) fx (g)+fTa (g) fx (g)+fN (g) where N is the concatenated text. The author with the minimum score is selected as the legitimate author of the unknown text for both methods. This algorithm is hereafter referred to as stamatatos07.

36 Using compression-based language models for text categorization. Teahan et al.[52] apply the Partial Matching (PPM) text compression scheme for text categorization. For an unknown document, its cross-entropy for the various categories are computed; the lower the cross-entropy, the more likely the document belongs to the category. The category and feature cutoffs are defined for and categorization such that precision and recall are maximized. This algorithm is hereafter referred to as teahan03. 3.2.2 Results

Overall, we performed 336 experiments for our performance analysis. For each algorithm, 4-fold cross-validation was used on CASIS-orig to observe identification accuracy. To study the effect of text lengths, 5-fold cross-validation was performed using CASIS-2, CASIS-5, CASIS-10 and CASIS-20. Identification accuracy was recorded for the fourteen algorithms on all subsets. Identification accuracy is computed as

TP Accuracy = (3-1) N where, TP is the number of true positives and N is the total number of samples. 3.2.2.1 Algorithms with poor performance

As suspected, all algorithms cannot perform under demanding circumstances. In particular, we observed the lowest accuracy among arun09, khmelev03, koppel07, peng04, and sidorov14 (Figure 3-1A):

• arun09 uses stopword usage as an indication of style, but it requires documents of considerable length to compute stopword graphs with reliable edge information. Also, since KL divergence is used to compute similarity, long text lengths are required for robust computation of training and test stopword distributions.

• khmelev03 introduces the R-measure for measuring repeatability. It is suspected that CASIS dataset either contains little repeatability in author samples and the algorithm cannot accurately distinguish between a large number of authors due to high R values from groups of different authors.

37 • koppel07 presents an unmasking technique to remove strong features for determining same-author pairs iteratively. However, repetitively removing strong features implies that there is an extensive feature set that can withstand the iterative nature of this algorithm. For smaller author samples, the algorithm may not be able to run enough iterations to achieve the same performance found when the authors tested on books. It is suspected that the smaller samples in the CASIS partitions are not suitable for an iterative algorithm.

• peng04 employs the CAN method to join Bayesian principles with n-grams. As far as computation time, it was found that peng04 took a considerably long time to finish on CASIS. Several factors could have contributed to the low accuracy with CASIS: language-dependencies, inability to scale to larger datasets, and limited availability of word-level trigrams.

• sidorov14 relies on syntactical dependencies for extraction of sn-grams and requires parsing for finding syntactic paths. It is possible that syntactical dependencies are harder to find as samples become smaller, limiting the accuracy of this algorithm. 3.2.2.2 Algorithms with fair performance

Some algorithms performed poorly for certain sentence partitions, but average in others. These include burrows02, devel01 and stamatatos06 (Figure 3-1B):

• burrows02 presents Delta as a measure of the most frequent words. 47% accuracy is reported when 150 words are used. It becomes lower (30%) when using less (40) words in the feature set. In CASIS-orig, each sample has an average of 304 words. Hence, if this dataset is further divided, samples become smaller. It may be the case that this algorithm needs a large enough sample to extract as many common words as possible for decent performance.

• devel01 uses topic information for authorship attribution. However, in small samples, such as those in the CASIS partitions, there may not be enough semantic or lexical information to infer topics.

• stamatatos06 uses random sampling for training k classifiers on k feature subspaces based on a 1000-dimensional feature space. Poor performance on some CASIS subsets could be due to the inability to properly train many classifiers on the information available in smaller samples. 3.2.2.3 Algorithms with good performance

Six algorithms performed well, and further exploration of these methods could reveal important insights regarding the needle-in-a-haystack problem. These include benedetto02, jairescalante11, keselj03, koppel11, stamatatos07 and teahan03 (Figure 3-1C):

38 A

B

C

Figure 3-1. Recognition accuracy of various implementations. A) Poorly performing algorithms. B) Moderately performing algorithms. C) Best performing algorithms.

39 • benedetto02 and teahan03 both use compression-based methods; due to similarities in performance, it is assumed that text compression is a robust technique for large-scale authorship attribution. However, teahan03 outperforms benedetto02; this may be due to the PPM approach used by teahan03 which allows back-off to smaller lengths when substring matches are not found. It should also be mentioned that teahan03 was the best performing algorithm for all partitions except CASIS-orig.

• jairescalante11 proposes locally-weighted BOW features for preserving sequential clues in documents. The high accuracy of this method could be attributed to the use of multiple histograms across samples at different locations which, in essence, expands the training set. Hence, where other algorithms suffer due to small and limited samples, this method creates more training samples per author.

• keselj03 and stamatatos07 both use pairs of n-grams (where n-grams are paired with their frequency). The main difference is the introduction of different distance measures in the latter and byte-level features in the former. However, in several sentence subsets, we notice the performance of these algorithms to be very similar. Therefore, frequency information at very low-level representations seems to be useful.

• koppel11 was developed specifically for the needle-in-a-haystack problem, which shows to contribute to the performance on CASIS. Furthermore, it is observed that koppel11’s algorithm (feature space sampling for random subspaces) is similar to stamatatos06; however, koppel11 is based on character-level features, whereas stamatatos06 relies on word-level features. This difference may be a significant factor in the observed performance improvement. 3.2.3 Discussion

Overall, it appears that lower level representations, such as byte and character-level features, outperform higher level features, such as syntactical and word-level representations. It could be due to the limited amount of text available in the author samples, where high-level representations are sparse. Further, when algorithms are specifically designed for large-scale problems and are tested with representative datasets, it reflects on the performance when used with other datasets. On the other hand, approaches where the technique is emphasized more than scalability, performance is not consistent. Finally, it is observed that both compression-based methods performed well on the CASIS dataset; this could be an indication that compression-based algorithms are beneficial in large-scale authorship attribution.

40 To further analyze these fourteen implementations, performance is evaluated across the corpus subsets (Figure 3-2). It is observed that performance drops across all algorithms when text lengths decrease from 20 sentences per post (CASIS-20) to 2 sentences per post (CASIS-2). This performance drop could be attributed to shorter texts per post or due to the increase in the number of authors from CASIS-20 to CASIS-2 as shown in Table 3-1. Furthermore, CASIS-2 and CASIS-orig have a similar number of authors; however, CASIS-2 has shorter text lengths and more number of samples than CASIS-orig. Hence, the performance of algorithms on CASIS-2 and CASIS-orig throws light on the effect of text lengths. The general trend is that performance on CASIS-orig is better than CASIS-2 for all approaches except stamatatos07. This trend implies that text lengths do affect authorship attribution using all of these approaches. Further, keselj03 and teahan03 seem to perform best for both CASIS-2 and CASIS-orig. Both of these approaches encode neighboring context using n-grams or partial matching which might help these approaches scale well for the number of authors, the number of samples per author, and shorter text lengths. 3.3 Verification

In this section, we explain our approach to perform authorship verification using CASIS dataset and discuss the results. 3.3.1 Methodology

In order to perform authorship verification, we use a vector representation of various stylistic cues and compare its verification performance with that of the best-performing algorithms as described in Section 3.2.2.3.

Feature Representation. Several bag-of-word (BOW) features are extracted from the CASIS corpus for performance evaluation under verification and identification protocols and feature and clustering analysis. Feature vectors of 2,016 character, lexical, and syntactical features are extracted from text samples. Character features consider documents as a sequence of characters and attempt to extract statistics

41 A B

C D

E

Figure 3-2. Recognition accuracy of various implementations across CASIS subsets. A) Recognition accuracy based on CASIS-2. B) Recognition accuracy based on CASIS-5. C) Recognition accuracy based on CASIS-10. D) Recognition accuracy based on CASIS-20. E) Recognition accuracy based on CASIS-orig.

42 regarding the most fundamental representation of the text. Character-level features are language-independent and require no language processing. Lexical-level features provide statistics at the word-level and therefore require word boundaries in the respective language. Syntactical features capture sentence structure for the representation of syntax. These features are inspired by existing literature [62] and are commonly used by various researchers for stylometric analysis (Tables 3-2, 3-3 and 3-4).

Table 3-2. Character level features - 213 dimensions No. of dimensions Description 1 #characters 1 #alphabets 1 #uppercase alphabets 1 #digits 1 #whitespaces 1 #tabspaces 26 Frequency of alphabets 21 Frequency of special characters 10 Frequency of digits 50 Frequency of character bigrams 50 Frequency of character trigrams 50 Frequency of character 4-grams

Similarity Measures. Five-fold cross-validation is used on CASIS-orig, CASIS-20, CASIS-10, and CASIS-5 to study authorship attribution under the verification protocol. In each fold, similarity scores are computed between every test sample and every training sample using cosine and maxmin similarity metrics. The cosine angle between them indicates the cosine similarity of two vectors. If two vectors are similar, it implies that the angle between them is close to zero and the cosine similarity is close to one. On the other hand, the maxmin similarity seems to work well for practical applications. Let x ∑ N x y and y represent two n-dimensional vectors. Cosine similarity is defined as √∑ i=1√i∑i N x2 N y 2 ∑ i=1 i i=1 i and maxmin similarity is defined as N min(xi ,yi ) . Instead of computing the similarity i=1 max(xi ,yi ) measure using all 2,016 dimensions, the technique of Koppel et al.[27] is followed such that 500 dimensions are randomly chosen, and the similarity is computed using those

43 Table 3-3. Lexical features - 145 dimensions No. of dimensions Description 1 #words 1 Fraction of short words 1 Average word length 1 Average sentence length in characters 1 Average sentence length in words 1 #Unique words 10 Hapax legomenon 20 Frequency of words of different word lengths 1 Fraction of capitalized words 1 Fraction of all uppercase words 1 Fraction of all lowercase words 1 Fraction of all camelcase words 1 Fraction of all othercase words 1 Yule’s I-measure 1 Sichel’s S-measure 1 Brunet’s W-measure 1 Honore’s R-measure 50 Frequency of word bigrams 50 Frequency of word trigrams

Table 3-4. Syntactic features - 1658 dimensions No. of dimensions Description 11 Frequency of punctuations (‘,’,‘.’,‘?’,‘!’,‘:’,‘;’,‘’́,‘”’,‘-’,‘(’,‘)’) 512 Frequency of function words 50 Frequency of POS bigrams 50 Frequency of POS trigrams 1035 Frequency of syntactic pairs dimensions. This procedure is repeated for ten iterations, and the average similarity measure is computed. 3.3.2 Results

With the similarity matrices for all five folds, the false accept and false reject rates (FAR and FRR, respectively) across all folds are computed at different thresholds. The equal error rate (EER) is defined as the error rate at which FAR and FRR are equal. Receiver operating characteristic (ROC) curves for all the data subsets are shown in Figure 3-3. From these plots, it can be seen that cosine similarity performs better than maxmin similarity across all datasets. Further, a trend is noticed where the EER increases

44 as the number of sentences per post decreases. This trend could be attributed to either the increase in the number of samples as the number of sentences per post decreases or to the noisy nature of frequency-based feature representations for shorter text lengths.

A B

Figure 3-3. ROC curves for CASIS subsets. A) ROC with cosine similarity. B) ROC with maxmin similarity.

Further, the d-prime measure is evaluated. d-prime is a statistical measure of the discrimination capability of a biometric system. It represents the degree of separation − √µgen µimp between scores of genuine and impostor matches. d-prime is defined as 2 2 , where σgen+σimp 2 µgen and σgen are the mean score and variance of genuine matches, respectively, and µimp

2 and σimp are the mean score and variance of impostor matches, respectively. The larger the value of d-prime, the better is the discrimination capability of a system. The score distributions and corresponding d-prime values for all data subsets are shown in Figure 3-4. From these plots, one can infer the difficulty of authorship verification problem by the amount of overlap between the score distributions of genuine and impostor matches. However, cosine similarity again outperforms maxmin similarity as observed by d-prime across all data subsets. Again, it is observed that the d-prime value decreases as the number of sentences per post decreases.

45 A B C D

E F G H

Figure 3-4. Score distributions for CASIS subsets. A) CASIS-orig - cosine similarity. B) CASIS-orig - maxmin similarity. C) CASIS-20 - cosine similarity. D) CASIS-20 - maxmin similarity. E) CASIS-10 - cosine similarity. F) CASIS-10 - maxmin similarity. G) CASIS-5 - cosine similarity. H) CASIS-5 - maxmin similarity

3.3.3 Discussion

Based on the results of the verification experiments, it can be observed that the cosine similarity provides better performance compared to the maxmin similarity for author verification. This inference is substantiated by the score distributions where cosine similarity yields larger d-prime values than maxmin similarity. Furthermore, ROC curves and EERs confirm that performance drops with decreases in the number of sentences per post. These results are also corroborated with decreasing d-prime values as the number of sentences per post decreases. This result shows the difficulty of authorship verification for short text lengths using BOW features which could be further attributed to an increase in noise when frequency-based features of short texts are used. 3.4 Feature Analysis

Bag-of-word features are commonly used due to their simplicity and effectiveness11 [ ]. Hence, this section analyzes the 2,016 character, lexical, and syntactic features described

46 in Section 3.3.1 for representation of authorial style to answer the following research questions:

• Are all features important for authorship attribution?

• Is style represented by a salient feature set for all authors?

• Is each author better represented by a specific unique set of features? 3.4.1 Methodology

Information gain and F-statistic are used to determine whether all dimensions are required for representing and distinguishing between different authorial styles. Information gain is the reduction in entropy of random variable X when the state of another random variable A is known. Given entropy H and the state of A as a, information gain is given by IG(X , a) = H(X ) − H(X |a). In our experiments, H(X ) represents entropy before knowing a for each feature dimension, A. Higher values of information gain indicate saliency of features. We vary the value of A in 20 steps, measure information gain each time and retain the maximum information gain for every feature dimension, A. F-statistic is commonly used in ANOVA regression tests to identify whether the means of two groups are significantly different. It takes into account both between-group variance and

between−group variance within-group variance. F-statistic is defined as within−group variance . It is preferable to have small within-group variance and large between-group variance to distinguish different groups reliably; therefore, high values of F-statistic indicate saliency of features. After information gain and F-statistic are computed for all 2,016 dimensions, these measures are ranked according to various criteria. In order to keep the criterion threshold adaptable to the data, these measures are sorted for selecting a threshold at the elbow of the curve. For those dimensions whose information gain and F-statistic are greater than the chosen threshold, four criteria for each measure (total of 8 criteria combined) are used to pick the salient features. Those feature dimensions which satisfy at least four out of these eight criteria form the final salient feature subset. Criteria include:

47 1. Each measure is ranked for every data subset separately, and the salient feature subset is chosen using a threshold. The common feature dimensions amongst salient features of all data subsets are chosen as the salient feature subset.

2. The mean information gain/F-statistic is computed for each dimension across all sentence subsets, and the mean measures are ranked. The top-ranking dimensions based on a threshold are chosen as the salient feature subset.

3. For each feature dimension, the Coefficient of Variation (CoV) is computed, where CoV = σ/µ, σ is the standard deviation of values of that feature dimension for a particular data subset, and µ is the corresponding mean value. The mean CoV is computed for each dimension across all sentence subsets, and the CoVs are ranked. The top-ranking dimensions with low CoV are used as the salient feature subset.

4. For each measure, the relative rankings of feature dimensions are computed for each data subset. Higher values of information gain/F-statistic are ranked close to 1. Ideally, minimal variation in the relative rankings of features across all data subsets is desired. The last criterion uses the mean coefficient of variation of the relative rankings of features amongst all data subsets. The top-ranking dimensions are retained as the salient feature subset. 3.4.2 Results

Based on these criteria, 334 dimensions that rank high in at least four out of eight criteria are retained. The salient features across different feature categories are shown in Table 3-5. Invalid dimensions are those dimensions with zero values across all data subsets.

Table 3-5. Salient features across different feature categories Feature Categories No. of dimensions Salient Dimensions Invalid dimensions #dim %dim #dim %dim Character 213 59 27.70 20 9.39 Lexical 145 40 27.59 14 9.66 Syntactic 1658 235 14.17 721 43.49

Authorship attribution is carried out on all CASIS subsets with both full and reduced (i.e., salient) feature sets using 5-fold cross-validation and Random Forests to study the effect of salient features on authorship attribution. The results are shown inTable 3-6. It can be observed that the reduced feature set performs significantly better than 2,016 feature dimensions.

48 Table 3-6. Comparison of salient features using the full feature set for authorship attribution. (TN = True Negative) Data subset Accuracy(%) Accuracy with TN(%) Full dim Reduced dim Full dim Reduced dim CASIS-orig 21.22 26.35 99.84 99.85 CASIS-20 80.33 84.15 97.92 98.65 CASIS-10 38.40 46.65 98.80 98.96 CASIS-5 15.30 18.80 99.55 99.57 CASIS-2 4.56 5.86 99.81 99.81

Five-fold cross-validation is performed on author subsets to study the effect of varying sentence lengths. In each data subset, the authors remain the same while the number of sentences per post varies. The results are shown in Table 3-7.

Table 3-7. Effect of varying sentence lengths. Data subset 19 authors - Acc(%) 103 authors - Acc(%) 378 authors - Acc(%) Full dim. Red. dim. Full dim. Red. dim. Full dim. Red. dim. CASIS-orig 73.68 75.00 44.90 51.70 30.22 36.77 CASIS-20 80.33 84.15 - - - - CASIS-10 61.03 68.93 37.55 45.95 - - CASIS-5 40.48 46.03 25.27 28.97 15.30 18.80 CASIS-2 30.61 29.73 13.09 15.62 6.93 8.75

3.4.3 Discussion

It was observed that more character and lexical features are salient for authorship attribution with fewer invalid dimensions. It is also noticed that though the syntactic category represents a significant portion of the 2,016 dimensions, many of these features are invalid. Further, a smaller percentage of syntactic features are salient for distinguishing authors. This observation suggests that while a frequency-based representation of the character and lexical features may be suitable for authorship attribution, the frequency representation of syntactic features might not be useful for characterizing style. Syntax represents structure in a sentence and authorship attribution may benefit from a representation that captures the structure effectively. While analyzing the effect of salient features on authorship attribution, performance for data subsets with a fewer number of sentences per post decreases due to the increase

49 in the number of samples and authors, along with shorter sample lengths. When text lengths are too short, the frequency-based features become noisier and may contribute to performance degradation. While overall performance is not at par with some of the state-of-the-art baseline methods, these experiments aim to investigate which of the different feature categories contribute to style representation given a common baseline. While analyzing the effect of varying text lengths on authorship attribution, it can be seen that performance rates using 10 to 20 sentences per post are closer to that of the original posts. This observation could be because the average sentence length of CASIS dataset is 13  11 sentences per post. However, as the number of sentences per post decreases, the performance drops. It reiterates the fact that since most of these features are frequency-based, shorter text lengths render these features noisy and causes attribution performance to decrease. We further perform a preliminary analysis to investigate whether all authors consistently use salient features. We also investigate whether each author exhibits a different subset of features, a.k.a author signatures, in which they are most distinguishable. 3.4.3.1 Consistency of Salient Features Across Authors

It is further investigated whether these features are consistent for every author using CoV as a measure of consistency with a set of salient features. Low values of CoV indicate high consistency. The mean consistency of the 334 salient dimensions is computed for every author across all sentence subsets in CASIS-orig, CASIS-20, CASIS-10, CASIS-5, and CASIS-2. The dimensions which have low CoV across all data subsets are consistent features for an author. It was observed that not all of the salient features are consistent for an author. Some of these features showed high CoV as the number of sentences per post decreases. This observation suggests that though the salient feature set is suitable for identifying an author, it is not consistent for all dimensions, especially as text lengths get short.

50 3.4.3.2 Author Signatures

We also investigate whether every author has a specific set of features in which they are consistent. It could serve as an author signature or a unique feature set which encapsulates the author’s style. The number of features that make up an author’s signature may differ from one author to another. Experiments were conducted onalldata subsets with author signatures chosen out of all 2,016 feature dimensions. It was observed that no two signatures were the same. Based on our analysis of various feature dimensions for authorship attribution, we observe that a reduced feature subset of 334 dimensions, chosen using information gain and F-statistic, performs better across all sentence subsets. This observation suggests that frequency-based features require texts of a certain length for reliable feature extraction. The reduced feature subset, however, does not seem to be consistent across all sentence lengths. Each author exhibits consistency in a different set of features leading to the hypothesis that every author has a unique author signature. 3.5 Clustering

The literature has been using hand-engineered features based on domain knowledge to represent style computationally. These feature representations are typically frequencies of different style markers, like function words, punctuations and n-grams. However, it is unclear whether such features help to identify an author uniquely. If such features are clustered in an unsupervised fashion, it will provide some insight into the effectiveness of such feature representations. If authors can be uniquely identified with such feature representations, one should find well-separated clusters with each containing samples from an author; i.e., the cluster purities will be very high. In this section, we report preliminary results of experiments that evaluate clustering using all and the identified salient features. Clustering a large number of high-dimensional samples is challenging. Hence, we use a graph-based clustering algorithm, Markov clustering, that has been optimized for scalability and speed. In order to perform clustering, we need to represent

51 samples as a graph network with every sample as a node and the similarity between two samples as edge weights. Only the closest K neighbors of every node are retained in the graph network. To ensure the nearest neighbor search is scalable for large datasets, we use Locality-Sensitive Hashing (LSH) to speed-up the nearest neighbor search. 3.5.1 Methodology

In this section, we describe our approach to perform unsupervised clustering of style feature representations and the ensuing results.

Markov Clustering. Markov clustering is a fast and scalable algorithm for unsupervised clustering in graphs [67]. Natural clusters in graphs tend to be highly connected while nodes belonging to different clusters have far fewer edges between them. Further, a random walk on the graph will result in nodes within a cluster being visited frequently while the probability of going between nodes belonging to different clusters is far less. These properties are incorporated in Markov clustering to perform unsupervised clustering with no prior information. In order to perform Markov clustering, a stochastic matrix (Markov matrix, M) is computed such that each column corresponds to a node j and each row entry i of column j represents the probability of going from node j to node i. Clustering is performed by alternating between two operators on the Markov matrix: expansion and inflation. Expansion involves random walks of greater length between nodes. Since nodes within a cluster are expected to be highly connected, there will be more than one path to go from one node to another in a cluster. Hence, expansion will increase the probabilities of random walks between nodes within a cluster. Matrix squaring of the Markov matrix achieves expansion. Inflation has a further effect of increasing probabilities of intra-cluster walks while reducing those of inter-cluster walks. Given an inflation coefficient r > 1, the elements of column j of the stochastic matrix after inflation operation, Γr (), is given by Mr ∑ ∑ ij r Γr (Mij ) = r , where r,j Mij represents the sum of all elements of column j raised r,j Mij to power r. The repeated operations of expansion and inflation will result in increased

52 probabilities of random walks between nodes in a cluster while probabilities of random walks between nodes of different clusters will be reduced. This process will lead tonatural separation of clusters over a few iterations. The presented experiments use the open source implementation presented in [68].

Locality-Sensitive Hashing. Locality-sensitive hashing (LSH) is a fast approach to perform approximate nearest-neighbor searches in high-dimensional spaces [69]. It extends the concept of hashing such that similar items end up in same “buckets” with high probability. An LSH function h is chosen uniformly at random from a family of functions F such that similar elements in metric space M are mapped to the same bucket s. Given a distance metric d on metric space M, a threshold R > 0 and approximation factor c > 1, two elements p,q ∈ M will be mapped as follows:

• h(p) = h(q) with probability P1 if d(p,q) ≤ R.

• h(p) = h(q) with probability P2 if d(p,q) ≥ cR, where P1 > P2. In our experiments, we use the LSH implementation, FALCONN [70]. 3.5.2 Preliminary Results

A graph network was computed using LSH for 50 nearest-neighbors and three different similarity measures, which was later provided as input to the Markov clustering algorithm. The clustering of CASIS-orig and CASIS-2 samples are shown in Figure 3-5B. The visualizations were made using BioLayout Express 3D [71]. Every circle denotes a graph node or a text sample. Same color nodes indicate a cluster. Results show that there are no separate connected blobs, indicating the difficulty of authorship attribution. However, each cluster may represent a meta-class which characterizes a writing style. Since the number of clusters is far less than the number of authors, it can be inferred that multiple authors are possibly grouped. Further, it is also possible that an author can be characterized by multiple styles and can belong to more than one cluster.

Effect of similarity measures. To study the effect of various similarity measures, we performed clustering on the CASIS-5 dataset with three similarity measures while

53 A B

Figure 3-5. Clustering results. A) Clustering on CASIS-orig: 1000 authors, 4000 nodes, 196,000 edges resulted in 16 clusters. B) Clustering on CASIS-2: 984 authors, 27662 nodes, 1,355,438 edges resulted in 96 clusters. keeping clustering parameters the same, as shown in Figure 3-6. It can be observed that cosine similarity and maxmin similarity yields contiguous clusters, while inverse Euclidean distance does not yield coherent clusters but multiple small split clusters.

A B C

Figure 3-6. Markov clustering on CASIS-5 with three similarity measures using 378 authors, 6691 nodes, 327,859 edges. A) Maxmin similarity - 22 clusters. B) Inverse Euclidean distance - 142 clusters. C) Cosine similarity - 20 clusters.

54 Effect of feature dimensions. To study clustering on the salient feature dimensions, we performed clustering on CASIS-2 using the same parameters on both 2,016-dimensional and 334-dimensional feature vectors. The results are shown in Figure 3-7. It can be seen that both methods yield a similar number of clusters and the clusters remain contiguous even when a reduced feature set is used. This observation indicates that the reduced feature set could be sufficient for authorship attribution without trading off performance.

A B

Figure 3-7. Markov clustering on CASIS-2 with features of different dimensions using 984 authors, 27662 nodes, 1,355,438 edges and maxmin similarity. A) Clustering with 2016 dimensions - 91 clusters. B) Clustering with 334 dimensions - 96 clusters.

3.5.3 Discussion

Based on the results of clustering BOW features, it is observed that clustering reveals no separate connected components. This observation throws light on the difficulty of the authorship attribution problem. Further, the number of clusters is far less than the number of authors. It implies that these clusters could represent meta-classes with a similar style. Each cluster could contain samples from multiple authors, and each author

55 could be a member of multiple clusters. Moreover, cosine and maxmin similarity yield contiguous clusters to suggest that these measures are more representative of feature similarity in the original dimensions. Finally, clustering using salient features yields a similar number of clusters as clustering using all dimensions, suggesting that the reduced feature set could be sufficient for authorship attribution without compromising performance. 3.6 Summary

This chapter analyzed the performance of author identification and verification of a large single-domain dataset with 1,000 authors. Further, it performed a preliminary analysis of salient features amongst a set of character, lexical and syntactic cues. Preliminary clustering analysis was performed to highlight the challenges of authorship attribution using feature representations described in Section 3.3.1. For author identification, fourteen state-of-the-art algorithms were evaluated. Six were found to be suitable for large-scale authorship attribution. It also appears that lower-level features are better representations for limited author samples and compression-based methods also yield good performance under challenging scenarios. For author verification, it was observed that cosine similarity performed better that maxmin similarity. Moreover, performance degradation was seen with a decrease in sample length. When analyzing salient features for authorship attribution, a feature set of 2,016 dimensions spanning character, lexical and syntactic categories was used. It was found that only a fraction of these is essential. Interestingly, for large-scale authorship attribution, it appears that frequency representation of syntactical information is less salient. This observation suggests that bag-of-words representations using character and word-level features may be more suitable for unconstrained problems. Preliminary feature analysis also shows that salient features are not consistent across all authors and

56 each author has a unique signature using different feature subsets in which they exhibit consistency. Clustering analysis also revealed no separated components, highlighting the difficulty of large-scale authorship attribution. Further, it was observed that the number of clusters is far fewer than the number of authors, indicating that clusters using this representation might represent meta-classes with similar style instead of individual authors. While authorship attribution is robust when using a small set of candidate authors, the performance analysis presented in this chapter shows the difficulty of correctly identifying the true author under challenging circumstances. Such challenges are common when attempting to use stylometry as a biometric trait to identify or profile authors using their online communications which are typically characterized by their short text lengths. This performance analysis and the corresponding insights set the stage for the research questions we address in the following chapters.

57 CHAPTER 4 WHAT CONSTITUTES STYLE? In the previous chapter, we performed an analysis of state-of-the-art authorship attribution approaches on a single-domain dataset. We observed that many of the best-performing approaches used low-level lexical features. However, we know that such features capture both content and style [63]. These features leverage the content information to provide an apparent boost in performance for single-domain datasets since both training and test data are from the same domain and may contain similar content. However, content can be easily mimicked and is not behavioral. Hence, it is imperative to disentangle style and content and focus only on cues that are independent of content information. In this chapter, we take a more in-depth look into different components that might constitute style and are independent of content. Specifically, we study the role of various structural and lexical feature representations to identify stylistic cues that are independent of content. We study this using a large cross-domain dataset that consists of samples written by the same author on different topics in the same genre/platform and different genres/platforms. By performing this analysis using a cross-domain dataset, we hypothesize that content information will be disregarded to a large extent and attribution will be performed using cues that are behavioral/stylistic. To the best of our knowledge, this is the first large-scale cross-domain analysis in stylometry. To study the role of structural representations, we perform cross-domain attribution using both shallow and deep syntactic models. Both syntactic representations do not contain any lexical information, and hence we hypothesize that these representations will be independent of content. For shallow syntactic models, we use vector representations of the most common Parts-of-Speech (POS) n-grams, while for deep syntactic models, we use a purely syntactic language model. The syntactic model is constructed using the Probabilistic Context-free grammars (PCFG) obtained from an author’s training data.

58 While the approach is similar to that of Raghavan et al.[39], we keep the model purely syntactic by removing any lexical information. We also add context to the rewrite rules by vertical and horizontal Markovization. To handle previously unseen rules, we also incorporate smoothing where the language model can back-off to lower order models. To study the role of lexical representations, we perform attribution by masking out all words or specific topic words corresponding to different lexical parts-of-speech (POS) which include nouns (common and proper nouns), verbs (except auxiliary verbs), adjectives and adverbs. We do so because character n-gram based approaches have largely outperformed function word based approaches [29] indicating that lexical words may also help with authorship attribution. This analysis would help understand which of these lexical words may help represent the style. Stamatatos [72] also mask out infrequent words (mostly lexical words) but we perform an in-depth analysis as to which of these lexical words are independent of content. Overall, we attempt to answer the following questions in this chapter:

• Do structural representations help with stylometric attribution in cross-domain settings?

• What are the effects of masking words corresponding to different lexical POSon cross-domain attribution?

• Does masking topic words corresponding to various lexical POS help with stylometric attribution in cross-domain settings? In the following sections, we explain the approaches taken to answer the above questions about stylometric attribution. This analysis could help us come up with better style representations that would work well in cross-domain scenarios. 4.1 Datasets

We use two datasets, a single-domain, and a cross-domain dataset, to understand the role of different feature representations in representing the style. They are:

1. IMDB1M [73] - This is a single-domain dataset and consists of 204,809 posts and 66,816 reviews written by 22,116 users. For our experiments, we use only a subset of

59 435 users with 30 posts per person. The chosen data subset contains 12617 samples across 435 users.

2. UF Cross-domain dataset - There are very few publicly available cross-domain datasets [74] in the research community. Further, these are not large-scale datasets since it is difficult to find text samples written by the same person across different topics or genres. Hence, we collected and prepared a large-scale cross-domain dataset from the StackExchange network. StackExchange network consists of 176 question and answer forums with over 3 million users. We extracted both cross-topic and cross-genre data from StackExchange. The dataset details are as follows:

• UF Cross-topic dataset - StackExchange network consists of question and answer forums dedicated to over 176 topics like StackOverflow, Mathematics etc. We extract cross-topic data from these forums by collecting posts made by users who contributed to two or more topics. This cross-topic dataset consists of data from 293,415 users who contributed to two or more StackExchange topic forums. For our experiments, we use a subset of 1237 users with 150 samples per person. The chosen data subset contains 188077 samples across 1237 users. • UF Cross-genre dataset - StackExchange user profiles also include their website URLs which are mostly blogs. We used Blogspot and Wordpress APIs to extract data from posts of users who had provided their blog URLs. Hence, this cross-genre dataset consists of posts from both StackExchange forums and blogs of 19,559 users. We use this data to create two cross-genre datasets - Blogs-Forums and Forums-Blogs. Blogs-Forums dataset consists of blog posts of users in the training data and their corresponding StackExchange forum posts in the test data. Forums-Blogs dataset consists of StackExchange forum posts of users in the training data and their corresponding blog posts in the test data. The details of these datasets are reported in Table 4-1.

Table 4-1. Single-domain and Cross-domain datasets Dataset #authors #samples char/sample words/sample sent/sample IMDB1M 435 12617 649 136 7 UF Cross-topic 1237 188077 2331 482 20 UF Blogs-Forums 530 45285 2300 459 20 UF Forums-Blogs 479 40872 2244 461 20

4.2 Role of Structural representations

In this section, we study the role of sentence structure or syntax in stylometry. The method by which a person puts together words to form a sentence could be behavioral

60 and hence studying structural representations might reveal some insights. We study these structural representations using shallow and deep syntactic models. Shallow syntactic models use parts-of-speech (POS) information to represent syntax while deep syntactic models use complete parse tree information to encode syntax of a sentence. 4.2.1 Methodology

We represent sentence structure using two models - shallow and deep syntactic models. Shallow syntactic models consist of vector representations of the most common POS n-grams used with a discriminative classifier. On the other hand, deep syntactic models consist of a generative model which creates a probabilistic context-free grammar (PCFG) for each author using the constituency parse trees of sentences written by them. Hence, deep syntactic models capture syntactic information at a deeper level. 4.2.1.1 Shallow syntactic models

Shallow syntactic models consist of vector representations of the most common POS n-grams as feature representations. We perform part-of-speech tagging on the training data of each author and obtain a set of POS n-grams that are most commonly used by the author. The union of most common POS n-grams of all authors is used as feature representation with a classifier. We use POS bigrams and POS trigrams, i.e. pairs or triples of POS tags, in our experiments. We use a one-vs-rest linear SVM to perform authorship attribution since the learned weights might provide insights into which POS n-grams are salient for attribution to each author. 4.2.1.2 Deep Syntactic Models

We use a purely syntactic language model to study the role of sentence structure or syntax in cross-domain stylometric attribution. A syntactic language model is obtained by constructing the probabilistic context-free grammar (PCFG) for each author using the constituency parse trees of sentences in their training samples. However, unlike the approach of [39] which uses the rewrite rules directly to construct PCFGs, we apply both vertical and horizontal Markovization [75] to the parse trees before constructing

61 PCFGs. This Markovization helps to incorporate some context into the rewrite rules and improves parsing accuracy. To keep the approach purely syntactic, we remove the leaf nodes which contain the sentence words [76, 77]. A test sample is attributed to the author whose PCFG yields the highest likelihood score. In order to account for unseen rules during test time, we also incorporate smoothing that allows the model to back-off to lower order syntactic language models. The rewrite rules of a sample sentence corresponding to different Markovization orders used in our model are shown in Figure 4-1.

Figure 4-1. Parse tree of a sample sentence and rewrite rules under different orders of vertical (v) and horizontal (h) Markovization. The leaf nodes are excluded from the rewrite rules.

In order to compare the syntactic language model with a purely character language model, we use the approach of [52]. This approach uses a lossless text compression method called Prediction by Partial Matching (PPM) [78]. With PPM, individual characters are encoded using the context provided by the preceding characters thus representing each author A using a separate language model pA. To attribute an unknown sample u of

62 length L, we compute its cross-entropy with respect to an author model pA as

1 H(p , u) = − log p (u) A L 2 A 1 ∑L = − log p (x |context ) L 2 A i i i=1 where contexti of character xi is x1, x2, ..., xi−1. The unknown sample is attributed to the author whose model yields least cross-entropy. For computational efficiency, the context is

typically truncated to preceding n-1 characters i.e. xi−n, ..xi−1. 4.2.2 Results

We perform 4-fold cross-validation on all datasets and report the average performance metrics here. The performance is evaluated using F-score and Accuracy. 4.2.2.1 Shallow syntactic models

For each fold of the dataset, we compute the POS n-grams using the training data of all authors as described in Section 4.2.1.1. The number of features is data dependent and ranges from 1200 to 1300 for POS bigrams and 5700 to 6610 for POS trigrams. For baseline comparison, we use the authorship attribution performance using the following feature representations: 917-dim stop words (function words and punctuations), 673-dim summary features (subset of features described in Tables 3-2, 3-3 and 3-4 with the exception of character n-grams, word n-grams and POS n-grams) and variable dimensional character n-grams. Results of the shallow syntactic approaches are reported in Table 4-2. For single-domain IMDB1M dataset, POS bigrams perform comparably to stop words, and POS trigrams perform comparably to summary features. However, both POS n-grams perform drastically worse than character n-gram features. For cross-domain dataset, POS n-grams perform poorly compared to all other feature representations for both cross-topic and cross-genre situations. Character n-gram features perform the best in both cross-topic and cross-genre experiments. These results suggest that using a purely syntactic representation using shallow syntactic models is not discriminative enough to identify users uniquely.

63 Table 4-2. Shallow syntactic model performance on single-domain and cross-domain datasets Dataset Scenarios Prec(%) Rec(%) F-score(%) Acc(%) Top10 Acc(%) Stop words 11.44 16.95 12.15 16.95 36.04 IMDB1M Summary feat 22.46 23.71 21.87 23.71 48.56 Char n-grams 47.64 45.32 43.87 45.32 64.46 POS bigrams 17.33 22.46 17.62 22.46 40.65 POS trigrams 23.04 25.85 22.36 25.85 39.57 Stop words 14.86 18.59 15.13 18.59 40.27 UF Cross-topic Summary feat 32.11 28.97 28.71 28.97 53.98 Char n-grams 38.81 32.76 32.57 32.76 62.19 POS bigrams 10.19 10.75 8.94 10.75 31.16 POS trigrams 10.66 10.71 9.11 10.71 31.44 Stop words 15.85 16.09 13.94 16.09 40.01 UF Blogs-Forums Summary feat 23.01 17.60 16.84 17.60 41.34 Char n-grams 35.75 27.75 27.44 27.75 56.21 POS bigrams 10.19 10.75 8.94 10.75 31.16 POS trigrams 10.66 10.71 9.11 10.71 31.44 Stop words 16.81 16.26 14.75 16.26 40.29 UF Forums-Blogs Summary feat 23.94 18.89 18.63 18.89 43.90 Char n-grams 40.83 31.83 32.20 31.83 61.73 POS bigrams 14.49 14.42 12.44 14.42 37.97 POS trigrams 12.84 12.83 10.94 12.83 35.90

4.2.2.2 Deep syntactic models

We use the BLLIP parser [79] trained on Wall Street Journal (WSJ) corpus to treebank an author’s training data. We use vertical and horizontal Markovization order of two (h=2, v=2) since higher orders did not seem to improve performance. The leaf nodes are removed from the parse trees to keep the analysis purely syntactic. To test stylometric attribution, we evaluate the syntactic language model (Syn LM) using the single-domain IMDB1M dataset and UF cross-domain (cross-topic and cross-genre) dataset. We repeat these experiments using the syntactic language model with lexical information in its leaf nodes (Syn LM+Lex) and also using the character language model (Char LM) [52]. The single-domain and cross-domain attribution performances are reported in Table 4-3. It can be observed from the results that, for all datasets, the performance of a purely syntactic language model (Syn LM) performs poorly compared to that with

64 lexical information (Syn LM+Lex) and character language model (Char LM). Also, Syn LM performs worse than shallow syntactic models (POS n-grams) as well. However, the addition of lexical information (Syn LM+Lex) from the leaf nodes of parse trees boosts the performance by multiple folds compared to Syn LM and POS n-grams. However, character language models (Char LM) still perform better than Syn LM and Syn LM+Lex representations. Table 4-3. Deep syntactic model performance on single-domain and cross-domain datasets Domain Scenarios Prec(%) Rec(%) F-score(%) Acc(%) Top10 Acc(%) Syn LM 14.7 10.14 10.37 10.14 24.04 IMDB1M Syn LM+Lex 45.27 32.61 33.12 32.61 53.40 Char LM 53.61 40.06 41.84 40.06 62.21 Syn LM 2.63 2.09 1.91 2.09 8.57 Cross-topic Syn LM+Lex 24.03 19.41 18.65 19.41 40.46 Char LM 23.95 20.67 19.67 20.67 41.45 Syn LM 4.24 3.54 2.81 3.54 13.8 Blogs-Forums Syn LM+Lex 33.19 25.56 24.02 25.56 47.58 Char LM 37.65 32.33 30.52 32.33 56.02 Syn LM 4.94 4.78 3.80 4.78 17.39 Forums-Blogs Syn LM+Lex 36.38 27.28 26.37 27.28 51.90 Char LM 40.99 33.17 32.87 33.17 58.40

4.2.3 Discussion

We studied the role of sentence structure or syntax on cross-domain authorship attribution using both shallow and deep syntactic models. It was observed that a purely syntactic representation (POS bigrams, POS trigrams, and Syn LM) is not discriminative enough to identify a person uniquely. Even baseline representations like stop words and summary features yield much better performance. However, the addition of lexical information to a purely syntactic model (Syn LM+Lex) boosts the performance drastically thereby proving that lexical representations contain more discriminative information compared to purely syntactic representations, especially for large-scale authorship attribution. Further, purely syntactic approaches seem to perform poorly compared to character n-gram-based vector and language model representations as is evident from Tables 4-2

65 and 4-3. Character n-gram based approaches have been highly successful in authorship attribution because they represent character, lexical and syntactic information to varying extents [2, 29] and are robust to morphological variations in language use [80]. However, from a purely quantitative perspective, their success can be attributed to the fact that character n-grams have much more data points than function words or syntactic rules yielding superior performance [29]. Character n-grams span the same word or word combinations multiple times depending on the n-gram order and hence amplify the presence of possibly unique word/phrase choices by an author. Shallow and deep syntactic language models do not have this advantage as the set of all possible POS n-grams, or syntactic rewrite rules are much lesser than the set of all possible character n-grams. However, the low performance of purely syntactic models does not necessarily imply that they are not useful in representing the style. One can use these in combination with lexical information to improve cross-domain performance. 4.3 Role of Lexical representations

Besides syntax, the choice of words used by a person also plays an essential role in stylometric attribution. These words may have a primary purpose of being lexical or grammatical. Lexical words have some meaning by themselves and are mostly content-specific. Such words typically include nouns, verbs, adjectives, and adverbs. On the other hand, grammatical words do not have any meaning by themselves and specify the relationships between lexical words. Hence, they are independent of content and typically include determiners, prepositions, and pronouns. Conventionally, grammatical words, especially function words, have been proposed for stylometric authorship attribution since they are independent of content. However, character n-gram based approaches have largely outperformed function word based approaches [29] indicating that some lexical words may also help with authorship attribution. In order to understand which lexical words may help with stylometric attribution, we study the effects of masking

66 all lexical words and certain topic words corresponding to different POS on authorship attribution. 4.3.1 Methodology

In the following sections, we describe our approaches to study the effects of masking all words or topic words corresponding to lexical POS to mitigate content information in our feature representations. 4.3.1.1 Role of Lexical POS

To analyze the effects of different lexical POS on stylometric attribution, we preprocess the samples to replace words corresponding to different POS with a predefined string . We use the set of Penn TreeBank POS tags in our experiments. The following effects are analyzed using a character language52 model[ ] for both single-domain and cross-domain datasets:

• orig: This utilizes the original samples and is used for benchmarking.

• no_NNP: This utilizes samples in which all proper nouns (NNP, NNPS) are replaced by .

• no_NN: This utilizes samples in which all common nouns (NN, NNS) are replaced by .

• no_VB: This utilizes samples in which all verbs (VB, VBD, VBG, VBN, VBP, VBZ) except auxiliary verbs are replaced by .

• no_ADJ: This utilizes samples in which all adjectives (JJ, JJR, JJS) are replaced by .

• no_ADV : This utilizes samples in which all adverbs (RB, RBR, RBS) are replaced by . 4.3.1.2 Role of Topic Words

The approach in Section 4.3.1.1 assumes that all words corresponding to a lexical POS may be content-specific. It may not be the case as some of these words mayhelp with stylometric attribution. Hence, in this section, we analyze the effects of masking only the topic words corresponding to different lexical POS on attribution. We use Latent

67 Dirichlet Allocation (LDA) [81] to obtain the topic words for each lexical POS. Since each author may focus on different topics, we perform LDA for each author using their training samples. The topic words of all authors are then collated and used for masking. We analyze the same effects as Section 4.3.1.1 - orig, no_topic_NNP, no_topic_NN, no_topic_VB, no_topic_ADJ and no_topic_ADV . 4.3.2 Results

In this section, we report the results that reflect the role of lexical POS and topic words in cross-domain authorship attribution. 4.3.2.1 Role of Lexical POS

To study the effects of lexical words on authorship attribution, we mask allwords corresponding to specific lexical POS as described in Section 4.3.1.1. We use NLTK toolkit [82] for POS tagging. We perform author identification on both single-domain and cross-domain datasets using the character n-gram representation and character language model [52]. We perform 4-fold cross-validation and report the average precision, recall, F-score and accuracy across all cross-validation folds. The performance with the original data (orig) without masking any words is considered as the baseline. For single domain IMDB1M dataset, the performance measures reflecting the effects of masking different lexical POS are reported in Table 4-4. Similarly, the performance measures for cross-topic and cross-genre experiments on UF cross-domain dataset are reported in Tables 4-5 and 4-6.

Table 4-4. Effect of lexical POS on single-domain authorship attribution using IMDB1M Scenarios Prec(%) Rec(%) F-score(%) Acc(%) Top10 Acc(%) orig 47.64 45.32 43.87 45.32 64.46 no_NN 27.33 30.00 26.49 30.00 45.56 no_NNP 30.18 31.33 28.49 31.33 47.96 no_VB 35.40 36.37 33.67 36.37 53.66 no_ADJ 36.55 37.14 34.54 37.14 53.56 no_ADV 38.63 38.24 35.87 38.24 55.08

68 Table 4-5. Effect of lexical POS on UF cross-topic authorship attribution Scenarios Prec(%) Rec(%) F-score(%) Acc(%) Top10 Acc(%) orig 38.81 32.76 32.57 32.76 62.19 no_NN 14.57 11.32 11.44 11.32 28.15 no_NNP 16.26 12.64 12.80 12.64 30.99 no_VB 15.43 13.02 12.76 13.02 30.30 no_ADJ 16.10 13.28 13.17 13.28 31.58 no_ADV 16.20 13.37 13.22 13.37 32.00

Table 4-6. Effect of lexical POS on UF cross-genre authorship attribution Domain Scenarios Prec(%) Rec(%) F-score(%) Acc(%) Top10 Acc(%) orig 35.75 27.75 27.44 27.75 56.21 no_NN 23.21 15.39 16.52 15.39 37.19 no_NNP 22.89 18.13 17.98 18.13 41.83 Blogs-Forums no_VB 26.60 21.56 21.22 21.56 43.75 no_ADJ 26.78 20.23 20.54 20.23 44.97 no_ADV 27.31 21.87 21.55 21.87 45.16 orig 40.83 31.83 32.2 31.83 61.73 no_NN 24.39 18.24 18.97 18.24 39.32 no_NNP 25.04 19.22 19.67 19.22 41.94 Forums-Blogs no_VB 28.68 23.85 23.61 23.85 45.80 no_ADJ 28.10 22.63 22.74 22.63 46.02 no_ADV 29.53 24.03 24.06 24.03 46.72

4.3.2.2 Role of Lexical POS

Topic words are chosen using the LDA implementation in gensim Python toolkit to test the hypothesis. Only words corresponding to specific lexical POS are input to LDA. We experiment by varying the number of topics (t = 2,10,50) and the number of words per topic (w=10,100). The topic words obtained from each author’s training data are collated and used for masking. Increasing the number of topics or words per topic beyond these values does not seem to provide any performance improvement. We perform 4-fold cross-validation experiments on both single-domain and cross-domain datasets. Best performance was obtained with t = 10, w = 10. The corresponding results are reported in Tables 4-7, 4-8 and 4-9. 4.3.3 Discussion

In this section, we discuss the inferences made in Section 4.3.2.

69 Table 4-7. Effect of topic words in lexical POS on IMDB1M single-domain authorship attribution Scenarios Prec(%) Rec(%) F-score(%) Acc(%) Top10 Acc(%) orig 47.64 45.32 43.87 45.32 64.46 notopic_ALL 50.32 47.33 46.10 47.33 67.12 notopic_NN 50.51 47.59 46.37 47.59 67.23 notopic_NNP 49.31 46.80 45.38 46.80 66.94 notopic_VB 51.22 47.92 46.77 47.92 67.66 notopic_ADJ 51.77 48.39 47.34 48.39 67.96 notopic_ADV 52.28 48.74 47.71 48.74 68.19

Table 4-8. Effect of topic words in lexical POS on UF cross-topic authorship attribution Scenarios Prec(%) Rec(%) F-score(%) Acc(%) Top10 Acc(%) orig 38.81 32.76 32.57 32.76 62.19 notopic_ALL 47.00 40.74 40.64 40.74 68.48 notopic_NN 47.32 40.37 40.51 40.37 68.69 notopic_NNP 46.75 39.11 39.62 39.11 68.14 notopic_VB 45.53 38.66 38.70 38.66 67.51 notopic_ADJ 46.54 39.22 39.44 39.22 68.13 notopic_ADV 44.90 37.17 37.50 37.17 67.48

Table 4-9. Effect of topic words in lexical POS on UF cross-genre authorship attribution Domain Scenarios Prec(%) Rec(%) F-score(%) Acc(%) Top10 Acc(%) orig 35.75 27.75 27.44 27.75 56.21 notopic_ALL 39.45 31.80 31.48 31.80 60.47 notopic_NN 41.02 32.28 32.18 32.28 61.25 Blogs-Forums notopic_NNP 39.73 31.08 31.22 31.08 60.45 notopic_VB 40.38 32.49 31.79 32.49 60.31 notopic_ADJ 41.26 32.20 32.08 32.20 61.26 notopic_ADV 40.13 32.10 31.54 32.10 59.96 orig 40.83 31.83 32.2 31.83 61.73 notopic_ALL 38.74 33.70 32.45 33.70 61.43 notopic_NN 41.14 33.96 33.75 33.96 62.30 Forums-Blogs notopic_NNP 40.19 32.62 32.48 32.62 61.35 notopic_VB 40.96 33.22 33.07 33.22 62.40 notopic_ADJ 41.72 33.12 33.41 33.12 62.55 notopic_ADV 41.41 31.97 32.64 31.97 62.09

70 4.3.3.1 Role of lexical POS

It is known that function words are independent of content and are useful for representing the style. However, the success of character n-gram approaches in authorship attribution indicates that some lexical words may also be useful for authorship attribution. Besides, character n-gram approaches do not necessarily decouple style and content and using these approaches as such may deteriorate attribution in cross-domain settings. Hence, we studied the effect of masking all words corresponding to different lexical POS on both single-domain and cross-domain attribution as explained in Section 4.3.1.1. For single-domain IMDB1M dataset, it can be observed from Table 4-4 that excluding common nouns (no_NN) and proper nouns (no_NNP) affects the attribution performance drastically. Masking all words corresponding to other lexical POS like verbs, adjectives or adverbs impacts the performance to a lesser extent. It suggests the heavy influence of topic conveyed using common nouns and proper nouns in single-domain attribution. In contrast, with cross-topic experiments on UF cross-domain dataset, it can be observed from Table 4-5 that excluding all lexical POS seems to have any significant impact suggesting that at least some words corresponding to each lexical POS category might be repeatedly used by a person to be useful for stylometric attribution. Results from cross-genre experiments as reported in Table 4-6 suggest the higher dominance of common nouns and proper nouns in cross-genre attribution compared to verbs, adjectives, and adverbs. 4.3.3.2 Role of topic words

It is possible that not all words corresponding to a specific lexical POS may be content-dependent. Some of these may help with representing the style. Hence, we performed attribution experiments by masking out certain topic words corresponding to different lexical POS as detailed in Section 4.3.1.2. For IMDB1M dataset, it was be observed that removing topic words corresponding to all lexical POS seems to improve attribution performance as against completely

71 masking them. For UF cross-topic dataset, it was observed that masking only topic words corresponding to common nouns (notopic_NNP) and all lexical POS (notopic_ALL) seems to improve the performance as against completely masking them. Masking topic words corresponding to other lexical POS also improve attribution performance but to a slightly lesser extent. However, masking topic information seems to improve attribution performance by 5-8% for cross-topic dataset as against 2-4% for the single-domain IMDB1M dataset. Similarly, for cross-genre experiments, it can be seen that masking topic words corresponding to lexical POS improves attribution performance by 4-5% for UF Blogs-Forums and by 0-2% for UF Forums-Blogs. This observation validates our hypothesis that some of the lexical POS may be content dependent while the rest may be repetitively used and hence can be useful to represent the style. One of the most pressing problems in this line of research is that there seems to be a dearth for publicly available large-scale cross-domain datasets. Authorship attribution approaches in literature mostly demonstrate their results using either small datasets or large-scale single-domain datasets. It does not provide a clear picture of the factors that contribute to the style which may be robust in cross-domain settings. The role of structural or lexical representations or other aspects in representing style can be better ascertained with experiments on large-scale cross-domain datasets. Hence, we have collected a large cross-domain dataset and studied the efficacy of different feature representations for encoding stylistic cues. 4.4 Learned representations

The representations evaluated so far were engineered to leverage different linguistic aspects like character-level or syntactic features. However, many of these representations are low-level and bag-of-word representations that do not account for the sequential or structural nature of the text. Hence, we experimented with deep learning architectures which capture such high-level sequential or syntactic aspects.

72 4.4.1 Methodology

We evaluated three different architectures which can capture high-level linguistic aspects of the text. They are Recurrent Neural Networks (RNN) which account for sequential nature of the text and can capture long-range dependencies in the text, Recursive Neural Networks (RecNN) which encode the structural or syntactic information of text and Hierarchical Attention Networks (HAN) which capture the hierarchical nature of the text. These architectures are described in detail in the following sections. 4.4.1.1 Recurrent Neural Networks

This architecture uses a single-layer bidirectional Long Short-term Memory (LSTM) with a sequence of words as input at each timestep. It would allow the model to capture contextual information and long-range dependencies in the text. The hidden state from the final time-step of the model was used with a softmax layer to predict theauthoras shown in Figure 4-2A. The input embedding layer was initialized with 100-dimensional word2vec [83] representations of words. The model was evaluated using CASIS dataset described in Section 3.1 which consists of blog posts written by 1000 non-native English speakers with four posts per person. Since we have only 4000 samples across all authors, RNN was trained on individual sentences from these posts, and the post was attributed to the author with the maximum number of correctly attributed sentences. The number of time-steps was limited to the mean sentence length of 30 words to enable processing a batch of sentences. Shorter sentences were padded, and longer sentences were truncated. 4.4.1.2 Recursive Neural Networks

We used a Recursive Neural Network [84] represented using a Tree-LSTM as shown in Figure 4-2B to encode structural or syntactic information. Tree-structured LSTMs are generalizations of LSTMs for tree-structured topologies. One can incorporate sentence syntax into the learned feature representations using parse-tree information in Tree-LSTMs. Feature representations of words are obtained by hierarchically fusing child nodes based on parse tree information. We used 100-dim Glove[85] word vectors

73 as input to the Recursive Neural Network. We used this architecture to evaluate the significance of syntax on stylometry using CASIS dataset. However, since parse tree information can be obtained for sentences, we performed sentence-level prediction for these posts and attributed the post to the author with the maximum number of correctly attributed sentences.

A B

Figure 4-2. Recurrent and recursive neural networks. A) Recurrent language model. B) Recursive Neural Networks [84].

4.4.1.3 Hierarchical Attention Networks

Using this representation, we attempt to leverage the hierarchical nature of documents/text samples using a hierarchical model (HAN) [86] as shown in Figure 4-3. A document consists of sentences, and a sentence consists of words. Hence, this model consists of a lower level RNN to learn sentence representations from words in the sentence and a higher level RNN to learn document representations from sentence representations. The model also has an attention layer at both the sentence and document level and is useful for obtaining insights into salient words/sentences for stylometry. We used 100-dim word2vec representations as input to the network. Since this model requires documents as input, we used a larger subset of IMDB1M dataset [73] which consisted of movie reviews from 3597 individuals and 201685 reviews across all authors.

74 Figure 4-3. Hierarchical attention model for text classification86 [ ]

4.4.2 Results

We evaluated these architectures only on single-domain datasets to test our hypothesis. The performance of different architectures for their respective datasets are reported in Table 4-10. The bag-of-words (BoW) vector representation described in Section 3.3.1 is used as a baseline. It can be observed from the reported results that even though learned representations perform better than BoW representations, they still suffer from the inherent issues that plague stylometry, i.e. poor scalability to a large number of authors or short text samples.

75 Table 4-10. Performance of learned representations on single-domain datasets Model Dataset # RNN layers # Hidden units Accuracy (%) BoW CASIS - - 25.96 LSTM CASIS 1 300 32.68 Tree-LSTM CASIS - 300 30.23 HAN IMDB1M 2 300 19.8

4.4.3 Discussion

The CASIS dataset is a classic challenging scenario in stylometry where the number of classes is large, and the number of samples per class is few. On the surface, deep learning approaches (LSTM and Tree-LSTM) seem to perform better than hand-engineered features. However, in order to combat overfitting, we treated sentences as samples for the deep learning approaches. These sentence-level predictions may be too fine-grained since not all sentences may be representative of author style. Further, the identification accuracy using both learned representations (LSTM and Tree-LSTM) is still low to use linguistic style as a biometric trait. It highlights the challenging nature of stylometry as an identity trait in itself and also lack of data to scale well for deep learning approaches. Generative models for text data have also not panned out as well as their image counterparts thus making synthetic text generation for data augmentation challenging as well. Stylometry using IMDB1M dataset is a typical scenario which highlights scalability challenges when identifying a large number of people. From our experiments, we observed that identification performance was poor irrespective of the hyperparameters. Here, the challenges could be either due to using stylometry as identity trait which may not be unique or due to our choice of text representation. For all our three learned representations, we used word vector representations of text as input. However, even the best-performing approaches in literature [87] are character-based. Shreshta et al. used character-based Convolutional Neural Networks (CNN) to achieve 36.5% accuracy for 1000 authors using single-domain data compared to 12.7% accuracy for word-based CNN. It

76 could be due to sparsity when using word or syntax level encoding since our text samples are short. However, character-based representations have been quite useful for short text samples as seen by the success of character n-gram approaches. Further, word or syntax level information does not seem to capture morphological variations which have shown to be useful for stylometric attribution besides function words and punctuations. 4.5 Summary

Authorship attribution approaches in the literature focus mostly on single-domain attribution where content and style are highly entangled. It does not provide a clear picture of linguistic aspects that are representative of the style and robust in cross-domain settings. In this chapter, we studied the role of structural and lexical information in representing the style. We evaluated the role of syntax using shallow and deep syntactic models and show that purely syntactic representations (POS n-grams and Syn LM) are not discriminative by themselves and need to be used in conjunction with lexical information. We evaluated the role of lexical information by masking off all words or certain topic words corresponding to different lexical POS. Common nouns and proper nouns are heavily influenced by topic and cross-topic attribution may benefit from completely masking topic words corresponding to these POS categories. Further, for all datasets, masking off certain topic words corresponding to all lexical POS yields better performance in both cross-domain and single-domain scenarios suggesting that the rest may be behavioral and hence might help represent the style. We also experimented with learned representations using deep learning architectures to capture higher level sequential and syntactic information. We used three deep learning architectures with word embeddings as input. However, we observed that these deep learning approaches still suffer from scalability challenges faced by stylometry which could be attributed to sparsity while using higher level representations for short texts.

77 Using the best-performing style representations and insights from this chapter and the previous chapter, we investigate the biometric capabilities of stylometry in the following chapter. It would allow us to understand the merits/demerits in deploying stylometry as a standalone biometric trait for authentication or identification.

78 CHAPTER 5 STYLOMETRY AS A BIOMETRIC TRAIT Having investigated the contribution of various linguistic aspects that are representative of style in the previous chapter, we delve deeper into the investigation of stylometry as a biometric trait in this chapter. Specifically, we investigate stylometry for salient biometric characteristics such as uniqueness and permanence. Further, as stylometry is a behavioral trait, we expect its performance to lag behind that of traditional physiological biometrics. Hence, we investigate multi-modal authentication using stylometry along with another co-occurring trait, i.e. keystroke dynamics. Overall, we attempt to answer the following research questions in this chapter:

• Is linguistic style sufficiently unique to be considered as a biometric trait?

• Does linguistic style remain permanent over long periods of time?

• Can stylometry be used with other biometric traits for authentication? 5.1 Biometric Characteristics

A biometric trait is expected to possess specific characteristics3 [ ] to be used in biometric applications viz. universality, uniqueness, permanence, measurability, performance, acceptability, and circumvention. In this section, we investigate stylometry for two biometric characteristics - uniqueness and permanence. 5.1.1 Analysis of Uniqueness Using Biometric Menagerie

In this section, we investigate the challenges of using a person’s writing style or linguistic usage as a cognitive biometric modality by determining the uniqueness of linguistic style using various feature representations discussed in Chapter 4. We apply Doddington’s idea of Biometric menagerie [4] and Yager and Dunstone’s biometric menagerie [88] to writing style. Biometric menagerie highlights the inherent differences between individuals that make biometric recognition challenging. Biometric menagerie approaches categorize individuals into different animals based on how well they match against themselves

79 and other individuals. A good biometric system should consist of most individuals who match very well against themselves and very poorly against others. In this section, we categorize individuals using two different biometric menagerie frameworks to understand the uniqueness of different stylometric representations. 5.1.1.1 Methodology

In this section, we describe the feature representations and Biometric menagerie approaches to evaluate the uniqueness of stylometry as a biometric trait.

Features. We use eight linguistic feature representations for our analysis including the ones described in Chapter 4. They are as follows:

• Stop words - We use 917 stop words (auxiliary verbs, prepositions, pronouns, articles, determiners) and punctuations for representing the style. This representation consists of the normalized frequency of these 917 features.

• Summary features - We use 767-dim Bag-of-Word (BoW) features that span character, lexical and syntactic features as shown in Table 5-1. Character features are not language-specific and require minimal processing. Lexical features capture linguistic style at word-level and include vocabulary richness and word n-grams. Syntactic features capture a much higher level of linguistic usage like grammatical syntax using function words, parts of speech (POS) and parsing information. The 767-dim BoW feature representation consists of 63-dim character features, 45-dim lexical features, and 659-dim syntactic features.

• Character n-gram features - Character n-grams have been shown to be effective for authorship attribution [25–27]. These capture linguistic usage at character level but retain a certain amount of context, e.g. character n-grams capture combinations of punctuations and characters and combinations of words.

• Character language model - This consists of character language model using the approach of Prediction by Partial Matching (PPM) [52].

• POS bigrams - This consists of the most common POS bigrams representing shallow syntactic information as explained in Section 4.2.1.1.

• POS trigrams - This consists of the most common POS trigrams representing shallow syntactic information as explained in Section 4.2.1.1.

• Syntactic language model - This consists of purely syntactic language model as explained in Section 4.2.1.2.

80 • Syntactic + Lexical language model - This consists of syntactic language model with lexical information as explained in Section 4.2.1.2.

Table 5-1. BoW features - 767 dimensions (63 character features, 45 lexical features and 659 syntactic features) Character features #dimensions No. of characters 1 No. of alphabets 1 No. of uppercase alphabets 1 No. of digits 1 No. of white spaces 1 No. of tab spaces 1 Frequency of alphabets 26 Frequency of special characters 21 Frequency of digits 10 Lexical features #dimensions No. of words 1 Fraction of short words 1 Average word length 1 Average sentence length in characters 1 Average sentence length in words 1 No. of unique words 1 Hapax legomenon 10 Frequency of words of different word lengths 20 Fraction of capitalized words 1 Fraction of all uppercase words 1 Fraction of all lowercase words 1 Fraction of all camelcase words 1 Fraction of all othercase words 1 Yule’s I-measure [89] 1 Sichel’s S-measure [90] 1 Brunet’s W-measure [91] 1 Honore’s R-measure [31] 1 Syntactic features #dimensions Frequency of punctuations 11 (‘,’,‘.’,‘?’,‘!’,‘:’,‘;’,‘’́,‘”’,‘-’,‘(’,‘)’) Frequency of function words 512

Biometric menagerie. We use two different Biometric menagerie paradigms as shown in Figure 5-1. They are the approaches proposed by Doddington [4] and Yager and Dunstone [88]. Doddington’s Biometric menagerie [4] categorizes individuals based on their genuine or impostor match scores as follows:

81 A B

Figure 5-1. Biometric menagerie approaches. A) Doddington. B) Yager & Dunstone.

• Sheep: These are individuals who can be easily identified and have high genuine match scores.

• Goats: These individuals are the most difficult to identify and match poorly even against themselves with very low genuine match scores. Hence, these individuals contribute significantly to the false rejection rate (FRR) of a biometric system.

• Wolves: These individuals can easily imitate others with very high impostor scores. Hence, these individuals contribute significantly to the false acceptance rates (FAR).

• Lambs: These individuals can be imitated easily with very high impostor scores. Hence, these individuals contribute significantly to the false acceptance rates (FAR) as well. Doddington’s approach classifies individuals using only genuine match scores or impostor match scores. Yager and Dunstone [88] instead categorize individuals using the relationship between both genuine and impostor match scores. The genuine and impostor match scores are divided into the bottom 25th and top 75th percentile. The individuals are categorized as follows:

• Doves: This category constitutes individuals in the top 75th percentile of genuine match scores and bottom 25th percentile of impostor match scores. Hence, these individuals contribute to low FAR and FRR.

82 • Chameleons: This category constitutes individuals in the top 75th percentile of genuine match scores and top 75th percentile of impostor match scores. Hence, these individuals contribute to high FAR and low FRR.

• Worms: This category constitutes individuals in the bottom 25th percentile of genuine match scores and top 75th percentile of impostor match scores. Hence, these individuals contribute to high FAR and high FRR.

• Phantoms: This category constitutes individuals in the bottom 25th percentile of genuine match scores and bottom 25th percentile of impostor match scores. Hence, these individuals contribute to high FRR and low FAR. In order to study the challenges of a biometric system based on the writing style of individuals, we investigate the system for different biometric menagerie animals. For every person, we compute their corresponding menagerie category using eight different feature representations. Sheep and goats are categorized only using a person’s genuine match score i.e. top 75th and bottom 25th percentile of genuine match score. Wolves are individuals whose probe samples constitute the top 75th percentile of impostor match scores. Lambs are individuals whose gallery samples constitute the top 75th percentile of impostor match scores. Yager and Dunstone categorization is performed using a combination of the bottom 25th and top 75th percentile of the genuine and impostor match scores. 5.1.1.2 Datasets

We use both the single-domain and cross-domain datasets explained in Section 4.1. The single-domain IMDB1M dataset consists of a subset of 435 users with 30 posts per person. The UF cross-domain dataset consists of both cross-topic and cross-genre datasets. 5.1.1.3 Results

We follow the approach of [4, 88] to perform the Biometric menagerie analysis for eight feature representations. We obtain the genuine and impostor match scores using trained machine learning models or language models of each person as the gallery and the test samples as probes. The percentage of individuals categorized as different

83 biometric animals for single-domain IMDB1M dataset is shown in Figure 5-2. Similarly, the percentage of individuals categorized as different biometric animals for the UF cross-topic and cross-genre datasets are shown in Figures 5-3, 5-4 and 5-5.

A

B

Figure 5-2. Biometric menagerie plots for IMDB1M dataset. A) Doddington’s menagerie. B) Yager & Dunstone’s menagerie.

5.1.1.4 Discussion

For both single-domain and cross-domain datasets, we observe that only a small fraction of individuals are categorized as sheep while most individuals are categorized as goats, lambs or wolves. Character representations (character n-grams and character

84 A

B

Figure 5-3. Biometric menagerie plots for UF cross-topic dataset. A) Doddington’s menagerie. B) Yager & Dunstone’s menagerie. language models) and summary features are unique with a higher number of individuals categorized as sheep. As expected, deep syntactic language model representation seems to be the least unique with a fewer number of individuals categorized as sheep. For Yager & Dunstone approach, not all individuals are categorized into one of the four categories. However, amongst those who are categorized, we find that 12-15% of individuals in all datasets are categorized as doves followed by worms while fewer people are categorized as phantoms or chameleons. Similarly, character representations yield a

85 A

B

Figure 5-4. Biometric menagerie plots for UF Blogs-Forums dataset. A) Doddington’s menagerie. B) Yager & Dunstone’s menagerie. higher number of doves while purely syntactic language models yield the least number of doves. These observations corroborate the effectiveness of character-based representations as seen in Chapter 4. Even though we use various feature representations to model people’s linguistic style, we find that character n-gram representations are more discriminative and unique compared to other representations.

86 A

B

Figure 5-5. Biometric menagerie plots for UF Forums-Blogs dataset. A) Doddington’s menagerie. B) Yager & Dunstone’s menagerie.

5.1.2 Analysis of Permanence in Linguistic Style

For a trait to be considered as a reliable biometric modality, it needs to undergo minimal change over long intervals of time. This property is called Permanence. This paper studies the permanence aspects of stylometry using samples acquired over 24 months. To the best of our knowledge, this is the first analysis that studies the permanence of stylometry for cybersecurity purposes systematically. We do so with two large-scale, publicly available datasets - Blogs and Yelp. Our experiments reveal that users

87 show permanence in linguistic style for a time span of up to 6 months beyond which it decreases. We also perform a detailed analysis of success and failure scenarios to obtain further insights. Verification over long intervals of time seems to naturally minimize topic interference and places more emphasis on cues that are genuinely representative of a person’s linguistic style. We observe that small fraction users exhibit permanence over 18 months due to their consistent use of distinctive stylistic cues. These cues encompass distinctive combinations of punctuations, grammatical morphemes, emojis and internet slang. We also observe that some users exhibit no permanence since they fail to use distinctive stylistic cues over time consistently. Such users cause high error rates thereby exhibiting goat/wolf/lamb behavior. 5.1.2.1 Methodology

In order to study the permanence of stylometry over time, we need to acquire samples written by the same author over various time intervals. We use two publicly available datasets and acquire subsets of authors who have samples spanning time intervals of 12 months, 18 months and 24 months. We call these as span-subsets in our paper. The number of authors in each of these span-subsets are different since the number of authors with samples spanning a given interval reduce with longer intervals. To study the possible drift in stylistic features over time, we further subdivide these span-subsets such that they contain train and test samples that are less than a month apart, one month apart, three months apart up until the total span of the subset. For example, we choose intervals, ∆T , spanning <1,1,3,6,12,18 and 24 months for the subset spanning 24 months. We denote these as drift-subsets in this paper. We use four sets of the non-overlapping train and test samples for each of these drift-subsets for 4-fold cross-validation. We preprocess the text to remove topic words using Latent Dirichlet Allocation [81] in order to minimize topic interference during authentication. We use the 3000 most common character 3-grams computed from the training data of each fold as our feature

88 representation. Feature vectors comprise of normalized frequencies of these features to account for varying text lengths. We perform verification experiments using 4-fold cross-validation for each drift-subset in each span-subset. For each subset, we then compute the False Accept Rate (FAR), False Reject Rate (FRR) and Equal Error Rate (EER). To compute permanence quantitatively, we use the measure proposed by Harvey et al. [93]. Biometric permanence, PM(∆T ), for elapsed time ∆T is given by

1 − FRR∆T PM(∆T , FARop) = (5-1) 1 − FRR0

where FRR∆T is computed using match scores from data acquired at ∆T intervals and

FRR0 is a baseline measure computed using match scores from data acquired during the same visit or within a short interval (∆T ≈ 0). Both FRR∆T and FRR0 are FRR

corresponding to a suitable operating FARop. According to this formulation,

• PM → 1 as FRR∆T → FRR0, i.e. if FRR does not increase over time, permanence is high.

• PM decreases as FRR∆T → 1, i.e. as FRR increases over time, permanence reduces. Thus, a biometric modality is said to exhibit permanence if PM is close to 1 over time.

In our experiments, we use EER0 computed at ∆T ≈ 0 as FARop. We use the

drift-subset acquired at <1 month interval to compute FRR0 and FARop. For each span-subset, we compute PM at ∆T = {1, 3, 6, 12, 18, 24} months up until the maximum span of the subset. 5.1.2.2 Datasets

We use the following two publicly available datasets for our experiments:

• Blogs [19]: This dataset consists of 678,161 blog posts written by 19,320 bloggers. It includes information about each blogger’s gender, age and occupation. From this dataset, we use a subset of authors who have posts spanning time intervals of 12, 18 and 24 months. Statistics of these span-subsets are given in Table 5-2.

89 • Yelp [94]: This dataset consists of 5.2M reviews of various businesses by 1.2M users. Similar to Blogs dataset, we use a subset of authors who have posts spanning time intervals of 12, 18 and 24 months from this dataset. Statistics of these span-subsets are given in Table 5-2.

Table 5-2. Dataset statistics Blogs Yelp Dataset #authors #samples #authors #samples 12 months 84 5005 148 8464 18 months 37 2152 101 5777 24 months 15 927 68 4030

5.1.2.3 Results

We perform 4-fold cross-validation experiments on both datasets for each drift-subset in all span-subsets. Verification is carried out using Linear SVM in a one-vs-rest manner where samples of a user constitute positive examples and samples of all other users are used as negative examples. The probability scores of the classifier are used as match scores for verification. Using the performance measures obtained for ∆T = <1 month as the baseline, we compute the permanence measures for other drift-subsets of longer time intervals. These permanence measures are reported in Table 5-3. The corresponding ROC curves for various span-sets are shown in Figure 5-6. With both datasets, it can be observed that the permanence is close to 1.0 for up to 6 months. However, it starts to drop off from 12 months onwards for both datasets. This effect seems to be more pronounced for Blogs dataset.

Table 5-3. Permanence of various span-subsets in Blogs and Yelp datasets Permanence of drift-subsets Datasets Span #authors < 1m 1m 3m 6m 12m 18m 24m 12 months 84 1.0 0.97 0.89 0.90 0.80 - - Blogs 18 months 37 1.0 0.91 0.89 0.85 0.79 0.69 - 24 months 15 1.0 0.98 0.92 0.92 0.77 0.70 0.63 12 months 148 1.0 0.97 0.95 0.93 0.84 - - Yelp 18 months 101 1.0 0.95 0.93 0.95 0.87 0.82 - 24 months 68 1.0 0.92 0.88 0.91 0.85 0.81 0.81

90 A B

C D

E F

Figure 5-6. ROC curves for various span-subsets in Blogs and Yelp datasets. A) Blogs: 84 authors, 12 months. B) Yelp: 148 authors, 12 months. C) Blogs: 37 authors, 18 months. D) Yelp: 101 authors, 18 months. E) Blogs: 15 authors, 24 months. F) Yelp: 68 authors, 24 months.

91 5.1.2.4 Discussion

To analyze further, we attempt to answer the following questions:

• Are there users who perform well consistently over long intervals of time?

• If such users exist, which features make them consistently succeed irrespective of time?

• What are the potential causes of users who fail consistently? (1) Are there users who perform well consistently over long intervals of time? To answer this, we group users in each drift-subset into two extreme categories: 1) High error: These users exhibit high FAR, high FRR or both across the cross-validation folds. 2) Low error: These users exhibit both low FAR and FRR across the cross-validation folds. The thresholds for this categorization is computed using the mean FAR and FRR of individual users in each drift-subset. We then determine users who consistently succeed or fail across the drift-subsets of a given span-subset. Figure 5-7 shows the trend of users who consistently succeed as time progresses. As it can be seen from these graphs, the number of users who are consistently authenticated reduces over time. For Yelp dataset, this trend stabilizes over six months with approximately 5% of users who are consistent beyond six months to over 24 months. For Blogs dataset, the number of consistent users continues to drop off beyond six months resulting in no consistent users spanning 24 months. This explains the permanence deterioration of Blogs in Table 5-3. (2) If such users exist, which features make them consistently succeed irrespective of time? To investigate this, we compute the feature importances returned by the classifier of each successful user in a drift-subset and subsequently compute the cumulative feature importance of that user across all drift-subsets of a given span-subset. Thus features that are consistently deemed important at various intervals over time will have high cumulative feature importance values. If no feature is consistently deemed important across different time intervals, the cumulative feature importance will be low for most or all features of that user.

92 A

Figure 5-7. Trend of consistent users over time. A) Blogs. B) Yelp.

Using these feature importance values, we visualize samples of a user written at different time intervals. An example is shown in Figure 5-8 which displays samples of a successful user acquired at different time intervals. The samples are shaded using the cumulative feature importances of that user’s features. Darker shades indicate high cumulative feature importance over time, and lighter shades indicate low feature importance. In this example, this user tends to use words like My husband, here, where and there very frequently irrespective of the span. Such cues allow this user to be consistently authenticated successfully over time. Consistent features of few other successful candidates in both datasets are listed in Table 5-4. From this table, it can be observed that stylistic cues depend mostly on unique combinations of punctuations, affixes corresponding to grammatical morphemes and stop words. Some users also seem to be identified by their frequent use of emojis, internet slang or theme (religion, hobby). Verification over long intervals of time seems to inherently minimize topic interference and places more emphasis on stylistic cues like punctuations and grammatical morphemes as predicted in literature.

(3) What are the potential causes of users who fail consistently?

93 A

Figure 5-8. Success scenario: Snippets showing important features of a user who is consistently authenticated correctly for a span of over 24 months. Some consistent features captured here are My husband, here, where, there

It is possible that some users do not have any features that they tend to use consistently over time or their features are not sufficiently distinctive. It causes high FAR or high FRR for such users causing them to exhibit goat/wolf/lamb behavior. Examples of a couple of failure scenarios are shown in Figure 5-9. In both scenarios, there are no consistently important features over time for these users as indicated by low cumulative feature importance. Those features that are somewhat important also tend to be common stop words or punctuations. Relying only on these features for authentication could cause high FAR/FRR explaining their consistent failure.

94 Table 5-4. Important features that are consistent over time for some users who are authenticated consistently Yelp UserID Consistent features 5lq4 with, -ly, quotes (‘’), perhaps, -ed, -ing ikm0 :-0, :-), Bullets (*), , &, today, including, -nd kGgA contractions (’s), ... , return, overall ppkK e !, !, d ! Xjfk were, where, here, there, clear, year, place, ’s, t -, e -, place, my husband Blogs UserID Consistent features 78196 imma, my, awf da, cuz, cat, Yall 152151 like, its, ! i, t ! 942828 of, pray, God, n of, His, catholic 1000866 ’m, -ly, . I, -lly 1516660 chem, ! , We , I am, end 1679249 photograph, comments, location 1713442 -ation, presiden-, information, students, librarians

A

B

Figure 5-9. Failure scenarios using samples from users who consistently fail over time. A) Failure scenario 1: No consistently important features over time. Partially consistent features also tend to be mostly stop words. B) Failure scenario 2: Consistently important features over time are mostly punctuations.

95 5.2 Multimodal Biometric Authentication

Behavioral biometrics are mostly coveted in cybersecurity since they are non-intrusive and have near-zero sensor costs. Since most of the information that we generate or exchange online is text-based, two behavioral biometrics lend themselves naturally for authentication purposes - keystroke dynamics and stylometry. Keystroke dynamics [95–97] involves authentication using a person’s typing patterns which are believed to be unique to the person and difficult to imitate. Keystroke dynamics use statistics of hold and transition timings of key presses and releases to perform authentication. Keystroke-based authentication has been demonstrated using short fixed inputs98 [ –100] and long free-text inputs [101–103]. Keystroke authentication with free-text inputs has also been researched for continuous authentication [104–106]. Since keystroke dynamics and stylometry are both behavioral traits, they have certain drawbacks when used independently. Keystroke dynamics are prone to be affected by hardware changes, dynamic change in typing patterns and injury. Stylometry, on the other hand, has been shown to be difficult for short text lengths2 [ ]. Most of the textual content we generate in the cyber world are typically only a few sentences, and we tend to use multiple personal gadgets to engage online. Hence, these traits could be complementary when used together for online authentication. The use of keystroke dynamics and stylometric traits has been pursued by Monaco et al.[107, 108] for online course assessment and authentication. They used 239 keystroke features that include hierarchical transition and duration timings between various key groups. Style features involved 228 features at character, word and syntax level. Their experiments with 30 users in Stewart dataset showed that verification using stylometry performed poorly compared to that using keystroke dynamics. Hence, they did not pursue any fusion experiments. Locklear et al.[109] used many higher-level cognitive traits (beyond keystroke and stylometry) exhibited by a user during text production for

96 continuous authentication. They used 123 such behavioral traits from discrete cognitive units to perform authentication of 486 subjects. We revisit the idea of jointly using keystroke and stylometric traits as proposed by Monaco et al.[107, 108] since it could have far broader applications in cybersecurity. However, in our approach, we use better stylometric feature representations, i.e. character n-grams, that have been demonstrated to be stable in short text scenarios. Analogous to n-grams, we use n-graph features for representing keystroke dynamics of free text input. These low-level feature representations for both traits seem to perform well compared to the complex feature representations used by Monaco et al. since they are statistically stable and less sparse even with short text lengths. Our verification experiments on two publicly available datasets, Villani and Stewart, show that both traits have the potential for further exploration either independently or when fused. 5.2.1 Methodology

In this section, we describe the feature representations for stylometry and keystroke dynamics and our authentication approaches. 5.2.1.1 Features

In this section, we describe the features chosen for both keystroke and stylometric traits. The biometric verification protocol using these traits is also explained subsequently.

Stylometry. Character n-grams have shown to be quite effective for stylometric attribution [27, 63] and are analogous to n-graphs used in keystroke dynamics. They are hypothesized to work well since they encompass stylistic attributes at the character, lexical and syntactic levels to various degrees. Hence, in our experiments we use character n-grams in two paradigms:

• Instance-based approach: In this approach, for each sample in the training and test data, we compute normalized frequencies of the most common L n-grams which are chosen using the training data. These constitute the feature vectors that represent stylistic attributes which are subsequently used for verification. We preprocess the text to remove topic words from the text using Latent Dirichlet Allocation [81] to minimize topic interference during authentication. The feature vectors are

97 normalized to account for varying text lengths. For our experiments, we use n = {2,3,4} and L = {50,100,500,1000,2000,3000}.

• Profile-based approach: In this approach, we create a character n-gram model for each user using all their training data. Character n-grams extracted from the test data are then compared with the user models for verification using suitable similarity measures. In our experiments, we use the profile-based approach proposed by Teahan et al.[52]. They computed a character language model for each user with backoff techniques and computed the likelihood of test data being generated bya user’s language model.

Keystroke. We focus on free-text keystroke dynamics here since we intend to use keystroke dynamics in conjunction with stylometric traits. Amongst other keystroke features, n-graphs have shown to be useful for free-text keystroke authentication [110, 111]. For our experiments, we use the key hold and transition times of the most common L n-graphs. Hold time denotes the duration from when a key is pressed to when it is released. Transition time denotes the duration from when a key is released to when the next key is pressed. For each n-graph, we compute the mean key hold times for each key and mean transition times between consecutive keys. This yields (2n − 1) features for each n-graph. Hence, for L most common n-graphs, each sample yields a K-dimensional feature vector where K = L ∗ (2n − 1). The mean key hold and transition times are computed after outlier rejection. The outlier rejection is done for three iterations where outliers beyond 2σ interval are rejected. For our experiments, we use n = {2,3,4} and L = {50,100,500,1000,2000,3000}. 5.2.1.2 Verification

For keystroke features and instance-based stylometry features, we build a binary classifier for each user since the accept/reject decision boundaries for each usermaybe different. We use Linear SVM as binary classifier where training samples corresponding to that user constitute positive samples, and those of other users constitute negative samples. The probability scores of test data corresponding to each user are used to compute the score matrices. For profile-based Teahan [52], the match scores are provided using cross-entropy measures.

98 The keystroke and stylometric traits are fused using two approaches:

• Feature-level fusion for keystroke features and instance-based stylometric features. For most common L n-graphs and L n-grams, P-dimensional feature vectors are used where P = L + L ∗ (2n − 1).

• Score-level fusion for keystroke features and both instance-based and profile-based stylometric approaches. The score matrices from both approaches are normalized and fused using the sum rule before verification. 5.2.2 Dataset

We used two publicly available keystroke datasets for our experiments.

• Villani dataset [112, 113]: This dataset consists of 144 users who provided at least five samples of free-text or fixed-text input while answering essay questions or copying fables respectively. However, for our experiments, we used only the data obtained with free-text input since we evaluate both keystroke and stylometric traits. Hence, we chose a subset of 124 users who had at least four samples of free-text input. This subset consists of 9 samples/person and 650 characters/sample, 138 words/sample and eight sentences/sample on an average. The keystroke data consists of the key press and release times for each key along with the corresponding key code. The text input for stylometry was extracted from the keycode sequences while accounting for keycodes corresponding to backspace and delete.

• Stewart dataset [107, 108]: This dataset consists of 40 students who answered four online short-answer tests with ten questions each. It consists of 27 samples/student with four sentences/sample on an average. In our experiments we use the data from all 40 students unlike Monaco et al.[108] who use the data from only 30 students who had completed all four tests or answered a sufficient number of questions per test. Since these samples are too short for reliable stylometric authentication, we follow the approach of Monaco et al.[108] and concatenate every two samples per student thereby halving the number of samples. The resulting dataset used in our experiments consists of 13 samples/student with 1105 characters/sample, 208 words/sample and eight sentences/sample on an average. 5.2.3 Results

The keystroke and stylometry features were extracted for both datasets as explained in Section 5.2.1.1. The most common features are chosen only based on the training data of each cross-validation fold. The top 20 keystroke features and top 20 stylometry features primarily consist of stop words like and, the and associated key sequences since we chose

99 the most common n-grams/n-graphs as our features. They also contain affixes that are used to change the grammatical notion like -tion and -ing. Verification experiments were performed using 4-fold cross-validation for stylometry and keystroke traits for various orders of n-grams/n-graphs (n = {2, 3, 4}) and for different lengthsL ( = {50, 100, 500, 1000, 2000, 3000}). For both datasets, keystroke traits performed best when using the 50 most common digraphs, i.e. n = 2, L = 50. For stylometry, the best performance for Villani dataset was obtained with n = 3, L = 3000 and for Stewart dataset, the best performance was obtained with n = 4, L = 2000. From our experiments, it was observed that EER decreased as L increases for stylometric traits and gave better performance with higher order n-grams, i.e. n = {3, 4}. This observation intuitively makes sense because linguistic choices of a person could be varied and require more features of a reasonable span to represent it reliably. For keystroke dynamics, it was observed that EER increases as L increases and digraphs perform better than higher order n-graphs. This observation indicates that keystroke timings are more consistent for frequently used words, i.e. stop words since they tend to be typed more frequently and without much thought. In the following discussion, we only report results obtained with best-performing parameters for both stylometry and keystroke dynamics. Verification performance for both traits was computed using Equal error rate (EER) and Area under the curve (AUC). The results for keystroke dynamics, instance-based and profile-based stylometric approaches are reported in Table 5-5 for both datasets. The corresponding ROC curves and genuine/impostor score distributions of both traits are shown in Figures 5-10 and 5-11A, 5-11B, 5-11E and 5-11F respectively. Our results of stylometric approaches on Stewart dataset are better than that of Monaco et al.[108] by 9-11% though we use a much larger subset of the dataset compared to their approach. Subsequently, we performed a feature-level fusion of keystroke features and instance-based stylometric features and obtained verification performance. Similarly,

100 Table 5-5. Verification performance (EER(%) and AUC(%)) for stylometry, keystroke and their fusion. FF: feature-level fusion, SF: score-level fusion. * uses a smaller subset of dataset Dataset Villani Stewart Modality EER(%) AUC(%) EER(%) AUC(%) Instance-based stylometry 7.33 97.79 12.92 94.21 Profile-based teahan 10.7 95.76 31.93 77.69 Stylometry - Monaco et al.[108] - - 22.00* - Keystroke 4.4 98.41 2.59 99.28 Keystroke - Monaco et al.[108] - - 0.00* - FF:Key+Ins-sty 5.1 98.59 5.2 99.19 SF:Key+Ins-sty 3.55 98.73 3.44 99.46 SF:Key+teahan 3.2 99.12 4.15 98.67

A B

Figure 5-10. ROC curves for stylometry, keystroke and their fusion. (FF: Feature-level fusion, SF: Score-level fusion). A) ROC curves for Villani dataset. B) ROC curves for Stewart dataset. verification experiments were performed on score-level fusion of keystroke match scores with instance-based stylometry match scores and keystroke match scores with profile-based stylometry match scores. The score matrices were normalized and summed to obtain the fused score matrix. The verification performance of the fused traits are also reported in Table 5-5. The corresponding ROC curves and genuine/impostor score distributions of fused traits are shown in Figures 5-10 and 5-11C, 5-11D, 5-11G and 5-11H respectively.

101 A B C D

E F G H

Figure 5-11. Score distribution curves for stylometry, keystroke and fusion for best performing L most frequent n-graphs/n-grams(n). A) Villani: Instance-based stylometry (n=3,L=3000). B) Villani: Keystroke (n=2,L=50). C) Villani: Feature fusion - Stylometry + Keystroke. D) Villani: Score fusion - Stylometry + Keystroke. E) Stewart: Instance-based stylometry (n=4,L=2000). F) Stewart: Keystroke (n=2,L=50). G) Stewart: Feature fusion - Stylometry + Keystroke. H) Stewart: Score fusion - Stylometry + Keystroke.

5.2.4 Discussion

It can be observed from the verification performance of both instance-based and profile-based stylometry approaches that instance-based approaches perform much better than the profile-based approach for both datasets. This result could be because topic words were disallowed from being part of feature representation for the instance-based approach as compared to the profile-based approach which uses the entire text. This effect causes the latter to suffer heavily due to topic interference. This effectseems more pronounced for Stewart dataset where all students answer a fixed set of technical questions. Hence, they all tend to use similar topic words resulting in higher error rates. By removing topic words in the instance-based approach, the performance is highly improved for Stewart dataset.

102 Analysis of success and failure cases of stylometry for both datasets was performed by obtaining the most important features for each author across all cross-validation folds. Each author’s samples were then visualized by highlighting n-grams using their feature importance as shown in Figure 5-12. This analysis revealed the following insights: i) Users who consistently performed well across all folds tend to use distinct words (that are not stop-words) or phrases across their samples as shown in Figure 5-12A, e.g. possible, when you, that you. Usage of such words or phrases consistently across all samples is indicative of their linguistic choices. ii) Users who consistently failed across all folds were mostly due to very few training samples. iii) Few other users who failed consistently despite sufficient training samples used no distinctive words or phrases across their samples. The essential features chosen for such authors were mostly stop words like the, would, because as shown in Figure 5-12B. Since all users commonly use stop words, it results in consistently high error rates for such users. The standalone verification performance of keystroke dynamics is much better compared to that of stylometry. Though both are behavioral traits, this could be because linguistic choices happen at a much higher cognitive level while typing happens at a lower muscle memory level [108]. Error analysis of failure cases when using keystroke dynamics revealed two potential reasons: i) Some users had very few samples for training, typically 4 to 5 samples, ii) Users had randomly typed characters instead of answering the given questions. These reasons could have skewed the keystroke timings of these users. When both these traits are fused, score-level fusion performs better than feature-level fusion for both datasets. The verification performance of score-level fusion is better than or at par with that of keystroke dynamics. Error analysis of failure cases in fusion scenarios points mostly to users with very few training samples for training. These users consistently fail when using both traits independently or in fusion and hence limit any further performance improvement when fusing both traits.

103 A

B

Figure 5-12. Text samples for success and failure cases. Samples are highlighted by features considered important for that author. Darker shades of red indicate features of higher importance and lighter shades indicate lesser importance. A) Three samples of an author who consistently performed well. Important features across samples are phrases like when you, that you, possible. B) Three samples of an author who consistently failed. Important features are common stop words like the, would, because.

104 5.3 Summary

In this chapter, we investigated the suitability of stylometry as a biometric trait. We evaluated stylometry for two essential biometric characteristics viz. uniqueness and permanence. The biometric menagerie was used to investigate uniqueness. It was observed that not all individuals exhibit unique stylistic traits and there is evidence that some users exhibit goat/lamb/wolf behavior. Permanence was evaluated using samples spanning over 24 months. It was observed that only a small fraction of users exhibit a consistent style over 24 months. For most users, the linguistic style seems stable for up to 6 months and tends to drift beyond that. It was also observed that stylometric verification over long intervals of time seems to inherently control for topic interference. Since stylometry is a behavioral trait, we expect its performance to improve when used with another co-occurring trait, i.e. keystroke dynamics. Hence, we evaluated multi-modal authentication using stylometry and keystroke dynamics. It was observed that keystroke dynamics outperforms stylometry possibly because keystroke depends on physical motor control compared to stylometry which requires much higher cognitive processing. For both traits, low-level features such as n-grams and n-graphs seem robust, especially in short-text scenarios. In this chapter, an investigation into the biometric capabilities of stylometry reveals that, with challenging data like online communications, not all users exhibit uniqueness or permanence. It would hamper large-scale identification under such demanding settings. This drawback can be compensated to some extent by keystroke dynamics when available. If not, one can still infer some information about a person’s identity using their linguistic usage. This approach to author profiling or soft-biometric classification is discussed inthe upcoming chapter.

105 CHAPTER 6 ROLE OF PSYCHOLOGICAL AND SOCIAL ASPECTS ON STYLOMETRY This chapter aims to understand the role of psychological and social aspects governing language usage. These aspects might help infer specific attributes indicative of a person’s identity like gender, age, personality, native language. This approach of inferring author attributes is called author profiling and finds applications in security, forensics, and marketing. We observe from previous chapters that not all individuals exhibit unique linguistic style with limited data. In such scenarios, it might be useful if we can infer other attributes like personality or social groups. In this chapter, we attempt to estimate author attributes like age, gender and personality traits and analyze the role of psychological aspects of language (psycholinguistic variables) in estimating social groups to which an author might belong. Both engineered and learned features are used to identify social groups like age, gender and personality traits. Analysis of salient features of the learned model is performed to understand linguistic aspects that would help highlight the differences in linguistic usage by different social groups. Specifically, we investigate the role of various psycholinguistic variables in estimating author attributes. In this chapter, we attempt to answer the following question:

• What are the linguistic aspects that help profile an author’s attributes like age, gender and personality? 6.1 Role of psychological aspects

Socio-linguistic researchers [114, 115] have predicted that different social groups use language differently and a subset of words called psycholinguistic words can help identify the individual differences between these social groups. In this section, we analyze various psycholinguistic word categories on two datasets to identify salient categories that indicate differences in age, gender and five personality traits.

106 6.1.1 Methodology

For our analysis, we use psycholinguistic categories proposed in Linguistic Inquiry and Word Count (LIWC2015) [114]. The LIWC2015 lexicon consists of 6400 words and word stems spanning 21 linguistic dimensions (e.g., prepositions, pronouns), 41 psychological constructs (e.g., affect, cognition, biological processes, drives), six personal concern groups (e.g., work, home, leisure), five informal language dimensions and 12 punctuations. These features have been used widely to study various psychological processes [115] like attentional focus, social relationships, honesty/deception, status/dominance and social coordination. The different psycholinguistic categories used in our experiments are listed in Table 6-1. To study the role of various psycholinguistic word categories in author profiling, we perform statistical analyses of these variables for different classes of the attributes. This analysis would allow us to identify psycholinguistic word categories that differentiate between different classes of social groups. The psycholinguistic variables for both datasets are measured using the LIWC tool and are continuous variables. For gender analysis, we use a t-test to identify psycholinguistic variables that are statistically different (p < 0.5) for both classes. For age analysis, we use one-way ANOVA to identify psycholinguistic variables that are statistically different (p < 0.5) for any of the classes. For personality traits, we binarize each trait to indicate presence or absence of a trait, i.e. True if trait > 0 and False if trait < 0. Then, we use a t-test to identify psycholinguistic variables that are statistically different (p < 0.5) for the presence or absence of the personality traits. Inferences from these statistical analyses are then used to deduce the role of various psycholinguistic variables in author profiling. 6.1.2 Datasets

To analyze the linguistic aspects that help profile an author, we used the following two datasets used for PAN author profiling tasks:

107 Table 6-1. LIWC categories Category Subcategories Description Examples Summary Analytic, Clout, Summary variables to reflect - Authentic, Tone analytical thinking, clout, authenticity and emotional tone Function pronoun, article, Function words like pronouns I, a, in, has, very, words prep, auxverb, & auxiliary verbs and, not (function) adverb, conj, negate Pronouns pronoun, ppron, I, Personal and impersonal I, we, you, she, it, we, you, shehe, they, pronouns those ipron Other verb, adj, compare, Other grammatical terms eat, free, best, grammar interrog, number, like verbs, adjectives, how, second, many quant comparisons, interrogatives, quantifiers Affect posemo, negemo, anx, Affective words for positive & nice, ugly, worried, anger, sad negative emotions hate, grief Social family, friend, female, Social words like friends & talk, buddy, dad, male family boy, girl Cognitive insight, cause, Insight, causation, think, because, process discrep, tentat, discrepancy, tentative, should, maybe, (cogproc) certain, differ, excl certainty, differentiation, always, else, but exclusives Perceptual see, hear, feel Perceptual processes like view, listen, touch process seeing, hearing & feeling (percept) Biological body, health, sexual, Biological processes related cheek, clinic, process (bio) ingest to body, health, sex and incest, pizza ingestion Drives affiliation, achieve, Conative words describing ally, win, superior, power, reward, risk needs, motives & drives proze, danger Time focuspast, Time orientation words ago, today, soon orientation focuspresent, referring to past, present (timeorient) focusfuture & future Relativity motion, space, time Relativity words referring to arrive, down, (relativ) motion, space & time season Personal work, leisure, home, Personal concerns job, movie, concerns money, relig, death kitchen, audit, (persconc) church, kill Informal swear, netspeak, Informal words like damn, lol, agree, assent, nonflu, filler non-fluencies, assent & swear hm, youknow words

108 1. PAN2015[116]: This dataset consists of tweets from four different languages of which we use only English in our experiments. The training data consists of tweets from 152 users (93 tweets/user on average), and the test data consists of blog posts from 142 users (93 posts/user on average). The corpus is labeled with gender, age and personality information. Gender and age were self-reported by users while personality information (Big5 traits - Extroversion, Emotional Stability, Agreeableness, Conscientiousness, Openness to experience) was obtained using an online personality test. Age information is divided into four classes: 18-24, 25-34, 35-49 and 50+. The personality scores for each trait are normalized to -0.5 to +0.5. The training and test data distribution for age & gender and mean personality scores are shown in Table 6-2 and note that the gender classes are balanced while the age classes are highly imbalanced.

2. PAN2016[117]: This dataset consists of cross-genre data for three languages of which we use only English in our experiments. The training data consists of tweets from 428 users (600 tweets/user on average), and the test data consists of blog posts from 78 users (14 posts/user on average). The gender and age information is provided where age is divided into five classes: 18-24, 25-34, 35-49, 50-64 and 65+. The distributions of training and test data are shown in Table 6-3. As with PAN2015 dataset, the gender classes are balanced while the age classes are highly imbalanced. Table 6-2. PAN2015 author profiling dataset - English Age Gender Personality Partitions #Users 18-24 25-34 35-49 50+ F M E S A C O Train 152 58 60 22 12 76 76 0.16 0.14 0.12 0.17 0.24 Test 142 56 58 20 8 71 71 0.17 0.13 0.14 0.17 0.26

Table 6-3. PAN2016 author profiling dataset - English Age Gender Partitions #Users 18-24 25-34 35-49 50-64 65+ F M Train 428 26 136 182 78 6 214 214 Test 78 10 24 32 10 2 39 39

The words in both datasets are categorized using LIWC text analysis tool. It should be noted that the LIWC lexicon corresponding to the categories mentioned above are predefined by the LIWC dictionary used by the tool. The proportion of various LIWC categories in the PAN2015 and PAN2016 datasets are shown in Figure 6-1. It can be observed from these figures that nearly half the data consists of psycholinguistic wordsand function words while the other half consists of content words or domain-specific tokens like

109 retweets, , and URLs. We are more interested in the role of the psycholinguistic or stylistic cues in author profiling since they might be difficult tomimic.

A B

C D

Figure 6-1. LIWC category distributions for PAN2015 and PAN2016 datasets. Categories with negligible contributions have been suppressed. A) PAN2015-train. B) PAN2015-test. C) PAN2016-train. D) PAN2016-test.

6.1.3 Results

The statistical analysis results of different social groups for PAN2015 and PAN2016 are reported in the following sections.

110 6.1.3.1 PAN2015

Results of the analysis on gender, age and personality traits on PAN2015 dataset are reported in Tables A-1, A-2 and A-3. Only statistically significant results (p < 0.05) corresponding to different psycholinguistic categories are reported. Inferring from the gender analysis, it can be observed that women use significantly more function words (auxiliary verbs and adverbs), pronouns (first person singular and plural), verbs and adjectives, affect words (positive emotions, anxiety, and sadness), social words (family, home), perceptual processes(hear and feel), biological processes (health and ingestion), affiliation words, words focused on present and future and other time-related words. Hence, linguistic style of women seem more authentic due to the use of more affect words, first-person singular words (not distancing self), moderately descriptive (adjectives) and fewer words related to cognitive complexity (prepositions, articles, exclusions). Also, women seem to use significantly more positive emotions which make women’s writing upbeat and positive. Men, on the other hand, use significantly more long words ( >=six letters), articles, numbers and informal words (internet lingo). Hence, the linguistic style of men seems significantly more analytic than women. The different age groups in the PAN2015 dataset are not balanced, and thisis reflected in the age analysis. Age group corresponding to 18-24 uses significantly moreor less of many word categories compared to the other groups. Ages 18-24 use significantly more function words (auxiliary verbs, adverbs, conjunctions, and negation), pronouns (first person singular and impersonal pronouns), verbs, affect words (negative emotions such as anger, anxiety, sadness), words corresponding to cognitive processes (discrepancy, tentative, differentiation), perceptual processes (feel), biological processes (body and sexual), relativity (motion and time) and informal words (swear and filler words). Hence, the linguistic style of people aged 18-24 seems significantly less analytic (lesser user of prepositions and articles) and exhibit less clout (more use of tentative and anxious words).

111 Ages 25-34, 35-49 and 50-xx have fewer word categories which they significantly use more or less. Ages 25-34 use significantly more perceptual processes(see), space-related words, leisure words, and netspeak. They use significantly less third-person pronouns, interrogatives, family-related words, tentative words, reward words, and assent words. Ages 35-49 use significantly more insight and causation words, health words, achievement words, space, work, and leisure words and netspeak. Ages 50+ use significantly less informal speech. Impostors may mimic some of the word categories like work and leisure used by ages 25-49 since content words can also represent these word categories. To summarize, the authenticity of writing style seems to decrease with age as younger people use more affect words, first-person singular words (not distancing self) and fewer words related to cognitive complexity (prepositions, articles, exclusions, six letter words, word count). The tone of writing style seems to get more upbeat and positive as age increases. For personality traits, the correlation of Big-5 personality scores is shown in Figure 6-3. It can be observed that there exists a positive correlation between extroversion-stability and stability-agreeableness. Openness and conscientiousness exhibit high negative correlation with the other three traits. For statistical analysis, the scores for each of the five traits are binarized, e.g. extroversion score > 0 indicates extroverted nature while a score < 0 indicates introverted nature. However, it must be noted that the binarized personality traits are imbalanced. Statistical analysis reveals that extroverted people use significantly more of very few word categories (friend, health, netspeak and assent). Introverted people use significantly more adjectives, comparisons, drive words (achievement, power, reward), work and money-related words. This observation suggests that extroverted people use social words and uninhibited informal speech while introverted people exhibit higher cognitive complexity and drives. Emotionally stable people use significantly more long words, articles, second-person pronouns, perceptual words(see), work-related words and netspeak. However, emotionally

112 Figure 6-2. Correlation of personality scores in PAN2015 unstable people use more function words (auxiliary verbs, adverbs, conjunctions, negation), first-person singular pronouns, impersonal pronouns, verbs, adjectives, affect words (negative emotions like anxiety and sadness), discrepancy words and words focused on the present. This inference conforms with neurotic people expressing negative emotions experienced by them, and hence their style seems significantly more authentic while that of emotionally stable people seems analytic and confident. People with agreeable nature use more second-person pronouns, leisure words and informal speech (netspeak). However, people who do not get along with others use more function words (auxiliary verbs, adverbs, conjunctions, negation), first-person singular and impersonal pronouns, negative emotion words (anxiety and anger), discrepancies, tentative words, differentiation words, feel words and risk words. To summarize, the writing styleof agreeable people seems more analytic and confident while that of disagreeable people tend to be more authentic.

113 Conscientious people use more long words, prepositions, second-person pronouns, achievement words, and work-related words. People with lower conscientiousness used more negations, first-person singular pronouns, impersonal pronouns, interrogatives and negative emotions like anger. Hence, the writing style of conscientious people seems more analytic, confident and positive. People with openness to experience frequently use pronouns, positive affect words, family words, affiliation and reward words, leisure words and informal speech (swear, non-fluencies, filler words). People with less openness use more articles, differentiation words and work-related words. To summarize, the writing style of people who exhibit openness seems more authentic while that of people with lesser openness seems more analytic. The correlation of different LIWC word categories with Big-5 personality traits is shown in Figure 6-3. Extroversion seems positively correlated with clout, second-person pronouns, informal words(netspeak, assent) and punctuations and negatively correlated with negative emotions (anxiety, sadness), cognitive complexity (quantifiers, comparisons, conjunctions) and words focusing on future. Emotional stability seems positively correlated with analytic style (long words, numbers) and netspeak while negatively correlated with first-person singular pronouns, auxiliary verbs, adverbs, negative emotions (anxiety) and discrepancies. Agreeableness is negatively correlated with negative emotions (anger), swear and words which focus on present and future. Conscientiousness is positively correlated with a positive tone and negatively correlated with negations and negative emotions (anger). Openness to experience is positively correlated with positive emotions and social words. 6.1.3.2 PAN2016

Results of the analysis on age and gender on PAN2016 dataset are reported in Table A-4. Only statistically significant results (p < 0.05) corresponding to different psycholinguistic categories are reported.

114 Figure 6-3. Correlation of personality traits with LIWC categories

For gender analysis, women tend to use significantly more personal pronouns, positive emotions, social words (family, friends), biological processes, words related to home and affiliation and informal speech (netspeak, assent). Men tend to use more longwords, articles, negative emotions (anger), quantifiers, cognitive processes (causation, tentative, differentiation, exclusions), work and risk related words and swear words. For age analysis, use of first-person singular pronouns, negative emotions, words focusing on past, present or future and non-fluencies decrease with age. Use of second-person pronouns, achievement words and netspeak increases with age. Overall, the style appears more analytic and confident as age increases while the authenticity decreases with age. 6.1.4 Discussion

An analysis of the role of psycholinguistic categories on different social groups on PAN2015 and PAN2016 datasets reveal the following inferences:

• Women tend to use more function words, pronouns, positive emotions, social words (family, home) and affiliation words while men tend to use more long words, articles and informal speech (netspeak and swear).

• The usage of first-person singular pronouns, negative emotions, and informal speech decrease with age while cognitive complexity (articles, prepositions, long words) increases with age.

• Extroverted people use more social words and informal speech while introverted people exhibit higher cognitive complexity and drives.

• Emotionally stable people frequently use cognitive complexity (articles, long words) while neurotic people frequently use first-person singular pronouns, negative emotions and words focused on the present.

115 • Agreeable people frequently use second-person pronouns, leisure words and informal speech while people who do not get along with others frequently use first-person singular pronouns, negative emotions, discrepancies and risk words.

• Conscientious people use more cognitive complexity (long words, prepositions), achievement and work-related words while those with lower conscientiousness use more negations, first-person singular pronouns, interrogatives, and negative emotions.

• People with openness use more pronouns, positive emotions, family, leisure and reward words while people with less openness use more articles, differentiation, and work-related words. In the following section, we will use both engineered and learned features to estimate gender, age and personality traits. We will analyze the cues considered salient by the machine learning model and validate them with the above inferences. 6.2 Profiling author social groups

In this section, we describe the approaches to estimate a person’s social groups like age, gender and personality traits. In literature [118–120], author profiling was performed using bag-of-words feature representations of linguistic cues with a classifier/regressor. In this chapter, we use a self-attentive sentence embedding [121] using bidirectional LSTM. The LSTM layer captures the sequential structure of text while the attention layer captures feature saliency and might provide insights into cues that are useful for estimating different social attributes. 6.2.1 Methodology

To estimate author attributes, we use the self-attentive sentence embedding [121] approach to encode a text sample. The architecture of self-attentive sentence embedding approach is shown in Figure 6-4. Word embeddings of n words in the sample are fed to a bidirectional LSTM with one or more layers. Let H be the n × 2u hidden layer outputs of the final layer of LSTM where u is the hidden layer size. The hidden layer outputs are

then fed to a attention layer comprising of two weight matrices - Ws1 of size da × 2u and

116 Ws2 of size r × da. The attention weights A are computed as

T A = softmax(Ws2tanh(Ws1H )) (6-1)

The final embedding is given by M = AH which is then fed to a final classification or regression layer. We use a softmax layer for gender and age estimation and a regression layer for personality traits.

Figure 6-4. Self-attentive sentence embedding architecture

The model is trained to minimize classification or regression loss for age/gender and personality traits respectively. Additionally, a penalization term is minimized for the attention layer. The penalization term ensures minimum overlap between each of the r hops of the attention layer and is given by

|| T − ||2 P = AA I F . (6-2)

Hence, the r × n attention weights A capture r different saliency weights across the sample. Analyzing the salient features learned by this model and other approaches in literature will help understand different linguistic aspects that are useful to determine an author’s attributes.

117 6.2.2 Results

We used the PAN2015 and PAN2016 datasets as described in Section 6.1.2 for author profiling. The layer of self-attentive sentence embedding architecture is pretrained with 100-dim Glove vectors [85]. We use a 2-layer bidirectional LSTM with a hidden layer size of 300. For the attention layer, we use da = 350 and r = 30. The final fully connected layer has 2000 hidden units. The network is trained usingAdam optimization with a learning rate of 0.001 for a maximum of 100 epochs. A softmax layer is used for gender and age classification. For personality traits, we use regression. Since the age classes are imbalanced, the classification loss is weighted such that larger classes have lesser importance and vice versa. The performance is recorded using classification accuracy for gender (balanced classes) and F-score for age (imbalanced classes). The regression performance for personality traits is recorded using MSE (mean squared error). For baseline comparison, we use the best performing approaches from PAN2015 and PAN2016 author profiling tasks with source code to allow a fair comparison, especially for the imbalanced age classes. For PAN2015 dataset, we use the approach of Grivas et al.[118] (https://github.com/pan-webis-de/pangram) who used both stylometric and structural features for gender, age, and personality prediction. The stylometric features include TFIDF representations of character trigrams, the frequency of words with different lengths and frequency of capitalized words. The structural features include -dependent features like frequency of @mentions and URLs. The features are then used with SVM classifier for age and gender prediction and with Support Vector Regression (SVR) for personality prediction. For PAN2016, we use the approaches of Modaresi et al.[119] (https://github.com/pan-webis-de/magic) and Busger et al.[120] (https://github.com/sixhobbits/ngram). Modaresi et al.[119] used a combination or word unigrams, word bigrams, character 4-grams, average spelling error, and punctuation features. These features are used with logistic regression for cross-genre age and gender prediction. Busger et al.[120] used n-grams of characters, words and POS categories,

118 the frequency of capitalized words and punctuations as features. The second-order representations of these features are used with SVM classifier for age and gender classification. The author profiling results for PAN2015 and PAN2016 datasets are reported in Tables 6-4 and 6-5 respectively. For PAN2015 dataset, it can be observed from Table 6-4 that the self-attentive approach performs better than that of Grivas et al.[118] for age and personality traits, specifically stability. For gender classification, simple bag-of-words representations outperform the self-attentive embedding approach. For PAN2016 dataset, it can be observed from Table 6-5 that the self-attentive approach performs better than other approaches [119, 120] for both age and gender classification. For both datasets, it can be observed that the self-attentive approach handles the imbalance in age classes better than the baseline approaches.

Table 6-4. PAN2015 author profiling results Method Age Gender Personality F-score(%) Acc(%) RMSE E SA CO Grivas et al. 56.31 83.09 0.1751 0.1598 0.2384 0.1570 0.1515 0.1690 Self-attentive 62.11 80.09 0.1625 0.1537 0.2046 0.1481 0.1459 0.1527

Table 6-5. PAN2016 author profiling results Method Age F-score(%) Gender accuracy(%) Modaresi et al.[119] 19.38 69.23 Busger et al.[120] 33.54 61.54 Self-attentive embedding 34.55 73.08

6.2.3 Discussion

From the above results, we see that gender prediction is better than age prediction for both datasets using all approaches. However, both gender and age prediction of PAN2016 is poor compared to those of PAN2015 indicating the difficulty of cross-genre author profiling. Further, PAN2016 consists of more samples than PAN2015. To understand the salient cues picked by the Self-attentive embedding approach, we use the attention weights to create a heat map of the text such that words with higher

119 weights appear dark red and those with lower weights appear white. Words with higher weights represent salient cues used by the model for predicting various social groups. Figure 6-5 shows snippets of text for which the gender was predicted correctly in the PAN2015 dataset. Some of the salient cues for text written by men and women are highlighted. It can be seen that the self-attentive embedding approach picks up positive emotions and function words as salient for women and articles, prepositions, numbers and informal words as salient for men. This conforms with the inferences made in Section 6.1.4. Figures 6-6 and 6-7 shows snippets of text for which the age-group was predicted correctly in PAN2015 dataset. Some of the salient cues for text written by people of four age groups (18-24, 25-34, 35-49, 50+) are highlighted. For ages 18-24, it can be seen that the self-attentive embedding approach picks up first-person singular pronouns, negative emotions, swear words and negations while for ages 50+ it picks up positive emotions, cognitive processes, and future focus. Cues for ages 25-34 include leisure related words, netspeak, and perceptual processes while those for ages 35-49 include work, achievement words and cognitive processes. This observation conforms with the inferences made in Section 6.1.4 that the use of first-person singular pronouns and negative emotions decrease with age while cognitive complexity increases with age. 6.3 Summary

This chapter studied the role of psychological and social aspects in writing style. Sociolinguistic analysis indicates that language is used differently by different social groups and can be used to identify them. Hence, we analyze the differences in language used by different gender, ages and personality traits using statistical analysis of two datasets (PAN2015 and PAN2016). Various psycholinguistic categories were analyzed, and it was observed that these social groups do indeed vary in usage for these categories. We further used both engineered and learned features to predict gender, age and personality traits. A self-attentive embedding approach with attention was used to predict various social

120 A

B

Figure 6-5. Some salient cues for gender prediction on PAN2015. A) Women use more positive emotions and function words B) Men use more articles, prepositions, numbers and informal words

121 A

B

Figure 6-6. Salient cues for age prediction (18-34) on PAN2015. A) Ages 18-24 use more first person singular pronouns, function words, negation, negative emotions/swear. B) Ages 25-34 use more words related to leisure, perception(see) and netspeak.

122 A

B

Figure 6-7. Salient cues for age prediction (35-xx) on PAN2015. A) Ages 35-49 use more words related to work, achievement and cognitive processes. B) Ages 50-xx use more words related to positive emotions, cognitive processes and future focus.

123 groups. The attention weights learned by the model was used to determine saliency for gender and age prediction. It was observed that the cues picked up as salient for both gender and age prediction conformed with the results of statistical analysis thus validating the role of different psycholinguistic categories in estimating age and gender.

124 CHAPTER 7 SUMMARY This dissertation explored the potential for stylometry to be used as a cognitive biometric trait. We started with the following thesis statement: A person’s linguistic usage exhibits distinctive behavioral traits that allow linguistic style to be used as a cognitive biometric trait. To validate this statement, we investigated different components that are representative of style in a large-scale cross-domain setting. We also evaluated whether stylometry exhibits essential biometric characteristics followed by author profiling using linguistic cues. Our research insights and conclusions are summarized in the next section. 7.1 Insights and Conclusions

We observed the following insights when attempting to answer our research questions listed in Chapter 1. They are described in the following sections: 7.1.1 Style Representations

• Do structural representations help with stylometric attribution in cross-domain settings? Shallow and deep syntactic models are not sufficient by themselves to effectively distinguish between authors in both single-domain and cross-domain settings. It could be due to data sparsity with shorter texts or inherently generic nature of syntactic cues since most people might follow similar grammar rules. However, the addition of lexical information boosts the performance drastically suggesting that the lexical information provides most of the discriminative nature.

• What are the effects of masking words corresponding to different lexical POSon cross-domain attribution? It was observed that masking all lexical POS hampers the authorship attribution performance drastically for both single-domain and cross-domain datasets. This observation suggests that at least some of these words corresponding to lexical POS contribute to stylometry besides grammatical function words.

• Does masking topic words corresponding to various lexical POS help with stylometric attribution in cross-domain settings? Masking topic words corresponding to lexical POS does improve attribution compared to completely masking them. This effect is pronounced for cross-topic attribution. This observation further validates the conclusion from the question above that some commonly used lexical POS also help with attribution.

125 7.1.2 Stylometry as a Biometric Trait

• Is linguistic style unique to be considered as a biometric trait? It was observed that not all users exhibited unique linguistic traits using biometric menagerie. Some users exhibited very generic traits allowing them to be easily impersonated or impersonate others.

• Does linguistic style remain permanent over long periods of time? When evaluated for 24 months, it was observed that very few users showed stylistic consistency over 24 months. The rest exhibited consistent style for up to 6 months and drifted beyond that.

• Can stylometry be used with other biometric traits for authentication? Stylometry was used in conjunction with keystroke dynamics, and it was observed that the multimodal authentication did outperform standalone stylometric authentication. It was observed that score-level fusion of stylometry and keystroke-based authentication improved the authentication performance and was limited only by individuals with very few samples. 7.1.3 Soft-biometric Classification or Author Profiling

• What are the linguistic aspects that help profile an author’s attributes like age, gender and personality? It was observed that various psycholinguistic word categories were correlated with author attributes and feature analysis of the author profiling model reveals that some of these psycholinguistic words are chosen as salient features in profiling various attributes of a person. To summarize, large-scale biometric identification using stylometry may be challenging with limited data like online communications. However, biometric verification and soft-biometric classification are feasible using stylometry as a standalone biometric trait. Being a behavioral trait, one might benefit from deploying stylometry ina multimodal authentication setting with complementary traits like keystroke dynamics. 7.2 Limitations

Throughout this dissertation, we observed various limitations such as :

• Representation - Linguistic style is an abstract concept, and we hypothesized various engineered and learned representations for it. However, we observed that only character-based representations were discriminative or unique. It makes one wonder whether the uniqueness in writing style of individuals is limited to morphological affixes, stop words and punctuation usages but not on a higher level spanning

126 multiple sentences or paragraphs. We are yet to answer the question of whether higher-level representations are not inherently unique or do they suffer from sparsity. Also, a single representation might not be effective for all authors or even for all genres written by the same author. In such scenarios, representing linguistic style becomes more complicated.

• Scalability - Irrespective of the feature representation used, we observed that the identification accuracy decreased with increasing number of authors which suggests that a large fraction of users exhibit very generic linguistic style concerning their vocabulary or syntax structure. It is then possible that multiple people exhibit similar linguistic styles and it would be effective to understand the linguistic styles exhibited by these groups of people.

• Large-scale data - Ideally, we can test our hypothesis reliably if we have volumes of data written by the same person across various genres and topics for many people. However, we do not have access to that amount of data for every individual which seriously hampers stylometric research since we know that stylometric attribution is poor with short texts or a small number of samples per person.

• Computational representations of language at this point are still primitive and ongoing research in natural language understanding might prove to be useful for encoding linguistic style in the coming years. 7.3 Future work

There is scope for much work to extend this dissertation. Given the limited amount of data we might encounter in cybersecurity scenarios and the decrease in identification accuracy with an increasing number of authors, two avenues of research warrant further pursuit. They are:

• Generative approaches - Understanding the text generation process for each author might allow us to identify how the text generation process of various authors differs. Generative approaches might also be useful for data augmentation in scenarios with limited data and for providing adversarial examples to study privacy breaches.

• Style profiles - In our experiments, we observed that a large fraction of individuals exhibited very generic features that were not unique. This observation raises the question of whether one can model the linguistic style of groups of people even if they are not unique to each person. It is possible that each person might exhibit more than one style and many people may exhibit similar styles and creating a style profile for an author or groups of authors might help to define the linguistic styleof individuals or groups.

127 APPENDIX AUTHOR PROFILING ANALYSIS

Table A-1. PAN2015 - LIWC analysis - only significant results (p < 0.05) are reported. Gender: M-Male, F-Female, Personality: E-Extroversion, S-Stable, A-Agreeableness, C-Conscientiousness, O-Openness, T-True, F-False Category Gender Age ESACO Summary Analytic M>F 18-24 < rest - T>F T>F T>F F>T Clout - 18-24 < rest - T>F T>F T>F - Authentic F>M decreases - F>T F>T - T>F Tone F>M increases - - - T>F - Sixltr M>F increases - T>F T>F T>F F>T Word count - increases F>T - - T>F -

Function words Function words F>M 18-24 > rest - F>T F>T - T>F Article M>F 18-24 < rest - T>F - - F>T Preposition - 18-24 < rest - - - T>F - Aux. verbs F>M 18-24 > rest - F>T F>T F>T - Adverbs F>M 18-24 > rest - F>T F>T F>T - Conjunction - 18-24 > rest - F>T F>T - - Negation - 18-24 > rest - F>T F>T F>T -

Pronouns Pronouns F>M 18-24 > rest - F>T F>T - T>F Personal pronouns F>M decreases - - - - T>F I F>M 18-24 > rest - F>T F>T F>T T>F we F>M - - - - - T>F you - 18-24 < rest - T>F T>F T>F - shehe - 25-49 < rest - F>T F>T - T>F they - 25-49 < rest - - - - - Impersonal pronouns - 18-24 > rest - F>T F>T F>T T>F

Other grammar Verbs F>M 18-24 > rest - F>T F>T F>T T>F Adjectives F>M - F>T F>T - - - Comparisons - - F>T - - - - Interrogatives - 25-34 < rest - - - F>T T>F Numbers M>F ------

128 Table A-2. PAN2015 - LIWC analysis (cont.) - only significant results (p < 0.05) are reported. Gender: M-Male, F-Female, Personality: E-Extroversion, S-Stable, A-Agreeableness, C-Conscientiousness, O-Openness, T-True, F-False Category Gender Age ESACO Affect Affect F>M 18-24 > rest - F>T F>T F>T T>F Positive emotion F>M 25-34 < rest - F>T - - T>F Negative emotion - 18-24 > rest - F>T F>T F>T - Anxiety F>M 18-24 > rest - F>T F>T - - Anger - 18-24 > rest - - F>T F>T T>F Sad F>M 18-24 > rest - F>T - - -

Social Social F>M 25-34 < rest - F>T - - - Family F>M 25-34 < rest - - - F>T T>F Friend - 18-24 > rest T>F - - - - Female - 18-24 > rest - - F>T - T>F Male - 25-49 < rest - F>T F>T - T>F

Cognitive processes Cognitive Process F>M 18-24 > rest - F>T - - - Insight - 35-49 > rest - - F>T - - Causation - 35-49 > rest - - - - - Discrepancy F>M 18-24 > rest - F>T F>T - - Tentative - 25-49 < rest - - F>T - - Certainty F>M 18-24 > rest - - - - - Differentiation - 18-24 > rest - - F>T - F>T

Perceptual processes Perceptual process F>M 35+ < rest - - - - T>F See - 25-34 > rest - T>F - - T>F Hear F>M ------Feel F>M 18-24 > rest - - F>T - -

Biological processes Biological process F>M 18-24 > rest - - F>T - - Body - 18-24 > rest - - F>T F>T - Health F>M 35-49 > rest T>F - - - T>F Sexual - 18-24 > rest - - - - T>F Ingestion F>M ------

129 Table A-3. PAN2015 - LIWC analysis (cont.) - only significant results (p < 0.05) are reported. Gender: M-Male, F-Female, Personality: E-Extroversion, S-Stable, A-Agreeableness, C-Conscientiousness, O-Openness, T-True, F-False Category Gender Age ESACO Drives Drives - 25-34 < rest F>T - - - - Affiliation F>M - - F>T - - T>F Achievement - 35-49 > rest F>T - - T>F - Power - 18-34 < rest F>T - - - - Reward - 25-34 < rest F>T - - - T>F Risk - - - - F>T - -

Time orientation FocusPast - 18-24 > rest - - - - T>F FocusPresent F>M 18-24 > rest - F>T F>T F>T T>F FocusFuture F>M 18-24 > rest - - F>T F>T -

Relativity Relativity F>M - - - - T>F - Motion - 18-24 > rest - - - - T>F Time F>M 18-24 > rest - - - - - Space - 25-49 > rest - - - T>F -

Personal concerns Work - 35-49 > rest F>T T>F - T>F F>T Leisure - 25-49 > rest - - T>F - T>F Home F>M ------Money - 18-24 < rest F>T - - - - Death - - - - - F>T -

Informal language Informal M>F 50-xx < rest T>F - T>F - - Swear - 18-24 > rest - - - F>T T>F Netspeak M>F 25-49 > rest T>F T>F T>F T>F - Assent - 25-34 < rest T>F - - - - Non-fluencies ------T>F Filler words - 18-24 > rest - - - - T>F

130 Table A-4. PAN2016 - LIWC analysis - only significant results (p < 0.05) are reported. Gender: M-Male, F-Female, Age: increases/decreases - usage increases/decreases with age Category Gender Age Category Gender Age Summary Other grammar Analytic - increases Verbs - decreases Clout F>M increases Quantifiers M>F decreases Authentic - decreases Tone F>M - Sixltr M>F increases Word Count M>F - Function words Pronouns Articles M>F - Pronouns F>M - Aux. verbs M>F decreases Personal Pronouns F>M - Adverbs M>F decreases I - decreases you F>M increases Impersonal pronouns - decreases Affect Cognitive processes Affect F>M decreases Cognitive process M>F - Positive emotion F>M 18-34 > rest Causation M>F - Negative emotion M>F decreases Tentative M>F decreases Anger M>F 65+ > rest Certainty - decreases Differentiation M>F decreases Exclusives M>F decreases Social Perceptual processes Social F>M - Perceptual process - 35-64 < rest Family F>M - Feel - decreases Female F>M - Biological processes Drives Biological process F>M 35-64 < rest Affiliation F>M - Body - decreases Achievement - increases Power - 18-34 < rest Risk M>F - Time orientation Relativity FocusPast M>F decreases Time - 35-64 < rest FocusPresent - decreases FocusFuture - decreases Personal concerns Informal language Work M>F 18-34 < rest Informal - 18-24 < rest Leisure - 65-xx > rest Swear M>F 18-34 > rest Home F>M - Netspeak F>M increases Assent F>M - Non-fluencies - decreases

131 REFERENCES [1] W. Daelemans, “Explanation in computational stylometry,” in International Conference on Intelligent Text Processing and Computational . Springer, 2013, pp. 451–462. [2] K. Luyckx, Scalability issues in authorship attribution. ASP/VUBPRESS/UPA, 2011. [3] A. K. Jain, P. Flynn, and A. A. Ross, Handbook of biometrics. Springer Science & Business Media, 2007. [4] G. Doddington, W. Liggett, A. Martin, M. Przybocki, and D. Reynolds, “Sheep, goats, lambs and wolves: A statistical analysis of speaker performance in the nist 1998 speaker recognition evaluation,” National Inst of Standards and Technology Gaithersburg MD, Tech. Rep., 1998. [5] D. I. Holmes, “The evolution of stylometry in humanities scholarship,” Literary and linguistic computing, vol. 13, no. 3, pp. 111–117, 1998. [6] G. K. Zipf, “The psychology of language,” NY Houghton-Mifflin, 1935. [7] G. U. Yule, “On sentence-length as a statistical characteristic of style in prose: With application to two cases of disputed authorship,” Biometrika, vol. 30, no. 3/4, pp. 363–390, 1939. [8] F. Mosteller and D. L. Wallace, “Notes on an authorship problem,” in Proceedings of a Harvard Symposium on Digital and Their Applications, 1962, pp. 163–197. [9] ——, “Inference in an authorship problem: A comparative study of discrimination methods applied to the authorship of the disputed federalist papers,” Journal of the American Statistical Association, vol. 58, no. 302, pp. 275–309, 1963. [10] F. Mosteller and D. Wallace, “Inference and disputed authorship: The federalist,” 1964. [11] P. Juola et al., “Authorship attribution,” Foundations and Trends® in , vol. 1, no. 3, pp. 233–334, 2008. [12] E. Stamatatos, “A survey of modern authorship attribution methods,” Journal of the Association for Information Science and Technology, vol. 60, no. 3, pp. 538–556, 2009. [13] T. Neal, K. Sundararajan, A. Fatima, Y. Yan, Y. Xiang, and D. Woodard, “Surveying stylometry techniques and applications,” ACM Comput. Surv., vol. 50, no. 6, pp. 86:1–86:36, nov 2017.

132 [14] M. L. Jockers and D. M. Witten, “A comparative study of machine learning methods for authorship attribution,” Literary and Linguistic Computing, 2010. [15] M. Koppel, J. Schler, S. Argamon, and Y. Winter, “The “fundamental problem” of authorship attribution,” English Studies, vol. 93, no. 3, pp. 284–291, 2012. [16] M. Koppel and J. Schler, “Authorship verification as a one-class classification problem,” in Proceedings of the twenty-first international conference on Machine learning. ACM, 2004, p. 62. [17] M. Koppel and Y. Winter, “Determining if two documents are written by the same author,” Journal of the Association for Information Science and Technology, vol. 65, no. 1, pp. 178–187, 2014. [18] O. Halvani, C. Winter, and A. Pflug, “Authorship verification for different languages, genres and topics,” Digital Investigation, vol. 16, pp. S33–S43, 2016. [19] J. Schler, M. Koppel, S. Argamon, and J. W. Pennebaker, “Effects of age and gender on blogging.” in AAAI spring symposium: Computational approaches to analyzing weblogs, vol. 6, 2006, pp. 199–205. [20] S. Argamon, M. Koppel, J. W. Pennebaker, and J. Schler, “Automatically profiling the author of an anonymous text,” Communications of the ACM, vol. 52, no. 2, pp. 119–123, 2009. [21] B. Verhoeven and W. Daelemans, “Clips stylometry investigation (csi) corpus: A dutch corpus for the detection of age, gender, personality, sentiment and deception in text,” in LREC 2014-Ninth International Conference on Language Resources and Evaluation, 2014, pp. 3081–3085. [22] T. R. Reddy, B. V. Vardhan, and P. V. Reddy, “A survey on authorship profiling techniques,” International Journal of Applied Engineering Research, vol. 11, no. 5, pp. 3092–3102, 2016. [23] O. De Vel, A. Anderson, M. Corney, and G. Mohay, “Mining e-mail content for author identification forensics,” ACM Sigmod Record, vol. 30, no. 4, pp. 55–64, 2001. [24] R. Zheng, J. Li, H. Chen, and Z. Huang, “A framework for authorship identification of online messages: Writing-style features and classification techniques,” Journal of the Association for Information Science and Technology, vol. 57, no. 3, pp. 378–393, 2006. [25] V. Kešelj, F. Peng, N. Cercone, and C. Thomas, “N-gram-based author profiles for authorship attribution,” in Proceedings of the conference pacific association for computational linguistics, PACLING, vol. 3, 2003, pp. 255–264. [26] E. Stamatatos, “Author identification using imbalanced and limited training texts,” in Database and Expert Systems Applications, 2007. DEXA’07. 18th International Workshop on. IEEE, 2007, pp. 237–241.

133 [27] M. Koppel, J. Schler, and S. Argamon, “Authorship attribution in the wild,” Language Resources and Evaluation, vol. 45, no. 1, pp. 83–94, 2011. [28] H. J. Escalante, T. Solorio, and M. Montes-y Gómez, “Local histograms of character n-grams for authorship attribution,” in Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies- Volume 1. Association for Computational Linguistics, 2011, pp. 288–298. [29] M. Kestemont, “Function words in authorship attribution. from black magic to theory?” in Proceedings of the 3rd Workshop on Computational Linguistics for Literature (CLFL), 2014, pp. 59–66. [30] T. C. Mendenhall, “The characteristic curves of composition,” Science, vol. 9, no. 214, pp. 237–249, 1887. [31] A. Honoré, “Some simple measures of richness of vocabulary,” Association for literary and linguistic computing bulletin, vol. 7, no. 2, pp. 172–177, 1979. [32] J. F. Burrows, “Word-patterns and story-shapes: The statistical analysis of narrative style,” Literary & Linguistic Computing, vol. 2, no. 2, pp. 61–70, 1987. [33] S. Argamon and S. Levitan, “Measuring the usefulness of function words for authorship attribution,” in Proceedings of the 2005 ACH/ALLC Conference, 2005. [34] F. Peng, D. Schuurmans, and S. Wang, “Augmenting naive bayes classifiers with statistical language models,” Information Retrieval, vol. 7, no. 3-4, pp. 317–345, 2004. [35] R. M. Coyotl-Morales, L. Villaseñor-Pineda, M. Montes-y Gómez, and P. Rosso, “Authorship attribution using word sequences,” in Iberoamerican Congress on Pattern Recognition. Springer, 2006, pp. 844–853. [36] C. Sanderson and S. Guenter, “Short text authorship attribution via sequence kernels, markov chains and author unmasking: An investigation,” in Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2006, pp. 482–491. [37] H. Baayen, H. Van Halteren, and F. Tweedie, “Outside the cave of shadows: Using syntactic annotation to enhance authorship attribution,” Literary and Linguistic Computing, vol. 11, no. 3, pp. 121–132, 1996. [38] E. Stamatatos, N. Fakotakis, and G. Kokkinakis, “Computer-based authorship attribution without lexical measures,” Computers and the Humanities, vol. 35, no. 2, pp. 193–214, 2001. [39] S. Raghavan, A. Kovashka, and R. Mooney, “Authorship attribution using probabilistic context-free grammars,” in Proceedings of the ACL 2010 Conference Short Papers. Association for Computational Linguistics, 2010, pp. 38–42.

134 [40] S. Argamon, M. Koppel, and G. Avneri, “Routing documents according to style,” in First International workshop on innovative information systems, 1998, pp. 85–92. [41] J. Diederich, J. Kindermann, E. Leopold, and G. Paass, “Authorship attribution with support vector machines,” Applied intelligence, vol. 19, no. 1-2, pp. 109–123, 2003. [42] M. Gamon, “Linguistic correlates of style: authorship classification with deep linguistic analysis features,” in Proceedings of the 20th international conference on Computational Linguistics. Association for Computational Linguistics, 2004, p. 611. [43] G. Sidorov, F. Velasquez, E. Stamatatos, A. Gelbukh, and L. Chanona-Hernández, “Syntactic dependency-based n-grams as classification features,” in Mexican International Conference on . Springer, 2012, pp. 1–11. [44] P. M. McCarthy, G. A. Lewis, D. F. Dufty, and D. S. McNamara, “Analyzing writing styles with coh-metrix.” in FLAIRS Conference, 2006, pp. 764–769. [45] G. A. Miller, “Wordnet: a lexical database for english,” Communications of the ACM, vol. 38, no. 11, pp. 39–41, 1995. [46] S. Argamon, C. Whitelaw, P. Chase, S. R. Hota, N. Garg, and S. Levitan, “Stylistic text classification using functional lexical features,” Journal of the Association for Information Science and Technology, vol. 58, no. 6, pp. 802–822, 2007. [47] J. Burrows, “‘delta’: a measure of stylistic difference and a guide to likely authorship,” Literary and linguistic computing, vol. 17, no. 3, pp. 267–287, 2002. [48] J. Savoy, “Comparative evaluation of term selection functions for authorship attribution,” Digital Scholarship in the Humanities, vol. 30, no. 2, pp. 246–261, 2013. [49] R. Arun, V. Suresh, and C. V. Madhavan, “Stopword graphs and authorship attribution in text corpora,” in Semantic Computing, 2009. ICSC’09. IEEE Interna- tional Conference on. IEEE, 2009, pp. 192–196. [50] Y. Zhao and J. Zobel, “Effective and scalable authorship attribution using function words,” in Asia Information Retrieval Symposium. Springer, 2005, pp. 174–189. [51] D. Benedetto, E. Caglioti, and V. Loreto, “Language trees and zipping,” Physical Review Letters, vol. 88, no. 4, p. 048702, 2002. [52] W. J. Teahan and D. J. Harper, “Using compression-based language models for text categorization,” in Language modeling for information retrieval. Springer, 2003, pp. 141–165. [53] J. G. Cleary and W. J. Teahan, “Unbounded length contexts for ppm,” The Computer Journal, vol. 40, no. 2_and_3, pp. 67–75, 1997.

135 [54] E. Stamatatos, “Authorship attribution based on feature set subspacing ensembles,” International Journal on Artificial Intelligence Tools, vol. 15, no. 05, pp. 823–838, 2006. [55] F. J. Tweedie, S. Singh, and D. I. Holmes, “Neural network applications in stylometry: The federalist papers,” Computers and the Humanities, vol. 30, no. 1, pp. 1–10, 1996. [56] C. E. Chaski, “Who’s at the keyboard? authorship attribution in digital evidence investigations,” International journal of digital evidence, vol. 4, no. 1, pp. 1–13, 2005. [57] E. Stamatatos, N. Fakotakis, and G. Kokkinakis, “Automatic text categorization in terms of genre and author,” Computational linguistics, vol. 26, no. 4, pp. 471–495, 2000. [58] K. Luyckx and W. Daelemans, “Shallow text analysis and machine learning for authorship attribtion,” LOT Occasional Series, vol. 4, pp. 149–160, 2005. [59] ——, “Authorship attribution and verification with many authors and limited data,” in Proceedings of the 22nd International Conference on Computational Linguistics- Volume 1. Association for Computational Linguistics, 2008, pp. 513–520. [60] D. Madigan, A. Genkin, D. D. Lewis, S. Argamon, D. Fradkin, and L. Ye, “Author identification on the large scale,” in Proc. of the Meeting of the Classification Society of North America, vol. 13, 2005. [61] K. Luyckx and W. Daelemans, “The effect of author set size and data sizein authorship attribution,” Literary and linguistic Computing, vol. 26, no. 1, pp. 35–55, 2011. [62] A. Narayanan, H. Paskov, N. Z. Gong, J. Bethencourt, E. Stefanov, E. C. R. Shin, and D. Song, “On the feasibility of internet-scale author identification,” in Security and Privacy (SP), 2012 IEEE Symposium on. IEEE, 2012, pp. 300–314. [63] M. Koppel, J. Schler, and S. Argamon, “Computational methods in authorship attribution,” Journal of the Association for Information Science and Technology, vol. 60, no. 1, pp. 9–26, 2009. [64] M. Potthast, S. Braun, T. Buz, F. Duffhauss, F. Friedrich, J. M. Gülzow, J. Köhler, W. Lötzsch, F. Müller, M. E. Müller et al., “Who wrote the web? revisiting influential author identification research applicable to information retrieval,” in European Conference on Information Retrieval. Springer, 2016, pp. 393–407. [65] D. V. Khmelev and W. J. Teahan, “A repetition based measure for verification of text collections and for text categorization,” in Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval. ACM, 2003, pp. 104–110.

136 [66] M. Koppel, J. Schler, and E. Bonchek-Dokow, “Measuring differentiability: Unmasking pseudonymous authors,” Journal of Machine Learning Research, vol. 8, no. Jun, pp. 1261–1276, 2007. [67] S. M. Van Dongen, “Graph clustering by flow simulation,” 2001. [68] “MCL - an open source implementation,” http://micans.org/mcl/. [69] A. Andoni, P. Indyk, T. Laarhoven, I. Razenshteyn, and L. Schmidt, “Practical and optimal lsh for angular distance,” in Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Eds. Curran Associates, Inc., 2015, pp. 1225–1233. [Online]. Available: http://papers.nips.cc/paper/5893-practical-and-optimal-lsh-for-angular-distance.pdf [70] “FALCONN - fast lookups of cosine and other nearest neighbors,” https: //github.com/FALCONN-LIB/FALCONN. [71] “Biolayout express 3d,” http://www.biolayout.org/. [72] E. Stamatatos, “Masking topic-related information to enhance authorship attribution,” Journal of the Association for Information Science and Technology, vol. 69, no. 3, pp. 461–473, 2018. [73] Y. Seroussi, F. Bohnert, and I. Zukerman, “Personalised rating prediction for new users using latent factor models,” in Proceedings of the 22nd ACM conference on Hypertext and hypermedia. ACM, 2011, pp. 47–56. [74] E. Stamatatos, “On the robustness of authorship attribution based on character n-gram features,” JL & Pol’y, vol. 21, p. 421, 2012. [75] D. Klein and C. D. Manning, “Accurate unlexicalized parsing,” in Proceedings of the 41st annual meeting of the association for computational linguistics, 2003. [76] S. Fuller, P. Maguire, and P. Moser, “A deep context grammatical model for authorship attribution,” 2014. [77] J. Björklund and N. Zechner, “Syntactic methods for topic-independent authorship attribution,” Natural Language Engineering, vol. 23, no. 5, pp. 789–806, 2017. [78] J. Cleary and I. Witten, “Data compression using adaptive coding and partial string matching,” IEEE transactions on Communications, vol. 32, no. 4, pp. 396–402, 1984. [79] E. Charniak and M. Johnson, “Coarse-to-fine n-best parsing and maxent discriminative reranking,” in Proceedings of the 43rd annual meeting on association for computational linguistics. Association for Computational Linguistics, 2005, pp. 173–180. [80] U. Sapkota, S. Bethard, M. Montes, and T. Solorio, “Not all character n-grams are created equal: A study in authorship attribution,” in Proceedings of the 2015

137 conference of the North American chapter of the association for computational linguistics: Human language technologies, 2015, pp. 93–102. [81] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” Journal of machine Learning research, vol. 3, no. Jan, pp. 993–1022, 2003. [82] S. Bird and E. Loper, “Nltk: the natural language toolkit,” in Proceedings of the ACL 2004 on Interactive poster and demonstration sessions. Association for Computational Linguistics, 2004, p. 31. [83] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in Advances in neural information processing systems, 2013, pp. 3111–3119. [84] K. S. Tai, R. Socher, and C. D. Manning, “Improved semantic representations from tree-structured long short-term memory networks,” arXiv preprint arXiv:1503.00075, 2015. [85] J. Pennington, R. Socher, and C. Manning, “Glove: Global vectors for word representation,” in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 1532–1543. [86] Z. Yang, D. Yang, C. Dyer, X. He, A. J. Smola, and E. H. Hovy, “Hierarchical attention networks for document classification.” in HLT-NAACL, 2016, pp. 1480–1489. [87] P. Shrestha, S. Sierra, F. Gonzalez, M. Montes, P. Rosso, and T. Solorio, “Convolutional neural networks for authorship attribution of short texts,” in Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, vol. 2, 2017, pp. 669–674. [88] N. Yager and T. Dunstone, “The biometric menagerie,” IEEE transactions on pattern analysis and machine intelligence, vol. 32, no. 2, pp. 220–230, 2010. [89] C. U. Yule, The statistical study of literary vocabulary. Cambridge University Press, 2014. [90] H. S. Sichel, “On a distribution law for word frequencies,” Journal of the American Statistical Association, vol. 70, no. 351a, pp. 542–547, 1975. [91] E. Brunet, Le vocabulaire de Jean Giraudoux. Structure et évolution. Slatkine, 1978, no. 1. [92] M. Wittman, P. Davis, and P. J. Flynn, “Empirical studies of the existence of the biometric menagerie in the frgc 2.0 color image corpus,” in Computer Vision and Pattern Recognition Workshop, 2006. CVPRW’06. Conference on. IEEE, 2006, pp. 33–33.

138 [93] J. Harvey, J. Campbell, S. Elliott, M. Brockly, and A. Adler, “Biometric permanence: Definition and robust calculation,” in Systems Conference (SysCon), 2017 Annual IEEE International. IEEE, 2017, pp. 1–7. [94] “Yelp dataset challenge,” https://www.yelp.com/dataset/challenge. [95] S. P. Banerjee and D. L. Woodard, “Biometric authentication and identification using keystroke dynamics: A survey,” Journal of Pattern Recognition Research, vol. 7, no. 1, pp. 116–139, 2012. [96] A. Alsultan and K. Warwick, “Keystroke dynamics authentication: a survey of free-text methods,” International Journal of Computer Science Issues, vol. 10, no. 4, pp. 1–10, 2013. [97] P. S. Teh, A. B. J. Teoh, and S. Yue, “A survey of keystroke dynamics biometrics,” The Scientific World Journal, vol. 2013, 2013. [98] F. Monrose, M. K. Reiter, and S. Wetzel, “Password hardening based on keystroke dynamics,” International Journal of Information Security, vol. 1, no. 2, pp. 69–83, 2002. [99] R. N. Rodrigues, G. F. Yared, C. R. d. N. Costa, J. B. Yabu-Uti, F. Violaro, and L. L. Ling, “Biometric access control through numerical keyboards based on keystroke dynamics,” in International Conference on Biometrics. Springer, 2006, pp. 640–646. [100] R. Giot, M. El-Abed, and C. Rosenberger, “Keystroke dynamics with low constraints svm based passphrase enrollment,” in Biometrics: Theory, Applications, and Systems, 2009. BTAS’09. IEEE 3rd International Conference on. IEEE, 2009, pp. 1–6. [101] J. Leggett, G. Williams, M. Usnick, and M. Longnecker, “Dynamic identity verification via keystroke characteristics,” International Journal of Man-Machine Studies, vol. 35, no. 6, pp. 859–870, 1991. [102] J. V. Monaco, N. Bakelman, S.-H. Cha, and C. C. Tappert, “Recent advances in the development of a long-text-input keystroke biometric authentication system for arbitrary text input,” in Intelligence and Security Informatics Conference (EISIC), 2013 European. IEEE, 2013, pp. 60–66. [103] C. C. Tappert, S.-H. Cha, M. Villani, and R. S. Zack, “A keystroke biometric system for long-text input,” Optimizing Information Security and Advancing Privacy Assurance: New Technologies: New Technologies, p. 32, 2012. [104] J. Ferreira and H. Santos, “Keystroke dynamics for continuous access control enforcement,” in Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC), 2012 International Conference on. IEEE, 2012, pp. 216–223. [105] A. Messerman, T. Mustafić, S. A. Camtepe, and S. Albayrak, “Continuous and non-intrusive identity verification in real-time environments based on free-text

139 keystroke dynamics,” in Biometrics (IJCB), 2011 International Joint Conference on. IEEE, 2011, pp. 1–8. [106] T. Shimshon, R. Moskovitch, L. Rokach, and Y. Elovici, “Continuous verification using keystroke dynamics,” in Computational Intelligence and Security (CIS), 2010 International Conference on. IEEE, 2010, pp. 411–415. [107] J. C. Stewart, J. V. Monaco, S.-H. Cha, and C. C. Tappert, “An investigation of keystroke and stylometry traits for authenticating online test takers,” in Biometrics (IJCB), 2011 International Joint Conference on. IEEE, 2011, pp. 1–7. [108] J. V. Monaco, J. C. Stewart, S.-H. Cha, and C. C. Tappert, “Behavioral biometric verification of student identity in online course assessment and authentication of authors in literary works,” in Biometrics: Theory, Applications and Systems (BTAS), 2013 IEEE Sixth International Conference on. IEEE, 2013, pp. 1–8. [109] H. Locklear, S. Govindarajan, Z. Sitová, A. Goodkind, D. G. Brizan, A. Rosenberg, V. V. Phoha, P. Gasti, and K. S. Balagani, “Continuous authentication with cognition-centric text production and revision features,” in Biometrics (IJCB), 2014 IEEE International Joint Conference on. IEEE, 2014, pp. 1–8. [110] D. Gunetti and C. Picardi, “Keystroke analysis of free text,” ACM Transactions on Information and System Security (TISSEC), vol. 8, no. 3, pp. 312–347, 2005. [111] T. Sim and R. Janakiraman, “Are digraphs good for free-text keystroke dynamics?” in Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on. IEEE, 2007, pp. 1–6. [112] C. C. Tappert, M. Villani, and S. Cha, “Keystroke biometric identification and authentication on long-text input,” Behavioral biometrics for human identification: Intelligent applications, pp. 342–367, 2009. [113] J. V. Monaco, N. Bakelman, S.-H. Cha, and C. C. Tappert, “Developing a keystroke biometric system for continual authentication of computer users,” in Intelligence and Security Informatics Conference (EISIC), 2012 European. IEEE, 2012, pp. 210–216. [114] J. W. Pennebaker, R. L. Boyd, K. Jordan, and K. Blackburn, “The development and psychometric properties of liwc2015,” Tech. Rep., 2015. [115] Y. R. Tausczik and J. W. Pennebaker, “The psychological meaning of words: Liwc and computerized text analysis methods,” Journal of language and social psychology, vol. 29, no. 1, pp. 24–54, 2010. [116] F. M. Rangel Pardo, F. Celli, P. Rosso, M. Potthast, B. Stein, and W. Daelemans, “Overview of the 3rd author profiling task at pan 2015,” in CLEF 2015 Evaluation Labs and Workshop Working Notes Papers, 2015, pp. 1–8. [117] F. Rangel, P. Rosso, B. Verhoeven, W. Daelemans, M. Potthast, and B. Stein, “Overview of the 4th author profiling task at pan 2016: cross-genre evaluations,”

140 in Working Notes Papers of the CLEF 2016 Evaluation Labs. CEUR Workshop Proceedings/Balog, Krisztian [edit.]; et al., 2016, pp. 750–784. [118] A. Grivas, A. Krithara, and G. Giannakopoulos, “Author profiling using stylometric and structural feature groupings.” in CLEF (Working Notes), 2015. [119] P. Modaresi, M. Liebeck, and S. Conrad, “Exploring the effects of cross-genre machine learning for author profiling in pan 2016.” in CLEF (Working Notes), 2016, pp. 970–977. [120] M. B. op Vollenbroek, T. Carlotto, T. Kreutz, M. Medvedeva, C. Pool, J. Bjerva, H. Haagsma, and M. Nissim, “Gronup: Groningen user profiling,” Notebook for PAN at CLEF, 2016. [121] Z. Lin, M. Feng, C. N. d. Santos, M. Yu, B. Xiang, B. Zhou, and Y. Bengio, “A structured self-attentive sentence embedding,” arXiv preprint arXiv:1703.03130, 2017.

141 BIOGRAPHICAL SKETCH Kalaivani Sundararajan hails from Chennai, India. Her research focuses on Natural Language Processing, Computer Vision and Machine Learning applied to Identity Sciences. She received her Bachelor of Technology in Instrumentation from Anna University, India and Master of Science in electrical engineering from Clemson University, USA. She graduated with PhD in computer engineering from University of Florida in December 2018.

142