Learning Stylometric Representations for Authorship Analysis

0 Learning Stylometric Representations for Authorship Analysis STEVEN H. H. DING, School of Information Studies, McGill University, Canada BENJAMIN C. M. FUNG, School of Information Studies, McGill University, Canada FARKHUND IQBAL, College of Technological Innovation, Zayed University, UAE WILLIAM K. CHEUNG, Department of Computer Science, Hong Kong Baptist University, Hong Kong Authorship analysis (AA) is the study of unveiling the hidden properties of authors from a body of expo- nentially exploding textual data. It extracts an author’s identity and sociolinguistic characteristics based on the reflected writing styles in the text. It is an essential process for various areas, such as cybercrime investigation, psycholinguistics, political socialization, etc. However, most of the previous techniques criti- cally depend on the manual feature engineering process. Consequently, the choice of feature set has been shown to be scenario- or dataset-dependent. In this paper, to mimic the human sentence composition process using a neural network approach, we propose to incorporate different categories of linguistic features into distributed representation of words in order to learn simultaneously the writing style representations based on unlabeled texts for authorship analysis. In particular, the proposed models allow topical, lexical, syn- tactical, and character-level feature vectors of each document to be extracted as stylometrics. We evaluate the performance of our approach on the problems of authorship characterization and authorship verification with the Twitter, novel, and essay datasets. The experiments suggest that our proposed text representation outperforms the bag-of-lexical-n-grams, Latent Dirichlet Allocation, Latent Semantic Analysis, PVDM, PVDBOW, and word2vec representations. Categories and Subject Descriptors: K.4.1 [Computers and Society]: Public Policy Issues—Abuse and crime involving computers; I.7.5 [Document and Text Processing]: Document Capture—Document analysis; I.2.7 [Natural Language Processing]: Text analysis General Terms: Design, Algorithms, Experimentation Additional Key Words and Phrases: Authorship analysis, computational linguistics, feature learning, text mining 1. INTRODUCTION The prevalence of the computer information system, personal computational devices, and the globalizing Internet have fundamentally transformed our daily lives and re- shaped the way we generate and digest information. Countless pieces of textual snip- pets and documents are generated every millisecond: This is the era of infobesity. Au- thorship analysis (AA) is one of the critical approaches to turn the burden of a vast amount of data into practical, useful knowledge. By looking into the reflected linguistic trails, AA is a study to unveil an underlying author’s identity and sociolinguistic characteristics. The advancement of authorship analysis backed up by stylometric techniques has a fundamental impact on various areas: — Cybercrime investigation. The distributed nature of cyber space provides an ideal anonymous channel for computer-mediated malicious activities, e.g., phishing scams, spamming, ransom messages, harassment, money laundering, illegal material distri- arXiv:1606.01219v1 [cs.CL] 3 Jun 2016 bution, etc., because the network-based origins such as IP address can be easily re- pudiated. Several authorship identification techniques have been developed for the purpose of cyber investigation on SMS text-messaging slips [Ragel et al. 2014], personal e-mails [Iqbal et al. 2013; Ding et al. 2015], and social blogs [Yang and Chow 2014; Stolerman et al. 2014]. Stylometric techniques have been used as evidence in the form of expert knowledge in the courts of the UK, the US, and Australia [Juola 2006; Brennan et al. 2012]. In a well-known case in the UK, linguistic experts showed Authors’ addresses: Benjamin C. M. Fung (corresponding author), School of Information Studies, McGill University, Montreal, QC, Canada H3A 1X1; email: [email protected] 0:2 that a series of text messages sent from Danielle’s phone after she had disappeared were actually not written by her but by her uncle, Stuart Campbell [Grant 2008]. For more recent information and definitions about linguistic evidence, please refer to the home page of the Institute for Linguistic Evidence (ILE)1. — Marketing, political socialization, and social network analysis. By inferring the hidden attributes of social network users, businesses and political parties can get a deeper understanding of their audiences and the related discussions. AA techniques have been applied to understand audiences’ political preferences [Golbeck and Hansen 2014; Pennacchiotti and Popescu 2011; Golbeck and Hansen 2011], networks of political communication [Conover et al. 2011b], and political learnings of prominent tweet- ers and media [Wong et al. 2013]. Understanding the audiences on social networks helps in planning political strategy [Conover et al. 2011a] and corrects bias for pre- dicting election results [Wong et al. 2013]. From the perspective of marketing, it is profitable to identify potential customers by characterizing them and understanding their reviews [Weren et al. 2014]. Pennacchiotti and Popescu [2011] present an AA approach to distinguish whether or not a Twitter user is a fan of Starbucks, which in general can be applied to other types of businesses. Moreover, most of social network analyses are based on some attributes of the user inside the network, and AA techniques can provide more hidden labels for these analyses [Cohen and Ruths 2013]. — Literary science and education. Authorship analysis also has a significant impact in the fields related to literary science. Many AA techniques have been developed to infer the disputed authorship of historical documents such as Civil War letters [Klar- reich 2003], Shakespeare’s plays [Klarreich 2003], the Federalist Papers [Oakes 2004; Tschuggnall and Specht 2014; Nasir et al. 2014], and the classic French literary mys- tery: “Le Roman de Violette” [Boukhaled and Ganascia 2014]. AA techniques are also used to quantify the performance of literary translators, since the best translators will not have their own writing style reflected in the translated works [Almishari et al. 2014]. Moreover, AA techniques have been used to understand the personality of stu- dents [Almishari et al. 2014], first languages [Almishari et al. 2014; Torney et al. 2012], and self-reported names [Liu and Ruths 2013]. Furthermore, AA techniques are fundamental to the detection of plagiarism in academic works [Cerra et al. 2014; Hansen et al. 2014]. Studies of authorship analysis backed up by computational stylometric techniques can be dated back to the 19th century. Many customized approaches focusing on different sub-problems and scenarios have been proposed [Stamatatos 2009]. It has been a successful line of research [Brennan et al. 2012]. Research problems in authorship analysis can be broadly categorized into three types: authorship identification (i.e., identify the most plausible author given a set of candidates [Iqbal et al. 2013; Ding et al. 2015]), authorship verification (i.e., verify whether or not a given candidate is the actual author of the given text [Stamatatos et al. 2014]), and authorship characterization (i.e., infer the sociolinguistic characteristics of the author of the given text [Rangel et al. 2014]). Both the problems of authorship identification and authorship characterization can be formulated as a one-class text classification problem. For the authorship attribution problem, the classification label is the identity of the anonymous text snippet; for the authorship attribution problem, the label can be the hidden properties of the anonymous author, such as age and gender. There exists other variances of the authorship attribution problems, such as the open-set problem, the closed-set problem, and the attribution-with-probability-output problem [Stamatatos 2009; Iqbal et al. 2013; Ding et al. 2015]. 1Institute for Linguistic Evidence (ILE): http://linguisticevidence.org/ Learning Stylometric Representations for Authorship Analysis 0:3 Fig. 1. Overview of the traditional solution and the proposed solution for authorship analysis. Regardless of the studied authorship problems, the existing solutions in previous AA studies typically consist of three major processes, as shown in the upper flowchart of Figure 1): the feature engineering process, the solution design process, and the experi- mental evaluation process. In the first, a set of features are manually chosen by the re- searchers to represent each unit of textual data as a numeric vector. In the second process, a classification model is carefully adopted or designed. At the end, the entire solution is evaluated based on the specific datasets. Representative solutions are Burger et al. [2011], Nirkhi and Dharaskar [2014], and Cavalcante et al. [2014]. Exceptions are few recent applications of the topic models that actually combine these two process into one [Pratanwanich and Lio` 2014; Savoy 2013a; Seroussi et al. 2014]. Still, the two-processes-based studies on authorship analysis problems dominate [Rangel et al. 2014]. During the feature engineering process, given the available dataset and applica- tion scenario, authorship analysts manually select a broad set of features based on the hypotheses or educated guesses, and then refine the selection based on the exper- imental feedback. As demonstrated by previous research [Savoy 2012; Zamani et al. 2014a; Savoy

Learning Stylometric Representations for Authorship Analysis

Quantitative Authorship Attribution: a History and an Evaluation of Techniques

Stylometric Analysis of Parliamentary Speeches: Gender Dimension

Detection of Translator Stylometry Using Pair-Wise Comparative Classiﬁcation and Network Motif Mining

Identifying Idiolect in Forensic Authorship Attribution: an N-Gram Textbite Approach Alison Johnson & David Wright University of Leeds

A Comparison of Classifiers and Features for Authorship Authentication of Social Networking Messages

Profiling a Set of Personality Traits of Text Author: What Our Words Reveal About Us*

On the Feasibility of Internet-Scale Author Identiﬁcation

Adversarial Stylometry: Circumventing Authorship Recognition to Preserve

Linguistic Authentication and Reliability 1,1. Authorship in An

Forensic Authorship Analysis of Microblogging Texts Using N-Grams and Stylometric Features

Stylometric Techniques for Multiple Author Clustering Shakespeare‘S Authorship in the Passionate Pilgrim

Automatic IQ Estimation Using Stylometry Methods