ANALYSIS of STYLOMETRY AS a COGNITIVE BIOMETRIC TRAIT by Kalaivani Sundararajan December 2018 Chair: Damon L

Total Page:16

File Type:pdf, Size:1020Kb

ANALYSIS of STYLOMETRY AS a COGNITIVE BIOMETRIC TRAIT by Kalaivani Sundararajan December 2018 Chair: Damon L ANALYSIS OF STYLOMETRY AS A COGNITIVE BIOMETRIC TRAIT By KALAIVANI SUNDARARAJAN A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2018 © 2018 Kalaivani Sundararajan I dedicate this dissertation to my Mom and Dad, sister Gowri and my husband Raghav who have been incredibly supportive, understanding and encouraging throughout this entire journey. ACKNOWLEDGMENTS I extend my heartfelt thanks to mentors and friends who have made this journey possible. First and foremost, I thank my advisor, Dr.Damon Woodard, who took me under his wing and provided constant support and motivation throughout this journey. His calm and composed demeanor has always been reassuring even during periods of self-doubt. He gave me complete freedom to explore and grow into an independent researcher. I would also like to thank my committee members for their valuable feedback and comments. Many mentors laid the foundation for this journey before I started Ph.D. My first job at Soliton Technologies, India, seeded the desire to pursue a career in research. I am forever grateful to Ms.Mekhala Devaraj and Dr.Ganesh Devaraj who provided me that opportunity and have been of immense support since then. I also thank Dr.Stan Birchfield who laid my foundation for academic research at Clemson University and has always wished me well. I would also like to thank all my mentors in industry and academia who have shaped my career. This long and eventful journey was made memorable by my dear friends. I would like to thank Poorna Sudhakar, Gayatri Lakshmanan, Rommel Jalasutram, Ahalya Srikanth, Renuka Jagadish, Anand Raju, Dhananjay Joshi, Vishnu Prasad and Ezhilselvi Karunakaran for all their love and support. Finally, I would like to thank my graduate coordinators Adrienne Cook, Tamira Carter and Lesly Galiana for their support. 4 TABLE OF CONTENTS page ACKNOWLEDGMENTS ................................. 4 LIST OF TABLES ..................................... 8 LIST OF FIGURES .................................... 10 ABSTRACT ........................................ 12 CHAPTER 1 INTRODUCTION .................................. 14 1.1 Motivation ................................... 15 1.2 Thesis Statement and Dissertation Topics .................. 15 1.2.1 Style Representations .......................... 16 1.2.2 Stylometry as a Biometric Trait .................... 17 1.2.3 Soft-biometric Classification or Author Profiling ........... 18 1.3 Contributions .................................. 18 1.4 Chapter Guide ................................. 19 2 LITERATURE REVIEW .............................. 21 2.1 Historical Background ............................. 21 2.2 Stylometry ................................... 22 2.2.1 Stylometric Features .......................... 22 2.2.2 Attribution Approaches ........................ 25 2.3 Challenges .................................... 28 3 PERFORMANCE ANALYSIS OF STATE-OF-THE-ART APPROACHES ... 30 3.1 Dataset ..................................... 31 3.2 Authorship Identification ............................ 31 3.2.1 Methodology .............................. 32 3.2.2 Results .................................. 37 3.2.3 Discussion ................................ 40 3.3 Verification ................................... 41 3.3.1 Methodology .............................. 41 3.3.2 Results .................................. 44 3.3.3 Discussion ................................ 46 3.4 Feature Analysis ................................ 46 3.4.1 Methodology .............................. 47 3.4.2 Results .................................. 48 3.4.3 Discussion ................................ 49 3.5 Clustering .................................... 51 3.5.1 Methodology .............................. 52 5 3.5.2 Preliminary Results ........................... 53 3.5.3 Discussion ................................ 55 3.6 Summary .................................... 56 4 WHAT CONSTITUTES STYLE? .......................... 58 4.1 Datasets ..................................... 59 4.2 Role of Structural representations ....................... 60 4.2.1 Methodology .............................. 61 4.2.2 Results .................................. 63 4.2.3 Discussion ................................ 65 4.3 Role of Lexical representations ........................ 66 4.3.1 Methodology .............................. 67 4.3.2 Results .................................. 68 4.3.3 Discussion ................................ 69 4.4 Learned representations ............................ 72 4.4.1 Methodology .............................. 73 4.4.2 Results .................................. 75 4.4.3 Discussion ................................ 76 4.5 Summary .................................... 77 5 STYLOMETRY AS A BIOMETRIC TRAIT ................... 79 5.1 Biometric Characteristics ........................... 79 5.1.1 Analysis of Uniqueness Using Biometric Menagerie ......... 79 5.1.2 Analysis of Permanence in Linguistic Style .............. 87 5.2 Multimodal Biometric Authentication ..................... 96 5.2.1 Methodology .............................. 97 5.2.2 Dataset ................................. 99 5.2.3 Results .................................. 99 5.2.4 Discussion ................................ 102 5.3 Summary .................................... 105 6 ROLE OF PSYCHOLOGICAL AND SOCIAL ASPECTS ON STYLOMETRY 106 6.1 Role of psychological aspects .......................... 106 6.1.1 Methodology .............................. 107 6.1.2 Datasets ................................. 107 6.1.3 Results .................................. 110 6.1.4 Discussion ................................ 115 6.2 Profiling author social groups ......................... 116 6.2.1 Methodology .............................. 116 6.2.2 Results .................................. 118 6.2.3 Discussion ................................ 119 6.3 Summary .................................... 120 6 7 SUMMARY ...................................... 125 7.1 Insights and Conclusions ............................ 125 7.1.1 Style Representations .......................... 125 7.1.2 Stylometry as a Biometric Trait .................... 126 7.1.3 Soft-biometric Classification or Author Profiling ........... 126 7.2 Limitations ................................... 126 7.3 Future work ................................... 127 APPENDIX: AUTHOR PROFILING ANALYSIS .................... 128 REFERENCES ....................................... 132 BIOGRAPHICAL SKETCH ................................ 142 7 LIST OF TABLES Table page 2-1 Stylistic features ................................... 26 3-1 CASIS Dataset partitions .............................. 31 3-2 Character level features ............................... 43 3-3 Lexical features .................................... 44 3-4 Syntactic features ................................... 44 3-5 Salient features across different feature categories ................. 48 3-6 Comparison of salient features ............................ 49 3-7 Effect of varying sentence lengths. .......................... 49 4-1 Single-domain and Cross-domain datasets ..................... 60 4-2 Shallow syntactic model performance ........................ 64 4-3 Deep syntactic model performance ......................... 65 4-4 Effect of lexical POS on single-domain authorship attribution using IMDB1M . 68 4-5 Effect of lexical POS on UF cross-topic authorship attribution .......... 69 4-6 Effect of lexical POS on UF cross-genre authorship attribution .......... 69 4-7 Effect of topic words on IMDB1M single-domain authorship attribution ..... 70 4-8 Effect of topic words on UF cross-topic authorship attribution .......... 70 4-9 Effect of topic words on UF cross-genre authorship attribution .......... 70 4-10 Performance of learned representations ....................... 76 5-1 BoW features ..................................... 81 5-2 Dataset statistics ................................... 90 5-3 Permanence of various span-subsets in Blogs and Yelp datasets ......... 90 5-4 Salient and consistent features ............................ 95 5-5 Verification performance for multimodal authentication .............. 101 6-1 LIWC categories ................................... 108 6-2 PAN2015 author profiling dataset - English .................... 109 8 6-3 PAN2016 author profiling dataset - English .................... 109 6-4 PAN2015 author profiling results .......................... 119 6-5 PAN2016 author profiling results .......................... 119 A-1 PAN2015 - LIWC analysis .............................. 128 A-2 PAN2015 - LIWC analysis(cont.) .......................... 129 A-3 PAN2015 - LIWC analysis(cont.) .......................... 130 A-4 PAN2016 - LIWC analysis .............................. 131 9 LIST OF FIGURES Figure page 2-1 Stylistic features ................................... 23 3-1 Recognition accuracy of various implementations ................. 39 3-2 Recognition accuracy of various implementations across CASIS subsets ..... 42 3-3 ROC curves for CASIS subsets ........................... 45 3-4 Score distributions for CASIS subsets ........................ 46 3-5 Clustering results ................................... 54 3-6 Markov clustering on CASIS-5 with three similarity measures .......... 54 3-7 Markov clustering on CASIS-2 with features of different dimensions ....... 55 4-1 Parse tree of a sample sentence and rewrite rules ................. 62 4-2 Recurrent
Recommended publications
  • Quantitative Authorship Attribution: a History and an Evaluation of Techniques
    QUANTITATIVEAUTHORSHIP ATTRIBUTION: A HISTORYAND AN EVALUATIONOF TECHNIQUES A thesis submitted in partial fulfillment of the requirements for the degree of IN THE DEPARTMENT OF LINGUISTICS O Jack Grieve 2005 SIMONFRASER UNI~RSITY Summer 2005 All rights reserved. This work may not be reproduced in whole or in part, by photocopy or other means, without the permission of the author Name: Jack William Grieve Degree: Master of Arts Title of Thesis: Quantitative Authorship Attribution: A History and an Evaluation of Techniques Examining Committee: Dr. Zita McRobbie Chair Associate Professor, Department of Linguistics Dr. Paul McFetridge Senior Supervisor Associate Professor, Department of Linguistics Dr. Maria Teresa Taboada Supervisor Assistant Professor, Department of Linguistics Dr. Fred Popowich External Examiner Professor, School of Computing Science Date Defended: SIMON FRASER UNIVERSITY PARTIAL COPYRIGHT LICENCE The author, whose copyright is declared on the title page of this work, has granted to Simon Fraser University the right to lend this thesis, project or extended essay to users of the Simon Fraser University Library, and to make partial or single copies only for such users or in response to a request from the library of any other university, or other educational institution, on its own behalf or for one of its users. The author has further granted permission to Simon Fraser University to keep or make a digital copy for use in its circulating collection. The author has further agreed that permission for multiple copying of this work for scholarly purposes may be granted by either the author or the Dean of Graduate Studies. It is understood that copying or publication of this work for financial gain shall not be allowed without the author's written permission.
    [Show full text]
  • Stylometric Analysis of Parliamentary Speeches: Gender Dimension
    Stylometric Analysis of Parliamentary Speeches: Gender Dimension Justina Mandravickaite˙ Tomas Krilaviciusˇ Vilnius University, Lithuania Vytautas Magnus University, Lithuania Baltic Institute of Advanced Baltic Inistitute of Advanced Technology, Lithuania Technology, Lithuania [email protected] [email protected] Abstract to capture the differences in the language due to the gender (Newman et al., 2008; Herring and Relation between gender and language has Martinson, 2004). Some results show that gen- been studied by many authors, however, der differences in language depend on the con- there is still some uncertainty left regard- text, e.g., people assume male language in a for- ing gender influence on language usage mal setting and female in an informal environ- in the professional environment. Often, ment (Pennebaker, 2011). We investigate gender the studied data sets are too small or texts impact to the language use in a professional set- of individual authors are too short in or- ting, i.e., transcripts of speeches of the Lithua- der to capture differences of language us- nian Parliament debates. We study language wrt age wrt gender successfully. This study style, i.e., male and female style of the language draws from a larger corpus of speeches usage by applying computational stylistics or sty- transcripts of the Lithuanian Parliament lometry. Stylometry is based on the two hypothe- (1990–2013) to explore language differ- ses: (1) human stylome hypothesis, i.e., each in- ences of political debates by gender via dividual has a unique style (Van Halteren et al., stylometric analysis. Experimental set 2005); (2) unique style of individual can be mea- up consists of stylistic features that indi- sured (Stamatatos, 2009), stylometry allows gain- cate lexical style and do not require exter- ing meta-knowledge (Daelemans, 2013), i.e., what nal linguistic tools, namely the most fre- can be learned from the text about the author quent words, in combination with unsu- - gender (Luyckx et al., 2006; Argamon et al., pervised machine learning algorithms.
    [Show full text]
  • PAN 2017: Author Profiling
    PAN 2017: Author Profiling - Gender and Language Variety Prediction Notebook for PAN at CLEF 2017 Matej Martinc1;2, Iza Škrjanec2, Katja Zupan1;2, and Senja Pollak1 1 Jožef Stefan Institute, Jamova 39, 1000 Ljubljana, Slovenia 2 Jožef Stefan International Postgraduate School, Jamova 39, 1000 Ljubljana, Slovenia [email protected],[email protected], [email protected],[email protected] Abstract We present the results of gender and language variety identification performed on the tweet corpus prepared for the PAN 2017 Author profiling shared task. Our approach consists of tweet preprocessing, feature construction, feature weighting and classification model construction. We propose a Logistic regres- sion classifier, where the main features are different types of character and word n-grams. Additional features include POS n-grams, emoji and document senti- ment information, character flooding and language variety word lists. Our model achieved the best results on the Portuguese test set in both—gender and language variety—prediction tasks with the obtained accuracy of 0.8600 and 0.9838, re- spectively. The worst accuracy was achieved on the Arabic test set. Keywords: author profiling, gender, language variety, Twitter 1 Introduction Recent trends in natural language processing (NLP) have shown a great interest in learn- ing about the demographics, psychological characteristics and (mental) health of a per- son based on the text she or he produced. This field, generally known as author profiling (AP), has various applications in marketing, security (forensics), research in social psy- chology, and medical diagnosis. A thriving subfield of AP is computational stylometry, which is concerned with how the content and genre of a document contribute to its style [4].
    [Show full text]
  • Detection of Translator Stylometry Using Pair-Wise Comparative Classification and Network Motif Mining
    Detection of Translator Stylometry using Pair-wise Comparative Classification and Network Motif Mining Heba Zaki Mohamed Abdallah El-Fiqi M.Sc. (Computer Sci.) Cairo University, Egypt B.Sc. (Computer Sci.) Zagazig University, Egypt SCIENTIA MANU E T MENTE A thesis submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy at the School of Engineering and Information Technology University of New South Wales Australian Defence Force Academy © Copyright 2013 by Heba El-Fiqi [This page is intentionally left blank] i Abstract Stylometry is the study of the unique linguistic styles and writing behaviours of individuals. The identification of translator stylometry has many contributions in fields such as intellectual-property, education, and forensic linguistics. Despite the research proliferation on the wider research field of authorship attribution using computational linguistics techniques, the translator stylometry problem is more challenging and there is no sufficient machine learning literature on the topic. Some authors even claimed that detecting who translated a piece of text is a problem with no solution; a claim we will challenge in this thesis. In this thesis, we evaluated the use of existing lexical measures for the transla- tor stylometry problem. It was found that vocabulary richness could not identify translator stylometry. This encouraged us to look for non-traditional represen- tations to discover new features to unfold translator stylometry. Network motifs are small sub-graphs that aim at capturing the local structure of a real network. We designed an approach that transforms the text into a network then identifies the distinctive patterns of a translator by employing network motif mining.
    [Show full text]
  • Author Profiling: Predicting Age and Gender from Blogs
    Author Profiling: Predicting Age and Gender from Blogs Notebook for PAN at CLEF 2013 K Santosh, Romil Bansal, Mihir Shekhar, and Vasudeva Varma International Institute of Information Technology, Hyderabad {santosh.kosgi, romil.bansal, mihir.shekhar}@research.iiit.ac.in, [email protected] Abstract Author profiling is the task of determining age, gender, native language or personality type of author by studying their sociolect aspect, that is, how lan- guage is shared by people. In this paper, we propose a Machine Learning ap- proach to determine unknown author’s age and gender. The approach uses three types of features: content based, style based and topic based. We were able to achieve an accuracy of 64.08%, 64.30% for age and 56.53%, 64.73% for gender in English and Spanish respectively. Keywords: Author Profiling, Topic Modelling, Text Categorization, Natural Lan- guage Processing 1 Introduction The problem of identifying the user’s profile from the text is always of importance as it helps in various fields like forensics and marketing. For example, in marketing, a manager might want to find the gender and age group of people who like or dislike their products from the public reviews. The increasing accessibility of public blogs offers an unprecedented opportunity to harvest information from texts authored by hundreds of thousands of different authors. In this paper, we tried to exploit these public blogs to find the relations between the author’s profile and the language style used by them. The main idea behind this task is to analyse how everyday languages reflects basic social and personality traits.
    [Show full text]
  • Automatic Author Profiling Based on Linguistic and Stylistic Features Notebook for PAN at CLEF 2013
    Automatic Author Profiling Based on Linguistic and Stylistic Features Notebook for PAN at CLEF 2013 Braja Gopal Patra1, Somnath Banerjee1, Dipankar Das2, Tanik Saikh1, Sivaji Bandyopadhyay1 1Department of Computer Science & Engineering, Jadavpur University, India 2Department of Computer Science & Engineering, NIT Meghalaya, India {brajagopal.cse, s.banerjee1980, dipankar.dipnil2005, tanik4u}@gmail.com, [email protected] Abstract. The rapid expansion of blog and electronic data in Web 2.0 is abounding and thus it is becoming important to identify the author‟s profile also. The problems of automatic identification of author‟s gender and age based on linguistic and stylistic pattern have been a subject of increasingly research interest in the recent years. The research methodologies are also helpful for several other applications like criminal detection, security and author detection etc. We have used lexical, syntactic and structural features for identifying the gender and age group of the authors. We have employed the Decision tree classifier for classifying the author profile. We have achieved the accuracies of 56.83% and 28.95% for gender and age group classification, respectively. Key words: Automatic author profiling, gender identification, age group identification, word class, decision tree. 1 Introduction In the 21st century, popularity of the Internet is increasing abundantly. Internet media like emails, blogs/internet forum and websites have been identified as the ideal communication platform for the people. In the recent years, the researchers have been paid increasingly interest to analyze of authorship of emails, electronic messages [5] and plagiarism detection [7] etc. Thus, analyzing the web content has become more important to the intelligence and security agencies that monitor the author information as much as possible [1].
    [Show full text]
  • Identifying Idiolect in Forensic Authorship Attribution: an N-Gram Textbite Approach Alison Johnson & David Wright University of Leeds
    Identifying idiolect in forensic authorship attribution: an n-gram textbite approach Alison Johnson & David Wright University of Leeds Abstract. Forensic authorship attribution is concerned with identifying authors of disputed or anonymous documents, which are potentially evidential in legal cases, through the analysis of linguistic clues left behind by writers. The forensic linguist “approaches this problem of questioned authorship from the theoretical position that every native speaker has their own distinct and individual version of the language [. ], their own idiolect” (Coulthard, 2004: 31). However, given the diXculty in empirically substantiating a theory of idiolect, there is growing con- cern in the Veld that it remains too abstract to be of practical use (Kredens, 2002; Grant, 2010; Turell, 2010). Stylistic, corpus, and computational approaches to text, however, are able to identify repeated collocational patterns, or n-grams, two to six word chunks of language, similar to the popular notion of soundbites: small segments of no more than a few seconds of speech that journalists are able to recognise as having news value and which characterise the important moments of talk. The soundbite oUers an intriguing parallel for authorship attribution studies, with the following question arising: looking at any set of texts by any author, is it possible to identify ‘n-gram textbites’, small textual segments that characterise that author’s writing, providing DNA-like chunks of identifying ma- terial? Drawing on a corpus of 63,000 emails and 2.5 million words written by 176 employees of the former American energy corporation Enron, a case study approach is adopted, Vrst showing through stylistic analysis that one Enron em- ployee repeatedly produces the same stylistic patterns of politely encoded direc- tives in a way that may be considered habitual.
    [Show full text]
  • Arabic Twitter User Profiling: Application to Cyber-Security
    Arabic Twitter User Profiling: Application to Cyber-security Rahma Basti1, Salma Jamoussi2, Anis Charfi3 and Abdelmajid Ben Hamadou4 1Multimedia InfoRmation Systems and Advanced Computing Laboratory (MIRACL), University of Sfax, Tunis, Tunisia 2Higher Institute of Computer Sceience and Multimedia of Sfax, 1173 Sfax 3038, Tunisia 3Carnegie Mellon University Qatar, Qatar 4Digital Research Center of Sfax (DRCS), Tunisia Keywords: Author Profiling, Arabic Text Processing, Age and Gender Prediction, Dangerous Profiles, Stylometric Features. Abstract: In recent years, we witnessed a rapid growth of social media networking and micro-blogging sites such as Twitter. In these sites, users provide a variety of data such as their personal data, interests, and opinions. However, this data shared is not always true. Often, social media users hide behind a fake profile and may use it to spread rumors or threaten others. To address that, different methods and techniques were proposed for user profiling. In this article, we use machine learning for user profiling in order to predict the age and gender of a user’s profile and we assess whether it is a dangerous profile using the users’ tweets and features. Our approach uses several stylistic features such as characters based, words based and syntax based. Moreover, the topics of interest of a user are included in the profiling task. We obtained the best accuracy levels with SVM and these were respectively 73.49% for age, 83.7% for gender, and 88.7% for the dangerous profile detection. 1 INTRODUCTION researchers to use it including those who focus on user profiling (Feldman and Sanger, 2006). Social media networks allow users to share Research in psychology (Frank and Witten, 1998) information, opinions and communicate with each has revealed that the words used by an individual can other.
    [Show full text]
  • Author Profiling for English and Spanish Text
    Author Profiling for English and Spanish Text Notebook for PAN at CLEF 2013 Upendra Sapkota1, Thamar Solorio1, Manuel Montes-y-Gómez2, and Gabriela Ramírez-de-la-Rosa1 1 University of Alabama at Birmingham, Birmingham, AL 35294, USA {upendra,solorio,gabyrr}@cis.uab.edu, 2 Instituto Nacional de Astrofísica, Óptica y Electrónica, Puebla, Mexico [email protected] Abstract This paper describes an approach for the author profiling task of the PAN 2013 challenge. This work is based on the idea of linguistic modality3 that has been successfully used in other classification tasks such as authorship attri- bution. We consider three different modalities: syntactic, stylistic, and semantic, each representing a different aspect of text. For each modality, we extract infor- mative meta features by computing the similarity relations between the feature vectors in the test files and the centroids of modality specific clusters. Since we were provided texts in both Spanish and English, we build a language indepen- dent framework for author profiling. For both English and Spanish documents, our system performed well for the age identification task. For gender prediction, although our system could not perform as expected for English, it yielded good results on Spanish. Keywords: author profiling, linguistic modality, similarity relations 1 Introduction With the proliferation of social media (blogs, chats), it has been possible to access large online texts. Such large texts can be utilized to understand how the writing style among users of different age groups, as well as between male and female vary [1,2]. Unlike the authorship attribution problem where the task is to identify the true author of a given piece of text, the author profiling task tries to learn as much information as possible (demographics, personality) about the unknown writer of the given text.
    [Show full text]
  • Author Profiling from Facebook Corpora
    Author Profiling from Facebook Corpora Fernando Chiu Hsieh, Rafael Felipe Sandroni Dias, Ivandre´ Paraboni University of Sao˜ Paulo, School of Arts, Sciences and Humanities Sao˜ Paulo, Brazil [email protected] , [email protected], [email protected] Abstract Author profiling - the computational task of prediction author’s demographics from text - has been a popular research topic in the NLP field, and also the focus of a number of prominent shared tasks. Author profiling is a problem of growing importance, with applications in forensics, security, sales and many others. In recent years, text available from social networks has become a primary source for computational models of author profiling, but existing studies are still largely focused on age and gender prediction, and are in many cases limited to the use of English text. Other languages, and other author profiling tasks, remain somewhat less popular. As a means to further this issue, in this work we present initial results of a number of author profiling tasks from a Facebook corpus in the Brazilian Portuguese language. As in previous studies, our own work will focus on both standard gender and age prediction tasks but, in addition to these, we will also address two less usual author profiling tasks, namely, predicting an author’s degree of religiosity and IT background status. The tasks are modelled by making use of different knowledge sources, and results of alternative approaches are discussed. Keywords: Author profiling, Document Classification, Corpus-based approaches 1. Introduction existing author profiling models are reviewed in Section 2. Our own approach and corpus are described in Section 3.
    [Show full text]
  • A Study of Arabic Social Media Users—Posting Behavior and Author's
    Cognitive Computation (2019) 11:71–86 https://doi.org/10.1007/s12559-018-9592-7 A Study of Arabic Social Media Users—Posting Behavior and Author’s Gender Prediction Abdulrahman I. Al-Ghadir1 · Aqil M. Azmi1 Received: 16 August 2016 / Accepted: 4 September 2018 / Published online: 24 September 20 18 © Springer Science+Business Media, LLC, part of Springer Nature 2018 Abstract Social media opens up numerous possibilities to study human interaction and collective behavior in an unprecedented scale. It opened a whole new venue for research under the name “social computing”. Researchers are interested in profiling individuals (e.g., gender, age group), groups, community, and networking. We are interested in studying the collective behavior of Arabic social media users. Most studies covering Arabic social media has focused on the sentiment analysis of, say tweets. This study, however, looks into who and when users interact with the Arabic social media. Specifically, there are two objectives of this work. First, studying the demographic posting behavior of social media users from two different perspectives: gender and educational level. Second, author profiling. Identifying author’s gender of a social media post. We use Saudi Arabia, a very prolific country when it comes to social media in general, as a backdrop for this study. The results in this study are based on mining huge amount of metadata of a popular local social media forum covering the period 2011– 14 inclusive. The extracted features (normalized list of k highest scoring words, and likewise for stems) from the posts were used to train classifiers to identify the author’s gender.
    [Show full text]
  • Learning Stylometric Representations for Authorship Analysis
    0 Learning Stylometric Representations for Authorship Analysis STEVEN H. H. DING, School of Information Studies, McGill University, Canada BENJAMIN C. M. FUNG, School of Information Studies, McGill University, Canada FARKHUND IQBAL, College of Technological Innovation, Zayed University, UAE WILLIAM K. CHEUNG, Department of Computer Science, Hong Kong Baptist University, Hong Kong Authorship analysis (AA) is the study of unveiling the hidden properties of authors from a body of expo- nentially exploding textual data. It extracts an author’s identity and sociolinguistic characteristics based on the reflected writing styles in the text. It is an essential process for various areas, such as cybercrime investigation, psycholinguistics, political socialization, etc. However, most of the previous techniques criti- cally depend on the manual feature engineering process. Consequently, the choice of feature set has been shown to be scenario- or dataset-dependent. In this paper, to mimic the human sentence composition process using a neural network approach, we propose to incorporate different categories of linguistic features into distributed representation of words in order to learn simultaneously the writing style representations based on unlabeled texts for authorship analysis. In particular, the proposed models allow topical, lexical, syn- tactical, and character-level feature vectors of each document to be extracted as stylometrics. We evaluate the performance of our approach on the problems of authorship characterization and authorship verification
    [Show full text]