ANALYSIS of STYLOMETRY AS a COGNITIVE BIOMETRIC TRAIT by Kalaivani Sundararajan December 2018 Chair: Damon L

ANALYSIS OF STYLOMETRY AS A COGNITIVE BIOMETRIC TRAIT By KALAIVANI SUNDARARAJAN A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2018 © 2018 Kalaivani Sundararajan I dedicate this dissertation to my Mom and Dad, sister Gowri and my husband Raghav who have been incredibly supportive, understanding and encouraging throughout this entire journey. ACKNOWLEDGMENTS I extend my heartfelt thanks to mentors and friends who have made this journey possible. First and foremost, I thank my advisor, Dr.Damon Woodard, who took me under his wing and provided constant support and motivation throughout this journey. His calm and composed demeanor has always been reassuring even during periods of self-doubt. He gave me complete freedom to explore and grow into an independent researcher. I would also like to thank my committee members for their valuable feedback and comments. Many mentors laid the foundation for this journey before I started Ph.D. My first job at Soliton Technologies, India, seeded the desire to pursue a career in research. I am forever grateful to Ms.Mekhala Devaraj and Dr.Ganesh Devaraj who provided me that opportunity and have been of immense support since then. I also thank Dr.Stan Birchfield who laid my foundation for academic research at Clemson University and has always wished me well. I would also like to thank all my mentors in industry and academia who have shaped my career. This long and eventful journey was made memorable by my dear friends. I would like to thank Poorna Sudhakar, Gayatri Lakshmanan, Rommel Jalasutram, Ahalya Srikanth, Renuka Jagadish, Anand Raju, Dhananjay Joshi, Vishnu Prasad and Ezhilselvi Karunakaran for all their love and support. Finally, I would like to thank my graduate coordinators Adrienne Cook, Tamira Carter and Lesly Galiana for their support. 4 TABLE OF CONTENTS page ACKNOWLEDGMENTS ................................. 4 LIST OF TABLES ..................................... 8 LIST OF FIGURES .................................... 10 ABSTRACT ........................................ 12 CHAPTER 1 INTRODUCTION .................................. 14 1.1 Motivation ................................... 15 1.2 Thesis Statement and Dissertation Topics .................. 15 1.2.1 Style Representations .......................... 16 1.2.2 Stylometry as a Biometric Trait .................... 17 1.2.3 Soft-biometric Classification or Author Profiling ........... 18 1.3 Contributions .................................. 18 1.4 Chapter Guide ................................. 19 2 LITERATURE REVIEW .............................. 21 2.1 Historical Background ............................. 21 2.2 Stylometry ................................... 22 2.2.1 Stylometric Features .......................... 22 2.2.2 Attribution Approaches ........................ 25 2.3 Challenges .................................... 28 3 PERFORMANCE ANALYSIS OF STATE-OF-THE-ART APPROACHES ... 30 3.1 Dataset ..................................... 31 3.2 Authorship Identification ............................ 31 3.2.1 Methodology .............................. 32 3.2.2 Results .................................. 37 3.2.3 Discussion ................................ 40 3.3 Verification ................................... 41 3.3.1 Methodology .............................. 41 3.3.2 Results .................................. 44 3.3.3 Discussion ................................ 46 3.4 Feature Analysis ................................ 46 3.4.1 Methodology .............................. 47 3.4.2 Results .................................. 48 3.4.3 Discussion ................................ 49 3.5 Clustering .................................... 51 3.5.1 Methodology .............................. 52 5 3.5.2 Preliminary Results ........................... 53 3.5.3 Discussion ................................ 55 3.6 Summary .................................... 56 4 WHAT CONSTITUTES STYLE? .......................... 58 4.1 Datasets ..................................... 59 4.2 Role of Structural representations ....................... 60 4.2.1 Methodology .............................. 61 4.2.2 Results .................................. 63 4.2.3 Discussion ................................ 65 4.3 Role of Lexical representations ........................ 66 4.3.1 Methodology .............................. 67 4.3.2 Results .................................. 68 4.3.3 Discussion ................................ 69 4.4 Learned representations ............................ 72 4.4.1 Methodology .............................. 73 4.4.2 Results .................................. 75 4.4.3 Discussion ................................ 76 4.5 Summary .................................... 77 5 STYLOMETRY AS A BIOMETRIC TRAIT ................... 79 5.1 Biometric Characteristics ........................... 79 5.1.1 Analysis of Uniqueness Using Biometric Menagerie ......... 79 5.1.2 Analysis of Permanence in Linguistic Style .............. 87 5.2 Multimodal Biometric Authentication ..................... 96 5.2.1 Methodology .............................. 97 5.2.2 Dataset ................................. 99 5.2.3 Results .................................. 99 5.2.4 Discussion ................................ 102 5.3 Summary .................................... 105 6 ROLE OF PSYCHOLOGICAL AND SOCIAL ASPECTS ON STYLOMETRY 106 6.1 Role of psychological aspects .......................... 106 6.1.1 Methodology .............................. 107 6.1.2 Datasets ................................. 107 6.1.3 Results .................................. 110 6.1.4 Discussion ................................ 115 6.2 Profiling author social groups ......................... 116 6.2.1 Methodology .............................. 116 6.2.2 Results .................................. 118 6.2.3 Discussion ................................ 119 6.3 Summary .................................... 120 6 7 SUMMARY ...................................... 125 7.1 Insights and Conclusions ............................ 125 7.1.1 Style Representations .......................... 125 7.1.2 Stylometry as a Biometric Trait .................... 126 7.1.3 Soft-biometric Classification or Author Profiling ........... 126 7.2 Limitations ................................... 126 7.3 Future work ................................... 127 APPENDIX: AUTHOR PROFILING ANALYSIS .................... 128 REFERENCES ....................................... 132 BIOGRAPHICAL SKETCH ................................ 142 7 LIST OF TABLES Table page 2-1 Stylistic features ................................... 26 3-1 CASIS Dataset partitions .............................. 31 3-2 Character level features ............................... 43 3-3 Lexical features .................................... 44 3-4 Syntactic features ................................... 44 3-5 Salient features across different feature categories ................. 48 3-6 Comparison of salient features ............................ 49 3-7 Effect of varying sentence lengths. .......................... 49 4-1 Single-domain and Cross-domain datasets ..................... 60 4-2 Shallow syntactic model performance ........................ 64 4-3 Deep syntactic model performance ......................... 65 4-4 Effect of lexical POS on single-domain authorship attribution using IMDB1M . 68 4-5 Effect of lexical POS on UF cross-topic authorship attribution .......... 69 4-6 Effect of lexical POS on UF cross-genre authorship attribution .......... 69 4-7 Effect of topic words on IMDB1M single-domain authorship attribution ..... 70 4-8 Effect of topic words on UF cross-topic authorship attribution .......... 70 4-9 Effect of topic words on UF cross-genre authorship attribution .......... 70 4-10 Performance of learned representations ....................... 76 5-1 BoW features ..................................... 81 5-2 Dataset statistics ................................... 90 5-3 Permanence of various span-subsets in Blogs and Yelp datasets ......... 90 5-4 Salient and consistent features ............................ 95 5-5 Verification performance for multimodal authentication .............. 101 6-1 LIWC categories ................................... 108 6-2 PAN2015 author profiling dataset - English .................... 109 8 6-3 PAN2016 author profiling dataset - English .................... 109 6-4 PAN2015 author profiling results .......................... 119 6-5 PAN2016 author profiling results .......................... 119 A-1 PAN2015 - LIWC analysis .............................. 128 A-2 PAN2015 - LIWC analysis(cont.) .......................... 129 A-3 PAN2015 - LIWC analysis(cont.) .......................... 130 A-4 PAN2016 - LIWC analysis .............................. 131 9 LIST OF FIGURES Figure page 2-1 Stylistic features ................................... 23 3-1 Recognition accuracy of various implementations ................. 39 3-2 Recognition accuracy of various implementations across CASIS subsets ..... 42 3-3 ROC curves for CASIS subsets ........................... 45 3-4 Score distributions for CASIS subsets ........................ 46 3-5 Clustering results ................................... 54 3-6 Markov clustering on CASIS-5 with three similarity measures .......... 54 3-7 Markov clustering on CASIS-2 with features of different dimensions ....... 55 4-1 Parse tree of a sample sentence and rewrite rules ................. 62 4-2 Recurrent

ANALYSIS of STYLOMETRY AS a COGNITIVE BIOMETRIC TRAIT by Kalaivani Sundararajan December 2018 Chair: Damon L

Quantitative Authorship Attribution: a History and an Evaluation of Techniques

Stylometric Analysis of Parliamentary Speeches: Gender Dimension

PAN 2017: Author Profiling

Detection of Translator Stylometry Using Pair-Wise Comparative Classiﬁcation and Network Motif Mining

Author Profiling: Predicting Age and Gender from Blogs

Automatic Author Profiling Based on Linguistic and Stylistic Features Notebook for PAN at CLEF 2013

Identifying Idiolect in Forensic Authorship Attribution: an N-Gram Textbite Approach Alison Johnson & David Wright University of Leeds

Arabic Twitter User Profiling: Application to Cyber-Security

Author Profiling for English and Spanish Text

Author Profiling from Facebook Corpora

A Study of Arabic Social Media Users—Posting Behavior and Author's

Learning Stylometric Representations for Authorship Analysis