MACHINE LEARNING METHOD for AUTHORSHIP ATTRIBUTION By

MACHINE LEARNING METHOD FOR AUTHORSHIP ATTRIBUTION By Xianfeng Hu A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Applied Mathematics|Doctor of Philosophy 2015 ABSTRACT MACHINE LEARNING METHOD FOR AUTHORSHIP ATTRIBUTION By Xianfeng Hu Broadly speaking, the authorship identification or authorship attribution problem is to determine the authorship of a given sample such as text, painting and so on. Our main work is to develop an effective and mathe-sound approach for the analysis of authorship of doubted books. Inspired by various authorship attribution problems in the history of literature and the application of machine learning in the study of literary stylometry, we develop a rigorous new method for the mathematical analysis of authorship by testing for a so-called chrono-divide in writing styles. Our method incorporates some of the latest advances in the study of authorship attribution, particularly techniques from support vector machines. By introducing the notion of relative frequency of word and phrases as feature ranking metrics our method proves to be highly effective and robust. Applying our method to the classical Chinese novel Dream of the Red Chamber has led to convincing if not irrefutable evidence that the first 80 chapters and the last 40 chapters of the book were written by two different authors. Also applying our method to the English novel Micro, we are able to confirm the existence of the chrono-divide and identify its location so that we can differentiate the contribution of Michael Crichton and Richard Preston, the authors of the novel. We have also tested our method to the other three Great Classical Novels in Chinese. As expected no chrono-divides have been found in these novels. This provides further evidence of the robustness of our method. We also proposed a new approach to authorship identification to solve the open class problem where the candidate group is nonexistent or very large, which is reliably scaled from a new method we have developed for the close class problem in which the candidates author pool is small. This is attained by using support vector machines and by analyzing the relative frequencies of common words in the function words dictionary and most frequently used words. This method scales very nicely to the open class problem through a novel author randomization technique, where an author in question is compared repeatedly to randomly selected authors. The author randomization technique proves to be highly robust and effective. Using our approaches we have found answers to three well known authorship controversies: (1) Did Robert Galbraith write Cuckoo's Calling? (2) Did Harper Lee write To Kill a Mockingbird or did her friend Truman Capote write it? (3) Did Bill Ayers write Obama's autobiography Dreams From My Father? ACKNOWLEDGMENTS First and foremost I want to thank my advisor Yang Wang, the smartest person I know. It has been an honor to be his Ph.D. student. I want to thank him for helping me to shape and guide the direction of the research field in particular. His support and insightful discussions about research made my pursuit of Ph.D. possible. I appreciate all his contribution of time, ideas and funding to make my Ph.D. productive and stimulating over the past five years. The enthusiasm he has for his research was contagious and motivational for me, even during the tough times in the Ph.D. pursuit. Beyond his scientific guidance, his advice on navigating an academic career has been invaluable. I will forever be thankful to my advisor Min Wu in South China University of Technology. She was and remains my excellent example as a successful woman mathematician, mentor and teacher. She is the reason why I decided go to pursue a career in research. I am grateful for her support and encouragement to pursue Ph.D. I thank Mark Iwen for sharing his GFFT code. It is he who leaded me to use c++ and showed tricks of debug. I really appreciate his patient and time during discussion. Thank him very much for helping me to prepare job application materials and suggestions during application. I also want to thank Qiang Wu for helping to start my first project and matlab. I thank him for his inspiring discussion and wonderful suggestions in life. I also want to thank Aditya Viswanathan for his wonderful suggestions of toolbox during discussion. I would like to thank Zhengfang Zhou, Andrew Christlieb and Yingda Cheng for serving on my defense committee. I would like to thank my fellow math students for their friendship over the past five years and for sharing the wonderful life with me, Zixuan Wang and Yuqi Hong in particular. The iv nights spent playing tennis, sharing dinner, or laughing over desert made the experience of graduate school more rewarding. I would also thank my friends Liping Chen and Liangmin Zhou for helping me during study and teach. My parents Junlin and Genshui, my brother Mubiao and sisters Nian, Xue and Jie, have been incredibly supportive throughout my student life. Thank them for giving me a warm family and happy childhood and I dedicate this dissertation to them. v TABLE OF CONTENTS LIST OF TABLES .................................... viii LIST OF FIGURES ................................... x Chapter 1 Introduction ............................... 1 1.1 Authorship Attribution . 1 1.2 Related Works . 4 1.3 Organization of the Thesis . 6 Chapter 2 Introduction to Machine Learning Techniques .......... 7 2.1 Introduction . 7 2.2 Linear Support Vector Machine . 8 2.2.1 Separable case . 8 2.2.2 Non-separable case . 11 2.3 Nonlinear Support Vector machine . 12 2.4 Multiclass Classification . 14 2.5 Cross Validation . 16 Chapter 3 Stylometry Anaysis: Detecting Chrono-Divide in Writing Styles 18 3.1 Chrono-divide . 18 3.2 Methodology . 20 3.2.1 Initial stylometric feature extraction . 20 3.2.2 Data preparation . 21 3.2.3 Feature subset selection . 21 3.2.4 Data analysis . 23 3.2.5 The algorithm . 24 3.3 Case Study: Analysis of Dream of the Red Chamber . 25 3.3.1 Background . 25 3.3.2 Separability of the chapters by Cao and Gao . 29 3.3.3 Non-separability of the first 80 chapters . 34 3.3.4 Analysis of chapters 81-120: style change over time . 35 3.3.5 Comparison with Continued Dream of the Red Chamber . 37 3.4 Case Study: Analysis of the other three Great Classical Novels . 38 3.5 Case Study: Chrono-divide of Micro ...................... 39 3.6 Conclusion . 42 Chapter 4 Open Class Authorship Identification ............... 43 4.1 Introduction . 43 4.2 Close Class Problems . 45 4.2.1 Methodology . 45 vi 4.2.2 Case studies . 47 4.3 Open Class Problems . 51 4.3.1 Database and data preparation . 51 4.3.2 Methodology . 52 4.3.3 Case studies . 56 4.4 Conclusion . 59 BIBLIOGRAPHY .................................... 60 vii LIST OF TABLES Table 3.1: The features and validation errors of the classifiers obtained from two randomly selected modeling subsets. 31 Table 3.2: Relative frequencies of the top ranked 8 features in each of the four Great Classical Novels. 39 Table 4.1: Cuckoo's Calling: Classification result by the classifier obtained using one book for each of the two authors as training samples. 48 Table 4.2: Cuckoo's Calling: Classification result by the classifier obtained using one book for each of the four authors as training samples, group I. 48 Table 4.3: Cuckoo's Calling: Classification result by the classifier obtained using one book for each of the four authors as training samples, group II. 48 Table 4.4: Cuckoo's Calling: Classification result by the classifier obtained using two books for each of the four authors as training samples. 49 Table 4.5: To Kill A Mockingbird: Classification result by the classifier obtained using one book for each of the five authors as training samples. 50 Table 4.6: Dreams From My Father: Classification result by the classifier obtained using one book for each of the four authors as training samples. 50 Table 4.7: Matching rate for multi-class classifications trained by randomly chosen books by each author. 53 Table 4.8: Matching rate for multi-class classifications trained by randomly chosen samples by each author. 53 Table 4.9: Matching rate for multi-class classifications by randomly chosen books by each author. 55 Table 4.10: Matching rate for multi-class classifications with randomly chosen samples by each author. 55 Table 4.11: Cuckoo's Calling: master suspected authors and the average matching rate. 57 Table 4.12: To Kill a Mockingbird: the average matching rate. 58 viii Table 4.13: Dreams From My Father: the average matching rate. 59 ix LIST OF FIGURES Figure 2.1: Hyperplanes . 9 Figure 2.2: Optimal hyperplane . 9 Figure 2.3: Linearly non-separable data set . 11 Figure 2.4: Linear SVM in new space . 13 Figure 2.5: 3-class classification: (a) one-vs-one, Hij separates class i and class j; (b) one-vs-rest, Hi separates class i from other classes. 15 Figure 2.6: The k-fold cross validation . 17 Figure 3.1: Experiment 1: (a) Mean cross validation error rate; (b) Values of SVM classifier on chapters 60-90. 34 Figure 3.2: Experiment 2: (a) Mean cross validation error rate; (b) Values of SVM classifier on chapters 31-50. Note there is no chrono-divide. 35 Figure 3.3: Experiment 3: (a) Mean cross validation error rate; (b) Values of SVM classifier on chapters 96-105, which correspond to the samples 31-50 in all 80 samples.

Load more