Analysing E-Mail Text Authorship for Forensic Purposes

Analysing E-mail Text Authorship for Forensic Purposes by Malcolm Walter Corney B.App.Sc (App.Chem.), QIT (1981) Grad.Dip.Comp.Sci., QUT (1992) Submitted to the School of Software Engineering and Data Communications in partial fulfilment of the requirements for the degree of Master of Information Technology at the QUEENSLAND UNIVERSITY OF TECHNOLOGY March 2003 c Malcolm Corney, 2003 The author hereby grants to QUT permission to reproduce and to distribute copies of this thesis document in whole or in part. Keywords e-mail; computer forensics; authorship attribution; authorship characterisation; stylis- tics; support vector machine ii Analysing E-mail Text Authorship for Forensic Purposes by Malcolm Walter Corney Abstract E-mail has become the most popular Internet application and with its rise in use has come an inevitable increase in the use of e-mail for criminal purposes. It is possible for an e-mail message to be sent anonymously or through spoofed servers. Computer forensics analysts need a tool that can be used to identify the author of such e-mail messages. This thesis describes the development of such a tool using techniques from the fields of stylometry and machine learning. An author’s style can be reduced to a pattern by making measurements of various stylometric features from the text. E-mail messages also contain macro-structural features that can be measured. These features together can be used with the Support Vector Machine learning algorithm to classify or attribute authorship of e-mail messages to an author providing a suitable sample of messages is available for comparison. In an investigation, the set of authors may need to be reduced from an initial large list of possible suspects. This research has trialled authorship characterisation based on sociolinguistic cohorts, such as gender and language background, as a technique for profiling the anonymous message so that the suspect list can be reduced. iii Publications Resulting from the Research The following publications have resulted from the body of work carried out in this thesis. Principal Author Refereed Journal Paper M. Corney, A. Anderson, G. Mohay and O. de Vel, “Identifying the Authors of Suspect E-mail”, submitted for publication in Computers and Security Journal, 2002. Refereed Conference Paper M. Corney, O. de Vel, A. Anderson and G. Mohay, “Gender-Preferential Text Mining of E-mail Discourse for Computer Forensics”, presented at the 18th Annual Computer Security Applications Conference (ACSAC 2002), Las Vegas, NV, USA, 2002. Other Author Book Chapter O. de Vel, A. Anderson, M. Corney and G. Mohay, “E-mail Authorship Attribution for Computer Forensics” in “Applications of Data Mining in Computer Security” edited by Daniel Barbara and Sushil Jajodia, Kluwer Academic Publishers, Boston, MA, USA, 2002. Refereed Journal Paper O. de Vel, A. Anderson, M. Corney and G. Mohay, “Mining E-mail Content for Author Identification Forensics”, SIGMOD Record Web Edition, 30(4), 2001. Workshop Papers O. de Vel, A. Anderson, M. Corney and G. Mohay, “Multi-Topic E-mail Authorship Attribution Forensics”, ACM Conference on Computer Security - Workshop on Data Mining for Security Applications, November 8 2001, Philadelphia, PA, USA. O. de Vel, M. Corney, A. Anderson and G.Mohay, “Language and Gender Author Co- hort Analysis of E-mail for Computer Forensics”, Digital Forensic Research Workshop, August 7 U˝ 9, 2002, Syracuse, NY, USA. iv Contents 1 Overview of the Thesis and Research 1 1.1 Problem Definition . 1 1.1.1 E-mail Usage and the Internet . 1 1.1.2 Computer Forensics . 4 1.2 Overview of the Project . 5 1.2.1 Aims of the Research . 5 1.2.2 Methodology . 7 1.2.3 Summary of the Results . 9 1.3 Overview of the Following Chapters . 10 1.4 Chapter Summary . 10 2 Review of Related Research 13 2.1 Stylometry and Authorship Attribution . 14 2.1.1 A Brief History . 16 2.1.1.1 Stylochronometry . 21 2.1.1.2 Literary Fraud and Stylometry . 22 2.1.2 Probabilistic and Statistical Approaches . 22 2.1.3 Computational Approaches . 24 2.1.4 Machine Learning Approaches . 26 2.1.5 Forensic Linguistics . 29 2.2 E-mail and Related Media . 32 2.2.1 E-mail as a Form of Communication . 32 2.2.2 E-mail Classification . 33 2.2.3 E-mail Authorship Attribution . 34 2.2.4 Software Forensics . 35 2.2.5 Text Classification . 35 2.3 Sociolinguistics . 37 2.3.1 Gender Differences . 38 2.3.2 Differences Between Native and Non-Native Language Writers 41 2.4 Machine Learning Techniques . 42 2.4.1 Support Vector Machines . 46 2.5 Chapter Summary . 48 v 3 Authorship Analysis and Characterisation 51 3.1 Machine Learning and Classification . 53 3.1.1 Classification Tools . 53 3.1.2 Classification Method . 55 3.1.3 Measures of Classification Performance . 58 3.1.4 Measuring Classification Performance with Small Data Sets . 61 3.2 Feature Selection . 65 3.3 Baseline Testing . 68 3.3.1 Feature Selection . 68 3.3.2 Effect of Number of Data Points and Size of Text on Classifi- cation . 69 3.4 Application to E-mail Messages . 70 3.4.1 E-mail Structural Features . 71 3.4.2 HTML Based Features . 74 3.4.3 Document Based Features . 75 3.4.4 Effect of Topic . 76 3.5 Profiling the Author - Reducing the List of Suspects . 77 3.5.1 Identifying Cohorts . 78 3.5.2 Cohort Preparation . 79 3.5.3 Cohort Testing - Gender . 81 3.5.3.1 Effect of Number of Words per E-mail Message . 82 3.5.3.2 The Effect of Number of Messages per Gender Cohort 82 3.5.3.3 Effect of Feature Sets on Gender Classification . 84 3.5.4 Cohort Testing - Experience with the English Language . 84 3.6 Data Sources . 84 3.7 Chapter Summary . 89 4 Baseline Experiments 91 4.1 Baseline Experiments . 92 4.2 Tuning SVM Performance Parameters . 94 4.2.1 Scaling . 94 4.2.2 Kernel Functions . 95 4.3 Feature Selection . 96 4.3.1 Experiments with the book Data Set . 96 4.3.2 Experiments with the thesis Data Set . 98 4.3.3 Collocations as Features . 100 4.3.4 Successful Feature Sets . 100 4.4 Calibrating the Experimental Parameters . 101 4.4.1 The Effect of the Number of Words per Text Chunk on Classi- fication . 101 vi 4.4.2 The Effect of the Number of Data Points per Authorship Class on Classification . 105 4.5 SVMlight Optimisation . 107 4.5.1 Kernel Function . 107 4.5.2 Effect of the Cost Parameter on Classification . 109 4.6 Chapter Summary . 111 5 Attribution and Profiling of E-mail 113 5.1 Experiments with E-mail Messages . 114 5.1.1 E-mail Specific Features . 114 5.1.2 ‘Chunking’ the E-mail Data . 117 5.2 In Search of Improved Classification . 118 5.2.1 Function Word Experiments . 119 5.2.2 Effect of Function Word Part of Speech on Classification . 120 5.2.3 Effect of SVM Kernel Function Parameters . 122 5.3 The Effect of Topic . 124 5.4 Authorship Characterisation . 126 5.4.1 Gender Experiments . 127 5.4.2 Language Background Experiments . 131 5.5 Chapter Summary . 132 6 Conclusions and Further Work 135 6.1 Conclusions . 135 6.2 Implications for Further Work . 137 Glossary 140 A Feature Sets 147 A.1 Document Based Features . 147 A.2 Word Based Features . 148 A.3 Character Based Features . 150 A.4 Function Word Frequency Distribution . 151 A.5 Word Length Frequency Distribution . 154 A.6 E-mail Structural Features . 154 A.7 E-mail Structural Features . 155 A.8 Gender Specific Features . 155 A.9 Collocation List . 156 vii viii List of Figures 1-1 Schema Showing How a Large List of Suspect Authors Could be Reduced to One Suspect Author . 5 2-1 Subproblems in the Field of Authorship Analysis . 15 2-2 An Example of an Optimal Hyperplane for a Linear SVM Classifier . 47 3-1 Example of Input or Training Data Vectors for SVMlight . 54 3-2 Example of Output Data from SVMlight . 55 3-3 ‘One Against All’ Learning for a 4 Class Problem . 56 3-4 ‘One Against One’ Learning for a 4 Class Problem . 57 3-5 Construction of the Two-Way Confusion Matrix . 59 3-6 An Example of the Random Distribution of Stratified k-fold Data . 63 3-7 Cross Validation with Stratified 3-fold Data . 64 3-8 Example of an E-mail Message . 72 3-9 E-mail Grammar . 75 3-10 Reducing a Large Group of Suspects to a Small Group Iteratively . 78 3-11 Production of Successively Smaller Cohorts by Sub-sampling . 83 4-1 Effect of Chunk Size for Different Feature Sets . 104 4-2 Effect of Number of Data Points . 106 5-1 Effect of Cohort Size on Gender . 130 5-2 Effect of Cohort Size on Language . 132 ix x List of Tables 3.1 Word Based Feature Set . 67 3.2 Character Based Feature Set . 68 3.3 Possible Combinations of Original and Requoted Text in E-mail Mes- sages . 73 3.4 List of E-mail Structural Features . 74 3.5 List of HTML Tag Features . 76 3.6 Document Based Feature Set . 76 3.7 Gender Specific Features . 81 3.8 Details of the Books Used in the book Data Set.

Load more