Investigating the Use of Forensic Stylistic and Stylometric Techniques in the Analyses of Authorship on a Publicly Accessible Social Networking Site (Facebook)

INVESTIGATING THE USE OF FORENSIC STYLISTIC AND STYLOMETRIC TECHNIQUES IN THE ANALYSES OF AUTHORSHIP ON A PUBLICLY ACCESSIBLE SOCIAL NETWORKING SITE (FACEBOOK) by COLIN SIMON MICHELL submitted in accordance with the requirements for the degree of MASTER OF ARTS in the subject LINGUISTICS at the UNIVERSITY OF SOUTH AFRICA SUPERVISOR: PROF EH HUBBARD JULY 2013 I Declaration I declare that “Investigating the use of forensic stylistic and stylometric techniques in the analysis of authorship on a publicly accessible social networking site (Facebook)” is my own work and that all sources that I have used or quoted have been indicated and acknowledged by means of complete references. Signature: Colin Simon Michell [Student No: 4322-607-8] II Summary This research study examines the forensic application of a selection of stylistic and stylometric techniques in a simulated authorship attribution case involving texts on the social networking site, Facebook. Eight participants each submitted 2,000 words of self- authored text from their personal Facebook messages, and one of them submitted an extra 2,000 words to act as the ‘disputed text’. The texts were analysed in terms of the first 1,000 words received and then at the 2,000-word level to determine what effect text length has on the effectiveness of the chosen style markers (keywords, function words, most frequently occurring words, punctuation, use of digitally mediated communication features and spelling). It was found that despite accurately identifying the author of the disputed text at the 1,000-word level, the results were not entirely conclusive but at the 2,000-word level the results were more promising, with certain style markers being particularly effective. Key words Facebook; authorship attribution; style markers; idiolect; forensic linguistics; forensic stylistics; stylometrics; WordSmith Tools III Acknowledgements I would like to express my thanks to the following: - All the researchers and experts whose work I consulted. - Professor Hilton Hubbard for being such a supportive and tolerant supervisor, who occasionally allowed me to trip over my own feet until I understood what had to be done. - The participants who graciously sent me writings from their Facebook inboxes and allowed me a privileged look into their private worlds. - My wife Nicola and daughter Roshyin and son Kiyan for all their moral support. - My previous employers at International House Johannesburg who gave me time during my work day to visit Pretoria for discussions with Prof. Hubbard, and to my current employers at the Higher Colleges of Technology - Fujairah for allowing me the use of their facilities and photocopiers. Finally, this research is dedicated to my late father John Howard Michell, who always supported my dreams to better myself, and Roshyin’s twin sister Skylar, who was with us for such a short time. IV Table of Contents 1. Chapter 1 – Aims and Rationale 1.1 Introduction 1 1.2 Research problem 2 1.3 Research Aims 4 1.4 Rationale for the study 5 1.4.1 Introduction to social networking sites 6 1.4.2 Criminal activity on Facebook 15 1.4.2.1 Paedophile activity on Facebook 15 1.4.2.2 Identity theft 16 1.4.2.3 Murder, rape and kidnapping cases 18 1.4.2.4 Cyberbullying 19 1.4.3 Linguistic conventions on social networking sites 20 1.4.3.1 What is digitally mediated communication (DMC)? 20 1.4.3.2 Linguistic implications of digitally mediated communication 21 1.4.3.3 Online anonymity and cryptolects 23 1.4.4 Concluding remarks on the rationale 25 1.5 Research method 26 1.6 Conclusion 27 1.7 Structure of the study 27 2. Chapter 2 – Literature Review 2.1 Introduction 30 2.2 A brief history of forensic linguistics 31 2.3 Stylistics and stylometry within the field of forensic linguistics 33 2.4 The idiolect debate 35 2.4.1 Idiolect and authorship attribution 35 2.4.2 Idiolect and the Unabomber 37 2.4.3 Objections to the importance of idiolect in authorship attribution 37 V 2.5 Language variation 39 2.5.1 Reasons for language variation 39 2.5.2 Language variation in authorship attribution 40 2.6 Language style 41 2.6.1 Spoken and written language 41 2.6.2 Language use in DMC 42 2.6.3 Linguistic norms 43 2.7 Style markers 45 2.7.1 What constitutes a suitable style marker 45 2.7.2 Cautions regarding style markers 46 2.7.3 Some style markers related to the study 47 2.7.3.1 Keywords 48 2.7.3.2 Function words 50 2.7.3.3 Punctuation and spelling 52 2.7.3.4 Most frequently occurring words 56 2.8 Conclusion 57 3. Chapter 3 – Research method 3.1 Introduction 59 3.2 Ethical considerations 60 3.3 The participants in the study 60 3.4 Data collection 61 3.5 Research methods 62 3.6 Qualitative analysis 64 3.6.1 Aspects of a qualitative analysis 64 3.6.2 Markedness in qualitative analysis 66 3.6.3 Mistakes and errors 66 3.7 Stylistic assessment of the participants’ writing 67 3.7.1 Categorisation of stylistic features 67 3.7.2 Descriptors for forensic document analysis 69 VI 3.8 Quantitative analysis 71 3.8.1 Importance of quantitative analysis 71 3.8.2 WordSmith Tools 72 3.8.2.1 Keywords 72 3.8.2.2 Concord 74 3.8.2.3 Wordlist 74 3.8.3 Statistical methods and the Chi-square test 75 3.9 Stylometric analysis of the participants’ writing 77 3.9.1 Test 1: Keywords 78 3.9.2 Test 2: Function words 79 3.9.3 Test 3: Most frequently occurring words 80 3.9.4 Test 4: Punctuation 81 3.10 Conclusion 84 4. Chapter 4 – Findings 4.1 Introduction 86 4.2 Qualitative analysis 86 4.2.1 Writer X 87 4.2.2 Writer A/Writer X 88 4.2.3 Writer B/Writer X 90 4.2.4 Writer C/Writer X 92 4.2.5 Writer D/Writer X 94 4.2.6 Writer E/Writer X 96 4.2.7 Writer F/Writer X 98 4.2.8 Writer G/Writer X 100 4.2.9 Writer H/Writer X 102 4.2.10 Summary of qualitative findings 103 4.3 Quantitative analysis 104 4.3.1 Keywords 104 4.3.1.1 Keywords (1,000-word level) 106 4.3.1.2 Keywords (2,000-word level) 110 VII 4.3.1.3 Keywords conclusion 114 4.3.2 Function words 116 4.3.2.1 Function words (1,000-word level) 116 4.3.2. 2 Function words (2,000-word level) 118 4.3.3 Most frequently occurring words 119 4.3.3.1 Most frequently occurring words (1,000-word level) 119 4.3.3.2 Most frequently occurring words (2,000-word level) 121 4.3.4 Punctuation 123 4.3.4.1 Punctuation (1,000-word level) 123 4.3.4.2 Punctuation (2,000-word level) 125 4.4 Summary of results 127 4.4.1 Qualitative summary (1,000-word level) 127 4.4.2 Qualitative summary (2,000-word level) 128 4.4.3 Quantitative summary (1,000-word level) 129 4.4.4 Quantitative summary (2,000-word level) 130 4.5 Conclusion 130 5. Chapter 5 – Conclusion 5.1 Introduction 135 5.2 Overview of the study 135 5.3 Contributions of the study 139 5.3.1 Facebook language and authorship analysis 139 5.3.2 Stylistic analysis 140 5.3.3 Stylometric analysis 141 5.4 Limitations of the study 144 5.4.1 Style markers 144 5.4.2 Data collected and comparisons made 145 5.5 Suggestions for further research 146 5.5.1 Text analysis of male authors 146 5.5.2 Second language speakers 146 5.5.3 Facebook updates and group threads 147 VIII 5.6 Conclusion 148 6. References 149 7. Appendices Appendix 1: Permission letter 162 Appendix 2: Request letter 163 Appendix 3: Participants’ texts 1. Writer X 164 2. Writer A 169 3. Writer B 174 4. Writer C 180 5. Writer D 186 6. Writer E 192 7. Writer F 197 8. Writer G 203 9. Writer H 209 Appendix 4: Keyword analysis (1000-word level) 215 Appendix 5: Keyword analysis (2000-word level) 218 IX List of figures Chapter 1 – Aims and rationale Figure 1.1 A standard discussion post resulting from a status update 8 Figure 1.2 Facebook sign-in page 9 Figure 1.3 Example of a friends list 11 Figure 1.4 An example of a Facebook message system 14 Figure 1.5 Typical message left on a user’s 14 Figure 1.6 Typical Instant Message thread 15 Figure 1.7 Part of a chat dialogue between Mr Rutberg, a Facebook friend 17 and the scammer Chapter 2 – Literature review Figure 2.1 Google search of “I asked her if I could carry her bags” 36 Chapter 3 – Research method Figure 3.1 Screenshot from the KeyWord Tool 73 Figure 3.2 Screenshot from the Concord Tool 74 Figure 3.3 Screenshot from the WordList Tool showing the frequency list 75 Figure 3.4 Example of a keyword test 78 Figure 3.5 Example of a wordlist 79 Figure 3.6 Character profiler utility 82 Figure 3.7 Microsoft word search function 83 Chapter 4 – Findings Figure 4.1 Concordance for ‘cause 101 Figure 4.2 Writer A Keywords 105 Chapter 5 – Conclusion Figure 5.1 Facebook group thread 147 X List of Tables Chapter 2 – Literature review Table 2.1 Examples of linguistic norms 44 Table 2.2 Comparison of keyness between texts by different authors 49 Table 2.3 Comparison of keyness between the core chronicles and 50 anonymous chronicle (same author) Table 2.4 Distribution of the and a/an across a corpus of news articles 51 and e-mail Chapter 3 – Research method Table 3.1 Stylistic analysis of Writer C 68 Table 3.2 Criteria for conclusions on authorship studies (SWGDOC) 70 Table 3.3 Chi-squaretest grid for function words between the disputed 77 text and Writer A at the 1,000-word level Table 3.4 Extract of the table used to analyse punctuation 80 Table 3.5 Extract of the table used to analyse most frequently occurring words 80 Table 3.6 Extract of the table used to analyse punctuation 84 Chapter 4 – Findings Table 4.1 Writer X: highlighted features 87 Table 4.2 Writer A/Writer X 88 Table 4.3 Writer B/Writer X 90 Table 4.4 Writer C/Writer X 92 Table 4.5 Writer D/Writer X 94 Table 4.6 Writer E/Writer X 96 Table 4.7 Writer F/Writer X 98 Table 4.8 Writer G/Writer X 100 Table 4.9 Writer H/Writer

Investigating the Use of Forensic Stylistic and Stylometric Techniques in the Analyses of Authorship on a Publicly Accessible Social Networking Site (Facebook)

Systemic Functional Features in Stylistic Text Classification

The Empirical Base of Linguistics: Grammaticality Judgments and Linguistic Methodology

O Crystal, D., Internet Linguistics: a Student Guide

Internet Linguistics (Ling 004B) Syllabus

For Reel & Real

Ehrensperger Report

Towards a Philosophy of Language Management David Crystal

Preprint Corpus Analysis in Forensic Linguistics

Actus Reus and Mens Rea of Murder  Understand Coke’S Definition of Murder  Explain How the Definition of Murder Has Changed and Evolved

Quantitative Authorship Attribution: a History and an Evaluation of Techniques

Stylometric Analysis of Parliamentary Speeches: Gender Dimension

Out of Style: Reanimating Stylistic Study in Composition and Rhetoric