Stylistics Versus Statistics: a Corpus Linguistic Approach to Combining Techniques in Forensic Authorship Analysis Using Enron E
Total Page:16
File Type:pdf, Size:1020Kb
Stylistics versus Statistics: A corpus linguistic approach to combining techniques in forensic authorship analysis using Enron emails David Wright Submitted in accordance with the requirements for the degree of Doctor of Philosophy The University of Leeds School of English September 2014 The candidate confirms that the work submitted is his/her own, except where work which has formed part of jointly authored publications has been included. The contribution of the candidate and the other authors to this work has been explicitly indicated below. The candidate confirms that appropriate credit has been given within the thesis where reference has been made to the work of others. The work in Chapter 6 of the thesis is an expansion of the experiment performed and reported in: Johnson, Alison and David Wright. 2014. Identifying idiolect in forensic authorship attribution: an n-gram textbite approach. Language and Law (Linguagem e Direito) 1(1), 37–69. In the above article, I was responsible for the experiment and the following sections are directly attributable to me: ‘Computational tools’ (p. 44) ‘Jaccard’s similarity coefficient’ (p. 44–5) ‘Experimental set up’ (p. 45–6) ‘Attribution task’ (p. 56–7) ‘Results’ (p. 57–60) In the above article, the following sections are directly attributable to Dr Alison Johnson: ‘James Derrick’ (p. 43–4) ‘Case study of James Derrick’ (p. 46) ‘Identifying Derrick’s professional authorial style’ (p. 46–54) ‘Conclusion’ (p. 62–63) The remaining sections of the above article are co-authored: ‘Introduction’ (p. 38–40) ‘N-grams, idiolect and email’ (p. 40–42) ‘The Enron Email Corpus’ (p. 42–43) ‘Summary of findings’ (p. 54–56) ‘Derrick’s discriminating n-grams’ (p. 60–62) This copy has been supplied on the understanding that it is copyright material and that no quotation from the thesis may be published without proper acknowledgement. © 2014 The University of Leeds and David Wright The right of David Wright to be identified as Author of this work has been asserted by him in accordance with the Copyright, Designs and Patents Act 1988. Acknowledgements First and foremost, my deepest thanks go to my wonderful parents Janet and Joseph William Wright, for their unwavering support throughout my PhD and beyond. Their selflessness is such that when the idea of me doing a PhD first arose, they offered to sell the family home to finance it. Thankfully, that was not necessary (I would never have allowed them to sell it, anyway). They have had to maintain me during the times when I worked from home in the North East, though. I hope that they know that I love them very much. I would also like to thank the rest of my close family and friends, who have been consistently supportive during my seven years at University. Special mention must go to someone I have known and loved for the last three of those years: my girlfriend Rachael Groenendaal. She has known me only as a PhD student, so I am looking forward to sharing with her my less anxious, nervous and restless side. The patience and understanding she has is immeasurable, and every day I am reminded that I am lucky to have such an extraordinary person in my life. This PhD would not have been possible without the exceptional supervision of Dr Alison Johnson. Alison has the unique ability to calm me when I am worried, and inspire me when I am doubtful. It is that ability, combined with her remarkable trust in me, which has guided me through this work. I hope that one day, I can mentor someone in the way Alison has mentored me. Special thanks also go to David Woolls for enduring countless ridiculous questions and requests from me. David has responded every time without complaint, and his software and insightful comment have been invaluable to this research. It was him, along with Alison, who first introduced me to the Enron corpus. Finally, I would like to extend my thanks to past and present lecturing staff in the School of English at the University of Leeds, in particular: Clive Upton, Anthea Fraser Gupta and Fiona Douglas. Without their inspirational teaching and critical feedback during my time as an undergraduate and postgraduate, I would not be pursuing an academic career. This study is funded by an Arts and Humanities Research Council Doctoral Studentship (Grant number AH/J502686/1). Abstract This thesis empirically investigates how a corpus linguistic approach can address the main theoretical and methodological challenges facing the field of forensic authorship analysis. Linguists approach the problem of questioned authorship from the theoretical position that each person has their own distinctive idiolect (Coulthard 2004: 431). However, the notion of idiolect has come under scrutiny in forensic linguistics over recent years for being too abstract to be of practical use (Grant 2010; Turell 2010). At the same time, two competing methodologies have developed in authorship analysis. On the one hand, there are qualitative stylistic approaches, and on the other there are statistical ‘stylometric’ techniques. This study uses a corpus of over 60,000 emails and 2.5 million words written by 176 employees of the former American company Enron to tackle these issues in the contexts of both authorship attribution (identifying authors using linguistic evidence) and author profiling (predicting authors’ social characteristics using linguistic evidence). Analyses reveal that even in shared communicative contexts, and when using very common lexical items, individual Enron employees produce distinctive collocation patterns and lexical co-selections. In turn, these idiolectal elements of linguistic output can be captured and quantified by word n-grams (strings of n words). An attribution experiment is performed using word n-grams to identify the authors of anonymised email samples. Results of the experiment are encouraging, and it is argued that the approach developed here offers a means by which stylistic and statistical techniques can complement each other. Finally, quantitative and qualitative analyses are combined in the sociolinguistic profiling of Enron employees by gender and occupation. Current author profiling research is exclusively statistical in nature. However, the findings here demonstrate that when statistical results are augmented by qualitative evidence, the complex relationship between language use and author identity can be more accurately observed. Table of Contents List of Tables .......................................................................................................................... i List of Figures ....................................................................................................................... ii 1 Introduction .................................................................................................................. 1 1.1 Thesis outline .......................................................................................................... 4 2 Theoretical and methodological issues in authorship attribution and profiling .... 7 2.1 The usefulness and utility of a theory of idiolect .................................................... 7 2.2 Approaches to identifying authorship ................................................................... 11 2.2.1 Lexis in stylistic approaches to authorship attribution ................................. 12 2.2.2 Lexis in stylometric approaches to authorship attribution ............................ 14 2.2.3 Benefits and drawbacks of stylistic and stylometric approaches .................. 19 2.2.4 Combining the stylistic and the stylometric.................................................. 23 2.3 Collocation, idiolect and word n-grams ................................................................ 26 2.4 The emergence of author profiling ....................................................................... 34 2.4.1 Linguistic profiling and criminal profiling ................................................... 37 2.5 Chapter conclusion: addressing the issues ............................................................ 41 3 The Enron Email Corpus .......................................................................................... 43 3.1 Introducing Enron and the corpus ......................................................................... 43 3.2 Extracting and preparing the data ......................................................................... 47 3.3 Ethical considerations ........................................................................................... 51 3.4 Breakdown of the corpus ...................................................................................... 52 3.4.1 The full Enron Email Corpus (EEC) ............................................................. 53 3.4.2 The Pilot Corpus ........................................................................................... 58 3.4.3 The twelve-author sample (EEC12) ............................................................. 60 3.4.4 The eighty-author sample (EEC80) .............................................................. 63 3.5 Chapter conclusion ............................................................................................... 66 4 Methodology: tools and techniques .......................................................................... 67 4.1 Wordsmith Tools ................................................................................................... 67 4.1.1 ‘Keyword’ analysis ......................................................................................