Forest for the Trees for Adversarial Code Stylometry
Total Page:16
File Type:pdf, Size:1020Kb
StyleCounsel: Seeing the (Random) Forest for the Trees in Adversarial Code Stylometry by Christopher McKnight A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Master of Mathematics in Computer Science Waterloo, Ontario, Canada, 2018 c Christopher McKnight 2018 I hereby declare that I am the sole author of this thesis. This is a true copy of the thesis, including any required final revisions, as accepted by my examiners. I understand that my thesis may be made electronically available to the public. ii Abstract Authorship attribution has piqued the interest of scholars for centuries, but had historically remained a matter of subjective opinion, based upon examination of handwriting and the physical document. Midway through the 20th Century, a technique known as stylometry was developed, in which the content of a document is analyzed to extract the author’s grammar use, preferred vocabulary, and other elements of compositional style. In parallel to this, programmers, and particularly those involved in education, were writing and testing systems designed to automate the analysis of good coding style and best practice, in order to assist with grading assignments. In the aftermath of the Morris Worm incident in 1988, researchers began to consider whether this automated analysis of program style could be combined with stylometry techniques and applied to source code, to identify the author of a program. The results of recent experiments have suggested this code stylometry can successfully iden- tify the author of short programs from among hundreds of candidates with up to 98% precision. This potential ability to discern the programmer of a sample of code from a large group of possible authors could have concerning consequences for the open-source community at large, particularly those contributors that may wish to remain anonymous. Recent international events have suggested the developers of certain anti-censorship and anti-surveillance tools are being targeted by their governments and forced to delete their repositories or face prosecution. In light of this threat to the freedom and privacy of individual programmers around the world, and due to a dearth of published research into practical code stylometry at scale and its feasibility, we carried out a number of investigations looking into the difficulties of applying this technique in the real world, and how one might effect a robust defence against it. To this end, we devised a system to aid programmers in obfuscating their inherent style and imitating another, overt, author’s style in order to protect their anonymity from this forensic technique. Our system utilizes the implicit rules encoded in the decision points of a random forest ensemble in order to derive a set of recommendations to present to the user detailing how to achieve this obfuscation and mimicry attack. In order to best test this system, and simultaneously assess the difficulties of performing practical stylometry at scale, we also gathered a large corpus of real open-source software and devised our own feature set including both novel attributes and those inspired or borrowed from other sources. Our results indicate that attempting a mass analysis of publicly available source code is fraught with difficulties in ensuring the integrity of the data. Furthermore, we found ours and most other published feature sets do not sufficiently capture an author’s style independently of the content to be very effective at scale, although its accuracy is significantly greater than a random guess. Evaluations of our tool indicate it can successfully extract a set of changes that would iii result in a misclassification as another user if implemented. More importantly, this extraction was independent of the specifics of the feature set, and therefore would still work even with a more accurate model of style. We ran a limited user study to assess the usability of the tool, and found overall it was beneficial to our participants, and could be even more beneficial if the valuable feedback we received were implemented in future work. iv Acknowledgements This work benefitted from the use of the CrySP RIPPLE Facility at the University of Water- loo. I would like to thank my supervisor Ian Goldberg for his guidance, encouragement, and res- olute attention to detail. Our weekly meetings were a great help in nudging me toward the finish line, while the many opportunities for personal development offered throughout the duration of my programme were priceless. Truly my eyes have been opened to a world beyond that with which I was familiar, and I shall never look back. To the members of my committee, Mike Godfrey and Yaoliang Yu, I am very grateful for your time, expertise and valuable feedback. Being able to use the private study rooms at Waterloo Public Library while writing this thesis was priceless. Finally, I would like to thank Radio X and all its DJs for keeping my sanity during long hours of coding and writing, particularly Johnny Vaughan and his “4til7 Thang” (even Little Si). v Dedication For my loving and supportive wife Katya, whose tireless drive and work ethic was a constant inspiration, and our son James, whose companionship on many contemplative early morning walks helped get me though even the darkest hours of this arduous journey. vi Table of Contents List of Tablesx List of Figures xi 1 Introduction1 2 Motivation4 3 Background8 3.1 The Federalist Papers................................8 3.2 The Morris/Internet Worm............................. 10 4 Literature Review 12 4.1 Authorship Attribution of Natural Language.................... 12 4.1.1 Early Work................................. 12 4.1.2 Computer-Assisted Studies......................... 14 4.1.3 Internet-Scale Authorship Attribution................... 20 4.2 Plagiarism Detection................................ 27 4.2.1 Intrinsic Plagiarism Detection....................... 28 4.2.2 Authorship Verification........................... 29 4.3 Authorship Attribution of Software......................... 29 4.3.1 Source Code Attribution.......................... 30 vii 4.3.2 Executable Code Attribution........................ 39 4.4 Defences Against Authorship Attribution..................... 42 5 Implementation and Methodology 51 5.1 Contributions.................................... 51 5.2 Background..................................... 53 5.2.1 Choice of Programming Language..................... 53 5.2.2 Eclipse IDE................................. 54 5.2.3 Weka Machine Learning.......................... 55 5.2.4 Random Forest............................... 55 5.2.5 GitHub................................... 58 5.3 Obtaining Data................................... 58 5.3.1 The GitHub Data API........................... 59 5.3.2 Rate Limiting................................ 60 5.3.3 Ensuring Sole and True Authorship.................... 63 5.3.4 Removing Duplicate and Common Files.................. 64 5.3.5 Summary of Data Collection........................ 65 5.4 Feature Extraction.................................. 66 5.4.1 Node Frequencies.............................. 68 5.4.2 Node Attributes............................... 71 5.4.3 Identifiers.................................. 71 5.4.4 Comments................................. 73 5.4.5 Other.................................... 74 5.5 Training and Making Predictions.......................... 75 5.6 Making Recommendations............................. 75 5.6.1 Requirements................................ 76 5.6.2 Parsing the Random Forest......................... 79 5.6.3 Analyzing the Split Points......................... 81 viii 5.6.4 Presenting to the User........................... 92 5.7 Using the Plugin.................................. 93 5.8 Pilot User Study................................... 98 5.8.1 Study Details................................ 98 6 Results 100 6.1 Conducting Source Code Authorship Attribution at the Internet Scale...... 100 6.2 Extracting a Class of Feature Vectors That Can Systematically Effect a Classifi- cation as Any Given Target............................. 107 6.3 Pilot User Study................................... 113 6.3.1 Results................................... 113 6.3.2 Experiences with Manual Task....................... 114 6.3.3 Experiences with Assisted Task...................... 115 6.3.4 Summary of User Study.......................... 117 7 Conclusions 119 7.1 Future Work..................................... 122 7.2 Final Remarks.................................... 124 References 127 APPENDICES 141 A User Study Questionnaire 142 ix List of Tables 5.1 Repositories per author............................... 66 5.2 Node class hierarchy counts............................ 70 5.3 Node attributes................................... 71 6.1 Investigation into repositories that completely failed................ 106 x List of Figures 5.1 Node class hierarchy................................ 69 5.2 Plugin menu..................................... 94 5.3 Resource selection dialog for training....................... 95 5.4 Resource selection dialog for evaluation...................... 96 5.5 Individual file output................................ 97 5.6 Aggregate output.................................. 97 5.7 Overall recommendations view........................... 97 5.8 Sample recommendations—statements....................... 98 5.9 Sample recommendations—unary expressions................... 98 6.1 Identifying the author of each file.........................