Feasibility of Deception in Code Attribution
Total Page:16
File Type:pdf, Size:1020Kb
Feasibility of Deception in Code Attribution by Alina Matyukhina Master of Mathematics, DonNU, 2017 Bachelor of Mathematics with Honors, DonNU, 2015 A DISSERTATION SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor of Philosophy In the Graduate Academic Unit of Faculty of Computer Science Supervisor(s): Natalia Stakhanova, Ph.D., Computer Science Examining Board: Rongxing Lu, Ph.D., Computer Science, Suprio Ray, Ph.D., Computer Science, Martin Wielemaker, Ph.D., Business Administration External Examiner: Mohammad Zulkernine, Ph.D., School of Computing, Queen's University This dissertation is accepted Dean of Graduate Studies THE UNIVERSITY OF NEW BRUNSWICK August, 2019 c Alina Matyukhina, 2019 Abstract Code authorship attribution is the process used to identify the probable author of given code, based on unique characteristics that reflect an author's program- ming style. Inspired by social studies in the attribution of literary works, in the past two decades researchers have examined the effectiveness of code attribution in the computer software domain, including computer security. Authorship attribution techniques have found a broad application in code plagiarism detection, biometric re- search, forensics, and malware analysis. Studies show that analysis of software might effectively unveil the digital identity of a programmer, reflected through variables and structures, programming language, employed development tools, their settings and, more importantly, how and what these tools are being used to do. Authorship attribution has been a prosperous area of research when an assumption can be made that the author of an unknown program has been honest in their writing style and does not try to modify it. In this thesis, we investigate the feasibility of deception of source code attribution techniques. We begin by exploring how data characteristics and feature selection influence both the accuracy and performance of attribution methods. Within this context, it is necessary to understand whether the results obtained by previous studies depend on the data source, quality, and context or the type of features used. It gives us the opportunity to dive deeper into the process of code authorship attribution to be able to understand its potential weaknesses. ii To evaluate current code attribution systems, we present an adversarial model de- fined by the adversary's goals, knowledge, and capabilities; for each group, we cate- gorize them by the possible variations. Modeling the role of attackers figures promi- nently in enhancing the cybersecurity defense. We believe that having a solid under- standing of the possible attacks can help in the research and deployment of reliable code authorship attribution systems. We present an author imitation attack that deceives current authorship attribution systems by imitating a coding style of a targeted developer. We investigate the attack's feasibility on open-source software repositories. To subvert an author imitation attack and to help in protecting the developer's pri- vacy, we introduce an author obfuscation method and novel coding style transforma- tions. The idea of author obfuscation is to allow authors to preserve the readability of their source code while removing identifying stylistic features that can be leveraged for code attribution. Code obfuscation, common in software development, typically aims to disguise the appearance of the code making it difficult to understand and reverse engineer. In contrast, the proposed author obfuscation hides the original author's style by leaving the source code visible, readable and understandable. In summary, this thesis presents original research work that not only advances the knowledge in code authorship attribution field but also contributes to the overall safety of our digital world by providing author obfuscation methods to protect the privacy of the developers. iii Dedication To My Mother My greatest teacher, who teaches me compassion, love, and fearlessness My Father My mentor, example, and guide; who always encourages me and cheers for me My Sister My best friend, my soulmate; who knows my thoughts without a word My Love My greatest gift, who has a quiet strength that makes my heart sing iv Acknowledgements I would like to thank my parents, my sister, my boyfriend and his family for sup- porting me during my PhD and every moment in my life. I would like to thank my supervisor, Canada Research Chair in Security and Privacy, Dr. Natalia Stakhanova for providing her guidance and suggestions. I am also grateful to Dr. Mila Dalla Preda for her help. I am very thankful to my examining committee members for their precious time, patience, and support. I also thank all my colleagues from the Canadian Institute for Cybersecurity for their coordination and cooperation. I would like to thank all the academic and administrative staff of the University of New Brunswick for their kindness and help. v Table of Contents Abstract ii Dedication iv Acknowledgments v Table of Contents viii List of Tables ix List of Figures xi 1 Introduction 1 1.1 Problem Statement and Motivation . 2 1.2 Contributions . 4 1.3 Thesis Organization . 5 2 Literature Review 7 2.1 Code authorship attribution . 7 2.1.1 Reviewed studies . 9 2.1.2 Datasets . 16 2.1.3 Features . 20 2.2 Summary . 23 3 Adversarial code attribution 25 vi 3.1 Author imitation attack . 26 3.2 Author obfuscation . 29 3.3 Coding style transformations . 30 3.3.1 Layout transformations . 30 3.3.2 Lexical transformations . 35 3.3.3 Syntactic transformations . 36 3.3.4 Control-flow transformations . 37 3.3.5 Data-flow transformations . 41 3.4 Summary . 44 4 Assessing effects of dataset characteristics on attribution perfor- mance 45 4.1 Replication study . 47 4.1.1 Experimental setup . 49 4.1.2 Evaluation . 53 4.1.3 Dataset analysis . 53 4.1.4 Correlation analysis of features from previous studies . 59 4.2 Impact of dataset characteristics on the performance . 61 4.2.1 How long should code samples be? . 62 4.2.2 How many samples per author are needed? . 64 4.2.3 How many authors are needed? . 67 4.2.4 What features are the best? . 70 4.3 Summary . 78 5 Evaluation 81 5.1 Evaluation of authorship attribution methods . 81 5.2 Adversarial model for code attribution . 85 5.3 Author imitation evaluation . 87 vii 5.4 Evaluation of coding style transformations based on the metrics . 90 5.4.1 Complexity . 91 5.4.2 Readability . 96 5.4.3 Resilience . 98 5.4.4 Cost . 99 5.4.5 Detection . 100 5.4.6 Results . 100 5.5 Author obfuscation techniques evaluation . 110 5.6 Threats to validity . 118 5.7 Summary . 122 6 Conclusion and Future work 123 Bibliography 138 A Our improved features 139 A.1 List of our improved features for datasets with 1 to 50 LOC . 139 A.2 List of our improved features for datasets with 51 to 100 LOC . 140 A.3 List of our improved features for datasets with 101 to 200 LOC . 143 A.4 List of our improved features for datasets with 201 to 300 LOC . 147 A.5 List of our improved features for datasets with 301 to 400 LOC . 149 A.6 List of our improved features for datasets with 401 plus LOC . 151 Vita viii List of Tables 2.1 Previous studies in source code authorship attribution. 10 4.1 The results of replication study . 54 4.2 Accuracy of attribution after removing common code . 57 4.3 Correlation coefficient analysis for Caliskan features . 60 4.4 Accuracy with uncorrelated features (401 plus 75 authors dataset) . 61 4.5 The datasets grouped by LOC . 63 5.1 The details of our datasets and feature sets employed by previous studies. 83 5.2 Percentage of successfully imitated authors . 90 5.3 Applied transformations and their evaluation . 109 5.4 Effect of information gain feature selection using Caliskan et al. [44] approach . 113 5.5 Results of author obfuscation methods on all datasets . 117 ix List of Figures 2.1 Timeline of attribution studies and the corresponding sources of the experimental datasets. 18 2.2 Dependence between number of authors and accuracy in the existing studies. The insert gives statistics for 3 outliers. 19 2.3 Feature selection levels in code authorship attribution . 21 3.1 An overview of author imitation attack . 27 3.2 Curly brackets styles . 32 3.3 Control-flow flattening transformations (with "switch" and normal dispatch) . 38 3.4 Opaque predicate transformations . 40 4.1 Presence of common code in datasets. 56 4.2 The changes in accuracy with respect to the amount of common code 59 4.3 Example of source code and its AST . 61 4.4 Accuracy vs LOC . 63 4.5 The impact of varying number of samples per author . 66 4.6 The impact of varying number of authors on accuracy . 66 4.7 Attribution accuracy with varying number of samples and number of authors on the example of Ding feature set . 70 4.8 Attribution accuracy of Caliskan features, TF unigram, and AST bi- grams on different LOC datasets . 71 x 4.9 Attribution accuracy with byte and token n-gram features on 1 to 50 LOC dataset after information gain feature selection . 73 4.10 Number of n-gram features before and after information gain feature selection on 1 to 50 LOC dataset . 74 4.11 Performance of byte and token n-gram features on datasets with dif- ferent code size after information gain feature selection . 75 4.12 Performance of byte n-gram features on a datasets with different num- ber of authors . 76 4.13 Performance of token n-gram features with different number of authors 77 4.14 Attribution accuracy with the improved feature set on 75 authors dataset with 10 samples per author .