Feasibility of Deception in Code Attribution

Total Page:16

File Type:pdf, Size:1020Kb

Feasibility of Deception in Code Attribution Feasibility of Deception in Code Attribution by Alina Matyukhina Master of Mathematics, DonNU, 2017 Bachelor of Mathematics with Honors, DonNU, 2015 A DISSERTATION SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor of Philosophy In the Graduate Academic Unit of Faculty of Computer Science Supervisor(s): Natalia Stakhanova, Ph.D., Computer Science Examining Board: Rongxing Lu, Ph.D., Computer Science, Suprio Ray, Ph.D., Computer Science, Martin Wielemaker, Ph.D., Business Administration External Examiner: Mohammad Zulkernine, Ph.D., School of Computing, Queen's University This dissertation is accepted Dean of Graduate Studies THE UNIVERSITY OF NEW BRUNSWICK August, 2019 c Alina Matyukhina, 2019 Abstract Code authorship attribution is the process used to identify the probable author of given code, based on unique characteristics that reflect an author's program- ming style. Inspired by social studies in the attribution of literary works, in the past two decades researchers have examined the effectiveness of code attribution in the computer software domain, including computer security. Authorship attribution techniques have found a broad application in code plagiarism detection, biometric re- search, forensics, and malware analysis. Studies show that analysis of software might effectively unveil the digital identity of a programmer, reflected through variables and structures, programming language, employed development tools, their settings and, more importantly, how and what these tools are being used to do. Authorship attribution has been a prosperous area of research when an assumption can be made that the author of an unknown program has been honest in their writing style and does not try to modify it. In this thesis, we investigate the feasibility of deception of source code attribution techniques. We begin by exploring how data characteristics and feature selection influence both the accuracy and performance of attribution methods. Within this context, it is necessary to understand whether the results obtained by previous studies depend on the data source, quality, and context or the type of features used. It gives us the opportunity to dive deeper into the process of code authorship attribution to be able to understand its potential weaknesses. ii To evaluate current code attribution systems, we present an adversarial model de- fined by the adversary's goals, knowledge, and capabilities; for each group, we cate- gorize them by the possible variations. Modeling the role of attackers figures promi- nently in enhancing the cybersecurity defense. We believe that having a solid under- standing of the possible attacks can help in the research and deployment of reliable code authorship attribution systems. We present an author imitation attack that deceives current authorship attribution systems by imitating a coding style of a targeted developer. We investigate the attack's feasibility on open-source software repositories. To subvert an author imitation attack and to help in protecting the developer's pri- vacy, we introduce an author obfuscation method and novel coding style transforma- tions. The idea of author obfuscation is to allow authors to preserve the readability of their source code while removing identifying stylistic features that can be leveraged for code attribution. Code obfuscation, common in software development, typically aims to disguise the appearance of the code making it difficult to understand and reverse engineer. In contrast, the proposed author obfuscation hides the original author's style by leaving the source code visible, readable and understandable. In summary, this thesis presents original research work that not only advances the knowledge in code authorship attribution field but also contributes to the overall safety of our digital world by providing author obfuscation methods to protect the privacy of the developers. iii Dedication To My Mother My greatest teacher, who teaches me compassion, love, and fearlessness My Father My mentor, example, and guide; who always encourages me and cheers for me My Sister My best friend, my soulmate; who knows my thoughts without a word My Love My greatest gift, who has a quiet strength that makes my heart sing iv Acknowledgements I would like to thank my parents, my sister, my boyfriend and his family for sup- porting me during my PhD and every moment in my life. I would like to thank my supervisor, Canada Research Chair in Security and Privacy, Dr. Natalia Stakhanova for providing her guidance and suggestions. I am also grateful to Dr. Mila Dalla Preda for her help. I am very thankful to my examining committee members for their precious time, patience, and support. I also thank all my colleagues from the Canadian Institute for Cybersecurity for their coordination and cooperation. I would like to thank all the academic and administrative staff of the University of New Brunswick for their kindness and help. v Table of Contents Abstract ii Dedication iv Acknowledgments v Table of Contents viii List of Tables ix List of Figures xi 1 Introduction 1 1.1 Problem Statement and Motivation . 2 1.2 Contributions . 4 1.3 Thesis Organization . 5 2 Literature Review 7 2.1 Code authorship attribution . 7 2.1.1 Reviewed studies . 9 2.1.2 Datasets . 16 2.1.3 Features . 20 2.2 Summary . 23 3 Adversarial code attribution 25 vi 3.1 Author imitation attack . 26 3.2 Author obfuscation . 29 3.3 Coding style transformations . 30 3.3.1 Layout transformations . 30 3.3.2 Lexical transformations . 35 3.3.3 Syntactic transformations . 36 3.3.4 Control-flow transformations . 37 3.3.5 Data-flow transformations . 41 3.4 Summary . 44 4 Assessing effects of dataset characteristics on attribution perfor- mance 45 4.1 Replication study . 47 4.1.1 Experimental setup . 49 4.1.2 Evaluation . 53 4.1.3 Dataset analysis . 53 4.1.4 Correlation analysis of features from previous studies . 59 4.2 Impact of dataset characteristics on the performance . 61 4.2.1 How long should code samples be? . 62 4.2.2 How many samples per author are needed? . 64 4.2.3 How many authors are needed? . 67 4.2.4 What features are the best? . 70 4.3 Summary . 78 5 Evaluation 81 5.1 Evaluation of authorship attribution methods . 81 5.2 Adversarial model for code attribution . 85 5.3 Author imitation evaluation . 87 vii 5.4 Evaluation of coding style transformations based on the metrics . 90 5.4.1 Complexity . 91 5.4.2 Readability . 96 5.4.3 Resilience . 98 5.4.4 Cost . 99 5.4.5 Detection . 100 5.4.6 Results . 100 5.5 Author obfuscation techniques evaluation . 110 5.6 Threats to validity . 118 5.7 Summary . 122 6 Conclusion and Future work 123 Bibliography 138 A Our improved features 139 A.1 List of our improved features for datasets with 1 to 50 LOC . 139 A.2 List of our improved features for datasets with 51 to 100 LOC . 140 A.3 List of our improved features for datasets with 101 to 200 LOC . 143 A.4 List of our improved features for datasets with 201 to 300 LOC . 147 A.5 List of our improved features for datasets with 301 to 400 LOC . 149 A.6 List of our improved features for datasets with 401 plus LOC . 151 Vita viii List of Tables 2.1 Previous studies in source code authorship attribution. 10 4.1 The results of replication study . 54 4.2 Accuracy of attribution after removing common code . 57 4.3 Correlation coefficient analysis for Caliskan features . 60 4.4 Accuracy with uncorrelated features (401 plus 75 authors dataset) . 61 4.5 The datasets grouped by LOC . 63 5.1 The details of our datasets and feature sets employed by previous studies. 83 5.2 Percentage of successfully imitated authors . 90 5.3 Applied transformations and their evaluation . 109 5.4 Effect of information gain feature selection using Caliskan et al. [44] approach . 113 5.5 Results of author obfuscation methods on all datasets . 117 ix List of Figures 2.1 Timeline of attribution studies and the corresponding sources of the experimental datasets. 18 2.2 Dependence between number of authors and accuracy in the existing studies. The insert gives statistics for 3 outliers. 19 2.3 Feature selection levels in code authorship attribution . 21 3.1 An overview of author imitation attack . 27 3.2 Curly brackets styles . 32 3.3 Control-flow flattening transformations (with "switch" and normal dispatch) . 38 3.4 Opaque predicate transformations . 40 4.1 Presence of common code in datasets. 56 4.2 The changes in accuracy with respect to the amount of common code 59 4.3 Example of source code and its AST . 61 4.4 Accuracy vs LOC . 63 4.5 The impact of varying number of samples per author . 66 4.6 The impact of varying number of authors on accuracy . 66 4.7 Attribution accuracy with varying number of samples and number of authors on the example of Ding feature set . 70 4.8 Attribution accuracy of Caliskan features, TF unigram, and AST bi- grams on different LOC datasets . 71 x 4.9 Attribution accuracy with byte and token n-gram features on 1 to 50 LOC dataset after information gain feature selection . 73 4.10 Number of n-gram features before and after information gain feature selection on 1 to 50 LOC dataset . 74 4.11 Performance of byte and token n-gram features on datasets with dif- ferent code size after information gain feature selection . 75 4.12 Performance of byte n-gram features on a datasets with different num- ber of authors . 76 4.13 Performance of token n-gram features with different number of authors 77 4.14 Attribution accuracy with the improved feature set on 75 authors dataset with 10 samples per author .
Recommended publications
  • C Programmer's Manual
    C PROGRAMMER'S MANUAL CGC 7900 SERIES COLOR GRAPHICS COMPUTERS Whitesmiths, Ltd. C PROGRAMMERS' MANUAL Release: 2.1 Date: Ma rch 1982 The C language was developed at Bell Laboratories by Dennis Ritchie; Whitesmiths, Ltd. has endeavored to remain as faithful as possible to his language specification. The external specifications of the Idris operating system, and of most of its utilities, are based heavily on those of UNIX, which was also developed at Bell Labora­ tories by Dennis Ritchie" and Ken Thompson. Whi tesmi ths, Ltd. grate­ fully acknowledges the parentage of many of the concepts we have commercialized, and we thank Western Electric Co. for waiving patent licensing fees for use of the UNIX protection mechanism. The successful implementation of Whi tesmi ths r compilers, operating systems, and utilities, however, is entirely the work of our pro­ gramming staff and allied consultants. For the record, UNIX is a trademark of Bell Laboratories; lAS, PDP-11, RSTS/E, RSX-11M, RT-11, VAX, VMS, and nearly every other term with an 11 in it all are trademarks of Digital Equipment Cor­ poration; CP/M is a trademark of Digital Research Co.; MC68000 and VERSAdos are trademarks of Motorola Inc.; ISIS is a trademark of In­ tel Corporation; A-Natural and Idris are trademarks of Whitesmiths, Ltd. C is not. Copyright (c) 1978, 1979, 1980, 1981 by Whitesmiths, Ltd. C PROGRAMMERS' MANUAL SECTIONS I. The C Languag-e II. Portable C Runtime Library III. C System Interface Library IV. C Machine Interface Library SCOPE This manual describes the C programming language, as implemented by Whitesmiths, Ltd., and the various library routines that make up the machine independent C environment.
    [Show full text]
  • COSMIC C Cross Compiler for Motorola 68HC11 Family
    COSMIC C Cross Compiler for Motorola 68HC11 Family COSMIC’s C cross compiler, cx6811 for the Motorola 68HC11 family of microcontrollers, incorporates over twenty years of innovative design and development effort. In the field since 1986 and previously sold under the Whitesmiths brand name, cx6811 is reliable, field-tested and incorporates many features to help ensure your embedded 68HC11 design meets and exceeds performance specifications. The C Compiler package for Windows includes: COSMIC integrated development environment (IDEA), optimizing C cross compiler, macro assembler, linker, librarian, object inspector, hex file generator, object format converters, debugging support utilities, run-time libraries and a compiler command driver. The PC compiler package runs under Windows 95/98/ME/NT4/2000 and XP. Complexity of a more generic compiler. You also get header Key Features file support for many of the popular 68HC11 peripherals, so Supports All 68HC11 Family Microcontrollers you can access their memory mapped objects by name either ANSI C Implementation at the C or assembly language levels. Extensions to ANSI for Embedded Systems ANSI / ISO Standard C Global and Processor-Specific Optimizations This implementation conforms with the ANSI and ISO Optimized Function Calling Standard C specifications which helps you protect your C support for Internal EEPROM software investment by aiding code portability and reliability. C support for Direct Page Data C Runtime Support C support for Code Bank Switching C runtime support consists of a subset of the standard ANSI C support for Interrupt Handlers library, and is provided in C source form with the binary Three In-Line Assembly Methods package so you are free to modify library routines to match User-defined Code/Data Program Sections your needs.
    [Show full text]
  • The User's Column, December 1980, BYTE Magazine
    BASIC, Computer Languages, and Computer Adventures Jerry Pournelle c/ o BYTE Publications 70 Main St Peterborough NH 03458 It's a typical Sunday afternoon here at Chaos Manor. new users become familiar with the way computers In one room a dozen kids are playing games on the Radio think, became studded with features. Every time you Shack TRS-80, while here in the office I've been playing turned around there was a new BASIC interpreter, each about with the C programming language after adding a one larger than the last, and almost none of them com­ check-writer to my accounting programs. My wife, the patible with each other. Whatever portability BASIC had only practical member of the family, gently reminds me enjoyed vanished in a myriad of disk operations, func­ of my deadlines: galley proofs of a new novel, King tions, WHILE statements, new input formats, etc, etc, David's Spaceship (Simon and Schuster); two chapters of and, at the same time, the "free" memory left over after the latest Niven/ Pournelie collaboration, Oath of Fealty loading BASIC got so small that you couldn't handle (Simon and Schuster, Real Soon Now); plus three col­ much data. umns; a speech to a librarians' convention; and inputs for The logical end of that process is Microsoft's newest a NASA study on America's fifty-year space plan. Some BASIC-BO. Understand, it's an excellent BASIC. It has business people worry about cash flow; for authors it's features that, not long ago, the most advanced languages work flow-work comes in bunches, like bananas, and didn't have.
    [Show full text]
  • Comparing UNIX with Other Systems
    Comparing UNIX with other systems Timothy DO' Chase Corporate Computer systems, Inc. 33 West Main Street Holmdel, New Jersey he original concept behind this article was to make a grand comparison between UNIX Tand several other well known systems. This was to beall encompassing and packed with vital information summarized in neat charts, tables and graphs. As the work began, the realization settled in that this was not only difficult to do, but would result in a work so boring as to beincomprehensible. The reader, faced with such awealth ofinformation would be lost at best. Conclusions would be difficult to draw and, in short, the result would be worthless. Mtertearfully filling my waste basket with the initial efforts, I regrouped and began by as­ king myself why would anyone be interested in comparing UNIX with another operating system? There appears to be only two answers. First, one might hope to learn something about UNIX by analogy. IfI understand the file system on MPEN and someone tells me that UNIX is like that except for such and so, then I might be more quickly able to under­ stand UNIX. This I felt was an unlikely motivation. After all, there are much simpler ways to learn UNIX. Instead, the motivation for comparing UNIX to other systems must come from a need to evaluate UNIX. Ifwe are aware of the features or short comings ofother systems, then we can benefit by evaluating UNIX relative to those systems. Choosing an operating system or computer is a major decision which we can benefit from or be stuck with for a long time.
    [Show full text]
  • Why Pascal Is Not My Favorite Programming Language
    AT&T Bell Laboratories Murray Hill, New Jersey 07974 Computing Science Technical Report No. 100 Why Pascal is Not My Favorite Programming Language Brian W. Kernighan April 2, 1981 Why Pascal is Not My Favorite Programming Language Brian W. Kernighan AT&T Bell Laboratories Murray Hill, New Jersey 07974 ABSTRACT The programming language Pascal has become the dominant language of instruction in computer science education. It has also strongly influenced lan- guages developed subsequently, in particular Ada. Pascal was originally intended primarily as a teaching language, but it has been more and more often recommended as a language for serious programming as well, for example, for system programming tasks and even operating systems. Pascal, at least in its standard form, is just plain not suitable for serious pro- gramming. This paper discusses my personal discovery of some of the reasons why. April 2, 1981 Why Pascal is Not My Favorite Programming Language Brian W. Kernighan AT&T Bell Laboratories Murray Hill, New Jersey 07974 1. Genesis This paper has its origins in two events — a spate of papers that compare C and Pas- cal1, 2, 3, 4 and a personal attempt to rewrite Software Tools5 in Pascal. Comparing C and Pascal is rather like comparing a Learjet to a Piper Cub — one is meant for getting something done while the other is meant for learning — so such comparisons tend to be somewhat farfetched. But the revision of Software Tools seems a more relevant comparison. The programs therein were originally written in Ratfor, a ‘‘structured’’ dialect of Fortran imple- mented by a preprocessor.
    [Show full text]
  • FOREWORD the Attached Paper Appeared As Part of a Workshop Session of Tutorial Lectures Held at the PRO- TEXT I Conference in Late 1984
    FOREWORD The attached paper appeared as part of a Workshop session of tutorial lectures held at the PRO- TEXT I Conference in late 1984. The lecture was an informal presentation telling the story of how the University of Nottingham equipped itself — just prior to the era of PostScript and “desk- top publishing” — to typeset all of its examination papers ‘in house’. On a personal level, my vol- unteering to lead this project triggered an interest in computerised typesetting which later devel- oped into a new direction in my computer science research interests. This ‘new direction’ has now lasted for more than 20 years and has grown to encompass almost all aspects of a subject now called ‘document engineering’. The paper is of some historical interest, if only to note that the equipment we used cost us the staggering sum of £80,000 in 1982. Nowadays comparable quality, and much greater speed, could be obtained from a computer and laser-printer combination costing about one twentieth of that amount. The key factors in the success of our project, in those very early days, were the willingness of the University not only to buy the necessary typesetters and computers but also to purchase a source code licence for the device-independent version of troff. This latter item, at £4000, was not cheap but it has done sterling service and is still in use, some 20 years later, in the School of Computer Science. Above all, the device independent troff, (unlike the original version of troff which was for the second-generation GSI CAT typesetter only) proved to be readily adaptable to a variety of typesetters and laser-printers.
    [Show full text]
  • LIFE with UNIX a Guide for Everyone
    LIFE WITH UNIX LIFE WITH UNIX A Guide For Everyone Don Libes & Sandy Ressler PRENTICE HALL, Englewood Cliffs, New Jersey 07632 Library of Congress Cataloging in Publication Data Life with UNIX, A Guide For Everyone UNIX is a registered trademark of AT&T. Production: Sophie Papanikolaou Cover production: Eloise Starkweather Cover design: Lundgren Graphics, Ltd. Cover artwork: Sandy Ressler Marketing: Mary Franz Life With UNIX was edited and composed with Frame Maker on a Sun Microsystems work- station running UNIX. Camera-ready copy was prepared on a Linotronic 100P by Profession- al Fast-Print Corporation using PostScript files generated by Frame Maker. 1989 by Prentice-Hall, Inc. A division of Simon & Schuster Englewood Cliffs, New Jersey 07632 All rights reserved. No part of this book may be reproduced, in any form or by any means, without written permission from the publisher. Printed in the United States of America 10 9 8 7 6 5 4 3 2 1 Prentice-Hall International (UK) Limited, London Prentice-Hall of Australia Pty. Limited, Sydney Prentice-Hall Canada Inc., Toronto Prentice-Hall Hispanoamericana, S.A., Mexico Prentice-Hall of India Priviate Limited, New Delhi Prentice-Hall of Japan, Inc., Tokyo Simon & Schuster Asia Pte. Ltd., Singapore Editora Prentice-Hall do Brasil, Ltda., Rio de Janeiro To our loving families Contents Preface .................................................................................................................. xiii How To Read This Book ......................................................................................xvii
    [Show full text]
  • Apply Principles of Creating Computer Software by Developing a Complete Programme to Meet Given Business Specifications 115392
    APPLY PRINCIPLES OF CREATING COMPUTER SOFTWARE BY DEVELOPING A COMPLETE PROGRAMME TO MEET GIVEN BUSINESS SPECIFICATIONS 115392 PURPOSE OF THE UNIT STANDARD This unit standard is intended: To provide a expert knowledge of the areas covered For those working in, or entering the workplace in the area of Systems Development To demonstrate an understanding of how to create (in the computer language of choice) a complete computer program that will solve a given business problem, showing all the steps involved in creating computer software People credited with this unit standard are able to: Interpret a given specification to plan a computer program solution Design a computer program to meet a business requirement Create a computer program that implements the design Test a computer program against the business requirements Implement the program to meet business requirements Document the program according to industry standards The performance of all elements is to a standard that allows for further learning in this area. LEARNING ASSUMED TO BE IN PLACE AND RECOGNITION OF PRIOR LEARNING The credit value of this unit is based on a person having the prior knowledge and skills to: Apply the principles of Computer Programming Design, develop and test computer program segments to given specifications UNIT STANDARD RANGE N/A INDEX NC: IT: SYSTEMS DEVELOPMENT AUTHOR: LEARNER MANUAL REL DATE: 27/01/2020 REV DATE: 01/01/2023 DOC REF: 48872 LM MOD 3 V-1 PAGE 260 Competence Requirements Page Unit Standard 115392 alignment index Here you will find the different outcomes explained which you need to be proved 261 competent in, in order to complete the Unit Standard 115392.
    [Show full text]
  • Writing Portable C Code: the SAS Institute Experience
    Writing Portable C Code: The SAS Institute Experience Richard D. Langston SAS Institute Inc. This paper illustrates how SAS Institute Inc. went the VM family, so the emphasis was placed on per­ about converting a large software system from PL/I formance within OS systems instead of portability. and IBM® 370 assembly language into a new ver­ Of course, the user community quickly recognized sion written almost entirely in C. The paper briefly the usefulness of the VM family and began an alle­ discusses the history of so~tware development at giance to those operating systems that precipitated the Institute, then explains the rationale behind the the Institute's need to convert the SAS System to company's decision to convert to a new language. run under VM. From there, the Institute's experience in writing portable software is discussed. In addition, the pa­ This effort was accomplished while maintaining per touches on the development process the Institute the aforementioned hybrid of languages. Very little used, including the adaptation of software coding PL/I code was added to support VM; most of the standards. system modifications were to the supervisor. There­ fore, most modifications were m_ade in assembly lan­ Let us begin by tracing the history of program­ guage. The conversion goal was to change as small ming language usage at the Institute. SAS® soft­ amount of procedure code as possible and allow the ware was originally implemented for OS-type oper­ supervisor to recognize the environment and deal ating systems (MVT, MFT, MVS, VSl). The SAS with its eccentricities, relieving the procedure writ­ System was implemented using a combination of ers to once again concentrate on their specialties.
    [Show full text]
  • The History of Solaris
    32 2593 AppC 9/19/01 12:55 PM Page 895 APPENDIX C The History of Solaris UNIX is plural. It is not one operating system but, HOW IT ALL BEGAN many implementations of an idea that originated in 1965. As a system administrator, you need to under- It all began at Bell Labs, the research lab of AT&T, one stand the history of the UNIX operating system— of the largest and most powerful companies of our where it came from, how it was built, and where it is time. Ironically, AT&T was not interested in develop- now. Understanding the various versions of UNIX and ing and selling computers or operating systems. In fact, its origins makes it clear why UNIX became known as the U.S. Department of Justice did not allow AT&T to a somewhat hostile operating system. For example, sell software. However, AT&T’s existing systems, made UNIX was not developed by a single company with a up of people and paper, were in danger of being over- large marketing organization driving the user interface. whelmed in the boom of the 1960s. By the 1970s, the (In other words, it did not follow the development phone business was in jeopardy. Out of desperation path of, say, Microsoft Windows.) On the other hand, and need, Ken Thompson of AT&T set out to develop UNIX was not invented by hackers who were fooling what no computer company was ready to provide—a around; it grew out of strong academic roots. The pri- multiuser, multiprocessing operating system to be used mary contributors to UNIX were highly educated in-house for its own information processing depart- mathematicians and computer scientists employed by ment.
    [Show full text]
  • Rationale for International Standard - Programming Language - C
    Rationale for International Standard - Programming Language - C UNIX is a trademark of X/Open Co., Ltd.. DEC and PDP-11 are trademarks of Digital Equipment Corporation. POSIX is a trademark of IEEE. CONTENTS 0. Introduction..........................................................................................................................1 0.1 Organization of the document .........................................................................................5 1. Scope ...................................................................................................................................6 2. Normative References...........................................................................................................6 3. Terms and definitions............................................................................................................6 4. Conformance........................................................................................................................7 5. Environment.........................................................................................................................8 5.1 Conceptual models..........................................................................................................8 5.1.1 Translation environment...........................................................................................9 5.1.2 Execution environments .........................................................................................10 5.2 Environmental considerations........................................................................................12
    [Show full text]
  • Rationale for International Standard— Programming Languages— C
    Rationale for International Standard— Programming Languages— C Revision 5.10 April-2003 UNIX is a trademark of The Open Group. POSIX is a trademark of IEEE. Contents 0. Introduction ............................................................................................................................1 0.1 Organization of the document ......................................................................................5 1. Scope ......................................................................................................................................7 2. Normative References............................................................................................................9 5 3. Terms and definitions...........................................................................................................11 4. Conformance ........................................................................................................................13 5. Environment.........................................................................................................................15 5.1 Conceptual models......................................................................................................15 5.1.1 Translation environment.................................................................................15 10 5.1.2 Execution environments .................................................................................17 5.2 Environmental considerations....................................................................................19
    [Show full text]