Detecting Software Theft Via System Call Based Birthmarks
Total Page:16
File Type:pdf, Size:1020Kb
Detecting Software Theft via System Call Based Birthmarks Xinran Wang, Yoon-Chan Jhi, Sencun Zhu Peng Liu Department of Computer Science and Engineering College of Information Sciences and Technology Pennsylvania State University Pennsylvania State University University Park, PA 16802 University Park, PA 16802 Email: {xinrwang, jhi, szhu}@cse.psu.edu Email: [email protected] Abstract—Along with the burst of open source projects, program inherently possesses, which can be used to identify software theft (or plagiarism) has become a very serious threat the program. Though some initial research has been done to the healthiness of software industry. Software birthmark, on software birthmarks, existing schemes are still limited which represents the unique characteristic of a program, can be used for software theft detection. We propose two system in meeting the following highly desired requirements: (R1) call based software birthmarks: SCSSB (System Call Short Resiliency to semantics-preserving obfuscation techniques Sequence Birthmark) and IDSCSB (Input Dependant System [7]; (R2) Capability to detect theft of components, which Call Subsequence Birthmark), and examine how well they may be only a small part of the original program; (R3) reflect unique behavioral characteristics of a program. To our Scalability to detect large-scale commercial or open source knowledge, our detection system based on SCSSB and IDSCSB is the first one that is capable of software component theft software theft; (R4) Applicability to binary executables, detection where only partial code is stolen. We demonstrate the because the source code of a suspected software product strength of our birthmarks against various evasion techniques, often cannot be obtained until some strong evidences are including those based on different compilers and different collected. To see the limitations of the existing techniques compiler optimization levels as well as those based on very with respect to above requirements, let us break them down powerful obfuscation techniques supported by SandMark. Unlike the existing work that were evaluated through small or into four classes: (C1) static source code based birthmark toy software, we also evaluate our birthmarks on a set of large [3]; (C2) static executable code based birthmark [8]; (C3) software (web browsers). Our results show that system call dynamic whole program path(WPP) based birthmark [5]; based birthmarks are very practical and effective in detecting (C4) dynamic API based birthmark [4], [6]. We briefly software theft that even adopts advanced evasion techniques. summarize their limitations as follows: C1, C2 and C3 Keywords-detection; plagiarism; birthmark; techniques cannot satisfy requirement R1 because they are vulnerable to the techniques such as outlining and ordering I. INTRODUCTION transformation; C2, C3 and C4 detect only whole program Software theft is an act of reusing someone else’s code, theft thus cannot satisfy R2; C1 cannot meet R4 because in whole or in part, into one’s own program in a way it requires source code; none of the existing techniques has violating the terms of original license. Along with the rapid been evaluated on large-scale programs. developing software industry and the burst of open source In this paper, we propose behavior based birthmarks for projects (e.g., in SourceForge.net there were over 180,000 meeting these key requirements. Behavior characteristic has registered open source projects as of Aug.2008), software been widely used to identify malware from benign programs theft has become a very serious concern to honest soft- [9], [10]. While two software independently developed for ware companies and open source communities. To protect the same purpose share many common behaviors, one usu- software from theft, Collberg and Thoborson [1] proposed ally contains unique behaviors compared to the other due software watermark techniques. Software watermark is a to the difference in features and implementation details. For unique identifier inserted into the protected software, which example, the Gecko HTML rendering engine [11] supports is hard to remove but easy to verify. However, theoretically, MathML, while KHTML [12] does not; Gecko implements any watermark can be removed by sufficiently determined RDF (resource description framework) to manage resources, attackers [2]. As such, a new kind of software protection while KHTML implements its own framework. The unique techniques called software birthmark were recently proposed behaviors can be used as birthmarks for software theft [3]–[6]. A software birthmark is a unique characteristic that a detection. Note that we aim to protect large-scale software. The programs or components that are too small to bear The work of Wang and Zhu was supported by CAREER NSF-0643906. unique behaviors are out of our scope. The work of Jhi and Liu was supported in part by AFOSR MURI grant A system call sequence is a good candidate for behavior FA9550-07-1-0527, ARO MURI: Computer-aided Human Centric Cyber Situation Awareness, and NSF CNS-0905131. This work was also supported based birthmarks because it shows the interaction between by AFRL award FA8750-08-C-0137. a program and the operating system, where the interaction is an essential behavioral characteristic of the program [9], [10]. Although a code stealer may apply compiler op- timization techniques or sophisticated semantic-preserving transformation on a program to hide original code, these techniques usually do not change the sequence of system ! ! ! ! ! calls. It is also difficult to avoid system calls, because system calls are the only way for a user mode program to request " kernel services in modern operating systems. For example, in operating systems such as Unix/Linux, there is no way to go through the file access control enforcement other than Figure 1. (a) An example program. S0,...,S6 denote system calls. (b) The system call trace generated by the execution of the example program with invoking open()/read()/write() system calls. input i = 1. (c) 4-long system call sequence set of the system call trace. We develop two system call based dynamic birthmarks SCSSB (System Call Short Sequence Birthmark) and ID- SCSB (Input Dependant System Call Subsequence Birth- our evaluation shows the proposed birthmarks are practical. mark) to meet aforementioned key requirements. To extract II. PROBLEM FORMALIZATION SCSSB, short subsequences of system calls are collected from the whole system call sequence during the execution A. Software Birthmarks of a program with a given input. Observing that some A software birthmark is an inherent characteristic of a system call short sequences in the set are commonly found program, which can uniquely identify the program. Before in many other programs and hence they do not represent we formally define software birthmarks, we first define copy. the unique behavior characteristic of the program, we estab- Program q is a copy of program p, if q is exactly the same lish a database of the common short sequences of system as p. q is still considered as a copy of p after a sophisticated calls from various programs. SCSSB is then extracted by software thief applies semantic preserving transformation removing the commonly found short sequences from the such as obfuscation techniques and compiler optimization. system call short sequences of the program. To address Tamada et al. [3] and Myles et al. [5] define software noise injection attacks, we further propose IDSCSB, which birthmark and dynamic software birthmark as follows: involves slightly higher performance overhead than SCSSB. Definition 1: (Software Birthmark) Let p, q be programs In IDSCSB, with two different inputs, we first extract two or program components. Let f(p) be a set of characteristics whole system call sequences that only include the system extracted from p. We say f(p) is a birthmark of component calls dependent to the individual input. The IDSCSB is p, only if both of the following conditions are satisfied: generated from each system call sequence by excluding the 1) f(p) is obtained only from p itself. system calls which appear in both system call sequences in 2) program q is a copy of p ⇒ f(p) = f(q). common. In this way, we can remove noisy system calls that Software birthmarks can be classified into static birth- are intentionally injected. Our contributions are threefold: marks and dynamic birthmarks. A static birthmark relies on • We proposed a novel type of birthmarks, which exploits syntactic structure of a program. Existing static birthmarks short sequences or input dependent subsequences of system are vulnerable to simple semantic-preserving transforma- calls to represent unique behaviors of a program. Without tions [5]. On the other hand, dynamic birthmarks rely on requiring any source code from the suspect, the system the runtime behavior of a program, which is more difficult call birthmark detection is a practical solution for reducing to be altered through the code obfuscation techniques. In plaintiff’s risks of false accusation before filing an intel- this paper, we propose two dynamic birthmarks. lectual property lawsuit. Definition 2: (Dynamic Software Birthmark) Let p, q be • As one of the most fundamental runtime indicators of pro- programs or program components. Let I be an input to p gram behaviors, our system call birthmarks are resilient to and q. Let f(p, I) be a set of characteristics extracted from p various advanced obfuscation techniques. Our experiment by executing p with input I. f(p, I) is a dynamic birthmark results indicate that it not only is resilient to simple evasion of p, only if both of the following conditions are satisfied: techniques such as different compilers and different opti- 1) f(p, I) is obtained only from p executed with input I mization levels, but also successfully discriminates code 2) q is a copy of q ⇒ f(p, I) = f(q, I) obfuscated by SandMark [13], a state-of-the-art obfuscator. B. System Call Birthmarks • To our best knowledge, SCSSB and IDSCSB are the first birthmarks that are proposed to detect software compo- Definition 3: (System Call Trace) Let p be a program or nent theft.