Behavior Based Software Theft Detection

Behavior Based Software Theft Detection 1Xinran Wang 1Yoon-Chan Jhi 1,2Sencun Zhu 2Peng Liu 1Department of Computer Science and Engineering 2College of Information Sciences and Technology Pennsylvania State University, University Park, PA 16802 {xinrwang, szhu, jhi}@cse.psu.edu; [email protected] ABSTRACT (e.g., in SourceForge.net there were over 230,000 registered Along with the burst of open source projects, software open source projects as of Feb.2009), software theft has be- theft (or plagiarism) has become a very serious threat to the come a very serious concern to honest software companies healthiness of software industry. Software birthmark, which and open source communities. As one example, in 2005 it represents the unique characteristics of a program, can be was determined in a federal court trial that IBM should pay used for software theft detection. We propose a system call an independent software vendor Compuware $140 million dependence graph based software birthmark called SCDG to license its software and $260 million to purchase its ser- birthmark, and examine how well it reflects unique behav- vices [1] because it was discovered that certain IBM products ioral characteristics of a program. To our knowledge, our contained code from Compuware. detection system based on SCDG birthmark is the first one To protect software from theft, Collberg and Thoborson that is capable of detecting software component theft where [10] proposed software watermark techniques. Software wa- only partial code is stolen. We demonstrate the strength of termark is a unique identifier embedded in the protected our birthmark against various evasion techniques, including software, which is hard to remove but easy to verify. How- those based on different compilers and different compiler op- ever, most of commercial and open source software do not timization levels as well as two state-of-the-art obfuscation have software watermarks embedded. On the other hand, tools. Unlike the existing work that were evaluated through “a sufficiently determined attackers will eventually be able small or toy software, we also evaluate our birthmark on a to defeat any watermark.” [9]. As such, a new kind of soft- set of large software. Our results show that SCDG birth- ware protection techniques called software birthmark were mark is very practical and effective in detecting software recently proposed [21, 26–28]. A software birthmark is a theft that even adopts advanced evasion techniques. unique characteristic that a program possesses and can be used to identify the program. Software birthmarks can be classified as static birthmarks and dynamic birthmarks. Categories and Subject Descriptors Though some initial research has been done on software K.4.1 [COMPUTERS AND SOCIETY ]: Public Pol- birthmarks, existing schemes are still limited to meet the icy Issues—Intellectual property rights following five highly desired requirements: (R1) Resiliency to semantics-preserving obfuscation techniques [11]; (R2) General Terms Capability to detect theft of components, which may be only a small part of the original program; (R3) Scalabil- Security ity to detect large-scale commercial or open source software theft; (R4) Applicability to binary executables, because the Keywords source code of a suspected software product often cannot Software Birthmark, Software Plagiarism, Software Theft, be obtained until some strong evidences are collected; (R5) Dynamic Analysis Independence to platforms such as operating systems and program languages. To see the limitations of the existing 1. INTRODUCTION detection schemes with respect to these five requirements, let us break them down into four classes: (C1) static source Software theft is an act of reusing someone else’s code, in code based birthmark [27]; (C2) static executable code based whole or in part, into one’s own program in a way violating birthmark [22]; (C3) dynamic WPP based birthmark [21]; the terms of original license. Along with the rapid develop- (C4) dynamic API based birthmark [26,28]. We may briefly ing software industry and the burst of open source projects summarize their limitations as follows. First, Classes C1, C2 and C3 cannot satisfy the requirement R1 because they are vulnerable to simple semantics-preserving obfuscation tech- Permission to make digital or hard copies of all or part of this work for niques such as outlining and ordering transformation. Sec- personal or classroom use is granted without fee provided that copies are ond, C2, C3 and C4 detect only whole program theft and not made or distributed for profit or commercial advantage and that copies thus cannot satisfy R2. Third, C1 cannot meet R4 because bear this notice and the full citation on the first page. To copy otherwise, to it has to access source code. Fourth, existing C3 and C4 republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. schemes cannot satisfy R5 because they rely on specific fea- CCS’09, November 9–13, 2009, Chicago, Illinois, USA. tures of Windows OS or Java platform. Finally, none of the Copyright 2009 ACM 978-1-60558-352-5/09/11 ...$5.00. four class schemes has evaluated their schemes on large-scale isomorphism testing on SCDGs is tractable and efficient in programs. practice. In this paper, we propose behavior based birthmarks to This paper makes the following contributions: meet these five key requirements. Behavior characteristics have been widely used to identify and separate malware from • We proposed a new type of birthmarks, which exploit benign programs [8,16]. While two independently developed SCDGs to represent unique behaviors of a program. With- software for the same purpose share many common behav- out requiring any source code from the suspect, SCDG iors, one usually contains unique behaviors compared to the birthmark based detection is a practical solution for re- other due to different features or different implementations. ducing plaintiff’s risks of false accusation before filing a For example, HTML layout engine Gecko engine [3] supports lawsuit related to intellectual property. MathML, while another engine KHTML [4] does not; Gecko • As one of the most fundamental runtime indicators of pro- engine implements RDF (resource description framework) to gram behaviors, the proposed system call birthmarks are manage resources, while KHTML engine implements its own resilient to various obfuscation techniques. Our experi- framework. The unique behaviors can be used as birthmarks ment results indicate that they not only are resilient to for software theft detection. Note that we aim to protect evasion techniques based on different compilers or differ- large-scale software. Small programs or components, which ent compiler optimization levels, but also successfully dis- may not contain unique behaviors, are out of the scope of criminates code obfuscated by two state-of-the-art obfus- this paper. cators. A system call dependence graph (SCDG), a graph rep- • We design and implement Hawk, a dynamic analysis tool resentation of the behaviors of a program, is a good can- for generating system call traces and SCDGs. Hawk po- didate for behavior based birthmarks. In a SCDG, system tentially has many other applications such as behavior calls are represented by vertices, and data and control de- based malware analysis. Detailed design and implemen- pendences between system calls by edges. A SCDG shows tation of Hawk are present in this paper. the interaction between a program and its operating system • To our knowledge, SCDG birthmarks are the first birth- and the interaction is an essential behavior characteristic marks which are proposed to detect software component of the program [8, 16]. Although a code stealer may apply theft. Moreover, unlike existing works that are evaluated compiler optimization techniques or sophisticated semantic- with small or toy software, we evaluate our birthmark on preserving transformation on a program to disguise original a set of large software, for example web browsers. Our code, these techniques usually do not change the SCDGs. evaluation shows the SCDG birthmark is a practical and It is also difficult to avoid system calls, because a system effective birthmark. call is the only way for a user mode program to request The rest of this paper is organized as follows. Section 2 kernel services in modern operating systems. For exam- introduces concepts and measurements about software birth- ple, in operating systems such as Unix/Linux, there is no marks. In Section 3, we propose the design of SCDG birth- way to go through the file access control enforcement other marks based software theft detection system. Section 4 than invoking open()/read()/write() system calls. Although presents evaluation results. We discuss limitation and fu- an exceptionally sedulous and creative plagiarist may cor- ture work in Section 5. Finally, we summarize related work rectly overhaul the SCDGs, the cost is probably higher than in Section 6 and draw our conclusion in Section 7. rewriting his own code, which conflicts with the intention of software theft. After all, software theft aims at code reuse 2. PROBLEM FORMALIZATION with disguises, which requires much less effort than writing one’s own code. This section first presents the definitions related to soft- We develop system call dependence graph (SCDG) birth- ware birthmark and SCDG birthmark, and then introduces marks for meeting these five key requirements. To extract a metric to compare two SCDG birthmarks. SCDG birthmarks, automated dynamic analysis is performed 2.1 Software Birthmarks on both plaintiff and suspect programs to record system call traces and dependence relation between system calls. Since A software birthmark is a unique characteristic that a pro- system calls are low level implementation of interactions be- gram possesses and that can be used to identify the program. tween a program and an OS, it is possible that two different Before we formally define software birthmarks, we first de- system call traces represent the same behavior. Thus, we fil- fine the meaning of copy.

Behavior Based Software Theft Detection

Dockerdocker

Build It with Nitrogen the Fast-Oﬀ-The-Block Erlang Web Framework

Resurrect Your Old PC

LYX Frequently Asked Questions with Answers

Easy Slackware

18 Free Ways to Download Any Video Off the Internet Posted on October 2, 2007 by Aseem Kishore Ads by Google

UNIVERSITY of CALIFORNIA SANTA CRUZ UNDERSTANDING and SIMULATING SOFTWARE EVOLUTION a Dissertation Submitted in Partial Satisfac

Behavior Based Software Theft Detection, CCS 2009

Release 3.5.3

Fira Code: Monospaced Font with Programming Ligatures

\", 22 \', 22 \(, 31, 32 \), 31, 32 \-, 26 \., 22 \=, 22 \[, 32 \], 32 \ˆ, 21, 22 \`, 22

An User & Developer Perspective on Immutable Oses