VaPD: A Value-Based Approach to Obfuscation-Resilient Software Plagiarism Detection

Yoon-Chan Jhi1 Xinran Wang2 Xiaoqi Jia3 Sencun Zhu4 Peng Liu5 {1jhi, 2xinrwang, 4szhu}@cse.psu.edu {3xjia, 5pliu}@ist.psu.edu 1,2,4,5The Pennsylvania State University, University Park, PA 16802 3Chinese Academy of Sciences, Beijing, China

Abstract Along with the burst of open source projects, software plagiarism has become a very serious threat to the software industry’s healthiness. Although computer-aided plagiarism detection tools can reduce the time, effort, expenses, and the risk in software plagiarism lawsuits, automated plagiarism detection is facing increasing difficulty introduced by various code obfuscation techniques. We present a value sequence based plagiarism detection system (VaPD) leveraging a novel and obfuscation-resilient runtime semantic indicator: The value sequence is a sequence of the outputs of certain machine instructions in an execution path. Even if a program is obfuscated, some critical runtime values (forming a sequence) should be the same as those generated from the original program given the same input. Thus, significant overlapping in value sequences of two programs given the same input indicates potential plagiarism. To eliminate noisy values introduced by the execution environment and/or the obfuscators, we use dynamic taint analysis and other value sequence refinement techniques. To our knowledge, VaPD is the first value- based software plagiarism detection method. We evaluate VaPD through a set of real world programs, and all the plagiarisms obfuscated by the SandMark tool are successfully discriminated. VaPD does not require any source code, and has broad applicability in software analysis.

1 Introduction

Software plagiarism (or theft) is an act of reusing someone else’s code, in whole or in part, into one’s own program in a way violating the terms of original license. Along with the rapid developing software industry and the burst of open source projects (e.g., SourceForge.net has over 171,000 registered open source projects as of March 2008), software plagiarism has become a very serious concern to honest software companies and open source communities. As an example, in 2005 it was determined in a federal court trial that IBM should pay an independent software vendor Compuware $140 million to license its software and $260 million to purchase its services [2] because it was discovered that certain IBM products contained code from Compuware. Since source code of suspicious programs is typically not available, filing a software plagiarism lawsuit often depends on human efforts to prove significant similarities in appearance (e.g., user interface, input/output format, etc.) of two programs [3]. Apparently, computer-aided tools to analyze inside of can reduce the time, effort, expenses, and the risk that have been inevitable in lawsuits dealing with software plagiarism cases. Automated plagiarism detection, however, is not an easy job. Besides often having no access to source code of suspected programs (or software products), automated plagiarism detection is also facing increasing difficulty as the effect of emerging code obfuscation techniques. Code obfuscation is a technique to trans- form a sequence of code into a different sequence that preserves the semantics but is much more difficult to understand. On the positive side, code obfuscation is a viable anti-reverse-engineering measure for protect- ing intellectual property. For example, it protects programs written in Java or # like languages that may be easily decompiled into the original source code. On the negative side, however, code obfuscation has been a useful tool for plagiarists and malware writers to hide their code from detection. Software plagiarism detection is extremely difficult when advanced and automated obfuscation tools [10, 4, 1] are abused.

1 Although some initial research has been done to tackle the software plagiarism problem, existing schemes are still quite limited in meeting all of the following three highly desired requirements: (R1) Resiliency to semantics-preserving obfuscation techniques [11]; (R2) Ability to directly work on binary executables of suspected programs. — Although source code of plantiff’s program is usually available, the source code of a suspected software product often cannot be obtained until some strong evidences are collected —; (R3) Platform independence, e.g., independent from operating systems and programming languages. To see the limitations of the existing detection schemes with respect to these three requirements, let us break them down into three classes: (C1) static source code comparison methods [23, 30, 35, 19, 33, 27, 28, 15]; (C2) static code comparison methods [25]; (C3) dynamic control flow based methods [24]; (C4) dy- namic API based methods [29, 31, 32]. First, Classes C1, C2 and C3 cannot satisfy requirement R1 because they are vulnerable to semantics-preserving obfuscation techniques such as outlining and ordering trans- formation. Second, C1 cannot meet R2 because it has to access source code. Third, existing C3 and C4 schemes cannot satisfy R3 because they rely on Windows OS or Java . We aim at an effective plagiarism detection method that meets all these three key requirements. We notice that although many program characteristics can be “washed-off” or removed by obfuscation tools, semantic is the only characteristic guaranteed to survive obfuscation because code obfuscation must preserve the semantics of the program being transformed. Hence, extracting the semantics is the key to creating nemesis to the obfuscation techniques. However, automatic semantic extraction is still an open problem because the question “what is the true semantics of a program?” has never been clearly answered. Existing approaches try to capture the semantics based on string pattern [7], abstract syntax [8, 21], token [18, 27, 28], program dependence graph [23, 22], or certain birthmark [30, 29, 31], but these semantics are weak in a sense that they are vulnerable to code obfuscation by existing tools [10, 4, 1]. To address aforementioned challenges, we introduce a software plagiarism detection technique exploit- ing a novel semantic indicator: value-sequence. A value sequence is a sequence of outputs from certain machine instructions (the standard mathematical operations, the logical operators, bitshift, and rotate) in an execution path. Our intuition tells us that even if a piece of code (e.g., the code a plagiarist wants to steal) is obfuscated, during its execution, some critical values (forming a sequence) should be the same as those generated from the original code given the same input. To remove the noisy values introduced by the execution environment or the obfuscators, we use dynamic taint analysis and the value sequence refinement techniques. Refinement is crucial, because otherwise the value sequences could contain too much noise to accurately detect plagiarisms. We implemented the value-sequence based plagiarism detection method (VaPD) atop a generic pro- cessor emulator (i.e., QEMU 0.9.1). We implemented a specific dynamic taint analyzer and made it part of VaPD. In addition, VaPD uses popular longest-common-subsequence based similarity measuring algo- rithms to compare two value sequences. Because processor emulators are in general independent of OS or programming languages, VaPD satisfies requirement R3. VaPD also satisfies R2 because processor emula- tors are designed to directly work on binary executables. Regarding requirement R1, we evaluated VaPD through a set of real world programs. Our experiment results indicate that VaPD successfully discriminates 36 plagiarisms obfuscated by the SandMark tool (totally 39 obfuscators, but 3 of them failed to obfuscate any of our test subject programs).1

Our Contributions: (1) We present a novel semantic indicator. To our knowledge, VaPD is the first method that uses value-sequences to detect plagiarism. (2) By exploiting one of the most fundamental runtime indicators of program semantics, VaPD is resilient to various advanced obfuscation techniques. (3) VaPD does not require access to source code of the suspected program, thus it could greatly reduce plaintiff’s risks through providing strong evidences before filing a lawsuit related to intellectual property. (4) VaPD is a fairly practical method. It can directly work on binary executables of the suspected program, and its effectiveness is not restricted to whole-program plagiarism. In fact, as we will show shortly in Sections 5.2 and 7.4, VaPD can also be used to solve certain categories of core-part plagiarism detection problems (which could be more common in lawsuits), although core-part plagiarism detection is still an open problem that requires much

1Since SandMark works on Java , we use GCJ, GNU ahead-of-time for Java Programming Language, to convert obfuscated programs to x86 native executables. GCJ’s optimization features are disabled deliberately.

2 future research. (5) VaPD would serve as a useful tool promoting a more healthy and trustworthy sharing environment for the open source community. (6) As a generic runtime characterization technique, VaPD’s potentials are not limited to plagiarism detection. For example, VaPD is also applicable for identifying malcode (worm and virus) variants that are obfuscated through metamorphism and/or polymorphism.

Scope of Our Work: We consider two types of software plagiarisms in the presence of automated ob- fuscators: whole-program plagiarism, where the plagiarizer copies the whole or majority of the plaintiff program and wraps it in a (modified) interface, and partial (or core-part) plagiarism, where the plagiarizer copies only a part of the plaintiff program and merges it in the suspect program. One purpose of VaPD is to develop a practical solution to real-world problems of the whole-program software plagiarism detection, in which no source code of the suspect program is available and various obfuscation techniques have been ap- plied. VaPD can also be a useful tool to solve many partial plagiarism cases where the plaintiff can provide the knowledge about which part of his program is likely to be plagiarized. We note that if the plagiarized code is very small or functionally trivial, VaPD would not be an appropriate tool.

2 State of the Art

We roughly group the literature into three categories: code obfuscation techniques, static analysis based plagiarism detection, and dynamic analysis based plagiarism detection. Code Obfuscation Techniques: Code obfuscation is a semantic preserving transformation to hinder figur- ing out original form of the resulting code. A generic code obfuscation technique is not as simple as adding one before computation and subtracting one after the computation. In [11], Collberg et al. provided a nice and extensive discussion on automated code obfuscation techniques. They classify code obfuscation tech- niques in the following categories depending on the feature that each technique targets: data obfuscation, control obfuscation, layout obfuscation, and preventive transformations. Collberg et al. also introduced Opaque Predicates [12] to thwart static disassembly, a popular tool for reverse engineering. Most of the 26 techniques discussed in this paper were included in the SandMark tool [10]. Other techniques such as Indirect Branches, Control-flow Flattening, and Function-pointer Aliasing were also introduced [34]. Several tools are available for code obfuscation. SandMark [10] is one of such tools implementing 39 obfuscators applicable to Java bytecode. Array representation and orientation, functions, in-memory repre- sentation of variables, order of instructions, and control and data dependence are just a small part of features that SandMark can alter. Another Java obfuscator is Zelix KlassMaster [4]. It implements comprehensive flow obfuscation technology, making it a heavy duty Java bytecode obfuscator. Code Virtualizer is a non- Java obfuscator that translates original into a virtual instruction set coupled with an interpreter. Semantic is the only characteristic guaranteed to be preserved across the obfuscation [1]. Static Analysis Based Plagiarism Detection: Existing static analysis techniques except birthmark-based techniques are closely related to clone detection [7, 8, 21, 22, 18, 14, 17, 16]. While possessing common interests with clone detection, plagiarism detection is different in that (1) we must deal with code obfuscation techniques which are often employed with malicious intention; (2) source code analysis of the suspicious program is impossible in most cases. Static analysis techniques for software plagiarism detection can be classified into five categories: String-based [7], AST-based [35, 19, 33], Token-based [27, 28, 15], PDG- based [23], and Birthmark-based [25, 30]. String-based: Each line of source code is considered as a string and the whole program is considered as a sequence of strings. A code fragment is labeled as plagiarism if the corresponding sequence of strings is the same as another code fragment from original program. AST- based: The abstract syntax trees (AST) are extracted from programs by analyzing their syntax. Then the ASTs are directly compared. If there are common subtrees, plagiarism may exist. Token-based: A program is first parsed to a sequence of tokens. The sequences of tokens are then compared to find plagiarism. PDG- based: A program dependency graph (PDG) represents the control flow and data flow relations between the statements in a program procedure. To find plagiarism, two PDGs are extracted from two programs (by some static analysis tools) and compared to find relaxed subgraph isomorphism. Birthmark-based: A software birthmark is a unique characteristic (processed by a program) that can be used to determine the program’s identification. Two birthmarks are extracted from two programs and compared.

3 program sum_prod program var_enc x = 0 ; y = 1 ; i = 0 ; x = 0 ; y = 1 ; i = 0 ; while (i < 5) { while (i < 5) { i ++ ; Variable Encoding i ++ ; x = x + i ; Obfuscator x = x + i ; y = y * i ; (with y = y * i) y = (y + i - x) * i + x ; } } y = y - x ; Value Sequence of sum_prod Value Sequence of val_enc = 1, 2, 6, 24 , 120 = 2, 1, 1, 2, 4, 1, 2, 5, 8, 2, 6, 12, 16, 6, 24 , 34, 39, 24, 120 , 135, 120

Figure 1: Value sequences of the original program (sum prod) and the obfuscated program (var enc) overlap significantly.

None of the above techniques is resilient to code obfuscation. String-based schemes are vulnerable even to simple identifier renaming. AST-based schemes are resilient to identifier renaming, but weak against statement reordering and control replacement. Token-based schemes are resilient to identifier renaming, but weak against junk code insertion and statement reordering. Because PDGs contain semantic information of programs, PDG-based schemes are more robust than the other three types of existing schemes. How- ever, PDG-based is still vulnerable to many semantics-preserving transformations such as inline and outline functions and opaque predicates. Birthmark-based schemes are vulnerable to either obfuscation techniques mentioned in [25] or some well-known obfuscation such as statement reordering and junk instruction inser- tion. Moreover, all existing techniques except for [25] need to access source code. Dynamic Analysis Based Plagiarism Detection: Myles and Collberg [24] proposed a whole program path (WPP) based dynamic birthmark. WPP was originally used to represent the dynamic control flow of a program. WPP birthmarks are robust to some control flow obfuscation such as opaque predicts insertion, but are still vulnerable to many semantics-preserving transformations such as flattening and loop unwinding. Tamada et al. [31, 32] also introduced two types of dynamic birthmark for Windows applications: Sequence of API Function Calls Birthmark (EXESEQ) and Frequency of API Function Calls Birthmark (EXEFREQ). In EXESEQ, the sequence of Windows API calls are recorded during the execution of a program. These sequences are directly compared to find similarity. In EXEFREQ, the frequency of each Windows API call is recorded during the execution of a program. The frequency distribution is used as the birthmark. Schuler et al. [29] proposed a dynamic birthmark for Java. The call sequences to Java standard API are recorded and the short sequences at object level are used as a birthmark. Their experiments showed that their API birthmarks are more robust to obfuscation than WPP birthmarks. These naive birthmarks, however, can identify only the same source code compiled by different with different options, and the performance against real obfuscation techniques is highly questionable. For example, attackers may simply embed some API implementation into their program so that fewer API calls will be observed.

3 Value Sequence, a Novel Semantic Indicator

When analyzing the existing obfuscation techniques, we noticed that the output of each machine instruction could not be obfuscated as easily as other semantic indicators could be. We call a sequence of the outputs from selected machine instructions in an execution path a value sequence. Details of value sequence con- struction will be shortly provided in each subsection. Before expanding on the technical part, we briefly present obfuscation resiliency of the value sequence using a quick example. Figure 1 shows a part of a program among a number of variants that we have tested. In this example, a C-like code sum prod computes the sum (in x) and the product (in y) of integers from 1 to 5. If we taint y in the beginning, our analyzer will extract the value sequence of y, which is {1, 2, 6, 24, 120}. To see how the value sequence changes after obfuscation, let us transform original sum prod through variable encoding obfuscator[11], and get var enc. Variable encoding is a technique that transforms variable v to α ∗ v + β. In this example, we further transform v to F (v, w) = α ∗ v + β ∗ w to add data dependence between v and w, which should deceive some dependence based plagiarism detectors. The value sequence of var enc extracted by our analyzer is {2, 1, 1, 2, 4, 1, 2, 5, 8, 2, 6, 12, 16, 6, 24, 34, 39, 24, 120, 135,

4 120}. In spite of variable encoding, we observe the value sequence of the original program (sum prod) is a complete subset of the value sequence of the obfuscated program (var enc). Although some obfuscation techniques can alter the representation of values in some cases, we believe the obfuscated values would be decoded into the original representation at certain time points unless the value representation system, including all operations defined with the values, is redefined. Inspired by this observation, we treat value sequences as a key semantics indicator. Note that a specific value could appear several times in different places of a value sequence.

3.1 Initial Construction of Value Sequence Since not all values associated with the execution of a program are semantics indicators, it is important to limit the types of values to be embedded in a value sequence. We establish the following requirements for an output of a machine instruction to be added into a value sequence:

• The value should be the output of a value-updating instruction. • The value should be closely related to program’s semantics.

In the following, we discuss the rationale for introducing these requirements. Informally, a computer is a state machine that makes state transition based on input and a sequence of machine instructions. After every execution of a machine instruction, the state is updated with the outcome of the instruction. Because how the states change reflects how the program computes, the sequence of state- updating values can be a semantic indicator. As such, in value-based characterization, we are interested in only the state transitions made by value-updating instructions. More formally, we can conceptualize the state-update as the change of data stored in devices such as RAM and registers after each instruction, and we call the changed data a state-updating value. We further define a value-updating instruction as a machine instruction that does not always preserve input in its output. For example, add is a value-updating instruction, but mov is not. Being an output of a value-updating instruction is a sufficient condition to be a state-updating value. Therefore, we exclude the output from non-value-updating instructions from a value sequence. In our x86 implementation, the value-updating instructions are the standard mathematical operations (add, sub, mul, div, etc.), the logical operators (and, or, xor, etc.), bitshift arithmetic and logical (shl, shr, etc.), and rotate (ror, rcl, etc.). The above technique helps dramatically reduce the size of a value sequence (in our experiment over JFlex, the size is reduced by 30 times); however, in practice it is still challenging to analyze all the values produced by all the value-updating instructions. For example, our very first analyzer extracted 1,822,481 state-updating values for JLex and 2,009,036 values for JFlex. Clearly, we must apply further restrictions to refine the value sequence. There are two classes of values computed by value-updating instructions: Class-1 includes those derived from input of the program, and Class-2 consists of those that are not. For example, when program P is processing input I in environment E, some instructions take values (derived) from input I as their input, but some others take input from environment E (i.e., program load location, stack pointer, size of stack frame, etc.). Since the semantic is a formal representation of the way that a program processes the input, it is obvious that the values in Class-1 are more closely related with the semantic of a program. So, we include only the values of Class-1 in a value sequence. To identify the values included in Class-1, we perform a dynamic taint analysis [26]. We start with tainting the name of input file, and then system-call monitor propagates the taint to the buffers where contents read from the tainted file is stored: every byte read from the input file will be tainted. Thereafter, registers and memory cells appearing in destination operands of all the value-updating instructions that take input from tainted registers or tainted memory locations are also tainted, and the output values are appended into the value sequence. In the example of JLex and JFlex, the value sequences contain less than 7,000 values after applying taint analysis, which is shorter by more than 250 times compared to the original sequences.

5 a = 1; PC address Instruction Operands Mark a = (a + 1) * 11 ;

0x08070854: jb 0x807085e

0x0807085e: mov 0x8(%edx,%eax,4),%ebx ebx ?

Compile & Execute 0x08070862: lea 0xffffffff(%edi),%eax eax (ptr)

edi (ptr)

T0x08070865: imul %edi,%eax eax abnormal

// eax = edx = 0x1 0x08070868: mov %eax,%edx edx abnormal

001: shl $0x2,%eax // output 4 in eax T0x0807086a: shr $0x1f,%edx edx abnormal

002: add %edx,%eax // output 5 in eax T0x0807086d: add %edx,%eax eax abnormal

003: add %eax,%eax // output 10 in eax T0x0807086f: and $0x1,%eax eax abnormal

004: add %edx,%eax // output 11 in eax 0x08070872: mov %eax,%esi esi abnormal 005: add $0xb,%eax // output 22 in eax T0x08070874: sub %edx,%esi esi abnormal ... 0x08070876: lea 0xffffffff(%esi),%eax eax (ptr) Value Sequence = {4,5,10,11,22}

0x08070879: cmp %eax,%ebx

0x0807087b: je 0x8070711

Sequential Reduction 0x08070881: mov 0xffffff90(%ebp),%ecx ecx ?

ebp (ptr)

0x08070884: mov %ebx,0x8(%ecx) *(ecx+0x8) ?

ecx (ptr) Value Sequence = {22} 0x08070887: mov 0xffffffa4(%ebp),%eax eax ?

ebp (ptr) (a) Sequential reduction. Variable a is initially (b) Anomaly-based refinement (from JFlex obfuscated with Op-aque tainted Predicate Insertion). Tainted lines are marked with ‘T’.

Figure 2: Value sequence refinement examples

3.2 Refining Value Sequences In this section, we discuss techniques to refine value sequences. An initial value sequence constructed through the dynamic taint analysis may still contain a number of less important values produced by inter- mediate or insubstantial computational steps. These steps may be the result of compiler optimization and/or code obfuscation. We need to eliminate those values to make the value sequence capable of characterizing larger programs. We believe a number of heuristics (e.g., control/data flow dependence analysis, abnormal code pattern detection, etc.) can be adopted to achieve this goal, and below we introduce some of them. For clarity of presentation in the later sections, let us denote a value sequence as S [P,I,E], where P is a program to run, I is the input to P , and E is the environment where P runs. Sequential Reduction Code optimization techniques often generate such intermediate instructions as shown in Figure 2(a). When variable a is initially tainted, our taint analysis extracts value sequence s = {4, 5, 10, 11, 22}. Note that sequence s1:4 = {4, 5, 10, 11}, a subsequence of s is generated by an intermediate code block op- timizing the computation of ‘a ← (a + 1) × 11’. All the values in s1:4 are updated and contained in register eax without affecting any memory locations until it is finally overwritten at line 005. Since instructions after line 005 would never read (or be affected by) the values in s1:4, we can remove s1:4 from s and retain only {22}. We formally generalize this heuristic in the rule of sequential reduction. Sequential Reduction Rule: Let ii,j denote i-th instruction updating destination j. Then, we can skip logging output of ii,j if j is never read within range (ii,j, ii+1,j). Repeat the same process until the first instruction that reads j and updates k (k 6= j) is executed. Note that sequential reduction only applies to original programs. It cannot be applied to suspect pro- grams because otherwise an intelligent plagiarizer may easily prevent the original values from being recorded.

Comparison-Based Refinement: Insert Opaque Predicates is an obfuscation technique where conditional branch ‘if(a)’ is rewritten into ‘if(a || F(x))’ or ‘if(a && T(x))’ given that F(x) and T(x) always yield logical false and true respectively. Through our preliminary experiments on existing automated obfuscators, we have noticed that Insert Opaque Predicates, once applied, is the most significant source of noise in value sequences of suspect programs. Let us denote by px, a code block that serves as an opaque predicate taking input x. Since existing obfuscators [10] insert px that always produces the same output regardless of x, the values produced by px can be identified by locating instructions that always generate the identical output for any input. To remove opaque predicates inserted in a suspect program P taking input I, we set up an alternative input I0 by slightly and deliberately modifying I. For example, in our JLex and JFlex experiments, we set up I0 by replacing all the tokens and non-terminal symbols appeared in I while leaving overall structure

6 (source code) (binary) Value Sequence int main() mov $0x1,%eax Refinement VM-based 0123456789 {... inc %eax 0123456789 ... Analyzer ...

Plaintiff Fetch & Decode op is value- No program Similarity updating Test Input(s) Value Sequence Instr. C := A op B Detector Yes Wrapper / source code Taint Analyzer No annotation push %ebp C is tainted VM-based 9051234434 if necessary jmp %eax 5627829052 Analyzer ... 71 ... Value Sequence Yes Refinement VM execution Suspect program {value seq} += C (binary) (a) Plagiarism detection process (b) Inside VM-based Ana- lyzer

Figure 3: VaPD design overview of I untouched because we want the execution paths taken by I and I0 to be similar, for an ideal result. Through comparative analysis of value sequences S [P,I,E] and S [P,I0,E], with a little help from dy- namic disassembly information, we can identify instructions generating consistently identical values. We refine S [P,I,E] by eliminating values produced by such instructions. In our experiments, dynamic taint analysis originally extracted value sequences containing 20,002 and 29,986 values respectively from JLex and JFlex obfuscated by Insert Opaque Predicates. Comparison-based refinement was able to reduce these numbers to 9,041 and 12,826, respectively.

Anomaly-Based Refinement: Some operations are known to be meaningless or erroneous when they deal with certain types of data. For example, adding, multiplying, or dividing two pointer variables is of no meaning at all. Such anomalous operations are seldom performed by carefully coded normal programs whereas suspect programs may contain them as the result of obfuscation. We observe that code generated by existing obfuscators tends to take input from values stored in arbitrary registers and pointers to random objects. Since registers often carry pointers too, we may leverage anomaly of pointer operations to remove values possibly inserted by obfuscators. We mark as containing pointers, the registers or memory locations (1) storing output of lea instruction, or (2) that have been used as branch or call destinations. Then, any of the following operations are meaningless so that we may mark their destination operands as abnormal: (1) mul pointer type, any type (vice versa); (2) add pointer type, pointer type; (3) div pointer type, any type (vice versa). Output of an instruction taking at least one operand marked as abnormal is also marked as abnormal. The value marked as abnormal is not added to a value sequence even if it is tainted by the input of the program. Figure 2(b) illustrates a code fragment from JFlex obfuscated by Opaque Predicate Insertion. All the tainted values that are output of the lines marked as abnormal will be removed from the value sequence. In the full dump of the code, we could see two groups of opaque predicates dealing with tainted values generate huge amount of noise in the value sequence. We tested above technique by applying it to the value sequence, and eliminated 29,016 tainted values produced by opaque predicates, from 29,986 values in the sequence.

4 Putting It Together 4.1 Design Overview Figure 3(a) shows overall design of VaPD. Here, provided with executable files of plaintiff program P and suspect program S, and common test input I, VM-based Analyzer extracts value sequences of P and S. After refining the value sequences, Similarity Detector computes similarity score of the two value sequences. VaPD repeats this process with a number of different inputs, and claims plagiarism if the average of the scores indicates a significant similarity. By default, VaPD uses value sequences VP and VS extracted through the entire execution of P and S respectively. However, when it deals with core-part plagiarisms (i.e., only part of P , e.g., a module denoted

7 by PM, is reused in S), VaPD compares partial value sequence VPM (extracted from PM) to VS through a window sliding over VS. Details on core-part plagiarism detection will be shortly discussed in Section 4.2. To extract partial value sequence, we insert special system calls into the source code of P (note that we do not assume access to the source code of S) to notify VM-based Analyzer when to start (or resume) and when to stop (or pause) extracting the value sequence. Provided by the plaintiff with the intelligence about which part of his program is likely to be plagiarized, we can annotate plaintiff’s source code and capture VPM from the part that is believed to be stolen. Details are described below.

VM-Based Analyzer VM-based Analyzer is a virtual machine that executes a given program instruction by instruction, as shown in Figure 3(b). We implement two operation modes in VM-based Analyzer: normal mode and partial extraction mode. In the normal mode, VM-based Analyzer operates as follows. After fetching an instruction, Taint Analyzer taints the destination operands if any of the source operands are tainted. After the instruction is executed by the VM execution module, VM-based Analyzer checks whether the instruction is a value-updating instruction and its output is tainted; if this is true, the output of the instruction is added to the value sequence. The analyzer then fetches and decodes the next instruction and repeats the same process until the program is finished. When the program terminates, the analyzer stops extracting values and passes the completed value sequence to VaPD. Note that VM-based Analyzer does not do any sequence refinement, which will be done at a later stage. In the partial extraction mode, VM- based Analyzer intercepts two special system calls START EXTRACT() and STOP EXTRACT() (system call numbers are 0xFFFFFFFF and 0xFFFFFFF0 respectively) requested by the test program. VM-based Analyzer does not produce values in the value sequence when it starts. It starts (or resumes) recording values when the test program requests START EXTRACT() system call, and it stops (or pauses) storing values when the program calls STOP EXTRACT() system call. Using the partial extraction mode, we can extract value sequences from a part (e.g., a module) of plaintiff programs, which is to be used in core-part plagiarism detection. Please note that the partial extraction mode is to extract partial value sequence of plaintiff programs. Malicious plagiarizers will not be able to exploit this mode to exclude plagiarized part from value sequence extraction. To reduce the number of values added into the value sequence, the VM-based analyzer does not extract values from dynamic linked libraries (or shared libraries) by default. However, if necessary, we can enable the analyzer to include some important shared libraries in the value sequence extraction because the virtual machine knows which libraries are loaded and where they are.

Similarity Detection We do similarity detection after value sequence refinement. In the literature, there are many metrics for measuring the similarity level of two sequences. In our prototype, we define it based on longest common subsequence (LCS). It should be noted that the definition of LCS does not require every subsequence to be a continuous segment of the mother sequence. For example, both {1, 6, 120} and {2, 24} are valid subsequences of the value sequence of sum prod (Figure 1) which is {1, 2, 6, 24, 120}. More formally, let |LCS [s1, s2] | denote the length of the longest common subsequence of sequence s1 and sequence s2. Similarity score of two value sequences s1 and s2 is defined as following,

|LCS [s1, s2] | Sim [s1, s2] = min [|s1|, |s2|]

4.2 Core-Part Plagiarism Detection In this paper, we also perform preliminary experiments to examine VaPD’s capability of detecting core-part plagiarism. IPM and ISM are the input to the plagiarized module and suspect module, respectively. For core-part plagiarism, we distinguish two cases: [Case-1] where the plagiarism is homogeneous (i.e., where IS = IP ), and [Case-2] where the plagiarism is heterogeneous (where IS 6= IP ). Case-1 often corresponds to the situations when two software products compete in the same market (i.e., they serve the same purpose), while Case-2 can happen between disparate products. Case-2 is a very difficult problem and

8 Table 1: Performance of JLex processing sample input.lex. (Pentium 3.0GHz/1GB RAM) VM Elapsed Time Output Size QEMU 0.171s N/A VaPD w/o taint analysis 4m 50.255s 104 MB VaPD 117m 14.941s 104 KB we do not know if a general solution exists at all. As such, in this stage of our research, we only consider the cases of homogeneous core-part plagiarism. In homogeneous core-part plagiarism cases, we can directly search VS for VPM , the value sequence of the plaintiff module, to check whether the module has been plagiarized and where it is located in a suspect program. For example, in the case of web browser layout engine plagiarism, given an input URL I, we can first obtain VPM , the value sequence of the plaintiff layout engine module; then, using the same input I we can obtain VS, the value sequence of the suspect program. If the plaintiff program and the suspect program use the same layout engine, we would expect VPM and a subsequence of VS (i.e., VSM ) to bear significantly similar patterns. Therefore, we can search for VPM through a sliding window over VS. If a match is found, Similarity Detector will conclude the suspect program contains a copy of the plaintiff module. In the sliding window based searching, one issue is to select an appropriate sliding step size. With the finest sliding step (i.e., one value), we will be able to identify every possible match; however, it requires |VS| − |VPM | computations of LCS-based similarity scores. Thus, to reduce the computational overhead, we apply the following two-stage scanning strategy. First, we quickly compute similarity scores by using a larger step size (e.g., the same length as |VPM |), and, depending on the result, we zoom into the portion showing similarity scores higher than a certain level (say Θ) by applying a finer step size (e.g., a quarter of |VPM |). In practice, we expect |VSM | ≥ |VPM | because of possible noise injected in VS. Therefore, if the similarity score under the finest step is µ, it is easy to see that Θ is at least 0.5µ, which is the worst case similarity score when a core-part plagiarism exists. As we will see in our experiments that the similarity score of VaPD for a real plagiarism is normally close to 1, it would be safe to let Θ = 0.4 in most cases.

5 Implementation

We implemented our VM-based Analyzer mainly inside the decoder module of QEMU 0.9.1. QEMU runs in two individual modes, whole system emulation and user-space emulation[9]. In the whole system emulation mode, QEMU operates just like other ordinary CPU emulators. In the user-space emulation mode, QEMU runs a guest process within its user-mode address space, and serves as a bridge between the guest process and the host . For example, when a guest process issues ‘int 0x80’, QEMU repacks it and requests to the host operating system as if QEMU were invoking the system call. When returning from the system call, QEMU passes the result returned from the host operating system to the guest process. Signals are also handled in a similar way. We implement our test bed in the user-space emulation mode. In whole system emulation mode, ob- taining value sequences of a particular process requires much effort. QEMU’s user-space emulation could remove such hassles.

Implementation specific limitations Although the user-space emulation mode much simplifies overall implementation, it also introduces a limitation. Current implementation of user-space emulation in QEMU (version 0.9.1) is not thread-safe, and does not implement certain system call handlers (e.g., futex, exec, etc.) correctly. We hope these will be fixed in future releases of QEMU.

Performance issues QEMU improves the execution by translating a code block on-the-fly into the ma- chine code that can be directly executed by the host machine. A cache is used so that QEMU may not retranslate the same code many times. Since our VM-based Analyzer requires to analyze a program in- struction by instruction, we had to force QEMU (1) to execute single instruction, not a block; and (2) to

9 1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0 Inliner Objectify Objectify ParamAlias Irreducibility Array_Folder Array_Folder Buggy_Code Buggy_Code Split_Classes Split_Classes Array_Splitter Array_Splitter Block_Marker Block_Marker Class_Splitter FalseRefactor FalseRefactor Method_Merger Method_Merger Branch_Inverter Branch_Inverter Dynamic_Inliner Publicize_Fields Publicize_Fields Boolean_Splitter Boolean_Splitter Overload_Names Overload_Names Field_Assignment Field_Assignment Rename_Registers Rename_Registers Interleave_Methods Duplicate_Registers Duplicate_Registers Variable_Reassigner Variable_Reassigner Reorder_Instructions Reorder_Instructions Reorder_Parameters Reorder_Parameters Bludgeon_Signatures Bludgeon_Signatures Random_Dead_Code Random_Dead_Code Integer_Array_Splitter Integer_Array_Splitter Merge_Local_Integers Merge_Local_Integers Static_Method_Bodies Static_Method_Bodies Promote_Primitive_Types Constant_Pool_Reorderer Constant_Pool_Reorderer Opaque_Branch_Insertion Opaque_Branch_Insertion Insert_Opaque_Predicates Insert_Opaque_Predicates Simple_Opaque_Predicates Simple_Opaque_Predicates Promote_Primitive_Registers Promote_Primitive_Registers Transparent_Branch_Insertion Transparent_Branch_Insertion Figure 4: Similarity scores of original JLex to obfus- Figure 5: Similarity scores of original JFlex to obfus- cated ones cated ones decode every instruction without retrieving the result from cache. The easiest way to do this is to disable block execution and translation block cache features. As the result, VaPD sacrifices performance as shown in Table 1, although in practice this is not a big concern for off-line analysis. VaPD is by nature an off-line plagiarism detection tool, so in real world VaPD will be given plenty of time (e.g., days, weeks) to do its job. To improve performance, we can enable translation block cache by embedding source operands, desti- nation operands, and taint flow information in each cache entry. Although taint analysis also degrades the performance by 24 times, it is the cost to pay for a nearly 1,000 times smaller storage used by dump files containing the value sequence and many other runtime information.

6 Experiment Results

During our evaluation, we want to answer two questions. First, how resilient is VaPD to obfuscation tech- niques? Second, how likely will it make a false accusation. We thoroughly test obfuscation resiliency of VaPD using two programs obfuscated with 34 of existing obfuscators and 20 inputs. By deliberately using programs that are similar to but disparate from each other, we also examine VaPD’s ability of detecting non-plagiarisms. In addition, we present preliminary results on detecting a part (a core-component) of a program using five different C compilers and three versions of the same compiler. Experiments are done in a Linux (Fedora Core 7) machine equipped with a Pentium D 3.00 GHz CPU and 1GB RAM.

6.1 The SandMark Tool SandMark [10] is a tool developed at the University of Arizona for analyzing and processing Java byte code. We use SandMark to evaluate obfuscation resiliency of VaPD because SandMark is the only freely available software with a comprehensive list of fully functioning2 code obfuscation algorithms, Note that VaPD analyzes x86 machine code, so, we convert Java byte code to a x86 executable using GCJ 4.1.2, the GNU ahead-of-time Compiler for the Java language. The code obfuscation techniques in the latest version (3.4.0) include 15 application obfuscations, 7 class obfuscations, and 17 method obfuscations. SandMark also implements one dynamic birthmarking algorithm, Whole Program Paths (WPP) birthmark[24]. However, for any of our test programs, SandMark does not successfully compute WPP birthmarks in our specific environment. In addition, SandMark provides features for watermarking, comparing, optimizing, visualizing, decompiling, examining, and measuring Java byte code.

6.2 Case Study I: JLex and JFlex We selected JLex[6] and JFlex[5] as the subjects of our tests. JLex and JFlex, both written in Java, are two individual programs written for the same purpose. They understand the same input syntax, and generate very similar lexical analyzers. The authors of each program claim that the two projects do not share any code. To

2Trial version of KlassMaster flow obfuscates only one or two methods in each class.

10 1 1

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0 0 zip zip gzip gzip bzip2 bzip2 Inliner cksum cksum Objectify Objectify ParamAlias Irreducibility Array_Folder Array_Folder Buggy_Code Buggy_Code openssl_md5 openssl_md5 Split_Classes Split_Classes Array_Splitter Array_Splitter Block_Marker Block_Marker Original_JLex Class_Splitter FalseRefactor FalseRefactor Original_JFlex Method_Merger Method_Merger Branch_Inverter Branch_Inverter Dynamic_Inliner Publicize_Fields Publicize_Fields Boolean_Splitter Boolean_Splitter Overload_Names Overload_Names Field_Assignment Field_Assignment Rename_Registers Rename_Registers Interleave_Methods Duplicate_Registers Duplicate_Registers Variable_Reassigner Variable_Reassigner Reorder_Instructions Reorder_Instructions Reorder_Parameters Reorder_Parameters Bludgeon_Signatures Bludgeon_Signatures Random_Dead_Code Random_Dead_Code Integer_Array_Splitter Integer_Array_Splitter Merge_Local_Integers Merge_Local_Integers Static_Method_Bodies Static_Method_Bodies Promote_Primitive_Types Constant_Pool_Reorderer Constant_Pool_Reorderer Opaque_Branch_Insertion Opaque_Branch_Insertion Insert_Opaque_Predicates Insert_Opaque_Predicates Simple_Opaque_Predicates Simple_Opaque_Predicates Promote_Primitive_Registers Promote_Primitive_Registers Transparent_Branch_Insertion Transparent_Branch_Insertion Figure 6: Similarity scores of JLex to JFlex (obfus- Figure 7: Similarity scores of JFlex to JLex (obfus- cated and original), and other programs cated and original), and other programs verify this claim, we manually compared both programs using code comparison features of SandMark [10]. Our manual comparison confirmed that those two programs do not share any code fragment significantly. To evaluate obfuscation resiliency of VaPD, we set up two cases of single-obfuscation experiments, where only one obfuscation technique is applied at a time, and one multiple-obfuscation experiment, where multiple obfuscators are applied to one program in a row. As a dynamic analysis based solution, VaPD may not reliably justify (non-)plagiarism based on a single high similarity score. Hence, in this experiment, we use 20 different inputs and compute the average similarity scores.

6.2.1 Impact of Single Obfuscation In single-obfuscation experiments, we set up two cases: [Case-1] same program comparison, where each of JLex and JFlex is compared to obfuscated versions of itself; and [Case-2] different program comparison, where each of JLex and JFlex is compared to each other’s obfuscated versions. Also, we compare JLex and JFlex to five programs (bzip2, cksum, gzip, openssl computing MD5, and zip) totally different from JLex and JFlex while processing the same input file. The result is shown in Figure 4, Figure 5, Figure 6, and Figure 7, where the x-axis shows obfuscation techniques3, and the y-axis shows the similarity score. From Case-1, Figure 4 shows the similarity score of original JLex verses each of 34 obfuscation tech- niques applied on JLex, and Figure 5 shows the similarity score of original JFlex verses each of 30 obfusca- tion techniques on JFlex. We can observe that in all but one case of JFlex, the similarity scores are close to 1.0. The similarity score in the exceptional case is around 0.84, on which we will provide more discussion in Section 7. From Case-2, Figure 6 shows the similarity score of original JLex to JFlex and its obfuscated versions plus five additional programs, and Figure 7, JFlex to JLex and its obfuscated versions plus five additional programs. The similarity scores of JLex and JFlex to their counterparts are about 0.4 and 0.5 respectively. The similarity scores of JLex and JFlex to the five programs are below 0.118. Moreover, we cross compare 34 obfuscated versions of JLex to 30 obfuscated versions of JFlex, performing 1,020 individual compar- isons. The similarity scores are between 0.2591 and 0.6279. Considering that the similarity score for a real plagiarism is close to 1, we can say that the scores of cross comparison are relatively low. Therefore, the results from Case-1 and Case-2 provide us with clear answers to the questions we raised earlier: how is VaPD resilient to obfuscation techniques; how likely will it make a false accusation? In all the cases, VaPD can identify identical programs with no false accusation regardless of obfuscation techniques (with an appropriate threshold 0.80).

6.2.2 Impact of Multiple Obfuscation We also notice that a plagiarist may attempt to hide plagiarism by heavily transforming a plagiarized program through a series of obfuscators. Therefore, evaluating resiliency of VaPD against multiple obfuscation techniques applied to single program is necessary. Although it is theoretically possible for a series of multiple obfuscators to transform a program, apply- ing many obfuscators to a single program raises practical issues of correctness of the target program and

3We could not test all 39 obfuscators because some of them failed in transforming JLex and JFlex.

11 Table 2: Similarity scores of five XML parsers cross compared to each other. RXP Expat Libxml2 Xerces-C Parsifal RXP 1 0.116 0.032 0.022 0.118 Expat 0.116 1 0.034 0.011 0.06 Libxml2 0.032 0.034 1 0.06 0.093 Xerces-C 0.022 0.011 0.06 1 0.307 Parsifal 0.118 0.06 0.093 0.307 1

1 Vcc1-43 (GCC 4.3) V (GCC 4.2) 0.8 cc1-42 Vcc1-41 (GCC 4.1) Vcc1-34 (GCC 3.4) 0.6 Vack (Amsterdam Compiler Kit) Vccom () 0.4 Vwcc386 (Open Watcom) Vtcc () Similarity score 0.2 cc1 assembler and linker

0 (begin) Position in VG43 (end)

Figure 8: Compiler front-ends searched within VG43 through a sliding window efficiency. For example, we attempted to apply all the 39 obfuscation techniques of SandMark to each of JLex and JFlex, but after trying several obfuscation orders, only some of them could be successfully applied. To address this problem, we selected obfuscation techniques from two groups following the classification of Collberg et al.[11]: data obfuscation and control obfuscation. We created four test programs by transform- ing JLex and JFlex through the two groups of obfuscators. We could apply eight control obfuscators and seven data obfuscators to JLex and seven control obfuscators and five data obfuscators to JFlex (see Table 3 in appendix for the list of obfuscators). We compared the four multi-obfuscated JLex and JFlex to their original program. The similarity score of JLex to control obfuscated JLex is 0.830, and the score of JLex to data obfuscated JLex is 0.834. Similarity score of both control and data obfuscated JFlex compared to original JFlex is 1.000. This experiment shows that VaPD is effective in detecting heavily obfuscated plagiarisms, as well as single-obfuscated plagiarisms.

6.3 Case Study II: XML Parsers We also cross compare value sequences of the five individual XML parsers: RXP that is used by the LT XML toolkit and the Festival speech synthesis system; Expat XML parser that is the underlying XML parser for the open source Mozilla project and Perl’s XML::Parser; Libxml2, the XML C parser and toolkit of Gnome; Xerces-C++ supported by Apache XML project; and Parsifal XML parser C library. For each of the five XML parsers, we wrote a simple test program that parses test input and prints the parser’s internal information to the terminal. We cross compare the value sequences that VaPD extracts from the five test programs, performing 10 distinct comparison cases. Since they are all individually developed projects, it would be a false accusation if VaPD computes a higher similarity score (say, greater than 0.8) for any of them. Table 2 shows similarity scores of the five XML parsers. In all the cases except one, we can observe similarity scores lower than 0.12. Only one comparison shows a similarity score of 0.307 that is still very low. Therefore, it is safe to say VaPD claims no false accusation in this case study.

6.4 Case Study III: Comparisons between Compiler Components We have done a pilot experiment to examine VaPD’s capability of solving subproblems of the core-part plagiarism, using the core parts extracted from eight compiler packages including GNU C Compiler (GCC). GCC consists of the following three parts: front end, middle end, and back end. The C front end

12 includes and the C compiler. The middle end and the back end deal with optimization, assembly, and code generation. Program cc1 is the front end of GCC, which compiles a C source file and generates an assembly file. Other compilers also have similar structure, and we want to see if GCC 4.3.1 uses any of the front ends of the four other compilers. Through source code annotation, we extract partial value sequences of the parts corresponding to the front end of GCC from Amsterdam Compiler Kit (Vack), Portable C Compiler (Vccom), Open Watcom C Compiler (Vwcc386), and Tiny C Compiler (Vtcc), and search each of them within VG43, the value sequence of GCC 4.3.1, through a sliding window. Because the source code of Open Watcom C Compiler is not available, we use ’compile only’ option to stop value sequence extraction before producing an executable. To see the result when the core-part plagiarism really exists, we also extract partial value sequences of cc1 from four different versions of GCC (Vcc1-34, Vcc1-41, Vcc1-42, and Vcc1-43 respectively from GCC 3.4, 4.1.2, 4.2.4, and 4.3.1) and search VG43 for each of them. The result is shown in Figure 6.4. Vcc1-43 marks the highest similarity score, followed by Vcc1-42 and Vcc1-41, within the part where cc1 is operational, while showing very low scores in the middle end and the back end. In contrast, Vcc1-34 generates very low similarity scores, which is because the old bison- based parser was entirely rewritten to a handwritten recursive descent parser when GCC 4.1.0 was released. The rest of compilers mark very low similarity scores over entire range of the sliding window. The result indicates that VaPD accurately detects core-components shared by different versions of GCC while raising no false alarms for GCC 3.4 and other compilers.

7 Discussion

In this section, we discuss the facts and limitations discovered while evaluating VaPD. We have tried to show VaPD’s potential by examining VaPD using various real world programs and all the obfuscators implemented in SandMark.

7.1 Multiple Obfuscations During the multiple obfuscation experiments (Section 6.2.2), we have tried to apply as many obfuscation techniques as possible. However, in the cases of multiple obfuscators serially applied to a single program, output of obfuscation can be affected by the order in which the obfuscators are applied. For example, when we tried to apply all of the 39 obfuscators to one program in different orders, many of them either crashed or failed in generating correctly functioning programs. We believe in practice, applying multiple obfuscators to a large program would be difficult due to increasing chance of conflict between obfuscation techniques. Since such conflict could ruin the semantics of the original program, even if the plagiarist has smoothly applied a sequence of obfuscators without crash, he is still uncertain whether the semantics he wants to steal is preserved in the output program. For example, in our experiments shown in Section 6.2.2, we had to find a working sequence of the eight control obfuscators for JLex by trial and error. Moreover, the working sequence took us more than 6 hours 50 minutes to be applied to JLex in our experimental environment. This discussion may provide an answer to the following discrepancy observed in JFlex multi-obfuscation experiments. As more obfuscation techniques are applied, a program would look more different from the original, but as we will discuss shortly after, this does not necessarily make the obfuscated program harder to detect. Let us recall that the similarity score of the original JFlex and the JFlex obfuscated by Merge Local Integers is around 0.85 (Figure 5). Because Merge Local Integers is also used in the multiple obfuscation experiments (Section 6.2.2), one might find it interesting to observe similarity scores of 1.0 in the multiple obfuscation experiments of JFlex. Here, we may think about three possible reasons. First, as we mentioned previously, the output of multi-obfuscation may change depending on the order in which the obfuscators are applied. For example, if Insert Opaque Predicates is applied after Opaque Branch Insertion produces random conditional branches within blocks, opaque predicates will be generated several times more in number than when Opaque Predicates Insertion is applied alone. Second, some obfuscators like Opaque Branch Insertion transform the original program based on random factors. Third, we observe, in some cases, that we have more noise in value sequences when more obfuscators are applied. In sum, all the three reasons may generate

13 additional noises in value sequence extraction: the more noises, the higher the chance of high similarity score becomes. In fact, such additional noises will not help the plagiarist to evade VaPD. This indicates that there are probably insufficient incentives for the plagiarist to do brute-force multi-obfuscation.

7.2 Further Refinement of Value Sequences In spite of dramatic reduction attributed to taint analysis, the size of a value sequence could still reach thou- sands or tens of thousands for a program with relatively long execution paths. To keep the value sequence concise while containing all the important value-changing information, we can try to remove some less important intermediate values from the value sequence. For example, we may ignore the counter values generated by loops. A loop counter variable is a counter variable used in the loop structure. Since loop variables are typically weak in reflecting strong semantics, preventing such loop variables from being included in value sequence extraction can improve accuracy of VaPD. In practice, our dynamic taint analysis removes most of the loop variables. However in some cases loop variables can be tainted by input and survive our taint analysis. In such cases, we can eliminate counter values either by developing heuristics that can be automated or by applying simple manual optimization to the value sequence. For example, we can remove the exceptional case (where obfuscators Merge Local Integers is applied) in Figure 5 by applying a simple manual optimization to the value sequence of original program. We analyzed the value sequences of original JFlex and found, at the end of the value sequences, a series of values generated by a loop containing nothing tainted but an instruction to increase ebx by 1 (add $0x01, %ebx). We identified ebx was a loop counter variable, and after excluding the code increasing ebx, the average similarity score became 0.988. These new value sequences also affected the results shown in Figure 7, resulting in the similarity score even lowered to near 0.35. Since dynamic taint analysis and value sequence refinement techniques have already reduced the size of value sequence greatly, performing above manual optimization is not tedious work. We think this particular case shows a potential of value sequence refinement techniques to improve VaPD’s accuracy. Finally, since the real meaning of each value and the logical connection among them have not been investigated in cur- rent work, a better understanding of the values will enable us to further remove system noise or semantic irrelevant values.

7.3 Counterattacks Next we discuss several major security challenges. One of the counterattacks that a plagiarist may come up with is rewriting a loop in a reverse order. However, automatic loop reversing is very difficult, because they may result in semantically different programs. By far we are not aware of such tools. Although theoretically some specific types of loops that are not tightly bound with the loop counters could be reversed, reversing only the loop counter variable will not affect the whole value sequence because we can eliminate values produced by loop counters (because of dynamic taint analysis). Finally, one might manually reverse a loop, if at all possible, but its impact could be very limited for a large program. Another possible counterattack is noise injection. We note that automated noise injection is hard be- cause if the noise is not tainted, it will be filtered because of dynamic taint analysis. However, if injected successfully, noise could dramatically increase the size of an extracted value sequence, thus slowing down the similarity score computation and consuming more memory space. We will consider sliding over a stream of values so that we may keep only a small portion of a value sequence in memory during runtime. We also note that under the LCS metric, injection of extremely huge amount of noise might increase the similar- ity score. If a naive program happens to generate many noisy values, this will raise the chance of false accusation. However, for malicious programs who want to hide their plagiarisms, intentionally injected noisy values will result in a higher chance of being accused. Therefore, if a plagiarist comes to know the mechanism of VaPD, one will never try evading VaPD by injecting random noise. Also, a plagiarist may aim to thwart dynamic taint analysis. Because taint analysis considers only data flow dependence, it looses taint if an attacker launders the data flow dependence using control flow dependence. When x is tainted, if(x==a) y=a can transfer x to y without tainting y only if a is statically

14 predictable. Because this is not easily done by automated obfuscators, we consider more general case while (y

7.4 Limitations Our LCS based metric might have a fundamental problem: It is likely to cause false positives when a short value sequence is compared to a very long value sequence. However, we observed only relatively low similarity scores (ranging from 0.000 to 0.562) while comparing original JFlex (length of the value sequence is only 176) to 21 programs with longer value sequences (ranging from 224 to 67,886; 15,073.05 in average) chosen from /bin, /usr/bin directories of a Fedora Core 7 Linux system. Besides of testing current metric with more programs, we will study other metrics such as n-gram and time sequence analysis. Even though no existing obfuscators is capable of this, our current comparison based value sequence refinement technique cannot identify opaque predicates that produce varying values that satisfy certain con- ditions. We hope software analysis techniques such as symbolic execution[20] would help us identify such conditions and develop new comparison based value sequence refinement techniques.

7.5 Applicability We can extend VaPD in the following ways. We notice that our value-based idea is not restricted to detecting plagiarism of code written in compiler-based programming languages such as C/C++. Indeed, we can port the value-based idea to detecting plagiarism of code written in non-compiler-based languages such as PHP, Perl, JavaScript, VBS, etc., as long as we have a way to capture and analyze the values. Also, VaPD is not dependent on a particular processor or an operating system since QEMU is a generic processor emulator. By extending underlying VM, our framework can be easily extended to support other platforms. Moreover, we can implement VM-based analyzer in other virtual machines such as Java VM as well. Also, we will explore other applications of value sequence. For example, in defense against malware such as Internet worms, time to react to the outbreak is crucial to contain the epidemic effectively. By providing an easy way to quickly match an unknown worm code sample against a large number of known samples in the database, our value-based technique may help identify the variant of a known worm.

8 Conclusion

In this paper, we propose a value-based approach, a novel technique to detect software plagiarism. Proposed approach uses a VM-based dynamic analyzer that examines x86 Linux executable files; therefore we do not need source code to prove substantial similarity of any two programs. We analyzed not-too-small real world programs including five XML parsers, Gzip, Zip, Bzip2, and JLex and JFlex along with a comprehensive set of obfuscation techniques of SandMark. We also provide a preliminary experiment results on detecting plagiarisms heavily transformed through multiple obfuscation techniques. Our value-based detection suc- cessfully identified all the programs that are not identical, but actually the same, with no false accusations. While our study indicates that VaPD has a great potential in detecting software plagiarism, we still have important issues and challenges to address through further research. Our future works include developing

15 value sequence refinement techniques, understanding meanings of values, and exploring broad application area of value-based approach.

References

[1] Code virtualizer. http://www.oreans.com/codevirtualizer.php. [2] Compuware, IBM settle lawsuit. http://news.zdnet.com/2100-3513 22-5629876.html. [3] A conversation with intellectual property attorney G. Gervaise Davis. http://www.gihyo.co.jp/magazine/SD/pacific/SD 0309.html. [4] Java obfuscator - Zelix KlassMaster. http://www.zelix.com/klassmaster/features.html. [5] JFlex - the fast scanner generator for java. http://jflex.de/. [6] JLex: A lexical analyzer generator for java(tm). http://www.cs.princeton.edu/ appel/modern/java/JLex/. [7] B. S. Baker. On finding duplication and near-duplication in large software systems. In WCRE ’95: Proceedings of the Second Working Conference on Reverse Engineering, page 86, Washington, DC, USA, 1995. IEEE Computer Society. [8] I. D. Baxter, A. Yahin, L. Moura, M. Sant’Anna, and L. Bier. Clone detection using abstract syntax trees. In Int. Conf. on Software Maintenance, 1998. [9] F. Bellard. Qemu, a fast and portable dynamic translator. In ATEC ’05: Proceedings of the annual conference on USENIX Annual Technical Conference, pages 41–41, Berkeley, CA, USA, 2005. USENIX Association. [10] C. Collberg, G. Myles, and A. Huntwork. Sandmark–a tool for software protection research. IEEE Security and Privacy, 1(4):40–49, 2003. [11] C. Collberg, C. Thomborson, and D. Low. A taxonomy of obfuscating transformations. Technical Report 148, The Univeristy of Auckland, July 1997. [12] C. Collberg, C. Thomborson, and D. Low. Manufacturing cheap, resilient, and stealthy opaque constructs. In POPL ’98: Proceedings of the 25th ACM SIGPLAN-SIGACT symposium on Principles of programming languages, pages 184–196, New York, NY, USA, 1998. ACM. [13] M. Egele, C. Kruegel, E. Kirda, H. Yin, and D. Song. Dynamic spyware analysis. In USENIX Annual Technical Conference, pages 233–246. USENIX, 2007. [14] M. Gabel, L. Jiang, and Z. Su. Scalable detection of semantic clones. In ICSE ’08: Proceedings of the 30th international conference on Software engineering, pages 321–330, New York, NY, USA, 2008. ACM. [15] J.-H. Ji, G. Woo, and H.-G. Cho. A source code linearization technique for detecting plagiarized programs. In ITiCSE ’07: Proceedings of the 12th annual SIGCSE conference on Innovation and technology in computer science education, pages 73–77, New York, NY, USA, 2007. ACM. [16] L. Jiang, G. Misherghi, Z. Su, and S. Glondu. Deckard: Scalable and accurate tree-based detection of code clones. In ICSE ’07: Proceedings of the 29th international conference on Software Engineering, pages 96–105, Washington, DC, USA, 2007. IEEE Computer Society. [17] L. Jiang, Z. Su, and E. Chiu. Context-based detection of clone-related bugs. In ESEC-FSE ’07: Proceedings of the the 6th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering, pages 55–64, New York, NY, USA, 2007. ACM. [18] T. Kamiya, S. Kusumoto, and K. Inoue. Ccfinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Trans. Softw. Eng., 28(7), 2002. [19] Y.-C. Kim and J. Choi. A program plagiarism evaluation system. In Information and Communication Technology Education Workshop, 2005. [20] J. C. King. Symbolic execution and program testing. Commun. ACM, 19(7):385–394, 1976. [21] K. Kontogiannis, M. Galler, and R. DeMori. Detecting code similarity using patterns. In Working Notes of 3rd Workshop on AI and Software Engineering, 1995. [22] J. Krinke. Identifying similar code with program dependence graphs. In Proc. of 8th Working Conf. on Reverse Engineering, 2001. [23] C. Liu, C. Chen, J. Han, and P. S. Yu. Gplag: detection of software plagiarism by program dependence graph analysis. In KDD ’06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 872–881, New York, NY, USA, 2006. ACM. [24] G. Myles and C. S. Collberg. Detecting software theft via whole program path birthmarks. In K. Zhang and Y. Zheng, editors, ISC, volume 3225 of Lecture Notes in Computer Science, pages 404–415. Springer, 2004. [25] G. Myles and C. S. Collberg. K-gram based software birthmarks. In SAC, 2005. [26] J. Newsome and D. Song. Dynamic taint analysis for automatic detection, analysis, and signature generation of exploits on commodity software. In Proceedings of the Network and Distributed System Security Symposium (NDSS 2005), 2005.

16 [27] L. Prechelt, G. Malpohl, and M. Philippsen. Finding plagiarisms among a set of programs with jplag. Universal Computer Science, 2000. [28] S. Schleimer, D. S. Wilkerson, and A. Aiken. Winnowing: local algorithms for document fingerprinting. In Proc. of ACM SIGMOD Int. Conf. on Management of Data, 2003. [29] D. Schuler, V. Dallmeier, and C. Lindig. A dynamic birthmark for java. In ASE ’07: Proceedings of the twenty-second IEEE/ACM international conference on Automated software engineering, pages 274–283, New York, NY, USA, 2007. ACM. [30] H. Tamada, M. Nakamura, A. Monden, and K. ichi Matsumoto. Design and evaluation of birthmarks for detecting theft of java programs. In Proc. IASTED International Conference on Software Engineering (IASTED SE 2004), pages 569–575, February 2004. Innsbruck, Austria. [31] H. Tamada, K. Okamoto, M. Nakamura, and A. Monden. Dynamic software birthmarks to detect the theft of windows applications. In International Symposium on Future Software Technology 2004 (ISFST 2004), Xi’an, China, October 2004. [32] H. Tamada, K. Okamoto, M. Nakamura, A. Monden, and K. ichi Matsumoto. Design and evaluation of dynamic software birthmarks based on calls. Information Science Technical Report NAIST-IS-TR2007011, ISSN 0919-9527, Graduate School of Information Science, Nara Institute of Science and Technology, May 2007. [33] N. Truong, P. Roe, and P. Bancroft. Static analysis of students’ java programs. In ACE ’04: Proceedings of the sixth conference on Australasian computing education, pages 317–325, Darlinghurst, Australia, Australia, 2004. Australian Computer Society, Inc. [34] C. Wang. A security architecture for survivability mechanisms. PhD thesis, Charlottesville, VA, USA, 2001. Adviser-John Knight. [35] W. Yang. Identifying syntactic differences between two programs. Softw. Pract. Exper., 21(7):739–755, 1991.

Table 3: Multiple obfuscation techniques applied to JLex and JFlex Obfuscators JLex√ JFlex√ Transparent Branch Insertion √ √ Simple Opaque Predicates √ Inliner √ √ Insert Opaque Predicates √ Dynamic Inliner √ √ Interleave Methods √ √ Method Merger √ Opaque Branch Insertion

Control Obfuscation √ √ Reorder Instructions Array Splitter √ Array Folder √ √ Integer Array Splitter String Encoder √ Promote Primitive Registers √ √ Variable Reassigner Promote Primitive Types √ √ Duplicate Registers

Data Obfuscation √ √ Boolean Splitter √ √ Merge Local Integers

17