Plagiarism Detection for Multithreaded Software Based on Thread-Aware Software Birthmarks

Plagiarism Detection for Multithreaded Software Based on Thread-Aware Software Birthmarks Zhenzhou Tian1, Qinghua Zheng1, Ting Liu1∗, Ming Fan1, Xiaodong Zhang1, Zijiang Yang2, 3 1 MOEKLINNS, Department of Computer Science and Technology, Xi’an Jiaotong University, Xi’an 710049, China 2 Department of Computer Science, Western Michigan University, Kalamazoo, MI 49008, USA 3 College of Computer and Technology, Xi’an University of Technology, 710048, China {zztian,fanming.911025,oijiaoda}@stu.xjtu.edu.cn; {qhzheng,tingliu}@mail.xjtu.edu.cn; [email protected] ABSTRACT for distributing Busybox in its FIOS wireless routers [1], and the The availability of inexpensive multicore hardware presents a crisis of Skype’s VOIP service for the violation of licensing terms turning point in software development. In order to benefit from of Joltid. Unfortunately software plagiarism is easy to implement the continued exponential throughput advances in new processors, but very difficult to detect. The unavailability of source code and the software applications must be multithreaded programs. As the existence of powerful automated semantic-preserving code multithreaded programs become increasingly popular, plagiarism obfuscation tools [8] are a few reasons that make software of multithreaded programs starts to plague the software industry. plagiarism a daunting task. Nevertheless, researchers welcomed Although there has been tremendous progress on software this challenge and developed effective methods. Software plagiarism detection technology, existing dynamic approaches watermarking is one of the earliest and most widely adopted remain optimized for sequential programs and cannot be applied techniques. A watermark is a unique identifier embedded in a to multithreaded programs without significant redesign. This program before its distribution. Being hard to remove but easy to paper fills the gap by presenting two dynamic birthmark based verify, watermarks can serve as a strong evidence for occurrences approaches. The first approach extracts key instructions while the of software plagiarism. However, watermarks in a program may second approach extracts system calls. Both approaches consider be eliminated by code obfuscations. It is also believed that a the effect of thread scheduling on computing software birthmarks. sufficiently determined attacker will eventually be able to defeat We have implemented a prototype based on the Pin any watermark [7]. In order to address the problem, the concept of instrumentation framework. Our empirical study shows that the software birthmark was proposed. A birthmark is a set of proposed approaches can effectively detect plagiarism of characteristics extracted from a program that reflect the program’s multithread programs and exhibit strong resilience to various intrinsic properties and can be used to uniquely identify the semantic-preserving code obfuscations. program. As illustrated in [17], with proper algorithms birthmarks may identify software theft even after code obfuscations. Categories and Subject Descriptors Despite the tremendous progress in software plagiarism detection K.5.1 [Legal Aspects of Computing]: Hardware/Software technology, a new trend in software development greatly threatens Protection—Copyrights, Licensing; K.4.1 [Computer and its effectiveness. In recent years, from smartphones to servers, Society]: Public Policy Issues—Intellectual property rights multicore processors are now ubiquitous. The availability of inexpensive multicore hardware presents a turning point in General Terms software development. In order for software applications to Experimentation, Security, Legal Aspects benefit from the continued exponential throughput advances in new processors, the applications must be multithreaded programs. Keywords The trend towards multithreaded programs is creating a gap Software Birthmark, Plagiarism Detection, Multithreaded between the current software development practice and the Program software plagiarism detection technology as the existing dynamic approaches remain optimized for sequential programs and cannot 1. INTRODUCTION be applied to multithreaded without significant redesign. Software plagiarism is becoming a serious threat to the healthy Figure 1 shows a multithreaded program that is taken from a test development of the software industry. The recent incidents case used in the WET [25] project with slight modifications. We include the lawsuit against Verizon by Free Software Foundation apply two widely used software plagiarism detection approaches based on software birthmarks: Dynamic Key Instruction Sequence *Corresponding Author Birthmark (DKISB) [22] and System Call Short Sequence Birthmark (SCSSB) [24]. We execute the program multiple times Permission to make digital or hard copies of all or part of this work for under the same inputs. For each run we use DKISB or SCSSB to Permission to make digital or hard copies of all or part of this work for personal or personal or classroom use is granted without fee provided that copies are extract a software birthmark and then compare the similarity classroom not made use or is granteddistributed without for fee profit provided or thatcommercial copies are a notdvantage made or and distributed that for profit or commercial advantage and that copies bear this notice and the full citation between the birthmarks across different runs. The similarity is on copies the first bear page. this Copyrights notice and for components the full citation of this workon the owned first by page. others To than copy ACM mustotherwise, be honored. or Abstractingrepublish, withto post credit on is permitted.servers or To copyto redistribute otherwise, or to republish, lists, computed using four different metrics, including Cosine distance, to post on servers or to redistribute to lists, requires prior specific permission and/or a Jaccard index, Dice coefficient and Containment [22, 20, 6, 24], fee.requires Request prior permissions specific from permission [email protected]. and/or a fee. that are widely used in birthmark based plagiarism detection ICPC'14, June 2–3, 2014, Hyderabad, India. ICPC’14Copyright, June 2014 2–3, ACM 2014, 978 Hyderabad,-1-4503-2879 India-1/14/06… $15.00. literature. According to its definition, a birthmark can uniquely Copyright 2014 ACM 978-1-4503-2879-1/14/06...$15.00 http://dx.doi.org/10.1145/2597008.2597143 304 #include <stdio.h> In this paper, we present thread-aware algorithms that effectively #include <unistd.h> detect plagiarism of multithreaded programs at the binary level. #include <pthread.h> Unlike many existing approaches [14, 19, 11] that require source #include<stdlib.h> code, our approach uses binary because source code is usually #define N 8 unavailable when birthmark techniques are used to obtain the pthread_t mThread[N]; initial evidence of software plagiarism. We name our two void *run(void *data){ approaches TW-DKISB (Thread Aware Dynamic Key Instruction int tid; Sequence Birthmark) and TW-SCSSB (Thread Aware System tid =(int) data; Call Short Sequence Birthmark) that amend the existing printf("hello world from thread %d\n",tid); approaches of DKISB and SCSSB, respectively. We exploit two return NULL; } models to abstract the thread information during birthmark int main(int argc, char *argv[]){ extraction. The similarity of birthmarks is computed using two int rc, i; matching algorithms on the four metrics, i.e. Cosine Distance, int count; Jaccard Index, Dice Coefficient and Containment [22, 20, 6, 24]. printf("input a number please: \n"); scanf("%d",&i); We have implemented a prototype and conducted experiments on for(i;i<N; i++){ 134 versions of 24 multithreaded programs. The preliminary rc = pthread_create(&mThread[i], NULL, run, (void *) i); results show that our approach is effective for multithreaded if (rc) software plagiarism detection. In addition, our approach exhibits printf("create thread failed. error code = %d\n", rc);} strong resilience to both weak obfuscations obtained by various for(i=0;i<N; i++) compiler optimizations, and strong obfuscations supported by pthread_join(mThread[i], NULL); obfuscators such as SandMark [8] and Allatori [3]. printf("main thread finished\n"); The remainder of the paper is organized as follows. Section 2 return 0; } introduces necessary concepts and describes our methods to Figure 1. A simple multithreaded program extracting and comparing birthmarks. A prototype overview is identify the program from which the birthmark is extracted. also briefly described at the end of this section. Section 3 presents Therefore, we expect highly similar birthmarks as we are the empirical study, followed by the related works in Section 4. executing the same program under the same inputs. That is, we Finally we conclude the paper in Section 5. expect current approaches to claim plagiarism in this experiment. 2. THREAD AWARE BIRTHMARKS However, as shown in Table 1, the data contradict what we have expected. For DKISB, the similarity scores are between 0.55 and BASED PLAGIARISM DETECTION 0.85. As for SCSSB, no score is greater than 0.55. In most literature, a similarity score above 0.8 usually means definite 2.1 Software Birthmarks plagiarism and a score below 0.2 usually means definite A software birthmark is a set of characteristics extracted from a independent programs. Therefore, the widely used birthmark- program that reflects intrinsic properties of the program. based software plagiarism detection techniques fail to declare Depending on whether its extraction

Load more