Deducing Similarities in Java Sources from Bytecodes

The following paper was originally published in the Proceedings of the USENIX Annual Technical Conference (NO 98) New Orleans, Louisiana, June 1998 Deducing Similarities in Java Sources from Bytecodes Brenda S. Baker Bell Laboratories, Lucent Technologies Udi Manber University of Arizona For more information about USENIX Association contact: 1. Phone: 510 528-8649 2. FAX: 510 548-5738 3. Email: [email protected] 4. WWW URL:http://www.usenix.org/ Deducing Similarities in Java Sources from Byteco des Brenda S. Baker Bel l Laboratories Lucent Technologies 700 Mountain Avenue Murray Hil l, NJ 07974 bsb@b ell-labs.com, http://cm.bell-labs.com/~bsb/ Udi Manber Department of Computer Science University of Arizona Tucson, AZ 85721 [email protected], http://glimpse.cs.arizona.edu/udi.html sible, given a large set of byteco de les, to iden- Abstract tify most of those that originate from similar source co de. In addition, an index can b e built from any number of byteco de les, such that given a new Several techniques for detecting similarities of Java byteco de one can very quickly nd all the old ones programs from byteco de les, without access to the that are similar to it. Our to ols also can detail in source, are intro duced in this pap er. These tech- a convenientway just which sections of the les are niques can be used to compare two les, to nd similar. similarities among thousands of les, or to compare one new le to an index of many old ones. Exp eri- Our techniques make it p ossible to identify byte- mental results indicate that these techniques can b e co de les that came from the same source without very e ective. Even changes of 30 to the source knowing the actual source even if signi cant parts le will usually result in byteco de that can be as- of the source les are di erent. The les do not even so ciated with the original. Several applications are havetobe of similar sizes. For example, the orig- discussed. inal source co de les may be di erent versions of the same program or di erent programs containing similar sections. 1 Intro duction Finding similarities from compiled co de is much more dicult than nding similarities in source co de or text, b ecause a small change in the source co de Javabyteco de, particularly in the form of applets, can pro duce completely di erent binaries when they is geared to b ecome the common way to execute are viewed just as sequences of bytes. programs through the web, and not only by tradi- tional computers. Network computers, even appli- To accomplish these goals, we adapt three to ols de- ances, are exp ected to be controlled byJavabyte- signed to nd similarity in source co de and text to co de, which will arrive through the net transpar- 1 work with byteco de les. Si [24] is designed ently and for the most part, automatically. There to analyze large number of text les to nd pairs is obviously a great need to be able to control all that contain a signi cantnumb er of common blo cks these programs. There will b e problems of security, \ ngerprints", where a blo ck corresp onds to a management of up dates, p ortability, handling pref- non-trivial segment of co de, usually a few lines. erences, deletions, and so on. 1 The original name was SIF, but at some unknown p oint This pap er intro duces several techniques to help one during the development an extra f was added to make it more asp ect of these problems. We show that it is p os- natural. vidually at nding similar les while keeping false Dup [5], searches sets of source les to lo ok for p ositives low. We then use dup and di on the sim- suciently long sections that match except for sys- ilar pairs to examine the similarities in more detail. tematic transformations of names, suchasvariable names, and can deal eciently with a few mil- Our programs are fast and can analyze thousands lion lines of source co de. The UNIX utility di of byteco de les in minutes. A new le can b e com- describ ed in [21] uses dynamic programming to pared to an index of thousands of existing les in identify line-by-line changes insertions or deletions a second or two. The number of false p ositives is from one le to another, and is useful for detailed kept to avery small minimum. Our programs are comparison of two les and for automated distribu- written in C and run under UNIX. Detailed exp eri- tion of patches. mental results are given later in the pap er. None of the programs mentioned ab ove was de- We foresee several applications for our to ol. signed to deal with binaries. The only work that we knowof that mayhave applied one of these to program management: When p eople havenumerous binaries is .RTPatch [27], a to ol for creating patches Java classes, and get many more on a regular basis, di erences between les for up dating les. This there will be a great need to organize them, often to ol is claimed to work for arbitrary les, even bina- not according to the original \plan." Being able ries; while no technical description is given of their to tell which classes are similar can be very help- metho ds, it seems likely that it applies the di al- ful. Sometimes a similarity will identify the source. gorithm to arbitrary bytes rather than to hashes of Knowing that a very similar class is already installed lines of text. We are not familiar with anywork that on one's disk may b e helpful in deciding what to do can nd similarities in large numb er of binaries. with the new class. It can also help with version control, etc. In some cases, esp ecially for programs Our adaptations of si and di do not work directly that p erform mostly arithmetic op erations, it may on byteco de les, or even on disassembled byteco de b e p ossible to identify programs that implement the les. Byteco de les contain many indices into ta- same algorithm, even when they are written by in- bles, and values of most indices can b e changed bya dep endent writers. For example, when we ran ex- slight even one character change to a Java source p eriments on thousands of arbitrary class les, we le. Therefore, we enco de disassembled byteco de discovered two MD5 programs with 78 similarity les in a \normal form" which takes into accountpo- in their byteco de les. sitional values that are less a ected by small changes than absolute values of indices. This enco ding, Plagiarism detection: Our approach will identify based on a technique rst used in dup, enables the stolen co de if only minor changes are made to it programs to nd the hidden similarityinbyteco de including any amountofsyntactic changes suchas les. Since this technique is already incorp orated changing variable names. However, the advance in into dup, dup can work on disassembled byteco de sophistication of obfuscators [13,12], esp ecially for les; however, additional simple prepro cessing im- Java, would allow someone to hide any co de segment proves its p erformance. prettywell, and our metho ds and very p ossibly any other metho d will not b e able to identify it. One There are other to ols for nding similarities in text may b e able to identify that obfuscators were used, or source co de, as well. However, to ols based on on the other hand, whichmay b e go o d enough e.g., style metrics, suchas[7,19,25], or data ow graphs in a classro om. [17] would require decompilation of byteco de les in order to b e applied. Some other to ols based on Program reuse and reengineering: It will obviously ngerprints, such as [16,20,10], chunks of text [9, be useful to know which classes are similar when 30,34], or visualization via a graphical user interface trying to reuse them. Identifying versions is a go o d [11] may be adaptable to byte co de les using the example of that. same techniques that we use for si , dup, and di . Uninstal lers: If you nally obtain the exact co de To search for similar les in a large set of byteco de you wanted, then you probably want to get rid of les, we run si on the enco ded disassembled byte- other co de that is very similar. co de les and dup on the prepro cessed disassembled byteco de les. We show that combining the output Security: Detecting that a class is similar to a known from si and dup is more e ective than either indi- bad class can b e very helpful. mo des: all-against-all and one-against-all. The rst mo de nds all groups of similar les in a large le Compression: It may b e p ossible to store only di er- system and gives a rough indication of the similarity. ences among classes that are very similar. But rst The running time is essentially linear in the total one must detect all such similarities. This could b e size of all les, which makes si scalable. A sort, esp ecially useful for low-storage devices. which is not a linear-time routine, is required, but it is p erformed only on ngerprints, and therefore Clustering: Small similarities are quite typical do es not dominate the running time unless the total among programs written by the same p erson.

Load more