The following paper was originally published in the Proceedings of the USENIX Annual Technical Conference (NO 98) New Orleans, Louisiana, June 1998
Deducing Similarities in Java Sources from Bytecodes
Brenda S. Baker Bell Laboratories, Lucent Technologies Udi Manber University of Arizona
For more information about USENIX Association contact: 1. Phone: 510 528-8649 2. FAX: 510 548-5738 3. Email: [email protected] 4. WWW URL:http://www.usenix.org/
Deducing Similarities in Java Sources from Byteco des
Brenda S. Baker
Bel l Laboratories
Lucent Technologies
700 Mountain Avenue
Murray Hil l, NJ 07974
bsb@b ell-labs.com, http://cm.bell-labs.com/~bsb/
Udi Manber
Department of Computer Science
University of Arizona
Tucson, AZ 85721
[email protected], http://glimpse.cs.arizona.edu/udi.html
sible, given a large set of byteco de les, to iden- Abstract
tify most of those that originate from similar source
co de. In addition, an index can b e built from any
number of byteco de les, such that given a new
Several techniques for detecting similarities of Java
byteco de one can very quickly nd all the old ones
programs from byteco de les, without access to the
that are similar to it. Our to ols also can detail in
source, are intro duced in this pap er. These tech-
a convenientway just which sections of the les are
niques can be used to compare two les, to nd
similar.
similarities among thousands of les, or to compare
one new le to an index of many old ones. Exp eri-
Our techniques make it p ossible to identify byte-
mental results indicate that these techniques can b e
co de les that came from the same source without
very e ective. Even changes of 30 to the source
knowing the actual source even if signi cant parts
le will usually result in byteco de that can be as-
of the source les are di erent. The les do not even
so ciated with the original. Several applications are
havetobe of similar sizes. For example, the orig-
discussed.
inal source co de les may be di erent versions of
the same program or di erent programs containing
similar sections.
1 Intro duction
Finding similarities from compiled co de is much
more dicult than nding similarities in source co de
or text, b ecause a small change in the source co de
Javabyteco de, particularly in the form of applets,
can pro duce completely di erent binaries when they
is geared to b ecome the common way to execute
are viewed just as sequences of bytes.
programs through the web, and not only by tradi-
tional computers. Network computers, even appli-
To accomplish these goals, we adapt three to ols de-
ances, are exp ected to be controlled byJavabyte-
signed to nd similarity in source co de and text to
co de, which will arrive through the net transpar-
1
work with byteco de les. Si [24] is designed
ently and for the most part, automatically. There
to analyze large number of text les to nd pairs
is obviously a great need to be able to control all
that contain a signi cantnumb er of common blo cks
these programs. There will b e problems of security,
\ ngerprints", where a blo ck corresp onds to a
management of up dates, p ortability, handling pref-
non-trivial segment of co de, usually a few lines.
erences, deletions, and so on.
1
The original name was SIF, but at some unknown p oint
This pap er intro duces several techniques to help one
during the development an extra f was added to make it more
asp ect of these problems. We show that it is p os- natural.
vidually at nding similar les while keeping false Dup [5], searches sets of source les to lo ok for
p ositives low. We then use dup and di on the sim- suciently long sections that match except for sys-
ilar pairs to examine the similarities in more detail. tematic transformations of names, suchasvariable
names, and can deal eciently with a few mil-
Our programs are fast and can analyze thousands lion lines of source co de. The UNIX utility di
of byteco de les in minutes. A new le can b e com- describ ed in [21] uses dynamic programming to
pared to an index of thousands of existing les in identify line-by-line changes insertions or deletions
a second or two. The number of false p ositives is from one le to another, and is useful for detailed
kept to avery small minimum. Our programs are comparison of two les and for automated distribu-
written in C and run under UNIX. Detailed exp eri- tion of patches.
mental results are given later in the pap er.
None of the programs mentioned ab ove was de-
We foresee several applications for our to ol. signed to deal with binaries. The only work that
we knowof that mayhave applied one of these to
program management: When p eople havenumerous binaries is .RTPatch [27], a to ol for creating patches
Java classes, and get many more on a regular basis, di erences between les for up dating les. This
there will be a great need to organize them, often to ol is claimed to work for arbitrary les, even bina-
not according to the original \plan." Being able ries; while no technical description is given of their
to tell which classes are similar can be very help- metho ds, it seems likely that it applies the di al-
ful. Sometimes a similarity will identify the source. gorithm to arbitrary bytes rather than to hashes of
Knowing that a very similar class is already installed lines of text. We are not familiar with anywork that
on one's disk may b e helpful in deciding what to do can nd similarities in large numb er of binaries.
with the new class. It can also help with version
control, etc. In some cases, esp ecially for programs Our adaptations of si and di do not work directly
that p erform mostly arithmetic op erations, it may on byteco de les, or even on disassembled byteco de
b e p ossible to identify programs that implement the les. Byteco de les contain many indices into ta-
same algorithm, even when they are written by in- bles, and values of most indices can b e changed bya
dep endent writers. For example, when we ran ex- slight even one character change to a Java source
p eriments on thousands of arbitrary class les, we le. Therefore, we enco de disassembled byteco de
discovered two MD5 programs with 78 similarity les in a \normal form" which takes into accountpo-
in their byteco de les. sitional values that are less a ected by small changes
than absolute values of indices. This enco ding,
Plagiarism detection: Our approach will identify based on a technique rst used in dup, enables the
stolen co de if only minor changes are made to it programs to nd the hidden similarityinbyteco de
including any amountofsyntactic changes suchas les. Since this technique is already incorp orated
changing variable names. However, the advance in into dup, dup can work on disassembled byteco de
sophistication of obfuscators [13,12], esp ecially for les; however, additional simple prepro cessing im-
Java, would allow someone to hide any co de segment proves its p erformance.
prettywell, and our metho ds and very p ossibly any
other metho d will not b e able to identify it. One There are other to ols for nding similarities in text
may b e able to identify that obfuscators were used, or source co de, as well. However, to ols based on
on the other hand, whichmay b e go o d enough e.g., style metrics, suchas[7,19,25], or data ow graphs
in a classro om. [17] would require decompilation of byteco de les
in order to b e applied. Some other to ols based on
Program reuse and reengineering: It will obviously ngerprints, such as [16,20,10], chunks of text [9,
be useful to know which classes are similar when 30,34], or visualization via a graphical user interface
trying to reuse them. Identifying versions is a go o d [11] may be adaptable to byte co de les using the
example of that. same techniques that we use for si , dup, and di .
Uninstal lers: If you nally obtain the exact co de To search for similar les in a large set of byteco de
you wanted, then you probably want to get rid of les, we run si on the enco ded disassembled byte-
other co de that is very similar. co de les and dup on the prepro cessed disassembled
byteco de les. We show that combining the output
Security: Detecting that a class is similar to a known from si and dup is more e ective than either indi-
bad class can b e very helpful. mo des: all-against-all and one-against-all. The rst
mo de nds all groups of similar les in a large le
Compression: It may b e p ossible to store only di er- system and gives a rough indication of the similarity.
ences among classes that are very similar. But rst The running time is essentially linear in the total
one must detect all such similarities. This could b e size of all les, which makes si scalable. A sort,
esp ecially useful for low-storage devices. which is not a linear-time routine, is required, but
it is p erformed only on ngerprints, and therefore
Clustering: Small similarities are quite typical do es not dominate the running time unless the total
among programs written by the same p erson. Our size of all les is many GBytes. The second mo de
to ols cannot by themselves cluster everything, but compares a given le to a prepro cessed approximate
they can help a clustering program identify many index of all other les, and determines very quickly
candidates. all les that are similar to the given le. In b oth
cases, similarity can b e detected even if the similar
Miscel laneous: There was recently discussion in the p ortions constitute as little as 25 of the size of the
press ab out Sun Microsystems allegedly using a smaller le for smaller p ercentages, the probability
copy of a p opular b enchmark program directly in its of false p ositives is non trivial, so although si will
Java compiler. We b elieve that our program would work, its output may b e to o large.
have discovered that easily and could be used to
continuously monitor for cases like that.
2.2 Dup
The rest of the pap er is organized as follows. The
next section describ es the three similarity to ols that
we adapt to Javabyteco de les. Section 3 describ es
Dup [4, 5, 6] lo oks for similarity of source co des
howwe pro cess Java class les to make them suit-
based on nding suciently long sections of co de
able for comparisons by dup, si , and di . In Sec-
that almost match. Dup's notion of almost-
tion 4 we present exp erimental results. These in-
matching is the parameterized match p-match: two
clude exp eriments with si alone, si and dup to-
sections of co de are a p-match if they are the same
gether, and di alone. Section 5 describ es related
textually except p ossibly for a systematic change of
work. We end with conclusions and future research.
variable names; e.g. if all o ccurrences of \count"
and \fun" in one section are changed to \num" and
\fo o", resp ectively, in the other. For a threshold
length in lines sp eci ed by the user, dup rep orts all
2 To ols for nding similarity
longest p-matches over the threshold length within
a list of les or between two lists of les. It can
compute the p ercentage of duplication b etween two
This section describ es the three to ols di , si , and
les or the p ercentage of duplication for the cross-
dup that we adapt to nding similarity in Java
pro duct of two lists of les. It can also generate a
byteco de les. All three were designed originally
pro le that shows the matches involving each line
for source or text les.
in the input and a plot showing where matches o c-
cur. Dup has b een found useful for identifying unde-
sirable duplication within a large software system,
2.1 Si
lo oking for plagiarism b etween systems, and for an-
alyzing the divergence of two systems of common
origin.
Si [24] is a program to nd similarities in text les.
It uses a sp ecial kind of random sampling, such that Using dup, one maycho ose to base a notion of sim-
if a sample taken from one le app ears in another ilarity on the existence of matching sections over a
it is guaranteed to be taken there to o. A set of threshold length, on the p ercentage of common co de
such samples, which are called in si approximate resulting from these matching sections, or on some
ngerprints, provides a compact representation of combination of the two.
a le. With high probability, sets of ngerprints
of two similar les will have large intersection, and The key idea that enables dup to identify p-matches
the ngerprints of two non-similar les will havea is to replace tokens such as identi ers by o -
very small intersection. Si works in two di erent sets to remove the identity of the identi er while
preserving the distinctness of di erent identi ers. tion of the three is more p owerful than any of them
In particular, the rst o ccurrence of an identi- individually.
er is replaced by a 0, and each later o ccurrence
of the same identi er is replaced by the number For searching large numb ers of les, si and dup
of tokens since the previous one. For example, can be used together either to broaden the notion
ux,y=x>y?x:y; and vw,z=w>z?w:z; are of similarity by taking the union of pairs of les
b oth enco ded as 00,0=6>6?5:5; b ecause u and found by si and dup or to make it more stringent
v o ccur in the same p ositions, as do the pair x by taking the intersection of the pairs rep orted as
and w and the pair y and z; the numb ers 6 and similar by dup and si .
5 represent the number of tokens single symb ols,
in this case b etween successive o ccurrences of the Si nds similarities that are not found by dup b e-
same identi er token. More generally, if max, arg1, cause o ccasional di erences may disrupt the longer
and arg2 are tokens, the same enco ding represents matches lo oked for by dup; on the other hand, dup
maxarg1,arg2=arg1>arg2?arg1:arg2;. nds some les that have long matches but aren't
rep orted by si b ecause the overall p ercentage of
Dup computes longest p-matches via a data struc- similarityin the les is to o low. For example, if a
ture called a parameterized sux tree p-sux tree. blo ck of co de is added to an otherwise unmo di ed
The p-sux tree is a compacted trie that represents le, the p ercentage of similarity might fall b elow
o set enco dings of suxes of the token string, but the p ercentage of similarity selected for si to re-
only uses linear space b ecause only the o set enco d- p ort, but dup would nd that the rest of the le
ing of the entire input is stored. At each access to was unchanged.
an o set, the previous context is used dynamically
to determine whether this is the rst use of this For more detailed analysis of les found to b e sim-
parameter within the sux b eing pro cessed. The ilar, dup, di , and gdi are useful. Dup provides
algorithms for building a p-sux tree and searching a list of matching sections of co de, a pro le show-
it for duplication are describ ed in [5, 6]. ing for each line what other lines match based on
matches over threshold length, and scatter plots for
visualization. Di and gdi provide an alignmentof
2.3 Di
the two les that enable lo oking at the di erences
line-by-line, and are esp ecially e ective when the
numb er of di erences is very small.
Di is the oldest to ol for nding commonality be-
tween les. Given two text les, di rep orts a min-
imal length edit script for the di erences between
the two les, where edit op erations include inser-
3 Adapting the to ols to Java byte-
tions and deletions. Many algorithms for minimal
co de les
edit scripts have app eared in the literature. A p opu-
lar version of di to day is the GNU implementation
based on an algorithm of Myers [26]. Because of the
This section describ es howwe adapt dup, si , and
quadratic worst-case running time, di can b e slow
di to handle Java class les. A class le, usually
for comparing large amounts of co de with many dif-
obtained by compiling a Java le, is a stream of
ferences.
bytes representing a single class in a form suitable
for the Java Virtual Machine [22]. We use the terms
Graphical user interfaces such as gdi which runs
\class les" and \byteco de les" interchangeably.
on SGI workstations under UNIX make it conve-
nient to lo ok at output from di by aligning iden-
tical sections of two les, side by side. Variants of
3.1 Information in Java class les
gdi exist for other platforms; a list is given in [15].
The rst items enco ded in a class le are a magic
2.4 Combining the to ols
numb er that identi es the le as a class le, the mi-
nor version numb er, and the ma jor version numb er.
The three to ols use three very di erent notions of AJava Virtual Machine of a particular ma jor and
similarity. In this pap er, we show that a combina- minor version number may run co de of the same
ma jor version number and the same or lower mi- `o', `c', `v', `i', `u', and `j', resp ectively. Jump o sets
nor version numb er. The version handled currently are translated from numb ers of bytes in the class le
by our system is ma jor version 45, minor version 3, to numb ers of lines in the disassembled le.
describ ed in [22].
o182 invokevirtual
c106
The remainder of a class le is mainly tables of
o153 ifeq
structures. Of these, the \constant p o ol" and the
j+17
metho d table are imp ortant to the design of our
o025 aload
to ols.
v4
o180 getfield
The constant pool contains information ab out all
c253
the constants used in the class, e.g. strings, integers,
o025 aload
elds, and classes. This table is very imp ortantas
v5
many other parts of the class le including co de
o182 invokevirtual
for metho ds contain indices into the constant p o ol.
c102
For example, a reference to a di erent class includes
o025 aload
the index for the string that names that class.
v4
o180 getfield
For each metho d, the metho d table contains the
c253
co de and other information suchasthe number of
o025 aload
lo cal variables, exception-handlers, the maximum
v4
stack length, indices into the constant pool repre-
o182 invokevirtual
senting the metho d name and a string describing
c260
the metho d typ e, and optional tables aimed at de-
o025 aload
buggers, such as line numb er relationships b etween
v5
byteco de and source and information relating lo cal
o180 getfield
variable names to the co de.
c253
3.2 Disassembly of Java class les Figure 1: Disassembled byteco de
Dup can b e run on disassembled byteco de if it is pro-
The rst step in pro cessing a Java class le is to vided with an appropriate lexical analyzer, though
disassemble it. Disassembling a class le requires p erformance is improved by undoing the jump o -
keeping trackat each p ointof how far the parsing sets b efore running dup. See the next section.
has gotten in the conceptual hierarchy of tables and
structures, based on tables in the disassembler de- Running si or di on the disassembled le without
rived from [22]. We wrote our own disassembler, further prepro cessing do es not pro duce useful infor-
although others have b een implemented previously; mation. For example, changing a 4 to a 5 in two
a list app ears in [32]. places in a 182-line Java le resulted in over 1100
lines of di output on the disassembled byteco de
After disassembling the le, we extract the co de of le, and less than 1 similarity rep orted by si .
each metho d for further analysis. The disassem-
bled co de contains op co des and arguments for a se- The reason that si and di fail is that indices into
quence of assembly-language-level instructions. Fig- the constant table or lo cal variable table maychange
ure 1 shows an example of a section from the mid- with slightchanges to the Java le, due to additions,
dle of a disassembled byteco de le, with comments deletions, or reorderings of the constant p o ol and/or
added to identify the numerical op co des. Since op- lo cal variable table. Such indices typically o ccur fre-
co des can havevariable numb ers of arguments, we quently in the co de, as in the example of Figure 1.
put one op co de or argument p er line. Acharacter When the les mentioned in the previous paragraph
at the start of the line identi es the typ e of item on are prepro cessed as describ ed shortly, di rep orts
the line. Op co des, indices into the constant p o ol, only changes in two lines containing op co des refer-
indices into the lo cal variable table, signed integers, ring to a constant of 5 rather than 4 and si nds
unsigned integers, and jump o sets are identi ed by 98 similarity.
3.3 Prepro cessing for dup enco ded as the negative of the numb er of lines since
the previous use if any of the same index. O sets
for indices into tables are negative to b e consistent
with jump o sets, which are negative for a jump to
With an appropriate lexical analyzer, dup can be
a preceding line and p ositive for a jump to a later
run on disassembled byteco de les. However, be-
line. The o sets are calculated indep endently for
fore running dup, it is preferable to undo the jump
the constant table and the lo cal variable table. The
o sets already present in the disassembled co de by
example of Figure 1 is shown in Figure 2 after cal-
changing jumps into a \goto lab el" form. Dup will
culating o sets for the entire le from which this
compute the o sets itself for the lab els. The dy-
section was extracted.
namic way in which dup calculates o sets relative
to suxes means that when two otherwise identical
sections of co de contain jumps to earlier p oints and
the jumps cross insertions or deletions, these jumps
will not cause mismatches, as would happ en with a
o182
precomputed xed enco ding. Thus, this prepro cess-
c-26
ing enables dup to nd longer p-matches.
o153
j+17
Our lexical analyzer for the disassembled byteco de
o025
les without jump o sets breaks up the input into
v-10
two classes of tokens: parameter tokens and non-
o180
parameter tokens. O sets are computed for pa-
c-10
rameter tokens but not for non-parameter tokens.
o025
Parameter tokens include indices into the constant
v-10
p o ol, indices into the lo cal variable table, lab els for
o182
jumps, and signed and unsigned integers; the vari-
c-26
ous typ es of parameter tokens are distinguished so
o025
that the o sets are computed separately and tokens
v-8
of di erenttyp es will not b e matched to each other
o180
in parameterized matches. The non-parameter to-
c-8
kens include op co des.
o025
v-4
o182
3.4 Further prepro cessing for si and
c-26
di via o sets
o025
v-12
o180
Even though absolute values of table indices in byte-
c-8
co de les maychange with slightchanges to the Java
source, there are still hidden similarities in the byte-
Figure 2: Disassembled byteco de after calculating
co de les. In particular, the corresp onding uses of
o sets
indices maintain the same positional relationship.
Consequently,we use the same o set enco ding that
is used in dup. In the context of disassembled byte-
co de, what corresp onds to the \identi ers" are the
indices into the constant pool or lo cal variable ta- This enco ding decreases reliance on the absolute
ble. Jumps are already enco ded as o sets in the value of the indices but preserves the information
byteco des. Si and di are then run on the o set- as to whether indices for di erent instructions are
enco ded les. the same or di erent. For example, if two les are
the same except that indices in one le are always
We treat each index into the constant pool or lo- one larger than indices in the other, the enco dings
cal variable table as a parameter to b e replaced by of the two les will be identical. The next section
an o set. The rst o ccurrence of each index is en- describ es exp eriments demonstrating that this en-
co ded as 0, and thereafter each use of an index is co ding enables si and di to work e ectively.
4 Exp eriments of these similar pairs representinteresting relation-
ships.
4.1 Exp eriment 1: Random changes
For the same 2056 Javabyteco de les, dup rep orted
92 ordered pairs of les to have at least one common
co de section of 200 lines or more, in comparison to
In the rst exp eriment, we to ok one Java program
the 634 ordered pairs rep orted by si . Table 2 shows
and made many di erent random changes to it.
the breakdown as to how many ordered pairs were
These changes included addition of statements e.g.,
rep orted by si alone, by b oth si and dup, and by
\newvariable = 43;" in random places, and substi-
dup alone.
tution/deletion of statements e.g., changing com-
plex conditions in \if " and \while" statements to
The goal of the exp erimentwas to lo ok for similarity
\i < 1". We varied the number of changes and
in les from di erent collections out of the 38 col-
the ratio b etween additions and deletions. We ran
lections wedownloaded. The second line of the ta-
these tests on two di erent Java programs. For
ble gives the breakdown with resp ect to similarities
each run, we measured the similarities of the source
between les from di erent collections. The initial
co de using the original si and the similarities of
analysis was in terms of ordered pairs, since similar-
the byteco de les. The results, shown in Table 1,
ities can b e asymmetric, but for ease of discussion,
consistently show that the byteco de similarities are
the third line shows the corresp onding number of
close to the source co de similarities. For each of the
unordered pairs.
two programs we partitioned the tests into three
groups according to source similarities: 90-100,
Of the 9 di erent-collection unordered pairs re-
80-89, and 70-79. The results are averages for
p orted by b oth si and dup, we b elieve that 8 pairs
each group, showing the number of trials, the av-
are originally from the same source, based on the
erage similarity for source and byteco de, and the
pairs of les having the same name. Of the 8 pairs,
maximal di erence b etween them in all the trials.
four pairs are identical les, and four are in the
range of 57-100 similar according to si and 45
Furthermore, we also ran si on all the variants of
to 97 similar according to dup based on just the
the two programs together, and found no false p os-
co de sections of at least 200 lines that match.
itives.
The remaining di erent-collection unordered pair
of les rep orted by b oth si and dup are a pair of
4.2 Exp eriment 2: a large set of byte-
programs from di erent sources cryptiX, develop ed
co de les
byWolfgang Platzer, and part of JavaFaces, devel-
op ed by John Thomas to compute MD5. Si found
them to have 78 similarity and 86 in the other
In the second set of exp eriments we to ok 2056 Java
direction, and dup found them to have a single
byteco de les from 38 collections of les from many
common co de segment of 1336 lines. The common
di erent sources and ran tests on all of them at
co de segment corresp onded to 60 identical lines in
the same time allowing for at least 50 similarity.
the Java les. There was no similarity otherwise in
The goal was to lo ok for similar les from di erent
the Java les, but si found additional similarities in
collections.
the byteco de les. The 60 identical lines are di er-
ent from the corresp onding lines of the MD5 RFC
Si rep orted 634 ordered pairs of les with similari-
[29], but semantically equivalent. Interestingly, a
ties of at least 50. We de ne similarityoftwo les
search of the WWW turned up a third Java program
as the p ercentage of one le that is contained in the
with the same 60 lines. Possibly two of these pro-
other. As a result, similarity is an asymmetric rela-
grams b orrowed from the third, or all three copied
tion: for example, if a le A is contained in another
a description of MD5 other than the RFC.
le B twice its size, then A is 100 similar to B, but
B is 50 similar to A. We use ordered pairs here for
The di erent-collection unordered pair rep orted
this reason.
only by dup lo oks at rst glance as if it surely must
come from a common source, b ecause dup rep orted
Of the 634 pairs, 591 were b etween les in the same
a common co de section of 1521 lines, representing
collection, and 43 were b etween les in di erent col-
85 of one le and 28 of the other. We didn't have
lections. Next, we use dup to aid in analyzing which
range of average average max
program source similarity number of source similarity byteco de similarity di erence
p ercent trials p ercent p ercent p ercent
90-100 59 93.2 88.7 9
Program 1 80-89 76 84.4 78.7 9
70-79 35 75.9 71.9 7
90-100 17 92.9 91 9
Program 2 80-89 25 83.4 80.2 8
70-79 6 76.2 74.2 5
Table 1: Similarity found by si for corresp onding Java les and byteco de les
si only b oth dup only
ordered pairs 552 82 10
ordered pairs, di erent collections 25 18 2
unordered pairs, di erent collections 23 9 1
Table 2: Similarities rep orted by si and dup for 2056 byteco de les
number of false p ositives was very small: at most the java source to compare. Up on insp ection of the
18 out of more than 2 million pairs. byteco de les, however, the common co de turned
out to b e the initialization of an array of size 256.
Since the stored values retrievable from the con-
stant p o ol of the byteco de les app eared unrelated,
4.3 Exp eriment 3: False negatives
the similarity is probably coincidental, merely an ar-
tifact of how the compiler generates co de for a series
of 256 array assignments.
The rst two exp eriments indicate that our to ols
can e ectively discover similarities while minimizing
For the pairs rep orted only by si , the Java source
false p ositives. But the question of false negatives {
was not available. However, almost all result from
that is, how many similar pairs we missed { remains
comparing a very small le 500 bytes or less in most
unresolved. There were none in the rst exp eriment,
cases to a large le; the small le was found to
but it was to o limited.
be similar to the large one, but not the other way
around. In addition, the names of these pairs in-
Since there are no other to ols to compare to, there is
dicate that the purp oses of the les are unrelated.
no guaranteed way for us to measure false negatives.
We conclude that these are false p ositives. The ex-
Nevertheless, we b elieve that the following "blind"
ceptions were matches whose names included H or
exp eriment gives a reasonable indication.
V apparently for horizontal and vertical; the H/H
and V/V pairs were rep orted by b oth si and dup,
We asked a friend who was familiar with the goals
but the H/V and V/H pairs only by si . The names
of our work, but not with the techniques we use, to
of another two pairs of matching les related to but-
randomly pick 10 programs from a set of 765 java
tons and checkb oxes, but the remaining pairs of les
programs a subset of the 2056 programs for which
app ear to b e totally unrelated based on the sub ject
we had the source, make random changes to them,
matter indicated by the names.
compile them p ossibly under a di erentversion of
the compiler and give the set of byteco de les in a
To summarize, we found many similarities and very
random order. We then ran si and dup to see how
few false p ositives by combining the information
many of his changes we can detect.
from dup and si . The di erent-collection pairs re-
p orted by b oth dup and si all app ear to b e valid
To make the test even more blind, our tester actu-
instances of similarity. The one rep orted only by
ally made changes to 12 programs, and added to our
dup was not; some of the ones rep orted only by
original test some additional programs as well as re-
si were not, esp ecially for small les. Overall, the
moved some programs. si running with a thresh-
for si , we also ran si on the disassembled co de old of 65 similarity discovered 9 of the 12. Dup,
without applying the o set enco ding. With a 25 with a threshold of 100 common lines, discovered
threshold of similarity, si discovered only three of 8. Together, they discovered 10. All 12 were found
the 12 pairs 4, 5, and 10, a much worse p erfor- by si when run with a 25 similarity threshold
mance than that describ ed ab ove for the o set en- and by dup when run with a 50-line threshold. On
co ding. the other hand, decreasing the threshold to 25 in-
creases the numb er of ordered pairs si nds among
the 2056 original les from 634 to 1430. Still reason-
ably small compared to the total of over 4 million
4.4 Running times and scalability
p ossible pairs, but clearly the numb er of false p osi-
tives grows when the threshold is decreased.
For the 2056 les, si used 41 seconds of elapsed
Out of the two programs that were missed by b oth
time and 4 seconds of user time while dup used 1
si and dup, one is particularly interesting. It in-
minute 55 seconds of elapsed time and 1 minute 7
volved relatively few changes to the source, but they
seconds of user time running under IRIX 5.3 and
made the byteco de le very di erent. On close in-
using one of 12 150 MHZ IP19 Pro cessors, with
sp ection, we found that the main culprit wasamove
data cache size 16 Kbytes, instruction cache size
of a segment of co de, which resulted in a byteco de
16 Kbytes, secondary uni ed instruction/data cache
le with many jumps around the relo cated co de. For
size 1 Mbyte, and main memory size 1280 Mbytes.
si , the o sets of all these jumps were a ected by the
Enco ding the 2056 byteco de les to ok 7 minutes
relo cation, resulting in di erent ngerprints. For
of elapsed time for si and 8 minutes indep en-
dup, the relo cated co de broke up long matches in
dently for dup. Most of the prepro cessing could b e
b oth places, which mattered since the byteco de les
shared. Si also enables the creation of an index so
were small - only four times the threshold length.
that new les can b e compared with an index of les
pro cessed earlier. Comparing one 50K byteco de to
Tovalidate the usefulness of the o set enco ding for
all the 2056 les in the index within 50 similar-
di , we ran di on the disassembled co de with and
ity takes a couple of seconds for enco ding the new
without the o set enco ding. The results are shown
le and thereafter the pro cessing by si is essentially
in Table 3. The pairs varied in le size and how
instantaneous 0.2 user time and b etween 0:00 and
manychanges were made. To get a relatively size-
0:01 elapsed time. The size of the index dep ends
invariant measure, we use the length of the di out-
on the amount of sampling done and the precision
put in lines divided by the sum of the le sizes,
of results; for the exp eriments used here the index
for each typ e of le. Pair 6 is sp ecial: di erent
o ccupied ab out 5 of the total size of all les.
versions of the JDK compiler were used to generate
two class les from the same java le. Consequently,
Dup is much greedier in space than si , and conse-
no value is given for this measure for the java le.
quently will not scale to pro cessing huge numb ers
Our second measure is the average length of blo cks
of byteco de les at one time. If the goal were to
of identical lines rep orted by di . The last column
pro cess an enormous number of byteco de les, say
contains the sum of the sizes of the o set-enco ded
to provide a registry service for the World Wide
disassembled les, which is the same as the sum of
Web as has b een prop osed for text les [9], then si
the le sizes without the o set-enco ding.
should b e used to screen for similarities that should
be checked subsequently by dup.
For most of the pairs, the data in Table 3 showa
signi cant improvement from using o set enco dings
for di . In a few instances, the values are ab out
the same with and without o sets. Pair 11 is the
5 Related Work
only instance where di found signi cantly less sim-
ilarity with o set enco dings, but just for the second
measure. Note that pairs 3, 10, and 12 were the
Javabyteco de les are relatively easy to decompile
ones not discovered by si with a 50 threshold, as
into Java. Programs can b e rewritten in structured
discussed ab ove, and pairs 9-12 were the ones not
form by analyzing the ow of control, and the names
discovered by dup with a 100-line threshold.
of classes, elds, and metho ds are available from the
class le. In fact, at least four byteco de decompilers
Tovalidate our approachof using o set enco dings
have b een implemented: Mo cha by Hanp eter Van
di -size/total-lines ave. ident. blo ck size sum of le
pair as in lines sizes in lines
java o set no-o set o set no-o set o set = no-o set
1 13 6 57 185 3 2740
2 14 17 68 18 2 1040
3 30 24 61 8 2 2098
4 14 14 21 173 20 1603
5 59 40 42 10 10 988
6 - 2 34 198 5 6041
7 19 13 32 22 6 8789
8 23 24 59 36 12 941
9 14 8 45 61 4 265
10 24 32 31 7 8 662
11 50 55 56 7 24 428
12 18 41 67 6 2 1037
Table 3: Results of running di with and without the o set enco ding
may b e p ossible to detect that they are used. Vliet, now deceased, and according to [31], taken
over as part of Borland's JBuilder [8], WingDis [33],
Another technique for enabling detection of stolen DejaVu [18], and Krakatoa [28]. A discussion of
co de is steganography, the art of hiding information these app ears in [14]. These pro duce quite readable
such that the information can later b e detected [2]. co de except for variable names, and even those may
For ob ject co de, insertions of extra no-op sequences be available p erhaps inadvertently from the class
of instructions would slowitdown, but Intel rep orts le if the compiler generated optional information
that it has successfully replaced co de fragments by aimed at debuggers.
equivalent ones to customize securitycode[3]. How-
ever, this is not common practice. Because class les are relatively easy to decompile
readably, Java applications are p otentially vulner-
For Javabyteco de les, the single Java Virtual Ma- able to theft, even if they are distributed only as
chine means that the variety of platforms is not an byteco de les. An unscrupulous p erson could use
issue. Multiplicity of compilers can b e dealt with by one of the decompilers mentioned ab ove to decom-
having the owner of the co de compile the source les pile the byteco de les of an application, mo dify the
with each of the available Java compilers and com- resulting source, recompile into byteco de les, and
pare the resulting byteco de les with the byteco de distribute the byteco de les.
les that were susp ected to b e stolen. Currently, the
numb er of di erent compilers is not large, so this is In fact, a numb er of \obfuscators" have b een writ-
manageable. Co de compiled by di erent releases of ten to make decompilation less successful. These in-
the same compiler will probably still b e very similar, clude Crema byVan Vliet, and likeMocha, also re-
although not necessarily identical. In our exp eri- p ortedly taken over by Borland's JBuilder[8], Hose-
ments we found that, given the same source les, jdk Mo cha, by Mark LaDue, HashJava now renamed
1.0.2 and jdk 1.1.x generated class les that rarely SourceGuard [1], by K.B. Sriram, and Job e, by
di ered in any signi cantway. Eron Jokipii; for discussion of these, see [23, 32].
Early obfuscators either changed symb ol names or
added no-ops to byteco de les in such a way that
particular decompilers crashed. New obfuscators
e.g., [13,12] employmuch more sophisticated tech-
6 Conclusions and Future Work
niques, some of cryptographic strength, to change
the app earance of co de. These techniques not only
make it harder to decompile, but they also makeit
Our exp eriments have validated our approach by
p ossible to change the app earance of byteco de les
showing that our to ols are able to deduce similar-
that come from the same source. Our metho ds will
ity in Java source from similarity in the byteco de
not be able to defeat such techniques, although it
les. We can make the rate of false p ositives very
low while keeping false negatives reasonably mini- References
mal. Our p ositional enco ding proves to be a very
powerful technique. Since the three to ols si , dup,
[1] 4thpass Software Corp oration. Protect your
and di are based on di erent techniques, one may
Javainvestment. http://www.4thpass.com/,
detect similarity when another one misses it. Lo cal
Apr. 7, 1998.
mo di cations within sections of co de will reduce the
[2] Ross Anderson and Fabien Petitcolas. On the
p ercentage of similarity found by si , but dup will
limits of steganography. In IEEE J-SAC, to
still nd long matches in the unchanged sections.
app ear.
Name changes of variables would not b e a problem
since the compiler turns variable references into ta-
[3] D. Aucsmith. Tamp er resistant software: An
ble indices, which we have shown our metho ds to
implementation. In Information Hiding, vol-
handle despite change in absolute values. Reorder-
ume 1174 of Lecture Notes in Computer Sci-
ing ma jor sections of co de, e.g. by changing the
ence, pages 317{333. Springer-Verlag, 1996.
order of metho ds, will have little e ect on si and
[4] Brenda S. Baker. On nding duplication and
dup, though it will greatly a ect di .
near-duplication in large software systems. In
Second Working Conference on Reverse Engi-
A ma jor advantage of lo oking at Java class les
neering, pages 86{95, 1995.
rather than at binaries for other programming lan-
guages is that there is just one version of the Java
[5] Brenda S. Baker. Parameterized pattern
Virtual Machine for all platforms or at least that is
matching: Algorithms and applications. J.
Sun's intention, while binaries of other languages
Comput. Syst. Sci., 521:28{42, Feb. 1996.
are di erent for di erent platforms. Extending our
[6] Brenda S. Baker. Parameterized duplication
techniques to binaries is a natural step to take, al-
though binaries present several additional problems: in strings: Algorithms and an application to
software maintenance. SIAM J. Computing,
They are strongly tied to the architecture, there
265:1343{1362, Oct. 1997.
are many compiler and even more optimization and
co de restructuring programs, and the information in
[7] H.L. Berghel and D.L. Sallach. Measurements
binaries as not as well organized as in byteco de les.
of program similarity in identical task environ-
Nevertheless, we b elieve that it will b e p ossible, at
ments. SIGPLAN Notices, 98:65{76, August
least in some degree, to identify similar binaries.
1984.
It would b e helpful to have a graphical user interface
[8] Borland. Jbuilder. http://www.borland.com/
that links the output of si and dup to a decompiler,
jbuilder/, Oct. 23, 1997.
allowing a develop er to see the evolution of two sim-
[9] S. Brin, J. Davis, and H. Garcia-Molina. Copy
ilar programs clearly.
detection mechanisms for digital do cuments. In
Proceedings of the ACM Special Interest Group
on Management of Data SIGMOD, 1995.
[10] Andrei Bro der, Steve Glassman, Mark Man-
7 Acknowledgements
asse, and Geo rey Zweig. Syntactic clustering
of the web. In Proceedings of the Sixth Inter-
national World Wide Web Conference, pages
Peter Bigot was our blind tester, and we thank him
391{404, April 1997.
for a thankless job well done.
[11] Kenneth Ward Church and Jonathan Isaac
Helfman. Dotplot: A program for exploring
self-similarity in millions of lines of text and
co de. Journal of Computational and Graphical
Statistics, 22:153{174, June 1993.
8 Availability
[12] Christian Collb erg, Clark Thomb orson, and
Douglas Low. Breaking abstractions and un-
Contact Brenda Baker, bsb@b ell-labs.com, for in- structuring data structures. In IEEE Inter-
formation ab out licensing the software from Lucent national Conference on Computer Languages
Technologies. 1998, Chicago, IL, May 1998.
[25] T.J. McCab e. Reverse engineering, reusability, [13] Christian Collb erg, Clark Thomb orson, and
redundancy: the connection. American Pro- Douglas Low. Manufacturing cheap, resilient,
grammer, 310:8{13, Oct. 1990. and stealthy opaque constructs. In Principles
of Programming Languages 1998, POPL'98,
[26] Eugene W. Myers. An OND di erence algo-
San Diego, CA, Jan. 1998.
rithm and its variations. Algorithmica, 1:251{
266, 1986. [14] Dave Dyer. Java decompilers compared.
JavaWorld, July 16, 1997. http://www.
[27] PocketSoft. .RTPatch Professional, Feb.
javaworld.com/javaworld/jw-07-1997/jw-
23, 1998. http://www.pocketsoft.com/
07-decompilers.html.
products.html.
[15] Luis Fernandes. Are there any apps
[28] Todd Pro ebsting and Scott A. Watterson.
that can display les in parallel, high-
Krakatoa: Decompilation in java do es byte-
lighting in color the di erences between
co de reveal source?. In USENIX Conference
them? http://www.ee.ryerson.ca:8080/~
on Object-oriented Technologies and Systems,
elf/xapps/Q-XI.html, April 3, 1997.
June 1997.
[16] Nevin Heintze. Scalable do cument ngerprint-
[29] Ronald Rivest. The MD5 message digest
ing. In Proceedings of the Second USENIX
algorithm. RFC 1321, Apr. 1992. http://
Workshop on Electronic Commerce, Nov. 18-
info.internet.isi.edu:80/in-notes/rfc/
21, 1996.
files/rfc1321.txt.
[17] Susan Horwitz. Identifying the semantic and
[30] Narayanan Shivakumar and Hector Garcia-
textual di erences between two versions of a
Molina. Building a scalable and accurate copy
program. In Proceedings of the ACM SIGPLAN
detection mechanism. In Proceedings of 1st
Conference on Programming Language Design
ACM International Conference on Digital Li-
and Implementation PLDI, pages 234{245,
braries DL'96, March 1996.
June 1990.
[31] Eric Smith. Mo cha, the Java decom-
[18] Innovative Software. OEW for Java. http://
piler. http://www.brouhaha.com/~eric/
www.isg.de/OEW/Java/, Oct. 2, 1997.
computers/mocha.html, Dec. 28, 1997.
[19] H.T. Jankowitz. Detecting plagiarism in stu-
[32] Thomas Trab er. To ols for working with byte-
dent PASCAL programs. Computer Journal,
co de: Byteco de assemblers, disassemblers and
311:1{8, 1988.
dumps. http://www.oasis.leo.org/java/
development/bytecode/00-index.html.Part
[20] J. Howard Johnson. Substring matching for
of Java Oasis, http://www.oasis.leo.org/
clone detection and change tracking. In Proc.
java/00-oasis.html.
International Conf. on Software Maintenance,
pages 120{126, 1994.
[33] Wingsoft. Wingsoft Pro ducts Information.
http://www.wingsoft.com/products.shtml.
[21] Brian W. Kernighan and Rob Pike. The UNIX
Programming Environment. Prentice-Hall, En-
[34] Tak Yan and Hector Garcia-Molina. Duplicate
glewo o d Cli s, New Jersey, 1984.
removal in information dissemination. Techni-
cal rep ort, Stanford Computer Science Dept.,
[22] Tim Lindholm and Frank Yellin. The Java Vir-
1995. submitted for publication.
tual Machine Speci cation. Addison-Wesley,
Reading, Massachusetts, 1997.
[23] Qusay H. Mahmoud. Java tip 22: Pro-
tect your byteco des from reverse engineer-
ing/decompilation. JavaWorld, Jan. 2, 1997.
http://www.javaworld.com/javatips/jw-
javatip22i.html.
[24] Udi Manb er. Finding similar les in a large le
system. In Proc. 1994 Winter Usenix Technical
Conference, pages 1{10, Jan 1994.