The following paper was originally published in the Proceedings of the USENIX Annual Technical Conference (NO 98) New Orleans, Louisiana, June 1998

Deducing Similarities in Sources from

Brenda S. Baker Bell Laboratories, Lucent Technologies Udi Manber University of Arizona

For more information about USENIX Association contact: 1. Phone: 510 528-8649 2. FAX: 510 548-5738 3. Email: [email protected] 4. WWW URL:http://www.usenix.org/

Deducing Similarities in Java Sources from Byteco des

Brenda S. Baker

Bel l Laboratories

Lucent Technologies

700 Mountain Avenue

Murray Hil l, NJ 07974

bsb@b ell-labs.com, http://cm.bell-labs.com/~bsb/

Udi Manber

Department of Computer Science

University of Arizona

Tucson, AZ 85721

[email protected], http://glimpse.cs.arizona.edu/udi.html

sible, given a large set of byteco de les, to iden- Abstract

tify most of those that originate from similar source

co de. In addition, an index can b e built from any

number of byteco de les, such that given a new

Several techniques for detecting similarities of Java

byteco de one can very quickly nd all the old ones

programs from byteco de les, without access to the

that are similar to it. Our to ols also can detail in

source, are intro duced in this pap er. These tech-

a convenientway just which sections of the les are

niques can be used to compare two les, to nd

similar.

similarities among thousands of les, or to compare

one new le to an index of many old ones. Exp eri-

Our techniques it p ossible to identify -

mental results indicate that these techniques can b e

co de les that came from the same source without

very e ective. Even changes of 30 to the source

knowing the actual source even if signi cant parts

le will usually result in byteco de that can be as-

of the source les are di erent. The les do not even

so ciated with the original. Several applications are

havetobe of similar sizes. For example, the orig-

discussed.

inal source co de les may be di erent versions of

the same program or di erent programs containing

similar sections.

1 Intro duction

Finding similarities from compiled co de is much

more dicult than nding similarities in source co de

or text, b ecause a small change in the source co de

Javabyteco de, particularly in the form of applets,

can pro duce completely di erent binaries when they

is geared to b ecome the common way to execute

are viewed just as sequences of .

programs through the web, and not only by tradi-

tional computers. Network computers, even appli-

To accomplish these goals, we adapt three to ols de-

ances, are exp ected to be controlled byJavabyte-

signed to nd similarity in source co de and text to

co de, which will arrive through the net transpar-

1

work with byteco de les. Si [24] is designed

ently and for the most part, automatically. There

to analyze large number of text les to nd pairs

is obviously a great need to be able to control all

that contain a signi cantnumb er of common blo cks

these programs. There will b e problems of security,

\ ngerprints", where a blo ck corresp onds to a

management of up dates, p ortability, handling pref-

non-trivial segment of co de, usually a few lines.

erences, deletions, and so on.

1

The original name was SIF, but at some unknown p oint

This pap er intro duces several techniques to help one

during the development an extra f was added to make it more

asp ect of these problems. We show that it is p os- natural.

vidually at nding similar les while keeping false Dup [5], searches sets of source les to lo ok for

p ositives low. We then use dup and di on the sim- suciently long sections that match except for sys-

ilar pairs to examine the similarities in more detail. tematic transformations of names, suchasvariable

names, and can deal eciently with a few mil-

Our programs are fast and can analyze thousands lion lines of source co de. The UNIX utility di

of byteco de les in minutes. A new le can b e com- describ ed in [21] uses dynamic programming to

pared to an index of thousands of existing les in identify line-by-line changes insertions or deletions

a second or two. The number of false p ositives is from one le to another, and is useful for detailed

kept to avery small minimum. Our programs are comparison of two les and for automated distribu-

written in and run under UNIX. Detailed exp eri- tion of patches.

mental results are given later in the pap er.

None of the programs mentioned ab ove was de-

We foresee several applications for our to ol. signed to deal with binaries. The only work that

we knowof that mayhave applied one of these to

program management: When p eople havenumerous binaries is .RTPatch [27], a to ol for creating patches

Java classes, and get many more on a regular basis, di erences between les for up dating les. This

there will be a great need to organize them, often to ol is claimed to work for arbitrary les, even bina-

not according to the original \plan." Being able ries; while no technical description is given of their

to tell which classes are similar can be very help- metho ds, it seems likely that it applies the di al-

ful. Sometimes a similarity will identify the source. gorithm to arbitrary bytes rather than to hashes of

Knowing that a very similar class is already installed lines of text. We are not familiar with anywork that

on one's disk may b e helpful in deciding what to do can nd similarities in large numb er of binaries.

with the new class. It can also help with version

control, etc. In some cases, esp ecially for programs Our adaptations of si and di do not work directly

that p erform mostly arithmetic op erations, it may on byteco de les, or even on disassembled byteco de

b e p ossible to identify programs that implement the les. Byteco de les contain many indices into ta-

same , even when they are written by in- bles, and values of most indices can b e changed bya

dep endent writers. For example, when we ran ex- slight even one character change to a Java source

p eriments on thousands of arbitrary class les, we le. Therefore, we enco de disassembled byteco de

discovered two MD5 programs with 78 similarity les in a \normal form" which takes into accountpo-

in their byteco de les. sitional values that are a ected by small changes

than absolute values of indices. This enco ding,

Plagiarism detection: Our approach will identify based on a technique rst used in dup, enables the

stolen co de if only minor changes are made to it programs to nd the hidden similarityinbyteco de

including any amountofsyntactic changes suchas les. Since this technique is already incorp orated

changing variable names. However, the advance in into dup, dup can work on disassembled byteco de

sophistication of obfuscators [13,12], esp ecially for les; however, additional simple prepro cessing im-

Java, would allow someone to hide any co de segment proves its p erformance.

prettywell, and our metho ds and very p ossibly any

other metho d will not b e able to identify it. One There are other to ols for nding similarities in text

may b e able to identify that obfuscators were used, or source co de, as well. However, to ols based on

on the other hand, whichmay b e go o d enough e.g., style metrics, suchas[7,19,25], or data ow graphs

in a classro om. [17] would require decompilation of byteco de les

in order to b e applied. Some other to ols based on

Program reuse and reengineering: It will obviously ngerprints, such as [16,20,10], chunks of text [9,

be useful to know which classes are similar when 30,34], or visualization via a graphical user interface

trying to reuse them. Identifying versions is a go o d [11] may be adaptable to byte co de les using the

example of that. same techniques that we use for si , dup, and di .

Uninstal lers: If you nally obtain the exact co de To search for similar les in a large set of byteco de

you wanted, then you probably want to get rid of les, we run si on the enco ded disassembled byte-

other co de that is very similar. co de les and dup on the prepro cessed disassembled

byteco de les. We show that combining the output

Security: Detecting that a class is similar to a known from si and dup is more e ective than either indi-

bad class can b e very helpful. mo des: all-against-all and one-against-all. The rst

mo de nds all groups of similar les in a large le

Compression: It may b e p ossible to store only di er- system and gives a rough indication of the similarity.

ences among classes that are very similar. But rst The running time is essentially linear in the total

one must detect all such similarities. This could b e size of all les, which makes si scalable. A sort,

esp ecially useful for low-storage devices. which is not a linear-time routine, is required, but

it is p erformed only on ngerprints, and therefore

Clustering: Small similarities are quite typical do es not dominate the running time unless the total

among programs written by the same p erson. Our size of all les is many GBytes. The second mo de

to ols cannot by themselves cluster everything, but compares a given le to a prepro cessed approximate

they can help a clustering program identify many index of all other les, and determines very quickly

candidates. all les that are similar to the given le. In b oth

cases, similarity can b e detected even if the similar

Miscel laneous: There was recently discussion in the p ortions constitute as little as 25 of the size of the

press ab out allegedly using a smaller le for smaller p ercentages, the probability

copy of a p opular b enchmark program directly in its of false p ositives is non trivial, so although si will

Java . We b elieve that our program would work, its output may b e to o large.

have discovered that easily and could be used to

continuously monitor for cases like that.

2.2 Dup

The rest of the pap er is organized as follows. The

next section describ es the three similarity to ols that

we adapt to Javabyteco de les. Section 3 describ es

Dup [4, 5, 6] lo oks for similarity of source co des

howwe pro cess Java class les to make them suit-

based on nding suciently long sections of co de

able for comparisons by dup, si , and di . In Sec-

that almost match. Dup's notion of almost-

tion 4 we present exp erimental results. These in-

matching is the parameterized match p-match: two

clude exp eriments with si alone, si and dup to-

sections of co de are a p-match if they are the same

gether, and di alone. Section 5 describ es related

textually except p ossibly for a systematic change of

work. We end with conclusions and future research.

variable names; e.g. if all o ccurrences of \count"

and \fun" in one section are changed to \num" and

\fo o", resp ectively, in the other. For a threshold

length in lines sp eci ed by the user, dup rep orts all

2 To ols for nding similarity

longest p-matches over the threshold length within

a list of les or between two lists of les. It can

compute the p ercentage of duplication b etween two

This section describ es the three to ols di , si , and

les or the p ercentage of duplication for the cross-

dup that we adapt to nding similarity in Java

pro duct of two lists of les. It can also generate a

byteco de les. All three were designed originally

pro le that shows the matches involving each line

for source or text les.

in the input and a plot showing where matches o c-

cur. Dup has b een found useful for identifying unde-

sirable duplication within a large software system,

2.1 Si

lo oking for plagiarism b etween systems, and for an-

alyzing the divergence of two systems of common

origin.

Si [24] is a program to nd similarities in text les.

It uses a sp ecial kind of random sampling, such that Using dup, one maycho ose to base a notion of sim-

if a sample taken from one le app ears in another ilarity on the existence of matching sections over a

it is guaranteed to be taken there to o. A set of threshold length, on the p ercentage of common co de

such samples, which are called in si approximate resulting from these matching sections, or on some

ngerprints, provides a compact representation of combination of the two.

a le. With high probability, sets of ngerprints

of two similar les will have large intersection, and The key idea that enables dup to identify p-matches

the ngerprints of two non-similar les will havea is to replace tokens such as identi ers by o -

very small intersection. Si works in two di erent sets to remove the identity of the identi er while

preserving the distinctness of di erent identi ers. tion of the three is more p owerful than any of them

In particular, the rst o ccurrence of an identi- individually.

er is replaced by a 0, and each later o ccurrence

of the same identi er is replaced by the number For searching large numb ers of les, si and dup

of tokens since the previous one. For example, can be used together either to broaden the notion

ux,y=x>y?x:y; and vw,z=w>z?w:z; are of similarity by taking the union of pairs of les

b oth enco ded as 00,0=6>6?5:5; b ecause u and found by si and dup or to make it more stringent

v o ccur in the same p ositions, as do the pair x by taking the intersection of the pairs rep orted as

and w and the pair y and z; the numb ers 6 and similar by dup and si .

5 represent the number of tokens single symb ols,

in this case b etween successive o ccurrences of the Si nds similarities that are not found by dup b e-

same identi er token. More generally, if max, arg1, cause o ccasional di erences may disrupt the longer

and arg2 are tokens, the same enco ding represents matches lo oked for by dup; on the other hand, dup

maxarg1,arg2=arg1>arg2?arg1:arg2;. nds some les that have long matches but aren't

rep orted by si b ecause the overall p ercentage of

Dup computes longest p-matches via a data struc- similarityin the les is to o low. For example, if a

ture called a parameterized sux tree p-sux tree. blo ck of co de is added to an otherwise unmo di ed

The p-sux tree is a compacted trie that represents le, the p ercentage of similarity might fall b elow

o set enco dings of suxes of the token string, but the p ercentage of similarity selected for si to re-

only uses linear space b ecause only the o set enco d- p ort, but dup would nd that the rest of the le

ing of the entire input is stored. At each access to was unchanged.

an o set, the previous context is used dynamically

to determine whether this is the rst use of this For more detailed analysis of les found to b e sim-

parameter within the sux b eing pro cessed. The ilar, dup, di , and gdi are useful. Dup provides

for building a p-sux tree and searching a list of matching sections of co de, a pro le show-

it for duplication are describ ed in [5, 6]. ing for each line what other lines match based on

matches over threshold length, and scatter plots for

visualization. Di and gdi provide an alignmentof

2.3 Di

the two les that enable lo oking at the di erences

line-by-line, and are esp ecially e ective when the

numb er of di erences is very small.

Di is the oldest to ol for nding commonality be-

tween les. Given two text les, di rep orts a min-

imal length edit script for the di erences between

the two les, where edit op erations include inser-

3 Adapting the to ols to Java byte-

tions and deletions. Many algorithms for minimal

co de les

edit scripts have app eared in the literature. A p opu-

lar version of di to day is the GNU implementation

based on an algorithm of Myers [26]. Because of the

This section describ es howwe adapt dup, si , and

quadratic worst-case running time, di can b e slow

di to handle Java class les. A class le, usually

for comparing large amounts of co de with many dif-

obtained by compiling a Java le, is a stream of

ferences.

bytes representing a single class in a form suitable

for the Java [22]. We use the terms

Graphical user interfaces such as gdi which runs

\class les" and \byteco de les" interchangeably.

on SGI workstations under UNIX make it conve-

nient to lo ok at output from di by aligning iden-

tical sections of two les, side by side. Variants of

3.1 Information in Java class les

gdi exist for other platforms; a list is given in [15].

The rst items enco ded in a class le are a magic

2.4 Combining the to ols

numb er that identi es the le as a class le, the mi-

nor version numb er, and the ma jor version numb er.

The three to ols use three very di erent notions of AJava Virtual Machine of a particular ma jor and

similarity. In this pap er, we show that a combina- minor version number may run co de of the same

ma jor version number and the same or lower mi- `o', `c', `v', `i', `u', and `j', resp ectively. Jump o sets

nor version numb er. The version handled currently are translated from numb ers of bytes in the class le

by our system is ma jor version 45, minor version 3, to numb ers of lines in the disassembled le.

describ ed in [22].

o182 invokevirtual

c106

The remainder of a class le is mainly tables of

o153 ifeq

structures. Of these, the \constant p o ol" and the

j+17

metho d table are imp ortant to the design of our

o025 aload

to ols.

v4

o180 getfield

The constant pool contains information ab out all

c253

the constants used in the class, e.g. strings, integers,

o025 aload

elds, and classes. This table is very imp ortantas

v5

many other parts of the class le including co de

o182 invokevirtual

for metho ds contain indices into the constant p o ol.

c102

For example, a reference to a di erent class includes

o025 aload

the index for the string that names that class.

v4

o180 getfield

For each metho d, the metho d table contains the

c253

co de and other information suchasthe number of

o025 aload

lo cal variables, exception-handlers, the maximum

v4

stack length, indices into the constant pool repre-

o182 invokevirtual

senting the metho d name and a string describing

c260

the metho d typ e, and optional tables aimed at de-

o025 aload

buggers, such as line numb er relationships b etween

v5

byteco de and source and information relating lo cal

o180 getfield

variable names to the co de.

c253

3.2 Disassembly of Java class les Figure 1: Disassembled byteco de

Dup can b e run on disassembled byteco de if it is pro-

The rst step in pro cessing a Java class le is to vided with an appropriate lexical analyzer, though

disassemble it. Disassembling a class le requires p erformance is improved by undoing the jump o -

keeping trackat each p ointof how far the sets b efore running dup. See the next section.

has gotten in the conceptual hierarchy of tables and

structures, based on tables in the de- Running si or di on the disassembled le without

rived from [22]. We wrote our own disassembler, further prepro cessing do es not pro duce useful infor-

although others have b een implemented previously; mation. For example, changing a 4 to a 5 in two

a list app ears in [32]. places in a 182-line Java le resulted in over 1100

lines of di output on the disassembled byteco de

After disassembling the le, we extract the co de of le, and less than 1 similarity rep orted by si .

each metho d for further analysis. The disassem-

bled co de contains op co des and arguments for a se- The reason that si and di fail is that indices into

quence of -language-level instructions. Fig- the constant table or lo cal variable table maychange

ure 1 shows an example of a section from the mid- with slightchanges to the Java le, due to additions,

dle of a disassembled byteco de le, with comments deletions, or reorderings of the constant p o ol and/or

added to identify the numerical op co des. Since op- lo cal variable table. Such indices typically o ccur fre-

co des can havevariable numb ers of arguments, we quently in the co de, as in the example of Figure 1.

put one op co de or argument p er line. Acharacter When the les mentioned in the previous paragraph

at the start of the line identi es the typ e of item on are prepro cessed as describ ed shortly, di rep orts

the line. Op co des, indices into the constant p o ol, only changes in two lines containing op co des refer-

indices into the lo cal variable table, signed integers, ring to a constant of 5 rather than 4 and si nds

unsigned integers, and jump o sets are identi ed by 98 similarity.

3.3 Prepro cessing for dup enco ded as the negative of the numb er of lines since

the previous use if any of the same index. O sets

for indices into tables are negative to b e consistent

with jump o sets, which are negative for a jump to

With an appropriate lexical analyzer, dup can be

a preceding line and p ositive for a jump to a later

run on disassembled byteco de les. However, be-

line. The o sets are calculated indep endently for

fore running dup, it is preferable to undo the jump

the constant table and the lo cal variable table. The

o sets already present in the disassembled co de by

example of Figure 1 is shown in Figure 2 after cal-

changing jumps into a \goto lab el" form. Dup will

culating o sets for the entire le from which this

compute the o sets itself for the lab els. The dy-

section was extracted.

namic way in which dup calculates o sets relative

to suxes means that when two otherwise identical

sections of co de contain jumps to earlier p oints and

the jumps cross insertions or deletions, these jumps

will not cause mismatches, as would happ en with a

o182

precomputed xed enco ding. Thus, this prepro cess-

c-26

ing enables dup to nd longer p-matches.

o153

j+17

Our lexical analyzer for the disassembled byteco de

o025

les without jump o sets breaks up the input into

v-10

two classes of tokens: parameter tokens and non-

o180

parameter tokens. O sets are computed for pa-

c-10

rameter tokens but not for non-parameter tokens.

o025

Parameter tokens include indices into the constant

v-10

p o ol, indices into the lo cal variable table, lab els for

o182

jumps, and signed and unsigned integers; the vari-

c-26

ous typ es of parameter tokens are distinguished so

o025

that the o sets are computed separately and tokens

v-8

of di erenttyp es will not b e matched to each other

o180

in parameterized matches. The non-parameter to-

c-8

kens include op co des.

o025

v-4

o182

3.4 Further prepro cessing for si and

c-26

di via o sets

o025

v-12

o180

Even though absolute values of table indices in byte-

c-8

co de les maychange with slightchanges to the Java

source, there are still hidden similarities in the byte-

Figure 2: Disassembled byteco de after calculating

co de les. In particular, the corresp onding uses of

o sets

indices maintain the same positional relationship.

Consequently,we use the same o set enco ding that

is used in dup. In the context of disassembled byte-

co de, what corresp onds to the \identi ers" are the

indices into the constant pool or lo cal variable ta- This enco ding decreases reliance on the absolute

ble. Jumps are already enco ded as o sets in the value of the indices but preserves the information

byteco des. Si and di are then run on the o set- as to whether indices for di erent instructions are

enco ded les. the same or di erent. For example, if two les are

the same except that indices in one le are always

We treat each index into the constant pool or lo- one larger than indices in the other, the enco dings

cal variable table as a parameter to b e replaced by of the two les will be identical. The next section

an o set. The rst o ccurrence of each index is en- describ es exp eriments demonstrating that this en-

co ded as 0, and thereafter each use of an index is co ding enables si and di to work e ectively.

4 Exp eriments of these similar pairs representinteresting relation-

ships.

4.1 Exp eriment 1: Random changes

For the same 2056 Javabyteco de les, dup rep orted

92 ordered pairs of les to have at least one common

co de section of 200 lines or more, in comparison to

In the rst exp eriment, we to ok one Java program

the 634 ordered pairs rep orted by si . Table 2 shows

and made many di erent random changes to it.

the breakdown as to how many ordered pairs were

These changes included addition of statements e.g.,

rep orted by si alone, by b oth si and dup, and by

\newvariable = 43;" in random places, and substi-

dup alone.

tution/deletion of statements e.g., changing com-

plex conditions in \if " and \while" statements to

The goal of the exp erimentwas to lo ok for similarity

\i < 1". We varied the number of changes and

in les from di erent collections out of the 38 col-

the ratio b etween additions and deletions. We ran

lections wedownloaded. The second line of the ta-

these tests on two di erent Java programs. For

ble gives the breakdown with resp ect to similarities

each run, we measured the similarities of the source

between les from di erent collections. The initial

co de using the original si  and the similarities of

analysis was in terms of ordered pairs, since similar-

the byteco de les. The results, shown in Table 1,

ities can b e asymmetric, but for ease of discussion,

consistently show that the byteco de similarities are

the third line shows the corresp onding number of

close to the source co de similarities. For each of the

unordered pairs.

two programs we partitioned the tests into three

groups according to source similarities: 90-100,

Of the 9 di erent-collection unordered pairs re-

80-89, and 70-79. The results are averages for

p orted by b oth si and dup, we b elieve that 8 pairs

each group, showing the number of trials, the av-

are originally from the same source, based on the

erage similarity for source and byteco de, and the

pairs of les having the same name. Of the 8 pairs,

maximal di erence b etween them in all the trials.

four pairs are identical les, and four are in the

range of 57-100 similar according to si and 45

Furthermore, we also ran si on all the variants of

to 97 similar according to dup based on just the

the two programs together, and found no false p os-

co de sections of at least 200 lines that match.

itives.

The remaining di erent-collection unordered pair

of les rep orted by b oth si and dup are a pair of

4.2 Exp eriment 2: a large set of byte-

programs from di erent sources cryptiX, develop ed

co de les

byWolfgang Platzer, and part of JavaFaces, devel-

op ed by John Thomas to compute MD5. Si found

them to have 78 similarity and 86 in the other

In the second set of exp eriments we to ok 2056 Java

direction, and dup found them to have a single

byteco de les from 38 collections of les from many

common co de segment of 1336 lines. The common

di erent sources and ran tests on all of them at

co de segment corresp onded to 60 identical lines in

the same time allowing for at least 50 similarity.

the Java les. There was no similarity otherwise in

The goal was to lo ok for similar les from di erent

the Java les, but si found additional similarities in

collections.

the byteco de les. The 60 identical lines are di er-

ent from the corresp onding lines of the MD5 RFC

Si rep orted 634 ordered pairs of les with similari-

[29], but semantically equivalent. Interestingly, a

ties of at least 50. We de ne similarityoftwo les

search of the WWW turned up a third Java program

as the p ercentage of one le that is contained in the

with the same 60 lines. Possibly two of these pro-

other. As a result, similarity is an asymmetric rela-

grams b orrowed from the third, or all three copied

tion: for example, if a le A is contained in another

a description of MD5 other than the RFC.

le B twice its size, then A is 100 similar to B, but

B is 50 similar to A. We use ordered pairs here for

The di erent-collection unordered pair rep orted

this reason.

only by dup lo oks at rst glance as if it surely must

come from a common source, b ecause dup rep orted

Of the 634 pairs, 591 were b etween les in the same

a common co de section of 1521 lines, representing

collection, and 43 were b etween les in di erent col-

85 of one le and 28 of the other. We didn't have

lections. Next, we use dup to aid in analyzing which

range of average average max

program source similarity number of source similarity byteco de similarity di erence

p ercent trials p ercent p ercent p ercent

90-100 59 93.2 88.7 9

Program 1 80-89 76 84.4 78.7 9

70-79 35 75.9 71.9 7

90-100 17 92.9 91 9

Program 2 80-89 25 83.4 80.2 8

70-79 6 76.2 74.2 5

Table 1: Similarity found by si for corresp onding Java les and byteco de les

si only b oth dup only

ordered pairs 552 82 10

ordered pairs, di erent collections 25 18 2

unordered pairs, di erent collections 23 9 1

Table 2: Similarities rep orted by si and dup for 2056 byteco de les

number of false p ositives was very small: at most the java source to compare. Up on insp ection of the

18 out of more than 2 million pairs. byteco de les, however, the common co de turned

out to b e the initialization of an array of size 256.

Since the stored values retrievable from the con-

stant p o ol of the byteco de les app eared unrelated,

4.3 Exp eriment 3: False negatives

the similarity is probably coincidental, merely an ar-

tifact of how the compiler generates co de for a series

of 256 array assignments.

The rst two exp eriments indicate that our to ols

can e ectively discover similarities while minimizing

For the pairs rep orted only by si , the Java source

false p ositives. But the question of false negatives {

was not available. However, almost all result from

that is, how many similar pairs we missed { remains

comparing a very small le 500 bytes or less in most

unresolved. There were none in the rst exp eriment,

cases to a large le; the small le was found to

but it was to o limited.

be similar to the large one, but not the other way

around. In addition, the names of these pairs in-

Since there are no other to ols to compare to, there is

dicate that the purp oses of the les are unrelated.

no guaranteed way for us to measure false negatives.

We conclude that these are false p ositives. The ex-

Nevertheless, we b elieve that the following "blind"

ceptions were matches whose names included H or

exp eriment gives a reasonable indication.

V apparently for horizontal and vertical; the H/H

and V/V pairs were rep orted by b oth si and dup,

We asked a friend who was familiar with the goals

but the H/V and V/H pairs only by si . The names

of our work, but not with the techniques we use, to

of another two pairs of matching les related to but-

randomly pick 10 programs from a set of 765 java

tons and checkb oxes, but the remaining pairs of les

programs a subset of the 2056 programs for which

app ear to b e totally unrelated based on the sub ject

we had the source, make random changes to them,

matter indicated by the names.

compile them p ossibly under a di erentversion of

the compiler and give the set of byteco de les in a

To summarize, we found many similarities and very

random order. We then ran si and dup to see how

few false p ositives by combining the information

many of his changes we can detect.

from dup and si . The di erent-collection pairs re-

p orted by b oth dup and si all app ear to b e valid

To make the test even more blind, our tester actu-

instances of similarity. The one rep orted only by

ally made changes to 12 programs, and added to our

dup was not; some of the ones rep orted only by

original test some additional programs as well as re-

si were not, esp ecially for small les. Overall, the

moved some programs. si running with a thresh-

for si , we also ran si on the disassembled co de old of 65 similarity discovered 9 of the 12. Dup,

without applying the o set enco ding. With a 25 with a threshold of 100 common lines, discovered

threshold of similarity, si discovered only three of 8. Together, they discovered 10. All 12 were found

the 12 pairs 4, 5, and 10, a much worse p erfor- by si when run with a 25 similarity threshold

mance than that describ ed ab ove for the o set en- and by dup when run with a 50-line threshold. On

co ding. the other hand, decreasing the threshold to 25 in-

creases the numb er of ordered pairs si nds among

the 2056 original les from 634 to 1430. Still reason-

ably small compared to the total of over 4 million

4.4 Running times and scalability

p ossible pairs, but clearly the numb er of false p osi-

tives grows when the threshold is decreased.

For the 2056 les, si used 41 seconds of elapsed

Out of the two programs that were missed by b oth

time and 4 seconds of user time while dup used 1

si and dup, one is particularly interesting. It in-

minute 55 seconds of elapsed time and 1 minute 7

volved relatively few changes to the source, but they

seconds of user time running under IRIX 5.3 and

made the byteco de le very di erent. On close in-

using one of 12 150 MHZ IP19 Pro cessors, with

sp ection, we found that the main culprit wasamove

data cache size 16 Kbytes, instruction cache size

of a segment of co de, which resulted in a byteco de

16 Kbytes, secondary uni ed instruction/data cache

le with many jumps around the relo cated co de. For

size 1 Mbyte, and main memory size 1280 Mbytes.

si , the o sets of all these jumps were a ected by the

Enco ding the 2056 byteco de les to ok 7 minutes

relo cation, resulting in di erent ngerprints. For

of elapsed time for si and 8 minutes indep en-

dup, the relo cated co de broke up long matches in

dently for dup. Most of the prepro cessing could b e

b oth places, which mattered since the byteco de les

shared. Si also enables the creation of an index so

were small - only four times the threshold length.

that new les can b e compared with an index of les

pro cessed earlier. Comparing one 50K byteco de to

Tovalidate the usefulness of the o set enco ding for

all the 2056 les in the index within 50 similar-

di , we ran di on the disassembled co de with and

ity takes a couple of seconds for enco ding the new

without the o set enco ding. The results are shown

le and thereafter the pro cessing by si is essentially

in Table 3. The pairs varied in le size and how

instantaneous 0.2 user time and b etween 0:00 and

manychanges were made. To get a relatively size-

0:01 elapsed time. The size of the index dep ends

invariant measure, we use the length of the di out-

on the amount of sampling done and the precision

put in lines divided by the sum of the le sizes,

of results; for the exp eriments used here the index

for each typ e of le. Pair 6 is sp ecial: di erent

o ccupied ab out 5 of the total size of all les.

versions of the JDK compiler were used to generate

two class les from the same java le. Consequently,

Dup is much greedier in space than si , and conse-

no value is given for this measure for the java le.

quently will not scale to pro cessing huge numb ers

Our second measure is the average length of blo cks

of byteco de les at one time. If the goal were to

of identical lines rep orted by di . The last column

pro cess an enormous number of byteco de les, say

contains the sum of the sizes of the o set-enco ded

to provide a registry service for the World Wide

disassembled les, which is the same as the sum of

Web as has b een prop osed for text les [9], then si

the le sizes without the o set-enco ding.

should b e used to screen for similarities that should

be checked subsequently by dup.

For most of the pairs, the data in Table 3 showa

signi cant improvement from using o set enco dings

for di . In a few instances, the values are ab out

the same with and without o sets. Pair 11 is the

5 Related Work

only instance where di found signi cantly less sim-

ilarity with o set enco dings, but just for the second

measure. Note that pairs 3, 10, and 12 were the

Javabyteco de les are relatively easy to decompile

ones not discovered by si with a 50 threshold, as

into Java. Programs can b e rewritten in structured

discussed ab ove, and pairs 9-12 were the ones not

form by analyzing the ow of control, and the names

discovered by dup with a 100-line threshold.

of classes, elds, and metho ds are available from the

class le. In fact, at least four byteco de

Tovalidate our approachof using o set enco dings

have b een implemented: Mo cha by Hanp eter Van

di -size/total-lines ave. ident. blo ck size sum of le

pair as  in lines sizes in lines

java o set no-o set o set no-o set o set = no-o set

1 13 6 57 185 3 2740

2 14 17 68 18 2 1040

3 30 24 61 8 2 2098

4 14 14 21 173 20 1603

5 59 40 42 10 10 988

6 - 2 34 198 5 6041

7 19 13 32 22 6 8789

8 23 24 59 36 12 941

9 14 8 45 61 4 265

10 24 32 31 7 8 662

11 50 55 56 7 24 428

12 18 41 67 6 2 1037

Table 3: Results of running di with and without the o set enco ding

may b e p ossible to detect that they are used. Vliet, now deceased, and according to [31], taken

over as part of Borland's JBuilder [8], WingDis [33],

Another technique for enabling detection of stolen DejaVu [18], and Krakatoa [28]. A discussion of

co de is steganography, the art of hiding information these app ears in [14]. These pro duce quite readable

such that the information can later b e detected [2]. co de except for variable names, and even those may

For ob ject co de, insertions of extra no-op sequences be available p erhaps inadvertently from the class

of instructions would slowitdown, but Intel rep orts le if the compiler generated optional information

that it has successfully replaced co de fragments by aimed at debuggers.

equivalent ones to customize securitycode[3]. How-

ever, this is not common practice. Because class les are relatively easy to decompile

readably, Java applications are p otentially vulner-

For Javabyteco de les, the single Java Virtual Ma- able to theft, even if they are distributed only as

chine means that the variety of platforms is not an byteco de les. An unscrupulous p erson could use

issue. Multiplicity of can b e dealt with by one of the decompilers mentioned ab ove to decom-

having the owner of the co de compile the source les pile the byteco de les of an application, mo dify the

with each of the available Java compilers and com- resulting source, recompile into byteco de les, and

pare the resulting byteco de les with the byteco de distribute the byteco de les.

les that were susp ected to b e stolen. Currently, the

numb er of di erent compilers is not large, so this is In fact, a numb er of \obfuscators" have b een writ-

manageable. Co de compiled by di erent releases of ten to make decompilation less successful. These in-

the same compiler will probably still b e very similar, clude Crema byVan Vliet, and likeMocha, also re-

although not necessarily identical. In our exp eri- p ortedly taken over by Borland's JBuilder[8], Hose-

ments we found that, given the same source les, jdk Mo cha, by Mark LaDue, HashJava now renamed

1.0.2 and jdk 1.1.x generated class les that rarely SourceGuard [1], by K.B. Sriram, and Job e, by

di ered in any signi cantway. Eron Jokipii; for discussion of these, see [23, 32].

Early obfuscators either changed symb ol names or

added no-ops to byteco de les in such a way that

particular decompilers crashed. New obfuscators

e.g., [13,12] employmuch more sophisticated tech-

6 Conclusions and Future Work

niques, some of cryptographic strength, to change

the app earance of co de. These techniques not only

make it harder to decompile, but they also makeit

Our exp eriments have validated our approach by

p ossible to change the app earance of byteco de les

showing that our to ols are able to deduce similar-

that come from the same source. Our metho ds will

ity in Java source from similarity in the byteco de

not be able to defeat such techniques, although it

les. We can make the rate of false p ositives very

low while keeping false negatives reasonably mini- References

mal. Our p ositional enco ding proves to be a very

powerful technique. Since the three to ols si , dup,

[1] 4thpass Software Corp oration. Protect your

and di are based on di erent techniques, one may

Javainvestment. http://www.4thpass.com/,

detect similarity when another one misses it. Lo cal

Apr. 7, 1998.

mo di cations within sections of co de will reduce the

[2] Ross Anderson and Fabien Petitcolas. On the

p ercentage of similarity found by si , but dup will

limits of steganography. In IEEE J-SAC, to

still nd long matches in the unchanged sections.

app ear.

Name changes of variables would not b e a problem

since the compiler turns variable references into ta-

[3] D. Aucsmith. Tamp er resistant software: An

ble indices, which we have shown our metho ds to

implementation. In Information Hiding, vol-

handle despite change in absolute values. Reorder-

ume 1174 of Lecture Notes in Computer Sci-

ing ma jor sections of co de, e.g. by changing the

ence, pages 317{333. Springer-Verlag, 1996.

order of metho ds, will have little e ect on si and

[4] Brenda S. Baker. On nding duplication and

dup, though it will greatly a ect di .

near-duplication in large software systems. In

Second Working Conference on Reverse Engi-

A ma jor advantage of lo oking at Java class les

neering, pages 86{95, 1995.

rather than at binaries for other programming lan-

guages is that there is just one version of the Java

[5] Brenda S. Baker. Parameterized pattern

Virtual Machine for all platforms or at least that is

matching: Algorithms and applications. J.

Sun's intention, while binaries of other languages

Comput. Syst. Sci., 521:28{42, Feb. 1996.

are di erent for di erent platforms. Extending our

[6] Brenda S. Baker. Parameterized duplication

techniques to binaries is a natural step to take, al-

though binaries present several additional problems: in strings: Algorithms and an application to

software maintenance. SIAM J. ,

They are strongly tied to the architecture, there

265:1343{1362, Oct. 1997.

are many compiler and even more optimization and

co de restructuring programs, and the information in

[7] H.L. Berghel and D.L. Sallach. Measurements

binaries as not as well organized as in byteco de les.

of program similarity in identical task environ-

Nevertheless, we b elieve that it will b e p ossible, at

ments. SIGPLAN Notices, 98:65{76, August

least in some degree, to identify similar binaries.

1984.

It would b e helpful to have a graphical user interface

[8] Borland. Jbuilder. http://www.borland.com/

that links the output of si and dup to a ,

jbuilder/, Oct. 23, 1997.

allowing a develop er to see the evolution of two sim-

[9] S. Brin, J. Davis, and H. Garcia-Molina. Copy

ilar programs clearly.

detection mechanisms for digital do cuments. In

Proceedings of the ACM Special Interest Group

on Management of Data SIGMOD, 1995.

[10] Andrei Bro der, Steve Glassman, Mark Man-

7 Acknowledgements

asse, and Geo rey Zweig. Syntactic clustering

of the web. In Proceedings of the Sixth Inter-

national World Wide Web Conference, pages

Peter Bigot was our blind tester, and we thank him

391{404, April 1997.

for a thankless job well done.

[11] Kenneth Ward Church and Jonathan Isaac

Helfman. Dotplot: A program for exploring

self-similarity in millions of lines of text and

co de. Journal of Computational and Graphical

Statistics, 22:153{174, June 1993.

8 Availability

[12] Christian Collb erg, Clark Thomb orson, and

Douglas Low. Breaking abstractions and un-

Contact Brenda Baker, bsb@b ell-labs.com, for in- structuring data structures. In IEEE Inter-

formation ab out licensing the software from Lucent national Conference on Computer Languages

Technologies. 1998, , IL, May 1998.

[25] T.J. McCab e. , reusability, [13] Christian Collb erg, Clark Thomb orson, and

redundancy: the connection. American Pro- Douglas Low. Manufacturing cheap, resilient,

grammer, 310:8{13, Oct. 1990. and stealthy opaque constructs. In Principles

of Programming Languages 1998, POPL'98,

[26] Eugene W. Myers. An OND di erence algo-

San Diego, CA, Jan. 1998.

rithm and its variations. Algorithmica, 1:251{

266, 1986. [14] Dave Dyer. Java decompilers compared.

JavaWorld, July 16, 1997. http://www.

[27] PocketSoft. .RTPatch Professional, Feb.

javaworld.com/javaworld/jw-07-1997/jw-

23, 1998. http://www.pocketsoft.com/

07-decompilers.html.

products.html.

[15] Luis Fernandes. Are there any apps

[28] Todd Pro ebsting and Scott A. Watterson.

that can display les in parallel, high-

Krakatoa: Decompilation in java do es byte-

lighting in color the di erences between

co de reveal source?. In USENIX Conference

them? http://www.ee.ryerson.ca:8080/~

on Object-oriented Technologies and Systems,

elf/xapps/Q-XI.html, April 3, 1997.

June 1997.

[16] Nevin Heintze. Scalable do cument ngerprint-

[29] Ronald Rivest. The MD5 message digest

ing. In Proceedings of the Second USENIX

algorithm. RFC 1321, Apr. 1992. http://

Workshop on Electronic Commerce, Nov. 18-

info.internet.isi.edu:80/in-notes/rfc/

21, 1996.

files/rfc1321.txt.

[17] Susan Horwitz. Identifying the semantic and

[30] Narayanan Shivakumar and Hector Garcia-

textual di erences between two versions of a

Molina. Building a scalable and accurate copy

program. In Proceedings of the ACM SIGPLAN

detection mechanism. In Proceedings of 1st

Conference on Design

ACM International Conference on Digital Li-

and Implementation PLDI, pages 234{245,

braries DL'96, March 1996.

June 1990.

[31] Smith. Mo cha, the Java decom-

[18] Innovative Software. OEW for Java. http://

piler. http://www.brouhaha.com/~eric/

www.isg.de/OEW/Java/, Oct. 2, 1997.

computers/mocha.html, Dec. 28, 1997.

[19] H.T. Jankowitz. Detecting plagiarism in stu-

[32] Thomas Trab er. To ols for working with byte-

dent PASCAL programs. Computer Journal,

co de: Byteco de assemblers, and

311:1{8, 1988.

dumps. http://www.oasis.leo.org/java/

development/bytecode/00-index.html.Part

[20] J. Howard Johnson. Substring matching for

of Java Oasis, http://www.oasis.leo.org/

clone detection and change tracking. In Proc.

java/00-oasis.html.

International Conf. on Software Maintenance,

pages 120{126, 1994.

[33] Wingsoft. Wingsoft Pro ducts Information.

http://www.wingsoft.com/products.shtml.

[21] Brian W. Kernighan and Rob Pike. The UNIX

Programming Environment. Prentice-Hall, En-

[34] Tak Yan and Hector Garcia-Molina. Duplicate

glewo o d Cli s, New Jersey, 1984.

removal in information dissemination. Techni-

cal rep ort, Stanford Computer Science Dept.,

[22] Tim Lindholm and Frank Yellin. The Java Vir-

1995. submitted for publication.

tual Machine Speci cation. Addison-Wesley,

Reading, Massachusetts, 1997.

[23] Qusay H. Mahmoud. Java tip 22: Pro-

tect your byteco des from reverse engineer-

ing/decompilation. JavaWorld, Jan. 2, 1997.

http://www.javaworld.com/javatips/jw-

javatip22i.html.

[24] Udi Manb er. Finding similar les in a large le

system. In Proc. 1994 Winter Usenix Technical

Conference, pages 1{10, Jan 1994.