The Journal of Systems and Software 71 (2004) 1–10 www.elsevier.com/locate/jss

Advanced obfuscation techniques for bytecode

Jien-Tsai Chan *, Wuu Yang

Department of Computer and Information Science, National Chiao Tung University, Hsinchu, Taiwan 300, ROC Received 17 May 2002; received in revised form 30 July 2002; accepted 2 August 2002

Abstract

There exist several obfuscation tools for preventing Java bytecode from being decompiled. Most of these tools simply scramble the names of the identifiers stored in a bytecode by substituting the identifiers with meaningless names. However, the scrambling technique cannot deter a determined cracker very long. We propose several advanced obfuscation techniques that make Java bytecode impossible to recompile or make the decompiled program difficult to and to recompile. The crux of our ap- proach is to over use an identifier. That is, an identifier can denote several entities, such as types, fields, and methods, simulta- neously. An additional benefit is that the size of the bytecode is reduced because fewer and shorter identifier names are used. Furthermore, we also propose several techniques to intentionally introduce syntactic and semantic errors into the decompiled program while preserving the original behaviors of the bytecode. Thus, the decompiled program would have to be debugged manually. Although our approach is to scramble the identifiers in Java bytecode, the scrambled bytecode produced with our techniques is much harder to crack than that produced with other identifier scrambling techniques. Furthermore, the run-time efficiency of the obfuscated bytecode is also improved because the size of the bytecode becomes smaller after obfuscation. 2002 Elsevier Inc. All rights reserved.

Keywords: Program protection; Bytecode obfuscation; Java programming language

1. Introduction The Java programming language has become more and more popular since its first release in 1994 (Gosling Traditionally, a program is compiled to native code et al., 2000). One of the major benefits of Java is (or machine code). Most of the symbolic information is portability––the compiled program can run on most stripped off when the program is compiled. The identi- platforms. A Java program is compiled to platform- fiers that denote variables and functions in the source independent bytecode. In order to achieve platform program become addresses in the compiled program. independence, instead of the traditional memory ad- Decompiling such a program, though difficult, is still dresses, Java uses symbolic references to link entities possible. Because no methods can absolutely protect a from different libraries (including the standard and program from decompilation attacks by experienced proprietary libraries). Therefore, the names of types, crackers, we usually consider a protection method suc- fields, and methods are stored in a constant pool within cessful if it can make the cracking work costly in terms a bytecode file (Engel, 1999; Lindholm and Yellin, 1999; of time and effort. Cracking becomes valueless when Meyer and Downing, 1997; Venners, 1998). These the cost is more than that of rewriting a program. names and the simple stack-machine instructions facili- Therefore, one of the basic rules is to prevent the de- tate the decompilation of the bytecode file. compilation to be done automatically with tools (i.e. There are many free or commercial Java decompilers decompilers). (D & , 2001; Hoeniche, 2001; Kouznetsov, 2001; Kumar, 2001; Mayon, 2001; PsychoticSoftware, 2001;

* Vliet, 1996). The decompiled program is almost iden- Corresponding author. Tel.: +886-9-3330-3945; fax: +886-3-572- tical to the original source program. These decompil- 1490. E-mail addresses: [email protected] (J.-T. Chan), wuu- ers become the lethal weapon of intellectual property [email protected] (W. Yang). piracy.

0164-1212/$ - see front matter 2002 Elsevier Inc. All rights reserved. doi:10.1016/S0164-1212(02)00066-3 2 J.-T. Chan, W. Yang / The Journal of Systems and Software 71 (2004) 1–10

Obfuscation tools are one of the major defenses 3. The candidates for identifier scrambling against the decompilers. Obfuscation transforms clear bytecode into more obscure bytecode. The goal of ob- According to the Java specification (Gosling et al., fuscation is to make the decompiled program much 2000), an identifier in a Java program may denote harder to understand so that a cracker has to spend more time and effort on the obfuscated bytecode. Most • a package of the existing obfuscation tools simply scramble the • a top-level type (either a class or an interface) symbolic information (identifiers) in the constant pool • a nested type (either a class or an interface) (Dr. Java, 2001; Eastridge, 2000; Hoeniche, 2001; • a field Plumb, 2001; Retrologic, 2000). Usually, a meaningful • a method name is substituted by a meaningless name. • a parameter (of a method, a constructor, or an excep- In this paper, we propose a new obfuscation ap- tion handler) proach that achieves better identifier scrambling. Based • a local variable on the approach, several techniques are introduced to make the bytecode much harder to understand and, However, not all of them are kept in the bytecode file sometimes, make the decompiled program not re-com- after compilation. Only the identifiers that denote the pilable. The basic approach is to endow an identifier first five items in the above list are stored in the byte- with as much information as possible. An identifier can code. By default, parameters and local variables are denote several types, several fields, and several methods stripped off from the bytecode and become the memory at the same time in the obfuscated bytecode. The cracker addresses of the local variable array in the correspond- is confused because an identifier is identified not only by ing stack frame (see Section 3.6 of Lindholm and Yellin its name but also by the context it exists. An additional (1999) and Section 3.7 of Engel (1999)). If the debug-info benefit is that the size of the bytecode is reduced because option of the compiler is enabled, the names of long, meaningful names are replaced by shorter, mean- parameters and local variables will be stored in the ingless names. We also propose several techniques to LocalVariableTable in the bytecode. The LocalVariable- purposely introduce certain hidden compilation errors Table can be removed by disabling the option (which into the obfuscated bytecode so that the decompiled is, the default setting of the Java compiler). If the Lo- program cannot be compiled again. Therefore, a cracker calVariableTable is not available, Java decompilers has to spend a lot of time debugging the decompiled usually automatically generate names sequentially for program manually. The basic approach and these tech- parameters and local variables. Though it is possible niques make a Java bytecode file harder to crack. Fur- to rename the variables in the LocalVariableTable to thermore, the run-time efficiency of an obfuscated make the decompilation process more difficult, a smar- program is also improved. ter decompiler may simply ignore these modified names and generate new variable names instead. Since we cannot prevent the decompilers from generating 2. Obfuscation scope names for parameters and local variables, names in the LocalVariableTable are not candidates for obfus- In Java, an application consists of one or more cation. The candidates for obfuscation are the first five packages. A may divide his own applica- items. tion into packages. He may also use the packages in the On the other hand, not all of the candidates can be standard library and proprietary libraries. Usually, only obfuscated. When an application runs, the Java virtual the part of an application that is developed by the machine (JVM) dynamically loads and links the refer- programmer is distributed. The proprietary libraries are enced types into the runtime environment. The byte- not distributed because of the copyright restrictions. code file that stores the referenced type is located by a The part of a program that will be obfuscated by the symbolic reference––the fully qualified name of a class obfuscation techniques is called the obfuscation scope. or an interface. These symbolic references cannot be Generally, only the programmer-developed part of an changed. Hence, only the candidates that reference en- application is protected. The packages that serve as tities in the obfuscation scope will be obfuscated. The utilities in the standard and proprietary libraries are not candidates that reference entities outside the obfusca- obfuscated. However, the obfuscation scope is not nec- tion scope (which generally denote entities in the stan- essary limited to the packages written by the program- dard library or the proprietary libraries) should not be mer. When an application is not big enough to confuse obfuscated. the cracker, the standard and proprietary libraries could The identifiers that denote entities in the obfuscation be included in the obfuscation scope. However the re- scope need further investigation. The following four distribution of the obfuscated proprietary libraries may groups of identifiers should not be obfuscated (these violate the copyright. groups are called the Exception groups): J.-T. Chan, W. Yang / The Journal of Systems and Software 71 (2004) 1–10 3

Exception group 1: The instance method that imple- actual object that contains the callback method is passed ments an abstract method of a su- as a parameter and a callback method is invoked perclass (or a superinterface) that through the polymorphism mechanism. Based on this is outside the obfuscation scope. assumption, all the callback methods whose names Exception group 2: The instance method that overrides should be retained will belong to Exception group 1 or an inherited method of a superclass 2. that is outside the obfuscation Fields, static methods, and nested types are statically scope. resolved by the Java compiler. Once the bytecode is Exception group 3: The entities that are explicitly des- generated, the JVM will not change the resolution. ignated by the programmer to re- Therefore, the names of fields, static methods, and main unchanged. nested types that are in the obfuscation scope may be Exception group 4: The instance method that serves as changed arbitrarily. a callback function. In summary, the targets of obfuscation include the names of the entities in the obfuscation scope except Java supports polymorphism. An instance method is those in Exception groups 1, 2, and 3. dynamically dispatched at run time based on the sig- nature of the method. The signature of a method con- sists of the name of the method and the number and the 4. Basic approach types of the formal parameters. Note that the return type and the throws clause are not part of the signature Our basic obfuscation approach is to reuse an iden- of a method in Java. Because the name of a method M tifier as often as possible. This approach makes an outside the obfuscation scope is retained, the name of identifier heavily overloaded and hence confusing to a the method that is in the obfuscation scope and over- cracker. An identifier can denote several types, fields, rides the method M should be retained as well. Other- and methods at the same time after obfuscation. When wise, JVM cannot find the overriding methods based on the obfuscated bytecode is decompiled, the meaning of the signature of M. These retained methods belong to an identifier is not determined only by its name but also Exception groups 1 and 2. by the context it exists. A cracker is confused because When a package is in the obfuscation scope, some- he has to identify the context in which an identifier times, it is necessary to keep some parts of a package exists. An additional benefit is that the size of the outside the scope. For example, the main method is the bytecode is reduced because fewer and shorter names entry point of an application. Therefore, the name of the are used. main method should be retained. Furthermore, a pro- There are two hierarchical structures in a Java ap- prietary library may export certain types and certain plication. The first is the package structure. An appli- methods as the interface of the library. The names of cation consists of one or more packages. A package may these exported types and methods should be retained as contain zero or more subpackages and top-level types well. These retained entities are said to belong to Ex- (i.e., classes and interfaces). The subpackages and the ception group 3. top-level types in a package cannot have the same name. The callback mechanism is heavily used in the event However, a subpackage or a top-level type may have the model of the graphical-user-interface (GUI) library of same name as the enclosing package. Suppose that se- Java. When the caller of an instance method N that serves quentially generated identifiers are used in an obfusca- as a callback function is outside the obfuscation scope, tion tool. The generation of identifiers may be restarted the name of N should not be obfuscated. Otherwise, the for every package. caller cannot find method N at run time. On the other The second structure is the inheritance structure. hand, if the caller is also in the obfuscation scope, the Every class, except the Object class, has a direct su- symbolic reference can be changed to the new, obfuscated perclass. A class may implement zero or more interfaces. name of N. In this case, the name of N can be obfuscated. An interface can inherit zero or more interfaces. An Determining whether a method is a callback function interface and a class implicitly inherit the Object class is a complex task. We first have to construct a call-graph if they does not inherit any other interfaces and other by examining the whole application and all the refer- classes, respectively. The depth of the inheritance hier- enced libraries. Through the call-graph, callback meth- archy is unlimited. Through the inheritance structure, ods can be identified. However, this construction would we can identify instance methods that belong to Ex- take a lot of time. We made a safe assumption here. ception groups 1 and 2; these instance methods will not Generally, the class that contains callback methods be renamed. Furthermore, the instance methods that implements specific interfaces or extends a specific class. have an overriding relationship and are in the obfusca- The caller of the callback method takes a parameter tion scope must be renamed consistently, if they are ever whose type is the superinterface or the superclass. The renamed. 4 J.-T. Chan, W. Yang / The Journal of Systems and Software 71 (2004) 1–10

Fig. 1. All the types of an application. Fig. 4. Obfuscated package structure.

Fig. 1 shows an example in which an application consists of several packages and types. Notice that the Object class is implicitly included in every Java ap- plication. Suppose that the packages p and p.q are in the obfuscation scope. The corresponding package struc- ture and inheritance structure are shown in Figs. 2 and 3, respectively. The shadowed area (surround by a dot- ted curve) denotes the obfuscation scope. The basic obfuscation approach consists of the fol- lowing five steps: Fig. 5. Obfuscated inheritance structure. (1) Analyze all the bytecode files that are in the obfusca- tion scope and construct the package structure and step, the package structure in Fig. 2 will become the the inheritance structure. one in Fig. 4 and the inheritance structure in Fig. 3 (2) Traverse the package structure from root to leaf. will become the one in Fig. 5. During the traversal, use sequentially generated (3) Traverse the inheritance structure from root to leaf. names to substitute the package names and top-level During the traversal, perform the following steps for type names. The generation of names is restarted for each type T. each package node. For example, suppose that the ii(3a) Restart the generation of names. Replace the generating sequence of names is a, b, c... After this names of all the fields with the sequentially generated new names. ii(3b) Restart the generation of names. Replace the names of all the nested types with the sequen- tially generated new names. A type in this paper means a class or an interface. A type that is declared within another type is called a nested type (it is also called a member type). There is no limit on the depth of type nesting. According to the Java language specification (Gosling et al., 2000), a nested type cannot have the same name as any of its enclosing Fig. 2. Package structure. types. Therefore, the name used by the enclos- ing types must be avoided. ii (3c) Restart the generation of names. Replace the name of a method M with the sequentially gener- ated new names. When M is an instance method, check whether M belongs to Exception group 1 or 2. If so, the original name of M remains un- changed. Otherwise, a supertype S of T must also be in the obfuscation scope. When there is an instance method N in S with the same orig- inal signature as M, use the same name of N for M. Otherwise, use a newly generated name for M. Fig. 3. Inheritance structure. Notice that the newly generated name cannot be J.-T. Chan, W. Yang / The Journal of Systems and Software 71 (2004) 1–10 5

the same as the name of the method in S. If so, it Java virtual machine will check the accessibility again may result in a new overriding relationship. An when the application runs, program behavior will not be inherited instance method cannot be overridden affected. arbitrarily. Otherwise, the invoked method may be changed unexpectedly. Recursively repeat 4.2. Dynamic loading problem steps 3a to 3c for each nested type. (4) Update all the symbolic references in the obfusca- Java supports dynamical loading and reflection. tion scope with the substituted names. Through the method Class.forName (‘‘MyClass- (5) Save the obfuscated code to each bytecode file in the Name’’), JVM can load the named class at run time. obfuscation scope. The name of the class could be determined at run time. Therefore, we cannot determine whether to keep the Notice that in Java source programs, a field may be name of the type or to change the value of the pa- shadowed (see Section 6.3.1 of (Gosling et al., 2000)) by rameter through static analysis at obfuscation time. the fields in the subtypes or the nested types. A nested Therefore, the names of the types that will be dynam- type may be shadowed by a nested type in the subtypes ically loaded should be retained and be indicated or other more deeply nested types. Arbitrarily changing manually. the name of a field, a static method, and a nested type The members of a dynamically loaded type need may cause unexpected shadowing or obscurity among further investigation. Most of the methods in the class the names. However, these shadowing and obscurity do Class return the public members of the dynamically not change the behavior of the application because those loaded type. Therefore, the public members of a dy- static entities are determined statically at compile time. namically loaded type should not be renamed. Even if they have the same name, JVM knows which There might be chances to use the protected and the type should be checked. Renaming will not change the default-access members of a dynamically loaded type entities to be used. Therefore, steps 3a and 3b are sim- through reflection. This situation rarely happens and plified because we do not have to worry about the ob- should be considered as problematic. Nevertheless, the scurity problem that has been resolved by the compiler. protected and the default-access members of a dynami- cally loaded type should not be renamed. Only the pri- 4.1. Flattening the package structure vate members of a dynamically loaded type can be renamed arbitrarily. Usually, types (classes and interfaces) that provide related functions are grouped in a package. This is the 4.3. Overloading unrelated methods purpose of introducing the package structure in Java. Packages help to organize their programs. In Java, methods are determined by their signatures. However, packages also help a cracker to analyze the This means that two methods are considered different if bytecode. The functions of the types in a package are they have the same name but different numbers or types easier to understand after some of them have been un- of the formal parameters. Such methods are called the derstood. overloaded methods. Another purpose of the package structure is to con- For further obfuscating bytecode, we can use the trol the accessibility of the members (i.e., fields, meth- same name for all the methods that have different names ods, and nested types) in a type. The members declared and different number or types of the formal parameters. as protected in a type T can be accessed by the types in Therefore, a program is further obscured because the the same package that T belongs to and by the subtypes methods can only be differentiated by the number and of T. The members that are declared as default-access the types of their formal parameters. (that is, no access modifier is specified) in a class T can There are some important issues when unrelated be accessed by the types in the same package that T methods are overloaded. First, the methods in a subclass belongs to. should preserve the overriding relationship among the Flattening the package structure is to put all the types superclasses and the subclass. Because the overriding that comprise an application into a single package. A relationship affects the method to be invoked at run cracker cannot make use of the package structure to time, the overriding relationship should not be modified crack an application. Though flattening the package when the methods are renamed. structure will extend the accessible range of the pro- Notice that the overloading relationship among the tected and the default-access members to the whole methods of the superclasses and the subclass need not be application. Extending the accessible range in the byte- preserved. Furthermore, we can make use of the rela- code, not the source code, will not change the behavior tionships between static methods and instance methods of an application. The Java compiler has already of the superclasses and the subclass to provide another checked the accessibility of the members. Although the layer of protection. See Section 5.4 for details. 6 J.-T. Chan, W. Yang / The Journal of Systems and Software 71 (2004) 1–10

Second, widening conversions (i.e. coercion) can re- sult in surprising benefits. In Java, there are two kinds of widening conversions. A widening primitive conversion is a conversion from a primitive type to another primitive type. During a widening primitive conversion, no in- formation about the overall precision of the numeric value is lost. A widening reference conversion is a con- version between non-primitive types. A widening refer- Fig. 6. (a) An example program and (b) the decompiled result by the ence conversion (i.e. from a subclass to its superclass) JAD decompiler. must be proved correct at compile time. Method invo- cation allows both widening conversions. The widening The semantic change will not occur due to the explicit conversions are performed automatically and implicitly casting conversion. by a compiler. Consider the following example. 4.4. An interesting example class X { void m (float a) {...} The example in Fig. 6a is a common situation in void p (long b) {...} many applications. Class C contains two fields t and u void f() { and inherits the fields x and y from the superclasses A int s ¼ 1; and B, respectively. The values of x and y are accessed m(s); in class C. After applying the basic obfuscation ap- } proach discussed in the previous section on the bytecode } of the example, the obfuscated bytecode is then de- compiled with the Jad decompiler (Kouznetsov, 2001). The method invocation m(s) in the method f in- The result is shown in Fig. 6b. The obfuscated bytecode vokes the method m. When the bytecode is obfuscated still functions correctly. However, the decompiled result by overriding unrelated methods, the decompiled pro- cannot be compiled successfully again. gram (produced by the Jad decompiler (Kouznetsov, This example shows the possibility of introducing 2001)) becomes changes to the bytecode so that the decompiled program of the obfuscated bytecode cannot be compiled suc- class X { cessfully again. The detail will be discussed in the next void g (float a) {...} section. void g (long b) {...} void g() { int i ¼ 1; 5. Making the decompiled program uncompilable g(i); } The techniques discussed in this section modify the } bytecode so that the decompiled program of the obfus- cated bytecode contains obscure compilation errors When the decompiled program is compiled to byte- while the obfuscated bytecode still functions correctly. code again, the method invocation g(i) will invoke the Therefore, a cracker has to debug the decompiled pro- method g(long) (the original p(long)) because an gram manually and, hopefully, painfully. integer value is converted to a long value rather than a In a Java program, an identifier may denote a type, a float value by a widening conversion. In contrast, in the field, a method, a parameter, or a local variable. The original code, it is g(float) that will be invoked. The Java compiler could be confused about which entity is behavior of the decompiled program is silently changed intended when an identifier denotes more than one en- by the obfuscation. This kind of a silent semantic change tity. Therefore, several rules are defined in the Java provides better protection for bytecode. language specification (Gosling et al., 2000) to clarify The technique of overloading unrelated methods the confusion. A Java compiler should obey these rules cleverly introduces semantic changes in the decompiled strictly during compilation. However, once the bytecode program. All the methods already exist; no bogus is produced, the Java virtual machine will ignore these methods are introduced. This semantic change is almost rules. This gap between a Java compiler and a Java impossible to discover, even by an expert. However, not virtual machine offers opportunities to intentionally - all the decompilers we examined are fooled by this olate these rules, not in the Java source program, but in technique. Some decompilers, such as Jode (Hoeniche, the bytecode. 2001) and JReverse Pro (Kumar, 2001), add an explicit Therefore, the crux of the techniques presented in this casting conversion to each parameter when necessary. section is to intentionally violate some of these rules in J.-T. Chan, W. Yang / The Journal of Systems and Software 71 (2004) 1–10 7 the bytecode in order to prevent the bytecode from Table 2 being decompiled. Note that these obfuscations cannot The symbols and characters that are unfit as or in an identifier in JVM be easily undone by automatic tools. Symbol and character Note The rules and the techniques of violating them are The constructor name of a class discussed in the following subsections. Several tech- The static initializer of a class niques may be applied together to enforce even better / Path separator in UNIX system n Path separator in Windows system protection. We will also discuss the limitations of each : Path separator in Mac system or drive technique. character in Windows system $ Nested type separator 5.1. Illegal identifiers . The separator in fully qualified names should not be used in the name of a type The Java language specification (Gosling et al., 2000) states that an identifier should be a letter followed by Note that not all the symbols and characters can be letters or digits. An identifier cannot be identical to a used as or in an identifier in the constant pool. Some keyword, a boolean literal or the null literal. These rules symbols and characters have special meanings to the help the lexical analyzer and the parser to analyze a JVM and to the host system. The symbols and characters program. However, these rules need not be obeyed in the that are unfit as or in an identifier are listed in Table 2. bytecode. Note that when JVM loads the bytecode, All the constructors of a type are named ‘‘’’ in JVM does not verify whether the names in the constant the bytecode. Therefore, the name ‘‘’’ should pool comply with the definition of an identifier. There- be avoided in obfuscation. Otherwise, JVM might be fore, the names in the constant pool of the bytecode confused when calling a constructor. The symbol could be changed to use illegal characters, keywords, ‘‘’’ is the name of the static initializer of a boolean literals, or the null literal. When the obfuscated type. JVM will invoke it to initialize the static members bytecode is decompiled and then compiled, these illegal of a type. identifiers will result in compilation errors. The characters ‘‘/’’, ‘‘n’’, and ‘‘:’’ should not to be For example, we can rename an identifier in the used in the name of a type. The three characters are used bytecode as the boolean literal ‘‘false’’ or the symbol as path separators in the host file systems on different ‘‘<>?!#’’. The modified bytecode still runs as before. platforms. Currently, most implementations of the Java However, decompilers will face troubles for this change. runtime system use the file system of the host to store the We tested this technique with several available decom- bytecode files. The only exception we know is IBM’s pilers. Many decompilers use Jad as the decompilation VisualAge for Java (IBM, 2001), which uses a database engine and add their own user interfaces. Only the de- system (called ENVY) to manage the bytecode files. If compilers based on different decompilation engines are the three characters are used in the names of types and chosen for the experiment (D & C, 2001; Hoeniche, the types are stored in the host file system, these sepa- 2001; Kouznetsov, 2001; Kumar, 2001; Mayon, 2001; rators will cause the JVM to misinterpret the types. PsychoticSoftware, 2001; Vliet, 1996). The result is The character ‘‘$’’ is used as the separator of a type shown in Table 1. Jad and jAscii are smarter than others and its nested types. Arbitrarily using ‘‘$’’ in the name of when handling keywords as identifiers. They change the an identifier may cause some unexpected results. How- keyword identifier to an ordinary identifier automati- ever, it can be used cleverly to introduce another kind of cally. Other decompilers use the keyword as the name of protection. See Section 5.3 for details. an identifier directly in the decompiled program. On the other hand, all the tested decompilers are fooled by the 5.2. Some interesting examples illegal symbol. JReverse Pro and Jode even fail to de- compile the modified bytecode. We could use a few characters that have specific meanings in a Java source program to rename the iden- Table 1 tifiers in the bytecode. The characters include ‘‘.’’, ‘‘(’’, The decompiled results of using illegal characters in the name of an ‘‘)’’, ‘‘;’’, and the space character. The following code is identifier the original code of the example in this subsection. Decompiler Use keyword as Use illegal characters in identifiers identifiers class A { Jad ‘‘_fldfalse’’ ‘‘<_3E_3F_21_23_’’ int foo ¼ 1; jAscii ‘‘_fldfalse’’ ‘‘< >?!#’’ Mocha ‘‘false’’ ‘‘< >?!#’’ } deClassify ‘‘false’’ ‘‘< >?!#’’ JReverse Pro ‘‘false’’ ERROR The name foo in the bytecode is changed to a name ClassSpy ‘‘false’’ ‘‘< >?!#’’ made up of the above characters. The decompilation Jode ‘‘false’’ ERROR results by different decompilers are shown in Table 3. 8 J.-T. Chan, W. Yang / The Journal of Systems and Software 71 (2004) 1–10

Table 3 The decompiled results of some interesting variable names Decompiler foo ! a.b foo ! 1.2 foo ! aðÞ foo ! ÃÃÃ (3 spaces) foo ! a;b JAD int a.b; b ¼ 1; int_cls1_fld2; _fld2 ¼ 1; int f_28_29 ¼ 1; int _20_20_20 ¼ 1; int a_3B_b ¼ 1; jAscii int a.b ¼ 1; int_fld1:2 ¼ 1; int aðÞ ¼ 1; int ÃÃÃ ¼ 1; int a;b ¼ 1; Mocha int a.b ¼ 1; int 1:2 ¼ 1; int aðÞ ¼ 1; int ÃÃÃ ¼ 1; int a;b ¼ 1; ClassCracker int a.b ¼ 1; int 1:2 ¼ 1; int aðÞ ¼ 1; int ÃÃÃ ¼ 1; int a;b ¼ 1; JReverse Pro int a.b ¼ 1; int 1:2 ¼ 1; int aðÞ ¼ 1; int ÃÃÃ ¼ 1; int a;b ¼ 1; ClassSpy int a.b ¼ 1; int spy_1:2 ¼ 1; int aðÞ ¼ 1; int ÃÃÃ ¼ 1; int a;b ¼ 1; Jode int a.b ¼ 1; int 1:2 ¼ 1; int aðÞ ¼ 1; int ÃÃÃ ¼ 1; int a;b ¼ 1;

The character ‘‘.’’ is the separator in a fully qualified name as any of its enclosing types. If a nested type name and the decimal point in a floating-point number. could have the same name as one of its enclosing It is also the separator between a reference and its types, the Java compiler would be confused in certain members. After we change foo to ‘‘a.b’’, the Java situations. For example, consider the following pro- compiler will consider ‘‘a’’ as an object or a type and gram. ‘‘b’’ as a member of ‘‘a’’. This name results in a com- pilation error. All the decompilers are fooled by this class M { renaming. The Jad decompiler even uses a wrong vari- class M {} able name in the constructor. void f() { Changing foo to a floating-point number ‘‘1.2’’ also M n; // which M ?? fools all the tested decompilers. This time, Jad, jAscii n ¼ new M(); // which M() ?? and ClassSpy try to correct the illegal identifier. But all } fail. } However, the character ‘‘.’’ should not be used in the name of a type in the bytecode. The substring before The above example cannot pass the compilation be- ‘‘.’’ will be treated as the name of a package or a type by cause the Java compiler cannot determine which M is the JVM. Consequently, a runtime error happens. intended in the declaration of the local variable n. The characters ‘‘(’’ and ‘‘)’’ are used as a pair and After compilation, the name of a nested type N en- are appended to the end of the name of a field. In this closed in type M becomes M$N. The Java compiler uses case, the field name will be treated as a method name by ‘‘$’’ as the separator between an enclosing type and a the Java compiler. Jad can correct this illegal name nested type while using ‘‘.’’ as the separator among a automatically while other decompilers cannot. package, its subpackages, and its top-level types. Sup- Notice that the two characters ‘‘(’’ and ‘‘)’’ can also pose that M is a top-level type in the package p.q. The be used in the name of a method. In the bytecode, the fully qualified name of N is p.q.M$N. Furthermore, N is return type and the types of the parameters of a method compiled to an independent bytecode file named are encoded as a string, called a descriptor, which is ‘‘M$N.class’’. separated from the method name. Therefore, using ‘‘(’’ In the bytecode, the simple name of a nested type can and ‘‘)’’ in the name of a method does not affect the be changed to be the same as that of its enclosing type. JVM to determine the signature of the method. For example, a nested type named M$N can be renamed The space character and the tab character are the as M$M. separators of tokens in a source program. After foo is After a nested type is renamed, the bytecode and the changed to be three spaces, the variable becomes invis- corresponding symbolic references also have to be ible! Among the tested decompilers, only Jad can correct modified accordingly. Otherwise, JVM cannot find the this illegal name; other decompilers cannot. bytecode file when trying to load the nested type. The A semicolon ‘‘;’’ is the end mark of a statement or a decompiled program of the obfuscated bytecode cannot declaration in a Java program. In the decompiled pro- be successfully compiled again. gram, the name ‘‘a;b’’ is divided into two names. Consequently, a runtime error happens. Among the 5.4. Static methods vs. instance methods tested decompilers, only Jad can correct the illegal name; other decompilers cannot. According to the specification of the Java language (Gosling et al., 2000), an inherited static method cannot 5.3. Nested type names be overridden by an instance method with the same signature in a subclass. Similarly, an inherited instance According to the Java language specification (Go- method cannot be overridden by a static method with sling et al., 2000), a nested type cannot have the same the same signature in a subclass. J.-T. Chan, W. Yang / The Journal of Systems and Software 71 (2004) 1–10 9

There are two processes to intentionally violate the patches an instance method. Therefore, arbitrarily above rule. They both require that the superclass and overriding instance methods is not allowed in our ob- the subclass in which the methods exist are in the ob- fuscation method. This technique requires that one of fuscation scope. The first process is that we can add a the two methods is a static method and the other is an bogus static method in the superclass (or subclass) for instance method. an instance method and add a bogus instance method in JVM uses four instructions––invokestatic, invokein- the superclass (or subclass) for a static method. Suppose terface, invokevirtual, invokespecial––for invoking that there is an instance method m in class Y and that Y methods. A static method is invoked with the instruction inherits class X. We can make a bogus static method m0 invokestatic. An instance method is invoked with the in X. Method m0 has the same signature as method m.To instruction invokevirtual (or invokeinterface) if the de- make the bogus one looks like a real method, the body clared type of the reference is a class (or an interface, of m0 could be identical to that of m. To make things even respectively). An instance method can also be invoked more complicated, we can make m0 to be slightly differ- with the instruction invokespecial, which invokes an in- ent from m. It would be difficult to determine which stance method of a superclass, a private method, or the method is the correct one when semantic change are instance initialization method. The differences between introduced into the bogus version. the four instructions lie in the lookup procedure for The second process is to override inherited methods resolving the method to be invoked. Because static that have different names but have the same number methods and instance methods are invoked with differ- and types of parameters. Consider the following ex- ent instructions, renaming m and n to be p in the byte- ample. code in the above example will not change the behavior of the obfuscated bytecode in the above example. class X { However, the decompiled program contains a subtle bug static int m(int a, String b) that is difficult to discover. throws EOFException {...} } class Y extends X { 6. Related works boolean n(int c, String d) throws FileNotFoundException {...} Obfuscation is a very useful tool for protecting } bytecode. Although there are many commercial or free products available (Dr. Java, 2001; Eastridge, 2000; The names m and n can be changed to be the same, Hoeniche, 2001; Plumb, 2001; Retrologic, 2000), few such as p. The new program becomes researches focus on this topic. LaDue provides a tool named HoseMocha. The tool adds several extra in- class X { structions at illegal positions int the bytecode (e.g. after static int p(int a, String b) the return instruction of a method) to foul the decom- throws EOFException {...} pilers (LaDue, 1997). Low discusses the concept of } obfuscation (Low, 1998b). Low’s master thesis concen- class Y extends X { trates on the obfuscation of control flow (Low, 1998a). boolean p(int c, String d) Collberg et al. study obfuscation extensively including a throws FileNotFoundException {...} survey of obfuscation (Collberg and Thomborson, 2000; } Collberg et al., 1997), data obfuscation (Collberg et al., 1998a), and control obfuscation (Collberg et al., 1998b). The new program violates the compiler rule in the Data obfuscation (Collberg et al., 1998a) uses tech- Java specification (Gosling et al., 2000) which says that niques such as splitting variables and merging scalar an inherited instance method cannot be overridden by a variables to transform a simple expression into an static method. equivalent, but complex one. It also alters the inheri- Note that the return type and the throws clause of a tance structure of types, restructures arrays, and chan- method is not part of the signature. If both m and n were ges procedural abstraction. instance methods, renaming them to p makes n override Control obfuscation (Collberg et al., 1998b) adds ex- m. This overriding relationship violates yet another rule tra predicates in programs to confuse the cracker. These in the Java specification (Gosling et al., 2000) which says extra predicates, called opaque predicates, have a fixed that an instance method and the overridden inherited value even though they are computed with complex ex- instance method must have the same return types and pressions. The extra predicates obscure the program. compatible throws clauses. This technique provides a Both data obfuscation and control obfuscation layer of protection. But overriding may change the change the bodies of types and methods. To make things method that is invoked when JVM dynamically dis- complicated, the two obfuscation techniques introduce 10 J.-T. Chan, W. Yang / The Journal of Systems and Software 71 (2004) 1–10 additional computations into the bytecode. These (CLR) (Gough, 2001), the runtime model of the C# additional computations will increase the size of the language (Liberty, 2001) is similar to Java’s. Therefore, bytecode. Furthermore, the additional computations the C# language could be the candidate to apply these will reduce the run-time efficiency of the program. obfuscation techniques. In contrast, our proposed methods concentrate on eliminating the symbolic information in the constant pool of the bytecode. Several techniques discussed in References this paper even introduce syntactic or semantic errors in the decompiled program while preserving the behavior Collberg, C., Thomborson, C., 2000. Watermarking, Tamper-Proof- of the bytecode. Most of the proposed techniques do not ing, and Obfuscation-Tools for Software Protection, Tech Report, produce additional code. Furthermore, they usually re- Department of Computer Science, The University of Auckland, duce the size of the bytecode. New Zealand. The obfuscation methods proposed by Low and Collberg, C., Thomborson, C., Low, D., 1998a. Breaking Abstractions and Unstructuring Data Structures. Proceedings of the IEEE Collberg (Collberg and Thomborson, 2000; Collberg International Conference on Computer Languages. pp. 28–38. et al., 1998a; Collberg et al., 1998b; Collberg et al., 1997; Collberg, C., Thomborson, C., Low, D., 1998b. Manufacturing Cheap, Low, 1998a; Low, 1998b) can cooperate with our pro- Resilient, and Stealthy Opaque Constructs. Conference Record of posed approach to construct an even more versatile and the Annual ACM Symposium on Principles of Programming stronger obfuscation tool. Languages. pp. 184–196. Collberg, C., Thomborson, C., Low, D., 1997. A Taxonomy of Obfuscating Transformations, Tech Report, Department of Com- puter Science, University of Auckland, New Zealand. 7. Conclusion D&C Software Solutions Inc., 2001. jAscii. ver. 1.0.17. http://www. jascii.com/. A good obfuscation tool should Dr. Java, 2001. Marvin Obfuscator. ver. 1.2. http://www.drjava.de/ obfuscator/. Eastridge Technology, 2000. Jshrink. ver. 1.19. http://www.e-t.com/ • preserve the semantics of the bytecode, jshrink.html. • deter the cracker as long as possible, Engel, J., 1999. Programming for the Java Virtual Machine. Addison- • be difficult to be overcome by a cracking tool, and Wesley, Reading, Mass. • improve the run-time efficiency and reduce the byte- Gosling, J., Joy, B., Steele, G., Bracha, G., 2000. The Java Language Specification, second ed. Addison-Wesley, MA. code size. Gough, J., 2001. Compiling for the Net Common Language Runtime. Prentice Hall. The techniques proposed in this paper satisfy all the Hoeniche, J., 2001. Java Optimize and Decompile Environment (Jode). above requirements. Preserving the semantics of the ver. 1.1.1. http://jode.sourceforge.net/. bytecode is the most important criterion of obfuscation. IBM, 2001. Visualage for Java. ver. 4.0. http://www.ibm.com/software/ ad/vajava/. Many techniques make the decompiled program un- Kouznetsov, P., 2001. Jad––the Fast Java Decompiler. ver. 1.58e. compilable. The obfuscation effects cannot be easily un- http://www.geocities.com/SiliconValley/Bridge/8617/jad.html. done by other cracking tools. A cracker has to spend lots Kumar, K., 2001. JReverse Pro. ver. 1.2.http://www.geocities.com/ of time to debug the decompiled buggy program manu- akarthikkumar/JReverse Pro/. ally. The shorter names reduce the size of a bytecode file. LaDue, M.D., 1997. HoseMocha. ver. 1.0.http://www.cigital.com/ hostile-pplets/HoseMocha.java. Overloaded names also contribute to the compression of Liberty, J., 2001. Programming C#. O’Reilly. the bytecode and the jar file. Consequently, the storage Lindholm, T., Yellin, F., 1999. The Java Virtual Machine Specifica- space and the loading time are reduced. tion, second ed. Addison-Wesley, Reading, MA. The main objective of the obfuscation techniques Low, D., 1998a. Java Control Flow Obfuscation, Master Thesis, proposed in this paper is to scramble the symbolic names University of Auckland, New Zealand. Low, D., 1998b. Protecting Java Code Via Code Obfuscation. ACM and the symbolic references in the bytecode. Although Crossroads 4 (3), 21–23. the technique of identifier scrambling appeared several Mayon Software Research, 2001. Classcracker. ver. 2.02. http:// years ago and several commercial or free products are www.pcug.org.au/~mayon/. based on similar ideas, our techniques provide stronger Meyer, J., Downing, T., 1997. Java Virtual Machine. O’Reilly, protection for bytecode than other existing techniques. Cambridge, Mass. Plumb Design, Inc., 2001. Condensity Professional Edition. ver. 2.0. It is possible to extend the proposed obfuscation http://www.condensity.com/index.html. techniques to other languages. The prerequisite of the PsychoticSoftware Inc., 2001. Classspy. ver. 2.0.3. http://www.psy- obfuscation techniques is that the information of the choticsoftware.com/Products/ClassSpy/index.jsp. identifiers is stored in the bytecode and the decompilers Retrologic Inc., 2000. Retroguard Bytecode Obfuscator. ver. 1.1. rely on the information during decompilation. For those http://www.retrologic.com/. Venners, B., 1998. Inside the Java Virtual Machine. McGraw-Hill, languages that use a similar mechanism, i.e., symbolic New York. linking, it is possible to apply the proposed techniques Vliet, H.v., 1996. Mocha. ver. beta 1. http://www.brouhaha.com/~eric/ on them. For example, NET common language runtime computers/mocha.html.