Krakatoa: Decompilation in Java (Does Bytecode Reveal Source?)

The following paper was originally published in the Proceedings of the Third USENIX Conference on Object-Oriented Technologies and Systems Portland, Oregon, June 1997 Krakatoa: Decompilation in Java (Does Bytecode Reveal Source?) Todd A. Proebsting, Scott A. Watterson The University of Arizona For more information about USENIX Association contact: 1. Phone: 510 528-8649 2. FAX: 510 548-5738 3. Email: [email protected] 4. WWW URL: http://www.usenix.org Krakatoa: Decompilation in Java Does Bytecode Reveal Source? To dd A. Pro ebsting Scott A. Watterson The University of Arizona Abstract well known. Presently,we fo cus our researchon two subproblems: recovering source-level expres- This pap er presents our technique for automati- sions and synthesizing high-level control constructs cally decompiling Javabyteco de into Java source. from goto-like primitives. Our technique reconstructs source-level expres- Krakatoa uses a stack-simulation technique to re- sions from byteco de, and reconstructs readable, cover expressions and p erform typ e inference. Ex- high-level control statements from primitive goto- pression recovery creates source-level assignments like branches. Fewer than a dozen simple co de- and comparisons from primitivebyteco de op era- rewriting rules reconstruct the high-level state- tions. We extend Ramshaw's goto-elimination al- ments. gorithm to structure and create source for arbitrary reducible control ow graphs. This technique pro duces source co de with lo ops and multi- 1 Intro duction level break's. Subsequent techniques recover more intuitive constructs e.g., if statements via appli- Decompilation transforms a low-level language into cation of simple co de rewrite rules. a high-level language. The Java Virtual Machine JVM sp eci es a low-level byteco de language for a Traditional decompilation systems use graph stack-based machine [LY97]. This language de nes transformations to recover high-level control con- 203 op erators, with most of the control ow sp eci- structs. These systems require the author of the ed by simple explicit transfers and lab els. Compil- decompiler to anticipate all high-level control id- ing a Java class yields a class le that contains typ e ioms. When faced with an unexp ected language information and byteco de. The JVM requires a sig- idiom, these systems either ab ort, or pro duce gotos ni cant amountoftyp e information from the class illegal in Java. Krakatoa represents a di erent ap- les for ob ject linking. Furthermore, the byteco de proach. Krakatoa rst pro duces legal Java source must b e veri ably wel l-behaved in order to ensure given legal Javabyteco de with arbitrary reducible safe execution. Decompilation systems can exploit control ow, and then recovers intuitive high-level this typ e information and well-b ehaved prop ertyto constructs from this source. recover Java source co de from the class le. We present a technique for transforming low-level Figure 1 gives the ve steps of decompilation Javabyteco de into legal Java source co de. Our sys- p erformed by Krakatoa. First, the expression 1 tem, Krakatoa, p erforms typ e inference to issue builder reads byteco de, recovers expressions and lo cal variable declarations. The veri er do es the typ e information, and pro duces a control ow same typeoftyp e inference, and the techniques are graph CFG. Next, the sequencer orders the CFG no des for Ramshaw's goto-elimination technique. Address: Department of Computer Science, Uni- Ramshaw's algorithm pro duces a convoluted|yet versity of Arizona, Tucson, AZ 85721; Email: fto dd, [email protected]. legal|Java abstract syntax tree AST. Our sys- 1 Krakatoa is a volcano lo cated in the Sunda Strait b e- tem then transforms this AST into a less convo- tween Java and Sumatra. Its 1983 eruption threw ve cubic luted AST using a set of simple rewrites. The nal miles of debris into the air and was heard 2200 miles away in Australia. phase pro duces Java source by traversing the AST. class foo { int sam; Java Bytecodes int barint a, int b { if sam > a { Expression Builder b = a*2; } Flow graph with expressions and conditional gotos return b; } Node Sequencer } Augmented Flow graph Compiled from foo.java class foo extends java.lang.Object { int sam; Goto Eliminator (Ramshaw's Algorithm) int barint,int; Java AST Method int barint,int 0 aload_0 Code Simplifier 1 getfield 3 <Field foo.sam I> 4 iload_1 Restructured Java AST 5 if_icmple 12 8 iload_1 Final Java printer 9 iconst_2 10 imul Java Source 11 istore_2 12 iload_2 13 ireturn } Figure 1: Java Byteco de Decompilation System Figure 2: Simple Metho d and Byteco de Disassem- bly via javap -c. 2 Expression Recovery instance, iload_1, which loads the value of the Javabyteco des b ear a very close corresp ondence rst lo cal variable|with typ e int|could b e rep- to Java source. As a result, recovering expres- resented on the stackas\i1". Similarly,if i1 and sions from Javabyteco de is often simple|much 2 were the top two elements of the symb olic stack, simpler than recovering expressions from machine and the next byteco de were iadd integer addition, language. Java class les include information that those elements would b e p opp ed o the stack and makes recovering high-level op erations like eld ref- replaced with \i1+2". The symb olic execution erences easy. The fact that the byteco de must b e of some expressions, like assignment, requires emit- well-b ehaved i.e., veri able also simpli es analy- ting Java source. sis. Figure 2 gives a sample program and its abbre- Our algorithm recovers expressions one basic viated disassembly. Note the level of typ e informa- blo ck at a time. Some basic blo cks such as those tion in the disassembly pro duced by Sun's javap pro duced by the conditional expression op eration, utility. A?B:C do not b egin with empty stacks, so some Symb olic execution of the byteco de creates the information is required to prop ogate from prede- corresp onding Java source expressions. It also cre- cessors. Also, basic blo cks that b egin exception- ates conditional and unconditional goto's, which handling blo cks|which are easily identi ed|b egin will b e removed by subsequent decompilation steps. with the raised exception on the stack. Symb olic execution simulates the Java Virtual Ma- Figure 3 provides the step-by-step decompilation chine's evaluation stack with strings that represent of the byteco de in Figure 2. The initial aload_0 the source-level expressions b eing computed. For instruction pushes a Java reference onto the stack. program with exactly the same control owasthe In virtual functions, the \0'th" lo cal variable, a0, goto-only program. always refers to this. The getfield instruction Ramshaw's algorithm requires two inputs: the references a named eld, \sam", of the current top control ow graph, and an instruction ordering. His of stack. Therefore, the \this" is p opp ed and algorithm enco des this order into the ow graph replaced with \this.sam". iload_1 pushes \i1" using augmenting edges, such that every instruc- onto the stack. The ifcmple compares the top two tion has an augmenting edge to the next instruc- stack elements and branches to the appropriate in- tion in sequence. These augmenting edges o ccur struction if the lower is less than or equal to the between every pair of physically adjacent instruc- top element. Symb olically executing the ifcmple tions even if actual control owbetween them is requires p opping the top two elements and emit- imp ossible. He proves that if this augmented graph ting the appropriate conditional branch. Translat- is reducible, then a structural ly equivalent [PKT73] ing the remaining instructions is similar. program can b e created without goto's. How- ever, Ramshaw provides no algorithm for nding Most of the byteco de instructions are equally a reducible augmented ow graph from a given re- simple to symb olically execute. Unfortunately,a ducible ow graph. few require more information. Some of the stack The control- ow graphs of Java programs are manipulation routines e.g., pop2, dup2, etc. de- reducible. Therefore, the compiled byteco de will p end on byte o sets from the stack top. For in- likely form a reducible control- ow graph. Unfor- stance, pop2 removes the top 8 bytes from the tunately, simple optimizations likeloopinversion stack, whether those 8 bytes represent one 8-byte create irreducible augmented ow graphs. The ow double value, or two 4-byte scalar values. To cor- graph of the program in Figure 8 has this problem rectly simulate these instructions the symb olic ex- b ecause the augmenting edge b etween the rst two ecution keeps track of the size and typ e of each statements creates a \jump" into the b o dy of the stack element. lo op formed by the next seven statements. To utilize Ramshaw's algorithm, we develop ed an 3 Instruction Ordering algorithm that orders a reducible graph's instructions such that the resulting augmented graph is After recovering expressions, conditional and un- also reducible. conditional goto's along with implicit fal l through b ehavior determine control ow. Java, however, 3.1 Augmenting the Flow Graph has no goto statement, so its control owmust b e expressed with structured statements. Creating a reducible augmented ow graph re- Ramshaw presented an algorithm for eliminating quires that no augmenting edge enters a lo op any- goto's from Pascal programs while preserving the where other than at its header.

Load more