Applying Genetic Programming to Bytecode and Assembly

Scholarly Horizons: University of Minnesota, Morris Undergraduate Journal Volume 1 Issue 2 Article 4 August 2014 Applying Genetic Programming to Bytecode and Assembly Eric Collom University of Minnesota, Morris Follow this and additional works at: https://digitalcommons.morris.umn.edu/horizons Part of the Programming Languages and Compilers Commons Recommended Citation Collom, Eric (2014) "Applying Genetic Programming to Bytecode and Assembly," Scholarly Horizons: University of Minnesota, Morris Undergraduate Journal: Vol. 1 : Iss. 2 , Article 4. Available at: https://digitalcommons.morris.umn.edu/horizons/vol1/iss2/4 This Article is brought to you for free and open access by the Journals at University of Minnesota Morris Digital Well. It has been accepted for inclusion in Scholarly Horizons: University of Minnesota, Morris Undergraduate Journal by an authorized editor of University of Minnesota Morris Digital Well. For more information, please contact [email protected]. Collom: Applying Genetic Programming to Bytecode and Assembly Applying Genetic Programming to Bytecode and Assembly Eric C. Collom University of Minnesota, Morris [email protected] ABSTRACT both Java bytecode and x86 assembly. Orlov and Sipper [6] Traditional genetic programming (GP) is typically not used focus on evolving simple programs as a whole while Schulte, to perform unrestricted evolution on entire programs at the et al., [7] focus on automated bug repair in programs. source code level. Instead, only small sections within pro- This paper is organized as follows: Section 2 covers the grams are usually evolved. Not being able to evolve whole background needed for understanding the application of GP programs is an issue since it limits the flexibility of what to bytecode and assembly. It contains information on Evo- can be evolved. Evolving programs in either bytecode or as- lutionary Computation (EC), Java bytecode, and the Java sembly language is a method that has been used to perform Virtual Machine. Section 3 describes the benefits of evolv- unrestricted evolution. This paper provides an overview of ing assembly and bytecode. Section 4 illustrates how Orlov applying genetic programming to Java bytecode and x86 as- and Sipper [6] evolved Java bytecode. Section 5 explores sembly. Two examples of how this method has been imple- how Schulte, et al., [7] evolved both x86 assembly and Java mented will be explored. We will also discuss experimental bytecode. Section 6 summarizes some of the experimental results that include evolving recursive functions and auto- results from [6] and [7]. Section 7 discusses possible future mated bug repair. work and ideas. Keywords 2. BACKGROUND evolutionary computation, x86 assembly code, Java byte- The components of EC and GP will be explored in this code, FINCH, automated bug repair section. Java bytecode and various details of the Java Vir- tual Machine (JVM) will also be discussed. We will only 1. INTRODUCTION address Java bytecode since knowledge of x86 assembly is not required in order to understand the research done by GP is a set of techniques used to automate computer prob- Schulte, et al., [7]. Also, throughout the rest of this paper, lem solving. This is done by evolving programs with an the term instruction-level code will be used when referring evolutionary algorithm (EA) that imitates natural selection to either Java bytecode or x86 assembly. in order to find a solution. Traditional GP has commonly been used to evolve specific parts of programs instead of full- 2.1 Evolutionary Computation fledged programs. This is because it is very hard to deal with EC is a field of computer science and artificial intelligence semantic constrains in source code. Source code is purely that is loosely based on evolutionary biology. EC imitates syntactical and does not represent the semantic constraints evolution through continuous optimization in order to solve of the language. However, if we would deal with these se- problems. Optimization in EC is driven by the selection mantic constraints it would allow for unrestricted evolution of more successful individuals. The representation of what and more flexibility in GP. Evolving bytecode and assem- an individual is depends on the problem being solved. For bly, instead of source code, is a method that allows for this example, it can be a string of bits, a parse tree, or a struc- flexibility. This is possible because bytecode and assembly tured object. In this paper, the individuals will be programs languages are less restrictive syntactically than source code. in Java bytecode and x86 assembly. We discuss this issue further in Section 3. The EC process is summarized in Figure 1. An initial pop- Orlov and Sipper [5] propose a method for applying GP to ulation of individuals is generated and a selection process is full-fledged programs that requires a program to be in Java used to choose the most fit individuals. The selection pro- bytecode. Schulte, et al., [7] apply a similar method with cess gives a fitness value, usually numerical, to each individual which indicates how well it solves the specific problem. In many cases a higher fitness indicated a fitter individual. However, in some cases a fitness closer to zero can indicate a better fit individual. For example, if the fitness is the total This work is licensed under the Creative Commons Attribution- error of a possible solution, the closer the fitness is to zero Noncommercial-Share Alike 3.0 United States License. To view a copy the better. The most fit individuals are taken and modified of this license, visit http://creativecommons.org/licenses/by-nc-sa/3.0/us/ or using genetic operators that imitate mutation and sexual re- send a letter to Creative Commons, 171 Second Street, Suite 300, San Fran- cisco, California, 94105, USA. production. A check is then done to see if any individuals UMM CSci Senior Seminar Conference, April 2014 Morris, MN. from the new population solve the desired problem. If not, Published by University of Minnesota Morris Digital Well, 2014 1 Scholarly Horizons: University of Minnesota, Morris Undergraduate Journal, Vol. 1, Iss. 2 [2014], Art. 4 Initial Bytecode Frame Population iload_1 0 1 2 Selection iload_2 ... 1 3 iadd Individuals ireturn Selected For Reproduction Operand Stack Genetic NO Operators Before After After iload_1 iload_1 iload_2 Offspring Population 3 (empty) 1 1 Problem After Solved? After iadd ireturn Yes 4 (empty) Return Solution Figure 2: In this example we are assuming that the frame already contains the local variables 1 and 3 to retain simplicity. When iload_1 is executed, it takes the element from the frame at index 1 and pushes Figure 1: The process of Evolutionary Computation it onto the stack. iload_2 does the same thing but with index 2. iadd pops two elements from the stack, the process of evolution is repeated until a solution is found which both must be of integer type, adds them and or until a predetermined number of generations is reached. then pushes the result to the stack. ireturn simply If the limit of generations is reached then the most fit indi- pops the stack and returns that element. vidual is returned. GP is a tool that uses the EC process to evolve programs. prefix is followed by the operation to be executed. A fashion in which GP selects individuals, for sexual repro- The opcodes used in this paper are: iconst_n, istore_n, duction, is through tournament selection. A certain number iload_n, iadd, isub, and ireturn. iconst_n pushes an of individuals are chosen for a simulated competition and the integer value of n on the stack. istore_n pops the stack individual with the highest fitness wins. That individual is and stores that value at index n on the local variable array then selected to be a parent. on the frame. iload_n takes the value of the element at One way sexual reproduction is simulated in GP is through index n on the array of local variables and pushes it to the the genetic operator called crossover. Crossover is the pro- stack. iadd pops two elements from the stack, adds them cess of taking two parent individuals and extracting a section and pushes the result to the stack. Similarly, isub pops of code from one and then replacing it with a section from two elements, subtracts the second element from the first the other program to form an offspring. Mutation is another element and pushes the result to the stack. ireturn simply genetic operator that is used to produce offspring. Mutation pops the stack and returns that value. performs random changes on a randomly chosen section of a program. 3. CONSTRAINTS AND BENEFITS 2.2 Java Bytecode and the JVM While it would be useful, it is difficult to evolve an en- Java bytecode is an \intermediate, platform-independent tire program in source code due to semantic constraints. representation" [5] of Java source code. However, \imple- This section will explore why it is difficult to deal with se- mentors of other languages can turn to the Java Virtual mantic constraints in source code and why it is easier at Machine as a delivery vehicle for their languages" [2]. A few the instruction-level. We will discuss some of the benefits examples of languages with which this has been done are that arise from being able to deal with these constraints and Scala, Groovy, Jython, Kawa, JavaFx Script, Clojure, Er- evolving at the instruction-level. jang, and JRuby. All these languages can be compiled to Java bytcode and be evolved at the instruction-level. 3.1 Source Code Constraints The JVM executes bytecode through a stack-based archi- There is a high risk of producing a non-compilable pro- tecture. Each method has a frame that contains an array gram when evolving programs at the source code level.

Load more