Using Recurrent Neural Networks for Decompilation

Using Recurrent Neural Networks for Decompilation Deborah S. Katz Jason Ruchti and Eric Schulte Computer Science Department Grammatech, Inc. Carnegie Mellon University Ithaca, NY, USA Pittsburgh, PA, USA fjruchti, [email protected] [email protected] Abstract—Decompilation, recovering source code from binary, [4], analysis is generally easier on source code than on binary is useful in many situations where it is necessary to analyze machine code. It is even often useful for a human to see a small or understand software for which source code is not available. portion of the binary of a program translated to a form that is Source code is much easier for humans to read than binary code, and there are many tools available to analyze source code. easier for a human to understand, such as in tandem with an 1 Existing decompilation techniques often generate source code that analysis tool such as IDA or GrammaTech’s CodeSurfer for is difficult for humans to understand because the generated code binaries.2 often does not use the coding idioms that programmers use. Dif- Unfortunately, useful and natural decompilation is hard. Ex- ferences from human-written code also reduce the effectiveness isting decompilers often produce code that does not conform of analysis tools on the decompiled source code. To address the problem of differences between decompiled to standard idioms or even parse, leading to confusion for code and human-written code, we present a novel technique both human and automated analysis. For example, existing for decompiling binary code snippets using a model based on decompilers can produce source code that contains elements Recurrent Neural Networks. The model learns properties and that are uncommon or particularly difficult-to-understand, such patterns that occur in source code and uses them to produce as GOTO statements or explicit, high-level representations of decompilation output. We train and evaluate our technique on snippets of binary machine code compiled from C source code. what are usually under-the-hood memory loads and stores [5]. The general approach we outline in this paper is not language- These elements reduce the usefulness of the decompiled source specific and requires little or no domain knowledge of a language code because they do not line up with how a developer would and its properties or how a compiler operates, making the reasonably structure or think about a program. approach easily extensible to new languages and constructs. Recurrent neural networks (RNNs) [6] are a useful tool Furthermore, the technique can be extended and applied in situations to which traditional decompilers are not targeted, for various deep learning applications, and they are partic- such as for decompilation of isolated binary snippets; fast, on- ularly useful in translating sequences. That is, they have been demand decompilation; domain-specific learned decompilation; employed in various situations where a sequence of tokens optimizing for readability of decompilation; and recovering (e.g., words, characters, or symbols) in one vocabulary (e.g., control flow constructs, comments, and variable or function English or French) can be translated to another [7]. They are names. We show that the translations produced by this technique are often accurate or close and can provide a useful picture of useful in translating between natural (human) languages, such the snippet’s behavior. as in translating from English to French, and in translating Index Terms—decompilation; recurrent neural networks; from a natural language to other forms of sequences [8]. In translation; deep learning; recent years, the technology around RNNs and using them for language analysis has advanced greatly, both on the hard- I. INTRODUCTION ware and software side. Increasingly powerful GPUs [9] and Decompilation is the process of translating binary machine even purpose-built hardware [10] have enabled practical use code into code at a higher level of abstraction, such as of meaningfully large RNNs [9]. Various generally-available C source code or LLVM intermediate representation (IR). frameworks such as TensorFlow,3 Keras,4 and Caffe5 have Decompilation is useful for analyzing or understanding a enabled researchers to apply deep-learning technologies (such program in many situations in which source code is not as RNNs) to various problem domains. available. Many developers and system administrators rely on Decompilation can be seen as a translation problem. The third-party libraries and commercial off-the-shelf binaries with decompiler needs to translate a sequence of tokens in bi- undistributed source code. In systems with otherwise-stringent nary machine code into a sequence of tokens in a higher security requirements, it is necessary to perform security 1 audits of these binaries. Similarly, malware is distributed https://www.hex-rays.com/products/ida/overview.shtml 2https://www.grammatech.com/products/codesurfer without source code, but it is vital for security researchers 3https://www.tensorflow.org/ to understand its effects [1], [2]. While new techniques for 4https://keras.io/ direct analysis of binary executables have shown promise [3], 5http://caffe.berkeleyvision.org/ SANER 2018, Campobasso, Italy 978-1-5386-4969-5/18 c 2018 IEEE 346 Technical Research Papers Accepted for publication by IEEE. c 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. level language, such as C source code. Previous work has Source Tokenization investigated decompilation using other techniques that were Source-3 Source-2 Source-1 originally used for translating natural languages, such as statistical machine translation [11]. A translation system based on RNNs needs a corpus of training data. The very large amount of publicly-available, open-source code provides a DEC-3 DEC-2 DEC-1 good basis for building a corpus of parallel binary machine code and higher-level source code for training and testing an C RNN-based decompiler. Furthermore, additional work using various forms of deep learning, including RNNs, to analyze various aspects of code and programming languages shows ENC-1 ENC-2 ENC-3 that this area has promise [12]–[15]. We propose a novel technique using a model based on RNNs for decompilation. We create a corpus pairing short snippets Byte-1 Byte-2 Byte-3 of higher-level code (C source code) with corresponding bytes Binary Tokenization of binary code in a compiled version of the program. We train an RNN-based encoder-decoder translation system on Fig. 1. Encoder Decoder Architecture a portion of the corpus. The trained model can then accept previously-unseen pieces of binary code and output a predicted decompilation of the corresponding higher-level code. We then This paper proceeds as follows. Section II explains the test the trained model on snippets of binary that had not been general architecture of our system and our procedure for seen in training, evaluating the predicted decompilations based using it. Section III discusses our experimental setup, focusing on their fidelity to the original source code. on parts of the general approach that are specific to our A major advantage of our approach is that it is general and experiments. Section IV presents our results. In Section V, we easily retargeted to different languages. While most existing present several threats to validity. Section VI discusses some tools that turn binary machine code into a more human- of the other work that uses techniques similar to ours or has understandable form are designed specifically for a particular similar goals. In Section VII we conclude. language, our approach is designed to be general and extend to II. APPROACH any language for which there is a sufficient corpus of training The following sections introduce our approach to using an data. While some domain knowledge of the specific target RNN-based technique for decompilation and IR lifting: language is useful in determining the best approach to preprocessing the input data and postprocessing the output, there is • An overview of the general procedure (Section II-A) nothing in the technique that inherently relies on knowledge • Creating a corpus of data (Section II-B) of how either the language or the compiler operates. • Preprocessing the data for use in the RNN-based model Another advantage of our approach is that it is well-suited (Section II-C) to decompiling small snippets of binary machine code. As • Training the model using a subset of the preprocessed mentioned above, a user of a reverse engineering tool such as data (Section II-D) IDA or GrammaTech’s CodeSurfer may want to see a local • Testing the accuracy of the trained model, using a differ- translation of a series of bytes into a source language, to give ent subset of data (Section II-E) the reverse engineer a more complete understanding of the These sections set out the general approach, while Sec- code. Our approach is ideally suited to working with analysis tion III sets out details of modifications specific to our ex- tools in this way. periments. The main contributions of this work are as follows: A. Overview • Adapt encoder-decoder RNN models originally designed We have developed a technique for using a model based on for natural language translation to translation between a recurrent neural network (RNN) as a decompiler. Here we programming language representations, including com- provide a brief primer on recurrent neural networks and design piled machine code. choices specific to the problem of decompilation. In general, • An evaluation of the utility and speed of RNN-based we use an existing library to train and validate our models and models for decompilation.

Load more