Palmtree: Learning an Assembly Language Model for Instruction Embedding
Total Page:16
File Type:pdf, Size:1020Kb
PalmTree: Learning an Assembly Language Model for Instruction Embedding Xuezixiang Li Yu Qu Heng Yin University of California Riverside University of California Riverside University of California Riverside Riverside, CA 92521, USA Riverside, CA 92521, USA Riverside, CA 92521, USA [email protected] [email protected] [email protected] ABSTRACT into a neural network (e.g., the work by Shin et al. [37], UDiff [23], Deep learning has demonstrated its strengths in numerous binary DeepVSA [14], and MalConv [35]), or feed manually-designed fea- analysis tasks, including function boundary detection, binary code tures (e.g., Gemini [40] and Instruction2Vec [41]), or automatically search, function prototype inference, value set analysis, etc. When learn to generate a vector representation for each instruction using applying deep learning to binary analysis tasks, we need to decide some representation learning models such as word2vec (e.g., In- what input should be fed into the neural network model. More nerEye [43] and EKLAVYA [5]), and then feed the representations specifically, we need to answer how to represent an instruction ina (embeddings) into the downstream models. fixed-length vector. The idea of automatically learning instruction Compared to the first two choices, automatically learning representations is intriguing, but the existing schemes fail to capture instruction-level representation is more attractive for two reasons: the unique characteristics of disassembly. These schemes ignore the (1) it avoids manually designing efforts, which require expert knowl- complex intra-instruction structures and mainly rely on control flow edge and may be tedious and error-prone; and (2) it can learn higher- in which the contextual information is noisy and can be influenced level features rather than pure syntactic features and thus provide by compiler optimizations. better support for downstream tasks. To learn instruction-level In this paper, we propose to pre-train an assembly language representations, researchers adopt algorithms (e.g., word2vec [28] model called PalmTree for generating general-purpose instruction and PV-DM [20]) from Natural Language Processing (NLP) domain, embeddings by conducting self-supervised training on large-scale by treating binary assembly code as natural language documents. unlabeled binary corpora. PalmTree utilizes three pre-training tasks Although recent progress in instruction representation learn- to capture various characteristics of assembly language. These train- ing (instruction embedding) is encouraging, there are still some ing tasks overcome the problems in existing schemes, thus can help unsolved problems which may greatly influence the quality of in- to generate high-quality representations. We conduct both intrinsic struction embeddings and limit the quality of downstream models: and extrinsic evaluations, and compare PalmTree with other in- First, existing approaches ignore the complex internal formats struction embedding schemes. PalmTree has the best performance of instructions. For instance, in x86 assembly code, the number for intrinsic metrics, and outperforms the other instruction embed- of operands can vary from zero to three; an operand could be a ding schemes for all downstream tasks. CPU register, an expression for a memory location, an immediate constant, or a string symbol; some instructions even have implicit CCS CONCEPTS operands, etc. Existing approaches either ignore this structural information by treating an entire instruction as a word (e.g., Inner- • Security and privacy ! Software reverse engineering; • The- Eye [43] and EKLAVYA [5]) or only consider a simple instruction ory of computation ! Program analysis; • Computing method- format (e.g., Asm2Vec [10]). Second, existing approaches use Con- ologies ! Knowledge representation and reasoning. trol Flow Graph (CFG) to capture contextual information between instructions (e.g., Asm2Vec [10], InnerEye [43], and the work by KEYWORDS Yu et al. [42]). However, the contextual information on control flow Deep Learning, Binary Analysis, Language Model, Representation can be noisy due to compiler optimizations, and cannot reflect the arXiv:2103.03809v3 [cs.LG] 14 Sep 2021 Learning actual dependency relations between instructions. Moreover, in recent years, pre-trained deep learning models [33] 1 INTRODUCTION are increasingly attracting attentions in different fields such as Computer Vision (CV) and Natural Language Processing (NLP). Recently, we have witnessed a surge of research efforts that lever- The intuition of pre-training is that with the development of deep age deep learning to tackle various binary analysis tasks, including learning, the numbers of model parameters are increasing rapidly. function boundary identification [37], binary code similarity detec- A much larger dataset is needed to fully train model parameters tion [23, 31, 40, 42, 43], function prototype inference [5], value set and to prevent overfitting. Thus, pre-trained models (PTMs) us- analysis [14], malware classification [35], etc. Deep learning has ing large-scale unlabeled corpora and self-supervised training tasks shown noticeably better performances over the traditional program have become very popular in some fields such as NLP. Represen- analysis and machine learning methods. tative deep pre-trained language models in NLP include BERT [9], When applying deep learning to these binary analysis tasks, GPT [34], RoBERTa [24], ALBERT [19], etc. Considering the nat- the first design choice that should be made is: what kind ofinput uralness of programming languages [1, 16] including assembly should be fed into the neural network model? Generally speak- ing, there are three choices: we can either directly feed raw bytes language, it has great potential to pre-train an assembly language • We propose to use three pre-training tasks for PalmTree model for different binary analysis tasks. embodying the characteristics of assembly language such as To solve the existing problems in instruction representation reordering and long range data dependency. learning and capture the underlying characteristics of instructions, • We conduct extensive empirical evaluations and demonstrate in this paper, we propose a pre-trained assembly language model that PalmTree outperforms the other instruction embedding called PalmTree1 for general-purpose instruction representation models and also significantly improves the accuracy of down- learning. PalmTree is based on the BERT [9] model but pre-trained stream binary analysis tasks. with newly designed training tasks exploiting the inherent charac- • We plan to release the source code of PalmTree, the pre- teristics of assembly language. trained model, and the evaluation framework to facilitate We are not the first to utilize the BERT model in binary analysis. the follow-up research in this area. For instance, Yu et al. [42] proposed to take CFG as input and use To facilitate further research, we have made the source code and BERT to pre-train the token embeddings and block embeddings for pre-trained PalmTree model publicly available at https://github. the purpose of binary code similarity detection. Trex [31] uses one com/palmtreemodel/PalmTree. of BERT’s pre-training tasks – Masked Language Model (MLM) to learn program execution semantics from functions’ micro-traces (a 2 BACKGROUND form of under-constrained dynamic traces) for binary code similar- In this section, we firstly summarize existing approaches and back- ity detection. ground knowledge of instruction embedding. Then we discuss some Contrast to the existing approaches, our goal is to develop a pre- unsolved problems of the existing approaches. Based on the discus- trained assembly language model for general-purpose instruction sions, we summarize representative techniques in this field. representation learning. Instead of only using MLM on control flow, PalmTree uses three training tasks to exploit special characteristics 2.1 Existing Approaches of assembly language such as instruction reordering introduced by compiler optimizations and long range data dependencies. The Based on the embedding generation process, existing approaches three training tasks work at different granularity levels to effectively can be classified into three categories: raw-byte encoding, manually- train PalmTree to capture internal formats, contextual control flow designed encoding, and learning-based encoding. dependency, and data flow dependency of instructions. 2.1.1 Raw-byte Encoding. The most basic approach is to apply a Experimental results show that PalmTree can provide high qual- simple encoding on the raw bytes of each instruction, and then ity general-purpose instruction embeddings. Downstream applica- feed the encoded instructions into a deep neural network. One such tions can directly use the generated embeddings in their models. A encoding is “one-hot encoding”, which converts each byte into a static embedding lookup table can be generated in advance for com- 256-dimensional vector. One of these dimensions is 1 and the others mon instructions. Such a pre-trained, general-purpose language are all 0. MalConv [35] and DeepVSA [14] take this approach to model scheme is especially useful when computing resources are classify malware and perform coarse-grained value set analysis, limited such as on a lower-end or embedded devices. respectively. We design a set of intrinsic and extrinsic evaluations to systemat- One instruction may be several bytes long. To strengthen the ically evaluate PalmTree