<<

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, MAY 2021 1 SeqTrans: Automatic Vulnerability Fix via Sequence to Sequence Learning

Jianlei Chi, Yu Qu, Ting Liu, Member, IEEE, Qinghua Zheng, Member, IEEE, Heng Yin, Member, IEEE

Abstract—Software vulnerabilities are now reported at an unprecedented speed due to the recent development of automated vulnerability hunting tools. However, fixing vulnerabilities still mainly depends on programmers’ manual efforts. Developers need to deeply understand the vulnerability and try to affect the system’s functions as little as possible. In this paper, with the advancement of Neural Machine Translation (NMT) techniques, we provide a novel approach called SeqTrans to exploit historical vulnerability fixes to provide suggestions and automatically fix the source code. To capture the contextual information around the vulnerable code, we propose to leverage data flow dependencies to construct code sequences and fed them into the state-of-the-art transformer model. The fine-tuning strategy has been introduced to overcome the small sample size problem. We evaluate SeqTrans on a dataset containing 1,282 commits that fix 624 vulnerabilities in 205 Java projects. Results show that the accuracy of SeqTrans outperforms the latest techniques and achieves 23.3% in statement-level fix and 25.3% in CVE-level fix. In the meantime, we look deep inside the result and observe that NMT model performs very well in certain kinds of vulnerabilities like CWE-287 (Improper Authentication) and CWE-863 (Incorrect Authorization).

Index Terms—Software engineering, vulnerability fix, neural machine translation, machine learning !

1 INTRODUCTION

OFTWARE evolves quite frequently due to numerous AFL [22], AFLGo [23], AFLFast [24]. Nevertheless, fixing S reasons such as deprecating old features, adding new vulnerabilities still heavily depends on manually generating features, refactoring, bug fixing, etc. Debugging is one of repair templates and defining repair rules, which are tedious the most time-consuming and painful processes in the entire and error-prone[25]. Automatically learn to generate vulner- software development life cycle (SDLC). A recent study ability fixes is urgently needed and will greatly improve indicates that the debugging component can account for up the efficiency of software development and maintenance to 50% of the overall software development overhead, and processes. the majority of the debugging costs come from manually There is a great deal of works of automated program checking and fixing bugs [1], [2], [3], [4]. This has led repair or called code migration in both industrial and aca- to a growing number of researchers working on teaching demic domains [5]. Some of them focus on automatically machines to automatically modify and fix the program, generating fix templates or called fix patterns [26], [27], which is called automated program repair [4], [5], [6], [7], [28], [29], [30]. Some of them focus on mining similar code [8], [9], [10], [11], [12], [13], [14]. changes from historical repair records such as CapGen [31] Software vulnerability is one kind of bugs that can and FixMiner [32]. Other approaches utilize static and be exploited by an attacher to cross authorization bound- dynamic analysis with constraining solving to accomplish aries. Vulnerabilities like HeartBleed [15], Spectre [16] and generation [7], [33]. IDEs also provide specific kinds of Meltdown [17], introduced significant threats to millions automatic changes [34]. For example, refactoring, generating of users. But there are some subtle differences that make getters and setters, adding override/implement methods or identifying and fixing them more difficult than bugs[18], other template codes, etc. Recently, introducing Machine [19], [20]. Firstly, the number of them is fewer than bugs, Learning (ML) techniques into program repair has also which makes it more difficult to learn enough knowledge attracted a lot of interest and become a trend [35], [36], [37], from historical data. In other words, we usually have only a [38], which build generic models to capture statistical char- relatively small database. Secondly, labeling and identifying acteristics using previous code changes and automatically vulnerability requires a mindset of the attacker that may fix the code being inserted. not be available to developers [21]. Thirdly, Vulnerabilities However, although some promising results have been are reported at an unprecedented speed due to the recent achieved, current works of automated program repair face a development of automated vulnerability hunting tools like list of limitations especially on fixing vulnerabilities. Firstly, most of them heavily rely on domain-specific knowledge • J. Chi, T. Liu and Q. Zheng are with the Ministry of Education Key or predefined change templates, which leads to limited Lab For Intelligent Networks and Network Security (MOEKLINNS), De- scalability [5]. Tufano’s dataset [39] contains 2 million sen- partment of Computer Science and Technology, Xian Jiaotong University, tence pairs of historical bug fix records. But vulnerability Xian 710049, China. Email: [email protected], tliu, [email protected]. fix datasets such as Ponta’s dataset [40] and AOSP dataset • Y. Qu and H. Yin are with the department of Computer Science and [41] only contains 624 and 1380 publicly disclosed vul- Engineering, UC Riverside, California, USA. nerabilities. The totally confirmed CVE records number is Email: yuq, [email protected] nearly 150K [42]. This means we need to train and learn JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, MAY 2021 2 from a small dataset of vulnerabilities. Secondly, traditional We called our approach SeqTrans, and it works as fol- techniques leverage search space, statistical analysis to rank lows: Firstly, we collect historical bug and vulnerability fix- similar repair records needs to define numerous features, ing records from two previous open datasets which contain which can be time-consuming and not accurate enough. 2 million and 3k sentence pairs of confirmed fix records. ML models can alleviate these problems but as mentioned Secondly, we start by training a transformer model with above, because of the small sample size only a few works a self-attention mechanism [46] for bug repairing on the have been done to focus on vulnerability fixing. big dataset. Then, we fine tune the model on the small In this paper, we focus on the two issues raised above dataset to match the target of our work for vulnerability and rely entirely on machine learning to capture grammati- fixing. Thirdly, if a new vulnerable object is inputted to cal and structural information as common change patterns. the trained model, beam search [47] will be utilized first In order to solve the small sample size problem, we use to obtain a list of candidate predictions. Then, a syntax the fine tuning method [43]. Fine tuning means that if our checker will be used to filter the candidate list and select the specialized domain dataset is similar to the general domain most suitable prediction. In order to evaluate our approach, dataset, we can take weights of a trained neural network we calculate the accuracy at statement level and across the and use it as initialization for a new model being trained on CVE on Ponta’s dataset[40]. The experimental result shows data from the same domain. It has been widely utilized for that our approach SeqTrans reaches a promising accuracy speeding up the training and overcoming the small sample of single line prediction by 23.3% when Beam=50, outper- size. Using this method, we can combine two related works forms the state-of-the-art model SequenceR [38] by 5% and together: vulnerability fixing and bug repair. We will firstly substantially surpasses the performance Tufano et al. [37] pre-train the model based on the large and diverse dataset and other NMT models. As for predictions for the full CVE, from bug repair records to captures universal features. Then, our approach also achieves the accuracy of 25.3% when we will fine tune the model on our small vulnerability fixing Beam=50, which is also better than other approaches. We dataset, freeze or optimize some of the pre-trained weights believe these promising results can confirm that SeqTrans is to make the model fit our small dataset. a competitive approach that achieves good performance on We choose the general approach of Neural Machine the task of vulnerability fixing. Translation (NMT) to learn rules from historical records and In the meantime, we also make some ablation studies apply them to future edits. It is widely utilized in Natural and observed internally what types of vulnerability fixes can Language Processing (NLP) domain, such as translate one be well predicted by SeqTrans. An interesting observation language (e.g., English) to another language (e.g., Swedish). we find is that our model gives results that vary for different The NMT model can generalize numerous sequence pairs types of CWEs. Our model performs quite well in specific between two languages and learn the probability distribu- types of CWEs like CWE-287 (Improper Authentication) and tion of changes, assign higher weights to appropriate editing CWE-863 (Incorrect Authorization) but even cannot make operations. Previous works such as Tufano et al. [37] and any prediction for certain CWEs like CWE-918 (Server-Side Chen et al. [38] have shown an initial success of using the Request Forgery). Our conclusion is that training a general NMT model for predicting code changes. However, both of model to fix vulnerabilities automatically is too ambitious them only focus on simple scenarios such as short sequences to cover all cases. But if we can focus on specific types of and single line cases. In fact, since the NMT model is them, the NMT model can make a very promising result to originally exploited for natural language processing, we help developers. SeqTrans can actually cover about 25% of should think about the gap between natural language and the types of CWEs in the data set. programming language [44]. Firstly, program language falls The paper makes the following contributions: under the category of languages called context-sensitive lan- guages. Dependencies in one statement may come from the entire function or even the entire class. Nevertheless, in nat- 1) We use the NMT model transformer to learn and ural language, token dependencies are always distributed in generalize common patterns from data for vulnera- the same sentence or neighboring sentences. Secondly, the bility fixing. vocabulary of natural languages is filled with conceptual 2) We propose to leverage data flow dependencies to terms. The vocabulary of programming languages is gen- construct vulnerable sequences and maintain the erally only grammar words like essential comments, plus vital context around them. various custom-named things like variables and functions. 3) Fine tuning has been introduced to overcome small Thirdly, programming languages are unambiguous, while sample size problem. natural languages are often multiplied ambiguous and re- 4) We implement our approach SeqTrans and evalu- quire interpretation in context to be fully understood. ate real publicly disclosed vulnerabilities on open- In order to solve the dependency problem across the source Java projects. Our SeqTrans outperforms entire class, we construct the define-use (def-use) [45] chain other program repair techniques and achieves the which represents the data flow dependencies to capture im- accuracy of 23.3% in statement-level validation and portant context around the vulnerable statement. It will ex- 25.3% in CVE-level validation. tract all variable definitions from the vulnerable statements. 5) We make an internal observation about prediction We use the state-of-the-art transformer model [46] to reduce results on different CWEs and find some interesting the performance degradation caused by long statements. CWE fixing operations captured by our model. Our This enables us to process long statements and captures a model can predict specific types of CWEs pretty broader range of dependencies. well. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, MAY 2021 3 2 MOTIVATING EXAMPLE Preprocessing: In this step, we will extract diff con- Figure 1 shows a motivating example of our approach. In texts from two datasets: bug repair and vulnerability fixing Figure 1, there are two vulnerability fixes for CVE-2017- datasets. Then, we perform normalization and abstraction 1000390 and CVE-2017-1000388, respectively. These two based on data flow dependencies to extract the def-use CVEs belong to the same CWE: CWE-732, which is named chains. We believe def-use chains are suitable for deep ”Incorrect Permission Assignment for Critical Resource.” learning models to capture syntax and structure information CWE-732 emphasizes that ”the product specifies permis- around the vulnerabilities with fewer noises. These def-use sions for a security-critical resource in a way that allows chains can be fed into the transformer model. that resource to be read or modified by unintended actors,” Pre-training and fine-tuning: The training process starts which means that when using a critical resource such as a on the bug repair dataset due to the reason that it is easy configuration file, the program should carefully check if the to collect a big enough train set for machine learning. resource has insecure permissions. Because vulnerability fixing and bug repair are similar task In Figure 1 (a), before the function getIconFileName domains. We can learn and capture parts of general features returns the IconF ileName, it should check whether the and hyperparameters from the general task domain dataset, user has the corresponding permission. A similar vul- which means the bug repair dataset. After the pre-training, nerability is included in Figure 1 (b). Before the func- we will fine tune the transformer model on the vulnerability tion EdgeOperation accesses two resources JobName, it fixing dataset. This dataset is much smaller than the first should first confirm whether the user has the permission, dataset because it is hard to confirm and collect a big enough otherwise, it will constitute an out-of-bounds permission, size for training. Based on the first model, we will refine or which can lead to the leakage of sensitive data such as pri- freeze some of the weights to make the model more suitable vacy. Although these two CVEs belong to different projects, for the task of vulnerability fixing. This has been proven to their repair processes are very similar. This inspired us achieve better results on small datasets and speed up the that it might be possible to learn common patterns from training process [49], [50]. historical vulnerability fixes that correspond to the same or Prediction and patching: If one vulnerable file is in- similar CWEs. putted, we need to locate the suspicious codes and make Figure 2 is a more extreme situation, containing two a prediction based on the trained model. In this paper, we identical CVE modifications CVE-2014-0075 and CVE-2014- do not pay much attention to the vulnerability location 0099. These two CVEs belong to the same CWE-189, which is part. They can be accomplished by previous vulnerability named ”Numeric Errors”. This CWE is easy to understand, location tools or with the help of a human security specialist. weaknesses in this category are related to improper calcu- SeqTrans can provide multiple candidates for users to select lation or conversion of numbers. These two CVEs contain the most suitable prediction. Syntax checker Findbugs [51] a series of modifications for overflow evasion, and they are is exploited to check the error and filter out predictions identical. We can directly copy the experience learned in one that contain syntax errors in advance. After that, we refill project to another project. abstraction and generate patches. We will discuss the details In this paper, we proposed a novel method to exploit of each part in the following part of this section. historical vulnerability fix records to provide suggestions and automatically fix the source code. If the function with similar structure requests accesses to a critical resource, our 3.2 Code Change Mining deep learning model can learn to check permissions before The two datasets we utilized are Tufano’s [39] and Ponta’s allowing access, eliminating the tedious process for devel- datasets [40]. Tufano’s dataset provides raw source code opers to search for vulnerability and recapitulate repair pairs extracted from the bug-fixing commits, which is easy patterns. to be used. However, Ponta’s dataset just provides the CSV table which contains the vulnerability fixing records. We 3 METHODS need a crawler to crawl the project we want. The table We use the neural machine translation method to guide contains vulnerability fixing records are shown as follows: automatically vulnerability fixing, which aims at learning (vulnerability id; repository url; commit id) common change patterns from historical records and apply- ing them to the new input files. In order to overcome the where vulnerability id is the identifier of a vulnerability small sample size problem, we introduce the fine tuning that is fixed in the commit id in the open source code technique. Data flow dependencies have also been intro- repository at the repository url. Each line in the dataset duced to maintain and capture more important information represents a commit that contributes to fixing a vulnerabil- around the diff context. SeqTrans can work together with ity. Then, we utilize a crawler to collect program repositories other vulnerability detection tools such as Eclipse Steady mentioned in the dataset. Pull Request (PR) data will be ex- [48]. They can provide vulnerability location information at tracted based on commit id. After that, in each PR we need the method level. to find out java file changes involved. Because our approach SeqTrans only supports java files now. With the help of a git 3.1 Overview version control system JGit [52], we can retrieve the version The overview of our approach is given in Figure 3, which of java files before and after code changes implemented in contains three stages: preprocessing, pre-training and fine- the PR. We call these java file pairs ChangeP air(CP ), each tuning, prediction and patching. CP contains a list of code diffs. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, MAY 2021 4

(a) CVE-2017-1000390, jenkinsci/tikal-multijob-plugin, 2424cec7a099fe4392f052a754fadc28de9f8d86

(b) CVE-2017-1000388, jenkinsci/tikal-multijob-plugin, d442ff671965c279770b28e37dc63a6ab73c0f0e

Fig. 1: Two similar vulnerability fixes belonging to CWE-732

(a) CVE-2014-0075, apache/tomcat, (b) CVE-2014-0099, apache/tomcat70, f646a5acd5e32d6f5a2d9bf1d94ca66b65477675 184cdc0d3f03f5737e12d21fff246d7285034597

Fig. 2: Two identical vulnerability fixes belonging to CWE-189

3.3 Code Diff Extraction After that, each CP is represented as a list of code diffs:

After we obtaining CP s from PR, we need to locate the diff CP = (stsrc, stdst)1, ..., (stsrc, stdst)n context. Although we can exploit the ”git diff” command provided by git to search line-level code diffs, it just doesn’t where (stsrc, stdst) represents statements from the source fulfill our needs. Slight code structure changes such as a file and the destination file. newline, adding space are not required for us. For this Then, we will extract data flow dependencies around reason, we choose to search for code diffs by using Abstract code diffs to construct our def-use chains. A def-use chain Syntax Trees (ASTs). The state-of-the-art diff searching tool means the assignment of some value to a variable, which named GumTree [53] is utilized to search for fine-grained contains all variable definitions from the vulnerable state- AST node mappings. Gumtree utilizes a parsing tool named ment. The reasons why we use data flow dependencies srcML [54] to parse the source code and build the AST tree. are shown as follows: 1) Context around the vulnerable It is worth noting that GumTree only provides a fine-grained statements is valuable to understand the risky behavior and mapping between AST nodes, so we modified the code of capture structure relationships. However, it is too heavy GumTree and combined it with another tool, Understand to maintain the full context at the class-level with lots of [55], to extract the precise diffs. In the meantime, we found unrelated code. 2) Data flow dependencies provide enough some bugs in Gumtree that leads to incorrect mismatching context for transformation. If one statement needs to be and reported them to the author. The algorithm of Gumtree modified, there is a high probability to co-change its def- is inspired by the way developers manually look at changes inition statements simultaneously. 3) Control flow depen- between files. It will traverse the AST tree pairs and com- dencies often contain branches, which makes them too long pute the mappings in two successive phases: to be tokenized. One example has been given in Figure 4. Assume that the method ”foo” contains one vulnerability, 1) A greedy top-down algorithm to find isomorphic we will maintain the method and the vulnerable statement. sub-trees of decreasing height. Mappings are es- All global variables will be preserved. All statements that tablished between the nodes of these isomorphic have data dependencies on the vulnerable statement will subtrees. They are called anchors mappings. be retained, too. Statements located after the vulnerable 2) A bottom-up algorithm where two nodes match statement within the same method will be removed. (called a container mapping) if their descendants The definition and use (def-use) dependencies can be (children of the nodes, and their children, and so extracted from the ASTs. The process can be shown as on) include a large number of common anchors. follows: Firstly, we traverse the whole AST and label each When two nodes match, an optimal algorithm will variable name. These variable names are distributed over be applied to search for additional mappings (called the leaf nodes of the AST. This step will be done in the recovery mappings) among their descendants. first phase of the modified Gumtree algorithm. Then, We JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, MAY 2021 5

Detect illegal code Abstraction structure Input sequences Normalization Refill literals and Vulnerable variable names source code Trained model Preprocessing Prediction Recovering

Diff matching Abstraction and Diff context normalization Fine tuning Encoder Decoder Output Preprocessing Encoder Decoder Vulnerability fixes Encoder Decoder

Encoder Decoder Fixed source

Transformer Transformer Encoder Decoder code Diff Abstraction and Diff context Input Encoder Decoder matching normalization Training Preprocessing Pre-training Bug repair commits

Fig. 3: Overview of our SeqTrans for automatically vulnerability fixing

Test.java: source Test.java: source class Foo { private String foo(int i, int k) { int i; if(i == 0) return "Foo!"; int k; if(k == 1) return 0;} String test; public void clear(String test){ test = ""; Test.java: normalized source } private String foo(int var1, int var2) { private String foo(int i, int k) { if(var1 == num1) return "str"; if(i == k) return i-k; if(var2 == num2) return num1;} } }

Test.java: buggy body Fig. 5: Normalize the source code int i; int k; String test; private String foo(int i, int k) { we need to preserve the semantic information of the source if(i == k) return i-k; code while abstracting the context. } The normalization process is shown in Figure 5. We replace variable names to ”var1”, ...., ”varn”, each literal Fig. 4: One example of the buggy body and string are also replaced to ”num1”, ...., ”numn” and ”str”. The reasons why we do this involve: 1) reduce the vocabulary size and the frequency of specific tokens; 2) will traverse up the leaf node to its defined location. With reduce the redundancy of the data and improve the consis- the help of modified GumTree and Understand, SeqTrans tency of the data. We will maintain a dictionary to store the changes each CP as the following shows: mappings between the original label and the substitute so that they can be refilled after prediction. Through the above CP = ((def1, ..., defn, stsrc), (def1, ..., defn, stdst))1, ..., optimization, we can control the vocabulary size and make

((def1, ..., defn, stsrc), (def1, ..., defn, stdst))n the NMT model concentrate on learning common patterns from different code changes. In this paper, we ignore code changes that involve the Subsequently, we split each abstract CP into a series of addition or deletion of entire methods/files. tokens. In this work, we use the Byte Pair Encoding (BPE) to tokenize statements [56]. BPE is a subword segmentation al- gorithm that encodes rare and unknown words as sequences 3.4 Normalization & Tokenization of subword units, which is also be called diagram coding. In the training process of the NMT model, there exist The intuition is that various word classes are translatable via a couple of drawbacks. Because NMT models output a smaller units than words, for instance, names (via character probability distribution over words, they can become very copying or transliteration), compounds (via compositional slow with a large number of possible words. We need to translation), and cognates and loanwords (via phonological impose an artificial limit on how many of the most common and morphological transformations). BPE has been widely words we want our model to handle. This is also called applied in Transformer (trained on standard WMT 2014 the vocabulary size. In order to reduce the vocabulary size, English-German dataset) and GPT-3 model. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, MAY 2021 6 We will utilize the example provided by Wikipedia to model being trained on data from a similar domain. Why illustrate BPE. As the following example shows, the original do we need to fine tune? The reasons are shown as follows: data is ”aaabdaaabac”, the algorithm will search the most 1) Overcome small sample size: it is impractical to frequently occurring byte pair. Now there is the following train a large neural network and overfitting cannot data and replacement table: be avoided. At this time, if we still want to use the ZabdZabac super feature extraction ability of large neural net- (1) Z = aa works, we can only rely on fine-tuning the already trained models. Then iterate the above steps and place the most frequently 2) Low training costs in the later stages: it can reduce occurring byte pair in the table: training costs and speed up training. ZY dZY ac 3) No need to build the wheel over and over again: the model trained by the previous work with great Y = ab (2) effort will be stronger than the model you build Z = aa from scratch in a large probability. XdXac Using this method, we can combine two related works together: vulnerability fixing and bug repair. However, one X = ZY (3) issue is that although fine-tuning is widely used in the Y = ab Neural Language (NL) field and large numbers of pre- Z = aa training models are provided, there are very few such pre- trained models in the Programming language (PL) field. The algorithm will be stopped when there are no pairs of That’s why we need to train the generic domain model by bytes that occur more than once. If we need to decompress ourselves. The model that performs best in the previous the data, we will perform the replacements in reverse order. training process will be fine-tuned using a small vulnera- It is worth mentioning that the seq2seq model utilized bility fixing model so that the knowledge learned in the bug in previous works faces severe performance degradation repair training can be transferred to the vulnerability fixing when processing long sequences. For example, Tufano et task. al.[37] limited the token number to 50-100. By utilizing the It is worth noting that base on some work such as transformer model with BPE we can better handle long Gururangan’s work [57] and documents of OpenNMT[58]. sequences. In our approach, we will limit the CP to 1500 They mentioned that some sequences were translated badly tokens. We will discuss the details in the following subsec- (like unidiomatic structure or UNKs) by the retrained model tion. while they are translated better by the base model, which is called ”Catastrophic Forgetting”. In order to alleviate the 3.5 Neural Machine Translation Network catastrophic forgetting, the retraining should be a combina- tion of in-domain and generic data. In this work, we will In this phase, we train SeqTrans to learn how to trans- try to mix part of general domain data into specific domain form the vulnerable codes and generate multiple prediction data to generate such a combination. candidates. The training process can be divided into two phases: pre-training and fine-tuning. 3.5.3 The Transformer Model 3.5.1 Pre-training In this work, we choose to use the transformer model [46] to solve the performance degradation problem of the seq2seq In the pre-training process, we will utilize a generalized do- model on long sequences. It has been widely utilized by main corpus for bug repairing to perform the first training. OpenAI and DeepMind in their language models. The Vulnerability fixing can be considered as a subset of bug implementation of the transformer model comes from an repairing. We believe that by pre-training on generic data, open-source neural machine translation framework Open- we can learn a large number of generic fixing experiences NMT [59]. It is designed to be research-friendly to try out and features that can be applied to the task of vulnerability new ideas in translation, summary, morphology, and many fixing. A list of CP sgeneral will be extracted by using the other domains. Some companies have proven the code to be approach discussed in section 3.3. These CP sgeneral that production-ready. contain vulnerable version and fixed version diff context Unlike Recurrent Neural Network (RNN) [60] or Long will be given to the network. We will discuss the network Short Term Memory (LSTM) [61] models, transformer relies detailly in the following subsection. The pre-training model entirely on the self-attention mechanism to draw global will be trained for 300K steps, we will select the model with dependencies between input and output data. This model the highest accuracy on the validation dataset as the final is more parallel and achieves better translation results. The model in the next fine-tuning process. transformer consists of two main components: a set of encoders chained together and a set of decoders chained 3.5.2 Fine-tuning together. The encode-decoder structure is widely used in After the first training phase, the best performing model NMT models, the encoder maps an input sequence of sym- will be used for fine-tuning. Fine tuning, which can also be bol representations (x1, ..., xn) to an embedding representa- called transfer learning means that we can take weights of a tion z = (z1, ..., zn), which contains information about the trained neural network and use it as initialization for a new parts of the inputs which are relevant to each other. Given JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, MAY 2021 7 z, the decoder then exploits this incorporated contextual that the predictions for position i can depend only on the information to generate an output sequence. Generates an known outputs at positions less than i [46]. The other parts output sequence (y1, ..., ym) of symbols one element at a are the same as the encoder. time. At each step, the model consumes the previously generated symbols as additional input when generating the 3.5.6 Attention Mechanism next [62]. The transformer follows this overall architecture The purpose of an attention mechanism is to use a set of using stacked self-attention and point-wise, fully connected encodings to incorporate context into a sequence. For each layers for both the encoder and decoder. Each encoder and token the attention mechanism requires a query vector Q decoder make use of an attention mechanism to weigh the of dimension d , a key vector K of dimension d and a connections between every input and refer to that informa- k k value vector V of dimension d . These vectors are created tion to generate output [46]. v by multiplying the embedding by three matrices that we As for the parameter selection, we discussed a variety of trained during the training process. Self-attention refers to settings for SeqTrans. The primary parameters were chosen the situation where the queries, keys, and values are all from OpenNMT’s recommendations to help simulate the created using encodings of the sequence. Then the output performance on the original dataset. Most of the major Z of this attention mechanism is: parameters are verified with the ablation study experiments in RQ2. The pre-training model is trained with a batch size QKT Z = Attention(Q, K, V ) = softmax( √ )V of 4096 for 300k steps. The fine-tuning model is trained with n a batch size of 4096 for extra 30k steps. In order to prevent the overfitting problem, we use a dropout of 0.1. In relation The multi-head attention utilized in the transformer imple- to the components shown in RQ2, some primary parameters ments several attention mechanisms in parallel and then are shown as follows: combines the resulting encoding in a process.

• Word vector size : 512 • Attention layers: 6 3.6 Prediction and Patch Generation • Size of hidden transformer feed-forward: 2048 The original output (or a list of outputs) is far from the • Dropout: 0.1 version that can be successfully compiled. Because it con- • Batch size: 4096 tains abstraction and normalization, it even may contain • Train steps: 300000 grammatical errors after prediction. Our patch generation • Learning rate decay: 0.5 consists of two steps to solve these problems: abstraction • Optimizer: Adam refill and syntax check. We will utilize an example from the open-source project called activemq to illustrate the process 3.5.4 Encoder of patch inference and generation. The encoder is composed of a stack of 6 identical layers. Figure 6 shows a CVE repair record in activemq, which Each layer consists of two sub-layers: a multi-head self- contains three single-line fixes. It’s worth noting that in this attention mechanism and a feed-forward neural network. work we don’t care about the detection of vulnerabilities. Residual connection [63] and normalization [64] have been The reason why we assume perfect vulnerability localiza- employed to each sub-layer so that we can represent the tion is that different works may choose different fault local- output of the sub-layer as: ization algorithms, implementations and granularities such sub layer output = Layer normization(x+(SubLayer(x))) as method-level or statement-level. Liu et al has pointed out that it is hard to compare different repair techniques where Sublayer(x) is the function implemented by the sub- due to the reason of different assumptions about the fault layer itself. The self-attention mechanism takes in a set of localization [65]. The vulnerable codes can come from a input encodings from the previous encoder and weighs classifier, a vulnerability detection tool or suspicious codes, their relevance to each other to generate a set of output etc. Firstly, as has been mentioned in Figure 3, the input encodings. The feed-forward neural network then further codes need to be abstracted and normalized. We decompose processes each output encoding individually. These output them into sequences following a similar process as depicted encodings are finally passed to the next encoder as its input. in Figure 7. In Figure 7, every abstracted variable has The padding mask has been utilized to ensure that the been marked in blue color, with every constant in yellow encoder doesn’t pay any attention to padding tokens. All color and every literal in green color. Each sequence will sub-layers as well as the embedding layers produce outputs maintain a dictionary for future recovery. The location of of dimension dmodel = 512 the sequence will also be recorded for subsequent backfill. Then, these sequences are feed into the transformer model, 3.5.5 Decoder beam search [37] are used to generate multiple predictions The decoder also contains a stack of 6 identical layers. How- for the same vulnerable line. The output of the network are ever, each layer consists of three sub-layers: an attention also abstracted sequences like Figure 7. It is a sequence that sub-layer has been added to perform multi-head attention contains the predicted statement and the context around to draw relevant information from the encodings generated it. But all we need is the predicted sentence, and we will by the encoders. The masking mechanism that contains do backfill operations on it. Thirdly, when a prediction is padding mask and sequence mask has been used to prevent selected, we first apply syntax check and then backfill all positions from attending to subsequent positions and ensure the abstraction contains. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, MAY 2021 8

Fig. 6: CVE-2015-5254, activemq, 73a0caf758f9e4916783a205c7e422b4db27905c

private static final ClassLoader var1 = public static final String [ ] var1 ClassLoadingAwareObjectInputStream.calss.getClassLoader(); public static final Map [ ] var1 private static String[] var2; public static final Set < String > var1 private static final ClientRootCertificate public Static boolean isAllAllowed() { [ ] var1 = values ( ) return getSerialziablePackages().length == num1 && private static final Java7Support var1 getSerialziablePackages()[num2].equals("liter1"); }

return var1 . length == num1 && var1 [ num2 ] . equals ( "liter1" ) Abs Origin return var1 . length == num1 && var1 [ num2 ] . equals ( "liter1" ) ) var1 FALLBACK_CLASS_LOADER return var1 . length == num1 && var1 [ num2 ] . equals ( "liter1" ] ) var2 serializablePackages return var1 . length == num1 && var1 [ num2 ] num1 1 return var1 . length ( ) num2 0 liter1 *

private static final ClassLoader FALLBACK_CLASS_LOADER = ClassLoadingAwareObjectInputStream.calss.getClassLoader(); public static final String[] serializablePackages;

public Static boolean isAllAllowed() { return serializablePackages.length == 1 && serializablePackages[0].equals("*"); }

Fig. 7: CVE-2015-5254, activemq, 73a0caf758f9e4916783a205c7e422b4db27905c

3.6.1 Beam Search refilled. The code will be automatically indented in this In many cases, developers have certain domain-specific process. It should be noted that all comments will be deleted knowledge. We can generate a list of prediction results to let and will not be refilled again. One shortcoming of SeqTrans them pick the most suitable one. Instead of greedily choos- is that the mappings that are included in the dictionary ing the most likely next step as the sequence is constructed, come from the source files. If some new or unseen variable the beam search [66], [67] expands all possible next steps names, constants, or literals are introduced into the fixed and keeps the k most likely, where k is a user-specified codes, it is hard for SeqTrans to understand and infer them. parameter and controls the number of beams or parallel All we can do is to reduce the corresponding abstraction searches through the sequence of probabilities. Beam search according to the dictionary. If a predicted abstraction cannot maintains the n best sequences until the upper limit of the find a mapping in the dictionary, we will copy the original set beam size. abstraction content to the current location. As has been depicted in Figure 7, each of the vulnerable statements will generate five prediction candidates. Usually, 3.6.3 Syntax Check the highest-ranked predictions will be chosen and utilized. We combine beam search with a grammar check tool to In some cases, there are syntax errors in the prediction analyze the syntax and grammatical errors contained in results. We will use syntax checking tools to detect these the predictions. The static analysis tool FindBugs [51] is errors. This will be detailly discussed in the following sub- exploited to identify different types of potential errors in k sections These candidates will be provided as suggestions Java programs. The version we utilized is 3.0.1. The goal is to developers to select the best result. to prioritize and filter out candidates that contain obvious syntax errors before providing suggestions for changes or 3.6.2 Abstraction Refill generating patches. This tool can be replaced by a compiler As has been shown in Figure 7, we will maintain a dictio- or parser. In SeqTrans, if the candidate prediction contains in nary to store the necessary information for restoration before the top 5 cannot pass the check of FindBugs, we will search abstraction. After prediction, the output will be concretized for the candidate list provided by beam search to test the and all the abstraction contains in the dictionary will be next candidate until anyone has passed the check process JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, MAY 2021 9 and output the 5 candidates. It should be noted that Find- bug-fixing record contains the buggy (pre-commit) and the bugs may trigger a warning even on the pre-commit version, fixed (post-commit) code. They discarded commits related so we only check the warning messages that are added after to non-Java files, as well as files that were created in the the prediction. For example, in Figure 7, the second and the bug-fixing commit since there would be no buggy version third candidates contain a syntax error, which cannot pass to learn from. Moreover, They discarded commits impacting the check of FindBugs. We will remove these two candidates more than five Java files, since we aim to learn focused bug and push the sixth and seventh candidates for checking until fixes that are not spread across the system. we get five candidates. In other words, we use FindBugs to Ponta’s dataset was obtained both from the National check the candidates to ensure that the five candidates we Vulnerability Database (NVD) and from project-specific Web recommend introduce as few new bugs as possible. resources that they monitor on a continuous basis. From Finally, we can generate the newly patched file and that data, they extracted a dataset that maps 624 publicly provide it to developers. We provide flexible choices for disclosed vulnerabilities affecting 205 distinct open-source developers whether to enable this feature or judge by their Java projects, used in SAP products or internal tools, onto domain-specific knowledge. Developers also have the flexi- the 1282 commits that fix them. The distribution of these bility to choose the predictions they need based on their own CVEs ranges from 2008 through 2019. Out of 624 vulnera- domain experience and based on our five recommended bilities, 29 do not have a CVE identifier at all, and 46, which candidates In addition, we believe that with the continuous do have a CVE identifier assigned by a numbering authority, improvement of model training, these grammatical errors are not available in the NVD yet. These vulnerabilities have will become less and less. In the end, we will no longer rely been removed from the dataset, the final number of non- on third-party grammatical error check tools. repetitive CVEs is 549 with 1068 related commits. In total, the processed Ponta’s dataset contains 1068 different vul- 4 EMPIRICAL STUDY &EVALUATION nerabilities fixing commits with 5K diff contexts across 205 projects which are classified as 77 CWEs from 2008 to 2019. In this section, we conduct our experiment on a public Figure 8 shows the CWE distribution in descending order of dataset [40] of vulnerability fixes and evaluate our method: frequency, with the yellow cumulative line on the secondary SeqTrans by investigating three research questions. axis, identifying the percentage of the total number. We have listed the IDs and type explanations of all CWEs in Ponta’s 4.1 Research Questions dataset in the appendix. We explore the following research questions: The datasets are released under an open-source license, together with supporting scripts that allow researchers to • RQ1: How much effectiveness can SeqTrans provide automatically retrieve the actual content of the commits for vulnerable code prediction? from the corresponding repositories and augment the at- RQ1 aims to prove that NMT is a feasible approach tributes available for each instance. Also, these scripts allow to learn code transformations and outperforms other complementing the dataset with additional instances that state-of-the-art techniques. are not security fixes (which is useful, for example, in • RQ2: What are the characteristics of the ML model machine learning applications). used that can impact the performance of SeqTrans. RQ2 will evaluate the impacts of the main compo- nents of SeqTrans on performance such as the data structure and the transformer model. • RQ3: How does SeqTrans perform in predicting spe- cific types of CWEs? RQ3 will explore in depth the prediction results and the source codes of the data set to observe whether our method performs inconsistently when predicting different kinds of code transformations.

Fig. 8: CWE distribution of Ponta’s dataset 4.2 Experimental Design In this section, we discuss our experimental design for RQ1, Validation: We will use two methods (two test sets) to RQ2, and RQ3. All experiments were accomplished on a validate the performance of the experiments. server with an Intel Xeon E5 processor, four Nvidia 3090 T GPU, and 1TB RAM. The first one cross is 10-fold cross-validation. Cross- Dataset: Our evaluation is based on two public datasets: validation is a technique to evaluate predictive models by Tufano’s [39] 1 and Ponta’s datasets [40] 2. Tufano’s dataset partitioning the original sample into a training set to train contains 780,000 bug fix commits and nearly 2 million the model, and a test set to evaluate it. In 10-fold cross- sentence pairs of historical bug fix records. For each bug- validation, the original sample is randomly partitioned into fixing commit, they extracted the source code before and 10 equal size subsamples. Of the 10 subsamples, a single after the bug-fix using the GitHub Compare API [68]. Each subsample is retained as the validation data for testing the model, and the remaining 9 subsamples are used as 1. ://sites.google.com/view/learning-fixes/data training data. The process is then repeated 10 times (the 2. https://github.com/SAP/vulnerability-assessment-kb folds), with each of the 10 subsamples used exactly once JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, MAY 2021 10 as the validation data. If the predicted statement equals the dataset. It should be noting that the token length that are statement in the test set, there is a right prediction. The 10 bigger than 2000 have been ignored in Tufano’s dataset. results from the folds can then be averaged to produce a Token length that are bigger than 800 have been ignored single estimation. The advantage of this method is that all in Ponta’s dataset. We can find that the majority of tokens observations are used for both training and validation, and in Tufano’s dataset are distributed between 0 and 1500. each observation is used for validation exactly once. The majority of tokens in Tufano’s dataset are distributed The second one Tcwe is based on the chronological rela- between 0 and 1500. The majority of tokens in Ponta’s tionship of the CVE repair records to simulate the actual dataset are distributed between 0 and 400. development process of using historical vulnerability fix records to fix subsequent suspicious code. We also sorted 4.2.1 RQ1 Setup: the CVE samples in Ponta’s dataset by time series and used The experimental part of RQ1 will be divided into two the CVE fix records from 2008 to 2017 as the training set components. (708 CP s), the CVE fix records from 2018 and 2019 were Firstly, we will show and analyze the joint training utilized as the validation (136 CP s) and test sets (150 CP s). and independent training results of the two datasets. Since We do not assess the compilability of the patches, because it SeqTrans uses two datasets and fine-tuning approach to requires us to download each snapshot of each Git project. overcome the problem of small samples, then independent In this case, we need to download thousands of Git projects and joint analyses for both datasets are necessary. For the and recompile them. Therefore, if one CP has been fully bug repair dataset of general domain, we will train on and correctly predicted, we regard it as one successful fix. Gtrain and validate on Gval. Gval is separated from the bug The distribution of the 42 CWEs in the test set is shown in repair dataset, which is not contained in Gtrain. Likewise, Figure 9. we will separate the vulnerability dataset of specific domain to Strain, Sval and Stest. The Stest will be utilized to validate the performance for both joint training and independent training. Sequences in each set are mutually exclusive. This experiment is designed to verify whether fine-tuning can help small samples overcome the problem of dataset size, learn from general domain tasks, and transfer it to specific domain tasks. Secondly, we will compare SeqTrans with some state- of-the-art techniques such as Tufano [37], [69] et al. and SequenceR [38]. Tufano has investigated the feasibility of using neural machine translation for learning wild code. The Fig. 9: CWE distribution of the test set disadvantage of his method is that only sentences with less than 100 tokens are analyzed. SequenceR presents a novel end-to-end approach to program repair based on sequence- to-sequence learning. It utilizes the copy mechanism to overcome the unlimited vocabulary problem. To the best of our knowledge, it achieves the best result reported on such a task. However, the abstract data structure of this method retains too much useless context. It does not use the normalization method either. We have also added the model that utilizes the same data structure as we but using the (a) Tufano’s dataset (b) Ponta’s dataset seq2seq model. Seq2seq model is an RNN encoder-decoder Fig. 10: Label distribution for each dataset model which has been widely utilized in the NMT domain, previous works such as SequenceR [38] and Tufano et al. [37] is also based on this model. We have calculated the pre- diction accuracy for each technique. Prediction accuracy will Distribution of general domain Distribution of general domain be calculated using 10 cross-validation for each technique. dataset dataset

0.0012 0.005 Then we will calculate the number of correct predictions

0.0010 0.004 0.0008 divided by the total number to calculate the accuracy. 0.003 0.0006 0.002 Density 0.0004 Density 0.0002 0.001 4.2.2 RQ2 Setup: 0 0 0 500 1000 1500 2000 2500 0 200 400 600 800 1000 Tokens Tokens In this part, we will discuss the impacts of the main factors (a) Tufano’s dataset (b) Ponta’s dataset that affect the performance of SeqTrans. The process is shown as follows: Firstly, we will select a Fig. 11: Token distribution for each dataset list of parameters that may affect the performance of our model. Then we will change one parameter at one time Figure 10 shows the label distribution of each dataset. and make the experiment in the same dataset. For each We can find that the frequency distribution of label in parameter, we will utilize cross-validation for 10 times and the two datasets is very dissimilar. Figure 11 shows the calculate the mean value as the final precision. The final token distribution of the abstract vulnerable context in each parameter selections of SeqTrans will produce the highest JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, MAY 2021 11 acceptance rates for the alternative configurations and data single line prediction in four different NMT models includ- formats we tested. ing the transformer model that we exploit, Seq2seq model, SequenceR, and the work of Tufano et al.. For the Seq2seq 4.2.3 RQ3 Setup: model and transformer model, we use the same training set In this part, we will discuss the observations when we look with def-use chains. As for the SequenceR [38] and Tufano deep inside the prediction results. We only manually ana- et al. [69], we will strictly follow their original codes and lyzed the prediction results generated by SeqTrans. Other data structures, repeat their preprocessing, training, and models are not considered. translating steps. We have calculated the prediction accuracy for each The reason why the total number in Tcross is inconsistent CWE and each category of code transformation. We will is that the data structure in different approaches is not look deep inside some well-predicted CWEs to explore why the same. SequenceR packages the entire class containing SeqTrans performs better on them. We will also analyze the the buggy line, keeps the buggy method, all the instance reasons for some CWEs with very poor prediction perfor- variables, and only the signature of the constructor and non- mance. buggy methods (stripping out the body). Then it performs tokenization and truncation to create the abstract buggy context. Because this abstract buggy context maintains too 4.3 Experimental Results much context even the whole buggy method and the sig- 4.3.1 RQ1: How much effectiveness can SeqTrans provide nature of the constructor in the class, it has the highest for vulnerable code prediction? total number after duplication. Tufano et al. only construct the buggy pair that contains the buggy method and the In RQ1, our goal is to compare the performance of SeqTrans corresponding fixed method. However, they limit the whole with other techniques on the task of vulnerability fix. As we sentence to 100 tokens and do not contain any statement have mentioned before, RQ1 will be divided into two com- outside of the method so that this approach has the lowest ponents. Firstly, we will analyze the joint training and inde- total number after duplication. As has been introduced in pendent training results of the two datasets. Table 1 shows Section 3.3, our approach will maintain the buggy method the prediction accuracy of models which were trained only with the vulnerable statement and any statement that has on the general domain dataset (only on Tufano’s dataset) or a data dependency on the vulnerable statement. The total trained only on a specific domain dataset (only on Ponta’s number of our approach is in the middle. dataset) or trained jointly (fine-tuning strategy). The first In order to maintain a relatively fair training and testing column is the training approach of the three models. The environment, we introduce a second verification method. As second column is the beam search size. For example, in has been explained in Section 4.2, Tcwe provides an identical the situation of Beam=10, for each vulnerable sequence, we set of raw training, validation, and test dataset for each will generate 10 prediction candidates. If one of these ten approach. if one CP has been fully and correctly predicted, candidates contains the correct prediction, the prediction we regard it as one successfully fix. We have also tried to accuracy is 1 otherwise it is 0. The third column is the total exploit the beam search to generate a list of predictions. prediction accuracy. Recall that we use 10 cross-validation Figure 13 shows the performance on Tcross when beam size to calculate the accuracy of the model, if the predicted increases from 1 to 50. The x-axis represents beam size and statement equals the statement in the test set, there is a right the y-axis represents the prediction accuracy. prediction. From table 2, we see that our SeqTrans performs the best From Table 1, we can observe that SeqTrans that use and achieves an accuracy of 301/2130 (14.1%) when Beam=1 the fine-tuning strategy achieves the best performance of on Tcross, followed by Seq2seq 121/2130 (7.5%), SequenceR 14.1% when Beam=1 and 23.3% when Beam=50. Next is 252/3661 (6.9%) and Tufano et al. 37/883 (4.2%). On Tcwe, the performance of 11.3% when Beam=1 and 22.1% when SeqTrans also reaches the best accuracy of 35/150(23.3%) Beam=50 achieved by training on a specific domain dataset. when Beam=1, followed by SequenceR 24/150 (16.0%), The worst prediction performance is using only data sets Seq2seq 20/150 (13.3%) and Tufano et al. 5/150 (3.3%) need from the general domain, it can just achieve the accuracy modification. The experimental results of Tcross and Tcwe of 4.7% when Beam=1 and 6.9% when Beam=50. Detailed are generally consistent. We will do a more detailed case Beam search results are shown in Figure 12 when beam size study in the RQ3. increases from 1 to 50. The x-axis represents beam size and To our surprise is that SequenceR is not as good as the y-axis represents the prediction accuracy. described. It even performs worse than Seq2seq when Results show that by using fine-tuning strategy to trans- Beam=1 on Tcross. The poor performance of SequenceR fer knowledge from the general domain of bug repair- can be explained by the difference between data structures. ing to the specific domain of vulnerability fixing, it in- SequenceR utilizes the buggy context which contains the deed improved the prediction performance of SeqTrans and buggy line and the context around the buggy line in the achieved better performance than doing training on two same function. Other variable declarations and method dec- separate datasets. Fine-tuning is helpful to alleviate and larations in the same class will be retained, too. However, overcome the small data size problem. In the following this buggy context keeps a lot of statements that have no experiments, the fine-tuning strategy will become one of the relationship with the buggy line. The whole data structure default configurations in SeqTrans. is too long and contains a large number of declaration Secondly, we will compare SeqTrans with some state- statements that are not related to the buggy line, which of-the-art techniques. Table 2 shows the accuracy results of performs not well in our public vulnerable dataset. Another JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, MAY 2021 12

TABLE 1: Prediction results in three training strategies Performance of different training strategies 24

Approach Beam Accuracy 22 1 100/2130(4.7%) 20

Only on general domain Gtrain 10 121/2130(5.7%) 18 50 146/2130(6.9%) 16 1 242/2130(11.3%) 14 Accuracy Only on specific domain Strain 10 338/2130(15.5%) 12

50 473/2130(22.1%) 10 General domain model Specific domain model 1 301/2130(14.1%) 8 SeqTrans

Joint training on Gtrain and Strain 10 411/2130(19.3%) 6

50 497/2130(23.3%) 4 0 10 20 30 40 50 60 Beam Size

TABLE 2: Performance of different techniques Fig. 12: Performance of three training strategies

Accuracy Performance of different techniques Approach Beam 25 SeqTrans Tcross Tcwe SequenceR 1 301/2130(14.1%) 35/150(23.3%) Seq2seq 20 Tufano SeqTrans 10 411/2130(19.3%) 38/150(25.3%) 50 497/2130(23.3%) 38/150(25.3%) 15 1 252/3661(6.9%) 24/150(16.0%) SequenceR 10 418/3661(11.4%) 26/150(17.3%) Accuracy 10 50 725/3661(19.8%) 27/150(18.0%) 1 121/2130(7.5%) 20/150(13.3%) 5 Seq2seq 10 242/2130(11.3%) 23/150(15.3%) 50 390/2130(18.3%) 23/150(15.3%) 0 0 10 20 30 40 50 60 1 37/883(4.2%) 5/150(3.3%) Beam Size Tufano et al. 10 59/883(6.7%) 7/150(4.6%) Fig. 13: Performance of different techniques 50 63/883(7.1%) 7/150(4.6%)

Precision results in different layer configurations Prediction results in different training steps 0.255 24

0.25 disadvantage is that SequenceR only supports single line 23 prediction, but in vulnerability fixing it always contains line 0.245 deletion and addition. 22 0.24 21

In our SeqTrans, we only maintain the data depen- Precision 0.235 dencies before the vulnerable statement. Meanwhile, we 20 0.23 will normalize the data and replace variable names by 0.225 19 ”var1, var2....vark”. Literal and numerical value will also 0.22 18 1 2 3 4 5 6 7 10000 20000 30000 40000 60000 80000 100000 be replaced by constants and maintained in a dictionary for Layer future recovery. The poor performance of Tufano et al. may (a) Layers (b) Training Steps be due to few data samples, we strictly follow their method and only select sequences with less than 100 tokens. On the Fig. 14: Factor analysis with selected parameters other hand, the fine-tuning method we use to learn from the general domain achieves a performance improvement. Overall, SeqTrans leverages def-use chains and fine-tuning 4.3.2 RQ2: What are the characteristics of the ML model strategy to maintain data dependencies and overcome the used that can impact the performance of SeqTrans? small data size issue, which can help the NMT model reach In RQ2, we will discuss some of the data formats and higher accuracy. configuration exploration processes that we have tried to eventually get a default SeqTrans model. Table 3 and Figure Answer to RQ1: In summary, NMT models are able 14 shows an ablation study for SeqTrans. From Table 3, we to learn meaningful code changes from historical code can see the prediction result of our default SeqTrans against repair records and generate predicted code like a de- the results of single changes on the model. We will explain veloper. Our approach SeqTrans based on transformer them one by one. These ablation results will help future model outperforms other NMT model on the task of researchers understand which configurations are most likely vulnerability fixing. Even outperforms the state-of-the- to improve their own models. Due to the random nature of art approach SequenceR in our public vulnerability fix the learning process, we will use the 10-fold cross-validation dataset. on Tcross to train each control group 10 times and take JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, MAY 2021 13 TABLE 3: Factor impact analysis with selected parameters impact. The reason we think is that the main purpose of BPE is to compress the data and solve the problem of Group Description Precision Impact unregistered words. Our vocabulary size is able to cover the - Default SeqTrans model 23.3% - majority of words. However, when we are preparing the first 1 Word Size (256 vs 512) 22.4% -4% general model, not using BPE to compress the sequences Word Size (512 vs 1024) 22.1% -5% will cause a huge vocabulary size and lead to overflow of 2 Training steps (30K vs 100K) 23.5% 1% GPU memory. 3 Layers (5 vs 6) 21.9% -6% Group 9 is designed to explore whether mixing some general domain training data into the small specific domain Layers (6 vs 7) 22.4% -4% dataset can alleviate the problem of catastrophic forgetting. 4 Batch Size (2048 vs 4096) 22.6% -3% We tried to mix in the same number of randomly selected 5 Hidden State Size (256 vs 512) 22.8% -2% Gtrain training data as Strain. The result shows that without 6 Without Def-use Chains 20.9% -10% mixing the prediction performance indeed causes a degra- 7 Without Code Normalization 21.9% -6% dation of the performance. 8 Without BPE 23.3% 0% In the last Group 10 is the performance change before 9 Without Mixed Fine-tuning 22.1% -5% and after using the fine-tuning strategy as explained in the previous experiments. SeqTrans achieves a 13% perfor- 10 Without Fine-tuning Strategy 20.2% -13% mance improvement, which indicates that the fine-tuning strategy is very beneficial for training small scale data and helps us to migrate knowledge from similar domains. the mean value as the final result. The first row is the performance of the default SeqTrans model as a reference. Answer to RQ2: The ablation study results demonstrate Group 1 in the second and third rows explored the effect that parameter selections for the SeqTrans produce the of word size on the performance of our model. Results highest acceptance rates for the configurations we tested. show that both the smaller and larger word size perform These ablation results will help future researchers under- worse than the configuration that we choose. We think the stand which configurations are most likely to improve reason is that Smaller word sizes may lead to transitional their own models. compression of features and loss of some valid information. Larger word sizes may not be appropriate for the size of our 4.3.3 RQ3: How does SeqTrans perform in predicting spe- dataset. cific types of CWEs? In Group 2 and Figure 14b we have discussed whether more training steps would significantly improve perfor- We now look at what types of vulnerabilities fix our model mance. The result indicates that the performance difference can well identify and generate predictions. The purpose of between 30K and 100K training steps is very small. The this experiment is to verify whether SeqTrans has better growth in predicted performance begins to converge after performance for a specific type of CWE. For example, the 30k training steps. We do not consider it worthwhile due to CWE has a high number of repair cases in the dataset or the large time overhead of 100K training steps. It is worth the CWE is distributed in the dataset with a balanced time noting that the training step here refers to the step used series. Table 4 shows the prediction accuracy of each CWE when fine-tuning the dataset of vulnerability fixing tasks on Tcross and Tcwe when Beam=50. The Common Weak- in the special domain, and the general domain model is ness Enumeration (CWE) is a category system for software consistent. weaknesses and vulnerabilities. Every CWE contains a list Group 3 in the fifth and sixth rows and Figure 14a are of CVEs. Because there are too many kinds of CWE, we only the test of model layers, we have tried different features list the top 20 with the highest accuracy in the table, which and the conclusion is that 6 layers is a suitable choice. It contains the vast majority of correct predictions. It should is worth noting that we need to ensure that the encoder be mentioned that the total result may be higher than the and decoder parts of the transformer model have the same results in Table 2. The reason is that some CVE may belong number of layers, so we use the same number of layers on to multiple kinds of CWE. It will be counted multiple times both the encoder and decoder. Results show that prediction when counting the number of CWEs. performance rises with the number of layers until it reaches Then we will explain Table 4. As for Tcross, the highest 6. The performance of layer 7 is not better than 6, so we one is CWE-444, which achieves the accuracy of 60%. If only decide on 6 as the parameter. Group 4 and Group 5 are the highest number of predictions is considered, it is CWE- the test of different batch sizes and hidden state sizes. The 502, which contains 311 correct predictions. As for Tcwe, experimental results show a similar conclusion: decreasing the highest one is CWE-306 and it achieves a surprising the size leads to a decrease in performance. prediction performance of 100%. If only the highest number In group 6, 7 and 8, we will discuss the impact of data of predictions is considered, it is CWE-22, which contains structure and processing on performance. The result shows 10 correct predictions. Detailed results are given in Table 4. a 10% improvement in model performance when comparing CWE No. indicates the CWE number. The first column of our data structure to the original single vulnerable line. Nor- Accu is the right prediction number and the total prediction malization in data preprocessing will lead to a 6% increase number. The second column of Accu is prediction accuracy. in performance. An interesting phenomenon is that whether We can find that most of the TOP CWE predictions in the BPE is enabled or not has only a minimal performance two test sets are the same. CWEs with large differences will JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, MAY 2021 14 TABLE 4: Prediction results in the data set correct prediction. Case Study: CWE-362: CWE-362 means ”Concurrent Tcross Tcwe CWE No. Accu CWE No. Accu Execution using Shared Resource with Improper Synchro- CWE-444 3/5 0.60 CWE-306 1/1 1.00 nization”. The program contains a code sequence that can CWE-287 45/84 0.54 CWE-287 2/3 0.67 run concurrently with other code, and the code sequence CWE-306 1/2 0.50 CWE-20 8/14 0.57 requires temporary, exclusive access to a shared resource, CWE-362 5/11 0.45 CWE-522 2/4 0.50 but a timing window exists in which the shared resource CWE-22 13/30 0.43 CWE-22 10/21 0.48 can be modified by another code sequence that is operating CWE-361 3/7 0.43 CWE-295 1/3 0.33 concurrently. It contains a list of condition operator change CWE-863 7/17 0.41 CWE-269 1/3 0.33 and parallelism-related modifications. CWE-284 3/8 0.38 CWE-863 3/10 0.30 CWE-522 24/67 0.36 CWE-502 5/12 0.42 CWE-20 31/97 0.32 CWE-611 3/13 0.23 > private boolean closed = false CWE-502 311/1013 0.31 CWE-200 2/11 0.18 = private volatile boolean closed = false CWE-78 7/23 0.30 CWE-noinfo 2/13 0.15 < private satatic boolean closed = false CWE-74 4/14 0.29 CWE-78 0/5 0 CWE-310 41/147 0.28 CWE-35 0/3 0 > return !getSocket().isOpen() CWE-269 8/29 0.28 CWE-601 0/2 0 = return closed || !getSocket().isOpen() CWE-264 14/60 0.23 CWE-74 0/2 0 < return closed || !getSocket().isOpen() CWE-611 1/52 0.21 CWE-362 0/1 0 CWE-noinfo 7/54 0.13 CWE-521 0/1 0 Fig. 16: Case: wrong prediction of CWE-362 CWE-200 3/28 0.11 CWE-50 0/1 0 CWE-19 5/56 0.09 CWE-89 0/1 0 In Figure 16, developers added one keyword and All 563/2130 26.4% All 40/150 26.7% changed the return condition. The condition modification of the statement has been correctly predicted by two models. However, the addition of the volatile keyword was not suc- be labeled. CWEs in Tcwe contain less CWE categories than cessfully predicted by Tcwe’s model. We think the reason is Tcross, which may have contributed to the greater concen- tration of top CWE. In the following, we will compare the that Tcross’s model learns from other records about adding difference between these two test sets and make a detailed the volatile keyword. analysis of why the model performs well on certain specific Case Study: CWE-502: CWE-502 means ”Deserialization CWEs. of Untrusted Data”. The application deserializes untrusted In the following, we will discuss some CWEs in Table 4. data without sufficiently verifying that the resulting data They perform differently or even achieve 0 accuracy in one will be valid. CWE-502 related code transformations account dataset. First of all, it must be stated that the reason why for half of the entire training set. It contains large numbers these CWEs marked blue are not present on the right side of repetitive code transformations, such as delete one throw exception and add a return statement, change parameter is that they are not included in Tcwe. These will not be the focus of our attention. orders. We will list some typical code changes that are well Case Study: CWE-306: CWE-306 means ”Missing Au- captured and handled by SeqTrans. thentication for Critical Function”. It is special because it has a very small sample but makes a correct prediction. > throw data.instantiationException(_valueClass, ClassUtil.getRootCause(cause)) = return data.handleInstantiationProblem(_valueClass, root, ClassUtil getRootCause(cause)) The software does not perform any authentication for func- < return data.handleInstantiationProblem(_valueClass, root, ClassUtil.getRootCause(cause)) tionality that requires a provable user identity or consumes a significant amount of resources. This commit contains Fig. 17: Case: right prediction of CWE-502 two code changes as shown in Figure 15. The first one (second line) is to add the annotation ”@SuppressWarnings In Figure 17, developers delete the throw keyword and ( ”resource” )” before the method declaration. The second add a return keyword to transfer the instantiation problem. one is to modify two parameters in the put method. In addition, a new parameter was inserted into the second position. This code transformation can be well captured by

> public static JMXConnectorServer createJMXServer (int port, boolean local) throws IOException SeqTrans. = @SuppressWarnings ( "resource" ) public static JMXConnectorServer createJMXServer (int port, boolean local) throws IOException < @SuppressWarnings ( "resource" ) public static JMXConnectorServer createJMXServer (int port, boolean local) throws IOException > if (type.isAssignableFrom(raw)) = if (raw.getParameterCount( ) == 1) > env.put(RMIExporter.EXPORTER_ATTRIBUTE, new Exporter()) < if (raw.getParameterCount( ) == 1) = env.put(jmx.remote.x.daemon, true) < env.put(jmx.remote.x.daemon, true)) Fig. 18: Case: right prediction of CWE-502

Fig. 15: Case: right prediction of CWE-306 In Figure 18, developers firstly change the target of the method call. Then, replace the method call from ”isAs- These two has been correctly captured and predicted signableFrom” to ”getParameterCount”. Finally, the condi- by SeqTrans. The other two incorrect predictions belong to tional expression ”== 1” is added. This code transformation variable definition changes, the model does not make the contains three single code transformations but is also well JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, MAY 2021 15 captured by SeqTrans. In general, our tool SeqTrans per- TABLE 5: Types of code transformation learned by forms stable and outstandingly for vulnerability fixes like SeqTrans CWE-502 that contain a lot of repetitive code transforma- Accu tions. Code Transformations Tcross Tcwe Case Study: CWE-78 and CWE-74: These two CWEs face Change Parameter 126/495(25.5%) 17/49(34.7%) the same problem and we will explain them together. CWE- Change Throw Exception 98/227(43.1%) 5/15(33.3%) 78 means ”Improper Neutralization of Special Elements Change Variable Definition 63/265(23.8%) 11/33(33.3%) used in an OS Command”. The software constructs all or Change Method Call 41/194(21.1%) 4/11(36.4%) part of an OS command using externally-influenced input Change Target 19/123(15.4%) 2/13(15.4%) from an upstream component, but it does not neutralize or Change Annotation 79/178(44.4%) 12/21(57.1%) incorrectly neutralizes special elements that could modify Change Method Declaration 47/197(23.9%) 3/13(23.1%) the intended OS command when it is sent to a downstream Change Class Declaration 1/57(1.8%) 0/3(0%) component. CWE-74 means ”Improper Neutralization of Change If Condition 28/167(16.8%) 2/7(28.6%) Special Elements in Output Used by a Downstream Com- Change Switch block 3/31(9.7%) 0/2(0%) ponent”. The software constructs all or part of a command, Change Loop Condition 2/38(5.3%) 0/2(0%) Change Return Statement 31/180(17.2%) 4/14(28.6%) data structure, or record using externally-influenced input Change Keywords ”this/super” 7/18(38.3%) 1/5(20.0%) from an upstream component, but it does not neutralize or Add Try Block 2/17(11.8%) 1/3(33.3%) incorrectly neutralizes special elements that could modify Change Catch Exception 1/13(7.7%) 0/1(0%) how it is parsed or interpreted when it is sent to a down- Refactoring 4/85(4.7%) 0/1(0%) stream component. We give the following explanation for Other 7/22(31.8%) 1/6(16.7%) the 0% accuracy of these two CWEs: Tcwe does not contain any of them in the training set. All of them are included in the test set. We believe that this situation is the cause of the • Change Throw Exception: Add, delete or replace the low accuracy rate. block of throw exception, add or delete the exception The conclusion reached is that, for some CWEs that keywords in the method declaration. contain duplicate vulnerability fixes or can be learned from • Change Variable Definition: Change variable type or historical repair records, our SeqTrans performs very well. value. Another hypothesis is that training a general model to fix • Change Method Call: Add, delete a method call or vulnerabilities automatically is too ambitious to cover all replace a method call by another. cases. If we can focus on specific types of them, the NMT • Change Target: Maintain the same method call but model can make a very promising result to help developers. change the target of the method call. • Change Annotation: Add, delete or replace the anno- Finding 1: SeqTrans performs well in predicting specific tation. kinds of vulnerability fixes like CWE-287 and CWE-362. • Change Method Declaration: Add, delete or replace It also performs well on a timing test set that simulates method name and the qualifier. learning historical modification records. The prediction • Change Class Declaration: Modify the declaration of range will become wider and wider as the historical a class. repair records increases. • Change if Condition: Add, delete or replace operands and operators in the if condition. On the other hand, to deeply analyze these specific • Change Switch Block: Add, delete or replace the CWEs, we derived Table 5 that shows the classification of ”case” statement. code transformations by manually analyzing prediction • Change Loop Condition: Modify the loop condition. results and source codes. We have made a change type • Change Return Statement: Change return type or classification for each code change not only the correct value, add or delete ”return” keyword. prediction but also the wrong prediction. The criteria • Change Keywords ”this/super”: add or delete these used for checking the semantic correctness is the same as keywords. mentioned above. We only consider the prediction results • Add Try Block: Put statements into the try block. that are strictly consistent with the true modifications as • Change Catch Exception: Add, delete or replace the correct predictions. So the actual accuracy should be higher block of catch exception. than the strict matching calculation method we used. The • Refactoring: Rewrite the code without changing first column is the type name of code transformations. functionality. We roughly divided the code transformation types into 17 • Other: Other transformations which are hard to be categories. It is worth noting that some single predictions categorized or occur infrequently. can include multiple types of code changes, they are We can observe some conclusions from Table 5. In classified into different code change types. For this reason, T , SeqTrans performs well in predicting throw excep- the sum of the classified changes is not equaled to the cross tion, annotation, and keywords changes. All of them sub- number in Table 4. Detailed definitions are shown in the stantially above average accuracy. When predicting param- following: eter change, method declaration, and variable definition. • Change Parameter: Add, delete the parameter or SeqTrans also performs better than the average accuracy. change the parameter order. In Tcwe, SeqTrans performed consistently with Tcross. Only JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, MAY 2021 16 class declaration, switch block, loop condition, catch ex- by Gumtrees author. Another problem is in the bottom-up ception changes, and refactoring show lower accuracy than algorithm part of Gumtree. This question did not receive others. We believe this gap can be explained in two points: a response from the author. Neither did we do further ex- code change sophistication and relevance. There are certain periment to evaluate the false-positive rate. The verification templates for code changes like annotation and throw ex- for this problem is very difficult and we have difficulty ceptions. SeqTrans can more easily learn how to modify in collecting a suitable ground truth. We also modified such changes from historical data. But some of them involve Gumtree to support statement-level code matching and def- sophisticated code changes, while others may only be due use chain collection. We believe that through these we have to insufficient samples, resulting in the model not learning minimized the impact of Gumtree. well. On the other hand, there are code changes such as refactorings and switch structure changes that are difficult 5.3 Limitations to accomplish with independent statement changes because the code is so interconnected. This also leads to a decrease The main limitation of SeqTrans is that it currently only in model prediction accuracy. supports the single line prediction. We always assume that these vulnerable statements are independent of each other Finding 2: SeqTrans performs well in handling throw ex- when making predictions about the full CVEs. We plan to ception change, annotation change and keywords change try to abstract and tokenize the vulnerable function at the in both datasets. Simple code transformations is easier function-level, and the data format we currently use cannot to be learned by the model, even in unseen situations. handle this length quite well. Sophisticated code and strongly correlated code transfor- mations is not easily modified. 5.4 Applications We believe SeqTrans can help programmers reduce repeti- Overall, SeqTrans will perform well above average tive work and give reasonable recommendations for fixing against specific kinds of CWE and specific kinds of code vulnerable statements. As SeqTrans receives more and more transformations. As the model iterates in the hands of modification records from developers, we believe there is developers and the size of the data increases, we believe still a big room for improvement in its performance. SeqTrans has a lot of space for improvement. On the other hand, training a generic model on large- scale data is very expensive, and it takes a long time to 5 DISCUSSION adjust the hyperparameters. If we can provide a general model for subsequent researchers to refine directly on the 5.1 Internal Threats basis of this model, it would be a meaningful work. We will The performance of the NMT model can be significantly provide open-source code for the replication of the study influenced by the hyperparameters we adopted. The trans- and motivate future work soon. former model is susceptible to hyperparameters. In order The SeqTrans approach can also be applied to areas to mimic the setup, we set a bunch of options outside of vulnerability fixing such as fine-grained code suggested by OpenNMT [58] to simulate their result. How- refactoring. We can use historical knowledge to learn how ever, there are gaps between source code language and to refactor target code such as attribute extraction, merge natural language. We also modified and test part of the parameter, inline variable, etc. This is also part of our future hyperparameters and choose the one that achieves the best exploration work. Moreover, our experiment is based on performance. Java language now. However, we believe that there is a com- We manually analyzed the prediction result and the mon logic between programming languages, and the rules source code, classified them into 17 types. This number of and features learned by the model can be easily applied to categories is based on our experience during the experiment other languages. process, which may not be complete enough to cover all the code transformations. More refined classification may 6 RELATED WORKS lead to more discoveries. However, during our analysis, we find that most of the code changes can be categorized In recent years, Deep Learning (DL) has become a powerful into specific code transformations or a list of them. Only a tool to solve problems of Software Engineering (SE), which few code changes cannot be identified, classified, and even can capture and discover features by the DL model rather partly should be attributed to the mismatch of Gumtree [53]. than manual derivation. In this work, we apply the Neural Machine Translation (NMT) model into the program repair field to learn from historical vulnerability repair records, 5.2 External Validity summarize common pattern rules to apply to subsequent During the experiment, we find that Gumtree [53] will vulnerability fixes. In the following, we will introduce introduce mismatches, which will affect the quality of the works focus on program repair and compare our work with training set. Other researchers have mentioned that occa- related research. sionally GumTree cannot appropriately detect motion and Automated Program Repair Traditional program repair update actions between two ASTs [70], [71]. In fact, we techniques can be categorized into two main categories: found two problems with Gumtree, one is related to the heuristic-based [72], constraint-based [72], and template- IO issue. We found that the IO streams Gumtree used based APR approaches [8]. We will list some traditional can cause blockages. This has been confirmed and fixed techniques to explain these three types of approaches. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, MAY 2021 17 Heuristic-based APR approaches construct and traverse the students for 93 programming tasks, DeepFix can completely search space for syntax program modifiers [72]. ARJA-e repair 1881 (27%) of them, and can partially repair 1338 [73] proposes a new evolutionary repair system for Java (19%) of them. HERCULES [81] present an APR technique code that aims to address challenges for the search space. that generalizes single-hunk repair techniques to include SimFix [74] utilizes both existing patches and similar code. an important class of multi-hunk bugs, namely bugs that It mines an abstract search space from existing patches and may require applying a substantially similar patch at a obtains a concrete search space by differencing with similar number of locations. The limitation is that it addresses only code snippets. Gatafix [75] is based on a novel hierarchical a specific class of multi-hunk repairs and the evaluation clustering algorithm that summarizes fix patterns into a is only carried out on the Defects4J dataset. TRACER [82] hierarchy ranging from general to specific patterns. Gen- is another work that is very similar to Deepfix for fixing Prog [6] and RSRepair[13] are two similar approaches. Both compiler errors, and its accuracy rate exceeds that of Deep- of them try to repair faulty programs with the same muta- fix. Tufano et al. [37], [69] has investigated the feasibility of tion operations in a search space. But GenProg uses random using neural machine translation for learning wild code. The search, rather than genetic programming, to guide the patch disadvantage of his method is that only sentences with less generation process. Meditor [26] provides a novel algorithm than 100 tokens are analyzed. In addition, this work is only that flexibly locates and groups MR (migration-related) code limited to the type of bug that contains only one sequence changes in commits. For edit application, Meditor matches within a single method. a given program with inferred edits to decide which edit is SequenceR [38] presents a novel end-to-end approach applicable and produce a migrated version for developers. to program repair based on sequence-to-sequence learning. AppEvolve [28] can automatically perform app updates for It utilizes the copy mechanism to overcome the unlim- API changes based on examples of how other developers ited vocabulary problem. To the best of our knowledge, it evolved their apps for the same changes. This technique is achieves the best result reported on such a task. However, able to update 85% of the API changes considered, but it is the abstract data structure of this method retains too much quite time-consuming and not scalable enough. useless context. It does not use the normalization method Constraint-based APR approaches usually focus on fixing a either. conditional expression, which is more prone to defects than Vulnerability Repair Fixing vulnerability is critical to other types of program elements. Elixir [76] uses method protect users from security compromises and to prevent call-related templates from par with local variables, fields or vendors from losing user confidence. Traditional tools such constants, to construct more expressive repair expressions, as Angelix[83], Semfix[7] and ClearView[84] heavily rely that go into synthesizing patches. ACS [77] focuses on on a set of positive/negative example inputs to find a fine-grained ranking criteria for condition synthesis, which patch that makes the program behaves correctly on those combines three heuristic ranking techniques that exploit examples. SENX[85] propose a different approach called the structure of the buggy program, the document of the ”property-based” which relies on program-independent, buggy program, and the conditional expressions in existing vulnerability-specific, human-specified safety properties. projects. Another trending direction is the application of neu- Template-based APR approaches can also be called history- ral network models for vulnerability repair. Harer et al. based repair approaches. These approaches mine and learn [86] apply Generative Adversarial Network (GAN) to the fixing patterns from prior bug fixes. It should be noted that problem of automated repair of software vulnerabilities. the classification between these three approaches is vague, They address the environment with no labeled vulnerable many techniques use more than one of them simultane- examples and achieve performance close to seq2seq ap- ously. FixMiner [32], SimFix [74], ssFix [78], CapGen [31] proaches that require labeled pairs. Chen et al. [87] apply and HDRepair [79]are based on frequently occurred code the simple seq2seq model for vulnerability repair but the change operations that are extracted from the patches in performance is not quite promising. Ratchet[88] also utilize code change histories. The main difference between them the NMT model to fix vulnerabilities, but it only stores is the object from which the data is extracted and the way single statements without any context around them. All of in which the data is processed. AVATAR [33] exploits fix these functions do not consider multiple-statement, either. patterns of static analysis violations as ingredients for patch Transformer and Tree Structure Another popular direc- generation. SOFix [80] has a novel approach to digging up tion is utilizing a transformer model or treat source code bug fix records from Stack Overflow responses. as a syntax tree to maintain richer information. TranS3 [89] These works are still based on statistical ranking or proposes a transformer-based framework to integrate code strict context matching. However, more and more works summarization with code search. Tree-based neural network are beginning to exploit machine learning to rank similar such as TreeLSTM [90], [91], ASTNN [92] or TreeNet [93] code transformations and automatically generate code rec- are also being applied on program analysis. Shiv et al. [94] ommendations. propose a method to extend transformers to tree-structured Learning-based APR approaches is actually part of data. This approach abstracts the sinusoidal positional en- template-based APR approaches that are enhanced by ma- codings of the transformer, using a novel positional en- chine learning techniques. We have separated them as an coding scheme to represent node positions within trees. It independent category. DeepFix [36] is a program repair tool achieves a 22% absolute increase in accuracy on a JavaScript using a multi-layered sequence-to-sequence neural network to CoffeeScript [95] translation dataset. TreeCaps [96] pro- with attention for fixing common programming errors. In a poses a tree-based capsule network for processing program collection of 6,971 incorrect C language programs written by code in an automated way that encodes code syntactical JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, MAY 2021 18 structures and captures code dependencies more accurately. [11] T. Ackling, B. Alexander, and I. Grunert, “Evolving patches for CODIT[97] and DLFix[98] has begun to apply tree structure software repair,” in Proceedings of the 13th annual conference on Genetic and evolutionary computation, 2011, pp. 1427–1434. into program repair and achieve some progress. They focus [12] F. Long and M. Rinard, “Staged program repair with condition on the single line predictions and never consider multiple- synthesis,” in Proceedings of the 2015 10th Joint Meeting on Founda- statement. However, this situation is more challenging than tions of Software Engineering, 2015, pp. 166–178. translate one language to another language. Converting the [13] Y. Qi, X. Mao, Y. Lei, Z. Dai, and C. Wang, “The strength of random search on automated program repair,” in Proceedings of the 36th generated prediction tree into readable code also faces chal- International Conference on Software Engineering, 2014, pp. 254–265. lenges. Overall, we believe that using a tree-based neural [14] E. Dinella, H. Dai, Z. Li, M. Naik, L. Song, and K. Wang, “Hop- network or even combining it with a transformer structure pity: Learning graph transformations to detect and fix bugs in will become our future work. programs,” in International Conference on Learning Representations (ICLR), 2020. [15] Z. Durumeric, F. Li, J. Kasten, J. Amann, J. Beekman, M. Payer, N. Weaver, D. Adrian, V. Paxson, M. Bailey et al., “The matter 7 CONCLUSION of heartbleed,” in Proceedings of the 2014 conference on internet measurement conference, 2014, pp. 475–488. In this paper, we design the automatic vulnerability fix tool [16] P. Kocher, J. Horn, A. Fogh, D. Genkin, D. Gruss, W. Haas, SeqTrans that is based on the NMT technique to learn from M. Hamburg, M. Lipp, S. Mangard, T. Prescher et al., “Spectre at- tacks: Exploiting speculative execution,” in 2019 IEEE Symposium historical vulnerability fixes. It can provide suggestions on Security and Privacy (SP). IEEE, 2019, pp. 1–19. and automatically fix the source code for developers. Fine- [17] M. Lipp, M. Schwarz, D. Gruss, T. Prescher, W. Haas, S. Mangard, tuning strategy is used to overcome the small sample size P. Kocher, D. Genkin, Y. Yarom, and M. Hamburg, “Meltdown,” problem. We conduct our study on real-world vulnerability arXiv preprint arXiv:1801.01207, 2018. [18] B. Potter and G. McGraw, “Software security testing,” IEEE Secu- fix records and compare our SeqTrans with three kinds rity & Privacy, vol. 2, no. 5, pp. 81–85, 2004. of other NMT techniques. We investigated two research [19] R. Scandariato, J. Walden, A. Hovsepyan, and W. Joosen, “Pre- questions based on these collected data. Experiment results dicting vulnerable software components via text mining,” IEEE show that our technique outperforms the state-of-the-art Transactions on Software Engineering, vol. 40, no. 10, pp. 993–1006, 2014. NMT model and achieves an accuracy rate of 23.3% in [20] Z. Shen and S. Chen, “A survey of automatic software vulnerabil- statement-level prediction and 25.3% in CVE-level predic- ity detection, program repair, and defect prediction techniques,” tion. The SeqTrans-based approach indeed helps solve the Security and Communication Networks, vol. 2020, 2020. [21] P. Morrison, K. Herzig, B. Murphy, and L. Williams, “Challenges scalability and small data set problems of existing methods with applying vulnerability prediction models,” in Proceedings of on the task of vulnerability fixing. We also look deeply the 2015 Symposium and Bootcamp on the Science of Security, 2015, into the model and manually analyze the prediction result pp. 1–9. and the source code. Our observation finds that SeqTrans [22] M. Zalewski, “American fuzzy lop: a security-oriented fuzzer,” URl: http://lcamtuf.coredump.cx/afl/(visited on 06/21/2017), 2010. performs quite well in specific kinds of CWEs like CWE-287 [23] K. Serebryany and M. Bohme,¨ “Aflgo: Directing afl to reach (Improper Authentication) and CWE-863 (Incorrect Autho- specific target locations,” 2017. rization). The prediction range will become wider and wider [24]M.B ohme,¨ V.-T. Pham, and A. Roychoudhury, “Coverage-based as the historical repair records increases. greybox as markov chain,” IEEE Transactions on Software Engineering, vol. 45, no. 5, pp. 489–506, 2017. [25] S. Ma, F. Thung, D. Lo, C. Sun, and R. H. Deng, “Vurle: Automatic vulnerability detection and repair by learning from examples,” in REFERENCES European Symposium on Research in Computer Security. Springer, 2017, pp. 229–246. [1] T. Britton, L. Jeng, G. Carver, and P. Cheak, “Reversible debug- [26] S. Xu, Z. Dong, and N. Meng, “Meditor: inference and application ging software quantify the time and cost saved using reversible of api migration edits,” in 2019 IEEE/ACM 27th International Con- debuggers,” 2013. ference on Program Comprehension (ICPC). IEEE, 2019, pp. 335–346. [2] B. Hailpern and P. Santhanam, “Software debugging, testing, and [27] H. A. Nguyen, T. T. Nguyen, G. Wilson Jr, A. T. Nguyen, M. Kim, verification,” IBM Systems Journal, vol. 41, no. 1, pp. 4–12, 2002. and T. N. Nguyen, “A graph-based approach to api usage adapta- [3] A. Zeller, Why programs fail: a guide to systematic debugging. Else- tion,” ACM Sigplan Notices, vol. 45, no. 10, pp. 302–321, 2010. vier, 2009. [28] M. Fazzini, Q. Xin, and A. Orso, “Automated api-usage update [4] L. Gazzola, D. Micucci, and L. Mariani, “Automatic software for android apps,” in Proceedings of the 28th ACM SIGSOFT In- repair: A survey,” IEEE Transactions on Software Engineering, vol. 45, ternational Symposium on Software Testing and Analysis, 2019, pp. no. 1, pp. 34–67, 2017. 204–215. [5] M. Monperrus, “Automatic software repair: a bibliography,” ACM [29] H. D. Phan, A. T. Nguyen, T. D. Nguyen, and T. N. Nguyen, Computing Surveys (CSUR), vol. 51, no. 1, pp. 1–24, 2018. “Statistical migration of api usages,” in 2017 IEEE/ACM 39th [6] W. Weimer, T. Nguyen, C. Le Goues, and S. Forrest, “Automati- International Conference on Software Engineering Companion (ICSE- cally finding patches using genetic programming,” in 2009 IEEE C). IEEE, 2017, pp. 47–50. 31st International Conference on Software Engineering. IEEE, 2009, [30] M. Lamothe, W. Shang, and T.-H. Chen, “A4: Automatically assist- pp. 364–374. ing android api migrations using code examples,” arXiv preprint [7] H. D. T. Nguyen, D. Qi, A. Roychoudhury, and S. Chandra, arXiv:1812.04894, 2018. “Semfix: Program repair via semantic analysis,” in 2013 35th [31] M. Wen, J. Chen, R. Wu, D. Hao, and S.-C. Cheung, “Context- International Conference on Software Engineering (ICSE). IEEE, 2013, aware patch generation for better automated program repair,” in pp. 772–781. 2018 IEEE/ACM 40th International Conference on Software Engineer- [8] D. Kim, J. Nam, J. Song, and S. Kim, “Automatic patch generation ing (ICSE). IEEE, 2018, pp. 1–11. learned from human-written patches,” in 2013 35th International [32] A. Koyuncu, K. Liu, T. F. Bissyande,´ D. Kim, J. Klein, M. Mon- Conference on Software Engineering (ICSE). IEEE, 2013, pp. 802– perrus, and Y. Le Traon, “Fixminer: Mining relevant fix patterns 811. for automated program repair,” Empirical Software Engineering, pp. [9] V. Dallmeier, A. Zeller, and B. Meyer, “Generating fixes from object 1–45, 2020. behavior anomalies,” in 2009 IEEE/ACM International Conference on [33] K. Liu, A. Koyuncu, D. Kim, and T. F. Bissyande,´ “Avatar: Fixing Automated Software Engineering. IEEE, 2009, pp. 550–554. semantic bugs with fix patterns of static analysis violations,” [10] S. R. L. Marcote and M. Monperrus, “Automatic repair of infinite in 2019 IEEE 26th International Conference on Software Analysis, loops,” arXiv preprint arXiv:1504.05078, 2015. Evolution and Reengineering (SANER). IEEE, 2019, pp. 1–12. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, MAY 2021 19

[34] I. Eclipse, “Eclipse ide,” Website www. eclipse. org Last visited: July, [58] G. Klein, Y. Kim, Y. Deng, J. Senellart, and A. M. Rush, 2009. “OpenNMT: Open-source toolkit for neural machine translation,” [35] M. Allamanis, E. T. Barr, P. Devanbu, and C. Sutton, “A survey of in Proc. ACL, 2017. [Online]. Available: https://doi.org/10.18653/ machine learning for big code and naturalness,” ACM Computing v1/P17-4012 Surveys (CSUR), vol. 51, no. 4, pp. 1–37, 2018. [59] G. Klein, Y. Kim, Y. Deng, J. Senellart, and A. Rush, [36] R. Gupta, S. Pal, A. Kanade, and S. Shevade, “Deepfix: Fixing “OpenNMT: Open-source toolkit for neural machine translation,” common c language errors by deep learning,” in Thirty-First AAAI in Proceedings of ACL 2017, System Demonstrations. Vancouver, Conference on Artificial Intelligence, 2017. Canada: Association for Computational Linguistics, Jul. 2017, pp. [37] M. Tufano, C. Watson, G. Bavota, M. Di Penta, M. White, and 67–72. [Online]. Available: https://www.aclweb.org/anthology/ D. Poshyvanyk, “An empirical investigation into learning bug- P17-4012 fixing patches in the wild via neural machine translation,” in Pro- [60] T. Mikolov, M. Karafiat,´ L. Burget, J. Cernockˇ y,` and S. Khudanpur, ceedings of the 33rd ACM/IEEE International Conference on Automated “Recurrent neural network based language model,” in Eleventh Software Engineering, 2018, pp. 832–837. annual conference of the international speech communication association, [38] Z. Chen, S. J. Kommrusch, M. Tufano, L.-N. Pouchet, D. Poshy- 2010. vanyk, and M. Monperrus, “Sequencer: Sequence-to-sequence [61] F. A. Gers, J. Schmidhuber, and F. Cummins, “Learning to forget: learning for end-to-end program repair,” IEEE Transactions on Continual prediction with lstm,” 1999. Software Engineering, 2019. [62] A. Graves, “Generating sequences with recurrent neural net- [39] M. Tufano, C. Watson, G. Bavota, M. D. Penta, M. White, and works,” arXiv preprint arXiv:1308.0850, 2013. D. Poshyvanyk, “An empirical study on learning bug-fixing [63] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for patches in the wild via neural machine translation,” ACM Trans- image recognition,” in Proceedings of the IEEE conference on computer actions on Software Engineering and Methodology (TOSEM), vol. 28, vision and pattern recognition, 2016, pp. 770–778. no. 4, pp. 1–29, 2019. [64] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv [40] S. E. Ponta, H. Plate, A. Sabetta, M. Bezzi, and C. Dangremont, preprint arXiv:1607.06450, 2016. “A manually-curated dataset of fixes to vulnerabilities of open- [65] K. Liu, A. Koyuncu, T. F. Bissyande,´ D. Kim, J. Klein, and source software,” in 2019 IEEE/ACM 16th International Conference Y. Le Traon, “You cannot fix what you cannot find! an investigation on Mining Software Repositories (MSR). IEEE, 2019, pp. 383–387. of fault localization bias in benchmarking automated program [41] “Aosp vulnerability dataset,” https://source.android.com/ repair systems,” in 2019 12th IEEE conference on software testing, security/bulletin, April 20, 2021. validation and verification (ICST). IEEE, 2019, pp. 102–113. [42] “Cve program,” https://cve.mitre.org/, April 20, 2021. [66] V. Raychev, M. Vechev, and E. Yahav, “Code completion with sta- [43] N. Tajbakhsh, J. Y. Shin, S. R. Gurudu, R. T. Hurst, C. B. Kendall, tistical language models,” in Proceedings of the 35th ACM SIGPLAN M. B. Gotway, and J. Liang, “Convolutional neural networks Conference on Programming Language Design and Implementation, for medical image analysis: Full training or fine tuning?” IEEE 2014, pp. 419–428. transactions on medical imaging, vol. 35, no. 5, pp. 1299–1312, 2016. [67] M. Freitag and Y. Al-Onaizan, “Beam search strategies for neural [44] C. Casalnuovo, K. Sagae, and P. Devanbu, “Studying the difference machine translation,” arXiv preprint arXiv:1702.01806, 2017. between natural and programming language corpora,” Empirical [68] “Github compare api,” https://docs.github.com/en/rest/ Software Engineering, vol. 24, no. 4, pp. 1823–1868, 2019. reference/repos#commits, April 20, 2021. [45] Y. Shi, S. Park, Z. Yin, S. Lu, Y. Zhou, W. Chen, and W. Zheng, “Do i use the wrong definition? defuse: Definition-use invariants for [69] M. Tufano, J. Pantiuchina, C. Watson, G. Bavota, and D. Poshy- detecting concurrency and sequential bugs,” in Proceedings of the vanyk, “On learning meaningful code changes via neural machine ACM international conference on Object oriented programming systems translation,” in 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE) languages and applications, 2010, pp. 160–174. . IEEE, 2019, pp. 25–36. [46] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. [70] V. Frick, T. Grassauer, F. Beck, and M. Pinzger, “Generating ac- Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” curate and compact edit scripts using tree differencing,” in 2018 in Advances in neural information processing systems, 2017, pp. 5998– IEEE International Conference on Software Maintenance and Evolution 6008. (ICSME). IEEE, 2018, pp. 264–274. [47] S. Wiseman and A. M. Rush, “Sequence-to-sequence learning as [71] J. Matsumoto, Y. Higo, and S. Kusumoto, “Beyond gumtree: A beam-search optimization,” arXiv preprint arXiv:1606.02960, 2016. hybrid approach to generate edit scripts,” in 2019 IEEE/ACM [48] S. E. Ponta, H. Plate, and A. Sabetta, “Beyond metadata: Code- 16th International Conference on Mining Software Repositories (MSR). centric and usage-based analysis of known vulnerabilities in open- IEEE, 2019, pp. 550–554. source software,” in 2018 IEEE International Conference on Software [72] C. L. Goues, M. Pradel, and A. Roychoudhury, “Automated pro- Maintenance and Evolution (ICSME). IEEE, 2018, pp. 449–460. gram repair,” Communications of the ACM, vol. 62, no. 12, pp. 56–65, [49] D. Mahajan, R. Girshick, V. Ramanathan, K. He, M. Paluri, Y. Li, 2019. A. Bharambe, and L. Van Der Maaten, “Exploring the limits of [73] Y. Yuan and W. Banzhaf, “Toward better evolutionary program weakly supervised pretraining,” in Proceedings of the European repair: An integrated approach,” ACM Transactions on Software Conference on Computer Vision (ECCV), 2018, pp. 181–196. Engineering and Methodology (TOSEM), vol. 29, no. 1, pp. 1–53, [50] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, 2020. M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A ro- [74] J. Jiang, Y. Xiong, H. Zhang, Q. Gao, and X. Chen, “Shaping bustly optimized bert pretraining approach,” arXiv preprint program repair space with existing patches and similar code,” in arXiv:1907.11692, 2019. Proceedings of the 27th ACM SIGSOFT international symposium on [51] “Findbugs - find bugs in java programs,” http://findbugs. software testing and analysis, 2018, pp. 298–309. .net/, March 06, 2015. [75] J. Bader, A. Scott, M. Pradel, and S. Chandra, “Getafix: Learning [52] “Eclipse jgit,” https://www.eclipse.org/jgit/, Accessed April 4, to fix bugs automatically,” Proceedings of the ACM on Programming 2017. Languages, vol. 3, no. OOPSLA, pp. 1–27, 2019. [53] J.-R. Falleri, F. Morandat, X. Blanc, M. Martinez, and M. Monper- [76] R. K. Saha, Y. Lyu, H. Yoshida, and M. R. Prasad, “Elixir: Effective rus, “Fine-grained and accurate source code differencing,” in Pro- object-oriented program repair,” in 2017 32nd IEEE/ACM Interna- ceedings of the 29th ACM/IEEE international conference on Automated tional Conference on Automated Software Engineering (ASE). IEEE, software engineering, 2014, pp. 313–324. 2017, pp. 648–659. [54] “srcml,” https://www.srcml.org/, April 20, 2021. [77] Y. Xiong, J. Wang, R. Yan, J. Zhang, S. Han, G. Huang, and [55] “Scitools understand,” https://scitools.com/features/, Sep 20, L. Zhang, “Precise condition synthesis for program repair,” in 2017 2019. IEEE/ACM 39th International Conference on Software Engineering [56] R. Sennrich, B. Haddow, and A. Birch, “Neural machine (ICSE). IEEE, 2017, pp. 416–426. translation of rare words with subword units,” arXiv preprint [78] Q. Xin and S. P. Reiss, “Leveraging syntax-related code for au- arXiv:1508.07909, 2015. tomated program repair,” in 2017 32nd IEEE/ACM International [57] S. Gururangan, A. Marasovic,´ S. Swayamdipta, K. Lo, I. Belt- Conference on Automated Software Engineering (ASE). IEEE, 2017, agy, D. Downey, and N. A. Smith, “Don’t stop pretraining: pp. 660–670. Adapt language models to domains and tasks,” arXiv preprint [79] X. B. D. Le, D. Lo, and C. Le Goues, “History driven program arXiv:2004.10964, 2020. repair,” in 2016 IEEE 23rd International Conference on Software JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, MAY 2021 20

Analysis, Evolution, and Reengineering (SANER), vol. 1. IEEE, 2016, Jianlei Chi received the B.S. degree in com- pp. 213–224. puter science and technology from Harbin Engi- [80] X. Liu and H. Zhong, “Mining stackoverflow for program repair,” neering University, China, in 2010 and 2014. He in 2018 IEEE 25th International Conference on Software Analysis, is currently working toward the Ph.D. degree in Evolution and Reengineering (SANER). IEEE, 2018, pp. 118–129. the Department of Computer Science and Tech- [81] S. Saha et al., “Harnessing evolution for multi-hunk program nology at Xian Jiaotong University, China. His repair,” in 2019 IEEE/ACM 41st International Conference on Software research interests include trustworthy software, Engineering (ICSE). IEEE, 2019, pp. 13–24. software testing, software security and software [82] U. Z. Ahmed, P. Kumar, A. Karkare, P. Kar, and S. Gulwani, “Com- behavior analysis. pilation error repair: for the student programs, from the student programs,” in Proceedings of the 40th International Conference on Software Engineering: Software Engineering Education and Training, 2018, pp. 78–87. [83] S. Mechtaev, J. Yi, and A. Roychoudhury, “Angelix: Scalable multi- line program patch synthesis via symbolic analysis,” in Proceedings of the 38th international conference on software engineering, 2016, pp. Yu Qu received the B.S. and Ph.D. degrees from 691–701. Xian Jiaotong University, Xian, China in 2006 [84] J. H. Perkins, S. Kim, S. Larsen, S. Amarasinghe, J. Bachrach, and 2015 respectively. He is a post-doctoral re- M. Carbin, C. Pacheco, F. Sherwood, S. Sidiroglou, G. Sullivan searcher at the Department of Computer Sci- et al., “Automatically patching errors in deployed software,” in ence and Engineering, UC Riverside. His re- Proceedings of the ACM SIGOPS 22nd symposium on Operating search interests include trustworthy software systems principles, 2009, pp. 87–102. and applying complex network and data mining [85] Z. Huang, D. Lie, G. Tan, and T. Jaeger, “Using safety properties theories to analyzing software systems. to generate vulnerability patches,” in 2019 IEEE Symposium on Security and Privacy (SP). IEEE, 2019, pp. 539–554. [86] J. Harer, O. Ozdemir, T. Lazovich, C. Reale, R. Russell, L. Kim et al., “Learning to repair software vulnerabilities with generative adversarial networks,” in Advances in Neural Information Processing Systems, 2018, pp. 7933–7943. [87] Z. Chen, S. Kommrusch, and M. Monperrus, “Using sequence-to- sequence learning for repairing c vulnerabilities,” arXiv preprint Ting Liu received his B.S. degree in information arXiv:1912.02015, 2019. engineering and Ph.D. degree in system engi- [88] H. Hata, E. Shihab, and G. Neubig, “Learning to generate cor- neering from School of Electronic and Informa- rective patches using neural machine translation,” arXiv preprint tion, Xian Jiaotong University, Xian, China, in arXiv:1812.07170, 2018. 2003 and 2010, respectively. Currently, he is a [89] W. Wang, Y. Zhang, Z. Zeng, and G. Xu, “Transˆ 3: A transformer- professor of the Systems Engineering Institute, based framework for unifying code summarization and code Xian Jiaotong University. His research interests search,” arXiv preprint arXiv:2003.03238, 2020. include smart grid, network security and trust- [90] M. Ahmed, M. R. Samee, and R. E. Mercer, “Improving tree-lstm worthy software. with tree attention,” in 2019 IEEE 13th International Conference on Semantic Computing (ICSC). IEEE, 2019, pp. 247–254. [91] K. S. Tai, R. Socher, and C. D. Manning, “Improved semantic representations from tree-structured long short-term memory net- works,” arXiv preprint arXiv:1503.00075, 2015. [92] J. Zhang, X. Wang, H. Zhang, H. Sun, K. Wang, and X. Liu, “A novel neural source code representation based on abstract syntax Qinghua Zheng Qinghua Zheng received the tree,” in 2019 IEEE/ACM 41st International Conference on Software B.S. degree incomputer software in 1990, the Engineering (ICSE). IEEE, 2019, pp. 783–794. M.S. degree in computer organization and archi- [93] Z. Cheng, C. Yuan, J. Li, and H. Yang, “Treenet: Learning sentence tecture in 1993, and the Ph.D. degree in system representations with unconstrained tree structure.” in IJCAI, 2018, engineering in 1997 from Xian Jiaotong Univer- pp. 4005–4011. sity, China. He was a postdoctoral researcher [94] V. Shiv and C. Quirk, “Novel positional encodings to enable tree- at Harvard University in 2002. He is currently based transformers,” in Advances in Neural Information Processing a professor in Xian Jiaotong University, and the Systems, 2019, pp. 12 058–12 068. dean of the Department of Computer Science. [95] J. Ashkenas et al., “Coffeescript,” 2009. His research areas include computer network [96] V. Jayasundara, N. D. Q. Bui, L. Jiang, and D. Lo, “Treecaps: Tree- security, intelligent e-learning theory and algo- structured capsule networks for program source code processing,” rithm, multimedia e-learning, and trustworthy software. arXiv preprint arXiv:1910.12306, 2019. [97] S. Chakraborty, Y. Ding, M. Allamanis, and B. Ray, “Codit: Code editing with tree-based neural models,” IEEE Transactions on Soft- ware Engineering, 2020. [98] Y. Li, S. Wang, and T. N. Nguyen, “Dlfix: Context-based code transformation learning for automated program repair,” in Pro- Heng Yin is a professor in the department ceedings of the ACM/IEEE 42nd International Conference on Software of Computer Science and Engineering at UC Engineering, 2020, pp. 602–614. Riverside. Before joining UC Riverside, he was with Syracuse University from September 2009 to June 2016, as assistant professor and then associate professor. He obtained his Ph.D in Computer Science from the College of William and Mary in 2009, while He spent 4 years at Carnegie Mellon University and later at UC Berkeley. His research interests lie in computer security and developing all kinds of techniques (such as program analysis, virtualization, and machine learning/deep learning) to solve computer and software security problems, including but not limited to detection and analysis, vulnerability discovery, program hardening, digital forensics.