Learning to Represent Edits

Published as a conference paper at ICLR 2019 LEARNING TO REPRESENT EDITS Pengcheng Yin,∗ Graham Neubig Language Technology Institute Carnegie Mellon University Pittsburgh, PA 15213, USA fpcyin,[email protected] Miltiadis Allamanis, Marc Brockschmidt, Alexander L. Gaunt Microsoft Research Cambridge, CB1 2FB, United Kingdom fmiallama,mabrocks,[email protected] ABSTRACT We introduce the problem of learning distributed representations of edits. By com- bining a “neural editor” with an “edit encoder”, our models learn to represent the salient information of an edit and can be used to apply edits to new inputs. We experiment on natural language and source code edit data. Our evaluation yields promising results that suggest that our neural network models learn to capture the structure and semantics of edits. We hope that this interesting task and data source will inspire other researchers to work further on this problem. 1 INTRODUCTION One great advantage of electronic storage of documents is the ease with which we can edit them, and edits are performed in a wide variety of contents. For example, right before a conference deadline, papers worldwide are finalized and polished, often involving common fixes for grammar, clarity and style. Would it be possible to automatically extract rules from these common edits? Similarly, program source code is constantly changed to implement new features, follow best practices and fix bugs. With the widespread deployment of (implicit) version control systems, these edits are quickly archived, creating a major data stream that we can learn from. In this work, we study the problem of learning distributed representations of edits. We only look at small edits with simple semantics that are more likely to appear often and do not consider larger edits; i.e., we consider “add definite articles” rather than “rewrite act 2, scene 3.” Concretely, we focus on two questions: i) Can we group semantically equivalent edits together, so that we can automatically recognize common edit patterns? ii) Can we automatically transfer edits from one context to another? A solution to the first question would yield a practical tool for copy editors and programmers alike, arXiv:1810.13337v2 [cs.LG] 22 Feb 2019 automatically identifying the most common changes. By leveraging tools from program synthesis, such groups of edits could be turned into interpretable rules and scripts (Rolim et al., 2017). When there is no simple hard rule explaining how to apply an edit, an answer to the second question would be of great use, e.g., to automatically rewrite natural language following some stylistic rule. We propose to handle edit data in an autoencoder-style framework, in which an “edit encoder” f∆ is trained to compute a representation of an edit x− ! x+, and a “neural editor” α is trained to construct x+ from the edit representation and x−. This framework ensures that the edit representation is semantically meaningful, and a sufficiently strong neural editor allows this representation to not be specific to the changed element. We experiment with various neural architectures that can learn to represent and apply edits and hope to direct the attention of the research community to this new and interesting data source, leading to better datasets and stronger models. Briefly, the contributions of our paper are: (a) in Sect. 2, we present a new and impor- tant machine learning task on learning representations of edits (b) we present a family of ∗Work done as an intern in Microsoft Research, Cambridge, UK. 1 Published as a conference paper at ICLR 2019 x−: var greenList = trivia==null ? null : trivia.Select(t=> t.UnderlyingNode); Edit 1 Edit Representation x+: var greenList = trivia?.Select(t=> t.UnderlyingNode); n f∆(x−; x+) 2 R 0 x−: me = ((ue!=null) ? ue.Operand : null) Neural Editor 0 as MemberExpression; α(x−; f∆(x−; x+)) Edit 2 0 x+: me = ue?.Operand as MemberExpression; Figure 1: Given an edit (Edit 1) of x− to x+, f∆ computes an edit representation vector. Using that 0 representation vector the neural editor α applies the same edit to a new x−. The code snippets shown here are real code change examples from the roslyn open-source compiler project. models that capture the structure of edits and compute efficient representations in Sect. 3 (c) we create a new source code edit dataset, and release the data extraction code at https://github.com/Microsoft/msrc-dpu-learning-to-represent-edits and the data at http://www.cs.cmu.edu/˜pengchey/githubedits.zip. (d) we perform a set of experiments on the learned edit representations in Sect. 4 for natural language text and source code and present promising empirical evidence that our models succeed in capturing the semantics of edits. 2 TASK In this work, we are interested in learning to represent and apply edits on discrete sequential or structured data, such as text or source code parse trees1. Figure 1 gives a graphical overview of the task, described precisely below. (i) (i) N (i) Edit Representation Given a dataset of edits fx− ! x+ gi=1; where x− is the original version (i) of some object and x+ its edited form (see upper half of Figure 1 for an example), our goal is to learn a representation function f∆ that maps an edit operation x− ! x+ to a real-valued edit represen- n tation f∆(x−; x+) 2 R : A desired quality of f∆ is for the computed edit representations to have the property that semantically similar edits have nearby representations in Rn. Having distributed representations also allows other interesting downstream tasks, e.g., unsupervised clustering and visualization of similar edits from large-scale data (e.g. the GitHub commit stream), which would be useful for developing human-assistance toolkits for discovering and extracting emerging edit patterns (e.g. new bug fixes or emerging “best practices” of coding). Neural Editor Given an edit representation function f∆, we want to learn to apply edits in a new context. This can be achieved by learning a neural editor α that accepts an edit representation 0 0 2 f∆(x−; x+) and a new input x− and generates x+. This is illustrated in the lower half of Figure 1. 3 MODEL We cast the edit representation problem as an autoencoding task, where we aim to minimize the reconstruction error of α for the edited version x+ given the edit representation f∆(x−; x+) and the original version x−. By limiting the capacity of f∆’s output and allowing the model to freely use information about x−, we are introducing a “bottleneck” that forces the overall framework to not simply treat f∆(x−; x+) as an encoder of x+. The main difference from traditional autoencoders is that in our setup, an optimal solution requires to re-use as much information as possible from x− 1Existing editing systems, e.g. the grammar checker in text editors and code refactoring module in IDEs, are powered by domain-specific, manually crafted rules, while we aim for a data-driven, domain-agnostic approach. 2 0 We leave the problem of identifying which edit representation f∆(x−; x+) to apply to x− as interesting future work. 2 Published as a conference paper at ICLR 2019 AssignStmt AST Child h1 root ! Expr h EXPANDR action Next Token h GENTERM action Expr h2 Expr ! Expr Op Expr h TREECP action Action Flow u = x + x h3 TREECP Expr h4 Op ! − h5 Expr ! IntLit Parent Feed Copied h6 IntLit ! 23 (a) (b) Figure 2: (a) Graph representation of statement u = x + x. Rectangular (resp. rounded) nodes denote tokens (resp. non-terminals). (b) Sequence of tree decoding steps yielding x + x - 23, where x + x is copied (using the TREECP action) from the context graph in (a). to make the most of the capacity of f∆. Formally, given a probabilistic editor function Pα such as a (i) (i) N neural network and a dataset fx− ! x+ gi=1; we seek to minimize the negative likelihood loss 1 X L = − log P (x j x ; f (x ; x )): N α + − ∆ − + i Note that this loss function can be interpreted in two ways: (1) as a conditional autoencoder that encodes the salient information of an edit, given x− and (2) as an encoder-decoder model that encodes x− and decodes x+ conditioned on the edit representation f∆(x−; x+). In the rest of this section, we discuss our methods to model Pα and f∆ as neural networks. 3.1 NEURAL EDITOR As discussed above, α should use as much information as possible from x−, and hence, an encoder- decoder architecture with the ability to copy from the input is most appropriate. As we are primarily interested in edits on text and source code in this work, we explored two architectures: a sequence- to-sequence model for text, and a graph-to-tree model for source code, whose known semantics we can leverage both on the encoder as well as on the decoder side. Other classes of edits, for example, image manipulation, would most likely be better served by convolutional neural models. Sequence-to-Sequence Neural Editor First, we consider a standard sequence-to-sequence model with attention (over the tokens of x−). The architecture of our sequence-to-sequence model is similar to that of Luong et al. (2015), with the difference that we use a bidirectional LSTM in the encoder and a token-level copying mechanism (Vinyals et al., 2015) that directly copies tokens into the decoded sequence. Whereas in standard sequence-to-sequence models the decoder is initialized with the representation computed by the encoder, we initialize it with the concatenation of encoder output and the edit representation. We also feed the edit representation as input to the decoder LSTM at each decoding time step.

Load more