1 DeepMerge: Learning to Merge Programs
Elizabeth Dinella, Member, IEEE, Todd Mytkowicz, Member, IEEE, Alexey Svyatkovskiy, Member, IEEE Christian Bird, Distinguished Scientist, IEEE Mayur Naik, Member, IEEE Shuvendu Lahiri, Member, IEEE
!
Abstract—In collaborative software development, program merging is Base O Variant A Variant B Resolution? the mechanism to integrate changes from multiple programmers. Merge (base.js) (a.js) (b.js) (m.js) algorithms in modern version control systems report a conflict when changes interfere textually. Merge conflicts require manual intervention y = 42; x = 1; y = 42; x = 1; and frequently stall modern continuous integration pipelines. Prior work (1). y = 42; z = 43; y = 42; found that, although costly, a large majority of resolutions involve re- z = 43; arranging text without writing any new code. Inspired by this observation we propose the first data-driven approach to resolve merge conflicts with a y = 42; x = 1; z = 43; CONFLICT machine learning model. We realize our approach in a tool DEEPMERGE (2). that uses a novel combination of (i) an edit-aware embedding of merge y = 42; y = 42; inputs and (ii) a variation of pointer networks, to construct resolutions from input segments. We also propose an algorithm to localize manual resolutions in a resolved file and employ it to curate a ground-truth Figure 1: Two examples of unstructured merges. dataset comprising 8,719 non-trivial resolutions in JavaScript programs. Our evaluation shows that, on a held out test set, DEEPMERGE can predict correct resolutions for 37% of non-trivial merges, compared to only 4% by the prevailing merge algorithm. As the name suggests, three- a state-of-the-art semistructured merge technique. Furthermore, on the way merge takes three files as input: the common base file O, subset of merges with upto 3 lines (comprising 24% of the total dataset), and its corresponding modified files, A and B. The algorithm DEEPMERGE can predict correct resolutions with 78% accuracy. either: 1) declares a “conflict” if the two changes interfere with 1 INTRODUCTION each other, or 2) provides a merged file M that incorporates changes In collaborative software development settings, version made in A and B. control systems such as “git” are commonplace. Such version control systems allow developers to simultaneously edit code Under the hood, three-way merge typically employs diff3 through features called branches. Branches are a growing the algorithm, which performs an unstructured (line- trend in version control as they allow developers to work based) merge [27]. Intuitively, the algorithm aligns the two- in their own isolated workspace, making changes indepen- way diffs of A (resp. B) over the common base O into a dently, and only integrating their work into the main line of sequence of diff slots. At each slot, a change from either A or development when it is complete. Integrating these changes B is incorporated. If both programs change a common slot, a frequently involves merging multiple copies of the source merge conflict is produced, and requires manual resolution of code. In fact, according to a large-scale empirical study of the conflicting modifications. arXiv:2105.07569v3 [cs.SE] 6 Sep 2021 Java projects on GitHub [11], nearly 12% of all commits are Figure 1 shows two simple code snippets to illustrate related to a merge. examples of three-way merge inputs and outputs. The figure To integrate changes by multiple developers across shows the base program file O along with the two variants A branches, version control systems utilize merge algorithms. and B. Example (1) shows a case where diff3 successfully Textual three-way file merge (e.g. present in “git merge”) is provides a merged file M incorporating changes made in both A and B. On the other hand, Example (2) shows a case diff3 • Elizabeth Dinella is with the University of Pennsylvania, Philadelphia, where declares a conflict because two independent Pennsylvania, USA. E-mail: [email protected]. Elizabeth Dinella changes (updates to x and z) occur in the same diff slot. performed this work while employed at Microsoft Research. When diff3 declares a conflict, a developer must • Todd Mytkowicz is with Microsoft Research, Redmond, Washington, USA. E-mail: [email protected] intervene. Consequently, merge conflicts are consistently • Alexey Svyatkovskiy is with Microsoft, Redmond, Washington, USA. ranked as one of the most taxing issues in collaborative, E-mail: [email protected] open-source software development, ”especially for seemingly • Christian Bird is with Microsoft Research, Redmond, Washington, USA. less experienced developers” [13]. Merge conflicts impact E-mail: [email protected] • Mayur Naik is with the University of Pennsylvania, Philadelphia, Penn- developer productivity, resulting in costly broken builds that sylvania, USA. E-mail: [email protected] stall the continuous integration (CI) pipelines for several • Shuvendu Lahiri is with Microsoft Research, Redmond, Washington, USA. hours to days. The fraction of merge conflicts as a percentage E-mail: [email protected] of merges range from 10% — 20% for most collaborative 2 projects. In several large projects, merge conflicts account for { unchanged lines (prefix) } <<<<<< a.js <<<<<< x = 1; up to 50% of merges (see [11] for details of prior studies). { lines edited by A} |||||| base.js Merge conflicts often arise due to the unstructured diff3 ||||||| ======algorithm that simply checks if two changes occur in the { affected lines of base O} z = 43; >>>>>> b.js same diff slot. For instance, the changes in Example (2), ======{ lines edited by B} y = 42; although textually conflicting, do not interfere semantically. >>>>>> This insight has inspired research to incorporate program { unchanged lines (suffix) } Structured structure and semantics while performing a merge. (a) Format of a conflict. (b) Instance of a conflict. merge approaches [3], [21], [32] and their variants treat merge inputs as abstract syntax trees (ASTs), and use tree- Figure 2: Conflict format and an instance reported by diff3 structured merge algorithms. However, such approaches still on Example (2) from Figure 1. yield a conflict on merges such as Example (2) above, as they do not model program semantics and cannot safely 1 reorder statements that have side effects. To make matters 3) localize the merge conflicts and the resolutions in a given worse, the gains from structured approaches hardly transfer file. to dynamic languages, namely JavaScript [32], due to the To represent the input in a concise yet expressive em- absence of static types. Semantics-based approaches [37], bedding, the paper shows how to construct an edit aware [28] can, in theory, employ program analysis and verifiers sequence to be consumed by DEEPMERGE. These edits to detect and synthesize the resolutions. However, there are are provided in the format of diff3 which is depicted in no semantics-based tools for synthesizing merges for any Figure 2(a) in the portion between markers “<<<<<<<“ and real-world programming language, reflecting the intractable “>>>>>>>“. The input embedding is extracted from parsing nature of the problem. Current automatic approaches fall the conflicting markers and represents A’s and B’s edits over short, suggesting that merge conflict resolution is a non- the common base O. trivial problem. This paper takes a fresh data-driven approach to the To represent the output at the line granularity, DEEP- problem of resolving unstructured merge conflicts. Inspired MERGE’s design is a form of a pointer network [34]. As such, by the abundance of data in open-source projects, the paper DEEPMERGE constructs resolutions by copying input lines, demonstrates how to collect a dataset of merge conflicts and rather than learning to generate them token by token. Guided resolutions. by our key insight that a large majority of resolutions are entirely comprised of lines from the input, such an output This dataset drives the paper’s key insight: a vast majority vocabulary is sufficiently expressive. (80%) of resolutions do not introduce new lines. Instead, they consist of (potentially rearranged) lines from the conflicting Lastly, the paper shows how to localize merge conflicts region. This observation is confirmed by a prior independent and the corresponding user resolutions in a given file. This large-scale study of Java projects from GitHub [13], in which is necessary as our approach exclusively aims to resolve lo- diff3 87% of resolutions are comprised exclusively from lines in cations in which has declared a conflict. As such, our the input. In other words, a typical resolution consists of algorithm only needs to generate the conflict resolution and re-arranging conflicting lines without writing any new code. not the entire merged file. Thus, to extract ground truth, we Our observation naturally begs the question: Are there latent must localize the resolution for a given conflict in a resolved patterns of rearrangement? Can these patterns be learned? file. Localizing such a resolution region unambiguously is a non-trivial task. The presence of extraneous changes This paper investigates the potential for learning latent unrelated to conflict resolution makes resolution localization patterns of rearrangement. Effectively, this boils down to the challenging. The paper presents the first algorithm to localize question: the resolution region for a conflict. This ground truth is Can we learn to synthesize merge conflict resolutions? essential for training such a deep learning model. Specifically, the paper frames merging as a sequence-to- The paper demonstrates an instance of DEEPMERGE sequence task akin to machine translation. trained to resolve unstructured merge conflicts in JavaScript To formulate program merging as a sequence-to-sequence programs. Besides its popularity, JavaScript is notorious problem, the paper considers the text of programs A, B, and for its rich dynamic features, and lacks tooling support. O as the input sequence, and the text of the resolved program Existing structured approaches struggle with JavaScript [32], M as the output sequence. However, this seemingly simple providing a strong motivation for a technique suitable for dy- formulation does not come without challenges. Section 5 namic languages. The paper contributes a real-world dataset demonstrates an out of the box sequence-to-sequence model of 8,719 merge tuples that require non-trivial resolutions trained on merge conflicts yields very low accuracy. In order from nearly twenty thousand repositories in GitHub. Our to effectively learn a merge algorithm, one must: evaluation shows that, on a held out test set, DEEPMERGE 1) represent merge inputs in a concise yet sufficiently can predict correct resolutions for 37% of non-trivial merges. expressive sequence; DEEPMERGE’s accuracy is a 9x improvement over a recent 2) create a mechanism to output tokens at the line granu- semistructured approach [32], evaluated on the same dataset. larity; and Furthermore, on the subset of merges with upto 3 lines (comprising 24% of the total dataset), DEEPMERGE can 1. We ran jdime [21] in structured mode on this example after predict correct resolutions with 78% accuracy. translating the code snippet to Java. Contributions. In summary, this paper: 3
1) is the first to define merge conflict resolution as a machine learning problem and identify a set of challenges for <<<<<< a.js let b = x + 5.7 A = var y = floor(b) encoding it as a sequence-to-sequence supervised learning let b = x + 5.7 console.log(y) problem (§ 2). var y = floor(b) console.log(y) var b = 5.7 2) presents a data-driven merge tool DEEPMERGE that uses O = edit-aware embedding to represent merge inputs and a |||||| base.js var y = floor(b) variation of pointer networks to construct the resolved var b = 5.7 B = var y = floor(x + 5.7) program (§ 3). var y = floor(b) 3) derives a real-world merge datasetfor supervised learn- ======var y = floor(x + 5.7) R = ing by proposing an algorithm for localizing resolution var y = floor(x + 5.7) console.log(y) regions (§ 4). >>>>>> b.js 4) performs an extensive evaluation of DEEPMERGE on merge conflicts in real-world JavaScript programs. And, (a) A merge instance. (b) Corresponding merge tu- demonstrates that it can correctly resolve a significant frac- ple. tion of unstructured merge conflicts with high precision and 9x higher accuracy than a structured approach. Figure 3: Formulation of a merge instance in our setting.
2 DATA-DRIVEN MERGE the third line of A. For this example, the R also incorporates We formulate program merging as a sequence-to-sequence the intents from both A and B intuitively, assuming b does supervised learning problem and discuss the challenges we not appear in the rest of the programs. must address in solving the resulting formulation. One possible way to learn a merge algorithm is by modeling the conditional probability 2.1 Problem Formulation p(R|A, B, O) (1) A merge consists of a 4-tuple of programs (A, B, O, M) In other words, a model that generates the output program where A and B are both derived from a common O, and M R given the three input programs. is the developer resolved program. Because programs are sequences, we further decompose A merge may consist of one or more regions. We define Eq 1 by applying the chain rule [29]: a merge tuple ((A, B, O), R) such that A, B, O are (sub) pro- N A B O Y grams that correspond to regions in , , and , respectively, p(R|A, B, O) = p(R |R , A, B, O) and R denotes the result of merging those regions. Although j CH1: Encode programs A, B, and O as the input to a takes as input the current sequence element xn, and the Seq2Seq model. previous hidden state zn−1. After processing the entire input sequence, the final hidden state, zN , is passed to the decoder. 2.2.2 Constructing the Output Resolution Our key insight that a majority of resolutions do not intro- Decoder: A decoder decode, produces the output sequence duce new lines leads us to construct the output resolution YM from an encoder hidden state Zn. Similar to encoders, directly from lines in the conflicting region. This naturally decoders work in an iterative fashion. At each iteration, the suggests the use of pointer networks [34], an encoder-decoder decoder produces a single output token ym along with a architecture capable of producing outputs explicitly pointing hidden summarization state hm. The current hidden state to tokens in the input sequence. However, a pointer net- and the previous predicted token ym are then used in the work formulation suggests an equivalent input and output following iteration to produce ym+1 and hm+1. Each ym the granularity. In Section 3.2, we show that the input is best model predicts is selected through a softmax over the hidden represented at a granularity far smaller than lines. state: Thus, the challenge is: p(ym|y1, ..., ym−1,X) = softmax(hm) CH2: Output R at the line granularity given a non-line DEEPMERGE is based on this encoder-decoder architec- granularity input. ture with two significant differences. First, rather than a standard embedding followed by 2.2.3 Extracting Ground Truth from Raw Merge Data. encoder, we introduce a novel embedding method called Finally, to learn a data-driven merge algorithm, we need Merge2Matrix. Merge2Matrix addresses CH1 by summarizing real-world data that serves as ground truth. Creating this input programs (A, B, O) into a single embedding fed to the dataset poses non-trivial challenges. First, we need to localize encoder. We discuss our Merge2Matrix solution as well as the resolution region and corresponding conflicting region. less effective alternatives in Section 3.2. In some cases, developers performing a manual merge Second, rather than using a standard decoder to generate resolution made changes unrelated to the merge. Localizing output tokens in some output token vocabulary, we augment resolution regions unambiguously from input programs is the decoder to function as a variant of pointer networks. The challenging due to the presence of these unrelated changes. decoder outputs line tuples (i, W ) where W ∈ {A, B} and i Second, we need to be able to recognize and subsequently is the i-th line in W . We discuss this in detail in Section 3.4. filter merge resolutions that do not incorporate both the Example 2. Figure 4 illustrates the flow of DEEPMERGE changes. In summary, we have: as it processes the inputs of a merge tuple. First, the raw text i i i i M CH3: Identify merge tuples {(A ,B ,O ,R )}i=1 of A, B, and O is fed to Merge2Matrix. As the name suggests, given (A, B, O, M). Merge2Matrix summarizes the tokenized inputs as a matrix. That matrix is then fed to an encoder which computes the encoder hidden state z . Along with the start token for the 3 THE DEEPMERGE ARCHITECTURE N decoder hidden state, the decoder takes zN and iteratively Section 2 suggested one way to learn a three-way merge is (denoted by the ··· ) generates as output the lines to copy through a maximum likelihood estimate of a sequence-to- from A and B. The final resolution is shown in the green sequence model. In this section we describe DEEPMERGE, box. the first data-driven merge framework, and discuss how it addresses challenges CH1 and CH2. We motivate the design of DEEPMERGE by comparing it to a standard sequence-to- 3.2 Merge2Matrix sequence model, the encoder-decoder architecture. An encoder takes a single sequence as input. As discussed in Section 2.2, a merge tuple consists of three sequences. This 3.1 Encoder Decoder Architectures section introduces Merge2Matrix, an input representation that expresses the tuple as a single sequence. It consists of Sequence-to-sequence models aim to map a fixed-length embedding, transformations to summarize embeddings, and input ((X ) ), to a fixed-length output, ((Y ) ). N N∈N M M∈N finally, edit-aware alignment. 2 The standard sequence-to-sequence model consists of three components: an input embedding, an encoder, and 3.2.1 Tokenization and Embedding a decoder. Input embedding: An embedding maps a discrete input This section discusses our relatively straightforward applica- |V | tion of both tokenization and embedding. from an input vocabulary V (xn ∈ N ), to a continuous D D Tokenization. dimensional vector space representation (xn ∈ R ). Such a Working with textual data requires tokeniza- mapping is obtained by multiplication over an embedding tion whereby we split a sequence of text into smaller units D×|V | tokens matrix E ∈ R . Applying this for each element of XN referred to as . Tokens can be defined at varying gives XN . granularities such as characters, words, or sub-words. These units form a vocabulary which maps input tokens to integer Encoder: An encoder encode, processes each xn and indices. Thus, a vocabulary is a mapping from a sequence produces a hidden state, zn which summarizes the sequence of text to a sequence of integers. This paper uses byte-pair upto the n-th element. At each iteration, the encoder encoding (BPE) as it has been shown to work well with source code, where tokens can be formed by combining 2. Note that M is not necessary equal to N. different words via casing conventions (e.g. snake_case or 5 <<<<<<< A let b = x + 5.7 <1,A> var y = floor(b) <2,A> console.log(y) <3,A> var y = floor(x + 5.7) <1,B> ||||||| console.log(y) <3,A> var b = 5.7 <1,O> var y = floor(b) <2,O> ======O var y = floor(x + 5.7) <1,B> >>>>>>> B line <1,A> line <1,A> line <2,A> line <2,A> line <3,A> <1,B> line <3,A> START token line <1,O> line <1,O> line <2,O> line <2,O> Merge2Matrix line <1,B> line <1,B> Encoder Decoder STOP token ... Decoder STOP token Zn hm hM Figure 4: Overall DEEPMERGE framework. The dotted box represents repetition of decode until m = M i.e. the hSTOPi token is predicted. In this example, we have omitted m = 2 in which the call to decode outputs y2 = h3,Ai. Decoder <<<<<<< A let b = x + 5.7 <1,A> var y = floor(b) <2,A> camelCaseconsole.log(y) <3,A> var y = floor(x + 5.7) <1,B> |||||||) causing a blowup in vocabulary size [18]. Byte- information forgetting; as the input growsconsole.log(y) longer, it becomes<3,A> var b = 5.7 <1,O> pair encodingvar y = floor(b) is an unsupervised<2,O> sub-word tokenization harder for encode to capture long-range correlations in that ======O that drawsvar y inspiration= floor(x + 5.7) from<1,B> information theory and data input. A solution that addresses CH1, must be concise while compression>>>>>>> whereinB frequently occurring sub-word pairs retainingline <1,A> the information in theline <1,A> input programs. line <1,A> line <2,A> line <2,A> line <2,A> are recursively merged and stored in the vocabulary. We line <3,A>Linearized.<1,B> As an attemptline <3,A> at a more<3,A> concise representa-line <3,A> START token line <1,O> line <1,O> line <1,O> found that the performance of BPE was empirically superior tion,line <2,O> we introduce a summarizationline <2,O> we call linearized. Thisline <2,O> Merge2Matrix line <1,B> line <1,B> line <1,B> to other tokenization schemes. Encoder Decoder methodSTOP token linearly combinesDecoder eachSTOP token of the embeddingsDecoder throughSTOP token Embedding. Given an input sequence XN , and a hy- our helper function: linearizeθ(A, B, O). In Section 5 we perparameter (embedding dimension) D, an embedding empirically demonstrate better model accuracy when we Z Z Z transformation creates XN . As described in Section 3.1, summarize with linearizeθ rather than concats. Z the output of this embedding is then fed to an encoder. Because a merge tuple consists of three inputs (A, B, and O), 3.2.3 Edit-Aware Alignment the following sections introduce novel transformations that In addition to input length, CH1 also alludes that an effective summarize these three inputs into a format suitable for the input representation needs to be “edit aware”. The aforemen- encoder. tioned representations do not provide any indication that A and B are edits from O. 3.2.2 Merge Tuple Summarization Prior work, Learning to Represent Edits (LTRE) [38] intro- In this section, we describe summarization techniques that duces a representation to succinctly encode 2 two-way diffs. The method uses a standard deterministic diffing algorithm are employed<<<<<<< after A embedding. Before we delve into details, let b = x + 5.7 <1, A> and represents the resulting pair-wise alignment as an auto- we first introducevar y = floor(b) two functions<2, A> used in summarization. console.log(y) <3, A> encoded fixed dimension vector. Suppose||||||| a function that concatenates embedded repre- var b = 5.7 <1, O> A two-way alignment produces an “edit sequence”. This sentations:var y = floor(b) <2, O> ======O series of edits, if applied to the second sequence, would var y = floor(x D+× 5.7N ) <1, B> D×N D×sN concat>>>>>>> sB :(R × · · · × R ) → R produce the first. An edit sequence, ∆AO, is comprised line <1, A> line <1, A> line <1, A> of the following editing actions: = representing equivalent<3, A> line <2, A> line <2, A> line <2, A> that takes s similarly shaped tensors as arguments and line <3, A> <1, B> line <3, A> tokens, + representing insertions, − representing deletions, line <3, A> concatenates them along their last dimension. ConcatenatingSTART token line <1, O> line <1, O> line <1, O> var y = floor(x + 5.7) <1, B> Merge2Matrix ↔ representingline <2, O> a replacement. Two special tokensline <2, O>∅ and | are line <2, O> console.log(y) <3, A> these s embeddings increases the size of the encoder’s input line <1, B> line <1, B> line <1, B> Decoder used as a paddingSTOP token token and aDecoder newline marker,STOP token respectively. Decoder STOP token by a factor of s. Note that these ∆s only capture information about the kinds Suppose a function linearize that linearly combines s Encoder of edits and ignore the the tokens that make up the edit embedded representations. We parameterizehidden state this function hidden state hidden state hidden state s+1 itself (with the exception of the newline token). Prior to the with learnable parameters θ ∈ R . As input, linearize ∆ D creation of , a preprocessing step adds padding tokens such takes an embedding xi ∈ R for i ∈ 1..S. Thus, we define that equivalent tokens in A (resp. B) and O are in the same position. These sequences, shown in Figure 5 are denoted as linearizeθ(x1,..., xs) = θ1 · x1 + ··· + θs · xs + θs+1 A0 and AO0 (resp. B0 and BO0). where all operations on the inputs x1,..., xs are pointwise. Example 3. Consider B’s edit to O in Figure 5 via its linearize reduces the size of the embeddings fed to the preprocessed sequences B0, BO0, and its edit sequence encoder by a factor of s. ∆BO. One intuitive view of ∆BO is that it is a set of Now that we have defined two helper functions, we instructions that describe how to turn B0 into BO0 with describe two summarization methods. the aforementioned semantics. Note the padding token ∅ Naıve.¨ Given a merge tuple’s inputs (A, B, O), a na¨ıve introduced into ∆BO represents padding out to the length implementation of Merge2Matrix is to simply concatenate of the longer edit sequence ∆AO. the embedded representations (i.e., concat3(A, B, O)) Tra- We now describe two edit-aware summarization methods ditional sequence-to-sequence models often suffer from based on this edit-aware representation. However, our setting 6 le b = + 5.7 \ a = fl ( b ) \ c le.l g( ) <<<<<<<