DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2021

Design and Implementation of Semantic Patch Support for the Spoon Java Transformation Engine

MIKAEL FORSBERG

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

Design and Implementation of Semantic Patch Support for the Spoon Java Transformation Engine

MIKAEL FORSBERG

Master in Computer Science Date: January 26, 2021 Supervisor: Nicolas Yves Maurice Harrand Examiner: Martin Monperrus School of Electrical Engineering and Computer Science Swedish title: Design och implementering av stöd för semantiska patchar för Javatransformeringsmotorn Spoon

iii

Abstract

Software development is more often than not a collaborative process, creating a need for tools and file formats that enable developers to create and share suc- cinct representations of changes to in order to facilitate efficient communication. Standard POSIX diffs and patches have long been important parts of the toolkit, but their lack of support for the syntax and semantics of spe- cific programming languages results in limited expressiveness. The Semantic Patch Language (SmPL), introduced in 2006 together with the tool Coccinelle, increases the expressiveness of POSIX-style patches for the by leveraging support for the syntax and semantics of C. For exam- ple, an SmPL patch can specify changes to source code using metavariables that bind arbitrary program variable names, allowing for the specification of transformations involving variable references regardless of what specific vari- able names appear in programs targeted by the patch. A recent development is Coccinelle4J, a prototype modification of Coccinelle targeting the Java pro- gramming language. Coccinelle4J remains based on a toolkit designed for the parsing and modeling of C, adapted to operate on Java source code. The language mismatch of the base toolkit gives rise to limitations. Despite this, Coccinelle4J remains the state of the art for an SmPL targeting Java. In this thesis we lay the foundations for an SmPL for Java based on Spoon, a robust Java metaprogramming toolkit. We qualitatively investigate to which extent the features of SmPL and Coccinelle are generalizable to a Java context, and we implement and evaluate SPOON-SMPL, a prototype SmPL tool for Java based on Spoon. We base the core design of SPOON-SMPL on temporal logic and model checking, heavily inspired by the design of Coccinelle. We find the majority of identified SmPL features to generalize for Java. We quantitatively evaluate SPOON-SMPL by comparing the running time performance to that of Coccinelle4J over a set of six semantic patches with associated real-world project code bases used in an API migration case study originally performed by the authors of Coccinelle4J. Additionally, we compare the running times of SPOON-SMPL to the average build time of each associated project. We find that SPOON-SMPL performs worse than Coccinelle4J, but that the performance remains in a range acceptable for a single developer using inexpensive hard- ware. Finally, we provide two proposed designs for extensions to SPOON-SMPL along with a set of suggestions for future work. The proposals show that our prototype offers a strong potential to leverage the capabilities of the Spoon li- brary, particularly in providing improved and robust support for certain aspects of Java for which Coccinelle4J provides only limited support. iv

Sammanfattning Mjukvaruutveckling är ofta en kollaborativ process med behov av effektiv kommunikation. Ett centralt inslag i denna kommunikation är möjligheten för utvecklare att skapa och sinsemellan dela kortfattade sammanfattningar över källkodsändringar. De POSIX-standardiserade verktygen och patch har länge utgjort en viktig del av verktygslådan, men deras avsaknad av stöd för syntax och semantik hos specifika programspråk ger upphov till en begrän- sad uttrycksfullhet. Semantic Patch Language (SmPL), introducerat år 2006 tillsammans med verktyget Coccinelle, erbjuder ökad uttrycksfullhet i POSIX- liknande patchar för programspråket C. En SmPL-patch kan bland annat an- vända metavariabler, logiska variabelnamn som binder godtyckliga program- variabler, för att specificera transformationer som berör variabelreferenser oav- sett vilka variabelnamn som förekommer i målprogrammet. Coccinelle4J, en modifikation av Coccinelle, är en nyligen framtagen prototyp på ett SmPL- verktyg för programspråket Java. Coccinelle4J baseras på en teknisk grund designad för tolkning och bearbetning av C som anpassats till att bearbeta Ja- va. Språkskillnader gör det svårt att få en heltäckande anpassning, vilket leder till ett begränsat stöd för vissa av Javas egenskaper. Trots detta är Coccinelle4J i dagsläget den främsta lösningen för SmPL för Java. I denna avhandling tar vi de första stegen mot ett SmPL för Java baserat på Spoon, ett robust metapro- grammeringsbibliotek för Java. Vi undersöker kvalitativt vilka egenskaper hos SmPL och Coccinelle som kan generaliseras till Java, samt implementerar och utvärderar SPOON-SMPL, en prototyp på ett SmPL-verktyg för Java baserat på Spoon. Designen av SPOON-SMPL är kraftigt inspirerad av Coccinelle, och ba- seras på temporallogik och modellprövning. Vi finner att en klar majoritet av de egenskaper vi identifierat hos SmPL och Coccinelle låter sig generaliseras till Java. Vi utvärderar kvantitativt SPOON-SMPL genom att jämföra körtidspre- standan mot Coccinelle4J över sex semantiska patchar med tillhörande pro- jektkodbaser som ursprungligen användes i en fallstudie kring API-migrering utförd av teamet bakom Coccinelle4J. Vi jämför även körtidsprestandan mot byggnadstiden för vardera projekt. Vi finner att körtidsprestandan hos SPOON- SMPL är sämre än Coccinelle4J, men att den trots det befinner sig inom ett område som är acceptabelt för en enskild mjukvaruutvecklare med en enkel persondator. Slutligen presenterar vi två detaljerade förslag till utökningar av SPOON-SMPL tillsammans med en uppsättning förslag för framtida arbete. Vi visar genom detta att vår prototyp har en kraftfull potential för utökningar som drar nytta av de funktioner som finns i Spoon, i synnerhet kring ett förbättrat och robust stöd för vissa egenskaper hos Java där Coccinelle4J endast erbjuder ett begränsat stöd. v

Acknowledgements

I would like to thank:

• Prof. Martin Monperrus, my examiner. Martin suggested the project and gave me the opportunity to pursue it. Martin also helped establish the research methodology and formulate the formal research questions, gave regular feedback on the structure of the thesis and my approaches to various aspects of the work, suggested many papers on related works, and provided tips on the use of Spoon.

• Nicolas Yves Maurice Harrand, my supervisor. Like Martin, Nicolas provided feedback on the methodology and the structure of the thesis, and also helped me with a couple of difficult choices in the implementation. Nicolas also provided detailed feedback on the full text, introduced me to a set of useful tools and ideas for im- proving the text, provided papers on the subtleties involved in bench- marking the performance of Java programs, and helped eliminate a for- mal research question for which the results were overly speculative.

• Ann Bengtsson, degree project coordinator at KTH EECS. Ann greatly helped me solve the complications surrounding my formal admittance to the degree project course.

Finally, I would like to jointly thank Martin and Nicolas for their sympathy and patience throughout the project in general, and in particular surrounding the passing of my father. To my father.

I’m sorry I took too long. Thank you for everything. Contents

1 Introduction 1 1.1 Problem statement ...... 1 1.2 Research questions ...... 2 1.3 Contributions ...... 2 1.4 Intended audience ...... 3 1.5 Outline of the thesis ...... 3

2 Background 4 2.1 Text file differencing ...... 4 2.1.1 diff ...... 4 2.1.2 patch ...... 6 2.2 Formal logics for the modeling of computer programs . . . . . 7 2.2.1 Computation Tree Logic ...... 7 2.2.2 CTL with free variables ...... 13 2.2.3 CTL with quantified variables ...... 16 2.2.4 CTL with variables and witnesses ...... 17 2.3 Program analysis and transformation ...... 19 2.3.1 Spoon ...... 20 2.3.2 Semantic Patch Language ...... 22

3 Related work 28 3.1 Semantic patching ...... 28 3.1.1 Coccinelle ...... 28 3.1.2 Coccinelle4J ...... 30 3.2 Program transformation using temporal logic ...... 31 3.3 Other approaches to Java source code transformation . . . . . 32 3.4 API migration ...... 32

vii viii CONTENTS

4 Design of spoon-smpl 34 4.1 Design goals ...... 34 4.2 Core engine ...... 36 4.3 Parsing SmPL ...... 37 4.4 Formula language ...... 41 4.5 Formula compilation ...... 43 4.6 Batch processing ...... 50 4.7 Use of Spoon ...... 50

5 Evaluation methodology 52 5.1 Analytical methodology ...... 52 5.1.1 RQ1: Generalizable features ...... 52 5.1.2 RQ2: Non-generalizable features ...... 53 5.2 Experimental methodology ...... 53 5.2.1 RQ3: Patch application performance ...... 54 5.2.2 RQ4: Project build times ...... 64

6 Evaluation results 66 6.1 Analytical results ...... 66 6.1.1 Coccinelle feature catalog ...... 66 6.1.2 RQ1: Generalizable features ...... 81 6.1.3 RQ2: Non-generalizable features ...... 81 6.2 Experimental results ...... 82 6.2.1 Hardware and software ...... 83 6.2.2 RQ3: Patch application performance ...... 83 6.2.3 RQ4: Project build times ...... 89

7 Discussion 91 7.1 Limitations ...... 91 7.2 Threats to validity ...... 92 7.3 Extension proposals ...... 93 7.3.1 Improving name resolution ...... 94 7.3.2 Improving sub-typing ...... 95 7.4 Future work ...... 96 7.4.1 Support for more simple Java constructs ...... 97 7.4.2 Support for looping constructs ...... 98 7.4.3 Support for isomorphisms ...... 98 7.4.4 Model checker optimizations ...... 99 7.4.5 Using Spoon sniper mode ...... 100 7.4.6 Using spoon.pattern ...... 100 CONTENTS ix

7.4.7 Target-embedded parsing of the semantic patch . . . . 101 7.5 Ethical considerations ...... 102

8 Conclusions 104

Bibliography 105

A Full semantic patches 109 A.1 Semantic patch 4: should_vibrate ...... 110 A.1.1 Original version ...... 110 A.1.2 Modified version ...... 111 A.2 Semantic patch 5: get_height ...... 112 A.2.1 Original version ...... 112 A.2.2 Modified version ...... 113

Chapter 1

Introduction

1.1 Problem statement

Automated code transformation over a full code base is an important tool in software engineering. Useful transformations range from simple renaming of variables to making adaptations for complex changes in library APIs which require corresponding modifications in client code. The POSIX (originally Unix) programs diff and patch and their varia- tions have long since offered a popular platform for declaratively specifying code transformations, but the platform is limited due to a lack of support for the syntax and semantics of the programming languages in which the transfor- mations are to take place. As an example of this limitation, it is generally not possible to write a POSIX patch describing a transformation that rearranges arbitrary expressions or includes constraints over arbitrary control flow paths. The Semantic Patch Language (SmPL) was introduced [1] in the context of Linux kernel development as an extension to the familiar POSIX patch format, featuring support for the syntax and semantics of the C programming language. SmPL resolves several of the limitations of POSIX patches, albeit limited to the C programming language. The Java programming language is widely used and the subject of ongoing research. An example of research is the Spoon project [2], a Java metapro- gramming library developed under the helm of the French research institute Inria and designed to facilitate the implementation of bespoke Java source code analyses and transformations. However, as opposed to the declarative style of POSIX patches and SmPL, implementing a code transformation using Spoon generally requires writing an imperative program. Viewed as a specification for a code transformation, an imperative (as opposed to declarative) program

1 2 CHAPTER 1. INTRODUCTION

is often harder to understand and maintain. The state of the art of SmPL for Java is the work on Coccinelle4J [3], a pro- totype based on Coccinelle [1], the original SmPL tool for C. Coccinelle4J suf- fers from a number of limitations regarding the syntax and semantics of Java, such as having limited support for sub-typing and requiring special rules for matching fully-qualified names to simply-qualified names. Limitations such as these could potentially be alleviated by using a more robust, Java-centric base such as Spoon. The main objective of this thesis is to design and implement SPOON-SMPL; an SmPL transformation engine for Java based on the Spoon library and trans- formation engine, and to evaluate its performance and potential. As part of this objective we also investigate which features of SmPL and Coccinelle that can be generalized to a Java context.

1.2 Research questions

• RQ1: Which features of Coccinelle/SmPL for C can be generalized to Java? • RQ2: Which features of Coccinelle/SmPL for C cannot be generalized to Java? Why not? • RQ3: Is the running time performance of an SmPL implementation for Java based on Spoon better or worse than that of Coccinelle4J? • RQ4: Is the running time performance of an SmPL implementation for Java based on Spoon acceptable for an individual developer?

1.3 Contributions

The primary contribution of this work takes the form of Java source code con- tributed1 to the Spoon open-source project2. Additionally, we make contribu- tions to the general body of work on SmPL.

• An implementation of a subset of SmPL for a subset of Java, written in Java, based on the Spoon Java transformation engine.

• A minor contribution to the spoon-control-flow sub-module of the Spoon project, consisting of the implementation of limited support for

1https://github.com/mkforsb/spoon/tree/smpl 2https://github.com/INRIA/spoon CHAPTER 1. INTRODUCTION 3

the control flow of exceptions and the associated statements throw, try, catch and finally.

• An alternative3 treatment of the formal logic CTL-VW and the precur- sors CTL-FV and CTL-V.

• An implementation of a CTL-VW model checking algorithm in Java.

To our knowledge, all of the implementation-related contributions are novel.

1.4 Intended audience

This text assumes the reader is familiar with elementary set theory, elemen- tary formal logic (propositional and first-order logic), context-free grammars presented in Backus-Naur form (BNF), the notions of Abstract Syntax Trees (ASTs) and Control Flow Graphs (CFGs), as well as the basic syntax and se- mantics of C-like programming languages.

1.5 Outline of the thesis

The general structure of the thesis is as follows. Chapter 2 introduces the fun- damental ideas and tools used in the implementation of SPOON-SMPL. This in- cludes topics such as text file differencing and Computation Tree Logic (CTL), as well as introductions for the Semantic Patch Language (SmPL) and the ba- sics of the Spoon library. Chapter 3 covers related works in general, and the works of Coccinelle and Coccinelle4J in particular. Chapter 4 presents the de- sign of SPOON-SMPL, giving an overview of the main design goals and require- ments as well as providing further details on the source code transformation engine, the parsing of SmPL and the compilation of CTL formulas. Chapter 5 presents the methods used to evaluate our formal research questions, and Chapter 6 presents the corresponding results. Chapter 7 discusses potential problems and limitations with our work, and provides suggestions for future work. Finally, Chapter 8 presents our conclusions.

3Alternative to Julien Brunel et al. ”A Foundation for Flow-Based Program Matching: Using Temporal Logic and Model Checking.” [4] Chapter 2

Background

This chapter covers the fundamental ideas and tools relevant to the implemen- tation of SPOON-SMPL. Section 2.1 introduces the notions of diff and patch, both of which have greatly influenced the original design of SmPL. Section 2.2 provides a semi-formal overview of Computation Tree Logic (CTL) and the series of extensions CTL-FV, CTL-V and CTL-VW, which are of central importance to SPOON-SMPL. Section 2.3 gives a brief overview on the topic of automated program analysis and transformation, followed by an introduction to API migration as an application of automated program transformation, be- fore ending with introductions to the Spoon metaprogramming library and the SmPL source code transformation language.

2.1 Text file differencing

This section introduces the diff program that generates summaries of differ- ences between files, and the patch program that can use such a summary to apply changes to files.

2.1.1 diff

The original diff program emerged in the 1970s during the early develop- ments of the UNIX operating system at Bell Labs [5]. In typical operation, diff takes as input the contents of two files and produces an output (”the diff”) containing some representation of the minimal set of changes needed to turn one of the files into the other (or vice versa). The algorithm usedto find the minimal set of changes is based around solving the longest common subsequence (LCS) problem, which results in finding the longest list of anchor

4 CHAPTER 2. BACKGROUND 5

points common to (and matching in sequence) both input files around which the instructions for changes should be placed. An example of the fundamental ideas of diff is shown in Table 2.1.

Input A: a b c d e f g Input B: w a b x y z e LCS: (a,b,e) Offsets in A: (0,1,4) Offsets in B: (1,2,6)

Diff (A → B): Offset a: prepend(w) Offset b: delete_successors(c,d); append(x,y,z) Offset e: delete_successors(f,g)

Table 2.1: Diff fundamentals.

To reverse the direction of the changes one simply swaps each prepend instruction for an instruction to delete_predecessors and each append in- struction for an instruction to delete_successors (and vice versa, respec- tively). The resulting set of instructions is then applied to the LCS offsets corresponding to the second file. Applying the set of instructions can be done manually or automatically and the act of doing so is often referred to as patching, in which case the diff itself is often also referred to as the patch. There exists several syntax formats for representing diffs. Earlier formats tend to omit any elements common to both input files, storing only the anchor offsets and the changes to be made at said offsets. A later development was the context diff format, which included the anchor elements and added the optional inclusion of a limited number of elements common to both input files and adjacent to (immediately preceding or following) the body of elements targeted by the diff. These additions were introduced as a way of representing the context in which the changes were to take place. Having context available not only made the diffs easier for humans to read, but also allowed forthe implementation of patching algorithms that made use of the context in order to apply patches to files for which the diff anchor points did not match exactly. In modern use the most common format is the unified context diff in which additions and removals are simply lines annotated with + and - respectively, 6 CHAPTER 2. BACKGROUND

included intermixed with context lines. Anchor offsets are included in the form of hunk headers @@[...]@@, which also allow for a single diff file to include multiple change sets. The header @@ -142,5 +142,7 @@ denotes a change in the block of lines 142-146 of the original file, producing the block of lines 142- 148 in the updated file. Listing 1 contains an example of the unified context diff format. Listing 1. Example of unified context diff.

1 @@ -142,5 +142,7 @@ 2 end 3 local args, callback, ret 4 - table.insert(history, text) 5 + if active_command == false then 6 + table.insert(history, text) 7 + end 8 props['neode.command.argv'] = text:gsub('^%S+%s+', '') 9 if active_command then

2.1.2 patch

The original patch program was developed by Larry Wall [6] as a companion to diff that, given a diff in the context diff format, applied the changes toa file or a set of files. Modern versions of patch are able to use the unified context format. By pattern matching against the context elements, changes can be applied even to files that differ slightly from the file used to produce the diff. For example, a new line of text introduced preceding a block targeted for patching can cause the line numbers specified in a patch to no longer match the line numbers of the block in the file. By scanning for the context elements, patch is able to recover from such mismatches to a limited degree. Both diff and patch were eventually made part of the POSIX standard [7]. The combination of diff and patch has been useful for many software projects. In particular, the ability to share succinct representations of changes to source code has been important for projects with a geographically dispersed set of loosely coupled participants such as the Linux kernel and other free soft- ware projects. However, the lack of support for the semantics of the program- ming languages in which the changes are to be applied give rise to a number of limitations. For example, it is common for a change in library code to induce corresponding required updates in a large number of files containing client code using the library API. It is likely for client code to pass variables as ar- guments to library functions, but unlikely for all client code to use the same CHAPTER 2. BACKGROUND 7

variable names. As such, if an API change is in the parameters to a library function it may not be possible to construct a single succinct diff that can up- date all of the client code.

2.2 Formal logics for the modeling of computer programs

A formal logic is a formal language consisting of a strict, mathematically- defined vocabulary and grammar, coupled with a formal semantics that define how to judge the truth of a given statement in the language. This section in- troduces the formal logics Computation Tree Logic (CTL) and the series of extensions CTL-FV (CTL with Free Variables), CTL-V (CTL with quantified Variables) and CTL-VW (CTL with Variables and Witnesses).

2.2.1 Computation Tree Logic Computation Tree Logic (CTL) [8] is categorized as a branching-time tempo- ral logic, making it well suited for reasoning about systems that change over time in a branching structure. Such systems are commonly modeled using a directed graph, where each individual node represents a possible state of the system and edges represent possible state transitions. CTL consists of the elements of propositional logic extended with a set of connectives that allows one to express properties of paths in a directed graph. A typical variant of CTL is shown in Table 2.2, where ϕ denotes a formula (a syntactically legal statement in the language), > and ⊥ denote true and false respectively and p denotes an atomic proposition.

ϕ ::= > | ⊥ | p | ¬ϕ | ϕ ∨ ϕ | ϕ ∧ ϕ | ϕ → ϕ | AX ϕ | EX ϕ | AF ϕ | EF ϕ | AG ϕ | EG ϕ | A[ϕ U ϕ] | E[ϕ U ϕ] For all paths There exists a path ϕ holds in the neXt state: AX ϕ EX ϕ ϕ holds now or in the Future: AF ϕ EF ϕ ϕ holds now and forever (Globally): AG ϕ EG ϕ ϕ holds Until ψ holds: A[ϕ U ψ] E[ϕ U ψ]

Table 2.2: Typical variant of CTL. 8 CHAPTER 2. BACKGROUND

A simple branching system is shown in Figure 2.1, in which we have that, for example, AG(p ∨ q), EX ¬q and AF(AG q) all hold in state s1.

s2

p, q s4 s1 p q s3 p

Figure 2.1: A simple branching system example.

The formal semantics of CTL are defined in terms ofa model which cap- tures the properties of the system under inspection.

Definition 1. A CTL model M is a triple (S, T, L) where S is the set of states, T is a binary relation over S such that (s1, s2) ∈ T if the system contains a directed transition from state s1 to state s2, and L is a function (the ”labeling function”) from S to the powerset of atomic propositions relevant to the system. For technical reasons we also require that every state has at least one successor: ∀s ∈ S. ∃s0 ∈ S. (s, s0) ∈ T . □

The model M corresponding to the picture of Figure 2.1 is:

M = (S, T, L) where S = {s1, s2, s3, s4},

T = {(s1, s2), (s1, s3), (s2, s4), (s3, s4), (s4, s4)},

L(s1) = {p},L(s2) = {p, q}

L(s3) = {p},L(s4) = {q}

The semantics of CTL are recursively defined on the form M, s ⊨ ϕ. The notation is to be read as ”ϕ holds (is true) in state s of model M”. The seman- tics for the variant of CTL of Table 2.2 is shown in Table 2.3. It is common to treat the problem of performing judgments on CTL formu- lae as an instance of the more general notion of model checking [9], which is a generic term for algorithmic methods of verifying that a given specification holds on a given finite state graph model of a system. CHAPTER 2. BACKGROUND 9

For a model M = (S, T, L) and a state s ∈ S : M, s ⊨ > true holds in any state of any model. M, s 6⊨ ⊥ false does not hold in any state of any model. M, s ⊨ p ⇔ p ∈ L(s) p holds in states labeled with p M, s ⊨ ¬ϕ ⇔ M, s 6⊨ ϕ ¬ϕ holds in states where ϕ does not hold. M, s ⊨ ϕ ∨ ψ ⇔ M, s ⊨ ϕ OR M, s ⊨ ψ M, s ⊨ ϕ ∧ ψ ⇔ M, s ⊨ ϕ AND M, s ⊨ ψ M, s ⊨ ϕ → ψ ⇔ M, s 6⊨ ϕ OR M, s ⊨ ψ M, s ⊨ AX ϕ ⇔ ∀(s, s0) ∈ T. M, s0 ⊨ ϕ AX ϕ holds in states where ϕ holds in all immediate successors. M, s ⊨ EX ϕ ⇔ ∃(s, s0) ∈ T. M, s0 ⊨ ϕ EX ϕ holds in states where ϕ holds in some immediate successor. M, s ⊨ AF ϕ ⇔ ∀P ∈ Path(s). (∃s0 ∈ P. M, s0 ⊨ ϕ) AF ϕ holds in states where all paths contain a state where ϕ holds. M, s ⊨ EF ϕ ⇔ ∃P ∈ Path(s). (∃s0 ∈ P. M, s0 ⊨ ϕ) EF ϕ holds in states where some path contains a state where ϕ holds. M, s ⊨ AG ϕ ⇔ ∀P ∈ Path(s). (∀s0 ∈ P. M, s0 ⊨ ϕ) AG ϕ holds in states where ϕ holds in every state of all paths. M, s ⊨ EG ϕ ⇔ ∃P ∈ Path(s). (∀s0 ∈ P. M, s0 ⊨ ϕ) EG ϕ holds in states where ϕ holds in every state of some path.

M, s ⊨ A[ϕ U ψ] ⇔ ∀P ∈ Path(s).(∃si ∈ P.( M, si ⊨ ψ ∧ ∀j ∈ [1, i − 1]. M, sj ⊨ ϕ)) A[ϕ U ψ] holds in states where all paths contain a state s0 where ψ holds with ϕ holding in every state preceding s0 on the path.

M, s ⊨ E[ϕ U ψ] ⇔ ∃P ∈ Path(s).(∃si ∈ P.( M, si ⊨ ψ ∧ ∀j ∈ [1, i − 1]. M, sj ⊨ ϕ)) E[ϕ U ψ] holds in states where some path contains a state s0 where ψ holds with ϕ holding in every state preceding s0 on the path.

Where Path(s) is the set of all successor chains originating in s, i.e all possible paths starting in s. Note that for all paths P ∈ Path(s) we have s ∈ P , i.e all paths include the current state. Path(s) = {(s1, s2, ...) | s1 = s ∧ (si, si+1) ∈ T ∧ i ∈ [1, ∞]}

Table 2.3: Semantics of typical CTL. 10 CHAPTER 2. BACKGROUND

Listing 2. Pseudocode for the CTL model checking algorithm.

1 function FIX(f,X):

2 // Given a function f and an initial value X, returns the

3 // fixpoint f(f(...f(X)))

4 repeat:

5 Xold := X

6 X := f(X)

7 if X = Xold:

8 return X

9

10 function preAny(M,X):

11 // Given a model and a set of states X, returns the set of

12 // states which CAN transition into a state in X ′ ′ 13 return {s ∈ MS | ∃(s, s ) ∈ MT . s ∈ X}

14

15 function preAll(M,X):

16 // Given a model and a set of states X, returns the set of

17 // states which can ONLY transition into a state in X ′ ′ 18 return {s ∈ MS | ∀(s, s ) ∈ MT . s ∈ X}

19

20 function SAT(M,ϕ):

21 ϕ match:

22 case >: return MS

23 case atomic: return {s ∈ MS | ϕ ∈ ML(s)}

24 case ¬ϕ1: return MS − SAT(M,ϕ1)

25 case ϕ1 ∧ ϕ2: return SAT(M,ϕ1) ∩ SAT(M,ϕ2)

26 case EX ϕ1: return preAny(M,SAT(M,ϕ1))

27 case AX ϕ1: return preAll(M,SAT(M,ϕ1))

28 case E[ϕ1 U ϕ2]:

29 X := SAT(M,ϕ2);

30 return FIX(λX 7→ X ∪ (SAT(M,ϕ1) ∩ preAny(M,X)), X)

31 case A[ϕ1 U ϕ2]:

32 X := SAT(M,ϕ2);

33 return FIX(λX 7→ X ∪ (SAT(M,ϕ1) ∩ preAll(M,X)), X) CHAPTER 2. BACKGROUND 11

Model checking the formula EX(p ∧ AF(q)):

Given a model M = (S, T, L), we begin by checking the for- ∈ EX mula leaf nodes p and q. We mark with p all states s S for which p ∈ L(s) and with q all states s ∈ S for which q ∈ L(s). ∧ Next, moving upwards in the formula parse tree, we mark with AF(q) all states marked with q, and then repeatedly apply the p AF AF(q) mark to states for which all successors are marked with AF(q), until there is no change. Next, we mark with p ∧ AF(q) q all states that are marked with both p and AF(q), before finally marking with EX(p ∧ AF(q)) all states that have at least one successor marked with p ∧ AF(q). The result is the set of states marked with EX(p ∧ AF(q)).

Figure 2.2: CTL model checking example.

The basic CTL model checking algorithm [8] SAT takes as input a model M = (S, T, L) and a CTL formula ϕ and outputs the set of states in which the formula holds:

def SAT(M, ϕ) = {s ∈ S | M, s ⊨ ϕ}.

The algorithm takes the form of bottom-up propagation of results corre- sponding to the subformulas of the parse tree of a given formula, with filtering taking place when results of lower-level subformulas are collected by higher- level subformulas. An example of model checking the formula EX(p ∧ AF(q)) is shown in Figure 2.2. Several of the CTL connectives can be expressed in terms of the other con- nectives [8]. For example, AG ϕ is semantically equivalent to ¬EF(¬ϕ). By exploiting these equivalences it is possible to reduce the size of the semantics, or — perhaps more importantly, to reduce the size and complexity of a com- puter algorithm that performs model checking. It can be shown [8] that any set of connectives that includes EU, at least one of {AX, EX} and at least one of {EG, AF, AU} is able to derive the missing connectives. Listing 2 represents a complete algorithm of SAT for the typical variant of CTL. For sake of brevity the formulas accepted by the algorithm have been restricted to the subset ϕ ::= > | p | ¬ϕ | ϕ ∧ ϕ | EX ϕ | AX ϕ | E[ϕ U ϕ] | A[ϕ U ϕ] 12 CHAPTER 2. BACKGROUND

Arbitrary input formulas would thus first need to be converted to this subset. For the base connectives we have ⊥ ≡ ¬>, ϕ ∨ ψ ≡ ¬(¬ϕ ∧ ¬ψ) and ϕ → ψ ≡ ¬ϕ ∨ ψ, while for the path connectives AG, EG etc. there are several alternative derivations [8]. The AU connective is useful for specifying global properties over arbitrary computation paths. For example, one could use the AU connective to specify that a certain variable is deleted exactly once:

A[¬(x-deleted)U x-deleted]

However, the AU connective can become problematic in models that con- tain loops. Computer programs commonly contain loops that may in practice always terminate, but for which the possibility of an endless loop cannot be excluded by static analysis. From a CTL perspective, such a model may allow for infinite paths that endlessly loop over a sequence of states. The semantics of the AU connective requires the ψ sub-formula of the expression ϕ U ψ to eventually be satisfied on all paths, which means the existence of an unrealistic, infinite successor path on which ψ is never satisfied could cause a complete formula A[ϕ U ψ] to fail to be satisfied for certain states. One approach to this issue is to replace, when appropriate, the use of the AU connective with what is known as the AW connective [4]. The AW con- nective can be viewed as a Weaker variant of AU, with the typical semantics of AW [8] being identical to that of AU with the single modification of lifting the requirement of the eventual satisfaction of ψ. Figure 2.3 presents the typical semantics for AW.

M, s ⊨ A[ϕ W ψ] ⇔ M, s ⊨ A[ϕ U ψ] ∨ AG ϕ

Figure 2.3: Typical semantics of AW.

A derivation of the AW connective using the CTL subset accepted by the pseudocode of Listing 2 is A[ϕ W ψ] ≡ ¬E[¬ψ U ¬(ϕ ∨ ψ)] [8]. CHAPTER 2. BACKGROUND 13

2.2.2 CTL with free variables CTL with Free Variables (CTL-FV) [4] is an extension of CTL that:

1. Introduces an environment θ to all judgments: M, s ⊨θ ϕ where θ is a function from (logical) variable names to a set of concrete values.

2. Replaces atomic propositions p with n-ary predicates p(x1, ..., xn). The semantics of predicates involves the environment:

M, s ⊨θ p(x1, ..., xn) ⇔ p(θ(x1), ..., θ(xn)) ∈ ML(s) In the definition of a model M, the codomain of the labeling function L changes to the powerset of n-ary predicates.

3. Replaces the SAT function with a function SATFV that returns the set of state-environment pairs that satisfy a given formula:

def SATFV(M, ϕ) = {(s, θ) | M, s ⊨θ ϕ}

For model checking, when considering the expression M, s ⊨θ p(x1, ..., xn) the environment θ can be treated as an unknown to be determined by resolv- ing the constraint p(θ(x1), ..., θ(xn)) ∈ ML(s). As an example, let some model M0 contain a state s0 labeled with the predicate f(1, 2). The expression 0 0 M , s ⊨θ f(x1, x2) with θ unknown then implies θ(x1) = 1 and θ(x2) = 2, so we would have

0 0 (s , [x1 7→ 1, x2 7→ 2]) ∈ SATFV(M , f(x1, x2)).

Resolving the environment bindings in this way requires adaptations to the model checking algorithm:

• SATFV(M, ϕ ∧ ψ) must test for compatible merges between the environ- ments of results yielded from checking the two subformulas:

SATFV(M, ϕ ∧ ψ) = {(s, θ) | s ∈ MS ∧ (s, θ1) ∈ SATFV(M, ϕ)

∧ (s, θ2) ∈ SATFV(M, ψ)

∧ θ1 u θ2 = θ =6 ⊥} ( Merge of θa and θb, θa, θb compatible where θa u θb = ⊥ θa, θb incompatible 14 CHAPTER 2. BACKGROUND

The formal details of environment merges and compatibility will be left undefined at this point1. Informally, two environments are compatible if they do not contain contradictory information, and a merge of two environments is an environment that contains bindings for all logical variables jointly present in the two environments being merged while being compatible with each of them.

• SATFV(M, AX ϕ) must generate one result for every compatible merge of complete sets of successor environments found in SATFV(M, ϕ):

Definition 2. In the context of a model M, a state s and a formula ϕ, a complete set of successor environments is a set E = {θ1, ..., θn} such 0 that for every s a successor to s there is at least one environment θi ∈ E M 0 ⊨ □ such that , s θi ϕ.

Given a state s, a formula ϕ and some complete set of successor environ- ments {θ1, ..., θn} covering all successors to s, if it is possible to merge all the environments in the set into a single environment θ, then it must be the case that M, s ⊨θ AX ϕ. Similar reasoning on sets of succes- sor environments must be applied in the handling of AF ϕ, AG ϕ and A[ϕ U ψ]. The SAT function for AX ϕ takes the form

SATFV(M, AX ϕ) = {(s, θ) | s ∈ MS 0 0 ∧ {s1, ..., sn} = {s | (s, s ) ∈ MT }

∧ {(s1, θ1), ..., (sn, θn)} ⊆ SATFV(M, ϕ)

∧ θ1 u ... u θn = θ =6 ⊥}

Note that there may exist many complete sets of successor environments formed according to Equation 2.1.

{(s1, θ1), ..., (sn, θn)} ⊆ SATFV(M, ϕ). (2.1)

• SATFV(M, ¬ϕ) must include

1. All states for which ϕ does not hold under any environment. 2. All states for which ϕ holds under some environment(s), and for every such state-environment pair must specify the negation of its environment. 1Complete definitions are available in [4] CHAPTER 2. BACKGROUND 15

The SAT function for ¬ϕ takes the form

0 0 SATFV(M, ¬ϕ) ={(s, ∅) | s ∈ MS ∧ ∀(s , θ) ∈ SATFV(M, ϕ). s =6 s }

∪ {(s, ¬θ) | s ∈ MS ∧ (s, θ) ∈ SATFV(M, ϕ)}

where ∅ is the empty environment.

The formal details of environment negation will be left undefined at this point2. Informally, the idea is that a negated environment is the environ- ment that is maximally contradictory to the original environment, with every binding mapped to the ”opposite” value of the original binding.

CTL-FV is an improvement over CTL in the context of reasoning over models of computer programs. The introduction of logical variables, environ- ments and predicates allow for the construction of generalized formulas able to match many programs, using logical variables to bind to arbitrary program elements such as program variable names. For example, one could express a formula checking for the existence of an unnecessary local variable as

ϕ = VarInit(x) ∧ AX(A[¬VarUsed(x) U ReturnVar(x)])

where VarInit, VarUsed and ReturnVar are predicates of a single logical variable. This formula would hold in model MA for the function shown in A Listing 3 with SATFV(M , ϕ) = {(s1, [x 7→ result])}, but would not hold in model MB for the function shown in Listing 4.

Listing 3. Example function A.

MA = (S, T, L) where 1 int fnA() S = (s , s , s ) 2 { 1 2 3 T = {(s , s ), (s , s ), (s , s )} 3 int result = 10; 1 2 2 3 3 3 { } 4 bar(); L(s1) = VarInit(result) {} 5 return result; L(s2) = { } 6 } L(s3) = ReturnVar(result)

2Complete definitions are available in [4] 16 CHAPTER 2. BACKGROUND

Listing 4. Example function B.

MB = (S, T, L) where 1 int fnB() S = (s , s , s ) 2 { 1 2 3 T = {(s , s ), (s , s ), (s , s )} 3 int result = 10; 1 2 2 3 3 3 { } 4 bar(result); L(s1) = VarInit(result) { } 5 return result; L(s2) = VarUsed(result) { } 6 } L(s3) = ReturnVar(result)

2.2.3 CTL with quantified variables CTL with quantified Variables (CTL-V) [4] is an extension of CTL-FV that introduces existentially quantified logical variables in formulas: ϕ ::= ... | ∃x.ϕ

CTL-V extends the function SATFV in the form of the function SATV that additionally handles this new type of formula element. The semantics of a formula ∃x.ϕ is defined over the codomain of the environment function (theset of concrete values that the logical variables can be mapped to), herein referred to as Val.

M, s ⊨θ ∃x.ϕ ⇔ ∃v ∈ Val. (M, s ⊨θ[x7→v] ϕ)

Note that this definition does not include the mapping [x 7→ v] in the environment associated with the left hand side expression, meaning the map- ping should not be included in results produced by SATV(M, ∃x.ϕ). Tak- ing some model M0 to contain a state s0 labeled with the predicate f(1, 2) 0 0 and taking Val to be {1, 2}, the expression M , s ⊨θ ∃x. f(x, y) holds by 0 0 M , s ⊨θ[x7→1] f(x, y) with the single requirement of θ(y) = 2. As an example of why it would be incorrect for the SATV function to in- clude the mapping in the environment for the left hand side expression, con- sider model checking the formula ∃x.f(x, 2)∧∃x.f(1, x) in the same example 0 model M using a BADSATV function that did include the mapping: 0 (s , [x 7→ 1]) ∈ BADSATV(M, ∃x.f(x, 2)) 0 (s , [x 7→ 2]) ∈ BADSATV(M, ∃x.f(1, x)) 0 (s , θ) 6∈ BADSATV(M, ∃x.f(x, 2) ∧ ∃x.f(1, x)) since [x 7→ 1] u [x 7→ 2] = ⊥ CHAPTER 2. BACKGROUND 17

CTL-FV requires that each logical variable can be bound to a single con- sistent concrete value. By the addition of existentially quantified logical vari- ables, CTL-V allows for expressing formulas where logical variables can bind to different values along different computation paths. Looking at state 1 of the model in Figure 2.4 and taking Val = {1, 2, 3} we have f(x) ∧ AX(∃y. g(y)), but we do not have f(x) ∧ AX(g(y)) which is the closest expression possible using CTL-FV.

f(1)

1 g(2) g(3) 2 4

h(1, 2) h(1, 3) 3 5

Figure 2.4: Example model.

2.2.4 CTL with variables and witnesses CTL with Variables and Witnesses (CTL-VW) is an extension of CTL-V [4]. While CTL-V allows for reasoning, specification and model checking using parameterized propositions (predicates) and scoped logical variables (able to take on different values along different computation paths), it does not record the bindings of quantified logical variables involved in a satisfied formula. In the context of code transformation, having access to these bindings is useful. CTL-VW was developed in part to address this shortcoming of CTL-V. As an example of the need for variable bindings to be recorded, a transfor- mation may want to swap the order of two program variables that appear as parameters to a function call. Such a transformation rule consists of two parts: a match rule and a production (replacement) rule. Specifying the match rule using logical variables avoids the need for writing separate rules for different cases of program variable names, but also creates the requirement for the pro- duction rule to be able to read the values bound to the logical variables in order to produce the correct replacement. 18 CHAPTER 2. BACKGROUND

CTL-VW extends CTL-V by: 1. Introducing the notion of witnesses.

A witness is a 4-tuple (s, x, v ∈ Val, {ω1, ..., ωn}), comprising a record of an environment mapping between a logical variable x and a value v that was involved in satisfying some formula in a specific state s, along with the set of other witnesses {ω1, ..., ωn} that were established prior to constructing the new record.

2. Changing judgments to be on the form M, s ⊨θ,Ω ϕ, where Ω is a set of witnesses. The SATVW function therefore computes a set of state- environment-witnesses triples:

def SATVW(M, ϕ) = {(s, θ, Ω) | M, s ⊨θ,Ω ϕ}

Importantly, the semantics of existentially quantified variables changes to:

M, s ⊨θ,{(s,x,v,Ω)} ∃x.ϕ ⇔ ∃v ∈ Val. M, s ⊨θ[x7→v],Ω ϕ

Thus using witnesses to capture the bindings that are dropped from the environment.

In the model checking of CTL-VW, when results of subformulas are com- bined by their enclosing connectives (such as ∧ or AX), sets of witnesses of results being combined are joined under regular set union. As an example, looking at state 1 of the model in Figure 2.4 and taking Val = {1, 2, 3} we have

(s1, ∅, {(s1, x, 1, {(s2, y, 2, ∅), (s4, y, 3, ∅)})}) ∈ SATVW(M, ∃x.(f(x) ∧ AX(∃y. g(y))))

Witnesses never influence the matching process in the model checking al- gorithm. Therefore, they can be used to have the SATVW function produce an output that associates states with arbitrary information. Again using the exam- ple model of Figure 2.4 and taking Val = {1, 2, 3}, if we introduce the formula x := y to mean the unconditional assignment of environment variable x to the value y, we can construct the formula

∃x.(f(x) ∧ AX(∃y.(g(y) ∧ ∃op. op := delete))) CHAPTER 2. BACKGROUND 19

Model checking this formula against the model in Figure 2.4 would pro- duce the result

(s1, ∅, {(s1, x, 1, {(s2, y, 2, {(s2, op, delete, ∅)}),

(s4, y, 3, {(s4, op, delete, ∅)})})})

Such a result can be seen as a tree encoding transformation operations, where operation instructions are found in the leaf nodes of the tree and any variable bindings available for use by a particular operation can be collected on the path to the leaf.

2.3 Program analysis and transformation

The term program analysis refers to the notion of algorithmically analysing the source code of a computer program in order to reason about its behavior without actually executing the program [10]. An alternative, often-used term is static analysis, which emphasizes that such analyses target a static, state- independent view of the program. Common approaches include Data Flow Analysis and Abstract Interpretation. In Data Flow Analysis, programs are usually modeled as directed graphs over which analyses are specified. One such analysis is Reaching Definition Analysis, in which each definition of the form x := expr produces a list of subsequent graph nodes for which the particular definition still holds for the variable x. In Abstract Interpretation, some of the concrete types of values involved in the computations performed by the program are replaced with abstract values. For example, one could replace the notion of general integer values with the more abstract notion of signs equipped with arithmetical rules such as Pos+Pos = Pos and Pos+Neg = Unknown. Interpreting a block of code using the abstract values and compu- tations instead of the concrete ones may then produce an output that describes certain behavior, such as whether a certain function given a positive input will always produce a positive output. Similarly, the term program transformation refers to the notion of algo- rithmically modifying the source code or object code of a computer program, typically based on the result of some analysis. For example, a Reaching Defi- nition Analysis may be used to find variable definitions that are never used and may therefore be removed. An algorithm taking the result of such an analysis and removing the unused variable definitions from the source code or object code of the analysed program would then fall under the notion of program transformation. 20 CHAPTER 2. BACKGROUND

One application of program transformation is the automation of API migra- tion. The majority of modern software programs delegate part of their func- tionality to one or more libraries [11, 12, 13], with the most common form of delegation consisting of using library-provided functions to perform certain tasks. The functions provided by a library together with rules and policies for how the functions are to be used (e.g the ordering of a series of related function calls) form the API (Application Programming Interface) of the library. The term API migration refers to the general notion of making changes to application code that uses a library API in order to adapt it to a different version of the API or to a different API that provides similar functionality, without fun- damentally changing the functionality of the application [14]. Adapting code to a different version of an API commonly occurs when software developers wish to upgrade to a newer version of a library. Switching to a different API can occur in adapting software to a different platform where a certain library may not be available. In a more formal sense, an instance of an API migration between two spe- cific APIs can be seen as a mathematical mapping between elements (such as function calls) of the first API to elements of the second API. This model permits the useful characterization of API migrations by their relationship car- dinality3, such as being one-to-one or many-to-many. A one-to-one API mi- gration may consist of a single function call in the first API being replaced by a single corresponding call in the second API. A many-to-many API migration may prescribe that a certain order-sensitive series of function calls in the first API is to be replaced by a certain series of calls in the second API, which when taken together perform a corresponding task. The task of automating the ap- plication of many-to-many API migrations to programs remains challenging [12].

2.3.1 Spoon Spoon [2] is a free and open-source metaprogramming library designed for the analysis and transformation of Java source code. Spoon is created and maintained under the helm of Inria, the French National Institute for Research in Digital Science and Technology. Spoon was first presented in 2005 [15] and remains under active development. The central feature of Spoon is its ability to convert between Java source code and an extensive metamodel – a rich representation of the Abstract Syntax Tree (AST) corresponding to the source code. The metamodel supports effi-

3An alternative term is mapping multiplicity. CHAPTER 2. BACKGROUND 21

cient searching, introspection (self-examination) and intercession (self- modification). The current version of Spoon parses Java source code usingthe JDT4 library from the Eclipse foundation. One of the main ways of building code analyses and transformations using Spoon is by implementing a Spoon Processor. Listing 5 shows a simple Spoon processor that logs occurrences of empty catch clauses. The process method will be called on every CtCatch metamodel element constructed by Spoon when parsing some given target source code, which will typically be a full project directory scanned recursively. Listing 5. Example Spoon Processor for analysis.

1 class CatchProcessor extends AbstractProcessor { 2 @Override 3 public void process(CtCatch element) { 4 if (element.getBody().getStatements().size() == 0) { 5 getFactory().getEnvironment() 6 .report(this, Level.WARN, element, 7 "empty catch clause"); 8 } 9 } 10 }

Spoon processors are also suitable for implementing code transformations. Listing 6 shows the process method of the processor from Listing 5 altered to instead modify the code by inserting into every empty catch clause a call to some logging facility (assumed to exist in the target code base) providing the caught exception as the singular argument. The method loggerCall, of which no definition is provided in the example, is assumed to construct avalid Spoon metamodel object corresponding to the call. Once processing is complete, Spoon will convert (pretty-print, unparse) the metamodel structure back into one or more Java source files. As the inter- cession API modifies the metamodel in-place, the produced source code will be affected by any modifications.

4https://wiki.eclipse.org/JDT_Core 22 CHAPTER 2. BACKGROUND

Listing 6. Example Spoon Processor for transformation.

3 public void process(CtCatch element) { 4 if (element.getBody().getStatements().size() == 0) { 5 CtCatchVariable exception = element.getParameter(); 6 element.getBody().addStatement(loggerCall(exception)); 7 } 8 }

2.3.2 Semantic Patch Language The Semantic Patch Language (SmPL) was originally proposed [1] as part of a solution to the problem of managing collateral evolutions [16] in the Linux kernel source code. Collateral evolutions are modifications to code that uses some library API, made in direct response to changes made to the library API. Such code is often referred to as client code, motivated by the idea that the library API provides a service to which the code forms a client. SmPL mimics the familiar unified context diff format. Additions are taken as lines beginning with a plus character and deletions are taken as lines begin- ning with a minus character. Lines starting with a character that does not have a special meaning (such as plus and minus) are taken to be context lines that together with any deletions define the pattern to be matched in source code targeted for patch application. SmPL adds support for the syntax and seman- tics of the C programming language by operating on abstract syntax trees and control flow graphs rather than the plain text. By exploiting the support forthe syntax and semantics of C, a number of powerful new features are introduced. Although SmPL was developed specifically for C, the core concepts are applicable to other languages. For many C-like languages such as Java and PHP, the core concepts are likely to apply in a near-identical fashion. Moreover, other programming languages such as Python and Lisp can be fitted to a model of abstract syntax trees and control flow graphs that is similar in structure to a corresponding model for C. Mapping the core features of SmPL onto such languages would therefore be straightforward. In this thesis, the term SmPL will tend to refer to a generic notion of SmPL that is not coupled to any specific target programming language, and we will tend to write SmPL for X when referring to an SmPL that targets the specific language X. A first core feature of SmPL is metavariables. Metavariables solve the problem of creating a generic patch for collateral evolutions where different instances of client code uses different variable names in calls to an API. For CHAPTER 2. BACKGROUND 23

example, the semantic patch shown in Listing 7 will match and transform any call to the function foo using three variables as arguments regardless of what their names are. The patch further shows how metavariables are available in transformations, as the call to foo will be replaced with a call to a function bar with the matched arguments arranged in a different order. Listing 7. Metavariables in SmPL.

1 @@ 2 identifier x,y,z; 3 @@ 4 - foo(x,y,z) 5 + bar(z,x,y)

Metavariables are themselves typed, with the three metavariables shown in Listing 7 having type identifier meaning they can only be bound to iden- tifiers such as variable names or function names. Many metavariable types are available, examples include constant, expression and statement, as well as explicit data types such as int. A second core feature of SmPL is the wildcard operator ”...”, often re- ferred to as dots. The wildcard operator matches arbitrary sequences of state- ments over arbitrary computation paths in the control flow graph. The opera- tor also includes the ability to impose customized constraints on each instance. In the semantic patch shown in Listing 8, the ... on line 7 will match all se- quences of statements over all computation paths starting from a match of the preceding line T ret = C; until there is a match for return ret;. Note that this means that all paths are required to eventually reach a match for return ret; in order for the operator to match. Additionally, the match will only be considered successful under the constraint specified by when != ret, which says that none of the statements found along any traversed path may involve the use of the variable name bound to the metavariable ret. Furthermore, by default the wildcard operator will only match the shortest path between the el- ements that enclose it, meaning the preceding and succeeding statements are not allowed to appear on any intermediate step of any path. These and other behaviors can be altered with constraint modifiers specified on an operator in- stance or by redefining the default constraints for a single rule. Examples of constraint modifiers include the relaxations when exists that lifts the require- ment that all paths need to satisfy all of the constraints, and when any that lifts the shortest-path requirement. 24 CHAPTER 2. BACKGROUND

Listing 8. Matching over arbitrary computation paths in SmPL.

1 @@ 2 type T; 3 identifier ret; 4 constant C; 5 @@ 6 - T ret = C; 7 ... when != ret 8 - return ret; 9 + return C;

Note that the wildcard operator can generate multiple matches as it tra- verses different computation paths. Applying the semantic patch of Listing8 to the function of Listing 9 produces the output shown in Listing 10 by one application of the removal on line 6 of the patch and two applications of the replacement on lines 8-9 of the patch.

Listing 9. Input Listing 10. Output

1 int foo(int x) { 1 int foo(int x) { 2 int val = 42; 2 if (x > 0){ 3 if (x > 0){ 3 return 42; 4 return val; 4 } else { 5 } else { 5 return 42; 6 return val; 6 } 7 } 7 } 8 }

The wildcard operator can also appear in the formal parameters of a func- tion definition and in the argument list of a function call. The semantic patch shown in Listing 11 will remove the call print_integer(x) from any func- tion that takes an int as one of its formal parameters, regardless of the actual variable name of x. CHAPTER 2. BACKGROUND 25

Listing 11. Wildcard operator in parameter and argument lists.

1 @@ 2 identifier f, x; 3 @@ 4 f(..., int x, ...) { 5 ... 6 - print_integer(x); 7 ... 8 }

Finally, the wildcard operator has two additional variants. The first variant is <... P ...>, which acts like a normal instance of the operator with the addition of an attempt at matching P at every step on every path, but not failing if P does not match. This is useful for transformations such as ”remove any calls to a() preceding (by any number of steps) a call to b()” shown in Listing 12.

Listing 12. Remove any calls to a() preceding a call to b().

1 @@ @@ 2 <... 3 - a() 4 ...> 5 b()

The second variant is <+... P ...+>, which acts like <... P ...> with the added requirement that P must match at least once on some path. This is useful for specifying transformations such as ”remove any calls to b() if there exists at least one earlier call to a()” shown in Listing 13.

Listing 13. Remove any calls to b() if there exists an earlier call to a().

1 @@ @@ 2 <+... 3 a() 4 ...+> 5 - b()

A third core feature of SmPL is the support for isomorphisms, which al- lows semantically equivalent statements to match even if the literal code does not match. Basic examples include the matching of ”x > 0” to ”0 < x” and 26 CHAPTER 2. BACKGROUND

”x = x + 1” to ”x += 1”. Regular use of an SmPL system will generally involve a standard library of isomorphism rules, and patches can specify that certain rules should be disabled or that additional rules should be included. The isomorphism rule shown in Listing 14 allows for matching between ”x > 0” and ”0 < x”, along with any other constructions using the less-than or greater-than operators with two expression operands. The rule shown in List- ing 15 allows the expression x = x + 1 to match x += 1 but not the other way around (even though that may be sensible), indicated by the single-direction arrow.

Listing 14. Isomorphism rule 1. Listing 15. Isomorphism rule 2.

1 Expression 1 Expression 2 @ gtr_lss @ 2 @ plus_one @ 3 expression X,Y; 3 identifier x; 4 @@ 4 @@ 5 X < Y <=> Y > X 5 x = x + 1 => x += 1

A fourth and final core feature is pattern disjunction, allowing patch devel- opers to cover multiple cases with a single succinct patch. Similarly to other SmPL syntax, the syntax for pattern disjunction is based on the first charac- ter on a line. Disjunctions are started and ended with an opening and closing parenthesis respectively. The clauses of a disjunction are separated with the vertical bar (or ”pipe”) character, also required to appear as the first charac- ter on a line. Pattern disjunctions can be nested, but are otherwise somewhat restricted in what elements can appear in clauses. Notably, the wildcard oper- ator is not allowed to appear inside a pattern disjunction. The semantic patch shown in Listing 16 shows pattern disjunction being used to match a call to a() followed by a call to either b() or c(). Listing 16. Pattern disjunction in SmPL.

1 @@ @@ 2 a(); 3 ( 4 b(); 5 | 6 c(); 7 )

As SmPL operates on control flow graphs of individual functions, a sin- gle patch rule cannot perform matching that involves more than one function. CHAPTER 2. BACKGROUND 27

However, this limitation can be worked around using features that allow multi- ple rules to be specified in a dependency hierarchy, including the ability fora dependent rule to access information that was matched by a prerequisite rule. Chapter 3

Related work

This chapter provides an overview of selected related works. Section 3.1 intro- duces the Coccinelle tool that is the standard implementation of SmPL for C, as well as the modification Coccinelle4J that implements a prototype SmPL for Java. Sections 3.2 and 3.3 cover related works in the domain of program transformation, with Section 3.2 focusing on works using temporal logic and model checking and Section 3.3 looking at other approaches. Finally, Section 3.4 covers related works in the domain of automated API migration.

3.1 Semantic patching

3.1.1 Coccinelle Coccinelle [1] was introduced together with SmPL and remains the de facto standard implementation of SmPL for the C programming language. The cen- tral function of Coccinelle is to apply a given semantic patch to a given file containing C source code, analogous to how the patch program can apply a given diff to a text file. One of the main motivations behind Coccinelle (and SmPL in general) was to alleviate problems in managing collateral evolutions [16] in the context of the Linux kernel. A collateral evolution is a response to a change in a library API that induces the requirement of updating client code that makes use of the library. A typical instance of a collateral evolution in the context of the Linux kernel is the adaptation of device driver code to additions or removals of formal parameters in kernel functions. Traditionally, a collateral evolution in the Linux kernel would consist of the developers of client code manually reviewing the changes made to the kernel

28 CHAPTER 3. RELATED WORK 29

before modifying their code to accommodate for the kernel changes. Once complete, the modifications would be summarized in the form of a diff/patch, which would be submitted back to the kernel developers. With the introduction of Coccinelle and SmPL, a kernel developer can accompany an API change with a corresponding semantic patch, allowing the developers of client code to more easily apply the changes using Coccinelle. Such semantic patches may also serve a part in documenting [17] API changes that induce collateral evolutions by providing a succinct, unambiguous representation of the changes required in client code. The familiarity of diff and patch in the Linux kernel development com- munity greatly influenced the design of Coccinelle and SmPL, leading tothe choice of basing the SmPL syntax on the unified context diff format. The technical foundation of Coccinelle is based on translating a semantic patch given in plain-text SmPL to a formula in a language that closely resem- bles CTL-VW [4]. The formula is then model-checked against the control flow graph of every function of an input C source program. Coccinelle is implemented mainly in OCaml. The implementation has been designed with performance in mind, for example by including optimiza- tions for picking out relevant files when operating on a large source tree. Coc- cinelle has been found to be fast enough for use on a regular, consumer-grade, single-developer PC. Additionally, Coccinelle includes features designed to increase the appeal of the tool beyond the technical level, such as featuring an interactive mode, attempting to preserve the coding style (including whites- pace) of files targeted by transformations, and even a feature that attempts to place inserted #include statements at aesthetically pleasant locations. Coccinelle furthermore supports programmable extensions to the engine by means of instrumenting patches with calls to script code written in OCaml or Python. An example use of this feature is to iteratively run a semantic match (a semantic patch that does not make any modifications) on some target source code, with the script code able to inspect results and change engine settings between iterations. Coccinelle and SmPL has seen significant use in the Linux kernel devel- opment community. The Coccinelle project maintains a web page1 showing a list of patches created using the tool that have been accepted and integrated into the official Linux kernel. 1http://coccinelle.lip6.fr/impact_linux.php 30 CHAPTER 3. RELATED WORK

3.1.2 Coccinelle4J Coccinelle4J [3] is a modification of Coccinelle that adapts the tool to oper- ate on Java source code. Coccinelle4J supports semantic patches that operate on method bodies, optionally including the method signature. The language support is restricted to the subset of Java known as Middleweight Java (MJ) [18] extended with the syntax and semantics of type parameters and exceptions (intraprocedural try-catch and throw). The similarities between method definitions in MJ and function definitions in C allowed the authors to bootstrap the language adaptation by mapping the AST elements of MJ to AST elements of C that Coccinelle already supported. Support was then added for exceptions, which entailed modeling the control flow behavior of try-catch blocks as there is no corresponding construct in C. The authors chose to over-approximate the set of possible exception control flow paths by connecting each statement in the body of agiven try block to the body of the corresponding catch. Coccinelle4J also adds limited support for Java sub-typing in metavari- able bindings. In Java, it is a common pattern to declare a variable under the interface of a super type while simultaneously initializing the variable to a sub-type:

List x = new ArrayList();

Coccinelle4J considers both the declared type as well as the constructor when binding a type-constrained metavariable to a statement fitting this pat- tern. As such, Coccinelle4J would allow metavariables constrained to either List or ArrayList to bind to the example statement, whereas the original Coccinelle would only allow the former. The motivation behind sup- porting this specific pattern of declaration and initialization was the prevalence of the pattern in real-world Java code. In addition to restricting semantic patches to a subset of Java, Coccinelle4J is beset with some further limitations in regards to the Java language. Match- ing fully qualified names to non-fully qualified names requires the use ofthe isomorphism system, although it is partially automated by a preprocessor stage. Furthermore, the support for sub-typing is limited to the types explicitly given in a statement that both declares and initializes a variable. For example, it is not possible to have a metavariable declared as Iterable bind the pro- gram variable List x despite the semantics of Java suggesting that x is indeed an instance of Iterable. Despite these limitations, a case study performed by the authors showed CHAPTER 3. RELATED WORK 31

Coccinelle4J to be capable and useful in assisting with automated API migra- tion on several medium to large-scale Android projects.

3.2 Program transformation using temporal logic

Seminal works on program analysis using temporal logic and model checking include Pnueli’s The Temporal Logic of Programs [19], Clarke and Emerson’s introduction [20] of CTL and Clarke, Emerson and Sistla’s introduction of an efficient model checking algorithm [21] for CTL in the context of verification of concurrent programs. Steffen later applies CTL and its model checking algorithm in his influential work [22] on model checking as a framework for data flow analysis, subsequently refined together with Schmidt [23, 24]. Building on the work of Steffen and Schmidt, Lacey and de Moor pro- pose a language [25] for specifying transformations over low-level imperative programs using a combination of pattern matching, CTL-like side conditions and model checking. Notably, the language allows for logical variables (some- times referred to as metavariables) to bind to program terms that can subse- quently be referred to in the production part of a transformation rule. This ex- tension to CTL later becomes known as CTL-FV (CTL with Free Variables) and the language was subsequently used [26] in formally proving correct a number of classical compiler optimizations such as constant folding and dead code elimination. Lacey, Kalvala and Warburton further developed the lan- guage [27] giving it the name TRANS and introducing syntax that more closely resembles a traditional programming language.

Kalvala and Warburton further built upon TRANS to create TRANSfix [28], a transformation language that is similar to SmPL. Compared to TRANS,

TRANSfix is designed more as a software development tool rather than a lan- guage for formal proofs of correctness. TRANSfix operates on high-level source code, has extensive support for metavariables and the semantics of a specifically targeted programming language (Java), and includes a wildcard pattern match operator ”....” that matches arbitrary sequences of statements. Notable differences from SmPL include a more verbose syntax and the need to explicitly state side conditions in CTL. 32 CHAPTER 3. RELATED WORK

3.3 Other approaches to Java source code transformation

JaTS [29] (Java Transformation System) is a language and tool similar to SmPL and TRANSfix. Notable features include extensive support for Java semantics, metavariables and their constraints, side conditions given in Java-like syntax, pattern constructs allowing for the matching of arbitrary sequences similar to a wildcard operator, and support for so-called ”executable declarations” which includes the notion of calling reflection methods on bound metavariables in production rules. Notable differences from SmPL include more verbose syn- tax and a more elaborate language, especially in regards to the executable dec- larations. The Java IDE (Integrated Development Environment) IntelliJ IDEA from JetBrains includes a feature known as Structural Search and Replace (SSR) [30], cited as being influential to the development of SmPL. SSR is designed as an extension of the traditional search-and-replace feature commonly present in text editors, taken to operate on ASTs rather than plain text. Notable features include metavariables and isomorphisms (the automatic matching of semanti- cally equivalent statements such as x + 1 to 1 + x). There are a number of generic program transformation languages capable of operating on Java. Stratego is a general-purpose program transformation system originally proposed by Visser [31]. The origins of Stratego are firmly tied to the domain of functional programming and languages such as ML and Haskell, and features syntax that may be unfamiliar to Java developers. Visser also provides an extensive survey [32] of program transformation techniques, albeit focused in the area of term rewriting. Rascal [33] is an even more generic system taking the form of a full programming language specifically designed for metaprogramming tasks such as program analysis and transfor- mation. Rascal includes features such as native support for syntax grammars and a switch-like visit statement that performs a complete traversal of a given tree structure, processing a case branch for each node.

3.4 API migration

Chow and Notkin propose [11] a system where library authors would annotate the source code of an updated library version with a transformation specifica- tion given in a domain-specific language that could later be used by clients to semi-automatically perform the upgrade migration. The language is cen- CHAPTER 3. RELATED WORK 33

tered around AST matching and production and includes special constructs for semantics-aware guard conditions. Limitations include the inability to de- scribe atomic transformations that involve more than one source code frag- ment. Having identified the powerful capabilities of semi-automatic refactoring tools found in modern IDEs, Henkel and Diwan propose CatchUp! [34] as a system that captures the use of these tools such that the refactorings can later be ”replayed”. The idea behind CatchUp! is for library authors to use IDE refactoring tools to carry out changes that break the public API, and to then dis- tribute the file containing the captured refactorings along with the new version of the library. Users of the library would then in theory be able to replay the refactorings on their application code in order to migrate to the new version. A prototype of CatchUp! was implemented as an Eclipse plugin, supporting the recording and playback of a subset of the refactoring tools available in Eclipse at the time. Limitations include the specificity of recorded refactorings toa certain range of versions of a certain IDE. PATL [12] (PAtch-like Transformation Language) was introduced by Wang et. al together with the technique of guided normalization to specifically target automated application of many-to-many API migrations. The language is sim- ilar to SmPL, being based on the traditional unified diff format while having rich support for the semantics of Java and employing the use of metavariables. The central contribution towards the automation of many-to-many API mi- grations is the guided normalization technique, in which a target program is transformed in a semantics-preserving way until a form is found that directly matches a PATL transformation rule. By using this technique, a single trans- formation rule can potentially cover a large number of real-world variants in regards to the ordering of statements targeted by the transformation as well as the presence of intermixed statements irrelevant to the transformation. Lim- itations include the inability to specify transformation rules that change the number of typed objects, such as when an object of some type belonging to the source API needs to be transformed into multiple objects of multiple types belonging to the target API. Chapter 4

Design of spoon-smpl

This chapter provides an overview of the design of SPOON-SMPL, a prototype implementation of a subset of SmPL for Java, based on the Spoon metapro- gramming library. Section 4.1 describes the main design goals and the ratio- nale behind some of our core design choices. Section 4.2 attempts to provide a succinct description of the core transformation engine. Section 4.3 describes our method of parsing plain-text SmPL into an internal representation. Sec- tions 4.4 and 4.5 cover our formula language and compilation process respec- tively, completing the picture of bringing a plain-text SmPL patch into the core transformation engine in the form of a formula. Section 4.6 describes our batch processing mode, which makes use of the Processor system provided by Spoon. Finally, Section 4.7 provides a concise summary of the ways Spoon is used in our implementation.

4.1 Design goals

The main design goal of our prototype implementation is the replication of a set of transformations produced by Coccinelle4J in an API migration case study [3] performed by the Coccinelle4J authors. The case study involves six semantic patches, each of which is associated with a target code base. More specifically, the goal is for our implementation to be able to take as input the same (or as similar as possible) semantic patches and apply them to the same code bases as used in the original case study. This should output at least the same set of transformations as produced by Coccinelle4J, in a suitable format. Additional transformations are acceptable only if they are correct under the semantics of SmPL.

34 CHAPTER 4. DESIGN OF SPOON-SMPL 35

The overarching goal stated above naturally gives rise to three core require- ments:

1. We must be able to parse the SmPL syntax used in the set of patches.

2. We must have an engine for matching and transformation of Java source code that supports the SmPL features used by the patches over the set of Java language constructs that appear in the code targeted by the patches.

3. We must support an output format comparable to the unified diff output available from Coccinelle4J, suitable both for manual inspection and for the automated application of transformations.

For the parsing of SmPL syntax there are many options available. Many popular parser generators such as ANTLR and JavaCC are applicable. How- ever, SmPL syntax is essentially only a small set of extensions to the target language syntax (Java, in our case) rather than a separate language. Further- more, a significant portion of a semantic patch will in general consist ofcode written in the target language. As such, we choose to implement our parser based on the idea of rewriting SmPL syntax to an equivalent representation in a domain-specific language written in Java. This representation is then parsed by the Spoon Java parser, producing a Spoon metamodel from which we ex- tract information such as metavariable declarations and the patch body. This also has the benefit of producing Spoon metamodel elements corresponding to all of the Java code elements that appear in the patch body, which we use in our transformation engine. Our method of parsing SmPL is described in further detail in Section 4.3. We base our core transformation engine on the logic CTL-VW as presented in [4] and in Section 2.2.4. This choice is motivated by the goals of our re- search, in which we focus on the feasibility of Spoon as a technical foundation in an SmPL for Java rather than on the choice of core algorithms. As such, picking a proven algorithm makes the most sense. We describe our core en- gine in further detail in Sections 4.2 and 4.4. For the third and final requirement we use the existing functionality in Spoon for pretty-printing a Spoon metamodel to produce Java source code, coupled with the ability to provide a diff command on the command line, al- lowing for user-customizable output formatting. 36 CHAPTER 4. DESIGN OF SPOON-SMPL

input target code input semantic patch

Parse Java into Spoon metamodel Parse SmPL file

Find executable block(s) Translate to Java DSL

Compute control flow graph(s) Parse Java into Spoon metamodel

To CTL-VW model(s) To CTL-VW formula

(1) Match using CTL-VW model checking

(2) Potentially modify matched code

Pretty-print

output

Figure 4.1: Workflow of SPOON-SMPL

4.2 Core engine

The core of SPOON-SMPL consists of an engine for the matching and transfor- mation of Java source code based on model checking. Figure 4.1 contains an overview of the workflow of SPOON-SMPL, in which the core engine performs the tasks of the nodes labeled (1) and (2). The engine is heavily inspired by the theoretical foundation of Coccinelle [4]. Analogous to the basic CTL model checking algorithm, our model checker takes as inputs a model and a formula. In the typical use case scenario the input model will be based on the control flow graph (CFG) of a method body extracted from a Spoon metamodel ofan input Java class. The formula will typically be an encoding of a semantic patch, produced as the final step in the parsing of a plain text SmPL file. The formula language is CTL-VW with a number of simple extensions. Model checking the formula against the model will result in the verdict of whether the patch can be applied to the method body. For positive results, the transformations to apply are found in the set of CTL-VW witness trees produced by the model CHAPTER 4. DESIGN OF SPOON-SMPL 37

checker1. We pass the results produced by the model checker to a separate facility that carries out each transformation operation. Transformations act on Spoon metamodel elements contained in the labels of the CTL model. Since we extract these elements from the full Spoon metamodel of the input code and store them by-reference in the CTL model, the transformations applied to them will also take effect in the original Spoon metamodel. Thus, the trans- formations are applied to the output written to disk by Spoon in its source tree batch processing mode.

4.3 Parsing SmPL

We begin the process of turning SmPL syntax into a formula by traditional lex- ical analysis, breaking up the input semantic patch into a list of tokens. This list is fed to a combined parser-compiler stage that parses the tokens according to the grammar of SmPL and produces the source code of a Java class that rep- resents the semantic patch in a form of domain-specific language (DSL). The Java DSL representation initially retains the plus and minus characters at the start of lines denoting additions and deletions (see Listing 18). We later split this initial form into two versions; a first version; dels (Listing 19), where all lines marked for addition are turned into empty lines, and a second version; adds (Listing 20), where all lines marked for deletion are turned into empty lines. The fact that lines are made into empty lines rather than simply being removed is important for the classification of statements into deletions, addi- tions or context. The classification relies on the source positions of individual statements, and is subsequently used in the anchoring of transformation opera- tions to appropriate elements. Finally, we remove any plus or minus characters at the start of lines from both dels and adds, thereby producing two instances of syntactically valid Java code, albeit with certain elements referencing unde- fined symbols. Picturing Listing 19 or Listing 20 to replace lines 8-15 of Listing 18 we get syntactically valid Java code for a class Rule, but we can identify sev- eral references to undefined method names such as type, identifier and __SmPLDots__. Furthermore, the identifiers T and C are undefined. However, due to the leniency available in the Spoon Java parser these unresolvable ref- erences do not cause a problem in the applications of the DSL code.

1In a special case, transformation operations representing the addition of one or more new methods are not encoded in the witness trees. 38 CHAPTER 4. DESIGN OF SPOON-SMPL

Listing 17. Example input semantic patch in SmPL syntax.

1 @ rule1 @ type T; identifier ret; constant C; @@ 2 - T ret = C; 3 ... when != ret 4 - return ret; 5 + return C;

Listing 18. Java DSL representation of Listing 17 before split.

1 class Rule { 2 String __SmPLRuleName__ = "rule1"; 3 void __SmPLMetavars__() { 4 type(T); 5 identifier(ret); 6 constant(C); 7 } 8 void __SmPLUnspecified__() { 9 if (__SmPLDotsOptionalMatch__(whenExists())) { 10 - T ret = C; 11 __SmPLDots__(whenNotEqual(__SmPLExpressionMatch__(ret))); 12 - return ret; 13 + return C; 14 } 15 } 16 }

Listing 19. Rule body method excerpt of Listing 18 after split, dels version.

8 void __SmPLUnspecified__() { 9 if (__SmPLDotsOptionalMatch__(whenExists())) { 10 T ret = C; 11 __SmPLDots__(whenNotEqual(__SmPLExpressionMatch__(ret))); 12 return ret; 13 14 } 15 } CHAPTER 4. DESIGN OF SPOON-SMPL 39

Listing 20. Rule body method excerpt of Listing 18 after split, adds version.

8 void __SmPLUnspecified__() { 9 if (__SmPLDotsOptionalMatch__(whenExists())) { 10 11 __SmPLDots__(whenNotEqual(__SmPLExpressionMatch__(ret))); 12 13 return C; 14 } 15 }

In Listing 17, the patch does not specify any constraints on the method header, resulting in the DSL rule body being defined with the special-case method signature void __SmPLUnspecified__() We consider a semantic patch that does not specify a method header to be a syntactic special case, semantically equivalent to a more verbose patch that specifies a match-all method header and encloses the statements in themethod body in the <... P ...> optional-match wildcard operator variant as shown in Listing 21. The when exists constraint relaxation seen on line 9 in Listings 18, 19 and 20 and line 3 in Listing 21 causes the formula compiler to produce an EU connective rather than an AU connective, which in this case produces the same behavior but results in improved performance in the model checking algorithm. The special void __SmPLUnspecified__() method signature is part of a performance enhancement that allows for avoiding performing any matching on the method header, as opposed to the verbose patch of Listing 21 which would perform matching on target method headers and produce bindings for the metavariables ftype and fname.

Listing 21. Verbose semantic patch equivalent to Listing 17.

1 @ rule1 @ type ftype, T; identifier fname, ret; constant C; @@ 2 ftype fname(...) { 3 <... when exists 4 - T ret = C; 5 ... when != ret 6 - return ret; 7 + return C; 8 ...> 9 } 40 CHAPTER 4. DESIGN OF SPOON-SMPL

As seen on line 9 in Listing 18, we represent the <... P ...> optional- match wildcard operator by an else-less if-statement in the Java DSL, with the body of the true-branch containing the statements enclosed by the opera- tor. Similarly, we represent SmPL pattern disjunction (Listing 22) in the Java DSL (Listing 23) using if-elseif chains. We use specifically-named identi- fier references in branch conditions to indicate that the construct is apattern disjunction, and we place the statements of each clause of the disjunction in a corresponding branch body. Listing 22. Semantic patch using pattern disjunction in SmPL syntax.

1 @@ @@ 2 ( 3 a(); 4 | 5 b(); 6 )

Listing 23. Excerpt from rule body method of Java DSL representation of Listing 22.

6 if (__SmPLBeginDisjunction__) { 7 a(); 8 } else if (__SmPLContinueDisjunction__) { 9 b(); 10 }

Continuing the process towards producing a formula, we parse the dels and adds DSL classes into two separate Spoon metamodels using the Spoon Java parser. Statement elements (and most other elements) in the Spoon metamodel include information on the position of the source code fragment that yielded the element. By extracting the line numbers of statement elements in the two metamodels, we deduce which statements are deletions (appearing only in the dels metamodel), which are additions (appearing only in adds) and which are context (appearing in both). This information is used to anchor most2 addition operations to an appropriate statement that is either a deletion or part of the context. An appropriate anchor statement is a statement that is adjacent to the added statement seen from the perspective of the semantic patch, while not being a special statement such as a wildcard operator. Deletion operations are anchored to the statements that are to be deleted, and we collect all anchor information in a map from anchor elements to transformation operations.

2Special cases exist in which the anchoring is more complicated. CHAPTER 4. DESIGN OF SPOON-SMPL 41

Finally, we compile the formula from the method signature and control flow graph of the rule method extracted from the dels metamodel, the set of metavariable declarations and the map of anchored transformation operations.

4.4 Formula language

The formula language in SPOON-SMPL consists of a sufficient3 subset of the base CTL connectives together with the quantified variable connective of CTL- V(W), along with a number of additions. Many of the additions are predicates used to match model labels containing code in the form of Spoon metamodel elements. The full language grammar is shown in Table 4.1.

ϕ ::= > | p | ¬ϕ | ϕ ∨ ϕ | ϕ ∧ ϕ | AX ϕ | EX ϕ | A[ϕ U ϕ] | E[ϕ U ϕ] | ∃x.ϕ | VarUse s | Expression(e, x¯) | Statement(e, x¯) | Branch(e, x¯) | SequentialOr ϕ¯ | SetEnv(x, v) | Metadata(x, k) | Optional ϕ | InnerAnd ϕ

where p ranges over atomic propositions (nullary predicates), s ranges over legal Java identifier strings, e ranges over Spoon metamodel ele- ments, x ranges over metavariables, v ranges over arbitrary Java values, k ranges over arbitrary Java strings and the overbar denotes a list.

In this context, a metavariable is a pair (s, f) where s again ranges over legal Java identifier strings and f is an indicator function f : e 7→ [0, 1] ∈ Z with e again ranging over Spoon metamodel elements and f(e0) indi- cating whether or not a given metamodel element e0 is an acceptable binding under the metavariable constraint defined by f.

Table 4.1: Formula language of SPOON-SMPL. The semantics of the base CTL connectives and the quantified variable con- nective is identical to the corresponding semantics in CTL-VW. The semantics of the additional connectives is shown in Table 4.2. In Table 4.2, the ”taking account” of metavariables is to be read as instruct- ing the machinery performing the matching of code to treat any elements cor- responding to identifiers {s | (s, f) ∈ x¯} present in e as metavariables, mean- ing they present the task of binding a structurally corresponding sub-element

3Able to derive all missing connectives. 42 CHAPTER 4. DESIGN OF SPOON-SMPL

found in a match candidate metamodel element, under the condition that this sub-element is accepted by the associated constraint f.

VarUse s Selects states labeled with a code statement that involves the use of a variable or field named s.

Expression(e, x¯) Selects states labeled with a code statement con- taining an expression matching e, taking into ac- count the list of metavariables x¯.

Statement(e, x¯) Selects states labeled with a code statement matching e, taking into account the list of metavariables x¯.

Branch(e, x¯) Selects states labeled with an if statement with condition expression matching e, taking into ac- count the list of metavariables x¯.

SetEnv(x, v) Selects all states and sets the environment of each result to [x 7→ v].

Metadata(x, k) Selects states labeled with a MetadataLabel that exports the metadata key k and for each result sets the environment to [x 7→ y] where y is the arbi- trary value exported by the label that yielded the result. ¯ SequentialOr ϕ SequentialOr(ϕ1, ϕ2) is semantically equivalent to ϕ1 ∨ (¬ϕ1 ∧ ϕ2).

Optional ϕ Selects all states, preferentially matching ϕ: SAT(ϕ) ∪ {(s, ..) ∈ SAT(>) | (s, ..) 6∈ SAT(ϕ)} where (s, ..) is an arbitrary result for state s.

InnerAnd ϕ Merges the results produced by ϕ under environ- ments containing only positive bindings.

Table 4.2: Semantics of SPOON-SMPL formula additions. CHAPTER 4. DESIGN OF SPOON-SMPL 43

We use Metadata predicates together with MetadataLabels to prevent wildcard operators from traversing out of their parent blocks. VarUse s of- fers improved performance over an equivalent Expression predicate, and the predicates Optional ϕ and SequentialOr ϕ¯ both offer improved performance over their equivalent CTL-VW formulas. InnerAnd ϕ solves a somewhat complicated problem involving duplicated transformation operations resulting from a combination of negative environ- ment bindings and the way witness forests are merged. A more detailed de- scription is available in the SPOON-SMPL documentation4.

4.5 Formula compilation

Whilst a complete description of the formula compilation in SPOON-SMPL is outside the scope of this document, this section illustrates some of the core aspects of our compilation process through a short series of simple examples. Our first example begins by taking the method shown in Listing 19and removing all statements that correspond to meta-elements in the SPOON-SMPL Java DSL. The elements removed are the Java DSL encodings of the implicit <... P ...> operator (lines 9 and 14 of Listing 19) and the regular wildcard operator (line 11 of Listing 19). The simplified method is shown in Listing 24 and its corresponding control flow graph (CFG) is shown in Figure 4.2.

Listing 24. Simplified version of method of Listing 19.

1 void __SmPLUnspecified__() { 2 T ret = C; 3 return ret; 4 }

return start T ret = C; exit ret;

Figure 4.2: Control flow graph for method of Listing 24.

We initiate formula compilation by providing the formula compiler with a set of metavariable declarations, a CFG, a map from CFG nodes to trans- formation operations and a starting node. To compile a formula for a given

4https://github.com/mkforsb/spoon/blob/smpl/spoon-smpl/src/main/java/spoon/...... smpl/formula/InnerAnd.java 44 CHAPTER 4. DESIGN OF SPOON-SMPL

node, the compiler first generates a local formula for matching the node itself. In this case the first local formula generated is the atomic proposition ”start” as shown at point 1 of Figure 4.3. If there are any transformation operations mapped to the given node the compiler will also encode these in the local for- mula. Next, the compiler will recursively call itself to generate formulas for any successors to the given node. Finally, the local formula is combined with any formulas generated for successors using appropriate connectives.

(1) ”start” AX

(2.1) ∃ T

(2.2) ∃ ret

(2.3) ∃ C ∧

(2) Statement (5) AX (T ret = C;, x¯) (6) ∧

(3) Statement AX (return ret;, x¯) (4) ”exit”

Figure 4.3: Formula compiled from the CFG of Figure 4.2.

In this case, the first recursive call will generate a local formula for match- ing the statement T ret = C; while taking into account the list of metavari- ables x¯ = (T, ret, C). The resulting local formula is shown at point 2 of Figure 4.3. The combination of knowing that none of the metavariables have been quantified in the current context5 and finding that all three metavariables are used in the statement will cause the compiler to generate the three vari- able quantifications shown in points 2.1-2.3 of the figure. The second and third recursive calls will generate the local formulas of points 3 and 4 respec- tively. The final formula is produced by each step in the unwinding of there- cursion combining the generated formulas using the conjunction (∧) and AX connectives as appropriate. Listing 25 shows an example of source code that

5Current branch of the formula, essentially. CHAPTER 4. DESIGN OF SPOON-SMPL 45

is matched by the formula, while Listings 26 and 27 show source code that will fail to be matched by the formula. The formula requires all successors to the declaration statement (point 2 in Figure 4.3) to match the return statement (point 3 in the figure), which is false in Listings 26 and 27 due to the respective statements on line 3.

Listing 25. Listing 26. Listing 27.

1 int fn() { 1 int fn() { 1 int fn() { 2 int y = 1; 2 int y = 1; 2 int y = 1; 3 return y; 3 print(y); 3 print("Ok"); 4 } 4 return y; 4 return y; 5 5 } 5 }

As an example of the encoding of a transformation operation we consider the statement in point 2 of Figure 4.3. Had this statement been specified as a deletion in the semantic patch, the local formula would instead take the form

Statement(T ret = C;, x¯) ∧ ∃ _v . SetEnv(_v, delete)

Where _v is an arbitrary name chosen to mark witness records of transfor- mation operations and delete is an arbitrary Java value chosen to indicate a deletion operation. Our second example expands on the first by reinstating the regular wildcard operator that appeared between the two non-meta-element statements in the original method, producing Listing 28 with the corresponding CFG shown in Figure 4.4.

Listing 28. Reinstating the wildcard operator to method of Listing 24.

1 void __SmPLUnspecified__() { 2 T ret = C; 3 __SmPLDots__(whenNotEqual(__SmPLExpressionMatch__(ret))); 4 return ret; 5 }

return start T ret = C; __SmPLDots.. exit ret;

Figure 4.4: Control flow graph for method of Listing 28. 46 CHAPTER 4. DESIGN OF SPOON-SMPL

The formula compiled from the CFG of Figure 4.4 can be taken as the for- mula of Figure 4.3 with an additional AU connective inserted between points 5 and 6. The modified part of the formula tree is shown in Figure 4.5, wherein the notation (~n) marks a point that corresponds to point n in the original tree. Note that we treat the special __SmPLDots__(...) statement differently from the other two statements in order to yield a formula that encodes the seman- tics of the wildcard operator. We use this strategy of giving special treatment to specifically-named elements in the formula compilation for several other SmPL features, such as pattern disjunction.

(~5) AX

AU

guard (~6) ∧

(~3) Statement(return ret;, x¯) AX

(~4) ”exit”

Figure 4.5: Part of formula compiled from the CFG of Figure 4.4.

The guard subformula seen in Figure 4.5 is shown in full in Figure 4.6.

¬

∨ ”unsupported”

∨ VarUse ret

Statement Statement (T ret = C;, x¯) (return ret;, x¯)

Figure 4.6: Full contents of the guard subformula of Figure 4.5.

In Figure 4.6 we can see that the guard is a negation of a disjunction over four clauses. This results in the enclosing AU connective rejecting matches where some path contains an intermediate state that satisfies any of the four CHAPTER 4. DESIGN OF SPOON-SMPL 47

clauses. The leftmost disjunction contains two predicates matching the two statements adjacent to the wildcard operator. This encodes the shortest-path constraint for the wildcard operator, disallowing the adjacent statements to ap- pear anywhere along the path traversed by the AU connective. The VarUse predicate encodes the additional constraint when != ret given in the seman- tic patch, disallowing the presence of any usages of the variable name ret along any path. Finally, the atomic proposition ”unsupported” disallows the wildcard operator from traversing across unsupported elements, which are marked by a corresponding label. One of our reasons for preventing the traver- sal over unsupported elements is the honoring of the when != ret constraint. We swap unsupported Java elements for labeled placeholders before initiating the model checking process. Thus, during model checking there is no way of knowing if a certain variable name is used within the original element. Listings 25 and 27 show examples of source code that will be matched by the formula. In contrast to our previous formula of Figure 4.3, the new formula will successfully match the code in Listing 27 due to the additional AU con- nective seen in Figure 4.5. The added AU connective will match sequences of near-arbitrary intermediate steps between the declaration statement and the return statement, which in the case of Listing 27 will consume the statement on line 3. Listing 26 shows an example that will fail to be matched. The guard subformula of the added AU connective disallows the presence of any expres- sion that makes use of the element bound to the metavariable ret. When attempting to match the formula against Listing 26, ret will be bound to y at the declaration statement on line 2. Since y is used in the expression on line 3, the guard subformula will not be satisfied. Thus the AU connective will not be satisfied, so the formula will fail to match. Reinstating the if statement that encodes the implicit <... P ...> wild- card operator brings us back to the method shown in Listing 19. If we take the formula compiled for the second example to be

”start” ∧ AX ϕ

Then the formula compiled for the complete method of Listing 19 would take the form

”start” ∧ AX E[Optional ϕ U ”exit”]

Where the EU and Optional connectives serve to encode the semantics of the <... P ...> operator and ϕ denotes an identical subformula present in both formulas. 48 CHAPTER 4. DESIGN OF SPOON-SMPL

Our third example illustrates how we handle branches and how the wild- card operator is constrained to the scope of its parent block. We consider the method shown in Listing 29 and the corresponding CFG shown in Figure 4.7.

Listing 29. Branch example.

8 void __SmPLUnspecified__() { 9 if (cond) { 10 a(); 11 } 12 }

start cond a(); exit

Figure 4.7: Control flow graph for method of Listing 29.

The formula compiled for the CFG of Figure 4.7 is shown in Figure 4.8. In the figure we can see that the if statement found in the code yielded a Branch formula (point 1) along with a conjunction (point 2) over two EX formulas corresponding to the two cases of control flow respectively, each of which was compiled by pointing the compiler to the successor CFG node of the cor- responding case. Furthermore, we see that the EX formula that corresponds to cond being true will match a state labeled with the proposition ”true” that also exports the metadata key ”parent”, before delegating the matching of the actual body of the branch to an AX subformula. This is designed to match the models we use in SPOON-SMPL, in which every representation of a branch body begins with a special block-begin state that is labeled with the branch case proposition and the ”parent” metadata key. CHAPTER 4. DESIGN OF SPOON-SMPL 49

”start” AX ∧

(1) Branch (2) ∧ (cond, x¯) EX EX

∃ p1 ”after” ∧

(3) ∧ AX

”true” (4) MetaData ∧ (p1, parent) Statement AX (a();, x¯) ”after”

Figure 4.8: Formula compiled from control flow graph of Figure 4.7.

We use the ”parent” metadata key for preventing wildcard operators from traversing out of their parent blocks. When compiling a formula for a semantic patch, each branch in the patch code will generate a patch branch identifier. The current example patch code contains only one branch, and its identifier can be seen as p1 in Figure 4.8. Similarly, when generating models for target code, we associate every branch with a target branch identifier. This identifier is exported by MetaData labels under the key ”parent” by every block-begin state and by a post-branch convergence state, both associated with the branch. The MetaData predicate (point 4 of Figure 4.8) results in binding a target branch identifier to the environment variable with name equal to the patch branch identifier. A wildcard operator placed inside the block corresponding tothe case where cond is true would yield a formula that included the subformula ”after” ∧ MetaData(p1, parent) as part of its guard formula, the larger structure of which would be similar to that of the guard formula shown in Figures 4.5 and 4.6. Thus, the operator is prevented from matching paths that cross the post- branch convergence state associated with the branch that matched the part of the formula corresponding to branch entrance (point 3 of Figure 4.8). 50 CHAPTER 4. DESIGN OF SPOON-SMPL

4.6 Batch processing

We implement batch processing of the files in a source tree by providing a Spoon Processor for CtExecutable elements. This processor is named SmPLProcessor. As the current version of Spoon does not allow one to pass arbitrary arguments to processors, we designed a workaround in which our SmPLProcessor implements a main method that reads command line argu- ments specifying a semantic patch and other SmPL-specific options, before passing control to the main method of spoon.Launcher along with any re- maining arguments, leaving the batch processing logic (including processor in- stantiation) to spoon.Launcher. We store the SmPL-specific options in class variables of SmPLProcessor which are later used by the SmPLProcessor in- stance created by spoon.Launcher. The process method of SmPLProcessor receives CtExecutable meta- model elements and queries them for the filename of their associated source file. When a filename is seen for the first time we perform a check onthefull source code of the file to establish whether there is a chance that the filecon- tains an executable block that could match the semantic patch being applied. The check is based on simple string matching over a set of strings collected from the semantic patch that must be present in a source file in order for a match to be possible. Many files are typically ruled out as potential matches and any future CtExecutable elements associated with the same source file are simply skipped in the process method. There are two types of output available when using SmPLProcessor. The first is the full spooned source code output — the target source tree provided to spoon.Launcher after parsing, processing and unparsing. This output is part of the default behavior of spoon.Launcher and can be turned off with a command line flag. The second type of output is accessed by providing SmPLProcessor with a --diff-command command line argument containing a shell command string that will be executed once for every source file that was modified by the semantic patch, providing the paths to two files containing the spooned source output before and after patch application respectively.

4.7 Use of Spoon

This section presents a summary of how we use the Spoon library and trans- formation engine in SPOON-SMPL.

• We use Spoon to parse the Java DSL representation of a semantic patch CHAPTER 4. DESIGN OF SPOON-SMPL 51

into a Spoon metamodel, and we use the information contained in the patch metamodel to compile the formula for the patch.

• We use Spoon to parse the code targeted for the application of a semantic patch. The target code is parsed into a Spoon metamodel from which executable blocks are extracted. These executable blocks are then used for building the control flow graphs on which the CTL models of SPOON- SMPL are based. CTL model labels use Spoon metamodel elements to represent code.

• Spoon metamodel elements form the basis of our AST pattern match- ing, performed for example when testing Statement predicates against model states labeled with code.

• Our transformation logic is implemented as manipulations of the Spoon metamodel AST.

• We use Spoon for the pretty-printing of the full source tree in batch processing mode. Chapter 5

Evaluation methodology

This chapter describes the methodology used to answer research questions RQ1-RQ4. The chapter is divided into two main sections covering the analyt- ical (non-experimental) and experimental methodology respectively. Section 5.1 describes the analytical methodology used to answer research questions RQ1 and RQ2. Section 5.2 describes the experiments used to answer RQ3 and RQ4.

5.1 Analytical methodology

This section covers the methodology employed in answering research ques- tions RQ1 and RQ2, neither of which involved an experiment. Section 5.1.1 describes the identification of the subset of features present in Coccinelle that permit a generalization to a Java context, while Section 5.1.2 describes the identification of features that do not permit such a generalization.

5.1.1 RQ1: Generalizable features Research question RQ1 involves identifying the subset of features present in Coccinelle that can be generalized for a Java context. There is no official, fine-grained feature list available for Coccinelle, so a feature catalog hadto be assembled for our study. To create this catalog we carefully studied the documents listed in Table 5.1 and collected a table of features identified from the descriptions of Coccinelle and SmPL. In an effort to present the material in an organized and coherent way, we group the features into two main sets: macro- and micro-level features. The set of macro-level features is intended to describe the overarching general function of Coccinelle such as the ability

52 CHAPTER 5. EVALUATION METHODOLOGY 53

to apply semantic patches to files. The set of micro-level features is intended to contain smaller, more specific features such as the types of metavariables supported by Coccinelle. Furthermore, as the micro-level feature set is much larger than the macro-level set, we further subdivide it into subsets loosely based on topics such as patch application and metavariables. Our evaluation consists of considering the generalizability of each feature in the catalog. We present the list of generalizable features in Section 6.1.1, wherein each feature is presented together with a rationale for finding the fea- ture to be generalizable. In Section 6.1.2 we summarize our findings and present our conclusion.

Document Description Padioleau et. al [1] Paper ”SmPL: A domain-specific language [...]” SmPL grammar [35] Presents the SmPL grammar used by Coccinelle. Coccinelle options [36] Describes the options provided by Coccinelle.

Table 5.1: Documents used as sources for Coccinelle feature catalog.

5.1.2 RQ2: Non-generalizable features Research question RQ2 involves identifying the subset of features present in Coccinelle that can not be generalized for a Java context. For this task we use the same approach as described for RQ1 in Section 5.1.1, and we use the same feature catalog as presented in Section 6.1.1. In Section 6.1.3 we summarize our findings and present our conclusion.

5.2 Experimental methodology

This section describes the methodology of the central experiments of the the- sis, designed to answer research questions RQ3 and RQ4. Section 5.2.1 de- scribes our experiment measuring the performance and correctness of SPOON- SMPL compared to Coccinelle4J in applying semantic patches over the set of six patches and associated projects used in an API migration case study originally performed by the authors of Coccinelle4J. Section 5.2.2 describes a directly re- lated experiment comparing the running time of SPOON-SMPL over the same set of API migration semantic patches and associated projects to the build times of the projects. 54 CHAPTER 5. EVALUATION METHODOLOGY

5.2.1 RQ3: Patch application performance Research question RQ3 asks whether the running time performance of a Java SmPL implementation based on Spoon is better or worse than that of Coc- cinelle4J. In order to answer this question, we choose to compare the perfor- mance of the two implementations in the application of six semantic patches to six associated project source trees used in an API migration case study per- formed by the Coccinelle4J authors [3]. Our experiment consists of running Coccinelle4J and SPOON-SMPL to apply each patch to a fresh copy of the project source tree associated with the patch. Multiple runs for each combination of patch and implementation are performed, and we measure and record the run- ning time of each run. Our evaluation consists of verifying the correctness of any applied transformations and comparing the average measured running times. With the exception of SPOON-SMPL itself, we source all of the material used in the experiment from the Coccinelle4J repository1 on GitHub. This reposi- tory contains2 all the material relevant to the case study, including the semantic patches and scripts to download specific revisions of each targeted project.

Running times We measure the running time both internally and externally for both Coc- cinelle4J and SPOON-SMPL. Both implementations contain an internal profiling timer, enabled by default in the SmPLProcessor of SPOON-SMPL and enabled in Coccinelle4J by passing the --profile flag on the command line. For Coc- cinelle4J, we record the reported profiler value of Main total. For SPOON- SMPL, we record the values of totalTimer and procTimer, with totalTimer representing the total running time of the main method and procTimer rep- resenting the total time directly spent on SmPL-related processing. We take the time spent on SmPL-related processing to be the sum of the time spent in the process method of the SmPLProcessor and the time taken to print the unified diffs over transformed files, which for technical reasons takes place outside the process method. Our reason for using two timers for SPOON-SMPL is due to the internal non-SmPL-related processing of Spoon (e.g parsing and pretty-printing) being a significant consumer of the total running time, which is shown by the difference between the two timers. In the results section, we list the internal measurements for Coccinelle under the heading INT while the

1https://github.com/kanghj/coccinelle.git @ 37b1025ea7378ce47cde44807e85fd77663f5f81 2At the time of writing. CHAPTER 5. EVALUATION METHODOLOGY 55

two measurements for SPOON-SMPL are listed under the headings INT and PROC respectively. Our external running time measurements use the built-in time command of GNU Bash 5.0. This command produces a printout of the elapsed wall clock time and the consumed CPU time in user mode and in kernel mode. In the results section, we list these measurements under the headings WCLK and CPU respectively, with the numbers listed under CPU being the sum of CPU time spent in user mode and kernel mode. Including the externally measured wall clock time provides a sanity check for the internal profiling timers of Coccinelle4J and SPOON-SMPL and in the case of SPOON-SMPL provides some insight into the overhead of the JVM. The consumed CPU time provides in- sight into the effects of multiprocessing on the performance, as a CPUtime larger than the wall clock time indicates that work has been performed by mul- tiple parallel threads. Due to the end-to-end nature of the measurements where SPOON-SMPL applies a patch exactly once per invocation of the JVM, we gave no further attention to the effects of JVM startup time or JIT warmup times [37].

Output modes for spoon-smpl Due to the impact of the pretty-printing stage of Spoon on the running times, we choose to define two output modes for SPOON-SMPL; full and minimal, and to include measurements for both modes. In full output mode, we execute the Spoon launcher with most options set to their default values, causing Spoon to pretty-print the full source tree after processing is complete. In the min- imal output mode we supply the command line argument --output-mode nooutput to the Spoon launcher, resulting in the pretty-printing stage being skipped. However, SPOON-SMPL still produces its unified diff output when run- ning in minimal output mode, providing the user with a full description of the modifications made by applying the semantic patch. Unfortunately, as we were unable to use the recently added sniper output mode for Spoon, the unified diff output of SPOON-SMPL will generally not be directly applicable asapatch to the original source tree. This is due to the default output mode of Spoon reformatting much of the code, meaning the deletions and context parts of the patch will often fail to match the original file. As such, the measurements for the minimal output mode should not be seen as a pure optimization over the full output mode, but rather as an alternative with drawbacks. 56 CHAPTER 5. EVALUATION METHODOLOGY

Correctness We define a metric for correctness as it would not be meaningful to compare running times for incorrectly applied patches. Our metric consists of the num- ber of true positives (correct transformations) and the number of false pos- itives (incorrect transformations). We use manual inspection and the infor- mal semantics of SmPL to decide whether a given transformation is correct or not. Neither true negatives (correctly untransformed code) nor false neg- atives (missing transformations) are counted due to the size of the targeted projects making such a count an impracticable task. We measure correctness using the diff output produced by Coccinelle4J and SPOON-SMPL respectively. For both implementations, the diff output is a convenient equivalent to taking a diff over a full source tree3 to which a semantic patch is applied, such that applying the diff output as a patch to the unmodified source tree carriesout all of the transformations generated by the semantic patch. As a preparatory step, we manually verified the diff output of each implementation–patch com- bination by visual inspection of the modified source tree in a diff visualiser program4, allowing for easy identification of true positives and false positives. We use the manually verified diff outputs to automatically verify the outputs of each individual run of the associated implementation–patch combination during the experiment. In the results section, correctness is presented as the number of correctly transformed executables and the number of incorrectly transformed executables. In this context, a correctly transformed executable is an executable block5 of Java code that has been modified by the application of the semantic patch and where all transformations applied to the block are instances of true positives. Additionally, we use the program cloc6 version 1.86 to count the NCLOC (Non-Comment Lines Of Code) of each project, included in the results section to provide an indication of the sizes of the projects and to provide insight into the differences in running times for the same implementation over different projects. Finally, we run our experiments on a virtual machine guest with a minimal Linux system to minimize the potential effects of arbitrary system packages. The rest of this section describes the six semantic patches, their target code bases and any modifications made to the semantic patches to adapt them for use with SPOON-SMPL. 3Requiring two copies of the source tree, one kept unmodified while the other one is modified by the semantic patch. 4Meld 3.20.2, https://meldmerge.org. 5A constructor, method or lambda expression. 6https://github.com/AlDanial/cloc CHAPTER 5. EVALUATION METHODOLOGY 57

Semantic patch 1: sticky_broadcasts

The original sticky_broadcasts semantic patch targets the two Android API methods sendStickyBroadcast and removeStickyBroadcast, both depre- cated in the release of Android API level 21. The patch as shown in Listing 30 consists of a disjunction of two clauses, where the first clause indicates the replacement of calls to sendStickyBroadcast with calls to sendBroadcast using the same argument list, and the second clause indicates the removal of any calls to removeStickyBroadcast. The target code base for the patch was a specific revision of the Nextcloud Android app7 released in October of 2018. We made no modifications to the sticky_broadcasts patch for use with SPOON-SMPL.

Listing 30. The sticky_broadcasts semantic patch.

1 @@ 2 Intent intent; 3 @@ 4 ( 5 - sendStickyBroadcast(intent); 6 + sendBroadcast(intent); 7 | 8 - removeStickyBroadcast(intent); 9 )

Semantic patch 2: set_text_size

The original set_text_size semantic patch targets the Android WebKit API method WebSettings::setTextSize that was deprecated in the release of Android API level 15. The patch as shown in Listing 31 consists of a dis- junction of five clauses with each clause indicating a replacement of acallto setTextSize with a call to setTextZoom while also changing the argument from a named constant belonging to the deprecated API enum WebSettings.TextSize to a basic integer. In order to apply the patch as-is to the code base targeted by the original case study, Coccinelle4J requires the use of additional isomorphism rules. An example of such a rule is shown in Listing 32. The example rule allows the expression LARGEST as found in the patch to match the expression WebSettings.TextSize.LARGEST as found in the target code. The target code base for the patch was a specific revision of the Lucid Browser Android

7https://github.com/nextcloud/android 58 CHAPTER 5. EVALUATION METHODOLOGY

app8 released in March of 2017. A file containing all the required isomorphism rules was available in the Coccinelle4J repository on GitHub. We modified the patch for use with SPOON-SMPL as our implementation lacks support for the isomorphism rules employed by Coccinelle4J. Our modi- fied patch is shown in Listing 33. The modification consists of declaring each of the five constant expressions LARGEST, LARGER etc. as expression metavari- ables with an additional constraint on the literal source code of the expression given as a regular expression in the syntax and semantics of the standard Java package java.util.regex. We added the feature of regular expression con- straints on metavariables specifically for this semantic patch. The additional constraint allows the matching between the expression LARGEST as found in the patch and the expression WebSettings.TextSize.LARGEST as found in the target code, with similar statements holding for the other four constants. As such the additional constraint solves the same problem as was solved by the use of isomorphism rules in Coccinelle4J. It is trivial to see that the two patches are equivalent.

Listing 31. Original set_text_size patch used with Coccinelle4J.

1 @@ 2 expression E; 3 @@ 4 ( 5 - E.setTextSize(LARGEST); 6 + E.setTextZoom(200); 7 | 8 - E.setTextSize(LARGER); 9 + E.setTextZoom(150); 10 | 11 - E.setTextSize(NORMAL); 12 + E.setTextZoom(100); 13 | 14 - E.setTextSize(SMALLER); 15 + E.setTextZoom(75); 16 | 17 - E.setTextSize(SMALLEST); 18 + E.setTextZoom(50); 19 )

8https://github.com/powerpoint45/Lucid-Browser CHAPTER 5. EVALUATION METHODOLOGY 59

Listing 32. Example isomorphism rule for original set_text_size patch.

1 Expression 2 @ WebSettings_TextSize_LARGEST @ 3 @@ 4 LARGEST => WebSettings.TextSize.LARGEST

Listing 33. Modified set_text_size patch used with SPOON-SMPL.

1 @@ 2 expression E; 3 expression LARGEST when matches "(WebSettings\\.TextSize\\.)?LARGEST"; 4 expression LARGER when matches "(WebSettings\\.TextSize\\.)?LARGER"; 5 expression NORMAL when matches "(WebSettings\\.TextSize\\.)?NORMAL"; 6 expression SMALLER when matches "(WebSettings\\.TextSize\\.)?SMALLER"; 7 expression SMALLEST when matches "(WebSettings\\.TextSize\\.)?SMALLEST"; 8 @@ 9 ( 10 - E.setTextSize(LARGEST); 11 + E.setTextZoom(200); 12 | 13 - E.setTextSize(LARGER); 14 + E.setTextZoom(150); 15 | 16 - E.setTextSize(NORMAL); 17 + E.setTextZoom(100); 18 | 19 - E.setTextSize(SMALLER); 20 + E.setTextZoom(75); 21 | 22 - E.setTextSize(SMALLEST); 23 + E.setTextZoom(50); 24 )

Semantic patch 3: get_color

The original get_color semantic patch targets the Android API method Resources::getColor that was deprecated in the release of Android API level 23. The patch as shown in Listing 34 consists of the replacement of calls to getColor on Resources objects returned by Context::getResources with calls to ContextCompat::getColor in which the context object is sup- plied as an additional argument. The target code base for the patch was a specific revision of the Kickstarter Android app9 released in January of 2016. We made no modifications to the 9https://github.com/kickstarter/android-oss 60 CHAPTER 5. EVALUATION METHODOLOGY

get_color patch for use with SPOON-SMPL.

Listing 34. The get_color semantic patch.

1 @@ 2 Context ctx; 3 expression E; 4 @@ 5 - ctx.getResources().getColor(E) 6 + ContextCompat.getColor(ctx, E)

Semantic patch 4: should_vibrate

The original should_vibrate semantic patch targets the Android API method AudioManager::shouldVibrate that was deprecated in Android API level 16. An excerpt of the patch is shown in Listing 36. The full patch is provided in Appendix A.1. The excerpt shows how a new method shouldVibrate is introduced, and how calls to am.shouldVibrate for arbitrary identifiers am should be replaced with calls to the newly introduced method in bodies of methods that include a Context object parameter. The target code base for the patch was a specific revision of the Signal10 messaging app for Android released in August of 2018. We made slight modifications to the patch for use with SPOON-SMPL asour implementation requires the presence of a return type specifier for matching on a method header. In the excerpt of the modified patch shown in Listing 37 the metavariable T has been introduced and used as the return type specifier in the matched method signature. It is trivial to see that the patches are equivalent since the T metavariable of the modified patch matches any type. Additionally, the original patch included a second rule that was not relevant for the application of the patch to the target code base. As such, we omitted this second rule in the patch used with SPOON-SMPL. Finally, due to an issue with the default Spoon pretty-printer, we made a small change to a file that was irrelevant11 to the semantic patch. The change is shown as a unified diff in Listing 35. As the file contained nothing relevant to the application of the semantic patch, the only effect on the experiment is the successful completion of the pretty-printing stage of Spoon. We made the choice to alter the file rather than remove it in an effort to make as few andas small changes as possible.

10https://github.com/signalapp/Signal-Android 11Patch application produced an identical result with the file removed. CHAPTER 5. EVALUATION METHODOLOGY 61

Listing 35. Patch for Spoon pretty-printing issue.

1 --- a/MediaOverviewActivity.java 2 +++ b/MediaOverviewActivity.java 3 @@ -334,1 +334,1 @@ 4 - protected Void doInBackground(MediaDatabase.MediaRecord... records) { 5 + protected Void doInBackground(MediaDatabase.MediaRecord records) {

Listing 36. Excerpt of original should_vibrate patch used with Coccinelle4J.

1 @@ 2 identifier am, f, ctx; 3 expression vibrate_type; 4 @@ 5 + boolean shouldVibrate(AudioManager am,Context ctx,int vibrateType) { 6 [ ... method body omitted ... ] 7 + } 8 9 f(..., Context ctx, ...) { 10 ... 11 - am.shouldVibrate(vibrate_type) 12 + shouldVibrate(am, ctx, vibrate_type) 13 ... 14 } 15 16 [ ... second rule omitted ... ]

Listing 37. Excerpt of modified should_vibrate patch used with SPOON-SMPL.

1 @@ 2 type T; 3 identifier am, f, ctx; 4 expression vibrate_type; 5 @@ 6 + boolean shouldVibrate(AudioManager am,Context ctx,int vibrateType) { 7 [ ... method body omitted ... ] 8 + } 9 10 T f(..., Context ctx, ...) { 11 ... 12 - am.shouldVibrate(vibrate_type) 13 + shouldVibrate(am, ctx, vibrate_type) 14 ... 15 } 62 CHAPTER 5. EVALUATION METHODOLOGY

Semantic patch 5: get_height

The original get_height semantic patch targets the Android API methods Display::getWidth and Display::getHeight that were deprecated in the release of Android API level 15. An excerpt of the patch is shown in List- ing 38. The full patch is provided in Appendix A.2. The excerpt shows how constructions of Point objects using calls to the deprecated methods as ar- guments are to be replaced with a default construction of a Point object fol- lowed by a call to display.getSize to populate the fields of the Point. Fur- thermore, any subsequent use of the expressions display.getHeight() and display.getWidth() are to be replaced by field accesses on the Point ob- ject. The target code base for the patch was a specific revision of the Glide12 library for Android released in November of 2014. We made slight modifications to the patch for use with SPOON-SMPL asour implementation lacks support for single statements spanning multiple lines in semantic patches. An excerpt of the modified patch is shown in Listing 39,in which the constructor call of lines 1-2 of Listing 38 has been reformatted to occupy a single line.

Listing 38. Excerpt of original get_height patch used with Coccinelle4J.

1 - p = new Point(display.getWidth(), 2 - display.getHeight()); 3 + p = new Point(); 4 + display.getSize(p);

Listing 39. Excerpt of modified get_height patch used with SPOON-SMPL.

1 - p = new Point(display.getWidth(), display.getHeight()); 2 + p = new Point(); 3 + display.getSize(p);

Semantic patch 6: on_console_message

The original on_console_message semantic patch targets a variant of the An- droid WebKit API method WebChromeClient::onConsoleMessage that was

12https://github.com/bumptech/glide CHAPTER 5. EVALUATION METHODOLOGY 63

deprecated in the release of Android API level 15. The deprecated method vari- ant takes three arguments of primitive types, while the suggested replacement is a variant taking a single argument of a compound type that encapsulates the primitive arguments. The methods in question are intended to be overridden by client code extending WebChromeClient and the patch targets such client im- plementations. The patch as shown in Listing 40 specifies the replacement of a declaration of the deprecated method signature with a declaration of the sug- gested non-deprecated signature, followed by the replacements of any usages of the formal parameters of the deprecated signature with the corresponding usage of the single formal parameter of the non-deprecated method variant. Identically to the case in Section 5.2.1, we needed to modify the semantic patch as SPOON-SMPL requires a return type to be present on the method header. Again a catch-all type metavariable T is used making it straightforward to see that the modified patch is equivalent to the original patch. The modified partof the patch is shown in Listing 41, in which on line 7 we omit lines corresponding exactly to lines 6-18 of the original patch shown in Listing 40.

Listing 40. Original on_console_message patch used with Coccinelle4J.

1 @@ 2 identifier p1, p2, p3; 3 @@ 4 - onConsoleMessage(String p1, int p2, String p3) { 5 + onConsoleMessage(ConsoleMessage cs) { 6 <... 7 ( 8 - p1 9 + cs.message() 10 | 11 - p2 12 + cs.lineNumber() 13 | 14 - p3 15 + cs.sourceId() 16 ) 17 ...> 18 } 64 CHAPTER 5. EVALUATION METHODOLOGY

Listing 41. Excerpt of modified on_console_message patch used with SPOON-SMPL.

1 @@ 2 type T; 3 identifier p1, p2, p3; 4 @@ 5 - T onConsoleMessage(String p1, int p2, String p3) { 6 + T onConsoleMessage(ConsoleMessage cs) { 7 [ ... lines omitted ... ]

5.2.2 RQ4: Project build times Research question RQ4 asks whether a Java SmPL implementation based on Spoon will have acceptable running time performance for an individual devel- oper. In order to answer this question, we need to establish a baseline against which the performance of SPOON-SMPL can be compared. We choose to use the build times of each of the six projects used in the Coccinelle4J API migration case study as our baseline, and for each project we compare the average build time against the wall clock running time of SPOON-SMPL in applying the as- sociated patch to the project source tree, reusing the measured running times from the experiment used to answer research question RQ3. Our choice of using the project build time as a baseline is motivated by the obvious need for a single developer to be able to build and test the project in order to work on it, meaning the time needed to do so must necessarily be acceptable. We therefore define an acceptable running time for semantic patch application to be less than or equal to the average build time of the associated project. All six case study projects use the Gradle build system and the Gradle wrapper gradlew, with the specific Gradle version differing between projects. We build each project by executing the Gradle wrapper from the command line in an environment with clean Gradle and Maven repositories. We measure the build time by recording the time reported by Gradle. We build each project several times, and exclude the first two builds (one to download dependencies, one to normalize) from the computation of the average build time. The simple notion of ”project build time” becomes somewhat problematic when considering projects with a large number of build targets. As an exam- ple, the Kickstarter app for Android listed 37 different build targets when run- ning the build system command gradlew tasks. For this reason, we define ”project build” to mean any build system task that results in the code being compiled and all unit tests being executed. Under this definition, we find the CHAPTER 5. EVALUATION METHODOLOGY 65

Project File Patch description Nextcloud app build.gradle Add missing dependencies.

Lucid Browser app/build.gradle Disable aborting the build on lint error.

Kickstarter app app/build.gradle Add Google repository URL. Remove syntax error.

Glide library build.gradle Add Google repository URL. gradle.properties Robolectric 2.4 → 2.3. library/build.gradle findbugs 2.0.3 → 3.0.0. GlideTest.java Remove @Resetter annotation.

Signal app ConversationAdapterTest.java Disable unit test testGetItemIdEquals.

Table 5.2: Projects requiring minor modifications for building. command gradlew test to be sufficient for all six case study projects. We verified this by manually breaking some of the code and/or some of thetests to see that gradlew test would fail on a clean run (preceded by gradlew clean) whenever such a breaking edit was present. However, we performed this verification only partially, meaning only some source files and only some tests were targeted. Five projects required minor modifications in order to build successfully. The modifications are described in Table 5.2. Additionally, we use a slightly newer Git revision of the Nextcloud app for its build, as building the revision used in the original case study proved to be problematic. The revision used in the original case study was committed on October 23rd 2018, while the revision we use to measure the build time was committed on October 29th 2018. All of our changes are made with the intent of getting the projects to build and test with as few changes as possible without being deeply familiar with any of them. Many of the specific changes were discovered by searching for common solutions to toolchain error messages on websites such as StackOver- flow13.

13https://stackoverflow.com Chapter 6

Evaluation results

This chapter provides the results of the efforts described in Chapter 5. The material is again divided into two main sections, with Section 6.1 providing the results to the analytical efforts and Section 6.2 providing the results of the experiments.

6.1 Analytical results

This section provides the results of the analytical efforts described in Sec- tions 5.1.1 and 5.1.2. Section 6.1.1 provides the catalog over the features of Coccinelle, along with their corresponding generalizability findings. Sections 6.1.2 and 6.1.3 provide a summary and conclusion for RQ1 and RQ2, respec- tively.

6.1.1 Coccinelle feature catalog This section provides the Coccinelle feature catalog along with the correspond- ing findings regarding the generalizability of individual features for aJava context. The section begins by listing the macro-level features, followed by a number of grouped subsets of the micro-level feature set. Additionally, for each group of features we provide at least one example of a semantic patch together with an input program that uses one or more features of the group. In the feature tables of this section, the Src column indicates the source document(s) from which the feature was identified, corresponding to subsets of the documents listed in Table 5.1. The Gen column indicates whether the feature is generalizable to Java.

66 CHAPTER 6. EVALUATION RESULTS 67

Due to the similarities between C and Java, we often find features to be trivially generalizable. For example, the syntax and semantics of the signature part of a function definition in C is nearly identical to that of method signa- tures in Java. Therefore, the Coccinelle feature for matching and transforming function signatures is trivially generalizable. In the feature tables, features that are trivially generalizable are listed with a T in the Gen column. Features that we do not find to be trivially generalizable are presented with a verdict in the form of Yes or No in the Gen column, along with a reference to a footnote providing the reasoning behind the verdict. These footnotes are listed at the bottom of the page containing the table that references them.

Macro-level features Feature Src Gen Match code based on target language semantics. [1] T As opposed to matching on literal text or AST structure. Using isomor- phisms and semantics-based constraints. Match and specify transformations on the CFG. [1] T Using computation path operators and CTL-like patch semantics. Specify general, concise transformations. [1] T Using metavariables and disjunctions, and by ignoring irrelevant details like spacing and comments. Matching on context elements. [1] T Context elements are required to be present, but are not themselves transformed. Analogous to diff/patch. Add, remove, replace source code elements. [1] T Statements, expressions or parts thereof, complete functions, function signatures and more. Multiple rules, dependencies, sharing of bindings. [1] T Named rules, dependency declarations, referring to metavariables of a different rule. Scripting support. [35] T Coccinelle has support for scripts written in Python or OCaml. Running interactively. [1, 36] T Requiring high performance and suitable interface. Patch application to single file or source tree. [36] T Analogous to diff/patch.

Table 6.1: Generalizability of macro-level features. T denotes an affirma- tion that is trivial due to the similarities between C and Java. 68 CHAPTER 6. EVALUATION RESULTS

Examples of macro-level features

Matching code based on language semantics Listings 42, 43 and 44 show how isomorphisms allow matching based on se- mantics rather than on the literal text or on the AST structure. The function foo shown in Listing 43 is valid code in both C and Java and has essentially identical semantics in both languages, so the generalization is trivial.

Listing 42. Patch Listing 43. Input Listing 44. Output

@@ int foo(int x) { int foo(int x) { identifier v; int y = x; int y = x; @@ y = y + 1; y = y + 2; - v = v + 1; y = 1 + y; y = y + 2; + v = v + 2; y++; y = y + 2; return y; return y; } }

Matching on the CFG Listings 45, 46 and 47 show how the computation path wildcard operator al- lows specifying match constraints on the control flow graph rather than on the literal text or the AST structure. In this example, the patch shown in Listing 45 features the wildcard operator on line 5. The patch will match the code in Listing 46 as all computation paths originating from the statement a() even- tually reach a statement b(E) for an arbitrary expression E. In Listing 47, the computation path that ends at the return statement does not reach a state- ment b(E), so the patch will not match. Note that Listings 46 and 47 depict two distinct instances of patch target source code, and that neither have had the patch of Listing 45 applied. The functions fn1 and fn2 shown in Listings 46 and 47 are valid in both C and Java with essentially identical semantics, so the generalization is trivial.

Listing 45. Patch Listing 46. Match Listing 47. No match

1 @@ void fn1(int c) { void fn2(int c) { 2 expression E; a(); a(); 3 @@ if (c > 0){ if (c > 0){ 4 - a(); b(1); return; 5 ... return; } 6 b(E); } b(2); 7 b(2); } 8 } CHAPTER 6. EVALUATION RESULTS 69

Micro-level features

Rules

Feature Src Gen Multiple rules/regions in single semantic patch. [1] T A single SmPL file may contain multiple rules/regions with each region starting with a header of the form @@ ... @@.

Sharing of metavariable bindings between rules. [35] T A dependent rule can access metavariables bound by a prerequisite rule.

Print rule dependency information. [36] T Print textual or graphical representation of rule. [36] T Rule dependency modifiers. [35] T Such as ever, which specifies that the prerequisite rule r need only to have matched in some context, and never, which specifies that r must never have matched in any context.

Virtual rules. [35] T Named rules for which the match success is defined by the command line, allowing control of dependent rules.

Per-rule constraint on file or pathname. [35] T Specifying which files or which paths that the rule should or should notbe applied to.

Per-rule control over isomorphisms. [35] T Ability to specify which isomorphisms (or sets thereof) should be enabled.

Table 6.2: Generalizability of micro-level features on the topic of rules. T denotes an affirmation that is trivial due to the similarities between C and Java. 70 CHAPTER 6. EVALUATION RESULTS

Examples of features related to rules

Multiple rules and sharing of bindings Listings 48, 49 and 50 show how a single patch may contain multiple rules, and how dependent rules are able to access metavariable bindings produced by prerequisite rules. Assuming that the function or method a() is defined externally, both Listings 49 and 50 are valid fragments of C or Java with es- sentially identical semantics, so the generalization is trivial.

Listing 48. Patch Listing 49. Input Listing 50. Output

@ rule1 @ void foo() { void foo() { identifier fn; a(); a(); @@ } } void fn() { a(); void bar() { void bar() { } a(); a(); } } @ rule2 depends on rule1 @ identifier rule1.fn; void run() { void run() { @@ foo(); a(); - fn(); bar(); a(); + a(); } }

Virtual rules Listing 51 shows a semantic patch that defines a virtual rule foo, and a rule rule1 that depends on foo. Listings 52-53 show the result of applying the patch without specifying any -D command line arguments, resulting in no transformations. Listings 54-55 show the result of applying the patch when specifying -D foo on the command line. This command line results in the rule foo being considered to have matched, allowing rule1 to match and trans- form the code. The functionality of virtual rules is not related to the syntax or semantics of the target language, so the generalization is trivial.

Listing 51. Patch Listing 52. Input Listing 53. Output

virtual foo void fn() { void fn() { @ rule1 depends on foo @ @@ a(); a(); - a(); } } | {z } Patch applied without any -D arguments.

Listing 54. Input Listing 55. Output

void fn() { void fn() { a(); } } | {z } Patch applied with -D foo. CHAPTER 6. EVALUATION RESULTS 71

Matching and transformation

Feature Src Gen Adding comments. [35] T Comments on addition lines are added to transformed output.

Adding preprocessor directives. [35] No1 Coccinelle can add, but not match against, preprocessor directives.

Match and transform #include directives. [35] Yes2 Match and transform function signatures. [1, 35] T Match and transform forward declarations. [35] No3 Match and transform struct definitions. [35] Yes4 Path and sequence operator ... [35] T Path operator <... P ...> [35] T Path operator <+... P ...+> [35] T Path operator constraints. [35] T Such as exists, forall, any, strict, program element (in)equality and boolean expression (in)equality.

Semantic match annotation. [35] T An asterisk in the first column of a line indicates a semantic match, for which Coccinelle produces an output highlighting the matched element without transforming it.

Optional match annotation. [35] T A question mark in the first column of a line indicates that the pattern on the line may be matched once or not at all, making it an optional match.

Table 6.3: Generalizability of micro-level features on the topic of match- ing and transformation. T denotes an affirmation that is trivial due to the similarities between C and Java.

1There is no standard preprocessor for Java. 2The corresponding Java element is imports. 3There are no forward declarations in Java. 4The corresponding Java element is class definitions. 72 CHAPTER 6. EVALUATION RESULTS

Examples related to matching and transformation

Adding preprocessor directives Listings 56, 57 and 58 show the ability of Coccinelle to add C preprocessor directives. As there is neither a standard preprocessor nor any corresponding or closely related feature in Java, we find this feature not to generalize for Java.

Listing 56. Patch Listing 57. Input Listing 58. Output

@@ @@ void foo() { void foo() { void foo() { a(); #ifdef __unix__ + #ifdef __unix__ } a(); a(); #endif + #endif } }

Matching and transforming #include directives Listings 59, 60 and 61 show the existing ability of Coccinelle to match and transform C #include directives. In Java, import statements serve a purpose similar to C’s #include. Listings 62, 63 and 64 show a suggestion of what the corresponding feature could look like in a Java context. Due to the sim- ilarity between #include directives and import statements and the fact that manipulating import statements is a useful task for Java, we find the feature to generalize for Java.

Listing 59. Patch Listing 60. Input Listing 61. Output

@@ @@ #include "SDL.h" #include "SDL2.h" - #include "SDL.h" + #include "SDL2.h"

Listing 62. Patch Listing 63. Input Listing 64. Output

@@ @@ import com.foo; import com.bar; - import com.foo; + import com.bar; CHAPTER 6. EVALUATION RESULTS 73

Parsing

Feature Src Gen Comments in SmPL code. [35] T Comments appearing on non-addition lines are ignored.

Have parser validate a given semantic patch. [36] T Have parser validate target code. [36] T Print parse tree with extra type annotations. [36] T Caching of parse trees. [36] T Print lexer token stream. [36] T Function to parse and unparse target code. [36] T Optionally adapt to C++ and/or IBM C. [36] No5 Control parser verbosity. [36] T Option to set bit width of integer types. [36] No6 Print textual or graphical view of control flow. [36] T Use external tool to identify patch candidates. [36] T Coccinelle supports the use of Glimpse and idutils in addition to its internal coccigrep function for quickly identifying candidate files.

Options for macros and preprocessor directives. [36] No7

Table 6.4: Generalizability of micro-level features on the topic of parsing. T denotes an affirmation that is trivial due to the similarities between C and Java.

5Not applicable to Java. 6Integer types have standardized widths in Java. 7There are no macros and no standard preprocessor in Java. 74 CHAPTER 6. EVALUATION RESULTS

Examples of features related to parsing

Single-line comments Listings 65, 66 and 67 show how Coccinelle handles single-line comments. Coccinelle does not support matching against comments [35], so comments on context lines and deletion lines are ignored. Such comments therefore act as comments in the context of the semantic patch itself. In contrast, Coccinelle does support adding comments to matched code. In the patch of Listing 65, the top comment is taken to be a comment about the patch. The bottom comment is treated as an actual C comment that should be added to code matching the rule. The distinction is made based on the whether the comment appears on an addition line or not [35]. The syntax and semantics of comments is nearly identical between C and Java, so the generalization is trivial.

Listing 65. Patch Listing 66. Input Listing 67. Output

@@ @@ void fn() { void fn() { // SmPL comment foo(); // C comment } foo(); + // C comment } foo();

Support for IBM C Listings 68, 69 and 70 show the ability of Coccinelle to enable support for the decimal(n,p) syntax extension to ANSI C created by IBM8. Coccinelle will only process functions using the extended syntax if the --ibm command line flag is specified. Without the flag, functions using the syntax will beskipped regardless of whether they contain code matching the patch being applied.

Listing 68. Patch Listing 69. Input Listing 70. Output

@@ void fn() { void fn() { expression E; decimal(10,2) x; decimal(10,2) x; @@ foo(x); } - foo(E); } | {z } Patch application requires --ibm command line flag.

8https://www.ibm.com/support/knowledgecenter/SSLTBW_2.2.0/com.ibm.zos. v2r2.cbcpx01/decchap.htm CHAPTER 6. EVALUATION RESULTS 75

Patch semantics and patch application Feature Src Gen Optionally restrict transfm. to matched successors. [36] T When enabled, restrict transformation to elements that are only reachable from CFG predecessors matched by some subpattern of the patch. Optional support for loops. [36] T When enabled and target code contains a loop, Coccinelle will swap the AU logical connective for the AW logical connective to avoid infinite recursion. Control semantics of nested expressions. [36] T By default, an expression metavariable E in the pattern f(E) is able to match both x and f(x) in the target code f(f(x)). Coccinelle offers the option to restrict this to the outermost match, meaning only f(x) would be matched in the example. Optional strict semantics of disjunctions. [36] T When enabled, Coccinelle will have later clauses of a disjunction verify that no earlier clauses have matched. Optional removal of comments. [36] T When enabled, Coccinelle will remove comments that are ”attached” to removed statements. Relaxed constraints for error-handling paths. [36] T Coccinelle relaxes the constraints of path operators for computation paths deemed to be error-handling code. Command line semantic match. [36] T Coccinelle offers shorthand syntax for quickly testing patterns against target code. Options for whitespace style of added code. [36] T Make backup copies of modified files. [36] T Optionally modify target code in-place. [36] T Optional timeout for processing a single file. [36] T Options for multiprocessing and profiling. [36] T Print names of patched program elements. [36] T Print info on each transformation. [36] T Print info on partial matches. [36] T Print textual or graphical trace of CTL process. [36] T Print environment bindings. [36] T Limit number of steps in CTL process. [36] T Control diff output. [36] T Table 6.5: Generalizability of micro-level features on the topic of patch semantics and patch application. T denotes an affirmation that is trivial due to the similarities between C and Java. 76 CHAPTER 6. EVALUATION RESULTS

Examples of features related to patches

Relaxed constraints for error paths Listings 71, 72 and 73 show how Coccinelle relaxes the constraints on the computation path wildcard operator for computation paths deemed to be error handling code. By default, error handling code is defined to be a conditional with only a then branch that ends in goto, break, continue or return [36]. The semantics of the wildcard operator in its default forall mode require all paths to reach a match for the patch pattern following the operator. In this example, the pattern following the operator is the statement b(). Coccinelle relaxes the semantics to exclude error paths. This allows the transformation shown in the example to be applied, despite one of the paths between a() and the end of the function never reaching a statement b(). There are several ways to alter the behavior regarding error paths. The wildcard operator can be set to run in strict mode by either changing the default mode for an entire rule or by providing the when strict constraint modifier on an individual operator instance. In strict mode, the error path relaxation is completely disabled. Additionally, there exists a command line flag9 that changes the definition of error handling code to only include else- less branches that end in return. While this scheme could be applied as-is to a Java context, it would perhaps be more appropriate to change the definition of error paths to be else-less branches that end in a throw statement. In either case, we find the similarities between C and Java to make the generalization trivial.

Listing 71. Patch Listing 72. Input Listing 73. Output

@@ @@ int foo(int x) { int foo(int x) { a(); a(); a(); ... - b(); if (x < 0){ if (x < 0){ return 0; return 0; } }

b(); return x + 1; } return x + 1; }

9--only-return-is-error-exit CHAPTER 6. EVALUATION RESULTS 77

Generalizable metavariable features (part 1 of 2) Feature Src Gen Metavariable type metavariable. [35] T Actual type inferred by the parser. Metavariable type fresh identifier. [35] T Generates an unused variable name. Metavariable type identifier. [35] T Binds to C identifiers such as variable names and function names. Metavariable type parameter. [35] T Binds to names of formal parameters of C functions. Metavariable type type. [35] T Binds to C type names such as int. Metavariable type typedef. [35] Yes10 Indicates that the given metavariable name should be treated as a literal type name (C type alias) Metavariable of specific program type. [35] T Binds to C expressions of the given type, such as int. Metavariable type statement. [35] T Binds to arbitrary full C statements. Metavariable type declaration. [35] T Binds to C declarations of variables, structs, struct fields and functions. Metavariable type field. [35] Yes11 Binds to C declarations of struct fields. Metavariable type initializer. [35] Yes12 Binds to C struct initializer expressions. Metavariable type attribute name. [35] Yes13 Allows matching against Linux kernel attribute macros. Metavariable type local idexpression. [35] Yes14 Binds to local C identifiers such as local variable names. Metavariable type global idexpression. [35] Yes15 Binds to global C identifiers such as global variable names. Table 6.6: Generalizable metavariable features (part 1 of 2). T denotes an affirmation that is trivial due to the similarities between Cand Java.

10A name other than typedef is appropriate for Java. 11The corresponding Java element is class field declarations. 12The corresponding Java element is constructor calls. 13The corresponding Java element is annotations. 14Restricting identifiers to local variables (as opposed to e.g class fields) is useful forJava. 15Restricting identifiers to non-local variables (e.g class fields) is useful forJava. 78 CHAPTER 6. EVALUATION RESULTS

Generalizable metavariable features (part 2 of 2)

Feature Src Gen Metavariable type expression. [35] T Binds to arbitrary C expressions. Metavariable type constant. [35] T Binds to arbitrary C constants. Metavariable type position. [35] T Binds to the source position of matched or attached element. Metavariable type symbol. [35] T Indicates that the given metavariable name should be treated as a literal identifier. Metavariable type format. [35] Yes16 Binds to the in-string format specifier part of a printf format string, such as .2f in the string ''%.2f''. Metavariable type assignment operator. [35] T Binds to any C assignment operator. Metavariable type binary operator. [35] T Binds to any C binary operator. Metavariable list[expr] specifier. [35] T Specifying that the metavariable should bind to a sequence of the appro- priate C element type rather than a single element, optionally given an exact sequence length to match. Applicable to identifier, parameter, statement, field, initializer, expression and format. Metavariable constraint on bound names. [35] T Specifying constraints on the literal name of a bound named C element such as an identifier, using a regular expression or an (in)equality expression. Virtual metavariables. [35] T Metavariables given bound values on the command line. Metavariable attachment syntax. [35] T Syntax extension of the form expression@x, where expression is an arbitrary C expression and x is a metavariable name, binding the expression to the given metavariable when possible. For example, int delta@d1 with d1 declared as a declaration metavariable. Table 6.7: Generalizable metavariable features (part 2 of 2). T denotes an affirmation that is trivial due to the similarities between C and Java.

16Standard Java provides a printf API in java.io.PrintStream. CHAPTER 6. EVALUATION RESULTS 79

Examples related to generalizable metavariable types

Metavariable type attribute name Listings 74, 75 and 76 show the ability of Coccinelle to match against addi- tional strings present in variable declarations. The syntax matches the use of macros representing __attribute__ specifications17 in the Linux kernel, such as18: #define __ro_after_init __attribute__((__section__(".data..ro_after_init"))) An example of kernel code using the above macro is19:

static int pcpu_nr_groups __ro_after_init; In Listing 74, the attribute name metavariable declaration is required to be present for Coccinelle to successfully parse the patch. Omitting the decla- ration results in a parse error for the patch body. With the declaration present Coccinelle will match the metavariable as it appears in the patch body (in syn- tactically legal locations) against the literal string matching its name. This behavior is different from the typical behavior of metavariables, where most types are used to bind arbitrary elements of a certain kind. Listings 75 and 76 show how a variable declaration annotated with the macro is transformed, while an unannotated variable declaration is not.

Listing 74. Patch Listing 75. Input

@ rule1 @ #define __ro_after_init __attribute__((... identifier v; int x __ro_after_init = 42; attribute name __ro_after_init; int y = 42; @@ - int v __ro_after_init = 42; Listing 76. Output + int v __ro_after_init = 0; #define __ro_after_init __attribute__((... int x __ro_after_init = 0; int y = 42;

In Java, annotations such as @Deprecated and @SupressWarnings can serve a purpose similar to the __attribute__ feature of GNU C. An SmPL for Java could implement a feature corresponding to the attribute name metavari- able type by using a metavariable declaration syntax such as annotation @Deprecated, thereby allowing the metavariable @Deprecated to match lit- eral occurrences of the string @Deprecated in program code. As such, we find the feature to generalize for Java, although it does not strike us as apartic- ularly good approach for the general support of Java annotations. 17A feature of GNU C / GCC, not part of ISO C. 18linux-5.9.3, include/linux/cache.h 19linux-5.9.3, mm/percpu.c 80 CHAPTER 6. EVALUATION RESULTS

Non-generalizable metavariable features

Feature Src Gen Metavariable pointer specifier. [35] No20 Specifying that the metavariable should bind to a pointer (or chain of pointers) to the appropriate base C type. Applicable to expression, idexpression, constant and metavariables binding a specific program type.

Metavariable type declarer. [35] No Specific to macros used in the Linux kernel.

Metavariable type iterator. [35] No Specific to macros used in the Linux kernel.

Table 6.8: Non-generalizable metavariable features. T denotes an affirmation that is trivial due to the similarities between C and Java.

Examples of non-generalizable metavariable features

Metavariable pointer specifier Listings 77, 78 and 79 show the straightforward ability of Coccinelle to specify that a type-constrained identifier metavariable should bind to a pointer ofthe type rather than directly to a value of the type. As there is no feature in Java corresponding to pointer variables, we find this feature not to generalize for Java. Listing 77. Patch Listing 78. Input

@@ point_t* pt = malloc(sizeof(point_t)); point_t* pt; printf("%d, %d\n", pt->x, pt->y); @@ - printf("%d, %d\n", pt->x, pt->y); Listing 79. Output + print_point(pt); point_t* pt = malloc(sizeof(point_t)); print_point(pt);

20Java has no corresponding pointer syntax or semantics. CHAPTER 6. EVALUATION RESULTS 81

6.1.2 RQ1: Generalizable features

Conclusion We find 81 out of 89 (91%) features cataloged in Section 6.1.1 to be generaliz- able to a Java context. Notably, we find all macro-level features to be general- izable. Furthermore, we find the majority of our identified generalizations to be trivial (72 out of 81, 89%) due to the strong similarities between the C and Java programming languages. For example, both languages contain the syntac- tic elements ”identifier” and ”constant” under essentially equivalent semantics. As such, the features of Coccinelle that target these specific syntactic elements permit a mapping from C elements to equivalent Java elements. Another ex- ample is the Coccinelle option for imposing a time limit on the processing of a single source file. This feature is not related to the syntax and semantics ofC, but rather to the idiomatic practice of splitting programs into multiple source files. As this practice is also idiomatic in the use of Java, we find thefeature to be trivially generalizable. The remaining 9 features that allow non-trivial generalizations are found to permit a mapping from a C element to a comparable but not strictly equivalent Java element. Examples include mappings from C #include statements to Java import statements, and from C __attribute__ specifications to Java @annotations. Table 6.9 summarizes the counts.

Features Macro Micro Total 89 9 80 Generalizable 81 (91%) 9 (100%) 72 (90%) Table 6.9: Summary of feature generalizability.

6.1.3 RQ2: Non-generalizable features

Conclusion We find 8 out of 89 (9%) features cataloged in Section 6.1.1 not to begen- eralizable to a Java context. Furthermore, we find that the features that are not generalizable either target some aspect of the C programming language that has no corresponding notion in Java such as macros or pointers, or targets some programming idiom21 specific to the C source code of the Linux kernel. Table 6.9 summarizes the counts.

21A common method or pattern of accomplishing a certain programming task. 82 CHAPTER 6. EVALUATION RESULTS

6.2 Experimental results

This section provides the results of the central experiments of the thesis, used to answer research questions RQ3 and RQ4. Section 6.2.2 provides the results of the performance comparison between SPOON-SMPL and Coccinelle4J over the set of semantic patches and associated projects used in the Coccinelle4J API migration case study, and Section 6.2.3 provides the results of the directly related experiment comparing the running time of SPOON-SMPL to the build time of each of the projects targeted in the aforementioned case study. Table 6.10 lists notation and abbreviations used throughout this chapter.

Notation Meaning C4J Coccinelle4J SP SPOON-SMPL

SPfull SPOON-SMPL with full output

SPmin SPOON-SMPL with minimal output NCLOC Non-Comment Lines Of Code WCLK Wall clock (seconds) CPU CPU time (seconds) INT Internal profiling timer (seconds) PROC Internal ”SmPL processing” profiling timer (seconds), only applicable to SPOON-SMPL. CORR Correctness. The notation A/B means A correctly transformed executables, B incorrectly transformed executables (s = 1.23) Standard deviation (seconds) Table 6.10: Notation and abbreviations used in result tables.

Coccinelle4J Source repository https://github.com/kanghj/coccinelle.git Source revision 37b1025ea7378ce47cde44807e85fd77663f5f81 OCaml compiler 4.05.0 (Debian -4.05.0-11-amd64)

SPOON-SMPL Source repository https://github.com/mkforsb/spoon.git Source revision 0a583a843459829971b4d816bace18ca5eb04a32 JDK/JRE version OpenJDK 11.0.8+10 x64 HotSpot JVM (AdoptOpenJDK)

Project builds JDK version OpenJDK 8 (8u265b01) x64 HotSpot JVM (AdoptOpenJDK)

Table 6.11: Software versions. CHAPTER 6. EVALUATION RESULTS 83

6.2.1 Hardware and software

Hardware We ran all experiments on a Dell E7470 laptop with an Intel Core i5 6300U CPU (2 cores, 4 threads, 2.4GHz-3GHz), 8GB DDR4 SDRAM and a Micron 1100 256GB SSD.

Software We ran all experiments in a Linux KVM virtual machine guest running Debian Bullseye with kernel x86_64 5.7.0, with unlimited access (in regards to cores and CPU usage) to the host CPU and access to 4GB of memory. The host system ran Debian Bullseye with kernel x86_64 5.8.0. Table 6.11 provides further details on the versions of certain software packages.

6.2.2 RQ3: Patch application performance This section presents the results of our comparison between SPOON-SMPL and Coccinelle4J in patch application over the set of six pairs of patches and as- sociated target project source trees used in the Coccinelle4J API migration case study. Table 6.12 presents the six patch-project pairs and lists the results of our correctness measurements. For each project we provide the specific source code revision we used, as well as the size of the source tree in terms of the number of .java source files and the number of lines of code (NCLOC) contained within.

Project / Semantic patch GitHub Repo. / Revision Files NCLOC C4J CORR SP CORR Nextcloud app nextcloud/android 323 54155 10/0 12/0 sticky_broadcasts 8e835271d5d2264d

Lucid Browser powerpoint45/Lucid-Browser 36 6780 1/0 1/0 set_text_size 04f3f584fb03096d

Kickstarter app kickstarter/android-oss 282 15450 8/0 8/0 get_color e0720885d860d56f

Signal messenger signalapp/Signal-Android 724 83827 1/0 1/0 should_vibrate a498176043dd2114

Glide library bumptech/glide/ 397 31674 1/0 1/0 get_height 827fc08222eb6159

MGit maks/MGit 104 9203 2/0 2/0 on_console_message b1c6531bab6c10c9

Table 6.12: Summary of target source trees and correctness. ”Files” is number of .java source files. 84 CHAPTER 6. EVALUATION RESULTS

Semantic patch C4J (N = 10) SPfull (N = 10) SPmin (N = 10) sticky_broadcasts 2.26 (s = 0.02) 25.17 (s = 1.05) 16.84 (s = 0.90) set_text_size 0.89 (s = 0.01) 19.82 (s = 0.44) 18.65 (s = 0.33) get_color 0.12 (s = 0.00) 9.81 (s = 0.67) 7.38 (s = 0.25) should_vibrate 0.08 (s = 0.00) 75.17 (s = 0.80) 18.52 (s = 0.79) get_height 0.09 (s = 0.00) 16.49 (s = 0.79) 10.66 (s = 0.47) on_console_message 0.13 (s = 0.00) 13.07 (s = 0.48) 10.87 (s = 0.20) Table 6.13: Patch application WCLK running times (seconds). Table 6.13 lists the wall clock running times of Coccinelle4J and the two output modes of SPOON-SMPL in applying each of the six semantic patches to its associated project. Each listed running time is the mean number of seconds over N runs, with the value of N shown in the column header. The standard deviation for each running time is shown in parenthesis. The results indicate that SPOON-SMPL runs slower than Coccinelle4J, and that the minimal output mode of SPOON-SMPL runs faster than the full output mode, particularly on larger projects.

Semantic patch C4J (N = 10) SPfull (N = 10) SPmin (N = 10) sticky_broadcasts 2.24 (s = 0.02) 25.01 (s = 1.06) 16.68 (s = 0.89) set_text_size 0.88 (s = 0.01) 19.65 (s = 0.43) 18.49 (s = 0.33) get_color 0.11 (s = 0.00) 9.59 (s = 0.65) 7.21 (s = 0.24) should_vibrate 0.07 (s = 0.00) 75.00 (s = 0.81) 18.36 (s = 0.79) get_height 0.08 (s = 0.00) 16.32 (s = 0.79) 10.49 (s = 0.47) on_console_message 0.12 (s = 0.00) 12.91 (s = 0.48) 10.72 (s = 0.21) Table 6.14: Patch application INT running times (seconds). Table 6.14 lists the running times of Coccinelle4J and the two output modes of SPOON-SMPL in applying each of the six semantic patches to its associated project as measured by an internal timing mechanism. The data is presented in the same way as described for Table 6.13. The results indicate that the in- ternal timing mechanisms of both Coccinelle4J and SPOON-SMPL are sound, as the measurements are similar to the external wall clock measurements listed in Table 6.13. Furthermore, the Java virtual machine appears to add about 0.17 seconds of startup overhead for SPOON-SMPL, as indicated by the difference between Tables 6.13 and 6.14 for the running times of SPOON-SMPL. CHAPTER 6. EVALUATION RESULTS 85

Semantic patch SPfull (N = 10) SPmin (N = 10) sticky_broadcasts 3.74 (s = 0.26) 4.36 (s = 0.40) set_text_size 13.95 (s = 0.34) 14.01 (s = 0.26) get_color 0.49 (s = 0.04) 0.51 (s = 0.05) should_vibrate 0.54 (s = 0.07) 0.50 (s = 0.03) get_height 0.55 (s = 0.03) 0.57 (s = 0.06) on_console_message 6.06 (s = 0.38) 5.83 (s = 0.20) Table 6.15: Patch application PROC running times (seconds). Table 6.15 lists the time spent on tasks directly related to SmPL processing for the two output modes of SPOON-SMPL in applying each of the six semantic patches to its associated project. The data is presented in the same way as described for Table 6.13. The results indicate that much of the running time of SPOON-SMPL is spent on tasks that are not directly related to SmPL processing, as seen in the difference between these measurements and the measurements for SPOON-SMPL listed in Table 6.14.

Semantic patch C4J (N = 10) SPfull (N = 10) SPmin (N = 10) sticky_broadcasts 2.26 (s = 0.02) 70.25 (s = 2.31) 53.77 (s = 2.77) set_text_size 0.89 (s = 0.01) 38.42 (s = 0.69) 35.08 (s = 0.92) get_color 0.12 (s = 0.00) 31.58 (s = 1.96) 24.03 (s = 0.75) should_vibrate 0.08 (s = 0.00) 130.54 (s = 3.50) 61.18 (s = 3.12) get_height 0.09 (s = 0.00) 51.49 (s = 2.87) 35.86 (s = 1.25) on_console_message 0.13 (s = 0.00) 35.30 (s = 2.22) 28.55 (s = 0.90) Table 6.16: Patch application CPU times (seconds). Table 6.16 lists the amount of CPU time consumed by Coccinelle4J and the two output modes of SPOON-SMPL in applying each of the six semantic patches to its associated project. The data is presented in the same way as described for Table 6.13. The results indicate that SPOON-SMPL involves simultaneous multiprocessing while Coccinelle4J does not, as indicated by the difference between these measurements and the wall clock running times listed in Table 6.13. Multiprocessing is not an explicit part of SPOON-SMPL and must therefore be part of the Spoon library or one of its dependencies, or part of generic work done by the JVM. By extension, the results indicate that the energy usage of SPOON-SMPL exceeds that of Coccinelle4J further beyond what is suggested by the measurements listed in Tables 6.13 and 6.14. 86 CHAPTER 6. EVALUATION RESULTS

Observations For 4 out of 6 semantic patches the running time of SPOON-SMPL is dominated by Spoon internals22 such as parsing and pretty-printing, rather than on tasks directly related to SmPL processing. A graphical depiction of the time spent on SmPL processing expressed as a percentage of the full running time is shown in Figure 6.1. In the case of set_text_size the running time is instead dom- inated by SmPL processing, presumably due to the patched executable block being very large, generating a correspondingly large control flow model over which to model-check the patch formula. In the case of on_console_message the running time is a near equal split between Spoon internals and SmPL pro- cessing. Looking at the total accumulated running times of SPOON-SMPL, ap- proximately 84% of the time is spent on Spoon internals in the full output mode, with the corresponding number for the minimal output mode being ap- proximately 69%. The minimal output mode consistently results in reduced running times. The reduction seems to scale with the size of the target source tree, as would be expected since Spoon is no longer tasked with pretty-printing the full meta- model. The running time reduction is directly visible in Figures 6.2 and 6.3, as well as in Figure 6.1 in the form of a consistently larger percentage of the running time being spent on SmPL processing, caused by a reduction to the time spent on Spoon internals. In the case of sticky_broadcasts SPOON-SMPL manages to apply the se- mantic patch to two additional executables that are missed by Coccinelle4J. In all other cases, the number of correctly transformed executable blocks are identical, and neither implementation produced any incorrect transformations. The reason behind Coccinelle4J missing the aforementioned two instances lies in a failure to parse parameterless lambda expressions, such as () -> expression. Both of the missed instances appear in files containing such lambda expressions, and the parse error causes Coccinelle4J to abort process- ing for these files.

Summary and conclusion Coccinelle4J ran faster than SPOON-SMPL in every instance of comparison, as shown in Tables 6.13, 6.14, 6.15 and 6.16, as well as in Figures 6.2, 6.3 and 6.4. SPOON-SMPL is a new, immature implementation that is likely to be rather unoptimized. In contrast, Coccinelle4J benefits from optimizations developed

22Taken as the ratio (INT - PROC) / INT for SPOON-SMPL. CHAPTER 6. EVALUATION RESULTS 87

during several years of work on Coccinelle23. However, much of the running time of SPOON-SMPL is spent on Spoon internals as shown in Figure 6.1. As such, optimizations to the core SmPL engine of SPOON-SMPL would not be enough to reach the general level of performance offered by Coccinelle4J. Fig- ure 6.4 shows the ratio of the PROC running time of SPOON-SMPL to the wall clock running time of Coccinelle4J, giving an idea of the remaining difference in performance when disregarding the running time spent on Spoon internals. It is unknown to us how much potential for optimization is present in the Spoon library for the parts involved in the long running times of SPOON-SMPL. However, it seems unrealistic to expect SPOON-SMPL to ever reach or surpass the low running times of Coccinelle4J, in particular when considering the startup and JIT warmup overhead of the JVM. In conclusion, we find that the running time performance of an SmPL im- plementation for Java based on Spoon is worse than that of Coccinelle4J.

100 Full output Minimal output

80

60

40

20 Running(%) proc. onSmPLspent time

0

get_color get_height set_text_size should_vibrate sticky_broadcasts on_console_message Figure 6.1: SPOON-SMPL running time (%) spent on SmPL processing.

23https://coccinelle.gitlabpages.inria.fr/website/distrib/changes.html 88 CHAPTER 6. EVALUATION RESULTS

1000 Full output Minimal output

800

600

400

200 Ratiospoon-smpl of WCLKto Coccinelle4J WCLK

0

get_color get_height set_text_size should_vibrate sticky_broadcasts on_console_message Figure 6.2: Ratio of SPOON-SMPL WCLK to Coccinelle4J WCLK.

250 Full output Minimal output

200

150

100

50 Ratiospoon-smpl of WCLKto Coccinelle4J WCLK

0

get_color get_height set_text_size should_vibrate sticky_broadcasts on_console_message Figure 6.3: Ratio of SPOON-SMPL WCLK to Coccinelle4J WCLK. Same data as Figure 6.2, zoomed in to provide more details in the range 0x – 250x.

60 Full output Minimal output 50

40

30

20

10 Ratiospoon-smpl of PROC to Coccinelle4J WCLK

0

get_color get_height set_text_size should_vibrate sticky_broadcasts on_console_message Figure 6.4: Ratio of SPOON-SMPL PROC to Coccinelle4J WCLK. CHAPTER 6. EVALUATION RESULTS 89

6.2.3 RQ4: Project build times This section presents the comparison between the patch application running time performance of SPOON-SMPL to the average build time of each patch appli- cation target. We use the same set of six semantic patches with six associated projects as used in the Coccinelle4J API migration case study, and we re-use the wall clock running time measurements for SPOON-SMPL from Section 6.2.2.

Measurements Table 6.17 lists the average build times of the six projects used as patch appli- cation targets in Section 6.2.2 next to the average patch application wall clock running times of SPOON-SMPL in each of its two output modes as listed in Table 6.13.

Project Build time (N = 3) SPfull (N = 10) SPmin (N = 10) Nextcloud app 66.33 (s = 1.53) 25.17 (s = 1.05) 16.84 (s = 0.90) Lucid Browser 5.49 (s = 0.61) 19.82 (s = 0.44) 18.65 (s = 0.33) Kickstarter app 130.91 (s = 1.21) 9.81 (s = 0.67) 7.38 (s = 0.25) Signal messenger 56.33 (s = 2.31) 75.17 (s = 0.80) 18.52 (s = 0.79) Glide library 66.75 (s = 1.52) 16.49 (s = 0.79) 10.66 (s = 0.47) MGit 14.00 (s = 2.00) 13.07 (s = 0.48) 10.87 (s = 0.20)

Table 6.17: Project build times (seconds) compared to SPOON-SMPL WCLK running time (seconds).

In the full output mode, we see that SPOON-SMPL is able to apply the seman- tic patch in less than the average build time of the associated project in 4 out of 6 cases. The two cases where the running time of SPOON-SMPL exceeds the associated build time are for the Lucid Browser project and the Signal messen- ger project. The Lucid Browser project is the smallest among the six projects, and also has the lowest average build time. Furthermore, the single patched executable block in the Lucid Browser project is unusually large, demonstrat- ing some poor performance scaling in SPOON-SMPL in model-checking over a large model. The Signal messenger project is the largest of the six projects, and is also the project for which the pretty-printing stage of Spoon consumes the largest slice of the running time, with the time directly spent on SmPL-related processing being as little as 0.7%. In the minimal output mode, we see that SPOON-SMPL is able to apply the semantic patch in less than the average build time in 5 out of 6 cases, improving 90 CHAPTER 6. EVALUATION RESULTS

on the full output mode with one additional successful case. Applying the semantic patch to the Signal messenger project now runs faster than the average build time, due to Spoon no longer being tasked with pretty-printing the full source tree of the project. The Lucid Browser project is now the single case for which the running time of SPOON-SMPL exceeds the average build time, with the same underlying reasons as described for the full output mode.

Conclusion We find that SPOON-SMPL is able to apply the semantic patch in less thanthe average associated project build time in 4 out of 6 (67%) cases in full output mode, and 5 out of 6 (83%) cases in minimal output mode. Considering these success rates together with the fact that SPOON-SMPL is a new and likely unopti- mized implementation, we can conclude that given an arbitrary Java project of non-trivial size, an SmPL implementation for Java based on Spoon will likely be able to offer a running time performance that is acceptable for an individual developer. Chapter 7

Discussion

This chapter provides additional discussion of the material covered in previ- ous chapters. Section 7.1 discusses some important limitations that should be kept in mind when considering the results of our work. Section 7.2 brings up some potential sources of errors that could impact the validity of our findings. Section 7.3 presents two technical proposals for extensions to the current im- plementation of SPOON-SMPL, while Section 7.4 provides a number of more general suggestions for future work. Finally, Section 7.5 discusses some po- tential ethical issues regarding misuse and sustainability.

7.1 Limitations

Scope of implementation and experiments SPOON-SMPL implements a subset of SmPL for a subset of Java using the Spoon Java transformation engine. Based on the performance of this implementation over a limited set of workloads we make general statements on the performance and viability of an SmPL implementation for Java based on Spoon, implicitly referring to a complete1 SmPL implementation for the full language of Java. As such, some skepticism is warranted in interpreting our generalized findings. However, we do not expect that an extended implementation of SPOON-SMPL would exhibit significantly different performance in general simply due to hav- ing support for more of the features of SmPL and/or more of the elements of Java. 1It is furthermore not entirely clear precisely what would constitute a complete SmPL implementation for Java.

91 92 CHAPTER 7. DISCUSSION

Lack of optimization With the exception of a utility that quickly excludes entire source files from full SmPL processing, there has been very little work dedicated to the optimization of SPOON-SMPL. As such, the performance values measured for SPOON-SMPL are likely to be closer to an upper bound for a reasonably competent imple- mentation rather than a lower bound.

Minimal output mode for spoon-smpl The minimal output mode for SPOON-SMPL results in improved performance, but produces a result that is not fully equivalent to the full output mode. As such, it should not be taken as a pure optimization, but rather as an alternative mode that has drawbacks. While the full output mode produces a full source tree with all transformations applied, the minimal output mode only prints the unified diffs over transformed files. Due to the reformatting applied bythe default Spoon pretty-printer, it will generally not be possible to apply these unified diffs as a patch to the original source tree. The minimal output mode thus offers a lower degree of automation, as the transformations described by the unified diffs would essentially need to be applied by hand. Weexpect this limitation to be lifted once the sniper output mode for Spoon reaches maturity. However, using the sniper mode may in itself render the minimal output mode mostly obsolete from the perspective of optimizing the running time.

7.2 Threats to validity

This section lists a number of points that could potentially threaten the validity of our findings.

Subjectivity of SmPL feature catalog There was no rigorous definition of feature employed in establishing the fea- ture catalog used to investigate the generalizability of the Coccinelle/SmPL feature set for a Java context. As such, the features included in the catalog as well as their grouping suffer from subjectivity. However, due to the strong sim- ilarities between C and Java we do not expect a more rigorously constructed feature catalog to yield significantly different results in terms of the overall generalizability of an SmPL for Java. CHAPTER 7. DISCUSSION 93

Single vendor and version of JVM It has been shown [38] that different implementations of the Java Virtual Ma- chine (JVM) can exhibit significantly different performance over the same workload. In our experiments, a single vendor and version of JVM was used for all performance measurements of SPOON-SMPL.

JVM heap size It has been shown [38] that the heap size of a JVM can have a significant impact on the performance over certain workloads. In our experiments, no special attention was given to the heap size. In taking our performance measurements, the JVM was not provided any specific options for the heap size, causing the default setting to be the only heap size tested.

Single machine and operating system Intuitively, using a single host machine and operating system is not optimal in the production of sound machine-independent statistics. A single machine can be compromised by misconfiguration, or by hardware issues such as poor ther- mal management or dysfunctional add-in boards producing an excess amount of processor interrupts. Similarly, a single specific instance of an operating system can be misconfigured or otherwise compromised by any number of software problems. In our experiments, a single machine and operating sys- tem were used for all measurements.

Virtual machine environment In an effort to minimize the potential effects of arbitrary system packages on performance measurements, our experiments were executed in a minimal Linux environment running as a guest in a Linux KVM virtual machine. While modern virtualization software paired with properly configured virtualization- aware hardware tends to provide near-native performance, the introduction of a virtualization layer inescapably introduces additional opportunities for soft- ware problems that affect performance.

7.3 Extension proposals

This section presents two technical proposals for extensions to the current im- plementation of SPOON-SMPL that target improved functionality specifically in 94 CHAPTER 7. DISCUSSION

areas where Coccinelle4J offers only limited support. Section 7.3.1 presents a proposal to improve name resolution in the matching between simply qualified names and fully qualified names. Section 7.3.2 presents a proposal to addsup- port for sub-typing. The two proposals were originally part of work targeting an additional formal research question which was ultimately dropped due to time constraints.

7.3.1 Improving name resolution From the plain-text perspective of Java source code, the problem with name resolution lies in the matching between fully qualified names (FQNs) and sim- ply (non-fully) qualified names (SQNs). As an example, a Java compiler will in most cases see the names String (an SQN) and java.lang.String (an FQN) as referring to the same class. The support for matching between FQNs and SQNs in Coccinelle4J relies on the Coccinelle isomorphism system. An example of an isomorphism rule used to assist with name resolution is shown in Listing 32 in Section 5.2.1. A simple way to improve name resolution in SPOON-SMPL would consist of the requirement that all names are given as FQNs in semantic patches, to- gether with modifying the pattern matching logic for comparing two instances of CtReference Spoon metamodel objects. The current logic compares in- stances based on the SQNs of the instances. The SQNs are retrieved from the method CtReference::getSimpleName regardless of whether either name can be resolved to a fully qualified form or not. The reason for this isthat the parsing of the Java DSL representation of a semantic patch takes place out- side the metamodel context of the code targeted for patch application, meaning no user-defined classes in the target code are available to resolve SQNs when building the Java DSL patch metamodel. Changing the logic to instead always compare the FQNs would be a trivial change to make in SPOON-SMPL, but due to the lack of parse-time context would require the use of FQNs in semantic patches. Note that there would be no corresponding requirements on the use of FQNs vs. SQNs in target code. The metamodel for target code is already built using the full target source tree, which means any SQNs will be resolvable to FQNs under the assumption that the target code is a valid program. A further improvement would be to implement the target-embedded pars- ing strategy for the Java DSL as presented in Section 7.4.7. The added parse- time context would then remove the requirement of having to use FQNs in semantic patches, while allowing the same name resolution strategy of always comparing the FQNs of references. CHAPTER 7. DISCUSSION 95

7.3.2 Improving sub-typing From the plain-text perspective of Java source code, the general problem with sub-typing lies in the matching between the name of a super type such as Collection and the name of a sub-type such as ArrayList. Coccinelle4J offers a limited form of support for sub-typing based ona pattern idiomatic in Java, in which a variable or field is declared using an interface super type while being simultaneously initialized to a concrete sub- type that implements the interface:

SomeInterface x = new SomeInterfaceImpl();

For a type-constrained identifier metavariable to be able to bind to the iden- tifier x in the above statement, the original Coccinelle would require it to be declared with type SomeInterface. The added sub-typing support in Coc- cinelle4J would additionally allow the binding for a metavariable declared as SomeInterfaceImpl, based on the explicit types given in the literal statement as it appears in the source code. As Coccinelle4J does not compute the inheritance trees of classes, it is not possible to have a type-constrained metavariable declared as a super type bind to a program variable of a sub type. For example, there is no way for a metavariable declared as Collection to bind to a statement of a local variable declaration of type ArrayList. To reduce ambiguity, the remainder of this section employs the terms pro- gram type to refer to types in the context of the source code of SPOON-SMPL, and represented type to refer to type information contained in elements of a Spoon metamodel as part of its representation of some arbitrary external source code. The special case of sub-typing supported by Coccinelle4J could be added to SPOON-SMPL by expanding on the apply method of the TypedIdentifier- Constraint metavariable constraint class. This method receives objects of type CtElement (Spoon metamodel elements) that are candidate bindings for a type-constrained metavariable, and returns either a CtElement object (typ- ically the input object) that is an accepted binding, or null if the binding is rejected. Instances of TypedIdentifierConstraint will accept bindings that are of program type CtVariableReference containing a represented type (a CtTypeReference ) with a simply-qualified name that matches the required type name provided in the construction of the constraint instance. It would be trivial to add a case for candidates of correct program type that fail the check on the represented type. In such cases, we can inspect the par- ent statement of the CtVariableReference which is accessible by means of 96 CHAPTER 7. DISCUSSION

CtElement::getParent. If the parent element is a local variable declaration with a default expression (an initializer) whose represented type matches the required type name, the binding can be accepted. In addition to supporting the specific statement pattern supported by Coccinelle4J, this solution would support any statement of the form InterfaceType x = E, where E is an arbitrary expression of the required represented type. If the target-embedded parsing strategy (Section 7.4.7) was implemented it would be simple to add further support for the matching between sub-types that is not available in Coccinelle4J, in which a type-constrained metavariable declared as a super type is able to bind to program variables of any sub-type. Again we can modify the TypedIdentifierConstraint class. Changing the constructor to take a CtTypeReference argument for the required type instead of a plain string would allow us to call the method

CtTypeInformation::isSubtypeOf

when we check the represented type of a candidate binding, allowing the accepting of bindings to any sub-type of the required type. In order to avoid this feature yielding unwanted matches, we can add syntax to the SmPL gram- mar of metavariable declarations such that the feature can be turned on or off on a per-declaration basis. An example of such an extended grammar is shown in Listing 80. Listing 80. Example patch using extended grammar for sub-typing.

1 @@ 2 Collection x with subtypes; 3 @@ 4 - printCollection(x);

7.4 Future work

This section presents a number of suggestions for future work that could be applied to SPOON-SMPL. Section 7.4.1 lists a number of simple, non-looping Java constructs that are currently unsupported by SPOON-SMPL, for which we expect the task of adding support to be straightforward. Section 7.4.2 suggests adding support for loops, which involves avoiding issues with infinite recur- sion in the model checking algorithm. Section 7.4.3 suggests implementing CHAPTER 7. DISCUSSION 97

support for isomorphism rules, a fundamental SmPL feature. Section 7.4.4 suggests optimizing the model checker implementation in general and by the implementation of two given optimizations in particular. Section 7.4.5 sug- gests using the sniper output mode available in Spoon once it reaches a more mature stage. Section 7.4.6 suggests using the existing spoon.pattern pack- age to perform AST pattern matching, allowing for the removal of the pattern matcher included with SPOON-SMPL. Finally, Section 7.4.7 suggests to em- bed the Java DSL representation of semantic patches into the class of targeted methods before parsing in order to provide more context for the Spoon Java parser.

7.4.1 Support for more simple Java constructs There are a number of simple Java constructs that are currently unsupported by SPOON-SMPL despite being fully supported by Spoon. Some of these con- structs are fully unsupported in the sense that they are completely ignored in the SmPL process, being replaced by placeholders during the preparation of the control flow graphs that constitute the inputs for both formula compilation and CTL model assembly. Examples include switch statements and array initializers. Other constructs are partially unsupported in the sense that they are not implemented in the pattern matcher or formula compiler, meaning they cannot appear in input semantic patches. An example of such a construct is the try–catch construct which is supported in the context of target code but not supported in the context of a semantic patch, meaning it is possible for a semantic patch to match and transform sub-elements of a try–catch construct but not match or transform the try–catch itself. In many cases, adding full support for simple constructs can be a straight- forward task. A sample of simple unsupported constructs is shown in Ta- ble 7.1. More unsupported construct can be identified by looking at the files SmPLMethodCFG.java and PatternBuilder.java. 98 CHAPTER 7. DISCUSSION

Category Spoon metamodel elements Annotations CtAnnotation Accesses to super CtSuperAccess Assertions CtAssert throw statements CtThrow Arrays CtNewArray, CtArrayRead, CtArrayWrite switch statements CtSwitch, CtCase, CtBreak try–catch statements CtTry, CtCatch

Table 7.1: Simple Java constructs currently unsupported by SPOON-SMPL.

7.4.2 Support for looping constructs None of the loop constructs of Java are currently supported by SPOON-SMPL. In addition to the straightforward work required to add support for any new construct, adding support for loops is likely to require modifications to formula compilation in order to avoid infinite recursion both in model checking andin the compilation process itself. In the paper [4] on the theoretical foundation of Coccinelle, the authors mention switching from using the AU connective to using the AW connective whenever the targeted source code is found to contain a loop. As the core engine of SPOON-SMPL is very similar to Coccinelle, a similar solution to supporting loops may be admissible.

7.4.3 Support for isomorphisms Isomorphisms play an important part in the notion of semantic patches match- ing code based on the semantics of the target language, and as such is perhaps the most fundamental SmPL feature missing in SPOON-SMPL, omitted mainly due to the time constraints of the project. Extending SPOON-SMPL with a system equivalent to the isomorphisms of Coccinelle would involve the implementa- tion of the following:

• The parsing of isomorphism rules, re-using much of the existing SmPL parsing code.

• The internal representation of an isomorphism rule.

• The pattern matching between Spoon metamodel elements and the in- ternal representation of isomorphisms. CHAPTER 7. DISCUSSION 99

• The injection of disjunctions covering isomorphism alternatives in the formula compiler.

7.4.4 Model checker optimizations Two optimizations mentioned in the paper [4] covering the theoretical founda- tion of Coccinelle are highly relevant to the current implementation of SPOON- SMPL:

1. When computing SAT(ϕ1∧AX ϕ2), use the information found in comput- ing SAT(ϕ1) to allow for early elimination of results in the computation of SAT(AX ϕ2), based on the knowledge that the two sets of results are to be intersected.

2. Introduce a second metavariable quantifier ∃nw that does not produce witness recordings in the model checking algorithm. In a given formula, use this alternative quantifier for metavariables that only appear under negation.

The first optimization targets the common formula structure

ϕ1 ∧ AX ϕ2

by reducing the number of results produced by the right hand side of the expression, which in the case of Coccinelle resulted in a large reduction (≈ 50%)2 to the number of results that needed to be considered for the subse- quent intersection. The second optimization targets metavariables that only occur under nega- tion. The semantics of negation in CTL-VW causes any witness records pro- duced in the processing of sub-formulas under negation to be dropped when processing the enclosing negation itself, thereby making the recording of such witnesses a pointless task that can readily be skipped. In addition to the two optimizations listed above, the current implementa- tion of SPOON-SMPL is likely to offer a rather large potential for more optimiza- tions to the model checker and related procedures, such as the intersection of results and environments. 2It is not entirely clear how much of the quoted reduction could be attributed to the specific optimization being discussed. Other optimizations were present. 100 CHAPTER 7. DISCUSSION

7.4.5 Using Spoon sniper mode

The sniper output mode is a new addition to Spoon that is still under heavy development. In sniper mode, any Spoon metamodel elements that have not been modified are pretty-printed using the original literal source codethat yielded the element in the first place. As such, the sniper mode is capable of retaining much more of the original code style, as opposed to the default output mode which produces source code formatted in a specific style. An attempt was made to use sniper mode with SPOON-SMPL, but the at- tempt was abandoned as crashes were encountered that were unrelated to the task being performed by SPOON-SMPL. Once the sniper mode reaches a more mature stage of development, making SPOON-SMPL default to using it would be a useful improvement.

7.4.6 Using spoon.pattern In the initial stage of designing SPOON-SMPL, the idea was to use the existing spoon.pattern package for all AST pattern matching tasks. This proved chal- lenging as spoon.pattern performs matching over a large number of proper- ties of Spoon metamodel elements, thereby providing a large number of oppor- tunities for mismatches to occur. The problem was made worse by the fact that SPOON-SMPL parsed the Java DSL for a semantic patch outside the context of the target code, providing many opportunities for properties of elements to re- main unresolved or be problematically inferred. Such properties often caused false mismatches with elements extracted from a metamodel constructed from parsing a full source tree, which generally have few unresolved properties. Ul- timately, the choice was made to implement a smaller and simpler pattern matching engine that performed more lenient matching over a limited set of properties. However, for the Spoon project, having to document and maintain two sep- arate AST pattern matching engines is not ideal from a software engineering perspective. As such, it would be beneficial if SPOON-SMPL could be adapted to use spoon.pattern for its AST pattern matching needs. One approach to solv- ing the aforementioned false mismatching issues is to modify spoon.pattern to include matching modes that allow for the specification of which properties should be considered and which should be ignored, or to include a matching mode in which unresolved properties do not influence the matching. Another approach is to implement the target-embedded parsing of the semantic patch Java DSL as described in Section 7.4.7, which would yield patch metamodels with fewer unresolved element properties. CHAPTER 7. DISCUSSION 101

7.4.7 Target-embedded parsing of the semantic patch The current implementation of SPOON-SMPL rewrites input semantic patches from plain-text SmPL to an internal representation in a form of domain-specific language (DSL) based on standard Java source code, employing a number of specifically designed constructs to encode the features of SmPL. The JavaDSL representation of a patch is later parsed using the Spoon Java parser. However, the parsing of the DSL representation takes place in an isolated context without access to information on any of the symbols (packages, types, fields, methods) present in the code targeted for patch application. This lack of context in the parsing stage gives rise to a number of limitations in regards to the potential for SPOON-SMPL to fully exploit the capabilities of Spoon. For example, Spoon provides the capability to check whether a given type is a sub-type of another type found in the same metamodel. However, the lack of context when pars- ing the patch DSL results in the type names found in the patch ending up in a separate metamodel, generally having no clear relation to identical type names found in target code. This means a sub-type check between types found in the patch and types found in the target code will produce many false negatives. A promising approach to solving the issues resulting from a lack of context is to adapt the Java DSL representation for embedding into each Java class that contains executable blocks targeted for patch application. The Spoon meta- model supports a notion of snippets; fragments of plain text source code in- serted into an existing metamodel as a statement or an expression, later to be parsed into proper metamodel elements by the Spoon Java parser. By using the snippet system, the patch DSL can be injected into arbitrary classes as the initializer expression for a new field added to the class. Compila- tion of the snippet will then take place in the context of the metamodel for the target class, which in turn is part of the full metamodel of the target code base. Symbols used in the patch will therefore be resolved as if they had appeared in the target class itself, with access to a complete metamodel context including private members of the target class. Listing 81 shows an example of how this technique can be constructed. The main drawback to using target-embedded parsing of the semantic patch is thought to be the impact on performance from having to parse the patch DSL once for every class containing candidate executable blocks (as opposed to exactly once), along with the need to remove the injected field prior to un- parsing. A limited experimental implementation of target-embedded parsing was made using the current implementation of SPOON-SMPL, with promising results. A full implementation was not considered due to the time constraints 102 CHAPTER 7. DISCUSSION

of the project, along with the fact that there would not be much benefit for our evaluation over the patch set used in the Coccinelle4J API migration case study. Listing 81. Example patch DSL injection technique.

1 CtClass targetcls = ... // an arbitrary class;

2

3 String code = "new Object() {\n" + 4 // [ ... semantic patch DSL ... ] 5 "}\n"

6

7 CtCodeSnippetExpression snippet; 8 snippet = factory.createCodeSnippetExpression(code);

9

10 CtTypeReference typeRef; 11 typeRef = factory.createCtTypeReference(Object.class);

12

13 Set mods = Set.of();

14

15 factory.createField(targetcls, mods, typeRef, 16 "__SmPLRule__", snippet);

17

18 targetcls.compileAndReplaceSnippets();

7.5 Ethical considerations

A piece of software can often be considered under the general notion of a tool, and like most tools will tend to have the potential of being misused. One po- tential misuse of SPOON-SMPL is to act as the applicator of semantic patches that make malicious modifications to Java source code, perhaps helping to enable the efficient distribution of malicious software in the form of small patches. However, it should be safe to claim that only a small number of users of Java software compile their programs from source, resulting in a small at- tack surface for such a scheme. Also, the binary distribution of SPOON-SMPL is fairly large, making it an unattractive option for someone wanting to go unno- ticed. Furthermore, SPOON-SMPL also has the potential to act as the applicator of patches that fix security issues rather than cause them, offering some bal- CHAPTER 7. DISCUSSION 103

ance to the potential for misuse. We expect most potential misuses to have similar dual ”good” uses. The performance of SPOON-SMPL is worse than that of Coccinelle4J, mean- ing that the application of a semantic patch using SPOON-SMPL is going to con- sume more electrical resources compared to applying the same patch using Coccinelle4J. However, we expect the application of semantic patches to be a rare occurrence in general, and therefore that most users of SPOON-SMPL would be able to offset the additional power draw by taking simple actions, suchas every so often lowering the brightness of their screen for a short amount of time. Chapter 8

Conclusions

The Java programming language is a good candidate target for a semantic patch language similar to that of SmPL as implemented by Coccinelle for the C pro- gramming language. The similarities between C and Java allow for trivial generalizations of most of Coccinelle’s features to a Java context, and the ex- tensive use of libraries in the Java developer community means there is a need for tools capable of assisting in the task of API migration. The small subset of Coccinelle’s features that do not generalize well to Java targets specific aspects of the source code of the Linux kernel which is not applicable for Java, or spe- cific features of C for which there are no equivalents in Java such aspointer types. As part of this thesis we have produced and presented SPOON-SMPL, a pro- totype implementation of a subset of SmPL targeting a subset of Java, built using Spoon. We have shown this prototype to be capable of replicating a set of semantic patch application results produced by Coccinelle4J, the state of the art SmPL tool for Java. While the running time of SPOON-SMPL is worse than that of Coccinelle4J, in most cases it is in the acceptable range for a single de- veloper using inexpensive hardware. Furthermore, we have shown that some of the limitations regarding the syntax and semantics of Java present in both SPOON-SMPL and Coccinelle4J would be simple to implement as extensions to SPOON-SMPL due to the capabilities provided by the Spoon library. In contrast, we expect the corresponding features to be more troublesome to implement in Coccinelle4J. We thus conclude that the Spoon metaprogramming library is a viable base for implementing a semantic patch language for Java, both in terms of capabilities and performance. While the performance of an SmPL tool for Java based on Spoon is likely to be significantly slower than Coccinelle4J, it is likely to be fast enough to be usable for a single developer while offering better support for the syntax and semantics of Java.

104 Bibliography

[1] Yoann Padioleau, Julia Lawall, and Gilles Muller. “SmPL: A domain- specific language for specifying collateral evolutions in Linux device drivers.” In: Electronic Notes in Theoretical Computer Science 166 (2007), pp. 47–62. [2] Renaud Pawlak et al. “Spoon: A library for implementing analyses and transformations of Java source code.” In: Software: Practice and Expe- rience 46.9 (2015), pp. 1155–1179. [3] Hong Jin Kang et al. “Semantic Patches for Java Program Transforma- tion.” In: 33rd European Conference on Object-Oriented Programming (ECOOP 2019). 2019. [4] Julien Brunel et al. “A Foundation for Flow-Based Program Match- ing: Using Temporal Logic and Model Checking.” In: Proceedings of the 36th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. POPL ’09. Association for Computing Ma- chinery, 2009, pp. 114–126. ISBN: 9781605583792. DOI: 10 . 1145 / 1480881.1480897. [5] James W. Hunt and M. Douglas McIlroy. “An Algorithm for Differential File Comparison.” In: Bell Laboratories Computing Science Technical Report no.41 (1976). [6] Eric S. Raymond. The Art of UNIX Programming. Boston: Addison- Wesley, 2003. ISBN: 0-13-142901-9. [7] “IEEE Standard for Information Technology–Portable Operating Sys- tem Interface (POSIX(R)) Base Specifications, Issue 7.” In: IEEE Std 1003.1-2017 (Revision of IEEE Std 1003.1-2008) (2018). [8] Michael Huth and Mark Ryan. Logic in Computer Science: Modelling and Reasoning about Systems. 2nd. New York: Cambridge University Press, 2004. ISBN: 9780521543101.

105 106 BIBLIOGRAPHY

[9] E. Allen Emerson. “The Beginning of Model Checking: A Personal Perspective.” In: 25 Years of Model Checking: History, Achieve- ments, Perspectives. Springer Berlin Heidelberg, 2008, pp. 27–45. ISBN: 9783540698500. [10] Flemming Nielson, Hanne Riis Nielson, and Chris Hankin. Principles of Program Analysis. 2nd. Berlin, Heidelberg: Springer, 2004. ISBN: 9783662038116. [11] Kingsum Chow and David Notkin. “Semi-automatic Update of Appli- cations in Response to Library Changes.” In: 1996 Proceedings of Inter- national Conference on Software Maintenance. IEEE, 1996, pp. 359– 368. [12] Chenglong Wang et al. “Transforming Programs between APIs with Many-to-Many Mappings.” In: 30th European Conference on Object- Oriented Programming (ECOOP 2016). 2016. [13] Cédric Teyton, Jean-Rémy Falleri, and Xavier Blanc. “Mining Library Migration Graphs.” In: 19th Working Conference on Reverse Engineer- ing. IEEE, 2012, pp. 289–298. [14] Thiago Tonelli Bartolomei, Krzysztof Czarnecki, and Ralf Lämmel. “Swing to SWT and Back: Patterns for API Migration by Wrapping.” In: Proceedings of 26th IEEE International Conference on Software Main- tenance (ISCM 2010). IEEE, 2010. [15] Renaud Pawlak. “Spoon: Annotation-Driven Program Transformation — the AOP Case.” In: Proceedings of the 1st Workshop on Aspect Ori- ented Middleware Development. AOMD ’05. Association for Comput- ing Machinery, 2005. ISBN: 1595932658. [16] Yoann Padioleau, Julia L. Lawall, and Gilles Muller. “Understanding Collateral Evolution in Linux Device Drivers.” In: ACM SIGOPS Op- erating Systems Review 40.4 (Apr. 2006), pp. 59–71. [17] Yoann Padioleau et al. “Documenting and automating collateral evolu- tions in Linux device drivers.” In: EuroSys. 2008, pp. 247–260. [18] Gavin Mark Bierman, Matthew J Parkinson, and Andrew Mawdesley Pitts. MJ: An imperative core calculus for Java and Java with effects. Tech. rep. 563. 2003. [19] Amir Pnueli. “The temporal logic of programs.” In: Proceedings of 18th Annual Symposium on Foundations of Computer Science (SFCS 1977). IEEE, 1977, pp. 46–57. BIBLIOGRAPHY 107

[20] Edmund M. Clarke and E. Allen Emerson. “Design and synthesis of syn- chronization skeletons using branching time temporal logic.” In: Logics of Programs, Workshop, Lecture Notes in Computer Science, Vol. 131. Springer Berlin Heidelberg, 1981, pp. 52–71. [21] Edmund M. Clarke, E. Allen Emerson, and A. Prasad Sistla. “Automatic Verification of Finite-State Concurrent Systems Using Temporal Logic Specifications.” In: ACM Transactions on Programming Languages and Systems (TOPLAS) 8.2 (1986), pp. 244–263. [22] Bernhard Steffen. “Data Flow Analysis as Model Checking.” In: Proceedings of Theoretical Aspects of Computer Science. TACS ’91. Springer, 1991, pp. 346–364. [23] David A. Schmidt. “Data Flow Analysis is Model Checking of Abstract Interpretations.” In: Proceedings of the 25th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. POPL ’98. As- sociation for Computing Machinery, 1998, pp. 38–48. [24] David Schmidt and Bernhard Steffen. “Program Analysis as Model Checking of Abstract Interpretations.” In: Static Analysis, SAS 1998. Lecture Notes in Computer Science, Vol. 1503. Springer Berlin Heidel- berg, 1998, pp. 351–380. [25] David Lacey and Oege de Moor. “Imperative Program Transformation by Rewriting.” In: Proceedings of the 10th International Conference on Compiler Construction. CC ’01. Springer Berlin Heidelberg, 2001, pp. 52–68. [26] David Lacey et al. “Proving Correctness of Compiler Optimizations by Temporal Logic.” In: Proceedings of the 29th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. POPL ’02. As- sociation for Computing Machinery, 2002, pp. 283–294. [27] Sara Kalvala, Richard Warburton, and David Lacey. “Program Transfor- mations Using Temporal Logic Side Conditions.” In: ACM Transactions on Programming Languages and Systems 31.4 (2009). [28] Sara Kalvala and Richard Warburton. “A Formal Approach to Fixing Bugs.” In: Formal Methods, Foundations and Applications. Springer Berlin Heidelberg, 2011, pp. 172–187. [29] Fernando Castor and Paulo Borba. “A language for specifying Java transformations.” In: Proceedings of the 5th Brazilian Symposium on Programming Languages (SBLP 2001). 2001, pp. 236–251. 108 BIBLIOGRAPHY

[30] Maxim Mossienko. “Structural Search and Replace: What, Why, and How-to.” In: (2006). Online; accessed 09-March-2020. https://www. jetbrains.com/idea/docs/ssr.pdf. [31] Eelco Visser. “Stratego: A Language for Program Transformation Based on Rewriting Strategies.” In: Rewriting Techniques and Applications. Springer Berlin Heidelberg, 2001, pp. 357–361. [32] Eelco Visser. “A survey of strategies in rule-based program transfor- mation systems.” In: Journal of Symbolic Computation 40.1 (2005), pp. 831–873. [33] Paul Klint, Tijs van der Storm, and Jurgen Vinju. “RASCAL : a Do- main Specific Language for Source Code Analysis and Manipulation.” In: 2009 Ninth IEEE International Working Conference on Source Code Analysis and Manipulation. IEEE, 2009, pp. 168–177. [34] Johannes Henkel and Amer Diwan. “CatchUp! Capturing and Replay- ing Refactorings to Support API Evolution.” In: Proceedings. 27th In- ternational Conference on Software Engineering, 2005. ICSE 2005. 2005, pp. 274–283. [35] Research group on Coccinelle. The SmPL Grammar (version 1.0.8). On- line: https : / / coccinelle . gitlabpages . inria . fr / website / docs/main_grammar.pdf. 2019. [36] Coccinelle authors. Coccinelle Usage (version 1.0.8). Online: https: //coccinelle.gitlabpages.inria.fr/website/docs/options. pdf. 2019. [37] Edd Barrett et al. “Virtual Machine Warmup Blows Hot and Cold.” In: Proc. ACM Program. Lang. (Oct. 2017). [38] Stephen M. Blackburn et al. “The DaCapo Benchmarks: Java Bench- marking Development and Analysis.” In: Proceedings of the 21st An- nual ACM SIGPLAN Conference on Object-Oriented Programming Sys- tems, Languages, and Applications (OOPSLA ’06). New York, NY, USA: Association for Computing Machinery, 2006, pp. 169–190. Appendix A

Full semantic patches

This appendix provides the full versions of two semantic patches used in the evaluation of SPOON-SMPL for which, due to their length, only excerpts were shown in the main text.

109 110 APPENDIX A. FULL SEMANTIC PATCHES

A.1 Semantic patch 4: should_vibrate

A.1.1 Original version

Listing 82 shows the original version of the should_vibrate semantic patch as provided by the Coccinelle4J authors and used in this work for the perfor- mance measurements of Coccinelle4J.

Listing 82. Full original should_vibrate semantic patch

1 virtual replace_no_context 2 3 @@ 4 identifier am, f, ctx; 5 expression vibrate_type; 6 @@ 7 + boolean shouldVibrate(AudioManager am, Context ctx, int vibrateType) { 8 + if (Build.VERSION.SDK_INT >= Build.VERSION_CODES.JELLY_BEAN) { 9 + Vibrator vibrator = (Vibrator) ctx. getSystemService( 10 + Context.VIBRATOR_SERVICE ); 11 + if (vibrator == null || !vibrator.hasVibrator()) { 12 + return false; 13 + } 14 + return am.getRingerMode() != 15 + AudioManager.RINGER_MODE_SILENT; 16 + } else { 17 + return audioManager.shouldVibrate(vibrateType); 18 + } 19 + } 20 21 f(..., Context ctx, ...) { 22 ... 23 - am.shouldVibrate(vibrate_type) 24 + shouldVibrate(am, ctx, vibrate_type) 25 ... 26 } 27 28 @depends on replace_no_context@ 29 identifier am; 30 expression vibrate_type; 31 identifier f !~ "^shouldVibrate$"; 32 @@ 33 f(...) { 34 <... when exists 35 - am.shouldVibrate(vibrate_type) 36 + am.getRingerMode() != AudioManager.RINGER_MODE_SILENT 37 ...> 38 } APPENDIX A. FULL SEMANTIC PATCHES 111

A.1.2 Modified version

Listing 83 shows the version of the should_vibrate semantic patch that has been modified for use with SPOON-SMPL.

Listing 83. Full modified should_vibrate semantic patch

1 @@ 2 type T; 3 identifier am, f, ctx; 4 expression vibrate_type; 5 @@ 6 + boolean shouldVibrate(AudioManager am, Context ctx, int vibrateType) { 7 + if (Build.VERSION.SDK_INT >= Build.VERSION_CODES.JELLY_BEAN) { 8 + Vibrator vibrator = (Vibrator) ctx.getSystemService(Context.VIBRATOR_SERVICE); 9 + if (vibrator == null || !vibrator.hasVibrator()) { 10 + return false; 11 + } 12 + return am.getRingerMode() != AudioManager.RINGER_MODE_SILENT; 13 + } else { 14 + return audioManager.shouldVibrate(vibrateType); 15 + } 16 + } 17 18 T f(..., Context ctx, ...) { 19 ... 20 - am.shouldVibrate(vibrate_type) 21 + shouldVibrate(am, ctx, vibrate_type) 22 ... 23 } 112 APPENDIX A. FULL SEMANTIC PATCHES

A.2 Semantic patch 5: get_height

A.2.1 Original version

Listing 84 shows the original version of the get_height semantic patch as provided by the Coccinelle4J authors and used in this work for the performance measurements of Coccinelle4J.

Listing 84. Full original get_height semantic patch

1 @rule1@ 2 Display display; 3 identifier p; 4 type T; 5 @@ 6 ( 7 - p = new Point(display.getWidth(), 8 - display.getHeight()); 9 + p = new Point(); 10 + d.getSize(p); 11 | 12 - T p = new Point(display.getWidth(), 13 - display.getHeight()); 14 + T p = new Point(); 15 + d.getSize(p); 16 ) 17 <... 18 ( 19 - display.getHeight() 20 + p.y 21 | 22 - display.getWidth() 23 + p.x 24 ) 25 ...> 26 27 @rule2@ 28 identifier display , f; 29 expression E; 30 @@ 31 f(...) { 32 ... 33 Display display = E; 34 + Point p = new Point (); 35 + display.getSize(p); 36 <... when != Point (...) 37 ( 38 - display.getHeight () 39 + p.y 40 | 41 - display.getWidth () 42 + p.x 43 ) 44 ...> 45 } APPENDIX A. FULL SEMANTIC PATCHES 113

A.2.2 Modified version

Listing 85 shows the version of the get_height semantic patch that has been modified for use with SPOON-SMPL.

Listing 85. Full modified get_height semantic patch

1 @rule1@ 2 Display display; 3 identifier p; 4 type T; 5 @@ 6 ( 7 - p = new Point(display.getWidth(), display.getHeight()); 8 + p = new Point(); 9 + display.getSize(p); 10 | 11 - T p = new Point(display.getWidth(), display.getHeight()); 12 + T p = new Point(); 13 + display.getSize(p); 14 ) 15 <... 16 ( 17 - display.getHeight() 18 + p.y 19 | 20 - display.getWidth() 21 + p.x 22 ) 23 ...> TRITA -EECS-EX-2021:36

www.kth.se