Equality Saturation: a New Approach to Optimization ∗

RossTate MichaelStepp ZacharyTatlock SorinLerner Department of Computer Science and Engineering University of California, San Diego {rtate, mstepp, ztatlock, lerner} @cs.ucsd.edu

Abstract generated code, a problem commonly known as the phase ordering Optimizations in a traditional are applied sequentially, problem. Another drawback is that profitability heuristics, which with each optimization destructively modifying the program to pro- decide whether or not to apply a given optimization, tend to make duce a transformed program that is then passed to the next op- their decisions one optimization at a time, and so it is difficult for timization. We present a new approach for structuring the opti- these heuristics to account for the effect of future transformations. mization phase of a compiler. In our approach, optimizations take In this paper, we present a new approach for structuring optimiz- the form of equality analyses that add equality information to a ers that addresses the above limitations of the traditional approach, common intermediate representation. The optimizer works by re- and also has a variety of other benefits. Our approach consists of peatedly applying these analyses to infer equivalences between computing a set of optimized versions of the input program and program fragments, thus saturating the intermediate representation then selecting the best candidate from this set. The set of candidate with equalities. Once saturated, the intermediate representation en- optimized programs is computed by repeatedly inferring equiva- codes multiple optimized versions of the input program. At this lences between program fragments, thus allowing us to represent point, a profitability heuristic picks the final optimized program the effect of many possible optimizations at once. This, in turn, from the various programs represented in the saturated represen- enables the compiler to delay the decision of whether or not an op- tation. Our proposed way of structuring optimizers has a variety of timization is profitable until it observes the full ramifications of that benefits over previous approaches: our approach obviates the need decision. Although related ideas have been explored in the context to worry about optimization ordering, enables the use of a global of super-optimizers, as Section 8 on related work will point out, optimization heuristic that selects among fully optimized programs, super-optimizers typically operate on straight-line code, whereas and can be used to perform translation validation, even on compil- our approach is meant as a general-purpose compilation paradigm ers other than our own. We present our approach, formalize it, and that can optimize complicated control flow structures. describe our choice of intermediate representation. We also present At its core, our approach is based on a simple change to the tra- experimental results showing that our approach is practical in terms ditional compilation model: whereas traditional optimizations op- of time and space overhead, is effective at discovering intricate op- erate by destructively performing transformations, in our approach timization opportunities, and is effective at performing translation optimizations take the form of equality analyses that simply add validation for a realistic optimizer. equality information to a common intermediate representation (IR), without losing the original program. Thus, after each equality anal- Categories and Subject Descriptors D.3.4 [Programming Lan- ysis runs, both the old program and the new program are repre- guages]: Processors – ; Optimization sented. The simplest form of equality analysis looks for ways to instan- General Terms Languages, Performance tiate equality axioms like a ∗ 0 = 0, or a ∗ 4 = a << 2. How- Keywords Compiler Optimization, Equality Reasoning, Interme- ever, our approach also supports arbitrarily complicated forms of diate Representation equality analyses, such as inlining, tail recursion elimination, and various forms of user defined axioms. The flexibility with which equality analyses are defined makes it easy for compiler writers to 1. Introduction port their traditional optimizations to our equality-based model: op- In a traditional compilation system, optimizations are applied se- timizations can work as before, except that when the optimization quentially, with each optimization taking as input the program pro- would have performed a transformation, it now simply records the duced by the previous one. This traditional approach to compilation transformation as an equality. has several well-known drawbacks. One of these drawbacks is that The main technical challenge that we face in our approach the order in which optimizations are run affects the quality of the is that the compiler’s IR must now use equality information to represent not just one optimized version of the input program, ∗ Supported in part by NSF CAREER grant CCF-0644306. but multiple versions at once. We address this challenge through a new IR that compactly represents equality information, and as a result can simultaneously store multiple optimized versions of the input program. After a program is converted into our IR, we Permission to make digital or hard copies of all or part of this work for personal or repeatedly apply equality analyses to infer new equalities until no classroom use is granted without fee provided that copies are not made or distributed more equalities can be inferred, a process known as saturation. for profit or commercial advantage and that copies bear this notice and the full citation Once saturated with equalities, our IR compactly represents the on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. various possible ways of computing the values from the original POPL’09, January 18–24, 2009, Savannah, Georgia, USA. program modulo the given set of equality analyses (and modulo Copyright 2009 ACM 978-1-60558-379-2/09/01. . . $5.00 some bound in the case where applying equality analyses leads to after equality saturation, then they are equivalent. Our approach unbounded expansion). can therefore be used to perform translation validation, a tech- Our approach of having optimizations add equality information nique that consists of automatically checking whether or not the to a common IR until it is saturated with equalities has a variety of optimized version of an input program is semantically equivalent benefits over previous optimization models. to the original program. For example, we can prove the correctness of optimizations performed by existing compilers, even if our prof- Optimization order does not matter. The first benefit of our ap- itability heuristic would not have selected those optimizations. proach is that it removes the need to think about optimization order- ing. When applying optimizations sequentially, ordering is a prob- Contributions. The contributions of this paper can therefore be lem because one optimization, say A, may perform some transfor- summarized as follows: mation that will irrevocably prevent another optimization, say B, from triggering, when in fact running B first would have produced • We present a new approach for structuring optimizers. In our the better outcome. This so-called phase ordering problem is ubiq- approach optimizations add equality information to a common uitous in compiler design. In our approach, however, the compiler IR that simultaneously represents multiple optimized versions writer does not need to worry about ordering, because optimiza- of the input program. Our approach obviates the need to worry tions do not destructively update the program – they simply add about optimization ordering, it enables the use of a global op- equality information. Therefore, after an optimization A is applied, timization heuristic (such as a Pseudo-Boolean solver), and it the original program is still represented (along with the transformed can be used to perform translation validation for any compiler. program), and so any optimization B that could have been applied Sections 2 and 3 present an overview of our approach and its ca- before A is still applicable after A. Thus, there is no way that apply- pabilities, Section 4 makes our approach formal, and Section 5 ing an optimization A can irrevocably prevent another optimization describes the new IR that allows our approach to be effective. B from applying, and so there is no way that applying optimiza- • We have instantiated our approach in a new Java bytecode opti- tions will lead the search astray. As a result, compiler writers who mizer called Peggy (Section 6). Peggy uses our approach not use our approach do not need to worry about the order in which only to optimize Java methods, but also to perform transla- optimizations run. Better yet, because optimizations are allowed to tion validation for existing compilers. Our experimental results freely interact during equality saturation, without any considera- (Section 7) show that our approach (1) is practical both in terms tion for ordering, our approach can discover intricate optimization of time and space overhead, (2) is effective at discovering both opportunities that compiler writers may not have anticipated, and simple and intricate optimization opportunities and (3) is effec- hence would not have implemented in a general purpose compiler. tive at performing translation validation for a realistic optimizer Global profitability heuristics. The second benefit of our ap- – Peggy is able to validate 98% of the runs of the Soot opti- proach is that it enables global profitability heuristics. Even if there mizer [36], and within the remaining 2% it uncovered a bug in existed a perfect order to run optimizations in, compiler writers Soot. would still have to design profitability heuristics for determining whether or not to perform certain optimizations such as inlining. 2. Overview Unfortunately, in a traditional compilation system where optimiza- Our approach for structuring optimizers is based on the idea of tions are applied sequentially, each heuristic decides in isolation having optimizations propagate equality information to a common whether or not to apply an optimization at a particular point in the IR that simultaneously represents multiple optimized versions of compilation process. The local nature of these heuristics makes it the input program. The main challenge in designing this IR is that difficult to take into account the effect of future optimizations. it must make equality reasoning effective and efficient. Our approach, on the other hand, allows the compiler writer to To make equality reasoning effective, our IR needs to support design profitability heuristics that are global in nature. In particu- the same kind of basic reasoning that one would expect from sim- lar, rather than choosing whether or not to apply an optimization ple equality axioms like , but with more locally, these heuristics choose between fully optimized versions a ∗ (b + c)= a ∗ b + a ∗ c complicated computations such as branches and loops. We have of the input program. Our approach makes this possible by sepa- designed a representation for computations called Program Expres- rating the decision of whether or not a transformation is applicable sion Graphs (PEGs) that meets these requirements. Similar to the from the decision of whether or not it is profitable. Indeed, using gated SSA representation [35, 19], PEGs are referentially transpar- an optimization to add an equality in our approach does not indi- ent, which intuitively means that the value of an expression de- cate a decision to perform the transformation – the added equality pends only on the value of its constituent expressions, without any just represents the option of picking that transformation later. The side-effects. As has been observed previously in many contexts, actual decision of which transformations to apply is performed by referential transparency makes equality reasoning simple and effec- a global heuristic after our IR has been saturated with equalities. tive. However, unlike previous SSA-based representations, PEGs This global heuristic simply chooses among the various optimized are also complete, which means that there is no need to maintain versions of the input program that are represented in the saturated any additional representation such as a control flow graph (CFG). IR, and so it has a global view of all the transformations that were Completeness makes it easy to use equality for performing trans- tried and what programs they generated. formations: if two PEG nodes are equal, then we can pick either There are many ways to implement this global profitability one to create a well-formed program, without worrying about the heuristic, and in our prototype compiler we have chosen to imple- implications on any underlying representation. ment it using a Pseudo-Boolean solver (a form of Integer Linear In addition to being effective, equality reasoning in our IR must Programming solver). In particular, after our IR has been saturated be efficient. The main challenge is that each added equality can with equalities, we use a Pseudo-Boolean solver and a static cost potentially double the number of represented programs, thus mak- model for every node to pick the lowest-cost program that com- ing the number of represented programs exponential in the worst putes the same result as the original program. case. To address this challenge, we record equality information of Translation validation. The third benefit of our approach is that PEG nodes by simply merging PEG nodes into equivalence classes. it can be used not only to optimize programs, but also to prove We call the resulting equivalence graph an E-PEG, and it is this E- programs equivalent: intuitively, if two programs are found equal PEG representation that we use in our approach. Using equivalence i := 0; i := 0; From a more formal point of view, each θ node produces a while (...) { while (...) { sequence of values, one value for each iteration of the loop. The use(i * 5); use(i); first argument of a θ node is the value for the first iteration, whereas i := i + 1; i := i + 5; the second argument is a sequence that represents the values for the if (...) { if (...) { remaining iterations. For example, in Figure 2, the nodes labeled 3 i := i + 3; i := i + 15; through 5 compute this sequence of remaining values in terms of } } the sequence produced by the θ node. In particular, nodes 3, 4 and } } 5 have been implicitly lifted to operate on this sequence. (a) (b) PEGs are well-suited for equality reasoning because all PEG Figure 1. Loop-induction-variable : (a) shows operators, even those for branches and loops, are mathematical the original code, and (b) shows the optimized code. functions with no side effects. As a result, PEGs are referentially transparent, which allows us to perform the same kind of equality reasoning that one is familiar with from . Even though classes allows E-PEGs to efficiently represent exponentially many all PEG operators are pure, PEGs can still represent programs with ways of expressing the input program, and it also allows the equal- state by using heap summary nodes. Section 6 explains how our ity saturation engine to efficiently take into account previously dis- Peggy compiler uses such heap summary nodes to represent the covered equalities. Among existing IRs, E-PEGs are unique in their state of Java objects. ability to represent multiple optimized versions of the input pro- 2.2 Encoding equalities using E-PEGs gram. A more detailed discussion of how PEGs and E-PEGs relate to previous IRs can be found in Section 8. A PEG by itself can only represent a single way of expressing the We illustrate the main features of our approach by showing how input program. To represent multiple optimized versions of the in- it can be used to implement loop-induction-variable strength reduc- put program, we need to encode equalities in our representation. tion. The idea behind this optimization is that if all assignments to a To this end, an E-PEG is a graph that groups together PEG nodes variable i in a loop are increments, then an expression i * c in the that are equal into equivalence classes. As an example, Figure 2(b) loop (with c being loop invariant) can be replaced with i, provided shows the E-PEG that our engine produces from the PEG of Fig- all the increments of i in the loop are appropriately scaled by c. ure 2(a). We display equalities graphically by adding a dashed edge As an example, consider the code snippet from Figure 1(a). The between two nodes that have become equal. These dashed edges use of i*5 inside the loop can be replaced with i as long as the two are only a visualization mechanism. In reality, PEG nodes that are increments in the loop are scaled by 5. The resulting code is shown equal are grouped together into an equivalence class. in Figure 1(b). Reasoning in an E-PEG is done through the application of opti- mizations, which in our approach take the form of equality analy- 2.1 Program Expression Graphs ses that add equality information to the E-PEG. An equality anal- ysis consists of two components: a trigger, which is an expression A Program Expression Graph (PEG) is a graph containing: (1) pattern stating the kinds of expressions that the analysis is inter- operator nodes, for example “plus”, “minus”, or any of our built-in ested in, and a callback , which should be invoked when nodes for representing conditionals and loops (2) “dataflow” edges the trigger pattern is found in the E-PEG. The saturation engine that specify where operator nodes get their arguments from. As an continuously monitors all the triggers simultaneously, and invokes example, the PEG for the use of i*5 in Figure 1(a) is shown in the necessary callbacks when triggers match. When invoked, a call- Figure 2(a). At the very top of the PEG we see node 1, which back function adds the appropriate equalities to the E-PEG. represents the i*5 multiply operation from inside the loop. Each The simplest form of equality analysis consists of instantiating PEG node represents an operation, with the children nodes being axioms such as a∗0 = 0. In this case, the trigger would be a∗0, and the arguments to the operation. The links from parents to children the callback function would add the equality a∗0 = 0. Even though are shown using solid (non-dashed) lines. For example, node 1 the vast majority of our reasoning is done through such declarative represents the multiplication of node 2 by the constant 5. PEGs axiom application, our trigger and callback mechanism is much follow the notational convention used in E-graphs [26, 27, 13] and more general, and has allowed us to implement equality analyses Abstract Syntax Trees (ASTs) of displaying operators above the such as inlining, tail-recursion elimination, and . arguments that flow into them, which is the opposite convention The following three axioms are the equality analyses required to typically used in Dataflow Graphs [11, 2]. We use the E-graph/AST perform loop-induction-variable strength reduction. They state that orientation because we think of PEGs as recursive expressions. multiplication distributes over addition, θ, and φ: Node 2 in our PEG represents the value of variable i inside the loop, right before the first instruction in the loop is executed. (a + b) ∗ m = a ∗ m + b ∗ m (1) We use θ nodes to represent values that vary inside of a loop. θ(a,b) ∗ m = θ(a ∗ m,b ∗ m) (2) Intuitively, the left child of a θ node computes the initial value, (3) whereas the right child computes the value at the current iteration in φ(a,b,c) ∗ m = φ(a,b ∗ m, c ∗ m) terms of the value at the previous iteration. In our example, the left After a program is converted to a PEG, a saturation engine child of the θ node is the constant 0, representing the initial value repeatedly applies equality analyses until either no more equalities of i. The right child of the θ node uses nodes 3, 4, and 5 to compute can be added, or a bound is reached on the number of expressions the value of i at the current iteration in terms of the value of i from that have been processed by the engine. As Section 7 will describe the previous iteration. The two plus nodes (nodes 4 and 5) represent in more details, our experiments show that 84% of methods can be the two increments of i in the loop, whereas the φ node (node 3) completely saturated, without any bounds being imposed. represents the merging of the two values of i produced by the two Figure 2(b) shows the saturated E-PEG that results from apply- plus nodes. As in gated SSA [35, 19], our φ nodes are executable: ing the above distributivity axioms, along with a simple constant the first (left-most) argument to φ is a selector that is used to select folding equality analysis. In particular, distributivity is applied four between the second and the third argument. Our example doesn’t times: axiom (2) adds equality edge A, axiom (3) edge B, axiom (1) use the branch condition in an interesting way, and so we just let δ edge C, and axiom (1) edge D. Our engine also applies the constant represent the PEG sub-graph that computes the branch condition. folding equality analysis to show that 0 ∗ 5 = 0, 3 ∗ 5 = 15 and 1 * 1 6 * A 5 5 7 2 **0 8 2 B 0 5 5 0 0 3 9 10 + * C 0 3 11 12 4 5 * 15 + + + * D 4 3 + 5 5 * 5 3 5 + 15 + 1 5 3 5 + 1 5 1 (a) (b) (c)

Figure 2. Loop-induction-variable Strength Reduction using PEGs: (a) shows the original PEG, (b) shows the E-PEG that our engine produces from the original PEG and (c) shows the optimized PEG, which results by choosing nodes 6, 8, 10, and 12 from (b).

1 ∗ 5 = 5. Note that when axiom (2) adds edge A, it also adds node forming i * 5 to i << 2 + i disables loop-induction-variable 7, which then enables axiom (3). Thus, equality analyses essentially strength reduction, and therefore generates code that is worse than communicate with each other by propagating equalities through the the one from Figure 1(b). E-PEG. Furthermore, note that the instantiation of axiom (1) adds The above example illustrates the ubiquitous phase ordering node 12 to the E-PEG, but it does not add the right child of node 12, problem. In systems that apply optimizations sequentially, the qual- namely θ(. . .) ∗ 5, because it is already represented in the E-PEG. ity of the generated code depends on the order in which optimiza- Once saturated with equalities, an E-PEG compactly represents tions are applied. Whitfield and Soffa [41] have shown experimen- multiple optimized versions of the input program – in fact, it com- tally that enabling and disabling interactions between optimizations pactly represents all the programs that could result from applying occur frequently in practice, and furthermore that the patterns of in- the optimizations in any order to the input program. For example, teraction vary not only from program to program, but also within a the E-PEG in Figure 2(b) encodes 128 ways of expressing the orig- single program. Thus, no one order is best across all compilation. inal program (because it encodes 7 independent equalities, namely A common partial solution consists of carefully considering all the 7 dashed edges). In general, a single E-PEG can efficiently rep- the possible interactions between optimizations, possibly with the resent exponentially many ways of expressing the input program. help of automated tools, and then coming up with a carefully tuned After saturation, a global profitability heuristic can pick which sequence for running optimizations that strives to enable most of optimized version of the input program is best. Because this prof- the beneficial interactions. This technique, however, puts a heavy itability can inspect the entire E-PEG at once, it has a global view burden on the compiler writer, and it also does not account for the of the programs produced by various optimizations, after all other fact that the best order may vary between programs. optimizations were also run. In our example, starting at node 1, by At high levels of optimizations, some compilers may even run choosing nodes 6, 8, 10, and 12, we can construct the graph in Fig- optimizations in a loop until no more changes can be made. Even ure 2(c), which corresponds exactly to performing loop-induction- so, if the compiler picks the wrong optimization to start with, variable strength reduction in Figure 1(b). then no matter what optimizations are applied later, in any order, More generally, when optimizing an entire function, one has to any number of times, the compiler will not be able to reverse the pick a node for the equivalence class of the return value and nodes disabling consequences of the first optimization. for all equivalence classes that the return value depends on. There In our approach, the compiler writer does not need to worry are many plausible heuristics for choosing nodes in an E-PEG. In about the order in which optimizations are applied. The pre- our Peggy implementation, we have chosen to select nodes using vious would be expressed as the axiom a Pseudo-Boolean solver, which is an Integer Linear Programming i*5=i<<2+i. However, unlike in a traditional compi- solver where variables are constrained to 0 or 1. In particular, we lation system, applying this axiom in our approach does not re- use a Pseudo-Boolean solver and a static cost model for every node move the original program from the representation — it only adds to compute the lowest-cost program that is encoded in the E-PEG. information — and so it cannot disable other optimizations. There- In the example from Figure 2, the Pseudo-Boolean solver picks fore, the code from Figure 1(b) would still be discovered, even if the nodes described above. Section 6.3 describes our technique for the peephole optimization was run first. In essence, our approach selecting nodes in more detail. is able to simultaneously explore all possible sequences of opti- mizations, while sharing work that is common across the various 2.3 Benefits of our approach sequences. In addition to reducing the burden on compiler writers, remov- Optimization order does not matter. To understand how our ap- ing the need to think about optimization ordering has two additional proach addresses the phase ordering problem, consider a simple benefits. First, because optimizations interact freely with no regard peephole optimization that transforms i * 5 into i << 2 + i. to order, our approach often ends up combining optimizations in On the surface, one may think that this transformation should al- unanticipated ways, leading to surprisingly complicated optimiza- ways be performed if it is applicable – after all, it replaces a multi- tions given how simple our equality analyses are — Section 3 gives plication with the much cheaper shift and add. In reality, however, such an example. Second, it makes it easier for end-user program- this peephole optimization may disable other more profitable trans- mers to add domain-specific axioms to the compiler, because they formations. The code from Figure 1(a) is such an example: trans- sum := 0; for (i := 0; i < 10; i++) { for (i := 0; i < 29; i += 2) { for (i := 0; i < 10; i++) { for (j := 0; j < 10; j++) { use(i) for (j := 0; j < 10; j++) { use(i*10 + j); } use(sum++); } use(i) } } (a) } (b) (c) eval + 2 pass 2 pass + * 2 1 1 10 1 0 + 0 eval 0 29 2 10 0 + 1 + 2 0 1 2 + (d) (e) 1 (f)

Figure 3. Various loops and their PEG representations. don’t have to think about where exactly in the compiler the axiom translation validation for any compiler (not necessarily our own), should be run, and in what order relative to other optimizations. by checking that each function in the input program is equivalent to the corresponding optimized function in the output program. Global profitability heuristics. Profitability heuristics in tradi- For example, our approach would be able to show that the tional compilers tend to be local in nature, making it difficult to take two program fragments from Figure 1 are equivalent. Further- into account the effect of future optimizations. For example, con- more, it would also be able to validate a compilation run in which sider inlining. Although it is straightforward to estimate the direct i*5=i<<2+i was applied first to Figure 1(a). This shows cost of inlining (the code-size increase) and the direct benefit of in- that we are able to perform translation validation regardless of what lining (the savings from removing the call overhead), it is far more optimized program our own profitability heuristic would choose. difficult to estimate the potentially larger indirect benefit, namely Although our translation validation technique is intraprocedu- the additional optimization opportunities that inlining exposes. ral, we can use interprocedural equality analyses such as inlining to To see how inlining would affect our running example, con- enable a certain amount of interprocedural reasoning. This allows sider again the code from Figure 1(a), but assume that instead of us to reason about transformations like reordering function calls. use(i * 5), there was a call to a function f, and the use of i*5 occurred inside f. If f is sufficiently large, a traditional inliner would not inline f, because the code bloat would outweigh the 3. Reasoning about loops call-overhead savings. However, a traditional inliner would miss This section shows how our approach can be used to reason across the fact that it may still be worth inlining f, despite its size, be- nested loops. The example highlights the fact that a simple axiom cause inlining would expose the opportunity for loop-induction- set can produce unanticipated optimizations which traditional com- variable strength reduction. One solution to this problem consists pilers would have to explicitly search for. of performing an inlining trial [12], where the compiler simulates We start in Sections 3.1 and 3.2 by describing all PEG con- the inlining transformation, along with the effect of subsequent op- structs used to represent loops. We then show in Section 3.3 how timizations, in order to decide whether or not to actually inline. our approach can perform an inter-loop strength reduction opti- However, in the face of multiple inlining decisions (or more gen- mization. erally multiple optimization decisions), there can be exponentially many possible outcomes, each one of which has to be compiled 3.1 Singleloop separately. Consider the simple loop from Figure 3(a). This loop iterates 15 In our approach, on the other hand, inlining simply adds an times, incrementing the value of i each time by 2. Assume that the equality to the E-PEG stating that the call to a given function is variable i is used inside the loop, and it is also used after the loop equal to its body instantiated with the actual arguments. The result- (as indicated by the two use annotations). The PEG for this code is ing E-PEG simultaneously represents the program where inlining is shown in Figure 3(d). The value of i inside the loop is represented performed and where it is not. Subsequent optimizations then op- by a θ node. Intuitively, this θ node produces the sequence of values erate on both of these programs at the same time. More generally, that i takes throughout the loop, in this case [0, 2, 4,...]. The value our approach can simultaneously explore exponentially many pos- of i after the loop is represented by the eval node at the top of the sibilities in parallel, while sharing the work that is redundant across PEG. Given a sequence s and an index n, eval(s, n) produces the these various possibilities. In the above example with inlining, once th n element of sequence s. To determine which element to select the E-PEG is saturated, a global profitability heuristic can make a from a sequence, our PEG representation uses pass nodes. Given more informed decision as to whether or not to pick the inlined ver- a sequence s of booleans, pass(s) returns the index of the first sion, since it will be able to take into account the fact that inlining element in the sequence that is true. In our example, the ≥ node enabled loop-induction-variable strength reduction. uses the result of the θ node to produce the sequence of values Translation Validation. Unlike traditional compilation frame- taken on by the boolean expression i ≥ 29 throughout the loop. works, our approach can be used not only to optimize programs, This sequence is then sent to pass, which in this case produces the but also to establish equivalences between programs. In particular, value 15, since the 15th value (counting from 0) of i in the loop if we convert two programs into an E-PEG, and then saturate it (which is 30) is the first one to make i ≥ 29 true. The eval node with equalities, then we can conclude that the two programs are then selects the 15th element of the sequence produced by the θ equivalent if they belong to the same equivalence class in the sat- node, which is 30. In our previous example from Section 2, we urated E-PEG. In this way, our approach can be used to perform omitted eval/pass from the PEG for clarity – because we were not A F Function Optimize(cfg : CFG) : CFG + 2 B C 1: let ir = ConvertToIR(cfg) + + + 2: let saturated ir = Saturate(ir,A) 1 2 0 3: let best = SelectBest(saturated ir) E + 1 0 0 4: return ConvertToCFG(best) eval 2 + + 1 D + eval 2 10 1 Figure 5. Optimization phase in our approach. We assume a global set A of equality analyses to be run. pass 2

Figure 3(f). In particular, this left child has been replaced with a 10 new node i, where i = θ1(0, 10 + i). We skip the steps in doing this because they are similar to the ones described in Section 2.2. Figure 4 shows the relevant equalities that our saturation engine Figure 4. E-PEG that results from running the saturation engine would add. We describe each in turn. on the PEG from Figure 3(f). By picking the nodes that are check- • marked, we get the PEG from Figure 3(e). Edge A is added by distributing + over θ2:

i + θ2(0, 1+ j)= θ2(i + 0, i +(1+ j)) interested in any of the values after the loop, the eval/pass nodes • would not have been used in any reasoning. Edge B is added because 0 is the identity of +, i.e.: i +0= i. • Edge C is added because addition is associative and commuta- 3.2 Nested loops tive: i +(1+ j)=1+(i + j) We now illustrate, through an example, how nested loops can be • Edge D is added because 0, incremented n times, produces n: encoded in our PEG representation. Consider the code snippet from Figure 3(b), which has two nested loops. We are interested in the eval ℓ(idℓ, pass ℓ(idℓ ≥ n)) = n where idℓ = θℓ(0, 1+ idℓ) value of sum inside the loop, as indicated by the use annotation. This is an example of a expressible as a The PEG for this code snippet is shown in Figure 3(e). Each θ, simple PEG axiom. eval and pass node is labeled with a subscript indicating what • loop depth it operates on (we previously omitted these subscripts Edge E is added by distributing + over the first child of eval 2: for clarity). The node labeled suminner represents the value of sum eval 2(j, k)+ i = eval 2(j + i, k) at the beginning of the inner loop body. Similarly, sumouter is the value of sum at the beginning of the outer loop body. Looking • Edge F is added because addition is commutative: j+i = i+j at suminner , we can see that: (1) on the first iteration (the left We use checkmarks in Figure 4 to highlight the nodes that Peggy child of suminner ), suminner gets the value of sum from the outer loop; (2) on other iterations, it gets one plus the value of sum would select using its Pseudo-Boolean profitability heuristic. These nodes constitute exactly the PEG from Figure 3(e), meaning that from the previous iteration of the inner loop. Looking at sumouter , Peggy optimizes the code in Figure 3(c) to the one in Figure 3(b). we can see that: (1) on the first iteration, sumouter gets 0; on other iterations, it gets the value of sum right after the inner loop Summary. This example illustrates several points. First, it shows terminates. The value of sum after the inner loop terminates is how a transformation that locally seems undesirable, namely trans- computed using a similar eval/pass pattern as in Figure 3(d). forming the constant 10 into an expensive loop (edge D), in the 3.3 Inter-loop strength reduction end leads to much better code. Our global profitability heuristic is perfectly suited for taking advantage of these situations. Second, Our approach allows an to perform intricate it shows an example of an unanticipated optimization, namely an optimizations of looping structures. We present such an example optimization that we did not realize would fall out from the simple here, with a kind of inter-loop strength reduction. Consider the code equality analyses we already had in place. In a traditional compi- snippet from Figure 3(c), which is equivalent to one we’ve already lation system, a specialized analysis would be required to perform seen in Figure 3(b). However, the code in 3(b) is faster because this optimization, whereas in our approach the optimization sim- sum++ is cheaper than i ∗ 10 + j. We show how our approach can ply happens without any special casing. In this way, our approach transform the code in Figure 3(c) to the code in Figure 3(b). essentially allows a few general equality analyses to do the work Figure 3(f) shows the PEG for i*10 + j, which will be the of many specialized transformations. Finally, it shows how our ap- focus of our optimization. We omit eval and pass in this PEG proach is able to reason about complex loop interactions, some- because they are not used in this example, except in one step that thing that is beyond the reach of current super-optimizer-based we will make explicit. techniques. Figure 4 shows the saturated E-PEG that results from running the saturation engine on the PEG from Figure 3(f). The checkmarks indicate which nodes will eventually be selected – they can be 4. Formalization of our Approach ignored for now. To make the graph more readable, we sometimes Having given an intuition of how our approach works through label nodes, and then connect an edge directly to a label name, examples, we now move to a formal description. Figure 5 shows rather than connecting it to the node with that label. For example, the Optimize function, which embodies our approach. Optimize consider node j in the E-PEG, which reads as θ2(0, 1+ j). Rather takes four steps: first, it converts the input CFG into an internal rep- than explicitly drawing an edge from + to j, we connect + to a resentation of the program; second, it saturates this internal repre- new copy of label j. sentation with equalities; third, it uses a global profitability heuris- In drawing Figure 4, we have already performed loop-induction tic to select the best program from the saturated representation; fi- variable strength reduction on the left child of the topmost + from nally, it converts the selected program back to a CFG. An instantiation of our approach therefore consists of three Types. Before giving the definition of semantic functions, we first components: (1) an IR where equality reasoning is effective, along define the types of values that these functions operate over. Values with the translation functions ConvertToIR and ConvertToCFG, that flow through a PEG are lifted in two ways. First, they are (2) a saturation engine Saturate, and (3) a global profitability ⊥-lifted, meaning that we add the special value ⊥ to each type heuristic SelectBest. Future sections will show how we instantiate domain. The ⊥ value indicates that the computation fails or does these three components in our Peggy compiler. not terminate. Formally, for each type τ, we define τ = τ ∪ {⊥}. Second, values are loop-lifted, which means that instead of Saturation Engine. The saturation engine Saturate infers equal- representing the value at a particular iteration, PEG nodes represent ities by repeatedly running a set A of equality analyses. Given an b a values for all iterations at the same time. Formally, we let L be a equality analysis a ∈ A, we define ir 1 → ir 2 to mean that ir 1 pro- set of loop identifiers, with each ℓ ∈ L representing a loop from duces ir 2 when the equality analysis a runs and adds some equali- the original code (in our previous examples we used integers). We ties to ir 1. If a chooses not to add any equalities, then ir 2 is simply assume a partial order ≤ that represents the loop nesting structure: ′ ′ the same as ir 1. ℓ < ℓ means that ℓ is nested within ℓ. An iteration index i We define a partial order ⊑ on IRs, based on the equalities they captures the iteration state of all loops in the PEG. In particular, encode: ir 1 ⊑ ir 2 iff the equalities in ir 1 are a subset of the i is a function that maps each loop identifier ℓ ∈ L to the iteration equalities in ir 2. Immediately from this definition, we get: that loop ℓ is currently on. Suppose for example that there are a two nested loops in the program, identified as ℓ1 and ℓ2. Then the (ir 1 → ir 2) ⇒ ir 1 ⊑ ir 2 (4) iteration index i = [ℓ1 7→ 5,ℓ2 7→ 3] represents the state where th rd We define an equality analysis a to be monotonic iff: loop ℓ1 is on the 5 iteration and loop ℓ2 is on the 3 iteration. We let I = L → N be the set of all loop iteration indices (where a ′ a ′ ′ ′ (ir 1 ⊑ ir 2) ∧ (ir 1 → ir 1) ∧ (ir 2 → ir 2) ⇒ (ir 1 ⊑ ir 2) N denotes the set of non-negative integers). For i ∈ I, we use the notation i[ℓ 7→ v] to denote a function that returns the same Intuitively, our approach addresses the phase ordering problem value as i on all inputs, except that it returns v on input ℓ. The because applying an equality analysis a before b cannot make b less output of a PEG node is a map from loop iteration indices in I to effective, as stated in the following non-interference theorem. values. In particular, for each type τ, we define a loop-lifted version I THEOREM 1. If a and b are monotonic then: τ = → τ. PEG nodes operate on these loop-lifted types. a b b Semantic Functions. The set of semantic functions F is divided (ir → ir ) ∧ (ir → ir ) ∧ (ir → ir ) ⇒ (ir ⊑ ir ) 1 2 2 3 1 4 4 3 intoe two: bF = Prims ∪ Domain where Prims contains the The above follows immediately from monotonicity and Property 4. primitive functions like φ and θ, which are built into the PEG We now define ir 1 → ir 2 as: representation, whereas Domain contains semantic functions for a particular domains like arithmetic. ir 1 → ir 2 ⇐⇒ ∃a ∈ A . (ir 1 → ir 2 ∧ ir 1 6= ir 2) Figure 6 gives the definition of the primitive functions Prims = The → relation formalizes one step taken by the saturation engine. {φ,θℓ, eval ℓ, pass ℓ}. These functions are polymorphic in τ, inthat We also define →∗ to be the reflexive transitive closure of →. The they can be instantiated for various τ’s, ranging from basic types →∗ relation formalizes an entire run of the saturation engine. like integers and strings to complicated types like the heap sum- Given a set A of monotonic equality analyses, if the saturation mary nodes that Peggy uses to represent Java objects. The defini- engine terminates, then it is canonizing, where canonizing means tions of eval ℓ and pass ℓ make use of the function monotonize ℓ, whose definition is given in Figure 6. The monotonize ℓ function that for any ir 1, there is a unique ir 2 with the following properties: ∗ transforms a sequence so that, once an indexed value is undefined, (1) ir 1 → ir 2 and (2) there is no ir 3 such that ir 2 → ir 3. In this case the saturation engine computes this canonical saturated all following indexed values are undefined. The monotonize ℓ IR, which means that Optimize returns the same result no matter function formalizes the fact that once a value is undefined at a what order optimizations run in. given loop iteration, the value remains undefined at subsequent Because in general saturation may not terminate, we bound the iterations. number of times that analyses can run. In this case we cannot pro- The domain semantic functions are defined as Domain = {op | vide the same canonizing property, but the non-interference theo- op ∈ DomainOp}, where DomainOp is a set of domain operators (like , and in the case of arithmetic), and is a -lifted, and rem (Theorem 1) still implies that no area of the search space can + ∗ − op ⊥ e be made unreachable by applying an equality analysis (a property then loop-lifted version of op. Intuitively, the ⊥-lifted version of an operator works like the original operator except that it returns if that traditional compilation systems lack). e ⊥ any of its inputs are ⊥, and the loop-lifted version of an operator applies the original operator for each loop index. 5. PEGs andE-PEGs As an example, the semantic function of + in a PEG is +, and The first step in instantiating our approach from the previous sec- the semantic function of 1 is 1 (since constants like 1 are simply tion is to pick an appropriate IR. To this end, we have designed a nullary operators). However, to make the notation less crowe ded, new IR called the E-PEG which can simultaneously represent mul- we omit the tildes on all domaine operators. tiple optimized versions of the input program. We first give a for- Node Semantics. For a PEG node n ∈ N , we denote its semantic mal description of our IR (Section 5.1), then we present its benefits value by JnK. We assume that J·K is lifted to sequences list[N ] in (Section 5.2), and finally we give a high-level description of how to the standard way. The semantic value of n is defined as: translate from CFGs to our IR and back (Section 5.3). JnK = L(n)(JC (n)K) (5) 5.1 Formalization of PEGs and E-PEGs Equation 5 is essentially the evaluation semantics for expressions. A PEGisatriple (N , L, C ), where N is a set of nodes, L : N → F The only complication here is that our expression graphs are recur- is a labeling that maps each node to a semantic function from a set sive. In this setting, one can think of Equation 5 as a set of recursive of semantic functions F , and C : N → list[N ] is a function that equations to be solved. To guarantee that a unique solution exists, maps each node to its children (i.e. arguments). we impose some well-formedness constraints on PEGs. φ : B × τ × τ → τ θℓ : τ × τ → τ if cond(i)= ⊥ then ⊥ if i(ℓ) = 0 then base(i) e e ie e i true i e e ie φ(cond, t, f)( )= if cond( )= then t( ) θℓ(base, loop)( )= i i i (if (ℓ) > 0 then loop( [ℓ 7→ (ℓ) − 1]) if cond(i)= false then f(i)

 eval ℓ : τ × N → τ pass ℓ : B → N if idx(i)= ⊥ then ⊥ if I = ∅ then ⊥ eval ℓ loop idx i pass cond i ( e , e )( )=e i i ℓ( e )( )=e (else monotonize ℓ(loop)( [ℓ 7→ idx( )]) (if I 6= ∅ then min I where I = {i ∈ N | monotonize ℓ(cond)(i[ℓ 7→ i]) = true}

where monotonize ℓ : τ → τ is defined as: i i i if ∃ 0 ≤ i< (ℓ). value( [ℓ 7→ i]) = ⊥ then ⊥ monotonizeℓ(value)(e)= e i i i (if ∀ 0 ≤ i< (ℓ). value( [ℓ 7→ i]) 6= ⊥ then value( )

Figure 6. Definition of primitive PEG functions. The important notation: L is the set of loop identifiers, N is the set of non-negative integers, B is the set of booleans, I = L→ N, τ = τ ∪ {⊥}, and τ = I → τ.

b e b DEFINITION 1. A PEG is well-formed iff: built-in axioms are given below, where • denotes “don’t care”: 1. All cycles pass through the second child edge of a θ ′ θℓ(A, B) = θℓ(eval ℓ(A, 0), B) 2. A path from a ℓ, eval ℓ, or pass to a ℓ′ implies or the θ ℓ θ ℓ ≤ ℓ eval ℓ(θℓ(A, •), 0) = eval ℓ(A, 0) path passes through the first child edge of an eval ℓ′ or pass ′ ℓ eval ℓ(eval ℓ(A, B),C) = eval ℓ(A, eval ℓ(B,C)) 3. All cycles containing eval ℓ or pass ℓ contain some θℓ′ with true ′ pass ℓ( ) = 0 ℓ <ℓ pass ℓ(θℓ(true, •)) = 0 pass (θℓ(false,A)) = pass (A) + 1 Condition 1 states that all cyclic paths in the PEG are due to looping ℓ ℓ constructs. Condition 2 states that a computation in an outer-loop One of the benefits of having a well-defined semantics for primi- cannot reference a value from inside an inner-loop. Condition 3 tive PEG functions is that we can reason formally about these func- states that the final value produced by an inner-loop cannot be tions. In particular, using our semantics, we have proved all the expressed in terms of itself, except if it’s referencing the value of axioms presented in this paper. the inner-loop from a previous outer-loop iteration.

THEOREM 2. If a PEG is well-formed, then for each node n in the 5.2 How PEGs enable our approach PEG there is a unique semantic value JnK satisfying Equation 5. The key feature of PEGs that makes our equality-saturation ap- The proof is by induction over the strongly-connected-component proach effective is that they are referentially transparent, which in- DAG of the PEG and the loop nesting structure ≤. tuitively means that the value of an expression depends only on the values of its constituent expressions [5]. In our PEG representation, Evaluation Semantics. The meaning function J·K can be eval- referential transparency can be formalized as follows: uated on demand, which provides an executable semantics for PEGs. For example, suppose we want to know the result of ′ ′ 2 L(n)= L(n )∧ ′ ℓ i ∀(n, n ) ∈ N . ′ ⇒ JnK = Jn K eval (x, pass ℓ(y)) at some iteration state . To determine which JC (n)K = JC (n )K case of eval ℓ’s definition we are in, we must evaluate pass ℓ(y) on   i. From the definition of pass , we must compute the minimum i ℓ This property follows from the definition in Equation (5), and the that makes y true. To do this, we iterate through values of i until we fact that for any n, L(n) is a pure mathematical function. find an appropriate one. The value of i we’ve found is the number Referential transparency makes equality reasoning effective be- of times the loop iterates, and we can use this i back in the eval ℓ cause it allows us to show that two expressions are equal by only function to extract the appropriate value out of x. This example considering their constituent expressions, without having to worry shows how an on-demand evaluation of an eval/pass sequence about side-effects. Furthermore, referential transparency has the essentially leads to the traditional operational semantics for loops. benefit that a single node in the PEG entirely captures the value of E-PEG Semantics. An E-PEG is a PEG with a set of equalities a complex program fragment, enabling us to record equivalences E between nodes. An equality between n and n′ denotes value between program fragments by using equivalence classes of nodes. equality: JnK = Jn′K. The set E forms an equivalence relation, Contrast this to CFGs, where to record equality between complex which in turn partitions the PEG nodes into equivalence classes. program fragments, one would have to record subgraph equality. Finally, PEGs allow us to record equalities at the granularity of Built-in Axioms. We have developed a set of PEG built-in axioms individual values, for example the iteration count in a loop, rather that state properties of the primitive semantic functions. These than at the level of the entire program state. Again, contrast this axioms are used in our approach as a set of equality analyses that to CFGs, where the simplest form of equality between program enable reasoning about primitive PEG operators. Some important fragments would record program-state equality. 5.3 Translating between CFGs and PEGs Function Saturate(peg : PEG,A : set of analyses) : EPEG To incorporate PEGs and E-PEGs into our approach, we have developed the ConvertToIR and ConvertToCFG functions from 1: let epeg = CreateInitialEPEG(peg) Figure 5. We give only a brief overview of these , with 2: while ∃(p, f) ∈ A, subst ∈ S . subst = Match(p, epeg) do more details in a technical report [34]. 3: epeg := AddEqualities(epeg, f(subst, epeg)) Transforming a CFG into a PEG. The key challenge in con- 4: return epeg structing a PEG from a CFG is determining the branching structure Figure 7. Peggy’s Saturation Engine. We use S to denote the set of the CFG. We perform this task with a function that, given paths of all substitutions from pattern nodes to E-PEG nodes. from one CFG basic block to another, produces a PEG expression with φ, eval, and pass operations, specifying which path is taken under which conditions. We use this to determine the break condi- tions of loops and the general branching structure of the CFG. We we have simply chosen one where all objects are put into a single also identify nesting depth, entry points, and back-edges of loops summarization object σ. to construct θ nodes. We piece these components together with the Exceptions. In order to maintain the program state at points instructions of each basic block to produce the PEG. Lastly, we ap- where exceptions are thrown, we bundle the exception state into ply some basic simplifications to remove conversion artifacts. Our our abstraction of the heap, namely the σ summary nodes. As a conversion from CFG to PEG handles arbitrary control result, operations like division which may throw an exception, but flow, including irreducible CFGs (by first converting them so they do not otherwise modify the heap, now take and return a σ node (in become reducible). addition to their regular parameters and return values). This forces Transforming a PEG into a CFG. Intuitively, our algorithm the observable state at the point where an exception is thrown to first groups parts of the PEG into sub-PEGs; then it recursively be preserved by our optimization process. Furthermore, to preserve converts these sub-PEGs into CFGs; and finally it combines these Java semantics, Peggy does not perform any optimizations across sub-CFGs into a CFG for the original PEG. The grouping is done as try/catch boundaries or synchronization boundaries. follows: φ nodes are grouped together whose conditions are equal, thus performing branch fusion; θ nodes are grouped together that 6.2 Saturation Engine have equal pass conditions, thus performing loop fusion. The pure The saturation engine’s purpose is to repeatedly dispatch equality mathematical nature of PEGs makes it easy to identify when two analyses. In our implementation an equality analysis is a pair (p, f) conditions are equal, which makes branch/loop fusion simple to where p is a trigger, which is an E-PEG pattern with free variables, implement. and f is a callback function that should be run when the pattern PEGs follow the insight from Click of separating code place- p is found in the E-PEG. While running, the engine continuously ment issues from the IR [9]. In particular, PEGs do not represent monitors the E-PEG for the presence of the pattern p, and when it code placement explicitly. Instead, placement is performed during is discovered, the engine constructs a matching substitution, which the translation back to a CFG. As a result, the translation from a is a map from each node in the pattern to the corresponding E- CFG to a PEG and back (without any saturation) ends up perform- PEG node. At this point, the engine invokes f with this matching ing a variety of optimizations: Constant Propagation, Copy Prop- substitution as a parameter, and f returns a set of equalities that agation, Common Subexpression Elimination, Partial Redundancy the engine adds to the E-PEG. In this way, an equality analysis Elimination, Unused Assignment Elimination, Unreachable Code will be invoked only when events of interest to it are discovered. Elimination, Branch Fusion, Loop Fusion, Loop Invariant Branch Furthermore, the analysis does not need to search the entire E-PEG Hoisting/Sinking, Loop Invariant Code Hoisting/Sinking, and Loop because it is provided with the matching substitution. Unswitching. Figure 7 shows the pseudo-code for Peggy’s saturation engine. The call to CreateInitialEPEG on the first line takes the input PEG and generates an E-PEG that initially contains no equality 6. The Peggy Instantiation information. The Match function invoked in the loop condition We have instantiated our approach in a Java bytecode optimizer performs pattern matching: if an analysis trigger occurs inside called Peggy. Recall from Figure 5 that an instantiation of our an E-PEG, then Match returns the matching substitution. Once a approach consists of three components: (1) an IR where equal- match occurs, the saturation engine uses AddEqualities to add the ity reasoning is effective, along with the translation functions equalities computed by the analysis into the E-PEG. ConvertToIR and ConvertToCFG, (2) a saturation engine Saturate, A remaining concern in Figure 7 is how to efficiently implement and (3) a global profitability heuristic SelectBest. We now describe the existential check on line 2. To address this challenge, we adapt how each of these three components work in Peggy. techniques from the AI community. In particular, the task of find- ing the matches on line 2 is similar to the task of determining when rules fire in rule-based expert or planning systems. These systems 6.1 Intermediate Representation make use of an efficient pattern matching algorithm called the Rete Peggy uses the PEG and E-PEG representations which, as ex- algorithm [17]. Intuitively, the Rete algorithm stores the state of plained in Section 5, are well suited for our approach. Because partially completed matches in a set of FSMs, and when new infor- Peggy is a Java bytecode optimizer, an additional challenge is to en- mation is added to the system, it transitions the appropriate FSM. code Java-specific concepts like the heap and exceptions in PEGs. Our saturation engine uses an adaptation of the Rete algorithm for Heap. We model the heap using heap summaries which we call the E-PEG domain to efficiently implement the check on line 2. σ nodes. Any operation that can read and/or write some object state In general, equality saturation may not terminate. Termination may have to take and/or return additional σ values. Because Java is also a concern in traditional compilers where, for example, inlin- stack variables cannot be modified except by direct assignments, ing recursive functions can lead to unbounded expansion. By using operations on stack variables are precise in our PEGs and do not triggers to control when equality edges are added (a technique also involve σ nodes. None of these decisions of how to represent the used in automated theorem provers), we can often avoid infinite ex- heap are built into the PEG representation. As with any heap sum- pansion. The trigger for an equality axiom typically looks for the marization strategy, one can have different levels of abstraction, and left-hand-side of the equality, and then makes it equal to the right- hand-side. On occasion, though, we use more restrictive triggers to The following table shows the average time in milliseconds avoid expansions that are likely to be useless. For example, the trig- taken per method for the 4 main Peggy phases (for Pueblo, a ger for the axiom equating a constant with a loop expression used timeout counts as 60 seconds). to add edge D in Figure 4 also checks that there is an appropriate CFG to PEG Saturation Pueblo PEG to CFG “pass” expression. In this way, it does not add a loop to the E-PEG Time 13.9 ms 87.4 ms 1,499 ms 52.8 ms if there wasn’t some kind of loop to begin with. Using our current axioms and triggers, our engine completely saturates 84% of the All phases combined take slightly over 1.5 seconds. An end- methods in our benchmarks. to-end run of Peggy is on average 6 times slower than Soot with In the remaining cases, we impose a limit on the number of ex- all of its intraprocedural optimizations turned on. Nearly all of our pressions that the engine fully processes (where fully processing time is spent in the Pseudo-Boolean solver. We have not focused an expression includes adding all the equalities that the expression our efforts on compile-time, and we conjecture there is significant triggers). To prevent the search from running astray and exploring room for improvement, such as better pseudo-boolean encodings, a single infinitely deep branch of the search space, we currently use or other kinds of profitability heuristics that run faster. a breadth-first order for processing new nodes in the E-PEG. This Since Peggy is implemented in Java, to evaluate memory foot- traversal strategy, however, can be customized in the implementa- print, we limited the JVM to a heap size of 200 MB, and observed tion of the Rete algorithm to better control the searching strategy in that Peggy was able to compile all the benchmarks without running those cases where an exhaustive search would not terminate. out of memory. In 84% of compiled methods, the engine ran to complete satu- 6.3 Global Profitability Heuristic ration, without imposing bounds. For the remaining cases, the en- gine limit of 500 was reached, meaning that the engine ran until Peggy’s SelectBest function uses a Pseudo-Boolean solver called fully processing 500 expressions in the E-PEG, along with all the Pueblo [32] to select which nodes from an E-PEG to include in the equalities they triggered. In these cases, we cannot provide a com- optimized program. A Pseudo-Boolean problem is an integer linear pleteness guarantee, but we can give an estimate of the size of the programming problem where all the variables have been restricted explored state space. In particular, using just 200 MB of heap, our to 0 or 1. By using these 0-1 variables to represent whether or not E-PEGs represented more than 2103 versions of the input program nodes have been selected, we can encode the constraints that must (using geometric average). hold for the selected nodes to be a well-formed computation. In particular, for each node or equivalence class x, we define a pseudo- 7.2 Implementing optimizations boolean variable that takes on the value 1 (true) if we choose to evaluate x, and 0 (false) otherwise. The constraints then state that: The main goal of our evaluation is to demonstrate that common, (1) we must select the equivalence class of the return value; (2) if as well as unanticipated, optimizations result in a natural way an equivalence class is selected, we must select one of its nodes; from our approach. To achieve this, we implemented a set of basic (3) if a node is selected, we must select its children’s equivalence equality analyses, listed in Figure 8(a). We then manually browsed classes; (4) the chosen PEG is well-formed. through the code that Peggy generates on a variety of benchmarks The cost model that we use assigns a constant cost Cn to each (including SpecJVM) and made a list of the optimizations that we depth(n) observed. Figure 8(b) shows the optimizations that we observed node n. In particular, Cn = basic cost(n) · k , where basic cost(n) accounts for how expensive n is by itself, and fall out from our approach using equality analyses 1 through 6, and depth n Figure 8(c) shows optimizations that we observed fall out from our k ( ) accounts for how often n is executed. We use depth(n) approach using equality analyses 1 through 7. to denote the loop nesting depth of n, and k is simply a constant, With effort similar to what would be required for a compiler which we have chosen as 20. Using Cn, the objective function we writer to implement the optimizations from part (a), our approach want to minimize is Bn · Cn, where N is the set of nodes n∈N enables the more advanced optimizations from parts (b) and (c). in the E-PEG, and Bn is the pseudo-boolean variable for node n. Peggy asks Pueblo toP minimize this objective function subject to Peggy performs some optimizations (for example 15 through 19) the well-formedness constraints described above. The nodes as- that are quite complex given the simplicity of its equality analy- signed 1 in the solution that Pueblo returns are selected to form the ses. To implement such optimizations in a traditional compiler, the compiler writer would have to explicitly design a pattern that is PEG that SelectBest returns. This PEG is the lowest-cost PEG that is represented in the E-PEG, according to our cost model. specific to those optimizations. In contrast, with our approach these optimizations fall out from the interaction of basic equality analy- ses without any additional developer effort, and without specifying an order in which to run them. Essentially, Peggy finds the right 7. Evaluation sequence of equality analyses to apply for producing the effect of In this section we use our Peggy implementation to validate three these complex optimizations. hypotheses about our approach for structuring optimizers: our ap- In terms of generated-code quality, Peggy generates code whose proach is practical both in terms of space and time (Section 7.1), performance is comparable to the code generated by Soot’s in- it is effective at discovering both simple and intricate optimization traprocedural optimizations, which include: common sub-expression opportunities (Section 7.2), and it is effective at performing trans- elimination, lazy code motion, copy propagation, constant propa- lation validation (Section 7.3). gation, constant folding, conditional branch folding, dead assign- ment elimination, and unreachable code elimination. However, in- traprocedural optimizations on Java bytecode generally produce 7.1 Time and space overhead only small gains (on the order of a few percent). In the rare cases To evaluate the running time of the various Peggy components, we where significant gains are to be had from intraprocedural opti- compiled SpecJVM, which comprises 2,461 methods. For 1% of mizations, Peggy excelled. Soot can also perform interprocedural these methods, Pueblo exceeded a one minute timeout we imposed optimizations, such as class-hierarchy-analysis, pointer-analysis, on it, in which case we just ran the conversion to PEG and back. and method-specialization. We did not enable these optimizations We imposed this timeout because in some rare cases, Pueblo runs when performing our comparison against Soot, because we have too long to be practical. not yet attempted to express any interprocedural optimizations. We (a) Equality Analyses Description 1. Built-in E-PEG axioms Axioms from Section 5.1 stating properties of primitive PEG nodes (φ, θ, eval, pass) 2. Basic Arithmetic Axioms Axioms stating properties of arithmetic operators like +, −, ∗, /, <<, >> 3. Constant Folding Equates a constant expression with its constant value 4. Java-specific Axioms Axioms stating properties of Java-specific operators like field and array accesses 5. Tail-recursion Elimination Replaces the body of a tail-recursive procedure with a loop 6. Method Inlining Inlining based on intraprocedural class analysis to disambiguate dynamic dispatch 7. Domain-specific Axioms User-provided axioms about certain application domains (optional) (b) Optimizations Description 8. Constant Propagation and Folding Traditional Constant Propagation and Folding 9. Algebraic Simplification Various forms of traditional algebraic simplifications 10. Peephole Strength Reduction Various forms of traditional peephole optimizations 11. Array Copy Propagation Replace read of an array element by the expression it was previously assigned 12. CSE for Array Elements Remove redundant array accesses 13. Loop Peeling Pulls the first iteration of a loop outside of the loop 14. LIVSR Optimization described in Section 2, namely Loop-induction-variable Strength Reduction 15. Interloop Strength Reduction Optimization described in Section 3 16. Entire-loop Strength Reduction Transforms entire loop into one operation, e.g. loop doing i incrs becomes “plus i” 17. Loop-operation Factoring Factors expensive operations out of a loop, for example multiplication 18. Loop-operation Distributing Distributes an expensive operation into a loop, where it cancels out with another operation 19. Partial Inlining Inlines part of a method in the caller but keeps the call for the rest of the computation (c) Domain-specific Optimizations Description 20. Domain-specific LIVSR LIVSR, but with domain operations like matrix/vector addition and multiply 21. Domain-specific code hoisting Code hoisting based on domain-specific invariance axioms 22. Domain-specific Redundancy Removal Removes redundant computations based on domain axioms 23. Temporary Object Removal Removes temporary objects created by calls to matrix/vector libraries 24. Specialization of Math Libraries Specializes vector/matrix algorithms based on, for example, the size of the vector/matrix 25. Design-pattern Optimizations Removes overhead of certain design patterns, like the indirection or interpreter pattern 26. Method Outlining Replaces code by call to a method performing the same computation, but more efficiently 27. Specialized Redirection Replaces method call with call to a more efficient version based on the calling context

Figure 8. Optimizations performed by Peggy conjecture that interprocedural optimizations can be expressed as rewrite of l.contains(e) to (l.indexOf(e) 6= −1) would remove equality analyses, and leave this exploration to future work. the redundancy mentioned above, it could also degrade perfor- With the addition of domain-specific axioms, our approach en- mance in the case where the list implements an efficient hash-based ables even more optimizations, as shown in part (c). To give a flavor contains. In our approach, the equality simply adds information for these domain-specific optimizations, we describe two examples. to the E-PEG, and the profitability heuristic can decide after satura- The first is a ray tracer application (5 KLOCs) that one of the au- tion which option is best, taking the entire context into account. In thors had previously developed. To make the implementation clean this way our approach transforms contains to indexOf, but only and easy to understand, the author used immutable vector objects if indexOf would have been called anyway. in a style. This approach however intro- These two examples illustrate the benefits of user-defined ax- duces many intermediate objects. With a few simple vector axioms, ioms. In particular, the clean, readable, and maintainable way of Peggy is able to remove the overhead of these temporary objects, writing code can sometimes incur performance overheads. User- thus performing a kind of deforestation optimization. This makes defined axioms allow the programmer to reduce these overheads the application 7% faster, and reduces the number of allocated ob- while keeping the code base clean of performance-related hacks. jects by 40%. Soot is not able to recover any of the overhead, even Our approach makes domain-specific axioms easier to add for the with interprocedural optimizations turned on. This is an instance of end-user programmer, because the programmer does not need to a more general technique where user-defined axioms allow Peggy worry about what order the user-defined axioms should be run in, to remove temporary objects (optimization 23 in Figure 8). or how they will interact with the compiler’s internal optimizations. Our second example targets a common programming idiom in- volving Lists, which consists of checking that a List contains 7.3 Translation Validation an element e, and if it does, fetching and using the index of the We used Peggy to perform translation validation for the Soot opti- element. If written cleanly, this pattern would be implemented with mizer [36]. In particular, we used Soot to optimize a set of bench- a branch whose guard is contains(e) and a call to indexOf(e) on marks with all of its intraprocedural optimizations turned on. The the true side of the branch. Unfortunately, contains and indexOf benchmarks included SpecJVM, along with other programs, com- would perform the same linear search, which makes this clean prising a total of 3,416 methods. After Soot finished compiling, for way of writing the code inefficient. Using the equality axiom each method we asked Peggy’s saturation engine to show that the l.contains(e)=(l.indexOf(e) 6= −1), Peggy can convert the original method was equivalent to the corresponding method that clean code into the hand-optimized code that programmers typi- Soot produced. The engine was able to show that 98% of methods cally write, which stores indexOf(e) into a temporary, and then were compiled correctly. branches if the temporary is not −1. An extensible rewrite system Among the cases that Peggy was unable to validate, we found would not be able to provide the same easy solution: although a three methods that Soot optimized incorrectly. In particular, Soot incorrectly pulled statements outside of an intricate loop, trans- proach has the advantage that it can perform translation validation forming a terminating loop into an infinite loop. It is a testament to by using the same technique as for . the power of our approach that it is able not only to perform opti- Intermediate Representations. Our main contribution is an mizations, but also to validate a large fraction of Soot runs, and that approach for structuring optimizers based on equality saturation. in doing so it exposed a bug in Soot. Furthermore, because most However, to make our approach effective, we have also designed false positives are a consequence of our coarse heap model (sin- the E-PEG representation. There has been a long line of work gle σ node), a finer-grained model can increase the effectiveness of on developing IRs that make analysis and optimizations easier to translation validation, and it would also enable more optimizations. perform [11, 2, 35, 19, 15, 39, 9, 33, 29]. The key distinguishing Our equality saturation engine can easily be extended so that, feature of E-PEGs is that a single E-PEG can represent many after each translation validation, it generates a machine-checkable optimized versions of the input program, which allows us to use proof of equivalence. With this in place, the trusted computing base global profitability heuristics and to perform translation validation. for our translation validator would only be: (1) the proof checker, We now compare the PEG component of our IR with previous (2) the built-in axioms used in translation validation, most of which IRs. PEGs are related to SSA [11], gated SSA [35] and thinned- we have proved by hand, and (3) the conversion from Java bytecode gated SSA [19]. The µ function from gated SSA is similar to our to PEG. θ function, and the η function is similar to our eval/pass pair. However, unlike PEGs, all these variants of SSA are tried to an underlying CFG representation. 8. Related Work Program Dependence Graphs [15] and the Program Dependence Web [28] represent control information by grouping together oper- Superoptimizers. Our approach of computing a set of programs ations that execute in the same control region. However, these IRs and then choosing from this set is related to the approach taken are still statement based, and maintain explicit control edges. by super-optimizers [24, 18, 4, 16]. Superoptimizers strive to pro- Like PEGs, the Value Dependence Graph [39] (VDG) is a com- duce optimal code, rather than simply improve programs. Although plete functional representation. VDGs use λ nodes (i.e. regular super-optimizers can generate (near) optimal code, they have so far function abstraction) to represent loops, whereas we use special- scaled only to small code sizes, mostly straight line code. Our ap- ized θ, eval and pass nodes. These specialized nodes, combined proach, on the other hand, is meant as a general purpose paradigm with simple axioms about them, allow us to perform intricate opti- that can optimize branches and loops, as shown by the inter-loop mizations across loops, such as the optimization from Section 3. optimization from Section 3. Dependence Flow Graphs [29] (DFGs) are a complete and ex- Our approach was inspired by Denali [21], a super-optimizer ecutable representation of programs based on dependencies. How- for finding near-optimal ways of computing a given basic block. ever, DFGs employ a side-effecting storage model with an imper- Denali represents the computations performed in the basic block as ative store operation, whereas our representation is entirely func- an expression graph, and applies axioms to create an E-graph data tional, making equational reasoning more natural. structure representing the various ways of computing the values Dataflow Languages. Our PEG intermediate representation is in the basic block. It then uses repeated calls to a SAT solver to related to the broad area of dataflow languages [20]. The most find the best way of computing the basic block given the equalities closely related is the Lucid [3], in which stored in the E-graph. The biggest difference between our work variables are maps from iteration counts to possibly undefined val- and Denali is that our approach can perform intricate optimizations ues, as in our PEGs. Lucid’s first/next operators are similar to our involving branches and loops. On the other hand, the Denali cost θ nodes, and Lucid’s as soon as operator is similar to our eval/pass model is more precise than ours because it assigns costs to entire pair. However, Lucid and PEGs differ in their intended use and ap- sequences of operations, and so it can take into account the effects plication. Lucid is a programming language designed to make for- of scheduling and . mal proofs of correctness easier to do, whereas Peggy uses equiva- Rewrite-based optimizers. Axioms or rewrite-rules have been lences of PEG nodes to optimize code expressed in existing imper- used in many compilation systems, for example TAMPR [6], ative languages. Furthermore, we incorporate a monotonize func- ASF+SDF [37], the ML compilation system of Visser et al. [38], tion into our semantics and axioms, which guarantees the correct- and Stratego [7]. These systems, however, perform transformations ness of our conversion to and from CFGs with loops. in sequence, with each axiom or rewrite rule destructively updat- Theorem Proving. Because most of our reasoning is performed ing the IR. Typically, such compilers also provide a mechanism using simple axioms, our work is related to the broad area of auto- for controlling the application of rewrites through built-in or user- mated theorem proving. The theorem prover that most inspired our defined strategies. Our approach, in contrast, does not use strategies work is Simplify [13], with its E-graph data structure for represent- – we instead simultaneously explore all possible optimization or- ing equalities [27]. Our E-PEGs are in essence specialized E-graphs derings, while avoiding redundant work. Furthermore, even with for reasoning about PEGs. Furthermore, the way our analyses com- no strategies, we can perform a variety of intricate optimizations. municate through equality is conceptually similar to the equality Optimization Ordering. Many research projects have been propagation approach used in Nelson-Oppen theorem provers [26]. aimed at mitigating the phase ordering problem, including auto- Execution Indices. Execution indices identify the state of mated assistance for exploring enabling and disabling properties of progress of an execution [14, 42]. The call stack typically acts optimizations [40, 41], automated techniques for generating good as the interprocedural portion, and the loop iteration counts in our sequences [10, 1, 22], manual techniques for combining analy- semantics can act as the intraprocedural portion. As a result, one ses and optimizations [8], and automated techniques for the same of the benefits of PEGs is that they make intraprocedural execution purpose [23]. However, we tackle the problem from a different indices explicit. perspective than previous approaches, in particular, by simulta- neously exploring all possible sequences of optimizations, up to some bound. Aside from the theoretical guarantees from Section 4, our approach can do well even if every part of the input program 9. Conclusion and future work requires a different ordering. We have presented a new approach to structuring optimizers that is Translation Validation. Although previous approaches to based on equality saturation. Our approach has a variety of benefits translation validation have been explored [31, 30, 25, 43], our ap- over previous compilation models: it addresses the phase ordering problem, it enables global profitability heuristics, and it performs In Workshop on Languages and Compilers for , translation validation. 1993. There are a variety of directions for future work. One direction [20] W. M. Johnston, J. R. P. Hanna, and R. J. Millar. Advances in dataflow is to extend Peggy so that it generates a proof of correctness for the programming languages. ACM Computing Surveys, 36(1):1–34, 2004. optimizations it performs. Each step in this proof would be the ap- [21] R. Joshi, G. Nelson, and K. Randall. Denali: a goal-directed plication of an equality analysis. Since the majority of our analyses superoptimizer. In PLDI, June 2002. are axiom applications, these proofs would be similar to standard mathematical proofs. We would then like to use these proofs as a [22] L. Torczon K. D. Cooper, D. Subramanian. Adaptive optimizing compilers for the 21st century. The Journal of Supercomputing, pages way of automatically generating optimizations. In particular, by de- 7–22, 2002. termining which nodes of the original PEG the proof depends on, and what properties of these nodes are important, we will explore [23] S. Lerner, D. Grove, and C. Chambers. Composing dataflow analyses how one can generalize not only the proof but also the transforma- and transformations. In POPL, January 2002. tion. Using such an approach, we hope to develop a compiler that [24] Henry Massalin. Superoptimizer: a look at the smallest program. In can learn optimizations as it compiles. ASPLOS, 1987. [25] G. Necula. Translation validation for an optimizing compiler. In References PLDI, June 2000. [26] G. Nelson and D. Oppen. Simplification by cooperating decision [1] L. Almagor, K. D. Cooper, A. Grosul, T. J. Harvey, S. W. Reeves, procedures. Transactions on Programming Languages and Systems, D. Subramanian, L. Torczon, and T. Waterman. Finding effective 1(2):245–257, October 1979. compilation sequences. In LCTES, 2004. [27] G. Nelson and D. Oppen. Fast decision procedures based on [2] B. Alpern, M. Wegman, and F. Zadeck. Detecting equality of variables congruence closure. Journal of the Association for Computing in programs. In POPL, January 1988. Machinery, 27(2):356–364, April 1980. [3] E. A. Ashcroft and W. W. Wadge. Lucid, a nonprocedural language [28] K. Ottenstein, R. Ballance, and A. MacCabe. The program with iteration. Communications of the ACM, 20(7):519–526, 1977. dependence web: a representation supporting control-, data-, and [4] S. Bansal and A. Aiken. Automatic generation of peephole demand-driven interpretation of imperative languages. In PLDI, June superoptimizers. In ASPLOS, 2006. 1990. [5] R. Bird and P. Wadler. Introduction to Functional Programming. [29] K. Pengali, M. Beck, and R. Johson. Dependence flow graphs: an Prentice Hall, 1988. algebraic approach to program dependencies. In POPL, January 1991. [6] James M. Boyle, Terence J. Harmer, and Victor L. Winter. The TAMPR program transformation system: simplifying the develop- [30] A. Pnueli, M. Siegel, and E. Singerman. Translation validation. In ment of numerical software. Modern software tools for scientific TACAS, 1998. computing, pages 353–372, 1997. [31] H. Samet. Proving the correctness of heuristically optimized code. [7] M. Bravenboer, K. T. Kalleberg, R. Vermaas, and E. Visser. Communications of the ACM, 21(7):570–582, July 1978. Stratego/XT 0.17. A language and toolset for program transformation. [32] H. Sheini and K. Sakallah. Pueblo: A hybrid pseudo-boolean SAT Science of Computer Programming, 72(1-2):52–70, 2008. solver. Journal on Satisfiability, Boolean Modeling and Computation, [8] K. D. Cooper C. Click. Combining analyses, combining opti- 2:61–96, 2006. mizations. Transactions on Programming Languages and Systems, [33] B. Steffen, J. Knoop, and O. Ruthing. The value flow graph: A 17(2):181–196, 1995. program representation for optimal program transformations. In [9] C. Click. Global code motion/global . In PLDI, June European Symposium on Programming, 1990. 1995. [34] Ross Tate, Michael Stepp, Zachary Tatlock, and Sorin Lerner. [10] K. D. Cooper, P. J. Schielke, and Subramanian D. Optimizing for Translating between PEGs and CFGs. Technical Report CS2008- reduced code space using genetic algorithms. In LCTES, 1999. 0931, University of California, San Diego, November 2008. [11] R. Cytron, J. Ferrante, B. Rosen, M. Wegman, and K. Zadeck. An [35] P. Tu and D. Padua. Efficient building and placing of gating functions. efficient method for computing static single assignment form. In In PLDI, June 1995. POPL, January 1989. [36] R. Vall´ee-Rai, L. Hendren, V. Sundaresan, P. Lam, E. Gagnon, and [12] Jeffrey Dean and Craig Chambers. Towards better inlining decisions P. Co. Soot - a Java optimization framework. In CASCON, 1999. using inlining trials. In Conference on LISP and Functional [37] M. G. J. van den Brand, J. Heering, P. Klint, and P. A. Olivier. Com- Programming, 1994. piling language definitions: the ASF+SDF compiler. Transactions on [13] D. Detlefs, G. Nelson, and J. Saxe. Simplify: A theorem prover Programming Languages and Systems, 24(4), 2002. for program checking. Journal of the Association for Computing [38] E. Visser, Z. Benaissa, and A Tolmach. Building program optimizers Machinery, 52(3):365–473, May 2005. with rewriting strategies. In ICFP, 1998. [14] E. Dijkstra. Go to statement considered harmful. pages 27–33, 1979. [39] D. Weise, R. Crew, M. Ernst, and B. Steensgaard. Value dependence [15] J. Ferrante, K. Ottenstein, and J. Warren. The program dependence graphs: Representation without taxation. In POPL, 1994. graph and its use in optimization. Transactions on Programming [40] Debbie Whitfield and Mary Lou Soffa. An approach to ordering Languages and Systems, 9(3):319–349, July 1987. optimizing transformations. In PPOPP, 1990. [16] Christopher W. Fraser, Robert R. Henry, and Todd A. Proebsting. [41] Deborah L. Whitfield and Mary Lou Soffa. An approach for exploring BURG – fast optimal and tree parsing. SIGPLAN code improving transformations. Transactions on Programming Notices, 27(4):68–76, April 1992. Languages and Systems, 19(6):1053–1084, November 1997. [17] J. Giarratano and G. Riley. Expert Systems – Principles and [42] B. Xin, W. N. Sumner, and X. Zhang. Efficient program execution Programming. PWS Publishing Company, 1993. indexing. In PLDI, June 2008. [18] Torbjorn Granlund and Richard Kenner. Eliminating branches using [43] Lenore Zuck, Amir Pnueli, Yi Fang, and Benjamin Goldberg. VOC: a superoptimizer and the GNU C compiler. In PLDI, 1992. A methodology for the translation validation of optimizing compilers. [19] P. Havlak. Construction of thinned gated single-assignment form. Journal of Universal Computer Science, 9(3):223–247, March 2003.