Practical and Effective Higher-Order Optimizations

PREPRINT: To be presented at ICFP ’14, September 1–6, 2014, Gothenburg, Sweden. Practical and Effective Higher-Order Optimizations Lars Bergstrom Matthew Fluet John Reppy Mozilla Research ∗ Matthew Le Nora Sandler [email protected] Rochester Institute of Technology University of Chicago fmtf,[email protected] fjhr,[email protected] Abstract Categories and Subject Descriptors D.3.0 [Programming Lan- Inlining is an optimization that replaces a call to a function with that guages]: General; D.3.2 [Programming Languages]: Language function’s body. This optimization not only reduces the overhead Classifications—Applicative (functional) languages; D.3.4 [Pro- of a function call, but can expose additional optimization oppor- gramming Languages]: Processors—Optimization tunities to the compiler, such as removing redundant operations or Keywords control-flow analysis, inlining, optimization unused conditional branches. Another optimization, copy propagation, replaces a redundant copy of a still-live variable with the origi- 1. Introduction nal. Copy propagation can reduce the total number of live variables, reducing register pressure and memory usage, and possibly elimi- All high level programming languages rely on compiler optimiza- nating redundant memory-to-memory copies. In practice, both of tions to transform a language that is convenient for software devel- these optimizations are implemented in nearly every modern com- opers into one that runs efficiently on target hardware. Two such piler. common compiler optimizations are copy propagation and func- These two optimizations are practical to implement and effec- tion inlining. Copy propagation in a language like ML is a simple tive in first-order languages, but in languages with lexically-scoped substitution. Given a program of the form: first-class functions (aka, closures), these optimizations are not let val x = y available to code programmed in a higher-order style. With higher- in order functions, the analysis challenge has been that the environ- x*2+y ment at the call site must be the same as at the closure capture end location, up to the free variables, or the meaning of the program We want to propagate the definition of x to its uses, resulting in may change. Olin Shivers’ 1991 dissertation called this family of let val x = y optimizations Super-β and he proposed one analysis technique, in called reflow, to support these optimizations. Unfortunately, reflow y*2+y has proven too expensive to implement in practice. Because these end higher-order optimizations are not available in functional-language At this point, we can eliminate the now unused x, resulting in compilers, programmers studiously avoid uses of higher-order values that cannot be optimized (particularly in compiler benchmarks). y*2+y This paper provides the first practical and effective technique This optimization can reduce the resource requirements (i.e., reg- for Super-β (higher-order) inlining and copy propagation, which ister pressure) of a program, and it may open the possibility for we call unchanged variable analysis. We show that this technique further simplifications in later optimization phases. is practical by implementing it in the context of a real compiler Inlining replaces a lexically inferior application of a function for an ML-family language and showing that the required analy- with the body of that function by performing straightforward β- ses have costs below 3% of the total compilation time. This tech- substitution. For example, given the program nique’s effectiveness is shown through a set of benchmarks and example programs, where this analysis exposes additional potential let fun f x = 2*x in optimization sites. f 3 end ∗ Portions of this work were performed while the author was at the Univer- inlining f and removing the unused definition results in sity of Chicago. 2*3 This optimization removes the cost of the function call and opens the possibility of further optimizations, such as constant folding. Inlining does require some care, however, since it can increase the program size, which can negatively affect the instruction cache performance, negating the benefits of eliminating call overhead. The importance of inlining for functional languages and techniques for providing predictable performance are well-covered in the context of GHC by Peyton Jones and Marlow [PM02]. Both copy propagation and function inlining have been well- studied and are widely implemented in modern compilers for both [Copyright notice will appear here once ’preprint’ option is removed.] first-order and higher-order programming languages. In this paper, we are interested in the higher-order version of these optimizations, While some compilers can reproduce similar results on trivial which are not used in practice because of the cost of the supporting examples such as these, implementing either of these optimizations analysis. in their full generality requires an environment-aware analysis to For example, consider the following iterative program: prove their safety. In the case of copy propagation, we need to ensure that the variable being substituted has the same value at the let fun emit x = print (Int.toString x) point that it is being substituted as it did when it was passed in. fun fact i m k = Similarly, if we want to substitute the body of a function at its call if i=0 then k m site, we need to ensure that all of the free variables have the same else fact (i-1) (m*i) k values at the call site as they did at the point where the function in fact 6 1 emit was defined. Today, developers using higher-order languages often end avoid writing programs that have non-trivial environment usage within code that is run in a loop unless they have special knowledge Higher-order copy propagation would allow the compiler to prop- of either the compiler or additional annotations on library functions agate the function emit into the body of fact, resulting in the (e.g., map) that will enable extra optimization. following program: This paper presents a new approach to control-flow analysis let (CFA) that supports more optimization opportunities for higher- fun emit x = print (Int.toString x) order programs than are possible in either type-directed optimiz- fun fact i m k = ers, heuristics-based approaches, or by using library-method anno- if i=0 then emit m tations. We use the example of transducers [SM06], a higher-order else fact (i-1) (m*i) emit in and compositional programming style, to further motivate these op- fact 6 1 emit timizations and to explain our novel analysis. end Our contributions are: This transformation has replaced an indirect call to emit (via the • A novel, practical environment analysis (Section 5) that pro- variable k) with a direct call. Direct calls are typically faster than vides a conservative approximation of when two occurrences indirect calls,1 and are amenable to function inlining. Furthermore, of a variable will have the same binding. the parameter k can be eliminated using useless variable elimina- • Timing results (Section 9) for the implementation of this analy- tion [Shi91], which results in a smaller program that uses fewer sis and related optimizations, showing that it requires less than resources: 3% of overall compilation time. let • Performance results for several benchmarks, showing that even fun emit x = print (Int.toString x) fun fact i m = highly tuned programs still contain higher-order optimization if i=0 then emit m opportunities. else fact (i-1) (m*i) in Source code for our complete implementation and fact 6 1 all the benchmarks described in this paper is available end at: http://smlnj-gforge.cs.uchicago.edu/ projects/manticore/. Function inlining also has a similar higher-order counterpart (what Shivers calls Super-β ). For example, in the following program, we can inline the body of pr at the call site inside fact, 2. Manticore despite the fact that pr is not in scope in fact: The techniques described in this paper have been developed as part let of the Manticore project and are implemented in the compiler for val two = 2 Parallel ML, which is a parallel dialect of Standard ML [FRRS11]. fun fact i m k = In this section, we give an overview of the host compiler and in- if i=0 termediate representation upon which we perform our analysis and then k m else fact (i-1) (m*i) k optimizations. The compiler is a whole-program compiler, read- fun pr x = print (Int.toString (x*two)) ing in the files in the source code alongside the sources from in the runtime library. As covered in more detail in an earlier pa- fact 6 1 pr per [FFR+07], there are six distinct intermediate representations end (IRs) in the Manticore compiler: Inlining pr produces 1. Parse tree — the product of the parser. let 2. AST — an explicitly-typed abstract-syntax tree representation. val two = 2 fun fact i m k = 3. BOM — a direct style normalized λ-calculus. if i=0 then print (Int.toString (m*two)) 4. CPS — a continuation-passing style λ-calculus. else fact (i-1) (m i) k * 5. CFG — a first-order control-flow graph representation. fun pr x = print (Int.toString (x*two)) in 6. MLTree — the expression tree representation used by the ML- fact 6 1 pr end RISC code generation framework [GGR94]. The work in this paper is performed on the CPS representation. This resulting program is now eligible for constant propagation and useless variable elimination. 2.1 CPS 1 Direct calls to known functions can use specialized calling conven- Continuation-passing style (CPS) is the final high-level represen- tions that are more efficient and provide more predictability to hardware tation used in the compiler before closure conversion generates a instruction-prefetch mechanisms.

Practical and Effective Higher-Order Optimizations

Polyhedral Compilation As a Design Pattern for Compiler Construction

Expression Rematerialization for VLIW DSP Processors with Distributed Register Files ?

Equality Saturation: a New Approach to Optimization

CS153: Compilers Lecture 19: Optimization

A Formally-Verified Alias Analysis

Cross-Platform Language Design

Practical and Accurate Low-Level Pointer Analysis

Phase-Ordering in Optimizing Compilers

Dead Code Elimination Based Pointer Analysis for Multithreaded Programs

Dataflow Analysis: Constant Propagation

Compiler-Based Code-Improvement Techniques

Fast Online Pointer Analysis