Making Collection Operations Optimal with Aggressive JIT Compilation
Total Page:16
File Type:pdf, Size:1020Kb
Making Collection Operations Optimal with Aggressive JIT Compilation Aleksandar Prokopec David Leopoldseder Oracle Labs Johannes Kepler University, Linz Switzerland Austria [email protected] [email protected] Gilles Duboscq Thomas Würthinger Oracle Labs Oracle Labs Switzerland Switzerland [email protected] [email protected] Abstract 1 Introduction Functional collection combinators are a neat and widely ac- Compilers for most high-level languages, such as Scala, trans- cepted data processing abstraction. However, their generic form the source program into an intermediate representa- nature results in high abstraction overheads – Scala collec- tion. This intermediate program representation is executed tions are known to be notoriously slow for typical tasks. We by the runtime system. Typically, the runtime contains a show that proper optimizations in a JIT compiler can widely just-in-time (JIT) compiler, which compiles the intermediate eliminate overheads imposed by these abstractions. Using representation into machine code. Although JIT compilers the open-source Graal JIT compiler, we achieve speedups of focus on optimizations, there are generally no guarantees up to 20× on collection workloads compared to the standard about the program patterns that result in fast machine code HotSpot C2 compiler. Consequently, a sufficiently aggressive – most optimizations emerge organically, as a response to JIT compiler allows the language compiler, such as Scalac, a particular programming paradigm in the high-level lan- to focus on other concerns. guage. Scala’s functional-style collection combinators are a In this paper, we show how optimizations, such as inlin- paradigm that the JVM runtime was unprepared for. ing, polymorphic inlining, and partial escape analysis, are Historically, Scala collections [Odersky and Moors 2009; combined in Graal to produce collections code that is optimal Prokopec et al. 2011] have been somewhat of a mixed bless- with respect to manually written code, or close to optimal. ing. On one hand, they helped Scala gain popularity im- We argue why some of these optimizations are more effec- mensely – code written using collection combinators is more tively done by a JIT compiler. We then identify specific use- concise and easier to write compared to Java-style foreach cases that most current JIT compilers do not optimize well, loops. High-level declarative combinators are exactly the warranting special treatment from the language compiler. kind of paradigm that Scala advocates. And yet, Scala was often criticized for having suboptimal performance because CCS Concepts • Software and its engineering → Just- of its collection framework, with reported performance over- in-time compilers; heads of up to 33× [Biboudis et al. 2014]. Despite several Keywords collections, data-parallelism, program optimiza- attempts at making them more efficient [Dragos and Odersky tion, JIT compilation, functional combinators, partial escape 2009; Prokopec and Petrashko 2013; Prokopec et al. 2015], analysis, iterators, inlining, scalar replacement Scala collections currently have suboptimal performance ACM Reference Format: compare to their Java pendants. Aleksandar Prokopec, David Leopoldseder, Gilles Duboscq, Thomas In this paper, we show that, despite the conventional wis- Würthinger 2017. Making Collection Operations Optimal with dom about the HotSpot C2 compiler, most overheads in- Aggressive JIT Compilation . In Proceedings of 8th ACM SIGPLAN troduced by the Scala standard library collections can International Scala Symposium (SCALA’17). ACM, New York, NY, be removed and optimal code can be generated with USA, 12 pages. https://doi.org/10.1145/3136000.3136002 a sufficiently aggressive JIT compiler. We validate this SCALA’17, October 22–23, 2017, Vancouver, Canada claim using the open-source Graal compiler [Duboscq et al. © 2017 Copyright held by the owner/author(s). Publication rights licensed 2013], and we explain how. Concretely: to the Association for Computing Machinery. This is the author’s version of the work. It is posted here for your personal • We identify and explain the abstraction overheads use. Not for redistribution. The definitive Version of Record was published in present in the Scala collections library (Section2). Proceedings of 8th ACM SIGPLAN International Scala Symposium (SCALA’17), • We summarize relevant optimizations used by Graal https://doi.org/10.1145/3136000.3136002. to eliminate those abstractions (Section3). SCALA’17, October 22–23, 2017, Vancouver, Canada A. Prokopec, D. Leopoldseder, G. Duboscq, and T. Würthinger • We present three case studies, showing how Graal’s callsite (e.g. by inlining), the read of the buffer field cannot be optimizations interact to produce optimal collections factored out of the loop, but instead occurs for every element. code (Section4). Finally, most collection operations (in particular, subclasses • We show a detailed performance comparison between of the Iterable trait) create Iterator objects, which store it- the Graal compiler and the HotSpot C2 compiler on eration state. Avoiding the allocation of the Iterator and Scala collections and on several Spark data-processing inlining the logic of its hasNext and next methods generally workloads. We focus both on microbenchmarks and results in more efficient code, but without knowing the con- application benchmarks (Section5). crete implementation of the iterator, replacing iterators with • We postulate that profiling information enables a JIT something more efficient is not possible. compiler to better optimize collection operations com- pared to a static compiler. We then identify the opti- mizations that current JIT compilers are unlikely to do, and conclude that language compilers should fo- cus on optimizing data structures more than on code optimization (Section7). 3 Overview of the Optimizations in Graal Note that the purpose of this paper is not to explain each Graal [Duboscq et al. 2013] is a dynamic optimizing com- of the optimizations used in Graal in detail – this was either piler for the HotSpot VM. Graal is not the default top-tier done elsewhere, or will be done in the future. Rather, the goal compiler in HotSpot, but it can optionally replace the C2 de- is to explain how we made Graal’s optimizations interact fault compiler. Just like the standard C2 compiler, Graal uses in order to produce optimal collections code. To reiterate, feedback-driven optimizations and generates code that spec- we do not claim novelty for any specific optimization that ulates on the method receiver types and the taken control we describe. Instead, we show how the correct set of opti- flow paths. The feedback is based on receiver type profiles mizations is combined to produce optimal collections code. and branch probabilities, and is gathered while the program As such, we hope that our findings will guide the language executes in interpreter mode. When a speculation proves implementors, and drive future design decisions in Scala (as incorrect, Graal deoptimizes the affected part of the code, well as other languages that run on the JVM). and compiles it again later [Duboscq et al. 2014]. An important design aspect in Graal, shared among many 2 Overheads in Scala Collection current JIT compilers, is the focus on intraprocedural analy- Operations sis. When HotSpot detects that a particular method is called On a high-level, we categorize the abstraction overheads many times, it schedules the method for compilation. The JIT present in Scala collections into several groups [Prokopec compiler (C2 or Graal) analyzes the particular method, and et al. 2014]. First, most Scala collections are generic in their attempts to trigger optimizations within its scope. Instead of parameter type. At the same time, Scala relies on type erasure doing inter-procedural analysis, additional optimizations are to eliminate generic type parameters in the intermediate enabled by inlining the callsites inside the current method. code (similar to Java [Bracha et al. 2003]). At the bytecode Graal’s intermediate representation (IR) is partly based level, collection methods such as foldLeft, indexOf or += take on the sea-of-nodes approach [Click and Paleczny 1995]. A arguments of the type Object (in Scala terms, AnyRef). When a program is represented as a directed graph in static single collection is instantiated with a value type, such as Int, most assignment form. Each node produces at most one value. method invocations implicitly allocate heap objects to wrap Control flow is represented with fixed nodes for which suc- values, which is called boxing. cessor edges point downwards in the graph. Data flow is Second, collections form a class hierarchy, with their oper- represented with floating nodes whose input edges point ations defined in base traits. Collection method calls are often upwards. The details of Graal IR are presented in related polymorphic. A compiler that does not know the concrete work [Duboscq et al. 2013]. In section4, we summarize the callee type, cannot easily optimize this call. For example, basic IR nodes that are required for understanding this paper. two apply calls on ArrayBuffer share the same array bounds In the rest of this section, we describe the most important check, but if the static receiver type is the base trait Seq, the high-level optimizations in Graal that are relevant for the rest compiler cannot factor out the common bounds check. of the discussion. To make the presentation more