Optimization Coaching for Fork/Join Applications on the Java Virtual Machine Extended abstract Eduardo Rosales Walter Binder∗ Università della Svizzera italiana Università della Svizzera italiana Lugano, Switzerland Lugano, Switzerland [email protected] [email protected]
ABSTRACT complete, and then typically merging (join) results; running in a In the multicore era, developers are increasingly writing parallel single Java Virtual Machine (JVM) on a shared-memory multicore. applications to leverage the available computing resources and to The Java fork/join framework [27] is the implementation enabling achieve speedups. Still, developing and tuning parallel applications fork/join applications on the JVM. It uses the work-stealing [57] to exploit the hardware resources remains a major challenge. We scheduling strategy for parallel programs, using an implementa- tackle this issue for fork/join applications running in a single Java tion where each worker thread has a deque (i.e., a double-ended Virtual Machine (JVM) on a shared-memory multicore. queue) of tasks and processes them one by one until the deque Developing fork/join applications is often challenging since fail- is empty, in which case the thread “steals" by taking out tasks ing in balancing the tradeoff between maximizing parallelism and from deques belonging to other active worker threads. Since the minimizing overheads, along with maximizing locality and mini- release of Java 8, the Java fork/join framework has been support- mizing contention, may lead to applications suffering from several ing the Java library in the java.util.Array [37] class on methods performance problems (e.g., excessive object creation/reclaiming, performing parallel sorting, in the java.util.streams package [40], load imbalance, inappropriate synchronization). which provides a sequence of elements supporting sequential and In contrast to the manual experimentation commonly required to parallel aggregate operations, and in the java.util.concurrent.- tune fork/join parallelism on the JVM, we propose coaching devel- CompletableFuture
∗PhD advisor terpreted and implemented by the developer. Although several works and tools address the assisted optimiza- EuroDW’18, April 2018, Porto, Portugal tion of sequential applications [15, 51, 52], parallelism discovery [14, . 17, 19, 25, 45, 49, 50], parallelism profiling1 [ , 2, 11, 23, 36, 54, 56, 59], EuroDW’18, April 2018, Porto, Portugal Eduardo Rosales and Walter Binder or have analyzed the use of concurrency on the JVM [29–31, 44], Hardware Performance Counters (HPC) [22], tgp allowed us finding with some even targeting fork/join parallelism [9, 43], we are not suboptimal task granularities causing performance issues in the aware of previous works on coaching a developer towards optimiz- DaCapo [6] benchmark suite. We plan to extend this work to study ing a fork/join application running in the JVM. suboptimal task granularity on fork/join applications on the JVM. The remainder of this proposal is organized as follows. In Sec- Suggesting optimizations. We plan to develop a methodology tion 2, we discuss our research directions. In Section 3, we analyze for automatically generating (and reporting to the developer) spe- the feasibility of our proposal by reviewing related work. cific recommendations aimed at fixing detected performance issues on fork/join applications. Inspired by the works of De Wael et al. [9] 2 OPTIMIZATION COACHING FOR and Pinto et al. [43], we plan to develop a tool able to recognize FORK/JOIN APPLICATIONS ON THE JVM fork/join anti-patterns, but it is also of paramount importance in our aim to automatically match such anti-patterns to concrete rec- We propose coaching developers towards the optimization of fork/- ommendations that a developer can use to avoid them. We plan join applications on the JVM. To achieve our goal it is imperative to use calling context profiling [3] to pinpoint the developer to ap- to address the following research directions. plication code where tasks classified as problematic were created, Diagnosing performance issues. We plan to build a method- submitted, or executed, guiding optimizations by means of sug- ology aimed at diagnosing issues and understanding its impact on gesting the use of fork/join design patterns [26] and concrete code the performance of fork/join applications on the JVM. To this end, modifications to enable potential missed parallelization opportuni- first we plan to define a new model to characterize fork/join tasks. ties. This is challenging, since automatically matching performance Second, we plan to determine the entities and metrics worth issues with concrete code suggestions to fix them will require the to consider to automatically detect performance drawbacks on use of advance data analysis techniques. In a previous study [47] fork/join applications. We plan to use static analysis to inspect we used calling contexts to locate tasks previously identified as of the source code in the search of fork/join anti patterns [9, 43]. suboptimal granularity in DaCapo, guiding optimizations through Complementary, we plan to use dynamic analysis to deal with code refactoring that led to noticeable speedups. Based on this work, polymorphism and reflection, along with collecting performance we plan to automate and specialize such techniques to specifically metrics enabling diagnosing performance issues noticeable at run- coach developers in optimizing fork/join applications on the JVM. time (e.g., excessive deque accesses and object creation/reclaiming, Validating optimizations. As previously pointed out by De load imbalance, inappropriate synchronization). To this end, we Wael et al. [9], there is a lack of standard benchmarks suitable to devise a vertical profiler [18] collecting information from the full perform empirical studies focused on fork/join parallelism on the system stack, including metrics at the application level (e.g., tasks JVM. Indeed, our early work mentioned above relied on DaCapo and threads), Java fork/join framework level (e.g., task submission, to study suboptimal task granularities mainly in Java thread pools, the number of already queued tasks), JVM level (e.g., garbage collec- however, this suite does not have any fork/join application. In this tions, allocations in Java Heap), operating system level (e.g., CPU direction, we plan to discover workloads exhibiting high diversity usage, memory load) and hardware level (e.g., reference cycles, in their task-parallel behavior by plugging tgp to AutoBench [60], a machine instructions, context switches, cache misses). toolchain combining massive code repository crawling, pluggable Third, we plan to profile and characterize all tasks spawned by hybrid analyses, and workload characterization techniques to dis- a fork/join application to reveal performance drawbacks related cover candidate workloads satisfying the needs of domain-specific to suboptimal task granularity, a key factor when implementing benchmarking. By fully integrating tgp and AutoBench, we plan to fork/join parallelism on the JVM as pointed out by Lea [26]. This discover workloads exhibiting diverse fork/join parallelism, suitable is challenging since recursion, fine-grained parallelism, and sched- for validating both aforementioned methodologies. uling affecting the life-cycle of a fork/join task (e.g., the steal op- eration), complicate task granularity profiling and may lead to incorrect measurements when not handled properly. Our profiler 3 RELATED WORK will rely on DiSL [33] to ensure full coverage of the tasks spawned by a fork/join application, including those in the Java class library. This section first introduces related studies on assisted optimization Finally, the metrics collected can be susceptible to perturbations of applications. Next, we review previous work analyzing the use of caused by the inserted instrumentation code, which may bias the concurrency on the JVM. Then, we focus on parallelism discovers. analysis. Therefore, we plan to lower the perturbation of dynamic Finally, we review profilers for parallel applications. Assisted optimization of applications. metrics by using efficient profiling data structures avoiding heap Several studies have allocations in the target application [32]. focused on providing the developer with feedback to assist the In previous studies [47, 48] we analyzed the tradeoff between optimization of several aspects of an application. Here, we relate maximizing parallelism and minimizing overheads, mainly focusing three of them pioneering an approach that is very close to our goal on the size of the tasks spawned by task-parallel applications on in providing recommendations which require only source-level the JVM. We introduced an early model to define a task as every knowledge to be implemented by a developer. Optimization Coach- instance of the Java interfaces java.lang.Runnable and java.util.- St-Amour et al. [51, 52] first coined the term ing concurrent.Callable, focusing on the study of Java thread pools [41]. , introducing a technique modifying the compiler’s optimizer to Furthermore, we presented tgp [48], a task-granularity profiler for inform the developer which optimizations it performs, which opti- the JVM. Relying on bytecode instrumentation [33] and resorting to mizations it misses, and suggesting specific changes to the source Optimization Coaching for Fork/Join Applications on the Java Virtual Machine EuroDW’18, April 2018, Porto, Portugal code that could trigger additional optimizations. The goal is report- study performance drawbacks related to suboptimal task granular- ing to the developer precise recommendations which interpretation ity. Moreover, we specifically devise coaching developers towards do not require expertise about the compiler’s internals and which optimizing fork/join applications by providing them with code implementation enables a missed or failed optimization attempted refactoring suggestions which implementation may enable missed by the compiler. The technique was first prototyped in a work [52] parallelization opportunities. modifying Racket’s [12] simple ahead-of-time byte-compiler which Parallelism discovers. A number of tools has been developed performs basic optimizations on Racket applications. In a second to discover parallelization opportunities. Most notably, tools focus- work [51], the authors extended this approach to consider dynamic ing on parallelism discovery include Kremlin [14], which tracks object oriented just-in-time compiled programming languages, pro- loops and dependencies to pinpoint sequential parts of a program totyping assisted optimizations in the SpiderMonkey JavaScript [34] that can be parallelized. Kismet [25] builds on Kremlin and en- engine. hances the prediction of potential speedups after parallelization, Gong et al. [15] present JITProf, a prototype profiling framework given a target machine and runtime system. Similarly, Hammacher that detects code patterns prohibiting optimizations on JavaScript et al. [17], Rountev et al. [49], and Dig et al. [10] focus on refac- Just-In-Time (JIT) compilers and reports their occurrences to the toring sequential legacy Java programs identifying independent developer in the goal of achieving optimizations. JITProf profiles computation paths that could have been parallelized and recom- the target application at runtime implementing different strategies mend locations to the programmer with the highest potential for to detect seven JIT-unfriendly patterns. The goal is reporting to the parallelization. Although the above works aim at guiding the de- developer JIT-unfriendly locations ranked by their relevancy on veloper towards modifications enabling parallelization, they only preventing optimizations to facilitate the developer fixing them. target the optimization of sequential applications. The above works shred light in the goal of mentoring develop- In addition, several parallelism profilers based on the work-span ers in locating performance drawbacks and in providing concrete model [7, 24] target parallel applications. Cilkview [19] builds upon recommendations to optimize an application. Nonetheless, they this model to predict achievable speedup bounds for a Cilk [13] notably differ from our goal because the related tools were not application when the number of used cores increases. CilkProf [50] designed for diagnosing and further recommending modifications extends Cilkview by measuring work and span on each call site to specifically aimed at optimizing parallel applications. Indeed, the determine which of them constitutes a bottleneck towards paral- techniques proposed focus on feedback generated by specific com- lelization. TaskProf [58] computes work, span, and the asymptotic piler’s optimizers, falling short in considering other metrics from parallelism of C++ applications using the Intel Threading Building the full system stack which may enable revealing issues specific to Blocks (TBB) [45] task-parallel library, relying on causal profil- parallel applications, including fork/join applications on the JVM. ing [8] techniques to estimate parallelism improvements. Despite Analyses of the use of concurrency on the JVM. Numerous these tools allow detecting bottlenecks on parallel applications and studies have focused on the use of concurrency on the JVM. Here, can estimate potential speedups, they do not focus on coaching a we highlight two works closely related to our aim. developer towards optimizing a fork/join application. Moreover, De Wael et al. [9] analyzed 120 open-source Java projects to none of these tools supports the JVM. study the use of the Java fork/join framework, confirming manually Parallelism profilers. Researchers from industry and academia the frequent use of four best coding practices and three recurring have developed several tools to analyze various parallel character- anti-patterns that potentially limit parallel performance. Their aim istics of an application. HPCToolkit [1] is a suite of tools using is providing practical conclusions to guide language or framework statistical sampling of timers and HPC collection to analyze the designers to propose new or improved abstractions that steer pro- performance of an application, resource consumption, and ineffi- grammers toward using the right patterns and away from using ciency, attributing them to the full calling context in which they the anti-patterns in developing fork/join applications for the JVM. occur. THOR [54] is a tool for performance analysis of Java appli- Pinto et al. [43] study parallelism bottlenecks in fork/join appli- cations on multicore systems. THOR relies on vertical profiling cations on the JVM, with a unique focus on how they interact with to collect fine-grained events, such as OS context switches and underlying system-level features, such as work-stealing scheduling Java lock contention from multiple layers of the execution stack. and memory management. They identify six bottlenecks impacting Each event is subsequently associated with a thread, providing a performance and energy efficiency on the AKKA28 [ ] actor frame- detailed graphical report on the behavior of threads spawned in work and conduct an in-depth refactoring on it to improve its core the execution of a parallel Java application. messaging engine. They also present FJDETECTOR, a plugin for Finally, a number of tools exist to support the analysis of parallel the Eclipse IDE [55] prototyping the detection of some bottlenecks Java applications through the collection of metrics at the appli- identified by the authors by means of code inspection. cation layer, most notably including VisualVM [56], IBM Health- Overall, these works bring meaningful insight in methodolo- Center [21], Oracle Developer Studio Performance Analyzer [36], gies focused on detecting bad coding practices and bottlenecks in Java Mission Control [35], JProfiler11 [ ], and YourKit [59], or at the real fork/join applications on the JVM. Complementary to such hardware layer resorting to HPCs, including Intel VTune [23] and studies, we plan to focus on both performance metrics collected AMD CodeAnalyst [2], among others. from the full system stack via dynamic analysis to automatically Overall, despite the aforementioned tools provide insight in char- diagnose performance issues noticeable only at runtime and in acterizing processes or threads over time, none of them specializes characterizing all the tasks spawned by the fork/join application to on fork/join applications, and thus they fall short in diagnosing performance problems specific to this type of parallel applications. EuroDW’18, April 2018, Porto, Portugal Eduardo Rosales and Walter Binder
ACKNOWLEDGMENTS [32] Lukáš Marek, Stephen Kell, Yudi Zheng, Lubomír Bulej, Walter Binder, Petr Tůma, Danilo Ansaloni, Aibek Sarimbekov, and Andreas Sewe. 2013. ShadowVM: Robust The research presented in this paper was supported by Oracle (ERO and Comprehensive Dynamic Program Analysis for the Java Platform. In GPCE. project 1332) and by the Swiss National Science Foundation (project 105–114. 200021_153560). [33] Lukáš Marek, Alex Villazón, Yudi Zheng, Danilo Ansaloni, Walter Binder, and Zhengwei Qi. 2012. DiSL: A Domain-specific Language for Bytecode Instrumen- tation. In AOSD. 239–250. [34] Mozilla. 2018. SpiderMonkey. https://developer.mozilla.org/en-US/docs/Mozilla/ REFERENCES Projects/SpiderMonkey. (2018). [35] Oracle. [n. d.]. Java Mission Control. http://www.oracle.com/technetwork/java/ [1] L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and javaseproducts/mission-control/java-mission-control-1998576.html. ([n. d.]). N. R. Tallent. 2010. HPCTOOLKIT: Tools for Performance Analysis of Optimized [36] Oracle. [n. d.]. Solaris Studio Performance Analyzer. http: Parallel Programs. Concurr. Comput.: Pract. Exper. 22, 6 (2010), 685–701. //www.oracle.com/technetwork/server-storage/solarisstudio/features/ [2] AMD. 2017. CodeAnalyst for Linux. http://developer.amd.com/tools-and-sdks/ performance-analyzer-2292312.html. ([n. d.]). archive/amd-codeanalyst-performance-analyzer-for-linux/. (2017). [37] Oracle. 2017. Class Arrays. https://docs.oracle.com/javase/8/docs/api/java/util/ [3] Glenn Ammons, Thomas Ball, and James R. Larus. 1997. Exploiting Hardware Arrays.html. (2017). Performance Counters with Flow and Context Sensitive Profiling. In PLDI ’97. [38] Oracle. 2017. Class CompletableFuture