Optimization Coaching for Fork/Join Applications on the Java Virtual

Optimization Coaching for Fork/Join Applications on the Java Virtual Machine Extended abstract Eduardo Rosales Walter Binder∗ Università della Svizzera italiana Università della Svizzera italiana Lugano, Switzerland Lugano, Switzerland [email protected] [email protected] ABSTRACT complete, and then typically merging (join) results; running in a In the multicore era, developers are increasingly writing parallel single Java Virtual Machine (JVM) on a shared-memory multicore. applications to leverage the available computing resources and to The Java fork/join framework [27] is the implementation enabling achieve speedups. Still, developing and tuning parallel applications fork/join applications on the JVM. It uses the work-stealing [57] to exploit the hardware resources remains a major challenge. We scheduling strategy for parallel programs, using an implementa- tackle this issue for fork/join applications running in a single Java tion where each worker thread has a deque (i.e., a double-ended Virtual Machine (JVM) on a shared-memory multicore. queue) of tasks and processes them one by one until the deque Developing fork/join applications is often challenging since fail- is empty, in which case the thread “steals" by taking out tasks ing in balancing the tradeoff between maximizing parallelism and from deques belonging to other active worker threads. Since the minimizing overheads, along with maximizing locality and mini- release of Java 8, the Java fork/join framework has been support- mizing contention, may lead to applications suffering from several ing the Java library in the java.util.Array [37] class on methods performance problems (e.g., excessive object creation/reclaiming, performing parallel sorting, in the java.util.streams package [40], load imbalance, inappropriate synchronization). which provides a sequence of elements supporting sequential and In contrast to the manual experimentation commonly required to parallel aggregate operations, and in the java.util.concurrent.- tune fork/join parallelism on the JVM, we propose coaching devel- CompletableFuture<T> [38] class which extends the functionalities opers towards optimizing fork/join applications by automatically provided by Java’s Future API [39] to develop asynchronous pro- diagnosing performance issues on such applications and further grams in Java. Moreover, the Java fork/join framework supports suggest concrete code modifications to fix them. thread management for other JVM languages [4, 16, 46] and has been increasingly used in supporting fork/join applications based CCS CONCEPTS on Actors [28, 42] and MapReduce [5, 20, 53], to mention some. Despite the features offered by the Java fork/join framework to • General and reference → Metrics; • Software and its engi- developers targeting the JVM, writing efficient fork/join applica- neering → Software performance; tions remains challenging. While design patterns have been pro- KEYWORDS posed [26] to guide developers in declaring fork/join parallelism, there is no unique optimal implementation that best resolves the Performance analysis, fork/join applications, optimization coaching, tradeoff between maximizing parallelism and minimizing over- software optimization, Java Virtual Machine heads, along with maximizing locality and minimizing contention. ACM Reference Format: Failing in balancing such conflicting goals in the development Eduardo Rosales and Walter Binder. 2018. Optimization Coaching for process may lead to fork/join applications suffering from several Fork/Join Applications on the Java Virtual Machine: Extended abstract. performance issues (e.g., excessive deque accesses and object cre- In . ACM, New York, NY, USA, 4 pages. ation/reclaiming, load imbalance, inappropriate synchronization) that may overshadow the performance gained by parallel execution. 1 INTRODUCTION Inspired by Optimization Coaching [51, 52], a technique focused on automatically providing the developer with information to facil- Recent hardware advances have brought shared-memory multi- itate both detecting issues and achieving potential optimizations cores to the mainstream, increasingly encouraging developers to on a program, we propose coaching developers towards optimizing write parallel applications able to utilize the available computing re- fork/join applications by automatically diagnosing performance sources. Still, developing and tuning parallel applications to exploit issues on such applications and further suggest concrete code refac- the hardware resources remains challenging. toring to solve them. To this end, we devise a tool generating rec- We tackle this issue for fork/join applications, i.e., parallel ver- ommendations that should only require source-level knowledge sions of divide-and-conquer algorithms recursively splitting (fork) (i.e., knowledge of the high-level programming language in which a work into tasks that are executed in parallel, waiting for them to the target application is implemented) to be straightforwardly in- ∗PhD advisor terpreted and implemented by the developer. Although several works and tools address the assisted optimiza- EuroDW’18, April 2018, Porto, Portugal tion of sequential applications [15, 51, 52], parallelism discovery [14, . 17, 19, 25, 45, 49, 50], parallelism profiling1 [ , 2, 11, 23, 36, 54, 56, 59], EuroDW’18, April 2018, Porto, Portugal Eduardo Rosales and Walter Binder or have analyzed the use of concurrency on the JVM [29–31, 44], Hardware Performance Counters (HPC) [22], tgp allowed us finding with some even targeting fork/join parallelism [9, 43], we are not suboptimal task granularities causing performance issues in the aware of previous works on coaching a developer towards optimiz- DaCapo [6] benchmark suite. We plan to extend this work to study ing a fork/join application running in the JVM. suboptimal task granularity on fork/join applications on the JVM. The remainder of this proposal is organized as follows. In Sec- Suggesting optimizations. We plan to develop a methodology tion 2, we discuss our research directions. In Section 3, we analyze for automatically generating (and reporting to the developer) spe- the feasibility of our proposal by reviewing related work. cific recommendations aimed at fixing detected performance issues on fork/join applications. Inspired by the works of De Wael et al. [9] 2 OPTIMIZATION COACHING FOR and Pinto et al. [43], we plan to develop a tool able to recognize FORK/JOIN APPLICATIONS ON THE JVM fork/join anti-patterns, but it is also of paramount importance in our aim to automatically match such anti-patterns to concrete rec- We propose coaching developers towards the optimization of fork/- ommendations that a developer can use to avoid them. We plan join applications on the JVM. To achieve our goal it is imperative to use calling context profiling [3] to pinpoint the developer to ap- to address the following research directions. plication code where tasks classified as problematic were created, Diagnosing performance issues. We plan to build a method- submitted, or executed, guiding optimizations by means of sug- ology aimed at diagnosing issues and understanding its impact on gesting the use of fork/join design patterns [26] and concrete code the performance of fork/join applications on the JVM. To this end, modifications to enable potential missed parallelization opportuni- first we plan to define a new model to characterize fork/join tasks. ties. This is challenging, since automatically matching performance Second, we plan to determine the entities and metrics worth issues with concrete code suggestions to fix them will require the to consider to automatically detect performance drawbacks on use of advance data analysis techniques. In a previous study [47] fork/join applications. We plan to use static analysis to inspect we used calling contexts to locate tasks previously identified as of the source code in the search of fork/join anti patterns [9, 43]. suboptimal granularity in DaCapo, guiding optimizations through Complementary, we plan to use dynamic analysis to deal with code refactoring that led to noticeable speedups. Based on this work, polymorphism and reflection, along with collecting performance we plan to automate and specialize such techniques to specifically metrics enabling diagnosing performance issues noticeable at run- coach developers in optimizing fork/join applications on the JVM. time (e.g., excessive deque accesses and object creation/reclaiming, Validating optimizations. As previously pointed out by De load imbalance, inappropriate synchronization). To this end, we Wael et al. [9], there is a lack of standard benchmarks suitable to devise a vertical profiler [18] collecting information from the full perform empirical studies focused on fork/join parallelism on the system stack, including metrics at the application level (e.g., tasks JVM. Indeed, our early work mentioned above relied on DaCapo and threads), Java fork/join framework level (e.g., task submission, to study suboptimal task granularities mainly in Java thread pools, the number of already queued tasks), JVM level (e.g., garbage collec- however, this suite does not have any fork/join application. In this tions, allocations in Java Heap), operating system level (e.g., CPU direction, we plan to discover workloads exhibiting high diversity usage, memory load) and hardware level (e.g., reference cycles, in their task-parallel behavior by plugging tgp to AutoBench [60], a machine instructions, context switches, cache misses). toolchain combining

Load more