FJProf: Profiling Fork/Join Applications on the Virtual Machine

Eduardo Rosales, Andrea Rosà, Walter Binder Faculty of Informatics Università della Svizzera italiana Lugano, Switzerland {rosale,andrea.rosa,walter.binder}@usi.ch

ABSTRACT shared- multicore. Fork/join applications are task-based, An efficient fork/join application should maximize parallelism while parallel versions of divide-and-conquer algorithms recursively split- minimizing overheads, and maximize locality while minimizing ting (fork) work into tasks that are executed in parallel, waiting contention. However, there is no unique optimal implementation for them to complete, and then typically merging (join) the results that best resolves such tradeoffs and failing in balancing them may computed by the forked tasks [21]. lead to fork/join applications suffering from several issues (e.g., Since Java 7, the Java fork/join framework [21, 34] is inte- suboptimal forking, load imbalance, excessive synchronization), grated into the Java library to enable fork/join parallelism on the possibly compromising the performance gained by a task-parallel JVM. Developers are increasingly using this framework as it ex- execution. Moreover, there is a lack of profilers enabling perfor- ploits the symbiosis between recursion and parallel decomposi- mance analysis of a fork/join application. As a result, developers tion, encouraging the use of a parallel divide-and-conquer pro- are often required to implement their own tools for monitoring gramming style [20] in applications written in Java or in one of and collecting information and metrics on fork/join applications, the multiple languages supported by the JVM (such as Scala [12], which could be time-consuming, error-prone, and is often beyond Apache Groovy [2] or Clojure [40]). The Java fork/join frame- the expertise of the developer. In this paper, we present FJProf, work is a key component of the Java library, as it directly sup- a novel profiler which accurately collects dynamic information ports parallel sorting in the java.util.Array [26] class, paral- and key metrics to facilitate characterizing several performance lel functional-style operations in the java.util.stream pack- attributes specific to a fork/join application running on asingle age [33] and the development of asynchronous programs in Java Virtual Machine (JVM) in a shared-memory multicore. FJProf the java.util.concurrent.CompletableFutures [27] class. Fur- reports information and graphics to developers that help them un- thermore, it has been extensively used to support applications based derstand the details of the fork/join processing exposed by a parallel on Actors [22] and MapReduce [3, 14, 45], to mention some. application running on the JVM. We show how FJProf supports Despite the popularity and the extensive set of features offered performance analysis by characterizing a fork/join application from by the Java fork/join framework, developing fork/join applications the Renaissance benchmark suite. able to fully leverage the available computing resources remains challenging. An efficient fork/join application should maximize CCS CONCEPTS parallelism while minimizing overheads, along with maximizing locality while minimizing contention [20]. While design principles • General and reference → Metrics; • Software and its engi- that guide developers in properly implementing fork/join paral- neering → Software performance. lelism have been proposed [20, 34], unfortunately, there is no unique optimal implementation that best resolves such tradeoffs. Failing in KEYWORDS balancing such competing forces may lead to fork/join applications Performance analysis, fork/join parallelism, task granularity, Java suffering from several performance issues (such as suboptimal fork- Virtual Machine ing, load imbalance, excessive synchronization) that may outweigh ACM Reference Format: the benefits of a parallel execution9 [ , 37]. Furthermore, conducting Eduardo Rosales, Andrea Rosà, Walter Binder. 2019. FJProf: Profiling a thorough analysis of a fork/join application requires developers Fork/Join Applications on the Java Virtual Machine. In Proceedings of VAL- to monitor, collect, and correlate a variety of metrics and infor- UETOOLS 2020 - 13th EAI International Conference on Performance Evaluation mation, which can be time-consuming, error-prone, and is often Methodologies and Tools (VALUETOOLS 2020). ACM, New York, NY, USA, beyond the expertise of a developer [9]. Although related work 8 pages. focuses on parallelism discovery [11, 13, 43] and parallelism pro- filing [1, 16, 25, 41, 46], with some authors studying the use of the 1 INTRODUCTION Java fork/join framework [9, 37], we are not aware of the existence of any profiler for the JVM specifically aiding performance analysis Hardware advances have brought shared-memory multicores to the of a fork/join application. mainstream, increasingly encouraging developers to write parallel Our work aims at filling this gap, introducing FJProf, a novel applications. While modern machines enable speedups, developing profiler specifically helping developers conduct performance anal- parallel applications that fully exploit the underlying hardware ysis of fork/join applications running in a single JVM on a shared- remains a major challenge. We tackle this issue for fork/join ap- memory multicore. FJProf accurately collects key metrics and plications running in a single Java Virtual Machine (JVM) on a VALUETOOLS 2020, May 18–20, 2020, Tsukuba, Japan Rosales et al. dynamic information that enable characterizing multiple perfor- java.util.concurrent.ForkJoinWorkerThread (which in turn mance attributes specific to a fork/join application. FJProf reports extends class java.lang.). Each worker has a deque (i.e., a to developers information and easy-to-interpret graphics that facil- double-ended queue) of tasks and processes them one by one. itate the understanding of the details of the fork/join processing The fork/join pool implements a work-stealing [11, 48] schedul- performed by a parallel application running on the JVM. The chal- ing strategy, i.e., if the dequeue of a worker is empty, the worker lenges in developing FJProf stem from the difficulty of designing “steals” tasks by taking them from the deques belonging to other efficient profiling techniques suitable for the JVM, able todetect active workers. When a worker encounters a join operation, it pro- every task spawned by a fork/join application while also mitigat- cesses other tasks, if available, until the target task has completed. ing performance overheads caused by the instrumentation code Otherwise, all tasks run to completion without blocking [21]. inserted. We apply our tool to a workload of Renaissance [38], To be executed by a worker, a task must be submitted to the a recently released benchmark suite for the JVM, showing how fork/join pool. This is done by calling methods execute, invoke, FJProf helps developers characterize a realistic complex fork/join or submit, which are defined in class ForkJoinTask. All of these application. methods end up invoking fork as the primary mechanism provided Our work makes the following contributions. We present FJProf, by ForkJoinTask to arrange the asynchronous execution of a task. a fork/join profiler for the JVM (Sec.3). To the best of our knowledge, Upon task submission, the fork/join pool adds the task to one of the proposed tool is the first profiler for fork/join applications run- the workers’ deques. ning on the JVM. We characterize a workload from the Renaissance To execute a task, a worker invokes method suite, showing how our tool provides detailed information that ForkJoinTask.exec, which, according to the Java API, should facilitate performance analysis of a modern, realistic and complex contain the computations to be carried out by a task [29]. fork/join application (Sec.4). Sec.2 provides background informa- This method is not usually overridden directly. Instead, de- tion. Sec.5 discusses work related to our approach. Finally, Sec.6 velopers often override one of the following methods in the concludes. abstract subtypes of ForkJoinTask: RecursiveAction.compute, RecursiveTask.compute, or CountedCompleter.compute, all of which end up invoking method exec. A worker always executes 2 BACKGROUND method exec to perform the main computations defined in a task. In this section, we introduce background information and the ter- We use the term execution method to define the overridden exec minology we use to describe our approach. method in any subtype of class ForkJoinTask. With task execution, In the Java fork/join framework, tasks are modeled as subtypes we denote the execution of an execution method by a worker. We of the abstract class java.util.concurrent.ForkJoinTask. consider a task to be executed if its execution method has been Therefore, we use the term task to define an instance executed at least once by a worker until normal or abnormal com- of class ForkJoinTask. Class ForkJoinTask is usually pletion. Finally, we define a task as active if a worker is executing not subclassed directly. Instead, developers often ex- the task. tend one of following subtypes of ForkJoinTask, de- pending on the needed style of fork/join processing: 3 PROFILING METHODOLOGY java.util.concurrent.RecursiveAction for computations that This section describes our profiling methodology, detailing the do not return results, java.util.concurrent.RecursiveTask for information and metrics targeted by FJProf, the components used those that do, and java.util.concurrent.CountedCompleter to collect them, and how the final traces are produced. for those in which completed actions can trigger other actions [29]. As indicated by its name, the primary methods offered by class ForkJoinTask to express fork/join parallelism are fork, which 3.1 Metrics arranges the task to asynchronously execute it (as explained below, Here, we detail all metrics which our profiler focuses on. As shown this action is known as task submission) and join, which waits for in Table1, FJProf collects a variety of metrics to provide extensive a task to complete and then returns the result of its computation. information enabling the analysis of a fork/join application. A ForkJoinTask is a much lighter weight entity than a normal Task Information. FJProf specifically targets tasks spawned thread. Computations modeled via ForkJoinTask should be of a by a fork/join application running on the JVM. To this end, our tool size and structure that maintain as much independence as possible. detects the creation, submission, and execution of each task, pro- They should minimize (if possible, eliminate) the use of shared filing the thread responsible for its creation, the worker executing resources, global (static) variables, locks, and dependencies [20]. it, and the type of the task. FJProf also collects the starting and Ideally, a task should contain code that runs to completion, avoiding ending execution timestamps for each task, to allow correlating the expensive coordinated actions such as frequent communication execution of a task with other collected metrics. Lastly, to enable with other tasks or the use of blocking synchronization apart from the analysis of the fork/join processing exposed by a parallel appli- joining [21]. cation, FJProf profiles invocations of methods fork and join of To execute parallel tasks, the Java fork/join frame- class ForkJoinTask. work employs a fork/join pool (implemented via class Overall, the information collected enables fine-grained character- java.util.concurrent.ForkJoinPool). A fork/join pool is ization of a fork/join application at a task-level, easing the analysis composed of a number of threads (called workers) whose purpose of a task to the point of recognizing the class where it originated, is to execute ForkJoinTasks. Workers are implemented via class the time interval in which it was active, or the specific worker in FJProf: Profiling Fork/Join Applications on the Java Virtual Machine VALUETOOLS 2020, May 18–20, 2020, Tsukuba, Japan

Table 1: Dynamic information and metrics collected by FJProf.

Metric/ Component(s) used to collect Description Information the metric/information Thread responsible for creation and execution Information Starting and ending execution timestamps DiSL instrumentation about a task Type Fork and join Number of invocations of methods fork and join DiSL instrumentation operations Starting and ending execution timestamps of fork and join operations Task granularity Amount of work performed by each task DiSL instrumentation, JNI, PAPI Number of workers Number of workers available to the fork/join pool DiSL instrumentation CPU utilization Percentage of total available processor cycles consumed by running processes top Garbage Activations of the garbage collector for stop-the-world collections JVMTI collections (starting and ending timestamps) charge of executing it. Furthermore, the information collected for hand, this metric may account for time intervals where the worker each task allows visually analyzing the level of parallelism exposed executing the task was not scheduled to execute on a computing by a fork/join application in terms of the number of active tasks core by the , resulting in an overestimation of task over time, along with the occurrences of fork and join invocations granularity. Moreover, this metric is platform-dependent, limiting during the whole execution of the application (see Sec. 3.2). the reproducibility of analyses focused on the task granularity Task Granularity. We define task granularity as the amount of exhibited by a fork/join application. work performed by each task. Task granularity is a key performance Complementary, FJProf can measure task granularity in terms attribute of fork/join parallelism, because it is related to the tradeoff of bytecode count, i.e., the number of bytecodes executed by a task. between the overhead caused by a task-parallel execution and the This metric is little perturbed by the inserted instrumentation code potential performance gain. (since the metric disregards the inserted bytecodes), is platform If a fork/join application uses small tasks (thus, each one car- independent (as long as the same Java library is used), and does rying out only few computations), it can keep more CPU cores not require any special hardware support [4, 6]. However, bytecode busy, improve locality and scalability, and decrease the amount count may underestimate the work performed by a task since it of time that CPU cores must idly wait for another task to be exe- cannot track code without bytecode representation (e.g., native cuted [20]. However, spawning an excessive number of small tasks code), represents computations of different complexity with the may overwhelm parallel processing due to overheads introduced by same unit, and may account bytecode that is not executed due to the creation, scheduling and managing of too many small tasks [29]. optimizations performed by the JVM’s dynamic compiler. On the other hand, attempting to reduce such overheads by spawn- Finally, FJProf can measure task granularity by profiling the ing only a few too large tasks (thus, each one performing substantial number of reference cycles1 elapsed during the execution of a task computations) may result in underutilizing some of the available (excluding intervals where the worker was not scheduled to execute computing resources, given the lack of tasks to be scheduled on idle on a core). Reference cycles have the advantage of taking into cores. In consequence, a fork/join application may suffer from load account the complexity of operations, accounting latencies due imbalance and low CPU utilization, thus limiting task parallelism. to misalignments and cache misses, and can be used to obtain a There is no general optimal solution to the aforementioned trade- temporal reference provided the nominal CPU frequency is known. offs. However, it is possible to analyze task granularity for a specific On the other hand, reference cycles are platform-dependent and parallel application. To account task granularity, FJProf measures require hardware support, thus compromising the reproducibility the computations performed in the dynamic extent of the execution of analyses concerning task granularity for fork/join applications. method of a task, considering all of them as part of the granularity Number of Workers. FJProf profiles the number of workers of the task. Since measuring the granularity of tasks that are created available to the fork/join pool used by the analyzed application. but never executed it is unfeasible, FJProf discards such tasks. This information is often useful during the performance analysis To represent the task granularity of a fork/join application, of a fork/join application, because it may have an impact on the FJProf enables selecting between one of the following metrics, each management and scheduling of tasks by the fork/join pool. one having different strengths and limitations: execution time, byte- CPU Utilization. FJProf measures CPU utilization. This metric code count, and references-cycles count. This plurality increases the allows determining whether the CPU is well utilized by the fork/join portability of our tool, making it suitable for diverse environments. application. For instance, low values of CPU utilization may indicate Using execution time (i.e., the amount of time a task was in low parallelism. execution, measured as wall time [36]) as the metric to characterize Garbage Collections. Lastly, FJProf detects all activations of task granularity has some advantages. This metric is easy to profile the Garbage Collector (GC), for stop-the-world collections i.e., pe- on the JVM, requiring very little instrumentation, thus mitigating riods when all threads cease modifying the state of the JVM. By the overheads incurred due to the inserted instrumentation code. Furthermore, wall time can be interpreted very easily. On the other 1Reference cycles are collected at the nominal CPU frequency, regardless of the pres- ence of technologies enabling frequency scaling. VALUETOOLS 2020, May 18–20, 2020, Tsukuba, Japan Rosales et al. tracking GC activities, FJProf can filter out fluctuations on the CPU The offline post-processing is fully supported by Python script- utilization due to garbage collection. This is important since no ing, which can be easily extended by users who require conducting task can be in execution during stop-the-world collections, hence, additional customized analyses. any CPU-utilization measurement would not refer to parallelism exposed by a fork/join application. 3.3 Implementation 3.2 Post-processing and Output In this section we detail the components of FJProf involved in metric collection and the interactions among them. Here, we describe the traces provided by FJProf, the offline post- Table1 shows the components of FJProf responsible for the processing required to refine them and the generation of graphics collection of information and metrics. FJProf instruments every produced by the tool. task executed by a fork/join application. To this end, FJProf builds Generated Traces. FJProf dumps the collected information on DiSL [24], a dynamic program analysis3 framework resorting to and metrics into three different traces. A first trace contains allin- bytecode instrumentation. The use of DiSL makes possible to accu- formation about tasks, task granularity, fork and join occurrences, rately profile all information about tasks, invocations of methods and the number of workers available to the fork/join pool. A second fork and join, task granularity (measured using wall time, byte- trace reports GC activations. Lastly, FJProf generates a trace re- code count, or reference-cycles count), along with querying the porting CPU-utilization measurements. Overall, all traces provide Java fork/join framework to obtain the number of workers available to developers detailed information that they can use to understand to the fork/join pool. the behavior of a fork/join application running on the JVM. DiSL is spawned on a separate JVM, i.e., the DiSL server. A native The traces are post-processed offline to avoid data processing agent attached to the JVM executing the target fork/join application overheads during application execution. FJProf performs trace intercepts classloading, sending loaded classes to the DiSL server. refinement by removing from the final traces both non-executed There, the instrumentation logic determines the methods to be pro- tasks and metrics collected during GC. Finally, the offline post- filed and inserts instrumentation code in them. Next, the modified processing automatically produces the graphics described below. classes are returned to the JVM executing the target application. The Generated Graphics. FJProf produces a variety of graphics DiSL weaver guarantees full bytecode coverage, i.e., DiSL instru- assisting developers in visually analyzing several performance at- ments every Java method with a bytecode representation, enabling 2 tributes specific to a fork/join application running on theJVM. the complete instrumentation of the Java library, which is notori- To support the analysis of task granularity, FJProf generates ously hard to instrument [5]. To further isolate the analysis from a box plot and a Cumulative Distribution Function (CDF). Both the application, FJProf uses Shadow VM [23], a deployment setting graphics provide complementary views facilitating developers un- of DiSL which runs analysis code in a separate JVM process, reduc- derstanding the distribution of task granularities exhibited by a ing overheads incurred by the instrumentation while preventing fork/join application. Thanks to both graphics, developers can easily known issues inherent to non-isolated approaches [19]. Upon col- identify the size of the tasks dominating the execution of a fork/join lection, metrics are sent to the Shadow VM, which contains most application, recognizing also the presence of outliers in the form of of the profiling logic and data structures. FJProf interacts with the too small or too large tasks, which can pinpoint suboptimal task Shadow VM thanks to a native agent attached to the JVM executing granularities. the target fork/join application. Regarding the analysis of fork/join processing, FJProf generates FJProf resorts to JVMTI [32] to detect GC activity. A dedicated a graphic depicting the occurrences of the fork and join operations agent attached to the fork/join application records timestamps performed by a fork/join application over time. This graphic assists upon the start (event GarbageCollectionStart) and end (event developers in visually understanding the effectiveness of a partic- GarbageCollectionFinish) of each stop-the-world garbage col- ular fork/join design by detailing how a parallel application uses lection. fork/join processing during its whole execution. To profile CPU utilization, our profiler integrates top [18], a tool To support analyses focused on task-parallelism, FJProf pro- capable of querying performance counters on Unix-based distribu- duces a plot describing the number of active tasks over time. This tions. This metric is sampled periodically at the minimum period figure is key in helping developers understand the level of paral- allowed by the tool (i.e., approximately every 150ms), where each lelism achieved by the application during its execution, enabling sample profiled by top represents the instantaneous CPU utilization. the identification of periods of time where the application usesa Finally, to measure task granularity using reference-cycles count, significant number of tasks as well as intervals where task-parallel FJProf resorts to Hardware Performance Counters (HPCs) that can processing decreases. be queried efficiently via PAPI [15], a third-party C library provid- In addition, FJProf generates a graphic reporting the CPU uti- ing a common interface to query low-level per-thread virtualized lization of a fork/join application over time. This figure discrimi- counters. A JNI [31] agent attached to the target application en- nates CPU-utilization measurements by the user and system (ker- ables the instrumentation logic to efficiently activate/deactivate the nel) components, complementary showing the total and average accounting of reference cycles per task execution. usage. This graphic allows developers to quickly realize whether a fork/join application makes good use of the available CPU re- sources. 3Dynamic program analysis employs runtime techniques (such as instrumentation 2Examples of all produced graphics can be found in Sec.4. and profiling) to explore the runtime behavior of a target application. FJProf: Profiling Fork/Join Applications on the Java Virtual Machine VALUETOOLS 2020, May 18–20, 2020, Tsukuba, Japan

4 EVALUATION In this section we evaluate FJProf, showing how it helps developers 1.0 characterize a fork/join application running on the JVM. 0.8 4.1 Target Benchmark Our evaluation targets a modern, complex and realistic fork/join 0.6 application running on a single JVM in a shared-memory multicore.

Specifically, we apply FJProf to a workload from Renaissance [38], 0.4 a recently released benchmark suite for the JVM. This suite con- Cumulative Probability Cumulative tains several modern, real-world, concurrent and object-oriented 0.2 workloads that exercise various concurrency primitives. In contrast to previous well-known benchmarks suites for the JVM, such as 0.0 DaCapo [7], ScalaBench [44], and SPECjvm2008 [8], Renaissance 104 105 106 107 108 109 includes a workload specifically exercising fork/join parallelism: Task Granularity (cycles) fj-kmeans. This workload implements a K-means4 clustering al- gorithm that uses the Java fork/join framework. We use the latest Figure 1: Cumulative Distribution Function (CDF) of task release of Renaissance as time of writing, i.e., Renaissance v0.10, granularity for fj-kmeans. released on October 30, 2019.

4.2 Experimental Setup 109 The evaluation presented here considers only the steady-state, i.e., the state achieved after running several warm-up iterations until 108 GC ergonomics and dynamic compilation stabilize. We run 30 warm- up iterations before profiling the benchmark with FJProf, which 107 corresponds to the default repetition number for fj-kmeans as set by the developers of Renaissance [39]. We conduct our evaluation on a server-class machine equipped 106 with two NUMA nodes, each one with an Intel Xeon E5-2680 (2,7

GHz) processor with 8 physical cores and 64 GB of RAM, running (cycles) Granularity Task 105 under Ubuntu 18.04.3 LTS. When profiling the workload, no other resource-intensive application was in execution. Moreover, we pin 104 fj-kmeans to an exclusive NUMA node (i.e., other processes, includ- ing the DiSL Server and Shadow VM run on a different NUMA node). This deployment setting increases the isolation of the workload as it exclusively utilizes the CPU cores and the memory of its dedicated Figure 2: Box plot of the distribution of task granularity for underlying NUMA node. Turbo Boost and Hyper-Threading are fj-kmeans. disabled, and the CPU governor is set to “performance”. Lastly, we use Java OpenJDK 1.8.0_222, build 25.222-b10. the median (945 122 reference cycles). The lower and upper blue lines represent the first quartile (203 101 reference cycles), and 4.3 Characterizing a Fork/Join Application the third quartile (4 256 321 reference cycles), respectively. The In this section we show how FJProf assists developers in character- outliers are plotted individually as circles (they can be seen above izing several performance attributes that are specific to a fork/join the box). The traces show a group of 2 807 tasks, mainly of class application. org.renaissance.jdk.concurrent.JavaKMeans$Assignment- Task Granularity. When profiling fj-kmeans, FJProf detects Task, executing more than 107 reference cycles. The largest 21 412 tasks spawned during the execution of the workload. In granularity is equal to 111 476 466 reference cycles. The results Figure1, we present the CDF of task granularity for fj-kmeans. The presented in Figures1 and2 suggest that fj-kmeans may feature x-axis reports task granularity in terms of references-cycle count several tasks of a large size, which could be split into smaller tasks. using a logarithmic scale, whereas the y-axis shows the cumulative This may lead to an improved load balancing and better CPU probability. In the figure, we can observe that the size of most ofthe utilization, potentially enabling speedups. tasks (88.18%) spawned by fj-kmeans is within the range between Fork and Join Operations. Figure3 shows the occurrences of 105 and 108 reference cycles. fork and join operations executed by fj-kmeans. The x-axis shows Complementary, Figure2 reports the box plot of the distribution the execution time of the application as a percentage, while the y- of task granularity for fj-kmeans. The y-axis represents task axis reports the number of occurrences over time.5 The traces show granularity on a logarithmic scale. The green line represents that fj-kmeans executes 664 802 fork and join operations in total.

4K-means is a clustering algorithm that groups a set of items not previously classified 5Figures3 and4 are obtained by diving the total execution time of the application into into a predefined number of k clusters. 100 equally-sized time intervals, plotting one point per time interval. VALUETOOLS 2020, May 18–20, 2020, Tsukuba, Japan Rosales et al.

forks joins Total User System AVG 100%

8000

7000 80%

6000 60% 5000

4000 40% Ocurrences

3000 CPUUtilization

2000 20% 1000

0 0% 0% 20% 40% 60% 80% 100% 0% 20% 40% 60% 80% 100% Execution Time Execution Time

Figure 3: Rate of fork and join operations over time in fj- Figure 5: CPU utilization over time in fj-kmeans when using kmeans. 8 workers.

which is the minimum sampling time for CPU utilization enabled

350 by top. Measurements sampled during stop-the-world garbage col- lections have been removed. Figure5 discriminates CPU-utilization 300 measurements considering the user and system components, also 250 showing the total and average usage. As shown in the figure, most of the time the CPU is not fully utilized, exhibiting an average uti- 200 lization value of 53.90%. This suggests that the workload may better 150 utilize the available computing resources during its execution.

100 Number of Active Tasks Active of Number 4.4 Optimizing Task Granularity of fj-kmeans 50 After analyzing the results reported by FJProf, we can determine

0 that fj-kmeans can be optimized. Concretely, the workload 0% 20% 40% 60% 80% 100% exhibits several large tasks that may limit CPU utilization. Execution Time Based on these findings, we reduce the task granularity of the benchmark. In fj-kmeans, task granularity is inversely Figure 4: Active tasks over time in fj-kmeans. proportional to the number of available workers by algorithm design. Hence, to reduce task granularity, we increase the number of workers available to the fork/join pool supporting the From the figure, it is possible to the determine that the workload execution of fj-kmeans. This number can be easily changed by exercises fork/join processing during its whole execution. Despite passing the target number of workers6 via the system property the presence of some large tasks, most of the tasks execute for short java.util.concurrent.ForkJoinPool.common.parallelism. time, causing the two lines in the figure to visually overlap. As previously mentioned, the number of workers set in the original Active Tasks over Time. To better understand the level of configuration of fj-kmeans corresponds to the total number parallelism achieved by fj-kmeans in terms of the executed tasks, of CPU cores available to the JVM, i.e., the value returned by Figure4 shows the amount of active tasks over time. The x-axis method Runtime.availableProcessors [28], which is equal to 8 reports the execution time of the application as a percentage, while in the machine used for the evaluation. the y-axis shows the number of active tasks over time. The green From the initial value of 8, we iteratively double the number of line representing active tasks allows determining that fj-kmeans workers, measuring the speedup obtained by this modification. We exposes task-parallelism during its whole execution, reaching a continue this approach until no speedup is observed. Our evaluation peak of 354 tasks being executed concurrently around the 17% of results show that increasing the number of workers of the fork/join the execution time. FJProf also reports that the number of workers pool used by fj-kmeans leads to a maximum speedup up of 6% when available to the fork/join pool used by fj-kmeans is a fixed value cor- 128 workers are used. responding to the total number of available cores (8 in the machine The total number of tasks spawned in this setting is equal to used for the evaluation). 29 604, the median of the granularities exhibited by the workload CPU Utilization. Finally, Figure5 reports the CPU utilization is 823 730 references cycles, and the largest granularity observed is of fj-kmeans over time. The x-axis reports the execution time of the application as a percentage, whereas the y-axis shows CPU utiliza- 6According to the Java API [28], this number must be a non-negative integer with a tion. The time unit used to produce the figure corresponds to 150ms, maximum value of 32 767. FJProf: Profiling Fork/Join Applications on the Java Virtual Machine VALUETOOLS 2020, May 18–20, 2020, Tsukuba, Japan

Overall, the above work provides meaningful insights on identi- Total User System AVG 100% fying good and bad coding practices mainly by means of inspecting the source code of a fork/join application. Complementary, our

80% work specifically addresses the profiling of fork/join applications running on the JVM. Unlike the aforementioned work, our tool benefits from the use of dynamic program analysis to accurately 60% collect several key metrics and information, enabling the study of parallel characteristics of a fork/join application that are beyond 40% the scope of techniques resorting to simple code inspection. Fur- CPUUtilization thermore, our evaluation results show that FJProf is not limited 20% to the analysis of applications adhering to basic, fixed fork/join designs, but assists developers in conducting performance analysis 0% of complex fork/join applications targeting the JVM nowadays. 0% 20% 40% 60% 80% 100% Execution Time Parallelism Profilers. Many tools originated in industry and academia analyze various parallel aspects of an application. Several parallelism profilers based on the work-span model [17] Figure 6: CPU utilization over time in fj-kmeans when using target parallel applications. Cilkview [13] builds upon this model to 128 workers. predict achievable speedup bounds for a Cilk [11] application when the number of used cores increases. CilkProf [43] extends Cilkview by assessing computations on each call site to determine which of 95 112 333 reference cycles. Figure6 reports the CPU utilization of them constitutes a bottleneck towards parallelization. Despite these fj-kmeans when using 128 workers, which show an average CPU tools allow detecting bottlenecks and discovering parallelization utilization of 59.36%, The speedup obtained may be attributed to the opportunities, they do not target fork/join parallelism on the JVM. lower task granularity, which in turn enables a better load balanc- tgp [41] is a profiler for multi-threaded applications running on ing and allows more tasks to be effectively distributed among the the JVM, which detects all tasks executed by a parallel application. available computing resources, as demonstrated by the increased Unfortunately, the profiling model of tgp is suitable only for Java CPU utilization. Further potential optimizations involving code thread pools [35]. The analyses obtained with tgp are highly specific refactoring (aimed at better splitting the work among tasks) are to thread pools and cannot be generalized to applications using part of our future work. Overall, our evaluation results show that fork/join pools, as the task model implemented by thread pools FJProf enables characterizing fork/join parallelism on the JVM, to perform task-parallelism notably differs from the one used by providing developers useful information that they can use to un- fork/join pools. tgp lacks knowledge of the internals of the Java derstand the behavior of a modern, complex and realistic fork/join fork/join framework, resulting in profiles at a low abstraction level application running on the JVM. that fall short in recognizing the key characteristics of the fork/join processing executed by a parallel application targeting the JVM. 5 RELATED WORK Complementary, FJProf incorporates profiling techniques specifi- This section discusses studies related to our approach. We first cally addressing performance analysis of the fork/join parallelism consider work focused on the analysis of fork/join parallelism on exposed by an application running on the JVM, providing detailed the JVM. Next, we compare FJProf to existing tools for profiling information and easy-to-interpret graphics that aid developers in and analyzing multiple characteristics of parallel applications. the understanding of several key characteristics of fork/join appli- Fork/Join Parallelism on the JVM. To the best of our knowl- cations. edge, only a few studies focus on fork/join parallelism on the JVM. HPCToolkit [1] is a suite of tools aimed at locating and quantify- In a preliminary position paper by the authors [42], we sketch some ing scalability bottlenecks in parallel programs. The suite instru- of the ideas behind the tool introduced in this paper. ments the binaries of the target application to achieve language- De Wael et al. [9] inspect the code of 120 projects to study the independency, and relies massively on HPCs. THOR [46] helps use of the Java fork/join framework, confirming manually the fre- developers understand the state of a Java thread, i.e., whether the quent use of four best coding practices and three anti-patterns that thread is running on a core or is idling. Finally, many tools support potentially limit parallel performance. The authors provide prac- the analysis of parallel Java applications through the collection of tical conclusions that guide language or framework designers in metrics at the application level. Most notable examples include Vi- proposing abstractions that steer developers toward using the right sualVM [47], Performance Analyzer [25], patterns while developing a fork/join application. Java Mission Control [30], JProfiler [10] and YourKit [49]. Some Pinto et al. [37] focus on studying parallelism problems in tools, such as Intel VTune [16], also support hardware-level profil- fork/join applications running on the JVM, identifying six bot- ing resorting to HPCs. Overall, despite the tools mentioned above tlenecks impacting both performance and energy efficiency. They focus on several features of parallel applications running on the also present FJDETECTOR, a proof-of-concept Eclipse plugin for de- JVM, none of them specializes on fork/join applications, falling tecting some of the bottlenecks identified by the authors, relying on short in analyzing performance attributes that are crucial to un- code inspection to prototype the analysis of applications adhering derstand the behavior of this type of parallel applications. Rather to three basic fork/join designs. than focusing on the high-level characterization of processes or VALUETOOLS 2020, May 18–20, 2020, Tsukuba, Japan Rosales et al. threads over time, our tool enables a fine-grained characterization [17] Joseph JaJa. 1992. Introduction to Parallel Algorithms. Addison-Wesley Profes- of a fork/join application at the task level, facilitating developers sional. [18] James C. Warner. 2013. top(1) - Linux man page. https://linux.die.net/man/1/top. in conducting performance analysis on key attributes specifically (2013). describing fork/join processing, including the level of parallelism in [19] Stephen Kell, Danilo Ansaloni, Walter Binder, and Lukáš Marek. 2012. The JVM is Not Observable Enough (and What to Do About It). In VMIL. 33–38. terms of the number of tasks spawned over time, the distribution of [20] Doug Lea. 1999. Concurrent Programming in Java. Second Edition: Design Principles task granularities, and the occurrences of fork and join operations and Patterns (2nd ed.). Addison-Wesley Professional. during the execution of the application. [21] Doug Lea. 2000. A Java Fork/Join Framework. In JAVA. 36–43. [22] Lightbend Inc. 2019. Akka. http://akka.io. (2019). [23] Lukáš Marek, Stephen Kell, Yudi Zheng, Lubomír Bulej, Walter Binder, Petr Tůma, 6 CONCLUSIONS Danilo Ansaloni, Aibek Sarimbekov, and Andreas Sewe. 2013. ShadowVM: Robust and Comprehensive Dynamic Program Analysis for the Java Platform. In GPCE. In this paper, we presented FJProf, a novel profiler for fork/join 105–114. applications running on the JVM. We are not aware of any other [24] Lukáš Marek, Alex Villazón, Yudi Zheng, Danilo Ansaloni, Walter Binder, and Zhengwei Qi. 2012. DiSL: A Domain-specific Language for Bytecode Instrumen- profiler relying on dynamic program analysis to accurately collect tation. In AOSD. 239–250. information and metrics specifically focused on fork/join applica- [25] Oracle. 2017. Performance Analyzer. https://docs.oracle.com/cd/E77782_01/html/ tions targeting the JVM. We used FJProf to characterize a realistic, E77798/index.html. (2017). [26] Oracle. 2019. Class Arrays. https://docs.oracle.com/javase/8/docs/api/java/util/ modern and complex fork/join application, analyzing several per- Arrays.html. (2019). formance attributes specific to fork/join parallelism, including task [27] Oracle. 2019. Class CompletableFuture. https://docs.oracle.com/javase/8/ granularity, the level of parallelism in terms of active tasks over docs/api/java/util/concurrent/CompletableFuture.html. (2019). [28] Oracle. 2019. Class ForkJoinPool. https://docs.oracle.com/javase/9/docs/api/java/ time, and the number of fork and join operations. util/concurrent/ForkJoinPool.html. (2019). As part of our future work, we plan to extend FJProf to support [29] Oracle. 2019. Class ForkJoinTask. https://docs.oracle.com/javase/9/docs/api/ java/util/concurrent/ForkJoinTask.html. (2019). the analysis of other parallel aspects of fork/join applications, such [30] Oracle. 2019. Java Mission Control. http://www.oracle.com/technetwork/java/ as the analysis of contention among the spawned tasks. We also plan javaseproducts/mission-control/java-mission-control-1998576.html. (2019). to conduct further performance analyses focused on the scheduling [31] Oracle. 2019. Java Native Interface. http://docs.oracle.com/javase/8/docs/ technotes/guides/jni/. (2019). of tasks by the work-stealing algorithm. Finally, we plan to release [32] Oracle. 2019. Java Virtual Machine Tool Interface (JVM TI). https://docs.oracle. FJProf as open-source software. com/javase/8/docs/technotes/guides/jvmti/. (2019). [33] Oracle. 2019. Package java.util.stream. https://docs.oracle.com/javase/9/docs/ api/java/util/stream/package-summary.html. (2019). REFERENCES [34] Oracle. 2019. The Java Tutorials - Fork/Join. https://docs.oracle.com/javase/ [1] L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and tutorial/essential/concurrency/forkjoin.html. (2019). N. R. Tallent. 2010. HPCTOOLKIT: Tools for Performance Analysis of Optimized [35] Oracle. 2019. Thread Pools. https://docs.oracle.com/javase/tutorial/essential/ Parallel Programs. Concurr. Comput.: Pract. Exper. 22, 6 (2010), 685–701. concurrency/pools.html. (2019). [2] Apache Groovy project. 2019. Groovy. http://www.groovy-lang.org/. (2019). [36] Achille Peternier, Walter Binder, Akira Yokokawa, and Lydia Chen. 2013. Par- [3] Colin Barrett, Christos Kotselidis, and Mikel Luján. 2016. Towards Co-designed allelism Profiling and Wall-time Prediction for Multi-threaded Applications. In Optimizations in Parallel Frameworks: A MapReduce Case Study. In CF. 172–179. ICPE. 211–216. [4] Walter Binder. 2005. A Portable and Customizable Profiling Framework for Java [37] Gustavo Pinto, Anthony Canino, Fernando Castor, Guoqing Xu, and Yu David Based on Bytecode Instruction Counting. In APLAS. 178–194. Liu. 2017. Understanding and Overcoming Parallelism Bottlenecks in Fork/Join [5] Walter Binder, Jarle Hulaas, and Philippe Moret. 2007. Advanced Java Bytecode Applications. In ASE 2017. 765–775. Instrumentation. In PPPJ. 135–144. [38] Aleksandar Prokopec, Andrea Rosà, David Leopoldseder, Gilles Duboscq, Petr [6] Walter Binder, Jane G. Hulaas, and Alex Villazón. 2001. Portable Resource Control Tůma, Martin Studener, Lubomír Bulej, Yudi Zheng, Alex Villazón, Doug Simon, in Java. In OOPSLA. 139–155. Thomas Würthinger, and Walter Binder. 2019. Renaissance: Benchmarking Suite [7] Stephen M. Blackburn, Robin Garner, Chris Hoffmann, Asjad M. Khang, Kathryn S. for Parallel Applications on the JVM. In PLDI. 31–47. McKinley, Rotem Bentzur, Amer Diwan, Daniel Feinberg, Daniel Frampton, [39] Renaissance Suite. 2019. Documentation Overview. https://renaissance.dev/docs. Samuel Z. Guyer, Martin Hirzel, Antony Hosking, Maria Jump, Han Lee, J. Eliot B. (2019). Moss, Aashish Phansalkar, Darko Stefanović, Thomas VanDrunen, Daniel von [40] Rich Hickey. 2018. The Clojure Programming Language. https://clojure.org. Dincklage, and Ben Wiedermann. 2006. The DaCapo Benchmarks: Java Bench- (2018). marking Development and Analysis. In OOPSLA. 169–190. [41] Andrea Rosà, Eduardo Rosales, and Walter Binder. 2018. Analyzing and Optimiz- [8] Standard Performance Evaluation Corporation. 2019. SPECjvm2008. https: ing Task Granularity on the JVM. In CGO. 27–37. //www.spec.org/jvm2008/. (2019). [42] Eduardo Rosales, Andrea Rosà, and Walter Binder. 2019. Optimization Coaching [9] Mattias De Wael, Stefan Marr, and Tom Van Cutsem. 2014. Fork/Join Parallelism for Fork/Join Applications on the Java Virtual Machine. In Programming. 7:1–7:3. in the Wild: Documenting Patterns and Anti-patterns in Java Programs Using [43] Tao B. Schardl, Bradley C. Kuszmaul, I-Ting Angelina Lee, William M. Leiserson, the Fork/Join Framework. In PPPJ. 39–50. and Charles E. Leiserson. 2015. The Cilkprof Scalability Profiler. In SPAA. 89–100. [10] ej-technologies. 2019. JProfiler. https://www.ej-technologies.com/products/ [44] Andreas Sewe, Mira Mezini, Aibek Sarimbekov, and Walter Binder. 2011. Da jprofiler/overview.html. (2019). Capo Con Scala: Design and Analysis of a Scala Benchmark Suite for the Java [11] Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. 1998. The Implementa- Virtual Machine. In OOPSLA. 657–676. tion of the Cilk-5 Multithreaded Language. In PLDI. 212–223. [45] Robert Stewart and Jeremy Singer. 2018. Comparing Fork/Join and MapReduce. [12] Philipp Haller and Martin Odersky. 2009. Scala Actors: Unifying Thread-based (2018). and Event-based Programming. Theor. Comput. Sci. 410, 2-3 (2009), 202–220. [46] Q. M. Teng, H. C. Wang, Z. Xiao, P. F. Sweeney, and E. Duesterwald. 2010. THOR: a [13] Yuxiong He, Charles E. Leiserson, and William M. Leiserson. 2010. The Cilkview Performance Analysis Tool for Java Applications Running on Multicore Systems. Scalability Analyzer. In SPAA. 145–156. IBM Journal of Research and Development 54, 5 (2010), 4:1–4:17. [14] Zhang Hong, Wang Xiao-ming, Cao Jie, Ma Yan-hong, Guo Yi-rong, and Wang [47] VisualVM. 2015. VisualVM. https://visualvm.github.io. (2015). Min. 2016. An Optimized Model for MapReduce Based on Hadoop. TELKOMNIKA [48] F Warren Burton and Michael Sleep. 1981. Executing Functional Programs on a 14 (12 2016), 1552. Virtual Tree of Processors. FPCA (1981), 187–194. [15] ICL. 2019. PAPI. http://icl.utk.edu/papi/. (2019). [49] YourKit. 2019. YourKit. https://www.yourkit.com. (2019). [16] Intel. 2019. Intel VTune Amplifier. https://software.intel.com/en-us/intel-vtune- amplifier-xe. (2019).