Optimization Coaching for Fork/Join Applications on the Virtual Machine Extended abstract Eduardo Rosales Walter Binder∗ Università della Svizzera italiana Università della Svizzera italiana Lugano, Switzerland Lugano, Switzerland [email protected] [email protected]

ABSTRACT complete, and then typically merging (join) results; running in a In the multicore era, developers are increasingly writing parallel single Java Virtual Machine (JVM) on a shared- multicore. applications to leverage the available computing resources and to The Java fork/join framework [27] is the implementation enabling achieve speedups. Still, developing and tuning parallel applications fork/join applications on the JVM. It uses the work-stealing [57] to exploit the hardware resources remains a major challenge. We scheduling strategy for parallel programs, using an implementa- tackle this issue for fork/join applications running in a single Java tion where each worker has a deque (i.e., a double-ended Virtual Machine (JVM) on a shared-memory multicore. queue) of tasks and processes them one by one until the deque Developing fork/join applications is often challenging since fail- is empty, in which case the thread “steals" by taking out tasks ing in balancing the tradeoff between maximizing parallelism and from deques belonging to other active worker threads. Since the minimizing overheads, along with maximizing locality and mini- release of Java 8, the Java fork/join framework has been support- mizing contention, may lead to applications suffering from several ing the Java library in the java.util.Array [37] class on methods performance problems (e.g., excessive object creation/reclaiming, performing parallel sorting, in the java.util.streams package [40], load imbalance, inappropriate synchronization). which provides a sequence of elements supporting sequential and In contrast to the manual experimentation commonly required to parallel aggregate operations, and in the java.util.concurrent.- tune fork/join parallelism on the JVM, we propose coaching devel- CompletableFuture [38] class which extends the functionalities opers towards optimizing fork/join applications by automatically provided by Java’s Future API [39] to develop asynchronous pro- diagnosing performance issues on such applications and further grams in Java. Moreover, the Java fork/join framework supports suggest concrete code modifications to fix them. thread management for other JVM languages [4, 16, 46] and has been increasingly used in supporting fork/join applications based CCS CONCEPTS on Actors [28, 42] and MapReduce [5, 20, 53], to mention some. Despite the features offered by the Java fork/join framework to • General and reference → Metrics; • Software and its engi- developers targeting the JVM, writing efficient fork/join applica- neering → Software performance; tions remains challenging. While design patterns have been pro- KEYWORDS posed [26] to guide developers in declaring fork/join parallelism, there is no unique optimal implementation that best resolves the Performance analysis, fork/join applications, optimization coaching, tradeoff between maximizing parallelism and minimizing over- software optimization, Java Virtual Machine heads, along with maximizing locality and minimizing contention. ACM Reference Format: Failing in balancing such conflicting goals in the development Eduardo Rosales and Walter Binder. 2018. Optimization Coaching for process may lead to fork/join applications suffering from several Fork/Join Applications on the Java Virtual Machine: Extended abstract. performance issues (e.g., excessive deque accesses and object cre- In . ACM, New York, NY, USA, 4 pages. ation/reclaiming, load imbalance, inappropriate synchronization) that may overshadow the performance gained by parallel execution. 1 INTRODUCTION Inspired by Optimization Coaching [51, 52], a technique focused on automatically providing the developer with information to facil- Recent hardware advances have brought shared-memory multi- itate both detecting issues and achieving potential optimizations cores to the mainstream, increasingly encouraging developers to on a program, we propose coaching developers towards optimizing write parallel applications able to utilize the available computing re- fork/join applications by automatically diagnosing performance sources. Still, developing and tuning parallel applications to exploit issues on such applications and further suggest concrete code refac- the hardware resources remains challenging. toring to solve them. To this end, we devise a tool generating rec- We tackle this issue for fork/join applications, i.e., parallel ver- ommendations that should only require source-level knowledge sions of divide-and-conquer algorithms recursively splitting (fork) (i.e., knowledge of the high-level programming language in which a work into tasks that are executed in parallel, waiting for them to the target application is implemented) to be straightforwardly in-

∗PhD advisor terpreted and implemented by the developer. Although several works and tools address the assisted optimiza- EuroDW’18, April 2018, Porto, Portugal tion of sequential applications [15, 51, 52], parallelism discovery [14, . 17, 19, 25, 45, 49, 50], parallelism profiling1 [ , 2, 11, 23, 36, 54, 56, 59], EuroDW’18, April 2018, Porto, Portugal Eduardo Rosales and Walter Binder or have analyzed the use of concurrency on the JVM [29–31, 44], Hardware Performance Counters (HPC) [22], tgp allowed us finding with some even targeting fork/join parallelism [9, 43], we are not suboptimal task granularities causing performance issues in the aware of previous works on coaching a developer towards optimiz- DaCapo [6] benchmark suite. We plan to extend this work to study ing a fork/join application running in the JVM. suboptimal task granularity on fork/join applications on the JVM. The remainder of this proposal is organized as follows. In Sec- Suggesting optimizations. We plan to develop a methodology tion 2, we discuss our research directions. In Section 3, we analyze for automatically generating (and reporting to the developer) spe- the feasibility of our proposal by reviewing related work. cific recommendations aimed at fixing detected performance issues on fork/join applications. Inspired by the works of De Wael et al. [9] 2 OPTIMIZATION COACHING FOR and Pinto et al. [43], we plan to develop a tool able to recognize FORK/JOIN APPLICATIONS ON THE JVM fork/join anti-patterns, but it is also of paramount importance in our aim to automatically match such anti-patterns to concrete rec- We propose coaching developers towards the optimization of fork/- ommendations that a developer can use to avoid them. We plan join applications on the JVM. To achieve our goal it is imperative to use calling context profiling [3] to pinpoint the developer to ap- to address the following research directions. plication code where tasks classified as problematic were created, Diagnosing performance issues. We plan to build a method- submitted, or executed, guiding optimizations by means of sug- ology aimed at diagnosing issues and understanding its impact on gesting the use of fork/join design patterns [26] and concrete code the performance of fork/join applications on the JVM. To this end, modifications to enable potential missed parallelization opportuni- first we plan to define a new model to characterize fork/join tasks. ties. This is challenging, since automatically matching performance Second, we plan to determine the entities and metrics worth issues with concrete code suggestions to fix them will require the to consider to automatically detect performance drawbacks on use of advance data analysis techniques. In a previous study [47] fork/join applications. We plan to use static analysis to inspect we used calling contexts to locate tasks previously identified as of the source code in the search of fork/join anti patterns [9, 43]. suboptimal granularity in DaCapo, guiding optimizations through Complementary, we plan to use dynamic analysis to deal with code refactoring that led to noticeable speedups. Based on this work, polymorphism and reflection, along with collecting performance we plan to automate and specialize such techniques to specifically metrics enabling diagnosing performance issues noticeable at run- coach developers in optimizing fork/join applications on the JVM. time (e.g., excessive deque accesses and object creation/reclaiming, Validating optimizations. As previously pointed out by De load imbalance, inappropriate synchronization). To this end, we Wael et al. [9], there is a lack of standard benchmarks suitable to devise a vertical profiler [18] collecting information from the full perform empirical studies focused on fork/join parallelism on the system stack, including metrics at the application level (e.g., tasks JVM. Indeed, our early work mentioned above relied on DaCapo and threads), Java fork/join framework level (e.g., task submission, to study suboptimal task granularities mainly in Java thread pools, the number of already queued tasks), JVM level (e.g., garbage collec- however, this suite does not have any fork/join application. In this tions, allocations in Java Heap), level (e.g., CPU direction, we plan to discover workloads exhibiting high diversity usage, memory load) and hardware level (e.g., reference cycles, in their task-parallel behavior by plugging tgp to AutoBench [60], a machine instructions, context switches, cache misses). toolchain combining massive code repository crawling, pluggable Third, we plan to profile and characterize all tasks spawned by hybrid analyses, and workload characterization techniques to dis- a fork/join application to reveal performance drawbacks related cover candidate workloads satisfying the needs of domain-specific to suboptimal task granularity, a key factor when implementing benchmarking. By fully integrating tgp and AutoBench, we plan to fork/join parallelism on the JVM as pointed out by Lea [26]. This discover workloads exhibiting diverse fork/join parallelism, suitable is challenging since recursion, fine-grained parallelism, and sched- for validating both aforementioned methodologies. uling affecting the life-cycle of a fork/join task (e.g., the steal op- eration), complicate task granularity profiling and may lead to incorrect measurements when not handled properly. Our profiler 3 RELATED WORK will rely on DiSL [33] to ensure full coverage of the tasks spawned by a fork/join application, including those in the Java class library. This section first introduces related studies on assisted optimization Finally, the metrics collected can be susceptible to perturbations of applications. Next, we review previous work analyzing the use of caused by the inserted instrumentation code, which may bias the concurrency on the JVM. Then, we focus on parallelism discovers. analysis. Therefore, we plan to lower the perturbation of dynamic Finally, we review profilers for parallel applications. Assisted optimization of applications. metrics by using efficient profiling data structures avoiding heap Several studies have allocations in the target application [32]. focused on providing the developer with feedback to assist the In previous studies [47, 48] we analyzed the tradeoff between optimization of several aspects of an application. Here, we relate maximizing parallelism and minimizing overheads, mainly focusing three of them pioneering an approach that is very close to our goal on the size of the tasks spawned by task-parallel applications on in providing recommendations which require only source-level the JVM. We introduced an early model to define a task as every knowledge to be implemented by a developer. Optimization Coach- instance of the Java interfaces java.lang.Runnable and java.util.- St-Amour et al. [51, 52] first coined the term ing concurrent.Callable, focusing on the study of Java thread pools [41]. , introducing a technique modifying the compiler’s optimizer to Furthermore, we presented tgp [48], a task-granularity profiler for inform the developer which optimizations it performs, which opti- the JVM. Relying on bytecode instrumentation [33] and resorting to mizations it misses, and suggesting specific changes to the source Optimization Coaching for Fork/Join Applications on the Java Virtual Machine EuroDW’18, April 2018, Porto, Portugal code that could trigger additional optimizations. The goal is report- study performance drawbacks related to suboptimal task granular- ing to the developer precise recommendations which interpretation ity. Moreover, we specifically devise coaching developers towards do not require expertise about the compiler’s internals and which optimizing fork/join applications by providing them with code implementation enables a missed or failed optimization attempted refactoring suggestions which implementation may enable missed by the compiler. The technique was first prototyped in a work [52] parallelization opportunities. modifying Racket’s [12] simple ahead-of-time byte-compiler which Parallelism discovers. A number of tools has been developed performs basic optimizations on Racket applications. In a second to discover parallelization opportunities. Most notably, tools focus- work [51], the authors extended this approach to consider dynamic ing on parallelism discovery include Kremlin [14], which tracks object oriented just-in-time compiled programming languages, pro- loops and dependencies to pinpoint sequential parts of a program totyping assisted optimizations in the SpiderMonkey JavaScript [34] that can be parallelized. Kismet [25] builds on Kremlin and en- engine. hances the prediction of potential speedups after parallelization, Gong et al. [15] present JITProf, a prototype profiling framework given a target machine and runtime system. Similarly, Hammacher that detects code patterns prohibiting optimizations on JavaScript et al. [17], Rountev et al. [49], and Dig et al. [10] focus on refac- Just-In-Time (JIT) compilers and reports their occurrences to the toring sequential legacy Java programs identifying independent developer in the goal of achieving optimizations. JITProf profiles computation paths that could have been parallelized and recom- the target application at runtime implementing different strategies mend locations to the programmer with the highest potential for to detect seven JIT-unfriendly patterns. The goal is reporting to the parallelization. Although the above works aim at guiding the de- developer JIT-unfriendly locations ranked by their relevancy on veloper towards modifications enabling parallelization, they only preventing optimizations to facilitate the developer fixing them. target the optimization of sequential applications. The above works shred light in the goal of mentoring develop- In addition, several parallelism profilers based on the work-span ers in locating performance drawbacks and in providing concrete model [7, 24] target parallel applications. Cilkview [19] builds upon recommendations to optimize an application. Nonetheless, they this model to predict achievable speedup bounds for a Cilk [13] notably differ from our goal because the related tools were not application when the number of used cores increases. CilkProf [50] designed for diagnosing and further recommending modifications extends Cilkview by measuring work and span on each call site to specifically aimed at optimizing parallel applications. Indeed, the determine which of them constitutes a bottleneck towards paral- techniques proposed focus on feedback generated by specific com- lelization. TaskProf [58] computes work, span, and the asymptotic piler’s optimizers, falling short in considering other metrics from parallelism of C++ applications using the Intel Threading Building the full system stack which may enable revealing issues specific to Blocks (TBB) [45] task-parallel library, relying on causal profil- parallel applications, including fork/join applications on the JVM. ing [8] techniques to estimate parallelism improvements. Despite Analyses of the use of concurrency on the JVM. Numerous these tools allow detecting bottlenecks on parallel applications and studies have focused on the use of concurrency on the JVM. Here, can estimate potential speedups, they do not focus on coaching a we highlight two works closely related to our aim. developer towards optimizing a fork/join application. Moreover, De Wael et al. [9] analyzed 120 open-source Java projects to none of these tools supports the JVM. study the use of the Java fork/join framework, confirming manually Parallelism profilers. Researchers from industry and academia the frequent use of four best coding practices and three recurring have developed several tools to analyze various parallel character- anti-patterns that potentially limit parallel performance. Their aim istics of an application. HPCToolkit [1] is a suite of tools using is providing practical conclusions to guide language or framework statistical sampling of timers and HPC collection to analyze the designers to propose new or improved abstractions that steer pro- performance of an application, resource consumption, and ineffi- grammers toward using the right patterns and away from using ciency, attributing them to the full calling context in which they the anti-patterns in developing fork/join applications for the JVM. occur. THOR [54] is a tool for performance analysis of Java appli- Pinto et al. [43] study parallelism bottlenecks in fork/join appli- cations on multicore systems. THOR relies on vertical profiling cations on the JVM, with a unique focus on how they interact with to collect fine-grained events, such as OS context switches and underlying system-level features, such as work-stealing scheduling Java lock contention from multiple layers of the execution stack. and memory management. They identify six bottlenecks impacting Each event is subsequently associated with a thread, providing a performance and energy efficiency on the AKKA28 [ ] actor frame- detailed graphical report on the behavior of threads spawned in work and conduct an in-depth refactoring on it to improve its core the execution of a parallel Java application. messaging engine. They also present FJDETECTOR, a plugin for Finally, a number of tools exist to support the analysis of parallel the Eclipse IDE [55] prototyping the detection of some bottlenecks Java applications through the collection of metrics at the appli- identified by the authors by means of code inspection. cation layer, most notably including VisualVM [56], IBM Health- Overall, these works bring meaningful insight in methodolo- Center [21], Performance Analyzer [36], gies focused on detecting bad coding practices and bottlenecks in Java Mission Control [35], JProfiler11 [ ], and YourKit [59], or at the real fork/join applications on the JVM. Complementary to such hardware layer resorting to HPCs, including Intel VTune [23] and studies, we plan to focus on both performance metrics collected AMD CodeAnalyst [2], among others. from the full system stack via dynamic analysis to automatically Overall, despite the aforementioned tools provide insight in char- diagnose performance issues noticeable only at runtime and in acterizing processes or threads over time, none of them specializes characterizing all the tasks spawned by the fork/join application to on fork/join applications, and thus they fall short in diagnosing performance problems specific to this type of parallel applications. EuroDW’18, April 2018, Porto, Portugal Eduardo Rosales and Walter Binder

ACKNOWLEDGMENTS [32] Lukáš Marek, Stephen Kell, Yudi Zheng, Lubomír Bulej, Walter Binder, Petr Tůma, Danilo Ansaloni, Aibek Sarimbekov, and Andreas Sewe. 2013. ShadowVM: Robust The research presented in this paper was supported by Oracle (ERO and Comprehensive Dynamic Program Analysis for the Java Platform. In GPCE. project 1332) and by the Swiss National Science Foundation (project 105–114. 200021_153560). [33] Lukáš Marek, Alex Villazón, Yudi Zheng, Danilo Ansaloni, Walter Binder, and Zhengwei Qi. 2012. DiSL: A Domain-specific Language for Bytecode Instrumen- tation. In AOSD. 239–250. [34] Mozilla. 2018. SpiderMonkey. https://developer.mozilla.org/en-US/docs/Mozilla/ REFERENCES Projects/SpiderMonkey. (2018). [35] Oracle. [n. d.]. Java Mission Control. http://www.oracle.com/technetwork/java/ [1] L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and javaseproducts/mission-control/java-mission-control-1998576.html. ([n. d.]). N. R. Tallent. 2010. HPCTOOLKIT: Tools for Performance Analysis of Optimized [36] Oracle. [n. d.]. Solaris Studio Performance Analyzer. http: Parallel Programs. Concurr. Comput.: Pract. Exper. 22, 6 (2010), 685–701. //www.oracle.com/technetwork/server-storage/solarisstudio/features/ [2] AMD. 2017. CodeAnalyst for Linux. http://developer.amd.com/tools-and-sdks/ performance-analyzer-2292312.html. ([n. d.]). archive/amd-codeanalyst-performance-analyzer-for-linux/. (2017). [37] Oracle. 2017. Class Arrays. https://docs.oracle.com/javase/8/docs/api/java/util/ [3] Glenn Ammons, Thomas Ball, and James R. Larus. 1997. Exploiting Hardware Arrays.html. (2017). Performance Counters with Flow and Context Sensitive Profiling. In PLDI ’97. [38] Oracle. 2017. Class CompletableFuture. https://docs.oracle.com/javase/8/ 85–96. docs/api/java/util/concurrent/CompletableFuture.html. (2017). [4] Apache Groovy project. 2018. Groovy. http://www.groovy-lang.org/. (2018). [39] Oracle. 2017. Interface Future. https://docs.oracle.com/javase/7/docs/api/ [5] Colin Barrett, Christos Kotselidis, and Mikel Luján. 2016. Towards Co-designed java/util/concurrent/Future.html. (2017). Optimizations in Parallel Frameworks: A MapReduce Case Study. In CF. 172–179. [40] Oracle. 2017. Package java.util.stream. https://docs.oracle.com/javase/8/docs/ [6] Stephen M. Blackburn, Robin Garner, Chris Hoffmann, Asjad M. Khang, Kathryn S. api/java/util/stream/package-summary.html. (2017). McKinley, Rotem Bentzur, Amer Diwan, Daniel Feinberg, Daniel Frampton, [41] Oracle. 2017. Thread Pools. https://docs.oracle.com/javase/tutorial/essential/ Samuel Z. Guyer, Martin Hirzel, Antony Hosking, Maria Jump, Han Lee, J. Eliot B. concurrency/pools.html. (2017). Moss, Aashish Phansalkar, Darko Stefanović, Thomas VanDrunen, Daniel von [42] Patrick Peschlow. 2012. The Fork/join Framework in Java 7. http://www.h-online. Dincklage, and Ben Wiedermann. 2006. The DaCapo Benchmarks: Java Bench- com/developer/features/The-fork-join-framework-in-Java-7-1762357.html. marking Development and Analysis. In OOPSLA. 169–190. (2012). [7] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. [43] Gustavo Pinto, Anthony Canino, Fernando Castor, Guoqing Xu, and Yu David 2009. Introduction to Algorithms, Third Edition (3rd ed.). The MIT Press. Liu. 2017. Understanding and Overcoming Parallelism Bottlenecks in ForkJoin [8] Charlie Curtsinger and Emery D. Berger. 2015. Coz: Finding Code That Counts Applications. In ASE 2017. 765–775. with Causal Profiling. In SOSP. 184–197. [44] Gustavo Pinto, Weslley Torres, Benito Fernandes, Fernando Castor, and [9] Mattias De Wael, Stefan Marr, and Tom Van Cutsem. 2014. Fork/Join Parallelism Roberto S.M. Barros. 2015. A Large-scale Study on the Usage of Java’s Con- in the Wild: Documenting Patterns and Anti-patterns in Java Programs Using current Programming Constructs. Journal of Systems and Software 106 (2015), 59 the Fork/Join Framework. In PPPJ. 39–50. – 81. [10] Danny Dig, John Marrero, and Michael D. Ernst. 2009. Refactoring Sequential [45] James Reinders. 2007. Intel Threading Building Blocks (1st ed.). O’Reilly & Java Code for Concurrency via Concurrent Libraries. In ICSE (ICSE). 397–407. Associates, Inc. [11] ej-technologies. 2017. JProfiler. https://www.ej-technologies.com/products/ [46] Rich Hickey. 2018. The Clojure Programming Language. https://clojure.org//. jprofiler/overview.html. (2017). (2018). [12] Matthew Flatt and PLT. 2010. Reference: Racket. Technical Report PLT-TR-2010-1. [47] Andrea Rosà, Eduardo Rosales, and Walter Binder. 2018. Analyzing and Optimiz- PLT Design Inc. https://racket-lang.org/tr1/. ing Task Granularity on the JVM. In CGO. 1–11. [13] Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. 1998. The Implementa- [48] Eduardo Rosales, Andrea Rosà, and Walter Binder. 2017. tgp: a Task-Granularity tion of the Cilk-5 Multithreaded Language. In PLDI. 212–223. Profiler for the Java Virtual Machine. In APSEC. 1–6. [14] Saturnino Garcia, Donghwan Jeon, Christopher M. Louie, and Michael Bedford [49] A. Rountev, K. Van Valkenburgh, Dacong Yan, and P. Sadayappan. 2010. Under- Taylor. 2011. Kremlin: Rethinking and Rebooting Gprof for the Multicore Age. standing Parallelism-Inhibiting Dependences in Sequential Java Programs. In In PLDI. 458–469. ICSM. 1–9. [15] Liang Gong, Michael Pradel, and Koushik Sen. 2015. JITProf: Pinpointing JIT- [50] Tao B. Schardl, Bradley C. Kuszmaul, I-Ting Angelina Lee, William M. Leiserson, unfriendly JavaScript Code. In ESEC/FSE 2015. 357–368. and Charles E. Leiserson. 2015. The Cilkprof Scalability Profiler. In SPAA. 89–100. [16] Philipp Haller and Martin Odersky. 2009. Scala Actors: Unifying Thread-based [51] V. St-Amour and S. Guo. 2015. Optimization Coaching for Javascript. In ECOOP. and Event-based Programming. Theor. Comput. Sci. 410, 2-3 (2009), 202–220. [52] Vincent St-Amour, Sam Tobin-Hochstadt, and Matthias Felleisen. 2012. Opti- [17] C. Hammacher, K. Streit, S. Hack, and A. Zeller. 2009. Profiling Java Programs mization Coaching: Optimizers Learn to Communicate with Programmers. In for Parallelism. In ICSE’09. 49–55. OOPSLA. 163–178. [18] Matthias Hauswirth, Peter F. Sweeney, Amer Diwan, and Michael Hind. 2004. [53] Robert Stewart and Jeremy Singer. 2018. Comparing Fork/Join and MapReduce. Vertical Profiling: Understanding the Behavior of Object-priented Applications. (2018). In OOPSLA. 251–269. [54] Q. M. Teng, H. C. Wang, Z. Xiao, P. F. Sweeney, and E. Duesterwald. 2010. THOR: a [19] Yuxiong He, Charles E. Leiserson, and William M. Leiserson. 2010. The Cilkview Performance Analysis Tool for Java Applications Running on Multicore Systems. Scalability Analyzer. In SPAA. 145–156. IBM Journal of Research and Development 54, 5 (2010), 4:1–4:17. [20] Zhang Hong, Wang Xiao-ming, Cao Jie, Ma Yan-hong, Guo Yi-rong, and Wang [55] The Eclipse Foundation. 2016. Eclipse IDE. http://www.eclipse.org/ide. (2016). Min. 2016. An Optimized Model for MapReduce Based on Hadoop. TELKOMNIKA [56] VisualVM. [n. d.]. https://visualvm.java.net. ([n. d.]). 14 (12 2016), 1552. [57] F Warren Burton and Michael Sleep. 1981. Executing Functional Programs on a [21] IBM. [n. d.]. IBM HealthCenter. https://www.ibm.com/developerworks/java/jdk/ Virtual Tree of Processors. FPCA (1981), 187–194. tools/healthcenter/. ([n. d.]). [58] Adarsh Yoga and Santosh Nagarakatte. 2017. A Fast Causal Profiler for Task [22] ICL. 2017. PAPI. http://icl.utk.edu/papi/. (2017). Parallel Programs. In ESEC/FSE. 15–26. [23] Intel. 2017. Intel VTune Amplifier. https://software.intel.com/en-us/ [59] YourKit. 2017. YourKit. https://www.yourkit.com. (2017). intel-vtune-amplifier-xe. (2017). [60] Yudi Zheng, Andrea Rosà, Luca Salucci, Yao Li, Haiyang Sun, Omar Javed, [24] Joseph JaJa. 1992. Introduction to Parallel Algorithms. Addison-Wesley Profes- Lubomir Bulej, Lydia Y. Chen, Zhengwei Qi, and Walter Binder. 2016. Auto- sional. Bench: Finding Workloads That You Need Using Pluggable Hybrid Analyses. In [25] Donghwan Jeon, Saturnino Garcia, Chris Louie, and Michael Bedford Taylor. 2011. SANER. 639–643. Kismet: Parallel Speedup Estimates for Serial Programs. In OOPSLA. 519–536. [26] Doug Lea. 1999. Concurrent Programming in Java. Second Edition: Design Principles and Patterns (2nd ed.). Addison-Wesley Professional. [27] Doug Lea. 2000. A Java Fork/Join Framework. In JAVA. 36–43. [28] Lightbend Inc. [n. d.]. Akka. http://akka.io. ([n. d.]). [29] Y. Lin and D. Dig. 2013. CHECK-THEN-ACT Misuse of Java Concurrent Collec- tions. In ICST’13. 164–173. [30] Y. Lin, S. Okur, and D. Dig. 2015. Study and Refactoring of Android Asynchronous Programming (T). In ASE. 224–235. [31] Yu Lin, Cosmin Radoi, and Danny Dig. 2014. Retrofitting Concurrency for Android Applications Through Refactoring. In FSE. 341–352.