Modular Class Data Sharing in Openjdk
Total Page:16
File Type:pdf, Size:1020Kb
Modular Class Data Sharing in OpenJDK Apourva Parthasarathy Evaluation of CDS for Cluster Computing Workloads and Improvement in Modularity of CDS Modular Class Data Sharing in OpenJDK by Apourva Parthasarathy to obtain the degree of Master of Science at Distributed Systems Group Faculty of Electrical Engineering, Mathematics, and Computer Science Delft University of Technology, Student number: 4597583 Thesis committee: Dr. J. S. Rellermeyer, TU Delft, supervisor Prof. dr. ir. D. J. H. Epema, TU Delft Dr. A. Katsifodimos, TU Delft MSc Presentation: 13th May, 2019 An electronic version of this thesis is available at http://repository.tudelft.nl/. Abstract Today, a vast majority of big data processing platforms are imple- mented in JVM-based languages such as Java and Scala. The main reasons for the popularity of the JVM are: the write once, run ev- erywhere paradigm through the use of platform independent byte- code, automatic garbage collection, bytecode interpretation for fast start-up, JIT-compilation of hot code paths for fast JVM warm-up and runtime code optimization. While these features are well suited for traditional Java applications in embedded systems, application servers and enterprise middleware, the characteristics of big data processing applications force us to examine the efficiency of existing JVM features. Modern big data platforms dwarf the size of the user programs that implement the actual application requiring the loading of 1000s of framework classes per JVM to run a task. Furthermore, these applications instantiate mostly the same identical classpath over and over thus multiplying the time required to load the classes and the memory required to store the class metadata in the JVM. The overhead due to classloading is particularly large when the task length is small, which is common in big data processing pipelines. The issue of classloading overhead is partly addressed by the in- troduction of OpenJDK Class Data Sharing (CDS), a technique to persist the metadata of classes loaded during the execution of an application in a memory mapped file, so that any subsequent run of the application can simply map this file into memory and reuse the loaded classes. We propose two approaches for utilizing CDS in Hadoop and Spark: BestFit CDS and Classpath CDS representing the two extreme possibilities for sharing. BestFit CDS provides min- imum classloading time and potential duplication while Classpath CDS provides zero duplication at the cost of higher classloading time. We show that per-JVM classloading overhead in Hadoop can be as high as 49% for small tasks and further show that the use of BestFit CDS can provide a 30% to 50% improvement in JVM execution time. Further, we identify the key limitations in the current implemen- tation of CDS and propose Modular CDS to improve the scalabil- ity and modularity of sharing. With the use of synthetic workloads, we show that the speedup due to Modular CDS lies between that of Classpath and BestFit CDS while at the same time achieving mini- mum archive size and no duplication of across archives. iii Preface I would like to thank my supervisor Dr. Jan. S. Rellermeyer for his guidance and encouragement throughout this project. It has been a steep learning curve, and the weekly meetings and continuous feed- back has been invaluable in making constant progress. I would also like to thank Sobhan Omranian for his helpful suggestions and en- couraging chats, especially when progress was slow. I would like to extend my thanks to the fellow students in Distributed Systems Group for the interesting discussions and feedback on my work. Although exciting and enjoyable, living in a new country comes with its challenges. I would like to express my gratitude to all my friends at Delft for their constant help and encouragement. Most importantly, I would like to thank my parents and family for their unconditional love and support. Apourva Parthasarathy Delft, May 10th 2019 v Contents 1 Introduction 1 1.1 JVM overheads in Cluster computing workloads . 1 1.2 OpenJDK HotSpot Class Data Sharing . 3 1.3 Modular Class Data Sharing. 5 1.4 Thesis outline and Contributions . 6 2 Background and Related Work 9 2.1 Cluster Computing Platforms . 9 2.1.1 Big Data Processing Platforms . 10 2.1.2 Serverless FaaS Platforms . 13 2.2 Java Virtual Machine . 14 2.2.1 Java Classloading . 14 2.2.2 Class Data Sharing. 17 2.2.3 Related Work on Java Virtual Machines . 20 3 Performance of Class Data Sharing for Low latency Workloads 23 3.1 Experimental Setup . 23 3.2 Analysis of Classloading Overhead . 24 3.3 Analysis of Class Data Sharing (CDS) Approaches . 27 3.3.1 Analysis of execution time . 29 3.3.2 Analysis of Startup Overhead . 32 3.3.3 Analysis of Memory Usage . 33 3.4 Modular Class Data Sharing Approach . 33 3.5 Summary . 35 4 Implementation and Analysis of Modular Class Data Sharing 37 4.1 Implementation . 37 4.2 Limitations of Modular Class Data Sharing Implementation . 46 4.3 Performance analysis of Modular Class Data Sharing . 47 4.4 Summary . 50 5 Conclusion and Future Work 53 5.1 Conclusion . 53 5.2 Future Work . 55 Bibliography 57 vii 1 Introduction 1.1. JVM overheads in Cluster computing work- loads Since the introduction of the MapReduce programming model [30] in 2004 and its reference implementation in Hadoop, cluster comput- ing frameworks based on this model have been growing to support a wide range of data processing and analytics operations in enter- prises. Today, several big data processing platforms are in use, each specializing in a different area of processing. These platforms spe- cialize in handling workloads ranging from high throughput batch processing [9, 21, 42], interactive SQL-style queries [24, 33, 38, 47], stream processing [27, 43, 48, 51], machine learning [23, 41], and efficient processing of graph based data structures [31]. Several popular big data processing platforms such as Hadoop MapReduce [9], Spark [21], Hive [47] , Flink [27], Storm [48] and Kafka [39] are implemented to run on the Java Virtual Machine (JVM) [16]. In general, big data platforms run a job on the cluster by split- ting the computation into multiple tasks and executing each task in- dependently on its own slice of the input data. The output from these tasks may then be aggregated or used as input by the next stage of tasks in the job. Some JVM based platforms launch a new JVM for each task while others may employ techniques to reuse JVMs. For example, Hadoop MapReduce runs every task in a separate JVM. Spark runs multiple tasks of a give application in a single JVM (per node) and launches a separate JVM for each new application. Hive, a data warehousing platform built on Hadoop translates SQL-style queries into a pipeline of Hadoop MapReduce jobs. A job is decom- posed into a series of Map and Reduce tasks each running in a sep- arate JVM. 1 2 1. Introduction Based on the above examples, we see that big data processing platforms continuously create and terminate a large number of JVMs over the lifetime of a job, thus potentially incurring significant JVM startup and runtime overhead. The overhead due to the JVM can broadly be viewed in three categories: startup overhead, warm-up overhead and Garbage Collection (GC) overhead. As the name sug- gests, startup overhead refers to the time spent in bootstrapping the JVM by initializing crucial JDK classes and other JVM internal data structures. Warm-up overhead refers to the time spent in executing Java in interpreted mode while the JIT-compiler compiles frequently used code paths and generates optimized code based on runtime pro- filing of the application. GC overhead refers to the overhead due to pauses caused by the Garbage Collector. For long-running JVMs, these overheads form a very small proportion of the total execution time of the JVM and hence do not affect the overall job execution time. However, we find that the JVM overhead dominates the total execution time for short-lived, latency sensitive applications. Fur- thermore, we see that big data processing platforms load thousands of classes per JVM and there exists a large overlap in the classes loaded by each JVM. This includes commonly used JDK classes, Log- ging and Networking libraries, common Hadoop framework classes such as Hadoop Distributed File System (HDFS) and Yet Another Re- source Negotiator (YARN). Hence, thousands of classes are repeatedly loaded by several JVMs leading to wasted time and redundant stor- age of class metadata across several JVMs. One solution to overcome this duplication to have a single long- lived process in each node that runs all jobs. However, there are several advantages to running each task in its own JVM: a) greater fault tolerance so that a failing task does not impact other tasks of the job. b) easier to scale up the job by simply starting new processes on other cluster nodes. c) performance isolation between tasks of different jobs. In addition to big data processing platforms, there exist other cluster computing platforms that rely on on-demand provisioning of JVMs. For example, Function-as-a-Service (FaaS) platforms [1, 7, 20] operate by running each new instance of the function in a JVM. In general, these platforms operate by launching a new container when the function is invoked. The container initializes the environment re- quired for the execution of the function including the language run- time. For JVM based functions, either a new container is launched or an existing container is reused from a pool of warmed up con- tainers if available. Container reuse is beneficial since the JVM is already in an initialized state and ready to run the function.