Modular Class Data Sharing in OpenJDK

Apourva Parthasarathy

Evaluation of CDS for Cluster Computing Workloads and Improvement in Modularity of CDS

Modular Class Data Sharing in OpenJDK

by Apourva Parthasarathy

to obtain the degree of Master of Science at Distributed Systems Group Faculty of Electrical Engineering, Mathematics, and Computer Science Delft University of Technology,

Student number: 4597583 Thesis committee: Dr. J. S. Rellermeyer, TU Delft, supervisor Prof. dr. ir. D. J. H. Epema, TU Delft Dr. A. Katsifodimos, TU Delft MSc Presentation: 13th May, 2019

An electronic version of this thesis is available at http://repository.tudelft.nl/.

Abstract

Today, a vast majority of big data processing platforms are imple- mented in JVM-based languages such as and Scala. The main reasons for the popularity of the JVM are: the write once, run ev- erywhere paradigm through the use of platform independent byte- code, automatic garbage collection, interpretation for fast start-up, JIT-compilation of hot code paths for fast JVM warm-up and runtime code optimization. While these features are well suited for traditional Java applications in embedded systems, application servers and enterprise middleware, the characteristics of big data processing applications force us to examine the efficiency of existing JVM features. Modern big data platforms dwarf the size of the user programs that implement the actual application requiring the loading of 1000s of framework classes per JVM to run a task. Furthermore, these applications instantiate mostly the same identical over and over thus multiplying the time required to load the classes and the memory required to store the class metadata in the JVM. The overhead due to classloading is particularly large when the task length is small, which is common in big data processing pipelines. The issue of classloading overhead is partly addressed by the in- troduction of OpenJDK Class Data Sharing (CDS), a technique to persist the metadata of classes loaded during the execution of an application in a memory mapped file, so that any subsequent run of the application can simply map this file into memory and reuse the loaded classes. We propose two approaches for utilizing CDS in Hadoop and Spark: BestFit CDS and Classpath CDS representing the two extreme possibilities for sharing. BestFit CDS provides min- imum classloading time and potential duplication while Classpath CDS provides zero duplication at the cost of higher classloading time. We show that per-JVM classloading overhead in Hadoop can be as high as 49% for small tasks and further show that the use of BestFit CDS can provide a 30% to 50% improvement in JVM execution time. Further, we identify the key limitations in the current implemen- tation of CDS and propose Modular CDS to improve the scalabil- ity and modularity of sharing. With the use of synthetic workloads, we show that the speedup due to Modular CDS lies between that of Classpath and BestFit CDS while at the same time achieving mini- mum archive size and no duplication of across archives.

iii

Preface

I would like to thank my supervisor Dr. Jan. S. Rellermeyer for his guidance and encouragement throughout this project. It has been a steep learning curve, and the weekly meetings and continuous feed- back has been invaluable in making constant progress. I would also like to thank Sobhan Omranian for his helpful suggestions and en- couraging chats, especially when progress was slow. I would like to extend my thanks to the fellow students in Distributed Systems Group for the interesting discussions and feedback on my work. Although exciting and enjoyable, living in a new country comes with its challenges. I would like to express my gratitude to all my friends at Delft for their constant help and encouragement. Most importantly, I would like to thank my parents and family for their unconditional love and support.

Apourva Parthasarathy Delft, May 10th 2019

v

Contents

1 Introduction 1 1.1 JVM overheads in Cluster computing workloads ...... 1 1.2 OpenJDK HotSpot Class Data Sharing ...... 3 1.3 Modular Class Data Sharing...... 5 1.4 Thesis outline and Contributions ...... 6 2 Background and Related Work 9 2.1 Cluster Computing Platforms ...... 9 2.1.1 Big Data Processing Platforms ...... 10 2.1.2 Serverless FaaS Platforms ...... 13 2.2 ...... 14 2.2.1 Java Classloading ...... 14 2.2.2 Class Data Sharing...... 17 2.2.3 Related Work on Java Virtual Machines ...... 20 3 Performance of Class Data Sharing for Low latency Workloads 23 3.1 Experimental Setup ...... 23 3.2 Analysis of Classloading Overhead ...... 24 3.3 Analysis of Class Data Sharing (CDS) Approaches ...... 27 3.3.1 Analysis of execution time ...... 29 3.3.2 Analysis of Startup Overhead ...... 32 3.3.3 Analysis of Memory Usage ...... 33 3.4 Modular Class Data Sharing Approach ...... 33 3.5 Summary ...... 35 4 Implementation and Analysis of Modular Class Data Sharing 37 4.1 Implementation ...... 37 4.2 Limitations of Modular Class Data Sharing Implementation . . . 46 4.3 Performance analysis of Modular Class Data Sharing ...... 47 4.4 Summary ...... 50 5 Conclusion and Future Work 53 5.1 Conclusion ...... 53 5.2 Future Work ...... 55 Bibliography 57

vii

1 Introduction

1.1. JVM overheads in Cluster computing work- loads

Since the introduction of the MapReduce programming model [30] in 2004 and its reference implementation in Hadoop, cluster comput- ing frameworks based on this model have been growing to support a wide range of data processing and analytics operations in enter- prises. Today, several big data processing platforms are in use, each specializing in a different area of processing. These platforms spe- cialize in handling workloads ranging from high throughput batch processing [9, 21, 42], interactive SQL-style queries [24, 33, 38, 47], stream processing [27, 43, 48, 51], machine learning [23, 41], and efficient processing of graph based data structures [31].

Several popular big data processing platforms such as Hadoop MapReduce [9], Spark [21], Hive [47] , Flink [27], Storm [48] and Kafka [39] are implemented to run on the Java Virtual Machine (JVM) [16]. In general, big data platforms run a job on the cluster by split- ting the computation into multiple tasks and executing each task in- dependently on its own slice of the input data. The output from these tasks may then be aggregated or used as input by the next stage of tasks in the job. Some JVM based platforms launch a new JVM for each task while others may employ techniques to reuse JVMs. For example, Hadoop MapReduce runs every task in a separate JVM. Spark runs multiple tasks of a give application in a single JVM (per node) and launches a separate JVM for each new application. Hive, a data warehousing platform built on Hadoop translates SQL-style queries into a pipeline of Hadoop MapReduce jobs. A job is decom- posed into a series of Map and Reduce tasks each running in a sep- arate JVM.

1 2 1. Introduction

Based on the above examples, we see that big data processing platforms continuously create and terminate a large number of JVMs over the lifetime of a job, thus potentially incurring significant JVM startup and runtime overhead. The overhead due to the JVM can broadly be viewed in three categories: startup overhead, warm-up overhead and Garbage Collection (GC) overhead. As the name sug- gests, startup overhead refers to the time spent in bootstrapping the JVM by initializing crucial JDK classes and other JVM internal data structures. Warm-up overhead refers to the time spent in executing Java in interpreted mode while the JIT-compiler compiles frequently used code paths and generates optimized code based on runtime pro- filing of the application. GC overhead refers to the overhead due to pauses caused by the Garbage Collector. For long-running JVMs, these overheads form a very small proportion of the total execution time of the JVM and hence do not affect the overall job execution time. However, we find that the JVM overhead dominates the total execution time for short-lived, latency sensitive applications. Fur- thermore, we see that big data processing platforms load thousands of classes per JVM and there exists a large overlap in the classes loaded by each JVM. This includes commonly used JDK classes, Log- ging and Networking libraries, common Hadoop framework classes such as Hadoop Distributed File System (HDFS) and Yet Another Re- source Negotiator (YARN). Hence, thousands of classes are repeatedly loaded by several JVMs leading to wasted time and redundant stor- age of class metadata across several JVMs.

One solution to overcome this duplication to have a single long- lived process in each node that runs all jobs. However, there are several advantages to running each task in its own JVM: a) greater fault tolerance so that a failing task does not impact other tasks of the job. b) easier to scale up the job by simply starting new processes on other cluster nodes. ) performance isolation between tasks of different jobs.

In addition to big data processing platforms, there exist other cluster computing platforms that rely on on-demand provisioning of JVMs. For example, Function-as-a-Service (FaaS) platforms [1, 7, 20] operate by running each new instance of the function in a JVM. In general, these platforms operate by launching a new container when the function is invoked. The container initializes the environment re- quired for the execution of the function including the language run- time. For JVM based functions, either a new container is launched or an existing container is reused from a pool of warmed up con- tainers if available. Container reuse is beneficial since the JVM is already in an initialized state and ready to run the function. While very effective, the JVM pool still does not eliminate JVM startup and 1.2. OpenJDK HotSpot Class Data Sharing 3 warm-up overhead completely. The warm JVMs are killed after a specified time-out period and need to be launched from scratch for the subsequent invocation. Furthermore, depending upon the load on the system, several new JVMs may have to be created instantly to handle sudden spikes in the number of incoming invocations. Fur- thermore, this approach does not address the duplication of class metadata across instances of the JVM.

In summary, we see that several JVM based cluster computing platforms operate by constantly provisioning short-lived JVMs to ex- ecute tasks. Although MapReduce based big data processing plat- forms like Hadoop and Spark were originally designed to provide high-throughput, fault-tolerant execution of batch processing jobs, we see that these platforms are increasingly being used to support small low latency jobs. Studies on characteristics of MapReduce based cluster workloads show that majority of jobs are short-lived and follow a heavy tailed distribution [36, 44] of job sizes. Analysis of production traces from Hadoop cluster has shown that about 80% of jobs in the cluster are small, running for less than 4 minutes and 90% of jobs read less that 64MB of data from HDFS [44]. While it is common knowledge that Java suffers from slow startup and warm- up overhead for short-lived JVMs, there have been few studies that analyze the impact of this overhead on the execution speed of small latency sensitive cluster computing workloads [40].

In this study, we identify that big data processing tasks load a large number of classes ( 5000 to 10000 classes per JVM) and often load a common subset of classes in every JVM. Based on this obser- vation, we limit our focus to analyzing and improving class loading overhead in JVM.

RQ1: What is the impact of Classloading overhead on small, low latency big data processing workloads?

1.2. OpenJDK HotSpot Class Data Sharing

Section 3.2 provides an overview of classloading overhead in the JVM and existing approaches to minimize overheads. Based on the out- come of RQ1, this study focuses on analyzing an approach that si- multaneously minimizes classloading overhead and lowers JVM mem- ory footprint: Application Class Data Sharing (CDS). Application CDS, introduced in 2017 as part of OpenJDK 9 [19], enables JVMs run- ning on a node to share loaded classes, thus leading to lower class- loading time and memory footprint per JVM. CDS enables metadata of loaded classes to be persisted in a memory mapped file that can 4 1. Introduction subsequently be used to directly map the loaded classes to the ad- dress space of a newly launched JVM. In this report, we refer to the memory mapped file as the CDS archive. While preliminary testing of CDS provides promising results [12], there are currently no studies on the benefits of OpenJDK Class Data Sharing for big data pro- cessing workloads. In our study, we aim to analyze the performance benefits of CDS for big data processing workloads and discuss the pros and cons of this approach.

The CDS archive is created statically at dump-time and does not allow class metadata to be added to the archive dynamically during the execution of the application. Furthermore, OpenJDK only allows loading of classes from a single CDS archive. These restrictions have important implications on how CDS archives are created and utilized.

• Full Sharing with duplication: If all instances of the appli- cation load the same set of classes, then a single CDS archive can be created containing these classes. At run-time, all in- stances on a given node share this CDS archive. However, for platforms like Hadoop and Spark which run a diverse range of workloads such as SQL queries and Machine Learning, each application loads a unique set of application classes along with common JDK had framework classes. As OpenJDK only allows the use of a single archive per JVM, it is not possible to separate the common classes into a shared archive while loading applica- tion specific classes in another archive. Hence, each application creates its own BestFit CDS archive containing the application classes and the common classes. While the BestFit approach provides full sharing, it is evident that each archive contains a significant amount of duplicate class metadata.

• Full Sharing with no duplication: While the BestFit approach successfully addresses classloading overhead, it still retains du- plication of class data among JVMs. An alternative is to create an exhaustive CDS archive with all the classes present in the application classpath. Hence, all Hadoop or Spark applications can make use of this platform specific archive. We call this the Classpath CDS approach. While this approach provides full sharing without any duplication, it greatly increases the size of the CDS archive while at the same time limiting the scalability and flexibility of archive creation and deployment.

• Partial Sharing : In order to avoid duplication of class meta- data, it is possible to create a CDS archive containing only the common classes used by all applications. However, this leads to a significant drop in performance as a) application classes not 1.3. Modular Class Data Sharing 5

present in the CDS archive have to be loaded from the class- file b) classes loaded from the classfile cannot be shared thus increasing duplication.

In this study, we investigate the pros and cons of these approaches and analyze their impact on job latency and memory utilization of the workloads.

RQ2: What is the performance impact of utilizing OpenJDK Ap- plication Class Data Sharing (CDS) on low latency big data pro- cessing workloads?

1.3. Modular Class Data Sharing

The limitations of the approaches discussed above can be addressed by providing a modular approach to Class Data Sharing. In this study, we aim to enhance CDS to support loading classes from mul- tiple CDS archives so as to minimize both the classloading time and memory footprint of the JVM. There are several benefits to this mod- ular approach a) platforms can be decomposed into logical layers as shown in Figure 1.1 and each layer can be dumped to a separate archive. Hence, all jobs can share framework classes while each ap- plication has its own application specific archive. b) this approach can potentially be extended to support a CDS archive per JAR so that every JVM that uses this JAR will simply load it from a single shared archive.

RQ3: How can Class Data sharing be enhanced to provide greater scalability and modularity in sharing? What are the perfor- mance benefits of modular Class Data Sharing in comparison with out-of-the-box Class Data Sharing approaches?

It must be noted that the current implementation of Modular CDS suffers from the serious limitation that it does not support the loading of archived array classes. Hence, we were unable to test Modular CDS with Hadoop and Spark. Instead, we test the performance of CDS with synthetic workloads and discuss how the implementation of Modular CDS can be further improved. 6 1. Introduction

Hadoop Applications Spark SQL Spark ML MapReduce

Spark Hive Hadoop Frameworks Common Common Common Archive Archive Archive

YARN Platforms HDFS Archive JDK Archive Archive

Figure 1.1: Vision for Modular Class Data Sharing in Big Data Processing Platforms

1.4. Thesis outline and Contributions

In Chapter 2, we provide a short background on big data processing platforms and their JVM overhead. Further, we discuss classloading in OpenJDK HotSpot JVM, the implementation of Class Data Sharing and the related literature on JVM speedup. In Chapter 3, we inves- tigate classloading overheads in Hadoop and Spark and analyze the benefits of CDS for low latency workloads. In Chapter 4, we discuss the implementation of Modular Class Data Sharing and compare it with traditional CDS approaches. Finally, we summarize the results of our study and discuss future work in Chapter 5.

Listed below are some of the main contributions of our work

1. Since OpenJDK Class Data Sharing was introduced as recently as 2017, there exist few studies on the performance benefits of this approach. To the best of our knowledge, this is the first study to investigate the application of CDS for cluster comput- ing workloads and analyze the impact of CDS on the execution time of small jobs. 2. We show that CDS improves per JVM execution time by 20% to 49% for tiny and small jobs in Hadoop and by about 5% to 20% for large jobs. In Spark, we see a 14% to 19% improvement in JVM execution times for tiny and small jobs and 5% to 8% for large jobs. 3. We propose Modular CDS to overcome the limitations of Best- 1.4. Thesis outline and Contributions 7

Fit and Classpath CDS. Modular CDS enables the separation of class metadata into several logical archives so as to achieve full sharing with no duplication of class metadata.

4. With the help of synthetic workloads, we show that Modular CDS provides 70% better job execution time compared to Class- path CDS for tiny and small jobs. However, Modular CDS per- forms about 40% worse than BestFit CDS due to the overhead of mapping multiple archives. Further, Modular CDS lowers memory usage compared to BestFit and Classpath CDS as it only maps classes required by the application and does not in- volve any duplication of class metadata.

5. The work in Chapter 3 had been included as part of the the article submitted to ISPDC 2019:

[45] Jan S Rellermeyer, Sobhan Omranian Khorasani, Dan Graur, and Apourva Parthasarathy. The coming age of per- vasive data processing. In 18th International Symposium on Parallel and Distributed Computing, ISPDC, 2019.

2

Background and Related Work

This chapter provides the necessary background on cluster comput- ing and Java Virtual Machine features relevant to our work. In sec- tion 1.1, we introduce some of the widely used cluster computing platforms and discuss how the JVM is utilized by these platforms. We provide a brief overview of performance improvements in clus- ter computing in Section 1.2. This is followed by an overview of the working of the Java Virtual Machine in section 1.3 with specific focus on the OpenJDK HotSpot JVM used this our work. In section 1.4, we discuss existing approaches for improving JVM reusability and minimizing JVM overheads.

2.1. Cluster Computing Platforms

In this study, we use the term cluster to refer to platforms that execute applications by distributing data and com- putation on a cluster of commodity servers. We limit our focus to JVM based cluster computing platforms that utilize short-lived JVMs for low latency operations. This is due to the fact that JVM startup and warmup overhead is not of great consequence for long running JVMs as the overhead cost is amortized over the lifetime of the JVM. Furthermore, the performance gains due to profile guided code op- timization in long running JVMs outweighs the initial overheads. In sections 2.1.1 and 2.1.2, we discuss two types of platforms which op- erate by on-demand provisioning of JVMs on the cluster: low latency big data processing platforms and Serverless FaaS based platforms.

9 10 2. Background and Related Work

2.1.1. Big Data Processing Platforms

Since the introduction of the MapReduce [30] programming model, a number of frameworks have been developed for large scale data processing on commodity clusters. Frameworks like HDFS manage distributed data storage on the cluster. Frameworks such as Apache Hadoop MapReduce [9] and Apache Spark [21] provide implementa- tions of the MapReduce programming model primarily built to pro- vide large scale fault tolerant execution of batch processing jobs. The ability of the MapReduce programming model to express a wide range of data processing operations has allowed such frameworks to be fur- ther extended to support complex SQL style queries, machine learn- ing algorithms, graph and real-time stream processing algorithms.

As the name suggests, the MapReduce programming model is based on two operations: Map and Reduce. The input data consist- ing of Key-Value pairs is split into several blocks and stored on the cluster nodes. Each block of data is transformed in parallel by the user-defined Map operation to produce intermediate output. Once all the Map operations complete, the intermediate output is hash partitioned and provided to the Reducer nodes. The user-defined Reduce function operates on each distinct key in the intermediate output to produce the desired output. There exist several implemen- tations of MapReduce that launch tasks as Java processes. Hadoop and Hive launch a new JVM to execute each task of the job while Spark launches each new application in a JVM. Depending on the size of the job, each job could launch 10-100s of JVMs on the clus- ter. The following sections provide an overview of JVM based big data processing platforms.

The Hadoop Distributed File System (HDFS) [10] is implemented in Java and provides access to data stored on the cluster nodes. HDFS runs two long running daemons: the cluster-wide NameNode which stores metadata about the files and a per-node DataNode which pro- vides access to data stored on that node. Data stored in HDFS can be accessed by the HDFS client which launches a JVM to execute the request. The HDFS client connects to the NameNode which then se- lects the appropriate DataNode to service the request. The DataNode services the request and returns to the client.

Hadoop MapReduce uses Yet Another Resource Negotiator (YARN) [22] for cluster-wide resource management. When the user runs a MapReduce job, the YARN client JVM is launched to submit the job to YARN. When YARN is ready to run the job, it launches MRApp- Master JVM which is responsible for scheduling and coordinating the individual tasks of the job. The Application Master determines 2.1. Cluster Computing Platforms 11

Figure 2.1: JVM usage in Hadoop MapReduce the number of Map and Reduce tasks based on the size of the input and resource limits. It then determines the locations of the tasks so as to collocate the computation with the data. Each task is launched as a YARN container JVM on the DataNode and is killed once the operation is complete. Depending on the level of parallelism desired, several tasks of the job may be running on the same node. Further- more, several independent jobs may run tasks on the same node if they operate on the same chunk of data. Hence, we observe that JVMs are constantly being created and terminated on each node in the cluster.

Hive [47] is a Distributed SQL query engine based on Hadoop MapReduce. It provides an alternative to traditional relational database management systems by providing a SQL-like interface to run data processing jobs on Hadoop. A SQL-like query is transformed to a Directed Acyclic Graph (DAG) of Map and Reduce operations which are executed in a pipelined fashion. Similar to Hive, other frame- works like Pig [32] and Jaql[26] also execute queries as a series of MapReduce jobs.

Hive, when run with the Hadoop MapReduce scheduler on a YARN cluster launches a JVM for each task of the job. Alternately, Apache Tez [46] provides a scheduler that allows reuse of YARN containers for tasks of a job having the same resource and data locality require- ments.

Like Tez, the Spark [50] scheduler is designed to reuse the Execu- torBackend JVM by running several tasks as threads in the JVM. The Spark client JVM submits the job to the cluster resource man- ager such as YARN. YARN creates the ExecutorLauncher JVM that 12 2. Background and Related Work

Figure 2.2: JVM usage in Spark

Figure 2.3: JVM usage in OpenWhisk

coordinates the tasks of the application on the worker nodes. This is analogous to the MRAppMaster in Hadoop MapReduce. Spark splits the job into tasks which are then executed by the ExecutorBack- end JVMs on the worker nodes. Unlike Hadoop MapReduce which launches a new JVM for each task, the executor JVM is capable of running all tasks of the given job within the same JVM. However, tasks of different applications are run in different executors so as to provide greater flexibility in resource allocation, performance iso- lation and fault tolerance. While this architecture minimizes JVM overhead within an application, we see that there exists further po- tential for reusing loaded classes among the ExecutorBackend JVMs on each node. 2.1. Cluster Computing Platforms 13

Tenzing [28] maintains a pool of processes which can be reused when a new job arrives. The Master watcher receives jobs and as- signs them to one of Master pool processes which then coordinates parallel execution of the tasks on Worker processes.

Impala [38] is a C++ based framework designed for low latency data processing and querying. In order to address the overhead of launching a new process for each individual task, Impala imple- ments a long-running daemon to execute tasks. The impalad daemon present on each DataNode handles concurrent client requests. The node responsible for the job coordinates with other Impala daemons to perform computations and aggregate the results.

2.1.2. Serverless FaaS Platforms

Similar to Big data processing platforms, Serverless FaaS platforms are characterized by constant creation and termination of JVMs on the cluster. FaaS platforms enable users to invoke operations as functions without having to manage the operational details of exe- cuting the function on the cluster. The FaaS backend is responsible for transparently acquiring cluster resources, initializing the runtime environment for the execution of the function and auto-scaling of re- sources based on the demand. There exist several proprietary FaaS platforms such as Amazon AWS Lambda [1], Google Cloud Functions [8], Microsoft Azure Functions [17] and open source platforms such as IBM OpenWhisk [20]. Figure 2.3 provides an overview of JVM usage in OpenWhisk.

Studies have identified that slow startup and warmup of contain- ers on FaaS platforms has a significant impact on the latency of func- tion execution [34, 49]. The primary approach to overcome cold-start is to maintain a pool of warmed-up containers so that the function can be invoked instantly. However, there are still scenarios where containers have to be started from scratch: a) containers time-out in the absence of high load thus causing a new container to be started up when a function is invoked. b) sudden peaks in demand requires that new containers be launched quickly. Furthermore, as appli- cations pay only for the resources used and the duration of use of the container, it is beneficial to make containers as light and fast as possible. Preliminary experiences of users [4–6] show that a Java container running HelloWorld takes a few 100ms to start while a Node.js container starts up in less than 5ms due to faster startup. These experiments however measure only startup time and not the total function execution time thus ignoring the potential benefits of Java’s JIT-compilation and runtime code optimization in minimizing 14 2. Background and Related Work function execution time.

2.2. Java Virtual Machine

In this section, we provide relevant background on classloading in Java followed by a survey of existing techniques to address startup overhead and memory footprint of the JVM.

2.2.1. Java Classloading

In this section, we briefly discuss the class loading mechanism in OpenJDK. The Java Virtual Machine loads classes dynamically on demand. Classloading in OpenJDK can be viewed as a series of three steps: loading, linking and initialization.

Classloading is triggered during Symbol Resolution when HotSpot finds that the class being referenced has yet to be loaded. Symbol Resolution is the process by which symbolic names in the Java pro- gram are replaced by the actual runtime address of the symbol (class, field or method). The JVM specification provides a list of 16 bytecode instructions that trigger the resolution of a symbol reference thus triggering classloading. This includes bytecode instructions such as creating a new instance of a class and accessing a field or invoking a method in the class.

During loading, the classloader searches the application class- path for the .class file matching the name of the class to be loaded. Next, the classloader reads and parses the bytecode from the class- file.

During linking, HotSpot performs semantic checks on the byte- code and creates runtime data structures to store class metadata. This includes creating SymbolTable entries for the names in the class- file, tables for field and method information and tables for inheritance information. Once the runtime metadata is created, HotSpot links the class being called to the class that initiated classloading. Hence, the calling class can now replace the symbolic reference to the re- solved class with the runtime address of the class.

Finally, in the initialization phase, the classloader executes any static initializers defined for the class and initializes static fields of the class. 2.2. Java Virtual Machine 15

Figure 2.4: Design of Class Data Sharing in OpenJDK HotSpot JVM 16 2. Background and Related Work

Figure 1 shows some of the important data structures created by the OpenJDK HotSpot JVM during the execution of the Java applica- tion. The data structures explained here will be referenced in subse- quent chapters to discuss the implementation of Modular Class Data Sharing.

The memory regions in HotSpot can be classified as Heap and Metaspace regions. Metaspace consists of HotSpot internal data struc- tures while the garbage collected heap is used to allocate instances of Java classes. Below, we provide an overview of the class metadata stored in the Metaspace region. These data structures are created by HotSpot at runtime as instances of C++ classes.

InstanceKlass: The InstanceKlass is the primary data structure for storing class metadata. HotSpot creates an InstanceKlass for ev- ery java.lang.Class instance in the Java application. The InstanceKlass has pointers to the ClassLoaderData that loaded the class, Method table and ConstantPool table and a back pointer to the Java mir- ror of the class metadata. Further, it points to the klassVtable and klassItable which store the inheritance information of the class.

ConstantPool: This table contains pointers to the symbols (classes, fields and methods) referenced by the class. During linking, the referenced class is resolved and a pointer to the newly created In- stanceKlass is added to the _resolved_klasses array in the Constant- Pool table of the calling class. All subsequent references to the class are simply directed to the linked InstanceKlass.

SymbolTable: This table contains a hashmap of the class, field and method names and other symbols such as method signatures and their corresponding Symbol* pointers defined in the Java appli- cation.

Method: The Method table contains metadata of the methods im- plemented by the class. This table includes the ConstMethod ta- ble which points to the ConstantPool, and MethodData and Method- Counter tables used by HotSpot during runtime profiling of the meth- ods. The Method table further contains the interpreter and compiler entry points which point to the runtime location of the method invo- cation.

klassVtable: This table contains a pointer to the super class (or java.lang.Object if the class does not explicitly extend any class) and an entry for each method inherited from a parent class. The klassVtable entry is set to point to the Method* of the child class for overridden methods and the Method* of the parent for inherited non-overriden methods. 2.2. Java Virtual Machine 17

klassItable: This table contains pointers to the interfaces imple- mented by the class and itableMethodEntry pointers to the default methods implemented by the interfaces.

At a high level, the classloading mechanism works as follows. Consider that the Java class Hello (already loaded) initiates the load- ing of class World by calling a static method of the class. This is triggered by the execution of invokestatic bytecode whose arguments are: a) the ConstantPool entry for World and b) the ConstantPool en- try for the method WorldFun() to be invoked. The JVM then reads World.class from disk, creates the InstanceKlass and all other asso- ciated tables for this class in the Metaspace region. The JVM creates an instance of java.lang.Class for World on the heap which includes a back-pointer to its corresponding InstanceKlass. The JVM further re- cursively loads super classes and interfaces inherited by World. Once these classes are loaded, the JVM updates the _resolved_references array in the ConstantPool table for class Hello to point to the class World. Next, the JVM proceeds to resolve the WorldFun() method. Once the correct implementation of the method is found, the JVM resumes with the execution of the invokestatic instruction.

2.2.2. Class Data Sharing

This section describes how class metadata can be archived and shared between multiple JVMs using Application Class Data Sharing (CDS) introduced in OpenJDK 9. This mechanism works by persisting the metadata of classes loaded during the execution of an application in a memory mapped file so that any subsequent run of the application can simply map this file into memory and reuse the loaded classes. In this work, we refer to this memory mapped file as the CDS archive. In addition to making classloading quicker, multiple JVMs that make use of the same CDS archive can share the loaded classes thus min- imizing the duplication of class metadata among several JVMs.

A limited form of Class Data Sharing has been available since OpenJDK 1.5. This version only supported the archiving of classes loaded by the Bootstrap classloader, thus only allowing classes in the Bootstrap class path (JRE classes such as java.lang classes) to be archived. However, as the size of Java applications is growing ever larger, we see that even small applications tend to load thousands of classes. Hence, CDS was further enhanced to allow archiving of classes loaded by the application class loader and custom class load- ers. A CDS archive can be created and used with the following three steps: 18 2. Background and Related Work

Step 1: The application is first executed with -XX:DumpShared- ClassList to create a list of all classes loaded by the application during its run. It must be noted that the classes loaded by the application in different runs may be different based on the input. However, this has not been considered during the dumping process. Currently, the user must manually add classes that are not dumped during the default DumpSharedClassList process.

Step 2: Next, the application is run with –Xshare:dump switch. This creates a CDS archive containing all the classes present in the class list created in the previous step.

Step 3: Finally, the application is run with –Xshare:on and –XX:- SharedArchiveFile options to utilize the CDS archive during class- loading.

CDS aims to improve both speed of classloading and memory footprint of the JVMs. The speedup is achieved as several steps in the conventional classloading process are eliminated when loading classes from a CDS archive as listed below.

1. Find: Search for the appropriate .class file in the classpath. This step can be quite expensive as the classloader looks up each JAR file on the classpath in order to find the first occur- rence of the class. 2. Read: Read the .class file from disk. 3. Define: Perform verification of bytecode including checks on the structure and semantics of the bytecode. This is followed by the creation of C++ representation for Java classes tables to store class metadata.

We now briefly discuss the structure of the CDS archive and how it is used at runtime. The CDS archive consists of 5 consecutive regions:

Miscellaneous code mc region: This region contains information for setting up method entry points (_i2i_entry, _from_interpreted_entry and _from_compiled_entry fields in the Method table) for shared meth- ods. The method entry points for interpreted mode point to a fixed location in the mc region. At runtime, a jump statement is gener- ated at this location which jumps unconditionally to the correct entry point for the method whose address is determined at runtime. This mechanism prevents the need for Method table to be modified unless the method is compiled, thus allowing the table to be shared by all JVMs. 2.2. Java Virtual Machine 19

Miscellaneous data md region: The C++ classes such as InstanceKlass, ConstantPool themselves contain virtual tables and must be relocat- able at runtime. During dumptime, the vtables are dumped with addresses relative to the _vptr. During runtime, these vtables are cloned into the JVM and are patched to point to the correct runtime addresses.

Read only ro region: This region contains read-only information such as SymbolTable, SystemDictionary and StringTable. These ta- bles can be shared as is by all JVMs that map the CDS archive.

Read write rw region: This region contains tables whose entries may be rewritten to point to the correct addresses at runtime. When an entry in this region is to be changed, only the page (or pages) that are to be updated are copied into the virtual address space of the JVM. This region includes tables such as ConstantPool which contains addresses of resolved references and InstanceKlass which has to be patched to point to the instance of the java.lang.Class on the Java heap.

Original data od region: This region contains the bytecode of the java application.

Closed archive heap regions: This region contains interned strings and other string java objects stored in the archive.

The CDS archive is memory mapped into the JVM’s virtual ad- dress space during JVM startup. Classloading with CDS occurs as follows: During Universe::init(), well known JVM classes such as java.lang.Object, java.lang.Class and java.lang.ClassLoader are loaded from the archive. When the application classes are to be loaded, the loadClass() method in java.lang.ClassLoader makes a call to the JVM passing the name of the class to be loaded and the defining class loader for the class. The JVM_FindOrLoadSharedClasses() method is called which determines if the class is present in the CDS archive. If the class is found, it is loaded from the archive and initialized. If the class is not in the archive, the default class loading procedure is followed and the class is loaded from the JAR file.

The JVM_FindOrLoadSharedClass() method looks up the System- DictionaryShared table in the CDS archive. This hashtable contains an entry for each class loaded in the CDS archive. If an entry for the class name is found, the JVM triggers the loading of the super class and interfaces implemented by the class. Once all parent classes are loaded, the metadata for the current class is read from the archive. The JVM then calls resore_unsharable_info() on the InstanceKlass, ConstantPool and Method tables so as to patch the tables to point to 20 2. Background and Related Work the correct runtime addresses. Next, the JVM updates the linking information in the calling class to point to the newly loaded class. The class is then initialized and control returns to the bytecode that that triggered the loading of the class.

2.2.3. Related Work on Java Virtual Machines

In the survey of related literature, we focus mainly on approaches for minimizing classloading overheads and improving sharing and reusability of the JVM. There have been several approaches that have attempted to make the JVM more reusable so as to improve startup speed, total application execution time and memory footprint. In this section, we provide an overview of such approaches.

The Multitasking Virtual Machine (MVM) [29] is designed to run as a single JVM capable of executing multiple Java applications as parallel tasks in the process. MVM is designed to provide maximum sharing by allowing JVM internal data structures, objects on the heap and compiled native code to be shared by all running tasks. The task-specific metadata that cannot be shared is maintained in a per-task table. MVM performs classloading the first time it is needed by a task and performs JIT compilation when the number of method invocations exceeds the threshold. All other tasks running in MVM then share the loaded class metadata and compiled code. There are two scenarios for a Java application that is just starting up: a) in the best case all the classes it needs are already loaded and its hot code paths already compiled b) in the worst case, the application has to perform the complete start-up and warm-up from scratch.

There exist several JVMs that theoretically provide a guaranteed lowering of overhead for each new application that starts up. Such techniques make use of some mechanism to persist and restore the JVM state so as to incur only a minimum constant overhead [12, 37, 40].

The Cloneable JVM [37] is primarily targeted at lowering startup overhead. With the MVM approach, the overheads are amortized over multiple applications that share the JVM. However, the startup times of individual processes may still be quite high if the classes used by the application are not already loaded. In order to speed up startup, Cloneable JVM clones a new process from an existing JVM containing loaded classes and JIT-compiled code. Upon resetting some applica- tion specific information, the new JVM is fully initialized and able to instantly begin executing the new Java application. The level of reuse however depends on similarity between the original JVM and cloned 2.2. Java Virtual Machine 21

JVM in terms of loaded classes, methods invoked and hot code paths. This approach further provides greater isolation as each JVM runs in its own process.

OpenJ9 [18] is similar to OpenJDK Class Data Sharing in that it uses a Memory Mapped file to cache class metadata beyond the lifetime of the JVM. The Java application can read from an existing cache and can modify the cache in subsequent runs. OpenJ9, like OpenJDK CDS does not allow a JVM to access more than one cache. This necessitates that each application creates its own cache thus potentially increasing duplication.

HotTub [40] is based on the approach of keeping warmed up JVMs in memory so that they can be reused by incoming java applications. HotTub maintains a pool of hot JVMs and matches the incoming request for a new JVM with the best available JVM in the pool. Once the JVM has completed its run, values such as static fields and file descriptors are reset and the JVM with all its loaded classes and JIT- compiled methods intact is added back to the pool to be reused.

In addition to the open source JVMs discussed above, there exist proprietary JVMs that deal with start-up and warm up overheads. Azul’s Zing ReadyNow VM [2] saves and restores compiler optimiza- tions to overcome bytecode interpretation overhead.

While the above solutions are based on enhancing the JVM, there are other solutions used in practices to overcome startup and warmup overheads [13–15]. This mainly involves starting up and warming up JVMs in advance and scheduling jobs on these hot JVMs. Such ap- proaches generally run a warm up sequence to load classes and JIT compile critical code paths so that the JVM is warmed up before it is ready to start servicing requests.

3 Performance of Class Data Sharing for Low latency Workloads

In this chapter, we quantify and analyze the overheads due to Open- JDK classloading for small jobs. Based on the analysis of Java over- heads in Section 3.1, we focus specifically on minimizing classloading overhead in cluster computing workloads. Section 3.2 describes the workloads used in this study and the experimental setup and Sec- tion 3.3 analyzes the classloading overhead in Hadoop and Spark workloads. In Section 3.4, we propose two approaches based on the recently introduced Application Class Data Sharing feature in OpenJDK and evaluate its impact on the performance of Hadoop and Spark workloads. Finally, in Section 3.5, we propose Modular Class Data Sharing and highlight the potential benefits of this approach.

3.1. Experimental Setup

In this study, we analyze classloading overhead in Hadoop and Spark workloads running on OpenJDK 10.0.2. We use Apache Hadoop 3.1.0 and Apache Spark 2.4.0 for our experiments. Since these ver- sions of Hadoop and Spark do not support OpenJDK9+, it required the use of several patches to make these platforms compatible with OpenJDK9. The patches used include patches 12760, 14984, 14986, 15133, 15304, 15775, 15610 for Hadoop and patch 21495 for Spark. We use HDFS for data storage and all workloads are managed by YARN resource manager. We require the ability to add JVM ar- guments for Class Data Sharing when launching YARN containers. However, Hadoop does not allow adding arbitrary JVM arguments to its config files. Hence, we supply these arguments through a separate

23 24 3. Performance of Class Data Sharing for Low latency Workloads

Tiny 32 KB Small 320 MB Large 3.20 GB

Table 3.1: Data sizes in HiBench benchmark used for Hadoop and Spark benchmarking. config file and modify org.apache.hadoop.yarn.server.nodemanager.- containermanager.launcher.ContainerLauncher to enable parsing of these arguments.

CDS compares the application classpath used at run-time against the classpath at dump-time to ensure that it is an exact match. The CDS archive is used only if the match. Hence, CDS does not accept classpaths with wildcards such as /hadoop/common*. We modify the YARN launch scripts to ensure that the Hadoop and Spark classpaths contain the full path to the JARs and other refer- enced resources.

The experiments are conducted on a single compute node on the DAS 5 cluster [3, 25]. The node runs CentOS Linux and has 64GB memory, dual 8 core processor. For benchmarking, we make use of the Wordcount benchmark in HiBench Benchmark Suite [11, 35]. We run the benchmark on three job sizes: tiny, small and large as shown in Table 3.1. Each job is run 10 times and the average and standard deviation values are reported. The buffer cache is flushed before each run in order to discount the performance speedup due to caching of application code and data.

3.2. Analysis of Classloading Overhead

The life cycle of the Java Virtual Machine consists of 5 main tasks: a) classloading b) execution of bytecode by interpretation c) JIT-compilation of frequently invoked methods and optimization of hot code paths d) execution of compiled code and e) Garbage collection. While several studies have focused on JIT-compilation and GC overheads, there are few studies on the classloading overheads, especially in the con- text of cluster computing applications. In this section, we analyze the classloading overhead of the JVM using a similar approach to HotTub [40] and show that it accounts for a large proportion of the total execution time for short lived JVMs.

Classloading in OpenJDK consists of 3 main operations: find, read and define. Find and read phases are executed by the JDK classloaders, which subsequently call the HotSpot VM to create the internal representation of the Java class. In the find phase, the class- 3.2. Analysis of Classloading Overhead 25 loader looks for the classfile of the class to be loaded in each JAR in the classpath and returns the first occurrence of the class. The classloader then reads the bytecode from the classfile and passes it to HotSpot which verifies the bytecode and creates tables to store class metadata in the define phase.

We enhance the PerfCounters in OpenJDK to measure the per- thread classloading overhead of the JVM. For Hadoop and Spark benchmarks, we instrument two classloaders used during classload- ing: BuiltinClassLoader and URLClassLoader. The BuiltinClassLoader can be one of three types: a) BootLoader which loads classes from the boot classpath b) PlatformCLassLoader which loads Java platform classes and c) SystemClassLoader which loads classes in the appli- cation classpath. Since the introduction of JDK 9, HotSpot loads JDK classes in the boot classpath from a jimage file. This file con- tains the runtime image for classes in the jrt:java.base module, most of which are loaded during JVM startup. The URLClassLoader loads classes from the URLs provided by the URLClassPath. Each of these classloaders has been instrumented to measure the time taken to find, read and define a class.

In order to understand the classloading overheads in Hadoop and Spark, we run the Wordcount HiBench workload for tiny, small and large datasets. Tables 3.2,3.3,3.4 show the total classloading time for all threads of the JVM. The Class Loading Time (CLT) is the sum of per-thread classloading time of the threads in the JVM. Since these threads may be executing in parallel, we measure the lower bound on classloading time as the time spent by the main thread in class- loading. As the main thread is responsible for loading about 90% of the classes in Hadoop, this provides a close approximation of the to- tal classloading time of the JVM. The classloading time of the main thread is used in the measurement of the % improvement in Class Loading Time (CLT) with respect to the total JVM Execution Time (JET). In Spark however, the main thread loads only about 30% of the classes. Hence, in the future, it is useful to analyze classloading across multiple threads in order to determine all classloading that falls in the critical path of execution. The large Hadoop job launches 26 JVMs based on the number of splits of input data. In Table 3.4, we only show the YarnChild JVMs with the highest and lowest change in job execution time.

YARN launches the MRAppMaster JVM which requests resources for the tasks and starts each Map and Reduce task in its own Yarn- Child JVM. During its lifetime, the MRAppMaster loads over 9000 classes while the YARN containers running Map and Reduce tasks load over 5000 classes each. We observe that for the tiny workload, 26 3. Performance of Class Data Sharing for Low latency Workloads classloading time dominates the total execution time of the JVM ac- counting for about 49% of the lifetime of the JVM for YarnChild JVM and 21.51% of the lifetime of the MRAppMaster JVM. For larger work- loads, the overhead falls to about 5%-22% as the classloading over- head is amortized over the lifetime of the JVM.

As several JVMs may be running in parallel, it must be noted that not all classloading overhead falls on the critical path of execution of the job. Hence, the overall impact of classloading overhead will be lower than the overheads of the individual JVMs. However, the decrease in overall job execution time due to class data sharing lies between 4% and 18% suggesting that a significant portion of class loading occurs on the critical path of execution.

Class Loading Time JVM Execution Time % of CLT/JET JVM (CLT) sec (JVM) sec (main thread) MRAppMaster 12.11 ± 0.46 44.64 ± 0.19 21.51 % YarnChild_1 4.67 ± 1.52 9.19 ± 0.27 49.41 %

Table 3.2: Hadoop Wordcount Tiny workload Class Loading Time JVM Execution Time % of CLT/JET JVM (CLT) sec (JVM) sec (main thread) MRAppMaster 12.41 ± 0.43 53.97 ± 0.78 17.96 % YarnChild_1 4.90 ± 0.21 28.25 ± 0.90 21.60 % YarnChild_2 4.78 ± 0.14 27.31 ± 0.93 15.60 % YarnChild_3 4.99 ± 0.18 17.53 ± 0.49 25.21 % YarnChild_4 4.84 ± 0.21 9.52 ± 0.40 44.48 %

Table 3.3: Hadoop Wordcount Small workload Class Loading Time JVM Execution Time % of CLT/JET JVM (CLT) sec (JVM) sec (main thread) MRAppMaster 12.04 ± 0.85 189.19 ± 2.76 4.99 % YarnChild_2 6.01 ± 0.53 30.92 ± 1.04 7.44 % YarnChild_3 6.48 ± 0.24 17.70 ± 0.49 21.88 %

Table 3.4: Hadoop Wordcount Large Workload

Tables 3.5, 3.6, 3.7 show the classloading overhead in Spark. YARN launches the ExecutorLauncher JVM which subsequently cre- ates CoarseGrainedExecutorBackend JVMs. These JVMs run Map and Reduce tasks as threads in the JVMs. While Class Loading Time (CLT) is measured as the total per-thread classloading time, we con- sider only the classloading time of the main thread when measur- ing % of Class Loading Time (CLT) with respect to JVM Execution Time (JET) in order obtain a lower bound on the classloading over- head. We see that classloading overhead per JVM is about 18%-19% for job sizes ranging from 32KB to 320MB. For large workloads, the overhead falls to about 6%-8%. 3.3. Analysis of Class Data Sharing (CDS) Approaches 27

Class Loading Time JVM Execution Time % of CLT/JET JVM (CLT) sec (JET) sec (main thread) ExecutorLauncher 7.08 ± 0.16 34.64 ± 3.26 18.98 % CoarseGrainedBackendExecutor 10.04 ± 0.28 21.69 ± 3.28 18.61 % CoarseGrainedBackendExecutor 7.87 ± 2.50 20.37 ± 3.35 19.18 %

Table 3.5: Spark Wordcount Tiny workload Class Loading Time JVM Execution Time % of CLT/JET JVM (CLT) sec (JET) sec (main thread) ExecutorLauncher 6.55 ± 0.28 37.68 ± 0.36 16.13 % CoarseGrainedBackendExecutor 10.22 ± 0.65 25.75 ± 0.61 14.94 % CoarseGrainedBackendExecutor 10.74 ± 0.40 24.77 ± 0.89 14.21 %

Table 3.6: Spark Wordcount Small workload Class Loading Time JVM Execution Time % of CLT/JET JVM (CLT) sec (JET) sec (main thread) ExecutorLauncher 7.15 ± 0.30 76.65 ± 1.27 8.14 % CoarseGrainedBackendExecutor 10.38 ± 0.36 64.73 ± 1.32 6.08 % CoarseGrainedBackendExecutor 10.38 ± 0.53 63.51 ± 1.32 5.93 %

Table 3.7: Spark Wordcount Large workload

Figure 3.1 shows the breakdown of classloading overhead for each JVM. We see that a large proportion of time is spent in the define phase where the HotSpot verifies the bytecode and creates the inter- nal representation of the class metadata.

As Class Data Sharing caches defined classes in the CDS archive and shares it among all running JVMs, one would expect to see a significant drop in classloading time due to the complete elimination of the Define phase for shared classes. In the following section, we provide a detailed analysis of BestFit and Classpath CDS approaches and investigate the reduction in classloading overhead due to these approaches.

3.3. Analysis of Class Data Sharing (CDS) Approaches

In this section, we propose two methods to apply OpenJDK Class Data Sharing (CDS) for Hadoop and Spark workloads: BestFit CDS and Classpath CDS. We evaluate the improvement in classloading time and overall job execution time due to CDS followed by an anal- ysis of the impact of CDS on startup time and memory usage.

The BestFit CDS archive contains only those classes loaded by the application whereas the Classpath CDS archive is created by dump- ing all classes from the Hadoop or Spark classpath in the archive.

The BestFit CDS archive is created in 2 steps a) executing a trial run of the application and obtaining a list of all classes loaded dur- 28 3. Performance of Class Data Sharing for Low latency Workloads

Hadoop Wordcount - Tiny Spark Wordcount - Tiny 12 12 find_time find_time 10 read_time 10 read_time define_time define_time 8 Other 8 Other

6 6 sec sec 4 4

2 2

0 0 JVM_1 JVM_2 JVM_1 JVM_2 JVM_3 Per JVM Class Loading Time Breakdown Per JVM Class Loading Time Breakdown

Hadoop Wordcount - Small Spark Wordcount - Small 12 12 find_time find_time read_time 10 read_time 10 define_time define_time Other 8 Other 8

6 6 sec sec

4 4

2 2

0 0 JVM_1 JVM_2 JVM_3 JVM_4 JVM_5 JVM_1 JVM_2 JVM_3 Per JVM Class Loading Time Breakdown Per JVM Class Loading Time Breakdown

Hadoop Wordcount - Large Spark Wordcount - Large 12 12 find_time find_time read_time 10 read_time 10 define_time define_time Other Other 8 8

6 6 sec sec 4 4

2 2

0 0 JVM_1 JVM_2 JVM_3 JVM_4 JVM_5 JVM_6 JVM_7 JVM_8 JVM_9 JVM_10 JVM_11 JVM_12 JVM_13 JVM_14 JVM_15 JVM_16 JVM_17 JVM_18 JVM_19 JVM_20 JVM_21 JVM_22 JVM_23 JVM_24 JVM_25 JVM_26 JVM_27 JVM_1 JVM_2 JVM_3 Per JVM Class Loading Time Breakdown Per JVM Class Loading Time Breakdown

Figure 3.1: Breakdown of per-JVM classloading time for tiny, small and large workloads in Hadoop and Spark 3.3. Analysis of Class Data Sharing (CDS) Approaches 29 ing this run b) creating the static CDS archive containing the loaded class. This archive can be shared by all tasks of the application and by all instances of the application. The BestFit CDS archive suffers from the limitation that several application specific archives may con- tain a large subset of common framework classes, thus increasing redundancy. The issue of redundancy can be mitigated to a large extent by adopting Classpath CDS.

The advantage of Classpath CDS is that it allows all Hadoop (or Spark) jobs to share the same CDS archive, thus achieving mini- mum duplication on a given node. However, this greatly increases the size of the archive, thus requiring small applications to request larger containers than they would otherwise require for their tasks. Furthermore, OpenJDK does not allow the CDS archive to be dy- namically modified at runtime. This limitation makes it impossible to maintain a small fixed size archive either by setting hard limits on archive size or through intelligent caching policies to only keep the most relevant classes in the archive.

In this section we provide an analysis of BetsFit and Classpath CDS for tiny, small and large applications on Hadoop and Spark and show that CDS provides a significant improvement in overall job exe- cution time. Following this, we discuss in detail the limitations of the current implementation of OpenJDK CDS in Section 3.4 and propose a modular and scalable approach for Class Data Sharing.

3.3.1. Analysis of execution time

Table 3.3.1 shows the improvement in classloading time due to Best- Fit and Classpath CDS. The table only shows the results from Tiny Hadoop and Spark workloads. The results are similar for Small and Large workloads as the same set of classes are loaded from the CDS archive in all cases. We observe that Class Data Sharing improves the per-JVM classloading time by 30%-50%.

Figures 3.2 and 3.3 show the breakdown of classloading time for Hadoop and Spark workloads. Comparing this with Figure 3.1 show- ing the No-CDS case, we see that the time spent in define phase is drastically reduced. As the class metadata for shared classes is di- rectly read from the memory-mapped CDS archive, the JVM executes the define phase only for those classes that are not present in the CDS archive. Currently, OpenJDK does not allow archiving of classes be- low JDK 1.5. Both Hadoop and Spark make use of a small number of pre-JDK 1.5 libraries which have to be loaded traditionally by reading bytecode from the classfiles. 30 3. Performance of Class Data Sharing for Low latency Workloads

Hadoop Bestfit Wordcount - Tiny Hadoop Classpath Wordcount - Tiny 12 12 find_time find_time 10 read_time 10 read_time define_time define_time 8 Other 8 Other

6 6 sec sec 4 4

2 2

0 0 JVM_1 JVM_2 JVM_1 JVM_2 Per JVM Class Loading Time Breakdown Per JVM Class Loading Time Breakdown

Hadoop Bestfit Wordcount - Small Hadoop Classpath Wordcount - Small 12 12 find_time find_time read_time 10 read_time 10 define_time define_time Other 8 Other 8

6 6 sec sec 4 4

2 2

0 0 JVM_1 JVM_2 JVM_3 JVM_4 JVM_5 JVM_1 JVM_2 JVM_3 JVM_4 JVM_5 Per JVM Class Loading Time Breakdown Per JVM Class Loading Time Breakdown

Hadoop Bestfit Wordcount - Large Hadoop Classpath Wordcount - Large 12 12 find_time find_time 10 read_time 10 read_time define_time define_time 8 Other 8 Other

6 6 sec sec 4 4

2 2

0 0 JVM_1 JVM_2 JVM_3 JVM_4 JVM_5 JVM_6 JVM_7 JVM_8 JVM_9 JVM_1 JVM_2 JVM_3 JVM_4 JVM_5 JVM_6 JVM_7 JVM_8 JVM_9 JVM_10 JVM_11 JVM_12 JVM_13 JVM_14 JVM_15 JVM_16 JVM_17 JVM_18 JVM_19 JVM_20 JVM_21 JVM_22 JVM_23 JVM_24 JVM_25 JVM_26 JVM_27 JVM_10 JVM_11 JVM_12 JVM_13 JVM_14 JVM_15 JVM_16 JVM_17 JVM_18 JVM_19 JVM_20 JVM_21 JVM_22 JVM_23 JVM_24 JVM_25 JVM_26 JVM_27 Per JVM Class Loading Time Breakdown Per JVM Class Loading Time Breakdown

Figure 3.2: Breakdown of per-JVM classloading time with Class Data Sharing for Hadoop

We see that the JVM now spends the majority of classloading time in the find phase. For classes that are not shared, this involves find- ing the classfile in the classpath. For shared classes, the find phase involves a call to HotSpot to find the required class in the SystemDic- tionaryShared table in the CDS archive. This dictionary maintains a hashmap of all class names present in the CDS archive and directly provides a reference to the metadata of the class. Further, we observe that classloading time remains fairly constant across job sizes and CDS archive sizes indicating that classloading incurs a low constant overhead and is dependent only on the number of classes loaded.

Table 3.8 shows the improvement in overall Job Execution Time by comparing the No-CDS with BestFit and Classpath CDS approaches. With BestFit CDS, we see a significant improvement of 15%-27% for 3.3. Analysis of Class Data Sharing (CDS) Approaches 31

Spark Bestfit Wordcount - Tiny Spark Classpath Wordcount - Tiny 12 12 find_time find_time 10 read_time 10 read_time define_time define_time 8 Other 8 Other

6 6 sec sec 4 4

2 2

0 0 JVM_1 JVM_2 JVM_3 JVM_1 JVM_2 JVM_3 Per JVM Class Loading Time Breakdown Per JVM Class Loading Time Breakdown

Spark Bestfit Wordcount - Small Spark Classpath Wordcount - Small 12 12 find_time find_time 10 read_time 10 read_time define_time define_time 8 Other 8 Other

6 6 sec sec 4 4

2 2

0 0 JVM_1 JVM_2 JVM_3 JVM_1 JVM_2 JVM_3 Per JVM Class Loading Time Breakdown Per JVM Class Loading Time Breakdown

Spark Bestfit Wordcount - Large Spark Classpath Wordcount - Large 12 12 find_time find_time 10 read_time 10 read_time define_time define_time 8 Other 8 Other

6 6 sec sec 4 4

2 2

0 0 JVM_1 JVM_2 JVM_3 JVM_1 JVM_2 JVM_3 Per JVM Class Loading Time Breakdown Per JVM Class Loading Time Breakdown

Figure 3.3: Breakdown of per-JVM classloading time with Class Data Sharing for Spark 32 3. Performance of Class Data Sharing for Low latency Workloads small and tiny Hadoop jobs. The improvement in the large job is limited to 4%-8% as the classloading overhead is amortized over the lifetime of the job. It is interesting to see that the Job Execution Time of Hadoop Large job with Classpath CDS is about 27 sec longer that BestFit CDS. This is due to the greater startup and initialization time required to initialize a large CDS archive.

It must be noted that the improvement in Job Execution Time is significantly lower than the improvement in per-JVM execution time. This is due to the fact that several JVMs run in parallel and are hence not in the critical path of execution. However, the reduction in per- JVM execution time is still significant as the resources relinquished by the JVM can be made available to other jobs in the cluster thus improving cluster utilization.

Application Job Execution Time (sec) % improvement in Job Execution Time No CDS BestFit Classpath BestFit Classpath Hadoop Tiny 49.73 36.04 40.77 27.527 % 18.01 % Hadoop Small 118.49 100.29 105.38 15.36 % 11.06 % Hadoop Large 779.18 713.92 740.80 8.37 % 4.92 % Spark Tiny 53.68 51.70 51.77 3.68 % 3.56 % Spark Small 64.44 61.78 62.03 4.17 % 3.75 % Spark Large 152.21 144.46 146.53 5.09 % 3.73 %

Table 3.8: Improvement in Overall Job Execution Time due to BestFit CDS and Classpath CDS

3.3.2. Analysis of Startup Overhead

In this section, we analyze the impact of BestFit and Classpath CDS on the startup time of the JVM. During startup, the JVM performs several operations such as initializing the TemplateTable for bytecode interpretation, initialization of well known JDK classes to bootstrap the execution of Java code and creation of VM thread. With the use of CDS, the startup phase further includes the initialization of the MetaspaceShared region which maps the CDS archive into the JVM address space.

Application No CDS (sec) BestFit CDS (sec) Classpath CDS (sec) Hadoop 0.79 0.94 1.86 Spark 0.85 0.97 4.09

Table 3.9: Startup time (sec) for No CDS, BestFit CDS and Classpath CDS

Table 3.9 shows that the startup overhead for MRAppMaster JVM in Hadoop and ExecutorLauncher JVM in Spark when running the Tiny Wordcount workload. We see that the use of CDS leads to an increase in startup overhead. In Hadoop, the startup time for Class- path CDS is almost 2x that of BestFit CDS due to increased time 3.4. Modular Class Data Sharing Approach 33

Application BestFit CDS Classpath CDS Num application Num archived Num classpath Num archived Archive size RO OD RW Archive size RO OD RW classes classes classes classes Hadoop Wordcount 110MB 9019 8905 37.5% 36.5% 23.3% 57506 40290 438MB 35.4% 37.9% 24.1% Spark Wordcount 107MB 8422 8135 37.9% 35.7% 24.2% 123610 119910 1213MB 34.5% 37.3% 26.2%

Table 3.10: Memory usage of best-fit and classpath CDS archives for mapping the MetaspaceShared region and initializing the shared tables. In Spark, the Classpath CDS archive is 1.2 GB in size com- pared to the 111 MB Wordcount application archive leading to a 4x increase in startup time.

3.3.3. Analysis of Memory Usage

Table 3.10 shows the proportion of shared class metadata in BestFit and Classpath CDS archives. The data in RO and OD region can be shared by all processes that map the archive. The RO region con- tains read-only class metadata while the OD region contains origi- nal bytecode of the classes. The RW region contains class metadata that is patched at runtime. Hence, the pages that are written to are copied-on-write to the JVM’s private memory while the clean pages are shared by all JVMs.

CDS leads to immense savings in memory usage on the cluster by minimizing duplication of class metadata. Consider the Hadoop Wordcount application: at least 59% of the metadata (RO and OD data) is shared by all parallel tasks of the job and by all parallel applications that use this archive. For a Hadoop Wordcount archive which is about 115 MB in size, each parallel Map/Reduce task can lower its memory usage by at least 67 MB due to Class Data Sharing.

3.4. Modular Class Data Sharing Approach

We have shown that the use of class data sharing leads to faster job execution time and lower per-JVM memory utilization. However, this approach does not completely eliminate duplication of class meta- data on a node. As OpenJDK only allows a single archive per JVM, each application creates its own custom archive including common framework classes and application specific classes thus increasing duplication in the archives.

For instance, consider three varied Spark workloads: Spark Pi, Spark SQL job and Spark ML jobs. These applications have common classes such as HDFS, YARN, Spark tools and Spark configuration classes along with application specific classes such as Spark SQL and Spark ML classes. In order to avoid duplication, it is possible 34 3. Performance of Class Data Sharing for Low latency Workloads to only dump common classes to the archive but this degrades job execution time as application classes have to be loaded from the JAR files. Furthermore, we observe that the Spark applications are var- ied enough such that the use of a sub-optimal CDS archive leads to a significantly lower job execution time compared to a Bestfit CDS archive.

We observe that the SparkPi archive contains all the common classes used by both SparkSQL and SparkML applications while Spark- SQL and SparkML make use of several application specific libraries. We conduct a cross validation experiment to test the fit of one appli- cation’s archive for another application and the results are seen in Table 3.11. It is seen that the Bestfit archive for a Spark CDS appli- cation runs about 10 seconds faster that that of the Spark Pi CDS archive and 6 seconds faster that Spark ML archive. Similarly, the Spark ML applications runs about 2-5 seconds faster with the use of BestFit CDS archive.

Spark Pi Spark SQL Spark ML Spark Pi 24.39 23.79 24.29 Spark SQL 80.01 70.26 76.16 Spark ML 88.96 91.34 86.16

Table 3.11: The use of a different application’s CDS archive leads to lower job execution time compared to Bestfit CDS archive.

From Table 3.10, we see that the use of Classpath CDS leads to very large archives that are about 5 - 10 times larger than the Bestfit archive. This large archive increases the job execution time due to an increase in startup time.

Hence, it is useful to make CDS more modular so as to archive the perfect balance between BestFit and Classpath approaches. Modular CDS enables common classes to be separated into separate archives in order to achieve perfect sharing without the need to use a Class- path archive containing a large number of unused classes. Some of the main advantages of Modular CDS are:

• It gives the user full control over the granularity of sharing de- pending on requirements of the application.

• It makes maintenance and deployment of CDS archive easier as a change in a JAR file requires us to only rebuild the archive containing that JAR.

• With Modular CDS we have the potential to dump each JAR file in a separate CDS archive so that we have a single global cache 3.5. Summary 35

of class metadata shared by all JVMs on the node. However, it must be noted that Modular CDS currently does not support runtime relocation of CDS archives. Hence, the user must en- sure that the CDS archives are dumped to distinct addresses spaces.

3.5. Summary

Based on the experiments in Chapter 3, we are able to answer RQ1 and RQ2 as discussed below.

RQ1: What is the impact of Classloading overhead on small low latency big data processing workloads?

In this study, we measure the classloading overhead of small data processing workloads with input sizes ranging form 32 KB to 3.2 GB per node using the HiBench Wordcount benchmark. We find that in a Hadoop tiny job, each JVM spends about 21% to 49% of its time in classloading. As the lifetime of the JVM becomes longer, the impact of classloading overhead falls to 5% to 21% for large workload. For Spark tiny and small jobs, each JVM spends about 14% to 19% of its time in classloading while the proportion of classloading time falls to 6% to 8% for large jobs. It is seen that a significant proportion of the classloading time is spent in the define phase. This phase involves verification of bytecode and creation of HotSpot internal data structures for representing class metadata.

RQ2: What is the performance impact of utilizing OpenJDK Ap- plication Class Data Sharing (CDS) on low latency big data pro- cessing workloads?

OpenJDK Class Data Sharing allows class metadata to be saved to a shared memory mapped file that can subsequently be used by JVMs on the node to access loaded classes. In this study, we propose two techniques for utilizing CDS for Hadoop and Spark workloads: BestFit CDS and Classpath CDS. These methods represent the trade- offs involved in CDS: Bestfit CDS gives the lowest classloading time but limited sharing among applications while Classspath CDS gives maximum sharing at the cost of large archive size and startup over- head. We see a reduction in job execution time of 13 sec to 18 sec for Hadoop tiny and small jobs and almost 60 sec for a large job when using BestFit CDS. For Spark tiny and small jobs, the benefits are 36 3. Performance of Class Data Sharing for Low latency Workloads lower with about 2 - 3 sec reduction in job execution time about 8 sec for large job. As Spark executes multiple tasks in a single JVM, it has much lower classloading overhead compared to Hadoop which executes each task in a new JVM. However, we see significant im- provement in the per-JVM execution time ranging from 5% to 49% for Hadoop and 5% to 19% for Spark. 4 Implementation and Analysis of Modular Class Data Sharing

4.1. Implementation

This section discusses the implementation of Modular Class Data Sharing in OpenJDK’s HotSpot Java Virtual Machine. The enhance- ment made to OpenJDK to support Modular CDS involves the follow- ing three aspects of class loading:

1. Symbol Resolution: It involves locating the CDS metadata for the symbol being resolved (may be class, field or method) in the correct CDS archive.

2. Linking: Ensuring that linking information is updated to link to the correct class when the calling class and the loaded class are in different CDS archives.

3. Inheritance: Ensuring that super class and interface informa- tion is linked correctly when the parent class or interfaces are loaded from different CDS archives.

Figure 4.1 shows an overview of the out-of-the-box Classloading mechanism in OpenJDK HotSpot. such as _new, _in- voke_static, _invoke_special, _get_field and _put_field involve resolu- tion of symbolc names and hence trigger classloading. If the class to be loaded is already present in the _resolved_references array, HotSpot proceeds with the resolution and linking of the field or method being invoked. If the class is not already loaded, HotSpot returns

37 38 4. Implementation and Analysis of Modular Class Data Sharing control to the Java application by calling the java.lang.Classloader.- loadClass() method in Classloader.java.

When CDS is enabled, the loadClass() method calls HotSpot’s JVM_- FindOrLoadClass() which loads the class from the CDS archive if present. If the class is not in the CDS archive, loadClass() falls back to the default class loading mechanism and searches for the class in the JAR files on the application classpath.

In the case of classloading from CDS archive, JVM_FindOrLoad- Class() invokes SystemDictionaryShared::load_shared_class(). System- DictionaryShared is a hashmap of the class name and the InstanceKlass that holds the class metadata. This table - persisted to the CDS archive during dump phase - contains all the classes dumped to the CDS archive. The load_shared_class() method triggers the classload- ing of super classes and interfaces inherited by the current class. Once the loading of all parent classes is complete, HotSpot invokes restore_unshareable_info() to patch fields in CDS tables that point to runtime objects on the heap. This is followed by the initialization phase where the itable and vtable of the class are initialized.

In the following sections, we discuss the design and implementa- tion of Modular CDS highlighting the trade-offs made in the design.

Three new VM arguments have been added to support Modular CDS. The -XX:+UseApCDS flag is used to turn on Modular CDS. It must be noted that our implementation incurs the negligible over- head of checking if the flag is turned on to determine which code paths are to be executed for class loading. The -XX:NumShared- Archives VM argument has been added to specify the number of CDS archives and -XX:SharedArchiveFile has been modified to ac- cept a comma separated list of CDS archives instead of a single CDS archive.

During the initialization of the JVM, MetaspaceShared class ini- tializes the shared CDS archive by restoring shared system dictionar- ies and loading and linking well known Java classes that are required for bootstrapping the execution of the application. MetaspaceShared class has been refactored to separate dump-time CDS functionality from run-time functionality so that we could modify the run-time data structures to support multiple CDS archives. This mainly in- volves converting static variables to static arrays in order to store book-keeping information for each CDS archive. Additionally, the FileMapInfo class that maps the CDS archive into the JVM’s address space has been refactored to separate the run-time information. The class, originally implemented as a singleton has been modified to allow the MetaspaceShared class to create an array of FileMapInfo 4.1. Implementation 39

Figure 4.1: Class Data Sharing Mechanism in OpenJDK 40 4. Implementation and Analysis of Modular Class Data Sharing classes one for each CDS archive to be mapped. It must be noted that we currently do not support run-time relocation of CDS data. The user must ensure that the CDS archives are mapped to distinct address spaces that do not contain any overlap.

The SymbolTable and StringTable have been modified to main- tain an array of _shared_table dictionaries one for each CDS archive. In case of default CDS, each JVM contains 2 SymbolTables a) a _shared_table containing symbols dumped during runtime and b) _the_table to store any symbols not present in the archive that may be loaded during runtime. These dictionaries contain one entry for each unique symbol used in the application. However, with multiple CDS archives, we have a _shared_table for each CDS archive which has several implications on the implementation:

• It increases duplication as several tables may have overlapping entries. • The look-up functionality is modified to support two types of look-up: look-up by [name, cds_index] which looks-up the name in _shared_table[cds_index] and look-up by [name] which looks- up each _shared_table in turn and returns the first occurrence of the name. • Some equality checks based on symbol pointers may fail since the dump-time Symbol* may be different from run-time Symbol* even for symbols with the same name. Hence, such compar- isons have to be modified to perform string comparison instead of pointer comparison, thus making the comparison slower.

HotSpot initializes certain well known Java classes during the startup phase even before the application’s main() method is exe- cuted. This includes classes such as java.lang.Object and java.lang.- Classloader. Currently, these well known classes are by default dumped to all CDS archives even if they are not specified in the list of classes to be dumped. Hence, we load these classes from the first archive specified in the list of -XX:SharedArchiveFile VM argument.

In the current implementation of Modular CDS, we do not make any modifications to the dumping process and to the structure of the CDS archive. However, it may be useful to dump the well known classes to a separate CDS archive and modify the dump phase to avoid storing these classes in the application CDS archives. This can minimize duplication of class metadata across CDS archives.

Classloading is initiated lazily the first time the class is refer- enced. This could be triggered by the creation of a new instance 4.1. Implementation 41 of the class or by accessing a static field or invoking a static method of the class. The calling class looks up the _resolved_references ar- ray to check if the referenced class is already loaded. If loaded, it simply returns a reference to the class and proceeds to resolving the field or method name. If not already loaded, it calls SystemDic- tionary::resolve_or_fail() which then calls HotSpot’s JVM_FindLoadedClass() method to load the class from the CDS archive if present.

We have enhanced JVM_FindLoadedClass() to support loading of classes in the presence of multiple CDS archives. In case of sin- gle CDS archive, the JVM contains a SystemDictionaryShared table containing an entry for every class present in the archive. In order to support multiple archives, we have implemented an array of System- DictionaryShared tables each containing the classes found in that archive. There are two main considerations when resolving classes in the presence of multiple CDS archives:

1. It is possible that a class is dumped in multiple CDS archives. If the user has full control to curate the classlists at dump-time, the user can ensure that there are no overlapping entries in the classlists. For instance, in case of Hadoop, the user can create separate archives for JDK classes and common libraries such as Networking and Logging while ensuring that a class is always dumped to a single archive. While the approach is useful in the context of frameworks like Hadoop and Spark, it is not generic as it creates dependency between archives and complicates adding and removing of new JARs/archives. Hence, Modular CDS has been designed to han- dle duplicate classes in archives. 2. The ConstantPool* of a class contains an array of Symbol* point- ing to the classes, fields and methods referenced by this class. In Modular CDS, it is possible that the referenced class is not present in the current archive but is rather dumped in another archive. Hence, this Symbol* pointer is no longer valid and has to be modified at runtime to point to the Symbol* in the CDS archive that actually defines the class.

In the following section, we discuss the design decisions made to deal with the above scenarios, analyze the pros and cons of this design and investigate alternative design options.

During the symbolic resolution of class name, we have to identify the CDS archive containing the class to be loaded. We implement a method SystemDictionary::find_shared_class_for_resolve() which loops through each SystemDictionaryShared[cds_index] and returns the class 42 4. Implementation and Analysis of Modular Class Data Sharing from the first dictionary in which it is found. This class is then loaded by the mechanism described in Figure 4.1. Once the current class, all its parent classes and interfaces are loaded, the LinkInfo of the calling class has to be updated to point to the newly loaded class. The Symbol* in the ConstantPool entry of the calling class is updated if the class is loaded from a different CDS archive.

For instance, consider a class Hello.java dumped in archive CDS1.jsa (mapped to address space 0x800000000 - 0x8100000000) and the class World.java dumped in archive CDS2.jsa (mapped to address space 0x900000000 - 0x910000000). Consider that the loading of class Hello is triggered for the first time by class World by invoking the Hello.sayHello() method. Figure 4.2 shows how linking occurs between classes in different archives.

CDS1.jsa CDS2.jsa

InstanceKlass : InstanceKlass : 0x90005000 0x80005000 World Hello

SymbolTable

sayHelllo ConstantPool V() SymbolTable 0x90005432 Hello Hello World sayHelllo V()

ResolvedReferences

0x80005000

LinkInfo

_name _signature _resolved_klass

Figure 4.2: Symbol Resolution in Modular CDS 4.1. Implementation 43

As shown in Figure 4.2, the constant pool, resolved references and LinkInfo are updated to point to the CDS1.jsa.

Once the symbol to be linked is resolved, HotSpot restores fields in the CDS archive that are dependent on run-time information. This includes a) creating a back-pointer from the InstanceKlass to the java.lang.Class object created for the class on the Java heap b) set- ting up interpreter and compiler entry points for shared methods.

HotSpot does not directly update the method entry points( _i2i_entry, _from_interpreted_entry and _from_compiled_entry pointers) in the Method table. Instead, an extra level of indirection is introduced such that these pointers point to a fixed location in the miscellaneous code(MC) section of the CDS archive. At run-time, an unconditional jump in- struction is injected at this location to jump to the correct entry point for the method. This is done in order to prevent updating pointers in the Method table so that these pages need not be copied-on-write to the JVMs sharing the archive while the method is still being inter- preted. For the methods that are later compiled, the Method table is updated to point to the compiled entry points. In order to support multiple CDS archives, we ensure that the dynamically-generated entry points are injected into the correct archive that the method belongs to.

The inheritance hierarchy of Java classes is maintained in klassVtable for parent class and klassItable for interfaces. These tables are dumped along with other class metadata in the CDS archive. By default, CDS dumps metadata for all classes in the class hierarchy to the CDS archive. Hence, classes in differnt CDS archives that extend the same class or inherit the same interface have to be updated at run-time to point to a single version of the parent class or interface.

The parent class or interface is loaded from the first CDS archive in which it is present. Classes in other CDS archives that extend this class or interface are updated to link to this version of the parent class. This not only involves updating the ConstantPool, _resolved- _references and LinkInfo tables as shown in Figure 4.2 but also up- dating the klassVtable and klassItable to point to the correct method for inherited parent class methods and interface methods.

The klassVtable contains pointers to its parent class and vtableEn- try. The vtableEntry contains an array of Method* pointers pointing to the method tables of overriden methods and the protected and public methods of the parent classes. Furthermore, the Method* stores the _vtable_index indicating the index of the method in the vtableEntry array. When using Modular CDS, the parent class may be loaded from a differnt archive that that of the child class. In such 44 4. Implementation and Analysis of Modular Class Data Sharing

CDS1.jsa CDS2.jsa

Method : A

_vtable_index : 1

klassVtable Method : B

_klass _vtable_index : 2 _vtableEntry vtableEntries

1 : A 2 : B 3 : C Method : C Method : C _vtable_index : 1 _vtable_index : 3

vtableEntries Method : B _vtable_index : 2 1 : C 2: B 3 : A

Method : A _vtable_index : 3

Figure 4.3: Update klassVtable at runtime to point to Method tables from the correct CDS archive. 4.1. Implementation 45

CDS1.jsa CDS2.jsa

Interface : A

_vtable_index : 1

itableOffsetEntries methods: { K }

1 : A 2 : B Interface : B klassItable _vtable_index : 2 _klass _itableOffsetEntry _itableMethodEntry Interface : A

_vtable_index : 1 itableMethodEntries methods: { K } 1 : K 2 : L 3 : M

Figure 4.4: Update klassItable at runtime to point to an Interface loaded from a differnt CDS archive. 46 4. Implementation and Analysis of Modular Class Data Sharing a scenario, we find that the _vtable_method index in the Method* of the parent class may not match the actual position of the Method* in the vtableEntry of the child class. For example, consider the sce- nario in Figure 4.3. At dump-time, the vtbleEntry array of the child class contains Method* A, B and C at array index 1, 2 and 3 respec- tively. Correspondingly, the Method tables of methods A, B and C have _vtable_index 1, 2 and 3 respectively. However, at runtime, the parent class may be loaded from a differnt archive in which the meth- ods A, B and C may have _vtable_index 3, 2 and 1 due to the order in which the methods were dumped during the creation of the archive.

Hence, we see that the order in which the methods of the same parent class are dumped may vary for differnt executions of the dump process. Hence, at runtime, the vtable entries of the child class must be swapped so that their position in the vtable array matches the _vtable_index value in the Method* of the method. However, in our experiments, we find that this scenario occurs rarely and hence we expect the one-time swapping overhead to be low.

Similar to the runtime patching of klassVtable, the klassItable is patched to point to the correct interface when the interface is loaded from a differnt archive. The klassItable of the child class has a pointer to the _itableOffsetEntry which contains an array of pointers to interfaces implemented by the class. In addition, the klassItable has a pointer to the _itableMethodEntry which contains an array of pointers to the interface methods. Since the interface methods are implemented by the class, the Method* pointers do not have to be changed even if the interface is loaded from a different archive.

However, in OpenJDK 8 and above, Java allows the use of default methods whereby the interface is allowed to provide the implemen- tation of the method. Hence, if the interface is loaded from a dif- ferent CDS archive at runtime, the itableMethodEntry for the default methods are updated to point to the Method* from the correct CDS archive. Figure 4.4 shows the the original dump-time pointers (in red) are modified to point (in green) to the interface from a different CDS archive.

4.2. Limitations of Modular Class Data Sharing Im- plementation

It must be noted that the current implementation of Modular Class Data Sharing is incomplete and works only for a subset of Java pro- grams. One of the main limitations of the implementation is the in- 4.3. Performance analysis of Modular Class Data Sharing 47 ability to support array classes. Array classes make use of fast sub- type check implemented in assembly code. The fast subtype check obtains a pointer to the super class of the current class simply by adding an offset to the address of the class. However, in Modular CDS, the super class may be loaded from a different archive and is hence present in an addresses region differnt from that of the child class. Hence, the fast subtype check fails leading to the failure of ar- ray class instantiation. Thus, Modular CDS currently only supports the loading of InstanceKlasses but not ArrayKlasses. Due to this limitation, we are unable to test out implementation of Modular CDS with Hadoop and Spark frameworks. Instead, we provide an anal- ysis of the performance of Modular CDS with synthetic workloads comprising only loading of classes without arrays.

Furthermore, the current implementation of Modular CDS is only able support up to 8 CDS archives. Beyond this, we encounter issues when initializing the archives where some shared tables in the older archives get overwritten by the new archives being initialized.

4.3. Performance analysis of Modular Class Data Sharing

In this section, we analyze the performance of our implementation of Modular Class Data Sharing and discuss the pros and cons of the approach.

To test the baseline performance of Modular CDS, we create 8 JAR files each containing a single class with a single method that prints ”HelloWorld”. A URLClassLoader is implemented that loads each JAR file from a CDS archive and calls its HelloWorld() method. The performance of Modular CDS is tested with two archives up to eight archives and the results are shown in Table 4.1. We see that the class loading time remains fairly constant even as the number of CDS archives increases. However, the startup time increases by about 20-40 ms for each additional archive. From Figure 4.5, we see that the increase in job execution time is in correspondence with the increase in startup time. It is expected that the startup time increases for each additional archive due to the overhead of mapping each archive and initializing the Modular CDS data structures that store archive metadata and so on. However, these overheads are constant for a given archive size.

Next, the performance of Modular CDS is tested for archives con- taining 10 - 1000 classes per archive and the results are shown in 48 4. Implementation and Analysis of Modular Class Data Sharing

Number of Class Loading Time Job Execution Time Startup Time CDS Archives (sec) (sec) (sec) 2 0.03 ± 0.003 0.54 ± 0.048 0.42 ± 0.040 3 0.03 ± 0.003 0.58 ± 0.036 0.46 ± 0.042 4 0.03 ± 0.004 0.60 ± 0.046 0.48 ± 0.044 5 0.04 ± 0.003 0.65 ± 0.045 0.55 ± 0.055 6 0.04 ± 0.005 0.69 ± 0.040 0.57 ± 0.037 7 0.04 ± 0.002 0.75 ± 0.050 0.64 ± 0.033 8 0.04 ± 0.004 0.77 ± 0.058 0.68 ± 0.080

Table 4.1: Classloading time is not affected by the increase in number of archives although startup time increases for each additional archive.

Increase in Startup Time and Job Execution times

Startup Time 0.75 Job Execution Time

0.70

0.65

0.60

0.55 Time (sec) 0.50

0.45

2 3 4 5 6 7 8 Number of CDS Archives

Figure 4.5: The job execution time increases with an increase in number of archives. This increase is largely due to increase in startup time. 4.3. Performance analysis of Modular Class Data Sharing 49

Table 4.2. We compare the performance of Modular CDS with BestFit and Classpath CDS.

Modular CDS Number of Number of Classes Classloading Time Job Execution Time Startup Time CDS Archives per CDS archive (sec) (sec) (sec) 8 10 0.05 ± 0.01 0.81 ± 0.05 0.67 ± 0.04 8 100 0.15 ± 0.01 1.03 ± 0.06 0.69 ± 0.03 8 1000 1.09 ± 0.08 3.18 ± 0.18 0.74 ± 0.04 BestFit CDS 1 80 0.06 ± 0.01 0.45 ± 0.01 0.35 ± 0.02 1 800 0.15 ± 0.01 0.62 ± 0.02 0.35 ± 0.02 1 8000 1.41 ± 0.11 2.82 ± 0.16 0.04 ± 0.02 Classpath CDS 1 123610 + 80 0.06 ± 0.01 2.29 ± 0.09 3.96 ± 0.01 1 1243610 + 800 0.18 ± 0.01 2.41 ± 0.01 3.78 ± 0.01 1 123610 + 8000 1.49 ± 0.11 4.62 ± 0.29 3.79 ± 0.11

Table 4.2: Comparison of Modular CDS with BestFit and Classpath CDS approaches.

In BestFit CDS we dump only those classes that are loaded by the application into a single CDS archive. In Classpath CDS we dump all classes present in Spark Classpath along with the application classes in a single archive. These approaches reflect the trade-off involved in CDS: BestFit archive gives the lowest classloading time but limited sharing among applications while Classpath CDS gives maximum sharing at the cost of large archive size and startup overhead. Mod- ular CDS combines the advantages of both the approaches by provid- ing maximum sharing with classloading time lying between BestFit and Classpath approaches.

Table 4.3 shows that the use of Modular CDS increases the job execution time of the application. We see that this increase directly corresponds to the increase in startup time. As the size of the ap- plication increases from 800 classes to 8000 classes, we see that the startup overhead is amortized over the lifetime of the application and hence has a much lower impact on the total execution time of the application.

Modular CDS archive BestFit CDS archive Number of % Increase % Increase dimensions dimensions classes loaded Startup Time Job Execution Time 8 archives * 10 classes 1 archive * 80 classes 80 47.76 % 44.44 % 8 archives * 100 classes 1 archive * 800 classes 800 49.27 % 39.80 % 8 archives * 1000 classes 1 archive * 8000 classes 8000 45.94 % 11.32 %

Table 4.3: Increase in Startup time and Job Execution time from BestFit CDS to Modular CDS

Table 4.4 shows the improvement in the performance of Modular CDS in comparison with Classpath CDS. The Classpath CDS archive contains over 120,000 classes while the application loads only about 80 - 8000 classes. In Classpath CDS, we pay the overhead of loading 50 4. Implementation and Analysis of Modular Class Data Sharing the large Classpath CDS archive in order to achieve full sharing of classes between all applications on the node. The main advantage of Modular CDS is that it provides the maximum sharing achieved by Classpath CDS at at a much lower overhead. From Table 4.4, we see that when 80 - 800 classes are loaded, the improvement in startup time and job execution time is over 70%. As the size of the application increases to 8000 classes, the 80% improvement in startup time is amortized over the lifetime of the job and hence leads to about 18% improvement in job execution time.

Analysis of Hadoop and Spark applications in HiBench Bench- mark shows that each task JVM loads about 5000 - 10000 classes and the number of duplicate classes per JVM is of the order of a few thousand classes. Hence, the results from the above experi- ments show that Modular CDS could potentially benefit Hadoop and Spark workloads by achieving faster Job execution times compared to Classpath CDS while at the same time providing maximum shar- ing between jobs.

Modular CDS archive Classpath CDS archive Number of % Improvement % Improvement dimensions dimensions classes loaded Startup Time Job Execution Time 8 archives * 10 classes 1 archive * 123687 classes 80 70.75 % 79.56 % 8 archives * 100 classes 1 archive * 124317 classes 800 71.39 % 72.75 % 8 archives * 1000 classes 1 archive * 130617 classes 8000 83.94 % 18.58 %

Table 4.4: Decrease in Startup time and Job Execution time from Classpath CDS to Modular CDS

4.4. Summary

Based on the analysis in this chapter, we are able to answer RQ3 on the benefits of Modular CDS.

RQ3: How can Class Data sharing be enhanced to provide greater scalability and modularity in sharing? What are the perfor- mance benefits of Modular Class Data Sharing in comparison with out-of-the-box Class Data Sharing approaches?

In this study, we show that the inability of the current OpenJDK implementation of CDS to allow the use of multiple CDS archives severely limits the scalability and modularity of the approach. Hence, we propose Modular CDS which combines the advantages of BestFit and Classpath CDS approaches by providing maximum sharing with classloading time lying between BestFit and Classpath approaches. It must be noted however, that the current implementation of Mod- ular CDS is limited as it does not support fast subtype check at the 4.4. Summary 51 assembly level and hence cannot support loading of array classes. Hence, we are unable to test Modular CDS with Hadoop and Spark but instead provide a preliminary performance analysis with syn- thetic workloads.

Comparing Modular CDS with 8 archives and 1000 classes per archive with a single BestFit archive containing the 8000 classes, we see that Modular CDS performs 11% worse that BestFit approach due to the increased cost of initialization and searching for classes across multiple archives. On the other hand, Modular CDS performs 18% better than Classpath CDS archive (containing all classes in the Spark classpath along with the application classes that are to be loaded).

5

Conclusion and Future Work

In this chapter, we summarize some of our major findings and con- tribution as answers to the Research Questions posed in Chapter 1. Finally we discuss how out research can be further extended to over- come existing limitations and enhance Modular Class Data Sharing.

5.1. Conclusion

RQ1: What is the impact of Classloading overhead on small low latency big data processing workloads?

In this study, we measure the classloading overhead of small data processing workloads with input sizes ranging form 32 KB to 3.2 GB per node using the HiBench Wordcount benchmark. We find that in a Hadoop tiny job, each JVM spends about 21% to 49% of its time in classloading. As the lifetime of the JVM becomes longer, the impact of classloading overhead falls to 5% to 21% for large workload. For Spark tiny and small jobs, each JVM spends about 14% to 19% of its time in classloading while the proportion of classloading time falls to 6% to 8% for large jobs. It is seen that a significant proportion of the classloading time is spent in the define phase. This phase involves verification of bytecode and creation of HotSpot internal data structures for representing class metadata.

RQ2: What is the performance impact of utilizing OpenJDK Ap- plication Class Data Sharing (CDS) on low latency big data pro- cessing workloads?

53 54 5. Conclusion and Future Work

OpenJDK Class Data Sharing allows class metadata to be saved to a shared memory mapped file that can subsequently be used by JVMs on the node to access loaded classes. In this study, we propose two techniques for utilizing CDS for Hadoop and Spark workloads: BestFit CDS and Classpath CDS. These methods represent the trade- offs involved in CDS: Bestfit CDS gives the lowest classloading time but limited sharing among applications while Classspath CDS gives maximum sharing at the cost of large archive size and startup over- head. We see a reduction in job execution time of 13 sec to 18 sec for Hadoop tiny and small jobs and almost 60 sec for a large job weh using BestFit CDS. For Spark tiny and small jobs, the benefits of lower with about 2 - 3 sec reduction in job execution time about 8 sec for large job. As Spark executes multiple tasks in a single JVM, it has much lower classloading overhead compared to Hadoop which executes each task in a new JVM. However, we see significant im- provement in the per-JVM execution time ranging from 30% to 50% for Hadoop and 5% to 40% for Spark.

Further, we see that while Classpath CDS performs much better than No-CDS, it has higher job execution time in comparison with BestFit CDS due to large archive size.

RQ3: How can Class Data sharing be enhanced to provide greater scalability and modularity in sharing? What are the perfor- mance benefits of Modular Class Data Sharing in comparison with out-of-the-box Class Data Sharing approaches?

In this study, we show that the inability of the current OpenJDK implementation of CDS to allow the use of multiple CDS archives severely limits the scalability and modularity of the approach. Hence, we propose Modular CDS which combines the advantages of BestFit and Classpath CDS approaches by providing maximum sharing with classloading time lying between BestFit and Classpath approaches. It must be noted however, that the current implementation of Mod- ular CDS is limited as it does not support fast subtype check at the assembly level and hence cannot support loading of array classes. Hence, we are unable to test Modular CDS with Hadoop and Spark but instead provide a preliminary performance analysis with syn- thetic workloads.

Comparing Modular CDS with 8 archives and 1000 classes per archive with a single BestFit archive containing the 8000 classes, we see that Modular CDS performs 11% worse that BestFit approach due to the increased cost of initialization and searching for classes across multiple archives. On the other hand, Modualar CDS per- forms 18% better than Classpath CDS archive (containing all classes 5.2. Future Work 55 in the Spark classpath along with the application classes that are to be loaded).

5.2. Future Work

Although the preliminary results for Modular CDS show some bene- fits, the approach can be improved and extended in several ways.

1. Currently, Modular CDS does not support loading of array classes thus severely limiting the scope of the approach. Support for array classes involves modifying the fast subtype check in the HotSpot’s MacroAssembler to enable subtype checks for super classes loaded from a differnt CDS archive.

2. The current implementation of Modular CDS supports class- loading from up to 8 CDS archives. When more that 8 archives are used, we encounter initialization errors where existing ta- bles get overwritten by new ones. Support for greater number of archives will allow us to evaluate the ability of Modular CDS to support one CDS archive per JAR file. With this feature, ev- ery application can load a JAR file directly from its CDS archive thus facilitating full sharing of classes among all applications on a node.

3. Along with Class Data Sharing, OpenJDK 9 has introduced Ahead- of-Time compilation (AOT) of Java classes. The pre-compiled .so files can directly be used at runtime, thus avoiding the overhead of interpreting Java bytecode. It is useful to combine Modular CDS with per-JAR AOT to further improve the speed of execu- tion of Java applications.

Bibliography

[1] AWS Lambda. https://aws.amazon.com/lambda/. [Online; accessed 7-April-2019].

[2] Azul ReadyNow Zing. https://www.azul.com/products/ zing/readynow-technology-for-zing/. [Online; accessed 7- April-2019].

[3] DAS 5 Cluster. https://www.cs.vu.nl/das5/users.shtml. [Online; accessed 7-April-2019].

[4] OpenJDK. https://hackernoon.com/ im-afraid-you-re-thinking-about-aws-lambda\ -cold-starts-all-wrong-7d907f278a4f, . [Online; ac- cessed 7-April-2019].

[5] OpenJDK. https://read.acloud.guru/ does-coding-language-memory-or-package-size-affect\ -cold-starts-of-aws-lambda-a15e26d12c76, . [Online; ac- cessed 7-April-2019].

[6] OpenJDK. https://blog.octo.com/en/ cold-start-warm-start-with-aws-lambda/, . [Online; accessed 7-April-2019].

[7] Fission. https://blog.fission.io/posts/jvm_ environment/. [Online; accessed 7-April-2019].

[8] Google Cloud Functions. https://cloud.google.com/ functions/. [Online; accessed 7-April-2019].

[9] Apache Hadoop. https://hadoop.apache.org/. [Online; ac- cessed 7-April-2019].

[10] Apache HDFS. https://hadoop.apache.org/docs/stable/ hadoop-project-dist/hadoop-hdfs/HdfsDesign.html. [Online; accessed 7-April-2019].

[11] HiBench Benchmark Suite. https://github.com/ Intel-bigdata/HiBench. [Online; accessed 7-April-2019].

[12] OpenJDK JEP 310. https://openjdk.java.net/jeps/310. [Online; accessed 7-April-2019].

57 58 Bibliography

[13] JVM and Cache Warmup Strategy for High Traffic Services. https://medium.com/ teads-engineering/jvm-and-cache-warm-up-strategy-\ for-high-traffic-services-4b5016f8b565, . [Online; accessed 7-April-2019]. [14] Heroku: Warming up a Java Process. https://devcenter. heroku.com/articles/warming-up-a-java-process, . [On- line; accessed 7-April-2019]. [15] Centrifuge Warmup Engine. https://github.com/ salesforce/centrifuge, . [Online; accessed 7-April-2019]. [16] The Java Virtual Machine Specification. https://docs. oracle.com/javase/specs/jvms/se9/html/index.html,. [Online; accessed 7-April-2019]. [17] Microsoft Azure Functions. https://azure.microsoft.com/ en-us/services/functions/. [Online; accessed 7-April- 2019]. [18] OpenJ9. https://developer.ibm.com/tutorials/ j-class-sharing-openj9/, . [Online; accessed 7-April- 2019]. [19] OpenJDK. https://openjdk.java.net/, . [Online; accessed 7-April-2019]. [20] OpenWhisk. https://openwhisk.apache.org/, . [Online; ac- cessed 7-April-2019]. [21] Apache Spark. https://spark.apache.org/. [Online; ac- cessed 7-April-2019]. [22] Apache YARN. https://hadoop.apache.org/docs/r3.1.0/ hadoop-yarn/hadoop-yarn-site/YARN.html. [Online; ac- cessed 7-April-2019]. [23] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geof- frey Irving, Michael Isard, et al. Tensorflow: A system for large- scale machine learning. In 12th {USENIX} Symposium on Op- erating Systems Design and Implementation ({OSDI} 16), pages 265–283, 2016. [24] Michael Armbrust, Reynold S Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K Bradley, Xiangrui Meng, Tomer Kaftan, Michael J Franklin, Ali Ghodsi, et al. Spark sql: Relational data process- ing in spark. In Proceedings of the 2015 ACM SIGMOD inter- national conference on management of data, pages 1383–1394. ACM, 2015. Bibliography 59

[25] Henri Bal, Dick Epema, Cees de Laat, Rob van Nieuwpoort, John Romein, Frank Seinstra, Cees Snoek, and Harry Wijshoff. A medium-scale distributed system for computer science re- search: Infrastructure for the long term. Computer, 49(5):54–63, 2016.

[26] Kevin S Beyer, Vuk Ercegovac, Rainer Gemulla, Andrey Balmin, Mohamed Eltabakh, Carl-Christian Kanne, Fatma Ozcan, and Eugene J Shekita. Jaql: A for large scale semistructured data analysis. In Proceedings of VLDB Confer- ence, 2011.

[27] Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. Apache flink: Stream and batch processing in a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 36 (4), 2015.

[28] Biswapesh Chattopadhyay, Liang Lin, Weiran Liu, Sagar Mittal, Prathyusha Aragonda, Vera Lychagina, Younghee Kwon, and Michael Wong. Tenzing a sql implementation on the mapreduce framework. 2011.

[29] Grzegorz Czajkowski and Laurent Daynàs. Multitasking with- out compromise: a virtual machine evolution. ACM SIGPLAN Notices, 47(4a):60–73, 2012.

[30] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107–113, 2008.

[31] Niels Doekemeijer and Ana Lucia Varbanescu. A survey of paral- lel graph processing frameworks. Delft University of Technology, page 21, 2014.

[32] Alan F Gates, Olga Natkovich, Shubham Chopra, Pradeep Ka- math, Shravan M Narayanamurthy, Christopher Olston, Ben- jamin Reed, Santhosh Srinivasan, and Utkarsh Srivastava. Building a high-level dataflow system on top of map-reduce: the pig experience. Proceedings of the VLDB Endowment, 2(2):1414– 1425, 2009.

[33] Alexander Hall, Olaf Bachmann, Robert Büssow, Silviu Gănceanu, and Marc Nunkesser. Processing a trillion cells per mouse click. Proceedings of the VLDB Endowment, 5(11):1436– 1446, 2012.

[34] Scott Hendrickson, Stephen Sturdevant, Tyler Harter, Venkateshwaran Venkataramani, Andrea C Arpaci-Dusseau, 60 Bibliography

and Remzi H Arpaci-Dusseau. Serverless computation with openlambda. In 8th {USENIX} Workshop on Hot Topics in Cloud Computing (HotCloud 16), 2016. [35] Shengsheng Huang, Jie Huang, Yan Liu, Lan Yi, and Jinquan Dai. Hibench: A representative and comprehensive hadoop benchmark suite. In Proc. ICDE Workshops, pages 41–51, 2010. [36] Soila Kavulya, Jiaqi Tan, Rajeev Gandhi, and Priya Narasimhan. An analysis of traces from a production mapreduce cluster. In Proceedings of the 2010 10th IEEE/ACM International Confer- ence on Cluster, Cloud and Grid Computing, pages 94–103. IEEE Computer Society, 2010. [37] Kiyokuni Kawachiya, Kazunori Ogata, Daniel Silva, Tamiya On- odera, Hideaki Komatsu, and Toshio Nakatani. Cloneable jvm: a new approach to start isolated java applications faster. In Pro- ceedings of the 3rd international conference on Virtual execution environments, pages 1–11. ACM, 2007. [38] Marcel Kornacker, Alexander Behm, Victor Bittorf, Taras Bo- brovytsky, Casey Ching, Alan Choi, Justin Erickson, Martin Grund, Daniel Hecht, Matthew Jacobs, et al. Impala: A modern, open-source sql engine for hadoop. In Cidr, volume 1, page 9, 2015. [39] Jay Kreps, Neha Narkhede, Jun Rao, et al. Kafka: A dis- tributed messaging system for log processing. In Proceedings of the NetDB, pages 1–7, 2011. [40] David Lion, Adrian Chiu, Hailong Sun, Xin Zhuang, Nikola Grcevski, and Ding Yuan. Don’t get caught in the cold, warm-up your {JVM}: Understand and eliminate {JVM} warm-up over- head in data-parallel systems. In 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16), pages 383–400, 2016. [41] Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, et al. Mllib: Machine learn- ing in apache spark. The Journal of Machine Learning Research, 17(1):1235–1241, 2016. [42] Derek G Murray, Frank McSherry, Rebecca Isaacs, Michael Is- ard, Paul Barham, and Martín Abadi. Naiad: a timely dataflow system. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pages 439–455. ACM, 2013. [43] Leonardo Neumeyer, Bruce Robbins, Anish Nair, and Anand Ke- sari. S4: Distributed stream computing platform. In 2010 IEEE Bibliography 61

International Conference on Data Mining Workshops, pages 170– 177. IEEE, 2010.

[44] Zujie Ren, Xianghua Xu, Jian Wan, Weisong Shi, and Min Zhou. Workload characterization on a production hadoop cluster: A case study on taobao. In 2012 IEEE International Symposium on Workload Characterization (IISWC), pages 3–13. IEEE, 2012.

[45] Jan S Rellermeyer, Sobhan Omranian Khorasani, Dan Graur, and Apourva Parthasarathy. The coming age of pervasive data processing. In 18th International Symposium on Parallel and Dis- tributed Computing, ISPDC. IEEE, 2019.

[46] Bikas Saha, Hitesh Shah, Siddharth Seth, Gopal Vijayaragha- van, Arun Murthy, and Carlo Curino. Apache tez: A unifying framework for modeling and building data processing applica- tions. In Proceedings of the 2015 ACM SIGMOD international con- ference on Management of Data, pages 1357–1369. ACM, 2015.

[47] Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. Hive: a warehousing solution over a map- reduce framework. Proceedings of the VLDB Endowment, 2(2): 1626–1629, 2009.

[48] Ankit Toshniwal, Siddarth Taneja, Amit Shukla, Karthik Ra- masamy, Jignesh M Patel, Sanjeev Kulkarni, Jason Jackson, Krishna Gade, Maosong Fu, Jake Donham, et al. Storm@ twit- ter. In Proceedings of the 2014 ACM SIGMOD international con- ference on Management of data, pages 147–156. ACM, 2014.

[49] Erwin van Eyk, Alexandru Iosup, Simon Seif, and Markus Thömmes. The spec cloud group’s research vision on faas and serverless architectures. In Proceedings of the 2nd International Workshop on Serverless Computing, pages 1–4. ACM, 2017.

[50] Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica. Spark: Cluster computing with work- ing sets. HotCloud, 10(10-10):95, 2010.

[51] Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, and Ion Stoica. Discretized streams: Fault- tolerant streaming computation at scale. In Proceedings of the twenty-fourth ACM symposium on operating systems principles, pages 423–438. ACM, 2013.