JVM Warm-Up Overhead in Data-Parallel Systems
Total Page:16
File Type:pdf, Size:1020Kb
DON’T GET CAUGHT IN THE COLD,WARM-UP YOUR JVM: UNDERSTAND AND ELIMINATE JVM WARM-UP OVERHEAD IN DATA-PARALLEL SYSTEMS by David Lion A thesis submitted in conformity with the requirements for the degree of Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto c Copyright 2017 by David Lion Abstract Don’t Get Caught in the Cold, Warm-up Your JVM: Understand and Eliminate JVM Warm-up Overhead in Data-Parallel Systems David Lion Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2017 Many widely used, latency sensitive, data-parallel distributed systems, such as HDFS, Hive, and Spark choose to use the Java Virtual Machine (JVM), despite debate on the overhead of doing so. This thesis analyzes the extent and causes of the JVM performance overhead in the above mentioned systems. Surprisingly, we find that the warm-up overhead is frequently the bottleneck, taking 33% of execution time for a 1GB HDFS read, and an average of 21 seconds for Spark queries. The findings on JVM warm-up overhead reveal a contradiction between the principle of parallelization, i.e., speeding up long running jobs by parallelizing them into short tasks, and amortizing JVM warm-up overhead through long tasks. We solve this problem by designing HotTub, a modified JVM that amortizes the warm-up overhead over the lifetime of a cluster node instead of over a single job. ii Acknowledgements First, I would like to thank my supervisor Professor Ding Yuan for his constant support and guidance. His unrivalled dedication to his students is largely responsible for any of my success or personal growth over the last few years. I would also like to thank him for accepting me as a graduate student and taking a chance on me when I did not deserve it. Second, I would like to thank the many people involved in the OSDI paper this work is based on. The co- authors of the paper, Adrian Chiu, Hailong Sun, Xin Zhuang, and Nikola Grcevski for their hard work on the paper this thesis is based on. Especially, Adrian Chiu whose incredible dedication was paramount in the success of this work. Yu Luo, Serhei Makarov, Michael Stumm, Jenny Ren, Kirk Rodrigues, Guangji Xu, Yongle Zhang, and Xu Zhao for the useful and thought stimulating discussions. Especially, Yu Luo for setting up and maintaining the server cluster environment used in our experiments. The anonymous reviewers of the paper and its shepherd, Prof. Andrea Arpaci-Dusseau, for their insightful feedback. Working with all of these people has been a pleasure and I owe much to them. I would also like to thank my thesis committee members, Professor Michael Stumm, Professor Tarek Abdelrahman, and Professor Ben Liang for their feedback and insights. Lastly, I would like to thank my friends and family for always supporting and encouraging me in my pursuits. iii Contents 1 Introduction 1 2 Background 5 2.1 Java Virtual Machine . .5 2.1.1 Class Loading . .5 2.1.2 Interpretation . .6 2.2 Systems . .7 2.2.1 Hadoop Distributed File System (HDFS) . .7 2.2.2 MapReduce and YARN . .8 2.2.3 Hive and Tez . .8 2.2.4 Spark . .9 2.3 Blocked Time Analysis . .9 3 Measure Warm-up Overhead 11 4 Analysis of Warm-up Overhead 13 4.1 Methodology . 13 4.2 HDFS . 15 4.3 Spark versus Hive . 18 4.4 Summary of Findings . 19 4.5 Industry Practices . 19 5 Design of HotTub 22 5.1 Maintain Consistency . 24 5.2 Limitations . 26 6 Implementation of HotTub 28 iv 7 Performance of HotTub 30 7.1 Speed-up . 30 7.2 Sensitivity to Difference in Workload . 33 7.3 Management Overhead . 33 8 Related Work 35 9 Concluding Remarks 37 9.1 Future Work . 37 Bibliography 39 v List of Tables 4.1 Breakdown of class loading time. 17 7.1 Performance improvements by comparing the average job completion time of an unmodified JVM and HotTub. For Spark and Hive we report the average times of the queries with the, best, median, and worst speed-up for each data size. 30 7.2 Comparing cache, TLB, and branch misses between HotTub (H) and an unmodified JVM (U) when running query 11 of BigBench on Spark with 100GB data size. The numbers are taken from the average of five runs. All page faults are minor faults. “Rate diff.” is calculated as (U Rate - H Rate)/(U Rate), which shows the improvement of HotTub on the miss rate. Perf cannot report the number of L1-icache loads or memory references to know the corresponding rates. 31 7.3 Speed-up sensitivity to workload differences using the five fastest BigBench queries on Spark with 100GB data. 33 vi List of Figures 3.1 Intercepting mode changing returns. 12 4.1 JVM warm-up time in various HDFS workload. “cl” and “int” represent class loading and inter- pretation time respectively. The x-axis shows the input file size. 15 4.2 The JVM warm-up overhead in HDFS workloads measured as the percentage of overall job com- pletion time. 16 4.3 Breakdown of sequential HDFS read of 1GB file. 16 4.4 Breakdown of the processing of data packets by client and datanode. 17 4.5 JVM overhead on BigBench. Overhead breakdown of BigBench queries across different scale factors. The queries are first grouped by scale factor and then ordered by runtime. Note that Hive has larger query time compared with Spark. 17 4.6 Breakdown of Spark’s execution of query 13. It only shows one executor (there are a total of ten executors, one per host). Each horizontal row represents a thread. The executor uses multiple threads to process this query. Each thread is used to process three tasks from three different stages. 18 5.1 Architecture of HotTub. 22 5.2 HotTub’s client algorithm. 23 7.1 HotTub successfully eliminates warm-up overhead. Unmodified query runtime shown against a breakdown of a query with reuse. There are 10 queries run on 2 scale factors for BigBench on Spark. Interpreter and class loading overhead are so low they are unnoticeable making up the difference. 31 7.2 HotTub iterative runtime improvement. A Sequential 1MB HDFS read performed repeatedly by HotTub. Iteration 0 is the application runtime of a new JVM, while iteration N is the Nth reuse of this JVM. 32 vii Chapter 1 Introduction In recent years it has become increasingly common to use distributed systems to process large amounts of data. A distributed system is a software application that runs on a cluster of computers where the nodes communicate over a network backbone. A major advantage provided by these systems is their ability to process large amounts of data in an economically feasible manner. They can utilize easily accessible and affordable commodity hardware to scale their computation power by continuously adding nodes to the cluster. This is possible as the computation is parallelized across all nodes in the cluster, avoiding the need for strong and expensive individual pieces of hardware. A large number of these data-parallel distributed systems are built on the Java Virtual Machine (JVM) [27]. These systems include distributed file systems such as HDFS [30], data analytic platforms such as Hadoop [29], Spark [67], Tez [65, 79], Hive [34, 80], Impala [13, 39], and key-value stores such as HBase [31] and Cassan- dra [15]. A recent trend is to process latency-sensitive, interactive queries [40, 68, 78] with these systems. For example, interactive query processing is one of the focuses for Spark SQL [10, 67, 68], Hive on Tez [40], and Impala [39]. Numerous improvements have been made to the performance of these systems. These works mostly focused on scheduling [2, 4, 33, 41, 59, 87], shuffling overhead [17, 19, 43, 48, 84], and removing redundant computa- tions [64]. Performance characteristics studies [47, 49, 58, 60] and benchmarks [18, 24, 37, 83] have been used to guide the optimization efforts. Most recently, some studies analyzed the performance implications of the JVM’s garbage collection (GC) on big data systems [26, 51, 52, 62]. However, there lacks an understanding of the JVM’s overall performance implications, other than GC, in latency-sensitive data analytics workloads. Consequently, almost every discussion on the implications of the JVM’s performance results in heated debate [38, 44, 45, 46, 61, 72, 86]. For example, the developers of Hy- 1 CHAPTER 1. INTRODUCTION 2 pertable, an in-memory key-value store, use C++ because they believe that the JVM is inherently slow. They also think that Java is acceptable for Hadoop because “the bulk of the work performed is I/O” [38]. In addition, many believe that as long as the system “scales”, i.e., parallelizes long jobs into short ones, the overhead of the JVM is not concerning [72]. It is clear that given its dynamic nature, the JVM’s overhead heavily depends on the characteristics of the application. For example, whether an interpreted method is compiled to machine instruc- tions by the just-in-time (JIT) compiler depends on how frequently it has been invoked. With all these different perspectives, a clear understanding of the JVM’s performance when running these systems is needed. This research asks a simple question: what is the performance overhead introduced by the JVM in latency- sensitive data-parallel systems? We answer this by presenting a thorough analysis of the JVM’s performance behavior when running systems including HDFS, Hive on Tez, and Spark. We drove our study using representative workloads from recent benchmarks.