Benchmarking an Amdahl-Н‐Balanced Cluster for Data

Benchmarking an Amdahl-balanced Cluster for Data Intensive Computing Author: Omkar Kulkarni Supervisor: Dr. Adam Carter August 19, 2011 MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2011 Abstract Technology has been advancing fast, but datasets have been growing even faster because it has become easier to generate and capture data. These large datasets, referred to as Big Data, are a storehouse of information waiting to be uncovered. The primary challenge in the analysis of Big Data is to overcome the I/O bottlenecK present on most modern systems. Sluggish I/O systems defeat the very purpose of having high-end processors Ȃ they simply cannot supply data fast enough to utilize all of the available processing power. This results in wastage of power and increases the cost of operating large clusters. This project evaluates the performance of an experimental cluster called EDIM1 which has been commissioned by EPCC specifically for data-intensive research projects. The cluster architecture is based on some of the recommendations from recent research in the area of data-intensive computing, particularly Amdahl-balanced blades. The idea is to use low-powered processors with high-throughput disKs so that the performance gap between them is narrowed down to tolerable limits. The cluster relies on the notion of accessing multiple disKs simultaneously to attain better aggregate throughput and also maKes use of the latest entrant in the storage technology marKet Ȃ the low-powered, high-performance solid-state drive (SSD). The project used numerous benchmarKs to characterise individual components of a node and also the cluster as a whole. Special attention was paid to the performance of the secondary storage. A custom benchmarK was written to measure the balance of the cluster Ȃ the extent of difference in performance between the processor and I/O. The tests also included distributed computing benchmarKs based on the popular MapReduce programming model using the Hadoop frameworK. The results of these tests have been very encouraging and show the practicality of this configuration. We demonstrate that not only is this configuration balanced, it is also well suited to meet the scalability requirements of data-intensive computing. Contents 1 Introduction ......................................................................................................... 1 2 Background........................................................................................................... 4 2.1 The Big Data Challenge ......................................................................................... 4 2.2 Design Principles for Data-intensive Computing ......................................... 6 2.2.1 Balanced Systems ............................................................................................ 7 2.2.2 Scale-up vs. Scale-out ..................................................................................... 8 2.3 Programming Abstractions .................................................................................. 8 2.3.1 MapReduce ......................................................................................................... 9 2.3.2 Hadoop ................................................................................................................ 9 2.4 Energy-efficient Computing.............................................................................. 10 2.4.1 Solid State Drives .......................................................................................... 10 2.4.2 Low Powered CPUs ...................................................................................... 11 2.5 EDIM1 Ȃ A New Machine for Data Intensive Research ............................ 11 3 EDIM1 Benchmarks ......................................................................................... 16 3.1 Single Node BenchmarKs ................................................................................... 16 3.1.1 LINPACK ........................................................................................................... 16 3.1.2 FLOPS ................................................................................................................ 16 3.1.3 STREAM ............................................................................................................ 17 3.1.4 IOzone ............................................................................................................... 18 3.2 The Amdahl Synthetic BenchmarK ................................................................. 18 3.3 Distributed (MapReduce) BenchmarKs ........................................................ 22 3.3.1 TestDFSIO ........................................................................................................ 23 3.3.2 TeraSort ........................................................................................................... 23 4 Performance Analysis..................................................................................... 25 4.1 Results of CPU and Memory Tests .................................................................. 25 i 4.2 Results of I/O tests .............................................................................................. 28 4.3 Results of the Amdahl Synthetic BenchmarK .............................................. 32 4.4 Results of the Distributed (MapReduce) BenchmarKs ............................ 37 5 Conclusion .......................................................................................................... 40 6 Further Work .................................................................................................... 42 Appendix A Results of Tests .............................................................................. 44 Appendix B Compiling and Running ............................................................... 56 Appendix C Hadoop Configuration .................................................................. 59 References .................................................................................................................. 61 ii List of Figures Figure 2-1 Extracting meaningful information from large datasets ...................................... 5 Figure 2-2 EDIM1: High Level Cluster Architecture ..................................................................12 Figure 2-3 EDIM1: High level Node Architecture (Amdahl Blade) .......................................13 Figure 3-1 State Transition Diagram for the Amdahl BenchmarK ........................................19 Figure 3-2 Pseudo-code for the I/O thread routine ...................................................................19 Figure 3-3 Pseudo-code for the compute thread routine ........................................................20 Figure 3-4 Application and I/O views of the data buffer .........................................................21 Figure 4-1 Results of the LINPACK and FLOPS benchmarKs ..................................................25 Figure 4-2 FLOPs expressed as a percentage of the CPU ClocK Frequency .......................27 Figure 4-3 IOZone: Sequential Read Throughput for Hard DisK Drive (HDD) .................29 Figure 4-4 IOzone: Sequential Read Throughput for Solid State Drive (SSD) ..................29 Figure 4-5 Random and Sequential Read Throughput for a file of 8 GB .............................30 Figure 4-6 Random and Sequential Write Throughput for a file of 8 GB ...........................30 Figure 4-7 SSD Write Patterns ..........................................................................................................31 Figure 4-8 Aggregate read throughput for a combination of disKs ......................................32 Figure 4-9 Amdahl balance achieved by combining multiple disKs .....................................33 Figure 4-10 Variation in CPU utilization with computational intensity (1 HDD) ............34 Figure 4-11 Variation in CPU utilization with computational intensity (2 HDDs) ..........35 Figure 4-12 Variation in CPU utilization with computational intensity (3 HDDs) ..........35 Figure 4-13 Variation in CPU utilization using different data types ....................................36 Figure 4-14 Variation in aggregate throughput with the number of nodes ......................37 iii Figure 4-15 Speed-up graph for the TeraSort benchmarK ......................................................38 iv List of Tables Table 2-1 Values for individual disKs .............................................................................................15 Table 2-2 Calculated values for the system ..................................................................................15 Table 3-1 Combination of floating point operations for the FLOPS Kernels .....................17 Table 4-1 Latency and Throughput for Floating Point Instructions [27] ...........................26 Table 4-2 STREAM Results with O4 optimization and double precision arithmetic ......28 v Acknowledgements I sincerely thanK my supervisor Dr. Adam Carter for his guidance and support throughout the course of this project. His feedbacK during our weekly meetings was highly informative and invaluable to the success of this project. I want to thank Adrian JacKson for guiding me while Adam was away for a brief period and Gareth Francis for his prompt responses to my e-mails in spite of his busy schedule. I want to thanK all my professors at EPCC who encouraged me to do my best. Last but not least, I want

Benchmarking an Amdahl-Н‐Balanced Cluster for Data

Bull SAS: Novascale B260 (Intel Xeon Processor 5110,1.60Ghz)

Comparing Filesystem Performance: Red Hat Enterprise Linux 6 Vs

Hypervisors Vs. Lightweight Virtualization: a Performance Comparison

Towards Better Performance Per Watt in Virtual Environments on Asymmetric Single-ISA Multi-Core Systems

I.MX 8Quadxplus Power and Performance

Software Performance Engineering Using Virtual Time Program Execution

Connectcore® 8X Performance and Power Benchmarking Report

PIC Licensing Information User Manual

Fair Benchmarking for Cloud Computing Systems

Mqsim: a Framework for Enabling Realistic Studies of Modern Multi-Queue SSD Devices

Filesystem Benchmark Tool

Efficient Online Memory Error Assessment and Circumvention For