A Dwarf-Based Scalable Big Data Benchmarking Methodology
Total Page:16
File Type:pdf, Size:1020Kb
A Dwarf-based Scalable Big Data Benchmarking Methodology Wanling Gao1,2, Lei Wang1,2, Jianfeng Zhan 1,2∗ , Chunjie Luo1,2, Daoyi Zheng1,2, Zhen Jia4, Biwei Xie1,2, Chen Zheng1,2, Qiang Yang5,6, and Haibin Wang3 1State Key Laboratory of Computer Architecture (Institute of Computing Technology, Chinese Academy of Sciences) {gaowanling, wanglei 2011, zhanjianfeng, luochunjie, zhengdaoyi, xiebiwei, zhengchen}@ict.ac.cn 2University of Chinese Academy of Sciences 3Huawei, [email protected] 4Princeton University, [email protected] 5Beijing Academy of Frontier Science & Technology, [email protected] 6Chinese Industry, Intelligence and Information Data Technology Corporation ABSTRACT Big Data Dwarfs Different from the traditional benchmarking methodol- Abstraction of frequently-appearing units of computation ogy that creates a new benchmark or proxy for every Implementation on specific possible workload, this paper presents a scalable big runtime system data benchmarking methodology. Among a wide vari- Big Data Dwarf Components ety of big data analytics workloads, we identify eight Computation Logic Memory Access I/O Behaviors big data dwarfs, each of which captures the common requirements of each class of unit of computation while DAG-like combination with being reasonably divorced from individual implementa- different weights (Tuning) tions. We implement the eight dwarfs on different soft- Big Data Proxy Benchmarks ware stacks, e.g., OpenMP, MPI, Hadoop as the dwarf Pipeline Efficiency Memory Access components. For the purpose of architecture simula- Instruction Mix Disk I/O Bandwidth Branch Prediction tion, we construct and tune big data proxy benchmarks Speedup Behavior using the directed acyclic graph (DAG)-like combina- Cache Behaviors tions of the dwarf components with different weights to Micro-architectural Behaviors System Behaviors mimic the benchmarks in BigDataBench. Our proxy Mimic benchmarks preserve the micro-architecture, memory, and I/O characteristics, and they shorten the simu- Big Data Workloads lation time by 100s times while maintain the average arXiv:1711.03229v1 [cs.AR] 9 Nov 2017 micro-architectural data accuracy above 90 percentage Figure 1: Dwarf-based Scalable Big Data Benchmarking on both X86 64 and ARMv8 processors. We will open- Methodology. source the big data dwarf components and proxy bench- marks soon. quently, and we need keep expanding the benchmark 1. INTRODUCTION set with emerging workloads. So it is very difficult to Benchmarking is the foundation of system and ar- build a comprehensive benchmark suite. Second, what- chitecture evaluation. The benchmark is run on sev- ever early in the design process or later in the system eral different systems, and the performance and price evaluation, it is time-consuming to run a comprehensive of each system is measured and recorded [1]. The most benchmark suite. The complex software stacks of the simplest benchmarking methodology is to creates a new modern workloads aggravate this issue. For example, benchmark for every possible workload. PARSEC [2], running big data benchmarks on the simulators needs BigDataBench [3] and CloudSuite [4] are put together prohibitively long execution. We use benchmarking scal- in this way. This methodology has two drawbacks. ability 1—which can be measured in terms of the cost First, the modern real-world workloads change very fre- 1 ∗ Here, the meaning of scalable differs from scaleable. As one The corresponding author is Jianfeng Zhan. of four properties of domain-specific benchmarks defined by of building and running a comprehensive benchmark data, which we call the dwarf components. As software suite—to denote this challenge. stack has great influences on workload behaviors [3, For scalability, proxy benchmarks, abstracted from 18], our dwarf component implementation considers the real workloads, are widely adopted. Kernel benchmarks— execution model of software stacks and the program- small programs extracted from large application pro- ming styles of workloads. For feasible simulation, we grams [5, 6]—are widely used in high performance com- choose the OpenMP version of dwarf components to puting. However, it is insufficient to completely re- construct proxy benchmarks as OpenMP is much more flect workload behaviors and has limited usefulness in light-weight than the other programming frameworks making overall comparisons [7, 5], especially for com- in terms of binary size and also supported by a major- plex big data workloads. Another attempt is to gener- ity of simulators [19], such as GEM5 [20]. As a node ate synthetic traces or benchmarks based on the work- represents original data set or an intermediate data set load trace to mimic workload behaviors, and they pro- being processed and a edge represents the dwarf com- pose the methods like statistic simulation [8, 9, 10, 11] ponents, we use the DAG-like combination of one or or sampling [12, 13, 14] to accelerate this trace. It more dwarf components with different weights to form omits the computation logic of the original workloads, our proxy benchmarks. We provide an auto-tuning tool and has several disadvantages. First, trace-based syn- to generate qualified proxy benchmarks satisfying the thetic benchmarks target specific architectures, and dif- simulation time and micro-architectural data accuracy ferent architecture configuration will generate different requirements, through an iteratively execution includ- synthetics benchmarks. Second, synthetic benchmark ing the adjusting and feedback stages. cannot be deployed across over different architectures, On the typical X86 64 and ARMv8 processors, the which makes it hard for cross-architecture comparisons. proxy benchmarks have about 100s times runtime speedup Third, workload trace has strong dependency on data with the average micro-architectural data accuracy above input, which make it hard to mimic real benchmark 90% with respect to the benchmarks from BigDataBench. with diverse data inputs. Our proxy benchmarks have been applied to the ARM Too complex workloads are not helpful for both repro- processor design and implementation in our industry ducibility and interpretability of performance data. So partnership. Our contributions are two-fold as follows: identifying abstractions of frequently-appearing units of computation, which we call big data dwarfs, is an im- • We propose a dwarf-based scalable big data bench- portant step toward fully understanding big data ana- marking methodology, using a DAG-like combina- lytics workloads. The concept of dwarf, which is firstly tion of the dwarf components with different weights proposed by Phil Colella [15], is thought to be not only to mimic big data analytics workloads. a highly abstraction of workload patterns, but also a minimum set of necessary functionality [16]. A dwarf • We identify eight big data dwarfs, and implement captures the common requirements of each class of unit the dwarf components for each dwarf on differ- of computation while being reasonably divorced from ent software stacks with diverse data inputs. Fi- individual implementations [17] 2. For example, trans- nally we construct the big data proxy benchmarks, formation from one domain to another domain contains which reduces simulation time and guarantees per- a class of unit of computation, such as fast fourier trans- formance data accuracy for micro-architectural sim- form (FFT) and discrete cosine transform (DCT). In- ulation. tuitively, we can build big data proxy benchmarks on the basis of big data dwarfs. Considering complexity, The rest of the paper is organized as follows. Section diversity and fast-changing characteristics of big data 2 presents the methodology. Section 3 performs evalu- analytics workloads, we propose a dwarf-based scalable ations on a five-node X86 64 cluster. In Section 4, we big data benchmarking methodology, as shown in Fig. 1. report using the proxy benchmarks on the ARMv8 pro- After thoroughly analyzing a majority of workloads in cessor. Section 5 introduces the related work. Finally, five typical big data application domains (search engine, we draw a conclusion in Section 6. social network, e-commerce, multimedia and bioinfor- matics), we identify eight big data dwarfs, including 2. BENCHMARKING METHODOLOGY matrix, sampling, logic, transform, set, graph, sort and basic statistic computations that frequently appear 3, 2.1 Methodology Overview the combinations of which describe most of big data workloads we investigated. We implement eight dwarfs In this section, we illustrate our dwarf-based scalable on different software stacks like MPI, Hadoop with di- big data benchmarking methodology, including dwarf verse data generation tools for text, matrix and graph identification, dwarf component implementation, and proxy benchmark construction. Fig. 2 presents our bench- Jim Gray [1], the latter refers to scaling the benchmark up marking methodology. to larger systems First, we analyze a majority of big data analytics 2Our definition has a subtle difference from the definition in the Berkeley report [17] workloads through workload decomposition and con- 3We acknowledge our eight dwarfs may be not enough for clude frequently-appearing classes of units of computa- other applications we fail to investigate. We will keep on tion, as big data dwarfs. Second, we provide the dwarf expanding this dwarf set. implementations using different software stacks, as the 1 Big Data Dwarf and Dwarf Components A majority of big data Workload Analysis analytics workloads Frequently ! Dwarf