arXiv:1711.03229v1 [cs.AR] 9 Nov 2017 aintm y10 ie hl ananteaverage the X86 percentage simu- maintain both 90 on while above the accuracy times shorten data 100s micro-architectural they by and time lation characteristics, proxy memory, I/O Our micro-architecture, and the BigDataBench. preserve in to benchmarks weights benchmarks different the with combina- components mimic (DAG)-like dwarf graph the of acyclic benchmarks tions proxy simula- directed data the architecture big tune using of and purpose construct dwarf we the the tion, as For Hadoop soft- MPI, different components. OpenMP, on e.g., dwarfs stacks, eight ware the implement implementa- We common while individual tions. computation from the of divorced captures unit reasonably eight of which being class identify of each we of each requirements workloads, dwarfs, analytics data vari- big data big wide big scalable a every of a Among for ety presents methodology. proxy paper benchmarking or data this benchmark new workload, a possible creates methodol- that benchmarking ogy traditional the from Different ABSTRACT nti a.Ti ehdlg a w drawbacks. two has fre- very methodology change workloads This real-world modern together the put First, way. are this [4] [2], PARSEC CloudSuite in and workload. [3] possible BigDataBench new every most a The for creates [1]. to benchmark is price recorded methodology and and benchmarking measured sev- simplest performance is on system the run each and of is systems, benchmark different The eral evaluation. chitecture INTRODUCTION 1. bench- proxy and soon. components marks dwarf data big the source ∗ h orsodn uhri inegZhan. Jianfeng is author corresponding The aln Gao Wanling ecmrigi h onaino ytmadar- and system of foundation the is Benchmarking 1 tt e aoaoyo optrAcietr Isiueo C of (Institute Architecture Computer of Laboratory Key State { awnig wanglei gaowanling, wr-ae clbeBgDt Benchmarking Data Big Scalable Dwarf-based A 4adAM8poesr.W ilopen- will We processors. ARMv8 and 64 1,2 e Wang Lei , 5 6 ejn cdm fFote cec ehooy [email protected] Technology, & Science Frontier of Academy Beijing hns nuty nelgneadIfrainDt ehooyC Technology Data Information and Intelligence Industry, Chinese 1,2 inegZhan Jianfeng , 01 hnineg ucuje hndoi ibwi zhengchen xiebiwei, zhengdaoyi, luochunjie, zhanjianfeng, 2011, 3 Zheng 2 4 uwi benjamin.wanghaibin@.com Huawei, nvriyo hns cdm fSciences of Academy Chinese of University rneo nvriy [email protected] University, Princeton 1,2 in Yang Qiang , Methodology ∗1,2 hni Luo Chunjie , Sciences) 5,6 n abnWang Haibin and , ffu rpriso oanseicbnhak endby defined benchmarks domain-specific of properties four of unn i aabnhak ntesmltr needs simulators ability use the We example, on execution. long For benchmarks prohibitively the data issue. of big this stacks running aggravate software system workloads complex the modern The in comprehensive later suite. a run or benchmark to process time-consuming is to design it difficult the evaluation, what- very in Second, is early suite. ever it benchmark benchmark So comprehensive the a workloads. build expanding emerging keep with need set we and quently, Benchmarking Data Methodology. Big Scalable Dwarf-based 1: Figure 1 ee h enn fsaal iesfo clal.A one As scaleable. from differs scalable of meaning the Here, 1,2 Computation Logic ay Zheng Daoyi , Micro-architectural Behaviors Abstraction offrequently-appearing units ofcomputation 1 muigTcnlg,CieeAaeyof Academy Chinese Technology, omputing wihcnb esrdi em ftecost the of terms in measured be can —which Pipeline Efficiency Branch Prediction Cache Behaviors Instruction Mix Big Data Dwarf Components Big Data Dwarf Big Big Data Proxy Benchmarks 3 Big Data Big Data Workloads 1,2 Big Data Big Data Dwarfs hnJia Zhen , Memory Access orporation Implementation on specific DAG-like combinationDAG-like with Mimic different weights (Tuning) fst.com 4 Disk I/OBandwidthDisk ie Xie Biwei , runtime system Speedup Behavior System Behaviors Memory Access } @ict.ac.cn ecmrigscal- benchmarking I/O Behaviors 1,2 Chen , of building and running a comprehensive benchmark data, which we call the dwarf components. As software suite—to denote this challenge. stack has great influences on workload behaviors [3, For scalability, proxy benchmarks, abstracted from 18], our dwarf component implementation considers the real workloads, are widely adopted. Kernel benchmarks— execution model of software stacks and the program- small programs extracted from large application pro- ming styles of workloads. For feasible simulation, we grams [5, 6]—are widely used in high performance com- choose the OpenMP version of dwarf components to puting. However, it is insufficient to completely re- construct proxy benchmarks as OpenMP is much more flect workload behaviors and has limited usefulness in light-weight than the other programming frameworks making overall comparisons [7, 5], especially for com- in terms of binary size and also supported by a major- plex big data workloads. Another attempt is to gener- ity of simulators [19], such as GEM5 [20]. As a node ate synthetic traces or benchmarks based on the work- represents original data set or an intermediate data set load trace to mimic workload behaviors, and they pro- being processed and a edge represents the dwarf com- pose the methods like statistic simulation [8, 9, 10, 11] ponents, we use the DAG-like combination of one or or sampling [12, 13, 14] to accelerate this trace. It more dwarf components with different weights to form omits the computation logic of the original workloads, our proxy benchmarks. We provide an auto-tuning tool and has several disadvantages. First, trace-based syn- to generate qualified proxy benchmarks satisfying the thetic benchmarks target specific architectures, and dif- simulation time and micro-architectural data accuracy ferent architecture configuration will generate different requirements, through an iteratively execution includ- synthetics benchmarks. Second, synthetic benchmark ing the adjusting and feedback stages. cannot be deployed across over different architectures, On the typical X86 64 and ARMv8 processors, the which makes it hard for cross-architecture comparisons. proxy benchmarks have about 100s times runtime speedup Third, workload trace has strong dependency on data with the average micro-architectural data accuracy above input, which make it hard to mimic real benchmark 90% with respect to the benchmarks from BigDataBench. with diverse data inputs. Our proxy benchmarks have been applied to the ARM Too complex workloads are not helpful for both repro- processor design and implementation in our industry ducibility and interpretability of performance data. So partnership. Our contributions are two-fold as follows: identifying abstractions of frequently-appearing units of computation, which we call big data dwarfs, is an im- • We propose a dwarf-based scalable big data bench- portant step toward fully understanding big data ana- marking methodology, using a DAG-like combina- lytics workloads. The concept of dwarf, which is firstly tion of the dwarf components with different weights proposed by Phil Colella [15], is thought to be not only to mimic big data analytics workloads. a highly abstraction of workload patterns, but also a minimum set of necessary functionality [16]. A dwarf • We identify eight big data dwarfs, and implement captures the common requirements of each class of unit the dwarf components for each dwarf on differ- of computation while being reasonably divorced from ent software stacks with diverse data inputs. Fi- individual implementations [17] 2. For example, trans- nally we construct the big data proxy benchmarks, formation from one domain to another domain contains which reduces simulation time and guarantees per- a class of unit of computation, such as fast fourier trans- formance data accuracy for micro-architectural sim- form (FFT) and discrete cosine transform (DCT). In- ulation. tuitively, we can build big data proxy benchmarks on the basis of big data dwarfs. Considering complexity, The rest of the paper is organized as follows. Section diversity and fast-changing characteristics of big data 2 presents the methodology. Section 3 performs evalu- analytics workloads, we propose a dwarf-based scalable ations on a five-node X86 64 cluster. In Section 4, we big data benchmarking methodology, as shown in Fig. 1. report using the proxy benchmarks on the ARMv8 pro- After thoroughly analyzing a majority of workloads in cessor. Section 5 introduces the related work. Finally, five typical big data application domains (search engine, we draw a conclusion in Section 6. social network, e-commerce, multimedia and bioinfor- matics), we identify eight big data dwarfs, including 2. BENCHMARKING METHODOLOGY matrix, sampling, logic, transform, set, graph, sort and basic statistic computations that frequently appear 3, 2.1 Methodology Overview the combinations of which describe most of big data workloads we investigated. We implement eight dwarfs In this section, we illustrate our dwarf-based scalable on different software stacks like MPI, Hadoop with di- big data benchmarking methodology, including dwarf verse data generation tools for text, matrix and graph identification, dwarf component implementation, and proxy benchmark construction. Fig. 2 presents our bench- Jim Gray [1], the latter refers to scaling the benchmark up marking methodology. to larger systems First, we analyze a majority of big data analytics 2Our definition has a subtle difference from the definition in the Berkeley report [17] workloads through workload decomposition and con- 3We acknowledge our eight dwarfs may be not enough for clude frequently-appearing classes of units of computa- other applications we fail to investigate. We will keep on tion, as big data dwarfs. Second, we provide the dwarf expanding this dwarf set. implementations using different software stacks, as the 1 Big Data Dwarf and Dwarf Components A majority of big data Workload Analysis analytics workloads Frequently ! Dwarf Workload Decomposition appearing Components Extracting From Five classes of units Application Domains Units of Computation of computation (Dwarf Implementation)

Given a big data 2 analytics Tracing and Combining Proxy Metrics Qualified proxy Initial Weights Profiling Benchmark benchmark workloads Auto-tuning Proxy Benchmark Construction

Figure 2: Methodology Overview. dwarf components. Finally, we construct proxy bench- 2.2.1 Eight Big Data Dwarfs marks using the DAG-like combinations of the dwarf In this subsection, we summarize eight big data dwarfs components with different weights to mimic the real- frequently appearing in big data analytics workloads. world workloads. A DAG-like structure uses a node to Matrix Computations In big data analytics, many represent original data set or intermediate data set be- problems involve matrix computations, such as matrix ing processed, and uses a edge to represent the dwarf multiplication and matrix transposition. components. Sampling Computations Sampling plays an essen- Given a big data analytics workload, we get the run- tial role in big data processing, which can obtain an ap- ning trace and execution time through the tracing and proximate solution when one problem cannot be solved profiling tools. According to the execution ratios, we by using analytical method. identify the functions and correlate them to Logic Computations We name computations per- the code fragments of the workload through bottom- forming bit manipulation as logic computations, such up analysis. Then we analyze these code fragments to as hash, data compression and encryption. choose the specific dwarf components and set their ini- Transform Computations The transform compu- tial weights according to execution ratios. Based on tations here mean the conversion from the original do- the dwarf components and initial weights, we construct main (such as time) to another domain (such as fre- our proxy benchmarks using a DAG-like combination quency). Common transform computations include dis- of dwarf components. For the targeted performance crete fourier transform (DFT), discrete cosine transform data accuracy, such as cache behaviors or I/O behav- (DCT) and wavelet transform. iors, we provide an auto-tuning tool to tune the param- Set Computations In mathematics, set means a eters of the proxy benchmark, and generate qualified collection of distinct objects. Likewise, the concept of proxy benchmark to satisfy the requirements. set is also widely used in computer science. For ex- ample, similarity analysis of two data sets involves set computations, such as Jaccard similarity. Furthermore, 2.2 Big Data Dwarf and Dwarf Components fuzzy set and rough set play very important roles in In this section, we illustrate the big data dwarfs and computer science. corresponding dwarf component implementations. Graph Computations A lot of applications involve After singling out a broad spectrum of big data ana- graphs, with nodes representing entities and edges rep- lytics workloads (machine learning, data mining, com- resenting dependencies. Graph computations are noto- puter vision and natural language processing) through rious for having irregular memory access patterns. investigating five application domains (search engine, Sort Computations Sort is widely used in many social network, e-commerce, multimedia, and bioinfor- areas. Jim Gray thought sort is the core of modern matics), we analyze these workloads and decompose databases [17], which shows its fundamentality. them to multiple classes of units of computation. Ac- Basic Statistic Computations Basic statistic com- cording to their frequency and importance, we final- putations are used to obtain the summary information ize eight big data dwarfs, which are abstractions of fre- through statistical computations, such as counting and quently-appearing classes of units of computation. Ta- probability statistics. ble 1 shows the importance of eight classes of units of computation (dwarfs) in a majority of big data analyt- 2.2.2 Dwarf Components ics workloads. We can find that these eight dwarfs are Fig. 3 presents the overview of our dwarf components, major classes of units of computations in a variety of which consist of two parts– data generation tools and big data analytics workloads. dwarf implementations. The data generation tools pro- Table 1: Eight Classes of Units of Computation.

Catergory Application Domain Workload Unit of Computation Search Engine PageRank Matrix, Graph, Sort Graph Mining Community Detection BFS, Connected component(CC) Graph Image Processing Principal components analysis(PCA) Matrix Deminsion Reduction Text Processing Latent dirichlet allocation(LDA) Basic Statistic, Sampling Image Recognition Convolutional neural network(CNN) Matrix, Sampling, Transform Deep Learning Speech Recognition Deep belief network(DBN) Matrix, Sampling Aporiori Basic Statistic, Set Association Rules Mining Recommendation FP-Growth Graph, Set, Basic Statistic Electronic Commerce Collaborative filtering(CF) Graph, Matrix Support vector machine(SVM) Matrix Image Recognition K-nearest neighbors(KNN) Matrix, Sort, Basic Staticstic Classification Speech Recognition Naive bayes Basic Statistic Text Recognition Random forest Graph, Basic Statistic Decision tree(C4.5/CART/ID3) Graph, Basic Statistic Clustering Data Mining K-means Matrix, Sort Image segmentation(GrabCut) Matrix, Graph Image Processing Scale-invariant feature trans- Matrix, Transform, Sampling, Feature Preprocess Signal Processing form(SIFT) Sort, Basic Statistic Text Processing Image Transform Matrix, Transform Term Frequency-inverse document Basic Statistic frequency (TF-IDF) Bioinformatics Hidden Markov Model(HMM) Matrix Sequence Tagging Language Processing Conditional random fields(CRF) Matrix, Sampling Indexing Search Engine Inverted index, Forward index Basic Statistic, Logic, Set, Sort Multimedia Processing MPEG-2 Matrix, Transform Security Encryption Matrix, Logic Encoding/Decoding Cryptography SimHash, MinHash Set, Logic Digital Signature Locality-sensitive hashing(LSH) Set, Logic Data Warehouse Business intelligence Project, Filter, OrderBy, Union Set, Sort vide various data inputs with different data types and Dwarf Components distributions to the dwarf components, covering text, Matrix Sampling Logic Transform graph and matrix data. Since software stack has great Computations Computations Computations Computations Distance Random MD5 hash FFT / IFFT influences on workload behaviors [3, 18], our dwarf com- calculation sampling ponent implementation considers the execution model Input Matrix Interval Encryption DCT multiplication sampling of software stacks and the programming styles of work- Set Graph Sort Basic Statistic loads using specific software stacks. Fig. 3 lists all dwarf Computations Computations Computations Computations components. For example, we provide distance calcu- Generation Data Union Graph Quick sort Count/average

(Types & Distribution) & (Types construct statistics Intersection Merge sort lation (i.e. euclidian, cosine) and matrix multiplication Graph Probability for matrix computations. For simulation-based archi- Difference traversal Max/Min statistics tecture research, we provide light-weight implementa- tion of dwarf components using the OpenMP [21] frame- work as OpenMP has several advantages. First, it is Figure 3: The Overview of the Dwarf Components. widely used. Second, it is much more light-weight than the other programming frameworks in terms of binary size. Third, it is supported by a majority of simula- 2.3 Proxy Benchmarks Construction tors [19], such as GEM5 [20]. For the purpose of system Fig. 4 presents the process of proxy benchmark con- evaluation, we also implemented the dwarf components struction, including decomposing process and auto-tuning on several other software stacks including MPI [22], process. Given a big data analytics workload, we ob- Hadoop [23] and Spark [24]. tain its hotspot functions and execution time through a multi-dimensional tracing and profiling method, includ- Decomposing Auto-Tuning Yes Qualified Tuned Parameters

Proxy BenchmarkProxy Accuracy Big Data Analytics Impact analysis Proxy

Initialization Evaluation Workloads Dwarf Parameter Benchmark components • Input data size Metrics: • Weight No

• System metrics • Parallelism degree • Micro-architectural • Chunk size Deviation metrics Initial analysis Weights Feedback

Adjusting Stage Feedback Stage

Figure 4: Proxy Benchmarks Construction. ing runtime tracing (e.g. JVM tracing and logging), Table 2: Tunable Parameters for Each Dwarf Compo- system profiling (e.g. CPU time breakdown) and hard- nent. ware profiling (e.g. CPU cycle breakdown). Based on the hotspot analysis, we correlate the hotspot functions to the code fragments of the workload and choose the Parameter Description Input Data Size The input data size for each corresponding dwarf components through analyzing the dwarf component computation logic of the code fragments. Our proxy Chunk Size The data block size processed by benchmark is a DAG-like combination of the selected each thread for each dwarf com- dwarf components with initial weights setting by their ponent execution ratios. Parallelism Degree The process and thread numbers We measure the proxy benchmark’s accuracy by com- for each dwarf component paring the performance data of the proxy benchmark Weight The contribution for each dwarf with those of the original workloads in both system and component micro-architecture level. In our methodology, we can choose different performance metrics to measure the ac- curacy according to the concerns about different behav- We introduce a decision tree based mechanism to as- iors of the workload. For example, if our proxy bench- sist the auto-tuning process. The tool learns the im- marks are to focus on cache behaviors of the workload, pact of each parameter on all metrics and build a deci- we can choose the metrics that reflect cache behaviors sion tree through Impact analysis. The learning process like cache hit rate to tune a qualified proxy benchmark. changes one parameter each time and execute multiple To tune the accuracy—making it more similar to the times to characterize the parameter’s impact on each original workload, we further provide an auto-tuning metric. Based on the impact analysis, the tool builds tool with four parameters to be tuned (listed in Table a decision tree to determine which parameter to tune 2). We found those four parameters play essential roles if one metric has large deviation. After that, the Feed- in the system and architectural behaviours of a certain back Stage is activated to evaluate the proxy benchmark workload. The tuning process is an iteratively execu- with tuned parameters. If it does not satisfy the require- tion process which can be separated into three stages: ments, the tool will adjust the parameters to improve parameter initialization, adjusting stage and feedback the accuracy using the decision tree in the adjusting stage. We elaborate them in detail as below: stage. Parameter Initialization We initialize the four pa- Feedback Stage In the feedback stage, the tool eval- rameters(Input Data Size, Chunk Size, Parallelism De- uates the accuracy of the current proxy benchmark with gree and Weight) according to the configuration of the specific parameters. If the deviations of all metrics are original workload. We scale down the input data set within the setting range (e.g. 15%), the auto-tuning and chunk size of the original workloads to initialize In- process is finished. Otherwise, the metrics with large put Data Size and Chunk Size. The Parallelism Degree deviations will be fed back to the adjusting stage. The is initialized as the parallelism degree of the original adjusting and feedback process will iterate until reach- workload. To guarantee the rationality of the weights ing the specified accuracy, and the finalized proxy bench- of each dwarf component, we initialize the weights pro- mark with the final parameter settings is our qualified portional to their corresponding execution ratios. For proxy benchmark. example, in Hadoop TeraSort, the weight is 70% of sort computation, 10% of sampling computation, and 20% of 2.4 Proxy Benchmarks Implementation graph computation, respectively. Note that the weight Considering the mandatory requirements of simula- can be adjusted within a reasonable range (e.g. plus or tion time and performance data accuracy for architec- minus 10%) during the tuning process. ture community, we implement four proxy benchmarks Adjusting Stage with respect to four representative Hadoop benchmarks Table 3: Four Hadoop Benchmarks from BigDataBench and Their Corresponding Proxy Benchmarks.

Big Data Workload Patterns Data Set Involved Dwarfs Involved Dwarf Components Benchmark Sort computations Quick sort; Merge sort Hadoop I/O Intensive Text Sampling computations Random sampling; Interval sampling TeraSort Graph computations Graph construction; Graph traversal Matrix computations Vector euclidean distance; Cosine distance Hadoop CPU Intensive Vectors Sort computations Quick sort; Merge sort Kmeans Basic Statistic Cluster count; Average computation Matrix computations Matrix construction; Matrix multiplication Hadoop Hybrid Graph Sort computations Quick sort; Min/max calculation PageRank Basic Statistic Out degree and in degree count of nodes Matrix computations Matrix construction; Matrix multiplication Sort computations Quick sort; Min/max calculation Hadoop CPU Intensive Image Sampling computations Interval sampling SIFT Memory Intensive Transform computations FFT/IFFT Transformation Basic Statistic Count Statistics from BigDataBench [3] – TeraSort, Kmeans, PageRank, intermediate data output to disk, and data combina- and SIFT, according to our benchmarking methodol- tion. JVM garbage collection (GC) is an important ogy. At the requests of our industry partners, we imple- step for automatic memory management. Currently, mented the four proxy benchmarks in advance because we don’t consider GC’s impacts in the proxy bench- of the following reasons. Other proxy benchmarks are mark construction, as GC only occupies a small frac- being built using the same methodology. tion in the Hadoop workloads (about 1% time spent Representative Application Domains They are on GC in our experiments) when the node has enough all widely used in many important application domains. memory. Finally, we use the auto-tuning tool to gen- For example, TeraSort is a widely-used workload in erate the proxy benchmarks. The process of construct- many application domains; PageRank is a famous work- ing these four proxy benchmarks shows that our auto- load for search engine; Kmeans is a simple but useful tuning method can reach a steady state after dozens workload used in services; SIFT is a fundamen- of iterations. Table 3 lists the benchmark details from tal workload for image feature extraction. the perspectives of data set, involved dwarfs, involved Various Workload Patterns They have different dwarf components. workload patterns. Hadoop TeraSort is an I/O-intensive workload; Hadoop Kmeans is a CPU-intensive work- 3. EVALUATION load; Hadoop PageRank is a hybrid workload which In this section, we evaluate our proxy benchmarks falls between CPU-intensive and I/O-intensive; Hadoop from the perspectives of runtime speedup and accuracy. SIFT is a CPU-intensive and memory-intensive work- load. 3.1 Experiment Setups Diverse Data Inputs They take different data in- We deploy a five-node cluster, with one master node puts. Hadoop TeraSort uses text data generated by and four slave nodes. They are connected using 1Gb gensort [25]; Hadoop Kmeans uses vector data while ethernet network. Each node is equipped with two In- Hadoop PageRank uses graph data; Hadoop SIFT uses tel Xeon E5645 (Westmere) processors, and each pro- image data from ImageNet [26]. These benchmarks are cessor has six physical out-of-order cores. The memory of great significance for measuring big data systems and of each node is 32GB. Each node runs Linux CentOS architectures [18]. 6.4 with the Linux kernel version 3.11.10. The JDK In the rest of this paper, we use Proxy TeraSort, and Hadoop versions are 1.7.0 and 2.7.1, respectively. Proxy Kmeans, Proxy PageRank and Proxy SIFT to The GCC version is 4.8.0, which supports the OpenMP represent the proxy benchmark for Hadoop TeraSort, 3.1 specification. The proxy benchmarks are compiled Hadoop Kmeans, Hadoop PageRank and Hadoop SIFT using ”-O2” option for optimization. The hardware and from BigDataBench, respectively. The input data to software details are listed on Table 4. each proxy benchmark has the same data type and dis- To evaluate the performance data accuracy, we run tribution with respect to those of the Hadoop bench- the proxy benchmarks against the benchmarks from marks so as to preserve the data impact on workload be- BigDataBench. We run the four Hadoop benchmarks haviors. To resemble the process of the Hadoop frame- from BigDataBench on the above five-node cluster us- work, we choose the OpenMP implementations of the ing the optimized Hadoop configurations, through tun- dwarf components, which are implemented with simi- ing the data block size of the Hadoop distributed file lar processes to the Hadoop framework, including in- system, memory allocation for each map/reduce job put data partition, chunk data allocation per thread, and reduce job numbers according to the cluster scale Table 4: Node Configuration Details of Xeon E5645 reflect the I/O behaviors of workloads. We collect micro-architectural metrics from hardware performance monitoring counters (PMCs), and look up Hardware Configurations the hardware events’ value on Intel Developer’s Man- CPU Type Intel CPU Core ual [28]. Perf [29] is used to collect these hardware Intel R Xeon E5645 6 [email protected] events. To guarantee the accuracy and validity, we run L1 DCache L1 ICache L2 Cache L3 Cache each workload three times, and collect performance data 6 × 32 KB 6 × 32 KB 6 × 256 KB 12MB of workloads on all slave nodes during the whole run- Memory 32GB,DDR3 time. We report and analyze their average value. Disk SATA@7200RPM Ethernet 1Gb Table 5: System and Micro-architectural Metrics. Hyper-Threading Disabled Software Configurations Category Metric Name Description Operating Linux JDK Hadoop Micro-architectural Metrics System Kernel Version Version IPC Instructions per cycle Processor CentOS 6.4 3.11.10 1.7.0 2.7.1 Million instructions Performance MIPS per second Ratios of floating- and memory size. For Hadoop TeraSort, we choose 100 Instruction Instruction point, load, store, GB text data produced by gensort [25]. For Hadoop Mix ratios branch and integer Kmeans and PageRank, we choose 100 GB sparse vector instructions data with 90% sparsity 4 and 226-vertex graph both gen- Branch Branch Miss Branch miss predic- erated by BDGS [27], respectively. For Hadoop SIFT, Prediction tion ratio L1I Hit Ratio L1 instruction cache we use one hundred thousand images from ImageNet [26]. hit ratio For comparison, we run the four proxy benchmarks on Cache L1D Hit Ratio L1 data cache hit ratio one of the slave nodes, respectively. Behavior L2 Hit Ratio L2 cache hit ratio L3 Hit Ratio L3 cache hit ratio 3.2 Metrics Selection and Collection System Metrics To evaluate accuracy, we choose micro-architectural Read Memory load band- and system metrics covering instruction mix, cache be- Bandwidth width Write Memory store band- havior, branch prediction, processor performance, mem- Memory Bandwidth width ory bandwidth and disk I/O behavior. Table 5 presents Bandwidth Total memory load and store the metrics we choose. Bandwidth Processor Performance. We choose two metrics bandwidth to measure the processor overall performance. Instruc- Disk I/O Disk I/O Disk read and write bandwidth tions per cycle (IPC) indicates the average number of Behavior Bandwidth instructions executed per clock cycle. Million instruc- tions per second (MIPS) indicates the instruction exe- 3.3 Runtime Speedup cution speed. Table 6 presents the execution time of the Hadoop Instruction Mix. We consider the instruction mix benchmarks and the proxy benchmarks on Xeon E5645. breakdown including the percentage of integer instruc- Hadoop TeraSort with 100 GB text data runs 1500 sec- tions, floating-point instructions, load instructions, store onds on the five-node cluster. Hadoop Kmeans with instructions and branch instructions. 100 GB vectors runs 5971 seconds for each iteration. Branch Prediction. Branch predication is an im- Hadoop PageRank with 226-vertex graph runs 1443 sec- portant strategy used in modern processors. We track onds for each iteration. Hadoop SIFT with one hundred the miss prediction ratio of branch instructions (br miss thousands images runs 721 seconds. The four corre- for short). sponding proxy benchmarks run about ten seconds on Cache Behavior. We evaluate cache efficiency using the physical machine. For TeraSort, Kmeans, PageR- cache hit ratios, including L1 instruction cache, L1 data ank, SIFT, the speedup is 136X (1500/11.02), 743X cache, L2 cache and L3 cache. (5971/8.03), 160X (1444/9.03) and 90X (721/8.02), re- Memory Bandwidth. We measure the data load spectively. rate from memory and the data store rate into memory, with the unit of bytes per second. We choose metrics of 3.4 Accuracy memory read bandwidth (read bw for short), memory We evaluate the accuracy of all metrics listed in Ta- write bandwidth (write bw for short) and total memory ble 5. For each metric, the accuracy of the proxy bench- bandwidth including both read and write (mem bw for mark comparing to the Hadoop benchmark is computed short). by Equation 1. Among which, V al represents the av- Disk I/O Behavior. We employ I/O bandwidth to H erage value of the Hadoop benchmark on all slave nodes; 4 The sparsity of the vector indicates the proportion of zero- V alP represents the average value of the proxy bench- valued elements. mark on a slave node. The absolute value ranges from 0 Table 6: Execution Time on Xeon E5645.

Execution Time (Second) Workloads Hadoop version Proxy version TeraSort 1500 11.02 Kmeans 5971 8.03 PageRank 1444 9.03 SIFT 721 8.02

Figure 6: Instruction Mix Breakdown on Xeon E5645. to 1. The number closer to 1 indicates higher accuracy.

V alP − V alH Accuracy(V alH ,ValP )=1 − (1) (SectorRead+W rite) ∗ SizeSector V alH BW = (2) DiskI/O RunTime Fig. 5 presents the system and micro-architectural Fig. 7 presents the I/O bandwidth of proxy bench- data accuracy of the proxy benchmarks on Xeon E5645. marks and Hadoop benchmarks on Xeon E5645. We We can find that the average accuracy of all metrics are can find that they have similar average disk I/O pres- greater than 90%. For TeraSort, Kmeans, PageRank, sure. The disk I/O bandwidth of Proxy TeraSort and SIFT, the average accuracy is 94%, 91%, 93%, 94%, re- Hadoop TeraSort is 32.04 MB and 33.99 MB per second, spectively. Fig. 6 shows the instruction mix breakdown respectively. of the proxy benchmarks and Hadoop benchmarks. From Fig. 6, we can find that the four proxy benchmarks pre- serve the instruction mix characteristics of these four Hadoop benchmarks. For example, the integer instruc- tion occupies 44% for Hadoop TeraSort and 46% for Proxy TeraSort, while the floating-point instruction oc- cupies less than 1% for both Hadoop and Proxy Tera- Sort. For instructions involving data movement, Hadoop TeraSort contains 39% of load and store instructions, and Proxy TeraSort contains 37%. The SIFT workload, widely used in computer vision for image processing, has many floating-point instructions. Also, Proxy SIFT Figure 7: Disk I/O BandWidth on Xeon E5645. preserves the instruction mix characteristics of Hadoop SIFT. 3.4.2 The Impact of Input Data In this section, we demonstrate when we change the input data sparsity, our proxy benchmarks can still mimic the big data workloads with a high accuracy. We run Hadoop Kmeans with the same Hadoop configurations, and the data input is 100 GB. We use different data input: sparse vector (the original configuration, 90% elements are zero) and dense vectors (all elements are non-zero, and 0% elements are zero). Fig. 8 presents the performance difference using different data input. We can find that the memory bandwidth with sparse vectors is nearly half of the memory bandwidth with Figure 5: System and Micro-architectural Data Accu- dense vectors, which confirms the data input’s impacts racy on Xeon E5645. on micro-architectural performance. On the other hand, Fig. 9 presents the accuracy of proxy benchmark using diverse input data. We can find that the average micro-architectural data accuracy of 3.4.1 Disk I/O Behaviors Proxy Kmeans is above 91% with respect to the fully- Big data applications have significant disk I/O pres- distributed Hadoop Kmeans using dense input data with sures. To evaluate the DISK I/O behaviors of the proxy no zero-valued element. When we change the input data benchmarks, we compute the disk I/O bandwidth us- sparsity from 0% to 90%, the data accuracy of Proxy ing Equation 2, where SectorRead+W rite means the to- Kmeans is also above 91% with respect to the original tal number of sector reads and sector writes; SizeSector workload. So we see that the Proxy Kmeans can mimic means the sector size (512 bytes for our nodes). the Hadoop Kmeans under different input data. Table 7: Platform Configurations

Hardware Configurations Model ARMv8 Xeon E5-2690 V3 Number of Pro- 1 1 cessors Number of Cores 32 12 Frequency 2.1GHz 2.6GHz L1 Cache(I/D) 48KB/32KB 32KB/32KB L2 Cache 8 x 1024KB 12 x 256KB L3 Cache 32MB 30MB Architecture ARM X86 64 Figure 8: Data Impact on Memory Bandwidth Using Memory 64GB, DDR4 64GB, DDR4 Ethernet 1Gb 1Gb Sparse and Dense Data for Hadoop Kmeans on Xeon Hyper-Threading None Enabled E5645. Software Configurations Operating Red-hat Enterprise Linux EulerOS V2.0 System Server release 7.0 Linux Kernel 4.1.23- 3.10.0-123.e17.x86-64 vhulk3.6.3.aarch64 GCC Version 4.9.3 4.8.2 JDK Version jdk1.8.0 101 jdk1.7.0 79 Hadoop Version 2.5.2 2.5.2

Sort, 50 GB dense vectors for Hadoop Kmeans, and 224-vertex graph data for Hadoop PageRank. We run Hadoop benchmarks with optimized configurations, thr- ough tuning the data block size of the Hadoop dis- tributed file system, memory allocation for each job and Figure 9: System and Micro-architectural Data Accu- reduce task numbers according to the cluster scale and racy Using Different Data Input on Xeon E5645. memory size. For comparison, we run proxy bench- marks on the slave node. Our industry partnership pays great attentions on cache and memory access pat- 4. CASESTUDIESONARMPROCESSORS terns, which are important micro-architectural and sys- tem metrics for chip design. So we mainly collect cache- This section demonstrates our proxy benchmark also related and memory-related performance data. can mimic the original benchmarks on the ARMv8 pro- cessors. We report our joint evaluation work with our industry partnership on ARMv8 processors using Hadoop 4.2 Runtime Speedup on ARMv8 TeraSort, Kmeans, PageRank, and the corresponding Table. 8 presents the execution time of Hadoop bench- proxy benchmarks. Our evaluation includes widely ac- marks and the proxy benchmarks on ARMv8. Our ceptable metrics: runtime speedup, performance accu- proxy benchmarks run within 10 seconds on ARMv8 racy and several other concerns like multi-core scalabil- processor. On two-node cluster equipped with ARMv8 ity and cross-platform speedup. processor, Hadoop TeraSort with 50 GB text data runs 4.1 Experiment Setup 1378 seconds. Hadoop Kmeans with 50GB vectors runs 3374 seconds for one iteration. Hadoop PageRank with Due to the resource limitation of ARMv8 processors, 224-vertex runs 4291 seconds for five iterations. In con- we use a two-node (one master and one slave) cluster trast, their proxy benchmarks run 4102, 8677 and 6219 with each node equipped with one ARMv8 processor. milliseconds, respectively. For TeraSort, Kmeans, PageR- In addition, we deploy a two-node (one master and one ank, the speedup is 336X (1378/4.10), 386X (3347/8.68) slave) cluster with each node equipped with one Xeon and 690X (4291/6.22), respectively. E5-2690 v3 (Haswell) processor for speedup compari- son. Each ARMv8 processor has 32 physical cores, with Table 8: Execution Time on ARMv8. each core having independent L1 instruction cache and L1 data cache. Every four cores share L2 cache and all cores share the last-level cache. The memory of each Execution Time (Second) Workloads node is 64GB. Each Haswell processor has 12 physical Hadoop version Proxy version cores, with each core having independent L1 and L2 TeraSort 1378 4.10 cache. All cores share the last-level cache. The mem- Kmeans 3347 8.68 ory of each node is 64GB. In order to narrow the gap of PageRank 4291 6.22 logical core numbers between two architectures, we en- able hyperthreading for Haswell processor. Table 7 lists the hardware and software details of two platforms. 4.3 Accuracy on ARMv8 Considering the memory size of the cluster, we use 50 We report the system and micro-architectural data GB text data generated by gensort for Hadoop Tera- accuracy of the Hadoop benchmarks and the proxy bench- marks. Likewise, we evaluate the accuracy by Equa- tion 1. Fig. 10 presents the accuracy of the proxy TimeARM benchmarks on ARM processor. We can find that on Speedup(Time ,Time )= (3) X86 64 ARM Time the ARMv8 processor, the average data accuracy is all X86 64 above 90%. For TeraSort, Kmeans and PageRank, the Fig. 12 shows the runtime speedups of the Hadoop average accuracy is 93%, 95% and 92%, respectively. benchmarks and the proxy benchmarks across ARM and X86 64 architectures. We can find that they have consistent speedup trends. For example, Hadoop Tera- Sort runs 1378 seconds and 856 seconds on ARMv8 and Haswell, respectively. Proxy TeraSort runs 4.1 seconds and 2.56 seconds on ARMv8 and Haswell, respectively. The runtime speedups between ARMv8 and Haswell are 1.61 (1378/856) running Hadoop TeraSort, and 1.60 (4.1/2.56) running Proxy TeraSort.

5. RELATED WORK Multiple benchmarking methodologies have been pro- Figure 10: System and Micro-architectural Data Accu- posed over the past few decades. The most simplest one racy on ARMv8. is to create a new benchmark for every possible work- load. PARSEC [2] provides a series of shared-memory programs for chip-multiprocessors. BigDataBench [3] is 4.4 Multi-core Scalability on ARMv8 a benchmark suite providing dozens of big data work- loads. CloudSuite [4] consists of eight applications, which ARMv8 has 32 physical cores, and we evaluate its are selected based on popularity. These benchmarking multi-core scalability using the Hadoop benchmarks and methods need to provide individual implementations for the proxy benchmarks on 4, 8, 16, 32 cores, respectively. every possible workload, and keep expanding bench- For each experiment, we disable the specified number mark set to cover emerging workloads. Moreover, it of cpu cores through cpu-hotplug mechanism. For the is frustrating to run (component or application) bench- Hadoop benchmarks, we adjust the Hadoop configura- marks like BigDataBench or CloudSuite on simulators tions so as to get the peak performance. For the proxy because of complex software stacks and long running benchmarks, we run them directly without any modifi- time. Using reduced data input is one way to reduce cation. execution time. Previous work [30, 31] adopts reduced Fig. 11 reports multi-core scalability in terms of run- data set for the SPEC benchmark and maintains simi- time and MIPS. The horizontal axis represents the core lar architecture behaviors using the full reference data number and the vertical axis represents runtime or MIPS. sets. Due to the large runtime gap between the Hadoop bench- Kernel benchmarks are widely used in high perfor- marks and proxy benchmarks, we list their runtime on mance computing. Livermore kernels [32] use Fortran different side of vertical axis: the left side indicates run- applications to measure floating-point performance range. time of the Hadoop benchmarks, while the right side in- The NAS parallel benchmarks [7] consist of several sep- dicates runtime of the proxy benchmarks. We can find arate tests, including five kernels and three pseudo- that they have similar multi-core scalability trends in applications derived from computational fluid dynamics terms of both runtime and instruction execution speed. (CFD) applications. Linpack [33] provides a collection of Fortran subroutines. Kernel benchmarks are insuf- 4.5 Runtime Speedup across Different Proces- ficient to completely reflect workload behaviors consid- sors ering the complexity and diversity of big data work- Runtime speedup across different processors is an- loads[7, 5]. other metric with much concern from our industry part- In terms of micro-architectural simulation, many pre- nership. In order to reflect the impacts of different vious studies generated synthetic benchmarks as prox- design decisions for running big data analytics work- ies [34, 35]. Statistical simulation [12, 13, 14, 36, 37, loads on X86 and ARM architectures, the proxy bench- 38] generates synthetic trace or synthetic benchmarks to marks should be able to maintain consistent perfor- mimic micro-architectural performance of long-running mance trends. That is to say, if the proxy benchmarks real workloads, which targets one workload on a specific gain performance promotion through an improved de- architecture with the certain configurations, and thus sign, the real workloads can also benefit from the de- each benchmark needs to be generated on the other ar- sign. We evaluate the runtime speedup across two dif- chitectures with different configurations [39]. Sampled ferent architectures of ARMv8 and Xeon E5-2690 V3 simulation selects a series of sample units for simula- (Haswell). The runtime speedup is computed using tion instead of entire instruction stream, which were Equation 3. The Hadoop configurations are also opti- sampled randomly [8], periodically [9, 10] or based on mized according to hardware environments. The proxy phase behavior [11]. Seongbeom et al. [40] accelerated benchmarks use the same version on two architectures. the full-system simulation through characterizing and (a) TeraSort (b) Kmeans (c) PageRank

Figure 11: Multi-core Scalability of the Hadoop benchmarks and Proxy Benchmarks on ARMv8.

putation in the above tasks and problems.

6. CONCLUSIONS In this paper, we propose a dwarf-based scalable big data benchmarking methodology. After thoroughly an- alyzing a majority of algorithms in five typical applica- tion domains: search engine, social networks, e-commerce, multimedia, and bioinformatics, we capture eight big Figure 12: Runtime Speedup across Different Proces- data dwarfs among a wide variety of big data analty- sors. ics workloads, including matrix, sampling, logic, trans- form, set, graph, sort and basic statistic computation. For each dwarf, we implement the dwarf components us- predicting the performance behavior of OS services. For ing diverse software stacks. We construct and tune the emerging big data workloads, PerfProx [41] proposed a big data proxy benchmarks using the DAG-like combi- proxy benchmark generation framework for real-world nations of dwarf components using different wights to database applications through characterizing low-level mimic the benchmarks in BigDataBench. dynamic execution characteristics. Our proxy benchmarks can reach 100s runtime speedup Our big data dwarfs are inspired by previous suc- with respect to the benchmarks from BigDataBench, cessful abstractions in other application scenarios. The while the average micro-architectural data accuracy is set concept in relational algebra [42] abstracted five above 90% on both X86 64 and ARM architectures. primitive and fundamental operators (Select, Project, The proxy benchmarks have been applied to ARM pro- Product, Union, Difference), setting off a wave of re- cessor design in our industry partnership. lational database research. The set abstraction is the basis of relational algebra and theoretical foundation 7. REFERENCES of database. Phil Colella [15] identified seven dwarfs [1] J. Gray, “Database and transaction processing performance of numerical methods which he thought would be im- handbook.,” 1993. portant for the next decade. Based on that, a multi- [2] C. Bienia, Benchmarking Modern Multiprocessors. PhD disciplinary group of Berkeley researchers proposed 13 thesis, Princeton University, January 2011. dwarfs which were highly abstractions of parallel com- [3] L. Wang, J. Zhan, C. Luo, Y. Zhu, Q. Yang, Y. He, puting, capturing the computation and communication W. Gao, Z. Jia, Y. Shi, S. Zhang, C. Zheng, G. Lu, K. Zhan, X. Li, and B. Qiu, “Bigdatabench: A big data patterns of a great mass of applications [17]. National benchmark suite from internet services,” in IEEE Research Council proposed seven major tasks in mas- International Symposium On High Performance Computer sive data analysis [43], which they called giants. These Architecture (HPCA), 2014. seven giants are macroscopical definition of problems [4] M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevdjic, C. Kaynak, A. D. Popescu, from the perspective of mathematics, while our eight A. Ailamaki, and B. Falsafi, “Clearing the clouds: A study classes of dwarfs are frequently-appearing units of com- of emerging workloads on modern hardware,” in ACM International Conference on Architectural Support for message passing interface standard,” Parallel computing, Programming Languages and Operating Systems vol. 22, no. 6, pp. 789–828, 1996. (ASPLOS), 2012. [23] “Hadoop.” http://hadoop.apache.org/. [5] D. J. Lilja, Measuring computer performance: a practitioner’s guide. Cambridge university press, 2005. [24] “Spark.” https://spark.apache.org/. [6] J. L. Hennessy and D. A. Patterson, Computer [25] “Gensort.” http://www.ordinal.com/gensort.html. architecture: a quantitative approach. Elsevier, 2011. [26] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and [7] D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, L. Fei-Fei, “ImageNet: A Large-Scale Hierarchical Image R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, Database,” in CVPR09, 2009. T. A. Lasinski, R. S. Schreiber, H. D. Simon, [27] Z. Ming, C. Luo, W. Gao, R. Han, Q. Yang, L. Wang, and V. Venkatakrishnan, and S. K. Weeratunga, “The nas J. Zhan, “Bdgs: A scalable big data generator suite in big parallel benchmarks,” The International Journal of data benchmarking,” arXiv preprint arXiv:1401.5465, 2014. Supercomputing Applications, vol. 5, no. 3, pp. 63–73, 1991. [28] R. Intel, “Intel r 64 and ia-32 architectures. software [8] T. M. Conte, M. A. Hirsch, and K. N. Menezes, “Reducing developer´smanual. volume 3a,” System Programming state loss for effective trace sampling of superscalar Guide, Part, vol. 1, 2010. processors,” in IEEE International Conference on Computer Design (ICCD), 1996. [29] “Perf tool.” https://perf.wiki.kernel.org/index.php/Main Page. [9] R. E. Wunderlich, T. F. Wenisch, B. Falsafi, and J. C. Hoe, “Smarts: Accelerating microarchitecture simulation via [30] A. J. KleinOsowski, J. Flynn, N. Meares, and D. J. Lilja, rigorous statistical sampling,” in IEEE International Adapting the SPEC 2000 Benchmark Suite for Symposium on Computer Architecture (ISCA), 2003. Simulation-Based Computer Architecture Research, pp. 83–100. Boston, MA: Springer US, 2001. [10] Z. Yu, H. Jin, J. Chen, and L. K. John, “Tss: Applying two-stage sampling in micro-architecture simulations,” in [31] A. KleinOsowski and D. J. Lilja, “Minnespec: A new spec IEEE International Symposium on Modeling, Analysis, benchmark workload for simulation-based computer Simulation of Computer and Telecommunication Systems architecture research,” IEEE Computer Architecture (MASCOTS), 2009. Letters, vol. 1, no. 1, pp. 7–7, 2002. [11] T. Sherwood, E. Perelman, G. Hamerly, and B. Calder, [32] F. H. McMahon, “The livermore fortran kernels: A “Automatically characterizing large scale program computer test of the numerical performance range,” tech. behavior,” in ACM SIGARCH Computer Architecture rep., Lawrence Livermore National Lab., CA (USA), 1986. News, vol. 30, pp. 45–57, 2002. [33] J. J. Dongarra, P. Luszczek, and A. Petitet, “The linpack [12] K. Skadron, M. Martonosi, D. August, M. Hill, D. Lilja, and benchmark: past, present and future,” Concurrency and V. S. Pai, “Challenges in computer architecture evaluation,” Computation: practice and experience, vol. 15, no. 9, IEEE Computer, vol. 36, no. 8, pp. 30–36, 2003. pp. 803–820, 2003. [13] L. Eeckhout, R. H. Bell Jr, B. Stougie, K. De Bosschere, [34] R. H. Bell Jr and L. K. John, “Improved automatic testcase and L. K. John, “Control flow modeling in statistical synthesis for performance model validation,” in ACM simulation for accurate and efficient processor design International Conference on Supercomputing (ICS), 2005. studies,” ACM SIGARCH Computer Architecture News, vol. 32, no. 2, p. 350, 2004. [35] K. Ganesan, J. Jo, and L. K. John, “Synthesizing memory-level parallelism aware miniature clones for spec [14] L. Eeckhout, K. De Bosschere, and H. Neefs, “Performance cpu2006 and implantbench workloads,” in IEEE analysis through synthetic trace generation,” in IEEE International Symposium on Performance Analysis of International Symposium on Performance Analysis of Systems Software (ISPASS), 2010. Systems and Software (ISPASS), 2000. [36] S. Nussbaum and J. E. Smith, “Modeling superscalar [15] P. Colella, “Defining software requirements for scientific processors via statistical simulation,” in International computing,” 2004. Conference on Parallel Architectures and Compilation Techniques (PACT), 2001. [16] http://www.krellinst.org/doecsgf/conf/2014/pres/ jhill.pdf. [37] M. Oskin, F. T. Chong, and M. Farrens, HLS: Combining statistical and symbolic simulation to guide microprocessor [17] K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, designs, vol. 28. ACM, 2000. P. Husbands, K. Keutzer, D. A. Patterson, W. L. Plishker, J. Shalf, S. W. Williams, and Y. Katherine, “The landscape [38] L. Eeckhout and K. De Bosschere, “Early design phase of parallel computing research: A view from berkeley,” tech. power/performance modeling through statistical rep., Technical Report UCB/EECS-2006-183, EECS simulation.,” in IEEE International Symposium on Department, University of California, Berkeley, 2006. Performance Analysis of Systems and Software (ISPASS), 2001. [18] Z. Jia, J. Zhan, L. Wang, R. Han, S. A. McKee, Q. Yang, C. Luo, and J. Li, “Characterizing and subsetting big data [39] A. M. Joshi, Constructing adaptable and scalable synthetic workloads,” in IEEE International Symposium on benchmarks for microprocessor performance evaluation. Workload Characterization (IISWC), 2014. ProQuest, 2007. [19] J. L. Abell´an, C. Chen, and A. Joshi, “Electro-photonic noc [40] S. Kim, F. Liu, Y. Solihin, R. Iyer, L. Zhao, and W. Cohen, designs for kilocore systems,” ACM Journal on Emerging “Accelerating full-system simulation through characterizing Technologies in Computing Systems (JETC), vol. 13, no. 2, and predicting operating system performance,” in IEEE p. 24, 2016. International Symposium on Performance Analysis of Systems and Software (ISPASS), 2007. [20] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, [41] R. Panda and L. K. John, “Proxy benchmarks for emerging S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. big-data workloads,” in IEEE International Symposium on Hill, and D. A. Wood, “The gem5 simulator,” ACM Performance Analysis of Systems and Software (ISPASS), SIGARCH Computer Architecture News, vol. 39, no. 2, 2017. pp. 1–7, 2011. [42] E. F. Codd, “A relational model of data for large shared [21] L. Dagum and R. Menon, “Openmp: an industry standard data banks,” Communications of the ACM, vol. 13, no. 6, api for shared-memory programming,” IEEE computational pp. 377–387, 1970. science and engineering, vol. 5, no. 1, pp. 46–55, 1998. [43] N. Council, “Frontiers in massive data analysis,” The [22] W. Gropp, E. Lusk, N. Doss, and A. Skjellum, “A National Academies Press Washington, DC, 2013. high-performance, portable implementation of the mpi