Memory Centric Characterization and Analysis of SPEC CPU2017 Suite

Session 11: Performance Analysis and Simulation ICPE ’19, April 7–11, 2019, Mumbai, India Memory Centric Characterization and Analysis of SPEC CPU2017 Suite Sarabjeet Singh Manu Awasthi [email protected] [email protected] Ashoka University Ashoka University ABSTRACT These benchmarks have become the standard for any researcher or In this paper, we provide a comprehensive, memory-centric charac- commercial entity wishing to benchmark their architecture or for terization of the SPEC CPU2017 benchmark suite, using a number of exploring new designs. mechanisms including dynamic binary instrumentation, measure- The latest offering of SPEC CPU suite, SPEC CPU2017, was re- ments on native hardware using hardware performance counters leased in June 2017 [8]. SPEC CPU2017 retains a number of bench- and operating system based tools. marks from previous iterations but has also added many new ones We present a number of results including working set sizes, mem- to reflect the changing nature of applications. Some recent stud- ory capacity consumption and memory bandwidth utilization of ies [21, 24] have already started characterizing the behavior of various workloads. Our experiments reveal that, on the x86_64 ISA, SPEC CPU2017 applications, looking for potential optimizations to SPEC CPU2017 workloads execute a significant number of mem- system architectures. ory related instructions, with approximately 50% of all dynamic In recent years the memory hierarchy, from the caches, all the instructions requiring memory accesses. We also show that there is way to main memory, has become a first class citizen of computer a large variation in the memory footprint and bandwidth utilization system design. The last decade has seen a renewed interest in the profiles of the entire suite, with some benchmarks using asmuch architectural design space exploration of main memory, including 16 GB of main memory and up to 2.3 GB/s of memory bandwidth. novel additions to the existing interfaces and architecture (JEDEC, We perform instruction distribution analysis of the benchmark DDR3, DDR4, DDR5) [4, 13, 22, 25]. Not only this, exploration of suite and find that the average instruction count for SPEC CPU2017 emerging memory technologies like Phase Change Memory, MRAM workloads is an order of magnitude higher than SPEC CPU2006 etc., to find their space in the memory hierarchy has also been ones. In addition, we also find that FP benchmarks of the suite have carried out [17, 20]. Researchers have started exploring emerging higher compute requirements: on average, FP workloads execute memory technologies to be used in both cache and main memory three times the number of compute operations as compared to INT architectures [6, 19, 27, 28]. workloads. SPEC 2006 [14] played an important role in these explorations. Innovations to cache and memory hierarchies have been explored KEYWORDS using these workloads by either (i) selecting individual workloads from the suite, or (ii) creating multi-programmed workload mixes, SPEC CPU2017; Memory Characterization; Performance Analysis; with varying memory behavior. This selection was made possible Benchmarks by already available studies characterizing the memory behavior ACM Reference Format: patterns SPEC 2006 suite [16]. Sarabjeet Singh and Manu Awasthi. 2019. Memory Centric Characterization However, there is no existing work that characterizes the mem- and Analysis of SPEC CPU2017 Suite. In Tenth ACM/SPEC International ory hierarchy behavior of the SPEC 2017 suite. In this paper, we Conference on Performance Engineering (ICPE ’19), April 7–11, 2019, Mumbai, bridge this gap in the literature by making the following important India. ACM, New York, NY, USA, 8 pages. https://doi.org/10.1145/3297663. contributions: 3310311 (1) Across the SPEC CPU2017 suite, we provide a holistic char- 1 INTRODUCTION acterization of the dynamic instruction execution profiles The study of computer architecture and system design depends on of different workloads, for both Rate and Speed categories, the availability of workloads that are able to faithfully represent and observe that most workloads have a large number of contemporary and future applications of a given vertical. In the CPU memory related operations: close to 50% on average across domain, Standard Performance Evaluation Corporation (SPEC) has the suite. been releasing the SPEC CPU suite for close to three decades now. (2) We provide a detailed analysis of the memory behavior of various benchmarks, using a combination of dynamic in- Permission to make digital or hard copies of all or part of this work for personal or strumentation tools (Pin/Pintools), hardware performance classroom use is granted without fee provided that copies are not made or distributed counters and operating system level tools. We report the for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM overall working set size, memory bandwidth consumption, must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, and memory resident working set sizes of various workloads. to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. The rest of the paper is organized as follows. Section 2 gives a back- ICPE ’19, April 7–11, 2019, Mumbai, India ground of CPU2017 benchmarks. Section 3 proposes the methodol- © 2019 Association for Computing Machinery. ACM ISBN 978-1-4503-6239-9/19/04...$15.00 ogy used to characterize the benchmarks. Section 4 and 5 analyze https://doi.org/10.1145/3297663.3310311 the benchmarks at an instruction and memory level, respectively. 285 Session 11: Performance Analysis and Simulation ICPE ’19, April 7–11, 2019, Mumbai, India Table 1: SPEC CPU2017 Benchmarks Benchmark Domain Inputs perlbench Perl Interpreter Interprets SpamAssassin, MHonArc, and an internal script gcc GNU C Compiler C source code file compiled with different optimizations mcf Route Planning Solves large-scale minimum cost flow problem omnetpp Discrete Event Simulation Simulates a 10 Gb Ethernet network xalancbmk XML to HTML Conversion XML documents to HTML, text, and other XML types x264 Video Compression Compresses portions of Blender Open Movie Project’s "Big Buck Bunny" deepsjeng Artificial Intelligence Plays Chess variants employing alpha-beta tree search leela Artificial Intelligence Computer Go featuring Monte Carlo tree search exchange2 Artificial Intelligence Recursively solves 27 9x9 Sudoku puzzles xz General Data Compression Tar archive, database of ClamAV signatures and a file with text and image data bwaves Explosion Modeling Simulates blast waves in 3D cactuBSSN Physics: Relativity Uses BSSN formulation of Einstein equation and employs finite differencing in space namd Molecular Dynamics 92224 atom simulation of Apolipoprotein A-I parest Biomedical Imaging 3D reconstruction of interior of a body using multiple 2D observations povray Ray Tracing Renders a 2560x2048 pixel image of a chess board with pieces lbm Fluid Dynamics Simulates flow of an incompressible fluid in3D wrf Weather Forecasting Simulates the January 2000 North American Blizzard blender 3D Rendering and Animation Simulates reduced version of the Weybec Crazy Glue shot 3 data set to image cam4 Atmosphere Modeling Atmospheric Component of the NCAR Earth System pop2 Wide-scale Ocean Modeling Ocean Component of the NCAR Earth System imagick Image Manipulation Performs various transformations on input images nab Molecular Dynamics Models molecules with varying number of atoms fotonik3d Computational Electromagnetics Employs finite-difference time-domain method for the Maxwell equations roms Regional Ocean Modeling A free-surface, hydrostatic, primitive equation model Table 2: System Configuration applications, which is intended for a reportable run). CPU2017 Model 40-core Intel Xeon E5-2698 v4 benchmarks and their input sets are described in Table 1. CPU Frequency 2.2GHz L1i cache 8-way, 32 KB L1d cache 8-way, 32 KB 3 METHODOLOGY L2 cache 8-way, 256 KB L3 cache Shared 20-way, 50 MB To study the characteristics of CPU2017 workloads, we used a num- Cache line size 64 Bytes Main Memory 505 GB, DDR4 ber of tools to analyse their behavior. The analysis in this paper Dynamic Frequency Scaling On is based on the x86_64 instruction set. The binaries for the workloads were created using the default, SPEC recommended compiler flags [2](gcc -O3), using compilation scripts which ship with Section 6 discusses the benchmarks which are newly added to the CPU 2017. Speed workloads are compiled to use 4 OpenMP threads, SPEC CPU suite. Finally, we discuss the related works in Section 7 while rate workloads were executed with a single instance of the and conclude in Section 8. benchmark. We use Pin [23], a dynamic binary instrumentation framework available for both 32 and 64 bit versions of the instruc- 2 SPEC CPU2017 tion set. Pin provides a rich set of APIs that can be used to study SPEC CPU is a widely acknowledged suite of CPU benchmarks, various characteristics of program behavior at the Instruction Set which is used for testing the performance of processor and memory Architecture (ISA) level. These APIs are used to create a number of systems. A number of versions of SPEC have been released over tools, called Pintools, with each capable of carrying out a certain the years, with the latest version, released in 2017. CPU2017

Memory Centric Characterization and Analysis of SPEC CPU2017 Suite

Overview of the SPEC Benchmarks

Oracle Corporation: SPARC T7-1

Poweredge R940 (Intel Xeon Gold 5122, 3.60 Ghz) Specint Rate Base2006 = 1080 CPU2006 License: 55 Test Date: Jun-2017 Test Sponsor: Dell Inc

3 — Arithmetic for Computers 2 MIPS Arithmetic Logic Unit (ALU) Zero Ovf

IBM Power Systems Performance Report Apr 13, 2021

Asustek Computer Inc.: Asus P6T

Modeling and Analyzing CPU Power and Performance: Metrics, Methods, and Abstractions

Continuous Profiling: Where Have All the Cycles Gone?

Specfp Benchmark Disclosure

QSCORES: Trading Dark Silicon for Scalable Energy Efficiency With

(SLB) Predictor: a Compiler Assisted Branch Prediction for Data Dependent Branches

Benchmarking-HOWTO.Pdf