Raexplore: Enabling Rapid, Automated Architecture Exploration

ANL/ALCF/TM-14/2 Raexplore: Enabling Rapid, Automated Architecture Exploration for Full Applications Argonne Leadership Computing Facility Division About Argonne National Laboratory Argonne is a U.S. Department of Energy laboratory managed by UChicago Argonne, LLC under contract DE-AC02-06CH11357. The Laboratory’s main facility is outside Chicago, at 9700 South Cass Avenue, Argonne, Illinois 60439. For information about Argonne and its pioneering science and technology programs, see www.anl.gov. DOCUMENT AVAILABILITY Online Access: U.S. Department of Energy (DOE) reports produced after 1991 and a growing number of pre-1991 documents are available free via DOE's SciTech Connect (http://www.osti.gov/scitech/) Reports not in digital format may be purchased by the public from the National Technical Information Service (NTIS): U.S. Department of Commerce National Technical Information Service 5301 Shawnee Rd Alexandra, VA 22312 www.ntis.gov Phone: (800) 553-NTIS (6847) or (703) 605-6000 Fax: (703) 605-6900 Email: [email protected] Reports not in digital format are available to DOE and DOE contractors from the Office of Scientific and Technical Information (OSTI): U.S. Department of Energy Office of Scientific and Technical Information P.O. Box 62 Oak Ridge, TN 37831-0062 www.osti.gov Phone: (865) 576-8401 Fax: (865) 576-5728 Email: [email protected] Disclaimer This report was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government nor any agency thereof, nor UChicago Argonne, LLC, nor any of their employees or officers, makes any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof. The views and opinions of document authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof, Argonne National Laboratory, or UChicago Argonne, LLC. ANL/ALCF/TM-14/2 Raexplore Enabling Rapid, Automated Architecture Exploration for Full Applications prepared by Yao Zhang, Prasanna Balaprakash, Jiayuan Meng, Vitali Morozov, Scott Parker, and Kalyan Kumaran Argonne Leadership Computing Facility Division, Argonne National Laboratory December 2014 Raexplore: Enabling Rapid, Automated Architecture Exploration for Full Applications Yao Zhang, Prasanna Balaprakash, Jiayuan Meng, Vitali Morozov, Scott Parker, and Kalyan Kumaran Argonne National Laboratory {yaozhang, pbalapra, jmeng, morozov, sparker, kumaran}@anl.gov Abstract SingleHthread%performance%(GigaInstrUc_ons/s)%=%Freq%×%IssUe%width% 20% POWER7%% Where%POWER%is%going:%server,%data%analy_cs% We present Raexplore, a performance modeling framework 18% for architecture exploration. Raexplore enables rapid, auto- Where%highHend%x86%is%going:%desktop,%server,%data%analy_cs% 16% Ivy%Bridge% mated, and systematic search of architecture design space by Scale%Up%→% Haswell% 14% Where%highHend%ARM%is%going:%desktop,%server% combining hardware counter-based performance characteri- Sandy%Bridge% XHGene%2% zation and analytical performance modeling. We demonstrate 12% 10% XHGene%1% Raexplore for two recent manycore processors IBM Blue- POWER6% Nehalem% Gene/Q compute chip and Intel Xeon Phi, targeting a set of 8% Where%PowerPC%is%going:%network,%gaming,%HPC% scientific applications. Our framework is able to capture com- 6% Cortex%A15% PowerPC%A2(PowerEN)% POWER5% plex interactions between architectural components including 4% Cortex%A9% Atom%S1260% PowerPC%A2%(BG/Q)% Xeon%Phi%(Knights%Corner)% Cortex%A8% PowerPC%75%(Wii%U)% Xeon%Phi(Knights%Ferry,SP)% instruction pipeline, cache, and memory, and to achieve a 2% PowerPC%440%(BG/L)% Atom%N450% GT200% GF110% GK110% 3–22% error for same-architecture and cross-architecture per- PowerPC%450%(BG/P)% Where%manyHcore/GPU%is%going:%gaming,%HPC% 0% Where%lowHend%x86%is%going:%smartphone,%tablet% formance predictions. Furthermore, we apply our framework 0% 200% 400% 600% 800% 1000% 1200% 1400% to assess the two processors, and discover and evaluate a list ThroUghpUt%performance%(GFLOPS)%=%Freq%×%Vector%width%(DP)%×%Cores%% Scale%oUt%→% of architectural scaling options for future processor designs. Figure 1: Processor landscape and trends. Processors in the same family/catergory are color coded. 1. Introduction Fundamentally, architecture design is driven by applications. Over 20 years ago, supercomputer pioneer Seymour Cray fa- Architectural evaluation and comparison for a diverse set of mously made an analogy on computer design “If you were current processors are challenging because they often require plowing a field, which would you rather use: two strong oxen significant human efforts to port applications to different archi- or 1024 chickens?" Today, we see both types of computers in tectures [8,31]. Studying the performance of future processors the marketplace, and computer architects face an even wider is even more challenging, as the hardware is not yet available. spectrum of design choices on core complexity, memory hier- While commonly used simulation-based techniques could pro- archies, parallelism, and special-purpose accelerators. Further- vide highly accurate results, they are prohibitively slow to more, the processor design landscape is becoming increasingly handle the combinatorial explosion of design choices in a more dynamic, as we see mainstream processors both trickling multi-dimensional space and thus often limited to studying up (e.g., ARM, DSP, GPU) and trickling down (e.g., Atom, kernels and benchmarks rather than larger programs, mini- Xeon Phi) in their design space to meet the demands for a apps, and even full applications [3, 39, 41, 42]. range of emerging applications in scientific computing, data In this work, we aim to address this architecture design analytics, gaming, wearable devices, computer vision, etc. challenge by developing Raexplore (Rapid architecture ex- For a big-picture view of this background, Figure1 sketches plore, pronounced as ray-xplore), a performance modeling today’s processor landscape and macro trends in terms of framework to reduce the needs for application porting in ar- single-thread performance and throughput performance. In chitectural comparison, and to serve as a fast, first-order ar- Figure2, we select eight representative processors and posi- chitecture explorer to complement slower but more accurate tion them in a multi-dimensional design space in terms of their simulation-based techniques. In particular, we make the fol- architectural features. The main observation is that the design lowing contributions in this paper. First, we develop a method- space is vast in terms of both high dimensionality and large ology that combines experimental performance characteriza- dynamic range for each dimension. We list eight major archi- tion and analytical performance modeling to enable rapid and tectural features (dimensions) ranging from core complexity systematic architecture exploration. Second, we develop ana- to memory hierarchies, not to mention other relatively minor lytical models for two recent manycore processors IBM Blue features such as branch prediction, prefetching, and memory Gene/Q compute chip and Intel Xeon Phi. We show that our management. We use the ratio between the highest and lowest models could capture complex interactions between architec- value to measure the span of the dynamic range of each feature tural components including instruction pipeline, cache, and (dimension). The observed span ranges from 4× up to 78×. memory, and achieve a 3–22% error for same-architecture IBM#BLue#Gene/Q# NVIDIA#K40# InteL#Xeon#Phi#7120P# InteL#Xeon#E7T8895V2# ARM#Cortex#A15# InteL#Atom#C2758# IBM#POWER8# Fujitsu#SPARC64#VIIIfx# 1# 61# 5# 8# 32# 64# 64# 8# 19.2# 0.8# 3.6# 0.6# Figure 3: Performance modeling framework. 2.4# 4# 32# 0.4# 2# 8# 1.9# 3# 24# exploration to a matter of merely evaluating a set of mathemat- 1.6# 2.5# 5.8# 16# 1.3# 15# 2# 8# 16# 2# ical formulas, enabling fast search of the vast design space. 12# 0.2# 3.2# 0.7# 8# 4# 8# 2.7# Figure3 shows our performance modeling framework. It 0.8# 4# 2# 4# 2# 0.5# first takes in a set of targeted applications and characterizes 0# 1# 0.1# Cores# Freq#(GHz)# Issue#width# vector#width#Threads/core# L1/core###### LLC/core# BW/core# their performance on a baseline architecture. The perfor- (DP#)# (KB)# (MB)# (GB/s)# ##(7.6×)###############(6.7×)##################(4×)##############(16×)################(64×)################(4×)################(78×)###############(7.2×)## mance characteristics are a set of performance events mea- Figure 2: Mutli-dimensional processor design space. For each sured by hardware counter-based tools such as PAPI [38], architectural feature, the numbers are normalized be- Intel VTune [25], and IBM HPM [21]. The performance tween 0 and 1 based on the highest value. The num- characteristics are then calibrated for a target architecture con- bers to the right of each marker are absolute values. figuration using analytical performance models. The target The numbers in parentheses below the horizontal axis are the ratios between the highest and lowest value,

Raexplore: Enabling Rapid, Automated Architecture Exploration

Ray Tracing on the Cell Processor

IBM System P5 Quad-Core Module Based on POWER5+ Technology: Technical Overview and Introduction

Implementing Powerpc Linux on System I Platform

From Blue Gene to Cell Power.Org Moscow, JSCC Technical Day November 30, 2005

IBM Powerpc 970 (A.K.A. G5)

IBM Power Systems Performance Report Apr 13, 2021

IBM's POWER5 Micro Processor Design and Methodology

Microprocessor

Introduction to the Cell Multiprocessor

IBM Power Roadmap

POWER8: the First Openpower Processor

Power8 Quser Mspl Nov 2015 Handout