<<

Memory Bandwidth and System Balance in HPC Systems

John . McCalpin, PhD [email protected] Outline 1. Changes in TOP500 Systems – System Architectures & System Sizes 2. Technology Trends & System Balances – The STREAM – Computation Rates vs Data Motion Latency and Bandwidth – Required Concurrency to Exploit available Bandwidths 3. Comments & Speculations… – Impact on HPC systems and on HPC Applications – Potential disruptive technologies Part 1 CHANGES IN TOP500 SYSTEMS TOP500 Rmax Contributions by System Architecture 1995 2000 2005 2010 2015 100%

90% Accelerated RoadRunner Tianhe-2 80% ASCI Red 70% IBM BG/L IBM BG/Q 60% MPP 50% RISC 40% TaihuLight

30% K Machine Note: A fraction of the 20% “Accelerated” Rmax is from x86 as well… 10% VECTOR

0% 1993.5 1993.9 1994.5 1994.9 1995.5 1995.9 1996.5 1996.9 1997.5 1997.9 1998.5 1998.9 1999.5 1999.9 2000.5 2000.9 2001.5 2001.9 2002.5 2002.9 2003.5 2003.9 2004.5 2004.9 2005.5 2005.9 2006.5 2006.9 2007.5 2007.9 2008.5 2008.9 2009.5 2009.9 2010.5 2010.9 2011.5 2011.9 2012.5 2012.9 2013.5 2013.9 2014.5 2014.9 2015.5 2015.9 2016.5 2016.9 HPC "Typical" System Acquisition Price per "" $10,000,000

$1,000,000 Vector RISC x86 single-core $100,000 x86 multi-core ($/socket) x86 multi-core ($/core)

$10,000

$1,000

$100 1980 1990 2000 2010 2020 TOP500 Rmax Contributions by System Architecture 1995 2000 2005 2010 2015 100%

90% Accelerated

80%

70% 100p 60% 1000p MPP 50% RISC x86 10,000p 250p 40% 48,000

30% VECTOR 20% 1000p

10% 4p 16p 32p 0% 1993.5 1993.9 1994.5 1994.9 1995.5 1995.9 1996.5 1996.9 1997.5 1997.9 1998.5 1998.9 1999.5 1999.9 2000.5 2000.9 2001.5 2001.9 2002.5 2002.9 2003.5 2003.9 2004.5 2004.9 2005.5 2005.9 2006.5 2006.9 2007.5 2007.9 2008.5 2008.9 2009.5 2009.9 2010.5 2010.9 2011.5 2011.9 2012.5 2012.9 2013.5 2013.9 2014.5 2014.9 2015.5 2015.9 2016.5 2016.9 Heterogeneous Systems • More sites are building “clusters of clusters”, e.g.: – Sub-cluster 1: 2-socket nodes with small memory – Sub-cluster 2: 2-socket nodes with large memory – Sub-cluster 3: 4-socket nodes with very large memory – Sub-cluster 4: nodes with accelerators, etc… • Consistent with observed partial shift to accelerators – Peaking at ~30% of the aggregate Rmax of the list in 2013-2015 • Under 20% for November 2016 list (after excluding x86 contributions) – Peaking at ~20% of the systems in 2015, now under 17% • No more than 16% of systems have had >2/3 of their Rpeak in accelerators – Accelerators have been split between many-core and GPU Part 2 TECHNOLOGY TRENDS & SYSTEM BALANCES The STREAM Benchmark • Created in 1991 while on faculty at the University of Delaware College of Marine Studies • Intended to be an extremely simplified representation of the low-compute-intensity, long-vector operations characteristic of ocean circulation models • Widely used for research, testing, marketing • Almost 1100 results in database at main site • Hosted at www.cs.virginia.edu/stream/ The STREAM Benchmark (2) • Four kernels, separately timed: Copy: C[i] = A[i]; i=1..N Scale: B[i] = scalar * C[i]; i=1..N Add: C[i] = A[i] + B[i]; i=1..N Triad A[i] = B[i] + scalar*C[i]; i=1..N • “N” chosen to make each array >> size • Repeated (typically) 10 times, first iteration ignored • Min/Avg/Max timings reported, best time used for BW The STREAM Benchmark (3) • Assumed Memory Traffic per index: Copy: C[i] = A[i]; 16 Scale: B[i] = scalar * C[i]; 16 Bytes Add: C[i] = A[i] + B[i]; 24 Bytes Triad A[i] = B[i] + scalar*C[i]; 24 Bytes • Many systems will read data before updating it (“write allocate” policy) – STREAM ignores this extra traffic if present What are “System Balances”? • “Performance” can be viewed as an N-dimensional vector of “mostly-orthogonal” components, e.g.: – Core performance (FLOPs) – LINPACK – Memory Bandwidth – STREAM – Memory Latency – lmbench/lat_mem_rd – Interconnect Bandwidth – osu_bw, osu_bibw – Interconnect Latency – osu_latency • System Balances are the ratios of these components Performance Component Trends 1. Peak FLOPS per socket increasing at 50%-60% per year 2. Memory Bandwidth increasing at ~23% per year 3. Memory Latency increasing at ~4% per year 4. Interconnect Bandwidth increasing at ~20% per year 5. Interconnect Latency decreasing at ~20% per year • These ratios suggest that processors should be increasingly imbalanced with respect to data motion…. – Today’s talk focuses on (1), (2), and a of (3) Memory Bandwidth is Falling Behind: (GFLOP/s) / (GWord/s) 10,000

1,000

100 +14.2%/year 2x / 5.2 years

10 Balance Ratio (FLOPS/memory access)

RISC systems (IBM, MIPS, Alpha) x86-64 systems (AMD & ) 1 1990 1995 2000 2005 2010 2015 2020 Date of Introduction Memory Latency is much worse: (GFLOP/s) / (Memory Latency) 10,000

Peak FLOPS per Idle Memory Latency +24.5%/year Peak FLOPS / Word of Sustained Memory BW 2x / 3.2 years 1,000

100 +14.2%/year 2x / 5.2 years

10 Balance Ratio (FLOPS/memory access)

RISC systems (IBM, MIPS, Alpha) x86-64 systems (AMD & Intel) 1 1990 1995 2000 2005 2010 2015 2020 Date of Introduction Interconnect Bandwidth is Falling Behind at a comparable rate 10,000 Peak FLOPS per Idle Memory Latency Peak FLOPS / Word of Sustained Memory BW +24.5%/year Peak FLOPS / Word of Sustained Network BW 2x / 3.2 years

1,000 +22.3%/year 2x / 3.4 years

100 +14.2%/year 2x / 5.2 years

10 Balance Ratio (FLOPS/memory access)

RISC systems (IBM, MIPS, Alpha) x86-64 systems (AMD & Intel) 1 1990 1995 2000 2005 2010 2015 2020 Date of Introduction Interconnect latency follows a similar trend… 10,000,000 Peak FLOPS per Idle Memory Latency

1,000,000 Peak FLOPS / Word of Sustained Memory BW

Peak FLOPS / Word of Sustained Network BW

100,000 Peak FLOPS (core) * Network Latency

Peak FLOPS () * Network Latency 10,000

1,000

100

10 Balance Ratio (FLOPS/memory access)

RISC systems (IBM, MIPS, Alpha) x86-64 systems (AMD & Intel) 1 1990 1995 2000 2005 2010 2015 2020 Date of Introduction STREAM Balance (FLOPS/Word): Mainstream vs ManyCore vs GPGPU

100 2016 product releases 90

80

70

60 2012 product releases 50

40

30

20

10

0 E5-2690 SE10P K20x Xeon E5-2690 v4 Xeon Phi 7250 NVIDIA P100 ( EP) (KNC, GDDR5) (Kepler, GDDR5) (Broadwell) (KNL, MCDRAM) (Pascal, HBM2) Why are FLOPS increasing so fast? • Peak FLOPs per package is the product of several terms: – Frequency – FP operations per cycle per core • Product of #FP units, SIMD width of each unit, and complexity of FP instructions (e.g., separate ADD & MUL vs FMA) – Number of cores per package • Low-level semiconductor technology tends to drive these terms at different rates… Intel Processor GFLOPS/Package Contributions over time Skylake Xeon log10(GHz) Haswell / 3.0 Sandy Broadwell log10(Cores/Socket) Bridge / log10(FP/Hz) Ivy Bridge 32 Nehalem/ 32 16 2.0 Westmere 16 16 Core 2 8 8 4 4 4 4

1.0 4 4 4 12 12 14 ~20 ~24 6 6 8 10 2 4 2 2 2 2 2 Base10 log contributionofGFLOPS 2 2.8 3.1 3.2 3.3 3.1 3.3 2.5 3.0 3.0 2.5 3.0 2.9 2.1 2.1 2.1 ~1.9 ~1.8 0.0 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 Intel Processor GFLOPS/Package Contributions: Mainstream vs ManyCore

4.0 log10(GHz) log10(Cores/Socket) log10(FP/Hz) Xeon Phi (Knights Landing)

Xeon Phi 6.8x (Knights Corner) Xeon E5 (v4) 3.0 5.2x (Broadwell) Xeon E5 (v1) 32 (Sandy Bridge) 16

2.0 16 8

1.0 60 14 72 8 Base10contribution log GFLOPS of 3.0 1.05 2.1 1.4 0.0 2012.25 2012.75 2016.33 2016.50 Why is Memory Bandwidth increasing slowly? • Slow rate of pin speed improvements – Emphasis has been on increasing capacity, not increasing bandwidth – Shared- architecture (multiple DIMMs per channel) is very hard at high frequencies • DRAM cycle time almost unchanged in 20 years – Speed increases require increasing transfer sizes – DDR3/DDR4 have minimum 64 transfers in DIMMs • Slow rate of increase in interface width – Pins cost money! Why is Memory Latency stagnant or growing? • More levels in – Many lookups serialized to save power • More asynchronous clock domain crossings – Many different clock domains to save power – Snoop (6): Core -> Ring -> QPI -> Ring -> QPI -> Ring -> Core – Local Memory (4): Core -> Ring -> DDR -> Ring -> Core – Remote Memory (8): Core -> Ring -> QPI -> Ring -> DDR -> Ring -> QPI -> Ring -> Core • More cores to keep coherent – Challenging even on a single mainstream server chip – Two-socket system latency typically dominated by coherence, not data – Manycore chips have much higher latency • Decreasing frequencies! Why is Interconnect Bandwidth growing slowly? • Slow rate of pin speed improvements – About 20%/year • Reluctance to increase interface width – chips typically pin-limited – wider interfaces get fewer ports – Parallel links require more – too expensive and does not always provide improved real-world bandwidth Why is Interconnect Latency improving slowly? • Legacy IO architecture designed around disks, not communications – Control operations using un-cached loads/stores – hundreds of ns per operation and no concurrency – -driven processing requires many thousands of cycles per transaction • Mismatch between SW requirements and HW capabilities A different implication of these technology trends LATENCY, BANDWIDTH, AND CONCURRENCY Latency, Bandwidth, and Concurrency • “Little’s Law” from queuing theory describes the relationship between latency (or occupancy), bandwidth, and concurrency. Latency * Bandwidth = Concurrency • Flat Latency * Increasing Bandwidth à Increasing Concurrency

• Because these are exponential trends, these are not small changes… Little’s Law: illustration for 2005-era processor 60 ns latency, 6.4 GB/s (=10ns per 64B cache line)

Time (ns) -60 -50 -40 -30 -20 -10 0 10 20 30 40 50 60 70 80 90 100 110 Buffer0 Request 0 Data 0 Buffer1 Request 1 Data 1 Buffer2 Request 2 Data 2 Buffer3 Request 3 Data 3 Buffer4 Request 4 Data 4 Buffer5 Request 5 Data 5 Buffer0 Request 6 Data 6 Buffer1 Request 7 Data 7 Buffer2 Request 8 Data 8 Buffer3 Request 9 Data 9 Buffer4 Request 10 Data 10 Buffer5 Request 11 Data 11

• 60 ns * 6.4 GB/s = 384 Bytes = 6 cache lines • To keep the full, there must always be 6 cache lines “in flight” • Each request must be launched at least 60 ns before the data is needed Latency-Bandwidth Products per Package (64B units)

10,000

NVIDIA GPU

1,000 Mainstream Processors Intel ManyCore +33%/year

100

10

1 1995 2000 2005 2010 2015 2020 Why is Increasing Concurrency a Problem? • Architectures are built assuming “flat” memory model – Location of data is invisible and uncontrollable – Caches and prefetchers are assumed to be “good enough” to cover latency and bandwidth differences • Implementations support limited L1 Data Cache misses per core: – Xeon E5: 10 L1 misses (maximum) – L2 Hardware Prefetchers help, but are also “invisible” and not directly controllable Increasing Concurrency (2) • Many cores are needed just to generate concurrency, even if not needed to do computing – This costs a lot of energy in the cores! • Large buffers and complex memory controllers are needed to handle the concurrent operations – DRAM management requires memory schedule to be updated frequently as new transactions appear – DRAM open page hit rates still go down, so DRAM power increases too – Design cost up, power cost up, BW utilization down Increasing Concurrency (3) • More cores create more concurrent memory access streams, which requires more DRAM banks • Examples: – 8-core Xeon E5 v1 with 2 streams per core needs >= 16 banks Requires 2 ranks of DDR3 DRAM (one dual-rank DIMM) – 12-core Xeon E5 v3 with 2 streams per core needs >= 24 banks Requires 2 ranks of DDR4 DRAM (one dual-rank DIMM) • Problems: – Some codes generate many address streams per core – LBM >32 – HyperThreading can double address streams per core – Adding more DIMMs can *decrease* performance due to rank-to-rank bus stalls Another angle… POWER AND ENERGY What about Power/Energy? • Power density is important in processor implementations – Frequencies can be limited by small-scale (core-sized) hot spots – Multi-core frequencies are now limited by package cooling – E.g., Xeon E5 v3 (Haswell) can only run DGEMM or LINPACK on ½ of the cores before running out of power & needing to throttle frequency • Power is not a first-order concern in operating cost!!! – Purchase price is $2500-$4000/socket – Socket draws 100-150 Watts & needs 40-50 Watts for cooling – At $0.10/kWh, this is 5%-7% of purchase price per year – This ratio is very hard to change!!! What about Power/Energy later? • If much cheaper processors become available, power would become a first-order cost • Example 1: “client” multicore processors – Use the same core architecture, but at much lower price – Typical configuration needs 25% of purchase cost per year for power – (High performance interconnect solution not available at reasonable price) • Example 2: “embedded” processors – Hypothetical $5 processor using 5 Watts requires $7/year for power – Not a problem for mobile – not credible for HPC – Response will be sociological and bureaucratic, as well as technical Part 3 COMMENTS & SPECULATIONS Implications for HPC Systems? • Applications have very different requirements for various performance components…. – It is relatively easy to find 100:1 ratios in memory bandwidth requirements across applications – Ratios in other axes are mostly smaller due to self-selection, but 100:1 examples certainly exist • E.g., bisection bandwidth for 3D DNS Turbulence • Example: CPU + Memory BW + Memory Latency model from 2007 – most of these applications are still in broad use Recent testing shows SPECfp_rate2006 run-time contributions on late 2006-eraWRF reference has increased system from 100% this 2007 value of 32% 90% memory time to 70% 80% memory time on a 12-core 70% Haswell EP 60%

50% T_BW 40% T_Lat 30% T_CPU 20% 10% 0% A note on “optimization” • Under fairly general assumptions, it can be easily proven that A homogeneous system cannot be “optimal” for a heterogeneous workload!

– “Optimal” here can refer to performance, power, overall cost.

• “Optimized for General-Purpose Workloads” is a contradiction – As systems get larger and the ratios of performance requirements increase, the advantage of specialization increases Implications for HPC Systems? • System Balance trends are due to a combination of technology factors and market factors – Technology trends suggest that balances will either get worse, or performance will stagnate • Some applications will be able to exploit the increasing capabilities, while others will not – Changing algorithms may be required to obtain continuing performance increases, e.g. • Increasing computational intensity, increasing locality Implications for HPC Applications? • As systems scale, the number of performance axes that need to be considered will continue to increase • Effective use of new & future hardware will require effective exploitation of – Node-level parallelism (e.g., MPI) – Multi-core parallelism (e.g., OpenMP) – SIMD Vector parallelism • Multiple independent SIMD vector ops to tolerate pipeline latency – Unit-stride memory access with increasing data re-use John D. McCalpin, PhD [email protected] 512-232-3754

For more information: www.tacc.utexas.edu Disruptive Technology Ideas (1) • Quit fighting the physics of data motion • Move simple cores to the data • Distances are shorter – less energy for data transfer • Design is simpler – less development money to recover • Use more efficient point-to-point DRAM interfaces – GDDR5/6, LPDDR4/5 • Core Performance requirements are lower – Lower frequency to exploit P ~ f3 • Save high-power processors for complex processing Disruptive Technology Ideas (2) • Move out of the 1980’s – memory is not flat! • The architecture must include functionality to enable control over the most “expensive” operations – This must include data motion through a as a first-order architectural concept – Provide semantic information to outer levels of the hierarchy to enable efficient and performant scheduling of operations on multiple data streams of differing priorities and consumption rates. Disruptive Technology Ideas (3) • Quit fighting physics – – Limit coherence to small areas in physical space and small areas in that can be tightly constrained – Don’t use cache coherence for communication! • Data motion through the memory hierarchy is fundamentally different than communication or synchronization • Optimal behaviors are different, so these must be exposed as different operation types Disruptive Technology Ideas (4) • Admit that are parallel! • Current architectures don’t include communication or synchronization as concepts – Communication and Synchronization can be implemented as side effects of sequences of ordered memory reference – Indirect, inefficient, ugly • Hardware is capable of extremely efficient communication and synchronization Disruptive Technology Ideas (5) • Design for Performance Predictability • Programming languages that describe algorithms at higher levels of abstraction require significant transformations to map to hardware – Currently very limited because performance cannot be modeled! • New HW architectures will require human cost for SW re-development but we a paying a cost now for incomprehensibly complex systems Summary • Technology trends suggest that data access is increasingly expensive compared to computation – Benchmark data from 25 years confirms these trends • Multi-level caching and aggressive prefetching have mitigated the impact of the imbalance, but – These are expensive to design – These are not energy-efficient • Architectural changes are needed to address design cost and energy efficiency For the insomniacs among us… BACKUP SLIDES TOP500 Rmax Contributions by Family 1995 2000 2005 2010 2015 100% Note: a fraction of Rmax from 90% Accelerated systems is also from x86 processors… 80%

70% MIPS 60% HP Intel x86

50% i860 40%

30% Alpha IA-64 20% AMD x86 SPARC 10% POWER SPARC

0% 1993.5 1993.9 1994.5 1994.9 1995.5 1995.9 1996.5 1996.9 1997.5 1997.9 1998.5 1998.9 1999.5 1999.9 2000.5 2000.9 2001.5 2001.9 2002.5 2002.9 2003.5 2003.9 2004.5 2004.9 2005.5 2005.9 2006.5 2006.9 2007.5 2007.9 2008.5 2008.9 2009.5 2009.9 2010.5 2010.9 2011.5 2011.9 2012.5 2012.9 2013.5 2013.9 2014.5 2014.9 2015.5 2015.9 2016.5 2016.9 Contributions of Various Accelerator Families to TOP500 RPeak

35% %Total Rpeak (other & mixed) %Total Rpeak in AMD GPUs 30% %Total Rpeak in Xeon Phi %Total Rpeak in NVIDIA GPUs

25% %Total Rpeak in IBM Cell Intel Xeon Phi

20%

15%

10% NVIDIA GPU Percent of Aggregate RPeak Entire List

5% IBM Cell

0% 2006.9 2007.5 2007.9 2008.5 2008.9 2009.5 2009.9 2010.5 2010.9 2011.5 2011.9 2012.5 2012.9 2013.5 2013.9 2014.5 2014.9 2015.6 2015.9 2016.5 Decreasing Price/Core Leads to Increasing Available Core Counts $10,000,000 1000 Vector RISC x86 single-core $1,000,000 x86 multi-core ($/socket) 100 Cores available for a $50,000 budget x86 multi-core ($/core) Cores/budget

$100,000 10

$10,000 1

$1,000 0.1

$100 0.01 1980 1990 2000 2010 2020 Why? • FLOPS – More cores – More FLOPS/Cycle • Memory Bandwidth – Slow rate of pin speed improvements • DRAM cell cycle time almost unchanged in 20 years • Speed increases require increasing transfer sizes – Slow rate of increase in interface width • Pins cost money! • Memory Latency – More levels in of cache – More asynchronous clock domain crossings – More cores to keep coherent • Interconnect Bandwidth – Slow rate of pin speed improvements – Slow rate of increase in interface width • Interconnect Latency – Legacy IO architecture designed around disks, not communications – Mismatch between SW requirements and HW capabilities SPEC CPU 2006 codes In decreasing order of computational intensity on reference system: • NAMD molecular dynamics • GAMESS quantum chemistry • CalculiX nonlinear structure finite element • POV-Ray ray-tracing • GROMACS molecular dynamics • Tonto quantum crystallography • ZEUS-MP CFD/MHD • CactusADM General Relativity • Deal.II adaptive finite element elliptic equation solver • WRF numerical weather prediction (limited area) • Sphinx-3 speech recognition • LESlie3d CFD (finite volume explicit) • Bwaves CFD (transonic, implicit Bi-CGstab) • SoPlex Simplex Linear Programming solver • GemsFDTD computational electromagnetics, finite difference, explicit • MILC quantum chromodynamics • LBM CFD (Lattice-Boltzmann) Required Memory BW for DGEMM vs Maximum Sustainable Memory BW 100% Taihu Light 90%

80% KNC KNL

70%

60%

50% HSW BDW

40% SNB IVB

30%

20%

10%

0% 1990 1995 2000 2005 2010 2015 2020 TOP500 x86 System Rmax Contributions by Cores/Package 2005 2010 2015 100% 60+ Core/Pkg Note: x86 processors in 18-20 Core/Pkg 90% Accelerated Systems are 16 Core/Pkg not included in these Rmax contributions. 14 Core/Pkg 80% 12 Core/Pkg 10 Core/Pkg 70% 8 Core/Pkg 6 Core/Pkg 60% 4 Core/Pkg 2 Core/Pkg 50% 1 Core/Pkg 4 16 40% 6

30% 2 12 10 20% 1 8

10%

0% Strawman: Processor At Memory (PAM)

1 GiB GDDR5 1W, 1-2GF CPU …

Big CPU