Memory Bandwidth and System Balance in HPC Systems

Memory Bandwidth and System Balance in HPC Systems John D. McCalpin, PhD [email protected] Outline 1. Changes in TOP500 Systems – System Architectures & System Sizes 2. Technology Trends & System Balances – The STREAM Benchmark – Computation Rates vs Data Motion Latency and Bandwidth – Required Concurrency to Exploit available Bandwidths 3. Comments & Speculations… – Impact on HPC systems and on HPC Applications – Potential disruptive technologies Part 1 CHANGES IN TOP500 SYSTEMS 100% 10% 20% 30% 40% 50% 60% 70% 80% 90% 0% 1993.5 1993.9 1994.5 1995 1994.9 1995.5 VECTOR 1995.9 1996.5 TOP500 Rmax Contributions by SystemArchitecture 1996.9 ASCI Red 1997.5 1997.9 1998.5 1998.9 1999.5 2000 1999.9 2000.5 Earth Simulator 2000.9 RISC 2001.5 2001.9 2002.5 2002.9 2003.5 2003.9 2004.5 IBM BG/L 2005 2004.9 2005.5 2005.9 2006.5 RoadRunner 2006.9 2007.5 2007.9 x86 2008.5 2008.9 IBM BG/Q 2009.5 2010 2009.9 2010.5 K Machine 2010.9 2011.5 2011.9 2012.5 Accelerated Tianhe 2012.9 from x86 as well as x86 from “Accelerated” A Note: 2013.5 TaihuLight MPP 2013.9 fraction of the the of fraction 2014.5 - 2015 2014.9 2 2015.5 Rmax … 2015.9 2016.5 is 2016.9 HPC "Typical" System Acquisition Price per "Processor" $10,000,000 $1,000,000 Vector RISC x86 single-core $100,000 x86 multi-core ($/socket) x86 multi-core ($/core) $10,000 $1,000 $100 1980 1990 2000 2010 2020 100% 10% 20% 30% 40% 50% 60% 70% 80% 90% 0% 1993.5 1993.9 4p 16p 1994.5 VECTOR 1995 1994.9 100p 1995.5 1995.9 1996.5 TOP500 Rmax Contributions by SystemArchitecture 1996.9 1997.5 1997.9 1998.5 32p 1998.9 1999.5 2000 1999.9 2000.5 250p 2000.9 RISC 2001.5 2001.9 2002.5 2002.9 2003.5 2003.9 2004.5 2005 2004.9 1000p 2005.5 1000p 2005.9 2006.5 2006.9 2007.5 2007.9 x86 2008.5 2008.9 2009.5 10,000p 2010 2009.9 2010.5 2010.9 2011.5 2011.9 2012.5 Accelerated 2012.9 2013.5 MPP 2013.9 2014.5 48,000 2015 2014.9 2015.5 2015.9 2016.5 2016.9 Heterogeneous Systems • More sites are building “clusters of clusters”, e.g.: – Sub-cluster 1: 2-socket nodes with small memory – Sub-cluster 2: 2-socket nodes with large memory – Sub-cluster 3: 4-socket nodes with very large memory – Sub-cluster 4: nodes with accelerators, etc… • Consistent with observed partial shift to accelerators – Peaking at ~30% of the aggregate Rmax of the list in 2013-2015 • Under 20% for November 2016 list (after excluding x86 contributions) – Peaking at ~20% of the systems in 2015, now under 17% • No more than 16% of systems have had >2/3 of their Rpeak in accelerators – Accelerators have been split between many-core and GPU Part 2 TECHNOLOGY TRENDS & SYSTEM BALANCES The STREAM Benchmark • Created in 1991 while on faculty at the University of Delaware College of Marine Studies • Intended to be an extremely simplified representation of the low-compute-intensity, long-vector operations characteristic of ocean circulation models • Widely used for research, testing, marketing • Almost 1100 results in database at main site • Hosted at www.cs.virginia.edu/stream/ The STREAM Benchmark (2) • Four kernels, separately timed: Copy: C[i] = A[i]; i=1..N Scale: B[i] = scalar * C[i]; i=1..N Add: C[i] = A[i] + B[i]; i=1..N Triad A[i] = B[i] + scalar*C[i]; i=1..N • “N” chosen to make each array >> cache size • Repeated (typically) 10 times, first iteration ignored • Min/Avg/Max timings reported, best time used for BW The STREAM Benchmark (3) • Assumed Memory Traffic per index: Copy: C[i] = A[i]; 16 Bytes Scale: B[i] = scalar * C[i]; 16 Bytes Add: C[i] = A[i] + B[i]; 24 Bytes Triad A[i] = B[i] + scalar*C[i]; 24 Bytes • Many systems will read data before updating it (“write allocate” policy) – STREAM ignores this extra traffic if present What are “System Balances”? • “Performance” can be viewed as an N-dimensional vector of “mostly-orthogonal” components, e.g.: – Core performance (FLOPs) – LINPACK – Memory Bandwidth – STREAM – Memory Latency – lmBench/lat_mem_rd – Interconnect Bandwidth – osu_Bw, osu_BiBw – Interconnect Latency – osu_latency • System Balances are the ratios of these components Performance Component Trends 1. Peak FLOPS per socket increasing at 50%-60% per year 2. Memory Bandwidth increasing at ~23% per year 3. Memory Latency increasing at ~4% per year 4. Interconnect Bandwidth increasing at ~20% per year 5. Interconnect Latency decreasing at ~20% per year • These ratios suggest that processors should Be increasingly imBalanced with respect to data motion…. – Today’s talk focuses on (1), (2), and a Bit of (3) Memory Bandwidth is Falling Behind: (GFLOP/s) / (GWord/s) 10,000 1,000 100 +14.2%/year 2x / 5.2 years 10 Balance Ratio (FLOPS/memory access) RISC systems (IBM, MIPS, Alpha) x86-64 systems (AMD & Intel) 1 1990 1995 2000 2005 2010 2015 2020 Date of Introduction Memory Latency is much worse: (GFLOP/s) / (Memory Latency) 10,000 Peak FLOPS per Idle Memory Latency +24.5%/year Peak FLOPS / Word of Sustained Memory BW 2x / 3.2 years 1,000 100 +14.2%/year 2x / 5.2 years 10 Balance Ratio (FLOPS/memory access) RISC systems (IBM, MIPS, Alpha) x86-64 systems (AMD & Intel) 1 1990 1995 2000 2005 2010 2015 2020 Date of Introduction Interconnect Bandwidth is Falling Behind at a comparable rate 10,000 Peak FLOPS per Idle Memory Latency Peak FLOPS / Word of Sustained Memory BW +24.5%/year Peak FLOPS / Word of Sustained Network BW 2x / 3.2 years 1,000 +22.3%/year 2x / 3.4 years 100 +14.2%/year 2x / 5.2 years 10 Balance Ratio (FLOPS/memory access) RISC systems (IBM, MIPS, Alpha) x86-64 systems (AMD & Intel) 1 1990 1995 2000 2005 2010 2015 2020 Date of Introduction Interconnect latency follows a similar trend… 10,000,000 Peak FLOPS per Idle Memory Latency 1,000,000 Peak FLOPS / Word of Sustained Memory BW Peak FLOPS / Word of Sustained Network BW 100,000 Peak FLOPS (core) * Network Latency Peak FLOPS (chip) * Network Latency 10,000 1,000 100 10 Balance Ratio (FLOPS/memory access) RISC systems (IBM, MIPS, Alpha) x86-64 systems (AMD & Intel) 1 1990 1995 2000 2005 2010 2015 2020 Date of Introduction STREAM Balance (FLOPS/Word): Mainstream vs ManyCore vs GPGPU 100 2016 product releases 90 80 70 60 2012 product releases 50 40 30 20 10 0 Xeon E5-2690 Xeon Phi SE10P NVIDIA K20x Xeon E5-2690 v4 Xeon Phi 7250 NVIDIA P100 (Sandy Bridge EP) (KNC, GDDR5) (Kepler, GDDR5) (Broadwell) (KNL, MCDRAM) (Pascal, HBM2) Why are FLOPS increasing so fast? • Peak FLOPs per package is the product of several terms: – Frequency – FP operations per cycle per core • Product of #FP units, SIMD width of each unit, and complexity of FP instructions (e.g., separate ADD & MUL vs FMA) – Number of cores per package • Low-level semiconductor technology tends to drive these terms at different rates… Intel Processor GFLOPS/Package Contributions over time Skylake Xeon log10(GHz) Haswell / 3.0 Sandy Broadwell log10(Cores/Socket) Bridge / log10(FP/Hz) Ivy Bridge 32 Nehalem/ 32 16 2.0 Westmere 16 16 Core 2 8 8 4 4 4 4 1.0 Pentium 4 4 4 12 12 14 ~20 ~24 6 6 8 10 2 4 2 2 2 2 2 Base 10 logcontribution ofGFLOPS 2 2.8 3.1 3.2 3.3 3.1 3.3 2.5 3.0 3.0 2.5 3.0 2.9 2.1 2.1 2.1 ~1.9 ~1.8 0.0 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 Intel Processor GFLOPS/Package Contributions: Mainstream vs ManyCore 4.0 log10(GHz) log10(Cores/Socket) log10(FP/Hz) Xeon Phi (Knights Landing) Xeon Phi 6.8x (Knights Corner) Xeon E5 (v4) 3.0 5.2x (Broadwell) Xeon E5 (v1) 32 (Sandy Bridge) 16 2.0 16 8 1.0 60 14 72 8 Base 10log contribution of GFLOPS 3.0 1.05 2.1 1.4 0.0 2012.25 2012.75 2016.33 2016.50 Why is Memory Bandwidth increasing slowly? • Slow rate of pin speed improvements – Emphasis has been on increasing capacity, not increasing bandwidth – Shared-bus architecture (multiple DIMMs per channel) is very hard at high frequencies • DRAM cell cycle time almost unchanged in 20 years – Speed increases require increasing transfer sizes – DDR3/DDR4 have minimum 64 Byte transfers in DIMMs • Slow rate of increase in interface width – Pins cost money! Why is Memory Latency stagnant or growing? • More levels in cache hierarchy – Many lookups serialized to save power • More asynchronous clock domain crossings – Many different clock domains to save power – Snoop (6): Core -> Ring -> QPI -> Ring -> QPI -> Ring -> Core – Local Memory (4): Core -> Ring -> DDR -> Ring -> Core – Remote Memory (8): Core -> Ring -> QPI -> Ring -> DDR -> Ring -> QPI -> Ring -> Core • More cores to keep coherent – Challenging even on a single mainstream server chip – Two-socket system latency typically dominated by coherence, not data – Manycore chips have much higher latency • Decreasing frequencies! Why is Interconnect Bandwidth growing slowly? • Slow rate of pin speed improvements – About 20%/year • Reluctance to increase interface width – Switch chips typically pin-limited – wider interfaces get fewer ports – Parallel links require more switches – too expensive and does not always provide improved real-world bandwidth Why is Interconnect Latency improving slowly? • Legacy IO architecture designed around disks, not communications – Control operations using un-cached loads/stores – hundreds of ns per operation and no concurrency – Interrupt-driven processing requires many thousands of cycles per transaction • Mismatch between SW requirements and HW capabilities A different implication of these technology trends LATENCY, BANDWIDTH, AND CONCURRENCY Latency, Bandwidth, and Concurrency • “Little’s Law” from queuing theory describes the relationship between latency (or occupancy), bandwidth, and concurrency.

Load more