Computing Performance On the Horizon Brendan Gregg USENIX Jun, 2021 About this talk This is ● a performance engineer's views about industry-wide server performance This isn't ● necessarily about my employer, employer’s views, or USENIX's views ● an endorsement of any company/product or sponsored by anyone ● professional market predictions (various companies sell such reports) ● based on confidential materials ● necessarily correct or fit for any purpose My predictions may be wrong! They will be thought-provoking. Computing Performance: On the Horizon (Brendan Gregg) 2 Agenda 1. Processors Slides are online and include extra details as fine print 2. Memory Slides: http://www.brendangregg.com/Slides/LISA2021_ComputingPerformance.pdf Video: https://www.usenix.org/conference/lisa21/presentation/gregg-computing 3. Disks 4. Networking 5. Runtimes 6. Kernels 7. Hypervisors 8. Observability Not covering: Databases, file systems, front-end, mobile, desktop. Computing Performance: On the Horizon (Brendan Gregg) 3 1. Processors Computing Performance: On the Horizon (Brendan Gregg) 4 Clock rate Early Intel Processors Year Processor GHz 1.8 1.6 1978 Intel 8086 0.008 1.4 1985 Intel 386 DX 0.02 1.2 z 1 H G 1993 Intel Pentium 0.06 x 0.8 a 1995 Pentium Pro 0.20 M 0.6 0.4 1999 Pentium III 0.50 0.2 0 2001 Intel Xeon 1.70 1978 1983 1988 1993 1998 Computing Performance: On the Horizon (Brendan Gregg) 5 Clock rate Server Processor Examples (AWS EC2) Year Processor Cores/T. Max GHz 4 60 3.5 50 2009 Xeon X5550 4/8 3.06 3 s d 40 a 2.5 e r z h H 2012 Xeon E5-2665 0 8/16 3.10 T G 2 30 e r x a a 1.5 w 2013 Xeon E5-2680 v2 10/20 3.60 M 20 Threads d r 1 a Max GHz 10 H 2017 Platinum 8175M 24/48 3.50 0.5 2019 Platinum 8259CL 24/48 3.50 0 0 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 Increase has leveled off due to power/efficiency ● Workstation processors higher; E.g., 2020 Xeon W-1270P @ 5.1 GHz Horizontal scaling instead ● More CPU cores, hardware threads, and server instances Computing Performance: On the Horizon (Brendan Gregg) 6 Interconnect rates Year CPU Interconnect Bandwidth Gbytes/s 2007 Intel FSB 12.8 2008 Intel QPI 25.6 2017 Intel UPI 41.6 10 years: ● 3.25x bus rate ● 6x core count Memory bus (covered later) also lagging Source: Systems Performance 2nd Edition Figure 6.10 [Gregg 20] Computing Performance: On the Horizon (Brendan Gregg) 7 CPU Utilization is Wrong It includes stall cycles ● Memory stalls (local and remote over an interconnect), resource stalls. ● It’s a fundamental metric. We shouldn’t need to “well, actually”-explain it. Workloads often memory bound ● I see instructions-per-cycle (IPC) of 0.2 - 0.8 (practical max 1.5, current microbenchmark max 4.0). Faster CPUs just mean more stalls 90% CPU ...may mean: Computing Performance: On the Horizon (Brendan Gregg) 8 Lithography Semiconductor Nanometer Process 35 32nm 30 25 20 m n 15 TSMC expects volume production 10 of 3nm in 2022 [Quach 21a] 5 3nm 2nm IBM has already built one [Quach 21b] 0 2009 2012 2014 2016 2018 2020 2022 2023 Lithography limits expected to Source: Semiconductor device fabrication [Wikipedia 21a] be reached by 2029, switching to stacked CPUs. [Moore 20] BTW: Silicon atom radius ~0.1 nm [Wikipedia 21b] Computing Performance: On the Horizon (Brendan Gregg) 9 Lithography “Nanometer Semiconductor Nanometer Process process” since 2010 should be 35 32nm 30 considered a marketing term 25 New terms proposed include: ● 20 GMT (gate pitch, metal pitch, tiers) ● LMC (logic, memory, interconnects) m n 15 [Moore 20] TSMC expects volume production 10 of 3nm in 2022 [Quach 21a] 5 3nm 2nm IBM has already built one [Quach 21b] (it has 12nm gate length) 0 2009 2012 2014 2016 2018 2020 2022 2023 Lithography limits expected to Source: Semiconductor device fabrication [Wikipedia 21a] be reached by 2029, switching to stacked CPUs. [Moore 20] BTW: Silicon atom radius ~0.1 nm [Wikipedia 21b] Computing Performance: On the Horizon (Brendan Gregg) 10 Other processor scaling Special instructions ● E.g., AVX-512 Vector Neural Network Instructions (VNNI) Connected chiplets ● Using embedded multi-die interconnect bridge (EMIB) [Alcorn 17] 3D stacking Computing Performance: On the Horizon (Brendan Gregg) 11 Latest server processor examples Vendor Processor Process Clock Cores/T. LLC Date Mbytes Intel Xeon Platinum “10nm” 2.3 - 3.4 40/80 60 Apr 2021 8380 (Ice Lake) AMD EPYC 7713P “7nm” 2.0 - 3.675 64/128 256 Mar 2021 ARM- Ampere Altra “7nm” 3.3 80/80 32 Dec 2020 based Q80-33 Coming soon to a datacenter near you (Although there is a TSMC chip shortage that may last through to 2022/2023 [Quatch 21][Ridley 21]) Computing Performance: On the Horizon (Brendan Gregg) 12 Cloud chip race Amazon ARM/Graviton2 ● ARM Neoverse N1, 64 core, 2.5 GHz Generic processors Microsoft ARM x86 AMD ARM ● ARM-based something coming soon [Warren 20] Google SoC ? ● Systems-on-Chip (SoC) coming soon [Vahdat 21] GOOG Grav2 MSFT Cloud processors Computing Performance: On the Horizon (Brendan Gregg) 13 Latest benchmark results Graviton2 is promising ● “We tested the new M6g instances using industry standard LMbench and certain Java benchmarks and saw up to 50% improvement over M5 instances.” – Ed Hunter, Director of performance and operating systems at Netflix. Full results still under construction ● It’s weeks of work to understand the latest differences and fix various things so that the performance as it will be in production can be measured, and also verify results with real production workloads. Computing Performance: On the Horizon (Brendan Gregg) 14 Accelerators GPUs ● Parallel workloads, thousands of GPU cores. Widespread adoption in machine learning. The “other” CPUs you FPGAs may not be monitoring ● Reprogrammable semiconductors ● Great potential, but needs specialists to program CPU ● Good for algorithms: compression, cryptocurrency, e video encoding, genomics, search, etc. s u GPU ● Microsoft FPGA-based configurable cloud [Russinovich 17] f o e s a FPGA Also TPUs E ● Tensor processing units [Google 21] Performance potential Computing Performance: On the Horizon (Brendan Gregg) 15 Latest GPU examples NVIDIA GeForce RTX 3090: 10,496 CUDA cores, 2020 ● [Burnes 20] Cerebras Gen2 WSE: 850,000 AI-optimized cores, 2021 ● Use most of the silicon wafer for one chip. 2.6 trillion transistors, 23 kW. [Trader 21] GPU ● Previous version was already the “Largest chip ever built,” SM Control Cores (SPs) and US$2M. [insideHPC 20] L1 Control SM L1 SM: Streaming multiprocessor SM Control SP: Streaming processor L1 Control SM L1 ... L2 Cache Computing Performance: On the Horizon (Brendan Gregg) 16 Latest FPGA examples Xilinx Virtex UltraScale+ VU19P, 8,938,000 logic cells, 2019 ● Using 35B transistors. Also has 4.5 Tbit/s transceiver bandwidth (bidir), and 1.5 Tbit/sec DDR4 bandwidth [Cutress 19] Xilinx Virtex UltraScale+ VU9P, 2,586,000 logic cells, 2016 ● Deploy right now: AWS EC2 F1 instance type (up to 8 of these FPGAs per instance) FPGA BPF (covered later) already in FPGAs ● E.g., 400 Gbit/s packet filter FFShark [Vega 20] Computing Performance: On the Horizon (Brendan Gregg) 17 My Predictions Computing Performance: On the Horizon (Brendan Gregg) 18 My Prediction: Multi-socket is doomed ● Single socket is getting big enough (cores) <100 One cores ● Already scaling horizontally (cloud) socket – And in datacenters, via “blades” or “microservers” >100 Cloud cores ● Why pay NUMA costs? – Two single-socket instances should out-perform one two-socket instance 1 hop 2 hops Mem CPU CPU Mem Multi-socket future is mixed: one socket for cores, one GPU socket, one FPGA socket, etc. EMIB connected. Computing Performance: On the Horizon (Brendan Gregg) 19 My Prediction: SMT future unclear Simultaneous multithreading (SMT) == hardware threads ● Performance variation ● ARM cores competitive ● Post meltdown/spectre – Some people turn them off Possibilities: ● SMT becomes “free” – Processor feature, not a cost basis – Turn “oh no! hardware threads” into “great! bonus hardware threads!” ● No more hardware threads – Future investment elsewhere Computing Performance: On the Horizon (Brendan Gregg) 20 My Prediction: Core count limits Imagine an 850,000-core server processor in today’s systems... Computing Performance: On the Horizon (Brendan Gregg) 21 My Prediction: Core count limits Worsening problems: ● Memory-bound workloads ● Kernel/app lock contention ● False sharing ● Power consumption Source: Figure 2.16 ● etc. [Gregg 20] General-purpose computing will hit a practical core limit ● For a given memory subsystem & kernel, and running multiple applications ● E.g., 1024 cores (except GPUs/ML/AI) ● Apps themselves will hit an even smaller practical limit (some already have by design, e.g., Node.js and 2 CPUs) Computing Performance: On the Horizon (Brendan Gregg) 22 My Prediction: 3 Eras of processor scaling Delivered processor characteristics: Era 1: Clock frequency Era 2: Core/thread count Era 3: Cache size & policy Computing Performance: On the Horizon (Brendan Gregg) 23 My Prediction: 3 Eras of processor scaling Practical server limits: Era 1: Clock frequency → already reached by ~2005 (3.5 GHz) Era 2: Core/thread count → limited by mid 2030s (e.g., 1024) Era 3: Cache size & policy → limited by end of 2030s Mid-century will need an entirely new computer hardware architecture, kernel memory architecture, or logic gate technology, to progress further. ● E.g., use of graphine, carbon nanotubes [Hruska 12] ● This is after moving more to stacked processors Computing Performance: On the Horizon (Brendan Gregg) 24 My Prediction: More processor vendors ARM licensed Era of CPU choice Beware: “optimizing for the benchmark” 0 1 2 3 ● Don’t believe microbenchmarks without doing Cores “active benchmarking”: Root-cause perf analysis 4 5 6 7 while the benchmark is still running.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages113 Page
-
File Size-