Peter Hsu, Phd [email protected] Presented at Barcelona Supercomputer Center, Spain 25 June 2018
Total Page:16
File Type:pdf, Size:1020Kb
Using Cellphone Technology to Build AI Servers Peter Hsu, PhD [email protected] Presented at Barcelona Supercomputer Center, Spain 25 June 2018 C2P3 @ BSC Copyright ©2018 Peter Hsu, All Rights Reserved Energy is Real Limit of AI 30 GFLOPS / Watt (32-bit Float) 200 PFLOPS @ 13.3 MW 15.1 GFLOPS / Watt (64-bit Float) C2P3 @ BSC Copyright ©2018 Peter Hsu, All Rights Reserved 2 IBM AC922 C2P3 @ BSC Copyright ©2018 Peter Hsu, All Rights Reserved 3 IBM AC922 Where does the energy go? How efficient is it really? Can we do better? C2P3 @ BSC Copyright ©2018 Peter Hsu, All Rights Reserved 4 Summary heat sink • 2× — near-data processing 280 mm = 11 inches 15.0 mm 15.0 mm SoC package • 1.6× — SoC/DRAM 3D Cache Coherent Parallel SoC die layout/package co-design Phone Processor 280 mm C4-bumps • 1.6× — vector accumulator PCB solder balls 15 mm • 1.4× — best consumer electronics process Airflow DRAM +CPU fine-pitch solder balls • 1.4× — 10nm → 7nm Power chip-scale package VRM 10x more energy efficient C4-bumps i.e. 150 GFLOPS/W 64-bit Float FLASFLASH 600 GFLOPS/W 16-bit Float SSD TSV DRAM die stack C2P3 @ BSC Copyright ©2018 Peter Hsu, All Rights Reserved 5 112fJ Cell Array 0.54pJ 3% 12% DRAM 1% 1.48pJ IBM AC922 8% 33% Power Wires & Buffers 22% 24% 2.31pJ GPU Subsystem and Full Server Energy (pJ) Power (W) 52% Component Operation per Op per MADD per Op Chip System Misc read row activation (per bit) 0.54 1.0 3.9 GPUSRAM HBM2 0.025 data transfer within a chip 1.48 2.7 10.6 16% 42% 8.9 35 209 14% DRAM dword / data transfer within a stack 2.3 4.2 16.6 Other MADD data transfer across interposer 0.5 1.0 3.9 CPU Arithmetic cache global wire (35mm x 1 bit) 2.7 5.0 19.6 17% 7% 5.1 12% miss write L2 cache (72-bit SRAM) 2.7 0.1 0.3 read L2 cache (72-bit SRAM) 2.7 2.7 10.6 dload / 21.8 mm local wire (2.5mm x 1 bits) 0.2 12.5 17.6 49.0 37.3 mm NVIDIA MADD V100 write vector register file (64-bit) 2.4 2.4 9.4 215 1,293 GPU read 3 vector operands (SRAM) 7.2 7.2 28.2 300W 300W 300W 300W 300W 300W MADD floating-point MADD (64-bit FP) 8.7 8.7 18.3 34.1 write 1 vector result (SRAM) 2.4 2.4 9.4 other memory interface, control, etc. 14.0 14.0 14.0 55 VRM conversion efficiency 85% 9.6 9.6 38 225 PCIe Card 2 fan (12V, 0.5A) or water pump 6 3.1 3.1 12 72 190W 190W 6 GPU Card Subtotal 300 76.6 76.6 300 1,800 3W x 8 3W x 8 DDR4 row activation (per bit) 0.54 2.5 0.7 16 10 3.0 48 DRAM data transfer within chip, I/O 1.71 7.9 2.3 IBM SRAM 3.4 40 just Power 2 wires & buffers 4.3 16 50 190 380 guesses 9 CPU other 8.5 100 2 CPU power VRM 85% 2.5 29 58 58 Misc 414W ⬆ misc.—PCIe, fans/pumps, etc. 18 414 414 Chasis TOTAL #of primary power supply 85% 17.2 405 405 3,105W 15.1 GFLOP/W 550 140 Grand Total 3,105 C2P3 @ BSC Copyright ©2018 Peter Hsu, All Rights Reserved 6 Cell Array DRAM 1% Power 8% Wires & Buffers 22% 24% Misc GPU 16% SRAM 42% 14% Other CPU Arithmetic 17% 12% 7% 21.8 mm 37.3 mm 8% DRAM access & arithmetic datapath most of 38% moving data & Where does⋏ the energy go? staging in SRAM How efficient is it really? Can we do better? C2P3 @ BSC Copyright ©2018 Peter Hsu, All Rights Reserved 7 Data-Streaming Architecture 300 300 300 300 300 300 190W 190W 3W x 8 3W x 8 CPU+GPU Architecture Misc TOTAL 3,105W Power 9 DRAM DRAM DRAM DRAM out-of- V100 cell array cell array cell array cell array order arithmetic NVMe SSD core I/O I/O I/O I/O register last level file NOTE: coherence ≠ always cache level-2 HBM2 PCIe cache cell array communicating through shared memory i.e. asynchronous DMA memcpy C2P3 @ BSC Copyright ©2018 Peter Hsu, All Rights Reserved 8 Data-Streaming Architecture Example IBM 7094, $2M USD 0.5 MHz clock cycle 0.15 MB main memory SSD to Main Main Memory to Memory Working Memory PCIe 4 Gen3 12 5,376 Cards 8Gt/s 340 GB/s NVLINKS 75 GB/s = 32 x8 GB/s GB/s 1 TB Main Memory Power 9 DRAM DRAM DRAM DRAM out-of- V100 Tape Storage Disk Drive Core Memory Logic Gates cell array cell array cell array cell array order arithmetic NVMe SSD core I/O I/O I/O I/O register last level file 96 GB cache level-2 HBM2 PCIe cell array Working 1962 cache Memory C2P3 @ BSC Copyright ©2018 Peter Hsu, All Rights Reserved 9 near-data processing Architectures SSD SSD SSD SSD SSD SSD UFS 2.1 1 TB Main Memory 307 1.2GB/s DRAM DRAM DRAM DRAM DRAM DRAM GB/s cell array cell array cell array cell array cell array cell array I/O I/O I/O I/O I/O I/O LPDDR4 8,737 GB/s data-streaming X 4266 SoC SoC SoC SoC SoC SoC level-2 level-2 level-2 level-2 level-2 level-2 cache cache cache cache cache cache register register register register register register file file file file file file arithmetic arithmetic arithmetic arithmetic arithmetic arithmetic 64 GB/s Network Network Network Network Network Network Bisection (peak) Gen4 PCIe 16Gt/s x30 SSD to Main Main Memory to Memory Working Memory PCIe 4 Gen3 12 5,376 Cards 8Gt/s 340 GB/s NVLINKS 75 GB/s = 32 x8 GB/s GB/s 1 TB Main Memory Power 9 DRAM DRAM DRAM DRAM out-of- V100 cell array cell array cell array cell array order arithmetic NVMe SSD core I/O I/O I/O I/O register last level file 96 GB cache level-2 HBM2 PCIe cache cell array Working Memory Oracle Labs RAPID parallel computer C2P3 @ BSC Copyright ©2018 Peter Hsu, All Rights Reserved 10 Cache Coherent Parallel Phone Processor 280 mm = 11 inches 15.0 mm 15.0 mm Gen 4 PCIe • 16 Gt/s @ 10-12 inches • Power efficient (≈5pJ/bit) • Strong industry support 280 mm • Volume manufacturing 16×16 Power & Cooling • Like 800W processor Airflow • But already spread out Industry has much experience with 4 GB DRAM, 30GB/s BW DRAM +CPU CC-NUMA, 256 nodes just bigger 0.567 TFLOPS 16-bit Float number than usual Intel PC Power Metric AC922 This Unit SGI Origin 2000 VRM DRAM Capacity 1.1 1.0 TB DRAM Bandwidth 5.7 7.6 TB/s FLASFLASH 16-Bit Compute 145 TFLOPS SSD 64-Bit Compute 47.0 36.3 TFLOPS C2P3 @ BSC Copyright ©2018 Peter Hsu, All Rights Reserved 11 near-data processing Advantages SSD SSD SSD SSD SSD SSD UFS 2.1 1 TB Main Memory 307 1.2GB/s DRAM DRAM DRAM DRAM DRAM DRAM GB/s cell array cell array cell array cell array cell array cell array I/O I/O I/O I/O I/O I/O LPDDR4 8,737 GB/s 280 mm = 11 inches 15.0 mm X 4266 SoC SoC SoC SoC SoC SoC 15.0 mm level-2 level-2 level-2 level-2 level-2 level-2 cache cache cache cache cache cache register register register register register register file file file file file file arithmetic arithmetic arithmetic arithmetic arithmetic arithmetic 64 GB/s Network Network Network Network Network Network Bisection 280 mm (peak) Gen4 PCIe 16Gt/s x30 SSD to Main Main Memory to Memory Working Memory Airflow DRAM +CPU Power VRM Use cache coherency to make physically partitioned memory easy to program FLASFLASH SSD C2P3 @ BSC Copyright ©2018 Peter Hsu, All Rights Reserved 12 Best Consumer Technology 280 mm = 11 inches 15.0 mm 15.0 mm 27 March 2017 280 mm 72.3mm2 7 December 2017 C2P3 @ BSC Copyright ©2018 Peter Hsu, All Rights Reserved 13 Processor over Memory heat sink SoC package SoC die C4-bumps LPDDR4Y PCB solder balls 15 mm Best of both worlds Processor Tile interface 2 mm fine-pitch solder balls 2 mm <1 mm chip-scale package 1 GB DRAM C4-bumps TSV DRAM die stack C2P3 @ BSC Copyright ©2018 Peter Hsu, All Rights Reserved 14 Vector Accumulator Architecture SRAM main memory DRAM main memory DRAM main memory 0.9pJ x 2.5% 0.9pJ x 2.5% L2 Cache Vector Register File 0.9pJ 0.9pJ 0.9pJ Vector Vector × Register File Register File v1 = A[1: N] r1 = A[1: N] 3.25pJ v2 = B[1:N] r2 = B[1:N] 0.9 pJ x 4 + vac = v1 * v2 × r3 = C[1:N] × v3 = C[1:N] r7 = r1 * r2 + r3 3.25pJ vac += v3 vac + r4 = D[1:N] + v4 = D[1:N] r5 = E[1:N] v5 = E[1:N] r8 = r4 * r5 + r7 vac += v4 * v5 r6 = F[1:N] v6 = F[1:N] r9 = r5 * r6 + r8 ∑=1.82pJ ∑=4.52pJ vac += v5 * v6 G[1:N] = r9 G[1:N] = vac CRAY-1 GPU C2P3 C2P3 @ BSC Copyright ©2018 Peter Hsu, All Rights Reserved 15 Vector Accumulator Architecture DRAM main memory 0.9pJ x 2.5% Block floating-point DNN training Mario Drumond Ta o Li n Martin Jaggi Babak Falsafi Vector Register File MSR Contact: Eric Chung February 20th, 2018 0.9pJ 0.9pJ Custom arithmetic for DNN Block floating-point (BFP) for DNNs × Prior work shows mixed results Compromise between fixed- and floating-point 3.25pJ § Half-precision floating-point (FP16): § Limits range of values within a single tensor + § 10x worse area/power than fixed-point § Wide range of values across tensors § Fixed-point: § Dynamically pick quantization points § Limited range vac § Complex techniques to select quantization points ü § Quantization points are static Dot products in fixed-point ✗ Other operations degenerate to floating-point (FP) Key observation: § Large fraction of DNN computations appear in dot products Custom arithmetic for dot products is enough Great candidate for custom DNN representation 3 4 C2P3 C2P3 @ BSC Copyright ©2018 Peter Hsu, All Rights Reserved 16 Dragonfly Network Key idea: leverage 280 mm 15.0 mm on-chip and off-chip 15.0 mm coherence networks 280 mm 256×256=65,536 processor SoC chips C2P3 @ BSC Copyright ©2018 Peter Hsu, All Rights Reserved 17 Together SSD SSD SSD SSD SSD SSD UFS 2.1 1 TB Main Memory 307 1.2GB/s DRAM DRAM DRAM DRAM DRAM DRAM GB/s cell array cell array cell array cell array cell array cell array I/O I/O I/O I/O I/O I/O LPDDR4 8,737 GB/s X 4266 SoC SoC SoC SoC SoC SoC level-2 level-2 level-2 level-2 level-2 level-2 cache cache cache cache cache cache register register register register register register C2P3 SoC and System Energy Power (W) file file file file file file arithmetic arithmetic arithmetic arithmetic arithmetic arithmetic 64 GB/s Component Operation pj pJ/MADD per Chip System Network Network Network Network Network Network Bisection read row activation (per bit) 0.54 0.91 0.07 (peak) LPDDR4X Gen4 0.026 data transfer within a chip 1.93 3.3 5.3 0.25 0.4 106 DRAM PCIe 16Gt/s DW/MADD off chip I/O SERDES 0.7 1.2 0.09 x30 Cell Array local wire (0.5 mm x 1 bits) 0.03 0.9 0.9 0.07 DRAM load DW 2% write register file (64-bit 1.8 0.05 0.05 0.00 14% SRAM) Wires & Buffers MADD read 2 operands (64-bit 3.6 3.6 0.28 Power 10.1 15% instruction SRAM) 25% C2P3 floating-point MADD (64-bit) 6.5 6.5 0.51 1.8 468 SRAM Processor memory interface, control, etc.