Big Data, AI, and the Future of Memory
Total Page:16
File Type:pdf, Size:1020Kb
Big Data, AI, and the Future of Memory Steven Woo Fellow and Distinguished Inventor, Rambus Inc. May 15, 2019 Memory Advancements Through Time 1990’s 2000’s 2010’s 2020’s Synchronous Memory Graphics Memory Low Power Memory Ultra High Bandwidth for PCs for Gaming for Mobile Memory for AI Faster Compute + Big Data Enabling Explosive Growth in AI 1980s Annual Size of the Global Datasphere – 1990s Now 175 ZB More 180 Accuracy Compute Neural Networks 160 140 120 Other Approaches 100 Zettabytes 80 60 40 20 Scale (Data Size, Model Size) 2010 2015 2020 2025 Source: Adapted from Jeff Dean, “Recent Advances in Artificial Intelligence and the Source: Adapted from Data Age 2025, sponsored by Seagate Implications for Computer System Design,” HotChips 29 Keynote, August 2017 with data from IDC Global DataSphere, Nov 2018 Key challenges: Moore’s Law ending, energy efficiency growing in importance ©2019 Rambus Inc. 3 AI Accelerators Need Memory Bandwidth Google TPU v1 1000 TPU Roofline Inference on newer silicon (Google TPU K80 Roofline HSW Roofline v1) built for AI processing largely limited LSTM0 by memory bandwidth LSTM1 10 MLP1 MLP0 v nVidia K80 CNN0 Inference on older, general purpose Intel Haswell CNN1 hardware (Haswell, K80) limited by LSTM0 1 LSTM1 compute and memory bandwidth TeraOps/sec (log scale) (log TeraOps/sec MLP1 MLP0 CNN0 0.1 CNN1 LSTM0 1 10 100 1000 Memory bandwidth is a critical LSTM1 Ops/weight byte (log scale) resource for AI applications = Google TPU v1 = nVidia K80 = Intel Haswell N. Jouppi, et.al., “In-Datacenter Performance Analysis of a Tensor Processing Unit™,” https://arxiv.org/ftp/arxiv/papers/1704/1704.04760.pdf ©2019 Rambus Inc. 4 Common Memory Systems for AI Applications On-Chip Memory HBM GDDR On-chip Memory Highest Bandwidth Very High Bandwidth Good tradeoff between bandwidth, and Power Efficiency and Density power efficiency, cost, and reliability Apple A11 Graphcore nVidia AMD Radeon nVidia GeForce AMD Radeon Bionic Processor IPU Tesla V100 RX Vega 56 RTX 2080Ti RX580 Multiple options for different needs, but more bandwidth is needed ©2019 Rambus Inc. 5 Power Dominated by Data Access and Data Movement 2013: Computation Will Take Less Energy 2014: Computation Energy Extremely Small Than On-Chip Data Movement by 2018 Compared to Data Access and Movement FLOPs will cost less than on- chip data movement (NUMA) Source: Horst Simon, “Why we need Exascale and why we Source: Mark Horowitz, ISSCC “Computing's energy problem won’t get there by 2020,” 2013 (and what we can do about it),” 2014. Data movement is a critical challenge to improving power-efficiency ©2019 Rambus Inc. 6 Thank you.