Breaking the Memory-Wall for AI: In-Memory Compute, Hbms Or Both?

Breaking the memory-wall for AI: In-memory compute, HBMs or both? Presented By: Kailash Prasad PhD Electrical Engineering nanoDC Lab 9th Dec 2019 Outline ● Motivation ● HBM - High Bandwidth Memory ● IMC - In Memory Computing ● In-memory compute, HBMs or both? ● Conclusion nanoDC Lab 2/48 Von - Neumann Architecture nanoDC Lab Source: https://en.wikipedia.org/wiki/Von_Neumann_architecture#/media/File:Von_Neumann_Architecture.svg 3/48 Moore’s Law nanoDC Lab Source: https://www.researchgate.net/figure/Evolution-of-electronic-CMOS-characteristics-over-time-41-Transistor-counts-orange_fig2_325420752 4/48 Power Wall nanoDC Lab Source: https://medium.com/artis-ventures/omicia-moores-law-the-coming-genomics-revolution-2cd31c8e5cd0 5/48 Hwang’s Law Hwang’s Law: Chip capacity will double every year nanoDC Lab Source: K. C. Smith, A. Wang and L. C. Fujino, "Through the Looking Glass: Trend Tracking for ISSCC 2012," in IEEE Solid-State Circuits Magazine 6/48 Memory Wall nanoDC Lab Source: http://on-demand.gputechconf.com/gtc/2018/presentation/s8949 7/48 Why this is Important? nanoDC Lab Source: https://hackernoon.com/the-most-promising-internet-of-things-trends-for-2018-10a852ccd189 8/48 Deep Neural Network nanoDC Lab Source: http://on-demand.gputechconf.com/gtc/2018/presentation/s8949 9/48 Data Movement vs. Computation Energy ● Data movement is a major system energy bottleneck ○ Comprises 41% of mobile system energy during web browsing [2] ○ Costs ~115 times as much energy as an ADD operation [1, 2] [1]: Reducing data Movement Energy via Online Data Clustering and Encoding (MICRO’16) [2]: Quantifying the energy cost of data movement for emerging smart phone workloads on mobile platforms (IISWC’14) nanoDC Lab Source: Mutlu, O. (2019). Processing Data Where It Makes Sense in Modern Computing Systems: Enabling In-Memory Computation. 10/48 Problem - Von Neumann Bottleneck Von-Neumann BottleNeck nanoDC Lab Source: https://en.wikipedia.org/wiki/Von_Neumann_architecture#/media/File:Von_Neumann_Architecture.svg 11/48 Solutions Increase Memory Bandwidth Reduce Data Movement High Bandwidth Memory In Memory Computing nanoDC Lab Source: https://www.electronicdesign.com/memory/qa-taking-closer-look-amd-s-high-bandwidth-memory 12/48 High Bandwidth Memory nanoDC Lab 13/48 High Bandwidth Memory nanoDC Lab Source: https://www.amd.com/system/files/49010-high-bandwidth-memory-hbm-1260x709.jpg 14/48 Device and Working ● HBM is a new type of memory chip ○ Low power consumption ○ Ultra-wide communication lanes. ● Multiple DRAM dies stacked in a single package connected by TSV and Microbumps ● Connected to the processor unit directly via Interposer Layer ● Developed by AMD and SK Hynix ● JEDEC industry standard in October 2013 nanoDC Lab Image Source: https://www.amd.com/system/files/49010-high-bandwidth-memory-hbm-1260x709.jpg 15/48 HBM Architecture Example consisting of four DRAM core A photo of the base die dies and one base logic die nanoDC Lab Source:H. Jun, S. Nam, H. Jin, J. Lee, Y. J. Park and J. J. Lee, "High-Bandwidth Memory (HBM) Test Challenges and Solutions," in IEEE Design & Test 16/48 Benefits over State of Art Memories ● Very High Bandwidth ● Lower Effective Clock Speed ● Smaller Package ● Shorter Interconnect Wires ● Lower Power Consumption nanoDC Lab 17/48 Area Savings nanoDC Lab Source: https://www.amd.com/system/files/49010-high-bandwidth-memory-hbm-1260x709.jpg 18/48 Bandwidth Improvement in HBM nanoDC Lab Source: https://www.electronicdesign.com/industrial-automation/high-bandwidth-memory-great-awakening-ai 19/48 Comparison - GDDR5 Vs HBM nanoDC Lab Source: https://www.amd.com/system/files/49010-high-bandwidth-memory-hbm-1260x709.jpg 20/48 Comparison - GDDR5 Vs HBM nanoDC Lab 21/48 HBM’s for AI ● High Bandwidth ● High Capacity ● Large Data Transfer Rate Source:A. Jain, A. Phanishayee, J. Mars, L. Tang and G. Pekhimenko, "Gist: Efficient Data Encoding for Deep Neural Network Training," 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), Los Angeles, CA, 2018, pp. 776-789. nanoDC Lab 22/48 Challenges Retention Time and Peripheral Transistor Chip Cost Issue and Sense Margin Performance Evolutionary Approaches Limited Capacity Source: S. Lee, "Technology scaling challenges and opportunities of memory devices," 2016 IEEE International Electron Devices Meeting (IEDM), San nanoDC Lab Francisco, CA, 2016, pp. 1.1.1-1.1.8. 23/48 In Memory Computing nanoDC Lab 24/48 Data Movement vs. Computation Energy ● Data movement is a major system energy bottleneck ○ Comprises 41% of mobile system energy during web browsing [2] ○ Costs ~115 times as much energy as an ADD operation [1, 2] [1]: Reducing data Movement Energy via Online Data Clustering and Encoding (MICRO’16) [2]: Quantifying the energy cost of data movement for emerging smart phone workloads on mobile platforms (IISWC’14) nanoDC Lab Source: Mutlu, O. (2019). Processing Data Where It Makes Sense in Modern Computing Systems: Enabling In-Memory Computation. 25/48 We Need A Paradigm Shift To … ● Enable computation with minimal data movement ● Compute where it makes sense (where data resides) ● Make computing architectures more data-centric nanoDC Lab 26/48 Two Examples Data Copy Bitwise Computation nanoDC Lab 27/48 RowClone: Fast and energy-efficient in-DRAM bulk data copy and initialization Seshadri et al., “RowClone: Fast and Efficient In-DRAM Copy and Initialization of Bulk Data,” MICRO 2013. nanoDC Lab 28/48 Today’s Systems: Bulk Data Copy 1) High latency 3) Cache pollution Memory CPU L1 L2 L3 MC 2) High bandwidth utilization 4) Unwanted data movement Seshadri et al., “RowClone: Fast and Efficient In-DRAM 1046ns, 3.6uJ (for 4KB page copy via DMA) Copy and Initialization of Bulk Data,” MICRO 2013. nanoDC Lab Source: Mutlu, O. (2019). Processing Data Where It Makes Sense in Modern Computing Systems: Enabling In-Memory Computation. 29/48 Future Systems: In-Memory Copy 3) No cache pollution 1) Low latency Memory CPU L1 L2 L3 MC 2) Low bandwidth utilization 4) No unwanted data movement 1046ns, 3.6uJ 90ns, 0.04uJ nanoDC Lab Source:Mutlu, O. (2019). Processing Data Where It Makes Sense in Modern Computing Systems: Enabling In-Memory Computation. 30/48 RowClone: In-DRAM Row Copy 4 Kbytes Idea: Two consecutive Activates Step 1: Negligible HW cost Activate row A Step 2: Activate row B Transfer DRAM subarray row Transfer row Row Buffer (4 Kbytes) 8 bits Data Bus nanoDC Lab Source: Mutlu, O. (2019). Processing Data Where It Makes Sense in Modern Computing Systems: Enabling In-Memory Computation. 31/48 RowClone: Latency and Energy Savings nanoDC Lab Source: Seshadri et al., “RowClone: Fast and Efficient In-DRAM Copy and Initialization of Bulk Data,” MICRO 2013. 32/48 Ambit: In-Memory Accelerator for Bulk Bitwise Operations using Commodity DRAM Technology Seshadri+, “Ambit: In-Memory Accelerator for Bulk Bitwise Operations using Commodity DRAM Technology,” MICRO 2017. nanoDC Lab 33/48 In-DRAM AND/OR: Triple Row Activation δ ½V½VVDDDDDD+ A Final State B AB + BC + AC C C(A + B) + disen ~C(AB) ½V0DD nanoDC Lab Source: Seshadri+, “Fast Bulk Bitwise AND and OR in DRAM”, IEEE CAL 2015. 34/48 Ambit vs. DDR3: Performance and Energy 32X 35X nanoDC Lab Source: Seshadri+, “Ambit: In-Memory Accelerator for Bulk Bitwise Operations using Commodity DRAM Technology,” MICRO 2017. 35/48 nanoDC Lab Source: https://dfan.engineering.asu.edu/in-memory-computing/ 36 In Memory Computing for AI nanoDC Lab Source: A. Sebastian et al.: “Temporal correlation detection using computational phase-change memory”, Nature Communications 8, 1115, 2017. 37/48 In Memory Computing for AI Slight Accuracy Drop Source: S.R. Nandakumar et al.: “Mixed-precision architecture based on computational memory for training deep neural networks”, in Proc. of the IEEE nanoDC Lab International Symposium on Circuits and Systems (ISCAS), 1-5, 2018. 38/48 Challenges - Software and Architecture ● Functionality of and applications & software for In Memory Computing ● Ease of programming (interfaces and compiler/HW support) ● System support: coherence & virtual memory ● Runtime and compilation systems for adaptive scheduling, ● Data mapping, access/sharing control ● Infrastructures to assess benefits and feasibility nanoDC Lab Source: Onur Mutlu Slides 39/48 Challenges - Circuits and Devices ● Process Variations ○ Not all Memory cells are created equal ○ Some cells have higher /lower capacitance ■ affects the reliability of operation ● Processing Using Memory ○ Exploits analog operation of underlying technology ○ Process variation can introduce failures ● Challenges ○ How to design architecture to reduce impact of variations ○ How to test for failures ● Intelligent Memory Controllers ○ ECC Circuits nanoDC Lab Source: Sheshadri’s Slides 40/48 HBM or In Memory Computing? nanoDC Lab 41/48 Both. Better Way: In Memory Compute in HBM nanoDC Lab 42/48 Why? nanoDC Lab Source: Grzela, Tomasz. (2015). Comparative STM-based study of thermal evolution of Co and Ni germanide nanostructures on Ge(001). 43/48 Why? ● Interconnects ○ Large Power consumption ■ Capacitance ■ Metal lines ● Large Interconnect Delay compared to Gate delay ● Logic Layer beneath DRAM dies ○ Support In Memory Computing nanoDC Lab 44/48 Conclusion In Memory Compute in HBM. nanoDC Lab 45/48 Some Recent works on IMC in HBM nanoDC Lab 46/48 Conclusion ● Von-Neumann Bottleneck ● Solutions ○ High Bandwidth Memory ○ In Memory Computing ● In Memory Computing in HBM’s ○ Increase Bandwidth ○ Reduce Data Movement nanoDC Lab 47/48 References 1. Moore, Gordon E. "Cramming more components onto integrated circuits." (1965): 114-117. 2. Chu, Yaohan, ed. High-level language computer architecture. Academic Press, 2014. 3. Seshadri, Vivek, et al. "Ambit: In-memory accelerator for bulk bitwise operations using commodity DRAM technology." Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 2017. 4. Seshadri, Vivek, et al. "RowClone: fast and energy-efficient in-DRAM bulk data copy and initialization." Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 2013. 5. Seshadri, Vivek, et al. "Fast bulk bitwise AND and OR in DRAM." IEEE Computer Architecture Letters 14.2 (2015): 127-131.

Breaking the Memory-Wall for AI: In-Memory Compute, Hbms Or Both?

Performance Impact of Memory Channels on Sparse and Irregular Algorithms

Case Study on Integrated Architecture for In-Memory and In-Storage Computing

ECE 571 – Advanced Microprocessor-Based Design Lecture 17

Adm-Pcie-9H3 V1.5

High Bandwidth Memory for Graphics Applications Contents

Architectures for High Performance Computing and Data Systems Using Byte-Addressable Persistent Memory

How to Manage High-Bandwidth Memory Automatically

Exploring New Features of High- Bandwidth Memory for Gpus

AXI High Bandwidth Memory Controller V1.0 Logicore IP

Arxiv:1905.04767V1 [Cs.DB] 12 May 2019 Ing Amounts of Data Into Their Complex Computational Data Locality, Limiting Data Reuse Through Caching [121]

High Performance Memory

High-Performance Architectures for Embedded Memory Systems