Processing Near Or in Memory for Deep Learning Lide Duan （段立德） Computing Technology Lab Alibaba DAMO Academy 阿里巴巴达摩院计算技术实验室 Alibaba Businesses

Processing Near or In Memory For Deep Learning Lide Duan （段立德） Computing Technology Lab Alibaba DAMO Academy 阿里巴巴达摩院计算技术实验室 Alibaba Businesses ® CD EMPG PAYMENT & FINANCIAL SERVICES OI AKN & && L -&- Nowadays, Alibaba is not just an e-commerce company, but also a high-tech / cloud computing / Internet services / local services company. 2 Alibaba DAMO Academy • 4 + X research areas (14 labs) in 8 cities: Beijing, Hangzhou, Shenzhen, New York, Bellevue, Tel Aviv-Yafo, Sunnyvale, Singapore. • With 600+ researchers onboard, and still quickly expanding. 3 Computing Technology Lab Lab Research Focuses: Lab Organization: • Future research exploration • Domain specific accelerator design (e.g., AI chips). 4 Computer Architecture Development • Computer architecture development is driven by both applications and semiconductor technology. Source: White Paper on AI Chip Technologies, 2018. 5 Memory Wall • Memory is much slower than processor, and consumes much more energy. • Today’s “memory wall” is getting worse! – Emerging applications are highly data intensive. – Computer architectures changing from computing-centric to data-centric. +20%/year Gap is even +52%/year larger due to multicore! +7%/year +25%/year Source: EIE: Efficient Inference Engine on Compressed Deep Neural Network, ISCA 2016. Source: Computer Architecture A Quantitative Approach, 6th Edition. 6 Bridging The Gap • Processing near, in, or with memory: Source: Bringing Computation to Memory, Dimin Niu, Feb. 2019. • DRISA (MICRO 2017) • UC Berkeley IRAM • On-chip eDRAM • UPMEM • HBM / HMC • In-situ NVM crossbars 7 PIM Design Factors • 2D vs. 3D • DRAM vs. NVM DRAM NVM 2D • DaDianNao (eDRAM) • ISAAC • UC Berkeley IRAM • PRIME • DRISA • PipeLayer • UPMEM • CELIA 3D • Neurocube • 3D PI-NVM • TETRIS 8 PIM Design Factors (2D + DRAM) DRAM NVM 2D • DaDianNao (eDRAM) • ISAAC • UC Berkeley IRAM • PRIME • DRISA • PipeLayer • UPMEM • CELIA 3D • Neurocube • 3D PI-NVM • TETRIS 9 DaDianNao • DaDianNao: A Machine-Learning Supercomputer (MICRO 2014) – A 64-chip system connected via 2D mesh – Each chip: 16 tiles connected via H-tree – Each tile: 4 eDRAM banks (storing weights) + an NFU – Each NFU: pipelined processing of NNs A Chip A Tile An NFU 10 UC Berkeley IRAM • Intelligent RAM (IRAM): Chips that remember and compute (ISSCC 1997). Challenges were: • Increased cost-per-bit • Need new programming model • Lack of killer applications 11 UPMEM • UPMEM Processing In-Memory (PIM) Technology Paper (March 2019) – One DPU per 64MB DRAM: programmable 32-bit RISC core. – Implemented in DRAM process; same server architecture. – Provides comprehensive SDK; similar to GPGPU. – 20x speedup, 10x energy efficiency for big data workloads. – 30%-50% area overhead; low manufacturing cost. 12 PIM Design Factors (3D + DRAM) DRAM NVM 2D • DaDianNao (eDRAM) • ISAAC • UC Berkeley IRAM • PRIME • DRISA • PipeLayer • UPMEM • CELIA 3D • Neurocube • 3D PI-NVM • TETRIS 13 HMC vs. HBM • Both are 3D-stacked DRAM; similar in principle, but incompatible. • Hybrid Memory Cube (HMC): – Developed by Micron, backed by HMC Consortium. – Widely adopted in academia. – “Far memory”; packet interface with serdes lanes. – Scalable with networked HMC devices; in HPC. • High Bandwidth Memory (HBM): – Widely adopt in industry: AMD / SK Hynix / Samsung / Nvidia / Micron – “Near memory”; packaged with processor (GPU). – Difficult to scale. 14 Neurocube • Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory (ISCA 2016). – 16 PE-router pairs connected in 2D mesh. – Each PE: 16 MAC units, memory for weights, cache. – Fully data-driven NN execution, with layer-to-vault mapping. 15 PIM Design Factors (2D + NVM) DRAM NVM 2D • DaDianNao (eDRAM) • ISAAC • UC Berkeley IRAM • PRIME • DRISA • PipeLayer • UPMEM • CELIA 3D • Neurocube • 3D PI-NVM • TETRIS 16 Emerging NVM Technologies • NVMs in common: non-volatile; low idle power; no refreshes; high write overheads; etc. • Phase-Change Memory (PCM): – Intel / Micron 3D Xpoint – Intel Optane DC Persistent Mem. vs. DC SSD. • STT-MRAM: – EverSpin ships standalone 256Mb in 40nm. – Difficult to scale beyond 28nm. • ReRAM / RRAM: – Arbitrary programmed cell resistance (“memristor”). – First invented by HP Labs, now produced by many companies (in early stage). 17 Processing NN in NVM Crossbar Arrays • Using Kirchoff’s Law to perform matrix multiplication in NVM crossbar arrays: – Input data applied as voltages on word lines – Synaptic weights programmed into cell conductance (1 / resistance) – Results are currents out of bit lines 18 ISAAC • ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars (ISCA 2016) – Full-fledged accelerator based on RRAM. – Hierarchical organization of crossbar arrays. – DACs / ADCs, various logic. – Pipelined architecture: • Inter-layer pipeline: all layers of a NN are mapped to different tiles at the same time. • Intra-layer pipeline: 27 cycles. 19 PIM Design Factors (3D + NVM) DRAM NVM 2D • DaDianNao (eDRAM) • ISAAC • UC Berkeley IRAM • PRIME • DRISA • PipeLayer • UPMEM • CELIA 3D • Neurocube • 3D PI-NVM • TETRIS 20 On-Going Work: 3D PI-NVM • A 3D-stacked processing-in- NVM (3D PI-NVM) framework – NVM + PIM + 3D – Utilizes HMC/HBM’s heterogeneous integration of different dies. – Logic die: in-between computation and storage dies; no PEs; NoC of routers. – Computation dies: vaults of CUs. – Data storage dies: partitions of NVM banks. – CU: tiles of crossbar arrays; DAC/ADC, buffers, logic, etc. 21 3D-Aware Model Mapping 22 3D-Aware Data Flow Management • Directs data flow both vertically (in vaults) and horizontally (in logic die) 23 Benefits • Generic benefits of 3D stacking. • Significantly improved NN processing throughput: – Simultaneously processing multiple layers / models. – Architecture is vertically scalable. • Using NVM for both computation and data storage: – Storing-in-NVM + processing-in-NVM • Thermal-friendly: – Using crossbar arrays for computation avoids PE hot spots. – Computation dies are on top. – Decoupling computation from routing in logic die. 24 Thanks! Backup Slides DRISA • DRISA: A DRAM-based Reconfigurable In-Situ Accelerator (MICRO 2017). – Bit-wise NOR between bit lines (functionally complete) – Modified sense amplifiers + added shifters – High parallelism: multiple banks/subarrays active! 27 TETRIS • TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory (ASPLOS 2017) – Logic die: large PE arrays (14x14 in each vault) + small SRAM buffers. – In-DRAM accumulation of output feature maps. – Software techniques: dataflow mapping and NN computation partitioning. 28 PRIME • PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory (ISCA 2016) – Allow ReRAM arrays to dynamically reconfigure between memory (mem. subarrays) and accelerator (full function subarrays). – Modify DRAM peripheral circuits for DAC, ADC, activation, pooling, etc. 29 CELIA • CELIA: A Device and Architecture Co-Design Framework for STT- MRAM-Based Deep Learning Acceleration (ICS 2018) – Challenge of using STT-MRAM for processing NN: Low ON/OFF ration => fewer cell resistances => low weight precision => low model accuracy – Device-level creation of multiple cell resistances – Non-uniform weight quantization 30 Defining Weight Matrix of a Layer • Windows of data, each from an input feature map, are flattened and concatenated to form an input feature vector. • The same number of convolutional kernels are also flattened and concatenated to form a weight vector. • Calculating the dot-product of these two vectors gives one pixel value in an output feature map. • Putting together the weight vectors for the same pixel in all output feature maps forms a weight matrix (WM) for this layer. 31 Intra-Layer Crossbar Array Allocation • Within the crossbar array allocation to a model layer, try to replicate the weight matrix of the layer as much as possible. – Using crossbar array allocation with input reuse (ICS 2018, IPCCC 2018) 32 Inter-Layer Crossbar Array Allocation • Achieves fully balanced pipelining. • Input window slides in a blocking manner. 33 Simultaneous Processing of Multiple Models • How to partition compute units (CUs) between two models being run at the same time? • Naïve approach: partition CUs in equal halves – May result in different running times for different models. – Hence, system performance is always bounded by the slower model. • Goal: partitioning CUs such that two models have the same running time. • This is analogous to the number partitioning problem: – Given a multiset of positive integers, can we partition them into two subsets with equal sum? 34 Simultaneous Processing of Multiple Models • The number partitioning problem itself is NP-complete, but there exists simple heuristics to approximately solve it: – Iterate through the numbers in descending order, and assign each number to whichever subset that has the smaller current sum. • In our design, we consider the relative performance improvement (due to having one more CU) as the “integer”: 35 Evaluation Results 36 Results of Running Two Models 37.

Load more