<<

Processing Near or In Memory For Deep Learning Lide Duan (段立德) Computing Technology Lab Alibaba DAMO Academy 阿里巴巴达摩院计算技术实验室 Alibaba Businesses

® CD

EMPG PAYMENT & FINANCIAL SERVICES OI

AKN & && L -&-

Nowadays, Alibaba is not just an e-commerce company, but also a high-tech / cloud computing / Internet services / local services company. 2 Alibaba DAMO Academy

• 4 + X research areas (14 labs) in 8 cities: Beijing, Hangzhou, Shenzhen, New York, Bellevue, Tel Aviv-Yafo, Sunnyvale, Singapore. • With 600+ researchers onboard, and still quickly expanding. 3 Computing Technology Lab Lab Research Focuses:

Lab Organization: • Future research exploration • Domain specific accelerator design (e.g., AI chips).

4 Computer Architecture Development • Computer architecture development is driven by both applications and semiconductor technology.

Source: White Paper on AI Chip Technologies, 2018. 5 Memory Wall • Memory is much slower than processor, and consumes much more energy. • Today’s “memory wall” is getting worse! – Emerging applications are highly intensive. – Computer architectures changing from computing-centric to data-centric.

+20%/year

Gap is even +52%/year larger due to multicore! +7%/year +25%/year

Source: EIE: Efficient Inference Engine on Compressed Deep Neural Network, ISCA 2016. Source: Computer Architecture A Quantitative Approach, 6th Edition. 6 Bridging The Gap • Processing near, in, or with memory:

Source: Bringing Computation to Memory, Dimin Niu, Feb. 2019.

• DRISA (MICRO 2017) • UC Berkeley IRAM • On-chip eDRAM • UPMEM • HBM / HMC • In-situ NVM crossbars

7 PIM Design Factors • 2D vs. 3D • DRAM vs. NVM

DRAM NVM

2D • DaDianNao (eDRAM) • ISAAC • UC Berkeley IRAM • PRIME • DRISA • PipeLayer • UPMEM • CELIA 3D • Neurocube • 3D PI-NVM • TETRIS

8 PIM Design Factors (2D + DRAM)

DRAM NVM

2D • DaDianNao (eDRAM) • ISAAC • UC Berkeley IRAM • PRIME • DRISA • PipeLayer • UPMEM • CELIA 3D • Neurocube • 3D PI-NVM • TETRIS

9 DaDianNao • DaDianNao: A Machine-Learning Supercomputer (MICRO 2014) – A 64-chip system connected via 2D mesh – Each chip: 16 tiles connected via H-tree – Each tile: 4 eDRAM banks (storing weights) + an NFU – Each NFU: pipelined processing of NNs

A Chip A Tile An NFU 10 UC Berkeley IRAM • Intelligent RAM (IRAM): Chips that remember and compute (ISSCC 1997).

Challenges were: • Increased cost-per-bit • Need new programming model • Lack of killer applications 11 UPMEM • UPMEM Processing In-Memory (PIM) Technology Paper (March 2019) – One DPU per 64MB DRAM: programmable 32-bit RISC core. – Implemented in DRAM process; same architecture. – Provides comprehensive SDK; similar to GPGPU. – 20x speedup, 10x energy efficiency for big data workloads. – 30%-50% area overhead; low manufacturing cost.

12 PIM Design Factors (3D + DRAM)

DRAM NVM

2D • DaDianNao (eDRAM) • ISAAC • UC Berkeley IRAM • PRIME • DRISA • PipeLayer • UPMEM • CELIA 3D • Neurocube • 3D PI-NVM • TETRIS

13 HMC vs. HBM • Both are 3D-stacked DRAM; similar in principle, but incompatible. • Hybrid Memory Cube (HMC): – Developed by Micron, backed by HMC Consortium. – Widely adopted in academia. – “Far memory”; packet interface with serdes lanes. – Scalable with networked HMC devices; in HPC. • (HBM): – Widely adopt in industry: AMD / SK Hynix / Samsung / Nvidia / Micron – “Near memory”; packaged with processor (GPU). – Difficult to scale.

14 Neurocube • Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory (ISCA 2016). – 16 PE-router pairs connected in 2D mesh. – Each PE: 16 MAC units, memory for weights, cache. – Fully data-driven NN execution, with layer-to-vault mapping.

15 PIM Design Factors (2D + NVM)

DRAM NVM

2D • DaDianNao (eDRAM) • ISAAC • UC Berkeley IRAM • PRIME • DRISA • PipeLayer • UPMEM • CELIA 3D • Neurocube • 3D PI-NVM • TETRIS

16 Emerging NVM Technologies • NVMs in common: non-volatile; low idle power; no refreshes; high write overheads; etc. • Phase-Change Memory (PCM): – Intel / Micron 3D Xpoint – Intel Optane DC Persistent Mem. vs. DC SSD. • STT-MRAM: – EverSpin ships standalone 256Mb in 40nm. – Difficult to scale beyond 28nm. • ReRAM / RRAM: – Arbitrary programmed cell resistance (“memristor”). – First invented by HP Labs, now produced by many companies (in early stage). 17 Processing NN in NVM Crossbar Arrays • Using Kirchoff’s Law to perform matrix multiplication in NVM crossbar arrays: – Input data applied as voltages on word lines – Synaptic weights programmed into cell conductance (1 / resistance) – Results are currents out of bit lines

18 ISAAC • ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars (ISCA 2016) – Full-fledged accelerator based on RRAM. – Hierarchical organization of crossbar arrays. – DACs / ADCs, various logic. – Pipelined architecture: • Inter-layer pipeline: all layers of a NN are mapped to different tiles at the same time. • Intra-layer pipeline: 27 cycles.

19 PIM Design Factors (3D + NVM)

DRAM NVM

2D • DaDianNao (eDRAM) • ISAAC • UC Berkeley IRAM • PRIME • DRISA • PipeLayer • UPMEM • CELIA 3D • Neurocube • 3D PI-NVM • TETRIS

20 On-Going Work: 3D PI-NVM

• A 3D-stacked processing-in- NVM (3D PI-NVM) framework – NVM + PIM + 3D – Utilizes HMC/HBM’s heterogeneous integration of different dies. – Logic die: in-between computation and storage dies; no PEs; NoC of routers. – Computation dies: vaults of CUs. – dies: partitions of NVM banks. – CU: tiles of crossbar arrays; DAC/ADC, buffers, logic, etc.

21 3D-Aware Model Mapping

22 3D-Aware Data Flow Management • Directs data flow both vertically (in vaults) and horizontally (in logic die)

23 Benefits • Generic benefits of 3D stacking. • Significantly improved NN processing throughput: – Simultaneously processing multiple layers / models. – Architecture is vertically scalable. • Using NVM for both computation and data storage: – Storing-in-NVM + processing-in-NVM • Thermal-friendly: – Using crossbar arrays for computation avoids PE hot spots. – Computation dies are on top. – Decoupling computation from routing in logic die.

24 Thanks! Slides DRISA • DRISA: A DRAM-based Reconfigurable In-Situ Accelerator (MICRO 2017). – Bit-wise NOR between bit lines (functionally complete) – Modified sense amplifiers + added shifters – High parallelism: multiple banks/subarrays active!

27 TETRIS • TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory (ASPLOS 2017) – Logic die: large PE arrays (14x14 in each vault) + small SRAM buffers. – In-DRAM accumulation of output feature maps. – Software techniques: dataflow mapping and NN computation partitioning.

28 PRIME • PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory (ISCA 2016) – Allow ReRAM arrays to dynamically reconfigure between memory (mem. subarrays) and accelerator (full function subarrays). – Modify DRAM peripheral circuits for DAC, ADC, activation, pooling, etc.

29 CELIA • CELIA: A Device and Architecture Co-Design Framework for STT- MRAM-Based Deep Learning Acceleration (ICS 2018) – Challenge of using STT-MRAM for processing NN: Low ON/OFF ration => fewer cell resistances => low weight precision => low model accuracy – Device-level creation of multiple cell resistances – Non-uniform weight quantization

30 Defining Weight Matrix of a Layer

• Windows of data, each from an input feature map, are flattened and concatenated to form an input feature vector. • The same number of convolutional kernels are also flattened and concatenated to form a weight vector. • Calculating the dot-product of these two vectors gives one pixel value in an output feature map. • Putting together the weight vectors for the same pixel in all output feature maps forms a weight matrix (WM) for this layer. 31 Intra-Layer Crossbar Array Allocation • Within the crossbar array allocation to a model layer, try to replicate the weight matrix of the layer as much as possible. – Using crossbar array allocation with input reuse (ICS 2018, IPCCC 2018)

32 Inter-Layer Crossbar Array Allocation • Achieves fully balanced pipelining. • Input window slides in a blocking manner.

33 Simultaneous Processing of Multiple Models • How to partition compute units (CUs) between two models being run at the same time? • Naïve approach: partition CUs in equal halves – May result in different running times for different models. – Hence, system performance is always bounded by the slower model. • Goal: partitioning CUs such that two models have the same running time. • This is analogous to the number partitioning problem: – Given a multiset of positive integers, can we partition them into two subsets with equal sum?

34 Simultaneous Processing of Multiple Models • The number partitioning problem itself is NP-complete, but there exists simple heuristics to approximately solve it: – Iterate through the numbers in descending order, and assign each number to whichever subset that has the smaller current sum. • In our design, we consider the relative performance improvement (due to having one more CU) as the “integer”:

35 Evaluation Results

36 Results of Running Two Models

37