Near Data Processing: Are We There Yet?

NEAR DATA PROCESSING: ARE WE THERE YET? MAYA GOKHALE LAWRENCE LIVERMORE NATIONAL LABORATORY This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract No. DE-AC52-07NA27344. LLNL-PRES-665279 OUTLINE Why we need near memory computing Niche application Data reorganization engines Computing near storage FPGAs for computing near memory WHY AREN’T WE THERE YET? § Processing near memory is attractive for high bandwidth, low latency, massive parallelism § 90’s era designs closely integrated logic and DRAM transistor § Expensive § Slow § Niche § Is 3D packaging the answer? § Expensive § Slow § Niche § What are the technology and commercial incentives? EXASCALE POWER PROBLEM: DATA MOVEMENT IS A PRIME CULPRIT • Cost of a double precision flop is negligible compared to the cost of reading and writing memory • Cache-unfriendly applications waste memory bandwidth, energy • Number of nodes needed to solve problem is inflated due to poor memory bw utilization • Similar phenomenon in disk Sources: Horst Simon, LBNL Greg Astfalk, HP MICRON HYBRID MEMORY CUBE OFFERS OPPORTUNITY FOR PROCESSING NEAR MEMORY Fast logic layer • Customized processors • Through silicon vias (TSV) for high bandwidth access to memory Vault organization yields enormous bandwidth in the package Best case latency is higher than traditional DRAM due to packetized abstract memory interface HMC LATENCY: LINK I TO VAULT I, 128B, 50/50 HMC LATENCY: LINK I TO VAULT J, 128B, 50/50 One link active, showing effect of request through communication network. HMC: WIDE VARIABILITY IN LATENCY WITH CONTENTION Requests on links 1-4 for data on vaults 0-4 (link 0 local) HMC SHOWS HIGH BANDWIDTH FOR REGULAR, STREAMING ACCESS PATTERNS, BUT … 128B payload 64B payload 32B payload 16B payload Small payload, irregular accesses cut bandwidth by factor of 3 SPECIALIZED PROCESSING FOR HIGH VALUE APPLICATIONS Processing in memory/storage might make sense for niche, high value applications § Image pixel processing, character processing (bioinformatics applications) § Massively parallel regular expression applications (cybersecurity, search problems cast as RE) § Content addressable memories At each clock cycle, the pipelined ALU can either load the value of the GOR signal. The microcode programmer data from memory or store data to memory, but not both has access to the GOR signal at the host processor through at the same time. Also, on each clock cycle, the MUpro- a GOR register containing the last 32 GORs generated by duces three outputs that can be either selected for storage the system. At each clock cycle, the pipelined ALU can(underTERASYS either load maskthe value WORKSTATION control)of the GOR signal. or The selected microcode programmer for recirculation. The Terasys interface board and PIM array units fit into an PIXEL/CHARACTERdata from memory or store data to memory, butATerasys not both has access workstation to the GOR signal (see at the hostFigure processor 1) through consists ofPROCESSING external enclosure. The system can accommodate up to Additionally, data can be sent to other processors via the NETWORK. at the same time. Also, on each clock cycle, the MUpro- a GOR register containing the last 32 GORs generated by PARTITIONEDOR In contrast to the GOR, duces three outputs that can be either selectedrouting for storage network.the system. whicheight PIM performs array units,only many-to-one with 4K processors communication, per array unit, the (under mask control) or selected for recirculation. Additionally, data can be sent to other processorsThea via Sun the processor Sparc-2PARTITIONED processor,canOR inputNETWORK. data In contrast through to the GOR, a multiplexer PORgiving network a maximum can be usedconfiguration for many-to-one of 32,768 or one-to-many PIM proces- routing network. an SBuswhich interface performs only card many-to-one residing communication, in the Sparcthe cabinet, sors. The processor can input data through a(MUX) multiplexer from POR networkeither can the be used parallel for many-to-one prefix or one-to-many network, the global communication among groups of processors. The 32K (MUX) from either the parallel prefix network,OR thea network, globalTerasys communication interface the partitioned among board, groups andofOR processors. network, The 32K or the internal processorsA data inparallel the Terasys program workstation resides can on be partitioned,the Sparc. IN MEMORYOR network, the partitioned OR network, or the internalARCHITECTURE processors in the Terasys workstation can be partitioned, masuregister control (see the “Network andmasuregister mask”one line or morestarting withcontrol PIM 2 processors array (see per units. partitionthe “Network and increasing andin mask” line startingConventional with 2 sequential processors instructions per partition are and executed increasing by the in at lower right in Figure 3). at lower right in Figure 3). Sparc. Operations on data parallel operands are conveyed Indirect addressing as memory writes through the Terasys interface board to Indirect addressing signifies an operation A[i], where each processor has its own instances ofA andIndirect i. To perform addressing PIM memory and cause the single bit ALUs to perform the this operation, the processors must access different memory locations simultaneously. Indirect addressing signifies an operation A[i], where specifiedoperations at the specified row address across all We decided against hardware support for indirect Sparc-2 the columns of the memory. addressing in the current PIM implementationeach for processor two has its own instances ofA and i. To perform major reasons. First, introducing indirect addressingthis operation, reg- the processorsI must accessI different mem- An instruction written to the PIM array is actually a 12- isters on a per-processor basis would reduce the number of processors per chip. Second, the error detectionory locations and simultaneously.I Sbus interface 2 Kbits I bit address in a 4K entry table of horizontal microcode, correction circuitry,which operates on a row basis, would which is generated on a per-program basis. The selected not operate correctly in the presence of per-processorWe indi-decided against hardware supportExternal for indirect rect addressing registers. Thus, we optedaddressing for indirect in the current PIM implementationenclosure for two table entry is the real instruction, which is sent from the addressing through microcode. To make this operation................................................ efficient, we have the compiler emit array-boundmajor infor- reasons. First, introducing indirect addressing reg- Terasys interface board to the PIM array units. mation as part of an indexing operation. This allows the microcoded indexing subroutine to limit theisters number on of a per-processor basis? would64 bits reduce the number For global communications among the processors there 4 b locations to query to the number of array elementsof processors rather per chip. Second, the error detection and are three networks: global OR, partitioned2 Kbits OR, and par- than the entire 2K bits. Register In indirect addressing, each processor holdscorrection its private circuitry,which operates onPIM a row basis, would allel prefix network. Level 0 of the PPN serves as a bidi- instance of an array. The application can also use a single PIM command(25) arrayA in Sparc conventional memory, whichnot is shared operate by correctly___.......__. in the presence of arrayper-processor units indi-j rectional linear nearest-neighbor network. ..___......._ all PIM processors. Each processor can index this shared I I I I rect addressing4 - - - registers... - - Thus, we opted for indirect In the following sections, we describe the PIM chip and table with its unique index i that is in PIM memory.L An 4K 4K 4K 4K optimized microcode routine, which is a factoraddressing of 8 faster through microcode. To make this operationi single-bit processor, the three interprocessor communi- than the more general per-processor table lookup,8 is pro-processors processorsf- processors processors ; vided for this operation in a Fortran-based efficient,library called we haveConventional the compiler emit array-bound infor- cation networks, the PIM array unit, and the Terasys inter- data(4) Passwork (Parallel SIMD Workbench) .3 mation as part of an indexing operation. This allows the face board. lnterprocessor communication microcoded indexing subroutine to limit the number of ? 64 bits A simple linear interconnect, augmented by global OR, Figure 2. A processor-in-memory chip. i 4 b partitioned OR, and parallel prefix net- locationsigure 1. to A query16K processor to the number Terasys of array workstation. elements rather Processor-in-memory chip design. This allows the PIM processor to than the entire 2K bits. The PIM integrated circuit, with slightlyRegisterover one million In indirect addressing, each processor holds its private transistors on 1-micron technology, contains 2K x 64 bits PARALLEL EXPRESSIONS.compute a global ORPolys or partitioned can be OR signal in one tick, instancepoly of unsigned an array. x:4; The I* application 4-bit logical can variable also use x */a single ofPIM SRAM, command(25) 64 custom-designed single-bit processors, plus send one bit of data to a left or right ___.......__. used in C expressions just as normal C Netwo neighbor in one tick, and arrayAtypedef in Sparc unsigned conventional poly int33:33; memory, /* which 33-bit islogical, shared

Near Data Processing: Are We There Yet?

Lecture 7 CUDA

ATI Radeon™ HD 4870 Computation Highlights

AMD Accelerated Parallel Processing Opencl Programming Guide

AMD Opencl User Guide.)

Novel Methodologies for Predictable CPU-To-GPU Command Offloading

Adaptive GPU Tessellation with Compute Shaders Jad Khoury, Jonathan Dupuy, and Christophe Riccio

Integrated Framework for Heterogeneous Embedded Platforms Using Opencl Author: Kulin Seth Department: Electrical and Computer Engineering

Virtual GPU Software User Guide Is Organized As Follows: ‣ This Chapter Introduces the Capabilities and Features of NVIDIA Vgpu Software

Opencl Programming

Lecture 7 Today's Content Trends

Compute & Memory Optimizations for High-Quality Speech Recognition on Low-End GPU Processors

The Compute Architecture of Intel® Processor Graphics Gen8