NEAR DATA PROCESSING: ARE WE THERE YET?

MAYA GOKHALE LAWRENCE LIVERMORE NATIONAL LABORATORY

This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract No. DE-AC52-07NA27344. LLNL-PRES-665279 OUTLINE

Why we need near memory Niche application Data reorganization engines Computing near storage FPGAs for computing near memory WHY AREN’T WE THERE YET?

§ Processing near memory is attractive for high bandwidth, low latency, massive parallelism § 90’s era designs closely integrated logic and DRAM transistor § Expensive § Slow § Niche § Is 3D packaging the answer? § Expensive § Slow § Niche § What are the technology and commercial incentives? EXASCALE POWER PROBLEM: DATA MOVEMENT IS A PRIME CULPRIT

• Cost of a double precision flop is negligible compared to the cost of reading and writing memory • -unfriendly applications waste memory bandwidth, energy • Number of nodes needed to solve problem is inflated due to poor memory bw utilization • Similar phenomenon in disk

Sources: Horst Simon, LBNL Greg Astfalk, HP MICRON HYBRID MEMORY CUBE OFFERS OPPORTUNITY FOR PROCESSING NEAR MEMORY Fast logic layer • Customized processors • Through silicon vias (TSV) for high bandwidth access to memory

Vault organization yields enormous bandwidth in the package

Best case latency is higher than traditional DRAM due to packetized abstract memory interface HMC LATENCY: LINK I TO VAULT I, 128B, 50/50 HMC LATENCY: LINK I TO VAULT J, 128B, 50/50

One link active, showing effect of request through communication network. HMC: WIDE VARIABILITY IN LATENCY WITH CONTENTION

Requests on links 1-4 for data on vaults 0-4 (link 0 local) HMC SHOWS HIGH BANDWIDTH FOR REGULAR, STREAMING ACCESS PATTERNS, BUT …

128B payload 64B payload

32B payload 16B payload Small payload, irregular accesses cut bandwidth by factor of 3 SPECIALIZED PROCESSING FOR HIGH VALUE APPLICATIONS

Processing in memory/storage might make sense for niche, high value applications § Image pixel processing, character processing (bioinformatics applications) § Massively parallel regular expression applications (cybersecurity, search problems cast as RE) § Content addressable memories At each clock cycle, the pipelined ALU can either load the value of the GOR signal. The microcode programmer data from memory or store data to memory, but not both has access to the GOR signal at the host processor through at the same time. Also, on each clock cycle, the MUpro- a GOR register containing the last 32 GORs generated by duces three outputs that can be either selected for storage the system.

At each clock cycle, the pipelined ALU can(underTERASYS either load maskthe value WORKSTATION control)of the GOR signal. or The selected microcode programmer for recirculation. The Terasys interface board and PIM array units fit into an PIXEL/CHARACTERdata from memory or store data to memory, butATerasys not both has access workstation to the GOR signal (see at the hostFigure processor 1) through consists ofPROCESSING external enclosure. The system can accommodate up to Additionally, data can be sent to other processors via the NETWORK. at the same time. Also, on each clock cycle, the MUpro- a GOR register containing the last 32 GORs generated by PARTITIONEDOR In contrast to the GOR, duces three outputs that can be either selectedrouting for storage network.the system. whicheight PIM performs array units,only many-to-one with 4K processors communication, per array unit, the (under mask control) or selected for recirculation. Additionally, data can be sent to other processorsThea via Sun the processor Sparc-2PARTITIONED processor,canOR inputNETWORK. data In contrast through to the GOR, a multiplexer PORgiving network a maximum can be usedconfiguration for many-to-one of 32,768 or one-to-many PIM proces- routing network. an SBuswhich interface performs only card many-to-one residing communication, in the Sparcthe cabinet, sors. The processor can input data through a(MUX) multiplexer from POR networkeither can the be used parallel for many-to-one prefix or one-to-many network, the global communication among groups of processors. The 32K (MUX) from either the parallel prefix network,OR thea network, globalTerasys communication interface the partitioned among board, groups andofOR processors. network, The 32K or the internal processorsA data inparallel the Terasys program workstation resides can on be partitioned,the Sparc. IN MEMORYOR network, the partitioned OR network, or the internalARCHITECTURE processors in the Terasys workstation can be partitioned, masuregister control (see the “Network andmasuregister mask”one line or morestarting withcontrol PIM 2 processors array (see per units. partitionthe “Network and increasing andin mask” line startingConventional with 2 sequential processors instructions per partition are and executed increasing by the in at lower right in Figure 3). at lower right in Figure 3). Sparc. Operations on data parallel operands are conveyed Indirect addressing as memory writes through the Terasys interface board to Indirect addressing signifies an operation A[i], where each processor has its own instances ofA andIndirect i. To perform addressing PIM memory and cause the single bit ALUs to perform the this operation, the processors must access different mem- ory locations simultaneously. Indirect addressing signifies an operation A[i], where specifiedoperations at the specified row address across all We decided against hardware support for indirect Sparc-2 the columns of the memory. addressing in the current PIM implementationeach for processor two has its own instances ofA and i. To perform major reasons. First, introducing indirect addressingthis operation, reg- the processorsI must access I different mem- An instruction written to the PIM array is actually a 12- isters on a per-processor basis would reduce the number of processors per chip. Second, the error detectionory locations and simultaneously.I Sbus interface 2 Kbits I bit address in a 4K entry table of horizontal microcode, correction circuitry,which operates on a row basis, would which is generated on a per-program basis. The selected not operate correctly in the presence of per-processorWe indi-decided against hardware supportExternal for indirect rect addressing registers. Thus, we optedaddressing for indirect in the current PIM implementationenclosure for two table entry is the real instruction, which is sent from the addressing through microcode. To make this operation...... efficient, we have the compiler emit array-boundmajor infor- reasons. First, introducing indirect addressing reg- Terasys interface board to the PIM array units. mation as part of an indexing operation. This allows the microcoded indexing subroutine to limit theisters number on of a per-processor basis? would64 bits reduce the number For global communications among the processors there 4 b locations to query to the number of array elementsof processors rather per chip. Second, the error detection and are three networks: global OR, partitioned2 Kbits OR, and par- than the entire 2K bits. Register In indirect addressing, each processor holdscorrection its private circuitry,which operates onPIM a row basis, would allel prefix network. Level 0 of the PPN serves as a bidi- instance of an array. The application can also use a single PIM command(25) arrayA in Sparc conventional memory, whichnot is shared operate by correctly___...... __. in the presence of arrayper-processor units indi-j rectional linear nearest-neighbor network. ..___...... _ all PIM processors. Each processor can index this shared I I I I rect addressing4 - - - registers...... - - Thus, we opted for indirect In the following sections, we describe the PIM chip and table with its unique index i that is in PIM memory.L An 4K 4K 4K 4K optimized microcode routine, which is a factoraddressing of 8 faster through microcode. To make this operationi single-bit processor, the three interprocessor communi- than the more general per-processor table lookup,8 is pro-processors processorsf- processors processors ; vided for this operation in a Fortran-based efficient,library called we haveConventional the compiler emit array-bound infor- cation networks, the PIM array unit, and the Terasys inter- data(4) Passwork (Parallel SIMD Workbench) .3 mation as part of an indexing operation. This allows the face board. lnterprocessor communication microcoded indexing subroutine to limit the number of ? 64 bits A simple linear interconnect, augmented by global OR, Figure 2. A processor-in-memory chip. i 4 b partitioned OR, and parallel prefix net- locationsigure 1. to A query16K processor to the number Terasys of array workstation. elements rather Processor-in-memory chip design. This allows the PIM processor to than the entire 2K bits. The PIM integrated circuit, with slightlyRegisterover one million In indirect addressing, each processor holds its private transistors on 1-micron technology, contains 2K x 64 bits PARALLEL EXPRESSIONS.compute a global ORPolys or partitioned can be OR signal in one tick, instancepoly of unsigned an array. x:4; The I* application 4-bit logical can variable also use x */a single ofPIM SRAM, command(25) 64 custom-designed single-bit processors, plus send one bit of data to a left or right ___...... __. used in C expressions just as normal C Netwo neighbor in one tick, and arrayAtypedef in Sparc unsigned conventional poly int33:33; memory, /* which 33-bit islogical, shared user-defined by control type and *I error detection circuitry.*Figure 2 shows the ..___...... _ (mono) variables. Anperform expression parallel prefix operationsThe a op notion inb logis all of PIM computingpoly processors. int y: 1000; in Eachmemory I* arbitrarily processor has been largecan withindex variables us this for areshared sev- supported PIM *I chip. Solid lines outline conventional memory com- parallel if either a(number-of-processors) or b is parallel. ticks perIf bit.the typedef poly struct 4 - - - .. . . . - - eral decades.table For example, with{ int33 its uniqueStone’A[50]; proposedindex i that a logic-in-memory is in PIM memory. com- An ponents, while dotted lines delimit the added PIM circuitry. other operand is serial,GLOBAL OR.it is The promoted logical OR network to combines signals fromputer all processors consisting and optimized of anint33 microcodeenhanced B; routine,cache memory which is array a factor that of serves8 faster In conventional memory mode, the chip is configured as parallel, and thc operationreturns a single bitis resultperformed to the host. GOR in char c; is performed across allas 32K a processorshigh-speed in than buffer the more between general CPU per-processor and conventional table lookup, memory. is pro- a 32Kx 4-bit SRAMf- with an 11-bit row address and a 4-bit the parallel domain.hardware. This signal is usedMore to condition- recently, vided for} a 5; thisgroup I* operationpoly at structure the in University a Fortran-based*I of Toronto library calledhas columnConventional address. In response to the row address signals, it dbC has infix reductionally control instruction operators execution: such One of data(4) designed a computationalPasswork (Parallel RAM-conventional SIMD Workbench) .3RAM with SIMD loads the selected row into a 64-bit register. The column as I = (reduce with OR operator)processors that, added to the sendFigure amplifiers 4. Parallel in variables a 4-Mbit inDRAM dbC (data-parallel address selectsbit a 4-bit nibble (one of 16 nibbles composing when applied to a parallel expression,process.2 yield lnterprocessor a serialvalue. C), communication the Terasys high-levelApril1995 i programmingthe language. 64-bit register). PARALLEL EXPRESSIONS. Polys can be Reductions are computed using the global OR and paral- poly unsigned x:4; I* 4-bit logical variable x */ In the commercialA simple realm, linear Hitachi interconnect, markets augmented a video DRAMby global chip OR, FigureTo operate 2. A processor-in-memory in PIM mode, the chip chip. is activated by one of used in C expressions just as normal C i typedef unsignedlel poly prefix int33:33; networks. /* 33-bit logical,with user-defined limitedpartitioned processing type *I OR,in the and memory. parallel Thisprefix chip, net- the HM53642 25 command bits from the Terasys interface board. (mono) variables. An expression a op b is poly int y: 1000; I* arbitrarily large variables areseries, supported is a 65K *I x 4-bit multiportpoly CMOS x:10, videoy:l1 DRAM with simple Command mode initiates an internal row operation with typedef poly struct parallel if either a or b is parallel. If the PARALLELCONTROL CONSTRUCTS.logic operators dbCdesign. extends on eachThis the allowsof the the x[4:8]4 bit-planes. PIM = y[O:4];processor to/* statvend index *I 64 processors operating on one of 2Krows of memory. The other operand is serial, it is promoted to { int33 A[50]; standard C guarded control constructs (if, while, do, and x[4+:5] = y[O+:5]; I* bit length notation *I int33 B; Our data-parallel C is similar to MasPar‘s MPL3 and Thinking processors can perform a local operation and optionally for) to the parallel domain. The controlling the I parallel, and thc operation is performed in char c; Machines C*.4 computeWavetracer’s a global MultiC5 OR orhad partitioned user-defined OR bit lengths, read/write the global OR, partitioned OR, and parallel the parallel domain. } 5; I* poly structurestatement *I determines whetherbut the did statement not havesignal is the serial inbit one orextractiordinsertion tick,Figure 5. Bit insertion or generic and bit-length extraction in prefixdbC. lines. parallel. If the is parallel, the statement is par- dbC has infix reduction operators such features. Our genericsend one SIMD bit of interface data to isa leftpatterned or right after the CM- allel. The escape statements break, continue, and return Netwo as I = (reduce with OR operator) that, Figure 4. Parallel variables in dbC2 Paris (data-parallel instructionneighborbit in one tick, and PIM processor have parallel versions. A break or continue in a parallel when applied to a parallel expression, yield a serialvalue. C), the Terasys high-level programming language.perform parallel prefix operations in log The PIM processor is a bit-serial processor that accesses loop is parallel. A return from a function returning a par- Reductions are computed using the global OR and paral- References (number-of-processors) ticks per bit. and processes bits to/from a 2-Kbit column of attached lel prefix networks. allel result (even poly void) has 1.parallel H.S. Stone, semantics. “A Logic-in-Memory Gotos Computer,” /€E€ Trans. Computers, memory. Functionally,the processor is divided into upper are not allowed in parallel constructs. I ... poly x:10, y:l1 Jan. 1970, pp.GLOBAL 73-78. OR. The logicalpoly y-5:5. OR networkz-7:7; and lower halves. The upper half performs the actual com- PARALLELCONTROL CONSTRUCTS. dbC extends the In addition to the C control constructs, dbC provides a x[4:8] = y[O:4]; /* statvend index2. D.G. *I Elliott,combines W.M. Snelgrove, signals fromand M.func(y-5); all Stunn, processors ”A Memory and/* SIMDpass 5 Hybrid bits to funcputations *I on the data, while the lower half performs rout- masked where constructI* bit length and an notation all block. *I Parallel code in standard C guarded control constructs (if, while, do, and x[4+:5] = y[O+:5]; and its Application,”returns a single Proc. bit Custom resultf unc(z-7); ICto Conf.,the host. IEEE, GOR Piscateway,I* pass 7 bits N.J., to funcing *I and masking operations. Data is brought in from the an all block is executed by all processors regardless of pre-I for) to the parallel domain. The controlling the 1992. is performed across all... 32K processors in processors’ attached memory through the load line, cir- vious processor activity.An all block may contain any ser- statement determines whether the statement is serial or Figure 5. Bit insertion and extraction3. MasPar in dbC. Applicationhardware. ThisLanguage signal (MPL)is used Reference to condition- Manual, MasPar, culates through the logic as specified by the program, and ial or parallel code, including parallel if/where and Figure 6. Generic bit-length parameters. parallel. If the is parallel, the statement is par- Pub-9302-0000ally control 01/90, instruction 1990. execution: One of is written back to memory via the store line. Figure 3 is a parallel loops. allel. The escape statements break, continue, and return 4. The MultiC Programming Language, Wavetracer, Pub-00001-001 - simplified diagram of the processor. have parallel versions. A break or continue in a parallel 1 .oo, 1994. Three primary registers, denoted A, €3, and C, supply Bit extensions illustrated in Figure 5. A parallel variable may be loop is parallel. A return from a function returning a par- 5. Paris Reference Manual, Thinking Machines, Cambridge, Mass., datax to the ALU. The registers have three primary input The generality of dbC‘s bit constructs facilitates com- indexed x[a:b],where a indicates the starting bit posi- allel result (even poly void) has parallel semantics. Gotos Tech. Rep. Version 5.0, 1989. lines for receiving data, each ofwhich also can April1995be inverted i are not allowed in parallel constructs. I putation... over arbitrarily sized data. tion and b the ending bit position, inclusive.to On receive the right- the logical 1 NOT of that input. In addition to the C control constructs, dbC provides a poly y-5:5. z-7:7; hand side of an assignment, this notation means that b /* pass 5 bits to func *I func(y-5);BIT-LENGTH SPECIFIERS. dbC extends C’s bit-field fea- - a 1bits are extracted from x starting at bit offset a. masked where construct and an all block. Parallel code in f unc(z-7); I* pass 7 bits to func *I + ture to allow any parallel integer variable (or integer com- When this “slice” notation is used to index a variable on an all block is executed by all processors regardless of pre- ... 1 Computer vious processor activity.An all block may contain any ser- ponent of a structured variable) to specify a length. the left-hand side, bit insertion is performed. As a short- ial or parallel code, including parallel if/where and FigureAlthough 6. Generic the syntax bit-length of the bit-length parameters. specifier is identical hand, x[a:lis equivalent to x[a:a],meaning one bit at parallel loops. to the C bit field, the semantics are quite different. The C offset a is accessed. An alternative notation is also illus- bit field represents a compromise between the desire to trated in the figure. The + : infix operator in an index Bit extensions illustratedaccess bits in andFigure the 5.difficulty A parallel of supporting variable x efficient may be bit expression a+:b means that the starting index is a and The generality of dbC‘s bit constructs facilitates com- indexedaccess x[a:b], in the word-orientedwhere a indicates domain. the starting A bit field bit declaredposi- the bit length is b. putation over arbitrarily sized data. tionto and be b6 the bits ending long in bit C position,might really inclusive. occupy On a thefull right-memory 1 handword side (32 of oran 64assignment, bits). Avariable this notation or structure means component that b GENERICBIT LENGTHS. To make it easier to write sub- BIT-LENGTHSPECIFIERS. dbC extends C’s bit-field fea- - a declared+ 1bits asare “poly extractedint:6” infrom dbC x is starting guaranteed at bit to offset use exactly a. routines that can operate on parameters of any bit length, ture to allow any parallel integer variable (or integer com- When6 bits this of “slice”parallel notation memory. is used to index a variable on dbC provides a generic bit-length construct. The bit length ponent of a structured variable) to specify a length. the left-handWhen two side, parallel bit insertion operands is performed. are used in aAs binary a short- oper- of a parameter to a function may be ‘I?,” indicating that Although the syntax of the bit-length specifier is identical hand,ation, x[a:l theiris equivalentbit lengths determine to x[a:a], themeaning number one of bitbit oper-at the length is to be determined at runtime at each invoca- to the C bit field, the semantics are quite different. The C offsetations a is accessed.required toAn computealternative the notation result. Thus,is also theillus- user tion of the function. Figure 6 illustrates the use of generic bit field represents a compromise between the desire to tratedcontrols in the both figure. The + : infix operator in an index bit length in the parameter x to function func. The func- access bits and the difficulty of supporting efficient bit expression a+:b means that the starting index is a and tion is called twice, first with a parameter of bit length 5, access in the word-oriented domain. A bit field declared the bitstorage length allocation is b. at the bit level (particularlyuseful on and the second time with a parameter of bit length 7. to be 6 bits long in C might really occupy a full memory Terasys, where each processor has only 2,048 bits of word (32 or 64 bits). Avariable or structure component GENERICmemory)BIT and LENGTHS. To make it easier to write sub- COMMUNICATION.The interprocessor communication declared as “polyint:6” in dbC is guaranteed to use exactly routinesthe that bit-serial can operate complexity on parameters of operations, of any bit since length, the bit intrinsic DBC-net-scnd transfers data between nearest 6 bits of parallel memory. dbC provideslengths aof generic the operands bit-length determine construct. the The number bit length of sub- neighbors in a linear topology at level 0 in the parallel pre- operations required to perform an operation. fix network. In addition, the DBC-send intrinsic can be When two parallel operands are used in a binary oper- of a parameter to a function may be ‘I?,” indicating that ation, their bit lengths determine the number of bit oper- the length is to be determined at runtime at each invoca- used for any-to-any communication; this gencral data ations required to compute the result. Thus, the user tion ofFigure the function. 4 illustrates Figure poly 6 illustrates declarations the with use of bit generic lengths. movement between processors is done through the Sparc controls both bit length in the parameter x to function func. The func- conventional memory. tion isBIT called EXTRACTION twice, first ANDwith INSERTION.a parameter ofdbC bit also length has 5, two Intrinsics are also provided to transfer data between ser- storage allocation at the bit level (particularlyuseful on andbit-oriented the second timeoperators, with a bitparameter insertion of bitand length bit extraction, 7. ial and parallel domains, both one processor at a time and Terasys, where each processor has only 2,048 bits of memory) and COMMUNICATION.The interprocessor communication the bit-serial complexity of operations, since the bit intrinsic DBC-net-scnd transfers data between nearest lengths of the operands determine the number of sub- neighbors in a linear topology at level 0 in the parallel pre- operations required to perform an operation. fix network. In addition, the DBC-send intrinsic can be used for any-to-any communication; this gencral data Figure 4 illustrates poly declarations with bit lengths. movement between processors is done through the Sparc conventional memory. BIT EXTRACTION AND INSERTION. dbC also has two Intrinsics are also provided to transfer data between ser- bit-oriented operators, bit insertion and bit extraction, ial and parallel domains, both one processor at a time and NFA COMPUTING IN DRAM: MASSIVE PARALLELISM Micron’s automata processor puts processing in the DRAM § Massively parallel 2D fabric § Thousands – millions of automaton processing elements “state transition elements” § Implement Non-deterministic finite automata § Augmented with counters § Functions can be reprogrammed at runtime § Routing matrix consists of programmable switches, buffers, routing wires, cross-point connections • Route outputs of one automaton to inputs of others • Possible multi-hop • DDR-3 interface • Estimate 4W/chip; snort rules in 16W vs 35W for FPGA AUTOMATA PROCESSOR APPLICATIONS RELATE TO REGULAR EXPRESSIONS Each chip takes an 8-bit symbol as input 8è256 decode for 256-bit “row buffer” and each column is 48K Can match regular expression, with additional counting • a+b+c+, total # symbols = 17 • Cybersecurity, bioinformatics applications Challenge is to expression other problems as NFAs • Graph search TCAM IN MEMRISTOR FOR CONTENT BASED SEARCH

Memristor-based ternary content addressable memory (mTCAM) for data-intensiveTCAM Emulation computing (Semiconductor System Science and Technology. V. 29, no. 10, 2014, Zynq SoC http://stacks.iop.org/0268-1242/29/i=10/a=104010The Zynq system-on-a-chip (SoC) from Xilinx is used for the emulation platform. Both fixed and programmable logic are combined within a single device. Two ARM A9 cores and a DDR3 memory A memristor-based ternarycontroller content are integrated addressable with FPGA logic . memoryFigure 1 shows ( mTCAMthe system architecture.) is presented. Each mTCAM cell, consisting of five transistors and two memristors to store and search for ternary data, is capable of remarkable nonvolatility and higher storage density than conventional CMOS-based TCAMs. Each memristor in the cell can be programmed individually such that high impedance is always present between searchlines to reduce static energy consumption. A unique two-step write scheme offers reliable and energy-efficient write operations. The search voltage is designed to ensure optimum sensing margins with the presence of variations in memristor devices. Simulations of the proposed mTCAM demonstrate functionalities in write and search modes, as well as a search delay of 2 ns and a search of 0.99 fJ/bit/search for a word width of 128 bits.

Figure 1: Commands from the host are processed by the programmable search engine. Results from multiple CAM accesses are merged or reduced before sending the final result to the host. Counters and timers are used to track power usage and time.

Emulation of the TCAM may be accomplished through a couple of methods. One method is through a core provided by Xilinx implemented entirely from FPGA components. The TCAM core has single cycle search performance but has limited capacity. A practical TCAM size for the Zynq is about 64 Kbits. The other method is to use an embedded MicroBlaze 32-bit processor instantiated in FPGA logic and a hash table implemented in 1GB of DRAM attached to the programmable logic. The processor and hash table implementation has greater capacity at the cost of reduced search performance.

A search application can execute on the ARM cores at hardware speed while accesses to the TCAM are monitored without additional overhead by dedicated counters in programmable logic. Library calls can be inserted in the application code at key locations to retrieve TCAM usage counts and search time.

V1.0 1

DATA ANALYTICS: HIGH VALUE COMMERCIAL APPLICATIONS

Application space includes … § Image/video processing § Graph analytics § Specialized in-memory data bases § Tree, list-based data structure manipulation Minimal benefit from cache • Streaming, strided access patterns • Irregular access patterns

Processing in (near) memory opportunity to “reorganize” data structures from native format to cache-friendly

PROPOSED DATA REORGANIZATION ENGINE ARCHITECTURE

Active Memory Subsystem

Vault Vault Vault

Slaves Host CPU Link Interconnect

Masters Reorg Reorg Engine Engine Data Reorganization Engines § Multiple block requests given at start View Buffer View Buffer § Incoming packets for Data Reorg Engine processed one at a time § All reorganization done within Engine § Views are created in buffers FPGA emulator for Data Reorg Engine in § View buffers are memory mapped from host CPU perspective active memory

SCOTT LLOYD, LLNL EMULATION ARCHITECTURE

1GB DDR3 SODIMM 64b Zynq FPGA DDR3 Controller Zynq Processing Vault Emulation System Vault Vault Vault

View

64b Slaves ARM Cores

Interconnect CPU 0

64b Masters 32b CPU 1 DRE DRE Buffer main Block RAM LSU LSU MCU MCU View kernel kernel

The Zynq with both hard processors and FPGA logic is used for active memory emulation. The ARM cores (CPU) are used for the main application and the kernel executes on a data reorganization engine (DRE) which consists of a micro-control unit (MCU) and a load-store unit (LSU). The vaults are emulated with DDR3 memory. EMULATOR STRUCTURE ON ZYNQ

1 GB DRAM #2 Zynq SoC Programmable Logic (PL)

AXI Performance Monitor (APM) Trace Capture Device Trace Subsystem Trace

BRAM

Peripheral AcceleratorAccelerator #1 Accelerator 1 GB DRAM 1GB

MemorySubsystem Monitor

L2 Cache AXI Interconnect

L1 L1 SRAM ARM Core ARM Core Host Subsystem

Processing System (PS) DATA CENTRIC BENCHMARKS DEMONSTRATE PROGRAMMABLE DRES Image difference: compute pixel-wise difference of two reduced resolution images • DRE in memory subsystem assembles view buffer of each reduced resolution image • Host CPU computes image difference Pagerank: traverse web graph to find highly referenced pages • DRE in memory subsystem assembles view buffer of node neighbors (page rank vector) • Host CPU updates reference count of each neighbor, creates next page rank vector RandomAccess: use memory location as index to next location • DRE in memory subsystem assembles view buffer of memory words accessed through gather/scatter hardware

EMULATOR CONFIGURATION

Emulated System Configuration CPU: 32-bit ARM A9 core, 2.57 GHz, 5.14 GB/s bandwidth DRAM: 1 GB, 122 ns read and write access SRAM: 256 KB, 42 ns read and write access ----- Data Reorganization Engine (DRE) in memory subsystem ----- LSU: (Load-Store Unit) 64-bit data path, 1.25 GHz, 10.00 GB/s bandwidth DRAM: 1 GB, 90 ns read and write access SRAM: 256 KB, 10 ns read and write access MCU: (MicroBlaze) 32-bit data path, 1.25 GHz, 5.00 GB/s bandwidth DRAM: 1 GB, 90 ns read and write access SRAM: 256 KB, 10 ns read and write access BRAM: local 16 KB for instructions and stack DRE USE FASTER, LESS ENERGY: IMAGE DECIMATION BY 16

Host Only: Access ARM read: 156504 ARM write: 15666 Access ARM read: 1997558 Access DRE read: 1000000 DRE write: 1000000 ARM write: 15764 Bytes ARM read: 5008128 ARM write: 501312 Bytes ARM read: 63921856 Bytes DRE read: 16000000 DRE write: 8000000 ARM write: 504448 Joules ARM: 0.000998247 Joules ARM: 0.0154623 Joules DRE: 0.001344 Joules A+D: 0.00234225 PAGERANK SCALE 18,19

Scale 18 profile

Scale 19 Host+DRE: Profile of one iteration of the Page Rank page rank time:0.753485 sec algorithm. The DRE assembles a compact page rank view based on the adjacency list for vertex i Host Only: which is an index array into the original page rank page rank time:0.828449 sec vector. The host accesses the compact view created by the DRE. RANDOMACCESS

Host+DRE: Real time used = 0.286718 seconds 0.014628678 Billion(10^9) Updates per second [GUP/s]

Host Only: Real time used = 0.313953 seconds 0.013359660 Billion(10^9) Updates per second [GUP/s] RESEARCH QUESTIONS REMAIN

Enough benefit to justify cost of fabrication? How to interact with virtual memory? How many DREs on chip? Programming models

ACTIVE STORAGE WITH MINERVA (UCSD AND LLNL)

Designed for future generations of § Prototyped on BEE-3 NVRAM that are byte addressable, system very low latency on read • 4 Xilinx Virtex 5, each with 16GB memory, ring Based on computation close to interconnect storage § Emulate PCM controller Moved data or I/O intensive by inserting appropriate computations to the storage to latency on reads and minimize data transfer between the writes, emulate wear host and the storage leveling to store blocks § Get advantage of fast Processing scales as storage expands PCM and FPGA-based Power efficiency and performance gain Arup De, UCSD/LLNL

MINERVA ARCHITECTURE PROVIDES CONCURRENT ACCESS, IN-STORAGE PROCESSING UNITS

0QI]% V%0.:JJVC% 0QI]% V%0.:JJVC% 0QI]% V%0.:JJVC%&

. 9;& . 9;& . 9;&

%Q. $0J V``:HV #`10V`

.04V 8%6 &3 &3 &3 &3 !VI !VI !VI !VI 0 `C + `C + `C + `C .4; 93: -. -. -. -.

!1J$%&V 1Q`@

/&&1Q1J  45! -. -. -. -. $%V%V&&!:J:$V` !VI !VI !VI !VI + `C + `C + `C + `C &3 &3 &3 &3 !V_%V> %-H.VR%CV` ""#$%:`R1:`V INTERACTION WITH STORAGE PROCESSOR

ReceivesLoads it command to memory Command 1 Sends back LoadsSends Readsdata DMA to commandrequest memoryMemory via Interface Status 1 completionUpdate notification status Ring request manager Interface Command 2 Local Status 2 Actual DMA Interface String Address Engine11

1 Local Data Ref. Memory Data Scheduler String Comparator RequestManager Grep

Control Extractor FSM K0 K1 Processor

Splitter Compute Enable Manager Compute K2 K3 Req-rsp Interfac e Response SchedulerWord Request Generator Counter

Manager LocalMemory Kn-1 Kn Initiate computations via request manager Receives result Compute Kernel MINERVA PROTOTYPE

Built on BEE3 board PCIe 1.1 x8 host connection Clock frequency 250MHz Virtex 5 FPGA implements • Request scheduler • Network • Storage processor • DDR2 DRAM emulates PCM

MINERVA FPGA EMULATOR ALLOWS US TO ESTIMATE PERFORMANCE, POWER Compare standard I/O with § Streaming string search computational offload approach • 8+× performance Design alternative storage • 10+× energy efficiency processor architectures and § Saxpy evaluate trade-offs • 4× performance Run benchmarks and mini-apps in • 4.5× energy efficiency real-time § Key-Value store (key lookup) Estimate power draw and energy • 1.6× performance for different technology nodes • 1.7× energy efficiency § B+ tree search • 4+× performance • 5× energy efficiency RESULTS

10 12 9 8 10 7 8 6 5 6 Efficiency I/O I/O Speedup Speedup 4 3 RPC 4 RPC 2 Energy 2 1 0 0 HOW ABOUT A RECONFIGURABLE LOGIC LAYER?

3D package • DRAM stack • TSVs directly into FPGA • Has the time come for “Programmable Heterogeneous Interconnect”

PHI Die PHI Die FPGA FPGA-BASED MEMORY INTENSIVE ARCHITECTURE Memory FPGA-resident compute engines Compute operate directly on data in the 3D FPGA memory stack Engines Memory

Cue Other Sensors HSI 2kx2k Sensor 12 bpp Target Handoff 200 bands • Geolocation Application driver: 1.3 GB • Attributes/Class. Processing hyperspectral 8-12 GFLOPs 10-15 GFLOPs l5-10 GFLOPs 5-7 GFLOPs 10-15 GFLOPs imagery 3 GB Mem 3 GB Mem 2 GB Mem 2 GB Mem 2 GB Mem Preprocessing Spect. Unmixing Image Fusion Anomaly Compression • Spect-Spatial • Anal. of • Material Map Detection • Spectral KLT Registration Variance Sharpening • Spectral Match • Bit Allocation • Gain Control • Regression • HR Panchro. Filtering • Band Coding • Calibration • Hypoth. + LR Spec. • Adaptive CFAR • Data Format. Testing Regis. • Hypoth. Testing 130 MB • Cluster Analysis • Atmospheric • Endmember • Image Data • Outlier Detection Cor. (Modtran, Sel. Fusion • Object Attributes/ XMTR etc.) • Terrain Preclassification • Geocoding Analysis XILINX IP

Methods and apparatus for implementing a stacked memory programmable integrated circuit system in package US 7701251 B1 ABSTRACT Methods and apparatus for implementing a stacked memory- programmable integrated circuit system-in-package are described. An aspect of the invention relates to a semiconductor device. A first integrated circuit (IC) die is provided having an array of tiles that form a programmable fabric of a programmable integrated circuit. A second IC die is stacked on the first IC die and connected therewith via inter-die connections. The second IC die includes banks of memory coupled to input/output (IO) data pins. The inter-die connections couple the IO data pins to the programmable fabric such that all of the banks of memory are accessible in parallel. Arifur Rahman, Chidamber R. Kulkarni Original Assignee Xilinx, Inc.

CONCLUSIONS

Near data processing is even more technically feasible than the 90’s. Commercial drivers may emerge to make it a reality in commodity DRAM Reconfigurable logic offers opportunity to create custom near data applications at relatively low cost