PIM Lite: on the Road Towards Relentless Multi-Threading in Massively Parallel Systems

PIM Lite: On the Road Towards Relentless Multi-threading in Massively Parallel Systems Jay Brockman, Peter Kogge, Shyamkumar Thoziyoor, and Edward Kang Department of Computer Science and Engineering University of Notre Dame February 17, 2003 Abstract Processing In Memory (PIM) technology (mixing signi¯cant processing logic with dense memory on the same chip) has become a popular new emerging trend in recent years. In many cases, however, it has been used simply as a step towards a \system on a chip." This paper assumes that PIM systems will be inherently massively parallel, with many chips collaborating in a computation, perhaps in concert with more conventional microprocessors. While such systems could be designed to support \classical" parallel models such as DSM or message passing, this paper discusses several di®erent models born from the HTMT project. All of these models involved signi¯cant multi-threading, with large numbers of relatively light weight threads executing within the PIM nodes. To take advantage of these characteristics, we have designed a new ISA and matching microarchitecture that supports such multithreading in ways that leverage very e±ciently the enhanced local bandwidth and access time capable from an on chip memory macro. A simpli¯ed version of this, termed PIM Lite, is about to go to fab as a memory part with multiple internal nodes, all of which support very light weight threads in a simple SMT microarchitecture. This paper will discuss PIM Lite, and then our outlook on what more advanced designs might look like. 1 Introduction In the conventional von Neumann model, processors and memory play complementary but entirely separate roles in computation: the memory stores data but has no provisions for operating on it while the processor modi¯es data but has little provision for storing it. The \von Neumann bottleneck" or \memory wall" associated with transferring data between the processor and memory is a well-know limitation to computing performance. Memory hierarchies do not eliminate the basic problem that the processor must manage nearly every detail of data transfer, from computing the address of the desired data, to coordination of the transfer protocols. Processing in Memory (PIM) breaks through the memory wall by pushing computation into the memory system. Rather than denoting a speci¯c implementation, PIM is a collection of methods and technologies that cover all aspects of pushing computation into the memory system including programming and execution models, microarchitectural organization, and physical design and layout. One of the key goals of this paper, however, is to demonstrate that by considering these technologies simultaneously and rationalizing the interfaces between them, extremely e±cient PIM implementations are possible that can improve performance over conventional systems by orders of magnitude. By physically placing processing logic on a memory chip, the processor can access data at the highest bandwidths and lowest latencies possible. Further, and distinguishing this project from others, is the emphasis not on leveraging the technology for single thread performance, but for support of massively parallel scalable systems where very signi¯cant amounts of many forms of concurrency is present for exploitation. Section 2 of this paper briefly overviews the assumed technology. Section 3 discusses the spectrum of PIM chip architectures that have been investigated. Section 4 describes some computational idioms we have focused on. Section 5 then describes the ISA of PIM Lite an early attempt to support the key ideas behind such idioms. Section 6 discusses a preliminary implementation of this ISA on a multi-node chip. A synthesized VHDL version of PIM Lite is currently running on an FPGA board with silicon planned for the ¯rst quarter of 2003. 1 2 Background and Related Work 2.1 Leveraging Memory Bandwidth Figure 1 illustrates the organization of a typical DRAM macro. Internal to the macro, data is stored typically in rows of 2K memory cells, where the entire row is read and bu®ered whenever any cell on that row is ¯rst accessed, regardless of how many bits of data on the row are actually needed by the application. Furthermore, once accessed, the contents of a row are latched into a primary sense amp within the macro. These values stay in the latch into another access overwrite them. Thus once accessed, such a row is called the currently open row. 2048 bits/internal row (256 bytes) 256 bits ott (ater 8 ) Figure 1: DRAM memory macro organization Multiplexing within the circuitry common to the whole macro typically allows bu®ered row data to be paged out at typically 256 bits at a time. Each such 256 bit word is termed a wide word here. Given these numbers of 2048 bits per open row, and 256 bits per wide word, there are 8 separate wide words per open row. A key property of such memory macros is that if the next access to the macro is to an address within the currently open row, then the access to the memory macro can be suppressed, and the correct wide word encompassing the desired data can be brought out by simply changing the multiplexor control values. This takes considerably less time than a full access. In many memory chips, when reads of consecutive data words is desired, this fast multiplexing is used repeatedly, and is termed page mode access. Given a very conservative row access time of 20 ns and a page access time of 2 ns, the bandwidth available from a single memory macro is over 50 Gbit/s. A central issues in PIM research is to discover ways to take advantage of this extraordinary bandwidth. A \SIMD" branch of PIM architecture stresses multiple dataflows, often as small as 1 bit wide, next to the memory macros row output, and running more or less synchronously. Variations of this approach date all the way back to the DAP and the CM-2 [9] (when memory densities were small), through the early and mid 1990s when chips such as the TERASYS [10] and the Linden DAAM [15] had several megabits of memory on board, to a more recent announcement of the Micron YUKON chip [12] that uses state-of-the- art DRAM. We have coined the term ASAP (At the Sense Amps Processing [4]) to refer to a wide set of ALUs positioned close to memory macro so that all the bits that are made available on an access can be processed simultaneously. The V-IRAM chip [14] integrates several high speed vector pipelines along with an integrated controller on a chip with signi¯cant embedded DRAM. The DIVA chip [8] combined a dense SRAM with a pipelined CPU and a set of ALUs. Multiple independent processors on a chip are a more recent phenomena. Perhaps the ¯rst was EXECUBE [13], which supported eight separate CPUs on a single die where the memory was state of the art DRAM. These CPUs were interconnected with DMA channels to form a binary hypercube on a chip.(In addition to running in a MIMD mode, EXECUBE also allowed constraining any number of the on-chip processors to run in SIMD mode). The POWER4 [23] is built on an all logic technology, but nevertheless, supports a very dense cache hierarchy on top of two separate microprocessor cores. The BLUE GENE chip [7] supports a large number of separate cores, all sharing the on chip memory in an SMP-like con¯guration. 2 Another important dimension of the PIM architectural design space is the external interface. Some chips, such as the Mitsubishi M32R/D [16], we conceives as single-chip systems and have interfaces more reminiscent of a conventional microprocessor, such as an external memory and I/O bus. More pertinent to the proposed research are those chips designed to be part of large, scalable systems. A number of chips, including TERASYS, DIVA, to the Yukon, have primary interfaces that allow them to \look like memory chips" to external systems, albeit almost always with some extra protocols that allow the \intelligence" on board to be touched and activated. Such interfaces allow potentially large numbers of them to be assembled into a single memory subsystem, and connected to a classical computer. Other chips, such as the EXECUBE and the BLUE GENE, have as their primary interface multiple chip-to-chip communication ports, allowing large arrays of such chips to make up a stand-alone single part type parallel processor. DIVA and Yukon have both memory-like interfaces, as well as ports for independent PIM-to-PIM communication. 2.2 Multi-PIM Systems and Communication Clearly, various combinations of the suite of architectural techniques discussed above support the full spectrum of conventional programming idioms. Vector and short vector SIMD processing, and SMP processing are all possible on chip. Among arrays of PIM chips, programming models from SPMD to message passing are readily supported. This, however, is not the end. First suggested by work from the PIM FAST project [4], and ampli¯ed in other projects such as HTMT [22][4], the concept of parcels (PARallel Communication ELements) expands the semantics of a classical memory read or write operation to accommodate a PIM-enhanced memory. In [4], we introduced the notion of a microserver as a PIM component to service parcels. In their simplest form, parcels and microservers could be used to service split-phase memory transactions, similar to the message-driven memory modules proposed in dataflow architectures such as [2] [20] [18]. More generally, the parcel/microserver model can be used to deliver any command, with arguments, to a location in memory. This mechanism could be used to remotely trigger actions ranging from simple reads and writes, to atomic updates, to invoking methods on objects.

Load more