PIM Lite: On the Road Towards Relentless Multi-threading in Massively Parallel Systems
Jay Brockman, Peter Kogge, Shyamkumar Thoziyoor, and Edward Kang Department of Computer Science and Engineering University of Notre Dame February 17, 2003
Abstract Processing In Memory (PIM) technology (mixing significant processing logic with dense memory on the same chip) has become a popular new emerging trend in recent years. In many cases, however, it has been used simply as a step towards a “system on a chip.” This paper assumes that PIM systems will be inherently massively parallel, with many chips collaborating in a computation, perhaps in concert with more conventional microprocessors. While such systems could be designed to support “classical” parallel models such as DSM or message passing, this paper discusses several different models born from the HTMT project. All of these models involved significant multi-threading, with large numbers of relatively light weight threads executing within the PIM nodes. To take advantage of these characteristics, we have designed a new ISA and matching microarchitecture that supports such multithreading in ways that leverage very efficiently the enhanced local bandwidth and access time capable from an on chip memory macro. A simplified version of this, termed PIM Lite, is about to go to fab as a memory part with multiple internal nodes, all of which support very light weight threads in a simple SMT microarchitecture. This paper will discuss PIM Lite, and then our outlook on what more advanced designs might look like.
1 Introduction
In the conventional von Neumann model, processors and memory play complementary but entirely separate roles in computation: the memory stores data but has no provisions for operating on it while the processor modifies data but has little provision for storing it. The “von Neumann bottleneck” or “memory wall” associated with transferring data between the processor and memory is a well-know limitation to computing performance. Memory hierarchies do not eliminate the basic problem that the processor must manage nearly every detail of data transfer, from computing the address of the desired data, to coordination of the transfer protocols. Processing in Memory (PIM) breaks through the memory wall by pushing computation into the memory system. Rather than denoting a specific implementation, PIM is a collection of methods and technologies that cover all aspects of pushing computation into the memory system including programming and execution models, microarchitectural organization, and physical design and layout. One of the key goals of this paper, however, is to demonstrate that by considering these technologies simultaneously and rationalizing the interfaces between them, extremely efficient PIM implementations are possible that can improve performance over conventional systems by orders of magnitude. By physically placing processing logic on a memory chip, the processor can access data at the highest bandwidths and lowest latencies possible. Further, and distinguishing this project from others, is the emphasis not on leveraging the technology for single thread performance, but for support of massively parallel scalable systems where very significant amounts of many forms of concurrency is present for exploitation. Section 2 of this paper briefly overviews the assumed technology. Section 3 discusses the spectrum of PIM chip architectures that have been investigated. Section 4 describes some computational idioms we have focused on. Section 5 then describes the ISA of PIM Lite an early attempt to support the key ideas behind such idioms. Section 6 discusses a preliminary implementation of this ISA on a multi-node chip. A synthesized VHDL version of PIM Lite is currently running on an FPGA board with silicon planned for the first quarter of 2003.
1 2 Background and Related Work 2.1 Leveraging Memory Bandwidth Figure 1 illustrates the organization of a typical DRAM macro. Internal to the macro, data is stored typically in rows of 2K memory cells, where the entire row is read and buffered whenever any cell on that row is first accessed, regardless of how many bits of data on the row are actually needed by the application. Furthermore, once accessed, the contents of a row are latched into a primary sense amp within the macro. These values stay in the latch into another access overwrite them. Thus once accessed, such a row is called the currently open row.
2048 bits/internal row (256 bytes)
256 bits o t t (a ter 8 )
Figure 1: DRAM memory macro organization
Multiplexing within the circuitry common to the whole macro typically allows buffered row data to be paged out at typically 256 bits at a time. Each such 256 bit word is termed a wide word here. Given these numbers of 2048 bits per open row, and 256 bits per wide word, there are 8 separate wide words per open row. A key property of such memory macros is that if the next access to the macro is to an address within the currently open row, then the access to the memory macro can be suppressed, and the correct wide word encompassing the desired data can be brought out by simply changing the multiplexor control values. This takes considerably less time than a full access. In many memory chips, when reads of consecutive data words is desired, this fast multiplexing is used repeatedly, and is termed page mode access. Given a very conservative row access time of 20 ns and a page access time of 2 ns, the bandwidth available from a single memory macro is over 50 Gbit/s. A central issues in PIM research is to discover ways to take advantage of this extraordinary bandwidth. A “SIMD” branch of PIM architecture stresses multiple dataflows, often as small as 1 bit wide, next to the memory macros row output, and running more or less synchronously. Variations of this approach date all the way back to the DAP and the CM-2 [9] (when memory densities were small), through the early and mid 1990s when chips such as the TERASYS [10] and the Linden DAAM [15] had several megabits of memory on board, to a more recent announcement of the Micron YUKON chip [12] that uses state-of-the- art DRAM. We have coined the term ASAP (At the Sense Amps Processing [4]) to refer to a wide set of ALUs positioned close to memory macro so that all the bits that are made available on an access can be processed simultaneously. The V-IRAM chip [14] integrates several high speed vector pipelines along with an integrated controller on a chip with significant embedded DRAM. The DIVA chip [8] combined a dense SRAM with a pipelined CPU and a set of ALUs. Multiple independent processors on a chip are a more recent phenomena. Perhaps the first was EXECUBE [13], which supported eight separate CPUs on a single die where the memory was state of the art DRAM. These CPUs were interconnected with DMA channels to form a binary hypercube on a chip.(In addition to running in a MIMD mode, EXECUBE also allowed constraining any number of the on-chip processors to run in SIMD mode). The POWER4 [23] is built on an all logic technology, but nevertheless, supports a very dense cache hierarchy on top of two separate microprocessor cores. The BLUE GENE chip [7] supports a large number of separate cores, all sharing the on chip memory in an SMP-like configuration.
2 Another important dimension of the PIM architectural design space is the external interface. Some chips, such as the Mitsubishi M32R/D [16], we conceives as single-chip systems and have interfaces more reminiscent of a conventional microprocessor, such as an external memory and I/O bus. More pertinent to the proposed research are those chips designed to be part of large, scalable systems. A number of chips, including TERASYS, DIVA, to the Yukon, have primary interfaces that allow them to “look like memory chips” to external systems, albeit almost always with some extra protocols that allow the “intelligence” on board to be touched and activated. Such interfaces allow potentially large numbers of them to be assembled into a single memory subsystem, and connected to a classical computer. Other chips, such as the EXECUBE and the BLUE GENE, have as their primary interface multiple chip-to-chip communication ports, allowing large arrays of such chips to make up a stand-alone single part type parallel processor. DIVA and Yukon have both memory-like interfaces, as well as ports for independent PIM-to-PIM communication.
2.2 Multi-PIM Systems and Communication Clearly, various combinations of the suite of architectural techniques discussed above support the full spec- trum of conventional programming idioms. Vector and short vector SIMD processing, and SMP processing are all possible on chip. Among arrays of PIM chips, programming models from SPMD to message passing are readily supported. This, however, is not the end. First suggested by work from the PIM FAST project [4], and amplified in other projects such as HTMT [22][4], the concept of parcels (PARallel Communication ELements) expands the semantics of a classical memory read or write operation to accommodate a PIM-enhanced memory. In [4], we introduced the notion of a microserver as a PIM component to service parcels. In their simplest form, parcels and microservers could be used to service split-phase memory transactions, similar to the message-driven memory modules proposed in dataflow architectures such as [2] [20] [18]. More generally, the parcel/microserver model can be used to deliver any command, with arguments, to a location in memory. This mechanism could be used to remotely trigger actions ranging from simple reads and writes, to atomic updates, to invoking methods on objects. This use of the parcel/microserver model has much in common with, and is based upon, message-driven computation in systems such as active messsages [24], the threaded abstract machine (TAM) [5], and the J-Machine [6] [19]. Moving still further up the programming hierarchy, invoking methods in memory becomes an ideal match to actor-based programming languages such as Smalltalk, [11] where all processing is based on sending messages to objects, with messages containing method names and arguments. The parcel/microserver model converts an operation that may have involved multiple remote memory accesses (to fetch components of the object), each involving two-way transactions (read and data return), into a single one way transaction that runs all of the method code as close as possible to the object. This results not only in lower latency, but also reduced network load. Methods invoked on objects in memory are liable to have short code lengths. Further, if the remote method idiom is used extensively in a parallel system, it becomes likely that multiple method calls will impinge concurrently on the same PIM node. Together, these effects lead to frequent switching of threads with relatively short run lengths, which makes the overhead of starting and swapping threads an important consideration. For a conventional RISC processor, thread swapping is a heavyweight activity, since the contents of the register file must be copied out to memory. Also, increasing the size of the register file has deleterious effects on power and cycle time, which has limited the number of hardware-supported threads in RISC processors to a very small number, like 2 [3]. Rather than placing a RISC processor in memory, the PIM Lite ISA was designed from the outset to support multiple threads, and to minimize the cost of moving thread state in and out of the processor by keeping most of the thread state in memory at all times. Furthermore, the matching microarchitecture supports SMT processing, allowing multiple such threads to be in execution concurrently. Once we have multi-threading implemented efficiently on individual PIM nodes, we can extend the concept to cover an entire system. Very light weight threads such as defined in PIM Lite have very little state, meaning that there is minimal cost of packaging the thread state when a reference to an off chip object is encountered. If done correctly, such a package is minimally different from a parcel, as defined before, meaning that we can now treat parcels not just as remote method invocations, but as “threads in transit.”
3 This, in turn, means that now we can consider execution models where a program “follows” the data references it makes, rather than make the data come to some fixed execution site. To say that this is radically different from our classical models is an understatement. Some early work [17] in determining the size of state that needs to be moved versus the “lifetime” of a thread on a node indicate that this maybe entirely reasonable, that the lifetimes of such threads on individual nodes might very well be long enough to overcome any overhead of thread packing and unpacking. Finally, the capability to support multi-threading everywhere permits yet another interesting class of programming idiom - introspection. Various kinds of monitor threads can roam a system performing such activities as consistency checking, load balancing, checkpoint collection, and garbage collection. All of this can be done independently of application-directed activities, particularly if conventional-like CPUs are performing much of an applications code. PIM Lite and its follow-ons are being designed to allow experimentation with such concepts.
3 The PIM Lite Architecture 3.1 System Overview PIM Lite is a reference PIM implementation developed to explore memory-centric architecture and imple- mentation, and to experiment with the range of programming idioms discussed in the previous section. A PIM Lite node is defined as a physical block of memory with an associated CPU. A PIM Lite chip contains 4 nodes connected on a bus. PIM Lite is a “lightweight” PIM implementation in several respects:
1. The architecture is designed to minimize the amount of program state in the CPU and hence minimize the number of bits that must be transferred in creating or swapping multiple threads. Further all program state is tightly encapsulated in wide-words that correspond with the size of a row in memory to maximize the use of on-chip memory bandwidth.
2. The VLSI implementation minimizes the area of the processing logic. The chip floorplan follows a pitch- matched, bit-sliced design style that aligns the wide-word processing logic with columns of memory. Further, the use of multi-threading eliminates the need for complex hazard detection and forwarding logic within the pipeline.
3. As an exploratory implementation, low cost and ease of design and test were paramount considerations over memory density and clock rate. A single PIM Lite node contains 4K bytes of SRAM program memory and 4K bytes of data memory, organized in rows as wide-words of 128 bits. Instructions, data, and addresses are all 16-bits, packaged in 128-bit wide-words.
PIM Lite chips may be interconnected on a bus (or more complex network) to form a PIM fabric. At the highest level, the entire fabric may be viewed as a single, physical address space. Logically, this address space may be partitioned into several virtual address spaces or domains, each supporting a separate PIM process. Physically, each domain may be striped across the entire set of PIM nodes, as shown in Figure 2, with large distributed data structures extending across many nodes. PIM Lite supports only a single domain. A PIM process consists of a collection of threads distributed across the PIM fabric. A given thread executing in the PIM system “sees” memory in two segments: global memory and private, local memory. Global memory is the main virtual memory of a process, containing both data and instructions. In general, the physical location of a global virtual address may be on any PIM node in the fabric. A small region of the global address space, however, always maps to physical locations on the current node, and is reserved for system resources such as domain-specific configuration parameters and node-specific, memory-mapped I/O locations. A global memory virtual address consists of a page number and an offset; a local page table on each node keeps track of which pages are currently resident on that node for each domain; PIM Lite assumes a single page per node. For simplicity PIM Lite is designed to operate on a standard 16-bit memory bus—more efficient but complex networks could have been used. Figure 3 illustrates the PIM Lite node pinout and bus configuration. An independent PIM Lite node supports three modes of operation, summarized below:
4 omai