PIM Lite: On the Road Towards Relentless Multi-threading in Massively Parallel Systems

Jay Brockman, Peter Kogge, Shyamkumar Thoziyoor, and Edward Kang Department of Computer Science and Engineering University of Notre Dame February 17, 2003

Abstract Processing In Memory (PIM) technology (mixing significant processing logic with dense memory on the same chip) has become a popular new emerging trend in recent years. In many cases, however, it has been used simply as a step towards a “system on a chip.” This paper assumes that PIM systems will be inherently massively parallel, with many chips collaborating in a computation, perhaps in concert with more conventional microprocessors. While such systems could be designed to support “classical” parallel models such as DSM or message passing, this paper discusses several different models born from the HTMT project. All of these models involved significant multi-threading, with large numbers of relatively light weight threads executing within the PIM nodes. To take advantage of these characteristics, we have designed a new ISA and matching microarchitecture that supports such multithreading in ways that leverage very efficiently the enhanced local bandwidth and access time capable from an on chip memory macro. A simplified version of this, termed PIM Lite, is about to go to fab as a memory part with multiple internal nodes, all of which support very light weight threads in a simple SMT microarchitecture. This paper will discuss PIM Lite, and then our outlook on what more advanced designs might look like.

1 Introduction

In the conventional von Neumann model, processors and memory play complementary but entirely separate roles in computation: the memory stores data but has no provisions for operating on it while the modifies data but has little provision for storing it. The “von Neumann bottleneck” or “memory wall” associated with transferring data between the processor and memory is a well-know limitation to computing performance. Memory hierarchies do not eliminate the basic problem that the processor must manage nearly every detail of data transfer, from computing the address of the desired data, to coordination of the transfer protocols. Processing in Memory (PIM) breaks through the memory wall by pushing computation into the memory system. Rather than denoting a specific implementation, PIM is a collection of methods and technologies that cover all aspects of pushing computation into the memory system including programming and execution models, microarchitectural organization, and physical design and layout. One of the key goals of this paper, however, is to demonstrate that by considering these technologies simultaneously and rationalizing the interfaces between them, extremely efficient PIM implementations are possible that can improve performance over conventional systems by orders of magnitude. By physically placing processing logic on a memory chip, the processor can access data at the highest bandwidths and lowest latencies possible. Further, and distinguishing this project from others, is the emphasis not on leveraging the technology for single performance, but for support of massively parallel scalable systems where very significant amounts of many forms of concurrency is present for exploitation. Section 2 of this paper briefly overviews the assumed technology. Section 3 discusses the spectrum of PIM chip architectures that have been investigated. Section 4 describes some computational idioms we have focused on. Section 5 then describes the ISA of PIM Lite an early attempt to support the key ideas behind such idioms. Section 6 discusses a preliminary implementation of this ISA on a multi-node chip. A synthesized VHDL version of PIM Lite is currently running on an FPGA board with silicon planned for the first quarter of 2003.

1 2 Background and Related Work 2.1 Leveraging Memory Bandwidth Figure 1 illustrates the organization of a typical DRAM macro. Internal to the macro, data is stored typically in rows of 2K memory cells, where the entire row is read and buffered whenever any cell on that row is first accessed, regardless of how many bits of data on the row are actually needed by the application. Furthermore, once accessed, the contents of a row are latched into a primary sense amp within the macro. These values stay in the latch into another access overwrite them. Thus once accessed, such a row is called the currently open row.

2048 bits/internal row (256 bytes)

256 bits ott (ater 8 )

Figure 1: DRAM memory macro organization

Multiplexing within the circuitry common to the whole macro typically allows buffered row data to be paged out at typically 256 bits at a time. Each such 256 bit word is termed a wide word here. Given these numbers of 2048 bits per open row, and 256 bits per wide word, there are 8 separate wide words per open row. A key property of such memory macros is that if the next access to the macro is to an address within the currently open row, then the access to the memory macro can be suppressed, and the correct wide word encompassing the desired data can be brought out by simply changing the multiplexor control values. This takes considerably less time than a full access. In many memory chips, when reads of consecutive data words is desired, this fast multiplexing is used repeatedly, and is termed page mode access. Given a very conservative row access time of 20 ns and a page access time of 2 ns, the bandwidth available from a single memory macro is over 50 Gbit/s. A central issues in PIM research is to discover ways to take advantage of this extraordinary bandwidth. A “SIMD” branch of PIM architecture stresses multiple dataflows, often as small as 1 bit wide, next to the memory macros row output, and running more or less synchronously. Variations of this approach date all the way back to the DAP and the CM-2 [9] (when memory densities were small), through the early and mid 1990s when chips such as the TERASYS [10] and the Linden DAAM [15] had several megabits of memory on board, to a more recent announcement of the Micron YUKON chip [12] that uses state-of-the- art DRAM. We have coined the term ASAP (At the Sense Amps Processing [4]) to refer to a wide set of ALUs positioned close to memory macro so that all the bits that are made available on an access can be processed simultaneously. The V-IRAM chip [14] integrates several high speed vector pipelines along with an integrated controller on a chip with significant embedded DRAM. The DIVA chip [8] combined a dense SRAM with a pipelined CPU and a set of ALUs. Multiple independent processors on a chip are a more recent phenomena. Perhaps the first was EXECUBE [13], which supported eight separate CPUs on a single die where the memory was state of the art DRAM. These CPUs were interconnected with DMA channels to form a binary hypercube on a chip.(In addition to running in a MIMD mode, EXECUBE also allowed constraining any number of the on-chip processors to run in SIMD mode). The POWER4 [23] is built on an all logic technology, but nevertheless, supports a very dense cache hierarchy on top of two separate microprocessor cores. The BLUE GENE chip [7] supports a large number of separate cores, all sharing the on chip memory in an SMP-like configuration.

2 Another important dimension of the PIM architectural design space is the external interface. Some chips, such as the Mitsubishi M32R/D [16], we conceives as single-chip systems and have interfaces more reminiscent of a conventional microprocessor, such as an external memory and I/O . More pertinent to the proposed research are those chips designed to be part of large, scalable systems. A number of chips, including TERASYS, DIVA, to the Yukon, have primary interfaces that allow them to “look like memory chips” to external systems, albeit almost always with some extra protocols that allow the “intelligence” on board to be touched and activated. Such interfaces allow potentially large numbers of them to be assembled into a single memory subsystem, and connected to a classical computer. Other chips, such as the EXECUBE and the BLUE GENE, have as their primary interface multiple chip-to-chip communication ports, allowing large arrays of such chips to make up a stand-alone single part type parallel processor. DIVA and Yukon have both memory-like interfaces, as well as ports for independent PIM-to-PIM communication.

2.2 Multi-PIM Systems and Communication Clearly, various combinations of the suite of architectural techniques discussed above support the full spec- trum of conventional programming idioms. Vector and short vector SIMD processing, and SMP processing are all possible on chip. Among arrays of PIM chips, programming models from SPMD to message passing are readily supported. This, however, is not the end. First suggested by work from the PIM FAST project [4], and amplified in other projects such as HTMT [22][4], the concept of parcels (PARallel Communication ELements) expands the semantics of a classical memory read or write operation to accommodate a PIM-enhanced memory. In [4], we introduced the notion of a microserver as a PIM component to service parcels. In their simplest form, parcels and microservers could be used to service split-phase memory transactions, similar to the message-driven memory modules proposed in dataflow architectures such as [2] [20] [18]. More generally, the parcel/microserver model can be used to deliver any command, with arguments, to a location in memory. This mechanism could be used to remotely trigger actions ranging from simple reads and writes, to atomic updates, to invoking methods on objects. This use of the parcel/microserver model has much in common with, and is based upon, message-driven computation in systems such as active messsages [24], the threaded abstract machine (TAM) [5], and the J-Machine [6] [19]. Moving still further up the programming hierarchy, invoking methods in memory becomes an ideal match to actor-based programming languages such as Smalltalk, [11] where all processing is based on sending messages to objects, with messages containing method names and arguments. The parcel/microserver model converts an operation that may have involved multiple remote memory accesses (to fetch components of the object), each involving two-way transactions (read and data return), into a single one way transaction that runs all of the method code as close as possible to the object. This results not only in lower latency, but also reduced network load. Methods invoked on objects in memory are liable to have short code lengths. Further, if the remote method idiom is used extensively in a parallel system, it becomes likely that multiple method calls will impinge concurrently on the same PIM node. Together, these effects lead to frequent switching of threads with relatively short run lengths, which makes the overhead of starting and swapping threads an important consideration. For a conventional RISC processor, thread swapping is a heavyweight activity, since the contents of the register file must be copied out to memory. Also, increasing the size of the register file has deleterious effects on power and cycle time, which has limited the number of hardware-supported threads in RISC processors to a very small number, like 2 [3]. Rather than placing a RISC processor in memory, the PIM Lite ISA was designed from the outset to support multiple threads, and to minimize the cost of moving thread state in and out of the processor by keeping most of the thread state in memory at all times. Furthermore, the matching microarchitecture supports SMT processing, allowing multiple such threads to be in execution concurrently. Once we have multi-threading implemented efficiently on individual PIM nodes, we can extend the concept to cover an entire system. Very light weight threads such as defined in PIM Lite have very little state, meaning that there is minimal cost of packaging the thread state when a reference to an off chip object is encountered. If done correctly, such a package is minimally different from a parcel, as defined before, meaning that we can now treat parcels not just as remote method invocations, but as “threads in transit.”

3 This, in turn, means that now we can consider execution models where a program “follows” the data references it makes, rather than make the data come to some fixed execution site. To say that this is radically different from our classical models is an understatement. Some early work [17] in determining the size of state that needs to be moved versus the “lifetime” of a thread on a node indicate that this maybe entirely reasonable, that the lifetimes of such threads on individual nodes might very well be long enough to overcome any overhead of thread packing and unpacking. Finally, the capability to support multi-threading everywhere permits yet another interesting class of programming idiom - introspection. Various kinds of monitor threads can roam a system performing such activities as consistency checking, load balancing, checkpoint collection, and garbage collection. All of this can be done independently of application-directed activities, particularly if conventional-like CPUs are performing much of an applications code. PIM Lite and its follow-ons are being designed to allow experimentation with such concepts.

3 The PIM Lite Architecture 3.1 System Overview PIM Lite is a reference PIM implementation developed to explore memory-centric architecture and imple- mentation, and to experiment with the range of programming idioms discussed in the previous section. A PIM Lite node is defined as a physical block of memory with an associated CPU. A PIM Lite chip contains 4 nodes connected on a bus. PIM Lite is a “lightweight” PIM implementation in several respects:

1. The architecture is designed to minimize the amount of program state in the CPU and hence minimize the number of bits that must be transferred in creating or swapping multiple threads. Further all program state is tightly encapsulated in wide-words that correspond with the size of a row in memory to maximize the use of on-chip memory bandwidth.

2. The VLSI implementation minimizes the area of the processing logic. The chip floorplan follows a pitch- matched, bit-sliced design style that aligns the wide-word processing logic with columns of memory. Further, the use of multi-threading eliminates the need for complex hazard detection and forwarding logic within the pipeline.

3. As an exploratory implementation, low cost and ease of design and test were paramount considerations over memory density and clock rate. A single PIM Lite node contains 4K bytes of SRAM program memory and 4K bytes of data memory, organized in rows as wide-words of 128 bits. Instructions, data, and addresses are all 16-bits, packaged in 128-bit wide-words.

PIM Lite chips may be interconnected on a bus (or more complex network) to form a PIM fabric. At the highest level, the entire fabric may be viewed as a single, physical address space. Logically, this address space may be partitioned into several virtual address spaces or domains, each supporting a separate PIM . Physically, each domain may be striped across the entire set of PIM nodes, as shown in Figure 2, with large distributed data structures extending across many nodes. PIM Lite supports only a single domain. A PIM process consists of a collection of threads distributed across the PIM fabric. A given thread executing in the PIM system “sees” memory in two segments: global memory and private, local memory. Global memory is the main virtual memory of a process, containing both data and instructions. In general, the physical location of a global virtual address may be on any PIM node in the fabric. A small region of the global address space, however, always maps to physical locations on the current node, and is reserved for system resources such as domain-specific configuration parameters and node-specific, memory-mapped I/O locations. A global memory virtual address consists of a page number and an offset; a local page table on each node keeps track of which pages are currently resident on that node for each domain; PIM Lite assumes a single page per node. For simplicity PIM Lite is designed to operate on a standard 16-bit memory bus—more efficient but complex networks could have been used. Figure 3 illustrates the PIM Lite node pinout and bus configuration. An independent PIM Lite node supports three modes of operation, summarized below:

4 omai

oa memory oa memory oa memory

omai

private memory private memory private memory

etor

Figure 2: A PIM Lite system with data structures distributed across multiple nodes.

A D R/W CS Ready A D R/W CS Ready

us Grant_in Grant_ut Grant_in Grant_ut Ariter

Request Reease Request Reease

Figure 3: PIM Lite pinout and bus configuration

Passive memory mode: In this mode a PIM Lite node behaves as a passive memory module. A PIM • node enters passive memory mode when its chip select (CS) pin is asserted and responds by setting its Ready pin.

PIM mode: When a PIM Lite node is not externally selected as a memory, it may be processing threads • internally.

Bus master mode: In order to send a parcel, a PIM Lite node must be granted control of the bus as a • master. PIM Lite uses a simple daisy-chained grant to request, receive, and release control of the bus.

3.2 Execution Model Conventional CPU architectures employ a set of named registers that are part of the CPU itself to hold much of the state of a program being executed. When the CPU switches threads, all of this state information must be copied out of the CPU registers to memory and the state of the new thread must be loaded in, which is a costly operation in both time and energy. In PIM Lite, nearly all the state information of a thread is taken out of the CPU and kept in memory at all times. This not only eliminates the need for explicitly copying thread state in and out of the CPU, but also greatly reduces the number of devices and hence the area of the CPU. In place of named registers in the CPU, thread state is packaged in data frames of memory. Physically, a frame is simply a region of storage locations within the regular memory system. Information

5 in a particular frame is uniquely access by two indices: a frame pointer (FP) that indicates the memory address of the first location in the frame, and an offset to subsequent storage locations in the frame. With the use of frames, virtually all “classical” instructions become operations on “memory” locations, expressed as operations on values within a single frame, between global memory and a frame, or between two frames. To maximize the bandwidth of data access, the frame size should correspond to the size of a row in memory. In PIM Lite, data frames have a fixed size of 4 wide-words or 32 16-bit words, comparable to the size of a typical RISC register file. The execution state of a thread in the PIM Lite CPU is completely described by two memory addresses: a frame pointer, FP, which points to the starting location of the data frame, and an instruction pointer IP that points to the current instruction. In a manner similar to that used earlier in dataflow architectures [21] [18] [20] [1], PIM Lite bundles these addresses as a pair , called a continuation, that forms the basic unit of execution. Because data frames are integer multiples of memory wide-words, threads can be swapped at the fastest rate physically allowed by the on-chip memory system, as quickly as a single random-data access into the memory array. A thread or continuation pool manages the scheduling of threads in the system, the details of which will be discussed in a later section.

3.3 Instruction Set The PIM Lite instruction set differs from a typical RISC ISA in two important ways:

1. the instructions are designed to take full advantage of the high bandwidth of on-chip memory, with the ability to operate on 128 bits of data (eight 16-bit words) concurrently. Instructions include thread operations such as fork and join, as well as vector/SIMD ALU operations,

2. rather than modifying a single program counter, all instructions generate either zero, one, or two continuations that are added to the thread pool.

The following sections outline the PIM Lite instruction classes.

3.3.1 Unified Vector/Scalar Frame Organization Other PIM chips supporting vector/SIMD operations such as IRAM [14] and DIVA [8] maintain separate vector and scalar register files. This adds to the weight of the CPU state that must be transferred to memory during thread swaps, and may lead to inefficiencies sections of code that mix vector and scalar operations. In contrast, PIM Lite employs a unified frame organization that allows frame entries to be accessed either as scalar values or as vectors, leading to easy mixing of vector and scalar code and keeping all program state tightly packaged as wide-words in memory. A PIM Lite frame may be viewed as a two-dimensional array of 16-bit words with 4 rows and 8 columns, where each row is a wide-word and each column represents a displacement into a wide-word. Individual scalar values (words) in a frame are denoted with the double subscript [w, d], where w is the wide-word number 0 ...3 and d is the displacement 0 ...7. An entire wide-word is denoted by the single subscript [w].

3.3.2 Arithmetic Operations The PIM Lite arithmetic instructions, summarized in Table 1 support scalar, vector, or vector/scalar formats. The vector format allows entire wide-words of data to be operated on concurrently. For example, Vadd (vector add) adds two vectors of 8 16-bit words of data and stores the resulting 128-bit vector. Vector/scalar instructions operate on one scalar and one vector operand. The VSadd (vector/scalar add) instruction concurrently adds a 16-bit scalar to each of the eight 16-bit words in the vector. This is supported by broadcasting the scalar to each 16-bit datapath component which execute concurrently. The scalar-scalar instruction add has semantics similar to a typical RISC instruction, operating on two 16-bit scalars quantities. In PIM Lite, however, scalar operations are implemented as a special case of the vector/scalar operation, where the source operand is broadcast to each 16-bit datapath but write-back is restricted to the destination datapath.

6 The permute instruction allows wide-words of data to be re-arranged at the word level. The permute instruction takes two vector operands: a data vector whose elements are to be rearranged, and a “permutation vector” (also called a WACI or Wide ALU Control Instruction) that specifies which field from the original data vector should appear in the corresponding field after rearrangement. The permute instruction is especially useful in operations such as data alignment and sorting.

Instruction Memory Operation Continuations vector-vector Vop [w] ,[w] [w] [w] op [w] dst src dst ← dst src permute [w]dst,[w]pos permute elements of [w]dst by position vector [w]pos

vector-scalar VSop [w] ,[w, d] [w] [w] op [w, d] dst src dst ← dst src scalar-scalar op [w, d] ,[w, d] [w, d] [w, d] op [w, d] dst src dst ← dst src Table 1: ALU instructions, op = (add, sub, and, or, not, move).

3.3.3 Location-Aware Memory Transfer Operations For applications that use large dynamic data structures, it is generally not possible to determine at compile time on which PIM node a given object resides. When a PIM thread needs access to data that is not on the current node, there are two choices: either bring the data to the thread, or move the thread to the data. The latter option is especially attractive when the thread state is small, and the likelihood of future references to data on the new node is high. In either case, the thread must be able to determine whether a given virtual address is on the current node or not. In PIM Lite, this information is contained in the local page table. If an address is not local, the decision on how to proceed needs to be under user, rather than system control. To handle this scenario, PIM Light uses conditional memory transfer instructions that check the local page table. Load and store operations in PIM Lite transfer entire wide-words between global memory and data frames if the global memory address is on the current PIM node, and then skip the next sequential instruction. Otherwise, they fall through to a user instruction that specifies how to respond. These conditional load and store instructions are summarized in Table 2.

Instruction Memory Operation Continuations

load [w]dst,[w, d]src if[w, d]src is local, then [w]dst MEM[w, d]src else none←

store [w, d]dst,[w]src if [w, d]dst is local, then MEM[w, d]dst [w]src else none ←

Table 2: Conditional memory/frame transfer instructions.

3.3.4 Control Flow Instructions The standard control instructions include jumps, forks, and joins. The jumps include jmp (jump always), jz (jump on zero), and jn (jump if negative), each of which has a target specified by a target instruction pointer in the frame location, [w, d]tar. These instructions are listed in Table 3.

7 Instruction Memory Operation Continuations

jmp [w, d]tar none

jz [w, d]tar,[w, d]cmp none if [w, d]cmp =0,then else

jn [w, d]tar,[w, d]cmp none if [w, d]cmp < 0, then else

Table 3: Jumps and branches.

The fork and join instructions manage the creation and termination of new threads, operating on the same data frame as its parent. Table 4 describes these instructions. The fork instruction creates a new thread that begins execution at the instruction pointer address specified in the frame location [w, d]tar, and the original thread continues execution at IP+1 by inserting two continuations into the thread pool. The fork instruction also increments a semaphore in the frame at [w, d]sem. The join instruction also uses this semaphore to join two threads that share a common frame into one. If the value of the semaphore is positive, the join instruction decrements the semaphore instruction and terminates a thread by producing no continuation. If the semaphore is equal to zero, it produces a continuation pointing to the next sequential instruction. The start instruction creates a new thread with a different data frame from the calling thread. It takes a single wide-word as an argument containing a list of parameters for the instruction. The first two locations in the wide-word, [w, 0]src and [w, 1]src, contain the frame pointer and instruction pointer, respectively, for the new thread. The remaining six locations are general purpose arguments. When the start instruction is executed, the entire wide word is copied into the new frame, and the continuation <[w, 0]src.[w, 1]src> is inserted into the pool. The start instruction effectively terminates the calling thread; in order for the calling thread to continue execution, a fork should be called before the start. The PIM Lite has no explicit call or jump-and-link mechanism. Instead, linear procedure call sequences may be constructed from the fork, start, and join instructions in much the same manner as was used in the hybrid dataflow architecture discussed in [18].

Instruction Memory Operation Continuations

fork [w, d]tar,[w, d]sem [w, d]sem [w, d]sem +1 , ←

join [w, d]sem if [w, d]sem > 0 [w, d]sem [w, d]sem 1 none else none ← −

stop none none

start [w]src copy [w]src to corresponding <[w, 0]src.[w, 1]src> wide word in frame at [w, 0]src

Table 4: Thread lifetime instructions.

3.3.5 Synchronization, Atomicity, and Parcel Processing The fork and join instructions may be used to synchronize two threads that share a common data frame. Threads can synchronize on global memory locations through atomic memory operations implemented in a critical section of a program. The PIM Lite the instructions enter and exit define the beginning and end

8 of a critical section. With the execution of the enter instruction, the currently executing thread cannot be preempted by another thread until an exit instruction is issued. The critical section instructions effectively lock access to memory at the granularity of an entire PIM node. Future PIM implementations will place full/empty bits in the memory to support reservations with a granularity of a wide-word or less. One thread can start a new thread on another PIM node (or itself) by sending a parcel to it. In PIM Lite, a parcel buffer consists of a memory-mapped wide-word on each PIM node, together with another memory mapped single word that serves as a “buffer full” signal. The format of a PIM Lite parcel is exactly the same as the format of an argument to a start instruction, that is, an pair followed by 6 scalar arguments. When the PIM Lite thread scheduler detects a parcel in the buffer (via the “full” signal), it processes the contents of the parcel buffer as a start instruction ahead of the threads currently in the pool. If it’s necessary for this parcel to be processed atomically, then the first instruction at the parcel’s IP should enter a critical section.

4 PIM Lite Microarchitecture

The basic microarchitecture of PIM Lite is a simple four-stage pipeline composed of Pair-fetch, Frame-read, Execute, and Write-back stages, as illustrated in Figure 4. The Pair-fetch stage reads and decodes an pair from the thread pool and also reads an instruction from memory. Next, the Frame-read stage reads the offsets encoded in the opcode and uses them to index into the frame memory. During the Execute stage, which contains the ALU and permutation hardware, the operation specified by the instruction is executed. Finally, in the Write Back stage, the frame data is modified and an updated pair, or multiple pairs (in the case of thread spawning), or a null pair (in the case of thread retirement) is written into the thread pool.

areahdaau areuahdaau

re Thread r rae aa a Queue er er er

Figure 4: PIM Lite pipeline.

Note that there is absolutely no hazard prediction, forwarding, or branch prediction hardware at all to control the flow of instructions through the pipeline. The thread pool’s management of multi-threading elimi- nates data and control dependencies. This allows the exclusion of supporting hardware from the pipeline—a large savings in control complexity for a minor speed penalty in the design. Thread pool operation and implementation is discussed in greater detail in a later section.

9 4.1 Datapath Implementation The PIM Lite datapath is implemented in a bit-sliced organization, with each of the slices pitch-matched to the width of the sense amplifiers and output drivers of the memory macro. This ensures that the datapath accesses the full width of the memory with the most efficient interconnect. Figure 5 shows a section of the layout detailing the pitch-matching between the memory array and the logic. The following sections discuss the implementation of the datapath components.

bit-sliced logic

memory array

Figure 5: Pitch-matched datapath floorplan

4.1.1 Frame Memory To avoid stalling the pipeline, the frame memory must support simultaneously 2 read and 1 write operation. For simplicity in PIM Lite, the entire frame memory is implemented as a dedicated 3-port memory block. In a more general, high-density PIM, however, this approach would not be acceptable for several reasons: first, it “hardwires” the size of frame memory, and second, adding ports to a cell greatly reduces the memory density (and moreover, is not currently feasible for DRAM technology). An alternative to adding multiple ports to the memory is to retain a single-ported memory and add a multi-ported frame cache indexed by frame pointer to the pipeline. Note that the primary role of the frame cache is to expand bandwidth, rather than reduce latency to memory. A small, multi-ported cache for frames provides the additional ports to memory needed by the CPU without sacrificing density in main memory, and occupies little space more than a conventional register file. Thus, when a new frame is created, if a part of the frame cache is allocated to the new frame, then individual values can be accessed from the cached frame at speeds rivaling that of classical register files, while still providing the presumption that the frame is “in memory.” This provides for tremendous simplifications in processing, especially in multi-threaded environments. If a new thread is to be started, and there is no room in the frame cache, a simple cache-line replacement can be performed, without having to explicitly store the registers associated with the threads state. Likewise, when restarting a thread not currently in the frame cache, the first reference causes a cache miss, which brings in the frame from memory, again without complex and extended series of loads and stores of individual registers. Finally, since all frames are “in memory” on inter-frame accesses the fact that the frame cache functions as a classical cache makes such transfers look no different from any other memory access. Together, the combination of a minimal state ISA using frames for operands, coupled with a frame cache yields high performance, simplicity of programming, and the ability to efficiently support very high levels of multi-threading.

10 4.1.2 Broadcast Bus The broadcast bus is a 16-bit bus that travels lengthwise across the chip, just before the ALUs, to support vector/scalar and scalar/scalar operations where the operands are not aligned in the same word column. Unaligned scalar/scalar arithmetic operations are implemented as a special case of a vector/scalar operation, where only the word column aligned with the destination operand performs the write-back to the frame cache.

-bit datapath -bit datapath -bit datapath

srcdst row oset src row oset in out in in out out out out out write-back- select

dpath -on-bus select

broadcast bus select bus

Figure 6: Broadcast bus.

4.1.3 Permutation Network The permutation network, which supports the Permute instruction for rearrangement of vector elements, is implemented in PIM Lite as a full 8-way crossbar switch. The crossbar switch itself is implemented as eight 8:1 multiplexors with both the data and select lines running vertically through the module in a pitch-matched fashion, which allows both the data and permutation vectors to come from wide words in memory. Because vectors in PIM Lite contain only 8 elements, the area and delay penalties for a full crossbar are not prohibitive. For larger implementation, a multistage network may be necessary.

4.2 Thread Scheduling The thread pool is responsible for scheduling fine-grain threads in a PIM Lite node by storing pairs and issuing them into the pipeline. In any pipelined processor that utilizes fine-grained parallelism, there is a possibility hazards resulting from inter-thread dependency. The PIM Lite thread pool ensures that there will be no dependencies in the pipeline by issuing threads with different FPs at each cycle, with at least as many cycles as there are pipeline stages between issuing threads with the same FP. If there aren’t enough threads in the system to fulfill this condition, then the pool issues a ghost thread, which is the equivalent of a NOP. It is obvious that performance of this scheme is not optimal when ghost threads must be executed. Therefore, it is up to the programmer/compiler of the system to make sure that the thread pool has enough unique threads to prevent the issuing of ghost threads. However, with even simple producer-consumer applications in streams, we have found it is simple to extract at least four-way parallelism from the types of operations that one would perform in the memory system. At a high level, the thread pool consists of four bins, with each bin composed of four slots, where each slot contains one pair. In addition, each slot also has an associated valid bit, which asserts whether or not there is a pair located there. Ideally, the thread pool can store 16 independent threads. Whenever any threads share the same frame pointer, however, they are placed in the same bin. We can therefore have up to four simultaneous threads that share the same frame at any given time. At each cycle, a modified

11 round-robin scheduling approach removes an pair (or a ghost thread) from the pool and inserts it into the processor pipeline for execution. A special “critical bin” is used for queuing pairs when a program is in a critical section, to ensure that the critical thread is not preempted. Physically, each bin in the thread pool is composed of a four pairs of 16-bit words stored in a three-port register array. This results in 128 bits of data which is pitch-matched to the width of the macro memory. Between the bins is combinational circuitry which includes, 16-bit comparators to compare incoming frame pointers, and also a myriad of counters to handle different thread issue situations in the thread pool. The thread pool is accessed heavily during each cycle (in fact, accessed in two pipe stages). This, in conjunction with the fairly complex write-back circuitry of the thread pool, means that the thread pool itself serves as the critical path of the pipeline during operation. The high logical complexity of the thread pool could theoretically result in drastically reduced performance, however, it is important to note that virtually all comparisons and other logical operations are done in parallel.

4.3 VLSI Floorplan of a PIM Lite Node A complete implementation of PIM Lite has been synthesized from VHDL and is now running on an FPGA board. Silicon will be available in the first quarter of 2003. Figure 7 illustrates the layout and components sizes for a single PIM Lite node, based on 0.18 micron design rules. A single node contains approximately 500 K transistors in the memory and 60 K transistors in the CPU logic, for a total of 2.25 M transistors for a 4-node chip.

re oo

Instruction Memory (4 Kbytes)

rme Memory ( K)

mm ermute et

t Memory (4 Kbytes)

ritec oic

mm

Figure 7: VLSI floorplan of a PIM Lite node and 4-node chip.

5 Conclusions

The PIM Lite chip is a prototype PIM chip developed to support lightweight multi-threading in the memory of massively parallel computer systems. The key contributions of PIM Lite are:

1. it provides a demonstration of a multi-node, multi-threaded PIM chip that supports a broad variety of programming idioms, including vector/SIMD, parcel-based active messages, and thread migration; 2. it minimizes the overhead of starting and swapping threads through the combination of minimizing program state contained in the CPU and packing that state in wide-words that take advantage of maximum available on-chip memory bandwidth; 3. it introduces a unified scalar/SIMD/multithreaded instruction set that operates on frames of wide- words;

12 4. it demonstrates the use of CPU logic circuitry pitch-matched to the width of memory sense amplifier to minimize CPU area and to keep interconnect short;

5. it does all of the above at a very low design and fabrication cost.

First silicon of PIM Lite will be available in the first quarter of 2003. A printed circuit card supporting multiple chips has been fabricated and will be used for prototyping application and system software. Follow- on activities include migrating to a DRAM technology, growing the ISA to handle 64-bit, multiple domain address spaces, enhancing parcel and thread interpretation mechanisms, and expanding the set of wide-word operations.

References

[1] Arvind and Rishiyur S. Nikhil. Executing a program on the MIT tagged-token dataflow architec- ture. IEEE Transactions on Computers, 39(3):300–318, March 1990. Also appears in Proceedings of PARLE87. Parallel Architectures and Languages Europe.pp.1–29, vol.2.

[2] Arvind and R. S. Nikhil. Executing a program on the MIT tagged-token dataflow architecture. In J. W. de Bakker, A. J. Nijman, and P. C. Treleaven, editors, PARLE ’87, Parallel Architectures and Languages Europe, Volume 2: Parallel Languages. Springer-Verlag, Berlin, DE, 1987. Lecture Notes in Computer Science 259.

[3] J. M. Borkenhagen, R. J. Eickemeyer, R. N. Kalla, and S. R. Kunkel. A multithreaded PowerPC processor for commercial servers. IBM Journal of Research and Development, 44(6):885–898, November 2000.

[4] Jay B. Brockman, Peter M. Kogge, Vincent W. Freeh, Shannon K. Kuntz, and Thomas L. Sterling. Microservers: A new memory semantics for massively . In Conference Proceedings of the 1999 International Conference on Supercomputing, pages 454–463, Rhodes, Greece, June 20–25, 1999. ACM SIGARCH.

[5] David E. Culler, Anurag Sah, Klaus E. Schauser, Thorsten von Eicken, and John Wawrzynek. Fine- grain parallelism with minimal hardware support: A compiler-controlled threaded abstract machine. In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 164–175, April 1991. Published in Vol.26, No.4. Apr.1991. Also as Tech report UCB-CSD-90-594, University of California Berkeley, Department of Computer Science.

[6] W. J. Dally, A. Chien, J. A. S. Fiske, G. Fyler, W. Horwat, J. S. Keen, R. A. Lethin, M. Noakes, P. R. Nuth, and D. S. Wills. The message driven processor: An integrated multicomputer processing element. In International Conference on Computer Design, VLSI in Computers and Processors, pages 416–419, Los Alamitos, Ca., USA, October 1992. IEEE Computer Society Press.

[7] Monty Denneau. Blue gene. In SC2000: High Performance Networking and Computing, pages 35–35, Dallas, TX, November 2000. ACM.

[8] Mary Hall, Peter Kogge, Jeff Koller, Pedro Diniz, Jacqueline Chame, Jeff Draper, Jeff LaCoss, John Granacki, Apoorv Srivastava, William Athas, Jay Brockman, Vincent Freeh, Joonseok Park, and Jae- wook Shin. Mapping irregular applications to DIVA, A PIM-based data-intensive architecture. In Supercomputing (SC’99), pages ??–??, Portland, Oregon, November 1999. ACM Press and IEEE Com- puter Society Press.

[9] W. Daniel Hillis and Lewis W. Tucker. The CM-5 : A scalable . Communications of the ACM (CACM), 36(11):30–40, November 1993.

[10] Ken Iobst, Maya Gokhale, and Bill Holmes. Processing in memory: The Terasys massively parellel PIM array. IEEE Computer, 28(4):23–??, April 1995.

13 [11] A. Kay. SMALLTALK. In R. A. Guedj, P. J. W. ten Hagen, F. R. A. Hopgood, H. A. Tucker, and D. A. Duce, editors, Methodology of Interaction, Proc. of the IFIP Workshop on Methodology of Interaction; Seillac II, pages 7–11, 15–17, 1980.

[12] Graham Kirsch. Active memory device delivers massive parallelism. In Microprocessor Forum, San Jose, CA, October 2002.

[13] P. M. Kogge. EXECUBE - A new architecture for scalable MPPs. In Dharma P. Agrawal, editor, Proceedings of the 23rd International Conference on Parallel Processing. Volume 1: Architecture, pages 77–84, Boca Raton, FL, USA, August 1994. CRC Press. [14] Christoforos E. Kozyrakis, Stylianos Perissakis, David Patterson, Thomas Anderson, Krste Asanovi´c, Neal Cardwell, Richard Fromm, Jason Golbus, Benjamin Gribstad, Kimberly Keeton, Randi Thomas, Noah Treuhaft, and Katherine Yelick. Scalable processors in the billion-transistor era: IRAM. IEEE Computer, 30(9):75–78, September 1997. [15] G. Lipovski and C. Yu. The dynamic associative access memory chip and its application to SIMD processing and full-text database retrieval. In IEEE International Workship on Memory Technology, Design and Testing, pages 24–33, San Jose, CA, August 1999. IEEE, IEEE Computer Society.

[16] Mitsubishi Corporation. M32R/D Datasheet, 1997.

[17] Richard C. Murphy, Peter M. Kogge, and Arun Rodrigues. The characterization of data intensive memory workloads on distributed PIM systems. Lecture Notes in Computer Science, 2107:85–85, 2001.

[18] Rishiyur S. Nikhil and Arvind. Can dataflow subsume von neumann computing? In Proceedings of the 16th Annual International Symposium on Computer Architecture, pages 262–272, June 1989.

[19] M. D. Noakes, D. A. Wallach, and W. J. Dally. The J-machine multicomputer: An architectural eval- uation. In Lubomir Bic, editor, Proceedings of the 20th Annual International Symposium on Computer Architecture, pages 224–236, San Diego, CA, May 1993. IEEE Computer Society Press.

[20] Gregory M. Papadopoulos and David E. Culler. Monsoon: An Explicit Token-Store Architecture. In 17th International Symposium on Computer Architecture, number 18(2) in ACM SIGARCH Computer Architecture News, pages 82–91, Seattle, Washington, May 28–31, June 1990.

[21] Burton J. Smith. A pipelined, shared resource MIMD computer. In Proceedings of the 1978 International Conference on Parallel Processing, pages 6–8, 1978.

[22] Thomas Sterling and Larry Bergman. A design analysis of a hybrid technology multithreaded architec- ture for petaflops scale computation. In Conference Proceedings of the 1999 International Conference on Supercomputing, pages 286–293, Rhodes, Greece, June 20–25, 1999. ACM SIGARCH.

[23] J. M. Tendler, J. S. Dodson, Jr. J. S. Fields, H. Le, and B. Sinharoy. POWER4 system microarchitecture. IBM Journal of Research and Development, 46(1):5–26, January 2002.

[24] Thorsten von Eicken, David E. Culler, Seth Copen Goldstein, and Klaus Erik Schauser. Active mes- sages: a mechanism for integrated communication and computation. In Proceedings the 19th Annual International Symposium on Computer Architecture, pages 256–266, Gold Coast, Australia, May 1992.

14