MULTIPROCESSOR COMMUNICATION NETWORK: BUILT FOR SPEED

MULTICORE DESIGNS PROMISE VARIOUS POWER-PERFORMANCE AND AREA-

PERFORMANCE BENEFITS. BUT INADEQUATE DESIGN OF THE ON-CHIP

COMMUNICATION NETWORK CAN DEPRIVE APPLICATIONS OF THESE BENEFITS.

TO ILLUMINATE THIS IMPORTANT POINT IN MULTICORE PROCESSOR DESIGN,

THE AUTHORS ANALYZE THE CELL PROCESSOR’S COMMUNICATION NETWORK,

USING A SERIES OF BENCHMARKS INVOLVING VARIOUS DMA TRAFFIC

PATTERNS AND SYNCHRONIZATION PROTOCOLS.

Over the past decade, high-perfor- monolithic processors, which are prohibitive- mance computing has ridden the wave of ly expensive to develop, have high power con- Michael Kistler commodity computing, building cluster- sumption, and give limited return on based parallel computers that leverage the investment. Multicore system-on-chip (SoC) IBM Austin tremendous growth in processor performance processors integrate several identical, inde- fueled by the commercial world. As this pace pendent processing units on the same die, Research Laboratory slows, processor designers face complex prob- together with network interfaces, acceleration lems in their efforts to increase gate density, units, and other specialized units. reduce power consumption, and design effi- Researchers have explored several design Michael Perrone cient memory hierarchies. Processor develop- avenues in both academia and industry. Exam- ers are looking for solutions that can keep up ples include MIT’s Raw multiprocessor, the IBM TJ Watson with the scientific and industrial communi- University of Texas’s Trips multiprocessor, ties’ insatiable demand for computing capa- AMD’s Opteron, IBM’s Power5, Sun’s Niagara, Research Center bility and that also have a sustainable market and Intel’s Montecito, among many others. outside science and industry. (For details on many of these processors, see A major trend in computer architecture is the March/April 2005 issue of IEEE Micro.) Fabrizio Petrini integrating system components onto the In all multicore processors, a major tech- processor chip. This trend is driving the devel- nological challenge is designing the internal, Pacific Northwest opment of processors that can perform func- on-chip communication network. To realize tions typically associated with entire systems. the unprecedented computational power of National Laboratory Building modular processors with multiple the many available processing units, the net- cores is far more cost-effective than building work must provide very high performance in

2 Published by the IEEE Computer Society 0272-1732/06/$20.00 © 2006 IEEE latency and in bandwidth. It must also resolve memory bandwidth, resulting in poor chip contention under heavy loads, provide fair- area use and increased power dissipation with- ness, and hide the processing units’ physical out commensurate performance gains.3 distribution as completely as possible. For example, larger memory latencies Another important dimension is the nature increase the amount of speculative execution and semantics of the communication primi- required to maintain high processor utiliza- tives available for interactions between the var- tion. Thus, they reduce the likelihood that ious processing units. Pinkston and Shin have useful work is being accomplished and recently compiled a comprehensive survey of increase administrative overhead and band- multicore processor design challenges, with width requirements. All of these problems particular emphasis on internal communica- lead to reduced power efficiency. tion mechanisms.1 Power use in CMOS processors is approach- The Cell Broadband Engine processor ing the limits of air cooling and might soon (known simply as the Cell processor), jointly begin to require sophisticated cooling tech- developed by IBM, Sony, and Toshiba, uses niques.4 These cooling requirements can sig- an elegant and natural approach to on-chip nificantly increase overall system cost and communication. Relying on four slotted rings complexity. Decreasing transistor size and cor- coordinated by a central arbiter, it borrows a respondingly increasing subthreshold leakage mainstream communication model from currents further increase power consumption.5 high-performance networks in which pro- Performance improvements from further cessing units cooperate through remote direct increasing processor frequencies and pipeline memory accesses (DMAs).2 From functional depths are also reaching their limits.6 Deeper and performance viewpoints, the on-chip net- pipelines increase the number of stalls from work is strikingly similar to high-performance data dependencies and increase branch mis- networks commonly used for remote com- prediction penalties. munication in commodity computing clus- The Cell processor addresses these issues by ters and custom supercomputers. attempting to minimize pipeline depth, In this article, we explore the design of the increase memory bandwidth, allow more Cell processor’s on-chip network and provide simultaneous, in-flight memory transactions, insight into its communication and synchro- and improve power efficiency and perfor- nization protocols. We describe the various mance.7 These design goals led to the use of steps of these protocols, the flexible yet simple cores that use area and involved, and their basic costs. Our perfor- power efficiently. mance evaluation uses a collection of bench- marks of increasing complexity, ranging from Processor overview basic communication patterns to more The Cell processor is the first implementa- demanding collective patterns that expose net- tion of the Cell Broadband Engine Architetc- work behavior under congestion. ture (CBEA), which is a fully compatible extension of the 64-bit PowerPC Architecture. Design rationale Its initial target is the PlayStation 3 game con- The Cell processor’s design addresses at least sole, but its capabilities also make it well suit- three issues that limit processor performance: ed for other applications such as visualization, memory latency, bandwidth, and power. image and signal processing, and various sci- Historically, processor performance entific and technical workloads. improvements came mainly from higher Figure 1 shows the Cell processor’s main processor clock frequencies, deeper pipelines, functional units. The processor is a heteroge- and wider issue designs. However, memory neous, multicore chip capable of massive access speed has not kept pace with these floating-point processing optimized for com- improvements, leading to increased effective putation-intensive workloads and rich broad- memory latencies and complex logic to hide band media applications. It consists of one them. Also, because complex cores don’t allow 64-bit power processor element (PPE), eight a large number of concurrent memory access- specialized coprocessors called synergistic es, they underutilize execution pipelines and processor elements (SPEs), a high-speed mem-

MAY–JUNE 2006 3 HIGH-PERFORMANCE INTERCONNECTS

an instruction set and a microarchitecture designed for high-performance data stream- ing and data-intensive computation. The SPU includes a 256-Kbyte local-store memory to hold an SPU program’s instructions and data. The SPU cannot access main memory direct- ly, but it can issue DMA commands to the MFC to bring data into local store or write computation results back to main memory. The SPU can continue program execution while the MFC independently performs these DMA transactions. No hardware data-load prediction structures exist for local store man- BEI Broadband engine interface MBL MIC bus logic agement, and each local store must be man- EIB Element interconnect bus PPE Power processor element FlexIO High-speed I/O interface SPE Synergistic processor element aged by software. L2 Level 2 cache XIO Extreme data rate I/O cell The MFC performs DMA operations to MIC Memory interface controller Test control unit/pervasive logic transfer data between local store and system memory. DMA operations specify system Figure 1. Main functional units of the Cell processor. memory locations using fully compliant Pow- erPC virtual addresses. DMA operations can transfer data between local store and any ory controller, and a high-bandwidth bus resources connected via the on-chip inter- interface, all integrated on-chip. The PPE and connect (main memory, another SPE’s local SPEs communicate through an internal high- store, or an I/O device). Parallel SPE-to-SPE speed element interconnect bus (EIB). transfers are sustainable at a rate of 16 With a clock speed of 3.2 GHz, the Cell per SPE clock, whereas aggregate main-mem- processor has a theoretical peak performance ory bandwidth is 25.6 Gbytes/s for the entire of 204.8 Gflop/s (single precision) and 14.6 Cell processor. Gflop/s (double precision). The EIB supports Each SPU has 128 128-bit single-instruc- a peak bandwidth of 204.8 Gbytes/s for intra- tion, multiple-data (SIMD) registers. The chip data transfers among the PPE, the SPEs, large number of architected registers facilitates and the memory and I/O interface controllers. highly efficient instruction scheduling and The memory interface controller (MIC) pro- enables important optimization techniques vides a peak bandwidth of 25.6 Gbytes/s to such as loop unrolling. All SPU instructions main memory. The I/O controller provides are inherently SIMD operations that the peak bandwidths of 25 Gbytes/s inbound and pipeline can run at four granularities: 16-way 35 Gbytes/s outbound. 8-bit integers, eight-way 16-bit integers, four- The PPE, the Cell’s main processor, runs way 32-bit integers or single-precision the operating system and coordinates the floating-point numbers, or two 64-bit dou- SPEs. It is a traditional 64-bit PowerPC ble-precision floating-point numbers. processor core with a vector multimedia The SPU is an in-order processor with two extension (VMX) unit, 32-Kbyte level 1 instruction pipelines, referred to as the even instruction and data caches, and a 512-Kbyte and odd pipelines. The floating- and fixed- level 2 cache. The PPE is a dual-issue, in- point units are on the even pipeline, and the order-execution design, with two-way simul- rest of the functional units are on the odd taneous multithreading. pipeline. Each SPU can issue and complete Each SPE consists of a synergistic proces- up to two instructions per cycle—one per sor unit (SPU) and a memory flow controller pipeline. The SPU can approach this theo- (MFC). The MFC includes a DMA con- retical limit for a wide variety of applications. troller, a memory management unit (MMU), All single-precision operations (8-bit, 16-bit, a bus interface unit, and an atomic unit for or 32-bit integers or 32-bit floats) are fully synchronization with other SPUs and the pipelined and can be issued at the full SPU PPE. The SPU is a RISC-style processor with clock rate (for example, four 32-bit floating-

4 IEEE MICRO PPE SPE1 SPE3 SPE5 SPE7 IOIF1

Data network Data bus arbiter

BIF MIC SPE0 SPE2 SPE4 SPE6 IOIF0

BIF Broadband interface IOIF I/O interface

Figure 2. Element interconnect bus (EIB). point operations per SPU clock cycle). The other SPEs. Supporting more-complex syn- two-way double-precision floating-point chronization mechanisms is a set of atomic operation is partially pipelined, so its instruc- operations available to the SPU, which oper- tions issue at a lower rate (two double-preci- ate in a similar manner as the PowerPC archi- sion flops every seven SPU clock cycles). tecture’s lwarx/stwcx atomic instructions. In When using single-precision floating-point fact, the SPU’s atomic operations interoper- fused multiply-add instructions (which count ate with PPE atomic instructions to build as two operations), the eight SPUs can per- locks and other synchronization mechanisms form a total of 64 operations per cycle. that work across the SPEs and the PPE. Final- ly, the Cell allows memory-mapped access to Communication architecture nearly all SPE resources, including the entire To take advantage of all the computation local store. This provides a convenient and power available on the Cell processor, work consistent mechanism for special communi- must be distributed and coordinated across cations needs not met by the other techniques. the PPE and the SPEs. The processor’s spe- The rich set of communications mecha- cialized communication mechanisms allow nisms in the Cell architecture enables pro- efficient data collection and distribution as grammers to efficiently implement widely well as coordination of concurrent activities used programming models for parallel and across the computation elements. Because the distributed applications. These models SPU can act directly only on programs and include the function-offload, device-exten- data in its own local store, each SPE has a sion, computational-acceleration, streaming, DMA controller that performs high-band- shared-memory-multiprocessor, and asym- width data transfer between local store and metric-thread-runtime models.8 main memory. These DMA engines also allow direct transfers between the local stores of two Element interconnect bus SPUs for pipeline or producer-consumer-style Figure 2 shows the EIB, the heart of the Cell parallel applications. processor’s communication architecture, At the other end of the spectrum, the SPU which enables communication among the can use either signals or mailboxes to perform PPE, the SPEs, main system memory, and simple low-latency signaling to the PPE or external I/O. The EIB has separate commu-

MAY–JUNE 2006 5 HIGH-PERFORMANCE INTERCONNECTS

nication paths for commands (requests to urable, with a peak of 30 Gbytes/s outbound transfer data to or from another element on and 25 Gbytes/s inbound. the bus) and data. Each bus element is con- The actual data bandwidth achieved on the nected through a point-to-point link to the EIB depends on several factors: the destina- address concentrator, which receives and tion and source’s relative locations, the chance orders commands from bus elements, broad- of a new transfer’s interfering with transfers casts the commands in order to all bus ele- in progress, the number of Cell chips in the ments (for snooping), and then aggregates and system, whether data transfers are to/from broadcasts the command response. The com- memory or between local stores in the SPEs, mand response is the signal to the appropriate and the data arbiter’s efficiency. bus elements to start the data transfer. Reduced bus bandwidths can result in the The EIB data network consists of four 16- following cases: -wide data rings: two running clockwise, and the other two counterclockwise. Each ring • All requestors access the same destina- potentially allows up to three concurrent data tion, such as the same local store, at the transfers, as long as their paths don’t overlap. same time. To initiate a data transfer, bus elements must • All transfers are in the same direction and request data bus access. The EIB data bus cause idling on two of the four data rings. arbiter processes these requests and decides • A large number of partial cache line which ring should handle each request. The transfers lowers bus efficiency. arbiter always selects one of the two rings that • All transfers must travel halfway around travel in the direction of the shortest transfer, the ring to reach their destinations, thus ensuring that the data won’t need to trav- inhibiting units on the way from using el more than halfway around the ring to its the same ring. destination. The arbiter also schedules the transfer to ensure that it won’t interfere with Memory flow controller other in-flight transactions. To minimize Each SPE contains an MFC that connects stalling on reads, the arbiter gives priority to the SPE to the EIB and manages the various requests coming from the memory controller. communication paths between the SPE and the It treats all others equally in round-robin fash- other Cell elements. The MFC runs at the EIB’s ion. Thus, certain communication patterns frequency—that is, at half the processor’s speed. will be more efficient than others. The SPU interacts with the MFC through the The EIB operates at half the processor-clock SPU channel interface. Channels are unidirec- speed. Each EIB unit can simultaneously send tional communication paths that act much like and receive 16 bytes of data every bus cycle. first-in first-out fixed-capacity queues. This The EIB’s maximum data bandwidth is lim- means that each channel is defined as either ited by the rate at which addresses are snooped read-only or write-only from the SPU’s per- across all units in the system, which is one spective. In addition, some channels are defined address per bus cycle. Each snooped address with blocking semantics, meaning that a read request can potentially transfer up to 128 of an empty read-only channel or a write to a bytes, so in a 3.2GHz Cell processor, the the- full write-only channel causes the SPU to block oretical peak data bandwidth on the EIB is until the operation completes. Each channel has 128 bytes × 1.6 GHz = 204.8 Gbytes/s. an associated count that indicates the number The on-chip I/O interfaces allow two Cell of available elements in the channel. The SPU processors to be connected using a coherent uses the read channel (rdch), write channel protocol called the broadband interface (BIF), (wrch), and read channel count (rchcnt) assem- which effectively extends the multiprocessor bly instructions to access the SPU channels. network to connect both PPEs and all 16 SPEs in a single coherent network. The BIF DMA. The MFC accepts and processes DMA protocol operates over IOIF0, one of the two commands that the SPU or the PPE issued available on-chip I/O interfaces; the other using the SPU channel interface or memory- interface, IOIF1, operates only in noncoher- mapped I/O (MMIO) registers. DMA com- ent mode. The IOIF0 bandwidth is config- mands queue in the MFC, and the SPU or PPE

6 IEEE MICRO (whichever issued the command) can continue Atomic operations. To support more complex execution in parallel with the data transfer, synchronization mechanisms, the SPU can use using either polling or blocking interfaces to special DMA operations to atomically update determine when the transfer is complete. This a lock line in main memory. These operations, autonomous execution of MFC DMA com- called get-lock-line-and-reserve (getllar) and mands allows convenient scheduling of DMA put-lock-line-conditional (putllc), are con- transfers to hide memory latency. ceptually equivalent to the PowerPC load-and- The MFC supports naturally aligned trans- reserve (lwarx) and store-conditional (stcwx) fers of 1, 2, 4, or 8 bytes, or a multiple of 16 instructions. bytes to a maximum of 16 Kbytes. DMA list The getllar operation reads the value of a commands can request a list of up to 2,048 synchronization variable in main memory and DMA transfers using a single MFC DMA sets a reservation on this location. If the PPE command. However, only the MFC’s associ- or another SPE subsequently modifies the syn- ated SPU can issue DMA list commands. A chronization variable, the SPE loses its reser- DMA list is an array of DMA source/destina- vation. The putllc operation updates the tion addresses and lengths in the SPU’s local synchronization variable only if the SPE still storage. When an SPU issues a DMA list com- holds a reservation on its location. If putllc mand, the SPU specifies the address and fails, the SPE must reissue getllar to obtain the length of the DMA list in the local store.9 Peak synchronization variable’s new value and then performance is achievable for transfers when retry the attempt to update it with another both the effective address and the local storage putllc. The MFC’s atomic unit performs the address are 128-byte aligned and the transfer atomic DMA operations and manages reser- size is an even multiple of 128 bytes. vations held by the SPE. Using atomic updates, the SPU can partic- Signal notification and mailboxes. The signal ipate with the PPE and other SPUs in locking notification facility supports two signaling protocols, barriers, or other synchronization channels: Sig_Notify_1 and Sig_Notify_2. The mechanisms. The atomic operations available SPU can read its own signal channels using the to the SPU also have some special features, read-blocking SPU channels SPU_RdSigNo- such as notification through an interrupt tify1 and SPU_RdSigNotify2. The PPE or an when a reservation is lost, that enable more SPU can write to these channels using memo- efficient and powerful synchronization than ry-mapped addresses. A special feature of the traditional approaches. signaling channels is that they can be config- ured to treat writes as logical OR operations, Memory-mapped I/O (MMIO) resources. Mem- allowing simple but powerful collective com- ory-mapped resources play a role in many of munication across processors. the communication mechanisms already dis- Each SPU also has a set of mailboxes that can cussed, but these are really just special cases of function as a narrow (32-bit) communication the Cell architecture’s general practice of mak- channel to the PPE or another SPE. The SPU ing all SPE resources available through MMIO. has a four-entry, read-blocking inbound mail- These resources fall into four broad classes: box and two single-entry, write-blocking out- bound mailboxes, one of which will also • Local storage. All of an SPU’s local storage generate an interrupt to the PPE when the SPE can be mapped into the effective-address writes to it. The PPE uses memory-mapped space. This allows the PPE to access the addresses to write to the SPU’s inbound mail- SPU’s local storage with simple loads and box and read from either of the SPU’s outbound stores, though doing so is far less efficient mailboxes. In contrast to the signal notification than using DMA. MMIO access to local channels, mailboxes are much better suited for storage is not synchronized with SPU exe- one-to-one communication patterns such as cution, so programmers must ensure that master-slave or producer-consumer models. A the SPU program is designed to allow typical round-trip communication using mail- unsynchronized access to its data (for boxes between two SPUs takes approximately example, by using the “volatile” variables) 300 nanoseconds (ns). when exploiting this feature.

MAY–JUNE 2006 7 HIGH-PERFORMANCE INTERCONNECTS

• Problem state memory map. Resources in The Cell architecture doesn’t dictate the size this class, intended for use directly by of these queues and recommends that soft- application programs, include access to ware not assume a particular size. This is the SPE’s DMA engine, mailbox chan- important to ensure functional correctness nels, and signal notification channels. across Cell architecture implementations. • Privilege 1 memory map. These resources However, programs should use DMA queue are available to privileged programs such entries efficiently because attempts to issue as the operating system or authorized DMA commands when the queue is full will subsystems to monitor and control the lead to performance degradation. In the Cell execution of SPU applications. processor, the MFC SPU command queue • Privilege 2 memory map. The operating contains 16 entries, and the MFC proxy com- system uses these resources to control the mand queue contains eight entries. resources available to the SPE. Figure 3 illustrates the basic flow of a DMA transfer to main storage initiated by an SPU. DMA flow The process consists of the following steps: The SPE’s DMA engine handles most com- munications between the SPU and other Cell 1. The SPU uses the channel interface to elements and executes DMA commands place the DMA command in the MFC issued by either the SPU or the PPE. A DMA SPU command queue. command’s data transfer direction is always 2. The DMAC selects a command for pro- referenced from the SPE’s perspective. There- cessing. The set of rules for selecting the fore, commands that transfer data into an SPE command for processing is complex, but, (from main storage to local store) are consid- in general, a) commands in the SPU ered get commands (gets), and transfers of command queue take priority over com- data out of an SPE (from local store to main mands in the proxy command queue, b) storage) are considered put commands (puts). the DMAC alternates between get and DMA transfers are coherent with respect to put commands, and c) the command main storage. Programmers should be aware must be ready (not waiting for address that the MFC might process the commands in resolution or list element fetch or depen- the queue in a different order from that in dent on another command). which they entered the queue. When order is 3. If the command is a DMA list command important, programmers must use special and requires a list element fetch, the forms of the get and put commands to enforce DMAC queues a request for the list ele- either barrier or fence semantics against other ment to the local-store interface. When commands in the queue. the list element is returned, the DMAC The MFC’s MMU handles address transla- updates the DMA entry and must rese- tion and protection checking of DMA access- lect it to continue processing. es to main storage, using information from 4. If the command requires address trans- page and segment tables defined in the Pow- lation, the DMAC queues it to the erPC architecture. The MMU has a built-in MMU for processing. When the transla- translation look-aside buffer (TLB) for caching tion is available in the TLB, processing the results of recently performed translations. proceeds to the next step (unrolling). On The MFC’s DMA controller (DMAC) a TLB miss, the MMU performs the processes DMA commands queued in the translation, using the page tables stored MFC. The MFC contains two separate DMA in main memory, and updates the TLB. command queues: The DMA entry is updated and must be reselected for processing to continue. • MFC SPU command queue, for com- 5. Next, the DMAC unrolls the command— mands issued by the associated SPU that is, creates a bus request to transfer the using the channel interface; and next block of data for the command. This • MFC proxy command queue, for com- bus request can transfer up to 128 bytes of mands issued by the PPE or other devices data but can transfer less, depending on using MMIO registers. alignment issues or the amount of data the

8 IEEE MICRO (1) DMA Queues SPU proxy DMAC SPU 1.6 GHz MMIO Channel interface (2) (5) (6) (4)

LS (3) TLB MMU

MFC EIB BIU (7) (7)

51.2 Gbyte/s 25.6 Gbyte/s in/out 3.2 GHz 1.6 GHz

(6) 204.8 Gbyte/s (7) (7) Memory MIC (off chip) 25.6 Gbyte/s 25.6 Gbyte/s in/out

BIU Bus interface unit MIC Memory interface controller DMAC Direct memory access controller MMIO Memory-mapped I/O EIB Element interconnect bus MMU Memory management unit LS Local store SPU Synergistic processor unit MFC Memory flow controller TLB Translation lookaside buffer

Figure 3. Basic flow of a DMA transfer.

DMA command requests. The DMAC that pipeline through the communication then queues this bus request to the bus network. The DMA command remains interface unit (BIU). in the MFC SPU command queue until 6. The BIU selects the request from its all its bus requests have completed. How- queue and issues the command to the ever, the DMAC can continue to process EIB. The EIB orders the command with other DMA commands. When all bus other outstanding requests and then requests for a command have completed, broadcasts the command to all bus ele- the DMAC signals command completion ments. For transfers involving main to the SPU and removes the command memory, the MIC acknowledges the from the queue. command to the EIB which then informs the BIU that the command was accept- In the absence of congestion, a thread run- ed and data transfer can begin. ning on the SPU can issue a DMA request in 7. The BIU in the MFC performs the reads as little as 10 clock cycles—the time needed to to local store required for the data trans- write to the five SPU channels that describe fer. The EIB transfers the data for this the source and destination addresses, the request between the BIU and the MIC. DMA size, the DMA tag, and the DMA com- The MIC transfers the data to or from mand. At that point, the DMAC can process the off-chip memory. the DMA request without SPU intervention. 8. The unrolling process produces a sequence The overall latency of generating the DMA of bus requests for the DMA command, command, initially selecting the command,

MAY–JUNE 2006 9 HIGH-PERFORMANCE INTERCONNECTS

and unrolling the first bus request to the board at the IBM Research Center in Austin. BIU—or, in simpler terms, the flow-through The Cell processor on this board was running latency from SPU issue to injection of the bus at 3.2 GHz. We also obtained results from an request into the EIB—is roughly 30 SPU internal version of the simulator that includes cycles when all resources are available. If list performance models for the MFC, EIB, and element fetch is required, it can add roughly memory subsystems. Performance simulation 20 SPU cycles. MMU translation exceptions for the Cell processor is still under develop- by the SPE are very expensive and should be ment, but we found good correlation between avoided if possible. If the queue in the BIU the simulator and hardware in our experi- becomes full, the DMAC is blocked from ments. The simulator let us observe aspects of issuing further requests until resources system behavior that would be difficult or become available again. practically impossible to observe on actual A transfer’s command phase involves hardware. snooping operations for all bus elements to We developed the benchmarks in C, using ensure coherence and typically requires some several Cell-specific libraries to orchestrate 50 bus cycles (100 SPU cycles) to complete. activities between the PPE and various SPEs. For gets, the remaining latency is attributable In all tests, the DMA operations are issued by to the data transfer from off-chip memory to the SPUs. We wrote DMA operations in C the memory controller and then across the bus language intrinsics,12 which in most cases pro- to the SPE, which writes it to local store. For duce inline assembly instructions to specify puts, DMA latency doesn’t include transfer- commands to the MFC through the SPU ring data all the way to off-chip memory channel interface. We measured elapsed time because the SPE considers the put complete in the SPE using a special register, the SPU once all data have been transferred to the decrementer, which ticks every 40 ns (or 128 memory controller. processor clock cycles). PPE and SPE interactions were performed Experimental results through mailboxes, input and output SPE reg- We conducted a series of experiments to isters that can send and receive messages in as explore the major performance aspects of the little as 150 ns. We implemented a simple syn- Cell’s on-chip communication network, its chronization mechanism to start the SPUs in protocols, and the pipelined communication’s a coordinated way. SPUs notify the comple- impact. We developed a suite of microbench- tion of benchmark phases through an atomic marks to analyze the internal interconnection fetch-and-add operation on a main memory network’s architectural features. Following the location that is polled infrequently and unin- research path of previous work on traditional trusively by the PPE. high-performance networks,2 we adopted an incremental approach to gain insight into sev- Basic DMA performance eral aspects of the network. We started our The first step of our analysis measures the analysis with simple pairwise, congestion-free latency and bandwidth of simple blocking DMAs, and then we increased the bench- puts and gets, when the target is in main marks’ complexity by including several pat- memory or in another SPE’s local store. Table terns that expose the contention resolution 1 breaks down the latency of a DMA opera- properties of the network under heavy load. tion into its components. Figure 4a shows DMA operation latency Methodology for a range of sizes. In these results, transfers Because of the limited availability of Cell between two local stores were always per- boards at the time of our experiments, we per- formed between SPEs 0 and 1, but in all our formed most of the software development on experiments, we found no performance dif- IBM’s Full-System Simulator for the Cell ference attributable to SPE location. The Broadband Engine Processor (available at results show that puts to main memory and http://www-128.ibm.com/developerworks/ gets and puts to local store had a latency of power/cell/).10,11 We collected the results pre- only 91 ns for transfers of up to 512 bytes sented here using an experimental evaluation (four cache lines). There was little difference

10 IEEE MICRO between puts and gets to local store because Table 1. DMA latency components for a clock local-store access latency was remarkably frequency of 3.2 GHz. low—only 8 ns (about 24 processor clock cycles). Main memory gets required less than Latency component Cycles Nanoseconds 100 ns to fetch information, which is remark- DMA issue 10 3.125 ably fast. DMA to EIB 30 9.375 Figure 4b presents the same results in terms List element fetch 10 3.125 of bandwidth achieved by each DMA opera- Coherence protocol 100 31.25 tion. As we expected, the largest transfers Data transfer for inter-SPE put 140 43.75 achieved the highest bandwidth, which we Total 290 90.61 measured as 22.5 Gbytes/s for gets and puts to local store and puts to main memory and 15 Gbytes/s for gets from main memory. 1,200 Next, we considered the impact of non- Blocking get, memory Blocking get, SPE blocking DMA operations. In the Cell proces- 1,000 Blocking put, memory sor, each SPE can have up to 16 outstanding Blocking put, SPE DMAs, for a total of 128 across the chip, allow- 800 ing unprecedented levels of parallelism in on- chip communication. Applications that rely 600 heavily on random scatter or gather accesses to main memory can take advantage of these com- 400 munication features seamlessly. Our bench- marks use a batched communication model, in Latency (nanoseconds) 200 which the SPU issues a fixed number (the batch size) of DMAs before blocking for notification 0 of request completion. By using a very large 4 16 64 256 1,024 4,096 16,384 batch size (16,384 in our experiments), we (a) DMA message size (bytes) effectively converted the benchmark to use a nonblocking communication model. 25 Figure 5 shows the results of these experi- Blocking get, memory ments, including aggregate latency and band- Blocking get, SPE 20 Blocking put, memory width for the set of DMAs in a batch, by batch Blocking put, SPE size and data transfer size. The results show a form of performance continuity between 15 blocking—the most constrained case—and nonblocking operations, with different degrees of freedom expressed by the increasing 10 batch size. In accessing main memory and local storage, nonblocking puts achieved the Bandwidth (Gbytes/second) 5 asymptotic bandwidth of 25.6 Gbytes/s, determined by the EIB capacity at the end- 0 points, with 2-Kbyte DMAs (Figure 5b and 4 16 64 256 1,024 4,096 16,384 5f). Accessing local store, nonblocking puts (b) DMA message size (bytes) achieved the optimal value with even smaller packets (Figure 5f). Figure 4. Latency (a) and bandwidth (b) as a function of DMA message size Gets are also very efficient when accessing for blocking gets and puts in the absence of contention. local memories, and the main memory laten- cy penalty slightly affects them, as Figure 5c shows. Overall, even a limited amount of example, 256-byte DMAs in Figure 5h). batching is very effective for intermediate DMA sizes, between 256 bytes and 4 Kbytes, Collective DMA performance with a factor of two or even three of bandwidth Parallelization of scientific applications gen- increase compared with the blocking case (for erates far more sophisticated collective

MAY–JUNE 2006 11 HIGH-PERFORMANCE INTERCONNECTS

800 30 Blocking 700 Blocking Batch = 2 25 Batch = 2 600 Batch = 4 Batch = 4 Batch = 8 Batch = 8 20 500 Batch = 16 Batch = 16 Batch = 32 Batch = 32 400 Nonblocking 15 Nonblocking 300 10

Latency (nanoseconds) 200

Bandwidth (Gbytes/second) 5 100

0 0 4 16 64 256 1,024 4,096 16,384 4 16 64 256 1,024 4,096 16,384 DMA message size (bytes) DMA message size (bytes) (a) Put latency, main memory (b) Put bandwidth, main memory

1,200 20 Blocking Blocking 1,000 Batch = 2 18 Batch = 2 Batch = 4 16 Batch = 4 Batch = 8 800 14 Batch = 8 Batch = 16 Batch = 16 Batch = 32 12 Batch = 32 600 Nonblocking 10 Nonblocking 8 400 6 Latency (nanoseconds) 4 200 Bandwidth (Gbytes/second) 2 0 0 4 16 64 256 1,024 4,096 16,384 4 16 64 256 1,024 4,096 16,384 DMA message size (bytes) DMA message size (bytes) (c) Get latency, main memory (d) Get bandwidth, main memory

800 30 Blocking 700 Blocking Batch = 2 25 Batch = 2 600 Batch = 4 Batch = 4 Batch = 8 Batch = 8 20 500 Batch = 16 Batch = 16 Batch = 32 Batch = 32 400 Nonblocking 15 Nonblocking 300 10 200 Latency (nanoseconds)

Bandwidth (Gbytes/second) 5 100

0 0 4 16 64 256 1,024 4,096 16,384 4 16 64 256 1,024 4,096 16,384 DMA message size (bytes) DMA message size (bytes) (e) Put latency, local store (f) Put bandwidth, local store

800 30 Blocking 700 Blocking Batch = 2 25 Batch = 2 600 Batch = 4 Batch = 4 Batch = 8 Batch = 8 20 500 Batch = 16 Batch = 16 Batch = 32 Batch = 32 400 Nonblocking 15 Nonblocking

300 10 200 Latency (nanoseconds)

Bandwidth (Gbytes/second) 5 100

0 0 4 16 64 256 1,024 4,096 16,384 4 16 64 256 1,024 4,096 16,384 DMA message size (bytes) DMA message size (bytes) (g) Get latency, local store (h) Get bandwidth, local store

Figure 5. DMA performance dimensions: how latency and bandwidth are affected by the choice of DMA tar- get (main memory or local store), DMA direction (put or get), and DMA synchronization (blocking after each DMA, blocking after a constant number of DMAs, ranging from two to 32, and nonblocking with a final fence).

12 IEEE MICRO communication patterns than the single pair- 26 wise DMAs discussed so far. We also analyzed 25 how the system performs for several common patterns of collective communications under 24 heavy load, in terms of both local perfor- 23 mance—what each SPE achieves—and aggre- 22 gate behavior—the performance level of all SPEs involved in the communication. 21 All results reported in this section are for 20 nonblocking communication. Each SPE Put, memory 19 Get, memory issues a sequence of 16,384-byte DMA com- Put, SPE 18 mands. The aggregate bandwidth reported is Get, SPE Aggregate bandwidth (Gbytes/second) the sum of the communication bandwidths 17 reached by each SPE. 12345678 Synergistic processor element The first important case is the hot spot, in (a) which many or, in the worst case, all SPEs are accessing main memory or a specific local 200 13 Uniform traffic store. This is a very demanding pattern that 180 Complement traffic exposes how the on-chip network and the 160 Pairwise traffic, put communication protocols behave under stress. Pairwise traffic, get It is representative of the most straightforward 140 code parallelizations, which distribute com- 120 putation to a collection of threads that fetch 100 data from main memory, perform the desired computation, and store the results without 80 SPE interaction. Figure 6a shows that the Cell 60 processor resolves hot spots in accesses to local 40

storage optimally, reaching the asymptotic per- Aggregate bandwidth (Gbytes/second) 20 formance with two or more SPEs. (For the 12345678 SPE hot spot tests, the number of SPEs (b) Synergistic processor element includes the hot node; x SPEs include 1 hot SPE plus x − 1 communication partners.) Figure 6. Aggregate communication performance: hot spots (a) and collec- Counterintuitively, get commands outper- tive communication patterns (b). form puts under load. In fact, with two or more SPEs, two or more get sources saturate the bandwidth either in main or local store. The partner. Note that SPEs with numerically con- put protocol, on the other hand, suffers from secutive numbers might not be physically a minor performance degradation, approxi- adjacent on the Cell hardware layout. mately 1.5 Gbytes/s less than the optimal value. The first static pattern, complement, is The second case is collective communica- resolved optimally by the network, and can tion patterns, in which all the SPES are both be mapped to the four rings with an aggregate source and target of the communication. Fig- performance slightly below 200 Gbytes/s (98 ure 6b summarizes the performance aspects percent of aggregate peak bandwidth). of the most common patterns that arise from The direction of data transfer affects the typical parallel applications. In the two static pairwise pattern’s performance. As the hot spot patterns, complement and pairwise, each SPE experiments show, gets have better contention executes a sequence of DMAs to a fixed tar- resolution properties under heavy load, and get SPE. In the complement pattern, each Figure 6b further confirms this, showing a gap SPE selects the target SPE by complementing of 40 Gbytes/s in aggregate bandwidth the bit string that identifies the source. In the between put- and get-based patterns. pairwise pattern, the SPEs are logically orga- The most difficult communication pattern, nized in pairs < i, i + 1 >, where i is an even arguably the worst case for this type of on-chip number, and each SPE communicates with its network, is uniform traffic, in which each SPE

MAY–JUNE 2006 13 HIGH-PERFORMANCE INTERCONNECTS

100 responsive and fair network even with the 90 Blocking put most demanding traffic patterns. Nonblocking put 80 ajor obstacles in the traditional path to 70 Mprocessor performance improvement 60 have led chip manufacturers to consider mul- 50 ticore designs. These architectural solutions

DMAs 40 promise various power-performance and area- performance benefits. But designers must take 30 care to ensure that these benefits are not lost 20 because of the on-chip communication net- 10 work’s inadequate design. Overall, our exper- 0 imental results demonstrate that the Cell 0 2 4 6 8 10 12 14 processor’s communications subsystem is well µ (a) Latency s matched to the processor’s computational capacity. The communications network pro- 120 vides the speed and bandwidth that applica- Blocking get tions need to exploit the processor’s Nonblocking get 100 computational power. MICRO

80 References 1. T.. Pinkston and J. Shin, “Trends toward 60 On-Chip Networked Microsystems,” Int’l J.

DMAs High Performance Computing and Net- 40 working, vol. 3, no. 1, 2005, pp. 3-18. 2. J. Beecroft et al., “QsNetII: Defining High- 20 Performance Network Design,” IEEE Micro, vol. 25, no. 4, July/Aug. 2005, pp. 34-47. 0 3. W.A. Wuld and S.A. McKee, “Hitting the 0 2 4 6 8 10 12 14 Memory Wall: Implications of the Obvious,” (b) Latency µs ACM Sigarch Computer Architecture News, Figure 7. Latency distribution with a main-memory hot spot: puts (a) and vol. 23, no. 1, Mar. 1995, pp. 20-24. gets (b). 4. U. Ghoshal and R. Schmidt, “Refrigeration Technologies for Sub-Ambient Temperature Operation of Computing Systems,” Proc. randomly chooses DMA a target across SPEs’ Int’l Solid-State Circuits Conf. (ISSCC 2000), local memories. In this case, aggregate band- IEEE Press, 2000, pp. 216-217, 458. width is only 80 Gbytes/s. We also explored 5. R.D. Isaac, “The Future of CMOS Technol- other static communication patterns and ogy,” IBM J. Research and Development, found results in the same aggregate-perfor- vol. 44, no. 3, May 2000, pp. 369-378. mance range. 6. V. Srinivasan et al., “Optimizing Pipelines for Finally, Figure 7 shows the distribution of Power and Performance,” Proc. 35th Ann. DMA latencies for all SPEs during the main- Int’l Symp. Microarchitecture (Micro 35), memory hot-spot pattern execution. One IEEE Press, 2002, pp. 333-344. important result is that the distributions show 7. H.P. Hofstee, “Power Efficient Processor no measurable differences across the SPEs, - Architecture and the Cell Processor,” Proc. dence of a fair and efficient for net- 11th Int’l Symp. High-Performance Com- work resource allocation. The peak of both puter Architecture (HPCA-11), IEEE Press, distributions shows a sevenfold latency 2005, pp. 258-262. increase, at about 5.6 µs. In both cases, the 8. J.A. Kahle et al., “Introduction to the Cell Mul- worst-case latency is only 13 µs, twice the aver- tiprocessor,” IBM J. Research and Develop- age. This is another remarkable result, demon- ment, vol. 49, no. 4/5, 2005, pp. 589-604. strating that applications can rely on a 9. IBM Cell Broadband Engine Architecture 1.0,

14 IEEE MICRO http://www-128.ibm.com/developerworks/ [email protected]. power/cell/downloads_doc.html. 10. IBM Full-System Simulator for the Cell For further information on this or any other Broadband Engine Processor, IBM Alpha- computing topic, visit our Digital Library at Works Project, http://www.alphaworks.ibm. http://www.computer.org/publications/dlib. com/tech/cellsystemsim. 11. J.L. Peterson et al., “Application of Full-Sys- tem Simulation in Exploratory System Design and Development,” IBM J. Research and Development, vol. 50, no. 2/3, 2006, pp. 321-332. 12. IBM SPU C/C++ Language Extensions 2.1, http://www-128.ibm.com/developerworks/ power/cell/downloads_doc.html. 13. G.F. Pfister and V.A. Norton, “Hot Spot Con- tention and Combining in Multistage Inter- connection Networks,” IEEE Trans. Computers, vol. 34, no. 10, Oct. 1985, pp. 943-948.

Michael Kistler is a senior software engineer in the IBM Austin Research Laboratory. His research interests include parallel and cluster computing, fault tolerance, and full-system simulation of high-performance computing systems. Kistler has an MS in computer sci- ence from Syracuse University.

Michael Perrone is the manager of the IBM TJ Watson Research Center’s Cell Solutions Department. His research interests include algorithmic optimization for the Cell proces- sor, parallel computing, and statistical . Perrone has a PhD in physics from Brown University.

Fabrizio Petrini is a laboratory fellow in the Applied Computer Science Group of the Com- putational Sciences and Mathematics Division at Pacific Northwest National Laboratory. His research interests include various aspects of supercomputers, such as high-performance interconnection networks and network inter- faces, multicore processors, job-scheduling algorithms, parallel architectures, operating sys- tems, and parallel-programming languages. Petrini has a Laurea and a PhD in computer science from the University of Pisa, Italy.

Direct questions and comments about this article to Fabrizio Petrini, Applied Computer Science Group, MS K7-90, Pacific Northwest National Laboratory, Richland, WA 99352;

MAY–JUNE 2006 15