Cell Multiprocessor Communication Network: Built for Speed

CELL MULTIPROCESSOR COMMUNICATION NETWORK: BUILT FOR SPEED MULTICORE DESIGNS PROMISE VARIOUS POWER-PERFORMANCE AND AREA- PERFORMANCE BENEFITS. BUT INADEQUATE DESIGN OF THE ON-CHIP COMMUNICATION NETWORK CAN DEPRIVE APPLICATIONS OF THESE BENEFITS. TO ILLUMINATE THIS IMPORTANT POINT IN MULTICORE PROCESSOR DESIGN, THE AUTHORS ANALYZE THE CELL PROCESSOR’S COMMUNICATION NETWORK, USING A SERIES OF BENCHMARKS INVOLVING VARIOUS DMA TRAFFIC PATTERNS AND SYNCHRONIZATION PROTOCOLS. Over the past decade, high-perfor- monolithic processors, which are prohibitive- mance computing has ridden the wave of ly expensive to develop, have high power con- Michael Kistler commodity computing, building cluster- sumption, and give limited return on based parallel computers that leverage the investment. Multicore system-on-chip (SoC) IBM Austin tremendous growth in processor performance processors integrate several identical, inde- fueled by the commercial world. As this pace pendent processing units on the same die, Research Laboratory slows, processor designers face complex prob- together with network interfaces, acceleration lems in their efforts to increase gate density, units, and other specialized units. reduce power consumption, and design effi- Researchers have explored several design Michael Perrone cient memory hierarchies. Processor develop- avenues in both academia and industry. Exam- ers are looking for solutions that can keep up ples include MIT’s Raw multiprocessor, the IBM TJ Watson with the scientific and industrial communi- University of Texas’s Trips multiprocessor, ties’ insatiable demand for computing capa- AMD’s Opteron, IBM’s Power5, Sun’s Niagara, Research Center bility and that also have a sustainable market and Intel’s Montecito, among many others. outside science and industry. (For details on many of these processors, see A major trend in computer architecture is the March/April 2005 issue of IEEE Micro.) Fabrizio Petrini integrating system components onto the In all multicore processors, a major tech- processor chip. This trend is driving the devel- nological challenge is designing the internal, Pacific Northwest opment of processors that can perform func- on-chip communication network. To realize tions typically associated with entire systems. the unprecedented computational power of National Laboratory Building modular processors with multiple the many available processing units, the net- cores is far more cost-effective than building work must provide very high performance in 2 Published by the IEEE Computer Society 0272-1732/06/$20.00 © 2006 IEEE latency and in bandwidth. It must also resolve memory bandwidth, resulting in poor chip contention under heavy loads, provide fair- area use and increased power dissipation with- ness, and hide the processing units’ physical out commensurate performance gains.3 distribution as completely as possible. For example, larger memory latencies Another important dimension is the nature increase the amount of speculative execution and semantics of the communication primi- required to maintain high processor utiliza- tives available for interactions between the var- tion. Thus, they reduce the likelihood that ious processing units. Pinkston and Shin have useful work is being accomplished and recently compiled a comprehensive survey of increase administrative overhead and band- multicore processor design challenges, with width requirements. All of these problems particular emphasis on internal communica- lead to reduced power efficiency. tion mechanisms.1 Power use in CMOS processors is approach- The Cell Broadband Engine processor ing the limits of air cooling and might soon (known simply as the Cell processor), jointly begin to require sophisticated cooling tech- developed by IBM, Sony, and Toshiba, uses niques.4 These cooling requirements can sig- an elegant and natural approach to on-chip nificantly increase overall system cost and communication. Relying on four slotted rings complexity. Decreasing transistor size and cor- coordinated by a central arbiter, it borrows a respondingly increasing subthreshold leakage mainstream communication model from currents further increase power consumption.5 high-performance networks in which pro- Performance improvements from further cessing units cooperate through remote direct increasing processor frequencies and pipeline memory accesses (DMAs).2 From functional depths are also reaching their limits.6 Deeper and performance viewpoints, the on-chip net- pipelines increase the number of stalls from work is strikingly similar to high-performance data dependencies and increase branch mis- networks commonly used for remote com- prediction penalties. munication in commodity computing clus- The Cell processor addresses these issues by ters and custom supercomputers. attempting to minimize pipeline depth, In this article, we explore the design of the increase memory bandwidth, allow more Cell processor’s on-chip network and provide simultaneous, in-flight memory transactions, insight into its communication and synchro- and improve power efficiency and perfor- nization protocols. We describe the various mance.7 These design goals led to the use of steps of these protocols, the algorithms flexible yet simple cores that use area and involved, and their basic costs. Our perfor- power efficiently. mance evaluation uses a collection of benchmarks of increasing complexity, ranging from Processor overview basic communication patterns to more The Cell processor is the first implementa- demanding collective patterns that expose net- tion of the Cell Broadband Engine Architetc- work behavior under congestion. ture (CBEA), which is a fully compatible extension of the 64-bit PowerPC Architecture. Design rationale Its initial target is the PlayStation 3 game con- The Cell processor’s design addresses at least sole, but its capabilities also make it well suit- three issues that limit processor performance: ed for other applications such as visualization, memory latency, bandwidth, and power. image and signal processing, and various sci- Historically, processor performance entific and technical workloads. improvements came mainly from higher Figure 1 shows the Cell processor’s main processor clock frequencies, deeper pipelines, functional units. The processor is a heteroge- and wider issue designs. However, memory neous, multicore chip capable of massive access speed has not kept pace with these floating-point processing optimized for com- improvements, leading to increased effective putation-intensive workloads and rich broad- memory latencies and complex logic to hide band media applications. It consists of one them. Also, because complex cores don’t allow 64-bit power processor element (PPE), eight a large number of concurrent memory access- specialized coprocessors called synergistic es, they underutilize execution pipelines and processor elements (SPEs), a high-speed mem- MAY–JUNE 2006 3 HIGH-PERFORMANCE INTERCONNECTS an instruction set and a microarchitecture designed for high-performance data stream- ing and data-intensive computation. The SPU includes a 256-Kbyte local-store memory to hold an SPU program’s instructions and data. The SPU cannot access main memory direct- ly, but it can issue DMA commands to the MFC to bring data into local store or write computation results back to main memory. The SPU can continue program execution while the MFC independently performs these DMA transactions. No hardware data-load prediction structures exist for local store man- BEI Broadband engine interface MBL MIC bus logic agement, and each local store must be man- EIB Element interconnect bus PPE Power processor element FlexIO High-speed I/O interface SPE Synergistic processor element aged by software. L2 Level 2 cache XIO Extreme data rate I/O cell The MFC performs DMA operations to MIC Memory interface controller Test control unit/pervasive logic transfer data between local store and system memory. DMA operations specify system Figure 1. Main functional units of the Cell processor. memory locations using fully compliant Pow- erPC virtual addresses. DMA operations can transfer data between local store and any ory controller, and a high-bandwidth bus resources connected via the on-chip inter- interface, all integrated on-chip. The PPE and connect (main memory, another SPE’s local SPEs communicate through an internal high- store, or an I/O device). Parallel SPE-to-SPE speed element interconnect bus (EIB). transfers are sustainable at a rate of 16 bytes With a clock speed of 3.2 GHz, the Cell per SPE clock, whereas aggregate main-mem- processor has a theoretical peak performance ory bandwidth is 25.6 Gbytes/s for the entire of 204.8 Gflop/s (single precision) and 14.6 Cell processor. Gflop/s (double precision). The EIB supports Each SPU has 128 128-bit single-instruc- a peak bandwidth of 204.8 Gbytes/s for intra- tion, multiple-data (SIMD) registers. The chip data transfers among the PPE, the SPEs, large number of architected registers facilitates and the memory and I/O interface controllers. highly efficient instruction scheduling and The memory interface controller (MIC) pro- enables important optimization techniques vides a peak bandwidth of 25.6 Gbytes/s to such as loop unrolling. All SPU instructions main memory. The I/O controller provides are inherently SIMD operations that the peak bandwidths of 25 Gbytes/s inbound and pipeline can run at four granularities: 16-way 35 Gbytes/s outbound. 8-bit integers, eight-way 16-bit integers, four- The PPE, the Cell’s main processor,

Load more