The Microarchitecture of the Synergistic Processor for a Cell Processor Brian Flachs, Shigehiro Asano, Member, IEEE, Sang H
Total Page:16
File Type:pdf, Size:1020Kb
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006 63 The Microarchitecture of the Synergistic Processor for a Cell Processor Brian Flachs, Shigehiro Asano, Member, IEEE, Sang H. Dhong, Fellow, IEEE, H. Peter Hofstee, Member, IEEE, Gilles Gervais, Roy Kim, Tien Le, Peichun Liu, Jens Leenstra, John Liberty, Brad Michael, Hwa-Joon Oh, Silvia Melitta Mueller, Osamu Takahashi, Member, IEEE, A. Hatakeyama, Yukio Watanabe, Naoka Yano, Daniel A. Brokenshire, Mohammad Peyravian, Vandung To, and Eiji Iwata Abstract—This paper describes an 11 FO4 streaming data becomes an important performance issue. Since leakage is processor in the IBM 90-nm SOI-low-k process. The dual-issue, proportional to area, processor designs need to extract more four-way SIMD processor emphasizes achievable performance per performance per transistor. area and power. Software controls most aspects of data movement and instruction flow to improve memory system performance and core performance density. The design minimizes instruction II. ARCHITECTURE latency while providing for fine grain clock control to reduce power. The Cell processor is a heterogeneous shared memory mul- tiprocessor [2]. It features a multi-threaded 64 bit POWER Index Terms—Cell, DSP, RISC, SIMD, SPE, SPU. processing element (PPE) and eight synergistic processing elements (SPE). Performance per transistor is the motivation I. INTRODUCTION for heterogeneity. Software can be divided into general purpose computing threads, operating system tasks, and streaming NCREASING thread level parallelism, data bandwidth, media threads and targeted to a processing core customized memory latency, and leakage current are important drivers I for those tasks. For example, PPE is responsible for running for new processor designs, such as Cell. Today’s media-rich the operating system and coordinating the flow of the data application software is often characterized by multiple light processing threads through the SPEs. This differentiation weight threads and software pipelines. This trend in software allows the architectures and implementations of the PPE and design favors processors that utilize these threads to drive SPE to be optimized for their respective workloads and enables the improved data bandwidths over processors designed to significant improvements in performance per transistor. accelerate a single thread of execution by taking advantage of The synergistic processor element (SPE) is the first imple- instruction level parallelism. Memory latency is a key limiter to mentation of a new processor architecture designed to accel- processor performance. Modern processors can lose up to 4000 erate media and streaming workloads. The architecture aims instruction slots while they wait for data from main memory. to improve the effective memory bandwidth achievable by ap- Previous designs emphasize large caches and reorder buffers, plications by improving the degree to which software can tol- first to reduce the average latency and second to maintain in- erate memory latency. SPE provides processing power needed struction throughput while waiting for data from cache misses. by streaming and media workloads through four-way SIMD op- However, these hardware structures have difficulty scaling to erations, dual issue and high frequency. the sizes required by the large data structures utilized by media Area and power efficiency are important enablers for multi- rich software. Transistors oxides are now a few atomic levels core designs that take advantage of parallelism in applications, thick and the channels are extremely narrow. These features are where performance is power limited. Every design choice must very good for improving transistor performance and increasing trade off the performance a prospective feature would bring transistor density, but tend to increase leakage current. As versus the prospect of omitting the feature and devoting the area processor performance becomes power limited, leakage current and power toward higher clock frequency or more SPE cores per Cell processor chip. Power efficiency drives a desire to replace Manuscript received April 15, 2005; revised August 31, 2005. event and status polling performed by software during synchro- B. Flachs, S. H. Dhong, H. P. Hofstee, G. Gervais, R. Kim, T. Le, P. Liu, J. Liberty, B. Michael, H.-J. Oh, O. Takahashi, D. A. Brokenshire, and V. To are nization with synchronization mechanisms that allow for low with the IBM Systems and Technology Group, Austin, TX 78758 USA (e-mail: power waiting. fl[email protected]). Fig. 1 is a diagram of the SPE architecture’s major entities and S. Asano and Y. Watanabe are with Toshiba America Electronic Components, Austin, TX 78717 USA. their relationships. Local store is a private memory for SPE in- N. Yano is with the Broadband System LSI Development Center, Semicon- structions and data. The synergistic processing unit (SPU) core ductor Company, Toshiba Corporation, Kawasaki, Japan. is a processor than runs instructions from the local store and J. Leenstra and S. M. Mueller are with the IBM Entwicklung GmbH, Boeblingen 71032, Germany. can read or write the local store with its load and store instruc- A. Hatakeyama and E. Iwata are with Sony Computer Entertainment, Austin, tions. The direct memory access (DMA) unit transfers data be- TX 78717 USA. tween local store and system memory. The DMA unit is pro- M. Peyravian is with IBM Microelectronics, Research Triangle Park, NC 27709 USA. grammable by SPU software via the channel unit. The channel Digital Object Identifier 10.1109/JSSC.2005.859332 unit is a message passing interface between the SPU core and 0018-9200/$20.00 © 2006 IEEE 64 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 41, NO. 1, JANUARY 2006 the DMA unit and the rest of the Cell processing system. The channel unit is accessed by the SPE software through channel access instructions. The SPU core is a SIMD RISC-style processor. All instruc- tions are encoded in 32 bit fixed length instruction formats and there are no hard to pipeline instructions. SPU features 128 gen- eral purpose registers. These registers are used by both floating point and integer instructions. The shared register file allows the highest level of performance for various workloads with the smallest number of registers. 128 registers allow for loop un- rolling which is necessary to fill functional unit pipelines with independent instructions. Most instructions operate on 127 bit wide data. For example, the floating point multiply add instruc- tion operates on vectors of four 32 bit single precision floating point values. Some instructions, such as floating point multiply Fig. 1. SPE architecture diagram. add, consume three register operands and produce a register result. SPE includes instructions that perform single precision floating point, integer arithmetic, logicals, loads, stores, com- pares and branches in addition to some instructions intended to help media applications. The instruction set is designed to simplify compiler register allocation and code schedulers. Most SPE software is written in C or C++ with intrinsic functions. The SPE’s architecture reduces area and power while facili- tating improved performance by requiring software solve “hard” scheduling problems such as data fetch and branch prediction. Because SPE will not be running the operating system, SPE concentrates on user mode execution. SPE load and store in- structions are performed within a local address space, not in system address space. The local address space is untranslated, unguarded and noncoherent with respect to the system address space and is serviced by the local store (LS). LS is a private memory, not a cache, and does not require tag arrays or backing Fig. 2. Example time line of concurrent computation and memory access. store. Loads, stores and instruction fetch complete with fixed delay and without exception, greatly simplifying the core de- sign and providing predictable real-time behavior. This design instructions. These commands can perform scatter-gather op- reduces the area and power of the core while allowing for higher erations from system memory or setup a complex set of status frequency operation. reporting and notification mechanisms. Not only can software Data is transferred to and from the LS in 1024 bit lines by achieve much higher bandwidth through the DMA engine than the SPE DMA engine. The SPE DMA engine allows SPE soft- it could with a hardware prefetch engine, a much higher frac- ware to schedule data transfers in parallel with core execution. tion of the bandwidth is useful data than would occur with the Fig. 2 is a time line that illustrates how software can be di- prefetch engine design. vided into coarse grained threads to overlap data transfer and The channel unit is a message passing interface between the core computation. As thread 1 finishes its computation it initi- SPU core and the rest of the system. Each device is allocated ates DMA fetch of its next data set and branches to thread 2. one or more channels through which messages can be sent to or Thread 2 begins by waiting for its previously requested data from the SPU core. SPU software sends and receives messages transfers to finish and begins computation while the DMA en- with the write and read channel instructions. Channels have ca- gine gets the data needed by thread 1. When thread 2 completes pacity which allows for multiple messages to be queued. Ca- the computation, it programs the DMA engine to store the re- pacity allows the SPU to send multiple commands to a device sults to system memory and fetch from system memory the in pipelined fashion without incurring delay, until the channel next data set. Thread 2 then branches back to thread 1. Tech- capacity is exhausted. When a channel is exhausted the write niques like double buffering and course grained multithreading or read instruction will stall the SPU in a low power wait mode allow software to overcome memory latency to achieve high until the device becomes ready. Channel wait mode can often memory bandwidth and improve performance.