T ec h n o l o g ies

Editor: Michael A. Gray, [email protected]

The PlayStation 3 for High- Performance Scientific Computing

By Jakub Kurzak, Alfredo Buttari, Piotr Luszczek, and Jack Dongarra

Is real-world gaming technology the next big thing in the more academically based high-performance computing arena? The authors put PlayStation 3 to the test.

he heart of the Play­ SMT feature, which comes at a small chip. Figure 1 shows a schematic of Station 3—the proces­ 5 percent increase in the hardware’s the Cell processor’s design. T sor—wasn’t originally intended cost, can deliver up to a 30 percent in­ All the Cell processor’s compo­ for scientific number crunching, just crease in performance. The PPU also nents, including the PPE, the SPEs, as the PlayStation 3 itself wasn’t meant includes a short-vector single instruc­ the main memory, and the I/O sys­ primarily to serve such purposes. Yet, tion, multiple data (SIMD) engine tem, are connected via the element both these items could impact the called VMX, which is an incarnation interconnection bus, which has four high-performance computing world of the PowerPC’s AltiVec. unidirectional rings (two in each in significant ways. This introductory However, the Cell processor’s real direction) and a token-based arbi­ article takes a closer look at their po­ power lies in the eight Synergistic tration mechanism that plays the tential to do so; an extended version of Processing Elements (SPEs) that ac­ role of traffic light. Each partici­ it is published as a University of Ten­ company the PPE. Each SPE con­ pant is hooked up to the bus with a nessee technical report (www.cs.utk. sists of a Synergistic Processing Unit bandwidth of 25.6 Gbytes/s; the bus edu/~library/2008). (SPU), 256 Kbytes of private memory has an internal bandwidth of 204.8 (referred to as the local store), and a Gbytes/s, which means that for all The Cell in a Nutshell memory-flow controller that delivers practical purposes, you shouldn’t be The Cell processor’s main control powerful direct memory-access ca­ able to saturate it. unit is the Power Processing Element pabilities to the SPU. The SPEs are The Cell chip draws its power from (PPE), which is a 64-bit, two-way si­ the Cell’s short-vector SIMD work­ the fact that it’s a parallel machine multaneous multithreading (SMT) horses and possess a large 128-entry, with eight small, fast, specialized processor that’s binary-compliant 128-bit vector register file as well as a number-crunching and processing with the PowerPC 970 architecture. range of SIMD instructions that can elements. The SPEs, in turn, rely on The PPE consists of the Power Pro­ operate simultaneously on two dou­ a simple design with short pipelines, cessing Unit (PPU), 32 Kbytes of L1 ble-precision values, four single-pre­ a huge register file, and a powerful cache, and 512 Kbytes of L2 cache. cision values, eight 16-bit integers, or SIMD instruction set. Although the PPU uses the Power­ 16 8-bit characters. Most instructions The Cell is essentially a distrib­ PC 970 instruction set, it has a rela­ are pipelined and can complete one uted-memory system on a chip, on tively simple architecture with in-order vector operation in each clock cycle, which each SPE possesses its private execution, which results in consider­ including fused multiplication–addi­ memory stripped of any indirection ably less circuitry than its out-of-or­ tion in single precision, which means mechanisms to make it faster. This der execution counterparts as well as that the SPU can accomplish two puts explicit control over data mo­ lower energy consumption. The high floating-point operations on four val­ tion in the hands of the programmer, , high memory bandwidth, ues in each clock cycle. This trans­ who must use techniques closely re­ and dual threading capabilities make lates to a peak of 2 × 4 × 3.2 GHz = sembling message passing, a model up for the potential performance de­ 25.6 Gflop/s for each SPE and adds that some might think is challenging ficiencies stemming from the PPU’s up to a staggering peak of 8 × 25.6 but is the only one known to be scal­ in-order execution architecture. The Gflop/s = 204.8 Gflop/s for the entire able today.

84 Copublished by the IEEE CS and the AIP 1521-9615/08/$25.00 ©2008 IEEE Computing in Science & Engineering SPE SPE SPE SPE SPE SPE SPE SPE

SPU SPU SPU SPU SPU SPU SPU SPU

LS LS LS LS LS LS LS LS

MFC MFC MFC MFC MFC MFC MFC MFC

Element interconnection bus

L2 MIC BIC

SPE: Synergistic Processing Element SPU: Synergistic Processing Unit L1 Power Processing Unit MFC: Memory ow controller LS: Local store Dual XDR RRAC I/O

Figure 1. Schematic of the Cell processor’s design. The main components are the Power Processing Unit, eight Synergistic Processing Elements (SPEs), and the element interconnection bus.

The PlayStation 3 application by the operating system’s Fortran (with support for Fortran 95 The PlayStation 3 is probably the hypervisor. and partial support for Fortran 2003). cheapest Cell-based system on the The GigE card is accessible to the The kit is available for installation on market: it contains a Cell processor Linux kernel through the hypervisor, Cell- or x86-based systems, with code (with the number of SPEs reduced which both makes it possible to turn compiled and built in cross-compila­ to six), 256 Mbytes of main memory, the PlayStation 3 into a networked tion mode, a method often preferred an NVIDIA graphics card with 256 workstation and facilitates building by experts. These tools practically Mbytes of its own memory, and a giga­ PlayStation 3 clusters via network guarantee compilation of any exist­ bit Ethernet (GigE) network card. switches. You can program such in­ ing C, C++, or Fortran code on the Sony made several convenient pro­ stallations by using the message-pass­ Cell processor, which makes the ini­ visions for installing Linux on the ing interface (MPI). The network tial port of any existing software ba­ PlayStation 3 in a dual-boot setup. card has a direct memory-access sically effortless. Installation instructions are plentiful unit, which you can set up via dedi­ As Table 1 shows, several program­ on the Web, but the basic gist is that cated hypervisor calls that enable data ming models and environments have a virtualization layer—also called transfers without requiring the main emerged for the Cell processor; it the hypervisor—separates the Linux processor’s intervention. seems to have ignited similar enthu­ kernel from the hardware. Devices siasm in the scientific high-perfor­ and other system resources are virtu­ Programming mance computing, embedded systems, alized, but Linux device drivers can All Linux distributions for the Play­ and graphics communities as well. work with them. The Cell processor Station 3 come with the standard Naturally, the programming tech­ in the PlayStation 3 is identical to GNU compiler suite, including C niques proposed for the Cell are as the one you would find in high-end (GCC), C++ (G++), and Fortran 95 diverse as the communities involved: IBM or Mercury blade severs, with (GFORTRAN), which now also pro­ they include shared-memory, distrib­ the exception that two SPEs aren’t vides support for OpenMP through uted-memory, and stream-processing available (one is disabled for chip the GNU GOMP library. The pro­ models and represent both data- and yield reasons). Nevertheless, a Cell grammer can use OpenMP to exploit task-parallel approaches. with one defective SPE still passes as the PPE’s SMT capabilities. IBM’s A separate problem is related to a good chip in the PlayStation 3. If software development kit for Cell programming for a cluster of Play­ all the SPEs are nondefective, a good delivers a similar set of GNU tools, Station 3s—such a cluster is essen­ one is disabled during manufactur­ along with an IBM compiler suite that tially a distributed-memory machine, ing. Another SPE is hidden from the includes C/C++ and, more recently, and there’s almost no programming

May/June 2008 85 T ec h n o l o g ies

Table 1. Programming environments for the Cell processor.* Origin Available Free Cell SuperScalar Barcelona Center X X Sequoia Stanford University X X Accelerated Library Framework IBM X X CorePy Indiana University X X Multicore Framework Mercury Computer Systems X Gedae Gedae X RapidMind RapidMind X Octopiler† IBM X X MPI Microtask‡ IBM * Available means you can get if for free or buy it as a product † Official name is the Single Source Compiler ‡ MPI Microtask is a research project inside IBM; there’s no outside access to this software

alternative to using MPI. Several heavy-iron processors, but it sets Rather, only the extremely compute- freely available implementations ex­ the upper limit of memory-intensive intensive, embarrassingly parallel ist, with the most popular being single-precision calculations to 12.8 problems have a fair chance of success MPICH2 from Argonne National Gflop/s and double-precision calcu­ in scaling to PlayStation 3 clusters. Laboratory and OpenMPI, an open lations to 6.4 Gflop/s, assuming two Such distributed computing prob­ source project in active develop­ operations are performed on one lems, often referred to as screen-saver ment by a team of 19 organizations, data element. computing, have gained popularity in including universities, national labo­ However, the largest disproportion recent years: the trend initiated by the ratories, companies, and private indi­ in the PlayStation 3’s performance is SETI@Home project had many fol­ viduals. Due to the PPE’s compliance between the Cell processor’s speed lowers, including the very successful with the PowerPC architecture, you and that of the GigE interconnection. Folding@Home project. could compile any of these libraries GigE isn’t crafted for performance for execution on the Cell almost out and, in practice, only about 65 percent of the box. The good news is that us­ of its peak bandwidth can be achieved he idea of many-core processors ing MPI on a PlayStation 3 cluster in MPI communication. Also, because T reaching hundreds, if not thou­ isn’t any different than on any other of the extra layer of indirection be­ sands, of processing elements per chip distributed-memory system. tween the operating system and the is emerging, with some researchers hardware (specifically, the hypervisor), aiming for distributed-memory sys­ Scientific Computing the incurred latency can be as big as tems on a chip, an inherently more In spite of its power, the PlayStation 200 μs, which is at least an order of scalable solution than shared-memory 3 has severe limitations for scientific magnitude below today’s standards for setups. Owing to this, the technol­ computing. First, it can only achieve high-performance interconnections. ogy delivered by the PlayStation 3 its astounding peak of 153.6 Gflop/s Even if you could lower the latency through its Cell processor provides for compute-intensive tasks in single- and gain a larger fraction of the peak a unique opportunity to gain experi­ precision arithmetic, which, besides bandwidth, the communication capac­ ence, which is likely to be priceless in delivering less precision, isn’t com­ ity of 1 Gbyte/s is way too small to the near future. pliant with the IEEE floating-point keep the Cell processor busy. A com­ But a major shortcoming of the standard (its double-precision peak is mon remedy for slow interconnections current Cell processor for numeri­ less than 11 Gflop/s). Second, it only is to run larger problems, but in this cal applications is the relatively slow implements truncation rounding, so case, the main memory’s small size speed of double-precision arithme­ denormalized numbers are flushed to (256 Mbytes) turns out to be the limit­ tic. The next incarnation promises zero, and NaNs (“not a number”) are ing factor. to include a fully pipelined double- treated as normal numbers. Finally, Taken all together, even simple precision unit that will deliver the memory-bound problems are limited examples of compute-intensive work­ speed of 12.8 Gflop/s from a single by the main memory’s bandwidth of loads such as matrix multiplication SPE clocked at 3.2 GHz and 102.4 25.6 Gbytes/s. This is a very respect­ can’t benefit from running in paral­ Gflop/s from an eight-SPE system, able value compared to cutting-edge lel on more than two PlayStation 3s. which is going to make the chip a

86 Computing in Science & Engineering Department Editors Books: Mario Belloni, Davidson College, [email protected] Computing Prescriptions: Isabel Beichl, Nat’l Inst. of Standards and Tech., [email protected] Computer Simulations: Muhammad Sahimi, University of Southern California, [email protected], and Dietrich Stauffer, Univ. of Köhn, [email protected] Education: Michael Dennin, Univ. of Calif., Irvine, [email protected], and Steven F. Barrett, Univ. of Wyoming, very serious competitor in the world of scientific and en­ [email protected] gineering computing. News: Rubin Landau, Oregon State Univ., [email protected] Although in agony, Moore’s law is still alive, and we’re Scientific Programming: Konstantin Läufer, Loyola Univ., entering the era of billion-transistor processors. Given ­Chicago, [email protected], and George K. Thiruvathukal, this, the current Cell processor uses a rather modest num­ Loyola Univ., Chicago, [email protected] Technologies: Michael A. Gray, American Univ., gray@american. ber of transistors (234 million). It isn’t hard to envision a edu, and James D. Myers, Collaborative Systems, NCSA, Cell processor with more than one PPE and many more [email protected] SPEs, perhaps exceeding the performance of one teraflop/s Visualization Corner: Cláudio T. Silva, Univ. of Utah, [email protected], and Joel E. Tohline, Louisiana State Univ., for a single chip. [email protected]

Jakub Kurzak is a senior research associate in the Department of Staff Computer Science at the University of Tennessee. His research in- Senior Editor: Jenny Stout, [email protected] Senior Editorial Services Manager: Crystal R. Shif terests include high-performance computing, parallel program- Magazine Editorial Manager: Steve Woods ming, and numerical algorithms. Kurzak has a PhD in computer Staff Editors: Kathy Clark-Fisher, Rebecca L. Deuel, science from the University of Houston. Contact him at kurzak@ Shani Murray, and Brandi Ortega Production Editor: Monette Velasco eecs.utk.edu. Publications Coordinator: Hazel Kosky, [email protected] Technical Illustrator: Alex Torres Alfredo Buttari is a scientist at the French National Institute for Re- Senior Advertising Coordinator: Marian Anderson search in Computer Science and Control (INRIA). His research inter- Marketing Manager: Georgann Carter ests include high-performance computing, parallel programming, Senior Business Development Manager: Sandra Brown and numerical algorithms. Buttari has a PhD in computer science AIP Staff: from the University of Rome. Contact him at [email protected] Circulation Director: Jeff Bebee, [email protected] Editorial Liaison: Charles Day, [email protected]

Piotr Luszczek is a scientist at MathWorks. His research interests IEEE Antennas and Propagation Society liaison: include high-performance computing, parallel programming, and Don Wilton, Univ. of Houston, [email protected] IEEE Signal processing Society liaison: numerical algorithms. Luszczek has a PhD in computer science from Elias S. Manolakos, Northeastevrn Univ., [email protected] the University of Tennessee. Contact him at [email protected] CS Publications Board Jack Dongarra is University Distinguished Professor in the Depart- Sorel Reisman (chair), Angela Burgess, Chita R. Das, Richard H. ment of Electrical Engineering and Computer Science at the Uni- ­Eckhouse, Van Eden, Frank E. Ferrante, David A. Grier, Pamela Jones, Phillip A. Laplante, Simon Liu, Paolo Montuschi, Jon versity of Tennessee, Distinguished Research Staff in the Computer Rokne, Linda I. Shafer, Steven L. Tanimoto Science and Mathematics Division at the Oak Ridge National Labo- ratory, and a professor in the School of Mathematics and School of CS Magazine Operations Committee Computer Science at the University of Manchester, UK. His research Robert E. Filman (chair), David Albonesi, Arnold (Jay) Bragg, Carl Chang, Kwang-Ting (Tim) Cheng, Norman Chonacky, Fred interests include high-performance computing, parallel program- ­Douglis, Hakan Erdogmus, James Hendler, Carl E. Landwehr, ming, and numerical algorithms. Dongarra has a PhD in applied Dejan Milojicic, Sethuraman (Panch) Panchanathan, Maureen mathematics from the University of New Mexico. Contact him don- Stone, Roy Want, Jeff Yost [email protected]. EDITORIAL OFFICE COMPUTING in SCIENCE & ENGINEERING FREE Visionary Web Videos 10662 Los Vaqueros Circle, Los Alamitos, CA 90720 USA about the Future of Multimedia. Phone +1 714 821 8380; www.computer.org/cise/

Listen to premiere multimedia experts! IEEE Antennas & Post your own views and demos! Propagation Society

Visit www.computer.org/multimedia

May/June 2008 87