T ec h n o l o g ies

Editor: Michael A. Gray, [email protected]

The PlayStation 3 for High- Performance Scientific Computing

By Jakub Kurzak, Alfredo Buttari, Piotr Luszczek, and Jack Dongarra

Is real-world gaming technology the next big thing in the more academically based high-performance computing arena? The authors put PlayStation 3 to the test.

he heart of the PlaySta- SMT feature, which comes at a small chip. Figure 1 shows a schematic of tion 3—the — 5 percent increase in the hardware’s the Cell processor’s design. T wasn’t originally intended for cost, can deliver up to 30 percent in- All the Cell processor’s compo- scientific number crunching, just as crease in performance. The PPU also nents, including the PPE, the SPEs, the PlayStation 3 itself wasn’t meant includes a short-vector single instruc- the main memory, and the I/O sys- primarily to serve such purposes. Yet, tion, multiple data (SIMD) engine tem, are connected via the element both these items could impact the called VMX, which is an incarnation interconnection bus, which has four high-performance computing world of the PowerPC’s AltiVec. unidirectional rings (two in each in significant ways. This introductory However, the Cell processor’s real direction) and a token-based arbi- article takes a closer look at their po- power lies in the eight Synergistic tration mechanism that plays the tential to do so; an extended version of Processing Elements (SPEs) that ac- role of traffic light. Each partici- it is published as a University of Ten- company the PPE. Each SPE con- pant is hooked up to the bus with a nessee technical report (www.cs.utk. sists of a Synergistic Processing Unit bandwidth of 25.6 Gbytes/s; the bus edu/~library/2008). (SPU), 256 Kbytes of private memory has an internal bandwidth of 204.8 (referred to as the local store), and a Gbytes/s, which means that for all The Cell in a Nutshell memory-flow controller, which deliv- practical purposes, you shouldn’t be The Cell processor’s main control ers powerful direct memory-access able to saturate it. unit is the capabilities to the SPU. The SPEs are The Cell chip draws its power from (PPE), which is a 64-bit, two-way si- the Cell’s short-vector SIMD work- the fact that it’s a parallel machine multaneous multithreading (SMT) horses and possess a large 128-entry, with eight small, fast, specialized processor that’s binary-compliant 128-bit vector as well as a number-crunching and processing with the PowerPC 970 architecture. range of SIMD instructions that can elements. The SPEs, in turn, rely on The PPE consists of the Power Pro- operate simultaneously on two dou- a simple design with short pipelines, cessing Unit (PPU), 32 Kbytes of L1 ble-precision values, four single-pre- a huge register file, and a powerful cache, and 512 Kbytes of L2 cache. cision values, eight 16-bit integers, or SIMD instruction set. Although the PPU uses the Pow- 16 8-bit characters. Most instructions The Cell is essentially a distrib- erPC 970 instruction set, it has a rela- are pipelined and can complete one uted-memory , on tively simple architecture with in-order vector operation in each clock cycle, which each SPE possesses its private execution, which results in consider- including fused multiplication–addi- memory stripped of any indirection ably less circuitry than its out-of-or- tion in single precision, which means mechanisms to make it faster. This der execution counterparts as well as that the SPU can accomplish two puts explicit control over data mo- lower energy consumption. The high floating-point operations on four val- tion in the hands of the programmer, , high memory bandwidth, ues in each clock cycle. This trans- who must use techniques closely re- and dual threading capabilities make lates to a peak of 2 × 4 × 3.2 GHz = sembling message passing, a model up for the potential performance de- 25.6 Gflop/s for each SPE and adds that some might think is challenging ficiencies stemming from the PPU’s up to a staggering peak of 8 × 25.6 but is the only one known to be scal- in-order execution architecture. The Gflop/s = 204.8 Gflop/s for the entire able today.

80 Copublished by the IEEE CS and the AIP 1521-9615/08/$25.00 ©2008 IEEE Computing in Science & Engineering SPE SPE SPE SPE SPE SPE SPE SPE

SPU SPU SPU SPU SPU SPU SPU SPU

LS LS LS LS LS LS LS LS

MFC MFC MFC MFC MFC MFC MFC MFC

Element interconnection bus (EIB)

L2 MIC BIC

SPE: Synergistic Processing Element SPU: Synergistic Processing Unit L1 Power Processing Unit (PPU) MFC: Memory ow controller LS: Local store Dual XDR RRAC I/O

Figure 1. Schematic of the Cell processor’s design. The main components are the Power Processing Unit, eight Synergistic Processing Elements, and the element interconnection bus.

The PlayStation 3 operating system’s virtualization layer port for Fortran 95 and partial support The PlayStation 3 is probably the (called the hypervisor). for Fortran 2003). The kit is available cheapest Cell-based system on the The GigE card is accessible to the for installation on Cell- or -based market: it contains a Cell processor Linux kernel through the hypervisor, systems, with code compiled and built (with the number of SPEs reduced which both makes it possible to turn in cross-compilation mode, a method to six), 256 Mbytes of main memory, the PlayStation 3 into a networked often preferred by experts. These tools an graphics card with 256 and facilitates building practically guarantee compilation of Mbytes of its own memory, and a giga- PlayStation 3 clusters via network any existing C, C++, or Fortran code bit Ethernet (GigE) network card. switches. You can program such in- on the Cell processor, which makes Sony made several convenient pro- stallations by using the message-pass- the initial port of any existing software visions for installing Linux on the ing interface (MPI). The network basically effortless. PlayStation 3 in a dual-boot setup. card has a direct memory-access As Table 1 shows, several program- Installation instructions are plenti- unit, which you can set up via dedi- ming models and environments have ful on the Web, but the basic gist is cated hypervisor calls that enable data emerged for the Cell processor; it that a virtualization layer—called transfers without requiring the main seems to have ignited similar enthu- the hypervisor—separates the Linux processor’s intervention. siasm in the scientific high-perfor- kernel from the hardware. Devices mance computing, embedded systems, and other system resources are virtu- Programming and graphics communities as well. alized, but Linux device drivers can All Linux distributions for the PlaySta- Naturally, the programming tech- work with them. The Cell processor tion 3 come with the standard GNU niques proposed for the Cell are as in the PlayStation 3 is identical to the compiler suite, including C (GCC), diverse as the communities involved: one you would find in high-end IBM C++ (G++), and Fortran 95 (GFOR- they include shared-memory, distrib- or Mercury blade severs, with the ex- TRAN), which now also provides sup- uted-memory, and -processing ception that two SPEs aren’t available port for OpenMP through the GNU models and represent both data- and (one is disabled for chip reasons). GOMP library. The programmer can task-parallel approaches. Nevertheless, a Cell with one defec- use OpenMP to exploit the PPE’s A separate problem is related to pro- tive SPE still passes as a good chip in SMT capabilities. IBM’s software de- gramming for a cluster of PlayStation the PlayStation 3. If all the SPEs are velopment kit for Cell delivers a similar 3s—such a cluster is essentially a dis- nondefective, a good one is disabled set of GNU tools, along with an IBM tributed-memory machine, and there’s during manufacturing. Another SPE compiler suite that includes C/C++ almost no programming alternative is hidden from the application by the and, more recently, Fortran (with sup- to using MPI. Several freely available

May/June 2008 81 T ec h n o l o g ies

Table 1. Programming environments for the Cell processor.* Origin Available Free Cell SuperScalar Barcelona Center X X Sequoia X X Accelerated Library Framework IBM X X CorePy Indiana University X X Multicore Framework Mercury Computer Systems X Gedae Gedae X RapidMind RapidMind X Octopiler† IBM X X MPI Microtask‡ IBM * Available means you can get if for free or buy it as a product † Official name is the Single Source Compiler ‡ MPI Microtask is a research project inside IBM; there’s no outside access to this software

implementations exist, with the most to 12.8 Gflop/s and double-precision in scaling to PlayStation 3 clusters. popular being MPICH2 from Argonne calculations to 6.4 Gflop/s, assuming Such prob- National Laboratory and OpenMPI, two operations are performed on one lems, often referred to as screen-saver an open source project in active devel- data element. computing, have gained popularity in opment by a team of 19 organizations, However, the largest disproportion recent years: the trend initiated by the including universities, national labo- in the PlayStation 3’s performance is SETI@Home project had many fol- ratories, companies, and private indi- between the Cell processor’s speed lowers, including the very successful viduals. Due to the PPE’s compliance and that of the GigE interconnection. Folding@Home project. with the PowerPC architecture, you GigE isn’t crafted for performance could compile any of these libraries for and, in practice, only about 65 percent execution on the Cell almost out of the of its peak bandwidth can be achieved he idea of many-core processors box. The good news is that using MPI in MPI communication. Also, because T reaching hundreds, if not thou- on a PlayStation 3 cluster isn’t any dif- of the extra layer of indirection be- sands, of processing elements per chip ferent than on any other distributed- tween the operating system and the is emerging, with some researchers memory system. hardware (specifically, the hypervisor), aiming for distributed-memory sys- the incurred latency can be as big as tems on a chip, an inherently more Scientific Computing 200 μs, which is at least an order of scalable solution than shared-memory In spite of its power, the PlayStation magnitude below today’s standards for setups. Owing to this, the technol- 3 has severe limitations for scientific high-performance interconnections. ogy delivered by the PlayStation 3 computing. First, it can only achieve its Even if you could lower the latency through its Cell processor provides astounding peak of 153.6 Gflop/s for and gain a larger fraction of the peak a unique opportunity to gain experi- compute-intensive tasks in single-pre- bandwidth, the communication capac- ence, which is likely to be priceless in cision arithmetic, which, besides de- ity of 1 Gbyte/s is way too small to the near future. livering less precision, isn’t compliant keep the Cell processor busy. A com- But a major shortcoming of the cur- with the IEEE floating-point standard mon remedy for slow interconnections rent Cell processor for numerical ap- (its double-precision peak is less than is to run larger problems, but in this plications is the relatively slow speed 11 Gflop/s). Second, it only implements case, the main memory’s small size of double-precision arithmetic. The truncation rounding, so denormalized (256 Mbytes) turns out to be the limit- next incarnation promises to include numbers are flushed to zero, and NaNs ing factor. a fully pipelined double-precision (“not a number”) are treated as nor- Taken all together, even simple unit that will deliver the speed of 12.8 mal numbers. Finally, memory-bound examples of compute-intensive work- Gflop/s from a single SPE clocked at problems are limited by the main mem- loads such as matrix multiplication 3.2 GHz and 102.4 Gflop/s from an ory’s bandwidth of 25.6 Gbytes/s. This can’t benefit from running in paral- eight-SPE system, which is going to is a very respectable value compared to lel on more than two PlayStation 3s. make the chip a very serious competi- cutting-edge heavy-iron processors, Rather, only the extremely compute- tor in the world of scientific and engi- but it sets the upper limit of memory- intensive, neering computing. intensive single-precision calculations problems have a fair chance of success Although in agony, Moore’s law is

82 Computing in Science & Engineering still alive, and we’re entering the era applied mathematics from the University of of billion-transistor processors. Given New Mexico. Contact him dongarra@eecs. this, the current Cell processor uses a utk.edu. rather modest number of transistors (234 million). It isn’t hard to envision a Cell processor with more than one PPE and many more SPEs, perhaps exceeding the performance of one teraflop/s for a single chip.

Jakub Kurzak is a senior research associate in the Department of Computer Science at the University of Tennessee. His research interests include high-performance comput- ing, parallel programming, and numerical algorithms. Kurzak has a PhD in computer science from University of Houston. Contact him at [email protected].

Alfredo Buttari is a scientist at the French National Institute for Research in Computer Science and Control (INRIA). His research interests include high-performance comput- ing, parallel programming, and numerical algorithms. Buttari has a PhD in computer science from the University of Rome. Con- tact him at [email protected]

Piotr Luszczek is a scientist at MathWorks. His research interests include high-perfor- mance computing, parallel programming, and numerical algorithms. Luszczek has a PhD in computer science from the Univer- sity of Tennessee. Contact him at luszczek@ eecs.utk.edu

Jack Dongarra is University Distinguished Professor in the Department of Electrical En- gineering and Computer Science at the Uni- versity of Tennessee, Distinguished Research Staff in the Computer Science and Math- ematics Division at the Oak Ridge National Laboratory, and a professor in the School of Mathematics and School of Computer Sci- ence at the University of Manchester, UK. His research interests include high-performance computing, parallel programming, and nu- merical algorithms. Dongarra has a PhD in

May/June 2008 83