The Playstation 3 for High- Performance Scientific Computing

T ec HNOLOG ies Editor: Michael A. Gray, [email protected] THE PLAYSTATION 3 FOR HIGH- PERFORMANCE SCIENTIFIC COMPUTING By Jakub Kurzak, Alfredo Buttari, Piotr Luszczek, and Jack Dongarra Is real-world gaming technology the next big thing in the more academically based high-performance computing arena? The authors put PlayStation 3 to the test. he heart of the Sony PlaySta- SMT feature, which comes at a small chip. Figure 1 shows a schematic of tion 3—the Cell processor— 5 percent increase in the hardware’s the Cell processor’s design. T wasn’t originally intended for cost, can deliver up to 30 percent in- All the Cell processor’s compo- scientific number crunching, just as crease in performance. The PPU also nents, including the PPE, the SPEs, the PlayStation 3 itself wasn’t meant includes a short-vector single instruc- the main memory, and the I/O sys- primarily to serve such purposes. Yet, tion, multiple data (SIMD) engine tem, are connected via the element both these items could impact the called VMX, which is an incarnation interconnection bus, which has four high-performance computing world of the PowerPC’s AltiVec. unidirectional rings (two in each in significant ways. This introductory However, the Cell processor’s real direction) and a token-based arbi- article takes a closer look at their po- power lies in the eight Synergistic tration mechanism that plays the tential to do so; an extended version of Processing Elements (SPEs) that ac- role of traffic light. Each partici- it is published as a University of Ten- company the PPE. Each SPE con- pant is hooked up to the bus with a nessee technical report (www.cs.utk. sists of a Synergistic Processing Unit bandwidth of 25.6 Gbytes/s; the bus edu/~library/2008). (SPU), 256 Kbytes of private memory has an internal bandwidth of 204.8 (referred to as the local store), and a Gbytes/s, which means that for all The Cell in a Nutshell memory-flow controller, which deliv- practical purposes, you shouldn’t be The Cell processor’s main control ers powerful direct memory-access able to saturate it. unit is the Power Processing Element capabilities to the SPU. The SPEs are The Cell chip draws its power from (PPE), which is a 64-bit, two-way si- the Cell’s short-vector SIMD work- the fact that it’s a parallel machine multaneous multithreading (SMT) horses and possess a large 128-entry, with eight small, fast, specialized processor that’s binary-compliant 128-bit vector register file as well as a number-crunching and processing with the PowerPC 970 architecture. range of SIMD instructions that can elements. The SPEs, in turn, rely on The PPE consists of the Power Pro- operate simultaneously on two dou- a simple design with short pipelines, cessing Unit (PPU), 32 Kbytes of L1 ble-precision values, four single-pre- a huge register file, and a powerful cache, and 512 Kbytes of L2 cache. cision values, eight 16-bit integers, or SIMD instruction set. Although the PPU uses the Pow- 16 8-bit characters. Most instructions The Cell is essentially a distrib- erPC 970 instruction set, it has a rela- are pipelined and can complete one uted-memory system on a chip, on tively simple architecture with in-order vector operation in each clock cycle, which each SPE possesses its private execution, which results in consider- including fused multiplication–addi- memory stripped of any indirection ably less circuitry than its out-of-or- tion in single precision, which means mechanisms to make it faster. This der execution counterparts as well as that the SPU can accomplish two puts explicit control over data mo- lower energy consumption. The high floating-point operations on four val- tion in the hands of the programmer, clock rate, high memory bandwidth, ues in each clock cycle. This trans- who must use techniques closely re- and dual threading capabilities make lates to a peak of 2 × 4 × 3.2 GHz = sembling message passing, a model up for the potential performance de- 25.6 Gflop/s for each SPE and adds that some might think is challenging ficiencies stemming from the PPU’s up to a staggering peak of 8 × 25.6 but is the only one known to be scal- in-order execution architecture. The Gflop/s = 204.8 Gflop/s for the entire able today. 80 Copublished by the IEEE CS and the AIP 1521-9615/08/$25.00 ©2008 IEEE COMPUTING IN SCIENCE & ENGINEERING SPE SPE SPE SPE SPE SPE SPE SPE SPU SPU SPU SPU SPU SPU SPU SPU LS LS LS LS LS LS LS LS MFC MFC MFC MFC MFC MFC MFC MFC Element interconnection bus (EIB) L2 MIC BIC SPE: Synergistic Processing Element SPU: Synergistic Processing Unit L1 Power Processing Unit (PPU) MFC: Memory ow controller LS: Local store Dual XDR RRAC I/O Figure 1. Schematic of the Cell processor’s design. The main components are the Power Processing Unit, eight Synergistic Processing Elements, and the element interconnection bus. The PlayStation 3 operating system’s virtualization layer port for Fortran 95 and partial support The PlayStation 3 is probably the (called the hypervisor). for Fortran 2003). The kit is available cheapest Cell-based system on the The GigE card is accessible to the for installation on Cell- or x86-based market: it contains a Cell processor Linux kernel through the hypervisor, systems, with code compiled and built (with the number of SPEs reduced which both makes it possible to turn in cross-compilation mode, a method to six), 256 Mbytes of main memory, the PlayStation 3 into a networked often preferred by experts. These tools an NVIDIA graphics card with 256 workstation and facilitates building practically guarantee compilation of Mbytes of its own memory, and a giga- PlayStation 3 clusters via network any existing C, C++, or Fortran code bit Ethernet (GigE) network card. switches. You can program such in- on the Cell processor, which makes Sony made several convenient pro- stallations by using the message-pass- the initial port of any existing software visions for installing Linux on the ing interface (MPI). The network basically effortless. PlayStation 3 in a dual-boot setup. card has a direct memory-access As Table 1 shows, several program- Installation instructions are plenti- unit, which you can set up via dedi- ming models and environments have ful on the Web, but the basic gist is cated hypervisor calls that enable data emerged for the Cell processor; it that a virtualization layer—called transfers without requiring the main seems to have ignited similar enthu- the hypervisor—separates the Linux processor’s intervention. siasm in the scientific high-perfor- kernel from the hardware. Devices mance computing, embedded systems, and other system resources are virtu- Programming and graphics communities as well. alized, but Linux device drivers can All Linux distributions for the PlaySta- Naturally, the programming tech- work with them. The Cell processor tion 3 come with the standard GNU niques proposed for the Cell are as in the PlayStation 3 is identical to the compiler suite, including C (GCC), diverse as the communities involved: one you would find in high-end IBM C++ (G++), and Fortran 95 (GFOR- they include shared-memory, distrib- or Mercury blade severs, with the ex- TRAN), which now also provides sup- uted-memory, and stream-processing ception that two SPEs aren’t available port for OpenMP through the GNU models and represent both data- and (one is disabled for chip yield reasons). GOMP library. The programmer can task-parallel approaches. Nevertheless, a Cell with one defec- use OpenMP to exploit the PPE’s A separate problem is related to pro- tive SPE still passes as a good chip in SMT capabilities. IBM’s software de- gramming for a cluster of PlayStation the PlayStation 3. If all the SPEs are velopment kit for Cell delivers a similar 3s—such a cluster is essentially a dis- nondefective, a good one is disabled set of GNU tools, along with an IBM tributed-memory machine, and there’s during manufacturing. Another SPE compiler suite that includes C/C++ almost no programming alternative is hidden from the application by the and, more recently, Fortran (with sup- to using MPI. Several freely available MAY/JUNE 2008 81 T ec HNOLOG ies Table 1. Programming environments for the Cell processor.* Origin Available Free Cell SuperScalar Barcelona Supercomputer Center X X Sequoia Stanford University X X Accelerated Library Framework IBM X X CorePy Indiana University X X Multicore Framework Mercury Computer Systems X Gedae Gedae X RapidMind RapidMind X Octopiler† IBM X X MPI Microtask‡ IBM * Available means you can get if for free or buy it as a product † Official name is the Single Source Compiler ‡ MPI Microtask is a research project inside IBM; there’s no outside access to this software implementations exist, with the most to 12.8 Gflop/s and double-precision in scaling to PlayStation 3 clusters. popular being MPICH2 from Argonne calculations to 6.4 Gflop/s, assuming Such distributed computing prob- National Laboratory and OpenMPI, two operations are performed on one lems, often referred to as screen-saver an open source project in active devel- data element. computing, have gained popularity in opment by a team of 19 organizations, However, the largest disproportion recent years: the trend initiated by the including universities, national labo- in the PlayStation 3’s performance is SETI@Home project had many fol- ratories, companies, and private indi- between the Cell processor’s speed lowers, including the very successful viduals. Due to the PPE’s compliance and that of the GigE interconnection.

The Playstation 3 for High- Performance Scientific Computing

Parallel Patterns for Adaptive Data Stream Processing

A Middleware for Efficient Stream Processing in CUDA

AMD Accelerated Parallel Processing Opencl Programming Guide

USE CASE Requirements

Copyrighted Material

From Blue Gene to Cell Power.Org Moscow, JSCC Technical Day November 30, 2005

IBM Powerpc 970 (A.K.A. G5)

Fine-Grained Window-Based Stream Processing on CPU-GPU Integrated

Lightsaber: Efficient Window Aggregation on Multi-Core Processors

Parallel Stream Processing with MPI for Video Analytics and Data Visualization

Chapter 1-Introduction to Microprocessors File

Programming the Cell Broadband Engine Examples and Best Practices