Performance Scientific Computing

T ec HNOLOG ies Editor: Michael A. Gray, [email protected] THE PLAYSTATION 3 FOR HIGH- PERFORMANCE SCIENTIFIC COMPUTING By Jakub Kurzak, Alfredo Buttari, Piotr Luszczek, and Jack Dongarra Is real-world gaming technology the next big thing in the more academically based high-performance computing arena? The authors put PlayStation 3 to the test. he heart of the Sony Play SMT feature, which comes at a small chip. Figure 1 shows a schematic of Station 3—the Cell proces 5 percent increase in the hardware’s the Cell processor’s design. T sor—wasn’t originally intended cost, can deliver up to a 30 percent in All the Cell processor’s compo for scientific number crunching, just crease in performance. The PPU also nents, including the PPE, the SPEs, as the PlayStation 3 itself wasn’t meant includes a shortvector single instruc the main memory, and the I/O sys primarily to serve such purposes. Yet, tion, multiple data (SIMD) engine tem, are connected via the element both these items could impact the called VMX, which is an incarnation interconnection bus, which has four highperformance computing world of the PowerPC’s AltiVec. unidirectional rings (two in each in significant ways. This introductory However, the Cell processor’s real direction) and a tokenbased arbi article takes a closer look at their po power lies in the eight Synergistic tration mechanism that plays the tential to do so; an extended version of Processing Elements (SPEs) that ac role of traffic light. Each partici it is published as a University of Ten company the PPE. Each SPE con pant is hooked up to the bus with a nessee technical report (www.cs.utk. sists of a Synergistic Processing Unit bandwidth of 25.6 Gbytes/s; the bus edu/~library/2008). (SPU), 256 Kbytes of private memory has an internal bandwidth of 204.8 (referred to as the local store), and a Gbytes/s, which means that for all The Cell in a Nutshell memoryflow controller that delivers practical purposes, you shouldn’t be The Cell processor’s main control powerful direct memoryaccess ca able to saturate it. unit is the Power Processing Element pabilities to the SPU. The SPEs are The Cell chip draws its power from (PPE), which is a 64bit, twoway si the Cell’s shortvector SIMD work the fact that it’s a parallel machine multaneous multithreading (SMT) horses and possess a large 128entry, with eight small, fast, specialized processor that’s binarycompliant 128bit vector register file as well as a numbercrunching and processing with the PowerPC 970 architecture. range of SIMD instructions that can elements. The SPEs, in turn, rely on The PPE consists of the Power Pro operate simultaneously on two dou a simple design with short pipelines, cessing Unit (PPU), 32 Kbytes of L1 bleprecision values, four singlepre a huge register file, and a powerful cache, and 512 Kbytes of L2 cache. cision values, eight 16bit integers, or SIMD instruction set. Although the PPU uses the Power 16 8bit characters. Most instructions The Cell is essentially a distrib PC 970 instruction set, it has a rela are pipelined and can complete one utedmemory system on a chip, on tively simple architecture with inorder vector operation in each clock cycle, which each SPE possesses its private execution, which results in consider including fused multiplication–addi memory stripped of any indirection ably less circuitry than its outofor tion in single precision, which means mechanisms to make it faster. This der execution counterparts as well as that the SPU can accomplish two puts explicit control over data mo lower energy consumption. The high floatingpoint operations on four val tion in the hands of the programmer, clock rate, high memory bandwidth, ues in each clock cycle. This trans who must use techniques closely re and dual threading capabilities make lates to a peak of 2 × 4 × 3.2 GHz = sembling message passing, a model up for the potential performance de 25.6 Gflop/s for each SPE and adds that some might think is challenging ficiencies stemming from the PPU’s up to a staggering peak of 8 × 25.6 but is the only one known to be scal inorder execution architecture. The Gflop/s = 204.8 Gflop/s for the entire able today. 84 Copublished by the IEEE CS and the AIP 1521-9615/08/$25.00 ©2008 IEEE COMPUTING IN SCIENCE & ENGINEERING SPE SPE SPE SPE SPE SPE SPE SPE SPU SPU SPU SPU SPU SPU SPU SPU LS LS LS LS LS LS LS LS MFC MFC MFC MFC MFC MFC MFC MFC Element interconnection bus L2 MIC BIC SPE: Synergistic Processing Element SPU: Synergistic Processing Unit L1 Power Processing Unit MFC: Memory ow controller LS: Local store Dual XDR RRAC I/O Figure 1. Schematic of the Cell processor’s design. The main components are the Power Processing Unit, eight Synergistic Processing Elements (SPEs), and the element interconnection bus. The PlayStation 3 application by the operating system’s Fortran (with support for Fortran 95 The PlayStation 3 is probably the hypervisor. and partial support for Fortran 2003). cheapest Cellbased system on the The GigE card is accessible to the The kit is available for installation on market: it contains a Cell processor Linux kernel through the hypervisor, Cell or x86based systems, with code (with the number of SPEs reduced which both makes it possible to turn compiled and built in crosscompila to six), 256 Mbytes of main memory, the PlayStation 3 into a networked tion mode, a method often preferred an NVIDIA graphics card with 256 workstation and facilitates building by experts. These tools practically Mbytes of its own memory, and a giga PlayStation 3 clusters via network guarantee compilation of any exist bit Ethernet (GigE) network card. switches. You can program such in ing C, C++, or Fortran code on the Sony made several convenient pro stallations by using the messagepass Cell processor, which makes the ini visions for installing Linux on the ing interface (MPI). The network tial port of any existing software ba PlayStation 3 in a dualboot setup. card has a direct memoryaccess sically effortless. Installation instructions are plentiful unit, which you can set up via dedi As Table 1 shows, several program on the Web, but the basic gist is that cated hypervisor calls that enable data ming models and environments have a virtualization layer—also called transfers without requiring the main emerged for the Cell processor; it the hypervisor—separates the Linux processor’s intervention. seems to have ignited similar enthu kernel from the hardware. Devices siasm in the scientific highperfor and other system resources are virtu Programming mance computing, embedded systems, alized, but Linux device drivers can All Linux distributions for the Play and graphics communities as well. work with them. The Cell processor Station 3 come with the standard Naturally, the programming tech in the PlayStation 3 is identical to GNU compiler suite, including C niques proposed for the Cell are as the one you would find in highend (GCC), C++ (G++), and Fortran 95 diverse as the communities involved: IBM or Mercury blade severs, with (GFORTRAN), which now also pro they include sharedmemory, distrib the exception that two SPEs aren’t vides support for OpenMP through utedmemory, and streamprocessing available (one is disabled for chip the GNU GOMP library. The pro models and represent both data and yield reasons). Nevertheless, a Cell grammer can use OpenMP to exploit taskparallel approaches. with one defective SPE still passes as the PPE’s SMT capabilities. IBM’s A separate problem is related to a good chip in the PlayStation 3. If software development kit for Cell programming for a cluster of Play all the SPEs are nondefective, a good delivers a similar set of GNU tools, Station 3s—such a cluster is essen one is disabled during manufactur along with an IBM compiler suite that tially a distributedmemory machine, ing. Another SPE is hidden from the includes C/C++ and, more recently, and there’s almost no programming MAY/JUNE 2008 85 T ec HNOLOG ies Table 1. Programming environments for the Cell processor.* Origin Available Free Cell SuperScalar Barcelona Supercomputer Center X X Sequoia Stanford University X X Accelerated Library Framework IBM X X CorePy Indiana University X X Multicore Framework Mercury Computer Systems X Gedae Gedae X RapidMind RapidMind X Octopiler† IBM X X MPI Microtask‡ IBM * Available means you can get if for free or buy it as a product † Official name is the Single Source Compiler ‡ MPI Microtask is a research project inside IBM; there’s no outside access to this software alternative to using MPI. Several heavyiron processors, but it sets Rather, only the extremely compute freely available implementations ex the upper limit of memoryintensive intensive, embarrassingly parallel ist, with the most popular being singleprecision calculations to 12.8 problems have a fair chance of success MPICH2 from Argonne National Gflop/s and doubleprecision calcu in scaling to PlayStation 3 clusters. Laboratory and OpenMPI, an open lations to 6.4 Gflop/s, assuming two Such distributed computing prob source project in active develop operations are performed on one lems, often referred to as screensaver ment by a team of 19 organizations, data element. computing, have gained popularity in including universities, national labo However, the largest disproportion recent years: the trend initiated by the ratories, companies, and private indi in the PlayStation 3’s performance is SETI@Home project had many fol viduals.

Load more