
T ec HNOLOG ies Editor: Michael A. Gray, [email protected] THE PLAYSTATION 3 FOR HIGH- PERFORMANCE SCIENTIFIC COMPUTING By Jakub Kurzak, Alfredo Buttari, Piotr Luszczek, and Jack Dongarra Is real-world gaming technology the next big thing in the more academically based high-performance computing arena? The authors put PlayStation 3 to the test. he heart of the Sony Play­ SMT feature, which comes at a small chip. Figure 1 shows a schematic of Station 3—the Cell proces­ 5 percent increase in the hardware’s the Cell processor’s design. T sor—wasn’t originally intended cost, can deliver up to a 30 percent in­ All the Cell processor’s compo­ for scientific number crunching, just crease in performance. The PPU also nents, including the PPE, the SPEs, as the PlayStation 3 itself wasn’t meant includes a short­vector single instruc­ the main memory, and the I/O sys­ primarily to serve such purposes. Yet, tion, multiple data (SIMD) engine tem, are connected via the element both these items could impact the called VMX, which is an incarnation interconnection bus, which has four high­performance computing world of the PowerPC’s AltiVec. unidirectional rings (two in each in significant ways. This introductory However, the Cell processor’s real direction) and a token­based arbi­ article takes a closer look at their po­ power lies in the eight Synergistic tration mechanism that plays the tential to do so; an extended version of Processing Elements (SPEs) that ac­ role of traffic light. Each partici­ it is published as a University of Ten­ company the PPE. Each SPE con­ pant is hooked up to the bus with a nessee technical report (www.cs.utk. sists of a Synergistic Processing Unit bandwidth of 25.6 Gbytes/s; the bus edu/~library/2008). (SPU), 256 Kbytes of private memory has an internal bandwidth of 204.8 (referred to as the local store), and a Gbytes/s, which means that for all The Cell in a Nutshell memory­flow controller that delivers practical purposes, you shouldn’t be The Cell processor’s main control powerful direct memory­access ca­ able to saturate it. unit is the Power Processing Element pabilities to the SPU. The SPEs are The Cell chip draws its power from (PPE), which is a 64­bit, two­way si­ the Cell’s short­vector SIMD work­ the fact that it’s a parallel machine multaneous multithreading (SMT) horses and possess a large 128­entry, with eight small, fast, specialized processor that’s binary­compliant 128­bit vector register file as well as a number­crunching and processing with the PowerPC 970 architecture. range of SIMD instructions that can elements. The SPEs, in turn, rely on The PPE consists of the Power Pro­ operate simultaneously on two dou­ a simple design with short pipelines, cessing Unit (PPU), 32 Kbytes of L1 ble­precision values, four single­pre­ a huge register file, and a powerful cache, and 512 Kbytes of L2 cache. cision values, eight 16­bit integers, or SIMD instruction set. Although the PPU uses the Power­ 16 8­bit characters. Most instructions The Cell is essentially a distrib­ PC 970 instruction set, it has a rela­ are pipelined and can complete one uted­memory system on a chip, on tively simple architecture with in­order vector operation in each clock cycle, which each SPE possesses its private execution, which results in consider­ including fused multiplication–addi­ memory stripped of any indirection ably less circuitry than its out­of­or­ tion in single precision, which means mechanisms to make it faster. This der execution counterparts as well as that the SPU can accomplish two puts explicit control over data mo­ lower energy consumption. The high floating­point operations on four val­ tion in the hands of the programmer, clock rate, high memory bandwidth, ues in each clock cycle. This trans­ who must use techniques closely re­ and dual threading capabilities make lates to a peak of 2 × 4 × 3.2 GHz = sembling message passing, a model up for the potential performance de­ 25.6 Gflop/s for each SPE and adds that some might think is challenging ficiencies stemming from the PPU’s up to a staggering peak of 8 × 25.6 but is the only one known to be scal­ in­order execution architecture. The Gflop/s = 204.8 Gflop/s for the entire able today. 84 Copublished by the IEEE CS and the AIP 1521-9615/08/$25.00 ©2008 IEEE COMPUTING IN SCIENCE & ENGINEERING SPE SPE SPE SPE SPE SPE SPE SPE SPU SPU SPU SPU SPU SPU SPU SPU LS LS LS LS LS LS LS LS MFC MFC MFC MFC MFC MFC MFC MFC Element interconnection bus L2 MIC BIC SPE: Synergistic Processing Element SPU: Synergistic Processing Unit L1 Power Processing Unit MFC: Memory ow controller LS: Local store Dual XDR RRAC I/O Figure 1. Schematic of the Cell processor’s design. The main components are the Power Processing Unit, eight Synergistic Processing Elements (SPEs), and the element interconnection bus. The PlayStation 3 application by the operating system’s Fortran (with support for Fortran 95 The PlayStation 3 is probably the hypervisor. and partial support for Fortran 2003). cheapest Cell­based system on the The GigE card is accessible to the The kit is available for installation on market: it contains a Cell processor Linux kernel through the hypervisor, Cell­ or x86­based systems, with code (with the number of SPEs reduced which both makes it possible to turn compiled and built in cross­compila­ to six), 256 Mbytes of main memory, the PlayStation 3 into a networked tion mode, a method often preferred an NVIDIA graphics card with 256 workstation and facilitates building by experts. These tools practically Mbytes of its own memory, and a giga­ PlayStation 3 clusters via network guarantee compilation of any exist­ bit Ethernet (GigE) network card. switches. You can program such in­ ing C, C++, or Fortran code on the Sony made several convenient pro­ stallations by using the message­pass­ Cell processor, which makes the ini­ visions for installing Linux on the ing interface (MPI). The network tial port of any existing software ba­ PlayStation 3 in a dual­boot setup. card has a direct memory­access sically effortless. Installation instructions are plentiful unit, which you can set up via dedi­ As Table 1 shows, several program­ on the Web, but the basic gist is that cated hypervisor calls that enable data ming models and environments have a virtualization layer—also called transfers without requiring the main emerged for the Cell processor; it the hypervisor—separates the Linux processor’s intervention. seems to have ignited similar enthu­ kernel from the hardware. Devices siasm in the scientific high­perfor­ and other system resources are virtu­ Programming mance computing, embedded systems, alized, but Linux device drivers can All Linux distributions for the Play­ and graphics communities as well. work with them. The Cell processor Station 3 come with the standard Naturally, the programming tech­ in the PlayStation 3 is identical to GNU compiler suite, including C niques proposed for the Cell are as the one you would find in high­end (GCC), C++ (G++), and Fortran 95 diverse as the communities involved: IBM or Mercury blade severs, with (GFORTRAN), which now also pro­ they include shared­memory, distrib­ the exception that two SPEs aren’t vides support for OpenMP through uted­memory, and stream­processing available (one is disabled for chip the GNU GOMP library. The pro­ models and represent both data­ and yield reasons). Nevertheless, a Cell grammer can use OpenMP to exploit task­parallel approaches. with one defective SPE still passes as the PPE’s SMT capabilities. IBM’s A separate problem is related to a good chip in the PlayStation 3. If software development kit for Cell programming for a cluster of Play­ all the SPEs are nondefective, a good delivers a similar set of GNU tools, Station 3s—such a cluster is essen­ one is disabled during manufactur­ along with an IBM compiler suite that tially a distributed­memory machine, ing. Another SPE is hidden from the includes C/C++ and, more recently, and there’s almost no programming MAY/JUNE 2008 85 T ec HNOLOG ies Table 1. Programming environments for the Cell processor.* Origin Available Free Cell SuperScalar Barcelona Supercomputer Center X X Sequoia Stanford University X X Accelerated Library Framework IBM X X CorePy Indiana University X X Multicore Framework Mercury Computer Systems X Gedae Gedae X RapidMind RapidMind X Octopiler† IBM X X MPI Microtask‡ IBM * Available means you can get if for free or buy it as a product † Official name is the Single Source Compiler ‡ MPI Microtask is a research project inside IBM; there’s no outside access to this software alternative to using MPI. Several heavy­iron processors, but it sets Rather, only the extremely compute­ freely available implementations ex­ the upper limit of memory­intensive intensive, embarrassingly parallel ist, with the most popular being single­precision calculations to 12.8 problems have a fair chance of success MPICH2 from Argonne National Gflop/s and double­precision calcu­ in scaling to PlayStation 3 clusters. Laboratory and OpenMPI, an open lations to 6.4 Gflop/s, assuming two Such distributed computing prob­ source project in active develop­ operations are performed on one lems, often referred to as screen­saver ment by a team of 19 organizations, data element. computing, have gained popularity in including universities, national labo­ However, the largest disproportion recent years: the trend initiated by the ratories, companies, and private indi­ in the PlayStation 3’s performance is SETI@Home project had many fol­ viduals.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages4 Page
-
File Size-