QPACE: Quantum Chromodynamics Parallel Computing on the Cell Broadband Engine
Total Page:16
File Type:pdf, Size:1020Kb
N OVEL A RCHITECTURES QPACE: Quantum Chromodynamics Parallel Computing on the Cell Broadband Engine Application-driven computers for Lattice Gauge Theory simulations have often been based on system-on-chip designs, but the development costs can be prohibitive for academic project budgets. An alternative approach uses compute nodes based on a commercial processor tightly coupled to a custom-designed network processor. Preliminary analysis shows that this solution offers good performance, but it also entails several challenges, including those arising from the processor’s multicore structure and from implementing the network processor on a field- programmable gate array. 1521-9615/08/$25.00 © 2008 IEEE uantum chromodynamics (QCD) is COPUBLISHED BY THE IEEE CS AND THE AIP a well-established theoretical frame- Gottfried Goldrian, Thomas Huth, Benjamin Krill, work to describe the properties and Jack Lauritsen, and Heiko Schick Q interactions of quarks and gluons, which are the building blocks of protons, neutrons, IBM Research and Development Lab, Böblingen, Germany and other particles. In some physically interesting Ibrahim Ouda dynamical regions, researchers can study QCD IBM Systems and Technology Group, Rochester, Minnesota perturbatively—that is, they can work out physi- Simon Heybrock, Dieter Hierl, Thilo Maurer, Nils Meyer, cal observables as a systematic expansion in the Andreas Schäfer, Stefan Solbrig, Thomas Streuer, strong coupling constant. In other (often more and Tilo Wettig interesting) regions, such an expansion isn’t pos- sible, so researchers must find other approaches. University of Regensburg, Germany The most systematic and widely used nonper- Dirk Pleiter, Karl-Heinz Sulanke, and Frank Winter turbative approach is lattice QCD (LQCD), a Deutsches Elektronen Synchrotron, Zeuthen, Germany discretized, computer-friendly version of the Hubert Simma theory Kenneth Wilson proposed more than 1 University of Milano-Bicocca, Italy 30 years ago. In this framework, QCD is refor- Sebastiano Fabio Schifano and Raffaele Tripiccione mulated as a statistical mechanics problem that we can study using Monte Carlo techniques. University of Ferrara, Italy Many physicists have performed LQCD Monte Andrea Nobile Carlo simulations over the years and have de- European Center for Theoretical Studies, Trento, Italy veloped efficient simulation algorithms and so- Matthias Drochner and Thomas Lippert phisticated analysis techniques. Since the early Research Center Jülich, Germany 1980s, LQCD researchers have pioneered the Zoltan Fodor use of massively parallel computers in large sci- entific applications, using virtually all available University of Wuppertal, Germany computing systems including traditional main- 46 THIS ARTICLE HAS BEEN PEER-REVIEWED. COMPUTING IN SCIENCE & ENGINEERING frames, large PC clusters, and high-performance (other discretization schemes exist, but the imple- systems. mentation details are all similar): We’re interested in parallel architectures on which LQCD codes scale to thousands of nodes ′ ψx= Dh ψ x and on which we can achieve good price-perfor- 4 1+ mance and power-performance ratios. In this re- U x,µ()γ µ ψx+µ = ∑ † . (1) spect, highly competitive systems are QCD On µ=1+U x−µ, µ ()1 − γµ ψx−µ a Chip (QCDOC)2 and apeNEXT,3 two custom- designed LQCD architectures that have been operating since 2004 and 2005, respectively, and Here, μ labels the four space–time directions, and 4 IBM’s BlueGene series. These architectures the Ux,μ are the SU(3) gauge matrices associated are based on system-on-chip designs. In modern with the links between nearest-neighbor lattice technologies, however, VLSI chip development sites. The gauge matrices are themselves dynami- costs are so high that an SoC approach is no lon- cal degrees of freedom and carry color indices ger possible for academic projects. (each of the 4N different Ux,μ has 3 × 3 complex To address this problem, our QCD Parallel entries). The γμ are the (constant) Dirac matrices, Computing on the Cell (QPACE) project uses carrying spinor indices. Today, state-of-the-art a compute node based on IBM’s PowerXCell 8i simulations use lattices with a linear size L of at multicore processor and tightly couples it to a least 32 sites. custom-designed network processor in which we Roughly speaking, the ideal LQCD simulation connect each node to its nearest neighbors in a engine is a system that can keep the previously de- 3D toroidal mesh. To facilitate the network pro- fined degrees of freedom in local storage and re- cessor’s implementation, we use a Xilinx Virtex-5 peatedly apply the Dirac operator with very high field-programmable gate array (FPGA). In both efficiency. Explicit parallelism is straightforward. price and power performance, our approach is As Equation 1 shows, Dh couples only nearest- competitive with BlueGene and has a much lower neighbor lattice sites, which suggests an obvious development cost. Here, we describe our perfor- parallel structure for an LQCD computer: a d- mance analysis and hardware benchmarks for the dimensional grid of processing nodes. Each node PowerXCell 8i processor, which show that it’s a contains local data storage and a compute engine. powerful option for LQCD. We then describe We need only to map the physical lattice onto the our architecture in detail and how we’re address- processor grid as regular tiles (if d < 4, we fully ing the various challenges it poses. map 4 − d dimensions of the physical lattice onto each processing node). Lattice Gauge Theory Computing In the conceptually simple case of a 4D hyper- LQCD simulations are among scientific comput- cubic grid of p4 processors, each processing node ing’s grand challenges. Today, a group active in would store and process a 4D subset of (L/p)4 lat- high-end LQCD simulations must have access to tice sites. Each processing node handles its sublat- computing resources on the order of tens of sus- tice using local data. The nodes also access data tained Tflops-years. corresponding to lattice sites that are just outside Accurately describing LQCD theory and al- the sublattice’s surface and are stored on near- gorithms is beyond this article’s scope. LQCD est-neighbor processors. Processing proceeds in simulations have an algorithmic structure exhib- parallel for two reasons: there are no direct data iting high parallelism and several features that dependencies for lattice sites that aren’t nearest make it easy to exploit much of this parallelism. neighbor, and the same operations sequence runs LQCD replaces continuous 4D space–time with a on all processing nodes in lockstep mode. discrete and finite lattice (containing N = L4 sites Parallel performance will be linear in the num- for a linear size L). All compute-intensive tasks in- ber of nodes, as long as volve repeated execution of just one basic step: the product of the lattice Dirac operator and a quark • the programmer can evenly partition the lattice ia field ψ. In LQCD, a quark field ψx is defined at on the processor grid, and lattice sites x = 1, . , N and carries color indices • the interconnection bandwidth is large enough a = 1, . , 3 and spinor indices i = 1, . , 4. Thus, to sustain each node’s performance. ψ is a vector with 12N complex entries. As Equation 1 shows, the hopping term of the We meet the first constraint easily for up to thou- Wilson Dirac operator Dh acts on ψ as follows sands of processors while the latter is less trivial NOVEMBER/DECEMBER 2008 47 T T link 3, the Cell/B.E. is obviously attractive for sci- ILB RF FP entific applications as well.6–8 This is even truer Main T T Network for the PowerXCell 8i. As we reported at the ILB RF memory interface 8 (MM) Lattice 2007 Conference, we investigated this processor’s performance extending a recently Local store (LS) proposed theoretical model.9 We now summa- T T mem NIF rize our findings. T EIB EIB Performance Model We consider two classes of hardware devices: Figure 1. Dataflow paths for a single synergistic processor element • storage devices, such as registers or LSs, that and corresponding execution times (Ti). The model assumes a store data and instructions and are character- communication network with a 3D torus topology in which 12 links ized by their storage size and (six inbound and six outbound) simultaneously transfer data. • processing devices that act on data or trans- fer data/instructions from one storage device to another. Such devices can be floating-point because the required inter-node bandwidth in- (FP) units, DMA controllers, or buses and are creases linearly with p (we provide figures on characterized by their bandwidths βi and start- bandwidth requirements later). up latencies λi. Conceptually, this case favors an implementa- tion in which the most powerful processor avail- We can divide an algorithm implemented able provides the total desired processing power on a specific architecture into microtasks per- and thus reduces processor count and node-to- formed by our model’s processing devices. The node bandwidth requirements. IBM’s Cell/B.E. execution time Ti of each task i is estimated by (www.ibm.com/developerworks/power/cell) is a linear ansatz, therefore an obvious choice. TIi i / βi+O() λ i , (2) The PowerXCell 8i Processor’s Performance where Ii is the amount of processed data. In the The Cell/B.E. processor contains one PowerPC following, we assume that all tasks can be run processor element and eight synergistic proces- concurrently at maximal throughput and that sor elements.5 Each SPE runs a single thread and suitable scheduling can hide all dependencies has its own 256 Kbytes on-chip local store (LS) and latencies. Given this, the total execution memory, which is accessible by direct memory time is access or local load/store operations to and from 128 general-purpose 128-bit registers.