N o v e l A rchitectures

QPACE: Quantum Chromodynamics Parallel Computing on the Broadband Engine

Application-driven computers for Lattice Gauge Theory simulations have often been based on system-on-chip designs, but the development costs can be prohibitive for academic project budgets. An alternative approach uses compute nodes based on a commercial processor tightly coupled to a custom-designed network processor. Preliminary analysis shows that this solution offers good performance, but it also entails several challenges, including those arising from the processor’s multicore structure and from implementing the network processor on a field- ­programmable gate array.

1521-9615/08/$25.00 © 2008 IEEE uantum chromodynamics (QCD) is Copublished by the IEEE CS and the AIP a well-established theoretical frame- Gottfried Goldrian, Thomas Huth, Benjamin Krill, work to describe the properties and Jack Lauritsen, and Heiko Schick Q interactions of quarks and gluons, which are the building blocks of protons, neutrons, IBM Research and Development Lab, Böblingen, Germany and other particles. In some physically interesting Ibrahim Ouda dynamical regions, researchers can study QCD IBM Systems and Technology Group, Rochester, Minnesota perturbatively—that is, they can work out physi- Simon Heybrock, Dieter Hierl, Thilo Maurer, Nils Meyer, cal observables as a systematic expansion in the Andreas Schäfer, Stefan Solbrig, Thomas Streuer, strong coupling constant. In other (often more and Tilo Wettig interesting) regions, such an expansion isn’t pos- sible, so researchers must find other approaches. University of Regensburg, Germany The most systematic and widely used nonper- Dirk Pleiter, Karl-Heinz Sulanke, and Frank Winter turbative approach is lattice QCD (LQCD), a Deutsches Elektronen Synchrotron, Zeuthen, Germany discretized, computer-friendly version of the Hubert Simma theory Kenneth Wilson proposed more than 1 University of Milano-Bicocca, Italy 30 years ago. In this framework, QCD is refor- Sebastiano Fabio Schifano and Raffaele Tripiccione mulated as a statistical mechanics problem that we can study using Monte Carlo techniques. University of Ferrara, Italy Many physicists have performed LQCD Monte Andrea Nobile Carlo simulations over the years and have de- European Center for Theoretical Studies, Trento, Italy veloped efficient simulation algorithms and so- Matthias Drochner and Thomas Lippert phisticated analysis techniques. Since the early Research Center Jülich, Germany 1980s, LQCD researchers have pioneered the Zoltan Fodor use of massively parallel computers in large sci- entific applications, using virtually all available University of Wuppertal, Germany computing systems including traditional main-

46 This article has been peer-reviewed. Computing in Science & Engineering frames, large PC clusters, and high-performance (other discretization schemes exist, but the imple- systems. mentation details are all similar): We’re interested in parallel architectures on which LQCD codes scale to thousands of nodes ′ ψx= Dh ψ x and on which we can achieve good price-perfor- 4  1+   mance and power-performance ratios. In this re- U x,µ()γ µ ψx+µ  =   ∑ † . (1) spect, highly competitive systems are QCD On     µ=1+U x−µ, µ ()1 − γµ ψx−µ  a Chip (QCDOC)2 and apeNEXT,3 two custom-   designed LQCD architectures that have been operating since 2004 and 2005, respectively, and Here, μ labels the four space–time directions, and 4 IBM’s BlueGene series. These architectures the Ux,μ are the SU(3) gauge matrices associated are based on system-on-chip designs. In modern with the links between nearest-neighbor lattice technologies, however, VLSI chip development sites. The gauge matrices are themselves dynami- costs are so high that an SoC approach is no lon- cal degrees of freedom and carry color indices ger possible for academic projects. (each of the 4N different Ux,μ has 3 × 3 complex To address this problem, our QCD Parallel entries). The γμ are the (constant) Dirac matrices, Computing on the Cell (QPACE) project uses carrying spinor indices. Today, state-of-the-art a compute node based on IBM’s PowerXCell 8i simulations use lattices with a linear size L of at multicore processor and tightly couples it to a least 32 sites. custom-designed network processor in which we Roughly speaking, the ideal LQCD simulation connect each node to its nearest neighbors in a engine is a system that can keep the previously de- 3D toroidal mesh. To facilitate the network pro- fined degrees of freedom in local storage and re- cessor’s implementation, we use a Virtex-5 peatedly apply the Dirac operator with very high field-programmable gate array (FPGA). In both efficiency. Explicit parallelism is straightforward. price and power performance, our approach is As Equation 1 shows, Dh couples only nearest- competitive with BlueGene and has a much lower neighbor lattice sites, which suggests an obvious development cost. Here, we describe our perfor- parallel structure for an LQCD computer: a d- mance analysis and hardware benchmarks for the dimensional grid of processing nodes. Each node PowerXCell 8i processor, which show that it’s a contains local data storage and a compute engine. powerful option for LQCD. We then describe We need only to map the physical lattice onto the our architecture in detail and how we’re address- processor grid as regular tiles (if d < 4, we fully ing the various challenges it poses. map 4 − d dimensions of the physical lattice onto each processing node). Lattice Gauge Theory Computing In the conceptually simple case of a 4D hyper- LQCD simulations are among scientific comput- cubic grid of p4 processors, each processing node ing’s grand challenges. Today, a group active in would store and process a 4D subset of (L/p)4 lat- high-end LQCD simulations must have access to tice sites. Each processing node handles its sublat- computing resources on the order of tens of sus- tice using local data. The nodes also access data tained Tflops-years. corresponding to lattice sites that are just outside Accurately describing LQCD theory and al- the sublattice’s surface and are stored on near- gorithms is beyond this article’s scope. LQCD est-neighbor processors. Processing proceeds in simulations have an algorithmic structure exhib- parallel for two reasons: there are no direct data iting high parallelism and several features that dependencies for lattice sites that aren’t nearest make it easy to exploit much of this parallelism. neighbor, and the same operations sequence runs LQCD replaces continuous 4D space–time with a on all processing nodes in lockstep mode. discrete and finite lattice (containing N = L4 sites Parallel performance will be linear in the num- for a linear size L). All compute-intensive tasks in- ber of nodes, as long as volve repeated execution of just one basic step: the product of the lattice Dirac operator and a quark • the programmer can evenly partition the lattice ia field ψ. In LQCD, a quark field ψx is defined at on the processor grid, and lattice sites x = 1, . . . , N and carries color indices • the interconnection bandwidth is large enough a = 1, . . . , 3 and spinor indices i = 1, . . . , 4. Thus, to sustain each node’s performance. ψ is a vector with 12N complex entries. As Equation 1 shows, the hopping term of the We meet the first constraint easily for up to thou- Wilson Dirac operator Dh acts on ψ as follows sands of processors while the latter is less trivial

November/December 2008 47 T T link 3, the Cell/B.E. is obviously attractive for sci- ILB RF FP entific applications as well.6–8 This is even truer Main T T Network for the PowerXCell 8i. As we reported at the ILB RF memory interface 8 (MM) Lattice 2007 Conference, we investigated this processor’s performance extending a recently Local store (LS) proposed theoretical model.9 We now summa- T T mem NIF rize our findings.

T EIB EIB Performance Model We consider two classes of hardware devices:

Figure 1. Dataflow paths for a single synergistic processor element • storage devices, such as registers or LSs, that and corresponding execution times (Ti). The model assumes a store data and instructions and are character- communication network with a 3D torus topology in which 12 links ized by their storage size and (six inbound and six outbound) simultaneously transfer data. • processing devices that act on data or trans- fer data/instructions from one storage device to another. Such devices can be floating-point because the required inter-node bandwidth in- (FP) units, DMA controllers, or buses and are

creases linearly with p (we provide figures on characterized by their bandwidths βi and start-

bandwidth requirements later). up latencies λi. Conceptually, this case favors an implementa- tion in which the most powerful processor avail- We can divide an algorithm implemented able provides the total desired processing power on a specific architecture into microtasks per- and thus reduces processor count and node-to- formed by our model’s processing devices. The

node bandwidth requirements. IBM’s Cell/B.E. execution time Ti of each task i is estimated by (www..com/developerworks/power/cell) is a linear ansatz, therefore an obvious choice.

TIi i / βi+O() λ i , (2) The PowerXCell 8i Processor’s Performance where Ii is the amount of processed data. In the The Cell/B.E. processor contains one PowerPC following, we assume that all tasks can be run processor element and eight synergistic proces- concurrently at maximal throughput and that sor elements.5 Each SPE runs a single thread and suitable scheduling can hide all dependencies has its own 256 Kbytes on-chip local store (LS) and latencies. Given this, the total execution memory, which is accessible by direct memory time is access or local load/store operations to and from

128 general-purpose 128-bit registers. An SPE can TTexe  max i . (3) execute two instructions per cycle, performing up i to eight single-precision floating-point operations. If Tpeak is the minimal achievable compute time Thus, the total single-precision peak performance for an application’s FP operations in an “ideal im- of all eight SPEs on a Cell/B.E. is 204.8 Gflops at plementation,” we define the FP efficiency εFP as 3.2 GHz. εFP = Tpeak/Texe. The PowerXCell 8i is an enhanced version of the Figure 1 shows the dataflow paths and associ- Cell/B.E., with the same single-precision perfor- ated execution times Ti that enter our analysis: mance and a peak double-precision performance of 102.4 Gflops with IEEE-compliant rounding. • floating-point operations,T FP; It has an on-chip memory controller supporting • load/store operations between register file (RF) a memory bandwidth of 25.6 Gbytes/s and a con- and LS, TRF; figurable Rambus FlexIO I/O interface that sup- • off-chip memory access, Tmem; ports coherent and noncoherent protocols with a • internal communications between SPEs on the total bidirectional bandwidth of 25.6 Gbytes/s. same processor, Tint; Internally, all processor units are connected to the • external communications between different coherent element interconnect bus (EIB) by direct processors, Text = max(TNIF, Tlink); and memory access (DMA) controllers. • transfers via the EIB (memory access, internal, Although it was developed for the PlayStation and external communications), TEIB.

48 Computing in Science & Engineering We consider a communication network with a example), a single SPE’s subvolume is restricted to 3D torus topology in which the six inbound and about VSPE = 79 sites. (We assume here a mini- six outbound links can simultaneously transfer mal code size of 4 Kbytes for the Dirac kernel; data, each using a link bandwidth of 1 Gbyte/s. a more realistic assumption of 32 Kbytes for the We also assume a balanced network interface pro- solver code and runtime environment decreases viding a bandwidth of 6 Gbytes/s simultaneously our estimate to VSPE = 70.) In a 3D network, the in each direction between the Cell/B.E. and the fourth lattice dimension must be distributed lo- network. (Current FPGA capabilities strongly cally within the same Cell/B.E. across the SPEs 3 constrain both the torus network’s dimensional- (logically arranged as a 1 × 8 grid). L4 is then a ity and the bandwidth—specifically, the number global lattice extension and could be as large as of high-speed serial transceivers and the total pin L4 = 64. This yields a very asymmetric local lat- 3 3 count.) We take all other hardware parameters tice, with VCell = 2 × 64 and VSPE = 2 × 8. (When

βi from the Cell/B.E. manuals (www.ibm.com/ distributed over 4,096 nodes, this gives a global developerworks/power/cell). We now describe the lattice size of 323 × 64.) results of our performance analysis for different optimizations of the main LQCD kernel. Data in off-chip memory (MM). When we store all data in the MM, there are no a-priori restrictions Lattice QCD Kernel on VCell. However, we want to avoid redundant Computing Equation 1 on a single lattice site loads of Equation 1’s operands from MM into LS amounts to 1,320 (double-precision) FP instruc- when sweeping through the lattice. To accomplish tions (not counting sign flips and complex conju- this and also permit concurrent computation and gation) and thus yields a Tpeak of 330 cycles per data transfers (to/from MM or remote SPEs), we site. However, executing Equation 1 requires at consider a multiple buffering scheme. least 840 multiply-add operations and TFP = 420 Multiple buffering schemes alternate several cycles per lattice site. Thus, any implementation buffers to either process or load/store data. This of Equation 1 on an SPE can’t perform better permits concurrent computation and data trans- than 78 percent of peak. fer at the price of additional storage (here, in the The lattice data layout greatly influences the LS). One way to implement such a scheme is to cost of load/store operation for the operands of compute Equation 1’s hopping term on a 3D slice Equation 1 (9 × 12 + 8 × 9 complex numbers), as of the local lattice and then move the slice along well as the time spent on remote communica- the fourth direction. Each SPE processes all sites tions. We assign to each Cell/B.E. a local lattice along the fourth direction, and the SPEs are logi- 3 with VCell = L1 × L2 × L3 × L4 sites and arrange the cally arranged as a 2 × 1 grid both to minimize eight SPEs logically as s1 × s2 × s3 × s4 = 8. A single internal SPE communications and to balance ex- SPE thus holds a subvolume of VSPE = (L1/s1) × ternal ones. To have all of Equation 1’s operands (L2/s2) × (L3/s3) × (L4/s4) = VCell/8 sites. On aver- available in the LS, we must be able to keep in the age, each SPE has Aint neighboring sites on other LS the U- and ψ-fields associated with all sites of

SPEs within a Cell/B.E. and Aext neighboring sites three 3D slices at the same time. This optimiza- outside a Cell/B.E. We investigated two data lay- tion requirement again constrains the local lattice out strategies: In the first, all data are stored in size, now to VCell ≈ 800 × L4 sites. the SPEs’ on-chip LS; in the second, the data are Table 1 shows the predicted microtask execu- stored in off-chip main memory (MM). tion times for the two data layouts and reasonable local lattice size choices. In the LS case, the theo- Data in on-chip memory (LS). In this case, we re- retical efficiency of roughly 27 percent is limited quire that all data for a repeated compute task are by the communication bandwidth (Texe ≈ Text). held in the SPEs’ LS. The LS must also hold a This is also the limiting factor for the MM case’s minimal program kernel, the runtime environ- smallest local lattice; for larger local lattices, the ment, and intermediate results. Therefore, the memory bandwidth is the limiting factor (Texe ≈ storage requirements strongly constrain the local Tmem). lattice volumes VSPE and VCell. We have applied our performance model to A spinor field ψx needs 24 real words—or 192 various linear algebra kernels and verified it by bytes in DP—per site, while a gauge field Ux,μ hardware benchmarks. It’s not strictly neces- needs 18 words (144 bytes) per link. If we as- sary to benchmark the full implementation of sume that a solver requires storage of roughly 400 a representative QCD kernel such as Equation words/site (for eight spinors and 12 gauge links, for 1 because, in all relevant cases, TFP is far from

November/December 2008 49 Table 1. Theoretical time estimates Ti (in units of 1,000 clock cycles) for some microtasks needed to compute Equation 1.* Data in on-chip local store Data in off-chip main memory 3 3 3 3 VCell 2 × 64 L1 × L2 × L3 8 4 2

Aint 16 Aint / L4 48 12 3 Aext 192 Aext / L4 48 12 3

Tpeak 21 Tpeak / L4 21 2.6 0.33

TFP 27 TFP / L4 27 3.4 0.42 TRF 12 TRF / L4 12 1.5 0.19 Tmem — Tmem / L4 61 7.7 0.96 Tint 2 Tint / L4 5 1.2 0.29 Text 79 Text / L4 20 4.9 1.23 TEIB 20 TEIB / L4 40 6.1 1.06

εFP 27% εFP 34% 34% 27% *Boldface indicates performance bottlenecks.

being the limiting factor. Instead, we’ve per- Nb 128-byte blocks aligned at LS lines (and cor- formed hardware benchmarks with the same responding to single EIB transactions). When δA memory access pattern as in Equation 1, us- = As − Ad is a multiple of 128 bytes, the source LS ing the above-mentioned multiple buffering lines can be directly mapped onto the destina- scheme for the MM case. We found that ex- tion lines. Then, we have Na = 0, and the effective 0 ecution times were at most 20 percent higher bandwidth βeff = I/(TDMA − λ ) is the approximate than the theoretical predictions for Tmem. (The peak value. Otherwise, if the alignments don’t freely available full-system simulator is not match (δA isn’t a multiple of 128), we encounter an useful in this respect because it doesn’t model additional latency of λa ≈ 16 cycles for each trans- memory transfers accurately.) Other research- ferred 128-byte block, reducing βeff by roughly a ers have reported on benchmarks that use a lat- factor of two. tice layout similar to the LS case but that keep Figure 2 shows how clearly these effects are only the gauge fields in LS, while spinor fields observed in our benchmarks and how accurately are accessed in MM.10 They found efficiencies Equation 4 describes them. above εFP = 20 percent. Discussion Local Store DMA Transfers Our performance model and hardware bench- Because DMA transfer speeds determine Tmem, marks identified the PowerXCell 8i processor as Tint, and Text, it’s crucial that we optimize them a promising option for LQCD. We expect that a to exploit the Cell/B.E. performance. Our analy- sustained performance above 20 percent is pos- sis of detailed micro-benchmarks for LS-to-LS sible on large machines. Parallel systems with transfers shows that Equation 2’s linear model O(2000) PowerXCell 8i processors add up to ap- doesn’t accurately describe execution time for proximately 200 Tflops (DP peak), which corre- DMA operations with arbitrary size I and arbi- sponds to roughly 50 Tflops sustained for typical trary address alignment. We therefore refined our LQCD applications. As we discussed earlier, a model to account for data transfer fragmentation simple nearest-neighbor d-dimensional intercon- and the buffers’ source and destination addresses nection among these processors is all we need to (As and Ad, respectively): support our algorithms’ data exchange patterns. This simple structure allows for a fast and cost- ­effective design and construction of our forth- TIDMA (, AAs,) d 128 bytes coming QCD-oriented number cruncher. =λ0 + λa ⋅ NI(, AA,) + NI(, A )⋅ . (4) a s d b s β The QPACE Project Our hardware benchmarks, fitted to Equation QPACE is a collaborative development effort 4, indicate that each LS-to-LS DMA transfer has among several academic institutions and the a (zero-size transfer) latency of λ0 ≈ 200 cycles. IBM development lab in Böblingen, Germany, The DMA controllers fragment all transfers into to design, implement, and deploy a next genera-

50 Computing in Science & Engineering tion of massively parallel and scalable computer 800 architectures optimized for LQCD. Our project’s Linear model (2) Re ned model (4) primary goal is to make a vast amount of comput- QS20 benchmarks ing power available for LQCD research. On the technical side, our goals are to 600

• use commodity processors, tightly inter­

connected by a custom network; (cycles) T • leverage the potential of FPGAs for network 400 implementation; and • aim for an unprecedentedly small ratio of

power consumption versus floating-point As = Ad = 0 (mod 128) performance. 200 0 512 1,024 1,536 2,048 In the following, we describe the key elements of (a) I (bytes) the QPACE architecture. 800 Node Card Linear model (2) The main building blocks of QPACE are the Re ned model (4) node cards. These processing nodes, which run QS20 benchmarks independently of each other, include two main 600 components:

• a PowerXCell 8i processor, which provides the (cycles) computing power; and T 400 • a network processor (NWP), which imple- ments a dedicated interface to connect the pro- cessor to a 3D high-speed torus network used As = 32, Ad = 16 (mod 128) for communications between the nodes and to 200 an Ethernet network for I/O. 0 512 1,024 1,536 2,048 (b) I (bytes) We keep additional logic—to boot and control the machine—to the bare minimum. Further- more, the node card contains 4 Gbytes of private Figure 2. LS-to-LS DMA transfers on an IBM QS20 system. Execution memory, which is sufficient for all the data struc- time is measured as a function of the transfer size with (a) aligned tures (and auxiliary variables) of present-day local and (b) misaligned source and destination addresses. The dashed and lattice sizes. solid lines correspond to the theoretical predictions of Equations 2 We implemented the NWP using an FPGA and 4, respectively. (Xilinx Virtex-5 LX110T), which lets us develop and test logic reasonably fast and keep develop- ment costs low. However, the devices themselves in QPACE, node-to-node communications pro- tend to be expensive (in our case, Xilinx’s support ceed directly from the LS of a processor’s SPE to of QPACE makes this issue less critical). The the LS of a nearest-neighbor processor’s SPE. For NWP’s main task is to route data between the communication, we don’t move the data through Cell/B.E. processor, the torus network links, and the MM, and thus we reduce the performance- the Ethernet I/O interface. A torus network link’s critical data traffic through the MM interface. bandwidth is approximately 1 Gbyte/s each for Instead, we route them, via the EIB, from an LS transmit and receive. In balance with the overall directly to the Cell/B.E. processor’s I/O interface. bandwidth of the six torus links attached to each The PowerPC processor element isn’t needed to NWP, the interface between the NWP and the control such communications. We expect the LS- Cell/B.E. processor has a bandwidth of 6 Gbytes/ to-LS copy operations latency to be on the order s. (Because existing southbridges don’t provide of 1 μs. this bandwidth to the Cell/B.E., a commodity network solution is ruled out.) Communication Network Unlike other Cell/B.E.-based parallel machines, To start a communication, the sending SPE must

November/December 2008 51 initiate a DMA data transfer from its LS to a buf- able number of Gigabit Ethernet uplinks, depend- fer attached to any link module. Once the data ar- ing on bandwidth requirements. When we deploy rives in the buffer, the NWP will move the data QPACE, we expect typical lattice sizes of 483 × across the network without the processor’s inter- 96, which require a gauge field configuration of vention. On the other end, the receiving device roughly 6 Gbytes. The available I/O bandwidth must post a receive request to trigger the DMA should thus allow us to read or write the database data transfer from the NWP’s receive buffer to in O(10) seconds. the final destination. For the torus network links, we implement a Power and Cooling lightweight protocol, which handles automatic We expect a single node card’s power consump- data resend in case of CRC errors. The physical tion to be less than 150 watts. A single QPACE layer uses well-tested and cost-efficient commer- cabinet will therefore consume roughly 35 kilo- cial hardware components, which let us move the watts, which translates into a power efficiency of most timing-critical logic out of the FPGA. Spe- approximately 1.5 watts/Gflops. We’re developing cifically, we use the 10 Gbits/s XAUI transceiver a liquid cooling system to reach the planned pack- PMC Sierra PM8358, which provides redundant aging density. link interfaces that we can use to select among several torus network topologies (as we describe QPACE Software in more detail later). To operate the QPACE nodes, we’ll use Linux Implementing the link to the Cell/B.E. proces- running on the PowerPC processor element. As sor’s I/O interface (a Rambus FlexIO bus) is much on most other processor platforms, we can’t start more challenging. At this point, we’ve connected the operating system directly after system start— an NWP prototype and a Cell/B.E. processor at instead, we first use a host firmware to initialize a speed of 3 Gbytes/s per link and direction by us- the hardware. The QPACE firmware will be based ing special features available in the Xilinx Virtex- on Slimline Open Firmware (www-128.ibm.com/ 5 FPGAs RocketIO transceivers. (So far, we’ve developerworks/power/pa-slof); system startup is tested only a single 8-bit link; we’ll use two links controlled by the root card microprocessor. in the final design.) Efficiently implementing applications on the Cell/B.E. processor is more difficult compared to Overall Machine Structure standard processors. To optimize large applica- We attach the node cards to a backplane through tion codes on a Cell/B.E. processor, programmers which we route all network signals. One back- face several challenges—they must, for example, plane hosts 32 node cards and two root cards, each carefully choose the data layout to maximize of which controls 16 node cards via an Ethernet- memory-interface utilization. Optimizing on- accessible microprocessor. One QPACE cabinet chip memory use to minimize external memory can accommodate eight backplanes, or 256 node accesses is mandatory. The program’s overall per- cards. Each cabinet therefore has a peak double- formance also depends on how much code can be precision performance of roughly 25 Tflops. parallelized on-chip. On the backplane, subsets of nodes are inter- We apply two strategies to relieve the program- connected in a one-dimensional ring topology. If mers’ porting burdens. First, in typical LQCD we select the primary or redundant XAUI links, applications, almost all cycles are spent in a few we can select a ring size of 2, 4, or 8 nodes. Like- kernel routines (such as Equation 1). We’ll there- wise, we can configure the number of nodes in fore provide highly optimized assembly imple- the second dimension, in which the nodes are mentations for such kernels, possibly also making connected by a combination of backplane connec- use of an assembly generator. Second, we’ll le- tions and cables. In the third dimension, we use verage the work of the USQCD (US Quantum cables. It’s also possible to operate a large QPACE Chromodynamics) collaboration (www.usqcd. system of N cabinets as a single partition with 2 · org/software.html), which has pioneered efforts N × 16 × 8 nodes. to define and implement software layers to hide hardware details. Such a framework will let pro- I/O Network grammers build LQCD applications in a portable We implement input and output operations us- way on top of these software layers. ing a Gigabit Ethernet tree network. Each node With QPACE, our goal is to implement the card is a tree endpoint connected to one of six QCD message-passing (QMP) API, as well as (at cabinet-level switches, each of which has a vari- least parts of) the QCD data-parallel (QDP) API,

52 Computing in Science & Engineering which includes operations on distributed data ob- Scientific Computing,”Proc. 3rd Conf. Computing Frontiers, jects. QMP comprises all communication opera- ACM Press, 2006, pp. 9–20. tions required for LQCD applications. It relies 7. A. Nakamura, “Development of QCD-Code on a Cell Ma- chine,” Proc. Int’l Symp. Lattice Field Theory (LAT 07), Proc. of on the fact that these applications typically have Science, 2007; http://pos.sissa.it//archive/conferences/ regular and repetitive communication patterns, 042/040/LATTICE%202007_040.pdf. with data being sent between adjacent nodes in a 8. F. Belletti et al., “QCD on the Cell Broadband Engine,” Proc. torus grid (and therefore we don’t need an API as Int’l Symp. Lattice Field Theory (LAT 07), Proc. of Science, 2007; http://pos.sissa.it//archive/conferences/042/039/ general as MPI). We’ll implement QMP on top LATTICE%202007_039.pdf. of a few low-level communication primitives that 9. G. Bilardi et al., “The Potential of On-Chip Multiprocessing might, for example, trigger data transmission via for QCD Machines,” Proc. Int’l Conf. High-Performance Com- one particular link, initiate the receive operation, puting, LNCS 3769, Springer Verlag, 2005, pp. 386–397. and allow the program to wait for completion of 10. J. Spray, J. Hill, and A. Trew, “Performance of a Lattice Quan- tum Chromodynamics Kernel on the Cell Processor,” 2008; communications. http://arxiv.org/abs/0804.3654.

Gottfried Goldrian is a Distinguished Engineer in PACE is an innovative approach to the IBM Lab in Böblingen, Germany. Contact him at use a commodity multicore processor [email protected]. together with a custom network for Q building a scalable massively parallel Thomas Huth is a developer at the IBM Lab in Böblin- machine for LQCD simulations. Directly con- gen. Contact him at [email protected]. necting a custom network to a high-end com- mercial processor’s I/O interface is a significant Benjamin Krill is a research engineer at the IBM Lab technological challenge. For QPACE, we were in Böblingen. Contact him at [email protected]. able to achieve this using an FPGA, but it might be highly nontrivial for other (multicore) proces- Jack Lauritsen is a developer at the IBM Lab in Bö- sors. It also remains to be seen whether FPGAs blingen. Contact him at [email protected]. will be able to cope with increasing bandwidth requirements in future developments. Heiko Schick is a developer at the IBM Lab in Böblin- The QPACE project’s ambitious goal is to com- gen. Contact him at [email protected]. plete hardware development by the end of 2008 and to begin manufacturing and deploying larger Ibrahim Ouda is a senior engineer at the IBM Lab systems beginning in 2009. We expect the ma- in Rochester, Minnesota. Contact him at ouda@us. chines to be fully available for lattice QCD re- ibm.com. search by mid-2009. Simon Heybrock is an undergraduate student of phys- Acknowledgments ics at the University of Regensburg, Germany. Contact QPACE is funded by the Deutsche Forschungsge- him at [email protected]. meinschaft (DFG) through the SFB/TR-55 frame- work and by IBM. We gratefully acknowledge Dieter Hierl is a research associate at the Universi- important contributions to QPACE by ty of Regensburg, Germany. Contact him at dieter. ­(Italy) and Knürr (Germany). [email protected].

References Thilo Maurer is a PhD student in physics at the Uni- 1. K.G. Wilson, “Confinement of Quarks,”Physical Rev. D, vol. versity of Regensburg, Germany. Contact him at thilo. 10, no. 8, 1974, pp. 2445–2459. [email protected]. 2. P.A. Boyle et al., “Overview of the QCDSP and QCDOC Computers,” IBM J. Research & Development, vol. 49, nos. 2-3, 2005, pp. 351–366. Nils Meyer is a PhD student in physics at the Univer- 3. F. Belletti et al., “Computing for LQCD: apeNEXT,” Comput- sity of Regensburg, Germany. Contact him at nils. ing in Science & Eng., vol. 8, no. 1, 2006, pp. 18–29. [email protected]. 4. A. Gara et al., “Overview of the Blue Gene/L System Archi- tecture,” IBM J. Research & Development, vol. 49, nos. 2-3, Andreas Schäfer is a professor of physics at the Uni- 2005, pp. 195–212. versity of Regensburg, Germany. Contact him at 5. H.P. Hofstee et al., “Cell Broadband Engine Technology and Systems,” IBM J. Research & Development, vol. 51, no. 5, ­[email protected]. 2007, pp. 501–502. 6. S. Williams et al., “The Potential of the Cell Processor for Stefan Solbrig is a research associate at the Univer-

November/December 2008 53 sity of Regensburg, Germany. Contact him at stefan. [email protected].

Thomas Streuer is a postdoctoral researcher at the University of Regensburg, Germany. Contact him at [email protected]. The American Institute of Physics is a not-for-profit membership corporation chartered in New York State in 1931 for the purpose of promoting the advancement and diffusion Tilo Wettig is a professor of physics at the University of the knowledge of physics and its application to human of Regensburg, Germany. Contact him at tilo.wettig@ welfare. Leading societies in the fields of physics, astronomy, physik.uni-regensburg.de. and related sciences are its members. Dirk Pleiter is a research scientist at Deutsches Elek- In order to achieve its purpose, AIP serves physics and tronen Synchrotron (DESY), Zeuthen, Germany. related fields of science and technology by serving its member societies, individual scientists, educators, students, Contact him at [email protected]. R&D leaders, and the general public with programs, services, and publications—information that matters. The Institute Karl-Heinz Sulanke is an electronics engineer publishes its own scientific journals as well as those of at Deutsches Elektronen Synchrotron (DESY), its member societies; provides abstracting and indexing Zeuthen, Germany. Contact him at karl-heinz. services; provides online database services; disseminates [email protected]. reliable information on physics to the public; collects and analyzes statistics on the profession and on physics Frank Winter is a PhD student in physics at Deutsches education; encourages and assists in the documentation and Elektronen Synchrotron (DESY), Zeuthen, Germany. study of the history and philosophy of physics; cooperates Contact him at [email protected]. with other organizations on educational projects at all levels; and collects and analyzes information on federal programs and budgets. Hubert Simma is a research scientist at Deutsches Elektronen Synchrotron (DESY), Zeuthen, Germany, The scientists represented by the Institute through its and a guest lecturer at the University of Milan. Con- member societies number more than 134 000. In addition, tact him at [email protected]. approximately 6000 students in more than 700 colleges and universities are members of the Institute’s Society of Physics Sebastiano Fabio Schifano is a research associate in Students, which includes the honor society Sigma Pi Sigma. Industry is represented through the membership of 37 computer science at the University of Ferrara, Italy. Corporate Associates. Contact him at [email protected].

Governing Board: Louis J. Lanzerotti (chair)*, Lila M. Adair, Raffaele Tripiccione is a professor of physics at David E. Aspnes, Anthony Atchley*, Arthur Bienenstock, the University of Ferrara, Italy. Contact him at Charles W. Carter Jr*, Timothy A. Cohn*, Bruce H. Curran*, ­[email protected]. Morton M. Denn*, Alexander Dickison, Michael D. Duncan, H. Frederick Dylla (ex officio)*, Janet Fender, Judith Flippen- Anderson, Judy R. Franz*, Brian J. Fraser, Jaime Fucugauchi, Andrea Nobile is a PhD student in physics at the Eu- John A. Graham, Timothy Grove, Mark Hamilton, Warren W. ropean Center for Theoretical Studies, Trento, and Hein*, William Hendee, James Hollenhorst, Judy C. Holoviak, the University of Milano-Bicocca, Italy. Contact him Leo Kadanoff, Angela R. Keyser, Timothy L. Killeen, Harvey at [email protected]. Leff, Rudolf Ludeke*, Kevin B. Marvel*, Patricia Mooney, Cherry Murray, Elizabeth A. Rogan*, Bahaa E. A. Saleh, Matthias Drochner is a research scientist at the Charles E. Schmid, Joseph Serene, Benjamin B. Snavely (ex Research Center Jülich, Germany. Contact him at officio)*, A. F. Spilhaus Jr, Gene Sprouse, Hervey (Peter) [email protected]. Stockman, Quinton L. Williams. *Members of the Executive Committee.

Management Committee: H. Frederick Dylla, Executive Thomas Lippert is the director of the Institute for Director and CEO; Richard Baccante, Treasurer and CFO; Advanced Simulation at the Research Centre Jülich, Theresa C. Braun, Vice President, Human Resources; head of the Jülich Supercomputing Centre, and a Catherine O’Riordan, Vice President, Physics Resources; professor of physics at the University of Wuppertal, Darlene A. Walters, Senior Vice President, Publishing; Germany. Contact him at [email protected]. Benjamin B. Snavely, Secretary. Zoltan Fodor is a professor of physics at the Univer- www.aip.org sity of Wuppertal, Germany. Contact him at fodor@ bodri.elte.hu.

54 Computing in Science & Engineering