QPACE: Energy-Efficient High Performance Computing
Total Page:16
File Type:pdf, Size:1020Kb
QPACE: Energy-Efficient High Performance Computing Sebastian Rinke, Research Centre Jülich, Germany Hans Böttiger, IBM Germany Research and Development Benjamin Krill, IBM Germany Research and Development Abstract QPACE is a special purpose supercomputer for Lattice Quantum Chromodynamics (LQCD) calculations designed with focus on low power consumption. To open QPACE for a broader range of High Performance Computing (HPC) ap- plications, the High Performance Linpack benchmark (HPL), whose calculations are common for some class of HPC applications, is chosen for porting to the QPACE system. In this respect, this paper discusses features and limitations of the QPACE architecture. It starts with analysing the requirements of HPL and, based on that, presents different ap- proaches to meet these requirements. The paper describes the different stages of porting HPL to QPACE ranging from low-level communication interface to MPI collective operations. Finally, HPL benchmark results show to what extent HPC applications other than LQCD could benefit from QPACE. 1 Introduction face (Rambus FlexIO) is integrated into the processor. Re- garding double precision (DP) floating point performance, During the last decade, energy efficiency has gained in- one SPE can execute up to 4 operations per cycle. Thus, creasing interest in High Performance Computing (HPC). clocked at 3.2 GHz, the 8 SPEs on a PowerXCell 8i pro- Thus, besides computational performance, power effi- cessor yield a DP peak performance of 102.4 GFlops. Each ciency is an additional factor by which parallel systems compute node, called node card, comprises one PowerX- are evaluated. QPACE (QCD Parallel Computing on Cell Cell 8i CPU, 4 GiB of DDR2 main memory, and a network Broadband Engine) is a parallel computer which was de- processor (NWP) which is implemented on a Xilinx V5- signed for Lattice Quantum Chromodynamics (LQCD) LX110T FPGA (Field Programmable Gate Array). The calculations with special attention to low power consump- PowerXCell 8i is directly connected to the NWP via its tion. Despite the fact that QPACE is a special purpose I/O interface. The network processor provides access to 3 supercomputer, its system architecture and performance different interconnects: Gigabit Ethernet, signal tree net- suggest to use it for a broader range of parallel high per- work, and 3D torus network. The power consumption of a formance applications. In this respect, this paper discusses single node card is about 130 W. One QPACE rack houses features and limitations of QPACE as an energy-efficient 256 node cards (35 kW) and yields 26 TFlops DP peak general purpose supercomputer. After an overview of the performance. This leads to a power efficiency of about 1.5 architecture, the high performance torus network and its W/GFlops (peak). corresponding API (Application Programming Interface) are discussed in Section 2. Then, in Section 3, the require- 2.1 Torus Network ments of the High Performance Linpack (HPL) benchmark are identified as they are common for some class of HPC The 3D torus network is a high speed point-to-point inter- applications. In Section 4, these requirements are related connect with 1 GB bandwidth (peak) per torus direction. to the features of QPACE and approaches for dealing with The aggregate bandwidth of the 6 links of a node card is certain limitations are investigated. In this respect, the im- 6 × 1 GB (peak). The torus provides direct SPE-to-SPE plementation of HPL on QPACE is described. Finally, communication between neighbouring node cards. That benchmark results of the HPL on QPACE are presented. means, data from one SPE’s local store can be transferred Section 5 concludes. to the local store of another SPE on one of the 6 neighbours in the torus. However, a routing mechansim for communic- ation between non-neighbours is not implemented. Since 2 QPACE Architecture each node card contains 8 SPEs and there is only one torus link between 2 neighbouring node cards, each link is lo- QPACE [4, 5, 6] is a distributed-memory parallel computer gically divided into 8 virtual channels. This provides 8 based on the IBM PowerXCell 8i [12] processor. This hy- separate reliable communication channels per link. brid processor contains one PowerPC Processor Element (PPE) and 8 Synergistic Processor Elements (SPE). Each 2.1.1 Torus Low-Level Communication SPE has 256 KB on-chip memory, called local store (LS), and runs one single thread. Memory transfers between For a better understanding of the remaining part, this sec- main memory and local store are possible via DMA op- tion briefly discusses the torus communication API and erations. The memory controller as well as an I/O inter- presents a ping-pong message transfer with the resulting benchmark results. As mentioned above, the torus network Note that a receive operation (tnw_credit()) allows message transfers between SPEs of neighbouring must be posted before the corresponding send operation node cards. That means, both the sender and the receiver (tnw_put()). That means, a message transfer is a credit are threads running on different SPEs. Basically, a thread based mechanism: (i) The receiver issues a credit for the on an SPE can access the torus network by means of the message which it is to receive. (ii) The sender sends the following operations: message. (iii) The incoming message fills the credit. List- ings 1 and 2 show the steps for a ping-pong message trans- tnw_put(link, channel, sbuf, offset, fer between two nodes A and B. size, tag) Initiates a non-blocking message transfer on the 1 / ∗ Initialization ∗ / sender. That means, tnw_put() posts the send tnw_credit_base(link_to_B , channel); request to the network and returns immediately. tnw_notify_base(link_to_B , channel , nbuf); The tag parameter identifies the request and is 4 / ∗ Give credit for response from B ∗ / later used for testing it for completion. link and tnw_credit(link_to_B , channel , buf, size , channel denote the torus link (0;:::; 5) and the 7 nindex ) ; virtual channel (0;:::; 7), respectively. The ad- dress of the send buffer of size bytes is provided / ∗ Send t o B ∗ / 10 tnw_put(link_to_B , channel , buf, 0, size , in sbuf. The offset is also transmitted to the t a g ) ; receiver and defines an offset in the receive buffer. 13 / ∗ Wait for response from B ∗ / tnw_test_put(tag) tnw_wait_recv(nbuf , nindex); Is called by the sender in order to test a send request, Listing 1: Ping-pong message transfer: Node A identified by tag, for completion. It returns true if the send buffer can be modified. tnw_notify_base(link, channel, nbuf) 1 / ∗ Initialization ∗ / Since receive operations are non-blocking as well, tnw_credit_base(link_to_A , channel); i.e. they return immediately, the receiver needs a tnw_notify_base(link_to_A , channel , nbuf); means by which it can test for received messages. In 4 / ∗ Give credit for incoming message from A ∗ / this respect, nbuf provides a pointer to a so-called tnw_credit(link_to_A , channel , buf, size , notification buffer. This buffer holds entries where 7 nindex ) ; each entry identifies a previously posted receive op- eration. Before a receive can be posted for link / ∗ Wait for incoming message from A ∗ / 10 tnw_wait_recv(nbuf , nindex); and channel, a notification buffer nbuf must be provided by a call to tnw_notify_base(link, / ∗ Send back to A ∗ / channel, nbuf). See below for details about 13 tnw_put(link_to_A , channel , buf, 0, size , the use of the notification buffer. t a g ) ; Listing 2: Ping-pong message transfer: Node B tnw_credit_base(link, channel) This call declares the whole local store of the SPE as receive buffer for messages over link and Figure 1(a) depicts the corresponding bandwidth channel. However, the actual receive operation benchmark results between two node cards. The timings is called tnw_credit() and described in the fol- between sending and receiving a message are shown in lowing. Fig. 1(b). Here, a 128 B message has a latency of 3 µsec. Note that the size of a message transmitted via tnw_credit(link, channel, rbuf, size, tnw_put() must be a multiple of 128 B. Furthermore, nindex) the send and the receive buffer must be aligned to 128 B As stated above, the receive operation is non- boundaries. The effect of these restrictions is further dis- blocking. In particular, tnw_credit() posts a cussed in Section 4. receive of size bytes for link and channel where rbuf is the destination address in the local store. The input parameter nindex is the in- 3 HPL Requirements dex of an entry in the notification buffer (see tnw_notify_base() above) which refers to this The Linpack benchmarks [1] represent a collection of receive operation. codes for numerical Algebra which have become an in- dustry standard for measuring the performance of com- tnw_wait_recv(nbuf, nindex) puter systems. HPL [1, 2] is a portable and scalable version This call blocks until the receive operation identi- of Linpack for distributed-memory computers that solves fied by the notification buffer nbuf and the index a random dense linear system of equations by Gaussian nindex (in the notification buffer) is complete. elimination with partial pivoting. 900 160 800 140 700 120 600 100 500 80 400 60 300 Latency [usec] Bandwidth [MiB/s] 40 200 100 20 0 0 256 1024 4096 16384 65536 256 1024 4096 16384 65536 Message size [Byte] Message size [Byte] (a) (b) Figure 1: Message transfer from local store to local store over torus network: (a) Bandwidth (b) Latency The corresponding matrix (N × N) is distributed onto a routing of MPI messages as HPL offers a ring algorithm two-dimensional grid of (P × Q) processes in a block- with next neighbour pattern. cyclic fashion to ensure good load balancing. With the right-looking algorithm [1] the LU factorization proceeds with the decomposition of one panel at a time as shown in Communication Section Communicator Fig.