2. the IBM Blue Gene/P Supercomputer
Total Page:16
File Type:pdf, Size:1020Kb
Introduction to HPC Programming 2. The IBM Blue Gene/P Supercomputer Valentin Pavlov <[email protected]> About these lectures • This is the second of series of six introductory lectures discussing the field of High-Performance Computing; • The intended audience of the lectures are high-school students with some programming experience (preferrably using the C programming language) having interests in scientific studies, e.g. physics, chemistry, biology, etc. • This lecture provides an overview of the IBM Blue Gene/P supercomputer’s architecture, along with some practical advices about its usage. What does “super-” mean? • When talking about computers, the prefix “super-” does not have the same meaning as when talking about people (e.g. Superman); • The analogy is closer to that of a supermarket – a market that sells a lot of different articles; • Thus, a supercomputer is not to be tought a priori as a very powerful computer, but simply as a collection of a lot of ordinary computers. What does “super-” mean? • Everyone with a spare several thousand euro can build an in-house supercomputer out of a lot of cheap components (think Raspberry Pi) which would in principle be not much different than high-end supercomputers, only slower. • Most of the information in these lectures is applicable to such ad-hoc supercomputers, clusters, etc. • In this lecture we’ll have a look at the architecture of a real supercomputer, the IBM Blue Gene/P, and also discuss the differences with the new version of this architecture, IBM Blue Gene/Q. IBM Blue Gene/P • IBM Blue Gene/P is a modular hybrid parallel system. • Its basic module is called a “rack” and a certain configuration can have from 1 to 72 racks. • In the full 72 racks configuration, the theoretical peak performance of the system is around 1 PFLOPS; • Detailed information about system administration and application programming of this machine is available online from the IBM RedBooks publication series, e.g. http://www.redbooks.ibm.com/abstracts/sg247287.html The IBM Blue Gene/P @ NCSA, Sofia • The Bulgarian Supercomputing Center in Sofia operates and provides access to an IBM Blue Gene/P configuration that consists of 2,048 Computing Nodes, having a total of 8,192 PowerPC cores @ 850 MHz and 4TB of RAM; • The connection of the Computing Nodes with the rest of the system is through 16 10 Gb/s channels; • Its theoretical performance is 27.85 TFLOPS; • Its energy efficiency is 371.67 MFLOPS/W; • When it was put into operation in 2008, it was ranked 126-th in the world in the http://top500.org list. Why a supercomputer? • This supercomputer is not much different than a network of 2,000 ordinary computers (cluster), or let’s say 40 different clusters of 50 machines; • So why bother with a supercomputer? Because it offers several distinctive advantages: • Energy efficient – the maximum power consumption of the system at full utilization is 75 kWh; This might seem a lot, but is several times less than 2,000 ordinary computers. • Small footprint – it fits in a small room, while 2,000 PCs would probably occupy a football stadium. 40 clusters of 50 machines would occupy 40 different rooms. Why a supercomputer? • Transparent high-speed and highly available network – the mass of cables and devices that interconnect 2,000 PCs would be a nightmarish mess; • Standard programming interfaces (MPI and OpenMP) – the same would be used on clusters. So, a software for the cluster would work on the supercomputer, too (at least in principle); • High scalability to thousands of cores – in the 40 different clusters scenarios each cluster is small and cannot run extra large jobs; Why a supercomputer? • High availability at lower price – built as an integrated unit from the start, it breaks a lot less than 2,000 ordinary computers would. Moreover, it can be operated by a small team of staff, as opposed to 40 different teams in the many clusters scenario. • Better utilization, compared to the 40 clusters scenario. The centralized management allows different teams of researchers to use the processing power in a shared resource manner, which would be very hard to do if the clusters were owned by different groups. IBM Blue Gene/P Hardware Organization Figure: IBM Blue Gene/P – from the CPU to the full system (Source: IBM) Compute Nodes (CN) • The processing power of the supercomputer stems from the multitude of Compute Nodes (CNs). There are 1,024 CNs in a rack which totals to 73,728 CNs in a full configuration. • Each CN contains a quad-core PowerPC @ 850 MHz with dual FPU (called “double hummer”) and 2 GB RAM. • Ideally, each core can process 4 instructions per cycle, thus performing at 850 × 4 = 3400 MFLOPS. Multiplied by the number of cores, this brings the performance of a single CN to 4 × 3:4 = 13:6 GFLOPS. Compute Nodes (CN) • The theoretical performance of the whole system is thus 73728 × 13:6 = 1002700:8 GFLOPS = 1:0027008 PFLOPS • Each CN has 4 cores and behave as a shared memory machine with regards to the 2 GB of RAM on the node; • The cores on one CN does not have access to the memory of another CN, so the collection of CNs behave as a distributed memory machine; • Thus, the machine has hybrid organization – distributed memory between nodes and shared memory within the same node. Connectivity • Each CN is directly connected to its immediate neighbours in all 3 directions; • Communications between non-neighbouring nodes involve at least one node that apart from computation is also involved in forwarding network traffic, which brings down its performance. • The whole system looks like a 3D MESH, but in order to reduce the amount of forwarding it can also be configured as a 3D TORUS – a 4D figure in which the ends of the mesh in each of the 3 directions are connected to each other. Connectivity • The advantage of the torus is that it halves the amount of forwarding necessary, since the longest distance is now half the number of nodes in each direction. • The connectivity with the rest of the system is achived through special Input/Output Nodes (IONs); • Each Node Card (32 CNs) has 1 ION through which the CNs access shared disk storage and the rest of the components of the system via 10 GB/s network; • There are other specialized networks, e.g. for collective communications, etc. Supporting Hardware • Apart from the racks containing the CNs, the supercomputer configuration includes several other components, the most important of them being: • Front-End Nodes (FENs) – a collection of servers to which the users connect remotely using secure shell protocol. In the BGP configuration they are PowerPC 64bit machines running SuSE Linux Enterprise Server 10 (SLES 10); • Service Node (SN) – a backend service node that manages and orchestrates the work of the whole machine. It is off premises for the end users and only administrators have access to it; Supporting Hardware • File Servers (FS) – several servers that run distributed file system which is exported and seen by both the CNs and the FENs. The home directories of the users are stored on this distributed file system and this is where all input and output goes. • Shared Storage library – disk enclosures containing the physical HDDs over which the distributed file system spans. Software features—cross-compilation • In contrast to some other supercomputers and clusters, Blue Gene has two distinct sets of computing devices: CNs—the actual work horses; and FENs—the machines to which the users have direct access. • CNs and FENs are not binary compatible—a program that is compiled to run on the FEN cannot run on the CNs and vice versa. • This puts the users in a situation in which they have to compile their programs on the FEN (since they only have access to it), but the programs must be able to run on the CNs. This is called cross-compilation. Software features—batch execution • Since cross-compiled programs cannot run on the FEN, users cannot execute them directly—they need some way to post a program for execution. • This is called batch job execution. The users prepare what is called ’job control file’ (JCF) in which the specifics of the job are stated and submit the job to a resource scheduler queue. When the resource scheduler finds free resources that can execute the job, it is sent to the corresponding CNs; • Blue Gene/P uses TWS LoadLeveler (LL) as resource scheduler; Software features—batch execution • Important consequence of the batch execution is that programs better not be interactive. • While it is possible to come up with some sophisticated mechanism to wait on the queue and perform redirection in order to allow interactivity, it is not desirable, since its one cannot predict exactly when the program will be run. • And when it does run and waits for user input, and the user is not there, the CNs will idly waste time and power. • Thus, all parameters of the programs must be passed via configuration files, command line options or some other way, but not via user interaction. Partitions • The multitude of CNs is divided in “partitions” (or “blocks”) and the smallest partition depends on the exact machine configuration, but is usually 32 nodes (on the machine in Sofia the smallest partition is 128 nodes1); • 1 A partition that encompases 2 rack (512 CNs) is called ’midplane’ and is the smallest partition for which TORUS network topology can be chosen; • When LL starts a job, it dynamically creates a correspondingly sized partition for it. After the job terminates, the partition is destroyed.