Introduction to HPC Programming 2. The IBM Blue Gene/P Valentin Pavlov About these lectures

• This is the second of series of six introductory lectures discussing the field of High-Performance Computing; • The intended audience of the lectures are high-school students with some programming experience (preferrably using the programming language) having interests in scientific studies, e.g. physics, chemistry, biology, etc. • This lecture provides an overview of the IBM Blue Gene/P supercomputer’s architecture, along with some practical advices about its usage. What does “super-” mean?

• When talking about computers, the prefix “super-” does not have the same meaning as when talking about people (e.g. Superman); • The analogy is closer to that of a supermarket – a market that sells a lot of different articles; • Thus, a supercomputer is not to be tought a priori as a very powerful computer, but simply as a collection of a lot of ordinary computers. What does “super-” mean?

• Everyone with a spare several thousand euro can build an in-house supercomputer out of a lot of cheap components (think Raspberry Pi) which would in principle be not much different than high-end , only slower. • Most of the information in these lectures is applicable to such ad-hoc supercomputers, clusters, etc. • In this lecture we’ll have a look at the architecture of a real supercomputer, the IBM Blue Gene/P, and also discuss the differences with the new version of this architecture, IBM Blue Gene/Q. IBM Blue Gene/P

• IBM Blue Gene/P is a modular hybrid parallel system. • Its basic module is called a “rack” and a certain configuration can have from 1 to 72 racks. • In the full 72 racks configuration, the theoretical peak performance of the system is around 1 PFLOPS; • Detailed information about system administration and application programming of this machine is available online from the IBM RedBooks publication series, e.g. http://www.redbooks.ibm.com/abstracts/sg247287.html The IBM Blue Gene/P @ NCSA, Sofia • The Bulgarian Supercomputing Center in Sofia operates and provides access to an IBM Blue Gene/P configuration that consists of 2,048 Computing Nodes, having a total of 8,192 PowerPC cores @ 850 MHz and 4TB of RAM; • The connection of the Computing Nodes with the rest of the system is through 16 10 Gb/s channels; • Its theoretical performance is 27.85 TFLOPS; • Its energy efficiency is 371.67 MFLOPS/W; • When it was put into operation in 2008, it was ranked 126-th in the world in the http://top500.org list. Why a supercomputer? • This supercomputer is not much different than a network of 2,000 ordinary computers (cluster), or let’s say 40 different clusters of 50 machines; • So why bother with a supercomputer? Because it offers several distinctive advantages: • Energy efficient – the maximum power consumption of the system at full utilization is 75 kWh; This might seem a lot, but is several times less than 2,000 ordinary computers. • Small footprint – it fits in a small room, while 2,000 PCs would probably occupy a football stadium. 40 clusters of 50 machines would occupy 40 different rooms. Why a supercomputer? • Transparent high-speed and highly available network – the mass of cables and devices that interconnect 2,000 PCs would be a nightmarish mess; • Standard programming interfaces (MPI and OpenMP) – the same would be used on clusters. So, a software for the cluster would work on the supercomputer, too (at least in principle); • High scalability to thousands of cores – in the 40 different clusters scenarios each cluster is small and cannot run extra large jobs; Why a supercomputer?

• High availability at lower price – built as an integrated unit from the start, it breaks a lot less than 2,000 ordinary computers would. Moreover, it can be operated by a small team of staff, as opposed to 40 different teams in the many clusters scenario. • Better utilization, compared to the 40 clusters scenario. The centralized management allows different teams of researchers to use the processing power in a shared resource manner, which would be very hard to do if the clusters were owned by different groups. IBM Blue Gene/P Hardware Organization

Figure: IBM Blue Gene/P – from the CPU to the full system (Source: IBM) Compute Nodes (CN) • The processing power of the supercomputer stems from the multitude of Compute Nodes (CNs). There are 1,024 CNs in a rack which totals to 73,728 CNs in a full configuration. • Each CN contains a quad-core PowerPC @ 850 MHz with dual FPU (called “double hummer”) and 2 GB RAM. • Ideally, each core can process 4 instructions per cycle, thus performing at 850 × 4 = 3400 MFLOPS. Multiplied by the number of cores, this brings the performance of a single CN to 4 × 3.4 = 13.6 GFLOPS. Compute Nodes (CN) • The theoretical performance of the whole system is thus 73728 × 13.6 = 1002700.8 GFLOPS = 1.0027008 PFLOPS • Each CN has 4 cores and behave as a shared memory machine with regards to the 2 GB of RAM on the node; • The cores on one CN does not have access to the memory of another CN, so the collection of CNs behave as a distributed memory machine; • Thus, the machine has hybrid organization – distributed memory between nodes and shared memory within the same node. Connectivity • Each CN is directly connected to its immediate neighbours in all 3 directions; • Communications between non-neighbouring nodes involve at least one node that apart from computation is also involved in forwarding network traffic, which brings down its performance. • The whole system looks like a 3D MESH, but in order to reduce the amount of forwarding it can also be configured as a 3D TORUS – a 4D figure in which the ends of the mesh in each of the 3 directions are connected to each other. Connectivity • The advantage of the torus is that it halves the amount of forwarding necessary, since the longest distance is now half the number of nodes in each direction. • The connectivity with the rest of the system is achived through special Input/Output Nodes (IONs); • Each Node Card (32 CNs) has 1 ION through which the CNs access shared disk storage and the rest of the components of the system via 10 GB/s network; • There are other specialized networks, e.g. for collective communications, etc. Supporting Hardware

• Apart from the racks containing the CNs, the supercomputer configuration includes several other components, the most important of them being: • Front-End Nodes (FENs) – a collection of servers to which the users connect remotely using secure shell protocol. In the BGP configuration they are PowerPC 64bit machines running SuSE Enterprise Server 10 (SLES 10); • Service Node (SN) – a backend service node that manages and orchestrates the work of the whole machine. It is off premises for the end users and only administrators have access to it; Supporting Hardware

• File Servers (FS) – several servers that run distributed file system which is exported and seen by both the CNs and the FENs. The home directories of the users are stored on this distributed file system and this is where all input and output goes. • Shared Storage library – disk enclosures containing the physical HDDs over which the distributed file system spans. Software features—cross-compilation • In contrast to some other supercomputers and clusters, Blue Gene has two distinct sets of computing devices: CNs—the actual work horses; and FENs—the machines to which the users have direct access. • CNs and FENs are not binary compatible—a program that is compiled to run on the FEN cannot run on the CNs and vice versa. • This puts the users in a situation in which they have to compile their programs on the FEN (since they only have access to it), but the programs must be able to run on the CNs. This is called cross-compilation. Software features—batch execution • Since cross-compiled programs cannot run on the FEN, users cannot execute them directly—they need some way to post a program for execution. • This is called batch job execution. The users prepare what is called ’job control file’ (JCF) in which the specifics of the job are stated and submit the job to a resource scheduler queue. When the resource scheduler finds free resources that can execute the job, it is sent to the corresponding CNs; • Blue Gene/P uses TWS LoadLeveler (LL) as resource scheduler; Software features—batch execution • Important consequence of the batch execution is that programs better not be interactive. • While it is possible to come up with some sophisticated mechanism to wait on the queue and perform redirection in order to allow interactivity, it is not desirable, since its one cannot predict exactly when the program will be run. • And when it does run and waits for user input, and the user is not there, the CNs will idly waste time and power. • Thus, all parameters of the programs must be passed via configuration files, command line options or some other way, but not via user interaction. Partitions • The multitude of CNs is divided in “partitions” (or “blocks”) and the smallest partition depends on the exact machine configuration, but is usually 32 nodes (on the machine in Sofia the smallest partition is 128 nodes1); • 1 A partition that encompases 2 rack (512 CNs) is called ’midplane’ and is the smallest partition for which TORUS network topology can be chosen; • When LL starts a job, it dynamically creates a correspondingly sized partition for it. After the job terminates, the partition is destroyed.

1Which means that there are at most 16 simulateously running jobs on this machine! Resource Allocation

• LL does resource allocation. Its task is to maximize the number of executed jobs for minimum extent of time given the limited hardware resources. • This is an optimization problem and is solved by heuristic means; • In order to solve this problem LL needs to know the extents of the jobs both in space (number of nodes) and in time (maximum execution time, called “wall clock limit”); Constraints and prirotization

• In order to ensure fair usage of the resources, the administrators can put constraints on the queue—e.g. a user can have no more than N running jobs and M jobs in total in the queue; • Apart from this, LL can dynamically assign a priority on each job, based on things like job submition time, number of jobs in the queue for the same user, last time a job was run by the same user, etc. • The policy for these things is configured by the system administrators. Execution modes • Each job must specify the execution mode in which to run. The execution mode specifies the shared/distributed memory configuration for cores inside each of the CNs in the job’s partition. • There are 3 available modes: VN, DUAL and SMP; • In VN mode each CN is viewed as a set of 4 different CPUs, working in distributed memory fashion. Each processor executes a separate copy of the parallel program, and the program cannot use threads. The RAM is divided into 4 blocks of 512 MB each and each core “sees” only its own block of RAM. Execution modes

• In DUAL mode each CN is viewed as 2 sets of 2 cores. Each set of 2 cores runs one copy of the program, and this copy can spawn one worker thread in addition to the master thread that is initially running. The RAM is divided into 2 blocks of 1 GB each and each set of 2 cores sees its own block. • This is a hybrid setting—the two threads running inside a set of cores work in shared memory fashion, while the different sets of cores work in distributed memory fashion. Execution modes

• In SMP mode each CN is viewed as 1 set of 4 cores. The node runs one copy of the program, and this copy can spawn three worker threads in addition to the master thread that is initially running. The RAM is not divided and the 4 cores see the whole 2 GB of RAM in a purely shared memory fashion. • Again, this is a hybrid setting—the four threads running inside a node work in shared memory fashion, while the different nodes work in distributed memory fashion. Execution modes—which one to use?

• In VN mode that partition looks like a purely distributed memory machine, while in DUAL and SMP mode the partition looks like a hybrid machine; • It is much easier to program a distributed memory machine than a hybrid one. • Thus, VN mode is the easiest, but it has a giant drawback—there are only 512 MB of RAM available to each copy of the program. Execution modes—which one to use?

• If you need more memory, you have to switch to DUAL or SMP mode. • But then you also have to take into consideration the hybrid nature of the machine and properly utilize the 4 threads available to each copy of the program. • Running single-threaded application in DUAL or SMP mode is enormous waste of resources! Job submition

• Prepared jobs are run using the command llsubmit, which accepts as an argument the “job control file” (JCF), which describes the required resources, executable file, its arguments and environment. • Calling llsubmit puts the job in the queue of waiting jobs. This queue can be listed using the command llq • A job can be cancelled by using llcancel and supplying it the jobid as seen in the llq list. JCF Contents # @ job name = hello # @ comment = "This is a Hello World program" # @ error = $(jobid).err # @ output = $(jobid).out # @ environment = COPY ALL; # @ wall clock limit = 01:00:00 # @ notification = never # @ job type = bluegene # @ bg size = 128 # @ class = n0128 # @ queue /bgsys/drivers/ppcfloor/bin/mpirun -exe hello -verbose 1 -mode VN -np 512 Important directives in JCF • error = $(jobid).err • output = $(jobid).out—These two directives specify a set of files to which the output of the job is redirected. • Remember that jobs are not interactive and the user cannot see what would normally be seen on the screen if the program was run by itself. • So the output that usually goes on the screen is stored in the specified files: errors go in the first file and the regular output—in the second. • LL replaces the text $(jobid) with the real ID assigned to the job, so not to overwrite some previous output. Important directives in JCF

• wall clock limit = 01:00:00 • bg size = 128 • The first directive provides the maximum extent of the job in time (HH:MM:SS). If the job is not finished at the end of the specified period, it is killed by LL; • The second directive provides the extent of the job in space and gives the number of CNs required by the job. • Remember that in order for LL to be able to solve the optimization problem, it needs these two pieces of data. Important directives in JCF • class = n0128 • The class of the job determines several important parameters, among which: • The maximum number of nodes that can be requested; • The maximal wall clock limit that can be specified; • The job priority—larger jobs have precedence over smaller jobs and faster jobs have priority over slower ones; • Administrators can put in place constraints in regards to the number of simultaneously executing jobs of each class. • The classes are different for each installation and their characteristics must be made available to the users in some documentation file. Other supercomputers – Top 10 as of November 2012 1 (USA) – 17 PFLOPS Cray XK7 2 (USA) – 16 PFLOPS IBM Blue Gene/Q 3 (Japan) – 10 PFLOPS, SPARC64-based 4 (USA) – 8 PFLOPS, Blue Gene/Q 5 JUQUEEN (Germany) – 8 PFLOPS, Blue Gene/Q 6 SuperMUC (Germany) – 2.8 PFLOPS, Intel iDataPlex 7 Stampede (USA) – 2.6 PFLOPS, uses Intel Xeon Phi 8 Tianhe-1A (China) – 2.5 PFLOPS, uses NVIDIA Tesla 9 Fermi (Italy) – 1.7 PFLOPS Blue Gene/Q 10 DARPA Trial Subset (USA) – 1.5 PFLOPS POWER7-based Blue Gene/Q • In the Top 10 list as of November 2012, 4 of the machines are IBM Blue Gene/Q • Conceptionally it is very similar to IBM Blue Gene/P, but its Compute Nodes are a lot more powerful; • Each compute node has 18 64-bit PowerPC cores @ 1.6 GHz (only 16 used for computation), 16 GB RAM and a peak performance of 20 PFLOPS; • Important aspects as cross-compilation, batch job submition, JCF file format, etc. are basically the same as those in Blue Gene/P. The obvious difference is mode specification, since now VN, SMP and DUAL are obsolete and the specification on the BG/Q is more flexible.