A Case Study of the IBM SP2 Eric L
Total Page:16
File Type:pdf, Size:1020Kb
Modeling Computation and Communication Performance of Parallel Scientific Applications: A Case Study of the IBM SP2 Eric L. Boyd, Gheith A. Abandah, Hsien–Hsin Lee, and Edward S. Davidson Advanced Computer Architecture Laboratory Department of Electrical Engineering and Computer Science University of Michigan 1301 Beal Avenue Ann Arbor, MI 48109-2122 PHONE: 313-936-2917; FAX: 313-763-4617 {boyd, gabandah, linear, davidson}@eecs.umich.edu Abstract A methodology for performance analysis of Massively Parallel Processors (MPPs) is presented. The IBM SP2 and some key routines of a finite element method application (FEMC) are used as a case study. A hierarchy of lower bounds on run time is developed for the POWER2 processor, using the MACS methodology developed in earlier work for uniprocessors and vector processors. Significantly, this hierarchy is extended to incorporate the effects of the memory hierarchy of each SP2 node and communication across the High Performance Switch (HPS) linking the nodes of the SP2. The performance models developed via this methodology facilitate explaining performance, identifying performance bottlenecks, and guiding application code im- provements. 1. Introduction Scientific applications are typically dominated by loop code, floating–point operations, and array references. The perfor- mance of such applications on scalar and vector uniprocessors has been found to be well characterized by the MACS model, a hierarchical series of performance bounds. As depicted in Figure 1, the M bound models the peak floating–point performance of the Machine architecture independent of the application. The MA bound models the machine in conjunction with the opera- tions deemed to be essential in the high–level Application workload. Hence Gap A represents the performance degradation due to the application algorithm’s need for operations that are not entirely masked by the most efficient class of floating–point op- erations. The MAC bound is derived from the actual Compiler–generated workload. Hence Gap C represents the performance degradation due to the additional operations introduced by the compiler. The MACS bound factors in the compiler–generated Schedule for the workload. Gap S thus represents the performance degradation due to scheduling constraints. The performance degradation seen in the remaining gap, Gap P, is due to as yet unmodeled effects, such as unmodeled cache miss penalties, load imbalances, OS interrupts, and I/O, that affect actual delivered performance. The bounds are generally expressed as lower bounds on run time and, along with measured run time, are given in units of CPF (clocks per essential floating–point operation). The reciprocal of CPF times the clock rate in MHz yields an upper bound on performance in MFLOPS. The MACS model has been effectively demonstrated and applied in varying amounts of detail to a wide variety of processors. [1][2][3][4][5][6] The MACS model developed for each SP2 node incorporates a significant additional refinement beyond these earlier studies by in- cluding the effects of the memory hierarchy. Gap A Gap C Gap S Gap P Measured M MA MAC MACS CPF Figure 1: MACS Performance Bound Hierarchy Scientific applications often exhibit a large degree of potential parallelism in their algorithms, and hence are prime candi- dates for execution on MPPs. Gap P can become very large and poses the fundamental limit to scalability for parallel applica- tions. Characterizing the communication performance of an MPP and the communication requirements of an application achieves a refinement of Gap P that is crucial to modeling and improving the performance of parallel scientific applications that exhibit moderate to high amounts of communication relative to computation. We demonstrate that coupling the MACS bounds hierarchy, which models the computation of an individual node, with models of communication, as well as load balancing and cache effects, to refine Gap P, enables the effective modeling of the performance of scientific applications on a parallel com- puter such as the IBM SP2. This refinement is begun in this paper with the introduction of a communication model for the SP2. The MACS model is applied to the SP2 and integrated with this communication model in a case study of a commercial scientific 1 This document was created with FrameMaker 4.0.4 application. This methodology could easily be extended to other message passing MPPs, such as the Intel Paragon and the Thinking Machines CM5. [7][8] We believe that it can also be extended to shared memory MPPs, such as the Kendall Square Research KSR2, the Convex Exemplar, and the Cray T3D [11][12][13][14], by using techniques to expose and characterize the implicit communication, as demonstrated in [6][9][10]. All experiments in this paper were run on an IBM SP2 with 32 Thin Node 66 POWER2 processors running the AIX 3.2.5 operating system. An overview of the SP2 architecture is given in Section 2. The single–node SP2 (POWER2) MACS model is detailed in Section 3, with extensions to include the effects of the memory hierarchy. Section 4 presents the communication model for the IBM SP2. An industrial structural finite element modeling code, FEMC, is used to demonstrate the methodology. This application is being ported from a vector supercomputer, parallelized, and tuned for the IBM SP2 at the University of Michigan, and hence represents one scenario of performance modeling. Three performance–limiting FEMC routines are examined. These routines exhibit low, moderate, and high communication to computation ratios and various patterns of communication. Section 5 gives a brief overview of the application routines and discusses the performance modeling results. The performance models devel- oped via this methodology facilitate explaining performance, identifying performance bottlenecks, and guiding application code improvements. 2. IBM SP2 Architectural Analysis A typical IBM SP2 contains between 4 and 128 nodes connected by a High–Performance Switch communication intercon- nect; bigger configurations are possible. Each node consists of a POWER or POWER2 processor and an SP2 communication adapter. There are three types of POWER2 nodes currently available: Thin Nodes, Thin 2 Nodes, and Wide Nodes. Thin Nodes have a 64 Kbyte data cache; Thin 2 Nodes have a 128 Kbyte data cache, and Wide Nodes have a 256 Kbyte data cache. Thin 2 Nodes and Wide Nodes can do quadword data accesses, copying two adjacent double words into two adjacent floating point registers from the primary cache in one cycle. 2.1. POWER2 Architecture [15] Each Thin Node POWER2 operates at a clock speed of 66.7 MHz, corresponding to a 15 ns processor clock cycle time. The POWER2 processor is subdivided into an Instruction Cache Unit (ICU), a Data Cache Unit (DCU), a Fixed–Point Unit (FXU), and a Floating–Point Unit (FPU), as shown in Figure 2. The POWER2 processor also includes a Storage Control Unit (SCU) and two I/O Units (XIO), neither of which are shown or described. Thin Nodes and Thin 2 Nodes also include an optional sec- ondary cache (1 or 2 Mbytes, respectively), memory (64 Mbytes to 512 Mbytes), and a communication adapter. All experiments in this paper are run on Thin Nodes with no secondary cache and a 256 Mbyte memory. The ICU can fetch up to eight instructions per clock cycle from the instruction cache (I–Cache). It can dispatch up to six instructions per cycle through the dispatch unit, two of which are reserved for branch and compare instructions. Two indepen- dent branch processors within the ICU can each resolve one branch per cycle. Most of the branch operation penalties can be masked by resolving them concurrently with FXU and FPU instruction execution. The instruction cache is 32 Kbytes with 128 byte cache lines with a one cycle access time. The FXU decodes and executes all memory references, integer operations, and logical operations. It includes address trans- lation, data protection, and data cache directories for load/store instructions. There are two fixed–point execution units which are fed by their own instruction decode unit and maintain a copy of the general purpose register file. Two fixed–point instruc- tions can be executed per cycle by the two execution units. Each unit contains an adder and a logic functional unit providing addition, subtraction, and Boolean operations. One unit can also execute special operations such as cache operations and priv- ileged operations, while the other unit can perform fixed–point multiply and divide operations. The FPU consists of three units, a double precision arithmetic unit, a load unit, and a store unit. Each unit has dual pipelines and hence can execute up to two instructions per cycle. The peak issue rate from the FPU instruction buffers to these units is thus two double precision floating–point multiply–adds, two floating–point loads, and two floating–point stores per clock. The DCU includes a four–way set associative multiport write–back cache. The experiments in this paper were performed on a configuration with a 64 Kbytes data cache with 64 byte cache lines. The DCU also provides several buffers for cache and direct memory access (DMA) operations as well as error detection/correction and bit steering for all data sent to and received from memory. The DCU has a one cycle access time. The miss penalty to memory is determined experimentally to be between 16 and 21 processor clock cycles. 2.2. Interconnect Architecture [16][17] [18] The SP2 interconnect, termed the High–Performance Switch (HPS), is designed to minimize the average latency of message transmissions while allowing the aggregate bandwidth to scale linearly with the number of nodes. The HPS is a bidirectional 2 Instruction Cache Unit (ICU) Instruction Cache Dual Branch Dispatch Unit Processors Instruction buffers Instruction buffers Execution Execution Sync Unit without Unit with Arithmetic Store Load Multiply/ Multiply/ Execution Execution Execution Divide Divide Unit Unit Unit Fixed Point Unit (FXU) Floating Point Unit (FPU) Data Cache Unit (DCU) (4 Separate Chips) Secondary Cache Memory Unit (Optional) Figure 2: POWER2 Processor Architecture multistage interconnect (MIN), logically composed of switching elements.