Research Report Trace-Driven Co-Simulation Of
Total Page:16
File Type:pdf, Size:1020Kb
RZ 3721 (# 99731) 12/02/2008 Computer Science 8 pages Research Report Trace-driven Co-simulation of High-Performance Computing Systems using OMNeT++ Cyriel Minkenberg, Germán Rodriguez Herrera IBM Research GmbH Zurich Research Laboratory 8803 Rüschlikon Switzerland Email: {sil,rod}@zurich.ibm.com LIMITED DISTRIBUTION NOTICE This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside pub- lisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests. After outside publication, requests should be filled only by reprints or legally obtained copies (e.g., payment of royalties). Some re- ports are available at http://domino.watson.ibm.com/library/Cyberdig.nsf/home. Research Almaden • Austin • Beijing • Delhi • Haifa • T.J. Watson • Tokyo • Zurich Trace-driven Co-simulation of High-Performance Computing Systems using OMNeT++ Cyriel Minkenberg Germán Rodriguez Herrera IBM Zurich Research Laboratory IBM Zurich Research Laboratory Säumerstrasse 4 Säumerstrasse 4 8803 Rüschlikon, Switzerland 8803 Rüschlikon, Switzerland [email protected] [email protected] ABSTRACT ing and routing policies, link-level flow control mechanism, In the context of developing next-generation high-perfor- and end-to-end congestion control mechanism. This type of mance computing systems, there is often a need for an “end- simulation is commonly of the “Monte Carlo” variety, i.e., to-end” simulation tool that can simulate the behaviour of applying a synthetically generated workload with random a full application on a reasonably faithful model of the ac- destination and interarrival-time distributions, rather than tual system. Considering the ever-increasing levels of paral- the load from a real application. lelism, we take a communication-centric view of the system based on collecting application traces at the message-passing Although such simulations are useful in determining load– interface level. We present an integrated toolchain that en- throughput and load–delay characteristics, they are not nec- ables the evaluation of the impact of all interconnection net- essarily a reliable performance indicator for the communi- work aspects on the performance of parallel applications. cation phases of specific applications. Therefore, we have The network simulator, based on OMNeT++, provides a the situation that instruction-level processor simulation, al- socket-based co-simulation interface to the MPI task simu- though accurate, does not scale to the desired system sizes, lator, which replays traces obtained using an instrumenta- whereas interconnect simulation does scale, but suffers from tion package. Both simulators generate output that can be unrealistic stimuli. This gap needs to be bridged to enable evaluated with a visualization tool. A set of additional tools true end-to-end full-system simulation. is provided to translate generic topology files to OMNeT’s ned format, import route files at run time, perform rout- One class of approaches to bridge this gap employs trace- ing optimizations, and generate particular topologies. We driven simulation. Instead of by an exact model of their also present several examples of results obtained that pro- behavior, computing nodes are represented by a trace that vide insights that would not have been possible without this contains two basic kinds of records, namely computation and integrated environment. communication. Computations are not actually performed, but simply represented by the amount of CPU time they would consume in reality. Communications are transformed 1. INTRODUCTION into data messages that are fed to a model of the intercon- The design of high-performance computing (HPC) systems nection network. To ensure accurate results, the simula- relies to a large extent on simulations to optimize the various tion should preserve causal dependencies between records, components of such a complex system. To evaluate processor e.g., when a particular computation depends on data to be performance, tools such as MAMBO [1] or SIMICS [2] are delivered by a preceding communication, the start of the used, which can perform cycle-accurate simulations of entire computation must wait for the communication to complete. applications at the instruction level. However, such a level of As many scientific HPC applications are based on the Mes- detail prevents scaling of this type of simulation to systems sage Passing Interface (MPI), tracing MPI calls is a suitable with more than a handful of processors. method to characterize the communication patterns of an important class of HPC workloads. An example of this ap- The interconnection network of a HPC system is usually proach is the MARS simulator presented in [3]. modelled at a higher level of abstraction, resorting to discrete- event simulation to enable scaling to systems with hundreds The simulation framework described here is the result of or thousands of network ports. The purpose of interconnec- joint project between the Barcelona Supercomputer Cen- tion network simulation is to optimize the network topology, ter (BSC) and IBM, in which a follow-on machine to the switch and adapter architectures and parameters, schedul- currently installed MareNostrum system is being designed under the working title of MareIncognito. MareNostrum is a 2,560-node cluster of IBM JS21 blades, each having two dual-core IBM 64-bit PowerPCr 970MP processors running Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are at 2.3 GHz for a total of 10,240 CPUs. The computing not made or distributed for profit or commercial advantage and that copies nodes are interconnected by means of a high-bandwidth, bear this notice and the full citation on the first page. To copy otherwise, to low-latency Myrinetr network [5], with each blade having republish, to post on servers or to redistribute to lists, requires prior specific one integrated Myrinet adapter. The switch fabric com- permission and/or a fee. prises ten 512-port and two 1280-port Myrinet switches ar- OMNeT++ 2009, March 6, 2009, Rome, Italy. ranged in a two-level fat-tree topology. However, as these Copyright 2009 ACM X-XXXXX-XX-X/XX/XX ...$5.00. switches are constructed internally using 32-port switch el- visualization tool, Paraver. To study the performance of in- ements (again arranged in a fat-tree topology), the network terconnection networks, IBM developed a tool referred to as really has five levels. The nodes are also connected to a the MPI Application Replay network Simulator (MARS) [3]. Gigabit Ethernet local area network. MareNostrum has 20 As these tools form the basis of our work, we briefly describe TB of RAM and 280 TB of external disk storage. Peak them below. performance with respect to the LINPACK benchmark is 94.21 teraflops, which ranked it as the world’s 26th fastest 3.1 Dimemas supercomputer on the Top500 list of June 2008. Dimemas is a tool developed at BSC to analyze the perfor- mance of message-passing programs. It is an event-driven Several teams at IBM and BSC are cooperating on key as- simulator that reconstructs the time behavior of message- pects of the design of MareIncognito, including applications, passing applications on a machine modelled by a set of per- programming models, tools, load balancing, interconnection formance parameters. network, processor, and system modeling. This paper de- scribes part of the effort responsible for the design of the The input to Dimemas is a trace containing a sequence of interconnection network. operations for each thread of each task. Each operation can be classified as either computation or communication. Such The remainder of this paper is organized as follows. In Sec. traces are usually generated by instrumenting an MPI ap- 3 we describe the existing BSC tools Dimemas and Par- plication, although they can also be generated synthetically. aver and the IBM interconnection network simulator MARS. During instrumentation, each computation is translated into Section 4 explains how we integrated these tools to form a a trace record indicating a “busy time” for a specific CPU, co-simulation platform for MPI applications using realistic whereas the actual computation performed is not recorded. interconnection network models. Section 5 provides exam- Communication operations are recorded as send, receive, or ples of the type of new insights that this new environment collective operation records, including the sender, receiver, can help generate. Finally, we conclude in Sec. 6. message size, and type of operation. 2. INTERCONNECTION NETWORK Dimemas replays such a trace using an architectural machine The role of the interconnection network in scientific as well model consisting of a network of SMP nodes. The model is as commercial computer systems is of increasing importance. highly parametrizable, allowing the specification of param- The underlying trend is that the growing demand for com- eters such as number of nodes, number of processors per puting capacity will be met through parallelism at the in- node, relative CPU speed, memory bandwidth, memory la- struction, thread, core, and machine levels. tency, number of communication buses, communication bus latency, etc. Dimemas outputs various statistics as well as The scale, speed, and efficiency of interconnects must grow a Paraver trace file. significantly, because of (i) increasing parallelism, (ii) the replacement of computer busses by switched networks, (iii) 3.2 Paraver consolidation of heterogeneous communication infrastruc- Paraver is a tool, also developed at BSC, to create visual ture (storage area, local area, and clustering networks; IO representations of the behavior of parallel programs. One of and memory busses) onto a single physical network, and (iv) the outputs of Dimemas is a Paraver trace that represents virtualization of computing resources that leads to increased the state of each thread at every time during the simula- variability and unpredictability in the network load.