Catamount Vs. Cray Linux Environment
Total Page:16
File Type:pdf, Size:1020Kb
To Upgrade or not to Upgrade? Catamount vs. Cray Linux Environment S.D. Hammond, G.R. Mudalige, J.A. Smith, J.A. Davis, S.A. Jarvis High Performance Systems Group, University of Warwick, Coventry, CV4 7AL, UK J. Holt Tessella PLC, Abingdon Science Park, Berkshire, OX14 3YS, UK I. Miller, J.A. Herdman, A. Vadgama Atomic Weapons Establishment, Aldermaston, Reading, RG7 4PR, UK Abstract 1 Introduction Assessing the performance of individual hardware and Modern supercomputers are growing in diversity and software-stack configurations for supercomputers is a complexity – the arrival of technologies such as multi- difficult and complex task, but with potentially high core processors, general purpose-GPUs and specialised levels of reward. While the ultimate benefit of such sys- compute accelerators has increased the potential sci- tem (re-)configuration and tuning studies is to improve entific delivery possible from such machines. This is application performance, potential improvements may not however without some cost, including significant also include more effective scheduling, kernel parame- increases in the sophistication and complexity of sup- ter refinement, pointers to application redesign and an porting operating systems and software libraries. This assessment of system component upgrades. With the paper documents the development and application of growing complexity of modern systems, the effort re- methods to assess the potential performance of select- quired to discover the elusive combination of hardware ing one hardware, operating system (OS) and software and software-stack settings that improve performance stack combination against another. This is of par- across a range of applications is, in itself, becoming an ticular interest to supercomputing centres, which rou- HPC grand challenge . The complexity of supercom- tinely examine prospective software/architecture com- puters is also set to grow, the introduction of general binations and possible machine upgrades. A case study purpose GPUs (GPGPUs), dense many-core processors is presented that assesses the potential performance of and advanced networking technologies only highlights a particle transport code on AWE’s Cray XT3 8,000- the need to develop effective and scalable techniques core supercomputer running images of the Catamount that allow collections of components to be easily and and the Cray Linux Environment (CLE) operating sys- cheaply compared. tems. This work demonstrates that by running a num- Application Performance Modelling (APM) is such ber of small benchmarks on a test machine and net- a technique. Different from traditional machine bench- work, and observing factors such as operating system marking and statistical analysis, APM has been shown noise, it is possible to speculate as to the performance to provide some of the deepest levels of insight into impact of upgrading from one operating system to an- the performance of commodity clusters [6], super- other on the system as a whole. This use of perfor- computers [9, 13] and their associated applications; mance modelling represents an inexpensive method of these studies include efficient application configura- examining the likely behaviour of a large supercom- tion [15], machine selection [8] and more recently puter before and after an operating system upgrade; this ahead-of-implementation algorithm optimisation [14]. method is also attractive if it is desirable to minimise This study concerns the use of these techniques in system downtime while exploring software-system up- assessing the choice of operating system. Although grades. The results show that benchmark tests run on Linux-based OSs are dominant in the supercomputing less than 256 cores would suggest that the impact (over- industry, proprietary lighter-weight kernels and more head) of upgrading the operating system to CLE was traditional environments such as Solaris, AIX and Win- less than 10%; model projections suggest that this is dows HPC provide viable alternatives. Each operating not the case at scale. system provides a unique blend of programming fea- 1 tures, reliability and performance. Selecting the ap- user support, it will likely provide better performance propriate operating system for a machine is therefore than a full OS image. Tsafrir et al. present a good a complex choice, that must balance each of these re- discussion of the potential dangers of operating system quirements so that the machine provides an efficient noise during computation [19]; this builds on work by runtime environment that is acceptable to system ad- Petrini et al. at the Los Alamos National Laboratory ministrators and applications programmers alike. In on diagnosing the poor performance of the SAGE hy- this paper we use a recently developed simulation- drodynamics code when using a full allocation of pro- based application performance model [7] of an AWE cessors per node [17]. The outcome of Tsafrir’s work particle transport benchmark code to assess the impact is the development of several methods for analytically of upgrading the operating system on AWE’s 8,000- assessing the likely impact of noise on application exe- core Cray XT3 supercomputer. This paper provides a cution. number of novel contributions: In [3] the authors use machine simulation to inves- tigate the impact of two versions of UNIX on the per- • This is the first documented account of the use of formance of memory subsystems for a variety of bench- a simulation-based application performance model marks and full applications. The results obtained are to assess the impact of operating system selection compared with several commonly held views of mem- or upgrades; ory performance, with several cases differing from the • This is the first independent comparison of the expected behaviour. [18] also includes an investigation Cray Catamount and CLE operating systems for of the impact of architectural trends on operating sys- a 3,500+ node Cray XT3 machine; tem performance for the IRIX 5.3 operating system used on large SGI supercomputers. These studies dif- • These are the first published results from a sim- fer from our own in that they present a very low level ulation study that demonstrate the scaling of the analysis of performance when applications are executed AWE particle transport benchmark code on a sys- serially or in very low processor count configurations. tem of 32,768 cores - four times the current system Conducting such an analysis at scale presents several configuration challenges including long simulation times and consid- The remainder of this paper is organised as follows: erable increases in the complexity of modelling contem- Section 2 describes related work; the main case study porary machine components. included in this paper, namely the modelling of AWE The authors of [2] use trace-driven simulation as a Chimaera on Catamount- and CLE-based machines, method for evaluating the impact of operating system is described in Section 3; the paper concludes in Sec- noise (jitter) in the execution of several benchmarks. tion 4. While the results of this work are promising, the meth- ods developed required modification of the Windows 2 Related Work NT kernel to permit necessary runtime instrumenta- tion. This presents obvious difficulties in production The subject of operating system selection is not new. environments. The work presented in this paper re- From their inception in the early 1960s, through UNIX quires no kernel-level modification, instead a user-space systems on the PDP-7 (and later PDP-11) in the early benchmarking process is used to obtain a representa- 1970s, to the MS-DOS and PC-DOS environments of tive profile of the jitter present during execution which more recent years, modern operating systems repre- is then incorporated during simulation. The scale of sent a collage of techniques which have been gradually our simulations are also considerably higher, reflecting refined through each successive hardware generation. the growth in machine sizes since [2] was published. Many of the early operating systems were tailored for In more recent work, [1] benchmarks several peta- particular machines (e.g. IBM’s OS/360 for the Sys- scale capable architectures including the Cray XT se- tem/360 mainframe), and as such there is a wealth of ries and IBM’s BlueGene/L platform. As well as pure academic and commercial literature describing many operating system benchmarking, a series of experi- of the design choices which are present in each system. ments are presented in which injected noise is used to Whilst there is no shortage of operating systems to se- examine the likely effect of unsynchronized background lect from, no one operating system has gained complete operations on the performance of collective communi- dominance. High performance computing has its own cations. The conclusion of the authors is that unless requirements: it may, for example, prove advantageous significant de-synchronisation of nodes occurs, operat- to select a light-weight, low-noise kernel for a compute ing system noise provided a small but limited degrada- node, as although it may be restricted in terms of its tion in runtime. The work in this paper builds upon 2 this benchmarking approach, integrating the output of vironment is not disrupted, a small 128-processor pi- a commonly used noise benchmark in simulations of a lot machine has been used to obtain performance pro- large parallel scientific code. files for the networking and compute behaviour of Chi- maera under both CLE and Catamount. The hard- 3 Case Study: Operating Systems and ware within the Pilot test machine is identical in spec- ification to that of the production system. The exten- the Chimaera Benchmark sions to the existing Chimaera model therefore require: (1) a new compute timing model for the XT3’s AMD The Chimaera benchmark is a three-dimensional par- Opteron processor under Catamount and CLE, (2) net- ticle transport code developed and maintained by work models in the context of both the Catamount and the United Kingdom Atomic Weapons Establishment Cray Linux Environment (CLE) operating systems and (AWE). It employs the wavefront design pattern, a de- (3) the benchmarking of noise profiles for each operat- scription of which can be found in [7, 15].