Performance Impact of the Red Storm Upgrade
Total Page:16
File Type:pdf, Size:1020Kb
Performance Impact of the Red Storm Upgrade Ron Brightwell, Keith D. Underwood, and Courtenay Vaughan Sandia National Laboratories P.O. Box 5800, MS-1110 Albuquerque, NM 87185-1110 {rbbrigh, kdunder, ctvaugh }@sandia.gov Abstract— The Cray Red Storm system at Sandia processors. Several large XT3 systems have National Laboratories recently completed an upgrade been deployed and have demonstrated excellent of the processor and network hardware. Single-core performance and scalability on a wide variety 2.0 GHz AMD Opteron processors were replaced of applications and workloads [1], [2]. Since with dual-core 2.4 GHz AMD Opterons, while the network interface hardware was upgraded from a the processor building block for the XT3 sys- sustained rate of 1.1 GB/s to 2.0 GB/s. This paper tem is the AMD Opteron, existing single-core provides an analysis of the impact of this upgrade on systems can be upgraded to dual-core simply the performance of several applications and micro- by changing the processor. benchmarks. We show scaling results for applications The Red Storm system at Sandia, which is out to thousands of processors and include an analysis the predecessor of the Cray XT3 product line, of the impact of using dual-core processors on this system. recently underwent the first stage of such an upgrade on more than three thousand of its I. INTRODUCTION nearly thirteen thousand nodes. In order to help The emergence of commodity multi-core maintain system balance, the Red Storm net- processors has created several significant chal- work on these nodes was also upgraded from lenges for the high-performance computing SeaStar version 1.2 to 2.1. The latest version community. One of the most significant chal- of the SeaStar provides a significant increase in lenges is maintaining the balance of the system network bandwidth. Before conducting initial as the compute performance increases with studies of single-core versus dual-core perfor- more and more cores per socket. Increasing mance, the system software environment had the number of cores per socket may not lead to be enhanced to support multi-core compute to a significant gain in overall application per- nodes. Performance results from a small system formance unless other characteristics of the were encouraging enough to justify upgrading system – such as memory bandwidth and net- the full Red Storm system to dual-core nodes. work performance – are able to keep pace. As With the first stage of the upgrade completed, such, understanding the impact of increasing we are now able to analyze the impact of the the number of cores per socket on application processor and network upgrade on application performance is extremely important. performance and scaling out to several thou- The Cray XT3 system was designed by Cray sand cores. and Sandia National Laboratories specifically The main contribution of this paper is an in- to meet system balance criteria that are nec- depth analysis of application performance and essary for effectively scaling several impor- scaling on a large-scale dual-core system. We tant DOE applications to tens of thousands of provide results for multiple representative DOE applications and compare single-core versus Sandia is a multiprogram laboratory operated by Sandia dual-core performance on up to four thousand Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear Security Ad- cores. Few such studies have been conducted ministration under contract DE-AC04-94AL85000. on commodity dual-core systems at this scale, and we believe this is the first such study of processor upgrade, the SeaStar 1.2 network the popular Cray XT3 platform. This paper chips are being replaced with SeaStar 2.1 net- also includes an analysis of the performance work chips. The new network chips increase of the SeaStar 2.1 network using traditional the sustained unidirectional bandwidth from 1.1 micro-benchmarks. The Red Storm system is GB/s to 2.1 GB/s and the sustained bidirec- the first (and perhaps only) XT3 product to tional bandwidth from 2.2 GB/s to 3.6 GB/s. have the SeaStar 2.1 network, which will be This increase helps to maintain the balance the network deployed in the next-generation of the system, but the change is primarily a XT product. The combination of dual-core change in injection bandwidth — SeaStar 1.2 compute nodes with the SeaStar 2.1 network chips could receive up to 2 GB/s, but they should provide a glimpse as to the level of could only send 1.1 GB/s. Notably, neither performance and scaling that will be available message latency or message rate properties in Cray’s follow-on system. Also, this paper of the SeaStar itself were changed. Similarly, provides a description of the enhancements that the router link bandwidth has not increased. were made to Sandia’s Catamount lightweight The router links, which could only be half kernel environment and Portals network stack subscribed by a SeaStar 1.2 node, can now to support dual-core XT3 systems. be almost fully subscribed by a single SeaStar The rest of this paper is organized as fol- 2.1 node; thus, network contention will be lows. In the next section, we describe how significantly increased. this study complements previously published research. Section II contains the specific de- III. MICRO-BENCHMARK RESULTS tails of the processor, network, and system We used two micro-benchmarks to compare software upgrade, including a detailed descrip- the Seastar 1.2 and Seastar 2.1 network in- tion of the necessary changes to Catamount. terfaces. We began with a simple ping-pong Micro-benchmark performance for the SeaStar benchmark to measure the latency between 2.1 network is presented in Section III, while nodes. We also looked at the impacts of the up- higher-level network benchmark results are grade on message rate and peak bandwidth. To shown in Section IV. Section V provides a de- measure peak bandwidth, we used a benchmark tailed analysis of several real-world application developed by the Ohio State University, which codes, both in terms of application scaling and posts several messages (64) at the receiver and absolute system performance. We summarize then sends a stream of messages from the the conclusions of this study in Section VI. sender. II. OVERVIEW OF SYSTEM UPGRADE A. Ping-Pong Latency and Bandwidth The Red Storm system was designed to Figure 1(a) shows the impact of the upgrade accommodate both a processor and a network on ping-pong latency. While the Seastar 2.1 did upgrade. The original system had 2.0 GHz not incorporate any features that should reduce single-core Opteron processors for a peak per- latency, the move from a 2.0 GHz processor formance of 4 GFLOPs per node. It was antic- to a 2.4 GHz processor reduced ping-pong ipated that dual-core processors would become latency by a full microsecond. This is because available in the lifetime of the machine, and the the majority of Portals processing happens on current upgrade is installing socket-compatible the host processor in an interrupt driven mode. 2.4 GHz dual-core Opteron processors that Similarly, placing two processors on one node have a peak performance of 9.6 GFLOPs per currently means that both the send and receive node — an improvement of almost 2.5×. work happens on one processor1 and this in- At Sandia’s request, the board-level design creases latency by over 2 microseconds. Future placed the network chips on a mezzanine that could easily be replaced. Thus, along with the 1System calls are proxied to one processor in a node. 12 2500 11 2000 10 1500 9 s) µ Time ( 8 1000 7 Bandwidth (Million Bytes/s) 500 6 1 2.0 GHz core, Seastar 1.2 1 2.0 GHz core, Seastar 1.2 1 2.4 GHz core, Seastar 2.1 1 2.4 GHz core, Seastar 2.1 2 2.4 GHz cores, Seastar 2.1 2 2.4 GHz cores, Seastar 2.1 5 0 1 4 16 64 256 1024 4 16 64 256 1024 4096 16384 65536 262144 220 222 Message Size (Bytes) Message Size (Bytes) (a) (b) Fig. 1. Ping-pong latency (a) and streaming bandwidth (b) versions of the Cray software are supposed some of the bandwidth that would otherwise to short-circuit the trip through the Seastar to be used to receive data. drastically reduce latencies between two cores on one node. IV. HIGHER-LEVEL BENCHMARKS The step in latency between 16 and 32 bytes While micro-benchmarks provide some in- is attributable to the need to take an extra teresting insights into the network properties of interrupt for messages that do not fit in the individual nodes, higher-level benchmarks can header packet (messages greater than 16 bytes). provide much more insight into overall system- This step has been a constraint in the Red level behaviors. Thus, we have included results Storm system, but should go away when Cray from the Pallas MPI benchmark suite[4] and implements a fully offloaded Portals[3]. from the HPC Challenge benchmark suite[5]. B. OSU Streaming Bandwidth A. Pallas In Figure 1(b), the improvements in the The Pallas benchmarks provide a slightly Seastar 2.1 are evident as a two-fold increase higher level view of system-level network per- in sustained, streaming bandwidth. We can formance by measuring the performance of also see an almost 20% improvement in mes- numerous MPI collectives. Figure 2 presents sage rate that is provided by the 20% boost data for three collectives that are particularly in the processor clock rate.