Performance Impact of the Red Storm Upgrade

Ron Brightwell, Keith D. Underwood, and Courtenay Vaughan Sandia National Laboratories P.O. Box 5800, MS-1110 Albuquerque, NM 87185-1110 {rbbrigh, kdunder, ctvaugh }@sandia.gov

Abstract— The Red Storm system at Sandia processors. Several large XT3 systems have National Laboratories recently completed an upgrade been deployed and have demonstrated excellent of the processor and network hardware. Single-core performance and scalability on a wide variety 2.0 GHz AMD Opteron processors were replaced of applications and workloads [1], [2]. Since with dual-core 2.4 GHz AMD Opterons, while the network interface hardware was upgraded from a the processor building block for the XT3 sys- sustained rate of 1.1 GB/s to 2.0 GB/s. This paper tem is the AMD Opteron, existing single-core provides an analysis of the impact of this upgrade on systems can be upgraded to dual-core simply the performance of several applications and micro- by changing the processor. benchmarks. We show scaling results for applications The Red Storm system at Sandia, which is out to thousands of processors and include an analysis the predecessor of the Cray XT3 product line, of the impact of using dual-core processors on this system. recently underwent the first stage of such an upgrade on more than three thousand of its I.INTRODUCTION nearly thirteen thousand nodes. In order to help The emergence of commodity multi-core maintain system balance, the Red Storm net- processors has created several significant chal- work on these nodes was also upgraded from lenges for the high-performance computing SeaStar version 1.2 to 2.1. The latest version community. One of the most significant chal- of the SeaStar provides a significant increase in lenges is maintaining the balance of the system network bandwidth. Before conducting initial as the compute performance increases with studies of single-core versus dual-core perfor- more and more cores per socket. Increasing mance, the system software environment had the number of cores per socket may not lead to be enhanced to support multi-core compute to a significant gain in overall application per- nodes. Performance results from a small system formance unless other characteristics of the were encouraging enough to justify upgrading system – such as memory bandwidth and net- the full Red Storm system to dual-core nodes. work performance – are able to keep pace. As With the first stage of the upgrade completed, such, understanding the impact of increasing we are now able to analyze the impact of the the number of cores per socket on application processor and network upgrade on application performance is extremely important. performance and scaling out to several thou- The Cray XT3 system was designed by Cray sand cores. and Sandia National Laboratories specifically The main contribution of this paper is an in- to meet system balance criteria that are nec- depth analysis of application performance and essary for effectively scaling several impor- scaling on a large-scale dual-core system. We tant DOE applications to tens of thousands of provide results for multiple representative DOE applications and compare single-core versus Sandia is a multiprogram laboratory operated by Sandia dual-core performance on up to four thousand Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear Security Ad- cores. Few such studies have been conducted ministration under contract DE-AC04-94AL85000. on commodity dual-core systems at this scale, and we believe this is the first such study of processor upgrade, the SeaStar 1.2 network the popular Cray XT3 platform. This paper chips are being replaced with SeaStar 2.1 net- also includes an analysis of the performance work chips. The new network chips increase of the SeaStar 2.1 network using traditional the sustained unidirectional bandwidth from 1.1 micro-benchmarks. The Red Storm system is GB/s to 2.1 GB/s and the sustained bidirec- the first (and perhaps only) XT3 product to tional bandwidth from 2.2 GB/s to 3.6 GB/s. have the SeaStar 2.1 network, which will be This increase helps to maintain the balance the network deployed in the next-generation of the system, but the change is primarily a XT product. The combination of dual-core change in injection bandwidth — SeaStar 1.2 compute nodes with the SeaStar 2.1 network chips could receive up to 2 GB/s, but they should provide a glimpse as to the level of could only send 1.1 GB/s. Notably, neither performance and scaling that will be available message latency or message rate properties in Cray’s follow-on system. Also, this paper of the SeaStar itself were changed. Similarly, provides a description of the enhancements that the router link bandwidth has not increased. were made to Sandia’s Catamount lightweight The router links, which could only be half kernel environment and Portals network stack subscribed by a SeaStar 1.2 node, can now to support dual-core XT3 systems. be almost fully subscribed by a single SeaStar The rest of this paper is organized as fol- 2.1 node; thus, network contention will be lows. In the next section, we describe how significantly increased. this study complements previously published research. Section II contains the specific de- III.MICRO-BENCHMARK RESULTS tails of the processor, network, and system We used two micro-benchmarks to compare software upgrade, including a detailed descrip- the Seastar 1.2 and Seastar 2.1 network in- tion of the necessary changes to Catamount. terfaces. We began with a simple ping-pong Micro-benchmark performance for the SeaStar benchmark to measure the latency between 2.1 network is presented in Section III, while nodes. We also looked at the impacts of the up- higher-level network benchmark results are grade on message rate and peak bandwidth. To shown in Section IV. Section V provides a de- measure peak bandwidth, we used a benchmark tailed analysis of several real-world application developed by the Ohio State University, which codes, both in terms of application scaling and posts several messages (64) at the receiver and absolute system performance. We summarize then sends a stream of messages from the the conclusions of this study in Section VI. sender.

II.OVERVIEW OF SYSTEM UPGRADE A. Ping-Pong Latency and Bandwidth The Red Storm system was designed to Figure 1(a) shows the impact of the upgrade accommodate both a processor and a network on ping-pong latency. While the Seastar 2.1 did upgrade. The original system had 2.0 GHz not incorporate any features that should reduce single-core Opteron processors for a peak per- latency, the move from a 2.0 GHz processor formance of 4 GFLOPs per node. It was antic- to a 2.4 GHz processor reduced ping-pong ipated that dual-core processors would become latency by a full microsecond. This is because available in the lifetime of the machine, and the the majority of Portals processing happens on current upgrade is installing socket-compatible the host processor in an interrupt driven mode. 2.4 GHz dual-core Opteron processors that Similarly, placing two processors on one node have a peak performance of 9.6 GFLOPs per currently means that both the send and receive node — an improvement of almost 2.5×. work happens on one processor1 and this in- At Sandia’s request, the board-level design creases latency by over 2 microseconds. Future placed the network chips on a mezzanine that could easily be replaced. Thus, along with the 1System calls are proxied to one processor in a node. 12 2500

11 2000

10

1500 9 s) µ

Time ( 8 1000

7 Bandwidth (Million Bytes/s)

500 6 1 2.0 GHz core, Seastar 1.2 1 2.0 GHz core, Seastar 1.2 1 2.4 GHz core, Seastar 2.1 1 2.4 GHz core, Seastar 2.1 2 2.4 GHz cores, Seastar 2.1 2 2.4 GHz cores, Seastar 2.1 5 0 1 4 16 64 256 1024 4 16 64 256 1024 4096 16384 65536 262144 220 222 Message Size (Bytes) Message Size (Bytes) (a) (b) Fig. 1. Ping-pong latency (a) and streaming bandwidth (b) versions of the Cray software are supposed some of the bandwidth that would otherwise to short-circuit the trip through the Seastar to be used to receive data. drastically reduce latencies between two cores on one node. IV. HIGHER-LEVEL BENCHMARKS The step in latency between 16 and 32 bytes While micro-benchmarks provide some in- is attributable to the need to take an extra teresting insights into the network properties of interrupt for messages that do not fit in the individual nodes, higher-level benchmarks can header packet (messages greater than 16 bytes). provide much more insight into overall system- This step has been a constraint in the Red level behaviors. Thus, we have included results Storm system, but should go away when Cray from the Pallas MPI benchmark suite[4] and implements a fully offloaded Portals[3]. from the HPC Challenge benchmark suite[5].

B. OSU Streaming Bandwidth A. Pallas In Figure 1(b), the improvements in the The Pallas benchmarks provide a slightly Seastar 2.1 are evident as a two-fold increase higher level view of system-level network per- in sustained, streaming bandwidth. We can formance by measuring the performance of also see an almost 20% improvement in mes- numerous MPI collectives. Figure 2 presents sage rate that is provided by the 20% boost data for three collectives that are particularly in the processor clock rate. The significant common in scientific applications: Alltoall, degradation in message rate for having two Allreduce, and Reduce. In each of the three, processes on one node (in the dual-core case) the performance of the collectives using small is attributable to the same source as the la- sizes is dominated by latency. As such, the tency increase in Figure 1(a) — competition upgrade provides an advantage that is commen- for processing resources on the node. Again, surate with the improvements in latency that are this reduction in messaging rate would largely seen with the faster processors. Furthermore, evaporate with an offloaded Portals implemen- moving to two cores per socket does not suffer tation. from the sacrifice in latency experienced by a The reduction in peak bandwidth when two ping-pong test between cores in a single socket. processes are on one node stems from com- This is because most of the collectives involve petition for HyperTransport bandwidth. The (at most) one or two communications between transmit DMA engine must perform a Hyper- the cores in one socket and not contention at Transport read to obtain data. This consumes every phase of the algorithm. At larger sizes, point-to-point bandwidths 90% gain. FFT achieves a somewhat surprising dominate for both the Allreduce and Reduce. 20 to 40% win, indicating that the second cache Thus, the upgraded platform continues to see is providing somewhat of a benefit. Finally, remarkable improvements from the improve- STREAMS achieves a remarkable 10 to 30% ment in bandwidth. In contrast, Alltoall per- improvement. A growing trend in microproces- formance at large message sizes can become sors is an inability to sustain higher bandwidth constrained by bisection bandwidth. Thus, the due to the excessive relative memory latency. curves start to converge. A remarkably bad Here we see evidence that two independent region can actually be seen in the Alltoall issue streams in one socket can extract sig- curve where using both cores on a single socket nificantly more memory bandwidth from the causes significant degradation in performance. memory controller. This is likely caused by a particularly poor The PTRANS and RandomAccess compo- allocation that causes regular synchronized link nents of the HPCC benchmark are shown in contention between the cores in a socket. The Figure 4. Both of these benchmarks are much collective algorithms should generally be tuned more network centric and measure aspects of to better match the topology of the machine the network that were not improved by the with a particular focus on considering the im- upgrade. PTRANS focuses on bisection band- pact of multi-core. width, which did not change since the router links did not change. Likewise, the MPI mes- B. HPC Challenge sage rate, which is measured by RandomAc- The HPC Challenge benchmarks provide a cess, did not change with the network upgrade. much broader view of system performance than RandomAccess did, however, see a significant simply network measurements. Performance gain from the almost 20% gain in MPI mes- measurements cover processor performance sage rate that was provided by the processor (HPL), memory performance (STREAMS), upgrade. small message network performance (Rando- Both the PTRANS and RandomAccess mAccess), network bisection bandwidth per- benchmarks also show mixed results from us- formance (PTRANS), and a coupled proces- ing the second core in the socket. In the case sor/network test (FFT). While far short of of PTRANS, a good placement of MPI ranks a real application, in its baseline form, this on the nodes would yield much better perfor- benchmark suite captures many more aspects mance because more communications would be of system performance than other benchmarks. “on node”; however, the allocation algorithm Figure 3 presents the performance of the currently in use in the Cray software stack is more processor centric HPCC metrics. As ex- particularly poor. This is particularly evident pected, moving from 2.0 GHz processors to 2.4 in the case where using dual-cores yields a GHz processors provided approximately a 20% large performance loss. RandomAccess gets no boost per core for HPL. FFT received a smaller gains in any cases from the second core. At (10%) gain, as it is a more memory bound large scales, the additional traffic to the same code. In contrast, STREAMS actually lost a socket is negligible. Furthermore, two cores bit of performance. This is not surprising since now contend for one network interface, which the benchmarks used the configurations tuned gives each one less than half of the message for the slower processors, which may not be throughput. optimal for the faster processor. Figure 3 also presents dual-core performance V. APPLICATIONS data along with a percent improvement per We considered the impact of the Red Storm socket achieved by using two cores instead of upgrade on three applications from three gen- one (while keeping the problem size constant). eral perspectives. The applications include HPL is clearly the largest winner with an 80 to SAGE, PARTISN, and CTH. Each of these are 100000

10000 1000

Time (microseconds) 1000 Time (microseconds) 100 1 2.0 GHz core 1 2.0 GHz core 1 2.4 GHz core 1 2.4 GHz core 2 2.4 GHz cores 2 2.4 GHz cores 100 1 10 100 1000 10000 100000 1e+06 1 10 100 1000 10000 100000 1e+06 Message size (bytes) Message size (bytes) (a) (b)

1000 Time (microseconds) 100 1 2.0 GHz core 1 2.4 GHz core 2 2.4 GHz cores 1 10 100 1000 10000 100000 1e+06 Message size (bytes) (c) Fig. 2. 64 node Pallas benchmarks for (a) Alltoall, (b) Allreduce, and (c) Reduce commonly used application level benchmarks port equation in several different geometries. It within the DOE complex. provides neutron transport solutions on orthog- onal meshes with adaptive mesh refinement in A. SAGE one, two, and three dimensions. A multi-group SAGE, SAIC’s Adaptive Grid Eulerian hy- energy treatment is used in conjunction with drocode, is a multi-dimensional, multi-material, the Sn angular approximation. A significant Eulerian hydrodynamics code with adaptive effort has been devoted to making the code run mesh refinement that uses second-order accu- efficiently on massively parallel systems. It can rate numerical techniques [6]. It represents a be coupled to nonlinear multi-physics codes large class of production applications at Los that run for weeks on thousands of processors Alamos National Laboratory. It is a large- to finish one simulation. scale parallel code written in Fortran 90 and uses MPI for inter-processor communications. C. CTH It routinely runs on thousands of processors for CTH is a multi-material, large deforma- months at a time. tion, strong shock wave, solid mechanics code developed at Sandia. CTH has models for B. PARTISN multi-phase, elastic viscoplastic, porous and The PArallel, TIme-dependent SN (PAR- explosive materials. Three-dimensional rect- TISN) code package is designed to solve the angular meshes; two-dimensional rectangular, time-independent or dependent multigroup dis- and cylindrical meshes; and one-dimensional crete ordinates form of the Boltzmann trans- rectilinear, cylindrical, and spherical meshes 100000 100 1000

40

80

20 10000 60

100 0

GFLOP/S 40 GFLOP/S

1000 Improvement (%) Improvement (%) -20

1 2.0 GHz core 20 1 2.0 GHz core 1 2.4 GHz core 1 2.4 GHz core 2 2.4 GHz cores 2 2.4 GHz cores -40 improvement improvement 100 0 10 10 100 1000 10000 10 100 1000 10000 Number of Sockets Number of Sockets (a) (b)

100000

40

20 10000

0 GB/s

1000 Improvement (%) -20

1 2.0 GHz core 1 2.4 GHz core 2 2.4 GHz cores -40 improvement 100 10 100 1000 10000 Number of Sockets (c) Fig. 3. (a) HPL, (b) FFT, and (c) STREAMS results

1000 1

40 40

20 20

100 0 0.1 0 GB/s GUP/s Improvement (%) -20 Improvement (%) -20

1 2.0 GHz core 1 2.0 GHz core 1 2.4 GHz core 1 2.4 GHz core 2 2.4 GHz cores -40 2 2.4 GHz cores -40 improvement improvement 10 0.01 10 100 1000 10000 10 100 1000 10000 Number of Sockets Number of Sockets (d) (e) Fig. 4. (a) PTRANS and (b) RandomAccess results are available. It uses second-order accurate tasks contend for memory bandwidth; however, numerical methods to reduce dispersion and that drop is constant and overall scalability is dissipation and to produce accurate, efficient relatively unimpacted. At the smaller problem results. CTH is used for studying armor/anti- size, however, we see one of the weaknesses of armor interactions, warhead design, high ex- the upgrade in that the upgraded processors do plosive initiation physics, and weapons safety not scale quite as well as the slower processors. issues. There are two sources of this issue. Foremost, the processor performance improved by more D. Scalability Impact (20%) for the smaller problem than for the When analyzing the impact of the upgrade, larger problem (10%). In addition, MPI latency there are three ways to think of the data. The did not improve as much as MPI bandwidth, first viewpoint considers the impact of moving which impacts applications with smaller mes- from Seastar 1.2 to Seastar 2.1 on application sages (in this case, from a smaller problem scalability. Another viewpoint asks the ques- size). tion: how is overall scalability impacted by Like CTH, SAGE scales extremely well on having two cores share a connection to the the upgraded machine, as indicated in Fig- network? Finally, we can consider the perfor- ure 5(d). In fact, the parallel efficiency of the mance improvements in terms of the perfor- single core and dual core systems begins to mance gain per socket. converge at large scales as scaling issues begin Figure 5(a) and (b) highlight the scalability to dominate single processor performance is- of PARTISN in the diffusion and transport sues. Thus, it is clear for both SAGE and CTH sections, respectively. Scalability is clearly not that contention for the network interface is not a problem with the move to faster processors. a particular issue. The only anomaly is at 512 Furthermore, on larger problem sizes, the dual nodes where the pre-upgrade machine has a core processors show very little scalability im- “magic data point” that is reproducible. As with pact as the number of processors increases; the PARTISN transport problem, these points however, the problem is very sensitive to the periodically arise as the application mapping to contention for resources (both the network the nodes hits a particularly good configuration interface and memory bandwidth) that arises (or bad configuration in the case of PARTISN). when two cores are placed in one socket. Another interesting note is in the jaggedness E. Dual-Core Improvement of the lines in Figure 5(b). Parallel efficiency Another way to consider the upgrade is should be approximately monotonic; however, based on the improvement offered by using we see numerous points where it increases dual core processors instead of single core as we move to larger numbers of processors. processors. This view holds the problem size These increases typically occur because of par- per socket constant (since the memory size ticularly good mappings between the problem is constant) and graphs absolute performance specification and the way the allocator places in terms of time on the left axis and percent the MPI ranks on the physical nodes. These improvement on the right axis versus the num- data points could be greatly smoothed by mov- ber of sockets on the X-axis. The PARTISN ing to a better allocation algorithm[7]. diffusion problem (Figure 6(a, c)) sees im- Figure 5(c) shows scalability results for CTH provements of 20 to 40% on small numbers of on two scaled speedup problem sizes (problem sockets; however, at large scale, the use of dual size shown is per node). Notably, the overall core processors offers very little advantage. In upgrade was neutral in terms of scalability for a fact, there can even be a slight performance single core per node at the larger problem size. loss associated with the drastic increase in MPI Similarly, moving to two cores per node takes tasks needed to support two cores in each an overall parallel efficiency drop as two MPI socket! In stark contrast, the transport problem 100 100

80 80

60 60

40 40 Efficiency (%) Efficiency (%)

1 2.0 GHz core (483) 1 2.0 GHz core (483) 1 2.4 GHz core (483) 1 2.4 GHz core (483) 20 2 2.4 GHz cores (483) 20 2 2.4 GHz cores (483) 1 2.0 GHz core (243) 1 2.0 GHz core (243) 1 2.4 GHz core (243) 1 2.4 GHz core (243) 2 2.4 GHz cores (243) 2 2.4 GHz cores (243) 0 0 1 4 16 64 256 1024 4096 1 4 16 64 256 1024 4096 Number of Cores Number of Cores (a) (b)

100 100

80 80

60 60

40 40 Efficiency (%) Efficiency (%)

1 2.0 GHz core (80-192-80) 1 2.4 GHz core (80-192-80) 20 2 2.4 GHz cores (80-192-80) 20 1 2.0 GHz core (50-120-50) 1 2.0 GHz core 1 2.4 GHz core (50-120-50) 1 2.4 GHz core 2 2.4 GHz cores (50-120-50) 2 2.4 GHz cores 0 0 1 4 16 64 256 1024 4096 1 4 16 64 256 1024 4096 Number of Cores Number of Cores (c) (d) Fig. 5. Parallel efficiency impacts on PARTISN (a) diffusion and (b) transport portions and parallel efficiency impacts on (c) CTH and (d) SAGE

(shown in Figure 6(b, d)) achieves a consistent Network micro-benchmarks show that half 25 to 50% performance improvement from round-trip latency has decreased by nearly 15% using the second core. due to improvements in the performance of Much like the diffusion portion of PARTISN, the Opteron. Peak unidirectional bandwidth has CTH receives an impressive 20 to 30% im- almost doubled, and peak bidirectional band- provement per socket by using dual core pro- width has increased almost 80%. Improved cessors. This is impressive because the number performance of a single core of the Opteron of MPI tasks has doubled, but the total work has also improved small message throughput has not. Thus, the work per time step per core by over 20%. is reduced by a factor of two when using dual We presented three perspectives on the up- core processors in these cases. grade. First, we considered single-core, 2.0 GHz Opterons with Seastar 1.2 parts and com- VI.CONCLUSIONS pared them to using a single 2.4 GHz core on each node with Seastar 2.1 parts. This provided This paper has described the dual-core and insight into the improvements provided by the network upgrade to the Red Storm system at increase in network bandwidth. Second, we Sandia National Laboratories. This is the first compared scaling efficiency of two cores per Cray XT3 system to be upgraded to dual-core node in the upgraded system to the scaling Opteron processors and SeaStar 2.1 network efficiency of one core per node in the up- chips. graded system. This provided a different look 350 260

40 240 40 300 220

20 200 20 250 180

200 0 160 0

140

150 Improvement (%) Improvement (%) -20 120 -20 Normalized Grind Time Normalized Grind Time 100 100 1 2.4 GHz core 1 2.4 GHz core 2 2.4 GHz cores -40 80 2 2.4 GHz cores -40 improvement improvement 50 60 1 4 16 64 256 1024 4096 1 4 16 64 256 1024 4096 Number of Sockets Number of Sockets (a) (b)

260 280

40 40 240 260

240 220 20 20 220 200 0 200 0 180 180 Improvement (%) Improvement (%) 160 -20 -20 Normalized Grind Time Normalized Grind Time 160

140 1 2.4 GHz core 140 1 2.4 GHz core 2 2.4 GHz cores -40 2 2.4 GHz cores -40 improvement improvement 120 120 1 4 16 64 256 1024 4096 1 4 16 64 256 1024 4096 Number of Sockets Number of Sockets (c) (d)

Fig. 6. Percent improvement for PARTISN diffusion (a, c) and transport (b, d) portions for 243 (a, b) and 483 (c, d) problems

3.5 13

40 40 12 3 11 20 20 10 2.5

0 9 0

2 8 -20 Improvement (%) -20 Improvement (%) Seconds per Timestep Seconds per Timestep 7 1.5 1 2.4 GHz core 6 1 2.4 GHz core 2 2.4 GHz cores -40 2 2.4 GHz cores -40 improvement improvement 1 5 1 4 16 64 256 1024 1 4 16 64 256 1024 Number of Sockets Number of Sockets (a) (b) Fig. 7. Percent improvement for CTH sizes 50x120x50 (a) and 80x192x80 (b) at the scaling impacts of network bandwidth in the system. Finally, we compared absolute performance per socket for using a single 2.4 GHz core with use of both cores to provide a look at the advantage obtained from a dual-core upgrade. Our results showed that adding a second core provides from 20% to 50% performance boost to real applications on a fixed problem size per- socket basis. Furthermore, the results indicate that scalability is impacted relatively little by the upgrade. Most degradation in parallel ef- ficiency is directly attributable to contention for resources caused by having two cores in one socket; however, with the doubling in MPI tasks that is typical of using dual-core proces- sors, it is possible to see scaling effects that are detrimental to overall performance when running at the largest scale.

REFERENCES [1] J. S. Vetter, S. R. Alam, J. T. H. Dunigan, M. R. Fahey, P. C. Roth, and P. H. Worley, “Early evaluation of the Cray XT3,” in 20th International Parallel and Distributed Processing Symposium, April 2006, pp. 25–29. [2] A. Hoisie, G. Johnson, D. J. Kerbyson, M. Lang, and S. Pakin, “A performance comparison through bench- marking and modeling of three leading : Blue Gene/L, Red Storm, and Purple,” in Proceedings of the IEEE/ACM International Conference on High- Performance Computing, Networking, Storage, and Anal- ysis (SC’06), November 2006. [3] R. Brightwell, T. Hudson, K. T. Pedretti, and K. D. Underwood, “SeaStar interconnect: Balanced bandwidth for scalable performance,” IEEE Micro, vol. 26, no. 3, May/June 2006. [4] Pallas MPI Benchmarks, http://http://www.pallas.com/e/ products/pmb/index.htm. [5] P. Luszczek, J. Dongarra, D. Koester, R. Rabenseifner, R. Lucas, J. Kepner, J. McCalpin, D. Bailey, and D. Taka- hashi, “Introduction to the HPC challenge benchmark suite,” March 2005, http://icl.cs.utk.edu/hpcc/pubs/index. html. [6] D. J. Kerbyson, H. J. Alme, A. Hoisie, F. Petrini, H. J. Wasserman, and M. Gittings, “Predictive performance and scalability modeling of a large-scale application,” in Pro- ceedings of the ACM/IEEE International Conference on High-Performance Computing and Networking (SC’01), November 2001. [7] M. A. Bender, D. P. Bunde, E. D. Demaine, S. P. Fekete, V. J. Leung, H. Mejer, and C. A. Phillips, “Communication-aware processor allocation for super- computers,” in International Workshop on Algorithms and Data Structures, August 2005.