Performance Analysis of the Alpha 21364-based HP GS1280 Multiprocessor Zarka Cvetanovic Hewlett-Packard Corporation [email protected]

Abstract This paper evaluates performance characteristics of the HP These improvements enhanced single-CPU performance GS1280 shared memory multiprocessor system. The and contributed to excellent multiprocessor scaling. We GS1280 system contains up to 64 Alpha 21364 CPUs describe and analyze these architectural advances and connected together via a torus-based interconnect. We present key results and profiling data to clarify the describe architectural features of the GS1280 system. We benefits of these design features. We contrast GS1280 to compare and contrast the GS1280 to the previous- two previous-generation Alpha systems, both based on a generation Alpha systems: AlphaServer GS320 and 21264 processor: GS320 – a 32-CPU SMP NUMA system ES45/SC45. We further quantitatively show the with switch-based interconnect [2], and SC45 – a 4-CPU performance effects of these features using application ES45 systems connected in a cluster configuration via a results and profiling data based on the built-in fast Quadrics switch. [4][5][6][7]. performance counters. We find that the HP GS1280 often SPECfp_rate2000 (Peak) provides 2 to 3 times the performance of the AlphaServer HP GS1280/1.15GHz (1-16P published, 32P estimated) HP SC45/1.25GHz (1-4P published, >4P estimated) GS320 at similar clock frequencies. We find the key 600 HP GS320/1.2GHz (published) IBM/p690/650 Turbo/1.3/1.45GHz (published) reasons for such performance gains are advances in 550 SGI Altix 3K/1GHz (published) memory, inter-processor, and I/O subsystem designs. 500 SUN Fire V480/V880/0.9GHz (published) 450 1. Introduction 400 350 The HP AlphaServer 1280 is a shared memory 300 multiprocessor containing up to 64 fourth-generation Alpha 250 21364 [1]. Figure 1 compares the 200 performance of GS1280 to the other systems using the 150 SPECfp_rate2000, a multiprocessor throughput standard 100 benchmark [8]. We show the published SPECfp_rate2000 50 results as of March 2003, with the exception of the 32P 0 0 5 10 15 20 25 30 35 GS1280 for which the data was measured on an # CPUs engineering prototype, but not published yet. We use Figure 1. SPECfp_rate2000 comparison. floating-point rather than integer SPEC benchmarks for this comparison since several of the floating-point benchmarks We include results from kernels that exercise memory stress memory bandwidth, while all integer benchmarks fit subsystem [9][10]. We include profiling results for standard well in the MB-size caches and thus are not a good benchmarks (SPEC CPU2000 [8]). In addition, we analyze indicator of memory system performance. The results in characteristics of representatives from 3 application classes Figure 1 indicate that GS1280 scales well in memory- that impose various levels of stress on memory subsystem bandwidth intensive workloads and has substantial and processor interconnect. We use profiles based on the performance advantage over the previous-generation Alpha built-in non-intrusive CPU hardware monitors [3]. These platforms despite disadvantage in the processor clock monitors are useful tools for analyzing system behavior frequency. We analyze key performance characteristics of with various workloads. In addition, we use tools based on the GS1280 in this paper to expose the key design features the EV7-specific performance counters: Xmesh [11]. that allowed GS1280 to reach such performance levels. Xmesh is a graphical tool that displays run-time information on utilization of CPUs, memory controllers, The GS1280 system contains many architectural advances inter-processor (IP) links, and I/O ports. – both in the and in the surrounding memory system - that contribute to its performance. The The remainder of this paper is organized as follows: 21364 processor [1][16] uses the same core as the previous- Section 2 describes the architecture of the GS1280 system. generation 21264 processor [4]. However, 21364 includes Section 3 describes the memory system improvements in three additional components: (1) an on-chip L2 cache, (2) GS1280. Section 4 describes the inter-processor two on-chip Direct Rambus (RDRAM) memory controllers performance characteristics. Section 5 discusses application and (3) a router. The combination of these components performance. Section 6 shows tradeoffs associated with helped achieve improved access time to the L2 cache and memory striping. Section 7 summarizes comparisons. local/remote memory. Section 8 concludes.

2. GS1280 System Overview network, the router multiplexes a physical link among The Alpha 21364 (EV7) microprocessor [1] shown in several virtual channels. Each input port has two first- Figure 2 integrates the following components on a single level arbiters, called the local arbiters, each of which chip: (1) second-level (L2) cache, (2) a router, (3) two selects a candidate packet among those waiting at the memory controllers (Zboxes), and (4) a 21264 (EV68) input port. Each output port has a second-level arbiter, microprocessor core. The processor frequency is 1.15 GHz. called the global arbiter, which selects a packet from The memory controllers and inter-processor links operate at those nominated for it by the local arbiters. 767 MHz (data rate). The L2 cache is 1.75 MB in size, 7- way set-associative. The load-to-use L2 cache latency is 12 The global directory protocol is a forwarding protocol [16]. cycles (10.4 ns). The data path to the cache is 16-bytes There are 3 types of messages: Requests, Forwards, and wide, resulting in peak bandwidth of 18.4 GB/s. There are Responses. A requesting processor sends a Request 16 Victim buffers from L1 to L2 and from L2 to memory. message to the directory. If the block is local, the directory The two integrated memory controllers connect processor is updated and a Response is sent back. If the block is in directly to the RDRAM memory. The peak memory Exclusive state, the Forward message is sent to the owner bandwidth is 12.3 GB/s (8 channels, 2 bytes each). There of the block, who sends the Response to the requestor and can be up to 2048 pages open simultaneously. The optional directory. If the block is in Shared state (and the request is 5th channel is provided as a redundant channel. The four to modify the block), Forward/invalidates are sent to each interprocessor links are capable of 6.2 GB/s each (2 of the shared copies, and a Response is sent to the unidirectional links with 3.1 GB/s each). The IO chip is requestor. connected to the EV7 via a full-duplex link capable of 3.1 GB/s. To optimize network buffer and link utilization, the 21364 routing protocol uses minimal adaptive routing algorithm. Chip Block Diagram Only a path with minimum number of hops from source to destination is used. However, a message can choose the Data IPx4 less congested minimal path (adaptive protocol). Both the Buffers Router I/O coherence and adaptive routing protocols can introduce L2 Tag deadlocks in the 21364 network. The coherence protocol Array Memory L2 RDRAM can introduce deadlocks due to cyclic dependence between Controller 0 Data different packet classes. For example, the Request packets Array L2 Cache can fill up the network and prevent the Response packets Controller Memory RDRAM from ever reaching their destinations. The 21364 breaks Core Controller 1 this cyclic dependence by creating virtual channels for L1 Cache Data each class of coherence packets and prioritizing the Address & Control dependence among these classes. By creating separate virtual channels for each class of packets, the router Figure 2. 21364 block diagram. guarantees that each class of packets can be drained independent of other classes. Thus, a Response packet can M M M M never block behind a Request packet. A Request can 364 364 364 364 generate a Block Response, but a Block Response cannot IO IO IO IO generate a Request. M M M M 364 364 364 364 Adaptive routing can generate two types of deadlocks: IO IO IO IO intra-dimensional (because the network is a torus, not a M M M M mesh) and inter-dimensional (arises in any square portion 364 364 364 364 of the mesh). The intra-dimensional deadlock is solved IO IO IO IO with virtual channels: VC0 and VC1. The inter- dimensional deadlocks are solved by allowing message to Figure 3. A 12-processor 21364-based multiprocessor. route in one dimension (e.g. East-West) before routing in the next dimension (e.g. North-South) [12]. Additionally, The router [16] connects multiple 21364s in a two- to facilitate adaptive routing, the 21364 provides a separate dimensional, adaptive, torus network (Figure 3). The virtual channel called the Adaptive channel for each class. router connects to 4 links that connect to 4 neighbors in Any message (other than I/O packets) can route through the torus: North, South, East, and West. Each router the Adaptive channel. However, if the Adaptive channels routes packets arriving from several input ports (L2 fill up, packets can enter the deadlock-free channels. cache, ZBoxes, I/O, and other routers) to several output ports. (i.e., L2 cache, ZBoxes, I/O, and other routers). To The previous-generation GS320 system uses a switch to avoid deadlocks in the coherence protocol and the connect four processors to the four memory modules in a single Quad Building Block (QBB) and then a hierarchical switch to connect QBBs into the larger-scale much lower access time than the off-chip caches in multiprocessor (up to 32 CPUs) [2]. GS320/ES45.

3. Memory Subsystem Figure 5 shows dependent load latency on GS1280 as both In this section we characterize memory subsystem of dataset size and stride increase. This data indicates that the GS1280 and compare to the previous-generation Alpha latency increases from ~80ns for open-page access to platforms. The section includes the analysis of local ~130ns for closed-page access (larger-stride access). memory latency, memory bandwidth, single-CPU performance, and remote memory latency. Dependent Memory Latency on GS1280 120-140 100-120 3.1 Local memory latency for dependent loads 80-100 The 21374 processor provides two RDRAM memory 60-80 controllers with 12.3 GB/s peak memory bandwidth. Each 140 40-60 120 20-40 processor can be configured with 0, 1, or 2 memory 0-20 controllers. The L2 1.75MB on-chip cache is 7-way set 100 80 associative. The L2 cache on ES45 and GS320 is 16MB, latency (ns) off-chip, direct-mapped. 60 40

Dependent Load Latency 20 320 300 GS1280/1.15GHz 0 16k

280 4k 1k 16m 260 ES45/1.25GHz 4m stride (bytes) 1m 240 256 64

GS320/1.22GHz 256k 64k 16

220 4 dataset size (bytes) 16k 200 4k 180 160 140 Figure 5. GS1280 dependent load latency for various strides. 120

Latency (ns) 100 80 3.2 Memory Bandwidth 60 40 The STREAM benchmark [10] measures sustainable 20 0 memory bandwidth in megabytes per second (MB/s) across four vector kernels: Copy, Scale, Sum, and SAXPY. 4k 8k 1m 2m 4m 8m 16k 32k 64k 16m 32m 64m 128k 256k 512k 128m Dataset size (bytes) McCalpin Stream: Triad HP GS1280 1.15GHz (published, >16P estimated) Figure 4. Dependent load latency comparison. HP GS320/1.2GHz (published) 400 HP SC45/1.25GHz (estimated >4P) Figure 4 compares the dependent-load latency [9]. The IBM/p690 Power4/1.3GHz (published) dependent-load latency measures load-to-use latency where 350 each load depends on the result from the previous load. The 300 lower axis varies the referenced data size to fit in different 250 levels of the memory system hierarchy. Data is accessed in a 200 stride of 64 bytes (cache block). The results in Figure 4 show that GS1280 has 3.8 times lower “dependent-load” 150

memory latency (32MB size) than the previous-generation Bandwidth (GB/s) 100 GS320. This indicates that large-size applications that are 50 not blocked to take advantage of 16MB cache will run substantially faster on GS1280 than on the 21264-based 0 0 10203040506070 platforms. For data range between 1.75MB and 16MB, the # CPUs latency is higher on GS1280 than on GS320 and ES45, since the block is fetched from memory on GS1280 vs. from the Figure 6. McCalpin STREAM bandwidth comparison. 16MB L2 cache on GS320/ES45. This indicates that the application sizes that fall in this range are likely to run We show only the results for the Triad kernel in Figure 6 slower on GS1280 than on the previous-generation (the other kernels have similar characteristics). This data platforms. For datasizes between 64KB and 1.75MB, latency indicates that the memory bandwidth on GS1280 is is again much lower on GS1280 than GS320/ES45. That is substantially higher than the previous-generation GS320 because the L2 cache in GS1280 is on-chip, thus providing and all other systems shown.

McCalpin STREAM (Triad) On average, GS1280 shows advantage over both GS320 and ES45 in SPECfp2000, and comparable performance in 4 CPUs SPECint2000. Note that some benchmarks demonstrate a substantial advantage on GS1280 over ES45/GS320. For example, swim shows 2.3 times advantage on GS1280 vs. ES45 and 4 times advantage vs. GS320. However, many GS320/1.2GHz other benchmarks show comparable performance (e.g. 1 CPU ES45/1.25GHz most integer benchmarks). Yet, there are cases where GS1280/1.15GHz GS320 and ES45 outperform GS1280 (e.g. facerec and amp). In order to better understand the causes of such 0 2 4 6 8 101214161820222426 differences, we generated profiles that show memory Bandwidth (GB/s) controller utilization for all benchmarks (Figures 10 and Figure 7. STREAM bandwidth for 1-4 CPUs. 11). IPC Comparison: SPECint2000 Figure 7 indicates that GS1280 exhibits not only 1-CPU GS320/1.22GHz SPECint2000 advantage in memory bandwidth (due to high-bandwidth ES45/1.25GHz twolf memory-controller design provided by the 21364 GS1280/1.15GHz processor), it also provides linear scaling in bandwidth as bzip2 the number of CPUs increases. This is due to GS1280 vortex memory design where each CPU has its own local perlbmk memory, thus avoiding contention for memory between gap jobs that run simultaneously on several CPUs. This is not eon the case on ES45 and GS320, where four CPUs contend parser for the same memory. Therefore, bandwidth improvement crafty from one to four CPUs on ES45/GS320 is less-than-linear mc f (as indicated in Figure 7). The data in Figures 6 and 7 cc1 indicate that the memory-bandwidth intensive applications vpr will run exceptionally well on GS1280. The advantage is gzip likely to be even more pronounced as the number of CPUs 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 increases. One such example is the SPEC throughput benchmarks shown in Figure 1. Figure 9. IPC for SPECint2000.

3.3. Single-CPU performance: CPU2000 Figures 8 and 9 compare Instructions-per-Cycle (IPC) for Figures 10 and 11 illustrate utilization floating-point (fp) and integer SPEC CPU2000 profiling histograms for SPEC CPU2000 benchmarks on benchmarks on GS1280 vs. GS320 and ES45 [8]. GS1280. The profiles are collected using the 21364 built-in performance counters and are shown as a function of the IPC Comparison: SPECfp2000 elapsed time for the entire benchmark run. This data

SPECfp2000 indicates that the benchmarks with high memory utilization apsi are the same benchmarks that show significant advantage sixtrack on GS1280. Swim is the leader with 53% utilization, f ma3d GS320/1.22GHz followed by applu, lucas, equake, and mgrid (20-30%), lucas ES45/1.25GHz fma3d, art, wupwise, and galgel (10-20%). Interestingly, ammp GS1280/1.15GHz facerec has 8% utilization: still GS1280 has lower IPC than facerec the other systems. That is due to the smaller cache size on equake GS1280 (1.75MB vs. 16MB on GS320/ES45). The art simulation results show that facerec dataset fits in the 8MB galgel cache, but not in the 1.75MB cache. Therefore, facerec mes a applu accesses memory on GS1280, while it fetches data mostly

mgr id from the 16MB cache on GS320 and ES45. Figure 4

swim illustrates that the cache access on GS320 is faster than the w upw ise memory access on GS1280.

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2

Figure 8. IPC for SPECfp2000.

order to characterize applications that do not fit well in SPECfp2000: Memory Controller Utilization local memory, we need to understand how local latency 60 wupwise compares to the remote latency. Figure 12 compares local swim and remote memory latency on GS320 and GS1280. mgrid Latency is measured from CPU0 to all other CPUs in a 16- 50 applu CPU system. Note that GS320 has two levels of latency: mesa local (within a set of 4 CPUs called QBB) and remote galgel (outside that QBB). The GS1280 system has many levels 40 art equake of remote latency, depending on how many hops need to facerec be passed from source to destination.

30 ammp lucas Figure 12 indicates that GS1280 shows 4 times advantage fma3d in average memory latency on 16 CPUs. The advantage is sixtyrack even higher (6.6 times) when Read-Dirty instead of Read- 20 apsi Clean latencies are compared. Note that in the case of Read-Dirty, a cache block is read from another processor’s 10 cache rather than from memory. Memory Controller Utilization (percentage) GS1280 vs. GS320 Latency: 16P 0 average

1 5 9 0 ->15 13 17 21 25 29 33 37 41 45 49 53 57 61 Timestamp 0 ->14 0 ->13 Figure 10. GS1280 memory controller utilization in 0 ->12 SPECfp2000. 0 ->11 0 ->10 SPECint2000: Memory Controller Utilization 0 ->9 28 0 ->8 gzip 0 ->7 vpr 0 ->6 gcc 24 mcf 0 ->5 crafty 0 ->4 parser 0 ->3 20 eon 0 ->2 GS320/1.2GHz gap 0 ->1 GS1280/1.15GHz prlbmk 0 -> 0 vortex 16 0 100 200 300 400 500 600 700 800 900 bzip2 Latency (ns) twolf Figure 12. Local/remote latency on 16 CPUs. 12 83 145 186 154 8 139 175 221 182 181 221 259 222 154 191 235 195 4 Figure 13. Remote memory latencies (ns) on GS1280 Memory Controller Utilization (percentage) Utilization Controller Memory (each square represents a CPU in a 16-CPU torus). 0 1 5 9

13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 Figure 13 illustrates measured latency from node 0 to all Timestamp other nodes in the 16-CPU GS1280 system (each square is Figure 11. GS1280 memory controller utilization in a CPU within a 4x4 torus). The local memory latency of SPECint2000. 83ns is increased to 139-154 ns for the 1-hop neighbors. Note that the 1-hop latency is the lowest for the neighbors on the same module (139 ns), and the highest for the 3.4. Remote memory latency neighbors that are connected via a cable (154 ns). The 2- In Sections 3.1 and 3.2 we contrasted local memory hop latency is 175-195 ns (6 nodes are 2-hop away). The 4- latency and bandwidth on GS1280 and previous-generation hop latency (worst-case for 16 CPUs) is 259 ns (1 node is platforms. Local memory characteristics are important for 4-hops away). the single-CPU workloads and multiprocessor workloads that fit well in processor’s local memory. However, in

Figure 14 shows the average load-to-use latency as the Figure 15 compares bandwidth under increasing load on number of CPUs increases. Figures 12 and 14 show that GS1280 and GS320. Each CPU randomly selects another GS1280 has significant advantage over GS320 not only in CPU to send a Read request to. The test is started with a local, but also in remote memory latency. This data single outstanding load (leftmost point). For each indicates that applications that are not structured to fit well additional point, one outstanding load is added (up to 30 within a processor’s local memory will run much more outstanding requests). In ideal case, bandwidth will efficiently on GS1280 that on GS320. In addition, this increase (moving to the right), and latency will not change advantage will be even more pronounced in applications (line stays low and flat). that require high amount of data sharing (parallel workloads) due to efficient Read-Dirty implementation in Figure 15 indicates that GS1280 shows increase in latency, GS1280. but it is not nearly as high as in GS320. GS1280 is much more resilient to the load: bandwidth increases at much Average Load-to-Use Latency 900 smaller latency increase. This is an important system feature for applications that require substantial inter- 800 processor (IP) bandwidth. Figure 14 also indicates another 700 interesting phenomenon: as the load is increased beyond 600 saturation point, the delivered bandwidth starts to decrease. 500 GS1280/1.15GHz Although interesting from the theoretical point of view, GS320/1.2GHz this phenomenon has no implications on performance of 400 real applications, as we have not observed any applications latency (ns) latency 300 that operate even close to this point. 200 100 4.1 Shuffle Interconnect We discovered that performance of an 8-CPU GS1280 0 4 8 16 32 64 configuration could be improved by a simple swap of the number of CPUs cables (which we call “shuffle”)[12]. Figures 16 and 17 Figure 14. Remote memory latency for 4-64 CPUs. show how the connections are changed from the standard “torus” interconnect (Figure 16) to “shuffle” (Figure 17) in order to improve performance. The redundant North-South 4. Interprocessor Bandwidth connections in an 8-CPU torus are used to connect to the The memory latency in Figure 14 is measured on an idle furthest nodes to crate a shuffle interconnect. system with only 2 CPUs exchanging messages. In this section, we evaluate interprocessor network response as the load increases. This is needed in order to characterize applications that require all CPUs to communicate simultaneously (more closely related to the real application environment).

Load Test: Max 30 Outstanding Memory References 6500 Figure 16. Torus. Figure 17. Shuffle. 6000 5500 GS1280/16P Table 1 shows the performance improvement from shuffle GS1280/32P 5000 GS1280/64P vs. torus using a simple analytical model. Note that shuffle GS320/32P is more beneficial in rectangular rather than in square 4500 GS320/16P shaped interconnects (bisection width and worst-case 4000 latency). The benefits in average latency increase as the 3500 system size grows. 3000 latency (ns) latency 2500 Table 1: Performance gains from shuffle. 2000 aver. latency worst latency bisection width 1500 4x2 1.200 1.500 2.000

1000 4x4 1.067 1.333 1.000 8x4 1.171 1.500 2.000 500 8x8 1.185 1.333 1.000 0 0 20000 40000 60000 80000 100000 120000 140000 16x8 1.371 1.500 2.000 bandwidth (MB/sec) 16x16 1.454 1.778 1.000 Figure 15. Load test comparison.

Figure 18 shows the performance gains from shuffle measured on an 8-CPU GS1280 prototype. We FLUENT 6: fl5l1 experimented with two shuffle routing approaches: (1) HP GS1280/1.15GHz shuffle with 1-hop: shuffle links are used as the initial HP SC45/1.25GHz 2200 HP GS320/1.22GHz (and only) hop, (2) shuffle with 2-hops: shuffle links are 2000 SUN SUNFIRE6800 USIII/900MHz used for 1 and 2 hops (e.g. we use shuffle links to alleviate IBM pSeries 690_Turbo/Power4/1.3GHz load on horizontal links). The performance data indicates 1800 Dell Pwredge2650 Xeon/2.4GHz that 1-hop shuffle provides between 5% and 25% 1600 performance gain (depending on network load) vs. torus. 1400 The 2-hop shuffle provides lower additional (2-5%) gain. 1200 1000 Rating 800 Shuffle Improvements 2500 600 400 current 200 2000 shuffle shuffle_2hop 0 0 5 10 15 20 25 30 35 1500 # CPUs Figure 19. Fluent Performance. 1000 latency (ns)

Fluent: Memory and IP-link Utilization 500 14

0 12 0 5000 10000 15000 20000 25000 30000 bandwidth (MB/sec) 10 Figure 18. Performance Improvement from Shuffle. 8 memory controllers (average)

5. Application performance 6 IP-links (average)

In this section, we compare GS1280 to the other systems Utilization (%) using 3 types of applications: (1) CPU-intensive applications 4 that do not stress either memory controller or inter-processor 2 (IP) links, (2) memory-bandwidth intensive applications that stress memory bandwidth, but not the IP links (many MPI 0 applications belong to this category), and (3) applications that stress both IP-links and memory bandwidth. An 54:43.2 54:53.3 55:03.3 55:13.4 55:23.4 55:33.4 55:43.5 55:53.6 56:03.7 56:13.7 56:23.7 56:33.8 56:43.8 56:53.8 57:03.9 57:13.9 57:23.9 example of application that represents each class is included. timestamp The utilization of memory controllers and IP links is measured using the 21364 built-in performance counters Figure 20. Memory and IP-links utilization in Fluent. [11].

5.1. CPU-intensive application: Fluent (CFD) 5.2. Memory-Bandwidth Intensive application: Fluent is the standard Computational Fluid Dynamics NAS Parallel application [13]. For comparison, we selected a large case NAS Parallel benchmarks represent a collection of kernels (l1) that models flow around a fighter aircraft (Figure 19). that are important in many technical applications [14]. The The results in Figure 19 are published as of March 2003. kernels are decomposed using MPI and can run on either This data indicates that GS1280 shows comparable shared-memory or cluster systems. With the exception of performance to ES45. Examining Figure 20 with measured EP (embarrassingly parallel), majority of these kernels utilization shows that the reason is that this application does (solvers, FFT, grid, integer sort) put significant stress on not put significant stress on either memory controller or IP- memory bandwidth (when size C is used). Note that links bandwidth. The large 16MB cache in ES45 often although these are small kernels, they provide the same provides advantage in applications that can be blocked for level of stress on memory subsystem as many large real cache re-use (such as Fluent). applications.

5.3. IP bandwidth intensive application: GUPS NAS Parallel SP GUPS is a multithreaded (OpenMP) application where each HP GS1280/1.15GHz 14000 HP SC45/1.25GHz thread updates an item randomly picked from the large table HP GS320/1.2GHz [15]. Since the table is so large that it spans the entire 12000 memory in the system, this application puts substantial stress 10000 on the IP-link bandwidth (Figure 24). In this application, 8000 GS1280 shows the most substantial advantage over the other systems, as shown in Figure 23. This is because this

MOPS 6000 application exploits substantial IP-link bandwidth advantage 4000 on GS1280, as discussed in Section 4 (Figure 15). It is also interesting that the links show uneven utilization in Figure 2000 24: East/West links show higher utilization than North/South 0 links. This is because the link utilization is higher on 0 5 10 15 20 25 30 horizontal than on vertical links in a 4x8 torus. This is also # CPUs the reason for the bend in performance at 32 CPUs: the Figure 21. SP Performance comparison. cross-sectional bandwidth is comparable in both 16P and 32P torus configurations. Figure 21 compares GS1280 performance to the previous- generation Alpha platforms in the SP solver. This data GUPS Performance Comparison shows substantial advantage on GS1280 compared to the 1200 GS1280/1.15GHz other systems. GS320/1.2GHz 1000 ES45/1.25GHz SP: Memory and IP-link Utilization 40 memory controllers (average) 800 35 IP-links (average) 30 600 25

Mupdates/s 400 20

15 200 Utilization (%) 10 0 5 0 10203040506070 0 # CPUs Figure 23. GUPS Performance comparison. 37:23.1 37:48.2 38:13.2 38:38.3 39:03.4 39:28.4 39:53.5 40:18.6 40:43.6 41:08.7 41:33.7 41:58.8 42:23.9 42:49.0 43:14.1 43:39.3 timestamp Figure 22. Memory Controller utilization in SP GUPS: Memory and IP-link Utilization (32P 100 GS1280) In order to explain this advantage, we show the memory 90 and IP-link utilization in Figure 22. Figure 22 shows that 80 memory bandwidth utilization is high in SP (26%). 70 GS1280 has substantially higher memory bandwidth than memory controller 60 average North/South ES45 and GS320 (Figures 6 and 7), thus the advantage in 50 average East/West SP. Since ES45 has higher memory bandwidth than GS320 40 (Figure 7), GS1280 shows even higher advantage vs. 30 GS320 than EC45. The IP links utilization in these MPI Utilization (%) 20 kernels is low (Figure 22). We also observed that IP link 10 utilization is low in many other MPI applications. The 0 GS1280 provides very high IP-link bandwidth that in many cases exceeds the needs of MPI applications (many of 11:02.2 11:04.3 11:06.3 11:08.3 11:10.3 11:12.3 11:14.3 11:16.4 11:18.4 11:20.5 11:22.5 11:24.5 which are designed for cluster interconnects with much timestamp lower bandwidth requirements). Figure 24. Memory and IP-link utilization in GUPS.

6. Memory Striping A more extensive study over a variety of applications Memory striping allows interleaving of 4 cache lines indicated that only a small portion of applications benefit across two CPUs, starting with CPU0/controller0, then from striping (while most others degrade performance). CPU0/controller1, and then CPU1/controller0, and finally CPU1/controller1. The CPUs chosen to participate in Hot-Spot Improvement from Striping striping are the closest neighbors (CPUs on the same 3500

module). Striping provides performance benefit in Non-striped alleviating hot spots, where a hot-spot traffic is spread 3000 across 2 CPUs (instead of one). The disadvantage of Striped memory striping is that it puts additional burden on the IP 2500 links between pairs of CPUs. 2000 The results of our evaluation of memory striping are presented in Figures 25 and 26. Figure 25 shows that 1500

striping degrades performance 10-30% in throughput (ns) latency applications due to increased inter-processor traffic. We 1000 observed degradation as high as 70% in some applications. 500 Degradation from Striping: SPECfp_rate2000 301.apsi 0 0 2000 4000 6000 8000 10000 12000 14000 200.sixtrack bandwidth (MB/sec) 191.fma3d Figure 26. Improvement from striping. 189.lucas 188.ammp 187.facerec

183.equake

179.art

178.galgel

177.mesa

173.applu

172.mgrid

171.swim

168.wupwise

0% 5% 10% 15% 20% 25% 30% 35%

Figure 25. Degradation from striping.

Figure 26 shows that striping improves performance of a hot-spot traffic pattern (all CPUs read data from CPU0) up to 80%. We use the Xmesh tool based on built-in performance counters to recognize the hot-spot traffic (Figure 27). The tool indicates that the IP-link and memory Figure 27. Xmesh with a hot-spot. traffic on the links to/from CPU0 (left corner) is higher than on any other CPU, and that Zbox utilization on that CPU is 53% (much higher than on any other CPU). We observed 30% improvement in real applications that generate hot-spot traffic.

7. Summary Comparisons advantage on GS1280. The GS1280 advantage in ISV In Section 5 we analyzed performance of representatives applications ranges from 1.2-2.1 times. The swim and of three application classes. In this section, we compare GUPS applications benefit from memory and IP-link GS1280 to GS320 across a wider range of applications bandwidth advantage in GS1280. (Figure 28). The data in Figure 28 is shown as the ratio of GS1280 improvement vs. GS320. The data is grouped in This data indicates that the designs of memory interface, the following categories: system components (CPU, I/O subsystem, and interprocessor interconnect have memory, Inter-Processor, I/O), integer and commercial profound effect on application performance. Often, the benchmarks (SPECint_rate2000, SAP Transaction emphasis is placed on processor design, and other system Processing, Decision Support [8][17]), HPTC standard components take lower priority. Our study indicates that benchmarks (SPECfp_rate2000, NAS Parallel, and the key factor for achieving high application performance SPEComp2001 [8][14]), HPTC applications (CFD, is the balanced design that includes not only a high- chemistry, weather prediction, structural modeling) performance processor, but also matching high- [13][18][19][20][21], and two cases where GS1280 shows performance designs of memory, interconnect, and I/O the highest improvement (GUPS and swim) [8][15]. subsystems.

GS1280/1.15GHz Advantage vs. GS320/1.2GHz: Performance Ratios 8. Conclusions swim 32P (from SPEComp2001) We evaluated the architecture and performance GUPS internal (32P) characteristics of the HP AlphaServer GS1280, based on the Alpha 21364 processor. The Alpha 21364 shows Gaussian98 internal 32P (chemistry) substantial departure in processor design compared to Nwchem internal 32P (SiOSi3) the previous-generation Alpha processors. It incorporates MM5 internal 32P (weather) (1) on-ship cache (smaller than the off-chip cache in the Dyna/Neon internal 16P (crash) previous-generation 21264), (2) two memory controllers StarCD 32P published (CFD) that provide exceptional memory bandwidth, and (3) a Fluent 32P published (CFD) router that allows efficient glue-less large-scale Nastran internal xlem (4P) multiprocessor design. The 21364 processor places on a single chip all components that previously required an SPEComp2001 published (16P) entire CPU module. SPECfp_rate2000 published (16P) NAS Parallel internal (16P) The results from our analysis show that this is a superior design for building large-scale multiprocessors. The Decision Support internal (32P) exceptional memory bandwidth that GS1280 provides is SAP SD Transaction Processing published (32P) important for a number of applications that cannot be SPECint_rate2000 published (16P) structured to allow for cache reuse. We observed 2-4 times advantage of GS1280 vs. the previous-generation I/O bandwidth (32P) AlphaServer GS320 in this type of applications (e.g. Inter-Processor bandwidth (32P) NAS Parallel). The low latency and exceptional memory latency (Dirty remote) bandwidth on IP links allow for very good scaling in memory latency (local) applications that cannot be blocked to fit in the local memory copy bw (32P) memory of each processor. We observed even higher memory copy bw (1P)

CPU speed advantage of GS1280 vs. the previous-generation AlphaServer GS320 in this type of applications (e.g. 01234567891011 over 10 times in GUPS). Since Alpha 21364 preserved Figure 28. GS1280 vs. GS320 summary comparisons. the same core as the previous-generation (and the CPU clock speeds are comparable), the The data in Figure 28 indicates that GS1280 shows the applications that are blocked to fit well in the on-chip most significant improvement in IP bandwidth (over 10 caches perform comparably on GS1280 and GS320 (e.g. times), and I/O and memory bandwidth (8 times). The SPECint2000). Some applications take advantage of the application comparisons show that although GS1280 and large 16MB cache, and therefore run faster on GS320 GS320 have comparable processor clock speeds, the than on GS1280 (e.g. facerec from SPECfp2000). majority of applications run faster on GS1280 than However, most applications benefit from the GS1280 GS320. The exceptions are the small integer benchmarks design, indicating that the architecture of memory (SPECint2000) that fit well in the on-chip caches. The interface, interprocessor interconnect, and I/O subsystem commercial workloads show between 1.3-1.6 times is as important as the processor design. advantage on GS1280 vs. GS320. The standard HPTC benchmarks (SPEC and NAS Parallel) show between 1.7- We proposed a simple change in routing (called shuffle) 2.6 times gain mainly due to memory-bandwidth that provides substantial performance improvements on an 8-CPU torus interconnect. We also determined that [6] Z. Cvetanovic and D. Bhandarkar, “Performance striping memory across two processors is beneficial only Characterization of the Microprocessor in applications that generate a hot-spot traffic, while it Using TP and SPEC Workloads,” The Second was detrimental for majority of applications due to International Symposium on High-Performance Computer increased nearest-neighbor bandwidth. Architecture (February 1996), pp. 270–280.

We have heavily relied on profiling analysis based on [7] Z. Cvetanovic and D. D. Donaldson, “AlphaServer the built-in performance counters (Xmesh) throughout 4100 Performance Characterization”, Digital Technical this study. Such tools are crucial for understanding Journal, Vol 8 No. 4, 1996: pp. 3-20. system behavior. We have used profiles to explain why some workloads perform exceptionally well on GS1280, [8] SPEC CPU2000 Benchmarks available at while others show comparable (or even worse) http://www.spec.org/osg/cpu2000/results performance than GS320 and ES45. In addition, these SPEC® and SPEC CPU2000® are registered trademarks of tools are crucial for identifying areas for improving the Standard Performance Evaluation Corporation. performance on GS1280: e.g. Xmesh can detect hot- spots, heavy traffic on the IP links (indicate poor [9] Information about the lmbench available at memory locality), etc. Once such bottlenecks are http://www.bitmover.com/lmbench/ recognized, various techniques can be used to improve performance. [10] The STREAM benchmark information available at http://www.cs.viginia.edu/stream The GS1280 system is the last-generation Alpha server. In our future work, we plan to extend our analysis to [11] Z. Cvetanovic, P. Gilbert, J. Campoli, “Xmesh: a non-Alpha based large-scale multiprocessor platforms. graphical performance monitoring tool for large- We will also place more emphasis on characterizing real scale multiprocessors”, HP internal report, 2002. I/O intensive applications. [12] J. Duato, S Yalamanchili, L. Ni, “Interconnection Networks, an Engineering Approach”, IEEE Acknowledgments Computer Society, Los Alamitos, California The author would like to thank to Jason Campoli, Peter

Gilbert, and Andrew Feld for profiling data collection. [13] Fluent data available at Special thanks to Darrel Donaldson for his guidance and http://www.fluent.com/software/fluent/fl5bench/fullres.htm help with the GS1280 system and to Steve Jenkins, Sas

Durvasula, and Jack Zemcik for supporting this work. [14] NAS Parallel data available at http://www.nas.nasa.gov/Software/NPB/

References [15] GUPS information available at [1] Peter Bannon, “EV7”, Microprocessor Forum, Oct. http://iram.cs.berkeley.edu/~brg/dis/gups/ 2001. [16] “EV7 System Design Specification”, Internal HP [2] K. Gharachorloo, M. Sharma, S.Steely, S. Van Doren, document, August 2002. “Architecture and Design and AlphaServer GS320”, ASPLOS 2000. [17] SAP Benchmark results available at http://www.sap.com/benchmark/index.asp?content=http [3] DCPI and ProfileMe External Release page: ://www.sap.com/benchmark/sd2tier.asp http://www.research.digital.com/SRC/dcpi/release.html [18] StarCD data available at [4] Z. Cvetanovic and R. Kessler, “Performance Analysis http://www.cd-adapco.com/support/bench/315/aclass.htm of the Alpha 21264-based ES40 System”, The 27th Annual International Symposium on Computer [19] LS-Dyna information available at Architecture, June 10-14, 2000, pp. 60-70. http://www.arup.com/dyna/applications/crash/crash.htm

[5] Z. Cvetanovic and D. Bhandarkar, "Characterization of [20] NWchem data available at Alpha AXP Performance Using TP and SPEC http://www.emsl.pnl.gov:2080/docs/nwchem/nwchem.html Workloads", The 21st Annual International Symposium on Computer Architecture, April 1994, pp. 60 - 70. [21] MM5 data available at http://www.mmm.ucar.edu/mm5/mm5-home.html