Performance Analysis of the Alpha 21364-Based HP GS1280 Multiprocessor Zarka Cvetanovic Hewlett-Packard Corporation [email protected]
Total Page:16
File Type:pdf, Size:1020Kb
Performance Analysis of the Alpha 21364-based HP GS1280 Multiprocessor Zarka Cvetanovic Hewlett-Packard Corporation [email protected] Abstract This paper evaluates performance characteristics of the HP These improvements enhanced single-CPU performance GS1280 shared memory multiprocessor system. The and contributed to excellent multiprocessor scaling. We GS1280 system contains up to 64 Alpha 21364 CPUs describe and analyze these architectural advances and connected together via a torus-based interconnect. We present key results and profiling data to clarify the describe architectural features of the GS1280 system. We benefits of these design features. We contrast GS1280 to compare and contrast the GS1280 to the previous- two previous-generation Alpha systems, both based on a generation Alpha systems: AlphaServer GS320 and 21264 processor: GS320 – a 32-CPU SMP NUMA system ES45/SC45. We further quantitatively show the with switch-based interconnect [2], and SC45 – a 4-CPU performance effects of these features using application ES45 systems connected in a cluster configuration via a results and profiling data based on the built-in fast Quadrics switch. [4][5][6][7]. performance counters. We find that the HP GS1280 often SPECfp_rate2000 (Peak) provides 2 to 3 times the performance of the AlphaServer HP GS1280/1.15GHz (1-16P published, 32P estimated) HP SC45/1.25GHz (1-4P published, >4P estimated) GS320 at similar clock frequencies. We find the key 600 HP GS320/1.2GHz (published) IBM/p690/650 Turbo/1.3/1.45GHz (published) reasons for such performance gains are advances in 550 SGI Altix 3K/1GHz (published) memory, inter-processor, and I/O subsystem designs. 500 SUN Fire V480/V880/0.9GHz (published) 450 1. Introduction 400 350 The HP AlphaServer 1280 is a shared memory 300 multiprocessor containing up to 64 fourth-generation Alpha 250 21364 microprocessors [1]. Figure 1 compares the 200 performance of GS1280 to the other systems using the 150 SPECfp_rate2000, a multiprocessor throughput standard 100 benchmark [8]. We show the published SPECfp_rate2000 50 results as of March 2003, with the exception of the 32P 0 0 5 10 15 20 25 30 35 GS1280 for which the data was measured on an # CPUs engineering prototype, but not published yet. We use Figure 1. SPECfp_rate2000 comparison. floating-point rather than integer SPEC benchmarks for this comparison since several of the floating-point benchmarks We include results from kernels that exercise memory stress memory bandwidth, while all integer benchmarks fit subsystem [9][10]. We include profiling results for standard well in the MB-size caches and thus are not a good benchmarks (SPEC CPU2000 [8]). In addition, we analyze indicator of memory system performance. The results in characteristics of representatives from 3 application classes Figure 1 indicate that GS1280 scales well in memory- that impose various levels of stress on memory subsystem bandwidth intensive workloads and has substantial and processor interconnect. We use profiles based on the performance advantage over the previous-generation Alpha built-in non-intrusive CPU hardware monitors [3]. These platforms despite disadvantage in the processor clock monitors are useful tools for analyzing system behavior frequency. We analyze key performance characteristics of with various workloads. In addition, we use tools based on the GS1280 in this paper to expose the key design features the EV7-specific performance counters: Xmesh [11]. that allowed GS1280 to reach such performance levels. Xmesh is a graphical tool that displays run-time information on utilization of CPUs, memory controllers, The GS1280 system contains many architectural advances inter-processor (IP) links, and I/O ports. – both in the microprocessor and in the surrounding memory system - that contribute to its performance. The The remainder of this paper is organized as follows: 21364 processor [1][16] uses the same core as the previous- Section 2 describes the architecture of the GS1280 system. generation 21264 processor [4]. However, 21364 includes Section 3 describes the memory system improvements in three additional components: (1) an on-chip L2 cache, (2) GS1280. Section 4 describes the inter-processor two on-chip Direct Rambus (RDRAM) memory controllers performance characteristics. Section 5 discusses application and (3) a router. The combination of these components performance. Section 6 shows tradeoffs associated with helped achieve improved access time to the L2 cache and memory striping. Section 7 summarizes comparisons. local/remote memory. Section 8 concludes. 2. GS1280 System Overview network, the router multiplexes a physical link among The Alpha 21364 (EV7) microprocessor [1] shown in several virtual channels. Each input port has two first- Figure 2 integrates the following components on a single level arbiters, called the local arbiters, each of which chip: (1) second-level (L2) cache, (2) a router, (3) two selects a candidate packet among those waiting at the memory controllers (Zboxes), and (4) a 21264 (EV68) input port. Each output port has a second-level arbiter, microprocessor core. The processor frequency is 1.15 GHz. called the global arbiter, which selects a packet from The memory controllers and inter-processor links operate at those nominated for it by the local arbiters. 767 MHz (data rate). The L2 cache is 1.75 MB in size, 7- way set-associative. The load-to-use L2 cache latency is 12 The global directory protocol is a forwarding protocol [16]. cycles (10.4 ns). The data path to the cache is 16-bytes There are 3 types of messages: Requests, Forwards, and wide, resulting in peak bandwidth of 18.4 GB/s. There are Responses. A requesting processor sends a Request 16 Victim buffers from L1 to L2 and from L2 to memory. message to the directory. If the block is local, the directory The two integrated memory controllers connect processor is updated and a Response is sent back. If the block is in directly to the RDRAM memory. The peak memory Exclusive state, the Forward message is sent to the owner bandwidth is 12.3 GB/s (8 channels, 2 bytes each). There of the block, who sends the Response to the requestor and can be up to 2048 pages open simultaneously. The optional directory. If the block is in Shared state (and the request is 5th channel is provided as a redundant channel. The four to modify the block), Forward/invalidates are sent to each interprocessor links are capable of 6.2 GB/s each (2 of the shared copies, and a Response is sent to the unidirectional links with 3.1 GB/s each). The IO chip is requestor. connected to the EV7 via a full-duplex link capable of 3.1 GB/s. To optimize network buffer and link utilization, the 21364 routing protocol uses minimal adaptive routing algorithm. Chip Block Diagram Only a path with minimum number of hops from source to destination is used. However, a message can choose the Data IPx4 less congested minimal path (adaptive protocol). Both the Buffers Router I/O coherence and adaptive routing protocols can introduce L2 Tag deadlocks in the 21364 network. The coherence protocol Array Memory L2 RDRAM can introduce deadlocks due to cyclic dependence between Controller 0 Data different packet classes. For example, the Request packets Array L2 Cache can fill up the network and prevent the Response packets Controller Memory RDRAM from ever reaching their destinations. The 21364 breaks Core Controller 1 this cyclic dependence by creating virtual channels for L1 Cache Data each class of coherence packets and prioritizing the Address & Control dependence among these classes. By creating separate virtual channels for each class of packets, the router Figure 2. 21364 block diagram. guarantees that each class of packets can be drained independent of other classes. Thus, a Response packet can M M M M never block behind a Request packet. A Request can 364 364 364 364 generate a Block Response, but a Block Response cannot IO IO IO IO generate a Request. M M M M 364 364 364 364 Adaptive routing can generate two types of deadlocks: IO IO IO IO intra-dimensional (because the network is a torus, not a M M M M mesh) and inter-dimensional (arises in any square portion 364 364 364 364 of the mesh). The intra-dimensional deadlock is solved IO IO IO IO with virtual channels: VC0 and VC1. The inter- dimensional deadlocks are solved by allowing message to Figure 3. A 12-processor 21364-based multiprocessor. route in one dimension (e.g. East-West) before routing in the next dimension (e.g. North-South) [12]. Additionally, The router [16] connects multiple 21364s in a two- to facilitate adaptive routing, the 21364 provides a separate dimensional, adaptive, torus network (Figure 3). The virtual channel called the Adaptive channel for each class. router connects to 4 links that connect to 4 neighbors in Any message (other than I/O packets) can route through the torus: North, South, East, and West. Each router the Adaptive channel. However, if the Adaptive channels routes packets arriving from several input ports (L2 fill up, packets can enter the deadlock-free channels. cache, ZBoxes, I/O, and other routers) to several output ports. (i.e., L2 cache, ZBoxes, I/O, and other routers). To The previous-generation GS320 system uses a switch to avoid deadlocks in the coherence protocol and the connect four processors to the four memory modules in a single Quad Building Block (QBB) and then a hierarchical switch to connect QBBs into the larger-scale much lower access time than the off-chip caches in multiprocessor (up to 32 CPUs) [2]. GS320/ES45. 3. Memory Subsystem Figure 5 shows dependent load latency on GS1280 as both In this section we characterize memory subsystem of dataset size and stride increase.