Considerations in the Design of Hydra: a Multiprocessor-On-A-Chip Microarchitecture

CONSIDERATIONS IN THE DESIGN OF HYDRA: A MULTIPROCESSOR-ON-A-CHIP MICROARCHITECTURE Lance Hammond Kunle Olukotun Technical Report No.: CSL-TR-98-749 February 1998 This work was supported by DARPA contract DABT63-95-C-0089. Considerations in the Design of Hydra: A Multiprocessor-on-a-Chip Microarchitecture Lance Hammond and Kunle Olukotun CSL-TR-98-749 February 1998 Computer Systems Laboratory Departments of Electrical Engineering and Computer Science Stanford University Gates Computer Science Building, #408 Stanford, CA 94305-9040 [email protected] Abstract As more transistors are integrated onto larger dies, single-chip multiprocessors integrated with large amounts of cache memory will soon become a feasible alternative to the large, monolithic uniprocessors that dominate today’s microprocessor marketplace. Hydra offers a promising way to build a small-scale MP-on-a-chip using a fairly simple design that still maintains excellent performance on a wide variety of applications. This report examines key parts of the Hydra design — the memory hierarchy, the on-chip buses, and the control and arbitration mechanisms — and explains the rationale for some of the decisions made in the course of finalizing the design of this memory system, with particular emphasis given to applications that stress the memory system with numerous memory accesses. With the balance between complexity and performance that we obtain, we feel Hydra offers a promising model for future MP-on-a-chip designs. Keywords & Phrases: microprocessors, multiprocessors, MP-on-a-chip (CMP), cache memories, cache coherence, SimOS, SUIF i Copyright © 1997, 1998. Lance Hammond and Kunle Olukotun Considerations in the Design of Hydra: A Multiprocessor-on-a-Chip Microarchitecture Lance Hammond and Kunle Olukotun Computer Systems Laboratory Stanford University Stanford, CA 94305-4070 email: [email protected], [email protected], http://www-hydra.stanford.edu 1. Introduction ferent parts of memory are described in section 4. Control mechanisms, including the resource and address arbiters, are The Hydra microarchitecture is a research vehicle currently described briefly in section 5. Finally, section 6 concludes. being designed at Stanford in an effort to evaluate the concept of a multiprocessor on a chip as an alternative for future microprocessor development, when large numbers of transistors 2. Simulation Methodology and Results and RAM may be integrated on a single chip. We have previ- Hydra is currently being evaluated using a sophisticated, cycle- ously demonstrated the value of this approach, in a general- accurate memory simulator written in C++ that is interfaced ized processor [1]. This technical report provides a more de- with the SimOS machine simulator [7]. SimOS allows us to tailed view of the system we are currently designing. simulate four fully functional MIPS-II ISA CPUs and a suite of I/O devices with enough realism to boot and execute the Hydra is composed of four 2-way superscalar MIPS CPUs on IRIX 5.3 operating system under our tested applications. As a a single chip, each similar to a small R10000 processor [2] result, system calls and I/O in our benchmarks were simu- with individual L1 instruction and data caches attached. A lated with exceptional realism. Hydra simulates the memory single, unified L2 cache supports the individual L1 caches and system using a group of interconnected state machine con- provides a path for rapid and simple communication between trollers to evaluate the memory’s response to all memory ref- the processors. These two levels of the memory hierarchy erences, both instruction and data, supplied by SimOS. These and the bus interconnections between them are the focus of controllers communicate through shared models of the cen- the Hydra design effort described here. However, the design tral arbitration mechanisms and the caches in order to accu- would be incomplete without a state-of-the-art off-chip rately model the time required to complete all accesses. memory system. On one side of the chip, a 128-bit L3 cache bus attaches to an array of high-speed cache SRAMs, while This paper focuses on describing the Hydra hardware qualita- on the other side Rambus channels directly connect main tively, but some key numbers from two applications that we memory to Hydra and a more conventional bus connects Hy- evaluated are used throughout the text to illustrate the ratio- dra to I/O devices. Fig. 1 shows a diagram of the architecture, nale for key design features of Hydra. The numbers of inter- while fig. 2 depicts a possible layout for the completed de- est are plotted in figs. 3–6. Representative samples from the sign. core loops of the swim and tomcatv SPEC95FP benchmarks, parallelized automatically using the SUIF compiler system [8], This paper gives a brief overview of the microarchitecture and were executed on the simulator to get these results. While we attempts to describe some of the trade-offs that have been have examined several other applications from the SPEC suite, evaluated in the course of revising the design. Section 2 gives these two have exhibited the most interesting memory system a brief overview of the simulation environment we are using behavior because they stress the memory system with large to evaluate Hydra, and presents a few of our most interesting numbers of accesses. In contrast, the Hydra memory system’s results obtained through simulation. Section 3 presents a de- cache hierarchy easily handles the small data sets of the other scriptive overview of Hydra’s memory hierarchy, along with SPEC benchmarks. In the future, we may examine other ap- a qualitative view of how the different levels interact. The plications with large data sets, such as databases. communication buses used to transmit data between the dif- 1 Centralized Bus Arbitration Mechanisms CPU 0 CPU 1 CPU 2 CPU 3 L1 Inst. L1 Data L1 Inst. L1 Data L1 Inst. L1 Data L1 Inst. L1 Data Cache Cache Cache Cache Cache Cache Cache Cache CPU 0 Memory Controller CPU 1 Memory Controller CPU 2 Memory Controller CPU 3 Memory Controller Cache Writes out, Cache Writes out, Cache Writes out, Cache Writes out, refills invalidates in refills invalidates in refills invalidates in refills invalidates in Write-through Bus (64b) Read/Replace Bus (256b) Direct I/O Rambus Memory DMA On-chip L2 Cache Off-chip L3 Interface Interface I/O Bus Interface The Edge of the Chip Cache SRAM DRAM Main I/O Devices Array Memory Figure 1: A schematic overview of Hydra L3 Data Port “A” Clock & Test Ports FPU #0 FPU #1 Test L2 CPU CPU Tags #0 #1 Clocking & L2 Cache Data PMMU #0 PMMU #1 #0 #0 #1 #1 I-Cache D-Cache I-Cache D-Cache CPU0 Memory Address CPU1 Memory Control Jammer Control Memory Bus Wiring Channel Controller CPU2 Memory Central CPU3 Memory Main Memory RamBus Interface L2 / L3 Cache Control Arbiter Control RamBus Interface Controller #2 #2 #3 #3 L3 Tag and Address Port I-Cache D-Cache I-Cache D-Cache L2 Cache PMMU #2 PMMU #3 L2 Data CPU CPU Tags #2 #3 Control Unit System Bus FPU #2 FPU #3 I/O System Bus L3 Data Port “B” Figure 2: A possible layout of Hydra 2 100 16 90 swim 14 swim 80 tomcatv 12 tomcatv 70 60 10 50 8 40 6 30 % of time Occupied 4 20 10 % slowdown from “perfect” case 2 0 0 1-Port Dual Bus 1-Port L2 2-Port L2 2-Bus (R) 2-Port (R) 2-Bus (W) 2-Port (W) Single Bus Single Bus 1-Way SA L3 2-Way SA L3 1-Way SA L3 2-Way SA L3 L2 Ports L3 Associativity # of buses L2 Ports L3 Associativity # of buses Figure 3: Resource occupancies in Hydra for several simu- Figure 4: The % increase in execution time seen as a result of lated situations. (see section 3 for the L2/L3 cache discus- several architectural decisions, over “perfect” scenarios. sion and section 4 for discussion of the buses) (see section 3 for the L2/L3 cache discussion and section 4 for discussion of the buses) 60 swim 50 tomcatv 40 30 Pipelined L2 Only Pipelined L2 & L3 All Pipelined Roughly Pipelined 0.5 20 0 10 % slowdown from “perfect” case -0.5 0 -1 swim -1.5 tomcatv -2 % slowdown from “All Pipelined” case Pipelined / old arbiter -2.5 Pipelined / new arbiter Address Jamming Only Unpipelined / old arbiter Figure 5: The % increase in execution time, over “perfect” Figure 6: The % increase in execution time seen as a result of scenarios with no arbitration at all between fully pipelined variations in the degree of pipelining used in the read and state machines, seen as a result of variations in the arbitra- write control state machines. (see section 5 for discussion) tion mechanisms and pipelining of the read and write control state machines. (see section 5 for discussion) 3 Le1 Cache Le2CachLy3CachMainMemor SeparateI&D Shared,on-chipShared,off-chip ConfigurationSRAMcachepairs Off-chipDRAM SRAMcacheSRAMcache foreachCPU Capacity1B6KBeach5B12K8BM128M 128-bitsynchronous32-bitRambus 64-bitconnectionto256-bitreadbus+64- BusWidth SRAM(runningat(runningatthefull CPUbitwritebus* halftheCPUspeed)CPUspeed) AccessTime1sCPUcycle5dCPUcycle1s0cyclestofirstworatleast50cycle Associativity4y-way4*-wa2A-wayN/ LineSize3s2bytes3s2byte6s4byte4KBpage Writethrough,noWriteback,allocateon“writeback”(virtual WritePolicy Writeback allocateonwritewrites memorypaging) InclusionnotenforcedIncludesalldatainL1Includesallcached InclusionN/A byL2onL1cachesand/orL2cachesdata Table 1: A summary of the cache hierarchy. The two entries marked with asterisks, L2 cache bus width and L3 cache associativity, were varied during the experiments documented in figures 2 and 3. A third experiment, comparing a single L2 port against a pair of separate read and write ports, is also documented in the graphs but is not reflected in this table, since all other levels of memory are strictly single-ported. 3. The Memory Hierarchy fairly sophisticated arbitration logic between the L1 cache and the processors.

Considerations in the Design of Hydra: a Multiprocessor-On-A-Chip Microarchitecture

Spreading Excellence Report

A Compiler-Compiler for DSL Embedding

(URMD) Grad Cohort Workshop

Entrepreneurship Opportunities & Skills

Implementing and Evaluating Nested Parallel Transactions in Software Transactional Memory

Ecpe Connections

Kunle Olukotun Cadence Design Systems Professor and Professor of Electrical Engineering

Energy-Efficient Abundant-Data Computing: the N3XT 1,000X

News from SCS Networks

Multicore Cpus: Processor Proliferation - IEEE Spectrum 2/15/11 1:51 PM

University of Copenhagen

10003.Demeyerromain.2604.Pdf (0.1