Rock:Ahigh-Performance Sparc Cmt Processor
Total Page:16
File Type:pdf, Size:1020Kb
......................................................................................................................................................................................................................... ROCK:AHIGH-PERFORMANCE SPARC CMT PROCESSOR ......................................................................................................................................................................................................................... ROCK,SUN’S THIRD-GENERATION CHIP-MULTITHREADING PROCESSOR, CONTAINS 16 HIGH-PERFORMANCE CORES, EACH OF WHICH CAN SUPPORT TWO SOFTWARE THREADS. ROCK USES A NOVEL CHECKPOINT-BASED ARCHITECTURE TO SUPPORT AUTOMATIC HARDWARE SCOUTING UNDER A LOAD MISS, SPECULATIVE OUT-OF-ORDER RETIREMENT OF INSTRUCTIONS, AND AGGRESSIVE DYNAMIC HARDWARE PARALLELIZATION OF A SEQUENTIAL INSTRUCTION STREAM.IT IS ALSO THE FIRST PROCESSOR TO SUPPORT TRANSACTIONAL MEMORY IN HARDWARE. ......Designing an aggressive chip- even more aggressive simultaneous speculative Shailender Chaudhry multithreaded (CMT) processor1 involves threading (SST), which uses two checkpoints. many tradeoffs. To maximize throughput EA is an area-efficient way of creating a large Robert Cypher performance, each processor core must be virtual issue window without the large asso- highly area and power efficient, so that ciative structures. SST dynamically extracts Magnus Ekman many cores can coexist on a single die. Simi- parallelism, letting execution proceed in par- larly, if the processor is to perform well on a allel at two different points in the program. Martin Karlsson wide spectrum of applications, per-thread per- These schemes let Rock maximize formance must be high so that serial code also throughput when the parallelism is available, Anders Landin executes efficiently, minimizing performance and focus the hardware resources to boost problems related to Amdahl’s law.2,3 Rock, per-thread performance when it is at a pre- Sherman Yip Sun’s third-generation chip-multithreading mium. Because these speculation techniques ˚ processor, uses a novel resource-sharing hier- make Rock less sensitive to cache misses, Hakan Zeffer archical structure tuned to maximize perfor- we have been able to reduce on-chip cache mance per area and power.4 sizes and instead add more cores to further Marc Tremblay Rock uses an innovative checkpoint-based increase performance. The high core-to- Sun Microsystems architecture to implement highly aggressive cache ratio requires high off-chip bandwidth. speculation techniques. Each core has sup- Rock-based systems use large caches in the port for two threads, each with one check- off-chip memory controllers to maintain a point. The hardware support for threads relatively low access rate to DRAM; this can be used two different ways: a single maximizes performance and optimizes total core can run two application threads, giving system power. each thread hardware support for execute To help the programmability of multi- ahead (EA) under cache misses; or all of processor systems with hundreds of threads a core’s resources can adaptively be com- and to improve synchronization-related bined to run one application thread with performance, Rock supports transactional .............................................................. 6 Published by the IEEE Computer Society 0272-1732/09/$25.00 c 2009 IEEE Authorized licensed use limited to: Politecnico di Milano. Downloaded on December 23, 2009 at 03:56 from IEEE Xplore. Restrictions apply. SerDes I/O links D cache MCU MCU MC MC CoreFGU Core IFU Core L2 L2 cluster cache cache CoreFGU Core SerDes I/O links D cache Rock Rock EFA PLL MMU Data switch SIU processor I/O processor I/O chip chip L2 L2 cache cache Core Core cluster cluster MC MC SerDes I/O links (a) (b) Figure 1. Die photo of the Rock chip (a) and logical overview of a two-processor-chip, 64-thread Rock-based system (b). memory5 in hardware; it is the first commer- number of instructions per cycle. Thus, cial processor to do so. TM lets a Rock four cores can share one IFU in a round- thread efficiently execute all instructions in robin fashion while maintaining full fetch a transaction as an atomic unit, or not at bandwidth. Furthermore, the shared in- all. Rock leverages the checkpointing mecha- struction cache enables constructive shar- nism with which it supports EA to imple- ing of common code, as is encountered ment the TM semantics with low additional in shared libraries and operating system hardware overhead. routines. Each core cluster contains two L1 data System overview caches (D cache) and two floating-point Figure 1 shows a die photo of the units (FGU), each of which is shared by a Rock processor and a logical overview of pair of cores. As Figure 1a shows, these struc- a Rock-based system using two processor tures are relatively large, so sharing them pro- chips. Each Rock processor has 16 cores, vides significant area savings. A crossbar each configurable to run one or two threads; connects the four core clusters with the four thus, each chip can run up to 32 threads. The level-two banks (L2 cache), the second-level 16 cores are divided into core clusters, with memory-management unit (MMU), and four cores per cluster. This design enables re- the system interface unit (SIU). Rock’s L2 source sharing among cores within a cluster, cache is 2 Mbytes, consisting of four 512- thus reducing the area requirements. All Kbyte 8-way set-associative banks. A rela- cores in a cluster share an instruction fetch tively large fraction of the die area is devoted unit (IFU) that includes the level-one (L1) to core clusters as opposed to L2 cache. We instruction cache. We decided that all four selected this design point after simulating cores in a cluster should share the IFU be- designs with different ratios of core clusters cause it is relatively simple to fetch a large to L2 cache. It provides the highest .................................................................... MARCH/APRIL 2009 7 Authorized licensed use limited to: Politecnico di Milano. Downloaded on December 23, 2009 at 03:56 from IEEE Xplore. Restrictions apply. ............................................................................................................................................................................................... HOT CHIPS 20 throughput of the different options and exploiting the great spatial locality typical yields good single-thread performance. in instruction data. The fetch unit fetches a Each Rock chip is directly connected to full cache line of 16 instructions and for- an I/O bridge ASIC and to multiple memory wards it to a strand-specific fetch buffer. controller ASICs via high-bandwidth Rock’s branch predictor is a 60-Kbit serializer/deserializer (SerDes) links. This sys- (30K two-bit saturating counters) gshare tem topology provides truly uniform access branch predictor, enhanced to take advan- to all memory locations. The memory tage of the software prediction bit provided controllers implement a directory-based in Sparc V9 branches. In each cycle, two MESI coherence protocol (which uses the program-counter-relative (PC-relative) branches states modified, exclusive, shared,andinvalid) are predicted. Indirect branches are predicted supporting multiprocessor configurations. by either a 128-entry branch-target buffer or We decided not to support the owned a 128-entry return-address predictor. cache state (O) because there are no direct As Figure 2 shows, the fetch stages are links between processors, and so accesses to connected to the decode pipeline through a cache line in the O state in another pro- aninstructionbuffer.Thedecodeunit, cessor would require a four-hop transaction, which serves one strand per cycle, has a which would increase the latency and the helper ROM that it uses to decompose com- link bandwidth requirements. plex instructions into sequences of simpler To further reduce link bandwidth instructions. Once decoded, instructions requirements, Rock systems use hardware proceed to the appropriate strand-specific in- to compress packets sent over the links be- struction queue. A round-robin arbiter tween the processors and the memory decides which strand will have access to the controllers. Each memory controller has an issue pipeline the next cycle. On every 8-Mbyte level-three cache, with all the L3 cycle, up to four instructions can be steered caches in the system forming a single large, from the instruction queue to the issue globally shared cache. This L3 cache reduces first-in, first-out buffers (FIFOs). There are both the latency of memory requests and the five issue FIFOs (labeled A0, A1, BR, FP, DRAM bandwidth requirements. Each and MEM in Figure 2); each is three entries memory controller has 4 þ 1 fully buffered deep and directly connected to its respective dual in-line memory module (FB-DIMM) execution pipe. An issue FIFO will stall an memorychannels,withthefifthofthese instruction if there is a read-after-write used for redundancy. (RAW) hazard. However, such a stall does not necessarily affect instructions in the Pipeline overview other issue FIFOs. Independent younger A strand consists of the hardware resour- instructions in the other issue FIFOs can ces needed to execute a software thread. proceed to the execution unit out of order. Thus, in a Rock system, eight strands share Instructions that can’t complete their execu- a fetch unit, which uses a round-robin tion go into the deferred instruction queue scheme to determine which strand gets (DQ) for