Sparc64 Viiifx:Anew-Generation Octocore Processor for Petascale Computing
Total Page:16
File Type:pdf, Size:1020Kb
[3B2-14] mmi2010020030.3d 30/3/010 12:13 Page 30 .......................................................................................................................................................................................................................... SPARC64 VIIIFX:ANEW-GENERATION OCTOCORE PROCESSOR FOR PETASCALE COMPUTING .......................................................................................................................................................................................................................... THE SPARC64 VIIIFX EIGHT-CORE PROCESSOR, DEVELOPED FOR USE IN PETASCALE COMPUTING SYSTEMS, RUNS AT SPEEDS OF UP TO 2GHZ AND ACHIEVES A PEAK PERFORMANCE OF 128 GIGAFLOPS WHILE CONSUMING AS LITTLE AS 58 WATTS OF POWER.SPARC64 VIIIFX REALIZES A SIX-FOLD IMPROVEMENT IN PERFORMANCE PER WATT OVER PREVIOUS GENERATION SPARC64 PROCESSORS. ......High performance, low power The execution unit performs integer and consumption, and reliability are key re- floating-point instructions. It includes two quirements for processors in supercomput- arithmetic logic units (ALUs) for integer Takumi Maruyama ing systems. Processors used to build instructions, two address generation units petascale computing systems, which are (AGUs) for load/store instructions, and four Toshio Yoshida large-scale supercomputer installations that floating-point multiply-and-add (FMA) can aggregate more than 10,000 processor execution units for floating-point instruc- Ryuji Kan chips in a single system, must also provide tions. Each core executes up to four double- an especially high levelofreliability.By precision FMA operations per cycle. In Iwao Yamazaki combining an enhanced instruction set addition, the execution unit includes 192 architecture (ISA) implementing the high- integer and 256 floating-point architectural Shuji Yamamura performance computing-arithmetic compu- registersaswellas32integerandtwosets tational extensions (HPC-ACE), Fujitsu’s of 48 floating-point renaming registers. Noriyuki Takahashi Sparc64 processor technology, and water The storage unit executes load/store cooling, the Sparc64 VIIIfx eight-core instructions. It includes two 32-Kbyte, two- Mikio Hondou processor meets these stringent perfor- way set-associative caches: a L1 instruction mance, power consumption, and reliability cache (I-cache) and an L1 data cache Fujitsu requirements. (D-cache). The L1 D-cache is dual-ported, and can execute two load instructions per Multicore chip architecture cycle, even if one or both of the target data Hiroshi Okano The Sparc64 VIIIfx chip comprises eight are located across D-cache line boundaries. identical cores with a shared level-2 (L2) The storage unit also has hardware-prefetch Fujitsu Laboratories cache1 (see Figure 1). Each core consists of engines, which software can control. three units. The L2 cache is a unified 5-Mbyte, 10- The instruction control unit handles way set-associative cache shared between instruction fetch, issue, and completion. the eight cores. That is, each core can access .............................................................. 30 Published by the IEEE Computer Society 0272-1732/10/$26.00 c 2010 IEEE [3B2-14] mmi2010020030.3d 30/3/010 12:13 Page 31 any portion of the L2 cache, even though the cache is physically split. The L2 cache is con- HSIO nected to embedded memory controllers that Core 5 L2 cache Core 7 communicate directly with double-data-rate- data 3 dual inline memory module (DDR3- DIMM) memory at a peak throughput of Core 4 Core 6 64 Gbytes per second. MAC L2 cache MAC MAC control MAC The Sparc64 VIIIfx chip is fabricated using Core 1 Core 3 Fujitsu’s 45-nm complementary metal-oxide DDR3 interface DDR3 interface semiconductor (CMOS) process. It occupies L2 cache a die area of approximately 510 mm2 and Core 0 data Core 2 contains around 760 million transistors. The chip runs at speeds of up to 2 GHz with a peak performance of 128 gigaflops. Figure 1. The Sparc64 VIIIfx die contains Processor pipeline eight identical cores, a shared 5-Mbyte The Sparc64 VIIIfx pipeline is similar to level-2 cache, and four memory controllers. that of the previous generation Sparc64 VII.2,3 The integer-load pipeline of the Sparc64 VIIIfx has 16 stages, as Figure 2 program, it generates a chain of instruction shows. We categorize the stages as instruction buffer entries that correspond to the loop. fetch, instruction issue, execution, and It supplies instructions in the short loop commit. from these entries rather than the L1 I-cache. Instruction fetch Instruction issue Instruction fetch stages include address Instruction issue stages are entry, predecode, generation, translation look-aside buffer and decode. As described earlier, the eight (TLB) tag access, cache tag match, cache read execution units in each core of the Sparc64 to buffer,andread result. VIIIfx chip consist of two ALUs for integer The last stage—read result—overlaps the instructions, two AGUs for load/store in- first stage of instruction issue (that is, structions, and four FMAs for floating-point entry). Instruction fetch stages work with instructions. Reservation stations in the in- the cache access unit to supply instructions struction control unit correspond to the inte- to subsequent stages. Instructions fetched ger (RSE), floating-point (RSF), and load/ from the L1 I-cache are stored in the instruc- store (RSA) instructions. The instruction tion buffer. issue stages decode and issue instructions to Sparc64 VIIIfx implements a branch pre- the appropriate reservation stations. diction mechanism, which is supported by The predecode stage decodes the set branch prediction resources called the branch extended arithmetic register (SXAR) instruc- history (BRHIS) and return address stack. tion, which specifies additional information Instruction fetch stages use these resources for other instructions. The instruction buffer to determine fetch addresses. sends up to six instructions (up to four non- Instruction fetch stages are designed to SXAR instructions and up to two SXAR work independently of subsequent stages instructions) to the predecode stage, which whenever possible. They fetch instructions packs them into four instructions and sends until the instruction buffer is full, at which them to the decode stage. Packed instruc- point prefetch requests can be sent to the tions have extended register fields and L1 I-cache. A stall in the execution stages single-instruction, multiple-data (SIMD) does not affect instruction fetch. instruction attributes. The instruction fetch unit has an addi- All resources needed to execute an in- tional feature for improving performance struction must be assigned in the issue stages. on local short loops. When it finds a short These resources include commit stack entries (maximum 24 instructions) loop in a (CSE) and renaming registers. An assigned .................................................................... MARCH/APRIL 2010 31 [3B2-14] mmi2010020030.3d 30/3/010 12:13 Page 32 ............................................................................................................................................................................................... HOT CHIPS Fetch Issue Dispatch register-read execute Memory Commit (four stages) (three stages) (4(int)/5(fp) stages) (L1 cache 3(int)/4(fp) stages) (two stages) CSE 48 entry Fetch L1 port EAGA PC instruction GPR 20 entry RSA cache Decode 188 10 entry Store L1 data and issue EAGB 32 Kbytes, registers port cache Control two-way 8 entry 32 Kbytes, registers GUB EXA two-way RSE Write 32 10 entry buffer Branch registers EXB target 5 entry address FLA 1K entry, RSF FPR two-way 8 × 2 256 entry registers FLB L2 cache FLC 5 Mbytes, FUB RSBR 10-way 48 × 2 8 entry registers FLD Memory controller DIMM CSE: Commit stack entries GPR: General-purpose register EAG: Effective address generation unit (A and B) GUB: General-purpose update buffer ECC: Error checking and correcting code PC: Program counter register EX: Integer execution unit (A and B) RSA: Reservation station for address generation FL: Floating-point execution unit (A-D) RSBR: Reservation station for branch execution FPR: Floating-point register RSE: Reservation station for integer execution FUB: Floating-point update buffer RSF: Reservation station for floating-point execution Figure 2. Sparc64 VIIIfx integer-load pipeline. There are 16 stages, which are categorized as instruction fetch, issue, execution, and commit. resource is specific to an instruction and can- issued to reservation stations will be executed not be assigned to another. During normal once certain conditions are met, such as hav- execution, assigned resources are released at ing all source operands ready and an appro- the last stage of the pipeline—the write priate execution unit available. stage. Instructions between the entry and We split the buffer read stage into two write stages are considered to be in flight. stages for reads to floating-point registers. When an exception is signaled, all in-flight Briefly, the floating-point register file sup- instructions and assigned resources are ports SIMD execution, and registers are ei- released immediately. This behavior lets the ther SIMD basic or SIMD extended. Access decoder restart instruction issue as quickly to the floating-point register file occurs in as possible. the first buffer read stage. During the second buffer read stage, the hardware can exchange Execution read data from the basic and extended sides There are four execution stages: