[3B2-14] mmi2010020030.3d 30/3/010 12:13 Page 30

...... SPARC64 VIIIFX:ANEW-GENERATION OCTOCORE PROCESSOR FOR PETASCALE COMPUTING

...... THE SPARC64 VIIIFX EIGHT-CORE PROCESSOR, DEVELOPED FOR USE IN PETASCALE

COMPUTING SYSTEMS, RUNS AT SPEEDS OF UP TO 2GHZ AND ACHIEVES A PEAK

PERFORMANCE OF 128 GIGAFLOPS WHILE CONSUMING AS LITTLE AS 58 WATTS OF

POWER.SPARC64 VIIIFX REALIZES A SIX-FOLD IMPROVEMENT IN PERFORMANCE PER

WATT OVER PREVIOUS GENERATION SPARC64 PROCESSORS.

...... High performance, low power The execution unit performs integer and consumption, and reliability are key re- floating-point instructions. It includes two quirements for processors in supercomput- arithmetic logic units (ALUs) for integer Takumi Maruyama ing systems. Processors used to build instructions, two address generation units petascale computing systems, which are (AGUs) for load/store instructions, and four Toshio Yoshida large-scale installations that floating-point multiply-and-add (FMA) can aggregate more than 10,000 processor execution units for floating-point instruc- Ryuji Kan chips in a single system, must also provide tions. Each core executes up to four double- an especially high levelofreliability.By precision FMA operations per cycle. In Iwao Yamazaki combining an enhanced instruction set addition, the execution unit includes 192 architecture (ISA) implementing the high- integer and 256 floating-point architectural Shuji Yamamura performance computing-arithmetic compu- registersaswellas32integerandtwosets tational extensions (HPC-ACE), ’s of 48 floating-point renaming registers. Noriyuki Takahashi Sparc64 processor technology, and water The storage unit executes load/store cooling, the Sparc64 VIIIfx eight-core instructions. It includes two 32-Kbyte, two- Mikio Hondou processor meets these stringent perfor- way set-associative caches: a L1 instruction mance, power consumption, and reliability cache (I-cache) and an L1 data cache Fujitsu requirements. (D-cache). The L1 D-cache is dual-ported, and can execute two load instructions per Multicore chip architecture cycle, even if one or both of the target data Hiroshi Okano The Sparc64 VIIIfx chip comprises eight are located across D-cache line boundaries. identical cores with a shared level-2 (L2) The storage unit also has hardware-prefetch Fujitsu Laboratories cache1 (see Figure 1). Each core consists of engines, which software can control. three units. The L2 cache is a unified 5-Mbyte, 10- The instruction control unit handles way set-associative cache shared between instruction fetch, issue, and completion. the eight cores. That is, each core can access ......

30 Published by the IEEE Computer Society 0272-1732/10/$26.00 c 2010 IEEE [3B2-14] mmi2010020030.3d 30/3/010 12:13 Page 31

any portion of the L2 cache, even though the cache is physically split. The L2 cache is con- HSIO nected to embedded memory controllers that Core 5 L2 cache Core 7 communicate directly with double-data-rate- data 3 dual inline memory module (DDR3- DIMM) memory at a peak throughput of Core 4 Core 6 64 Gbytes per second. MAC L2 cache MAC MAC control MAC The Sparc64 VIIIfx chip is fabricated using Core 1 Core 3

Fujitsu’s 45-nm complementary metal-oxide DDR3 interface DDR3 interface semiconductor (CMOS) process. It occupies L2 cache a die area of approximately 510 mm2 and Core 0 data Core 2 contains around 760 million transistors. The chip runs at speeds of up to 2 GHz with a peak performance of 128 gigaflops. Figure 1. The Sparc64 VIIIfx die contains Processor pipeline eight identical cores, a shared 5-Mbyte The Sparc64 VIIIfx pipeline is similar to level-2 cache, and four memory controllers. that of the previous generation Sparc64 VII.2,3 The integer-load pipeline of the Sparc64 VIIIfx has 16 stages, as Figure 2 program, it generates a chain of instruction shows. We categorize the stages as instruction buffer entries that correspond to the loop. fetch, instruction issue, execution, and It supplies instructions in the short loop commit. from these entries rather than the L1 I-cache. Instruction fetch Instruction issue Instruction fetch stages include address Instruction issue stages are entry, predecode, generation, translation look-aside buffer and decode. As described earlier, the eight (TLB) tag access, cache tag match, cache read execution units in each core of the Sparc64 to buffer,andread result. VIIIfx chip consist of two ALUs for integer The last stage—read result—overlaps the instructions, two AGUs for load/store in- first stage of instruction issue (that is, structions, and four FMAs for floating-point entry). Instruction fetch stages work with instructions. Reservation stations in the in- the cache access unit to supply instructions struction control unit correspond to the inte- to subsequent stages. Instructions fetched ger (RSE), floating-point (RSF), and load/ from the L1 I-cache are stored in the instruc- store (RSA) instructions. The instruction tion buffer. issue stages decode and issue instructions to Sparc64 VIIIfx implements a branch pre- the appropriate reservation stations. diction mechanism, which is supported by The predecode stage decodes the set branch prediction resources called the branch extended arithmetic register (SXAR) instruc- history (BRHIS) and return address stack. tion, which specifies additional information Instruction fetch stages use these resources for other instructions. The instruction buffer to determine fetch addresses. sends up to six instructions (up to four non- Instruction fetch stages are designed to SXAR instructions and up to two SXAR work independently of subsequent stages instructions) to the predecode stage, which whenever possible. They fetch instructions packs them into four instructions and sends until the instruction buffer is full, at which them to the decode stage. Packed instruc- point prefetch requests can be sent to the tions have extended register fields and L1 I-cache. A stall in the execution stages single-instruction, multiple-data (SIMD) does not affect instruction fetch. instruction attributes. The instruction fetch unit has an addi- All resources needed to execute an in- tional feature for improving performance struction must be assigned in the issue stages. on local short loops. When it finds a short These resources include commit stack entries (maximum 24 instructions) loop in a (CSE) and renaming registers. An assigned ......

MARCH/APRIL 2010 31 [3B2-14] mmi2010020030.3d 30/3/010 12:13 Page 32

...... HOT CHIPS

Fetch Issue Dispatch register-read execute Memory Commit (four stages) (three stages) (4(int)/5(fp) stages) (L1 cache 3(int)/4(fp) stages) (two stages)

CSE 48 entry

Fetch L1 port EAGA PC instruction GPR 20 entry RSA cache Decode 188 10 entry Store L1 data and issue EAGB 32 Kbytes, registers port cache Control two-way 8 entry 32 Kbytes, registers GUB EXA two-way RSE Write 32 10 entry buffer Branch registers EXB target 5 entry address FLA 1K entry, RSF FPR two-way 8 × 2 256 entry registers FLB L2 cache FLC 5 Mbytes, FUB RSBR 10-way 48 × 2 8 entry registers FLD

Memory controller

DIMM

CSE: Commit stack entries GPR: General-purpose register EAG: Effective address generation unit (A and B) GUB: General-purpose update buffer ECC: Error checking and correcting code PC: Program counter register EX: Integer execution unit (A and B) RSA: Reservation station for address generation FL: Floating-point execution unit (A-D) RSBR: Reservation station for branch execution FPR: Floating-point register RSE: Reservation station for integer execution FUB: Floating-point update buffer RSF: Reservation station for floating-point execution

Figure 2. Sparc64 VIIIfx integer-load pipeline. There are 16 stages, which are categorized as instruction fetch, issue, execution, and commit.

resource is specific to an instruction and can- issued to reservation stations will be executed not be assigned to another. During normal once certain conditions are met, such as hav- execution, assigned resources are released at ing all source operands ready and an appro- the last stage of the pipeline—the write priate execution unit available. stage. Instructions between the entry and We split the buffer read stage into two write stages are considered to be in flight. stages for reads to floating-point registers. When an exception is signaled, all in-flight Briefly, the floating-point register file sup- instructions and assigned resources are ports SIMD execution, and registers are ei- released immediately. This behavior lets the ther SIMD basic or SIMD extended. Access decoder restart instruction issue as quickly to the floating-point register file occurs in as possible. the first buffer read stage. During the second buffer read stage, the hardware can exchange Execution read data from the basic and extended sides There are four execution stages: priority, to enable access to a flat floating-point regis- buffer read, execute, and update. Instructions ter file and/or complex operations...... 32 IEEE MICRO [3B2-14] mmi2010020030.3d 30/3/010 12:13 Page 33

The execution latency varies depending Integer register Floating-point register on the instruction. The renaming registers store the operation’s result until the instruc- Register Sparc-V9 32 Sparc-V9 32 tion commits. window 160 Commit Extension 32 The two commit instruction stages are complete and write. Following out-of-order SIMD execution, instructions commit in program (basic) Sparc-V9 order. Instructions complete, and results are written from the renaming registers to the ar- chitecturalregisters.Uptofourpacked 224 instructions can be committed per cycle. Extension Exceptions are handled in the commit stages. That is, exceptions that occur during an execution stage are not handled immedi- 32 Extension ately but are signaled when the instruction commits. SIMD (extended) High-performance computing-arithmetic computational extensions During the initial design phase for (a) (b) Sparc64 VIIIfx, we performed numerous per- formance evaluation studies to determine the Figure 3. Registers defined by the high-performance computing-arithmetic best method for meeting the performance computational extensions (HPC-ACE). The Sparc-V9 standard defines 160 and power consumption goals. We concluded integer registers and 32 floating-point registers. HPC-ACE increases the that we needed to significantly enhance the number of integer registers to 192 (a) and the number of floating-point Sparc-V9 ISA,4,5 used by the previous gener- registers to 256 (b). ation Sparc64 VII. The alternative was to use a higher processor frequency, which would have resulted in significantly higher power this number is comparable to other general- consumption. We therefore developed an purpose processors, it is insufficient when run- instruction set extension called the high- ning a large number of scientific applications. performance computing-arithmetic computa- Having many architectural registers helps the tional extensions.6 HPC-ACE defines various compiler extract more parallelism from pro- performance-improving features, such as grams. Combined with loop unrolling and large register sets and SIMD instructions. software pipelining, having numerous archi- Sparc64 VIIIfx complies with the Sparc- tectural registers enables highly parallel execu- V9standardandiscompatiblewithusersoft- tion. In addition, a large register set reduces ware written for previous Sparc64 processors. overhead due to register spills and fills. TheSparc-V9standarddefines160integer Large register sets registers (when the number of windows is 8). We performed cycles-per-instruction As Figure 3a shows, HPC-ACE increases the (CPI) analysis for scientific applications run- number of integer registers to 192. Similarly, ning on the previous generation Sparc64 VII HPC-ACE increases the number of floating- processor. The floating-point execution units point registers from 32 to 256 (Figure 3b). exhibited relatively low occupancy rates for The first 160 integer and 32 floating-point certain applications despite a large amount registers in HPC-ACE are the same as those of inherent parallelism. We discovered that in the Sparc-V9 standard. Integer registers the small number of architectural registers are organized as Sparc-V9 register windows, in Sparc64 VII limited the actual parallelism. except for the 32 additional registers. So, we The Sparc-V9 standard defines 32 double- can access a total of 64 integer registers with- precision floating-point registers. Although out changing the register window......

MARCH/APRIL 2010 33 [3B2-14] mmi2010020030.3d 30/3/010 12:13 Page 34

...... HOT CHIPS

SXAR instruction Instruction 1 Instruction 2

SXAR instruction 00 f–urd 111 f–urs1 f–urs2 f–urs3s–urd s–urs1 s–urs2 s–urs3 31 0 Upper 3 bits of each register Upper 3 bits of each register field for instruction 1 field for instruction 2 Instruction 1 10 Rd op2 Rs1 Rs3 Rs2 31 0 Lower 5 bits of each register field Instruction 2 10 Rd op2 Rs1 Rs3 Rs2 31 0 Lower 5 bits of each register field

Figure 4. Set extended arithmetic register (SXAR) instruction usage. The instruction specifies the upper 3 bits of the register fields for the subsequent one or two instructions.

One of the difficulties in implementing interrupt or trap handler. This behavior is large architectural register sets is the number similar to the trap behavior for the program of bits required to encode register numbers counter register (PC), the next program in an instruction. Supporting 256 floating- counter register (NPC), and the processor point registers for FMA instructions requires state register (PSTATE) as defined in the four 8-bit register number fields; the length Sparc-V9 standard. of a Sparc-V9 instruction, however, is lim- ited to 32 bits. Single-instruction, multiple-data instructions To solve this issue, we defined a new pre- Another limitation of the Sparc-V9 fix instruction that specifies the upper three standard is the lack of single-instruction, bits of the register number fields for the multiple-data instructions. SIMD instruc- one or two instructions that immediately fol- tions help increase execution throughput low the prefix instruction, as Figure 4 shows. without the need to increase instruction We call this prefix instruction the SXAR in- fetch and decode bandwidth. HPC-ACE struction because the instruction writes an defines two-way floating-point SIMD instruc- immediate value containing the upper three tions. That is, one SIMD floating-point in- bits of the register number fields to the XAR. struction executes two single-precision or The XAR allows interrupts between a double-precision floating-point operations. SXAR instruction and subsequent instruc- One SIMD FMA instruction executes two tions. Once the information specified in multiply operations and two add operations. the SXAR instruction is stored in the XAR, SIMD load/store instructions access the subsequent one or two instructions refer- two contiguous single-precision or double- ence the XAR during execution. The XAR is precision data in memory. The data align- cleared as these instructions commit. If an in- ment required by a double-precision SIMD terrupt occurs between the SXAR and subse- load instruction is 8 , rather than quent instructions, the hardware saves the 16 bytes, to give the compiler more opportu- contents of the XAR into the trap XAR nities to use SIMD instructions. (TXAR). The TXAR’s contents are written Instead of defining a new opcode, the back to the XAR on a return from an SXAR instruction described above is used ...... 34 IEEE MICRO [3B2-14] mmi2010020030.3d 30/3/010 12:13 Page 35

to specify SIMD operations. One SXAR in- software, and a program must be explicitly struction can specify SIMD operations for written to use it. In general, local memory can the one or two floating-point instructions achieve higher performance than a hardware- immediately following the SXAR instruc- controlled cache if the application is carefully tion. Most Sparc-V9 floating-point instruc- written by a programmer familiar with the tions can be executed as two-way SIMD program’s characteristics. Writing such a operations, with a few exceptions, such as program is not easy, however. the divide and square root instructions. HPC-ACE defines a software-controlled These instructions have long execution laten- sector cache that combines the advantages of cies (that is, they have low execution a conventional cache and local memory. throughput), making it difficult to keep the The basic idea is to allow software to opti- pipeline full. Instead, HPC-ACE defines mize cache performance by treating the instructions for floating-point reciprocal cache like local memory while maintaining approximation of divide and square root. cache coherency. Software divides the cache Using these new instructions, the execution into two sectors: sector 0 and sector 1. Sector 0 throughput for divide and square root in is used for instruction fetch and normal Sparc64 VIIIfx is more than four times that operand accesses. Sector 1 is reserved for of the previous generation Sparc64 VII. operand accesses explicitly specified by soft- To support SIMD execution, we divide ware through the SXAR instruction. Soft- the floating-point registers into register ware specifies the relative sizes of sector 0 pairs: a SIMD basic register and a SIMD and sector 1 via the configuration register. extended register. A SIMD instruction refer- If a given application accesses data in a ences register pairs, and one operation is exe- streaming fashion, assigning streaming data cuted using the SIMD basic registers and the accesses to sector 1 will avoid cache pollution other using the SIMD extended registers. without affecting data in sector 0. Similarly, AuniquefeatureofHPC-ACEisthatthe if an application requires certain data to reside floating-point registers are flat. That is, both in the cache, the software will assign that data non-SIMD instructions and SIMD instruc- to sector 1. The hardware will perform in- tions can access all the floating-point registers. struction fetch and other data accesses using For example, we could load four noncontigu- sector 0, and will not replace data in sector 1. ousdatainmemoryintotwopairsoffloating- We implement the sector cache by storing point registers with four non-SIMD instruc- sector information with each cache line, as tions and then calculate two results from the Figure 5 shows. When a cache miss occurs, loaded data using a single SIMD instruction. the hardware chooses the cache way to be replaced such that the ratio of sectors speci- Software controlled cache fied by the configuration register is main- As mentioned earlier, the Sparc64 VIIIfx’s tained whenever possible. The hardware peak performance is 128 gigaflops. The pro- uses this in addition to the normal least cessor provides a large amount of off-chip recently used replacement policy. memory bandwidth (peak 64 Gbytes per sec- ond), but additional mechanisms are needed Virtual single processor by integrated multicore to provide the sustained on-chip bandwidth parallel architecture required to effectively utilize the chip’s full Sparc64VIIIfx’ssharedL2cachehelps floating-point execution capability. avoid false sharing between cores. The chip Cache and local memory are often used to also has a hardware barrier for intercore syn- hide the difference in speed between pro- chronization. These features are similar to cessor and memory. Both are buffers placed those in Sparc64 VII. By combining these between the processor and memory to hide hardware features with Fujitsu’s compiler memory access latencies and allow data technology for automatic parallelization, reuse. Cache is controlled by hardware and users can treat the eight-core Sparc64 VIIIfx is invisible to software. That is, a program as a single fast CPU—that is, users do not need not be aware of the cache. Local mem- have to be aware of the eight cores to write ory, on the other hand, is controlled by programs. We call this combination the ......

MARCH/APRIL 2010 35 [3B2-14] mmi2010020030.3d 30/3/010 12:13 Page 36

...... HOT CHIPS

place of conditional branches, which are bot- Sector 0: Sector 1 = 30:70 percent Way 0 Way 9 tlenecks for loop unrolling and software pipelining. This allows efficient execution 0001111111 of if-loops. It also includes floating-point minimum and maximum instructions. Index HPC-ACE also supports floating-point 00 01111111 trigonometric functions. A newly defined in- struction called ftrimadd calculates the mini- max approximation of the sine and cosine functions piece by piece. The instruction multiplies the partial result of the previous Figure 5. Sector cache. At each cache instruction by the square of the input oper- index, sector 0 contains 3 cache ways and and and then adds the appropriate coeffi- sector 1 contains 7 cache ways. cient, which hardware provides. Performance virtual single processor by integrated multicore The HPC-ACE and enhanced hardware parallel architecture (Visimpact). features help Sparc64 VIIIfx achieve much Software optimization for nested loops higher performance than the previous gener- illustrates the advantages of Visimpact over ation Sparc64 VII processor. vector processors and conventional scalar processors. In Figure 6, for example, the Processor core performance outer loop J of code A cannot be executed Figure 7 shows the performance of the in parallel, because the value of A(J) depends Sparc64 VIIIfx core. The vertical bars show on A(J þ 1) in the inner loop I. Similarly, execution times relative to a Sparc64 VII the inner loop K of code B cannot be exe- core running at 2.5 GHz, and the results cuted in parallel. are divided on the horizontal axis by bench- On a , code A works well mark type: divide, sine, and a ninth-degree because the inner loop I is executed as a vec- polynomial. For each benchmark, the left tor. However, code B cannot be executed in bar indicates the result on Sparc64 VII, parallel. On a conventional , and the middle and right bars indicate results the inner loop I of code A is executed poorly on Sparc64 VIIIfx. The bar on the right on multiple cores, because synchronization shows the results with SIMD instructions, overhead on every J iteration and false shar- and the bar in the middle shows the results ing overhead between cores are high. On without. the other hand, code B works well on a scalar On the divide benchmark, most of the processor by executing the outer loop L in Sparc64 VII execution time is spent waiting parallel on multiple cores. for the execution units, as the 0 commits In contrast, Sparc64 VIIIfx can execute section of the bar shows. The new instruc- the inner loop I of code A in parallel on mul- tion for floating-point reciprocal approxima- tiple cores effectively, because hardware syn- tion of divide reduces the time spent waiting chronizes multiple cores and has little for the execution units to nearly zero on overhead. In addition, code B works well Sparc64 VIIIfx, which runs 4.3 times faster on both the Sparc64 VIIIfx and conventional than Sparc64 VII. The sine benchmark scalar processors. So, Sparc64 VIIIfx is the shows an even larger improvement. The exe- only one of the three on which codes A cution time on Sparc64 VIIIfx is 6.8 times and B both work fine. This lets the compiler faster despite the lower processor frequency. perform additional optimizations. The improvement on the ninth-degree poly- nomial benchmark is not as dramatic because Other enhancements performance on this benchmark can be HPC-ACE includes many other enhance- highly optimized using only the 32 floating- ments of the Sparc-V9 standard. It supports point registers defined in the Sparc-V9 conditional instructions that can be used in standard...... 36 IEEE MICRO [3B2-14] mmi2010020030.3d 30/3/010 12:13 Page 37

Code A DO J=1,N DO J=1,N DO J=1,N V DO I=1,M P DO I=1,M P DO I=1,M V A(J)=A(J)+A(J+1)*B(l,J) P A(J)=A(J)+A(J+1)*B(I,J) P A(J)=A(J)+A(J+1)*B(I,J) V END P END P END END END Parallel END Parallel J Vector J J 1 1 1 2 SW barrier 2 3 2 3 4 Time SW barrier Time 4 Time ... 3 ... I= SW barrier I= I= 1 2 3 4 5 Core 0 Core 1 Core 2 Core 0 Core 1 Core 2 SIMD SIMD SIMD

Code B DO L=1,N P DO L=1,N P DO L=1,N DO K=1,M DO K=1,M DO K=1,M A(K,L)=A(K,L)+A(K+1,L)*B(K,L) A(K,L)=A(K,L)+A(K+1,L)*B(K,L) A(K,L)=A(K,L)+A(K+1,L)*B(K,L) END END END END P END P END

(a) (b) (c)

Figure 6. Comparison of software optimization for nested loops on vector processors (a), conventional scalar processors (b), and Visimpact (c).

Processor chip performance Figure 8 shows the performance of the 1.4 Sparc64 VIIIfx chip relative to Sparc64 VII 1.2 for two benchmarks from scientific applica- tions for molecular dynamics and fluid 1.0 dynamics. Sparc64 VIIIfx reduces the time spent waiting for execution units to almost 0.8 zero on these two applications. These results 0.6 demonstrate the effectiveness of HPC-ACE for scientific applications. Sparc64 VIIIfx 0.4 Sparc64 VII at 2.5 GHz Sparc64

runs 2.4 and 2.6 times faster than Sparc64 to Execution time relative VII on these applications with the current 0.2 compiler. With future compiler optimiza- 0 tions, performance should improve to VII VIIIfx VIIIfxVII VIIIfx VIIIfx VII VIIIfx VIIIfx about three times that of Sparc64 VII. (no SIMD) (no SIMD) (no SIMD) Loop 6: y(i) = x1(i) / x2(i) y(i) = sin(x1(i)) Loop 14: Power consumption Ninth-degree polynomial The Sparc64 VIIIfx chip uses water cool- (a) (b) (c) ing to reduce leakage power and fine-grained 0 commits due to cache/system 0 commits due to execution unit clock gating to reduce dynamic power. We 1 commit apply various other circuit techniques to 2–3 commits reduce power consumption. As a result, the 4 commits chip consumes as little as 58 watts on an average process with a junction temperature Figure 7. Performance comparison of the Sparc64 VIIIfx and Sparc64 VII cores (Tj)of30C. on divide (a), sine (b), and ninth-polynomial (c) benchmarks. With SIMD instruc- Figure 9 compares the peak performance tions, the Sparc64 VIIIfx core is 4.3 times faster than Sparc64 VII on divide, and power consumption of successive 6.8 times faster on sine, and 1.4 times faster on the ninth-degree polynomial. Sparc64 processors.7-9 We normalized all ......

MARCH/APRIL 2010 37 [3B2-14] mmi2010020030.3d 30/3/010 12:13 Page 38

...... HOT CHIPS

values for performance and power consump- 1.4 tion to values for Sparc64 V. As the graph illustrates, Sparc64 VI and VII provided 1.2 improved performance with small increases 1.0 in power consumption. Sparc64 VIIIfx is the first Sparc64 processor to increase perfor- 0.8 mance threefold over the previous generation while simultaneously decreasing power con- 0.6 sumption by half.

0.4

Sparc64 VII at 2.5 GHz Sparc64 Reliability, availability, and serviceability Execution time relative to Execution time relative Sparc64 VIIIfx combines Fujitsu’s main- 0.2 frame and Unix reliability, availability, and 10 0 serviceability (RAS) features with water Sparc64 VII Sparc64 VIIIfx Sparc64 VII Sparc64 VIIIfx cooling to achieve the high level of reliability (a) (b) required in petascale computing systems. Both tag and data for all embedded caches 0 commits due to cache/system are either error-correcting code (ECC) pro- 0 commits due to execution unit tected, or duplicated and parity protected. 1 commit 2–3 commits Both integer and floating-point architectural 4 commits registers are ECC protected as well. Hard- ware detects and corrects single-bit errors in Figure 8. Performance comparison of the Sparc64 VIIIfx and Sparc64 VII these resources. Other internal registers are chips for two scientific applications. Sparc64 VIIIfx is 2.6 times faster than parity protected, and ALUs are parity or res- Sparc64 VII on the molecular dynamics application (a) and 2.4 times faster idue protected. on the fluid dynamics application (b). Hardware instruction retry Sparc64 VIIIfx implements an instruction retry mechanism for correcting any single-bit errors that occur in the registers or ALUs, as 25 Figure 10 shows. When an error is detected, Performance (gigaflops) = 3x improvement all instructions that are currently executing 20 are cancelled. Since all internal states before the commit stages can be discarded, the pro- grammable resources will see the results of 15 only those instructions that have completed Power (W) = 0.5x increase execution without causing errors. In addition to preventing programmable resources from 10 being corrupted, the instruction retry mech- Sparc64 V = 1.0 anism allows hardware to retry an instruction 5 after error detection. Instruction retry is au- tomatically started on error detection. The instruction that caused the error is executed 0 alone to maximize the possibility of success- Sparc64 V V + VI VII VIIIfx ful execution. If the instruction commits suc- Frequency (GHz) 1.35 2.16 2.4 2.5 2.0 cessfully, the hardware automatically resumes Process (nm) 130 90 90 65 45 normal execution. During the instruction retry process, soft- Figure 9. Comparison of peak performance and power for successive ware intervention is not needed. If the retry generations of Sparc64. Sparc64 VIIIfx is the first Sparc64 processor to succeeds, the error is completely invisible to increase performance threefold over the previous generation while software. If a retry does not succeed, instruc- simultaneously decreasing power consumption by half. tion retry repeats until a threshold is reached (or retry succeeds). The hardware generates ...... 38 IEEE MICRO [3B2-14] mmi2010020030.3d 30/3/010 12:13 Page 39

Instruction retry Cache Tag ECC (L2 cache)

Duplicate and parity (L1 cache) Fetch Execute Commit Fetch Execute Commit Data ECC (L2 cache, L1 D-cache) Update of Duplicate and parity (L1 I-cache) Error Flush Single step execution software Register ECC (integer/floating point visible registers) resources

Parity (others) IBF IWR CSE X PC IBF IWR CSE PC

ALU Parity/residue RSE,RSF RSE,RSF GPR,FPR GPR,FPR ALU RSA ALU RSA Hardware Yes EAGA/B EAGA/B instruction retry EXA/B RSBR Memory EXA/B RSBR Memory FLA/B FLA/B GUB,FUB GUB,FUB History Yes Software SW visible resources visible resources Back to normal execution after the re- executed instruction gets committed without error.

CSE: Commit stack entries GUB: General-purpose update buffer EAG: Effective address generation unit (A and B) IBF: Instruction buffer ECC: Error checking and correcting code IWR: Instruction word register EX: Integer execution unit (A and B) PC: Program counter register FL: Floating-point execution unit (A and B) RSA: Reservation station for address generation FPR: Floating-point register RSBR: Reservation station for branch execution FUB: Floating-point update buffer RSE: Reservation station for integer execution GPR: General-purpose register RSF: Reservation station for floating-point execution

Figure 10. Sparc64 VIIIfx reliability, availability, and serviceability (RAS) features. The instruction retry mechanism corrects any single-bit errors in the registers or arithmetic logic units (ALUs). an interrupt that notifies software of the errors when the threshold is reached. Coverage Because no standard benchmark for quantifying reliability exists, we use Figure 11 to graphically show RAS coverage in the Sparc64 VIIIfx chip. Most of the chip area is dark gray, which means 1-bit error correct- able. Light gray means 1-bit error detectable, and white means 1-bit error harmless. Sparc64 VIIIfx guarantees data integrity through extensive use of RAS features.

ower consumption is one of the biggest 1-bit error correctable 1-bit error detectable P issues for the recent high-end proces- 1-bit error harmless sors used in HPC and Unix servers. Sparc64 VIIIfx realizes a six-fold improvement in Figure 11. Diagram of Sparc64 VIIIfx RAS performance per watt over previous genera- coverage. tion Sparc64 processors. As Fujitsu con- tinues to develop the Sparc64 series, we will make good use of our experience designing Acknowledgments the Sparc64 VIIIfx to ensure that future It is an honor to present the Sparc64 Sparc64 processors will continue to meet VIIIfx processor on behalf of the Sparc64 the needs of a new era. MICRO VIIIfx processor design team......

MARCH/APRIL 2010 39 [3B2-14] mmi2010020030.3d 30/3/010 12:13 Page 40

...... HOT CHIPS

...... References Ryuji Kan is an engineer in the Next 1. T. Maruyama, ‘‘Sparc64 VIIIfx: Fujitsu’s New Generation Technical Computing unit at Generation Octo-core Processor for Peta Fujitsu. He has been involved in the devel- Scale Computing,’’ Hot Chips 21, 2009; opment of the execution units of the Sparc64 http://jp.fujitsu.com/solutions/hpc/brochures. V, VI, VII, and VIIIfx processors. Kan has an 2. Fujitsu Sparc64 VII Processor, whitepaper, ME in informatics from Kyoto University. Fujitsu Ltd., 2008; http://jp.fujitsu.com/ solutions/hpc/brochures. Iwao Yamazaki is an engineer in the Next 3. T. Maruyama, ‘‘Sparc64 VII: Fujitsu’s Next Generation Technical Computing unit at Generation Quad-core Processor,’’ Hot Fujitsu. His technical interests include cache Chips 20, 2008; http://jp.fujitsu.com/ designs for server processors. Yamazaki has a solutions/hpc/brochures. BS in physics from Kyoto University. 4. Sparc Int’l, The Sparc Architecture Manual Shuji Yamamura is an engineer in the Next (Version 9), Prentice-Hall, 1994. Generation Technical Computing unit at 5. Sparc Joint Programming Specification Fujitsu. His technical interests are in high- (JPS1): Commonality, architecture manual, performance processor architectures. Yama- and Fujitsu Ltd., 2002; mura has a PhD in electronics and informa- http://jp.fujitsu.com/solutions/hpc/brochures. tion science from the Kyoto Institute of 6. Sparc64 VIIIfx Extensions, Fujitsu Ltd., ar- Technology. He is a member of the IEEE chitecture manual, 2008; http://jp.fujitsu. Computer Society. com/solutions/hpc/brochures/. 7. A. Inoue, ‘‘Fujitsu’s New Sparc64 V for Noriyuki Takahashi is an engineer in the Mission Critical Servers,’’ Next Generation Technical Computing unit at Forum, 2002. Fujitsu. His technical interests include memory 8. H. Ando et al., ‘‘A 1.3 GHz Fifth Generation controller design. Takahashi has an ME in Sparc64 Microprocessor,’’ Proc. Int’l Solid- electrical engineering from Ehime University. State Circuit Conf., IEEE Press, 2003, pp. 1896-1905. Mikio Hondou is an engineer in the Next 9. R. Kan et al., ‘‘Low Power Design of a High Per- Generation Technical Computing unit at formance Quad-core Microprocessor for Mis- Fujitsu. His technical interests include sion Critical Servers,’’ COOL Chips XII, 2009. microprocessor architecture and performance 10. Sparc64 V Processor For Unix Server, analysis. Hondou has an MS in physics from whitepaper, Fujitsu Ltd., 2004; http:// Keio University. www.fujitsu.com/downloads/PRMPWR/ SPARC64_v_e.pdf. Hiroshi Okano is a senior researcher in the Platform Technologies Laboratories at Fujitsu Takumi Maruyama is a deputy general Laboratories. His technical interest is in low- manager in the Next Generation Technical power techniques for large-scale integration Computing Unit at Fujitsu. His technical from system level to circuit level design. interests include microprocessor architecture Okano has an ME in electrical engineering and VLSI design. Maruyama has a BE in from Hiroshima University. mathematical engineering and instrumenta- tion physics from the University of Tokyo. Direct questions and comments about this article to Takumi Maruyama, Fujitsu Toshio Yoshida is an engineer in the Next Limited, 1-1, Kamikodanaka 4-chome, Generation Technical Computing unit at Nakaharaku, Kawasaki, 211-8588, Japan; Fujitsu. His technical interests include [email protected]. microprocessor architecture. Yoshida has an MS in physics from the Faculty of Science and Graduate School of Science at the University of Tokyo.

...... 40 IEEE MICRO