SPARC64 Viiifx: CPU for the K Computer
Total Page:16
File Type:pdf, Size:1020Kb
SPARC64 VIIIfx: CPU for the K computer Toshio Yoshida Mikio Hondo Ryuji Kan Go Sugizaki SPARC64 VIIIfx, which was developed as a processor for the K computer, uses Fujitsu Semiconductor Ltd.’s 45-nm CMOS process for semiconductors and is composed of eight cores, a 6 MB shared level 2 cache, and memory controllers. Peak performance of 128 GFLOPS at an operating frequency of 2 GHz is achieved with power consumption as low as 58 W. The performance per unit of power is more than six times that of the SPARC processor, our previous model. To achieve this performance per unit of power, we extended the SPARC-V9 architecture to develop high performance computing-arithmetic computational extensions (HPC-ACE), the optimum instruction set for scientific computations. In addition, we successfully reduced the leakage power by water cooling and dynamic power by clock gating to achieve a lower power consumption. Furthermore, high-reliability technology for mainframes and UNIX servers is used to ensure stable operation of a system connecting more than 80 000 processors. This paper outlines the technologies used to achieve the high performance, low power consumption and high reliability of SPARC64 VIIIfx. 1. Introduction 1) High performance Fujitsu developed SPARC64 VIIIfx1) (Fig SPARC64 VIIIfx is a multicore processor ure 1) as a processor for the supercomputer note)i (“K computer”). The K computer has more high-speed serial I/O (HSIO) than 80 000 processors installed to give it a computational performance in excess of 10 Core 5 L2 cache data Core 7 PFLOPS. These processors need to have high performance, low power consumption, and high reliability. This paper outlines the technologies Core 4 Core 6 used to achieve these goals. MAC L2 cache control MAC MAC MAC 2. Goals in development of DDR3 Interface Core 1 Core 3 DDR3 Interface SPARC64 VIIIfx DDR3 Interface The goals in the development of SPARC64 L2 cache data VIIIfx include: Core 0 Core 2 note)i “K computer” is the English name that RIKEN has been using for the supercomputer of this project since July 2010. “K” comes from the Japanese word “Kei,” which means ten peta or 10 to the Figure 1 SPARC64 VIIIfx chip. 16th power. 274 FUJITSU Sci. Tech. J., Vol. 48, No. 3, pp. 274–279 (July 2012) T. Yoshida et al.: SPARC64 VIIIfx: CPU for the K computer integrating eight cores, a shared level 2 cache, consumption of the processor needed to be memory access controllers (MACs), and a high- reduced to 58 W or less. To that end, Fujitsu speed serial I/O (HSIO). used low-leakage transistors and water cooling For each core to exhibit high-execution to lower the junction temperature to 30°C so as performance in a real application, we extended to reduce the leakage power. In addition, the the SPARC-V9 architecture2)–4) and developed processor must make full use of clock gating to High Performance Computing-Arithmetic reduce its dynamic power. Computational Extensions (HPC-ACE),5) an 3) High reliability instruction set capable of efficiently executing The processor requires high-reliability scientific computations. technology used for mainframes and UNIX To achieve a higher speed of parallel servers6),7) to ensure it operates stably. processing by having eight cores on the chip, the architecture must also have a function to share 3. Microarchitecture of SPARC64 the level 2 cache across all cores and synchronize VIIIfx cores by means of hardware. Combining this with The pipeline of SPARC64 VIIIfx is shown Fujitsu’s automatic parallel compiler allows the in Figure 2 and the specifications are shown in user to handle multiple cores when programming Table 1. as if it were one high-speed CPU without having A core is composed of an instruction to be aware of the multiple cores. At Fujitsu, this control unit, execution unit and level 1 cache. is called Virtual Single Processor by Integrated The instruction control unit is responsible for Multicore Parallel Architecture (VISIMPACT). instruction fetch, instruction decode, out-of-order 2) Low power consumption instruction control, and instruction commit Due to the limited amount of power control. available to the entire system, the power The execution unit is equipped with Out-of-order execution Register Fetch Decode Issue Memory access Commit read Execution Commit control Fetch EAGA port Instruction Instruction Address Fixed-point Data Program level 1 decode reservation register counter station EAGB Store level 1 cache port cache Control Fixed-point Fixed-point EXA register reservation renaming station register EXB Branch history Floating-point FLA reservation Floating-point station register FLB FLC Branch Floating-point reservation renaming station register FLD Level 2 MAC cache 8 cores Figure 2 SPARC64 VIIIfx pipeline. FUJITSU Sci. Tech. J., Vol. 48, No. 3 (July 2012) 275 T. Yoshida et al.: SPARC64 VIIIfx: CPU for the K computer Table 1 In addition, a K computer exclusive SPARC64 VIIIfx specifications. Item Specification InterConnect Controller chip and HSIO are No. of cores 8 used for connections to ensure a high inter-chip Level 2 cache 6 MB communication throughput. Operating frequency 2 GHz Process technology FSL 45-nm CMOS 4. Instruction extensions HPC- Die size 22.7 mm × 22.6 mm ACE No. of transistors Approx. 760 million HPC-ACE is an extended instruction set Peak performance 128 GFLOPS intended for scientific computation for the Memory bandwidth 64 GB/s (theoretical peak value) SPARC-V9 architecture. It was developed based Power consumption 58 W (process condition: TYP) on the analysis of many HPC applications jointly FSL: Fujitsu Semiconductor Ltd. with Fujitsu software development department. 1) Expansion of number of registers two fixed-point functional units (EXA/B), The number of floating-point registers of two functional units for load/store address SPARC-V9 is 32, which is not sufficient for HPC computation (EAGA/B) and four floating-point applications. Increasing the number of registers, multiply-and-add (FMA) units (FLA/B/C/D). The however, is not possible with the 32-bit SPARC FMA units have a single instruction multiple architecture because there is an insufficient data (SIMD) architecture and execute two instruction length. As a solution to this, a parallel operations with one instruction. One new prefix instruction called the set extended FMA unit is capable of conducting floating-point arithmetic register (SXAR) has been defined for multiplication and addition for each cycle and HPC-ACE. An SXAR instruction extends register each core can execute eight double-precision addressing for up to two following instructions. floating-point operations per cycle. Hence the The register address length is extended by 3 bits, chip is capable of making 64 double-precision which allows for 256 floating-point registers, floating-point operations per cycle. The operating eight times that of SPARC-V9 (Figure 3). frequency is 2 GHz and the peak performance is The compiler uses this high-capacity set 128 GFLOPS. There are 192 fixed-point registers of registers for optimization including software and 256 floating-point registers. pipelining and maximizes the instruction-level The level 1 cache processes load/store parallelism of an application. In terms of the instructions. Each core has a 32 Kbyte two-way Himeno Benchmark, a representative HPC instruction cache and data cache. The data benchmark, the performance has improved by cache has a dual-port structure capable of two 1.65 times. simultaneous load accesses and executes two 2) SIMD operations and load/store instructions 16-byte SIMD loads or one 16-byte SIMD store. SIMD is a technology to allow parallel The level 2 cache is shared by the eight cores execution of more than one data process with one and cache coherence is ensured for each core. An instruction. HPC-ACE uses SIMD technology inter-core hardware barrier is provided to allow to execute two FMA operations with one high-speed synchronization between the cores, as instruction. SIMD operations for more quickly will be described later. multiplying complex numbers are also supported. SPARC64 VIIIfx incorporates memory con- In addition, SIMD execution is possible with load trollers to reduce its latency and improve the and store instructions. SIMD processing of load throughput of its memory access. The memory instructions is executed without penalty in an bandwidth is 64 GB/s as a theoretical peak value. 8-byte alignment for double precision and 4-byte 276 FUJITSU Sci. Tech. J., Vol. 48, No. 3 (July 2012) T. Yoshida et al.: SPARC64 VIIIfx: CPU for the K computer Floating-point register SPARC-V9 32 SXAR Extended address (3 bits) Address Register address Extended 224 (5 bits) 8 bits in total register Subsequent instruction Figure 3 Register address extension by SXAR instruction. alignment for single precision. To efficiently process loops containing 3) Sector cache mechanism if statements, it is necessary to eliminate For HPC-ACE, a cache mechanism (sector conditional branch instructions. For that cache) that can be software-controlled has been purpose, conditional execution instructions developed. The conventional caches cannot be have been added for HPC-ACE. Specifically, controlled by software. Even if the user is aware a new compare instruction is used to write the of the high frequency with which data is reused, result of comparison in a floating-point register the hardware evicts the data from the cache and a conditional execution instruction is used when registering other data in the cache, which based on the result of such comparison. As the might hinder improvements in performance. To conditional execution instructions, data transfer address this issue, the sector cache mechanism between floating-point registers and store splits the cache into two sectors and allows from a floating-point register to memory have software to be used to register frequently reused been defined.