Supercomputer Fugaku CPU A64FX Realizing High Performance, High-Density Packaging, and Low Power Consumption
Total Page:16
File Type:pdf, Size:1020Kb
Supercomputer Fugaku CPU A64FX Realizing High Performance, High-Density Packaging, and Low Power Consumption Ryohei Okazaki Takekazu Tabata Sota Sakashita Kenichi Kitamura Noriko Takagi Hideki Sakata Takeshi Ishibashi Takeo Nakamura Yuichiro Ajima The A64FX was developed as a processor for the supercomputer Fugaku. For the semiconduc- tors, the 7- nm CMOS process technology of TSMC [1] has been used. For higher density, the Tofu interconnect D and the PCI Express controllers have been implemented in the CPU chip, and the high-bandwidth 3D stacked memory is integrated in the package. The A64FX employs the Arm architecture to improve the software development environment while inheriting Fujitsu’s proven high-performance microarchitecture. In addition, we worked on the specifica- tion of Scalable Vector Extension (SVE) as a lead partner of Arm Limited, the result of which has been adopted. This article outlines the A64FX and describes its high-performance microarchi- tecture, architecture to achieve high-density packaging, and low power consumption design. 1. Introduction architecture to the device level with the aim of providing The A64FX (Figure 1) was developed as a proces- a system that uses the A64FX, a general-purpose CPU, sor for the supercomputer Fugaku (hereafter, Fugaku). with the capability of achieving performance per power Fugaku features 158,976 processors [2]. These proces- equivalent to that of a system equipped with a GPU. sors require high performance, high-density packaging, Figure 2 shows a block diagram of the A64FX and low power consumption, all at high levels. The CPU [3]. There are four core memory groups (CMGs), software development environment is also important. each of which is composed of 13 cores (12 used as This article outlines the A64FX and describes its computing cores and one as an assistant core), level 2 high-performance microarchitecture, architecture to cache and memory controller, and a ring bus network achieve high-density packaging, and low power con- on a chip (NoC) is used to connect them with the Tofu sumption design. Interconnect D (hereafter, TofuD) [4] interface and PCI Express (PCIe) interface. 2. A64FX overview and adoption of Arm In developing the A64FX, the Arm architecture architecture was adopted with the aim of having the supercomputer For the A64FX, we decided to work on devel- Fugaku accepted by a wider range of software develop- opment with the aim of achieving the capability of ers than the K computer, which saw a full-scale launch higher-speed application execution based on the mi- in 2012, and putting in place an environment allowing croarchitectures of processors developed by Fujitsu up use of the latest software. The Arm architecture is an in- to now, including the processor of the K computer. struction set developed by Arm Limited (hereafter, Arm) In order to accelerate application performance, we widely used in software development for smartphones analyzed various applications and optimized the entire and embedded devices. Recently, Arm processors for processor, revising the configuration of various blocks, servers have appeared by extension to a 64-bit archi- optimizing resources, adding new circuits, selecting tecture, which is the standard in the server sector, and and optimizing memory components, and optimizing new addition of the hypervisor extension feature for OS operations. In addition, we implemented measures servers. These developments have raised expectations to save power across a wide range of levels from the for expansion of Arm into the server sector. Fujitsu Technical Review 1 Supercomputer Fugaku CPU A64FX Realizing High Performance, High-Density Packaging, and Low Power Consumption Tu nteface e nteface Core nteface Level ache Rin us Figure 1 Die photo of A64FX CPU. Tu e processing (DSP) for embedding and other applica- CMG ntlle ntlle tions. The SIMD length is 128 bits, the same as the K Core computer. This is shorter than 256 bits and 512 bits, the current trends for HPC CPUs, and was unsuitable HBM2 Core HBM2 for improving the computing performance per core. In addition, based on Fujitsu’s past experience in HPC de- Rin us velopment, instructions useful for the HPC applications were also insufficient. HBM2 HBM2 Therefore, by collaborating with Arm, Fujitsu contributed as a lead partner to the specification of Scalable Vector Extension (SVE) capable of high-speed ih anwith e execution of HPC applications including scientific com- Figure 2 puting and AI, the result of which has been adopted for Block diagram of A64FX CPU. the A64FX. 3. High-performance microarchitecture In offering the Arm architecture as a proces- With the A64FX, microarchitecture exclusively for sor for high-performance computing (HPC), single Fugaku was developed with the aim of speeding up the instruction, multiple data (SIMD) extension, a unique execution of user applications. This section outlines the feature of the Arm architecture, became an issue. This microarchitecture and describes the elemental technol- SIMD extension, known as the Advanced SIMD, is used ogy utilized to make it a reality. for accelerating media processing and digital signal 2 Fujitsu Technical Review Supercomputer Fugaku CPU A64FX Realizing High Performance, High-Density Packaging, and Low Power Consumption 3.1 Microarchitecture overview although the operating frequency varies depending on Figure 3 shows the pipeline of the A64FX [3]. A the system in which it is installed. core is composed of the instruction control unit, execu- The level 1 cache unit processes load/store in- tion unit, and level 1 cache unit. The instruction control structions. Each core has a 64-KiB instruction cache unit performs instruction fetch, instruction decode, and a 64-KiB data cache. The data cache is configured instruction out-of-order processing control, and instruc- to be capable of two simultaneous load accesses and tion completion control. executes two 64-byte SIMD loads or one 64-byte SIMD The execution unit is equipped with two fixed- store. point functional units (EXA/EXB), two functional units The level 2 cache unit has 8 MiB of unified cache for address computation and simple fixed-point arith- per CMG and is shared by 13 cores including the assis- metic (known as EAGA/EAGB for address calculation tant core. and EXC/EXD for fixed-point arithmetic), two floating- point units for executing SVE instructions (FLA/FLB), 3.2 Proven microarchitecture and resource and one predicate unit for executing predicate arith- optimization metic (PRX). Both floating-point units have a 512-bit For the A64FX, we have implemented various SIMD configuration and can perform a floating-point types of hardware resource optimization based on the multiply-accumulate operation every cycle. Therefore, high-performance, high-reliability microarchitecture each computing core is capable of 32 double-precision adopted in mainframe, and UNIX servers, and the K floating-point operations per cycle and use of all com- computer, which Fujitsu previously developed. puting cores in the chip allows 1,536 double-precision In particular, for the numbers of reorder buffer floating-point operations to be performed per cycle. (ROB), reservation station, and other queues, which With single-precision floating-point operations and are important as performance indicators, we have ad- half-precision floating-point operations, the number of opted control to accelerate the release by judging the operations that can be performed is twice and four times release timing of the queue at the time of instruction the number of double-precision floating-point opera- execution. This ensures the instruction execution per- tions, respectively. It operates at 1.8 GHz/2 GHz/2.2 GHz, formance without unnecessarily increasing the number Fetch ece ssue Reiste ea Executin e access Comletin Level nstuctin Executin unit Level cache unit nstuctin cache unit cntl unit cntl unit ces it cntl unit EAGA Resevatin EXC Fetch t Level statin EAGB Progra nstuctin aess Fie int instuctin EXD cunte ece eneatin reiste Store t Level cache EXA ata cache ntl Resevatin EXB reiste statin Write bufe Branch executin preictin Resevatin Preicate R statin reiste executin Resevatin FLA statin Flatin branch int eiste F Tu cntlle Level cache e cntlle cntlle Tu HBM2 PCI-GEN3 Figure 3 Diagram of A64FX pipeline. Fujitsu Technical Review 3 Supercomputer Fugaku CPU A64FX Realizing High Performance, High-Density Packaging, and Low Power Consumption of queue entries, which reduces the chip area by curb- 3.5 Level 1 data cache to support various ing the increase of logic circuits. access patterns In order to maximize the efficiency of 512-bit 3.3 Branch prediction circuits SIMD, it is important to maintain the access throughput We have employed several branch prediction of the L1 data cache in load instructions that transfer circuits so that optimum branch prediction can be per- data to registers. When load instructions with addresses formed with various applications. not on 512-bit boundaries are sequentially executed in For example, we have employed a circuit that uses the order of address, access across cache lines occurs a piecewise linear algorithm to perform branch predic- once in several instructions. To avoid performance deg- tion. This allows for high-accuracy branch prediction radation in this case, the L1 data cache of the A64FX capability even with a program that has a compli- is configured to allow each of the two lead ports to cated instruction structure. As a result, it is possible