ADAC Japan 2017

Jan 25th, 2017

Fujitsu processor history and future

Takumi Maruyama Senior Director Connect. Challenge. AI Platform Division Advanced System ResearchInspire. & Development Unit

All Rights Reserved, Copyright© LIMITED 2015 Agenda  K computer

 Fujitsu processor development history  HPC  UNIX/Mainframe

 Future development plan  Post K  AI processor: DLU

 Summary

1 All Rights Reserved, Copyright 2017 FUJITSU LIMITED K computer

WR#1 10.51 PFlops (2011/11) K computer “Still the best”

2011 2012 2013 2014 2015 2016 1.TOP500 List 2 4 4 4 7

2. Gordon Bell Prize Finalist

3. HPC Challenge Awards (HPC、Random Access、STREAM、FFT) 4. 4 2

3 All Rights Reserved, Copyright 2017 FUJITSU LIMITED K computer “High reliable” CPU failure rates of the K computer are about quarter compared to that of Blue Waters.

(Source: ISC 2015 Long term failure analysis of 10 petascale , RIKEN)

4 All Rights Reserved, Copyright 2017 FUJITSU LIMITED Fujitsu technologies to realize K computer

High Performance Torus network Processor 864 racks 82,944 Compute nodes 5,184 IO nodes 8core 6D.

Liquid Cooling High density rack

4processor 24board

「京」は、理化学研究所が2010年7月から使用している「次世代スーパーコンピュータ」の愛称です。 ©Riken

5 All Rights Reserved, Copyright 2017 FUJITSU LIMITED Fujitsu processor development History - HPC - UNIX/GS

6 All Rights Reserved, Copyright 2017 FUJITSU LIMITED Fujitsu Processor development AI DLU Perpetual evolution, Post-K 20nm ARM Always targeting No.1 SPARC64 XIfx HPC Virtual Machine Architecture UNIX Software on Chip Tr=1B High-speed Interconnect 40CMOSnm Cu SPARC64 SPARC64 Performance 40nm XII HPC-ACE IXfx SPARC64 X+ System on Chip 45nm SPARC64 SPARC64 VIIIfx Hardware Barrier X Next GS Multi-core Multi-thread 65nm SPARC64 VII GS21 L2$ on Die 2600 90nm SPARC64 Non-Blocking $ VI SPARC64 GS21

130nm O-O-O Execution V+ 1600 28nm SPARC64 Super-Scalar V 180nm GS21 Single-chip CPU 250nm / 900 GS21 Mainframe Store Ahead 220nm 600 Branch History GS8900 350nm GS8800B GS8800 Prefetch SPARC64

Reliability GP :Technology generation

SPARC64 $ ECC GS8600 GP Register/ALU Parity SPARC64 Mainframe/UNIX/HPC + AI II Instruction Retry SPARC64 $ Dynamic Degradation incremental development

Error Checkers/History 2000~2003 2004~2007 2008~2011 2012~2015 2016~

7 All Rights Reserved, Copyright 2017 FUJITSU LIMITED SPARC64™ VII

• Architecture Features • 4core x 2threads (SMT) L2L2 CCacheache • Embedded 6MB L2$ DataData • 2.5GHz • Jupiter Bus

L2L2 CCacheache CoreCore 11 TagTag CoreCore 33 • Fujitsu 65nm CMOS L2L2 CCacheache • 21.31mm x 20.86mm Control Control • 600M transistors ExecutionL1D Cache Unit SystemSystem • 456 signal pins L1 Cache InterfaceInterface Instruction CoreCore 22 Instruction Control Control L1I Cache • 135 W (max) SystemSystem I/OI/O • 44% power reduction per core from SPARC64TM VI

8 All Rights Reserved, Copyright 2017 FUJITSU LIMITED SPARC64™ VII in HPC & UNIX server The same processor was used both in HPC and UNIX severs - SC(system controller LSI) was different

Supercomputer FX1 SPARC Enterprise

L2 Cache Data

L2L2 CCacheache Core 1 TagTag Core 3 L2L2 CCacheache ControlControl ExecutionL1D Cache Unit SystemSystem L1 Cache InterfaceInterface Instruction Core 2 Instruction Control Control L1I Cache System I/O

System board diagram System board diagram

CPU SC MC DDR2 TM 32B SPARC64 JSC DDR2 VII CPU SC MC DDR2

32B JSC CPU SC MC DDR2 20B(in)+12B(out) DDR2

CPU SC MC DDR2

9 All Rights Reserved, Copyright 2017 FUJITSU LIMITED SPARC64™ VIIIfx Chip [HPC] • Architecture Features HSIO • 8 cores • HPC-ACE (128-bit SIMD) Core5 L2$ Data Core7 • Shared 6 MB L2$

• Embedded Memory Controller

Core4 Core6 • 2 GHz

MAC MAC • Fujitsu 45nm CMOS MAC L2$ Control MAC • 22.7mm x 22.6mm

Core1 Core3 • 760M transistors

DDR3 interface DDR3 interface DDR3 • 1271 signal pins Core0 L2$ Data Core2 • Performance (peak) • 128GFlops • 64GB/s memory throughput • Power • 58W (TYP, 30℃) • Water Cooling – Low leakage power and High reliability 10 All Rights Reserved, Copyright 2017 FUJITSU LIMITED SPARC64™ X+ Chip [UNIX]  Architecture Features

DDR3 Interface • 16 cores x 2 SMT threads

MAC • Shared 24 MB L2$ Core Core Core Core • Memory and I/O Controllers

Core Core Core Core SERDES / PCIe SERDES PCIe Gen3 /

SERDES SERDES Inter / • HPC-ACE (128bit SIMD) • SWoC (Software on Chip)

L2 Cache L2 Cache L2 Cache

Data Control Data  28nm CMOS

- CPU

• 24.0mm x 25.0mm

Core Core Core Core • 2,990M transistors • 1,500 signal pins Core Core Core Core MAC • 3.7GHz DDR3 Interface  Performance (peak) • 473GFlops • 102GB/s memory throughput

11 All Rights Reserved, Copyright 2017 FUJITSU LIMITED SPARC64™ XIfx Chip [HPC]  Architecture Features Tofu2 interface • 32 computing cores Tofu2 controller core core core core + 2 assistant cores • HPC-ACE2 (256bit SIMD)

core core core core MAC MAC • 24 MB L2 cache

L2 cache

MAC

HMC interface HMC core core core core • HMC, Tofu2 , PCI Gen3

core core core core Assistant Assistant core core  20nm CMOS

core core core core • 3,750M transistors

HMC HMC interface

core core core core

MAC MAC • 1,001 signal pins

L2 cache MAC • 2.2GHz core core core core

core core PCI controller core core  Performance (peak) PCI interface • 1.1TFlops • HMC 240GB/s x 2(in/out) • Tofu2 125GB/s x 2(in/out)

12 All Rights Reserved, Copyright 2017 FUJITSU LIMITED HPC-ACE @SPARC64™ VIIIfx (High Performance Computing - Arithmetic Computational Extensions)

 Fujitsu’s unique ISA extension to SPARC-V9 for HPC • Large register sets • 128bit SIMD • Software controlled Cache • FP Trigonometric Functions • Conditional operation Register extension to SPARC-V9 • FP Reciprocal Approximation INT Reg FP Reg of Divide/Square-root Register V9 32 V9 32 160 Window Ext. 32  SPARC architecture is tolerant SIMD about ISA enhancements. V9 (basic)

224 Ext. 32 Ext. SIMD (extended)

13 All Rights Reserved, Copyright 2017 FUJITSU LIMITED Software on Chip @SPARC64™ X+  HW for SW ISA extensions to accelerates specific software function with HW

 The targets  Decimal operation (IEEE754 decimal and NUMBER)  Cypher operation (AES/DES)  Database acceleration

 HW implementation  The HW engines for SWoC are implemented in FPU • To fully utilize 128 FP registers & software pipelining  Implemented as instructions rather than dedicated co-processor to maximize flexibility of SW.  Avoid complication due to “CISC” type instructions • Various “RISC” type instructions are newly defined, instead. • 18 insts. for Decimal, and 10 insts. for Cypher operation

14 All Rights Reserved, Copyright 2017 FUJITSU LIMITED SPARC64TM VII Pipeline [HPC/UNIX] Fetch Issue Dispatch Reg.-Read Execute Memory Commit

CSE 64Entry Fetch Port EAGA 16Entry Decode RSA GPR PC L1 I$ 156Registers 10Entry Store x2 64KB & Issue x2 EAGB L1 D$ 2Way Port 64KB 16Entry Control EXA 2Way Registers RSE GUB Store x2 8x2Entry 32Registers EXB Buffer Branch 16Entry Target Address FLA 8Kentry RSF FPR 64Registers 8x2Entry x2 FLB

RSBR FUB 10Entry 48Registers

L2$ 4-core 6MB/12MB 12Way

System Bus Interface

15 All Rights Reserved, Copyright 2017 FUJITSU LIMITED SPARC64TM VIIIfx Pipeline [HPC]

Fetch Issue Dispatch Reg.-ReadExecute Memory Commit

CSE 48Entry Fetch EAGA Port PC Decode 20Entry L1 I$ RSA GPR Store 32KB & Issue 10Entry 188Registers L1 D$ 2Way EAGB Control Port 32KB Registers 8Entry 2Way EXA Write Branch RSE GUB 10Entry 32Registers EXB Buffer Target 5Entry Address FLA 1Kentry RSF FPR 2Way 8x2Entry 256Registers FLB L2$ FLC 6MB RSBR FUB 12Way 6Entry 48x2Registers FLD

Memory 8-core … Controller DIMM

16 All Rights Reserved, Copyright 2017 FUJITSU LIMITED SPARC64TM X Pipeline [UNIX] Fetch Issue Dispatch Reg.-Read Execute Memory Commit

CSE 96Entry Fetch EAGA Port EXC 32Entry L1 I$ Decode RSA GPRGPR PCPC 64KB & Issue 24Entry EAGB Store 156Registers156Registers L1 D$ 4Way EXD Port 64KB 24Entry Control Control EXA 4Way Registers RSE GUB Write Registers 24Entry 64Registers Branch EXB Buffer 10Entry Target FLA Address Decimal 4Kentry RSF FPRFPR Cypher 20Entry 128Registers Pattern 128Registers History FLB Table FLC 16Kentry RSBR FUB Cypher 16Entry 64Registers FLD

L2$ 16-core … 24MB 24Way

Memory IO Router Controller Controller CPU-CPU I/F DIMM PCI-GEN3 17 All Rights Reserved, Copyright 2017 FUJITSU LIMITED SPARC64TM XIfx Pipeline [HPC] Fetch Issue Dispatch Reg-Read Execute Cache and Memory Commit

CSE Fetch EAGA Port L1 I$ Decode EXC RSA GPR PC 64KB & Issue 188Registers EAGB Store L1 D $ 4ways Port 64KB EXD Control 4Way Registers RSE GUB EXA Write Buffer Branch EXB Target Address RSF FPR FLA 128x4 Reg. FLBFLB Pattern Local History Pattern Table Table FLB FLBFLB RSBR FUB

L2$ 34 cores … Tofu 2 controller MAC (HMC Controller) PCI Controller

CPU-CPU I/F HMC PCI-GEN3

18 All Rights Reserved, Copyright 2017 FUJITSU LIMITED Fujitsu’s processor design approach  Fully utilize the latest semiconductor technology  Enhance/change ISA (Instruction Set Architecture) to meet requirements: HPC-ACE, SWoC  Shared micro-Architecture across HPC/UNIX/GS Perpetual evolution over generations

19 All Rights Reserved, Copyright 2017 FUJITSU LIMITED Future Fujitsu processor development

- Post K - AI Processor (DLUTM)

20 All Rights Reserved, Copyright 2017 FUJITSU LIMITED Post-K Goals and Approaches

 Post-K Goals  High application performance and good power efficiency  Keeping application compatibility while advancing from predecessors  Good usability and better accessibility for users  Our Approaches  Developing high performance and scalable, custom CPU cores 【Performance】 Wider SIMD & high memory BW, mathematical acc. Primitives 【Scalability】 Scalable many core, zero OS jitter (assistant core) 【Power efficiency】 The best device tech, power control functions, optimal resources  Maintaining performance balance and supporting advanced features • High memory BW, “Tofu” interconnect, and RIKEN advanced system software  Adopting ARM standard architecture • Co-operation with ARM/ community and utilization of open source software • Getting involved in the ARM HPC ecosystem

21 All Rights Reserved, Copyright 2017 FUJITSU LIMITED Post-K CPU  ISA: ARM SVE  FUJITSU, as a lead partner in ARM SVE development, contributes to specification of ARM SVE (Scalable Vector Extension), for application performance  FUJITSU ARM core incorporates FUJITSU’s proven supercomputer microarchitecture  ARM SVE, plus optional functions and Tofu, maintain programing models and performance balance  Post-K complies ARM’s standard frameworks (SBSA, etc.), for compatibility among platforms

Functions for Perf. Post-K FX100 FX10 K computer SIMD 512bit 256bit 128bit 128bit SVE FMA4 ✔ ✔ ✔ ✔ incorporated Math. acc. prim.* ✔Enhanced ✔ ✔ ✔ Inter-core barrier ✔ ✔ ✔ ✔ Optional Sector cache ✔Enhanced ✔ ✔ ✔ functions Prefetch modes ✔Enhanced ✔ ✔ ✔ Interconnect Tofu ✔Enhanced ✔ ✔ ✔ *Mathematical acceleration primitives include trigonometric functions, sine & cosines, and exponential... 22 All Rights Reserved, Copyright 2017 FUJITSU LIMITED DLUTM Goals

FY2018~

DLU TM (Deep Learning Unit)

Supercomputer K technologies

DLUTM features  Architecture designed for Deep Learning  High performance HBM2 memory  Low power design ➔ Goal: 10x Performance/Watt compared to others 写真はイメージであり、実物とは異な ります  Massively parallel:Apply the supercomputer interconnect technology ➔Ability to handle large scale neural networks

23 All Rights Reserved, Copyright 2017 FUJITSU LIMITED DLUTM architecture  ISA: Newly developed for Deep learning  Micro-Architecture  Simple pipeline to remove HW complexity  On chip network to share data between DPUs  Utilize Fujitsu’s HPC experience such as high density FMA and high speed interconnect ➔ Maximize performance(throughput) / watt TM Large scale DLU interconnect DLU (Deep Learning Unit) through off-chip network DPU-0 DPU Host I/F DPE DPE DPE DPE DPE DPE

DPU-1 DPU

DPE DPE DPE DPE DPE DPE

HBM2 DPU DPU-n

DPE DPE DPE DPE DPE DPE

New ISA for Deep learning Fujitsu’s interconnect technology High density FMA on-chip network DPU: Deep learning Processing Unit, DPE: Deep learning Processing Element

24 All Rights Reserved, Copyright 2017 FUJITSU LIMITED DLUTM Future Roadmap

 Multiple generations of DLUs over time, as we currently do for HPC/UNIX/Mainframe processors.

 Accelerator  Host CPU embedded  Neuro computing st  Needs separate Host CPU nd  Inter-DLU direct connection  Combinational optimization The 1 The 2 Large scale neural Future architecture. Generation Generation network

• K computer processor development technology Performance • Architecture optimized for DL per watt • Large scale network

Fujitsu ×10

FY2018 FY2021 * Subject to change without notice FUJITSU CONFIDENTIAL 25 All Rights Reserved, Copyright 2017 FUJITSU LIMITED New Arch. Rivals Quantum Computers  New Architecture  Architecture designed for combinatorial optimization problems  Uses a basic optimization circuit, based on digital circuitry and conventional semiconductor technology  Hierarchical structure for optimal data movement and parallel calculation  Acceleration Technology  Calculates multiple candidates at once, parallel execution  Detects and escapes from the local minimum states by adding score

New architecture Acceleration using basic optimization circuit http://www.fujitsu.com/global/about/resources/news/press-releases/2016/1020-02.html 26 All Rights Reserved, Copyright 2017 FUJITSU LIMITED Summary

27 All Rights Reserved, Copyright 2017 FUJITSU LIMITED FJ Processors 2010 2012 2015 AI K Computer FX10 FX100 8core 16core 34core DLU 2008 TM TM FX1 SPARC64TM SPARC64 SPARC64 4core VIIIfx Ixfx XIfx HPC 45nm 40nm 20nm 2013 2011 M10 Post “京” SPARC64TM SPARC Enterprise VII 16core 65nm 4core

SPARC SPARC64TM UNIX X Enterprise SPARC64TM 28nm TM VII+ SPARC64 65nm XII 2014 GS21 model 2600  Utilization of the latest 8core Mainframe Semiconductor technologies Next 28nm  Evolution to meet different requirements GS between HPC/UNIX/Mainframe/AI

28 All Rights Reserved, Copyright 2017 FUJITSU LIMITED An example of different requirements: Single thread performance vs Throughput AI Throughput processor 10,000 DLU HPC processor SPARC64TM UNIX XIfx

1,000 processor

SPARC64TM SPARC64TM X+

IXfx GFlops 100 SPARC64TM VIIIfx

SPARC64TM VII Single thread performance [Amdahl’s law] 10 0 1 2 3 4 GHz

29 All Rights Reserved, Copyright 2017 FUJITSU LIMITED Summary Fujitsu has successfully designed various processors for decades.

Fujitsu’s processor design win has come from  Instruction set architecture (ISA) enhancements  Shared micro-architecture with perpetual evolution over generations  Semiconductor technology improvements

Fujitsu will take similar design approach for Post-K and DLU  ISA and micro-architecture are getting more important due to the limitation of Moore’s law.

Fujitsu will continue to develop processors to meet the needs of a new era.

30 All Rights Reserved, Copyright 2017 FUJITSU LIMITED