ADAC Japan 2017
Jan 25th, 2017
Fujitsu processor history and future
Takumi Maruyama Senior Director Connect. Challenge. AI Platform Division Advanced System ResearchInspire. & Development Unit
All Rights Reserved, Copyright© FUJITSU LIMITED 2015 Agenda K computer
Fujitsu processor development history HPC UNIX/Mainframe
Future development plan Post K AI processor: DLU
Summary
1 All Rights Reserved, Copyright 2017 FUJITSU LIMITED RIKEN K computer
WR#1 10.51 PFlops (2011/11) K computer “Still the best”
2011 2012 2013 2014 2015 2016 1.TOP500 List 2 4 4 4 7
2. Gordon Bell Prize Finalist
3. HPC Challenge Awards (HPC、Random Access、STREAM、FFT) 4. Graph500 4 2
3 All Rights Reserved, Copyright 2017 FUJITSU LIMITED K computer “High reliable” CPU failure rates of the K computer are about quarter compared to that of Blue Waters.
(Source: ISC 2015 Long term failure analysis of 10 petascale supercomputer, RIKEN)
4 All Rights Reserved, Copyright 2017 FUJITSU LIMITED Fujitsu technologies to realize K computer
High Performance Torus network Processor 864 racks 82,944 Compute nodes 5,184 IO nodes 8core 6D.
Liquid Cooling High density rack
4processor 24board
「京」は、理化学研究所が2010年7月から使用している「次世代スーパーコンピュータ」の愛称です。 ©Riken
5 All Rights Reserved, Copyright 2017 FUJITSU LIMITED Fujitsu processor development History - HPC - UNIX/GS
6 All Rights Reserved, Copyright 2017 FUJITSU LIMITED Fujitsu Processor development AI DLU Perpetual evolution, Post-K 20nm ARM Always targeting No.1 SPARC64 XIfx HPC Virtual Machine Architecture UNIX Software on Chip Tr=1B High-speed Interconnect 40CMOSnm Cu SPARC64 SPARC64 Performance 40nm XII HPC-ACE IXfx SPARC64 X+ System on Chip 45nm SPARC64 SPARC64 VIIIfx Hardware Barrier X Next GS Multi-core Multi-thread 65nm SPARC64 VII GS21 L2$ on Die 2600 90nm SPARC64 Non-Blocking $ VI SPARC64 GS21
130nm O-O-O Execution V+ 1600 28nm SPARC64 Super-Scalar V 180nm GS21 Single-chip CPU 250nm / 900 GS21 Mainframe Store Ahead 220nm 600 Branch History GS8900 350nm GS8800B GS8800 Prefetch SPARC64
Reliability GP :Technology generation
SPARC64 $ ECC GS8600 GP Register/ALU Parity SPARC64 Mainframe/UNIX/HPC + AI II Instruction Retry SPARC64 $ Dynamic Degradation incremental development
Error Checkers/History 2000~2003 2004~2007 2008~2011 2012~2015 2016~
7 All Rights Reserved, Copyright 2017 FUJITSU LIMITED SPARC64™ VII
• Architecture Features • 4core x 2threads (SMT) L2L2 CCacheache • Embedded 6MB L2$ DataData • 2.5GHz • Jupiter Bus
L2L2 CCacheache CoreCore 11 TagTag CoreCore 33 • Fujitsu 65nm CMOS L2L2 CCacheache • 21.31mm x 20.86mm Control Control • 600M transistors ExecutionL1D Cache Unit SystemSystem • 456 signal pins L1 Cache InterfaceInterface Instruction CoreCore 22 Instruction Control Control L1I Cache • 135 W (max) SystemSystem I/OI/O • 44% power reduction per core from SPARC64TM VI
8 All Rights Reserved, Copyright 2017 FUJITSU LIMITED SPARC64™ VII in HPC & UNIX server The same processor was used both in HPC and UNIX severs - SC(system controller LSI) was different
Supercomputer FX1 SPARC Enterprise
L2 Cache Data
L2L2 CCacheache Core 1 TagTag Core 3 L2L2 CCacheache ControlControl ExecutionL1D Cache Unit SystemSystem L1 Cache InterfaceInterface Instruction Core 2 Instruction Control Control L1I Cache System I/O
System board diagram System board diagram
CPU SC MC DDR2 TM 32B SPARC64 JSC DDR2 VII CPU SC MC DDR2
32B JSC CPU SC MC DDR2 20B(in)+12B(out) DDR2
CPU SC MC DDR2
9 All Rights Reserved, Copyright 2017 FUJITSU LIMITED SPARC64™ VIIIfx Chip [HPC] • Architecture Features HSIO • 8 cores • HPC-ACE (128-bit SIMD) Core5 L2$ Data Core7 • Shared 6 MB L2$
• Embedded Memory Controller
Core4 Core6 • 2 GHz
MAC MAC • Fujitsu 45nm CMOS MAC L2$ Control MAC • 22.7mm x 22.6mm
Core1 Core3 • 760M transistors
DDR3 interface DDR3 interface DDR3 • 1271 signal pins Core0 L2$ Data Core2 • Performance (peak) • 128GFlops • 64GB/s memory throughput • Power • 58W (TYP, 30℃) • Water Cooling – Low leakage power and High reliability 10 All Rights Reserved, Copyright 2017 FUJITSU LIMITED SPARC64™ X+ Chip [UNIX] Architecture Features
DDR3 Interface • 16 cores x 2 SMT threads
MAC • Shared 24 MB L2$ Core Core Core Core • Memory and I/O Controllers
Core Core Core Core SERDES / PCIe SERDES PCIe Gen3 /
SERDES SERDES Inter / • HPC-ACE (128bit SIMD) • SWoC (Software on Chip)
L2 Cache L2 Cache L2 Cache
Data Control Data 28nm CMOS
- CPU
• 24.0mm x 25.0mm
Core Core Core Core • 2,990M transistors • 1,500 signal pins Core Core Core Core MAC • 3.7GHz DDR3 Interface Performance (peak) • 473GFlops • 102GB/s memory throughput
11 All Rights Reserved, Copyright 2017 FUJITSU LIMITED SPARC64™ XIfx Chip [HPC] Architecture Features Tofu2 interface • 32 computing cores Tofu2 controller core core core core + 2 assistant cores • HPC-ACE2 (256bit SIMD)
core core core core MAC MAC • 24 MB L2 cache
L2 cache
MAC
HMC interface HMC core core core core • HMC, Tofu2 , PCI Gen3
core core core core Assistant Assistant core core 20nm CMOS
core core core core • 3,750M transistors
HMC HMC interface
core core core core
MAC MAC • 1,001 signal pins
L2 cache MAC • 2.2GHz core core core core
core core PCI controller core core Performance (peak) PCI interface • 1.1TFlops • HMC 240GB/s x 2(in/out) • Tofu2 125GB/s x 2(in/out)
12 All Rights Reserved, Copyright 2017 FUJITSU LIMITED HPC-ACE @SPARC64™ VIIIfx (High Performance Computing - Arithmetic Computational Extensions)
Fujitsu’s unique ISA extension to SPARC-V9 for HPC • Large register sets • 128bit SIMD • Software controlled Cache • FP Trigonometric Functions • Conditional operation Register extension to SPARC-V9 • FP Reciprocal Approximation INT Reg FP Reg of Divide/Square-root Register V9 32 V9 32 160 Window Ext. 32 SPARC architecture is tolerant SIMD about ISA enhancements. V9 (basic)
224 Ext. 32 Ext. SIMD (extended)
13 All Rights Reserved, Copyright 2017 FUJITSU LIMITED Software on Chip @SPARC64™ X+ HW for SW ISA extensions to accelerates specific software function with HW
The targets Decimal operation (IEEE754 decimal and NUMBER) Cypher operation (AES/DES) Database acceleration
HW implementation The HW engines for SWoC are implemented in FPU • To fully utilize 128 FP registers & software pipelining Implemented as instructions rather than dedicated co-processor to maximize flexibility of SW. Avoid complication due to “CISC” type instructions • Various “RISC” type instructions are newly defined, instead. • 18 insts. for Decimal, and 10 insts. for Cypher operation
14 All Rights Reserved, Copyright 2017 FUJITSU LIMITED SPARC64TM VII Pipeline [HPC/UNIX] Fetch Issue Dispatch Reg.-Read Execute Memory Commit
CSE 64Entry Fetch Port EAGA 16Entry Decode RSA GPR PC L1 I$ 156Registers 10Entry Store x2 64KB & Issue x2 EAGB L1 D$ 2Way Port 64KB 16Entry Control EXA 2Way Registers RSE GUB Store x2 8x2Entry 32Registers EXB Buffer Branch 16Entry Target Address FLA 8Kentry RSF FPR 64Registers 8x2Entry x2 FLB
RSBR FUB 10Entry 48Registers
L2$ 4-core 6MB/12MB 12Way
System Bus Interface
15 All Rights Reserved, Copyright 2017 FUJITSU LIMITED SPARC64TM VIIIfx Pipeline [HPC]
Fetch Issue Dispatch Reg.-ReadExecute Memory Commit
CSE 48Entry Fetch EAGA Port PC Decode 20Entry L1 I$ RSA GPR Store 32KB & Issue 10Entry 188Registers L1 D$ 2Way EAGB Control Port 32KB Registers 8Entry 2Way EXA Write Branch RSE GUB 10Entry 32Registers EXB Buffer Target 5Entry Address FLA 1Kentry RSF FPR 2Way 8x2Entry 256Registers FLB L2$ FLC 6MB RSBR FUB 12Way 6Entry 48x2Registers FLD
Memory 8-core … Controller DIMM
16 All Rights Reserved, Copyright 2017 FUJITSU LIMITED SPARC64TM X Pipeline [UNIX] Fetch Issue Dispatch Reg.-Read Execute Memory Commit
CSE 96Entry Fetch EAGA Port EXC 32Entry L1 I$ Decode RSA GPRGPR PCPC 64KB & Issue 24Entry EAGB Store 156Registers156Registers L1 D$ 4Way EXD Port 64KB 24Entry Control Control EXA 4Way Registers RSE GUB Write Registers 24Entry 64Registers Branch EXB Buffer 10Entry Target FLA Address Decimal 4Kentry RSF FPRFPR Cypher 20Entry 128Registers Pattern 128Registers History FLB Table FLC 16Kentry RSBR FUB Cypher 16Entry 64Registers FLD
L2$ 16-core … 24MB 24Way
Memory IO Router Controller Controller CPU-CPU I/F DIMM PCI-GEN3 17 All Rights Reserved, Copyright 2017 FUJITSU LIMITED SPARC64TM XIfx Pipeline [HPC] Fetch Issue Dispatch Reg-Read Execute Cache and Memory Commit
CSE Fetch EAGA Port L1 I$ Decode EXC RSA GPR PC 64KB & Issue 188Registers EAGB Store L1 D $ 4ways Port 64KB EXD Control 4Way Registers RSE GUB EXA Write Buffer Branch EXB Target Address RSF FPR FLA 128x4 Reg. FLBFLB Pattern Local History Pattern Table Table FLB FLBFLB RSBR FUB
L2$ 34 cores … Tofu 2 controller MAC (HMC Controller) PCI Controller
CPU-CPU I/F HMC PCI-GEN3
18 All Rights Reserved, Copyright 2017 FUJITSU LIMITED Fujitsu’s processor design approach Fully utilize the latest semiconductor technology Enhance/change ISA (Instruction Set Architecture) to meet requirements: HPC-ACE, SWoC Shared micro-Architecture across HPC/UNIX/GS Perpetual evolution over generations
19 All Rights Reserved, Copyright 2017 FUJITSU LIMITED Future Fujitsu processor development
- Post K - AI Processor (DLUTM)
20 All Rights Reserved, Copyright 2017 FUJITSU LIMITED Post-K Goals and Approaches
Post-K Goals High application performance and good power efficiency Keeping application compatibility while advancing from predecessors Good usability and better accessibility for users Our Approaches Developing high performance and scalable, custom CPU cores 【Performance】 Wider SIMD & high memory BW, mathematical acc. Primitives 【Scalability】 Scalable many core, zero OS jitter (assistant core) 【Power efficiency】 The best device tech, power control functions, optimal resources Maintaining performance balance and supporting advanced features • High memory BW, “Tofu” interconnect, and RIKEN advanced system software Adopting ARM standard architecture • Co-operation with ARM/Linux community and utilization of open source software • Getting involved in the ARM HPC ecosystem
21 All Rights Reserved, Copyright 2017 FUJITSU LIMITED Post-K CPU ISA: ARM SVE FUJITSU, as a lead partner in ARM SVE development, contributes to specification of ARM SVE (Scalable Vector Extension), for application performance FUJITSU ARM core incorporates FUJITSU’s proven supercomputer microarchitecture ARM SVE, plus optional functions and Tofu, maintain programing models and performance balance Post-K complies ARM’s standard frameworks (SBSA, etc.), for compatibility among platforms
Functions for Perf. Post-K FX100 FX10 K computer SIMD 512bit 256bit 128bit 128bit SVE FMA4 ✔ ✔ ✔ ✔ incorporated Math. acc. prim.* ✔Enhanced ✔ ✔ ✔ Inter-core barrier ✔ ✔ ✔ ✔ Optional Sector cache ✔Enhanced ✔ ✔ ✔ functions Prefetch modes ✔Enhanced ✔ ✔ ✔ Interconnect Tofu ✔Enhanced ✔ ✔ ✔ *Mathematical acceleration primitives include trigonometric functions, sine & cosines, and exponential... 22 All Rights Reserved, Copyright 2017 FUJITSU LIMITED DLUTM Goals
FY2018~
DLU TM (Deep Learning Unit)
Supercomputer K technologies
DLUTM features Architecture designed for Deep Learning High performance HBM2 memory Low power design ➔ Goal: 10x Performance/Watt compared to others 写真はイメージであり、実物とは異な ります Massively parallel:Apply the supercomputer interconnect technology ➔Ability to handle large scale neural networks
23 All Rights Reserved, Copyright 2017 FUJITSU LIMITED DLUTM architecture ISA: Newly developed for Deep learning Micro-Architecture Simple pipeline to remove HW complexity On chip network to share data between DPUs Utilize Fujitsu’s HPC experience such as high density FMA and high speed interconnect ➔ Maximize performance(throughput) / watt TM Large scale DLU interconnect DLU (Deep Learning Unit) through off-chip network DPU-0 DPU Host I/F DPE DPE DPE DPE DPE DPE
DPU-1 DPU
DPE DPE DPE DPE DPE DPE
HBM2 DPU DPU-n
DPE DPE DPE DPE DPE DPE
New ISA for Deep learning Fujitsu’s interconnect technology High density FMA on-chip network DPU: Deep learning Processing Unit, DPE: Deep learning Processing Element
24 All Rights Reserved, Copyright 2017 FUJITSU LIMITED DLUTM Future Roadmap
Multiple generations of DLUs over time, as we currently do for HPC/UNIX/Mainframe processors.
Accelerator Host CPU embedded Neuro computing st Needs separate Host CPU nd Inter-DLU direct connection Combinational optimization The 1 The 2 Large scale neural Future architecture. Generation Generation network
• K computer processor development technology Performance • Architecture optimized for DL per watt • Large scale network
Fujitsu ×10
FY2018 FY2021 * Subject to change without notice FUJITSU CONFIDENTIAL 25 All Rights Reserved, Copyright 2017 FUJITSU LIMITED New Arch. Rivals Quantum Computers New Architecture Architecture designed for combinatorial optimization problems Uses a basic optimization circuit, based on digital circuitry and conventional semiconductor technology Hierarchical structure for optimal data movement and parallel calculation Acceleration Technology Calculates multiple candidates at once, parallel execution Detects and escapes from the local minimum states by adding score
New architecture Acceleration using basic optimization circuit http://www.fujitsu.com/global/about/resources/news/press-releases/2016/1020-02.html 26 All Rights Reserved, Copyright 2017 FUJITSU LIMITED Summary
27 All Rights Reserved, Copyright 2017 FUJITSU LIMITED FJ Processors 2010 2012 2015 AI K Computer FX10 FX100 8core 16core 34core DLU 2008 TM TM FX1 SPARC64TM SPARC64 SPARC64 4core VIIIfx Ixfx XIfx HPC 45nm 40nm 20nm 2013 2011 M10 Post “京” SPARC64TM SPARC Enterprise VII 16core 65nm 4core
SPARC SPARC64TM UNIX X Enterprise SPARC64TM 28nm TM VII+ SPARC64 65nm XII 2014 GS21 model 2600 Utilization of the latest 8core Mainframe Semiconductor technologies Next 28nm Evolution to meet different requirements GS between HPC/UNIX/Mainframe/AI
28 All Rights Reserved, Copyright 2017 FUJITSU LIMITED An example of different requirements: Single thread performance vs Throughput AI Throughput processor 10,000 DLU HPC processor SPARC64TM UNIX XIfx
1,000 processor
SPARC64TM SPARC64TM X+
IXfx GFlops 100 SPARC64TM VIIIfx
SPARC64TM VII Single thread performance [Amdahl’s law] 10 0 1 2 3 4 GHz
29 All Rights Reserved, Copyright 2017 FUJITSU LIMITED Summary Fujitsu has successfully designed various processors for decades.
Fujitsu’s processor design win has come from Instruction set architecture (ISA) enhancements Shared micro-architecture with perpetual evolution over generations Semiconductor technology improvements
Fujitsu will take similar design approach for Post-K and DLU ISA and micro-architecture are getting more important due to the limitation of Moore’s law.
Fujitsu will continue to develop processors to meet the needs of a new era.
30 All Rights Reserved, Copyright 2017 FUJITSU LIMITED