Introduction of ’s next-generation

MATSUMOTO Takayuki

July 16, 2014

HPC Platform Solutions  Fujitsu has a long history of supercomputing over 30 years  Technologies and experience of providing Post-FX10 several types of such as vector, MPP and cluster, are inherited PRIMEHPC FX10 and implemented in latest supercomputer product line No.1 in Top500 (June and Nov., 2011) FX1 World’s Fastest Vector Processor (1999) Highest Performance VPP5000 SPARC efficiency in Top500 NWT* Enterprise (Nov. 2008) Developed with NAL No.1 in Top500 PRIMEQUEST VPP300/700 (Nov. 1993) PRIMEPOWER CX400 Gordon Bell Prize HPC2500 Skinless server (1994, 95, 96) ⒸJAXA VPP500 World’s Most Scalable PRIMERGY VP Series Supercomputer BX900 Cluster node AP3000 (2003) HX600 Cluster node F230-75APU AP1000 PRIMERGY RX200 Cluster node Japan’s Largest Cluster in Top500 Japan’s First (July 2004) Vector (Array) *NWT: Supercomputer Numerical Wind Tunnel (1977)

1 Copyright 2014 FUJITSU LIMITED K computer and Fujitsu PRIMEHPC series

 Single CPU/node architecture for multicore  Good Bytes/flop and scalability  Key technologies for massively parallel supercomputers  Original CPU and interconnect  Support for tens of millions of cores (VISIMPACT*, Collective comm. HW)

VISIMPACT SIMD extension HPC-ACE HPC-ACE HPC-ACE2 Collective comm. HW Direct network Tofu Direct network Tofu Tofu interconnect 2 HMC & Optical connections

CY2008~ CY2010~ CY2012~ Coming soon 40GF, 4-core/CPU 128GF, 8-core/CPU 236.5GF, 16-core/CPU 1TF~, 32-core/CPU

* VISIMPACT: Virtual Single Processor by Integrated Multi-core Parallel Architecture 2 Copyright 2014 FUJITSU LIMITED Architecture continuity for compatibility

 Upper compatible CPU:  Binary-compatible with the K computer & PRIMEHPC FX10  Good byte/flop balance  New features:  New instructions (stride load/store, indirect load/store, permutation, concatenation)  Improved micro architecture (out-of-order, branch-prediction, etc.)  For distributed parallel executions:  Compatible interconnect architecture  Improved interconnect bandwidth

3 Copyright 2014 FUJITSU LIMITED The K computer and the evolution of PRIMEHPC

K computer PRIMEHPC Post-FX10 FX10 CPU SPARC64 VIIIfx SPARC64 IXfx SPARC64 XIfx Peak perf. 128 GFLOPS 236.5 GFLOPS 1TFLOPS ~

# of cores 8 16 32 + 2

Memory DDR3 SDRAM ← HMC

Interconnect Tofu Interconnect ← Tofu Interconnect 2 System size 11PFLOPS Max. 23PFLOPS Max. 100PFLOPS Link BW 5GB/s x bidirectional ← 12.5GB/s x bidirectional

4 Copyright 2014 FUJITSU LIMITED Feature and Configuration of Post-FX10

Fujitsu designed Chassis TM SPARC64 XIfx  1 CPU/1 node  1TF~(DP)/2TF~(SP)  12 nodes/2U Chassis  32 + 2 core CPU  Water cooled  HPC-ACE2 support  Tofu2 integrated

Cabinet  200~ nodes/cabinet High-density  100% water cooled CPU Memory Board with EXCU (option) Tofu Interconnect 2  Three CPUs  12.5 GB/s×2(in/out)/link  3 x 8 Micron’s HMCs  10 links/node  8 Finisar's opt modules, BOA, for  Optical technology inter-chassis connections

5 Copyright 2014 FUJITSU LIMITED Flexible SIMD operations

 New 256bit wide SIMD functions enable versatile operations  Four double-precision calculations  Stride load/store, Indirect (list) load/store, Permutation, Concatenation

SIMD Stride load Indirect load Permutation

Double precision reg S reg S Memory reg S1 Specified stride Memory reg S2 128 Arbitrary shuffle regs simultaneous calculation

reg D reg D reg D reg D

6 Copyright 2014 FUJITSU LIMITED Tofu Interconnect 2 Successor to Tofu Interconnect Highly scalable, 6-dimensional mesh/torus topology Increased link bandwidth by 2.5 times to 12.5 GB/s Interconnect integrated into CPU System-on-chip (SoC) removes off-chip I/O Improved packaging density and energy efficiency Optical cable connection between chassis Scalable three-dimensional torus

Three-dimensional torus unit Well-balanced shape available 2×3×2

7 Copyright 2014 FUJITSU LIMITED Entire software stack is enhanced for Post-FX10

Applications

HPC Portal / System Management Portal Technical Computing Suite System Management High Performance Automatic parallelization compiler File System   System management Fortran  System control FEFS  C  System monitoring  C++  System operation support Tools and math libraries  based high performance distributed  Programming support tools file system  Mathematical libraries Job Management  High scalability, high reliability and  Job manager availability Parallel languages and libraries  Job scheduler  Resource management   Parallel OpenMP  MPI  XPFortran

Linux based OS (enhanced for FX series)

PRIMEHPC FX series

8 Copyright 2014 FUJITSU LIMITED HPC Platform Solutions  Fujitsu has a long history of supercomputing since early 1980  Technologies and experience of providing Post-FX10 several types of supercomputers such as SPARC64 XIfx Exascale vector, MPP and cluster, are inherited PRIMEHPC FX10 and implemented in latestSPARC64 IXfx supercomputer product line No.1 in Top500 (June and Nov., 2011) Post-FX10 K computer FX1 SPARC64 VIIIfx World’s Fastest Vector ProcessorFUJITSU (1999) Highest Performance SupercomputerVPP5000 SPARC efficiency in Top500 NWT* PRIMEHPC FX10Enterprise (Nov. 2008) Developed with NAL 100Pflops class No.1 in Top500 PRIMEQUEST VPP300/700 (Nov. 1993) PRIMEPOWER PRIMERGY CX400 Gordon Bell Prize HPC2500 Skinless server K (1994,computer 95, 96) ⒸJAXA VPP500 World’s Most Scalable PRIMERGY VP Series Supercomputer BX900 20Pflops class Cluster node AP3000 (2003) HX600 Cluster node F230-75APU 10Pflops class AP1000 PRIMERGY RX200 Cluster node Japan’s Largest Cluster in Top500 Japan’s First (July 2004) Vector (Array) *NWT: Supercomputer Numerical Wind Tunnel (1977)

9 Copyright 2014 FUJITSU LIMITED Open a bright future with Technical Computing

10 Copyright 2014 FUJITSU LIMITED