Fujitsu Petascale PRIMEHPC FX10

Toshiyuki Shimizu Director of System Development Div., Next Generation Technical Computing Unit Fujitsu Limited 4x2 racks (768 compute nodes) configuration

Copyright 2011 FUJITSU LIMITED Outline

 Design target  Technologies for highly scalable  PRIMEHPC FX10  CPU  Interconnect  Summary

1 Copyright 2011 FUJITSU LIMITED Targets for supercomputer development

High effective High performance performance and and low power productivity of highly consumption parallel applications

High operability and availability for large- scale systems

2 Copyright 2011 FUJITSU LIMITED Technologies for highly scalable supercomputers

 Developed key technologies & implemented in the series of systems  PRIMEHPC FX10 will be available from Jan., 2012 Hybrid parallel Tofu interconnect (VISIMPACT) ISA extension Collective SW (HPC-ACE)

CY2008~ CY2011. June~ 40GF, 4-core CPU 128GF, 8-core CPU CY2012~ Linpack 111TF Linpack 10.51 PF 236.5GF, 16-core CPU 3,008 nodes 88,128 nodes ~23.2 PF, 98,304 nodes *The , which is being jointly developed by RIKEN and Fujitsu, is part of the High-Performance Computing Infrastructure (HPCI) initiative led by Japan's Ministry of Education, Culture, Sports, Science and Technology (MEXT). 3 Copyright 2011 FUJITSU LIMITED PRIMEHPC FX10

 High-speed and ultra-large-scale computing environment  Up to 23.2 PFLOPS (98,304 nodes, 1,024 racks, 6 petabytes of memory)  Node  New processor SPARC64™ IXfx  Uncompromised memory system  Tofu interconnect  Interconnect controller ICC with enhanced collective switch  Water cooling options  Rear door heat exchanger (EXCU) removes 100% exhaust heat of air

4 Copyright 2011 FUJITSU LIMITED 16 cores SPARC64TM IXfx HSIO Core Core Core Core

 Binary compatible with K computer

Core Core Core Core

 Double the number of cores of K’s SPARC64 VIIIfx

L2$ Data L2$ Data

 SPARC V9 + HPC-ACE L2$

MAC MAC

 # of registers: 256 double precision registers Control MAC

L2$ Data L2$ Data MAC DDR3 interface DDR3  SIMD instructions: 2-way, 2-wide SIMD w/ FMA interface DDR3  Software-controllable cache (Sector cache) Core Core Core Core  VISIMPACT Core Core Core Core

 Automatic parallelizing compiler Frequency 1.848 GHz  Inter-core hardware barrier Secondary cache 12 MB (Shared cache)  Shared secondary cache Theoretical peak 236.5 GFLOPS  Integrated memory controller Memory throughput 85 GB/s  1.333GHz DDR3 Power consumption 110 W Process technology 40 nm

5 Copyright 2011 FUJITSU LIMITED Performance improvement by HPC-ACE

 With FP regs expansion only, maximum 3.2, average 1.43 times faster than normal  With both FP regs & SIMD, max. 4.5, average 1.84 times faster than normal 5.00

4.50 4.5

拡張レジスタ効果With regs. expansion only 4.00 拡張レジスタWith regs+ SIMD効果. & SIMD 3.2 3.50

3.00

2.50

2.00

1.50

1.00

0.50

Performance/normal configuration 0.00 89 applications for compiler optimization evaluation

Copyright 2011 FUJITSU LIMITED

JOB06 JOB17 JOB28 JOB39 JOB50 JOB61 JOB72 JOB83 JOB02 JOB03 JOB04 JOB05 JOB07 JOB08 JOB09 JOB10 JOB11 JOB12 JOB13 JOB14 JOB15 JOB16 JOB18 JOB19 JOB20 JOB21 JOB22 JOB23 JOB24 JOB25 JOB26 JOB27 JOB29 JOB30 JOB31 JOB32 JOB33 JOB34 JOB35 JOB36 JOB37 JOB38 JOB40 JOB41 JOB42 JOB43 JOB44 JOB45 JOB46 JOB47 JOB48 JOB49 JOB51 JOB52 JOB53 JOB54 JOB55 JOB56 JOB57 JOB58 JOB59 JOB60 JOB62 JOB63 JOB64 JOB65 JOB66 JOB67 JOB68 JOB69 JOB70 JOB71 JOB73 JOB74 JOB75 JOB76 JOB77 JOB78 JOB79 JOB80 JOB81 JOB82 JOB84 JOB85 JOB86 JOB87 JOB88 JOB89 JOB01 6 6D mesh/torus Tofu interconnect  Highly scalable direct network 6D mesh/torus  10 redundant links for XYZ and ABC connections  4 RDMA engines (4x2 simultaneous transfer)  Tofu original algorithms for collective communications  Tofu barrier for barrier & reduction in H/W ×  Implemented in direct attached ICC XYZ ABC TM SPARC64 IXfx C Y ICC Inter- Y Frequency 312.5 MHz Connect X CPU B Switching capacity 100 GB/s Controller X (ICC) Power consumption 28 W 5GB/s x 2 Process technology 65 nm x10 links 20GB/s x 2 A Z Z B 7 Copyright 2011 FUJITSU LIMITED Interconnect performance

 256-node All-to-all performance 4 Tofu (8x4x8=256)  Tofu is better node bandwidth than 3 InfiniBand QDR (256) InfiniBand QDR

 Effects 2

 Porting of existing applications would be easy because of good scalability GB/s 1 of Tofu interconnect 0  HPCC Global FFT using All-to-all 1.E+00 1.E+02 1.E+04 1.E+06 communications frequently records Message size in bytes No. 1 result

8 Copyright 2011 FUJITSU LIMITED PRIMEHPC FX10 Summary

 Improves Fujitsu’s supercomputer technology employed in the “K computer,” the world’s fastest supercomputer  Newly developed SPARC64 IXfx processor (236.5GFLOPS)  Tofu interconnect (scales up to 98,304 nodes)  Fujitsu provides all hardware and software  Processor and interconnect  Software FEFS: high-performance distributed file system Compilers, performance analyzer, and so on

9 Copyright 2011 FUJITSU LIMITED

WEB site http://www.fujitsu.com/global/services/solutions/tc/hpc/products/primehpc/

10 Copyright 2011 FUJITSU LIMITED Targets and solutions in PRIMEHPC FX10

 High-performance, low-power consumption, multi-core High performance, CPU “SPARC64 IXfx” Low power  Efficient cooling system for a greener world consumption (direct water cooling system, optional exhaust cooling unit)  6D mesh/torus interconnect “Tofu” scales beyond 100,000 Highly effective nodes performance  High memory bandwidth with one CPU/node configuration Parallel application  Hybrid parallel execution support “VISIMPACT” high productivity  High reliability components & functions based on High reliability and mainframe development experience operability  High operability and fault tolerance achieved by Tofu Large-scale interconnect  High reliability and operability in large systems using Fujitsu systems developed HPC middleware

11 Copyright 2011 FUJITSU LIMITED