Fujitsu's Challenge for

Sustained

October 16th, 2008 Motoi Okuda

Technical Computing Solutions Unit Fujitsu Limited

IDC HPC User Forum, Oct. 16th 、2008 Agenda

z Fujitsu’s Approach for Petascale Computing and HPC Solution Offerings z Japanese Next Generation Project and Fujitsu’s Contributions z Fujitsu’s Challenges for Petascale Computing z Conclusion

IDC HPC User Forum, Oct. 16th, 20081 All Rights Reserved, Copyright FUJITSU LIMITED 2008 Fujitsu’s Approach for Scaling up to 10 PFlops

System performance = Processor performance x Number of processors 10 PFlops Many cores CPU or accelerator approach 1 PFlops 1000 High-end general purpose CPU approach LANL Roadrunner 100 TFlops Our approach 100 Give priority to application migration !

Low power consumption 10 TFlops embedded processor 10 NMCAC SGI Altix ICE8200 approach

ES Peak performance per Peak performance processor (GFlops) ASC Purple P5 575 JUGENE LLNL BG/L 1 BG/P

1,000 10,000 100,000 1,000,000 Number of processors IDC HPC User Forum, Oct. 16th, 20082 All Rights Reserved, Copyright FUJITSU LIMITED 2008 Key Issues for Approaching Petascale Computing z How to utilize multi-core CPU? z How to handle a hundred thousand processes? z How to realize high reliability, availability and data integrity of a hundred thousand node system? z How to decrease electric power and footprint? z Fujitsu’s stepwise approach to product release ensures that customers can be prepared for Petascale computing Step1 : 2008 ~ „ The new high end technical computing server FX1 ‹ New Integrated Multi-core Parallel ArChiTecture ‹ Intelligent interconnect ‹ Extremely reliable CPU design ÎProvides a highly efficient hybrid parallel programming environment „ Design of Petascale system which inherits FX1 architecture Step2 : 2011 ~ „ Petascale system with new high performance, highly reliable and low power consumption CPU, innovative interconnect and high density packaging

IDC HPC User Forum, Oct. 16th, 20083 All Rights Reserved, Copyright FUJITSU LIMITED 2008 Current Technical Computing Platforms

High-end TC Large-scale SMP System Cluster Solutions Solutions Solutions z Optimal price/performance z Scalability up to z Up to 2TB memory space for for MPI-based applications 100 TFlops TC applications z Highly scalable class z High I/O bandwidth for I/O z InfiniBand interconnect z Highly effective server performance z High reliability based on z High-end RISC mainframe technology CPU z High-end RISC CPU Solidware Solutions z Ultra high BX Series NEW NEW performance FX1 TM for specific RX Series SPARC64 VII PRIMEQUEST applications 580 Itanium® 2 SPARC64TM VII ~32cpu ~64cpu

FPGAFPGA boardboard HX600 NEW IA/Linux RG1000RG1000 SPARC/Solaris IA/Linux SPARC/Solaris IDC HPC User Forum, Oct. 16th, 20084 All Rights Reserved, Copyright FUJITSU LIMITED 2008 Customers of Large Scale TC Systems z Fujitsu has installed over 1200 TC systems for over 400 customers. No. of Customer Type Performance CPU Japan Aerospace Exploration Cluster (FX1) >3,500 135 TFlops Agency (JAXA) Scalar SMP (SPARC Enterprise) *This system will be installed in end of 2008

Scalar SMP >3,500 >80 TFlops Manufacturer A Cluster ( Cluster HX600) >2,000 >61.2 TFlops KYOTO Univ. Computing Center Scalar SMP (SPARC Enterprise) ( Scalar SMP PRIMEQUEST) 1,824 32 TFlops KYUSHU Univ. Computing Center Cluster (PRIMERGY) Manufacturer B Cluster >1,200 >15 TFlops RIKEN Cluster (PRIMERGY) 3,088 26.18 TFlops NAGOYA Univ. Computing Center Scalar SMP (HPC2500) 1,600 13 TFlops TOKYO Univ. KAMIOKA Cluster (PRIMERGY) 540 12.9 TFlops Observatory

Cluster (PRIMERGY) 324 6.9 TFlops National Institute of Genetics Scalar SMP(SPARC Enterprise)

Institute for Molecular Science Scalar SMP (PRIMEQUEST) 320 4 TFlops

IDC HPC User Forum, Oct. 16th, 20085 All Rights Reserved, Copyright FUJITSU LIMITED 2008 FX1FX1 LaunchLaunch CustomerCustomer z First system will be installed at JAXA by the end of 2008

THIN nodes FX1 (3,392 nodes) FAT node (SMP) 135 TFlops SPARC Enterprise 1 TFlops

Hardware barrier between nodes High Speed Intelligent Interconnect Network

I/O & front end servers SPARC Enterprise LAN

FC bus

System Control Server power/facility control RAID subsystem ETERNUS IDC HPC User Forum, Oct. 16th, 20086 All Rights Reserved, Copyright FUJITSU LIMITED 2008 FX1 : New High-End TC Server - Outline - z High-performance CPU designed by Fujitsu „ SPARC64TM VII : 4 cores by 65 nm technology „ Performance : 40 GFlops (2.5 GHz) z New architecture for high-end TC server „ Integrated Multi-core Parallel ArChiTecture by leading edge CPU and compiler technologies „ Blade type node configuration for high memory bandwidth z High-speed intelligent interconnect „ Combination of InfiniBand DDR interconnect and the highly-functional switch „ Highly-functional switch realizes barrier synchronization and high-speed reduction between nodes by hardware z Petascale system inherits Integrated Multi-core Parallel ArChiTecture „ FX1 is suitable platform to develop and evaluate Petascale applications

IDC HPC User Forum, Oct. 16th, 20087 All Rights Reserved, Copyright FUJITSU LIMITED 2008 Integrated Multi-core Parallel ArChiTecture Introduction z Concept „ Highly efficient thread level parallel processing technology for multi-core chip

CPU CHIP CPU CHIP ProcessProcess L2$ Proc. L2$ Proc. Mem. core core coreThreadcore Mem. L2$ Parallelization between L2$ Proc. L2$ Proc. core core corecores core z Advantage „ Handles the multi-core CPU as one equivalent faster CPU

ÎReduces number of MPI processes to 1/ncore and increases parallel efficiency ÎReduces memory-wall problem z Challenge „ How to decrease the thread level parallelization overhead?

IDC HPC User Forum, Oct. 16th, 20088 All Rights Reserved, Copyright FUJITSU LIMITED 2008 Integrated Multi-core Parallel ArChiTecture Key Technologies z CPU technologies „ Hardware barrier synchronization between cores Î Reduces overhead for parallel execution, 10 times faster than software emulation Î Start up time is comparable to that of the vector unit Î Barrier overhead remains constant regardless of number of cores

700 600 H/W Barrier 500 S/W Barrier 400 SPARC64TM VII

(ns) 300 200 Real quad-core CPU for 100 Technical Computing 0 24# of cores (2.5 GHz, 40 GFlops/chip) Barrier Overhead

„ Shared L2 cache memory (6 MB) Î Reduces the number of cache to cache data transfers Î Efficient cache memory usage z Compiler technologies „ Automatic parallelization or OpenMP on thread-based algorithm by vectorization technology

IDC HPC User Forum, Oct. 16th, 20089 All Rights Reserved, Copyright FUJITSU LIMITED 2008 Integrated Multi-core Parallel ArChiTecture, preliminary measured data Performance Measurement of Automatic Parallelization z LINPACK performance on 1 CPU (4 cores) „ n = 100 Î 3.26 GFlops „ n = 40,000 Î 37.8 GFlops (93.8%) z Performance comparison of DAXPY (EuroBen Kernel 8) on 1 CPU „ 4core + IMPACT shows better performance than Î 1core performance with small number of loop iterations Î servers

10,000

1,000 FX1 : SPARC64 VII (4 cores @ 2.5 GHz) FX1 : SPARC64 VII (1 cores @ 2.5 GHz) MFlops 100 VPP5000 (9.6 GFlops) INTEL Clovertown (4 cores @ 2.66 GHz) Opteron Barcelona (4 cores @ 2.3 GHz) 10 10 100 1,000 10,000 # of loop iterations Performance of DAXPY IDC HPC User Forum, Oct. 16th, 200810 All Rights Reserved, Copyright FUJITSU LIMITED 2008 Integrated Multi-core Parallel ArChiTecture, preliminary measured data Performance Measurement of NPB on 1 CPU z Performance comparison of NPB class C between pure MPI and Integrated Multi-core Parallel ArChiTecture on 1 CPU (4 cores) „ IMPACT(OMP) is better than pure MPI for 6/7 programs

1.8 1.6 1.4 1.2 1 MPI 0.8 IMPACT 0.6 (OMP) 0.4 IMPACT

Performance comparison (automatic) 0.2 0 BT CG EP FT LU MG SP

IDC HPC User Forum, Oct. 16th, 200811 All Rights Reserved, Copyright FUJITSU LIMITED 2008 Integrated Multi-core Parallel ArChiTecture, preliminary measured data Performance measurement of NPB on 256 CPUs (1) z Performance comparison of NPB class C between pure MPI and MPI + Integrated Multi-core Parallel ArChiTecture „ MPI + IMPACT (Automatic Parallelization) is better than pure MPI with 5/8 programs N*Cores FX1 256 nodes (1024 cores) EP BT CG 20000 600000 70000 18000 60000 16000 500000 50000 14000 400000 12000 40000 10000 300000 8000 30000 200000 6000 20000 4000 100000 10000 2000 0 0 0 1 10 100 1000 10000 1 10 100 1000 10000 1 10 100 1000 10000 FT IS LU 250000 7000 450000 400000 6000 200000 350000 5000 300000 150000 4000 250000 MOPS 200000 100000 3000 150000 2000 50000 100000 1000 50000 0 0 0 1 10 100 1000 10000 1 10 100 1000 10000 1 10 100 1000 10000 MG SP 600000 250000

500000 200000 400000 150000 : pure MPI 300000 100000 200000 : MPI+IMPACT 50000 100000 (Automatic Parallelization) 0 0 1 10 100 1000 10000 1 10 100 1000 10000 IDC HPC User Forum, Oct. 16th, 200812 All Rights Reserved, Copyright FUJITSU LIMITED 2008 FX1 Intelligent Interconnect Introduction z Combination of fat tree topology InfiniBand DDR interconnect and the highly-functional switch (Intelligent switch) z Intelligent switch „ Result of the PSI (Petascale System Interconnect) national project „ Functions Spine-SWs ‹ Hardware barrier function among nodes InfiniBand InfiniBand InfiniBand SW SW SW ‹Hardware assistance for MPI functions (synchronization and reduction) InfiniBand InfiniBand InfiniBand Leaf ‹Global ping for OS scheduling SW SW SW SWs „ Advantages : Node

‹Faster HW barrier speeds up OpenMP Intelligent Intelligent and data parallel FORTRAN (XPF) SW SW ‹Fast collective operations accelerate highly parallel applications ‹Reduces OS jitter effect Intelligent Switch & its connection

IDC HPC User Forum, Oct. 16th, 200813 All Rights Reserved, Copyright FUJITSU LIMITED 2008 FX1 Intelligent Interconnect High Performance Barrier & Reduction Hardware z Hardware barrier and reduction shows low latency and constant overhead in comparison with software barrier and reduction*

μsec μsec 120 120

100 100 80 Software 80 Software

60 60

40 40

20 Hardware 20 Hardware

0 0 2 4 8 16 32 64 128 256 2 4 8 16 32 64 128 256 Number of processes Number of processes Barrier Reduction

* : Executed by host processor using butterfly network built by point to point communication.

IDC HPC User Forum, Oct. 16th, 200814 All Rights Reserved, Copyright FUJITSU LIMITED 2008 FX1 Intelligent Interconnect Stability of Reduction Function z Intelligent interconnect realizes stable reduction performance by global ping function

120 Average 100 Software sec) 80 μ 82.99μsec 60 40 Intelligent Switch performance( 20 11.86μsec 0

020406080100 Number of reductions

Reduction (All reduce) performance on 128 node system

IDC HPC User Forum, Oct. 16th, 200815 All Rights Reserved, Copyright FUJITSU LIMITED 2008 Technical Computing Server Roadmap zDevelopment of the commodity based server and of the proprietary high end server for Technical Computing

„„ InheritInherit andand enhanceenhance aarchitecturerchitecture ofof FX1FX1 „„ NewNew HHPPCC aarchitecturerchitecture fforor PetascalePetascale eraera ‹‹EnhancedEnhanced processorprocessor „„ BladeBlade typetype systemsystem eenhannhancingcing memorymemory bandwidthbandwidth ‹‹NewNew scalablescalable iinterconnectnterconnect andand iinterconnectnterconnect „„ HighHigh densitydensity packaging and loloww power consumption „„ LeadoffLeadoff systemsystem forfor PPetascaleetascale computercomputer Petascale NEWNEW High end TC server FX1 system Supercomputer ( ) (>100 TFlops) (>10 PFlops)

Solaris SPARC64TM VII NEWNEW SMP ® System Linux Itanium 2

„„ BladeBlade serverserver withwith InfiniBandInfiniBand Xeon NEWNEW „„ PCPC clustclusteerr ooff 44 socketssockets 1166 wwaaysys withwith enhancedenhanced iinterconnectnterconnect PC cluster Opteron 2008 2009 2010 Future IDC HPC User Forum, Oct. 16th, 200816 All Rights Reserved, Copyright FUJITSU LIMITED 2008 Agenda

z Fujitsu’s Approach for Petascale Computing and HPC Solution Offerings z Japanese Next Generation Supercomputer Project and Fujitsu’s Contributions z Fujitsu’s Challenges for Petascale Computing z Conclusion

IDC HPC User Forum, Oct. 16th, 200817 All Rights Reserved, Copyright FUJITSU LIMITED 2008 Japanese Next Generation Supercomputer Project* Project Target Source: RIKEN official report * : Sponsored by MEXT (Ministry of Education, Culture, Sport, Science and Technology)

~$1.2 B

IDC HPC User Forum, Oct. 16th, 200818 All Rights Reserved, Copyright FUJITSU LIMITED 2008 Japanese Next Generation Supercomputer Project Project Schedule and Fujitsu’s Contributions FY 2005 2006 2007 2008 2009 2010 2011 z System and Middleware Major industry contributor NAREGI : Grid Project led by NII Grid operation R&D for Petascale System Primary R&D projects for Next Interconnect Generation Supercomputer Scalar system development Collaborative Next Generation Supercomputer Project joint research of targeting LINPACK 10 PFlops and led by RIKEN architecture Grand design Detailed design Production z Application Software R&D and code optimization Life Science Application project led by RIKEN Nano Science Application project led by IMS CAE Application project led by IIS

IDC HPC User Forum, Oct. 16th, 200819 All Rights Reserved, Copyright FUJITSU LIMITED 2008 Japanese Next Generation Supercomputer Project Project Outline z System configuration „ The hardware system consists of scalar and vector processor units z The target performance „ 10 PFlops on LINPACK BMT z Contributor „ Fujitsu, Hitachi and NEC join the project as the system developers z Schedule „ Prototype system will be available for operation from the end of FY2010 and full system will be available from the end of FY2011

Scalar system by Fujitsu Vector system by NEC and Hitachi

Source: CSTP evaluation working group report IDC HPC User Forum, Oct. 16th, 200820 All Rights Reserved, Copyright FUJITSU LIMITED 2008 Agenda

z Fujitsu’s Approach for Petascale Computing and HPC Solution Offerings z Japanese Next Generation Supercomputer Project and Fujitsu’s Contributions z Fujitsu’s Challenges for Petascale Computing z Conclusion

IDC HPC User Forum, Oct. 16th, 200821 All Rights Reserved, Copyright FUJITSU LIMITED 2008 Fujitsu’s Approach for Scaling up to 10 PFlops

System performance = Processor performance x Number of processors 10 PFlops Many cores CPU or accelerator approach 1 PFlops 1000 High-end general purpose CPU approach LANL Roadrunner 100 TFlops Our approach 100 Give priority to application migration !

Low power consumption 10 TFlops embedded processor 10 NMCAC SGI Altix ICE8200 approach

ES Peak performance per Peak performance processor (GFlops) ASC Purple P5 575 JUGENE LLNL BG/L 1 BG/P

1,000 10,000 100,000 1,000,000 Number of processors IDC HPC User Forum, Oct. 16th, 200822 All Rights Reserved, Copyright FUJITSU LIMITED 2008 Fujitsu’s Challenges for Petascale Supercomputer

z FP enhanced multi-core scalar CPU (over 100 Fujitsu high-end GFlops/CPU) with mainframe level reliability CPU z Inherit Integrated Multi-core Parallel ArChiTecture of Venus the FX1 z Low power consumption, targeting ~1/10 power consumption per flop

Leading edge z 3D torus interconnect with scalability up to over interconnect 10 PFlops, high bandwidth, high reliability and low latency

Latest packaging & z Targeting X ~10 packing density per flop by liquid cooling technology cooling technology

Node FP enhanced FP enhanced Mem. MultiMulti-core-core ScalarScalar CPUCPU ICC

IDC HPC User Forum, Oct. 16th, 200823 All Rights Reserved, Copyright FUJITSU LIMITED 2008 Fujitsu’s Challenges for Petascale Supercomputer

z Sophisticated compiler for program with 100,000 Middleware for processes on multi-core CPU highly parallel system z System management software for system with 100,000 nodes

z Optimization of highly parallel applications Highly parallel application S/W z Collaboration with users and ISVs to optimize their software for Petascale system

• Performance & environmental User Fujitsu requirement • Applications FX1 Application developer • Program analysis, Parallelization & Optimization • Compiler & MW improvement • Will be ready for • Applications Petascale computing adapted for Petascale system environment

IDC HPC User Forum, Oct. 16th, 200824 All Rights Reserved, Copyright FUJITSU LIMITED 2008 History of Fujitsu High–End Processor zHigh reliability and data integrity VenusVenus „ Cache ECC CPUCPU forfor PetascalePetascale supercomputersupercomputer „ Register and ALU parity CMOCMOSS Cu+LowCu+Low-k-k „ Instruction retry Tr = 500M SPARC64VII 6565 nmnm „ Cache dynamic CMOS Cu + Low-k GS21 90 nm SPARC64VI TrTr = = 540M 540M degradation CMOSCMOS Cu+Low-kCu+Low-k Tr = 200M 9090 nmnm CMOS Cu Mainframe 130 nm SPARC64V + Tr = 400M GS21 Processor Tr = 45M CMOS CU + Low-k CMOS Cu SPARC64V 90 nm 180 nm GS8900 Tr =2 00M Tr = 30M CMOS Cu CMOS Al GS8800B SPARC64 SPARC64GPGP 130 nm 250 nm / 220 nm GS8800 Tr = 30M Tr = 10M CMOS Cu CMOS Al 180 nm / 150 nm 350 nm GS8600 SPARC64TM Processor SPARC64II Processor RAS Coverage of SPARC64VII Guaranteed Data Integrity SPARC64 : 1bit error correctable ~1995 1996 1998 2000 2004~ : 1bit error detectable ~1997 ~1999 ~2003 : 1bit error harmless IDC HPC User Forum, Oct. 16th, 200825 All Rights Reserved, Copyright FUJITSU LIMITED 2008 Interconnect for Parallel Computer System z Interconnect type and characteristics Interconnect type Crossbar Fat-Tree Mesh / Torus Performance Best Good Average

Operability and usability Best Good X Weak Cost, packaging density Weak Average Good and power consumption X Hundreds nodes Thousands nodes >10,000 nodes Scalability X Weak - Ave.-Good Best Scalar Massive Representative Vector Parallel PC cluster parallel z Targeting over 100,000 nodes parallel system „ Cost, packaging density and power consumption are essential issues „ Too many hops are needed for mesh interconnect Î Torus interconnect is a strong candidate Î The greatest challenge of torus interconnect is operability and usability z Fujitsu’s challenge is to develop an innovative torus interconnect

IDC HPC User Forum, Oct. 16th, 200826 All Rights Reserved, Copyright FUJITSU LIMITED 2008 Fujitsu’s Interconnect for Petascale Computer System z Architecture „ Improved 3D torus „ Switchless z Advantages „ Low latency and low power consumption „ Scalability to over 100,000 nodes „ High reliability and availability „ High density packaging „ Reduced wiring cost „ Simple 3D torus logical (application) view

Improved 3D torus architecture

IDC HPC User Forum, Oct. 16th, 200827 All Rights Reserved, Copyright FUJITSU LIMITED 2008 Agenda

z Fujitsu’s Approach for Petascale Computing and HPC Solution Offerings z Japanese Next Generation Supercomputer Project and Fujitsu’s Contributions z Fujitsu’s Challenges for Petascale Computing z Conclusion

IDC HPC User Forum, Oct. 16th, 200828 All Rights Reserved, Copyright FUJITSU LIMITED 2008 Conclusion z Fujitsu continues to invest in HPC technology to provide solutions to meet the broadest user requirements at the highest levels of performance z Targeting sustained PFlops performance, Fujitsu has embarked on the Petascale Computing challenge

IDC HPC User Forum, Oct. 16th, 200829 All Rights Reserved, Copyright FUJITSU LIMITED 2008 IDC HPC User Forum, Oct. 16th, 200830 All Rights Reserved, Copyright FUJITSU LIMITED 2008