Fujitsu's Vision for High Performance Computing

Fujitsu's Challenge for Petascale Computing Sustained October 16th, 2008 Motoi Okuda Technical Computing Solutions Unit Fujitsu Limited IDC HPC User Forum, Oct. 16th 、2008 Agenda z Fujitsu’s Approach for Petascale Computing and HPC Solution Offerings z Japanese Next Generation Supercomputer Project and Fujitsu’s Contributions z Fujitsu’s Challenges for Petascale Computing z Conclusion IDC HPC User Forum, Oct. 16th, 20081 All Rights Reserved, Copyright FUJITSU LIMITED 2008 Fujitsu’s Approach for Scaling up to 10 PFlops System performance = Processor performance x Number of processors 10 PFlops Many cores CPU or accelerator approach 1 PFlops 1000 High-end general purpose CPU approach LANL Roadrunner 100 TFlops Our approach 100 Give priority to application migration ! Low power consumption 10 TFlops embedded processor 10 NMCAC SGI Altix ICE8200 approach ES Peak performance per processor (GFlops) ASC Purple P5 575 JUGENE LLNL BG/L 1 BG/P 1,000 10,000 100,000 1,000,000 Number of processors IDC HPC User Forum, Oct. 16th, 20082 All Rights Reserved, Copyright FUJITSU LIMITED 2008 Key Issues for Approaching Petascale Computing z How to utilize multi-core CPU? z How to handle a hundred thousand processes? z How to realize high reliability, availability and data integrity of a hundred thousand node system? z How to decrease electric power and footprint? z Fujitsu’s stepwise approach to product release ensures that customers can be prepared for Petascale computing Step1 : 2008 ~ The new high end technical computing server FX1 New Integrated Multi-core Parallel ArChiTecture Intelligent interconnect Extremely reliable CPU design ÎProvides a highly efficient hybrid parallel programming environment Design of Petascale system which inherits FX1 architecture Step2 : 2011 ~ Petascale system with new high performance, highly reliable and low power consumption CPU, innovative interconnect and high density packaging IDC HPC User Forum, Oct. 16th, 20083 All Rights Reserved, Copyright FUJITSU LIMITED 2008 Current Technical Computing Platforms High-end TC Large-scale SMP System Cluster Solutions Solutions Solutions z Optimal price/performance z Scalability up to z Up to 2TB memory space for for MPI-based applications 100 TFlops TC applications z Highly scalable class z High I/O bandwidth for I/O z InfiniBand interconnect z Highly effective server performance z High reliability based on z High-end RISC mainframe technology CPU z High-end RISC CPU Solidware Solutions z Ultra high BX Series NEW NEW performance FX1 TM for specific RX Series SPARC64 VII PRIMEQUEST applications 580 Itanium® 2 SPARC64TM VII ~32cpu ~64cpu FPGAFPGA boardboard HX600 NEW IA/Linux RG1000RG1000 SPARC/Solaris IA/Linux SPARC/Solaris IDC HPC User Forum, Oct. 16th, 20084 All Rights Reserved, Copyright FUJITSU LIMITED 2008 Customers of Large Scale TC Systems z Fujitsu has installed over 1200 TC systems for over 400 customers. No. of Customer Type Performance CPU Japan Aerospace Exploration Cluster (FX1) >3,500 135 TFlops Agency (JAXA) Scalar SMP （SPARC Enterprise) *This system will be installed in end of 2008 Scalar SMP >3,500 >80 TFlops Manufacturer A Cluster （ Cluster HX600) >2,000 >61.2 TFlops KYOTO Univ. Computing Center Scalar SMP （SPARC Enterprise) （ Scalar SMP PRIMEQUEST) 1,824 32 TFlops KYUSHU Univ. Computing Center Cluster (PRIMERGY) Manufacturer B Cluster >1,200 >15 TFlops RIKEN Cluster (PRIMERGY) 3,088 26.18 TFlops NAGOYA Univ. Computing Center Scalar SMP (HPC2500) 1,600 13 TFlops TOKYO Univ. KAMIOKA Cluster (PRIMERGY) 540 12.9 TFlops Observatory Cluster (PRIMERGY) 324 6.9 TFlops National Institute of Genetics Scalar SMP（SPARC Enterprise) Institute for Molecular Science Scalar SMP （PRIMEQUEST) 320 4 TFlops IDC HPC User Forum, Oct. 16th, 20085 All Rights Reserved, Copyright FUJITSU LIMITED 2008 FX1FX1 LaunchLaunch CustomerCustomer z First system will be installed at JAXA by the end of 2008 THIN nodes FX1 (3,392 nodes) FAT node (SMP) 135 TFlops SPARC Enterprise 1 TFlops Hardware barrier between nodes High Speed Intelligent Interconnect Network I/O & front end servers SPARC Enterprise ＬＡＮ FC bus System Control Server power/facility control RAID subsystem ETERNUS IDC HPC User Forum, Oct. 16th, 20086 All Rights Reserved, Copyright FUJITSU LIMITED 2008 FX1 : New High-End TC Server - Outline - z High-performance CPU designed by Fujitsu SPARC64TM VII : 4 cores by 65 nm technology Performance : 40 GFlops (2.5 GHz) z New architecture for high-end TC server Integrated Multi-core Parallel ArChiTecture by leading edge CPU and compiler technologies Blade type node configuration for high memory bandwidth z High-speed intelligent interconnect Combination of InfiniBand DDR interconnect and the highly-functional switch Highly-functional switch realizes barrier synchronization and high-speed reduction between nodes by hardware z Petascale system inherits Integrated Multi-core Parallel ArChiTecture FX1 is suitable platform to develop and evaluate Petascale applications IDC HPC User Forum, Oct. 16th, 20087 All Rights Reserved, Copyright FUJITSU LIMITED 2008 Integrated Multi-core Parallel ArChiTecture Introduction z Concept Highly efficient thread level parallel processing technology for multi-core chip CPU CHIP CPU CHIP ProcessProcess L2$ Proc. L2$ Proc. Mem. core core coreThreadcore Mem. L2$ Parallelization between L2$ Proc. L2$ Proc. core core corecores core z Advantage Handles the multi-core CPU as one equivalent faster CPU ÎReduces number of MPI processes to 1/ncore and increases parallel efficiency ÎReduces memory-wall problem z Challenge How to decrease the thread level parallelization overhead? IDC HPC User Forum, Oct. 16th, 20088 All Rights Reserved, Copyright FUJITSU LIMITED 2008 Integrated Multi-core Parallel ArChiTecture Key Technologies z CPU technologies Hardware barrier synchronization between cores Î Reduces overhead for parallel execution, 10 times faster than software emulation Î Start up time is comparable to that of the vector unit Î Barrier overhead remains constant regardless of number of cores 700 600 H/W Barrier 500 S/W Barrier 400 SPARC64TM VII (ns) 300 200 Real quad-core CPU for 100 Technical Computing 0 24# of cores (2.5 GHz, 40 GFlops/chip) Barrier Overhead Shared L2 cache memory (6 MB) Î Reduces the number of cache to cache data transfers Î Efficient cache memory usage z Compiler technologies Automatic parallelization or OpenMP on thread-based algorithm by vectorization technology IDC HPC User Forum, Oct. 16th, 20089 All Rights Reserved, Copyright FUJITSU LIMITED 2008 Integrated Multi-core Parallel ArChiTecture, preliminary measured data Performance Measurement of Automatic Parallelization z LINPACK performance on 1 CPU (4 cores) n = 100 Î 3.26 GFlops n = 40,000 Î 37.8 GFlops (93.8%) z Performance comparison of DAXPY (EuroBen Kernel 8) on 1 CPU 4core + IMPACT shows better performance than Î 1core performance with small number of loop iterations Î X86 servers 10,000 1,000 FX1 : SPARC64 VII (4 cores @ 2.5 GHz) FX1 : SPARC64 VII (1 cores @ 2.5 GHz) MFlops 100 VPP5000 (9.6 GFlops) INTEL Clovertown (4 cores @ 2.66 GHz) Opteron Barcelona (4 cores @ 2.3 GHz) 10 10 100 1,000 10,000 # of loop iterations Performance of DAXPY IDC HPC User Forum, Oct. 16th, 200810 All Rights Reserved, Copyright FUJITSU LIMITED 2008 Integrated Multi-core Parallel ArChiTecture, preliminary measured data Performance Measurement of NPB on 1 CPU z Performance comparison of NPB class C between pure MPI and Integrated Multi-core Parallel ArChiTecture on 1 CPU (4 cores) IMPACT(OMP) is better than pure MPI for 6/7 programs 1.8 1.6 1.4 1.2 1 MPI 0.8 IMPACT 0.6 (OMP) 0.4 IMPACT Performance comparison (automatic) 0.2 0 BT CG EP FT LU MG SP IDC HPC User Forum, Oct. 16th, 200811 All Rights Reserved, Copyright FUJITSU LIMITED 2008 Integrated Multi-core Parallel ArChiTecture, preliminary measured data Performance measurement of NPB on 256 CPUs (1) z Performance comparison of NPB class C between pure MPI and MPI + Integrated Multi-core Parallel ArChiTecture MPI + IMPACT (Automatic Parallelization) is better than pure MPI with 5/8 programs N*Cores FX1 256 nodes (1024 cores) EP BT CG 20000 600000 70000 18000 60000 16000 500000 50000 14000 400000 12000 40000 10000 300000 8000 30000 200000 6000 20000 4000 100000 10000 2000 0 0 0 1 10 100 1000 10000 1 10 100 1000 10000 1 10 100 1000 10000 FT IS LU 250000 7000 450000 400000 6000 200000 350000 5000 300000 150000 4000 250000 MOPS 200000 100000 3000 150000 2000 50000 100000 1000 50000 0 0 0 1 10 100 1000 10000 1 10 100 1000 10000 1 10 100 1000 10000 MG SP 600000 250000 500000 200000 400000 150000 : pure MPI 300000 100000 200000 : MPI+IMPACT 50000 100000 (Automatic Parallelization) 0 0 1 10 100 1000 10000 1 10 100 1000 10000 IDC HPC User Forum, Oct. 16th, 200812 All Rights Reserved, Copyright FUJITSU LIMITED 2008 FX1 Intelligent Interconnect Introduction z Combination of fat tree topology InfiniBand DDR interconnect and the highly-functional switch (Intelligent switch) z Intelligent switch Result of the PSI (Petascale System Interconnect) national project Functions Spine-SWs Hardware barrier function among nodes InfiniBand InfiniBand InfiniBand SW SW SW Hardware assistance for MPI functions (synchronization and reduction) InfiniBand InfiniBand InfiniBand Leaf Global ping for OS scheduling SW SW SW SWs Advantages : Node Faster HW barrier speeds up OpenMP Intelligent Intelligent and data parallel FORTRAN (XPF) SW SW Fast collective operations accelerate highly parallel applications Reduces OS jitter effect Intelligent Switch & its connection IDC HPC User Forum, Oct. 16th, 200813 All Rights Reserved, Copyright FUJITSU LIMITED 2008 FX1 Intelligent Interconnect High Performance Barrier & Reduction Hardware z Hardware barrier and reduction shows low latency and constant overhead in comparison with software barrier and reduction* μsec μsec 120 120 100 100 80 Software 80 Software 60 60 40 40 20 Hardware 20 Hardware 0 0 2 4 8 16 32 64 128 256 2 4 8 16 32 64 128 256 Number of processes Number of processes Barrier Reduction * : Executed by host processor using butterfly network built by point to point communication.

Fujitsu's Vision for High Performance Computing

Survey of Computer Architecture

JSC News No. 166, July 2008

Document.Pdf

An Analysis of System Balance and Architectural Trends Based on Top500 Supercomputers

HPC Achievement and Impact – 2009 a Personal Perspective

33Rd TOP500 List

The TOP500 Project Looking Back Over 16 Years of Supercomputing Experience with Special Emphasis on the Industrial Segment

On4off Project – Welcome & Use Case „Vier Sterne Tisch“

Europe Shoots up in TOP500: 4 Systems in the Top 10

No Slide Title

No. 160 • Jan. 2008 JUGENE: Jülich's Next Step Towards Petascale

IBM BG/P Workshop Lukas Arnold, Forschungszentrum Jülich, 14.-16.10.2009 Contact: [email protected] Aim of This Workshop Contribution