Fujitsu's Architectures and Collaborations for Weather Prediction and Climate Research
Total Page:16
File Type:pdf, Size:1020Kb
Fujitsu’s Architectures and Collaborations for Weather Prediction and Climate Research Ross Nobes Fujitsu Laboratories of Europe ECMWF Workshop on HPC in Meteorology, October 2014 Copyright 2014 FUJITSU LIMITED Fujitsu HPC - Past, Present and Future Fujitsu has been developing advanced National supercomputers since 1977 project Exa Post-FX10 FX10 No.1 in Top500 (June and Nov., 2011) K computer FX1 World’s Fastest Most Efficient Vector Processor (1999) Performance in Top500 (Nov. 2008) Next x86 VPP5000 SPARC generation NWT* Enterprise Developed with NAL No.1 in Top500 PRIMEQUEST PRIMERGY CX400 (Nov. 1993) VPP300/700 PRIMEPOWER Gordon Bell Prize HPC2500 (1994, 95, 96) ⒸJAXA VPP500 World’s Most Scalable VP Series PRIMERGY Supercomputer BX900 AP3000 (2003) Cluster node HX600 Cluster node F230-75APU AP1000 PRIMERGY RX200 Cluster node Japan’s Largest Cluster in Top500 Japan’s First (July 2004) Vector (Array) Supercomputer *NWT: Numerical Wind Tunnel (1977) ECMWF Workshop on HPC in Meteorology, October 2014 1 Copyright 2014 FUJITSU LIMITED Fujitsu’s Approach to the HPC Market SPARC64-Based Supercomputers X86 PRIMERGY Supercomputer HPC solutions Divisional Departmental Workgroup Work Group Powerful Workstations ECMWF Workshop on HPC in Meteorology, October 2014 2 Copyright 2014 FUJITSU LIMITED The K computer June 2011 No.1 in TOP500 List (ISC11) November 2011 Consecutive No. 1 in TOP500 List (SC11) November 2011 ACM Gordon Bell Prize Peak-Performance (SC11) November 2012 No.1 in Three HPC Challenge Award Benchmarks (SC12) November 2012 ACM Gordon Bell Prize (SC12) November 2013 No.1 in Class 1 and 2 of the HPC Challenge Awards (SC13) June 2014 No.1 in Graph 500 “Big Data” Supercomputer Ranking (ISC14) Build on the success of the K computer ECMWF Workshop on HPC in Meteorology, October 2014 3 Copyright 2014 FUJITSU LIMITED Evolution of PRIMEHPC K computer PRIMEHPC FX10 Post-FX10 CPU SPARC64 VIIIfx SPARC64 IXfx SPARC64 XIfx Peak perf. 128 GFLOPS 236.5 GFLOPS 1 TFLOPS ~ # of cores 8 16 32 + 2 Memory DDR3 SDRAM ← HMC Interconnect Tofu Interconnect ← Tofu Interconnect 2 System size 11 PFLOPS Max. 23 PFLOPS Max. 100 PFLOPS Link BW 5 GB/s x ← 12.5 GB/s x bidirectional bidirectional ECMWF Workshop on HPC in Meteorology, October 2014 4 Copyright 2014 FUJITSU LIMITED Continuity in Architecture for Compatibility Upwards compatible CPU: Binary-compatible with the K computer & PRIMEHPC FX10 Good byte/flop balance New features: New instructions Improved micro architecture For distributed parallel executions: Compatible interconnect architecture Improved interconnect bandwidth ECMWF Workshop on HPC in Meteorology, October 2014 5 Copyright 2014 FUJITSU LIMITED 32 + 2 Core SPARC64 XIfx Rich micro architecture improves single thread performance 2 additional, Assistant-cores for avoiding OS jitter and non-blocking MPI functions K FX10 Post-FX10 Peak FP performance 128 GF 236.5 GF 1-TF class Core Execution unit FMA 2FMA 2FMA 2 config. SIMD 128 bit 128 bit 256 bit wide Dual SP mode NA NA 2x DP performance Integer SIMD NA NA Support Single thread - - Improved OOO performance execution, better enhancement branch prediction, larger cache ECMWF Workshop on HPC in Meteorology, October 2014 6 Copyright 2014 FUJITSU LIMITED Flexible SIMD Operations New 256-bit wide SIMD functions enable versatile operations Four double-precision calculations Strided load/store, Indirect (list) load/store, Permutation, Concatenation SIMD Strided load Indirect load Permutation Double precision reg S Memory reg S reg S1 Specified stride Memory reg S2 128 Arbitrary shuffle regs simultaneous calculation reg D reg D reg D reg D ECMWF Workshop on HPC in Meteorology, October 2014 7 Copyright 2014 FUJITSU LIMITED Hybrid Memory Cube (HMC) The increased arithmetic performance of the processor needs higher memory bandwidth The required memory bandwidth can almost be met by HMC (480 GB/s) Interconnect is boosted to 12.5 GB/s x 2 (bi-directional) with optical link Peak performance/node K FX10 post-FX10 DP performance (Gflops) 128 236.5 Over 1TF Memory Bandwidth (GB/s) 64 85 480 Interconnect Link Bandwidth (GB/s) 5 5 12.5 ECMWF Workshop on HPC in Meteorology, October 2014 8 Copyright 2014 FUJITSU LIMITED HMC (2) Amount per processor Capacity Memory BW HMC x8 32 GB 480 GB/s DDR4-DIMM x8 32~128 GB 154 GB/s GDDR5 x16 8 GB 320 GB/s HSSD was adopted for main memory since its bandwidth is three times more than DDR4 HMC can deliver the required bandwidth for high performance multi-core processors ECMWF Workshop on HPC in Meteorology, October 2014 9 Copyright 2014 FUJITSU LIMITED Tofu2 Interconnect Successor to Tofu Interconnect Highly scalable, 6-dimensional mesh/torus topology Logical 3D, 2D or 1D torus network from the user’s point of view Increased link bandwidth by 2.5 times to 100 Gbps Scalable three-dimensional torus Well-balanced shape Three-dimensional torus unit available 2×3×2 ECMWF Workshop on HPC in Meteorology, October 2014 10 Copyright 2014 FUJITSU LIMITED A Next Generation Interconnect Optical-dominant: 2/3 of network links are optical 1.2 Optical network links Electrical network links 1 0.8 25 Gbps 0.6 0.4 12.5 Gbps 10 Gbps 14 Gbps 6.25 Gbps 0.2 25 Gbps Gross raw bitrate per node (Tbps) 5 Gbps 14 Gbps 14 Gbps 0 IBM Blue IB-FDR Cray Tofu1 Tofu2 Gene/Q Fat-Tree Cascade ECMWF Workshop on HPC in Meteorology, October 2014 11 Copyright 2014 FUJITSU LIMITED Reduction in the Chip Area Size Process technology shrinks from 65 to 20 nm System-on-chip integration eliminates the host bus interface Chip area shrinks to 1/3 size TNR TNI and TBI Tofu network router (TNR) Tofu network interface (TNI) 34 core processor and PCI Tofu barrier Express with 8 HMC links interface (TBI) root complex Host bus interface Tofu1 < 300mm2 InterConnect Controller (65nm) PCI Express root complex Tofu2 < 100mm2 –SPARC64TM XIfx (20nm) ECMWF Workshop on HPC in Meteorology, October 2014 12 Copyright 2014 FUJITSU LIMITED Throughput of Single Put Transfer Achieved 11.46 GB/s of throughput which is 92% efficiency 16.00 11.46 8.00 4.66 4.00 2.00 1.00 0.50 0.25 0.13 Throughput of Put transfer (GB/s) Tofu2 0.06 Tofu1 0.03 4 16 64 256 1024 4096 Message size (byte) ECMWF Workshop on HPC in Meteorology, October 2014 13 Copyright 2014 FUJITSU LIMITED Throughput of Concurrent Put Transfers Linear increase in throughput without the host bus bottleneck 50 Tofu2 45.82 45 Tofu1 40 35 30 25 Host bus bottleneck 20 17.63 15 10 Throughput of Put transfer (GB/s) 5 0 1-way 2-way 3-way 4-way ECMWF Workshop on HPC in Meteorology, October 2014 14 Copyright 2014 FUJITSU LIMITED Communication Latencies Pattern Method Latency One-way Put 8-byte to memory 0.87 μsec Put 8-byte to cache 0.71 μsec Round-trip Put 8-byte ping-pong by CPU 1.42 μsec Put 8-byte ping-pong by session 1.41 μsec Atomic RMW 8-byte 1.53 μsec ECMWF Workshop on HPC in Meteorology, October 2014 15 Copyright 2014 FUJITSU LIMITED Communication Intensive Applications Fujitsu’s math library provides 3D-FFT functionality with improved scalability for massively parallel execution Significantly improved interconnect bandwidth Optimised process configuration enabled by Tofu interconnect and job manager Optimised MPI library provides high-performance collective communications ECMWF Workshop on HPC in Meteorology, October 2014 16 Copyright 2014 FUJITSU LIMITED Enhanced Software Stack for Post-FX10 ECMWF Workshop on HPC in Meteorology, October 2014 17 Copyright 2014 FUJITSU LIMITED Enduring Programming Model ECMWF Workshop on HPC in Meteorology, October 2014 18 Copyright 2014 FUJITSU LIMITED Smaller, Faster, More Efficient Highly integrated components with high-density packaging. Performance of 1-chassis corresponds to approx. 1-cabinet of K computer. Efficient in space, time, and power Fujitsu designed CPU Chassis Cabinet Tofu 2 TM 12345SPARC64 Memory (12 CPUs) XIfx Board 32 + 2 core CPU Three CPUs 1 CPU/1 node Tofu2 integrated 3 x 8 HMCs (Hybrid Memory Cube) 12 nodes/2U chassis Over 200 nodes/cabinet 8 optical modules (25Gbpsx12ch) Water cooled ECMWF Workshop on HPC in Meteorology, October 2014 19 Copyright 2014 FUJITSU LIMITED Scale-Out Smart for HPC and Cloud Computing Fujitsu Server PRIMERGY CX400 M1 (CX2550 M1 / CX2570 M1) 20 © 2014 FUJITSU PRIMERGY x86 Cluster Series Fujitsu has been developing advanced supercomputers since 1977 National projects Exa Post-FX10 FX10 No.1 in Top500 (June and Nov., 2011) K computer FX1 World’s Fastest Most Efficient Vector Processor (1999) Performance in Top500 (Nov. 2008) Next x86 VPP5000 SPARC generation NWT* Enterprise Developed with NAL No.1 in Top500 PRIMEQUEST PRIMERGY CX400 (Nov. 1993) VPP300/700 PRIMEPOWER Gordon Bell Prize HPC2500 (1994, 95, 96) ⒸJAXA VPP500 World’s Most Scalable VP Series PRIMERGY Supercomputer BX900 AP3000 (2003) Cluster node HX600 Cluster node F230-75APU AP1000 PRIMERGY RX200 Cluster node Japan’s Largest Cluster in Top500 Japan’s First (July 2004) Vector (Array) Supercomputer *NWT: Numerical Wind Tunnel (1977) ECMWF Workshop on HPC in Meteorology, October 2014 21 Copyright 2014 FUJITSU LIMITED Fujitsu PRIMERGY Portfolio TX Family RX Family BX Family CX Family Products CX400 CX420 CX2550 CX2570 Chassis Server Nodes ECMWF Workshop on HPC in Meteorology, October 2014 22 Copyright 2014 FUJITSU LIMITED Fujitsu PRIMERGY CX400 M1 Feature Overview 4 dual-socket nodes in 2U density Up to 24 storage drives Choice of server nodes Dual socket servers with Intel® Xeon® processor E5-2600 v3 product family (Haswell) Optionally up to two GPGPU or co-processor cards DDR4 memory technology