A Fad Or the Yellow Brick Road Onto Exascale?

Total Page:16

File Type:pdf, Size:1020Kb

A Fad Or the Yellow Brick Road Onto Exascale? GPU Acceleration: A Fad or the Yellow Brick Road onto Exascale? Satoshi Matsuoka, Professor/Dr.Sci. Global Scientific Information and Computing Center (GSIC) Tokyo Inst. Technology / National Institute of Informatics, Japan CNRS ORAP Forum, 2010 Mar 31, Paris, France 1 GPU vs. Standard Many Cores? Shared Shared Shared Shared Shared Shared PE PE PE PE SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP PE PE PE PE SP SP SP SP SP SP SP SP SP SP SP SP vs. PE PE PE PE MC MC MC MC MC MC PE PE PE PE GDDR3 GDDR3 GDDR3 GDDR3 GDDR3 GDDR3 • Are these just legacy programming compatibility differences? GPUs as Commodity Vector Engines Accelerators: Grape, Unlike the conventional accelerators, N-body ClearSpeed, GPUs have high memory bandwidth and MatMul GPUs dense Since latest high-end GPUs support Computational Density Cache-based double precision, GPUs also work CPUs as commodity vector processors. The target application area for GPUs is FFT very wide. Vector SCs CFD Cray, SX… Restrictions: Limited non- Memory stream memory access, Access PCI-express overhead, etc. Frequency Î How do we utilize them easily? GPUsGPUs asas CommodityCommodity MassivelyMassively ParallelParallel VectorVector ProcessorsProcessors • E.g., NVIDIA Tesla, AMD Firestream – High Peak Performance > 1TFlops • Good for tightly coupled code e.g. Nbody – High Memory bandwidth (>100GB/s) • Good for sparse codes e.g. CFD – Low latency over shared memory • Thousands threads hide latency w/zero overhead – Slow and Parallel and Efficient vector engines for HPC – Restrictions: Limited non-stream memory access, PCI-express overhead, programming model etc. How do we exploit them given vector computing experiences? From CPU Centric to GPU Centric Nodes for Scaling CPU “Centric” GDDR5 >150GB STREAM GPU persocket DDR3 4-8GB 15-20GB STREAM persocket Core Core 2GB/core Dataset Dataset CPU Roles These are -OS Isomorphic -Services IOH PCIe x16 - Irregular PCIe x8 4GB/s Sparse 2GB/s PCIe x8 IOH 2GB/s x n 40Gbps IB IB QDR GPU IB QDR HCA 1GB IB QDR HCA Flash 200GB HCA 400MB/s 40Gbps IB x n High Performance 3-D FFT on NVIDIA CUDA GPUs [Nukada et. al. SC08] Fast Fourier Transform is an O(NlogN) algorithm with O(N) memory access. Acceleration of FFT is much more difficult than N-body, MatMul, etc., but possible! At least, GPUs’ theoretical memory bandwidth is much higher than CPUs. multi-row FFT algorithm for GPU [SC08] Input This algorithm accesses 4 streams multiple streams, but each of them is successive. 4-point FFT Since each thread compute independent set of small FFT, Output many registers are required. 4 streams For 256-point FFT, use two- pass 16-point FFT kernels. Performance of 1-D FFT Note: An earlier sample with 1.3GHz is used for Tesla S1070. 3-D FFT Performance on various ComparisonCPUs, CellBE, with and CPUsGPUs 140 GFLOPS All-to-all 3-D Protein Docking Challenge (Collaboration with Yutaka Akiyama, Tokyo Tech.) P1 P2 P3 P4 P5 …. P1000 P1 P2 P3 1,000 x1,000 all-to-all docking P4 fitness evaluation will take only P5 1-2 months (15 deg. pitch) … with a 32-node HPC-GPGPU cluster (128 GPGPUs). Fast PPI candidate cf. screening with FFT ~ 500 years with single CPU (sus. 1+GF) P1000 ~ 2 years with 1-rack BlueGene/L Blue Protein system CBRC, AIST (4 rack, 8192 nodes) Algorithm for 3‐D All‐to‐All Protein Docking candidate Protein 1 docking sites (1) 3‐D (3) 3‐D voxel (5) cluster data complex analysis Convolution NVIDIA (3‐D FFT and its Inverse) CUDA Host Protein 2 CPUs (2) 3‐D voxel (4) rotation data (6 deg., 15 deg.) Docking Confidence level Calculation for a single protein‐protein pair: ~= 200 Tera ops. 3‐D complex convolution O(N3 log N) , typically N = 256 x 200 Exa Ops for Possible rotations R = 54,000 (6 deg. pitch) 1000 x 1000 HeavilyHeavily GPUGPU AccleratedAcclerated WindowsWindows HPCHPC PrototypePrototype ClusterCluster • 32 compute nodes • 128 NVIDIA 8800GTS •one head node. •Gigabit Ethernet network • Three 40U rack cabinets. • Windows Compute Cluster Server 2008 • Visual Studio 2005 SP1 • nVidia CUDA 2.x PerformancePerformance EstimationEstimation ofof 3D3D PPDPPD Single Node Power (W) Peak (GFLOPS) 3D-FFT Docking Nodes per (GFLOPS) (GFLOPS) 40 U rack Blue Gene/L 20 5.6 - 1.8 1024 TSUBAME 1000 (est.) 76.8 (DP) 18.8 (DP) 26.7 (DP) 10 8800 GTS *4 570 1664 256 207 8~13 System Total ! Only CPUs for TSUBAME. DP=double precision. # of nodes Power Peak Docking MFLOPS/ (kW) (TFLOPS) (TFLOPS) W Blue Gene/L (Blue 4096 80 22.9 7.0 87.5 Protein@AIST,Japan) (4racks) TSUBAME. (Opteron 655 (~70 ~700 50.3 (DP) 17.5 (DP) 25 Only) racks) GPU Accel .WinHPC 32 (4racks) 18 53.2 6.5 361 Can compute 1000x1000 in 1 month (15 deg.) or 1 year (6 deg.) On full TSUBAME 1.2, ~100TFlops (1MW)-> ~52 rack BG/L Tokyo Tech. TSUBAME 1.0 Production SC Cluster Unified IB network Spring 2006 Voltaire ISR9288 Infiniband 10Gbps x2 (DDR next ver.) Sun Galaxy 4 (Opteron Dual ~1310+50 Ports core 8-socket) ~13.5Terabits/s (3Tbits bisection) 10480core/655Nodes 7th on the 27th 21.4Terabytes 10Gbps+External [email protected] 50.4TeraFlops Network #1 SC in Asia 27-29th OS Linux (SuSE 9, 10) NAREGI Grid MW NEC SX-8i (for porting) 500GB 500GB ~2000 Users 48disks 500GB 48disks 48disks Storage 1.0 Petabyte (Sun “Thumper”) ClearSpeed CSX600 0.1Petabyte (NEC iStore) SIMD accelerator Lustre FS, NFS, WebDAV (over IP) 360 boards, 50GB/s aggregate I/O BW 35TeraFlops(Current)) TSUBAME 1.2 Experimental Evolution (Oct. 2008) The first “Petascale” SC in Japan Storage Voltaire ISR9288 Infiniband x8 1.5 Petabyte (Sun x4500 x 60) 10Gbps x2 ~1310+50 Ports NEC SX-8i 0.1Petabyte (NEC iStore) ~13.5Terabits/s 500GB Lustre FS, NFS, CIF, WebDAV (over IP) (3Tbits bisection) 48disks 60GB/s aggregate I/O BW 10Gbps+External NW 10,000 CPU Cores Sun x4600 (16 Opteron Cores) 32~128 GBytes/Node Unified Infiniband 300,000 SIMD Cores network 10480core/655Nodes > 20 Million Threads 21.4TeraBytes 50.4TeraFlops ~900TFlops-SFP, ~170TFlops-DFP OS Linux (SuSE 9, 10) NEW Deploy: 80TB/s Mem BW (1/2 ES) NAREGI Grid MW GCOE TSUBASA Harpertown-Xeon 90Node 720CPU 8.2TeraFlops PCI-e ClearSpeed CSX600 SIMD accelerator 648 boards, 52.2TeraFlops SFP/DFP 170 Nvidia Tesla 1070, ~680 Tesla cards High Performance in Many BW-Intensive Apps 10% power increase over TSUBAME 1.0 680 Unit Tesla Installation… While TSUBAME in Production Service (!) 16 TSUBAME 1.2 Node Configuration Thread Processor Cluster (TPC) nVidia Tesla T10: 55nm, 470m2, Thread Processor 1.4billion transistors ArrayThread (TPA or ProcessorSM) SP >1TF SFP 90GF DFP 240 SP Cores “Powerful Scalar” PCI-e x8 gen1 30 SMs GDDR3 4GB 2GB/s 1.4Ghz x86 102 GBytes/s 16 cores 1.08TFlops SFP Tesla Accelerator 2.4Ghz 17 90GFlops DFP 80GFlops 240 SP Cores 20GBytes/s 32GB 30 SMs GDDR3 4GB 1.4Ghz Dual Rail x4 IB SDR 102 GBytes/s 17 2 x 1GB/s 1.08TFlops SFP Tesla Accelerator 90GFlops DFP Linpack on TSUBAME---Heterogeneity at Work [Endo et.al. IPDPS08, IPDPS10.] • Linpack implementation that uses cooperatively: – 10,368 Opteron cores – 612 Tesla GPUs – 648 ClearSpeed accelerators – 640 Xeon cores (another cluster named “tsubasa”) • Different strategy than on Roadrunner is required • 87.01TFlops: The world’s highest Linpack performance on GPU-accelerated clusters + + TSUBAME in Top500 Ranking [Endo et.al. IEEE IPDPS08, IPDPS10 etc.] Jun 06 Nov 06 Jun 07 Nov 07 Jun 08 Nov 08 Jun09 Rmax 38.18 47.38 48.88 56.43 67.70 77.48 87.01 (Tflops) [IPDPS [IPDPS2 2008] 010] Rank 7 9 14 16 24 29 41 Opteron CS 360 CS 648 Xeon Tesla • Continuous improvement for 6 times •The 2nd fastest heterogeneous supercomputer in the world (No.1 is RoadRunner) 19 Electric Power Consumption • About 1090kW during Linpack – An estimated value from sampling – Cooling, network are excluded Even for dense problems GPUs are effective low power vector- like engines Higher performance with much lower memory footprint OK, so we put GPUs into commodity servers and slap them together with cheap networks and we are “supercomputing” No, since (strong) scaling becomes THE problem (!) Multi-GPU in CFD: Riken Himeno Benchmark (Joint Work w/NEC) RIKEN Himeno CFD Benchmark Himeno for CUDA 16 64 16 Poisson Equation: 64+2 1 block = (Generalized coordinate) ∇ ⋅(∇p) = ρ 16x16x8 compute 2 2 2 2 2 2 ∂ p ∂ p ∂ p ∂ p ∂ p ∂ p region + + + α + β + γ = ρ ∂x2 ∂y 2 ∂z 2 ∂xy ∂xz ∂yz Block has 256 thread Discretized Form: 128+2 Total 256 p − 2 p + p p − 2 p + p p − 2 p + p i+1, j,k i, j,k i−1, j,k i, j+1,k i, j,k i, j−1,k i, j,k +1 i, j,k i, j,k−1 blocks = 2 + 2 + 2 ∆x ∆y ∆y 65536 p − p − p + p + α i+1, j+1,k i−1, j+1,k i+1, j−1,k i−1, j−1,k threads 4∆x∆y p − p − p + p + β i+1, j,k +1 i−1, j,k +1 i+1, j−1,k i−1, j,k−1 16+2 4∆x∆z Block 16+2 pi, j+1,k +1 − pi, j+1,k −1 − pi, j−1,k −1 + pi, j−1,k−1 shared mem + γ = ρ 4∆y∆z i, j,k =16kB 8+2 18 neighbor point access Boundary region used22 for transfer Parallelization of Himeno Benchmark 513 Communication of border data is required Divide X axis for processes 1025 X Z 513 Y • Although original Himeno supports 3D division, our GPU version currently supports only 1D division Himeno Size L/L’ (257x257x513) CPU vs.
Recommended publications
  • Linpack Evaluation on a Supercomputer with Heterogeneous Accelerators
    Linpack Evaluation on a Supercomputer with Heterogeneous Accelerators Toshio Endo Akira Nukada Graduate School of Information Science and Engineering Global Scientific Information and Computing Center Tokyo Institute of Technology Tokyo Institute of Technology Tokyo, Japan Tokyo, Japan [email protected] [email protected] Satoshi Matsuoka Naoya Maruyama Global Scientific Information and Computing Center Global Scientific Information and Computing Center Tokyo Institute of Technology/National Institute of Informatics Tokyo Institute of Technology Tokyo, Japan Tokyo, Japan [email protected] [email protected] Abstract—We report Linpack benchmark results on the Roadrunner or other systems described above, it includes TSUBAME supercomputer, a large scale heterogeneous system two types of accelerators. This is due to incremental upgrade equipped with NVIDIA Tesla GPUs and ClearSpeed SIMD of the system, which has been the case in commodity CPU accelerators. With all of 10,480 Opteron cores, 640 Xeon cores, 648 ClearSpeed accelerators and 624 NVIDIA Tesla GPUs, clusters; they may have processors with different speeds as we have achieved 87.01TFlops, which is the third record as a result of incremental upgrade. In this paper, we present a heterogeneous system in the world. This paper describes a Linpack implementation and evaluation results on TSUB- careful tuning and load balancing method required to achieve AME with 10,480 Opteron cores, 624 Tesla GPUs and 648 this performance. On the other hand, since the peak speed is ClearSpeed accelerators. In the evaluation, we also used a 163 TFlops, the efficiency is 53%, which is lower than other systems.
    [Show full text]
  • Clearspeed Technical Training
    ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical Training December 2007 Overview 1 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Presenters Ronald Langhi Technical Marketing Manager [email protected] Brian Sumner Senior Engineer [email protected] 2 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com ClearSpeed Technology: Company Background • Founded in 2001 – Focused on alleviating the power, heat, and density challenges of HPC systems – 103 patents granted and pending (as of September 2007) – Offices in San Jose, California and Bristol, UK 3 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Agenda Accelerators ClearSpeed and HPC Hardware overview Installing hardware and software Thinking about performance Software Development Kit Application examples Help and support 4 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com ENVISION. ACCELERATE. ARRIVE. What is an accelerator? 5 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com What is an accelerator? • A device to improve performance – Relieve main CPU of workload – Or to augment CPU’s capability • An accelerator card can increase performance – On specific tasks – Without aggravating facility limits on clusters (power, size, cooling) 6 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com All accelerators are good… for their intended purpose Cell and GPUs FPGAs •Good for video gaming tasks •Good for integer, bit-level ops •32-bit FLOPS, not IEEE •Programming looks like circuit design •Unconventional programming model •Low power per chip, but •Small local memory 20x more power than custom VLSI •High power consumption (> 200 W) •Not for 64-bit FLOPS ClearSpeed •Good for HPC applications •IEEE 64-bit and 32-bit FLOPS •Custom VLSI, true coprocessor •At least 1 GB local memory •Very low power consumption (25 W) •Familiar programming model 7 Copyright © 2007 ClearSpeed Technology Inc.
    [Show full text]
  • Introduction Hardware Acceleration Philosophy Popular Accelerators In
    Special Purpose Accelerators Special Purpose Accelerators Introduction Recap: General purpose processors excel at various jobs, but are no Theme: Towards Reconfigurable High-Performance Computing mathftch for acce lera tors w hen dea ling w ith spec ilidtialized tas ks Lecture 4 Objectives: Platforms II: Special Purpose Accelerators Define the role and purpose of modern accelerators Provide information about General Purpose GPU computing Andrzej Nowak Contents: CERN openlab (Geneva, Switzerland) Hardware accelerators GPUs and general purpose computing on GPUs Related hardware and software technologies Inverted CERN School of Computing, 3-5 March 2008 1 iCSC2008, Andrzej Nowak, CERN openlab 2 iCSC2008, Andrzej Nowak, CERN openlab Special Purpose Accelerators Special Purpose Accelerators Hardware acceleration philosophy Popular accelerators in general Floating point units Old CPUs were really slow Embedded CPUs often don’t have a hardware FPU 1980’s PCs – the FPU was an optional add on, separate sockets for the 8087 coprocessor Video and image processing MPEG decoders DV decoders HD decoders Digital signal processing (including audio) Sound Blaster Live and friends 3 iCSC2008, Andrzej Nowak, CERN openlab 4 iCSC2008, Andrzej Nowak, CERN openlab Towards Reconfigurable High-Performance Computing Lecture 4 iCSC 2008 3-5 March 2008, CERN Special Purpose Accelerators 1 Special Purpose Accelerators Special Purpose Accelerators Mainstream accelerators today Integrated FPUs Realtime graphics GiGaming car ds Gaming physics
    [Show full text]
  • The Return of Acceleration Technology
    Real Science. Real Numbers. Real Software. The Return of Acceleration Technology John L. Gustafson, Ph.D. CTO, HPC ClearSpeed Technology, Inc. 1 Copyright © 2008 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Thesis 25 years ago, transistors were expensive; accelerators arose as a way to get more performance for certain applications. Then transistors got very cheap, and CPUs subsumed the functions. Why are accelerators back? Because supercomputing is no longer limited by cost. It is now limited by heat, space, and scalability. Accelerators once again are solving these problems. 2 Copyright © 2008 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com The accelerator idea is as old as supercomputing itself 2 MB 10x speedup 3 MB/s General-purpose computer Attached vector processor Runs OS, compilers, disk, accelerates certain printers, user interface applications, but not all Even in 1977, HPC users faced issues of when it makes sense to use floating-point-intensive vector hardware. “History doesn’t repeat itself, but it does rhyme.” —Mark Twain 3 Copyright © 2008 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com From 1984… My first visit to Bristol FPS-164/MAX 0.3 GFLOPS, $500,000 • 30 double precision PEs under (1/20 price of a 1984 Cray) SIMD control; 16 KB of very high bandwidth memory per PE, but This accelerator persuaded normal bandwidth to host Dongarra to create the Level 3 • Specific to matrix multiplication BLAS operations, and to make the LINPACK benchmark • Targeted at chemistry, electromagnetics, and structural scalable… leading to the TOP500 analysis benchmark a few years later.
    [Show full text]
  • Highlights of the 53Rd TOP500 List
    ISC 2019, Frankfurt, Highlights of June 17, 2019 the 53rd Erich TOP500 List Strohmaier ISC 2019 TOP500 TOPICS • Petaflops are everywhere! • “New” TOP10 • Dennard scaling and the TOP500 • China: Top consumer and producer ? A closer look • Green500, HPCG • Future of TOP500 Power # Site Manufacturer Computer Country Cores Rmax ST [Pflops] [MW] Oak Ridge 41 LIST: THESummit TOP10 1 IBM IBM Power System, USA 2,414,592 148.6 10.1 National Laboratory P9 22C 3.07GHz, Mellanox EDR, NVIDIA GV100 Sierra Lawrence Livermore 2 IBM IBM Power System, USA 1,572,480 94.6 7.4 National Laboratory P9 22C 3.1GHz, Mellanox EDR, NVIDIA GV100 National Supercomputing Sunway TaihuLight 3 NRCPC China 10,649,600 93.0 15.4 Center in Wuxi NRCPC Sunway SW26010, 260C 1.45GHz Tianhe-2A National University of 4 NUDT ANUDT TH-IVB-FEP, China 4,981,760 61.4 18.5 Defense Technology Xeon 12C 2.2GHz, Matrix-2000 Texas Advanced Computing Frontera 5 Dell USA 448,448 23.5 Center / Univ. of Texas Dell C6420, Xeon Platinum 8280 28C 2.7GHz, Mellanox HDR Piz Daint Swiss National Supercomputing 6 Cray Cray XC50, Switzerland 387,872 21.2 2.38 Centre (CSCS) Xeon E5 12C 2.6GHz, Aries, NVIDIA Tesla P100 Los Alamos NL / Trinity 7 Cray Cray XC40, USA 979,072 20.2 7.58 Sandia NL Intel Xeon Phi 7250 68C 1.4GHz, Aries National Institute of Advanced AI Bridging Cloud Infrastructure (ABCI) 8 Industrial Science and Fujitsu PRIMERGY CX2550 M4, Japan 391,680 19.9 1.65 Technology Xeon Gold 20C 2.4GHz, IB-EDR, NVIDIA V100 SuperMUC-NG 9 Leibniz Rechenzentrum Lenovo ThinkSystem SD530, Germany 305,856
    [Show full text]
  • Exascale” Supercomputer Fugaku & Beyond
    The first “exascale” supercomputer Fugaku & beyond l Satoshi Matsuoka l Director, RIKEN Center for Computational Science l 20190815 Modsim Presentation 1 Arm64fx & Fugaku 富岳 /Post-K are: l Fujitsu-Riken design A64fx ARM v8.2 (SVE), 48/52 core CPU l HPC Optimized: Extremely high package high memory BW (1TByte/s), on-die Tofu-D network BW (~400Gbps), high SVE FLOPS (~3Teraflops), various AI support (FP16, INT8, etc.) l Gen purpose CPU – Linux, Windows (Word), otherSCs/Clouds l Extremely power efficient – > 10x power/perf efficiency for CFD benchmark over current mainstream x86 CPU l Largest and fastest supercomputer to be ever built circa 2020 l > 150,000 nodes, superseding LLNL Sequoia l > 150 PetaByte/s memory BW l Tofu-D 6D Torus NW, 60 Petabps injection BW (10x global IDC traffic) l 25~30PB NVMe L1 storage l many endpoint 100Gbps I/O network into Lustre l The first ‘exascale’ machine (not exa64bitflops but in apps perf.) 3 Brief History of R-CCS towards Fugaku April 2018 April 2012 AICS renamed to RIKEN R-CCS. July 2010 Post-K Feasibility Study start Satoshi Matsuoka becomes new RIKEN AICS established 3 Arch Teams and 1 Apps Team Director August 2010 June 2012 Aug 2018 HPCI Project start K computer construction complete Arm A64fx announce at Hotchips September 2010 September 2012 Oct 2018 K computer installation starts K computer production start NEDO 100x processor project start First meeting of SDHPC (Post-K) November 2012 Nov 2018 ACM Gordon bell Award Post-K Manufacturing approval by Prime Minister’s CSTI Committee 2 006 2011 2013
    [Show full text]
  • TSUBAME2.0: a Tiny and Greenest Petaflops Supercomputer
    TSUBAME2.0: A Tiny and Greenest Petaflops Supercomputer Satoshi Matsuoka Global Scientific Information and Computing Center (GSIC) Tokyo Institute of Technology (Tokyo Tech.) Booth #1127 Booth Presentations SC10 Nov 2010 The TSUBAME 1.0 “Supercomputing Grid Cluster” Unified IB network Spring 2006 Voltaire ISR9288 Infiniband 10Gbps x2 (DDR next ver.) Sun Galaxy 4 (Opteron Dual ~1310+50 Ports “Fastest core 8-socket) ~13.5Terabits/s (3Tbits bisection) Supercomputer in 10480core/655Nodes Asia” 7th on the 27th 21.4Terabytes 10Gbps+External 50.4TeraFlops Network [email protected] OS Linux (SuSE 9, 10) NAREGI Grid MW NEC SX-8i (for porting) 500GB 48disks 500GB 500GB 48disks 48disks Storage 1.0 Petabyte (Sun “Thumper”) ClearSpeed CSX600 0.1Petabyte (NEC iStore) SIMD accelerator Lustre FS, NFS, WebDAV (over IP) 360 boards, 50GB/s aggregate I/O BW 35TeraFlops(Current)) Titech TSUBAME ~76 racks 350m2 floor area 1.2 MW (peak), PUE=1.44 You know you have a problem when, … Biggest Problem is Power… Peak Watts/ Peak Ratio c.f. Machine CPU Cores Watts MFLOPS/ CPU GFLOPS TSUBAME Watt Core TSUBAME(Opteron) 10480 800,000 50,400 63.00 76.34 TSUBAME2006 (w/360CSs) 11,200 810,000 79,430 98.06 72.32 TSUBAME2007 (w/648CSs) 11,776 820,000 102,200 124.63 69.63 1.00 Earth Simulator 5120 6,000,000 40,000 6.67 1171.88 0.05 ASCI Purple (LLNL) 12240 6,000,000 77,824 12.97 490.20 0.10 AIST Supercluster (Opteron) 3188 522,240 14400 27.57 163.81 0.22 LLNL BG/L (rack) 2048 25,000 5734.4 229.38 12.21 1.84 Next Gen BG/P (rack) 4096 30,000 16384 546.13 7.32 4.38 TSUBAME 2.0 (2010Q3/4) 160,000 810,000 1,024,000 1264.20 5.06 10.14 TSUBAME 2.0 x24 improvement in 4.5 years…? ~ x1000 over 10 years Scaling Peta to Exa Design? • Shorten latency as much as possible – Extreme multi-core incl.
    [Show full text]
  • (WP8) Prototypes Future Technologies
    “Future Technologies” (WP8) Prototypes Iris Christadler, Dr. Herbert Huber Leibniz Supercomputing Centre, Germany Prototype Overview (1/2) CEA 1U Tesla Server T1070 (CUDA, Take more easily advantage of accelerators. Compare “GPU/CAPS” CAPS, DDT) , Intel Harpertown nodes HMPP with other approaches to program accelerators. Assess the applicability of new file system and I/O Subsystem (SSD, Lustre, pNFS) CINECA storage technologies. CINES-LRZ Hybrid SGI ICE/UV/Nehalem-EP & Evaluate a hybrid system architecture containing thin nodes, fat nodes and compute accelerators “LRB/CS” Nehalem-EX/ClearSpeed/Larrabee with a shared file system. CSCS Prototype PGAS language compilers Understand the usability and programmability “UPC/CAF” (CAF + UPC for Cray XT systems) of PGAS languages. EPCC Maxwell – FPGA prototype (VHDL Assess the potential of high-level languages support & consultancy + software for using FPGAs in HPC. Compare energy “FPGA” licenses (e.g., Mitrion-C)) efficiency with other solutions. SC09, “Future Technologies” (WP8) Prototypes, Outlook Deliverable D8.3.2 2 Prototype Overview (2/2) FZJ eQPACE (PowerXCell Gain deep expertise in communication “Cell & FPGA 8i cluster with special network issues. Extend the application interconnect” network processor) domain of the QPACE system. LRZ RapidMind Multi-Core Development Assess the potential of data stream languages. Platform (automatic code generation Compare RapidMind with other approaches for “RapidMind” for x86, GPUs and Cell) programming accelerators or multi-core systems NCF ClearSpeed Evaluate ClearSpeed accelerator hardware “ClearSpeed” CATS 700 units for large-scale applications. Air cooled blade system from SNIC- Supermicro with AMD Istanbul Evaluate and optimize energy efficiency and packing density of Experiences with the KTH processors & QDR IB commodity hardware.
    [Show full text]
  • The TOP500 Project Looking Back Over 16 Years of Supercomputing Experience with Special Emphasis on the Industrial Segment
    The TOP500 Project Looking Back Over 16 Years of Supercomputing Experience with Special Emphasis on the Industrial Segment Hans Werner Meuer [email protected] Prometeus GmbH & Universität Mannheim Microsoft Industriekunden – Veranstaltung Frankfurt /16. Oktober 2008 / Windows HPC Server 2008 Launch page 1 31th List: The TOP10 Rmax Power Manufacturer Computer Installation Site Country #Cores [TF/s] [MW] Roadrunner 1 IBM 1026 DOE/NNSA/LANL USA 2.35 122,400 BladeCenter QS22/LS21 BlueGene/L 2 IBM 478.2 DOE/NNSA/LLNL USA 2.33 212,992 eServer Blue Gene Solution Intrepid 3 IBM 450.3 DOE/ANL USA 1.26 163,840 Blue Gene/P Solution Ranger 4 Sun 326 TACC USA 2.00 62,976 SunBlade x6420 Jaguar 5 Cray 205 DOE/ORNL USA 1.58 30,976 Cray XT4 QuadCore JUGENE Forschungszentrum 6 IBM 180 Germany 0.50 65,536 Blue Gene/P Solution Juelich (FZJ) Encanto New Mexico Computing 7 SGI 133.2 USA 0.86 14,336 SGI Altix ICE 8200 Applications Center EKA Computational Research 8 HP 132.8 India 1.60 14,384 Cluster Platform 3000 BL460c Laboratories, TATA SONS 9 IBM Blue Gene/P Solution 112.5 IDRIS France 0.32 40,960 Total Exploration 10 SGI SGI Altix ICE 8200EX 106.1 France 0.44 10,240 Production Microsoft Industriekunden – Veranstaltung page 2 Outline Mannheim Supercomputer Statistics & Top500 Project Start in 1993 Competition between Manufacturers, Countries and Sites My Supercomputer Favorite in the Top500 Lists The 31st List as of June 2008 Performance Development and Projection Bell‘s Law Supercomputing, quo vadis? Top500, quo vadis? Microsoft Industriekunden – Veranstaltung
    [Show full text]
  • Clearspeed Paper: a Study of Accelerator Options for Maximizing Cluster Performance
    MAXIMUM PERFORMANCE CLUSTER DESIGN CLEARSPEED PAPER: A STUDY OF ACCELERATOR OPTIONS FOR MAXIMIZING CLUSTER PERFORMANCE Abstract The typical primary goal in cluster design 64-bit versions of NVIDIA’s announced now is to maximize performance within “Tesla” product. We use compute-per- the constraints of space and power. This volume arguments to show that a cluster contrasts with the primary goal of a dec- achieves optimum 64-bit floating-point ade ago, when the constraint was the performance with x86 nodes enhanced budget for parts purchase. We therefore with ClearSpeed e620 “Advance” accel- examine how to create a cluster with the erators. The reason is that power dissi- highest possible performance using cur- pation and its attendant volume become rent technology. Specific choices include the primary limiters to maximizing cluster a pure x86 design versus one with accel- performance. erators from ClearSpeed, or even future www.clearspeed.com The Constraints of Modern multiple rails to exceed the standard, which poten- tially increases the effective volume demanded by Cluster Design the board beyond its geometry. There are many constraints and goals in cluster We can also test the 70 watts per liter guideline at design, and we do not claim to address all of these the server level. The current 1U (1.75 inches) here. In this paper, we are concerned chiefly with servers consume about 1000 watts maximum, in- physical limitations: electrical power, heat removal, cluding extensibility options. The standard width is floor space, floor loading, and volume. 19 inches, and a typical depth is 26.5 inches. The We recognize that many other issues enter into volume of the 1U server is thus about 14.4 liters.
    [Show full text]
  • GPU Scalability 35
    Petascaling Commodity onto Exascale with GPUs on TSUBAME1.2 onto TSUBAME2.0 Satoshi Matsuoka, Professor/Dr.Sci. Global Scientific Information and Computing Center (GSIC) Tokyo Inst. Technology 1 GPUs as Commodity Massively Parallel Vector Processors for Ultra Low Power • E.g., NVIDIA Tesla, AMD Firestream – High Peak Performance > 1TFlops • Good for tightly coupled code e.g. Nbody – High Memory bandwidth (>100GB/s) • Good for sparse codes e.g. CFD – Low latency over shared memory • Thousands threads hide latency w/zero overhead – Slow and Parallel and Efficient vector engines for HPC – Restrictions: Limited non-stream memory access, PCI-express overhead, programming model etc. How do we exploit them given vector computing experiences? GPUs as Commodity Vector Engines for Ultra Low Power Unlike the conventional accelerators, Accelerators GPUs have high memory bandwidth and N-body MatMul HPC dense &Ultra Low Power Since latest high-end GPUs support Density Computational Computational Cache-based double precision, GPUs also work CPUs as commodity vector processors. The target application area for GPUs is FFT very wide. Vector Processor Restrictions: Limited non- Memory stream memory access, Access PCI-express overhead, etc. Frequency How do we utilize them easily? From CPU Centric to GPU Centric Nodes for Scaling CPU “Centric” GDDR5 >150GB STREAM GPU persocket DDR3 4-8GB 15-20GB STREAM persocket Core 2GB/core Core Dataset Dataset CPU Roles These are - OS Isomorphic - Services NB PCIe x16 - Irregular PCIe x8 4GB/s Sparse 2GB/s PCIe x8 NB 2GB/s x n 40Gbps IB IB QDR GPU IB QDR HCA 1GB IB QDR HCA Flash 200GB HCA 400MB/s 40Gbps IB x n All-to-all 3-D Protein Docking Challenge (Collaboration with Yutaka Akiyama, Tokyo Tech.) P1 P2 P3 P4 P5 ….
    [Show full text]
  • A Simd Approach to Large-Scale Real-Time System Air Traffic Control Using Associative Processor and Consequences for Parallel Computing
    A SIMD APPROACH TO LARGE-SCALE REAL-TIME SYSTEM AIR TRAFFIC CONTROL USING ASSOCIATIVE PROCESSOR AND CONSEQUENCES FOR PARALLEL COMPUTING A dissertation submitted to Kent State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy by Man Yuan December 2012 Dissertation written by Man Yuan B.S., Hefei University of Technology, China 2001 M.S., University of Western Ontario, Canada 2003 Ph.D., Kent State University, 2012 Approved by Dr. Johnnie W. Baker , Chair, Doctoral Dissertation Committee Dr. Lothar Reichel , Members, Doctoral Dissertation Committee Dr. Mikhail Nesterenko Dr. Ye Zhao Dr. Richmond Nettey Accepted by Dr. Javed I. Khan , Chair, Department of Computer Science Dr. Raymond Craig , Dean, College of Arts and Sciences ii TABLE OF CONTENTS LISTOFFIGURES .................................. vii LISTOFTABLES ................................... ix Acknowledgements ................................... x 1 Introduction..................................... 1 2 BackgroundInformation . .. .. 7 2.1 Flynn’s Taxonomy and Classification . ... 7 2.1.1 Multiple Instruction Stream Multiple Data Stream (MIMD) ... 7 2.1.2 Single Instruction Stream Multiple Data Stream (SIMD)..... 8 2.2 Real-TimeSystems .............................. 10 2.3 AirTrafficControl(ATC) .......................... 13 2.3.1 TaskCharacteristics . 14 2.3.2 The worst-case environment of ATC . 14 2.3.3 ATCTasks .............................. 15 2.4 AnAssociativeProcessorforATC. .. 16 2.5 OpenMP.................................... 18 2.5.1 AvoidFalseSharing ......................... 20 2.5.2 OptimizeBarrierUse. 21 2.5.3 AvoidtheOrderedConstruct . 22 iii 2.5.4 AvoidLargeCriticalRegions . 22 2.5.5 MaximieParallelRegions . 22 2.5.6 Avoid Parallel Regions in Inner Loops . 23 2.5.7 ImproveLoadBalance . 24 2.5.8 Using Compiler Features to Improve Performance . .... 24 3 SurveyofLiterature................................ 26 3.1 PreviousWorkonATC............................ 26 3.2 Classical MIMD Real-Time Scheculing Theory .
    [Show full text]