(WP8) Prototypes Future Technologies

Total Page:16

File Type:pdf, Size:1020Kb

(WP8) Prototypes Future Technologies “Future Technologies” (WP8) Prototypes Iris Christadler, Dr. Herbert Huber Leibniz Supercomputing Centre, Germany Prototype Overview (1/2) CEA 1U Tesla Server T1070 (CUDA, Take more easily advantage of accelerators. Compare “GPU/CAPS” CAPS, DDT) , Intel Harpertown nodes HMPP with other approaches to program accelerators. Assess the applicability of new file system and I/O Subsystem (SSD, Lustre, pNFS) CINECA storage technologies. CINES-LRZ Hybrid SGI ICE/UV/Nehalem-EP & Evaluate a hybrid system architecture containing thin nodes, fat nodes and compute accelerators “LRB/CS” Nehalem-EX/ClearSpeed/Larrabee with a shared file system. CSCS Prototype PGAS language compilers Understand the usability and programmability “UPC/CAF” (CAF + UPC for Cray XT systems) of PGAS languages. EPCC Maxwell – FPGA prototype (VHDL Assess the potential of high-level languages support & consultancy + software for using FPGAs in HPC. Compare energy “FPGA” licenses (e.g., Mitrion-C)) efficiency with other solutions. SC09, “Future Technologies” (WP8) Prototypes, Outlook Deliverable D8.3.2 2 Prototype Overview (2/2) FZJ eQPACE (PowerXCell Gain deep expertise in communication “Cell & FPGA 8i cluster with special network issues. Extend the application interconnect” network processor) domain of the QPACE system. LRZ RapidMind Multi-Core Development Assess the potential of data stream languages. Platform (automatic code generation Compare RapidMind with other approaches for “RapidMind” for x86, GPUs and Cell) programming accelerators or multi-core systems NCF ClearSpeed Evaluate ClearSpeed accelerator hardware “ClearSpeed” CATS 700 units for large-scale applications. Air cooled blade system from SNIC- Supermicro with AMD Istanbul Evaluate and optimize energy efficiency and packing density of Experiences with the KTH processors & QDR IB commodity hardware. prototypes will be reported (subject to EC approval) in Deliverable D8 .3 .2 [http://www.prace-project.eu/documents/public-deliverables-1/] SC09, “Future Technologies” (WP8) Prototypes, Outlook Deliverable D8.3.2 3 The teaser A SELECTION OF RESULTS SC09, “Future Technologies” (WP8) Prototypes, Outlook Deliverable D8.3.2 4 Rinf SC09, “Future Technologies” (WP8) Prototypes, Outlook Deliverable D8.3.2 5 Euroben results - accelerator languages Accelerator Languages (absolute performance) 1000000 MKL (8 Nehalem cores) CUDA (1 C1060) 94% 81% 100000 CellSs (1 PowerXCell8i) 79% 78% v. peak Cn (1CSX700) 10000 Accelerator Languages (%peak perf) 94 81 79 78 100.00 1000 30 Mflops 4.5 6 10.00 3.3 100 2 MKL formance rr 090.9 1.00 pe CUDA CellSs peak 10 0.10 of mod2f/MKL: 0.04 mod2f/MKL: Cn single‐threaded only % single‐threaded 0.03 only 0010.01 1 mod2am mod2as mod2f peak perf mod2am mod2as mod2f SC09, “Future Technologies” (WP8) Prototypes, Outlook Deliverable D8.3.2 6 Euroben results - GPGPU languages Performance Comparison (dense matrix‐matrix mul.) on Nvidia C1060 100 90 80 70 60 CUDA 50 flops CAPS GG 40 CUDA+MPI 4x4 30 RapidMind 20 OpenCL 10 MKL (8cores Nehalem) 0 matrix size (m) SC09, “Future Technologies” (WP8) Prototypes, Outlook Deliverable D8.3.2 7 Euroben results - productivity Development Time versus Performance (dense matrix-matrix mul.) 100000 20 18 10000 16 14 e in Days Mflops 1000 12 mm * * 10 100 8 6 elopment Ti erformance in vv Performance PP 10 ** 4 De total time 2 first version 1 0 * OpenCL and CUDA+MPI port based on existing CUDA port ** RapidMind developer included time for benchmarking SC09, “Future Technologies” (WP8) Prototypes, Outlook Deliverable D8.3.2 8 First IO -Results SC09, “Future Technologies” (WP8) Prototypes, Outlook Deliverable D8.3.2 9 A glimpse on what you will find in Deliverable D8.3.2 PROTOTYPES SC09, “Future Technologies” (WP8) Prototypes, Outlook Deliverable D8.3.2 10 eQPACE Extend communication capabilities of eQPACE to make it suitable for a wider range of applications. Reach a top position in the Green500 list (FZJ). • HdHardware: Power XCll8iXCell8i processor no des w ihith custom 3D-torus interconnect. • BhkBenchmarks: HPL, Euroben kernels, torus network benchmark , applications & iterative solvers. •Progggramming environments: Cell SDK & CellSs SC09, “Future Technologies” (WP8) Prototypes, Outlook Deliverable D8.3.2 11 RapidMind Evaluation of the RapidMind programming model (LRZ). RidMidRapidMind mod2am 70 • Hardware: 60 50 40 30 – CPUs (Nehalem EP, AMD Opteron) Gfops 20 10 – GPUs (Nvidia Tesla and Quadro FX) 0 – Cell (QS22-blade cluster) matrix size (m) • Software: x86‐dp (8 cores nehalem) cuda‐dp (c1060) glsl‐sp (FX 5800) RapidMind allows to write code which can run on x86 cores as well as accelerators like GPUs and Cell. – Evaluate ease-of-use & portability – Assess RapidMind performance on different architectures – Compare RapidMind with other accelerator languages SC09, “Future Technologies” (WP8) Prototypes, Outlook Deliverable D8.3.2 12 LRZ-CINES Evaluation of a hybrid system architecture containing thin nodes, fat nodes and compute accelerators with a shared file system (CINES, LRZ). • HdHardware: – SGI ICE (Nehalem EP) – SGI UV (Nehalem EX) – Clearspeed CSX700 • Benchmarks: – Euroben kernels – Synthetic BMs: HPL, Rinf, Intel MPI Benchmark, Apex-MAP – Application BMs: Gadget, Raxml, Specfm3dglobe SC09, “Future Technologies” (WP8) Prototypes, Outlook Deliverable D8.3.2 13 Hybrid technology demonstrator Evaluating GPGPU with CAPS HMPP (CEA). CAPS hmpp mod2am • Hardware: 70 60 50 Tesla servers connected to 40 ops 30 Gfl BllBull servers v ia PCI-E. 20 10 • Software: 0 CAPS HMPP a llows to exp lo it the matrix size (m) CUDA mod2am potential of GPGPUs by simply 70 60 adding preprocessor directives to 50 40 30 Gflops legacy Fortran and C codes. 20 10 0 matrix size (m) SC09, “Future Technologies” (WP8) Prototypes, Outlook Deliverable D8.3.2 14 Maxwell FPGA Evaluate the performance and usability of the HARWEST Compiling Environment (EPCC). • Hardware: FPGA prototype “Maxwell” (32 FPGAs) fbhAlhDLddfrom both Alpha Data Ltd and NllNallatec hLdLtd us ing Virtex-4 FPGAs supplied by Xilinx Corp. • BhkBenchmarks: 4 Euroben kernels • Languages: – VHDL – HCE SC09, “Future Technologies” (WP8) Prototypes, Outlook Deliverable D8.3.2 15 PGAS languages Evaluate ease of use of PGAS programming model (CSCS). • Hardware: Cray XT5 • Compiler: Cray Compiler Environment (CCE) • Evaluation of the compiler: – Functional correctness – Conformance with language standards – Usability for existing CAF and UPC benchmarks/applications • Benchmarks from Rice University, George Washington University and the Lawrence Berkley National Laboratory SC09, “Future Technologies” (WP8) Prototypes, Outlook Deliverable D8.3.2 16 ClearSpeed/PetaPath Evaluate ClearSpeed-Petapath system (NCF). • Hardware: 114 ClearSpeed CSX700 cards • Language: Cn • Benchmarks: – 4 Euroben kernels – 4 Applications • Astronomy • Geophysics • numerical mathematics • medical tomography SC09, “Future Technologies” (WP8) Prototypes, Outlook Deliverable D8.3.2 17 XC4-IO • Compare performances in storage infrastructure access, using different hardware configurations and file system architectures. (CINECA). Das Bild kann nicht angezeigt werden. Dieser Computer verfügt möglicherweise über zu wenig Arbeitsspeicher, um das Bild zu öffnen, oder das Bild ist beschädigt. Starten Sie den Computer neu, und öffnen Sie dann erneut die Datei. Wenn weiterhin das rote x angezeigt wird, müssen Sie das Bild möglicherweise löschen und dann erneut einfügen. SC09, “Future Technologies” (WP8) Prototypes, Outlook Deliverable D8.3.2 18 SNIC-KTH Preliminary Results (Gromacs) Evaluate energy efficiency of high density commodity parts (SNIC-KTH). • Hardware: AMD Istanbul • Benchmarks: Euroben, STREAM, IMB, Gromacs, CFD • Measure power consumption per component • Adjust fan speed and fan power • Assess energy management features of AMD Istanbul (Control of voltage and frequency of components) SC09, “Future Technologies” (WP8) Prototypes, Outlook Deliverable D8.3.2 19 Results will be reported in Deliverable D8.3.2. RESEARCH ACTIVITIES SC09, “Future Technologies” (WP8) Prototypes, Outlook Deliverable D8.3.2 20 Parallel GPU Evaluation of GPGPU programming languages (CSC). • Languages – CUDA+MPI GPU-HMMER – OpenCL • Benchmarks: – GPU-HMMER – Euroben Kernels • Hardware – Tesla – AMD Firestream – CEA WP8 Prototype SC09, “Future Technologies” (WP8) Prototypes, Outlook Deliverable D8.3.2 21 Advanced PGAS Programming upc_barrier; Evaluate usability of PGAS upc_forall (sc=0; sc<totblks; sc++; sc) { // S quare mat ri x multi pl y C = A * B with aid of DGEMM programming model (CSC). double beta = 0; double *clocal = (double *)c[sc].x; // Local C-block for this UPC-thread int ib = sc / numblks, jb = sc % numblks, kb, i, j, k; • Languages for (kb=0; kb<numblks; kb++) { int sa = ib * numblks + kb; // The owner of A-block is sa % THREADS – Coarray Fortran (CAF) int sb = kb * numblks + jb; // The owner of B-block is sb %THREADS double *al = (sa%THREADS == MYTHREAD) ? // Get the A-block – Unified Parallel C (UPC) (double *)a[sa].x : ( upc_memget(alocal, a[sa].x, ns), alocal); double *bl = (sb%THREADS == MYTHREAD) ? // Get the B-block • BhkBenchmarks (double *)b[sb].x : (upc_memget(blocal, b[sb].x, ns), blocal); double *cl = clocal; // The local C-block owned by this UPC-thread – Euroben mod2am/as/f // Call BLAS3-library DGEMM dgemm_("N","N", &blksize, &blksize, &blksize, &alpha, • Environments al, &blksize, bl, &blksize, &beta, cl, &blksize, 11);1, 1); beta = 1; – Cray XT5 (cce) } /* for (kb=0; kb<numblks; kb++) */ } /* upc_forall (sc=0; sc<totblks; sc++; sc) */ – SGI Altix (g95,
Recommended publications
  • Linpack Evaluation on a Supercomputer with Heterogeneous Accelerators
    Linpack Evaluation on a Supercomputer with Heterogeneous Accelerators Toshio Endo Akira Nukada Graduate School of Information Science and Engineering Global Scientific Information and Computing Center Tokyo Institute of Technology Tokyo Institute of Technology Tokyo, Japan Tokyo, Japan [email protected] [email protected] Satoshi Matsuoka Naoya Maruyama Global Scientific Information and Computing Center Global Scientific Information and Computing Center Tokyo Institute of Technology/National Institute of Informatics Tokyo Institute of Technology Tokyo, Japan Tokyo, Japan [email protected] [email protected] Abstract—We report Linpack benchmark results on the Roadrunner or other systems described above, it includes TSUBAME supercomputer, a large scale heterogeneous system two types of accelerators. This is due to incremental upgrade equipped with NVIDIA Tesla GPUs and ClearSpeed SIMD of the system, which has been the case in commodity CPU accelerators. With all of 10,480 Opteron cores, 640 Xeon cores, 648 ClearSpeed accelerators and 624 NVIDIA Tesla GPUs, clusters; they may have processors with different speeds as we have achieved 87.01TFlops, which is the third record as a result of incremental upgrade. In this paper, we present a heterogeneous system in the world. This paper describes a Linpack implementation and evaluation results on TSUB- careful tuning and load balancing method required to achieve AME with 10,480 Opteron cores, 624 Tesla GPUs and 648 this performance. On the other hand, since the peak speed is ClearSpeed accelerators. In the evaluation, we also used a 163 TFlops, the efficiency is 53%, which is lower than other systems.
    [Show full text]
  • Clearspeed Technical Training
    ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical Training December 2007 Overview 1 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Presenters Ronald Langhi Technical Marketing Manager [email protected] Brian Sumner Senior Engineer [email protected] 2 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com ClearSpeed Technology: Company Background • Founded in 2001 – Focused on alleviating the power, heat, and density challenges of HPC systems – 103 patents granted and pending (as of September 2007) – Offices in San Jose, California and Bristol, UK 3 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Agenda Accelerators ClearSpeed and HPC Hardware overview Installing hardware and software Thinking about performance Software Development Kit Application examples Help and support 4 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com ENVISION. ACCELERATE. ARRIVE. What is an accelerator? 5 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com What is an accelerator? • A device to improve performance – Relieve main CPU of workload – Or to augment CPU’s capability • An accelerator card can increase performance – On specific tasks – Without aggravating facility limits on clusters (power, size, cooling) 6 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com All accelerators are good… for their intended purpose Cell and GPUs FPGAs •Good for video gaming tasks •Good for integer, bit-level ops •32-bit FLOPS, not IEEE •Programming looks like circuit design •Unconventional programming model •Low power per chip, but •Small local memory 20x more power than custom VLSI •High power consumption (> 200 W) •Not for 64-bit FLOPS ClearSpeed •Good for HPC applications •IEEE 64-bit and 32-bit FLOPS •Custom VLSI, true coprocessor •At least 1 GB local memory •Very low power consumption (25 W) •Familiar programming model 7 Copyright © 2007 ClearSpeed Technology Inc.
    [Show full text]
  • Introduction Hardware Acceleration Philosophy Popular Accelerators In
    Special Purpose Accelerators Special Purpose Accelerators Introduction Recap: General purpose processors excel at various jobs, but are no Theme: Towards Reconfigurable High-Performance Computing mathftch for acce lera tors w hen dea ling w ith spec ilidtialized tas ks Lecture 4 Objectives: Platforms II: Special Purpose Accelerators Define the role and purpose of modern accelerators Provide information about General Purpose GPU computing Andrzej Nowak Contents: CERN openlab (Geneva, Switzerland) Hardware accelerators GPUs and general purpose computing on GPUs Related hardware and software technologies Inverted CERN School of Computing, 3-5 March 2008 1 iCSC2008, Andrzej Nowak, CERN openlab 2 iCSC2008, Andrzej Nowak, CERN openlab Special Purpose Accelerators Special Purpose Accelerators Hardware acceleration philosophy Popular accelerators in general Floating point units Old CPUs were really slow Embedded CPUs often don’t have a hardware FPU 1980’s PCs – the FPU was an optional add on, separate sockets for the 8087 coprocessor Video and image processing MPEG decoders DV decoders HD decoders Digital signal processing (including audio) Sound Blaster Live and friends 3 iCSC2008, Andrzej Nowak, CERN openlab 4 iCSC2008, Andrzej Nowak, CERN openlab Towards Reconfigurable High-Performance Computing Lecture 4 iCSC 2008 3-5 March 2008, CERN Special Purpose Accelerators 1 Special Purpose Accelerators Special Purpose Accelerators Mainstream accelerators today Integrated FPUs Realtime graphics GiGaming car ds Gaming physics
    [Show full text]
  • The Return of Acceleration Technology
    Real Science. Real Numbers. Real Software. The Return of Acceleration Technology John L. Gustafson, Ph.D. CTO, HPC ClearSpeed Technology, Inc. 1 Copyright © 2008 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Thesis 25 years ago, transistors were expensive; accelerators arose as a way to get more performance for certain applications. Then transistors got very cheap, and CPUs subsumed the functions. Why are accelerators back? Because supercomputing is no longer limited by cost. It is now limited by heat, space, and scalability. Accelerators once again are solving these problems. 2 Copyright © 2008 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com The accelerator idea is as old as supercomputing itself 2 MB 10x speedup 3 MB/s General-purpose computer Attached vector processor Runs OS, compilers, disk, accelerates certain printers, user interface applications, but not all Even in 1977, HPC users faced issues of when it makes sense to use floating-point-intensive vector hardware. “History doesn’t repeat itself, but it does rhyme.” —Mark Twain 3 Copyright © 2008 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com From 1984… My first visit to Bristol FPS-164/MAX 0.3 GFLOPS, $500,000 • 30 double precision PEs under (1/20 price of a 1984 Cray) SIMD control; 16 KB of very high bandwidth memory per PE, but This accelerator persuaded normal bandwidth to host Dongarra to create the Level 3 • Specific to matrix multiplication BLAS operations, and to make the LINPACK benchmark • Targeted at chemistry, electromagnetics, and structural scalable… leading to the TOP500 analysis benchmark a few years later.
    [Show full text]
  • Highlights of the 53Rd TOP500 List
    ISC 2019, Frankfurt, Highlights of June 17, 2019 the 53rd Erich TOP500 List Strohmaier ISC 2019 TOP500 TOPICS • Petaflops are everywhere! • “New” TOP10 • Dennard scaling and the TOP500 • China: Top consumer and producer ? A closer look • Green500, HPCG • Future of TOP500 Power # Site Manufacturer Computer Country Cores Rmax ST [Pflops] [MW] Oak Ridge 41 LIST: THESummit TOP10 1 IBM IBM Power System, USA 2,414,592 148.6 10.1 National Laboratory P9 22C 3.07GHz, Mellanox EDR, NVIDIA GV100 Sierra Lawrence Livermore 2 IBM IBM Power System, USA 1,572,480 94.6 7.4 National Laboratory P9 22C 3.1GHz, Mellanox EDR, NVIDIA GV100 National Supercomputing Sunway TaihuLight 3 NRCPC China 10,649,600 93.0 15.4 Center in Wuxi NRCPC Sunway SW26010, 260C 1.45GHz Tianhe-2A National University of 4 NUDT ANUDT TH-IVB-FEP, China 4,981,760 61.4 18.5 Defense Technology Xeon 12C 2.2GHz, Matrix-2000 Texas Advanced Computing Frontera 5 Dell USA 448,448 23.5 Center / Univ. of Texas Dell C6420, Xeon Platinum 8280 28C 2.7GHz, Mellanox HDR Piz Daint Swiss National Supercomputing 6 Cray Cray XC50, Switzerland 387,872 21.2 2.38 Centre (CSCS) Xeon E5 12C 2.6GHz, Aries, NVIDIA Tesla P100 Los Alamos NL / Trinity 7 Cray Cray XC40, USA 979,072 20.2 7.58 Sandia NL Intel Xeon Phi 7250 68C 1.4GHz, Aries National Institute of Advanced AI Bridging Cloud Infrastructure (ABCI) 8 Industrial Science and Fujitsu PRIMERGY CX2550 M4, Japan 391,680 19.9 1.65 Technology Xeon Gold 20C 2.4GHz, IB-EDR, NVIDIA V100 SuperMUC-NG 9 Leibniz Rechenzentrum Lenovo ThinkSystem SD530, Germany 305,856
    [Show full text]
  • A Fad Or the Yellow Brick Road Onto Exascale?
    GPU Acceleration: A Fad or the Yellow Brick Road onto Exascale? Satoshi Matsuoka, Professor/Dr.Sci. Global Scientific Information and Computing Center (GSIC) Tokyo Inst. Technology / National Institute of Informatics, Japan CNRS ORAP Forum, 2010 Mar 31, Paris, France 1 GPU vs. Standard Many Cores? Shared Shared Shared Shared Shared Shared PE PE PE PE SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP PE PE PE PE SP SP SP SP SP SP SP SP SP SP SP SP vs. PE PE PE PE MC MC MC MC MC MC PE PE PE PE GDDR3 GDDR3 GDDR3 GDDR3 GDDR3 GDDR3 • Are these just legacy programming compatibility differences? GPUs as Commodity Vector Engines Accelerators: Grape, Unlike the conventional accelerators, N-body ClearSpeed, GPUs have high memory bandwidth and MatMul GPUs dense Since latest high-end GPUs support Computational Density Cache-based double precision, GPUs also work CPUs as commodity vector processors. The target application area for GPUs is FFT very wide. Vector SCs CFD Cray, SX… Restrictions: Limited non- Memory stream memory access, Access PCI-express overhead, etc. Frequency Î How do we utilize them easily? GPUsGPUs asas CommodityCommodity MassivelyMassively ParallelParallel VectorVector ProcessorsProcessors • E.g., NVIDIA Tesla, AMD Firestream – High Peak Performance > 1TFlops • Good for tightly coupled code e.g. Nbody – High Memory bandwidth (>100GB/s) • Good for sparse codes e.g. CFD – Low latency over shared memory • Thousands threads hide latency w/zero overhead – Slow and Parallel and Efficient vector engines for HPC – Restrictions: Limited non-stream memory access, PCI-express overhead, programming model etc.
    [Show full text]
  • Exascale” Supercomputer Fugaku & Beyond
    The first “exascale” supercomputer Fugaku & beyond l Satoshi Matsuoka l Director, RIKEN Center for Computational Science l 20190815 Modsim Presentation 1 Arm64fx & Fugaku 富岳 /Post-K are: l Fujitsu-Riken design A64fx ARM v8.2 (SVE), 48/52 core CPU l HPC Optimized: Extremely high package high memory BW (1TByte/s), on-die Tofu-D network BW (~400Gbps), high SVE FLOPS (~3Teraflops), various AI support (FP16, INT8, etc.) l Gen purpose CPU – Linux, Windows (Word), otherSCs/Clouds l Extremely power efficient – > 10x power/perf efficiency for CFD benchmark over current mainstream x86 CPU l Largest and fastest supercomputer to be ever built circa 2020 l > 150,000 nodes, superseding LLNL Sequoia l > 150 PetaByte/s memory BW l Tofu-D 6D Torus NW, 60 Petabps injection BW (10x global IDC traffic) l 25~30PB NVMe L1 storage l many endpoint 100Gbps I/O network into Lustre l The first ‘exascale’ machine (not exa64bitflops but in apps perf.) 3 Brief History of R-CCS towards Fugaku April 2018 April 2012 AICS renamed to RIKEN R-CCS. July 2010 Post-K Feasibility Study start Satoshi Matsuoka becomes new RIKEN AICS established 3 Arch Teams and 1 Apps Team Director August 2010 June 2012 Aug 2018 HPCI Project start K computer construction complete Arm A64fx announce at Hotchips September 2010 September 2012 Oct 2018 K computer installation starts K computer production start NEDO 100x processor project start First meeting of SDHPC (Post-K) November 2012 Nov 2018 ACM Gordon bell Award Post-K Manufacturing approval by Prime Minister’s CSTI Committee 2 006 2011 2013
    [Show full text]
  • TSUBAME2.0: a Tiny and Greenest Petaflops Supercomputer
    TSUBAME2.0: A Tiny and Greenest Petaflops Supercomputer Satoshi Matsuoka Global Scientific Information and Computing Center (GSIC) Tokyo Institute of Technology (Tokyo Tech.) Booth #1127 Booth Presentations SC10 Nov 2010 The TSUBAME 1.0 “Supercomputing Grid Cluster” Unified IB network Spring 2006 Voltaire ISR9288 Infiniband 10Gbps x2 (DDR next ver.) Sun Galaxy 4 (Opteron Dual ~1310+50 Ports “Fastest core 8-socket) ~13.5Terabits/s (3Tbits bisection) Supercomputer in 10480core/655Nodes Asia” 7th on the 27th 21.4Terabytes 10Gbps+External 50.4TeraFlops Network [email protected] OS Linux (SuSE 9, 10) NAREGI Grid MW NEC SX-8i (for porting) 500GB 48disks 500GB 500GB 48disks 48disks Storage 1.0 Petabyte (Sun “Thumper”) ClearSpeed CSX600 0.1Petabyte (NEC iStore) SIMD accelerator Lustre FS, NFS, WebDAV (over IP) 360 boards, 50GB/s aggregate I/O BW 35TeraFlops(Current)) Titech TSUBAME ~76 racks 350m2 floor area 1.2 MW (peak), PUE=1.44 You know you have a problem when, … Biggest Problem is Power… Peak Watts/ Peak Ratio c.f. Machine CPU Cores Watts MFLOPS/ CPU GFLOPS TSUBAME Watt Core TSUBAME(Opteron) 10480 800,000 50,400 63.00 76.34 TSUBAME2006 (w/360CSs) 11,200 810,000 79,430 98.06 72.32 TSUBAME2007 (w/648CSs) 11,776 820,000 102,200 124.63 69.63 1.00 Earth Simulator 5120 6,000,000 40,000 6.67 1171.88 0.05 ASCI Purple (LLNL) 12240 6,000,000 77,824 12.97 490.20 0.10 AIST Supercluster (Opteron) 3188 522,240 14400 27.57 163.81 0.22 LLNL BG/L (rack) 2048 25,000 5734.4 229.38 12.21 1.84 Next Gen BG/P (rack) 4096 30,000 16384 546.13 7.32 4.38 TSUBAME 2.0 (2010Q3/4) 160,000 810,000 1,024,000 1264.20 5.06 10.14 TSUBAME 2.0 x24 improvement in 4.5 years…? ~ x1000 over 10 years Scaling Peta to Exa Design? • Shorten latency as much as possible – Extreme multi-core incl.
    [Show full text]
  • The TOP500 Project Looking Back Over 16 Years of Supercomputing Experience with Special Emphasis on the Industrial Segment
    The TOP500 Project Looking Back Over 16 Years of Supercomputing Experience with Special Emphasis on the Industrial Segment Hans Werner Meuer [email protected] Prometeus GmbH & Universität Mannheim Microsoft Industriekunden – Veranstaltung Frankfurt /16. Oktober 2008 / Windows HPC Server 2008 Launch page 1 31th List: The TOP10 Rmax Power Manufacturer Computer Installation Site Country #Cores [TF/s] [MW] Roadrunner 1 IBM 1026 DOE/NNSA/LANL USA 2.35 122,400 BladeCenter QS22/LS21 BlueGene/L 2 IBM 478.2 DOE/NNSA/LLNL USA 2.33 212,992 eServer Blue Gene Solution Intrepid 3 IBM 450.3 DOE/ANL USA 1.26 163,840 Blue Gene/P Solution Ranger 4 Sun 326 TACC USA 2.00 62,976 SunBlade x6420 Jaguar 5 Cray 205 DOE/ORNL USA 1.58 30,976 Cray XT4 QuadCore JUGENE Forschungszentrum 6 IBM 180 Germany 0.50 65,536 Blue Gene/P Solution Juelich (FZJ) Encanto New Mexico Computing 7 SGI 133.2 USA 0.86 14,336 SGI Altix ICE 8200 Applications Center EKA Computational Research 8 HP 132.8 India 1.60 14,384 Cluster Platform 3000 BL460c Laboratories, TATA SONS 9 IBM Blue Gene/P Solution 112.5 IDRIS France 0.32 40,960 Total Exploration 10 SGI SGI Altix ICE 8200EX 106.1 France 0.44 10,240 Production Microsoft Industriekunden – Veranstaltung page 2 Outline Mannheim Supercomputer Statistics & Top500 Project Start in 1993 Competition between Manufacturers, Countries and Sites My Supercomputer Favorite in the Top500 Lists The 31st List as of June 2008 Performance Development and Projection Bell‘s Law Supercomputing, quo vadis? Top500, quo vadis? Microsoft Industriekunden – Veranstaltung
    [Show full text]
  • Clearspeed Paper: a Study of Accelerator Options for Maximizing Cluster Performance
    MAXIMUM PERFORMANCE CLUSTER DESIGN CLEARSPEED PAPER: A STUDY OF ACCELERATOR OPTIONS FOR MAXIMIZING CLUSTER PERFORMANCE Abstract The typical primary goal in cluster design 64-bit versions of NVIDIA’s announced now is to maximize performance within “Tesla” product. We use compute-per- the constraints of space and power. This volume arguments to show that a cluster contrasts with the primary goal of a dec- achieves optimum 64-bit floating-point ade ago, when the constraint was the performance with x86 nodes enhanced budget for parts purchase. We therefore with ClearSpeed e620 “Advance” accel- examine how to create a cluster with the erators. The reason is that power dissi- highest possible performance using cur- pation and its attendant volume become rent technology. Specific choices include the primary limiters to maximizing cluster a pure x86 design versus one with accel- performance. erators from ClearSpeed, or even future www.clearspeed.com The Constraints of Modern multiple rails to exceed the standard, which poten- tially increases the effective volume demanded by Cluster Design the board beyond its geometry. There are many constraints and goals in cluster We can also test the 70 watts per liter guideline at design, and we do not claim to address all of these the server level. The current 1U (1.75 inches) here. In this paper, we are concerned chiefly with servers consume about 1000 watts maximum, in- physical limitations: electrical power, heat removal, cluding extensibility options. The standard width is floor space, floor loading, and volume. 19 inches, and a typical depth is 26.5 inches. The We recognize that many other issues enter into volume of the 1U server is thus about 14.4 liters.
    [Show full text]
  • GPU Scalability 35
    Petascaling Commodity onto Exascale with GPUs on TSUBAME1.2 onto TSUBAME2.0 Satoshi Matsuoka, Professor/Dr.Sci. Global Scientific Information and Computing Center (GSIC) Tokyo Inst. Technology 1 GPUs as Commodity Massively Parallel Vector Processors for Ultra Low Power • E.g., NVIDIA Tesla, AMD Firestream – High Peak Performance > 1TFlops • Good for tightly coupled code e.g. Nbody – High Memory bandwidth (>100GB/s) • Good for sparse codes e.g. CFD – Low latency over shared memory • Thousands threads hide latency w/zero overhead – Slow and Parallel and Efficient vector engines for HPC – Restrictions: Limited non-stream memory access, PCI-express overhead, programming model etc. How do we exploit them given vector computing experiences? GPUs as Commodity Vector Engines for Ultra Low Power Unlike the conventional accelerators, Accelerators GPUs have high memory bandwidth and N-body MatMul HPC dense &Ultra Low Power Since latest high-end GPUs support Density Computational Computational Cache-based double precision, GPUs also work CPUs as commodity vector processors. The target application area for GPUs is FFT very wide. Vector Processor Restrictions: Limited non- Memory stream memory access, Access PCI-express overhead, etc. Frequency How do we utilize them easily? From CPU Centric to GPU Centric Nodes for Scaling CPU “Centric” GDDR5 >150GB STREAM GPU persocket DDR3 4-8GB 15-20GB STREAM persocket Core 2GB/core Core Dataset Dataset CPU Roles These are - OS Isomorphic - Services NB PCIe x16 - Irregular PCIe x8 4GB/s Sparse 2GB/s PCIe x8 NB 2GB/s x n 40Gbps IB IB QDR GPU IB QDR HCA 1GB IB QDR HCA Flash 200GB HCA 400MB/s 40Gbps IB x n All-to-all 3-D Protein Docking Challenge (Collaboration with Yutaka Akiyama, Tokyo Tech.) P1 P2 P3 P4 P5 ….
    [Show full text]
  • A Simd Approach to Large-Scale Real-Time System Air Traffic Control Using Associative Processor and Consequences for Parallel Computing
    A SIMD APPROACH TO LARGE-SCALE REAL-TIME SYSTEM AIR TRAFFIC CONTROL USING ASSOCIATIVE PROCESSOR AND CONSEQUENCES FOR PARALLEL COMPUTING A dissertation submitted to Kent State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy by Man Yuan December 2012 Dissertation written by Man Yuan B.S., Hefei University of Technology, China 2001 M.S., University of Western Ontario, Canada 2003 Ph.D., Kent State University, 2012 Approved by Dr. Johnnie W. Baker , Chair, Doctoral Dissertation Committee Dr. Lothar Reichel , Members, Doctoral Dissertation Committee Dr. Mikhail Nesterenko Dr. Ye Zhao Dr. Richmond Nettey Accepted by Dr. Javed I. Khan , Chair, Department of Computer Science Dr. Raymond Craig , Dean, College of Arts and Sciences ii TABLE OF CONTENTS LISTOFFIGURES .................................. vii LISTOFTABLES ................................... ix Acknowledgements ................................... x 1 Introduction..................................... 1 2 BackgroundInformation . .. .. 7 2.1 Flynn’s Taxonomy and Classification . ... 7 2.1.1 Multiple Instruction Stream Multiple Data Stream (MIMD) ... 7 2.1.2 Single Instruction Stream Multiple Data Stream (SIMD)..... 8 2.2 Real-TimeSystems .............................. 10 2.3 AirTrafficControl(ATC) .......................... 13 2.3.1 TaskCharacteristics . 14 2.3.2 The worst-case environment of ATC . 14 2.3.3 ATCTasks .............................. 15 2.4 AnAssociativeProcessorforATC. .. 16 2.5 OpenMP.................................... 18 2.5.1 AvoidFalseSharing ......................... 20 2.5.2 OptimizeBarrierUse. 21 2.5.3 AvoidtheOrderedConstruct . 22 iii 2.5.4 AvoidLargeCriticalRegions . 22 2.5.5 MaximieParallelRegions . 22 2.5.6 Avoid Parallel Regions in Inner Loops . 23 2.5.7 ImproveLoadBalance . 24 2.5.8 Using Compiler Features to Improve Performance . .... 24 3 SurveyofLiterature................................ 26 3.1 PreviousWorkonATC............................ 26 3.2 Classical MIMD Real-Time Scheculing Theory .
    [Show full text]