Clearspeed Technical Training

Total Page:16

File Type:pdf, Size:1020Kb

Clearspeed Technical Training ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical Training December 2007 Overview 1 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Presenters Ronald Langhi Technical Marketing Manager [email protected] Brian Sumner Senior Engineer [email protected] 2 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com ClearSpeed Technology: Company Background • Founded in 2001 – Focused on alleviating the power, heat, and density challenges of HPC systems – 103 patents granted and pending (as of September 2007) – Offices in San Jose, California and Bristol, UK 3 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Agenda Accelerators ClearSpeed and HPC Hardware overview Installing hardware and software Thinking about performance Software Development Kit Application examples Help and support 4 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com ENVISION. ACCELERATE. ARRIVE. What is an accelerator? 5 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com What is an accelerator? • A device to improve performance – Relieve main CPU of workload – Or to augment CPU’s capability • An accelerator card can increase performance – On specific tasks – Without aggravating facility limits on clusters (power, size, cooling) 6 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com All accelerators are good… for their intended purpose Cell and GPUs FPGAs •Good for video gaming tasks •Good for integer, bit-level ops •32-bit FLOPS, not IEEE •Programming looks like circuit design •Unconventional programming model •Low power per chip, but •Small local memory 20x more power than custom VLSI •High power consumption (> 200 W) •Not for 64-bit FLOPS ClearSpeed •Good for HPC applications •IEEE 64-bit and 32-bit FLOPS •Custom VLSI, true coprocessor •At least 1 GB local memory •Very low power consumption (25 W) •Familiar programming model 7 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com The case for accelerators • Accelerators designed for HPC applications can improve performance as well as performance per (watt, cabinet, dollar) • Accelerators enable: – Larger problems for given compute time, or – Higher accuracy for given compute time, or – Same problem in shorter time • Host to card latency and bandwidth are not major barriers to successful use of properly- designed accelerators. 8 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com ENVISION. ACCELERATE. ARRIVE. What can be accelerated? 9 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Good application targets for acceleration • Application needs to be both computationally intensive and contain a high degree of data parallelism. • Computationally intensive: – Software depends on executing large numbers of arithmetic calculations – Usually 64-bit FLoating point Operations per Second (FLOPS) – Should also have a high ratio of FLOPS to data movement (bandwidth) – Computationally intensive applications may run for many hours or more even on large clusters. • Data parallelism: – Software performs the same sequence of operations again and again but on a different item of data each time • Example computationally intensive, data parallel problems include: – Large matrix arithmetic (linear algebra) – Molecular simulations – Monte Carlo options pricing in financial applications – And many, many more… 10 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Example data parallel problems that can be accelerated Ab initio Computational Chemistry Structural Analysis Electromagnetic Modeling Global Illumination Graphics Radar Cross-Section 11 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com HPC Requirements • Accelerator boards increase compute performance on highly specific tasks, without aggravating facility limits on clusters (power, size) • Need to consider – Type of application – Software – Data type and precision – Compatibility with host (logical and physical) – Memory size (local to accelerator) – Latency and bandwidth to host 12 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com An HPC-specific accelerator • CSX600 coprocessor for math acceleration – Assists serial CPU running compute-intensive math libraries – Available on add-in boards, e.g. PCI-X, PCIe – Potentially integrated on the motherboard – Can also be used for embedded applications • Significantly accelerates certain libraries and applications – Target libraries: Level 3 BLAS, LAPACK, ACML, Intel® MKL – Mathematical modeling tools: Mathematica®, MATLAB®, etc. – In-house code: Using the SDK to port compute-intensive kernels • ClearSpeed Advance™ board – Dual CSX600 coprocessors – Sustains 67 GFLOPS for 64-bit matrix multiply (DGEMM) calls – PCI-X, PCI Express x8 – Low power; typically 25-35 Watts 13 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Plug-and-play Acceleration • ClearSpeed host-side library CSXL – Provides some of the most commonly used and important Level 3 BLAS and LAPACK functions – Exploits standard shared/dynamic library mechanisms to intercept calls to L3 BLAS and LAPACK – Executes calls heterogeneously across both the multi- core host and the ClearSpeed accelerators simultaneously for maximum performance – Compatible with ACML from AMD and MKL from Intel • User & application do not need to be aware of ClearSpeed – Except that the application suddenly runs faster 14 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Programming considerations • Is my main data type integer or floating-point? • Is the data parallel in nature? • What precision do I need? • How much data needs to be local to the accelerated task? • Does existing accelerator software meet my needs, or do I have to write my own? • If I have to write my own code will the existing tools meet my needs—for example: compiler, debugger, and simulator? 15 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com ENVISION. ACCELERATE. ARRIVE. Hardware Overview 16 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com CSX600: A chip designed for HPC • Array of 96 Processor Elements; 64-bit and 32-bit floating point • Single-Instruction, Multiple-Data (SIMD) • 210 MHz -- key to low power • 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the chip is floating point hardware • Embedded SRAM • Interface to DDR2 DRAM • Inter-processor I/O ports • ~ 1 TB/sec internal bandwidth ClearSpeed CSX600 • 128 million transistors • Approximately 10 Watts 17 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com CSX600 processor core CSX 600 System Network Peripheral Network • Multi-Threaded Array Processing Mono Instruc- Control – Programmed in familiar languages Data Controller tion and – Hardware multi-threading Cache Cache Debug – Asynchronous, overlapped I/O – Run-time extensible instruction set Poly Controller • Array of 96 Processor Elements (PEs) PE PE … PE – Each has multiple execution units 0 1 95 – Including double precision floating point and integer units System Network Programmable I/O to DRAM 18 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com CSX600 processing element (PE) PE • Multiple execution units n • 4-stage floating point adder ALU MAC 32/64-bit FP Mul PE FP Add PE 64 Div, Sqrt 64 • 4-stage floating point multiplier n–1 n+1 IEEE 754 64 64 64 } • Divide/square root unit Register File 128 Bytes • Fixed-point MAC 16x16 → 32+64 • Integer ALU with shifter PE SRAM 6 KBytes • Load/store • 5-port register file (3 reads, 2 writes) 32 Programmed I/O • Closely coupled 6 KB SRAM for data 32 • High bandwidth per PE DMA (PIO) 128 PIO Collection & Distribution • Per PE address generators (serves as hardware gather-scatter) • Fast inter-PE communication path 19 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Advance accelerator memory hierarchy Host Tier 3 DRAM: 1-32 GBytes typical Aggregate: ~1GB/s Bank 1 1.0 GBytes Tier 2 BankCSX 0 DRAM: 0.5 GBytes CSX DRAM: 0.5 GBytes 5.4 GB/s ~0.03 GB/s per PE PE 95 192 PEs * 6 KB = 1.1 MB Tier 1 PE 0 Poly memory: 6 KBytes 161 GB/s 32 Per PE 16 Swazzle 192 PEs * 128 Byte = 24 KB Tier 0 Register memory: 128 Bytes 16 322 GB/s 725 GB/s 16 16 16 Per PE Total: 80 GFLOPS, 1.1 TB/s Arithmetic: 0.42 GFLOPS …but only 25 Watts 20 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Acceleration by plug-in card Advance X620 (PCI-X) • Dual ClearSpeed CSX600 coprocessors • R∞ > 66 GFLOPS for 64-bit matrix multiply (DGEMM) calls – Hardware also supports 32-bit floating point and integer calculations • 133 MHz PCI-X two-thirds length (8″) form 203 mm length, full-height factor • PCIe x8 half-length form factor Advance e620 PCIe (x8) • 1 GB of memory on the board • Drivers today for Linux (Red Hat and SLES) and Windows (XP, Server 2003) • Low power: 25 watts typical • Multiple boards can be used together for greater performance Half length, full-height Both boards can sustain over 66 GFLOPS on 64-bit HPC kernels 21 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Host to board DMA performance • The board includes a host DMA controller which can act as a bus master.
Recommended publications
  • Linpack Evaluation on a Supercomputer with Heterogeneous Accelerators
    Linpack Evaluation on a Supercomputer with Heterogeneous Accelerators Toshio Endo Akira Nukada Graduate School of Information Science and Engineering Global Scientific Information and Computing Center Tokyo Institute of Technology Tokyo Institute of Technology Tokyo, Japan Tokyo, Japan [email protected] [email protected] Satoshi Matsuoka Naoya Maruyama Global Scientific Information and Computing Center Global Scientific Information and Computing Center Tokyo Institute of Technology/National Institute of Informatics Tokyo Institute of Technology Tokyo, Japan Tokyo, Japan [email protected] [email protected] Abstract—We report Linpack benchmark results on the Roadrunner or other systems described above, it includes TSUBAME supercomputer, a large scale heterogeneous system two types of accelerators. This is due to incremental upgrade equipped with NVIDIA Tesla GPUs and ClearSpeed SIMD of the system, which has been the case in commodity CPU accelerators. With all of 10,480 Opteron cores, 640 Xeon cores, 648 ClearSpeed accelerators and 624 NVIDIA Tesla GPUs, clusters; they may have processors with different speeds as we have achieved 87.01TFlops, which is the third record as a result of incremental upgrade. In this paper, we present a heterogeneous system in the world. This paper describes a Linpack implementation and evaluation results on TSUB- careful tuning and load balancing method required to achieve AME with 10,480 Opteron cores, 624 Tesla GPUs and 648 this performance. On the other hand, since the peak speed is ClearSpeed accelerators. In the evaluation, we also used a 163 TFlops, the efficiency is 53%, which is lower than other systems.
    [Show full text]
  • Introduction Hardware Acceleration Philosophy Popular Accelerators In
    Special Purpose Accelerators Special Purpose Accelerators Introduction Recap: General purpose processors excel at various jobs, but are no Theme: Towards Reconfigurable High-Performance Computing mathftch for acce lera tors w hen dea ling w ith spec ilidtialized tas ks Lecture 4 Objectives: Platforms II: Special Purpose Accelerators Define the role and purpose of modern accelerators Provide information about General Purpose GPU computing Andrzej Nowak Contents: CERN openlab (Geneva, Switzerland) Hardware accelerators GPUs and general purpose computing on GPUs Related hardware and software technologies Inverted CERN School of Computing, 3-5 March 2008 1 iCSC2008, Andrzej Nowak, CERN openlab 2 iCSC2008, Andrzej Nowak, CERN openlab Special Purpose Accelerators Special Purpose Accelerators Hardware acceleration philosophy Popular accelerators in general Floating point units Old CPUs were really slow Embedded CPUs often don’t have a hardware FPU 1980’s PCs – the FPU was an optional add on, separate sockets for the 8087 coprocessor Video and image processing MPEG decoders DV decoders HD decoders Digital signal processing (including audio) Sound Blaster Live and friends 3 iCSC2008, Andrzej Nowak, CERN openlab 4 iCSC2008, Andrzej Nowak, CERN openlab Towards Reconfigurable High-Performance Computing Lecture 4 iCSC 2008 3-5 March 2008, CERN Special Purpose Accelerators 1 Special Purpose Accelerators Special Purpose Accelerators Mainstream accelerators today Integrated FPUs Realtime graphics GiGaming car ds Gaming physics
    [Show full text]
  • The Return of Acceleration Technology
    Real Science. Real Numbers. Real Software. The Return of Acceleration Technology John L. Gustafson, Ph.D. CTO, HPC ClearSpeed Technology, Inc. 1 Copyright © 2008 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Thesis 25 years ago, transistors were expensive; accelerators arose as a way to get more performance for certain applications. Then transistors got very cheap, and CPUs subsumed the functions. Why are accelerators back? Because supercomputing is no longer limited by cost. It is now limited by heat, space, and scalability. Accelerators once again are solving these problems. 2 Copyright © 2008 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com The accelerator idea is as old as supercomputing itself 2 MB 10x speedup 3 MB/s General-purpose computer Attached vector processor Runs OS, compilers, disk, accelerates certain printers, user interface applications, but not all Even in 1977, HPC users faced issues of when it makes sense to use floating-point-intensive vector hardware. “History doesn’t repeat itself, but it does rhyme.” —Mark Twain 3 Copyright © 2008 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com From 1984… My first visit to Bristol FPS-164/MAX 0.3 GFLOPS, $500,000 • 30 double precision PEs under (1/20 price of a 1984 Cray) SIMD control; 16 KB of very high bandwidth memory per PE, but This accelerator persuaded normal bandwidth to host Dongarra to create the Level 3 • Specific to matrix multiplication BLAS operations, and to make the LINPACK benchmark • Targeted at chemistry, electromagnetics, and structural scalable… leading to the TOP500 analysis benchmark a few years later.
    [Show full text]
  • Highlights of the 53Rd TOP500 List
    ISC 2019, Frankfurt, Highlights of June 17, 2019 the 53rd Erich TOP500 List Strohmaier ISC 2019 TOP500 TOPICS • Petaflops are everywhere! • “New” TOP10 • Dennard scaling and the TOP500 • China: Top consumer and producer ? A closer look • Green500, HPCG • Future of TOP500 Power # Site Manufacturer Computer Country Cores Rmax ST [Pflops] [MW] Oak Ridge 41 LIST: THESummit TOP10 1 IBM IBM Power System, USA 2,414,592 148.6 10.1 National Laboratory P9 22C 3.07GHz, Mellanox EDR, NVIDIA GV100 Sierra Lawrence Livermore 2 IBM IBM Power System, USA 1,572,480 94.6 7.4 National Laboratory P9 22C 3.1GHz, Mellanox EDR, NVIDIA GV100 National Supercomputing Sunway TaihuLight 3 NRCPC China 10,649,600 93.0 15.4 Center in Wuxi NRCPC Sunway SW26010, 260C 1.45GHz Tianhe-2A National University of 4 NUDT ANUDT TH-IVB-FEP, China 4,981,760 61.4 18.5 Defense Technology Xeon 12C 2.2GHz, Matrix-2000 Texas Advanced Computing Frontera 5 Dell USA 448,448 23.5 Center / Univ. of Texas Dell C6420, Xeon Platinum 8280 28C 2.7GHz, Mellanox HDR Piz Daint Swiss National Supercomputing 6 Cray Cray XC50, Switzerland 387,872 21.2 2.38 Centre (CSCS) Xeon E5 12C 2.6GHz, Aries, NVIDIA Tesla P100 Los Alamos NL / Trinity 7 Cray Cray XC40, USA 979,072 20.2 7.58 Sandia NL Intel Xeon Phi 7250 68C 1.4GHz, Aries National Institute of Advanced AI Bridging Cloud Infrastructure (ABCI) 8 Industrial Science and Fujitsu PRIMERGY CX2550 M4, Japan 391,680 19.9 1.65 Technology Xeon Gold 20C 2.4GHz, IB-EDR, NVIDIA V100 SuperMUC-NG 9 Leibniz Rechenzentrum Lenovo ThinkSystem SD530, Germany 305,856
    [Show full text]
  • A Fad Or the Yellow Brick Road Onto Exascale?
    GPU Acceleration: A Fad or the Yellow Brick Road onto Exascale? Satoshi Matsuoka, Professor/Dr.Sci. Global Scientific Information and Computing Center (GSIC) Tokyo Inst. Technology / National Institute of Informatics, Japan CNRS ORAP Forum, 2010 Mar 31, Paris, France 1 GPU vs. Standard Many Cores? Shared Shared Shared Shared Shared Shared PE PE PE PE SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP PE PE PE PE SP SP SP SP SP SP SP SP SP SP SP SP vs. PE PE PE PE MC MC MC MC MC MC PE PE PE PE GDDR3 GDDR3 GDDR3 GDDR3 GDDR3 GDDR3 • Are these just legacy programming compatibility differences? GPUs as Commodity Vector Engines Accelerators: Grape, Unlike the conventional accelerators, N-body ClearSpeed, GPUs have high memory bandwidth and MatMul GPUs dense Since latest high-end GPUs support Computational Density Cache-based double precision, GPUs also work CPUs as commodity vector processors. The target application area for GPUs is FFT very wide. Vector SCs CFD Cray, SX… Restrictions: Limited non- Memory stream memory access, Access PCI-express overhead, etc. Frequency Î How do we utilize them easily? GPUsGPUs asas CommodityCommodity MassivelyMassively ParallelParallel VectorVector ProcessorsProcessors • E.g., NVIDIA Tesla, AMD Firestream – High Peak Performance > 1TFlops • Good for tightly coupled code e.g. Nbody – High Memory bandwidth (>100GB/s) • Good for sparse codes e.g. CFD – Low latency over shared memory • Thousands threads hide latency w/zero overhead – Slow and Parallel and Efficient vector engines for HPC – Restrictions: Limited non-stream memory access, PCI-express overhead, programming model etc.
    [Show full text]
  • Exascale” Supercomputer Fugaku & Beyond
    The first “exascale” supercomputer Fugaku & beyond l Satoshi Matsuoka l Director, RIKEN Center for Computational Science l 20190815 Modsim Presentation 1 Arm64fx & Fugaku 富岳 /Post-K are: l Fujitsu-Riken design A64fx ARM v8.2 (SVE), 48/52 core CPU l HPC Optimized: Extremely high package high memory BW (1TByte/s), on-die Tofu-D network BW (~400Gbps), high SVE FLOPS (~3Teraflops), various AI support (FP16, INT8, etc.) l Gen purpose CPU – Linux, Windows (Word), otherSCs/Clouds l Extremely power efficient – > 10x power/perf efficiency for CFD benchmark over current mainstream x86 CPU l Largest and fastest supercomputer to be ever built circa 2020 l > 150,000 nodes, superseding LLNL Sequoia l > 150 PetaByte/s memory BW l Tofu-D 6D Torus NW, 60 Petabps injection BW (10x global IDC traffic) l 25~30PB NVMe L1 storage l many endpoint 100Gbps I/O network into Lustre l The first ‘exascale’ machine (not exa64bitflops but in apps perf.) 3 Brief History of R-CCS towards Fugaku April 2018 April 2012 AICS renamed to RIKEN R-CCS. July 2010 Post-K Feasibility Study start Satoshi Matsuoka becomes new RIKEN AICS established 3 Arch Teams and 1 Apps Team Director August 2010 June 2012 Aug 2018 HPCI Project start K computer construction complete Arm A64fx announce at Hotchips September 2010 September 2012 Oct 2018 K computer installation starts K computer production start NEDO 100x processor project start First meeting of SDHPC (Post-K) November 2012 Nov 2018 ACM Gordon bell Award Post-K Manufacturing approval by Prime Minister’s CSTI Committee 2 006 2011 2013
    [Show full text]
  • TSUBAME2.0: a Tiny and Greenest Petaflops Supercomputer
    TSUBAME2.0: A Tiny and Greenest Petaflops Supercomputer Satoshi Matsuoka Global Scientific Information and Computing Center (GSIC) Tokyo Institute of Technology (Tokyo Tech.) Booth #1127 Booth Presentations SC10 Nov 2010 The TSUBAME 1.0 “Supercomputing Grid Cluster” Unified IB network Spring 2006 Voltaire ISR9288 Infiniband 10Gbps x2 (DDR next ver.) Sun Galaxy 4 (Opteron Dual ~1310+50 Ports “Fastest core 8-socket) ~13.5Terabits/s (3Tbits bisection) Supercomputer in 10480core/655Nodes Asia” 7th on the 27th 21.4Terabytes 10Gbps+External 50.4TeraFlops Network [email protected] OS Linux (SuSE 9, 10) NAREGI Grid MW NEC SX-8i (for porting) 500GB 48disks 500GB 500GB 48disks 48disks Storage 1.0 Petabyte (Sun “Thumper”) ClearSpeed CSX600 0.1Petabyte (NEC iStore) SIMD accelerator Lustre FS, NFS, WebDAV (over IP) 360 boards, 50GB/s aggregate I/O BW 35TeraFlops(Current)) Titech TSUBAME ~76 racks 350m2 floor area 1.2 MW (peak), PUE=1.44 You know you have a problem when, … Biggest Problem is Power… Peak Watts/ Peak Ratio c.f. Machine CPU Cores Watts MFLOPS/ CPU GFLOPS TSUBAME Watt Core TSUBAME(Opteron) 10480 800,000 50,400 63.00 76.34 TSUBAME2006 (w/360CSs) 11,200 810,000 79,430 98.06 72.32 TSUBAME2007 (w/648CSs) 11,776 820,000 102,200 124.63 69.63 1.00 Earth Simulator 5120 6,000,000 40,000 6.67 1171.88 0.05 ASCI Purple (LLNL) 12240 6,000,000 77,824 12.97 490.20 0.10 AIST Supercluster (Opteron) 3188 522,240 14400 27.57 163.81 0.22 LLNL BG/L (rack) 2048 25,000 5734.4 229.38 12.21 1.84 Next Gen BG/P (rack) 4096 30,000 16384 546.13 7.32 4.38 TSUBAME 2.0 (2010Q3/4) 160,000 810,000 1,024,000 1264.20 5.06 10.14 TSUBAME 2.0 x24 improvement in 4.5 years…? ~ x1000 over 10 years Scaling Peta to Exa Design? • Shorten latency as much as possible – Extreme multi-core incl.
    [Show full text]
  • (WP8) Prototypes Future Technologies
    “Future Technologies” (WP8) Prototypes Iris Christadler, Dr. Herbert Huber Leibniz Supercomputing Centre, Germany Prototype Overview (1/2) CEA 1U Tesla Server T1070 (CUDA, Take more easily advantage of accelerators. Compare “GPU/CAPS” CAPS, DDT) , Intel Harpertown nodes HMPP with other approaches to program accelerators. Assess the applicability of new file system and I/O Subsystem (SSD, Lustre, pNFS) CINECA storage technologies. CINES-LRZ Hybrid SGI ICE/UV/Nehalem-EP & Evaluate a hybrid system architecture containing thin nodes, fat nodes and compute accelerators “LRB/CS” Nehalem-EX/ClearSpeed/Larrabee with a shared file system. CSCS Prototype PGAS language compilers Understand the usability and programmability “UPC/CAF” (CAF + UPC for Cray XT systems) of PGAS languages. EPCC Maxwell – FPGA prototype (VHDL Assess the potential of high-level languages support & consultancy + software for using FPGAs in HPC. Compare energy “FPGA” licenses (e.g., Mitrion-C)) efficiency with other solutions. SC09, “Future Technologies” (WP8) Prototypes, Outlook Deliverable D8.3.2 2 Prototype Overview (2/2) FZJ eQPACE (PowerXCell Gain deep expertise in communication “Cell & FPGA 8i cluster with special network issues. Extend the application interconnect” network processor) domain of the QPACE system. LRZ RapidMind Multi-Core Development Assess the potential of data stream languages. Platform (automatic code generation Compare RapidMind with other approaches for “RapidMind” for x86, GPUs and Cell) programming accelerators or multi-core systems NCF ClearSpeed Evaluate ClearSpeed accelerator hardware “ClearSpeed” CATS 700 units for large-scale applications. Air cooled blade system from SNIC- Supermicro with AMD Istanbul Evaluate and optimize energy efficiency and packing density of Experiences with the KTH processors & QDR IB commodity hardware.
    [Show full text]
  • The TOP500 Project Looking Back Over 16 Years of Supercomputing Experience with Special Emphasis on the Industrial Segment
    The TOP500 Project Looking Back Over 16 Years of Supercomputing Experience with Special Emphasis on the Industrial Segment Hans Werner Meuer [email protected] Prometeus GmbH & Universität Mannheim Microsoft Industriekunden – Veranstaltung Frankfurt /16. Oktober 2008 / Windows HPC Server 2008 Launch page 1 31th List: The TOP10 Rmax Power Manufacturer Computer Installation Site Country #Cores [TF/s] [MW] Roadrunner 1 IBM 1026 DOE/NNSA/LANL USA 2.35 122,400 BladeCenter QS22/LS21 BlueGene/L 2 IBM 478.2 DOE/NNSA/LLNL USA 2.33 212,992 eServer Blue Gene Solution Intrepid 3 IBM 450.3 DOE/ANL USA 1.26 163,840 Blue Gene/P Solution Ranger 4 Sun 326 TACC USA 2.00 62,976 SunBlade x6420 Jaguar 5 Cray 205 DOE/ORNL USA 1.58 30,976 Cray XT4 QuadCore JUGENE Forschungszentrum 6 IBM 180 Germany 0.50 65,536 Blue Gene/P Solution Juelich (FZJ) Encanto New Mexico Computing 7 SGI 133.2 USA 0.86 14,336 SGI Altix ICE 8200 Applications Center EKA Computational Research 8 HP 132.8 India 1.60 14,384 Cluster Platform 3000 BL460c Laboratories, TATA SONS 9 IBM Blue Gene/P Solution 112.5 IDRIS France 0.32 40,960 Total Exploration 10 SGI SGI Altix ICE 8200EX 106.1 France 0.44 10,240 Production Microsoft Industriekunden – Veranstaltung page 2 Outline Mannheim Supercomputer Statistics & Top500 Project Start in 1993 Competition between Manufacturers, Countries and Sites My Supercomputer Favorite in the Top500 Lists The 31st List as of June 2008 Performance Development and Projection Bell‘s Law Supercomputing, quo vadis? Top500, quo vadis? Microsoft Industriekunden – Veranstaltung
    [Show full text]
  • Clearspeed Paper: a Study of Accelerator Options for Maximizing Cluster Performance
    MAXIMUM PERFORMANCE CLUSTER DESIGN CLEARSPEED PAPER: A STUDY OF ACCELERATOR OPTIONS FOR MAXIMIZING CLUSTER PERFORMANCE Abstract The typical primary goal in cluster design 64-bit versions of NVIDIA’s announced now is to maximize performance within “Tesla” product. We use compute-per- the constraints of space and power. This volume arguments to show that a cluster contrasts with the primary goal of a dec- achieves optimum 64-bit floating-point ade ago, when the constraint was the performance with x86 nodes enhanced budget for parts purchase. We therefore with ClearSpeed e620 “Advance” accel- examine how to create a cluster with the erators. The reason is that power dissi- highest possible performance using cur- pation and its attendant volume become rent technology. Specific choices include the primary limiters to maximizing cluster a pure x86 design versus one with accel- performance. erators from ClearSpeed, or even future www.clearspeed.com The Constraints of Modern multiple rails to exceed the standard, which poten- tially increases the effective volume demanded by Cluster Design the board beyond its geometry. There are many constraints and goals in cluster We can also test the 70 watts per liter guideline at design, and we do not claim to address all of these the server level. The current 1U (1.75 inches) here. In this paper, we are concerned chiefly with servers consume about 1000 watts maximum, in- physical limitations: electrical power, heat removal, cluding extensibility options. The standard width is floor space, floor loading, and volume. 19 inches, and a typical depth is 26.5 inches. The We recognize that many other issues enter into volume of the 1U server is thus about 14.4 liters.
    [Show full text]
  • GPU Scalability 35
    Petascaling Commodity onto Exascale with GPUs on TSUBAME1.2 onto TSUBAME2.0 Satoshi Matsuoka, Professor/Dr.Sci. Global Scientific Information and Computing Center (GSIC) Tokyo Inst. Technology 1 GPUs as Commodity Massively Parallel Vector Processors for Ultra Low Power • E.g., NVIDIA Tesla, AMD Firestream – High Peak Performance > 1TFlops • Good for tightly coupled code e.g. Nbody – High Memory bandwidth (>100GB/s) • Good for sparse codes e.g. CFD – Low latency over shared memory • Thousands threads hide latency w/zero overhead – Slow and Parallel and Efficient vector engines for HPC – Restrictions: Limited non-stream memory access, PCI-express overhead, programming model etc. How do we exploit them given vector computing experiences? GPUs as Commodity Vector Engines for Ultra Low Power Unlike the conventional accelerators, Accelerators GPUs have high memory bandwidth and N-body MatMul HPC dense &Ultra Low Power Since latest high-end GPUs support Density Computational Computational Cache-based double precision, GPUs also work CPUs as commodity vector processors. The target application area for GPUs is FFT very wide. Vector Processor Restrictions: Limited non- Memory stream memory access, Access PCI-express overhead, etc. Frequency How do we utilize them easily? From CPU Centric to GPU Centric Nodes for Scaling CPU “Centric” GDDR5 >150GB STREAM GPU persocket DDR3 4-8GB 15-20GB STREAM persocket Core 2GB/core Core Dataset Dataset CPU Roles These are - OS Isomorphic - Services NB PCIe x16 - Irregular PCIe x8 4GB/s Sparse 2GB/s PCIe x8 NB 2GB/s x n 40Gbps IB IB QDR GPU IB QDR HCA 1GB IB QDR HCA Flash 200GB HCA 400MB/s 40Gbps IB x n All-to-all 3-D Protein Docking Challenge (Collaboration with Yutaka Akiyama, Tokyo Tech.) P1 P2 P3 P4 P5 ….
    [Show full text]
  • A Simd Approach to Large-Scale Real-Time System Air Traffic Control Using Associative Processor and Consequences for Parallel Computing
    A SIMD APPROACH TO LARGE-SCALE REAL-TIME SYSTEM AIR TRAFFIC CONTROL USING ASSOCIATIVE PROCESSOR AND CONSEQUENCES FOR PARALLEL COMPUTING A dissertation submitted to Kent State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy by Man Yuan December 2012 Dissertation written by Man Yuan B.S., Hefei University of Technology, China 2001 M.S., University of Western Ontario, Canada 2003 Ph.D., Kent State University, 2012 Approved by Dr. Johnnie W. Baker , Chair, Doctoral Dissertation Committee Dr. Lothar Reichel , Members, Doctoral Dissertation Committee Dr. Mikhail Nesterenko Dr. Ye Zhao Dr. Richmond Nettey Accepted by Dr. Javed I. Khan , Chair, Department of Computer Science Dr. Raymond Craig , Dean, College of Arts and Sciences ii TABLE OF CONTENTS LISTOFFIGURES .................................. vii LISTOFTABLES ................................... ix Acknowledgements ................................... x 1 Introduction..................................... 1 2 BackgroundInformation . .. .. 7 2.1 Flynn’s Taxonomy and Classification . ... 7 2.1.1 Multiple Instruction Stream Multiple Data Stream (MIMD) ... 7 2.1.2 Single Instruction Stream Multiple Data Stream (SIMD)..... 8 2.2 Real-TimeSystems .............................. 10 2.3 AirTrafficControl(ATC) .......................... 13 2.3.1 TaskCharacteristics . 14 2.3.2 The worst-case environment of ATC . 14 2.3.3 ATCTasks .............................. 15 2.4 AnAssociativeProcessorforATC. .. 16 2.5 OpenMP.................................... 18 2.5.1 AvoidFalseSharing ......................... 20 2.5.2 OptimizeBarrierUse. 21 2.5.3 AvoidtheOrderedConstruct . 22 iii 2.5.4 AvoidLargeCriticalRegions . 22 2.5.5 MaximieParallelRegions . 22 2.5.6 Avoid Parallel Regions in Inner Loops . 23 2.5.7 ImproveLoadBalance . 24 2.5.8 Using Compiler Features to Improve Performance . .... 24 3 SurveyofLiterature................................ 26 3.1 PreviousWorkonATC............................ 26 3.2 Classical MIMD Real-Time Scheculing Theory .
    [Show full text]