Clearspeed Technical Training

ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical Training December 2007 Overview 1 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Presenters Ronald Langhi Technical Marketing Manager [email protected] Brian Sumner Senior Engineer [email protected] 2 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com ClearSpeed Technology: Company Background • Founded in 2001 – Focused on alleviating the power, heat, and density challenges of HPC systems – 103 patents granted and pending (as of September 2007) – Offices in San Jose, California and Bristol, UK 3 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Agenda Accelerators ClearSpeed and HPC Hardware overview Installing hardware and software Thinking about performance Software Development Kit Application examples Help and support 4 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com ENVISION. ACCELERATE. ARRIVE. What is an accelerator? 5 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com What is an accelerator? • A device to improve performance – Relieve main CPU of workload – Or to augment CPU’s capability • An accelerator card can increase performance – On specific tasks – Without aggravating facility limits on clusters (power, size, cooling) 6 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com All accelerators are good… for their intended purpose Cell and GPUs FPGAs •Good for video gaming tasks •Good for integer, bit-level ops •32-bit FLOPS, not IEEE •Programming looks like circuit design •Unconventional programming model •Low power per chip, but •Small local memory 20x more power than custom VLSI •High power consumption (> 200 W) •Not for 64-bit FLOPS ClearSpeed •Good for HPC applications •IEEE 64-bit and 32-bit FLOPS •Custom VLSI, true coprocessor •At least 1 GB local memory •Very low power consumption (25 W) •Familiar programming model 7 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com The case for accelerators • Accelerators designed for HPC applications can improve performance as well as performance per (watt, cabinet, dollar) • Accelerators enable: – Larger problems for given compute time, or – Higher accuracy for given compute time, or – Same problem in shorter time • Host to card latency and bandwidth are not major barriers to successful use of properly- designed accelerators. 8 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com ENVISION. ACCELERATE. ARRIVE. What can be accelerated? 9 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Good application targets for acceleration • Application needs to be both computationally intensive and contain a high degree of data parallelism. • Computationally intensive: – Software depends on executing large numbers of arithmetic calculations – Usually 64-bit FLoating point Operations per Second (FLOPS) – Should also have a high ratio of FLOPS to data movement (bandwidth) – Computationally intensive applications may run for many hours or more even on large clusters. • Data parallelism: – Software performs the same sequence of operations again and again but on a different item of data each time • Example computationally intensive, data parallel problems include: – Large matrix arithmetic (linear algebra) – Molecular simulations – Monte Carlo options pricing in financial applications – And many, many more… 10 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Example data parallel problems that can be accelerated Ab initio Computational Chemistry Structural Analysis Electromagnetic Modeling Global Illumination Graphics Radar Cross-Section 11 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com HPC Requirements • Accelerator boards increase compute performance on highly specific tasks, without aggravating facility limits on clusters (power, size) • Need to consider – Type of application – Software – Data type and precision – Compatibility with host (logical and physical) – Memory size (local to accelerator) – Latency and bandwidth to host 12 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com An HPC-specific accelerator • CSX600 coprocessor for math acceleration – Assists serial CPU running compute-intensive math libraries – Available on add-in boards, e.g. PCI-X, PCIe – Potentially integrated on the motherboard – Can also be used for embedded applications • Significantly accelerates certain libraries and applications – Target libraries: Level 3 BLAS, LAPACK, ACML, Intel® MKL – Mathematical modeling tools: Mathematica®, MATLAB®, etc. – In-house code: Using the SDK to port compute-intensive kernels • ClearSpeed Advance™ board – Dual CSX600 coprocessors – Sustains 67 GFLOPS for 64-bit matrix multiply (DGEMM) calls – PCI-X, PCI Express x8 – Low power; typically 25-35 Watts 13 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Plug-and-play Acceleration • ClearSpeed host-side library CSXL – Provides some of the most commonly used and important Level 3 BLAS and LAPACK functions – Exploits standard shared/dynamic library mechanisms to intercept calls to L3 BLAS and LAPACK – Executes calls heterogeneously across both the multi- core host and the ClearSpeed accelerators simultaneously for maximum performance – Compatible with ACML from AMD and MKL from Intel • User & application do not need to be aware of ClearSpeed – Except that the application suddenly runs faster 14 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Programming considerations • Is my main data type integer or floating-point? • Is the data parallel in nature? • What precision do I need? • How much data needs to be local to the accelerated task? • Does existing accelerator software meet my needs, or do I have to write my own? • If I have to write my own code will the existing tools meet my needs—for example: compiler, debugger, and simulator? 15 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com ENVISION. ACCELERATE. ARRIVE. Hardware Overview 16 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com CSX600: A chip designed for HPC • Array of 96 Processor Elements; 64-bit and 32-bit floating point • Single-Instruction, Multiple-Data (SIMD) • 210 MHz -- key to low power • 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the chip is floating point hardware • Embedded SRAM • Interface to DDR2 DRAM • Inter-processor I/O ports • ~ 1 TB/sec internal bandwidth ClearSpeed CSX600 • 128 million transistors • Approximately 10 Watts 17 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com CSX600 processor core CSX 600 System Network Peripheral Network • Multi-Threaded Array Processing Mono Instruc- Control – Programmed in familiar languages Data Controller tion and – Hardware multi-threading Cache Cache Debug – Asynchronous, overlapped I/O – Run-time extensible instruction set Poly Controller • Array of 96 Processor Elements (PEs) PE PE … PE – Each has multiple execution units 0 1 95 – Including double precision floating point and integer units System Network Programmable I/O to DRAM 18 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com CSX600 processing element (PE) PE • Multiple execution units n • 4-stage floating point adder ALU MAC 32/64-bit FP Mul PE FP Add PE 64 Div, Sqrt 64 • 4-stage floating point multiplier n–1 n+1 IEEE 754 64 64 64 } • Divide/square root unit Register File 128 Bytes • Fixed-point MAC 16x16 → 32+64 • Integer ALU with shifter PE SRAM 6 KBytes • Load/store • 5-port register file (3 reads, 2 writes) 32 Programmed I/O • Closely coupled 6 KB SRAM for data 32 • High bandwidth per PE DMA (PIO) 128 PIO Collection & Distribution • Per PE address generators (serves as hardware gather-scatter) • Fast inter-PE communication path 19 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Advance accelerator memory hierarchy Host Tier 3 DRAM: 1-32 GBytes typical Aggregate: ~1GB/s Bank 1 1.0 GBytes Tier 2 BankCSX 0 DRAM: 0.5 GBytes CSX DRAM: 0.5 GBytes 5.4 GB/s ~0.03 GB/s per PE PE 95 192 PEs * 6 KB = 1.1 MB Tier 1 PE 0 Poly memory: 6 KBytes 161 GB/s 32 Per PE 16 Swazzle 192 PEs * 128 Byte = 24 KB Tier 0 Register memory: 128 Bytes 16 322 GB/s 725 GB/s 16 16 16 Per PE Total: 80 GFLOPS, 1.1 TB/s Arithmetic: 0.42 GFLOPS …but only 25 Watts 20 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Acceleration by plug-in card Advance X620 (PCI-X) • Dual ClearSpeed CSX600 coprocessors • R∞ > 66 GFLOPS for 64-bit matrix multiply (DGEMM) calls – Hardware also supports 32-bit floating point and integer calculations • 133 MHz PCI-X two-thirds length (8″) form 203 mm length, full-height factor • PCIe x8 half-length form factor Advance e620 PCIe (x8) • 1 GB of memory on the board • Drivers today for Linux (Red Hat and SLES) and Windows (XP, Server 2003) • Low power: 25 watts typical • Multiple boards can be used together for greater performance Half length, full-height Both boards can sustain over 66 GFLOPS on 64-bit HPC kernels 21 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Host to board DMA performance • The board includes a host DMA controller which can act as a bus master.

Clearspeed Technical Training

Linpack Evaluation on a Supercomputer with Heterogeneous Accelerators

Introduction Hardware Acceleration Philosophy Popular Accelerators In

The Return of Acceleration Technology

Highlights of the 53Rd TOP500 List

A Fad Or the Yellow Brick Road Onto Exascale?

Exascale” Supercomputer Fugaku & Beyond

TSUBAME2.0: a Tiny and Greenest Petaflops Supercomputer

(WP8) Prototypes Future Technologies

The TOP500 Project Looking Back Over 16 Years of Supercomputing Experience with Special Emphasis on the Industrial Segment

Clearspeed Paper: a Study of Accelerator Options for Maximizing Cluster Performance

GPU Scalability 35

A Simd Approach to Large-Scale Real-Time System Air Traffic Control Using Associative Processor and Consequences for Parallel Computing