Clearspeed Technical Training

Clearspeed Technical Training

ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical Training December 2007 Overview 1 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Presenters Ronald Langhi Technical Marketing Manager [email protected] Brian Sumner Senior Engineer [email protected] 2 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com ClearSpeed Technology: Company Background • Founded in 2001 – Focused on alleviating the power, heat, and density challenges of HPC systems – 103 patents granted and pending (as of September 2007) – Offices in San Jose, California and Bristol, UK 3 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Agenda Accelerators ClearSpeed and HPC Hardware overview Installing hardware and software Thinking about performance Software Development Kit Application examples Help and support 4 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com ENVISION. ACCELERATE. ARRIVE. What is an accelerator? 5 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com What is an accelerator? • A device to improve performance – Relieve main CPU of workload – Or to augment CPU’s capability • An accelerator card can increase performance – On specific tasks – Without aggravating facility limits on clusters (power, size, cooling) 6 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com All accelerators are good… for their intended purpose Cell and GPUs FPGAs •Good for video gaming tasks •Good for integer, bit-level ops •32-bit FLOPS, not IEEE •Programming looks like circuit design •Unconventional programming model •Low power per chip, but •Small local memory 20x more power than custom VLSI •High power consumption (> 200 W) •Not for 64-bit FLOPS ClearSpeed •Good for HPC applications •IEEE 64-bit and 32-bit FLOPS •Custom VLSI, true coprocessor •At least 1 GB local memory •Very low power consumption (25 W) •Familiar programming model 7 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com The case for accelerators • Accelerators designed for HPC applications can improve performance as well as performance per (watt, cabinet, dollar) • Accelerators enable: – Larger problems for given compute time, or – Higher accuracy for given compute time, or – Same problem in shorter time • Host to card latency and bandwidth are not major barriers to successful use of properly- designed accelerators. 8 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com ENVISION. ACCELERATE. ARRIVE. What can be accelerated? 9 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Good application targets for acceleration • Application needs to be both computationally intensive and contain a high degree of data parallelism. • Computationally intensive: – Software depends on executing large numbers of arithmetic calculations – Usually 64-bit FLoating point Operations per Second (FLOPS) – Should also have a high ratio of FLOPS to data movement (bandwidth) – Computationally intensive applications may run for many hours or more even on large clusters. • Data parallelism: – Software performs the same sequence of operations again and again but on a different item of data each time • Example computationally intensive, data parallel problems include: – Large matrix arithmetic (linear algebra) – Molecular simulations – Monte Carlo options pricing in financial applications – And many, many more… 10 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Example data parallel problems that can be accelerated Ab initio Computational Chemistry Structural Analysis Electromagnetic Modeling Global Illumination Graphics Radar Cross-Section 11 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com HPC Requirements • Accelerator boards increase compute performance on highly specific tasks, without aggravating facility limits on clusters (power, size) • Need to consider – Type of application – Software – Data type and precision – Compatibility with host (logical and physical) – Memory size (local to accelerator) – Latency and bandwidth to host 12 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com An HPC-specific accelerator • CSX600 coprocessor for math acceleration – Assists serial CPU running compute-intensive math libraries – Available on add-in boards, e.g. PCI-X, PCIe – Potentially integrated on the motherboard – Can also be used for embedded applications • Significantly accelerates certain libraries and applications – Target libraries: Level 3 BLAS, LAPACK, ACML, Intel® MKL – Mathematical modeling tools: Mathematica®, MATLAB®, etc. – In-house code: Using the SDK to port compute-intensive kernels • ClearSpeed Advance™ board – Dual CSX600 coprocessors – Sustains 67 GFLOPS for 64-bit matrix multiply (DGEMM) calls – PCI-X, PCI Express x8 – Low power; typically 25-35 Watts 13 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Plug-and-play Acceleration • ClearSpeed host-side library CSXL – Provides some of the most commonly used and important Level 3 BLAS and LAPACK functions – Exploits standard shared/dynamic library mechanisms to intercept calls to L3 BLAS and LAPACK – Executes calls heterogeneously across both the multi- core host and the ClearSpeed accelerators simultaneously for maximum performance – Compatible with ACML from AMD and MKL from Intel • User & application do not need to be aware of ClearSpeed – Except that the application suddenly runs faster 14 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Programming considerations • Is my main data type integer or floating-point? • Is the data parallel in nature? • What precision do I need? • How much data needs to be local to the accelerated task? • Does existing accelerator software meet my needs, or do I have to write my own? • If I have to write my own code will the existing tools meet my needs—for example: compiler, debugger, and simulator? 15 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com ENVISION. ACCELERATE. ARRIVE. Hardware Overview 16 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com CSX600: A chip designed for HPC • Array of 96 Processor Elements; 64-bit and 32-bit floating point • Single-Instruction, Multiple-Data (SIMD) • 210 MHz -- key to low power • 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the chip is floating point hardware • Embedded SRAM • Interface to DDR2 DRAM • Inter-processor I/O ports • ~ 1 TB/sec internal bandwidth ClearSpeed CSX600 • 128 million transistors • Approximately 10 Watts 17 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com CSX600 processor core CSX 600 System Network Peripheral Network • Multi-Threaded Array Processing Mono Instruc- Control – Programmed in familiar languages Data Controller tion and – Hardware multi-threading Cache Cache Debug – Asynchronous, overlapped I/O – Run-time extensible instruction set Poly Controller • Array of 96 Processor Elements (PEs) PE PE … PE – Each has multiple execution units 0 1 95 – Including double precision floating point and integer units System Network Programmable I/O to DRAM 18 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com CSX600 processing element (PE) PE • Multiple execution units n • 4-stage floating point adder ALU MAC 32/64-bit FP Mul PE FP Add PE 64 Div, Sqrt 64 • 4-stage floating point multiplier n–1 n+1 IEEE 754 64 64 64 } • Divide/square root unit Register File 128 Bytes • Fixed-point MAC 16x16 → 32+64 • Integer ALU with shifter PE SRAM 6 KBytes • Load/store • 5-port register file (3 reads, 2 writes) 32 Programmed I/O • Closely coupled 6 KB SRAM for data 32 • High bandwidth per PE DMA (PIO) 128 PIO Collection & Distribution • Per PE address generators (serves as hardware gather-scatter) • Fast inter-PE communication path 19 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Advance accelerator memory hierarchy Host Tier 3 DRAM: 1-32 GBytes typical Aggregate: ~1GB/s Bank 1 1.0 GBytes Tier 2 BankCSX 0 DRAM: 0.5 GBytes CSX DRAM: 0.5 GBytes 5.4 GB/s ~0.03 GB/s per PE PE 95 192 PEs * 6 KB = 1.1 MB Tier 1 PE 0 Poly memory: 6 KBytes 161 GB/s 32 Per PE 16 Swazzle 192 PEs * 128 Byte = 24 KB Tier 0 Register memory: 128 Bytes 16 322 GB/s 725 GB/s 16 16 16 Per PE Total: 80 GFLOPS, 1.1 TB/s Arithmetic: 0.42 GFLOPS …but only 25 Watts 20 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Acceleration by plug-in card Advance X620 (PCI-X) • Dual ClearSpeed CSX600 coprocessors • R∞ > 66 GFLOPS for 64-bit matrix multiply (DGEMM) calls – Hardware also supports 32-bit floating point and integer calculations • 133 MHz PCI-X two-thirds length (8″) form 203 mm length, full-height factor • PCIe x8 half-length form factor Advance e620 PCIe (x8) • 1 GB of memory on the board • Drivers today for Linux (Red Hat and SLES) and Windows (XP, Server 2003) • Low power: 25 watts typical • Multiple boards can be used together for greater performance Half length, full-height Both boards can sustain over 66 GFLOPS on 64-bit HPC kernels 21 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Host to board DMA performance • The board includes a host DMA controller which can act as a bus master.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    134 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us