ENVISION. ACCELERATE. ARRIVE.

ClearSpeed Technical Training

December 2007

Overview

1 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Presenters

Ronald Langhi Technical Marketing Manager [email protected]

Brian Sumner Senior Engineer [email protected]

2 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com ClearSpeed Technology: Company Background

• Founded in 2001

– Focused on alleviating the power, heat, and density challenges of HPC systems

– 103 patents granted and pending (as of September 2007)

– Offices in San Jose, California and Bristol, UK

3 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Agenda

Accelerators ClearSpeed and HPC Hardware overview Installing hardware and software Thinking about performance Software Development Kit Application examples Help and support

4 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com ENVISION. ACCELERATE. ARRIVE.

What is an accelerator?

5 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com What is an accelerator?

• A device to improve performance – Relieve main CPU of workload – Or to augment CPU’s capability

• An accelerator card can increase performance – On specific tasks – Without aggravating facility limits on clusters (power, size, cooling)

6 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com All accelerators are good… for their intended purpose

Cell and GPUs FPGAs •Good for video gaming tasks •Good for integer, bit-level ops •32-bit FLOPS, not IEEE •Programming looks like circuit design •Unconventional programming model •Low power per chip, but •Small local memory 20x more power than custom VLSI •High power consumption (> 200 W) •Not for 64-bit FLOPS

ClearSpeed •Good for HPC applications •IEEE 64-bit and 32-bit FLOPS •Custom VLSI, true •At least 1 GB local memory •Very low power consumption (25 W) •Familiar programming model

7 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com The case for accelerators

• Accelerators designed for HPC applications can improve performance as well as performance per (watt, cabinet, dollar) • Accelerators enable: – Larger problems for given compute time, or – Higher accuracy for given compute time, or – Same problem in shorter time • Host to card latency and bandwidth are not major barriers to successful use of properly- designed accelerators.

8 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com ENVISION. ACCELERATE. ARRIVE.

What can be accelerated?

9 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Good application targets for acceleration

• Application needs to be both computationally intensive and contain a high degree of data parallelism. • Computationally intensive: – Software depends on executing large numbers of arithmetic calculations – Usually 64-bit FLoating point Operations per Second (FLOPS) – Should also have a high ratio of FLOPS to data movement (bandwidth) – Computationally intensive applications may run for many hours or more even on large clusters. • Data parallelism: – Software performs the same sequence of operations again and again but on a different item of data each time • Example computationally intensive, data parallel problems include: – Large matrix arithmetic (linear algebra) – Molecular simulations – Monte Carlo options pricing in financial applications – And many, many more…

10 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Example data parallel problems that can be accelerated

Ab initio Computational Chemistry

Structural Analysis

Electromagnetic Modeling

Global Illumination Graphics Radar Cross-Section 11 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com HPC Requirements

• Accelerator boards increase compute performance on highly specific tasks, without aggravating facility limits on clusters (power, size)

• Need to consider – Type of application – Software – Data type and precision – Compatibility with host (logical and physical) – Memory size (local to accelerator) – Latency and bandwidth to host

12 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com An HPC-specific accelerator

• CSX600 coprocessor for math acceleration – Assists serial CPU running compute-intensive math libraries – Available on add-in boards, e.g. PCI-X, PCIe – Potentially integrated on the motherboard – Can also be used for embedded applications • Significantly accelerates certain libraries and applications – Target libraries: Level 3 BLAS, LAPACK, ACML, Intel® MKL – Mathematical modeling tools: Mathematica®, MATLAB®, etc. – In-house code: Using the SDK to port compute-intensive kernels • ClearSpeed Advance™ board

– Dual CSX600 – Sustains 67 GFLOPS for 64-bit matrix multiply (DGEMM) calls – PCI-X, PCI Express x8 – Low power; typically 25-35 Watts

13 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Plug-and-play Acceleration

• ClearSpeed host-side library CSXL – Provides some of the most commonly used and important Level 3 BLAS and LAPACK functions – Exploits standard shared/dynamic library mechanisms to intercept calls to L3 BLAS and LAPACK – Executes calls heterogeneously across both the multi- core host and the ClearSpeed accelerators simultaneously for maximum performance – Compatible with ACML from AMD and MKL from Intel • User & application do not need to be aware of ClearSpeed – Except that the application suddenly runs faster

14 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Programming considerations

• Is my main data type integer or floating-point? • Is the data parallel in nature? • What precision do I need? • How much data needs to be local to the accelerated task? • Does existing accelerator software meet my needs, or do I have to write my own? • If I have to write my own code will the existing tools meet my needs—for example: compiler, debugger, and simulator?

15 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com ENVISION. ACCELERATE. ARRIVE.

Hardware Overview

16 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com CSX600: A chip designed for HPC

• Array of 96 Processor Elements; 64-bit and 32-bit floating point • Single-Instruction, Multiple-Data (SIMD) • 210 MHz -- key to low power • 47% logic, 53% memory

– About 50% of the logic is FPU – Hence around one quarter of the chip is floating point hardware • Embedded SRAM • Interface to DDR2 DRAM • Inter-processor I/O ports • ~ 1 TB/sec internal bandwidth ClearSpeed CSX600 • 128 million transistors • Approximately 10 Watts 17 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com CSX600 processor core

CSX 600

System Network

Peripheral Network • Multi-Threaded Array Processing Mono Instruc- Control – Programmed in familiar languages Data Controller tion and – Hardware multi-threading Cache Cache Debug – Asynchronous, overlapped I/O – Run-time extensible instruction set Poly Controller • Array of 96 Processor Elements (PEs) PE PE … PE – Each has multiple execution units 0 1 95 – Including double precision floating point and integer units System Network Programmable I/O to DRAM

18 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com CSX600 processing element (PE)

PE • Multiple execution units n • 4-stage floating point adder

ALU MAC 32/64-bit FP Mul PE FP Add PE 64 Div, Sqrt 64 • 4-stage floating point multiplier n–1 n+1 IEEE 754 64 64 64 } • Divide/square root unit Register File 128 Bytes • Fixed-point MAC 16x16 → 32+64 • Integer ALU with shifter PE SRAM 6 KBytes • Load/store • 5-port register file (3 reads, 2 writes) 32 Programmed I/O • Closely coupled 6 KB SRAM for data 32 • High bandwidth per PE DMA (PIO) 128 PIO Collection & Distribution • Per PE address generators (serves as hardware gather-scatter) • Fast inter-PE communication path

19 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Advance accelerator memory hierarchy

Host Tier 3 DRAM: 1-32 GBytes typical

Aggregate: ~1GB/s Bank 1 1.0 GBytes Tier 2 BankCSX 0 DRAM: 0.5 GBytes CSX DRAM: 0.5 GBytes 5.4 GB/s ~0.03 GB/s per PE PE 95 192 PEs * 6 KB = 1.1 MB Tier 1 PE 0 Poly memory: 6 KBytes 161 GB/s 32

Per PE 16 Swazzle 192 PEs * 128 Byte = 24 KB Tier 0 Register memory: 128 Bytes 16 322 GB/s 725 GB/s 16 16 16 Per PE Total: 80 GFLOPS, 1.1 TB/s Arithmetic: 0.42 GFLOPS …but only 25 Watts 20 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Acceleration by plug-in card

Advance X620 (PCI-X) • Dual ClearSpeed CSX600 coprocessors

• R∞ > 66 GFLOPS for 64-bit matrix multiply (DGEMM) calls – Hardware also supports 32-bit floating point and integer calculations • 133 MHz PCI-X two-thirds length (8″) form 203 mm length, full-height factor • PCIe x8 half-length form factor Advance e620 PCIe (x8) • 1 GB of memory on the board • Drivers today for Linux (Red Hat and SLES) and Windows (XP, Server 2003) • Low power: 25 watts typical • Multiple boards can be used together for greater performance

Half length, full-height

Both boards can sustain over 66 GFLOPS on 64-bit HPC kernels

21 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Host to board DMA performance

• The board includes a host DMA controller which can act as a bus master. • All DMA transfers are at least 8-byte aligned. • The host DMA engine will attempt to use the full bandwidth of the bus.

Type of PCI-X slot Peak bandwidth Expected DMA speed PCI Express x8 2,000 MB/s Up to 1,300 MB/s

PCI-X 133 MHz 1,066 MB/s Up to 750 MB/s

• Note: measured bandwidth is highly system-dependent – Variations of up to 50% have been observed – Depends on system chipset, operating system, bus contention… 22 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com ENVISION. ACCELERATE. ARRIVE.

Installing Hardware and Software

23 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Configuration support

• Advance supports the following host operating systems: Operating System IA32 AMD64/EM64T (x86) (x86-64) SuSE Linux Enterprise Server 9 9 Red Hat Enterprise Linux 4 9 9 Windows XP SP2 9 Windows Server 2003 9 preview Supported host BLAS libraries Supported compilers • AMD ACML • For Linux: • Intel MKL gcc, icc, fort, pgf • Goto • For Windows XP, 2003: • ATLAS Visual C++ 2005

For the latest support information go to http://support.clearspeed.com

24 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Base software

• All ClearSpeed software on Linux is installed using the rpm command.

• The software consists of three parts: – Runtime and driver software – Diagnostics – ClearSpeed standard libraries, CSXL & CSFFT

• You can download the latest versions from the ClearSpeed support website: • https://support.clearspeed.com/downloads

25 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Installing base software on Linux

1. Log in to the Linux machine as root and change to the directory containing the drivers package. 2. Install the runtime software, using the command: rpm –i csx600_m512_le-runtime-..rpm

3. Install the Kernel module - for Linux 2.6 simply install the open source CSX driver using:

/opt/clearspeed/csx600_m512_le/drivers/csx/install-csx

4. Install the board diagnostics: rpm –i csx600_m512_le-board_diagnostics- ..rpm 5. Install the CSXL library package: rpm –i csx600_m512_le-csxl_..rpm

Note: For Windows a Jungo driver will need to be installed and configured – see installation manual for more details.

26 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Confirming successful installation

ClearSpeed distributes diagnostic tests to check that the board and drivers are successfully installed: 1. Open a shell window and go to an appropriate directory: cd /tmp 2. Set up ClearSpeed environment variables, by typing: source /opt/clearspeed/csx600_m512_le/bin/bashrc 3. Run the diagnostic program, by typing the command: /opt/clearspeed/csx600_m512_le/bin/run_tests.pl

– Some tests take several minutes to complete. – Each test will write Pass or Fail to standard output. – A log file test.log will be written in the current directory.

27 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com csreset

– The csreset command reinitializes an Advance board and its processors. – It must be run after start-up or reboot of the system or simulator. – It is also a good idea to run csreset at the start of a batch job that calls the Advance board. – The csreset command can take argument flags to provide a finer level of control. These include:

-A Specifies that all boards should be reset. -v Verbose output. This shows the details about each board. -h Help. This shows the full list of options.

28 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com If you have problems with software installation

– Make sure you are logged in as super-user. • As root for Linux. • As administrator for Windows. – If the configure or make install steps fail, check that you have the appropriate header files. • Check the preconfigured header files and, if necessary, obtain the appropriate configured header file. – If the system cannot access the board but the driver is installed, make sure the board is seated well. • Try removing the board and reinstalling.

29 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com ENVISION. ACCELERATE. ARRIVE.

Targeting ClearSpeed Advance: Exploiting Data Parallelism

30 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Alternative approaches

Three main approaches to acceleration:

1. Use an application which is already ported 2. Plug and play 3. Custom port using the SDK

31 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Using an application which is already ported

• Acceleration: simply insert ClearSpeed • Latest list of ported applications: – http://www.clearspeed.com/products/applicationsupport/ • Includes: – Amber – Mathematica – MATLAB – Star-P

32 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Plug and play libraries: CSXL

• Underlying shared libraries are augmented with ClearSpeed CSXL accelerated functions • Includes key functions from: – LAPACK – Level 3 BLAS • As an example, BLAS is used by: – AMD ACML – Intel MKL – Full list on: http://www.clearspeed.com/products/compatibility/ • Application is transparently accelerated – No modifications to application

33 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Acceleration using CSXL and standard libraries

Application

Automatically CSXL Intercept Layer select optimum path Host Library CSXL Library LAPACK LAPACK BLAS BLAS etc. etc.

34 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Considerations for custom port of application

• Is the task large enough to consider acceleration? – Takes time to ship data to the accelerator • Accelerator can work in parallel with host – Overlap computation • Performance considerations – Look for areas of data parallelism – Overlap compute with data I/O – Make full use of ClearSpeed I/O paths • Analysis starts with model based on memory tiers and can be verified using performance profiling tools

35 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Is this trip necessary? Considering I/O

Node Accelerator memory memory Bandwidth = B Node Accelerator

• Time to move N data to/from another node or an accelerator is speed accelerator ~latency+N/B seconds. break- • Because local memory bandwidth even is usually higher than B, acceleration might be lost in the node communication time. • Estimate the break-even point for the task (note: offloading is different from accelerating, where time host continues working). (larger problem size)

36 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Memory bandwidth dictates performance

Node Accelerator memory DRAM 17 GB/s PCI-X or PCIe 5.4 GB/s 1 to 2 GB/s Multicore x86 Accelerator

192 GB/s Accelerator Local RAM

• Applications that can stage into local RAM can go 10x faster than current high-end Intel and AMD hosts – Applications residing in Accelerator DRAM do not make use of massive memory bandwidth • GPUs face very similar issue

37 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Latency and bandwidth: Simple offload model

Accelerator

- band width band - width Host latency latency Host

• Accelerator must be quite fast for this approach to have benefit • This “mental picture” may stem from early days of Intel 80x87, Motorola 6888x math coprocessors

38 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Latency and bandwidth: Acceleration model

Accelerator

- band width band - width Host latency Host latency Host

• Host continues working – Accelerator needs only be fast enough to make up for time lost to bandwidth + latency • Easiest use model – Host and accelerator share the same task, like DGEMM • More flexible – Host, accelerator each specialize what they do

39 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Accelerator need not wait for all data before starting

AcceleratorAccelerator

- band width band - width

Host latency Host latency Host

• Host can work while data is moved – PCI transfers might burden a single x86 core by up to 40% – Other cores on host continue productive work at full speed • Accelerator can work while data is moved – Can be slower than the host, and still add performance! • In practice, latency is microseconds; accelerator task takes seconds – Latency gaps above would be microscopic if drawn to scale 40 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Performance considerations

• Look for data parallelism – Fine-grained – vector operations – Medium-grained – unrolled independent loops – Coarse-grained – multiple simultaneous data channels/sets • Performance analysis for accelerator cards – Like analysis for message-passing parallelism but with more levels of memory and communication • Application porting success depends heavily on attention to memory bandwidths – (Surprisingly) not so much on the bandwidth between host and accelerator card

41 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com PCI Bus

• ClearSpeed boards utilize either PCI-X or PCIe busses – PCI-X 133 MHz: 1 GB/s Peak – PCIe x8: 1.6 GB/s Peak

• Available memory on board – 1 GB of 200 MHz DDR2 SDRAM shared by 2 CSX600 processors

• Must consider both the transfer rate AND the available memory – If application requires more memory, then more communication to the board is necessary

• Infinitely fast board – Time = Bus Speed * Total data size transferred

42 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com PCI Bus

• Driver performance is very machine-specific and depends on transfer size, direction, etc. – Transfer Size vs. Transfer Rate

– See Runtime User’s Guide for current driver performance

43 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com On-board Memory

• 2 level memory hierarchy – 1 GB “mono” shared memory – 6 kB “poly” memory per processing element (PE) • 6 kB/PE * 96 PE = 576 kB per CSX600

• Peak bandwidth between levels – 2.7 GB/s x 2 chips = 5.4 GB/s

• Must consider both the transfer rate AND the available memory – If application requires more memory, then more communication to the board is necessary • Infinitely fast PEs – Time = Bus Speed * Total data size transferred

• Secondary considerations – Burst size: 64 Bytes/PE (i.e., 8 doubles) – Transfers can be smaller, but at reduced efficiency

44 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com SIMD Computing

• What is SIMD? – Single Instruction, Multiple Data • Each PE sees the same instruction stream • Each PE issues “load”, “multiply”, etc., simultaneously • But acts on different data per PE – PARALLEL COMPUTATION

• ClearSpeed SIMD is enhanced by: – Local memory for each PE • data management is easier within “poly” memory • does not require adjacent access for all 96 elements involved in the computation from shared memory pool – PEs can be enabled/disabled • not required to use all PEs always • useful for handling “boundaries”

45 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com SIMD Array

• 96 PEs per CSX600 – 210 MHz – double precision multiply-accumulate per cycle – 4 cycle pipeline depth for multiply and accumulation • For top performance, use operations on 4 element vectors on each PE • Nearest neighbor communication – “swazzle” path topology is a line or ring – Bandwidth: 8 Bytes per cycle between register files • 8*96*210 = 161 GB/s • Useful for fine grained communication

46 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Good Example Kernels

• Dense Linear Algebra – Matrix-Matrix products (DGEMM) • Low memory bandwidth required = high data re-use • Inner kernel: Matrix-multivector product – 96x96 matrix, x4 vectors » 96x96 matrix due to 96 PEs » 4 vectors due to multiply/accumulate pipeline depth • Monte Carlo (computational finance) – “Embarrassingly parallel” task distribution – Very little data requirement • Molecular Dynamics (Amber, BUDE) – Large numbers of identical tasks can be found – Requires small working data sets

47 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Possible Kernel

• Partial Differential Equations – Some are memory bandwidth limited, so not a good candidate for ClearSpeed acceleration • small stencil implies little computation per grid point • wide, sparse stencil implies large active data set

• But, some PDE simulations are good candidates – require a small grid, so can run entirely in PE memory (computational finance) – have large, dense stencils • large amounts of computation per grid point • sufficiently small active data set – implicit time stepping • large systems of equations solved via direct methods • direct solvers utilize dense linear algebra kernels – (i.e., DGEMM)

48 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Keys to Success

• Parallelism is essential

• Proper management of the “poly” memory is also critical – Application must accept memory bandwidth limits • PCIe or PCI-X • On-board memory hierarchy – SDK enables asynchronous data transfers • permits efficient “double buffering” to manage data streams, accommodating the size limit – Application must employ a small working data set • less than 576 kB, distributed across 96 PEs • also aware of 1 GB shared memory limit

• While developing ClearSpeed applications, use the ClearSpeed Visual Profiler to discover what is actually happening on the board!

49 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Remember the host processor

• Today’s multi-core hosts are very useful for managing “other tasks” that are not accelerated by ClearSpeed

• Many applications can overlap these tasks with ClearSpeed accelerated tasks

• Profile the host portion of your application as well using any of a variety of tools – Use ClearSpeed Visual Profile for CSAPI utilization

50 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com General optimization techniques

• Latency hiding – Overlap compute with I/O • Data reuse – On-chip swazzle path • Maximize PE usage – Ensure all PEs are processing, not idle

51 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Overlap data with compute

• Double-buffer • Many levels of data I/O – compute parallelism – PE load/store overlaps PE compute – PE to board memory can also overlap – Board memory to host memory can also overlap • Hence, if task is compute bound: – Data takes “no time” to transfer • If task is I/O bound: – Compute takes “no time” to calculate

52 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Data reuse

• Swazzle path – Left or right 64 bit transfer (8 bytes) – 8 bytes per cycle, so ~161GB/s per CSX processor – Can be complete loop or linear chain • Parallel with other data I/O – Register-register move – On-off chip in parallel • Doesn’t impinge on DRAM access – PE local memory – register in parallel • Doesn’t impinge on local memory access

53 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Maximize PE usage

• Aim for 100% efficiency • PEs use predicated execution – PEs are “disabled” rather than code skipped – Minimize effects – extract common code from conditionals • Mono processor can branch – Skip blocks of code

54 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Detail of I/O widths for performance analysis

322GB/s 322GB/s Each accelerator board has: – 161 GB/s bandwidth PE register PE PE PE to PE memory n-1 n n+1

ALU • 4 bytes per cycle MAC FP Mul FP Add

64 Div, Sqrt 64 – 322 GB/s swazzle path 64 64 64 968GB/s bandwidth Register File • 8 bytes per cycle 128 Bytes 161GB/s – 968 GB/s bandwidth PE register PE SRAM to PE ALU 6 KBytes

• 24 bytes per cycle 32 Programmed I/O – 5.4 GB/s DRAM bandwidth 32 128 • 32 bytes per cycle 5.4GB/s PIO Collection & Distribution

(Aggregate bandwidth for two CSX600 chips.) CSX DRAM 1 GByte

55 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com ENVISION. ACCELERATE. ARRIVE.

Software Development Kit

56 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com ClearSpeed SDK overview

• Cn compiler – C with extension for SIMD control • Assembler • Linker • Simulator • Debugger • Graphical profiler • Libraries • Documentation • Available for Windows XP / 2003 and Linux (Red Hat Enterprise Linux 4 and SLES 9)

57 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Agenda

1. Introduction to Cn 2. C n Libraries 3. Debugging Cn 4. CSAPI: Host / Board Communication

58 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com ENVISION. ACCELERATE. ARRIVE.

Introduction to Cn

59 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Software Development

The CSX architecture is simpler to program:

• Single program for serial and parallel operations • Architecture and compiler co-designed • Instruction and data caches • Simple, regular 32-bit instruction set • Large, flexible register file • Fast thread context switching • Built-in debug support • Same development process as traditional architectures: compile, assemble, link • Cn is a simple parallel extension of C

60 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Cn — C with vector extensions for CSX

• New Keywords – mono and poly storage qualifiers • mono is a serial (single) variable • poly is a parallel (vector) variable • Mono variables in 1 GB DRAM • Poly variables in 6 KB SRAM of each PE

DRAM 1 GB 61 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Cn differences from C

• New data type multiplicity modifiers: – mono: denotes serial variable • resident in “mono” memory • mono is the default multiplicity – poly: denotes parallel/vector variable • resident in “poly” memory local to each PE – applies to pointers, doubly so: • mono int * poly foo; – foo is a pointer in poly memory to an int in mono memory • poly int * mono bar; – bar is a pointer in mono memory to an int in poly memory • int * poly *mono good_grief; – as you would expect…. • Pointer sizes: – mono int * • 4 bytes (32-bit addressable space, 512 MB) – poly int * • 2 bytes (16-bit addressable space, 6 kB)

62 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Cn differences from C

• Execution context: – Alters branch/jump behavior – In mono context, jumps occur as in traditional architecture – In poly context, PEs are enabled/disabled • if (penum>32) {…} else {…} – disables false PEs on true branch, then re-enables the false PEs and disables the other PEs for the false branch – both branches executed • break, continue – select PEs get disabled until the end of scope on all PEs • return – select PEs get disabled until all PEs return, or end of scope

63 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Porting C to Cn (Example 1)

C code

int i, j; for( i=0; i<96; i++ ) { j = 2*i; }

Similar Cn code

poly int i, j; i = get_penum(); // i=0 on PE0, i=1 on PE1 etc. j = 2*i; // j=0 on PE0, j=2 on PE2 etc.

64 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Porting C to Cn (Example 2)

C code

int i; for( i=0; i

Similar Cn code

poly int me, i; mono int npes; me = get_penum(); // me=0 on PE0, me=1 on PE1 etc. npes = get_num_pes(); // npes = 96 // i=0,96,192, …; 1,97,193, … etc. for( i=me; i

while (n) { memcpym2p (a, A+24*get_penum(), 24*sizeof(double)); A+=24*96; for (i=0; i<24; i++) { b[0] += a[i]*mat[0] + a[i+1]*mat[1]; b[1] += a[i+1]*mat[0] + a[i]*mat[1]; b[2] += a[i]*mat[2] - a[i+1]*mat[3]; b[3] += a[i+1]*mat[2] - a[i]*mat[3]; } n -= 24*96; }

memcpyp2m (B+4*get_penum(), b, 4*sizeof(double)); return; } 66 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com ENVISION. ACCELERATE. ARRIVE.

Cn Libraries

67 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Runtime libraries

• Cn supports standard C runtime, including: – malloc – printf – sqrt – memcpy

Cn extensions include: – sqrtp – memcpym2p / memcpyp2m – get_penum – swazzle – any / all

68 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Asynchronous I/O

• For most efficient use of limited PE memory, overlap data transfers between mono memory and poly: – async_memcpym2p/p2m – sem_sig / sem_wait

For greatest efficiency, async_memcpy routines bypass the data cache, so coherency must be maintained: • dcache_flush / dcache_flush_address

69 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Asynchronous I/O example

void foo(double *A, double *B,int n) { // Assume n is divisible by 24*96 poly unsigned short penum=get_penum(); poly double mat[4]={1.,2.,3.,4.}; poly double a_front[12], a_back[12]; poly double b[4]={0.,0.,0.,0.}; int i;

async_memcpym2p(19,a_front,A+12*penum,12*sizeof(double));A+=12*96; n-=24*96; while (n) { async_memcpym2p(17,a_back,A+12*penum,12*sizeof(double));A+=12*96; sem_wait(19); for (i=0;i<12;i++) { b[0] += a_front[i]*mat[0] + a_front[i+1]*mat[1]; b[1] += a_front[i+1]*mat[0] + a_front[i]*mat[1]; b[2] += a_front[i]*mat[2] - a_front[i+1]*mat[3]; b[3] += a_front[i+1]*mat[2] - a_front[i]*mat[3]; } n-=12*96; async_memcpym2p(19,a_front,A+12*penum,12*sizeof(double));A+=12*96; sem_wait(17); for (i=0;i<12;i++) { … // compute on a_back, then finish outside while loop 70 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com ENVISION. ACCELERATE. ARRIVE.

Cn Pointers

71 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Cn — mono and poly pointers

• Using mono and poly with pointers mono int * mono mPmi mono pointer to mono int poly int * mono mPpi mono pointer to poly int mono int * poly pPmi poly pointer to mono int poly int * poly pPpi poly pointer to poly int

• Most commonly used is mono pointer to poly poly * mono

72 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Cn — mono and poly pointers

• mono pointer to mono int mono int * mono mPmi

int *

Mono memory

int

73 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Cn — mono and poly pointers • mono pointer to poly int poly int * mono pPmi int Poly memory

int * int Poly memory Mono memory

int Poly memory

Note: Points to same location in each PE 74 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Cn — mono and poly pointers

• poly pointer to poly int poly int * poly pPpi

int int Poly memory Poly memory Poly memory

int * int * int *

Int

Note: Pointer stored in same location in each PE

75 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Cn — mono and poly pointers • poly pointer to mono int mono int * poly pPmi int * Poly memory

int

int * Mono memory Poly memory

int

int int * Poly memory

Note: Pointer stored in same location in each PE 76 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com ENVISION. ACCELERATE. ARRIVE.

Conditional Expressions

77 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Conditional Expressions: mono-if

• Conditions based on mono expressions – Expression has same value on all PEs – Code block selected according to expression and branch instruction executed mono int i, j; i = j = 1; if( i == j ) { // this block executed on all PEs } else { // this block branched over on all PEs }

78 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Conditional Expressions: poly-if

• Conditions based on poly expressions – Expression may have different values on different PEs – But SIMD model implies all PEs execute same instruction simultaneously – All branches executed on all PEs, with PE enabled if conditional expression is true (like predicated instructions)

poly int i; i = get_penum(); if( i < 48 ) { // PEs 0, 1, 2, … execute instructions // PEs 48, 49, … instructions issued but ignored } else { // PEs 0, 1, 2, … instructions issued but ignored // PEs 48, 49, … execute instructions } 79 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Conditional Expressions: poly-while

• While loops based on poly expressions – Loop continues execution until condition is false on all PEs – PEs will be disabled one by one until while condition is false on all PEs – count keeps track of total number of iterations (96 in this case)

mono int count = 0; poly int me; me = get_penum(); while( me > 0 ) { --me; ++count; }

80 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Other variations between C and Cn

• Labeled break and continue statements • No switch statement using poly variables (use multiple if statements) • No goto statement in poly context

81 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com ENVISION. ACCELERATE. ARRIVE.

Moving Data

82 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Data flow

• Board and host communicate via Linux kernel module or Windows driver • Create a handle and establish the connection

83 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Data flow

• Register intent of using the first processor on the card • Load the code onto the enabled processor

84 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Data flow

• Transfer data from host to board • Semaphores synchronize transfers between host and board

85 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Data flow

• Run the code on the enabled processor • Host can continue with other work

86 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Data flow

• Send results back to host • Halt board program and clean up

87 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Implicit broadcast from mono and poly

• Implicit broadcast from mono to poly by assignment mono int m = 7; poly int p; p = m; // Implicit broadcast to all PEs

• Assigning poly to mono is not permitted mono int m; poly int p = get_penum(); m = p; // NO! m receives different value from each PE

88 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Explicit data movement – mono to poly memcpym2p(); async_memcpym2p() • Memory copy of n bytes from mono to poly – Source is a poly pointer to mono memory, which can have a different value for each PE – Destination is a mono pointer to poly memory, that is destination address is the same for all PEs

Source data in mono memory

Same destination on each PE

PE0 PE1 PE2 PE95 89 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Explicit data movement – poly to mono memcpyp2m(); async_memcpyp2m() • Memory copy of n bytes from poly to mono – Source is a mono pointer to poly memory; therefore source address is the same for every PE – Destination is a poly pointer to mono memory, which can have a different value for each PE

Destination data in mono memory

Same source address on each PE

PE0 PE1 PE2 PE95 90 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Explicit data movement – asynchronous async_memcpym2p(); async_memcpyp2m() • Asynchronous memory copy of n bytes from mono to poly or from poly to mono – Computation continues during data copy – Mono memory data cache NOT flushed – Restrictions on alignment of data – Use semaphores to wait for completion of copy – Much higher bandwidth than synchronous versions

dcache_flush(); async_memcpym2p( semaphore, … ); // computation continues sem_wait( semaphore ); // use data that has been transferred from mono memory

91 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Explicit data movement – swazzle

• Register-to-register transfer between neighboring PE’s

PE n

ALU Status flags

To: To: Register file PE n-1 PE n+1

Enable Memory stack

92 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Swazzle operations

• Assembly language versions operate directly on register file • Cn versions operate on data and include implicit data movement from memory to registers • Variants – swazzle_up( poly int src ); // copy to higher numbered PE – swazzle_down( poly int src ); // copy to lower numbered PE – swazzle_up_generic( poly void *dst, poly void *src, unsigned int size ); – swazzle_down_generic( … ); – Similar swazzles operating on other data types – Functions to set data copied into ends of swazzle chain

93 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Data movement bandwidths per CSX600

• Mono memory to poly memory — 2.7 GB/s aggregate over 96 PEs • Poly memory to registers — 840 MB/s per PE, 81 GB/s aggregate • Swazzle path bandwidth — 1680 MB/s per PE, 161 GB/s aggregate • Total bandwidth for Advance board (2 CSX600 processors) ~0.5 TB/s

94 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com DMA performance

• Advance board has a host DMA controller which can act as a PCI bus master • All DMA transfers are at least 8-byte aligned • Host DMA engine will attempt to use the entire bus bandwidth

ClearSpeed Advance DMA Performance

1200

1000

800 e620_Read_avg e620_Write_avg 600 X620_Read_avg MB s / X620_Write_avg 400

200

0 2.0 3.0 3.9 4.9 5.9 6.8 7.8 Transfer size (MB) 95 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com ENVISION. ACCELERATE. ARRIVE.

CSAPI Host - Board communication

96 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Host-Board interaction basics

• The basic model for interaction between the host and the card is very simple:

• The ClearSpeed board can signal and wait for semaphores; it cannot initiate data transactions with the host.

• The host pushes data to and pulls data from the board.

• The host can also signal and receive semaphores.

97 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Connecting to the board

• A host application needs to perform the following sequence to launch a process on the board: – Create a CSAPI handle • CSAPI_new – Establish a connection with the board • CSAPI_connect – Register the host application with the driver • CSAPI_register_application – Load the CSX application on the desired chip • CSAPI_load – Run the CSX application on the desired chip • CSAPI_run

98 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Interacting with the board

– Get board memory address of a known symbol • CSAPI_get_symbol_value – This must be done after the application is loaded, if the dynamic load capability is to be used. – Write/Read data to a retrieved memory address • CSAPI_write_mono_memory • CSAPI_read_mono_memory – Asynchronous variants of these routines also exist – A process does not need to be running for these operations to succeed, but the process needs to be loaded. – These should not be performed DURING process termination. – Managing semaphores • CSAPI_allocate_shared_semaphore – Declares a semaphore for use on both host and card • CSAPI_semaphore_wait • CSAPI_semaphore_signal

99 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Cleaning up

– Process termination • CSAPI_wait_on_terminate • CSAPI_get_return_value – Clean-up • CSAPI_delete

– See CSX600 Runtime Software User Guide for more details, including: • managing multiple processes on the board/chip at once • managing board control registers • board reset • managing multi-threaded CSX applications • board memory allocation • managing multiple boards/chips • error handling

100 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com ENVISION. ACCELERATE. ARRIVE.

Debugging Cn

101 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com csgdb

– csgdb is a port of the open source gdb debugger – full symbolic debugging of mono/poly variables – full gdb breakpoint support – step through Cn or assembly – views mono and poly registers – views PE enabled state – also accessible via DDD • DDD allows graphical data visualization

102 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Debug control

– To enable debugging: • export CS_CSAPI_DEBUGGER=1 – initializes the debug interface within the host application • export CS_CSAPI_DEBUGGER_ATTACH=1 – host application will then write a port number to stdout and wait for to be pressed so that csgdb can be manually attached to the connected board process – Launch the host application • This can be done with or without a debugger. – Launch csgdb in a new shell* • csgdb – No need to “connect” as the host application did this already • set desired breakpoints • run – Note that the host is currently blocked waiting for , so card process may also be blocked waiting for the host. – Press return in the host shell for the host and card applications to proceed.

103 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com csgdb Debugger (Shown with ddd Front-end)

On-chip poly array contents displayed

Real time plot of contents of PE memory

Cn source-level break point, watch points, single step, etc.

Register contents

Disassembly, break point, watch points, single step, etc.

104 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com csgdb Command-line example

% cscn foo.cn –g –o foo.csx % csgdb ./foo.csx • (gdb) connect • 0x80000000 in __FRAME_BEGIN_MONO__ () • (gdb) break 109 • Breakpoint 1 at 0x800154c0: file foo.cn, line 109. • (gdb) run • Starting program: /home/kris/my_app/foo.csx • Breakpoint 1, main () at foo.cn:109 • (gdb) next • 110 y = MINY + (get_penum() * STEPY); • (gdb) print y • $1 = {-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1}

105 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com ENVISION. ACCELERATE. ARRIVE.

ClearSpeed Visual Profiler Explaining Performance

106 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com ClearSpeed Visual Profiler (csvprof)

– Host tracing • Trace CSAPI function • User can infer overlapping host/board utilization • Locate hot-spots – Board tracing • Trace board side functions without instrumentation • Locate hot-spots – Board hardware utilization • Display activity of csx functional units including: – ld/st – Instruction cache – Pi/o – Data cache – SIMD microcode – Thread • Cycle accurate • View corresponding source – Unified GUI

107 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Detailed profiling is essential for accelerator tuning HOST CODE HOST/BOARD PROFILING INTERACTION Visually inspect Infer cause and effect. multiple host threads. Measure transfer Time specific code bandwidth. sections. Check overlap of host Check overlap of host and board compute. threads.

Advance Accelerator Board Host Host Advance Accelerator Board HostCPU(s) CSX 600 CSX 600 HostCPU(s) CSX600 CSX600 CPU(s) Pipeline Pipeline CPU(s) Pipeline Pipeline

ACCELERATOR PIPE CSX600 SYSTEM View instruction issue. Trace at system level. Visualize overlap of Inspect overlap of executing instructions. compute and I/O. Get cycle-accurate View cache utilization. timing. Graph performance. Remove instruction-level performance bottlenecks. 108 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com csvprof: Host tracing

• Dynamic loading of CSAPI Trace implementation • Triggered with an environment variable: – export CS_CSAPI_TRACE=1 » Recall similar enabling of debug support: » export CS_CSAPI_DEBUGGER=1

• Specify tracing format: – export CS_CSAPI_TRACE_CSVPROF=1 – currently this is the only implementation, but in the future…

• Specify output file for trace: – export CS_CSAPI_TRACE_CSVPROF_FILE=mytrace.cst – default filename = csvprof_data.cst

• Output file written during CSAPI_delete

109 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com csvprof: Host-Board interaction

110 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com csvprof: Host code profile – Linpack benchmark

111 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com csvprof: CSX600 system profile

112 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com csvprof: Accelerator pipeline profile

113 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com csvprof: Instruction pipeline stalls

114 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com csvprof: Advance board tracing

– Enabled using the debugger, csgdb • Can use interactively or through gdb script

– Can select events to profile, or all events

– Requires buffer allocation on the card • Today, this is done statically • One could use CSAPI to allocate buffer, but developer must get location and size of the buffer to user to be entered for csgdb • Easy if running only on one chip, place buffer in the other chip’s memory

– Explicit dump to generate trace file • Can control the type of data to be dumped

115 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com csvprof: Sample gdb script

• % cat ./csgdb_trace.gdb • connect • load ./foo.csx

• cstrace buffer 0x60000000 0x1000000

• cstrace event all on

• tbreak test_me

• continue

• cstrace enable

• continue

• cstrace dump foo.cst

• cstrace dump branch dgemm_test4_branch.cst

• quit

• % csgdb –command=./csgdb_trace.gdb

116 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com ENVISION. ACCELERATE. ARRIVE.

Tuning Tips

117 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Pipelined arithmetic

• Four-stage floating-point pipeline • Use vector types, vector intrinsic functions, and vector math library for high efficiency

__DVECTOR a, b, c; poly double x[N]; a = *((__DVECTOR *)x[0]); b = *((__DVECTOR *)x[4]); c = cs_sqrt( __cs_vadd( a, b ) );

118 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Poly conditionals

• When possible, remove common sub- expressions from poly if-blocks to reduce amount of replicated work. • Maybe need to compute and throw away results if it leads to fewer poly conditional blocks. • A poly if-block uses predicated instructions, not a branch, so it is cheap if not many additional instructions are executed.

119 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Poly loop counters

• Loops with poly counters are more expensive than those with mono counters • Use mono loop counters if possible

120 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Arrays

• Pointer incrementing is more efficient than using array index notation • Poly addresses require 16 bits • Use short for poly pointer increments – This avoids conversion of int to short

121 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Data transfer

• Synchronous functions are completely general – flush the data cache each transfer • memcpyp2m() • memcpym2p() • Asynchronous functions maximize performance – do not flush cache – have data size and alignment restrictions – require use of wait semaphore • async_memcpyp2m(); sem_wait() • async_memcpym2p(); sem_wait() • Large data blocks are more efficient than small blocks – Host to board – Board to host – Mono to poly – Poly to mono

122 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com ENVISION. ACCELERATE. ARRIVE.

Application Examples

123 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Math function speed comparison

64-bit Function Operations per Second (Billions) 2.5 2.6 GHz dual-core Opteron 2.0 3 GHz dual-core Woodcrest ClearSpeed Advance card 1.5

1.0

0.5

0.0 Sqrt InvSqrt Exp Ln Cos Sin SinCos Inv Function name

Typical speedup of ~8X over the fastest x86 processors, because math functions stay in local memory on the card

124 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Nucleic Acid Builder (NAB)

• Newton-Raphson refinement now possible; large DGEMM calls from computed second derivatives will be in AMBER 10 • 2.5x speedup obtained for this operation in three hours of programmer effort • Enables accurate computation of entropy and Gibbs Free Energy for first time • AMBER itself has cases that ClearSpeed accelerates by 3.2x to 9x, with 5x to 17x possible once we exploit symmetry of atom- atom interactions

125 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com AMBER molecular modeling with ClearSpeed

AMBER Generalized Born Models 1, 2, and 6

100.0 90.0 83.5 84.6 80.0

70.0

60.0

50.0 37.9 40.0 30.0 24.6 23.5 Run Time,Run in Minutes 20.0 10.0 4.0 0.0 Generalized Born 1 Generalized Born 2 Generalized Born 3

Host Advance X620

• AMBER model Host Advance X620 Speedup • Gen Born 1 83.5 min 24.6 min 3.4× • Gen Born 2 84.6 min 23.5 min 3.6× • Gen Born 6 37.9 min 4.0 min 9.4×

126 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Monte Carlo methods exploit high local bandwidth • Monte Carlo methods are ideal for ClearSpeed acceleration: – High regularity and locality of the algorithm – Very high compute to I/O ratio – Very good scalability to high degrees of parallelism – Needs 64-bit • Excellent results for parallelization – Achieving 10x performance per Advance card vs. highly optimized code on the fastest x86 CPUs available today – Maintains high precision required by the computations • True 64-bit IEEE 754 floating point throughout – 25 W per card typical when card is computing • ClearSpeed has a Monte Carlo example code, available in source form for evaluation

127 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Monte Carlo applications scale very well

• No acceleration: 200M samples, 79 seconds • 1 Advance board: 200M samples, 3.6 seconds • 5 Advance boards: 200M samples, 0.7 seconds European Option Pricing Model

10 9 8 7 6 5

Speedup 4 3 2 1 0

012345

Number of ClearSpeed Advance Boards 128 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Why do Monte Carlo applications need 64-bit?

• Accuracy increases as the square root of the number of trials, so five-decimal accuracy takes 10 billion trials. • But, when you sum many similar values, you start to Single precision: lose all the significant digits. 1.0000x108 + 1 8 • 64-bit summation needed to = 1.0000x10 get a single-precision result! Double precision: 1.0000x108 + 1 = 1.00000001x108

129 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com ENVISION. ACCELERATE. ARRIVE.

Help and Support

130 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Installed documentation

• docs directory – CSXL user guide – runtime user guide – csvprof Visual Profiler overview and examples – SDK • getting started • gdb manual • instruction set manual • Cn library manual

• reference manual – release notes • examples directory

131 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com ClearSpeed online

• General information, news, etc. – Company website www.clearspeed.com

• Report a problem, find answers, etc. – Support website support.clearspeed.com

• Support website has: – Documentation, user guides, reference manuals – Solutions knowledge base – Software downloads – Log a case

132 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com Join the ClearSpeed Developer Program!

• Designed to support the leading-edge community of developers using accelerators • Membership is free and has the following benefits: – Access to the ClearSpeed Developer website – ClearSpeed Developer Community on-line forum – Invitation to participate in ClearSpeed Developer & User Community meetings and events – Repository to share and access demonstrations and sample codes within the ClearSpeed Developer Community – Technical updates, tips and tricks from the gurus at ClearSpeed and the Developer Community – And more, including opportunities to preview new software releases and developer discount programs. • Leverage the expertise of developers worldwide. • Ask a question, or share your knowledge. • Register now at developer.clearspeed.com !

133 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com 134 Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com