COMP 635: Seminar on Heterogeneous Processors

www.cs.rice.edu/~vsarkar/comp635

Vivek Sarkar

Department of Computer Science Rice University

[email protected]

August 27, 2007

Course Goals

• Gain familiarity with heterogeneous processor systems by studying a few sample design points in the spectrum • Study and critique current software environments for these designs (programming models, compilers, tools, runtimes) • Discuss research challenges in advancing the state of the art of software for heterogeneous processors • Target audience: software, hardware, and application researchers interested in building or using heterogeneous processor systems, or understanding strengths and weaknesses of heterogeneous processors w.r.t. their research areas

COMP 635, Fall 2007 (V.Sarkar) 2 Course Organization

• Class dates (12 lectures) — 8/27, 9/10, 9/20 (Thurs), 9/24, 10/1, 10/8, 10/22, 10/29, 11/5, 11/19, 11/26, 12/3 — No classes on 9/3 (Labor Day), 10/15 (Midterm Recess), 11/12 (Supercomputing 2007 conference week) — No class on 9/17 (Mon); we will meet on 9/20 (Thurs) instead that week • Time & Place — Default: Mondays, 3:30pm - 4:30pm, DH 2014 — Exception: time & place for 9/20 (Thurs) lecture TBD — 30 minutes reserved after lecture for discussion (optional) • Office Hours (DH 3131) — 11am - 12noon, Fridays from 8/31/07 to 12/7/07 • OWL-Space repository: COMP 635 F07 • Grading — Satisfactory/unsatisfactory grade for students taking seminar for credit – Others should register officially as auditors, if possible — For a satisfactory grade, you need to 1. Attend at least 50% of lectures 2. Submit a 4-page project/study report by 12/7/07 (report can be prepared in a group - just plan on 4 pages/person in that case) — Optional in-class presentation of project/study report on 12/3/07 COMP 635, Fall 2007 (V.Sarkar) 3

Course Content

• Introduction to Heterogeneous Processors and their Programming Models (1 lecture) • Processor and Cell SDK (2 lectures) • Nvidia GPU and CUDA programming environment (2 lectures) • DRC FPGA Module and Celoxica Programming Environment (1 lecture) • Clearspeed Accelerator and SDK (1 lecture) • Imagine Stream Processor (1 lecture) • Microsoft Accelerator Library (1 lecture) • Vector and SIMD processors -- a historical perspective (1 lecture) • Programming Model and Runtime Desiderata for future Heterogeneous Processors (1 lecture) • Student presentations (1 lecture) COMP 635, Fall 2007 (V.Sarkar) 4 COMP 635 Lecture 1: Introduction to Heterogeneous Processors and their Programming Models

COMP 635, Fall 2007 (V.Sarkar) 5

Acknowledgments

• Georgia Tech ECE 6100, Module 14 — Vince Mooney, Krishna Palem, Sudhakar Yalamanchili —http://www.ece.gatech.edu/academic/courses/fall2006/ece6100/Class/ind ex.html

• MIT 6.189 IAP 2007, Lecture 2 —“Introduction to the Cell Processor”, Michael Perrone — http://cag.csail.mit.edu/ps3/lectures/6.189-lecture2-cell.pdf

• UIUC ECE 497, Lecture 16 —courses.ece.uiuc.edu/ece412/lectures/lecture16.ppt

• UIUC ECE 498 AL1, Programming Massively Parallel Processors — David Kirk, Wen-mei Hwu —http://courses.ece.uiuc.edu/ece498/al1/Syllabus.html

COMP 635, Fall 2007 (V.Sarkar) 6 Heterogeneous Processors General-purpose processor orchestrates activity LOCAL Memory transfer MEMORY module Accelerators can use schedules GPP scheduled, streaming system-wide bulk ACC communication… data movement or can operate on locally-buffered data Y

R pushed to them in N I O advance A MTM M M E M Motivation: ACC ACC 1) Different parts of programs have different requirements Control-intensive portions need good branch predictors, speculation, big caches to achieve good performance LOCAL Data-processing portions need lots of MEMORY ALUs, have simpler control flows 2) Power consumption Accelerated activities and associated private data Features like branch prediction, out-of- order execution, tend to have very are localized for bandwidth, power, efficiency high power/performance ratios. Applications often have time-varying performance requirements COMP 635, Fall 2007 (V.Sarkar) 7

Sample Application Domains for Heterogeneous Processors

• Cell Processor — Medical imaging, Drug discovery, Reservoir modeling, Seismic analysis, … • GPU (e.g., Nvidia) — Computer-aided design (CAD), Digital content creation (DCC), emerging HPC applications, … • FPGA (e.g., DRC) —HPC, Petroleum, Financial, … • HPC accelerators (e.g., Clearspeed) — HPC, Network processing, Graphics, … • Stream Processors (e.g., Imagine) —Image processing, Signal processing, Video, Graphics, … • Others —TCP/IP offload, Crypto, …

COMP 635, Fall 2007 (V.Sarkar) 8 Programming Models for Heterogeneous Processors

• Data Parallelism • Single Program Multiple Data (SPMD) • Pipelining • Work Queue • Fork Join • Message Passing • Storage Models: Shared vs. Local vs. Partitioned Memories • Hybrid combinations of above Only a limited subset of these models are in production use today ==> programming model implementations for heterogeneous processors will have to grow to accommodate new application domains and new classes of programmers COMP 635, Fall 2007 (V.Sarkar) 9

Heterogeneous Processor Spectrum

Dimension 1: Distance of accelerator from main processor

Heterogeneous Multicore

Dimension 2: Hardware customization in accelerator

COMP 635, Fall 2007 (V.Sarkar) 10 Heterogeneous Processor Spectrum

Dimension 1: Distance of accelerator from main processor Focus of this course

Heterogeneous Multicore

Dimension 2: Hardware customization in Focus of accelerator this course

COMP 635, Fall 2007 (V.Sarkar) 11

Spectrum of Programmers for Heterogeneous Processors

• Application-level Users — Plug & play experience by using ISV frameworks such as MATLAB and Mathematica, etc • Library-level Programmers — Portable library interface that works across homogeneous and heterogeneous processors • Language-level Programmers — Portable programming language that works across homogeneous and heterogeneous processors — Conspicuous lack of new languages for heterogeneous processors, especially languages with managed runtimes! • SDK-level Programmers — C-based compilers and tools that are specific to a given heterogeneous processor

COMP 635, Fall 2007 (V.Sarkar) 12 Spectrum of Programmers for Heterogeneous Processors

• Application-level Users — Plug & play experience by using ISV frameworks such as MATLAB and Mathematica, etc • Library-level Programmers — Portable library interface that works across homogeneous and heterogeneous processors • Language-level Programmers — Portable programming language that works across homogeneous and heterogenFocuseous pofro cessors — Conspicuous lack of new lathisngu courseages for heterogeneous processors, especially languages with managed runtimes! • SDK-level Programmers — C-based compilers and tools that are specific to a given heterogeneous processor

COMP 635, Fall 2007 (V.Sarkar) 13

Cell Broadband Engine (BE)

COMP 635, Fall 2007 (V.Sarkar) 14 Cell Performance

COMP 635, Fall 2007 (V.Sarkar) 15

Cell Temperature Distribution

Power and heat are key constraints

COMP 635, Fall 2007 (V.Sarkar) 16 Code Partitioning for Cell

Key Flow Graph Node Call Graph Node Flow Graph Edge Call Graph Edge Compile for PPE

Compile for SPE

Outlining Cloning

• Outlining: extract parallel loop into a separate procedure • Cloning: make separate copies for PPE and SPE, including clones of all procedures called from loop • Coordination: insert operations on signal registers and mailbox queues in PPE and SPE codes • Reference: “Using advanced compiler technology to exploit the performance of the Cell Broadband Engine architecture”, A. Eichenberger et al, IBM Systems Journal, Vol 45, No 1, 2006 COMP 635, Fall 2007 (V.Sarkar) 17

Why GPUs?

• A quiet revolution and potential build-up — Calculation: 367 GFLOPS vs. 32 GFLOPS — Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s — Until last year, programmed through graphics API

— GPU in every PC and – massive volume and potential impact

COMP 635, Fall 2007 (V.Sarkar) 18 Sample GPU Applications

Application Description Source Kernel % time H.264 SPEC ‘06 version, change in guess vector 34,811 194 35%

LBM SPEC ‘06 version, change to single precision 1,481 285 >99% and print fewer reports RC5-72 Distributed.net RC5-72 challenge client code 1,979 218 >99%

FEM Finite element modeling, simulation of 3D 1,874 146 99% graded materials RPES Rye Polynomial Equation Solver, quantum 1,104 281 99% chem, 2-electron repulsion PNS Petri Net simulation of a distributed system 322 160 >99%

SAXPY Single-precision implementation of saxpy, 952 31 >99% used in Linpack’s Gaussian elim. routine TRACF Two Point Angular Correlation Function 536 98 96% FDTD Finite-Difference Time Domain analysis of 1,365 93 16% 2D electromagnetic wave propagation MRI-Q Computing a matrix Q, a scanner’s 490 33 >99% configuration in MRI reconstruction COMP 635, Fall 2007 (V.Sarkar) 19

Performance of Sample Kernels and Applications

• GeForce 8800 GTX vs. 2.2GHz 248 • 10× speedup in a kernel is typical, as long as the kernel can occupy enough parallel threads • 25× to 400× speedup if the function’s data requirements and control flow suit the GPU and the application is optimized • Keep in mind that the speedup also reflects how suitable the CPU is for executing the kernel Source: Slide 21, Lecture 1, UIUC ECE 498, David Kirk & Wen-mei Hwu, http://courses.ece.uiuc.edu/ece498/al1/lectures/lecture1%20intro%20fall%202007.ppt COMP 635, Fall 2007 (V.Sarkar) 20 FPGAs: Basics of FPGA Offload

Source: “Compiling Software Code to FPGA-based Accelerator Processors for HPC Applications” by Doug Johnson, [email protected], gladiator.ncsa.uiuc.edu/PDFs/rssi06/presentations/14_Doug_Johnson.pdf

COMP 635, Fall 2007 (V.Sarkar) 21

FPGA Acceleration Examples

COMP 635, Fall 2007 (V.Sarkar) 22 ClearSpeed Multi-Threaded Array Processor (MTAP)

• Hardware multi- threading for latency tolerance • Asynchronous, overlapped I/O • Poly execution unit contains 96 Processor Elements (PE’s) or cores. • Array of PE’s operates in a synchronous manner, i.e. each PE executes the same instruction on its data.

Source: “Accelerating HPC Applications with ClearSpeed” by Daniel Kliger, [email protected], www.cse.scitech.ac.uk/disco/mew17/talks/ClearSpeed%20 Daresbury%20MEW%202006.pdf

COMP 635, Fall 2007 (V.Sarkar) 23

Clearspeed Linpack results

• Standard System —Two 3.0 GHz Xeon 5160 (Woodcrest) dual core processors, 16GB memory per node – Single server: 34 GFLOPS – Four node cluster: 136 GFLOPS – Power consumption: 1,940 Watts – Benchmark runtime: 48.4 minutes • ClearSpeed Accelerated System —Add two Advance accelerator boards per node (25W per board!) – Single server: 90.1 GFLOPS – Four node cluster: 364.2 GFLOPS – Power consumption: 2,140 Watts – Benchmark runtime: 18.4 minutes

COMP 635, Fall 2007 (V.Sarkar) 24 ClearSpeed’s CSXL acceleration library

The CSXL acceleration library intercepts and accelerates calls to functions in the Basic Linear Algebra Subprograms (BLAS) library. These include Level 3 BLAS DGEMM calls and LAPACK DGETRF calls. COMP 635, Fall 2007 (V.Sarkar) 25

Imagine Stream Processor

COMP 635, Fall 2007 (V.Sarkar) 26 Transforming Memory Accesses to Communication for Scalability

Software challenge: deliver productivity of shared memory model, combined with scalability of communication model

COMP 635, Fall 2007 (V.Sarkar) 27

Example of how Compilers can Help

Opportunity for new languages to reduce compiler effort and broaden applicability

Source: UIUC ECE 497, courses.ece.uiuc.edu/ece412/lectures/lecture16.ppt COMP 635, Fall 2007 (V.Sarkar) 28 Code Partitioning for Heterogeneous Processors

• Factors to consider when extracting a region of code for execution on an accelerator — Matching operations in code region with primitives in accelerator (includes instruction selection and FPGA synthesis) — Establishing coherence between main and local memories — Obeying local memory size constraints — Volume of data to be communicated — Granularity of region relative to overhead of thread creation — Structural constraints of task/thread being extracted — Cloning of code that needs to be executed on multiple elements — Coordination with rest of the program (coroutine vs. macro- dataflow models) — . . .

COMP 635, Fall 2007 (V.Sarkar) 29

Reading List for Next Lecture (Sep 10th)

1. “Using advanced compiler technology to exploit the performance of the Cell Broadband Engine architecture”, A. Eichenberger et al, IBM Systems Journal, Vol 45, No 1, 2006, http://researchweb.watson.ibm.com/journal/sj/451/eichenberger.pdf 2. “Dynamic Multigrain Parallelization on the Cell Broadband Engine”, F. Blagojevic et al, PPoPP 2007 Best Paper, March 2007, http://portal.acm.org/ft_gateway.cfm?id=1229445&type=pdf&coll=portal&dl=ACM &CFID=14018324&CFTOKEN=91433508

COMP 635, Fall 2007 (V.Sarkar) 30 Announcement: Kickoff Meeting for Habanero Multicore Software Research Project

Habanero is a new research project focused on Multicore Software. Its scope will span programming languages, compilers, virtual machines, and low-level runtime systems, and is synergistic with the expertise we have in various CS groups at Rice including the Parallel Compilers, Scalar Compilers, Programming Language Technologies, and Systems groups. A kickoff meeting for the Habanero project is scheduled for 1pm - 2:30pm on Wednesday, August 29th in DH 3076. Cookies will be served!

COMP 635, Fall 2007 (V.Sarkar) 31

BACKUP SLIDES START HERE

COMP 635, Fall 2007 (V.Sarkar) 32 Freescale MPC8572 PowerQUICC III Processor

• Dual Embedded e500 core 36-bit physical addressing • Double-precision floating-point • Integrated L1/L2 cache — L1 cache—32 KB data and 32 KB — Shared L2 cache—1 MB with ECC — L2 configurable as SRAM, cache and I/O transactions can be stashed into L2 cache regions • Integrated DDR memory controller with • full ECC support • Integrated security engine, Pattern Matching Engine, Packet Deflate Engine • Four on-chip triple-speed Ethernet controllers

COMP 635, Fall 2007 (V.Sarkar) 33

Freescale MPC8572 PowerQUICC III Processor

Source: Freescale

COMP 635, Fall 2007 (V.Sarkar) 34 AMD’s use of HyperTransport (Torrenza)

• “Torrenza” technology — Allows licensing of coherent HyperTransport™ to 3rd party manufacturers to make socket- compatible accelerators/co- processors — Allows 3rd party PPUs (Physics Processing Unit), GPUs, and co- processors to access main system memory directly and coherently — Could make accelerator programming model easier to use than say, the Cell processor, where each SPE cannot directly access main memory.

COMP 635, Fall 2007 (V.Sarkar) 35