Heterogeneous Processors

COMP 635: Seminar on Heterogeneous Processors www.cs.rice.edu/~vsarkar/comp635 Vivek Sarkar Department of Computer Science Rice University [email protected] August 27, 2007 Course Goals • Gain familiarity with heterogeneous processor systems by studying a few sample design points in the spectrum • Study and critique current software environments for these designs (programming models, compilers, tools, runtimes) • Discuss research challenges in advancing the state of the art of software for heterogeneous processors • Target audience: software, hardware, and application researchers interested in building or using heterogeneous processor systems, or understanding strengths and weaknesses of heterogeneous processors w.r.t. their research areas COMP 635, Fall 2007 (V.Sarkar) 2 Course Organization • Class dates (12 lectures) — 8/27, 9/10, 9/20 (Thurs), 9/24, 10/1, 10/8, 10/22, 10/29, 11/5, 11/19, 11/26, 12/3 — No classes on 9/3 (Labor Day), 10/15 (Midterm Recess), 11/12 (Supercomputing 2007 conference week) — No class on 9/17 (Mon); we will meet on 9/20 (Thurs) instead that week • Time & Place — Default: Mondays, 3:30pm - 4:30pm, DH 2014 — Exception: time & place for 9/20 (Thurs) lecture TBD — 30 minutes reserved after lecture for discussion (optional) • Office Hours (DH 3131) — 11am - 12noon, Fridays from 8/31/07 to 12/7/07 • OWL-Space repository: COMP 635 F07 • Grading — Satisfactory/unsatisfactory grade for students taking seminar for credit – Others should register officially as auditors, if possible — For a satisfactory grade, you need to 1. Attend at least 50% of lectures 2. Submit a 4-page project/study report by 12/7/07 (report can be prepared in a group - just plan on 4 pages/person in that case) — Optional in-class presentation of project/study report on 12/3/07 COMP 635, Fall 2007 (V.Sarkar) 3 Course Content • Introduction to Heterogeneous Processors and their Programming Models (1 lecture) • Cell Processor and Cell SDK (2 lectures) • Nvidia GPU and CUDA programming environment (2 lectures) • DRC FPGA Coprocessor Module and Celoxica Programming Environment (1 lecture) • Clearspeed Accelerator and SDK (1 lecture) • Imagine Stream Processor (1 lecture) • Microsoft Accelerator Library (1 lecture) • Vector and SIMD processors -- a historical perspective (1 lecture) • Programming Model and Runtime Desiderata for future Heterogeneous Processors (1 lecture) • Student presentations (1 lecture) COMP 635, Fall 2007 (V.Sarkar) 4 COMP 635 Lecture 1: Introduction to Heterogeneous Processors and their Programming Models COMP 635, Fall 2007 (V.Sarkar) 5 Acknowledgments • Georgia Tech ECE 6100, Module 14 — Vince Mooney, Krishna Palem, Sudhakar Yalamanchili —http://www.ece.gatech.edu/academic/courses/fall2006/ece6100/Class/ind ex.html • MIT 6.189 IAP 2007, Lecture 2 —“Introduction to the Cell Processor”, Michael Perrone — http://cag.csail.mit.edu/ps3/lectures/6.189-lecture2-cell.pdf • UIUC ECE 497, Lecture 16 —courses.ece.uiuc.edu/ece412/lectures/lecture16.ppt • UIUC ECE 498 AL1, Programming Massively Parallel Processors — David Kirk, Wen-mei Hwu —http://courses.ece.uiuc.edu/ece498/al1/Syllabus.html COMP 635, Fall 2007 (V.Sarkar) 6 Heterogeneous Processors General-purpose processor orchestrates activity LOCAL Memory transfer MEMORY module Accelerators can use schedules GPP scheduled, streaming system-wide bulk ACC communication… data movement or can operate on locally-buffered data Y R pushed to them in N I O advance A MTM M M E M Motivation: ACC ACC 1) Different parts of programs have different requirements Control-intensive portions need good branch predictors, speculation, big caches to achieve good performance LOCAL Data-processing portions need lots of MEMORY ALUs, have simpler control flows 2) Power consumption Accelerated activities and associated private data Features like branch prediction, out-of- order execution, tend to have very are localized for bandwidth, power, efficiency high power/performance ratios. Applications often have time-varying performance requirements COMP 635, Fall 2007 (V.Sarkar) 7 Sample Application Domains for Heterogeneous Processors • Cell Processor — Medical imaging, Drug discovery, Reservoir modeling, Seismic analysis, … • GPU (e.g., Nvidia) — Computer-aided design (CAD), Digital content creation (DCC), emerging HPC applications, … • FPGA (e.g., Xilinx DRC) —HPC, Petroleum, Financial, … • HPC accelerators (e.g., Clearspeed) — HPC, Network processing, Graphics, … • Stream Processors (e.g., Imagine) —Image processing, Signal processing, Video, Graphics, … • Others —TCP/IP offload, Crypto, … COMP 635, Fall 2007 (V.Sarkar) 8 Programming Models for Heterogeneous Processors • Data Parallelism • Single Program Multiple Data (SPMD) • Pipelining • Work Queue • Fork Join • Message Passing • Storage Models: Shared vs. Local vs. Partitioned Memories • Hybrid combinations of above Only a limited subset of these models are in production use today ==> programming model implementations for heterogeneous processors will have to grow to accommodate new application domains and new classes of programmers COMP 635, Fall 2007 (V.Sarkar) 9 Heterogeneous Processor Spectrum Dimension 1: Distance of accelerator from main processor Heterogeneous Multicore Dimension 2: Hardware customization in accelerator COMP 635, Fall 2007 (V.Sarkar) 10 Heterogeneous Processor Spectrum Dimension 1: Distance of accelerator from main processor Focus of this course Heterogeneous Multicore Dimension 2: Hardware customization in Focus of accelerator this course COMP 635, Fall 2007 (V.Sarkar) 11 Spectrum of Programmers for Heterogeneous Processors • Application-level Users — Plug & play experience by using ISV frameworks such as MATLAB and Mathematica, etc • Library-level Programmers — Portable library interface that works across homogeneous and heterogeneous processors • Language-level Programmers — Portable programming language that works across homogeneous and heterogeneous processors — Conspicuous lack of new languages for heterogeneous processors, especially languages with managed runtimes! • SDK-level Programmers — C-based compilers and tools that are specific to a given heterogeneous processor COMP 635, Fall 2007 (V.Sarkar) 12 Spectrum of Programmers for Heterogeneous Processors • Application-level Users — Plug & play experience by using ISV frameworks such as MATLAB and Mathematica, etc • Library-level Programmers — Portable library interface that works across homogeneous and heterogeneous processors • Language-level Programmers — Portable programming language that works across homogeneous and heterogenFocuseous pofro cessors — Conspicuous lack of new lathisngu courseages for heterogeneous processors, especially languages with managed runtimes! • SDK-level Programmers — C-based compilers and tools that are specific to a given heterogeneous processor COMP 635, Fall 2007 (V.Sarkar) 13 Cell Broadband Engine (BE) COMP 635, Fall 2007 (V.Sarkar) 14 Cell Performance COMP 635, Fall 2007 (V.Sarkar) 15 Cell Temperature Distribution Power and heat are key constraints COMP 635, Fall 2007 (V.Sarkar) 16 Code Partitioning for Cell Key Flow Graph Node Call Graph Node Flow Graph Edge Call Graph Edge Compile for PPE Compile for SPE Outlining Cloning • Outlining: extract parallel loop into a separate procedure • Cloning: make separate copies for PPE and SPE, including clones of all procedures called from loop • Coordination: insert operations on signal registers and mailbox queues in PPE and SPE codes • Reference: “Using advanced compiler technology to exploit the performance of the Cell Broadband Engine architecture”, A. Eichenberger et al, IBM Systems Journal, Vol 45, No 1, 2006 COMP 635, Fall 2007 (V.Sarkar) 17 Why GPUs? • A quiet revolution and potential build-up — Calculation: 367 GFLOPS vs. 32 GFLOPS — Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s — Until last year, programmed through graphics API — GPU in every PC and workstation – massive volume and potential impact COMP 635, Fall 2007 (V.Sarkar) 18 Sample GPU Applications Application Description Source Kernel % time H.264 SPEC ‘06 version, change in guess vector 34,811 194 35% LBM SPEC ‘06 version, change to single precision 1,481 285 >99% and print fewer reports RC5-72 Distributed.net RC5-72 challenge client code 1,979 218 >99% FEM Finite element modeling, simulation of 3D 1,874 146 99% graded materials RPES Rye Polynomial Equation Solver, quantum 1,104 281 99% chem, 2-electron repulsion PNS Petri Net simulation of a distributed system 322 160 >99% SAXPY Single-precision implementation of saxpy, 952 31 >99% used in Linpack’s Gaussian elim. routine TRACF Two Point Angular Correlation Function 536 98 96% FDTD Finite-Difference Time Domain analysis of 1,365 93 16% 2D electromagnetic wave propagation MRI-Q Computing a matrix Q, a scanner’s 490 33 >99% configuration in MRI reconstruction COMP 635, Fall 2007 (V.Sarkar) 19 Performance of Sample Kernels and Applications • GeForce 8800 GTX vs. 2.2GHz Opteron 248 • 10× speedup in a kernel is typical, as long as the kernel can occupy enough parallel threads • 25× to 400× speedup if the function’s data requirements and control flow suit the GPU and the application is optimized • Keep in mind that the speedup also reflects how suitable the CPU is for executing the kernel Source: Slide 21, Lecture 1, UIUC ECE 498, David Kirk & Wen-mei Hwu, http://courses.ece.uiuc.edu/ece498/al1/lectures/lecture1%20intro%20fall%202007.ppt COMP 635, Fall 2007 (V.Sarkar) 20 FPGAs: Basics of FPGA Offload Source: “Compiling Software Code to FPGA-based Accelerator Processors for HPC

Load more