Auto-tuning and the Roofline Model EECS Sam Williams, David Patterson, Kathy Yelick, Jim Demmel, Electrical Engineering and Andrew Waterman, Rich Vuduc, Lenny Oliker, John Shalf, Jonathan Carter, … BERKELEY PAR LAB Computer Sciences [email protected] P A R A L L E L C O M P U T I N G L A B O R A T O R Y Where does this fit in the ParLab ? The Roofline Model Auto-tuning the Structured Grid Motif Auto-tuning the Sparse Linear Algebra Motif ParLab Reference Reference Reference •Multi- and manycore is the only foreseeable solution to improve Samuel Williams, Andrew Waterman, David Patterson, “Roofline: An Insightful Multicore Performance Model”, Samuel Williams, Jonathan Carter, Leonid Oliker, John Shalf, Katherine Yelick, "Lattice Boltzmann Simulation Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, James Demmel, (submitted to) Communications of the ACM, 2008 Optimization on Leading Multicore Platforms", International Parallel & Distributed Processing Symposium "Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", performance or reduce power. (IPDPS) (to appear), 2008. Supercomputing (SC), 2007. •Vertically and horizontally integrated lab focused on manycore. Best Paper, Application Track •Within each layer are multiple research groups. Introduction •Within each group are multiple projects. •Use bound and bottleneck analysis to distill the key components of Lattice-Boltzmann Magneto-hydrodynamics (LBMHD) What’s a Sparse Matrix? •Two groups (analysis and verification) dive through all layers architecture and performance (computation, communication, locality) •Simulates plasma turbulence via LBM •Like a dense matrix, into a visually-intuitive performance model. •Couples CFD with Maxwell’s Equations •but most of the entries are 0.0 12 12 peak single precision flops 0 15 4 15 •Huge performance advantage in 25 25 My Work •Allow programmers to model, •Thus it requires: +Z 6 +Z 14 +Z 14 128 8 2 23 18 23 18 10 21 21 storing/operating on the nonzeros predict, and analyze a kernel’s 26 26 •Fits into two different, but related layers: 27pt Momentum distribution 20 9 20 13 22 13 22 A x y 1 11 s 17 5 17 •CSR is the standard representation performance. / 64 15pt Magnetic distribution +Y +Y 24 +Y 24 • The Roofline model is a template for analyzing performance 7 16 16 3 •Here we restrict the model to p +X 19 +X 19 +X •Requires significant meta data • Auto-tuning computational motifs is buried in the Efficiency Layer. o 7 macroscopic quantities macroscopic variables momentum distribution magnetic distribution l 32 •Uses existing multicore SMPs as proxies for next generation multicore memory-intensive SPMD f (density, momentum, magnetic field) G Sparse Matrix Vector Multiplication (SpMV) computers floating-point kernels. 16 •Two phases to the code: e l collision() advances the grid one time step •Evaluate y=Ax e.g. Victoria Falls with 128 threads is like a 128 core machine b 8 a stream() handles the boundary conditions (periodic for benchmark) •A is a sparse matrix Personal Image Hearing, Parallel Naïve Roofline Model n Speech i Health Retrieval Music Browser •Each lattice update requires ~1300 flops and ~1200 bytes of data •x & y are dense vectors •Well known formalism. a 4 t •No ILP, DLP, and very low flop:byte ratio (<0.166) Design Patterns/Motifs t •flop:byte ~ 1.0(ideal), ~0.7(cache-based machines) •Base on microbenchmarks peak stream bandwidth Applications a Composition & Coordination Language (C&CL) Static and optimization manuals. 2 Verification C&CL Compiler/Interpreter •Combines communication, Auto-tuning LBMHD Auto-tuning SpMV 1 •Auto-tuning dramatically improved performance on the Opteron (4x). •Register, Cache, and TLB blocking result in hierarchical data structures. Productivity Parallel Parallel Type computation, and locality 1 1 1 Layer /8 /4 /2 1 2 4 8 16 Libraries Frameworks Systems into a single figure. •Became important when the problem could no longer be mapped with •Exhaustive search for optimal prefetch distance. flop:DRAM byte ratio Niagara2’s 4MB pages. •Memory traffic minimization heuristic improves flop:byte ratio (<0.25) Efficiency Directed peak SP Sketching •Although prefetching showed little benefit, SIMD and streaming stores Languages Testing 128 •Dramatic increases in performance across all machines Auto-tuners In-core Parallelism helped significantly. •SIMDization of little/no value. Correctness mul / add imbalance Efficiency Dynamic s Legacy Communication & •Current architectures achieve / 64 •Cell was not auto-tuned, and only collision() was implemented. •Benefits are matrix dependent. Layer Schedulers Checking Code Synch. Primitives p high performance through many o Efficiency Language Compilers l 32 Debugging f OS Libraries & Services forms of in-core parallelism. w/out SIMD Legacy OS with Replay G Diagnosing Power/Performance Hypervisor •A lack of exploitation of any form OS 16 e Multicore/GPGPU RAMP Manycore of in-core parallelism will l th id b w Arch. degrade performance. 8 d a n a b n i m w/out ILP •Delineate performance levels a re a 4 st = in-core ceilings t k t a e What is Auto-tuning? a p 2 Basic Idea Instruction Mix 1 •Provides performance portability across the breadth and evolution of •Large numbers of integer instructions 1 1 1 /8 /4 /2 1 2 4 8 16 can limit FP performance. multicore architectures. flop:DRAM byte ratio •There are too many complex architectures with too many possible code peak SP transformations to hand optimize every kernel for every architecture. 128 •An optimization on one machine may slow another machine down. Memory Bandwidth mul / add imbalance s •Need a general, automated solution. •High memory bandwidth comes / 64 p from hiding latency and o l 32 exploiting parallelism. f w/out SIMD Code Generators G •HW prefetchers hide latency for 16 e •Kernel-specific l th unit-stride access patterns. id b w s •Perl scripts generate 1000’s of code variations for various optimizations: 8 d g n •SW prefetchers supplement this. a n in a h tio s b c a n m • NUMA-Aware collocates data with the threads processing it t z a i m fe i w/out ILP •Multisocket SMPs require a e m re re r ti t a s • Array Padding avoids conflicts in the L1/L2 4 st p p t o e k W d careful placement of data t a S A ri e t M st • Register Blocking in the sparse motif, data structure is a p u U it (NUMA optimizations). /o N n 2 t u w u t hierarchically blocked for locality /o u •A lack of any of these will w /o • Cache Blocking minimizes cache misses and memory traffic w degrade memory bandwidth 1 • Vectorization avoids thrashing the TLB 1/ 1/ 1/ 1 2 4 8 16 = bandwidth ceilings 8 4 2 • Unrolling/DLP compensates for poor compilers flop:DRAM byte ratio • SW Prefetching attempts to hide L2 and DRAM latency peak SP • SIMDization compensates for poor compilers, and 128 Locality Using the Roofline Model Clovertown Barcelona Using the Roofline Model Clovertown Barcelona streaming stores minimize memory traffic mul / add imbalance 128 128 128 128 •Think 3C’s of caches. s •Out-of-the-box code touches peak DP peak DP peak DP peak DP / 64 64 64 •Plot only the Dense matrix 64 64 mul/add imbalance mul/add imbalance w/out SIMD w/out SIMD •All kernels have compulsory p too many arrays per loop 32 32 32 32 stored in sparse format flop/s flop/s flop/s flop/s o w/out SIMD w/out SIMD l G G G cache misses. 32 G f iteration. 16 16 16 16 •Clearly heavily memory- w/out ILP w/out SIMD w/out ILP Auto-tuners •Caches are finite G •Cache bypass improves the 8 8 8 8 w/out ILP w/out ILP bound on all computers mul/add imbalance mul/add imbalance 16 O attainable attainable attainable h attainable h 4 4 c 4 4 c e t t = capacity misses e e •Search over all possible code variants for best performance. C f f n l arithmetic intensity e e th r A r A C •Register blocking amortizes p p l o M M id y 2 2 W U 2 2 W U b S N S N o m t t t t w u u u u •Caches aren’t fully associative C o o o o •Often, an exhaustive search is intractable. d g s m •Dataset is too large for the / / / / 8 n p fits within snoop filter w w fits within snoop filter w w a meta data n in o 1 1 1 1 p a io u b h t s m 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 u c a l /16 /8 /4 /2 1 2 4 8 /16 /8 /4 /2 1 2 4 8 /16 /8 /4 /2 1 2 4 8 /16 /8 /4 /2 1 2 4 8 s • The trend is to use heuristics to guide the search. = conflict misses n t m p z al snoop filter to ever work. i m e i o w/out ILP f e s •Prefetching/NUMA often a u flop:DRAM byte ratio flop:DRAM byte ratio flop:DRAM byte ratio flop:DRAM byte ratio o e re im tr r y r t l a s r 4 t p p s •If software doesn’t handle these, • The future is to use performance models to guide the search. s y t o •Clearly, no room for o e + k W d essential in delivering t r + a A ri C S t y Victoria Falls Cell Blade Victoria Falls Cell Blade e t M s C arithmetic intensity will degrade a a p u M o A U it improvement on Clovertown, 128 128 128 128 higher bandwidth o / N n p l i l n a 2 w t u s u 3 t s f c 64 64 64 64 = arithmetic intensity walls o u l Barcelona, or Cell.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages1 Page
-
File Size-