EECS Electrical Engineering and Computer Sciences BERKELEY PAR LAB P A R A L L E L C O M P U T I N G L A B O R A T O R Y The Roofline Model: A pedagogical tool for program analysis and optimization

Samuel Williams1,2, David Patterson1, Leonid Oliker1,2, John Shalf2, Katherine Yelick1,2

1University of California, Berkeley 2Lawrence Berkeley National Laboratory

[email protected]

1 EEElectrical EngCineeringS and Outline Computer Sciences BERKELEY PAR LAB

 Motivation, Goals, Audience, etc…  Survey of multicore architectures  Description of the Roofline model  Introduction to Auto-tuning  Application of the roofline to auto-tuned kernels . Example #1 - SpMV . Example #2 - LBMHD  Conclusions

2 EEElectrical EngCineeringS and Motivation Computer Sciences BERKELEY PAR LAB

 Multicore guarantees neither good scalability nor good (attained) performance  Performance and scalability can be extremely non-intuitive even to computer scientists

 Success of the multicore paradigm seems to be premised upon their programmability  To that end, one must understand the limits to both scalability and efficiency.

- How can we empower programmers?

3 EEElectrical EngCineeringS and Primary Focus Computer Sciences BERKELEY PAR LAB

 Throughput-oriented kernels (rather than time)  Our performance metrics are: Gflop/s and % of peak (efficiency)

for purposes of this talk, I will focus on memory-intensive 64b floating-point SPMD kernels.

 Not focused on algorithmic innovation or computational complexity

4 EEElectrical EngCineeringS and Goals & Audience Computer Sciences BERKELEY PAR LAB

 Goals for Roofline: . Provide everyone (especially undergrads) with a graphical aid that provides: realistic expectations of performance and productivity . Show inherent hardware limitations for a given kernel . Show potential benefit and priority of optimizations

 Who’s not the audience for the Roofline: . Not for those interested in fine tuning (+10%) . Not for those challenged by parallel kernel correctness

5 EECS Electrical Engineering and Computer Sciences BERKELEY PAR LAB P A R A L L E L C O M P U T I N G L A B O R A T O R Y

Multicore SMPs of Interest

(used throughout the rest of the talk)

6 EEElectrical EngCineeringS and Multicore SMPs Used Computer Sciences BERKELEY PAR LAB

Intel Xeon E5345 (Clovertown) AMD Opteron 2356 (Barcelona)

Core Core Core Core Core Core Core Core Opteron Opteron Opteron Opteron Opteron Opteron Opteron Opteron 4MB 4MB 4MB 4MB 512KB 512KB 512KB 512KB 512KB 512KB 512KB 512KB shared L2 shared L2 shared L2 shared L2 victim victim victim victim victim victim victim victim 2MB Shared quasi-victim (32 way) 2MB Shared quasi-victim (32 way)

FSB FSB 4GB/s (each direction)

10.66 GB/s 10.66 GB/s SRI / crossbar HyperTransport HyperTransport SRI / crossbar Chipset (4x64b controllers) 2x64b memory controllers 2x64b memory controllers 21.33 GB/s(read) 10.66 GB/s(write) 10.66 GB/s 10.66 GB/s 667MHz FBDIMMs 667MHz DDR2 DIMMs 667MHz DDR2 DIMMs

Sun T2+ T5140 (Victoria Falls) IBM QS20 Cell Blade

VMT VMT SPE SPE SPE SPE SPE SPE SPE SPE SPE SPE SPE SPE SPE SPE SPE SPE PARC PARC PARC PARC PARC PARC PARC PARC PARC PARC PARC PARC PARC PARC PARC PARC PPE PPE S S S S S S S S S S S S S S S S

MT MT MT MT MT MT MT MT MT MT MT MT MT MT MT MT 256K 256K 256K 256K 256K 256K 256K 256K 256K 256K 256K 256K 256K 256K 256K 256K 512K 512K BIF BIF Crossbar Crossbar irection) L2 d L2 MFC MFC MFC MFC MFC MFC MFC MFC MFC MFC MFC MFC MFC MFC MFC MFC <20GB/s 179 GB/s 90 GB/s B/s 179 GB/s 90 GB/s G (each 4MB Shared L2 (16 way) 4MB Shared L2 (16 way) EIB (ring network) EIB (ring network) (64b interleaved) (64b interleaved) 8 x 6.4 (1 per hub direction) 4 Coherency Hubs 4 Coherency Hubs XDR memory controllers XDR memory controllers 2x128b controllers 2x128b controllers 25.6 GB/s 25.6 GB/s 21.33 GB/s 10.66 GB/s 21.33 GB/s 10.66 GB/s 512MB XDR DRAM 512MB XDR DRAM 667MHz FBDIMMs 667MHz FBDIMMs

7 EEElectrical EngCineeringS and Multicore SMPs Used Computer Sciences BERKELEY PAR LAB

Intel Xeon E5345 (Clovertown) AMD Opteron d2356 (Barcelona) se Core Core Core Core Core Core Core Core Opteron Opteron OpteronaOpteron Opteron Opteron Opteron Opteron 4MB 4MB 4MB 4MB 512KB 512KB 512KB 512KB 512KB 512KB 512KB 512KB shared L2 shared L2 shared L2 shared L2 victim -victimbvictim victim victim victim victim victim 2MB Shared quasi-victim (32 way) 2MB Shared quasi-victim (32 way) FSB FSB e 4GB/s (each direction) 10.66 GB/s 10.66 GB/s h SRI / crossbar HyperTransport HyperTransport SRI / crossbar Chipset (4x64b controllers) y c 2x64b memory controllers 2x64b memory controllers 21.33 GB/s(read) 10.66 GB/s(write) a h c 10.66 GB/s 10.66 GB/s 667MHz FBDIMMs C r l a 667MHz DDR2 DIMMs 667MHz DDR2 DIMMs na er io Hi nt y Sun T2+ eT5140 (Victoriaor Falls) IBM QS20 Cell Blade nv m re VMT oVMT SPE SPE SPE SPE SPE SPE SPE SPE SPE SPE SPE SPE SPE SPE SPE SPE PARC PARC PARC PARC PARC PARC PARC PARC PARC PARC PARC PARC e PARC PARC PARC PARC t o PPE PPE S S S S S S S S S S S S S S S S

S MT MT MT MT MT MT MT MT MT MT MT MT MT MT MT MT 256K 256K 256K 256K 256K 256K 256K 256K C M 256K 256K 256K 256K 256K 256K 256K 256K l y 512K 512K BIF BIF Crossbar Crossbar irection)

d h L2 a L2 MFC MFC MFC MFC MFC MFC MFC MFC MFC MFC MFC MFC MFC MFC MFC MFC <20GB/s B/s c c 179 GB/s 90 GB/s 179 GB/s 90 GB/s G r o(each 4MB Shared L2 (16 way) 4MB Shared L2 (16 way) EIB (ring network) EIB (ringa network) (64b interleaved) (64b interleaved) L r 8 x 6.4 (1 per hub direction) t e 4 Coherency Hubs 4 Coherency Hubs XDR memory controllersn XDRi memory controllers 2x128b controllers 2x128b controllers i H jo25.6 GB/s y 25.6 GB/s 21.33 GB/s 10.66 GB/s 21.33 GB/s 10.66 GB/s is512MB XDR DRAM or 512MB XDR DRAM 667MHz FBDIMMs 667MHz FBDIMMs D m Me

8 EEElectrical EngCineeringS and Multicore SMPs Used Computer Sciences BERKELEY PAR LAB

Intel Xeon E5345 (Clovertown) AMD Opteron 2356 (Barcelona)

Core Core Core Core Core Core Core Core Opteron Opteron Opteron Opteron Opteron Opteron Opteron Opteron 4MB 4MB 4MB 4MB 512KB 512KB 512KB 512KB 512KB 512KB 512KB 512KB shared L2 shared L2 shared L2 shared L2 victim victim victim victim victim victim victim victim 2MB Shared quasi-victim (32 way) 2MB Shared quasi-victim (32 way)

FSB FSB 4GB/s (each direction)

10.66 GB/s 10.66 GB/s SRI / crossbar HyperTransport HyperTransport SRI / crossbar Chipset (4x64b controllers) 2x64b memory controllers 2x64b memory controllers 21.33 GB/s(read) 10.66 GB/s(write) 10.66 GB/s 10.66 GB/s 667MHz FBDIMMs 667MHz DDR2 DIMMs 667MHz DDR2 DIMMs

Sun T2+ T5140 (Victoria Falls) IBM QS20 Cell Blade es VMT VMT SPE SPE SPE SPE SPE SPE SPE SPE r SPE SPE SPE SPE SPE SPE SPE SPE PARC PARC PARC PARC PARC PARC PARC PARC PARC PARC PARC PARC PARC PARC PARC PARC PPE PPE S S S S S S S S S S S S S S S S

o MT MT MT MT MT MT MT MT MT MT MT MT MT MT MT MT 256K 256K 256K 256K 256K 256K 256K 256K c 256K 256K 256K 256K 256K 256K 256K 256K 512K 512K BIF BIF Crossbar Crossbar irection) d L2 d L2 MFC MFC MFC MFC MFC MFC MFC MFC MFC MFC MFC MFC MFC MFC MFC MFC <20GB/s 179 GB/s 90 GB/s B/s 179e GB/s 90 GB/s G d (each 4MB Shared L2 (16 way) a 4MB Shared L2 (16 way) EIB (ring network) EIB (ring network) (64b interleaved) (64b interleaved) e 8 x 6.4 (1 per hub direction) 4 Coherency Hubsr 4 Coherency Hubs XDR memory controllers XDR memory controllers 2x128b controllersh 2x128b controllers tit 25.6 GB/s 25.6 GB/s 21.33 GB/s 10.66 GB/s 21.33 GB/s 10.66 GB/s ul 512MB XDR DRAM 512MB XDR DRAM m 667MHz FBDIMMs 667MHz FBDIMMs

9 EECS Multicore SMPs Used Electrical Engineering and Computer Sciences (peak double precision ) BERKELEY PAR LAB

Intel Xeon E5345 (Clovertown) AMD Opteron 2356 (Barcelona)

Core Core Core Core Core Core Core Core Opteron Opteron Opteron Opteron Opteron Opteron Opteron Opteron 4MB 4MB 4MB 4MB 512KB 512KB 512KB 512KB 512KB 512KB 512KB 512KB shared L2 shared L2 shared L2 shared L2 victim victim victim victim victim victim victim victim 2MB Shared quasi-victim (32 way) 2MB Shared quasi-victim (32 way)

FSB FSB 4GB/s (each direction)

10.66 GB/s 10.66 GB/s SRI / crossbar HyperTransport HyperTransport SRI / crossbar Chipset (4x64b controllers) 75 GFlop/s 2x64b memory74 controllers Gflop/s2x64b memory controllers 21.33 GB/s(read) 10.66 GB/s(write) 10.66 GB/s 10.66 GB/s 667MHz FBDIMMs 667MHz DDR2 DIMMs 667MHz DDR2 DIMMs

Sun T2+ T5140 (Victoria Falls) IBM QS20 Cell Blade

VMT VMT SPE SPE SPE SPE SPE SPE SPE SPE SPE SPE SPE SPE SPE SPE SPE SPE PARC PARC PARC PARC PARC PARC PARC PARC PARC PARC PARC PARC PARC PARC PARC PARC PPE PPE S S S S S S S S S S S S S S S S

MT MT MT MT MT MT MT MT MT MT MT MT MT MT MT MT 256K 256K 256K 256K 256K 256K 256K 256K 256K 256K 256K 256K 256K 256K 256K 256K 512K 512K BIF BIF Crossbar Crossbar irection) L2 d L2 MFC MFC MFC MFC MFC MFC MFC MFC MFC MFC MFC MFC MFC MFC MFC MFC <20GB/s 179 GB/s 90 GB/s B/s 179 GB/s 90 GB/s G (each 4MB Shared L2 (16 way) 4MB Shared L2 (16 way) EIB (ring network) EIB (ring network) (64b interleaved) (64b interleaved) 19 GFlop/s8 x 6.4 (1 per hub direction) 29* GFlop/s 4 Coherency Hubs 4 Coherency Hubs XDR memory controllers XDR memory controllers 2x128b controllers 2x128b controllers 25.6 GB/s 25.6 GB/s 21.33 GB/s 10.66 GB/s 21.33 GB/s 10.66 GB/s 512MB XDR DRAM 512MB XDR DRAM 667MHz FBDIMMs 667MHz FBDIMMs

*SPEs only 10 EECS Multicore SMPs Used Electrical Engineering and Computer Sciences (total DRAM bandwidth) BERKELEY PAR LAB

Intel Xeon E5345 (Clovertown) AMD Opteron 2356 (Barcelona)

Core Core Core Core Core Core Core Core Opteron Opteron Opteron Opteron Opteron Opteron Opteron Opteron 4MB 4MB 4MB 4MB 512KB 512KB 512KB 512KB 512KB 512KB 512KB 512KB shared L2 shared L2 shared L2 shared L2 victim victim victim victim victim victim victim victim 2MB Shared quasi-victim (32 way) 2MB Shared quasi-victim (32 way) 21 FSBGB/s (read)FSB 4GB/s (each direction) 10.66 GB/s 10.66 GB/s SRI / crossbar HyperTransport HyperTransport SRI / crossbar Chipset (4x64b controllers) 2x64b memory21 controllers GB/s2x64b memory controllers 1021.33 GB/s GB/s(read) (10.66write) GB/s(write) 10.66 GB/s 10.66 GB/s 667MHz FBDIMMs 667MHz DDR2 DIMMs 667MHz DDR2 DIMMs

Sun T2+ T5140 (Victoria Falls) IBM QS20 Cell Blade

VMT VMT SPE SPE SPE SPE SPE SPE SPE SPE SPE SPE SPE SPE SPE SPE SPE SPE PARC PARC PARC PARC PARC PARC PARC PARC PARC PARC PARC PARC PARC PARC PARC PARC PPE PPE S S S S S S S S S S S S S S S S

MT MT MT MT MT MT MT MT MT MT MT MT MT MT MT MT 256K 256K 256K 256K 256K 256K 256K 256K 256K 256K 256K 256K 256K 256K 256K 256K 512K 512K BIF BIF Crossbar Crossbar irection) L2 d L2 MFC MFC MFC MFC MFC MFC MFC MFC MFC MFC MFC MFC MFC MFC MFC MFC <20GB/s 179 GB/s 90 GB/s B/s 179 GB/s 90 GB/s 42 GB/sG (read) (each 4MB Shared L2 (16 way) 4MB Shared L2 (16 way) EIB (ring network) EIB (ring network) (64b interleaved) (64b interleaved) 8 x 6.4 (1 per hub direction) 51* GB/s 214 Coherency GB/s Hubs (write)4 Coherency Hubs XDR memory controllers XDR memory controllers 2x128b controllers 2x128b controllers 25.6 GB/s 25.6 GB/s 21.33 GB/s 10.66 GB/s 21.33 GB/s 10.66 GB/s 512MB XDR DRAM 512MB XDR DRAM 667MHz FBDIMMs 667MHz FBDIMMs

*SPEs only 11 EECS Electrical Engineering and Computer Sciences BERKELEY PAR LAB P A R A L L E L C O M P U T I N G L A B O R A T O R Y

Roofline models for multicore SMPs

(for memory-intensive double precision floating-point kernels)

12 EEElectrical EngCineeringS and Arithmetic Intensity in HPC Computer Sciences BERKELEY PAR LAB

O( 1 ) O( log(N) ) O( N )

A r i t h m e t i c I n t e n s i t y

SpMV, BLAS1,2 FFTs Dense Linear Algebra Stencils (PDEs) (BLAS3) Lattice Methods Particle Methods

 True Arithmetic Intensity (AI) ~ Total Flops / Total DRAM . constant with respect to problem size for many problems of interest . ultimately limited by compulsory traffic . diminished by conflict or capacity misses.

13 EEElectrical EngCineeringS and Naïve Roofline Model Computer Sciences BERKELEY PAR LAB

 Based on Bound and Bottleneck analysis1

 Performance is upper bounded by both the peak flop rate, and the product of streaming bandwidth and the flop: ratio  (well understood in the performance oriented communities) Peak Gflop/s Gflop/s(AI) = min AI * StreamBW

where AI is the actual arithmetic intensity

 Assumptions: . Bandwidth is independent on arithmetic intensity . Bandwidth is independent of optimization or access pattern . Computation is independent of optimization . Complete overlap of either communication or computation

1D. Lazowska, J. Zahorjan, G. Graham, K. Sevcik, “Quantitative System Performance” 14 EEElectrical EngCineeringS and Roofline Model Computer Sciences BERKELEY PAR LAB

128 s

/ 64 p o

l 32 f G 16 e l scale ! b

8 a n i

a 4 t t a 2 Log 1 Log scale ! 1 1 1 /8 /4 /2 1 2 4 8 16

flop:DRAM byte ratio 15 EEElectrical EngCineeringS and Naïve Roofline Model Computer Sciences BERKELEY PAR LAB Intel Xeon E5345 Opteron 2356 128 128 (Clovertown) peak DP (Barcelona) peak DP  Unrealistically optimistic 64 64 model 32 32  Hand optimized Stream flop/s flop/s

G 16 G 16 BW 8 8

attainable 4 attainable 4

2 2 hand optimized Stream BW 1 hand optimized Stream BW 1 1 1 1 1 1 1 1 1 /16 /8 /4 /2 1 2 4 8 /16 /8 /4 /2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio Sun T2+ T5140 IBM QS20 128 128 (Victoria Falls) Cell Blade 64 64 peak DP 32 32 flop/s flop/s peak DP G G 16 16 8 8

hand optimized Stream BW attainable attainable 4 hand optimized Stream BW 4 2 2 1 1 1 1 1 1 1 1 1 1 /16 /8 /4 /2 1 2 4 8 /16 /8 /4 /2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio 16 EEElectrical EngCineeringS and Naïve Roofline Model Computer Sciences BERKELEY PAR LAB Intel Xeon E5345 Opteron 2356 128 128 (Clovertown) peak DP (Barcelona) peak DP  Unrealistically optimistic 64 64 model 32 32  Hand optimized Stream flop/s flop/s

G 16 G 16 BW benchmark 8 8 e attainable 4 attainable 4 c man 2 2 hand optimized Stream BWrfor 1 hand optimized Stream BW 1 pe th 1 1 1 1 1 1 1 1 / / / / 1 2 4 8 / / f/ / 1 2 4 8 i 16 8 4 2 el 16 o8 4 2 le w flop:DRAM byte ratios lev flop:DRAMi nbytea ratiob ns Sun T2+ T5140hi IBMtt QS20a tio 128 T 128 a za (Victoria Falls) nly Cell Blade imi 64 s o 64 pt i o peak DP 32 32 e flop/s flop/s v peak DP si G G n 16 exte 16 8 8

hand optimized Stream BW attainable attainable 4 hand optimized Stream BW 4 2 2 1 1 1 1 1 1 1 1 1 1 /16 /8 /4 /2 1 2 4 8 /16 /8 /4 /2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio 17 EEElectrical EngCineeringS and Naïve Roofline Model Computer Sciences BERKELEY PAR LAB Intel Xeon E5345 Opteron 2356 128 128 (Clovertown) peak DP (Barcelona) peak DP  Unrealistically optimistic 64 64 model 32 32  Hand optimized Stream flop/s flop/s

G 16 G 16 BW benchmark 8 8

attainable 4 attainable 4 ach 2 2 e hand optimized Stream BW is hand optimized Stream BW e 1 1 tiv 1/ 1/ 1/ 1/ 1 2 4 8 1/ 1/ i1/ 1/ 1 2 4 8 16 8 4 2 en16 s8 4 2 to ? flop:DRAM byte ratioow s flop:DRAM tbyteu ratiore ions Sun T2+ T5140H IBMt eQS20c at 128 128 hi iz (Victoria Falls) arc Cell Blade tim 64 64 op e peak DP 32 32 os flop/s flop/s peak DP th G G g 16 ovin 16 8 rem 8 hand optimized Stream BW attainable attainable 4 hand optimized Stream BW 4 2 2 1 1 1 1 1 1 1 1 1 1 /16 /8 /4 /2 1 2 4 8 /16 /8 /4 /2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio 18 EEElectrical EngCineeringS and Better Roofline Computer Sciences BERKELEY PAR LAB

 Collect StreamBWj with progressively fewer optimizations

 Estimate InCoreGFlopsi with progressively fewer optimizations

InCoreGFlopsi GFlopsi,j(AI) = min AI * StreamBWj is the attainable performance with:

memory optimizations1…i - and - in-core optimizations1..j

 These denote a series of ceilings below the roofline

 Assumptions: . Bandwidth is independent on arithmetic intensity . Complete overlap of either communication or computation

19 EECS Roofline models Electrical Engineering and Computer Sciences (dram bandwidth) BERKELEY PAR LAB Intel Xeon E5345 Opteron 2356 128 128 (Clovertown) peak DP (Barcelona) peak DP  What happens as 64 64 bandwidth optimizations 32 32 are stripped out ? flop/s flop/s

G 16 G 16  Form a series of 8 8 bandwidth ceilings below the roofline

attainable 4 attainable 4  Small problems fit in the 2 2 hand optimized Stream BW snoop filter in Clovertown’s 1 hand optimized Stream BW 1 1 1 1 1 1 1 1 1 MCH /16 /8 /4 /2 1 2 4 8 /16 /8 /4 /2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio  most architectures see NUMA and prefetch Sun T2+ T5140 IBM QS20 128 128 variations (Victoria Falls) Cell Blade 64 64 peak DP 32 32 flop/s flop/s peak DP G G 16 16 8 8

hand optimized Stream BW attainable attainable 4 hand optimized Stream BW 4 2 2 1 1 1 1 1 1 1 1 1 1 /16 /8 /4 /2 1 2 4 8 /16 /8 /4 /2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio 20 EECS Roofline models Electrical Engineering and Computer Sciences (dram bandwidth) BERKELEY PAR LAB Intel Xeon E5345 Opteron 2356 128 128 (Clovertown) peak DP (Barcelona) peak DP  What happens as 64 64 bandwidth optimizations 32 32 are stripped out ? flop/s flop/s

G 16 G 16  Form a series of 8 8 bandwidth ceilings below the roofline

attainable 4 attainable 4  Small problems fit in the 2 2 hand optimized Stream BW snoop filter in Clovertown’s 1 datasethand fits inoptimized snoop filter Stream BW 1 1 1 1 1 1 1 1 1 MCH /16 /8 /4 /2 1 2 4 8 /16 /8 /4 /2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio  most architectures see NUMA and prefetch Sun T2+ T5140 IBM QS20 128 128 variations (Victoria Falls) Cell Blade 64 64 peak DP 32 32 flop/s flop/s peak DP G G 16 16 8 8

hand optimized Stream BW attainable attainable 4 hand optimized Stream BW 4 2 2 1 1 1 1 1 1 1 1 1 1 /16 /8 /4 /2 1 2 4 8 /16 /8 /4 /2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio 21 EECS Roofline models Electrical Engineering and Computer Sciences (dram bandwidth) BERKELEY PAR LAB Intel Xeon E5345 Opteron 2356 128 128 (Clovertown) peak DP (Barcelona) peak DP  What happens as 64 64 bandwidth optimizations 32 32 are stripped out ? flop/s flop/s

G 16 G 16  Form a series of 8 8 bandwidth ceilings below the roofline

attainable 4 attainable 4 h tc fe re  Small problems fit in the p 2 2 W handS optimized Stream BW t u snoop filter in Clovertown’s /o 1 datasethand fits inoptimized snoop filter Stream BW 1 w 1 1 1 1 1 1 1 1 MCH /16 /8 /4 /2 1 2 4 8 /16 /8 /4 /2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio  most architectures see NUMA and prefetch Sun T2+ T5140 IBM QS20 128 128 variations (Victoria Falls) Cell Blade 64 64 peak DP 32 32 flop/s flop/s peak DP G G 16 16

8 h 8 tc fe re p hand optimized Stream BW attainable attainable 4 W 4 handS optimized Stream BW t u /o 2 w 2 1 1 1 1 1 1 1 1 1 1 /16 /8 /4 /2 1 2 4 8 /16 /8 /4 /2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio 22 EECS Roofline models Electrical Engineering and Computer Sciences (dram bandwidth) BERKELEY PAR LAB Intel Xeon E5345 Opteron 2356 128 128 (Clovertown) peak DP (Barcelona) peak DP  What happens as 64 64 bandwidth optimizations 32 32 are stripped out ? flop/s flop/s

G 16 G 16  Form a series of 8 8 bandwidth ceilings below the roofline

attainable 4 attainable 4 h tc fe re  Small problems fit in the p 2 2 W handS optimized Stream BW t u snoop filter in Clovertown’s /o 1 datasethand fits inoptimized snoop filter Stream BW 1 w 1 1 1 1 1 1 1 1 MCH /16 /8 /4 /2 1 2 4 8 /16 /8 /4 /2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio  most architectures see NUMA and prefetch Sun T2+ T5140 IBM QS20 128 128 variations (Victoria Falls) Cell Blade 64 64 peak DP 32 32 flop/s flop/s peak DP G G 16 16

8 h 8 tc A fe M e D r d p e handn optimized Stream BW attainable attainable 4 W 4 g handS optimized Stream BW li t a u n /o u 2 w 2 1 1 1 1 1 1 1 1 1 1 /16 /8 /4 /2 1 2 4 8 /16 /8 /4 /2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio 23 EECS Roofline models Electrical Engineering and Computer Sciences (dram bandwidth) BERKELEY PAR LAB Intel Xeon E5345 Opteron 2356 128 128 (Clovertown) peak DP (Barcelona) peak DP  What happens as 64 64 bandwidth optimizations 32 32 are stripped out ? flop/s flop/s

G 16 G 16  Form a series of 8 8 bandwidth ceilings below the roofline

attainable 4 attainable 4 h tc fe e r A  Small problems fit in the p M 2 2 W U handS optimizedN Stream BW t t u u snoop filter in Clovertown’s /o /o 1 datasethand fits inoptimized snoop filter Stream BW 1 w w 1 1 1 1 1 1 1 1 MCH /16 /8 /4 /2 1 2 4 8 /16 /8 /4 /2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio  most architectures see NUMA and prefetch Sun T2+ T5140 IBM QS20 128 128 variations (Victoria Falls) Cell Blade 64 64 peak DP 32 32 flop/s flop/s peak DP G G 16 16

8 h 8 tc A fe M e D r d p e handn optimized Stream BW attainable attainable 4 W 4 g handS optimized Stream BW li A t a u A n M u U /o M N w U t N u 2 t 2 o u / /o w w 1 1 1 1 1 1 1 1 1 1 /16 /8 /4 /2 1 2 4 8 /16 /8 /4 /2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio 24 EECS Roofline models Electrical Engineering and Computer Sciences (dram bandwidth) BERKELEY PAR LAB Intel Xeon E5345 Opteron 2356 128 128 (Clovertown) peak DP (Barcelona) peak DP  What happens as 64 64 bandwidth optimizations 32 32 are stripped out ? flop/s flop/s

G 16 G 16  Form a series of 8 8 bandwidth ceilings below the roofline

attainable 4 attainable 4 h tc fe e  nit stride r A Small problems fit in the p M unit stride 2 u 2 W U handS optimizedN Stream BW t t u u snoop filter in Clovertown’s /o /o 1 datasethand fits inoptimizedw/out snoop filter Stream BW 1 w w w/out 1 1 1 1 1 1 1 1 MCH /16 /8 /4 /2 1 2 4 8 /16 /8 /4 /2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio  most architectures see NUMA and prefetch Sun T2+ T5140 IBM QS20 128 128 variations (Victoria Falls) Cell Blade 64 64 peak DP 32 32 flop/s flop/s peak DP G G 16 16

8 h 8 tc A fe M e D r d p e handn optimized Stream BW attainable attainable 4 W 4 g handS optimized Stream BW li A t e a u A id n M tr u U /o M s N w U r t N it u 2 t n 2 o u u / o t w / u w /o 1 w 1 1 1 1 1 1 1 1 1 /16 /8 /4 /2 1 2 4 8 /16 /8 /4 /2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio 25 EEElectrical EngCineeringS and In-core Performance Computer Sciences BERKELEY PAR LAB

 Define a similar set of ceilings for in-core performance

 In-core performance can be limited by (among other things): . Not satisfying all forms of in-core parallelism: • Instruction-level parallelism (multi-issue, pipeline, …) • Data-level parallelism (SIMD) • Functional unit parallelism (adders + multipliers + …) . Non-FP instructions can consume instruction issue bandwidth • As the FP fraction decrease, how sensitive is attainable performance?

 One or the other is usually more difficult to satisfy on a given architecture/kernel = Architecture’s Achilles’ Heel

26 EECS Roofline models Electrical Engineering and Computer Sciences (in-core performance = in-core parallelism?) BERKELEY PAR LAB Intel Xeon E5345 Opteron 2356 128 128 (Clovertown) peak DP (Barcelona) peak DP  Covering the breadth of in- 64 64 core parallelism is the mul/add imbalance mul/add imbalance 32 32 preeminent challenge on flop/s w/out SIMD flop/s w/out SIMD G 16 G 16 most architectures  8 8 Form a series of w/out ILP w/out ILP parallelism ceilings below

attainable 4 attainable 4 h tc fe e the roofline r A p M 2 2 W U S N t t u u /o /o 1 dataset fits in snoop filter 1 w w 1 1 1 1 1 1 1 1  On Niagara machines, /16 /8 /4 /2 1 2 4 8 /16 /8 /4 /2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio instruction latencies are easily hidden with 8-way Sun T2+ T5140 IBM QS20 128 128 multithreading (Victoria Falls) Cell Blade 64 64 peak DP 32 32 flop/s flop/s peak DP G G w/out FMA 16 16 w/out SIMD 8 h 8 tc A fe M e D r d p e w/out ILP n attainable attainable 4 W 4 g S li A t a u A n M u U /o M N w U t N u 2 t 2 o u / /o w w 1 1 1 1 1 1 1 1 1 1 /16 /8 /4 /2 1 2 4 8 /16 /8 /4 /2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio 27 EECS Roofline models Electrical Engineering and Computer Sciences (in-core performance = instruction mix?) BERKELEY PAR LAB Intel Xeon E5345 Opteron 2356 128 128 (Clovertown) peak DP (Barcelona) peak DP  All machines have a 64 64 limited instruction issue 25% FP 25% FP 32 32 bandwidth. flop/s 12% FP flop/s 12% FP G 16 G 16  non-FP instructions sap 8 8 instruction issue bandwidth needed by FP instructions

attainable 4 attainable 4 h tc fe e r A  As the FP fraction of the p M 2 2 W U S N t t u u dynamic instruction mix /o /o 1 dataset fits in snoop filter 1 w w decreases, so might 1/ 1/ 1/ 1/ 1 2 4 8 1/ 1/ 1/ 1/ 1 2 4 8 16 8 4 2 16 8 4 2 performance. flop:DRAM byte ratio flop:DRAM byte ratio  On Cell, double precision Sun T2+ T5140 IBM QS20 128 128 instructions stall (Victoria Falls) Cell Blade subsequent issues for 7 64 64 peak DP cycles. 32 32 flop/s flop/s peak DP G G 12% 16 16 25% FP 8 h 8 tc A fe M e 12% FP D r d p e n attainable attainable 4 W 4 g S li A t a u A n M u U /o M N w U t N u 2 t 2 o u / /o w w 1 1 1 1 1 1 1 1 1 1 /16 /8 /4 /2 1 2 4 8 /16 /8 /4 /2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio 28 EECS Roofline models Electrical Engineering and Computer Sciences (Achilles' Heel) BERKELEY PAR LAB Intel Xeon E5345 Opteron 2356 128 128 (Clovertown) peak DP (Barcelona) peak DP  Its clear that in-core 64 64 parallelism is more mul/add imbalance mul/add imbalance 32 32 important on the flop/s w/out SIMD flop/s w/out SIMD G 16 G 16 superscalars  8 8 Instruction mix is more w/out ILP w/out ILP important on Niagara2

attainable 4 attainable 4 h tc fe e r A p M 2 2 W U S N t t u u /o /o  Each architecture has its 1 dataset fits in snoop filter 1 w w 1 1 1 1 1 1 1 1 own Achilles’ Heel when /16 /8 /4 /2 1 2 4 8 /16 /8 /4 /2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio it comes to in-core performance Sun T2+ T5140 IBM QS20 128 128 (Victoria Falls) Cell Blade 64 64 peak DP 32 32 flop/s flop/s peak DP G G w/out FMA 16 16 25% FP w/out SIMD 8 h 8 tc A fe M e 12% FP D r d p e w/out ILP n attainable attainable 4 W 4 g S li A t a u A n M u U /o M N w U t N u 2 t 2 o u / /o w w 1 1 1 1 1 1 1 1 1 1 /16 /8 /4 /2 1 2 4 8 /16 /8 /4 /2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio 29 EECS Roofline models Electrical Engineering and Computer Sciences (ceilings constrain performance) BERKELEY PAR LAB Intel Xeon E5345 Opteron 2356 128 128 (Clovertown) peak DP (Barcelona) peak DP  The ceilings act to 64 64 constrain performance to a mul/add imbalance mul/add imbalance 32 32 much smaller region flop/s w/out SIMD flop/s w/out SIMD G 16 G 16 8 8 w/out ILP w/out ILP

attainable 4 attainable 4 h tc fe e r A p M 2 2 W U S N t t u u /o /o 1 dataset fits in snoop filter 1 w w 1 1 1 1 1 1 1 1 /16 /8 /4 /2 1 2 4 8 /16 /8 /4 /2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio Sun T2+ T5140 IBM QS20 128 128 (Victoria Falls) Cell Blade 64 64 peak DP 32 32 flop/s flop/s peak DP G G w/out FMA 16 16 25% FP w/out SIMD 8 h 8 tc A fe M e 12% FP D r d p e w/out ILP n attainable attainable 4 W 4 g S li A t a u A n M u U /o M N w U t N u 2 t 2 o u / /o w w 1 1 1 1 1 1 1 1 1 1 /16 /8 /4 /2 1 2 4 8 /16 /8 /4 /2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio 30 EECS Roofline models Electrical Engineering and Computer Sciences (ceilings constrain performance) BERKELEY PAR LAB Intel Xeon E5345 Opteron 2356 128 128 (Clovertown) peak DP (Barcelona) peak DP  The ceilings act to 64 64 constrain performance to a mul/add imbalance mul/add imbalance 32 32 much smaller region flop/s w/out SIMD flop/s w/out SIMD G 16 G 16 8 8 w/out ILP w/out ILP

attainable 4 attainable 4 h tc fe e r A p M 2 2 W U S N t t u u /o /o 1 dataset fits in snoop filter 1 w w 1 1 1 1 1 1 1 1 /16 /8 /4 /2 1 2 4 8 /16 /8 /4 /2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio Sun T2+ T5140 IBM QS20 128 128 (Victoria Falls) Cell Blade 64 64 peak DP 32 32 flop/s flop/s peak DP G G w/out FMA 16 16 25% FP w/out SIMD 8 h 8 tc A fe M e 12% FP D r d p e w/out ILP n attainable attainable 4 W 4 g S li A t a u A n M u U /o M N w U t N u 2 t 2 o u / /o w w 1 1 1 1 1 1 1 1 1 1 /16 /8 /4 /2 1 2 4 8 /16 /8 /4 /2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio 31 EECS Roofline models Electrical Engineering and Computer Sciences (thickness) BERKELEY PAR LAB Intel Xeon E5345 Opteron 2356 128 128 (Clovertown) peak DP (Barcelona) peak DP  The ceilings act to 64 64 constrain performance to a mul/add imbalance mul/add imbalance 32 32 much smaller region flop/s w/out SIMD flop/s w/out SIMD G 16 G 16 8 8 w/out ILP w/out ILP is attainable attainable h 4 4 c e t fe n e i r A l p f M 2 2 W U o S N t t o u u r o o / / dataset fits in snoop filter w w e 1 1 th te 1/ 1/ 1/ 1/ 1 2 4 8 1/ 1/ f1 / 1/ 1 2 4 8 i 16 8 4 2 s 16o8 4 2 uis flop:DRAM byte ratioknes flop:DRAM fbyte rratioeq xity Sun T2+ T5140ic IBMe QS20 o ple 128 Th 128tiv m (Victoria Falls) ica Cell Blade co 64 nd 64 W i S peak DP 32 32 or flop/s flop/s peak DP r G G le w/out FMA 16 pi 16 25% FP om w/out SIMD 8 h c 8 A tc fe M e 12% FP D r d p e w/out ILP n attainable attainable 4 W 4 g S li A t a u A n M u U /o M N w U t N u 2 t 2 o u / /o w w 1 1 1 1 1 1 1 1 1 1 /16 /8 /4 /2 1 2 4 8 /16 /8 /4 /2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio 32 EECS Electrical Engineering and Computer Sciences BERKELEY PAR LAB P A R A L L E L C O M P U T I N G L A B O R A T O R Y

Three Categories of Software Optimization

33 EEElectrical EngCineeringS and Optimization Categorization Computer Sciences BERKELEY PAR LAB

Maximizing Maximizing Minimizing In-core Performance Memory Bandwidth Memory Traffic

•Exploit in-core parallelism •Exploit NUMA Eliminate: (ILP, DLP, etc…) •Capacity misses •Hide memory latency •Conflict misses •Good (enough) •Compulsory misses floating-point balance •Satisfy Little’s Law •Write allocate behavior

unit-stride streams reorder ? memory blocking unroll & ? affinity SW array ? jam eliminate ? prefetch padding ? ? ? streaming explicit branches ? compress stores SIMD DMA TLB ? ? data ?lists bl?ocking ?

34 EECS Maximizing Attained Electrical Engineering and Computer Sciences in-core Performance BERKELEY PAR LAB AMD Opteron 2356 peak SP  Compilers may not have (Barcelona) as much knowledge as the 128 programmer mul / add imbalance  Express more in-core s

/ 64 parallelism and amortize

p non-FP instructions o

l 32  Software optimizations: f w/out SIMD . Explicit SIMDization G

. Loop unrolling 16

e . Unroll and jam l th id

b . Reordering

w

s c 8 d g n a n n o a i io . Predication h t m b c a n t

z p i m fe i w/out ILP  Punch through ceilings

a u e re im

r t l a

p s 4 st p t o o k W t a A r

S y e t M c

a a

p c u m o U o / N p 2 i n a t s w u s f o c l

/ e i i c w t y s t 1 1 1 1 /8 /4 /2 1 2 4 8 16

flop:DRAM byte ratio 35 EECS Maximizing Attained Electrical Engineering and Computer Sciences Memory Bandwidth BERKELEY PAR LAB AMD Opteron 2356 peak SP  Compilers won’t give great (Barcelona) out-of-the-box bandwidth 128  Optimizations: mul / add imbalance . long unit stride accesses s

/ 64 . NUMA aware allocation and

p parallelization o

l 32 . SW prefetching f . Maximize MLP w/out SIMD G

 Punch through bandwidth 16

e ceilings l th id b

w

s c 8 d g n a n n o a i io h t m b c a n t

z p i m fe i w/out ILP

a u e re im

r t l a

p s 4 st p t o o k W t a A r

S y e t M c

a a

p c u m o U o / N p 2 i n a t s w u s f o c l

/ e i i c w t y s t 1 1 1 1 /8 /4 /2 1 2 4 8 16

flop:DRAM byte ratio 36 EECS Minimizing Total Electrical Engineering and Computer Sciences Memory Traffic BERKELEY PAR LAB AMD Opteron 2356 peak SP  Use performance counters to (Barcelona) measure flop:byte ratio (AI) 128  Out-of-the-box code may mul / add imbalance have an AI ratio much less s

/ 64 than the compulsory ratio

p  Optimizations: o

l 32 . Array padding: conflict f w/out SIMD . Cache blocking: capacity G 16 . Cache bypass: compulsory e

l th  Push arithmetic intensity to

id c b

o the compulsory limit w s 8 d g n m

a n a in o b h ti p

c a u n t z l i m e i w/out ILP

f s a e m re r ti o a p 4 t p r

s y t o

k W t a S A m e t M c i a a p u c o U s o p 2 / N s n a w t

u l i f c

o m / l i i c w t y i t t 1 1 1 1 /8 /4 /2 1 2 4 8 16

flop:DRAM byte ratio 37 EEElectrical EngCineeringS and Optimization Categorization Computer Sciences BERKELEY PAR LAB

Maximizing Maximizing Minimizing In-core Performance Memory Bandwidth Memory Traffic

•Exploit in-core parallelism •Exploit NUMA Eliminate: (ILP, DLP, etc…) ha•Csapacity misses •Hide memory latenicoy n •Conflict misses •Good (enough) izat •Cocmpeulsory misses floating-point balance •Sattiismfy Little’s Law sp•aWrite allocate behavior h op ter Eac rame pa unit-stride rs? ge streams te reorder ar ? ecache l memory am a ar blocking unroll & ? affinity SWal p array ? jam eliminate ? imprefetch padding ? pt ? ? streaming explicit branches e o ? th compress stores SIMD re DMA TLB ? ? t a lists data Wha ? bl?ocking ?

38 EECS Electrical Engineering and Computer Sciences BERKELEY PAR LAB P A R A L L E L C O M P U T I N G L A B O R A T O R Y

Introduction to Auto-tuning

39 EEElectrical EngCineeringS and Out-of-the-box Code Computer Sciences BERKELEY PAR LAB

 Out-of-the-box code has (unintentional) assumptions on: . cache sizes (>10MB) . functional unit latencies(~1 cycle) . etc…

 These assumptions may result in poor performance when they exceed the machine characteristics

40 EEElectrical EngCineeringS and What is auto-tuning? Computer Sciences BERKELEY PAR LAB

 Goal: provide performance portability across the existing breadth and evolution of microprocessors  At the expense of a one time up front productivity cost that’s amortized by the number of machines its used on

 Auto-tuning does not invent new optimizations  Auto-tuning automates the exploration of the optimization and parameter space  Two components: 1. parameterized code generator (we wrote ours in Perl) 2. Auto-tuning exploration benchmark (combination of heuristics and exhaustive search)  Can be extended with ISA specific optimizations (e.g. DMA, SIMD)

41 EECS Distinguishing the Electrical Engineering and Computer Sciences Roofline and Auto-tuning BERKELEY PAR LAB

 Roofline specifies what's deficient, but not how to fix it.

 Auto-tuning attempts to fix it by searching the parameter space for the existing body of optimization work

42 EECS Electrical Engineering and Computer Sciences BERKELEY PAR LAB P A R A L L E L C O M P U T I N G L A B O R A T O R Y

Application of the Roofline Model to sample Kernels

Does the roofline model provide insight into the limitations of architecture, implementation, and algorithm?

43 EEElectrical EngCineeringS and notes Computer Sciences BERKELEY PAR LAB

 Things to watch for: 1. do performance graphs alone provide insight into the limitations of kernel or architecture ?

2. does the roofline show the ultimate performance limitations of kernel and architecture ?

3. does the roofline show which optimizations will be necessary ?

44 EECS Electrical Engineering and Computer Sciences BERKELEY PAR LAB P A R A L L E L C O M P U T I N G L A B O R A T O R Y

Example #1: Auto-tuning Sparse Matrix-Vector Multiplication (SpMV)

Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, James Demmel, "Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Supercomputing (SC), 2007.

45 EECS Sparse Matrix Electrical Engineering and Computer Sciences Vector Multiplication BERKELEY PAR LAB

 What’s a Sparse Matrix ? . Most entries are 0.0 . Performance advantage in only storing/operating on the nonzeros . Requires significant meta data to reconstruct the matrix structure  What’s SpMV ? . Evaluate y=Ax . A is a sparse matrix, x & y are dense vectors  Challenges . Very low arithmetic intensity (often <0.166 flops/byte) . Difficult to exploit ILP(bad for superscalar), . Difficult to exploit DLP(bad for SIMD)

A x y

A.col[ ] ... for (r=0; r

(a) (b) (c) algebra conceptualization CSR data structure CSR reference code 46 EEElectrical EngCineeringS and The Dataset (matrices) Computer Sciences BERKELEY PAR LAB

 Unlike dense BLAS, performance is dictated by sparsity  Suite of 14 matrices  All bigger than the caches of our SMPs  We’ll also include a median performance number

2K x 2K Dense matrix stored in sparse format

Dense

Well Structured (sorted by nonzeros/row)

FEM / FEM / Wind FEM / FEM / Protein QCD Economics Epidemiology Spheres Cantilever Tunnel Harbor Ship

Poorly Structured hodgepodge FEM / Circuit webbase Accelerator

Extreme Aspect Ratio (linear programming)

LP

47 EECS SpMV Performance Electrical Engineering and Computer Sciences (simple parallelization) BERKELEY PAR LAB

 Out-of-the box SpMV performance on a suite of 14 matrices  Scalability isn’t great  Is this performance good?

Naïve Pthreads Naïve

48 EECS Auto-tuned SpMV Performance Electrical Engineering and Computer Sciences (architecture specific optimizations) BERKELEY PAR LAB

 Fully auto-tuned SpMV performance across the suite of matrices  Included SPE/local store optimized version  Why do some optimizations work better on some architectures?  Performance is better, but is performance good?

+Cache/LS/TLB Blocking +Matrix Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve

49 EECS Auto-tuned SpMV Performance Electrical Engineering and Computer Sciences (architecture specific optimizations) BERKELEY PAR LAB

 Fully auto-tuned SpMV performance across the suite of matrices  Included SPE/local store optimized version  Why do some optimizations work better on some Auto-tuning resulted architectures?  Performance is better, in better performance, but is performance good?

but did it result in +Cache/LS/TLB Blocking good performance? +Matrix Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve

50 EEElectrical EngCineeringS and Roofline model for SpMV Computer Sciences BERKELEY PAR LAB Intel Xeon E5345 Opteron 2356 128 128 (Clovertown) peak DP (Barcelona) peak DP  Double precision roofline 64 64 models w/out SIMD w/out SIMD 32 32  FMA is inherent in SpMV flop/s flop/s

G 16 G 16 (place at bottom) w/out ILP w/out ILP 8 8 mul/add imbalance mul/add imbalance

attainable 4 attainable 4 h tc ataset fits in snoop filter fe e d r A p M 2 2 W U S N t t u u /o /o 1 dataset 1 w w 1 1 1 1 1 1 1 1 /16 /8 /4 /2 1 2 4 8 /16 /8 /4 /2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio Sun T2+ T5140 IBM QS20 128 128 (Victoria Falls) Cell Blade 64 64 peak DP 32 32 flop/s flop/s peak DP G G w/out SIMD 16 16 25% FP w/out ILP 8 h 8 tc A fe M e 12% FP D r d p e w/out FMA n attainable attainable 4 W 4 g S li A t a u A n M u U /o M N w U t N u 2 t 2 o u / /o w w 1 1 1 1 1 1 1 1 1 1 /16 /8 /4 /2 1 2 4 8 /16 /8 /4 /2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio 51 EECS Roofline model for SpMV Electrical Engineering and Computer Sciences (overlay arithmetic intensity) BERKELEY PAR LAB Intel Xeon E5345 Opteron 2356 128 128 (Clovertown) peak DP (Barcelona) peak DP  Two unit stride streams 64 64 w/out SIMD w/out SIMD  Inherent FMA 32 32

flop/s flop/s  No ILP G G 16 16  No DLP w/out ILP w/out ILP 8 8  FP is 12-25% mul/add imbalance mul/add imbalance

attainable 4 attainable 4 h tc  Naïve compulsory ataset fits in snoop filter fe e d r A p M 2 2 W U flop:byte < 0.166 S N t t u u /o /o 1 dataset 1 w w 1 1 1 1 1 1 1 1 /16 /8 /4 /2 1 2 4 8 /16 /8 /4 /2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio Sun T2+ T5140 IBM QS20 128 128 (Victoria Falls) Cell Blade 64 64 peak DP 32 32 flop/s flop/s peak DP G G w/out SIMD 16 16 No naïve Cell 25% FP w/out ILP 8 h 8 impA lementation tc fe M e 12% FP D r d p e w/out FMA n attainable attainable 4 W 4 g S li A t a u A n M u U /o M N w U t N u 2 t 2 o u / /o w w 1 1 1 1 1 1 1 1 1 1 /16 /8 /4 /2 1 2 4 8 /16 /8 /4 /2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio 52 EECS Roofline model for SpMV Electrical Engineering and Computer Sciences (out-of-the-box parallel) BERKELEY PAR LAB Intel Xeon E5345 Opteron 2356 128 128 (Clovertown) peak DP (Barcelona) peak DP  Two unit stride streams 64 64 w/out SIMD w/out SIMD  Inherent FMA 32 32

flop/s flop/s  No ILP G G 16 16  No DLP w/out ILP w/out ILP 8 8  FP is 12-25% mul/add imbalance mul/add imbalance

attainable 4 attainable 4 h tc  Naïve compulsory ataset fits in snoop filter fe e d r A p M 2 2 W U flop:byte < 0.166 S N t t u u /o /o 1 dataset 1 w w  For simplicity: dense 1/ 1/ 1/ 1/ 1 2 4 8 1/ 1/ 1/ 1/ 1 2 4 8 16 8 4 2 16 8 4 2 matrix in sparse format flop:DRAM byte ratio flop:DRAM byte ratio Sun T2+ T5140 IBM QS20 128 128 (Victoria Falls) Cell Blade 64 64 peak DP 32 32 flop/s flop/s peak DP G G w/out SIMD 16 16 No naïve Cell 25% FP w/out ILP 8 h 8 impA lementation tc fe M e 12% FP D r d p e w/out FMA n attainable attainable 4 W 4 g S li A t a u A n M u U /o M N w U t N u 2 t 2 o u / /o w w 1 1 1 1 1 1 1 1 1 1 /16 /8 /4 /2 1 2 4 8 /16 /8 /4 /2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio 53 EECS Roofline model for SpMV Electrical Engineering and Computer Sciences (NUMA & SW prefetch) BERKELEY PAR LAB Intel Xeon E5345 Opteron 2356 128 128 (Clovertown) peak DP (Barcelona) peak DP  compulsory flop:byte ~ 64 64 0.166 w/out SIMD w/out SIMD 32 32  utilize all memory channels flop/s flop/s

G 16 G 16 w/out ILP w/out ILP 8 8 mul/add imbalance mul/add imbalance

attainable 4 attainable 4 h tc ataset fits in snoop filter fe e d r A p M 2 2 W U S N t t u u /o /o 1 dataset 1 w w 1 1 1 1 1 1 1 1 /16 /8 /4 /2 1 2 4 8 /16 /8 /4 /2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio Sun T2+ T5140 IBM QS20 128 128 (Victoria Falls) Cell Blade 64 64 peak DP 32 32 flop/s flop/s peak DP G G w/out SIMD 16 16 No naïve Cell 25% FP w/out ILP 8 h 8 impA lementation tc fe M e 12% FP D r d p e w/out FMA n attainable attainable 4 W 4 g S li A t a u A n M u U /o M N w U t N u 2 t 2 o u / /o w w 1 1 1 1 1 1 1 1 1 1 /16 /8 /4 /2 1 2 4 8 /16 /8 /4 /2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio 54 EECS Roofline model for SpMV Electrical Engineering and Computer Sciences (matrix compression) BERKELEY PAR LAB Intel Xeon E5345 Opteron 2356 128 128 (Clovertown) peak DP (Barcelona) peak DP  Inherent FMA 64 64 w/out SIMD w/out SIMD  Register blocking improves 32 32 ILP, DLP, flop:byte ratio, flop/s flop/s

G 16 G 16 and FP% of instructions w/out ILP w/out ILP 8 8 mul/add imbalance mul/add imbalance

attainable 4 attainable 4 h tc ataset fits in snoop filter fe e d r A p M 2 2 W U S N t t u u /o /o 1 dataset 1 w w 1 1 1 1 1 1 1 1 /16 /8 /4 /2 1 2 4 8 /16 /8 /4 /2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio Sun T2+ T5140 IBM QS20 128 128 (Victoria Falls) Cell Blade 64 64 peak DP 32 32 flop/s flop/s peak DP G G w/out SIMD 16 16 25% FP w/out ILP 8 h 8 tc A fe M e 12% FP D r d p e w/out FMA n attainable attainable 4 W 4 g S li A t a u A n M u U /o M N w U t N u 2 t 2 o u / /o w w 1 1 1 1 1 1 1 1 1 1 /16 /8 /4 /2 1 2 4 8 /16 /8 /4 /2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio 55 EECS Roofline model for SpMV Electrical Engineering and Computer Sciences (matrix compression) BERKELEY PAR LAB Intel Xeon E5345 Opteron 2356 128 128 (Clovertown) peak DP (Barcelona) peak DP  Inherent FMA 64 64 w/out SIMD w/out SIMD  Register blocking improves 32 32 ILP, DLP, flop:byte ratio, flop/s flop/s

G 16 G 16 and FP% of instructions w/out ILP w/out ILP 8 8 mul/add imbalance mul/add imbalance

attainable 4 attainable 4 h tc fe e r A p M 2 2 W U S N t t u u /o /o 1 dataset fits in snoop filter 1 w w 1 1 1 1 1 1 1 1 /16 /8 /4 /2 1 2 4 8 /16 /8 /4 /2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio Sun T2+ T5140 IBM QS20 128 128 (Victoria Falls) Cell Blade 64 64 peak DP 32 32 flop/s flop/s peak DP G G w/out SIMD 16 16 25% FP w/out ILP 8 h 8 tc A fe M e 12% FP D r d p e w/out FMA n attainable attainable 4 W 4 g S li A t a u A n M u U /o M N w U t N u 2 t 2 o u / /o w w All1 machines are1 on the 1 1 1 bandwidth1 1roofline1 1 1 ! /16 /8 /4 /2 1 2 4 8 /16 /8 /4 /2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio 56 EECS Electrical Engineering and Computer Sciences BERKELEY PAR LAB P A R A L L E L C O M P U T I N G L A B O R A T O R Y

Example #2: Auto-tuning Lattice-Boltzmann Magneto- Hydrodynamics (LBMHD)

Samuel Williams, Jonathan Carter, Leonid Oliker, John Shalf, Katherine Yelick, "Lattice Boltzmann Simulation Optimization on Leading Multicore Platforms", International Parallel & Distributed Processing Symposium (IPDPS), 2008.

Best Paper, Application Track

57 EEElectrical EngCineeringS and LBMHD Computer Sciences BERKELEY PAR LAB

 Plasma turbulence simulation via Lattice Boltzmann Method  Two distributions: . momentum distribution (27 scalar components) . magnetic distribution (15 vector components)  Three macroscopic quantities: . Density . Momentum (vector) . Magnetic Field (vector)  Arithmetic Intensity: . Must read 73 doubles, and update 79 doubles per lattice update (1216 bytes) . Requires about 1300 floating point operations per lattice update . Just over 1.0 flops/byte (ideal)

12 12 0  Out-of-the-box, 15 4 15 25 25 +Z 6 +Z 14 +Z 14 8 2 no unit stride memory 23 18 23 18 10 21 21 26 26 20 9 20 13 22 13 22 access patterns 1 11 17 5 17 +Y +Y 24 +Y 24 7 16 16 3 +X 19 +X 19 +X

macroscopic variables momentum distribution magnetic distribution 58 EEElectrical EngCineeringS and Initial LBMHD Performance Computer Sciences BERKELEY PAR LAB

 Generally, scalability looks good  but is performance good?

Naïve+NUMA

*collision() only 59 EECSAuto-tuned LBMHD Performance Electrical Engineering and Computer Sciences (architecture specific optimizations) BERKELEY PAR LAB

 Auto-tuning avoids cache conflict and TLB capacity misses  Additionally, it exploits SIMD where the compiler doesn’t  Include a SPE/Local Store optimized version

+small pages +Explicit SIMDization +SW Prefetching +Unrolling +Vectorization +Padding Naïve+NUMA

*collision() only 60 EEElectrical EngCineeringS and Roofline model for LBMHD Computer Sciences BERKELEY PAR LAB Intel Xeon E5345 Opteron 2356 128 128 (Clovertown) peak DP (Barcelona) peak DP  Far more adds than 64 64 multiplies (imbalance) mul/add imbalance mul/add imbalance 32 32  Huge data sets flop/s w/out SIMD flop/s w/out SIMD G 16 G 16 8 8 w/out ILP w/out ILP

attainable 4 attainable 4 h tc fe e r A p M 2 2 W U S N t t u u /o /o 1 dataset fits in snoop filter 1 w w 1 1 1 1 1 1 1 1 /16 /8 /4 /2 1 2 4 8 /16 /8 /4 /2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio Sun T2+ T5140 IBM QS20 128 128 (Victoria Falls) Cell Blade 64 64 peak DP 32 32 flop/s flop/s peak DP G G w/out FMA 16 16 25% FP w/out ILP 8 h 8 tc A fe M e 12% FP D r d p e w/out SIMD n attainable attainable 4 W 4 g S li A t a u A n M u U /o M N w U t N u 2 t 2 o u / /o w w 1 1 1 1 1 1 1 1 1 1 /16 /8 /4 /2 1 2 4 8 /16 /8 /4 /2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio 61 EECS Roofline model for LBMHD Electrical Engineering and Computer Sciences (overlay arithmetic intensity) BERKELEY PAR LAB Intel Xeon E5345 Opteron 2356 128 128 (Clovertown) peak DP (Barcelona) peak DP  Far more adds than 64 64 multiplies (imbalance) mul/add imbalance mul/add imbalance 32 32  Essentially random flop/s w/out SIMD flop/s w/out SIMD G 16 G 16 access to memory 8 8  Flop:byte ratio ~0.7 w/out ILP w/out ILP  NUMA allocation/access attainable 4 attainable 4 h tc fe e r A p M  Little ILP 2 2 W U S N t t u u /o /o 1 dataset fits in snoop filter 1 w w  No DLP 1/ 1/ 1/ 1/ 1 2 4 8 1/ 1/ 1/ 1/ 1 2 4 8 16 8 4 2 16 8 4 2  High conflict misses flop:DRAM byte ratio flop:DRAM byte ratio Sun T2+ T5140 IBM QS20 128 128 (Victoria Falls) Cell Blade 64 64 peak DP 32 32 flop/s flop/s peak DP G G w/out FMA 16 16 No naïve Cell 25% FP w/out ILP 8 h 8 impA lementation tc fe M e 12% FP D r d p e w/out SIMD n attainable attainable 4 W 4 g S li A t a u A n M u U /o M N w U t N u 2 t 2 o u / /o w w 1 1 1 1 1 1 1 1 1 1 /16 /8 /4 /2 1 2 4 8 /16 /8 /4 /2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio 62 EECS Roofline model for LBMHD Electrical Engineering and Computer Sciences (out-of-the-box parallel performance) BERKELEY PAR LAB Intel Xeon E5345 Opteron 2356 128 128 (Clovertown) peak DP (Barcelona) peak DP  Far more adds than 64 64 multiplies (imbalance) mul/add imbalance mul/add imbalance 32 32  Essentially random flop/s w/out SIMD flop/s w/out SIMD G 16 G 16 access to memory 8 8  Flop:byte ratio ~0.7 w/out ILP snoop filter n w/out ILP  NUMA allocation/access attainable 4 i attainable 4 h tc fe e r A p M  Little ILP 2 2 W U S N t t u u /o /o 1 dataset fits 1 w w  No DLP 1/ 1/ 1/ 1/ 1 2 4 8 1/ 1/ 1/ 1/ 1 2 4 8 16 8 4 2 16 8 4 2  High conflict misses flop:DRAM byte ratio flop:DRAM byte ratio Sun T2+ T5140 IBM QS20 128 128  Peak VF performance with (Victoria Falls) Cell Blade 64 threads (out of 128) - 64 64 peak DP high conflict misses 32 32 flop/s flop/s peak DP G G w/out FMA 16 16 No naïve Cell 25% FP w/out ILP 8 h 8 impA lementation tc fe M e 12% FP D r d p e w/out SIMD n attainable attainable 4 W 4 g S li A t a u A n M u U /o M N w U t N u 2 t 2 o u / /o w w 1 1 1 1 1 1 1 1 1 1 /16 /8 /4 /2 1 2 4 8 /16 /8 /4 /2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio 63 EECS Roofline model for LBMHD Electrical Engineering and Computer Sciences (Padding, Vectorization, Unrolling, Reordering, …) BERKELEY PAR LAB Intel Xeon E5345 Opteron 2356 128 128 (Clovertown) peak DP (Barcelona) peak DP  Vectorize the code to 64 64 eliminate TLB capacity mul/add imbalance mul/add imbalance 32 32 misses flop/s w/out SIMD flop/s w/out SIMD G 16 G 16  Ensures unit stride access 8 8 (bottom bandwidth ceiling) w/out ILP snoop filter n w/out ILP  Tune for optimal VL attainable 4 i attainable 4 h tc fe e r A  p M Clovertown pinned to 2 2 W U S N t t u u /o /o lower BW ceiling 1 dataset fits 1 w w 1 1 1 1 1 1 1 1 /16 /8 /4 /2 1 2 4 8 /16 /8 /4 /2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio Sun T2+ T5140 IBM QS20 128 128 (Victoria Falls) Cell Blade 64 64 peak DP 32 32 flop/s flop/s peak DP G G w/out FMA 16 16 No naïve Cell 25% FP w/out ILP 8 h 8 impA lementation tc fe M e 12% FP D r d p e w/out SIMD n attainable attainable 4 W 4 g S li A t a u A n M u U /o M N w U t N u 2 t 2 o u / /o w w 1 1 1 1 1 1 1 1 1 1 /16 /8 /4 /2 1 2 4 8 /16 /8 /4 /2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio 64 EECS Roofline model for LBMHD Electrical Engineering and Computer Sciences (SIMDization + cache bypass) BERKELEY PAR LAB Intel Xeon E5345 Opteron 2356 128 128 (Clovertown) peak DP (Barcelona) peak DP  Make SIMDization explicit 64 64 mul/add imbalance mul/add imbalance  Technically, this swaps ILP 32 32 and SIMD ceilings flop/s flop/s G 16 G 16  Use cache bypass w/out ILP w/out ILP 8 8 instruction: movntpd w/out SIMD snoop filter n w/out SIMD  Increases flop:byte ratio to attainable 4 i attainable 4 h tc fe e r A p M ~1.0 on x86/Cell 2 2 W U S N t t u u /o /o 1 dataset fits 1 w w 1 1 1 1 1 1 1 1 /16 /8 /4 /2 1 2 4 8 /16 /8 /4 /2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio Sun T2+ T5140 IBM QS20 128 128 (Victoria Falls) Cell Blade 64 64 peak DP 32 32 flop/s flop/s peak DP G G w/out FMA 16 16 25% FP w/out ILP 8 h 8 tc A fe M e 12% FP D r d p e w/out SIMD n attainable attainable 4 W 4 g S li A t a u A n M u U /o M N w U t N u 2 t 2 o u / /o w w 1 1 1 1 1 1 1 1 1 1 /16 /8 /4 /2 1 2 4 8 /16 /8 /4 /2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio 65 EECS Roofline model for LBMHD Electrical Engineering and Computer Sciences (SIMDization + cache bypass) BERKELEY PAR LAB Intel Xeon E5345 Opteron 2356 128 128 (Clovertown) peak DP (Barcelona) peak DP  Make SIMDization explicit 64 64 mul/add imbalance mul/add imbalance  Technically, this swaps ILP 32 32 and SIMD ceilings flop/s w/out SIMD flop/s w/out SIMD G 16 G 16  Use cache bypass 8 8 instruction: movntpd w/out ILP w/out ILP  Increases flop:byte ratio to attainable 4 attainable 4 h tc fe e r A p M ~1.0 on x86/Cell 2 2 W U S N t t u u /o /o 1 dataset fits in snoop filter 1 w w 1 1 1 1 1 1 1 1 /16 /8 /4 /2 1 2 4 8 /16 /8 /4 /2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio Sun T2+ T5140 IBM QS20 128 128 (Victoria Falls) Cell Blade 64 64 peak DP 32 32 flop/s flop/s peak DP G G w/out FMA 16 16 25% FP w/out ILP 8 h 8 tc A fe M e 12% FP D r d p 3 out of 4 machinese w/out SIMD n attainable attainable 4 W hit the 4Roofline g S li A t a u A n M u U /o M N w U t N u 2 t 2 o u / /o w w 1 1 1 1 1 1 1 1 1 1 /16 /8 /4 /2 1 2 4 8 /16 /8 /4 /2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio 66 EECS Electrical Engineering and Computer Sciences BERKELEY PAR LAB P A R A L L E L C O M P U T I N G L A B O R A T O R Y

Conclusions

67 EEElectrical EngCineeringS and Summary Computer Sciences BERKELEY PAR LAB

 The Roofline model is a visually intuitive figure for kernel analysis and optimization  We believe undergraduates will find it useful in assessing performance and scalability limitations

 It is easily extended to other architectural paradigms  We believe it is easily extendable to other metrics: . performance (sort, graphics, crypto…) . bandwidth (L2, PCIe, …)

 We believe that a performance counters could be used to generate a runtime-specific roofline that would greatly aide the optimization

68 EEElectrical EngCineeringS and Suggestion… Computer Sciences BERKELEY PAR LAB

 As architectures are presented over the next two days, we invite you to create a roofline model for each.  Estimate the ceilings.

 Then contemplate performance and productivity among them

69 EEElectrical EngCineeringS and Acknowledgements Computer Sciences BERKELEY PAR LAB

 Research supported by: . Microsoft and Intel funding (Award #20080469) . DOE Office of Science under contract number DE-AC02-05CH11231 . NSF contract CNS-0325873 . Sun Microsystems - Niagara2 / Victoria Falls machines . AMD - access to Quad-core Opteron (barcelona) access . Forschungszentrum Jülich - access to QS20 Cell blades . IBM - virtual loaner program to QS20 Cell blades

70 EECS Electrical Engineering and Computer Sciences BERKELEY PAR LAB P A R A L L E L C O M P U T I N G L A B O R A T O R Y

Questions ?

Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, James Demmel, "Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Supercomputing (SC), 2007.

Samuel Williams, Jonathan Carter, Leonid Oliker, John Shalf, Katherine Yelick, "Lattice Boltzmann Simulation Optimization on Leading Multicore Platforms", International Parallel & Distributed Processing Symposium (IPDPS), 2008. Best Paper, Application Track

Kaushik Datta, Mark Murphy, Vasily Volkov, Samuel Williams, Jonathan Carter, Leonid Oliker, David Patterson, John Shalf, Katherine Yelick, “Stencil Computation Optimization and Autotuning on State-of-the-Art Multicore Architecture”, Supercomputing (SC) (to appear), 2008.

71