High-Performance Computing and Compiler Optimization

P. (Saday) Sadayappan

August 2019 What is a compiler? What •is Traditionally:a Compiler? Program that analyzes and translates from a high level language (e.g., C++) to low-level assembly languageCompilers that can be executed are translators by hardware

• Fortran var a • C  Machine code var b int a, b;  Virtual machine code • C++ mov 3 a a = • 3;Java  Transformed source code mov 4 r1 if (a• Text < processing4) { translate  Augmented sourcecmpi a r1 b language= 2; • HTML/XML code jge l_e } else {  • Command & Low-level commands mov 2 b b Scripting= 3;  Semantic jmp l_d } Languages components • Natural language  Anotherl_e: language mov 3 b • Domain specific l_d: ;done languages

Wednesday,Wednesday, August 22, August 12 22, 12

Source: Milind Kulkarni 2 Compilers are optimizers • Can perform optimizations to make a program more What isefficient a Compiler? Compilers are optimizers var a • Can perform optimizations to make a program more efficient var b var a var c var b int a, b, c; mov a r1 var c b = av + 3; addi 3 r1 mov a r1 var a c = av + 3; mov varr1 bb addivar a3 r1 mov vara r2c movvar r1b b int a, b, c; addimov 3 ar2 r1 movvar r1c c b = a + 3; mov addir2 c3 r1 mov a Source:r1 Milind Kulkarni c = a + 3; mov r1 b addi 3 r1 ♦ Early days of computing: Minimizingmov a numberr2 of executedmov r1 b instructions

Wednesday,minimized August 22, 12 program execution timeaddi 3 r2 mov r1 c § Sequential processors: had a singlemov functional r2 c unit to execute instructions § Compiler technology is very advanced in minimizing number of instructions

♦ Today:Wednesday, August 22, 12 § All computers are highly parallel: must make use of all parallel resources § Cost of data movement dominates the cost of performing the operations on data § Many challenging problems for compilers 3 The Good Old Days for Software

Source: J. Birnbaum

• Single-processor performance experienced dramatic improvements from clock, and architectural improvement (Pipelining, Instruction-Level-Parallelism) • Applications experienced automatic performance improvement 4 Power Density Trends

1000 Rocket Nozzle Nuclear Reactor

2 100 ® 4

Hot plate Pentium® III Pentium® II Watts/cm 10 Pentium® Pro Pentium® 1 P=VI: 75W @ 1.5V = 50 A! 1.5µ 1µ 0.7µ 0.5µ 0.35µ 0.25µ 0.18µ 0.13µ 0.1µ 0.07µ

Power density is the expended power per unit area on chip: too high => transistors malfunction 5 5 Hitting the Power Wall

toward

a

brighter

tomorrow

http://img.tomshardware.com/us/2005/11/21/the_mother_of_all_cpu_charts_2005/cpu_frequency.gif

6 Hitting the Power Wall

http://img.tomshardware.com/us/2005/11/21/the_mother_of_all_cpu_charts_2005/cpu_frequency.gif 2004 – cancels Tejas and Jayhawk due to "heat problems due to the extreme power consumption of the core ..." 7 The Only Option: Many Cores

♦ Chip density is still increasing (but end is in sight L) § Clock speed is not ♦ There is little or no more hidden parallelism (ILP) to be found ♦ Parallelism must be exposed to and managed by software ♦ Computers are getting heterogeneous: CPUs, GPUs, TPUs, FPGAs

Source: Intel, Microsoft (Sutter) and Stanford (Olukotun, Hammond)

8 8 turing lecture

predicting the outcome of 15 branches. ent approach to achieve performance a single core, assuming different por- If a processor architect wants to limit improvements. The multicore era was tions of serial execution, where only wasted work to only 10% of the time, thus born. one processor is active. For example, the processor must predict each branch Multicore shifted responsibility for when only 1% of the time is serial, the correctly 99.3% of the time. Few general- identifying parallelism and deciding speedup for a 64-processor configura- purpose programs have branches that how to exploit it to the programmer tion is about 35. Unfortunately, the can be predicted so accurately. and to the language system. Multicore power needed is proportional to 64 To appreciate how this wasted work does not resolve the challenge of ener- processors, so approximately 45% of adds up, consider the data in Figure 4, gy-efficient computation that was exac- the energy is wasted. showing the fraction of instructions erbated by the end of Dennard scaling. Real programs have more complex that are effectively executed but turn Each active core burns power whether structures of course, with portions out to be wasted because the proces- or not it contributes effectively to the that allow varying numbers of proces- sor speculated incorrectly. On average, computation. A primary hurdle is an sors to be used at any given moment 19% of the instructions are wasted for old observation, called Amdahl’s Law, in time. Nonetheless, the need to com- these benchmarks on an i7. stating that the speedup from a paral- municate and synchronize periodically The amount of wasted energy is great- lel computer is limited by the portion means most applications have some er, however, since the processor must of a computation that is sequential. portions that can effectively use only use additional energy to restore the To appreciate the importance of this a fraction of the processors. Although state when it speculates incorrectly. observation, consider Figure 5, show- Amdahl’s Law is more than 50 years Measurements like these led many to ing how much faster an application old, it remains a difficult hurdle. conclude architects needed a differ- runs with up to 64 cores compared to With the end of Dennard scaling, increasing the number of cores on a EndFigure of 6.Moore’s Growth of computer Law performance Scaling using of integer VLSI programs (SPECintCPU). chip meant power is also increasing at nearly the same rate. Unfortunately, [From Hennessey & Patterson] End of the Line ⇒ 2X/20 years (3%/yr) the power that goes into a processor Amdahl’s Law ⇒ 2X/6 years (12%/year) must also be removed as heat. Mul- End of Dennard Scaling ⇒ Multicore 2X/3.5 years (23%/year) ticore processors are thus limited by CISC 2X/2.5 years RISC 2X/1.5 years the thermal dissipation power (TDP), (22%/year) (52%/year) 100,000 or average amount of power the pack- age and cooling system can remove. Although some high-end data centers 10,000 may use more advanced packages and cooling technology, no computer us- 1,000 ers would want to put a small heat exchanger on their desks or wear a ra- 100 diator on their backs to cool their cell- phones. The limit of TDP led directly 10 to the era of “dark silicon,” whereby Performance vs. VAX11-780 Performance processors would slow on the clock 1 rate and turn off idle cores to prevent 1980 1985 1990 1995 2000 2005 2010 2015 overheating. Another way to view this approach is that some chips can real- ♦ Moore’s Law (doubling of #transistors on a chip every 18months-2years) over 4 locate their precious power from the Figure 7. Potential speedup of matrix multiply in Python for four optimizations. idle cores to the active ones. decades has fueled over 5 orders of magnitude rise in computing power/chip An era without Dennard scaling, Matrix Multiply Speedup Over Native Python along with reduced Moore’s Law and ♦ Unfortunately, we are now at the end of that ride 9 62,806 Amdahl’s Law in full effect means 100,000 inefficiency limits improvement in 6,727 performance to only a few percent 10,000 per year (see Figure 6). Achieving 366 higher rates of performance improve- 1,000 ment—as was seen in the 1980s and

Speedup 1990s—will require new architec- 100 47 tural approaches that use the inte-

10 grated-circuit capability much more efficiently. We will return to what ap- 1 1 proaches might work after discussing Python C + parallel + memory + SIMD another major shortcoming of mod- loops optimization instructions ern computers—their support, or lack thereof, for computer security.

54 COMMUNICATIONS OF THE ACM | FEBRUARY 2019 | VOL. 62 | NO. 2 Can You Predict Performance?

• E1 executes 2*108 FLOPs and double W,X,Y,Z; takes about 0.4s to execute on this for(j=0;j<100000000;j++){ W = 0.999999*X; • Performance in GFLOPs: billions X = 0.999999*W;} (Giga) of FLoating point // Example loop E1 Operations per Second = 0.5 • About how long does E2 take? 1. [0-0.4s] 0.35s => 2.27 GFLOPs for(j=0;j<100000000;j++){ 2. [0.4s-0.6s] W = 0.999999*W + 0.000001; X = 0.999999*X + 0.000001; 3. [0.6s-0.8s] Y = 0.999999*Y + 0.000001; 4. More than 0.8s Z = 0.999999*Z + 0.000001; } // Example loop E2

10 ILP Affects Performance

• ILP (Instruction Level Parallelism): double W,X,Y,Z; Many operations in a sequential code could be executed concurrently if they for(j=0;j<100000000;j++){ do not have dependences W = 0.999999*X; • Pipelined stages in functional units can X = 0.999999*W;} be exploited by ILP // Example loop E1 • Multiple functional units in a CPU be exploited by ILP for(j=0;j<100000000;j++){ • ILP is automatically exploited by the W = 0.999999*W + 0.000001; system when possible X = 0.999999*X + 0.000001; • E2’s statements are independent and Y = 0.999999*Y + 0.000001; provide ILP, but E1’s statements are Z = 0.999999*Z + 0.000001; not, and do not provide ILP } // Example loop E2

11 Performance Prediction

#define N 32 • About how long will code run for #define T 1024*1024 double A[N][N]; the 4Kx4K matrix? %FLOPS=32*1024*1024=230 1. [0-0.3s] for(it=0; it

#define N 4096 #define T 64 32x32 4Kx4K double A[N][N]; 30 30 %FLOPS=32*1024*1024=230 FLOPs 2 2 for(it=0; it

12 Performance Prediction

#define N 32 • 32x32 matrix fits in fast cache memory #define T 1024*1024 but 4Kx4K matrix cannot double A[N][N]; %FLOPS=32*1024*1024=230 for(it=0; it

#define N 4096 #define T 64 32x32 4Kx4K double A[N][N]; 30 30 %FLOPS=32*1024*1024=230 FLOPs 2 2 for(it=0; it

13 Data Movement Cost: Energy Trends FLOPs almost free; 10000 data movement cost is dominant 2.35x decrease 1000 20x Minimizing amount of data movement increasingly critical 100 85x No Change 10x decrease 45mm45 nm 10 Picojoules 11nm11 nm (2018)

1

DP FLOP Register 1mm on-chip5mm on-chip Cross System Off-chip/DRAM Source: Jim Demmel, John Shalf Local interconnect 14 Roofline Performance Trends: GPUs

104 V100 P100 K40 3 Change in 10 Machine balance Peak perf. M2090 (GFLOPs)

(GFLOPs) 2 C160

(GFLOPs) 10 Performance Roofline Performance Peak performance Change in 1 10 −1 0 1 Peak Mem BW 10 10 10 (GBytes/s)*10 Operational Intensity (flops/byte) ♦ Nvidia GPUs over 5 generations: Fermi, Maxwell, Kepler, Pascal, Volta ♦ Roofline plot: Peak GFLOPs has increased faster than Peak Mem-BW ♦ Machine balance (Peak_GFLOPs/Peak_BW) has steadily risen => Computations require higher Operational Intensity than machine-balance (lower data mvmt. per op) to avoid being memory-bandwidth bound 15

ACM Transactions on Architecture and Code Optimization, Vol. 0, No.ACM 0, Article Transactions 0, Publication on Architecture date: 2014. and Code Optimization, Vol. 0, No. 0, Article 0, Publication date: 2014.

(tiles) allowing reuse (thereby reducing reuse distance)(tiles) allowing in multiple reuse directions (thereby reducing when the reuse block distance) fits in in multiple directions when the block fits in

2008]. Tiling for locality attempts to group points2008]. in an Tiling iteration for locality space of attempts a loop into to group smaller points blocks in an iteration space of a loop into smaller blocks

fusion [Irigoin and Triolet 1988; Wolf and Lam 1991;fusion Kennedy [Irigoin and and Triolet McKinley 1988; 1993; Wolf Bondhugula and Lam 1991; et al. Kennedy and McKinley 1993; Bondhugula et al.

optimize data locality by applying loop transformations,optimize data in particular locality by involving applying loop loop tiling transformations, and loop in particular involving loop tiling and loop

the heart of our proposed dynamic analysis. Inthe heartcase of of loops, our proposed numerous dynamic efforts analysis. have attempted In the case to of loops, numerous efforts have attempted to

In this section, we provide details on our algorithmIn this for section, convex we partitioning provide details of CDAGs, on our which algorithm is at for convex partitioning of CDAGs, which is at

3. CONVEX PARTITIONING OF CDAG 3. CONVEX PARTITIONING OF CDAG

original sequential execution. Details are presentedoriginal in the sequential next section. execution. Details are presented in the next section.

dependence graph), with the vertices within a partitiondependence being graph), executed with in the samevertices relative within order a partition as the being executed in the same relative order as the

are executed in some valid order (correspondingare to a executed topological in some sort of valid a coarse-grained order (corresponding inter-partition to a topological sort of a coarse-grained inter-partition

After such a partitioning, the execution order of theAfter vertices such a is partitioning, reordered so the that execution the convex order partitions of the vertices is reordered so that the convex partitions

Finally, Fig. 6 shows the convex partitioning of theFinally, CDAG Fig. corresponding 6 shows the convex to the code partitioning in Fig. 2. of the CDAG corresponding to the code in Fig. 2.

(4) Perform standard reuse-distance analysis of the(4) reorderedPerform standard trace after reuse-distance multi-level convex analysis partitioning. of the reordered trace after multi-level convex partitioning.

convex partitioning is analogous to multi-level cache-obliviousconvex partitioning blocking. is analogous to multi-level cache-oblivious blocking.

a CDAG is analogous to tiling the iteration spacea of CDAG a regular is analogous nested loop to tiling computation. the iteration Multi-level space of a regular nested loop computation. Multi-level

of operations of the CDAG from the original orderof in operations the given of input the CDAG code. A from convex the partitioning original order of in the given input code. A convex partitioning of

(3) Perform a multi-level convex partitioning of(3) thePerform CDAG, which a multi-level is then convex used to partitioning change the schedule of the CDAG, which is then used to change the schedule

(2) Form a CDAG from the execution trace. (2) Form a CDAG from the execution trace.

(1) Generate a sequential execution trace of a program.(1) Generate a sequential execution trace of a program.

The dynamic analysis involves the following steps: ComputationalThe dynamic analysis involvesvs. Data the following Movement steps: Complexity

possible. possible.

the reuse-distance profile of the code, it is likely that it is already as well optimized for data locality as thefor reuse-distance (i=1; i

tions have the potential to enhance data locality. On the other hand, if the analysis is unable to improve

tions have the potential to enhance data locality. Onfor( thejt other= 1; hand,jt

reuse-distance profile compared to that of the given program’s sequential execution trace. If this analysis reuse-distanceA[i][j] = profileA[i][j-1] compared + A[i-1][j]; to that of the given program’sfor(i = it; sequential i < min(it+B execution, N−1); trace. i++) If this analysis computation, attempting to find a different order of execution of the operations that can improve the

computation, attempting to find a different order of executionfor(j = jt; ofj < the min( operationsjt+B, N−1); that canj++) improve the They key idea behind the work presented in this article is to perform analysis on the CDAG of a They keyUntiled idea behindversion the work presented in this article is to perform analysis on the CDAG of a

Comp. complexity: (N-2)2 Ops A[i][j] = A[i−1][j] + A[i][j−1];

tices represent operations performed. tices represent operations performed. Tiled Version

code in Fig. 2 for = 10. code in Fig. 2 for = 10. N

N 2 Input vertices are shown in black, all other ver-

Input vertices are shown in black, all other ver- Comp. complexity: (N-2) Ops Fig. 6: Convex-partition of the CDAG for the Fig. 6: Convex-partition of the CDAG for the

Fig. 5: CDAG for Gauss-Seidel code in Fig. 2. Fig. 5: CDAG for Gauss-Seidel code in Fig. 2.

j j

2 2 1

1 2 3 4 1 12 5 6

5 6 7 8 3 4 7 8

3

i i 3

5 5 4

9 10 11 12 9 10 13 4 14

13 14 15 16 11 12 15 16

16 program. program.

locality characteristics based on the set of operationslocality that characteristics actually transpired based during on the an set execution of operations of the that actually transpired during an execution of the

dependences in the computation need not be representeddependences since in the the goal computation is to capture need the not inherent be represented data since the goal is to capture the inherent data

an essential partial order captured by the data dependencesan essential partial between order the capturedoperation by instances. the data Control dependences between the operation instances. Control

by a sequential program, it abstracts away thatby sequential a sequential schedule program, of operations it abstracts and away only that imposes sequential schedule of operations and only imposes

Although a CDAG is derived from analysis of dependencesAlthough a between CDAG is instances derived from of statements analysis executedof dependences between instances of statements executed

0:7 0:7 Computational vs. Data Movement Complexity

for (i=1; i

◆ Currently used primary metric of computational complexity is insufficient § For alternative algorithms with comparable computational complexity, the data movement cost is critical § Current performance tools cannot distinguish data movement bottlenecks that can be overcome (through loop/data transformations) versus inherent/fundamental

◆ The maximum performance achievable with these two codes is very different!

f = 1; f = 1; for (k=0; k

Stmt_seq

for “j” for “p” for (i=0; i

S(a,b,i, j) = å A(a,c,i,k)B(b,e, f ,l)C(d, f , j,k)D(c,d,e,l) 4N10 Ops c,d,e, f ,k,l S(a,b,i, j) = å A(a,c,i,k)C(d, f , j,k)B(b,e, f ,l)D(c,d,e,l) c,d,e, f ,k,l æ ö S(a,b,i, j) = A(a,c,i,k)C(d, f , j,k)ç B(b,e, f ,l)D(c,d,e,l)÷ å çå ÷ c,d, f ,k è e,l ø é æ öù S(a,b,i, j) = A(a,c,i,k)ê C(d, f , j,k)ç B(b,e, f ,l)D(c,d,e,l)÷ú å êå çå ÷ú c,k ëd, f è e,l øû T1(b,c,d, f ) = å B(b,e, f ,l)D(c,d,e,l) 2N6 Ops e,l T 2(b,c, j,k) = åT1(b,c,d, f )C(d, f , j,k) 2N6 Ops d, f S(a,b,i, j) = åT 2(b,c, j,k)A(a,c,i,k) 2N6 Ops c,k Algebraic Transformations: Operation Minimization

S(a,b,i, j) = å A(a,c,i,k)B(b,e, f ,l)C(d, f , j,k)D(c,d,e,l) c,d ,e, f ,k,l

◆ Requires 4 * N10 operations if indices all have range N

◆ Optimized form requires only 6 * N6 operations

T1(b,c,d, f ) = å B(b,e, f ,l)D(c,d,e,l) e,l T 2(b,c, j,k) = åT1(b,c,d, f )C(d, f , j,k) d , f S(a,b,i, j) = åT 2(b,c, j,k)A(a,c,i,k) c,k n Optimization Problem: Given an input tensor-contraction expression, find equivalent form that minimizes # operations

q Problem is NP-hard; efficient pruning search strategy developed, that is effective in practice n However, storage rqmts. increase after operation minimization Memory Minimization: Compute by Parts (Loop Fusion)

S = 0 T1 = 0; T2 = 0; S = 0 for b, c T1f = 0; T2f = 0 T1bcdf = å Bbefl Dcdel for b, c, d, e, f, l e,l T1bcdf += Bbefl Dcdel for d, e, f, l T 2bcjk = åT1bcdf Cdfjk d , f for b, c, d, f, j, k T1fdf += Bbefl Dcdel T2bcjk += T1bcdf Cdfjk for d, f, j, k Sabij = åT 2bcjk Aacik c,k for a, b, c, i, j, k T2fjk += T1fdf Cdfjk

Sabij += T2bcjk Aacik for a, i, j, k

Sabij += T2fjk Aacik

Expression sequence Unfused code (Partially) Fused code Memory Minimization: Compute by Parts (Loop Fusion)

S = 0 T1 = 0; T2 = 0; S = 0 for b, c T1f = 0; T2f = 0 T1bcdf = å Bbefl Dcdel for b, c, d, e, f, l e,l T1bcdf += Bbefl Dcdel for d, e, f, l T 2bcjk = åT1bcdf Cdfjk d , f for b, c, d, f, j, k T1fdf += Bbefl Dcdel

T2bcjk += T1bcdf Cdfjk for d, f, j, k Sabij = åT 2bcjk Aacik c,k for a, b, c, i, j, k T2fjk += T1fdf Cdfjk

Sabij += T2bcjk Aacik for a, i, j, k

Sabij += T2fjk Aacik

Expression sequence Unfused code (Partially) Fused code

n Optimization Problem: Given an operation-minimized sequence of tensor- contractions, find “best” set of loops to fuse to control memory and data movement costs Research Directions ♦ Develop tools & techniques to characterize data movement complexity of algorithms Algorithms ♦ Develop performance optimization strategies for key matrix/tensor computations § Dense § Sparse Compilers/ ♦ Advance general-purpose as well as Frameworks domain/pattern-centric optimizing compilers § Productivity + Performance + Portability ♦ Develop effective performance modeling approaches for compilers § Hybrid analytical + machine learning Computer ♦ Develop effective design-space exploration Architecture framework for algorithm-architecture codesign

24 Summary ♦ Increasing parallelism and heterogeneity in computer systems § High-performance software development getting more difficult => Compilers must play a bigger role in achieving performance, productivity and portability ♦ But compilers face a tough challenge: Optimizing data movement § Multiple directions of research in addressing the challenge

♦ Effective Algorithm->Architecture co-design: massive design space § Compilers can play a key role

♦ Fall 2019 Seminar 7940-003 (Wed 1:25-2:45) study of recent researach algorithm/compiler/architecture issues for Machine Learning § Jointly with Profs. Rajeev Balasubramonian, Aditya Bhaskara, Vivek Srikumar and Suresh Venkatasubramanian

♦ Spring 2020: CS 6230: Parallel and High-Performance Computing 25