High-Performance Computing and Compiler Optimization

High-Performance Computing and Compiler Optimization P. (Saday) Sadayappan August 2019 What is a compiler? What •is Traditionally:a Compiler? Program that analyzes and translates from a high level language (e.g., C++) to low-level assembly languageCompilers that can be executed are translators by hardware • Fortran var a • C Machine code var b int a, b; Virtual machine code • C++ mov 3 a a = • 3;Java Transformed source code mov 4 r1 if (a• Text < processing4) { translate Augmented sourcecmpi a r1 b language= 2; • HTML/XML code jge l_e } else { • Command & Low-level commands mov 2 b b Scripting= 3; Semantic jmp l_d } Languages components • Natural language Anotherl_e: language mov 3 b • Domain specific l_d: ;done languages Wednesday,Wednesday, August 22, August 12 22, 12 Source: Milind Kulkarni 2 Compilers are optimizers • Can perform optimizations to make a program more What isefficient a Compiler? Compilers are optimizers var a • Can perform optimizations to make a program more efficient var b var a var c var b int a, b, c; mov a r1 var c b = av + 3; addi 3 r1 mov a r1 var a c = av + 3; mov varr1 bb addivar a3 r1 mov vara r2c movvar r1b b int a, b, c; addimov 3 ar2 r1 movvar r1c c b = a + 3; mov addir2 c3 r1 mov a Source:r1 Milind Kulkarni c = a + 3; mov r1 b addi 3 r1 ♦ Early days of computing: Minimizingmov a numberr2 of executedmov r1 b instructions Wednesday,minimized August 22, 12 program execution timeaddi 3 r2 mov r1 c § Sequential processors: had a singlemov functional r2 c unit to execute instructions § Compiler technology is very advanced in minimizing number of instructions ♦ Today:Wednesday, August 22, 12 § All computers are highly parallel: must make use of all parallel resources § Cost of data movement dominates the cost of performing the operations on data § Many challenging problems for compilers 3 The Good Old Days for Software Source: J. Birnbaum • Single-processor performance experienced dramatic improvements from clock, and architectural improvement (Pipelining, Instruction-Level-Parallelism) • Applications experienced automatic performance improvement 4 Power Density Trends 1000 Rocket Nozzle Nuclear Reactor 2 100 Pentium® 4 Hot plate Pentium® III Pentium® II Watts/cm 10 Pentium® Pro i386 Pentium® i486 1 P=VI: 75W @ 1.5V = 50 A! 1.5µ 1µ 0.7µ 0.5µ 0.35µ 0.25µ 0.18µ 0.13µ 0.1µ 0.07µ Power density is the expended power per unit area on chip: too high => transistors malfunction 5 5 Hitting the Power Wall toward a brighter tomorrow http://img.tomshardware.com/us/2005/11/21/the_mother_of_all_cpu_charts_2005/cpu_frequency.gif 6 Hitting the Power Wall http://img.tomshardware.com/us/2005/11/21/the_mother_of_all_cpu_charts_2005/cpu_frequency.gif 2004 – Intel cancels Tejas and Jayhawk due to "heat problems due to the extreme power consumption of the core ..." 7 The Only Option: Many Cores ♦ Chip density is still increasing (but end is in sight L) § Clock speed is not ♦ There is little or no more hidden parallelism (ILP) to be found ♦ Parallelism must be exposed to and managed by software ♦ Computers are getting heterogeneous: CPUs, GPUs, TPUs, FPGAs Source: Intel, Microsoft (Sutter) and Stanford (Olukotun, Hammond) 8 8 turing lecture predicting the outcome of 15 branches. ent approach to achieve performance a single core, assuming different por- If a processor architect wants to limit improvements. The multicore era was tions of serial execution, where only wasted work to only 10% of the time, thus born. one processor is active. For example, the processor must predict each branch Multicore shifted responsibility for when only 1% of the time is serial, the correctly 99.3% of the time. Few general- identifying parallelism and deciding speedup for a 64-processor configura- purpose programs have branches that how to exploit it to the programmer tion is about 35. Unfortunately, the can be predicted so accurately. and to the language system. Multicore power needed is proportional to 64 To appreciate how this wasted work does not resolve the challenge of ener- processors, so approximately 45% of adds up, consider the data in Figure 4, gy-efficient computation that was exac- the energy is wasted. showing the fraction of instructions erbated by the end of Dennard scaling. Real programs have more complex that are effectively executed but turn Each active core burns power whether structures of course, with portions out to be wasted because the proces- or not it contributes effectively to the that allow varying numbers of processor speculated incorrectly. On average, computation. A primary hurdle is an sors to be used at any given moment 19% of the instructions are wasted for old observation, called Amdahl’s Law, in time. Nonetheless, the need to com- these benchmarks on an Intel Core i7. stating that the speedup from a paral- municate and synchronize periodically The amount of wasted energy is great- lel computer is limited by the portion means most applications have some er, however, since the processor must of a computation that is sequential. portions that can effectively use only use additional energy to restore the To appreciate the importance of this a fraction of the processors. Although state when it speculates incorrectly. observation, consider Figure 5, show- Amdahl’s Law is more than 50 years Measurements like these led many to ing how much faster an application old, it remains a difficult hurdle. conclude architects needed a differ- runs with up to 64 cores compared to With the end of Dennard scaling, increasing the number of cores on a EndFigure of 6.Moore’s Growth of computer Law performance Scaling using of integer VLSI programs (SPECintCPU). chip meant power is also increasing at nearly the same rate. Unfortunately, [From Hennessey & Patterson] End of the Line ⇒ 2X/20 years (3%/yr) the power that goes into a processor Amdahl’s Law ⇒ 2X/6 years (12%/year) must also be removed as heat. Mul- End of Dennard Scaling ⇒ Multicore 2X/3.5 years (23%/year) ticore processors are thus limited by CISC 2X/2.5 years RISC 2X/1.5 years the thermal dissipation power (TDP), (22%/year) (52%/year) 100,000 or average amount of power the pack- age and cooling system can remove. Although some high-end data centers 10,000 may use more advanced packages and cooling technology, no computer us- 1,000 ers would want to put a small heat exchanger on their desks or wear a ra- 100 diator on their backs to cool their cell- phones. The limit of TDP led directly 10 to the era of “dark silicon,” whereby Performance vs. VAX11-780 Performance processors would slow on the clock 1 rate and turn off idle cores to prevent 1980 1985 1990 1995 2000 2005 2010 2015 overheating. Another way to view this approach is that some chips can real- ♦ Moore’s Law (doubling of #transistors on a chip every 18months-2years) over 4 locate their precious power from the Figure 7. Potential speedup of matrix multiply in Python for four optimizations. idle cores to the active ones. decades has fueled over 5 orders of magnitude rise in computing power/chip An era without Dennard scaling, Matrix Multiply Speedup Over Native Python along with reduced Moore’s Law and ♦ Unfortunately, we are now at the end of that ride 9 62,806 Amdahl’s Law in full effect means 100,000 inefficiency limits improvement in 6,727 performance to only a few percent 10,000 per year (see Figure 6). Achieving 366 higher rates of performance improve- 1,000 ment—as was seen in the 1980s and Speedup 1990s—will require new architec- 100 47 tural approaches that use the inte- 10 grated-circuit capability much more efficiently. We will return to what ap- 1 1 proaches might work after discussing Python C + parallel + memory + SIMD another major shortcoming of mod- loops optimization instructions ern computers—their support, or lack thereof, for computer security. 54 COMMUNICATIONS OF THE ACM | FEBRUARY 2019 | VOL. 62 | NO. 2 Can You Predict Performance? • E1 executes 2*108 FLOPs and double W,X,Y,Z; takes about 0.4s to execute on this laptop for(j=0;j<100000000;j++){ W = 0.999999*X; • Performance in GFLOPs: billions X = 0.999999*W;} (Giga) of FLoating point // Example loop E1 Operations per Second = 0.5 • About how long does E2 take? 1. [0-0.4s] 0.35s => 2.27 GFLOPs for(j=0;j<100000000;j++){ 2. [0.4s-0.6s] W = 0.999999*W + 0.000001; X = 0.999999*X + 0.000001; 3. [0.6s-0.8s] Y = 0.999999*Y + 0.000001; 4. More than 0.8s Z = 0.999999*Z + 0.000001; } // Example loop E2 10 ILP Affects Performance • ILP (Instruction Level Parallelism): double W,X,Y,Z; Many operations in a sequential code could be executed concurrently if they for(j=0;j<100000000;j++){ do not have dependences W = 0.999999*X; • Pipelined stages in functional units can X = 0.999999*W;} be exploited by ILP // Example loop E1 • Multiple functional units in a CPU be exploited by ILP for(j=0;j<100000000;j++){ • ILP is automatically exploited by the W = 0.999999*W + 0.000001; system when possible X = 0.999999*X + 0.000001; • E2’s statements are independent and Y = 0.999999*Y + 0.000001; provide ILP, but E1’s statements are Z = 0.999999*Z + 0.000001; not, and do not provide ILP } // Example loop E2 11 Performance Prediction #define N 32 • About how long will code run for #define T 1024*1024 double A[N][N]; the 4Kx4K matrix? %FLOPS=32*1024*1024=230 1.

High-Performance Computing and Compiler Optimization

The Intel X86 Microarchitectures Map Version 2.0

The Paramountcy of Reconfigurable Computing

The Intel X86 Microarchitectures Map Version 2.2

Redalyc.Optimization of Operating Systems Towards Green Computing

Extracting Parallelism from Legacy Sequential Code Using Software Transactional Memory

High-Performance Parallel Computing

The Intel X86 Microarchitectures Map Version 1.1

Analysis of Task Scheduling for Multi-Core Embedded Systems

Thesis Title

Extracting Parallelism from Legacy Sequential Code Using Transactional Memory

Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2017 Lecture 1

GPU Programming