Program Optimization These Slides Performance Realities Optimizing

These Slides Overview Generally Useful Optimizations Program Optimization . Code motion/precomputation . Strength reduction CSci 2021: Machine Architecture and Organization . Sharing of common subexpressions Lecture #22-23, March 13th-16th, 2015 . Removing unnecessary procedure calls Optimization Blockers Your instructor: Stephen McCamant . Procedure calls . Memory aliasing Based on slides originally by: Randy Bryant, Dave O’Hallaron, Antonia Zhai Optimizing In Larger Programs: Profiling Exploiting Instruction-Level Parallelism Dealing with Conditionals 1 2 Performance Realities Optimizing Compilers Provide efficient mapping of program to machine There’s more to performance than asymptotic complexity . register allocation . code selection and ordering (scheduling) Constant factors matter too! . dead code elimination . Easily see 10:1 performance range depending on how code is written . eliminating minor inefficiencies . Must optimize at multiple levels: Don’t (usually) improve asymptotic efficiency . algorithm, data representations, procedures, and loops . up to programmer to select best overall algorithm Must understand system to optimize performance . big-O savings are (often) more important than constant factors . How programs are compiled and executed . but constant factors also matter . How to measure program performance and identify bottlenecks Have difficulty overcoming “optimization blockers” . How to improve performance without destroying code modularity and . potential memory aliasing generality . potential procedure side-effects 3 4 Limitations of Optimizing Compilers Generally Useful Optimizations Operate under fundamental constraint . Must not cause any change in program behavior Optimizations that you or the compiler should do regardless . Often prevents it from making optimizations when would only affect behavior of processor / compiler under pathological conditions. Behavior that may be obvious to the programmer can be obfuscated by languages and coding styles (Loop Invariant) Code Motion . e.g., Data ranges may be more limited than variable types suggest . Reduce frequency with which computation performed Most analysis is performed only within procedures . If it will always produce same result . Whole-program analysis is too expensive in most cases . Especially moving code out of loop Most analysis is based only on static information void set_row(double *a, double *b, . Compiler has difficulty anticipating run-time inputs long i, long n) { long j; long j; When in doubt, the compiler must be conservative for (j = 0; j < n; j++) int ni = n*i; a[n*i+j] = b[j]; for (j = 0; j < n; j++) } a[ni+j] = b[j]; 5 6 1 Reduction in Strength Compiler-Generated Code Motion void set_row(double *a, double *b, long i, long n) long j; . Replace costly operation with simpler one { long ni = n*i; long j; double *rowp = a+ni; . Shift, add instead of multiply or divide for (j = 0; j < n; j++) for (j = 0; j < n; j++) 16*x --> x << 4 a[n*i+j] = b[j]; *rowp++ = b[j]; } . Utility machine dependent . Depends on cost of multiply or divide instruction Where are the FP operations? – On Intel Nehalem, integer multiply requires 3 CPU cycles set_row: testq %rcx, %rcx # Test n . Recognize sequence of products jle .L4 # If 0, goto done movq %rcx, %rax # rax = n imulq %rdx, %rax # rax *= i leaq (%rdi,%rax,8), %rdx # rowp = A + n*i*8 int ni = 0; movl $0, %r8d # j = 0 for (i = 0; i < n; i++) for (i = 0; i < n; i++) { .L3: # loop: for (j = 0; j < n; j++) for (j = 0; j < n; j++) movq (%rsi,%r8,8), %rax # t = b[j] a[n*i + j] = b[j]; a[ni + j] = b[j]; movq %rax, (%rdx) # *rowp = t ni += n; addq $1, %r8 # j++ } addq $8, %rdx # rowp++ cmpq %r8, %rcx # Compare n:j jg .L3 # If >, goto loop .L4: # done: rep ; ret 7 8 Share Common Subexpressions Optimization Blocker #1: Procedure Calls . Reuse portions of expressions . Compilers often not very sophisticated in exploiting arithmetic Procedure to Convert String to Lower Case properties void lower(char *s) /* Sum neighbors of i,j */ long inj = i*n + j; { up = val[(i-1)*n + j ]; up = val[inj - n]; int i; down = val[(i+1)*n + j ]; down = val[inj + n]; left = val[i*n + j-1]; left = val[inj - 1]; for (i = 0; i < strlen(s); i++) right = val[i*n + j+1]; right = val[inj + 1]; if (s[i] >= 'A' && s[i] <= 'Z') sum = up + down + left + right; sum = up + down + left + right; s[i] -= ('A' - 'a'); } 3 multiplications: i*n, (i–1)*n, (i+1)*n 1 multiplication: i*n leaq 1(%rsi), %rax # i+1 imulq %rcx, %rsi # i*n leaq -1(%rsi), %r8 # i-1 addq %rdx, %rsi # i*n+j . Extracted from 213 lab submissions, Fall, 1998 imulq %rcx, %rsi # i*n movq %rsi, %rax # i*n+j imulq %rcx, %rax # (i+1)*n subq %rcx, %rax # i*n+j-n imulq %rcx, %r8 # (i-1)*n leaq (%rsi,%rcx), %rcx # i*n+j+n addq %rdx, %rsi # i*n+j addq %rdx, %rax # (i+1)*n+j addq %rdx, %r8 # (i-1)*n+j 9 10 Lower Case Conversion Performance Convert Loop To Goto Form void lower(char *s) { . Time quadruples when double string length int i = 0; . Quadratic performance if (i >= strlen(s)) goto done; 200 lower loop: 180 if (s[i] >= 'A' && s[i] <= 'Z') 160 s[i] -= ('A' - 'a'); 140 i++; 120 if (i < strlen(s)) 100 goto loop; 80 done: CPU seconds CPU 60 } 40 20 . strlen executed every iteration 0 0 100000 200000 300000 400000 500000 String length 11 12 2 Calling Strlen Improving Performance /* My version of strlen */ void lower(char *s) size_t strlen(const char *s) { { int i; size_t length = 0; int len = strlen(s); while (*s != '\0') { for (i = 0; i < len; i++) s++; if (s[i] >= 'A' && s[i] <= 'Z') length++; s[i] -= ('A' - 'a'); } } return length; } Strlen performance . Move call to strlen outside of loop . Only way to determine length of string is to scan its entire length, looking for . Since result does not change from one iteration to another null character. Form of code motion Overall performance, string of length N . N calls to strlen . Require times N, N-1, N-2, …, 1 . Overall O(N2) performance 13 14 Lower Case Conversion Performance Optimization Blocker: Procedure Calls Why couldn’t compiler move strlen out of inner loop? . Time doubles when double string length . Procedure may have side effects . Linear performance of lower2 . Alters global state each time called . Function may not return same value for given arguments 200 . Depends on other parts of global state 180 . Procedure lower could interact with strlen 160 140 Warning: 120 . Compiler treats procedure call as a black box lower 100 . Weak optimizations near them 80 int lencnt = 0; CPU seconds CPU 60 Remedies: size_t strlen(const char *s) 40 . Use of inline functions { 20 size_t length = 0; lower2 . GCC does this with –O2 0 while (*s != '\0') { . See web aside ASM:OPT 0 100000 200000 300000 400000 500000 s++; length++; String length . Do your own code motion } lencnt += length; return length; } 15 16 Exercise Break: Weird Pointers Memory Matters /* Sum rows is of n X n matrix a and store in vector b */ Can the following function ever return 12, and if so how? void sum_rows1(double *a, double *b, long n) { long i, j; int f(int *p1, int *p2, int *p3) { for (i = 0; i < n; i++) { *p1 = 100; b[i] = 0; for (j = 0; j < n; j++) *p2 = 10; b[i] += a[i*n + j]; *p3 = 1; } return *p1 + *p2 + *p3; } } # sum_rows1 inner loop .L53: Yes, for instance: addsd (%rcx), %xmm0 # FP add addq $8, %rcx decq %rax int a, b; movsd %xmm0, (%rsi,%r8,8) # FP store f(&a, &b, &a); jne .L53 . Code updates b[i] on every iteration . Why couldn’t compiler optimize this away? 17 18 3 Memory Aliasing Removing Aliasing /* Sum rows is of n X n matrix a /* Sum rows is of n X n matrix a and store in vector b */ and store in vector b */ void sum_rows1(double *a, double *b, long n) { void sum_rows2(double *a, double *b, long n) { long i, j; long i, j; for (i = 0; i < n; i++) { for (i = 0; i < n; i++) { b[i] = 0; double val = 0; for (j = 0; j < n; j++) for (j = 0; j < n; j++) b[i] += a[i*n + j]; val += a[i*n + j]; } b[i] = val; } } } Value of B: double A[9] = init: [4, 8, 16] # sum_rows2 inner loop { 0, 1, 2, .L66: 4, 8, 16}, i = 0: [3, 8, 16] addsd (%rcx), %xmm0 # FP Add 32, 64, 128}; addq $8, %rcx i = 1: [3, 22, 16] decq %rax double B[3] = A+3; jne .L66 sum_rows1(A, B, 3); i = 2: [3, 22, 224] . Code updates b[i] on every iteration . No need to store intermediate results . Must consider possibility that these updates will affect program behavior 19 20 Optimization Blocker: Memory Aliasing What About Larger Programs? Aliasing If your program has just one loop, it’s obvious where to . Two different memory references specify single location change to make it go faster . Easy to have happen in C In more complex programs, what to optimize is a key . Since allowed to do address arithmetic question . Direct access to storage structures When you first write a non-trivial program, it often has a . Get in habit of introducing local variables single major algorithm performance problem . Accumulating within loops . Textbook’s example: insertion sort . Your way of telling compiler not to check for aliasing . Last program I wrote: missed opportunity for dynamic programming . Fixing this problem is way more important than any other changes 21 22 Amdahl’s Law Knowing What’s Slow: Profiling If you speed up one part of a system, the total benefit is Profiling makes a version of a program that records how limited by how much time that part took to start with long it spends on different tasks Speedup S is: .

Program Optimization These Slides Performance Realities Optimizing

Program Analysis and Optimization for Embedded Systems

Stochastic Program Optimization by Eric Schkufza, Rahul Sharma, and Alex Aiken

The Effect of Code Expanding Optimizations on Instruction Cache Design

ARM Optimizing C/C++ Compiler V20.2.0.LTS User's Guide (Rev. V)

Compiler Optimization-Space Exploration

XL Fortran: Getting Started

Machine Learning in Compiler Optimisation Zheng Wang and Michael O’Boyle

An Optimization-Driven Incremental Inline Substitution Algorithm for Just-In-Time Compilers

A Unified Approach to Global Program Optimization

XL C/C++: Optimization and Programming Guide About This Document

Demand-Driven Inlining Heuristics in a Region-Based Optimizing Compiler for ILP Architectures

Analysis and Optimization of Scientific Applications Through Set and Relation Abstractions M