
These Slides Overview Generally Useful Optimizations Program Optimization . Code motion/precomputation . Strength reduction CSci 2021: Machine Architecture and Organization . Sharing of common subexpressions Lecture #22-23, March 13th-16th, 2015 . Removing unnecessary procedure calls Optimization Blockers Your instructor: Stephen McCamant . Procedure calls . Memory aliasing Based on slides originally by: Randy Bryant, Dave O’Hallaron, Antonia Zhai Optimizing In Larger Programs: Profiling Exploiting Instruction-Level Parallelism Dealing with Conditionals 1 2 Performance Realities Optimizing Compilers Provide efficient mapping of program to machine There’s more to performance than asymptotic complexity . register allocation . code selection and ordering (scheduling) Constant factors matter too! . dead code elimination . Easily see 10:1 performance range depending on how code is written . eliminating minor inefficiencies . Must optimize at multiple levels: Don’t (usually) improve asymptotic efficiency . algorithm, data representations, procedures, and loops . up to programmer to select best overall algorithm Must understand system to optimize performance . big-O savings are (often) more important than constant factors . How programs are compiled and executed . but constant factors also matter . How to measure program performance and identify bottlenecks Have difficulty overcoming “optimization blockers” . How to improve performance without destroying code modularity and . potential memory aliasing generality . potential procedure side-effects 3 4 Limitations of Optimizing Compilers Generally Useful Optimizations Operate under fundamental constraint . Must not cause any change in program behavior Optimizations that you or the compiler should do regardless . Often prevents it from making optimizations when would only affect behavior of processor / compiler under pathological conditions. Behavior that may be obvious to the programmer can be obfuscated by languages and coding styles (Loop Invariant) Code Motion . e.g., Data ranges may be more limited than variable types suggest . Reduce frequency with which computation performed Most analysis is performed only within procedures . If it will always produce same result . Whole-program analysis is too expensive in most cases . Especially moving code out of loop Most analysis is based only on static information void set_row(double *a, double *b, . Compiler has difficulty anticipating run-time inputs long i, long n) { long j; long j; When in doubt, the compiler must be conservative for (j = 0; j < n; j++) int ni = n*i; a[n*i+j] = b[j]; for (j = 0; j < n; j++) } a[ni+j] = b[j]; 5 6 1 Reduction in Strength Compiler-Generated Code Motion void set_row(double *a, double *b, long i, long n) long j; . Replace costly operation with simpler one { long ni = n*i; long j; double *rowp = a+ni; . Shift, add instead of multiply or divide for (j = 0; j < n; j++) for (j = 0; j < n; j++) 16*x --> x << 4 a[n*i+j] = b[j]; *rowp++ = b[j]; } . Utility machine dependent . Depends on cost of multiply or divide instruction Where are the FP operations? – On Intel Nehalem, integer multiply requires 3 CPU cycles set_row: testq %rcx, %rcx # Test n . Recognize sequence of products jle .L4 # If 0, goto done movq %rcx, %rax # rax = n imulq %rdx, %rax # rax *= i leaq (%rdi,%rax,8), %rdx # rowp = A + n*i*8 int ni = 0; movl $0, %r8d # j = 0 for (i = 0; i < n; i++) for (i = 0; i < n; i++) { .L3: # loop: for (j = 0; j < n; j++) for (j = 0; j < n; j++) movq (%rsi,%r8,8), %rax # t = b[j] a[n*i + j] = b[j]; a[ni + j] = b[j]; movq %rax, (%rdx) # *rowp = t ni += n; addq $1, %r8 # j++ } addq $8, %rdx # rowp++ cmpq %r8, %rcx # Compare n:j jg .L3 # If >, goto loop .L4: # done: rep ; ret 7 8 Share Common Subexpressions Optimization Blocker #1: Procedure Calls . Reuse portions of expressions . Compilers often not very sophisticated in exploiting arithmetic Procedure to Convert String to Lower Case properties void lower(char *s) /* Sum neighbors of i,j */ long inj = i*n + j; { up = val[(i-1)*n + j ]; up = val[inj - n]; int i; down = val[(i+1)*n + j ]; down = val[inj + n]; left = val[i*n + j-1]; left = val[inj - 1]; for (i = 0; i < strlen(s); i++) right = val[i*n + j+1]; right = val[inj + 1]; if (s[i] >= 'A' && s[i] <= 'Z') sum = up + down + left + right; sum = up + down + left + right; s[i] -= ('A' - 'a'); } 3 multiplications: i*n, (i–1)*n, (i+1)*n 1 multiplication: i*n leaq 1(%rsi), %rax # i+1 imulq %rcx, %rsi # i*n leaq -1(%rsi), %r8 # i-1 addq %rdx, %rsi # i*n+j . Extracted from 213 lab submissions, Fall, 1998 imulq %rcx, %rsi # i*n movq %rsi, %rax # i*n+j imulq %rcx, %rax # (i+1)*n subq %rcx, %rax # i*n+j-n imulq %rcx, %r8 # (i-1)*n leaq (%rsi,%rcx), %rcx # i*n+j+n addq %rdx, %rsi # i*n+j addq %rdx, %rax # (i+1)*n+j addq %rdx, %r8 # (i-1)*n+j 9 10 Lower Case Conversion Performance Convert Loop To Goto Form void lower(char *s) { . Time quadruples when double string length int i = 0; . Quadratic performance if (i >= strlen(s)) goto done; 200 lower loop: 180 if (s[i] >= 'A' && s[i] <= 'Z') 160 s[i] -= ('A' - 'a'); 140 i++; 120 if (i < strlen(s)) 100 goto loop; 80 done: CPU seconds CPU 60 } 40 20 . strlen executed every iteration 0 0 100000 200000 300000 400000 500000 String length 11 12 2 Calling Strlen Improving Performance /* My version of strlen */ void lower(char *s) size_t strlen(const char *s) { { int i; size_t length = 0; int len = strlen(s); while (*s != '\0') { for (i = 0; i < len; i++) s++; if (s[i] >= 'A' && s[i] <= 'Z') length++; s[i] -= ('A' - 'a'); } } return length; } Strlen performance . Move call to strlen outside of loop . Only way to determine length of string is to scan its entire length, looking for . Since result does not change from one iteration to another null character. Form of code motion Overall performance, string of length N . N calls to strlen . Require times N, N-1, N-2, …, 1 . Overall O(N2) performance 13 14 Lower Case Conversion Performance Optimization Blocker: Procedure Calls Why couldn’t compiler move strlen out of inner loop? . Time doubles when double string length . Procedure may have side effects . Linear performance of lower2 . Alters global state each time called . Function may not return same value for given arguments 200 . Depends on other parts of global state 180 . Procedure lower could interact with strlen 160 140 Warning: 120 . Compiler treats procedure call as a black box lower 100 . Weak optimizations near them 80 int lencnt = 0; CPU seconds CPU 60 Remedies: size_t strlen(const char *s) 40 . Use of inline functions { 20 size_t length = 0; lower2 . GCC does this with –O2 0 while (*s != '\0') { . See web aside ASM:OPT 0 100000 200000 300000 400000 500000 s++; length++; String length . Do your own code motion } lencnt += length; return length; } 15 16 Exercise Break: Weird Pointers Memory Matters /* Sum rows is of n X n matrix a and store in vector b */ Can the following function ever return 12, and if so how? void sum_rows1(double *a, double *b, long n) { long i, j; int f(int *p1, int *p2, int *p3) { for (i = 0; i < n; i++) { *p1 = 100; b[i] = 0; for (j = 0; j < n; j++) *p2 = 10; b[i] += a[i*n + j]; *p3 = 1; } return *p1 + *p2 + *p3; } } # sum_rows1 inner loop .L53: Yes, for instance: addsd (%rcx), %xmm0 # FP add addq $8, %rcx decq %rax int a, b; movsd %xmm0, (%rsi,%r8,8) # FP store f(&a, &b, &a); jne .L53 . Code updates b[i] on every iteration . Why couldn’t compiler optimize this away? 17 18 3 Memory Aliasing Removing Aliasing /* Sum rows is of n X n matrix a /* Sum rows is of n X n matrix a and store in vector b */ and store in vector b */ void sum_rows1(double *a, double *b, long n) { void sum_rows2(double *a, double *b, long n) { long i, j; long i, j; for (i = 0; i < n; i++) { for (i = 0; i < n; i++) { b[i] = 0; double val = 0; for (j = 0; j < n; j++) for (j = 0; j < n; j++) b[i] += a[i*n + j]; val += a[i*n + j]; } b[i] = val; } } } Value of B: double A[9] = init: [4, 8, 16] # sum_rows2 inner loop { 0, 1, 2, .L66: 4, 8, 16}, i = 0: [3, 8, 16] addsd (%rcx), %xmm0 # FP Add 32, 64, 128}; addq $8, %rcx i = 1: [3, 22, 16] decq %rax double B[3] = A+3; jne .L66 sum_rows1(A, B, 3); i = 2: [3, 22, 224] . Code updates b[i] on every iteration . No need to store intermediate results . Must consider possibility that these updates will affect program behavior 19 20 Optimization Blocker: Memory Aliasing What About Larger Programs? Aliasing If your program has just one loop, it’s obvious where to . Two different memory references specify single location change to make it go faster . Easy to have happen in C In more complex programs, what to optimize is a key . Since allowed to do address arithmetic question . Direct access to storage structures When you first write a non-trivial program, it often has a . Get in habit of introducing local variables single major algorithm performance problem . Accumulating within loops . Textbook’s example: insertion sort . Your way of telling compiler not to check for aliasing . Last program I wrote: missed opportunity for dynamic programming . Fixing this problem is way more important than any other changes 21 22 Amdahl’s Law Knowing What’s Slow: Profiling If you speed up one part of a system, the total benefit is Profiling makes a version of a program that records how limited by how much time that part took to start with long it spends on different tasks Speedup S is: .
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages10 Page
-
File Size-