Optimization Blocker: Memory Aliasing

How to Write Fast Numerical Code Spring 2012 Lecture 5 Instructor: Markus Püschel TA: Georg Ofenbeck © Markus Püschel Computer Science Technicalities Research project: Time to finalize! Find partner: [email protected] Lost on finding a project? Talk to me (e.g., office hours tomorrow) Exam: Fr Apr 27th © Markus Püschel Computer Science Last Time: ILP Latency/throughput (Pentium 4 fp mult: 7/2) 1 d 0 Twice as fast * d1 1 d0 1 d1 * d2 * d2 * d3 * d3 * d4 * d5 * d4 * d6 * d7 * d5 * * * d 6 * * d7 * © Markus Püschel Computer Science Last Time: How Many Accumulators? Latency: 7 cycles Those have to be independent cycles Based on this insight: K = #accumulators = ceil(latency/cycles per issue) © Markus Püschel Computer Science Compiler Limitations void unroll2_sa(vec_ptr v, data_t *dest) { int length = vec_length(v); int limit = length-1; void combine4(vec_ptr v, data_t *dest) data_t *d = get_vec_start(v); { data_t x0 = IDENT; int i; data_t x1 = IDENT; int length = vec_length(v); int i; data_t *d = get_vec_start(v); /* Combine 2 elements at a time */ data_t t = IDENT; for (i = 0; i < limit; i+=2) for (i = 0; i < length; i++) x0 = x0 OP d[i]; t = t OP d[i]; x1 = x1 OP d[i+1]; *dest = t; /* Finish any remaining elements */ } for (; i < length; i++) x0 = x0 OP d[i]; *dest = x0 OP x1; } Associativity law does not hold for floats: illegal transformation No good way of handling choices (e.g., number of accumulators) More examples of limitations today © Markus Püschel Computer Science Today Measuring performance & benchmarking Section 3.2 in the tutorial http://spiral.ece.cmu.edu:8080/pub-spiral/abstract.jsp?id=100 Optimizing compilers and optimization blockers . Overview . Removing unnecessary procedure calls . Code motion . Strength reduction . Sharing of common subexpressions . Optimization blocker: Procedure calls . Optimization blocker: Memory aliasing . Summary Chapter 5 in Computer Systems: A Programmer's Perspective, 2nd edition, Randal E. Bryant and David R. O'Hallaron, Addison Wesley 2010 © Markus Püschel Computer Science Benchmarking First: Verify your code! Measure runtime (in [s] or [cycles]) for a set of relevant input sizes . seconds: actual runtime . cycles: abstracts from CPU frequency Usually: Compute and show performance (in [flop/s] or [flop/cycle]) Careful: Better performance ≠ better runtime (why?) . Op count could differ . Never show in one plot performance of two algorithms with substantially different op count © Markus Püschel Computer Science How to measure runtime? C clock() . process specific, low resolution, very portable gettimeofday . measures wall clock time, higher resolution, somewhat portable Performance counter (e.g., TSC on Intel) . measures cycles (i.e., also wall clock time), highest resolution, not portable Careful: . measure only what you want to measure . ensure proper machine state (e.g., cold or warm cache = input data is or is not in cache) . measure enough repetitions . check how reproducible; if not reproducible: fix it Getting proper measurements is not easy at all! © Markus Püschel Computer Science Example: Timing MMM Assume MMM(A,B,C,n) computes C = C + AB, A,B,C are nxn matrices double time_MMM(int n) { // allocate double *A=(double*)malloc(n*n*sizeof(double)); double *B=(double*)malloc(n*n*sizeof(double)); double *C=(double*)malloc(n*n*sizeof(double)); // initialize for (int i = 0; i < n*n; i++){ A[i] = B[i] = C[i] = 0.0; } init_MMM(A,B,C,n); // if needed // warm up cache (for warm cache timing) MMM(A,B,C,n); // time ReadTime(t0); for (int i = 0; i < TIMING_REPETITIONS; i++) MMM(A,B,C,n); ReadTime(t1); // compute runtime return (double)((t1-t0)/TIMING_REPETITIONS); } © Markus Püschel Computer Science Problems with Timing Too few iterations: inaccurate non-reproducible timing Too many iterations: system events interfere Machine is under load: produces side effects Multiple timings performed on the same machine Bad data alignment of input/output vectors: align to multiples of cache line (on Core: address is divisible by 64) Time stamp counter (if used) can overflow (on 32-bit architectures) Machine was not rebooted for a long time: state of operating system causes problems Computation is input data dependent: choose representative input data Computation is inplace and data grows until an exception is triggered (computation is done with NaNs) You work on a computer that has dynamic frequency scaling (e.g., turbo boost) Always check whether timings make sense, are reproducible © Markus Püschel Computer Science Benchmarks in Writing Specify experimental setup . platform . compiler and version . compiler flags used Plot: Very readable . Title, x-label, y-label should be there . Fonts large enough . Enough contrast (no yellow on white please) . Proper number format . No: 13.254687; yes: 13.25 . No: 2.0345e-05 s; yes: 20.3 μs . No: 100000 B; maybe: 100,000 B; yes: 100 KB © Markus Püschel Computer Science Left alignment Attractive font (sans serif, avoid Arial) Calibri, Helvetica, Gill Sans MT, … DFT 2n (single precision) on Pentium 4, 2.53 GHz [Gflop/s] Horizontal 7 y-label Spiral SSE 6 5 Main line Intel MKL possibly 4 emphasized (red, thicker) No y-axis 3 Spiral vectorized (superfluous) 2 Spiral scalar 1 Background/grid 0 inverted for better layering 4 5 6 7 8 n 9 10 11 12 13 No legend; makes decoding easier © Markus Püschel Computer Science Today Measuring performance & benchmarking Optimizing compilers and optimization blockers . Overview . Removing unnecessary procedure calls . Code motion . Strength reduction . Sharing of common subexpressions . Optimization blocker: Procedure calls . Optimization blocker: Memory aliasing . Summary © Markus Püschel Computer Science Optimizing Compilers Always use optimization flags: . gcc: default is no optimization (-O0)! . icc: some optimization is turned on Good choices for gcc/icc: -O2, -O3, -march=xxx, -mSSE3, -m64 . Read in manual what they do . Try to understand the differences Try different flags and maybe different compilers © Markus Püschel Computer Science Example (On Core 2 Duo) double a[4][4]; double b[4][4]; double c[4][4]; /* Multiply 4 x 4 matrices c = a*b + c */ void mmm(double *a, double *b, double *c) { int i, j, k; for (i = 0; i < 4; i++) for (j = 0; j < 4; j++) for (k = 0; k < 4; k++) c[i*4+j] += a[i*4 + k]*b[k*4 + j]; } Compiled without flags: Prevents use of SSE ~1300 cycles Compiled with -O3 -m64 -march=… -fno-tree-vectorize ~150 cycles © Markus Püschel Computer Science Optimizing Compilers Compilers are good at: mapping program to machine . register allocation . code selection and ordering (instruction scheduling) . dead code elimination . eliminating minor inefficiencies Compilers are not good at: algorithmic restructuring . For example to increase ILP, locality, etc. Cannot deal with choices Compilers are not good at: overcoming “optimization blockers” . potential memory aliasing . potential procedure side-effects © Markus Püschel Computer Science Limitations of Optimizing Compilers If in doubt, the compiler is conservative Operate under fundamental constraints . Must not change program behavior under any possible condition . Often prevents it from making optimizations when would only affect behavior under pathological conditions Most analysis is performed only within procedures . Whole-program analysis is too expensive in most cases Most analysis is based only on static information . Compiler has difficulty anticipating run-time inputs . Not good at evaluating or dealing with choices © Markus Püschel Computer Science Organization Instruction level parallelism (ILP): an example Optimizing compilers and optimization blockers . Overview . Removing unnecessary procedure calls . Code motion . Strength reduction . Sharing of common subexpressions . Optimization blocker: Procedure calls . Optimization blocker: Memory aliasing . Summary © Markus Püschel Computer Science Example: Data Type for Vectors /* data structure for vectors */ typedef struct{ len 0 1 len-1 int len; data double *data; } vec; /* retrieve vector element and store at val */ int get_vec_element(vec *v, int idx, double *val) { if (idx < 0 || idx >= v->len) return 0; *val = v->data[idx]; return 1; } © Markus Püschel Computer Science Example: Summing Vector Elements /* retrieve vector element and store at val */ int get_vec_element(vec *v, int idx, double *val) { if (idx < 0 || idx >= v->len) return 0; *val = v->data[idx]; return 1; } /* sum elements of vector */ double sum_elements(vec *v, double *res) Overhead for every fp +: { • One fct call int i; • One < n = vec_length(v); *res = 0.0; • One >= double t; • One || • One memory variable for (i = 0; i < n; i++) { get_vec_element(v, i, &t); access *res += t; } Slowdown: return res; probably 10x or more } © Markus Püschel Computer Science Removing Procedure Call /* sum elements of vector */ double sum_elements(vec *v, double *res) { int i; n = vec_length(v); *res = 0.0; double t; for (i = 0; i < n; i++) { get_vec_element(v, i, &t); *res += t; } return res; } /* sum elements of vector */ double sum_elements(vec *v, double *res) { int i; n = vec_length(v); *res = 0.0; double *data = get_vec_start(v); for (i = 0; i < n; i++) *res += data[i]; return res; } © Markus Püschel Computer Science Removing Procedure Calls Procedure calls can be very expensive Bound checking can be very expensive Abstract data types can easily lead to inefficiencies . Usually avoided for in superfast numerical library functions Watch your innermost loop! Get a feel for overhead

Load more