Optimization Blocker: Memory Aliasing

How to Write Fast Numerical Code Spring 2012 Lecture 5 Instructor: Markus Püschel TA: Georg Ofenbeck © Markus Püschel Computer Science Technicalities Research project: Time to finalize! Find partner: [email protected] Lost on finding a project? Talk to me (e.g., office hours tomorrow) Exam: Fr Apr 27th © Markus Püschel Computer Science Last Time: ILP Latency/throughput (Pentium 4 fp mult: 7/2) 1 d 0 Twice as fast * d1 1 d0 1 d1 * d2 * d2 * d3 * d3 * d4 * d5 * d4 * d6 * d7 * d5 * * * d 6 * * d7 * © Markus Püschel Computer Science Last Time: How Many Accumulators? Latency: 7 cycles Those have to be independent cycles Based on this insight: K = #accumulators = ceil(latency/cycles per issue) © Markus Püschel Computer Science Compiler Limitations void unroll2_sa(vec_ptr v, data_t *dest) { int length = vec_length(v); int limit = length-1; void combine4(vec_ptr v, data_t *dest) data_t *d = get_vec_start(v); { data_t x0 = IDENT; int i; data_t x1 = IDENT; int length = vec_length(v); int i; data_t *d = get_vec_start(v); /* Combine 2 elements at a time */ data_t t = IDENT; for (i = 0; i < limit; i+=2) for (i = 0; i < length; i++) x0 = x0 OP d[i]; t = t OP d[i]; x1 = x1 OP d[i+1]; *dest = t; /* Finish any remaining elements */ } for (; i < length; i++) x0 = x0 OP d[i]; *dest = x0 OP x1; } Associativity law does not hold for floats: illegal transformation No good way of handling choices (e.g., number of accumulators) More examples of limitations today © Markus Püschel Computer Science Today Measuring performance & benchmarking Section 3.2 in the tutorial http://spiral.ece.cmu.edu:8080/pub-spiral/abstract.jsp?id=100 Optimizing compilers and optimization blockers . Overview . Removing unnecessary procedure calls . Code motion . Strength reduction . Sharing of common subexpressions . Optimization blocker: Procedure calls . Optimization blocker: Memory aliasing . Summary Chapter 5 in Computer Systems: A Programmer's Perspective, 2nd edition, Randal E. Bryant and David R. O'Hallaron, Addison Wesley 2010 © Markus Püschel Computer Science Benchmarking First: Verify your code! Measure runtime (in [s] or [cycles]) for a set of relevant input sizes . seconds: actual runtime . cycles: abstracts from CPU frequency Usually: Compute and show performance (in [flop/s] or [flop/cycle]) Careful: Better performance ≠ better runtime (why?) . Op count could differ . Never show in one plot performance of two algorithms with substantially different op count © Markus Püschel Computer Science How to measure runtime? C clock() . process specific, low resolution, very portable gettimeofday . measures wall clock time, higher resolution, somewhat portable Performance counter (e.g., TSC on Intel) . measures cycles (i.e., also wall clock time), highest resolution, not portable Careful: . measure only what you want to measure . ensure proper machine state (e.g., cold or warm cache = input data is or is not in cache) . measure enough repetitions . check how reproducible; if not reproducible: fix it Getting proper measurements is not easy at all! © Markus Püschel Computer Science Example: Timing MMM Assume MMM(A,B,C,n) computes C = C + AB, A,B,C are nxn matrices double time_MMM(int n) { // allocate double *A=(double*)malloc(n*n*sizeof(double)); double *B=(double*)malloc(n*n*sizeof(double)); double *C=(double*)malloc(n*n*sizeof(double)); // initialize for (int i = 0; i < n*n; i++){ A[i] = B[i] = C[i] = 0.0; } init_MMM(A,B,C,n); // if needed // warm up cache (for warm cache timing) MMM(A,B,C,n); // time ReadTime(t0); for (int i = 0; i < TIMING_REPETITIONS; i++) MMM(A,B,C,n); ReadTime(t1); // compute runtime return (double)((t1-t0)/TIMING_REPETITIONS); } © Markus Püschel Computer Science Problems with Timing Too few iterations: inaccurate non-reproducible timing Too many iterations: system events interfere Machine is under load: produces side effects Multiple timings performed on the same machine Bad data alignment of input/output vectors: align to multiples of cache line (on Core: address is divisible by 64) Time stamp counter (if used) can overflow (on 32-bit architectures) Machine was not rebooted for a long time: state of operating system causes problems Computation is input data dependent: choose representative input data Computation is inplace and data grows until an exception is triggered (computation is done with NaNs) You work on a computer that has dynamic frequency scaling (e.g., turbo boost) Always check whether timings make sense, are reproducible © Markus Püschel Computer Science Benchmarks in Writing Specify experimental setup . platform . compiler and version . compiler flags used Plot: Very readable . Title, x-label, y-label should be there . Fonts large enough . Enough contrast (no yellow on white please) . Proper number format . No: 13.254687; yes: 13.25 . No: 2.0345e-05 s; yes: 20.3 μs . No: 100000 B; maybe: 100,000 B; yes: 100 KB © Markus Püschel Computer Science Left alignment Attractive font (sans serif, avoid Arial) Calibri, Helvetica, Gill Sans MT, … DFT 2n (single precision) on Pentium 4, 2.53 GHz [Gflop/s] Horizontal 7 y-label Spiral SSE 6 5 Main line Intel MKL possibly 4 emphasized (red, thicker) No y-axis 3 Spiral vectorized (superfluous) 2 Spiral scalar 1 Background/grid 0 inverted for better layering 4 5 6 7 8 n 9 10 11 12 13 No legend; makes decoding easier © Markus Püschel Computer Science Today Measuring performance & benchmarking Optimizing compilers and optimization blockers . Overview . Removing unnecessary procedure calls . Code motion . Strength reduction . Sharing of common subexpressions . Optimization blocker: Procedure calls . Optimization blocker: Memory aliasing . Summary © Markus Püschel Computer Science Optimizing Compilers Always use optimization flags: . gcc: default is no optimization (-O0)! . icc: some optimization is turned on Good choices for gcc/icc: -O2, -O3, -march=xxx, -mSSE3, -m64 . Read in manual what they do . Try to understand the differences Try different flags and maybe different compilers © Markus Püschel Computer Science Example (On Core 2 Duo) double a[4][4]; double b[4][4]; double c[4][4]; /* Multiply 4 x 4 matrices c = a*b + c */ void mmm(double *a, double *b, double *c) { int i, j, k; for (i = 0; i < 4; i++) for (j = 0; j < 4; j++) for (k = 0; k < 4; k++) c[i*4+j] += a[i*4 + k]*b[k*4 + j]; } Compiled without flags: Prevents use of SSE ~1300 cycles Compiled with -O3 -m64 -march=… -fno-tree-vectorize ~150 cycles © Markus Püschel Computer Science Optimizing Compilers Compilers are good at: mapping program to machine . register allocation . code selection and ordering (instruction scheduling) . dead code elimination . eliminating minor inefficiencies Compilers are not good at: algorithmic restructuring . For example to increase ILP, locality, etc. Cannot deal with choices Compilers are not good at: overcoming “optimization blockers” . potential memory aliasing . potential procedure side-effects © Markus Püschel Computer Science Limitations of Optimizing Compilers If in doubt, the compiler is conservative Operate under fundamental constraints . Must not change program behavior under any possible condition . Often prevents it from making optimizations when would only affect behavior under pathological conditions Most analysis is performed only within procedures . Whole-program analysis is too expensive in most cases Most analysis is based only on static information . Compiler has difficulty anticipating run-time inputs . Not good at evaluating or dealing with choices © Markus Püschel Computer Science Organization Instruction level parallelism (ILP): an example Optimizing compilers and optimization blockers . Overview . Removing unnecessary procedure calls . Code motion . Strength reduction . Sharing of common subexpressions . Optimization blocker: Procedure calls . Optimization blocker: Memory aliasing . Summary © Markus Püschel Computer Science Example: Data Type for Vectors /* data structure for vectors */ typedef struct{ len 0 1 len-1 int len; data double *data; } vec; /* retrieve vector element and store at val */ int get_vec_element(vec *v, int idx, double *val) { if (idx < 0 || idx >= v->len) return 0; *val = v->data[idx]; return 1; } © Markus Püschel Computer Science Example: Summing Vector Elements /* retrieve vector element and store at val */ int get_vec_element(vec *v, int idx, double *val) { if (idx < 0 || idx >= v->len) return 0; *val = v->data[idx]; return 1; } /* sum elements of vector */ double sum_elements(vec *v, double *res) Overhead for every fp +: { • One fct call int i; • One < n = vec_length(v); *res = 0.0; • One >= double t; • One || • One memory variable for (i = 0; i < n; i++) { get_vec_element(v, i, &t); access *res += t; } Slowdown: return res; probably 10x or more } © Markus Püschel Computer Science Removing Procedure Call /* sum elements of vector */ double sum_elements(vec *v, double *res) { int i; n = vec_length(v); *res = 0.0; double t; for (i = 0; i < n; i++) { get_vec_element(v, i, &t); *res += t; } return res; } /* sum elements of vector */ double sum_elements(vec *v, double *res) { int i; n = vec_length(v); *res = 0.0; double *data = get_vec_start(v); for (i = 0; i < n; i++) *res += data[i]; return res; } © Markus Püschel Computer Science Removing Procedure Calls Procedure calls can be very expensive Bound checking can be very expensive Abstract data types can easily lead to inefficiencies . Usually avoided for in superfast numerical library functions Watch your innermost loop! Get a feel for overhead

Optimization Blocker: Memory Aliasing

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support