High Performance Computing

High Performance Computing Course #: CSI 440/540 High Perf Sci Comp I Fall ‘09 Mark R. Gilder Email: [email protected] [email protected] CSI 440/540 This course investigates the latest trends in high-performance computing (HPC) evolution and examines key issues in developing algorithms capable of exploiting these architectures. Grading: Your grade in the course will be based on completion of assignments (40%), course project (35%), class presentation(15%), class participation (10%). Course Goals Understanding of the latest trends in HPC architecture evolution, Appreciation for the complexities in efficiently mapping algorithms onto HPC architectures, Familiarity with various program transformations in order to improve performance, Hands-on experience in design and implementation of algorithms for both shared & distributed memory parallel architectures using Pthreads, OpenMP and MPI. Experience in evaluating performance of parallel programs. Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 2 Lecture 4 Outline: ◦ Compiler Optimizations ◦ POSIX Threads (Pthreads) Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 3 Outline ◦ Compiler Optimizations ◦ POSIX Threads (Pthreads) Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 4 Compiler Optimizations • Types of Compiler Optimizations ◦ Scalar Optimizations ◦ Loop Optimizations ◦ Inlining Note: The following section is based on notes by Henry Neeman, Director OU Supercomputing Center for Education & Research – University of Oklahoma. These techniques are also described in most compiler books: Compilers: Principles, Techniques, and Tools by Alfred V. Aho, Ravi Sethi, Jeffrey D. Ullman, Addison Wesley, January 1, 1986. Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 5 Compiler Design There’s been a tremendous amount of research done on compilers for the common languages used for today’s HPC environments: ◦ Fortran: ~40 years ◦ C: ~30 years ◦ C++: ~15 years, plus C experience Lots of experience in determining how to make programs run more efficiently. Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 6 Scalar Optimizations Copy Propagation Constant Folding Dead Code Removal Strength Reduction Common Sub-expression Elimination Variable Renaming Note: Not every compiler performs all of these optimizations. Therefore, it is important to understand how and when these optimizations are applied so they can be applied by hand if necessary. Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 7 Copy Propagation x = y Before z = 1 + x Has data dependency Compile x = y After z = 1 + y No data dependency Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 8 Constant Folding Before After add1 = 100 sum = 300 add2 = 200 sum = add1+add2 Since sum is actually just the sum of two constants, the compiler can pre-compute it and assign the value to sum, eliminating the addition that would otherwise be performed at runtime. Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 9 Dead Code Removal Before After var = 5 var = 5 PRINT *, var PRINT *, var STOP STOP PRINT *, var * 2 Since the last statement never executed, the compiler can remove it. Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 10 Strength Reduction Before After x = y ** 2.0 x = y * y a = c / 2.0 a = c * 0.5 Raising one value to the power of another and/or division is much more expensive than multiplication. If the compiler can determine that the power is a relatively small integer, or that the denominator is a constant, then convert to multiplication. Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 11 Common Subexpressions Before After d = c * (a+b) aplusb = a + b e = (a+b) * 2.0 d = c * aplusb e = aplusb * 2.0 The sub-expression (a+b) occurs in both of the assignment statements, so calculate once and use the result where needed. Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 12 Variable Renaming Before After x = y * z x0 = y * z q = r + x * 2 q = r + x0 * 2 x = a + b x = a + b The original code has an output dependency, however, if we rename the result of the first assignment statement we can remove the dependency while preserving the final value of x. Static Single Assignment (SSA) - based on the premise that program variables are assigned in exactly one location in the program. Multiple assignments to the same variable create new versions of that variable. Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 13 Loop Optimizations Hoisting and Sinking Induction Variable Simplification Iteration Peeling Loop Interchange Loop Unrolling Note: Not every compiler performs all of these optimizations. Therefore, it is important to understand how and when these optimizations are applied so they can be applied by hand if necessary. Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 14 Hoisting and Sinking DO i = 1, n Hoist Before a(i) = b(i) + c * d e = g(n) END DO !! i = 1, n Sink temp = c * d DO i = 1, n After a(i) = b(i) + temp END DO !! i = 1, n e = g(n) Code that doesn’t change inside the loop is called loop invariant. Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 15 Induction Variable Simplifying Before After DO i = 1, n k = m k = i*4+m DO i = 1, n … k = k + 4 END DO … END DO One operation can be cheaper than two. On the other hand, this strategy can create a new dependency. Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 16 Iteration Peeling DO i = 1, n IF ((i == 1) .OR. (i == n)) THEN x(i) = y(i) Before ELSE x(i) = y(i + 1) + y(i – 1) END IF END DO x(1) = y(1) DO i = 2, n-1 After x(i) = y(i + 1) + y(i – 1) END DO x(n) = y(n) Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 17 Loop Interchange Before After DO i = 1, ni DO j = 1, nj DO j = 1, nj DO i = 1, ni a(i,j) = b(i,j) a(i,j) = b(i,j) END DO !! j END DO !! i END DO !! i END DO !! j Array elements a(i,j) and a(i+1,j) are closer to each other in memory, while a(i,j+1) is much farther away, so swapping the loops increases the likelihood it will be in cache. However, the reverse is true for “C” – see below. Fortran is “Column Major”, C is “Row Major” so ensure the inner loop optimizes the memory storage layout. Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 18 Loop Unrolling DO i = 1, n Before a(i) = a(i)+b(i) END DO !! i DO i = 1, n, 4 a(i) = a(i)+b(i) a(i+1) = a(i+1)+b(i+1) After a(i+2) = a(i+2)+b(i+2) a(i+3) = a(i+3)+b(i+3) END DO !! i Most modern compilers do this automatically so its generally not necessary to do this by hand. Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 19 Why Do Compilers Unroll? A loop with a lot of operations gets better performance, up to some point, especially if there are lots of arithmetic operations but few main memory loads and stores. Unrolling creates multiple operations that typically load from the same, or adjacent, cache lines. So, an unrolled loop has more operations without increasing the memory accesses. Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 20 Inlining Before After DO i = 1, n DO i = 1, n a(i) = func(i) a(i) = i * 3 END DO END DO … REAL FUNCTION func (x) … func = x * 3 END FUNCTION func When a function or subroutine is in-lined, the call site is replaced by the actual statements of the called routine, eliminating the overhead of making the call. Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 21 Outline ◦ Compiler Optimizations ◦ POSIX Threads - Pthreads Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 22 Topic Overview Thread Basics The POSIX Thread API Synchronization Primitives in Pthreads Controlling Thread and Synchronization Attributes Composite Synchronization Constructs OpenMP: a Standard for Directive Based Parallel Programming Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 23 Overview of Programming Models Programming models provide support for expressing concurrency and synchronization. Process based models assume that all data associated with a process is private, by default, unless otherwise specified. Lightweight processes and threads assume that all memory is global. Directive based programming models extend the threaded model by facilitating creation and synchronization of threads. Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 24 What is a Thread? A thread is defined as an independent stream of instructions that can be packaged and executed by the operating system (similar to a process but much less overhead) . Multiple threads can be executed in parallel on many computer systems. This multithreading generally occurs by time slicing in which case the processing is not literally simultaneous, for the single processor is really doing only one thing at a time. Unlike processes, threads typically share the state information of a single process, and share memory and other resources directly. Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 25 Overview of Programming Models A thread is a single stream of control in the flow of a program. A program like: for (row = 0; row < n; row++) for (col = 0; col < n; col++) c[row][col] = dot_product(get_row(a, row), get_col(b, col)); can be transformed to: for (row = 0; row < n; row++) for (col = 0; col < n; col++) c[row][col] = create_thread(dot_product(get_row(a, row), get_col(b, col))); In this case, one may think of the thread as an instance of a function that returns before the function has finished executing.

Load more