High Performance Computing

Course #: CSI 440/540 High Perf Sci Comp I Fall ‘09

Mark R. Gilder Email: [email protected] [email protected] CSI 440/540

This course investigates the latest trends in high-performance computing (HPC) evolution and examines key issues in developing algorithms capable of exploiting these architectures.

Grading: Your grade in the course will be based on completion of assignments (40%), course project (35%), class presentation(15%), class participation (10%).

Course Goals  Understanding of the latest trends in HPC architecture evolution,  Appreciation for the complexities in efficiently mapping algorithms onto HPC architectures,  Familiarity with various program transformations in order to improve performance,  Hands-on experience in design and implementation of algorithms for both shared & parallel architectures using Pthreads, OpenMP and MPI.  Experience in evaluating performance of parallel programs.

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 2 Lecture 4 Outline: ◦ Compiler Optimizations ◦ POSIX Threads (Pthreads)

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 3 Outline

◦ Compiler Optimizations ◦ POSIX Threads (Pthreads)

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 4 Compiler Optimizations

• Types of Compiler Optimizations ◦ Scalar Optimizations ◦ Loop Optimizations ◦ Inlining

Note: The following section is based on notes by Henry Neeman, Director OU Supercomputing Center for Education & Research – University of Oklahoma.

These techniques are also described in most compiler books: Compilers: Principles, Techniques, and Tools by Alfred V. Aho, Ravi Sethi, Jeffrey D. Ullman, Addison Wesley, January 1, 1986.

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 5 Compiler Design

 There’s been a tremendous amount of research done on compilers for the common languages used for today’s HPC environments: ◦ Fortran: ~40 years ◦ : ~30 years ◦ C++: ~15 years, plus C experience

 Lots of experience in determining how to make programs run more efficiently.

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 6 Scalar Optimizations

 Copy Propagation  Constant Folding  Dead Code Removal  Strength Reduction  Common Sub-expression Elimination  Variable Renaming

Note: Not every compiler performs all of these optimizations. Therefore, it is important to understand how and when these optimizations are applied so they can be applied by hand if necessary.

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 7 Copy Propagation

x = y Before z = 1 + x Has data dependency Compile

x = y After z = 1 + y

No data dependency Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 8 Constant Folding Before After add1 = 100 sum = 300 add2 = 200 sum = add1+add2

 Since sum is actually just the sum of two constants, the compiler can pre-compute it and assign the value to sum, eliminating the addition that would otherwise be performed at runtime.

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 9 Dead Code Removal Before After var = 5 var = 5 PRINT *, var PRINT *, var STOP STOP PRINT *, var * 2

 Since the last statement never executed, the compiler can remove it.

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 10 Strength Reduction Before After x = y ** 2.0 x = y * y a = c / 2.0 a = c * 0.5

 Raising one value to the power of another and/or division is much more expensive than multiplication. If the compiler can determine that the power is a relatively small integer, or that the denominator is a constant, then convert to multiplication.

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 11 Common Subexpressions Before After

d = c * (a+b) aplusb = a + b e = (a+b) * 2.0 d = c * aplusb e = aplusb * 2.0

 The sub-expression (a+b) occurs in both of the assignment statements, so calculate once and use the result where needed.

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 12 Variable Renaming Before After x = y * z x0 = y * z q = r + x * 2 q = r + x0 * 2 x = a + b x = a + b

 The original code has an output dependency, however, if we rename the result of the first assignment statement we can remove the dependency while preserving the final value of x.

 Static Single Assignment (SSA) - based on the premise that program variables are assigned in exactly one location in the program. Multiple assignments to the same variable create new versions of that variable.

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 13 Loop Optimizations

 Hoisting and Sinking  Induction Variable Simplification  Iteration Peeling  Loop Interchange  Loop Unrolling

Note: Not every compiler performs all of these optimizations. Therefore, it is important to understand how and when these optimizations are applied so they can be applied by hand if necessary.

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 14 Hoisting and Sinking DO i = 1, n Hoist Before a(i) = b(i) + c * d e = g(n) END DO !! i = 1, n Sink

temp = c * d DO i = 1, n After a(i) = b(i) + temp END DO !! i = 1, n e = g(n)

Code that doesn’t change inside the loop is called loop invariant.

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 15 Induction Variable Simplifying

Before After DO i = 1, n k = m k = i*4+m DO i = 1, n … k = k + 4 END DO … END DO

 One operation can be cheaper than two. On the other hand, this strategy can create a new dependency.

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 16 Iteration Peeling

DO i = 1, n IF ((i == 1) .OR. (i == n)) THEN x(i) = y(i) Before ELSE x(i) = y(i + 1) + y(i – 1) END IF END DO

x(1) = y(1) DO i = 2, n-1 After x(i) = y(i + 1) + y(i – 1) END DO x(n) = y(n)

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 17 Loop Interchange Before After DO i = 1, ni DO j = 1, nj DO j = 1, nj DO i = 1, ni a(i,j) = b(i,j) a(i,j) = b(i,j) END DO !! j END DO !! i END DO !! i END DO !! j

 Array elements a(i,j) and a(i+1,j) are closer to each other in memory, while a(i,j+1) is much farther away, so swapping the loops increases the likelihood it will be in cache. However, the reverse is true for “C” – see below.

 Fortran is “Column Major”, C is “Row Major” so ensure the inner loop optimizes the memory storage layout. Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 18 Loop Unrolling DO i = 1, n Before a(i) = a(i)+b(i) END DO !! i

DO i = 1, n, 4 a(i) = a(i)+b(i) a(i+1) = a(i+1)+b(i+1) After a(i+2) = a(i+2)+b(i+2) a(i+3) = a(i+3)+b(i+3) END DO !! i

 Most modern compilers do this automatically so its generally not necessary to do this by hand.

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 19 Why Do Compilers Unroll?

 A loop with a lot of operations gets better performance, up to some point, especially if there are lots of arithmetic operations but few main memory loads and stores.

 Unrolling creates multiple operations that typically load from the same, or adjacent, cache lines.

 So, an unrolled loop has more operations without increasing the memory accesses.

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 20 Inlining Before After DO i = 1, n DO i = 1, n a(i) = func(i) a(i) = i * 3 END DO END DO … REAL FUNCTION func (x) … func = x * 3 END FUNCTION func

 When a function or is in-lined, the call site is replaced by the actual statements of the called routine, eliminating the overhead of making the call.

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 21 Outline

◦ Compiler Optimizations ◦ POSIX Threads - Pthreads

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 22 Topic Overview

Basics  The POSIX Thread API  Synchronization Primitives in Pthreads  Controlling Thread and Synchronization Attributes  Composite Synchronization Constructs  OpenMP: a Standard for Directive Based Parallel Programming

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 23 Overview of Programming Models

 Programming models provide support for expressing concurrency and synchronization.

based models assume that all data associated with a process is private, by default, unless otherwise specified.

 Lightweight processes and threads assume that all memory is global.

 Directive based programming models extend the threaded model by facilitating creation and synchronization of threads.

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 24 What is a Thread?  A thread is defined as an independent stream of instructions that can be packaged and executed by the operating system (similar to a process but much less overhead) .

 Multiple threads can be executed in parallel on many computer systems. This multithreading generally occurs by time slicing in which case the processing is not literally simultaneous, for the single processor is really doing only one thing at a time.

 Unlike processes, threads typically share the state information of a single process, and share memory and other resources directly.

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 25 Overview of Programming Models

 A thread is a single stream of control in the flow of a program. A program like:

for (row = 0; row < n; row++) for (col = 0; col < n; col++) c[row][col] = dot_product(get_row(a, row), get_col(b, col));

can be transformed to:

for (row = 0; row < n; row++) for (col = 0; col < n; col++) c[row][col] = create_thread(dot_product(get_row(a, row), get_col(b, col)));

In this case, one may think of the thread as an instance of a function that returns before the function has finished executing.

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 26 Thread Basics

 Threads provide software portability.  Inherent support for latency hiding.  Scheduling and load balancing.  Ease of programming and widespread use.

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 27 Thread Basics

 All memory in the logical machine model of a thread is globally accessible to every thread.

 The stack corresponding to the function call is generally treated as being local to the thread for liveness reasons.

 This implies a logical machine model with both global memory (default) and local memory (stacks).

 It is important to note that such a flat model may result in very poor performance since memory is physically distributed in typical machines.

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 28 Thread Basics

 The logical machine model of a thread-based programming paradigm.

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 29 Model  All threads have access to the same global / shared memory.  Additionally, threads have their own private data.  Programmers are responsible for synchronizing access to global memory.

http://www.llnl.gov/computing/tutorials/pthreads/ Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 30 Thread Safeness  Thread Safeness refers to the ability to execute multiple threads simultaneously without "clobbering" shared data or creating "race" conditions.  For example, suppose we have an application which creates three threads each of which calls subA(). Additionally: ◦ subA() accesses/modifies a global structure or location in memory. ◦ As each thread calls this routine it is possible that they may try to modify this global structure/memory location at the same time. ◦ If the routine does not employ some sort of synchronization constructs to prevent data corruption, then it is not thread-safe.

 BE CAREFUL! - if your application uses libraries or other objects that don't explicitly guarantee thread-safeness – assume that they are NOT thread-safe until proven otherwise. This can be done by "serializing" the calls to the uncertain routine. http://www.llnl.gov/computing/tutorials/pthreads/

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 31 POSIX Threads

Standard Process Unix Process w/Threads

http://www.llnl.gov/computing/tutorials/pthreads/

Threads duplicate on the bare essential resources that enable them to exist as executable code.

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 32 POSIX Threads

 Independent flow of control is achieved by having each thread maintain its own: ◦ Stack Pointer ◦ Registers ◦ Scheduling Properties (priority, etc.) ◦ Signals ◦ Thread Specific Data

http://www.llnl.gov/computing/tutorials/pthreads/

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 33 POSIX Threads

 Also, because threads share resources within the same process ◦ Changes made by one thread to shared resources, i.e., closing a file, will be visible to all other threads.

◦ Pointers with the same addresses point to the same data.

◦ Explicit synchronization must be performed by the programming when accessing (reading/writing) the same memory (race conditions). http://www.llnl.gov/computing/tutorials/pthreads/

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 34 The POSIX Thread API Pthreads  Historically, hardware vendors have implemented their own proprietary versions of threads. These implementations differed substantially from each other making it difficult for programmers to develop portable threaded applications.

 In order to take full advantage of the capabilities provided by threads, a standardized programming interface was required. For UNIX systems, this interface has been specified by the IEEE POSIX 1003.1c standard (1995). Implementations which adhere to this standard are referred to as POSIX threads, or Pthreads. Most hardware vendors now offer Pthreads in addition to their proprietary API's.

http://www.llnl.gov/computing/tutorials/pthreads/

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 35 POSIX Threads (Pthreads)

 Commonly referred to as Pthreads, POSIX has emerged as the standard threads API, supported by most vendors.

 The concepts discussed here are largely independent of the API and can be used for programming with other thread (NT threads, Solaris threads, Java threads, etc.) as well.

 Pthreads are defined as a set of C language programming types and procedure calls, implemented with a “pthread.h” header/include file and a thread - though the this library may be part of another library, such as libc.

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 36 Why Pthreads?

 In general though, in order for a program to take advantage of Pthreads, it must be able to be organized into discrete, independent tasks which can execute concurrently. For example, if routine1 and routine2 can be interchanged, interleaved and/or overlapped in real time, they are candidates for threading.

http://www.llnl.gov/computing/tutorials/pthreads/

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 37 Why Pthreads?

 The primary motivation for using Pthreads is to realize potential program performance gains.  When compared to the cost of creating and managing a process, a thread can be created with much less operating system overhead. Managing threads requires fewer system resources than managing processes.  On modern, multi-cpu machines, Pthreads are ideally suited for parallel programming, and whatever applies to parallel programming in general, applies to parallel Pthreads programs.

Intel Quad Core (Threads)

Cluster (MPI)

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 38 Pthreads API

 The Pthreads API can be informally grouped into three major classes: 1. Thread Management: functions that work directly on threads, i.e., creating, detaching, joining, etc. These include functions to set/query thread attributes (joinable, scheduling etc.)

2. Mutexes: functions that deal with synchronization, called a "mutex", which include functions for creating, destroying, locking and unlocking mutexes. They are also supplemented by mutex attribute functions that set or modify attributes associated with mutexes.

3. Condition Variables: functions that address communications between threads that share a mutex. They are based upon programmer specified conditions. This class includes functions to create, destroy, wait and signal based upon specified variable values. Functions to set/query condition variable attributes are also included.

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 39 Naming Conventions

 All functions in the thread library begin with pthread_

Routine Prefix Functional Group pthread_ Threads themselves and misc pthread_attr_ Thread attributes objects pthread_mutex_ Mutexes pthread_mutexattr_ Mutex attributes objects. pthread_cond_ Condition variables pthread_condattr_ Condition attributes objects pthread_key_ Thread-specific data keys

 The Pthreads API contains over 60 routines.  For portability, the pthread.h header file should be included in each source file using the Pthreads library. http://www.llnl.gov/computing/tutorials/pthreads/

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 40 Compiling Threaded Programs  All functions in the thread library begin with pthread_

Compiler Compiler / Platform Description Command xlc_r / cc_r C (ANSI / non-ANSI) IBM xlC_r C++ AIX xlf_r -qnosave Fortran - using IBM's Pthreads API xlf90_r -qnosave INTEL icc -pthread C icpc -pthread C++ PathScale pathcc -pthread C Linux pathCC -pthread C++ PGI pgcc -lpthread C Linux pgCC -lpthread C++ GNU gcc -pthread GNU C Linux, AIX g++ -pthread GNU C++ http://www.llnl.gov/computing/tutorials/pthreads/

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 41 Creating Threads  Routines: ◦ pthread_create (thread, attr, start_routine, arg) ◦ pthread_attr_init (attr) ◦ pthread_attr_destroy (attr)

◦ pthread_create arguments:  thread: an opaque, unique identifier for the new thread returned by the subroutine.

 attr: an opaque attribute object that may be used to set thread attributes. You can specify a thread attributes object, or NULL for the default values.

 start_routine: the C routine that the thread will execute once it is created.

 arg: a single argument that may be passed to start_routine. It must be passed by reference as a pointer cast of type void. NULL may be used if no argument is to be passed.

http://www.llnl.gov/computing/tutorials/pthreads/

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 42 Creating Threads

 Initially, your main() program comprises a single, default thread. All other threads must be explicitly created by the programmer.

 pthread_create - creates a new thread and makes it executable. This routine can be called any number of times from anywhere within your code.

 The maximum number of threads that may be created by a process is implementation dependent.

 Once created, threads are peers, and may create other threads. There is no implied hierarchy or dependency between threads.

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 43 Terminating Threads

 There are several ways in which a Pthread may be terminated: ◦ The thread returns from its starting routine (the main routine for the initial thread). ◦ The thread makes a call to the pthread_exit() routine. ◦ The thread is cancelled by another thread via the pthread_cancel()routine. ◦ The entire process is terminated due to a call to either the exec or exit routines.

 pthread_exit() is used to explicitly exit a thread. Typically, the pthread_exit() routine is called after a thread has completed its work and is no longer required to exist.

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 44 Terminating Threads

 If main() finishes before the threads it has created, and exits with pthread_exit(), the other threads will continue to execute. Otherwise, they will be automatically terminated when main() finishes.

 The programmer may optionally specify a termination status, which is stored as a void pointer for any thread that may join the calling thread.

 Cleanup: the pthread_exit() routine does not close files; any files opened inside the thread will remain open after the thread is terminated.

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 45 Example: Thread Creation #include #include #define NUM_THREADS 5

void *PrintHello(void *threadid) { int tid; tid = (int)threadid; printf("Hello World! It's me, thread #%d!\n", tid); pthread_exit(NULL); }

int main (int argc, char *argv[]) { pthread_t threads[NUM_THREADS]; int rc, t; for(t=0; t

 This simple example code creates 5 threads with the pthread_create() routine. Each thread prints a "Hello World!" message, and then terminates with a call to pthread_exit(). http://www.llnl.gov/computing/tutorials/pthreads/ Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 46 Example: Argument Passing

 The pthread_create() routine permits the programmer to pass one argument to the thread start routine. For cases where multiple arguments must be passed, this limitation is easily overcome by creating a structure which contains all of the arguments, and then passing a pointer to that structure in the pthread_create() routine.

 All arguments must be passed by reference and cast to (void *).

int *taskids[NUM_THREADS];

for(t=0; t

http://www.llnl.gov/computing/tutorials/pthreads/

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 47 Thread Management  Routines: ◦ pthread_join(threadid,status) ◦ pthread_detach(threadid,status) ◦ pthread_attr_setdetachstate(attr,detachstate) ◦ pthread_attr_getdetachstate(attr,detachstate)  "Joining" is one way to accomplish synchronization between threads. For example:

 The pthread_join() subroutine blocks the calling thread until the specified threadid thread terminates.  The programmer is able to obtain the target thread's termination return status if it was specified in the target thread's call to pthread_exit().  A joining thread can match one pthread_join() call. It is a logical error to attempt multiple joins on the same thread. http://www.llnl.gov/computing/tutorials/pthreads/

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 48 Joining and Detaching Threads Joinable or Not?  When a thread is created, one of its attributes defines whether it is joinable or detached. Only threads that are created as joinable can be joined. If a thread is created as detached, it can never be joined.

 The final draft of the POSIX standard specifies that threads should be created as joinable. However, not all implementations may follow this.

 To explicitly create a thread as joinable or detached, the attr argument in the pthread_create() routine is used. The typical 4 step process is: ◦ Declare a pthread attribute variable of the pthread_attr_t data type ◦ Initialize the attribute variable with pthread_attr_init() ◦ Set the attribute detached status with pthread_attr_setdetachstate() ◦ When done, free library resources used by the attribute with pthread_attr_destroy()

Detaching:  The pthread_detach() routine can be used to explicitly detach a thread even though it was created as joinable.

 There is no converse routine. http://www.llnl.gov/computing/tutorials/pthreads/

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 49 Stack Management Routines  pthread_attr_getstacksize(attr, stacksize)  pthread_attr_setstacksize(attr, stacksize)  pthread_attr_getstackaddr(attr, stackaddr)  pthread_attr_setstackaddr(attr, stackaddr)

Preventing Stack Problems:  The POSIX standard does not dictate the size of a thread's stack. This is implementation dependent and varies.  Exceeding the default stack limit is often very easy to do, with the usual results: program termination and/or corrupted data.  Safe and portable programs do not depend upon the default stack limit, but instead, explicitly allocate enough stack for each thread by using the pthread_attr_setstacksize routine.  The pthread_attr_getstackaddr and pthread_attr_setstackaddr routines can be used by applications in an environment where the stack for a thread must be placed in some particular region of memory.

http://www.llnl.gov/computing/tutorials/pthreads/

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 50 Stack Management Example

#include int main(int argc, char *argv[]) #include { #define NTHREADS 4 pthread_t threads[NTHREADS]; #define N 1000 size_t stacksize; #define MEGEXTRA 1000000 int rc, t;

pthread_attr_t attr; pthread_attr_init(&attr); pthread_attr_getstacksize (&attr, &stacksize); void *dowork(void *threadid) printf("Default stack size = %li\n", stacksize); { stacksize = sizeof(double)*N*N+MEGEXTRA; double A[N][N]; printf("Amount of stack needed per thread = %li\n",stacksize); int i,j; pthread_attr_setstacksize (&attr, stacksize); size_t mystacksize; printf("Creating threads with stack size = %li bytes\n",stacksize); for(t=0; t

http://www.llnl.gov/computing/tutorials/pthreads/

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 51