Programming Parallel Machines

Programming Parallel Machines Prof. R. Eigenmann ECE563, Spring 2013 engineering.purdue.edu/~eigenman/ECE563 1 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Parallelism - Why Bother? Hardware-Perspective: Parallelism is everywhere – instruction level – chip level (multicores) – co-processors (accelerators, GPUs) – multi-processor level – multi-computer level – distributed system level Big Question: Can all this parallelism be hidden ? 2 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Hiding Parallelism - For Whom and Behind What? For the end user: – HPC applications: parallelism is not really apparent. Sometimes, the user needs to tell the system on how many “nodes” the application shall run. – Collaborative applications and remote resource accesses: the user may want to see the parallelism. For the application programmer: We can try to hide parallelism – using parallelizing compilers – using parallel libraries or software components – In reality: partially hide parallelism behind a good API 3 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Different Forms of Parallelism An operating system has parallel processes to manage the many parallel activities that are going on concurrently. A high-performance computing application executes multiple parts of the program in parallel in order to get the job done faster. A bank that performs a transaction with another banks uses parallel systems to engage multiple distributed databases A multi-group research team that uses a satellite downlink in Boston, a large computer in San-Diego and a “Cave” for visualization at the Univ. of Illinois needs collaborative parallel systems. 4 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 How important is Parallel Programming 2013 in Academia? It’s a hot topic again – There was significant interest in the 1980es and first half of 1990es. – Then the interest in parallelism declined until about 2005. – Increased funding for HPC and emerging multicore architectures have made parallel programming a central topic. Hard problems remain: – what is the right user model and application programmer interface (API) for high performance and high productivity ? – what are the right programming methodologies? – how can we build scalable hardware and software systems? – how can we get the user community to accept parallelism? 5 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 How important is Parallel Programming 2013 in Industry? Parallelism has become a mainstream technology – The “multicore challenge” is one of the current big issues. – Industry’s presence at “Supercomputing” is growing. – Most large-scale computational applications exist in parallel form (car crash, computational chemistry, airplane/flow simulations, seismic processing, to name just a few). – Parallel systems sell in many shapes and forms: HPC systems for “number crunching”, parallel servers, multi-processor PCs. – Students who learn about parallelism find good jobs. However, the hard problems pose significant challenges: – Lack of a good API leads to non-portable programs. – Lack of a programming methodology leads to high software cost. – Non-scalable systems limit the benefit from parallelism. – Lack of acceptance hinders dissemination of parallel systems. 6 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 What is so Special About Parallel Programming? Parallel programming is hard because: You have to “think parallel” – We tend to think step-by-step, which is closer to the way a sequential program is written. All algorithms have ordering constraints – They are a.k.a. data and control dependences, and they are difficult to analyze and express => you need to coordinate the parallel threads using synchronization constructs (e.g., locks) Deadlocks may occur. – Incorrectly coordinated threads may cause a program to “hang”. Race conditions may occur – Race conditions occur when dependences are violated. They lead to non- reproducible answers of the program. Most often this is an error. Data partitioning, off-loading, and message generation are tedious – But unavoidable when programming distributed and heterogeneous systems. 7 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 So, Why Can’t We Hide Parallelism? Could we reuse parallel modules? – Software reuse is still largely an unsolved problem – Parallelism encoded in reusable modules may not be at the right level Parallelizing an inner loop is almost always less efficient than an outer loop. Could we use parallelizing compilers? – Despite much progress in parallelizing compilers, there are still many programs that fail to use such tools, e.g. programs not written in Fortran77 and C irregular programs (e.g., using sparse data structures) 8 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Basic Methods For Creating Parallel Programs Write a sequential program, then parallelize it. – Application-level approach: Study the physics behind the code. Identify steps that can be executed concurrently. Re-code these steps in parallel form. – Program-level approach: Analyze the program. Find code sections that access disjoint data. Mark these sections for parallel execution. Our focus Write a parallel program directly (from scratch) – Advantage: parallelism is not inhibited by the sequential programming style. – Disadvantages: Large, existing applications: huge effort If sequential programming is difficult already, this approach makes programming even more difficult 9 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Where in the Application Can We Find Parallelism? Coarse-grain parallelism through domain decomposition Subroutine-level parallelism Loop-level parallelism call A parallelcall(A) DO i=1,n !$OMP PARALLEL DO parallelcall(B) call B A(i)=B(i) DO i=1,n call C A(i)=B(i) call C ENDDO waitfor A, B ENDDO Instruction-level parallelism load B,R1 load A,R0 C=A+4.5*B mul R1,#4.5 compiler add R0,R1 store R1,C 10 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 How Can We Express This Parallelism? Parallel Threads new_thread(id,subr, param) all threads live from thread creation the beginning to the dynamic, as needed. wait_thread(id) end of the program or program section DO i=1,n PARALLEL DO i=1,n Parallel Loops ... ENDDO multiple inner PARALLEL DO i=1,n PARALLEL DO i=1,n parallel loops call work(i) ... ENDDO ENDDO PARALLEL DO i=1,n outermost loop in program ... ENDDO is parallel ENDDO 11 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 How Can We Express This Parallelism? COBEGIN task1 task 1, 2, and Parallel Sections || 3 are task2 executed concurrently || task3 COEND BEGIN (all in parallel) a copy of SPMD execution “do-the-work” is ...do-the-work... executed by each processor/ thread May be implicit END 12 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 How do Parallel Tasks Communicate? shared-address-space, distributed-memory or global-address-space, or message-passing model: shared-memory model: tasks exchange data tasks see each other’s data through explicit (unless it is explicitly messages declared as private.) shared data ... ... A A=3 message load send(A,task2) receive (X,task1) ... store ... ... B=X A=3 B=A ... ... ... task 1 task 2 13 task 1 task 2 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Generating Parallel Programs Automatically A quick tour through a parallelizing compiler The important techniques of parallelizing compilers are also the important techniques for manual parallelization See ECE 663 Lecture Notes and Slides engineering.purdue.edu/~eigenman/ECE663/Handouts 14 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Automatic Parallelization The Scope of Auto-Parallelization Loop Parallelization 101 Most Influential Dependence Removing Transformations User’s Role in “Automatic” Parallelization Performance of Parallelizing Compilers 15 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Where Do Parallelization Tools Fit Into the Software Engineering Process ? Source Source- User Parallelizing to- F90 F90 + inserts compiler Source OpenMP Parallel inserts parallel Restructurer constructs or constructs or directives directives Source-to-source restructurers: • F90 → F90/OpenMP parallel • C → C/OpenMP program user examples: • SGI F77 compiler tunes (-apo -mplist option) program • Cetus compiler 16 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 The Basics About Parallelizing Compilers Loops are the primary source of parallelism in scientific and engineering applications. Compilers detect loops that have independent iterations, I.e. iterations access disjoint data FOR I = 1 TO n Parallelism of loops accessing arrays: A[expression1] = … The loop is independent if, … = A[expression2] expression1 is different from expression2 (for any ENDFOR two different iterations) 17 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Parallel Loop Execution Fully-independent loops are executed in parallel. FOR i=1 TO 250 FOR i=251 TO 500 The execution of this … … loop on 4 processors ENDFORFOR i=1 ENDFORTO 1000 may assign 250 FOR i=501… TO 750 FOR i=751 TO 1000 iterations to each … ENDFOR … processor ENDFOR ENDFOR Usually, any dependence prevents a loop from executing in parallel. Hence, removing dependences is very important. 18 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Loop Parallelization 101 Data Dependence: Definition A dependence exists

Programming Parallel Machines

Generalizing Loop-Invariant Code Motion in a Real-World Compiler

A Tiling Perspective for Register Optimization Fabrice Rastello, Sadayappan Ponnuswany, Duco Van Amstel

Efficient Symbolic Analysis for Optimizing Compilers*

Using Static Single Assignment SSA Form

Loop Transformations and Parallelization

Compiler-Based Code-Improvement Techniques

Autotuning for Automatic Parallelization on Heterogeneous Systems

An Unfolding-Based Loop Optimization Technique

SSE Optimizations These Optimizations Leverage the Streaming SIMD Extension (SSE) Instruction Set of the CPU

Compiler Construction

Using Openmp Effectively on Theta

Global Optimization