Lecture 18: Concurrency

COMP 524 Programming Language Concepts Stephen Olivier April 14, 2009

Based on slides by A. Block, notes by N. Fisher, F. Hernandez-Campos, and D. Stotts

The University of North Carolina at Chapel Hill Why Allow for Concurrency?

•Handle multiple events (web server request handling) •Allow for more effective utilization of physical devices •Allow for multiple processes to run on multiple processors.

The University of North Carolina at Chapel Hill 2 Core Multiple Processor System Chip

Multicore Multiprocessor Cluster

The University of North Carolina at Chapel Hill 3 Core MultipleThese Processorare both Systemcalled multiprocessor Chip Shared Memory

Multicore Multiprocessor Cluster

The University of North Carolina at Chapel Hill 4 Race conditions

•A race condition occurs whenever there is a detrimental way to interleave multiple segments of code. •Race conditions are one of the hardest issues to do correctly in a concurrent system. • Languages and libraries offering guarantees of freedom from race conditions have been the subject of much research • Hard to do with good performance

The University of North Carolina at Chapel Hill 5 read head; read head; A->next:=head; B->next:=head; head:=A; head:=B;

Possible interleaving of the routines above:

read head; read head; A->next:=head; read head; head:=A; A->next:=head; read head; B->next:=head; B->next:=head; head:=A; head:=B; head:=B;

The University of North Carolina at Chapel Hill 6 Cache coherency

•Cache coherency is a problem when multiple processors have their own local copies of data in their cache and values change. (Usually solved in HW)

X=3 X=4 X=3 Bus Cache

Memory X=3

The University of North Carolina at Chapel Hill 7 Parallel Execution

•Heavyweight processes have their own memory space •Lightweight processes share memory space. •An execution context in a concurrent system is typically called a

The University of North Carolina at Chapel Hill 8 Creating multiple threads

•Before Java and C#, parallel code consisted of an annotated Fortran or C/C++ with library calls. • e.g. OpenMP, MPI • Still widely used, especially in scientific & industrial apps •For Unix, the library for C/C++ is called POSIX pthreads. • Other frameworks built on top of pthreads •Microsoft has a similar threading package for Windows

The University of North Carolina at Chapel Hill 9 Communication

•Reads and writes into shared memory space • Available natively in shared memory systems • Supported in some cluster interconnect technologies •Messages between threads/processes • Supported for both shared memory and clusters

The University of North Carolina at Chapel Hill 10

•Allows ordering of operations among threads • Often needed for program correctness •May be explicit (by the programmer) or implicit (by the threading library to support higher level abstractions such as loops) •This is the primary subject of Section 12.3 (read)

The University of North Carolina at Chapel Hill 11 Remote Procedure Calls

•Remote Procedure Calls (RPC) are used to communicate between a client and server •The client calls a local stub, which packages the parameters then sends them to the server and waits for a response. •Discussed in greater detail in Section 12.4.4 (read)

The University of North Carolina at Chapel Hill 12 Six ways to create threads

•co-begin •parallel loops •Launch-at elaboration •fork •implicit receipt •Early Reply

The University of North Carolina at Chapel Hill 13 co-begin

•Multiple commands can be executed at the same time

par begin a:=3, begin c:=4; c:c+1 end, b:=4 end

The University of North Carolina at Chapel Hill 14 parallel loops

•A loop in which iterations execute in parallel

co(i:=5 to 10)-> p(a,b,i) oc

The University of North Carolina at Chapel Hill 15 Launch-at-elaboration

•New threads are created when method is launched and destroyed by end of method.

procedure P is task T is ... end T; begin -- P ... end P;

The University of North Carolina at Chapel Hill 16 Fork/Join

•Threads are created by a function call fork and destroyed by the function call join • Allows more general parallelism than some other models •In Java 5 “forking” is supported by sending tasks to functions that implement the Runnable or Callable interface.

The University of North Carolina at Chapel Hill 17 A possible execution using co-begin, parallel loops, or lauch- at-elaboration

The University of North Carolina at Chapel Hill 18 A possible execution using explicit fork-join

The University of North Carolina at Chapel Hill 19 Fork-join example: Fibonacci in

•Cilk entends C to support fork-join with spawn & sync

cilk int fib(int n) { if (n < 2) return n; else { int x, y; x = spawn fib(n - 1); y = spawn fib(n - 2); sync; return x + y; } }

The University of North Carolina at Chapel Hill 20 Implicit receipt

•Implicit receipt is similar to a fork except that it causes a new thread to be created in another memory space. • Typical model for RPC

The University of North Carolina at Chapel Hill 21 Early reply

•Early reply allows for a thread to return a value but continue executing. • e.g. Do some work and return result to parent thread, then update some logs

The University of North Carolina at Chapel Hill 22 Blocked and runnable

•At any given time a thread is either blocked or runnable. •A thread is blocked if it is “waiting” for a resource •Threads that are runnable but not running are enqueued on the ready list.

The University of North Carolina at Chapel Hill 23 Preemption

•It is possible for a thread to be “preempted” by another thread • e.g., interrupt scheduling

The University of North Carolina at Chapel Hill 24 Yielding

•Its possible for a thread to • Suspends execution and allows another thread to execute • Thread state changes from runnable to blocked •This can cause race conditions • Particularly in combination with preemption

The University of North Carolina at Chapel Hill 25 Throughput-Oriented Systems

•Want to events as quickly as possible • e.g. Requests to a web server •Limited communication and synchronization required between threads • Concurrent data access issues handled by database system •Swap threads out while they wait for memory accesses and remote communication • Sun Niagara built to support many lightweight threads •

The University of North Carolina at Chapel Hill 26 Compute-Oriented Systems

•One large program runs on many processors (shared memory and/or a cluster) • Typically one thread per processor •Scientific apps such as climate simulation •Sometimes require significant communication and synchronization between threads • Minimizing communication is typically key to performance

The University of North Carolina at Chapel Hill 27 Data Parallel Programming

•SIMD (Single Instruction Multiple Data) • Same instructions performed on multiple data simultaneously • Developed in the early Cray • Now built into mainstream processors • e.g. 128-bit vector operations in MMX, SSE, Altivec, 3D Now •SPMD (Single Program Multiple Data) • Same program replicated onto multiple threads, each operates on different data (usually based on its thread ID number) • e.g. Parallel loops and regions in OpenMP

The University of North Carolina at Chapel Hill 28 Task Parallel Programming

•Divide work into a hierarchy of tasks • Newly spawned tasks may be moved to idle threads • Particularly useful for divide-and-conquer algorithms •Cilk example given earlier uses this model •New programming frameworks designed to promote parallel programming for multicore • Intel Thread Building Blocks and Ct • Microsoft Thread Parallel Library

The University of North Carolina at Chapel Hill 29 Programming Language Issues

•Concurrency support in the language or in libraries? •Do programmers use threads explicitly (e.g. pthreads) or implicitly (e.g. simple forall loop, cilk spawn-sync)? •Can the compiler provide auto-parallelization (i.e. Intel compiler auto-vectorization for SSE)? •Can the programmer specify data that is global vs. local or where specific data resides? •How is synchronization supported?

The University of North Carolina at Chapel Hill 30 Performance Issues

•Amdahl’s Law • of a parallel program is limited by the time needed for the sequential fraction of the program. •Load imbalance • Uneven distribution of work among processors •Communication and Memory Operations • Latency: time delay to access resource • Bandwidth: amount of data transferable per unit time • Contention: many threads want to access same resource

The University of North Carolina at Chapel Hill 31