List of lectures
1Introduction and Functional Programming 2Imperative Programming and Data Structures 3Parsing TDDA69 Data and Program Structure 4Evaluation 5Object Oriented Programming Concurrent Computing 6Macros and decorators Cyrille Berger 7Virtual Machines and Bytecode 8Garbage Collection and Native Code 9Concurrent Computing 10Declarative Programming 11Logic Programming 12Summary
2 / 65
Lecture content Concurrent computing
Parallel Programming In concurrent computing several Multithreaded Programming computations are executed at the same The States Problems and Solutions time Atomic actions Language and Interpreter Design Considerations In parallel computing all computations units Single Instruction, Multiple Threads Programming have access to shared memory (for instance Distributed programming in a single process) Message Passing In distributed computing computations units MapReduce communicate through messages passing
3 / 65 4 / 65 Benefits of concurrent computing Disadvantages of concurrent computing
Faster computation Concurrency is hard to implement Responsiveness properly Interactive applications can be performing two Safety tasks at the same time: rendering, spell checking... Easy to corrupt memory Availability of services Deadlock Load balancing between servers Tasks can wait indefinitely for each other Controllability Non-deterministic Tasks can be suspended, resumed and stopped. Not always faster! The memory bandwidth and CPU cache is limited
5 / 65 6 / 65
Concurrent computing programming Stream Programming in Functional Programming
Four basic approach to computing: No global state Sequencial programming: no concurrency Functions only act on their input, Declarative concurrency: streams in a functional language they are reentrant Message passing: with active objects, used in Functions can then be executed in distributed computing parallel Atomic actions: on a shared memory, used in parallel computing As long as they do not depend on the output of an other function
7 / 65 8 / 65 Parallel Programming
In parallel computing several computations are executed at the Parallel Programming same time and have access to shared memory
Unit Unit Unit
Memory
10
SIMD, SIMT, SMT (1/2) SIMD, SIMT, SMT (2/2)
SIMD: Single Instruction, Multiple Data SIMD: Elements of a short vector (4 to 8 elements) are processed in parallel PUSH [1, 2, 3, 4] PUSH [4, 5, 6, 7] VEC_ADD_4 SIMT: Single Instruction, Multiple Threads The same instruction is executed by multiple threads (from 128 to 3048 or more in SIMT: the future) execute([1, 2, 3, 4], [4, 5, 6, 7], lambda a,b,ti: a[ti]=a[ti] + max(b[ti], 5)) SMT: a = [1, 2, 3, 4] SMT: Simultaneous Multithreading b = [4, 5, 6, 7] General purpose, different instructions are executed by different threads ... Thread.new(lambda : a = a + b) Thread.new(lambda : c = c * b)
11 12 Why the need for the different models?
Flexibility: SMT > SIMT > SIMD Less flexibility give higher performance Multithreaded Programming Unless the lack of flexibility prevent to accomplish the task Performance: SIMD > SIMT > SMT
13
Singlethreaded vs Multithreaded Multithreaded Programming Model
Start with a single root main thread Fork: to create concurently sub 0 ... sub n executing threads Join: to synchronize threads main Threads communicate through shared memory sub 0 ... sub n Threads execute assynchronously They may or may not execute main on different processors
15 16 A multithreaded example
thread1 = new Thread( function() { /* do some computation */ }); thread2 = new Thread( function() The States Problems and Solutions { /* do some computation */ }); thread1.start(); thread2.start(); thread1.join(); thread2.join();
17
Global States and multi-threading
Example: var a = 0; thread1 = new Thread( function() { a = a + 1; }); thread2 = new Thread( Atomic actions function() { a = a + 1; }); thread1.start(); thread2.start(); What is the value of a ? This is called a (data) race condition
19 Atomic operations Mutex
An operation is said to be atomic, if it Mutex is the short of Mutual thread2 = new Thread( exclusion function() appears to happen instantaneously It is a technique to prevent two threads to access a shared resource { read/write, swap, fetch-and-add... at the same time m.lock(); test-and-set: set a value and return the Example: a = a + 1; var a = 0; old one var m = new Mutex(); m.unlock(); thread1 = new Thread( }); To implement a lock: function() thread1.start(); while(test-and-set(lock, 1) == 1) {} { m.lock(); thread2.start(); To unlock: a = a + 1; lock = 0 m.unlock(); Now a=2 }); 21 22
Dependency Condition variable
thread2 = new Thread( A Condition variable is a set thread2 = new Thread( Example: function() function() of threads waiting for a var a = 1; { { certain condition cv.wait(); var m = new Mutex(); m.lock(); Example: m.lock(); a = a * 3; thread1 = new Thread( a = a * 3; var a = 1; var m = new Mutex(); m.unlock(); function() m.unlock(); var cv = new ConditionVariable(); }); { }); thread1 = new Thread( thread1.start(); function() thread2.start(); m.lock(); thread1.start(); { a = 6 thread2.start(); m.lock(); a = a + 1; a = a + 1; m.unlock(); What is the value of cv.notify(); m.unlock(); }); a ? 4 or 6 ? });
23 24 Deadlock Advantages of atomic actions
What might happen: thread2 = new Thread( var a = 0; function() Very efficient var b = 2; { var ma = new Mutex(); mb.lock(); Less overhead, faster than message var mb = new Mutex(); ma.lock(); thread1 = new Thread( b = b - 1; passing function() a = a + b; { mb.unlock(); ma.lock(); ma.unlock(); mb.lock(); }); b = b - 1; thread1.start(); a = a - 1; thread2.start(); ma.unlock(); thread1 waits for mb, mb.unlock(); }); thread2 waits for ma
25 26
Disadvantages of atomic actions
Blocking Meaning some threads have to wait Small overhead Deadlock Language and Interpreter Design Considerations A low-priority thread can block a high priority thread A common source of programming errors
27 Common mistakes Forget to unlock a mutex
Forget to unlock a mutex Most programming language have, Race condition either: Deadlocks A guard object that will unlock a mutex upon destruction Granularity issues: too much locking A synchronization statement will kill the performance some_rlock = threading.RLock() with some_rlock: print("some_rlock is locked while this executes")
29 30
Race condition Safe Shared Mutable State in rust (1/3)
Can we detect potential race let mut data = vec![1, 2, 3]; for i in 0..3 { condition during compilation? thread::spawn(move || { data[i] += 1; In the rust programming language }); Objects are owned by a specific thread } Types can be marked with Send trait Gives an error: "capture of moved indicate that the object can be moved between threads Types can be marked with Sync trait value: `data`" indicate that the object can be accessed by multiple threads safely
31 32 Safe Shared Mutable State in rust (2/3) Safe Shared Mutable State in rust (3/3)
let mut data = Arc::new(vec![1, 2, 3]); let data = Arc::new(Mutex::new(vec![1, 2, 3])); for i in 0..3 { for i in 0..3 { let data = data.clone(); let data = data.clone(); thread::spawn(move || { thread::spawn(move || { data[i] += 1; }); let mut data = data.lock().unwrap(); } data[i] += 1; Arc add reference counting, is movable }); } and syncable It now compiles and it works Gives error: cannot borrow immutable borrowed content as mutable
33 34
Single Instruction, Multiple Threads Programming
With SIMT, the same instructions is executed by multiple threads on
Single Instruction, Multiple Threads Programming different registers
36 Single instruction, multiple flow paths (1/2) Single instruction, multiple flow paths (1/2)
Using a masking system, it is possible to support if/ Benefits: else block Multiple flows are needed in many algorithms Threads are always executing the instruction of both part of the if/else blocks Drawbacks: data = [-2, 0, 1, -1, 2], data2 = [...] function f(thread_id, data, data2) Only one flow path is executed at a time, non { if(data[thread_id] < 0) running threads must wait { data[thread_id] = data[thread_id]-data2[thread_id]; Randomize memory access } else if(data[thread_id] > 0) Elements of a vector are not accessed sequentially { data[thread_id] = data[thread_id]+data2[thread_id]; } }
37 38
Programming Language Design for SIMT
OpenCL, CUDA are the most common Very low level, C/C++-derivative General purpose programming language are not suitable Some work has been done to be able to write in Distributed programming Python and run on a GPU with CUDA @jit(argtypes=[float32[:], float32[:], float32[:]], target='gpu') def add_matrix(A, B, C): A[cuda.threadIdx.x] = B[cuda.threadIdx.x] + C[cuda.threadIdx.x] with limitation on standard function that can be called
39 Distributed Programming (1/4) Distributed programming (2/4)
In distributed computing several A distributed computing application computations are executed at the consists of multiple programs running same time and communicate on multiple computers that together through messages passing coordinate to perform some task. Computation is performed in parallel by many Unit Unit Unit computers. Information can be restricted to certain computers. Memory Memory Memory Redundancy and geographic diversity improve reliability.
41 42
Distributed programming (3/4) Distributed programming (4/4)
Characteristics of distributed Individual programs have computing: differentiating roles. Computers are independent — they do not share Distributed computing for large- memory. Coordination is enabled by messages passed scale data processing: across a network. Databases respond to queries over a network. Data sets can be partitioned across multiple machines.
43 44 Message Passing
Messages are (usually) passed through sockets Messages are exchanged synchronously or asynchronously Message Passing Communication can be centralized or peer-to-peer
46
Python's Global Interpreter Lock (1/2) Python's Global Interpreter Lock (2/2)
CPython can only interpret one CPython can only interpret one single thread at a given time single thread at a given time Single-threaded are faster (no need to lock in memory management) The lock is released, when: Many C-library used as extensions are not thread safe The current thread is blocking for I/O To eliminate the GIL Python developers have the Every 100 interpreter ticks following requirements: True multithreading is not possible Simplicity Do actually improve performance with CPython Backward compatible Prompt and ordered destruction
47 48 Python's Multiprocessing module Python's Message Passing (1/2)
The multiprocessing package offers both Example of message passing local and remote concurrency, effectively from multiprocessing import Process side-stepping the Global Interpreter Lock def f(name): print 'hello', name by using subprocesses instead of threads if __name__ == '__main__': It implements transparent message p = Process(target=f, args=('bob',)) passing, allowing to exchange Python p.start() p.join() objects between processes Output hello bob
49 50
Python's Message Passing (2/2) Serialization
Example of message passing with pipes from multiprocessing import Process, Pipe A serialized object is an object
def f(conn): represented as a sequence of bytes conn.send([42, None, 'hello']) conn.close() that includes the object’s data, its if __name__ == '__main__': parent_conn, child_conn = Pipe() type and the types of data stored in p = Process(target=f, args=(child_conn,)) p.start() print parent_conn.recv() the object. p.join() Output [42, None, 'hello'] Transparent message passing is possible thanks to serialization
51 52 pickle Shared memory
In Python, serialization is done with the pickle module Memory can be shared between Python process with a Value or Array. from multiprocessing import Process, Value, Array It can serialize user-defined classes The class definition must be available before deserialization def f(n, a): n.value = 3.1415927 Works with different version of Python for i in range(len(a)): By default, use an ASCII format a[i] = -a[i]
It can serialize: if __name__ == '__main__': num = Value('d', 0.0) Basic types: booleans, numbers, strings arr = Array('i', range(10)) Containers: tuples, lists, sets and dictionnary (of pickable objects) p = Process(target=f, args=(num, arr)) Top level functions and classes (only the name) p.start() Objects where __dict__ or __getstate()__ are pickable p.join()
Example: print num.value pickle.loads(pickle.dumps(10)) print arr[:] And of course, you would need to use Mutex to avoid race condition
53 54
Big Data Processing (1/2)
MapReduce is a framework for batch processing of big data. Framework: A system used by programmers to build applications MapReduce Batch processing: All the data is available at the outset, and results are not used until processing completes Big data: Used to describe data sets so large and comprehensive that they can reveal facts about a whole population, usually from statistical analysis
56 Big Data Processing (2/2) MapReduce Evaluation Model (1/2)
The MapReduce idea: Map phase: Apply a mapper function to all Data sets are too big to be analyzed by one inputs, emitting intermediate key-value machine pairs Using multiple machines has the same The mapper takes an iterable value containing inputs, complications, regardless of the application/ such as lines of text analysis The mapper yields zero or more key-value pairs for each Pure functions enable an abstraction barrier input between data processing logic and coordinating a distributed application
57 58
MapReduce Evaluation Model (2/2) MapReduce Execution Model (1/2)
Reduce phase: For each intermediate key, apply a reducer function to accumulate all values associated with that key The reducer takes an iterable value containing intermediate key-value pairs All pairs with the same key appear consecutively The reducer yields zero or more values, each associated with that intermediate key
59 60 MapReduce Execution Model (2/2) MapReduce example
From a 1.1 billion people The keys are shuffled and assigned to reducers database (facebook?), we want to function reduce(age, friends): know the average number of { var r = 0; friends per age for(friend in friends) In SQL: { SELECT age, AVG(friends) FROM users GROUP r += friend; BY age } In MapReduce: send(age, r / friends.size); } the total set of users in splitted different users_set function map(users_set) { for(user in users_set) { send(user.age, user.friends.size); } }
61 62
MapReduce Assumptions MapReduce Benefits
Constraints on the mapper and reducer: Fault tolerance: A machine or hard drive might crash The mapper must be equivalent to applying a deterministic pure function to The MapReduce framework automatically re-runs failed tasks each input independently Speed: Some machine might be slow because it's The reducer must be equivalent to applying a deterministic pure function to the sequence of values for each key overloaded Benefits of functional programming: The framework can run multiple copies of a task and keep the result of When a program contains only pure functions, call expressions can be the one that finishes first evaluated in any order, lazily, and in parallel Network locality: Data transfer is expensive Referential transparency: a call expression can be replaced by its value (or vis The framework tries to schedule map tasks on the machines that hold versa) without changing the program the data to be processed In MapReduce, these functional programming ideas allow: Monitoring: Will my job finish before dinner?!? Consistent results, however computation is partitioned The framework provides a web-based interface describing jobs Re-computation and caching of results, as needed
63 64 Summary
Parallel programming Multi-threading and how to help reduce programmer error Distributed programming and MapReduce
65 / 65