<<

List of lectures

1Introduction and 2Imperative Programming and Data Structures 3Parsing TDDA69 Data and Program Structure 4Evaluation 5Object Oriented Programming Concurrent 6Macros and decorators Cyrille Berger 7Virtual Machines and Bytecode 8Garbage Collection and Native Code 9Concurrent Computing 10Declarative Programming 11Logic Programming 12Summary

2 / 65

Lecture content

Parallel Programming In concurrent computing several Multithreaded Programming are executed at the same The States Problems and Solutions time Atomic actions Language and Design Considerations In all computations units Single Instruction, Multiple Threads Programming have access to (for instance Distributed programming in a single ) In computations units MapReduce communicate through messages passing

3 / 65 4 / 65 Benefits of concurrent computing Disadvantages of concurrent computing

Faster Concurrency is hard to implement Responsiveness properly Interactive applications can be performing two Safety tasks at the same time: rendering, spell checking... Easy to corrupt memory Availability of services Load balancing between servers Tasks can wait indefinitely for each other Controllability Non-deterministic Tasks can be suspended, resumed and stopped. Not always faster! The memory bandwidth and CPU cache is limited

5 / 65 6 / 65

Concurrent computing programming Stream Programming in Functional Programming

Four basic approach to computing: No global state Sequencial programming: no concurrency Functions only act on their input, Declarative concurrency: streams in a functional language they are reentrant Message passing: with active objects, used in Functions can then be executed in distributed computing parallel Atomic actions: on a shared memory, used in parallel computing As long as they do not depend on the output of an other function

7 / 65 8 / 65 Parallel Programming

In parallel computing several computations are executed at the Parallel Programming same time and have access to shared memory

Unit Unit Unit

Memory

10

SIMD, SIMT, SMT (1/2) SIMD, SIMT, SMT (2/2)

SIMD: Single Instruction, Multiple Data SIMD: Elements of a short vector (4 to 8 elements) are processed in parallel PUSH [1, 2, 3, 4] PUSH [4, 5, 6, 7] VEC_ADD_4 SIMT: Single Instruction, Multiple Threads The same instruction is executed by multiple threads (from 128 to 3048 or more in SIMT: the future) execute([1, 2, 3, 4], [4, 5, 6, 7], lambda a,b,ti: a[ti]=a[ti] + max(b[ti], 5)) SMT: a = [1, 2, 3, 4] SMT: Simultaneous Multithreading b = [4, 5, 6, 7] General purpose, different instructions are executed by different threads ... .new(lambda : a = a + b) Thread.new(lambda : c = c * b)

11 12 Why the need for the different models?

Flexibility: SMT > SIMT > SIMD Less flexibility give higher performance Multithreaded Programming Unless the lack of flexibility prevent to accomplish the task Performance: SIMD > SIMT > SMT

13

Singlethreaded vs Multithreaded Multithreaded Programming Model

Start with a single root main thread Fork: to create concurently sub 0 ... sub n executing threads Join: to synchronize threads main Threads communicate through shared memory sub 0 ... sub n Threads execute assynchronously They may or may not execute main on different processors

15 16 A multithreaded example

thread1 = new Thread( function() { /* do some computation */ }); thread2 = new Thread( function() The States Problems and Solutions { /* do some computation */ }); thread1.start(); thread2.start(); thread1.join(); thread2.join();

17

Global States and multi-threading

Example: var a = 0; thread1 = new Thread( function() { a = a + 1; }); thread2 = new Thread( Atomic actions function() { a = a + 1; }); thread1.start(); thread2.start(); What is the value of a ? This is called a (data)

19 Atomic operations Mutex

An operation is said to be atomic, if it Mutex is the short of Mutual thread2 = new Thread( exclusion function() appears to happen instantaneously It is a technique to prevent two threads to access a shared resource { read/write, swap, fetch-and-add... at the same time m.(); test-and-set: set a value and return the Example: a = a + 1; var a = 0; old one var m = new Mutex(); m.unlock(); thread1 = new Thread( }); To implement a lock: function() thread1.start(); while(test-and-set(lock, 1) == 1) {} { m.lock(); thread2.start(); To unlock: a = a + 1; lock = 0 m.unlock(); Now a=2 }); 21 22

Dependency Condition variable

thread2 = new Thread( A Condition variable is a set thread2 = new Thread( Example: function() function() of threads waiting for a var a = 1; { { certain condition cv.wait(); var m = new Mutex(); m.lock(); Example: m.lock(); a = a * 3; thread1 = new Thread( a = a * 3; var a = 1; var m = new Mutex(); m.unlock(); function() m.unlock(); var cv = new ConditionVariable(); }); { }); thread1 = new Thread( thread1.start(); function() thread2.start(); m.lock(); thread1.start(); { a = 6 thread2.start(); m.lock(); a = a + 1; a = a + 1; m.unlock(); What is the value of cv.notify(); m.unlock(); }); a ? 4 or 6 ? });

23 24 Deadlock Advantages of atomic actions

What might happen: thread2 = new Thread( var a = 0; function() Very efficient var b = 2; { var ma = new Mutex(); mb.lock(); Less overhead, faster than message var mb = new Mutex(); ma.lock(); thread1 = new Thread( b = b - 1; passing function() a = a + b; { mb.unlock(); ma.lock(); ma.unlock(); mb.lock(); }); b = b - 1; thread1.start(); a = a - 1; thread2.start(); ma.unlock(); thread1 waits for mb, mb.unlock(); }); thread2 waits for ma

25 26

Disadvantages of atomic actions

Blocking Meaning some threads have to wait Small overhead Deadlock Language and Interpreter Design Considerations A low-priority thread can a high priority thread A common source of programming errors

27 Common mistakes Forget to unlock a mutex

Forget to unlock a mutex Most have, Race condition either: A guard object that will unlock a mutex upon destruction Granularity issues: too much locking A synchronization statement will kill the performance some_rlock = threading.RLock() with some_rlock: print("some_rlock is locked while this executes")

29 30

Race condition Safe Shared Mutable State in rust (1/3)

Can we detect potential race let mut data = vec![1, 2, 3]; for i in 0..3 { condition during compilation? thread::spawn(move || { data[i] += 1; In the rust programming language }); Objects are owned by a specific thread } Types can be marked with Send trait Gives an error: "capture of moved indicate that the object can be moved between threads Types can be marked with Sync trait value: `data`" indicate that the object can be accessed by multiple threads safely

31 32 Safe Shared Mutable State in rust (2/3) Safe Shared Mutable State in rust (3/3)

let mut data = Arc::new(vec![1, 2, 3]); let data = Arc::new(Mutex::new(vec![1, 2, 3])); for i in 0..3 { for i in 0..3 { let data = data.clone(); let data = data.clone(); thread::spawn(move || { thread::spawn(move || { data[i] += 1; }); let mut data = data.lock().unwrap(); } data[i] += 1; Arc add reference counting, is movable }); } and syncable It now compiles and it works Gives error: cannot borrow immutable borrowed content as mutable

33 34

Single Instruction, Multiple Threads Programming

With SIMT, the same instructions is executed by multiple threads on

Single Instruction, Multiple Threads Programming different registers

36 Single instruction, multiple flow paths (1/2) Single instruction, multiple flow paths (1/2)

Using a masking system, it is possible to support if/ Benefits: else block Multiple flows are needed in many Threads are always executing the instruction of both part of the if/else blocks Drawbacks: data = [-2, 0, 1, -1, 2], data2 = [...] function f(thread_id, data, data2) Only one flow path is executed at a time, non { if(data[thread_id] < 0) running threads must wait { data[thread_id] = data[thread_id]-data2[thread_id]; Randomize memory access } else if(data[thread_id] > 0) Elements of a vector are not accessed sequentially { data[thread_id] = data[thread_id]+data2[thread_id]; } }

37 38

Programming Language Design for SIMT

OpenCL, CUDA are the most common Very low level, C/C++-derivative General purpose programming language are not suitable Some work has been done to be able to write in Distributed programming Python and run on a GPU with CUDA @jit(argtypes=[float32[:], float32[:], float32[:]], target='gpu') def add_matrix(A, B, C): A[cuda.threadIdx.x] = B[cuda.threadIdx.x] + C[cuda.threadIdx.x] with limitation on standard function that can be called

39 Distributed Programming (1/4) Distributed programming (2/4)

In distributed computing several A distributed computing application computations are executed at the consists of multiple programs running same time and communicate on multiple that together through messages passing coordinate to perform some task. Computation is performed in parallel by many Unit Unit Unit computers. Information can be restricted to certain computers. Memory Memory Memory Redundancy and geographic diversity improve reliability.

41 42

Distributed programming (3/4) Distributed programming (4/4)

Characteristics of distributed Individual programs have computing: differentiating roles. Computers are independent — they do not share Distributed computing for large- memory. Coordination is enabled by messages passed scale data processing: across a network. respond to queries over a network. Data sets can be partitioned across multiple machines.

43 44 Message Passing

Messages are (usually) passed through sockets Messages are exchanged synchronously or asynchronously Message Passing Communication can be centralized or peer-to-peer

46

Python's Global Interpreter Lock (1/2) Python's Global Interpreter Lock (2/2)

CPython can only interpret one CPython can only interpret one single thread at a given time single thread at a given time Single-threaded are faster (no need to lock in memory management) The lock is released, when: Many C-library used as extensions are not thread safe The current thread is for I/O To eliminate the GIL Python developers have the Every 100 interpreter ticks following requirements: True multithreading is not possible Simplicity Do actually improve performance with CPython Backward compatible Prompt and ordered destruction

47 48 Python's module Python's Message Passing (1/2)

The multiprocessing package offers both Example of message passing local and remote concurrency, effectively from multiprocessing import Process side-stepping the Global Interpreter Lock def f(name): print 'hello', name by using subprocesses instead of threads if __name__ == '__main__': It implements transparent message p = Process(target=f, args=('bob',)) passing, allowing to exchange Python p.start() p.join() objects between processes Output hello bob

49 50

Python's Message Passing (2/2) Serialization

Example of message passing with pipes from multiprocessing import Process, Pipe A serialized object is an object

def f(conn): represented as a sequence of bytes conn.send([42, None, 'hello']) conn.close() that includes the object’s data, its if __name__ == '__main__': parent_conn, child_conn = Pipe() type and the types of data stored in p = Process(target=f, args=(child_conn,)) p.start() print parent_conn.recv() the object. p.join() Output [42, None, 'hello'] Transparent message passing is possible thanks to serialization

51 52 pickle Shared memory

In Python, serialization is done with the pickle module Memory can be shared between Python process with a Value or Array. from multiprocessing import Process, Value, Array It can serialize user-defined classes The class definition must be available before deserialization def f(n, a): n.value = 3.1415927 Works with different version of Python for i in range(len(a)): By default, use an ASCII format a[i] = -a[i]

It can serialize: if __name__ == '__main__': num = Value('', 0.0) Basic types: booleans, numbers, strings arr = Array('i', range(10)) Containers: tuples, lists, sets and dictionnary (of pickable objects) p = Process(target=f, args=(num, arr)) Top level functions and classes (only the name) p.start() Objects where __dict__ or __getstate()__ are pickable p.join()

Example: print num.value pickle.loads(pickle.dumps(10)) print arr[:] And of course, you would need to use Mutex to avoid race condition

53 54

Big Data Processing (1/2)

MapReduce is a framework for of big data. Framework: A system used by programmers to build applications MapReduce Batch processing: All the data is available at the outset, and results are not used until processing completes Big data: Used to describe data sets so large and comprehensive that they can reveal facts about a whole population, usually from statistical analysis

56 Big Data Processing (2/2) MapReduce Evaluation Model (1/2)

The MapReduce idea: Map phase: Apply a mapper function to all Data sets are too big to be analyzed by one inputs, emitting intermediate key-value machine pairs Using multiple machines has the same The mapper takes an iterable value containing inputs, complications, regardless of the application/ such as lines of text analysis The mapper yields zero or more key-value pairs for each Pure functions enable an abstraction barrier input between data processing logic and coordinating a distributed application

57 58

MapReduce Evaluation Model (2/2) MapReduce Execution Model (1/2)

Reduce phase: For each intermediate key, apply a reducer function to accumulate all values associated with that key The reducer takes an iterable value containing intermediate key-value pairs All pairs with the same key appear consecutively The reducer yields zero or more values, each associated with that intermediate key

59 60 MapReduce Execution Model (2/2) MapReduce example

From a 1.1 billion people The keys are shuffled and assigned to reducers (facebook?), we want to function reduce(age, friends): know the average number of { var r = 0; friends per age for(friend in friends) In SQL: { SELECT age, AVG(friends) FROM users GROUP r += friend; BY age } In MapReduce: send(age, r / friends.size); } the total set of users in splitted different users_set function map(users_set) { for(user in users_set) { send(user.age, user.friends.size); } }

61 62

MapReduce Assumptions MapReduce Benefits

Constraints on the mapper and reducer: : A machine or hard drive might crash The mapper must be equivalent to applying a deterministic pure function to The MapReduce framework automatically re-runs failed tasks each input independently Speed: Some machine might be slow because it's The reducer must be equivalent to applying a deterministic pure function to the sequence of values for each key overloaded Benefits of functional programming: The framework can run multiple copies of a task and keep the result of When a program contains only pure functions, call expressions can be the one that finishes first evaluated in any order, lazily, and in parallel Network locality: Data transfer is expensive Referential transparency: a call expression can be replaced by its value (or vis The framework tries to schedule map tasks on the machines that hold versa) without changing the program the data to be processed In MapReduce, these functional programming ideas allow: Monitoring: Will my job finish before dinner?!? Consistent results, however computation is partitioned The framework provides a web-based interface describing jobs Re-computation and caching of results, as needed

63 64 Summary

Parallel programming Multi-threading and how to help reduce programmer error Distributed programming and MapReduce

65 / 65