Efficiency

Python is an interpreted language no pre-compiled binaries, all code is translated on-the-fly to machine instructions Python and High-Performance byte-code as a middle step which may be stored (.pyc) Computing All objects are dynamic in Python nothing is fixed == optimisation nightmare lot of from metadata Flexibility is good, but comes with a cost!

/ /

1 2

Improving Python performance Parallelisation strategies for Python

Array based computations with Global Lock (GIL) Using extended programming language CPython's is not -safe Embed compiled code in a Python program no threads possible, except for I/O etc. /C++, affects overall performance of threading Utilize parallel processing -based "threading" with fork independent processes that have a limited way to communicate Message-passing is the Way to Go to achieve true parallelism in Python

/ /

3 4 Numpy – fast array interface

Standard Python is not well suitable for numerical computations lists are very flexible but also slow to process in numerical computations Numpy adds a new array Numpy basics static, multidimensional fast processing of arrays tools for , random numbers, etc.

/ / Processing math: 100% Processing math: 100%

5 6

Numpy arrays Python list vs. NumPy array

All elements of an array have the same type Python list NumPy array Array can have multiple dimensions The number of elements in the array is fixed, shape can be changed

Memory layout Memory layout ...

/ / Processing math: 100% Processing math: 100%

7 8 Creating numpy arrays Helper functions for creating arrays: 1

arange and linspace can generate ranges of numbers: From a list: >>> a = numpy.arange(10) >>> import numpy >>> a >>> a = numpy.array((1, 2, 3, 4), float) array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) >>> a

array([ 1., 2., 3., 4.]) >>> b = numpy.arange(0.1, 0.2, 0.02)

>>> b >>> list1 = [[1, 2, 3], [4,5,6]] array([0.1 , 0.12, 0.14, 0.16, 0.18]) >>> mat = numpy.array(list1, complex)

>>> mat >>> c = numpy.linspace(-4.5, 4.5, 5) array([[ 1.+0.j, 2.+0.j, 3.+0.j], >>> c [ 4.+0.j, 5.+0.j, 6.+0.j]]) array([-4.5 , -2.25, 0. , 2.25, 4.5 ])

>>> mat.shape (2, 3)

>>> mat.size 6

/ / Processing math: 100% Processing math: 100%

9 10

Helper functions for creating arrays: 2 Helper functions for creating arrays: 3

array with given shape initialized to zeros, ones or arbitrary (full): Similar arrays as an existing one with zeros_like, ones_like, full_like >>> a = numpy.zeros((4, 6), float) and empty_like: >>> a.shape (4, 6) >>> a = numpy.zeros((4, 6), float) >>> b = numpy.empty_like(a) >>> b = numpy.ones((2, 4)) >>> c = numpy.ones_like(a) >>> b >>> d = numpy.full_like(a, 9.1) array([[ 1., 1., 1., 1.], [ 1., 1., 1., 1.]])

>>> c = numpy.full((2, 3), 4.2) >>> c array([[4.2, 4.2, 4.2], [4.2, 4.2, 4.2]]) Empty array (no values assigned) with empty

/ / Processing math: 100% Processing math: 100%

11 12 Non-numeric data Accessing arrays

NumPy supports also storing non-numerical data e.g. strings (largest Simple indexing: Slicing: element determines the item size) >>> mat = numpy.array([[1, 2, 3], [4, 5, 6]]) >>> a = numpy.arange(10) >>> mat[0,2] >>> a[2:] >>> a = numpy.array(['foo', 'foo-bar']) 3 array([2, 3, 4, 5, 6, 7, 8, 9]) >>> a array(['foo', 'foo-bar'], dtype='|U7') >>> mat[1,-2] >>> a[:-1] 5 array([0, 1, 2, 3, 4, 5, 6, 7, 8]) arrays can, however, be sometimes useful >>> dna = 'AAAGTCTGAC' >>> a[1:3] = -1 >>> a = numpy.array(dna, dtype='c') >>> a >>> a array([0, -1, -1, 3, 4, 5, 6, 7, 8, 9]) array([b'A', b'A', b'A', b'G', b'T', b'C', b'T', b'G', b'A', b'C'], dtype='|S1') >>> a[1:7:2] array([1, 3, 5])

/ / Processing math: 100% Processing math: 100%

13 14

Slicing of arrays in multiple dimensions Views and copies of arrays

Multidimensional arrays can be sliced along multiple dimensions Simple assignment creates references to arrays Slicing creates "views" to the arrays Values can be assigned to only part of the array >>> a = numpy.zeros((4, 4)) Use copy() for real copying of arrays >>> a[1:3, 1:3] = 2.0 a = numpy.arange(10) >>> a b = a # reference, changing values in b changes a array([[ 0., 0., 0., 0.], b = a.copy() # true copy [ 0., 2., 2., 0.], [ 0., 2., 2., 0.], c = a[1:4] # view, changing c changes elements [1:4] of a [ 0., 0., 0., 0.]]) c = a[1:4].copy() # true copy of subarray

/ / Processing math: 100% Processing math: 100%

15 16 Array manipulation Array manipulation

reshape : change the shape of array concatenate : join arrays together >>> mat = numpy.array([[1, 2, 3], [4, 5, 6]]) >>> mat1 = numpy.array([[1, 2, 3], [4, 5, 6]]) >>> mat >>> mat2 = numpy.array([[7, 8, 9], [10, 11, 12]]) array([[1, 2, 3], >>> numpy.concatenate((mat1, mat2)) [4, 5, 6]]) array([[ 1, 2, 3], [ 4, 5, 6], >>> mat.reshape(3,2) [ 7, 8, 9], array([[1, 2], [10, 11, 12]]) [3, 4], [5, 6]]) >>> numpy.concatenate((mat1, mat2), axis=1) array([[ 1, 2, 3, 7, 8, 9], ravel : flatten array to 1-d [ 4, 5, 6, 10, 11, 12]]) >>> mat.ravel() array([1, 2, 3, 4, 5, 6]) split : split array to N pieces >>> numpy.split(mat1, 3, axis=1) [array([[1], [4]]), array([[2], [5]]), array([[3], [6]])]

/ / Processing math: 100% Processing math: 100%

17 18

Array operations Array operations

Most operations for numpy arrays are done element-wise (+, -, *, /, **) Numpy has special functions which can work with array arguments (sin, >>> a = numpy.array([1.0, 2.0, 3.0]) cos, exp, sqrt, log, ...) >>> b = 2.0 >>> a * b >>> import numpy, math array([ 2., 4., 6.]) >>> a = numpy.linspace(-math.pi, math.pi, 8) >>> a >>> a + b array([-3.14159265, -2.24399475, -1.34639685, -0.44879895, array([ 3., 4., 5.]) 0.44879895, 1.34639685, 2.24399475, 3.14159265])

>>> a * a >>> numpy.sin(a) array([ 1., 4., 9.]) array([ -1.22464680e-16, -7.81831482e-01, -9.74927912e-01, -4.33883739e-01, 4.33883739e-01, 9.74927912e-01, 7.81831482e-01, 1.22464680e-16])

>>> math.sin(a) Traceback (most recent call last): File "", line 1, in ? TypeError: only length-1 arrays can be converted to Python scalars

/ / Processing math: 100% Processing math: 100%

19 20 I/O with Numpy

Numpy provides functions for reading data from file and for writing data into the files Simple text files Numpy tools numpy.loadtxt numpy.savetxt Data in regular column layout Can deal with comments and different column delimiters

/ / Processing math: 100% Processing math: 100%

21 22

Random numbers Polynomials

The module numpy.random provides several functions for constructing Polynomial is defined by an array of coefficients p random arrays p(x, N) = p[0]xN−1 + p[1]xN−2 + . . . + p[N − 1] random: uniform random numbers For example: normal: normal distribution Least square fitting: numpy.polyfit choice: random sample from given array Evaluating polynomials: numpy.polyval ... Roots of polynomial: numpy.roots >>> import numpy.random as rnd >>> rnd.random((2,2)) >>> x = numpy.linspace(-4, 4, 7) array([[ 0.02909142, 0.90848 ], >>> y = x**2 + rnd.random(x.shape) [ 0.9471314 , 0.31424393]]) >>> >>> p = numpy.polyfit(x, y, 2) >>> rnd.choice(numpy.arange(4), 10) >>> p array([0, 1, 1, 2, 1, 1, 2, 0, 2, 3]) array([ 0.96869003, -0.01157275, 0.69352514])

/ / Processing math: 100% Processing math: 100%

23 24 Linear algebra Linear algebra

Numpy can calculate and vector products efficiently: dot, vdot, ... Normally, NumPy utilises high performance libraries in linear algebra Eigenproblems: linalg.eig, linalg.eigvals, ... operations Linear systems and matrix inversion: linalg.solve, linalg.inv Example: matrix multiplication C = A * B matrix dimension 200 >>> A = numpy.array(((2, 1), (1, 3))) pure python: 5.30 s >>> B = numpy.array(((-2, 4.2), (4.2, 6))) >>> C = numpy.dot(A, B) naive C: 0.09 s >>> numpy.dot: 0.01 s >>> b = numpy.array((1, 2)) >>> numpy.linalg.solve(C, b) # solve C x = b array([ 0.04453441, 0.06882591])

/ / Processing math: 100% Processing math: 100%

25 26

Anatomy of NumPy array

ndarray type is made of one dimensional contiguous block of memory (raw data) indexing scheme: how to locate an element Numpy advanced topics data type descriptor: how to interpret an element

/ / Processing math: 100% Processing math: 100%

27 28 NumPy indexing ndarray attributes

There are many possible ways of arranging items of N-dimensional array a = numpy.array(...) in a 1-dimensional block a.flags various information about memory layout NumPy uses striding where N-dimensional index (n0, n1, . . . , nN−1) a.strides corresponds to offset from the beginning of 1-dimensional block bytes to step in each dimension when traversing N−1 a.itemsize offset = ∑ s n , s is stride in dimension k size of one array element in bytes k k k a.data k=0 Python buffer object pointing to start of arrays data a.__array_interface__ Python internal interface offset

(nx, ny)

/ / Processing math: 100% Processing math: 100%

29 30

Advanced indexing Vectorized operations

Numpy arrays can be indexed also with other arrays (integer or boolean) for loops in Python are slow >>> x = numpy.arange(10,1,-1) Use "vectorized" operations when possible >>> x array([10, 9, 8, 7, 6, 5, 4, 3, 2]) Example: difference

>>> x[numpy.array([3, 3, 1, 8])] for loop is ~80 times slower! array([7, 7, 9, 2]) # brute force using a for loop arr = numpy.arange(1000) 0 1 2 3 4 5 6 7 8 9 Boolean "mask" arrays dif = numpy.zeros(999, int) for i in range(1, len(arr)): 0 1 2 3 4 5 6 7 8 9 >>> m = x > 7 dif[i-1] = arr[i] - arr[i-1] >>> m

array([ True, True, True, False, False, ... # vectorized operation

arr = numpy.arange(1000) >>> x[m] dif = arr[1:] - arr[:-1] array([10, 9, 8]) Advanced indexing creates copies of arrays

/ / Processing math: 100% Processing math: 100%

31 32 Broadcasting Broadcasting

If array shapes are different, the smaller array may be broadcasted into a Example: calculate distances from a given point larger shape # array containing 3d coordinates for 100 points points = numpy.random.random((100, 3)) >>> from numpy import array origin = numpy.array((1.0, 2.2, -2.2)) >>> a = array([[1,2],[3,4],[5,6]], float) dists = (points - origin)**2 >>> a dists = numpy.sqrt(numpy.sum(dists, axis=1)) array([[ 1., 2.], [ 3., 4.], # find the most distant point [ 5., 6.]]) i = numpy.argmax(dists) print(points[i]) >>> b = array([[7,11]], float) >>> b array([[ 7., 11.]])

>>> a * b array([[ 7., 22.], [ 21., 44.], [ 35., 66.]])

/ / Processing math: 100% Processing math: 100%

33 34

Temporary arrays Temporary arrays

In complex expressions, NumPy stores intermediate values in temporary Broadcasting approaches can lead also to hidden temporary arrays arrays Example: pairwise distance of M points in 3 dimensions Memory consumption can be higher than expected Input data is M x 3 array a = numpy.random.random((1024, 1024, 50)) Output is M x M array containing the distance between points i and j b = numpy.random.random((1024, 1024, 50)) There is a temporary 1000 x 1000 x 3 array # two temporary arrays will be created X = numpy.random.random((1000, 3)) c = 2.0 * a - 4.5 * b D = numpy.sqrt(((X[:, numpy.newaxis, :] - X) ** 2).sum(axis=-1))

# three temporary arrays will be created due to unnecessary parenthesis c = (2.0 * a - 4.5 * b) + 1.1 * (numpy.sin(a) + numpy.cos(b))

/ / Processing math: 100% Processing math: 100%

35 36 Numexpr Numexpr

Evaluation of complex expressions with one operation at a time can lead By default, numexpr to use multiple threads also into suboptimal performance Number of threads can be queried and with Effectively, one carries out multiple for loops in the NumPy C-code ne.set_num_threads(nthreads) Numexpr package provides fast evaluation of array expressions Supported operators and functions: +,-,*,/,**, sin, cos, tan, exp, log, sqrt import numexpr as ne x = numpy.random.random((1000000, 1)) Speedups in comparison to NumPy are typically between 0.95 and 4 y = numpy.random.random((1000000, 1)) Works best on arrays that do not fit in CPU cache poly = ne.evaluate("((.25*x + .75)*x - 1.5)*x - 2")

/ / Processing math: 100% Processing math: 100%

37 38

Summary

Numpy provides a static array data Multidimensional arrays Fast mathematical operations for arrays Tools for linear algebra and random numbers Arrays can be broadcasted into same shapes Expression evaluation can lead into temporary arrays

/ Processing math: 100%

39 Performance analysis Performance measurement

/ /

40 41

Measuring application performance Measuring application performance

Correctness is the most import factor in any application Applications own timers Premature optimization is the root of all evil! timeit module Before starting to optimize application, one should measure where time cProfile module is spent Full fedged profiling tools: TAU, Intel Vtune, Python Tools for Visual Typically 90 % of time is spent in 10 % of application Studio ...

Mind the ! Recursive calculation of Fibonacci Pure Python 1 numbers Pure C 126 Pure Python (better algorithm) 24e6

/ /

42 43 Measuring application performance timeit module

Python time module can be used for measuring time spent in specific Easy timing of small bits of Python code part of the program Tries to avoid common pitfalls in measuring execution times time.perf_counter() : include time spent in other processes Command line interface and Python interface time.process_time() : only time in current process %timeit magic in IPython import time In [1]: from mymodule import func In [2]: %timeit func() t0 = time.process_time() for n in range(niter): 10 loops, best of 3: 433 msec per loop heavy_calculation() t1 = time.process_time() $ python -m timeit -s "from mymodule import func" "func()" print('Time spent in heavy calculation', t1-t0) 10 loops, best of 3: 433 msec per loop

/ /

44 45

cProfile Investigating profile with pstats

Execution profile of Python program Printing execution time of selected functions Time spent in different parts of the program Sorting by function name, time, cumulative time, ... Call graphs Python module interface and interactive browser Python API: In [1]: from pstats import Stats $ python -m pstats myprof.prof In [2]: p = Stats('myprof.prof') Profiling whole program from command line In [3]: p.strip_dirs() Welcome to the profile statistics import cProfile In [4]: p.sort_stats('time') % strip ... In [5]: p.print_stats(5) % sort time % stats 5 # profile statement and save results to a file func.prof Mon Oct 12 10:11:00 2016 my.prof cProfile.run('func()', 'func.prof') ... Mon Oct 12 10:11:00 2016 my.prof ... $ python -m cProfile -o myprof.prof myprogram.py

/ /

46 47 Summary

Python has various built-in tools for measuring application performance time module timeit module cProfile and pstats modules

/

48 Cython and interfacing external Cython libraries

/ /

49 50

Cython Python overheads

Optimising static for Python Interpreting Extended Cython programming language "Boxing" - everything is an object Tune readable Python code into plain C performance by adding static Function call overhead type declarations Global interpreter lock - no threading benefits (CPython) Easy interfacing to external C/C++ libraries

/ /

51 52 Interpreting Case study: Mandelbrot fractal

Cython command generates a C/C++ source file from a Cython source file Pure Python: 5.55 s C/C++ source is then compiled into an extension module Compilation with Cython: 4.87 s Interpreting overhead is normally not drastic No interpretation but lots of calls to Python C-API from distutils.core import setup def kernel(zr, zi, cr, ci, lim, cutoff): from Cython.Build import cythonize count = 0

# Normally, one compiles cython extended code with .pyx ending while ((zr*zr + zi*zi) < (lim*lim)) setup(ext_modules=cythonize("mandel_cyt.py"), ) and count < cutoff: zr = zr * zr - zi * zi + cr $ python setup.py build_ext --inplace zi = zr * zr - zi * zi + cr count += 1 In [1]: import mandel_cyt return count

/ /

53 54

"Boxing" Static type declarations

In Python, everything is an object Cython extended code should have .pyx ending Object Object Check the types: Object Cannot be run with normal Python Integer Integer integers Integer Types are declared with cdef keyword

int 7 int 6 int 13 In function signatures only type is given + def integrate(f, a, b, N): def integrate(f, double a, double b, int N): int 7 int 6 other other + other s = 0 cdef double s = 0 object object object dx = (b - a) / N cdef int i data data int 13 data for i in range(N): cdef double dx = (b - a) / N = s += f(a + i * dx) for i in range(N): return s * dx s += f(a + i * dx) return s * dx

/ /

55 56 Static type declarations Function call overhead

Pure Python: 5.55 s Function calls in Python can involve lots of checking and "boxing" Static type declarations in kernel: 100 ms Overhead can be reduced by declaring functions to be C-functions def kernel(double zr, double zi, ...): cdef keyword: functions can be called only from Cython cdef int count = 0 cpdef keyword: generate also Python wrapper while ((zr*zr + zi*zi) < (lim*lim)) def integrate(f, a, b, N): cdef double integrate(f, double a, ...): and count < cutoff: s = 0 cdef double s = 0 zr = zr * zr - zi * zi + cr dx = (b - a) / N cdef int i zi = zr * zr - zi * zi + cr for i in range(N): cdef double dx = (b - a) / N count += 1 s += f(a + i * dx) for i in range(N): return s * dx s += f(a + i * dx) return count return s * dx

/ /

57 58

Using C functions NumPy arrays with Cython

Static type declarations in kernel: 100 ms Cython supports fast indexing for NumPy arrays Kernel as C function: 69 ms Type and dimensions of array have to be declared cdef int kernel(double zr, double zi, ...): import numpy as np # normal NumPy import cdef int count = 0 cimport numpy as cnp # import for NumPY C-API while ((zr*zr + zi*zi) < (lim*lim)) and count < cutoff: def func(): # declarations can be made only in function scope zr = zr * zr - zi * zi + cr cdef cnp.ndarray[cnp.int_t, ndim=2] data zi = zr * zr - zi * zi + cr data = np.empty((N, N), dtype=int) count += 1 return count ...

for i in range(N): for j in range(N): data[i,j] = ... # double loop is done in nearly C speed

/ /

59 60 Compiler directives Final performance

Compiler directives can be used for turning of certain Python features for Pure Python: 5.5 s additional performance Static type declarations: 100 ms boundscheck (False) : assume no IndexErrors Kernel as C function: 69 ms wraparound (False): no negative indexing Fast indexing and directives: 15 ms ... import numpy as np # normal NumPy import cimport numpy as cnp # import for NumPY C-API

import cython

@cython.boundscheck(False) def func(): # declarations can be made only in function scope cdef cnp.ndarray[cnp.int_t, ndim=2] data data = np.empty((N, N), dtype=int)

/ /

61 62

Where to add types? HTML-report

Typing everything reduces readibility and can even slow down the performance Profiling should be first step when optimising Cython is able to provide annotated HTML-report Lines are colored according to the level of "typedness" white lines translate to pure C lines that require the Python C-API are yellow (darker as they translate to more C-API interaction) $ cython -a cython_module.pyx $ firefox cython_module.html

/ /

63 64 Profiling Cython code Summary

By default, Cython code does not show up in profile produced by cProfile Cython is optimising static compiler for Python Profiling can be enabled for entire source file or on per function basis Possible to add type declarations with Cython language # cython: profile=True Fast indexing for NumPy arrays import cython At best cases, huge speed ups can be obtained @cython.profile(False) cdef func(): Some compromise for Python flexibility ...

# cython: profile=False import cython

@cython.profile(True) cdef func(): ...

/ /

65 66

Further functionality in Cython

Using C structs and C++ classes in Cython Exceptions handling Parallelisation (threading) with Cython ... Interfacing external libraries

/ /

67 68 Increasing performance with compiled code CFFI

There are Python interfaces for many high performance libraries C Foreign Function Interface for Python However, sometimes one might want to utilize a without Python Interact with almost any C code interface C-like declarations within Python Existing libraries Can often be copy-pasted from headers / documentation Own code written in C or Fortran ABI and API modes Python C-API provides the most comprehensive way to extend Python ABI does not require compilation CFFI, Cython, and f2py can provide easier approaches API can be faster and more robust Only API discussed here Some understanding of C required

/ /

69 70

Creating Python interface to C library Example: Python interface to C math library

In API mode, CFFI is used for building a Python extension module that from cffi import FFI ffibuilder = FFI() provides interface to the library ffibuilder.cdef(""" One needs to write a build script that specifies: double sqrt(double x); // list all the function prototypes from the the library functions to be interfaced double sin(double x); // library that we want to use """) name of the Python extension instructions for compiling and linking ffibuilder.set_source("_my_math", # name of the Python extension """ CFFI uses C compiler and creates the shared library #include // Some C source, often just include """, The extension module can then be used from Python code. library_dirs = [], # location of library, not needed for C # C standard library libraries = ['m'] # name of the library we want to interface )

ffibuilder.compile(verbose=True)

/ /

71 72 Example: Python interface to C math library Passing NumPy arrays to C code

Building the extension Only simple scalar numbers can be automatically converted Python python3 build_mymath.py objects and C types generating ./_mymath.c running build_ext In C, arrays are passed to functions as pointers building '_mymath' extension ... A "pointer" object to NumPy array can be obtained with cast and gcc -pthread -shared -Wl,-z,relro -g ./_mymath.o -L/usr/lib64 -lm -lpython3.6m from_buffer functions -o ./_mymath.-36m-x86_64-linux-gnu.so Using the extension from _mymath import lib

a = lib.sqrt(4.5) b = lib.sin(1.2) Python floats are automatically converted to C doubles and back

/ /

73 74

Passing NumPy arrays to C code Summary

C function adding two arrays Obtaining "pointers" in Python External libraries can be interfaced in various ways // c = a + b from add_module import ffi, lib CFFI provides easy interfacing to C libraries void add(double *a, double *b, double *c, int n) { a = np.random.random((1000000,1)) System libraries and user libraries for (int i=0; i

/ /

75 76 Processes and threads

Multiprocessing Process Thread Independent execution units A single process may contain Have their own state information multiple threads and own space Have their own state information, but share the same memory address space

/ /

77 78

Processes and threads Processes and threads

Process Thread Process Thread Long-lived: created when parallel Short-lived: created when entering MPI OpenMP program started, killed when a parallel region, destroyed good performance C / Fortran, not Python program is finished (joined) when region ends scales from a laptop to a threading module Explicit communication between Communication through shared only for I/O bound tasks (maybe) processes memory Global Interpreter Lock (GIL) limits usability

/ /

79 80 Processes and threads Multiprocessing

Underlying OS used to spawn new independent subprocesses processes are independent and execute code in an asynchronous manner Process Thread Process no guarantee on the order of execution MPI multiprocessing module Communication possible only through dedicated, shared communication good performance relies on OS for forking worker channels scales from a laptop to a processes that mimic threads Queues, Pipes supercomputer limited communication between must be created before a new process is forked the parallel processes

/ /

81 82

Spawn a process Communication from multiprocessing import Process Sharing data import os , data manager def hello(name): print 'Hello', name Pipes print 'My PID is', os.getpid() direct communication between two processes print "My parent's PID is", os.getppid() Queues # Create a new process work sharing among a group of processes p = Process(target=hello, args=('Alice', )) Pool of workers # Start the process offloading tasks to a group of worker processes p.start() print 'Spawned a new process from PID', os.getpid()

# End the process p.join()

/ /

83 84 Queues Queues

FIFO (first-in-first-out) task queues that can be used to distribute work from multiprocessing import Process, Queue

among processes def f(q): while True: Shared among all processes x = q.get() all processes can add and retrieve data from the queue if x is None: break Automatically takes care of locking, so can be used safely with minimal print(x**2)

hassle q = Queue() for i in range(100): q.put(i) # task queue: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, ..., 99]

for i in range(3): q.put(None) p = Process(target=f, args=(q, )) p.start()

/ /

85 86

Queues Pool of workers from multiprocessing import Process, Queue Group of processes that carry out tasks assigned to them def f(q): 1. Master process submits tasks to the pool while True: 2. Pool of worker processes perform the tasks x = q.get() if x is None: # if sentinel, stop execution 3. Master process retrieves the results from the pool break print(x**2) Blocking and non-blocking (= asynchronous) calls available q = Queue() for i in range(100): q.put(i) # task queue: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, ..., 99] for i in range(3): q.put(None) # add sentinels to the queue to signal STOP p = Process(target=f, args=(q, )) p.start()

/ /

87 88 Pool of workers Pool of workers

from multiprocessing import Pool from multiprocessing import Pool import time import time def f(x): def f(x): return x**2 return x**2 pool = Pool(8) pool = Pool(8)

# Blocking execution (with a single process) # calculate x**2 in parallel for x in 0..9 result = pool.apply(f, (4,)) result = pool.map(f, range(10)) print(result) print(result)

# Non-blocking execution "in the background" # non-blocking alternative result = pool.apply_async(f, (12,)) result = pool.map_async(f, range(10)) while not result.ready(): while not result.ready(): time.sleep(1) time.sleep(1) print(result.get()) print(result.get()) # an alternative to "sleeping" is to use e.g. result.get(timeout=1)

/ /

89 90

Summary

Parallelism achieved by launching new OS processes Only limited communication possible work sharing: queues / pool of workers Non-blocking execution available do something else while waiting for results Further information: https://docs.python.org/2/library/multiprocessing.html

/

91 MPI for Python Message Passing Interface

/ /

92 93

Message passing interface Processes and threads

MPI is an application programming interface (API) for communication Serial between separate processes region MPI Serial Serial MPI programs are portable and scalable region region the same program can run on different types of computers, from PC's to Parallel processes Parallel region Parallel region the most widely used approach for distributed Process Thread MPI is flexible and comprehensive Independent execution units A single process may contain large (over 300 procedures) Have their own state information multiple threads concise (often only 6 procedures are needed) and own memory address space Have their own state information, MPI standard defines C and Fortran interfaces but share the same memory MPI for Python (mpi4py) provides an unofficial Python interface address space

/ /

94 95 Execution model MPI rank

MPI program is launched as a set of independent, identical processes Rank: ID number given to a process execute the same program code and instructions it is possible to query for rank can reside in different nodes (or even in different computers) processes can perform different tasks based on their rank The way to launch a MPI program depends on the system if (rank == 0): # do something mpiexec, mpirun, srun, aprun, ... elif (rank == 1): mpiexec/mpirun in training class # do something else else: srun on puhti.csc.fi # all other processes do something different

/ /

96 97

Data model MPI communicator

Each MPI process has its own separate memory space, i.e. all variables Communicator: a group containing all the processes that will participate and data are local to the process in communication Processes can exchange data by sending and receiving messages in mpi4py most MPI calls are implemented as methods of a communicator rank 0 rank 1 object MPI_COMM_WORLD contains all processes (MPI.COMM_WORLD in mpi4py) a = 1.0 MPI a = 0.7 user can define custom communicators b = 2.0 b = 3.1

Process 1 messages Process 2

/ /

98 99 Routines in MPI for Python Getting started

Communication between processes Basic methods of communicator object sending and receiving messages between two processes Get_size() Number of processes in communicator sending and receiving messages between several processes Get_rank() rank of this process Synchronization between processes from mpi4py import MPI

Communicator creation and manipulation comm = MPI.COMM_WORLD # communicator object containing all processes

Advanced features (e.g. user defined datatypes, one-sided size = comm.Get_size() communication and parallel I/O) rank = comm.Get_rank()

print("I am rank %d in group of %d processes" % (rank, size))

/ /

100 101

Running an example program

$ mpiexec -n 4 python3 hello.py

I am rank 2 in group of 4 processes I am rank 0 in group of 4 processes I am rank 3 in group of 4 processes I am rank 1 in group of 4 processes Point-to-Point Communication from mpi4py import MPI comm = MPI.COMM_WORLD # communicator object containing all processes size = comm.Get_size() rank = comm.Get_rank() print("I am rank %d in group of %d processes" % (rank, size))

/ /

102 103 MPI communication MPI point-to-point operations

Data is local to the MPI processes 0 1 One process sends a message to another process that receives it They need to communicate to Sends and receives in a program should match - one receive per send coordinate work Each message contains Point-to-point communication 2 4 The actual data that is to be sent Messages are sent between two The datatype of each element of data processes The number of elements the data consists of Collective communication An identification number for the message (tag) Involving a number of processes at 0 1 The ranks of the source and destination process the same time With mpi4py it is often enough to specify only data and source and

2 4 destination

/ /

104 105

Sending and receiving data Sending and receiving data

Sending and receiving a dictionary Arbitrary Python objects can be communicated with the send and receive from mpi4py import MPI methods of a communicator

comm = MPI.COMM_WORLD # communicator object containing all processes .send(data, dest) .recv(source) rank = comm.Get_rank() data source if rank == 0: Python object to send source rank data = {'a': 7, 'b': 3.14} note: data is provided as return value comm.send(data, dest=1) dest elif rank == 1: destination rank data = comm.recv(source=0) Destination and source ranks have to match!

/ /

106 107 Blocking routines & Typical point-to-point communication patterns

send() and recv() are blocking routines Pairwise exchange the functions exit only once it is safe to use the data (memory) involved in Process 0 Process 1 Process 2 Process 3 the communication Completion depends on other processes => risk for deadlocks for example, if all processes call recv() there is no-one left to call a Pipe, a ring of processes exchanging data corresponding send() and the program is stuck forever Process 0 Process 1 Process 2 Process 3

Incorrect ordering of sends and receives may result in a

/ /

108 109

Case study: parallel sum Case study: parallel sum

Memory Initial state Memory P0 P1 An array A containing floating point numbers read P0 P1 1. Scatter the data from a a file by the first MPI task (rank 0). 1.1. receive operation for scatter 1.2. send operation for scatter Goal 2. Compute partial sums in parallel Calculate the total sum of all elements in array A in 3. Gather the partial sums parallel. 3.1. receive operation for gather 3.2. send operation for gather 4. Compute the total sum

/ /

110 111 Step 1.1: Receive operation for scatter Step 1.2: Send operation for scatter

/ /

112 113

Step 2: Compute partial sums in parallel Step 3.1: Receive operation for gather

/ /

114 115 Step 3.2: Send operation for gather Step 4: Compute the total sum

/ /

116 117

Communicating NumPy arrays Send/receive a NumPy array

Arbitrary Python objects are converted to byte streams (pickled) when Note the difference between upper/lower case! sending and back to Python objects (unpickled) when receiving send/recv: general Python objects, slow these conversions may be a serious overhead to communication Send/Recv: continuous arrays, fast from mpi4py import MPI Contiguous memory buffers (such as NumPy arrays) can be import numpy communicated with very little overhead using upper case methods: comm = MPI.COMM_WORLD Send(data, dest) rank = comm.Get_rank() Recv(data, source) data = numpy.empty(100, dtype=float) note the difference in receiving: the data array has to exist at the time of if rank == 0: call data[:] = numpy.arange(100, dtype=float) comm.Send(data, dest=1) elif rank == 1: comm.Recv(data, source=0)

/ /

118 119 Combined send and receive MPI datatypes

Send one message and receive another with a single command MPI has a number of predefined datatypes to represent data reduces risk for deadlocks e.g. MPI.INT for integer and MPI.DOUBLE for float Destination and source ranks can be same or different No need to specify the datatype for Python objects or Numpy arrays MPI.PROC_NULL can be used for no destination/source objects are serialised as byte streams data = numpy.arange(10, dtype=float) * (rank + 1) automatic detection for NumPy arrays buffer = numpy.empty(data.shape, dtype=data.dtype) If needed, one can also define custom datatypes if rank == 0: for example to use non-contiguous data buffers dest, source = 1, 1 elif rank == 1: dest, source = 0, 0

comm.Sendrecv(data, dest=dest, recvbuf=buffer, source=source)

/ /

120 121

Summary

Point-to-point communication = messages are sent between two MPI processes Point-to-point operations enable any parallel communication pattern (in principle) Non-blocking Communication Arbitrary Python objects (that can be pickled!) send / recv sendrecv Memory buffers such as Numpy arrays Send / Recv Sendrecv

/ /

122 123 Non-blocking communication Non-blocking communication

Non-blocking sends and receives Have to finalize send/receive operations isend & irecv wait() returns immediately and sends/receives in background Waits for the communication started with isend or irecv to finish return value is a Request object (blocking) Enables some computing concurrently with communication test() Tests if the communication has finished (non-blocking) Avoids many common dead-lock situations You can mix non-blocking and blocking p2p routines e.g., receive isend with recv

/ /

124 125

Example: non-blocking send/receive Multiple non-blocking operations rank = comm.Get_rank() Methods waitall() and waitany() may come handy when dealing with size = comm.Get_size() multiple non-blocking operations (available in the MPI.Request class) if rank == 0: data = arange(size, dtype=float) * (rank + 1) Request.waitall(requests) req = comm.Isend(data, dest=1) # start a send wait for all initiated requests to complete calculate_something(rank) # .. do something else .. req.wait() # wait for send to finish Request.waitany(requests) # safe to read/write data again wait for any initiated request to complete elif rank == 1: For example, assuming requests is a list of request objects, one can wait data = empty(size, float) for all of them to be finished with: req = comm.Irecv(data, source=0) # post a receive calculate_something(rank) # .. do something else .. MPI.Request.waitall(requests) req.wait() # wait for receive to finish # data is now ready for use

/ /

126 127 Example: non-blocking message chain Overlapping computation and communication

request_in = comm.Irecv(ghost_data) from mpi4py import MPI import numpy request_out = comm.Isend(border_data)

comm = MPI.COMM_WORLD rank = comm.Get_rank() compute(ghost_independent_data) size = comm.Get_size() request_in.wait()

data = numpy.arange(10, dtype=float) * (rank + 1) # send buffer buffer = numpy.zeros(10, dtype=float) # receive buffer compute(border_data)

tgt = rank + 1 request_out.wait() src = rank - 1 if rank == 0: src = MPI.PROC_NULL if rank == size - 1: tgt = MPI.PROC_NULL

req = [] req.append(comm.Isend(data, dest=tgt)) req.append(comm.Irecv(buffer, source=src))

MPI.Request.waitall(req)

/ /

128 129

Summary

Non-blocking communication is usually the smart way to do point-to- point communication in MPI Non-blocking communication realization isend / Isend Communicators irecv / Irecv request.wait()

/ /

130 131 Communicators Communicators

The communicator determines the "communication universe" Communicators are dynamic MPI_COMM_WORLD The source and destination of a message is identified by process rank A task can belong simultaneously 2 5 within the communicator to several communicators 4 3 0 7 So far: MPI.COMM_WORLD Unique rank in each 6 1 Processes can be divided into subcommunicators communicator Task level parallelism with process groups performing separate tasks Collective communication within a group of processes 1 1 2 Parallel I/O 2 0 0 3 3

Sub comm 1 Sub comm 2

/ /

132 133

User-defined communicators

By default a single, universal communicator exists to which all processes belong (MPI.COMM_WORLD) One can create new communicators, e.g. by splitting this into sub-groups comm = MPI.COMM_WORLD Collective Communication rank = comm.Get_rank()

color = rank % 4

local_comm = comm.Split(color) local_rank = local_comm.Get_rank()

print("Global rank: %d Local rank: %d" % (rank, local_rank))

/ /

134 135 Collective communication Collective communication

Collective communication transmits data among all processes in a Collective communication typically outperforms point-to-point process group (communicator) communication these routines must be called by all the processes in the group Code becomes more compact (and efficient!) and easier to maintain: amount of sent and received data must match For example, communicating a Numpy array of 1M elements from task 0 Collective communication includes to all other tasks: data movement if rank == 0: comm.Bcast(data, 0) for i in range(1, size): collective computation comm.Send(data, i) synchronization else: comm.Recv(data, 0) Example comm.barrier() makes every task hold until all tasks in the communicator comm have called it

/ /

136 137

Broadcast Broadcast

Send the same data from one process to all the other Broadcast sends same data to all processes from mpi4py import MPI Rank 0 A import numpy MPI_Bcast Rank 1 A A comm = MPI.COMM_WORLD rank = comm.Get_rank()

... A if rank == 0: py_data = {'key1' : 0.0, 'key2' : 11} # Python object Rank N-1 A data = np.arange(8) / 10. # NumPy array A else: py_data = None data = np.zeros(8) This buffer may contain mulple This buf er may contain mult ple new_data = comm.bcast(py_data, root=0) elements comm.Bcast(data, root=0)

/ /

138 139 Scatter Scatter

Send equal amount of data from one process to others Scatter distributes data to processes Segments A, B, ... may contain multiple elements from mpi4py import MPI from numpy import arange, empty Send buffer Recv buffer comm = MPI.COMM_WORLD A B C D A rank = comm.Get_rank()

s size = comm.Get_size() e

s MPI_Scatter if rank == 0: s py_data = range(size) e B c data = arange(size**2, dtype=float) o

r else: P C py_data = None data = None

D new_data = comm.scatter(py_data, root=0) # returns the value

buffer = empty(size, float) # prepare a receive buffer comm.Scatter(data, buffer, root=0) # in-place modification

/ /

140 141

Gather Gather

Collect data from all the process to one process Gather pulls data from all processes Segments A, B, ... may contain multiple elements from mpi4py import MPI from numpy import arange, zeros Send buffer Recv buffer comm = MPI.COMM_WORLD A A B C D rank = comm.Get_rank() MPI_Gather size = comm.Get_size()

s e

s B data = arange(10, dtype=float) * (rank + 1) s B

e buffer = zeros(size * 10, float) c o r n = comm.gather(rank, root=0) # returns the value

P C comm.Gather(data, buffer, root=0) # in-place modification

D

/ /

142 143 Reduce Reduce

Applies an operation over set of processes and places result in single Reduce gathers data and applies an operation on it process from mpi4py import MPI from numpy import arange, empty Send buffer Recv buffer ∑A ∑C ∑D comm = MPI.COMM_WORLD A B C D ∑A ∑B ∑C ∑D A0 B0 C0 D0 i ∑iBi i i rank = comm.Get_rank() MPI_Reduce i i i size = comm.Get_size()

s

e A1 B1 C1 D1 data = arange(10 * size, dtype=float) * (rank + 1) s

s buffer = zeros(size * 10, float) e

c o r A2 B2 C2 D2 (sum) n = comm.reduce(rank, op=MPI.SUM, root=0) # returns the value P 2 2 2 2 comm.Reduce(data, buffer, op=MPI.SUM, root=0) # in-place modification

A3 B3 C3 D3

/ /

144 145

Other common collective operations Non-blocking collectives

Scatterv New in MPI 3: no support in mpi4py each process receives different amount of data Non-blocking collectives enable the overlapping of communication and Gatherv computation together with the benefits of collective communication each process sends different amount of data Restrictions Allreduce have to be called in same order by all ranks in a communicator all processes receive the results of reduction mixing of blocking and non-blocking collectives is not allowed Alltoall each process sends and receives to/from each other Alltoallv each process sends and receives different amount of data

/ /

146 147 Common mistakes with collectives Summary

1. Using a collective operation within one branch of an if-else test based on Collective communications involve all the processes within a the rank of the process communicator for example: if rank == 0: comm.bcast(...) all processes must call them all processes in a communicator must call a collective routine! Collective operations make code more transparent and compact 2. Assuming that all processes making a collective call would complete at Collective routines allow optimizations by MPI library the same time. MPI-3 contains also non-blocking collectives, but these are currently not 3. Using the input buffer also as an output buffer: supported by MPI for Python for example: comm.Scatter(a, a, MPI.SUM) always use different memory locations (arrays) for input and output!

/ /

148 149

On-line resources Summary

Documentation for mpi4py is quite limited mpi4py provides Python interface to MPI short on-line manual and API reference available at MPI calls via communicator object http://pythonhosted.org/mpi4py/ Possible to communicate arbitrary Python objects Some good references: NumPy arrays can be communicated with nearly same speed as in "A Python Introduction to Parallel Programming with MPI" by Jeremy C/Fortran Bejarano http://materials.jeremybejarano.com/MPIwithPython/ "mpi4py examples" by Jörg Bornschein https://github.com/jbornschein/mpi4py-examples

/ /

150 151