Introduction to Parallel Programming (For Physicists) FRANÇOIS GÉLIS & GRÉGOIRE MISGUICH, Ipht Courses, June 2019

Introduction to parallel programming (for physicists) FRANÇOIS GÉLIS & GRÉGOIRE MISGUICH, IPhT courses, June 2019. 1. Introduction & hardware aspects (FG) These 2. A few words about Maple & Mathematica slides (GM) slides 3. Linear algebra libraries 4. Fast Fourier transform 5. Python Multiprocessing 6. OpenMP 7. MPI (FG) 8. MPI+OpenMP (FG) Numerical Linear algebra Here: a few simple examples showing how to call some parallel linear algebra libraries in numerical calculations (numerical) Linear algebra • Basic Linear Algebra Subroutines: BLAS • More advanced operations Linear Algebra Package: LAPACK • vector op. (=level 1) (‘90, Fortran 77) • matrix-vector (=level 2) • Call the BLAS routines • matrix-matrix mult. & triangular • Matrix diagonalization, linear systems inversion (=level 3) and eigenvalue problems • Many implementations but • Matrix decompositions: LU, QR, SVD, standardized interface Cholesky • Discussed here: Intel MKL & OpenBlas (multi-threaded = parallelized for • Many implementations shared-memory architectures) Used in most scientific softwares & libraries (Python/Numpy, Maple, Mathematica, Matlab, …) A few other useful libs … for BIG matrices • ARPACK • ScaLAPACK =Implicitly Restarted Arnoldi Method (~Lanczos for Parallel version of LAPACK for for distributed memory Hermitian cases) architectures Large scale eigenvalue problems. For (usually sparse) nxn matrices with n which can be as large as 10^8, or even more ! Ex: iterative algo. to find the largest eigenvalue, without storing the matrix M (just provide v→Mv). Can be used from Python/SciPy • PARPACK = parallel version of ARPACK for distributed memory architectures (the matrices are stored over several nodes) Two multi-threaded implementations of BLAS & Lapack OpenBLAS Intel’s implementation (=part of the MKL lib.) • Open source License (BSD) • Commercial license • Based on GotoBLAS2 • Included in Intel®Parallel Studio XE (compilers, [Created by GAZUSHIGE GOTO, Texas Adv. libraries, …), which is free for students Computing Center, Univ. of Texas] • Included in Intel® Performance Libraries (librairies only, without compiler), free for every one. • Installed in most/many computing centers / intel- • Can be used from Fortran, C, C++, based clusters Python/Numpy, … • Included in intel-python, which is free for all users • Can be used from Fortran, C, C++ or Python/Numpy Lapack/Intel-MKL Code examples in C or FORTRAN: https://software.intel.com/sites/products/documentation/doclib/mkl_sa/11/mkl_lapack_examples/ Intel® Math Kernel Library LAPACK Examples Symmetric Eigenproblems has examples for LAPACK This document provides code examples for LAPACK routines that compute eigenvalues and eigenvectors of (Linear Algebra PACKage) routines that solve problems real symmetric and complex Hermitian matrices. in the following fields: Nonsymmetric Eigenproblems Linear Equations Nonsymmetic Eigenproblems provides examples Examples for several LAPACK routines that solve for ?geev, one of several LAPACK routines that systems of linear equations. compute eigenvalues and eigenvectors of general Linear Least Squares Problems matrices. Examples for some of the LAPACK routines that find Singular Value Decomposition solutions to linear least squares problems. Examples for LAPACK routines that compute Symmetric Eigenproblems the singular value decomposition of a general rectangular matrix. Lapack-MKL example in Fortran DGELSD/from Intel’s website (part 1/3) * The routine computes the minimum-norm solution to a real linear least * squares problem: minimize ||b - A*x|| using the singular value * decomposition (SVD) of A. A is an m-by-n matrix which may be * rank-deficient. Here minimize 푏 − 퐴푥Ԧ with * 2 * Several right hand side vectors b and solution vectors x can be 퐴 a rectangular matrix and 푏 a * handled vector. * in a single call; they are stored as the columns of the m-by-nrhs * right * hand side matrix B and the n-by-nrhs solution matrix X. * * The effective rank of A is determined byNaming treating convention as zero those of the LAPACK routines : XYYZZZ * singular * values which are less than rcond times theX : largestS singularREAL value. * D DOUBLE PRECISION * Example Program Results. C COMPLEX * ======================== Z COMPLEX*16 or DOUBLE COMPLEX * YY: BD bidiagonal * DGELSD Example Program Results DI diagonal * GB general band * Minimum norm solution * -0.69 -0.24 0.06 GE general * -0.80 -0.08 0.21 GG general matrices, generalized problem (i.e., a pair of general * 0.38 0.12 -0.65 matrices) * 0.29 -0.24 0.42 GT general tridiagonal * 0.29 0.35 -0.30 (…) * ZZZ: Type of computation. Here LSD stands for minimum norm solution to a linear least (...) squares problem using the singular value decomposition of A and a divide and conquer method. Lapack-MKL example in Fortran (part 2/3) * .. Parameters .. DATA A/ INTEGER M, N, NRHS $ 0.12,-6.91,-3.33, 3.97, PARAMETER ( M = 4, N = 5, NRHS = 3 ) $ -8.19, 2.22,-8.94, 3.33, INTEGER LDA, LDB $ 7.69,-5.12,-6.72,-2.74, PARAMETER ( LDA = M, LDB = N ) $ -2.26,-9.08,-4.40,-7.92, INTEGER LWMAX $ -4.71, 9.96,-9.98,-3.20 PARAMETER ( LWMAX = 1000 ) $ / * DATA B/ * .. Local Scalars .. $ 7.30, 1.33, 2.68,-9.62, 0.00, INTEGER INFO, LWORK, RANK $ 0.47, 6.58,-1.71,-0.79, 0.00, DOUBLE PRECISION RCOND $ -6.28,-3.42, 3.46, 0.41, 0.00 * $ / * .. Local Arrays .. * * IWORK dimension should be at least * .. External Subroutines .. * 3*MIN(M,N)*NLVL + 11*MIN(M,N), EXTERNAL DGELSD * Where EXTERNAL PRINT_MATRIX * NLVL = MAX( 0,INT(LOG_2( MIN(M,N)/(SMLSIZ+1)))+1) * * and SMLSIZ = 25 * .. Intrinsic Functions .. INTEGER IWORK( 3*M*0+11*M ) INTRINSIC INT, MIN DOUBLE PRECISION A( LDA, N ), B( LDB, NRHS ),S( M ),* WORK( LWMAX ) Lapack routines require you to M,N: size of A provide some memory space to NRHS: number of work (in the form of array(s)). Here: vectors 푏 for which the WORK, double precision array of size problem must me LWORK(see next slide). solved. Lapack-MKL example in Fortran (part 3/3) First call to DGELSD with LWORK=-1 → LAPACK returns * .. Executable Statements .. SVD failed to converge;' the optimal size LWORK of the WRITE(*,*)'DGELSD Example Program WRITE(*,*)'the least squares solution Results' could not be computed.' workspace array WORK. * Negative RCOND means using STOP default (machine precision) value END IF RCOND = -1.0 * Actual calculation * * Print minimum norm solution. * Query the optimal workspace. * Compilation and output : * CALL PRINT_MATRIX( 'Minimum norm LWORK = -1 solution', N, NRHS, B, LDB ) $ ifort -mkl DGELSD_example.f CALL DGELSD( M, N, NRHS, A, LDA, B, * $ ./a.out LDB, S, RCOND, RANK, WORK, LWORK, * Print effective rank. DGELSD Example Program IWORK, INFO ) * Results LWORK = MIN( LWMAX, INT( WORK( 1 ))) WRITE(*,'(/A,I6)')' Effective rank = * ', RANK Minimum norm solution * Solve the equations A*X = B. * -0.69 -0.24 0.06 * * Print singular values. -0.80 -0.08 0.21 CALL DGELSD( M, N, NRHS, A, LDA, B, * 0.38 0.12 -0.65 LDB, S, RCOND, RANK, WORK, LWORK, CALL PRINT_MATRIX( 'Singular values', 0.29 -0.24 0.42 IWORK, INFO ) 1, M, S, 1 ) 0.29 0.35 -0.30 * STOP * Check for convergence. END Effective rank = 4 * * IF( INFO.GT.0 ) THEN * End of DGELSD Example. Singular values WRITE(*,*)'The algorithm computing 18.66 15.99 10.01 8.51 Lapack/Intel-MKL zheevd example in C #include <stdlib.h> #include <stdio.h> // LAPACKE_zheevd: computes all #include <mkl.h> // eigenvalues and eigenvectors of a // C-wrapper to the Fortran Lapack lib. : // complex Hermitian matrix A using divide #include <mkl_lapacke.h> // and conquer algorithm int main() { const int n=2000; prints the num. of threads info = LAPACKE_zheevd( LAPACK_ROW_MAJOR, 'V', 'L', N, a, LDA, w ); printf("Matrix size=%i\n",N); printf("Number of threads=%i\n", /* Check for convergence */ mkl_get_max_threads()); if( info > 0 ) { printf( "The algorithm failed to compute MKL_INT N = n, LDA = n, info,i,j; eigenvalues.\n" ); double w[N]; exit( 1 ); MKL_Complex16* a; } a=malloc(N*LDA*sizeof(MKL_Complex16)); /* Print the extreme eigenvalues */ printf("Smallest eigen value=%6.2f\n",w[0]); printf("Largest value=%6.2f\n\n",w[N-1]); srand(999); /* dense random matrix */ } for (i=0; i < N; i++ ) Call to Lapack. for(j = 0; j<N; j++ ) This version takes care of the a[i*LDA+j].real=rand(),a[i*LDA+j].imag=rand( w: array containing workspace memory management ); the eigen. vals. (contrary to the Fortran version) OpenBLAS/LAPACK zheevd example in C #include <stdlib.h> #include <stdio.h> // LAPACKE_zheevd: computes all #include <omp.h> // eigenvalues and eigenvectors of a // C-wrapper to the Fortran Lapack lib. // complex Hermitian matrix A using divide #include <lapacke.h> // and conquer algorithm int main() { const int N=2000; info = LAPACKE_zheevd( LAPACK_ROW_MAJOR, 'V', 'L', N, a, LDA, w ); printf("Matrix size=%i\n",N); printf("Number of threads=%i\n", /* Check for convergence */ omp_get_max_threads()); if( info > 0 ) { printf( "The algorithm failed to compute int LDA=N, info,i,j; eigenvalues.\n" ); double w[N]; exit( 1 ); lapack_complex_double* a; } a=malloc(N*LDA*sizeof(lapack_complex_double) /* Print the extreme eigenvalues */ ); printf("Smallest eigen value=%6.2f\n",w[0]); printf("Largest value=%6.2f\n\n",w[N-1]); srand(999); /* Dense random matrix */ } for (i=0; i < N; i++ ) for(j = 0; j<N; j++ ) a[i*LDA+j]=lapack_make_complex_double(rand() ,rand()); OpenBLAS/CBLAS dgemm example in C #include <stdio.h> #include <stdlib.h> #include <cblas.h> int main() { Matrix-matrix multiplcation (BLAS) of double int N=10000,N2,i,j; precision general matrices N2=N*N; //Memory allocation for the arrays:

Introduction to Parallel Programming (For Physicists) FRANÇOIS GÉLIS & GRÉGOIRE MISGUICH, Ipht Courses, June 2019

CUDA 6 and Beyond

Using Intel® Math Kernel Library and Intel® Integrated Performance Primitives in the Microsoft* .NET* Framework

Armadillo C++ Library

Recursive Approach in Sparse Matrix LU Factorization

Abstract 1 Introduction

Overview of Iterative Linear System Solver Packages

Rcppmlpack: R Integration with MLPACK Using Rcpp

ARM HPC Ecosystem

Intel(R) Math Kernel Library for Linux* OS User's Guide

0 BLIS: a Modern Alternative to the BLAS

Release 0.11 Todd Gamblin

18 Restructuring the Tridiagonal and Bidiagonal QR Algorithms for Performance