Parallel Computing
Total Page:16
File Type:pdf, Size:1020Kb
Parallel Computing Benson Muite [email protected] http://math.ut.ee/˜benson https://courses.cs.ut.ee/2014/paralleel/fall/Main/HomePage 6 October 2014 Parallel Programming Models and Parallel Programming Implementations Parallel Programming Models • Shared Memory • Distributed Memory • Simple computation and communication cost models Parallel Programming Implementations • OpenMP • MPI • OpenCL – C programming like language that aims to be manufacturer independent and allow for wide range of multithreaded devices • CUDA (Compute Unified Device Architecture) – Used for Nvidia GPUs, based on C/C++, makes numerical computations easier • Pthreads – older threaded programming interface, some legacy codes still around, lower level than OpenMP but not as easy to use • Universal Parallel C (UPC) – New generation parallel programming language, distributed array support, consider trying it • Coarray Fortran (CAF) – New generation parallel programming language, distributed array support, consider trying it • Julia (http://julialang.org/) – New serial and parallel programming language, consider trying it • Parallel Virtual Machine (PVM) – Older, has been replaced by MPI • OpenACC – brief mention • Not covered: Java threads, C++11 threads, Python Threads, High Performance Fortran (HPF) Parallel Programming Models: Flynn’s Taxonomy • (Single Instruction Single Data) – Serial • Single Program Multiple Data • Multiple Program Multiple Data • Multiple Instruction Single Data • Master Worker • Server Client Parallel Programming Models • loop parallelizm; often occurs in numerical computations, easiest way to find parallelizm in a serial program • task parallelism; often used on personal computers running several pieces of software (email client, browser, text editor etc.) Shared Memory Programming • Easiest paradigm • Either random access memory is physically shared between all cores or software allows one to program a distributed memory machine as if it has shared memory • To get good performance, need to be aware of non uniform memory access • To get correct results, need to be aware of false sharing and prevent this from occurring • Example APIs are OpenMP, OpenCL, Pthreads, Julia, OpenACC, Java threads, C++11 threads, C++14 threads, Python threads Shared Memory Programming: Non uniform memory access • Location in memory can be far from core, hence many cycles to get information • On Rocket, 2 sockets (chips) per node, each chip has 10 cores • A single socket can be programmed using OpenMP, so it looks like shared memory • However, better to think of it as globally accessible memory, half of which is on one socket, and half on the other • It can take longer to access memory that is far away, than it is to get to memory that is close by • How would you modify the stream test to measure this? Shared Memory Programming: False Sharing • Cache lines may be updated and have different values than what is in main memory • Two different cache lines corresponding to two different cores may correspond to same place in physical memory • Several different means of trying to ensure consistency • For general purpose code, avoid false sharing • For code for a specific machine with a specific compiler, found out the protocol to try an optimize performance while preserving correctness • It is possible to use OpenMP in a distributed memory manner Shared Memory Programming: False Sharing #include <stdlib.h> #include <omp.h> unsigned int const SIZE=16; unsigned int const ITER=48000000; int main() { int i,j; //_declspec(align(16)) double a[SIZE],b[SIZE],c[SIZE]; //initialize for (i=0; i<SIZE;i++) { c[i]=b[i]=a[i]=(double)rand(); } #pragma omp parallel for for(i=0; i<ITER; i++) { #pragma vector aligned (a,b,c) for(j=0; j<SIZE; j++){ a[j]=b[j]*c[j]+a[j]; } } return 0; } Listing 1: A C++ program to find maximum floating point opera- tions adapted from Rahman (2013). Shared Memory Programming: False Sharing PROGRAM falsesharing1 USE omp_lib IMPLICIT NONE INTEGER(kind=4), PARAMETER:: SIZE = 16 INTEGER(kind=4), PARAMETER :: ITER = 480000 INTEGER(kind=4) :: i,j,maxthreads DOUBLE PRECISION :: a(1:SIZE), b(1:SIZE), c(1:SIZE) DO i=1,size a(i)=i*0.1**5 b(i)=i*0.2**3 c(i)=i*0.3**3 ENDDO !$OMP PARALLELDO DO j=1,ITER DO i=1,size a(i)=b(i)*c(i)+a(i) ENDDO ENDDO !$OMP END PARALLELDO PRINT *, a END PROGRAM falsesharing1 Listing 2: A Fortran program to demonstrate false sharing adapted from Rahman (2013). Shared Memory Programming: False Sharing PROGRAM falsesharing2 USE omp_lib IMPLICIT NONE INTEGER(kind=4), PARAMETER:: SIZE = 16 INTEGER(kind=4), PARAMETER :: ITER = 480000 INTEGER(kind=4) :: i,j,mytid DOUBLE PRECISION :: a(1:SIZE), b(1:SIZE), c(1:SIZE) DO i=1,size a(i)=i*0.1**5 b(i)=i*0.2**3 c(i)=i*0.3**3 ENDDO !$OMP PARALLELDO PRIVATE(i,j) SHARED(b,c) REDUCTION(+:a) DO j=1,ITER DO i=1,SIZE a(i)=b(i)*c(i)+a(i) ENDDO ENDDO !$OMP END PARALLELDO PRINT *, a END PROGRAM falsesharing2 Listing 3: A Fortran program to demonstrate a fix for false shar- ing adapted from Rahman (2013). Comments on Shared Memory Programming • Find out default behavior for your programming environment, or better yet explicitly state behavior for each variable • Check which standard is being adhered to, OpenMP3 and OpenMP4 standards support many options • When allocating memory, some systems have a first touch policy, the first thread to put something in the allocated space determines where that allocated space is on the chip. • It is possible to do shared memory programming using a distributed memory style, each thread gets an id and does its own operations, with synchronization steps carefully planned. Distributed Memory Programming • It is difficult to construct large machines with shared memory • Even when one does so, performance with simple shared memory programming can be poor • Distributed memory programming can be quite effective on shared memory machines, e.g. Intel Xeon Phi • Most common interface is Message Passing Interface • Here recognize that each process has its own separate memory space and that they need to communicate with each other • It is still an abstraction that may not fully map to the hardware, but it better approximates what the hardware does than a shared memory software layer and still allows for portability MPI • Current version is MPI 3.0 • MPI 1.0 1994. Result of 1 year of work in meetings and email discussion through MPI forum • MPI 1.0 had most functionality in use today, Fortran, C bindings, communicators; largest missing functionality was MPI-IO • MPI 1.1 1995 Parallel I-O, C++, Fortran 90 • MPI 2 1997 Parallel I-O, Remote memory access, dynamic process management, C++ deprecated • MPI3 2012 Nonblocking collectives, extended one sided operations, C++ bindings removed, Fortran 2008 binding An example program: Matrix Vector Multiplication • http://www.mcs.anl.gov/research/projects/ mpi/usingmpi/examples/simplempi/main.htm An example program: Matrix Vector Multiplication • Program gives one way to do matrix vector multiplication Ab = c • Master initializes matrix, then sends rows to workers to do a vector dotproduct • Operation count: Arithmetic operations n multiplications, n − 1 additions per dot product, n dot products so n × (n + n − 1) = 2n2 − n • Communication count neglecting cost of sending initial vector: n bytes sent and 1 byte returned for each element. n × (n + 1) = n2 + n 2 • 2n −n ! Computational intensity: n2+n 2 • Would such a strategy be good for Intel Xeon Phi? • How can we improve this algorithm? An example program: Matrix Matrix Multiplication • http://www.mcs.anl.gov/research/projects/ mpi/usingmpi/examples/simplempi/main.htm An example program: Matrix Matrix Multiplication • Program gives one way to do matrix vector multiplication AB = C • Master initializes matrices, then sends B and rows of A to workers to do a matrix vector product • Operation count: Arithmetic operations n multiplications, n − 1 additions per dot product, n2 dot products so n2 × (n + n − 1) = 2n3 − n2 • Communication count neglecting cost of sending initial matrix: n bytes sent and n bytes returned for each row. n × 2n = 2n2 3 2 • 2n −n ! Computational intensity: 2n2 n • Would such a strategy be good for Intel Xeon Phi? • How can we improve this algorithm? Parallel Programming Languages • Coarray Fortran in Intel and Cray compilers https: //software.intel.com/en-us/node/455262 http://caf.rice.edu/ GNU still in development https://gcc.gnu.org/wiki/Coarray, possible evaluation project • Universal Parallel C http://upc.lbl.gov/, possible evaluation project • Python mpi4py https://bitbucket.org/mpi4py, a possible evaluation project OpenACC • OpenMP like directive language to produce code for accelerators • Currently supported by PGI and Cray • Since OpenMP 4.0 supports accelerators, unclear whether OpenACC will be extended. • It used to be supported by CAPS enterprise which is no longer in business, though they are selling their source code (http://people.irisa.fr/Francois.Bodin/?p=400). Maybe they will open source it? • CAPS compiler is a source-to-source translator producing OpenCL or CUDA code from OpenACC code to allow write once, run everywhere somewhat efficiently Summary • For efficient shared memory programming, understand the hardware you are running on • Examples of message passing and operation counts for matrix vector and matrix matrix multiplications • Uncertain parallel programming language future • Large legacy codebase that allows people want to keep using same codes with small modifications • When trying to make a code parallel, estimate possible lifespan, hardware aiming to target as well as ease of porting to other hardwares • Code usually has half life of 10 years, high performance machines last 3-4 years • Who will contribute to MPI 4.0 (http://meetings. mpi-forum.org/MPI_4.0_main_page.php)? New Key Concepts and References • Parallel Computer Architecture; RR 2.2, 2.7, 3.1-3.9 • Gropp, W., Lusk, E.