Parallel Computing

Parallel Computing Benson Muite [email protected] http://math.ut.ee/˜benson https://courses.cs.ut.ee/2014/paralleel/fall/Main/HomePage 6 October 2014 Parallel Programming Models and Parallel Programming Implementations Parallel Programming Models • Shared Memory • Distributed Memory • Simple computation and communication cost models Parallel Programming Implementations • OpenMP • MPI • OpenCL – C programming like language that aims to be manufacturer independent and allow for wide range of multithreaded devices • CUDA (Compute Unified Device Architecture) – Used for Nvidia GPUs, based on C/C++, makes numerical computations easier • Pthreads – older threaded programming interface, some legacy codes still around, lower level than OpenMP but not as easy to use • Universal Parallel C (UPC) – New generation parallel programming language, distributed array support, consider trying it • Coarray Fortran (CAF) – New generation parallel programming language, distributed array support, consider trying it • Julia (http://julialang.org/) – New serial and parallel programming language, consider trying it • Parallel Virtual Machine (PVM) – Older, has been replaced by MPI • OpenACC – brief mention • Not covered: Java threads, C++11 threads, Python Threads, High Performance Fortran (HPF) Parallel Programming Models: Flynn’s Taxonomy • (Single Instruction Single Data) – Serial • Single Program Multiple Data • Multiple Program Multiple Data • Multiple Instruction Single Data • Master Worker • Server Client Parallel Programming Models • loop parallelizm; often occurs in numerical computations, easiest way to find parallelizm in a serial program • task parallelism; often used on personal computers running several pieces of software (email client, browser, text editor etc.) Shared Memory Programming • Easiest paradigm • Either random access memory is physically shared between all cores or software allows one to program a distributed memory machine as if it has shared memory • To get good performance, need to be aware of non uniform memory access • To get correct results, need to be aware of false sharing and prevent this from occurring • Example APIs are OpenMP, OpenCL, Pthreads, Julia, OpenACC, Java threads, C++11 threads, C++14 threads, Python threads Shared Memory Programming: Non uniform memory access • Location in memory can be far from core, hence many cycles to get information • On Rocket, 2 sockets (chips) per node, each chip has 10 cores • A single socket can be programmed using OpenMP, so it looks like shared memory • However, better to think of it as globally accessible memory, half of which is on one socket, and half on the other • It can take longer to access memory that is far away, than it is to get to memory that is close by • How would you modify the stream test to measure this? Shared Memory Programming: False Sharing • Cache lines may be updated and have different values than what is in main memory • Two different cache lines corresponding to two different cores may correspond to same place in physical memory • Several different means of trying to ensure consistency • For general purpose code, avoid false sharing • For code for a specific machine with a specific compiler, found out the protocol to try an optimize performance while preserving correctness • It is possible to use OpenMP in a distributed memory manner Shared Memory Programming: False Sharing #include <stdlib.h> #include <omp.h> unsigned int const SIZE=16; unsigned int const ITER=48000000; int main() { int i,j; //_declspec(align(16)) double a[SIZE],b[SIZE],c[SIZE]; //initialize for (i=0; i<SIZE;i++) { c[i]=b[i]=a[i]=(double)rand(); } #pragma omp parallel for for(i=0; i<ITER; i++) { #pragma vector aligned (a,b,c) for(j=0; j<SIZE; j++){ a[j]=b[j]*c[j]+a[j]; } } return 0; } Listing 1: A C++ program to find maximum floating point operations adapted from Rahman (2013). Shared Memory Programming: False Sharing PROGRAM falsesharing1 USE omp_lib IMPLICIT NONE INTEGER(kind=4), PARAMETER:: SIZE = 16 INTEGER(kind=4), PARAMETER :: ITER = 480000 INTEGER(kind=4) :: i,j,maxthreads DOUBLE PRECISION :: a(1:SIZE), b(1:SIZE), c(1:SIZE) DO i=1,size a(i)=i*0.1**5 b(i)=i*0.2**3 c(i)=i*0.3**3 ENDDO !$OMP PARALLELDO DO j=1,ITER DO i=1,size a(i)=b(i)*c(i)+a(i) ENDDO ENDDO !$OMP END PARALLELDO PRINT *, a END PROGRAM falsesharing1 Listing 2: A Fortran program to demonstrate false sharing adapted from Rahman (2013). Shared Memory Programming: False Sharing PROGRAM falsesharing2 USE omp_lib IMPLICIT NONE INTEGER(kind=4), PARAMETER:: SIZE = 16 INTEGER(kind=4), PARAMETER :: ITER = 480000 INTEGER(kind=4) :: i,j,mytid DOUBLE PRECISION :: a(1:SIZE), b(1:SIZE), c(1:SIZE) DO i=1,size a(i)=i*0.1**5 b(i)=i*0.2**3 c(i)=i*0.3**3 ENDDO !$OMP PARALLELDO PRIVATE(i,j) SHARED(b,c) REDUCTION(+:a) DO j=1,ITER DO i=1,SIZE a(i)=b(i)*c(i)+a(i) ENDDO ENDDO !$OMP END PARALLELDO PRINT *, a END PROGRAM falsesharing2 Listing 3: A Fortran program to demonstrate a fix for false sharing adapted from Rahman (2013). Comments on Shared Memory Programming • Find out default behavior for your programming environment, or better yet explicitly state behavior for each variable • Check which standard is being adhered to, OpenMP3 and OpenMP4 standards support many options • When allocating memory, some systems have a first touch policy, the first thread to put something in the allocated space determines where that allocated space is on the chip. • It is possible to do shared memory programming using a distributed memory style, each thread gets an id and does its own operations, with synchronization steps carefully planned. Distributed Memory Programming • It is difficult to construct large machines with shared memory • Even when one does so, performance with simple shared memory programming can be poor • Distributed memory programming can be quite effective on shared memory machines, e.g. Intel Xeon Phi • Most common interface is Message Passing Interface • Here recognize that each process has its own separate memory space and that they need to communicate with each other • It is still an abstraction that may not fully map to the hardware, but it better approximates what the hardware does than a shared memory software layer and still allows for portability MPI • Current version is MPI 3.0 • MPI 1.0 1994. Result of 1 year of work in meetings and email discussion through MPI forum • MPI 1.0 had most functionality in use today, Fortran, C bindings, communicators; largest missing functionality was MPI-IO • MPI 1.1 1995 Parallel I-O, C++, Fortran 90 • MPI 2 1997 Parallel I-O, Remote memory access, dynamic process management, C++ deprecated • MPI3 2012 Nonblocking collectives, extended one sided operations, C++ bindings removed, Fortran 2008 binding An example program: Matrix Vector Multiplication • http://www.mcs.anl.gov/research/projects/ mpi/usingmpi/examples/simplempi/main.htm An example program: Matrix Vector Multiplication • Program gives one way to do matrix vector multiplication Ab = c • Master initializes matrix, then sends rows to workers to do a vector dotproduct • Operation count: Arithmetic operations n multiplications, n − 1 additions per dot product, n dot products so n × (n + n − 1) = 2n2 − n • Communication count neglecting cost of sending initial vector: n bytes sent and 1 byte returned for each element. n × (n + 1) = n2 + n 2 • 2n −n ! Computational intensity: n2+n 2 • Would such a strategy be good for Intel Xeon Phi? • How can we improve this algorithm? An example program: Matrix Matrix Multiplication • http://www.mcs.anl.gov/research/projects/ mpi/usingmpi/examples/simplempi/main.htm An example program: Matrix Matrix Multiplication • Program gives one way to do matrix vector multiplication AB = C • Master initializes matrices, then sends B and rows of A to workers to do a matrix vector product • Operation count: Arithmetic operations n multiplications, n − 1 additions per dot product, n2 dot products so n2 × (n + n − 1) = 2n3 − n2 • Communication count neglecting cost of sending initial matrix: n bytes sent and n bytes returned for each row. n × 2n = 2n2 3 2 • 2n −n ! Computational intensity: 2n2 n • Would such a strategy be good for Intel Xeon Phi? • How can we improve this algorithm? Parallel Programming Languages • Coarray Fortran in Intel and Cray compilers https: //software.intel.com/en-us/node/455262 http://caf.rice.edu/ GNU still in development https://gcc.gnu.org/wiki/Coarray, possible evaluation project • Universal Parallel C http://upc.lbl.gov/, possible evaluation project • Python mpi4py https://bitbucket.org/mpi4py, a possible evaluation project OpenACC • OpenMP like directive language to produce code for accelerators • Currently supported by PGI and Cray • Since OpenMP 4.0 supports accelerators, unclear whether OpenACC will be extended. • It used to be supported by CAPS enterprise which is no longer in business, though they are selling their source code (http://people.irisa.fr/Francois.Bodin/?p=400). Maybe they will open source it? • CAPS compiler is a source-to-source translator producing OpenCL or CUDA code from OpenACC code to allow write once, run everywhere somewhat efficiently Summary • For efficient shared memory programming, understand the hardware you are running on • Examples of message passing and operation counts for matrix vector and matrix matrix multiplications • Uncertain parallel programming language future • Large legacy codebase that allows people want to keep using same codes with small modifications • When trying to make a code parallel, estimate possible lifespan, hardware aiming to target as well as ease of porting to other hardwares • Code usually has half life of 10 years, high performance machines last 3-4 years • Who will contribute to MPI 4.0 (http://meetings. mpi-forum.org/MPI_4.0_main_page.php)? New Key Concepts and References • Parallel Computer Architecture; RR 2.2, 2.7, 3.1-3.9 • Gropp, W., Lusk, E.

Parallel Computing

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support