Distributed Simulation in CUBINlab
John Papandriopoulos http://www.cubinlab.ee.mu.oz.au/
ARC Special Research Centre for Ultra-Broadband Information Networks Outline
n Clustering overview n The CUBINlab cluster n Paradigms for Distributed Simulation n Message Passing Interface n Parallelising MATLAB n Simulation example n Wrap up Clustering
n Speedup through parallelism n Motivated by:
q Cheap commodity hardware
q Cheap/fast network interconnects n Various architectures:
q Dedicated nodes, separate network
q Reuse of existing network and workstations
Clustering Overview Cluster Architectures
BeoWulf Network of workstations
Clustering Overview Cluster Architectures
BeoWulf Network of workstations
Clustering Overview How can CUBIN benefit?
n Many problems have inherent parallelism:
q Optimization problems n e.g. shortest path
q Network simulation
q Monte Carlo simulations
q Numerical and iterative calculations n e.g. large systems: matrix inverse, FFT, PDEs n Clustering can exploit this! n CUBINlab Distributed Simulation Cluster (DSC)
Clustering Overview Composition of the DSC
n 9 machines, each:
q 550MHz to 1.5GHz
q 128MB to 256MB RAM
q ~10GB HDD scratch space n 100Mbps Ethernet n Mandrake Linux n Identical software n Low maintenance n http://www.cubinlab.ee.mu.oz.au/cluster/
The CUBINlab Cluster Utilisation of the Cluster
n Which machines should I use? n Resource allocation n Load balancing
n Each machine is running a SNMP daemon n Use this to get stats on each machine, 5 min intervals
q CPU load
q Memory use
q Scratch space use
q LAN network use (rate & volume) n http://www.cubinlab.ee.mu.oz.au/netstat/
The CUBINlab Cluster CPU Load Average History
The CUBINlab Cluster Paradigms for Distributed Simulation Paradigms for Distributed Simulation
n Distributed vs. Parallel
q loose vs. tight coupling n Single Instruction Multiple Data (SIMD)
q Vector (super-)computers n Multiple Instruction Multiple Data (MIMD)
q Parallel/distributed computers
q Synchronous vs. asynchronous design n Shared memory vs. message passing
Paradigms for Distributed Simulation Embarrassingly Parallel Applications
n Totally independent subtasks n Assign each subtask to a processor n Collect results at the end n Master-Slave model: Slave #1
Master Slave #2 No task dependencies Bag of subtasks Slave #p
Paradigms for Distributed Simulation Load Balancing
n When do we allocate tasks, and to whom?
q At the start? As we go along?
q Equal division?
q Communications overhead? n Concurrency maximised when all processors in use n Unequal distribution of work or different processing speeds: û Some processors will be idle
Paradigms for Distributed Simulation Message Passing Interface (MPI) Overview of MPI(CH)
n Standard for message passing n Portable and free implementation n High performance n Object orientated “look and feel” n C, Fortran and C++ bindings available
Message Passing Interface MPICH Architecture
MPI
n Interface providing message passing Abstract Device constructs Interface q MPI_Send q MPI_Recv q etc. Channel n Uses ADI to implement all behaviour, with a Interface smaller set of constructs
TCP/IP
Message Passing Interface MPICH Architecture
MPI
n Send/Receive messages
Abstract Device n Moving data b/w MPI and CI Interface n Managing pending messages
n Providing basic environment information
q e.g. how many tasks running Channel Interface
TCP/IP
Message Passing Interface MPICH Architecture
MPI
n Formats and packs bytes for the wire
Abstract Device n Very simple: 5 functions only Interface n Three mechanisms:
q Eager: immediate delivery
q Rendezvous: once receiver wants it Channel Interface q Get: shared memory/DMA (h/ware)
TCP/IP
Message Passing Interface Process Structure n Groups and Communicators n How many processors are there? n Who am I? (Who are my neighbours?) n Virtual topologies
q Cartesian for grid computation
(0,0) (0,1) (0,2) map (0) (1) (2) (3) (4) (5) (6) (7) (8) (1,0) (1,1) (1,2)
Process group (2,0) (2,1) (2,2)
Message Passing Interface Communicators
MPI_COMM_WORLD
(0) (2) (4) (1) (3) (5)
Message Passing Interface Communicators
MPI_COMM_WORLD
(0) (2) (4) (1) (3) (5)
(0) (2) Group A Group B (4) (1) (3) (5)
Message Passing Interface Communicators
MPI_COMM_WORLD
(0) (2) (4) (1) (3) (5)
(0) (2) Group A Group B (4) (1) (3) (5)
(1) (0) (2) COMM_A COMM_B (0) (2) (1)
Message Passing Interface Communicators
MPI_COMM_WORLD
(4) (0) (2) (3) (5) (1)
COMM_A COMM_B
Message Passing Interface Communicators
MPI_COMM_WORLD
(4) (0) (2) (3) (5) (1)
COMM_A COMM_B
Message Passing Interface Point-to-Point: sending data
MPI_SEND(buf, count, datatype, dest, tag, comm)
IN buf initial address of send buffer (choice) IN count number of elements in send buffer (non-negative integer) IN datatype datatype of each send buffer element (handle) IN dest rank of destination (integer) IN tag message tag (integer) IN comm communicator (handle)
Message Passing Interface Point-to-Point: receiving data
MPI_RECV(buf, count, datatype, source, tag, comm, status)
OUT buf initial address of receive buffer (choice) IN count number of elements in receive buffer (integer) IN datatype datatype of each receive buffer element (handle) IN source rank of source (integer) IN tag message tag (integer) IN comm communicator (handle) OUT status status object (Status)
Message Passing Interface Simple Example: “hello world”
#include
MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Comm_rank(MPI_COMM_WORLD, &rank); tag = 100;
if (rank == 0) { strcpy(message, "Hello world!"); for (i=1; i printf("node %d: %s\n", rank, message); MPI_Finalize(); } Message Passing Interface Simple Example: hello world n Compiling $mpicc -o simple simple.c n Running on 12 machines (?!) $mpirun -np 12 simple node 0: Hello world! node 3: Hello world! node 7: Hello world! node 6: Hello world! node 8: Hello world! node 2: Hello world! node 5: Hello world! node 4: Hello world! node 1: Hello world! node 11: Hello world! node 9: Hello world! node 10: Hello world! $ Message Passing Interface MPI Data Types n Standard types: MPI data type C data type MPI_CHAR signed char MPI_SHORT signed short int MPI_INT signed int MPI_LONG signed long int MPI_UNSIGNED_CHAR unsigned char MPI_UNSIGNED_SHORT unsigned short int MPI_UNSIGNED unsigned int MPI_UNSIGNED_LONG unsigned long int MPI_FLOAT float MPI_DOUBLE double MPI_LONG DOUBLE long double n Can define your own also Message Passing Interface Collective Communication n Barrier synchronization n Broadcast from one to many n Gather from many to one n Scatter data from one to many n Scatter/Gather (complete exchange, all-to-all) n … and many more Message Passing Interface Broadcast Collective Communication Scatter and Gather Collective Communication All Gather Collective Communication All to All Collective Communication Profiling Parallel Code n MPI has profiling built in: q mpicc -o simple -mpilog simple.c q Generates a log file of all calls during execution n Log file visualiser is available q Jumpshot-3 (Java) Message Passing Interface Parallelising MATLAB Speeding up MATLAB Code n Use vector/matrix form of evaluations q inner = A’*B rather than inner = 0; for i = 1:size(A,1) inner = inner + A(i).*B(i); end n Pre-allocate matrices q S = zero(N, K); n Clear unused variables to avoid swapping q clear S; n Re-write bottlenecks in C with MEX n Parallelize! Parallelising MATLAB Why not a Parallel MATLAB? n Parallelism is a good idea! n Why doesn’t MathWorks provide support? n MathWorks says [1]: q Communication bottlenecks and problem size q Memory model and architecture q (Only) useful for “outer-loop” parallelism q Most time spent: parser, interpreter, graphics q Business situation Parallelising MATLAB Parallelising MATLAB: overview n Parallelism through high level toolboxes n Medium-course granularity q Good for “outer loop” parallelisation n Uses message-passing APIs q MPI q PVM q Socket based q File IO based Parallelising MATLAB Toolboxes Available Status Toolbox Origins X$ MultiMATLAB Cornell S DP-Toolbox U of Rostock, Germany B MPITB/PVMTB U of Granada, Spain $ MATmarks U of Illinois S MatlabMPI MIT S MATLAB*P MIT S Matlab Parallelization ToolkIt Einar Heiberg ? Parallel Toolbox for MATLAB Unknown X No source $ Commercial B Binaries Only S Open Source Parallelising MATLAB MatlabMPI n Limited subset of MPI commands û Very primitive handling of basic operations û No barrier function for synchronization n Communication based on file IO ü Very portable: no MEX û File system, disc and NFS overhead û Spin locks for blocking reads n Uses RSH/SSH to spawn remote MATLAB instances q Machine list supplied in m-files Parallelising MATLAB MatlabMPI API MPI_Run Runs a MATLAB script in parallel MPI_Init Inititialises toolbox MPI_Finalize Cleans up at the end MPI_Comm_size Gets number of processors in a communicator MPI_Comm_rank Gets rank of current processor within a communicator MPI_Send Sends a message to a processor (non-blocking) MPI_Recv Receives message from a processor (blocking) MPI_Abort Function to kill all matlab jobs started by MatlabMPI MPI_Bcast Broadcast a message (blocking) MPI_Probe Returns a list of all incoming messages Parallelising MATLAB MatlabMPI Performance n Mainly used for embarrassing parallel n Equals MPI native at 1MByte message size [12] n Spinlock and busy waiting caused a DDoS attack on our NFS Parallelising MATLAB Simulation Example Iterative Power Control n Iterative solution to a Power Control problem: n Each iteration depends on all Pj, j ¹i n We want to simulate over 1,000 runs with different user signature sequences S Simulation Example Inner Loop Parallelism n Split the iteration Pi(n+1) over p processors n Synchronize at the end of each iteration n Is there a faster way? Simulation Example Outer Loop Parallelism n Split each 1,000 runs over p processors n No need for explicit synchronization q Each simulation run is independent q Embarrassingly parallel problem n Master/slave model will work quite well Simulation Example Simulation Results Sorry, no results due to the MatlabMPI DDoS attack! Simulation Example Wrap Up Conclusions n Clustering provides a cheap way to increase computational power through parallelism n Parallelism is present in many problems, to a degree n Message passing is one method of unifying computation amongst distributed processors n MATLAB can be used for coarse-grain parallel applications References-1 1. “Why there isn’t a parallel MATLAB”, Cleve Moler, http://www.mathworks.com/company/newsletter/pdf/spr95cleve.pdf 2. “Parallel Matlab survey”, Ron Choy, http://supertech.lcs.mit.edu/~cly/survey.html 3. “MultiMatlab: Integrating MATLAB with High-Performance Parallel Computing”, V. Menon and A. Trefethen, Cornell University 4. “Matpar: Parallel Extensions for MATLAB”, P. Springer, Jet Propulsion Laboratory, CalTech 5. “Message Passing under MATLAB”, J. Baldomero, U of Granada, Spain 6. “Performance of Message-Passing MATLAB Toolboxes”, J. Fernandez, A. Canas, A. Diaz, J. Gonzalez, J. Ortega and A. Prieto, U of Granada, Spain References-2 7. “Parallel and Distributed Computation”, D. Bertsekas and J. Tsitsiklis, Prentice Hall, NJ, 1989 8. “Message Passing Interface Forum”, http://www.mpi-forum.org/ 9. “MPICH-A Portable Implementation of MPI”, http://www-unix.mcs.anl.gov/mpi/mpich/ 10. “A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard”, W. Gropp and E. Lusk, Argonne National Laboratory 11. “Tutorial on MPI: The Message Passing Interface”, W. Gropp, http://www-unix.mcs.anl.gov/mpi/tutorial/gropp/talk.html 12. “Parallel Programming with MatlabMPI”, J. Kepner, MIT Lincoln Laboratory, 2001