Distributed Simulation in CUBINlab

John Papandriopoulos http://www.cubinlab.ee.mu.oz.au/

ARC Special Research Centre for Ultra-Broadband Information Networks Outline

n Clustering overview n The CUBINlab cluster n Paradigms for Distributed Simulation n Message Passing Interface n Parallelising MATLAB n Simulation example n Wrap up Clustering

n through parallelism n Motivated by:

q Cheap commodity hardware

q Cheap/fast network interconnects n Various architectures:

q Dedicated nodes, separate network

q Reuse of existing network and workstations

Clustering Overview Cluster Architectures

BeoWulf Network of workstations

Clustering Overview Cluster Architectures

BeoWulf Network of workstations

Clustering Overview How can CUBIN benefit?

n Many problems have inherent parallelism:

q Optimization problems n e.g. shortest path

q Network simulation

q Monte Carlo simulations

q Numerical and iterative calculations n e.g. large systems: matrix inverse, FFT, PDEs n Clustering can exploit this! n CUBINlab Distributed Simulation Cluster (DSC)

Clustering Overview Composition of the DSC

n 9 machines, each:

q 550MHz to 1.5GHz

q 128MB to 256MB RAM

q ~10GB HDD scratch space n 100Mbps Ethernet n Mandrake Linux n Identical software n Low maintenance n http://www.cubinlab.ee.mu.oz.au/cluster/

The CUBINlab Cluster Utilisation of the Cluster

n Which machines should I use? n Resource allocation n Load balancing

n Each machine is running a SNMP daemon n Use this to get stats on each machine, 5 min intervals

q CPU load

q Memory use

q Scratch space use

q LAN network use (rate & volume) n http://www.cubinlab.ee.mu.oz.au/netstat/

The CUBINlab Cluster CPU Load Average History

The CUBINlab Cluster Paradigms for Distributed Simulation Paradigms for Distributed Simulation

n Distributed vs. Parallel

q loose vs. tight coupling n Single Instruction Multiple Data (SIMD)

q Vector (super-)computers n Multiple Instruction Multiple Data (MIMD)

q Parallel/distributed computers

q Synchronous vs. asynchronous design n vs. message passing

Paradigms for Distributed Simulation Applications

n Totally independent subtasks n Assign each subtask to a processor n Collect results at the end n Master-Slave model: Slave #1

Master Slave #2 No task dependencies Bag of subtasks Slave #p

Paradigms for Distributed Simulation Load Balancing

n When do we allocate tasks, and to whom?

q At the start? As we go along?

q Equal division?

q Communications overhead? n Concurrency maximised when all processors in use n Unequal distribution of work or different processing speeds: û Some processors will be idle

Paradigms for Distributed Simulation Message Passing Interface (MPI) Overview of MPI(CH)

n Standard for message passing n Portable and free implementation n High performance n Object orientated “look and feel” n C, Fortran and C++ bindings available

Message Passing Interface MPICH Architecture

MPI

n Interface providing message passing Abstract Device constructs Interface q MPI_Send q MPI_Recv q etc. Channel n Uses ADI to implement all behaviour, with a Interface smaller set of constructs

TCP/IP

Message Passing Interface MPICH Architecture

MPI

n Send/Receive messages

Abstract Device n Moving data b/w MPI and CI Interface n Managing pending messages

n Providing basic environment information

q e.g. how many tasks running Channel Interface

TCP/IP

Message Passing Interface MPICH Architecture

MPI

n Formats and packs bytes for the wire

Abstract Device n Very simple: 5 functions only Interface n Three mechanisms:

q Eager: immediate delivery

q Rendezvous: once receiver wants it Channel Interface q Get: shared memory/DMA (h/ware)

TCP/IP

Message Passing Interface Structure n Groups and Communicators n How many processors are there? n Who am I? (Who are my neighbours?) n Virtual topologies

q Cartesian for grid computation

(0,0) (0,1) (0,2) map (0) (1) (2) (3) (4) (5) (6) (7) (8) (1,0) (1,1) (1,2)

Process group (2,0) (2,1) (2,2)

Message Passing Interface Communicators

MPI_COMM_WORLD

(0) (2) (4) (1) (3) (5)

Message Passing Interface Communicators

MPI_COMM_WORLD

(0) (2) (4) (1) (3) (5)

(0) (2) Group A Group B (4) (1) (3) (5)

Message Passing Interface Communicators

MPI_COMM_WORLD

(0) (2) (4) (1) (3) (5)

(0) (2) Group A Group B (4) (1) (3) (5)

(1) (0) (2) COMM_A COMM_B (0) (2) (1)

Message Passing Interface Communicators

MPI_COMM_WORLD

(4) (0) (2) (3) (5) (1)

COMM_A COMM_B

Message Passing Interface Communicators

MPI_COMM_WORLD

(4) (0) (2) (3) (5) (1)

COMM_A COMM_B

Message Passing Interface Point-to-Point: sending data

MPI_SEND(buf, count, datatype, dest, tag, comm)

IN buf initial address of send buffer (choice) IN count number of elements in send buffer (non-negative integer) IN datatype datatype of each send buffer element (handle) IN dest rank of destination (integer) IN tag message tag (integer) IN comm communicator (handle)

Message Passing Interface Point-to-Point: receiving data

MPI_RECV(buf, count, datatype, source, tag, comm, status)

OUT buf initial address of receive buffer (choice) IN count number of elements in receive buffer (integer) IN datatype datatype of each receive buffer element (handle) IN source rank of source (integer) IN tag message tag (integer) IN comm communicator (handle) OUT status status object (Status)

Message Passing Interface Simple Example: “hello world”

#include #include "mpi.h" void main(int argc, char* argv[]) { int rank, size, tag, i; char message[20]; MPI_Status status;

MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Comm_rank(MPI_COMM_WORLD, &rank); tag = 100;

if (rank == 0) { strcpy(message, "Hello world!"); for (i=1; i

printf("node %d: %s\n", rank, message); MPI_Finalize(); } Message Passing Interface Simple Example: hello world n Compiling $mpicc -o simple simple.c n Running on 12 machines (?!) $mpirun -np 12 simple node 0: Hello world! node 3: Hello world! node 7: Hello world! node 6: Hello world! node 8: Hello world! node 2: Hello world! node 5: Hello world! node 4: Hello world! node 1: Hello world! node 11: Hello world! node 9: Hello world! node 10: Hello world! $

Message Passing Interface MPI Data Types n Standard types:

MPI data type C data type MPI_CHAR signed char MPI_SHORT signed short int MPI_INT signed int MPI_LONG signed long int MPI_UNSIGNED_CHAR unsigned char MPI_UNSIGNED_SHORT unsigned short int MPI_UNSIGNED unsigned int MPI_UNSIGNED_LONG unsigned long int MPI_FLOAT float MPI_DOUBLE double MPI_LONG DOUBLE long double n Can define your own also

Message Passing Interface Collective Communication

n Barrier synchronization n Broadcast from one to many n Gather from many to one n Scatter data from one to many n Scatter/Gather (complete exchange, all-to-all) n … and many more

Message Passing Interface Broadcast

Collective Communication Scatter and Gather

Collective Communication All Gather

Collective Communication All to All

Collective Communication Profiling Parallel Code n MPI has profiling built in: q mpicc -o simple -mpilog simple.c

q Generates a log file of all calls during execution n Log file visualiser is available

q Jumpshot-3 ()

Message Passing Interface Parallelising MATLAB Speeding up MATLAB Code n Use vector/matrix form of evaluations q inner = A’*B rather than inner = 0; for i = 1:size(A,1) inner = inner + A(i).*B(i); end n Pre-allocate matrices q S = zero(N, K); n Clear unused variables to avoid swapping q clear S; n Re-write bottlenecks in C with MEX n Parallelize!

Parallelising MATLAB Why not a Parallel MATLAB?

n Parallelism is a good idea! n Why doesn’t MathWorks provide support? n MathWorks says [1]:

q Communication bottlenecks and problem size

q Memory model and architecture

q (Only) useful for “outer-loop” parallelism

q Most time spent: parser, interpreter, graphics

q Business situation

Parallelising MATLAB Parallelising MATLAB: overview

n Parallelism through high level toolboxes n Medium-course granularity

q Good for “outer loop” parallelisation n Uses message-passing

q MPI

q PVM

q Socket based

q File IO based

Parallelising MATLAB Toolboxes Available

Status Toolbox Origins X$ MultiMATLAB Cornell S DP-Toolbox U of Rostock, Germany B MPITB/PVMTB U of Granada, Spain $ MATmarks U of Illinois S MatlabMPI MIT S MATLAB*P MIT S Matlab Parallelization ToolkIt Einar Heiberg ? Parallel Toolbox for MATLAB Unknown

X No source $ Commercial B Binaries Only S Open Source

Parallelising MATLAB MatlabMPI

n Limited subset of MPI commands û Very primitive handling of basic operations û No barrier function for synchronization n Communication based on file IO ü Very portable: no MEX û File system, disc and NFS overhead û Spin locks for blocking reads n Uses RSH/SSH to spawn remote MATLAB instances q Machine list supplied in m-files

Parallelising MATLAB MatlabMPI API

MPI_Run Runs a MATLAB script in parallel MPI_Init Inititialises toolbox MPI_Finalize Cleans up at the end

MPI_Comm_size Gets number of processors in a communicator MPI_Comm_rank Gets rank of current processor within a communicator MPI_Send Sends a message to a processor (non-blocking) MPI_Recv Receives message from a processor (blocking)

MPI_Abort Function to kill all matlab jobs started by MatlabMPI MPI_Bcast Broadcast a message (blocking) MPI_Probe Returns a list of all incoming messages

Parallelising MATLAB MatlabMPI Performance n Mainly used for embarrassing parallel n Equals MPI native at 1MByte message size [12] n Spinlock and busy waiting caused a DDoS attack on our NFS

Parallelising MATLAB Simulation Example Iterative Power Control n Iterative solution to a Power Control problem:

n Each iteration depends on all Pj, j ¹i n We want to simulate over 1,000 runs with different user signature sequences S

Simulation Example Inner Loop Parallelism

n Split the iteration Pi(n+1) over p processors n Synchronize at the end of each iteration n Is there a faster way?

Simulation Example Outer Loop Parallelism n Split each 1,000 runs over p processors n No need for explicit synchronization

q Each simulation run is independent

q Embarrassingly parallel problem n Master/slave model will work quite well

Simulation Example Simulation Results

Sorry, no results due to the MatlabMPI DDoS attack!

Simulation Example Wrap Up Conclusions

n Clustering provides a cheap way to increase computational power through parallelism n Parallelism is present in many problems, to a degree n Message passing is one method of unifying computation amongst distributed processors n MATLAB can be used for coarse-grain parallel applications References-1

1. “Why there isn’t a parallel MATLAB”, Cleve Moler, http://www.mathworks.com/company/newsletter/pdf/spr95cleve.pdf 2. “Parallel Matlab survey”, Ron Choy, http://supertech.lcs.mit.edu/~cly/survey.html 3. “MultiMatlab: Integrating MATLAB with High-Performance ”, V. Menon and A. Trefethen, Cornell University 4. “Matpar: Parallel Extensions for MATLAB”, P. Springer, Jet Propulsion Laboratory, CalTech 5. “Message Passing under MATLAB”, J. Baldomero, U of Granada, Spain 6. “Performance of Message-Passing MATLAB Toolboxes”, J. Fernandez, A. Canas, A. Diaz, J. Gonzalez, J. Ortega and A. Prieto, U of Granada, Spain References-2

7. “Parallel and Distributed Computation”, D. Bertsekas and J. Tsitsiklis, Prentice Hall, NJ, 1989 8. “Message Passing Interface Forum”, http://www.mpi-forum.org/ 9. “MPICH-A Portable Implementation of MPI”, http://www-unix.mcs.anl.gov/mpi/mpich/ 10. “A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard”, W. Gropp and E. Lusk, Argonne National Laboratory 11. “Tutorial on MPI: The Message Passing Interface”, W. Gropp, http://www-unix.mcs.anl.gov/mpi/tutorial/gropp/talk.html 12. “Parallel Programming with MatlabMPI”, J. Kepner, MIT Lincoln Laboratory, 2001