Distributed Simulation in Cubinlab

Distributed Simulation in CUBINlab John Papandriopoulos http://www.cubinlab.ee.mu.oz.au/ ARC Special Research Centre for Ultra-Broadband Information Networks Outline n Clustering overview n The CUBINlab cluster n Paradigms for Distributed Simulation n Message Passing Interface n Parallelising MATLAB n Simulation example n Wrap up Clustering n Speedup through parallelism n Motivated by: q Cheap commodity hardware q Cheap/fast network interconnects n Various architectures: q Dedicated nodes, separate network q Reuse of existing network and workstations Clustering Overview Cluster Architectures BeoWulf Network of workstations Clustering Overview Cluster Architectures BeoWulf Network of workstations Clustering Overview How can CUBIN benefit? n Many problems have inherent parallelism: q Optimization problems n e.g. shortest path q Network simulation q Monte Carlo simulations q Numerical and iterative calculations n e.g. large systems: matrix inverse, FFT, PDEs n Clustering can exploit this! n CUBINlab Distributed Simulation Cluster (DSC) Clustering Overview Composition of the DSC n 9 machines, each: q 550MHz to 1.5GHz q 128MB to 256MB RAM q ~10GB HDD scratch space n 100Mbps Ethernet n Mandrake Linux n Identical software n Low maintenance n http://www.cubinlab.ee.mu.oz.au/cluster/ The CUBINlab Cluster Utilisation of the Cluster n Which machines should I use? n Resource allocation n Load balancing n Each machine is running a SNMP daemon n Use this to get stats on each machine, 5 min intervals q CPU load q Memory use q Scratch space use q LAN network use (rate & volume) n http://www.cubinlab.ee.mu.oz.au/netstat/ The CUBINlab Cluster CPU Load Average History The CUBINlab Cluster Paradigms for Distributed Simulation Paradigms for Distributed Simulation n Distributed vs. Parallel q loose vs. tight coupling n Single Instruction Multiple Data (SIMD) q Vector (super-)computers n Multiple Instruction Multiple Data (MIMD) q Parallel/distributed computers q Synchronous vs. asynchronous design n Shared memory vs. message passing Paradigms for Distributed Simulation Embarrassingly Parallel Applications n Totally independent subtasks n Assign each subtask to a processor n Collect results at the end n Master-Slave model: Slave #1 Master Slave #2 No task dependencies Bag of subtasks Slave #p Paradigms for Distributed Simulation Load Balancing n When do we allocate tasks, and to whom? q At the start? As we go along? q Equal division? q Communications overhead? n Concurrency maximised when all processors in use n Unequal distribution of work or different processing speeds: û Some processors will be idle Paradigms for Distributed Simulation Message Passing Interface (MPI) Overview of MPI(CH) n Standard for message passing n Portable and free implementation n High performance n Object orientated “look and feel” n C, Fortran and C++ bindings available Message Passing Interface MPICH Architecture MPI n Interface providing message passing Abstract Device constructs Interface q MPI_Send q MPI_Recv q etc. Channel n Uses ADI to implement all behaviour, with a Interface smaller set of constructs TCP/IP Message Passing Interface MPICH Architecture MPI n Send/Receive messages Abstract Device n Moving data b/w MPI and CI Interface n Managing pending messages n Providing basic environment information q e.g. how many tasks running Channel Interface TCP/IP Message Passing Interface MPICH Architecture MPI n Formats and packs bytes for the wire Abstract Device n Very simple: 5 functions only Interface n Three mechanisms: q Eager: immediate delivery q Rendezvous: once receiver wants it Channel Interface q Get: shared memory/DMA (h/ware) TCP/IP Message Passing Interface Process Structure n Groups and Communicators n How many processors are there? n Who am I? (Who are my neighbours?) n Virtual topologies q Cartesian for grid computation (0,0) (0,1) (0,2) map (0) (1) (2) (3) (4) (5) (6) (7) (8) (1,0) (1,1) (1,2) Process group (2,0) (2,1) (2,2) Message Passing Interface Communicators MPI_COMM_WORLD (0) (2) (4) (1) (3) (5) Message Passing Interface Communicators MPI_COMM_WORLD (0) (2) (4) (1) (3) (5) (0) (2) Group A Group B (4) (1) (3) (5) Message Passing Interface Communicators MPI_COMM_WORLD (0) (2) (4) (1) (3) (5) (0) (2) Group A Group B (4) (1) (3) (5) (1) (0) (2) COMM_A COMM_B (0) (2) (1) Message Passing Interface Communicators MPI_COMM_WORLD (4) (0) (2) (3) (5) (1) COMM_A COMM_B Message Passing Interface Communicators MPI_COMM_WORLD (4) (0) (2) (3) (5) (1) COMM_A COMM_B Message Passing Interface Point-to-Point: sending data MPI_SEND(buf, count, datatype, dest, tag, comm) IN buf initial address of send buffer (choice) IN count number of elements in send buffer (non-negative integer) IN datatype datatype of each send buffer element (handle) IN dest rank of destination (integer) IN tag message tag (integer) IN comm communicator (handle) Message Passing Interface Point-to-Point: receiving data MPI_RECV(buf, count, datatype, source, tag, comm, status) OUT buf initial address of receive buffer (choice) IN count number of elements in receive buffer (integer) IN datatype datatype of each receive buffer element (handle) IN source rank of source (integer) IN tag message tag (integer) IN comm communicator (handle) OUT status status object (Status) Message Passing Interface Simple Example: “hello world” #include <stdio.h> #include "mpi.h" void main(int argc, char* argv[]) { int rank, size, tag, i; char message[20]; MPI_Status status; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &size); MPI_Comm_rank(MPI_COMM_WORLD, &rank); tag = 100; if (rank == 0) { strcpy(message, "Hello world!"); for (i=1; i<size; i++) MPI_Send(message, 13, MPI_CHAR, i, tag, MPI_COMM_WORLD); } else { MPI_Recv(message, 13, MPI_CHAR, 0, tag, MPI_COMM_WORLD, &status); } printf("node %d: %s\n", rank, message); MPI_Finalize(); } Message Passing Interface Simple Example: hello world n Compiling $mpicc -o simple simple.c n Running on 12 machines (?!) $mpirun -np 12 simple node 0: Hello world! node 3: Hello world! node 7: Hello world! node 6: Hello world! node 8: Hello world! node 2: Hello world! node 5: Hello world! node 4: Hello world! node 1: Hello world! node 11: Hello world! node 9: Hello world! node 10: Hello world! $ Message Passing Interface MPI Data Types n Standard types: MPI data type C data type MPI_CHAR signed char MPI_SHORT signed short int MPI_INT signed int MPI_LONG signed long int MPI_UNSIGNED_CHAR unsigned char MPI_UNSIGNED_SHORT unsigned short int MPI_UNSIGNED unsigned int MPI_UNSIGNED_LONG unsigned long int MPI_FLOAT float MPI_DOUBLE double MPI_LONG DOUBLE long double n Can define your own also Message Passing Interface Collective Communication n Barrier synchronization n Broadcast from one to many n Gather from many to one n Scatter data from one to many n Scatter/Gather (complete exchange, all-to-all) n … and many more Message Passing Interface Broadcast Collective Communication Scatter and Gather Collective Communication All Gather Collective Communication All to All Collective Communication Profiling Parallel Code n MPI has profiling built in: q mpicc -o simple -mpilog simple.c q Generates a log file of all calls during execution n Log file visualiser is available q Jumpshot-3 (Java) Message Passing Interface Parallelising MATLAB Speeding up MATLAB Code n Use vector/matrix form of evaluations q inner = A’*B rather than inner = 0; for i = 1:size(A,1) inner = inner + A(i).*B(i); end n Pre-allocate matrices q S = zero(N, K); n Clear unused variables to avoid swapping q clear S; n Re-write bottlenecks in C with MEX n Parallelize! Parallelising MATLAB Why not a Parallel MATLAB? n Parallelism is a good idea! n Why doesn’t MathWorks provide support? n MathWorks says [1]: q Communication bottlenecks and problem size q Memory model and architecture q (Only) useful for “outer-loop” parallelism q Most time spent: parser, interpreter, graphics q Business situation Parallelising MATLAB Parallelising MATLAB: overview n Parallelism through high level toolboxes n Medium-course granularity q Good for “outer loop” parallelisation n Uses message-passing APIs q MPI q PVM q Socket based q File IO based Parallelising MATLAB Toolboxes Available Status Toolbox Origins X$ MultiMATLAB Cornell S DP-Toolbox U of Rostock, Germany B MPITB/PVMTB U of Granada, Spain $ MATmarks U of Illinois S MatlabMPI MIT S MATLAB*P MIT S Matlab Parallelization ToolkIt Einar Heiberg ? Parallel Toolbox for MATLAB Unknown X No source $ Commercial B Binaries Only S Open Source Parallelising MATLAB MatlabMPI n Limited subset of MPI commands û Very primitive handling of basic operations û No barrier function for synchronization n Communication based on file IO ü Very portable: no MEX û File system, disc and NFS overhead û Spin locks for blocking reads n Uses RSH/SSH to spawn remote MATLAB instances q Machine list supplied in m-files Parallelising MATLAB MatlabMPI API MPI_Run Runs a MATLAB script in parallel MPI_Init Inititialises toolbox MPI_Finalize Cleans up at the end MPI_Comm_size Gets number of processors in a communicator MPI_Comm_rank Gets rank of current processor within a communicator MPI_Send Sends a message to a processor (non-blocking) MPI_Recv Receives message from a processor (blocking) MPI_Abort Function to kill all matlab jobs started by MatlabMPI MPI_Bcast Broadcast a message (blocking) MPI_Probe Returns a list of all incoming messages Parallelising MATLAB MatlabMPI Performance n Mainly used for embarrassing parallel n Equals MPI native at 1MByte message size [12] n Spinlock and busy waiting caused a DDoS attack

Load more