Out-Of-Order Execution Parallel Virtual Machine
Total Page:16
File Type:pdf, Size:1020Kb
OVM Out-of-order Execution Parallel Virtual Machine George Bosilca, Franck Cappello Laboratoire de Recherche en Informatique Universite d’Orsay, France [email protected] Global view Evolution of parallel High performance computing architectures toward a required by a large spectre of users. multi-level (compagnies, research labs) NUMA approach Weather/Climate Clusters of SMP Physics (high energy particle) Pharmaceutics Bioinformatic Medical Images Cinema Microprocessors Simulation Etc. Programming language PC Clusters and execution models Regular Applications: Without load balancing Non-Uniform Applications: GRID, P2P With load balancing Branch and bound Dynamic generation of additional computations A central place for the execution model Applications Programming Models Ex : data parallel, concurent model, master/slave Languages Languages : Fortran, C, C++, Java, Matlab, Scilab Expression Execution Model Message Passing : MPI, OVM (RPC), Virtual Shared Memory : Scach, TreadMarks, API Global Computing Infrastructure : Netsolve, Ninf, Execution environment Computational Grids : Globus, Legion, XtremWeb GRID enabled message passing: MPICH-G, Virtual Machine Virtual shared memory for GRID: OmniOpenMP GRID, P2P Operating System Architectures Clusters of SMP Clusters Hardware Portability and virtualization Entropia, United Dev Architectures Distributed systems SETI@home Grilles locales 500 Cluster X86, Alpha, Itanium SIMD Constellation (cluster of SMP) Compaq SC Sun HPC cluster NEC SX4 / Earth 400 TOP 500 SGI O3000 / 2000 HP Hyperplex Sun HPC 10000 MPP 300 IBM SP3 Hitachi SR8000 200 CRAY T3E Fujitsu VPP 5000 100 Uni SMP HP superdome Processor NEC SX 5 0 93 94 95 96 97 98 99 00 Clusters, SMP, NUMA, Cluster de SMP,Cluster NUMA de SMP, … Cluster de SMP Cluster de Machines vectorielles SMP Clusters pseudo-vectorielles NUMA de SMPs GRID, Calcul Global, P2P Clusters de SMP Cluster de PC Proposer aux programmeurs des environnements portables permettant de pérenniser sinon leurs applications au moins leurs outils/environnements de programmation Programming Models . • RPC – Initially not really developed for // applications • OpenMP – Concurrent RPC using several threads • MPI – Data always cross the network with the RPC request • Dataflow – “global communication” like operations should be done locally + A whole range of applications could be easily expressed using RPC • Master/slave • Recursive • Some kind of branch-and-bound Programming Models .. • RPC Suitable only on shared memory architectures • OpenMP – Complex to handle irregular applications • MPI – Static environment (MPI2) • Dataflow Comp Comp Cache Penalty or Wait Charge Irregularity Unpredictable results on complex // architectures comm. comm. sync sync +Perfect for regular applications is a homogeneous environment +Have global communications Programming Models … • RPC • Most of the time the dataflow graph is statically created during the compilation • OpenMP step • MPI • Internal load-balancing • Still no “global communications” • Dataflow SMART: horizontal and vertical parallelism only for shared memory computers Programming Model • PVM – The user should do the load-balancing depending on the application – Blocking global communication – MP layer too complex and quite slow +++ Dynamic : pvm_spawn • Harness – Support multiple instance of PVM and MPI applications – More related to a execution environment Load Balancing environments Load-balancing in multi-flow environments : Athapascan (static dataflow graph + distributed resolution), lab ID, ENSIMAG PM2 (threads migration), LIFL, ENS Lyon Cilk (work stealing with a recursion heap), MIT OO with load-balancing capabilities (95, 96) : Dome (c++ et PVM), SPMD, check point and migration Charm, Charm++ (thread based) Urbana Chanpaign Load-balancing for PVM/MPI (93-94-95 ; 01) : MPVM (process migration) Oregon Graduate Institute of S & T. UPVM (threads migration) Oregon Graduate Institute of S & T. Dynamic PVM (condor checkpoint), University of Amsterdam MPI TM (task migration – core dump) Mississippi State University AMPI (threads migration using Charm++) Urbana Champaign Dynamite PVM (task migration) Choosing a programming environment • Multiple parameters equation – Computer architecture – Application type – Level of programming paradigm (performances) The ideal parallel environment ? • Dynamic : resources can join/leave the environment • Automatic load balancing (user hints) • Suitable for HPC • Suitable for different kind of applications • Same performances on all architectures OVM general principles Research goal : A virtual machine based on a dynamic macro-dataflow graph and RPC à Continuum between SPMD and master/slave approachs • Why Out-of-Order ? Broker Client Servers Dataflow User API Shared Lib. Data Application OVM RPCs Resource Management User Scheduler OVM Shared Lib. Data Based on decoupled RPC ( data and requests are separate entities) Client Simplified view of the system : Unique architecture Ø Create and initialize sets of data Ø Define affinities between data Ø Define the relationship between the data (A,B) = operation(A,B,C,D) Ø Collect the final results • Provide services using plug-ins (shared libraries) on the server side • Provide a long-time scheduling plug-in for the broker Servers • Multi-threaded approach Data – Only one server by node Services Collect. • Supporting dynamic multi-protocol communication devices • Provide generic functions through the default plug-ins • And/Or additional services through user plug-ins Broker • Correct execution • Resolve dependencies Dataflow graph construction • Resource Management – Available nodes – Available services Dependencies resolution Dataflow Flow Ready requests Resource Control Management Completed Requests Scheduler Graph destruction Different RPC modes 1) Classic RPC User Broker Servers Program RPC Load RPC Function RPC Balancing RPC RPC data Volatile Data 2) Decoupled RPC User Broker Servers Program Data RPC Function RPC RPC depende id RPCid ncies id RPC id Persistent Data RPCid Macro-dataflow Move D2 Move D1 User Program Modify D1 Servers RPC Broker D1 RPC F1 RPC id Dataflow RPC id id RPC manager Data Dependencies wait RPC RPC D2 F2 Requirement : starting a function on the servers require that all the data used by the function should be available on that server Interest : only the real data dependencies limit the parallelism SPMD Applications Purpose: execution flow as close as possible to MPI Cyclic placement of the persistant data Problem : Hide the cost of the macro-dataflow management Servers Persistent data data data data data Shared Libraries Funct 1 Funct 1 Funct 1 Funct 1 Funct 2 Funct 2 Funct 2 Funct 2 Funct 3 Funct 3 Funct 3 Funct 3 Data flow data Client + • persistent data broker • static scheduling Scheduling function • global operations Irregular Applications Purpose : limit the migration, increase the number of concurent tasks Scheduling : First Come First Serve Servers Fonct * Fonct Shared Libraries X Fonct Z Fonct Y data data data data Volatiles data Data flow data Client + • Volatile data broker Scheduling function • Dynamic scheduling • No global operations Optimizations • Classic multi-protocole RPC : • embeded data if the size < 4 Ko • or asynchronous data retrieval from the client ( >= 4 Ko ) • blocking or non-blocking mode for the client • Non-blocking global communications • each server start a new execution after his contribution to a global communication • Asynchronous global communications • the completion order of the global communications could not be the same as the starting order (independent global communications could be executed in parallel) • Task prefetch when persistent data are located on the same server – hierarchic dataflow • Task prefetch on mode RPC with volatile data, without checking for ressource dependencies to hide data transfert cost for the next task • Data cache on the servers side (for volatile data) • Both kind of RPC can be used in same time ( (a) = operation(2, b, c) ) Regular applications OVM and MPI for 3 of the kernels from the NAS NPB benchmark 2.3 : CG, FT, EP CG : frequent reduction with short messages FT : frequent all-to-all with long data EP : pure parallel kernel with no communication Cluster of PentiumPro 200 + Myrinet, Score environnement (RWCP) Optimized MPI global operations 1000 EP MPI EP OVM 100 FT MPI Time (s) FT OVM 10 CG MPI CG OVM 1 2 4 8 16 32 Non-uniform Load Application Parallel version of Aires Auger Project Simulation of very high energy particle entering the Earth atmosphere 100 Rely on the load-balancing 56,1 mechanism 26,9 Iso-energy blocs 15,4 10 7,66 3,96 36,8 Speedup 20,8 13,5 7,2 With initialization 3,72 Without initialization 1 4 8 16 32 64 Another Irregular Application Real-time raytracing with OVM and Povray Parallel algorithm •Global distribution of the scene •Dynamic work distribution •Compute each part of the final image and put them together to obtain the final image Radiosity (any light) Speedup 100 Bohn/Garmann 2000 Singh Carter 2001 Benavides OVM 10 Garmann 2000 1994 1 Master/Slave approach with 1 2 4 8 16 32 64 128 the master involved in the Number of Procs computation Conclusion • Fit well for large classes of applications • We can reach the same performances as MPI for some regular applications (even with different ratio communication/computation) • OVM allow linear speedup with irregular applications (AIRES, raytracing, radiosity) using a Master/Slave approach • Multi protocoles (PM, TCP/IP, BIP), multi architectures : OVM can be used on multiples parallel computers Fault Tolerance • How to became a fault-tolerant environment ? • The system keep trace of all requested services for the clients. • Data replication, not necessary the last version • Multiple parallel execution for the same service • Single point of failure : the broker But the most important thing is that the fault tolerance could be achieved at 100% automatically.