
Legion: Programming Heterogeneous, Distributed Parallel Machines Alex Aiken Stanford Joint work involving LANL, NVIDIA & Stanford 1 Modern Supercomputers Heterogeneity Processor kinds Relative performance Distributed Memory Non-uniform Distinct from processors Growing disparities FLOPS >> bandwidth bandwidth >> latency 2 Programming System Goals High Performance We must be fast Performance Portability Across many kinds of machines and over many generations Programmability Sequential semantics, parallel execution 3 Can We Fulfill These Goals Today? Yes … at great cost: Do you want to schedule that graph? (High Performance) Do you want to re-schedule that graph for every new machine? (Performance Portability) Do you want to be responsible for generating that graph? (Programmability) Today: programmer’s responsibility Tomorrow: programming system’s responsibility Task graph for one time step on one node… 4 … of a mini-app The Crux How do we describe the data? Most programming systems focus on control Minimal facilities for organization/structure of data Why? Answer: Solve the aliasing problem Can two references refer to the same data? Answer: Decouple naming data from layout/location 5 Legion Approach Capture the structure of program data Decouple specification from mapping Asynchronous tasking Automate data movement parallelism discovery synchronization hiding long latency 6 7 Example: Circuit Simulation 8 Example: Circuit Simulation task simulate_circuit(Region[Node] N, Region[Wires] W) { Tasks are the unit of parallel execution. Logical regions are (typed) collections } Logical: no implied layout no implied location 9 Partitioning 10 Partitioning task simulate_circuit(Region[Node] N, Region[Wires] W) { [ P, S ] = par??on(ps_map, N) N } P S 11 Partitioning task simulate_circuit(Region[Node] N, Region[Wires] W) { Array[Region[Node]] private, shared, ghost; … [ P, S ] = par??on(ps_map, N) private = par??on(private_map, P) shared = par??on(shared_map, S) ghost = par??on(ghost_map, S) N } P S s s g g p1 … p3 1 … 3 1 … 3 12 Region Trees N P S p1 … p3 s1 … s3 g1 … g3 13 Region Trees disjoint N P S p1 … p3 s1 … s3 g1 … g3 Locality Independence 14 Region Trees disjoint N P S p1 … p3 s1 … s3 g1 … g3 Locality Independence 15 Region Trees N P S p1 … p3 s1 … s3 g1 … g3 Locality Independence Aliasing 16 Region Trees possibly overlap N P S p1 … p3 s1 … s3 g1 … g3 Locality Independence Aliasing 17 Legion Tasks task simulate_circuit(Region[Node] N, Region[Wires] W) : { … calc_currents(piece[0]); calc_currents(piece[1]); distribute_charge(piece[0]); distribute_charge(piece[1]); … } subtasks task calc_currents(Piece p) : task distribute_charge(Piece p) : 18 Legion Tasks task simulate_circuit(Region[Node] N, Region[Wires] W) : N, W { A task must declare the … regions it will use. calc_currents(piece[0], , , ); p0 s0 g0 calc_currents(piece[1], , , ); p1 s1 g1 distribute_charge(piece[0], , , ); p0 s0 g0 distribute_charge(piece[1], , , ); p1 s1 g1 … } Subtask containment: A subtask can only use task calc_currents(Piece p) : (sub)regions accessible p.private , p.shared, p.ghost, to its parent task. p.wires task distribute_charge(Piece p) : Tasks appear to execute p.private, p.shared, p.ghost, in program order. p.wires 19 Interference Informal definition: Two tasks T1 and T2 are non- interfering T1#T2 if there is no dependence between them on their region arguments. If T1#T2 then T1 and T2 can execute in parallel. 20 Execution Model task simulate_circuit(Region[Node] N, Region[Wires] W) : { … calc_currents(piece[0], , , ); p0 s0 g0 p s g Interferes? calc_currents(piece[1], , , ); 1 1 1 Interferes? distribute_charge(piece[0], , , ); p0 s0 g0 distribute_charge(piece[1], , , ); p1 s1 g1 … } Tasks are issued in program order. 21 Privileges task simulate_circuit(Region[Node] N, Region[Wires] W) : N,ReadWrite W (N,W) { … calc_currents(piece[0], , , ); p0 s0 g0 calc_currents(piece[1], , , ); p1 s1 g1 distribute_charge(piece[0], , , ); p0 s0 g0 distribute_charge(piece[1], , , ); p1 s1 g1 … } task calc_currents(Piece p) : p.privateReadWrite , p.shared(p.wires),, p.ghost Read(p.private, , p.shared, p.ghost) p.wires task distribute_charge(Piece p) : p.privateReadOnly, p.shared(p.wires),, Reduce(p.ghost, p.private, p.shared, p.ghost) p.wires 22 Non-Interference Dimensions Several dimensions of Voltage Capac. Induct. Charge # operator Node Entries (rows) Node Privileges Node Fields (columns) Node Logical regions are a Node relational data model Node Partitioning is selection Node (σ) Node Field-slicing is projection (π) Node Don’t support all Node relational operators 23 Legion Summary Logical regions: a relational data model Support partitioning and slicing Convey locality, independence, aliasing Implicit task parallelism Task may have arbitrary sub-tasks Tasks declare region usage including privileges and fields Tasks appear to execute in program order Execute in parallel when non-interference established Machine independent specification of application 24 25 Legion Runtime System T T T T Parallel Legion Runtime Distributed Execution T T T T t0:: r5, r7 t : r , r Parent 1 0 2 Tasks : Regions :: Instructions : Registers Task t2: r1, r2 t3: r4, r6 Dep. Resolve Map Distribute Execute Complete Commit Analysis Spec. A Distributed Hierarchical Out-of-Order Task Processor 26 Dependence Analysis disjoint task simulate_circuit(Region[Node] N, Region[Wires] W) : ReadWrite(N,W) N read only { … calc_currents(piece[0], ); calc_currents(piece[1] ); P S distribute_charge(piece[0] ); distribute_charge(piece[1] ); … } p0 p1 s0 s1 g0 g1 task calc_currents(Piece p) : ReadWrite(p.wires), Read(p.private, p.shared, p.ghost) task distribute_charge(Piece p) : ReadOnly(p.wires), Reduce(p.private, p.shared, p.ghost) CC[0]CC[1] 27 Dependence Analysis task simulate_circuit(Region[Node] N, Region[Wires] W) : ReadWrite(N,W) N WAR w/CC[1] { … calc_currents(piece[0], ); calc_currents(piece[1], ); P S distribute_charge(piece[0], ); distribute_charge(piece[1], ); … } p0 p1 s0 s1 g0 g1 task calc_currents(Piece p) : ReadWrite(p.wires), Read(p.private, p.shared, p.ghost) task distribute_charge(Piece p) : ReadOnly(p.wires), Reduce(p.private, p.shared, p.ghost) read-after-write of wires w/CC[0] DC[0] CC[1] CC[0] 28 Mapping Interface Programmer selects: rc Where tasks run Where regions are placed rw rn Mapping computed dynamically rw1 rw2 rn1 rn2 Decouple correctness from performance x86 $ N U M x86 $ A D t1 t3 R x86 $ N t5 U A t2 t4 M M x86 $ A Dep. Resolve Map Distribute Execute Complete Commit Analysis CUDASpec. FB 29 29 Correctness Independent of Mapping task simulate_circuit(Region[Node] N, Region[Wires] W) : N ReadWrite(N,W) { … calc_currents(piece[0], ); calc_currents(piece[1], ); P S distribute_charge(piece[0], ); distribute_charge(piece[1], ); … p0 p1 s0 s1 g0 g1 } task calc_currents(Piece p) : ReadWrite(p.wires), Read(p.private, p.shared, p.ghost) task distribute_charge(Piece p) : ReadOnly(p.wires), Reduce(p.private, p.shared, p.ghost) copy DC[0] CC[1] CC[0] Node 1 Node 0 30 Distribution T T After tasks are mapped 1 2 they are distributed to target node t1 t2 Task execution can Node 0 Node 1 generate sub-tasks Subtask containment: A subtask can only use Do we need inter-node (sub)regions accessible dependence checks? to its parent task. Dep. Resolve Map Distribute Execute Complete Commit Analysis Spec. 31 Independence Theorem Let t1 be a subtask of T1 and t2 be a subtask of T2. Then T1#T2 => t1#t2 # T1 T2 # t1 t2 32 Independence Theorem Let t1 be a subtask of T1 and t2 be a subtask of T2. Then T1#T2 => t1#t2 Proof: Use subtask containment. Observation: It is sufficient to test interference only of sibling tasks. Note: Similar property holds in functional languages, but it holds in Legion even though we may imperatively mutate regions. 33 Runtime Summary A distributed hierarchical out-of-order task processor Analogous to hardware processors Can exploit parallelism implicitly: Task-, data-, and nested-parallelism Runtime builds task graph ahead of execution to hide latency and costs of dynamic analysis Decouples mapping decisions from correctness Enables efficient porting and (auto) tuning 34 35 S3D Production combustion simulation Written in ~200K lines of Fortran Direct numerical simulation using explicit methods 36 196 T.F. Lu, C.K. Law / Progress in Energy and Combustion Science 35 (2009) 192–215 Methane/Air, φ=1.0, p=1 atm is to fix the pressure to atmospheric because many fundamental and practical combustion phenomena and processes take place GRI-Mech 1.2 under atmospheric pressure. Other restrictions can also be 10-1 12-Step imposed, such as lean combustion, high-temperature flames without considering the possible presence of ignition described by 10-Step -2 low-temperature chemistry, and homogeneous charge combustion. 10 4-Step However, except for well-controlled laboratory-scale experiments, the combustion mode is frequently a mixed one in most complex -3 10 and practical combustion situations, involving for example both premixed and nonpremixed reactants, or both ignition and flames. 10-4 Consequently it is more conservative to apply unrestricted comprehensive mechanisms in simulations of complex flows. Auto-Ignition Delay (sec) 10-5 4. Overview of mechanism reduction and 0.5 0.6 0.7 0.8 0.9 1.0 facilitated computation 1000/T (1/K) The availability of a comprehensive detailed reaction mecha- Fig. 9. Comparison of predicted ignition delay times of atmospheric, stoichiometric methane–air mixtures using various reduced mechanisms and the detailed mecha- nism does not mean that it can be readily adopted for computa- nism, showing the inadequacy of the four-step class of mechanisms. tional simulation. In fact, except for the smallest of fuels such as hydrogen and methane, and for such simple combustion systems as the 1-D laminar flame, detailed mechanisms of the larger fuels are component in the simulation of chemically reacting flows. It is simply too large for simulation without substantial reduction. important because the fidelity of all subsequent steps of mecha- Fig.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages49 Page
-
File Size-