HPC Job Sceduling Co-Scheduling in HPC Clusters
Total Page:16
File Type:pdf, Size:1020Kb
HPC Job Sceduling Co-scheduling in HPC Clusters October 26, 2018 Nikolaos Triantafyllis PhD student School of Electrical and Computer Engineering - N.T.U.A. Categories of Schedulers The majority of Schedulers could be categorized in: • Operating Systems Process Schedulers • Cluster Systems Jobs Schedulers • Big Data Schedulers 1 Operating Systems Process Schedulers • During scheduling events, an algorithm has to assign CPU times to tasks • Focus on responsiveness and low overhead • Most notable process schedulers: ◦ Cooperative Scheduling (CS) ◦ Multi-Level Feedback Queue (MLFQ) ◦ O(n) Scheduler ◦ O(1) Scheduler ◦ Completely Fair Scheduler (CFS) ◦ Brain F Scheduler (BFS) • CFS: Each process should have equal share of CPU time. Current Linux kernels use CFS • BFS: Improves interactivity, but lowers performance. Proposed in 2009 2 Cluster Systems Jobs Schedulers • During scheduling events, an algorithm has to assign nodes to jobs • Focus on scalability and high throughput • Most notable jobs schedulers: ◦ Simple Linux Utility for Resource Management (SLURM) ◦ Maui Cluster Scheduler (Maui) ◦ Moab High-Performance Computing Suite (Moab) ◦ Univa Grid Engine (UGE) ◦ LoadLeveler (LL) ◦ Load Sharing Facility (LSF) ◦ Portable Batch System (PBS) [OpenPBS, TORQUE, PBS Pro] ◦ Globus toolkit ◦ GridWay ◦ HTCondor ◦ Mesos ◦ Open MPI ◦ TORQUE ◦ Borg and Omega 3 Big Data Schedulers • During scheduling events, an algorithm has to assign nodes to jobs • Jobs have storage and processing of large and complex data sets • Support specialized frameworks and to a very limited set of problems • Most notable process schedulers: ◦ Dryad ◦ MapRedure ◦ Hadoop ◦ HaLoop ◦ Spark ◦ CIEL “Big Data: New York Stock Exchange produces about 1 TB of new trade data per day.” 4 Cluster Structure Figure 1: Typical resource management system 5 Job Submission example in SLURM § ¤¨ 1 #!/bin/bash 2 # Example with 48 MPI tasks and 24 tasks per node. 3 # 4 # Project/Account (use your own) 5 #SBATCH -A hpc2n-1234-56 6 # 7 # Number of MPI tasks 8 #SBATCH -n 48 9 # 10 # Number of tasks per node 11 #SBATCH --tasks-per-node=24 12 # 13 # Runtime of this jobs is less then 12 hours. 14 #SBATCH --time=12:00:00 15 16 module load openmpi/gcc 17 18 srun ./mpi_program 19 20 # End of submit file ¦ ¥© 6 HPC Resource Management Figure 2: Typical resource management system • Substitute an external scheduler (e.g. Maui) for the internal scheduler to enhance functionality 7 Cluster Systems Jobs Schedulers: Brief Description Simple Linux Utility for Resource Management (SLURM) • Initially developed by Lawrence Livermore National Laboratory (LLNL) • Open source • Supported only by Linux kernel • Is on 6 of the top 10 systems including the number 1 system, Sunway TaihuLight with 10,649,600 computing cores • Is used by 50% of world supercomputers Maui Cluster Scheduler (Maui) • Developed by Adaptive Computing, Inc. • Open source 8 Cluster Systems Jobs Schedulers: Brief Description Moab High-Performance Computing Suite (Moab) • Developed by Adaptive Computing, Inc. • Proprietary • Successor of Maui framework • Additional features • Is used by 40% of the top 10, top 25 and top 100 on the Top500 list Univa Grid Engine (UGE) • Developed by Oracle • Proprietary from 2010 • Also known as Oracle Grid Engine • Supports Job Checkpointing -snapshot of current app state- 9 Cluster Systems Jobs Schedulers: Brief Description LoadLeveler (LL) • Developed by IBM • Proprietary • Supports Job Checkpointing and gang scheduling Load Sharing Facility (LSF) • Developed by IBM • Proprietary • Supports priority escalation -job priority increases in every time interval- 10 Cluster Systems Jobs Schedulers: Brief Description Portable Batch System (PBS) • Developed by NASA • 3 versions: ◦ OpenPBS - Open source, suitable for small clusters ◦ TORQUE - Proprietary, fork of OpenPBS (Adaptive Computing, Inc.) ◦ PBS Professional - Commercial version Globus toolkit • Developed by Globus Alliance • Open source • Set of tools for constructing a computing grid • Communicates with local resource manager (e.g. PBS, UGE, LL) 11 Cluster Systems Jobs Schedulers: Brief Description GridWay • Developed by researches at the University of Madrid • Open source • Was built on top of Globus Toolkit framework • Contains module that detects slowdown (app performance monitoring) and requests job’s migration HTCondor • Developed by University of Wisconsin-Madison • Open source • Before 2012 was known as Condor • Number of tools and frameworks have been built on top of HTCondor, e.g. DAGMan (Directed Acyclic Graph Manager) -apps are nodes and edges dependencies- 12 Cluster Systems Jobs Schedulers: Brief Description Mesos • Developed by Apache Software Foundation • Open source • Mesos master - allocates resources • Fault tolerant (ZooKeeper framework -to elect new master-) Open MPI • Developed by consortium of partners • Open source • Job scheduling by slot or node basis in Round-Robin (RR) 13 Cluster Systems Jobs Schedulers: Brief Description TORQUE - Terascale Open-source Resource and QUEue Manager • Developed by Adaptive Computing • Fork of OpenPBS • Proprietary since June of 2018 • Supports external Job schedulers Borg and Omega • Developed by Google • Proprietary • Borg is a resource manager making a resource offering to scheduler instances • Omega project deploys schedulers that are working in parallel (share state of resources) 14 Typical resource management system Figure 3: SLURM resource management system 15 Typical resource management system Figure 4: Multi-cluster Environment 16 Job Taxonomy regarding flexibility Five types of jobs can be distinguished from the global perspective of the job scheduler: 1. Rigid Jobs - requires fixed set of resources (fixed static resource allocation) 2. Moldable Jobs - allows variable set of resources but must be allocated before it starts (variable static resource allocation) 3. Malleable Jobs - allows variable set of resources which the scheduler dynamically (de)allocates. The scheduler must inform the running job so it can adapt to the new resource allocation (scheduler-initiated and app-executed, variable dynamic resource allocation) 4. Evolving Jobs - reversed of malleable (app-initiated and scheduler-executed, variable dynamic resource allocation) 5. Adaptive Jobs - Combination of malleable and evolving characteristics (app- or scheduler-initiated and scheduler- or app-executed, respectively, variable dynamic resource allocation) 17 Example of Rigid Job Figure 5: A space/time diagram, where the Y-axis represents compute nodes and the X-axis represents time, illustrating a rigid job 18 Example of Moldable Job Figure 6: A space/time diagram, where the Y-axis represents compute nodes and the X-axis represents time, illustrating a moldable job 19 Example of Malleable Job Figure 7: A space/time diagram, where the Y-axis represents compute nodes and the X-axis represents time, illustrating a malleable job 20 Example of Evolving Job Figure 8: A space/time diagram, where the Y-axis represents compute nodes and the X-axis represents time, illustrating a evolving job 21 Example of Adaptive Job Figure 9: A space/time diagram, where the Y-axis represents compute nodes and the X-axis represents time, illustrating a adaptive job 22 HPC & Cloud Application types The majority of HPC workloads alternates phases during their life cycle Figure 10: Types of HPC & Cloud Applications based on system resources (CPU, I/O, network, memory) 23 Scheduling Algorithms • Scheduling algorithms can be broadly divided into two classes; time-sharing and space-sharing • Time-sharing algorithms divide time on a processor into several discrete intervals or slots. These slots are then assigned to unique jobs • Space-sharing algorithms give the requested resources to a single job until the job completes execution. Most cluster schedulers operate in space-sharing mode 24 Time-sharing Algorithms • Local scheduling - threads are placed in a global run queue and executed in RR strategy to the available processors • Gang scheduling - threads run simultaneously on different processors and if two or more of them communicate with each other, they will all be ready to communicate at the same time. If they were not gang-scheduled, then one could wait to send or receive a message to another while it is sleeping, and vice versa. 25 Space-sharing Algorithms • First Come First Served (FCFS) • Round Robin (RR) • Shortest Job First (SJF) / Longest Job First (LJF) (job execution time is provided by user - doubtful) • Smallest Job First (SJF) / Largest Job First (LJF) (job size is provided by user) • Advanced Reservation - the availability of a set of resources is guaranteed at a particular time • Backfilling - optimization allowing shorter jobs to execute while long job at the head of queue is waiting for a free processor • Preemptive Backfilling - backfilling with QoS, where higher priority jobs prevent lower • Fair-Share - equal distribution while a site administrator can set system utilization targets for users, groups, QoS levels (RR strategy at each level of abstraction) 26 HPC Cluster Hierarchy Figure 11: Hierarchical structure of a typical HPC system 27 HPC Co-Allocation Figure 12: Allocation on node level vs allocation on core level 28 Dedicated Resource Allocation • Most parallel scientific apps have frequent communication phases. Any imbalance in computation times result in waiting times, heavily reducing scalability. Easiest way to ensure balanced performance is to assign dedicated resources at the level of node 29 Co-Scheduling • Definition: