HPC Job Sceduling Co-scheduling in HPC Clusters

October 26, 2018

Nikolaos Triantafyllis PhD student

School of Electrical and Computer Engineering - N.T.U.A. Categories of Schedulers

The majority of Schedulers could be categorized in:

• Operating Systems Process Schedulers • Cluster Systems Jobs Schedulers • Big Data Schedulers

1 Operating Systems Process Schedulers

• During scheduling events, an algorithm has to assign CPU times to tasks • Focus on responsiveness and low overhead • Most notable process schedulers: ◦ Cooperative Scheduling (CS) ◦ Multi-Level Feedback Queue (MLFQ) ◦ O(n) Scheduler ◦ O(1) Scheduler ◦ Completely Fair Scheduler (CFS) ◦ Brain F Scheduler (BFS) • CFS: Each process should have equal share of CPU time. Current kernels use CFS • BFS: Improves interactivity, but lowers performance. Proposed in 2009

2 Cluster Systems Jobs Schedulers

• During scheduling events, an algorithm has to assign nodes to jobs • Focus on scalability and high throughput • Most notable jobs schedulers: ◦ Simple Linux Utility for Resource Management (SLURM) ◦ Maui Cluster Scheduler (Maui) ◦ Moab High-Performance Computing Suite (Moab) ◦ Grid Engine (UGE) ◦ LoadLeveler (LL) ◦ Load Sharing Facility (LSF) ◦ Portable Batch System (PBS) [OpenPBS, TORQUE, PBS Pro] ◦ ◦ GridWay ◦ HTCondor ◦ Mesos ◦ Open MPI ◦ TORQUE ◦ Borg and Omega 3 Big Data Schedulers

• During scheduling events, an algorithm has to assign nodes to jobs • Jobs have storage and processing of large and complex data sets • Support specialized frameworks and to a very limited set of problems • Most notable process schedulers: ◦ Dryad ◦ MapRedure ◦ Hadoop ◦ HaLoop ◦ Spark ◦ CIEL

“Big Data: New York Stock Exchange produces about 1 TB of new trade data per day.”

4 Cluster Structure

Figure 1: Typical resource management system

5 Job Submission example in SLURM

§ ¤¨

1 #!/bin/bash 2 # Example with 48 MPI tasks and 24 tasks per node. 3 # 4 # Project/Account (use your own) 5 #SBATCH -A hpc2n-1234-56 6 # 7 # Number of MPI tasks 8 #SBATCH -n 48 9 # 10 # Number of tasks per node 11 #SBATCH --tasks-per-node=24 12 # 13 # Runtime of this jobs is less then 12 hours. 14 #SBATCH --time=12:00:00 15 16 module load openmpi/gcc 17 18 srun ./mpi_program 19 20 # End of submit file ¦ ¥©

6 HPC Resource Management

Figure 2: Typical resource management system

• Substitute an external scheduler (e.g. Maui) for the internal scheduler to enhance functionality

7 Cluster Systems Jobs Schedulers: Brief Description

Simple Linux Utility for Resource Management (SLURM)

• Initially developed by Lawrence Livermore National Laboratory (LLNL) • Open source • Supported only by Linux kernel • Is on 6 of the top 10 systems including the number 1 system, Sunway TaihuLight with 10,649,600 computing cores • Is used by 50% of world

Maui Cluster Scheduler (Maui)

• Developed by Adaptive Computing, Inc. • Open source

8 Cluster Systems Jobs Schedulers: Brief Description

Moab High-Performance Computing Suite (Moab)

• Developed by Adaptive Computing, Inc. • Proprietary • Successor of Maui framework • Additional features • Is used by 40% of the top 10, top 25 and top 100 on the Top500 list

Univa Grid Engine (UGE)

• Developed by Oracle • Proprietary from 2010 • Also known as • Supports Job Checkpointing -snapshot of current app state-

9 Cluster Systems Jobs Schedulers: Brief Description

LoadLeveler (LL)

• Developed by IBM • Proprietary • Supports Job Checkpointing and gang scheduling

Load Sharing Facility (LSF)

• Developed by IBM • Proprietary • Supports priority escalation -job priority increases in every time interval-

10 Cluster Systems Jobs Schedulers: Brief Description

Portable Batch System (PBS)

• Developed by NASA • 3 versions: ◦ OpenPBS - Open source, suitable for small clusters ◦ TORQUE - Proprietary, fork of OpenPBS (Adaptive Computing, Inc.) ◦ PBS Professional - Commercial version

Globus toolkit

• Developed by Globus Alliance • Open source • Set of tools for constructing a computing grid • Communicates with local resource manager (e.g. PBS, UGE, LL)

11 Cluster Systems Jobs Schedulers: Brief Description

GridWay

• Developed by researches at the University of Madrid • Open source • Was built on top of Globus Toolkit framework • Contains module that detects slowdown (app performance monitoring) and requests job’s migration

HTCondor

• Developed by University of Wisconsin-Madison • Open source • Before 2012 was known as Condor • Number of tools and frameworks have been built on top of HTCondor, e.g. DAGMan (Directed Acyclic Graph Manager) -apps are nodes and edges dependencies-

12 Cluster Systems Jobs Schedulers: Brief Description

Mesos

• Developed by Apache Software Foundation • Open source • Mesos master - allocates resources • Fault tolerant (ZooKeeper framework -to elect new master-)

Open MPI

• Developed by consortium of partners • Open source • Job scheduling by slot or node basis in Round-Robin (RR)

13 Cluster Systems Jobs Schedulers: Brief Description

TORQUE - Terascale Open-source Resource and QUEue Manager

• Developed by Adaptive Computing • Fork of OpenPBS • Proprietary since June of 2018 • Supports external Job schedulers

Borg and Omega

• Developed by Google • Proprietary • Borg is a resource manager making a resource offering to scheduler instances • Omega project deploys schedulers that are working in parallel (share state of resources)

14 Typical resource management system

Figure 3: SLURM resource management system

15 Typical resource management system

Figure 4: Multi-cluster Environment

16 Job Taxonomy regarding flexibility

Five types of jobs can be distinguished from the global perspective of the job scheduler: 1. Rigid Jobs - requires fixed set of resources (fixed static resource allocation) 2. Moldable Jobs - allows variable set of resources but must be allocated before it starts (variable static resource allocation) 3. Malleable Jobs - allows variable set of resources which the scheduler dynamically (de)allocates. The scheduler must inform the running job so it can adapt to the new resource allocation (scheduler-initiated and app-executed, variable dynamic resource allocation) 4. Evolving Jobs - reversed of malleable (app-initiated and scheduler-executed, variable dynamic resource allocation) 5. Adaptive Jobs - Combination of malleable and evolving characteristics (app- or scheduler-initiated and scheduler- or app-executed, respectively, variable dynamic resource allocation) 17 Example of Rigid Job

Figure 5: A space/time diagram, where the Y-axis represents compute nodes and the X-axis represents time, illustrating a rigid job

18 Example of Moldable Job

Figure 6: A space/time diagram, where the Y-axis represents compute nodes and the X-axis represents time, illustrating a moldable job

19 Example of Malleable Job

Figure 7: A space/time diagram, where the Y-axis represents compute nodes and the X-axis represents time, illustrating a malleable job

20 Example of Evolving Job

Figure 8: A space/time diagram, where the Y-axis represents compute nodes and the X-axis represents time, illustrating a evolving job

21 Example of Adaptive Job

Figure 9: A space/time diagram, where the Y-axis represents compute nodes and the X-axis represents time, illustrating a adaptive job

22 HPC & Cloud Application types

The majority of HPC workloads alternates phases during their life cycle

Figure 10: Types of HPC & Cloud Applications based on system resources (CPU, I/O, network, memory) 23 Scheduling Algorithms

• Scheduling algorithms can be broadly divided into two classes; time-sharing and space-sharing • Time-sharing algorithms divide time on a processor into several discrete intervals or slots. These slots are then assigned to unique jobs • Space-sharing algorithms give the requested resources to a single job until the job completes execution. Most cluster schedulers operate in space-sharing mode

24 Time-sharing Algorithms

• Local scheduling - threads are placed in a global run queue and executed in RR strategy to the available processors • Gang scheduling - threads run simultaneously on different processors and if two or more of them communicate with each other, they will all be ready to communicate at the same time. If they were not gang-scheduled, then one could wait to send or receive a message to another while it is sleeping, and vice versa.

25 Space-sharing Algorithms

• First Come First Served (FCFS) • Round Robin (RR) • Shortest Job First (SJF) / Longest Job First (LJF) (job execution time is provided by user - doubtful) • Smallest Job First (SJF) / Largest Job First (LJF) (job size is provided by user) • Advanced Reservation - the availability of a set of resources is guaranteed at a particular time • Backfilling - optimization allowing shorter jobs to execute while long job at the head of queue is waiting for a free processor • Preemptive Backfilling - backfilling with QoS, where higher priority jobs prevent lower • Fair-Share - equal distribution while a site administrator can set system utilization targets for users, groups, QoS levels (RR strategy at each level of abstraction)

26 HPC Cluster Hierarchy

Figure 11: Hierarchical structure of a typical HPC system 27 HPC Co-Allocation

Figure 12: Allocation on node level vs allocation on core level

28 Dedicated Resource Allocation

• Most parallel scientific apps have frequent communication phases. Any imbalance in computation times result in waiting times, heavily reducing scalability. Easiest way to ensure balanced performance is to assign dedicated resources at the level of node

29 Co-Scheduling

• Definition: Simultaneously run more than one application on anode, allocating jobs at the granularity of core level • Single application performance degradation by running multiple applications simultaneously might not lead to overall runtime degradation • Nodes are becoming more parallel, hosting dozens of cores and additional accelerator components. Thus, huge challenge for apps is to use all of the components of the increasingly heterogeneous nodes efficiently • Users getting charged from a project budget in core hours at the granularity of nodes

30 Co-Scheduling

• Evenly balanced workloads -helps scalability- are not guaranteed cause to hardware automation (e.g. automatically frequency changes due to temperature changes) • By definition, dedicated resource allocation is counter-productive for achieving highest possible system throughput and energy consumption • Highly optimized codes achieve down to 5% of system’s theoretical peak performance • Good Example: Data analytics apps are typically bound by I/O, thus excellent candidate for co-scheduling

31 Co-Scheduling Challenges

• Apps are bound by dominating factor (i.e. CPU or memory bandwidth or IO). App candidates should be chosen carefully, not to slow down each other -joint throughput improvement- (e.g. memory-bound and compute bound applications) • Process migration: Initial co-scheduling of two (or more) applications might need to be readjusted e.g. because one app has finished and the next one has different characteristics. Process migration and virtualization techniques (VMs, containers etc.) are not yet applied to HPC (except for process to node mapping on node failure), as they impose a certain slowdown • The degree of utilizing given resources is application characteristic

32 Co-Scheduling Challenges

• Learning or predict application characteristics at early runtime stage or from previous runs; this includes methods for the scheduler to better predict app resource usage and knowledge about computation and communication phases as well as apps that are able to shrink/expand resources on demand -malleable/evolving/adaptive- • Applications need to be monitored by a number of tools/metrics (e.g. Miss-Rate-Curves -pages faults for the allocated memory-, H/W performance counters) • Mapping a set of jobs to resources can be compared to Bin Packing Problem (NP-complete)

33 Link and Cache-Aware (LCA) co-scheduling approach

• Contention situations concerning memory link, shared LLC or both on single-node • Four app classes: ◦ N - activity on private part of memory ◦ C - heavy activity on shared LLC ◦ LC - significant activity on both memory link and shared LLC ◦ L - heavy activity on memory link, thrashing shared LLC (streaming) • Co-execution of L-C, L-L, and L-LC should be avoided • Algorithm - O(n) complexity • Steps: 1. Co-schedules N with L (L impose greatest harm) 2. Then N with C (suffer from greatest harm) 3. Then N with LC 4. Remaining C with LC, then with C 5. Remaining LC with L, then with LC 6. Remaining L with L

34 LCA Evaluation

Figure 13: Average slowdown over standalone execution

35 Job Striping

• In real supercomputing systems, users are charged for the combined number of CPU hours that their application uses • Price of Execution:  k = rate constant N = nodes assigned to a job Cost = k ∗ N ∗ P ∗ T P = number of cores per node T = total time the job ran   • If a user spreads their job, they are charged for the cores that they leave unassigned • While this would not be a problem if spread offered a 2x increase in performance, in many cases it does not

36 Job Striping

Figure 14: Scheduling two 16 process apps A and B on two 8-core processors

37 Job Striping

• In the compact configuration, the shared LLC and memory bandwidth are split among c processes. However, in the spread c configuration they are divided between 2 processes. This means that the mean available memory bandwidth and LLC space per core are 2x higher in the spread configuration • Consequently, if a user spreads their job, they are charged for the cores that they leave unassigned. While this would not be a problem if spread offered a 2x increase in performance, in many, cases it does not • Job striping: schedule that takes the best features of both compact and spread; full occupancy that compact provides, but also reduction in resource contention that spread offers • Intuition: since threads often execute near identical tasks; thus they have very similar demands on the system • In addition, even if A and B both apps are communication-heavy, their communication patterns are likely to be more out of phase than those of single parallel application 38 Job Striping Evaluation

Figure 15: Min & Max runtime in sec for each app sheduled compactly, spread or striped

39 Memory/Computation/Communication Intensive

Figure 16: Mem. Figure 17: Comp. Figure 18: Comm. Intensive app Intensive app Intensive app (HPCG) (Monte Carlo PI) (Monte Carlo PI modified) 40