Machine Learning for Load Balancing in the Linux Kernel Jingde Chen Subho S
Total Page:16
File Type:pdf, Size:1020Kb
Machine Learning for Load Balancing in the Linux Kernel Jingde Chen Subho S. Banerjee [email protected] [email protected] University of Illinois at Urbana-Champaign University of Illinois at Urbana-Champaign Zbigniew T. Kalbarczyk Ravishankar K. Iyer [email protected] [email protected] University of Illinois at Urbana-Champaign University of Illinois at Urbana-Champaign Abstract ACM Reference Format: The OS load balancing algorithm governs the performance Jingde Chen, Subho S. Banerjee, Zbigniew T. Kalbarczyk, and Rav- gains provided by a multiprocessor computer system. The ishankar K. Iyer. 2020. Machine Learning for Load Balancing in Linux’s Completely Fair Scheduler (CFS) scheduler tracks the Linux Kernel. In 11th ACM SIGOPS Asia-Pacific Workshop on Systems (APSys ’20), August 24–25, 2020, Tsukuba, Japan. ACM, New process loads by average CPU utilization to balance work- York, NY, USA, 8 pages. https://doi.org/10.1145/3409963.3410492 load between processor cores. That approach maximizes the utilization of processing time but overlooks the con- tention for lower-level hardware resources. In servers run- 1 Introduction ning compute-intensive workloads, an imbalanced need for The Linux scheduler, called the Completely Fair Scheduler limited computing resources hinders execution performance. (CFS) algorithm, enforces a fair scheduling policy that aims to This paper solves the above problem using a machine learn- allow all tasks to be executed with a fair quota of processing ing (ML)-based resource-aware load balancer. We describe time. CFS allocates CPU time appropriately among the avail- (1) low-overhead methods for collecting training data; (2) an able hardware threads to allow processes to make forward ML model based on a multi-layer perceptron model that imi- progress. However, the scheduling algorithm of CFS does tates the CFS load balancer based on the collected training not take into account contention for low-level hardware re- data; and (3) an in-kernel implementation of inference on sources (e.g., shared functional units or memory bandwidth) the model. Our experiments demonstrate that the proposed other than CPU time. There is an implicit assumption that model has an accuracy of 99% in making migration decisions fair sharing of CPU time implies that other such microarchi- and while only increasing the latency by 1.9 µs. tectural resources are also divided fairly [2]. That assumption CCS Concepts is often false and leaves the scheduling algorithm unaware • Software and its engineering ! Scheduling; • Theory of hardware resource utilization, thereby leading to shared of computation ! Scheduling algorithms; • Comput- resource contention and reduced performance. ing methodologies ! Neural networks. While prior work has investigated the integration of hard- ware resource usage in OS load balancing, most attempts Keywords have focused only on resources shared at the chip multi- Linux kernel, Machine Learning, Neural Network, Com- processor (CMP) level and depend on predefined schedul- pletely Fair Scheduler, Operating System, Load Balancing. ing policies or on efforts by the user to manually annotate resource usage of the programs [5, 7, 16]. There is not a Permission to make digital or hard copies of all or part of this work for system-wide solution that works for all general processes, personal or classroom use is granted without fee provided that copies or one that can take advantage of the dynamic contextual are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights information available from tracing and telemetry data from for components of this work owned by others than the author(s) must the system. Traditional methods are fundamentally limited be honored. Abstracting with credit is permitted. To copy otherwise, or in having to effectively quantify relationships between differ- republish, to post on servers or to redistribute to lists, requires prior specific ent resources and application performance, which indeed is permission and/or a fee. Request permissions from [email protected]. a hard problem. One solution is to use machine learning (ML) APSys ’20, August 24–25, 2020, Tsukuba, Japan in the kernel to model and tackle the problems described © 2020 Copyright held by the owner/author(s). Publication rights licensed to ACM. above. Even small gains can translate into big savings at the ACM ISBN 978-1-4503-8069-0/20/08...$15.00 level of large data centers. An important step in building an https://doi.org/10.1145/3409963.3410492 ML engine that can outperform the standard CFS Linux load APSys ’20, August 24–25, 2020, Tsukuba, Japan Jingde Chen, Subho S. Banerjee, Zbigniew T. Kalbarczyk, and Ravishankar K. Iyer balancer is to see whether it is possible to match the current kernel had used a straightforward $ ¹=º scheduling algo- load balancer performance. The insights thus gained, as we rithm, which iterated every runnable task to pick the next will show, can be quite valuable in the next step, namely one to run. In 2003, the $ ¹=º scheduler was replaced with building an engine that can outperform current techniques. the $ ¹1º scheduler which runs in constant time instead of ML in the Kernel. The ML approach will have to (i) per- linear time. The $ ¹1º scheduler managed running tasks with form load balancing in Linux that emulates original kernel two linked lists for each task priority, one for ready-to-run decisions; (ii) include hardware resource utilization models; tasks and the other for expired tasks that have used up their (iii) train the model to solve specific cases with high resource timeshares. Later, Linux abandoned the concept of having needs that cause frequent contention; and (iv) apply rein- predefined time slices. The Completely Fair Scheduler (CFS) forcement learning to the model to optimize load balancing was introduced to the kernel in Linux 2.6.23 in 2007 together performance. To exemplify the types of resource contention, with the design of the Modular Scheduler Core [13], and has consider simultaneous multi-threaded (SMT) threads on a since been the default Linux scheduler. We refer to that latest CMP that appear as identical logical CPU cores to the OS, design as the Linux scheduler. The scheduler implements in- but share most of the computing resources on the chip, in- dividual scheduling classes to which the Modular Scheduler cluding caches and registers. Processes scheduled to run on Core can delegate detailed scheduling tasks. Existing sched- different SMT threads at the same time contend for there- uling classes include the rt class for real-time processes; the sources on the chip. Similarly, CPUs within the same NUMA fair class which implements the CFS; and the idle class, node share memory and I/O bandwidth. In the end, we aim which handles cases when there is no runnable task. to implement an ML module inside the kernel that monitors The scheduler comprises a large portion of the kernel real-time resource usage in the system, identifies potential source and resides in the <kernel/sched> directory and the hardware performance bottlenecks, and generates the best scheduler core is implemented in <kernel/sched/core.c>. load balancing decisions. The Linux scheduler uses per-CPU allocated runqueues for Contributions. In this paper, we present the results of the faster local access, and the load balancing algorithm works by first step, i.e., an ML model and an implementation to replace migrating tasks across the per-core runqueues. The primary the CFS load balancer. The proposed approach consists of: granularity of work scheduled under Linux is the thread (and (1) Collection of runtime system data via dynamic kernel not the process). The threads are not scheduled via direct tracing, using eBPF (described in Section 3.2). attachment to the per-core runqueues, but as schedulable (2) Training of the ML model using collected data to emulate entities (sched_entity) attached to subordinate class run- task migration decisions (described in Section 3.3). queues managed by individual scheduling classes. (3) Two implementations of the model in the kernel, one 2.1 The Completely Fair Scheduler with floating-point arithmetic and one with fixed-point The CFS is the most complicated part of the Linux sched- arithmetic (described in Section 3.4). uler and is implemented in <kernel/sched/fair.c> with The ML model is trained offline in the user space andis 10,000 lines of code. The CFS schedules threads that have used for online inference in the kernel to generate migration policies of SCHED_NORMAL and SCHED_BATCH. As mentioned decisions. The performance of the in-kernel ML model is as above, the CFS manages its own runqueue structure cfs_rq follows: in the per-CPU runqueue. cfs_rq sorts the sched_entity (1) 99% of the migration decisions generated by the ML of all runnable threads on it in a red-black tree (a kind of model are correct. self-balancing binary search tree) that uses virtual runtimes (2) The average latency of the load balancing procedure is as keys. When asked to pick the next task to run, CFS returns 13% higher than the original kernel (16.4 µs vs. 14.5 µs). the task corresponding to the sched_entity with the most (3) During testing, our approach produced a 5-minute aver- insignificant virtual runtime, which is the least executed task. age CPU load of 1.773 compared to Linux’s 1.758. In that way, all tasks in the queue can get a fair chance to The work described in this paper was based on the Linux run. To reflect task priorities, the virtual runtime accumu- 4.15 source code. The code used in this paper can be found at lates more slowly for high-priority tasks and they get to https://github.com/keitokuch/MLLB, and the modified ker- run longer.