Fork/Wait and Multicore Frequency Scaling

Fork/Wait and Multicore Frequency Scaling: a Generational Clash Damien Carver, Redha Gouicem, Jean-Pierre Lozi, Julien Sopena, Baptiste Lepers, Willy Zwaenepoel, Nicolas Palix, Julia Lawall, Gilles Muller To cite this version: Damien Carver, Redha Gouicem, Jean-Pierre Lozi, Julien Sopena, Baptiste Lepers, et al.. Fork/Wait and Multicore Frequency Scaling: a Generational Clash. 10th Workshop on Programming Languages and Operating Systems, Oct 2019, Huntsville, Canada. pp.53-59, 10.1145/3365137.3365400. hal- 02349987v2 HAL Id: hal-02349987 https://hal.inria.fr/hal-02349987v2 Submitted on 31 Jan 2020 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Fork/Wait and Multicore Frequency Scaling: a Generational Clash Damien Carver Jean-Pierre Lozi Julien Sopena Redha Gouicem [email protected] [email protected] [email protected] Oracle Labs Sorbonne Université, LIP6, Inria Sorbonne Université, LIP6, Inria Baptiste Lepers Nicolas Palix Julia Lawall Willy Zwaenepoel [email protected] Gilles Muller [email protected] Université Grenoble Alpes [email protected] University of Sydney Sorbonne Université, LIP6, Inria Abstract due to NUMA and a complex network of interconnect links, The complexity of computer architectures has risen since and they run at different—yet non-independent—frequencies. the early years of the Linux kernel: Simultaneous Multi- In order to tackle increased hardware complexity, the Threading (SMT), multicore processing, and frequency scal- Linux scheduler has had to evolve. It was rewritten twice: ing with complex algorithms such as Intel® Turbo Boost in 2003, the O(1) scheduler was introduced, and in 2007, the have all become omnipresent. In order to keep up with hard- Completely Fair Scheduler (CFS) replaced it. Since then, a ware innovations, the Linux scheduler has been rewritten myriad of heuristics has been added. A large body of re- several times, and many hardware-related heuristics have search has focused on improving scheduling on modern mul- been added. Despite this, we show in this paper that a fun- ticore architectures, often focusing on locality and NUMA damental problem was never identified: the POSIX process issues [13, 19]. creation model, i.e., fork/wait, can behave inefficiently on Despite these efforts, recent works have hinted that there current multicore architectures due to frequency scaling. may still be significant room for improvement. Major per- We investigate this issue through a simple case study: the formance bugs have gone unnoticed for years in CFS [20], compilation of the Linux kernel source tree. To do this, we despite the widespread use of Linux. Furthermore, while develop SchedLog, a low-overhead scheduler tracing tool, ULE, the scheduler of FreeBSD, does not outperform CFS on and SchedDisplay, a scriptable tool to graphically analyze average, it does outperform it significantly on many work- SchedLog’s traces efficiently. loads [7], for reasons that are not always well understood. We implement two solutions to the problem at the sched- With currently available tools, studying scheduler behavior uler level which improve the speed of compiling part of the at a fine grain is cumbersome, and there is a risk that the Linux kernel by up to 26%, and the whole kernel by up to overhead of monitoring will interfere with the behavior that 10%. is intended to be observed. Indeed, it is rarely done, either by Linux kernel developers or by the research community. However, understanding scheduler behavior is necessary in 1 Introduction order to fully exploit the performance of multicore architec- Over the past decade, computer architectures have grown tures, as most classes of workloads trust the scheduler for increasingly complex. It is now common, even for affordable task placement. machines, to sport multiple CPUs with dozens of hardware In this paper, we show that CFS suffers from a fundamen- threads. These hardware threads are sometimes linked to- tal performance issue that directly stems from the POSIX gether through hyperthreading and different levels of shared model of creating processes, i.e., fork/wait, on multicore caches, their memory latency and bandwidth is non-uniform architectures with frequency scaling enabled. Consider a common case where a parent process forks a child and waits Publication rights licensed to ACM. ACM acknowledges that this contribu- for the result, in a low concurrency scenario, for example tion was authored or co-authored by an employee, contractor or affiliate of in the case of a typical shell script. If there are idle cores, a national government. As such, the Government retains a nonexclusive, CFS will choose one of them for the child process. As the royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only. core has been idle, the child process will likely start to run PLOS ’19, October 27, 2019, Huntsville, ON, Canada at a low frequency. On the other hand, the core of the parent © 2019 Copyright held by the owner/author(s). Publication rights licensed has seen recent activity. The hardware’s frequency scaling to ACM. algorithm will likely have increased its frequency, even if, ACM ISBN 978-1-4503-7017-2/19/10...$15.00 due to a wait, the core is now idle. https://doi.org/10.1145/3365137.3365400 PLOS ’19, October 27, 2019, Huntsville, ON, Canada Carver et al. Figure 1. Execution trace: building the Linux kernel scheduler using 320 jobs with CFS. We expose the performance issue through a case study of RAM, running Linux 4.19 in Debian Buster. The CPUs of Kbuild, i.e., building all or part of the Linux kernel. We have a frequency range of 1.2-2.1 GHz, and can reach up to conduct the study using our own dedicated tracing tool for 3.0 GHz with Intel® Turbo Boost. the Linux scheduler, SchedLog, that focuses on recording Figure 1 shows a trace from our graphical tool with the scheduling events with very low overhead. We then visualize frequency of the active cores at each scheduler tick (every these events with SchedDisplay, a highly configurable visu- 4 ms) when compiling the whole Linux kernel. An initial ker- alization tool we developed, to understand the scheduler’s nel compilation (i.e., subsequent to make clean) performs a behavior. With our newly learnt knowledge, we then imple- number of preparatory activities, then compiles the kernel ment two solutions for the performance issues that improve source files, and finally performs some cleanup. Around2 performance by up to 26% on Kbuild. Finally, we propose seconds there is a short burst of activity, performing an initial alternative solutions that aim to be more efficient, and to build across the source tree, where all of the cores are used reduce the likelihood of worsening performance on other and the cores reach a high frequency (2.1 – 2.6 GHz; they do workloads. not reach the highest frequencies provided by Turbo Boost The contributions of this paper are the following: due to safeguards in Turbo Boost to prevent overheating). • The identification of a performance issue in CFS that Likewise, from 5 to 34 seconds, all of the cores are again directly stems from the fork/wait process creation used and reach a high frequency, to compile the major part model of POSIX. of the kernel source code. There appears to be little room • A case study of the performance of the Linux scheduler for improvement in these regions. On the other hand, the using a workload that is representative of running regions at the beginning and end of the graph show a mod- compilation tasks in a large C project: Kbuild. erate number of cores used, almost all running at the lowest • SchedLog, a low-overhead tracing tool that focuses on frequency, around 1.2 GHz. Studying the trace at a higher scheduling events. degree of magnification shows that the code executed here • SchedDisplay, a highly configurable graphical tracing is nearly sequential, typically alternating between one and tool for SchedLog traces to ease the detection of poor two active cores at a time. This raises the question of why scheduler behavior. so many cores are used for so little work, and why are they not able to reach a higher frequency when they are occupied. The rest of this paper is organized as follows. Section 2 We are in the presence of a performance bug. shows, through a case study of Kbuild, that the fork/wait process creation model of POSIX can behave inefficiently 3 A Tracing Tool for CFS on current multicore architectures. Section 3 describes our In this section, we describe a very low overhead tool that graphical tracing tool for the Linux scheduler. Section 4 dis- we have developed for tracing scheduler behavior. In the cusses solutions to the performance problem and presents next section, we describe how we have used this tool to some that we implemented. Section 5 presents related work. understand the Kbuild performance bug and to evaluate our Finally, Section 6 concludes. proposed solutions. 2 Kbuild and the Performance of 3.1 Collecting scheduling events Fork/Wait In order to study the Linux scheduler, we need to understand We analyze the performance of Kbuild on a 4-socket Xeon E7- its behavior. To this end, we must collect the list of scheduling 8870 machine (80 cores/160 hardware threads) with 512 GiB events that happen at runtime, such as thread creations, clock Fork/Wait and Multicore Frequency Scaling PLOS ’19, October 27, 2019, Huntsville, ON, Canada ticks and thread migrations.

Load more